bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH bpf-next v7 00/11] Extend SOCKMAP/SOCKHASH to store listening sockets
@ 2020-02-18 17:10 Jakub Sitnicki
  2020-02-18 17:10 ` [PATCH bpf-next v7 01/11] net, sk_msg: Annotate lockless access to sk_prot on clone Jakub Sitnicki
                   ` (11 more replies)
  0 siblings, 12 replies; 24+ messages in thread
From: Jakub Sitnicki @ 2020-02-18 17:10 UTC (permalink / raw)
  To: bpf; +Cc: netdev, kernel-team, John Fastabend, Lorenz Bauer, Martin Lau

This patch set turns SOCK{MAP,HASH} into generic collections for TCP
sockets, both listening and established. Adding support for listening
sockets enables us to use these BPF map types with reuseport BPF programs.

Why? SOCKMAP and SOCKHASH, in comparison to REUSEPORT_SOCKARRAY, allow the
socket to be in more than one map at the same time.

Having a BPF map type that can hold listening sockets, and gracefully
co-exist with reuseport BPF is important if, in the future, we want
BPF programs that run at socket lookup time [0]. Cover letter for v1 of
this series tells the full story of how we got here [1].

Although SOCK{MAP,HASH} are not a drop-in replacement for SOCKARRAY just
yet, because UDP support is lacking, it's a step in this direction. We're
working with Lorenz on extending SOCK{MAP,HASH} to hold UDP sockets, and
expect to post RFC series for sockmap + UDP in the near future.

I've dropped Acks from all patches that have been touched since v6.

The audit for missing READ_ONCE annotations for access to sk_prot is
ongoing. Thus far I've found one location specific to TCP listening sockets
that needed annotating. This got fixed it in this iteration. I wonder if
sparse checker could be put to work to identify places where we have
sk_prot access while not holding sk_lock...

The patch series depends on another one, posted earlier [2], that has been
split out of it.

Thanks,
jkbs

[0] https://lore.kernel.org/bpf/20190828072250.29828-1-jakub@cloudflare.com/
[1] https://lore.kernel.org/bpf/20191123110751.6729-1-jakub@cloudflare.com/
[2] https://lore.kernel.org/bpf/20200217121530.754315-1-jakub@cloudflare.com/

v6 -> v7:

- Extended the series to cover SOCKHASH. (patches 4-8, 10-11) (John)

- Rebased onto recent bpf-next. Resolved conflicts in recent fixes to
  sk_state checks on sockmap/sockhash update path. (patch 4)

- Added missing READ_ONCE annotation in sock_copy. (patch 1)

- Split out patches that simplify sk_psock_restore_proto [2].

v5 -> v6:

- Added a fix-up for patch 1 which I forgot to commit in v5. Sigh.

v4 -> v5:

- Rebase onto recent bpf-next to resolve conflicts. (Daniel)

v3 -> v4:

- Make tcp_bpf_clone parameter names consistent across function declaration
  and definition. (Martin)

- Use sock_map_redirect_okay helper everywhere we need to take a different
  action for listening sockets. (Lorenz)

- Expand comment explaining the need for a callback from reuseport to
  sockarray code in reuseport_detach_sock. (Martin)

- Mention the possibility of using a u64 counter for reuseport IDs in the
  future in the description for patch 10. (Martin)

v2 -> v3:

- Generate reuseport ID when group is created. Please see patch 10
  description for details. (Martin)

- Fix the build when CONFIG_NET_SOCK_MSG is not selected by either
  CONFIG_BPF_STREAM_PARSER or CONFIG_TLS. (kbuild bot & John)

- Allow updating sockmap from BPF on BPF_SOCK_OPS_TCP_LISTEN_CB callback. An
  oversight in previous iterations. Users may want to populate the sockmap with
  listening sockets from BPF as well.

- Removed RCU read lock assertion in sock_map_lookup_sys. (Martin)

- Get rid of a warning when child socket was cloned with parent's psock
  state. (John)

- Check for tcp_bpf_unhash rather than tcp_bpf_recvmsg when deciding if
  sk_proto needs restoring on clone. Check for recvmsg in the context of
  listening socket cloning was confusing. (Martin)

- Consolidate sock_map_sk_is_suitable with sock_map_update_okay. This led
  to adding dedicated predicates for sockhash. Update self-tests
  accordingly. (John)

- Annotate unlikely branch in bpf_{sk,msg}_redirect_map when socket isn't
  in a map, or isn't a valid redirect target. (John)

- Document paired READ/WRITE_ONCE annotations and cover shared access in
  more detail in patch 2 description. (John)

- Correct a couple of log messages in sockmap_listen self-tests so the
  message reflects the actual failure.

- Rework reuseport tests from sockmap_listen suite so that ENOENT error
  from bpf_sk_select_reuseport handler does not happen on happy path.

v1 -> v2:

- af_ops->syn_recv_sock callback is no longer overridden and burdened with
  restoring sk_prot and clearing sk_user_data in the child socket. As child
  socket is already hashed when syn_recv_sock returns, it is too late to
  put it in the right state. Instead patches 3 & 4 address restoring
  sk_prot and clearing sk_user_data before we hash the child socket.
  (Pointed out by Martin Lau)

- Annotate shared access to sk->sk_prot with READ_ONCE/WRITE_ONCE macros as
  we write to it from sk_msg while socket might be getting cloned on
  another CPU. (Suggested by John Fastabend)

- Convert tests for SOCKMAP holding listening sockets to return-on-error
  style, and hook them up to test_progs. Also use BPF skeleton for setup.
  Add new tests to cover the race scenario discovered during v1 review.

RFC -> v1:

- Switch from overriding proto->accept to af_ops->syn_recv_sock, which
  happens earlier. Clearing the psock state after accept() does not work
  for child sockets that become orphaned (never got accepted). v4-mapped
  sockets need special care.

- Return the socket cookie on SOCKMAP lookup from syscall to be on par with
  REUSEPORT_SOCKARRAY. Requires SOCKMAP to take u64 on lookup/update from
  syscall.

- Make bpf_sk_redirect_map (ingress) and bpf_msg_redirect_map (egress)
  SOCKMAP helpers fail when target socket is a listening one.

- Make bpf_sk_select_reuseport helper fail when target is a TCP established
  socket.

- Teach libbpf to recognize SK_REUSEPORT program type from section name.

- Add a dedicated set of tests for SOCKMAP holding listening sockets,
  covering map operations, overridden socket callbacks, and BPF helpers.


Jakub Sitnicki (11):
  net, sk_msg: Annotate lockless access to sk_prot on clone
  net, sk_msg: Clear sk_user_data pointer on clone if tagged
  tcp_bpf: Don't let child socket inherit parent protocol ops on copy
  bpf, sockmap: Allow inserting listening TCP sockets into sockmap
  bpf, sockmap: Don't set up upcalls and progs for listening sockets
  bpf, sockmap: Return socket cookie on lookup from syscall
  bpf, sockmap: Let all kernel-land lookup values in SOCKMAP/SOCKHASH
  bpf: Allow selecting reuseport socket from a SOCKMAP/SOCKHASH
  net: Generate reuseport group ID on group creation
  selftests/bpf: Extend SK_REUSEPORT tests to cover SOCKMAP/SOCKHASH
  selftests/bpf: Tests for sockmap/sockhash holding listening sockets

 include/linux/skmsg.h                         |    3 +-
 include/net/sock.h                            |   37 +-
 include/net/sock_reuseport.h                  |    2 -
 include/net/tcp.h                             |    7 +
 kernel/bpf/reuseport_array.c                  |    5 -
 kernel/bpf/verifier.c                         |   10 +-
 net/core/filter.c                             |   27 +-
 net/core/skmsg.c                              |    2 +-
 net/core/sock.c                               |   14 +-
 net/core/sock_map.c                           |  167 +-
 net/core/sock_reuseport.c                     |   50 +-
 net/ipv4/tcp_bpf.c                            |   18 +-
 net/ipv4/tcp_minisocks.c                      |    2 +
 net/ipv4/tcp_ulp.c                            |    3 +-
 net/tls/tls_main.c                            |    3 +-
 .../bpf/prog_tests/select_reuseport.c         |   63 +-
 .../selftests/bpf/prog_tests/sockmap_listen.c | 1496 +++++++++++++++++
 .../selftests/bpf/progs/test_sockmap_listen.c |   98 ++
 tools/testing/selftests/bpf/test_maps.c       |    6 +-
 19 files changed, 1910 insertions(+), 103 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/sockmap_listen.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_sockmap_listen.c

-- 
2.24.1


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH bpf-next v7 01/11] net, sk_msg: Annotate lockless access to sk_prot on clone
  2020-02-18 17:10 [PATCH bpf-next v7 00/11] Extend SOCKMAP/SOCKHASH to store listening sockets Jakub Sitnicki
@ 2020-02-18 17:10 ` Jakub Sitnicki
  2020-02-18 17:10 ` [PATCH bpf-next v7 02/11] net, sk_msg: Clear sk_user_data pointer on clone if tagged Jakub Sitnicki
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 24+ messages in thread
From: Jakub Sitnicki @ 2020-02-18 17:10 UTC (permalink / raw)
  To: bpf; +Cc: netdev, kernel-team, John Fastabend, Lorenz Bauer, Martin Lau

sk_msg and ULP frameworks override protocol callbacks pointer in
sk->sk_prot, while tcp accesses it locklessly when cloning the listening
socket, that is with neither sk_lock nor sk_callback_lock held.

Once we enable use of listening sockets with sockmap (and hence sk_msg),
there will be shared access to sk->sk_prot if socket is getting cloned
while being inserted/deleted to/from the sockmap from another CPU:

Read side:

tcp_v4_rcv
  sk = __inet_lookup_skb(...)
  tcp_check_req(sk)
    inet_csk(sk)->icsk_af_ops->syn_recv_sock
      tcp_v4_syn_recv_sock
        tcp_create_openreq_child
          inet_csk_clone_lock
            sk_clone_lock
              READ_ONCE(sk->sk_prot)

Write side:

sock_map_ops->map_update_elem
  sock_map_update_elem
    sock_map_update_common
      sock_map_link_no_progs
        tcp_bpf_init
          tcp_bpf_update_sk_prot
            sk_psock_update_proto
              WRITE_ONCE(sk->sk_prot, ops)

sock_map_ops->map_delete_elem
  sock_map_delete_elem
    __sock_map_delete
     sock_map_unref
       sk_psock_put
         sk_psock_drop
           sk_psock_restore_proto
             tcp_update_ulp
               WRITE_ONCE(sk->sk_prot, proto)

Mark the shared access with READ_ONCE/WRITE_ONCE annotations.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---
 include/linux/skmsg.h | 3 ++-
 net/core/sock.c       | 8 +++++---
 net/ipv4/tcp_bpf.c    | 4 +++-
 net/ipv4/tcp_ulp.c    | 3 ++-
 net/tls/tls_main.c    | 3 ++-
 5 files changed, 14 insertions(+), 7 deletions(-)

diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h
index d90ef61712a1..112765bd146d 100644
--- a/include/linux/skmsg.h
+++ b/include/linux/skmsg.h
@@ -352,7 +352,8 @@ static inline void sk_psock_update_proto(struct sock *sk,
 	psock->saved_write_space = sk->sk_write_space;
 
 	psock->sk_proto = sk->sk_prot;
-	sk->sk_prot = ops;
+	/* Pairs with lockless read in sk_clone_lock() */
+	WRITE_ONCE(sk->sk_prot, ops);
 }
 
 static inline void sk_psock_restore_proto(struct sock *sk,
diff --git a/net/core/sock.c b/net/core/sock.c
index a4c8fac781ff..bf1173b93eda 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1572,13 +1572,14 @@ static inline void sock_lock_init(struct sock *sk)
  */
 static void sock_copy(struct sock *nsk, const struct sock *osk)
 {
+	const struct proto *prot = READ_ONCE(osk->sk_prot);
 #ifdef CONFIG_SECURITY_NETWORK
 	void *sptr = nsk->sk_security;
 #endif
 	memcpy(nsk, osk, offsetof(struct sock, sk_dontcopy_begin));
 
 	memcpy(&nsk->sk_dontcopy_end, &osk->sk_dontcopy_end,
-	       osk->sk_prot->obj_size - offsetof(struct sock, sk_dontcopy_end));
+	       prot->obj_size - offsetof(struct sock, sk_dontcopy_end));
 
 #ifdef CONFIG_SECURITY_NETWORK
 	nsk->sk_security = sptr;
@@ -1792,16 +1793,17 @@ static void sk_init_common(struct sock *sk)
  */
 struct sock *sk_clone_lock(const struct sock *sk, const gfp_t priority)
 {
+	struct proto *prot = READ_ONCE(sk->sk_prot);
 	struct sock *newsk;
 	bool is_charged = true;
 
-	newsk = sk_prot_alloc(sk->sk_prot, priority, sk->sk_family);
+	newsk = sk_prot_alloc(prot, priority, sk->sk_family);
 	if (newsk != NULL) {
 		struct sk_filter *filter;
 
 		sock_copy(newsk, sk);
 
-		newsk->sk_prot_creator = sk->sk_prot;
+		newsk->sk_prot_creator = prot;
 
 		/* SANITY */
 		if (likely(newsk->sk_net_refcnt))
diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c
index 8a01428f80c1..dd183b050642 100644
--- a/net/ipv4/tcp_bpf.c
+++ b/net/ipv4/tcp_bpf.c
@@ -645,8 +645,10 @@ static void tcp_bpf_reinit_sk_prot(struct sock *sk, struct sk_psock *psock)
 	/* Reinit occurs when program types change e.g. TCP_BPF_TX is removed
 	 * or added requiring sk_prot hook updates. We keep original saved
 	 * hooks in this case.
+	 *
+	 * Pairs with lockless read in sk_clone_lock().
 	 */
-	sk->sk_prot = &tcp_bpf_prots[family][config];
+	WRITE_ONCE(sk->sk_prot, &tcp_bpf_prots[family][config]);
 }
 
 static int tcp_bpf_assert_proto_ops(struct proto *ops)
diff --git a/net/ipv4/tcp_ulp.c b/net/ipv4/tcp_ulp.c
index 38d3ad141161..6c43fa189195 100644
--- a/net/ipv4/tcp_ulp.c
+++ b/net/ipv4/tcp_ulp.c
@@ -106,7 +106,8 @@ void tcp_update_ulp(struct sock *sk, struct proto *proto,
 
 	if (!icsk->icsk_ulp_ops) {
 		sk->sk_write_space = write_space;
-		sk->sk_prot = proto;
+		/* Pairs with lockless read in sk_clone_lock() */
+		WRITE_ONCE(sk->sk_prot, proto);
 		return;
 	}
 
diff --git a/net/tls/tls_main.c b/net/tls/tls_main.c
index 94774c0e5ff3..82225bcc1117 100644
--- a/net/tls/tls_main.c
+++ b/net/tls/tls_main.c
@@ -742,7 +742,8 @@ static void tls_update(struct sock *sk, struct proto *p,
 		ctx->sk_write_space = write_space;
 		ctx->sk_proto = p;
 	} else {
-		sk->sk_prot = p;
+		/* Pairs with lockless read in sk_clone_lock(). */
+		WRITE_ONCE(sk->sk_prot, p);
 		sk->sk_write_space = write_space;
 	}
 }
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH bpf-next v7 02/11] net, sk_msg: Clear sk_user_data pointer on clone if tagged
  2020-02-18 17:10 [PATCH bpf-next v7 00/11] Extend SOCKMAP/SOCKHASH to store listening sockets Jakub Sitnicki
  2020-02-18 17:10 ` [PATCH bpf-next v7 01/11] net, sk_msg: Annotate lockless access to sk_prot on clone Jakub Sitnicki
@ 2020-02-18 17:10 ` Jakub Sitnicki
  2020-02-18 17:10 ` [PATCH bpf-next v7 03/11] tcp_bpf: Don't let child socket inherit parent protocol ops on copy Jakub Sitnicki
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 24+ messages in thread
From: Jakub Sitnicki @ 2020-02-18 17:10 UTC (permalink / raw)
  To: bpf; +Cc: netdev, kernel-team, John Fastabend, Lorenz Bauer, Martin Lau

sk_user_data can hold a pointer to an object that is not intended to be
shared between the parent socket and the child that gets a pointer copy on
clone. This is the case when sk_user_data points at reference-counted
object, like struct sk_psock.

One way to resolve it is to tag the pointer with a no-copy flag by
repurposing its lowest bit. Based on the bit-flag value we clear the child
sk_user_data pointer after cloning the parent socket.

The no-copy flag is stored in the pointer itself as opposed to externally,
say in socket flags, to guarantee that the pointer and the flag are copied
from parent to child socket in an atomic fashion. Parent socket state is
subject to change while copying, we don't hold any locks at that time.

This approach relies on an assumption that sk_user_data holds a pointer to
an object aligned at least 2 bytes. A manual audit of existing users of
rcu_dereference_sk_user_data helper confirms our assumption.

Also, an RCU-protected sk_user_data is not likely to hold a pointer to a
char value or a pathological case of "struct { char c; }". To be safe, warn
when the flag-bit is set when setting sk_user_data to catch any future
misuses.

It is worth considering why clearing sk_user_data unconditionally is not an
option. There exist users, DRBD, NVMe, and Xen drivers being among them,
that rely on the pointer being copied when cloning the listening socket.

Potentially we could distinguish these users by checking if the listening
socket has been created in kernel-space via sock_create_kern, and hence has
sk_kern_sock flag set. However, this is not the case for NVMe and Xen
drivers, which create sockets without marking them as belonging to the
kernel.

Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---
 include/net/sock.h | 37 +++++++++++++++++++++++++++++++++++--
 net/core/skmsg.c   |  2 +-
 net/core/sock.c    |  6 ++++++
 3 files changed, 42 insertions(+), 3 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 02162b0378f7..9f37fdfd15d4 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -502,10 +502,43 @@ enum sk_pacing {
 	SK_PACING_FQ		= 2,
 };
 
+/* Pointer stored in sk_user_data might not be suitable for copying
+ * when cloning the socket. For instance, it can point to a reference
+ * counted object. sk_user_data bottom bit is set if pointer must not
+ * be copied.
+ */
+#define SK_USER_DATA_NOCOPY	1UL
+#define SK_USER_DATA_PTRMASK	~(SK_USER_DATA_NOCOPY)
+
+/**
+ * sk_user_data_is_nocopy - Test if sk_user_data pointer must not be copied
+ * @sk: socket
+ */
+static inline bool sk_user_data_is_nocopy(const struct sock *sk)
+{
+	return ((uintptr_t)sk->sk_user_data & SK_USER_DATA_NOCOPY);
+}
+
 #define __sk_user_data(sk) ((*((void __rcu **)&(sk)->sk_user_data)))
 
-#define rcu_dereference_sk_user_data(sk)	rcu_dereference(__sk_user_data((sk)))
-#define rcu_assign_sk_user_data(sk, ptr)	rcu_assign_pointer(__sk_user_data((sk)), ptr)
+#define rcu_dereference_sk_user_data(sk)				\
+({									\
+	void *__tmp = rcu_dereference(__sk_user_data((sk)));		\
+	(void *)((uintptr_t)__tmp & SK_USER_DATA_PTRMASK);		\
+})
+#define rcu_assign_sk_user_data(sk, ptr)				\
+({									\
+	uintptr_t __tmp = (uintptr_t)(ptr);				\
+	WARN_ON_ONCE(__tmp & ~SK_USER_DATA_PTRMASK);			\
+	rcu_assign_pointer(__sk_user_data((sk)), __tmp);		\
+})
+#define rcu_assign_sk_user_data_nocopy(sk, ptr)				\
+({									\
+	uintptr_t __tmp = (uintptr_t)(ptr);				\
+	WARN_ON_ONCE(__tmp & ~SK_USER_DATA_PTRMASK);			\
+	rcu_assign_pointer(__sk_user_data((sk)),			\
+			   __tmp | SK_USER_DATA_NOCOPY);		\
+})
 
 /*
  * SK_CAN_REUSE and SK_NO_REUSE on a socket mean that the socket is OK
diff --git a/net/core/skmsg.c b/net/core/skmsg.c
index ded2d5227678..eeb28cb85664 100644
--- a/net/core/skmsg.c
+++ b/net/core/skmsg.c
@@ -512,7 +512,7 @@ struct sk_psock *sk_psock_init(struct sock *sk, int node)
 	sk_psock_set_state(psock, SK_PSOCK_TX_ENABLED);
 	refcount_set(&psock->refcnt, 1);
 
-	rcu_assign_sk_user_data(sk, psock);
+	rcu_assign_sk_user_data_nocopy(sk, psock);
 	sock_hold(sk);
 
 	return psock;
diff --git a/net/core/sock.c b/net/core/sock.c
index bf1173b93eda..e4af4dbc1c9e 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1865,6 +1865,12 @@ struct sock *sk_clone_lock(const struct sock *sk, const gfp_t priority)
 			goto out;
 		}
 
+		/* Clear sk_user_data if parent had the pointer tagged
+		 * as not suitable for copying when cloning.
+		 */
+		if (sk_user_data_is_nocopy(newsk))
+			RCU_INIT_POINTER(newsk->sk_user_data, NULL);
+
 		newsk->sk_err	   = 0;
 		newsk->sk_err_soft = 0;
 		newsk->sk_priority = 0;
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH bpf-next v7 03/11] tcp_bpf: Don't let child socket inherit parent protocol ops on copy
  2020-02-18 17:10 [PATCH bpf-next v7 00/11] Extend SOCKMAP/SOCKHASH to store listening sockets Jakub Sitnicki
  2020-02-18 17:10 ` [PATCH bpf-next v7 01/11] net, sk_msg: Annotate lockless access to sk_prot on clone Jakub Sitnicki
  2020-02-18 17:10 ` [PATCH bpf-next v7 02/11] net, sk_msg: Clear sk_user_data pointer on clone if tagged Jakub Sitnicki
@ 2020-02-18 17:10 ` Jakub Sitnicki
  2020-02-21  3:28   ` John Fastabend
  2020-02-18 17:10 ` [PATCH bpf-next v7 04/11] bpf, sockmap: Allow inserting listening TCP sockets into sockmap Jakub Sitnicki
                   ` (8 subsequent siblings)
  11 siblings, 1 reply; 24+ messages in thread
From: Jakub Sitnicki @ 2020-02-18 17:10 UTC (permalink / raw)
  To: bpf; +Cc: netdev, kernel-team, John Fastabend, Lorenz Bauer, Martin Lau

Prepare for cloning listening sockets that have their protocol callbacks
overridden by sk_msg. Child sockets must not inherit parent callbacks that
access state stored in sk_user_data owned by the parent.

Restore the child socket protocol callbacks before it gets hashed and any
of the callbacks can get invoked.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---
 include/net/tcp.h        |  7 +++++++
 net/ipv4/tcp_bpf.c       | 14 ++++++++++++++
 net/ipv4/tcp_minisocks.c |  2 ++
 3 files changed, 23 insertions(+)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index a5ea27df3c2b..07f947cc80e6 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -2203,6 +2203,13 @@ int tcp_bpf_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
 		    int nonblock, int flags, int *addr_len);
 int __tcp_bpf_recvmsg(struct sock *sk, struct sk_psock *psock,
 		      struct msghdr *msg, int len, int flags);
+#ifdef CONFIG_NET_SOCK_MSG
+void tcp_bpf_clone(const struct sock *sk, struct sock *newsk);
+#else
+static inline void tcp_bpf_clone(const struct sock *sk, struct sock *newsk)
+{
+}
+#endif
 
 /* Call BPF_SOCK_OPS program that returns an int. If the return value
  * is < 0, then the BPF op failed (for example if the loaded BPF
diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c
index dd183b050642..7d6e1b75d4d4 100644
--- a/net/ipv4/tcp_bpf.c
+++ b/net/ipv4/tcp_bpf.c
@@ -693,3 +693,17 @@ int tcp_bpf_init(struct sock *sk)
 	rcu_read_unlock();
 	return 0;
 }
+
+/* If a child got cloned from a listening socket that had tcp_bpf
+ * protocol callbacks installed, we need to restore the callbacks to
+ * the default ones because the child does not inherit the psock state
+ * that tcp_bpf callbacks expect.
+ */
+void tcp_bpf_clone(const struct sock *sk, struct sock *newsk)
+{
+	int family = sk->sk_family == AF_INET6 ? TCP_BPF_IPV6 : TCP_BPF_IPV4;
+	struct proto *prot = newsk->sk_prot;
+
+	if (prot == &tcp_bpf_prots[family][TCP_BPF_BASE])
+		newsk->sk_prot = sk->sk_prot_creator;
+}
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index ad3b56d9fa71..c8274371c3d0 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -548,6 +548,8 @@ struct sock *tcp_create_openreq_child(const struct sock *sk,
 	newtp->fastopen_req = NULL;
 	RCU_INIT_POINTER(newtp->fastopen_rsk, NULL);
 
+	tcp_bpf_clone(sk, newsk);
+
 	__TCP_INC_STATS(sock_net(sk), TCP_MIB_PASSIVEOPENS);
 
 	return newsk;
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH bpf-next v7 04/11] bpf, sockmap: Allow inserting listening TCP sockets into sockmap
  2020-02-18 17:10 [PATCH bpf-next v7 00/11] Extend SOCKMAP/SOCKHASH to store listening sockets Jakub Sitnicki
                   ` (2 preceding siblings ...)
  2020-02-18 17:10 ` [PATCH bpf-next v7 03/11] tcp_bpf: Don't let child socket inherit parent protocol ops on copy Jakub Sitnicki
@ 2020-02-18 17:10 ` Jakub Sitnicki
  2020-02-21  3:33   ` John Fastabend
  2020-02-18 17:10 ` [PATCH bpf-next v7 05/11] bpf, sockmap: Don't set up upcalls and progs for listening sockets Jakub Sitnicki
                   ` (7 subsequent siblings)
  11 siblings, 1 reply; 24+ messages in thread
From: Jakub Sitnicki @ 2020-02-18 17:10 UTC (permalink / raw)
  To: bpf; +Cc: netdev, kernel-team, John Fastabend, Lorenz Bauer, Martin Lau

In order for sockmap/sockhash types to become generic collections for
storing TCP sockets we need to loosen the checks during map update, while
tightening the checks in redirect helpers.

Currently sock{map,hash} require the TCP socket to be in established state,
which prevents inserting listening sockets.

Change the update pre-checks so the socket can also be in listening state.

Since it doesn't make sense to redirect with sock{map,hash} to listening
sockets, add appropriate socket state checks to BPF redirect helpers too.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---
 net/core/sock_map.c                     | 59 ++++++++++++++++++-------
 tools/testing/selftests/bpf/test_maps.c |  6 +--
 2 files changed, 45 insertions(+), 20 deletions(-)

diff --git a/net/core/sock_map.c b/net/core/sock_map.c
index 3a7a96ab088a..dd92a3556d73 100644
--- a/net/core/sock_map.c
+++ b/net/core/sock_map.c
@@ -391,7 +391,8 @@ static int sock_map_update_common(struct bpf_map *map, u32 idx,
 static bool sock_map_op_okay(const struct bpf_sock_ops_kern *ops)
 {
 	return ops->op == BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB ||
-	       ops->op == BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB;
+	       ops->op == BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB ||
+	       ops->op == BPF_SOCK_OPS_TCP_LISTEN_CB;
 }
 
 static bool sock_map_sk_is_suitable(const struct sock *sk)
@@ -400,6 +401,16 @@ static bool sock_map_sk_is_suitable(const struct sock *sk)
 	       sk->sk_protocol == IPPROTO_TCP;
 }
 
+static bool sock_map_sk_state_allowed(const struct sock *sk)
+{
+	return (1 << sk->sk_state) & (TCPF_ESTABLISHED | TCPF_LISTEN);
+}
+
+static bool sock_map_redirect_allowed(const struct sock *sk)
+{
+	return sk->sk_state != TCP_LISTEN;
+}
+
 static int sock_map_update_elem(struct bpf_map *map, void *key,
 				void *value, u64 flags)
 {
@@ -423,7 +434,7 @@ static int sock_map_update_elem(struct bpf_map *map, void *key,
 	}
 
 	sock_map_sk_acquire(sk);
-	if (sk->sk_state != TCP_ESTABLISHED)
+	if (!sock_map_sk_state_allowed(sk))
 		ret = -EOPNOTSUPP;
 	else
 		ret = sock_map_update_common(map, idx, sk, flags);
@@ -460,13 +471,17 @@ BPF_CALL_4(bpf_sk_redirect_map, struct sk_buff *, skb,
 	   struct bpf_map *, map, u32, key, u64, flags)
 {
 	struct tcp_skb_cb *tcb = TCP_SKB_CB(skb);
+	struct sock *sk;
 
 	if (unlikely(flags & ~(BPF_F_INGRESS)))
 		return SK_DROP;
-	tcb->bpf.flags = flags;
-	tcb->bpf.sk_redir = __sock_map_lookup_elem(map, key);
-	if (!tcb->bpf.sk_redir)
+
+	sk = __sock_map_lookup_elem(map, key);
+	if (unlikely(!sk || !sock_map_redirect_allowed(sk)))
 		return SK_DROP;
+
+	tcb->bpf.flags = flags;
+	tcb->bpf.sk_redir = sk;
 	return SK_PASS;
 }
 
@@ -483,12 +498,17 @@ const struct bpf_func_proto bpf_sk_redirect_map_proto = {
 BPF_CALL_4(bpf_msg_redirect_map, struct sk_msg *, msg,
 	   struct bpf_map *, map, u32, key, u64, flags)
 {
+	struct sock *sk;
+
 	if (unlikely(flags & ~(BPF_F_INGRESS)))
 		return SK_DROP;
-	msg->flags = flags;
-	msg->sk_redir = __sock_map_lookup_elem(map, key);
-	if (!msg->sk_redir)
+
+	sk = __sock_map_lookup_elem(map, key);
+	if (unlikely(!sk || !sock_map_redirect_allowed(sk)))
 		return SK_DROP;
+
+	msg->flags = flags;
+	msg->sk_redir = sk;
 	return SK_PASS;
 }
 
@@ -748,7 +768,7 @@ static int sock_hash_update_elem(struct bpf_map *map, void *key,
 	}
 
 	sock_map_sk_acquire(sk);
-	if (sk->sk_state != TCP_ESTABLISHED)
+	if (!sock_map_sk_state_allowed(sk))
 		ret = -EOPNOTSUPP;
 	else
 		ret = sock_hash_update_common(map, key, sk, flags);
@@ -916,13 +936,17 @@ BPF_CALL_4(bpf_sk_redirect_hash, struct sk_buff *, skb,
 	   struct bpf_map *, map, void *, key, u64, flags)
 {
 	struct tcp_skb_cb *tcb = TCP_SKB_CB(skb);
+	struct sock *sk;
 
 	if (unlikely(flags & ~(BPF_F_INGRESS)))
 		return SK_DROP;
-	tcb->bpf.flags = flags;
-	tcb->bpf.sk_redir = __sock_hash_lookup_elem(map, key);
-	if (!tcb->bpf.sk_redir)
+
+	sk = __sock_hash_lookup_elem(map, key);
+	if (unlikely(!sk || !sock_map_redirect_allowed(sk)))
 		return SK_DROP;
+
+	tcb->bpf.flags = flags;
+	tcb->bpf.sk_redir = sk;
 	return SK_PASS;
 }
 
@@ -939,12 +963,17 @@ const struct bpf_func_proto bpf_sk_redirect_hash_proto = {
 BPF_CALL_4(bpf_msg_redirect_hash, struct sk_msg *, msg,
 	   struct bpf_map *, map, void *, key, u64, flags)
 {
+	struct sock *sk;
+
 	if (unlikely(flags & ~(BPF_F_INGRESS)))
 		return SK_DROP;
-	msg->flags = flags;
-	msg->sk_redir = __sock_hash_lookup_elem(map, key);
-	if (!msg->sk_redir)
+
+	sk = __sock_hash_lookup_elem(map, key);
+	if (unlikely(!sk || !sock_map_redirect_allowed(sk)))
 		return SK_DROP;
+
+	msg->flags = flags;
+	msg->sk_redir = sk;
 	return SK_PASS;
 }
 
diff --git a/tools/testing/selftests/bpf/test_maps.c b/tools/testing/selftests/bpf/test_maps.c
index 02eae1e864c2..c6766b2cff85 100644
--- a/tools/testing/selftests/bpf/test_maps.c
+++ b/tools/testing/selftests/bpf/test_maps.c
@@ -756,11 +756,7 @@ static void test_sockmap(unsigned int tasks, void *data)
 	/* Test update without programs */
 	for (i = 0; i < 6; i++) {
 		err = bpf_map_update_elem(fd, &i, &sfd[i], BPF_ANY);
-		if (i < 2 && !err) {
-			printf("Allowed update sockmap '%i:%i' not in ESTABLISHED\n",
-			       i, sfd[i]);
-			goto out_sockmap;
-		} else if (i >= 2 && err) {
+		if (err) {
 			printf("Failed noprog update sockmap '%i:%i'\n",
 			       i, sfd[i]);
 			goto out_sockmap;
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH bpf-next v7 05/11] bpf, sockmap: Don't set up upcalls and progs for listening sockets
  2020-02-18 17:10 [PATCH bpf-next v7 00/11] Extend SOCKMAP/SOCKHASH to store listening sockets Jakub Sitnicki
                   ` (3 preceding siblings ...)
  2020-02-18 17:10 ` [PATCH bpf-next v7 04/11] bpf, sockmap: Allow inserting listening TCP sockets into sockmap Jakub Sitnicki
@ 2020-02-18 17:10 ` Jakub Sitnicki
  2020-02-21  3:42   ` John Fastabend
  2020-02-18 17:10 ` [PATCH bpf-next v7 06/11] bpf, sockmap: Return socket cookie on lookup from syscall Jakub Sitnicki
                   ` (6 subsequent siblings)
  11 siblings, 1 reply; 24+ messages in thread
From: Jakub Sitnicki @ 2020-02-18 17:10 UTC (permalink / raw)
  To: bpf; +Cc: netdev, kernel-team, John Fastabend, Lorenz Bauer, Martin Lau

Now that sockmap/sockhash can hold listening sockets, when setting up the
psock we will (i) grab references to verdict/parser progs, and (2) override
socket upcalls sk_data_ready and sk_write_space.

However, since we cannot redirect to listening sockets so we don't need to
link the socket to the BPF progs. And more importantly we don't want the
listening socket to have overridden upcalls because they would get
inherited by child sockets cloned from it.

Introduce a separate initialization path for listening sockets that does
not change the upcalls and ignores the BPF progs.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---
 net/core/sock_map.c | 52 +++++++++++++++++++++++++++++++++++++++------
 1 file changed, 45 insertions(+), 7 deletions(-)

diff --git a/net/core/sock_map.c b/net/core/sock_map.c
index dd92a3556d73..a5103112a344 100644
--- a/net/core/sock_map.c
+++ b/net/core/sock_map.c
@@ -228,6 +228,30 @@ static int sock_map_link(struct bpf_map *map, struct sk_psock_progs *progs,
 	return ret;
 }
 
+static int sock_map_link_no_progs(struct bpf_map *map, struct sock *sk)
+{
+	struct sk_psock *psock;
+	int ret;
+
+	psock = sk_psock_get_checked(sk);
+	if (IS_ERR(psock))
+		return PTR_ERR(psock);
+
+	if (psock) {
+		tcp_bpf_reinit(sk);
+		return 0;
+	}
+
+	psock = sk_psock_init(sk, map->numa_node);
+	if (!psock)
+		return -ENOMEM;
+
+	ret = tcp_bpf_init(sk);
+	if (ret < 0)
+		sk_psock_put(sk, psock);
+	return ret;
+}
+
 static void sock_map_free(struct bpf_map *map)
 {
 	struct bpf_stab *stab = container_of(map, struct bpf_stab, map);
@@ -334,6 +358,11 @@ static int sock_map_get_next_key(struct bpf_map *map, void *key, void *next)
 	return 0;
 }
 
+static bool sock_map_redirect_allowed(const struct sock *sk)
+{
+	return sk->sk_state != TCP_LISTEN;
+}
+
 static int sock_map_update_common(struct bpf_map *map, u32 idx,
 				  struct sock *sk, u64 flags)
 {
@@ -356,7 +385,14 @@ static int sock_map_update_common(struct bpf_map *map, u32 idx,
 	if (!link)
 		return -ENOMEM;
 
-	ret = sock_map_link(map, &stab->progs, sk);
+	/* Only sockets we can redirect into/from in BPF need to hold
+	 * refs to parser/verdict progs and have their sk_data_ready
+	 * and sk_write_space callbacks overridden.
+	 */
+	if (sock_map_redirect_allowed(sk))
+		ret = sock_map_link(map, &stab->progs, sk);
+	else
+		ret = sock_map_link_no_progs(map, sk);
 	if (ret < 0)
 		goto out_free;
 
@@ -406,11 +442,6 @@ static bool sock_map_sk_state_allowed(const struct sock *sk)
 	return (1 << sk->sk_state) & (TCPF_ESTABLISHED | TCPF_LISTEN);
 }
 
-static bool sock_map_redirect_allowed(const struct sock *sk)
-{
-	return sk->sk_state != TCP_LISTEN;
-}
-
 static int sock_map_update_elem(struct bpf_map *map, void *key,
 				void *value, u64 flags)
 {
@@ -700,7 +731,14 @@ static int sock_hash_update_common(struct bpf_map *map, void *key,
 	if (!link)
 		return -ENOMEM;
 
-	ret = sock_map_link(map, &htab->progs, sk);
+	/* Only sockets we can redirect into/from in BPF need to hold
+	 * refs to parser/verdict progs and have their sk_data_ready
+	 * and sk_write_space callbacks overridden.
+	 */
+	if (sock_map_redirect_allowed(sk))
+		ret = sock_map_link(map, &htab->progs, sk);
+	else
+		ret = sock_map_link_no_progs(map, sk);
 	if (ret < 0)
 		goto out_free;
 
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH bpf-next v7 06/11] bpf, sockmap: Return socket cookie on lookup from syscall
  2020-02-18 17:10 [PATCH bpf-next v7 00/11] Extend SOCKMAP/SOCKHASH to store listening sockets Jakub Sitnicki
                   ` (4 preceding siblings ...)
  2020-02-18 17:10 ` [PATCH bpf-next v7 05/11] bpf, sockmap: Don't set up upcalls and progs for listening sockets Jakub Sitnicki
@ 2020-02-18 17:10 ` Jakub Sitnicki
  2020-02-21  3:45   ` John Fastabend
  2020-02-18 17:10 ` [PATCH bpf-next v7 07/11] bpf, sockmap: Let all kernel-land lookup values in SOCKMAP/SOCKHASH Jakub Sitnicki
                   ` (5 subsequent siblings)
  11 siblings, 1 reply; 24+ messages in thread
From: Jakub Sitnicki @ 2020-02-18 17:10 UTC (permalink / raw)
  To: bpf; +Cc: netdev, kernel-team, John Fastabend, Lorenz Bauer, Martin Lau

Tooling that populates the SOCK{MAP,HASH} with sockets from user-space
needs a way to inspect its contents. Returning the struct sock * that the
map holds to user-space is neither safe nor useful. An approach established
by REUSEPORT_SOCKARRAY is to return a socket cookie (a unique identifier)
instead.

Since socket cookies are u64 values, SOCK{MAP,HASH} need to support such a
value size for lookup to be possible. This requires special handling on
update, though. Attempts to do a lookup on a map holding u32 values will be
met with ENOSPC error.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---
 net/core/sock_map.c | 57 +++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 53 insertions(+), 4 deletions(-)

diff --git a/net/core/sock_map.c b/net/core/sock_map.c
index a5103112a344..f48c934d5da0 100644
--- a/net/core/sock_map.c
+++ b/net/core/sock_map.c
@@ -10,6 +10,7 @@
 #include <linux/skmsg.h>
 #include <linux/list.h>
 #include <linux/jhash.h>
+#include <linux/sock_diag.h>
 
 struct bpf_stab {
 	struct bpf_map map;
@@ -31,7 +32,8 @@ static struct bpf_map *sock_map_alloc(union bpf_attr *attr)
 		return ERR_PTR(-EPERM);
 	if (attr->max_entries == 0 ||
 	    attr->key_size    != 4 ||
-	    attr->value_size  != 4 ||
+	    (attr->value_size != sizeof(u32) &&
+	     attr->value_size != sizeof(u64)) ||
 	    attr->map_flags & ~SOCK_CREATE_FLAG_MASK)
 		return ERR_PTR(-EINVAL);
 
@@ -302,6 +304,21 @@ static void *sock_map_lookup(struct bpf_map *map, void *key)
 	return ERR_PTR(-EOPNOTSUPP);
 }
 
+static void *sock_map_lookup_sys(struct bpf_map *map, void *key)
+{
+	struct sock *sk;
+
+	if (map->value_size != sizeof(u64))
+		return ERR_PTR(-ENOSPC);
+
+	sk = __sock_map_lookup_elem(map, *(u32 *)key);
+	if (!sk)
+		return ERR_PTR(-ENOENT);
+
+	sock_gen_cookie(sk);
+	return &sk->sk_cookie;
+}
+
 static int __sock_map_delete(struct bpf_stab *stab, struct sock *sk_test,
 			     struct sock **psk)
 {
@@ -445,11 +462,18 @@ static bool sock_map_sk_state_allowed(const struct sock *sk)
 static int sock_map_update_elem(struct bpf_map *map, void *key,
 				void *value, u64 flags)
 {
-	u32 ufd = *(u32 *)value;
 	u32 idx = *(u32 *)key;
 	struct socket *sock;
 	struct sock *sk;
 	int ret;
+	u64 ufd;
+
+	if (map->value_size == sizeof(u64))
+		ufd = *(u64 *)value;
+	else
+		ufd = *(u32 *)value;
+	if (ufd > S32_MAX)
+		return -EINVAL;
 
 	sock = sockfd_lookup(ufd, &ret);
 	if (!sock)
@@ -557,6 +581,7 @@ const struct bpf_map_ops sock_map_ops = {
 	.map_alloc		= sock_map_alloc,
 	.map_free		= sock_map_free,
 	.map_get_next_key	= sock_map_get_next_key,
+	.map_lookup_elem_sys_only = sock_map_lookup_sys,
 	.map_update_elem	= sock_map_update_elem,
 	.map_delete_elem	= sock_map_delete_elem,
 	.map_lookup_elem	= sock_map_lookup,
@@ -787,10 +812,17 @@ static int sock_hash_update_common(struct bpf_map *map, void *key,
 static int sock_hash_update_elem(struct bpf_map *map, void *key,
 				 void *value, u64 flags)
 {
-	u32 ufd = *(u32 *)value;
 	struct socket *sock;
 	struct sock *sk;
 	int ret;
+	u64 ufd;
+
+	if (map->value_size == sizeof(u64))
+		ufd = *(u64 *)value;
+	else
+		ufd = *(u32 *)value;
+	if (ufd > S32_MAX)
+		return -EINVAL;
 
 	sock = sockfd_lookup(ufd, &ret);
 	if (!sock)
@@ -866,7 +898,8 @@ static struct bpf_map *sock_hash_alloc(union bpf_attr *attr)
 		return ERR_PTR(-EPERM);
 	if (attr->max_entries == 0 ||
 	    attr->key_size    == 0 ||
-	    attr->value_size  != 4 ||
+	    (attr->value_size != sizeof(u32) &&
+	     attr->value_size != sizeof(u64)) ||
 	    attr->map_flags & ~SOCK_CREATE_FLAG_MASK)
 		return ERR_PTR(-EINVAL);
 	if (attr->key_size > MAX_BPF_STACK)
@@ -943,6 +976,21 @@ static void sock_hash_free(struct bpf_map *map)
 	kfree(htab);
 }
 
+static void *sock_hash_lookup_sys(struct bpf_map *map, void *key)
+{
+	struct sock *sk;
+
+	if (map->value_size != sizeof(u64))
+		return ERR_PTR(-ENOSPC);
+
+	sk = __sock_hash_lookup_elem(map, key);
+	if (!sk)
+		return ERR_PTR(-ENOENT);
+
+	sock_gen_cookie(sk);
+	return &sk->sk_cookie;
+}
+
 static void sock_hash_release_progs(struct bpf_map *map)
 {
 	psock_progs_drop(&container_of(map, struct bpf_htab, map)->progs);
@@ -1032,6 +1080,7 @@ const struct bpf_map_ops sock_hash_ops = {
 	.map_update_elem	= sock_hash_update_elem,
 	.map_delete_elem	= sock_hash_delete_elem,
 	.map_lookup_elem	= sock_map_lookup,
+	.map_lookup_elem_sys_only = sock_hash_lookup_sys,
 	.map_release_uref	= sock_hash_release_progs,
 	.map_check_btf		= map_check_no_btf,
 };
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH bpf-next v7 07/11] bpf, sockmap: Let all kernel-land lookup values in SOCKMAP/SOCKHASH
  2020-02-18 17:10 [PATCH bpf-next v7 00/11] Extend SOCKMAP/SOCKHASH to store listening sockets Jakub Sitnicki
                   ` (5 preceding siblings ...)
  2020-02-18 17:10 ` [PATCH bpf-next v7 06/11] bpf, sockmap: Return socket cookie on lookup from syscall Jakub Sitnicki
@ 2020-02-18 17:10 ` Jakub Sitnicki
  2020-02-21  3:46   ` John Fastabend
  2020-02-18 17:10 ` [PATCH bpf-next v7 08/11] bpf: Allow selecting reuseport socket from a SOCKMAP/SOCKHASH Jakub Sitnicki
                   ` (4 subsequent siblings)
  11 siblings, 1 reply; 24+ messages in thread
From: Jakub Sitnicki @ 2020-02-18 17:10 UTC (permalink / raw)
  To: bpf; +Cc: netdev, kernel-team, John Fastabend, Lorenz Bauer, Martin Lau

Don't require the kernel code, like BPF helpers, that needs access to
SOCK{MAP,HASH} map contents to live in net/core/sock_map.c. Expose the
lookup operation to all kernel-land.

Lookup from BPF context is not whitelisted yet. While syscalls have a
dedicated lookup handler.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---
 net/core/sock_map.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/net/core/sock_map.c b/net/core/sock_map.c
index f48c934d5da0..2e0f465295c3 100644
--- a/net/core/sock_map.c
+++ b/net/core/sock_map.c
@@ -301,7 +301,7 @@ static struct sock *__sock_map_lookup_elem(struct bpf_map *map, u32 key)
 
 static void *sock_map_lookup(struct bpf_map *map, void *key)
 {
-	return ERR_PTR(-EOPNOTSUPP);
+	return __sock_map_lookup_elem(map, *(u32 *)key);
 }
 
 static void *sock_map_lookup_sys(struct bpf_map *map, void *key)
@@ -991,6 +991,11 @@ static void *sock_hash_lookup_sys(struct bpf_map *map, void *key)
 	return &sk->sk_cookie;
 }
 
+static void *sock_hash_lookup(struct bpf_map *map, void *key)
+{
+	return __sock_hash_lookup_elem(map, key);
+}
+
 static void sock_hash_release_progs(struct bpf_map *map)
 {
 	psock_progs_drop(&container_of(map, struct bpf_htab, map)->progs);
@@ -1079,7 +1084,7 @@ const struct bpf_map_ops sock_hash_ops = {
 	.map_get_next_key	= sock_hash_get_next_key,
 	.map_update_elem	= sock_hash_update_elem,
 	.map_delete_elem	= sock_hash_delete_elem,
-	.map_lookup_elem	= sock_map_lookup,
+	.map_lookup_elem	= sock_hash_lookup,
 	.map_lookup_elem_sys_only = sock_hash_lookup_sys,
 	.map_release_uref	= sock_hash_release_progs,
 	.map_check_btf		= map_check_no_btf,
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH bpf-next v7 08/11] bpf: Allow selecting reuseport socket from a SOCKMAP/SOCKHASH
  2020-02-18 17:10 [PATCH bpf-next v7 00/11] Extend SOCKMAP/SOCKHASH to store listening sockets Jakub Sitnicki
                   ` (6 preceding siblings ...)
  2020-02-18 17:10 ` [PATCH bpf-next v7 07/11] bpf, sockmap: Let all kernel-land lookup values in SOCKMAP/SOCKHASH Jakub Sitnicki
@ 2020-02-18 17:10 ` Jakub Sitnicki
  2020-02-18 17:10 ` [PATCH bpf-next v7 09/11] net: Generate reuseport group ID on group creation Jakub Sitnicki
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 24+ messages in thread
From: Jakub Sitnicki @ 2020-02-18 17:10 UTC (permalink / raw)
  To: bpf; +Cc: netdev, kernel-team, John Fastabend, Lorenz Bauer, Martin Lau

SOCKMAP & SOCKHASH now support storing references to listening
sockets. Nothing keeps us from using these map types a collection of
sockets to select from in BPF reuseport programs. Whitelist the map types
with the bpf_sk_select_reuseport helper.

The restriction that the socket has to be a member of a reuseport group
still applies. Sockets in SOCKMAP/SOCKHASH that don't have sk_reuseport_cb
set are not a valid target and we signal it with -EINVAL.

The main benefit from this change is that, in contrast to
REUSEPORT_SOCKARRAY, SOCK{MAP,HASH} don't impose a restriction that a
listening socket can be just one BPF map at the same time.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---
 kernel/bpf/verifier.c | 10 +++++++---
 net/core/filter.c     | 15 ++++++++++-----
 2 files changed, 17 insertions(+), 8 deletions(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 1cc945daa9c8..6d15dfbd4b88 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -3693,14 +3693,16 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env,
 		if (func_id != BPF_FUNC_sk_redirect_map &&
 		    func_id != BPF_FUNC_sock_map_update &&
 		    func_id != BPF_FUNC_map_delete_elem &&
-		    func_id != BPF_FUNC_msg_redirect_map)
+		    func_id != BPF_FUNC_msg_redirect_map &&
+		    func_id != BPF_FUNC_sk_select_reuseport)
 			goto error;
 		break;
 	case BPF_MAP_TYPE_SOCKHASH:
 		if (func_id != BPF_FUNC_sk_redirect_hash &&
 		    func_id != BPF_FUNC_sock_hash_update &&
 		    func_id != BPF_FUNC_map_delete_elem &&
-		    func_id != BPF_FUNC_msg_redirect_hash)
+		    func_id != BPF_FUNC_msg_redirect_hash &&
+		    func_id != BPF_FUNC_sk_select_reuseport)
 			goto error;
 		break;
 	case BPF_MAP_TYPE_REUSEPORT_SOCKARRAY:
@@ -3774,7 +3776,9 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env,
 			goto error;
 		break;
 	case BPF_FUNC_sk_select_reuseport:
-		if (map->map_type != BPF_MAP_TYPE_REUSEPORT_SOCKARRAY)
+		if (map->map_type != BPF_MAP_TYPE_REUSEPORT_SOCKARRAY &&
+		    map->map_type != BPF_MAP_TYPE_SOCKMAP &&
+		    map->map_type != BPF_MAP_TYPE_SOCKHASH)
 			goto error;
 		break;
 	case BPF_FUNC_map_peek_elem:
diff --git a/net/core/filter.c b/net/core/filter.c
index c180871e606d..77d2f471b3bb 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -8620,6 +8620,7 @@ struct sock *bpf_run_sk_reuseport(struct sock_reuseport *reuse, struct sock *sk,
 BPF_CALL_4(sk_select_reuseport, struct sk_reuseport_kern *, reuse_kern,
 	   struct bpf_map *, map, void *, key, u32, flags)
 {
+	bool is_sockarray = map->map_type == BPF_MAP_TYPE_REUSEPORT_SOCKARRAY;
 	struct sock_reuseport *reuse;
 	struct sock *selected_sk;
 
@@ -8628,12 +8629,16 @@ BPF_CALL_4(sk_select_reuseport, struct sk_reuseport_kern *, reuse_kern,
 		return -ENOENT;
 
 	reuse = rcu_dereference(selected_sk->sk_reuseport_cb);
-	if (!reuse)
-		/* selected_sk is unhashed (e.g. by close()) after the
-		 * above map_lookup_elem().  Treat selected_sk has already
-		 * been removed from the map.
+	if (!reuse) {
+		/* reuseport_array has only sk with non NULL sk_reuseport_cb.
+		 * The only (!reuse) case here is - the sk has already been
+		 * unhashed (e.g. by close()), so treat it as -ENOENT.
+		 *
+		 * Other maps (e.g. sock_map) do not provide this guarantee and
+		 * the sk may never be in the reuseport group to begin with.
 		 */
-		return -ENOENT;
+		return is_sockarray ? -ENOENT : -EINVAL;
+	}
 
 	if (unlikely(reuse->reuseport_id != reuse_kern->reuseport_id)) {
 		struct sock *sk;
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH bpf-next v7 09/11] net: Generate reuseport group ID on group creation
  2020-02-18 17:10 [PATCH bpf-next v7 00/11] Extend SOCKMAP/SOCKHASH to store listening sockets Jakub Sitnicki
                   ` (7 preceding siblings ...)
  2020-02-18 17:10 ` [PATCH bpf-next v7 08/11] bpf: Allow selecting reuseport socket from a SOCKMAP/SOCKHASH Jakub Sitnicki
@ 2020-02-18 17:10 ` Jakub Sitnicki
  2020-02-18 17:10 ` [PATCH bpf-next v7 10/11] selftests/bpf: Extend SK_REUSEPORT tests to cover SOCKMAP/SOCKHASH Jakub Sitnicki
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 24+ messages in thread
From: Jakub Sitnicki @ 2020-02-18 17:10 UTC (permalink / raw)
  To: bpf; +Cc: netdev, kernel-team, John Fastabend, Lorenz Bauer, Martin Lau

Commit 736b46027eb4 ("net: Add ID (if needed) to sock_reuseport and expose
reuseport_lock") has introduced lazy generation of reuseport group IDs that
survive group resize.

By comparing the identifier we check if BPF reuseport program is not trying
to select a socket from a BPF map that belongs to a different reuseport
group than the one the packet is for.

Because SOCKARRAY used to be the only BPF map type that can be used with
reuseport BPF, it was possible to delay the generation of reuseport group
ID until a socket from the group was inserted into BPF map for the first
time.

Now that SOCK{MAP,HASH} can be used with reuseport BPF we have two options,
either generate the reuseport ID on map update, like SOCKARRAY does, or
allocate an ID from the start when reuseport group gets created.

This patch takes the latter approach to keep sockmap free of calls into
reuseport code. This streamlines the reuseport_id access as its lifetime
now matches the longevity of reuseport object.

The cost of this simplification, however, is that we allocate reuseport IDs
for all SO_REUSEPORT users. Even those that don't use SOCKARRAY in their
setups. With the way identifiers are currently generated, we can have at
most S32_MAX reuseport groups, which hopefully is sufficient. If we ever
get close to the limit, we can switch an u64 counter like sk_cookie.

Another change is that we now always call into SOCKARRAY logic to unlink
the socket from the map when unhashing or closing the socket. Previously we
did it only when at least one socket from the group was in a BPF map.

It is worth noting that this doesn't conflict with sockmap tear-down in
case a socket is in a SOCK{MAP,HASH} and belongs to a reuseport
group. sockmap tear-down happens first:

  prot->unhash
  `- tcp_bpf_unhash
     |- tcp_bpf_remove
     |  `- while (sk_psock_link_pop(psock))
     |     `- sk_psock_unlink
     |        `- sock_map_delete_from_link
     |           `- __sock_map_delete
     |              `- sock_map_unref
     |                 `- sk_psock_put
     |                    `- sk_psock_drop
     |                       `- rcu_assign_sk_user_data(sk, NULL)
     `- inet_unhash
        `- reuseport_detach_sock
           `- bpf_sk_reuseport_detach
              `- WRITE_ONCE(sk->sk_user_data, NULL)

Suggested-by: Martin Lau <kafai@fb.com>
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---
 include/net/sock_reuseport.h |  2 --
 kernel/bpf/reuseport_array.c |  5 ----
 net/core/filter.c            | 12 +--------
 net/core/sock_reuseport.c    | 50 +++++++++++++++---------------------
 4 files changed, 22 insertions(+), 47 deletions(-)

diff --git a/include/net/sock_reuseport.h b/include/net/sock_reuseport.h
index 43f4a818d88f..3ecaa15d1850 100644
--- a/include/net/sock_reuseport.h
+++ b/include/net/sock_reuseport.h
@@ -55,6 +55,4 @@ static inline bool reuseport_has_conns(struct sock *sk, bool set)
 	return ret;
 }
 
-int reuseport_get_id(struct sock_reuseport *reuse);
-
 #endif  /* _SOCK_REUSEPORT_H */
diff --git a/kernel/bpf/reuseport_array.c b/kernel/bpf/reuseport_array.c
index 50c083ba978c..01badd3eda7a 100644
--- a/kernel/bpf/reuseport_array.c
+++ b/kernel/bpf/reuseport_array.c
@@ -305,11 +305,6 @@ int bpf_fd_reuseport_array_update_elem(struct bpf_map *map, void *key,
 	if (err)
 		goto put_file_unlock;
 
-	/* Ensure reuse->reuseport_id is set */
-	err = reuseport_get_id(reuse);
-	if (err < 0)
-		goto put_file_unlock;
-
 	WRITE_ONCE(nsk->sk_user_data, &array->ptrs[index]);
 	rcu_assign_pointer(array->ptrs[index], nsk);
 	free_osk = osk;
diff --git a/net/core/filter.c b/net/core/filter.c
index 77d2f471b3bb..925b23de218b 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -8641,18 +8641,8 @@ BPF_CALL_4(sk_select_reuseport, struct sk_reuseport_kern *, reuse_kern,
 	}
 
 	if (unlikely(reuse->reuseport_id != reuse_kern->reuseport_id)) {
-		struct sock *sk;
-
-		if (unlikely(!reuse_kern->reuseport_id))
-			/* There is a small race between adding the
-			 * sk to the map and setting the
-			 * reuse_kern->reuseport_id.
-			 * Treat it as the sk has not been added to
-			 * the bpf map yet.
-			 */
-			return -ENOENT;
+		struct sock *sk = reuse_kern->sk;
 
-		sk = reuse_kern->sk;
 		if (sk->sk_protocol != selected_sk->sk_protocol)
 			return -EPROTOTYPE;
 		else if (sk->sk_family != selected_sk->sk_family)
diff --git a/net/core/sock_reuseport.c b/net/core/sock_reuseport.c
index 91e9f2223c39..adcb3aea576d 100644
--- a/net/core/sock_reuseport.c
+++ b/net/core/sock_reuseport.c
@@ -16,27 +16,8 @@
 
 DEFINE_SPINLOCK(reuseport_lock);
 
-#define REUSEPORT_MIN_ID 1
 static DEFINE_IDA(reuseport_ida);
 
-int reuseport_get_id(struct sock_reuseport *reuse)
-{
-	int id;
-
-	if (reuse->reuseport_id)
-		return reuse->reuseport_id;
-
-	id = ida_simple_get(&reuseport_ida, REUSEPORT_MIN_ID, 0,
-			    /* Called under reuseport_lock */
-			    GFP_ATOMIC);
-	if (id < 0)
-		return id;
-
-	reuse->reuseport_id = id;
-
-	return reuse->reuseport_id;
-}
-
 static struct sock_reuseport *__reuseport_alloc(unsigned int max_socks)
 {
 	unsigned int size = sizeof(struct sock_reuseport) +
@@ -55,6 +36,7 @@ static struct sock_reuseport *__reuseport_alloc(unsigned int max_socks)
 int reuseport_alloc(struct sock *sk, bool bind_inany)
 {
 	struct sock_reuseport *reuse;
+	int id, ret = 0;
 
 	/* bh lock used since this function call may precede hlist lock in
 	 * soft irq of receive path or setsockopt from process context
@@ -78,10 +60,18 @@ int reuseport_alloc(struct sock *sk, bool bind_inany)
 
 	reuse = __reuseport_alloc(INIT_SOCKS);
 	if (!reuse) {
-		spin_unlock_bh(&reuseport_lock);
-		return -ENOMEM;
+		ret = -ENOMEM;
+		goto out;
 	}
 
+	id = ida_alloc(&reuseport_ida, GFP_ATOMIC);
+	if (id < 0) {
+		kfree(reuse);
+		ret = id;
+		goto out;
+	}
+
+	reuse->reuseport_id = id;
 	reuse->socks[0] = sk;
 	reuse->num_socks = 1;
 	reuse->bind_inany = bind_inany;
@@ -90,7 +80,7 @@ int reuseport_alloc(struct sock *sk, bool bind_inany)
 out:
 	spin_unlock_bh(&reuseport_lock);
 
-	return 0;
+	return ret;
 }
 EXPORT_SYMBOL(reuseport_alloc);
 
@@ -134,8 +124,7 @@ static void reuseport_free_rcu(struct rcu_head *head)
 
 	reuse = container_of(head, struct sock_reuseport, rcu);
 	sk_reuseport_prog_free(rcu_dereference_protected(reuse->prog, 1));
-	if (reuse->reuseport_id)
-		ida_simple_remove(&reuseport_ida, reuse->reuseport_id);
+	ida_free(&reuseport_ida, reuse->reuseport_id);
 	kfree(reuse);
 }
 
@@ -199,12 +188,15 @@ void reuseport_detach_sock(struct sock *sk)
 	reuse = rcu_dereference_protected(sk->sk_reuseport_cb,
 					  lockdep_is_held(&reuseport_lock));
 
-	/* At least one of the sk in this reuseport group is added to
-	 * a bpf map.  Notify the bpf side.  The bpf map logic will
-	 * remove the sk if it is indeed added to a bpf map.
+	/* Notify the bpf side. The sk may be added to a sockarray
+	 * map. If so, sockarray logic will remove it from the map.
+	 *
+	 * Other bpf map types that work with reuseport, like sockmap,
+	 * don't need an explicit callback from here. They override sk
+	 * unhash/close ops to remove the sk from the map before we
+	 * get to this point.
 	 */
-	if (reuse->reuseport_id)
-		bpf_sk_reuseport_detach(sk);
+	bpf_sk_reuseport_detach(sk);
 
 	rcu_assign_pointer(sk->sk_reuseport_cb, NULL);
 
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH bpf-next v7 10/11] selftests/bpf: Extend SK_REUSEPORT tests to cover SOCKMAP/SOCKHASH
  2020-02-18 17:10 [PATCH bpf-next v7 00/11] Extend SOCKMAP/SOCKHASH to store listening sockets Jakub Sitnicki
                   ` (8 preceding siblings ...)
  2020-02-18 17:10 ` [PATCH bpf-next v7 09/11] net: Generate reuseport group ID on group creation Jakub Sitnicki
@ 2020-02-18 17:10 ` Jakub Sitnicki
  2020-02-21  3:52   ` John Fastabend
  2020-02-18 17:10 ` [PATCH bpf-next v7 11/11] selftests/bpf: Tests for sockmap/sockhash holding listening sockets Jakub Sitnicki
  2020-02-21 21:41 ` [PATCH bpf-next v7 00/11] Extend SOCKMAP/SOCKHASH to store " Daniel Borkmann
  11 siblings, 1 reply; 24+ messages in thread
From: Jakub Sitnicki @ 2020-02-18 17:10 UTC (permalink / raw)
  To: bpf; +Cc: netdev, kernel-team, John Fastabend, Lorenz Bauer, Martin Lau

Parametrize the SK_REUSEPORT tests so that the map type for storing sockets
is not hard-coded in the test setup routine.

This, together with careful state cleaning after the tests, lets us run the
test cases for REUSEPORT_ARRAY, SOCKMAP, and SOCKHASH to have test coverage
for all supported map types. The last two support only TCP sockets at the
moment.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---
 .../bpf/prog_tests/select_reuseport.c         | 63 ++++++++++++++++---
 1 file changed, 53 insertions(+), 10 deletions(-)

diff --git a/tools/testing/selftests/bpf/prog_tests/select_reuseport.c b/tools/testing/selftests/bpf/prog_tests/select_reuseport.c
index 098bcae5f827..9ed0ab06fd92 100644
--- a/tools/testing/selftests/bpf/prog_tests/select_reuseport.c
+++ b/tools/testing/selftests/bpf/prog_tests/select_reuseport.c
@@ -36,6 +36,7 @@ static int result_map, tmp_index_ovr_map, linum_map, data_check_map;
 static __u32 expected_results[NR_RESULTS];
 static int sk_fds[REUSEPORT_ARRAY_SIZE];
 static int reuseport_array = -1, outer_map = -1;
+static enum bpf_map_type inner_map_type;
 static int select_by_skb_data_prog;
 static int saved_tcp_syncookie = -1;
 static struct bpf_object *obj;
@@ -63,13 +64,15 @@ static union sa46 {
 	}								\
 })
 
-static int create_maps(void)
+static int create_maps(enum bpf_map_type inner_type)
 {
 	struct bpf_create_map_attr attr = {};
 
+	inner_map_type = inner_type;
+
 	/* Creating reuseport_array */
 	attr.name = "reuseport_array";
-	attr.map_type = BPF_MAP_TYPE_REUSEPORT_SOCKARRAY;
+	attr.map_type = inner_type;
 	attr.key_size = sizeof(__u32);
 	attr.value_size = sizeof(__u32);
 	attr.max_entries = REUSEPORT_ARRAY_SIZE;
@@ -726,12 +729,36 @@ static void cleanup_per_test(bool no_inner_map)
 
 static void cleanup(void)
 {
-	if (outer_map != -1)
+	if (outer_map != -1) {
 		close(outer_map);
-	if (reuseport_array != -1)
+		outer_map = -1;
+	}
+
+	if (reuseport_array != -1) {
 		close(reuseport_array);
-	if (obj)
+		reuseport_array = -1;
+	}
+
+	if (obj) {
 		bpf_object__close(obj);
+		obj = NULL;
+	}
+
+	memset(expected_results, 0, sizeof(expected_results));
+}
+
+static const char *maptype_str(enum bpf_map_type type)
+{
+	switch (type) {
+	case BPF_MAP_TYPE_REUSEPORT_SOCKARRAY:
+		return "reuseport_sockarray";
+	case BPF_MAP_TYPE_SOCKMAP:
+		return "sockmap";
+	case BPF_MAP_TYPE_SOCKHASH:
+		return "sockhash";
+	default:
+		return "unknown";
+	}
 }
 
 static const char *family_str(sa_family_t family)
@@ -779,13 +806,21 @@ static void test_config(int sotype, sa_family_t family, bool inany)
 	const struct test *t;
 
 	for (t = tests; t < tests + ARRAY_SIZE(tests); t++) {
-		snprintf(s, sizeof(s), "%s/%s %s %s",
+		snprintf(s, sizeof(s), "%s %s/%s %s %s",
+			 maptype_str(inner_map_type),
 			 family_str(family), sotype_str(sotype),
 			 inany ? "INANY" : "LOOPBACK", t->name);
 
 		if (!test__start_subtest(s))
 			continue;
 
+		if (sotype == SOCK_DGRAM &&
+		    inner_map_type != BPF_MAP_TYPE_REUSEPORT_SOCKARRAY) {
+			/* SOCKMAP/SOCKHASH don't support UDP yet */
+			test__skip();
+			continue;
+		}
+
 		setup_per_test(sotype, family, inany, t->no_inner_map);
 		t->fn(sotype, family);
 		cleanup_per_test(t->no_inner_map);
@@ -814,13 +849,20 @@ static void test_all(void)
 		test_config(c->sotype, c->family, c->inany);
 }
 
-void test_select_reuseport(void)
+void test_map_type(enum bpf_map_type mt)
 {
-	if (create_maps())
+	if (create_maps(mt))
 		goto out;
 	if (prepare_bpf_obj())
 		goto out;
 
+	test_all();
+out:
+	cleanup();
+}
+
+void test_select_reuseport(void)
+{
 	saved_tcp_fo = read_int_sysctl(TCP_FO_SYSCTL);
 	saved_tcp_syncookie = read_int_sysctl(TCP_SYNCOOKIE_SYSCTL);
 	if (saved_tcp_syncookie < 0 || saved_tcp_syncookie < 0)
@@ -831,8 +873,9 @@ void test_select_reuseport(void)
 	if (disable_syncookie())
 		goto out;
 
-	test_all();
+	test_map_type(BPF_MAP_TYPE_REUSEPORT_SOCKARRAY);
+	test_map_type(BPF_MAP_TYPE_SOCKMAP);
+	test_map_type(BPF_MAP_TYPE_SOCKHASH);
 out:
-	cleanup();
 	restore_sysctls();
 }
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH bpf-next v7 11/11] selftests/bpf: Tests for sockmap/sockhash holding listening sockets
  2020-02-18 17:10 [PATCH bpf-next v7 00/11] Extend SOCKMAP/SOCKHASH to store listening sockets Jakub Sitnicki
                   ` (9 preceding siblings ...)
  2020-02-18 17:10 ` [PATCH bpf-next v7 10/11] selftests/bpf: Extend SK_REUSEPORT tests to cover SOCKMAP/SOCKHASH Jakub Sitnicki
@ 2020-02-18 17:10 ` Jakub Sitnicki
  2020-02-21  3:56   ` John Fastabend
  2020-02-21 21:41 ` [PATCH bpf-next v7 00/11] Extend SOCKMAP/SOCKHASH to store " Daniel Borkmann
  11 siblings, 1 reply; 24+ messages in thread
From: Jakub Sitnicki @ 2020-02-18 17:10 UTC (permalink / raw)
  To: bpf; +Cc: netdev, kernel-team, John Fastabend, Lorenz Bauer, Martin Lau

Now that SOCKMAP and SOCKHASH map types can store listening sockets,
user-space and BPF API is open to a new set of potential pitfalls.

Exercise the map operations, with extra attention to code paths susceptible
to races between map ops and socket cloning, and BPF helpers that work with
SOCKMAP/SOCKHASH to gain confidence that all works as expected.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---
 .../selftests/bpf/prog_tests/sockmap_listen.c | 1496 +++++++++++++++++
 .../selftests/bpf/progs/test_sockmap_listen.c |   98 ++
 2 files changed, 1594 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/sockmap_listen.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_sockmap_listen.c

diff --git a/tools/testing/selftests/bpf/prog_tests/sockmap_listen.c b/tools/testing/selftests/bpf/prog_tests/sockmap_listen.c
new file mode 100644
index 000000000000..b1b2acea0638
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/sockmap_listen.c
@@ -0,0 +1,1496 @@
+// SPDX-License-Identifier: GPL-2.0
+// Copyright (c) 2020 Cloudflare
+/*
+ * Test suite for SOCKMAP/SOCKHASH holding listening sockets.
+ * Covers:
+ *  1. BPF map operations - bpf_map_{update,lookup delete}_elem
+ *  2. BPF redirect helpers - bpf_{sk,msg}_redirect_map
+ *  3. BPF reuseport helper - bpf_sk_select_reuseport
+ */
+
+#include <linux/compiler.h>
+#include <errno.h>
+#include <error.h>
+#include <limits.h>
+#include <netinet/in.h>
+#include <pthread.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+
+#include <bpf/bpf.h>
+#include <bpf/libbpf.h>
+
+#include "bpf_util.h"
+#include "test_progs.h"
+#include "test_sockmap_listen.skel.h"
+
+#define MAX_STRERR_LEN 256
+#define MAX_TEST_NAME 80
+
+#define _FAIL(errnum, fmt...)                                                  \
+	({                                                                     \
+		error_at_line(0, (errnum), __func__, __LINE__, fmt);           \
+		CHECK_FAIL(true);                                              \
+	})
+#define FAIL(fmt...) _FAIL(0, fmt)
+#define FAIL_ERRNO(fmt...) _FAIL(errno, fmt)
+#define FAIL_LIBBPF(err, msg)                                                  \
+	({                                                                     \
+		char __buf[MAX_STRERR_LEN];                                    \
+		libbpf_strerror((err), __buf, sizeof(__buf));                  \
+		FAIL("%s: %s", (msg), __buf);                                  \
+	})
+
+/* Wrappers that fail the test on error and report it. */
+
+#define xaccept(fd, addr, len)                                                 \
+	({                                                                     \
+		int __ret = accept((fd), (addr), (len));                       \
+		if (__ret == -1)                                               \
+			FAIL_ERRNO("accept");                                  \
+		__ret;                                                         \
+	})
+
+#define xbind(fd, addr, len)                                                   \
+	({                                                                     \
+		int __ret = bind((fd), (addr), (len));                         \
+		if (__ret == -1)                                               \
+			FAIL_ERRNO("bind");                                    \
+		__ret;                                                         \
+	})
+
+#define xclose(fd)                                                             \
+	({                                                                     \
+		int __ret = close((fd));                                       \
+		if (__ret == -1)                                               \
+			FAIL_ERRNO("close");                                   \
+		__ret;                                                         \
+	})
+
+#define xconnect(fd, addr, len)                                                \
+	({                                                                     \
+		int __ret = connect((fd), (addr), (len));                      \
+		if (__ret == -1)                                               \
+			FAIL_ERRNO("connect");                                 \
+		__ret;                                                         \
+	})
+
+#define xgetsockname(fd, addr, len)                                            \
+	({                                                                     \
+		int __ret = getsockname((fd), (addr), (len));                  \
+		if (__ret == -1)                                               \
+			FAIL_ERRNO("getsockname");                             \
+		__ret;                                                         \
+	})
+
+#define xgetsockopt(fd, level, name, val, len)                                 \
+	({                                                                     \
+		int __ret = getsockopt((fd), (level), (name), (val), (len));   \
+		if (__ret == -1)                                               \
+			FAIL_ERRNO("getsockopt(" #name ")");                   \
+		__ret;                                                         \
+	})
+
+#define xlisten(fd, backlog)                                                   \
+	({                                                                     \
+		int __ret = listen((fd), (backlog));                           \
+		if (__ret == -1)                                               \
+			FAIL_ERRNO("listen");                                  \
+		__ret;                                                         \
+	})
+
+#define xsetsockopt(fd, level, name, val, len)                                 \
+	({                                                                     \
+		int __ret = setsockopt((fd), (level), (name), (val), (len));   \
+		if (__ret == -1)                                               \
+			FAIL_ERRNO("setsockopt(" #name ")");                   \
+		__ret;                                                         \
+	})
+
+#define xsocket(family, sotype, flags)                                         \
+	({                                                                     \
+		int __ret = socket(family, sotype, flags);                     \
+		if (__ret == -1)                                               \
+			FAIL_ERRNO("socket");                                  \
+		__ret;                                                         \
+	})
+
+#define xbpf_map_delete_elem(fd, key)                                          \
+	({                                                                     \
+		int __ret = bpf_map_delete_elem((fd), (key));                  \
+		if (__ret == -1)                                               \
+			FAIL_ERRNO("map_delete");                              \
+		__ret;                                                         \
+	})
+
+#define xbpf_map_lookup_elem(fd, key, val)                                     \
+	({                                                                     \
+		int __ret = bpf_map_lookup_elem((fd), (key), (val));           \
+		if (__ret == -1)                                               \
+			FAIL_ERRNO("map_lookup");                              \
+		__ret;                                                         \
+	})
+
+#define xbpf_map_update_elem(fd, key, val, flags)                              \
+	({                                                                     \
+		int __ret = bpf_map_update_elem((fd), (key), (val), (flags));  \
+		if (__ret == -1)                                               \
+			FAIL_ERRNO("map_update");                              \
+		__ret;                                                         \
+	})
+
+#define xbpf_prog_attach(prog, target, type, flags)                            \
+	({                                                                     \
+		int __ret =                                                    \
+			bpf_prog_attach((prog), (target), (type), (flags));    \
+		if (__ret == -1)                                               \
+			FAIL_ERRNO("prog_attach(" #type ")");                  \
+		__ret;                                                         \
+	})
+
+#define xbpf_prog_detach2(prog, target, type)                                  \
+	({                                                                     \
+		int __ret = bpf_prog_detach2((prog), (target), (type));        \
+		if (__ret == -1)                                               \
+			FAIL_ERRNO("prog_detach2(" #type ")");                 \
+		__ret;                                                         \
+	})
+
+#define xpthread_create(thread, attr, func, arg)                               \
+	({                                                                     \
+		int __ret = pthread_create((thread), (attr), (func), (arg));   \
+		errno = __ret;                                                 \
+		if (__ret)                                                     \
+			FAIL_ERRNO("pthread_create");                          \
+		__ret;                                                         \
+	})
+
+#define xpthread_join(thread, retval)                                          \
+	({                                                                     \
+		int __ret = pthread_join((thread), (retval));                  \
+		errno = __ret;                                                 \
+		if (__ret)                                                     \
+			FAIL_ERRNO("pthread_join");                            \
+		__ret;                                                         \
+	})
+
+static void init_addr_loopback4(struct sockaddr_storage *ss, socklen_t *len)
+{
+	struct sockaddr_in *addr4 = memset(ss, 0, sizeof(*ss));
+
+	addr4->sin_family = AF_INET;
+	addr4->sin_port = 0;
+	addr4->sin_addr.s_addr = htonl(INADDR_LOOPBACK);
+	*len = sizeof(*addr4);
+}
+
+static void init_addr_loopback6(struct sockaddr_storage *ss, socklen_t *len)
+{
+	struct sockaddr_in6 *addr6 = memset(ss, 0, sizeof(*ss));
+
+	addr6->sin6_family = AF_INET6;
+	addr6->sin6_port = 0;
+	addr6->sin6_addr = in6addr_loopback;
+	*len = sizeof(*addr6);
+}
+
+static void init_addr_loopback(int family, struct sockaddr_storage *ss,
+			       socklen_t *len)
+{
+	switch (family) {
+	case AF_INET:
+		init_addr_loopback4(ss, len);
+		return;
+	case AF_INET6:
+		init_addr_loopback6(ss, len);
+		return;
+	default:
+		FAIL("unsupported address family %d", family);
+	}
+}
+
+static inline struct sockaddr *sockaddr(struct sockaddr_storage *ss)
+{
+	return (struct sockaddr *)ss;
+}
+
+static int enable_reuseport(int s, int progfd)
+{
+	int err, one = 1;
+
+	err = xsetsockopt(s, SOL_SOCKET, SO_REUSEPORT, &one, sizeof(one));
+	if (err)
+		return -1;
+	err = xsetsockopt(s, SOL_SOCKET, SO_ATTACH_REUSEPORT_EBPF, &progfd,
+			  sizeof(progfd));
+	if (err)
+		return -1;
+
+	return 0;
+}
+
+static int listen_loopback_reuseport(int family, int sotype, int progfd)
+{
+	struct sockaddr_storage addr;
+	socklen_t len;
+	int err, s;
+
+	init_addr_loopback(family, &addr, &len);
+
+	s = xsocket(family, sotype, 0);
+	if (s == -1)
+		return -1;
+
+	if (progfd >= 0)
+		enable_reuseport(s, progfd);
+
+	err = xbind(s, sockaddr(&addr), len);
+	if (err)
+		goto close;
+
+	err = xlisten(s, SOMAXCONN);
+	if (err)
+		goto close;
+
+	return s;
+close:
+	xclose(s);
+	return -1;
+}
+
+static int listen_loopback(int family, int sotype)
+{
+	return listen_loopback_reuseport(family, sotype, -1);
+}
+
+static void test_insert_invalid(int family, int sotype, int mapfd)
+{
+	u32 key = 0;
+	u64 value;
+	int err;
+
+	value = -1;
+	err = bpf_map_update_elem(mapfd, &key, &value, BPF_NOEXIST);
+	if (!err || errno != EINVAL)
+		FAIL_ERRNO("map_update: expected EINVAL");
+
+	value = INT_MAX;
+	err = bpf_map_update_elem(mapfd, &key, &value, BPF_NOEXIST);
+	if (!err || errno != EBADF)
+		FAIL_ERRNO("map_update: expected EBADF");
+}
+
+static void test_insert_opened(int family, int sotype, int mapfd)
+{
+	u32 key = 0;
+	u64 value;
+	int err, s;
+
+	s = xsocket(family, sotype, 0);
+	if (s == -1)
+		return;
+
+	errno = 0;
+	value = s;
+	err = bpf_map_update_elem(mapfd, &key, &value, BPF_NOEXIST);
+	if (!err || errno != EOPNOTSUPP)
+		FAIL_ERRNO("map_update: expected EOPNOTSUPP");
+
+	xclose(s);
+}
+
+static void test_insert_bound(int family, int sotype, int mapfd)
+{
+	struct sockaddr_storage addr;
+	socklen_t len;
+	u32 key = 0;
+	u64 value;
+	int err, s;
+
+	init_addr_loopback(family, &addr, &len);
+
+	s = xsocket(family, sotype, 0);
+	if (s == -1)
+		return;
+
+	err = xbind(s, sockaddr(&addr), len);
+	if (err)
+		goto close;
+
+	errno = 0;
+	value = s;
+	err = bpf_map_update_elem(mapfd, &key, &value, BPF_NOEXIST);
+	if (!err || errno != EOPNOTSUPP)
+		FAIL_ERRNO("map_update: expected EOPNOTSUPP");
+close:
+	xclose(s);
+}
+
+static void test_insert_listening(int family, int sotype, int mapfd)
+{
+	u64 value;
+	u32 key;
+	int s;
+
+	s = listen_loopback(family, sotype);
+	if (s < 0)
+		return;
+
+	key = 0;
+	value = s;
+	xbpf_map_update_elem(mapfd, &key, &value, BPF_NOEXIST);
+	xclose(s);
+}
+
+static void test_delete_after_insert(int family, int sotype, int mapfd)
+{
+	u64 value;
+	u32 key;
+	int s;
+
+	s = listen_loopback(family, sotype);
+	if (s < 0)
+		return;
+
+	key = 0;
+	value = s;
+	xbpf_map_update_elem(mapfd, &key, &value, BPF_NOEXIST);
+	xbpf_map_delete_elem(mapfd, &key);
+	xclose(s);
+}
+
+static void test_delete_after_close(int family, int sotype, int mapfd)
+{
+	int err, s;
+	u64 value;
+	u32 key;
+
+	s = listen_loopback(family, sotype);
+	if (s < 0)
+		return;
+
+	key = 0;
+	value = s;
+	xbpf_map_update_elem(mapfd, &key, &value, BPF_NOEXIST);
+
+	xclose(s);
+
+	errno = 0;
+	err = bpf_map_delete_elem(mapfd, &key);
+	if (!err || (errno != EINVAL && errno != ENOENT))
+		/* SOCKMAP and SOCKHASH return different error codes */
+		FAIL_ERRNO("map_delete: expected EINVAL/EINVAL");
+}
+
+static void test_lookup_after_insert(int family, int sotype, int mapfd)
+{
+	u64 cookie, value;
+	socklen_t len;
+	u32 key;
+	int s;
+
+	s = listen_loopback(family, sotype);
+	if (s < 0)
+		return;
+
+	key = 0;
+	value = s;
+	xbpf_map_update_elem(mapfd, &key, &value, BPF_NOEXIST);
+
+	len = sizeof(cookie);
+	xgetsockopt(s, SOL_SOCKET, SO_COOKIE, &cookie, &len);
+
+	xbpf_map_lookup_elem(mapfd, &key, &value);
+
+	if (value != cookie) {
+		FAIL("map_lookup: have %#llx, want %#llx",
+		     (unsigned long long)value, (unsigned long long)cookie);
+	}
+
+	xclose(s);
+}
+
+static void test_lookup_after_delete(int family, int sotype, int mapfd)
+{
+	int err, s;
+	u64 value;
+	u32 key;
+
+	s = listen_loopback(family, sotype);
+	if (s < 0)
+		return;
+
+	key = 0;
+	value = s;
+	xbpf_map_update_elem(mapfd, &key, &value, BPF_NOEXIST);
+	xbpf_map_delete_elem(mapfd, &key);
+
+	errno = 0;
+	err = bpf_map_lookup_elem(mapfd, &key, &value);
+	if (!err || errno != ENOENT)
+		FAIL_ERRNO("map_lookup: expected ENOENT");
+
+	xclose(s);
+}
+
+static void test_lookup_32_bit_value(int family, int sotype, int mapfd)
+{
+	u32 key, value32;
+	int err, s;
+
+	s = listen_loopback(family, sotype);
+	if (s < 0)
+		return;
+
+	mapfd = bpf_create_map(BPF_MAP_TYPE_SOCKMAP, sizeof(key),
+			       sizeof(value32), 1, 0);
+	if (mapfd < 0) {
+		FAIL_ERRNO("map_create");
+		goto close;
+	}
+
+	key = 0;
+	value32 = s;
+	xbpf_map_update_elem(mapfd, &key, &value32, BPF_NOEXIST);
+
+	errno = 0;
+	err = bpf_map_lookup_elem(mapfd, &key, &value32);
+	if (!err || errno != ENOSPC)
+		FAIL_ERRNO("map_lookup: expected ENOSPC");
+
+	xclose(mapfd);
+close:
+	xclose(s);
+}
+
+static void test_update_listening(int family, int sotype, int mapfd)
+{
+	int s1, s2;
+	u64 value;
+	u32 key;
+
+	s1 = listen_loopback(family, sotype);
+	if (s1 < 0)
+		return;
+
+	s2 = listen_loopback(family, sotype);
+	if (s2 < 0)
+		goto close_s1;
+
+	key = 0;
+	value = s1;
+	xbpf_map_update_elem(mapfd, &key, &value, BPF_NOEXIST);
+
+	value = s2;
+	xbpf_map_update_elem(mapfd, &key, &value, BPF_EXIST);
+	xclose(s2);
+close_s1:
+	xclose(s1);
+}
+
+/* Exercise the code path where we destroy child sockets that never
+ * got accept()'ed, aka orphans, when parent socket gets closed.
+ */
+static void test_destroy_orphan_child(int family, int sotype, int mapfd)
+{
+	struct sockaddr_storage addr;
+	socklen_t len;
+	int err, s, c;
+	u64 value;
+	u32 key;
+
+	s = listen_loopback(family, sotype);
+	if (s < 0)
+		return;
+
+	len = sizeof(addr);
+	err = xgetsockname(s, sockaddr(&addr), &len);
+	if (err)
+		goto close_srv;
+
+	key = 0;
+	value = s;
+	xbpf_map_update_elem(mapfd, &key, &value, BPF_NOEXIST);
+
+	c = xsocket(family, sotype, 0);
+	if (c == -1)
+		goto close_srv;
+
+	xconnect(c, sockaddr(&addr), len);
+	xclose(c);
+close_srv:
+	xclose(s);
+}
+
+/* Perform a passive open after removing listening socket from SOCKMAP
+ * to ensure that callbacks get restored properly.
+ */
+static void test_clone_after_delete(int family, int sotype, int mapfd)
+{
+	struct sockaddr_storage addr;
+	socklen_t len;
+	int err, s, c;
+	u64 value;
+	u32 key;
+
+	s = listen_loopback(family, sotype);
+	if (s < 0)
+		return;
+
+	len = sizeof(addr);
+	err = xgetsockname(s, sockaddr(&addr), &len);
+	if (err)
+		goto close_srv;
+
+	key = 0;
+	value = s;
+	xbpf_map_update_elem(mapfd, &key, &value, BPF_NOEXIST);
+	xbpf_map_delete_elem(mapfd, &key);
+
+	c = xsocket(family, sotype, 0);
+	if (c < 0)
+		goto close_srv;
+
+	xconnect(c, sockaddr(&addr), len);
+	xclose(c);
+close_srv:
+	xclose(s);
+}
+
+/* Check that child socket that got created while parent was in a
+ * SOCKMAP, but got accept()'ed only after the parent has been removed
+ * from SOCKMAP, gets cloned without parent psock state or callbacks.
+ */
+static void test_accept_after_delete(int family, int sotype, int mapfd)
+{
+	struct sockaddr_storage addr;
+	const u32 zero = 0;
+	int err, s, c, p;
+	socklen_t len;
+	u64 value;
+
+	s = listen_loopback(family, sotype);
+	if (s == -1)
+		return;
+
+	len = sizeof(addr);
+	err = xgetsockname(s, sockaddr(&addr), &len);
+	if (err)
+		goto close_srv;
+
+	value = s;
+	err = xbpf_map_update_elem(mapfd, &zero, &value, BPF_NOEXIST);
+	if (err)
+		goto close_srv;
+
+	c = xsocket(family, sotype, 0);
+	if (c == -1)
+		goto close_srv;
+
+	/* Create child while parent is in sockmap */
+	err = xconnect(c, sockaddr(&addr), len);
+	if (err)
+		goto close_cli;
+
+	/* Remove parent from sockmap */
+	err = xbpf_map_delete_elem(mapfd, &zero);
+	if (err)
+		goto close_cli;
+
+	p = xaccept(s, NULL, NULL);
+	if (p == -1)
+		goto close_cli;
+
+	/* Check that child sk_user_data is not set */
+	value = p;
+	xbpf_map_update_elem(mapfd, &zero, &value, BPF_NOEXIST);
+
+	xclose(p);
+close_cli:
+	xclose(c);
+close_srv:
+	xclose(s);
+}
+
+/* Check that child socket that got created and accepted while parent
+ * was in a SOCKMAP is cloned without parent psock state or callbacks.
+ */
+static void test_accept_before_delete(int family, int sotype, int mapfd)
+{
+	struct sockaddr_storage addr;
+	const u32 zero = 0, one = 1;
+	int err, s, c, p;
+	socklen_t len;
+	u64 value;
+
+	s = listen_loopback(family, sotype);
+	if (s == -1)
+		return;
+
+	len = sizeof(addr);
+	err = xgetsockname(s, sockaddr(&addr), &len);
+	if (err)
+		goto close_srv;
+
+	value = s;
+	err = xbpf_map_update_elem(mapfd, &zero, &value, BPF_NOEXIST);
+	if (err)
+		goto close_srv;
+
+	c = xsocket(family, sotype, 0);
+	if (c == -1)
+		goto close_srv;
+
+	/* Create & accept child while parent is in sockmap */
+	err = xconnect(c, sockaddr(&addr), len);
+	if (err)
+		goto close_cli;
+
+	p = xaccept(s, NULL, NULL);
+	if (p == -1)
+		goto close_cli;
+
+	/* Check that child sk_user_data is not set */
+	value = p;
+	xbpf_map_update_elem(mapfd, &one, &value, BPF_NOEXIST);
+
+	xclose(p);
+close_cli:
+	xclose(c);
+close_srv:
+	xclose(s);
+}
+
+struct connect_accept_ctx {
+	int sockfd;
+	unsigned int done;
+	unsigned int nr_iter;
+};
+
+static bool is_thread_done(struct connect_accept_ctx *ctx)
+{
+	return READ_ONCE(ctx->done);
+}
+
+static void *connect_accept_thread(void *arg)
+{
+	struct connect_accept_ctx *ctx = arg;
+	struct sockaddr_storage addr;
+	int family, socktype;
+	socklen_t len;
+	int err, i, s;
+
+	s = ctx->sockfd;
+
+	len = sizeof(addr);
+	err = xgetsockname(s, sockaddr(&addr), &len);
+	if (err)
+		goto done;
+
+	len = sizeof(family);
+	err = xgetsockopt(s, SOL_SOCKET, SO_DOMAIN, &family, &len);
+	if (err)
+		goto done;
+
+	len = sizeof(socktype);
+	err = xgetsockopt(s, SOL_SOCKET, SO_TYPE, &socktype, &len);
+	if (err)
+		goto done;
+
+	for (i = 0; i < ctx->nr_iter; i++) {
+		int c, p;
+
+		c = xsocket(family, socktype, 0);
+		if (c < 0)
+			break;
+
+		err = xconnect(c, (struct sockaddr *)&addr, sizeof(addr));
+		if (err) {
+			xclose(c);
+			break;
+		}
+
+		p = xaccept(s, NULL, NULL);
+		if (p < 0) {
+			xclose(c);
+			break;
+		}
+
+		xclose(p);
+		xclose(c);
+	}
+done:
+	WRITE_ONCE(ctx->done, 1);
+	return NULL;
+}
+
+static void test_syn_recv_insert_delete(int family, int sotype, int mapfd)
+{
+	struct connect_accept_ctx ctx = { 0 };
+	struct sockaddr_storage addr;
+	socklen_t len;
+	u32 zero = 0;
+	pthread_t t;
+	int err, s;
+	u64 value;
+
+	s = listen_loopback(family, sotype | SOCK_NONBLOCK);
+	if (s < 0)
+		return;
+
+	len = sizeof(addr);
+	err = xgetsockname(s, sockaddr(&addr), &len);
+	if (err)
+		goto close;
+
+	ctx.sockfd = s;
+	ctx.nr_iter = 1000;
+
+	err = xpthread_create(&t, NULL, connect_accept_thread, &ctx);
+	if (err)
+		goto close;
+
+	value = s;
+	while (!is_thread_done(&ctx)) {
+		err = xbpf_map_update_elem(mapfd, &zero, &value, BPF_NOEXIST);
+		if (err)
+			break;
+
+		err = xbpf_map_delete_elem(mapfd, &zero);
+		if (err)
+			break;
+	}
+
+	xpthread_join(t, NULL);
+close:
+	xclose(s);
+}
+
+static void *listen_thread(void *arg)
+{
+	struct sockaddr unspec = { AF_UNSPEC };
+	struct connect_accept_ctx *ctx = arg;
+	int err, i, s;
+
+	s = ctx->sockfd;
+
+	for (i = 0; i < ctx->nr_iter; i++) {
+		err = xlisten(s, 1);
+		if (err)
+			break;
+		err = xconnect(s, &unspec, sizeof(unspec));
+		if (err)
+			break;
+	}
+
+	WRITE_ONCE(ctx->done, 1);
+	return NULL;
+}
+
+static void test_race_insert_listen(int family, int socktype, int mapfd)
+{
+	struct connect_accept_ctx ctx = { 0 };
+	const u32 zero = 0;
+	const int one = 1;
+	pthread_t t;
+	int err, s;
+	u64 value;
+
+	s = xsocket(family, socktype, 0);
+	if (s < 0)
+		return;
+
+	err = xsetsockopt(s, SOL_SOCKET, SO_REUSEADDR, &one, sizeof(one));
+	if (err)
+		goto close;
+
+	ctx.sockfd = s;
+	ctx.nr_iter = 10000;
+
+	err = pthread_create(&t, NULL, listen_thread, &ctx);
+	if (err)
+		goto close;
+
+	value = s;
+	while (!is_thread_done(&ctx)) {
+		err = bpf_map_update_elem(mapfd, &zero, &value, BPF_NOEXIST);
+		/* Expecting EOPNOTSUPP before listen() */
+		if (err && errno != EOPNOTSUPP) {
+			FAIL_ERRNO("map_update");
+			break;
+		}
+
+		err = bpf_map_delete_elem(mapfd, &zero);
+		/* Expecting no entry after unhash on connect(AF_UNSPEC) */
+		if (err && errno != EINVAL && errno != ENOENT) {
+			FAIL_ERRNO("map_delete");
+			break;
+		}
+	}
+
+	xpthread_join(t, NULL);
+close:
+	xclose(s);
+}
+
+static void zero_verdict_count(int mapfd)
+{
+	unsigned int zero = 0;
+	int key;
+
+	key = SK_DROP;
+	xbpf_map_update_elem(mapfd, &key, &zero, BPF_ANY);
+	key = SK_PASS;
+	xbpf_map_update_elem(mapfd, &key, &zero, BPF_ANY);
+}
+
+enum redir_mode {
+	REDIR_INGRESS,
+	REDIR_EGRESS,
+};
+
+static const char *redir_mode_str(enum redir_mode mode)
+{
+	switch (mode) {
+	case REDIR_INGRESS:
+		return "ingress";
+	case REDIR_EGRESS:
+		return "egress";
+	default:
+		return "unknown";
+	}
+}
+
+static void redir_to_connected(int family, int sotype, int sock_mapfd,
+			       int verd_mapfd, enum redir_mode mode)
+{
+	const char *log_prefix = redir_mode_str(mode);
+	struct sockaddr_storage addr;
+	int s, c0, c1, p0, p1;
+	unsigned int pass;
+	socklen_t len;
+	int err, n;
+	u64 value;
+	u32 key;
+	char b;
+
+	zero_verdict_count(verd_mapfd);
+
+	s = listen_loopback(family, sotype | SOCK_NONBLOCK);
+	if (s < 0)
+		return;
+
+	len = sizeof(addr);
+	err = xgetsockname(s, sockaddr(&addr), &len);
+	if (err)
+		goto close_srv;
+
+	c0 = xsocket(family, sotype, 0);
+	if (c0 < 0)
+		goto close_srv;
+	err = xconnect(c0, sockaddr(&addr), len);
+	if (err)
+		goto close_cli0;
+
+	p0 = xaccept(s, NULL, NULL);
+	if (p0 < 0)
+		goto close_cli0;
+
+	c1 = xsocket(family, sotype, 0);
+	if (c1 < 0)
+		goto close_peer0;
+	err = xconnect(c1, sockaddr(&addr), len);
+	if (err)
+		goto close_cli1;
+
+	p1 = xaccept(s, NULL, NULL);
+	if (p1 < 0)
+		goto close_cli1;
+
+	key = 0;
+	value = p0;
+	err = xbpf_map_update_elem(sock_mapfd, &key, &value, BPF_NOEXIST);
+	if (err)
+		goto close_peer1;
+
+	key = 1;
+	value = p1;
+	err = xbpf_map_update_elem(sock_mapfd, &key, &value, BPF_NOEXIST);
+	if (err)
+		goto close_peer1;
+
+	n = write(mode == REDIR_INGRESS ? c1 : p1, "a", 1);
+	if (n < 0)
+		FAIL_ERRNO("%s: write", log_prefix);
+	if (n == 0)
+		FAIL("%s: incomplete write", log_prefix);
+	if (n < 1)
+		goto close_peer1;
+
+	key = SK_PASS;
+	err = xbpf_map_lookup_elem(verd_mapfd, &key, &pass);
+	if (err)
+		goto close_peer1;
+	if (pass != 1)
+		FAIL("%s: want pass count 1, have %d", log_prefix, pass);
+
+	n = read(c0, &b, 1);
+	if (n < 0)
+		FAIL_ERRNO("%s: read", log_prefix);
+	if (n == 0)
+		FAIL("%s: incomplete read", log_prefix);
+
+close_peer1:
+	xclose(p1);
+close_cli1:
+	xclose(c1);
+close_peer0:
+	xclose(p0);
+close_cli0:
+	xclose(c0);
+close_srv:
+	xclose(s);
+}
+
+static void test_skb_redir_to_connected(struct test_sockmap_listen *skel,
+					struct bpf_map *inner_map, int family,
+					int sotype)
+{
+	int verdict = bpf_program__fd(skel->progs.prog_skb_verdict);
+	int parser = bpf_program__fd(skel->progs.prog_skb_parser);
+	int verdict_map = bpf_map__fd(skel->maps.verdict_map);
+	int sock_map = bpf_map__fd(inner_map);
+	int err;
+
+	err = xbpf_prog_attach(parser, sock_map, BPF_SK_SKB_STREAM_PARSER, 0);
+	if (err)
+		return;
+	err = xbpf_prog_attach(verdict, sock_map, BPF_SK_SKB_STREAM_VERDICT, 0);
+	if (err)
+		goto detach;
+
+	redir_to_connected(family, sotype, sock_map, verdict_map,
+			   REDIR_INGRESS);
+
+	xbpf_prog_detach2(verdict, sock_map, BPF_SK_SKB_STREAM_VERDICT);
+detach:
+	xbpf_prog_detach2(parser, sock_map, BPF_SK_SKB_STREAM_PARSER);
+}
+
+static void test_msg_redir_to_connected(struct test_sockmap_listen *skel,
+					struct bpf_map *inner_map, int family,
+					int sotype)
+{
+	int verdict = bpf_program__fd(skel->progs.prog_msg_verdict);
+	int verdict_map = bpf_map__fd(skel->maps.verdict_map);
+	int sock_map = bpf_map__fd(inner_map);
+	int err;
+
+	err = xbpf_prog_attach(verdict, sock_map, BPF_SK_MSG_VERDICT, 0);
+	if (err)
+		return;
+
+	redir_to_connected(family, sotype, sock_map, verdict_map, REDIR_EGRESS);
+
+	xbpf_prog_detach2(verdict, sock_map, BPF_SK_MSG_VERDICT);
+}
+
+static void redir_to_listening(int family, int sotype, int sock_mapfd,
+			       int verd_mapfd, enum redir_mode mode)
+{
+	const char *log_prefix = redir_mode_str(mode);
+	struct sockaddr_storage addr;
+	int s, c, p, err, n;
+	unsigned int drop;
+	socklen_t len;
+	u64 value;
+	u32 key;
+
+	zero_verdict_count(verd_mapfd);
+
+	s = listen_loopback(family, sotype | SOCK_NONBLOCK);
+	if (s < 0)
+		return;
+
+	len = sizeof(addr);
+	err = xgetsockname(s, sockaddr(&addr), &len);
+	if (err)
+		goto close_srv;
+
+	c = xsocket(family, sotype, 0);
+	if (c < 0)
+		goto close_srv;
+	err = xconnect(c, sockaddr(&addr), len);
+	if (err)
+		goto close_cli;
+
+	p = xaccept(s, NULL, NULL);
+	if (p < 0)
+		goto close_cli;
+
+	key = 0;
+	value = s;
+	err = xbpf_map_update_elem(sock_mapfd, &key, &value, BPF_NOEXIST);
+	if (err)
+		goto close_peer;
+
+	key = 1;
+	value = p;
+	err = xbpf_map_update_elem(sock_mapfd, &key, &value, BPF_NOEXIST);
+	if (err)
+		goto close_peer;
+
+	n = write(mode == REDIR_INGRESS ? c : p, "a", 1);
+	if (n < 0 && errno != EACCES)
+		FAIL_ERRNO("%s: write", log_prefix);
+	if (n == 0)
+		FAIL("%s: incomplete write", log_prefix);
+	if (n < 1)
+		goto close_peer;
+
+	key = SK_DROP;
+	err = xbpf_map_lookup_elem(verd_mapfd, &key, &drop);
+	if (err)
+		goto close_peer;
+	if (drop != 1)
+		FAIL("%s: want drop count 1, have %d", log_prefix, drop);
+
+close_peer:
+	xclose(p);
+close_cli:
+	xclose(c);
+close_srv:
+	xclose(s);
+}
+
+static void test_skb_redir_to_listening(struct test_sockmap_listen *skel,
+					struct bpf_map *inner_map, int family,
+					int sotype)
+{
+	int verdict = bpf_program__fd(skel->progs.prog_skb_verdict);
+	int parser = bpf_program__fd(skel->progs.prog_skb_parser);
+	int verdict_map = bpf_map__fd(skel->maps.verdict_map);
+	int sock_map = bpf_map__fd(inner_map);
+	int err;
+
+	err = xbpf_prog_attach(parser, sock_map, BPF_SK_SKB_STREAM_PARSER, 0);
+	if (err)
+		return;
+	err = xbpf_prog_attach(verdict, sock_map, BPF_SK_SKB_STREAM_VERDICT, 0);
+	if (err)
+		goto detach;
+
+	redir_to_listening(family, sotype, sock_map, verdict_map,
+			   REDIR_INGRESS);
+
+	xbpf_prog_detach2(verdict, sock_map, BPF_SK_SKB_STREAM_VERDICT);
+detach:
+	xbpf_prog_detach2(parser, sock_map, BPF_SK_SKB_STREAM_PARSER);
+}
+
+static void test_msg_redir_to_listening(struct test_sockmap_listen *skel,
+					struct bpf_map *inner_map, int family,
+					int sotype)
+{
+	int verdict = bpf_program__fd(skel->progs.prog_msg_verdict);
+	int verdict_map = bpf_map__fd(skel->maps.verdict_map);
+	int sock_map = bpf_map__fd(inner_map);
+	int err;
+
+	err = xbpf_prog_attach(verdict, sock_map, BPF_SK_MSG_VERDICT, 0);
+	if (err)
+		return;
+
+	redir_to_listening(family, sotype, sock_map, verdict_map, REDIR_EGRESS);
+
+	xbpf_prog_detach2(verdict, sock_map, BPF_SK_MSG_VERDICT);
+}
+
+static void test_reuseport_select_listening(int family, int sotype,
+					    int sock_map, int verd_map,
+					    int reuseport_prog)
+{
+	struct sockaddr_storage addr;
+	unsigned int pass;
+	int s, c, p, err;
+	socklen_t len;
+	u64 value;
+	u32 key;
+
+	zero_verdict_count(verd_map);
+
+	s = listen_loopback_reuseport(family, sotype, reuseport_prog);
+	if (s < 0)
+		return;
+
+	len = sizeof(addr);
+	err = xgetsockname(s, sockaddr(&addr), &len);
+	if (err)
+		goto close_srv;
+
+	key = 0;
+	value = s;
+	err = xbpf_map_update_elem(sock_map, &key, &value, BPF_NOEXIST);
+	if (err)
+		goto close_srv;
+
+	c = xsocket(family, sotype, 0);
+	if (c < 0)
+		goto close_srv;
+	err = xconnect(c, sockaddr(&addr), len);
+	if (err)
+		goto close_cli;
+
+	p = xaccept(s, NULL, NULL);
+	if (p < 0)
+		goto close_cli;
+
+	key = SK_PASS;
+	err = xbpf_map_lookup_elem(verd_map, &key, &pass);
+	if (err)
+		goto close_peer;
+	if (pass != 1)
+		FAIL("want pass count 1, have %d", pass);
+
+close_peer:
+	xclose(p);
+close_cli:
+	xclose(c);
+close_srv:
+	xclose(s);
+}
+
+static void test_reuseport_select_connected(int family, int sotype,
+					    int sock_map, int verd_map,
+					    int reuseport_prog)
+{
+	struct sockaddr_storage addr;
+	int s, c0, c1, p0, err;
+	unsigned int drop;
+	socklen_t len;
+	u64 value;
+	u32 key;
+
+	zero_verdict_count(verd_map);
+
+	s = listen_loopback_reuseport(family, sotype, reuseport_prog);
+	if (s < 0)
+		return;
+
+	/* Populate sock_map[0] to avoid ENOENT on first connection */
+	key = 0;
+	value = s;
+	err = xbpf_map_update_elem(sock_map, &key, &value, BPF_NOEXIST);
+	if (err)
+		goto close_srv;
+
+	len = sizeof(addr);
+	err = xgetsockname(s, sockaddr(&addr), &len);
+	if (err)
+		goto close_srv;
+
+	c0 = xsocket(family, sotype, 0);
+	if (c0 < 0)
+		goto close_srv;
+
+	err = xconnect(c0, sockaddr(&addr), len);
+	if (err)
+		goto close_cli0;
+
+	p0 = xaccept(s, NULL, NULL);
+	if (err)
+		goto close_cli0;
+
+	/* Update sock_map[0] to redirect to a connected socket */
+	key = 0;
+	value = p0;
+	err = xbpf_map_update_elem(sock_map, &key, &value, BPF_EXIST);
+	if (err)
+		goto close_peer0;
+
+	c1 = xsocket(family, sotype, 0);
+	if (c1 < 0)
+		goto close_peer0;
+
+	errno = 0;
+	err = connect(c1, sockaddr(&addr), len);
+	if (!err || errno != ECONNREFUSED)
+		FAIL_ERRNO("connect: expected ECONNREFUSED");
+
+	key = SK_DROP;
+	err = xbpf_map_lookup_elem(verd_map, &key, &drop);
+	if (err)
+		goto close_cli1;
+	if (drop != 1)
+		FAIL("want drop count 1, have %d", drop);
+
+close_cli1:
+	xclose(c1);
+close_peer0:
+	xclose(p0);
+close_cli0:
+	xclose(c0);
+close_srv:
+	xclose(s);
+}
+
+/* Check that redirecting across reuseport groups is not allowed. */
+static void test_reuseport_mixed_groups(int family, int sotype, int sock_map,
+					int verd_map, int reuseport_prog)
+{
+	struct sockaddr_storage addr;
+	int s1, s2, c, err;
+	unsigned int drop;
+	socklen_t len;
+	u64 value;
+	u32 key;
+
+	zero_verdict_count(verd_map);
+
+	/* Create two listeners, each in its own reuseport group */
+	s1 = listen_loopback_reuseport(family, sotype, reuseport_prog);
+	if (s1 < 0)
+		return;
+
+	s2 = listen_loopback_reuseport(family, sotype, reuseport_prog);
+	if (s2 < 0)
+		goto close_srv1;
+
+	key = 0;
+	value = s1;
+	err = xbpf_map_update_elem(sock_map, &key, &value, BPF_NOEXIST);
+	if (err)
+		goto close_srv2;
+
+	key = 1;
+	value = s2;
+	err = xbpf_map_update_elem(sock_map, &key, &value, BPF_NOEXIST);
+
+	/* Connect to s2, reuseport BPF selects s1 via sock_map[0] */
+	len = sizeof(addr);
+	err = xgetsockname(s2, sockaddr(&addr), &len);
+	if (err)
+		goto close_srv2;
+
+	c = xsocket(family, sotype, 0);
+	if (c < 0)
+		goto close_srv2;
+
+	err = connect(c, sockaddr(&addr), len);
+	if (err && errno != ECONNREFUSED) {
+		FAIL_ERRNO("connect: expected ECONNREFUSED");
+		goto close_cli;
+	}
+
+	/* Expect drop, can't redirect outside of reuseport group */
+	key = SK_DROP;
+	err = xbpf_map_lookup_elem(verd_map, &key, &drop);
+	if (err)
+		goto close_cli;
+	if (drop != 1)
+		FAIL("want drop count 1, have %d", drop);
+
+close_cli:
+	xclose(c);
+close_srv2:
+	xclose(s2);
+close_srv1:
+	xclose(s1);
+}
+
+#define TEST(fn)                                                               \
+	{                                                                      \
+		fn, #fn                                                        \
+	}
+
+static void test_ops_cleanup(const struct bpf_map *map)
+{
+	const struct bpf_map_def *def;
+	int err, mapfd;
+	u32 key;
+
+	def = bpf_map__def(map);
+	mapfd = bpf_map__fd(map);
+
+	for (key = 0; key < def->max_entries; key++) {
+		err = bpf_map_delete_elem(mapfd, &key);
+		if (err && errno != EINVAL && errno != ENOENT)
+			FAIL_ERRNO("map_delete: expected EINVAL/ENOENT");
+	}
+}
+
+static const char *family_str(sa_family_t family)
+{
+	switch (family) {
+	case AF_INET:
+		return "IPv4";
+	case AF_INET6:
+		return "IPv6";
+	default:
+		return "unknown";
+	}
+}
+
+static const char *map_type_str(const struct bpf_map *map)
+{
+	const struct bpf_map_def *def;
+
+	def = bpf_map__def(map);
+	if (IS_ERR(def))
+		return "invalid";
+
+	switch (def->type) {
+	case BPF_MAP_TYPE_SOCKMAP:
+		return "sockmap";
+	case BPF_MAP_TYPE_SOCKHASH:
+		return "sockhash";
+	default:
+		return "unknown";
+	}
+}
+
+static void test_ops(struct test_sockmap_listen *skel, struct bpf_map *map,
+		     int family, int sotype)
+{
+	const struct op_test {
+		void (*fn)(int family, int sotype, int mapfd);
+		const char *name;
+	} tests[] = {
+		/* insert */
+		TEST(test_insert_invalid),
+		TEST(test_insert_opened),
+		TEST(test_insert_bound),
+		TEST(test_insert_listening),
+		/* delete */
+		TEST(test_delete_after_insert),
+		TEST(test_delete_after_close),
+		/* lookup */
+		TEST(test_lookup_after_insert),
+		TEST(test_lookup_after_delete),
+		TEST(test_lookup_32_bit_value),
+		/* update */
+		TEST(test_update_listening),
+		/* races with insert/delete */
+		TEST(test_destroy_orphan_child),
+		TEST(test_syn_recv_insert_delete),
+		TEST(test_race_insert_listen),
+		/* child clone */
+		TEST(test_clone_after_delete),
+		TEST(test_accept_after_delete),
+		TEST(test_accept_before_delete),
+	};
+	const char *family_name, *map_name;
+	const struct op_test *t;
+	char s[MAX_TEST_NAME];
+	int map_fd;
+
+	family_name = family_str(family);
+	map_name = map_type_str(map);
+	map_fd = bpf_map__fd(map);
+
+	for (t = tests; t < tests + ARRAY_SIZE(tests); t++) {
+		snprintf(s, sizeof(s), "%s %s %s", map_name, family_name,
+			 t->name);
+
+		if (!test__start_subtest(s))
+			continue;
+
+		t->fn(family, sotype, map_fd);
+		test_ops_cleanup(map);
+	}
+}
+
+static void test_redir(struct test_sockmap_listen *skel, struct bpf_map *map,
+		       int family, int sotype)
+{
+	const struct redir_test {
+		void (*fn)(struct test_sockmap_listen *skel,
+			   struct bpf_map *map, int family, int sotype);
+		const char *name;
+	} tests[] = {
+		TEST(test_skb_redir_to_connected),
+		TEST(test_skb_redir_to_listening),
+		TEST(test_msg_redir_to_connected),
+		TEST(test_msg_redir_to_listening),
+	};
+	const char *family_name, *map_name;
+	const struct redir_test *t;
+	char s[MAX_TEST_NAME];
+
+	family_name = family_str(family);
+	map_name = map_type_str(map);
+
+	for (t = tests; t < tests + ARRAY_SIZE(tests); t++) {
+		snprintf(s, sizeof(s), "%s %s %s", map_name, family_name,
+			 t->name);
+		if (!test__start_subtest(s))
+			continue;
+
+		t->fn(skel, map, family, sotype);
+	}
+}
+
+static void test_reuseport(struct test_sockmap_listen *skel,
+			   struct bpf_map *map, int family, int sotype)
+{
+	const struct reuseport_test {
+		void (*fn)(int family, int sotype, int socket_map,
+			   int verdict_map, int reuseport_prog);
+		const char *name;
+	} tests[] = {
+		TEST(test_reuseport_select_listening),
+		TEST(test_reuseport_select_connected),
+		TEST(test_reuseport_mixed_groups),
+	};
+	int socket_map, verdict_map, reuseport_prog;
+	const char *family_name, *map_name;
+	const struct reuseport_test *t;
+	char s[MAX_TEST_NAME];
+
+	family_name = family_str(family);
+	map_name = map_type_str(map);
+
+	socket_map = bpf_map__fd(map);
+	verdict_map = bpf_map__fd(skel->maps.verdict_map);
+	reuseport_prog = bpf_program__fd(skel->progs.prog_reuseport);
+
+	for (t = tests; t < tests + ARRAY_SIZE(tests); t++) {
+		snprintf(s, sizeof(s), "%s %s %s", map_name, family_name,
+			 t->name);
+
+		if (!test__start_subtest(s))
+			continue;
+
+		t->fn(family, sotype, socket_map, verdict_map, reuseport_prog);
+	}
+}
+
+static void run_tests(struct test_sockmap_listen *skel, struct bpf_map *map,
+		      int family)
+{
+	test_ops(skel, map, family, SOCK_STREAM);
+	test_redir(skel, map, family, SOCK_STREAM);
+	test_reuseport(skel, map, family, SOCK_STREAM);
+}
+
+void test_sockmap_listen(void)
+{
+	struct test_sockmap_listen *skel;
+
+	skel = test_sockmap_listen__open_and_load();
+	if (!skel) {
+		FAIL("skeleton open/load failed");
+		return;
+	}
+
+	skel->bss->test_sockmap = true;
+	run_tests(skel, skel->maps.sock_map, AF_INET);
+	run_tests(skel, skel->maps.sock_map, AF_INET6);
+
+	skel->bss->test_sockmap = false;
+	run_tests(skel, skel->maps.sock_hash, AF_INET);
+	run_tests(skel, skel->maps.sock_hash, AF_INET6);
+
+	test_sockmap_listen__destroy(skel);
+}
diff --git a/tools/testing/selftests/bpf/progs/test_sockmap_listen.c b/tools/testing/selftests/bpf/progs/test_sockmap_listen.c
new file mode 100644
index 000000000000..a3a366c57ce1
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_sockmap_listen.c
@@ -0,0 +1,98 @@
+// SPDX-License-Identifier: GPL-2.0
+// Copyright (c) 2020 Cloudflare
+
+#include <errno.h>
+#include <stdbool.h>
+#include <linux/bpf.h>
+
+#include <bpf/bpf_helpers.h>
+
+struct {
+	__uint(type, BPF_MAP_TYPE_SOCKMAP);
+	__uint(max_entries, 2);
+	__type(key, __u32);
+	__type(value, __u64);
+} sock_map SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_SOCKHASH);
+	__uint(max_entries, 2);
+	__type(key, __u32);
+	__type(value, __u64);
+} sock_hash SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_ARRAY);
+	__uint(max_entries, 2);
+	__type(key, int);
+	__type(value, unsigned int);
+} verdict_map SEC(".maps");
+
+static volatile bool test_sockmap; /* toggled by user-space */
+
+SEC("sk_skb/stream_parser")
+int prog_skb_parser(struct __sk_buff *skb)
+{
+	return skb->len;
+}
+
+SEC("sk_skb/stream_verdict")
+int prog_skb_verdict(struct __sk_buff *skb)
+{
+	unsigned int *count;
+	__u32 zero = 0;
+	int verdict;
+
+	if (test_sockmap)
+		verdict = bpf_sk_redirect_map(skb, &sock_map, zero, 0);
+	else
+		verdict = bpf_sk_redirect_hash(skb, &sock_hash, &zero, 0);
+
+	count = bpf_map_lookup_elem(&verdict_map, &verdict);
+	if (count)
+		(*count)++;
+
+	return verdict;
+}
+
+SEC("sk_msg")
+int prog_msg_verdict(struct sk_msg_md *msg)
+{
+	unsigned int *count;
+	__u32 zero = 0;
+	int verdict;
+
+	if (test_sockmap)
+		verdict = bpf_msg_redirect_map(msg, &sock_map, zero, 0);
+	else
+		verdict = bpf_msg_redirect_hash(msg, &sock_hash, &zero, 0);
+
+	count = bpf_map_lookup_elem(&verdict_map, &verdict);
+	if (count)
+		(*count)++;
+
+	return verdict;
+}
+
+SEC("sk_reuseport")
+int prog_reuseport(struct sk_reuseport_md *reuse)
+{
+	unsigned int *count;
+	int err, verdict;
+	__u32 zero = 0;
+
+	if (test_sockmap)
+		err = bpf_sk_select_reuseport(reuse, &sock_map, &zero, 0);
+	else
+		err = bpf_sk_select_reuseport(reuse, &sock_hash, &zero, 0);
+	verdict = err ? SK_DROP : SK_PASS;
+
+	count = bpf_map_lookup_elem(&verdict_map, &verdict);
+	if (count)
+		(*count)++;
+
+	return verdict;
+}
+
+int _version SEC("version") = 1;
+char _license[] SEC("license") = "GPL";
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* RE: [PATCH bpf-next v7 03/11] tcp_bpf: Don't let child socket inherit parent protocol ops on copy
  2020-02-18 17:10 ` [PATCH bpf-next v7 03/11] tcp_bpf: Don't let child socket inherit parent protocol ops on copy Jakub Sitnicki
@ 2020-02-21  3:28   ` John Fastabend
  0 siblings, 0 replies; 24+ messages in thread
From: John Fastabend @ 2020-02-21  3:28 UTC (permalink / raw)
  To: Jakub Sitnicki, bpf
  Cc: netdev, kernel-team, John Fastabend, Lorenz Bauer, Martin Lau,
	eric.dumazet

Jakub Sitnicki wrote:
> Prepare for cloning listening sockets that have their protocol callbacks
> overridden by sk_msg. Child sockets must not inherit parent callbacks that
> access state stored in sk_user_data owned by the parent.
> 
> Restore the child socket protocol callbacks before it gets hashed and any
> of the callbacks can get invoked.
> 
> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
> ---

Looks reasonable to me. CC'ing Eric as well.

Acked-by: John Fastabend <john.fastabend@gmail.com>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: [PATCH bpf-next v7 04/11] bpf, sockmap: Allow inserting listening TCP sockets into sockmap
  2020-02-18 17:10 ` [PATCH bpf-next v7 04/11] bpf, sockmap: Allow inserting listening TCP sockets into sockmap Jakub Sitnicki
@ 2020-02-21  3:33   ` John Fastabend
  0 siblings, 0 replies; 24+ messages in thread
From: John Fastabend @ 2020-02-21  3:33 UTC (permalink / raw)
  To: Jakub Sitnicki, bpf
  Cc: netdev, kernel-team, John Fastabend, Lorenz Bauer, Martin Lau

Jakub Sitnicki wrote:
> In order for sockmap/sockhash types to become generic collections for
> storing TCP sockets we need to loosen the checks during map update, while
> tightening the checks in redirect helpers.
> 
> Currently sock{map,hash} require the TCP socket to be in established state,
> which prevents inserting listening sockets.
> 
> Change the update pre-checks so the socket can also be in listening state.
> 
> Since it doesn't make sense to redirect with sock{map,hash} to listening
> sockets, add appropriate socket state checks to BPF redirect helpers too.
> 
> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
> ---

Acked-by: John Fastabend <john.fastabend@gmail.com>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: [PATCH bpf-next v7 05/11] bpf, sockmap: Don't set up upcalls and progs for listening sockets
  2020-02-18 17:10 ` [PATCH bpf-next v7 05/11] bpf, sockmap: Don't set up upcalls and progs for listening sockets Jakub Sitnicki
@ 2020-02-21  3:42   ` John Fastabend
  0 siblings, 0 replies; 24+ messages in thread
From: John Fastabend @ 2020-02-21  3:42 UTC (permalink / raw)
  To: Jakub Sitnicki, bpf
  Cc: netdev, kernel-team, John Fastabend, Lorenz Bauer, Martin Lau

Jakub Sitnicki wrote:
> Now that sockmap/sockhash can hold listening sockets, when setting up the
> psock we will (i) grab references to verdict/parser progs, and (2) override
> socket upcalls sk_data_ready and sk_write_space.
> 
> However, since we cannot redirect to listening sockets so we don't need to
> link the socket to the BPF progs. And more importantly we don't want the
> listening socket to have overridden upcalls because they would get
> inherited by child sockets cloned from it.
> 
> Introduce a separate initialization path for listening sockets that does
> not change the upcalls and ignores the BPF progs.
> 
> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
> ---
>  net/core/sock_map.c | 52 +++++++++++++++++++++++++++++++++++++++------
>  1 file changed, 45 insertions(+), 7 deletions(-)
> 

Interesting, so after this with listen and established socks in
the same map some will inherit the programs attached to the map and
some will not... I think this is OK when socks are added we know their
state so can reason about it. Anyways the same can happen by attaching
programs after socks are added.

It would probably be more confusing to reject listen socks when progs
are attached so seems like the right design choice to me.

Acked-by: John Fastabend <john.fastabend@gmail.com>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: [PATCH bpf-next v7 06/11] bpf, sockmap: Return socket cookie on lookup from syscall
  2020-02-18 17:10 ` [PATCH bpf-next v7 06/11] bpf, sockmap: Return socket cookie on lookup from syscall Jakub Sitnicki
@ 2020-02-21  3:45   ` John Fastabend
  0 siblings, 0 replies; 24+ messages in thread
From: John Fastabend @ 2020-02-21  3:45 UTC (permalink / raw)
  To: Jakub Sitnicki, bpf
  Cc: netdev, kernel-team, John Fastabend, Lorenz Bauer, Martin Lau

Jakub Sitnicki wrote:
> Tooling that populates the SOCK{MAP,HASH} with sockets from user-space
> needs a way to inspect its contents. Returning the struct sock * that the
> map holds to user-space is neither safe nor useful. An approach established
> by REUSEPORT_SOCKARRAY is to return a socket cookie (a unique identifier)
> instead.
> 
> Since socket cookies are u64 values, SOCK{MAP,HASH} need to support such a
> value size for lookup to be possible. This requires special handling on
> update, though. Attempts to do a lookup on a map holding u32 values will be
> met with ENOSPC error.
> 
> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
> ---

Acked-by: John Fastabend <john.fastabend@gmail.com>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: [PATCH bpf-next v7 07/11] bpf, sockmap: Let all kernel-land lookup values in SOCKMAP/SOCKHASH
  2020-02-18 17:10 ` [PATCH bpf-next v7 07/11] bpf, sockmap: Let all kernel-land lookup values in SOCKMAP/SOCKHASH Jakub Sitnicki
@ 2020-02-21  3:46   ` John Fastabend
  0 siblings, 0 replies; 24+ messages in thread
From: John Fastabend @ 2020-02-21  3:46 UTC (permalink / raw)
  To: Jakub Sitnicki, bpf
  Cc: netdev, kernel-team, John Fastabend, Lorenz Bauer, Martin Lau

Jakub Sitnicki wrote:
> Don't require the kernel code, like BPF helpers, that needs access to
> SOCK{MAP,HASH} map contents to live in net/core/sock_map.c. Expose the
> lookup operation to all kernel-land.
> 
> Lookup from BPF context is not whitelisted yet. While syscalls have a
> dedicated lookup handler.
> 
> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
> ---
>  net/core/sock_map.c | 9 +++++++--
>  1 file changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/net/core/sock_map.c b/net/core/sock_map.c
> index f48c934d5da0..2e0f465295c3 100644
> --- a/net/core/sock_map.c
> +++ b/net/core/sock_map.c
> @@ -301,7 +301,7 @@ static struct sock *__sock_map_lookup_elem(struct bpf_map *map, u32 key)
>  
>  static void *sock_map_lookup(struct bpf_map *map, void *key)
>  {
> -	return ERR_PTR(-EOPNOTSUPP);
> +	return __sock_map_lookup_elem(map, *(u32 *)key);
>  }
>  
>  static void *sock_map_lookup_sys(struct bpf_map *map, void *key)
> @@ -991,6 +991,11 @@ static void *sock_hash_lookup_sys(struct bpf_map *map, void *key)
>  	return &sk->sk_cookie;
>  }
>  
> +static void *sock_hash_lookup(struct bpf_map *map, void *key)
> +{
> +	return __sock_hash_lookup_elem(map, key);
> +}
> +
>  static void sock_hash_release_progs(struct bpf_map *map)
>  {
>  	psock_progs_drop(&container_of(map, struct bpf_htab, map)->progs);
> @@ -1079,7 +1084,7 @@ const struct bpf_map_ops sock_hash_ops = {
>  	.map_get_next_key	= sock_hash_get_next_key,
>  	.map_update_elem	= sock_hash_update_elem,
>  	.map_delete_elem	= sock_hash_delete_elem,
> -	.map_lookup_elem	= sock_map_lookup,
> +	.map_lookup_elem	= sock_hash_lookup,
>  	.map_lookup_elem_sys_only = sock_hash_lookup_sys,
>  	.map_release_uref	= sock_hash_release_progs,
>  	.map_check_btf		= map_check_no_btf,
> -- 
> 2.24.1
> 

Acked-by: John Fastabend <john.fastabend@gmail.com>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: [PATCH bpf-next v7 10/11] selftests/bpf: Extend SK_REUSEPORT tests to cover SOCKMAP/SOCKHASH
  2020-02-18 17:10 ` [PATCH bpf-next v7 10/11] selftests/bpf: Extend SK_REUSEPORT tests to cover SOCKMAP/SOCKHASH Jakub Sitnicki
@ 2020-02-21  3:52   ` John Fastabend
  0 siblings, 0 replies; 24+ messages in thread
From: John Fastabend @ 2020-02-21  3:52 UTC (permalink / raw)
  To: Jakub Sitnicki, bpf
  Cc: netdev, kernel-team, John Fastabend, Lorenz Bauer, Martin Lau

Jakub Sitnicki wrote:
> Parametrize the SK_REUSEPORT tests so that the map type for storing sockets
> is not hard-coded in the test setup routine.
> 
> This, together with careful state cleaning after the tests, lets us run the
> test cases for REUSEPORT_ARRAY, SOCKMAP, and SOCKHASH to have test coverage
> for all supported map types. The last two support only TCP sockets at the
> moment.
> 
> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
> ---

Acked-by: John Fastabend <john.fastabend@gmail.com>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: [PATCH bpf-next v7 11/11] selftests/bpf: Tests for sockmap/sockhash holding listening sockets
  2020-02-18 17:10 ` [PATCH bpf-next v7 11/11] selftests/bpf: Tests for sockmap/sockhash holding listening sockets Jakub Sitnicki
@ 2020-02-21  3:56   ` John Fastabend
  0 siblings, 0 replies; 24+ messages in thread
From: John Fastabend @ 2020-02-21  3:56 UTC (permalink / raw)
  To: Jakub Sitnicki, bpf
  Cc: netdev, kernel-team, John Fastabend, Lorenz Bauer, Martin Lau

Jakub Sitnicki wrote:
> Now that SOCKMAP and SOCKHASH map types can store listening sockets,
> user-space and BPF API is open to a new set of potential pitfalls.
> 
> Exercise the map operations, with extra attention to code paths susceptible
> to races between map ops and socket cloning, and BPF helpers that work with
> SOCKMAP/SOCKHASH to gain confidence that all works as expected.
> 
> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
> ---
>  .../selftests/bpf/prog_tests/sockmap_listen.c | 1496 +++++++++++++++++
>  .../selftests/bpf/progs/test_sockmap_listen.c |   98 ++
>  2 files changed, 1594 insertions(+)
>  create mode 100644 tools/testing/selftests/bpf/prog_tests/sockmap_listen.c
>  create mode 100644 tools/testing/selftests/bpf/progs/test_sockmap_listen.c

Reminds me I need to clean up the sock{map|hash} tests as well.

Acked-by: John Fastabend <john.fastabend@gmail.com>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH bpf-next v7 00/11] Extend SOCKMAP/SOCKHASH to store listening sockets
  2020-02-18 17:10 [PATCH bpf-next v7 00/11] Extend SOCKMAP/SOCKHASH to store listening sockets Jakub Sitnicki
                   ` (10 preceding siblings ...)
  2020-02-18 17:10 ` [PATCH bpf-next v7 11/11] selftests/bpf: Tests for sockmap/sockhash holding listening sockets Jakub Sitnicki
@ 2020-02-21 21:41 ` Daniel Borkmann
  2020-02-22  0:47   ` Alexei Starovoitov
  11 siblings, 1 reply; 24+ messages in thread
From: Daniel Borkmann @ 2020-02-21 21:41 UTC (permalink / raw)
  To: Jakub Sitnicki, bpf
  Cc: netdev, kernel-team, John Fastabend, Lorenz Bauer, Martin Lau

On 2/18/20 6:10 PM, Jakub Sitnicki wrote:
> This patch set turns SOCK{MAP,HASH} into generic collections for TCP
> sockets, both listening and established. Adding support for listening
> sockets enables us to use these BPF map types with reuseport BPF programs.
> 
> Why? SOCKMAP and SOCKHASH, in comparison to REUSEPORT_SOCKARRAY, allow the
> socket to be in more than one map at the same time.
> 
> Having a BPF map type that can hold listening sockets, and gracefully
> co-exist with reuseport BPF is important if, in the future, we want
> BPF programs that run at socket lookup time [0]. Cover letter for v1 of
> this series tells the full story of how we got here [1].
> 
> Although SOCK{MAP,HASH} are not a drop-in replacement for SOCKARRAY just
> yet, because UDP support is lacking, it's a step in this direction. We're
> working with Lorenz on extending SOCK{MAP,HASH} to hold UDP sockets, and
> expect to post RFC series for sockmap + UDP in the near future.
> 
> I've dropped Acks from all patches that have been touched since v6.
> 
> The audit for missing READ_ONCE annotations for access to sk_prot is
> ongoing. Thus far I've found one location specific to TCP listening sockets
> that needed annotating. This got fixed it in this iteration. I wonder if
> sparse checker could be put to work to identify places where we have
> sk_prot access while not holding sk_lock...
> 
> The patch series depends on another one, posted earlier [2], that has been
> split out of it.
> 
> Thanks,
> jkbs
> 
> [0] https://lore.kernel.org/bpf/20190828072250.29828-1-jakub@cloudflare.com/
> [1] https://lore.kernel.org/bpf/20191123110751.6729-1-jakub@cloudflare.com/
> [2] https://lore.kernel.org/bpf/20200217121530.754315-1-jakub@cloudflare.com/
> 
> v6 -> v7:
> 
> - Extended the series to cover SOCKHASH. (patches 4-8, 10-11) (John)
> 
> - Rebased onto recent bpf-next. Resolved conflicts in recent fixes to
>    sk_state checks on sockmap/sockhash update path. (patch 4)
> 
> - Added missing READ_ONCE annotation in sock_copy. (patch 1)
> 
> - Split out patches that simplify sk_psock_restore_proto [2].

Applied, thanks!

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH bpf-next v7 00/11] Extend SOCKMAP/SOCKHASH to store listening sockets
  2020-02-21 21:41 ` [PATCH bpf-next v7 00/11] Extend SOCKMAP/SOCKHASH to store " Daniel Borkmann
@ 2020-02-22  0:47   ` Alexei Starovoitov
  2020-02-22 13:49     ` Jakub Sitnicki
  0 siblings, 1 reply; 24+ messages in thread
From: Alexei Starovoitov @ 2020-02-22  0:47 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Jakub Sitnicki, bpf, Network Development, kernel-team,
	John Fastabend, Lorenz Bauer, Martin Lau

On Fri, Feb 21, 2020 at 1:41 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
>
> On 2/18/20 6:10 PM, Jakub Sitnicki wrote:
> > This patch set turns SOCK{MAP,HASH} into generic collections for TCP
> > sockets, both listening and established. Adding support for listening
> > sockets enables us to use these BPF map types with reuseport BPF programs.
> >
> > Why? SOCKMAP and SOCKHASH, in comparison to REUSEPORT_SOCKARRAY, allow the
> > socket to be in more than one map at the same time.
> >
> > Having a BPF map type that can hold listening sockets, and gracefully
> > co-exist with reuseport BPF is important if, in the future, we want
> > BPF programs that run at socket lookup time [0]. Cover letter for v1 of
> > this series tells the full story of how we got here [1].
> >
> > Although SOCK{MAP,HASH} are not a drop-in replacement for SOCKARRAY just
> > yet, because UDP support is lacking, it's a step in this direction. We're
> > working with Lorenz on extending SOCK{MAP,HASH} to hold UDP sockets, and
> > expect to post RFC series for sockmap + UDP in the near future.
> >
> > I've dropped Acks from all patches that have been touched since v6.
> >
> > The audit for missing READ_ONCE annotations for access to sk_prot is
> > ongoing. Thus far I've found one location specific to TCP listening sockets
> > that needed annotating. This got fixed it in this iteration. I wonder if
> > sparse checker could be put to work to identify places where we have
> > sk_prot access while not holding sk_lock...
> >
> > The patch series depends on another one, posted earlier [2], that has been
> > split out of it.
> >
> > Thanks,
> > jkbs
> >
> > [0] https://lore.kernel.org/bpf/20190828072250.29828-1-jakub@cloudflare.com/
> > [1] https://lore.kernel.org/bpf/20191123110751.6729-1-jakub@cloudflare.com/
> > [2] https://lore.kernel.org/bpf/20200217121530.754315-1-jakub@cloudflare.com/
> >
> > v6 -> v7:
> >
> > - Extended the series to cover SOCKHASH. (patches 4-8, 10-11) (John)
> >
> > - Rebased onto recent bpf-next. Resolved conflicts in recent fixes to
> >    sk_state checks on sockmap/sockhash update path. (patch 4)
> >
> > - Added missing READ_ONCE annotation in sock_copy. (patch 1)
> >
> > - Split out patches that simplify sk_psock_restore_proto [2].
>
> Applied, thanks!

Jakub,

what is going on here?
# test_progs -n 40
#40 select_reuseport:OK
Summary: 1/126 PASSED, 30 SKIPPED, 0 FAILED

Does it mean nothing was actually tested?
I really don't like to see 30 skipped tests.
Is it my environment?
If so please make them hard failures.
I will fix whatever I need to fix in my setup.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH bpf-next v7 00/11] Extend SOCKMAP/SOCKHASH to store listening sockets
  2020-02-22  0:47   ` Alexei Starovoitov
@ 2020-02-22 13:49     ` Jakub Sitnicki
  2020-02-23 21:43       ` Alexei Starovoitov
  0 siblings, 1 reply; 24+ messages in thread
From: Jakub Sitnicki @ 2020-02-22 13:49 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Daniel Borkmann, bpf, Network Development, kernel-team,
	John Fastabend, Lorenz Bauer, Martin Lau

Hi Alexei,

On Sat, Feb 22, 2020 at 12:47 AM GMT, Alexei Starovoitov wrote:
> On Fri, Feb 21, 2020 at 1:41 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
>>
>> On 2/18/20 6:10 PM, Jakub Sitnicki wrote:
>> > This patch set turns SOCK{MAP,HASH} into generic collections for TCP
>> > sockets, both listening and established. Adding support for listening
>> > sockets enables us to use these BPF map types with reuseport BPF programs.
>> >
>> > Why? SOCKMAP and SOCKHASH, in comparison to REUSEPORT_SOCKARRAY, allow the
>> > socket to be in more than one map at the same time.
>> >
>> > Having a BPF map type that can hold listening sockets, and gracefully
>> > co-exist with reuseport BPF is important if, in the future, we want
>> > BPF programs that run at socket lookup time [0]. Cover letter for v1 of
>> > this series tells the full story of how we got here [1].
>> >
>> > Although SOCK{MAP,HASH} are not a drop-in replacement for SOCKARRAY just
>> > yet, because UDP support is lacking, it's a step in this direction. We're
>> > working with Lorenz on extending SOCK{MAP,HASH} to hold UDP sockets, and
>> > expect to post RFC series for sockmap + UDP in the near future.
>> >
>> > I've dropped Acks from all patches that have been touched since v6.
>> >
>> > The audit for missing READ_ONCE annotations for access to sk_prot is
>> > ongoing. Thus far I've found one location specific to TCP listening sockets
>> > that needed annotating. This got fixed it in this iteration. I wonder if
>> > sparse checker could be put to work to identify places where we have
>> > sk_prot access while not holding sk_lock...
>> >
>> > The patch series depends on another one, posted earlier [2], that has been
>> > split out of it.
>> >
>> > Thanks,
>> > jkbs
>> >
>> > [0] https://lore.kernel.org/bpf/20190828072250.29828-1-jakub@cloudflare.com/
>> > [1] https://lore.kernel.org/bpf/20191123110751.6729-1-jakub@cloudflare.com/
>> > [2] https://lore.kernel.org/bpf/20200217121530.754315-1-jakub@cloudflare.com/
>> >
>> > v6 -> v7:
>> >
>> > - Extended the series to cover SOCKHASH. (patches 4-8, 10-11) (John)
>> >
>> > - Rebased onto recent bpf-next. Resolved conflicts in recent fixes to
>> >    sk_state checks on sockmap/sockhash update path. (patch 4)
>> >
>> > - Added missing READ_ONCE annotation in sock_copy. (patch 1)
>> >
>> > - Split out patches that simplify sk_psock_restore_proto [2].
>>
>> Applied, thanks!
>
> Jakub,
>
> what is going on here?
> # test_progs -n 40
> #40 select_reuseport:OK
> Summary: 1/126 PASSED, 30 SKIPPED, 0 FAILED
>
> Does it mean nothing was actually tested?
> I really don't like to see 30 skipped tests.
> Is it my environment?
> If so please make them hard failures.
> I will fix whatever I need to fix in my setup.

The UDP tests for sock{map,hash} are marked as skipped, because UDP
support is not implemented yet. Sorry for the confusion.

Having read the recent thread about BPF selftests [0] I now realize that
this is not the best idea. It sends the wrong signal to the developer.

I propose to exclude the UDP tests w/ sock{map,hash} by not registering
them with test__start_subtest at all. Failing them would indicate a
regression, which is not true. While skipping them points to a potential
problem with the test environment, which isn't true, either.

I'll follow up with a patch for this, if that sounds good to you.

[0] https://lore.kernel.org/bpf/20200220191845.u62nhohgzngbrpib@ast-mbp/T/#t

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH bpf-next v7 00/11] Extend SOCKMAP/SOCKHASH to store listening sockets
  2020-02-22 13:49     ` Jakub Sitnicki
@ 2020-02-23 21:43       ` Alexei Starovoitov
  2020-02-24 13:59         ` Jakub Sitnicki
  0 siblings, 1 reply; 24+ messages in thread
From: Alexei Starovoitov @ 2020-02-23 21:43 UTC (permalink / raw)
  To: Jakub Sitnicki
  Cc: Daniel Borkmann, bpf, Network Development, kernel-team,
	John Fastabend, Lorenz Bauer, Martin Lau

On Sat, Feb 22, 2020 at 01:49:52PM +0000, Jakub Sitnicki wrote:
> Hi Alexei,
> 
> On Sat, Feb 22, 2020 at 12:47 AM GMT, Alexei Starovoitov wrote:
> > On Fri, Feb 21, 2020 at 1:41 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
> >>
> >> On 2/18/20 6:10 PM, Jakub Sitnicki wrote:
> >> > This patch set turns SOCK{MAP,HASH} into generic collections for TCP
> >> > sockets, both listening and established. Adding support for listening
> >> > sockets enables us to use these BPF map types with reuseport BPF programs.
> >> >
> >> > Why? SOCKMAP and SOCKHASH, in comparison to REUSEPORT_SOCKARRAY, allow the
> >> > socket to be in more than one map at the same time.
> >> >
> >> > Having a BPF map type that can hold listening sockets, and gracefully
> >> > co-exist with reuseport BPF is important if, in the future, we want
> >> > BPF programs that run at socket lookup time [0]. Cover letter for v1 of
> >> > this series tells the full story of how we got here [1].
> >> >
> >> > Although SOCK{MAP,HASH} are not a drop-in replacement for SOCKARRAY just
> >> > yet, because UDP support is lacking, it's a step in this direction. We're
> >> > working with Lorenz on extending SOCK{MAP,HASH} to hold UDP sockets, and
> >> > expect to post RFC series for sockmap + UDP in the near future.
> >> >
> >> > I've dropped Acks from all patches that have been touched since v6.
> >> >
> >> > The audit for missing READ_ONCE annotations for access to sk_prot is
> >> > ongoing. Thus far I've found one location specific to TCP listening sockets
> >> > that needed annotating. This got fixed it in this iteration. I wonder if
> >> > sparse checker could be put to work to identify places where we have
> >> > sk_prot access while not holding sk_lock...
> >> >
> >> > The patch series depends on another one, posted earlier [2], that has been
> >> > split out of it.
> >> >
> >> > Thanks,
> >> > jkbs
> >> >
> >> > [0] https://lore.kernel.org/bpf/20190828072250.29828-1-jakub@cloudflare.com/
> >> > [1] https://lore.kernel.org/bpf/20191123110751.6729-1-jakub@cloudflare.com/
> >> > [2] https://lore.kernel.org/bpf/20200217121530.754315-1-jakub@cloudflare.com/
> >> >
> >> > v6 -> v7:
> >> >
> >> > - Extended the series to cover SOCKHASH. (patches 4-8, 10-11) (John)
> >> >
> >> > - Rebased onto recent bpf-next. Resolved conflicts in recent fixes to
> >> >    sk_state checks on sockmap/sockhash update path. (patch 4)
> >> >
> >> > - Added missing READ_ONCE annotation in sock_copy. (patch 1)
> >> >
> >> > - Split out patches that simplify sk_psock_restore_proto [2].
> >>
> >> Applied, thanks!
> >
> > Jakub,
> >
> > what is going on here?
> > # test_progs -n 40
> > #40 select_reuseport:OK
> > Summary: 1/126 PASSED, 30 SKIPPED, 0 FAILED
> >
> > Does it mean nothing was actually tested?
> > I really don't like to see 30 skipped tests.
> > Is it my environment?
> > If so please make them hard failures.
> > I will fix whatever I need to fix in my setup.
> 
> The UDP tests for sock{map,hash} are marked as skipped, because UDP
> support is not implemented yet. Sorry for the confusion.
> 
> Having read the recent thread about BPF selftests [0] I now realize that
> this is not the best idea. It sends the wrong signal to the developer.
> 
> I propose to exclude the UDP tests w/ sock{map,hash} by not registering
> them with test__start_subtest at all. Failing them would indicate a
> regression, which is not true. While skipping them points to a potential
> problem with the test environment, which isn't true, either.

So the tests are ready, but kernel support is missing?
Please don't run those tests then since they're guaranteed to fail atm.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH bpf-next v7 00/11] Extend SOCKMAP/SOCKHASH to store listening sockets
  2020-02-23 21:43       ` Alexei Starovoitov
@ 2020-02-24 13:59         ` Jakub Sitnicki
  0 siblings, 0 replies; 24+ messages in thread
From: Jakub Sitnicki @ 2020-02-24 13:59 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Daniel Borkmann, bpf, Network Development, kernel-team,
	John Fastabend, Lorenz Bauer, Martin Lau

On Sun, Feb 23, 2020 at 10:43 PM CET, Alexei Starovoitov wrote:
> On Sat, Feb 22, 2020 at 01:49:52PM +0000, Jakub Sitnicki wrote:
>> Hi Alexei,
>> 
>> On Sat, Feb 22, 2020 at 12:47 AM GMT, Alexei Starovoitov wrote:
>> > On Fri, Feb 21, 2020 at 1:41 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
>> >>
>> >> On 2/18/20 6:10 PM, Jakub Sitnicki wrote:
>> >> > This patch set turns SOCK{MAP,HASH} into generic collections for TCP
>> >> > sockets, both listening and established. Adding support for listening
>> >> > sockets enables us to use these BPF map types with reuseport BPF programs.
>> >> >
>> >> > Why? SOCKMAP and SOCKHASH, in comparison to REUSEPORT_SOCKARRAY, allow the
>> >> > socket to be in more than one map at the same time.
>> >> >
>> >> > Having a BPF map type that can hold listening sockets, and gracefully
>> >> > co-exist with reuseport BPF is important if, in the future, we want
>> >> > BPF programs that run at socket lookup time [0]. Cover letter for v1 of
>> >> > this series tells the full story of how we got here [1].
>> >> >
>> >> > Although SOCK{MAP,HASH} are not a drop-in replacement for SOCKARRAY just
>> >> > yet, because UDP support is lacking, it's a step in this direction. We're
>> >> > working with Lorenz on extending SOCK{MAP,HASH} to hold UDP sockets, and
>> >> > expect to post RFC series for sockmap + UDP in the near future.
>> >> >
>> >> > I've dropped Acks from all patches that have been touched since v6.
>> >> >
>> >> > The audit for missing READ_ONCE annotations for access to sk_prot is
>> >> > ongoing. Thus far I've found one location specific to TCP listening sockets
>> >> > that needed annotating. This got fixed it in this iteration. I wonder if
>> >> > sparse checker could be put to work to identify places where we have
>> >> > sk_prot access while not holding sk_lock...
>> >> >
>> >> > The patch series depends on another one, posted earlier [2], that has been
>> >> > split out of it.
>> >> >
>> >> > Thanks,
>> >> > jkbs
>> >> >
>> >> > [0] https://lore.kernel.org/bpf/20190828072250.29828-1-jakub@cloudflare.com/
>> >> > [1] https://lore.kernel.org/bpf/20191123110751.6729-1-jakub@cloudflare.com/
>> >> > [2] https://lore.kernel.org/bpf/20200217121530.754315-1-jakub@cloudflare.com/
>> >> >
>> >> > v6 -> v7:
>> >> >
>> >> > - Extended the series to cover SOCKHASH. (patches 4-8, 10-11) (John)
>> >> >
>> >> > - Rebased onto recent bpf-next. Resolved conflicts in recent fixes to
>> >> >    sk_state checks on sockmap/sockhash update path. (patch 4)
>> >> >
>> >> > - Added missing READ_ONCE annotation in sock_copy. (patch 1)
>> >> >
>> >> > - Split out patches that simplify sk_psock_restore_proto [2].
>> >>
>> >> Applied, thanks!
>> >
>> > Jakub,
>> >
>> > what is going on here?
>> > # test_progs -n 40
>> > #40 select_reuseport:OK
>> > Summary: 1/126 PASSED, 30 SKIPPED, 0 FAILED
>> >
>> > Does it mean nothing was actually tested?
>> > I really don't like to see 30 skipped tests.
>> > Is it my environment?
>> > If so please make them hard failures.
>> > I will fix whatever I need to fix in my setup.
>> 
>> The UDP tests for sock{map,hash} are marked as skipped, because UDP
>> support is not implemented yet. Sorry for the confusion.
>> 
>> Having read the recent thread about BPF selftests [0] I now realize that
>> this is not the best idea. It sends the wrong signal to the developer.
>> 
>> I propose to exclude the UDP tests w/ sock{map,hash} by not registering
>> them with test__start_subtest at all. Failing them would indicate a
>> regression, which is not true. While skipping them points to a potential
>> problem with the test environment, which isn't true, either.
>
> So the tests are ready, but kernel support is missing?

Yes, correct.

> Please don't run those tests then since they're guaranteed to fail atm.

Just posted [0] to rectify this situation.

[0] https://lore.kernel.org/bpf/20200224135327.121542-1-jakub@cloudflare.com/T/#t

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2020-02-24 13:59 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-02-18 17:10 [PATCH bpf-next v7 00/11] Extend SOCKMAP/SOCKHASH to store listening sockets Jakub Sitnicki
2020-02-18 17:10 ` [PATCH bpf-next v7 01/11] net, sk_msg: Annotate lockless access to sk_prot on clone Jakub Sitnicki
2020-02-18 17:10 ` [PATCH bpf-next v7 02/11] net, sk_msg: Clear sk_user_data pointer on clone if tagged Jakub Sitnicki
2020-02-18 17:10 ` [PATCH bpf-next v7 03/11] tcp_bpf: Don't let child socket inherit parent protocol ops on copy Jakub Sitnicki
2020-02-21  3:28   ` John Fastabend
2020-02-18 17:10 ` [PATCH bpf-next v7 04/11] bpf, sockmap: Allow inserting listening TCP sockets into sockmap Jakub Sitnicki
2020-02-21  3:33   ` John Fastabend
2020-02-18 17:10 ` [PATCH bpf-next v7 05/11] bpf, sockmap: Don't set up upcalls and progs for listening sockets Jakub Sitnicki
2020-02-21  3:42   ` John Fastabend
2020-02-18 17:10 ` [PATCH bpf-next v7 06/11] bpf, sockmap: Return socket cookie on lookup from syscall Jakub Sitnicki
2020-02-21  3:45   ` John Fastabend
2020-02-18 17:10 ` [PATCH bpf-next v7 07/11] bpf, sockmap: Let all kernel-land lookup values in SOCKMAP/SOCKHASH Jakub Sitnicki
2020-02-21  3:46   ` John Fastabend
2020-02-18 17:10 ` [PATCH bpf-next v7 08/11] bpf: Allow selecting reuseport socket from a SOCKMAP/SOCKHASH Jakub Sitnicki
2020-02-18 17:10 ` [PATCH bpf-next v7 09/11] net: Generate reuseport group ID on group creation Jakub Sitnicki
2020-02-18 17:10 ` [PATCH bpf-next v7 10/11] selftests/bpf: Extend SK_REUSEPORT tests to cover SOCKMAP/SOCKHASH Jakub Sitnicki
2020-02-21  3:52   ` John Fastabend
2020-02-18 17:10 ` [PATCH bpf-next v7 11/11] selftests/bpf: Tests for sockmap/sockhash holding listening sockets Jakub Sitnicki
2020-02-21  3:56   ` John Fastabend
2020-02-21 21:41 ` [PATCH bpf-next v7 00/11] Extend SOCKMAP/SOCKHASH to store " Daniel Borkmann
2020-02-22  0:47   ` Alexei Starovoitov
2020-02-22 13:49     ` Jakub Sitnicki
2020-02-23 21:43       ` Alexei Starovoitov
2020-02-24 13:59         ` Jakub Sitnicki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).