linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH bpf-next 1/2] bpf, net: Support SO_REUSEPORT sockets with bpf_sk_assign
@ 2023-05-25  8:19 Lorenz Bauer
  2023-05-25  8:19 ` [PATCH bpf-next 2/2] selftests/bpf: Test that SO_REUSEPORT can be used with sk_assign helper Lorenz Bauer
                   ` (4 more replies)
  0 siblings, 5 replies; 12+ messages in thread
From: Lorenz Bauer @ 2023-05-25  8:19 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa, David Ahern,
	Willem de Bruijn, Joe Stringer
  Cc: Lorenz Bauer, Joe Stringer, Martin KaFai Lau, netdev, linux-kernel, bpf

Currently the bpf_sk_assign helper in tc BPF context refuses SO_REUSEPORT
sockets. This means we can't use the helper to steer traffic to Envoy, which
configures SO_REUSEPORT on its sockets. In turn, we're blocked from removing
TPROXY from our setup.

The reason that bpf_sk_assign refuses such sockets is that the bpf_sk_lookup
helpers don't execute SK_REUSEPORT programs. Instead, one of the
reuseport sockets is selected by hash. This could cause dispatch to the
"wrong" socket:

    sk = bpf_sk_lookup_tcp(...) // select SO_REUSEPORT by hash
    bpf_sk_assign(skb, sk) // SK_REUSEPORT wasn't executed

Fixing this isn't as simple as invoking SK_REUSEPORT from the lookup
helpers unfortunately. In the tc context, L2 headers are at the start
of the skb, while SK_REUSEPORT expects L3 headers instead.

Instead, we execute the SK_REUSEPORT program when the assigned socket
is pulled out of the skb, further up the stack. This creates some
trickiness with regards to refcounting as bpf_sk_assign will put both
refcounted and RCU freed sockets in skb->sk. reuseport sockets are RCU
freed. We can infer that the sk_assigned socket is RCU freed if the
reuseport lookup succeeds, but convincing yourself of this fact isn't
straight forward. Therefore we defensively check refcounting on the
sk_assign sock even though it's probably not required in practice.

Fixes: 8e368dc ("bpf: Fix use of sk->sk_reuseport from sk_assign")
Fixes: cf7fbe6 ("bpf: Add socket assign support")
Co-developed-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Lorenz Bauer <lmb@isovalent.com>
Cc: Joe Stringer <joe@cilium.io>
Link: https://lore.kernel.org/bpf/CACAyw98+qycmpQzKupquhkxbvWK4OFyDuuLMBNROnfWMZxUWeA@mail.gmail.com/
---
 include/net/inet6_hashtables.h | 36 +++++++++++++++++++++++++++++-----
 include/net/inet_hashtables.h  | 27 +++++++++++++++++++++++--
 include/net/sock.h             |  7 +++++--
 include/uapi/linux/bpf.h       |  3 ---
 net/core/filter.c              |  2 --
 net/ipv4/inet_hashtables.c     | 15 +++++++-------
 net/ipv4/udp.c                 | 23 +++++++++++++++++++---
 net/ipv6/inet6_hashtables.c    | 19 +++++++++---------
 net/ipv6/udp.c                 | 23 +++++++++++++++++++---
 tools/include/uapi/linux/bpf.h |  3 ---
 10 files changed, 119 insertions(+), 39 deletions(-)

diff --git a/include/net/inet6_hashtables.h b/include/net/inet6_hashtables.h
index 56f1286583d3..3ba4dc2703da 100644
--- a/include/net/inet6_hashtables.h
+++ b/include/net/inet6_hashtables.h
@@ -48,6 +48,13 @@ struct sock *__inet6_lookup_established(struct net *net,
 					const u16 hnum, const int dif,
 					const int sdif);
 
+struct sock *inet6_lookup_reuseport(struct net *net, struct sock *sk,
+				    struct sk_buff *skb, int doff,
+				    const struct in6_addr *saddr,
+				    __be16 sport,
+				    const struct in6_addr *daddr,
+				    unsigned short hnum);
+
 struct sock *inet6_lookup_listener(struct net *net,
 				   struct inet_hashinfo *hashinfo,
 				   struct sk_buff *skb, int doff,
@@ -85,14 +92,33 @@ static inline struct sock *__inet6_lookup_skb(struct inet_hashinfo *hashinfo,
 					      int iif, int sdif,
 					      bool *refcounted)
 {
-	struct sock *sk = skb_steal_sock(skb, refcounted);
-
+	bool prefetched;
+	struct sock *sk = skb_steal_sock(skb, refcounted, &prefetched);
+	struct net *net = dev_net(skb_dst(skb)->dev);
+	const struct ipv6hdr *ip6h = ipv6_hdr(skb);
+
+	if (prefetched) {
+		struct sock *reuse_sk = inet6_lookup_reuseport(net, sk, skb, doff,
+							       &ip6h->saddr, sport,
+							       &ip6h->daddr, ntohs(dport));
+		if (reuse_sk) {
+			if (reuse_sk != sk) {
+				if (*refcounted) {
+					sock_put(sk);
+					*refcounted = false;
+				}
+				if (IS_ERR(reuse_sk))
+					return NULL;
+			}
+			return reuse_sk;
+		}
+	}
 	if (sk)
 		return sk;
 
-	return __inet6_lookup(dev_net(skb_dst(skb)->dev), hashinfo, skb,
-			      doff, &ipv6_hdr(skb)->saddr, sport,
-			      &ipv6_hdr(skb)->daddr, ntohs(dport),
+	return __inet6_lookup(net, hashinfo, skb,
+			      doff, &ip6h->saddr, sport,
+			      &ip6h->daddr, ntohs(dport),
 			      iif, sdif, refcounted);
 }
 
diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
index 99bd823e97f6..c2af195ca71f 100644
--- a/include/net/inet_hashtables.h
+++ b/include/net/inet_hashtables.h
@@ -379,6 +379,11 @@ struct sock *__inet_lookup_established(struct net *net,
 				       const __be32 daddr, const u16 hnum,
 				       const int dif, const int sdif);
 
+struct sock *inet_lookup_reuseport(struct net *net, struct sock *sk,
+				   struct sk_buff *skb, int doff,
+				   __be32 saddr, __be16 sport,
+				   __be32 daddr, unsigned short hnum);
+
 static inline struct sock *
 	inet_lookup_established(struct net *net, struct inet_hashinfo *hashinfo,
 				const __be32 saddr, const __be16 sport,
@@ -436,13 +441,31 @@ static inline struct sock *__inet_lookup_skb(struct inet_hashinfo *hashinfo,
 					     const int sdif,
 					     bool *refcounted)
 {
-	struct sock *sk = skb_steal_sock(skb, refcounted);
+	bool prefetched;
+	struct sock *sk = skb_steal_sock(skb, refcounted, &prefetched);
+	struct net *net = dev_net(skb_dst(skb)->dev);
 	const struct iphdr *iph = ip_hdr(skb);
 
+	if (prefetched) {
+		struct sock *reuse_sk = inet_lookup_reuseport(net, sk, skb, doff,
+							      iph->saddr, sport,
+							      iph->daddr, ntohs(dport));
+		if (reuse_sk) {
+			if (reuse_sk != sk) {
+				if (*refcounted) {
+					sock_put(sk);
+					*refcounted = false;
+				}
+				if (IS_ERR(reuse_sk))
+					return NULL;
+			}
+			return reuse_sk;
+		}
+	}
 	if (sk)
 		return sk;
 
-	return __inet_lookup(dev_net(skb_dst(skb)->dev), hashinfo, skb,
+	return __inet_lookup(net, hashinfo, skb,
 			     doff, iph->saddr, sport,
 			     iph->daddr, dport, inet_iif(skb), sdif,
 			     refcounted);
diff --git a/include/net/sock.h b/include/net/sock.h
index 656ea89f60ff..5645570c2a64 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -2806,20 +2806,23 @@ sk_is_refcounted(struct sock *sk)
  * skb_steal_sock - steal a socket from an sk_buff
  * @skb: sk_buff to steal the socket from
  * @refcounted: is set to true if the socket is reference-counted
+ * @prefetched: is set to true if the socket was assigned from bpf
  */
 static inline struct sock *
-skb_steal_sock(struct sk_buff *skb, bool *refcounted)
+skb_steal_sock(struct sk_buff *skb, bool *refcounted, bool *prefetched)
 {
 	if (skb->sk) {
 		struct sock *sk = skb->sk;
 
 		*refcounted = true;
-		if (skb_sk_is_prefetched(skb))
+		*prefetched = skb_sk_is_prefetched(skb);
+		if (*prefetched)
 			*refcounted = sk_is_refcounted(sk);
 		skb->destructor = NULL;
 		skb->sk = NULL;
 		return sk;
 	}
+	*prefetched = false;
 	*refcounted = false;
 	return NULL;
 }
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 1bb11a6ee667..2af606a525db 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -4144,9 +4144,6 @@ union bpf_attr {
  *		**-EOPNOTSUPP** if the operation is not supported, for example
  *		a call from outside of TC ingress.
  *
- *		**-ESOCKTNOSUPPORT** if the socket type is not supported
- *		(reuseport).
- *
  * long bpf_sk_assign(struct bpf_sk_lookup *ctx, struct bpf_sock *sk, u64 flags)
  *	Description
  *		Helper is overloaded depending on BPF program type. This
diff --git a/net/core/filter.c b/net/core/filter.c
index 968139f4a1ac..5f451260849b 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -7265,8 +7265,6 @@ BPF_CALL_3(bpf_sk_assign, struct sk_buff *, skb, struct sock *, sk, u64, flags)
 		return -EOPNOTSUPP;
 	if (unlikely(dev_net(skb->dev) != sock_net(sk)))
 		return -ENETUNREACH;
-	if (unlikely(sk_fullsock(sk) && sk->sk_reuseport))
-		return -ESOCKTNOSUPPORT;
 	if (sk_is_refcounted(sk) &&
 	    unlikely(!refcount_inc_not_zero(&sk->sk_refcnt)))
 		return -ENOENT;
diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index e7391bf310a7..920131e4a65d 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -332,10 +332,10 @@ static inline int compute_score(struct sock *sk, struct net *net,
 	return score;
 }
 
-static inline struct sock *lookup_reuseport(struct net *net, struct sock *sk,
-					    struct sk_buff *skb, int doff,
-					    __be32 saddr, __be16 sport,
-					    __be32 daddr, unsigned short hnum)
+struct sock *inet_lookup_reuseport(struct net *net, struct sock *sk,
+				   struct sk_buff *skb, int doff,
+				   __be32 saddr, __be16 sport,
+				   __be32 daddr, unsigned short hnum)
 {
 	struct sock *reuse_sk = NULL;
 	u32 phash;
@@ -346,6 +346,7 @@ static inline struct sock *lookup_reuseport(struct net *net, struct sock *sk,
 	}
 	return reuse_sk;
 }
+EXPORT_SYMBOL_GPL(inet_lookup_reuseport);
 
 /*
  * Here are some nice properties to exploit here. The BSD API
@@ -369,8 +370,8 @@ static struct sock *inet_lhash2_lookup(struct net *net,
 	sk_nulls_for_each_rcu(sk, node, &ilb2->nulls_head) {
 		score = compute_score(sk, net, hnum, daddr, dif, sdif);
 		if (score > hiscore) {
-			result = lookup_reuseport(net, sk, skb, doff,
-						  saddr, sport, daddr, hnum);
+			result = inet_lookup_reuseport(net, sk, skb, doff,
+						       saddr, sport, daddr, hnum);
 			if (result)
 				return result;
 
@@ -399,7 +400,7 @@ static inline struct sock *inet_lookup_run_bpf(struct net *net,
 	if (no_reuseport || IS_ERR_OR_NULL(sk))
 		return sk;
 
-	reuse_sk = lookup_reuseport(net, sk, skb, doff, saddr, sport, daddr, hnum);
+	reuse_sk = inet_lookup_reuseport(net, sk, skb, doff, saddr, sport, daddr, hnum);
 	if (reuse_sk)
 		sk = reuse_sk;
 	return sk;
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 6893fb867529..c67253386a38 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -2426,7 +2426,7 @@ int __udp4_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
 	struct rtable *rt = skb_rtable(skb);
 	__be32 saddr, daddr;
 	struct net *net = dev_net(skb->dev);
-	bool refcounted;
+	bool refcounted, prefetched;
 	int drop_reason;
 
 	drop_reason = SKB_DROP_REASON_NOT_SPECIFIED;
@@ -2455,11 +2455,28 @@ int __udp4_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
 	if (udp4_csum_init(skb, uh, proto))
 		goto csum_error;
 
-	sk = skb_steal_sock(skb, &refcounted);
+	sk = skb_steal_sock(skb, &refcounted, &prefetched);
 	if (sk) {
 		struct dst_entry *dst = skb_dst(skb);
 		int ret;
 
+		if (prefetched) {
+			struct sock *reuse_sk = lookup_reuseport(net, sk, skb,
+								 saddr, uh->source,
+								 daddr, ntohs(uh->dest));
+			if (reuse_sk) {
+				if (reuse_sk != sk) {
+					if (refcounted) {
+						sock_put(sk);
+						refcounted = false;
+					}
+					if (IS_ERR(reuse_sk))
+						goto no_sk;
+				}
+				sk = reuse_sk;
+			}
+		}
+
 		if (unlikely(rcu_dereference(sk->sk_rx_dst) != dst))
 			udp_sk_rx_dst_set(sk, dst);
 
@@ -2476,7 +2493,7 @@ int __udp4_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
 	sk = __udp4_lib_lookup_skb(skb, uh->source, uh->dest, udptable);
 	if (sk)
 		return udp_unicast_rcv_skb(sk, skb, uh);
-
+no_sk:
 	if (!xfrm4_policy_check(NULL, XFRM_POLICY_IN, skb))
 		goto drop;
 	nf_reset_ct(skb);
diff --git a/net/ipv6/inet6_hashtables.c b/net/ipv6/inet6_hashtables.c
index b64b49012655..b7c56867314e 100644
--- a/net/ipv6/inet6_hashtables.c
+++ b/net/ipv6/inet6_hashtables.c
@@ -111,12 +111,12 @@ static inline int compute_score(struct sock *sk, struct net *net,
 	return score;
 }
 
-static inline struct sock *lookup_reuseport(struct net *net, struct sock *sk,
-					    struct sk_buff *skb, int doff,
-					    const struct in6_addr *saddr,
-					    __be16 sport,
-					    const struct in6_addr *daddr,
-					    unsigned short hnum)
+struct sock *inet6_lookup_reuseport(struct net *net, struct sock *sk,
+				    struct sk_buff *skb, int doff,
+				    const struct in6_addr *saddr,
+				    __be16 sport,
+				    const struct in6_addr *daddr,
+				    unsigned short hnum)
 {
 	struct sock *reuse_sk = NULL;
 	u32 phash;
@@ -127,6 +127,7 @@ static inline struct sock *lookup_reuseport(struct net *net, struct sock *sk,
 	}
 	return reuse_sk;
 }
+EXPORT_SYMBOL_GPL(inet6_lookup_reuseport);
 
 /* called with rcu_read_lock() */
 static struct sock *inet6_lhash2_lookup(struct net *net,
@@ -143,8 +144,8 @@ static struct sock *inet6_lhash2_lookup(struct net *net,
 	sk_nulls_for_each_rcu(sk, node, &ilb2->nulls_head) {
 		score = compute_score(sk, net, hnum, daddr, dif, sdif);
 		if (score > hiscore) {
-			result = lookup_reuseport(net, sk, skb, doff,
-						  saddr, sport, daddr, hnum);
+			result = inet6_lookup_reuseport(net, sk, skb, doff,
+							saddr, sport, daddr, hnum);
 			if (result)
 				return result;
 
@@ -175,7 +176,7 @@ static inline struct sock *inet6_lookup_run_bpf(struct net *net,
 	if (no_reuseport || IS_ERR_OR_NULL(sk))
 		return sk;
 
-	reuse_sk = lookup_reuseport(net, sk, skb, doff, saddr, sport, daddr, hnum);
+	reuse_sk = inet6_lookup_reuseport(net, sk, skb, doff, saddr, sport, daddr, hnum);
 	if (reuse_sk)
 		sk = reuse_sk;
 	return sk;
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index e5a337e6b970..3fede8ec95c4 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -949,7 +949,7 @@ int __udp6_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
 	struct net *net = dev_net(skb->dev);
 	struct udphdr *uh;
 	struct sock *sk;
-	bool refcounted;
+	bool refcounted, prefetched;
 	u32 ulen = 0;
 
 	if (!pskb_may_pull(skb, sizeof(struct udphdr)))
@@ -986,11 +986,28 @@ int __udp6_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
 		goto csum_error;
 
 	/* Check if the socket is already available, e.g. due to early demux */
-	sk = skb_steal_sock(skb, &refcounted);
+	sk = skb_steal_sock(skb, &refcounted, &prefetched);
 	if (sk) {
 		struct dst_entry *dst = skb_dst(skb);
 		int ret;
 
+		if (prefetched) {
+			struct sock *reuse_sk = lookup_reuseport(net, sk, skb,
+								 saddr, uh->source,
+								 daddr, ntohs(uh->dest));
+			if (reuse_sk) {
+				if (reuse_sk != sk) {
+					if (refcounted) {
+						sock_put(sk);
+						refcounted = false;
+					}
+					if (IS_ERR(reuse_sk))
+						goto no_sk;
+				}
+				sk = reuse_sk;
+			}
+		}
+
 		if (unlikely(rcu_dereference(sk->sk_rx_dst) != dst))
 			udp6_sk_rx_dst_set(sk, dst);
 
@@ -1020,7 +1037,7 @@ int __udp6_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
 			goto report_csum_error;
 		return udp6_unicast_rcv_skb(sk, skb, uh);
 	}
-
+no_sk:
 	reason = SKB_DROP_REASON_NO_SOCKET;
 
 	if (!uh->check)
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 1bb11a6ee667..2af606a525db 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -4144,9 +4144,6 @@ union bpf_attr {
  *		**-EOPNOTSUPP** if the operation is not supported, for example
  *		a call from outside of TC ingress.
  *
- *		**-ESOCKTNOSUPPORT** if the socket type is not supported
- *		(reuseport).
- *
  * long bpf_sk_assign(struct bpf_sk_lookup *ctx, struct bpf_sock *sk, u64 flags)
  *	Description
  *		Helper is overloaded depending on BPF program type. This
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH bpf-next 2/2] selftests/bpf: Test that SO_REUSEPORT can be used with sk_assign helper
  2023-05-25  8:19 [PATCH bpf-next 1/2] bpf, net: Support SO_REUSEPORT sockets with bpf_sk_assign Lorenz Bauer
@ 2023-05-25  8:19 ` Lorenz Bauer
  2023-05-26  0:06   ` Martin KaFai Lau
  2023-05-25 13:24 ` [PATCH bpf-next 1/2] bpf, net: Support SO_REUSEPORT sockets with bpf_sk_assign Eric Dumazet
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 12+ messages in thread
From: Lorenz Bauer @ 2023-05-25  8:19 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa, Mykola Lysenko,
	Shuah Khan
  Cc: Lorenz Bauer, Joe Stringer, linux-kernel, bpf, linux-kselftest

From: Daniel Borkmann <daniel@iogearbox.net>

We use two programs to check that the new reuseport logic is executed
appropriately.

The first is a TC clsact program which bpf_sk_assigns
the skb to a UDP or TCP socket created by user space. Since the test
communicates via lo we see both directions of packets in the eBPF.
Traffic ingressing to the reuseport socket is identified by looking
at the destination port. For TCP, we additionally need to make sure
that we only assign the initial SYN packets towards our listening
socket. The network stack then creates a request socket which
transitions to ESTABLISHED after the 3WHS.

The second is a reuseport program which shares the fact that
it has been executed with user space. This tells us that the delayed
lookup mechanism is working.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Co-developed-by: Lorenz Bauer <lmb@isovalent.com>
Signed-off-by: Lorenz Bauer <lmb@isovalent.com>
Cc: Joe Stringer <joe@cilium.io>
---
 tools/testing/selftests/bpf/network_helpers.c |   3 +
 .../selftests/bpf/prog_tests/assign_reuse.c   | 280 ++++++++++++++++++
 .../selftests/bpf/progs/test_assign_reuse.c   | 142 +++++++++
 3 files changed, 425 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/assign_reuse.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_assign_reuse.c

diff --git a/tools/testing/selftests/bpf/network_helpers.c b/tools/testing/selftests/bpf/network_helpers.c
index a105c0cd008a..8a33bcea97de 100644
--- a/tools/testing/selftests/bpf/network_helpers.c
+++ b/tools/testing/selftests/bpf/network_helpers.c
@@ -423,6 +423,9 @@ struct nstoken *open_netns(const char *name)
 
 void close_netns(struct nstoken *token)
 {
+	if (!token)
+		return;
+
 	ASSERT_OK(setns(token->orig_netns_fd, CLONE_NEWNET), "setns");
 	close(token->orig_netns_fd);
 	free(token);
diff --git a/tools/testing/selftests/bpf/prog_tests/assign_reuse.c b/tools/testing/selftests/bpf/prog_tests/assign_reuse.c
new file mode 100644
index 000000000000..2cb9bb591e71
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/assign_reuse.c
@@ -0,0 +1,280 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2023 Isovalent */
+#include <uapi/linux/if_link.h>
+#include <test_progs.h>
+
+#include <netinet/tcp.h>
+#include <netinet/udp.h>
+
+#include "network_helpers.h"
+#include "test_assign_reuse.skel.h"
+
+#define NS_TEST "assign_reuse"
+#define LOOPBACK 1
+#define PORT 4443
+
+static int attach_reuseport(int sock_fd, int prog_fd)
+{
+	return setsockopt(sock_fd, SOL_SOCKET, SO_ATTACH_REUSEPORT_EBPF,
+			  &prog_fd, sizeof(prog_fd));
+}
+
+static socklen_t inetaddr_len(const struct sockaddr_storage *addr)
+{
+	return addr->ss_family == AF_INET  ? sizeof(struct sockaddr_in) :
+	       addr->ss_family == AF_INET6 ? sizeof(struct sockaddr_in6) :
+					     0;
+}
+
+static bool is_ipv6(const char *ip)
+{
+	return !!strchr(ip, ':');
+}
+
+static int make_socket(int sotype, const char *ip, int port,
+		       struct sockaddr_storage *addr)
+{
+	struct timeval timeo = { .tv_usec = 500000 /* 500 ms */ };
+	int family = is_ipv6(ip) ? AF_INET6 : AF_INET;
+	int ret, fd;
+
+	ret = make_sockaddr(family, ip, port, addr, NULL);
+	if (!ASSERT_OK(ret, "make_sockaddr"))
+		return -1;
+	fd = socket(addr->ss_family, sotype, 0);
+	if (!ASSERT_GE(fd, 0, "socket"))
+		return -1;
+	ret = setsockopt(fd, SOL_SOCKET, SO_SNDTIMEO, &timeo, sizeof(timeo));
+	if (!ASSERT_OK(ret, "sndtimeo")) {
+		close(fd);
+		return -1;
+	}
+	ret = setsockopt(fd, SOL_SOCKET, SO_RCVTIMEO, &timeo, sizeof(timeo));
+	if (!ASSERT_OK(ret, "rcvtimeo")) {
+		close(fd);
+		return -1;
+	}
+	return fd;
+}
+
+static int create_server(int sotype, const char *ip, int port, int prog_fd)
+{
+	struct sockaddr_storage addr = {};
+	const int one = 1;
+	int ret, fd;
+
+	fd = make_socket(sotype, ip, port, &addr);
+	if (fd < 0)
+		return -1;
+	if (sotype == SOCK_STREAM) {
+		ret = setsockopt(fd, SOL_SOCKET, SO_REUSEADDR, &one,
+				 sizeof(one));
+		if (!ASSERT_OK(ret, "reuseaddr"))
+			goto cleanup;
+	}
+	if (prog_fd >= 0) {
+		ret = setsockopt(fd, SOL_SOCKET, SO_REUSEPORT, &one,
+				 sizeof(one));
+		if (!ASSERT_OK(ret, "reuseport"))
+			goto cleanup;
+		ret = attach_reuseport(fd, prog_fd);
+		if (!ASSERT_OK(ret, "attach_reuseport"))
+			goto cleanup;
+	}
+	ret = bind(fd, (void *)&addr, inetaddr_len(&addr));
+	if (!ASSERT_OK(ret, "bind"))
+		goto cleanup;
+	if (sotype == SOCK_STREAM) {
+		ret = listen(fd, SOMAXCONN);
+		if (!ASSERT_OK(ret, "listen"))
+			goto cleanup;
+	}
+	return fd;
+cleanup:
+	close(fd);
+	return -1;
+}
+
+static int create_client(int sotype, const char *ip, int port)
+{
+	struct sockaddr_storage addr = {};
+	int ret, fd;
+
+	fd = make_socket(sotype, ip, port, &addr);
+	if (fd < 0)
+		return -1;
+	ret = connect(fd, (void *)&addr, inetaddr_len(&addr));
+	if (ret)
+		goto cleanup;
+	return fd;
+cleanup:
+	close(fd);
+	return -1;
+}
+
+static __u64 cookie(int fd)
+{
+	__u64 cookie = 0;
+	socklen_t cookie_len = sizeof(cookie);
+	int ret;
+
+	ret = getsockopt(fd, SOL_SOCKET, SO_COOKIE, &cookie, &cookie_len);
+	ASSERT_OK(ret, "cookie");
+	ASSERT_GT(cookie, 0, "cookie_invalid");
+
+	return cookie;
+}
+
+static int echo_test_udp(int fd_sv, const char *ip, int port)
+{
+	struct sockaddr_storage addr = {};
+	socklen_t len = sizeof(addr);
+	char buff[1] = {};
+	int fd_cl = -1, ret;
+
+	fd_cl = create_client(SOCK_DGRAM, ip, port);
+	ASSERT_GT(fd_cl, 0, "create_client");
+	ASSERT_EQ(getsockname(fd_cl, (void *)&addr, &len), 0, "getsockname");
+
+	ASSERT_EQ(send(fd_cl, buff, sizeof(buff), 0), 1, "send_client");
+
+	ret = recv(fd_sv, buff, sizeof(buff), 0);
+	if (ret < 0)
+		return errno;
+
+	ASSERT_EQ(ret, 1, "recv_server");
+	ASSERT_EQ(sendto(fd_sv, buff, sizeof(buff), 0, (void *)&addr, len), 1, "send_server");
+	ASSERT_EQ(recv(fd_cl, buff, sizeof(buff), 0), 1, "recv_client");
+	close(fd_cl);
+	return 0;
+}
+
+static int echo_test_tcp(int fd_sv, const char *ip, int port)
+{
+	char buff[1] = {};
+	int fd_cl = -1, fd_sv_cl = -1;
+
+	fd_cl = create_client(SOCK_STREAM, ip, port);
+	if (fd_cl < 0)
+		return errno;
+
+	fd_sv_cl = accept(fd_sv, NULL, NULL);
+	ASSERT_GE(fd_sv_cl, 0, "accept_fd");
+
+	ASSERT_EQ(send(fd_cl, buff, sizeof(buff), 0), 1, "send_client");
+	ASSERT_EQ(recv(fd_sv_cl, buff, sizeof(buff), 0), 1, "recv_server");
+	ASSERT_EQ(send(fd_sv_cl, buff, sizeof(buff), 0), 1, "send_server");
+	ASSERT_EQ(recv(fd_cl, buff, sizeof(buff), 0), 1, "recv_client");
+	close(fd_sv_cl);
+	close(fd_cl);
+	return 0;
+}
+
+void run_assign_reuse(int sotype, const char *ip, int port)
+{
+	DECLARE_LIBBPF_OPTS(bpf_tc_hook, tc_hook, .ifindex = LOOPBACK,
+			    .attach_point = BPF_TC_INGRESS, );
+	DECLARE_LIBBPF_OPTS(bpf_tc_opts, tc_opts, .handle = 1, .priority = 1, );
+	bool hook_created = false, tc_attached = false;
+	int ret, fd_tc, fd_accept, fd_drop, fd_sv = -1, fd_map;
+	__u64 fd_val;
+	struct test_assign_reuse *skel;
+	const int zero = 0;
+
+	skel = test_assign_reuse__open();
+	if (!ASSERT_OK_PTR(skel, "skel_open"))
+		goto cleanup;
+
+	skel->rodata->dest_port = port;
+
+	ret = test_assign_reuse__load(skel);
+	if (!ASSERT_OK(ret, "skel_load"))
+		goto cleanup;
+
+	ASSERT_EQ(skel->bss->sk_cookie_seen, 0, "cookie_init");
+
+	fd_tc = bpf_program__fd(skel->progs.tc_main);
+	fd_accept = bpf_program__fd(skel->progs.reuse_accept);
+	fd_drop = bpf_program__fd(skel->progs.reuse_drop);
+	fd_map = bpf_map__fd(skel->maps.sk_map);
+
+	fd_sv = create_server(sotype, ip, port, fd_drop);
+	if (!ASSERT_GE(fd_sv, 0, "create_server"))
+		goto cleanup;
+
+	fd_val = fd_sv;
+	ret = bpf_map_update_elem(fd_map, &zero, &fd_val, BPF_NOEXIST);
+	if (!ASSERT_OK(ret, "bpf_sk_map"))
+		goto cleanup;
+
+	ret = bpf_tc_hook_create(&tc_hook);
+	if (ret == 0)
+		hook_created = true;
+	ret = ret == -EEXIST ? 0 : ret;
+	if (!ASSERT_OK(ret, "bpf_tc_hook_create"))
+		goto cleanup;
+
+	tc_opts.prog_fd = fd_tc;
+	ret = bpf_tc_attach(&tc_hook, &tc_opts);
+	if (!ASSERT_OK(ret, "bpf_tc_attach"))
+		goto cleanup;
+	tc_attached = true;
+
+	if (sotype == SOCK_STREAM)
+		ASSERT_EQ(echo_test_tcp(fd_sv, ip, port), ECONNREFUSED, "drop_tcp");
+	else
+		ASSERT_EQ(echo_test_udp(fd_sv, ip, port), EAGAIN, "drop_udp");
+	ASSERT_EQ(skel->bss->reuseport_executed, 1, "program executed once");
+
+	skel->bss->sk_cookie_seen = 0;
+	skel->bss->reuseport_executed = 0;
+	ASSERT_OK(attach_reuseport(fd_sv, fd_accept), "attach_reuseport(accept)");
+
+	if (sotype == SOCK_STREAM)
+		ASSERT_EQ(echo_test_tcp(fd_sv, ip, port), 0, "echo_tcp");
+	else
+		ASSERT_EQ(echo_test_udp(fd_sv, ip, port), 0, "echo_udp");
+
+	ASSERT_EQ(skel->bss->sk_cookie_seen, cookie(fd_sv),
+		  "cookie_mismatch");
+	ASSERT_EQ(skel->bss->reuseport_executed, 1, "program executed once");
+cleanup:
+	if (tc_attached) {
+		tc_opts.flags = tc_opts.prog_fd = tc_opts.prog_id = 0;
+		ret = bpf_tc_detach(&tc_hook, &tc_opts);
+		ASSERT_OK(ret, "bpf_tc_detach");
+	}
+	if (hook_created) {
+		tc_hook.attach_point = BPF_TC_INGRESS | BPF_TC_EGRESS;
+		bpf_tc_hook_destroy(&tc_hook);
+	}
+	test_assign_reuse__destroy(skel);
+	close(fd_sv);
+}
+
+void serial_test_assign_reuse(void)
+{
+	struct nstoken *tok = NULL;
+
+	SYS(out, "ip netns add %s", NS_TEST);
+	SYS(cleanup, "ip -net %s link set dev lo up", NS_TEST);
+
+	tok = open_netns(NS_TEST);
+	if (!ASSERT_OK_PTR(tok, "netns token"))
+		return;
+
+	if (test__start_subtest("tcpv4"))
+		run_assign_reuse(SOCK_STREAM, "127.0.0.1", PORT);
+	if (test__start_subtest("tcpv6"))
+		run_assign_reuse(SOCK_STREAM, "::1", PORT);
+	if (test__start_subtest("udpv4"))
+		run_assign_reuse(SOCK_DGRAM, "127.0.0.1", PORT);
+	if (test__start_subtest("udpv6"))
+		run_assign_reuse(SOCK_DGRAM, "::1", PORT);
+
+cleanup:
+	close_netns(tok);
+	SYS_NOFAIL("ip netns delete %s", NS_TEST);
+out:
+	return;
+}
diff --git a/tools/testing/selftests/bpf/progs/test_assign_reuse.c b/tools/testing/selftests/bpf/progs/test_assign_reuse.c
new file mode 100644
index 000000000000..c83e870b2612
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_assign_reuse.c
@@ -0,0 +1,142 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2023 Isovalent */
+#include <stdbool.h>
+#include <linux/bpf.h>
+#include <linux/if_ether.h>
+#include <linux/in.h>
+#include <linux/ip.h>
+#include <linux/ipv6.h>
+#include <linux/tcp.h>
+#include <linux/udp.h>
+#include <bpf/bpf_endian.h>
+#include <bpf/bpf_helpers.h>
+#include <linux/pkt_cls.h>
+
+char LICENSE[] SEC("license") = "GPL";
+
+__u64 sk_cookie_seen;
+__u64 reuseport_executed;
+union {
+	struct tcphdr tcp;
+	struct udphdr udp;
+} headers;
+
+const volatile __u16 dest_port;
+
+struct {
+	__uint(type, BPF_MAP_TYPE_SOCKMAP);
+	__uint(max_entries, 1);
+	__type(key, __u32);
+	__type(value, __u64);
+} sk_map SEC(".maps");
+
+SEC("sk_reuseport")
+int reuse_accept(struct sk_reuseport_md *ctx)
+{
+	reuseport_executed++;
+
+	if (ctx->ip_protocol == IPPROTO_TCP) {
+		if (ctx->data + sizeof(headers.tcp) > ctx->data_end)
+			return SK_DROP;
+
+		if (__builtin_memcmp(&headers.tcp, ctx->data, sizeof(headers.tcp)) != 0)
+			return SK_DROP;
+	} else if (ctx->ip_protocol == IPPROTO_UDP) {
+		if (ctx->data + sizeof(headers.udp) > ctx->data_end)
+			return SK_DROP;
+
+		if (__builtin_memcmp(&headers.udp, ctx->data, sizeof(headers.udp)) != 0)
+			return SK_DROP;
+	} else {
+		return SK_DROP;
+	}
+
+	sk_cookie_seen = bpf_get_socket_cookie(ctx->sk);
+	return SK_PASS;
+}
+
+SEC("sk_reuseport")
+int reuse_drop(struct sk_reuseport_md *ctx)
+{
+	reuseport_executed++;
+	sk_cookie_seen = 0;
+	return SK_DROP;
+}
+
+static inline int
+assign_sk(struct __sk_buff *skb)
+{
+	int zero = 0, ret = 0;
+	struct bpf_sock *sk;
+
+	sk = bpf_map_lookup_elem(&sk_map, &zero);
+	if (!sk)
+		return TC_ACT_SHOT;
+	ret = bpf_sk_assign(skb, sk, 0);
+	bpf_sk_release(sk);
+	return ret ? TC_ACT_SHOT : TC_ACT_OK;
+}
+
+static inline bool
+maybe_assign_tcp(struct __sk_buff *skb, struct tcphdr *th)
+{
+	if (th + 1 > (void *)(long)(skb->data_end))
+		return TC_ACT_SHOT;
+
+	if (!th->syn || th->ack || th->dest != bpf_htons(dest_port))
+		return TC_ACT_OK;
+
+	__builtin_memcpy(&headers.tcp, th, sizeof(headers.tcp));
+	return assign_sk(skb);
+}
+
+static inline bool
+maybe_assign_udp(struct __sk_buff *skb, struct udphdr *uh)
+{
+	if (uh + 1 > (void *)(long)(skb->data_end))
+		return TC_ACT_SHOT;
+
+	if (uh->dest != bpf_htons(dest_port))
+		return TC_ACT_OK;
+
+	__builtin_memcpy(&headers.udp, uh, sizeof(headers.udp));
+	return assign_sk(skb);
+}
+
+SEC("tc")
+int tc_main(struct __sk_buff *skb)
+{
+	void *data_end = (void *)(long)skb->data_end;
+	void *data = (void *)(long)skb->data;
+	struct ethhdr *eth;
+
+	eth = (struct ethhdr *)(data);
+	if (eth + 1 > data_end)
+		return TC_ACT_SHOT;
+
+	if (eth->h_proto == bpf_htons(ETH_P_IP)) {
+		struct iphdr *iph = (struct iphdr *)(data + sizeof(*eth));
+
+		if (iph + 1 > data_end)
+			return TC_ACT_SHOT;
+
+		if (iph->protocol == IPPROTO_TCP)
+			return maybe_assign_tcp(skb, (struct tcphdr *)(iph + 1));
+		else if (iph->protocol == IPPROTO_UDP)
+			return maybe_assign_udp(skb, (struct udphdr *)(iph + 1));
+		else
+			return TC_ACT_SHOT;
+	} else {
+		struct ipv6hdr *ip6h = (struct ipv6hdr *)(data + sizeof(*eth));
+
+		if (ip6h + 1 > data_end)
+			return TC_ACT_SHOT;
+
+		if (ip6h->nexthdr == IPPROTO_TCP)
+			return maybe_assign_tcp(skb, (struct tcphdr *)(ip6h + 1));
+		else if (ip6h->nexthdr == IPPROTO_UDP)
+			return maybe_assign_udp(skb, (struct udphdr *)(ip6h + 1));
+		else
+			return TC_ACT_SHOT;
+	}
+}
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH bpf-next 1/2] bpf, net: Support SO_REUSEPORT sockets with bpf_sk_assign
  2023-05-25  8:19 [PATCH bpf-next 1/2] bpf, net: Support SO_REUSEPORT sockets with bpf_sk_assign Lorenz Bauer
  2023-05-25  8:19 ` [PATCH bpf-next 2/2] selftests/bpf: Test that SO_REUSEPORT can be used with sk_assign helper Lorenz Bauer
@ 2023-05-25 13:24 ` Eric Dumazet
  2023-05-25 19:51   ` Daniel Borkmann
  2023-05-25 17:41 ` Kuniyuki Iwashima
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 12+ messages in thread
From: Eric Dumazet @ 2023-05-25 13:24 UTC (permalink / raw)
  To: Lorenz Bauer
  Cc: David S. Miller, Jakub Kicinski, Paolo Abeni, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, David Ahern, Willem de Bruijn, Joe Stringer,
	Joe Stringer, Martin KaFai Lau, netdev, linux-kernel, bpf

On Thu, May 25, 2023 at 10:19 AM Lorenz Bauer <lmb@isovalent.com> wrote:
>
> Currently the bpf_sk_assign helper in tc BPF context refuses SO_REUSEPORT
> sockets. This means we can't use the helper to steer traffic to Envoy, which
> configures SO_REUSEPORT on its sockets. In turn, we're blocked from removing
> TPROXY from our setup.
>
> The reason that bpf_sk_assign refuses such sockets is that the bpf_sk_lookup
> helpers don't execute SK_REUSEPORT programs. Instead, one of the
> reuseport sockets is selected by hash. This could cause dispatch to the
> "wrong" socket:
>
>     sk = bpf_sk_lookup_tcp(...) // select SO_REUSEPORT by hash
>     bpf_sk_assign(skb, sk) // SK_REUSEPORT wasn't executed
>
> Fixing this isn't as simple as invoking SK_REUSEPORT from the lookup
> helpers unfortunately. In the tc context, L2 headers are at the start
> of the skb, while SK_REUSEPORT expects L3 headers instead.
>
> Instead, we execute the SK_REUSEPORT program when the assigned socket
> is pulled out of the skb, further up the stack. This creates some
> trickiness with regards to refcounting as bpf_sk_assign will put both
> refcounted and RCU freed sockets in skb->sk. reuseport sockets are RCU
> freed. We can infer that the sk_assigned socket is RCU freed if the
> reuseport lookup succeeds, but convincing yourself of this fact isn't
> straight forward. Therefore we defensively check refcounting on the
> sk_assign sock even though it's probably not required in practice.
>
> Fixes: 8e368dc ("bpf: Fix use of sk->sk_reuseport from sk_assign")
> Fixes: cf7fbe6 ("bpf: Add socket assign support")
> Co-developed-by: Daniel Borkmann <daniel@iogearbox.net>
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> Signed-off-by: Lorenz Bauer <lmb@isovalent.com>
> Cc: Joe Stringer <joe@cilium.io>
> Link: https://lore.kernel.org/bpf/CACAyw98+qycmpQzKupquhkxbvWK4OFyDuuLMBNROnfWMZxUWeA@mail.gmail.com/
> ---
>  include/net/inet6_hashtables.h | 36 +++++++++++++++++++++++++++++-----
>  include/net/inet_hashtables.h  | 27 +++++++++++++++++++++++--
>  include/net/sock.h             |  7 +++++--
>  include/uapi/linux/bpf.h       |  3 ---
>  net/core/filter.c              |  2 --
>  net/ipv4/inet_hashtables.c     | 15 +++++++-------
>  net/ipv4/udp.c                 | 23 +++++++++++++++++++---
>  net/ipv6/inet6_hashtables.c    | 19 +++++++++---------
>  net/ipv6/udp.c                 | 23 +++++++++++++++++++---
>  tools/include/uapi/linux/bpf.h |  3 ---
>  10 files changed, 119 insertions(+), 39 deletions(-)


> diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
> index e7391bf310a7..920131e4a65d 100644
> --- a/net/ipv4/inet_hashtables.c
> +++ b/net/ipv4/inet_hashtables.c
> @@ -332,10 +332,10 @@ static inline int compute_score(struct sock *sk, struct net *net,
>         return score;
>  }
>
> -static inline struct sock *lookup_reuseport(struct net *net, struct sock *sk,
> -                                           struct sk_buff *skb, int doff,
> -                                           __be32 saddr, __be16 sport,
> -                                           __be32 daddr, unsigned short hnum)
> +struct sock *inet_lookup_reuseport(struct net *net, struct sock *sk,
> +                                  struct sk_buff *skb, int doff,
> +                                  __be32 saddr, __be16 sport,
> +                                  __be32 daddr, unsigned short hnum)
>  {
>         struct sock *reuse_sk = NULL;
>         u32 phash;
> @@ -346,6 +346,7 @@ static inline struct sock *lookup_reuseport(struct net *net, struct sock *sk,
>         }
>         return reuse_sk;
>  }
> +EXPORT_SYMBOL_GPL(inet_lookup_reuseport);
>
>  /*
>   * Here are some nice properties to exploit here. The BSD API
> @@ -369,8 +370,8 @@ static struct sock *inet_lhash2_lookup(struct net *net,
>         sk_nulls_for_each_rcu(sk, node, &ilb2->nulls_head) {
>                 score = compute_score(sk, net, hnum, daddr, dif, sdif);
>                 if (score > hiscore) {
> -                       result = lookup_reuseport(net, sk, skb, doff,
> -                                                 saddr, sport, daddr, hnum);
> +                       result = inet_lookup_reuseport(net, sk, skb, doff,
> +                                                      saddr, sport, daddr, hnum);
>                         if (result)
>                                 return result;
>

Please split in a series.

First a patch renaming lookup_reuseport() to inet_lookup_reuseport()
and inet6_lookup_reuseport()
(cleanup, no change in behavior)

This would ease review and future bug hunting quite a bit.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH bpf-next 1/2] bpf, net: Support SO_REUSEPORT sockets with bpf_sk_assign
  2023-05-25  8:19 [PATCH bpf-next 1/2] bpf, net: Support SO_REUSEPORT sockets with bpf_sk_assign Lorenz Bauer
  2023-05-25  8:19 ` [PATCH bpf-next 2/2] selftests/bpf: Test that SO_REUSEPORT can be used with sk_assign helper Lorenz Bauer
  2023-05-25 13:24 ` [PATCH bpf-next 1/2] bpf, net: Support SO_REUSEPORT sockets with bpf_sk_assign Eric Dumazet
@ 2023-05-25 17:41 ` Kuniyuki Iwashima
  2023-05-25 20:01   ` Daniel Borkmann
  2023-05-25 23:42 ` Martin KaFai Lau
  2023-05-26  5:56 ` Joe Stringer
  4 siblings, 1 reply; 12+ messages in thread
From: Kuniyuki Iwashima @ 2023-05-25 17:41 UTC (permalink / raw)
  To: lmb
  Cc: andrii, ast, bpf, daniel, davem, dsahern, edumazet, haoluo, joe,
	joe, john.fastabend, jolsa, kafai, kpsingh, kuba, linux-kernel,
	martin.lau, netdev, pabeni, sdf, song, willemdebruijn.kernel,
	yhs, kuniyu

From: Lorenz Bauer <lmb@isovalent.com>
Date: Thu, 25 May 2023 09:19:22 +0100
> Currently the bpf_sk_assign helper in tc BPF context refuses SO_REUSEPORT
> sockets. This means we can't use the helper to steer traffic to Envoy, which
> configures SO_REUSEPORT on its sockets. In turn, we're blocked from removing
> TPROXY from our setup.
> 
> The reason that bpf_sk_assign refuses such sockets is that the bpf_sk_lookup
> helpers don't execute SK_REUSEPORT programs. Instead, one of the
> reuseport sockets is selected by hash. This could cause dispatch to the
> "wrong" socket:
> 
>     sk = bpf_sk_lookup_tcp(...) // select SO_REUSEPORT by hash
>     bpf_sk_assign(skb, sk) // SK_REUSEPORT wasn't executed
> 
> Fixing this isn't as simple as invoking SK_REUSEPORT from the lookup
> helpers unfortunately. In the tc context, L2 headers are at the start
> of the skb, while SK_REUSEPORT expects L3 headers instead.
> 
> Instead, we execute the SK_REUSEPORT program when the assigned socket
> is pulled out of the skb, further up the stack. This creates some
> trickiness with regards to refcounting as bpf_sk_assign will put both
> refcounted and RCU freed sockets in skb->sk. reuseport sockets are RCU
> freed. We can infer that the sk_assigned socket is RCU freed if the
> reuseport lookup succeeds, but convincing yourself of this fact isn't
> straight forward. Therefore we defensively check refcounting on the
> sk_assign sock even though it's probably not required in practice.
> 
> Fixes: 8e368dc ("bpf: Fix use of sk->sk_reuseport from sk_assign")
> Fixes: cf7fbe6 ("bpf: Add socket assign support")

Please use 12 chars of hash.

---8<---
$ cat ~/.gitconfig
[core]
	abbrev = 12
[pretty]
	fixes = Fixes: %h (\"%s\")

$ git show 8e368dc --pretty=fixes | head -n 1
Fixes: 8e368dc72e86 ("bpf: Fix use of sk->sk_reuseport from sk_assign")
---8<---


> Co-developed-by: Daniel Borkmann <daniel@iogearbox.net>
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> Signed-off-by: Lorenz Bauer <lmb@isovalent.com>
> Cc: Joe Stringer <joe@cilium.io>
> Link: https://lore.kernel.org/bpf/CACAyw98+qycmpQzKupquhkxbvWK4OFyDuuLMBNROnfWMZxUWeA@mail.gmail.com/
> ---
>  include/net/inet6_hashtables.h | 36 +++++++++++++++++++++++++++++-----
>  include/net/inet_hashtables.h  | 27 +++++++++++++++++++++++--
>  include/net/sock.h             |  7 +++++--
>  include/uapi/linux/bpf.h       |  3 ---
>  net/core/filter.c              |  2 --
>  net/ipv4/inet_hashtables.c     | 15 +++++++-------
>  net/ipv4/udp.c                 | 23 +++++++++++++++++++---
>  net/ipv6/inet6_hashtables.c    | 19 +++++++++---------
>  net/ipv6/udp.c                 | 23 +++++++++++++++++++---
>  tools/include/uapi/linux/bpf.h |  3 ---
>  10 files changed, 119 insertions(+), 39 deletions(-)
> 
> diff --git a/include/net/inet6_hashtables.h b/include/net/inet6_hashtables.h
> index 56f1286583d3..3ba4dc2703da 100644
> --- a/include/net/inet6_hashtables.h
> +++ b/include/net/inet6_hashtables.h
> @@ -48,6 +48,13 @@ struct sock *__inet6_lookup_established(struct net *net,
>  					const u16 hnum, const int dif,
>  					const int sdif);
>  
> +struct sock *inet6_lookup_reuseport(struct net *net, struct sock *sk,
> +				    struct sk_buff *skb, int doff,
> +				    const struct in6_addr *saddr,
> +				    __be16 sport,
> +				    const struct in6_addr *daddr,
> +				    unsigned short hnum);
> +
>  struct sock *inet6_lookup_listener(struct net *net,
>  				   struct inet_hashinfo *hashinfo,
>  				   struct sk_buff *skb, int doff,
> @@ -85,14 +92,33 @@ static inline struct sock *__inet6_lookup_skb(struct inet_hashinfo *hashinfo,
>  					      int iif, int sdif,
>  					      bool *refcounted)
>  {
> -	struct sock *sk = skb_steal_sock(skb, refcounted);
> -
> +	bool prefetched;
> +	struct sock *sk = skb_steal_sock(skb, refcounted, &prefetched);
> +	struct net *net = dev_net(skb_dst(skb)->dev);
> +	const struct ipv6hdr *ip6h = ipv6_hdr(skb);

nit: Reverse Xmas Tree order.  Same for other chunks.


> +
> +	if (prefetched) {
> +		struct sock *reuse_sk = inet6_lookup_reuseport(net, sk, skb, doff,
> +							       &ip6h->saddr, sport,
> +							       &ip6h->daddr, ntohs(dport));
> +		if (reuse_sk) {
> +			if (reuse_sk != sk) {
> +				if (*refcounted) {
> +					sock_put(sk);
> +					*refcounted = false;
> +				}
> +				if (IS_ERR(reuse_sk))
> +					return NULL;
> +			}
> +			return reuse_sk;
> +		}

Maybe we can add a hepler to avoid this duplication ?


> +	}
>  	if (sk)
>  		return sk;
>  
> -	return __inet6_lookup(dev_net(skb_dst(skb)->dev), hashinfo, skb,
> -			      doff, &ipv6_hdr(skb)->saddr, sport,
> -			      &ipv6_hdr(skb)->daddr, ntohs(dport),
> +	return __inet6_lookup(net, hashinfo, skb,
> +			      doff, &ip6h->saddr, sport,
> +			      &ip6h->daddr, ntohs(dport),
>  			      iif, sdif, refcounted);
>  }
>  
> diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
> index 99bd823e97f6..c2af195ca71f 100644
> --- a/include/net/inet_hashtables.h
> +++ b/include/net/inet_hashtables.h
> @@ -379,6 +379,11 @@ struct sock *__inet_lookup_established(struct net *net,
>  				       const __be32 daddr, const u16 hnum,
>  				       const int dif, const int sdif);
>  
> +struct sock *inet_lookup_reuseport(struct net *net, struct sock *sk,
> +				   struct sk_buff *skb, int doff,
> +				   __be32 saddr, __be16 sport,
> +				   __be32 daddr, unsigned short hnum);
> +
>  static inline struct sock *
>  	inet_lookup_established(struct net *net, struct inet_hashinfo *hashinfo,
>  				const __be32 saddr, const __be16 sport,
> @@ -436,13 +441,31 @@ static inline struct sock *__inet_lookup_skb(struct inet_hashinfo *hashinfo,
>  					     const int sdif,
>  					     bool *refcounted)
>  {
> -	struct sock *sk = skb_steal_sock(skb, refcounted);
> +	bool prefetched;
> +	struct sock *sk = skb_steal_sock(skb, refcounted, &prefetched);
> +	struct net *net = dev_net(skb_dst(skb)->dev);
>  	const struct iphdr *iph = ip_hdr(skb);
>  
> +	if (prefetched) {
> +		struct sock *reuse_sk = inet_lookup_reuseport(net, sk, skb, doff,
> +							      iph->saddr, sport,
> +							      iph->daddr, ntohs(dport));
> +		if (reuse_sk) {
> +			if (reuse_sk != sk) {
> +				if (*refcounted) {
> +					sock_put(sk);
> +					*refcounted = false;
> +				}
> +				if (IS_ERR(reuse_sk))
> +					return NULL;
> +			}
> +			return reuse_sk;
> +		}
> +	}
>  	if (sk)
>  		return sk;
>  
> -	return __inet_lookup(dev_net(skb_dst(skb)->dev), hashinfo, skb,
> +	return __inet_lookup(net, hashinfo, skb,
>  			     doff, iph->saddr, sport,
>  			     iph->daddr, dport, inet_iif(skb), sdif,
>  			     refcounted);
> diff --git a/include/net/sock.h b/include/net/sock.h
> index 656ea89f60ff..5645570c2a64 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -2806,20 +2806,23 @@ sk_is_refcounted(struct sock *sk)
>   * skb_steal_sock - steal a socket from an sk_buff
>   * @skb: sk_buff to steal the socket from
>   * @refcounted: is set to true if the socket is reference-counted
> + * @prefetched: is set to true if the socket was assigned from bpf
>   */
>  static inline struct sock *
> -skb_steal_sock(struct sk_buff *skb, bool *refcounted)
> +skb_steal_sock(struct sk_buff *skb, bool *refcounted, bool *prefetched)
>  {
>  	if (skb->sk) {
>  		struct sock *sk = skb->sk;
>  
>  		*refcounted = true;
> -		if (skb_sk_is_prefetched(skb))
> +		*prefetched = skb_sk_is_prefetched(skb);
> +		if (*prefetched)
>  			*refcounted = sk_is_refcounted(sk);
>  		skb->destructor = NULL;
>  		skb->sk = NULL;
>  		return sk;
>  	}
> +	*prefetched = false;
>  	*refcounted = false;
>  	return NULL;
>  }
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 1bb11a6ee667..2af606a525db 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -4144,9 +4144,6 @@ union bpf_attr {
>   *		**-EOPNOTSUPP** if the operation is not supported, for example
>   *		a call from outside of TC ingress.
>   *
> - *		**-ESOCKTNOSUPPORT** if the socket type is not supported
> - *		(reuseport).
> - *
>   * long bpf_sk_assign(struct bpf_sk_lookup *ctx, struct bpf_sock *sk, u64 flags)
>   *	Description
>   *		Helper is overloaded depending on BPF program type. This
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 968139f4a1ac..5f451260849b 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -7265,8 +7265,6 @@ BPF_CALL_3(bpf_sk_assign, struct sk_buff *, skb, struct sock *, sk, u64, flags)
>  		return -EOPNOTSUPP;
>  	if (unlikely(dev_net(skb->dev) != sock_net(sk)))
>  		return -ENETUNREACH;
> -	if (unlikely(sk_fullsock(sk) && sk->sk_reuseport))
> -		return -ESOCKTNOSUPPORT;
>  	if (sk_is_refcounted(sk) &&
>  	    unlikely(!refcount_inc_not_zero(&sk->sk_refcnt)))
>  		return -ENOENT;
> diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
> index e7391bf310a7..920131e4a65d 100644
> --- a/net/ipv4/inet_hashtables.c
> +++ b/net/ipv4/inet_hashtables.c
> @@ -332,10 +332,10 @@ static inline int compute_score(struct sock *sk, struct net *net,
>  	return score;
>  }
>  
> -static inline struct sock *lookup_reuseport(struct net *net, struct sock *sk,
> -					    struct sk_buff *skb, int doff,
> -					    __be32 saddr, __be16 sport,
> -					    __be32 daddr, unsigned short hnum)
> +struct sock *inet_lookup_reuseport(struct net *net, struct sock *sk,
> +				   struct sk_buff *skb, int doff,
> +				   __be32 saddr, __be16 sport,
> +				   __be32 daddr, unsigned short hnum)
>  {
>  	struct sock *reuse_sk = NULL;
>  	u32 phash;
> @@ -346,6 +346,7 @@ static inline struct sock *lookup_reuseport(struct net *net, struct sock *sk,
>  	}
>  	return reuse_sk;
>  }
> +EXPORT_SYMBOL_GPL(inet_lookup_reuseport);
>  
>  /*
>   * Here are some nice properties to exploit here. The BSD API
> @@ -369,8 +370,8 @@ static struct sock *inet_lhash2_lookup(struct net *net,
>  	sk_nulls_for_each_rcu(sk, node, &ilb2->nulls_head) {
>  		score = compute_score(sk, net, hnum, daddr, dif, sdif);
>  		if (score > hiscore) {
> -			result = lookup_reuseport(net, sk, skb, doff,
> -						  saddr, sport, daddr, hnum);
> +			result = inet_lookup_reuseport(net, sk, skb, doff,
> +						       saddr, sport, daddr, hnum);
>  			if (result)
>  				return result;
>  
> @@ -399,7 +400,7 @@ static inline struct sock *inet_lookup_run_bpf(struct net *net,
>  	if (no_reuseport || IS_ERR_OR_NULL(sk))
>  		return sk;
>  
> -	reuse_sk = lookup_reuseport(net, sk, skb, doff, saddr, sport, daddr, hnum);
> +	reuse_sk = inet_lookup_reuseport(net, sk, skb, doff, saddr, sport, daddr, hnum);
>  	if (reuse_sk)
>  		sk = reuse_sk;
>  	return sk;
> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> index 6893fb867529..c67253386a38 100644
> --- a/net/ipv4/udp.c
> +++ b/net/ipv4/udp.c
> @@ -2426,7 +2426,7 @@ int __udp4_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
>  	struct rtable *rt = skb_rtable(skb);
>  	__be32 saddr, daddr;
>  	struct net *net = dev_net(skb->dev);
> -	bool refcounted;
> +	bool refcounted, prefetched;
>  	int drop_reason;
>  
>  	drop_reason = SKB_DROP_REASON_NOT_SPECIFIED;
> @@ -2455,11 +2455,28 @@ int __udp4_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
>  	if (udp4_csum_init(skb, uh, proto))
>  		goto csum_error;
>  
> -	sk = skb_steal_sock(skb, &refcounted);
> +	sk = skb_steal_sock(skb, &refcounted, &prefetched);
>  	if (sk) {
>  		struct dst_entry *dst = skb_dst(skb);
>  		int ret;
>  
> +		if (prefetched) {
> +			struct sock *reuse_sk = lookup_reuseport(net, sk, skb,
> +								 saddr, uh->source,
> +								 daddr, ntohs(uh->dest));
> +			if (reuse_sk) {
> +				if (reuse_sk != sk) {
> +					if (refcounted) {
> +						sock_put(sk);
> +						refcounted = false;
> +					}
> +					if (IS_ERR(reuse_sk))
> +						goto no_sk;
> +				}
> +				sk = reuse_sk;
> +			}
> +		}
> +
>  		if (unlikely(rcu_dereference(sk->sk_rx_dst) != dst))
>  			udp_sk_rx_dst_set(sk, dst);
>  
> @@ -2476,7 +2493,7 @@ int __udp4_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
>  	sk = __udp4_lib_lookup_skb(skb, uh->source, uh->dest, udptable);
>  	if (sk)
>  		return udp_unicast_rcv_skb(sk, skb, uh);
> -
> +no_sk:
>  	if (!xfrm4_policy_check(NULL, XFRM_POLICY_IN, skb))
>  		goto drop;
>  	nf_reset_ct(skb);
> diff --git a/net/ipv6/inet6_hashtables.c b/net/ipv6/inet6_hashtables.c
> index b64b49012655..b7c56867314e 100644
> --- a/net/ipv6/inet6_hashtables.c
> +++ b/net/ipv6/inet6_hashtables.c
> @@ -111,12 +111,12 @@ static inline int compute_score(struct sock *sk, struct net *net,
>  	return score;
>  }
>  
> -static inline struct sock *lookup_reuseport(struct net *net, struct sock *sk,
> -					    struct sk_buff *skb, int doff,
> -					    const struct in6_addr *saddr,
> -					    __be16 sport,
> -					    const struct in6_addr *daddr,
> -					    unsigned short hnum)
> +struct sock *inet6_lookup_reuseport(struct net *net, struct sock *sk,
> +				    struct sk_buff *skb, int doff,
> +				    const struct in6_addr *saddr,
> +				    __be16 sport,
> +				    const struct in6_addr *daddr,
> +				    unsigned short hnum)
>  {
>  	struct sock *reuse_sk = NULL;
>  	u32 phash;
> @@ -127,6 +127,7 @@ static inline struct sock *lookup_reuseport(struct net *net, struct sock *sk,
>  	}
>  	return reuse_sk;
>  }
> +EXPORT_SYMBOL_GPL(inet6_lookup_reuseport);
>  
>  /* called with rcu_read_lock() */
>  static struct sock *inet6_lhash2_lookup(struct net *net,
> @@ -143,8 +144,8 @@ static struct sock *inet6_lhash2_lookup(struct net *net,
>  	sk_nulls_for_each_rcu(sk, node, &ilb2->nulls_head) {
>  		score = compute_score(sk, net, hnum, daddr, dif, sdif);
>  		if (score > hiscore) {
> -			result = lookup_reuseport(net, sk, skb, doff,
> -						  saddr, sport, daddr, hnum);
> +			result = inet6_lookup_reuseport(net, sk, skb, doff,
> +							saddr, sport, daddr, hnum);
>  			if (result)
>  				return result;
>  
> @@ -175,7 +176,7 @@ static inline struct sock *inet6_lookup_run_bpf(struct net *net,
>  	if (no_reuseport || IS_ERR_OR_NULL(sk))
>  		return sk;
>  
> -	reuse_sk = lookup_reuseport(net, sk, skb, doff, saddr, sport, daddr, hnum);
> +	reuse_sk = inet6_lookup_reuseport(net, sk, skb, doff, saddr, sport, daddr, hnum);
>  	if (reuse_sk)
>  		sk = reuse_sk;
>  	return sk;
> diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
> index e5a337e6b970..3fede8ec95c4 100644
> --- a/net/ipv6/udp.c
> +++ b/net/ipv6/udp.c
> @@ -949,7 +949,7 @@ int __udp6_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
>  	struct net *net = dev_net(skb->dev);
>  	struct udphdr *uh;
>  	struct sock *sk;
> -	bool refcounted;
> +	bool refcounted, prefetched;
>  	u32 ulen = 0;
>  
>  	if (!pskb_may_pull(skb, sizeof(struct udphdr)))
> @@ -986,11 +986,28 @@ int __udp6_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
>  		goto csum_error;
>  
>  	/* Check if the socket is already available, e.g. due to early demux */
> -	sk = skb_steal_sock(skb, &refcounted);
> +	sk = skb_steal_sock(skb, &refcounted, &prefetched);
>  	if (sk) {
>  		struct dst_entry *dst = skb_dst(skb);
>  		int ret;
>  
> +		if (prefetched) {
> +			struct sock *reuse_sk = lookup_reuseport(net, sk, skb,
> +								 saddr, uh->source,
> +								 daddr, ntohs(uh->dest));
> +			if (reuse_sk) {
> +				if (reuse_sk != sk) {
> +					if (refcounted) {
> +						sock_put(sk);
> +						refcounted = false;
> +					}
> +					if (IS_ERR(reuse_sk))
> +						goto no_sk;
> +				}
> +				sk = reuse_sk;
> +			}
> +		}
> +
>  		if (unlikely(rcu_dereference(sk->sk_rx_dst) != dst))
>  			udp6_sk_rx_dst_set(sk, dst);
>  
> @@ -1020,7 +1037,7 @@ int __udp6_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
>  			goto report_csum_error;
>  		return udp6_unicast_rcv_skb(sk, skb, uh);
>  	}
> -
> +no_sk:
>  	reason = SKB_DROP_REASON_NO_SOCKET;
>  
>  	if (!uh->check)
> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> index 1bb11a6ee667..2af606a525db 100644
> --- a/tools/include/uapi/linux/bpf.h
> +++ b/tools/include/uapi/linux/bpf.h
> @@ -4144,9 +4144,6 @@ union bpf_attr {
>   *		**-EOPNOTSUPP** if the operation is not supported, for example
>   *		a call from outside of TC ingress.
>   *
> - *		**-ESOCKTNOSUPPORT** if the socket type is not supported
> - *		(reuseport).
> - *
>   * long bpf_sk_assign(struct bpf_sk_lookup *ctx, struct bpf_sock *sk, u64 flags)
>   *	Description
>   *		Helper is overloaded depending on BPF program type. This
> -- 
> 2.40.1

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH bpf-next 1/2] bpf, net: Support SO_REUSEPORT sockets with bpf_sk_assign
  2023-05-25 13:24 ` [PATCH bpf-next 1/2] bpf, net: Support SO_REUSEPORT sockets with bpf_sk_assign Eric Dumazet
@ 2023-05-25 19:51   ` Daniel Borkmann
  0 siblings, 0 replies; 12+ messages in thread
From: Daniel Borkmann @ 2023-05-25 19:51 UTC (permalink / raw)
  To: Eric Dumazet, Lorenz Bauer
  Cc: David S. Miller, Jakub Kicinski, Paolo Abeni, Alexei Starovoitov,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	David Ahern, Willem de Bruijn, Joe Stringer, Joe Stringer,
	Martin KaFai Lau, netdev, linux-kernel, bpf

On 5/25/23 3:24 PM, Eric Dumazet wrote:
> On Thu, May 25, 2023 at 10:19 AM Lorenz Bauer <lmb@isovalent.com> wrote:
>>
>> Currently the bpf_sk_assign helper in tc BPF context refuses SO_REUSEPORT
>> sockets. This means we can't use the helper to steer traffic to Envoy, which
>> configures SO_REUSEPORT on its sockets. In turn, we're blocked from removing
>> TPROXY from our setup.
>>
>> The reason that bpf_sk_assign refuses such sockets is that the bpf_sk_lookup
>> helpers don't execute SK_REUSEPORT programs. Instead, one of the
>> reuseport sockets is selected by hash. This could cause dispatch to the
>> "wrong" socket:
>>
>>      sk = bpf_sk_lookup_tcp(...) // select SO_REUSEPORT by hash
>>      bpf_sk_assign(skb, sk) // SK_REUSEPORT wasn't executed
>>
>> Fixing this isn't as simple as invoking SK_REUSEPORT from the lookup
>> helpers unfortunately. In the tc context, L2 headers are at the start
>> of the skb, while SK_REUSEPORT expects L3 headers instead.
>>
>> Instead, we execute the SK_REUSEPORT program when the assigned socket
>> is pulled out of the skb, further up the stack. This creates some
>> trickiness with regards to refcounting as bpf_sk_assign will put both
>> refcounted and RCU freed sockets in skb->sk. reuseport sockets are RCU
>> freed. We can infer that the sk_assigned socket is RCU freed if the
>> reuseport lookup succeeds, but convincing yourself of this fact isn't
>> straight forward. Therefore we defensively check refcounting on the
>> sk_assign sock even though it's probably not required in practice.
>>
>> Fixes: 8e368dc ("bpf: Fix use of sk->sk_reuseport from sk_assign")
>> Fixes: cf7fbe6 ("bpf: Add socket assign support")
>> Co-developed-by: Daniel Borkmann <daniel@iogearbox.net>
>> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
>> Signed-off-by: Lorenz Bauer <lmb@isovalent.com>
>> Cc: Joe Stringer <joe@cilium.io>
>> Link: https://lore.kernel.org/bpf/CACAyw98+qycmpQzKupquhkxbvWK4OFyDuuLMBNROnfWMZxUWeA@mail.gmail.com/
>> ---
>>   include/net/inet6_hashtables.h | 36 +++++++++++++++++++++++++++++-----
>>   include/net/inet_hashtables.h  | 27 +++++++++++++++++++++++--
>>   include/net/sock.h             |  7 +++++--
>>   include/uapi/linux/bpf.h       |  3 ---
>>   net/core/filter.c              |  2 --
>>   net/ipv4/inet_hashtables.c     | 15 +++++++-------
>>   net/ipv4/udp.c                 | 23 +++++++++++++++++++---
>>   net/ipv6/inet6_hashtables.c    | 19 +++++++++---------
>>   net/ipv6/udp.c                 | 23 +++++++++++++++++++---
>>   tools/include/uapi/linux/bpf.h |  3 ---
>>   10 files changed, 119 insertions(+), 39 deletions(-)
> 
> 
>> diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
>> index e7391bf310a7..920131e4a65d 100644
>> --- a/net/ipv4/inet_hashtables.c
>> +++ b/net/ipv4/inet_hashtables.c
>> @@ -332,10 +332,10 @@ static inline int compute_score(struct sock *sk, struct net *net,
>>          return score;
>>   }
>>
>> -static inline struct sock *lookup_reuseport(struct net *net, struct sock *sk,
>> -                                           struct sk_buff *skb, int doff,
>> -                                           __be32 saddr, __be16 sport,
>> -                                           __be32 daddr, unsigned short hnum)
>> +struct sock *inet_lookup_reuseport(struct net *net, struct sock *sk,
>> +                                  struct sk_buff *skb, int doff,
>> +                                  __be32 saddr, __be16 sport,
>> +                                  __be32 daddr, unsigned short hnum)
>>   {
>>          struct sock *reuse_sk = NULL;
>>          u32 phash;
>> @@ -346,6 +346,7 @@ static inline struct sock *lookup_reuseport(struct net *net, struct sock *sk,
>>          }
>>          return reuse_sk;
>>   }
>> +EXPORT_SYMBOL_GPL(inet_lookup_reuseport);
>>
>>   /*
>>    * Here are some nice properties to exploit here. The BSD API
>> @@ -369,8 +370,8 @@ static struct sock *inet_lhash2_lookup(struct net *net,
>>          sk_nulls_for_each_rcu(sk, node, &ilb2->nulls_head) {
>>                  score = compute_score(sk, net, hnum, daddr, dif, sdif);
>>                  if (score > hiscore) {
>> -                       result = lookup_reuseport(net, sk, skb, doff,
>> -                                                 saddr, sport, daddr, hnum);
>> +                       result = inet_lookup_reuseport(net, sk, skb, doff,
>> +                                                      saddr, sport, daddr, hnum);
>>                          if (result)
>>                                  return result;
>>
> 
> Please split in a series.
> 
> First a patch renaming lookup_reuseport() to inet_lookup_reuseport()
> and inet6_lookup_reuseport()
> (cleanup, no change in behavior)
> 
> This would ease review and future bug hunting quite a bit.

Makes sense and should reduce the churn on the actual change.

I think Lorenz is planning to flush out a v2 next week with this split.

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH bpf-next 1/2] bpf, net: Support SO_REUSEPORT sockets with bpf_sk_assign
  2023-05-25 17:41 ` Kuniyuki Iwashima
@ 2023-05-25 20:01   ` Daniel Borkmann
  0 siblings, 0 replies; 12+ messages in thread
From: Daniel Borkmann @ 2023-05-25 20:01 UTC (permalink / raw)
  To: Kuniyuki Iwashima, lmb
  Cc: andrii, ast, bpf, davem, dsahern, edumazet, haoluo, joe, joe,
	john.fastabend, jolsa, kafai, kpsingh, kuba, linux-kernel,
	martin.lau, netdev, pabeni, sdf, song, willemdebruijn.kernel,
	yhs

On 5/25/23 7:41 PM, Kuniyuki Iwashima wrote:
> From: Lorenz Bauer <lmb@isovalent.com>
> Date: Thu, 25 May 2023 09:19:22 +0100
>> Currently the bpf_sk_assign helper in tc BPF context refuses SO_REUSEPORT
>> sockets. This means we can't use the helper to steer traffic to Envoy, which
>> configures SO_REUSEPORT on its sockets. In turn, we're blocked from removing
>> TPROXY from our setup.
>>
>> The reason that bpf_sk_assign refuses such sockets is that the bpf_sk_lookup
>> helpers don't execute SK_REUSEPORT programs. Instead, one of the
>> reuseport sockets is selected by hash. This could cause dispatch to the
>> "wrong" socket:
>>
>>      sk = bpf_sk_lookup_tcp(...) // select SO_REUSEPORT by hash
>>      bpf_sk_assign(skb, sk) // SK_REUSEPORT wasn't executed
>>
>> Fixing this isn't as simple as invoking SK_REUSEPORT from the lookup
>> helpers unfortunately. In the tc context, L2 headers are at the start
>> of the skb, while SK_REUSEPORT expects L3 headers instead.
>>
>> Instead, we execute the SK_REUSEPORT program when the assigned socket
>> is pulled out of the skb, further up the stack. This creates some
>> trickiness with regards to refcounting as bpf_sk_assign will put both
>> refcounted and RCU freed sockets in skb->sk. reuseport sockets are RCU
>> freed. We can infer that the sk_assigned socket is RCU freed if the
>> reuseport lookup succeeds, but convincing yourself of this fact isn't
>> straight forward. Therefore we defensively check refcounting on the
>> sk_assign sock even though it's probably not required in practice.
>>
>> Fixes: 8e368dc ("bpf: Fix use of sk->sk_reuseport from sk_assign")
>> Fixes: cf7fbe6 ("bpf: Add socket assign support")
> 
> Please use 12 chars of hash.
> 
> $ cat ~/.gitconfig
> [core]
> 	abbrev = 12
> [pretty]
> 	fixes = Fixes: %h (\"%s\")
> 
> $ git show 8e368dc --pretty=fixes | head -n 1
> Fixes: 8e368dc72e86 ("bpf: Fix use of sk->sk_reuseport from sk_assign")

Yeap, not quite sure what happened here but the 12 chars is clear. Will
be fixed up in v2, too, ofc.

>> Co-developed-by: Daniel Borkmann <daniel@iogearbox.net>
>> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
>> Signed-off-by: Lorenz Bauer <lmb@isovalent.com>
>> Cc: Joe Stringer <joe@cilium.io>
>> Link: https://lore.kernel.org/bpf/CACAyw98+qycmpQzKupquhkxbvWK4OFyDuuLMBNROnfWMZxUWeA@mail.gmail.com/
>> ---
>>   include/net/inet6_hashtables.h | 36 +++++++++++++++++++++++++++++-----
>>   include/net/inet_hashtables.h  | 27 +++++++++++++++++++++++--
>>   include/net/sock.h             |  7 +++++--
>>   include/uapi/linux/bpf.h       |  3 ---
>>   net/core/filter.c              |  2 --
>>   net/ipv4/inet_hashtables.c     | 15 +++++++-------
>>   net/ipv4/udp.c                 | 23 +++++++++++++++++++---
>>   net/ipv6/inet6_hashtables.c    | 19 +++++++++---------
>>   net/ipv6/udp.c                 | 23 +++++++++++++++++++---
>>   tools/include/uapi/linux/bpf.h |  3 ---
>>   10 files changed, 119 insertions(+), 39 deletions(-)
>>
[...]
>> @@ -85,14 +92,33 @@ static inline struct sock *__inet6_lookup_skb(struct inet_hashinfo *hashinfo,
>>   					      int iif, int sdif,
>>   					      bool *refcounted)
>>   {
>> -	struct sock *sk = skb_steal_sock(skb, refcounted);
>> -
>> +	bool prefetched;
>> +	struct sock *sk = skb_steal_sock(skb, refcounted, &prefetched);
>> +	struct net *net = dev_net(skb_dst(skb)->dev);
>> +	const struct ipv6hdr *ip6h = ipv6_hdr(skb);
> 
> nit: Reverse Xmas Tree order.  Same for other chunks.

It is, the prefetched bool is simply used one line below. I don't think
this is much different than most other code from style pov..

>> +
>> +	if (prefetched) {
>> +		struct sock *reuse_sk = inet6_lookup_reuseport(net, sk, skb, doff,
>> +							       &ip6h->saddr, sport,
>> +							       &ip6h->daddr, ntohs(dport));
>> +		if (reuse_sk) {
>> +			if (reuse_sk != sk) {
>> +				if (*refcounted) {
>> +					sock_put(sk);
>> +					*refcounted = false;
>> +				}
>> +				if (IS_ERR(reuse_sk))
>> +					return NULL;
>> +			}
>> +			return reuse_sk;
>> +		}
> 
> Maybe we can add a hepler to avoid this duplication ?

We'll check if it can be made a bit nicer and integrate this into the v2.

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH bpf-next 1/2] bpf, net: Support SO_REUSEPORT sockets with bpf_sk_assign
  2023-05-25  8:19 [PATCH bpf-next 1/2] bpf, net: Support SO_REUSEPORT sockets with bpf_sk_assign Lorenz Bauer
                   ` (2 preceding siblings ...)
  2023-05-25 17:41 ` Kuniyuki Iwashima
@ 2023-05-25 23:42 ` Martin KaFai Lau
  2023-05-26  1:43   ` Kuniyuki Iwashima
  2023-05-26  5:56 ` Joe Stringer
  4 siblings, 1 reply; 12+ messages in thread
From: Martin KaFai Lau @ 2023-05-25 23:42 UTC (permalink / raw)
  To: Lorenz Bauer
  Cc: Joe Stringer, Martin KaFai Lau, netdev, linux-kernel, bpf,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, David Ahern, Willem de Bruijn, Joe Stringer

On 5/25/23 1:19 AM, Lorenz Bauer wrote:
> diff --git a/include/net/inet6_hashtables.h b/include/net/inet6_hashtables.h
> index 56f1286583d3..3ba4dc2703da 100644
> --- a/include/net/inet6_hashtables.h
> +++ b/include/net/inet6_hashtables.h
> @@ -48,6 +48,13 @@ struct sock *__inet6_lookup_established(struct net *net,
>   					const u16 hnum, const int dif,
>   					const int sdif);
>   
> +struct sock *inet6_lookup_reuseport(struct net *net, struct sock *sk,
> +				    struct sk_buff *skb, int doff,
> +				    const struct in6_addr *saddr,
> +				    __be16 sport,
> +				    const struct in6_addr *daddr,
> +				    unsigned short hnum);
> +
>   struct sock *inet6_lookup_listener(struct net *net,
>   				   struct inet_hashinfo *hashinfo,
>   				   struct sk_buff *skb, int doff,
> @@ -85,14 +92,33 @@ static inline struct sock *__inet6_lookup_skb(struct inet_hashinfo *hashinfo,
>   					      int iif, int sdif,
>   					      bool *refcounted)
>   {
> -	struct sock *sk = skb_steal_sock(skb, refcounted);
> -
> +	bool prefetched;
> +	struct sock *sk = skb_steal_sock(skb, refcounted, &prefetched);
> +	struct net *net = dev_net(skb_dst(skb)->dev);
> +	const struct ipv6hdr *ip6h = ipv6_hdr(skb);
> +
> +	if (prefetched) {
> +		struct sock *reuse_sk = inet6_lookup_reuseport(net, sk, skb, doff,

If sk is TCP_ESTABLISHED, I suspect sk->sk_reuseport is 1 (from sk_clone)?

If it is, it should still work other than an extra inet6_ehashfn. Does it worth 
an extra sk->sk_state check or it is overkill?


> +							       &ip6h->saddr, sport,
> +							       &ip6h->daddr, ntohs(dport));


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH bpf-next 2/2] selftests/bpf: Test that SO_REUSEPORT can be used with sk_assign helper
  2023-05-25  8:19 ` [PATCH bpf-next 2/2] selftests/bpf: Test that SO_REUSEPORT can be used with sk_assign helper Lorenz Bauer
@ 2023-05-26  0:06   ` Martin KaFai Lau
  0 siblings, 0 replies; 12+ messages in thread
From: Martin KaFai Lau @ 2023-05-26  0:06 UTC (permalink / raw)
  To: Lorenz Bauer
  Cc: Joe Stringer, linux-kernel, bpf, linux-kselftest,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, Mykola Lysenko, Shuah Khan

On 5/25/23 1:19 AM, Lorenz Bauer wrote:
> diff --git a/tools/testing/selftests/bpf/network_helpers.c b/tools/testing/selftests/bpf/network_helpers.c
> index a105c0cd008a..8a33bcea97de 100644
> --- a/tools/testing/selftests/bpf/network_helpers.c
> +++ b/tools/testing/selftests/bpf/network_helpers.c
> @@ -423,6 +423,9 @@ struct nstoken *open_netns(const char *name)
>   
>   void close_netns(struct nstoken *token)
>   {
> +	if (!token)
> +		return;

+1. :)

> +
>   	ASSERT_OK(setns(token->orig_netns_fd, CLONE_NEWNET), "setns");
>   	close(token->orig_netns_fd);
>   	free(token);
> diff --git a/tools/testing/selftests/bpf/prog_tests/assign_reuse.c b/tools/testing/selftests/bpf/prog_tests/assign_reuse.c
> new file mode 100644
> index 000000000000..2cb9bb591e71
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/prog_tests/assign_reuse.c
> @@ -0,0 +1,280 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright (c) 2023 Isovalent */
> +#include <uapi/linux/if_link.h>
> +#include <test_progs.h>
> +
> +#include <netinet/tcp.h>
> +#include <netinet/udp.h>
> +
> +#include "network_helpers.h"
> +#include "test_assign_reuse.skel.h"
> +
> +#define NS_TEST "assign_reuse"
> +#define LOOPBACK 1
> +#define PORT 4443
> +
> +static int attach_reuseport(int sock_fd, int prog_fd)
> +{
> +	return setsockopt(sock_fd, SOL_SOCKET, SO_ATTACH_REUSEPORT_EBPF,
> +			  &prog_fd, sizeof(prog_fd));
> +}
> +
> +static socklen_t inetaddr_len(const struct sockaddr_storage *addr)
> +{
> +	return addr->ss_family == AF_INET  ? sizeof(struct sockaddr_in) :
> +	       addr->ss_family == AF_INET6 ? sizeof(struct sockaddr_in6) :
> +					     0;
> +}
> +
> +static bool is_ipv6(const char *ip)
> +{
> +	return !!strchr(ip, ':');
> +}
> +
> +static int make_socket(int sotype, const char *ip, int port,
> +		       struct sockaddr_storage *addr)
> +{
> +	struct timeval timeo = { .tv_usec = 500000 /* 500 ms */ };
> +	int family = is_ipv6(ip) ? AF_INET6 : AF_INET;
> +	int ret, fd;
> +
> +	ret = make_sockaddr(family, ip, port, addr, NULL);
> +	if (!ASSERT_OK(ret, "make_sockaddr"))
> +		return -1;
> +	fd = socket(addr->ss_family, sotype, 0);
> +	if (!ASSERT_GE(fd, 0, "socket"))
> +		return -1;
> +	ret = setsockopt(fd, SOL_SOCKET, SO_SNDTIMEO, &timeo, sizeof(timeo));
> +	if (!ASSERT_OK(ret, "sndtimeo")) {
> +		close(fd);
> +		return -1;
> +	}
> +	ret = setsockopt(fd, SOL_SOCKET, SO_RCVTIMEO, &timeo, sizeof(timeo));
> +	if (!ASSERT_OK(ret, "rcvtimeo")) {
> +		close(fd);
> +		return -1;
> +	}
> +	return fd;
> +}
> +
> +static int create_server(int sotype, const char *ip, int port, int prog_fd)
> +{
> +	struct sockaddr_storage addr = {};
> +	const int one = 1;
> +	int ret, fd;
> +
> +	fd = make_socket(sotype, ip, port, &addr);
> +	if (fd < 0)
> +		return -1;
> +	if (sotype == SOCK_STREAM) {
> +		ret = setsockopt(fd, SOL_SOCKET, SO_REUSEADDR, &one,
> +				 sizeof(one));
> +		if (!ASSERT_OK(ret, "reuseaddr"))
> +			goto cleanup;
> +	}
> +	if (prog_fd >= 0) {
> +		ret = setsockopt(fd, SOL_SOCKET, SO_REUSEPORT, &one,
> +				 sizeof(one));
> +		if (!ASSERT_OK(ret, "reuseport"))
> +			goto cleanup;
> +		ret = attach_reuseport(fd, prog_fd);
> +		if (!ASSERT_OK(ret, "attach_reuseport"))
> +			goto cleanup;
> +	}
> +	ret = bind(fd, (void *)&addr, inetaddr_len(&addr));
> +	if (!ASSERT_OK(ret, "bind"))
> +		goto cleanup;
> +	if (sotype == SOCK_STREAM) {
> +		ret = listen(fd, SOMAXCONN);
> +		if (!ASSERT_OK(ret, "listen"))
> +			goto cleanup;
> +	}
> +	return fd;
> +cleanup:
> +	close(fd);
> +	return -1;
> +}
> +
> +static int create_client(int sotype, const char *ip, int port)
> +{
> +	struct sockaddr_storage addr = {};
> +	int ret, fd;
> +
> +	fd = make_socket(sotype, ip, port, &addr);
> +	if (fd < 0)
> +		return -1;
> +	ret = connect(fd, (void *)&addr, inetaddr_len(&addr));
> +	if (ret)
> +		goto cleanup;
> +	return fd;
> +cleanup:
> +	close(fd);
> +	return -1;
> +}

I believe a lot of the above can be done by start_reuseport_server() and 
connect_to_fd() from network_helpers.c.

[ ... ]

> +void serial_test_assign_reuse(void)

Remove "serial_" .


> +{
> +	struct nstoken *tok = NULL;
> +
> +	SYS(out, "ip netns add %s", NS_TEST);
> +	SYS(cleanup, "ip -net %s link set dev lo up", NS_TEST);
> +
> +	tok = open_netns(NS_TEST);
> +	if (!ASSERT_OK_PTR(tok, "netns token"))
> +		return;
> +
> +	if (test__start_subtest("tcpv4"))
> +		run_assign_reuse(SOCK_STREAM, "127.0.0.1", PORT);
> +	if (test__start_subtest("tcpv6"))
> +		run_assign_reuse(SOCK_STREAM, "::1", PORT);
> +	if (test__start_subtest("udpv4"))
> +		run_assign_reuse(SOCK_DGRAM, "127.0.0.1", PORT);
> +	if (test__start_subtest("udpv6"))
> +		run_assign_reuse(SOCK_DGRAM, "::1", PORT);
> +
> +cleanup:
> +	close_netns(tok);
> +	SYS_NOFAIL("ip netns delete %s", NS_TEST);
> +out:
> +	return;
> +}
> diff --git a/tools/testing/selftests/bpf/progs/test_assign_reuse.c b/tools/testing/selftests/bpf/progs/test_assign_reuse.c
> new file mode 100644
> index 000000000000..c83e870b2612
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/progs/test_assign_reuse.c
> @@ -0,0 +1,142 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright (c) 2023 Isovalent */
> +#include <stdbool.h>
> +#include <linux/bpf.h>
> +#include <linux/if_ether.h>
> +#include <linux/in.h>
> +#include <linux/ip.h>
> +#include <linux/ipv6.h>
> +#include <linux/tcp.h>
> +#include <linux/udp.h>
> +#include <bpf/bpf_endian.h>
> +#include <bpf/bpf_helpers.h>
> +#include <linux/pkt_cls.h>
> +
> +char LICENSE[] SEC("license") = "GPL";
> +
> +__u64 sk_cookie_seen;
> +__u64 reuseport_executed;
> +union {
> +	struct tcphdr tcp;
> +	struct udphdr udp;
> +} headers;
> +
> +const volatile __u16 dest_port;
> +
> +struct {
> +	__uint(type, BPF_MAP_TYPE_SOCKMAP);
> +	__uint(max_entries, 1);
> +	__type(key, __u32);
> +	__type(value, __u64);
> +} sk_map SEC(".maps");
> +
> +SEC("sk_reuseport")
> +int reuse_accept(struct sk_reuseport_md *ctx)
> +{
> +	reuseport_executed++;
> +
> +	if (ctx->ip_protocol == IPPROTO_TCP) {
> +		if (ctx->data + sizeof(headers.tcp) > ctx->data_end)
> +			return SK_DROP;
> +
> +		if (__builtin_memcmp(&headers.tcp, ctx->data, sizeof(headers.tcp)) != 0)
> +			return SK_DROP;
> +	} else if (ctx->ip_protocol == IPPROTO_UDP) {
> +		if (ctx->data + sizeof(headers.udp) > ctx->data_end)
> +			return SK_DROP;
> +
> +		if (__builtin_memcmp(&headers.udp, ctx->data, sizeof(headers.udp)) != 0)
> +			return SK_DROP;
> +	} else {
> +		return SK_DROP;
> +	}
> +
> +	sk_cookie_seen = bpf_get_socket_cookie(ctx->sk);
> +	return SK_PASS;
> +}
> +
> +SEC("sk_reuseport")
> +int reuse_drop(struct sk_reuseport_md *ctx)
> +{
> +	reuseport_executed++;
> +	sk_cookie_seen = 0;
> +	return SK_DROP;
> +}
> +
> +static inline int

nit. inline is no longer a must. Same for a few functions below.

> +assign_sk(struct __sk_buff *skb)
> +{
> +	int zero = 0, ret = 0;
> +	struct bpf_sock *sk;
> +
> +	sk = bpf_map_lookup_elem(&sk_map, &zero);
> +	if (!sk)
> +		return TC_ACT_SHOT;
> +	ret = bpf_sk_assign(skb, sk, 0);
> +	bpf_sk_release(sk);
> +	return ret ? TC_ACT_SHOT : TC_ACT_OK;
> +}
> +


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH bpf-next 1/2] bpf, net: Support SO_REUSEPORT sockets with bpf_sk_assign
  2023-05-25 23:42 ` Martin KaFai Lau
@ 2023-05-26  1:43   ` Kuniyuki Iwashima
  2023-05-26  2:49     ` Kuniyuki Iwashima
  0 siblings, 1 reply; 12+ messages in thread
From: Kuniyuki Iwashima @ 2023-05-26  1:43 UTC (permalink / raw)
  To: martin.lau
  Cc: andrii, ast, bpf, daniel, davem, dsahern, edumazet, haoluo, joe,
	joe, john.fastabend, jolsa, kafai, kpsingh, kuba, linux-kernel,
	lmb, netdev, pabeni, sdf, song, willemdebruijn.kernel, yhs,
	kuniyu

From: Martin KaFai Lau <martin.lau@linux.dev>
Date: Thu, 25 May 2023 16:42:46 -0700
> On 5/25/23 1:19 AM, Lorenz Bauer wrote:
> > diff --git a/include/net/inet6_hashtables.h b/include/net/inet6_hashtables.h
> > index 56f1286583d3..3ba4dc2703da 100644
> > --- a/include/net/inet6_hashtables.h
> > +++ b/include/net/inet6_hashtables.h
> > @@ -48,6 +48,13 @@ struct sock *__inet6_lookup_established(struct net *net,
> >   					const u16 hnum, const int dif,
> >   					const int sdif);
> >   
> > +struct sock *inet6_lookup_reuseport(struct net *net, struct sock *sk,
> > +				    struct sk_buff *skb, int doff,
> > +				    const struct in6_addr *saddr,
> > +				    __be16 sport,
> > +				    const struct in6_addr *daddr,
> > +				    unsigned short hnum);
> > +
> >   struct sock *inet6_lookup_listener(struct net *net,
> >   				   struct inet_hashinfo *hashinfo,
> >   				   struct sk_buff *skb, int doff,
> > @@ -85,14 +92,33 @@ static inline struct sock *__inet6_lookup_skb(struct inet_hashinfo *hashinfo,
> >   					      int iif, int sdif,
> >   					      bool *refcounted)
> >   {
> > -	struct sock *sk = skb_steal_sock(skb, refcounted);
> > -
> > +	bool prefetched;
> > +	struct sock *sk = skb_steal_sock(skb, refcounted, &prefetched);
> > +	struct net *net = dev_net(skb_dst(skb)->dev);
> > +	const struct ipv6hdr *ip6h = ipv6_hdr(skb);
> > +
> > +	if (prefetched) {
> > +		struct sock *reuse_sk = inet6_lookup_reuseport(net, sk, skb, doff,
> 
> If sk is TCP_ESTABLISHED, I suspect sk->sk_reuseport is 1 (from sk_clone)?

Exactly, it will cause null-ptr-deref in reuseport_select_sock().
We may want to use rcu_access_pointer(sk->sk_reuseport_cb) in
each lookup_reuseport() instead of adding sk_state check ?


> 
> If it is, it should still work other than an extra inet6_ehashfn. Does it worth 
> an extra sk->sk_state check or it is overkill?
> 
> 
> > +							       &ip6h->saddr, sport,
> > +							       &ip6h->daddr, ntohs(dport));

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH bpf-next 1/2] bpf, net: Support SO_REUSEPORT sockets with bpf_sk_assign
  2023-05-26  1:43   ` Kuniyuki Iwashima
@ 2023-05-26  2:49     ` Kuniyuki Iwashima
  2023-05-26 21:08       ` Martin KaFai Lau
  0 siblings, 1 reply; 12+ messages in thread
From: Kuniyuki Iwashima @ 2023-05-26  2:49 UTC (permalink / raw)
  To: kuniyu
  Cc: andrii, ast, bpf, daniel, davem, dsahern, edumazet, haoluo, joe,
	joe, john.fastabend, jolsa, kafai, kpsingh, kuba, linux-kernel,
	lmb, martin.lau, netdev, pabeni, sdf, song,
	willemdebruijn.kernel, yhs

From: Kuniyuki Iwashima <kuniyu@amazon.com>
Date: Thu, 25 May 2023 18:43:17 -0700
> From: Martin KaFai Lau <martin.lau@linux.dev>
> Date: Thu, 25 May 2023 16:42:46 -0700
> > On 5/25/23 1:19 AM, Lorenz Bauer wrote:
> > > diff --git a/include/net/inet6_hashtables.h b/include/net/inet6_hashtables.h
> > > index 56f1286583d3..3ba4dc2703da 100644
> > > --- a/include/net/inet6_hashtables.h
> > > +++ b/include/net/inet6_hashtables.h
> > > @@ -48,6 +48,13 @@ struct sock *__inet6_lookup_established(struct net *net,
> > >   					const u16 hnum, const int dif,
> > >   					const int sdif);
> > >   
> > > +struct sock *inet6_lookup_reuseport(struct net *net, struct sock *sk,
> > > +				    struct sk_buff *skb, int doff,
> > > +				    const struct in6_addr *saddr,
> > > +				    __be16 sport,
> > > +				    const struct in6_addr *daddr,
> > > +				    unsigned short hnum);
> > > +
> > >   struct sock *inet6_lookup_listener(struct net *net,
> > >   				   struct inet_hashinfo *hashinfo,
> > >   				   struct sk_buff *skb, int doff,
> > > @@ -85,14 +92,33 @@ static inline struct sock *__inet6_lookup_skb(struct inet_hashinfo *hashinfo,
> > >   					      int iif, int sdif,
> > >   					      bool *refcounted)
> > >   {
> > > -	struct sock *sk = skb_steal_sock(skb, refcounted);
> > > -
> > > +	bool prefetched;
> > > +	struct sock *sk = skb_steal_sock(skb, refcounted, &prefetched);
> > > +	struct net *net = dev_net(skb_dst(skb)->dev);
> > > +	const struct ipv6hdr *ip6h = ipv6_hdr(skb);
> > > +
> > > +	if (prefetched) {
> > > +		struct sock *reuse_sk = inet6_lookup_reuseport(net, sk, skb, doff,
> > 
> > If sk is TCP_ESTABLISHED, I suspect sk->sk_reuseport is 1 (from sk_clone)?
> 
> Exactly, it will cause null-ptr-deref in reuseport_select_sock().

Sorry, this doesn't occur.  reuseport_select_sock() has null check.


> We may want to use rcu_access_pointer(sk->sk_reuseport_cb) in
> each lookup_reuseport() instead of adding sk_state check ?

And if someone has a weird program that creates multiple listeners and
disable SO_REUSEPORT for a listener that hits first in lhash2, checking
sk_reuseport_cb might not work ?  I hope no one does such though, checking
sk_reuseport and sk_state could be better.

> 
> 
> > 
> > If it is, it should still work other than an extra inet6_ehashfn. Does it worth 
> > an extra sk->sk_state check or it is overkill?
> > 
> > 
> > > +							       &ip6h->saddr, sport,
> > > +							       &ip6h->daddr, ntohs(dport));

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH bpf-next 1/2] bpf, net: Support SO_REUSEPORT sockets with bpf_sk_assign
  2023-05-25  8:19 [PATCH bpf-next 1/2] bpf, net: Support SO_REUSEPORT sockets with bpf_sk_assign Lorenz Bauer
                   ` (3 preceding siblings ...)
  2023-05-25 23:42 ` Martin KaFai Lau
@ 2023-05-26  5:56 ` Joe Stringer
  4 siblings, 0 replies; 12+ messages in thread
From: Joe Stringer @ 2023-05-26  5:56 UTC (permalink / raw)
  To: Lorenz Bauer
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa, David Ahern,
	Willem de Bruijn, Joe Stringer, Joe Stringer, Martin KaFai Lau,
	netdev, linux-kernel, bpf

On Thu, May 25, 2023 at 1:19 AM Lorenz Bauer <lmb@isovalent.com> wrote:
>
> Currently the bpf_sk_assign helper in tc BPF context refuses SO_REUSEPORT
> sockets. This means we can't use the helper to steer traffic to Envoy, which
> configures SO_REUSEPORT on its sockets. In turn, we're blocked from removing
> TPROXY from our setup.
>
> The reason that bpf_sk_assign refuses such sockets is that the bpf_sk_lookup
> helpers don't execute SK_REUSEPORT programs. Instead, one of the
> reuseport sockets is selected by hash. This could cause dispatch to the
> "wrong" socket:
>
>     sk = bpf_sk_lookup_tcp(...) // select SO_REUSEPORT by hash
>     bpf_sk_assign(skb, sk) // SK_REUSEPORT wasn't executed
>
> Fixing this isn't as simple as invoking SK_REUSEPORT from the lookup
> helpers unfortunately. In the tc context, L2 headers are at the start
> of the skb, while SK_REUSEPORT expects L3 headers instead.
>
> Instead, we execute the SK_REUSEPORT program when the assigned socket
> is pulled out of the skb, further up the stack. This creates some
> trickiness with regards to refcounting as bpf_sk_assign will put both
> refcounted and RCU freed sockets in skb->sk. reuseport sockets are RCU
> freed. We can infer that the sk_assigned socket is RCU freed if the
> reuseport lookup succeeds, but convincing yourself of this fact isn't
> straight forward. Therefore we defensively check refcounting on the
> sk_assign sock even though it's probably not required in practice.
>
> Fixes: 8e368dc ("bpf: Fix use of sk->sk_reuseport from sk_assign")
> Fixes: cf7fbe6 ("bpf: Add socket assign support")
> Co-developed-by: Daniel Borkmann <daniel@iogearbox.net>
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> Signed-off-by: Lorenz Bauer <lmb@isovalent.com>
> Cc: Joe Stringer <joe@cilium.io>
> Link: https://lore.kernel.org/bpf/CACAyw98+qycmpQzKupquhkxbvWK4OFyDuuLMBNROnfWMZxUWeA@mail.gmail.com/

Nice approach to fix this issue, wish I'd thought of it :)

I pulled this and tested out in a little-vm-helper environment with
kind and Cilium's examples/kubernetes/connectivity-check proxy suite,
as well as cilium-cli's connectivity tests and the L7 features seem to
be working as expected with SO_REUSEPORT.

Tested-by: Joe Stringer <joe@cilium.io>

I also glanced through the commit, and the various protocols seem to
be handled consistently at the very least, though I agree it'd be
simpler for review and bisecting if broken down into more incremental
changes.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH bpf-next 1/2] bpf, net: Support SO_REUSEPORT sockets with bpf_sk_assign
  2023-05-26  2:49     ` Kuniyuki Iwashima
@ 2023-05-26 21:08       ` Martin KaFai Lau
  0 siblings, 0 replies; 12+ messages in thread
From: Martin KaFai Lau @ 2023-05-26 21:08 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: andrii, ast, bpf, daniel, davem, dsahern, edumazet, haoluo, joe,
	joe, john.fastabend, jolsa, kafai, kpsingh, kuba, linux-kernel,
	lmb, netdev, pabeni, sdf, song, willemdebruijn.kernel, yhs

On 5/25/23 7:49 PM, Kuniyuki Iwashima wrote:
>> We may want to use rcu_access_pointer(sk->sk_reuseport_cb) in
>> each lookup_reuseport() instead of adding sk_state check ?

> And if someone has a weird program that creates multiple listeners and
> disable SO_REUSEPORT for a listener that hits first in lhash2, checking
> sk_reuseport_cb might not work ?  

I am also not sure what the correct expectation is if the application disable 
SO_REUSEPORT in a sk after bind(). This is not something new or specific from 
the current patch though.

> I hope no one does such though, checking sk_reuseport 
> and sk_state could be better.

My previous comment on checking sk_state is to avoid the unnecessary 
inet_ehashfn() for the TCP_ESTABLISHED sk. Checking sk_reuseport_cb could also 
work since it should be NULL for established sk. However, not sure if it could 
be another cacheline access. The udp one is checking the sk_state also: 
"sk->sk_reuseport && sk->sk_state != TCP_ESTABLISHED".

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2023-05-26 21:09 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-05-25  8:19 [PATCH bpf-next 1/2] bpf, net: Support SO_REUSEPORT sockets with bpf_sk_assign Lorenz Bauer
2023-05-25  8:19 ` [PATCH bpf-next 2/2] selftests/bpf: Test that SO_REUSEPORT can be used with sk_assign helper Lorenz Bauer
2023-05-26  0:06   ` Martin KaFai Lau
2023-05-25 13:24 ` [PATCH bpf-next 1/2] bpf, net: Support SO_REUSEPORT sockets with bpf_sk_assign Eric Dumazet
2023-05-25 19:51   ` Daniel Borkmann
2023-05-25 17:41 ` Kuniyuki Iwashima
2023-05-25 20:01   ` Daniel Borkmann
2023-05-25 23:42 ` Martin KaFai Lau
2023-05-26  1:43   ` Kuniyuki Iwashima
2023-05-26  2:49     ` Kuniyuki Iwashima
2023-05-26 21:08       ` Martin KaFai Lau
2023-05-26  5:56 ` Joe Stringer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).