Netdev Archive on lore.kernel.org
 help / color / Atom feed
* [PATCHv2 bpf-next 0/5] Add bpf_sk_assign eBPF helper
@ 2020-03-25  5:57 Joe Stringer
  2020-03-25  5:57 ` [PATCHv2 bpf-next 1/5] bpf: Add socket assign support Joe Stringer
                   ` (4 more replies)
  0 siblings, 5 replies; 41+ messages in thread
From: Joe Stringer @ 2020-03-25  5:57 UTC (permalink / raw)
  To: bpf; +Cc: netdev, daniel, ast, eric.dumazet, lmb, kafai

Introduce a new helper that allows assigning a previously-found socket
to the skb as the packet is received towards the stack, to cause the
stack to guide the packet towards that socket subject to local routing
configuration.

This series the successor to previous discussions on-list[0], in-person
at LPC2019[1], and the previous version[2] to support TProxy use cases
more directly from eBPF programs attached at TC ingress, to simplify and
streamline Linux stack configuration in scale environments with Cilium.

Normally in ip{,6}_rcv_core(), the skb will be orphaned, dropping any
existing socket reference associated with the skb. Existing tproxy
implementations in netfilter get around this restriction by running the
tproxy logic after ip_rcv_core() in the PREROUTING table. However, this
is not an option for TC-based logic (including eBPF programs attached at
TC ingress).

This series introduces the BPF helper bpf_sk_assign() to associate the
socket with the skb on the ingress path as the packet is passed up the
stack. The initial patch in the series simply takes a reference on the
socket to ensure safety, but later patches relax this for listen
sockets.

To ensure delivery to the relevant socket, we still consult the routing
table, for full examples of how to configure see the tests in patch #5;
the simplest form of the route would look like this:

  $ ip route add local default dev lo

This series is laid out as follows:
* Patch 1 extends the eBPF API to add sk_assign() and defines a new
  socket free function to allow the later paths to understand when the
  socket associated with the skb should be kept through receive.
* Patches 2-4 optimizate the receive path to prefetch the socket
  destination and avoid taking a reference on listener sockets during
  receive.
* Patches 5 extends the selftests with examples of the new
  functionality and validation of correct behaviour.

Changes since v1:
* Replace the metadata_dst approach with using the skb->destructor to
  determine whether the socket has been prefetched. This is much
  simpler.
* Avoid taking a reference on listener sockets during receive
* Restrict assigning sockets across namespaces
* Restrict assigning SO_REUSEPORT sockets
* Fix cookie usage for socket dst check
* Rebase the tests against test_progs infrastructure
* Tidy up commit messages

[0] https://www.mail-archive.com/netdev@vger.kernel.org/msg303645.html
[1] https://linuxplumbersconf.org/event/4/contributions/464/
[2] https://lore.kernel.org/netdev/20200312233648.1767-1-joe@wand.net.nz/T/#m4028efa9381856049ae5633986ec562d6b94a146


Joe Stringer (4):
  bpf: Add socket assign support
  bpf: Prefetch established socket destinations
  net: Track socket refcounts in skb_steal_sock()
  bpf: Don't refcount LISTEN sockets in sk_assign()

Lorenz Bauer (1):
  selftests: bpf: add test for sk_assign

 include/net/inet6_hashtables.h                |   3 +-
 include/net/inet_hashtables.h                 |   3 +-
 include/net/sock.h                            |  42 ++-
 include/uapi/linux/bpf.h                      |  25 +-
 net/core/filter.c                             |  50 +++-
 net/core/sock.c                               |  10 +
 net/ipv4/ip_input.c                           |   3 +-
 net/ipv4/udp.c                                |   6 +-
 net/ipv6/ip6_input.c                          |   3 +-
 net/ipv6/udp.c                                |   9 +-
 net/sched/act_bpf.c                           |   2 +
 tools/include/uapi/linux/bpf.h                |  25 +-
 tools/testing/selftests/bpf/Makefile          |   2 +-
 .../selftests/bpf/prog_tests/sk_assign.c      | 244 ++++++++++++++++++
 .../selftests/bpf/progs/test_sk_assign.c      | 127 +++++++++
 15 files changed, 529 insertions(+), 25 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/sk_assign.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_sk_assign.c

-- 
2.20.1


^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCHv2 bpf-next 1/5] bpf: Add socket assign support
  2020-03-25  5:57 [PATCHv2 bpf-next 0/5] Add bpf_sk_assign eBPF helper Joe Stringer
@ 2020-03-25  5:57 ` Joe Stringer
  2020-03-26  6:23   ` Martin KaFai Lau
  2020-03-26 10:24   ` Lorenz Bauer
  2020-03-25  5:57 ` [PATCHv2 bpf-next 2/5] bpf: Prefetch established socket destinations Joe Stringer
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 41+ messages in thread
From: Joe Stringer @ 2020-03-25  5:57 UTC (permalink / raw)
  To: bpf; +Cc: netdev, daniel, ast, eric.dumazet, lmb, kafai

Add support for TPROXY via a new bpf helper, bpf_sk_assign().

This helper requires the BPF program to discover the socket via a call
to bpf_sk*_lookup_*(), then pass this socket to the new helper. The
helper takes its own reference to the socket in addition to any existing
reference that may or may not currently be obtained for the duration of
BPF processing. For the destination socket to receive the traffic, the
traffic must be routed towards that socket via local route. The
simplest example route is below, but in practice you may want to route
traffic more narrowly (eg by CIDR):

  $ ip route add local default dev lo

This patch avoids trying to introduce an extra bit into the skb->sk, as
that would require more invasive changes to all code interacting with
the socket to ensure that the bit is handled correctly, such as all
error-handling cases along the path from the helper in BPF through to
the orphan path in the input. Instead, we opt to use the destructor
variable to switch on the prefetch of the socket.

Signed-off-by: Joe Stringer <joe@wand.net.nz>
---
v2: Use skb->destructor to determine socket prefetch usage instead of
      introducing a new metadata_dst
    Restrict socket assign to same netns as TC device
    Restrict assigning reuseport sockets
    Adjust commit wording
v1: Initial version
---
 include/net/sock.h             |  7 +++++++
 include/uapi/linux/bpf.h       | 25 ++++++++++++++++++++++++-
 net/core/filter.c              | 31 +++++++++++++++++++++++++++++++
 net/core/sock.c                |  9 +++++++++
 net/ipv4/ip_input.c            |  3 ++-
 net/ipv6/ip6_input.c           |  3 ++-
 net/sched/act_bpf.c            |  2 ++
 tools/include/uapi/linux/bpf.h | 25 ++++++++++++++++++++++++-
 8 files changed, 101 insertions(+), 4 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index b5cca7bae69b..2613d21a667a 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1657,6 +1657,7 @@ struct sk_buff *sock_omalloc(struct sock *sk, unsigned long size,
 void skb_orphan_partial(struct sk_buff *skb);
 void sock_rfree(struct sk_buff *skb);
 void sock_efree(struct sk_buff *skb);
+void sock_pfree(struct sk_buff *skb);
 #ifdef CONFIG_INET
 void sock_edemux(struct sk_buff *skb);
 #else
@@ -2526,6 +2527,12 @@ void sock_net_set(struct sock *sk, struct net *net)
 	write_pnet(&sk->sk_net, net);
 }
 
+static inline bool
+skb_sk_is_prefetched(struct sk_buff *skb)
+{
+	return skb->destructor == sock_pfree;
+}
+
 static inline struct sock *skb_steal_sock(struct sk_buff *skb)
 {
 	if (skb->sk) {
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 5d01c5c7e598..0c6f151deebe 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -2950,6 +2950,28 @@ union bpf_attr {
  *		restricted to raw_tracepoint bpf programs.
  *	Return
  *		0 on success, or a negative error in case of failure.
+ *
+ * int bpf_sk_assign(struct sk_buff *skb, struct bpf_sock *sk, u64 flags)
+ *	Description
+ *		Assign the *sk* to the *skb*. When combined with appropriate
+ *		routing configuration to receive the packet towards the socket,
+ *		will cause *skb* to be delivered to the specified socket.
+ *		Subsequent redirection of *skb* via  **bpf_redirect**\ (),
+ *		**bpf_clone_redirect**\ () or other methods outside of BPF may
+ *		interfere with successful delivery to the socket.
+ *
+ *		This operation is only valid from TC ingress path.
+ *
+ *		The *flags* argument must be zero.
+ *	Return
+ *		0 on success, or a negative errno in case of failure.
+ *
+ *		* **-EINVAL**		Unsupported flags specified.
+ *		* **-ENETUNREACH**	Socket is unreachable (wrong netns).
+ *		* **-ENOENT**		Socket is unavailable for assignment.
+ *		* **-EOPNOTSUPP**	Unsupported operation, for example a
+ *					call from outside of TC ingress.
+ *		* **-ESOCKTNOSUPPORT**	Socket type not supported (reuseport).
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -3073,7 +3095,8 @@ union bpf_attr {
 	FN(jiffies64),			\
 	FN(read_branch_records),	\
 	FN(get_ns_current_pid_tgid),	\
-	FN(xdp_output),
+	FN(xdp_output),			\
+	FN(sk_assign),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
diff --git a/net/core/filter.c b/net/core/filter.c
index 96350a743539..f7f9b6631f75 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5860,6 +5860,35 @@ static const struct bpf_func_proto bpf_tcp_gen_syncookie_proto = {
 	.arg5_type	= ARG_CONST_SIZE,
 };
 
+BPF_CALL_3(bpf_sk_assign, struct sk_buff *, skb, struct sock *, sk, u64, flags)
+{
+	if (flags != 0)
+		return -EINVAL;
+	if (!skb_at_tc_ingress(skb))
+		return -EOPNOTSUPP;
+	if (unlikely(sk->sk_reuseport))
+		return -ESOCKTNOSUPPORT;
+	if (unlikely(dev_net(skb->dev) != sock_net(sk)))
+		return -ENETUNREACH;
+	if (unlikely(!refcount_inc_not_zero(&sk->sk_refcnt)))
+		return -ENOENT;
+
+	skb_orphan(skb);
+	skb->sk = sk;
+	skb->destructor = sock_pfree;
+
+	return 0;
+}
+
+static const struct bpf_func_proto bpf_sk_assign_proto = {
+	.func		= bpf_sk_assign,
+	.gpl_only	= false,
+	.ret_type	= RET_INTEGER,
+	.arg1_type      = ARG_PTR_TO_CTX,
+	.arg2_type      = ARG_PTR_TO_SOCK_COMMON,
+	.arg3_type	= ARG_ANYTHING,
+};
+
 #endif /* CONFIG_INET */
 
 bool bpf_helper_changes_pkt_data(void *func)
@@ -6153,6 +6182,8 @@ tc_cls_act_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 		return &bpf_skb_ecn_set_ce_proto;
 	case BPF_FUNC_tcp_gen_syncookie:
 		return &bpf_tcp_gen_syncookie_proto;
+	case BPF_FUNC_sk_assign:
+		return &bpf_sk_assign_proto;
 #endif
 	default:
 		return bpf_base_func_proto(func_id);
diff --git a/net/core/sock.c b/net/core/sock.c
index 0fc8937a7ff4..cfaf60267360 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -2071,6 +2071,15 @@ void sock_efree(struct sk_buff *skb)
 }
 EXPORT_SYMBOL(sock_efree);
 
+/* Buffer destructor for prefetch/receive path where reference count may
+ * not be held, e.g. for listen sockets.
+ */
+void sock_pfree(struct sk_buff *skb)
+{
+	sock_edemux(skb);
+}
+EXPORT_SYMBOL(sock_pfree);
+
 kuid_t sock_i_uid(struct sock *sk)
 {
 	kuid_t uid;
diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c
index aa438c6758a7..b0c244af1e4d 100644
--- a/net/ipv4/ip_input.c
+++ b/net/ipv4/ip_input.c
@@ -509,7 +509,8 @@ static struct sk_buff *ip_rcv_core(struct sk_buff *skb, struct net *net)
 	IPCB(skb)->iif = skb->skb_iif;
 
 	/* Must drop socket now because of tproxy. */
-	skb_orphan(skb);
+	if (!skb_sk_is_prefetched(skb))
+		skb_orphan(skb);
 
 	return skb;
 
diff --git a/net/ipv6/ip6_input.c b/net/ipv6/ip6_input.c
index 7b089d0ac8cd..e96304d8a4a7 100644
--- a/net/ipv6/ip6_input.c
+++ b/net/ipv6/ip6_input.c
@@ -285,7 +285,8 @@ static struct sk_buff *ip6_rcv_core(struct sk_buff *skb, struct net_device *dev,
 	rcu_read_unlock();
 
 	/* Must drop socket now because of tproxy. */
-	skb_orphan(skb);
+	if (!skb_sk_is_prefetched(skb))
+		skb_orphan(skb);
 
 	return skb;
 err:
diff --git a/net/sched/act_bpf.c b/net/sched/act_bpf.c
index 46f47e58b3be..6c7ed8fcc909 100644
--- a/net/sched/act_bpf.c
+++ b/net/sched/act_bpf.c
@@ -53,6 +53,8 @@ static int tcf_bpf_act(struct sk_buff *skb, const struct tc_action *act,
 		bpf_compute_data_pointers(skb);
 		filter_res = BPF_PROG_RUN(filter, skb);
 	}
+	if (filter_res != TC_ACT_OK)
+		skb_orphan(skb);
 	rcu_read_unlock();
 
 	/* A BPF program may overwrite the default action opcode.
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 5d01c5c7e598..0c6f151deebe 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -2950,6 +2950,28 @@ union bpf_attr {
  *		restricted to raw_tracepoint bpf programs.
  *	Return
  *		0 on success, or a negative error in case of failure.
+ *
+ * int bpf_sk_assign(struct sk_buff *skb, struct bpf_sock *sk, u64 flags)
+ *	Description
+ *		Assign the *sk* to the *skb*. When combined with appropriate
+ *		routing configuration to receive the packet towards the socket,
+ *		will cause *skb* to be delivered to the specified socket.
+ *		Subsequent redirection of *skb* via  **bpf_redirect**\ (),
+ *		**bpf_clone_redirect**\ () or other methods outside of BPF may
+ *		interfere with successful delivery to the socket.
+ *
+ *		This operation is only valid from TC ingress path.
+ *
+ *		The *flags* argument must be zero.
+ *	Return
+ *		0 on success, or a negative errno in case of failure.
+ *
+ *		* **-EINVAL**		Unsupported flags specified.
+ *		* **-ENETUNREACH**	Socket is unreachable (wrong netns).
+ *		* **-ENOENT**		Socket is unavailable for assignment.
+ *		* **-EOPNOTSUPP**	Unsupported operation, for example a
+ *					call from outside of TC ingress.
+ *		* **-ESOCKTNOSUPPORT**	Socket type not supported (reuseport).
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -3073,7 +3095,8 @@ union bpf_attr {
 	FN(jiffies64),			\
 	FN(read_branch_records),	\
 	FN(get_ns_current_pid_tgid),	\
-	FN(xdp_output),
+	FN(xdp_output),			\
+	FN(sk_assign),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
-- 
2.20.1


^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCHv2 bpf-next 2/5] bpf: Prefetch established socket destinations
  2020-03-25  5:57 [PATCHv2 bpf-next 0/5] Add bpf_sk_assign eBPF helper Joe Stringer
  2020-03-25  5:57 ` [PATCHv2 bpf-next 1/5] bpf: Add socket assign support Joe Stringer
@ 2020-03-25  5:57 ` Joe Stringer
  2020-03-26 21:11   ` Alexei Starovoitov
  2020-03-25  5:57 ` [PATCHv2 bpf-next 3/5] net: Track socket refcounts in skb_steal_sock() Joe Stringer
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 41+ messages in thread
From: Joe Stringer @ 2020-03-25  5:57 UTC (permalink / raw)
  To: bpf; +Cc: netdev, daniel, ast, eric.dumazet, lmb, kafai

Enhance the sk_assign logic to temporarily store the socket
receive destination, to save the route lookup later on. The dst
reference is kept alive by the caller's socket reference.

Suggested-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Joe Stringer <joe@wand.net.nz>
---
v2: Provide cookie to dst_check() for IPv6 case
v1: Initial version
---
 net/core/filter.c | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/net/core/filter.c b/net/core/filter.c
index f7f9b6631f75..0fada7fe9b75 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5876,6 +5876,21 @@ BPF_CALL_3(bpf_sk_assign, struct sk_buff *, skb, struct sock *, sk, u64, flags)
 	skb_orphan(skb);
 	skb->sk = sk;
 	skb->destructor = sock_pfree;
+	if (sk_fullsock(sk)) {
+		struct dst_entry *dst = READ_ONCE(sk->sk_rx_dst);
+		u32 cookie = 0;
+
+#if IS_ENABLED(CONFIG_IPV6)
+		if (sk->sk_family == AF_INET6)
+			cookie = inet6_sk(sk)->rx_dst_cookie;
+#endif
+		if (dst)
+			dst = dst_check(dst, cookie);
+		if (dst) {
+			skb_dst_drop(skb);
+			skb_dst_set_noref(skb, dst);
+		}
+	}
 
 	return 0;
 }
-- 
2.20.1


^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCHv2 bpf-next 3/5] net: Track socket refcounts in skb_steal_sock()
  2020-03-25  5:57 [PATCHv2 bpf-next 0/5] Add bpf_sk_assign eBPF helper Joe Stringer
  2020-03-25  5:57 ` [PATCHv2 bpf-next 1/5] bpf: Add socket assign support Joe Stringer
  2020-03-25  5:57 ` [PATCHv2 bpf-next 2/5] bpf: Prefetch established socket destinations Joe Stringer
@ 2020-03-25  5:57 ` Joe Stringer
  2020-03-25  5:57 ` [PATCHv2 bpf-next 4/5] bpf: Don't refcount LISTEN sockets in sk_assign() Joe Stringer
  2020-03-25  5:57 ` [PATCHv2 bpf-next 5/5] selftests: bpf: add test for sk_assign Joe Stringer
  4 siblings, 0 replies; 41+ messages in thread
From: Joe Stringer @ 2020-03-25  5:57 UTC (permalink / raw)
  To: bpf; +Cc: netdev, daniel, ast, eric.dumazet, lmb, kafai

Refactor the UDP/TCP handlers slightly to allow skb_steal_sock() to make
the determination of whether the socket is reference counted in the case
where it is prefetched by earlier logic such as early_demux or
dst_sk_prefetch.

Signed-off-by: Joe Stringer <joe@wand.net.nz>
---
v2: Initial version
---
 include/net/inet6_hashtables.h |  3 +--
 include/net/inet_hashtables.h  |  3 +--
 include/net/sock.h             | 10 +++++++++-
 net/ipv4/udp.c                 |  6 ++++--
 net/ipv6/udp.c                 |  9 ++++++---
 5 files changed, 21 insertions(+), 10 deletions(-)

diff --git a/include/net/inet6_hashtables.h b/include/net/inet6_hashtables.h
index fe96bf247aac..81b965953036 100644
--- a/include/net/inet6_hashtables.h
+++ b/include/net/inet6_hashtables.h
@@ -85,9 +85,8 @@ static inline struct sock *__inet6_lookup_skb(struct inet_hashinfo *hashinfo,
 					      int iif, int sdif,
 					      bool *refcounted)
 {
-	struct sock *sk = skb_steal_sock(skb);
+	struct sock *sk = skb_steal_sock(skb, refcounted);
 
-	*refcounted = true;
 	if (sk)
 		return sk;
 
diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
index d0019d3395cf..ad64ba6a057f 100644
--- a/include/net/inet_hashtables.h
+++ b/include/net/inet_hashtables.h
@@ -379,10 +379,9 @@ static inline struct sock *__inet_lookup_skb(struct inet_hashinfo *hashinfo,
 					     const int sdif,
 					     bool *refcounted)
 {
-	struct sock *sk = skb_steal_sock(skb);
+	struct sock *sk = skb_steal_sock(skb, refcounted);
 	const struct iphdr *iph = ip_hdr(skb);
 
-	*refcounted = true;
 	if (sk)
 		return sk;
 
diff --git a/include/net/sock.h b/include/net/sock.h
index 2613d21a667a..1ca2e808cb8e 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -2533,15 +2533,23 @@ skb_sk_is_prefetched(struct sk_buff *skb)
 	return skb->destructor == sock_pfree;
 }
 
-static inline struct sock *skb_steal_sock(struct sk_buff *skb)
+/**
+ * skb_steal_sock
+ * @skb to steal the socket from
+ * @refcounted is set to true if the socket is reference-counted
+ */
+static inline struct sock *
+skb_steal_sock(struct sk_buff *skb, bool *refcounted)
 {
 	if (skb->sk) {
 		struct sock *sk = skb->sk;
 
+		*refcounted = true;
 		skb->destructor = NULL;
 		skb->sk = NULL;
 		return sk;
 	}
+	*refcounted = false;
 	return NULL;
 }
 
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 2633fc231593..b4035021bbd3 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -2288,6 +2288,7 @@ int __udp4_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
 	struct rtable *rt = skb_rtable(skb);
 	__be32 saddr, daddr;
 	struct net *net = dev_net(skb->dev);
+	bool refcounted;
 
 	/*
 	 *  Validate the packet.
@@ -2313,7 +2314,7 @@ int __udp4_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
 	if (udp4_csum_init(skb, uh, proto))
 		goto csum_error;
 
-	sk = skb_steal_sock(skb);
+	sk = skb_steal_sock(skb, &refcounted);
 	if (sk) {
 		struct dst_entry *dst = skb_dst(skb);
 		int ret;
@@ -2322,7 +2323,8 @@ int __udp4_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
 			udp_sk_rx_dst_set(sk, dst);
 
 		ret = udp_unicast_rcv_skb(sk, skb, uh);
-		sock_put(sk);
+		if (refcounted)
+			sock_put(sk);
 		return ret;
 	}
 
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 5dc439a391fe..7d4151747340 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -843,6 +843,7 @@ int __udp6_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
 	struct net *net = dev_net(skb->dev);
 	struct udphdr *uh;
 	struct sock *sk;
+	bool refcounted;
 	u32 ulen = 0;
 
 	if (!pskb_may_pull(skb, sizeof(struct udphdr)))
@@ -879,7 +880,7 @@ int __udp6_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
 		goto csum_error;
 
 	/* Check if the socket is already available, e.g. due to early demux */
-	sk = skb_steal_sock(skb);
+	sk = skb_steal_sock(skb, &refcounted);
 	if (sk) {
 		struct dst_entry *dst = skb_dst(skb);
 		int ret;
@@ -888,12 +889,14 @@ int __udp6_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
 			udp6_sk_rx_dst_set(sk, dst);
 
 		if (!uh->check && !udp_sk(sk)->no_check6_rx) {
-			sock_put(sk);
+			if (refcounted)
+				sock_put(sk);
 			goto report_csum_error;
 		}
 
 		ret = udp6_unicast_rcv_skb(sk, skb, uh);
-		sock_put(sk);
+		if (refcounted)
+			sock_put(sk);
 		return ret;
 	}
 
-- 
2.20.1


^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCHv2 bpf-next 4/5] bpf: Don't refcount LISTEN sockets in sk_assign()
  2020-03-25  5:57 [PATCHv2 bpf-next 0/5] Add bpf_sk_assign eBPF helper Joe Stringer
                   ` (2 preceding siblings ...)
  2020-03-25  5:57 ` [PATCHv2 bpf-next 3/5] net: Track socket refcounts in skb_steal_sock() Joe Stringer
@ 2020-03-25  5:57 ` Joe Stringer
  2020-03-25 10:29   ` Lorenz Bauer
  2020-03-25  5:57 ` [PATCHv2 bpf-next 5/5] selftests: bpf: add test for sk_assign Joe Stringer
  4 siblings, 1 reply; 41+ messages in thread
From: Joe Stringer @ 2020-03-25  5:57 UTC (permalink / raw)
  To: bpf; +Cc: netdev, daniel, ast, eric.dumazet, lmb, kafai

Avoid taking a reference on listen sockets by checking the socket type
in the sk_assign and in the corresponding skb_steal_sock() code in the
the transport layer, and by ensuring that the prefetch free (sock_pfree)
function uses the same logic to check whether the socket is refcounted.

Suggested-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Joe Stringer <joe@wand.net.nz>
---
v2: Initial version
---
 include/net/sock.h | 25 +++++++++++++++++--------
 net/core/filter.c  |  6 +++---
 net/core/sock.c    |  3 ++-
 3 files changed, 22 insertions(+), 12 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 1ca2e808cb8e..3ec1865f173e 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -2533,6 +2533,21 @@ skb_sk_is_prefetched(struct sk_buff *skb)
 	return skb->destructor == sock_pfree;
 }
 
+/* This helper checks if a socket is a full socket,
+ * ie _not_ a timewait or request socket.
+ */
+static inline bool sk_fullsock(const struct sock *sk)
+{
+	return (1 << sk->sk_state) & ~(TCPF_TIME_WAIT | TCPF_NEW_SYN_RECV);
+}
+
+static inline bool
+sk_is_refcounted(struct sock *sk)
+{
+	/* Only full sockets have sk->sk_flags. */
+	return !sk_fullsock(sk) || !sock_flag(sk, SOCK_RCU_FREE);
+}
+
 /**
  * skb_steal_sock
  * @skb to steal the socket from
@@ -2545,6 +2560,8 @@ skb_steal_sock(struct sk_buff *skb, bool *refcounted)
 		struct sock *sk = skb->sk;
 
 		*refcounted = true;
+		if (skb_sk_is_prefetched(skb))
+			*refcounted = sk_is_refcounted(sk);
 		skb->destructor = NULL;
 		skb->sk = NULL;
 		return sk;
@@ -2553,14 +2570,6 @@ skb_steal_sock(struct sk_buff *skb, bool *refcounted)
 	return NULL;
 }
 
-/* This helper checks if a socket is a full socket,
- * ie _not_ a timewait or request socket.
- */
-static inline bool sk_fullsock(const struct sock *sk)
-{
-	return (1 << sk->sk_state) & ~(TCPF_TIME_WAIT | TCPF_NEW_SYN_RECV);
-}
-
 /* Checks if this SKB belongs to an HW offloaded socket
  * and whether any SW fallbacks are required based on dev.
  * Check decrypted mark in case skb_orphan() cleared socket.
diff --git a/net/core/filter.c b/net/core/filter.c
index 0fada7fe9b75..997b8606167e 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5343,8 +5343,7 @@ static const struct bpf_func_proto bpf_sk_lookup_udp_proto = {
 
 BPF_CALL_1(bpf_sk_release, struct sock *, sk)
 {
-	/* Only full sockets have sk->sk_flags. */
-	if (!sk_fullsock(sk) || !sock_flag(sk, SOCK_RCU_FREE))
+	if (sk_is_refcounted(sk))
 		sock_gen_put(sk);
 	return 0;
 }
@@ -5870,7 +5869,8 @@ BPF_CALL_3(bpf_sk_assign, struct sk_buff *, skb, struct sock *, sk, u64, flags)
 		return -ESOCKTNOSUPPORT;
 	if (unlikely(dev_net(skb->dev) != sock_net(sk)))
 		return -ENETUNREACH;
-	if (unlikely(!refcount_inc_not_zero(&sk->sk_refcnt)))
+	if (sk_is_refcounted(sk) &&
+	    unlikely(!refcount_inc_not_zero(&sk->sk_refcnt)))
 		return -ENOENT;
 
 	skb_orphan(skb);
diff --git a/net/core/sock.c b/net/core/sock.c
index cfaf60267360..a2ab79446f59 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -2076,7 +2076,8 @@ EXPORT_SYMBOL(sock_efree);
  */
 void sock_pfree(struct sk_buff *skb)
 {
-	sock_edemux(skb);
+	if (sk_is_refcounted(skb->sk))
+		sock_edemux(skb);
 }
 EXPORT_SYMBOL(sock_pfree);
 
-- 
2.20.1


^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCHv2 bpf-next 5/5] selftests: bpf: add test for sk_assign
  2020-03-25  5:57 [PATCHv2 bpf-next 0/5] Add bpf_sk_assign eBPF helper Joe Stringer
                   ` (3 preceding siblings ...)
  2020-03-25  5:57 ` [PATCHv2 bpf-next 4/5] bpf: Don't refcount LISTEN sockets in sk_assign() Joe Stringer
@ 2020-03-25  5:57 ` Joe Stringer
  2020-03-25 10:35   ` Lorenz Bauer
                     ` (3 more replies)
  4 siblings, 4 replies; 41+ messages in thread
From: Joe Stringer @ 2020-03-25  5:57 UTC (permalink / raw)
  To: bpf; +Cc: Lorenz Bauer, netdev, daniel, ast, eric.dumazet, kafai

From: Lorenz Bauer <lmb@cloudflare.com>

Attach a tc direct-action classifier to lo in a fresh network
namespace, and rewrite all connection attempts to localhost:4321
to localhost:1234 (for port tests) and connections to unreachable
IPv4/IPv6 IPs to the local socket (for address tests).

Keep in mind that both client to server and server to client traffic
passes the classifier.

Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
Co-authored-by: Joe Stringer <joe@wand.net.nz>
Signed-off-by: Joe Stringer <joe@wand.net.nz>
---
v2: Rebase onto test_progs infrastructure
v1: Initial commit
---
 tools/testing/selftests/bpf/Makefile          |   2 +-
 .../selftests/bpf/prog_tests/sk_assign.c      | 244 ++++++++++++++++++
 .../selftests/bpf/progs/test_sk_assign.c      | 127 +++++++++
 3 files changed, 372 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/sk_assign.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_sk_assign.c

diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index 7729892e0b04..4f7f83d059ca 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -76,7 +76,7 @@ TEST_PROGS_EXTENDED := with_addr.sh \
 # Compile but not part of 'make run_tests'
 TEST_GEN_PROGS_EXTENDED = test_sock_addr test_skb_cgroup_id_user \
 	flow_dissector_load test_flow_dissector test_tcp_check_syncookie_user \
-	test_lirc_mode2_user xdping test_cpp runqslower
+	test_lirc_mode2_user xdping test_cpp runqslower test_sk_assign
 
 TEST_CUSTOM_PROGS = urandom_read
 
diff --git a/tools/testing/selftests/bpf/prog_tests/sk_assign.c b/tools/testing/selftests/bpf/prog_tests/sk_assign.c
new file mode 100644
index 000000000000..1f0afcc20c48
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/sk_assign.c
@@ -0,0 +1,244 @@
+// SPDX-License-Identifier: GPL-2.0
+// Copyright (c) 2018 Facebook
+// Copyright (c) 2019 Cloudflare
+// Copyright (c) 2020 Isovalent, Inc.
+/*
+ * Test that the socket assign program is able to redirect traffic towards a
+ * socket, regardless of whether the port or address destination of the traffic
+ * matches the port.
+ */
+
+#define _GNU_SOURCE
+#include <fcntl.h>
+#include <signal.h>
+#include <stdlib.h>
+#include <unistd.h>
+
+#include "test_progs.h"
+
+#define TEST_DPORT 4321
+#define TEST_DADDR (0xC0A80203)
+#define NS_SELF "/proc/self/ns/net"
+
+static __u32 duration;
+
+static bool configure_stack(int self_net)
+{
+	/* Move to a new networking namespace */
+	if (CHECK_FAIL(unshare(CLONE_NEWNET)))
+		return false;
+
+	/* Configure necessary links, routes */
+	if (CHECK_FAIL(system("ip link set dev lo up")))
+		return false;
+	if (CHECK_FAIL(system("ip route add local default dev lo")))
+		return false;
+	if (CHECK_FAIL(system("ip -6 route add local default dev lo")))
+		return false;
+
+	/* Load qdisc, BPF program */
+	if (CHECK_FAIL(system("tc qdisc add dev lo clsact")))
+		return false;
+	if (CHECK_FAIL(system("tc filter add dev lo ingress bpf direct-action "
+		     "object-file ./test_sk_assign.o section sk_assign_test")))
+		return false;
+
+	return true;
+}
+
+static int start_server(const struct sockaddr *addr, socklen_t len)
+{
+	int fd;
+
+	fd = socket(addr->sa_family, SOCK_STREAM, 0);
+	if (CHECK_FAIL(fd == -1))
+		goto out;
+	if (CHECK_FAIL(bind(fd, addr, len) == -1))
+		goto close_out;
+	if (CHECK_FAIL(listen(fd, 128) == -1))
+		goto close_out;
+
+	goto out;
+
+close_out:
+	close(fd);
+	fd = -1;
+out:
+	return fd;
+}
+
+static void handle_timeout(int signum)
+{
+	if (signum == SIGALRM)
+		fprintf(stderr, "Timed out while connecting to server\n");
+	kill(0, SIGKILL);
+}
+
+static struct sigaction timeout_action = {
+	.sa_handler = handle_timeout,
+};
+
+static int connect_to_server(const struct sockaddr *addr, socklen_t len)
+{
+	int fd = -1;
+
+	fd = socket(addr->sa_family, SOCK_STREAM, 0);
+	if (CHECK_FAIL(fd == -1))
+		goto out;
+	if (CHECK_FAIL(sigaction(SIGALRM, &timeout_action, NULL)))
+		goto out;
+	alarm(3);
+	if (CHECK_FAIL(connect(fd, addr, len) == -1))
+		goto close_out;
+
+	goto out;
+
+close_out:
+	close(fd);
+	fd = -1;
+out:
+	return fd;
+}
+
+static in_port_t get_port(int fd)
+{
+	struct sockaddr_storage name;
+	socklen_t len;
+	in_port_t port = 0;
+
+	len = sizeof(name);
+	if (CHECK_FAIL(getsockname(fd, (struct sockaddr *)&name, &len)))
+		return port;
+
+	switch (name.ss_family) {
+	case AF_INET:
+		port = ((struct sockaddr_in *)&name)->sin_port;
+		break;
+	case AF_INET6:
+		port = ((struct sockaddr_in6 *)&name)->sin6_port;
+		break;
+	default:
+		CHECK(1, "Invalid address family", "%d\n", name.ss_family);
+	}
+	return port;
+}
+
+static int run_test(int server_fd, const struct sockaddr *addr, socklen_t len)
+{
+	int client = -1, srv_client = -1;
+	char buf[] = "testing";
+	in_port_t port;
+	int ret = 1;
+
+	client = connect_to_server(addr, len);
+	if (client == -1) {
+		perror("Cannot connect to server");
+		goto out;
+	}
+
+	srv_client = accept(server_fd, NULL, NULL);
+	if (CHECK_FAIL(srv_client == -1)) {
+		perror("Can't accept connection");
+		goto out;
+	}
+	if (CHECK_FAIL(write(client, buf, sizeof(buf)) != sizeof(buf))) {
+		perror("Can't write on client");
+		goto out;
+	}
+	if (CHECK_FAIL(read(srv_client, buf, sizeof(buf)) != sizeof(buf))) {
+		perror("Can't read on server");
+		goto out;
+	}
+
+	port = get_port(srv_client);
+	if (CHECK_FAIL(!port))
+		goto out;
+	if (CHECK(port != htons(TEST_DPORT), "Expected", "port %u but got %u",
+		  TEST_DPORT, ntohs(port)))
+		goto out;
+
+	ret = 0;
+out:
+	close(client);
+	close(srv_client);
+	return ret;
+}
+
+static int do_sk_assign(void)
+{
+	struct sockaddr_in addr4;
+	struct sockaddr_in6 addr6;
+	int server = -1;
+	int server_v6 = -1;
+	int err = 1;
+
+	memset(&addr4, 0, sizeof(addr4));
+	addr4.sin_family = AF_INET;
+	addr4.sin_addr.s_addr = htonl(INADDR_LOOPBACK);
+	addr4.sin_port = htons(1234);
+
+	memset(&addr6, 0, sizeof(addr6));
+	addr6.sin6_family = AF_INET6;
+	addr6.sin6_addr = in6addr_loopback;
+	addr6.sin6_port = htons(1234);
+
+	server = start_server((const struct sockaddr *)&addr4, sizeof(addr4));
+	if (server == -1)
+		goto out;
+
+	server_v6 = start_server((const struct sockaddr *)&addr6,
+				 sizeof(addr6));
+	if (server_v6 == -1)
+		goto out;
+
+	/* Connect to unbound ports */
+	addr4.sin_port = htons(TEST_DPORT);
+	addr6.sin6_port = htons(TEST_DPORT);
+
+	test__start_subtest("ipv4 port redir");
+	if (run_test(server, (const struct sockaddr *)&addr4, sizeof(addr4)))
+		goto out;
+
+	test__start_subtest("ipv6 port redir");
+	if (run_test(server_v6, (const struct sockaddr *)&addr6, sizeof(addr6)))
+		goto out;
+
+	/* Connect to unbound addresses */
+	addr4.sin_addr.s_addr = htonl(TEST_DADDR);
+	addr6.sin6_addr.s6_addr32[3] = htonl(TEST_DADDR);
+
+	test__start_subtest("ipv4 addr redir");
+	if (run_test(server, (const struct sockaddr *)&addr4, sizeof(addr4)))
+		goto out;
+
+	test__start_subtest("ipv6 addr redir");
+	if (run_test(server_v6, (const struct sockaddr *)&addr6, sizeof(addr6)))
+		goto out;
+
+	err = 0;
+out:
+	close(server);
+	close(server_v6);
+	return err;
+}
+
+void test_sk_assign(void)
+{
+	int self_net;
+
+	self_net = open(NS_SELF, O_RDONLY);
+	if (CHECK_FAIL(self_net < 0)) {
+		perror("Unable to open "NS_SELF);
+		return;
+	}
+
+	if (!configure_stack(self_net)) {
+		perror("configure_stack");
+		goto cleanup;
+	}
+
+	do_sk_assign();
+
+cleanup:
+	close(self_net);
+}
diff --git a/tools/testing/selftests/bpf/progs/test_sk_assign.c b/tools/testing/selftests/bpf/progs/test_sk_assign.c
new file mode 100644
index 000000000000..7de30ad3f594
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_sk_assign.c
@@ -0,0 +1,127 @@
+// SPDX-License-Identifier: GPL-2.0
+// Copyright (c) 2019 Cloudflare Ltd.
+
+#include <stddef.h>
+#include <stdbool.h>
+#include <string.h>
+#include <linux/bpf.h>
+#include <linux/if_ether.h>
+#include <linux/in.h>
+#include <linux/ip.h>
+#include <linux/ipv6.h>
+#include <linux/pkt_cls.h>
+#include <linux/tcp.h>
+#include <sys/socket.h>
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_endian.h>
+
+int _version SEC("version") = 1;
+char _license[] SEC("license") = "GPL";
+
+/* Fill 'tuple' with L3 info, and attempt to find L4. On fail, return NULL. */
+static struct bpf_sock_tuple *get_tuple(void *data, __u64 nh_off,
+					void *data_end, __u16 eth_proto,
+					bool *ipv4)
+{
+	struct bpf_sock_tuple *result;
+	__u8 proto = 0;
+	__u64 ihl_len;
+
+	if (eth_proto == bpf_htons(ETH_P_IP)) {
+		struct iphdr *iph = (struct iphdr *)(data + nh_off);
+
+		if (iph + 1 > data_end)
+			return NULL;
+		if (iph->ihl != 5)
+			/* Options are not supported */
+			return NULL;
+		ihl_len = iph->ihl * 4;
+		proto = iph->protocol;
+		*ipv4 = true;
+		result = (struct bpf_sock_tuple *)&iph->saddr;
+	} else if (eth_proto == bpf_htons(ETH_P_IPV6)) {
+		struct ipv6hdr *ip6h = (struct ipv6hdr *)(data + nh_off);
+
+		if (ip6h + 1 > data_end)
+			return NULL;
+		ihl_len = sizeof(*ip6h);
+		proto = ip6h->nexthdr;
+		*ipv4 = false;
+		result = (struct bpf_sock_tuple *)&ip6h->saddr;
+	} else {
+		return NULL;
+	}
+
+	if (result + 1 > data_end || proto != IPPROTO_TCP)
+		return NULL;
+
+	return result;
+}
+
+SEC("sk_assign_test")
+int bpf_sk_assign_test(struct __sk_buff *skb)
+{
+	void *data_end = (void *)(long)skb->data_end;
+	void *data = (void *)(long)skb->data;
+	struct ethhdr *eth = (struct ethhdr *)(data);
+	struct bpf_sock_tuple *tuple, ln = {0};
+	struct bpf_sock *sk;
+	int tuple_len;
+	bool ipv4;
+	int ret;
+
+	if (eth + 1 > data_end)
+		return TC_ACT_SHOT;
+
+	tuple = get_tuple(data, sizeof(*eth), data_end, eth->h_proto, &ipv4);
+	if (!tuple)
+		return TC_ACT_SHOT;
+
+	tuple_len = ipv4 ? sizeof(tuple->ipv4) : sizeof(tuple->ipv6);
+	sk = bpf_skc_lookup_tcp(skb, tuple, tuple_len, BPF_F_CURRENT_NETNS, 0);
+	if (sk) {
+		if (sk->state != BPF_TCP_LISTEN)
+			goto assign;
+
+		bpf_sk_release(sk);
+	}
+
+	if (ipv4) {
+		if (tuple->ipv4.dport != bpf_htons(4321))
+			return TC_ACT_OK;
+
+		ln.ipv4.daddr = bpf_htonl(0x7f000001);
+		ln.ipv4.dport = bpf_htons(1234);
+
+		sk = bpf_skc_lookup_tcp(skb, &ln, sizeof(ln.ipv4),
+					BPF_F_CURRENT_NETNS, 0);
+	} else {
+		if (tuple->ipv6.dport != bpf_htons(4321))
+			return TC_ACT_OK;
+
+		/* Upper parts of daddr are already zero. */
+		ln.ipv6.daddr[3] = bpf_htonl(0x1);
+		ln.ipv6.dport = bpf_htons(1234);
+
+		sk = bpf_skc_lookup_tcp(skb, &ln, sizeof(ln.ipv6),
+					BPF_F_CURRENT_NETNS, 0);
+	}
+
+	/* We can't do a single skc_lookup_tcp here, because then the compiler
+	 * will likely spill tuple_len to the stack. This makes it lose all
+	 * bounds information in the verifier, which then rejects the call as
+	 * unsafe.
+	 */
+	if (!sk)
+		return TC_ACT_SHOT;
+
+	if (sk->state != BPF_TCP_LISTEN) {
+		bpf_sk_release(sk);
+		return TC_ACT_SHOT;
+	}
+
+assign:
+	ret = bpf_sk_assign(skb, sk, 0);
+	bpf_sk_release(sk);
+	return ret == 0 ? TC_ACT_OK : TC_ACT_SHOT;
+}
-- 
2.20.1


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCHv2 bpf-next 4/5] bpf: Don't refcount LISTEN sockets in sk_assign()
  2020-03-25  5:57 ` [PATCHv2 bpf-next 4/5] bpf: Don't refcount LISTEN sockets in sk_assign() Joe Stringer
@ 2020-03-25 10:29   ` Lorenz Bauer
  2020-03-25 20:46     ` Joe Stringer
  0 siblings, 1 reply; 41+ messages in thread
From: Lorenz Bauer @ 2020-03-25 10:29 UTC (permalink / raw)
  To: Joe Stringer
  Cc: bpf, Networking, Daniel Borkmann, Alexei Starovoitov,
	Eric Dumazet, Martin Lau

On Wed, 25 Mar 2020 at 05:58, Joe Stringer <joe@wand.net.nz> wrote:
>
> Avoid taking a reference on listen sockets by checking the socket type
> in the sk_assign and in the corresponding skb_steal_sock() code in the
> the transport layer, and by ensuring that the prefetch free (sock_pfree)
> function uses the same logic to check whether the socket is refcounted.
>
> Suggested-by: Martin KaFai Lau <kafai@fb.com>
> Signed-off-by: Joe Stringer <joe@wand.net.nz>
> ---
> v2: Initial version
> ---
>  include/net/sock.h | 25 +++++++++++++++++--------
>  net/core/filter.c  |  6 +++---
>  net/core/sock.c    |  3 ++-
>  3 files changed, 22 insertions(+), 12 deletions(-)
>
> diff --git a/include/net/sock.h b/include/net/sock.h
> index 1ca2e808cb8e..3ec1865f173e 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -2533,6 +2533,21 @@ skb_sk_is_prefetched(struct sk_buff *skb)
>         return skb->destructor == sock_pfree;
>  }
>
> +/* This helper checks if a socket is a full socket,
> + * ie _not_ a timewait or request socket.
> + */
> +static inline bool sk_fullsock(const struct sock *sk)
> +{
> +       return (1 << sk->sk_state) & ~(TCPF_TIME_WAIT | TCPF_NEW_SYN_RECV);
> +}
> +
> +static inline bool
> +sk_is_refcounted(struct sock *sk)
> +{
> +       /* Only full sockets have sk->sk_flags. */
> +       return !sk_fullsock(sk) || !sock_flag(sk, SOCK_RCU_FREE);
> +}
> +
>  /**
>   * skb_steal_sock
>   * @skb to steal the socket from
> @@ -2545,6 +2560,8 @@ skb_steal_sock(struct sk_buff *skb, bool *refcounted)
>                 struct sock *sk = skb->sk;
>
>                 *refcounted = true;
> +               if (skb_sk_is_prefetched(skb))
> +                       *refcounted = sk_is_refcounted(sk);
>                 skb->destructor = NULL;
>                 skb->sk = NULL;
>                 return sk;
> @@ -2553,14 +2570,6 @@ skb_steal_sock(struct sk_buff *skb, bool *refcounted)
>         return NULL;
>  }
>
> -/* This helper checks if a socket is a full socket,
> - * ie _not_ a timewait or request socket.
> - */
> -static inline bool sk_fullsock(const struct sock *sk)
> -{
> -       return (1 << sk->sk_state) & ~(TCPF_TIME_WAIT | TCPF_NEW_SYN_RECV);
> -}
> -
>  /* Checks if this SKB belongs to an HW offloaded socket
>   * and whether any SW fallbacks are required based on dev.
>   * Check decrypted mark in case skb_orphan() cleared socket.
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 0fada7fe9b75..997b8606167e 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -5343,8 +5343,7 @@ static const struct bpf_func_proto bpf_sk_lookup_udp_proto = {
>
>  BPF_CALL_1(bpf_sk_release, struct sock *, sk)
>  {
> -       /* Only full sockets have sk->sk_flags. */
> -       if (!sk_fullsock(sk) || !sock_flag(sk, SOCK_RCU_FREE))
> +       if (sk_is_refcounted(sk))
>                 sock_gen_put(sk);
>         return 0;
>  }
> @@ -5870,7 +5869,8 @@ BPF_CALL_3(bpf_sk_assign, struct sk_buff *, skb, struct sock *, sk, u64, flags)
>                 return -ESOCKTNOSUPPORT;
>         if (unlikely(dev_net(skb->dev) != sock_net(sk)))
>                 return -ENETUNREACH;
> -       if (unlikely(!refcount_inc_not_zero(&sk->sk_refcnt)))
> +       if (sk_is_refcounted(sk) &&
> +           unlikely(!refcount_inc_not_zero(&sk->sk_refcnt)))
>                 return -ENOENT;
>
>         skb_orphan(skb);
> diff --git a/net/core/sock.c b/net/core/sock.c
> index cfaf60267360..a2ab79446f59 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -2076,7 +2076,8 @@ EXPORT_SYMBOL(sock_efree);
>   */
>  void sock_pfree(struct sk_buff *skb)
>  {
> -       sock_edemux(skb);
> +       if (sk_is_refcounted(skb->sk))
> +               sock_edemux(skb);

sock_edemux calls sock_gen_put, which is also called by
bpf_sk_release. Is it worth teaching sock_gen_put about
sk_fullsock, and dropping the other helpers? I was considering this
when fixing up sk_release, but then forgot
about it.

>  }
>  EXPORT_SYMBOL(sock_pfree);
>
> --
> 2.20.1
>


-- 
Lorenz Bauer  |  Systems Engineer
6th Floor, County Hall/The Riverside Building, SE1 7PB, UK

www.cloudflare.com

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCHv2 bpf-next 5/5] selftests: bpf: add test for sk_assign
  2020-03-25  5:57 ` [PATCHv2 bpf-next 5/5] selftests: bpf: add test for sk_assign Joe Stringer
@ 2020-03-25 10:35   ` Lorenz Bauer
  2020-03-25 20:55     ` Joe Stringer
  2020-03-25 18:17   ` Yonghong Song
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 41+ messages in thread
From: Lorenz Bauer @ 2020-03-25 10:35 UTC (permalink / raw)
  To: Joe Stringer
  Cc: bpf, Networking, Daniel Borkmann, Alexei Starovoitov,
	Eric Dumazet, Martin Lau

On Wed, 25 Mar 2020 at 05:58, Joe Stringer <joe@wand.net.nz> wrote:
>
> From: Lorenz Bauer <lmb@cloudflare.com>
>
> Attach a tc direct-action classifier to lo in a fresh network
> namespace, and rewrite all connection attempts to localhost:4321
> to localhost:1234 (for port tests) and connections to unreachable
> IPv4/IPv6 IPs to the local socket (for address tests).

Can you extend this to cover UDP as well?

>
> Keep in mind that both client to server and server to client traffic
> passes the classifier.
>
> Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
> Co-authored-by: Joe Stringer <joe@wand.net.nz>
> Signed-off-by: Joe Stringer <joe@wand.net.nz>
> ---
> v2: Rebase onto test_progs infrastructure
> v1: Initial commit
> ---
>  tools/testing/selftests/bpf/Makefile          |   2 +-
>  .../selftests/bpf/prog_tests/sk_assign.c      | 244 ++++++++++++++++++
>  .../selftests/bpf/progs/test_sk_assign.c      | 127 +++++++++
>  3 files changed, 372 insertions(+), 1 deletion(-)
>  create mode 100644 tools/testing/selftests/bpf/prog_tests/sk_assign.c
>  create mode 100644 tools/testing/selftests/bpf/progs/test_sk_assign.c
>
> diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
> index 7729892e0b04..4f7f83d059ca 100644
> --- a/tools/testing/selftests/bpf/Makefile
> +++ b/tools/testing/selftests/bpf/Makefile
> @@ -76,7 +76,7 @@ TEST_PROGS_EXTENDED := with_addr.sh \
>  # Compile but not part of 'make run_tests'
>  TEST_GEN_PROGS_EXTENDED = test_sock_addr test_skb_cgroup_id_user \
>         flow_dissector_load test_flow_dissector test_tcp_check_syncookie_user \
> -       test_lirc_mode2_user xdping test_cpp runqslower
> +       test_lirc_mode2_user xdping test_cpp runqslower test_sk_assign
>
>  TEST_CUSTOM_PROGS = urandom_read
>
> diff --git a/tools/testing/selftests/bpf/prog_tests/sk_assign.c b/tools/testing/selftests/bpf/prog_tests/sk_assign.c
> new file mode 100644
> index 000000000000..1f0afcc20c48
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/prog_tests/sk_assign.c
> @@ -0,0 +1,244 @@
> +// SPDX-License-Identifier: GPL-2.0
> +// Copyright (c) 2018 Facebook
> +// Copyright (c) 2019 Cloudflare
> +// Copyright (c) 2020 Isovalent, Inc.
> +/*
> + * Test that the socket assign program is able to redirect traffic towards a
> + * socket, regardless of whether the port or address destination of the traffic
> + * matches the port.
> + */
> +
> +#define _GNU_SOURCE
> +#include <fcntl.h>
> +#include <signal.h>
> +#include <stdlib.h>
> +#include <unistd.h>
> +
> +#include "test_progs.h"
> +
> +#define TEST_DPORT 4321
> +#define TEST_DADDR (0xC0A80203)
> +#define NS_SELF "/proc/self/ns/net"
> +
> +static __u32 duration;
> +
> +static bool configure_stack(int self_net)
> +{
> +       /* Move to a new networking namespace */
> +       if (CHECK_FAIL(unshare(CLONE_NEWNET)))
> +               return false;
> +
> +       /* Configure necessary links, routes */
> +       if (CHECK_FAIL(system("ip link set dev lo up")))
> +               return false;
> +       if (CHECK_FAIL(system("ip route add local default dev lo")))
> +               return false;
> +       if (CHECK_FAIL(system("ip -6 route add local default dev lo")))
> +               return false;
> +
> +       /* Load qdisc, BPF program */
> +       if (CHECK_FAIL(system("tc qdisc add dev lo clsact")))
> +               return false;
> +       if (CHECK_FAIL(system("tc filter add dev lo ingress bpf direct-action "
> +                    "object-file ./test_sk_assign.o section sk_assign_test")))
> +               return false;
> +
> +       return true;
> +}
> +
> +static int start_server(const struct sockaddr *addr, socklen_t len)
> +{
> +       int fd;
> +
> +       fd = socket(addr->sa_family, SOCK_STREAM, 0);
> +       if (CHECK_FAIL(fd == -1))
> +               goto out;
> +       if (CHECK_FAIL(bind(fd, addr, len) == -1))
> +               goto close_out;
> +       if (CHECK_FAIL(listen(fd, 128) == -1))
> +               goto close_out;
> +
> +       goto out;
> +
> +close_out:
> +       close(fd);
> +       fd = -1;
> +out:
> +       return fd;
> +}
> +
> +static void handle_timeout(int signum)
> +{
> +       if (signum == SIGALRM)
> +               fprintf(stderr, "Timed out while connecting to server\n");
> +       kill(0, SIGKILL);
> +}
> +
> +static struct sigaction timeout_action = {
> +       .sa_handler = handle_timeout,
> +};
> +
> +static int connect_to_server(const struct sockaddr *addr, socklen_t len)
> +{
> +       int fd = -1;
> +
> +       fd = socket(addr->sa_family, SOCK_STREAM, 0);
> +       if (CHECK_FAIL(fd == -1))
> +               goto out;
> +       if (CHECK_FAIL(sigaction(SIGALRM, &timeout_action, NULL)))
> +               goto out;
> +       alarm(3);
> +       if (CHECK_FAIL(connect(fd, addr, len) == -1))
> +               goto close_out;
> +
> +       goto out;
> +
> +close_out:
> +       close(fd);
> +       fd = -1;
> +out:
> +       return fd;
> +}
> +
> +static in_port_t get_port(int fd)
> +{
> +       struct sockaddr_storage name;
> +       socklen_t len;
> +       in_port_t port = 0;
> +
> +       len = sizeof(name);
> +       if (CHECK_FAIL(getsockname(fd, (struct sockaddr *)&name, &len)))
> +               return port;
> +
> +       switch (name.ss_family) {
> +       case AF_INET:
> +               port = ((struct sockaddr_in *)&name)->sin_port;
> +               break;
> +       case AF_INET6:
> +               port = ((struct sockaddr_in6 *)&name)->sin6_port;
> +               break;
> +       default:
> +               CHECK(1, "Invalid address family", "%d\n", name.ss_family);
> +       }
> +       return port;
> +}
> +
> +static int run_test(int server_fd, const struct sockaddr *addr, socklen_t len)
> +{
> +       int client = -1, srv_client = -1;
> +       char buf[] = "testing";
> +       in_port_t port;
> +       int ret = 1;
> +
> +       client = connect_to_server(addr, len);
> +       if (client == -1) {
> +               perror("Cannot connect to server");
> +               goto out;
> +       }
> +
> +       srv_client = accept(server_fd, NULL, NULL);
> +       if (CHECK_FAIL(srv_client == -1)) {
> +               perror("Can't accept connection");
> +               goto out;
> +       }
> +       if (CHECK_FAIL(write(client, buf, sizeof(buf)) != sizeof(buf))) {
> +               perror("Can't write on client");
> +               goto out;
> +       }
> +       if (CHECK_FAIL(read(srv_client, buf, sizeof(buf)) != sizeof(buf))) {
> +               perror("Can't read on server");
> +               goto out;
> +       }
> +
> +       port = get_port(srv_client);
> +       if (CHECK_FAIL(!port))
> +               goto out;
> +       if (CHECK(port != htons(TEST_DPORT), "Expected", "port %u but got %u",
> +                 TEST_DPORT, ntohs(port)))
> +               goto out;
> +
> +       ret = 0;
> +out:
> +       close(client);
> +       close(srv_client);
> +       return ret;
> +}
> +
> +static int do_sk_assign(void)
> +{
> +       struct sockaddr_in addr4;
> +       struct sockaddr_in6 addr6;
> +       int server = -1;
> +       int server_v6 = -1;
> +       int err = 1;
> +
> +       memset(&addr4, 0, sizeof(addr4));
> +       addr4.sin_family = AF_INET;
> +       addr4.sin_addr.s_addr = htonl(INADDR_LOOPBACK);
> +       addr4.sin_port = htons(1234);
> +
> +       memset(&addr6, 0, sizeof(addr6));
> +       addr6.sin6_family = AF_INET6;
> +       addr6.sin6_addr = in6addr_loopback;
> +       addr6.sin6_port = htons(1234);
> +
> +       server = start_server((const struct sockaddr *)&addr4, sizeof(addr4));
> +       if (server == -1)
> +               goto out;
> +
> +       server_v6 = start_server((const struct sockaddr *)&addr6,
> +                                sizeof(addr6));
> +       if (server_v6 == -1)
> +               goto out;
> +
> +       /* Connect to unbound ports */
> +       addr4.sin_port = htons(TEST_DPORT);
> +       addr6.sin6_port = htons(TEST_DPORT);
> +
> +       test__start_subtest("ipv4 port redir");
> +       if (run_test(server, (const struct sockaddr *)&addr4, sizeof(addr4)))
> +               goto out;
> +
> +       test__start_subtest("ipv6 port redir");
> +       if (run_test(server_v6, (const struct sockaddr *)&addr6, sizeof(addr6)))
> +               goto out;
> +
> +       /* Connect to unbound addresses */
> +       addr4.sin_addr.s_addr = htonl(TEST_DADDR);
> +       addr6.sin6_addr.s6_addr32[3] = htonl(TEST_DADDR);
> +
> +       test__start_subtest("ipv4 addr redir");
> +       if (run_test(server, (const struct sockaddr *)&addr4, sizeof(addr4)))
> +               goto out;
> +
> +       test__start_subtest("ipv6 addr redir");
> +       if (run_test(server_v6, (const struct sockaddr *)&addr6, sizeof(addr6)))
> +               goto out;
> +
> +       err = 0;
> +out:
> +       close(server);
> +       close(server_v6);
> +       return err;
> +}
> +
> +void test_sk_assign(void)
> +{
> +       int self_net;
> +
> +       self_net = open(NS_SELF, O_RDONLY);
> +       if (CHECK_FAIL(self_net < 0)) {
> +               perror("Unable to open "NS_SELF);
> +               return;
> +       }
> +
> +       if (!configure_stack(self_net)) {
> +               perror("configure_stack");
> +               goto cleanup;
> +       }
> +
> +       do_sk_assign();
> +
> +cleanup:
> +       close(self_net);
> +}
> diff --git a/tools/testing/selftests/bpf/progs/test_sk_assign.c b/tools/testing/selftests/bpf/progs/test_sk_assign.c
> new file mode 100644
> index 000000000000..7de30ad3f594
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/progs/test_sk_assign.c
> @@ -0,0 +1,127 @@
> +// SPDX-License-Identifier: GPL-2.0
> +// Copyright (c) 2019 Cloudflare Ltd.
> +
> +#include <stddef.h>
> +#include <stdbool.h>
> +#include <string.h>
> +#include <linux/bpf.h>
> +#include <linux/if_ether.h>
> +#include <linux/in.h>
> +#include <linux/ip.h>
> +#include <linux/ipv6.h>
> +#include <linux/pkt_cls.h>
> +#include <linux/tcp.h>
> +#include <sys/socket.h>
> +#include <bpf/bpf_helpers.h>
> +#include <bpf/bpf_endian.h>
> +
> +int _version SEC("version") = 1;
> +char _license[] SEC("license") = "GPL";
> +
> +/* Fill 'tuple' with L3 info, and attempt to find L4. On fail, return NULL. */
> +static struct bpf_sock_tuple *get_tuple(void *data, __u64 nh_off,
> +                                       void *data_end, __u16 eth_proto,
> +                                       bool *ipv4)
> +{
> +       struct bpf_sock_tuple *result;
> +       __u8 proto = 0;
> +       __u64 ihl_len;
> +
> +       if (eth_proto == bpf_htons(ETH_P_IP)) {
> +               struct iphdr *iph = (struct iphdr *)(data + nh_off);
> +
> +               if (iph + 1 > data_end)
> +                       return NULL;
> +               if (iph->ihl != 5)
> +                       /* Options are not supported */
> +                       return NULL;
> +               ihl_len = iph->ihl * 4;
> +               proto = iph->protocol;
> +               *ipv4 = true;
> +               result = (struct bpf_sock_tuple *)&iph->saddr;
> +       } else if (eth_proto == bpf_htons(ETH_P_IPV6)) {
> +               struct ipv6hdr *ip6h = (struct ipv6hdr *)(data + nh_off);
> +
> +               if (ip6h + 1 > data_end)
> +                       return NULL;
> +               ihl_len = sizeof(*ip6h);
> +               proto = ip6h->nexthdr;
> +               *ipv4 = false;
> +               result = (struct bpf_sock_tuple *)&ip6h->saddr;
> +       } else {
> +               return NULL;
> +       }
> +
> +       if (result + 1 > data_end || proto != IPPROTO_TCP)
> +               return NULL;
> +
> +       return result;
> +}
> +
> +SEC("sk_assign_test")
> +int bpf_sk_assign_test(struct __sk_buff *skb)
> +{
> +       void *data_end = (void *)(long)skb->data_end;
> +       void *data = (void *)(long)skb->data;
> +       struct ethhdr *eth = (struct ethhdr *)(data);
> +       struct bpf_sock_tuple *tuple, ln = {0};
> +       struct bpf_sock *sk;
> +       int tuple_len;
> +       bool ipv4;
> +       int ret;
> +
> +       if (eth + 1 > data_end)
> +               return TC_ACT_SHOT;
> +
> +       tuple = get_tuple(data, sizeof(*eth), data_end, eth->h_proto, &ipv4);
> +       if (!tuple)
> +               return TC_ACT_SHOT;
> +
> +       tuple_len = ipv4 ? sizeof(tuple->ipv4) : sizeof(tuple->ipv6);
> +       sk = bpf_skc_lookup_tcp(skb, tuple, tuple_len, BPF_F_CURRENT_NETNS, 0);
> +       if (sk) {
> +               if (sk->state != BPF_TCP_LISTEN)
> +                       goto assign;
> +
> +               bpf_sk_release(sk);
> +       }
> +
> +       if (ipv4) {
> +               if (tuple->ipv4.dport != bpf_htons(4321))
> +                       return TC_ACT_OK;
> +
> +               ln.ipv4.daddr = bpf_htonl(0x7f000001);
> +               ln.ipv4.dport = bpf_htons(1234);
> +
> +               sk = bpf_skc_lookup_tcp(skb, &ln, sizeof(ln.ipv4),
> +                                       BPF_F_CURRENT_NETNS, 0);
> +       } else {
> +               if (tuple->ipv6.dport != bpf_htons(4321))
> +                       return TC_ACT_OK;
> +
> +               /* Upper parts of daddr are already zero. */
> +               ln.ipv6.daddr[3] = bpf_htonl(0x1);
> +               ln.ipv6.dport = bpf_htons(1234);
> +
> +               sk = bpf_skc_lookup_tcp(skb, &ln, sizeof(ln.ipv6),
> +                                       BPF_F_CURRENT_NETNS, 0);
> +       }
> +
> +       /* We can't do a single skc_lookup_tcp here, because then the compiler
> +        * will likely spill tuple_len to the stack. This makes it lose all
> +        * bounds information in the verifier, which then rejects the call as
> +        * unsafe.
> +        */
> +       if (!sk)
> +               return TC_ACT_SHOT;
> +
> +       if (sk->state != BPF_TCP_LISTEN) {
> +               bpf_sk_release(sk);
> +               return TC_ACT_SHOT;
> +       }
> +
> +assign:
> +       ret = bpf_sk_assign(skb, sk, 0);
> +       bpf_sk_release(sk);
> +       return ret == 0 ? TC_ACT_OK : TC_ACT_SHOT;
> +}
> --
> 2.20.1
>


-- 
Lorenz Bauer  |  Systems Engineer
6th Floor, County Hall/The Riverside Building, SE1 7PB, UK

www.cloudflare.com

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCHv2 bpf-next 5/5] selftests: bpf: add test for sk_assign
  2020-03-25  5:57 ` [PATCHv2 bpf-next 5/5] selftests: bpf: add test for sk_assign Joe Stringer
  2020-03-25 10:35   ` Lorenz Bauer
@ 2020-03-25 18:17   ` Yonghong Song
  2020-03-25 21:20     ` Joe Stringer
  2020-03-26 10:13     ` Lorenz Bauer
  2020-03-26  2:04   ` Andrii Nakryiko
  2020-03-26  2:16   ` Andrii Nakryiko
  3 siblings, 2 replies; 41+ messages in thread
From: Yonghong Song @ 2020-03-25 18:17 UTC (permalink / raw)
  To: Joe Stringer, bpf; +Cc: Lorenz Bauer, netdev, daniel, ast, eric.dumazet, kafai



On 3/24/20 10:57 PM, Joe Stringer wrote:
> From: Lorenz Bauer <lmb@cloudflare.com>
> 
> Attach a tc direct-action classifier to lo in a fresh network
> namespace, and rewrite all connection attempts to localhost:4321
> to localhost:1234 (for port tests) and connections to unreachable
> IPv4/IPv6 IPs to the local socket (for address tests).
> 
> Keep in mind that both client to server and server to client traffic
> passes the classifier.
> 
> Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
> Co-authored-by: Joe Stringer <joe@wand.net.nz>
> Signed-off-by: Joe Stringer <joe@wand.net.nz>
> ---
> v2: Rebase onto test_progs infrastructure
> v1: Initial commit
> ---
>   tools/testing/selftests/bpf/Makefile          |   2 +-
>   .../selftests/bpf/prog_tests/sk_assign.c      | 244 ++++++++++++++++++
>   .../selftests/bpf/progs/test_sk_assign.c      | 127 +++++++++
>   3 files changed, 372 insertions(+), 1 deletion(-)
>   create mode 100644 tools/testing/selftests/bpf/prog_tests/sk_assign.c
>   create mode 100644 tools/testing/selftests/bpf/progs/test_sk_assign.c
> 
> diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
> index 7729892e0b04..4f7f83d059ca 100644
> --- a/tools/testing/selftests/bpf/Makefile
> +++ b/tools/testing/selftests/bpf/Makefile
> @@ -76,7 +76,7 @@ TEST_PROGS_EXTENDED := with_addr.sh \
>   # Compile but not part of 'make run_tests'
>   TEST_GEN_PROGS_EXTENDED = test_sock_addr test_skb_cgroup_id_user \
>   	flow_dissector_load test_flow_dissector test_tcp_check_syncookie_user \
> -	test_lirc_mode2_user xdping test_cpp runqslower
> +	test_lirc_mode2_user xdping test_cpp runqslower test_sk_assign

No test_sk_assign any more as the test is integrated into test_progs, right?

>   
>   TEST_CUSTOM_PROGS = urandom_read
>   
> diff --git a/tools/testing/selftests/bpf/prog_tests/sk_assign.c b/tools/testing/selftests/bpf/prog_tests/sk_assign.c
> new file mode 100644
> index 000000000000..1f0afcc20c48
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/prog_tests/sk_assign.c
> @@ -0,0 +1,244 @@
> +// SPDX-License-Identifier: GPL-2.0
> +// Copyright (c) 2018 Facebook
> +// Copyright (c) 2019 Cloudflare
> +// Copyright (c) 2020 Isovalent, Inc.
> +/*
> + * Test that the socket assign program is able to redirect traffic towards a
> + * socket, regardless of whether the port or address destination of the traffic
> + * matches the port.
> + */
> +
> +#define _GNU_SOURCE
> +#include <fcntl.h>
> +#include <signal.h>
> +#include <stdlib.h>
> +#include <unistd.h>
> +
> +#include "test_progs.h"
> +
> +#define TEST_DPORT 4321
> +#define TEST_DADDR (0xC0A80203)
> +#define NS_SELF "/proc/self/ns/net"
> +
> +static __u32 duration;
> +
> +static bool configure_stack(int self_net)

self_net parameter is not used.

> +{
> +	/* Move to a new networking namespace */
> +	if (CHECK_FAIL(unshare(CLONE_NEWNET)))
> +		return false;

You can use CHECK to encode better error messages. Thhis is what
most test_progs tests are using.

> +	/* Configure necessary links, routes */
> +	if (CHECK_FAIL(system("ip link set dev lo up")))
> +		return false;
> +	if (CHECK_FAIL(system("ip route add local default dev lo")))
> +		return false;
> +	if (CHECK_FAIL(system("ip -6 route add local default dev lo")))
> +		return false;
> +
> +	/* Load qdisc, BPF program */
> +	if (CHECK_FAIL(system("tc qdisc add dev lo clsact")))
> +		return false;
> +	if (CHECK_FAIL(system("tc filter add dev lo ingress bpf direct-action "
> +		     "object-file ./test_sk_assign.o section sk_assign_test")))
> +		return false;
> +
> +	return true;
> +}
> +
> +static int start_server(const struct sockaddr *addr, socklen_t len)
> +{
> +	int fd;
> +
> +	fd = socket(addr->sa_family, SOCK_STREAM, 0);
> +	if (CHECK_FAIL(fd == -1))
> +		goto out;
> +	if (CHECK_FAIL(bind(fd, addr, len) == -1))
> +		goto close_out;
> +	if (CHECK_FAIL(listen(fd, 128) == -1))
> +		goto close_out;
> +
> +	goto out;
> +
> +close_out:
> +	close(fd);
> +	fd = -1;
> +out:
> +	return fd;
> +}
> +
> +static void handle_timeout(int signum)
> +{
> +	if (signum == SIGALRM)
> +		fprintf(stderr, "Timed out while connecting to server\n");
> +	kill(0, SIGKILL);
> +}
> +
> +static struct sigaction timeout_action = {
> +	.sa_handler = handle_timeout,
> +};
> +
> +static int connect_to_server(const struct sockaddr *addr, socklen_t len)
> +{
> +	int fd = -1;
> +
> +	fd = socket(addr->sa_family, SOCK_STREAM, 0);
> +	if (CHECK_FAIL(fd == -1))
> +		goto out;
> +	if (CHECK_FAIL(sigaction(SIGALRM, &timeout_action, NULL)))
> +		goto out;

should this goto close_out?

> +	alarm(3);
> +	if (CHECK_FAIL(connect(fd, addr, len) == -1))
> +		goto close_out;
> +
> +	goto out;
> +
> +close_out:
> +	close(fd);
> +	fd = -1;
> +out:
> +	return fd;
> +}
> +
> +static in_port_t get_port(int fd)
> +{
> +	struct sockaddr_storage name;
> +	socklen_t len;
> +	in_port_t port = 0;
> +
> +	len = sizeof(name);
> +	if (CHECK_FAIL(getsockname(fd, (struct sockaddr *)&name, &len)))
> +		return port;
> +
> +	switch (name.ss_family) {
> +	case AF_INET:
> +		port = ((struct sockaddr_in *)&name)->sin_port;
> +		break;
> +	case AF_INET6:
> +		port = ((struct sockaddr_in6 *)&name)->sin6_port;
> +		break;
> +	default:
> +		CHECK(1, "Invalid address family", "%d\n", name.ss_family);
> +	}
> +	return port;
> +}
> +
> +static int run_test(int server_fd, const struct sockaddr *addr, socklen_t len)
> +{
> +	int client = -1, srv_client = -1;
> +	char buf[] = "testing";
> +	in_port_t port;
> +	int ret = 1;
> +
> +	client = connect_to_server(addr, len);
> +	if (client == -1) {
> +		perror("Cannot connect to server");
> +		goto out;
> +	}
> +
> +	srv_client = accept(server_fd, NULL, NULL);
> +	if (CHECK_FAIL(srv_client == -1)) {
> +		perror("Can't accept connection");
> +		goto out;
> +	}
> +	if (CHECK_FAIL(write(client, buf, sizeof(buf)) != sizeof(buf))) {
> +		perror("Can't write on client");
> +		goto out;
> +	}
> +	if (CHECK_FAIL(read(srv_client, buf, sizeof(buf)) != sizeof(buf))) {
> +		perror("Can't read on server");
> +		goto out;
> +	}
> +
> +	port = get_port(srv_client);
> +	if (CHECK_FAIL(!port))
> +		goto out;
> +	if (CHECK(port != htons(TEST_DPORT), "Expected", "port %u but got %u",
> +		  TEST_DPORT, ntohs(port)))
> +		goto out;
> +
> +	ret = 0;
> +out:
> +	close(client);
> +	close(srv_client);
> +	return ret;
> +}
> +
> +static int do_sk_assign(void)
> +{
> +	struct sockaddr_in addr4;
> +	struct sockaddr_in6 addr6;
> +	int server = -1;
> +	int server_v6 = -1;
> +	int err = 1;
> +
> +	memset(&addr4, 0, sizeof(addr4));
> +	addr4.sin_family = AF_INET;
> +	addr4.sin_addr.s_addr = htonl(INADDR_LOOPBACK);
> +	addr4.sin_port = htons(1234);
> +
> +	memset(&addr6, 0, sizeof(addr6));
> +	addr6.sin6_family = AF_INET6;
> +	addr6.sin6_addr = in6addr_loopback;
> +	addr6.sin6_port = htons(1234);
> +
> +	server = start_server((const struct sockaddr *)&addr4, sizeof(addr4));
> +	if (server == -1)
> +		goto out;
> +
> +	server_v6 = start_server((const struct sockaddr *)&addr6,
> +				 sizeof(addr6));
> +	if (server_v6 == -1)
> +		goto out;
> +
> +	/* Connect to unbound ports */
> +	addr4.sin_port = htons(TEST_DPORT);
> +	addr6.sin6_port = htons(TEST_DPORT);
> +
> +	test__start_subtest("ipv4 port redir");
> +	if (run_test(server, (const struct sockaddr *)&addr4, sizeof(addr4)))
> +		goto out;
> +
> +	test__start_subtest("ipv6 port redir");
> +	if (run_test(server_v6, (const struct sockaddr *)&addr6, sizeof(addr6)))
> +		goto out;
> +
> +	/* Connect to unbound addresses */
> +	addr4.sin_addr.s_addr = htonl(TEST_DADDR);
> +	addr6.sin6_addr.s6_addr32[3] = htonl(TEST_DADDR);
> +
> +	test__start_subtest("ipv4 addr redir");
> +	if (run_test(server, (const struct sockaddr *)&addr4, sizeof(addr4)))
> +		goto out;
> +
> +	test__start_subtest("ipv6 addr redir");
> +	if (run_test(server_v6, (const struct sockaddr *)&addr6, sizeof(addr6)))
> +		goto out;
> +
> +	err = 0;
> +out:
> +	close(server);
> +	close(server_v6);
> +	return err;
> +}
> +
> +void test_sk_assign(void)
> +{
> +	int self_net;
> +
> +	self_net = open(NS_SELF, O_RDONLY);
> +	if (CHECK_FAIL(self_net < 0)) {
> +		perror("Unable to open "NS_SELF);
> +		return;
> +	}
> +
> +	if (!configure_stack(self_net)) {
> +		perror("configure_stack");
> +		goto cleanup;
> +	}
> +
> +	do_sk_assign();
> +
> +cleanup:
> +	close(self_net);

Did we exit the newly unshared net namespace and restored the previous 
namespace?

> +}
> diff --git a/tools/testing/selftests/bpf/progs/test_sk_assign.c b/tools/testing/selftests/bpf/progs/test_sk_assign.c
> new file mode 100644
> index 000000000000..7de30ad3f594
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/progs/test_sk_assign.c
> @@ -0,0 +1,127 @@
> +// SPDX-License-Identifier: GPL-2.0
> +// Copyright (c) 2019 Cloudflare Ltd.
> +
> +#include <stddef.h>
> +#include <stdbool.h>
> +#include <string.h>
> +#include <linux/bpf.h>
> +#include <linux/if_ether.h>
> +#include <linux/in.h>
> +#include <linux/ip.h>
> +#include <linux/ipv6.h>
> +#include <linux/pkt_cls.h>
> +#include <linux/tcp.h>
> +#include <sys/socket.h>
> +#include <bpf/bpf_helpers.h>
> +#include <bpf/bpf_endian.h>
> +
> +int _version SEC("version") = 1;
> +char _license[] SEC("license") = "GPL";
> +
> +/* Fill 'tuple' with L3 info, and attempt to find L4. On fail, return NULL. */
> +static struct bpf_sock_tuple *get_tuple(void *data, __u64 nh_off,
> +					void *data_end, __u16 eth_proto,
> +					bool *ipv4)
> +{
> +	struct bpf_sock_tuple *result;
> +	__u8 proto = 0;
> +	__u64 ihl_len;
> +
> +	if (eth_proto == bpf_htons(ETH_P_IP)) {
> +		struct iphdr *iph = (struct iphdr *)(data + nh_off);
> +
> +		if (iph + 1 > data_end)
> +			return NULL;
> +		if (iph->ihl != 5)
> +			/* Options are not supported */
> +			return NULL;
> +		ihl_len = iph->ihl * 4;
> +		proto = iph->protocol;
> +		*ipv4 = true;
> +		result = (struct bpf_sock_tuple *)&iph->saddr;
> +	} else if (eth_proto == bpf_htons(ETH_P_IPV6)) {
> +		struct ipv6hdr *ip6h = (struct ipv6hdr *)(data + nh_off);
> +
> +		if (ip6h + 1 > data_end)
> +			return NULL;
> +		ihl_len = sizeof(*ip6h);
> +		proto = ip6h->nexthdr;
> +		*ipv4 = false;
> +		result = (struct bpf_sock_tuple *)&ip6h->saddr;
> +	} else {
> +		return NULL;
> +	}
> +
> +	if (result + 1 > data_end || proto != IPPROTO_TCP)
> +		return NULL;
> +
> +	return result;
> +}
> +
> +SEC("sk_assign_test")
> +int bpf_sk_assign_test(struct __sk_buff *skb)
> +{
> +	void *data_end = (void *)(long)skb->data_end;
> +	void *data = (void *)(long)skb->data;
> +	struct ethhdr *eth = (struct ethhdr *)(data);
> +	struct bpf_sock_tuple *tuple, ln = {0};
> +	struct bpf_sock *sk;
> +	int tuple_len;
> +	bool ipv4;
> +	int ret;
> +
> +	if (eth + 1 > data_end)
> +		return TC_ACT_SHOT;
> +
> +	tuple = get_tuple(data, sizeof(*eth), data_end, eth->h_proto, &ipv4);
> +	if (!tuple)
> +		return TC_ACT_SHOT;
> +
> +	tuple_len = ipv4 ? sizeof(tuple->ipv4) : sizeof(tuple->ipv6);
> +	sk = bpf_skc_lookup_tcp(skb, tuple, tuple_len, BPF_F_CURRENT_NETNS, 0);

You can get rid of tuple_len with
	if (ipv4)
		sk = bpf_skc_lookup_tcp(..., sizeof(tuple->ipv4), ...);
	else
		sk = bpf_skc_lookup_tcp(..., sizeof(tuple->ipv6), ...);

and later on you can do common bpf_skc_lookup_tcp.
But it may not be worthwhile to do it, as you have two separate calls
in the above instead.

> +	if (sk) {
> +		if (sk->state != BPF_TCP_LISTEN)
> +			goto assign;
> +
> +		bpf_sk_release(sk);
> +	}
> +
> +	if (ipv4) {
> +		if (tuple->ipv4.dport != bpf_htons(4321))
> +			return TC_ACT_OK;
> +
> +		ln.ipv4.daddr = bpf_htonl(0x7f000001);
> +		ln.ipv4.dport = bpf_htons(1234);
> +
> +		sk = bpf_skc_lookup_tcp(skb, &ln, sizeof(ln.ipv4),
> +					BPF_F_CURRENT_NETNS, 0);
> +	} else {
> +		if (tuple->ipv6.dport != bpf_htons(4321))
> +			return TC_ACT_OK;
> +
> +		/* Upper parts of daddr are already zero. */
> +		ln.ipv6.daddr[3] = bpf_htonl(0x1);
> +		ln.ipv6.dport = bpf_htons(1234);
> +
> +		sk = bpf_skc_lookup_tcp(skb, &ln, sizeof(ln.ipv6),
> +					BPF_F_CURRENT_NETNS, 0);
> +	}
> +
> +	/* We can't do a single skc_lookup_tcp here, because then the compiler
> +	 * will likely spill tuple_len to the stack. This makes it lose all
> +	 * bounds information in the verifier, which then rejects the call as
> +	 * unsafe.
> +	 */

This is a known issue. For scalars, only constant is restored properly
in verifier at this moment. I did some hacking before to enable any
scalars. The fear is this will make pruning performs worse. More
study is needed here.

> +	if (!sk)
> +		return TC_ACT_SHOT;
> +
> +	if (sk->state != BPF_TCP_LISTEN) {
> +		bpf_sk_release(sk);
> +		return TC_ACT_SHOT;
> +	}
> +
> +assign:
> +	ret = bpf_sk_assign(skb, sk, 0);
> +	bpf_sk_release(sk);
> +	return ret == 0 ? TC_ACT_OK : TC_ACT_SHOT;
> +}
> 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCHv2 bpf-next 4/5] bpf: Don't refcount LISTEN sockets in sk_assign()
  2020-03-25 10:29   ` Lorenz Bauer
@ 2020-03-25 20:46     ` Joe Stringer
  2020-03-26 10:20       ` Lorenz Bauer
  0 siblings, 1 reply; 41+ messages in thread
From: Joe Stringer @ 2020-03-25 20:46 UTC (permalink / raw)
  To: Lorenz Bauer
  Cc: Joe Stringer, bpf, Networking, Daniel Borkmann,
	Alexei Starovoitov, Eric Dumazet, Martin Lau

On Wed, Mar 25, 2020 at 3:29 AM Lorenz Bauer <lmb@cloudflare.com> wrote:
>
> On Wed, 25 Mar 2020 at 05:58, Joe Stringer <joe@wand.net.nz> wrote:
> >
> > Avoid taking a reference on listen sockets by checking the socket type
> > in the sk_assign and in the corresponding skb_steal_sock() code in the
> > the transport layer, and by ensuring that the prefetch free (sock_pfree)
> > function uses the same logic to check whether the socket is refcounted.
> >
> > Suggested-by: Martin KaFai Lau <kafai@fb.com>
> > Signed-off-by: Joe Stringer <joe@wand.net.nz>
> > ---
> > v2: Initial version
> > ---
> >  include/net/sock.h | 25 +++++++++++++++++--------
> >  net/core/filter.c  |  6 +++---
> >  net/core/sock.c    |  3 ++-
> >  3 files changed, 22 insertions(+), 12 deletions(-)
> >
> > diff --git a/include/net/sock.h b/include/net/sock.h
> > index 1ca2e808cb8e..3ec1865f173e 100644
> > --- a/include/net/sock.h
> > +++ b/include/net/sock.h
> > @@ -2533,6 +2533,21 @@ skb_sk_is_prefetched(struct sk_buff *skb)
> >         return skb->destructor == sock_pfree;
> >  }
> >
> > +/* This helper checks if a socket is a full socket,
> > + * ie _not_ a timewait or request socket.
> > + */
> > +static inline bool sk_fullsock(const struct sock *sk)
> > +{
> > +       return (1 << sk->sk_state) & ~(TCPF_TIME_WAIT | TCPF_NEW_SYN_RECV);
> > +}
> > +
> > +static inline bool
> > +sk_is_refcounted(struct sock *sk)
> > +{
> > +       /* Only full sockets have sk->sk_flags. */
> > +       return !sk_fullsock(sk) || !sock_flag(sk, SOCK_RCU_FREE);
> > +}
> > +
> >  /**
> >   * skb_steal_sock
> >   * @skb to steal the socket from
> > @@ -2545,6 +2560,8 @@ skb_steal_sock(struct sk_buff *skb, bool *refcounted)
> >                 struct sock *sk = skb->sk;
> >
> >                 *refcounted = true;
> > +               if (skb_sk_is_prefetched(skb))
> > +                       *refcounted = sk_is_refcounted(sk);
> >                 skb->destructor = NULL;
> >                 skb->sk = NULL;
> >                 return sk;
> > @@ -2553,14 +2570,6 @@ skb_steal_sock(struct sk_buff *skb, bool *refcounted)
> >         return NULL;
> >  }
> >
> > -/* This helper checks if a socket is a full socket,
> > - * ie _not_ a timewait or request socket.
> > - */
> > -static inline bool sk_fullsock(const struct sock *sk)
> > -{
> > -       return (1 << sk->sk_state) & ~(TCPF_TIME_WAIT | TCPF_NEW_SYN_RECV);
> > -}
> > -
> >  /* Checks if this SKB belongs to an HW offloaded socket
> >   * and whether any SW fallbacks are required based on dev.
> >   * Check decrypted mark in case skb_orphan() cleared socket.
> > diff --git a/net/core/filter.c b/net/core/filter.c
> > index 0fada7fe9b75..997b8606167e 100644
> > --- a/net/core/filter.c
> > +++ b/net/core/filter.c
> > @@ -5343,8 +5343,7 @@ static const struct bpf_func_proto bpf_sk_lookup_udp_proto = {
> >
> >  BPF_CALL_1(bpf_sk_release, struct sock *, sk)
> >  {
> > -       /* Only full sockets have sk->sk_flags. */
> > -       if (!sk_fullsock(sk) || !sock_flag(sk, SOCK_RCU_FREE))
> > +       if (sk_is_refcounted(sk))
> >                 sock_gen_put(sk);
> >         return 0;
> >  }
> > @@ -5870,7 +5869,8 @@ BPF_CALL_3(bpf_sk_assign, struct sk_buff *, skb, struct sock *, sk, u64, flags)
> >                 return -ESOCKTNOSUPPORT;
> >         if (unlikely(dev_net(skb->dev) != sock_net(sk)))
> >                 return -ENETUNREACH;
> > -       if (unlikely(!refcount_inc_not_zero(&sk->sk_refcnt)))
> > +       if (sk_is_refcounted(sk) &&
> > +           unlikely(!refcount_inc_not_zero(&sk->sk_refcnt)))
> >                 return -ENOENT;
> >
> >         skb_orphan(skb);
> > diff --git a/net/core/sock.c b/net/core/sock.c
> > index cfaf60267360..a2ab79446f59 100644
> > --- a/net/core/sock.c
> > +++ b/net/core/sock.c
> > @@ -2076,7 +2076,8 @@ EXPORT_SYMBOL(sock_efree);
> >   */
> >  void sock_pfree(struct sk_buff *skb)
> >  {
> > -       sock_edemux(skb);
> > +       if (sk_is_refcounted(skb->sk))
> > +               sock_edemux(skb);
>
> sock_edemux calls sock_gen_put, which is also called by
> bpf_sk_release. Is it worth teaching sock_gen_put about
> sk_fullsock, and dropping the other helpers? I was considering this
> when fixing up sk_release, but then forgot
> about it.

I like the idea, but I'm concerned about breaking things outside the
focus of this new helper if the skb_sk_is_prefetched() function from
patch 1 is allowed to return true for sockets other than the ones
assigned from the bpf_sk_assign() helper. At a glance there's users of
sock_efree (which sock_edemux can be defined to) like netem_enqueue()
which may inadvertently trigger unexpected paths here. I think it's
more explicit so more obviously correct if the destructor pointer used
in this series is unique compared to other paths, even if the
underlying code is the same.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCHv2 bpf-next 5/5] selftests: bpf: add test for sk_assign
  2020-03-25 10:35   ` Lorenz Bauer
@ 2020-03-25 20:55     ` Joe Stringer
  2020-03-26  6:25       ` Martin KaFai Lau
  0 siblings, 1 reply; 41+ messages in thread
From: Joe Stringer @ 2020-03-25 20:55 UTC (permalink / raw)
  To: Lorenz Bauer
  Cc: Joe Stringer, bpf, Networking, Daniel Borkmann,
	Alexei Starovoitov, Eric Dumazet, Martin Lau

On Wed, Mar 25, 2020 at 3:35 AM Lorenz Bauer <lmb@cloudflare.com> wrote:
>
> On Wed, 25 Mar 2020 at 05:58, Joe Stringer <joe@wand.net.nz> wrote:
> >
> > From: Lorenz Bauer <lmb@cloudflare.com>
> >
> > Attach a tc direct-action classifier to lo in a fresh network
> > namespace, and rewrite all connection attempts to localhost:4321
> > to localhost:1234 (for port tests) and connections to unreachable
> > IPv4/IPv6 IPs to the local socket (for address tests).
>
> Can you extend this to cover UDP as well?

I'm working on a follow-up series for UDP, we need this too.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCHv2 bpf-next 5/5] selftests: bpf: add test for sk_assign
  2020-03-25 18:17   ` Yonghong Song
@ 2020-03-25 21:20     ` Joe Stringer
  2020-03-25 22:00       ` Yonghong Song
  2020-03-26 10:13     ` Lorenz Bauer
  1 sibling, 1 reply; 41+ messages in thread
From: Joe Stringer @ 2020-03-25 21:20 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Joe Stringer, bpf, Lorenz Bauer, netdev, Daniel Borkmann,
	Alexei Starovoitov, Eric Dumazet, Martin KaFai Lau

On Wed, Mar 25, 2020 at 11:18 AM Yonghong Song <yhs@fb.com> wrote:
>
>
>
> On 3/24/20 10:57 PM, Joe Stringer wrote:
> > From: Lorenz Bauer <lmb@cloudflare.com>
> >
> > Attach a tc direct-action classifier to lo in a fresh network
> > namespace, and rewrite all connection attempts to localhost:4321
> > to localhost:1234 (for port tests) and connections to unreachable
> > IPv4/IPv6 IPs to the local socket (for address tests).
> >
> > Keep in mind that both client to server and server to client traffic
> > passes the classifier.
> >
> > Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
> > Co-authored-by: Joe Stringer <joe@wand.net.nz>
> > Signed-off-by: Joe Stringer <joe@wand.net.nz>
> > ---
> > v2: Rebase onto test_progs infrastructure
> > v1: Initial commit
> > ---
> >   tools/testing/selftests/bpf/Makefile          |   2 +-
> >   .../selftests/bpf/prog_tests/sk_assign.c      | 244 ++++++++++++++++++
> >   .../selftests/bpf/progs/test_sk_assign.c      | 127 +++++++++
> >   3 files changed, 372 insertions(+), 1 deletion(-)
> >   create mode 100644 tools/testing/selftests/bpf/prog_tests/sk_assign.c
> >   create mode 100644 tools/testing/selftests/bpf/progs/test_sk_assign.c
> >
> > diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
> > index 7729892e0b04..4f7f83d059ca 100644
> > --- a/tools/testing/selftests/bpf/Makefile
> > +++ b/tools/testing/selftests/bpf/Makefile
> > @@ -76,7 +76,7 @@ TEST_PROGS_EXTENDED := with_addr.sh \
> >   # Compile but not part of 'make run_tests'
> >   TEST_GEN_PROGS_EXTENDED = test_sock_addr test_skb_cgroup_id_user \
> >       flow_dissector_load test_flow_dissector test_tcp_check_syncookie_user \
> > -     test_lirc_mode2_user xdping test_cpp runqslower
> > +     test_lirc_mode2_user xdping test_cpp runqslower test_sk_assign
>
> No test_sk_assign any more as the test is integrated into test_progs, right?

I'll fix it up.

> > +static __u32 duration;
> > +
> > +static bool configure_stack(int self_net)
>
> self_net parameter is not used.

Hrm, why didn't the compiler tell me this..? Will fix.

> > +{
> > +     /* Move to a new networking namespace */
> > +     if (CHECK_FAIL(unshare(CLONE_NEWNET)))
> > +             return false;
>
> You can use CHECK to encode better error messages. Thhis is what
> most test_progs tests are using.

I was going back and forth on this when I was writing this bit.
CHECK_FAIL() already prints the line that fails, so when debugging
it's pretty clear what call went wrong if you dig into the code.
Combine with perror() and you actually get a readable string of the
error, whereas the common form for CHECK() seems to be just printing
the error code which the developer then has to do symbol lookup to
interpret..

    if (CHECK(efd < 0, "open", "err %d errno %d\n", efd, errno))

Example output with CHECK_FAIL / perror approach:

    # ./test_progs -t assign
    ...
    Timed out while connecting to server
    connect_to_server:FAIL:90
    Cannot connect to server: Interrupted system call
    #46/1 ipv4 port redir:FAIL
    #46 sk_assign:FAIL
    Summary: 0/0 PASSED, 0 SKIPPED, 2 FAILED

Diff to make this happen is just connect to a port that the BPF
program doesn't redirect:

$ git diff
diff --git a/tools/testing/selftests/bpf/prog_tests/sk_assign.c
b/tools/testing/selftests/bpf/prog_tests/sk_assign.c
index 1f0afcc20c48..ba661145518a 100644
--- a/tools/testing/selftests/bpf/prog_tests/sk_assign.c
+++ b/tools/testing/selftests/bpf/prog_tests/sk_assign.c
@@ -192,7 +191,7 @@ static int do_sk_assign(void)
                goto out;

        /* Connect to unbound ports */
-       addr4.sin_port = htons(TEST_DPORT);
+       addr4.sin_port = htons(666);
        addr6.sin6_port = htons(TEST_DPORT);

        test__start_subtest("ipv4 port redir");

(I had to drop the kill() call as well, but that's part of the next
revision of this series..)

> > +static int connect_to_server(const struct sockaddr *addr, socklen_t len)
> > +{
> > +     int fd = -1;
> > +
> > +     fd = socket(addr->sa_family, SOCK_STREAM, 0);
> > +     if (CHECK_FAIL(fd == -1))
> > +             goto out;
> > +     if (CHECK_FAIL(sigaction(SIGALRM, &timeout_action, NULL)))
> > +             goto out;
>
> should this goto close_out?

Will fix.

> > +void test_sk_assign(void)
> > +{
> > +     int self_net;
> > +
> > +     self_net = open(NS_SELF, O_RDONLY);
> > +     if (CHECK_FAIL(self_net < 0)) {
> > +             perror("Unable to open "NS_SELF);
> > +             return;
> > +     }
> > +
> > +     if (!configure_stack(self_net)) {
> > +             perror("configure_stack");
> > +             goto cleanup;
> > +     }
> > +
> > +     do_sk_assign();
> > +
> > +cleanup:
> > +     close(self_net);
>
> Did we exit the newly unshared net namespace and restored the previous
> namespace?

Ah I've mainly just been running this test so it didn't affect me but
I realise now I dropped the hunks that were intended to do this
cleanup. Will fix.

> > +     /* We can't do a single skc_lookup_tcp here, because then the compiler
> > +      * will likely spill tuple_len to the stack. This makes it lose all
> > +      * bounds information in the verifier, which then rejects the call as
> > +      * unsafe.
> > +      */
>
> This is a known issue. For scalars, only constant is restored properly
> in verifier at this moment. I did some hacking before to enable any
> scalars. The fear is this will make pruning performs worse. More
> study is needed here.

Thanks for the background. Do you want me to refer to any specific
release version or date or commit for this comment or it's fine to
leave as-is?

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCHv2 bpf-next 5/5] selftests: bpf: add test for sk_assign
  2020-03-25 21:20     ` Joe Stringer
@ 2020-03-25 22:00       ` Yonghong Song
  2020-03-25 23:07         ` Joe Stringer
  0 siblings, 1 reply; 41+ messages in thread
From: Yonghong Song @ 2020-03-25 22:00 UTC (permalink / raw)
  To: Joe Stringer
  Cc: bpf, Lorenz Bauer, netdev, Daniel Borkmann, Alexei Starovoitov,
	Eric Dumazet, Martin KaFai Lau



On 3/25/20 2:20 PM, Joe Stringer wrote:
> On Wed, Mar 25, 2020 at 11:18 AM Yonghong Song <yhs@fb.com> wrote:
>>
>>
>>
>> On 3/24/20 10:57 PM, Joe Stringer wrote:
>>> From: Lorenz Bauer <lmb@cloudflare.com>
>>>
>>> Attach a tc direct-action classifier to lo in a fresh network
>>> namespace, and rewrite all connection attempts to localhost:4321
>>> to localhost:1234 (for port tests) and connections to unreachable
>>> IPv4/IPv6 IPs to the local socket (for address tests).
>>>
>>> Keep in mind that both client to server and server to client traffic
>>> passes the classifier.
>>>
>>> Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
>>> Co-authored-by: Joe Stringer <joe@wand.net.nz>
>>> Signed-off-by: Joe Stringer <joe@wand.net.nz>
>>> ---
>>> v2: Rebase onto test_progs infrastructure
>>> v1: Initial commit
>>> ---
>>>    tools/testing/selftests/bpf/Makefile          |   2 +-
>>>    .../selftests/bpf/prog_tests/sk_assign.c      | 244 ++++++++++++++++++
>>>    .../selftests/bpf/progs/test_sk_assign.c      | 127 +++++++++
>>>    3 files changed, 372 insertions(+), 1 deletion(-)
>>>    create mode 100644 tools/testing/selftests/bpf/prog_tests/sk_assign.c
>>>    create mode 100644 tools/testing/selftests/bpf/progs/test_sk_assign.c
>>>
>>> diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
>>> index 7729892e0b04..4f7f83d059ca 100644
>>> --- a/tools/testing/selftests/bpf/Makefile
>>> +++ b/tools/testing/selftests/bpf/Makefile
>>> @@ -76,7 +76,7 @@ TEST_PROGS_EXTENDED := with_addr.sh \
>>>    # Compile but not part of 'make run_tests'
>>>    TEST_GEN_PROGS_EXTENDED = test_sock_addr test_skb_cgroup_id_user \
>>>        flow_dissector_load test_flow_dissector test_tcp_check_syncookie_user \
>>> -     test_lirc_mode2_user xdping test_cpp runqslower
>>> +     test_lirc_mode2_user xdping test_cpp runqslower test_sk_assign
>>
>> No test_sk_assign any more as the test is integrated into test_progs, right?
> 
> I'll fix it up.
> 
>>> +static __u32 duration;
>>> +
>>> +static bool configure_stack(int self_net)
>>
>> self_net parameter is not used.
> 
> Hrm, why didn't the compiler tell me this..? Will fix.
> 
>>> +{
>>> +     /* Move to a new networking namespace */
>>> +     if (CHECK_FAIL(unshare(CLONE_NEWNET)))
>>> +             return false;
>>
>> You can use CHECK to encode better error messages. Thhis is what
>> most test_progs tests are using.
> 
> I was going back and forth on this when I was writing this bit.
> CHECK_FAIL() already prints the line that fails, so when debugging
> it's pretty clear what call went wrong if you dig into the code.
> Combine with perror() and you actually get a readable string of the
> error, whereas the common form for CHECK() seems to be just printing
> the error code which the developer then has to do symbol lookup to
> interpret..
> 
>      if (CHECK(efd < 0, "open", "err %d errno %d\n", efd, errno))
> 
> Example output with CHECK_FAIL / perror approach:
> 
>      # ./test_progs -t assign
>      ...
>      Timed out while connecting to server
>      connect_to_server:FAIL:90
>      Cannot connect to server: Interrupted system call
>      #46/1 ipv4 port redir:FAIL
>      #46 sk_assign:FAIL
>      Summary: 0/0 PASSED, 0 SKIPPED, 2 FAILED

I won't insist since CHECK_FAIL should roughly provide enough
information for failure. CHECK might be more useful if you want
to provide more context, esp. if the same routine is called
in multiple places and you can have a marker to differentiate
which call site caused the problem.

But again, just a suggestion. CHECK_FAIL is okay to me.

> 
> Diff to make this happen is just connect to a port that the BPF
> program doesn't redirect:
> 
> $ git diff
> diff --git a/tools/testing/selftests/bpf/prog_tests/sk_assign.c
> b/tools/testing/selftests/bpf/prog_tests/sk_assign.c
> index 1f0afcc20c48..ba661145518a 100644
> --- a/tools/testing/selftests/bpf/prog_tests/sk_assign.c
> +++ b/tools/testing/selftests/bpf/prog_tests/sk_assign.c
> @@ -192,7 +191,7 @@ static int do_sk_assign(void)
>                  goto out;
> 
>          /* Connect to unbound ports */
> -       addr4.sin_port = htons(TEST_DPORT);
> +       addr4.sin_port = htons(666);
>          addr6.sin6_port = htons(TEST_DPORT);
> 
>          test__start_subtest("ipv4 port redir");
> 
> (I had to drop the kill() call as well, but that's part of the next
> revision of this series..)
> 
>>> +static int connect_to_server(const struct sockaddr *addr, socklen_t len)
>>> +{
>>> +     int fd = -1;
>>> +
>>> +     fd = socket(addr->sa_family, SOCK_STREAM, 0);
>>> +     if (CHECK_FAIL(fd == -1))
>>> +             goto out;
>>> +     if (CHECK_FAIL(sigaction(SIGALRM, &timeout_action, NULL)))
>>> +             goto out;
>>
>> should this goto close_out?
> 
> Will fix.
> 
>>> +void test_sk_assign(void)
>>> +{
>>> +     int self_net;
>>> +
>>> +     self_net = open(NS_SELF, O_RDONLY);
>>> +     if (CHECK_FAIL(self_net < 0)) {
>>> +             perror("Unable to open "NS_SELF);
>>> +             return;
>>> +     }
>>> +
>>> +     if (!configure_stack(self_net)) {
>>> +             perror("configure_stack");
>>> +             goto cleanup;
>>> +     }
>>> +
>>> +     do_sk_assign();
>>> +
>>> +cleanup:
>>> +     close(self_net);
>>
>> Did we exit the newly unshared net namespace and restored the previous
>> namespace?
> 
> Ah I've mainly just been running this test so it didn't affect me but
> I realise now I dropped the hunks that were intended to do this
> cleanup. Will fix.
> 
>>> +     /* We can't do a single skc_lookup_tcp here, because then the compiler
>>> +      * will likely spill tuple_len to the stack. This makes it lose all
>>> +      * bounds information in the verifier, which then rejects the call as
>>> +      * unsafe.
>>> +      */
>>
>> This is a known issue. For scalars, only constant is restored properly
>> in verifier at this moment. I did some hacking before to enable any
>> scalars. The fear is this will make pruning performs worse. More
>> study is needed here.
> 
> Thanks for the background. Do you want me to refer to any specific
> release version or date or commit for this comment or it's fine to
> leave as-is?

Maybe add a "workaround:" marker in the comments so later we can search
and find these examples if we have compiler/verifier improvements.

-bash-4.4$ egrep -ri workaround
test_get_stack_rawtp.c: * This is an acceptable workaround since there 
is one entry here.
test_seg6_loop.c:       // workaround: define induction variable "i" as 
"long" instead
test_sysctl_loop1.c:    /* a workaround to prevent compiler from generating
-bash-4.4$

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCHv2 bpf-next 5/5] selftests: bpf: add test for sk_assign
  2020-03-25 22:00       ` Yonghong Song
@ 2020-03-25 23:07         ` Joe Stringer
  0 siblings, 0 replies; 41+ messages in thread
From: Joe Stringer @ 2020-03-25 23:07 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Joe Stringer, bpf, Lorenz Bauer, netdev, Daniel Borkmann,
	Alexei Starovoitov, Eric Dumazet, Martin KaFai Lau

On Wed, Mar 25, 2020 at 3:01 PM Yonghong Song <yhs@fb.com> wrote:
>
>
>
> On 3/25/20 2:20 PM, Joe Stringer wrote:
> > On Wed, Mar 25, 2020 at 11:18 AM Yonghong Song <yhs@fb.com> wrote:
> >>
> >>
> >>
> >> On 3/24/20 10:57 PM, Joe Stringer wrote:
> >>> From: Lorenz Bauer <lmb@cloudflare.com>
> >>>
> >>> Attach a tc direct-action classifier to lo in a fresh network
> >>> namespace, and rewrite all connection attempts to localhost:4321
> >>> to localhost:1234 (for port tests) and connections to unreachable
> >>> IPv4/IPv6 IPs to the local socket (for address tests).
> >>>
> >>> Keep in mind that both client to server and server to client traffic
> >>> passes the classifier.
> >>>
> >>> Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
> >>> Co-authored-by: Joe Stringer <joe@wand.net.nz>
> >>> Signed-off-by: Joe Stringer <joe@wand.net.nz>
> >>> ---
> >>> v2: Rebase onto test_progs infrastructure
> >>> v1: Initial commit
> >>> ---
> >>>    tools/testing/selftests/bpf/Makefile          |   2 +-
> >>>    .../selftests/bpf/prog_tests/sk_assign.c      | 244 ++++++++++++++++++
> >>>    .../selftests/bpf/progs/test_sk_assign.c      | 127 +++++++++
> >>>    3 files changed, 372 insertions(+), 1 deletion(-)
> >>>    create mode 100644 tools/testing/selftests/bpf/prog_tests/sk_assign.c
> >>>    create mode 100644 tools/testing/selftests/bpf/progs/test_sk_assign.c
> >>>
> >>> diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
> >>> index 7729892e0b04..4f7f83d059ca 100644
> >>> --- a/tools/testing/selftests/bpf/Makefile
> >>> +++ b/tools/testing/selftests/bpf/Makefile
> >>> @@ -76,7 +76,7 @@ TEST_PROGS_EXTENDED := with_addr.sh \
> >>>    # Compile but not part of 'make run_tests'
> >>>    TEST_GEN_PROGS_EXTENDED = test_sock_addr test_skb_cgroup_id_user \
> >>>        flow_dissector_load test_flow_dissector test_tcp_check_syncookie_user \
> >>> -     test_lirc_mode2_user xdping test_cpp runqslower
> >>> +     test_lirc_mode2_user xdping test_cpp runqslower test_sk_assign
> >>
> >> No test_sk_assign any more as the test is integrated into test_progs, right?
> >
> > I'll fix it up.
> >
> >>> +static __u32 duration;
> >>> +
> >>> +static bool configure_stack(int self_net)
> >>
> >> self_net parameter is not used.
> >
> > Hrm, why didn't the compiler tell me this..? Will fix.
> >
> >>> +{
> >>> +     /* Move to a new networking namespace */
> >>> +     if (CHECK_FAIL(unshare(CLONE_NEWNET)))
> >>> +             return false;
> >>
> >> You can use CHECK to encode better error messages. Thhis is what
> >> most test_progs tests are using.
> >
> > I was going back and forth on this when I was writing this bit.
> > CHECK_FAIL() already prints the line that fails, so when debugging
> > it's pretty clear what call went wrong if you dig into the code.
> > Combine with perror() and you actually get a readable string of the
> > error, whereas the common form for CHECK() seems to be just printing
> > the error code which the developer then has to do symbol lookup to
> > interpret..
> >
> >      if (CHECK(efd < 0, "open", "err %d errno %d\n", efd, errno))
> >
> > Example output with CHECK_FAIL / perror approach:
> >
> >      # ./test_progs -t assign
> >      ...
> >      Timed out while connecting to server
> >      connect_to_server:FAIL:90
> >      Cannot connect to server: Interrupted system call
> >      #46/1 ipv4 port redir:FAIL
> >      #46 sk_assign:FAIL
> >      Summary: 0/0 PASSED, 0 SKIPPED, 2 FAILED
>
> I won't insist since CHECK_FAIL should roughly provide enough
> information for failure. CHECK might be more useful if you want
> to provide more context, esp. if the same routine is called
> in multiple places and you can have a marker to differentiate
> which call site caused the problem.

Good point, maybe for extra context the subtests can use CHECK() in
addition to the CHECK_FAIL.

> But again, just a suggestion. CHECK_FAIL is okay to me.

<snip>

> >>> +     /* We can't do a single skc_lookup_tcp here, because then the compiler
> >>> +      * will likely spill tuple_len to the stack. This makes it lose all
> >>> +      * bounds information in the verifier, which then rejects the call as
> >>> +      * unsafe.
> >>> +      */
> >>
> >> This is a known issue. For scalars, only constant is restored properly
> >> in verifier at this moment. I did some hacking before to enable any
> >> scalars. The fear is this will make pruning performs worse. More
> >> study is needed here.
> >
> > Thanks for the background. Do you want me to refer to any specific
> > release version or date or commit for this comment or it's fine to
> > leave as-is?
>
> Maybe add a "workaround:" marker in the comments so later we can search
> and find these examples if we have compiler/verifier improvements.
>
> -bash-4.4$ egrep -ri workaround
> test_get_stack_rawtp.c: * This is an acceptable workaround since there
> is one entry here.
> test_seg6_loop.c:       // workaround: define induction variable "i" as
> "long" instead
> test_sysctl_loop1.c:    /* a workaround to prevent compiler from generating
> -bash-4.4$

SGTM, Will roll that in thanks.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCHv2 bpf-next 5/5] selftests: bpf: add test for sk_assign
  2020-03-25  5:57 ` [PATCHv2 bpf-next 5/5] selftests: bpf: add test for sk_assign Joe Stringer
  2020-03-25 10:35   ` Lorenz Bauer
  2020-03-25 18:17   ` Yonghong Song
@ 2020-03-26  2:04   ` Andrii Nakryiko
  2020-03-26  2:16   ` Andrii Nakryiko
  3 siblings, 0 replies; 41+ messages in thread
From: Andrii Nakryiko @ 2020-03-26  2:04 UTC (permalink / raw)
  To: Joe Stringer
  Cc: bpf, Lorenz Bauer, Networking, Daniel Borkmann,
	Alexei Starovoitov, Eric Dumazet, Martin Lau

On Tue, Mar 24, 2020 at 10:58 PM Joe Stringer <joe@wand.net.nz> wrote:
>
> From: Lorenz Bauer <lmb@cloudflare.com>
>
> Attach a tc direct-action classifier to lo in a fresh network
> namespace, and rewrite all connection attempts to localhost:4321
> to localhost:1234 (for port tests) and connections to unreachable
> IPv4/IPv6 IPs to the local socket (for address tests).
>
> Keep in mind that both client to server and server to client traffic
> passes the classifier.
>
> Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
> Co-authored-by: Joe Stringer <joe@wand.net.nz>
> Signed-off-by: Joe Stringer <joe@wand.net.nz>
> ---

Can you please check that you test fails (instead of getting stuck)
when there is something wrong with network. We went through this
exercise with tcp_rtt and sockmap_listen, where a bunch of stuff was
blocking. This works fine when everything works, but horribly breaks
when something is not working. Given this is part of test_progs, let's
please make sure we don't deadlock anywhere.

> v2: Rebase onto test_progs infrastructure
> v1: Initial commit
> ---
>  tools/testing/selftests/bpf/Makefile          |   2 +-
>  .../selftests/bpf/prog_tests/sk_assign.c      | 244 ++++++++++++++++++
>  .../selftests/bpf/progs/test_sk_assign.c      | 127 +++++++++
>  3 files changed, 372 insertions(+), 1 deletion(-)
>  create mode 100644 tools/testing/selftests/bpf/prog_tests/sk_assign.c
>  create mode 100644 tools/testing/selftests/bpf/progs/test_sk_assign.c
>

[...]

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCHv2 bpf-next 5/5] selftests: bpf: add test for sk_assign
  2020-03-25  5:57 ` [PATCHv2 bpf-next 5/5] selftests: bpf: add test for sk_assign Joe Stringer
                     ` (2 preceding siblings ...)
  2020-03-26  2:04   ` Andrii Nakryiko
@ 2020-03-26  2:16   ` Andrii Nakryiko
  2020-03-26  5:28     ` Joe Stringer
  3 siblings, 1 reply; 41+ messages in thread
From: Andrii Nakryiko @ 2020-03-26  2:16 UTC (permalink / raw)
  To: Joe Stringer
  Cc: bpf, Lorenz Bauer, Networking, Daniel Borkmann,
	Alexei Starovoitov, Eric Dumazet, Martin Lau

On Tue, Mar 24, 2020 at 10:58 PM Joe Stringer <joe@wand.net.nz> wrote:
>
> From: Lorenz Bauer <lmb@cloudflare.com>
>
> Attach a tc direct-action classifier to lo in a fresh network
> namespace, and rewrite all connection attempts to localhost:4321
> to localhost:1234 (for port tests) and connections to unreachable
> IPv4/IPv6 IPs to the local socket (for address tests).
>
> Keep in mind that both client to server and server to client traffic
> passes the classifier.
>
> Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
> Co-authored-by: Joe Stringer <joe@wand.net.nz>
> Signed-off-by: Joe Stringer <joe@wand.net.nz>
> ---
> v2: Rebase onto test_progs infrastructure
> v1: Initial commit
> ---
>  tools/testing/selftests/bpf/Makefile          |   2 +-
>  .../selftests/bpf/prog_tests/sk_assign.c      | 244 ++++++++++++++++++
>  .../selftests/bpf/progs/test_sk_assign.c      | 127 +++++++++
>  3 files changed, 372 insertions(+), 1 deletion(-)
>  create mode 100644 tools/testing/selftests/bpf/prog_tests/sk_assign.c
>  create mode 100644 tools/testing/selftests/bpf/progs/test_sk_assign.c
>
> diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
> index 7729892e0b04..4f7f83d059ca 100644
> --- a/tools/testing/selftests/bpf/Makefile
> +++ b/tools/testing/selftests/bpf/Makefile
> @@ -76,7 +76,7 @@ TEST_PROGS_EXTENDED := with_addr.sh \
>  # Compile but not part of 'make run_tests'
>  TEST_GEN_PROGS_EXTENDED = test_sock_addr test_skb_cgroup_id_user \
>         flow_dissector_load test_flow_dissector test_tcp_check_syncookie_user \
> -       test_lirc_mode2_user xdping test_cpp runqslower
> +       test_lirc_mode2_user xdping test_cpp runqslower test_sk_assign
>
>  TEST_CUSTOM_PROGS = urandom_read
>
> diff --git a/tools/testing/selftests/bpf/prog_tests/sk_assign.c b/tools/testing/selftests/bpf/prog_tests/sk_assign.c
> new file mode 100644
> index 000000000000..1f0afcc20c48
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/prog_tests/sk_assign.c
> @@ -0,0 +1,244 @@
> +// SPDX-License-Identifier: GPL-2.0
> +// Copyright (c) 2018 Facebook
> +// Copyright (c) 2019 Cloudflare
> +// Copyright (c) 2020 Isovalent, Inc.
> +/*
> + * Test that the socket assign program is able to redirect traffic towards a
> + * socket, regardless of whether the port or address destination of the traffic
> + * matches the port.
> + */
> +
> +#define _GNU_SOURCE
> +#include <fcntl.h>
> +#include <signal.h>
> +#include <stdlib.h>
> +#include <unistd.h>
> +
> +#include "test_progs.h"
> +
> +#define TEST_DPORT 4321
> +#define TEST_DADDR (0xC0A80203)
> +#define NS_SELF "/proc/self/ns/net"
> +
> +static __u32 duration;
> +
> +static bool configure_stack(int self_net)
> +{
> +       /* Move to a new networking namespace */
> +       if (CHECK_FAIL(unshare(CLONE_NEWNET)))
> +               return false;
> +
> +       /* Configure necessary links, routes */
> +       if (CHECK_FAIL(system("ip link set dev lo up")))
> +               return false;
> +       if (CHECK_FAIL(system("ip route add local default dev lo")))
> +               return false;
> +       if (CHECK_FAIL(system("ip -6 route add local default dev lo")))
> +               return false;
> +
> +       /* Load qdisc, BPF program */
> +       if (CHECK_FAIL(system("tc qdisc add dev lo clsact")))
> +               return false;
> +       if (CHECK_FAIL(system("tc filter add dev lo ingress bpf direct-action "
> +                    "object-file ./test_sk_assign.o section sk_assign_test")))
> +               return false;
> +
> +       return true;
> +}
> +
> +static int start_server(const struct sockaddr *addr, socklen_t len)
> +{
> +       int fd;
> +
> +       fd = socket(addr->sa_family, SOCK_STREAM, 0);
> +       if (CHECK_FAIL(fd == -1))
> +               goto out;
> +       if (CHECK_FAIL(bind(fd, addr, len) == -1))
> +               goto close_out;
> +       if (CHECK_FAIL(listen(fd, 128) == -1))
> +               goto close_out;
> +
> +       goto out;
> +
> +close_out:
> +       close(fd);
> +       fd = -1;
> +out:
> +       return fd;
> +}
> +
> +static void handle_timeout(int signum)
> +{
> +       if (signum == SIGALRM)
> +               fprintf(stderr, "Timed out while connecting to server\n");
> +       kill(0, SIGKILL);
> +}
> +
> +static struct sigaction timeout_action = {
> +       .sa_handler = handle_timeout,
> +};
> +
> +static int connect_to_server(const struct sockaddr *addr, socklen_t len)
> +{
> +       int fd = -1;
> +
> +       fd = socket(addr->sa_family, SOCK_STREAM, 0);
> +       if (CHECK_FAIL(fd == -1))
> +               goto out;
> +       if (CHECK_FAIL(sigaction(SIGALRM, &timeout_action, NULL)))
> +               goto out;

no-no-no, we are not doing this. It's part of prog_tests and shouldn't
install its own signal handlers and sending asynchronous signals to
itself. Please find another way to have a timeout.

> +       alarm(3);
> +       if (CHECK_FAIL(connect(fd, addr, len) == -1))
> +               goto close_out;
> +
> +       goto out;
> +
> +close_out:
> +       close(fd);
> +       fd = -1;
> +out:
> +       return fd;
> +}
> +
> +static in_port_t get_port(int fd)
> +{
> +       struct sockaddr_storage name;
> +       socklen_t len;
> +       in_port_t port = 0;
> +
> +       len = sizeof(name);
> +       if (CHECK_FAIL(getsockname(fd, (struct sockaddr *)&name, &len)))
> +               return port;
> +
> +       switch (name.ss_family) {
> +       case AF_INET:
> +               port = ((struct sockaddr_in *)&name)->sin_port;
> +               break;
> +       case AF_INET6:
> +               port = ((struct sockaddr_in6 *)&name)->sin6_port;
> +               break;
> +       default:
> +               CHECK(1, "Invalid address family", "%d\n", name.ss_family);
> +       }
> +       return port;
> +}
> +
> +static int run_test(int server_fd, const struct sockaddr *addr, socklen_t len)
> +{
> +       int client = -1, srv_client = -1;
> +       char buf[] = "testing";
> +       in_port_t port;
> +       int ret = 1;
> +
> +       client = connect_to_server(addr, len);
> +       if (client == -1) {
> +               perror("Cannot connect to server");
> +               goto out;
> +       }
> +
> +       srv_client = accept(server_fd, NULL, NULL);
> +       if (CHECK_FAIL(srv_client == -1)) {
> +               perror("Can't accept connection");
> +               goto out;
> +       }
> +       if (CHECK_FAIL(write(client, buf, sizeof(buf)) != sizeof(buf))) {
> +               perror("Can't write on client");
> +               goto out;
> +       }
> +       if (CHECK_FAIL(read(srv_client, buf, sizeof(buf)) != sizeof(buf))) {
> +               perror("Can't read on server");
> +               goto out;
> +       }
> +
> +       port = get_port(srv_client);
> +       if (CHECK_FAIL(!port))
> +               goto out;
> +       if (CHECK(port != htons(TEST_DPORT), "Expected", "port %u but got %u",
> +                 TEST_DPORT, ntohs(port)))
> +               goto out;
> +
> +       ret = 0;
> +out:
> +       close(client);
> +       close(srv_client);
> +       return ret;
> +}
> +
> +static int do_sk_assign(void)
> +{
> +       struct sockaddr_in addr4;
> +       struct sockaddr_in6 addr6;
> +       int server = -1;
> +       int server_v6 = -1;
> +       int err = 1;
> +
> +       memset(&addr4, 0, sizeof(addr4));
> +       addr4.sin_family = AF_INET;
> +       addr4.sin_addr.s_addr = htonl(INADDR_LOOPBACK);
> +       addr4.sin_port = htons(1234);
> +
> +       memset(&addr6, 0, sizeof(addr6));
> +       addr6.sin6_family = AF_INET6;
> +       addr6.sin6_addr = in6addr_loopback;
> +       addr6.sin6_port = htons(1234);
> +
> +       server = start_server((const struct sockaddr *)&addr4, sizeof(addr4));
> +       if (server == -1)
> +               goto out;
> +
> +       server_v6 = start_server((const struct sockaddr *)&addr6,
> +                                sizeof(addr6));
> +       if (server_v6 == -1)
> +               goto out;
> +
> +       /* Connect to unbound ports */
> +       addr4.sin_port = htons(TEST_DPORT);
> +       addr6.sin6_port = htons(TEST_DPORT);
> +
> +       test__start_subtest("ipv4 port redir");
> +       if (run_test(server, (const struct sockaddr *)&addr4, sizeof(addr4)))
> +               goto out;
> +
> +       test__start_subtest("ipv6 port redir");
> +       if (run_test(server_v6, (const struct sockaddr *)&addr6, sizeof(addr6)))
> +               goto out;
> +
> +       /* Connect to unbound addresses */
> +       addr4.sin_addr.s_addr = htonl(TEST_DADDR);
> +       addr6.sin6_addr.s6_addr32[3] = htonl(TEST_DADDR);
> +
> +       test__start_subtest("ipv4 addr redir");
> +       if (run_test(server, (const struct sockaddr *)&addr4, sizeof(addr4)))
> +               goto out;
> +
> +       test__start_subtest("ipv6 addr redir");

test__start_subtest() returns false if subtest is supposed to be
skipped. If you ignore that, then test_progs -t and -n filtering won't
work properly.

> +       if (run_test(server_v6, (const struct sockaddr *)&addr6, sizeof(addr6)))
> +               goto out;
> +
> +       err = 0;
> +out:
> +       close(server);
> +       close(server_v6);
> +       return err;
> +}
> +
> +void test_sk_assign(void)
> +{
> +       int self_net;
> +
> +       self_net = open(NS_SELF, O_RDONLY);

I'm not familiar with what this does. Can you please explain briefly
what are the side effects of this? Asking because of shared test_progs
environment worries, of course.

> +       if (CHECK_FAIL(self_net < 0)) {
> +               perror("Unable to open "NS_SELF);
> +               return;
> +       }
> +
> +       if (!configure_stack(self_net)) {
> +               perror("configure_stack");
> +               goto cleanup;
> +       }
> +
> +       do_sk_assign();
> +
> +cleanup:
> +       close(self_net);
> +}
> diff --git a/tools/testing/selftests/bpf/progs/test_sk_assign.c b/tools/testing/selftests/bpf/progs/test_sk_assign.c
> new file mode 100644
> index 000000000000..7de30ad3f594
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/progs/test_sk_assign.c
> @@ -0,0 +1,127 @@
> +// SPDX-License-Identifier: GPL-2.0
> +// Copyright (c) 2019 Cloudflare Ltd.
> +
> +#include <stddef.h>
> +#include <stdbool.h>
> +#include <string.h>
> +#include <linux/bpf.h>
> +#include <linux/if_ether.h>
> +#include <linux/in.h>
> +#include <linux/ip.h>
> +#include <linux/ipv6.h>
> +#include <linux/pkt_cls.h>
> +#include <linux/tcp.h>
> +#include <sys/socket.h>
> +#include <bpf/bpf_helpers.h>
> +#include <bpf/bpf_endian.h>
> +
> +int _version SEC("version") = 1;
> +char _license[] SEC("license") = "GPL";
> +
> +/* Fill 'tuple' with L3 info, and attempt to find L4. On fail, return NULL. */
> +static struct bpf_sock_tuple *get_tuple(void *data, __u64 nh_off,
> +                                       void *data_end, __u16 eth_proto,
> +                                       bool *ipv4)
> +{
> +       struct bpf_sock_tuple *result;
> +       __u8 proto = 0;
> +       __u64 ihl_len;
> +
> +       if (eth_proto == bpf_htons(ETH_P_IP)) {
> +               struct iphdr *iph = (struct iphdr *)(data + nh_off);
> +
> +               if (iph + 1 > data_end)
> +                       return NULL;
> +               if (iph->ihl != 5)
> +                       /* Options are not supported */
> +                       return NULL;
> +               ihl_len = iph->ihl * 4;
> +               proto = iph->protocol;
> +               *ipv4 = true;
> +               result = (struct bpf_sock_tuple *)&iph->saddr;
> +       } else if (eth_proto == bpf_htons(ETH_P_IPV6)) {
> +               struct ipv6hdr *ip6h = (struct ipv6hdr *)(data + nh_off);
> +
> +               if (ip6h + 1 > data_end)
> +                       return NULL;
> +               ihl_len = sizeof(*ip6h);
> +               proto = ip6h->nexthdr;
> +               *ipv4 = false;
> +               result = (struct bpf_sock_tuple *)&ip6h->saddr;
> +       } else {
> +               return NULL;
> +       }
> +
> +       if (result + 1 > data_end || proto != IPPROTO_TCP)
> +               return NULL;
> +
> +       return result;
> +}
> +
> +SEC("sk_assign_test")

Please use "canonical" section name that libbpf recognizes. This
woulde be "action/" or "classifier/", right?

> +int bpf_sk_assign_test(struct __sk_buff *skb)
> +{
> +       void *data_end = (void *)(long)skb->data_end;
> +       void *data = (void *)(long)skb->data;
> +       struct ethhdr *eth = (struct ethhdr *)(data);
> +       struct bpf_sock_tuple *tuple, ln = {0};
> +       struct bpf_sock *sk;
> +       int tuple_len;
> +       bool ipv4;
> +       int ret;
> +
> +       if (eth + 1 > data_end)
> +               return TC_ACT_SHOT;
> +
> +       tuple = get_tuple(data, sizeof(*eth), data_end, eth->h_proto, &ipv4);
> +       if (!tuple)
> +               return TC_ACT_SHOT;
> +
> +       tuple_len = ipv4 ? sizeof(tuple->ipv4) : sizeof(tuple->ipv6);
> +       sk = bpf_skc_lookup_tcp(skb, tuple, tuple_len, BPF_F_CURRENT_NETNS, 0);
> +       if (sk) {
> +               if (sk->state != BPF_TCP_LISTEN)
> +                       goto assign;
> +
> +               bpf_sk_release(sk);
> +       }
> +
> +       if (ipv4) {
> +               if (tuple->ipv4.dport != bpf_htons(4321))
> +                       return TC_ACT_OK;
> +
> +               ln.ipv4.daddr = bpf_htonl(0x7f000001);
> +               ln.ipv4.dport = bpf_htons(1234);
> +
> +               sk = bpf_skc_lookup_tcp(skb, &ln, sizeof(ln.ipv4),
> +                                       BPF_F_CURRENT_NETNS, 0);
> +       } else {
> +               if (tuple->ipv6.dport != bpf_htons(4321))
> +                       return TC_ACT_OK;
> +
> +               /* Upper parts of daddr are already zero. */
> +               ln.ipv6.daddr[3] = bpf_htonl(0x1);
> +               ln.ipv6.dport = bpf_htons(1234);
> +
> +               sk = bpf_skc_lookup_tcp(skb, &ln, sizeof(ln.ipv6),
> +                                       BPF_F_CURRENT_NETNS, 0);
> +       }
> +
> +       /* We can't do a single skc_lookup_tcp here, because then the compiler
> +        * will likely spill tuple_len to the stack. This makes it lose all
> +        * bounds information in the verifier, which then rejects the call as
> +        * unsafe.
> +        */
> +       if (!sk)
> +               return TC_ACT_SHOT;
> +
> +       if (sk->state != BPF_TCP_LISTEN) {
> +               bpf_sk_release(sk);
> +               return TC_ACT_SHOT;
> +       }
> +
> +assign:
> +       ret = bpf_sk_assign(skb, sk, 0);
> +       bpf_sk_release(sk);
> +       return ret == 0 ? TC_ACT_OK : TC_ACT_SHOT;
> +}
> --
> 2.20.1
>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCHv2 bpf-next 5/5] selftests: bpf: add test for sk_assign
  2020-03-26  2:16   ` Andrii Nakryiko
@ 2020-03-26  5:28     ` Joe Stringer
  2020-03-26  6:31       ` Martin KaFai Lau
  2020-03-26 19:36       ` Andrii Nakryiko
  0 siblings, 2 replies; 41+ messages in thread
From: Joe Stringer @ 2020-03-26  5:28 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Joe Stringer, bpf, Lorenz Bauer, Networking, Daniel Borkmann,
	Alexei Starovoitov, Eric Dumazet, Martin Lau

On Wed, Mar 25, 2020 at 7:16 PM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Tue, Mar 24, 2020 at 10:58 PM Joe Stringer <joe@wand.net.nz> wrote:
> >
> > From: Lorenz Bauer <lmb@cloudflare.com>
> >
> > Attach a tc direct-action classifier to lo in a fresh network
> > namespace, and rewrite all connection attempts to localhost:4321
> > to localhost:1234 (for port tests) and connections to unreachable
> > IPv4/IPv6 IPs to the local socket (for address tests).
> >
> > Keep in mind that both client to server and server to client traffic
> > passes the classifier.
> >
> > Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
> > Co-authored-by: Joe Stringer <joe@wand.net.nz>
> > Signed-off-by: Joe Stringer <joe@wand.net.nz>
> > ---

<snip>

> > +static void handle_timeout(int signum)
> > +{
> > +       if (signum == SIGALRM)
> > +               fprintf(stderr, "Timed out while connecting to server\n");
> > +       kill(0, SIGKILL);
> > +}
> > +
> > +static struct sigaction timeout_action = {
> > +       .sa_handler = handle_timeout,
> > +};
> > +
> > +static int connect_to_server(const struct sockaddr *addr, socklen_t len)
> > +{
> > +       int fd = -1;
> > +
> > +       fd = socket(addr->sa_family, SOCK_STREAM, 0);
> > +       if (CHECK_FAIL(fd == -1))
> > +               goto out;
> > +       if (CHECK_FAIL(sigaction(SIGALRM, &timeout_action, NULL)))
> > +               goto out;
>
> no-no-no, we are not doing this. It's part of prog_tests and shouldn't
> install its own signal handlers and sending asynchronous signals to
> itself. Please find another way to have a timeout.

I realise it didn't clean up after itself. How about signal(SIGALRM,
SIG_DFL); just like other existing tests do?

> > +       test__start_subtest("ipv4 addr redir");
> > +       if (run_test(server, (const struct sockaddr *)&addr4, sizeof(addr4)))
> > +               goto out;
> > +
> > +       test__start_subtest("ipv6 addr redir");
>
> test__start_subtest() returns false if subtest is supposed to be
> skipped. If you ignore that, then test_progs -t and -n filtering won't
> work properly.

Will fix.

> > +       if (run_test(server_v6, (const struct sockaddr *)&addr6, sizeof(addr6)))
> > +               goto out;
> > +
> > +       err = 0;
> > +out:
> > +       close(server);
> > +       close(server_v6);
> > +       return err;
> > +}
> > +
> > +void test_sk_assign(void)
> > +{
> > +       int self_net;
> > +
> > +       self_net = open(NS_SELF, O_RDONLY);
>
> I'm not familiar with what this does. Can you please explain briefly
> what are the side effects of this? Asking because of shared test_progs
> environment worries, of course.

This one is opening an fd to the current program's netns path on the
filesystem. The intention was to use it to switch back to the current
netns after the unshare() call elsewhere which switches to a new
netns. As per the other feedback the bit where it switches back to
this netns was dropped along the way so I'll fix it up in the next
revision.

> > +SEC("sk_assign_test")
>
> Please use "canonical" section name that libbpf recognizes. This
> woulde be "action/" or "classifier/", right?

Will fix.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCHv2 bpf-next 1/5] bpf: Add socket assign support
  2020-03-25  5:57 ` [PATCHv2 bpf-next 1/5] bpf: Add socket assign support Joe Stringer
@ 2020-03-26  6:23   ` Martin KaFai Lau
  2020-03-26  6:31     ` Joe Stringer
  2020-03-26 10:24   ` Lorenz Bauer
  1 sibling, 1 reply; 41+ messages in thread
From: Martin KaFai Lau @ 2020-03-26  6:23 UTC (permalink / raw)
  To: Joe Stringer; +Cc: bpf, netdev, daniel, ast, eric.dumazet, lmb

On Tue, Mar 24, 2020 at 10:57:41PM -0700, Joe Stringer wrote:
> Add support for TPROXY via a new bpf helper, bpf_sk_assign().
> 
> This helper requires the BPF program to discover the socket via a call
> to bpf_sk*_lookup_*(), then pass this socket to the new helper. The
> helper takes its own reference to the socket in addition to any existing
> reference that may or may not currently be obtained for the duration of
> BPF processing. For the destination socket to receive the traffic, the
> traffic must be routed towards that socket via local route. The
> simplest example route is below, but in practice you may want to route
> traffic more narrowly (eg by CIDR):
> 
>   $ ip route add local default dev lo
> 
> This patch avoids trying to introduce an extra bit into the skb->sk, as
> that would require more invasive changes to all code interacting with
> the socket to ensure that the bit is handled correctly, such as all
> error-handling cases along the path from the helper in BPF through to
> the orphan path in the input. Instead, we opt to use the destructor
> variable to switch on the prefetch of the socket.
> 
> Signed-off-by: Joe Stringer <joe@wand.net.nz>
> ---
> v2: Use skb->destructor to determine socket prefetch usage instead of
>       introducing a new metadata_dst
>     Restrict socket assign to same netns as TC device
>     Restrict assigning reuseport sockets
>     Adjust commit wording
> v1: Initial version
> ---
>  include/net/sock.h             |  7 +++++++
>  include/uapi/linux/bpf.h       | 25 ++++++++++++++++++++++++-
>  net/core/filter.c              | 31 +++++++++++++++++++++++++++++++
>  net/core/sock.c                |  9 +++++++++
>  net/ipv4/ip_input.c            |  3 ++-
>  net/ipv6/ip6_input.c           |  3 ++-
>  net/sched/act_bpf.c            |  2 ++
>  tools/include/uapi/linux/bpf.h | 25 ++++++++++++++++++++++++-
>  8 files changed, 101 insertions(+), 4 deletions(-)
> 

[ ... ]

> diff --git a/net/sched/act_bpf.c b/net/sched/act_bpf.c
> index 46f47e58b3be..6c7ed8fcc909 100644
> --- a/net/sched/act_bpf.c
> +++ b/net/sched/act_bpf.c
> @@ -53,6 +53,8 @@ static int tcf_bpf_act(struct sk_buff *skb, const struct tc_action *act,
>  		bpf_compute_data_pointers(skb);
>  		filter_res = BPF_PROG_RUN(filter, skb);
>  	}
> +	if (filter_res != TC_ACT_OK)
Should skb_sk_is_prefetched() be checked also?

> +		skb_orphan(skb);
>  	rcu_read_unlock();
>  
>  	/* A BPF program may overwrite the default action opcode.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCHv2 bpf-next 5/5] selftests: bpf: add test for sk_assign
  2020-03-25 20:55     ` Joe Stringer
@ 2020-03-26  6:25       ` Martin KaFai Lau
  2020-03-26  6:38         ` Joe Stringer
  0 siblings, 1 reply; 41+ messages in thread
From: Martin KaFai Lau @ 2020-03-26  6:25 UTC (permalink / raw)
  To: Joe Stringer
  Cc: Lorenz Bauer, bpf, Networking, Daniel Borkmann,
	Alexei Starovoitov, Eric Dumazet

On Wed, Mar 25, 2020 at 01:55:59PM -0700, Joe Stringer wrote:
> On Wed, Mar 25, 2020 at 3:35 AM Lorenz Bauer <lmb@cloudflare.com> wrote:
> >
> > On Wed, 25 Mar 2020 at 05:58, Joe Stringer <joe@wand.net.nz> wrote:
> > >
> > > From: Lorenz Bauer <lmb@cloudflare.com>
> > >
> > > Attach a tc direct-action classifier to lo in a fresh network
> > > namespace, and rewrite all connection attempts to localhost:4321
> > > to localhost:1234 (for port tests) and connections to unreachable
> > > IPv4/IPv6 IPs to the local socket (for address tests).
> >
> > Can you extend this to cover UDP as well?
> 
> I'm working on a follow-up series for UDP, we need this too.
Other than selftests, what are the changes for UDP in patch 1 - 4?

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCHv2 bpf-next 5/5] selftests: bpf: add test for sk_assign
  2020-03-26  5:28     ` Joe Stringer
@ 2020-03-26  6:31       ` Martin KaFai Lau
  2020-03-26 19:36       ` Andrii Nakryiko
  1 sibling, 0 replies; 41+ messages in thread
From: Martin KaFai Lau @ 2020-03-26  6:31 UTC (permalink / raw)
  To: Joe Stringer
  Cc: Andrii Nakryiko, bpf, Lorenz Bauer, Networking, Daniel Borkmann,
	Alexei Starovoitov, Eric Dumazet

On Wed, Mar 25, 2020 at 10:28:11PM -0700, Joe Stringer wrote:
> On Wed, Mar 25, 2020 at 7:16 PM Andrii Nakryiko
> <andrii.nakryiko@gmail.com> wrote:
> >
> > On Tue, Mar 24, 2020 at 10:58 PM Joe Stringer <joe@wand.net.nz> wrote:
> > >
> > > From: Lorenz Bauer <lmb@cloudflare.com>
> > >
> > > Attach a tc direct-action classifier to lo in a fresh network
> > > namespace, and rewrite all connection attempts to localhost:4321
> > > to localhost:1234 (for port tests) and connections to unreachable
> > > IPv4/IPv6 IPs to the local socket (for address tests).
> > >
> > > Keep in mind that both client to server and server to client traffic
> > > passes the classifier.
> > >
> > > Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
> > > Co-authored-by: Joe Stringer <joe@wand.net.nz>
> > > Signed-off-by: Joe Stringer <joe@wand.net.nz>
> > > ---
> 
> <snip>
> 
> > > +static void handle_timeout(int signum)
> > > +{
> > > +       if (signum == SIGALRM)
> > > +               fprintf(stderr, "Timed out while connecting to server\n");
> > > +       kill(0, SIGKILL);
> > > +}
> > > +
> > > +static struct sigaction timeout_action = {
> > > +       .sa_handler = handle_timeout,
> > > +};
> > > +
> > > +static int connect_to_server(const struct sockaddr *addr, socklen_t len)
> > > +{
> > > +       int fd = -1;
> > > +
> > > +       fd = socket(addr->sa_family, SOCK_STREAM, 0);
> > > +       if (CHECK_FAIL(fd == -1))
> > > +               goto out;
> > > +       if (CHECK_FAIL(sigaction(SIGALRM, &timeout_action, NULL)))
> > > +               goto out;
> >
> > no-no-no, we are not doing this. It's part of prog_tests and shouldn't
> > install its own signal handlers and sending asynchronous signals to
> > itself. Please find another way to have a timeout.
> 
> I realise it didn't clean up after itself. How about signal(SIGALRM,
> SIG_DFL); just like other existing tests do?
or setsockopt SO_SNDTIMEO?

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCHv2 bpf-next 1/5] bpf: Add socket assign support
  2020-03-26  6:23   ` Martin KaFai Lau
@ 2020-03-26  6:31     ` Joe Stringer
  0 siblings, 0 replies; 41+ messages in thread
From: Joe Stringer @ 2020-03-26  6:31 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: Joe Stringer, bpf, netdev, Daniel Borkmann, Alexei Starovoitov,
	Eric Dumazet, Lorenz Bauer

On Wed, Mar 25, 2020 at 11:23 PM Martin KaFai Lau <kafai@fb.com> wrote:
>
> On Tue, Mar 24, 2020 at 10:57:41PM -0700, Joe Stringer wrote:
> > Add support for TPROXY via a new bpf helper, bpf_sk_assign().
> >
> > This helper requires the BPF program to discover the socket via a call
> > to bpf_sk*_lookup_*(), then pass this socket to the new helper. The
> > helper takes its own reference to the socket in addition to any existing
> > reference that may or may not currently be obtained for the duration of
> > BPF processing. For the destination socket to receive the traffic, the
> > traffic must be routed towards that socket via local route. The
> > simplest example route is below, but in practice you may want to route
> > traffic more narrowly (eg by CIDR):
> >
> >   $ ip route add local default dev lo
> >
> > This patch avoids trying to introduce an extra bit into the skb->sk, as
> > that would require more invasive changes to all code interacting with
> > the socket to ensure that the bit is handled correctly, such as all
> > error-handling cases along the path from the helper in BPF through to
> > the orphan path in the input. Instead, we opt to use the destructor
> > variable to switch on the prefetch of the socket.
> >
> > Signed-off-by: Joe Stringer <joe@wand.net.nz>
> > ---
> > v2: Use skb->destructor to determine socket prefetch usage instead of
> >       introducing a new metadata_dst
> >     Restrict socket assign to same netns as TC device
> >     Restrict assigning reuseport sockets
> >     Adjust commit wording
> > v1: Initial version
> > ---
> >  include/net/sock.h             |  7 +++++++
> >  include/uapi/linux/bpf.h       | 25 ++++++++++++++++++++++++-
> >  net/core/filter.c              | 31 +++++++++++++++++++++++++++++++
> >  net/core/sock.c                |  9 +++++++++
> >  net/ipv4/ip_input.c            |  3 ++-
> >  net/ipv6/ip6_input.c           |  3 ++-
> >  net/sched/act_bpf.c            |  2 ++
> >  tools/include/uapi/linux/bpf.h | 25 ++++++++++++++++++++++++-
> >  8 files changed, 101 insertions(+), 4 deletions(-)
> >
>
> [ ... ]
>
> > diff --git a/net/sched/act_bpf.c b/net/sched/act_bpf.c
> > index 46f47e58b3be..6c7ed8fcc909 100644
> > --- a/net/sched/act_bpf.c
> > +++ b/net/sched/act_bpf.c
> > @@ -53,6 +53,8 @@ static int tcf_bpf_act(struct sk_buff *skb, const struct tc_action *act,
> >               bpf_compute_data_pointers(skb);
> >               filter_res = BPF_PROG_RUN(filter, skb);
> >       }
> > +     if (filter_res != TC_ACT_OK)
> Should skb_sk_is_prefetched() be checked also?

Yes, thanks, will fix.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCHv2 bpf-next 5/5] selftests: bpf: add test for sk_assign
  2020-03-26  6:25       ` Martin KaFai Lau
@ 2020-03-26  6:38         ` Joe Stringer
  2020-03-26 23:39           ` Joe Stringer
  0 siblings, 1 reply; 41+ messages in thread
From: Joe Stringer @ 2020-03-26  6:38 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: Joe Stringer, Lorenz Bauer, bpf, Networking, Daniel Borkmann,
	Alexei Starovoitov, Eric Dumazet

On Wed, Mar 25, 2020 at 11:25 PM Martin KaFai Lau <kafai@fb.com> wrote:
>
> On Wed, Mar 25, 2020 at 01:55:59PM -0700, Joe Stringer wrote:
> > On Wed, Mar 25, 2020 at 3:35 AM Lorenz Bauer <lmb@cloudflare.com> wrote:
> > >
> > > On Wed, 25 Mar 2020 at 05:58, Joe Stringer <joe@wand.net.nz> wrote:
> > > >
> > > > From: Lorenz Bauer <lmb@cloudflare.com>
> > > >
> > > > Attach a tc direct-action classifier to lo in a fresh network
> > > > namespace, and rewrite all connection attempts to localhost:4321
> > > > to localhost:1234 (for port tests) and connections to unreachable
> > > > IPv4/IPv6 IPs to the local socket (for address tests).
> > >
> > > Can you extend this to cover UDP as well?
> >
> > I'm working on a follow-up series for UDP, we need this too.
> Other than selftests, what are the changes for UDP in patch 1 - 4?

Nothing in those patches, I have refactoring of all of the socket
helpers, skc_lookup_udp() and adding flags to the socket lookup
functions to support only looking for a certain type of sockets -
established or listen. This helps to avoid multiple lookups in these
cases where you really just want to look up established sockets with
the packet tuple first then look up the listener socket with the
unrelated/tproxy tuple. For UDP it makes it easier to find the correct
socket and in general (including TCP) helps to avoid up to two socket
hashtable lookups for this use case. This part is because the current
helpers all look up the established socket first then the listener
socket, so for the first packet that hits these we perform both of
these lookups for the packet tuple (which finds nothing), then look up
an established socket for the target tuple (which finds nothing) then
finally a listen socket for the target tuple. It's about another 300+
/ 250- changes overall, of which a large chunk is one patch that
refactors the code into macros. I haven't narrowed down for sure
whether the lookup flags patch is required for UDP cases yet.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCHv2 bpf-next 5/5] selftests: bpf: add test for sk_assign
  2020-03-25 18:17   ` Yonghong Song
  2020-03-25 21:20     ` Joe Stringer
@ 2020-03-26 10:13     ` Lorenz Bauer
  2020-03-26 21:07       ` call for bpf progs. " Alexei Starovoitov
  1 sibling, 1 reply; 41+ messages in thread
From: Lorenz Bauer @ 2020-03-26 10:13 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Joe Stringer, bpf, Networking, Daniel Borkmann,
	Alexei Starovoitov, Eric Dumazet, Martin Lau

On Wed, 25 Mar 2020 at 18:17, Yonghong Song <yhs@fb.com> wrote:
>
>
>
> On 3/24/20 10:57 PM, Joe Stringer wrote:
> > From: Lorenz Bauer <lmb@cloudflare.com>
> >
> > Attach a tc direct-action classifier to lo in a fresh network
> > namespace, and rewrite all connection attempts to localhost:4321
> > to localhost:1234 (for port tests) and connections to unreachable
> > IPv4/IPv6 IPs to the local socket (for address tests).
> >
> > Keep in mind that both client to server and server to client traffic
> > passes the classifier.
> >
> > Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
> > Co-authored-by: Joe Stringer <joe@wand.net.nz>
> > Signed-off-by: Joe Stringer <joe@wand.net.nz>
> > ---
> > v2: Rebase onto test_progs infrastructure
> > v1: Initial commit
> > ---
> >   tools/testing/selftests/bpf/Makefile          |   2 +-
> >   .../selftests/bpf/prog_tests/sk_assign.c      | 244 ++++++++++++++++++
> >   .../selftests/bpf/progs/test_sk_assign.c      | 127 +++++++++
> >   3 files changed, 372 insertions(+), 1 deletion(-)
> >   create mode 100644 tools/testing/selftests/bpf/prog_tests/sk_assign.c
> >   create mode 100644 tools/testing/selftests/bpf/progs/test_sk_assign.c
> >
> > diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
> > index 7729892e0b04..4f7f83d059ca 100644
> > --- a/tools/testing/selftests/bpf/Makefile
> > +++ b/tools/testing/selftests/bpf/Makefile
> > @@ -76,7 +76,7 @@ TEST_PROGS_EXTENDED := with_addr.sh \
> >   # Compile but not part of 'make run_tests'
> >   TEST_GEN_PROGS_EXTENDED = test_sock_addr test_skb_cgroup_id_user \
> >       flow_dissector_load test_flow_dissector test_tcp_check_syncookie_user \
> > -     test_lirc_mode2_user xdping test_cpp runqslower
> > +     test_lirc_mode2_user xdping test_cpp runqslower test_sk_assign
>
> No test_sk_assign any more as the test is integrated into test_progs, right?
>
> >
> >   TEST_CUSTOM_PROGS = urandom_read
> >
> > diff --git a/tools/testing/selftests/bpf/prog_tests/sk_assign.c b/tools/testing/selftests/bpf/prog_tests/sk_assign.c
> > new file mode 100644
> > index 000000000000..1f0afcc20c48
> > --- /dev/null
> > +++ b/tools/testing/selftests/bpf/prog_tests/sk_assign.c
> > @@ -0,0 +1,244 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +// Copyright (c) 2018 Facebook
> > +// Copyright (c) 2019 Cloudflare
> > +// Copyright (c) 2020 Isovalent, Inc.
> > +/*
> > + * Test that the socket assign program is able to redirect traffic towards a
> > + * socket, regardless of whether the port or address destination of the traffic
> > + * matches the port.
> > + */
> > +
> > +#define _GNU_SOURCE
> > +#include <fcntl.h>
> > +#include <signal.h>
> > +#include <stdlib.h>
> > +#include <unistd.h>
> > +
> > +#include "test_progs.h"
> > +
> > +#define TEST_DPORT 4321
> > +#define TEST_DADDR (0xC0A80203)
> > +#define NS_SELF "/proc/self/ns/net"
> > +
> > +static __u32 duration;
> > +
> > +static bool configure_stack(int self_net)
>
> self_net parameter is not used.
>
> > +{
> > +     /* Move to a new networking namespace */
> > +     if (CHECK_FAIL(unshare(CLONE_NEWNET)))
> > +             return false;
>
> You can use CHECK to encode better error messages. Thhis is what
> most test_progs tests are using.
>
> > +     /* Configure necessary links, routes */
> > +     if (CHECK_FAIL(system("ip link set dev lo up")))
> > +             return false;
> > +     if (CHECK_FAIL(system("ip route add local default dev lo")))
> > +             return false;
> > +     if (CHECK_FAIL(system("ip -6 route add local default dev lo")))
> > +             return false;
> > +
> > +     /* Load qdisc, BPF program */
> > +     if (CHECK_FAIL(system("tc qdisc add dev lo clsact")))
> > +             return false;
> > +     if (CHECK_FAIL(system("tc filter add dev lo ingress bpf direct-action "
> > +                  "object-file ./test_sk_assign.o section sk_assign_test")))
> > +             return false;
> > +
> > +     return true;
> > +}
> > +
> > +static int start_server(const struct sockaddr *addr, socklen_t len)
> > +{
> > +     int fd;
> > +
> > +     fd = socket(addr->sa_family, SOCK_STREAM, 0);
> > +     if (CHECK_FAIL(fd == -1))
> > +             goto out;
> > +     if (CHECK_FAIL(bind(fd, addr, len) == -1))
> > +             goto close_out;
> > +     if (CHECK_FAIL(listen(fd, 128) == -1))
> > +             goto close_out;
> > +
> > +     goto out;
> > +
> > +close_out:
> > +     close(fd);
> > +     fd = -1;
> > +out:
> > +     return fd;
> > +}
> > +
> > +static void handle_timeout(int signum)
> > +{
> > +     if (signum == SIGALRM)
> > +             fprintf(stderr, "Timed out while connecting to server\n");
> > +     kill(0, SIGKILL);
> > +}
> > +
> > +static struct sigaction timeout_action = {
> > +     .sa_handler = handle_timeout,
> > +};
> > +
> > +static int connect_to_server(const struct sockaddr *addr, socklen_t len)
> > +{
> > +     int fd = -1;
> > +
> > +     fd = socket(addr->sa_family, SOCK_STREAM, 0);
> > +     if (CHECK_FAIL(fd == -1))
> > +             goto out;
> > +     if (CHECK_FAIL(sigaction(SIGALRM, &timeout_action, NULL)))
> > +             goto out;
>
> should this goto close_out?
>
> > +     alarm(3);
> > +     if (CHECK_FAIL(connect(fd, addr, len) == -1))
> > +             goto close_out;
> > +
> > +     goto out;
> > +
> > +close_out:
> > +     close(fd);
> > +     fd = -1;
> > +out:
> > +     return fd;
> > +}
> > +
> > +static in_port_t get_port(int fd)
> > +{
> > +     struct sockaddr_storage name;
> > +     socklen_t len;
> > +     in_port_t port = 0;
> > +
> > +     len = sizeof(name);
> > +     if (CHECK_FAIL(getsockname(fd, (struct sockaddr *)&name, &len)))
> > +             return port;
> > +
> > +     switch (name.ss_family) {
> > +     case AF_INET:
> > +             port = ((struct sockaddr_in *)&name)->sin_port;
> > +             break;
> > +     case AF_INET6:
> > +             port = ((struct sockaddr_in6 *)&name)->sin6_port;
> > +             break;
> > +     default:
> > +             CHECK(1, "Invalid address family", "%d\n", name.ss_family);
> > +     }
> > +     return port;
> > +}
> > +
> > +static int run_test(int server_fd, const struct sockaddr *addr, socklen_t len)
> > +{
> > +     int client = -1, srv_client = -1;
> > +     char buf[] = "testing";
> > +     in_port_t port;
> > +     int ret = 1;
> > +
> > +     client = connect_to_server(addr, len);
> > +     if (client == -1) {
> > +             perror("Cannot connect to server");
> > +             goto out;
> > +     }
> > +
> > +     srv_client = accept(server_fd, NULL, NULL);
> > +     if (CHECK_FAIL(srv_client == -1)) {
> > +             perror("Can't accept connection");
> > +             goto out;
> > +     }
> > +     if (CHECK_FAIL(write(client, buf, sizeof(buf)) != sizeof(buf))) {
> > +             perror("Can't write on client");
> > +             goto out;
> > +     }
> > +     if (CHECK_FAIL(read(srv_client, buf, sizeof(buf)) != sizeof(buf))) {
> > +             perror("Can't read on server");
> > +             goto out;
> > +     }
> > +
> > +     port = get_port(srv_client);
> > +     if (CHECK_FAIL(!port))
> > +             goto out;
> > +     if (CHECK(port != htons(TEST_DPORT), "Expected", "port %u but got %u",
> > +               TEST_DPORT, ntohs(port)))
> > +             goto out;
> > +
> > +     ret = 0;
> > +out:
> > +     close(client);
> > +     close(srv_client);
> > +     return ret;
> > +}
> > +
> > +static int do_sk_assign(void)
> > +{
> > +     struct sockaddr_in addr4;
> > +     struct sockaddr_in6 addr6;
> > +     int server = -1;
> > +     int server_v6 = -1;
> > +     int err = 1;
> > +
> > +     memset(&addr4, 0, sizeof(addr4));
> > +     addr4.sin_family = AF_INET;
> > +     addr4.sin_addr.s_addr = htonl(INADDR_LOOPBACK);
> > +     addr4.sin_port = htons(1234);
> > +
> > +     memset(&addr6, 0, sizeof(addr6));
> > +     addr6.sin6_family = AF_INET6;
> > +     addr6.sin6_addr = in6addr_loopback;
> > +     addr6.sin6_port = htons(1234);
> > +
> > +     server = start_server((const struct sockaddr *)&addr4, sizeof(addr4));
> > +     if (server == -1)
> > +             goto out;
> > +
> > +     server_v6 = start_server((const struct sockaddr *)&addr6,
> > +                              sizeof(addr6));
> > +     if (server_v6 == -1)
> > +             goto out;
> > +
> > +     /* Connect to unbound ports */
> > +     addr4.sin_port = htons(TEST_DPORT);
> > +     addr6.sin6_port = htons(TEST_DPORT);
> > +
> > +     test__start_subtest("ipv4 port redir");
> > +     if (run_test(server, (const struct sockaddr *)&addr4, sizeof(addr4)))
> > +             goto out;
> > +
> > +     test__start_subtest("ipv6 port redir");
> > +     if (run_test(server_v6, (const struct sockaddr *)&addr6, sizeof(addr6)))
> > +             goto out;
> > +
> > +     /* Connect to unbound addresses */
> > +     addr4.sin_addr.s_addr = htonl(TEST_DADDR);
> > +     addr6.sin6_addr.s6_addr32[3] = htonl(TEST_DADDR);
> > +
> > +     test__start_subtest("ipv4 addr redir");
> > +     if (run_test(server, (const struct sockaddr *)&addr4, sizeof(addr4)))
> > +             goto out;
> > +
> > +     test__start_subtest("ipv6 addr redir");
> > +     if (run_test(server_v6, (const struct sockaddr *)&addr6, sizeof(addr6)))
> > +             goto out;
> > +
> > +     err = 0;
> > +out:
> > +     close(server);
> > +     close(server_v6);
> > +     return err;
> > +}
> > +
> > +void test_sk_assign(void)
> > +{
> > +     int self_net;
> > +
> > +     self_net = open(NS_SELF, O_RDONLY);
> > +     if (CHECK_FAIL(self_net < 0)) {
> > +             perror("Unable to open "NS_SELF);
> > +             return;
> > +     }
> > +
> > +     if (!configure_stack(self_net)) {
> > +             perror("configure_stack");
> > +             goto cleanup;
> > +     }
> > +
> > +     do_sk_assign();
> > +
> > +cleanup:
> > +     close(self_net);
>
> Did we exit the newly unshared net namespace and restored the previous
> namespace?
>
> > +}
> > diff --git a/tools/testing/selftests/bpf/progs/test_sk_assign.c b/tools/testing/selftests/bpf/progs/test_sk_assign.c
> > new file mode 100644
> > index 000000000000..7de30ad3f594
> > --- /dev/null
> > +++ b/tools/testing/selftests/bpf/progs/test_sk_assign.c
> > @@ -0,0 +1,127 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +// Copyright (c) 2019 Cloudflare Ltd.
> > +
> > +#include <stddef.h>
> > +#include <stdbool.h>
> > +#include <string.h>
> > +#include <linux/bpf.h>
> > +#include <linux/if_ether.h>
> > +#include <linux/in.h>
> > +#include <linux/ip.h>
> > +#include <linux/ipv6.h>
> > +#include <linux/pkt_cls.h>
> > +#include <linux/tcp.h>
> > +#include <sys/socket.h>
> > +#include <bpf/bpf_helpers.h>
> > +#include <bpf/bpf_endian.h>
> > +
> > +int _version SEC("version") = 1;
> > +char _license[] SEC("license") = "GPL";
> > +
> > +/* Fill 'tuple' with L3 info, and attempt to find L4. On fail, return NULL. */
> > +static struct bpf_sock_tuple *get_tuple(void *data, __u64 nh_off,
> > +                                     void *data_end, __u16 eth_proto,
> > +                                     bool *ipv4)
> > +{
> > +     struct bpf_sock_tuple *result;
> > +     __u8 proto = 0;
> > +     __u64 ihl_len;
> > +
> > +     if (eth_proto == bpf_htons(ETH_P_IP)) {
> > +             struct iphdr *iph = (struct iphdr *)(data + nh_off);
> > +
> > +             if (iph + 1 > data_end)
> > +                     return NULL;
> > +             if (iph->ihl != 5)
> > +                     /* Options are not supported */
> > +                     return NULL;
> > +             ihl_len = iph->ihl * 4;
> > +             proto = iph->protocol;
> > +             *ipv4 = true;
> > +             result = (struct bpf_sock_tuple *)&iph->saddr;
> > +     } else if (eth_proto == bpf_htons(ETH_P_IPV6)) {
> > +             struct ipv6hdr *ip6h = (struct ipv6hdr *)(data + nh_off);
> > +
> > +             if (ip6h + 1 > data_end)
> > +                     return NULL;
> > +             ihl_len = sizeof(*ip6h);
> > +             proto = ip6h->nexthdr;
> > +             *ipv4 = false;
> > +             result = (struct bpf_sock_tuple *)&ip6h->saddr;
> > +     } else {
> > +             return NULL;
> > +     }
> > +
> > +     if (result + 1 > data_end || proto != IPPROTO_TCP)
> > +             return NULL;
> > +
> > +     return result;
> > +}
> > +
> > +SEC("sk_assign_test")
> > +int bpf_sk_assign_test(struct __sk_buff *skb)
> > +{
> > +     void *data_end = (void *)(long)skb->data_end;
> > +     void *data = (void *)(long)skb->data;
> > +     struct ethhdr *eth = (struct ethhdr *)(data);
> > +     struct bpf_sock_tuple *tuple, ln = {0};
> > +     struct bpf_sock *sk;
> > +     int tuple_len;
> > +     bool ipv4;
> > +     int ret;
> > +
> > +     if (eth + 1 > data_end)
> > +             return TC_ACT_SHOT;
> > +
> > +     tuple = get_tuple(data, sizeof(*eth), data_end, eth->h_proto, &ipv4);
> > +     if (!tuple)
> > +             return TC_ACT_SHOT;
> > +
> > +     tuple_len = ipv4 ? sizeof(tuple->ipv4) : sizeof(tuple->ipv6);
> > +     sk = bpf_skc_lookup_tcp(skb, tuple, tuple_len, BPF_F_CURRENT_NETNS, 0);
>
> You can get rid of tuple_len with
>         if (ipv4)
>                 sk = bpf_skc_lookup_tcp(..., sizeof(tuple->ipv4), ...);
>         else
>                 sk = bpf_skc_lookup_tcp(..., sizeof(tuple->ipv6), ...);
>
> and later on you can do common bpf_skc_lookup_tcp.
> But it may not be worthwhile to do it, as you have two separate calls
> in the above instead.
>
> > +     if (sk) {
> > +             if (sk->state != BPF_TCP_LISTEN)
> > +                     goto assign;
> > +
> > +             bpf_sk_release(sk);
> > +     }
> > +
> > +     if (ipv4) {
> > +             if (tuple->ipv4.dport != bpf_htons(4321))
> > +                     return TC_ACT_OK;
> > +
> > +             ln.ipv4.daddr = bpf_htonl(0x7f000001);
> > +             ln.ipv4.dport = bpf_htons(1234);
> > +
> > +             sk = bpf_skc_lookup_tcp(skb, &ln, sizeof(ln.ipv4),
> > +                                     BPF_F_CURRENT_NETNS, 0);
> > +     } else {
> > +             if (tuple->ipv6.dport != bpf_htons(4321))
> > +                     return TC_ACT_OK;
> > +
> > +             /* Upper parts of daddr are already zero. */
> > +             ln.ipv6.daddr[3] = bpf_htonl(0x1);
> > +             ln.ipv6.dport = bpf_htons(1234);
> > +
> > +             sk = bpf_skc_lookup_tcp(skb, &ln, sizeof(ln.ipv6),
> > +                                     BPF_F_CURRENT_NETNS, 0);
> > +     }
> > +
> > +     /* We can't do a single skc_lookup_tcp here, because then the compiler
> > +      * will likely spill tuple_len to the stack. This makes it lose all
> > +      * bounds information in the verifier, which then rejects the call as
> > +      * unsafe.
> > +      */
>
> This is a known issue. For scalars, only constant is restored properly
> in verifier at this moment. I did some hacking before to enable any
> scalars. The fear is this will make pruning performs worse. More
> study is needed here.

Of topic, but: this is actually one of the most challenging issues for
us when writing
BPF. It forces us to have very deep call graphs to hopefully avoid clang
spilling the constants. Please let me know if I can help in any way.

>
> > +     if (!sk)
> > +             return TC_ACT_SHOT;
> > +
> > +     if (sk->state != BPF_TCP_LISTEN) {
> > +             bpf_sk_release(sk);
> > +             return TC_ACT_SHOT;
> > +     }
> > +
> > +assign:
> > +     ret = bpf_sk_assign(skb, sk, 0);
> > +     bpf_sk_release(sk);
> > +     return ret == 0 ? TC_ACT_OK : TC_ACT_SHOT;
> > +}
> >



-- 
Lorenz Bauer  |  Systems Engineer
6th Floor, County Hall/The Riverside Building, SE1 7PB, UK

www.cloudflare.com

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCHv2 bpf-next 4/5] bpf: Don't refcount LISTEN sockets in sk_assign()
  2020-03-25 20:46     ` Joe Stringer
@ 2020-03-26 10:20       ` Lorenz Bauer
  2020-03-26 21:37         ` Joe Stringer
  0 siblings, 1 reply; 41+ messages in thread
From: Lorenz Bauer @ 2020-03-26 10:20 UTC (permalink / raw)
  To: Joe Stringer
  Cc: bpf, Networking, Daniel Borkmann, Alexei Starovoitov,
	Eric Dumazet, Martin Lau

On Wed, 25 Mar 2020 at 20:47, Joe Stringer <joe@wand.net.nz> wrote:
>
> On Wed, Mar 25, 2020 at 3:29 AM Lorenz Bauer <lmb@cloudflare.com> wrote:
> >
> > On Wed, 25 Mar 2020 at 05:58, Joe Stringer <joe@wand.net.nz> wrote:
> > >
> > > Avoid taking a reference on listen sockets by checking the socket type
> > > in the sk_assign and in the corresponding skb_steal_sock() code in the
> > > the transport layer, and by ensuring that the prefetch free (sock_pfree)
> > > function uses the same logic to check whether the socket is refcounted.
> > >
> > > Suggested-by: Martin KaFai Lau <kafai@fb.com>
> > > Signed-off-by: Joe Stringer <joe@wand.net.nz>
> > > ---
> > > v2: Initial version
> > > ---
> > >  include/net/sock.h | 25 +++++++++++++++++--------
> > >  net/core/filter.c  |  6 +++---
> > >  net/core/sock.c    |  3 ++-
> > >  3 files changed, 22 insertions(+), 12 deletions(-)
> > >
> > > diff --git a/include/net/sock.h b/include/net/sock.h
> > > index 1ca2e808cb8e..3ec1865f173e 100644
> > > --- a/include/net/sock.h
> > > +++ b/include/net/sock.h
> > > @@ -2533,6 +2533,21 @@ skb_sk_is_prefetched(struct sk_buff *skb)
> > >         return skb->destructor == sock_pfree;
> > >  }
> > >
> > > +/* This helper checks if a socket is a full socket,
> > > + * ie _not_ a timewait or request socket.
> > > + */
> > > +static inline bool sk_fullsock(const struct sock *sk)
> > > +{
> > > +       return (1 << sk->sk_state) & ~(TCPF_TIME_WAIT | TCPF_NEW_SYN_RECV);
> > > +}
> > > +
> > > +static inline bool
> > > +sk_is_refcounted(struct sock *sk)
> > > +{
> > > +       /* Only full sockets have sk->sk_flags. */
> > > +       return !sk_fullsock(sk) || !sock_flag(sk, SOCK_RCU_FREE);
> > > +}
> > > +
> > >  /**
> > >   * skb_steal_sock
> > >   * @skb to steal the socket from
> > > @@ -2545,6 +2560,8 @@ skb_steal_sock(struct sk_buff *skb, bool *refcounted)
> > >                 struct sock *sk = skb->sk;
> > >
> > >                 *refcounted = true;
> > > +               if (skb_sk_is_prefetched(skb))
> > > +                       *refcounted = sk_is_refcounted(sk);
> > >                 skb->destructor = NULL;
> > >                 skb->sk = NULL;
> > >                 return sk;
> > > @@ -2553,14 +2570,6 @@ skb_steal_sock(struct sk_buff *skb, bool *refcounted)
> > >         return NULL;
> > >  }
> > >
> > > -/* This helper checks if a socket is a full socket,
> > > - * ie _not_ a timewait or request socket.
> > > - */
> > > -static inline bool sk_fullsock(const struct sock *sk)
> > > -{
> > > -       return (1 << sk->sk_state) & ~(TCPF_TIME_WAIT | TCPF_NEW_SYN_RECV);
> > > -}
> > > -
> > >  /* Checks if this SKB belongs to an HW offloaded socket
> > >   * and whether any SW fallbacks are required based on dev.
> > >   * Check decrypted mark in case skb_orphan() cleared socket.
> > > diff --git a/net/core/filter.c b/net/core/filter.c
> > > index 0fada7fe9b75..997b8606167e 100644
> > > --- a/net/core/filter.c
> > > +++ b/net/core/filter.c
> > > @@ -5343,8 +5343,7 @@ static const struct bpf_func_proto bpf_sk_lookup_udp_proto = {
> > >
> > >  BPF_CALL_1(bpf_sk_release, struct sock *, sk)
> > >  {
> > > -       /* Only full sockets have sk->sk_flags. */
> > > -       if (!sk_fullsock(sk) || !sock_flag(sk, SOCK_RCU_FREE))
> > > +       if (sk_is_refcounted(sk))
> > >                 sock_gen_put(sk);
> > >         return 0;
> > >  }
> > > @@ -5870,7 +5869,8 @@ BPF_CALL_3(bpf_sk_assign, struct sk_buff *, skb, struct sock *, sk, u64, flags)
> > >                 return -ESOCKTNOSUPPORT;
> > >         if (unlikely(dev_net(skb->dev) != sock_net(sk)))
> > >                 return -ENETUNREACH;
> > > -       if (unlikely(!refcount_inc_not_zero(&sk->sk_refcnt)))
> > > +       if (sk_is_refcounted(sk) &&
> > > +           unlikely(!refcount_inc_not_zero(&sk->sk_refcnt)))
> > >                 return -ENOENT;
> > >
> > >         skb_orphan(skb);
> > > diff --git a/net/core/sock.c b/net/core/sock.c
> > > index cfaf60267360..a2ab79446f59 100644
> > > --- a/net/core/sock.c
> > > +++ b/net/core/sock.c
> > > @@ -2076,7 +2076,8 @@ EXPORT_SYMBOL(sock_efree);
> > >   */
> > >  void sock_pfree(struct sk_buff *skb)
> > >  {
> > > -       sock_edemux(skb);
> > > +       if (sk_is_refcounted(skb->sk))
> > > +               sock_edemux(skb);
> >
> > sock_edemux calls sock_gen_put, which is also called by
> > bpf_sk_release. Is it worth teaching sock_gen_put about
> > sk_fullsock, and dropping the other helpers? I was considering this
> > when fixing up sk_release, but then forgot
> > about it.
>
> I like the idea, but I'm concerned about breaking things outside the
> focus of this new helper if the skb_sk_is_prefetched() function from
> patch 1 is allowed to return true for sockets other than the ones
> assigned from the bpf_sk_assign() helper. At a glance there's users of
> sock_efree (which sock_edemux can be defined to) like netem_enqueue()
> which may inadvertently trigger unexpected paths here. I think it's
> more explicit so more obviously correct if the destructor pointer used
> in this series is unique compared to other paths, even if the
> underlying code is the same.

Sorry, I didn't mean to get rid of sock_pfree, I was referring to
sk_fullsock and
sk_is_refcounted. My point was that it's weird that sock_gen_put isn't
actually generic because it doesn't properly handle SOCK_RCU_FREE.

-- 
Lorenz Bauer  |  Systems Engineer
6th Floor, County Hall/The Riverside Building, SE1 7PB, UK

www.cloudflare.com

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCHv2 bpf-next 1/5] bpf: Add socket assign support
  2020-03-25  5:57 ` [PATCHv2 bpf-next 1/5] bpf: Add socket assign support Joe Stringer
  2020-03-26  6:23   ` Martin KaFai Lau
@ 2020-03-26 10:24   ` Lorenz Bauer
  2020-03-26 22:52     ` Joe Stringer
  1 sibling, 1 reply; 41+ messages in thread
From: Lorenz Bauer @ 2020-03-26 10:24 UTC (permalink / raw)
  To: Joe Stringer
  Cc: bpf, Networking, Daniel Borkmann, Alexei Starovoitov,
	Eric Dumazet, Martin Lau

On Wed, 25 Mar 2020 at 05:57, Joe Stringer <joe@wand.net.nz> wrote:
>
> Add support for TPROXY via a new bpf helper, bpf_sk_assign().
>
> This helper requires the BPF program to discover the socket via a call
> to bpf_sk*_lookup_*(), then pass this socket to the new helper. The
> helper takes its own reference to the socket in addition to any existing
> reference that may or may not currently be obtained for the duration of
> BPF processing. For the destination socket to receive the traffic, the
> traffic must be routed towards that socket via local route. The
> simplest example route is below, but in practice you may want to route
> traffic more narrowly (eg by CIDR):
>
>   $ ip route add local default dev lo
>
> This patch avoids trying to introduce an extra bit into the skb->sk, as
> that would require more invasive changes to all code interacting with
> the socket to ensure that the bit is handled correctly, such as all
> error-handling cases along the path from the helper in BPF through to
> the orphan path in the input. Instead, we opt to use the destructor
> variable to switch on the prefetch of the socket.
>
> Signed-off-by: Joe Stringer <joe@wand.net.nz>
> ---
> v2: Use skb->destructor to determine socket prefetch usage instead of
>       introducing a new metadata_dst
>     Restrict socket assign to same netns as TC device
>     Restrict assigning reuseport sockets
>     Adjust commit wording
> v1: Initial version
> ---
>  include/net/sock.h             |  7 +++++++
>  include/uapi/linux/bpf.h       | 25 ++++++++++++++++++++++++-
>  net/core/filter.c              | 31 +++++++++++++++++++++++++++++++
>  net/core/sock.c                |  9 +++++++++
>  net/ipv4/ip_input.c            |  3 ++-
>  net/ipv6/ip6_input.c           |  3 ++-
>  net/sched/act_bpf.c            |  2 ++
>  tools/include/uapi/linux/bpf.h | 25 ++++++++++++++++++++++++-
>  8 files changed, 101 insertions(+), 4 deletions(-)
>
> diff --git a/include/net/sock.h b/include/net/sock.h
> index b5cca7bae69b..2613d21a667a 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -1657,6 +1657,7 @@ struct sk_buff *sock_omalloc(struct sock *sk, unsigned long size,
>  void skb_orphan_partial(struct sk_buff *skb);
>  void sock_rfree(struct sk_buff *skb);
>  void sock_efree(struct sk_buff *skb);
> +void sock_pfree(struct sk_buff *skb);
>  #ifdef CONFIG_INET
>  void sock_edemux(struct sk_buff *skb);
>  #else
> @@ -2526,6 +2527,12 @@ void sock_net_set(struct sock *sk, struct net *net)
>         write_pnet(&sk->sk_net, net);
>  }
>
> +static inline bool
> +skb_sk_is_prefetched(struct sk_buff *skb)
> +{
> +       return skb->destructor == sock_pfree;
> +}
> +
>  static inline struct sock *skb_steal_sock(struct sk_buff *skb)
>  {
>         if (skb->sk) {
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 5d01c5c7e598..0c6f151deebe 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -2950,6 +2950,28 @@ union bpf_attr {
>   *             restricted to raw_tracepoint bpf programs.
>   *     Return
>   *             0 on success, or a negative error in case of failure.
> + *
> + * int bpf_sk_assign(struct sk_buff *skb, struct bpf_sock *sk, u64 flags)
> + *     Description
> + *             Assign the *sk* to the *skb*. When combined with appropriate
> + *             routing configuration to receive the packet towards the socket,
> + *             will cause *skb* to be delivered to the specified socket.
> + *             Subsequent redirection of *skb* via  **bpf_redirect**\ (),
> + *             **bpf_clone_redirect**\ () or other methods outside of BPF may
> + *             interfere with successful delivery to the socket.
> + *
> + *             This operation is only valid from TC ingress path.
> + *
> + *             The *flags* argument must be zero.
> + *     Return
> + *             0 on success, or a negative errno in case of failure.
> + *
> + *             * **-EINVAL**           Unsupported flags specified.
> + *             * **-ENETUNREACH**      Socket is unreachable (wrong netns).
> + *             * **-ENOENT**           Socket is unavailable for assignment.
> + *             * **-EOPNOTSUPP**       Unsupported operation, for example a
> + *                                     call from outside of TC ingress.
> + *             * **-ESOCKTNOSUPPORT**  Socket type not supported (reuseport).
>   */
>  #define __BPF_FUNC_MAPPER(FN)          \
>         FN(unspec),                     \
> @@ -3073,7 +3095,8 @@ union bpf_attr {
>         FN(jiffies64),                  \
>         FN(read_branch_records),        \
>         FN(get_ns_current_pid_tgid),    \
> -       FN(xdp_output),
> +       FN(xdp_output),                 \
> +       FN(sk_assign),
>
>  /* integer value in 'imm' field of BPF_CALL instruction selects which helper
>   * function eBPF program intends to call
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 96350a743539..f7f9b6631f75 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -5860,6 +5860,35 @@ static const struct bpf_func_proto bpf_tcp_gen_syncookie_proto = {
>         .arg5_type      = ARG_CONST_SIZE,
>  };
>
> +BPF_CALL_3(bpf_sk_assign, struct sk_buff *, skb, struct sock *, sk, u64, flags)
> +{
> +       if (flags != 0)
> +               return -EINVAL;
> +       if (!skb_at_tc_ingress(skb))
> +               return -EOPNOTSUPP;
> +       if (unlikely(sk->sk_reuseport))
> +               return -ESOCKTNOSUPPORT;
> +       if (unlikely(dev_net(skb->dev) != sock_net(sk)))
> +               return -ENETUNREACH;
> +       if (unlikely(!refcount_inc_not_zero(&sk->sk_refcnt)))
> +               return -ENOENT;
> +
> +       skb_orphan(skb);
> +       skb->sk = sk;
> +       skb->destructor = sock_pfree;
> +
> +       return 0;
> +}

Follow up to my email re UDP tests: it seems like the helper doesn't check
that the sk is TCP, hence I assumed that you want to add support for
both in the same series.

Also, is it possible to check that the sk protocol matches skb protocol?

> +
> +static const struct bpf_func_proto bpf_sk_assign_proto = {
> +       .func           = bpf_sk_assign,
> +       .gpl_only       = false,
> +       .ret_type       = RET_INTEGER,
> +       .arg1_type      = ARG_PTR_TO_CTX,
> +       .arg2_type      = ARG_PTR_TO_SOCK_COMMON,
> +       .arg3_type      = ARG_ANYTHING,
> +};
> +
>  #endif /* CONFIG_INET */
>
>  bool bpf_helper_changes_pkt_data(void *func)
> @@ -6153,6 +6182,8 @@ tc_cls_act_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
>                 return &bpf_skb_ecn_set_ce_proto;
>         case BPF_FUNC_tcp_gen_syncookie:
>                 return &bpf_tcp_gen_syncookie_proto;
> +       case BPF_FUNC_sk_assign:
> +               return &bpf_sk_assign_proto;
>  #endif
>         default:
>                 return bpf_base_func_proto(func_id);
> diff --git a/net/core/sock.c b/net/core/sock.c
> index 0fc8937a7ff4..cfaf60267360 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -2071,6 +2071,15 @@ void sock_efree(struct sk_buff *skb)
>  }
>  EXPORT_SYMBOL(sock_efree);
>
> +/* Buffer destructor for prefetch/receive path where reference count may
> + * not be held, e.g. for listen sockets.
> + */
> +void sock_pfree(struct sk_buff *skb)
> +{
> +       sock_edemux(skb);
> +}
> +EXPORT_SYMBOL(sock_pfree);
> +
>  kuid_t sock_i_uid(struct sock *sk)
>  {
>         kuid_t uid;
> diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c
> index aa438c6758a7..b0c244af1e4d 100644
> --- a/net/ipv4/ip_input.c
> +++ b/net/ipv4/ip_input.c
> @@ -509,7 +509,8 @@ static struct sk_buff *ip_rcv_core(struct sk_buff *skb, struct net *net)
>         IPCB(skb)->iif = skb->skb_iif;
>
>         /* Must drop socket now because of tproxy. */
> -       skb_orphan(skb);
> +       if (!skb_sk_is_prefetched(skb))
> +               skb_orphan(skb);
>
>         return skb;
>
> diff --git a/net/ipv6/ip6_input.c b/net/ipv6/ip6_input.c
> index 7b089d0ac8cd..e96304d8a4a7 100644
> --- a/net/ipv6/ip6_input.c
> +++ b/net/ipv6/ip6_input.c
> @@ -285,7 +285,8 @@ static struct sk_buff *ip6_rcv_core(struct sk_buff *skb, struct net_device *dev,
>         rcu_read_unlock();
>
>         /* Must drop socket now because of tproxy. */
> -       skb_orphan(skb);
> +       if (!skb_sk_is_prefetched(skb))
> +               skb_orphan(skb);
>
>         return skb;
>  err:
> diff --git a/net/sched/act_bpf.c b/net/sched/act_bpf.c
> index 46f47e58b3be..6c7ed8fcc909 100644
> --- a/net/sched/act_bpf.c
> +++ b/net/sched/act_bpf.c
> @@ -53,6 +53,8 @@ static int tcf_bpf_act(struct sk_buff *skb, const struct tc_action *act,
>                 bpf_compute_data_pointers(skb);
>                 filter_res = BPF_PROG_RUN(filter, skb);
>         }
> +       if (filter_res != TC_ACT_OK)
> +               skb_orphan(skb);
>         rcu_read_unlock();
>
>         /* A BPF program may overwrite the default action opcode.
> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> index 5d01c5c7e598..0c6f151deebe 100644
> --- a/tools/include/uapi/linux/bpf.h
> +++ b/tools/include/uapi/linux/bpf.h
> @@ -2950,6 +2950,28 @@ union bpf_attr {
>   *             restricted to raw_tracepoint bpf programs.
>   *     Return
>   *             0 on success, or a negative error in case of failure.
> + *
> + * int bpf_sk_assign(struct sk_buff *skb, struct bpf_sock *sk, u64 flags)
> + *     Description
> + *             Assign the *sk* to the *skb*. When combined with appropriate
> + *             routing configuration to receive the packet towards the socket,
> + *             will cause *skb* to be delivered to the specified socket.
> + *             Subsequent redirection of *skb* via  **bpf_redirect**\ (),
> + *             **bpf_clone_redirect**\ () or other methods outside of BPF may
> + *             interfere with successful delivery to the socket.
> + *
> + *             This operation is only valid from TC ingress path.
> + *
> + *             The *flags* argument must be zero.
> + *     Return
> + *             0 on success, or a negative errno in case of failure.
> + *
> + *             * **-EINVAL**           Unsupported flags specified.
> + *             * **-ENETUNREACH**      Socket is unreachable (wrong netns).
> + *             * **-ENOENT**           Socket is unavailable for assignment.
> + *             * **-EOPNOTSUPP**       Unsupported operation, for example a
> + *                                     call from outside of TC ingress.
> + *             * **-ESOCKTNOSUPPORT**  Socket type not supported (reuseport).
>   */
>  #define __BPF_FUNC_MAPPER(FN)          \
>         FN(unspec),                     \
> @@ -3073,7 +3095,8 @@ union bpf_attr {
>         FN(jiffies64),                  \
>         FN(read_branch_records),        \
>         FN(get_ns_current_pid_tgid),    \
> -       FN(xdp_output),
> +       FN(xdp_output),                 \
> +       FN(sk_assign),
>
>  /* integer value in 'imm' field of BPF_CALL instruction selects which helper
>   * function eBPF program intends to call
> --
> 2.20.1
>


-- 
Lorenz Bauer  |  Systems Engineer
6th Floor, County Hall/The Riverside Building, SE1 7PB, UK

www.cloudflare.com

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCHv2 bpf-next 5/5] selftests: bpf: add test for sk_assign
  2020-03-26  5:28     ` Joe Stringer
  2020-03-26  6:31       ` Martin KaFai Lau
@ 2020-03-26 19:36       ` Andrii Nakryiko
  2020-03-26 21:38         ` Joe Stringer
  1 sibling, 1 reply; 41+ messages in thread
From: Andrii Nakryiko @ 2020-03-26 19:36 UTC (permalink / raw)
  To: Joe Stringer
  Cc: bpf, Lorenz Bauer, Networking, Daniel Borkmann,
	Alexei Starovoitov, Eric Dumazet, Martin Lau

On Wed, Mar 25, 2020 at 10:28 PM Joe Stringer <joe@wand.net.nz> wrote:
>
> On Wed, Mar 25, 2020 at 7:16 PM Andrii Nakryiko
> <andrii.nakryiko@gmail.com> wrote:
> >
> > On Tue, Mar 24, 2020 at 10:58 PM Joe Stringer <joe@wand.net.nz> wrote:
> > >
> > > From: Lorenz Bauer <lmb@cloudflare.com>
> > >
> > > Attach a tc direct-action classifier to lo in a fresh network
> > > namespace, and rewrite all connection attempts to localhost:4321
> > > to localhost:1234 (for port tests) and connections to unreachable
> > > IPv4/IPv6 IPs to the local socket (for address tests).
> > >
> > > Keep in mind that both client to server and server to client traffic
> > > passes the classifier.
> > >
> > > Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
> > > Co-authored-by: Joe Stringer <joe@wand.net.nz>
> > > Signed-off-by: Joe Stringer <joe@wand.net.nz>
> > > ---
>
> <snip>
>
> > > +static void handle_timeout(int signum)
> > > +{
> > > +       if (signum == SIGALRM)
> > > +               fprintf(stderr, "Timed out while connecting to server\n");
> > > +       kill(0, SIGKILL);
> > > +}
> > > +
> > > +static struct sigaction timeout_action = {
> > > +       .sa_handler = handle_timeout,
> > > +};
> > > +
> > > +static int connect_to_server(const struct sockaddr *addr, socklen_t len)
> > > +{
> > > +       int fd = -1;
> > > +
> > > +       fd = socket(addr->sa_family, SOCK_STREAM, 0);
> > > +       if (CHECK_FAIL(fd == -1))
> > > +               goto out;
> > > +       if (CHECK_FAIL(sigaction(SIGALRM, &timeout_action, NULL)))
> > > +               goto out;
> >
> > no-no-no, we are not doing this. It's part of prog_tests and shouldn't
> > install its own signal handlers and sending asynchronous signals to
> > itself. Please find another way to have a timeout.
>
> I realise it didn't clean up after itself. How about signal(SIGALRM,
> SIG_DFL); just like other existing tests do?

You have alarm(3), which will deliver SIGALARM later, when other test
is running. Once you clean up this custom signal handler it will cause
test_progs to coredump. So still no, please find another way to do
timeouts.


>
> > > +       test__start_subtest("ipv4 addr redir");
> > > +       if (run_test(server, (const struct sockaddr *)&addr4, sizeof(addr4)))
> > > +               goto out;
> > > +
> > > +       test__start_subtest("ipv6 addr redir");
> >
> > test__start_subtest() returns false if subtest is supposed to be
> > skipped. If you ignore that, then test_progs -t and -n filtering won't
> > work properly.
>
> Will fix.
>
> > > +       if (run_test(server_v6, (const struct sockaddr *)&addr6, sizeof(addr6)))
> > > +               goto out;
> > > +
> > > +       err = 0;
> > > +out:
> > > +       close(server);
> > > +       close(server_v6);
> > > +       return err;
> > > +}
> > > +
> > > +void test_sk_assign(void)
> > > +{
> > > +       int self_net;
> > > +
> > > +       self_net = open(NS_SELF, O_RDONLY);
> >
> > I'm not familiar with what this does. Can you please explain briefly
> > what are the side effects of this? Asking because of shared test_progs
> > environment worries, of course.
>
> This one is opening an fd to the current program's netns path on the
> filesystem. The intention was to use it to switch back to the current
> netns after the unshare() call elsewhere which switches to a new
> netns. As per the other feedback the bit where it switches back to
> this netns was dropped along the way so I'll fix it up in the next
> revision.
>
> > > +SEC("sk_assign_test")
> >
> > Please use "canonical" section name that libbpf recognizes. This
> > woulde be "action/" or "classifier/", right?
>
> Will fix.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* call for bpf progs. Re: [PATCHv2 bpf-next 5/5] selftests: bpf: add test for sk_assign
  2020-03-26 10:13     ` Lorenz Bauer
@ 2020-03-26 21:07       ` " Alexei Starovoitov
  2020-03-26 23:14         ` Yonghong Song
                           ` (2 more replies)
  0 siblings, 3 replies; 41+ messages in thread
From: Alexei Starovoitov @ 2020-03-26 21:07 UTC (permalink / raw)
  To: Lorenz Bauer
  Cc: Yonghong Song, Joe Stringer, bpf, Networking, Daniel Borkmann,
	Alexei Starovoitov, Eric Dumazet, Martin Lau

On Thu, Mar 26, 2020 at 10:13:31AM +0000, Lorenz Bauer wrote:
> > > +
> > > +     if (ipv4) {
> > > +             if (tuple->ipv4.dport != bpf_htons(4321))
> > > +                     return TC_ACT_OK;
> > > +
> > > +             ln.ipv4.daddr = bpf_htonl(0x7f000001);
> > > +             ln.ipv4.dport = bpf_htons(1234);
> > > +
> > > +             sk = bpf_skc_lookup_tcp(skb, &ln, sizeof(ln.ipv4),
> > > +                                     BPF_F_CURRENT_NETNS, 0);
> > > +     } else {
> > > +             if (tuple->ipv6.dport != bpf_htons(4321))
> > > +                     return TC_ACT_OK;
> > > +
> > > +             /* Upper parts of daddr are already zero. */
> > > +             ln.ipv6.daddr[3] = bpf_htonl(0x1);
> > > +             ln.ipv6.dport = bpf_htons(1234);
> > > +
> > > +             sk = bpf_skc_lookup_tcp(skb, &ln, sizeof(ln.ipv6),
> > > +                                     BPF_F_CURRENT_NETNS, 0);
> > > +     }
> > > +
> > > +     /* We can't do a single skc_lookup_tcp here, because then the compiler
> > > +      * will likely spill tuple_len to the stack. This makes it lose all
> > > +      * bounds information in the verifier, which then rejects the call as
> > > +      * unsafe.
> > > +      */
> >
> > This is a known issue. For scalars, only constant is restored properly
> > in verifier at this moment. I did some hacking before to enable any
> > scalars. The fear is this will make pruning performs worse. More
> > study is needed here.
> 
> Of topic, but: this is actually one of the most challenging issues for
> us when writing
> BPF. It forces us to have very deep call graphs to hopefully avoid clang
> spilling the constants. Please let me know if I can help in any way.

Thanks for bringing this up.
Yonghong, please correct me if I'm wrong.
I think you've experimented with tracking spilled constants. The first issue
came with spilling of 4 byte constant. The verifier tracks 8 byte slots and
lots of places assume that slot granularity. It's not clear yet how to refactor
the verifier. Ideas, help are greatly appreciated.
The second concern was pruning, but iirc the experiments were inconclusive.
selftests/bpf only has old fb progs. Hence, I think, the step zero is for
everyone to contribute their bpf programs written in C. If we have both
cilium and cloudflare progs as selftests it will help a lot to guide such long
lasting verifier decisions.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCHv2 bpf-next 2/5] bpf: Prefetch established socket destinations
  2020-03-25  5:57 ` [PATCHv2 bpf-next 2/5] bpf: Prefetch established socket destinations Joe Stringer
@ 2020-03-26 21:11   ` Alexei Starovoitov
  2020-03-26 21:45     ` Joe Stringer
  0 siblings, 1 reply; 41+ messages in thread
From: Alexei Starovoitov @ 2020-03-26 21:11 UTC (permalink / raw)
  To: Joe Stringer; +Cc: bpf, netdev, daniel, ast, eric.dumazet, lmb, kafai

On Tue, Mar 24, 2020 at 10:57:42PM -0700, Joe Stringer wrote:
> Enhance the sk_assign logic to temporarily store the socket
> receive destination, to save the route lookup later on. The dst
> reference is kept alive by the caller's socket reference.
> 
> Suggested-by: Daniel Borkmann <daniel@iogearbox.net>
> Signed-off-by: Joe Stringer <joe@wand.net.nz>
> ---
> v2: Provide cookie to dst_check() for IPv6 case
> v1: Initial version
> ---
>  net/core/filter.c | 15 +++++++++++++++
>  1 file changed, 15 insertions(+)
> 
> diff --git a/net/core/filter.c b/net/core/filter.c
> index f7f9b6631f75..0fada7fe9b75 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -5876,6 +5876,21 @@ BPF_CALL_3(bpf_sk_assign, struct sk_buff *, skb, struct sock *, sk, u64, flags)
>  	skb_orphan(skb);
>  	skb->sk = sk;
>  	skb->destructor = sock_pfree;
> +	if (sk_fullsock(sk)) {
> +		struct dst_entry *dst = READ_ONCE(sk->sk_rx_dst);
> +		u32 cookie = 0;
> +
> +#if IS_ENABLED(CONFIG_IPV6)
> +		if (sk->sk_family == AF_INET6)
> +			cookie = inet6_sk(sk)->rx_dst_cookie;
> +#endif
> +		if (dst)
> +			dst = dst_check(dst, cookie);
> +		if (dst) {
> +			skb_dst_drop(skb);
> +			skb_dst_set_noref(skb, dst);
> +		}

I think the rest of the feedback for the patches can be addressed quickly and
overall the set is imo ready to land within this cycle. My only concern is
above dst_set().
Since it's an optimization may be drop this patch? we can land
the rest and this one can be introduced in the next cycle?
I'm happy to be convinced otherwise, but would like a better explanation
why it's safe to do so in this context.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCHv2 bpf-next 4/5] bpf: Don't refcount LISTEN sockets in sk_assign()
  2020-03-26 10:20       ` Lorenz Bauer
@ 2020-03-26 21:37         ` Joe Stringer
  0 siblings, 0 replies; 41+ messages in thread
From: Joe Stringer @ 2020-03-26 21:37 UTC (permalink / raw)
  To: Lorenz Bauer
  Cc: Joe Stringer, bpf, Networking, Daniel Borkmann,
	Alexei Starovoitov, Eric Dumazet, Martin Lau

On Thu, Mar 26, 2020 at 3:21 AM Lorenz Bauer <lmb@cloudflare.com> wrote:
>
> On Wed, 25 Mar 2020 at 20:47, Joe Stringer <joe@wand.net.nz> wrote:
> >
> > On Wed, Mar 25, 2020 at 3:29 AM Lorenz Bauer <lmb@cloudflare.com> wrote:
> > >
> > > On Wed, 25 Mar 2020 at 05:58, Joe Stringer <joe@wand.net.nz> wrote:
> > > >
> > > > Avoid taking a reference on listen sockets by checking the socket type
> > > > in the sk_assign and in the corresponding skb_steal_sock() code in the
> > > > the transport layer, and by ensuring that the prefetch free (sock_pfree)
> > > > function uses the same logic to check whether the socket is refcounted.
> > > >
> > > > Suggested-by: Martin KaFai Lau <kafai@fb.com>
> > > > Signed-off-by: Joe Stringer <joe@wand.net.nz>
> > > > ---
> > > > v2: Initial version
> > > > ---
> > > >  include/net/sock.h | 25 +++++++++++++++++--------
> > > >  net/core/filter.c  |  6 +++---
> > > >  net/core/sock.c    |  3 ++-
> > > >  3 files changed, 22 insertions(+), 12 deletions(-)
> > > >
> > > > diff --git a/include/net/sock.h b/include/net/sock.h
> > > > index 1ca2e808cb8e..3ec1865f173e 100644
> > > > --- a/include/net/sock.h
> > > > +++ b/include/net/sock.h
> > > > @@ -2533,6 +2533,21 @@ skb_sk_is_prefetched(struct sk_buff *skb)
> > > >         return skb->destructor == sock_pfree;
> > > >  }
> > > >
> > > > +/* This helper checks if a socket is a full socket,
> > > > + * ie _not_ a timewait or request socket.
> > > > + */
> > > > +static inline bool sk_fullsock(const struct sock *sk)
> > > > +{
> > > > +       return (1 << sk->sk_state) & ~(TCPF_TIME_WAIT | TCPF_NEW_SYN_RECV);
> > > > +}
> > > > +
> > > > +static inline bool
> > > > +sk_is_refcounted(struct sock *sk)
> > > > +{
> > > > +       /* Only full sockets have sk->sk_flags. */
> > > > +       return !sk_fullsock(sk) || !sock_flag(sk, SOCK_RCU_FREE);
> > > > +}
> > > > +
> > > >  /**
> > > >   * skb_steal_sock
> > > >   * @skb to steal the socket from
> > > > @@ -2545,6 +2560,8 @@ skb_steal_sock(struct sk_buff *skb, bool *refcounted)
> > > >                 struct sock *sk = skb->sk;
> > > >
> > > >                 *refcounted = true;
> > > > +               if (skb_sk_is_prefetched(skb))
> > > > +                       *refcounted = sk_is_refcounted(sk);
> > > >                 skb->destructor = NULL;
> > > >                 skb->sk = NULL;
> > > >                 return sk;
> > > > @@ -2553,14 +2570,6 @@ skb_steal_sock(struct sk_buff *skb, bool *refcounted)
> > > >         return NULL;
> > > >  }
> > > >
> > > > -/* This helper checks if a socket is a full socket,
> > > > - * ie _not_ a timewait or request socket.
> > > > - */
> > > > -static inline bool sk_fullsock(const struct sock *sk)
> > > > -{
> > > > -       return (1 << sk->sk_state) & ~(TCPF_TIME_WAIT | TCPF_NEW_SYN_RECV);
> > > > -}
> > > > -
> > > >  /* Checks if this SKB belongs to an HW offloaded socket
> > > >   * and whether any SW fallbacks are required based on dev.
> > > >   * Check decrypted mark in case skb_orphan() cleared socket.
> > > > diff --git a/net/core/filter.c b/net/core/filter.c
> > > > index 0fada7fe9b75..997b8606167e 100644
> > > > --- a/net/core/filter.c
> > > > +++ b/net/core/filter.c
> > > > @@ -5343,8 +5343,7 @@ static const struct bpf_func_proto bpf_sk_lookup_udp_proto = {
> > > >
> > > >  BPF_CALL_1(bpf_sk_release, struct sock *, sk)
> > > >  {
> > > > -       /* Only full sockets have sk->sk_flags. */
> > > > -       if (!sk_fullsock(sk) || !sock_flag(sk, SOCK_RCU_FREE))
> > > > +       if (sk_is_refcounted(sk))
> > > >                 sock_gen_put(sk);
> > > >         return 0;
> > > >  }
> > > > @@ -5870,7 +5869,8 @@ BPF_CALL_3(bpf_sk_assign, struct sk_buff *, skb, struct sock *, sk, u64, flags)
> > > >                 return -ESOCKTNOSUPPORT;
> > > >         if (unlikely(dev_net(skb->dev) != sock_net(sk)))
> > > >                 return -ENETUNREACH;
> > > > -       if (unlikely(!refcount_inc_not_zero(&sk->sk_refcnt)))
> > > > +       if (sk_is_refcounted(sk) &&
> > > > +           unlikely(!refcount_inc_not_zero(&sk->sk_refcnt)))
> > > >                 return -ENOENT;
> > > >
> > > >         skb_orphan(skb);
> > > > diff --git a/net/core/sock.c b/net/core/sock.c
> > > > index cfaf60267360..a2ab79446f59 100644
> > > > --- a/net/core/sock.c
> > > > +++ b/net/core/sock.c
> > > > @@ -2076,7 +2076,8 @@ EXPORT_SYMBOL(sock_efree);
> > > >   */
> > > >  void sock_pfree(struct sk_buff *skb)
> > > >  {
> > > > -       sock_edemux(skb);
> > > > +       if (sk_is_refcounted(skb->sk))
> > > > +               sock_edemux(skb);
> > >
> > > sock_edemux calls sock_gen_put, which is also called by
> > > bpf_sk_release. Is it worth teaching sock_gen_put about
> > > sk_fullsock, and dropping the other helpers? I was considering this
> > > when fixing up sk_release, but then forgot
> > > about it.
> >
> > I like the idea, but I'm concerned about breaking things outside the
> > focus of this new helper if the skb_sk_is_prefetched() function from
> > patch 1 is allowed to return true for sockets other than the ones
> > assigned from the bpf_sk_assign() helper. At a glance there's users of
> > sock_efree (which sock_edemux can be defined to) like netem_enqueue()
> > which may inadvertently trigger unexpected paths here. I think it's
> > more explicit so more obviously correct if the destructor pointer used
> > in this series is unique compared to other paths, even if the
> > underlying code is the same.
>
> Sorry, I didn't mean to get rid of sock_pfree, I was referring to
> sk_fullsock and
> sk_is_refcounted. My point was that it's weird that sock_gen_put isn't
> actually generic because it doesn't properly handle SOCK_RCU_FREE.

I briefly looked into this, and I'm still not confident that all
existing usage of sock_gen_put() is handling SOCK_RCU_FREE in a way
that would be consistent with pushing this check into sock_gen_put().
I think it's something worth digging into, but I'd prefer to dig into
it separately from this series.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCHv2 bpf-next 5/5] selftests: bpf: add test for sk_assign
  2020-03-26 19:36       ` Andrii Nakryiko
@ 2020-03-26 21:38         ` Joe Stringer
  0 siblings, 0 replies; 41+ messages in thread
From: Joe Stringer @ 2020-03-26 21:38 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Joe Stringer, bpf, Lorenz Bauer, Networking, Daniel Borkmann,
	Alexei Starovoitov, Eric Dumazet, Martin Lau

On Thu, Mar 26, 2020 at 12:37 PM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Wed, Mar 25, 2020 at 10:28 PM Joe Stringer <joe@wand.net.nz> wrote:
> >
> > On Wed, Mar 25, 2020 at 7:16 PM Andrii Nakryiko
> > <andrii.nakryiko@gmail.com> wrote:
> > >
> > > On Tue, Mar 24, 2020 at 10:58 PM Joe Stringer <joe@wand.net.nz> wrote:
> > > >
> > > > From: Lorenz Bauer <lmb@cloudflare.com>
> > > >
> > > > Attach a tc direct-action classifier to lo in a fresh network
> > > > namespace, and rewrite all connection attempts to localhost:4321
> > > > to localhost:1234 (for port tests) and connections to unreachable
> > > > IPv4/IPv6 IPs to the local socket (for address tests).
> > > >
> > > > Keep in mind that both client to server and server to client traffic
> > > > passes the classifier.
> > > >
> > > > Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
> > > > Co-authored-by: Joe Stringer <joe@wand.net.nz>
> > > > Signed-off-by: Joe Stringer <joe@wand.net.nz>
> > > > ---
> >
> > <snip>
> >
> > > > +static void handle_timeout(int signum)
> > > > +{
> > > > +       if (signum == SIGALRM)
> > > > +               fprintf(stderr, "Timed out while connecting to server\n");
> > > > +       kill(0, SIGKILL);
> > > > +}
> > > > +
> > > > +static struct sigaction timeout_action = {
> > > > +       .sa_handler = handle_timeout,
> > > > +};
> > > > +
> > > > +static int connect_to_server(const struct sockaddr *addr, socklen_t len)
> > > > +{
> > > > +       int fd = -1;
> > > > +
> > > > +       fd = socket(addr->sa_family, SOCK_STREAM, 0);
> > > > +       if (CHECK_FAIL(fd == -1))
> > > > +               goto out;
> > > > +       if (CHECK_FAIL(sigaction(SIGALRM, &timeout_action, NULL)))
> > > > +               goto out;
> > >
> > > no-no-no, we are not doing this. It's part of prog_tests and shouldn't
> > > install its own signal handlers and sending asynchronous signals to
> > > itself. Please find another way to have a timeout.
> >
> > I realise it didn't clean up after itself. How about signal(SIGALRM,
> > SIG_DFL); just like other existing tests do?
>
> You have alarm(3), which will deliver SIGALARM later, when other test
> is running. Once you clean up this custom signal handler it will cause
> test_progs to coredump. So still no, please find another way to do
> timeouts.

FWIW I hit this issue and alarm(0) afterwards fixed it up.

That said the SO_SNDTIMEO / SO_RCVTIMEO alternative should work fine
for this use case instead so I'll switch to that, it's marginally
cleaner.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCHv2 bpf-next 2/5] bpf: Prefetch established socket destinations
  2020-03-26 21:11   ` Alexei Starovoitov
@ 2020-03-26 21:45     ` Joe Stringer
  0 siblings, 0 replies; 41+ messages in thread
From: Joe Stringer @ 2020-03-26 21:45 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Joe Stringer, bpf, netdev, Daniel Borkmann, Alexei Starovoitov,
	Eric Dumazet, Lorenz Bauer, Martin KaFai Lau

On Thu, Mar 26, 2020 at 2:12 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Tue, Mar 24, 2020 at 10:57:42PM -0700, Joe Stringer wrote:
> > Enhance the sk_assign logic to temporarily store the socket
> > receive destination, to save the route lookup later on. The dst
> > reference is kept alive by the caller's socket reference.
> >
> > Suggested-by: Daniel Borkmann <daniel@iogearbox.net>
> > Signed-off-by: Joe Stringer <joe@wand.net.nz>
> > ---
> > v2: Provide cookie to dst_check() for IPv6 case
> > v1: Initial version
> > ---
> >  net/core/filter.c | 15 +++++++++++++++
> >  1 file changed, 15 insertions(+)
> >
> > diff --git a/net/core/filter.c b/net/core/filter.c
> > index f7f9b6631f75..0fada7fe9b75 100644
> > --- a/net/core/filter.c
> > +++ b/net/core/filter.c
> > @@ -5876,6 +5876,21 @@ BPF_CALL_3(bpf_sk_assign, struct sk_buff *, skb, struct sock *, sk, u64, flags)
> >       skb_orphan(skb);
> >       skb->sk = sk;
> >       skb->destructor = sock_pfree;
> > +     if (sk_fullsock(sk)) {
> > +             struct dst_entry *dst = READ_ONCE(sk->sk_rx_dst);
> > +             u32 cookie = 0;
> > +
> > +#if IS_ENABLED(CONFIG_IPV6)
> > +             if (sk->sk_family == AF_INET6)
> > +                     cookie = inet6_sk(sk)->rx_dst_cookie;
> > +#endif
> > +             if (dst)
> > +                     dst = dst_check(dst, cookie);
> > +             if (dst) {
> > +                     skb_dst_drop(skb);
> > +                     skb_dst_set_noref(skb, dst);
> > +             }
>
> I think the rest of the feedback for the patches can be addressed quickly and
> overall the set is imo ready to land within this cycle. My only concern is
> above dst_set().
> Since it's an optimization may be drop this patch? we can land
> the rest and this one can be introduced in the next cycle?
> I'm happy to be convinced otherwise, but would like a better explanation
> why it's safe to do so in this context.

[resend for lists; somehow gmail introduced some http gunk]

FWIW I found an issue with this implementation over the last day so
your concern is well-warranted. I'd be fine with dropping the
optimization for now and sending it out with other optimizations next
cycle.

Will respin ASAP.

Cheers,
Joe

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCHv2 bpf-next 1/5] bpf: Add socket assign support
  2020-03-26 10:24   ` Lorenz Bauer
@ 2020-03-26 22:52     ` Joe Stringer
  2020-03-27  2:40       ` Joe Stringer
  0 siblings, 1 reply; 41+ messages in thread
From: Joe Stringer @ 2020-03-26 22:52 UTC (permalink / raw)
  To: Lorenz Bauer
  Cc: Joe Stringer, bpf, Networking, Daniel Borkmann,
	Alexei Starovoitov, Eric Dumazet, Martin Lau

On Thu, Mar 26, 2020 at 3:25 AM Lorenz Bauer <lmb@cloudflare.com> wrote:
>
> On Wed, 25 Mar 2020 at 05:57, Joe Stringer <joe@wand.net.nz> wrote:
> >
> > Add support for TPROXY via a new bpf helper, bpf_sk_assign().
> >
> > This helper requires the BPF program to discover the socket via a call
> > to bpf_sk*_lookup_*(), then pass this socket to the new helper. The
> > helper takes its own reference to the socket in addition to any existing
> > reference that may or may not currently be obtained for the duration of
> > BPF processing. For the destination socket to receive the traffic, the
> > traffic must be routed towards that socket via local route. The
> > simplest example route is below, but in practice you may want to route
> > traffic more narrowly (eg by CIDR):
> >
> >   $ ip route add local default dev lo
> >
> > This patch avoids trying to introduce an extra bit into the skb->sk, as
> > that would require more invasive changes to all code interacting with
> > the socket to ensure that the bit is handled correctly, such as all
> > error-handling cases along the path from the helper in BPF through to
> > the orphan path in the input. Instead, we opt to use the destructor
> > variable to switch on the prefetch of the socket.
> >
> > Signed-off-by: Joe Stringer <joe@wand.net.nz>
> > ---
> > v2: Use skb->destructor to determine socket prefetch usage instead of
> >       introducing a new metadata_dst
> >     Restrict socket assign to same netns as TC device
> >     Restrict assigning reuseport sockets
> >     Adjust commit wording
> > v1: Initial version
> > ---
> >  include/net/sock.h             |  7 +++++++
> >  include/uapi/linux/bpf.h       | 25 ++++++++++++++++++++++++-
> >  net/core/filter.c              | 31 +++++++++++++++++++++++++++++++
> >  net/core/sock.c                |  9 +++++++++
> >  net/ipv4/ip_input.c            |  3 ++-
> >  net/ipv6/ip6_input.c           |  3 ++-
> >  net/sched/act_bpf.c            |  2 ++
> >  tools/include/uapi/linux/bpf.h | 25 ++++++++++++++++++++++++-
> >  8 files changed, 101 insertions(+), 4 deletions(-)
> >
> > diff --git a/include/net/sock.h b/include/net/sock.h
> > index b5cca7bae69b..2613d21a667a 100644
> > --- a/include/net/sock.h
> > +++ b/include/net/sock.h
> > @@ -1657,6 +1657,7 @@ struct sk_buff *sock_omalloc(struct sock *sk, unsigned long size,
> >  void skb_orphan_partial(struct sk_buff *skb);
> >  void sock_rfree(struct sk_buff *skb);
> >  void sock_efree(struct sk_buff *skb);
> > +void sock_pfree(struct sk_buff *skb);
> >  #ifdef CONFIG_INET
> >  void sock_edemux(struct sk_buff *skb);
> >  #else
> > @@ -2526,6 +2527,12 @@ void sock_net_set(struct sock *sk, struct net *net)
> >         write_pnet(&sk->sk_net, net);
> >  }
> >
> > +static inline bool
> > +skb_sk_is_prefetched(struct sk_buff *skb)
> > +{
> > +       return skb->destructor == sock_pfree;
> > +}
> > +
> >  static inline struct sock *skb_steal_sock(struct sk_buff *skb)
> >  {
> >         if (skb->sk) {
> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > index 5d01c5c7e598..0c6f151deebe 100644
> > --- a/include/uapi/linux/bpf.h
> > +++ b/include/uapi/linux/bpf.h
> > @@ -2950,6 +2950,28 @@ union bpf_attr {
> >   *             restricted to raw_tracepoint bpf programs.
> >   *     Return
> >   *             0 on success, or a negative error in case of failure.
> > + *
> > + * int bpf_sk_assign(struct sk_buff *skb, struct bpf_sock *sk, u64 flags)
> > + *     Description
> > + *             Assign the *sk* to the *skb*. When combined with appropriate
> > + *             routing configuration to receive the packet towards the socket,
> > + *             will cause *skb* to be delivered to the specified socket.
> > + *             Subsequent redirection of *skb* via  **bpf_redirect**\ (),
> > + *             **bpf_clone_redirect**\ () or other methods outside of BPF may
> > + *             interfere with successful delivery to the socket.
> > + *
> > + *             This operation is only valid from TC ingress path.
> > + *
> > + *             The *flags* argument must be zero.
> > + *     Return
> > + *             0 on success, or a negative errno in case of failure.
> > + *
> > + *             * **-EINVAL**           Unsupported flags specified.
> > + *             * **-ENETUNREACH**      Socket is unreachable (wrong netns).
> > + *             * **-ENOENT**           Socket is unavailable for assignment.
> > + *             * **-EOPNOTSUPP**       Unsupported operation, for example a
> > + *                                     call from outside of TC ingress.
> > + *             * **-ESOCKTNOSUPPORT**  Socket type not supported (reuseport).
> >   */
> >  #define __BPF_FUNC_MAPPER(FN)          \
> >         FN(unspec),                     \
> > @@ -3073,7 +3095,8 @@ union bpf_attr {
> >         FN(jiffies64),                  \
> >         FN(read_branch_records),        \
> >         FN(get_ns_current_pid_tgid),    \
> > -       FN(xdp_output),
> > +       FN(xdp_output),                 \
> > +       FN(sk_assign),
> >
> >  /* integer value in 'imm' field of BPF_CALL instruction selects which helper
> >   * function eBPF program intends to call
> > diff --git a/net/core/filter.c b/net/core/filter.c
> > index 96350a743539..f7f9b6631f75 100644
> > --- a/net/core/filter.c
> > +++ b/net/core/filter.c
> > @@ -5860,6 +5860,35 @@ static const struct bpf_func_proto bpf_tcp_gen_syncookie_proto = {
> >         .arg5_type      = ARG_CONST_SIZE,
> >  };
> >
> > +BPF_CALL_3(bpf_sk_assign, struct sk_buff *, skb, struct sock *, sk, u64, flags)
> > +{
> > +       if (flags != 0)
> > +               return -EINVAL;
> > +       if (!skb_at_tc_ingress(skb))
> > +               return -EOPNOTSUPP;
> > +       if (unlikely(sk->sk_reuseport))
> > +               return -ESOCKTNOSUPPORT;
> > +       if (unlikely(dev_net(skb->dev) != sock_net(sk)))
> > +               return -ENETUNREACH;
> > +       if (unlikely(!refcount_inc_not_zero(&sk->sk_refcnt)))
> > +               return -ENOENT;
> > +
> > +       skb_orphan(skb);
> > +       skb->sk = sk;
> > +       skb->destructor = sock_pfree;
> > +
> > +       return 0;
> > +}
>
> Follow up to my email re UDP tests: it seems like the helper doesn't check
> that the sk is TCP, hence I assumed that you want to add support for
> both in the same series.
>
> Also, is it possible to check that the sk protocol matches skb protocol?

Correction, I was able to get UDP working without the other series (at
least in Cilium, still working through the tests) so the extra check
should be unnecessary.

I'll look into options for avoiding crossing packets/sockets of the wrong types.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: call for bpf progs. Re: [PATCHv2 bpf-next 5/5] selftests: bpf: add test for sk_assign
  2020-03-26 21:07       ` call for bpf progs. " Alexei Starovoitov
@ 2020-03-26 23:14         ` Yonghong Song
  2020-03-27 10:02         ` Lorenz Bauer
  2020-03-27 19:06         ` Joe Stringer
  2 siblings, 0 replies; 41+ messages in thread
From: Yonghong Song @ 2020-03-26 23:14 UTC (permalink / raw)
  To: Alexei Starovoitov, Lorenz Bauer
  Cc: Joe Stringer, bpf, Networking, Daniel Borkmann,
	Alexei Starovoitov, Eric Dumazet, Martin Lau



On 3/26/20 2:07 PM, Alexei Starovoitov wrote:
> On Thu, Mar 26, 2020 at 10:13:31AM +0000, Lorenz Bauer wrote:
>>>> +
>>>> +     if (ipv4) {
>>>> +             if (tuple->ipv4.dport != bpf_htons(4321))
>>>> +                     return TC_ACT_OK;
>>>> +
>>>> +             ln.ipv4.daddr = bpf_htonl(0x7f000001);
>>>> +             ln.ipv4.dport = bpf_htons(1234);
>>>> +
>>>> +             sk = bpf_skc_lookup_tcp(skb, &ln, sizeof(ln.ipv4),
>>>> +                                     BPF_F_CURRENT_NETNS, 0);
>>>> +     } else {
>>>> +             if (tuple->ipv6.dport != bpf_htons(4321))
>>>> +                     return TC_ACT_OK;
>>>> +
>>>> +             /* Upper parts of daddr are already zero. */
>>>> +             ln.ipv6.daddr[3] = bpf_htonl(0x1);
>>>> +             ln.ipv6.dport = bpf_htons(1234);
>>>> +
>>>> +             sk = bpf_skc_lookup_tcp(skb, &ln, sizeof(ln.ipv6),
>>>> +                                     BPF_F_CURRENT_NETNS, 0);
>>>> +     }
>>>> +
>>>> +     /* We can't do a single skc_lookup_tcp here, because then the compiler
>>>> +      * will likely spill tuple_len to the stack. This makes it lose all
>>>> +      * bounds information in the verifier, which then rejects the call as
>>>> +      * unsafe.
>>>> +      */
>>>
>>> This is a known issue. For scalars, only constant is restored properly
>>> in verifier at this moment. I did some hacking before to enable any
>>> scalars. The fear is this will make pruning performs worse. More
>>> study is needed here.
>>
>> Of topic, but: this is actually one of the most challenging issues for
>> us when writing
>> BPF. It forces us to have very deep call graphs to hopefully avoid clang
>> spilling the constants. Please let me know if I can help in any way.
> 
> Thanks for bringing this up.
> Yonghong, please correct me if I'm wrong.

Yes. The summary below is correct. For reference, the below bcc issue
documents some of my investigation:
   https://github.com/iovisor/bcc/issues/2463

> I think you've experimented with tracking spilled constants. The first issue
> came with spilling of 4 byte constant. The verifier tracks 8 byte slots and
> lots of places assume that slot granularity. It's not clear yet how to refactor
> the verifier. Ideas, help are greatly appreciated.

I cannot remember exactly what I did then. Probably remember the spilled 
size too. Since the hack is never peer reviewed, maybe my approach has bugs.

> The second concern was pruning, but iirc the experiments were inconclusive.
> selftests/bpf only has old fb progs. Hence, I think, the step zero is for
> everyone to contribute their bpf programs written in C. If we have both
> cilium and cloudflare progs as selftests it will help a lot to guide such long
> lasting verifier decisions.

Yes, this is inconclusive and I did not do any active investigation here
since just enhancing the non-const spill won't resolve the above issue.
But totally agree that if we had an implementation, we should measure
its impact on verifier speed.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCHv2 bpf-next 5/5] selftests: bpf: add test for sk_assign
  2020-03-26  6:38         ` Joe Stringer
@ 2020-03-26 23:39           ` Joe Stringer
  0 siblings, 0 replies; 41+ messages in thread
From: Joe Stringer @ 2020-03-26 23:39 UTC (permalink / raw)
  To: Joe Stringer
  Cc: Martin KaFai Lau, Lorenz Bauer, bpf, Networking, Daniel Borkmann,
	Alexei Starovoitov, Eric Dumazet

On Wed, Mar 25, 2020 at 11:38 PM Joe Stringer <joe@wand.net.nz> wrote:
>
> On Wed, Mar 25, 2020 at 11:25 PM Martin KaFai Lau <kafai@fb.com> wrote:
> >
> > On Wed, Mar 25, 2020 at 01:55:59PM -0700, Joe Stringer wrote:
> > > On Wed, Mar 25, 2020 at 3:35 AM Lorenz Bauer <lmb@cloudflare.com> wrote:
> > > >
> > > > On Wed, 25 Mar 2020 at 05:58, Joe Stringer <joe@wand.net.nz> wrote:
> > > > >
> > > > > From: Lorenz Bauer <lmb@cloudflare.com>
> > > > >
> > > > > Attach a tc direct-action classifier to lo in a fresh network
> > > > > namespace, and rewrite all connection attempts to localhost:4321
> > > > > to localhost:1234 (for port tests) and connections to unreachable
> > > > > IPv4/IPv6 IPs to the local socket (for address tests).
> > > >
> > > > Can you extend this to cover UDP as well?
> > >
> > > I'm working on a follow-up series for UDP, we need this too.
> > Other than selftests, what are the changes for UDP in patch 1 - 4?
>
> Nothing in those patches, I have refactoring of all of the socket
> helpers, skc_lookup_udp() and adding flags to the socket lookup
> functions to support only looking for a certain type of sockets -
> established or listen. This helps to avoid multiple lookups in these
> cases where you really just want to look up established sockets with
> the packet tuple first then look up the listener socket with the
> unrelated/tproxy tuple. For UDP it makes it easier to find the correct
> socket and in general (including TCP) helps to avoid up to two socket
> hashtable lookups for this use case. This part is because the current
> helpers all look up the established socket first then the listener
> socket, so for the first packet that hits these we perform both of
> these lookups for the packet tuple (which finds nothing), then look up
> an established socket for the target tuple (which finds nothing) then
> finally a listen socket for the target tuple. It's about another 300+
> / 250- changes overall, of which a large chunk is one patch that
> refactors the code into macros. I haven't narrowed down for sure
> whether the lookup flags patch is required for UDP cases yet.

FWIW I did some more testing and it was not apparent that
skc_lookup_udp is at all necessary, I was able to roll in UDP support
in the next revision of this series with no special extra patches.

I'll keep working on those other optimizations in the background though.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCHv2 bpf-next 1/5] bpf: Add socket assign support
  2020-03-26 22:52     ` Joe Stringer
@ 2020-03-27  2:40       ` Joe Stringer
  0 siblings, 0 replies; 41+ messages in thread
From: Joe Stringer @ 2020-03-27  2:40 UTC (permalink / raw)
  To: Joe Stringer
  Cc: Lorenz Bauer, bpf, Networking, Daniel Borkmann,
	Alexei Starovoitov, Eric Dumazet, Martin Lau

On Thu, Mar 26, 2020 at 3:52 PM Joe Stringer <joe@wand.net.nz> wrote:
>
> On Thu, Mar 26, 2020 at 3:25 AM Lorenz Bauer <lmb@cloudflare.com> wrote:
> >
> > On Wed, 25 Mar 2020 at 05:57, Joe Stringer <joe@wand.net.nz> wrote:
> > >
> > > Add support for TPROXY via a new bpf helper, bpf_sk_assign().
> > >
> > > This helper requires the BPF program to discover the socket via a call
> > > to bpf_sk*_lookup_*(), then pass this socket to the new helper. The
> > > helper takes its own reference to the socket in addition to any existing
> > > reference that may or may not currently be obtained for the duration of
> > > BPF processing. For the destination socket to receive the traffic, the
> > > traffic must be routed towards that socket via local route. The
> > > simplest example route is below, but in practice you may want to route
> > > traffic more narrowly (eg by CIDR):
> > >
> > >   $ ip route add local default dev lo
> > >
> > > This patch avoids trying to introduce an extra bit into the skb->sk, as
> > > that would require more invasive changes to all code interacting with
> > > the socket to ensure that the bit is handled correctly, such as all
> > > error-handling cases along the path from the helper in BPF through to
> > > the orphan path in the input. Instead, we opt to use the destructor
> > > variable to switch on the prefetch of the socket.
> > >
> > > Signed-off-by: Joe Stringer <joe@wand.net.nz>
> > > ---
> > > v2: Use skb->destructor to determine socket prefetch usage instead of
> > >       introducing a new metadata_dst
> > >     Restrict socket assign to same netns as TC device
> > >     Restrict assigning reuseport sockets
> > >     Adjust commit wording
> > > v1: Initial version
> > > ---
> > >  include/net/sock.h             |  7 +++++++
> > >  include/uapi/linux/bpf.h       | 25 ++++++++++++++++++++++++-
> > >  net/core/filter.c              | 31 +++++++++++++++++++++++++++++++
> > >  net/core/sock.c                |  9 +++++++++
> > >  net/ipv4/ip_input.c            |  3 ++-
> > >  net/ipv6/ip6_input.c           |  3 ++-
> > >  net/sched/act_bpf.c            |  2 ++
> > >  tools/include/uapi/linux/bpf.h | 25 ++++++++++++++++++++++++-
> > >  8 files changed, 101 insertions(+), 4 deletions(-)
> > >
> > > diff --git a/include/net/sock.h b/include/net/sock.h
> > > index b5cca7bae69b..2613d21a667a 100644
> > > --- a/include/net/sock.h
> > > +++ b/include/net/sock.h
> > > @@ -1657,6 +1657,7 @@ struct sk_buff *sock_omalloc(struct sock *sk, unsigned long size,
> > >  void skb_orphan_partial(struct sk_buff *skb);
> > >  void sock_rfree(struct sk_buff *skb);
> > >  void sock_efree(struct sk_buff *skb);
> > > +void sock_pfree(struct sk_buff *skb);
> > >  #ifdef CONFIG_INET
> > >  void sock_edemux(struct sk_buff *skb);
> > >  #else
> > > @@ -2526,6 +2527,12 @@ void sock_net_set(struct sock *sk, struct net *net)
> > >         write_pnet(&sk->sk_net, net);
> > >  }
> > >
> > > +static inline bool
> > > +skb_sk_is_prefetched(struct sk_buff *skb)
> > > +{
> > > +       return skb->destructor == sock_pfree;
> > > +}
> > > +
> > >  static inline struct sock *skb_steal_sock(struct sk_buff *skb)
> > >  {
> > >         if (skb->sk) {
> > > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > > index 5d01c5c7e598..0c6f151deebe 100644
> > > --- a/include/uapi/linux/bpf.h
> > > +++ b/include/uapi/linux/bpf.h
> > > @@ -2950,6 +2950,28 @@ union bpf_attr {
> > >   *             restricted to raw_tracepoint bpf programs.
> > >   *     Return
> > >   *             0 on success, or a negative error in case of failure.
> > > + *
> > > + * int bpf_sk_assign(struct sk_buff *skb, struct bpf_sock *sk, u64 flags)
> > > + *     Description
> > > + *             Assign the *sk* to the *skb*. When combined with appropriate
> > > + *             routing configuration to receive the packet towards the socket,
> > > + *             will cause *skb* to be delivered to the specified socket.
> > > + *             Subsequent redirection of *skb* via  **bpf_redirect**\ (),
> > > + *             **bpf_clone_redirect**\ () or other methods outside of BPF may
> > > + *             interfere with successful delivery to the socket.
> > > + *
> > > + *             This operation is only valid from TC ingress path.
> > > + *
> > > + *             The *flags* argument must be zero.
> > > + *     Return
> > > + *             0 on success, or a negative errno in case of failure.
> > > + *
> > > + *             * **-EINVAL**           Unsupported flags specified.
> > > + *             * **-ENETUNREACH**      Socket is unreachable (wrong netns).
> > > + *             * **-ENOENT**           Socket is unavailable for assignment.
> > > + *             * **-EOPNOTSUPP**       Unsupported operation, for example a
> > > + *                                     call from outside of TC ingress.
> > > + *             * **-ESOCKTNOSUPPORT**  Socket type not supported (reuseport).
> > >   */
> > >  #define __BPF_FUNC_MAPPER(FN)          \
> > >         FN(unspec),                     \
> > > @@ -3073,7 +3095,8 @@ union bpf_attr {
> > >         FN(jiffies64),                  \
> > >         FN(read_branch_records),        \
> > >         FN(get_ns_current_pid_tgid),    \
> > > -       FN(xdp_output),
> > > +       FN(xdp_output),                 \
> > > +       FN(sk_assign),
> > >
> > >  /* integer value in 'imm' field of BPF_CALL instruction selects which helper
> > >   * function eBPF program intends to call
> > > diff --git a/net/core/filter.c b/net/core/filter.c
> > > index 96350a743539..f7f9b6631f75 100644
> > > --- a/net/core/filter.c
> > > +++ b/net/core/filter.c
> > > @@ -5860,6 +5860,35 @@ static const struct bpf_func_proto bpf_tcp_gen_syncookie_proto = {
> > >         .arg5_type      = ARG_CONST_SIZE,
> > >  };
> > >
> > > +BPF_CALL_3(bpf_sk_assign, struct sk_buff *, skb, struct sock *, sk, u64, flags)
> > > +{
> > > +       if (flags != 0)
> > > +               return -EINVAL;
> > > +       if (!skb_at_tc_ingress(skb))
> > > +               return -EOPNOTSUPP;
> > > +       if (unlikely(sk->sk_reuseport))
> > > +               return -ESOCKTNOSUPPORT;
> > > +       if (unlikely(dev_net(skb->dev) != sock_net(sk)))
> > > +               return -ENETUNREACH;
> > > +       if (unlikely(!refcount_inc_not_zero(&sk->sk_refcnt)))
> > > +               return -ENOENT;
> > > +
> > > +       skb_orphan(skb);
> > > +       skb->sk = sk;
> > > +       skb->destructor = sock_pfree;
> > > +
> > > +       return 0;
> > > +}
> >
> > Follow up to my email re UDP tests: it seems like the helper doesn't check
> > that the sk is TCP, hence I assumed that you want to add support for
> > both in the same series.
> >
> > Also, is it possible to check that the sk protocol matches skb protocol?
>
> Correction, I was able to get UDP working without the other series (at
> least in Cilium, still working through the tests) so the extra check
> should be unnecessary.
>
> I'll look into options for avoiding crossing packets/sockets of the wrong types.

I looked into these options and found:
* sk->protocol is the L4 proto; skb->protocol is the L3 proto so it
doesn't make sense to compare them
* skb doesn't seem to have any L4 indicator unless you want to start
diving into the packet, which seems not ideal
* I tried using sk->family and ensuring, for example, only AF_INET
sockets can pair with skb->protocol == ETH_P_IP, but I found that in
my integration test I'm using dual-stack sockets so the check was
failing there
* I tried making up some nonsense tests where I cross different types
of sockets on send/recv side and the worst behaviour I could trigger
was timeouts

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: call for bpf progs. Re: [PATCHv2 bpf-next 5/5] selftests: bpf: add test for sk_assign
  2020-03-26 21:07       ` call for bpf progs. " Alexei Starovoitov
  2020-03-26 23:14         ` Yonghong Song
@ 2020-03-27 10:02         ` Lorenz Bauer
  2020-03-27 16:08           ` Alexei Starovoitov
  2020-03-27 19:06         ` Joe Stringer
  2 siblings, 1 reply; 41+ messages in thread
From: Lorenz Bauer @ 2020-03-27 10:02 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Yonghong Song, Joe Stringer, bpf, Networking, Daniel Borkmann,
	Alexei Starovoitov, Eric Dumazet, Martin Lau

On Thu, 26 Mar 2020 at 21:07, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Thu, Mar 26, 2020 at 10:13:31AM +0000, Lorenz Bauer wrote:
> > > > +
> > > > +     if (ipv4) {
> > > > +             if (tuple->ipv4.dport != bpf_htons(4321))
> > > > +                     return TC_ACT_OK;
> > > > +
> > > > +             ln.ipv4.daddr = bpf_htonl(0x7f000001);
> > > > +             ln.ipv4.dport = bpf_htons(1234);
> > > > +
> > > > +             sk = bpf_skc_lookup_tcp(skb, &ln, sizeof(ln.ipv4),
> > > > +                                     BPF_F_CURRENT_NETNS, 0);
> > > > +     } else {
> > > > +             if (tuple->ipv6.dport != bpf_htons(4321))
> > > > +                     return TC_ACT_OK;
> > > > +
> > > > +             /* Upper parts of daddr are already zero. */
> > > > +             ln.ipv6.daddr[3] = bpf_htonl(0x1);
> > > > +             ln.ipv6.dport = bpf_htons(1234);
> > > > +
> > > > +             sk = bpf_skc_lookup_tcp(skb, &ln, sizeof(ln.ipv6),
> > > > +                                     BPF_F_CURRENT_NETNS, 0);
> > > > +     }
> > > > +
> > > > +     /* We can't do a single skc_lookup_tcp here, because then the compiler
> > > > +      * will likely spill tuple_len to the stack. This makes it lose all
> > > > +      * bounds information in the verifier, which then rejects the call as
> > > > +      * unsafe.
> > > > +      */
> > >
> > > This is a known issue. For scalars, only constant is restored properly
> > > in verifier at this moment. I did some hacking before to enable any
> > > scalars. The fear is this will make pruning performs worse. More
> > > study is needed here.
> >
> > Of topic, but: this is actually one of the most challenging issues for
> > us when writing
> > BPF. It forces us to have very deep call graphs to hopefully avoid clang
> > spilling the constants. Please let me know if I can help in any way.
>
> Thanks for bringing this up.
> Yonghong, please correct me if I'm wrong.
> I think you've experimented with tracking spilled constants. The first issue
> came with spilling of 4 byte constant. The verifier tracks 8 byte slots and
> lots of places assume that slot granularity. It's not clear yet how to refactor
> the verifier. Ideas, help are greatly appreciated.
> The second concern was pruning, but iirc the experiments were inconclusive.
> selftests/bpf only has old fb progs. Hence, I think, the step zero is for
> everyone to contribute their bpf programs written in C. If we have both
> cilium and cloudflare progs as selftests it will help a lot to guide such long
> lasting verifier decisions.

Ok, I'll try to get something sorted out. We have a TC classifier that
would be suitable,
and I've been meaning to get it open sourced. Does the integration into the
test suite have to involve running packets through it, or is compile
and load enough?

-- 
Lorenz Bauer  |  Systems Engineer
6th Floor, County Hall/The Riverside Building, SE1 7PB, UK

www.cloudflare.com

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: call for bpf progs. Re: [PATCHv2 bpf-next 5/5] selftests: bpf: add test for sk_assign
  2020-03-27 10:02         ` Lorenz Bauer
@ 2020-03-27 16:08           ` Alexei Starovoitov
  0 siblings, 0 replies; 41+ messages in thread
From: Alexei Starovoitov @ 2020-03-27 16:08 UTC (permalink / raw)
  To: Lorenz Bauer
  Cc: Yonghong Song, Joe Stringer, bpf, Networking, Daniel Borkmann,
	Alexei Starovoitov, Eric Dumazet, Martin Lau

On Fri, Mar 27, 2020 at 3:03 AM Lorenz Bauer <lmb@cloudflare.com> wrote:
> >
> > Thanks for bringing this up.
> > Yonghong, please correct me if I'm wrong.
> > I think you've experimented with tracking spilled constants. The first issue
> > came with spilling of 4 byte constant. The verifier tracks 8 byte slots and
> > lots of places assume that slot granularity. It's not clear yet how to refactor
> > the verifier. Ideas, help are greatly appreciated.
> > The second concern was pruning, but iirc the experiments were inconclusive.
> > selftests/bpf only has old fb progs. Hence, I think, the step zero is for
> > everyone to contribute their bpf programs written in C. If we have both
> > cilium and cloudflare progs as selftests it will help a lot to guide such long
> > lasting verifier decisions.
>
> Ok, I'll try to get something sorted out. We have a TC classifier that
> would be suitable,
> and I've been meaning to get it open sourced. Does the integration into the
> test suite have to involve running packets through it, or is compile
> and load enough?

It would be great if you can add it as part of test_progs and run it
with one or two packets via prog_test_run like all the tests do.
Thanks!

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: call for bpf progs. Re: [PATCHv2 bpf-next 5/5] selftests: bpf: add test for sk_assign
  2020-03-26 21:07       ` call for bpf progs. " Alexei Starovoitov
  2020-03-26 23:14         ` Yonghong Song
  2020-03-27 10:02         ` Lorenz Bauer
@ 2020-03-27 19:06         ` Joe Stringer
  2020-03-27 20:16           ` Daniel Borkmann
  2020-03-28  0:17           ` Andrii Nakryiko
  2 siblings, 2 replies; 41+ messages in thread
From: Joe Stringer @ 2020-03-27 19:06 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Lorenz Bauer, Yonghong Song, Joe Stringer, bpf, Networking,
	Daniel Borkmann, Alexei Starovoitov, Eric Dumazet, Martin Lau,
	john fastabend

On Thu, Mar 26, 2020 at 2:07 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
> The second concern was pruning, but iirc the experiments were inconclusive.
> selftests/bpf only has old fb progs. Hence, I think, the step zero is for
> everyone to contribute their bpf programs written in C. If we have both
> cilium and cloudflare progs as selftests it will help a lot to guide such long
> lasting verifier decisions.

How would you like to handle program changes over time for this?

In Cilium community we periodically rebuild bpf-next VM images for
testing, and run every pull request against those images[0]. We also
test against specific older kernels, currently 4.9 and 4.19. This
allows us to get some sense for the impact of upstream changes while
developing Cilium features, but unfortunately doesn't allow everyone
using kernel selftests to get that feedback at least from the kernel
tree. We also have a verifier complexity test script where we compile
with the maximum number of features (to ideally generate the most
complex programs possible) then attempt to load all of the various
programs, and output the complexity count that the kernel reports[1,2]
which we can track over time.

However Cilium BPF programs are actively developing and even if we
merge these programs into the kernel tree, they will get out-of-date
quickly. Up until recently everything was verifying fine compiling
with LLVM7 and loading into bpf-next. Over the past month we started
noticing new issues not with the existing implementation, but in *new*
BPF features. As we increased complexity, our CI started failing
against bpf-next[3] while they loaded fine on older kernels. We ended
up mitigating by upgrading to LLVM-10. Long story short, there's
several moving parts; changing BPF program implementations, changing
the compiler toolchain, changing the kernel verifier. So my question
is basically, where's the line of responsibility for what the kernel
selftests are responsible for vs integration tests? How do we maintain
those over time as the BPF programs and compiler changes?

Do we just parachute the ~11K LoC of Cilium datapath into the kernel
tree once per cycle? Or should Cilium autobuild a verifier-test docker
image that kernel testing scripts can pull & run? Or would it be
helpful to have a separate GitHub project similar to libbpf that pulls
out kernel selftests, Cilium progs, fb progs, cloudflare progs, etc
automatically and centralizes a generic suite of BPF verifier
integration tests? Some other option?

[0] https://github.com/cilium/packer-ci-build
[1] https://github.com/cilium/cilium/blob/master/test/bpf/check-complexity.sh
[2] https://github.com/cilium/cilium/blob/master/test/bpf/verifier-test.sh
[3] https://github.com/cilium/cilium/issues/10517

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: call for bpf progs. Re: [PATCHv2 bpf-next 5/5] selftests: bpf: add test for sk_assign
  2020-03-27 19:06         ` Joe Stringer
@ 2020-03-27 20:16           ` Daniel Borkmann
  2020-03-27 22:24             ` Alexei Starovoitov
  2020-03-28  0:17           ` Andrii Nakryiko
  1 sibling, 1 reply; 41+ messages in thread
From: Daniel Borkmann @ 2020-03-27 20:16 UTC (permalink / raw)
  To: Joe Stringer, Alexei Starovoitov
  Cc: Lorenz Bauer, Yonghong Song, bpf, Networking, Alexei Starovoitov,
	Eric Dumazet, Martin Lau, john fastabend

On 3/27/20 8:06 PM, Joe Stringer wrote:
> On Thu, Mar 26, 2020 at 2:07 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
>> The second concern was pruning, but iirc the experiments were inconclusive.
>> selftests/bpf only has old fb progs. Hence, I think, the step zero is for
>> everyone to contribute their bpf programs written in C. If we have both
>> cilium and cloudflare progs as selftests it will help a lot to guide such long
>> lasting verifier decisions.
> 
> How would you like to handle program changes over time for this?
> 
> In Cilium community we periodically rebuild bpf-next VM images for
> testing, and run every pull request against those images[0]. We also
> test against specific older kernels, currently 4.9 and 4.19. This
> allows us to get some sense for the impact of upstream changes while
> developing Cilium features, but unfortunately doesn't allow everyone
> using kernel selftests to get that feedback at least from the kernel
> tree. We also have a verifier complexity test script where we compile
> with the maximum number of features (to ideally generate the most
> complex programs possible) then attempt to load all of the various
> programs, and output the complexity count that the kernel reports[1,2]
> which we can track over time.
> 
> However Cilium BPF programs are actively developing and even if we
> merge these programs into the kernel tree, they will get out-of-date
> quickly. Up until recently everything was verifying fine compiling
> with LLVM7 and loading into bpf-next. Over the past month we started
> noticing new issues not with the existing implementation, but in *new*
> BPF features. As we increased complexity, our CI started failing
> against bpf-next[3] while they loaded fine on older kernels. We ended
> up mitigating by upgrading to LLVM-10. Long story short, there's
> several moving parts; changing BPF program implementations, changing
> the compiler toolchain, changing the kernel verifier. So my question
> is basically, where's the line of responsibility for what the kernel
> selftests are responsible for vs integration tests? How do we maintain
> those over time as the BPF programs and compiler changes?

I wonder whether it would make sense to create test cases which are based
on our cilium/cilium:latest docker image. Those would be run as part of
BPF kernel selftests as well as part of our own CI for every PR. I think
it could be some basic connectivity test, service mapping, etc with a
bunch of application containers.

One nice thing that just came to mind is that it would actually allow for
easy testing of latest clang/llvm git by rerunning the test script and
remapping them as a volume, e.g.:

   docker run -it -v /root/clang+llvm-7.0.1-x86_64-linux-gnu-ubuntu-18.04/bin/clang:/bin/clang \
                  -v /root/clang+llvm-7.0.1-x86_64-linux-gnu-ubuntu-18.04/bin/llc:/bin/llc \
          cilium/cilium:latest /bin/sh

Perhaps that would be more useful and always up to date than a copy of the
code base that would get stale next day? In the end in this context kernel
changes and/or llvm changes might be of interest to check whether anything
potentially blows up, so having a self-contained packaging might be useful.
Thoughts?

> Do we just parachute the ~11K LoC of Cilium datapath into the kernel
> tree once per cycle? Or should Cilium autobuild a verifier-test docker
> image that kernel testing scripts can pull & run? Or would it be
> helpful to have a separate GitHub project similar to libbpf that pulls
> out kernel selftests, Cilium progs, fb progs, cloudflare progs, etc
> automatically and centralizes a generic suite of BPF verifier
> integration tests? Some other option?
> 
> [0] https://github.com/cilium/packer-ci-build
> [1] https://github.com/cilium/cilium/blob/master/test/bpf/check-complexity.sh
> [2] https://github.com/cilium/cilium/blob/master/test/bpf/verifier-test.sh
> [3] https://github.com/cilium/cilium/issues/10517
> 


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: call for bpf progs. Re: [PATCHv2 bpf-next 5/5] selftests: bpf: add test for sk_assign
  2020-03-27 20:16           ` Daniel Borkmann
@ 2020-03-27 22:24             ` Alexei Starovoitov
  0 siblings, 0 replies; 41+ messages in thread
From: Alexei Starovoitov @ 2020-03-27 22:24 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Joe Stringer, Lorenz Bauer, Yonghong Song, bpf, Networking,
	Alexei Starovoitov, Eric Dumazet, Martin Lau, john fastabend

On Fri, Mar 27, 2020 at 09:16:52PM +0100, Daniel Borkmann wrote:
> 
> Perhaps that would be more useful and always up to date than a copy of the
> code base that would get stale next day? In the end in this context kernel
> changes and/or llvm changes might be of interest to check whether anything
> potentially blows up, so having a self-contained packaging might be useful.
> Thoughts?

Right now we have zero cilium progs in selftest :)
so any number of progs is better than nothing.

> > Do we just parachute the ~11K LoC of Cilium datapath into the kernel
> > tree once per cycle? 

I don't think 11k progs updated every kernel release will catch
much more than basic reduced set.
selftests/bpf is not a substitute for cilium CI.
I would prefer .c tests to be developed once and frozen.
More tests can be added every 6 month or so.
I think copy-paste is ok-ish too, but would be much better
to think through about aspects of the code first.
It worked well for fb xdp/tc progs. I took some of them,
refactored them to selftest/Makefile and they were kept as-is for
the last 3 years. While real progs kept being actively changed.
For example: progs/test_l4lb.c is about 1/10th of the real deal here:
https://github.com/facebookincubator/katran/tree/master/katran/lib/bpf
In terms of lines code, number of includes and so on.
While hacking katran into selftest I tried to capture the _style_ of C code.
Like:
 bool get_packet_dst(struct real_definition **real,
                     struct packet_description *pckt,
                     struct vip_meta *vip_info,
                     bool is_ipv6)
I wouldn't have written the prototype this way.
Passing double pointer as a return value as a first argument is not my style :)
The style of nested 'if', etc. Those are the things that make clang
generate specific code patterns that I was trying to preserve in selftests.

Example 2: progs/strobemeta.h is about 1/4 of real thing.
Yet frozen it was super useful for the work on bounded loops.

Example 3: progs/test_get_stack_rawtp.c is one specific code pattern
that used in our profiling prog.
This one was immensely helpful to track down jmp32/alu32 bugs.
It's the one that we're still fixing (see John's jmp32 work).

and so on.
selftests/bpf/progs don't need to be full copy-paste. Ideally the capture
the essence of the progs in the short form.

clang bpf backend and verifier smartness keep changing. So having
frozen selftests is necessary to see the trend and capture bugs
in the backend and in the verifier.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: call for bpf progs. Re: [PATCHv2 bpf-next 5/5] selftests: bpf: add test for sk_assign
  2020-03-27 19:06         ` Joe Stringer
  2020-03-27 20:16           ` Daniel Borkmann
@ 2020-03-28  0:17           ` Andrii Nakryiko
  1 sibling, 0 replies; 41+ messages in thread
From: Andrii Nakryiko @ 2020-03-28  0:17 UTC (permalink / raw)
  To: Joe Stringer
  Cc: Alexei Starovoitov, Lorenz Bauer, Yonghong Song, bpf, Networking,
	Daniel Borkmann, Alexei Starovoitov, Eric Dumazet, Martin Lau,
	john fastabend

On Fri, Mar 27, 2020 at 12:07 PM Joe Stringer <joe@wand.net.nz> wrote:
>
> On Thu, Mar 26, 2020 at 2:07 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> > The second concern was pruning, but iirc the experiments were inconclusive.
> > selftests/bpf only has old fb progs. Hence, I think, the step zero is for
> > everyone to contribute their bpf programs written in C. If we have both
> > cilium and cloudflare progs as selftests it will help a lot to guide such long
> > lasting verifier decisions.
>
> How would you like to handle program changes over time for this?
>
> In Cilium community we periodically rebuild bpf-next VM images for
> testing, and run every pull request against those images[0]. We also
> test against specific older kernels, currently 4.9 and 4.19. This
> allows us to get some sense for the impact of upstream changes while
> developing Cilium features, but unfortunately doesn't allow everyone
> using kernel selftests to get that feedback at least from the kernel
> tree. We also have a verifier complexity test script where we compile
> with the maximum number of features (to ideally generate the most
> complex programs possible) then attempt to load all of the various
> programs, and output the complexity count that the kernel reports[1,2]
> which we can track over time.
>
> However Cilium BPF programs are actively developing and even if we
> merge these programs into the kernel tree, they will get out-of-date
> quickly. Up until recently everything was verifying fine compiling
> with LLVM7 and loading into bpf-next. Over the past month we started
> noticing new issues not with the existing implementation, but in *new*
> BPF features. As we increased complexity, our CI started failing
> against bpf-next[3] while they loaded fine on older kernels. We ended
> up mitigating by upgrading to LLVM-10. Long story short, there's
> several moving parts; changing BPF program implementations, changing
> the compiler toolchain, changing the kernel verifier. So my question
> is basically, where's the line of responsibility for what the kernel
> selftests are responsible for vs integration tests? How do we maintain
> those over time as the BPF programs and compiler changes?

Just wanted to point out that libbpf's Github CI has multi-kernel
testing, so we'll be able to capture regressions on old kernels that
are caused by libbpf and/or nightly clang (we are currently pulling
clang-11 from nightly packages). We are also testing against latest
kernel as well, so if they break, we'll need to fix them. Which is why
I'd like those programs to be manageable in size and complexity and a
simple part of test_progs, not in some Docker container :)

>
> Do we just parachute the ~11K LoC of Cilium datapath into the kernel
> tree once per cycle? Or should Cilium autobuild a verifier-test docker
> image that kernel testing scripts can pull & run? Or would it be
> helpful to have a separate GitHub project similar to libbpf that pulls
> out kernel selftests, Cilium progs, fb progs, cloudflare progs, etc
> automatically and centralizes a generic suite of BPF verifier
> integration tests? Some other option?
>
> [0] https://github.com/cilium/packer-ci-build
> [1] https://github.com/cilium/cilium/blob/master/test/bpf/check-complexity.sh
> [2] https://github.com/cilium/cilium/blob/master/test/bpf/verifier-test.sh
> [3] https://github.com/cilium/cilium/issues/10517

^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, back to index

Thread overview: 41+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-25  5:57 [PATCHv2 bpf-next 0/5] Add bpf_sk_assign eBPF helper Joe Stringer
2020-03-25  5:57 ` [PATCHv2 bpf-next 1/5] bpf: Add socket assign support Joe Stringer
2020-03-26  6:23   ` Martin KaFai Lau
2020-03-26  6:31     ` Joe Stringer
2020-03-26 10:24   ` Lorenz Bauer
2020-03-26 22:52     ` Joe Stringer
2020-03-27  2:40       ` Joe Stringer
2020-03-25  5:57 ` [PATCHv2 bpf-next 2/5] bpf: Prefetch established socket destinations Joe Stringer
2020-03-26 21:11   ` Alexei Starovoitov
2020-03-26 21:45     ` Joe Stringer
2020-03-25  5:57 ` [PATCHv2 bpf-next 3/5] net: Track socket refcounts in skb_steal_sock() Joe Stringer
2020-03-25  5:57 ` [PATCHv2 bpf-next 4/5] bpf: Don't refcount LISTEN sockets in sk_assign() Joe Stringer
2020-03-25 10:29   ` Lorenz Bauer
2020-03-25 20:46     ` Joe Stringer
2020-03-26 10:20       ` Lorenz Bauer
2020-03-26 21:37         ` Joe Stringer
2020-03-25  5:57 ` [PATCHv2 bpf-next 5/5] selftests: bpf: add test for sk_assign Joe Stringer
2020-03-25 10:35   ` Lorenz Bauer
2020-03-25 20:55     ` Joe Stringer
2020-03-26  6:25       ` Martin KaFai Lau
2020-03-26  6:38         ` Joe Stringer
2020-03-26 23:39           ` Joe Stringer
2020-03-25 18:17   ` Yonghong Song
2020-03-25 21:20     ` Joe Stringer
2020-03-25 22:00       ` Yonghong Song
2020-03-25 23:07         ` Joe Stringer
2020-03-26 10:13     ` Lorenz Bauer
2020-03-26 21:07       ` call for bpf progs. " Alexei Starovoitov
2020-03-26 23:14         ` Yonghong Song
2020-03-27 10:02         ` Lorenz Bauer
2020-03-27 16:08           ` Alexei Starovoitov
2020-03-27 19:06         ` Joe Stringer
2020-03-27 20:16           ` Daniel Borkmann
2020-03-27 22:24             ` Alexei Starovoitov
2020-03-28  0:17           ` Andrii Nakryiko
2020-03-26  2:04   ` Andrii Nakryiko
2020-03-26  2:16   ` Andrii Nakryiko
2020-03-26  5:28     ` Joe Stringer
2020-03-26  6:31       ` Martin KaFai Lau
2020-03-26 19:36       ` Andrii Nakryiko
2020-03-26 21:38         ` Joe Stringer

Netdev Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/netdev/0 netdev/git/0.git
	git clone --mirror https://lore.kernel.org/netdev/1 netdev/git/1.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 netdev netdev/ https://lore.kernel.org/netdev \
		netdev@vger.kernel.org
	public-inbox-index netdev

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.netdev


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git