netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH bpf-next 0/7] Add bpf_sk_assign eBPF helper
@ 2020-03-12 23:36 Joe Stringer
  2020-03-12 23:36 ` [PATCH bpf-next 1/7] dst: Move skb_dst_drop to skbuff.c Joe Stringer
                   ` (6 more replies)
  0 siblings, 7 replies; 30+ messages in thread
From: Joe Stringer @ 2020-03-12 23:36 UTC (permalink / raw)
  To: bpf; +Cc: netdev, daniel, ast, eric.dumazet, lmb

Introduce a new helper that allows assigning a previously-found socket
to the skb as the packet is received towards the stack, to cause the
stack to guide the packet towards that socket subject to local routing
configuration.

This series is a spiritual successor to previous discussions on-list[0]
and in-person at LPC2019[1] to support TProxy use cases more directly
from eBPF programs attached at TC ingress, to simplify and streamline
Linux stack configuration in scale environments with Cilium.

Normally in ip{,6}_rcv_core(), the skb will be orphaned, dropping any
existing socket reference associated with the skb. Existing tproxy
implementations in netfilter get around this restriction by running the
tproxy logic after ip_rcv_core() in the PREROUTING table. However, this
is not an option for TC-based logic (including eBPF programs attached at
TC ingress).

This series proposes to introduce a new metadata destination,
dst_sk_prefetch, which communicates from earlier paths in the stack that
the socket has been prefetched and ip{,6}_rcv_core() should respect this
socket selection and retain the reference on the skb.

My initial implementation of the dst_sk_prefetch held no metadata and
was simply a unique pointer that could be used to make this
determination in the ip receive core. However, throughout the testing
phase it became apparent that this minimal implementation was not enough
to allow socket redirection for traffic between two local processes.
Specifically, if the destination was retained as the dst_sk_prefetch
metadata destination, or if the destination was dropped from the skb,
then during ip{,6}_rcv_finish_core() the destination would be considered
invalid and subject the packet to routing. In this case, loopback
traffic from 127.0.0.1 to 127.0.0.1 would be considered martian (both
martian source and martian destination) because that layer assumes that
any such loopback traffic would already have the valid loopback
destination configured on the skb so the routing check would be skipped.

To resolve this issue, I extended dst_sk_prefetch to act as a wrapper
for any existing destination (such as the loopback destination) by
stashing the existing destination into a per-cpu variable for the
duration of processing between TC ingress hook and the ip receive core.
Since the existing destination may be reference-counted, close attention
must be paid to any paths that may cause the packet to be queued for
processing on another CPU, to ensure that the reference is not lost. To
this end, the TC logic checks if the eBPF program return code may
indicate intention to pass the packet anywhere other than up the stack;
the error paths in skb cleanup handle the dst_sk_prefetch; and finally
after the skb_orphan check in ip receive core the original destination
reference (if available) is restored to the skb.

The eBPF API extension itself, bpf_sk_assign() is pretty straightforward
in that it takes an skb and socket, and associates them together. The
helper takes its own reference to the socket to ensure it remains
accessible beyond the release of rcu_read_lock, and the socket is
associated to the skb; the subsequent release of that reference is
handled by existing skb cleanup functions. Additionally, the helper
associates the new dst_sk_prefetch destination with the skb to
communicate the socket prefetch intention with the ingress path.

Finally, tests (courtesy Lorenz Bauer) are added to validate the
functionality. In addition to testing with the selftests in the tree,
I have validated the runtime behaviour of the new helper by extending
Cilium to make use of the functionality in lieu of existing tproxy
logic.

This series is laid out as follows:
* Patches 1-2 prepare the dst_sk_prefetch for use by sk_assign().
* Patch 3 extends the eBPF API for sk_assign and uses dst_sk_prefetch to
  store the socket reference and retain it through ip receive.
* Patch 4 is a minor optimization to prefetch the socket destination for
  established sockets.
* Patches 5-7 add and extend the selftests with examples of the new
  functionality and validation of correct behaviour.

[0] https://www.mail-archive.com/netdev@vger.kernel.org/msg303645.html
[1] https://linuxplumbersconf.org/event/4/contributions/464/

Joe Stringer (6):
  dst: Move skb_dst_drop to skbuff.c
  dst: Add socket prefetch metadata destinations
  bpf: Add socket assign support
  dst: Prefetch established socket destinations
  selftests: bpf: Extend sk_assign for address proxy
  selftests: bpf: Improve debuggability of sk_assign

Lorenz Bauer (1):
  selftests: bpf: add test for sk_assign

 include/linux/skbuff.h                        |   1 +
 include/net/dst.h                             |  14 --
 include/net/dst_metadata.h                    |  31 +++
 include/uapi/linux/bpf.h                      |  23 +-
 net/core/dst.c                                |  44 ++++
 net/core/filter.c                             |  28 +++
 net/core/skbuff.c                             |  18 ++
 net/ipv4/ip_input.c                           |   5 +-
 net/ipv6/ip6_input.c                          |   5 +-
 net/sched/act_bpf.c                           |   3 +
 tools/include/uapi/linux/bpf.h                |  18 +-
 tools/testing/selftests/bpf/.gitignore        |   1 +
 tools/testing/selftests/bpf/Makefile          |   3 +-
 .../selftests/bpf/progs/test_sk_assign.c      | 127 ++++++++++
 tools/testing/selftests/bpf/test_sk_assign.c  | 231 ++++++++++++++++++
 tools/testing/selftests/bpf/test_sk_assign.sh |  22 ++
 16 files changed, 555 insertions(+), 19 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/progs/test_sk_assign.c
 create mode 100644 tools/testing/selftests/bpf/test_sk_assign.c
 create mode 100755 tools/testing/selftests/bpf/test_sk_assign.sh

-- 
2.20.1


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH bpf-next 1/7] dst: Move skb_dst_drop to skbuff.c
  2020-03-12 23:36 [PATCH bpf-next 0/7] Add bpf_sk_assign eBPF helper Joe Stringer
@ 2020-03-12 23:36 ` Joe Stringer
  2020-03-12 23:36 ` [PATCH bpf-next 2/7] dst: Add socket prefetch metadata destinations Joe Stringer
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 30+ messages in thread
From: Joe Stringer @ 2020-03-12 23:36 UTC (permalink / raw)
  To: bpf; +Cc: netdev, daniel, ast, eric.dumazet, lmb

Prepare for extending this function to handle dst_sk_prefetch by moving
it away from the generic dst header and into the skbuff code.

Signed-off-by: Joe Stringer <joe@wand.net.nz>
---
 include/linux/skbuff.h |  1 +
 include/net/dst.h      | 14 --------------
 net/core/skbuff.c      | 15 +++++++++++++++
 3 files changed, 16 insertions(+), 14 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 21749b2cdc9b..860cee22c49b 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1047,6 +1047,7 @@ static inline bool skb_unref(struct sk_buff *skb)
 	return true;
 }
 
+void skb_dst_drop(struct sk_buff *skb);
 void skb_release_head_state(struct sk_buff *skb);
 void kfree_skb(struct sk_buff *skb);
 void kfree_skb_list(struct sk_buff *segs);
diff --git a/include/net/dst.h b/include/net/dst.h
index 3448cf865ede..b6a2ecab53ce 100644
--- a/include/net/dst.h
+++ b/include/net/dst.h
@@ -259,20 +259,6 @@ static inline void refdst_drop(unsigned long refdst)
 		dst_release((struct dst_entry *)(refdst & SKB_DST_PTRMASK));
 }
 
-/**
- * skb_dst_drop - drops skb dst
- * @skb: buffer
- *
- * Drops dst reference count if a reference was taken.
- */
-static inline void skb_dst_drop(struct sk_buff *skb)
-{
-	if (skb->_skb_refdst) {
-		refdst_drop(skb->_skb_refdst);
-		skb->_skb_refdst = 0UL;
-	}
-}
-
 static inline void __skb_dst_copy(struct sk_buff *nskb, unsigned long refdst)
 {
 	nskb->_skb_refdst = refdst;
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index e1101a4f90a6..6b2798450fd4 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1034,6 +1034,21 @@ struct sk_buff *alloc_skb_for_msg(struct sk_buff *first)
 }
 EXPORT_SYMBOL_GPL(alloc_skb_for_msg);
 
+/**
+ * skb_dst_drop - drops skb dst
+ * @skb: buffer
+ *
+ * Drops dst reference count if a reference was taken.
+ */
+void skb_dst_drop(struct sk_buff *skb)
+{
+	if (skb->_skb_refdst) {
+		refdst_drop(skb->_skb_refdst);
+		skb->_skb_refdst = 0UL;
+	}
+}
+EXPORT_SYMBOL_GPL(skb_dst_drop);
+
 /**
  *	skb_morph	-	morph one skb into another
  *	@dst: the skb to receive the contents
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH bpf-next 2/7] dst: Add socket prefetch metadata destinations
  2020-03-12 23:36 [PATCH bpf-next 0/7] Add bpf_sk_assign eBPF helper Joe Stringer
  2020-03-12 23:36 ` [PATCH bpf-next 1/7] dst: Move skb_dst_drop to skbuff.c Joe Stringer
@ 2020-03-12 23:36 ` Joe Stringer
  2020-03-12 23:36 ` [PATCH bpf-next 3/7] bpf: Add socket assign support Joe Stringer
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 30+ messages in thread
From: Joe Stringer @ 2020-03-12 23:36 UTC (permalink / raw)
  To: bpf; +Cc: netdev, daniel, ast, eric.dumazet, lmb

Metadata destinations were introduced in commit f38a9eb1f77b
("dst: Metadata destinations") to "carry per packet metadata
between forwarding and processing elements via the skb->dst pointer".

The aim of this new METADATA_SK_PREFETCH destination type is to allow
early forwarding elements to store a socket destination for the duration
of receiving into the IP stack, which can be later be identified to
avoid orphaning the skb and losing the prefetched socket in ip_rcv_core().

The destination is stored temporarily in a per-CPU buffer to ensure that
if applications attempt to reach out from loopback address to loopback
address, they may restore the original destination and avoid martian
packet drops.

Signed-off-by: Joe Stringer <joe@wand.net.nz>
---
 include/net/dst_metadata.h | 31 +++++++++++++++++++++++++++++++
 net/core/dst.c             | 30 ++++++++++++++++++++++++++++++
 2 files changed, 61 insertions(+)

diff --git a/include/net/dst_metadata.h b/include/net/dst_metadata.h
index 56cb3c38569a..31574c553a07 100644
--- a/include/net/dst_metadata.h
+++ b/include/net/dst_metadata.h
@@ -9,6 +9,7 @@
 enum metadata_type {
 	METADATA_IP_TUNNEL,
 	METADATA_HW_PORT_MUX,
+	METADATA_SK_PREFETCH,
 };
 
 struct hw_port_info {
@@ -80,6 +81,8 @@ static inline int skb_metadata_dst_cmp(const struct sk_buff *skb_a,
 		return memcmp(&a->u.tun_info, &b->u.tun_info,
 			      sizeof(a->u.tun_info) +
 					 a->u.tun_info.options_len);
+	case METADATA_SK_PREFETCH:
+		return 0;
 	default:
 		return 1;
 	}
@@ -214,4 +217,32 @@ static inline struct metadata_dst *ipv6_tun_rx_dst(struct sk_buff *skb,
 				  0, ip6_flowlabel(ip6h), flags, tunnel_id,
 				  md_size);
 }
+
+extern const struct metadata_dst dst_sk_prefetch;
+
+static inline bool dst_is_sk_prefetch(const struct dst_entry *dst)
+{
+	return dst == &dst_sk_prefetch.dst;
+}
+
+static inline bool skb_dst_is_sk_prefetch(const struct sk_buff *skb)
+{
+	return dst_is_sk_prefetch(skb_dst(skb));
+}
+
+void dst_sk_prefetch_store(struct sk_buff *skb);
+void dst_sk_prefetch_fetch(struct sk_buff *skb);
+
+/**
+ * dst_sk_prefetch_reset - reset prefetched socket dst
+ * @skb: buffer
+ *
+ * Reverts the dst back to the originally stored dst if present.
+ */
+static inline void dst_sk_prefetch_reset(struct sk_buff *skb)
+{
+	if (unlikely(skb_dst_is_sk_prefetch(skb)))
+		dst_sk_prefetch_fetch(skb);
+}
+
 #endif /* __NET_DST_METADATA_H */
diff --git a/net/core/dst.c b/net/core/dst.c
index 193af526e908..cf1a1d5b6b0a 100644
--- a/net/core/dst.c
+++ b/net/core/dst.c
@@ -330,3 +330,33 @@ void metadata_dst_free_percpu(struct metadata_dst __percpu *md_dst)
 	free_percpu(md_dst);
 }
 EXPORT_SYMBOL_GPL(metadata_dst_free_percpu);
+
+const struct metadata_dst dst_sk_prefetch = {
+	.dst = {
+		.ops = &md_dst_ops,
+		.input = dst_md_discard,
+		.output = dst_md_discard_out,
+		.flags = DST_NOCOUNT | DST_METADATA,
+		.obsolete = DST_OBSOLETE_NONE,
+		.__refcnt = ATOMIC_INIT(1),
+	},
+	.type = METADATA_SK_PREFETCH,
+};
+EXPORT_SYMBOL(dst_sk_prefetch);
+
+DEFINE_PER_CPU(unsigned long, dst_sk_prefetch_dst);
+
+void dst_sk_prefetch_store(struct sk_buff *skb)
+{
+	unsigned long refdst;
+
+	refdst = skb->_skb_refdst;
+	__this_cpu_write(dst_sk_prefetch_dst, refdst);
+	skb_dst_set_noref(skb, (struct dst_entry *)&dst_sk_prefetch.dst);
+}
+
+void dst_sk_prefetch_fetch(struct sk_buff *skb)
+{
+	skb->_skb_refdst = __this_cpu_read(dst_sk_prefetch_dst);
+}
+EXPORT_SYMBOL(dst_sk_prefetch_fetch);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH bpf-next 3/7] bpf: Add socket assign support
  2020-03-12 23:36 [PATCH bpf-next 0/7] Add bpf_sk_assign eBPF helper Joe Stringer
  2020-03-12 23:36 ` [PATCH bpf-next 1/7] dst: Move skb_dst_drop to skbuff.c Joe Stringer
  2020-03-12 23:36 ` [PATCH bpf-next 2/7] dst: Add socket prefetch metadata destinations Joe Stringer
@ 2020-03-12 23:36 ` Joe Stringer
  2020-03-16 10:08   ` Jakub Sitnicki
  2020-03-16 22:57   ` Martin KaFai Lau
  2020-03-12 23:36 ` [PATCH bpf-next 4/7] dst: Prefetch established socket destinations Joe Stringer
                   ` (3 subsequent siblings)
  6 siblings, 2 replies; 30+ messages in thread
From: Joe Stringer @ 2020-03-12 23:36 UTC (permalink / raw)
  To: bpf; +Cc: netdev, daniel, ast, eric.dumazet, lmb

Add support for TPROXY via a new bpf helper, bpf_sk_assign().

This helper requires the BPF program to discover the socket via a call
to bpf_sk*_lookup_*(), then pass this socket to the new helper. The
helper takes its own reference to the socket in addition to any existing
reference that may or may not currently be obtained for the duration of
BPF processing. For the destination socket to receive the traffic, the
traffic must be routed towards that socket via local route, the socket
must have the transparent option enabled out-of-band, and the socket
must not be closing. If all of these conditions hold, the socket will be
assigned to the skb to allow delivery to the socket.

The recently introduced dst_sk_prefetch is used to communicate from the
TC layer to the IP receive layer that the socket should be retained
across the receive. The dst_sk_prefetch destination wraps any existing
destination (if available) and stores it temporarily in a per-cpu var.

To ensure that no dst references held by the skb prior to sk_assign()
are lost, they are stored in the per-cpu variable associated with
dst_sk_prefetch. When the BPF program invocation from the TC action
completes, we check the return code against TC_ACT_OK and if any other
return code is used, we restore the dst to avoid unintentionally leaking
the reference held in the per-CPU variable. If the packet is cloned or
dropped before reaching ip{,6}_rcv_core(), the original dst will also be
restored from the per-cpu variable to avoid the leak; if the packet makes
its way to the receive function for the protocol, then the destination
(if any) will be restored to the packet at that point.

Signed-off-by: Joe Stringer <joe@wand.net.nz>
---
 include/uapi/linux/bpf.h       | 23 ++++++++++++++++++++++-
 net/core/filter.c              | 28 ++++++++++++++++++++++++++++
 net/core/skbuff.c              |  3 +++
 net/ipv4/ip_input.c            |  5 ++++-
 net/ipv6/ip6_input.c           |  5 ++++-
 net/sched/act_bpf.c            |  3 +++
 tools/include/uapi/linux/bpf.h | 18 +++++++++++++++++-
 7 files changed, 81 insertions(+), 4 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 40b2d9476268..35f282cc745e 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -2914,6 +2914,26 @@ union bpf_attr {
  *		of sizeof(struct perf_branch_entry).
  *
  *		**-ENOENT** if architecture does not support branch records.
+ *
+ * int bpf_sk_assign(struct sk_buff *skb, struct bpf_sock *sk, u64 flags)
+ *	Description
+ *		Assign the *sk* to the *skb*. When combined with appropriate
+ *		routing configuration to receive the packet towards the socket,
+ *		will cause *skb* to be delivered to the specified socket.
+ *		Subsequent redirection of *skb* via  **bpf_redirect**\ (),
+ *		**bpf_clone_redirect**\ () or other methods outside of BPF may
+ *		interfere with successful delivery to the socket.
+ *
+ *		This operation is only valid from TC ingress path.
+ *
+ *		The *flags* argument must be zero.
+ *	Return
+ *		0 on success, or a negative errno in case of failure.
+ *
+ *		* **-EINVAL**		Unsupported flags specified.
+ *		* **-EOPNOTSUPP**:	Unsupported operation, for example a
+ *					call from outside of TC ingress.
+ *		* **-ENOENT**		The socket cannot be assigned.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -3035,7 +3055,8 @@ union bpf_attr {
 	FN(tcp_send_ack),		\
 	FN(send_signal_thread),		\
 	FN(jiffies64),			\
-	FN(read_branch_records),
+	FN(read_branch_records),	\
+	FN(sk_assign),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
diff --git a/net/core/filter.c b/net/core/filter.c
index cd0a532db4e7..bae0874289d8 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5846,6 +5846,32 @@ static const struct bpf_func_proto bpf_tcp_gen_syncookie_proto = {
 	.arg5_type	= ARG_CONST_SIZE,
 };
 
+BPF_CALL_3(bpf_sk_assign, struct sk_buff *, skb, struct sock *, sk, u64, flags)
+{
+	if (flags != 0)
+		return -EINVAL;
+	if (!skb_at_tc_ingress(skb))
+		return -EOPNOTSUPP;
+	if (unlikely(!refcount_inc_not_zero(&sk->sk_refcnt)))
+		return -ENOENT;
+
+	skb_orphan(skb);
+	skb->sk = sk;
+	skb->destructor = sock_edemux;
+	dst_sk_prefetch_store(skb);
+
+	return 0;
+}
+
+static const struct bpf_func_proto bpf_sk_assign_proto = {
+	.func		= bpf_sk_assign,
+	.gpl_only	= false,
+	.ret_type	= RET_INTEGER,
+	.arg1_type      = ARG_PTR_TO_CTX,
+	.arg2_type      = ARG_PTR_TO_SOCK_COMMON,
+	.arg3_type	= ARG_ANYTHING,
+};
+
 #endif /* CONFIG_INET */
 
 bool bpf_helper_changes_pkt_data(void *func)
@@ -6139,6 +6165,8 @@ tc_cls_act_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 		return &bpf_skb_ecn_set_ce_proto;
 	case BPF_FUNC_tcp_gen_syncookie:
 		return &bpf_tcp_gen_syncookie_proto;
+	case BPF_FUNC_sk_assign:
+		return &bpf_sk_assign_proto;
 #endif
 	default:
 		return bpf_base_func_proto(func_id);
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 6b2798450fd4..80ee8f7b6a19 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -63,6 +63,7 @@
 
 #include <net/protocol.h>
 #include <net/dst.h>
+#include <net/dst_metadata.h>
 #include <net/sock.h>
 #include <net/checksum.h>
 #include <net/ip6_checksum.h>
@@ -1042,6 +1043,7 @@ EXPORT_SYMBOL_GPL(alloc_skb_for_msg);
  */
 void skb_dst_drop(struct sk_buff *skb)
 {
+	dst_sk_prefetch_reset(skb);
 	if (skb->_skb_refdst) {
 		refdst_drop(skb->_skb_refdst);
 		skb->_skb_refdst = 0UL;
@@ -1466,6 +1468,7 @@ struct sk_buff *skb_clone(struct sk_buff *skb, gfp_t gfp_mask)
 		n->fclone = SKB_FCLONE_UNAVAILABLE;
 	}
 
+	dst_sk_prefetch_reset(skb);
 	return __skb_clone(n, skb);
 }
 EXPORT_SYMBOL(skb_clone);
diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c
index aa438c6758a7..9bd4858d20fc 100644
--- a/net/ipv4/ip_input.c
+++ b/net/ipv4/ip_input.c
@@ -509,7 +509,10 @@ static struct sk_buff *ip_rcv_core(struct sk_buff *skb, struct net *net)
 	IPCB(skb)->iif = skb->skb_iif;
 
 	/* Must drop socket now because of tproxy. */
-	skb_orphan(skb);
+	if (skb_dst_is_sk_prefetch(skb))
+		dst_sk_prefetch_fetch(skb);
+	else
+		skb_orphan(skb);
 
 	return skb;
 
diff --git a/net/ipv6/ip6_input.c b/net/ipv6/ip6_input.c
index 7b089d0ac8cd..f7b42adca9d0 100644
--- a/net/ipv6/ip6_input.c
+++ b/net/ipv6/ip6_input.c
@@ -285,7 +285,10 @@ static struct sk_buff *ip6_rcv_core(struct sk_buff *skb, struct net_device *dev,
 	rcu_read_unlock();
 
 	/* Must drop socket now because of tproxy. */
-	skb_orphan(skb);
+	if (skb_dst_is_sk_prefetch(skb))
+		dst_sk_prefetch_fetch(skb);
+	else
+		skb_orphan(skb);
 
 	return skb;
 err:
diff --git a/net/sched/act_bpf.c b/net/sched/act_bpf.c
index 46f47e58b3be..b4c557e6158d 100644
--- a/net/sched/act_bpf.c
+++ b/net/sched/act_bpf.c
@@ -11,6 +11,7 @@
 #include <linux/filter.h>
 #include <linux/bpf.h>
 
+#include <net/dst_metadata.h>
 #include <net/netlink.h>
 #include <net/pkt_sched.h>
 #include <net/pkt_cls.h>
@@ -53,6 +54,8 @@ static int tcf_bpf_act(struct sk_buff *skb, const struct tc_action *act,
 		bpf_compute_data_pointers(skb);
 		filter_res = BPF_PROG_RUN(filter, skb);
 	}
+	if (filter_res != TC_ACT_OK)
+		dst_sk_prefetch_reset(skb);
 	rcu_read_unlock();
 
 	/* A BPF program may overwrite the default action opcode.
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 40b2d9476268..546e9e1368ff 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -2914,6 +2914,21 @@ union bpf_attr {
  *		of sizeof(struct perf_branch_entry).
  *
  *		**-ENOENT** if architecture does not support branch records.
+ *
+ * int bpf_sk_assign(struct sk_buff *skb, struct bpf_sock *sk, u64 flags)
+ *	Description
+ *		Assign the *sk* to the *skb*.
+ *
+ *		This operation is only valid from TC ingress path.
+ *
+ *		The *flags* argument must be zero.
+ *	Return
+ *		0 on success, or a negative errno in case of failure.
+ *
+ *		* **-EINVAL**		Unsupported flags specified.
+ *		* **-EOPNOTSUPP**:	Unsupported operation, for example a
+ *					call from outside of TC ingress.
+ *		* **-ENOENT**		The socket cannot be assigned.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -3035,7 +3050,8 @@ union bpf_attr {
 	FN(tcp_send_ack),		\
 	FN(send_signal_thread),		\
 	FN(jiffies64),			\
-	FN(read_branch_records),
+	FN(read_branch_records),	\
+	FN(sk_assign),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH bpf-next 4/7] dst: Prefetch established socket destinations
  2020-03-12 23:36 [PATCH bpf-next 0/7] Add bpf_sk_assign eBPF helper Joe Stringer
                   ` (2 preceding siblings ...)
  2020-03-12 23:36 ` [PATCH bpf-next 3/7] bpf: Add socket assign support Joe Stringer
@ 2020-03-12 23:36 ` Joe Stringer
  2020-03-16 23:03   ` Martin KaFai Lau
  2020-03-12 23:36 ` [PATCH bpf-next 5/7] selftests: bpf: add test for sk_assign Joe Stringer
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 30+ messages in thread
From: Joe Stringer @ 2020-03-12 23:36 UTC (permalink / raw)
  To: bpf; +Cc: netdev, daniel, ast, eric.dumazet, lmb

Enhance the dst_sk_prefetch logic to temporarily store the socket
receive destination, to save the route lookup later on. The dst
reference is kept alive by the caller's socket reference.

Suggested-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Joe Stringer <joe@wand.net.nz>
---
 include/net/dst_metadata.h |  2 +-
 net/core/dst.c             | 20 +++++++++++++++++---
 net/core/filter.c          |  2 +-
 3 files changed, 19 insertions(+), 5 deletions(-)

diff --git a/include/net/dst_metadata.h b/include/net/dst_metadata.h
index 31574c553a07..4f16322b08d5 100644
--- a/include/net/dst_metadata.h
+++ b/include/net/dst_metadata.h
@@ -230,7 +230,7 @@ static inline bool skb_dst_is_sk_prefetch(const struct sk_buff *skb)
 	return dst_is_sk_prefetch(skb_dst(skb));
 }
 
-void dst_sk_prefetch_store(struct sk_buff *skb);
+void dst_sk_prefetch_store(struct sk_buff *skb, struct sock *sk);
 void dst_sk_prefetch_fetch(struct sk_buff *skb);
 
 /**
diff --git a/net/core/dst.c b/net/core/dst.c
index cf1a1d5b6b0a..5068d127d9c2 100644
--- a/net/core/dst.c
+++ b/net/core/dst.c
@@ -346,11 +346,25 @@ EXPORT_SYMBOL(dst_sk_prefetch);
 
 DEFINE_PER_CPU(unsigned long, dst_sk_prefetch_dst);
 
-void dst_sk_prefetch_store(struct sk_buff *skb)
+void dst_sk_prefetch_store(struct sk_buff *skb, struct sock *sk)
 {
-	unsigned long refdst;
+	unsigned long refdst = 0L;
+
+	WARN_ON(!rcu_read_lock_held() &&
+		!rcu_read_lock_bh_held());
+	if (sk_fullsock(sk)) {
+		struct dst_entry *dst = READ_ONCE(sk->sk_rx_dst);
+
+		if (dst)
+			dst = dst_check(dst, 0);
+		if (dst)
+			refdst = (unsigned long)dst | SKB_DST_NOREF;
+	}
+	if (!refdst)
+		refdst = skb->_skb_refdst;
+	if (skb->_skb_refdst != refdst)
+		skb_dst_drop(skb);
 
-	refdst = skb->_skb_refdst;
 	__this_cpu_write(dst_sk_prefetch_dst, refdst);
 	skb_dst_set_noref(skb, (struct dst_entry *)&dst_sk_prefetch.dst);
 }
diff --git a/net/core/filter.c b/net/core/filter.c
index bae0874289d8..db9b7b8b4a04 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5858,7 +5858,7 @@ BPF_CALL_3(bpf_sk_assign, struct sk_buff *, skb, struct sock *, sk, u64, flags)
 	skb_orphan(skb);
 	skb->sk = sk;
 	skb->destructor = sock_edemux;
-	dst_sk_prefetch_store(skb);
+	dst_sk_prefetch_store(skb, sk);
 
 	return 0;
 }
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH bpf-next 5/7] selftests: bpf: add test for sk_assign
  2020-03-12 23:36 [PATCH bpf-next 0/7] Add bpf_sk_assign eBPF helper Joe Stringer
                   ` (3 preceding siblings ...)
  2020-03-12 23:36 ` [PATCH bpf-next 4/7] dst: Prefetch established socket destinations Joe Stringer
@ 2020-03-12 23:36 ` Joe Stringer
  2020-03-17  6:30   ` Martin KaFai Lau
  2020-03-12 23:36 ` [PATCH bpf-next 6/7] selftests: bpf: Extend sk_assign for address proxy Joe Stringer
  2020-03-12 23:36 ` [PATCH bpf-next 7/7] selftests: bpf: Improve debuggability of sk_assign Joe Stringer
  6 siblings, 1 reply; 30+ messages in thread
From: Joe Stringer @ 2020-03-12 23:36 UTC (permalink / raw)
  To: bpf; +Cc: Lorenz Bauer, netdev, daniel, ast, eric.dumazet

From: Lorenz Bauer <lmb@cloudflare.com>

Attach a tc direct-action classifier to lo in a fresh network
namespace, and rewrite all connection attempts to localhost:4321
to localhost:1234.

Keep in mind that both client to server and server to client traffic
passes the classifier.

Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Joe Stringer <joe@wand.net.nz>
---
 tools/testing/selftests/bpf/.gitignore        |   1 +
 tools/testing/selftests/bpf/Makefile          |   3 +-
 .../selftests/bpf/progs/test_sk_assign.c      | 127 +++++++++++++
 tools/testing/selftests/bpf/test_sk_assign.c  | 176 ++++++++++++++++++
 tools/testing/selftests/bpf/test_sk_assign.sh |  19 ++
 5 files changed, 325 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/bpf/progs/test_sk_assign.c
 create mode 100644 tools/testing/selftests/bpf/test_sk_assign.c
 create mode 100755 tools/testing/selftests/bpf/test_sk_assign.sh

diff --git a/tools/testing/selftests/bpf/.gitignore b/tools/testing/selftests/bpf/.gitignore
index ec464859c6b6..e9c185899def 100644
--- a/tools/testing/selftests/bpf/.gitignore
+++ b/tools/testing/selftests/bpf/.gitignore
@@ -28,6 +28,7 @@ test_netcnt
 test_tcpnotify_user
 test_libbpf
 test_tcp_check_syncookie_user
+test_sk_assign
 test_sysctl
 test_hashmap
 test_btf_dump
diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index ee4ad34adb4a..503fd9dc8cf6 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -58,6 +58,7 @@ TEST_PROGS := test_kmod.sh \
 	test_xdp_vlan_mode_generic.sh \
 	test_xdp_vlan_mode_native.sh \
 	test_lwt_ip_encap.sh \
+	test_sk_assign.sh \
 	test_tcp_check_syncookie.sh \
 	test_tc_tunnel.sh \
 	test_tc_edt.sh \
@@ -74,7 +75,7 @@ TEST_PROGS_EXTENDED := with_addr.sh \
 # Compile but not part of 'make run_tests'
 TEST_GEN_PROGS_EXTENDED = test_sock_addr test_skb_cgroup_id_user \
 	flow_dissector_load test_flow_dissector test_tcp_check_syncookie_user \
-	test_lirc_mode2_user xdping test_cpp runqslower
+	test_lirc_mode2_user xdping test_cpp runqslower test_sk_assign
 
 TEST_CUSTOM_PROGS = urandom_read
 
diff --git a/tools/testing/selftests/bpf/progs/test_sk_assign.c b/tools/testing/selftests/bpf/progs/test_sk_assign.c
new file mode 100644
index 000000000000..7de30ad3f594
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_sk_assign.c
@@ -0,0 +1,127 @@
+// SPDX-License-Identifier: GPL-2.0
+// Copyright (c) 2019 Cloudflare Ltd.
+
+#include <stddef.h>
+#include <stdbool.h>
+#include <string.h>
+#include <linux/bpf.h>
+#include <linux/if_ether.h>
+#include <linux/in.h>
+#include <linux/ip.h>
+#include <linux/ipv6.h>
+#include <linux/pkt_cls.h>
+#include <linux/tcp.h>
+#include <sys/socket.h>
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_endian.h>
+
+int _version SEC("version") = 1;
+char _license[] SEC("license") = "GPL";
+
+/* Fill 'tuple' with L3 info, and attempt to find L4. On fail, return NULL. */
+static struct bpf_sock_tuple *get_tuple(void *data, __u64 nh_off,
+					void *data_end, __u16 eth_proto,
+					bool *ipv4)
+{
+	struct bpf_sock_tuple *result;
+	__u8 proto = 0;
+	__u64 ihl_len;
+
+	if (eth_proto == bpf_htons(ETH_P_IP)) {
+		struct iphdr *iph = (struct iphdr *)(data + nh_off);
+
+		if (iph + 1 > data_end)
+			return NULL;
+		if (iph->ihl != 5)
+			/* Options are not supported */
+			return NULL;
+		ihl_len = iph->ihl * 4;
+		proto = iph->protocol;
+		*ipv4 = true;
+		result = (struct bpf_sock_tuple *)&iph->saddr;
+	} else if (eth_proto == bpf_htons(ETH_P_IPV6)) {
+		struct ipv6hdr *ip6h = (struct ipv6hdr *)(data + nh_off);
+
+		if (ip6h + 1 > data_end)
+			return NULL;
+		ihl_len = sizeof(*ip6h);
+		proto = ip6h->nexthdr;
+		*ipv4 = false;
+		result = (struct bpf_sock_tuple *)&ip6h->saddr;
+	} else {
+		return NULL;
+	}
+
+	if (result + 1 > data_end || proto != IPPROTO_TCP)
+		return NULL;
+
+	return result;
+}
+
+SEC("sk_assign_test")
+int bpf_sk_assign_test(struct __sk_buff *skb)
+{
+	void *data_end = (void *)(long)skb->data_end;
+	void *data = (void *)(long)skb->data;
+	struct ethhdr *eth = (struct ethhdr *)(data);
+	struct bpf_sock_tuple *tuple, ln = {0};
+	struct bpf_sock *sk;
+	int tuple_len;
+	bool ipv4;
+	int ret;
+
+	if (eth + 1 > data_end)
+		return TC_ACT_SHOT;
+
+	tuple = get_tuple(data, sizeof(*eth), data_end, eth->h_proto, &ipv4);
+	if (!tuple)
+		return TC_ACT_SHOT;
+
+	tuple_len = ipv4 ? sizeof(tuple->ipv4) : sizeof(tuple->ipv6);
+	sk = bpf_skc_lookup_tcp(skb, tuple, tuple_len, BPF_F_CURRENT_NETNS, 0);
+	if (sk) {
+		if (sk->state != BPF_TCP_LISTEN)
+			goto assign;
+
+		bpf_sk_release(sk);
+	}
+
+	if (ipv4) {
+		if (tuple->ipv4.dport != bpf_htons(4321))
+			return TC_ACT_OK;
+
+		ln.ipv4.daddr = bpf_htonl(0x7f000001);
+		ln.ipv4.dport = bpf_htons(1234);
+
+		sk = bpf_skc_lookup_tcp(skb, &ln, sizeof(ln.ipv4),
+					BPF_F_CURRENT_NETNS, 0);
+	} else {
+		if (tuple->ipv6.dport != bpf_htons(4321))
+			return TC_ACT_OK;
+
+		/* Upper parts of daddr are already zero. */
+		ln.ipv6.daddr[3] = bpf_htonl(0x1);
+		ln.ipv6.dport = bpf_htons(1234);
+
+		sk = bpf_skc_lookup_tcp(skb, &ln, sizeof(ln.ipv6),
+					BPF_F_CURRENT_NETNS, 0);
+	}
+
+	/* We can't do a single skc_lookup_tcp here, because then the compiler
+	 * will likely spill tuple_len to the stack. This makes it lose all
+	 * bounds information in the verifier, which then rejects the call as
+	 * unsafe.
+	 */
+	if (!sk)
+		return TC_ACT_SHOT;
+
+	if (sk->state != BPF_TCP_LISTEN) {
+		bpf_sk_release(sk);
+		return TC_ACT_SHOT;
+	}
+
+assign:
+	ret = bpf_sk_assign(skb, sk, 0);
+	bpf_sk_release(sk);
+	return ret == 0 ? TC_ACT_OK : TC_ACT_SHOT;
+}
diff --git a/tools/testing/selftests/bpf/test_sk_assign.c b/tools/testing/selftests/bpf/test_sk_assign.c
new file mode 100644
index 000000000000..cba5f8b2b7fd
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_sk_assign.c
@@ -0,0 +1,176 @@
+// SPDX-License-Identifier: GPL-2.0
+// Copyright (c) 2018 Facebook
+// Copyright (c) 2019 Cloudflare
+
+#include <string.h>
+#include <stdlib.h>
+#include <unistd.h>
+
+#include <arpa/inet.h>
+#include <netinet/in.h>
+#include <sys/types.h>
+#include <sys/socket.h>
+
+#include <bpf/bpf.h>
+#include <bpf/libbpf.h>
+
+#include "bpf_rlimit.h"
+#include "cgroup_helpers.h"
+
+static int start_server(const struct sockaddr *addr, socklen_t len)
+{
+	int fd;
+
+	fd = socket(addr->sa_family, SOCK_STREAM, 0);
+	if (fd == -1) {
+		log_err("Failed to create server socket");
+		goto out;
+	}
+
+	if (bind(fd, addr, len) == -1) {
+		log_err("Failed to bind server socket");
+		goto close_out;
+	}
+
+	if (listen(fd, 128) == -1) {
+		log_err("Failed to listen on server socket");
+		goto close_out;
+	}
+
+	goto out;
+
+close_out:
+	close(fd);
+	fd = -1;
+out:
+	return fd;
+}
+
+static int connect_to_server(const struct sockaddr *addr, socklen_t len)
+{
+	int fd = -1;
+
+	fd = socket(addr->sa_family, SOCK_STREAM, 0);
+	if (fd == -1) {
+		log_err("Failed to create client socket");
+		goto out;
+	}
+
+	if (connect(fd, addr, len) == -1) {
+		log_err("Fail to connect to server");
+		goto close_out;
+	}
+
+	goto out;
+
+close_out:
+	close(fd);
+	fd = -1;
+out:
+	return fd;
+}
+
+static int run_test(int server_fd, const struct sockaddr *addr, socklen_t len)
+{
+	int client = -1, srv_client = -1;
+	struct sockaddr_storage name;
+	char buf[] = "testing";
+	in_port_t port;
+	int ret = 1;
+
+	client = connect_to_server(addr, len);
+	if (client == -1)
+		goto out;
+
+	srv_client = accept(server_fd, NULL, NULL);
+	if (srv_client == -1) {
+		log_err("Can't accept connection");
+		goto out;
+	}
+
+	if (write(client, buf, sizeof(buf)) != sizeof(buf)) {
+		log_err("Can't write on client");
+		goto out;
+	}
+
+	if (read(srv_client, buf, sizeof(buf)) != sizeof(buf)) {
+		log_err("Can't read on server");
+		goto out;
+	}
+
+	len = sizeof(name);
+	if (getsockname(srv_client, (struct sockaddr *)&name, &len)) {
+		log_err("Can't getsockname");
+		goto out;
+	}
+
+	switch (name.ss_family) {
+	case AF_INET:
+		port = ((struct sockaddr_in *)&name)->sin_port;
+		break;
+
+	case AF_INET6:
+		port = ((struct sockaddr_in6 *)&name)->sin6_port;
+		break;
+
+	default:
+		log_err("Invalid address family");
+		goto out;
+	}
+
+	if (port != htons(4321)) {
+		log_err("Expected port 4321, got %u", ntohs(port));
+		goto out;
+	}
+
+	ret = 0;
+out:
+	close(client);
+	close(srv_client);
+	return ret;
+}
+
+int main(int argc, char **argv)
+{
+	struct sockaddr_in addr4;
+	struct sockaddr_in6 addr6;
+	int server = -1;
+	int server_v6 = -1;
+	int err = 1;
+
+	memset(&addr4, 0, sizeof(addr4));
+	addr4.sin_family = AF_INET;
+	addr4.sin_addr.s_addr = htonl(INADDR_LOOPBACK);
+	addr4.sin_port = htons(1234);
+
+	memset(&addr6, 0, sizeof(addr6));
+	addr6.sin6_family = AF_INET6;
+	addr6.sin6_addr = in6addr_loopback;
+	addr6.sin6_port = htons(1234);
+
+	server = start_server((const struct sockaddr *)&addr4, sizeof(addr4));
+	if (server == -1)
+		goto out;
+
+	server_v6 = start_server((const struct sockaddr *)&addr6,
+				 sizeof(addr6));
+	if (server_v6 == -1)
+		goto out;
+
+	/* Connect to unbound ports */
+	addr4.sin_port = htons(4321);
+	addr6.sin6_port = htons(4321);
+
+	if (run_test(server, (const struct sockaddr *)&addr4, sizeof(addr4)))
+		goto out;
+
+	if (run_test(server_v6, (const struct sockaddr *)&addr6, sizeof(addr6)))
+		goto out;
+
+	printf("ok\n");
+	err = 0;
+out:
+	close(server);
+	close(server_v6);
+	return err;
+}
diff --git a/tools/testing/selftests/bpf/test_sk_assign.sh b/tools/testing/selftests/bpf/test_sk_assign.sh
new file mode 100755
index 000000000000..62eae9255491
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_sk_assign.sh
@@ -0,0 +1,19 @@
+#!/bin/bash -e
+# SPDX-License-Identifier: GPL-2.0
+
+if [[ $EUID -ne 0 ]]; then
+        echo "This script must be run as root"
+        echo "FAIL"
+        exit 1
+fi
+
+# Run the script in a dedicated network namespace.
+if [[ -z $(ip netns identify $$) ]]; then
+        exec ../net/in_netns.sh "$0" "$@"
+fi
+
+tc qdisc add dev lo clsact
+tc filter add dev lo ingress bpf direct-action object-file ./test_sk_assign.o \
+	section "sk_assign_test"
+
+exec ./test_sk_assign
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH bpf-next 6/7] selftests: bpf: Extend sk_assign for address proxy
  2020-03-12 23:36 [PATCH bpf-next 0/7] Add bpf_sk_assign eBPF helper Joe Stringer
                   ` (4 preceding siblings ...)
  2020-03-12 23:36 ` [PATCH bpf-next 5/7] selftests: bpf: add test for sk_assign Joe Stringer
@ 2020-03-12 23:36 ` Joe Stringer
  2020-03-12 23:36 ` [PATCH bpf-next 7/7] selftests: bpf: Improve debuggability of sk_assign Joe Stringer
  6 siblings, 0 replies; 30+ messages in thread
From: Joe Stringer @ 2020-03-12 23:36 UTC (permalink / raw)
  To: bpf; +Cc: netdev, daniel, ast, eric.dumazet, lmb

Extend the socket assign test program to also validate that connections
to foreign addresses may also be proxied to a user agent via this
mechanism.

Signed-off-by: Joe Stringer <joe@wand.net.nz>
---
 tools/testing/selftests/bpf/test_sk_assign.c  | 13 +++++++++++++
 tools/testing/selftests/bpf/test_sk_assign.sh |  3 +++
 2 files changed, 16 insertions(+)

diff --git a/tools/testing/selftests/bpf/test_sk_assign.c b/tools/testing/selftests/bpf/test_sk_assign.c
index cba5f8b2b7fd..4b7b9bbe7859 100644
--- a/tools/testing/selftests/bpf/test_sk_assign.c
+++ b/tools/testing/selftests/bpf/test_sk_assign.c
@@ -1,6 +1,7 @@
 // SPDX-License-Identifier: GPL-2.0
 // Copyright (c) 2018 Facebook
 // Copyright (c) 2019 Cloudflare
+// Copyright (c) 2020 Isovalent. Inc.
 
 #include <string.h>
 #include <stdlib.h>
@@ -17,6 +18,8 @@
 #include "bpf_rlimit.h"
 #include "cgroup_helpers.h"
 
+#define TEST_DADDR (0xC0A80203)
+
 static int start_server(const struct sockaddr *addr, socklen_t len)
 {
 	int fd;
@@ -161,6 +164,16 @@ int main(int argc, char **argv)
 	addr4.sin_port = htons(4321);
 	addr6.sin6_port = htons(4321);
 
+	if (run_test(server, (const struct sockaddr *)&addr4, sizeof(addr4)))
+		goto out;
+
+	if (run_test(server_v6, (const struct sockaddr *)&addr6, sizeof(addr6)))
+		goto out;
+
+	/* Connect to unbound addresses */
+	addr4.sin_addr.s_addr = htonl(TEST_DADDR);
+	addr6.sin6_addr.s6_addr32[3] = htonl(TEST_DADDR);
+
 	if (run_test(server, (const struct sockaddr *)&addr4, sizeof(addr4)))
 		goto out;
 
diff --git a/tools/testing/selftests/bpf/test_sk_assign.sh b/tools/testing/selftests/bpf/test_sk_assign.sh
index 62eae9255491..de1df4e438de 100755
--- a/tools/testing/selftests/bpf/test_sk_assign.sh
+++ b/tools/testing/selftests/bpf/test_sk_assign.sh
@@ -12,6 +12,9 @@ if [[ -z $(ip netns identify $$) ]]; then
         exec ../net/in_netns.sh "$0" "$@"
 fi
 
+ip route add local default dev lo
+ip -6 route add local default dev lo
+
 tc qdisc add dev lo clsact
 tc filter add dev lo ingress bpf direct-action object-file ./test_sk_assign.o \
 	section "sk_assign_test"
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH bpf-next 7/7] selftests: bpf: Improve debuggability of sk_assign
  2020-03-12 23:36 [PATCH bpf-next 0/7] Add bpf_sk_assign eBPF helper Joe Stringer
                   ` (5 preceding siblings ...)
  2020-03-12 23:36 ` [PATCH bpf-next 6/7] selftests: bpf: Extend sk_assign for address proxy Joe Stringer
@ 2020-03-12 23:36 ` Joe Stringer
  6 siblings, 0 replies; 30+ messages in thread
From: Joe Stringer @ 2020-03-12 23:36 UTC (permalink / raw)
  To: bpf; +Cc: netdev, daniel, ast, eric.dumazet, lmb

This test was a bit obtuse before, add a shorter timeout when
connectivity doesn't work and a '-d' debug flag for extra output.

Signed-off-by: Joe Stringer <joe@wand.net.nz>
---
 tools/testing/selftests/bpf/test_sk_assign.c  | 42 +++++++++++++++++++
 tools/testing/selftests/bpf/test_sk_assign.sh |  2 +-
 2 files changed, 43 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/bpf/test_sk_assign.c b/tools/testing/selftests/bpf/test_sk_assign.c
index 4b7b9bbe7859..51d3d01d5476 100644
--- a/tools/testing/selftests/bpf/test_sk_assign.c
+++ b/tools/testing/selftests/bpf/test_sk_assign.c
@@ -3,6 +3,8 @@
 // Copyright (c) 2019 Cloudflare
 // Copyright (c) 2020 Isovalent. Inc.
 
+#include <fcntl.h>
+#include <signal.h>
 #include <string.h>
 #include <stdlib.h>
 #include <unistd.h>
@@ -20,6 +22,14 @@
 
 #define TEST_DADDR (0xC0A80203)
 
+static bool debug;
+
+#define debugf(format, ...)				\
+do {							\
+	if (debug)					\
+		printf(format, ##__VA_ARGS__);		\
+} while (0)
+
 static int start_server(const struct sockaddr *addr, socklen_t len)
 {
 	int fd;
@@ -49,6 +59,17 @@ static int start_server(const struct sockaddr *addr, socklen_t len)
 	return fd;
 }
 
+static void handle_timeout(int signum)
+{
+	if (signum == SIGALRM)
+		log_err("Timed out while connecting to server");
+	kill(0, SIGKILL);
+}
+
+static struct sigaction timeout_action = {
+	.sa_handler = handle_timeout,
+};
+
 static int connect_to_server(const struct sockaddr *addr, socklen_t len)
 {
 	int fd = -1;
@@ -59,6 +80,12 @@ static int connect_to_server(const struct sockaddr *addr, socklen_t len)
 		goto out;
 	}
 
+	if (sigaction(SIGALRM, &timeout_action, NULL)) {
+		log_err("Failed to configure timeout signal");
+		goto out;
+	}
+
+	alarm(3);
 	if (connect(fd, addr, len) == -1) {
 		log_err("Fail to connect to server");
 		goto close_out;
@@ -141,6 +168,17 @@ int main(int argc, char **argv)
 	int server_v6 = -1;
 	int err = 1;
 
+	if (argc > 1) {
+		if (!memcmp(argv[1], "-h", 2)) {
+			printf("usage: %s.sh [FLAGS]\n", argv[0]);
+			printf("  -d\tEnable debug logs\n");
+			printf("  -h\tPrint help message\n");
+			exit(1);
+		}
+		if (!memcmp(argv[1], "-d", 2))
+			debug = true;
+	}
+
 	memset(&addr4, 0, sizeof(addr4));
 	addr4.sin_family = AF_INET;
 	addr4.sin_addr.s_addr = htonl(INADDR_LOOPBACK);
@@ -166,9 +204,11 @@ int main(int argc, char **argv)
 
 	if (run_test(server, (const struct sockaddr *)&addr4, sizeof(addr4)))
 		goto out;
+	debugf("ipv4 port: ok\n");
 
 	if (run_test(server_v6, (const struct sockaddr *)&addr6, sizeof(addr6)))
 		goto out;
+	debugf("ipv6 port: ok\n");
 
 	/* Connect to unbound addresses */
 	addr4.sin_addr.s_addr = htonl(TEST_DADDR);
@@ -176,9 +216,11 @@ int main(int argc, char **argv)
 
 	if (run_test(server, (const struct sockaddr *)&addr4, sizeof(addr4)))
 		goto out;
+	debugf("ipv4 addr: ok\n");
 
 	if (run_test(server_v6, (const struct sockaddr *)&addr6, sizeof(addr6)))
 		goto out;
+	debugf("ipv6 addr: ok\n");
 
 	printf("ok\n");
 	err = 0;
diff --git a/tools/testing/selftests/bpf/test_sk_assign.sh b/tools/testing/selftests/bpf/test_sk_assign.sh
index de1df4e438de..5a84ad18f85a 100755
--- a/tools/testing/selftests/bpf/test_sk_assign.sh
+++ b/tools/testing/selftests/bpf/test_sk_assign.sh
@@ -19,4 +19,4 @@ tc qdisc add dev lo clsact
 tc filter add dev lo ingress bpf direct-action object-file ./test_sk_assign.o \
 	section "sk_assign_test"
 
-exec ./test_sk_assign
+exec ./test_sk_assign "$@"
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH bpf-next 3/7] bpf: Add socket assign support
  2020-03-12 23:36 ` [PATCH bpf-next 3/7] bpf: Add socket assign support Joe Stringer
@ 2020-03-16 10:08   ` Jakub Sitnicki
  2020-03-16 21:23     ` Joe Stringer
  2020-03-16 22:57   ` Martin KaFai Lau
  1 sibling, 1 reply; 30+ messages in thread
From: Jakub Sitnicki @ 2020-03-16 10:08 UTC (permalink / raw)
  To: Joe Stringer
  Cc: bpf, netdev, daniel, ast, eric.dumazet, lmb, Florian Westphal

[+CC Florian]

Hey Joe,

On Fri, Mar 13, 2020 at 12:36 AM CET, Joe Stringer wrote:
> Add support for TPROXY via a new bpf helper, bpf_sk_assign().
>
> This helper requires the BPF program to discover the socket via a call
> to bpf_sk*_lookup_*(), then pass this socket to the new helper. The
> helper takes its own reference to the socket in addition to any existing
> reference that may or may not currently be obtained for the duration of
> BPF processing. For the destination socket to receive the traffic, the
> traffic must be routed towards that socket via local route, the socket
> must have the transparent option enabled out-of-band, and the socket
> must not be closing. If all of these conditions hold, the socket will be
> assigned to the skb to allow delivery to the socket.

My impression from the last time we have been discussing TPROXY is that
the check for IP_TRANSPARENT on ingress doesn't serve any purpose [0].

The socket option only has effect on output, when there is a need to
source traffic from a non-local address.

Setting IP_TRANSPARENT requires CAP_NET_{RAW|ADMIN}, which grant a wider
range of capabilities than needed to build a transparent proxy app. This
is problematic because you to lock down your application with seccomp.

It seems it should be enough to use a port number from a privileged
range, if you want to ensure that only the designed process can receive
the proxied traffic.

Or, alternatively, instead of using socket lookup + IP_TRANSPARENT
check, get the socket from sockmap and apply control to who can update
the BPF map.

Thanks,
-jkbs

[0] https://lore.kernel.org/bpf/20190621125155.2sdw7pugepj3ityx@breakpoint.cc/

>
> The recently introduced dst_sk_prefetch is used to communicate from the
> TC layer to the IP receive layer that the socket should be retained
> across the receive. The dst_sk_prefetch destination wraps any existing
> destination (if available) and stores it temporarily in a per-cpu var.
>
> To ensure that no dst references held by the skb prior to sk_assign()
> are lost, they are stored in the per-cpu variable associated with
> dst_sk_prefetch. When the BPF program invocation from the TC action
> completes, we check the return code against TC_ACT_OK and if any other
> return code is used, we restore the dst to avoid unintentionally leaking
> the reference held in the per-CPU variable. If the packet is cloned or
> dropped before reaching ip{,6}_rcv_core(), the original dst will also be
> restored from the per-cpu variable to avoid the leak; if the packet makes
> its way to the receive function for the protocol, then the destination
> (if any) will be restored to the packet at that point.
>
> Signed-off-by: Joe Stringer <joe@wand.net.nz>
> ---
>  include/uapi/linux/bpf.h       | 23 ++++++++++++++++++++++-
>  net/core/filter.c              | 28 ++++++++++++++++++++++++++++
>  net/core/skbuff.c              |  3 +++
>  net/ipv4/ip_input.c            |  5 ++++-
>  net/ipv6/ip6_input.c           |  5 ++++-
>  net/sched/act_bpf.c            |  3 +++
>  tools/include/uapi/linux/bpf.h | 18 +++++++++++++++++-
>  7 files changed, 81 insertions(+), 4 deletions(-)
>
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 40b2d9476268..35f282cc745e 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -2914,6 +2914,26 @@ union bpf_attr {
>   *		of sizeof(struct perf_branch_entry).
>   *
>   *		**-ENOENT** if architecture does not support branch records.
> + *
> + * int bpf_sk_assign(struct sk_buff *skb, struct bpf_sock *sk, u64 flags)
> + *	Description
> + *		Assign the *sk* to the *skb*. When combined with appropriate
> + *		routing configuration to receive the packet towards the socket,
> + *		will cause *skb* to be delivered to the specified socket.
> + *		Subsequent redirection of *skb* via  **bpf_redirect**\ (),
> + *		**bpf_clone_redirect**\ () or other methods outside of BPF may
> + *		interfere with successful delivery to the socket.
> + *
> + *		This operation is only valid from TC ingress path.
> + *
> + *		The *flags* argument must be zero.
> + *	Return
> + *		0 on success, or a negative errno in case of failure.
> + *
> + *		* **-EINVAL**		Unsupported flags specified.
> + *		* **-EOPNOTSUPP**:	Unsupported operation, for example a
> + *					call from outside of TC ingress.
> + *		* **-ENOENT**		The socket cannot be assigned.
>   */
>  #define __BPF_FUNC_MAPPER(FN)		\
>  	FN(unspec),			\
> @@ -3035,7 +3055,8 @@ union bpf_attr {
>  	FN(tcp_send_ack),		\
>  	FN(send_signal_thread),		\
>  	FN(jiffies64),			\
> -	FN(read_branch_records),
> +	FN(read_branch_records),	\
> +	FN(sk_assign),
>
>  /* integer value in 'imm' field of BPF_CALL instruction selects which helper
>   * function eBPF program intends to call
> diff --git a/net/core/filter.c b/net/core/filter.c
> index cd0a532db4e7..bae0874289d8 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -5846,6 +5846,32 @@ static const struct bpf_func_proto bpf_tcp_gen_syncookie_proto = {
>  	.arg5_type	= ARG_CONST_SIZE,
>  };
>
> +BPF_CALL_3(bpf_sk_assign, struct sk_buff *, skb, struct sock *, sk, u64, flags)
> +{
> +	if (flags != 0)
> +		return -EINVAL;
> +	if (!skb_at_tc_ingress(skb))
> +		return -EOPNOTSUPP;
> +	if (unlikely(!refcount_inc_not_zero(&sk->sk_refcnt)))
> +		return -ENOENT;
> +
> +	skb_orphan(skb);
> +	skb->sk = sk;
> +	skb->destructor = sock_edemux;
> +	dst_sk_prefetch_store(skb);
> +
> +	return 0;
> +}
> +
> +static const struct bpf_func_proto bpf_sk_assign_proto = {
> +	.func		= bpf_sk_assign,
> +	.gpl_only	= false,
> +	.ret_type	= RET_INTEGER,
> +	.arg1_type      = ARG_PTR_TO_CTX,
> +	.arg2_type      = ARG_PTR_TO_SOCK_COMMON,
> +	.arg3_type	= ARG_ANYTHING,
> +};
> +
>  #endif /* CONFIG_INET */
>
>  bool bpf_helper_changes_pkt_data(void *func)
> @@ -6139,6 +6165,8 @@ tc_cls_act_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
>  		return &bpf_skb_ecn_set_ce_proto;
>  	case BPF_FUNC_tcp_gen_syncookie:
>  		return &bpf_tcp_gen_syncookie_proto;
> +	case BPF_FUNC_sk_assign:
> +		return &bpf_sk_assign_proto;
>  #endif
>  	default:
>  		return bpf_base_func_proto(func_id);
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 6b2798450fd4..80ee8f7b6a19 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -63,6 +63,7 @@
>
>  #include <net/protocol.h>
>  #include <net/dst.h>
> +#include <net/dst_metadata.h>
>  #include <net/sock.h>
>  #include <net/checksum.h>
>  #include <net/ip6_checksum.h>
> @@ -1042,6 +1043,7 @@ EXPORT_SYMBOL_GPL(alloc_skb_for_msg);
>   */
>  void skb_dst_drop(struct sk_buff *skb)
>  {
> +	dst_sk_prefetch_reset(skb);
>  	if (skb->_skb_refdst) {
>  		refdst_drop(skb->_skb_refdst);
>  		skb->_skb_refdst = 0UL;
> @@ -1466,6 +1468,7 @@ struct sk_buff *skb_clone(struct sk_buff *skb, gfp_t gfp_mask)
>  		n->fclone = SKB_FCLONE_UNAVAILABLE;
>  	}
>
> +	dst_sk_prefetch_reset(skb);
>  	return __skb_clone(n, skb);
>  }
>  EXPORT_SYMBOL(skb_clone);
> diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c
> index aa438c6758a7..9bd4858d20fc 100644
> --- a/net/ipv4/ip_input.c
> +++ b/net/ipv4/ip_input.c
> @@ -509,7 +509,10 @@ static struct sk_buff *ip_rcv_core(struct sk_buff *skb, struct net *net)
>  	IPCB(skb)->iif = skb->skb_iif;
>
>  	/* Must drop socket now because of tproxy. */
> -	skb_orphan(skb);
> +	if (skb_dst_is_sk_prefetch(skb))
> +		dst_sk_prefetch_fetch(skb);
> +	else
> +		skb_orphan(skb);
>
>  	return skb;
>
> diff --git a/net/ipv6/ip6_input.c b/net/ipv6/ip6_input.c
> index 7b089d0ac8cd..f7b42adca9d0 100644
> --- a/net/ipv6/ip6_input.c
> +++ b/net/ipv6/ip6_input.c
> @@ -285,7 +285,10 @@ static struct sk_buff *ip6_rcv_core(struct sk_buff *skb, struct net_device *dev,
>  	rcu_read_unlock();
>
>  	/* Must drop socket now because of tproxy. */
> -	skb_orphan(skb);
> +	if (skb_dst_is_sk_prefetch(skb))
> +		dst_sk_prefetch_fetch(skb);
> +	else
> +		skb_orphan(skb);
>
>  	return skb;
>  err:
> diff --git a/net/sched/act_bpf.c b/net/sched/act_bpf.c
> index 46f47e58b3be..b4c557e6158d 100644
> --- a/net/sched/act_bpf.c
> +++ b/net/sched/act_bpf.c
> @@ -11,6 +11,7 @@
>  #include <linux/filter.h>
>  #include <linux/bpf.h>
>
> +#include <net/dst_metadata.h>
>  #include <net/netlink.h>
>  #include <net/pkt_sched.h>
>  #include <net/pkt_cls.h>
> @@ -53,6 +54,8 @@ static int tcf_bpf_act(struct sk_buff *skb, const struct tc_action *act,
>  		bpf_compute_data_pointers(skb);
>  		filter_res = BPF_PROG_RUN(filter, skb);
>  	}
> +	if (filter_res != TC_ACT_OK)
> +		dst_sk_prefetch_reset(skb);
>  	rcu_read_unlock();
>
>  	/* A BPF program may overwrite the default action opcode.
> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> index 40b2d9476268..546e9e1368ff 100644
> --- a/tools/include/uapi/linux/bpf.h
> +++ b/tools/include/uapi/linux/bpf.h
> @@ -2914,6 +2914,21 @@ union bpf_attr {
>   *		of sizeof(struct perf_branch_entry).
>   *
>   *		**-ENOENT** if architecture does not support branch records.
> + *
> + * int bpf_sk_assign(struct sk_buff *skb, struct bpf_sock *sk, u64 flags)
> + *	Description
> + *		Assign the *sk* to the *skb*.
> + *
> + *		This operation is only valid from TC ingress path.
> + *
> + *		The *flags* argument must be zero.
> + *	Return
> + *		0 on success, or a negative errno in case of failure.
> + *
> + *		* **-EINVAL**		Unsupported flags specified.
> + *		* **-EOPNOTSUPP**:	Unsupported operation, for example a
> + *					call from outside of TC ingress.
> + *		* **-ENOENT**		The socket cannot be assigned.
>   */
>  #define __BPF_FUNC_MAPPER(FN)		\
>  	FN(unspec),			\
> @@ -3035,7 +3050,8 @@ union bpf_attr {
>  	FN(tcp_send_ack),		\
>  	FN(send_signal_thread),		\
>  	FN(jiffies64),			\
> -	FN(read_branch_records),
> +	FN(read_branch_records),	\
> +	FN(sk_assign),
>
>  /* integer value in 'imm' field of BPF_CALL instruction selects which helper
>   * function eBPF program intends to call

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH bpf-next 3/7] bpf: Add socket assign support
  2020-03-16 10:08   ` Jakub Sitnicki
@ 2020-03-16 21:23     ` Joe Stringer
  0 siblings, 0 replies; 30+ messages in thread
From: Joe Stringer @ 2020-03-16 21:23 UTC (permalink / raw)
  To: Jakub Sitnicki
  Cc: Joe Stringer, bpf, netdev, Daniel Borkmann, Alexei Starovoitov,
	Eric Dumazet, Lorenz Bauer, Florian Westphal

On Mon, Mar 16, 2020 at 3:08 AM Jakub Sitnicki <jakub@cloudflare.com> wrote:
>
> [+CC Florian]
>
> Hey Joe,
>
> On Fri, Mar 13, 2020 at 12:36 AM CET, Joe Stringer wrote:
> > Add support for TPROXY via a new bpf helper, bpf_sk_assign().
> >
> > This helper requires the BPF program to discover the socket via a call
> > to bpf_sk*_lookup_*(), then pass this socket to the new helper. The
> > helper takes its own reference to the socket in addition to any existing
> > reference that may or may not currently be obtained for the duration of
> > BPF processing. For the destination socket to receive the traffic, the
> > traffic must be routed towards that socket via local route, the socket
> > must have the transparent option enabled out-of-band, and the socket
> > must not be closing. If all of these conditions hold, the socket will be
> > assigned to the skb to allow delivery to the socket.
>
> My impression from the last time we have been discussing TPROXY is that
> the check for IP_TRANSPARENT on ingress doesn't serve any purpose [0].
>
> The socket option only has effect on output, when there is a need to
> source traffic from a non-local address.
>
> Setting IP_TRANSPARENT requires CAP_NET_{RAW|ADMIN}, which grant a wider
> range of capabilities than needed to build a transparent proxy app. This
> is problematic because you to lock down your application with seccomp.
>
> It seems it should be enough to use a port number from a privileged
> range, if you want to ensure that only the designed process can receive
> the proxied traffic.

Thanks for looking this over. You're right, I neglected to fix up the
commit message here from an earlier iteration that enforced this
constraint. I can fix this up in a v2.

> Or, alternatively, instead of using socket lookup + IP_TRANSPARENT
> check, get the socket from sockmap and apply control to who can update
> the BPF map.

There's no IP_TRANSPARENT check in this iteration of the series.

Cheers,
Joe

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH bpf-next 3/7] bpf: Add socket assign support
  2020-03-12 23:36 ` [PATCH bpf-next 3/7] bpf: Add socket assign support Joe Stringer
  2020-03-16 10:08   ` Jakub Sitnicki
@ 2020-03-16 22:57   ` Martin KaFai Lau
  2020-03-17  3:06     ` Joe Stringer
  1 sibling, 1 reply; 30+ messages in thread
From: Martin KaFai Lau @ 2020-03-16 22:57 UTC (permalink / raw)
  To: Joe Stringer; +Cc: bpf, netdev, daniel, ast, eric.dumazet, lmb

On Thu, Mar 12, 2020 at 04:36:44PM -0700, Joe Stringer wrote:
> Add support for TPROXY via a new bpf helper, bpf_sk_assign().
> 
> This helper requires the BPF program to discover the socket via a call
> to bpf_sk*_lookup_*(), then pass this socket to the new helper. The
> helper takes its own reference to the socket in addition to any existing
> reference that may or may not currently be obtained for the duration of
> BPF processing. For the destination socket to receive the traffic, the
> traffic must be routed towards that socket via local route, the socket
I also missed where is the local route check in the patch.
Is it implied by a sk can be found in bpf_sk*_lookup_*()?

> must have the transparent option enabled out-of-band, and the socket
> must not be closing. If all of these conditions hold, the socket will be
> assigned to the skb to allow delivery to the socket.
> 
> The recently introduced dst_sk_prefetch is used to communicate from the
> TC layer to the IP receive layer that the socket should be retained
> across the receive. The dst_sk_prefetch destination wraps any existing
> destination (if available) and stores it temporarily in a per-cpu var.
> 
> To ensure that no dst references held by the skb prior to sk_assign()
> are lost, they are stored in the per-cpu variable associated with
> dst_sk_prefetch. When the BPF program invocation from the TC action
> completes, we check the return code against TC_ACT_OK and if any other
> return code is used, we restore the dst to avoid unintentionally leaking
> the reference held in the per-CPU variable. If the packet is cloned or
> dropped before reaching ip{,6}_rcv_core(), the original dst will also be
> restored from the per-cpu variable to avoid the leak; if the packet makes
> its way to the receive function for the protocol, then the destination
> (if any) will be restored to the packet at that point.
> 

[ ... ]

> diff --git a/net/core/filter.c b/net/core/filter.c
> index cd0a532db4e7..bae0874289d8 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -5846,6 +5846,32 @@ static const struct bpf_func_proto bpf_tcp_gen_syncookie_proto = {
>  	.arg5_type	= ARG_CONST_SIZE,
>  };
>  
> +BPF_CALL_3(bpf_sk_assign, struct sk_buff *, skb, struct sock *, sk, u64, flags)
> +{
> +	if (flags != 0)
> +		return -EINVAL;
> +	if (!skb_at_tc_ingress(skb))
> +		return -EOPNOTSUPP;
> +	if (unlikely(!refcount_inc_not_zero(&sk->sk_refcnt)))
> +		return -ENOENT;
> +
> +	skb_orphan(skb);
> +	skb->sk = sk;
sk is from the bpf_sk*_lookup_*() which does not consider
the bpf_prog installed in SO_ATTACH_REUSEPORT_EBPF.
However, the use-case is currently limited to sk inspection.

It now supports selecting a particular sk to receive traffic.
Any plan in supporting that?

> +	skb->destructor = sock_edemux;
> +	dst_sk_prefetch_store(skb);
> +
> +	return 0;
> +}
> +

[ ... ]

> diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c
> index aa438c6758a7..9bd4858d20fc 100644
> --- a/net/ipv4/ip_input.c
> +++ b/net/ipv4/ip_input.c
> @@ -509,7 +509,10 @@ static struct sk_buff *ip_rcv_core(struct sk_buff *skb, struct net *net)
>  	IPCB(skb)->iif = skb->skb_iif;
>  
>  	/* Must drop socket now because of tproxy. */
> -	skb_orphan(skb);
> +	if (skb_dst_is_sk_prefetch(skb))
> +		dst_sk_prefetch_fetch(skb);
> +	else
> +		skb_orphan(skb);
>  
>  	return skb;
>  
> diff --git a/net/ipv6/ip6_input.c b/net/ipv6/ip6_input.c
> index 7b089d0ac8cd..f7b42adca9d0 100644
> --- a/net/ipv6/ip6_input.c
> +++ b/net/ipv6/ip6_input.c
> @@ -285,7 +285,10 @@ static struct sk_buff *ip6_rcv_core(struct sk_buff *skb, struct net_device *dev,
>  	rcu_read_unlock();
>  
>  	/* Must drop socket now because of tproxy. */
> -	skb_orphan(skb);
> +	if (skb_dst_is_sk_prefetch(skb))
> +		dst_sk_prefetch_fetch(skb);
> +	else
> +		skb_orphan(skb);
If I understand it correctly, this new test is to skip
the skb_orphan() call for locally routed skb.
Others cases (forward?) still depend on skb_orphan() to be called here?

>  
>  	return skb;
>  err:
> diff --git a/net/sched/act_bpf.c b/net/sched/act_bpf.c
> index 46f47e58b3be..b4c557e6158d 100644
> --- a/net/sched/act_bpf.c
> +++ b/net/sched/act_bpf.c
> @@ -11,6 +11,7 @@
>  #include <linux/filter.h>
>  #include <linux/bpf.h>
>  
> +#include <net/dst_metadata.h>
>  #include <net/netlink.h>
>  #include <net/pkt_sched.h>
>  #include <net/pkt_cls.h>
> @@ -53,6 +54,8 @@ static int tcf_bpf_act(struct sk_buff *skb, const struct tc_action *act,
>  		bpf_compute_data_pointers(skb);
>  		filter_res = BPF_PROG_RUN(filter, skb);
>  	}
> +	if (filter_res != TC_ACT_OK)
> +		dst_sk_prefetch_reset(skb);
>  	rcu_read_unlock();
>  
>  	/* A BPF program may overwrite the default action opcode.
> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> index 40b2d9476268..546e9e1368ff 100644
> --- a/tools/include/uapi/linux/bpf.h
> +++ b/tools/include/uapi/linux/bpf.h
> @@ -2914,6 +2914,21 @@ union bpf_attr {
>   *		of sizeof(struct perf_branch_entry).
>   *
>   *		**-ENOENT** if architecture does not support branch records.
> + *
> + * int bpf_sk_assign(struct sk_buff *skb, struct bpf_sock *sk, u64 flags)
> + *	Description
> + *		Assign the *sk* to the *skb*.
> + *
> + *		This operation is only valid from TC ingress path.
> + *
> + *		The *flags* argument must be zero.
> + *	Return
> + *		0 on success, or a negative errno in case of failure.
> + *
> + *		* **-EINVAL**		Unsupported flags specified.
> + *		* **-EOPNOTSUPP**:	Unsupported operation, for example a
> + *					call from outside of TC ingress.
> + *		* **-ENOENT**		The socket cannot be assigned.
>   */
>  #define __BPF_FUNC_MAPPER(FN)		\
>  	FN(unspec),			\
> @@ -3035,7 +3050,8 @@ union bpf_attr {
>  	FN(tcp_send_ack),		\
>  	FN(send_signal_thread),		\
>  	FN(jiffies64),			\
> -	FN(read_branch_records),
> +	FN(read_branch_records),	\
> +	FN(sk_assign),
>  
>  /* integer value in 'imm' field of BPF_CALL instruction selects which helper
>   * function eBPF program intends to call
> -- 
> 2.20.1
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH bpf-next 4/7] dst: Prefetch established socket destinations
  2020-03-12 23:36 ` [PATCH bpf-next 4/7] dst: Prefetch established socket destinations Joe Stringer
@ 2020-03-16 23:03   ` Martin KaFai Lau
  2020-03-17  3:17     ` Joe Stringer
  0 siblings, 1 reply; 30+ messages in thread
From: Martin KaFai Lau @ 2020-03-16 23:03 UTC (permalink / raw)
  To: Joe Stringer; +Cc: bpf, netdev, daniel, ast, eric.dumazet, lmb

On Thu, Mar 12, 2020 at 04:36:45PM -0700, Joe Stringer wrote:
> Enhance the dst_sk_prefetch logic to temporarily store the socket
> receive destination, to save the route lookup later on. The dst
> reference is kept alive by the caller's socket reference.
> 
> Suggested-by: Daniel Borkmann <daniel@iogearbox.net>
> Signed-off-by: Joe Stringer <joe@wand.net.nz>
> ---
>  include/net/dst_metadata.h |  2 +-
>  net/core/dst.c             | 20 +++++++++++++++++---
>  net/core/filter.c          |  2 +-
>  3 files changed, 19 insertions(+), 5 deletions(-)
> 
> diff --git a/include/net/dst_metadata.h b/include/net/dst_metadata.h
> index 31574c553a07..4f16322b08d5 100644
> --- a/include/net/dst_metadata.h
> +++ b/include/net/dst_metadata.h
> @@ -230,7 +230,7 @@ static inline bool skb_dst_is_sk_prefetch(const struct sk_buff *skb)
>  	return dst_is_sk_prefetch(skb_dst(skb));
>  }
>  
> -void dst_sk_prefetch_store(struct sk_buff *skb);
> +void dst_sk_prefetch_store(struct sk_buff *skb, struct sock *sk);
>  void dst_sk_prefetch_fetch(struct sk_buff *skb);
>  
>  /**
> diff --git a/net/core/dst.c b/net/core/dst.c
> index cf1a1d5b6b0a..5068d127d9c2 100644
> --- a/net/core/dst.c
> +++ b/net/core/dst.c
> @@ -346,11 +346,25 @@ EXPORT_SYMBOL(dst_sk_prefetch);
>  
>  DEFINE_PER_CPU(unsigned long, dst_sk_prefetch_dst);
>  
> -void dst_sk_prefetch_store(struct sk_buff *skb)
> +void dst_sk_prefetch_store(struct sk_buff *skb, struct sock *sk)
>  {
> -	unsigned long refdst;
> +	unsigned long refdst = 0L;
> +
> +	WARN_ON(!rcu_read_lock_held() &&
> +		!rcu_read_lock_bh_held());
> +	if (sk_fullsock(sk)) {
> +		struct dst_entry *dst = READ_ONCE(sk->sk_rx_dst);
> +
> +		if (dst)
> +			dst = dst_check(dst, 0);
v6 requires a cookie.  tcp_v6_early_demux() could be a good example.

> +		if (dst)
> +			refdst = (unsigned long)dst | SKB_DST_NOREF;
> +	}
> +	if (!refdst)
> +		refdst = skb->_skb_refdst;
> +	if (skb->_skb_refdst != refdst)
> +		skb_dst_drop(skb);
>  
> -	refdst = skb->_skb_refdst;
>  	__this_cpu_write(dst_sk_prefetch_dst, refdst);
>  	skb_dst_set_noref(skb, (struct dst_entry *)&dst_sk_prefetch.dst);
>  }

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH bpf-next 3/7] bpf: Add socket assign support
  2020-03-16 22:57   ` Martin KaFai Lau
@ 2020-03-17  3:06     ` Joe Stringer
  2020-03-17  6:26       ` Martin KaFai Lau
  2020-03-17 10:09       ` Lorenz Bauer
  0 siblings, 2 replies; 30+ messages in thread
From: Joe Stringer @ 2020-03-17  3:06 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: Joe Stringer, bpf, netdev, Daniel Borkmann, Alexei Starovoitov,
	Eric Dumazet, Lorenz Bauer

On Mon, Mar 16, 2020 at 3:58 PM Martin KaFai Lau <kafai@fb.com> wrote:
>
> On Thu, Mar 12, 2020 at 04:36:44PM -0700, Joe Stringer wrote:
> > Add support for TPROXY via a new bpf helper, bpf_sk_assign().
> >
> > This helper requires the BPF program to discover the socket via a call
> > to bpf_sk*_lookup_*(), then pass this socket to the new helper. The
> > helper takes its own reference to the socket in addition to any existing
> > reference that may or may not currently be obtained for the duration of
> > BPF processing. For the destination socket to receive the traffic, the
> > traffic must be routed towards that socket via local route, the socket
> I also missed where is the local route check in the patch.
> Is it implied by a sk can be found in bpf_sk*_lookup_*()?

This is a requirement for traffic redirection, it's not enforced by
the patch. If the operator does not configure routing for the relevant
traffic to ensure that the traffic is delivered locally, then after
the eBPF program terminates, it will pass up through ip_rcv() and
friends and be subject to the whims of the routing table. (or
alternatively if the BPF program redirects somewhere else then this
reference will be dropped).

Maybe there's a path to simplifying this configuration path in future
to loosen this requirement, but for now I've kept the series as
minimal as possible on that front.

> [ ... ]
>
> > diff --git a/net/core/filter.c b/net/core/filter.c
> > index cd0a532db4e7..bae0874289d8 100644
> > --- a/net/core/filter.c
> > +++ b/net/core/filter.c
> > @@ -5846,6 +5846,32 @@ static const struct bpf_func_proto bpf_tcp_gen_syncookie_proto = {
> >       .arg5_type      = ARG_CONST_SIZE,
> >  };
> >
> > +BPF_CALL_3(bpf_sk_assign, struct sk_buff *, skb, struct sock *, sk, u64, flags)
> > +{
> > +     if (flags != 0)
> > +             return -EINVAL;
> > +     if (!skb_at_tc_ingress(skb))
> > +             return -EOPNOTSUPP;
> > +     if (unlikely(!refcount_inc_not_zero(&sk->sk_refcnt)))
> > +             return -ENOENT;
> > +
> > +     skb_orphan(skb);
> > +     skb->sk = sk;
> sk is from the bpf_sk*_lookup_*() which does not consider
> the bpf_prog installed in SO_ATTACH_REUSEPORT_EBPF.
> However, the use-case is currently limited to sk inspection.
>
> It now supports selecting a particular sk to receive traffic.
> Any plan in supporting that?

I think this is a general bpf_sk*_lookup_*() question, previous
discussion[0] settled on avoiding that complexity before a use case
arises, for both TC and XDP versions of these helpers; I still don't
have a specific use case in mind for such functionality. If we were to
do it, I would presume that the socket lookup caller would need to
pass a dedicated flag (supported at TC and likely not at XDP) to
communicate that SO_ATTACH_REUSEPORT_EBPF progs should be respected
and used to select the reuseport socket.

> > diff --git a/net/ipv6/ip6_input.c b/net/ipv6/ip6_input.c
> > index 7b089d0ac8cd..f7b42adca9d0 100644
> > --- a/net/ipv6/ip6_input.c
> > +++ b/net/ipv6/ip6_input.c
> > @@ -285,7 +285,10 @@ static struct sk_buff *ip6_rcv_core(struct sk_buff *skb, struct net_device *dev,
> >       rcu_read_unlock();
> >
> >       /* Must drop socket now because of tproxy. */
> > -     skb_orphan(skb);
> > +     if (skb_dst_is_sk_prefetch(skb))
> > +             dst_sk_prefetch_fetch(skb);
> > +     else
> > +             skb_orphan(skb);
> If I understand it correctly, this new test is to skip
> the skb_orphan() call for locally routed skb.
> Others cases (forward?) still depend on skb_orphan() to be called here?

Roughly yes. 'locally routed skb' is a bit loose wording though, at
this point the BPF program only prefetched the socket to let the stack
know that it should deliver the skb to that socket, assuming that it
passes the upcoming routing check.

For more discussion on the other cases, there is the previous
thread[1] and in particular the child thread discussion with Florian,
Eric and Daniel.

[0] https://www.mail-archive.com/netdev@vger.kernel.org/msg253250.html
[1] https://www.spinics.net/lists/netdev/msg580058.html

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH bpf-next 4/7] dst: Prefetch established socket destinations
  2020-03-16 23:03   ` Martin KaFai Lau
@ 2020-03-17  3:17     ` Joe Stringer
  0 siblings, 0 replies; 30+ messages in thread
From: Joe Stringer @ 2020-03-17  3:17 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: Joe Stringer, bpf, netdev, Daniel Borkmann, Alexei Starovoitov,
	Eric Dumazet, Lorenz Bauer

On Mon, Mar 16, 2020 at 4:03 PM Martin KaFai Lau <kafai@fb.com> wrote:
>
> On Thu, Mar 12, 2020 at 04:36:45PM -0700, Joe Stringer wrote:
> > Enhance the dst_sk_prefetch logic to temporarily store the socket
> > receive destination, to save the route lookup later on. The dst
> > reference is kept alive by the caller's socket reference.
> >
> > Suggested-by: Daniel Borkmann <daniel@iogearbox.net>
> > Signed-off-by: Joe Stringer <joe@wand.net.nz>
> > ---
> >  include/net/dst_metadata.h |  2 +-
> >  net/core/dst.c             | 20 +++++++++++++++++---
> >  net/core/filter.c          |  2 +-
> >  3 files changed, 19 insertions(+), 5 deletions(-)
> >
> > diff --git a/include/net/dst_metadata.h b/include/net/dst_metadata.h
> > index 31574c553a07..4f16322b08d5 100644
> > --- a/include/net/dst_metadata.h
> > +++ b/include/net/dst_metadata.h
> > @@ -230,7 +230,7 @@ static inline bool skb_dst_is_sk_prefetch(const struct sk_buff *skb)
> >       return dst_is_sk_prefetch(skb_dst(skb));
> >  }
> >
> > -void dst_sk_prefetch_store(struct sk_buff *skb);
> > +void dst_sk_prefetch_store(struct sk_buff *skb, struct sock *sk);
> >  void dst_sk_prefetch_fetch(struct sk_buff *skb);
> >
> >  /**
> > diff --git a/net/core/dst.c b/net/core/dst.c
> > index cf1a1d5b6b0a..5068d127d9c2 100644
> > --- a/net/core/dst.c
> > +++ b/net/core/dst.c
> > @@ -346,11 +346,25 @@ EXPORT_SYMBOL(dst_sk_prefetch);
> >
> >  DEFINE_PER_CPU(unsigned long, dst_sk_prefetch_dst);
> >
> > -void dst_sk_prefetch_store(struct sk_buff *skb)
> > +void dst_sk_prefetch_store(struct sk_buff *skb, struct sock *sk)
> >  {
> > -     unsigned long refdst;
> > +     unsigned long refdst = 0L;
> > +
> > +     WARN_ON(!rcu_read_lock_held() &&
> > +             !rcu_read_lock_bh_held());
> > +     if (sk_fullsock(sk)) {
> > +             struct dst_entry *dst = READ_ONCE(sk->sk_rx_dst);
> > +
> > +             if (dst)
> > +                     dst = dst_check(dst, 0);
> v6 requires a cookie.  tcp_v6_early_demux() could be a good example.

Nice catch. I plan to roll in the following incremental for v2:

diff --git a/net/core/dst.c b/net/core/dst.c
index 5068d127d9c2..b60f85227247 100644
--- a/net/core/dst.c
+++ b/net/core/dst.c
@@ -354,9 +354,14 @@ void dst_sk_prefetch_store(struct sk_buff *skb,
struct sock *sk)
                !rcu_read_lock_bh_held());
        if (sk_fullsock(sk)) {
                struct dst_entry *dst = READ_ONCE(sk->sk_rx_dst);
+               u32 cookie = 0;

+#if IS_ENABLED(CONFIG_IPV6)
+               if (sk->sk_family == AF_INET6)
+                       cookie = inet6_sk(sk)->rx_dst_cookie;
+#endif
                if (dst)
-                       dst = dst_check(dst, 0);
+                       dst = dst_check(dst, cookie);
                if (dst)
                        refdst = (unsigned long)dst | SKB_DST_NOREF;
        }

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH bpf-next 3/7] bpf: Add socket assign support
  2020-03-17  3:06     ` Joe Stringer
@ 2020-03-17  6:26       ` Martin KaFai Lau
  2020-03-18  0:46         ` Joe Stringer
  2020-03-17 10:09       ` Lorenz Bauer
  1 sibling, 1 reply; 30+ messages in thread
From: Martin KaFai Lau @ 2020-03-17  6:26 UTC (permalink / raw)
  To: Joe Stringer
  Cc: bpf, netdev, Daniel Borkmann, Alexei Starovoitov, Eric Dumazet,
	Lorenz Bauer

On Mon, Mar 16, 2020 at 08:06:38PM -0700, Joe Stringer wrote:
> On Mon, Mar 16, 2020 at 3:58 PM Martin KaFai Lau <kafai@fb.com> wrote:
> >
> > On Thu, Mar 12, 2020 at 04:36:44PM -0700, Joe Stringer wrote:
> > > Add support for TPROXY via a new bpf helper, bpf_sk_assign().
> > >
> > > This helper requires the BPF program to discover the socket via a call
> > > to bpf_sk*_lookup_*(), then pass this socket to the new helper. The
> > > helper takes its own reference to the socket in addition to any existing
> > > reference that may or may not currently be obtained for the duration of
> > > BPF processing. For the destination socket to receive the traffic, the
> > > traffic must be routed towards that socket via local route, the socket
> > I also missed where is the local route check in the patch.
> > Is it implied by a sk can be found in bpf_sk*_lookup_*()?
> 
> This is a requirement for traffic redirection, it's not enforced by
> the patch. If the operator does not configure routing for the relevant
> traffic to ensure that the traffic is delivered locally, then after
> the eBPF program terminates, it will pass up through ip_rcv() and
> friends and be subject to the whims of the routing table. (or
> alternatively if the BPF program redirects somewhere else then this
> reference will be dropped).
> 
> Maybe there's a path to simplifying this configuration path in future
> to loosen this requirement, but for now I've kept the series as
> minimal as possible on that front.
> 
> > [ ... ]
> >
> > > diff --git a/net/core/filter.c b/net/core/filter.c
> > > index cd0a532db4e7..bae0874289d8 100644
> > > --- a/net/core/filter.c
> > > +++ b/net/core/filter.c
> > > @@ -5846,6 +5846,32 @@ static const struct bpf_func_proto bpf_tcp_gen_syncookie_proto = {
> > >       .arg5_type      = ARG_CONST_SIZE,
> > >  };
> > >
> > > +BPF_CALL_3(bpf_sk_assign, struct sk_buff *, skb, struct sock *, sk, u64, flags)
> > > +{
> > > +     if (flags != 0)
> > > +             return -EINVAL;
> > > +     if (!skb_at_tc_ingress(skb))
> > > +             return -EOPNOTSUPP;
> > > +     if (unlikely(!refcount_inc_not_zero(&sk->sk_refcnt)))
> > > +             return -ENOENT;
> > > +
> > > +     skb_orphan(skb);
> > > +     skb->sk = sk;
> > sk is from the bpf_sk*_lookup_*() which does not consider
> > the bpf_prog installed in SO_ATTACH_REUSEPORT_EBPF.
> > However, the use-case is currently limited to sk inspection.
> >
> > It now supports selecting a particular sk to receive traffic.
> > Any plan in supporting that?
> 
> I think this is a general bpf_sk*_lookup_*() question, previous
> discussion[0] settled on avoiding that complexity before a use case
> arises, for both TC and XDP versions of these helpers; I still don't
> have a specific use case in mind for such functionality. If we were to
> do it, I would presume that the socket lookup caller would need to
> pass a dedicated flag (supported at TC and likely not at XDP) to
> communicate that SO_ATTACH_REUSEPORT_EBPF progs should be respected
> and used to select the reuseport socket.
It is more about the expectation on the existing SO_ATTACH_REUSEPORT_EBPF
usecase.  It has been fine because SO_ATTACH_REUSEPORT_EBPF's bpf prog
will still be run later (e.g. from tcp_v4_rcv) to decide which sk to
recieve the skb.

If the bpf@tc assigns a TCP_LISTEN sk in bpf_sk_assign(),
will the SO_ATTACH_REUSEPORT_EBPF's bpf still be run later
to make the final sk decision?

> 
> > > diff --git a/net/ipv6/ip6_input.c b/net/ipv6/ip6_input.c
> > > index 7b089d0ac8cd..f7b42adca9d0 100644
> > > --- a/net/ipv6/ip6_input.c
> > > +++ b/net/ipv6/ip6_input.c
> > > @@ -285,7 +285,10 @@ static struct sk_buff *ip6_rcv_core(struct sk_buff *skb, struct net_device *dev,
> > >       rcu_read_unlock();
> > >
> > >       /* Must drop socket now because of tproxy. */
> > > -     skb_orphan(skb);
> > > +     if (skb_dst_is_sk_prefetch(skb))
> > > +             dst_sk_prefetch_fetch(skb);
> > > +     else
> > > +             skb_orphan(skb);
> > If I understand it correctly, this new test is to skip
> > the skb_orphan() call for locally routed skb.
> > Others cases (forward?) still depend on skb_orphan() to be called here?
> 
> Roughly yes. 'locally routed skb' is a bit loose wording though, at
> this point the BPF program only prefetched the socket to let the stack
> know that it should deliver the skb to that socket, assuming that it
> passes the upcoming routing check.
Which upcoming routing check?  I think it is the part I am missing.

In patch 4, let say the dst_check() returns NULL (may be due to a route
change).  Later in the upper stack, it does a route lookup
(ip_route_input_noref() or ip6_route_input()).  Could it return
a forward route? and I assume missing a skb_orphan() call
here will still be fine?

> 
> For more discussion on the other cases, there is the previous
> thread[1] and in particular the child thread discussion with Florian,
> Eric and Daniel.
> 
> [0] https://urldefense.proofpoint.com/v2/url?u=https-3A__www.mail-2Darchive.com_netdev-40vger.kernel.org_msg253250.html&d=DwIBaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=VQnoQ7LvghIj0gVEaiQSUw&m=mX45GxyUJ_HfsBIJTVMZY9ztD5rVViDuOIQ0pXtyJcM&s=z5lZSVTonmhT5OeyxsefzUC2fMqDEwFvlEV1qkyrULg&e= 
> [1] https://urldefense.proofpoint.com/v2/url?u=https-3A__www.spinics.net_lists_netdev_msg580058.html&d=DwIBaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=VQnoQ7LvghIj0gVEaiQSUw&m=mX45GxyUJ_HfsBIJTVMZY9ztD5rVViDuOIQ0pXtyJcM&s=oFYt8cTKQEc-wEfY5YSsjfVN3QqBlFGfrrT7DTKw1rc&e= 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH bpf-next 5/7] selftests: bpf: add test for sk_assign
  2020-03-12 23:36 ` [PATCH bpf-next 5/7] selftests: bpf: add test for sk_assign Joe Stringer
@ 2020-03-17  6:30   ` Martin KaFai Lau
  2020-03-17 20:56     ` Joe Stringer
  0 siblings, 1 reply; 30+ messages in thread
From: Martin KaFai Lau @ 2020-03-17  6:30 UTC (permalink / raw)
  To: Joe Stringer; +Cc: bpf, Lorenz Bauer, netdev, daniel, ast, eric.dumazet

On Thu, Mar 12, 2020 at 04:36:46PM -0700, Joe Stringer wrote:
> From: Lorenz Bauer <lmb@cloudflare.com>
> 
> Attach a tc direct-action classifier to lo in a fresh network
> namespace, and rewrite all connection attempts to localhost:4321
> to localhost:1234.
> 
> Keep in mind that both client to server and server to client traffic
> passes the classifier.
> 
> Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
> Signed-off-by: Joe Stringer <joe@wand.net.nz>
> ---
>  tools/testing/selftests/bpf/.gitignore        |   1 +
>  tools/testing/selftests/bpf/Makefile          |   3 +-
>  .../selftests/bpf/progs/test_sk_assign.c      | 127 +++++++++++++
>  tools/testing/selftests/bpf/test_sk_assign.c  | 176 ++++++++++++++++++
Can this test be put under the test_progs.c framework?

>  tools/testing/selftests/bpf/test_sk_assign.sh |  19 ++

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH bpf-next 3/7] bpf: Add socket assign support
  2020-03-17  3:06     ` Joe Stringer
  2020-03-17  6:26       ` Martin KaFai Lau
@ 2020-03-17 10:09       ` Lorenz Bauer
  2020-03-18  1:10         ` Joe Stringer
  1 sibling, 1 reply; 30+ messages in thread
From: Lorenz Bauer @ 2020-03-17 10:09 UTC (permalink / raw)
  To: Joe Stringer
  Cc: Martin KaFai Lau, bpf, netdev, Daniel Borkmann,
	Alexei Starovoitov, Eric Dumazet

On Tue, 17 Mar 2020 at 03:06, Joe Stringer <joe@wand.net.nz> wrote:
>
> On Mon, Mar 16, 2020 at 3:58 PM Martin KaFai Lau <kafai@fb.com> wrote:
> >
> > On Thu, Mar 12, 2020 at 04:36:44PM -0700, Joe Stringer wrote:
> > > Add support for TPROXY via a new bpf helper, bpf_sk_assign().
> > >
> > > This helper requires the BPF program to discover the socket via a call
> > > to bpf_sk*_lookup_*(), then pass this socket to the new helper. The
> > > helper takes its own reference to the socket in addition to any existing
> > > reference that may or may not currently be obtained for the duration of
> > > BPF processing. For the destination socket to receive the traffic, the
> > > traffic must be routed towards that socket via local route, the socket
> > I also missed where is the local route check in the patch.
> > Is it implied by a sk can be found in bpf_sk*_lookup_*()?
>
> This is a requirement for traffic redirection, it's not enforced by
> the patch. If the operator does not configure routing for the relevant
> traffic to ensure that the traffic is delivered locally, then after
> the eBPF program terminates, it will pass up through ip_rcv() and
> friends and be subject to the whims of the routing table. (or
> alternatively if the BPF program redirects somewhere else then this
> reference will be dropped).

Can you elaborate what "an appropriate routing configuration" would be?
I'm not well versed with how routing works, sorry.

Do you think being subject to the routing table is desirable, or is it an
implementation trade-off?

>
> I think this is a general bpf_sk*_lookup_*() question, previous
> discussion[0] settled on avoiding that complexity before a use case
> arises, for both TC and XDP versions of these helpers; I still don't
> have a specific use case in mind for such functionality. If we were to
> do it, I would presume that the socket lookup caller would need to
> pass a dedicated flag (supported at TC and likely not at XDP) to
> communicate that SO_ATTACH_REUSEPORT_EBPF progs should be respected
> and used to select the reuseport socket.

I was surprised that both TC and XDP don't run the reuseport program!
So far I assumed that TC did pass the skb. I understand that you don't want
to tackle this issue, but is it possible to reject reuseport sockets from
sk_assign in that case?

-- 
Lorenz Bauer  |  Systems Engineer
6th Floor, County Hall/The Riverside Building, SE1 7PB, UK

www.cloudflare.com

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH bpf-next 5/7] selftests: bpf: add test for sk_assign
  2020-03-17  6:30   ` Martin KaFai Lau
@ 2020-03-17 20:56     ` Joe Stringer
  2020-03-18 17:27       ` Martin KaFai Lau
  0 siblings, 1 reply; 30+ messages in thread
From: Joe Stringer @ 2020-03-17 20:56 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: Joe Stringer, bpf, Lorenz Bauer, netdev, Daniel Borkmann,
	Alexei Starovoitov, Eric Dumazet

On Tue, Mar 17, 2020 at 12:31 AM Martin KaFai Lau <kafai@fb.com> wrote:
>
> On Thu, Mar 12, 2020 at 04:36:46PM -0700, Joe Stringer wrote:
> > From: Lorenz Bauer <lmb@cloudflare.com>
> >
> > Attach a tc direct-action classifier to lo in a fresh network
> > namespace, and rewrite all connection attempts to localhost:4321
> > to localhost:1234.
> >
> > Keep in mind that both client to server and server to client traffic
> > passes the classifier.
> >
> > Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
> > Signed-off-by: Joe Stringer <joe@wand.net.nz>
> > ---
> >  tools/testing/selftests/bpf/.gitignore        |   1 +
> >  tools/testing/selftests/bpf/Makefile          |   3 +-
> >  .../selftests/bpf/progs/test_sk_assign.c      | 127 +++++++++++++
> >  tools/testing/selftests/bpf/test_sk_assign.c  | 176 ++++++++++++++++++
> Can this test be put under the test_progs.c framework?

I'm not sure, how does the test_progs.c framework handle the logic in
"tools/testing/selftests/bpf/test_sk_assign.sh"?

Specifically I'm looking for:
* Unique netns to avoid messing with host networking stack configuration
* Control over routes
* Attaching loaded bpf programs to ingress qdisc of a device

These are each trivial one-liners in the supplied shell script
(admittedly building on existing shell infrastructure in the tests dir
and iproute2 package). Seems like maybe the netns parts aren't so bad
looking at flow_dissector_reattach.c but anything involving netlink
configuration would either require pulling in a netlink library
dependency somewhere or shelling out to the existing binaries. At that
point I wonder if we're trying to achieve integration of this test
into some automated prog runner, is there a simpler way like a place I
can just add a one-liner to run the test_sk_assign.sh script?

> >  tools/testing/selftests/bpf/test_sk_assign.sh |  19 ++

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH bpf-next 3/7] bpf: Add socket assign support
  2020-03-17  6:26       ` Martin KaFai Lau
@ 2020-03-18  0:46         ` Joe Stringer
  2020-03-18 10:03           ` Jakub Sitnicki
  2020-03-18 18:48           ` Martin KaFai Lau
  0 siblings, 2 replies; 30+ messages in thread
From: Joe Stringer @ 2020-03-18  0:46 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: Joe Stringer, bpf, netdev, Daniel Borkmann, Alexei Starovoitov,
	Eric Dumazet, Lorenz Bauer

On Mon, Mar 16, 2020 at 11:27 PM Martin KaFai Lau <kafai@fb.com> wrote:
>
> On Mon, Mar 16, 2020 at 08:06:38PM -0700, Joe Stringer wrote:
> > On Mon, Mar 16, 2020 at 3:58 PM Martin KaFai Lau <kafai@fb.com> wrote:
> > >
> > > On Thu, Mar 12, 2020 at 04:36:44PM -0700, Joe Stringer wrote:
> > > > Add support for TPROXY via a new bpf helper, bpf_sk_assign().
> > > >
> > > > This helper requires the BPF program to discover the socket via a call
> > > > to bpf_sk*_lookup_*(), then pass this socket to the new helper. The
> > > > helper takes its own reference to the socket in addition to any existing
> > > > reference that may or may not currently be obtained for the duration of
> > > > BPF processing. For the destination socket to receive the traffic, the
> > > > traffic must be routed towards that socket via local route, the socket
> > > I also missed where is the local route check in the patch.
> > > Is it implied by a sk can be found in bpf_sk*_lookup_*()?
> >
> > This is a requirement for traffic redirection, it's not enforced by
> > the patch. If the operator does not configure routing for the relevant
> > traffic to ensure that the traffic is delivered locally, then after
> > the eBPF program terminates, it will pass up through ip_rcv() and
> > friends and be subject to the whims of the routing table. (or
> > alternatively if the BPF program redirects somewhere else then this
> > reference will be dropped).
> >
> > Maybe there's a path to simplifying this configuration path in future
> > to loosen this requirement, but for now I've kept the series as
> > minimal as possible on that front.
> >
> > > [ ... ]
> > >
> > > > diff --git a/net/core/filter.c b/net/core/filter.c
> > > > index cd0a532db4e7..bae0874289d8 100644
> > > > --- a/net/core/filter.c
> > > > +++ b/net/core/filter.c
> > > > @@ -5846,6 +5846,32 @@ static const struct bpf_func_proto bpf_tcp_gen_syncookie_proto = {
> > > >       .arg5_type      = ARG_CONST_SIZE,
> > > >  };
> > > >
> > > > +BPF_CALL_3(bpf_sk_assign, struct sk_buff *, skb, struct sock *, sk, u64, flags)
> > > > +{
> > > > +     if (flags != 0)
> > > > +             return -EINVAL;
> > > > +     if (!skb_at_tc_ingress(skb))
> > > > +             return -EOPNOTSUPP;
> > > > +     if (unlikely(!refcount_inc_not_zero(&sk->sk_refcnt)))
> > > > +             return -ENOENT;
> > > > +
> > > > +     skb_orphan(skb);
> > > > +     skb->sk = sk;
> > > sk is from the bpf_sk*_lookup_*() which does not consider
> > > the bpf_prog installed in SO_ATTACH_REUSEPORT_EBPF.
> > > However, the use-case is currently limited to sk inspection.
> > >
> > > It now supports selecting a particular sk to receive traffic.
> > > Any plan in supporting that?
> >
> > I think this is a general bpf_sk*_lookup_*() question, previous
> > discussion[0] settled on avoiding that complexity before a use case
> > arises, for both TC and XDP versions of these helpers; I still don't
> > have a specific use case in mind for such functionality. If we were to
> > do it, I would presume that the socket lookup caller would need to
> > pass a dedicated flag (supported at TC and likely not at XDP) to
> > communicate that SO_ATTACH_REUSEPORT_EBPF progs should be respected
> > and used to select the reuseport socket.
> It is more about the expectation on the existing SO_ATTACH_REUSEPORT_EBPF
> usecase.  It has been fine because SO_ATTACH_REUSEPORT_EBPF's bpf prog
> will still be run later (e.g. from tcp_v4_rcv) to decide which sk to
> recieve the skb.
>
> If the bpf@tc assigns a TCP_LISTEN sk in bpf_sk_assign(),
> will the SO_ATTACH_REUSEPORT_EBPF's bpf still be run later
> to make the final sk decision?

I don't believe so, no:

ip_local_deliver()
-> ...
-> ip_protocol_deliver_rcu()
-> tcp_v4_rcv()
-> __inet_lookup_skb()
-> skb_steal_sock(skb)

But this will only affect you if you are running both the bpf@tc
program with sk_assign() and the reuseport BPF sock programs at the
same time. This is why I link it back to the bpf_sk*_lookup_*()
functions: If the socket lookup in the initial step respects reuseport
BPF prog logic and returns the socket using the same logic, then the
packet will be directed to the socket you expect. Just like how
non-BPF reuseport would work with this series today.

> >
> > > > diff --git a/net/ipv6/ip6_input.c b/net/ipv6/ip6_input.c
> > > > index 7b089d0ac8cd..f7b42adca9d0 100644
> > > > --- a/net/ipv6/ip6_input.c
> > > > +++ b/net/ipv6/ip6_input.c
> > > > @@ -285,7 +285,10 @@ static struct sk_buff *ip6_rcv_core(struct sk_buff *skb, struct net_device *dev,
> > > >       rcu_read_unlock();
> > > >
> > > >       /* Must drop socket now because of tproxy. */
> > > > -     skb_orphan(skb);
> > > > +     if (skb_dst_is_sk_prefetch(skb))
> > > > +             dst_sk_prefetch_fetch(skb);
> > > > +     else
> > > > +             skb_orphan(skb);
> > > If I understand it correctly, this new test is to skip
> > > the skb_orphan() call for locally routed skb.
> > > Others cases (forward?) still depend on skb_orphan() to be called here?
> >
> > Roughly yes. 'locally routed skb' is a bit loose wording though, at
> > this point the BPF program only prefetched the socket to let the stack
> > know that it should deliver the skb to that socket, assuming that it
> > passes the upcoming routing check.
> Which upcoming routing check?  I think it is the part I am missing.
>
> In patch 4, let say the dst_check() returns NULL (may be due to a route
> change).  Later in the upper stack, it does a route lookup
> (ip_route_input_noref() or ip6_route_input()).  Could it return
> a forward route? and I assume missing a skb_orphan() call
> here will still be fine?

Yes it could return a forward route, in that case:

ip_forward()
-> if (unlikely(skb->sk)) goto drop;

Note that you'd have to get a socket reference to get to this point in
the first place. I see two options:
* BPF program operator didn't set up the routes correctly for local
socket destination
* BPF program looks up socket in another netns and tries to assign it.

For the latter case I could introduce a netns validation check to
ensure it matches the netns of the device.

> >
> > For more discussion on the other cases, there is the previous
> > thread[1] and in particular the child thread discussion with Florian,
> > Eric and Daniel.
> >
> > [0] https://urldefense.proofpoint.com/v2/url?u=https-3A__www.mail-2Darchive.com_netdev-40vger.kernel.org_msg253250.html&d=DwIBaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=VQnoQ7LvghIj0gVEaiQSUw&m=mX45GxyUJ_HfsBIJTVMZY9ztD5rVViDuOIQ0pXtyJcM&s=z5lZSVTonmhT5OeyxsefzUC2fMqDEwFvlEV1qkyrULg&e=
> > [1] https://urldefense.proofpoint.com/v2/url?u=https-3A__www.spinics.net_lists_netdev_msg580058.html&d=DwIBaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=VQnoQ7LvghIj0gVEaiQSUw&m=mX45GxyUJ_HfsBIJTVMZY9ztD5rVViDuOIQ0pXtyJcM&s=oFYt8cTKQEc-wEfY5YSsjfVN3QqBlFGfrrT7DTKw1rc&e=

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH bpf-next 3/7] bpf: Add socket assign support
  2020-03-17 10:09       ` Lorenz Bauer
@ 2020-03-18  1:10         ` Joe Stringer
  2020-03-18  2:03           ` Joe Stringer
  0 siblings, 1 reply; 30+ messages in thread
From: Joe Stringer @ 2020-03-18  1:10 UTC (permalink / raw)
  To: Lorenz Bauer
  Cc: Joe Stringer, Martin KaFai Lau, bpf, netdev, Daniel Borkmann,
	Alexei Starovoitov, Eric Dumazet

On Tue, Mar 17, 2020 at 3:10 AM Lorenz Bauer <lmb@cloudflare.com> wrote:
>
> On Tue, 17 Mar 2020 at 03:06, Joe Stringer <joe@wand.net.nz> wrote:
> >
> > On Mon, Mar 16, 2020 at 3:58 PM Martin KaFai Lau <kafai@fb.com> wrote:
> > >
> > > On Thu, Mar 12, 2020 at 04:36:44PM -0700, Joe Stringer wrote:
> > > > Add support for TPROXY via a new bpf helper, bpf_sk_assign().
> > > >
> > > > This helper requires the BPF program to discover the socket via a call
> > > > to bpf_sk*_lookup_*(), then pass this socket to the new helper. The
> > > > helper takes its own reference to the socket in addition to any existing
> > > > reference that may or may not currently be obtained for the duration of
> > > > BPF processing. For the destination socket to receive the traffic, the
> > > > traffic must be routed towards that socket via local route, the socket
> > > I also missed where is the local route check in the patch.
> > > Is it implied by a sk can be found in bpf_sk*_lookup_*()?
> >
> > This is a requirement for traffic redirection, it's not enforced by
> > the patch. If the operator does not configure routing for the relevant
> > traffic to ensure that the traffic is delivered locally, then after
> > the eBPF program terminates, it will pass up through ip_rcv() and
> > friends and be subject to the whims of the routing table. (or
> > alternatively if the BPF program redirects somewhere else then this
> > reference will be dropped).
>
> Can you elaborate what "an appropriate routing configuration" would be?
> I'm not well versed with how routing works, sorry.

Maybe I should add this into the git commit message :-)

The simplest version of it is demonstrated in patch 6:
https://www.spinics.net/lists/netdev/msg637176.html

$ ip route add local default dev lo

Depending on your use case, you may want to be more specific on the
matches, eg using a specific CIDR rather than setting a default route.

> Do you think being subject to the routing table is desirable, or is it an
> implementation trade-off?

I think it's an implementation trade-off.

> >
> > I think this is a general bpf_sk*_lookup_*() question, previous
> > discussion[0] settled on avoiding that complexity before a use case
> > arises, for both TC and XDP versions of these helpers; I still don't
> > have a specific use case in mind for such functionality. If we were to
> > do it, I would presume that the socket lookup caller would need to
> > pass a dedicated flag (supported at TC and likely not at XDP) to
> > communicate that SO_ATTACH_REUSEPORT_EBPF progs should be respected
> > and used to select the reuseport socket.
>
> I was surprised that both TC and XDP don't run the reuseport program!

FWIW this is explicitly documented in the helper man pages for
sk_lookup_tcp() and friends:
http://man7.org/linux/man-pages/man7/bpf-helpers.7.html

> So far I assumed that TC did pass the skb. I understand that you don't want
> to tackle this issue, but is it possible to reject reuseport sockets from
> sk_assign in that case?

What if users don't attach a reuseport BPF program, but rely on the
standard hashing mechanism? Then we would be artificially limiting
that case.

What do you have in mind for the motivation of this, are you concerned
about feature probing or something else?

> --
> Lorenz Bauer  |  Systems Engineer
> 6th Floor, County Hall/The Riverside Building, SE1 7PB, UK
>
> www.cloudflare.com

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH bpf-next 3/7] bpf: Add socket assign support
  2020-03-18  1:10         ` Joe Stringer
@ 2020-03-18  2:03           ` Joe Stringer
  0 siblings, 0 replies; 30+ messages in thread
From: Joe Stringer @ 2020-03-18  2:03 UTC (permalink / raw)
  To: Joe Stringer
  Cc: Lorenz Bauer, Martin KaFai Lau, bpf, netdev, Daniel Borkmann,
	Alexei Starovoitov, Eric Dumazet

On Tue, Mar 17, 2020 at 6:10 PM Joe Stringer <joe@wand.net.nz> wrote:
>
> On Tue, Mar 17, 2020 at 3:10 AM Lorenz Bauer <lmb@cloudflare.com> wrote:
> >
> > On Tue, 17 Mar 2020 at 03:06, Joe Stringer <joe@wand.net.nz> wrote:
> > >
> > > On Mon, Mar 16, 2020 at 3:58 PM Martin KaFai Lau <kafai@fb.com> wrote:
> > > >
> > > > On Thu, Mar 12, 2020 at 04:36:44PM -0700, Joe Stringer wrote:
> > > > > Add support for TPROXY via a new bpf helper, bpf_sk_assign().
> > > > >
> > > > > This helper requires the BPF program to discover the socket via a call
> > > > > to bpf_sk*_lookup_*(), then pass this socket to the new helper. The
> > > > > helper takes its own reference to the socket in addition to any existing
> > > > > reference that may or may not currently be obtained for the duration of
> > > > > BPF processing. For the destination socket to receive the traffic, the
> > > > > traffic must be routed towards that socket via local route, the socket
> > > > I also missed where is the local route check in the patch.
> > > > Is it implied by a sk can be found in bpf_sk*_lookup_*()?
> > >
> > > This is a requirement for traffic redirection, it's not enforced by
> > > the patch. If the operator does not configure routing for the relevant
> > > traffic to ensure that the traffic is delivered locally, then after
> > > the eBPF program terminates, it will pass up through ip_rcv() and
> > > friends and be subject to the whims of the routing table. (or
> > > alternatively if the BPF program redirects somewhere else then this
> > > reference will be dropped).
> >
> > Can you elaborate what "an appropriate routing configuration" would be?
> > I'm not well versed with how routing works, sorry.
>
> [...]
>
> > Do you think being subject to the routing table is desirable, or is it an
> > implementation trade-off?
>
> I think it's an implementation trade-off.

Perhaps it's worth expanding on this a bit more. There's always the
tradeoff of solving your specific problem vs. introducing
functionality that will integrate with the rest of the stack. In some
sense, I would like a notion here of "shortcut this traffic directly
to the socket", it will solve my problem but it's quite specific to
that so there's not much room for sharing the usage. It could still be
very useful to some use cases, but alternatives may support use cases
you hadn't thought of in the first place. Maybe there's a more
incremental path to achieving my goal through an implementation like
this.

The current design of bpf_sk_assign() in this series defers to the
stack a bit more than alternatives may do (thinking eg a socket
redirect function "bpf_sk_redirect()"). It says "this is best-effort";
if you wanted to, you could still override this functionality with
iptables tproxy rules. You could choose to route the traffic
differently (although through the exploration with Martin above for
now this will have fairly limited options unless we make additional
changes..). Glancing through the existing eBPF API, you could assign
the socket to the skb then subsequently use things like
bpf_get_socket_cookie() to fetch the cookie out. For all I know,
someone will come up with some nifty future idea that makes use of the
idea "we associate the socket with the skb" to solve a use case I
haven't thought of, and that could exist either within the bpf@tc hook
or after.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH bpf-next 3/7] bpf: Add socket assign support
  2020-03-18  0:46         ` Joe Stringer
@ 2020-03-18 10:03           ` Jakub Sitnicki
  2020-03-19  5:49             ` Joe Stringer
  2020-03-18 18:48           ` Martin KaFai Lau
  1 sibling, 1 reply; 30+ messages in thread
From: Jakub Sitnicki @ 2020-03-18 10:03 UTC (permalink / raw)
  To: Joe Stringer
  Cc: Martin KaFai Lau, bpf, netdev, Daniel Borkmann,
	Alexei Starovoitov, Eric Dumazet, Lorenz Bauer

On Wed, Mar 18, 2020 at 01:46 AM CET, Joe Stringer wrote:
> On Mon, Mar 16, 2020 at 11:27 PM Martin KaFai Lau <kafai@fb.com> wrote:
>>
>> On Mon, Mar 16, 2020 at 08:06:38PM -0700, Joe Stringer wrote:
>> > On Mon, Mar 16, 2020 at 3:58 PM Martin KaFai Lau <kafai@fb.com> wrote:
>> > >
>> > > On Thu, Mar 12, 2020 at 04:36:44PM -0700, Joe Stringer wrote:
>> > > > Add support for TPROXY via a new bpf helper, bpf_sk_assign().
>> > > >
>> > > > This helper requires the BPF program to discover the socket via a call
>> > > > to bpf_sk*_lookup_*(), then pass this socket to the new helper. The
>> > > > helper takes its own reference to the socket in addition to any existing
>> > > > reference that may or may not currently be obtained for the duration of
>> > > > BPF processing. For the destination socket to receive the traffic, the
>> > > > traffic must be routed towards that socket via local route, the socket
>> > > I also missed where is the local route check in the patch.
>> > > Is it implied by a sk can be found in bpf_sk*_lookup_*()?
>> >
>> > This is a requirement for traffic redirection, it's not enforced by
>> > the patch. If the operator does not configure routing for the relevant
>> > traffic to ensure that the traffic is delivered locally, then after
>> > the eBPF program terminates, it will pass up through ip_rcv() and
>> > friends and be subject to the whims of the routing table. (or
>> > alternatively if the BPF program redirects somewhere else then this
>> > reference will be dropped).
>> >
>> > Maybe there's a path to simplifying this configuration path in future
>> > to loosen this requirement, but for now I've kept the series as
>> > minimal as possible on that front.
>> >
>> > > [ ... ]
>> > >
>> > > > diff --git a/net/core/filter.c b/net/core/filter.c
>> > > > index cd0a532db4e7..bae0874289d8 100644
>> > > > --- a/net/core/filter.c
>> > > > +++ b/net/core/filter.c
>> > > > @@ -5846,6 +5846,32 @@ static const struct bpf_func_proto bpf_tcp_gen_syncookie_proto = {
>> > > >       .arg5_type      = ARG_CONST_SIZE,
>> > > >  };
>> > > >
>> > > > +BPF_CALL_3(bpf_sk_assign, struct sk_buff *, skb, struct sock *, sk, u64, flags)
>> > > > +{
>> > > > +     if (flags != 0)
>> > > > +             return -EINVAL;
>> > > > +     if (!skb_at_tc_ingress(skb))
>> > > > +             return -EOPNOTSUPP;
>> > > > +     if (unlikely(!refcount_inc_not_zero(&sk->sk_refcnt)))
>> > > > +             return -ENOENT;
>> > > > +
>> > > > +     skb_orphan(skb);
>> > > > +     skb->sk = sk;
>> > > sk is from the bpf_sk*_lookup_*() which does not consider
>> > > the bpf_prog installed in SO_ATTACH_REUSEPORT_EBPF.
>> > > However, the use-case is currently limited to sk inspection.
>> > >
>> > > It now supports selecting a particular sk to receive traffic.
>> > > Any plan in supporting that?
>> >
>> > I think this is a general bpf_sk*_lookup_*() question, previous
>> > discussion[0] settled on avoiding that complexity before a use case
>> > arises, for both TC and XDP versions of these helpers; I still don't
>> > have a specific use case in mind for such functionality. If we were to
>> > do it, I would presume that the socket lookup caller would need to
>> > pass a dedicated flag (supported at TC and likely not at XDP) to
>> > communicate that SO_ATTACH_REUSEPORT_EBPF progs should be respected
>> > and used to select the reuseport socket.
>> It is more about the expectation on the existing SO_ATTACH_REUSEPORT_EBPF
>> usecase.  It has been fine because SO_ATTACH_REUSEPORT_EBPF's bpf prog
>> will still be run later (e.g. from tcp_v4_rcv) to decide which sk to
>> recieve the skb.
>>
>> If the bpf@tc assigns a TCP_LISTEN sk in bpf_sk_assign(),
>> will the SO_ATTACH_REUSEPORT_EBPF's bpf still be run later
>> to make the final sk decision?
>
> I don't believe so, no:
>
> ip_local_deliver()
> -> ...
> -> ip_protocol_deliver_rcu()
> -> tcp_v4_rcv()
> -> __inet_lookup_skb()
> -> skb_steal_sock(skb)
>
> But this will only affect you if you are running both the bpf@tc
> program with sk_assign() and the reuseport BPF sock programs at the
> same time. This is why I link it back to the bpf_sk*_lookup_*()
> functions: If the socket lookup in the initial step respects reuseport
> BPF prog logic and returns the socket using the same logic, then the
> packet will be directed to the socket you expect. Just like how
> non-BPF reuseport would work with this series today.

I'm a bit lost in argumentation. The cover letter says that the goal is
to support TPROXY use cases from BPF TC. TPROXY, however, supports
reuseport load-balancing, which is essential to scaling out your
receiver [0].

I assume that in Cilium use case, single socket / single core is
sufficient to handle traffic steered with this new mechanism.

Also, socket lookup from XDP / BPF TC _without_ reuseport sounds
okay-ish because you're likely after information that a socket (group)
is attached to some local address / port.

However, when you go one step further and assign the socket to skb
without running reuseport logic, that is breaking socket load-balancing
for applications.

That is to say that I'm with Lorenz on this one. Sockets that belong to
reuseport group should not be a valid target for assignment until socket
lookup from BPF honors reuseport.

[0] https://www.slideshare.net/lfevents/boost-udp-transaction-performance

[...]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH bpf-next 5/7] selftests: bpf: add test for sk_assign
  2020-03-17 20:56     ` Joe Stringer
@ 2020-03-18 17:27       ` Martin KaFai Lau
  2020-03-19  5:45         ` Joe Stringer
  0 siblings, 1 reply; 30+ messages in thread
From: Martin KaFai Lau @ 2020-03-18 17:27 UTC (permalink / raw)
  To: Joe Stringer
  Cc: bpf, Lorenz Bauer, netdev, Daniel Borkmann, Alexei Starovoitov,
	Eric Dumazet, Andrii Nakryiko

On Tue, Mar 17, 2020 at 01:56:12PM -0700, Joe Stringer wrote:
> On Tue, Mar 17, 2020 at 12:31 AM Martin KaFai Lau <kafai@fb.com> wrote:
> >
> > On Thu, Mar 12, 2020 at 04:36:46PM -0700, Joe Stringer wrote:
> > > From: Lorenz Bauer <lmb@cloudflare.com>
> > >
> > > Attach a tc direct-action classifier to lo in a fresh network
> > > namespace, and rewrite all connection attempts to localhost:4321
> > > to localhost:1234.
> > >
> > > Keep in mind that both client to server and server to client traffic
> > > passes the classifier.
> > >
> > > Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
> > > Signed-off-by: Joe Stringer <joe@wand.net.nz>
> > > ---
> > >  tools/testing/selftests/bpf/.gitignore        |   1 +
> > >  tools/testing/selftests/bpf/Makefile          |   3 +-
> > >  .../selftests/bpf/progs/test_sk_assign.c      | 127 +++++++++++++
> > >  tools/testing/selftests/bpf/test_sk_assign.c  | 176 ++++++++++++++++++
> > Can this test be put under the test_progs.c framework?
> 
> I'm not sure, how does the test_progs.c framework handle the logic in
> "tools/testing/selftests/bpf/test_sk_assign.sh"?
> 
> Specifically I'm looking for:
> * Unique netns to avoid messing with host networking stack configuration
> * Control over routes
> * Attaching loaded bpf programs to ingress qdisc of a device
> 
> These are each trivial one-liners in the supplied shell script
> (admittedly building on existing shell infrastructure in the tests dir
> and iproute2 package). Seems like maybe the netns parts aren't so bad
> looking at flow_dissector_reattach.c but anything involving netlink
> configuration would either require pulling in a netlink library
> dependency somewhere or shelling out to the existing binaries. At that
> point I wonder if we're trying to achieve integration of this test
> into some automated prog runner, is there a simpler way like a place I
> can just add a one-liner to run the test_sk_assign.sh script?
I think running a system(cmd) in test_progs is fine, as long as it cleans
up everything when it is done.  There is some pieces of netlink
in tools/lib/bpf/netlink.c that may be reuseable also.

Other than test_progs.c, I am not aware there is a script to run
all *.sh.  I usually only run test_progs.

Cc: Andrii who has fixed many selftest issues recently.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH bpf-next 3/7] bpf: Add socket assign support
  2020-03-18  0:46         ` Joe Stringer
  2020-03-18 10:03           ` Jakub Sitnicki
@ 2020-03-18 18:48           ` Martin KaFai Lau
  2020-03-19  6:24             ` Joe Stringer
  1 sibling, 1 reply; 30+ messages in thread
From: Martin KaFai Lau @ 2020-03-18 18:48 UTC (permalink / raw)
  To: Joe Stringer
  Cc: bpf, netdev, Daniel Borkmann, Alexei Starovoitov, Eric Dumazet,
	Lorenz Bauer

On Tue, Mar 17, 2020 at 05:46:58PM -0700, Joe Stringer wrote:
> On Mon, Mar 16, 2020 at 11:27 PM Martin KaFai Lau <kafai@fb.com> wrote:
> >
> > On Mon, Mar 16, 2020 at 08:06:38PM -0700, Joe Stringer wrote:
> > > On Mon, Mar 16, 2020 at 3:58 PM Martin KaFai Lau <kafai@fb.com> wrote:
> > > >
> > > > On Thu, Mar 12, 2020 at 04:36:44PM -0700, Joe Stringer wrote:
> > > > > Add support for TPROXY via a new bpf helper, bpf_sk_assign().
> > > > >
> > > > > This helper requires the BPF program to discover the socket via a call
> > > > > to bpf_sk*_lookup_*(), then pass this socket to the new helper. The
> > > > > helper takes its own reference to the socket in addition to any existing
> > > > > reference that may or may not currently be obtained for the duration of
> > > > > BPF processing. For the destination socket to receive the traffic, the
> > > > > traffic must be routed towards that socket via local route, the socket
> > > > I also missed where is the local route check in the patch.
> > > > Is it implied by a sk can be found in bpf_sk*_lookup_*()?
> > >
> > > This is a requirement for traffic redirection, it's not enforced by
> > > the patch. If the operator does not configure routing for the relevant
> > > traffic to ensure that the traffic is delivered locally, then after
> > > the eBPF program terminates, it will pass up through ip_rcv() and
> > > friends and be subject to the whims of the routing table. (or
> > > alternatively if the BPF program redirects somewhere else then this
> > > reference will be dropped).
> > >
> > > Maybe there's a path to simplifying this configuration path in future
> > > to loosen this requirement, but for now I've kept the series as
> > > minimal as possible on that front.
> > >
> > > > [ ... ]
> > > >
> > > > > diff --git a/net/core/filter.c b/net/core/filter.c
> > > > > index cd0a532db4e7..bae0874289d8 100644
> > > > > --- a/net/core/filter.c
> > > > > +++ b/net/core/filter.c
> > > > > @@ -5846,6 +5846,32 @@ static const struct bpf_func_proto bpf_tcp_gen_syncookie_proto = {
> > > > >       .arg5_type      = ARG_CONST_SIZE,
> > > > >  };
> > > > >
> > > > > +BPF_CALL_3(bpf_sk_assign, struct sk_buff *, skb, struct sock *, sk, u64, flags)
> > > > > +{
> > > > > +     if (flags != 0)
> > > > > +             return -EINVAL;
> > > > > +     if (!skb_at_tc_ingress(skb))
> > > > > +             return -EOPNOTSUPP;
> > > > > +     if (unlikely(!refcount_inc_not_zero(&sk->sk_refcnt)))
> > > > > +             return -ENOENT;
> > > > > +
> > > > > +     skb_orphan(skb);
> > > > > +     skb->sk = sk;
> > > > sk is from the bpf_sk*_lookup_*() which does not consider
> > > > the bpf_prog installed in SO_ATTACH_REUSEPORT_EBPF.
> > > > However, the use-case is currently limited to sk inspection.
> > > >
> > > > It now supports selecting a particular sk to receive traffic.
> > > > Any plan in supporting that?
> > >
> > > I think this is a general bpf_sk*_lookup_*() question, previous
> > > discussion[0] settled on avoiding that complexity before a use case
> > > arises, for both TC and XDP versions of these helpers; I still don't
> > > have a specific use case in mind for such functionality. If we were to
> > > do it, I would presume that the socket lookup caller would need to
> > > pass a dedicated flag (supported at TC and likely not at XDP) to
> > > communicate that SO_ATTACH_REUSEPORT_EBPF progs should be respected
> > > and used to select the reuseport socket.
> > It is more about the expectation on the existing SO_ATTACH_REUSEPORT_EBPF
> > usecase.  It has been fine because SO_ATTACH_REUSEPORT_EBPF's bpf prog
> > will still be run later (e.g. from tcp_v4_rcv) to decide which sk to
> > recieve the skb.
> >
> > If the bpf@tc assigns a TCP_LISTEN sk in bpf_sk_assign(),
> > will the SO_ATTACH_REUSEPORT_EBPF's bpf still be run later
> > to make the final sk decision?
> 
> I don't believe so, no:
> 
> ip_local_deliver()
> -> ...
> -> ip_protocol_deliver_rcu()
> -> tcp_v4_rcv()
> -> __inet_lookup_skb()
> -> skb_steal_sock(skb)
> 
> But this will only affect you if you are running both the bpf@tc
> program with sk_assign() and the reuseport BPF sock programs at the
> same time.
I don't think it is the right answer to ask the user to be careful and
only use either bpf_sk_assign()@tc or bpf_prog@so_reuseport.

> This is why I link it back to the bpf_sk*_lookup_*()
> functions: If the socket lookup in the initial step respects reuseport
> BPF prog logic and returns the socket using the same logic, then the
> packet will be directed to the socket you expect. Just like how
> non-BPF reuseport would work with this series today.
Changing bpf_sk*_lookup_*() is a way to solve it but I don't know what it
may run into when recurring bpf_prog, i.e. running bpf@so-reuseport inside
bpf@tc. That may need a closer look.

> 
> > >
> > > > > diff --git a/net/ipv6/ip6_input.c b/net/ipv6/ip6_input.c
> > > > > index 7b089d0ac8cd..f7b42adca9d0 100644
> > > > > --- a/net/ipv6/ip6_input.c
> > > > > +++ b/net/ipv6/ip6_input.c
> > > > > @@ -285,7 +285,10 @@ static struct sk_buff *ip6_rcv_core(struct sk_buff *skb, struct net_device *dev,
> > > > >       rcu_read_unlock();
> > > > >
> > > > >       /* Must drop socket now because of tproxy. */
> > > > > -     skb_orphan(skb);
> > > > > +     if (skb_dst_is_sk_prefetch(skb))
> > > > > +             dst_sk_prefetch_fetch(skb);
> > > > > +     else
> > > > > +             skb_orphan(skb);
> > > > If I understand it correctly, this new test is to skip
> > > > the skb_orphan() call for locally routed skb.
> > > > Others cases (forward?) still depend on skb_orphan() to be called here?
> > >
> > > Roughly yes. 'locally routed skb' is a bit loose wording though, at
> > > this point the BPF program only prefetched the socket to let the stack
> > > know that it should deliver the skb to that socket, assuming that it
> > > passes the upcoming routing check.
> > Which upcoming routing check?  I think it is the part I am missing.
> >
> > In patch 4, let say the dst_check() returns NULL (may be due to a route
> > change).  Later in the upper stack, it does a route lookup
> > (ip_route_input_noref() or ip6_route_input()).  Could it return
> > a forward route? and I assume missing a skb_orphan() call
> > here will still be fine?
> 
> Yes it could return a forward route, in that case:
> 
> ip_forward()
> -> if (unlikely(skb->sk)) goto drop;
> 
> Note that you'd have to get a socket reference to get to this point in
It is another question that I have.  The TCP_LISTEN sk will suffer
from this extra refcnt, e.g. SYNFLOOD.  Can something smarter
be done in skb->destructor?

In general, it took me a while to wrap my head around thinking
how a skb->_skb_refdst is related to assigning a sk to skb->sk.
My understanding is it is a way to tell when not to call
skb_orphan() here.  Have you considered other options (e.g.
using a bit in skb->sk)?   It will be useful to explain
them in the commit message.

> the first place. I see two options:
> * BPF program operator didn't set up the routes correctly for local
> socket destination
> * BPF program looks up socket in another netns and tries to assign it.
> 
> For the latter case I could introduce a netns validation check to
> ensure it matches the netns of the device.
> 
> > >
> > > For more discussion on the other cases, there is the previous
> > > thread[1] and in particular the child thread discussion with Florian,
> > > Eric and Daniel.
> > >
> > > [0] https://urldefense.proofpoint.com/v2/url?u=https-3A__www.mail-2Darchive.com_netdev-40vger.kernel.org_msg253250.html&d=DwIBaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=VQnoQ7LvghIj0gVEaiQSUw&m=mX45GxyUJ_HfsBIJTVMZY9ztD5rVViDuOIQ0pXtyJcM&s=z5lZSVTonmhT5OeyxsefzUC2fMqDEwFvlEV1qkyrULg&e=
> > > [1] https://urldefense.proofpoint.com/v2/url?u=https-3A__www.spinics.net_lists_netdev_msg580058.html&d=DwIBaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=VQnoQ7LvghIj0gVEaiQSUw&m=mX45GxyUJ_HfsBIJTVMZY9ztD5rVViDuOIQ0pXtyJcM&s=oFYt8cTKQEc-wEfY5YSsjfVN3QqBlFGfrrT7DTKw1rc&e=

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH bpf-next 5/7] selftests: bpf: add test for sk_assign
  2020-03-18 17:27       ` Martin KaFai Lau
@ 2020-03-19  5:45         ` Joe Stringer
  2020-03-19 17:36           ` Andrii Nakryiko
  0 siblings, 1 reply; 30+ messages in thread
From: Joe Stringer @ 2020-03-19  5:45 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: Joe Stringer, bpf, Lorenz Bauer, netdev, Daniel Borkmann,
	Alexei Starovoitov, Eric Dumazet, Andrii Nakryiko

On Wed, Mar 18, 2020 at 10:28 AM Martin KaFai Lau <kafai@fb.com> wrote:
>
> On Tue, Mar 17, 2020 at 01:56:12PM -0700, Joe Stringer wrote:
> > On Tue, Mar 17, 2020 at 12:31 AM Martin KaFai Lau <kafai@fb.com> wrote:
> > >
> > > On Thu, Mar 12, 2020 at 04:36:46PM -0700, Joe Stringer wrote:
> > > > From: Lorenz Bauer <lmb@cloudflare.com>
> > > >
> > > > Attach a tc direct-action classifier to lo in a fresh network
> > > > namespace, and rewrite all connection attempts to localhost:4321
> > > > to localhost:1234.
> > > >
> > > > Keep in mind that both client to server and server to client traffic
> > > > passes the classifier.
> > > >
> > > > Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
> > > > Signed-off-by: Joe Stringer <joe@wand.net.nz>
> > > > ---
> > > >  tools/testing/selftests/bpf/.gitignore        |   1 +
> > > >  tools/testing/selftests/bpf/Makefile          |   3 +-
> > > >  .../selftests/bpf/progs/test_sk_assign.c      | 127 +++++++++++++
> > > >  tools/testing/selftests/bpf/test_sk_assign.c  | 176 ++++++++++++++++++
> > > Can this test be put under the test_progs.c framework?
> >
> > I'm not sure, how does the test_progs.c framework handle the logic in
> > "tools/testing/selftests/bpf/test_sk_assign.sh"?
> >
> > Specifically I'm looking for:
> > * Unique netns to avoid messing with host networking stack configuration
> > * Control over routes
> > * Attaching loaded bpf programs to ingress qdisc of a device
> >
> > These are each trivial one-liners in the supplied shell script
> > (admittedly building on existing shell infrastructure in the tests dir
> > and iproute2 package). Seems like maybe the netns parts aren't so bad
> > looking at flow_dissector_reattach.c but anything involving netlink
> > configuration would either require pulling in a netlink library
> > dependency somewhere or shelling out to the existing binaries. At that
> > point I wonder if we're trying to achieve integration of this test
> > into some automated prog runner, is there a simpler way like a place I
> > can just add a one-liner to run the test_sk_assign.sh script?
> I think running a system(cmd) in test_progs is fine, as long as it cleans
> up everything when it is done.  There is some pieces of netlink
> in tools/lib/bpf/netlink.c that may be reuseable also.
>
> Other than test_progs.c, I am not aware there is a script to run
> all *.sh.  I usually only run test_progs.
>
> Cc: Andrii who has fixed many selftest issues recently.

OK, unless I get some other guidance I'll take a stab at this.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH bpf-next 3/7] bpf: Add socket assign support
  2020-03-18 10:03           ` Jakub Sitnicki
@ 2020-03-19  5:49             ` Joe Stringer
  0 siblings, 0 replies; 30+ messages in thread
From: Joe Stringer @ 2020-03-19  5:49 UTC (permalink / raw)
  To: Jakub Sitnicki
  Cc: Joe Stringer, Martin KaFai Lau, bpf, netdev, Daniel Borkmann,
	Alexei Starovoitov, Eric Dumazet, Lorenz Bauer

On Wed, Mar 18, 2020 at 3:03 AM Jakub Sitnicki <jakub@cloudflare.com> wrote:
>
> On Wed, Mar 18, 2020 at 01:46 AM CET, Joe Stringer wrote:
> > On Mon, Mar 16, 2020 at 11:27 PM Martin KaFai Lau <kafai@fb.com> wrote:
> >>
> >> On Mon, Mar 16, 2020 at 08:06:38PM -0700, Joe Stringer wrote:
> >> > On Mon, Mar 16, 2020 at 3:58 PM Martin KaFai Lau <kafai@fb.com> wrote:
> >> > >
> >> > > On Thu, Mar 12, 2020 at 04:36:44PM -0700, Joe Stringer wrote:
> >> > > > Add support for TPROXY via a new bpf helper, bpf_sk_assign().
> >> > > >
> >> > > > This helper requires the BPF program to discover the socket via a call
> >> > > > to bpf_sk*_lookup_*(), then pass this socket to the new helper. The
> >> > > > helper takes its own reference to the socket in addition to any existing
> >> > > > reference that may or may not currently be obtained for the duration of
> >> > > > BPF processing. For the destination socket to receive the traffic, the
> >> > > > traffic must be routed towards that socket via local route, the socket
> >> > > I also missed where is the local route check in the patch.
> >> > > Is it implied by a sk can be found in bpf_sk*_lookup_*()?
> >> >
> >> > This is a requirement for traffic redirection, it's not enforced by
> >> > the patch. If the operator does not configure routing for the relevant
> >> > traffic to ensure that the traffic is delivered locally, then after
> >> > the eBPF program terminates, it will pass up through ip_rcv() and
> >> > friends and be subject to the whims of the routing table. (or
> >> > alternatively if the BPF program redirects somewhere else then this
> >> > reference will be dropped).
> >> >
> >> > Maybe there's a path to simplifying this configuration path in future
> >> > to loosen this requirement, but for now I've kept the series as
> >> > minimal as possible on that front.
> >> >
> >> > > [ ... ]
> >> > >
> >> > > > diff --git a/net/core/filter.c b/net/core/filter.c
> >> > > > index cd0a532db4e7..bae0874289d8 100644
> >> > > > --- a/net/core/filter.c
> >> > > > +++ b/net/core/filter.c
> >> > > > @@ -5846,6 +5846,32 @@ static const struct bpf_func_proto bpf_tcp_gen_syncookie_proto = {
> >> > > >       .arg5_type      = ARG_CONST_SIZE,
> >> > > >  };
> >> > > >
> >> > > > +BPF_CALL_3(bpf_sk_assign, struct sk_buff *, skb, struct sock *, sk, u64, flags)
> >> > > > +{
> >> > > > +     if (flags != 0)
> >> > > > +             return -EINVAL;
> >> > > > +     if (!skb_at_tc_ingress(skb))
> >> > > > +             return -EOPNOTSUPP;
> >> > > > +     if (unlikely(!refcount_inc_not_zero(&sk->sk_refcnt)))
> >> > > > +             return -ENOENT;
> >> > > > +
> >> > > > +     skb_orphan(skb);
> >> > > > +     skb->sk = sk;
> >> > > sk is from the bpf_sk*_lookup_*() which does not consider
> >> > > the bpf_prog installed in SO_ATTACH_REUSEPORT_EBPF.
> >> > > However, the use-case is currently limited to sk inspection.
> >> > >
> >> > > It now supports selecting a particular sk to receive traffic.
> >> > > Any plan in supporting that?
> >> >
> >> > I think this is a general bpf_sk*_lookup_*() question, previous
> >> > discussion[0] settled on avoiding that complexity before a use case
> >> > arises, for both TC and XDP versions of these helpers; I still don't
> >> > have a specific use case in mind for such functionality. If we were to
> >> > do it, I would presume that the socket lookup caller would need to
> >> > pass a dedicated flag (supported at TC and likely not at XDP) to
> >> > communicate that SO_ATTACH_REUSEPORT_EBPF progs should be respected
> >> > and used to select the reuseport socket.
> >> It is more about the expectation on the existing SO_ATTACH_REUSEPORT_EBPF
> >> usecase.  It has been fine because SO_ATTACH_REUSEPORT_EBPF's bpf prog
> >> will still be run later (e.g. from tcp_v4_rcv) to decide which sk to
> >> recieve the skb.
> >>
> >> If the bpf@tc assigns a TCP_LISTEN sk in bpf_sk_assign(),
> >> will the SO_ATTACH_REUSEPORT_EBPF's bpf still be run later
> >> to make the final sk decision?
> >
> > I don't believe so, no:
> >
> > ip_local_deliver()
> > -> ...
> > -> ip_protocol_deliver_rcu()
> > -> tcp_v4_rcv()
> > -> __inet_lookup_skb()
> > -> skb_steal_sock(skb)
> >
> > But this will only affect you if you are running both the bpf@tc
> > program with sk_assign() and the reuseport BPF sock programs at the
> > same time. This is why I link it back to the bpf_sk*_lookup_*()
> > functions: If the socket lookup in the initial step respects reuseport
> > BPF prog logic and returns the socket using the same logic, then the
> > packet will be directed to the socket you expect. Just like how
> > non-BPF reuseport would work with this series today.
>
> I'm a bit lost in argumentation. The cover letter says that the goal is
> to support TPROXY use cases from BPF TC. TPROXY, however, supports
> reuseport load-balancing, which is essential to scaling out your
> receiver [0].

Thanks for the link, that helps set the background.

> I assume that in Cilium use case, single socket / single core is
> sufficient to handle traffic steered with this new mechanism.
>
> Also, socket lookup from XDP / BPF TC _without_ reuseport sounds
> okay-ish because you're likely after information that a socket (group)
> is attached to some local address / port.
>
> However, when you go one step further and assign the socket to skb
> without running reuseport logic, that is breaking socket load-balancing
> for applications.
>
> That is to say that I'm with Lorenz on this one. Sockets that belong to
> reuseport group should not be a valid target for assignment until socket
> lookup from BPF honors reuseport.

I was considering SO_REUSEPORT socket option separately from the BPF
reuseport programs, and thinking that from that perspective if you
weren't at the point of loading the BPF programs to help steer the
traffic to reuseport sockets, then you could still make use of this
helper. I think you're conversely assuming BPF reuseport program will
be configured and that's the default, so if we allow this at all then
we'll break that use case.

I still disagree, but in the interests of unblocking the basic cases
here I can roll in this restriction; it's always easier to loosen
things up in future than to make them more restrictive later.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH bpf-next 3/7] bpf: Add socket assign support
  2020-03-18 18:48           ` Martin KaFai Lau
@ 2020-03-19  6:24             ` Joe Stringer
  2020-03-20  1:54               ` Martin KaFai Lau
  0 siblings, 1 reply; 30+ messages in thread
From: Joe Stringer @ 2020-03-19  6:24 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: Joe Stringer, bpf, netdev, Daniel Borkmann, Alexei Starovoitov,
	Eric Dumazet, Lorenz Bauer

On Wed, Mar 18, 2020 at 11:49 AM Martin KaFai Lau <kafai@fb.com> wrote:
>
> On Tue, Mar 17, 2020 at 05:46:58PM -0700, Joe Stringer wrote:
> > On Mon, Mar 16, 2020 at 11:27 PM Martin KaFai Lau <kafai@fb.com> wrote:
> > >
> > > On Mon, Mar 16, 2020 at 08:06:38PM -0700, Joe Stringer wrote:
> > > > On Mon, Mar 16, 2020 at 3:58 PM Martin KaFai Lau <kafai@fb.com> wrote:
> > > > >
> > > > > On Thu, Mar 12, 2020 at 04:36:44PM -0700, Joe Stringer wrote:
> > > > > > Add support for TPROXY via a new bpf helper, bpf_sk_assign().
> > > > > >
> > > > > > This helper requires the BPF program to discover the socket via a call
> > > > > > to bpf_sk*_lookup_*(), then pass this socket to the new helper. The
> > > > > > helper takes its own reference to the socket in addition to any existing
> > > > > > reference that may or may not currently be obtained for the duration of
> > > > > > BPF processing. For the destination socket to receive the traffic, the
> > > > > > traffic must be routed towards that socket via local route, the socket
> > > > > I also missed where is the local route check in the patch.
> > > > > Is it implied by a sk can be found in bpf_sk*_lookup_*()?
> > > >
> > > > This is a requirement for traffic redirection, it's not enforced by
> > > > the patch. If the operator does not configure routing for the relevant
> > > > traffic to ensure that the traffic is delivered locally, then after
> > > > the eBPF program terminates, it will pass up through ip_rcv() and
> > > > friends and be subject to the whims of the routing table. (or
> > > > alternatively if the BPF program redirects somewhere else then this
> > > > reference will be dropped).
> > > >
> > > > Maybe there's a path to simplifying this configuration path in future
> > > > to loosen this requirement, but for now I've kept the series as
> > > > minimal as possible on that front.
> > > >
> > > > > [ ... ]
> > > > >
> > > > > > diff --git a/net/core/filter.c b/net/core/filter.c
> > > > > > index cd0a532db4e7..bae0874289d8 100644
> > > > > > --- a/net/core/filter.c
> > > > > > +++ b/net/core/filter.c
> > > > > > @@ -5846,6 +5846,32 @@ static const struct bpf_func_proto bpf_tcp_gen_syncookie_proto = {
> > > > > >       .arg5_type      = ARG_CONST_SIZE,
> > > > > >  };
> > > > > >
> > > > > > +BPF_CALL_3(bpf_sk_assign, struct sk_buff *, skb, struct sock *, sk, u64, flags)
> > > > > > +{
> > > > > > +     if (flags != 0)
> > > > > > +             return -EINVAL;
> > > > > > +     if (!skb_at_tc_ingress(skb))
> > > > > > +             return -EOPNOTSUPP;
> > > > > > +     if (unlikely(!refcount_inc_not_zero(&sk->sk_refcnt)))
> > > > > > +             return -ENOENT;
> > > > > > +
> > > > > > +     skb_orphan(skb);
> > > > > > +     skb->sk = sk;
> > > > > sk is from the bpf_sk*_lookup_*() which does not consider
> > > > > the bpf_prog installed in SO_ATTACH_REUSEPORT_EBPF.
> > > > > However, the use-case is currently limited to sk inspection.
> > > > >
> > > > > It now supports selecting a particular sk to receive traffic.
> > > > > Any plan in supporting that?
> > > >
> > > > I think this is a general bpf_sk*_lookup_*() question, previous
> > > > discussion[0] settled on avoiding that complexity before a use case
> > > > arises, for both TC and XDP versions of these helpers; I still don't
> > > > have a specific use case in mind for such functionality. If we were to
> > > > do it, I would presume that the socket lookup caller would need to
> > > > pass a dedicated flag (supported at TC and likely not at XDP) to
> > > > communicate that SO_ATTACH_REUSEPORT_EBPF progs should be respected
> > > > and used to select the reuseport socket.
> > > It is more about the expectation on the existing SO_ATTACH_REUSEPORT_EBPF
> > > usecase.  It has been fine because SO_ATTACH_REUSEPORT_EBPF's bpf prog
> > > will still be run later (e.g. from tcp_v4_rcv) to decide which sk to
> > > recieve the skb.
> > >
> > > If the bpf@tc assigns a TCP_LISTEN sk in bpf_sk_assign(),
> > > will the SO_ATTACH_REUSEPORT_EBPF's bpf still be run later
> > > to make the final sk decision?
> >
> > I don't believe so, no:
> >
> > ip_local_deliver()
> > -> ...
> > -> ip_protocol_deliver_rcu()
> > -> tcp_v4_rcv()
> > -> __inet_lookup_skb()
> > -> skb_steal_sock(skb)
> >
> > But this will only affect you if you are running both the bpf@tc
> > program with sk_assign() and the reuseport BPF sock programs at the
> > same time.
> I don't think it is the right answer to ask the user to be careful and
> only use either bpf_sk_assign()@tc or bpf_prog@so_reuseport.

Applying a restriction on reuseport sockets until we sort this out per
my other email should resolve this concern.

> > This is why I link it back to the bpf_sk*_lookup_*()
> > functions: If the socket lookup in the initial step respects reuseport
> > BPF prog logic and returns the socket using the same logic, then the
> > packet will be directed to the socket you expect. Just like how
> > non-BPF reuseport would work with this series today.
> Changing bpf_sk*_lookup_*() is a way to solve it but I don't know what it
> may run into when recurring bpf_prog, i.e. running bpf@so-reuseport inside
> bpf@tc. That may need a closer look.

Right, that's my initial concern as well.

One alternative might be something like: in the helper implementation,
store some bit somewhere to say "we need to resolve the reuseport
later" and then when the TC BPF program returns, check this bit and if
reuseport is necessary, trigger the BPF program for it and fix up the
socket after-the-fact. A bit uglier though, also not sure how socket
refcounting would work there; maybe we can avoid the refcount in the
socket lookup and then fix it up in the later execution.

> [...]
> It is another question that I have.  The TCP_LISTEN sk will suffer
> from this extra refcnt, e.g. SYNFLOOD.  Can something smarter
> be done in skb->destructor?

Can you elaborate a bit more on the idea you have here?

Looking at the BPF API, it seems like the writer of the program can
use bpf_tcp_gen_syncookie() / bpf_tcp_check_syncookie() to generate
and check syn cookies to mitigate this kind of attack. This at least
provides an option beyond what existing tproxy implementations
provide.

> In general, it took me a while to wrap my head around thinking
> how a skb->_skb_refdst is related to assigning a sk to skb->sk.
> My understanding is it is a way to tell when not to call
> skb_orphan() here.  Have you considered other options (e.g.
> using a bit in skb->sk)?   It will be useful to explain
> them in the commit message.

Good point, I did briefly explore that initially and it looked a lot
more invasive. With that approach, any time we do some kind of socket
handling (assign, release, steal, etc.) we have this extra bit we have to
deal with and decide whether we need to specially handle it.
skb->_skb_refdst already has this ugliness (see skb_dst() and friends)
so on a practical note it seemed less invasive to me to reuse that
infrastructure.

Conceptually I was looking at this as a metadata destination similar
to the referred patches in one of the earlier commit messages. We
associate this special socket destination initially, to tell ip_rcv()
that we really do need to retain this socket and not just orphan
it/continue with the regular destination selection logic.

I can roll this explanation into the series header and/or commit
messages as well.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH bpf-next 5/7] selftests: bpf: add test for sk_assign
  2020-03-19  5:45         ` Joe Stringer
@ 2020-03-19 17:36           ` Andrii Nakryiko
  0 siblings, 0 replies; 30+ messages in thread
From: Andrii Nakryiko @ 2020-03-19 17:36 UTC (permalink / raw)
  To: Joe Stringer
  Cc: Martin KaFai Lau, bpf, Lorenz Bauer, netdev, Daniel Borkmann,
	Alexei Starovoitov, Eric Dumazet, Andrii Nakryiko

On Wed, Mar 18, 2020 at 10:46 PM Joe Stringer <joe@wand.net.nz> wrote:
>
> On Wed, Mar 18, 2020 at 10:28 AM Martin KaFai Lau <kafai@fb.com> wrote:
> >
> > On Tue, Mar 17, 2020 at 01:56:12PM -0700, Joe Stringer wrote:
> > > On Tue, Mar 17, 2020 at 12:31 AM Martin KaFai Lau <kafai@fb.com> wrote:
> > > >
> > > > On Thu, Mar 12, 2020 at 04:36:46PM -0700, Joe Stringer wrote:
> > > > > From: Lorenz Bauer <lmb@cloudflare.com>
> > > > >
> > > > > Attach a tc direct-action classifier to lo in a fresh network
> > > > > namespace, and rewrite all connection attempts to localhost:4321
> > > > > to localhost:1234.
> > > > >
> > > > > Keep in mind that both client to server and server to client traffic
> > > > > passes the classifier.
> > > > >
> > > > > Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
> > > > > Signed-off-by: Joe Stringer <joe@wand.net.nz>
> > > > > ---
> > > > >  tools/testing/selftests/bpf/.gitignore        |   1 +
> > > > >  tools/testing/selftests/bpf/Makefile          |   3 +-
> > > > >  .../selftests/bpf/progs/test_sk_assign.c      | 127 +++++++++++++
> > > > >  tools/testing/selftests/bpf/test_sk_assign.c  | 176 ++++++++++++++++++
> > > > Can this test be put under the test_progs.c framework?
> > >
> > > I'm not sure, how does the test_progs.c framework handle the logic in
> > > "tools/testing/selftests/bpf/test_sk_assign.sh"?
> > >
> > > Specifically I'm looking for:
> > > * Unique netns to avoid messing with host networking stack configuration
> > > * Control over routes
> > > * Attaching loaded bpf programs to ingress qdisc of a device
> > >
> > > These are each trivial one-liners in the supplied shell script
> > > (admittedly building on existing shell infrastructure in the tests dir
> > > and iproute2 package). Seems like maybe the netns parts aren't so bad
> > > looking at flow_dissector_reattach.c but anything involving netlink
> > > configuration would either require pulling in a netlink library
> > > dependency somewhere or shelling out to the existing binaries. At that
> > > point I wonder if we're trying to achieve integration of this test
> > > into some automated prog runner, is there a simpler way like a place I
> > > can just add a one-liner to run the test_sk_assign.sh script?
> > I think running a system(cmd) in test_progs is fine, as long as it cleans
> > up everything when it is done.  There is some pieces of netlink
> > in tools/lib/bpf/netlink.c that may be reuseable also.
> >
> > Other than test_progs.c, I am not aware there is a script to run
> > all *.sh.  I usually only run test_progs.
> >
> > Cc: Andrii who has fixed many selftest issues recently.
>
> OK, unless I get some other guidance I'll take a stab at this.

Having tests in test_progs makes sure it's executed constantly, both
by maintainers, automated testing and, hopefully, developers. Having
some .sh script has much more coverage in that sense. So if at all
possible, please add new tests to test_progs.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH bpf-next 3/7] bpf: Add socket assign support
  2020-03-19  6:24             ` Joe Stringer
@ 2020-03-20  1:54               ` Martin KaFai Lau
  2020-03-20  4:28                 ` Joe Stringer
  0 siblings, 1 reply; 30+ messages in thread
From: Martin KaFai Lau @ 2020-03-20  1:54 UTC (permalink / raw)
  To: Joe Stringer
  Cc: bpf, netdev, Daniel Borkmann, Alexei Starovoitov, Eric Dumazet,
	Lorenz Bauer

On Wed, Mar 18, 2020 at 11:24:11PM -0700, Joe Stringer wrote:
> On Wed, Mar 18, 2020 at 11:49 AM Martin KaFai Lau <kafai@fb.com> wrote:
> >
> > On Tue, Mar 17, 2020 at 05:46:58PM -0700, Joe Stringer wrote:
> > > On Mon, Mar 16, 2020 at 11:27 PM Martin KaFai Lau <kafai@fb.com> wrote:
> > > >
> > > > On Mon, Mar 16, 2020 at 08:06:38PM -0700, Joe Stringer wrote:
> > > > > On Mon, Mar 16, 2020 at 3:58 PM Martin KaFai Lau <kafai@fb.com> wrote:
> > > > > >
> > > > > > On Thu, Mar 12, 2020 at 04:36:44PM -0700, Joe Stringer wrote:
> > > > > > > Add support for TPROXY via a new bpf helper, bpf_sk_assign().
> > > > > > >
> > > > > > > This helper requires the BPF program to discover the socket via a call
> > > > > > > to bpf_sk*_lookup_*(), then pass this socket to the new helper. The
> > > > > > > helper takes its own reference to the socket in addition to any existing
> > > > > > > reference that may or may not currently be obtained for the duration of
> > > > > > > BPF processing. For the destination socket to receive the traffic, the
> > > > > > > traffic must be routed towards that socket via local route, the socket
> > > > > > I also missed where is the local route check in the patch.
> > > > > > Is it implied by a sk can be found in bpf_sk*_lookup_*()?
> > > > >
> > > > > This is a requirement for traffic redirection, it's not enforced by
> > > > > the patch. If the operator does not configure routing for the relevant
> > > > > traffic to ensure that the traffic is delivered locally, then after
> > > > > the eBPF program terminates, it will pass up through ip_rcv() and
> > > > > friends and be subject to the whims of the routing table. (or
> > > > > alternatively if the BPF program redirects somewhere else then this
> > > > > reference will be dropped).
> > > > >
> > > > > Maybe there's a path to simplifying this configuration path in future
> > > > > to loosen this requirement, but for now I've kept the series as
> > > > > minimal as possible on that front.
> > > > >
> > > > > > [ ... ]
> > > > > >
> > > > > > > diff --git a/net/core/filter.c b/net/core/filter.c
> > > > > > > index cd0a532db4e7..bae0874289d8 100644
> > > > > > > --- a/net/core/filter.c
> > > > > > > +++ b/net/core/filter.c
> > > > > > > @@ -5846,6 +5846,32 @@ static const struct bpf_func_proto bpf_tcp_gen_syncookie_proto = {
> > > > > > >       .arg5_type      = ARG_CONST_SIZE,
> > > > > > >  };
> > > > > > >
> > > > > > > +BPF_CALL_3(bpf_sk_assign, struct sk_buff *, skb, struct sock *, sk, u64, flags)
> > > > > > > +{
> > > > > > > +     if (flags != 0)
> > > > > > > +             return -EINVAL;
> > > > > > > +     if (!skb_at_tc_ingress(skb))
> > > > > > > +             return -EOPNOTSUPP;
> > > > > > > +     if (unlikely(!refcount_inc_not_zero(&sk->sk_refcnt)))
> > > > > > > +             return -ENOENT;
> > > > > > > +
> > > > > > > +     skb_orphan(skb);
> > > > > > > +     skb->sk = sk;
> > > > > > sk is from the bpf_sk*_lookup_*() which does not consider
> > > > > > the bpf_prog installed in SO_ATTACH_REUSEPORT_EBPF.
> > > > > > However, the use-case is currently limited to sk inspection.
> > > > > >
> > > > > > It now supports selecting a particular sk to receive traffic.
> > > > > > Any plan in supporting that?
> > > > >
> > > > > I think this is a general bpf_sk*_lookup_*() question, previous
> > > > > discussion[0] settled on avoiding that complexity before a use case
> > > > > arises, for both TC and XDP versions of these helpers; I still don't
> > > > > have a specific use case in mind for such functionality. If we were to
> > > > > do it, I would presume that the socket lookup caller would need to
> > > > > pass a dedicated flag (supported at TC and likely not at XDP) to
> > > > > communicate that SO_ATTACH_REUSEPORT_EBPF progs should be respected
> > > > > and used to select the reuseport socket.
> > > > It is more about the expectation on the existing SO_ATTACH_REUSEPORT_EBPF
> > > > usecase.  It has been fine because SO_ATTACH_REUSEPORT_EBPF's bpf prog
> > > > will still be run later (e.g. from tcp_v4_rcv) to decide which sk to
> > > > recieve the skb.
> > > >
> > > > If the bpf@tc assigns a TCP_LISTEN sk in bpf_sk_assign(),
> > > > will the SO_ATTACH_REUSEPORT_EBPF's bpf still be run later
> > > > to make the final sk decision?
> > >
> > > I don't believe so, no:
> > >
> > > ip_local_deliver()
> > > -> ...
> > > -> ip_protocol_deliver_rcu()
> > > -> tcp_v4_rcv()
> > > -> __inet_lookup_skb()
> > > -> skb_steal_sock(skb)
> > >
> > > But this will only affect you if you are running both the bpf@tc
> > > program with sk_assign() and the reuseport BPF sock programs at the
> > > same time.
> > I don't think it is the right answer to ask the user to be careful and
> > only use either bpf_sk_assign()@tc or bpf_prog@so_reuseport.
> 
> Applying a restriction on reuseport sockets until we sort this out per
> my other email should resolve this concern.
> 
> > > This is why I link it back to the bpf_sk*_lookup_*()
> > > functions: If the socket lookup in the initial step respects reuseport
> > > BPF prog logic and returns the socket using the same logic, then the
> > > packet will be directed to the socket you expect. Just like how
> > > non-BPF reuseport would work with this series today.
> > Changing bpf_sk*_lookup_*() is a way to solve it but I don't know what it
> > may run into when recurring bpf_prog, i.e. running bpf@so-reuseport inside
> > bpf@tc. That may need a closer look.
> 
> Right, that's my initial concern as well.
> 
> One alternative might be something like: in the helper implementation,
> store some bit somewhere to say "we need to resolve the reuseport
> later" and then when the TC BPF program returns, check this bit and if
> reuseport is necessary, trigger the BPF program for it and fix up the
> socket after-the-fact.
skb_dst_is_sk_prefetch() could be that bit.  One major thing
is that bpf@so_reuseport is currently run at the transport layer
and expecting skb->data pointing to udp/tcp hdr.  The ideal
place is to run it there.  However, the skb_dst_is_sk_prefetch() bit
is currently lost at ip[6]_rcv_core.

> A bit uglier though, also not sure how socket
> refcounting would work there; maybe we can avoid the refcount in the
> socket lookup and then fix it up in the later execution.
That should not be an issue if refcnt is not taken for
SOCK_RCU_FREE (e.g. TCP_LISTEN) in the first place.

> 
> > [...]
> > It is another question that I have.  The TCP_LISTEN sk will suffer
> > from this extra refcnt, e.g. SYNFLOOD.  Can something smarter
> > be done in skb->destructor?
> 
> Can you elaborate a bit more on the idea you have here?
I am thinking can skb->destructor do something like bpf_sk_release()?
This patch reuses tcp sock_edemux which currently only lookups the
established sk.

> 
> Looking at the BPF API, it seems like the writer of the program can
> use bpf_tcp_gen_syncookie() / bpf_tcp_check_syncookie() to generate
> and check syn cookies to mitigate this kind of attack. This at least
> provides an option beyond what existing tproxy implementations
> provide.
When the SYNACK comes back, it will still be served by a TCP_LISTEN sk.
I know refcnt sucks on synflood test.  I don't know what the effect
may be on serving those valid synack since there is no need
to measure after SOCK_RCU_FREE is in ;)

UDP is also in SOCK_RCU_FREE.  I think only early_demux, which
seems to be for connected only,  takes a refnct.
btw, it may be a good idea to add a udp test.

I am fine to push them to optimize/support later bucket
It is still good to explore a little more such that we don't
regret later.

> 
> > In general, it took me a while to wrap my head around thinking
> > how a skb->_skb_refdst is related to assigning a sk to skb->sk.
> > My understanding is it is a way to tell when not to call
> > skb_orphan() here.  Have you considered other options (e.g.
> > using a bit in skb->sk)?   It will be useful to explain
> > them in the commit message.
> 
> Good point, I did briefly explore that initially and it looked a lot
> more invasive. With that approach, any time we do some kind of socket
> handling (assign, release, steal, etc.) we have this extra bit we have to
> deal with and decide whether we need to specially handle it.
> skb->_skb_refdst already has this ugliness (see skb_dst() and friends)
> so on a practical note it seemed less invasive to me to reuse that
> infrastructure.
> 
> Conceptually I was looking at this as a metadata destination similar
> to the referred patches in one of the earlier commit messages. We
> associate this special socket destination initially, to tell ip_rcv()
> that we really do need to retain this socket and not just orphan
> it/continue with the regular destination selection logic.
> 
> I can roll this explanation into the series header and/or commit
> messages as well.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH bpf-next 3/7] bpf: Add socket assign support
  2020-03-20  1:54               ` Martin KaFai Lau
@ 2020-03-20  4:28                 ` Joe Stringer
  0 siblings, 0 replies; 30+ messages in thread
From: Joe Stringer @ 2020-03-20  4:28 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: Joe Stringer, bpf, netdev, Daniel Borkmann, Alexei Starovoitov,
	Eric Dumazet, Lorenz Bauer

On Thu, Mar 19, 2020 at 6:55 PM Martin KaFai Lau <kafai@fb.com> wrote:
>
> On Wed, Mar 18, 2020 at 11:24:11PM -0700, Joe Stringer wrote:
> > On Wed, Mar 18, 2020 at 11:49 AM Martin KaFai Lau <kafai@fb.com> wrote:
> > >
> > > On Tue, Mar 17, 2020 at 05:46:58PM -0700, Joe Stringer wrote:
> > > > On Mon, Mar 16, 2020 at 11:27 PM Martin KaFai Lau <kafai@fb.com> wrote:
> > > > >
> > > > > On Mon, Mar 16, 2020 at 08:06:38PM -0700, Joe Stringer wrote:
> > > > > > On Mon, Mar 16, 2020 at 3:58 PM Martin KaFai Lau <kafai@fb.com> wrote:
> > > > > > >
> > > > > > > On Thu, Mar 12, 2020 at 04:36:44PM -0700, Joe Stringer wrote:
> > > > > > > > Add support for TPROXY via a new bpf helper, bpf_sk_assign().
> > > > > > > >
> > > > > > > > This helper requires the BPF program to discover the socket via a call
> > > > > > > > to bpf_sk*_lookup_*(), then pass this socket to the new helper. The
> > > > > > > > helper takes its own reference to the socket in addition to any existing
> > > > > > > > reference that may or may not currently be obtained for the duration of
> > > > > > > > BPF processing. For the destination socket to receive the traffic, the
> > > > > > > > traffic must be routed towards that socket via local route, the socket
> > > > > > > I also missed where is the local route check in the patch.
> > > > > > > Is it implied by a sk can be found in bpf_sk*_lookup_*()?
> > > > > >
> > > > > > This is a requirement for traffic redirection, it's not enforced by
> > > > > > the patch. If the operator does not configure routing for the relevant
> > > > > > traffic to ensure that the traffic is delivered locally, then after
> > > > > > the eBPF program terminates, it will pass up through ip_rcv() and
> > > > > > friends and be subject to the whims of the routing table. (or
> > > > > > alternatively if the BPF program redirects somewhere else then this
> > > > > > reference will be dropped).
> > > > > >
> > > > > > Maybe there's a path to simplifying this configuration path in future
> > > > > > to loosen this requirement, but for now I've kept the series as
> > > > > > minimal as possible on that front.
> > > > > >
> > > > > > > [ ... ]
> > > > > > >
> > > > > > > > diff --git a/net/core/filter.c b/net/core/filter.c
> > > > > > > > index cd0a532db4e7..bae0874289d8 100644
> > > > > > > > --- a/net/core/filter.c
> > > > > > > > +++ b/net/core/filter.c
> > > > > > > > @@ -5846,6 +5846,32 @@ static const struct bpf_func_proto bpf_tcp_gen_syncookie_proto = {
> > > > > > > >       .arg5_type      = ARG_CONST_SIZE,
> > > > > > > >  };
> > > > > > > >
> > > > > > > > +BPF_CALL_3(bpf_sk_assign, struct sk_buff *, skb, struct sock *, sk, u64, flags)
> > > > > > > > +{
> > > > > > > > +     if (flags != 0)
> > > > > > > > +             return -EINVAL;
> > > > > > > > +     if (!skb_at_tc_ingress(skb))
> > > > > > > > +             return -EOPNOTSUPP;
> > > > > > > > +     if (unlikely(!refcount_inc_not_zero(&sk->sk_refcnt)))
> > > > > > > > +             return -ENOENT;
> > > > > > > > +
> > > > > > > > +     skb_orphan(skb);
> > > > > > > > +     skb->sk = sk;
> > > > > > > sk is from the bpf_sk*_lookup_*() which does not consider
> > > > > > > the bpf_prog installed in SO_ATTACH_REUSEPORT_EBPF.
> > > > > > > However, the use-case is currently limited to sk inspection.
> > > > > > >
> > > > > > > It now supports selecting a particular sk to receive traffic.
> > > > > > > Any plan in supporting that?
> > > > > >
> > > > > > I think this is a general bpf_sk*_lookup_*() question, previous
> > > > > > discussion[0] settled on avoiding that complexity before a use case
> > > > > > arises, for both TC and XDP versions of these helpers; I still don't
> > > > > > have a specific use case in mind for such functionality. If we were to
> > > > > > do it, I would presume that the socket lookup caller would need to
> > > > > > pass a dedicated flag (supported at TC and likely not at XDP) to
> > > > > > communicate that SO_ATTACH_REUSEPORT_EBPF progs should be respected
> > > > > > and used to select the reuseport socket.
> > > > > It is more about the expectation on the existing SO_ATTACH_REUSEPORT_EBPF
> > > > > usecase.  It has been fine because SO_ATTACH_REUSEPORT_EBPF's bpf prog
> > > > > will still be run later (e.g. from tcp_v4_rcv) to decide which sk to
> > > > > recieve the skb.
> > > > >
> > > > > If the bpf@tc assigns a TCP_LISTEN sk in bpf_sk_assign(),
> > > > > will the SO_ATTACH_REUSEPORT_EBPF's bpf still be run later
> > > > > to make the final sk decision?
> > > >
> > > > I don't believe so, no:
> > > >
> > > > ip_local_deliver()
> > > > -> ...
> > > > -> ip_protocol_deliver_rcu()
> > > > -> tcp_v4_rcv()
> > > > -> __inet_lookup_skb()
> > > > -> skb_steal_sock(skb)
> > > >
> > > > But this will only affect you if you are running both the bpf@tc
> > > > program with sk_assign() and the reuseport BPF sock programs at the
> > > > same time.
> > > I don't think it is the right answer to ask the user to be careful and
> > > only use either bpf_sk_assign()@tc or bpf_prog@so_reuseport.
> >
> > Applying a restriction on reuseport sockets until we sort this out per
> > my other email should resolve this concern.
> >
> > > > This is why I link it back to the bpf_sk*_lookup_*()
> > > > functions: If the socket lookup in the initial step respects reuseport
> > > > BPF prog logic and returns the socket using the same logic, then the
> > > > packet will be directed to the socket you expect. Just like how
> > > > non-BPF reuseport would work with this series today.
> > > Changing bpf_sk*_lookup_*() is a way to solve it but I don't know what it
> > > may run into when recurring bpf_prog, i.e. running bpf@so-reuseport inside
> > > bpf@tc. That may need a closer look.
> >
> > Right, that's my initial concern as well.
> >
> > One alternative might be something like: in the helper implementation,
> > store some bit somewhere to say "we need to resolve the reuseport
> > later" and then when the TC BPF program returns, check this bit and if
> > reuseport is necessary, trigger the BPF program for it and fix up the
> > socket after-the-fact.
> skb_dst_is_sk_prefetch() could be that bit.  One major thing
> is that bpf@so_reuseport is currently run at the transport layer
> and expecting skb->data pointing to udp/tcp hdr.  The ideal
> place is to run it there.

My initial thought above was much simpler - just holding it long
enough to exit the current BPF program so we know that the tc@bpf
program terminates, then preparing & running the so_reuseport program.

> However, the skb_dst_is_sk_prefetch() bit
> is currently lost at ip[6]_rcv_core.

Yeah, I think this is tricky. Here's three paths I'm tracking right now:
* Loopback destination is already assigned to skb, so we wrap that
with the dst_sk_prefetch in sk_assign. We unwrap it again currently in
ip[6]_rcv_core, but if we retained it, we'd need to convince
ip[6]_rcv_finish_core() to respect the metadata dst, then convince
ip[6]_rcv_finish() to call the right receive destination function.
* In the connected case of patch #4 in this series, we wrap the
rx_dst. Needs similar treatment.
* In the regular initial packet case for non-loopback destination, we
don't actually have a destination yet so the routing check would need
to track this bit and ensure it's propagated through that routing
check and again back out to ip[6]_rcv_finish() and call the right
destination receive.

Even if we get through all of that and we get up to the transport
layer, one of the main points of this feature is to guide the packet
to a socket that may be associated with a different tuple, so AFAIK we
end up needing another path even up at the transport layer to check
this bit and jump straight to the reuseport_select_sock() call.
Looking at __udp4_lib_rcv() path I'm eyeing the conditional branch
after stealing the socket, in there we'd need more special casing for
the new call to select the reuseport socket.

Following that rabbit-hole, it seems less invasive to either get the
exact right reuseport socket in the first place or something closer to
the simpler approach above.

> > A bit uglier though, also not sure how socket
> > refcounting would work there; maybe we can avoid the refcount in the
> > socket lookup and then fix it up in the later execution.
> That should not be an issue if refcnt is not taken for
> SOCK_RCU_FREE (e.g. TCP_LISTEN) in the first place.

This is likely the part that I was missing---I assumed that all bets
were off when we rcu_read_unlock(). But it sounds like you're saying
it's actually the RCU grace period, and that won't happen before we
enqueue the skb at the socket?

> >
> > > [...]
> > > It is another question that I have.  The TCP_LISTEN sk will suffer
> > > from this extra refcnt, e.g. SYNFLOOD.  Can something smarter
> > > be done in skb->destructor?
> >
> > Can you elaborate a bit more on the idea you have here?
> I am thinking can skb->destructor do something like bpf_sk_release()?
> This patch reuses tcp sock_edemux which currently only lookups the
> established sk.

I can try it out.

> >
> > Looking at the BPF API, it seems like the writer of the program can
> > use bpf_tcp_gen_syncookie() / bpf_tcp_check_syncookie() to generate
> > and check syn cookies to mitigate this kind of attack. This at least
> > provides an option beyond what existing tproxy implementations
> > provide.
> When the SYNACK comes back, it will still be served by a TCP_LISTEN sk.
> I know refcnt sucks on synflood test.  I don't know what the effect
> may be on serving those valid synack since there is no need
> to measure after SOCK_RCU_FREE is in ;)
>
> UDP is also in SOCK_RCU_FREE.  I think only early_demux, which
> seems to be for connected only,  takes a refnct.
> btw, it may be a good idea to add a udp test.

UDP was next on my list, in my local testing I needed a new
skc_lookup_udp() to find the right sockets. That patch series is more
straightforward than this one so I thought it'd be better to get the
feedback on this socket assign approach first then follow up with UDP
support.

> I am fine to push them to optimize/support later bucket
> It is still good to explore a little more such that we don't
> regret later.

I'll dig around a bit.

Thanks for the feedback,
Joe

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2020-03-20  4:29 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-12 23:36 [PATCH bpf-next 0/7] Add bpf_sk_assign eBPF helper Joe Stringer
2020-03-12 23:36 ` [PATCH bpf-next 1/7] dst: Move skb_dst_drop to skbuff.c Joe Stringer
2020-03-12 23:36 ` [PATCH bpf-next 2/7] dst: Add socket prefetch metadata destinations Joe Stringer
2020-03-12 23:36 ` [PATCH bpf-next 3/7] bpf: Add socket assign support Joe Stringer
2020-03-16 10:08   ` Jakub Sitnicki
2020-03-16 21:23     ` Joe Stringer
2020-03-16 22:57   ` Martin KaFai Lau
2020-03-17  3:06     ` Joe Stringer
2020-03-17  6:26       ` Martin KaFai Lau
2020-03-18  0:46         ` Joe Stringer
2020-03-18 10:03           ` Jakub Sitnicki
2020-03-19  5:49             ` Joe Stringer
2020-03-18 18:48           ` Martin KaFai Lau
2020-03-19  6:24             ` Joe Stringer
2020-03-20  1:54               ` Martin KaFai Lau
2020-03-20  4:28                 ` Joe Stringer
2020-03-17 10:09       ` Lorenz Bauer
2020-03-18  1:10         ` Joe Stringer
2020-03-18  2:03           ` Joe Stringer
2020-03-12 23:36 ` [PATCH bpf-next 4/7] dst: Prefetch established socket destinations Joe Stringer
2020-03-16 23:03   ` Martin KaFai Lau
2020-03-17  3:17     ` Joe Stringer
2020-03-12 23:36 ` [PATCH bpf-next 5/7] selftests: bpf: add test for sk_assign Joe Stringer
2020-03-17  6:30   ` Martin KaFai Lau
2020-03-17 20:56     ` Joe Stringer
2020-03-18 17:27       ` Martin KaFai Lau
2020-03-19  5:45         ` Joe Stringer
2020-03-19 17:36           ` Andrii Nakryiko
2020-03-12 23:36 ` [PATCH bpf-next 6/7] selftests: bpf: Extend sk_assign for address proxy Joe Stringer
2020-03-12 23:36 ` [PATCH bpf-next 7/7] selftests: bpf: Improve debuggability of sk_assign Joe Stringer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).