netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH bpf-next 0/3] bpf: add MPTCP subflow support
@ 2020-08-21 15:15 Nicolas Rybowski
  2020-08-21 15:15 ` [PATCH bpf-next 1/3] bpf: expose is_mptcp flag to bpf_tcp_sock Nicolas Rybowski
                   ` (3 more replies)
  0 siblings, 4 replies; 8+ messages in thread
From: Nicolas Rybowski @ 2020-08-21 15:15 UTC (permalink / raw)
  To: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	David S. Miller, Jakub Kicinski, John Fastabend, KP Singh,
	Martin KaFai Lau, Mat Martineau, Matthieu Baerts, Song Liu,
	Yonghong Song
  Cc: Nicolas Rybowski, bpf, linux-kernel, mptcp, netdev

Previously it was not possible to make a distinction between plain TCP
sockets and MPTCP subflow sockets on the BPF_PROG_TYPE_SOCK_OPS hook.

This patch series now enables a fine control of subflow sockets. In its
current state, it allows to put different sockopt on each subflow from a
same MPTCP connection (socket mark, TCP congestion algorithm, ...) using
BPF programs.

It should also be the basis of exposing MPTCP-specific fields through BPF.

Nicolas Rybowski (3):
  bpf: expose is_mptcp flag to bpf_tcp_sock
  mptcp: attach subflow socket to parent cgroup
  bpf: add 'bpf_mptcp_sock' structure and helper

 include/linux/bpf.h            | 33 ++++++++++++++++
 include/uapi/linux/bpf.h       | 14 +++++++
 kernel/bpf/verifier.c          | 30 ++++++++++++++
 net/core/filter.c              | 13 +++++-
 net/mptcp/Makefile             |  2 +
 net/mptcp/bpf.c                | 72 ++++++++++++++++++++++++++++++++++
 net/mptcp/subflow.c            | 27 +++++++++++++
 scripts/bpf_helpers_doc.py     |  2 +
 tools/include/uapi/linux/bpf.h | 14 +++++++
 9 files changed, 206 insertions(+), 1 deletion(-)
 create mode 100644 net/mptcp/bpf.c

-- 
2.28.0


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH bpf-next 1/3] bpf: expose is_mptcp flag to bpf_tcp_sock
  2020-08-21 15:15 [PATCH bpf-next 0/3] bpf: add MPTCP subflow support Nicolas Rybowski
@ 2020-08-21 15:15 ` Nicolas Rybowski
  2020-08-21 15:15 ` [PATCH bpf-next 2/3] mptcp: attach subflow socket to parent cgroup Nicolas Rybowski
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 8+ messages in thread
From: Nicolas Rybowski @ 2020-08-21 15:15 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Song Liu,
	Yonghong Song, Andrii Nakryiko, John Fastabend, KP Singh,
	David S. Miller, Jakub Kicinski
  Cc: Nicolas Rybowski, Matthieu Baerts, Mat Martineau, netdev, bpf,
	linux-kernel

is_mptcp is a field from struct tcp_sock used to indicate that the
current tcp_sock is part of the MPTCP protocol.

In this protocol, a first socket (mptcp_sock) is created with
sk_protocol set to IPPROTO_MPTCP (=262) for control purpose but it
isn't directly on the wire. This is the role of the subflow (kernel)
sockets which are classical tcp_sock with sk_protocol set to
IPPROTO_TCP. The only way to differentiate such sockets from plain TCP
sockets is the is_mptcp field from tcp_sock.

Such an exposure in BPF is thus required to be able to differentiate
plain TCP sockets from MPTCP subflow sockets in BPF_PROG_TYPE_SOCK_OPS
programs.

The choice has been made to silently pass the case when CONFIG_MPTCP is
unset by defaulting is_mptcp to 0 in order to make BPF independent of
the MPTCP configuration. Another solution is to make the verifier fail
in 'bpf_tcp_sock_is_valid_ctx_access' but this will add an additional
'#ifdef CONFIG_MPTCP' in the BPF code and a same injected BPF program
will not run if MPTCP is not set.

An example use-case is provided in
https://github.com/multipath-tcp/mptcp_net-next/tree/scripts/bpf/examples

Suggested-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Acked-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Acked-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Nicolas Rybowski <nicolas.rybowski@tessares.net>
---
 include/uapi/linux/bpf.h       | 1 +
 net/core/filter.c              | 9 ++++++++-
 tools/include/uapi/linux/bpf.h | 1 +
 3 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 0480f893facd..a9fccfdb3a62 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -3865,6 +3865,7 @@ struct bpf_tcp_sock {
 	__u32 delivered;	/* Total data packets delivered incl. rexmits */
 	__u32 delivered_ce;	/* Like the above but only ECE marked packets */
 	__u32 icsk_retransmits;	/* Number of unrecovered [RTO] timeouts */
+	__u32 is_mptcp;		/* Is MPTCP subflow? */
 };
 
 struct bpf_sock_tuple {
diff --git a/net/core/filter.c b/net/core/filter.c
index b2df52086445..eb400aeea282 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5742,7 +5742,7 @@ bool bpf_tcp_sock_is_valid_access(int off, int size, enum bpf_access_type type,
 				  struct bpf_insn_access_aux *info)
 {
 	if (off < 0 || off >= offsetofend(struct bpf_tcp_sock,
-					  icsk_retransmits))
+					  is_mptcp))
 		return false;
 
 	if (off % size != 0)
@@ -5876,6 +5876,13 @@ u32 bpf_tcp_sock_convert_ctx_access(enum bpf_access_type type,
 	case offsetof(struct bpf_tcp_sock, icsk_retransmits):
 		BPF_INET_SOCK_GET_COMMON(icsk_retransmits);
 		break;
+	case offsetof(struct bpf_tcp_sock, is_mptcp):
+#ifdef CONFIG_MPTCP
+		BPF_TCP_SOCK_GET_COMMON(is_mptcp);
+#else
+		*insn++ = BPF_MOV32_IMM(si->dst_reg, 0);
+#endif
+		break;
 	}
 
 	return insn - insn_buf;
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 0480f893facd..a9fccfdb3a62 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -3865,6 +3865,7 @@ struct bpf_tcp_sock {
 	__u32 delivered;	/* Total data packets delivered incl. rexmits */
 	__u32 delivered_ce;	/* Like the above but only ECE marked packets */
 	__u32 icsk_retransmits;	/* Number of unrecovered [RTO] timeouts */
+	__u32 is_mptcp;		/* Is MPTCP subflow? */
 };
 
 struct bpf_sock_tuple {
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH bpf-next 2/3] mptcp: attach subflow socket to parent cgroup
  2020-08-21 15:15 [PATCH bpf-next 0/3] bpf: add MPTCP subflow support Nicolas Rybowski
  2020-08-21 15:15 ` [PATCH bpf-next 1/3] bpf: expose is_mptcp flag to bpf_tcp_sock Nicolas Rybowski
@ 2020-08-21 15:15 ` Nicolas Rybowski
  2020-08-21 15:15 ` [PATCH bpf-next 3/3] bpf: add 'bpf_mptcp_sock' structure and helper Nicolas Rybowski
  2020-08-24 22:01 ` [PATCH bpf-next 0/3] bpf: add MPTCP subflow support Alexei Starovoitov
  3 siblings, 0 replies; 8+ messages in thread
From: Nicolas Rybowski @ 2020-08-21 15:15 UTC (permalink / raw)
  To: Mat Martineau, Matthieu Baerts, David S. Miller, Jakub Kicinski,
	Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Song Liu,
	Yonghong Song, Andrii Nakryiko, John Fastabend, KP Singh
  Cc: Nicolas Rybowski, Paolo Abeni, netdev, mptcp, linux-kernel, bpf

It has been observed that the kernel sockets created for the subflows
(except the first one) are not in the same cgroup as their parents.
That's because the additional subflows are created by kernel workers.

This is a problem with eBPF programs attached to the parent's
cgroup won't be executed for the children. But also with any other features
of CGroup linked to a sk.

This patch fixes this behaviour.

As the subflow sockets are created by the kernel, we can't use
'mem_cgroup_sk_alloc' because of the current context being the one of the
kworker. This is why we have to do low level memcg manipulation, if
required.

Suggested-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Suggested-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Nicolas Rybowski <nicolas.rybowski@tessares.net>
---
 net/mptcp/subflow.c | 27 +++++++++++++++++++++++++++
 1 file changed, 27 insertions(+)

diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c
index e8cac2655c82..535a3f9f8cfc 100644
--- a/net/mptcp/subflow.c
+++ b/net/mptcp/subflow.c
@@ -1130,6 +1130,30 @@ int __mptcp_subflow_connect(struct sock *sk, int ifindex,
 	return err;
 }
 
+static void mptcp_attach_cgroup(struct sock *parent, struct sock *child)
+{
+#ifdef CONFIG_SOCK_CGROUP_DATA
+	struct sock_cgroup_data *parent_skcd = &parent->sk_cgrp_data,
+				*child_skcd = &child->sk_cgrp_data;
+
+	/* only the additional subflows created by kworkers have to be modified */
+	if (cgroup_id(sock_cgroup_ptr(parent_skcd)) !=
+	    cgroup_id(sock_cgroup_ptr(child_skcd))) {
+#ifdef CONFIG_MEMCG
+		struct mem_cgroup *memcg = parent->sk_memcg;
+
+		mem_cgroup_sk_free(child);
+		if (memcg && css_tryget(&memcg->css))
+			child->sk_memcg = memcg;
+#endif /* CONFIG_MEMCG */
+
+		cgroup_sk_free(child_skcd);
+		*child_skcd = *parent_skcd;
+		cgroup_sk_clone(child_skcd);
+	}
+#endif /* CONFIG_SOCK_CGROUP_DATA */
+}
+
 int mptcp_subflow_create_socket(struct sock *sk, struct socket **new_sock)
 {
 	struct mptcp_subflow_context *subflow;
@@ -1150,6 +1174,9 @@ int mptcp_subflow_create_socket(struct sock *sk, struct socket **new_sock)
 
 	lock_sock(sf->sk);
 
+	/* the newly created socket has to be in the same cgroup as its parent */
+	mptcp_attach_cgroup(sk, sf->sk);
+
 	/* kernel sockets do not by default acquire net ref, but TCP timer
 	 * needs it.
 	 */
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH bpf-next 3/3] bpf: add 'bpf_mptcp_sock' structure and helper
  2020-08-21 15:15 [PATCH bpf-next 0/3] bpf: add MPTCP subflow support Nicolas Rybowski
  2020-08-21 15:15 ` [PATCH bpf-next 1/3] bpf: expose is_mptcp flag to bpf_tcp_sock Nicolas Rybowski
  2020-08-21 15:15 ` [PATCH bpf-next 2/3] mptcp: attach subflow socket to parent cgroup Nicolas Rybowski
@ 2020-08-21 15:15 ` Nicolas Rybowski
  2020-08-24 22:01 ` [PATCH bpf-next 0/3] bpf: add MPTCP subflow support Alexei Starovoitov
  3 siblings, 0 replies; 8+ messages in thread
From: Nicolas Rybowski @ 2020-08-21 15:15 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Song Liu,
	Yonghong Song, Andrii Nakryiko, John Fastabend, KP Singh,
	David S. Miller, Jakub Kicinski, Mat Martineau, Matthieu Baerts
  Cc: Nicolas Rybowski, linux-kernel, netdev, bpf, mptcp

In order to precisely identify the parent MPTCP connection of a subflow,
it is required to access the mptcp_sock's token which uniquely identify a
MPTCP connection.

This patch adds a new structure 'bpf_mptcp_sock' exposing the 'token' field
of the 'mptcp_sock' extracted from a subflow's 'tcp_sock'. It also adds the
declaration of a new BPF helper of the same name to expose the newly
defined structure in the userspace BPF API.

This is the foundation to expose more MPTCP-specific fields through BPF.

Currently, it is limited to the field 'token' of the msk but it is
easily extensible.

Acked-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Acked-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Nicolas Rybowski <nicolas.rybowski@tessares.net>
---
 include/linux/bpf.h            | 33 ++++++++++++++++
 include/uapi/linux/bpf.h       | 13 ++++++
 kernel/bpf/verifier.c          | 30 ++++++++++++++
 net/core/filter.c              |  4 ++
 net/mptcp/Makefile             |  2 +
 net/mptcp/bpf.c                | 72 ++++++++++++++++++++++++++++++++++
 scripts/bpf_helpers_doc.py     |  2 +
 tools/include/uapi/linux/bpf.h | 13 ++++++
 8 files changed, 169 insertions(+)
 create mode 100644 net/mptcp/bpf.c

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index a9b7185a6b37..b4d6a80a653c 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -281,6 +281,7 @@ enum bpf_return_type {
 	RET_PTR_TO_SOCK_COMMON_OR_NULL,	/* returns a pointer to a sock_common or NULL */
 	RET_PTR_TO_ALLOC_MEM_OR_NULL,	/* returns a pointer to dynamically allocated memory or NULL */
 	RET_PTR_TO_BTF_ID_OR_NULL,	/* returns a pointer to a btf_id or NULL */
+	RET_PTR_TO_MPTCP_SOCK_OR_NULL,	/* returns a pointer to mptcp_sock or NULL */
 };
 
 /* eBPF function prototype used by verifier to allow BPF_CALLs from eBPF programs
@@ -360,6 +361,8 @@ enum bpf_reg_type {
 	PTR_TO_RDONLY_BUF_OR_NULL, /* reg points to a readonly buffer or NULL */
 	PTR_TO_RDWR_BUF,	 /* reg points to a read/write buffer */
 	PTR_TO_RDWR_BUF_OR_NULL, /* reg points to a read/write buffer or NULL */
+	PTR_TO_MPTCP_SOCK,	 /* reg points to struct mptcp_sock */
+	PTR_TO_MPTCP_SOCK_OR_NULL, /* reg points to struct mptcp_sock or NULL */
 };
 
 /* The information passed from prog-specific *_is_valid_access
@@ -1737,6 +1740,7 @@ extern const struct bpf_func_proto bpf_skc_to_tcp_sock_proto;
 extern const struct bpf_func_proto bpf_skc_to_tcp_timewait_sock_proto;
 extern const struct bpf_func_proto bpf_skc_to_tcp_request_sock_proto;
 extern const struct bpf_func_proto bpf_skc_to_udp6_sock_proto;
+extern const struct bpf_func_proto bpf_mptcp_sock_proto;
 
 const struct bpf_func_proto *bpf_tracing_func_proto(
 	enum bpf_func_id func_id, const struct bpf_prog *prog);
@@ -1793,6 +1797,35 @@ struct sk_reuseport_kern {
 	u32 reuseport_id;
 	bool bind_inany;
 };
+
+#ifdef CONFIG_MPTCP
+bool bpf_mptcp_sock_is_valid_access(int off, int size,
+				    enum bpf_access_type type,
+				    struct bpf_insn_access_aux *info);
+
+u32 bpf_mptcp_sock_convert_ctx_access(enum bpf_access_type type,
+				      const struct bpf_insn *si,
+				      struct bpf_insn *insn_buf,
+				      struct bpf_prog *prog,
+				      u32 *target_size);
+#else /* CONFIG_MPTCP */
+static inline bool bpf_mptcp_sock_is_valid_access(int off, int size,
+						  enum bpf_access_type type,
+						  struct bpf_insn_access_aux *info)
+{
+	return false;
+}
+
+static inline u32 bpf_mptcp_sock_convert_ctx_access(enum bpf_access_type type,
+						    const struct bpf_insn *si,
+						    struct bpf_insn *insn_buf,
+						    struct bpf_prog *prog,
+						    u32 *target_size)
+{
+	return 0;
+}
+#endif /* CONFIG_MPTCP */
+
 bool bpf_tcp_sock_is_valid_access(int off, int size, enum bpf_access_type type,
 				  struct bpf_insn_access_aux *info);
 
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index a9fccfdb3a62..58b6e075537d 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -3395,6 +3395,14 @@ union bpf_attr {
  *		A non-negative value equal to or less than *size* on success,
  *		or a negative error in case of failure.
  *
+ * struct bpf_mptcp_sock *bpf_mptcp_sock(struct bpf_sock *sk)
+ *	Description
+ *		This helper gets a **struct bpf_mptcp_sock** pointer from a
+ *		**struct bpf_sock** pointer.
+ *	Return
+ *		A **struct bpf_mptcp_sock** pointer on success, or **NULL** in
+ *		case of failure.
+ *
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -3539,6 +3547,7 @@ union bpf_attr {
 	FN(skc_to_tcp_request_sock),	\
 	FN(skc_to_udp6_sock),		\
 	FN(get_task_stack),		\
+	FN(mptcp_sock),			\
 	/* */
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
@@ -3868,6 +3877,10 @@ struct bpf_tcp_sock {
 	__u32 is_mptcp;		/* Is MPTCP subflow? */
 };
 
+struct bpf_mptcp_sock {
+	__u32 token;		/* msk token */
+};
+
 struct bpf_sock_tuple {
 	union {
 		struct {
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index ef938f17b944..423bbd786eb8 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -391,6 +391,7 @@ static bool type_is_sk_pointer(enum bpf_reg_type type)
 	return type == PTR_TO_SOCKET ||
 		type == PTR_TO_SOCK_COMMON ||
 		type == PTR_TO_TCP_SOCK ||
+		type == PTR_TO_MPTCP_SOCK ||
 		type == PTR_TO_XDP_SOCK;
 }
 
@@ -398,6 +399,7 @@ static bool reg_type_not_null(enum bpf_reg_type type)
 {
 	return type == PTR_TO_SOCKET ||
 		type == PTR_TO_TCP_SOCK ||
+		type == PTR_TO_MPTCP_SOCK ||
 		type == PTR_TO_MAP_VALUE ||
 		type == PTR_TO_SOCK_COMMON;
 }
@@ -408,6 +410,7 @@ static bool reg_type_may_be_null(enum bpf_reg_type type)
 	       type == PTR_TO_SOCKET_OR_NULL ||
 	       type == PTR_TO_SOCK_COMMON_OR_NULL ||
 	       type == PTR_TO_TCP_SOCK_OR_NULL ||
+	       type == PTR_TO_MPTCP_SOCK_OR_NULL ||
 	       type == PTR_TO_BTF_ID_OR_NULL ||
 	       type == PTR_TO_MEM_OR_NULL ||
 	       type == PTR_TO_RDONLY_BUF_OR_NULL ||
@@ -426,6 +429,8 @@ static bool reg_type_may_be_refcounted_or_null(enum bpf_reg_type type)
 		type == PTR_TO_SOCKET_OR_NULL ||
 		type == PTR_TO_TCP_SOCK ||
 		type == PTR_TO_TCP_SOCK_OR_NULL ||
+		type == PTR_TO_MPTCP_SOCK ||
+		type == PTR_TO_MPTCP_SOCK_OR_NULL ||
 		type == PTR_TO_MEM ||
 		type == PTR_TO_MEM_OR_NULL;
 }
@@ -499,6 +504,8 @@ static const char * const reg_type_str[] = {
 	[PTR_TO_SOCK_COMMON_OR_NULL] = "sock_common_or_null",
 	[PTR_TO_TCP_SOCK]	= "tcp_sock",
 	[PTR_TO_TCP_SOCK_OR_NULL] = "tcp_sock_or_null",
+	[PTR_TO_MPTCP_SOCK]	= "mptcp_sock",
+	[PTR_TO_MPTCP_SOCK_OR_NULL] = "mptcp_sock_or_null",
 	[PTR_TO_TP_BUFFER]	= "tp_buffer",
 	[PTR_TO_XDP_SOCK]	= "xdp_sock",
 	[PTR_TO_BTF_ID]		= "ptr_",
@@ -2176,6 +2183,8 @@ static bool is_spillable_regtype(enum bpf_reg_type type)
 	case PTR_TO_SOCK_COMMON_OR_NULL:
 	case PTR_TO_TCP_SOCK:
 	case PTR_TO_TCP_SOCK_OR_NULL:
+	case PTR_TO_MPTCP_SOCK:
+	case PTR_TO_MPTCP_SOCK_OR_NULL:
 	case PTR_TO_XDP_SOCK:
 	case PTR_TO_BTF_ID:
 	case PTR_TO_BTF_ID_OR_NULL:
@@ -2777,6 +2786,9 @@ static int check_sock_access(struct bpf_verifier_env *env, int insn_idx,
 	case PTR_TO_TCP_SOCK:
 		valid = bpf_tcp_sock_is_valid_access(off, size, t, &info);
 		break;
+	case PTR_TO_MPTCP_SOCK:
+		valid = bpf_mptcp_sock_is_valid_access(off, size, t, &info);
+		break;
 	case PTR_TO_XDP_SOCK:
 		valid = bpf_xdp_sock_is_valid_access(off, size, t, &info);
 		break;
@@ -2935,6 +2947,9 @@ static int check_ptr_alignment(struct bpf_verifier_env *env,
 	case PTR_TO_TCP_SOCK:
 		pointer_desc = "tcp_sock ";
 		break;
+	case PTR_TO_MPTCP_SOCK:
+		pointer_desc = "mptcp_sock ";
+		break;
 	case PTR_TO_XDP_SOCK:
 		pointer_desc = "xdp_sock ";
 		break;
@@ -4899,6 +4914,10 @@ static int check_helper_call(struct bpf_verifier_env *env, int func_id, int insn
 		mark_reg_known_zero(env, regs, BPF_REG_0);
 		regs[BPF_REG_0].type = PTR_TO_TCP_SOCK_OR_NULL;
 		regs[BPF_REG_0].id = ++env->id_gen;
+	} else if (fn->ret_type == RET_PTR_TO_MPTCP_SOCK_OR_NULL) {
+		mark_reg_known_zero(env, regs, BPF_REG_0);
+		regs[BPF_REG_0].type = PTR_TO_MPTCP_SOCK_OR_NULL;
+		regs[BPF_REG_0].id = ++env->id_gen;
 	} else if (fn->ret_type == RET_PTR_TO_ALLOC_MEM_OR_NULL) {
 		mark_reg_known_zero(env, regs, BPF_REG_0);
 		regs[BPF_REG_0].type = PTR_TO_MEM_OR_NULL;
@@ -5226,6 +5245,8 @@ static int adjust_ptr_min_max_vals(struct bpf_verifier_env *env,
 	case PTR_TO_SOCK_COMMON_OR_NULL:
 	case PTR_TO_TCP_SOCK:
 	case PTR_TO_TCP_SOCK_OR_NULL:
+	case PTR_TO_MPTCP_SOCK:
+	case PTR_TO_MPTCP_SOCK_OR_NULL:
 	case PTR_TO_XDP_SOCK:
 		verbose(env, "R%d pointer arithmetic on %s prohibited\n",
 			dst, reg_type_str[ptr_reg->type]);
@@ -6880,6 +6901,8 @@ static void mark_ptr_or_null_reg(struct bpf_func_state *state,
 			reg->type = PTR_TO_SOCK_COMMON;
 		} else if (reg->type == PTR_TO_TCP_SOCK_OR_NULL) {
 			reg->type = PTR_TO_TCP_SOCK;
+		} else if (reg->type == PTR_TO_MPTCP_SOCK_OR_NULL) {
+			reg->type = PTR_TO_MPTCP_SOCK;
 		} else if (reg->type == PTR_TO_BTF_ID_OR_NULL) {
 			reg->type = PTR_TO_BTF_ID;
 		} else if (reg->type == PTR_TO_MEM_OR_NULL) {
@@ -8242,6 +8265,8 @@ static bool regsafe(struct bpf_reg_state *rold, struct bpf_reg_state *rcur,
 	case PTR_TO_SOCK_COMMON_OR_NULL:
 	case PTR_TO_TCP_SOCK:
 	case PTR_TO_TCP_SOCK_OR_NULL:
+	case PTR_TO_MPTCP_SOCK:
+	case PTR_TO_MPTCP_SOCK_OR_NULL:
 	case PTR_TO_XDP_SOCK:
 		/* Only valid matches are exact, which memcmp() above
 		 * would have accepted
@@ -8769,6 +8794,8 @@ static bool reg_type_mismatch_ok(enum bpf_reg_type type)
 	case PTR_TO_SOCK_COMMON_OR_NULL:
 	case PTR_TO_TCP_SOCK:
 	case PTR_TO_TCP_SOCK_OR_NULL:
+	case PTR_TO_MPTCP_SOCK:
+	case PTR_TO_MPTCP_SOCK_OR_NULL:
 	case PTR_TO_XDP_SOCK:
 	case PTR_TO_BTF_ID:
 	case PTR_TO_BTF_ID_OR_NULL:
@@ -9889,6 +9916,9 @@ static int convert_ctx_accesses(struct bpf_verifier_env *env)
 		case PTR_TO_TCP_SOCK:
 			convert_ctx_access = bpf_tcp_sock_convert_ctx_access;
 			break;
+		case PTR_TO_MPTCP_SOCK:
+			convert_ctx_access = bpf_mptcp_sock_convert_ctx_access;
+			break;
 		case PTR_TO_XDP_SOCK:
 			convert_ctx_access = bpf_xdp_sock_convert_ctx_access;
 			break;
diff --git a/net/core/filter.c b/net/core/filter.c
index eb400aeea282..a103ccc2506d 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -6560,6 +6560,10 @@ sock_ops_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 	case BPF_FUNC_tcp_sock:
 		return &bpf_tcp_sock_proto;
 #endif /* CONFIG_INET */
+#ifdef CONFIG_MPTCP
+	case BPF_FUNC_mptcp_sock:
+		return &bpf_mptcp_sock_proto;
+#endif /* CONFIG_MPTCP */
 	default:
 		return bpf_base_func_proto(func_id);
 	}
diff --git a/net/mptcp/Makefile b/net/mptcp/Makefile
index a611968be4d7..aae2ede220ed 100644
--- a/net/mptcp/Makefile
+++ b/net/mptcp/Makefile
@@ -10,3 +10,5 @@ obj-$(CONFIG_INET_MPTCP_DIAG) += mptcp_diag.o
 mptcp_crypto_test-objs := crypto_test.o
 mptcp_token_test-objs := token_test.o
 obj-$(CONFIG_MPTCP_KUNIT_TESTS) += mptcp_crypto_test.o mptcp_token_test.o
+
+obj-$(CONFIG_BPF) += bpf.o
diff --git a/net/mptcp/bpf.c b/net/mptcp/bpf.c
new file mode 100644
index 000000000000..5332469fbb28
--- /dev/null
+++ b/net/mptcp/bpf.c
@@ -0,0 +1,72 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Multipath TCP
+ *
+ * Copyright (c) 2020, Tessares SA.
+ *
+ * Author: Nicolas Rybowski <nicolas.rybowski@tessares.net>
+ *
+ */
+
+#include <linux/bpf.h>
+
+#include "protocol.h"
+
+bool bpf_mptcp_sock_is_valid_access(int off, int size, enum bpf_access_type type,
+				    struct bpf_insn_access_aux *info)
+{
+	if (off < 0 || off >= offsetofend(struct bpf_mptcp_sock, token))
+		return false;
+
+	if (off % size != 0)
+		return false;
+
+	switch (off) {
+	default:
+		return size == sizeof(__u32);
+	}
+}
+
+u32 bpf_mptcp_sock_convert_ctx_access(enum bpf_access_type type,
+				      const struct bpf_insn *si,
+				      struct bpf_insn *insn_buf,
+				      struct bpf_prog *prog, u32 *target_size)
+{
+	struct bpf_insn *insn = insn_buf;
+
+#define BPF_MPTCP_SOCK_GET_COMMON(FIELD)							\
+	do {											\
+		BUILD_BUG_ON(sizeof_field(struct mptcp_sock, FIELD) >				\
+				sizeof_field(struct bpf_mptcp_sock, FIELD));			\
+		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct mptcp_sock, FIELD),		\
+							si->dst_reg, si->src_reg,		\
+							offsetof(struct mptcp_sock, FIELD));	\
+	} while (0)
+
+	if (insn > insn_buf)
+		return insn - insn_buf;
+
+	switch (si->off) {
+	case offsetof(struct bpf_mptcp_sock, token):
+		BPF_MPTCP_SOCK_GET_COMMON(token);
+		break;
+	}
+
+	return insn - insn_buf;
+}
+
+BPF_CALL_1(bpf_mptcp_sock, struct sock *, sk)
+{
+	if (sk_fullsock(sk) && sk->sk_protocol == IPPROTO_TCP && sk_is_mptcp(sk)) {
+		struct mptcp_subflow_context *mptcp_sfc = mptcp_subflow_ctx(sk);
+
+		return (unsigned long)mptcp_sfc->conn;
+	}
+	return (unsigned long)NULL;
+}
+
+const struct bpf_func_proto bpf_mptcp_sock_proto = {
+	.func           = bpf_mptcp_sock,
+	.gpl_only       = false,
+	.ret_type       = RET_PTR_TO_MPTCP_SOCK_OR_NULL,
+	.arg1_type      = ARG_PTR_TO_SOCK_COMMON,
+};
diff --git a/scripts/bpf_helpers_doc.py b/scripts/bpf_helpers_doc.py
index 5bfa448b4704..4db41a1c344a 100755
--- a/scripts/bpf_helpers_doc.py
+++ b/scripts/bpf_helpers_doc.py
@@ -428,6 +428,7 @@ class PrinterHelpers(Printer):
             'struct tcp_request_sock',
             'struct udp6_sock',
             'struct task_struct',
+            'struct bpf_mptcp_sock',
 
             'struct __sk_buff',
             'struct sk_msg_md',
@@ -472,6 +473,7 @@ class PrinterHelpers(Printer):
             'struct tcp_request_sock',
             'struct udp6_sock',
             'struct task_struct',
+            'struct bpf_mptcp_sock',
     }
     mapped_types = {
             'u8': '__u8',
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index a9fccfdb3a62..58b6e075537d 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -3395,6 +3395,14 @@ union bpf_attr {
  *		A non-negative value equal to or less than *size* on success,
  *		or a negative error in case of failure.
  *
+ * struct bpf_mptcp_sock *bpf_mptcp_sock(struct bpf_sock *sk)
+ *	Description
+ *		This helper gets a **struct bpf_mptcp_sock** pointer from a
+ *		**struct bpf_sock** pointer.
+ *	Return
+ *		A **struct bpf_mptcp_sock** pointer on success, or **NULL** in
+ *		case of failure.
+ *
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -3539,6 +3547,7 @@ union bpf_attr {
 	FN(skc_to_tcp_request_sock),	\
 	FN(skc_to_udp6_sock),		\
 	FN(get_task_stack),		\
+	FN(mptcp_sock),			\
 	/* */
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
@@ -3868,6 +3877,10 @@ struct bpf_tcp_sock {
 	__u32 is_mptcp;		/* Is MPTCP subflow? */
 };
 
+struct bpf_mptcp_sock {
+	__u32 token;		/* msk token */
+};
+
 struct bpf_sock_tuple {
 	union {
 		struct {
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH bpf-next 0/3] bpf: add MPTCP subflow support
  2020-08-21 15:15 [PATCH bpf-next 0/3] bpf: add MPTCP subflow support Nicolas Rybowski
                   ` (2 preceding siblings ...)
  2020-08-21 15:15 ` [PATCH bpf-next 3/3] bpf: add 'bpf_mptcp_sock' structure and helper Nicolas Rybowski
@ 2020-08-24 22:01 ` Alexei Starovoitov
  2020-08-25 18:55   ` Nicolas Rybowski
  3 siblings, 1 reply; 8+ messages in thread
From: Alexei Starovoitov @ 2020-08-24 22:01 UTC (permalink / raw)
  To: Nicolas Rybowski
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	David S. Miller, Jakub Kicinski, John Fastabend, KP Singh,
	Martin KaFai Lau, Mat Martineau, Matthieu Baerts, Song Liu,
	Yonghong Song, bpf, linux-kernel, mptcp, netdev

On Fri, Aug 21, 2020 at 05:15:38PM +0200, Nicolas Rybowski wrote:
> Previously it was not possible to make a distinction between plain TCP
> sockets and MPTCP subflow sockets on the BPF_PROG_TYPE_SOCK_OPS hook.
> 
> This patch series now enables a fine control of subflow sockets. In its
> current state, it allows to put different sockopt on each subflow from a
> same MPTCP connection (socket mark, TCP congestion algorithm, ...) using
> BPF programs.
> 
> It should also be the basis of exposing MPTCP-specific fields through BPF.

Looks fine, but I'd like to see the full picture a bit better.
What's the point of just 'token' ? What can be done with it?
What are you thinking to add later?
Also selftest for new feature is mandatory.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH bpf-next 0/3] bpf: add MPTCP subflow support
  2020-08-24 22:01 ` [PATCH bpf-next 0/3] bpf: add MPTCP subflow support Alexei Starovoitov
@ 2020-08-25 18:55   ` Nicolas Rybowski
  2020-08-26 19:12     ` Alexei Starovoitov
  0 siblings, 1 reply; 8+ messages in thread
From: Nicolas Rybowski @ 2020-08-25 18:55 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	David S. Miller, Jakub Kicinski, John Fastabend, KP Singh,
	Martin KaFai Lau, Mat Martineau, Matthieu Baerts, Song Liu,
	Yonghong Song, bpf, linux-kernel, mptcp, netdev

Hi Alexei,

Thanks for the feedback!

On Tue, Aug 25, 2020 at 12:01 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Fri, Aug 21, 2020 at 05:15:38PM +0200, Nicolas Rybowski wrote:
> > Previously it was not possible to make a distinction between plain TCP
> > sockets and MPTCP subflow sockets on the BPF_PROG_TYPE_SOCK_OPS hook.
> >
> > This patch series now enables a fine control of subflow sockets. In its
> > current state, it allows to put different sockopt on each subflow from a
> > same MPTCP connection (socket mark, TCP congestion algorithm, ...) using
> > BPF programs.
> >
> > It should also be the basis of exposing MPTCP-specific fields through BPF.
>
> Looks fine, but I'd like to see the full picture a bit better.
> What's the point of just 'token' ? What can be done with it?

The idea behind exposing only the token at the moment is that it is
the strict minimum required to identify all subflows linked to a
single MPTCP connection. Without that, each subflow is seen as a
"normal" TCP connection and it is not possible to find a link between
each other.
In other words, it allows the collection of all the subflows of a
MPTCP connection in a BPF map and then the application of per subflow
specific policies. More concrete examples of its usage are available
at [1].

We try to avoid exposing new fields without related use-cases, this is
why it is the only one currently. And this one is very important to
identify MPTCP connections and subflows.

> What are you thinking to add later?

The next steps would be the exposure of additional subflow context
data like the backup bit or some path manager fields to allow more
flexible / accurate BPF decisions.
We are also looking at implementing Packet Schedulers [2] and Path
Managers through BPF.
The ability of collecting all the paths available for a given MPTCP
connection - identified by its token - at the BPF level should help
for such decisions but more data will need to be exposed later to take
smart decisions or to analyse some situations.

I hope it makes the overall idea clearer.

> Also selftest for new feature is mandatory.

I will work on the selftests to add them in a v2. I was not sure a new
selftest was required when exposing a new field but now it is clear,
thanks!


[1] https://github.com/multipath-tcp/mptcp_net-next/tree/scripts/bpf/examples
[2] https://datatracker.ietf.org/doc/draft-bonaventure-iccrg-schedulers/

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH bpf-next 0/3] bpf: add MPTCP subflow support
  2020-08-25 18:55   ` Nicolas Rybowski
@ 2020-08-26 19:12     ` Alexei Starovoitov
  2020-08-27 11:54       ` Nicolas Rybowski
  0 siblings, 1 reply; 8+ messages in thread
From: Alexei Starovoitov @ 2020-08-26 19:12 UTC (permalink / raw)
  To: Nicolas Rybowski
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	David S. Miller, Jakub Kicinski, John Fastabend, KP Singh,
	Martin KaFai Lau, Mat Martineau, Matthieu Baerts, Song Liu,
	Yonghong Song, bpf, LKML, mptcp, Network Development

On Tue, Aug 25, 2020 at 11:55 AM Nicolas Rybowski
<nicolas.rybowski@tessares.net> wrote:
>
> Hi Alexei,
>
> Thanks for the feedback!
>
> On Tue, Aug 25, 2020 at 12:01 AM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Fri, Aug 21, 2020 at 05:15:38PM +0200, Nicolas Rybowski wrote:
> > > Previously it was not possible to make a distinction between plain TCP
> > > sockets and MPTCP subflow sockets on the BPF_PROG_TYPE_SOCK_OPS hook.
> > >
> > > This patch series now enables a fine control of subflow sockets. In its
> > > current state, it allows to put different sockopt on each subflow from a
> > > same MPTCP connection (socket mark, TCP congestion algorithm, ...) using
> > > BPF programs.
> > >
> > > It should also be the basis of exposing MPTCP-specific fields through BPF.
> >
> > Looks fine, but I'd like to see the full picture a bit better.
> > What's the point of just 'token' ? What can be done with it?
>
> The idea behind exposing only the token at the moment is that it is
> the strict minimum required to identify all subflows linked to a
> single MPTCP connection. Without that, each subflow is seen as a
> "normal" TCP connection and it is not possible to find a link between
> each other.
> In other words, it allows the collection of all the subflows of a
> MPTCP connection in a BPF map and then the application of per subflow
> specific policies. More concrete examples of its usage are available
> at [1].
>
> We try to avoid exposing new fields without related use-cases, this is
> why it is the only one currently. And this one is very important to
> identify MPTCP connections and subflows.
>
> > What are you thinking to add later?
>
> The next steps would be the exposure of additional subflow context
> data like the backup bit or some path manager fields to allow more
> flexible / accurate BPF decisions.
> We are also looking at implementing Packet Schedulers [2] and Path
> Managers through BPF.
> The ability of collecting all the paths available for a given MPTCP
> connection - identified by its token - at the BPF level should help
> for such decisions but more data will need to be exposed later to take
> smart decisions or to analyse some situations.
>
> I hope it makes the overall idea clearer.
>
> > Also selftest for new feature is mandatory.
>
> I will work on the selftests to add them in a v2. I was not sure a new
> selftest was required when exposing a new field but now it is clear,
> thanks!
>
>
> [1] https://github.com/multipath-tcp/mptcp_net-next/tree/scripts/bpf/examples
> [2] https://datatracker.ietf.org/doc/draft-bonaventure-iccrg-schedulers/

Thanks! The links are certainly helpful.
Since long term you're considering implementing path manager in bpf
I suggest to take a look at bpf_struct_ops and bpf based tcp congestion control.
It would fit that use case better.
For now the approach proposed in this patch is probably good enough
for simple subflow marking. From the example it's not clear what the networking
stack is supposed to do with a different sk_mark.
Also considering using sk local storage instead of sk_mark. It's arbitrary size.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH bpf-next 0/3] bpf: add MPTCP subflow support
  2020-08-26 19:12     ` Alexei Starovoitov
@ 2020-08-27 11:54       ` Nicolas Rybowski
  0 siblings, 0 replies; 8+ messages in thread
From: Nicolas Rybowski @ 2020-08-27 11:54 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	David S. Miller, Jakub Kicinski, John Fastabend, KP Singh,
	Martin KaFai Lau, Mat Martineau, Matthieu Baerts, Song Liu,
	Yonghong Song, bpf, LKML, mptcp, Network Development

Hi Alexei,

On Wed, Aug 26, 2020 at 9:13 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Tue, Aug 25, 2020 at 11:55 AM Nicolas Rybowski
> <nicolas.rybowski@tessares.net> wrote:
> >
> > Hi Alexei,
> >
> > Thanks for the feedback!
> >
> > On Tue, Aug 25, 2020 at 12:01 AM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > On Fri, Aug 21, 2020 at 05:15:38PM +0200, Nicolas Rybowski wrote:
> > > > Previously it was not possible to make a distinction between plain TCP
> > > > sockets and MPTCP subflow sockets on the BPF_PROG_TYPE_SOCK_OPS hook.
> > > >
> > > > This patch series now enables a fine control of subflow sockets. In its
> > > > current state, it allows to put different sockopt on each subflow from a
> > > > same MPTCP connection (socket mark, TCP congestion algorithm, ...) using
> > > > BPF programs.
> > > >
> > > > It should also be the basis of exposing MPTCP-specific fields through BPF.
> > >
> > > Looks fine, but I'd like to see the full picture a bit better.
> > > What's the point of just 'token' ? What can be done with it?
> >
> > The idea behind exposing only the token at the moment is that it is
> > the strict minimum required to identify all subflows linked to a
> > single MPTCP connection. Without that, each subflow is seen as a
> > "normal" TCP connection and it is not possible to find a link between
> > each other.
> > In other words, it allows the collection of all the subflows of a
> > MPTCP connection in a BPF map and then the application of per subflow
> > specific policies. More concrete examples of its usage are available
> > at [1].
> >
> > We try to avoid exposing new fields without related use-cases, this is
> > why it is the only one currently. And this one is very important to
> > identify MPTCP connections and subflows.
> >
> > > What are you thinking to add later?
> >
> > The next steps would be the exposure of additional subflow context
> > data like the backup bit or some path manager fields to allow more
> > flexible / accurate BPF decisions.
> > We are also looking at implementing Packet Schedulers [2] and Path
> > Managers through BPF.
> > The ability of collecting all the paths available for a given MPTCP
> > connection - identified by its token - at the BPF level should help
> > for such decisions but more data will need to be exposed later to take
> > smart decisions or to analyse some situations.
> >
> > I hope it makes the overall idea clearer.
> >
> > > Also selftest for new feature is mandatory.
> >
> > I will work on the selftests to add them in a v2. I was not sure a new
> > selftest was required when exposing a new field but now it is clear,
> > thanks!
> >
> >
> > [1] https://github.com/multipath-tcp/mptcp_net-next/tree/scripts/bpf/examples
> > [2] https://datatracker.ietf.org/doc/draft-bonaventure-iccrg-schedulers/
>
> Thanks! The links are certainly helpful.
> Since long term you're considering implementing path manager in bpf
> I suggest to take a look at bpf_struct_ops and bpf based tcp congestion control.
> It would fit that use case better.

We will definitively take a look at that, thanks ! It is indeed the
direction we should take.

> For now the approach proposed in this patch is probably good enough
> for simple subflow marking. From the example it's not clear what the networking
> stack is supposed to do with a different sk_mark.
> Also considering using sk local storage instead of sk_mark. It's arbitrary size.

Originally, this use-case was asked by Android for some app specific behaviours.
But the example is provided here to mainly illustrate the possibility
to put different sockopt per subflow knowing their relations with
other subflows.
Indeed in this example, per se, the marking of the subflows has no
interest, it is for illustration purpose only. It was an easy solution
to have quick tests in the userspace through nftables.

Also, the implementation of all the signals allowing dynamic subflows
creation / removal by the path manager to comply with the RFC [1] is
still under heavy development on the MPTCP side, so we cannot provide
more realistic examples at the moment.

[1] https://tools.ietf.org/html/rfc8684#section-3.4

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2020-08-27 11:56 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-08-21 15:15 [PATCH bpf-next 0/3] bpf: add MPTCP subflow support Nicolas Rybowski
2020-08-21 15:15 ` [PATCH bpf-next 1/3] bpf: expose is_mptcp flag to bpf_tcp_sock Nicolas Rybowski
2020-08-21 15:15 ` [PATCH bpf-next 2/3] mptcp: attach subflow socket to parent cgroup Nicolas Rybowski
2020-08-21 15:15 ` [PATCH bpf-next 3/3] bpf: add 'bpf_mptcp_sock' structure and helper Nicolas Rybowski
2020-08-24 22:01 ` [PATCH bpf-next 0/3] bpf: add MPTCP subflow support Alexei Starovoitov
2020-08-25 18:55   ` Nicolas Rybowski
2020-08-26 19:12     ` Alexei Starovoitov
2020-08-27 11:54       ` Nicolas Rybowski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).