netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH net-next 00/33] Multipath TCP
@ 2019-06-17 22:57 Mat Martineau
  2019-06-17 22:57 ` [RFC PATCH net-next 01/33] tcp: Add MPTCP option number Mat Martineau
                   ` (32 more replies)
  0 siblings, 33 replies; 34+ messages in thread
From: Mat Martineau @ 2019-06-17 22:57 UTC (permalink / raw)
  To: edumazet, netdev
  Cc: Mat Martineau, cpaasch, fw, pabeni, peter.krystad, dcaratti,
	matthieu.baerts

The MPTCP upstreaming community has prepared a net-next RFC patch set
for review.

Clone/fetch:
https://github.com/multipath-tcp/mptcp_net-next.git (tag: netdev-rfc)

Browse:
https://github.com/multipath-tcp/mptcp_net-next/tree/netdev-rfc

With CONFIG_MPTCP=y, a socket created with IPPROTO_MPTCP will attempt to
create an MPTCP connection but remains compatible with regular
TCP. IPPROTO_TCP socket behavior is unchanged.

This implementation makes use of ULP between the userspace-facing MPTCP
socket and the set of in-kernel TCP sockets it controls. ULP has been
extended for use with listening sockets. skb_ext is used to carry MPTCP
metadata.

The patch set includes a self-test to exercise MPTCP in various
connection and routing scenarios.

We have more work to do to reach the initial feature set for merging,
notably:

* Finish MP_JOIN work

* Couple receive windows across sibling subflow TCP sockets as required
  by RFC 6824

* IPv6

* Limit subflow ULP visibility to kernel space

Thank you for your review. You can find us at mptcp@lists.01.org and
https://is.gd/mptcp_upstream


Florian Westphal (6):
  mptcp: add mptcp_poll
  mptcp: add and use mptcp_subflow_hold
  mptcp: add basic kselftest program
  mptcp: selftests: switch to netns+veth based tests
  mptcp: accept: don't leak mptcp socket structure
  mptcp: switch sublist to mptcp socket lock protection

Mat Martineau (11):
  tcp: Add MPTCP option number
  tcp: Define IPPROTO_MPTCP
  mptcp: Add MPTCP socket stubs
  tcp, ulp: Add clone operation to tcp_ulp_ops
  mptcp: Add MPTCP to skb extensions
  tcp: Prevent coalesce/collapse when skb has MPTCP extensions
  tcp: Export low-level TCP functions
  mptcp: Write MPTCP DSS headers to outgoing data packets
  mptcp: Implement MPTCP receive path
  mptcp: selftests: Add capture option
  tcp: Check for filled TCP option space before SACK

Paolo Abeni (4):
  tcp: clean ext on tx recycle
  mptcp: use sk_page_frag() in sendmsg
  mptcp: sendmsg() do spool all the provided data
  mptcp: allow collapsing consecutive sendpages on the same substream

Peter Krystad (12):
  mptcp: Handle MPTCP TCP options
  mptcp: Associate MPTCP context with TCP socket
  tcp: Expose tcp struct and routine for MPTCP
  mptcp: Handle MP_CAPABLE options for outgoing connections
  mptcp: Create SUBFLOW socket for incoming connections
  mptcp: Add key generation and token tree
  mptcp: Add shutdown() socket operation
  mptcp: Add setsockopt()/getsockopt() socket operations
  mptcp: Make connection_list a real list of subflows
  mptcp: Add path manager interface
  mptcp: Add ADD_ADDR handling
  mptcp: Add handling of incoming MP_JOIN requests

 include/linux/skbuff.h                        |   11 +
 include/linux/tcp.h                           |   51 +
 include/net/mptcp.h                           |  158 +++
 include/net/sock.h                            |    1 +
 include/net/tcp.h                             |   20 +
 include/uapi/linux/in.h                       |    2 +
 net/Kconfig                                   |    1 +
 net/Makefile                                  |    1 +
 net/core/skbuff.c                             |    7 +
 net/ipv4/inet_connection_sock.c               |    2 +
 net/ipv4/tcp.c                                |    8 +-
 net/ipv4/tcp_input.c                          |   25 +-
 net/ipv4/tcp_ipv4.c                           |    4 +-
 net/ipv4/tcp_minisocks.c                      |    6 +
 net/ipv4/tcp_output.c                         |   62 +-
 net/ipv4/tcp_ulp.c                            |   12 +
 net/mptcp/Kconfig                             |   11 +
 net/mptcp/Makefile                            |    4 +
 net/mptcp/crypto.c                            |  206 ++++
 net/mptcp/options.c                           |  621 ++++++++++
 net/mptcp/pm.c                                |   66 ++
 net/mptcp/protocol.c                          | 1043 +++++++++++++++++
 net/mptcp/protocol.h                          |  229 ++++
 net/mptcp/subflow.c                           |  344 ++++++
 net/mptcp/token.c                             |  373 ++++++
 tools/include/uapi/linux/in.h                 |    2 +
 tools/testing/selftests/Makefile              |    1 +
 tools/testing/selftests/net/mptcp/.gitignore  |    1 +
 tools/testing/selftests/net/mptcp/Makefile    |   11 +
 tools/testing/selftests/net/mptcp/config      |    1 +
 .../selftests/net/mptcp/mptcp_connect.c       |  408 +++++++
 .../selftests/net/mptcp/mptcp_connect.sh      |  271 +++++
 32 files changed, 3955 insertions(+), 8 deletions(-)
 create mode 100644 include/net/mptcp.h
 create mode 100644 net/mptcp/Kconfig
 create mode 100644 net/mptcp/Makefile
 create mode 100644 net/mptcp/crypto.c
 create mode 100644 net/mptcp/options.c
 create mode 100644 net/mptcp/pm.c
 create mode 100644 net/mptcp/protocol.c
 create mode 100644 net/mptcp/protocol.h
 create mode 100644 net/mptcp/subflow.c
 create mode 100644 net/mptcp/token.c
 create mode 100644 tools/testing/selftests/net/mptcp/.gitignore
 create mode 100644 tools/testing/selftests/net/mptcp/Makefile
 create mode 100644 tools/testing/selftests/net/mptcp/config
 create mode 100644 tools/testing/selftests/net/mptcp/mptcp_connect.c
 create mode 100755 tools/testing/selftests/net/mptcp/mptcp_connect.sh

-- 
2.22.0


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [RFC PATCH net-next 01/33] tcp: Add MPTCP option number
  2019-06-17 22:57 [RFC PATCH net-next 00/33] Multipath TCP Mat Martineau
@ 2019-06-17 22:57 ` Mat Martineau
  2019-06-17 22:57 ` [RFC PATCH net-next 02/33] tcp: Define IPPROTO_MPTCP Mat Martineau
                   ` (31 subsequent siblings)
  32 siblings, 0 replies; 34+ messages in thread
From: Mat Martineau @ 2019-06-17 22:57 UTC (permalink / raw)
  To: edumazet, netdev
  Cc: Mat Martineau, cpaasch, fw, pabeni, peter.krystad, dcaratti,
	matthieu.baerts

TCP option 30 is allocated for MPTCP by the IANA.

Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
---
 include/net/tcp.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 96e0e53ff440..a8ec09b7767e 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -179,6 +179,7 @@ void tcp_time_wait(struct sock *sk, int state, int timeo);
 #define TCPOPT_SACK             5       /* SACK Block */
 #define TCPOPT_TIMESTAMP	8	/* Better RTT estimations/PAWS */
 #define TCPOPT_MD5SIG		19	/* MD5 Signature (RFC2385) */
+#define TCPOPT_MPTCP		30	/* Multipath TCP (RFC6824) */
 #define TCPOPT_FASTOPEN		34	/* Fast open (RFC7413) */
 #define TCPOPT_EXP		254	/* Experimental */
 /* Magic number to be after the option value for sharing TCP
-- 
2.22.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RFC PATCH net-next 02/33] tcp: Define IPPROTO_MPTCP
  2019-06-17 22:57 [RFC PATCH net-next 00/33] Multipath TCP Mat Martineau
  2019-06-17 22:57 ` [RFC PATCH net-next 01/33] tcp: Add MPTCP option number Mat Martineau
@ 2019-06-17 22:57 ` Mat Martineau
  2019-06-17 22:57 ` [RFC PATCH net-next 03/33] mptcp: Add MPTCP socket stubs Mat Martineau
                   ` (30 subsequent siblings)
  32 siblings, 0 replies; 34+ messages in thread
From: Mat Martineau @ 2019-06-17 22:57 UTC (permalink / raw)
  To: edumazet, netdev
  Cc: Mat Martineau, cpaasch, fw, pabeni, peter.krystad, dcaratti,
	matthieu.baerts

To open a MPTCP socket with socket(AF_INET, SOCK_STREAM, IPPROTO_MPTCP),
IPPROTO_MPTCP needs a value that differs from IPPROTO_TCP. The existing
IPPROTO numbers mostly map directly to IANA-specified protocol numbers.
MPTCP does not have a protocol number allocated because MPTCP packets
use the TCP protocol number. Use private number not used OTA.

Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
---
 include/uapi/linux/in.h       | 2 ++
 tools/include/uapi/linux/in.h | 2 ++
 2 files changed, 4 insertions(+)

diff --git a/include/uapi/linux/in.h b/include/uapi/linux/in.h
index e7ad9d350a28..44df6dc1ff1d 100644
--- a/include/uapi/linux/in.h
+++ b/include/uapi/linux/in.h
@@ -76,6 +76,8 @@ enum {
 #define IPPROTO_MPLS		IPPROTO_MPLS
   IPPROTO_RAW = 255,		/* Raw IP packets			*/
 #define IPPROTO_RAW		IPPROTO_RAW
+  IPPROTO_MPTCP = 262,		/* Multipath TCP connection 		*/
+#define IPPROTO_MPTCP		IPPROTO_MPTCP
   IPPROTO_MAX
 };
 #endif
diff --git a/tools/include/uapi/linux/in.h b/tools/include/uapi/linux/in.h
index e7ad9d350a28..44df6dc1ff1d 100644
--- a/tools/include/uapi/linux/in.h
+++ b/tools/include/uapi/linux/in.h
@@ -76,6 +76,8 @@ enum {
 #define IPPROTO_MPLS		IPPROTO_MPLS
   IPPROTO_RAW = 255,		/* Raw IP packets			*/
 #define IPPROTO_RAW		IPPROTO_RAW
+  IPPROTO_MPTCP = 262,		/* Multipath TCP connection 		*/
+#define IPPROTO_MPTCP		IPPROTO_MPTCP
   IPPROTO_MAX
 };
 #endif
-- 
2.22.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RFC PATCH net-next 03/33] mptcp: Add MPTCP socket stubs
  2019-06-17 22:57 [RFC PATCH net-next 00/33] Multipath TCP Mat Martineau
  2019-06-17 22:57 ` [RFC PATCH net-next 01/33] tcp: Add MPTCP option number Mat Martineau
  2019-06-17 22:57 ` [RFC PATCH net-next 02/33] tcp: Define IPPROTO_MPTCP Mat Martineau
@ 2019-06-17 22:57 ` Mat Martineau
  2019-06-17 22:57 ` [RFC PATCH net-next 04/33] mptcp: Handle MPTCP TCP options Mat Martineau
                   ` (29 subsequent siblings)
  32 siblings, 0 replies; 34+ messages in thread
From: Mat Martineau @ 2019-06-17 22:57 UTC (permalink / raw)
  To: edumazet, netdev
  Cc: Mat Martineau, cpaasch, fw, pabeni, peter.krystad, dcaratti,
	matthieu.baerts

Implements the infrastructure for MPTCP sockets.

MPTCP sockets open one in-kernel TCP socket per subflow. These subflow
sockets are only managed by the MPTCP socket that owns them and are not
visible from userspace. This commit allows a userspace program to open
an MPTCP socket with:

  sock = socket(AF_INET, SOCK_STREAM, IPPROTO_MPTCP);

The resulting socket is simply a wrapper around a single regular TCP
socket, without any of the MPTCP protocol implemented over the wire.

Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Peter Krystad <peter.krystad@linux.intel.com>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
---
 include/net/mptcp.h  |  22 ++++++++
 net/Kconfig          |   1 +
 net/Makefile         |   1 +
 net/ipv4/tcp.c       |   2 +
 net/mptcp/Kconfig    |  10 ++++
 net/mptcp/Makefile   |   4 ++
 net/mptcp/protocol.c | 118 +++++++++++++++++++++++++++++++++++++++++++
 net/mptcp/protocol.h |  22 ++++++++
 8 files changed, 180 insertions(+)
 create mode 100644 include/net/mptcp.h
 create mode 100644 net/mptcp/Kconfig
 create mode 100644 net/mptcp/Makefile
 create mode 100644 net/mptcp/protocol.c
 create mode 100644 net/mptcp/protocol.h

diff --git a/include/net/mptcp.h b/include/net/mptcp.h
new file mode 100644
index 000000000000..0fe78fddc638
--- /dev/null
+++ b/include/net/mptcp.h
@@ -0,0 +1,22 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Multipath TCP
+ *
+ * Copyright (c) 2017 - 2019, Intel Corporation.
+ */
+
+#ifndef __NET_MPTCP_H
+#define __NET_MPTCP_H
+
+#ifdef CONFIG_MPTCP
+
+void mptcp_init(void);
+
+#else
+
+static inline void mptcp_init(void)
+{
+}
+
+#endif /* CONFIG_MPTCP */
+#endif /* __NET_MPTCP_H */
diff --git a/net/Kconfig b/net/Kconfig
index d122f53c6fa2..8d5d43017feb 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -93,6 +93,7 @@ if INET
 source "net/ipv4/Kconfig"
 source "net/ipv6/Kconfig"
 source "net/netlabel/Kconfig"
+source "net/mptcp/Kconfig"
 
 endif # if INET
 
diff --git a/net/Makefile b/net/Makefile
index 449fc0b221f8..306d2e8c12c0 100644
--- a/net/Makefile
+++ b/net/Makefile
@@ -87,3 +87,4 @@ endif
 obj-$(CONFIG_QRTR)		+= qrtr/
 obj-$(CONFIG_NET_NCSI)		+= ncsi/
 obj-$(CONFIG_XDP_SOCKETS)	+= xdp/
+obj-$(CONFIG_MPTCP)		+= mptcp/
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 5542e3d778e6..866c985a0c04 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -271,6 +271,7 @@
 #include <net/icmp.h>
 #include <net/inet_common.h>
 #include <net/tcp.h>
+#include <net/mptcp.h>
 #include <net/xfrm.h>
 #include <net/ip.h>
 #include <net/sock.h>
@@ -3978,4 +3979,5 @@ void __init tcp_init(void)
 	tcp_metrics_init();
 	BUG_ON(tcp_register_congestion_control(&tcp_reno) != 0);
 	tcp_tasklet_init();
+	mptcp_init();
 }
diff --git a/net/mptcp/Kconfig b/net/mptcp/Kconfig
new file mode 100644
index 000000000000..d87dfdc210cc
--- /dev/null
+++ b/net/mptcp/Kconfig
@@ -0,0 +1,10 @@
+
+config MPTCP
+	bool "Multipath TCP"
+	depends on INET
+	help
+	  Multipath TCP (MPTCP) connections send and receive data over multiple
+	  subflows in order to utilize multiple network paths. Each subflow
+	  uses the TCP protocol, and TCP options carry header information for
+	  MPTCP.
+
diff --git a/net/mptcp/Makefile b/net/mptcp/Makefile
new file mode 100644
index 000000000000..659129d1fcbf
--- /dev/null
+++ b/net/mptcp/Makefile
@@ -0,0 +1,4 @@
+# SPDX-License-Identifier: GPL-2.0
+obj-$(CONFIG_MPTCP) += mptcp.o
+
+mptcp-y := protocol.o
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
new file mode 100644
index 000000000000..86db17af8c05
--- /dev/null
+++ b/net/mptcp/protocol.c
@@ -0,0 +1,118 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Multipath TCP
+ *
+ * Copyright (c) 2017 - 2019, Intel Corporation.
+ */
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/netdevice.h>
+#include <net/sock.h>
+#include <net/inet_common.h>
+#include <net/inet_hashtables.h>
+#include <net/protocol.h>
+#include <net/tcp.h>
+#include <net/mptcp.h>
+#include "protocol.h"
+
+static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
+{
+	struct mptcp_sock *msk = mptcp_sk(sk);
+	struct socket *subflow = msk->subflow;
+
+	pr_debug("subflow=%p", subflow->sk);
+
+	return sock_sendmsg(subflow, msg);
+}
+
+static int mptcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
+			 int nonblock, int flags, int *addr_len)
+{
+	struct mptcp_sock *msk = mptcp_sk(sk);
+	struct socket *subflow = msk->subflow;
+
+	pr_debug("subflow=%p", subflow->sk);
+
+	return sock_recvmsg(subflow, msg, flags);
+}
+
+static int mptcp_init_sock(struct sock *sk)
+{
+	struct mptcp_sock *msk = mptcp_sk(sk);
+	struct socket *sf;
+	int err;
+
+	pr_debug("msk=%p", msk);
+
+	err = sock_create_kern(&init_net, PF_INET, SOCK_STREAM, IPPROTO_TCP,
+			       &sf);
+	if (!err) {
+		pr_debug("subflow=%p", sf->sk);
+		msk->subflow = sf;
+	}
+
+	return err;
+}
+
+static void mptcp_close(struct sock *sk, long timeout)
+{
+	struct mptcp_sock *msk = mptcp_sk(sk);
+
+	inet_sk_state_store(sk, TCP_CLOSE);
+
+	if (msk->subflow) {
+		pr_debug("subflow=%p", msk->subflow->sk);
+		sock_release(msk->subflow);
+	}
+
+	sock_orphan(sk);
+	sock_put(sk);
+}
+
+static int mptcp_connect(struct sock *sk, struct sockaddr *saddr, int len)
+{
+	struct mptcp_sock *msk = mptcp_sk(sk);
+	int err;
+
+	saddr->sa_family = AF_INET;
+
+	pr_debug("msk=%p, subflow=%p", msk, msk->subflow->sk);
+
+	err = kernel_connect(msk->subflow, saddr, len, 0);
+
+	sk->sk_state = TCP_ESTABLISHED;
+
+	return err;
+}
+
+static struct proto mptcp_prot = {
+	.name		= "MPTCP",
+	.owner		= THIS_MODULE,
+	.init		= mptcp_init_sock,
+	.close		= mptcp_close,
+	.accept		= inet_csk_accept,
+	.connect	= mptcp_connect,
+	.shutdown	= tcp_shutdown,
+	.sendmsg	= mptcp_sendmsg,
+	.recvmsg	= mptcp_recvmsg,
+	.hash		= inet_hash,
+	.unhash		= inet_unhash,
+	.get_port	= inet_csk_get_port,
+	.obj_size	= sizeof(struct mptcp_sock),
+	.no_autobind	= 1,
+};
+
+static struct inet_protosw mptcp_protosw = {
+	.type		= SOCK_STREAM,
+	.protocol	= IPPROTO_MPTCP,
+	.prot		= &mptcp_prot,
+	.ops		= &inet_stream_ops,
+};
+
+void __init mptcp_init(void)
+{
+	if (proto_register(&mptcp_prot, 1) != 0)
+		panic("Failed to register MPTCP proto.\n");
+
+	inet_register_protosw(&mptcp_protosw);
+}
diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
new file mode 100644
index 000000000000..972204835421
--- /dev/null
+++ b/net/mptcp/protocol.h
@@ -0,0 +1,22 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Multipath TCP
+ *
+ * Copyright (c) 2017 - 2019, Intel Corporation.
+ */
+
+#ifndef __MPTCP_PROTOCOL_H
+#define __MPTCP_PROTOCOL_H
+
+/* MPTCP connection sock */
+struct mptcp_sock {
+	/* inet_connection_sock must be the first member */
+	struct	inet_connection_sock sk;
+	struct	socket *subflow;
+};
+
+static inline struct mptcp_sock *mptcp_sk(const struct sock *sk)
+{
+	return (struct mptcp_sock *)sk;
+}
+
+#endif /* __MPTCP_PROTOCOL_H */
-- 
2.22.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RFC PATCH net-next 04/33] mptcp: Handle MPTCP TCP options
  2019-06-17 22:57 [RFC PATCH net-next 00/33] Multipath TCP Mat Martineau
                   ` (2 preceding siblings ...)
  2019-06-17 22:57 ` [RFC PATCH net-next 03/33] mptcp: Add MPTCP socket stubs Mat Martineau
@ 2019-06-17 22:57 ` Mat Martineau
  2019-06-17 22:57 ` [RFC PATCH net-next 05/33] mptcp: Associate MPTCP context with TCP socket Mat Martineau
                   ` (28 subsequent siblings)
  32 siblings, 0 replies; 34+ messages in thread
From: Mat Martineau @ 2019-06-17 22:57 UTC (permalink / raw)
  To: edumazet, netdev
  Cc: Peter Krystad, cpaasch, fw, pabeni, dcaratti, matthieu.baerts

From: Peter Krystad <peter.krystad@linux.intel.com>

Currently only MPTCP v0 is supported so ignore v1 MP_CAPABLE option.

Signed-off-by: Peter Krystad <peter.krystad@linux.intel.com>
---
 include/linux/tcp.h   |  15 +++++
 include/net/mptcp.h   |  20 ++++++
 net/ipv4/tcp_input.c  |   5 ++
 net/ipv4/tcp_output.c |  15 +++++
 net/mptcp/Makefile    |   2 +-
 net/mptcp/options.c   | 146 ++++++++++++++++++++++++++++++++++++++++++
 net/mptcp/protocol.c  |   4 +-
 net/mptcp/protocol.h  |  22 +++++++
 8 files changed, 226 insertions(+), 3 deletions(-)
 create mode 100644 net/mptcp/options.c

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index c23019a3b264..73c633f58233 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -100,6 +100,17 @@ struct tcp_options_received {
 	u8	num_sacks;	/* Number of SACK blocks		*/
 	u16	user_mss;	/* mss requested by user in ioctl	*/
 	u16	mss_clamp;	/* Maximal mss, negotiated at connection setup */
+#if IS_ENABLED(CONFIG_MPTCP)
+	struct mptcp_options_received {
+		u8      mp_capable : 1,
+			mp_join : 1,
+			dss : 1,
+			version : 4;
+		u8      flags;
+		u64     sndr_key;
+		u64     rcvr_key;
+	} mptcp;
+#endif
 };
 
 static inline void tcp_clear_options(struct tcp_options_received *rx_opt)
@@ -109,6 +120,10 @@ static inline void tcp_clear_options(struct tcp_options_received *rx_opt)
 #if IS_ENABLED(CONFIG_SMC)
 	rx_opt->smc_ok = 0;
 #endif
+#if IS_ENABLED(CONFIG_MPTCP)
+	rx_opt->mptcp.mp_capable = rx_opt->mptcp.mp_join = 0;
+	rx_opt->mptcp.dss = 0;
+#endif
 }
 
 /* This is the max number of SACKS that we'll generate and process. It's safe
diff --git a/include/net/mptcp.h b/include/net/mptcp.h
index 0fe78fddc638..0d3e02c6c817 100644
--- a/include/net/mptcp.h
+++ b/include/net/mptcp.h
@@ -8,15 +8,35 @@
 #ifndef __NET_MPTCP_H
 #define __NET_MPTCP_H
 
+/* MPTCP option subtypes */
+#define OPTION_MPTCP_MPC_SYN	BIT(0)
+#define OPTION_MPTCP_MPC_SYNACK	BIT(1)
+#define OPTION_MPTCP_MPC_ACK	BIT(2)
+
+struct mptcp_out_options {
+	u16 suboptions;
+	u64 sndr_key;
+	u64 rcvr_key;
+};
+
 #ifdef CONFIG_MPTCP
 
 void mptcp_init(void);
 
+void mptcp_parse_option(const unsigned char *ptr, int opsize,
+			struct tcp_options_received *opt_rx);
+void mptcp_write_options(__be32 *ptr, struct mptcp_out_options *opts);
+
 #else
 
 static inline void mptcp_init(void)
 {
 }
 
+static inline void mptcp_parse_option(const unsigned char *ptr, int opsize,
+				      struct tcp_options_received *opt_rx)
+{
+}
+
 #endif /* CONFIG_MPTCP */
 #endif /* __NET_MPTCP_H */
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 9269bbfc05f9..117f0efbbad5 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -79,6 +79,7 @@
 #include <trace/events/tcp.h>
 #include <linux/jump_label_ratelimit.h>
 #include <net/busy_poll.h>
+#include <net/mptcp.h>
 
 int sysctl_tcp_max_orphans __read_mostly = NR_FILE;
 
@@ -3857,6 +3858,10 @@ void tcp_parse_options(const struct net *net,
 				 */
 				break;
 #endif
+			case TCPOPT_MPTCP:
+				mptcp_parse_option(ptr, opsize, opt_rx);
+				break;
+
 			case TCPOPT_FASTOPEN:
 				tcp_parse_fastopen_option(
 					opsize - TCPOLEN_FASTOPEN_BASE,
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index d954ff9069e8..69c4f39efe8b 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -38,6 +38,7 @@
 #define pr_fmt(fmt) "TCP: " fmt
 
 #include <net/tcp.h>
+#include <net/mptcp.h>
 
 #include <linux/compiler.h>
 #include <linux/gfp.h>
@@ -411,6 +412,7 @@ static inline bool tcp_urg_mode(const struct tcp_sock *tp)
 #define OPTION_WSCALE		(1 << 3)
 #define OPTION_FAST_OPEN_COOKIE	(1 << 8)
 #define OPTION_SMC		(1 << 9)
+#define OPTION_MPTCP		(1 << 10)
 
 static void smc_options_write(__be32 *ptr, u16 *options)
 {
@@ -436,8 +438,19 @@ struct tcp_out_options {
 	__u8 *hash_location;	/* temporary pointer, overloaded */
 	__u32 tsval, tsecr;	/* need to include OPTION_TS */
 	struct tcp_fastopen_cookie *fastopen_cookie;	/* Fast open cookie */
+#if IS_ENABLED(CONFIG_MPTCP)
+	struct mptcp_out_options mptcp;
+#endif
 };
 
+static void mptcp_options_write(__be32 *ptr, struct tcp_out_options *opts)
+{
+#if IS_ENABLED(CONFIG_MPTCP)
+	if (unlikely(OPTION_MPTCP & opts->options))
+		mptcp_write_options(ptr, &opts->mptcp);
+#endif
+}
+
 /* Write previously computed TCP options to the packet.
  *
  * Beware: Something in the Internet is very sensitive to the ordering of
@@ -546,6 +559,8 @@ static void tcp_options_write(__be32 *ptr, struct tcp_sock *tp,
 	}
 
 	smc_options_write(ptr, &options);
+
+	mptcp_options_write(ptr, opts);
 }
 
 static void smc_set_option(const struct tcp_sock *tp,
diff --git a/net/mptcp/Makefile b/net/mptcp/Makefile
index 659129d1fcbf..27a846263f08 100644
--- a/net/mptcp/Makefile
+++ b/net/mptcp/Makefile
@@ -1,4 +1,4 @@
 # SPDX-License-Identifier: GPL-2.0
 obj-$(CONFIG_MPTCP) += mptcp.o
 
-mptcp-y := protocol.o
+mptcp-y := protocol.o options.o
diff --git a/net/mptcp/options.c b/net/mptcp/options.c
new file mode 100644
index 000000000000..42626cd0a9f7
--- /dev/null
+++ b/net/mptcp/options.c
@@ -0,0 +1,146 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Multipath TCP
+ *
+ * Copyright (c) 2017 - 2019, Intel Corporation.
+ */
+
+#include <linux/kernel.h>
+#include <net/tcp.h>
+#include <net/mptcp.h>
+#include "protocol.h"
+
+void mptcp_parse_option(const unsigned char *ptr, int opsize,
+			struct tcp_options_received *opt_rx)
+{
+	struct mptcp_options_received *mp_opt = &opt_rx->mptcp;
+	u8 subtype = *ptr >> 4;
+
+	switch (subtype) {
+	/* MPTCPOPT_MP_CAPABLE
+	 * 0: 4MSB=subtype, 4LSB=version
+	 * 1: Handshake flags
+	 * 2-9: Sender key
+	 * 10-17: Receiver key (optional)
+	 */
+	case MPTCPOPT_MP_CAPABLE:
+		if (opsize != TCPOLEN_MPTCP_MPC_SYN &&
+		    opsize != TCPOLEN_MPTCP_MPC_SYNACK)
+			break;
+
+		mp_opt->version = *ptr++ & MPTCP_VERSION_MASK;
+		if (mp_opt->version != 0)
+			break;
+
+		mp_opt->flags = *ptr++;
+		if (!((mp_opt->flags & MPTCP_CAP_FLAG_MASK) == MPTCP_CAP_HMAC_SHA1) ||
+		    (mp_opt->flags & MPTCP_CAP_EXTENSIBILITY))
+			break;
+
+		mp_opt->mp_capable = 1;
+		mp_opt->sndr_key = get_unaligned_be64(ptr);
+		ptr += 8;
+
+		if (opsize == TCPOLEN_MPTCP_MPC_SYNACK) {
+			mp_opt->rcvr_key = get_unaligned_be64(ptr);
+			ptr += 8;
+			pr_debug("MP_CAPABLE flags=%x, sndr=%llu, rcvr=%llu",
+				 mp_opt->flags, mp_opt->sndr_key,
+				 mp_opt->rcvr_key);
+		} else {
+			pr_debug("MP_CAPABLE flags=%x, sndr=%llu",
+				 mp_opt->flags, mp_opt->sndr_key);
+		}
+		break;
+
+	/* MPTCPOPT_MP_JOIN
+	 *
+	 * Initial SYN
+	 * 0: 4MSB=subtype, 000, 1LSB=Backup
+	 * 1: Address ID
+	 * 2-5: Receiver token
+	 * 6-9: Sender random number
+	 *
+	 * SYN/ACK response
+	 * 0: 4MSB=subtype, 000, 1LSB=Backup
+	 * 1: Address ID
+	 * 2-9: Sender truncated HMAC
+	 * 10-13: Sender random number
+	 *
+	 * Third ACK
+	 * 0: 4MSB=subtype, 0000
+	 * 1: 0 (Reserved)
+	 * 2-21: Sender HMAC
+	 */
+
+	/* MPTCPOPT_DSS
+	 * 0: 4MSB=subtype, 0000
+	 * 1: 3MSB=0, F=Data FIN, m=DSN length, M=has DSN/SSN/DLL/checksum,
+	 *    a=DACK length, A=has DACK
+	 * 0, 4, or 8 bytes of DACK (depending on A/a)
+	 * 0, 4, or 8 bytes of DSN (depending on M/m)
+	 * 0 or 4 bytes of SSN (depending on M)
+	 * 0 or 2 bytes of DLL (depending on M)
+	 * 0 or 2 bytes of checksum (depending on M)
+	 */
+	case MPTCPOPT_DSS:
+		pr_debug("DSS");
+		mp_opt->dss = 1;
+		break;
+
+	/* MPTCPOPT_ADD_ADDR
+	 * 0: 4MSB=subtype, 4LSB=IP version (4 or 6)
+	 * 1: Address ID
+	 * 4 or 16 bytes of address (depending on ip version)
+	 * 0 or 2 bytes of port (depending on length)
+	 */
+
+	/* MPTCPOPT_REMOVE_ADDR
+	 * 0: 4MSB=subtype, 0000
+	 * 1: Address ID
+	 * Additional bytes: More address IDs (depending on length)
+	 */
+
+	/* MPTCPOPT_MP_PRIO
+	 * 0: 4MSB=subtype, 000, 1LSB=Backup
+	 * 1: Address ID (optional, current addr implied if not present)
+	 */
+
+	/* MPTCPOPT_MP_FAIL
+	 * 0: 4MSB=subtype, 0000
+	 * 1: 0 (Reserved)
+	 * 2-9: DSN
+	 */
+
+	/* MPTCPOPT_MP_FASTCLOSE
+	 * 0: 4MSB=subtype, 0000
+	 * 1: 0 (Reserved)
+	 * 2-9: Receiver key
+	 */
+	default:
+		break;
+	}
+}
+
+void mptcp_write_options(__be32 *ptr, struct mptcp_out_options *opts)
+{
+	if ((OPTION_MPTCP_MPC_SYN |
+	     OPTION_MPTCP_MPC_ACK) & opts->suboptions) {
+		u8 len;
+
+		if (OPTION_MPTCP_MPC_SYN & opts->suboptions)
+			len = TCPOLEN_MPTCP_MPC_SYN;
+		else
+			len = TCPOLEN_MPTCP_MPC_ACK;
+
+		*ptr++ = htonl((TCPOPT_MPTCP << 24) | (len << 16) |
+			       (MPTCPOPT_MP_CAPABLE << 12) |
+			       ((MPTCP_VERSION_MASK & 0) << 8) |
+			       MPTCP_CAP_HMAC_SHA1);
+		put_unaligned_be64(opts->sndr_key, ptr);
+		ptr += 2;
+		if (OPTION_MPTCP_MPC_ACK & opts->suboptions) {
+			put_unaligned_be64(opts->rcvr_key, ptr);
+			ptr += 2;
+		}
+	}
+}
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 86db17af8c05..e57ee600df7f 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -39,13 +39,13 @@ static int mptcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
 static int mptcp_init_sock(struct sock *sk)
 {
 	struct mptcp_sock *msk = mptcp_sk(sk);
+	struct net *net = sock_net(sk);
 	struct socket *sf;
 	int err;
 
 	pr_debug("msk=%p", msk);
 
-	err = sock_create_kern(&init_net, PF_INET, SOCK_STREAM, IPPROTO_TCP,
-			       &sf);
+	err = sock_create_kern(net, PF_INET, SOCK_STREAM, IPPROTO_TCP, &sf);
 	if (!err) {
 		pr_debug("subflow=%p", sf->sk);
 		msk->subflow = sf;
diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
index 972204835421..ac57e10ec4ca 100644
--- a/net/mptcp/protocol.h
+++ b/net/mptcp/protocol.h
@@ -7,6 +7,28 @@
 #ifndef __MPTCP_PROTOCOL_H
 #define __MPTCP_PROTOCOL_H
 
+/* MPTCP option subtypes */
+#define MPTCPOPT_MP_CAPABLE	0
+#define MPTCPOPT_MP_JOIN	1
+#define MPTCPOPT_DSS		2
+#define MPTCPOPT_ADD_ADDR	3
+#define MPTCPOPT_REMOVE_ADDR	4
+#define MPTCPOPT_MP_PRIO	5
+#define MPTCPOPT_MP_FAIL	6
+#define MPTCPOPT_MP_FASTCLOSE	7
+
+/* MPTCP suboption lengths */
+#define TCPOLEN_MPTCP_MPC_SYN		12
+#define TCPOLEN_MPTCP_MPC_SYNACK	20
+#define TCPOLEN_MPTCP_MPC_ACK		20
+
+/* MPTCP MP_CAPABLE flags */
+#define MPTCP_VERSION_MASK	(0x0F)
+#define MPTCP_CAP_CHECKSUM_REQD	BIT(7)
+#define MPTCP_CAP_EXTENSIBILITY	BIT(6)
+#define MPTCP_CAP_HMAC_SHA1	BIT(0)
+#define MPTCP_CAP_FLAG_MASK	(0x3F)
+
 /* MPTCP connection sock */
 struct mptcp_sock {
 	/* inet_connection_sock must be the first member */
-- 
2.22.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RFC PATCH net-next 05/33] mptcp: Associate MPTCP context with TCP socket
  2019-06-17 22:57 [RFC PATCH net-next 00/33] Multipath TCP Mat Martineau
                   ` (3 preceding siblings ...)
  2019-06-17 22:57 ` [RFC PATCH net-next 04/33] mptcp: Handle MPTCP TCP options Mat Martineau
@ 2019-06-17 22:57 ` Mat Martineau
  2019-06-17 22:57 ` [RFC PATCH net-next 06/33] tcp: Expose tcp struct and routine for MPTCP Mat Martineau
                   ` (27 subsequent siblings)
  32 siblings, 0 replies; 34+ messages in thread
From: Mat Martineau @ 2019-06-17 22:57 UTC (permalink / raw)
  To: edumazet, netdev
  Cc: Peter Krystad, cpaasch, fw, pabeni, dcaratti, matthieu.baerts

From: Peter Krystad <peter.krystad@linux.intel.com>

Use ULP to associate a subflow_context structure with each TCP
subflow socket.

Signed-off-by: Peter Krystad <peter.krystad@linux.intel.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
---
 include/linux/tcp.h  |  3 ++
 net/mptcp/Makefile   |  2 +-
 net/mptcp/protocol.c | 96 +++++++++++++++++++++++++++++++++++++-------
 net/mptcp/protocol.h | 24 +++++++++++
 net/mptcp/subflow.c  | 76 +++++++++++++++++++++++++++++++++++
 5 files changed, 185 insertions(+), 16 deletions(-)
 create mode 100644 net/mptcp/subflow.c

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 73c633f58233..b8c24bd8c862 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -397,6 +397,9 @@ struct tcp_sock {
 	u32	mtu_info; /* We received an ICMP_FRAG_NEEDED / ICMPV6_PKT_TOOBIG
 			   * while socket was owned by user.
 			   */
+#if IS_ENABLED(CONFIG_MPTCP)
+	bool	is_mptcp;
+#endif
 
 #ifdef CONFIG_TCP_MD5SIG
 /* TCP AF-Specific parts; only used by MD5 Signature support so far */
diff --git a/net/mptcp/Makefile b/net/mptcp/Makefile
index 27a846263f08..e1ee5aade8b0 100644
--- a/net/mptcp/Makefile
+++ b/net/mptcp/Makefile
@@ -1,4 +1,4 @@
 # SPDX-License-Identifier: GPL-2.0
 obj-$(CONFIG_MPTCP) += mptcp.o
 
-mptcp-y := protocol.o options.o
+mptcp-y := protocol.o subflow.o options.o
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index e57ee600df7f..ce2374ea7871 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -20,7 +20,7 @@ static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 	struct mptcp_sock *msk = mptcp_sk(sk);
 	struct socket *subflow = msk->subflow;
 
-	pr_debug("subflow=%p", subflow->sk);
+	pr_debug("subflow=%p", subflow_ctx(subflow->sk));
 
 	return sock_sendmsg(subflow, msg);
 }
@@ -31,7 +31,7 @@ static int mptcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
 	struct mptcp_sock *msk = mptcp_sk(sk);
 	struct socket *subflow = msk->subflow;
 
-	pr_debug("subflow=%p", subflow->sk);
+	pr_debug("subflow=%p", subflow_ctx(subflow->sk));
 
 	return sock_recvmsg(subflow, msg, flags);
 }
@@ -39,19 +39,10 @@ static int mptcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
 static int mptcp_init_sock(struct sock *sk)
 {
 	struct mptcp_sock *msk = mptcp_sk(sk);
-	struct net *net = sock_net(sk);
-	struct socket *sf;
-	int err;
 
 	pr_debug("msk=%p", msk);
 
-	err = sock_create_kern(net, PF_INET, SOCK_STREAM, IPPROTO_TCP, &sf);
-	if (!err) {
-		pr_debug("subflow=%p", sf->sk);
-		msk->subflow = sf;
-	}
-
-	return err;
+	return 0;
 }
 
 static void mptcp_close(struct sock *sk, long timeout)
@@ -61,7 +52,7 @@ static void mptcp_close(struct sock *sk, long timeout)
 	inet_sk_state_store(sk, TCP_CLOSE);
 
 	if (msk->subflow) {
-		pr_debug("subflow=%p", msk->subflow->sk);
+		pr_debug("subflow=%p", subflow_ctx(msk->subflow->sk));
 		sock_release(msk->subflow);
 	}
 
@@ -76,7 +67,7 @@ static int mptcp_connect(struct sock *sk, struct sockaddr *saddr, int len)
 
 	saddr->sa_family = AF_INET;
 
-	pr_debug("msk=%p, subflow=%p", msk, msk->subflow->sk);
+	pr_debug("msk=%p, subflow=%p", msk, subflow_ctx(msk->subflow->sk));
 
 	err = kernel_connect(msk->subflow, saddr, len, 0);
 
@@ -102,15 +93,90 @@ static struct proto mptcp_prot = {
 	.no_autobind	= 1,
 };
 
+static int mptcp_subflow_create(struct sock *sk)
+{
+	struct mptcp_sock *msk = mptcp_sk(sk);
+	struct net *net = sock_net(sk);
+	struct socket *sf;
+	int err;
+
+	pr_debug("msk=%p", msk);
+	err = sock_create_kern(net, PF_INET, SOCK_STREAM, IPPROTO_TCP, &sf);
+	if (!err) {
+		lock_sock(sf->sk);
+		err = tcp_set_ulp(sf->sk, "mptcp");
+		release_sock(sf->sk);
+		if (!err) {
+			struct subflow_context *subflow = subflow_ctx(sf->sk);
+
+			pr_debug("subflow=%p", subflow);
+			msk->subflow = sf;
+			subflow->conn = sk;
+			subflow->request_mptcp = 1; // @@ if MPTCP enabled
+			subflow->request_cksum = 1; // @@ if checksum enabled
+			subflow->version = 0;
+		}
+	}
+	return err;
+}
+
+static int mptcp_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len)
+{
+	struct mptcp_sock *msk = mptcp_sk(sock->sk);
+	int err = -ENOTSUPP;
+
+	pr_debug("msk=%p", msk);
+
+	if (uaddr->sa_family != AF_INET) // @@ allow only IPv4 for now
+		return err;
+
+	if (!msk->subflow) {
+		err = mptcp_subflow_create(sock->sk);
+		if (err)
+			return err;
+	}
+	return inet_bind(msk->subflow, uaddr, addr_len);
+}
+
+static int mptcp_stream_connect(struct socket *sock, struct sockaddr *uaddr,
+				int addr_len, int flags)
+{
+	struct mptcp_sock *msk = mptcp_sk(sock->sk);
+	int err = -ENOTSUPP;
+
+	pr_debug("msk=%p", msk);
+
+	if (uaddr->sa_family != AF_INET) // @@ allow only IPv4 for now
+		return err;
+
+	if (!msk->subflow) {
+		err = mptcp_subflow_create(sock->sk);
+		if (err)
+			return err;
+	}
+
+	return inet_stream_connect(msk->subflow, uaddr, addr_len, flags);
+}
+
+static struct proto_ops mptcp_stream_ops;
+
 static struct inet_protosw mptcp_protosw = {
 	.type		= SOCK_STREAM,
 	.protocol	= IPPROTO_MPTCP,
 	.prot		= &mptcp_prot,
-	.ops		= &inet_stream_ops,
+	.ops		= &mptcp_stream_ops,
+	.flags		= INET_PROTOSW_ICSK,
 };
 
 void __init mptcp_init(void)
 {
+	mptcp_prot.h.hashinfo = tcp_prot.h.hashinfo;
+	mptcp_stream_ops = inet_stream_ops;
+	mptcp_stream_ops.bind = mptcp_bind;
+	mptcp_stream_ops.connect = mptcp_stream_connect;
+
+	subflow_init();
+
 	if (proto_register(&mptcp_prot, 1) != 0)
 		panic("Failed to register MPTCP proto.\n");
 
diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
index ac57e10ec4ca..b6adc2aa6222 100644
--- a/net/mptcp/protocol.h
+++ b/net/mptcp/protocol.h
@@ -41,4 +41,28 @@ static inline struct mptcp_sock *mptcp_sk(const struct sock *sk)
 	return (struct mptcp_sock *)sk;
 }
 
+/* MPTCP subflow context */
+struct subflow_context {
+	u32	request_mptcp : 1,  /* send MP_CAPABLE */
+		request_cksum : 1,
+		version : 4;
+	struct  socket *tcp_sock;  /* underlying tcp_sock */
+	struct  sock *conn;        /* parent mptcp_sock */
+};
+
+static inline struct subflow_context *subflow_ctx(const struct sock *sk)
+{
+	struct inet_connection_sock *icsk = inet_csk(sk);
+
+	return (struct subflow_context *)icsk->icsk_ulp_data;
+}
+
+static inline struct socket *
+mptcp_subflow_tcp_socket(const struct subflow_context *subflow)
+{
+	return subflow->tcp_sock;
+}
+
+void subflow_init(void);
+
 #endif /* __MPTCP_PROTOCOL_H */
diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c
new file mode 100644
index 000000000000..8d13713ee159
--- /dev/null
+++ b/net/mptcp/subflow.c
@@ -0,0 +1,76 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Multipath TCP
+ *
+ * Copyright (c) 2017 - 2019, Intel Corporation.
+ */
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/netdevice.h>
+#include <net/sock.h>
+#include <net/inet_common.h>
+#include <net/inet_hashtables.h>
+#include <net/protocol.h>
+#include <net/tcp.h>
+#include <net/mptcp.h>
+#include "protocol.h"
+
+static struct subflow_context *subflow_create_ctx(struct sock *sk,
+						  struct socket *sock)
+{
+	struct inet_connection_sock *icsk = inet_csk(sk);
+	struct subflow_context *ctx;
+
+	ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
+	if (!ctx)
+		return NULL;
+
+	pr_debug("subflow=%p", ctx);
+
+	icsk->icsk_ulp_data = ctx;
+	/* might be NULL */
+	ctx->tcp_sock = sock;
+
+	return ctx;
+}
+
+static int subflow_ulp_init(struct sock *sk)
+{
+	struct tcp_sock *tsk = tcp_sk(sk);
+	struct subflow_context *ctx;
+	int err = 0;
+
+	ctx = subflow_create_ctx(sk, sk->sk_socket);
+	if (!ctx) {
+		err = -ENOMEM;
+		goto out;
+	}
+
+	pr_debug("subflow=%p", ctx);
+
+	tsk->is_mptcp = 1;
+out:
+	return err;
+}
+
+static void subflow_ulp_release(struct sock *sk)
+{
+	struct subflow_context *ctx = subflow_ctx(sk);
+
+	pr_debug("subflow=%p", ctx);
+
+	kfree(ctx);
+}
+
+static struct tcp_ulp_ops subflow_ulp_ops __read_mostly = {
+	.name		= "mptcp",
+	.owner		= THIS_MODULE,
+	.init		= subflow_ulp_init,
+	.release	= subflow_ulp_release,
+};
+
+void subflow_init(void)
+{
+	if (tcp_register_ulp(&subflow_ulp_ops) != 0)
+		panic("MPTCP: failed to register subflows to ULP\n");
+}
-- 
2.22.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RFC PATCH net-next 06/33] tcp: Expose tcp struct and routine for MPTCP
  2019-06-17 22:57 [RFC PATCH net-next 00/33] Multipath TCP Mat Martineau
                   ` (4 preceding siblings ...)
  2019-06-17 22:57 ` [RFC PATCH net-next 05/33] mptcp: Associate MPTCP context with TCP socket Mat Martineau
@ 2019-06-17 22:57 ` Mat Martineau
  2019-06-17 22:57 ` [RFC PATCH net-next 07/33] mptcp: Handle MP_CAPABLE options for outgoing connections Mat Martineau
                   ` (26 subsequent siblings)
  32 siblings, 0 replies; 34+ messages in thread
From: Mat Martineau @ 2019-06-17 22:57 UTC (permalink / raw)
  To: edumazet, netdev
  Cc: Peter Krystad, cpaasch, fw, pabeni, dcaratti, matthieu.baerts

From: Peter Krystad <peter.krystad@linux.intel.com>

tcp_request_sock_ipv4_ops and tcp_v4_init_sock().

This function is needed for MPTCP subflow initialization.

Signed-off-by: Peter Krystad <peter.krystad@linux.intel.com>
---
 include/net/tcp.h   | 3 +++
 net/ipv4/tcp_ipv4.c | 4 ++--
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index a8ec09b7767e..cdbc2a6d9d9a 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -432,6 +432,7 @@ struct sock *tcp_v4_syn_recv_sock(const struct sock *sk, struct sk_buff *skb,
 int tcp_v4_do_rcv(struct sock *sk, struct sk_buff *skb);
 int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len);
 int tcp_connect(struct sock *sk);
+int tcp_v4_init_sock(struct sock *sk);
 enum tcp_synack_type {
 	TCP_SYNACK_NORMAL,
 	TCP_SYNACK_FASTOPEN,
@@ -1959,6 +1960,8 @@ struct tcp_request_sock_ops {
 			   enum tcp_synack_type synack_type);
 };
 
+extern const struct tcp_request_sock_ops tcp_request_sock_ipv4_ops;
+
 #ifdef CONFIG_SYN_COOKIES
 static inline __u32 cookie_init_sequence(const struct tcp_request_sock_ops *ops,
 					 const struct sock *sk, struct sk_buff *skb,
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 633e8244ed5b..ac61c6c9ec13 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1367,7 +1367,7 @@ struct request_sock_ops tcp_request_sock_ops __read_mostly = {
 	.syn_ack_timeout =	tcp_syn_ack_timeout,
 };
 
-static const struct tcp_request_sock_ops tcp_request_sock_ipv4_ops = {
+const struct tcp_request_sock_ops tcp_request_sock_ipv4_ops = {
 	.mss_clamp	=	TCP_MSS_DEFAULT,
 #ifdef CONFIG_TCP_MD5SIG
 	.req_md5_lookup	=	tcp_v4_md5_lookup,
@@ -2053,7 +2053,7 @@ static const struct tcp_sock_af_ops tcp_sock_ipv4_specific = {
 /* NOTE: A lot of things set to zero explicitly by call to
  *       sk_alloc() so need not be done here.
  */
-static int tcp_v4_init_sock(struct sock *sk)
+int tcp_v4_init_sock(struct sock *sk)
 {
 	struct inet_connection_sock *icsk = inet_csk(sk);
 
-- 
2.22.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RFC PATCH net-next 07/33] mptcp: Handle MP_CAPABLE options for outgoing connections
  2019-06-17 22:57 [RFC PATCH net-next 00/33] Multipath TCP Mat Martineau
                   ` (5 preceding siblings ...)
  2019-06-17 22:57 ` [RFC PATCH net-next 06/33] tcp: Expose tcp struct and routine for MPTCP Mat Martineau
@ 2019-06-17 22:57 ` Mat Martineau
  2019-06-17 22:57 ` [RFC PATCH net-next 08/33] mptcp: add mptcp_poll Mat Martineau
                   ` (25 subsequent siblings)
  32 siblings, 0 replies; 34+ messages in thread
From: Mat Martineau @ 2019-06-17 22:57 UTC (permalink / raw)
  To: edumazet, netdev
  Cc: Peter Krystad, cpaasch, fw, pabeni, dcaratti, matthieu.baerts

From: Peter Krystad <peter.krystad@linux.intel.com>

Add hooks to tcp_output.c to add MP_CAPABLE to an outgoing
SYN request for a subflow socket and to the final ACK of the
three-way handshake.

Use the .sk_rx_dst_set() handler in the subflow proto to capture
when the responding SYN-ACK is received and notify the MPTCP
connection layer.

Signed-off-by: Peter Krystad <peter.krystad@linux.intel.com>
---
 include/net/mptcp.h   | 35 ++++++++++++++++++++++++++++
 net/ipv4/tcp_input.c  |  3 +++
 net/ipv4/tcp_output.c | 29 +++++++++++++++++++++--
 net/mptcp/options.c   | 45 ++++++++++++++++++++++++++++++++++++
 net/mptcp/protocol.c  | 53 +++++++++++++++++++++++++++++++------------
 net/mptcp/protocol.h  | 16 +++++++++++--
 net/mptcp/subflow.c   | 25 ++++++++++++++++++--
 7 files changed, 185 insertions(+), 21 deletions(-)

diff --git a/include/net/mptcp.h b/include/net/mptcp.h
index 0d3e02c6c817..81255b0f57d7 100644
--- a/include/net/mptcp.h
+++ b/include/net/mptcp.h
@@ -14,17 +14,30 @@
 #define OPTION_MPTCP_MPC_ACK	BIT(2)
 
 struct mptcp_out_options {
+#if IS_ENABLED(CONFIG_MPTCP)
 	u16 suboptions;
 	u64 sndr_key;
 	u64 rcvr_key;
+#endif
 };
 
 #ifdef CONFIG_MPTCP
 
 void mptcp_init(void);
 
+static inline bool sk_is_mptcp(const struct sock *sk)
+{
+	return tcp_sk(sk)->is_mptcp;
+}
+
 void mptcp_parse_option(const unsigned char *ptr, int opsize,
 			struct tcp_options_received *opt_rx);
+bool mptcp_syn_options(struct sock *sk, unsigned int *size,
+		       struct mptcp_out_options *opts);
+void mptcp_rcv_synsent(struct sock *sk);
+bool mptcp_established_options(struct sock *sk, unsigned int *size,
+			       struct mptcp_out_options *opts);
+
 void mptcp_write_options(__be32 *ptr, struct mptcp_out_options *opts);
 
 #else
@@ -33,10 +46,32 @@ static inline void mptcp_init(void)
 {
 }
 
+static inline bool sk_is_mptcp(const struct sock *sk)
+{
+	return false;
+}
+
 static inline void mptcp_parse_option(const unsigned char *ptr, int opsize,
 				      struct tcp_options_received *opt_rx)
 {
 }
 
+static inline bool mptcp_syn_options(struct sock *sk, unsigned int *size,
+				     struct mptcp_out_options *opts)
+{
+	return false;
+}
+
+static inline void mptcp_rcv_synsent(struct sock *sk)
+{
+}
+
+static inline bool mptcp_established_options(struct sock *sk,
+					     unsigned int *size,
+					     struct mptcp_out_options *opts)
+{
+	return false;
+}
+
 #endif /* CONFIG_MPTCP */
 #endif /* __NET_MPTCP_H */
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 117f0efbbad5..4aa60fe0deca 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -5901,6 +5901,9 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb,
 		tcp_sync_mss(sk, icsk->icsk_pmtu_cookie);
 		tcp_initialize_rcv_mss(sk);
 
+		if (sk_is_mptcp(sk))
+			mptcp_rcv_synsent(sk);
+
 		/* Remember, tcp_poll() does not lock socket!
 		 * Change state from SYN-SENT only after copied_seq
 		 * is initialized. */
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 69c4f39efe8b..f46e58347d73 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -438,9 +438,7 @@ struct tcp_out_options {
 	__u8 *hash_location;	/* temporary pointer, overloaded */
 	__u32 tsval, tsecr;	/* need to include OPTION_TS */
 	struct tcp_fastopen_cookie *fastopen_cookie;	/* Fast open cookie */
-#if IS_ENABLED(CONFIG_MPTCP)
 	struct mptcp_out_options mptcp;
-#endif
 };
 
 static void mptcp_options_write(__be32 *ptr, struct tcp_out_options *opts)
@@ -665,6 +663,15 @@ static unsigned int tcp_syn_options(struct sock *sk, struct sk_buff *skb,
 
 	smc_set_option(tp, opts, &remaining);
 
+	if (sk_is_mptcp(sk)) {
+		unsigned int size;
+
+		if (mptcp_syn_options(sk, &size, &opts->mptcp)) {
+			opts->options |= OPTION_MPTCP;
+			remaining -= size;
+		}
+	}
+
 	return MAX_TCP_OPTION_SPACE - remaining;
 }
 
@@ -763,6 +770,24 @@ static unsigned int tcp_established_options(struct sock *sk, struct sk_buff *skb
 		size += TCPOLEN_TSTAMP_ALIGNED;
 	}
 
+	/* MPTCP options have precedence over SACK for the limited TCP
+	 * option space because a MPTCP connection would be forced to
+	 * fall back to regular TCP if a required multipath option is
+	 * missing. SACK still gets a chance to use whatever space is
+	 * left.
+	 */
+	if (sk_is_mptcp(sk)) {
+		unsigned int remaining = MAX_TCP_OPTION_SPACE - size;
+		unsigned int opt_size;
+
+		if (mptcp_established_options(sk, &opt_size, &opts->mptcp)) {
+			if (remaining >= opt_size) {
+				opts->options |= OPTION_MPTCP;
+				size += opt_size;
+			}
+		}
+	}
+
 	eff_sacks = tp->rx_opt.num_sacks + tp->rx_opt.dsack;
 	if (unlikely(eff_sacks)) {
 		const unsigned int remaining = MAX_TCP_OPTION_SPACE - size;
diff --git a/net/mptcp/options.c b/net/mptcp/options.c
index 42626cd0a9f7..071e937d5c1f 100644
--- a/net/mptcp/options.c
+++ b/net/mptcp/options.c
@@ -121,6 +121,51 @@ void mptcp_parse_option(const unsigned char *ptr, int opsize,
 	}
 }
 
+bool mptcp_syn_options(struct sock *sk, unsigned int *size,
+		       struct mptcp_out_options *opts)
+{
+	struct subflow_context *subflow = subflow_ctx(sk);
+
+	if (subflow->request_mptcp) {
+		pr_debug("local_key=%llu", subflow->local_key);
+		opts->suboptions = OPTION_MPTCP_MPC_SYN;
+		opts->sndr_key = subflow->local_key;
+		*size = TCPOLEN_MPTCP_MPC_SYN;
+		return true;
+	}
+	return false;
+}
+
+void mptcp_rcv_synsent(struct sock *sk)
+{
+	struct tcp_sock *tp = tcp_sk(sk);
+	struct subflow_context *subflow = subflow_ctx(sk);
+
+	pr_debug("subflow=%p", subflow);
+	if (subflow->request_mptcp && tp->rx_opt.mptcp.mp_capable) {
+		subflow->mp_capable = 1;
+		subflow->remote_key = tp->rx_opt.mptcp.sndr_key;
+	}
+}
+
+bool mptcp_established_options(struct sock *sk, unsigned int *size,
+			       struct mptcp_out_options *opts)
+{
+	struct subflow_context *subflow = subflow_ctx(sk);
+
+	if (subflow->mp_capable && !subflow->fourth_ack) {
+		opts->suboptions = OPTION_MPTCP_MPC_ACK;
+		opts->sndr_key = subflow->local_key;
+		opts->rcvr_key = subflow->remote_key;
+		*size = TCPOLEN_MPTCP_MPC_ACK;
+		subflow->fourth_ack = 1;
+		pr_debug("subflow=%p, local_key=%llu, remote_key=%llu",
+			 subflow, subflow->local_key, subflow->remote_key);
+		return true;
+	}
+	return false;
+}
+
 void mptcp_write_options(__be32 *ptr, struct mptcp_out_options *opts)
 {
 	if ((OPTION_MPTCP_MPC_SYN |
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index ce2374ea7871..56637e4474da 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -18,9 +18,15 @@
 static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 {
 	struct mptcp_sock *msk = mptcp_sk(sk);
-	struct socket *subflow = msk->subflow;
-
-	pr_debug("subflow=%p", subflow_ctx(subflow->sk));
+	struct socket *subflow;
+
+	if (msk->connection_list) {
+		subflow = msk->connection_list;
+		pr_debug("conn_list->subflow=%p", subflow_ctx(subflow->sk));
+	} else {
+		subflow = msk->subflow;
+		pr_debug("subflow=%p", subflow_ctx(subflow->sk));
+	}
 
 	return sock_sendmsg(subflow, msg);
 }
@@ -29,9 +35,15 @@ static int mptcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
 			 int nonblock, int flags, int *addr_len)
 {
 	struct mptcp_sock *msk = mptcp_sk(sk);
-	struct socket *subflow = msk->subflow;
-
-	pr_debug("subflow=%p", subflow_ctx(subflow->sk));
+	struct socket *subflow;
+
+	if (msk->connection_list) {
+		subflow = msk->connection_list;
+		pr_debug("conn_list->subflow=%p", subflow_ctx(subflow->sk));
+	} else {
+		subflow = msk->subflow;
+		pr_debug("subflow=%p", subflow_ctx(subflow->sk));
+	}
 
 	return sock_recvmsg(subflow, msg, flags);
 }
@@ -56,24 +68,36 @@ static void mptcp_close(struct sock *sk, long timeout)
 		sock_release(msk->subflow);
 	}
 
+	if (msk->connection_list) {
+		pr_debug("conn_list->subflow=%p", msk->connection_list->sk);
+		sock_release(msk->connection_list);
+	}
+
 	sock_orphan(sk);
 	sock_put(sk);
 }
 
-static int mptcp_connect(struct sock *sk, struct sockaddr *saddr, int len)
+static int mptcp_get_port(struct sock *sk, unsigned short snum)
 {
 	struct mptcp_sock *msk = mptcp_sk(sk);
-	int err;
-
-	saddr->sa_family = AF_INET;
 
 	pr_debug("msk=%p, subflow=%p", msk, subflow_ctx(msk->subflow->sk));
 
-	err = kernel_connect(msk->subflow, saddr, len, 0);
+	return inet_csk_get_port(msk->subflow->sk, snum);
+}
 
-	sk->sk_state = TCP_ESTABLISHED;
+void mptcp_finish_connect(struct sock *sk, int mp_capable)
+{
+	struct mptcp_sock *msk = mptcp_sk(sk);
+	struct subflow_context *subflow = subflow_ctx(msk->subflow->sk);
 
-	return err;
+	if (mp_capable) {
+		msk->remote_key = subflow->remote_key;
+		msk->local_key = subflow->local_key;
+		msk->connection_list = msk->subflow;
+		msk->subflow = NULL;
+	}
+	sk->sk_state = TCP_ESTABLISHED;
 }
 
 static struct proto mptcp_prot = {
@@ -82,13 +106,12 @@ static struct proto mptcp_prot = {
 	.init		= mptcp_init_sock,
 	.close		= mptcp_close,
 	.accept		= inet_csk_accept,
-	.connect	= mptcp_connect,
 	.shutdown	= tcp_shutdown,
 	.sendmsg	= mptcp_sendmsg,
 	.recvmsg	= mptcp_recvmsg,
 	.hash		= inet_hash,
 	.unhash		= inet_unhash,
-	.get_port	= inet_csk_get_port,
+	.get_port	= mptcp_get_port,
 	.obj_size	= sizeof(struct mptcp_sock),
 	.no_autobind	= 1,
 };
diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
index b6adc2aa6222..9206e60ef6d3 100644
--- a/net/mptcp/protocol.h
+++ b/net/mptcp/protocol.h
@@ -33,7 +33,10 @@
 struct mptcp_sock {
 	/* inet_connection_sock must be the first member */
 	struct	inet_connection_sock sk;
-	struct	socket *subflow;
+	u64	local_key;
+	u64	remote_key;
+	struct	socket *connection_list; /* @@ needs to be a list */
+	struct	socket *subflow; /* outgoing connect, listener or !mp_capable */
 };
 
 static inline struct mptcp_sock *mptcp_sk(const struct sock *sk)
@@ -43,9 +46,14 @@ static inline struct mptcp_sock *mptcp_sk(const struct sock *sk)
 
 /* MPTCP subflow context */
 struct subflow_context {
+	u64	local_key;
+	u64	remote_key;
 	u32	request_mptcp : 1,  /* send MP_CAPABLE */
 		request_cksum : 1,
-		version : 4;
+		mp_capable : 1,	    /* remote is MPTCP capable */
+		fourth_ack : 1,     /* send initial DSS */
+		version : 4,
+		conn_finished : 1;
 	struct  socket *tcp_sock;  /* underlying tcp_sock */
 	struct  sock *conn;        /* parent mptcp_sock */
 };
@@ -65,4 +73,8 @@ mptcp_subflow_tcp_socket(const struct subflow_context *subflow)
 
 void subflow_init(void);
 
+extern const struct inet_connection_sock_af_ops ipv4_specific;
+
+void mptcp_finish_connect(struct sock *sk, int mp_capable);
+
 #endif /* __MPTCP_PROTOCOL_H */
diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c
index 8d13713ee159..91df2c4be339 100644
--- a/net/mptcp/subflow.c
+++ b/net/mptcp/subflow.c
@@ -15,6 +15,22 @@
 #include <net/mptcp.h>
 #include "protocol.h"
 
+static void subflow_finish_connect(struct sock *sk, const struct sk_buff *skb)
+{
+	struct subflow_context *subflow = subflow_ctx(sk);
+
+	inet_sk_rx_dst_set(sk, skb);
+
+	if (subflow->conn && !subflow->conn_finished) {
+		pr_debug("subflow=%p, remote_key=%llu", subflow_ctx(sk),
+			 subflow->remote_key);
+		mptcp_finish_connect(subflow->conn, subflow->mp_capable);
+		subflow->conn_finished = 1;
+	}
+}
+
+static struct inet_connection_sock_af_ops subflow_specific;
+
 static struct subflow_context *subflow_create_ctx(struct sock *sk,
 						  struct socket *sock)
 {
@@ -36,7 +52,8 @@ static struct subflow_context *subflow_create_ctx(struct sock *sk,
 
 static int subflow_ulp_init(struct sock *sk)
 {
-	struct tcp_sock *tsk = tcp_sk(sk);
+	struct tcp_sock *tp = tcp_sk(sk);
+	struct inet_connection_sock *icsk = inet_csk(sk);
 	struct subflow_context *ctx;
 	int err = 0;
 
@@ -48,7 +65,8 @@ static int subflow_ulp_init(struct sock *sk)
 
 	pr_debug("subflow=%p", ctx);
 
-	tsk->is_mptcp = 1;
+	tp->is_mptcp = 1;
+	icsk->icsk_af_ops = &subflow_specific;
 out:
 	return err;
 }
@@ -71,6 +89,9 @@ static struct tcp_ulp_ops subflow_ulp_ops __read_mostly = {
 
 void subflow_init(void)
 {
+	subflow_specific = ipv4_specific;
+	subflow_specific.sk_rx_dst_set = subflow_finish_connect;
+
 	if (tcp_register_ulp(&subflow_ulp_ops) != 0)
 		panic("MPTCP: failed to register subflows to ULP\n");
 }
-- 
2.22.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RFC PATCH net-next 08/33] mptcp: add mptcp_poll
  2019-06-17 22:57 [RFC PATCH net-next 00/33] Multipath TCP Mat Martineau
                   ` (6 preceding siblings ...)
  2019-06-17 22:57 ` [RFC PATCH net-next 07/33] mptcp: Handle MP_CAPABLE options for outgoing connections Mat Martineau
@ 2019-06-17 22:57 ` Mat Martineau
  2019-06-17 22:57 ` [RFC PATCH net-next 09/33] tcp, ulp: Add clone operation to tcp_ulp_ops Mat Martineau
                   ` (24 subsequent siblings)
  32 siblings, 0 replies; 34+ messages in thread
From: Mat Martineau @ 2019-06-17 22:57 UTC (permalink / raw)
  To: edumazet, netdev
  Cc: Florian Westphal, cpaasch, pabeni, peter.krystad, dcaratti,
	matthieu.baerts

From: Florian Westphal <fw@strlen.de>

Can't use tcp_poll directly:

BUG: KASAN: slab-out-of-bounds in tcp_poll+0x17f/0x540
Read of size 4 at addr ffff88806ac5e50c by task mptcp_connect/2085
Call Trace:
 tcp_poll+0x17f/0x540
 sock_poll+0x152/0x180

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 net/mptcp/protocol.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 56637e4474da..3d9cd52e3e1e 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -181,6 +181,19 @@ static int mptcp_stream_connect(struct socket *sock, struct sockaddr *uaddr,
 	return inet_stream_connect(msk->subflow, uaddr, addr_len, flags);
 }
 
+static __poll_t mptcp_poll(struct file *file, struct socket *sock,
+			   struct poll_table_struct *wait)
+{
+	const struct mptcp_sock *msk;
+	struct sock *sk = sock->sk;
+
+	msk = mptcp_sk(sk);
+	if (msk->subflow)
+		return tcp_poll(file, msk->subflow, wait);
+
+	return tcp_poll(file, msk->connection_list, wait);
+}
+
 static struct proto_ops mptcp_stream_ops;
 
 static struct inet_protosw mptcp_protosw = {
@@ -197,6 +210,7 @@ void __init mptcp_init(void)
 	mptcp_stream_ops = inet_stream_ops;
 	mptcp_stream_ops.bind = mptcp_bind;
 	mptcp_stream_ops.connect = mptcp_stream_connect;
+	mptcp_stream_ops.poll = mptcp_poll;
 
 	subflow_init();
 
-- 
2.22.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RFC PATCH net-next 09/33] tcp, ulp: Add clone operation to tcp_ulp_ops
  2019-06-17 22:57 [RFC PATCH net-next 00/33] Multipath TCP Mat Martineau
                   ` (7 preceding siblings ...)
  2019-06-17 22:57 ` [RFC PATCH net-next 08/33] mptcp: add mptcp_poll Mat Martineau
@ 2019-06-17 22:57 ` Mat Martineau
  2019-06-17 22:57 ` [RFC PATCH net-next 10/33] mptcp: Create SUBFLOW socket for incoming connections Mat Martineau
                   ` (23 subsequent siblings)
  32 siblings, 0 replies; 34+ messages in thread
From: Mat Martineau @ 2019-06-17 22:57 UTC (permalink / raw)
  To: edumazet, netdev
  Cc: Mat Martineau, cpaasch, fw, pabeni, peter.krystad, dcaratti,
	matthieu.baerts

If ULP is used on a listening socket, icsk_ulp_ops and icsk_ulp_data are
copied when the listener is cloned. Sometimes the clone is immediately
deleted, which will invoke the release op on the clone and likely
corrupt the listening socket's icsk_ulp_data.

The clone operation is invoked immediately after the clone is copied and
gives the ULP type an opportunity to set up the clone socket and its
icsk_ulp_data.

Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
---
 include/net/tcp.h               |  5 +++++
 net/ipv4/inet_connection_sock.c |  2 ++
 net/ipv4/tcp_ulp.c              | 12 ++++++++++++
 3 files changed, 19 insertions(+)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index cdbc2a6d9d9a..23995f8c11fa 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -2108,6 +2108,9 @@ struct tcp_ulp_ops {
 	int (*init)(struct sock *sk);
 	/* cleanup ulp */
 	void (*release)(struct sock *sk);
+	/* clone ulp */
+	void (*clone)(const struct request_sock *req, struct sock *newsk,
+		      const gfp_t priority);
 
 	char		name[TCP_ULP_NAME_MAX];
 	struct module	*owner;
@@ -2117,6 +2120,8 @@ void tcp_unregister_ulp(struct tcp_ulp_ops *type);
 int tcp_set_ulp(struct sock *sk, const char *name);
 void tcp_get_available_ulp(char *buf, size_t len);
 void tcp_cleanup_ulp(struct sock *sk);
+void tcp_clone_ulp(const struct request_sock *req,
+		   struct sock *newsk, const gfp_t priority);
 
 #define MODULE_ALIAS_TCP_ULP(name)				\
 	__MODULE_INFO(alias, alias_userspace, name);		\
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index fb0b4b0994ec..386df6f227d8 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -814,6 +814,8 @@ struct sock *inet_csk_clone_lock(const struct sock *sk,
 		/* Deinitialize accept_queue to trap illegal accesses. */
 		memset(&newicsk->icsk_accept_queue, 0, sizeof(newicsk->icsk_accept_queue));
 
+		tcp_clone_ulp(req, newsk, priority);
+
 		security_inet_csk_clone(newsk, req);
 	}
 	return newsk;
diff --git a/net/ipv4/tcp_ulp.c b/net/ipv4/tcp_ulp.c
index 3d8a1d835471..d6e8ba0035e8 100644
--- a/net/ipv4/tcp_ulp.c
+++ b/net/ipv4/tcp_ulp.c
@@ -114,6 +114,18 @@ void tcp_cleanup_ulp(struct sock *sk)
 	icsk->icsk_ulp_ops = NULL;
 }
 
+void tcp_clone_ulp(const struct request_sock *req, struct sock *newsk,
+		   const gfp_t priority)
+{
+	struct inet_connection_sock *icsk = inet_csk(newsk);
+
+	if (!icsk->icsk_ulp_ops)
+		return;
+
+	if (icsk->icsk_ulp_ops->clone)
+		icsk->icsk_ulp_ops->clone(req, newsk, priority);
+}
+
 static int __tcp_set_ulp(struct sock *sk, const struct tcp_ulp_ops *ulp_ops)
 {
 	struct inet_connection_sock *icsk = inet_csk(sk);
-- 
2.22.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RFC PATCH net-next 10/33] mptcp: Create SUBFLOW socket for incoming connections
  2019-06-17 22:57 [RFC PATCH net-next 00/33] Multipath TCP Mat Martineau
                   ` (8 preceding siblings ...)
  2019-06-17 22:57 ` [RFC PATCH net-next 09/33] tcp, ulp: Add clone operation to tcp_ulp_ops Mat Martineau
@ 2019-06-17 22:57 ` Mat Martineau
  2019-06-17 22:57 ` [RFC PATCH net-next 11/33] mptcp: Add key generation and token tree Mat Martineau
                   ` (22 subsequent siblings)
  32 siblings, 0 replies; 34+ messages in thread
From: Mat Martineau @ 2019-06-17 22:57 UTC (permalink / raw)
  To: edumazet, netdev
  Cc: Peter Krystad, cpaasch, fw, pabeni, dcaratti, matthieu.baerts

From: Peter Krystad <peter.krystad@linux.intel.com>

Add subflow_request_sock type that extends tcp_request_sock
and add an is_mptcp flag to tcp_request_sock distinguish them.

Override the listen() and accept() methods of the MPTCP
socket proto_ops so they may act on the subflow socket.

Override the conn_request() and syn_recv_sock() handlers
in the inet_connection_sock to handle incoming MPTCP
SYNs and the ACK to the response SYN.

Add handling in tcp_output.c to add MP_CAPABLE to an outgoing
SYN-ACK response for a subflow_request_sock.

Signed-off-by: Peter Krystad <peter.krystad@linux.intel.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
---
 include/linux/tcp.h   |   3 +
 include/net/mptcp.h   |  19 ++++++
 net/ipv4/tcp_input.c  |   3 +
 net/ipv4/tcp_output.c |  18 +++++
 net/mptcp/options.c   |  57 +++++++++++++++-
 net/mptcp/protocol.c  |  94 +++++++++++++++++++++++++-
 net/mptcp/protocol.h  |  20 ++++++
 net/mptcp/subflow.c   | 150 ++++++++++++++++++++++++++++++++++++++++--
 8 files changed, 357 insertions(+), 7 deletions(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index b8c24bd8c862..fcbe8443aaad 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -139,6 +139,9 @@ struct tcp_request_sock {
 	const struct tcp_request_sock_ops *af_specific;
 	u64				snt_synack; /* first SYNACK sent time */
 	bool				tfo_listener;
+#if IS_ENABLED(CONFIG_MPTCP)
+	bool				is_mptcp;
+#endif
 	u32				txhash;
 	u32				rcv_isn;
 	u32				snt_isn;
diff --git a/include/net/mptcp.h b/include/net/mptcp.h
index 81255b0f57d7..e7cae0f4404a 100644
--- a/include/net/mptcp.h
+++ b/include/net/mptcp.h
@@ -30,11 +30,18 @@ static inline bool sk_is_mptcp(const struct sock *sk)
 	return tcp_sk(sk)->is_mptcp;
 }
 
+static inline bool rsk_is_mptcp(const struct request_sock *req)
+{
+	return tcp_rsk(req)->is_mptcp;
+}
+
 void mptcp_parse_option(const unsigned char *ptr, int opsize,
 			struct tcp_options_received *opt_rx);
 bool mptcp_syn_options(struct sock *sk, unsigned int *size,
 		       struct mptcp_out_options *opts);
 void mptcp_rcv_synsent(struct sock *sk);
+bool mptcp_synack_options(const struct request_sock *req, unsigned int *size,
+			  struct mptcp_out_options *opts);
 bool mptcp_established_options(struct sock *sk, unsigned int *size,
 			       struct mptcp_out_options *opts);
 
@@ -51,6 +58,11 @@ static inline bool sk_is_mptcp(const struct sock *sk)
 	return false;
 }
 
+static inline bool rsk_is_mptcp(const struct request_sock *req)
+{
+	return false;
+}
+
 static inline void mptcp_parse_option(const unsigned char *ptr, int opsize,
 				      struct tcp_options_received *opt_rx)
 {
@@ -66,6 +78,13 @@ static inline void mptcp_rcv_synsent(struct sock *sk)
 {
 }
 
+static inline bool mptcp_synack_options(const struct request_sock *req,
+					unsigned int *size,
+					struct mptcp_out_options *opts)
+{
+	return false;
+}
+
 static inline bool mptcp_established_options(struct sock *sk,
 					     unsigned int *size,
 					     struct mptcp_out_options *opts)
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 4aa60fe0deca..240eb75c7b84 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -6493,6 +6493,9 @@ int tcp_conn_request(struct request_sock_ops *rsk_ops,
 
 	tcp_rsk(req)->af_specific = af_ops;
 	tcp_rsk(req)->ts_off = 0;
+#if IS_ENABLED(CONFIG_MPTCP)
+	tcp_rsk(req)->is_mptcp = 0;
+#endif
 
 	tcp_clear_options(&tmp_opt);
 	tmp_opt.mss_clamp = af_ops->mss_clamp;
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index f46e58347d73..a41ba69760f1 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -594,6 +594,22 @@ static void smc_set_option_cond(const struct tcp_sock *tp,
 #endif
 }
 
+static void mptcp_set_option_cond(const struct request_sock *req,
+				  struct tcp_out_options *opts,
+				  unsigned int *remaining)
+{
+	if (rsk_is_mptcp(req)) {
+		unsigned int size;
+
+		if (mptcp_synack_options(req, &size, &opts->mptcp)) {
+			if (*remaining >= size) {
+				opts->options |= OPTION_MPTCP;
+				*remaining -= size;
+			}
+		}
+	}
+}
+
 /* Compute TCP options for SYN packets. This is not the final
  * network wire format yet.
  */
@@ -733,6 +749,8 @@ static unsigned int tcp_synack_options(const struct sock *sk,
 		}
 	}
 
+	mptcp_set_option_cond(req, opts, &remaining);
+
 	smc_set_option_cond(tcp_sk(sk), ireq, opts, &remaining);
 
 	return MAX_TCP_OPTION_SPACE - remaining;
diff --git a/net/mptcp/options.c b/net/mptcp/options.c
index 071e937d5c1f..d8e77cd5664d 100644
--- a/net/mptcp/options.c
+++ b/net/mptcp/options.c
@@ -121,6 +121,39 @@ void mptcp_parse_option(const unsigned char *ptr, int opsize,
 	}
 }
 
+void mptcp_get_options(const struct sk_buff *skb,
+		       struct tcp_options_received *opt_rx)
+{
+	const unsigned char *ptr;
+	const struct tcphdr *th = tcp_hdr(skb);
+	int length = (th->doff * 4) - sizeof(struct tcphdr);
+
+	ptr = (const unsigned char *)(th + 1);
+
+	while (length > 0) {
+		int opcode = *ptr++;
+		int opsize;
+
+		switch (opcode) {
+		case TCPOPT_EOL:
+			return;
+		case TCPOPT_NOP:	/* Ref: RFC 793 section 3.1 */
+			length--;
+			continue;
+		default:
+			opsize = *ptr++;
+			if (opsize < 2) /* "silly options" */
+				return;
+			if (opsize > length)
+				return;	/* don't parse partial options */
+			if (opcode == TCPOPT_MPTCP)
+				mptcp_parse_option(ptr, opsize, opt_rx);
+			ptr += opsize - 2;
+			length -= opsize;
+		}
+	}
+}
+
 bool mptcp_syn_options(struct sock *sk, unsigned int *size,
 		       struct mptcp_out_options *opts)
 {
@@ -166,14 +199,35 @@ bool mptcp_established_options(struct sock *sk, unsigned int *size,
 	return false;
 }
 
+bool mptcp_synack_options(const struct request_sock *req, unsigned int *size,
+			  struct mptcp_out_options *opts)
+{
+	struct subflow_request_sock *subflow_req = subflow_rsk(req);
+
+	if (subflow_req->mp_capable) {
+		opts->suboptions = OPTION_MPTCP_MPC_SYNACK;
+		opts->sndr_key = subflow_req->local_key;
+		opts->rcvr_key = subflow_req->remote_key;
+		*size = TCPOLEN_MPTCP_MPC_SYNACK;
+		pr_debug("subflow_req=%p, local_key=%llu, remote_key=%llu",
+			 subflow_req, subflow_req->local_key,
+			 subflow_req->remote_key);
+		return true;
+	}
+	return false;
+}
+
 void mptcp_write_options(__be32 *ptr, struct mptcp_out_options *opts)
 {
 	if ((OPTION_MPTCP_MPC_SYN |
+	     OPTION_MPTCP_MPC_SYNACK |
 	     OPTION_MPTCP_MPC_ACK) & opts->suboptions) {
 		u8 len;
 
 		if (OPTION_MPTCP_MPC_SYN & opts->suboptions)
 			len = TCPOLEN_MPTCP_MPC_SYN;
+		else if (OPTION_MPTCP_MPC_SYNACK & opts->suboptions)
+			len = TCPOLEN_MPTCP_MPC_SYNACK;
 		else
 			len = TCPOLEN_MPTCP_MPC_ACK;
 
@@ -183,7 +237,8 @@ void mptcp_write_options(__be32 *ptr, struct mptcp_out_options *opts)
 			       MPTCP_CAP_HMAC_SHA1);
 		put_unaligned_be64(opts->sndr_key, ptr);
 		ptr += 2;
-		if (OPTION_MPTCP_MPC_ACK & opts->suboptions) {
+		if ((OPTION_MPTCP_MPC_SYNACK |
+		     OPTION_MPTCP_MPC_ACK) & opts->suboptions) {
 			put_unaligned_be64(opts->rcvr_key, ptr);
 			ptr += 2;
 		}
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 3d9cd52e3e1e..ea771f537ac0 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -69,7 +69,8 @@ static void mptcp_close(struct sock *sk, long timeout)
 	}
 
 	if (msk->connection_list) {
-		pr_debug("conn_list->subflow=%p", msk->connection_list->sk);
+		pr_debug("conn_list->subflow=%p",
+			 subflow_ctx(msk->connection_list->sk));
 		sock_release(msk->connection_list);
 	}
 
@@ -77,6 +78,47 @@ static void mptcp_close(struct sock *sk, long timeout)
 	sock_put(sk);
 }
 
+static struct sock *mptcp_accept(struct sock *sk, int flags, int *err,
+				 bool kern)
+{
+	struct mptcp_sock *msk = mptcp_sk(sk);
+	struct socket *listener = msk->subflow;
+	struct socket *new_sock;
+	struct socket *new_mptcp_sock;
+	struct subflow_context *subflow;
+
+	pr_debug("msk=%p, listener=%p", msk, subflow_ctx(listener->sk));
+	*err = kernel_accept(listener, &new_sock, flags);
+	if (*err < 0)
+		return NULL;
+
+	subflow = subflow_ctx(new_sock->sk);
+	pr_debug("msk=%p, new subflow=%p, ", msk, subflow);
+
+	*err = sock_create(PF_INET, SOCK_STREAM, IPPROTO_MPTCP,
+			   &new_mptcp_sock);
+	if (*err < 0) {
+		kernel_sock_shutdown(new_sock, SHUT_RDWR);
+		sock_release(new_sock);
+		return NULL;
+	}
+
+	msk = mptcp_sk(new_mptcp_sock->sk);
+	pr_debug("new msk=%p", msk);
+	subflow->conn = new_mptcp_sock->sk;
+	subflow->tcp_sock = new_sock;
+
+	if (subflow->mp_capable) {
+		msk->remote_key = subflow->remote_key;
+		msk->local_key = subflow->local_key;
+		msk->connection_list = new_sock;
+	} else {
+		msk->subflow = new_sock;
+	}
+
+	return new_mptcp_sock->sk;
+}
+
 static int mptcp_get_port(struct sock *sk, unsigned short snum)
 {
 	struct mptcp_sock *msk = mptcp_sk(sk);
@@ -105,7 +147,7 @@ static struct proto mptcp_prot = {
 	.owner		= THIS_MODULE,
 	.init		= mptcp_init_sock,
 	.close		= mptcp_close,
-	.accept		= inet_csk_accept,
+	.accept		= mptcp_accept,
 	.shutdown	= tcp_shutdown,
 	.sendmsg	= mptcp_sendmsg,
 	.recvmsg	= mptcp_recvmsg,
@@ -181,6 +223,51 @@ static int mptcp_stream_connect(struct socket *sock, struct sockaddr *uaddr,
 	return inet_stream_connect(msk->subflow, uaddr, addr_len, flags);
 }
 
+static int mptcp_getname(struct socket *sock, struct sockaddr *uaddr,
+			 int peer)
+{
+	struct mptcp_sock *msk = mptcp_sk(sock->sk);
+	struct socket *subflow;
+	int err = -EPERM;
+
+	if (msk->connection_list)
+		subflow = msk->connection_list;
+	else
+		subflow = msk->subflow;
+
+	err = inet_getname(subflow, uaddr, peer);
+
+	return err;
+}
+
+static int mptcp_listen(struct socket *sock, int backlog)
+{
+	struct mptcp_sock *msk = mptcp_sk(sock->sk);
+	int err;
+
+	pr_debug("msk=%p", msk);
+
+	if (!msk->subflow) {
+		err = mptcp_subflow_create(sock->sk);
+		if (err)
+			return err;
+	}
+	return inet_listen(msk->subflow, backlog);
+}
+
+static int mptcp_stream_accept(struct socket *sock, struct socket *newsock,
+			       int flags, bool kern)
+{
+	struct mptcp_sock *msk = mptcp_sk(sock->sk);
+
+	pr_debug("msk=%p", msk);
+
+	if (!msk->subflow)
+		return -EINVAL;
+
+	return inet_accept(sock, newsock, flags, kern);
+}
+
 static __poll_t mptcp_poll(struct file *file, struct socket *sock,
 			   struct poll_table_struct *wait)
 {
@@ -211,6 +298,9 @@ void __init mptcp_init(void)
 	mptcp_stream_ops.bind = mptcp_bind;
 	mptcp_stream_ops.connect = mptcp_stream_connect;
 	mptcp_stream_ops.poll = mptcp_poll;
+	mptcp_stream_ops.accept = mptcp_stream_accept;
+	mptcp_stream_ops.getname = mptcp_getname;
+	mptcp_stream_ops.listen = mptcp_listen;
 
 	subflow_init();
 
diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
index 9206e60ef6d3..34eb10c279f0 100644
--- a/net/mptcp/protocol.h
+++ b/net/mptcp/protocol.h
@@ -44,6 +44,23 @@ static inline struct mptcp_sock *mptcp_sk(const struct sock *sk)
 	return (struct mptcp_sock *)sk;
 }
 
+struct subflow_request_sock {
+	struct	tcp_request_sock sk;
+	u8	mp_capable : 1,
+		mp_join : 1,
+		checksum : 1,
+		backup : 1,
+		version : 4;
+	u64	local_key;
+	u64	remote_key;
+};
+
+static inline
+struct subflow_request_sock *subflow_rsk(const struct request_sock *rsk)
+{
+	return (struct subflow_request_sock *)rsk;
+}
+
 /* MPTCP subflow context */
 struct subflow_context {
 	u64	local_key;
@@ -75,6 +92,9 @@ void subflow_init(void);
 
 extern const struct inet_connection_sock_af_ops ipv4_specific;
 
+void mptcp_get_options(const struct sk_buff *skb,
+		       struct tcp_options_received *opt_rx);
+
 void mptcp_finish_connect(struct sock *sk, int mp_capable);
 
 #endif /* __MPTCP_PROTOCOL_H */
diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c
index 91df2c4be339..fd2bf7621f0e 100644
--- a/net/mptcp/subflow.c
+++ b/net/mptcp/subflow.c
@@ -15,6 +15,37 @@
 #include <net/mptcp.h>
 #include "protocol.h"
 
+static void subflow_v4_init_req(struct request_sock *req,
+				const struct sock *sk_listener,
+				struct sk_buff *skb)
+{
+	struct subflow_request_sock *subflow_req = subflow_rsk(req);
+	struct subflow_context *listener = subflow_ctx(sk_listener);
+	struct tcp_options_received rx_opt;
+
+	tcp_rsk(req)->is_mptcp = 1;
+	pr_debug("subflow_req=%p, listener=%p", subflow_req, listener);
+
+	tcp_request_sock_ipv4_ops.init_req(req, sk_listener, skb);
+
+	memset(&rx_opt.mptcp, 0, sizeof(rx_opt.mptcp));
+	mptcp_get_options(skb, &rx_opt);
+
+	if (rx_opt.mptcp.mp_capable && listener->request_mptcp) {
+		subflow_req->mp_capable = 1;
+		if (rx_opt.mptcp.version >= listener->version)
+			subflow_req->version = listener->version;
+		else
+			subflow_req->version = rx_opt.mptcp.version;
+		if ((rx_opt.mptcp.flags & MPTCP_CAP_CHECKSUM_REQD) ||
+		    listener->request_cksum)
+			subflow_req->checksum = 1;
+		subflow_req->remote_key = rx_opt.mptcp.sndr_key;
+	} else {
+		subflow_req->mp_capable = 0;
+	}
+}
+
 static void subflow_finish_connect(struct sock *sk, const struct sk_buff *skb)
 {
 	struct subflow_context *subflow = subflow_ctx(sk);
@@ -29,21 +60,82 @@ static void subflow_finish_connect(struct sock *sk, const struct sk_buff *skb)
 	}
 }
 
+static struct request_sock_ops subflow_request_sock_ops;
+static struct tcp_request_sock_ops subflow_request_sock_ipv4_ops;
+
+static int subflow_conn_request(struct sock *sk, struct sk_buff *skb)
+{
+	struct subflow_context *subflow = subflow_ctx(sk);
+
+	pr_debug("subflow=%p", subflow);
+
+	/* Never answer to SYNs sent to broadcast or multicast */
+	if (skb_rtable(skb)->rt_flags & (RTCF_BROADCAST | RTCF_MULTICAST))
+		goto drop;
+
+	return tcp_conn_request(&subflow_request_sock_ops,
+				&subflow_request_sock_ipv4_ops,
+				sk, skb);
+drop:
+	tcp_listendrop(sk);
+	return 0;
+}
+
+static struct sock *subflow_syn_recv_sock(const struct sock *sk,
+					  struct sk_buff *skb,
+					  struct request_sock *req,
+					  struct dst_entry *dst,
+					  struct request_sock *req_unhash,
+					  bool *own_req)
+{
+	struct subflow_context *listener = subflow_ctx(sk);
+	struct subflow_request_sock *subflow_req = subflow_rsk(req);
+	struct tcp_options_received opt_rx;
+	struct sock *child;
+
+	pr_debug("listener=%p, req=%p, conn=%p", listener, req, listener->conn);
+
+	if (subflow_req->mp_capable) {
+		opt_rx.mptcp.mp_capable = 0;
+		mptcp_get_options(skb, &opt_rx);
+		if (!opt_rx.mptcp.mp_capable ||
+		    subflow_req->local_key != opt_rx.mptcp.rcvr_key ||
+		    subflow_req->remote_key != opt_rx.mptcp.sndr_key)
+			return NULL;
+	}
+
+	child = tcp_v4_syn_recv_sock(sk, skb, req, dst, req_unhash, own_req);
+
+	if (child && *own_req) {
+		if (!subflow_ctx(child)) {
+			pr_debug("Closing child socket");
+			inet_sk_set_state(child, TCP_CLOSE);
+			sock_set_flag(child, SOCK_DEAD);
+			inet_csk_destroy_sock(child);
+			child = NULL;
+		}
+	}
+
+	return child;
+}
+
 static struct inet_connection_sock_af_ops subflow_specific;
 
 static struct subflow_context *subflow_create_ctx(struct sock *sk,
-						  struct socket *sock)
+						  struct socket *sock,
+						  gfp_t priority)
 {
 	struct inet_connection_sock *icsk = inet_csk(sk);
 	struct subflow_context *ctx;
 
-	ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
+	ctx = kzalloc(sizeof(*ctx), priority);
+	icsk->icsk_ulp_data = ctx;
+
 	if (!ctx)
 		return NULL;
 
 	pr_debug("subflow=%p", ctx);
 
-	icsk->icsk_ulp_data = ctx;
 	/* might be NULL */
 	ctx->tcp_sock = sock;
 
@@ -57,7 +149,7 @@ static int subflow_ulp_init(struct sock *sk)
 	struct subflow_context *ctx;
 	int err = 0;
 
-	ctx = subflow_create_ctx(sk, sk->sk_socket);
+	ctx = subflow_create_ctx(sk, sk->sk_socket, GFP_KERNEL);
 	if (!ctx) {
 		err = -ENOMEM;
 		goto out;
@@ -80,16 +172,66 @@ static void subflow_ulp_release(struct sock *sk)
 	kfree(ctx);
 }
 
+static void subflow_ulp_clone(const struct request_sock *req,
+			      struct sock *newsk,
+			      const gfp_t priority)
+{
+	struct subflow_request_sock *subflow_req = subflow_rsk(req);
+
+	/* newsk->sk_socket is NULL at this point */
+	struct subflow_context *subflow = subflow_create_ctx(newsk, NULL,
+							     priority);
+
+	if (!subflow)
+		return;
+
+	subflow->conn = NULL;
+	subflow->conn_finished = 1;
+
+	if (subflow_req->mp_capable) {
+		subflow->mp_capable = 1;
+		subflow->fourth_ack = 1;
+		subflow->remote_key = subflow_req->remote_key;
+		subflow->local_key = subflow_req->local_key;
+	}
+}
+
 static struct tcp_ulp_ops subflow_ulp_ops __read_mostly = {
 	.name		= "mptcp",
 	.owner		= THIS_MODULE,
 	.init		= subflow_ulp_init,
 	.release	= subflow_ulp_release,
+	.clone		= subflow_ulp_clone,
 };
 
+static int subflow_ops_init(struct request_sock_ops *subflow_ops)
+{
+	subflow_ops->obj_size = sizeof(struct subflow_request_sock);
+	subflow_ops->slab_name = "request_sock_subflow";
+
+	subflow_ops->slab = kmem_cache_create(subflow_ops->slab_name,
+					      subflow_ops->obj_size, 0,
+					      SLAB_ACCOUNT |
+					      SLAB_TYPESAFE_BY_RCU,
+					      NULL);
+	if (!subflow_ops->slab)
+		return -ENOMEM;
+
+	return 0;
+}
+
 void subflow_init(void)
 {
+	subflow_request_sock_ops = tcp_request_sock_ops;
+	if (subflow_ops_init(&subflow_request_sock_ops) != 0)
+		panic("MPTCP: failed to init subflow request sock ops\n");
+
+	subflow_request_sock_ipv4_ops = tcp_request_sock_ipv4_ops;
+	subflow_request_sock_ipv4_ops.init_req = subflow_v4_init_req;
+
 	subflow_specific = ipv4_specific;
+	subflow_specific.conn_request = subflow_conn_request;
+	subflow_specific.syn_recv_sock = subflow_syn_recv_sock;
 	subflow_specific.sk_rx_dst_set = subflow_finish_connect;
 
 	if (tcp_register_ulp(&subflow_ulp_ops) != 0)
-- 
2.22.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RFC PATCH net-next 11/33] mptcp: Add key generation and token tree
  2019-06-17 22:57 [RFC PATCH net-next 00/33] Multipath TCP Mat Martineau
                   ` (9 preceding siblings ...)
  2019-06-17 22:57 ` [RFC PATCH net-next 10/33] mptcp: Create SUBFLOW socket for incoming connections Mat Martineau
@ 2019-06-17 22:57 ` Mat Martineau
  2019-06-17 22:57 ` [RFC PATCH net-next 12/33] mptcp: Add shutdown() socket operation Mat Martineau
                   ` (21 subsequent siblings)
  32 siblings, 0 replies; 34+ messages in thread
From: Mat Martineau @ 2019-06-17 22:57 UTC (permalink / raw)
  To: edumazet, netdev
  Cc: Peter Krystad, cpaasch, fw, pabeni, dcaratti, matthieu.baerts

From: Peter Krystad <peter.krystad@linux.intel.com>

Generate the local keys, IDSN, and token when creating a new
socket. Introduce the token tree to track all tokens in use using
a radix tree with the MPTCP token itself as the index.

Signed-off-by: Peter Krystad <peter.krystad@linux.intel.com>
---
 net/mptcp/Makefile   |   2 +-
 net/mptcp/crypto.c   | 206 +++++++++++++++++++++++++++++++++++
 net/mptcp/protocol.c |  17 +++
 net/mptcp/protocol.h |  26 +++++
 net/mptcp/subflow.c  |  36 ++++++-
 net/mptcp/token.c    | 248 +++++++++++++++++++++++++++++++++++++++++++
 6 files changed, 533 insertions(+), 2 deletions(-)
 create mode 100644 net/mptcp/crypto.c
 create mode 100644 net/mptcp/token.c

diff --git a/net/mptcp/Makefile b/net/mptcp/Makefile
index e1ee5aade8b0..178ae81d8b66 100644
--- a/net/mptcp/Makefile
+++ b/net/mptcp/Makefile
@@ -1,4 +1,4 @@
 # SPDX-License-Identifier: GPL-2.0
 obj-$(CONFIG_MPTCP) += mptcp.o
 
-mptcp-y := protocol.o subflow.o options.o
+mptcp-y := protocol.o subflow.o options.o token.o crypto.o
diff --git a/net/mptcp/crypto.c b/net/mptcp/crypto.c
new file mode 100644
index 000000000000..26a68aa1933a
--- /dev/null
+++ b/net/mptcp/crypto.c
@@ -0,0 +1,206 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Multipath TCP cryptographic functions
+ * Copyright (c) 2017 - 2019, Intel Corporation.
+ *
+ * Note: This code is based on mptcp_ctrl.c, mptcp_ipv4.c, and
+ *       mptcp_ipv6 from multipath-tcp.org, authored by:
+ *
+ *       Sébastien Barré <sebastien.barre@uclouvain.be>
+ *       Christoph Paasch <christoph.paasch@uclouvain.be>
+ *       Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
+ *       Gregory Detal <gregory.detal@uclouvain.be>
+ *       Fabien Duchêne <fabien.duchene@uclouvain.be>
+ *       Andreas Seelinger <Andreas.Seelinger@rwth-aachen.de>
+ *       Lavkesh Lahngir <lavkesh51@gmail.com>
+ *       Andreas Ripke <ripke@neclab.eu>
+ *       Vlad Dogaru <vlad.dogaru@intel.com>
+ *       Octavian Purdila <octavian.purdila@intel.com>
+ *       John Ronan <jronan@tssg.org>
+ *       Catalin Nicutar <catalin.nicutar@gmail.com>
+ *       Brandon Heller <brandonh@stanford.edu>
+ */
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/netdevice.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/cryptohash.h>
+#include <linux/random.h>
+#include <linux/siphash.h>
+#include <asm/unaligned.h>
+
+static siphash_key_t crypto_key_secret __read_mostly;
+static hsiphash_key_t crypto_nonce_secret __read_mostly;
+static u32 crypto_seed;
+
+u32 crypto_v4_get_nonce(__be32 saddr, __be32 daddr, __be16 sport, __be16 dport)
+{
+	return hsiphash_4u32((__force u32)saddr, (__force u32)daddr,
+			    (__force u32)sport << 16 | (__force u32)dport,
+			    crypto_seed++, &crypto_nonce_secret);
+}
+
+u64 crypto_v4_get_key(__be32 saddr, __be32 daddr, __be16 sport, __be16 dport)
+{
+	pr_debug("src=%x:%d, dst=%x:%d", saddr, sport, daddr, dport);
+	return siphash_4u32((__force u32)saddr, (__force u32)daddr,
+			    (__force u32)sport << 16 | (__force u32)dport,
+			    crypto_seed++, &crypto_key_secret);
+}
+
+u32 crypto_v6_get_nonce(const struct in6_addr *saddr,
+			const struct in6_addr *daddr,
+			__be16 sport, __be16 dport)
+{
+	const struct {
+		struct in6_addr saddr;
+		struct in6_addr daddr;
+		u32 seed;
+		__be16 sport;
+		__be16 dport;
+	} __aligned(SIPHASH_ALIGNMENT) combined = {
+		.saddr = *saddr,
+		.daddr = *daddr,
+		.seed = crypto_seed++,
+		.sport = sport,
+		.dport = dport,
+	};
+
+	return hsiphash(&combined, offsetofend(typeof(combined), dport),
+			&crypto_nonce_secret);
+}
+
+u64 crypto_v6_get_key(const struct in6_addr *saddr,
+		      const struct in6_addr *daddr,
+		      __be16 sport, __be16 dport)
+{
+	const struct {
+		struct in6_addr saddr;
+		struct in6_addr daddr;
+		u32 seed;
+		__be16 sport;
+		__be16 dport;
+	} __aligned(SIPHASH_ALIGNMENT) combined = {
+		.saddr = *saddr,
+		.daddr = *daddr,
+		.seed = crypto_seed++,
+		.sport = sport,
+		.dport = dport,
+	};
+
+	return siphash(&combined, offsetofend(typeof(combined), dport),
+		       &crypto_key_secret);
+}
+
+void crypto_key_sha1(u64 key, u32 *token, u64 *idsn)
+{
+	u32 workspace[SHA_WORKSPACE_WORDS];
+	u32 mptcp_hashed_key[SHA_DIGEST_WORDS];
+	u8 input[64];
+
+	memset(workspace, 0, sizeof(workspace));
+
+	/* Initialize input with appropriate padding */
+	memset(&input[9], 0, sizeof(input) - 10); /* -10, because the last byte
+						   * is explicitly set too
+						   */
+	put_unaligned_be64(key, input);
+	input[8] = 0x80; /* Padding: First bit after message = 1 */
+	input[63] = 0x40; /* Padding: Length of the message = 64 bits */
+
+	sha_init(mptcp_hashed_key);
+	sha_transform(mptcp_hashed_key, input, workspace);
+
+	if (token)
+		*token = mptcp_hashed_key[0];
+	if (idsn)
+		*idsn = ((u64)mptcp_hashed_key[3] << 32) + mptcp_hashed_key[4];
+}
+
+void crypto_hmac_sha1(u64 key1, u64 key2, u32 *hash_out,
+		      int arg_num, ...)
+{
+	u32 workspace[SHA_WORKSPACE_WORDS];
+	u8 input[128]; /* 2 512-bit blocks */
+	int i;
+	int index;
+	int length;
+	u8 *msg;
+	va_list list;
+	u8 key_1[8];
+	u8 key_2[8];
+
+	memset(workspace, 0, sizeof(workspace));
+
+	put_unaligned_be64(key1, key_1);
+	put_unaligned_be64(key2, key_2);
+
+	/* Generate key xored with ipad */
+	memset(input, 0x36, 64);
+	for (i = 0; i < 8; i++)
+		input[i] ^= key_1[i];
+	for (i = 0; i < 8; i++)
+		input[i + 8] ^= key_2[i];
+
+	va_start(list, arg_num);
+	index = 64;
+	for (i = 0; i < arg_num; i++) {
+		length = va_arg(list, int);
+		msg = va_arg(list, u8 *);
+		WARN_ON(index + length > 125); /* Message is too long */
+		memcpy(&input[index], msg, length);
+		index += length;
+	}
+	va_end(list);
+
+	input[index] = 0x80; /* Padding: First bit after message = 1 */
+	memset(&input[index + 1], 0, (126 - index));
+
+	/* Padding: Length of the message = 512 + message length (bits) */
+	input[126] = 0x02;
+	input[127] = ((index - 64) * 8); /* Message length (bits) */
+
+	sha_init(hash_out);
+	sha_transform(hash_out, input, workspace);
+	memset(workspace, 0, sizeof(workspace));
+
+	sha_transform(hash_out, &input[64], workspace);
+	memset(workspace, 0, sizeof(workspace));
+
+	for (i = 0; i < 5; i++)
+		hash_out[i] = (__force u32)cpu_to_be32(hash_out[i]);
+
+	/* Prepare second part of hmac */
+	memset(input, 0x5C, 64);
+	for (i = 0; i < 8; i++)
+		input[i] ^= key_1[i];
+	for (i = 0; i < 8; i++)
+		input[i + 8] ^= key_2[i];
+
+	memcpy(&input[64], hash_out, 20);
+	input[84] = 0x80;
+	memset(&input[85], 0, 41);
+
+	/* Padding: Length of the message = 512 + 160 bits */
+	input[126] = 0x02;
+	input[127] = 0xA0;
+
+	sha_init(hash_out);
+	sha_transform(hash_out, input, workspace);
+	memset(workspace, 0, sizeof(workspace));
+
+	sha_transform(hash_out, &input[64], workspace);
+
+	for (i = 0; i < 5; i++)
+		hash_out[i] = (__force u32)cpu_to_be32(hash_out[i]);
+}
+
+void crypto_init(void)
+{
+	get_random_bytes((void *)&crypto_key_secret,
+			 sizeof(crypto_key_secret));
+	get_random_bytes((void *)&crypto_nonce_secret,
+			 sizeof(crypto_nonce_secret));
+	crypto_seed = 0;
+}
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index ea771f537ac0..2f340ef8e281 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -111,6 +111,9 @@ static struct sock *mptcp_accept(struct sock *sk, int flags, int *err,
 	if (subflow->mp_capable) {
 		msk->remote_key = subflow->remote_key;
 		msk->local_key = subflow->local_key;
+		msk->token = subflow->token;
+		pr_debug("token=%u", msk->token);
+		token_update_accept(new_sock->sk, new_mptcp_sock->sk);
 		msk->connection_list = new_sock;
 	} else {
 		msk->subflow = new_sock;
@@ -119,6 +122,15 @@ static struct sock *mptcp_accept(struct sock *sk, int flags, int *err,
 	return new_mptcp_sock->sk;
 }
 
+static void mptcp_destroy(struct sock *sk)
+{
+	struct mptcp_sock *msk = mptcp_sk(sk);
+
+	pr_debug("msk=%p, subflow=%p", sk, msk->subflow->sk);
+
+	token_destroy(msk->token);
+}
+
 static int mptcp_get_port(struct sock *sk, unsigned short snum)
 {
 	struct mptcp_sock *msk = mptcp_sk(sk);
@@ -136,6 +148,8 @@ void mptcp_finish_connect(struct sock *sk, int mp_capable)
 	if (mp_capable) {
 		msk->remote_key = subflow->remote_key;
 		msk->local_key = subflow->local_key;
+		msk->token = subflow->token;
+		pr_debug("token=%u", msk->token);
 		msk->connection_list = msk->subflow;
 		msk->subflow = NULL;
 	}
@@ -149,6 +163,7 @@ static struct proto mptcp_prot = {
 	.close		= mptcp_close,
 	.accept		= mptcp_accept,
 	.shutdown	= tcp_shutdown,
+	.destroy	= mptcp_destroy,
 	.sendmsg	= mptcp_sendmsg,
 	.recvmsg	= mptcp_recvmsg,
 	.hash		= inet_hash,
@@ -302,6 +317,8 @@ void __init mptcp_init(void)
 	mptcp_stream_ops.getname = mptcp_getname;
 	mptcp_stream_ops.listen = mptcp_listen;
 
+	token_init();
+	crypto_init();
 	subflow_init();
 
 	if (proto_register(&mptcp_prot, 1) != 0)
diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
index 34eb10c279f0..5a8ed316d70e 100644
--- a/net/mptcp/protocol.h
+++ b/net/mptcp/protocol.h
@@ -35,6 +35,7 @@ struct mptcp_sock {
 	struct	inet_connection_sock sk;
 	u64	local_key;
 	u64	remote_key;
+	u32	token;
 	struct	socket *connection_list; /* @@ needs to be a list */
 	struct	socket *subflow; /* outgoing connect, listener or !mp_capable */
 };
@@ -53,6 +54,7 @@ struct subflow_request_sock {
 		version : 4;
 	u64	local_key;
 	u64	remote_key;
+	u32	token;
 };
 
 static inline
@@ -65,6 +67,7 @@ struct subflow_request_sock *subflow_rsk(const struct request_sock *rsk)
 struct subflow_context {
 	u64	local_key;
 	u64	remote_key;
+	u32	token;
 	u32	request_mptcp : 1,  /* send MP_CAPABLE */
 		request_cksum : 1,
 		mp_capable : 1,	    /* remote is MPTCP capable */
@@ -97,4 +100,27 @@ void mptcp_get_options(const struct sk_buff *skb,
 
 void mptcp_finish_connect(struct sock *sk, int mp_capable);
 
+void token_init(void);
+void token_new_request(struct request_sock *req, const struct sk_buff *skb);
+void token_destroy_request(u32 token);
+void token_new_connect(struct sock *sk);
+void token_new_accept(struct sock *sk);
+void token_update_accept(struct sock *sk, struct sock *conn);
+void token_destroy(u32 token);
+
+void crypto_init(void);
+u32 crypto_v4_get_nonce(__be32 saddr, __be32 daddr,
+			__be16 sport, __be16 dport);
+u64 crypto_v4_get_key(__be32 saddr, __be32 daddr,
+		      __be16 sport, __be16 dport);
+u64 crypto_v6_get_key(const struct in6_addr *saddr,
+		      const struct in6_addr *daddr,
+		      __be16 sport, __be16 dport);
+u32 crypto_v6_get_nonce(const struct in6_addr *saddr,
+			const struct in6_addr *daddr,
+			__be16 sport, __be16 dport);
+void crypto_key_sha1(u64 key, u32 *token, u64 *idsn);
+void crypto_hmac_sha1(u64 key1, u64 key2, u32 *hash_out,
+		      int arg_num, ...);
+
 #endif /* __MPTCP_PROTOCOL_H */
diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c
index fd2bf7621f0e..abae6a42a101 100644
--- a/net/mptcp/subflow.c
+++ b/net/mptcp/subflow.c
@@ -15,6 +15,29 @@
 #include <net/mptcp.h>
 #include "protocol.h"
 
+static int subflow_rebuild_header(struct sock *sk)
+{
+	struct subflow_context *subflow = subflow_ctx(sk);
+
+	if (subflow->request_mptcp && !subflow->token) {
+		pr_debug("subflow=%p", sk);
+		token_new_connect(sk);
+	}
+
+	return inet_sk_rebuild_header(sk);
+}
+
+static void subflow_req_destructor(struct request_sock *req)
+{
+	struct subflow_request_sock *subflow_req = subflow_rsk(req);
+
+	pr_debug("subflow_req=%p", subflow_req);
+
+	if (subflow_req->mp_capable)
+		token_destroy_request(subflow_req->token);
+	tcp_request_sock_ops.destructor(req);
+}
+
 static void subflow_v4_init_req(struct request_sock *req,
 				const struct sock *sk_listener,
 				struct sk_buff *skb)
@@ -41,6 +64,8 @@ static void subflow_v4_init_req(struct request_sock *req,
 		    listener->request_cksum)
 			subflow_req->checksum = 1;
 		subflow_req->remote_key = rx_opt.mptcp.sndr_key;
+		pr_debug("remote_key=%llu", subflow_req->remote_key);
+		token_new_request(req, skb);
 	} else {
 		subflow_req->mp_capable = 0;
 	}
@@ -107,12 +132,16 @@ static struct sock *subflow_syn_recv_sock(const struct sock *sk,
 	child = tcp_v4_syn_recv_sock(sk, skb, req, dst, req_unhash, own_req);
 
 	if (child && *own_req) {
-		if (!subflow_ctx(child)) {
+		struct subflow_context *ctx = subflow_ctx(child);
+
+		if (!ctx) {
 			pr_debug("Closing child socket");
 			inet_sk_set_state(child, TCP_CLOSE);
 			sock_set_flag(child, SOCK_DEAD);
 			inet_csk_destroy_sock(child);
 			child = NULL;
+		} else if (ctx->mp_capable) {
+			token_new_accept(child);
 		}
 	}
 
@@ -193,6 +222,8 @@ static void subflow_ulp_clone(const struct request_sock *req,
 		subflow->fourth_ack = 1;
 		subflow->remote_key = subflow_req->remote_key;
 		subflow->local_key = subflow_req->local_key;
+		subflow->token = subflow_req->token;
+		pr_debug("token=%u", subflow->token);
 	}
 }
 
@@ -217,6 +248,8 @@ static int subflow_ops_init(struct request_sock_ops *subflow_ops)
 	if (!subflow_ops->slab)
 		return -ENOMEM;
 
+	subflow_ops->destructor = subflow_req_destructor;
+
 	return 0;
 }
 
@@ -233,6 +266,7 @@ void subflow_init(void)
 	subflow_specific.conn_request = subflow_conn_request;
 	subflow_specific.syn_recv_sock = subflow_syn_recv_sock;
 	subflow_specific.sk_rx_dst_set = subflow_finish_connect;
+	subflow_specific.rebuild_header = subflow_rebuild_header;
 
 	if (tcp_register_ulp(&subflow_ulp_ops) != 0)
 		panic("MPTCP: failed to register subflows to ULP\n");
diff --git a/net/mptcp/token.c b/net/mptcp/token.c
new file mode 100644
index 000000000000..8c15b8134f70
--- /dev/null
+++ b/net/mptcp/token.c
@@ -0,0 +1,248 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Multipath TCP token management
+ * Copyright (c) 2017 - 2019, Intel Corporation.
+ *
+ * Note: This code is based on mptcp_ctrl.c from multipath-tcp.org,
+ *       authored by:
+ *
+ *       Sébastien Barré <sebastien.barre@uclouvain.be>
+ *       Christoph Paasch <christoph.paasch@uclouvain.be>
+ *       Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
+ *       Gregory Detal <gregory.detal@uclouvain.be>
+ *       Fabien Duchêne <fabien.duchene@uclouvain.be>
+ *       Andreas Seelinger <Andreas.Seelinger@rwth-aachen.de>
+ *       Lavkesh Lahngir <lavkesh51@gmail.com>
+ *       Andreas Ripke <ripke@neclab.eu>
+ *       Vlad Dogaru <vlad.dogaru@intel.com>
+ *       Octavian Purdila <octavian.purdila@intel.com>
+ *       John Ronan <jronan@tssg.org>
+ *       Catalin Nicutar <catalin.nicutar@gmail.com>
+ *       Brandon Heller <brandonh@stanford.edu>
+ */
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/radix-tree.h>
+#include <linux/ip.h>
+#include <linux/tcp.h>
+#include <net/sock.h>
+#include <net/inet_common.h>
+#include <net/protocol.h>
+#include <net/mptcp.h>
+#include "protocol.h"
+
+static struct radix_tree_root token_tree;
+static struct radix_tree_root token_req_tree;
+static spinlock_t token_tree_lock;
+static int token_used;
+
+static bool find_req_token(u32 token)
+{
+	void *used;
+
+	pr_debug("token=%u", token);
+	used = radix_tree_lookup(&token_req_tree, token);
+	return used;
+}
+
+static bool find_token(u32 token)
+{
+	void *used;
+
+	pr_debug("token=%u", token);
+	used = radix_tree_lookup(&token_tree, token);
+	return used;
+}
+
+static void new_req_token(struct request_sock *req,
+			  const struct sk_buff *skb)
+{
+	const struct inet_request_sock *ireq = inet_rsk(req);
+	struct subflow_request_sock *subflow_req = subflow_rsk(req);
+	u64 local_key;
+
+	if (!IS_ENABLED(CONFIG_IPV6) || skb->protocol == htons(ETH_P_IP)) {
+		local_key = crypto_v4_get_key(ip_hdr(skb)->saddr,
+					      ip_hdr(skb)->daddr,
+					      htons(ireq->ir_num),
+					      ireq->ir_rmt_port);
+#if IS_ENABLED(CONFIG_IPV6)
+	} else {
+		local_key = crypto_v6_get_key(&ipv6_hdr(skb)->saddr,
+					      &ipv6_hdr(skb)->daddr,
+					      htons(ireq->ir_num),
+					      ireq->ir_rmt_port);
+#endif
+	}
+	pr_debug("local_key=%llu:%llx", local_key, local_key);
+	subflow_req->local_key = local_key;
+	crypto_key_sha1(subflow_req->local_key, &subflow_req->token, NULL);
+	pr_debug("token=%u", subflow_req->token);
+}
+
+static void new_token(const struct sock *sk)
+{
+	struct subflow_context *subflow = subflow_ctx(sk);
+	const struct inet_sock *isk = inet_sk(sk);
+
+	if (sk->sk_family == AF_INET) {
+		subflow->local_key = crypto_v4_get_key(isk->inet_saddr,
+						       isk->inet_daddr,
+						       isk->inet_sport,
+						       isk->inet_dport);
+#if IS_ENABLED(CONFIG_IPV6)
+	} else {
+		subflow->local_key = crypto_v6_get_key(&inet6_sk(sk)->saddr,
+						       &sk->sk_v6_daddr,
+						       isk->inet_sport,
+						       isk->inet_dport);
+#endif
+	}
+	pr_debug("local_key=%llu:%llx", subflow->local_key, subflow->local_key);
+	crypto_key_sha1(subflow->local_key, &subflow->token, NULL);
+	pr_debug("token=%u", subflow->token);
+}
+
+static int insert_req_token(u32 token)
+{
+	void *used = &token_used;
+
+	pr_debug("token=%u", token);
+	return radix_tree_insert(&token_req_tree, token, used);
+}
+
+static int insert_token(u32 token, void *conn)
+{
+	void *used = &token_used;
+
+	if (conn)
+		used = conn;
+
+	pr_debug("token=%u, conn=%p", token, used);
+	return radix_tree_insert(&token_tree, token, used);
+}
+
+static void update_token(u32 token, void *conn)
+{
+	void **slot;
+
+	pr_debug("token=%u, conn=%p", token, conn);
+	slot = radix_tree_lookup_slot(&token_tree, token);
+	if (slot) {
+		if (*slot != &token_used)
+			pr_err("slot ALREADY updated!");
+		*slot = conn;
+	} else {
+		pr_warn("token NOT FOUND!");
+	}
+}
+
+static void destroy_req_token(u32 token)
+{
+	void *cur;
+
+	cur = radix_tree_delete(&token_req_tree, token);
+	if (!cur)
+		pr_warn("token NOT FOUND!");
+}
+
+static struct sock *destroy_token(u32 token)
+{
+	void *conn;
+
+	pr_debug("token=%u", token);
+	conn = radix_tree_delete(&token_tree, token);
+	if (conn && conn != &token_used)
+		return (struct sock *)conn;
+	return NULL;
+}
+
+/* create new local key, idsn, and token for subflow_request */
+void token_new_request(struct request_sock *req,
+		       const struct sk_buff *skb)
+{
+	struct subflow_request_sock *subflow_req = subflow_rsk(req);
+
+	pr_debug("subflow_req=%p", req);
+	while (1) {
+		new_req_token(req, skb);
+		spin_lock_bh(&token_tree_lock);
+		if (!find_req_token(subflow_req->token) &&
+		    !find_token(subflow_req->token))
+			break;
+		spin_unlock_bh(&token_tree_lock);
+	}
+	insert_req_token(subflow_req->token);
+	spin_unlock_bh(&token_tree_lock);
+}
+
+/* create new local key, idsn, and token for subflow */
+void token_new_connect(struct sock *sk)
+{
+	struct subflow_context *subflow = subflow_ctx(sk);
+
+	pr_debug("subflow=%p", sk);
+
+	while (1) {
+		new_token(sk);
+		spin_lock_bh(&token_tree_lock);
+		if (!find_req_token(subflow->token) &&
+		    !find_token(subflow->token))
+			break;
+		spin_unlock_bh(&token_tree_lock);
+	}
+	insert_token(subflow->token, subflow->conn);
+	sock_hold(subflow->conn);
+	spin_unlock_bh(&token_tree_lock);
+}
+
+void token_new_accept(struct sock *sk)
+{
+	struct subflow_context *subflow = subflow_ctx(sk);
+
+	pr_debug("subflow=%p", sk);
+
+	spin_lock_bh(&token_tree_lock);
+	insert_token(subflow->token, NULL);
+	spin_unlock_bh(&token_tree_lock);
+}
+
+void token_update_accept(struct sock *sk, struct sock *conn)
+{
+	struct subflow_context *subflow = subflow_ctx(sk);
+
+	pr_debug("subflow=%p, conn=%p", sk, conn);
+
+	spin_lock_bh(&token_tree_lock);
+	update_token(subflow->token, conn);
+	sock_hold(conn);
+	spin_unlock_bh(&token_tree_lock);
+}
+
+void token_destroy_request(u32 token)
+{
+	pr_debug("token=%u", token);
+
+	spin_lock_bh(&token_tree_lock);
+	destroy_req_token(token);
+	spin_unlock_bh(&token_tree_lock);
+}
+
+void token_destroy(u32 token)
+{
+	struct sock *conn;
+
+	pr_debug("token=%u", token);
+	spin_lock_bh(&token_tree_lock);
+	conn = destroy_token(token);
+	if (conn)
+		sock_put(conn);
+	spin_unlock_bh(&token_tree_lock);
+}
+
+void token_init(void)
+{
+	INIT_RADIX_TREE(&token_tree, GFP_ATOMIC);
+	INIT_RADIX_TREE(&token_req_tree, GFP_ATOMIC);
+	spin_lock_init(&token_tree_lock);
+}
-- 
2.22.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RFC PATCH net-next 12/33] mptcp: Add shutdown() socket operation
  2019-06-17 22:57 [RFC PATCH net-next 00/33] Multipath TCP Mat Martineau
                   ` (10 preceding siblings ...)
  2019-06-17 22:57 ` [RFC PATCH net-next 11/33] mptcp: Add key generation and token tree Mat Martineau
@ 2019-06-17 22:57 ` Mat Martineau
  2019-06-17 22:57 ` [RFC PATCH net-next 13/33] mptcp: Add setsockopt()/getsockopt() socket operations Mat Martineau
                   ` (20 subsequent siblings)
  32 siblings, 0 replies; 34+ messages in thread
From: Mat Martineau @ 2019-06-17 22:57 UTC (permalink / raw)
  To: edumazet, netdev
  Cc: Peter Krystad, cpaasch, fw, pabeni, dcaratti, matthieu.baerts

From: Peter Krystad <peter.krystad@linux.intel.com>

Signed-off-by: Peter Krystad <peter.krystad@linux.intel.com>
---
 net/mptcp/protocol.c | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 2f340ef8e281..6596e594fa5f 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -296,6 +296,26 @@ static __poll_t mptcp_poll(struct file *file, struct socket *sock,
 	return tcp_poll(file, msk->connection_list, wait);
 }
 
+static int mptcp_shutdown(struct socket *sock, int how)
+{
+	struct mptcp_sock *msk = mptcp_sk(sock->sk);
+	int ret = 0;
+
+	pr_debug("sk=%p, how=%d", msk, how);
+
+	if (msk->subflow) {
+		pr_debug("subflow=%p", msk->subflow->sk);
+		ret = kernel_sock_shutdown(msk->subflow, how);
+	}
+
+	if (msk->connection_list) {
+		pr_debug("conn_list->subflow=%p", msk->connection_list->sk);
+		ret = kernel_sock_shutdown(msk->connection_list, how);
+	}
+
+	return ret;
+}
+
 static struct proto_ops mptcp_stream_ops;
 
 static struct inet_protosw mptcp_protosw = {
@@ -316,6 +336,7 @@ void __init mptcp_init(void)
 	mptcp_stream_ops.accept = mptcp_stream_accept;
 	mptcp_stream_ops.getname = mptcp_getname;
 	mptcp_stream_ops.listen = mptcp_listen;
+	mptcp_stream_ops.shutdown = mptcp_shutdown;
 
 	token_init();
 	crypto_init();
-- 
2.22.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RFC PATCH net-next 13/33] mptcp: Add setsockopt()/getsockopt() socket operations
  2019-06-17 22:57 [RFC PATCH net-next 00/33] Multipath TCP Mat Martineau
                   ` (11 preceding siblings ...)
  2019-06-17 22:57 ` [RFC PATCH net-next 12/33] mptcp: Add shutdown() socket operation Mat Martineau
@ 2019-06-17 22:57 ` Mat Martineau
  2019-06-17 22:57 ` [RFC PATCH net-next 14/33] tcp: clean ext on tx recycle Mat Martineau
                   ` (19 subsequent siblings)
  32 siblings, 0 replies; 34+ messages in thread
From: Mat Martineau @ 2019-06-17 22:57 UTC (permalink / raw)
  To: edumazet, netdev
  Cc: Peter Krystad, cpaasch, fw, pabeni, dcaratti, matthieu.baerts

From: Peter Krystad <peter.krystad@linux.intel.com>

Signed-off-by: Peter Krystad <peter.krystad@linux.intel.com>
---
 net/mptcp/protocol.c | 48 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 48 insertions(+)

diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 6596e594fa5f..3215601b9c43 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -131,6 +131,52 @@ static void mptcp_destroy(struct sock *sk)
 	token_destroy(msk->token);
 }
 
+static int mptcp_setsockopt(struct sock *sk, int level, int optname,
+			    char __user *uoptval, unsigned int optlen)
+{
+	struct mptcp_sock *msk = mptcp_sk(sk);
+	struct socket *subflow;
+	char __kernel *optval;
+
+	pr_debug("msk=%p", msk);
+	if (msk->connection_list) {
+		subflow = msk->connection_list;
+		pr_debug("conn_list->subflow=%p", subflow_ctx(subflow->sk));
+	} else {
+		subflow = msk->subflow;
+		pr_debug("subflow=%p", subflow_ctx(subflow->sk));
+	}
+
+	/* will be treated as __user in tcp_setsockopt */
+	optval = (char __kernel __force *)uoptval;
+
+	return kernel_setsockopt(subflow, level, optname, optval, optlen);
+}
+
+static int mptcp_getsockopt(struct sock *sk, int level, int optname,
+			    char __user *uoptval, int __user *uoption)
+{
+	struct mptcp_sock *msk = mptcp_sk(sk);
+	struct socket *subflow;
+	char __kernel *optval;
+	int __kernel *option;
+
+	pr_debug("msk=%p", msk);
+	if (msk->connection_list) {
+		subflow = msk->connection_list;
+		pr_debug("conn_list->subflow=%p", subflow_ctx(subflow->sk));
+	} else {
+		subflow = msk->subflow;
+		pr_debug("subflow=%p", subflow_ctx(subflow->sk));
+	}
+
+	/* will be treated as __user in tcp_getsockopt */
+	optval = (char __kernel __force *)uoptval;
+	option = (int __kernel __force *)uoption;
+
+	return kernel_getsockopt(subflow, level, optname, optval, option);
+}
+
 static int mptcp_get_port(struct sock *sk, unsigned short snum)
 {
 	struct mptcp_sock *msk = mptcp_sk(sk);
@@ -162,6 +208,8 @@ static struct proto mptcp_prot = {
 	.init		= mptcp_init_sock,
 	.close		= mptcp_close,
 	.accept		= mptcp_accept,
+	.setsockopt	= mptcp_setsockopt,
+	.getsockopt	= mptcp_getsockopt,
 	.shutdown	= tcp_shutdown,
 	.destroy	= mptcp_destroy,
 	.sendmsg	= mptcp_sendmsg,
-- 
2.22.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RFC PATCH net-next 14/33] tcp: clean ext on tx recycle
  2019-06-17 22:57 [RFC PATCH net-next 00/33] Multipath TCP Mat Martineau
                   ` (12 preceding siblings ...)
  2019-06-17 22:57 ` [RFC PATCH net-next 13/33] mptcp: Add setsockopt()/getsockopt() socket operations Mat Martineau
@ 2019-06-17 22:57 ` Mat Martineau
  2019-06-17 22:57 ` [RFC PATCH net-next 15/33] mptcp: Add MPTCP to skb extensions Mat Martineau
                   ` (18 subsequent siblings)
  32 siblings, 0 replies; 34+ messages in thread
From: Mat Martineau @ 2019-06-17 22:57 UTC (permalink / raw)
  To: edumazet, netdev
  Cc: Paolo Abeni, cpaasch, fw, peter.krystad, dcaratti, matthieu.baerts

From: Paolo Abeni <pabeni@redhat.com>

Otherwise we will find stray/unexpected/old extensions
value on next iteration.

On tcp_write_xmit() we can end-up splitting an already queued
skb in two parts, via tso_fragment(). The newly created skb
can be allocated via the tx cache and the mptcp stack will not
be aware of it, so nobody set properly the MPTCP ext.

End result, we transmit the skb using an hold MPTCP DSS map
and that confuses the rx side/corrupt the stream. It requires
some concurrent conditions, so it's not deterministic.

Resetting the ext on recycle fixes all the current mptcp self tests
issues.

Apparently only MPTCP has issues with this kind of stray ext,
so an alternative would be add an additional mptcp hook in
tso_fragment() or in sk_stream_alloc_skb() to always init
the ext.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 include/linux/skbuff.h | 8 ++++++++
 include/net/sock.h     | 1 +
 2 files changed, 9 insertions(+)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 28bdaf978e72..37387ab9f336 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -4024,6 +4024,14 @@ static inline void skb_ext_put(struct sk_buff *skb)
 		__skb_ext_put(skb->extensions);
 }
 
+static inline void skb_ext_clear(struct sk_buff *skb)
+{
+	if (skb->active_extensions) {
+		__skb_ext_put(skb->extensions);
+		skb->active_extensions = 0;
+	}
+}
+
 static inline void __skb_ext_copy(struct sk_buff *dst,
 				  const struct sk_buff *src)
 {
diff --git a/include/net/sock.h b/include/net/sock.h
index e9d769c04637..bfa695716721 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1469,6 +1469,7 @@ static inline void sk_wmem_free_skb(struct sock *sk, struct sk_buff *skb)
 	sk->sk_wmem_queued -= skb->truesize;
 	sk_mem_uncharge(sk, skb->truesize);
 	if (!sk->sk_tx_skb_cache && !skb_cloned(skb)) {
+		skb_ext_clear(skb);
 		skb_zcopy_clear(skb, true);
 		sk->sk_tx_skb_cache = skb;
 		return;
-- 
2.22.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RFC PATCH net-next 15/33] mptcp: Add MPTCP to skb extensions
  2019-06-17 22:57 [RFC PATCH net-next 00/33] Multipath TCP Mat Martineau
                   ` (13 preceding siblings ...)
  2019-06-17 22:57 ` [RFC PATCH net-next 14/33] tcp: clean ext on tx recycle Mat Martineau
@ 2019-06-17 22:57 ` Mat Martineau
  2019-06-17 22:57 ` [RFC PATCH net-next 16/33] tcp: Prevent coalesce/collapse when skb has MPTCP extensions Mat Martineau
                   ` (17 subsequent siblings)
  32 siblings, 0 replies; 34+ messages in thread
From: Mat Martineau @ 2019-06-17 22:57 UTC (permalink / raw)
  To: edumazet, netdev
  Cc: Mat Martineau, cpaasch, fw, pabeni, peter.krystad, dcaratti,
	matthieu.baerts

Add enum value for MPTCP and update config dependencies

Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
---
 include/linux/skbuff.h |  3 +++
 include/net/mptcp.h    | 16 ++++++++++++++++
 net/core/skbuff.c      |  7 +++++++
 net/mptcp/Kconfig      |  1 +
 4 files changed, 27 insertions(+)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 37387ab9f336..5de91656f8ea 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -3993,6 +3993,9 @@ enum skb_ext_id {
 #endif
 #ifdef CONFIG_XFRM
 	SKB_EXT_SEC_PATH,
+#endif
+#if IS_ENABLED(CONFIG_MPTCP)
+	SKB_EXT_MPTCP,
 #endif
 	SKB_EXT_NUM, /* must be last */
 };
diff --git a/include/net/mptcp.h b/include/net/mptcp.h
index e7cae0f4404a..805206359135 100644
--- a/include/net/mptcp.h
+++ b/include/net/mptcp.h
@@ -8,6 +8,22 @@
 #ifndef __NET_MPTCP_H
 #define __NET_MPTCP_H
 
+/* MPTCP sk_buff extension data */
+struct mptcp_ext {
+	u64		data_ack;
+	u64		data_seq;
+	u32		subflow_seq;
+	u16		data_len;
+	__sum16		checksum;
+	u8		use_map:1,
+			dsn64:1,
+			use_checksum:1,
+			data_fin:1,
+			use_ack:1,
+			ack64:1,
+			__unused:2;
+};
+
 /* MPTCP option subtypes */
 #define OPTION_MPTCP_MPC_SYN	BIT(0)
 #define OPTION_MPTCP_MPC_SYNACK	BIT(1)
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index bab9484f1631..826d81ef32f5 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -66,6 +66,7 @@
 #include <net/checksum.h>
 #include <net/ip6_checksum.h>
 #include <net/xfrm.h>
+#include <net/mptcp.h>
 
 #include <linux/uaccess.h>
 #include <trace/events/skb.h>
@@ -3979,6 +3980,9 @@ static const u8 skb_ext_type_len[] = {
 #ifdef CONFIG_XFRM
 	[SKB_EXT_SEC_PATH] = SKB_EXT_CHUNKSIZEOF(struct sec_path),
 #endif
+#if IS_ENABLED(CONFIG_MPTCP)
+	[SKB_EXT_MPTCP] = SKB_EXT_CHUNKSIZEOF(struct mptcp_ext),
+#endif
 };
 
 static __always_inline unsigned int skb_ext_total_length(void)
@@ -3989,6 +3993,9 @@ static __always_inline unsigned int skb_ext_total_length(void)
 #endif
 #ifdef CONFIG_XFRM
 		skb_ext_type_len[SKB_EXT_SEC_PATH] +
+#endif
+#if IS_ENABLED(CONFIG_MPTCP)
+		skb_ext_type_len[SKB_EXT_MPTCP] +
 #endif
 		0;
 }
diff --git a/net/mptcp/Kconfig b/net/mptcp/Kconfig
index d87dfdc210cc..f21190a4f7e9 100644
--- a/net/mptcp/Kconfig
+++ b/net/mptcp/Kconfig
@@ -2,6 +2,7 @@
 config MPTCP
 	bool "Multipath TCP"
 	depends on INET
+	select SKB_EXTENSIONS
 	help
 	  Multipath TCP (MPTCP) connections send and receive data over multiple
 	  subflows in order to utilize multiple network paths. Each subflow
-- 
2.22.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RFC PATCH net-next 16/33] tcp: Prevent coalesce/collapse when skb has MPTCP extensions
  2019-06-17 22:57 [RFC PATCH net-next 00/33] Multipath TCP Mat Martineau
                   ` (14 preceding siblings ...)
  2019-06-17 22:57 ` [RFC PATCH net-next 15/33] mptcp: Add MPTCP to skb extensions Mat Martineau
@ 2019-06-17 22:57 ` Mat Martineau
  2019-06-17 22:57 ` [RFC PATCH net-next 17/33] tcp: Export low-level TCP functions Mat Martineau
                   ` (16 subsequent siblings)
  32 siblings, 0 replies; 34+ messages in thread
From: Mat Martineau @ 2019-06-17 22:57 UTC (permalink / raw)
  To: edumazet, netdev
  Cc: Mat Martineau, cpaasch, fw, pabeni, peter.krystad, dcaratti,
	matthieu.baerts

The MPTCP extension data needs to be preserved as it passes through the
TCP stack. Make sure that these skbs are not appended to others during
coalesce or collapse, so the data remains associated with the payload of
the given skb.

Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
---
 include/net/mptcp.h   | 10 ++++++++++
 include/net/tcp.h     |  8 ++++++++
 net/ipv4/tcp_input.c  | 10 ++++++++--
 net/ipv4/tcp_output.c |  2 +-
 4 files changed, 27 insertions(+), 3 deletions(-)

diff --git a/include/net/mptcp.h b/include/net/mptcp.h
index 805206359135..30cfa473e8bf 100644
--- a/include/net/mptcp.h
+++ b/include/net/mptcp.h
@@ -61,6 +61,11 @@ bool mptcp_synack_options(const struct request_sock *req, unsigned int *size,
 bool mptcp_established_options(struct sock *sk, unsigned int *size,
 			       struct mptcp_out_options *opts);
 
+static inline bool mptcp_skb_ext_exist(const struct sk_buff *skb)
+{
+	return skb_ext_exist(skb, SKB_EXT_MPTCP);
+}
+
 void mptcp_write_options(__be32 *ptr, struct mptcp_out_options *opts);
 
 #else
@@ -108,5 +113,10 @@ static inline bool mptcp_established_options(struct sock *sk,
 	return false;
 }
 
+static inline bool mptcp_skb_ext_exist(const struct sk_buff *skb)
+{
+	return false;
+}
+
 #endif /* CONFIG_MPTCP */
 #endif /* __NET_MPTCP_H */
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 23995f8c11fa..af13d91f4a0f 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -39,6 +39,7 @@
 #include <net/tcp_states.h>
 #include <net/inet_ecn.h>
 #include <net/dst.h>
+#include <net/mptcp.h>
 
 #include <linux/seq_file.h>
 #include <linux/memcontrol.h>
@@ -949,6 +950,13 @@ static inline bool tcp_skb_can_collapse_to(const struct sk_buff *skb)
 	return likely(!TCP_SKB_CB(skb)->eor);
 }
 
+static inline bool tcp_skb_can_collapse(const struct sk_buff *to,
+					const struct sk_buff *from)
+{
+	return likely(tcp_skb_can_collapse_to(to) &&
+		      !mptcp_skb_ext_exist(from));
+}
+
 /* Events passed to congestion control interface */
 enum tcp_ca_event {
 	CA_EVENT_TX_START,	/* first transmit when no packets in flight */
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 240eb75c7b84..5e634fdd8e1c 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -1402,7 +1402,7 @@ static struct sk_buff *tcp_shift_skb_data(struct sock *sk, struct sk_buff *skb,
 	if ((TCP_SKB_CB(prev)->sacked & TCPCB_TAGBITS) != TCPCB_SACKED_ACKED)
 		goto fallback;
 
-	if (!tcp_skb_can_collapse_to(prev))
+	if (!tcp_skb_can_collapse(prev, skb))
 		goto fallback;
 
 	in_sack = !after(start_seq, TCP_SKB_CB(skb)->seq) &&
@@ -4362,6 +4362,9 @@ static bool tcp_try_coalesce(struct sock *sk,
 	if (TCP_SKB_CB(from)->seq != TCP_SKB_CB(to)->end_seq)
 		return false;
 
+	if (mptcp_skb_ext_exist(from))
+		return false;
+
 #ifdef CONFIG_TLS_DEVICE
 	if (from->decrypted != to->decrypted)
 		return false;
@@ -4869,10 +4872,12 @@ tcp_collapse(struct sock *sk, struct sk_buff_head *list, struct rb_root *root,
 
 		/* The first skb to collapse is:
 		 * - not SYN/FIN and
+		 * - does not include a MPTCP skb extension
 		 * - bloated or contains data before "start" or
 		 *   overlaps to the next one.
 		 */
 		if (!(TCP_SKB_CB(skb)->tcp_flags & (TCPHDR_SYN | TCPHDR_FIN)) &&
+		    !mptcp_skb_ext_exist(skb) &&
 		    (tcp_win_from_space(sk, skb->truesize) > skb->len ||
 		     before(TCP_SKB_CB(skb)->seq, start))) {
 			end_of_skbs = false;
@@ -4888,7 +4893,7 @@ tcp_collapse(struct sock *sk, struct sk_buff_head *list, struct rb_root *root,
 		/* Decided to skip this, advance start seq. */
 		start = TCP_SKB_CB(skb)->end_seq;
 	}
-	if (end_of_skbs ||
+	if (end_of_skbs || mptcp_skb_ext_exist(skb) ||
 	    (TCP_SKB_CB(skb)->tcp_flags & (TCPHDR_SYN | TCPHDR_FIN)))
 		return;
 
@@ -4931,6 +4936,7 @@ tcp_collapse(struct sock *sk, struct sk_buff_head *list, struct rb_root *root,
 				skb = tcp_collapse_one(sk, skb, list, root);
 				if (!skb ||
 				    skb == tail ||
+				    mptcp_skb_ext_exist(skb) ||
 				    (TCP_SKB_CB(skb)->tcp_flags & (TCPHDR_SYN | TCPHDR_FIN)))
 					goto end;
 #ifdef CONFIG_TLS_DEVICE
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index a41ba69760f1..4e49b2c40820 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2882,7 +2882,7 @@ static void tcp_retrans_try_collapse(struct sock *sk, struct sk_buff *to,
 		if (!tcp_can_collapse(sk, skb))
 			break;
 
-		if (!tcp_skb_can_collapse_to(to))
+		if (!tcp_skb_can_collapse(to, skb))
 			break;
 
 		space -= skb->len;
-- 
2.22.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RFC PATCH net-next 17/33] tcp: Export low-level TCP functions
  2019-06-17 22:57 [RFC PATCH net-next 00/33] Multipath TCP Mat Martineau
                   ` (15 preceding siblings ...)
  2019-06-17 22:57 ` [RFC PATCH net-next 16/33] tcp: Prevent coalesce/collapse when skb has MPTCP extensions Mat Martineau
@ 2019-06-17 22:57 ` Mat Martineau
  2019-06-17 22:57 ` [RFC PATCH net-next 18/33] mptcp: Write MPTCP DSS headers to outgoing data packets Mat Martineau
                   ` (15 subsequent siblings)
  32 siblings, 0 replies; 34+ messages in thread
From: Mat Martineau @ 2019-06-17 22:57 UTC (permalink / raw)
  To: edumazet, netdev
  Cc: Mat Martineau, cpaasch, fw, pabeni, peter.krystad, dcaratti,
	matthieu.baerts

MPTCP will make use of tcp_send_mss() and tcp_push() when sending
data to specific TCP subflows.

Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
---
 include/net/tcp.h | 3 +++
 net/ipv4/tcp.c    | 6 +++---
 2 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index af13d91f4a0f..a937ee196eba 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -327,6 +327,9 @@ int tcp_sendpage_locked(struct sock *sk, struct page *page, int offset,
 			size_t size, int flags);
 ssize_t do_tcp_sendpages(struct sock *sk, struct page *page, int offset,
 		 size_t size, int flags);
+int tcp_send_mss(struct sock *sk, int *size_goal, int flags);
+void tcp_push(struct sock *sk, int flags, int mss_now, int nonagle,
+	      int size_goal);
 void tcp_release_cb(struct sock *sk);
 void tcp_wfree(struct sk_buff *skb);
 void tcp_write_timer_handler(struct sock *sk);
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 866c985a0c04..675806689b68 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -686,8 +686,8 @@ static bool tcp_should_autocork(struct sock *sk, struct sk_buff *skb,
 	       refcount_read(&sk->sk_wmem_alloc) > skb->truesize;
 }
 
-static void tcp_push(struct sock *sk, int flags, int mss_now,
-		     int nonagle, int size_goal)
+void tcp_push(struct sock *sk, int flags, int mss_now,
+	      int nonagle, int size_goal)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 	struct sk_buff *skb;
@@ -921,7 +921,7 @@ static unsigned int tcp_xmit_size_goal(struct sock *sk, u32 mss_now,
 	return max(size_goal, mss_now);
 }
 
-static int tcp_send_mss(struct sock *sk, int *size_goal, int flags)
+int tcp_send_mss(struct sock *sk, int *size_goal, int flags)
 {
 	int mss_now;
 
-- 
2.22.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RFC PATCH net-next 18/33] mptcp: Write MPTCP DSS headers to outgoing data packets
  2019-06-17 22:57 [RFC PATCH net-next 00/33] Multipath TCP Mat Martineau
                   ` (16 preceding siblings ...)
  2019-06-17 22:57 ` [RFC PATCH net-next 17/33] tcp: Export low-level TCP functions Mat Martineau
@ 2019-06-17 22:57 ` Mat Martineau
  2019-06-17 22:57 ` [RFC PATCH net-next 19/33] mptcp: Implement MPTCP receive path Mat Martineau
                   ` (14 subsequent siblings)
  32 siblings, 0 replies; 34+ messages in thread
From: Mat Martineau @ 2019-06-17 22:57 UTC (permalink / raw)
  To: edumazet, netdev
  Cc: Mat Martineau, cpaasch, fw, pabeni, peter.krystad, dcaratti,
	matthieu.baerts

Per-packet metadata required to write the MPTCP DSS option is written to
the skb_ext area. One write to the socket may contain more than one
packet of data, in which case the DSS option in the first packet will
have a mapping covering all of the data in that write. Packets after the
first do not have a DSS option. This is complicated to handle under
memory pressure, since the first packet (with the DSS mapping) is pushed
to the TCP core before the remaining skbs are allocated.

The current implementation is limited. It will only send up to one page
of data. The number of bytes sent is returned so the caller knows which
bytes were sent and which were not. More work is required to ensure that
it works correctly with full buffers or under memory pressure.

The MPTCP DSS checksum is not yet implemented.

Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Peter Krystad <peter.krystad@linux.intel.com>
---
 include/net/mptcp.h   |  14 ++++-
 net/ipv4/tcp_output.c |  11 ++--
 net/mptcp/options.c   | 143 +++++++++++++++++++++++++++++++++++++++++-
 net/mptcp/protocol.c  | 117 ++++++++++++++++++++++++++++++----
 net/mptcp/protocol.h  |  18 +++++-
 net/mptcp/subflow.c   |   2 +
 net/mptcp/token.c     |  13 ++--
 7 files changed, 289 insertions(+), 29 deletions(-)

diff --git a/include/net/mptcp.h b/include/net/mptcp.h
index 30cfa473e8bf..003150a8e406 100644
--- a/include/net/mptcp.h
+++ b/include/net/mptcp.h
@@ -8,6 +8,14 @@
 #ifndef __NET_MPTCP_H
 #define __NET_MPTCP_H
 
+/* MPTCP DSS flags */
+
+#define MPTCP_DSS_DATA_FIN	BIT(4)
+#define MPTCP_DSS_DSN64		BIT(3)
+#define MPTCP_DSS_HAS_MAP	BIT(2)
+#define MPTCP_DSS_ACK64		BIT(1)
+#define MPTCP_DSS_HAS_ACK	BIT(0)
+
 /* MPTCP sk_buff extension data */
 struct mptcp_ext {
 	u64		data_ack;
@@ -34,6 +42,7 @@ struct mptcp_out_options {
 	u16 suboptions;
 	u64 sndr_key;
 	u64 rcvr_key;
+	struct mptcp_ext ext_copy;
 #endif
 };
 
@@ -58,7 +67,8 @@ bool mptcp_syn_options(struct sock *sk, unsigned int *size,
 void mptcp_rcv_synsent(struct sock *sk);
 bool mptcp_synack_options(const struct request_sock *req, unsigned int *size,
 			  struct mptcp_out_options *opts);
-bool mptcp_established_options(struct sock *sk, unsigned int *size,
+bool mptcp_established_options(struct sock *sk, struct sk_buff *skb,
+			       unsigned int *size, unsigned int remaining,
 			       struct mptcp_out_options *opts);
 
 static inline bool mptcp_skb_ext_exist(const struct sk_buff *skb)
@@ -107,7 +117,9 @@ static inline bool mptcp_synack_options(const struct request_sock *req,
 }
 
 static inline bool mptcp_established_options(struct sock *sk,
+					     struct sk_buff *skb,
 					     unsigned int *size,
+					     unsigned int remaining,
 					     struct mptcp_out_options *opts)
 {
 	return false;
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 4e49b2c40820..5fe9459bbd6a 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -796,13 +796,12 @@ static unsigned int tcp_established_options(struct sock *sk, struct sk_buff *skb
 	 */
 	if (sk_is_mptcp(sk)) {
 		unsigned int remaining = MAX_TCP_OPTION_SPACE - size;
-		unsigned int opt_size;
+		unsigned int opt_size = 0;
 
-		if (mptcp_established_options(sk, &opt_size, &opts->mptcp)) {
-			if (remaining >= opt_size) {
-				opts->options |= OPTION_MPTCP;
-				size += opt_size;
-			}
+		if (mptcp_established_options(sk, skb, &opt_size, remaining,
+					      &opts->mptcp)) {
+			opts->options |= OPTION_MPTCP;
+			size += opt_size;
 		}
 	}
 
diff --git a/net/mptcp/options.c b/net/mptcp/options.c
index d8e77cd5664d..625cd93fb9a8 100644
--- a/net/mptcp/options.c
+++ b/net/mptcp/options.c
@@ -181,12 +181,13 @@ void mptcp_rcv_synsent(struct sock *sk)
 	}
 }
 
-bool mptcp_established_options(struct sock *sk, unsigned int *size,
-			       struct mptcp_out_options *opts)
+static bool mptcp_established_options_mp(struct sock *sk, unsigned int *size,
+					 unsigned int remaining,
+					 struct mptcp_out_options *opts)
 {
 	struct subflow_context *subflow = subflow_ctx(sk);
 
-	if (subflow->mp_capable && !subflow->fourth_ack) {
+	if (!subflow->fourth_ack && remaining >= TCPOLEN_MPTCP_MPC_ACK) {
 		opts->suboptions = OPTION_MPTCP_MPC_ACK;
 		opts->sndr_key = subflow->local_key;
 		opts->rcvr_key = subflow->remote_key;
@@ -199,6 +200,92 @@ bool mptcp_established_options(struct sock *sk, unsigned int *size,
 	return false;
 }
 
+static bool mptcp_established_options_dss(struct sock *sk, struct sk_buff *skb,
+					  unsigned int *size,
+					  unsigned int remaining,
+					  struct mptcp_out_options *opts)
+{
+	unsigned int dss_size = 0;
+	struct mptcp_ext *mpext;
+	unsigned int ack_size;
+
+	mpext = skb ? mptcp_get_ext(skb) : NULL;
+
+	if (!skb || (mpext && mpext->use_map)) {
+		unsigned int map_size;
+		bool use_csum;
+
+		map_size = TCPOLEN_MPTCP_DSS_BASE + TCPOLEN_MPTCP_DSS_MAP64;
+		use_csum = subflow_ctx(sk)->use_checksum;
+		if (use_csum)
+			map_size += TCPOLEN_MPTCP_DSS_CHECKSUM;
+
+		if (map_size <= remaining) {
+			remaining -= map_size;
+			dss_size = map_size;
+			if (mpext) {
+				opts->ext_copy.data_seq = mpext->data_seq;
+				opts->ext_copy.subflow_seq = mpext->subflow_seq;
+				opts->ext_copy.data_len = mpext->data_len;
+				opts->ext_copy.checksum = mpext->checksum;
+				opts->ext_copy.use_map = 1;
+				opts->ext_copy.dsn64 = mpext->dsn64;
+				opts->ext_copy.use_checksum = use_csum;
+			}
+		} else {
+			opts->ext_copy.use_map = 0;
+			WARN_ONCE(1, "MPTCP: Map dropped");
+		}
+	}
+
+	if (mpext && mpext->use_ack) {
+		ack_size = TCPOLEN_MPTCP_DSS_ACK64;
+
+		/* Add kind/lenght/subtype/flag overhead if mapping is not
+		 * populated
+		 */
+		if (dss_size == 0)
+			ack_size += TCPOLEN_MPTCP_DSS_BASE;
+
+		if (ack_size <= remaining) {
+			dss_size += ack_size;
+
+			opts->ext_copy.data_ack = mpext->data_ack;
+			opts->ext_copy.ack64 = 1;
+			opts->ext_copy.use_ack = 1;
+		} else {
+			opts->ext_copy.use_ack = 0;
+			WARN(1, "MPTCP: Ack dropped");
+		}
+	}
+
+	*size = ALIGN(dss_size, 4);
+	return true;
+}
+
+bool mptcp_established_options(struct sock *sk, struct sk_buff *skb,
+			       unsigned int *size, unsigned int remaining,
+			       struct mptcp_out_options *opts)
+{
+	unsigned int opt_size = 0;
+
+	if (!subflow_ctx(sk)->mp_capable)
+		return false;
+
+	if (mptcp_established_options_mp(sk, &opt_size, remaining, opts)) {
+		*size += opt_size;
+		remaining -= opt_size;
+		return true;
+	} else if (mptcp_established_options_dss(sk, skb, &opt_size, remaining,
+					       opts)) {
+		*size += opt_size;
+		remaining -= opt_size;
+		return true;
+	}
+
+	return false;
+}
+
 bool mptcp_synack_options(const struct request_sock *req, unsigned int *size,
 			  struct mptcp_out_options *opts)
 {
@@ -243,4 +330,54 @@ void mptcp_write_options(__be32 *ptr, struct mptcp_out_options *opts)
 			ptr += 2;
 		}
 	}
+
+	if (opts->ext_copy.use_ack || opts->ext_copy.use_map) {
+		struct mptcp_ext *mpext = &opts->ext_copy;
+		u8 len = TCPOLEN_MPTCP_DSS_BASE;
+		u8 flags = 0;
+
+		if (mpext->use_ack) {
+			len += TCPOLEN_MPTCP_DSS_ACK64;
+			flags = MPTCP_DSS_HAS_ACK | MPTCP_DSS_ACK64;
+		}
+
+		if (mpext->use_map) {
+			len += TCPOLEN_MPTCP_DSS_MAP64;
+
+			if (mpext->use_checksum)
+				len += TCPOLEN_MPTCP_DSS_CHECKSUM;
+
+			/* Use only 64-bit mapping flags for now, add
+			 * support for optional 32-bit mappings later.
+			 */
+			flags |= MPTCP_DSS_HAS_MAP | MPTCP_DSS_DSN64;
+			if (mpext->data_fin)
+				flags |= MPTCP_DSS_DATA_FIN;
+		}
+
+		*ptr++ = htonl((TCPOPT_MPTCP << 24) |
+			       (len  << 16) |
+			       (MPTCPOPT_DSS << 12) |
+			       (flags));
+
+		if (mpext->use_ack) {
+			put_unaligned_be64(mpext->data_ack, ptr);
+			ptr += 2;
+		}
+
+		if (mpext->use_map) {
+			__sum16 checksum;
+
+			pr_debug("Writing map values");
+			put_unaligned_be64(mpext->data_seq, ptr);
+			ptr += 2;
+			*ptr++ = htonl(mpext->subflow_seq);
+
+			if (mpext->use_checksum)
+				checksum = mpext->checksum;
+			else
+				checksum = TCPOPT_NOP << 8 | TCPOPT_NOP;
+			*ptr = htonl(mpext->data_len << 16 | checksum);
+		}
+	}
 }
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 3215601b9c43..a6e6367c8ed1 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -18,17 +18,104 @@
 static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 {
 	struct mptcp_sock *msk = mptcp_sk(sk);
-	struct socket *subflow;
+	int mss_now, size_goal, poffset, ret;
+	struct mptcp_ext *mpext = NULL;
+	struct page *page = NULL;
+	struct sk_buff *skb;
+	struct sock *ssk;
+	size_t psize;
 
-	if (msk->connection_list) {
-		subflow = msk->connection_list;
-		pr_debug("conn_list->subflow=%p", subflow_ctx(subflow->sk));
-	} else {
-		subflow = msk->subflow;
-		pr_debug("subflow=%p", subflow_ctx(subflow->sk));
+	pr_debug("msk=%p", msk);
+	if (!msk->connection_list && msk->subflow) {
+		pr_debug("fallback passthrough");
+		return sock_sendmsg(msk->subflow, msg);
 	}
 
-	return sock_sendmsg(subflow, msg);
+	if (!msg_data_left(msg)) {
+		pr_debug("empty send");
+		return sock_sendmsg(msk->connection_list, msg);
+	}
+
+	ssk = msk->connection_list->sk;
+
+	if (msg->msg_flags & ~(MSG_MORE | MSG_DONTWAIT | MSG_NOSIGNAL))
+		return -ENOTSUPP;
+
+	/* Initial experiment: new page per send.  Real code will
+	 * maintain list of active pages and DSS mappings, append to the
+	 * end and honor zerocopy
+	 */
+	page = alloc_page(GFP_KERNEL);
+	if (!page)
+		return -ENOMEM;
+
+	/* Copy to page */
+	poffset = 0;
+	pr_debug("left=%zu", msg_data_left(msg));
+	psize = copy_page_from_iter(page, poffset,
+				    min_t(size_t, msg_data_left(msg),
+					  PAGE_SIZE),
+				    &msg->msg_iter);
+	pr_debug("left=%zu", msg_data_left(msg));
+
+	if (!psize) {
+		put_page(page);
+		return -EINVAL;
+	}
+
+	lock_sock(sk);
+	lock_sock(ssk);
+
+	/* Mark the end of the previous write so the beginning of the
+	 * next write (with its own mptcp skb extension data) is not
+	 * collapsed.
+	 */
+	skb = tcp_write_queue_tail(ssk);
+	if (skb)
+		TCP_SKB_CB(skb)->eor = 1;
+
+	mss_now = tcp_send_mss(ssk, &size_goal, msg->msg_flags);
+
+	ret = do_tcp_sendpages(ssk, page, poffset, min_t(int, size_goal, psize),
+			       msg->msg_flags | MSG_SENDPAGE_NOTLAST);
+	put_page(page);
+	if (ret <= 0)
+		goto error_out;
+
+	if (skb == tcp_write_queue_tail(ssk))
+		pr_err("no new skb %p/%p", sk, ssk);
+
+	skb = tcp_write_queue_tail(ssk);
+
+	mpext = skb_ext_add(skb, SKB_EXT_MPTCP);
+
+	if (mpext) {
+		memset(mpext, 0, sizeof(*mpext));
+		mpext->data_ack = msk->ack_seq;
+		mpext->data_seq = msk->write_seq;
+		mpext->subflow_seq = subflow_ctx(ssk)->rel_write_seq;
+		mpext->data_len = ret;
+		mpext->checksum = 0xbeef;
+		mpext->use_map = 1;
+		mpext->dsn64 = 1;
+		mpext->use_ack = 1;
+		mpext->ack64 = 1;
+
+		pr_debug("data_seq=%llu subflow_seq=%u data_len=%u checksum=%u, dsn64=%d",
+			 mpext->data_seq, mpext->subflow_seq, mpext->data_len,
+			 mpext->checksum, mpext->dsn64);
+	} /* TODO: else fallback */
+
+	msk->write_seq += ret;
+	subflow_ctx(ssk)->rel_write_seq += ret;
+
+	tcp_push(ssk, msg->msg_flags, mss_now, tcp_sk(ssk)->nonagle, size_goal);
+
+error_out:
+	release_sock(ssk);
+	release_sock(sk);
+
+	return ret;
 }
 
 static int mptcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
@@ -109,11 +196,14 @@ static struct sock *mptcp_accept(struct sock *sk, int flags, int *err,
 	subflow->tcp_sock = new_sock;
 
 	if (subflow->mp_capable) {
-		msk->remote_key = subflow->remote_key;
 		msk->local_key = subflow->local_key;
 		msk->token = subflow->token;
-		pr_debug("token=%u", msk->token);
 		token_update_accept(new_sock->sk, new_mptcp_sock->sk);
+		msk->write_seq = subflow->idsn + 1;
+		subflow->rel_write_seq = 1;
+		msk->remote_key = subflow->remote_key;
+		crypto_key_sha1(msk->remote_key, NULL, &msk->ack_seq);
+		msk->ack_seq++;
 		msk->connection_list = new_sock;
 	} else {
 		msk->subflow = new_sock;
@@ -192,10 +282,13 @@ void mptcp_finish_connect(struct sock *sk, int mp_capable)
 	struct subflow_context *subflow = subflow_ctx(msk->subflow->sk);
 
 	if (mp_capable) {
-		msk->remote_key = subflow->remote_key;
 		msk->local_key = subflow->local_key;
 		msk->token = subflow->token;
-		pr_debug("token=%u", msk->token);
+		msk->write_seq = subflow->idsn + 1;
+		subflow->rel_write_seq = 1;
+		msk->remote_key = subflow->remote_key;
+		crypto_key_sha1(msk->remote_key, NULL, &msk->ack_seq);
+		msk->ack_seq++;
 		msk->connection_list = msk->subflow;
 		msk->subflow = NULL;
 	}
diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
index 5a8ed316d70e..79a9ce6c4d31 100644
--- a/net/mptcp/protocol.h
+++ b/net/mptcp/protocol.h
@@ -21,6 +21,10 @@
 #define TCPOLEN_MPTCP_MPC_SYN		12
 #define TCPOLEN_MPTCP_MPC_SYNACK	20
 #define TCPOLEN_MPTCP_MPC_ACK		20
+#define TCPOLEN_MPTCP_DSS_BASE		4
+#define TCPOLEN_MPTCP_DSS_ACK64		8
+#define TCPOLEN_MPTCP_DSS_MAP64		14
+#define TCPOLEN_MPTCP_DSS_CHECKSUM	2
 
 /* MPTCP MP_CAPABLE flags */
 #define MPTCP_VERSION_MASK	(0x0F)
@@ -35,6 +39,8 @@ struct mptcp_sock {
 	struct	inet_connection_sock sk;
 	u64	local_key;
 	u64	remote_key;
+	u64	write_seq;
+	u64	ack_seq;
 	u32	token;
 	struct	socket *connection_list; /* @@ needs to be a list */
 	struct	socket *subflow; /* outgoing connect, listener or !mp_capable */
@@ -54,6 +60,7 @@ struct subflow_request_sock {
 		version : 4;
 	u64	local_key;
 	u64	remote_key;
+	u64	idsn;
 	u32	token;
 };
 
@@ -68,12 +75,16 @@ struct subflow_context {
 	u64	local_key;
 	u64	remote_key;
 	u32	token;
+	u32     rel_write_seq;
+	u64     idsn;
 	u32	request_mptcp : 1,  /* send MP_CAPABLE */
 		request_cksum : 1,
 		mp_capable : 1,	    /* remote is MPTCP capable */
 		fourth_ack : 1,     /* send initial DSS */
 		version : 4,
-		conn_finished : 1;
+		conn_finished : 1,
+		use_checksum : 1;
+
 	struct  socket *tcp_sock;  /* underlying tcp_sock */
 	struct  sock *conn;        /* parent mptcp_sock */
 };
@@ -123,4 +134,9 @@ void crypto_key_sha1(u64 key, u32 *token, u64 *idsn);
 void crypto_hmac_sha1(u64 key1, u64 key2, u32 *hash_out,
 		      int arg_num, ...);
 
+static inline struct mptcp_ext *mptcp_get_ext(struct sk_buff *skb)
+{
+	return (struct mptcp_ext *)skb_ext_find(skb, SKB_EXT_MPTCP);
+}
+
 #endif /* __MPTCP_PROTOCOL_H */
diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c
index abae6a42a101..bbfdf03489bb 100644
--- a/net/mptcp/subflow.c
+++ b/net/mptcp/subflow.c
@@ -188,6 +188,7 @@ static int subflow_ulp_init(struct sock *sk)
 
 	tp->is_mptcp = 1;
 	icsk->icsk_af_ops = &subflow_specific;
+	ctx->use_checksum = 0;
 out:
 	return err;
 }
@@ -216,6 +217,7 @@ static void subflow_ulp_clone(const struct request_sock *req,
 
 	subflow->conn = NULL;
 	subflow->conn_finished = 1;
+	subflow->use_checksum = 0;
 
 	if (subflow_req->mp_capable) {
 		subflow->mp_capable = 1;
diff --git a/net/mptcp/token.c b/net/mptcp/token.c
index 8c15b8134f70..b055a3e82add 100644
--- a/net/mptcp/token.c
+++ b/net/mptcp/token.c
@@ -74,10 +74,11 @@ static void new_req_token(struct request_sock *req,
 					      ireq->ir_rmt_port);
 #endif
 	}
-	pr_debug("local_key=%llu:%llx", local_key, local_key);
 	subflow_req->local_key = local_key;
-	crypto_key_sha1(subflow_req->local_key, &subflow_req->token, NULL);
-	pr_debug("token=%u", subflow_req->token);
+	crypto_key_sha1(subflow_req->local_key, &subflow_req->token,
+			&subflow_req->idsn);
+	pr_debug("local_key=%llu, token=%u, idsn=%llu", subflow_req->local_key,
+		 subflow_req->token, subflow_req->idsn);
 }
 
 static void new_token(const struct sock *sk)
@@ -98,9 +99,9 @@ static void new_token(const struct sock *sk)
 						       isk->inet_dport);
 #endif
 	}
-	pr_debug("local_key=%llu:%llx", subflow->local_key, subflow->local_key);
-	crypto_key_sha1(subflow->local_key, &subflow->token, NULL);
-	pr_debug("token=%u", subflow->token);
+	crypto_key_sha1(subflow->local_key, &subflow->token, &subflow->idsn);
+	pr_debug("local_key=%llu, token=%u, idsn=%llu", subflow->local_key,
+		 subflow->token, subflow->idsn);
 }
 
 static int insert_req_token(u32 token)
-- 
2.22.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RFC PATCH net-next 19/33] mptcp: Implement MPTCP receive path
  2019-06-17 22:57 [RFC PATCH net-next 00/33] Multipath TCP Mat Martineau
                   ` (17 preceding siblings ...)
  2019-06-17 22:57 ` [RFC PATCH net-next 18/33] mptcp: Write MPTCP DSS headers to outgoing data packets Mat Martineau
@ 2019-06-17 22:57 ` Mat Martineau
  2019-06-17 22:57 ` [RFC PATCH net-next 20/33] mptcp: Make connection_list a real list of subflows Mat Martineau
                   ` (13 subsequent siblings)
  32 siblings, 0 replies; 34+ messages in thread
From: Mat Martineau @ 2019-06-17 22:57 UTC (permalink / raw)
  To: edumazet, netdev
  Cc: Mat Martineau, cpaasch, fw, pabeni, peter.krystad, dcaratti,
	matthieu.baerts

Parses incoming DSS options and populates outgoing MPTCP ACK
fields. MPTCP fields are parsed from the TCP option header and placed in
an skb extension, allowing the upper MPTCP layer to access MPTCP
options after the skb has gone through the TCP stack.

Outgoing MPTCP ACK values are now populated from a value stored
in the connection socket rather than carried in an skb extension. This
allows sent packet headers to make use of the most up-to-date sequence
number and allows the MPTCP ACK to be populated in TCP ACK packets that
have no payload.

Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Peter Krystad <peter.krystad@linux.intel.com>
---
 include/linux/tcp.h  |  17 +-
 include/net/mptcp.h  |  16 +-
 net/ipv4/tcp_input.c |   4 +
 net/mptcp/options.c  | 138 ++++++++++++++--
 net/mptcp/protocol.c | 378 ++++++++++++++++++++++++++++++++++++++++---
 net/mptcp/protocol.h |  37 +++--
 net/mptcp/subflow.c  |  53 ++++--
 7 files changed, 574 insertions(+), 69 deletions(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index fcbe8443aaad..81cfa7834111 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -102,13 +102,26 @@ struct tcp_options_received {
 	u16	mss_clamp;	/* Maximal mss, negotiated at connection setup */
 #if IS_ENABLED(CONFIG_MPTCP)
 	struct mptcp_options_received {
+		u64     sndr_key;
+		u64     rcvr_key;
+		u64	data_ack;
+		u64	data_seq;
+		u32	subflow_seq;
+		u16	data_len;
+		__sum16	checksum;
 		u8      mp_capable : 1,
 			mp_join : 1,
 			dss : 1,
 			version : 4;
 		u8      flags;
-		u64     sndr_key;
-		u64     rcvr_key;
+		u8	dss_flags;
+		u8	use_map:1,
+			dsn64:1,
+			use_checksum:1,
+			data_fin:1,
+			use_ack:1,
+			ack64:1,
+			__unused:2;
 	} mptcp;
 #endif
 };
diff --git a/include/net/mptcp.h b/include/net/mptcp.h
index 003150a8e406..ecc45733d8cf 100644
--- a/include/net/mptcp.h
+++ b/include/net/mptcp.h
@@ -8,14 +8,6 @@
 #ifndef __NET_MPTCP_H
 #define __NET_MPTCP_H
 
-/* MPTCP DSS flags */
-
-#define MPTCP_DSS_DATA_FIN	BIT(4)
-#define MPTCP_DSS_DSN64		BIT(3)
-#define MPTCP_DSS_HAS_MAP	BIT(2)
-#define MPTCP_DSS_ACK64		BIT(1)
-#define MPTCP_DSS_HAS_ACK	BIT(0)
-
 /* MPTCP sk_buff extension data */
 struct mptcp_ext {
 	u64		data_ack;
@@ -71,6 +63,9 @@ bool mptcp_established_options(struct sock *sk, struct sk_buff *skb,
 			       unsigned int *size, unsigned int remaining,
 			       struct mptcp_out_options *opts);
 
+void mptcp_attach_dss(struct sock *sk, struct sk_buff *skb,
+		      struct tcp_options_received *opt_rx);
+
 static inline bool mptcp_skb_ext_exist(const struct sk_buff *skb)
 {
 	return skb_ext_exist(skb, SKB_EXT_MPTCP);
@@ -125,6 +120,11 @@ static inline bool mptcp_established_options(struct sock *sk,
 	return false;
 }
 
+static inline void mptcp_attach_dss(struct sock *sk, struct sk_buff *skb,
+				    struct tcp_options_received *opt_rx)
+{
+}
+
 static inline bool mptcp_skb_ext_exist(const struct sk_buff *skb)
 {
 	return false;
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 5e634fdd8e1c..eaa9abd8841d 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -5650,6 +5650,10 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb)
 	/* Process urgent data. */
 	tcp_urg(sk, skb, th);
 
+	/* Prepare MPTCP sequence data */
+	if (sk_is_mptcp(sk))
+		mptcp_attach_dss(sk, skb, &tp->rx_opt);
+
 	/* step 7: process the segment text */
 	tcp_data_queue(sk, skb);
 
diff --git a/net/mptcp/options.c b/net/mptcp/options.c
index 625cd93fb9a8..6c5aed6351b3 100644
--- a/net/mptcp/options.c
+++ b/net/mptcp/options.c
@@ -14,6 +14,7 @@ void mptcp_parse_option(const unsigned char *ptr, int opsize,
 {
 	struct mptcp_options_received *mp_opt = &opt_rx->mptcp;
 	u8 subtype = *ptr >> 4;
+	int expected_opsize;
 
 	switch (subtype) {
 	/* MPTCPOPT_MP_CAPABLE
@@ -85,6 +86,72 @@ void mptcp_parse_option(const unsigned char *ptr, int opsize,
 	case MPTCPOPT_DSS:
 		pr_debug("DSS");
 		mp_opt->dss = 1;
+		ptr++;
+
+		mp_opt->dss_flags = (*ptr++) & MPTCP_DSS_FLAG_MASK;
+		mp_opt->data_fin = (mp_opt->dss_flags & MPTCP_DSS_DATA_FIN) != 0;
+		mp_opt->dsn64 = (mp_opt->dss_flags & MPTCP_DSS_DSN64) != 0;
+		mp_opt->use_map = (mp_opt->dss_flags & MPTCP_DSS_HAS_MAP) != 0;
+		mp_opt->ack64 = (mp_opt->dss_flags & MPTCP_DSS_ACK64) != 0;
+		mp_opt->use_ack = (mp_opt->dss_flags & MPTCP_DSS_HAS_ACK);
+
+		pr_debug("data_fin=%d dsn64=%d use_map=%d ack64=%d use_ack=%d",
+			 mp_opt->data_fin, mp_opt->dsn64,
+			 mp_opt->use_map, mp_opt->ack64,
+			 mp_opt->use_ack);
+
+		expected_opsize = TCPOLEN_MPTCP_DSS_BASE;
+
+		if (mp_opt->use_ack) {
+			if (mp_opt->ack64)
+				expected_opsize += TCPOLEN_MPTCP_DSS_ACK64;
+			else
+				expected_opsize += TCPOLEN_MPTCP_DSS_ACK32;
+
+			if (opsize < expected_opsize)
+				break;
+
+			if (mp_opt->ack64) {
+				mp_opt->data_ack = get_unaligned_be64(ptr);
+				ptr += 8;
+			} else {
+				mp_opt->data_ack = get_unaligned_be32(ptr);
+				ptr += 4;
+			}
+
+			pr_debug("data_ack=%llu", mp_opt->data_ack);
+		}
+
+		if (mp_opt->use_map) {
+			if (mp_opt->dsn64)
+				expected_opsize += TCPOLEN_MPTCP_DSS_MAP64;
+			else
+				expected_opsize += TCPOLEN_MPTCP_DSS_MAP32;
+
+			if (opsize < expected_opsize)
+				break;
+
+			if (mp_opt->dsn64) {
+				mp_opt->data_seq = get_unaligned_be64(ptr);
+				ptr += 8;
+			} else {
+				mp_opt->data_seq = get_unaligned_be32(ptr);
+				ptr += 4;
+			}
+
+			mp_opt->subflow_seq = get_unaligned_be32(ptr);
+			ptr += 4;
+
+			mp_opt->data_len = get_unaligned_be16(ptr);
+			ptr += 2;
+
+			/* Checksum not currently supported */
+			mp_opt->checksum = 0;
+
+			pr_debug("data_seq=%llu subflow_seq=%u data_len=%u ck=%u",
+				 mp_opt->data_seq, mp_opt->subflow_seq,
+				 mp_opt->data_len, mp_opt->checksum);
+		}
 		break;
 
 	/* MPTCPOPT_ADD_ADDR
@@ -238,25 +305,31 @@ static bool mptcp_established_options_dss(struct sock *sk, struct sk_buff *skb,
 		}
 	}
 
-	if (mpext && mpext->use_ack) {
-		ack_size = TCPOLEN_MPTCP_DSS_ACK64;
+	ack_size = TCPOLEN_MPTCP_DSS_ACK64;
 
-		/* Add kind/lenght/subtype/flag overhead if mapping is not
-		 * populated
-		 */
-		if (dss_size == 0)
-			ack_size += TCPOLEN_MPTCP_DSS_BASE;
+	/* Add kind/length/subtype/flag overhead if mapping is not populated */
+	if (dss_size == 0)
+		ack_size += TCPOLEN_MPTCP_DSS_BASE;
 
-		if (ack_size <= remaining) {
-			dss_size += ack_size;
+	if (ack_size <= remaining) {
+		struct mptcp_sock *msk;
 
-			opts->ext_copy.data_ack = mpext->data_ack;
-			opts->ext_copy.ack64 = 1;
-			opts->ext_copy.use_ack = 1;
+		dss_size += ack_size;
+
+		msk = mptcp_sk(subflow_ctx(sk)->conn);
+		if (msk) {
+			opts->ext_copy.data_ack = msk->ack_seq;
 		} else {
-			opts->ext_copy.use_ack = 0;
-			WARN(1, "MPTCP: Ack dropped");
+			crypto_key_sha1(subflow_ctx(sk)->remote_key, NULL,
+					&opts->ext_copy.data_ack);
+			opts->ext_copy.data_ack++;
 		}
+
+		opts->ext_copy.ack64 = 1;
+		opts->ext_copy.use_ack = 1;
+	} else {
+		opts->ext_copy.use_ack = 0;
+		WARN(1, "MPTCP: Ack dropped");
 	}
 
 	*size = ALIGN(dss_size, 4);
@@ -304,6 +377,42 @@ bool mptcp_synack_options(const struct request_sock *req, unsigned int *size,
 	return false;
 }
 
+void mptcp_attach_dss(struct sock *sk, struct sk_buff *skb,
+		      struct tcp_options_received *opt_rx)
+{
+	struct mptcp_options_received *mp_opt;
+	struct mptcp_ext *mpext;
+
+	mp_opt = &opt_rx->mptcp;
+
+	if (!mp_opt->dss)
+		return;
+
+	mpext = skb_ext_add(skb, SKB_EXT_MPTCP);
+	if (!mpext)
+		return;
+
+	memset(mpext, 0, sizeof(*mpext));
+
+	if (mp_opt->use_map) {
+		mpext->data_seq = mp_opt->data_seq;
+		mpext->subflow_seq = mp_opt->subflow_seq;
+		mpext->data_len = mp_opt->data_len;
+		mpext->checksum = mp_opt->checksum;
+		mpext->use_map = 1;
+		mpext->dsn64 = mp_opt->dsn64;
+		mpext->use_checksum = mp_opt->use_checksum;
+	}
+
+	if (mp_opt->use_ack) {
+		mpext->data_ack = mp_opt->data_ack;
+		mpext->use_ack = 1;
+		mpext->ack64 = mp_opt->ack64;
+	}
+
+	mpext->data_fin = mp_opt->data_fin;
+}
+
 void mptcp_write_options(__be32 *ptr, struct mptcp_out_options *opts)
 {
 	if ((OPTION_MPTCP_MPC_SYN |
@@ -342,6 +451,7 @@ void mptcp_write_options(__be32 *ptr, struct mptcp_out_options *opts)
 		}
 
 		if (mpext->use_map) {
+			pr_debug("Updating DSS length and flags for map");
 			len += TCPOLEN_MPTCP_DSS_MAP64;
 
 			if (mpext->use_checksum)
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index a6e6367c8ed1..2e76b7450ce2 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -7,6 +7,8 @@
 #include <linux/kernel.h>
 #include <linux/module.h>
 #include <linux/netdevice.h>
+#include <linux/sched/signal.h>
+#include <linux/atomic.h>
 #include <net/sock.h>
 #include <net/inet_common.h>
 #include <net/inet_hashtables.h>
@@ -15,6 +17,13 @@
 #include <net/mptcp.h>
 #include "protocol.h"
 
+static inline bool before64(__u64 seq1, __u64 seq2)
+{
+	return (__s64)(seq1 - seq2) < 0;
+}
+
+#define after64(seq2, seq1)	before64(seq1, seq2)
+
 static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 {
 	struct mptcp_sock *msk = mptcp_sk(sk);
@@ -91,15 +100,12 @@ static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 
 	if (mpext) {
 		memset(mpext, 0, sizeof(*mpext));
-		mpext->data_ack = msk->ack_seq;
 		mpext->data_seq = msk->write_seq;
 		mpext->subflow_seq = subflow_ctx(ssk)->rel_write_seq;
 		mpext->data_len = ret;
 		mpext->checksum = 0xbeef;
 		mpext->use_map = 1;
 		mpext->dsn64 = 1;
-		mpext->use_ack = 1;
-		mpext->ack64 = 1;
 
 		pr_debug("data_seq=%llu subflow_seq=%u data_len=%u checksum=%u, dsn64=%d",
 			 mpext->data_seq, mpext->subflow_seq, mpext->data_len,
@@ -118,21 +124,333 @@ static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 	return ret;
 }
 
+struct mptcp_read_arg {
+	struct msghdr *msg;
+};
+
+static u64 expand_seq(u64 old_seq, u16 old_data_len, u64 seq)
+{
+	if ((u32)seq == (u32)old_seq)
+		return old_seq;
+
+	/* Assume map covers data not mapped yet. */
+	return seq | ((old_seq + old_data_len + 1) & GENMASK_ULL(63, 32));
+}
+
+static u64 get_map_offset(struct subflow_context *subflow)
+{
+	return tcp_sk(mptcp_subflow_tcp_socket(subflow)->sk)->copied_seq -
+		      subflow->ssn_offset -
+		      subflow->map_subflow_seq;
+}
+
+static u64 get_mapped_dsn(struct subflow_context *subflow)
+{
+	return subflow->map_seq + get_map_offset(subflow);
+}
+
+static int mptcp_read_actor(read_descriptor_t *desc, struct sk_buff *skb,
+			    unsigned int offset, size_t len)
+{
+	struct mptcp_read_arg *arg = desc->arg.data;
+	size_t copy_len;
+
+	copy_len = min(desc->count, len);
+
+	if (likely(arg->msg)) {
+		int err;
+
+		err = skb_copy_datagram_msg(skb, offset, arg->msg, copy_len);
+		if (err) {
+			pr_debug("error path");
+			desc->error = err;
+			return err;
+		}
+	} else {
+		pr_debug("Flushing skb payload");
+	}
+
+	// MSG_PEEK support? Other flags? MSG_TRUNC?
+
+	desc->count -= copy_len;
+
+	pr_debug("consumed %zu bytes, %zu left", copy_len, desc->count);
+	return copy_len;
+}
+
+static int mptcp_flush_actor(read_descriptor_t *desc, struct sk_buff *skb,
+			     unsigned int offset, size_t len)
+{
+	pr_debug("Flushing one skb with %zu of %zu bytes remaining",
+		 len, len + offset);
+
+	desc->count = 0;
+
+	return len;
+}
+
+enum mapping_status {
+	MAPPING_ADDED,
+	MAPPING_MISSING,
+	MAPPING_EMPTY,
+	MAPPING_DATA_FIN
+};
+
+static enum mapping_status mptcp_get_mapping(struct sock *ssk)
+{
+	struct subflow_context *subflow = subflow_ctx(ssk);
+	struct mptcp_ext *mpext;
+	enum mapping_status ret;
+	struct sk_buff *skb;
+	u64 map_seq;
+
+	skb = skb_peek(&ssk->sk_receive_queue);
+	if (!skb) {
+		pr_debug("Empty queue");
+		return MAPPING_EMPTY;
+	}
+
+	mpext = mptcp_get_ext(skb);
+
+	if (!mpext) {
+		/* This is expected for non-DSS data packets */
+		return MAPPING_MISSING;
+	}
+
+	if (!mpext->use_map) {
+		ret = MAPPING_MISSING;
+		goto del_out;
+	}
+
+	pr_debug("seq=%llu is64=%d ssn=%u data_len=%u ck=%u",
+		 mpext->data_seq, mpext->dsn64, mpext->subflow_seq,
+		 mpext->data_len, mpext->checksum);
+
+	if (mpext->data_len == 0) {
+		pr_err("Infinite mapping not handled");
+		ret = MAPPING_MISSING;
+		goto del_out;
+	} else if (mpext->subflow_seq == 0 &&
+		   mpext->data_fin == 1) {
+		pr_debug("DATA_FIN with no payload");
+		ret = MAPPING_DATA_FIN;
+		goto del_out;
+	}
+
+	if (!mpext->dsn64) {
+		map_seq = expand_seq(subflow->map_seq, subflow->map_data_len,
+				     mpext->data_seq);
+		pr_debug("expanded seq=%llu", subflow->map_seq);
+	} else {
+		map_seq = mpext->data_seq;
+	}
+
+	if (subflow->map_valid) {
+		/* due to GSO/TSO we can receive the same mapping multiple
+		 * times, before it's expiration.
+		 */
+		if (subflow->map_seq != map_seq ||
+		    subflow->map_subflow_seq != mpext->subflow_seq ||
+		    subflow->map_data_len != mpext->data_len)
+			pr_warn("Replaced mapping before it was done");
+	}
+
+	subflow->map_seq = map_seq;
+	subflow->map_subflow_seq = mpext->subflow_seq;
+	subflow->map_data_len = mpext->data_len;
+	subflow->map_valid = 1;
+	ret = MAPPING_ADDED;
+	pr_debug("new map seq=%llu subflow_seq=%u data_len=%u",
+		 subflow->map_seq, subflow->map_subflow_seq,
+		 subflow->map_data_len);
+
+del_out:
+	__skb_ext_del(skb, SKB_EXT_MPTCP);
+	return ret;
+}
+
 static int mptcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
 			 int nonblock, int flags, int *addr_len)
 {
 	struct mptcp_sock *msk = mptcp_sk(sk);
-	struct socket *subflow;
+	struct subflow_context *subflow;
+	struct mptcp_read_arg arg;
+	read_descriptor_t desc;
+	struct tcp_sock *tp;
+	struct sock *ssk;
+	int copied = 0;
+	long timeo;
 
-	if (msk->connection_list) {
-		subflow = msk->connection_list;
-		pr_debug("conn_list->subflow=%p", subflow_ctx(subflow->sk));
-	} else {
-		subflow = msk->subflow;
-		pr_debug("subflow=%p", subflow_ctx(subflow->sk));
+	if (!msk->connection_list) {
+		pr_debug("fallback-read subflow=%p",
+			 subflow_ctx(msk->subflow->sk));
+		return sock_recvmsg(msk->subflow, msg, flags);
+	}
+
+	ssk = msk->connection_list->sk;
+	subflow = subflow_ctx(ssk);
+	tp = tcp_sk(ssk);
+
+	lock_sock(sk);
+	lock_sock(ssk);
+
+	desc.arg.data = &arg;
+	desc.error = 0;
+
+	timeo = sock_rcvtimeo(sk, nonblock);
+
+	len = min_t(size_t, len, INT_MAX);
+
+	while (copied < len) {
+		enum mapping_status status;
+		size_t discard_len = 0;
+		u32 map_remaining;
+		int bytes_read;
+		u64 ack_seq;
+		u64 old_ack;
+		u32 ssn;
+
+		status = mptcp_get_mapping(ssk);
+
+		if (status == MAPPING_ADDED) {
+			/* Common case, but nothing to do here */
+		} else if (status == MAPPING_MISSING) {
+			if (!subflow->map_valid) {
+				pr_debug("Mapping missing, trying next skb");
+
+				arg.msg = NULL;
+				desc.count = SIZE_MAX;
+
+				bytes_read = tcp_read_sock(ssk, &desc,
+							   mptcp_flush_actor);
+
+				if (bytes_read < 0)
+					break;
+
+				continue;
+			}
+		} else if (status == MAPPING_EMPTY) {
+			goto wait_for_data;
+		} else if (status == MAPPING_DATA_FIN) {
+			/* TODO: Handle according to RFC 6824 */
+			if (!copied) {
+				pr_err("Can't read after DATA_FIN");
+				copied = -ENOTCONN;
+			}
+
+			break;
+		}
+
+		ssn = tcp_sk(ssk)->copied_seq - subflow->ssn_offset;
+		old_ack = msk->ack_seq;
+
+		if (unlikely(before(ssn, subflow->map_subflow_seq))) {
+			/* Mapping covers data later in the subflow stream,
+			 * discard unmapped data.
+			 */
+			pr_debug("Mapping covers data later in stream");
+			discard_len = subflow->map_subflow_seq - ssn;
+		} else if (unlikely(!before(ssn, (subflow->map_subflow_seq +
+						  subflow->map_data_len)))) {
+			/* Mapping ends earlier in the subflow stream.
+			 * Invalidate the mapping and try again.
+			 */
+			subflow->map_valid = 0;
+			pr_debug("Invalid mapping ssn=%d map_seq=%d map_data_len=%d",
+				 ssn, subflow->map_subflow_seq,
+				 subflow->map_data_len);
+			continue;
+		} else {
+			ack_seq = get_mapped_dsn(subflow);
+
+			if (before64(ack_seq, old_ack)) {
+				/* Mapping covers data already received,
+				 * discard data in the current mapping
+				 * and invalidate the map
+				 */
+				u64 map_end_dsn = subflow->map_seq +
+					subflow->map_data_len;
+				discard_len = min(map_end_dsn - ack_seq,
+						  old_ack - ack_seq);
+				subflow->map_valid = 0;
+				pr_debug("Duplicate MPTCP data found");
+			}
+		}
+
+		if (discard_len) {
+			/* Discard data for the current mapping.
+			 */
+			pr_debug("Discard %zu bytes", discard_len);
+
+			arg.msg = NULL;
+			desc.count = discard_len;
+
+			bytes_read = tcp_read_sock(ssk, &desc,
+						   mptcp_read_actor);
+
+			if (bytes_read < 0)
+				break;
+			else if (bytes_read == discard_len)
+				continue;
+			else
+				goto wait_for_data;
+		}
+
+		/* Read mapped data */
+		map_remaining = subflow->map_data_len - get_map_offset(subflow);
+		desc.count = min_t(size_t, len - copied, map_remaining);
+		arg.msg = msg;
+		bytes_read = tcp_read_sock(ssk, &desc, mptcp_read_actor);
+		if (bytes_read < 0)
+			break;
+
+		/* Refresh current MPTCP sequence number based on subflow seq */
+		ack_seq = get_mapped_dsn(subflow);
+
+		if (before64(old_ack, ack_seq))
+			msk->ack_seq = ack_seq;
+
+		if (!before(tcp_sk(ssk)->copied_seq - subflow->ssn_offset,
+			    subflow->map_subflow_seq + subflow->map_data_len)) {
+			subflow->map_valid = 0;
+			pr_debug("Done with mapping: seq=%u data_len=%u",
+				 subflow->map_subflow_seq,
+				 subflow->map_data_len);
+		}
+
+		copied += bytes_read;
+
+wait_for_data:
+		if (copied)
+			break;
+
+		if (tp->urg_data && tp->urg_seq == tp->copied_seq) {
+			pr_err("Urgent data present, cannot proceed");
+			break;
+		}
+
+		if (ssk->sk_err || ssk->sk_state == TCP_CLOSE ||
+		    (ssk->sk_shutdown & RCV_SHUTDOWN) || !timeo ||
+		    signal_pending(current)) {
+			pr_debug("nonblock or error");
+			break;
+		}
+
+		/* Handle blocking and retry read if needed.
+		 *
+		 * Wait on MPTCP sock, the subflow will notify via data ready.
+		 */
+
+		pr_debug("block");
+		release_sock(ssk);
+		sk_wait_data(sk, &timeo, NULL);
+		lock_sock(ssk);
 	}
 
-	return sock_recvmsg(subflow, msg, flags);
+	release_sock(ssk);
+	release_sock(sk);
+
+	return copied;
 }
 
 static int mptcp_init_sock(struct sock *sk)
@@ -192,22 +510,29 @@ static struct sock *mptcp_accept(struct sock *sk, int flags, int *err,
 
 	msk = mptcp_sk(new_mptcp_sock->sk);
 	pr_debug("new msk=%p", msk);
-	subflow->conn = new_mptcp_sock->sk;
-	subflow->tcp_sock = new_sock;
 
 	if (subflow->mp_capable) {
+		u64 ack_seq;
+
+		msk->remote_key = subflow->remote_key;
 		msk->local_key = subflow->local_key;
 		msk->token = subflow->token;
 		token_update_accept(new_sock->sk, new_mptcp_sock->sk);
+		msk->connection_list = new_sock;
+
+		crypto_key_sha1(msk->remote_key, NULL, &ack_seq);
 		msk->write_seq = subflow->idsn + 1;
+		ack_seq++;
+		msk->ack_seq = ack_seq;
+		subflow->map_seq = ack_seq;
+		subflow->map_subflow_seq = 1;
 		subflow->rel_write_seq = 1;
-		msk->remote_key = subflow->remote_key;
-		crypto_key_sha1(msk->remote_key, NULL, &msk->ack_seq);
-		msk->ack_seq++;
-		msk->connection_list = new_sock;
+		subflow->conn = new_mptcp_sock->sk;
+		subflow->tcp_sock = new_sock;
 	} else {
 		msk->subflow = new_sock;
 	}
+	inet_sk_state_store(new_mptcp_sock->sk, TCP_ESTABLISHED);
 
 	return new_mptcp_sock->sk;
 }
@@ -282,17 +607,24 @@ void mptcp_finish_connect(struct sock *sk, int mp_capable)
 	struct subflow_context *subflow = subflow_ctx(msk->subflow->sk);
 
 	if (mp_capable) {
+		u64 ack_seq;
+
+		msk->remote_key = subflow->remote_key;
 		msk->local_key = subflow->local_key;
 		msk->token = subflow->token;
-		msk->write_seq = subflow->idsn + 1;
-		subflow->rel_write_seq = 1;
-		msk->remote_key = subflow->remote_key;
-		crypto_key_sha1(msk->remote_key, NULL, &msk->ack_seq);
-		msk->ack_seq++;
+		pr_debug("msk=%p, token=%u", msk, msk->token);
 		msk->connection_list = msk->subflow;
 		msk->subflow = NULL;
+
+		crypto_key_sha1(msk->remote_key, NULL, &ack_seq);
+		msk->write_seq = subflow->idsn + 1;
+		ack_seq++;
+		msk->ack_seq = ack_seq;
+		subflow->map_seq = ack_seq;
+		subflow->map_subflow_seq = 1;
+		subflow->rel_write_seq = 1;
 	}
-	sk->sk_state = TCP_ESTABLISHED;
+	inet_sk_state_store(sk, TCP_ESTABLISHED);
 }
 
 static struct proto mptcp_prot = {
diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
index 79a9ce6c4d31..5c840f76a9b9 100644
--- a/net/mptcp/protocol.h
+++ b/net/mptcp/protocol.h
@@ -22,7 +22,9 @@
 #define TCPOLEN_MPTCP_MPC_SYNACK	20
 #define TCPOLEN_MPTCP_MPC_ACK		20
 #define TCPOLEN_MPTCP_DSS_BASE		4
+#define TCPOLEN_MPTCP_DSS_ACK32		4
 #define TCPOLEN_MPTCP_DSS_ACK64		8
+#define TCPOLEN_MPTCP_DSS_MAP32		10
 #define TCPOLEN_MPTCP_DSS_MAP64		14
 #define TCPOLEN_MPTCP_DSS_CHECKSUM	2
 
@@ -33,17 +35,25 @@
 #define MPTCP_CAP_HMAC_SHA1	BIT(0)
 #define MPTCP_CAP_FLAG_MASK	(0x3F)
 
+/* MPTCP DSS flags */
+#define MPTCP_DSS_DATA_FIN	BIT(4)
+#define MPTCP_DSS_DSN64		BIT(3)
+#define MPTCP_DSS_HAS_MAP	BIT(2)
+#define MPTCP_DSS_ACK64		BIT(1)
+#define MPTCP_DSS_HAS_ACK	BIT(0)
+#define MPTCP_DSS_FLAG_MASK	(0x1F)
+
 /* MPTCP connection sock */
 struct mptcp_sock {
 	/* inet_connection_sock must be the first member */
-	struct	inet_connection_sock sk;
-	u64	local_key;
-	u64	remote_key;
-	u64	write_seq;
-	u64	ack_seq;
-	u32	token;
-	struct	socket *connection_list; /* @@ needs to be a list */
-	struct	socket *subflow; /* outgoing connect, listener or !mp_capable */
+	struct inet_connection_sock sk;
+	u64		local_key;
+	u64		remote_key;
+	u64		write_seq;
+	u64		ack_seq;
+	u32		token;
+	struct socket	*connection_list; /* @@ needs to be a list */
+	struct socket	*subflow; /* outgoing connect/listener/!mp_capable */
 };
 
 static inline struct mptcp_sock *mptcp_sk(const struct sock *sk)
@@ -62,6 +72,7 @@ struct subflow_request_sock {
 	u64	remote_key;
 	u64	idsn;
 	u32	token;
+	u32	ssn_offset;
 };
 
 static inline
@@ -77,16 +88,22 @@ struct subflow_context {
 	u32	token;
 	u32     rel_write_seq;
 	u64     idsn;
-	u32	request_mptcp : 1,  /* send MP_CAPABLE */
+	u64	map_seq;
+	u32	map_subflow_seq;
+	u32	ssn_offset;
+	u16	map_data_len;
+	u16	request_mptcp : 1,  /* send MP_CAPABLE */
 		request_cksum : 1,
 		mp_capable : 1,	    /* remote is MPTCP capable */
 		fourth_ack : 1,     /* send initial DSS */
 		version : 4,
 		conn_finished : 1,
-		use_checksum : 1;
+		use_checksum : 1,
+		map_valid : 1;
 
 	struct  socket *tcp_sock;  /* underlying tcp_sock */
 	struct  sock *conn;        /* parent mptcp_sock */
+	void	(*tcp_sk_data_ready)(struct sock *sk);
 };
 
 static inline struct subflow_context *subflow_ctx(const struct sock *sk)
diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c
index bbfdf03489bb..a82f5091eed8 100644
--- a/net/mptcp/subflow.c
+++ b/net/mptcp/subflow.c
@@ -66,6 +66,8 @@ static void subflow_v4_init_req(struct request_sock *req,
 		subflow_req->remote_key = rx_opt.mptcp.sndr_key;
 		pr_debug("remote_key=%llu", subflow_req->remote_key);
 		token_new_request(req, skb);
+		pr_debug("syn seq=%u", TCP_SKB_CB(skb)->seq);
+		subflow_req->ssn_offset = TCP_SKB_CB(skb)->seq;
 	} else {
 		subflow_req->mp_capable = 0;
 	}
@@ -82,6 +84,11 @@ static void subflow_finish_connect(struct sock *sk, const struct sk_buff *skb)
 			 subflow->remote_key);
 		mptcp_finish_connect(subflow->conn, subflow->mp_capable);
 		subflow->conn_finished = 1;
+
+		if (skb) {
+			pr_debug("synack seq=%u", TCP_SKB_CB(skb)->seq);
+			subflow->ssn_offset = TCP_SKB_CB(skb)->seq;
+		}
 	}
 }
 
@@ -150,6 +157,20 @@ static struct sock *subflow_syn_recv_sock(const struct sock *sk,
 
 static struct inet_connection_sock_af_ops subflow_specific;
 
+static void subflow_data_ready(struct sock *sk)
+{
+	struct subflow_context *subflow = subflow_ctx(sk);
+	struct sock *parent = subflow->conn;
+
+	pr_debug("sk=%p", sk);
+	subflow->tcp_sk_data_ready(sk);
+
+	if (parent) {
+		pr_debug("parent=%p", parent);
+		parent->sk_data_ready(parent);
+	}
+}
+
 static struct subflow_context *subflow_create_ctx(struct sock *sk,
 						  struct socket *sock,
 						  gfp_t priority)
@@ -188,7 +209,9 @@ static int subflow_ulp_init(struct sock *sk)
 
 	tp->is_mptcp = 1;
 	icsk->icsk_af_ops = &subflow_specific;
+	ctx->tcp_sk_data_ready = sk->sk_data_ready;
 	ctx->use_checksum = 0;
+	sk->sk_data_ready = subflow_data_ready;
 out:
 	return err;
 }
@@ -207,25 +230,31 @@ static void subflow_ulp_clone(const struct request_sock *req,
 			      const gfp_t priority)
 {
 	struct subflow_request_sock *subflow_req = subflow_rsk(req);
+	struct subflow_context *old_ctx;
+	struct subflow_context *new_ctx;
+
+	old_ctx = inet_csk(newsk)->icsk_ulp_data;
 
 	/* newsk->sk_socket is NULL at this point */
-	struct subflow_context *subflow = subflow_create_ctx(newsk, NULL,
-							     priority);
+	new_ctx = subflow_create_ctx(newsk, NULL, priority);
 
-	if (!subflow)
+	if (!new_ctx)
 		return;
 
-	subflow->conn = NULL;
-	subflow->conn_finished = 1;
-	subflow->use_checksum = 0;
+	new_ctx->conn = NULL;
+	new_ctx->conn_finished = 1;
+	new_ctx->tcp_sk_data_ready = old_ctx->tcp_sk_data_ready;
+	new_ctx->use_checksum = old_ctx->use_checksum;
 
 	if (subflow_req->mp_capable) {
-		subflow->mp_capable = 1;
-		subflow->fourth_ack = 1;
-		subflow->remote_key = subflow_req->remote_key;
-		subflow->local_key = subflow_req->local_key;
-		subflow->token = subflow_req->token;
-		pr_debug("token=%u", subflow->token);
+		new_ctx->mp_capable = 1;
+		new_ctx->fourth_ack = 1;
+		new_ctx->remote_key = subflow_req->remote_key;
+		new_ctx->local_key = subflow_req->local_key;
+		new_ctx->token = subflow_req->token;
+		new_ctx->ssn_offset = subflow_req->ssn_offset;
+		new_ctx->idsn = subflow_req->idsn;
+		pr_debug("token=%u", new_ctx->token);
 	}
 }
 
-- 
2.22.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RFC PATCH net-next 20/33] mptcp: Make connection_list a real list of subflows
  2019-06-17 22:57 [RFC PATCH net-next 00/33] Multipath TCP Mat Martineau
                   ` (18 preceding siblings ...)
  2019-06-17 22:57 ` [RFC PATCH net-next 19/33] mptcp: Implement MPTCP receive path Mat Martineau
@ 2019-06-17 22:57 ` Mat Martineau
  2019-06-17 22:57 ` [RFC PATCH net-next 21/33] mptcp: add and use mptcp_subflow_hold Mat Martineau
                   ` (12 subsequent siblings)
  32 siblings, 0 replies; 34+ messages in thread
From: Mat Martineau @ 2019-06-17 22:57 UTC (permalink / raw)
  To: edumazet, netdev
  Cc: Peter Krystad, cpaasch, fw, pabeni, dcaratti, matthieu.baerts

From: Peter Krystad <peter.krystad@linux.intel.com>

Use the MPTCP socket lock to mutually exclude shutdown and close
execution.

Since mptcp_close() is the only code path that removes entries from
conn_list, we can safely traverse the list while interrupting the RCU
critical section.

Signed-off-by: Peter Krystad <peter.krystad@linux.intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 net/mptcp/protocol.c | 222 ++++++++++++++++++++++++++++++++-----------
 net/mptcp/protocol.h |   9 +-
 2 files changed, 172 insertions(+), 59 deletions(-)

diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 2e76b7450ce2..c00e837a1766 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -29,34 +29,48 @@ static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 	struct mptcp_sock *msk = mptcp_sk(sk);
 	int mss_now, size_goal, poffset, ret;
 	struct mptcp_ext *mpext = NULL;
+	struct subflow_context *subflow;
 	struct page *page = NULL;
+	struct hlist_node *node;
 	struct sk_buff *skb;
 	struct sock *ssk;
 	size_t psize;
 
 	pr_debug("msk=%p", msk);
-	if (!msk->connection_list && msk->subflow) {
+	if (msk->subflow) {
 		pr_debug("fallback passthrough");
 		return sock_sendmsg(msk->subflow, msg);
 	}
 
+	rcu_read_lock();
+	node = rcu_dereference(hlist_first_rcu(&msk->conn_list));
+	subflow = hlist_entry(node, struct subflow_context, node);
+	ssk = mptcp_subflow_tcp_socket(subflow)->sk;
+	sock_hold(ssk);
+	rcu_read_unlock();
+
 	if (!msg_data_left(msg)) {
 		pr_debug("empty send");
-		return sock_sendmsg(msk->connection_list, msg);
+		ret = sock_sendmsg(mptcp_subflow_tcp_socket(subflow), msg);
+		goto put_out;
 	}
 
-	ssk = msk->connection_list->sk;
+	pr_debug("conn_list->subflow=%p", subflow);
 
-	if (msg->msg_flags & ~(MSG_MORE | MSG_DONTWAIT | MSG_NOSIGNAL))
-		return -ENOTSUPP;
+	if (msg->msg_flags & ~(MSG_MORE | MSG_DONTWAIT | MSG_NOSIGNAL)) {
+		ret = -ENOTSUPP;
+		goto put_out;
+	}
 
 	/* Initial experiment: new page per send.  Real code will
 	 * maintain list of active pages and DSS mappings, append to the
 	 * end and honor zerocopy
 	 */
 	page = alloc_page(GFP_KERNEL);
-	if (!page)
-		return -ENOMEM;
+	if (!page) {
+		ret = -ENOMEM;
+		goto put_out;
+	}
 
 	/* Copy to page */
 	poffset = 0;
@@ -68,8 +82,8 @@ static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 	pr_debug("left=%zu", msg_data_left(msg));
 
 	if (!psize) {
-		put_page(page);
-		return -EINVAL;
+		ret = -EINVAL;
+		goto put_out;
 	}
 
 	lock_sock(sk);
@@ -87,9 +101,8 @@ static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 
 	ret = do_tcp_sendpages(ssk, page, poffset, min_t(int, size_goal, psize),
 			       msg->msg_flags | MSG_SENDPAGE_NOTLAST);
-	put_page(page);
 	if (ret <= 0)
-		goto error_out;
+		goto release_out;
 
 	if (skb == tcp_write_queue_tail(ssk))
 		pr_err("no new skb %p/%p", sk, ssk);
@@ -117,10 +130,15 @@ static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 
 	tcp_push(ssk, msg->msg_flags, mss_now, tcp_sk(ssk)->nonagle, size_goal);
 
-error_out:
+release_out:
 	release_sock(ssk);
 	release_sock(sk);
 
+put_out:
+	if (page)
+		put_page(page);
+
+	sock_put(ssk);
 	return ret;
 }
 
@@ -275,20 +293,26 @@ static int mptcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
 	struct mptcp_sock *msk = mptcp_sk(sk);
 	struct subflow_context *subflow;
 	struct mptcp_read_arg arg;
+	struct hlist_node *node;
 	read_descriptor_t desc;
 	struct tcp_sock *tp;
 	struct sock *ssk;
 	int copied = 0;
 	long timeo;
 
-	if (!msk->connection_list) {
+	if (msk->subflow) {
 		pr_debug("fallback-read subflow=%p",
 			 subflow_ctx(msk->subflow->sk));
 		return sock_recvmsg(msk->subflow, msg, flags);
 	}
 
-	ssk = msk->connection_list->sk;
-	subflow = subflow_ctx(ssk);
+	rcu_read_lock();
+	node = rcu_dereference(hlist_first_rcu(&msk->conn_list));
+	subflow = hlist_entry(node, struct subflow_context, node);
+	ssk = mptcp_subflow_tcp_socket(subflow)->sk;
+	sock_hold(ssk);
+	rcu_read_unlock();
+
 	tp = tcp_sk(ssk);
 
 	lock_sock(sk);
@@ -450,6 +474,8 @@ static int mptcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
 	release_sock(ssk);
 	release_sock(sk);
 
+	sock_put(ssk);
+
 	return copied;
 }
 
@@ -459,24 +485,56 @@ static int mptcp_init_sock(struct sock *sk)
 
 	pr_debug("msk=%p", msk);
 
+	INIT_LIST_HEAD_RCU(&msk->conn_list);
+	spin_lock_init(&msk->conn_list_lock);
+
 	return 0;
 }
 
+static void mptcp_flush_conn_list(struct sock *sk, struct list_head *list)
+{
+	struct mptcp_sock *msk = mptcp_sk(sk);
+
+	INIT_LIST_HEAD_RCU(list);
+	spin_lock_bh(&msk->conn_list_lock);
+	list_splice_init(&msk->conn_list, list);
+	spin_unlock_bh(&msk->conn_list_lock);
+
+	if (!list_empty(list))
+		synchronize_rcu();
+}
+
 static void mptcp_close(struct sock *sk, long timeout)
 {
 	struct mptcp_sock *msk = mptcp_sk(sk);
+	struct subflow_context *subflow, *tmp;
+	struct socket *ssk = NULL;
+	struct list_head list;
 
 	inet_sk_state_store(sk, TCP_CLOSE);
 
+	spin_lock_bh(&msk->conn_list_lock);
 	if (msk->subflow) {
-		pr_debug("subflow=%p", subflow_ctx(msk->subflow->sk));
-		sock_release(msk->subflow);
+		ssk = msk->subflow;
+		msk->subflow = NULL;
 	}
+	spin_unlock_bh(&msk->conn_list_lock);
+	if (ssk) {
+		pr_debug("subflow=%p", ssk->sk);
+		sock_release(ssk);
+	}
+
+	/* this is the only place where we can remove any entry from the
+	 * conn_list. Additionally acquiring the socket lock here
+	 * allows for mutual exclusion with mptcp_shutdown().
+	 */
+	lock_sock(sk);
+	mptcp_flush_conn_list(sk, &list);
+	release_sock(sk);
 
-	if (msk->connection_list) {
-		pr_debug("conn_list->subflow=%p",
-			 subflow_ctx(msk->connection_list->sk));
-		sock_release(msk->connection_list);
+	list_for_each_entry_safe(subflow, tmp, &list, node) {
+		pr_debug("conn_list->subflow=%p", subflow);
+		sock_release(mptcp_subflow_tcp_socket(subflow));
 	}
 
 	sock_orphan(sk);
@@ -518,7 +576,10 @@ static struct sock *mptcp_accept(struct sock *sk, int flags, int *err,
 		msk->local_key = subflow->local_key;
 		msk->token = subflow->token;
 		token_update_accept(new_sock->sk, new_mptcp_sock->sk);
-		msk->connection_list = new_sock;
+		spin_lock_bh(&msk->conn_list_lock);
+		list_add_rcu(&subflow->node, &msk->conn_list);
+		msk->subflow = NULL;
+		spin_unlock_bh(&msk->conn_list_lock);
 
 		crypto_key_sha1(msk->remote_key, NULL, &ack_seq);
 		msk->write_seq = subflow->idsn + 1;
@@ -550,46 +611,46 @@ static int mptcp_setsockopt(struct sock *sk, int level, int optname,
 			    char __user *uoptval, unsigned int optlen)
 {
 	struct mptcp_sock *msk = mptcp_sk(sk);
-	struct socket *subflow;
 	char __kernel *optval;
 
-	pr_debug("msk=%p", msk);
-	if (msk->connection_list) {
-		subflow = msk->connection_list;
-		pr_debug("conn_list->subflow=%p", subflow_ctx(subflow->sk));
-	} else {
-		subflow = msk->subflow;
-		pr_debug("subflow=%p", subflow_ctx(subflow->sk));
-	}
-
 	/* will be treated as __user in tcp_setsockopt */
 	optval = (char __kernel __force *)uoptval;
 
-	return kernel_setsockopt(subflow, level, optname, optval, optlen);
+	pr_debug("msk=%p", msk);
+	if (msk->subflow) {
+		pr_debug("subflow=%p", msk->subflow->sk);
+		return kernel_setsockopt(msk->subflow, level, optname, optval,
+					 optlen);
+	}
+
+	/* @@ the meaning of setsockopt() when the socket is connected and
+	 * there are multiple subflows is not defined.
+	 */
+	return 0;
 }
 
 static int mptcp_getsockopt(struct sock *sk, int level, int optname,
 			    char __user *uoptval, int __user *uoption)
 {
 	struct mptcp_sock *msk = mptcp_sk(sk);
-	struct socket *subflow;
 	char __kernel *optval;
 	int __kernel *option;
 
-	pr_debug("msk=%p", msk);
-	if (msk->connection_list) {
-		subflow = msk->connection_list;
-		pr_debug("conn_list->subflow=%p", subflow_ctx(subflow->sk));
-	} else {
-		subflow = msk->subflow;
-		pr_debug("subflow=%p", subflow_ctx(subflow->sk));
-	}
-
 	/* will be treated as __user in tcp_getsockopt */
 	optval = (char __kernel __force *)uoptval;
 	option = (int __kernel __force *)uoption;
 
-	return kernel_getsockopt(subflow, level, optname, optval, option);
+	pr_debug("msk=%p", msk);
+	if (msk->subflow) {
+		pr_debug("subflow=%p", msk->subflow->sk);
+		return kernel_getsockopt(msk->subflow, level, optname, optval,
+					 option);
+	}
+
+	/* @@ the meaning of setsockopt() when the socket is connected and
+	 * there are multiple subflows is not defined.
+	 */
+	return 0;
 }
 
 static int mptcp_get_port(struct sock *sk, unsigned short snum)
@@ -613,8 +674,10 @@ void mptcp_finish_connect(struct sock *sk, int mp_capable)
 		msk->local_key = subflow->local_key;
 		msk->token = subflow->token;
 		pr_debug("msk=%p, token=%u", msk, msk->token);
-		msk->connection_list = msk->subflow;
+		spin_lock_bh(&msk->conn_list_lock);
+		list_add_rcu(&subflow->node, &msk->conn_list);
 		msk->subflow = NULL;
+		spin_unlock_bh(&msk->conn_list_lock);
 
 		crypto_key_sha1(msk->remote_key, NULL, &ack_seq);
 		msk->write_seq = subflow->idsn + 1;
@@ -715,17 +778,32 @@ static int mptcp_getname(struct socket *sock, struct sockaddr *uaddr,
 			 int peer)
 {
 	struct mptcp_sock *msk = mptcp_sk(sock->sk);
-	struct socket *subflow;
-	int err = -EPERM;
+	struct subflow_context *subflow;
+	struct hlist_node *node;
+	struct sock *ssk;
+	int ret;
 
-	if (msk->connection_list)
-		subflow = msk->connection_list;
-	else
-		subflow = msk->subflow;
+	pr_debug("msk=%p", msk);
 
-	err = inet_getname(subflow, uaddr, peer);
+	if (msk->subflow) {
+		pr_debug("subflow=%p", msk->subflow->sk);
+		return inet_getname(msk->subflow, uaddr, peer);
+	}
 
-	return err;
+	/* @@ the meaning of getname() for the remote peer when the socket
+	 * is connected and there are multiple subflows is not defined.
+	 * For now just use the first subflow on the list.
+	 */
+	rcu_read_lock();
+	node = rcu_dereference(hlist_first_rcu(&msk->conn_list));
+	subflow = hlist_entry(node, struct subflow_context, node);
+	ssk = mptcp_subflow_tcp_socket(subflow)->sk;
+	sock_hold(ssk);
+	rcu_read_unlock();
+
+	ret = inet_getname(mptcp_subflow_tcp_socket(subflow), uaddr, peer);
+	sock_put(ssk);
+	return ret;
 }
 
 static int mptcp_listen(struct socket *sock, int backlog)
@@ -760,31 +838,59 @@ static __poll_t mptcp_poll(struct file *file, struct socket *sock,
 			   struct poll_table_struct *wait)
 {
 	const struct mptcp_sock *msk;
+	struct subflow_context *subflow;
 	struct sock *sk = sock->sk;
+	struct hlist_node *node;
+	struct sock *ssk;
+	__poll_t ret;
 
 	msk = mptcp_sk(sk);
 	if (msk->subflow)
 		return tcp_poll(file, msk->subflow, wait);
 
-	return tcp_poll(file, msk->connection_list, wait);
+	rcu_read_lock();
+	node = rcu_dereference(hlist_first_rcu(&msk->conn_list));
+	subflow = hlist_entry(node, struct subflow_context, node);
+	ssk = mptcp_subflow_tcp_socket(subflow)->sk;
+	sock_hold(ssk);
+	rcu_read_unlock();
+
+	ret = tcp_poll(file, ssk->sk_socket, wait);
+	sock_put(ssk);
+	return ret;
 }
 
 static int mptcp_shutdown(struct socket *sock, int how)
 {
 	struct mptcp_sock *msk = mptcp_sk(sock->sk);
+	struct subflow_context *subflow;
 	int ret = 0;
 
 	pr_debug("sk=%p, how=%d", msk, how);
 
 	if (msk->subflow) {
 		pr_debug("subflow=%p", msk->subflow->sk);
-		ret = kernel_sock_shutdown(msk->subflow, how);
+		return kernel_sock_shutdown(msk->subflow, how);
 	}
 
-	if (msk->connection_list) {
-		pr_debug("conn_list->subflow=%p", msk->connection_list->sk);
-		ret = kernel_sock_shutdown(msk->connection_list, how);
+	/* protect against concurrent mptcp_close(), so that nobody can
+	 * remove entries from the conn list and walking the list breaking
+	 * the RCU critical section is still safe. We need to release the
+	 * RCU lock to call the blocking kernel_sock_shutdown() primitive
+	 * Note: we can't use MPTCP socket lock to protect conn_list changes,
+	 * as we need to update it from the BH via the mptcp_finish_connect()
+	 */
+	lock_sock(sock->sk);
+	rcu_read_lock();
+	list_for_each_entry_rcu(subflow, &msk->conn_list, node) {
+		pr_debug("conn_list->subflow=%p", subflow);
+		rcu_read_unlock();
+		ret = kernel_sock_shutdown(mptcp_subflow_tcp_socket(subflow),
+					   how);
+		rcu_read_lock();
 	}
+	rcu_read_unlock();
+	release_sock(sock->sk);
 
 	return ret;
 }
diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
index 5c840f76a9b9..a1bf093bb37e 100644
--- a/net/mptcp/protocol.h
+++ b/net/mptcp/protocol.h
@@ -7,6 +7,8 @@
 #ifndef __MPTCP_PROTOCOL_H
 #define __MPTCP_PROTOCOL_H
 
+#include <linux/spinlock.h>
+
 /* MPTCP option subtypes */
 #define MPTCPOPT_MP_CAPABLE	0
 #define MPTCPOPT_MP_JOIN	1
@@ -52,10 +54,14 @@ struct mptcp_sock {
 	u64		write_seq;
 	u64		ack_seq;
 	u32		token;
-	struct socket	*connection_list; /* @@ needs to be a list */
+	spinlock_t	conn_list_lock;
+	struct list_head conn_list;
 	struct socket	*subflow; /* outgoing connect/listener/!mp_capable */
 };
 
+#define mptcp_for_each_subflow(__msk, __subflow)			\
+	list_for_each_entry_rcu(__subflow, &((__msk)->conn_list), node)
+
 static inline struct mptcp_sock *mptcp_sk(const struct sock *sk)
 {
 	return (struct mptcp_sock *)sk;
@@ -83,6 +89,7 @@ struct subflow_request_sock *subflow_rsk(const struct request_sock *rsk)
 
 /* MPTCP subflow context */
 struct subflow_context {
+	struct	list_head node;/* conn_list of subflows */
 	u64	local_key;
 	u64	remote_key;
 	u32	token;
-- 
2.22.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RFC PATCH net-next 21/33] mptcp: add and use mptcp_subflow_hold
  2019-06-17 22:57 [RFC PATCH net-next 00/33] Multipath TCP Mat Martineau
                   ` (19 preceding siblings ...)
  2019-06-17 22:57 ` [RFC PATCH net-next 20/33] mptcp: Make connection_list a real list of subflows Mat Martineau
@ 2019-06-17 22:57 ` Mat Martineau
  2019-06-17 22:57 ` [RFC PATCH net-next 22/33] mptcp: add basic kselftest program Mat Martineau
                   ` (11 subsequent siblings)
  32 siblings, 0 replies; 34+ messages in thread
From: Mat Martineau @ 2019-06-17 22:57 UTC (permalink / raw)
  To: edumazet, netdev
  Cc: Florian Westphal, cpaasch, pabeni, peter.krystad, dcaratti,
	matthieu.baerts

From: Florian Westphal <fw@strlen.de>

subflow sockets already have lifetime managed by RCU, so we can
switch to atomic_inc_not_zero and skip/pretend we did not find
such socket in the mptcp subflow list.

This is required to get rid of synchronize_rcu() from mptcp_close().

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 net/mptcp/protocol.c | 104 +++++++++++++++++++++++++++----------------
 1 file changed, 66 insertions(+), 38 deletions(-)

diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index c00e837a1766..0db4099d9c13 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -24,14 +24,35 @@ static inline bool before64(__u64 seq1, __u64 seq2)
 
 #define after64(seq2, seq1)	before64(seq1, seq2)
 
+static bool mptcp_subflow_hold(struct subflow_context *subflow)
+{
+	struct sock *sk = mptcp_subflow_tcp_socket(subflow)->sk;
+
+	return refcount_inc_not_zero(&sk->sk_refcnt);
+}
+
+static struct sock *mptcp_subflow_get_ref(const struct mptcp_sock *msk)
+{
+	struct subflow_context *subflow;
+
+	rcu_read_lock();
+	mptcp_for_each_subflow(msk, subflow) {
+		if (mptcp_subflow_hold(subflow)) {
+			rcu_read_unlock();
+			return mptcp_subflow_tcp_socket(subflow)->sk;
+		}
+	}
+
+	rcu_read_unlock();
+	return NULL;
+}
+
 static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 {
 	struct mptcp_sock *msk = mptcp_sk(sk);
 	int mss_now, size_goal, poffset, ret;
 	struct mptcp_ext *mpext = NULL;
-	struct subflow_context *subflow;
 	struct page *page = NULL;
-	struct hlist_node *node;
 	struct sk_buff *skb;
 	struct sock *ssk;
 	size_t psize;
@@ -42,20 +63,17 @@ static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 		return sock_sendmsg(msk->subflow, msg);
 	}
 
-	rcu_read_lock();
-	node = rcu_dereference(hlist_first_rcu(&msk->conn_list));
-	subflow = hlist_entry(node, struct subflow_context, node);
-	ssk = mptcp_subflow_tcp_socket(subflow)->sk;
-	sock_hold(ssk);
-	rcu_read_unlock();
+	ssk = mptcp_subflow_get_ref(msk);
+	if (!ssk)
+		return -ENOTCONN;
 
 	if (!msg_data_left(msg)) {
 		pr_debug("empty send");
-		ret = sock_sendmsg(mptcp_subflow_tcp_socket(subflow), msg);
+		ret = sock_sendmsg(ssk->sk_socket, msg);
 		goto put_out;
 	}
 
-	pr_debug("conn_list->subflow=%p", subflow);
+	pr_debug("conn_list->subflow=%p", ssk);
 
 	if (msg->msg_flags & ~(MSG_MORE | MSG_DONTWAIT | MSG_NOSIGNAL)) {
 		ret = -ENOTSUPP;
@@ -293,7 +311,6 @@ static int mptcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
 	struct mptcp_sock *msk = mptcp_sk(sk);
 	struct subflow_context *subflow;
 	struct mptcp_read_arg arg;
-	struct hlist_node *node;
 	read_descriptor_t desc;
 	struct tcp_sock *tp;
 	struct sock *ssk;
@@ -306,13 +323,11 @@ static int mptcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
 		return sock_recvmsg(msk->subflow, msg, flags);
 	}
 
-	rcu_read_lock();
-	node = rcu_dereference(hlist_first_rcu(&msk->conn_list));
-	subflow = hlist_entry(node, struct subflow_context, node);
-	ssk = mptcp_subflow_tcp_socket(subflow)->sk;
-	sock_hold(ssk);
-	rcu_read_unlock();
+	ssk = mptcp_subflow_get_ref(msk);
+	if (!ssk)
+		return -ENOTCONN;
 
+	subflow = subflow_ctx(ssk);
 	tp = tcp_sk(ssk);
 
 	lock_sock(sk);
@@ -778,8 +793,6 @@ static int mptcp_getname(struct socket *sock, struct sockaddr *uaddr,
 			 int peer)
 {
 	struct mptcp_sock *msk = mptcp_sk(sock->sk);
-	struct subflow_context *subflow;
-	struct hlist_node *node;
 	struct sock *ssk;
 	int ret;
 
@@ -794,14 +807,11 @@ static int mptcp_getname(struct socket *sock, struct sockaddr *uaddr,
 	 * is connected and there are multiple subflows is not defined.
 	 * For now just use the first subflow on the list.
 	 */
-	rcu_read_lock();
-	node = rcu_dereference(hlist_first_rcu(&msk->conn_list));
-	subflow = hlist_entry(node, struct subflow_context, node);
-	ssk = mptcp_subflow_tcp_socket(subflow)->sk;
-	sock_hold(ssk);
-	rcu_read_unlock();
+	ssk = mptcp_subflow_get_ref(msk);
+	if (!ssk)
+		return -ENOTCONN;
 
-	ret = inet_getname(mptcp_subflow_tcp_socket(subflow), uaddr, peer);
+	ret = inet_getname(ssk->sk_socket, uaddr, peer);
 	sock_put(ssk);
 	return ret;
 }
@@ -837,26 +847,44 @@ static int mptcp_stream_accept(struct socket *sock, struct socket *newsock,
 static __poll_t mptcp_poll(struct file *file, struct socket *sock,
 			   struct poll_table_struct *wait)
 {
-	const struct mptcp_sock *msk;
 	struct subflow_context *subflow;
+	const struct mptcp_sock *msk;
 	struct sock *sk = sock->sk;
-	struct hlist_node *node;
-	struct sock *ssk;
-	__poll_t ret;
+	__poll_t ret = 0;
+	unsigned int i;
 
 	msk = mptcp_sk(sk);
 	if (msk->subflow)
 		return tcp_poll(file, msk->subflow, wait);
 
-	rcu_read_lock();
-	node = rcu_dereference(hlist_first_rcu(&msk->conn_list));
-	subflow = hlist_entry(node, struct subflow_context, node);
-	ssk = mptcp_subflow_tcp_socket(subflow)->sk;
-	sock_hold(ssk);
-	rcu_read_unlock();
+	i = 0;
+	for (;;) {
+		struct subflow_context *tmp = NULL;
+		int j = 0;
+
+		rcu_read_lock();
+		mptcp_for_each_subflow(msk, subflow) {
+			if (j < i) {
+				j++;
+				continue;
+			}
+
+			if (!mptcp_subflow_hold(subflow))
+				continue;
+
+			tmp = subflow;
+			i++;
+			break;
+		}
+		rcu_read_unlock();
+
+		if (!tmp)
+			break;
+
+		ret |= tcp_poll(file, mptcp_subflow_tcp_socket(tmp), wait);
+		sock_put(mptcp_subflow_tcp_socket(tmp)->sk);
+	}
 
-	ret = tcp_poll(file, ssk->sk_socket, wait);
-	sock_put(ssk);
 	return ret;
 }
 
-- 
2.22.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RFC PATCH net-next 22/33] mptcp: add basic kselftest program
  2019-06-17 22:57 [RFC PATCH net-next 00/33] Multipath TCP Mat Martineau
                   ` (20 preceding siblings ...)
  2019-06-17 22:57 ` [RFC PATCH net-next 21/33] mptcp: add and use mptcp_subflow_hold Mat Martineau
@ 2019-06-17 22:57 ` Mat Martineau
  2019-06-17 22:57 ` [RFC PATCH net-next 23/33] mptcp: selftests: switch to netns+veth based tests Mat Martineau
                   ` (10 subsequent siblings)
  32 siblings, 0 replies; 34+ messages in thread
From: Mat Martineau @ 2019-06-17 22:57 UTC (permalink / raw)
  To: edumazet, netdev
  Cc: Florian Westphal, cpaasch, pabeni, peter.krystad, dcaratti,
	matthieu.baerts

From: Florian Westphal <fw@strlen.de>

create mptcp connection between two processes, xmit data back and
forth.  Data is read from stdin and written (after traversing mtcp
connection twice) to stdout.

Wrapper script tests that data has passed un-altered.

Will run automatically on "make kselftest".

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 tools/testing/selftests/Makefile              |   1 +
 tools/testing/selftests/net/mptcp/.gitignore  |   1 +
 tools/testing/selftests/net/mptcp/Makefile    |  11 +
 tools/testing/selftests/net/mptcp/config      |   1 +
 .../selftests/net/mptcp/mptcp_connect.c       | 390 ++++++++++++++++++
 .../selftests/net/mptcp/mptcp_connect.sh      |  48 +++
 6 files changed, 452 insertions(+)
 create mode 100644 tools/testing/selftests/net/mptcp/.gitignore
 create mode 100644 tools/testing/selftests/net/mptcp/Makefile
 create mode 100644 tools/testing/selftests/net/mptcp/config
 create mode 100644 tools/testing/selftests/net/mptcp/mptcp_connect.c
 create mode 100755 tools/testing/selftests/net/mptcp/mptcp_connect.sh

diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index 9781ca79794a..e949f1be6773 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -29,6 +29,7 @@ TARGETS += memory-hotplug
 TARGETS += mount
 TARGETS += mqueue
 TARGETS += net
+TARGETS += net/mptcp
 TARGETS += netfilter
 TARGETS += networking/timestamping
 TARGETS += nsfs
diff --git a/tools/testing/selftests/net/mptcp/.gitignore b/tools/testing/selftests/net/mptcp/.gitignore
new file mode 100644
index 000000000000..3143fb05a511
--- /dev/null
+++ b/tools/testing/selftests/net/mptcp/.gitignore
@@ -0,0 +1 @@
+mptcp_connect
diff --git a/tools/testing/selftests/net/mptcp/Makefile b/tools/testing/selftests/net/mptcp/Makefile
new file mode 100644
index 000000000000..0fc5d45055ee
--- /dev/null
+++ b/tools/testing/selftests/net/mptcp/Makefile
@@ -0,0 +1,11 @@
+# SPDX-License-Identifier: GPL-2.0
+
+top_srcdir = ../../../../..
+
+CFLAGS =  -Wall -Wl,--no-as-needed -O2 -g
+
+TEST_PROGS := mptcp_connect.sh
+
+TEST_GEN_FILES = mptcp_connect
+
+include ../../lib.mk
diff --git a/tools/testing/selftests/net/mptcp/config b/tools/testing/selftests/net/mptcp/config
new file mode 100644
index 000000000000..3bfe60494af8
--- /dev/null
+++ b/tools/testing/selftests/net/mptcp/config
@@ -0,0 +1 @@
+CONFIG_MPTCP=y
diff --git a/tools/testing/selftests/net/mptcp/mptcp_connect.c b/tools/testing/selftests/net/mptcp/mptcp_connect.c
new file mode 100644
index 000000000000..78c43624e84f
--- /dev/null
+++ b/tools/testing/selftests/net/mptcp/mptcp_connect.c
@@ -0,0 +1,390 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#define _GNU_SOURCE
+
+#include <errno.h>
+#include <fcntl.h>
+#include <string.h>
+#include <stdbool.h>
+#include <stdint.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <strings.h>
+#include <unistd.h>
+
+#include <sys/poll.h>
+#include <sys/socket.h>
+#include <sys/types.h>
+
+#include <netdb.h>
+#include <netinet/in.h>
+#include <netinet/tcp.h>
+
+extern int optind;
+
+#ifndef IPPROTO_MPTCP
+#define IPPROTO_MPTCP 262
+#endif
+
+static const char *cfg_host;
+static const char *cfg_port	= "12000";
+static int cfg_server_proto	= IPPROTO_MPTCP;
+static int cfg_client_proto	= IPPROTO_MPTCP;
+
+static void die_usage(void)
+{
+	fprintf(stderr, "Usage: mptcp_connect [-c MPTCP|TCP] [-p port] "
+		"[-s MPTCP|TCP]\n");
+	exit(-1);
+}
+
+static const char *getxinfo_strerr(int err)
+{
+	if (err == EAI_SYSTEM)
+		return strerror(errno);
+
+	return gai_strerror(err);
+}
+
+static void xgetaddrinfo(const char *node, const char *service,
+			 const struct addrinfo *hints,
+			 struct addrinfo **res)
+{
+	int err = getaddrinfo(node, service, hints, res);
+
+	if (err) {
+		const char *errstr = getxinfo_strerr(err);
+
+		fprintf(stderr, "Fatal: getaddrinfo(%s:%s): %s\n",
+			node ? node : "", service ? service : "", errstr);
+		exit(1);
+	}
+}
+
+static int sock_listen_mptcp(const char * const listenaddr,
+			     const char * const port)
+{
+	int sock;
+	struct addrinfo hints = {
+		.ai_protocol = IPPROTO_TCP,
+		.ai_socktype = SOCK_STREAM,
+		.ai_flags = AI_PASSIVE | AI_NUMERICHOST
+	};
+
+	hints.ai_family = AF_INET;
+
+	struct addrinfo *a, *addr;
+	int one = 1;
+
+	xgetaddrinfo(listenaddr, port, &hints, &addr);
+
+	for (a = addr; a; a = a->ai_next) {
+		sock = socket(a->ai_family, a->ai_socktype, cfg_server_proto);
+		if (sock < 0) {
+			perror("socket");
+			continue;
+		}
+
+		if (-1 == setsockopt(sock, SOL_SOCKET, SO_REUSEADDR, &one,
+				     sizeof(one)))
+			perror("setsockopt");
+
+		if (bind(sock, a->ai_addr, a->ai_addrlen) == 0)
+			break; /* success */
+
+		perror("bind");
+		close(sock);
+		sock = -1;
+	}
+
+	if (sock >= 0 && listen(sock, 20))
+		perror("listen");
+
+	freeaddrinfo(addr);
+	return sock;
+}
+
+static int sock_connect_mptcp(const char * const remoteaddr,
+			      const char * const port, int proto)
+{
+	struct addrinfo hints = {
+		.ai_protocol = IPPROTO_TCP,
+		.ai_socktype = SOCK_STREAM,
+	};
+	struct addrinfo *a, *addr;
+	int sock = -1;
+
+	hints.ai_family = AF_INET;
+
+	xgetaddrinfo(remoteaddr, port, &hints, &addr);
+	for (a = addr; a; a = a->ai_next) {
+		sock = socket(a->ai_family, a->ai_socktype, proto);
+		if (sock < 0) {
+			perror("socket");
+			continue;
+		}
+
+		if (connect(sock, a->ai_addr, a->ai_addrlen) == 0)
+			break; /* success */
+
+		perror("connect()");
+		close(sock);
+		sock = -1;
+	}
+
+	freeaddrinfo(addr);
+	return sock;
+}
+
+static size_t do_write(const int fd, char *buf, const size_t len)
+{
+	size_t offset = 0;
+
+	while (offset < len) {
+		unsigned int do_w;
+		size_t written;
+		ssize_t bw;
+
+		do_w = rand() & 0xffff;
+		if (do_w == 0 || do_w > (len - offset))
+			do_w = len - offset;
+
+		bw = write(fd, buf + offset, do_w);
+		if (bw < 0) {
+			perror("write");
+			return 0;
+		}
+
+		written = (size_t)bw;
+		offset += written;
+	}
+	return offset;
+}
+
+static void copyfd_io(int peerfd)
+{
+	struct pollfd fds = { .events = POLLIN };
+
+	fds.fd = peerfd;
+
+	for (;;) {
+		char buf[4096];
+		ssize_t len;
+
+		switch (poll(&fds, 1, -1)) {
+		case -1:
+			if (errno == EINTR)
+				continue;
+			perror("poll");
+			return;
+		case 0:
+			/* should not happen, we requested infinite wait */
+			fputs("Timed out?!", stderr);
+			return;
+		}
+
+		if ((fds.revents & POLLIN) == 0)
+			return;
+
+		len = read(peerfd, buf, sizeof(buf));
+		if (!len)
+			return;
+		if (len < 0) {
+			if (errno == EINTR)
+				continue;
+
+			perror("read");
+			return;
+		}
+
+		if (!do_write(peerfd, buf, len))
+			return;
+	}
+}
+
+int main_loop_s(int listensock)
+{
+	struct sockaddr_storage ss;
+	socklen_t salen;
+	int remotesock;
+
+	salen = sizeof(ss);
+	while ((remotesock = accept(listensock, (struct sockaddr *)&ss,
+				    &salen)) < 0)
+		perror("accept");
+
+	copyfd_io(remotesock);
+	close(remotesock);
+
+	return 0;
+}
+
+static void init_rng(void)
+{
+	int fd = open("/dev/urandom", O_RDONLY);
+	unsigned int foo;
+
+	if (fd > 0) {
+		read(fd, &foo, sizeof(foo));
+		close(fd);
+	}
+
+	srand(foo);
+}
+
+int main_loop(void)
+{
+	int pollfds = 2, timeout = -1;
+	char start[32];
+	int pipefd[2];
+	ssize_t ret;
+	int fd;
+
+	if (pipe(pipefd)) {
+		perror("pipe");
+		exit(1);
+	}
+
+	switch (fork()) {
+	case 0:
+		close(pipefd[0]);
+
+		init_rng();
+
+		fd = sock_listen_mptcp(NULL, cfg_port);
+		if (fd < 0)
+			return -1;
+
+		write(pipefd[1], "RDY\n", 4);
+		main_loop_s(fd);
+		exit(1);
+	case -1:
+		perror("fork");
+		return -1;
+	default:
+		close(pipefd[1]);
+		break;
+	}
+
+	init_rng();
+	ret = read(pipefd[0], start, (int)sizeof(start));
+	if (ret < 0) {
+		perror("read");
+		return -1;
+	}
+
+	if (ret != 4 || strcmp(start, "RDY\n"))
+		return -1;
+
+	/* listener is ready. */
+	fd = sock_connect_mptcp(cfg_host, cfg_port, cfg_client_proto);
+	if (fd < 0)
+		return -1;
+
+	for (;;) {
+		struct pollfd fds[2];
+		char buf[4096];
+		ssize_t len;
+
+		fds[0].fd = fd;
+		fds[0].events = POLLIN;
+		fds[1].fd = 0;
+		fds[1].events = POLLIN;
+		fds[1].revents = 0;
+
+		switch (poll(fds, pollfds, timeout)) {
+		case -1:
+			if (errno == EINTR)
+				continue;
+			perror("poll");
+			return -1;
+		case 0:
+			close(fd);
+			return 0;
+		}
+
+		if (fds[0].revents & POLLIN) {
+			unsigned int blen = rand();
+
+			blen %= sizeof(buf);
+
+			++blen;
+			len = read(fd, buf, blen);
+			if (len < 0) {
+				perror("read");
+				return -1;
+			}
+
+			if (len > blen) {
+				fprintf(stderr, "read returned more data than "
+						"buffer length\n");
+				len = blen;
+			}
+
+			write(1, buf, len);
+		}
+		if (fds[1].revents & POLLIN) {
+			len = read(0, buf, sizeof(buf));
+			if (len == 0) {
+				pollfds = 1;
+				timeout = 1000;
+				continue;
+			}
+
+			if (len < 0) {
+				perror("read");
+				break;
+			}
+
+			do_write(fd, buf, len);
+		}
+	}
+
+	return 1;
+}
+
+int parse_proto(const char *proto)
+{
+	if (!strcasecmp(proto, "MPTCP"))
+		return IPPROTO_MPTCP;
+	if (!strcasecmp(proto, "TCP"))
+		return IPPROTO_TCP;
+	die_usage();
+
+	/* silence compiler warning */
+	return 0;
+}
+
+static void parse_opts(int argc, char **argv)
+{
+	int c;
+
+	while ((c = getopt(argc, argv, "c:p:s:h")) != -1) {
+		switch (c) {
+		case 'c':
+			cfg_client_proto = parse_proto(optarg);
+			break;
+		case 'p':
+			cfg_port = optarg;
+			break;
+		case 's':
+			cfg_server_proto = parse_proto(optarg);
+			break;
+		case 'h':
+			die_usage();
+			break;
+		}
+	}
+
+	if (optind + 1 != argc)
+		die_usage();
+	cfg_host = argv[optind];
+}
+
+int main(int argc, char *argv[])
+{
+	init_rng();
+
+	parse_opts(argc, argv);
+	return main_loop();
+}
diff --git a/tools/testing/selftests/net/mptcp/mptcp_connect.sh b/tools/testing/selftests/net/mptcp/mptcp_connect.sh
new file mode 100755
index 000000000000..efcdda84b62a
--- /dev/null
+++ b/tools/testing/selftests/net/mptcp/mptcp_connect.sh
@@ -0,0 +1,48 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+
+tmpin=$(mktemp)
+tmpout=$(mktemp)
+
+cleanup()
+{
+	rm -f "$tmpin" "$tmpout"
+}
+
+check_transfer()
+{
+	cl_proto=${1}
+	srv_proto=${2}
+
+	printf "%-8s -> %-8s socket\t\t" ${cl_proto} ${srv_proto}
+
+	./mptcp_connect -c ${cl_proto} -p 43212 -s ${srv_proto} 127.0.0.1  < "$tmpin" > "$tmpout" 2>/dev/null
+	ret=$?
+	if [ ${ret} -ne 0 ]; then
+		echo "[ FAIL ]"
+		echo " exit code ${ret}"
+		return ${ret}
+	fi
+	cmp "$tmpin" "$tmpout" > /dev/null 2>&1
+	if [ $? -ne 0 ]; then
+		echo "[ FAIL ]"
+		ls -l "$tmpin" "$tmpout" 1>&2
+	else
+		echo "[  OK  ]"
+	fi
+}
+
+trap cleanup EXIT
+
+SIZE=$((RANDOM % (1024 * 1024)))
+if [ $SIZE -eq 0 ]; then
+	SIZE=1
+fi
+
+dd if=/dev/urandom of="$tmpin" bs=1 count=$SIZE 2> /dev/null
+
+check_transfer MPTCP MPTCP
+check_transfer MPTCP TCP
+check_transfer TCP MPTCP
+
+exit 0
-- 
2.22.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RFC PATCH net-next 23/33] mptcp: selftests: switch to netns+veth based tests
  2019-06-17 22:57 [RFC PATCH net-next 00/33] Multipath TCP Mat Martineau
                   ` (21 preceding siblings ...)
  2019-06-17 22:57 ` [RFC PATCH net-next 22/33] mptcp: add basic kselftest program Mat Martineau
@ 2019-06-17 22:57 ` Mat Martineau
  2019-06-17 22:57 ` [RFC PATCH net-next 24/33] mptcp: selftests: Add capture option Mat Martineau
                   ` (9 subsequent siblings)
  32 siblings, 0 replies; 34+ messages in thread
From: Mat Martineau @ 2019-06-17 22:57 UTC (permalink / raw)
  To: edumazet, netdev
  Cc: Florian Westphal, cpaasch, pabeni, peter.krystad, dcaratti,
	matthieu.baerts

From: Florian Westphal <fw@strlen.de>

... so we can exercise PMTU and MSS handling.
MTU on lo is 64k, so we never had to deal with segmentation either.

This also avoids problems with timewait state in inet_ns,
all net namespaces are torn down before script exits.

This uncovers several bugs:
1. mptcp_init_sock() passes init_net instead of sock_net(sk), i.e.
   network namespaces are not supported
2. We corrupt tcp option space, I can see invalid tcp headers (too
   short) in tcpdump, and receiver process hangs without getting the
   data (poll timeout).
   Seems to be gone after adding

    if (size == MAX_TCP_OPTION_SPACE)
          return size;

    BUG(size > MAX_TCP_OPTION_SPACE);

    in tcp_established_options() after writing mptcp options.
    SACK writing doesn't handle 'no option space left' case.
3. Several WARN_ON from networking core are triggered, e.g. due
   to mem accounting being off.  Maybe fixed already with pending
   locking fix in mptcp_recv.
4. "Replaced mapping before it was done" in dmesg.
5. receiver blocking in read(), while TCP socket is in CLOSE-WAIT.
   wait_woken+0xd6/0x170
   sk_wait_data+0x248/0x270
   mptcp_recvmsg+0x5c0/0xd50
6. kmemleak gets noisy, we probably leak a refcount somewhere (iirc
   Davide is already working on this).
7. crash on connect completion, probably same bug that Paolo reported
   already.

As the script did not yet turn up any problem when using only
tcp, it appears these are MPTCP related bugs rather than with script
or mptcp_connect.c .

Once above issues are fixed, this will be extended again to
set different/varying MTU in ns1 and ns4.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 .../selftests/net/mptcp/mptcp_connect.c       | 310 +++++++++---------
 .../selftests/net/mptcp/mptcp_connect.sh      | 246 ++++++++++++--
 2 files changed, 382 insertions(+), 174 deletions(-)

diff --git a/tools/testing/selftests/net/mptcp/mptcp_connect.c b/tools/testing/selftests/net/mptcp/mptcp_connect.c
index 78c43624e84f..cac71f0ac8f8 100644
--- a/tools/testing/selftests/net/mptcp/mptcp_connect.c
+++ b/tools/testing/selftests/net/mptcp/mptcp_connect.c
@@ -26,16 +26,19 @@ extern int optind;
 #define IPPROTO_MPTCP 262
 #endif
 
+static bool listen_mode;
+static int  poll_timeout;
+
 static const char *cfg_host;
 static const char *cfg_port	= "12000";
-static int cfg_server_proto	= IPPROTO_MPTCP;
-static int cfg_client_proto	= IPPROTO_MPTCP;
+static int cfg_sock_proto	= IPPROTO_MPTCP;
+
 
 static void die_usage(void)
 {
-	fprintf(stderr, "Usage: mptcp_connect [-c MPTCP|TCP] [-p port] "
-		"[-s MPTCP|TCP]\n");
-	exit(-1);
+	fprintf(stderr, "Usage: mptcp_connect [-s MPTCP|TCP] [-p port] "
+		"[ -l ] [ -t timeout ] connect_address\n");
+	exit(1);
 }
 
 static const char *getxinfo_strerr(int err)
@@ -79,11 +82,9 @@ static int sock_listen_mptcp(const char * const listenaddr,
 	xgetaddrinfo(listenaddr, port, &hints, &addr);
 
 	for (a = addr; a; a = a->ai_next) {
-		sock = socket(a->ai_family, a->ai_socktype, cfg_server_proto);
-		if (sock < 0) {
-			perror("socket");
+		sock = socket(a->ai_family, a->ai_socktype, cfg_sock_proto);
+		if (sock < 0)
 			continue;
-		}
 
 		if (-1 == setsockopt(sock, SOL_SOCKET, SO_REUSEADDR, &one,
 				     sizeof(one)))
@@ -97,10 +98,19 @@ static int sock_listen_mptcp(const char * const listenaddr,
 		sock = -1;
 	}
 
-	if (sock >= 0 && listen(sock, 20))
+	freeaddrinfo(addr);
+
+	if (sock < 0) {
+		fprintf(stderr, "Could not create listen socket\n");
+		return sock;
+	}
+
+	if (listen(sock, 20)) {
 		perror("listen");
+		close(sock);
+		return -1;
+	}
 
-	freeaddrinfo(addr);
 	return sock;
 }
 
@@ -136,7 +146,7 @@ static int sock_connect_mptcp(const char * const remoteaddr,
 	return sock;
 }
 
-static size_t do_write(const int fd, char *buf, const size_t len)
+static size_t do_rnd_write(const int fd, char *buf, const size_t len)
 {
 	size_t offset = 0;
 
@@ -161,62 +171,149 @@ static size_t do_write(const int fd, char *buf, const size_t len)
 	return offset;
 }
 
-static void copyfd_io(int peerfd)
+static size_t do_write(const int fd, char *buf, const size_t len)
 {
-	struct pollfd fds = { .events = POLLIN };
+	size_t offset = 0;
 
-	fds.fd = peerfd;
+	while (offset < len) {
+		size_t written;
+		ssize_t bw;
+
+		bw = write(fd, buf + offset, len - offset);
+		if (bw < 0) {
+			perror("write");
+			return 0;
+		}
+
+		written = (size_t)bw;
+		offset += written;
+	}
+
+	return offset;
+}
+
+static ssize_t do_rnd_read(const int fd, char *buf, const size_t len)
+{
+	size_t cap = rand();
+
+	cap &= 0xffff;
+
+	if (cap == 0)
+		cap = 1;
+	else if (cap > len)
+		cap = len;
+
+	return read(fd, buf, cap);
+}
+
+static int copyfd_io(int infd, int peerfd, int outfd)
+{
+	struct pollfd fds = {
+		.fd = peerfd,
+		.events = POLLIN | POLLOUT,
+	};
 
 	for (;;) {
-		char buf[4096];
+		char buf[8192];
 		ssize_t len;
 
-		switch (poll(&fds, 1, -1)) {
+		if (fds.events == 0)
+			break;
+
+		switch (poll(&fds, 1, poll_timeout)) {
 		case -1:
 			if (errno == EINTR)
 				continue;
 			perror("poll");
-			return;
+			return 1;
 		case 0:
-			/* should not happen, we requested infinite wait */
-			fputs("Timed out?!", stderr);
-			return;
+			fprintf(stderr, "%s: poll timed out (events: "
+				"POLLIN %u, POLLOUT %u)\n", __func__,
+				fds.events & POLLIN, fds.events & POLLOUT);
+			return 2;
 		}
 
-		if ((fds.revents & POLLIN) == 0)
-			return;
+		if (fds.revents & POLLIN) {
+			len = do_rnd_read(peerfd, buf, sizeof(buf));
+			if (len == 0) {
+				/* no more data to receive:
+				 * peer has closed its write side
+				 */
+				fds.events &= ~POLLIN;
 
-		len = read(peerfd, buf, sizeof(buf));
-		if (!len)
-			return;
-		if (len < 0) {
-			if (errno == EINTR)
-				continue;
+				if ((fds.events & POLLOUT) == 0)
+					/* and nothing more to send */
+					break;
+
+			/* Else, still have data to transmit */
+			} else if (len < 0) {
+				perror("read");
+				return 3;
+			}
 
-			perror("read");
-			return;
+			do_write(outfd, buf, len);
 		}
 
-		if (!do_write(peerfd, buf, len))
-			return;
+		if (fds.revents & POLLOUT) {
+			len = do_rnd_read(infd, buf, sizeof(buf));
+			if (len > 0) {
+				if (!do_rnd_write(peerfd, buf, len))
+					return 111;
+			} else if (len == 0) {
+				/* We have no more data to send. */
+				fds.events &= ~POLLOUT;
+
+				if ((fds.events & POLLIN) == 0)
+					/* ... and peer also closed already */
+					break;
+
+				/* ... but we still receive.
+				 * Close our write side.
+				 */
+				shutdown(peerfd, SHUT_WR);
+			} else {
+				if (errno == EINTR)
+					continue;
+				perror("read");
+				return 4;
+			}
+		}
 	}
+
+	close(peerfd);
+	return 0;
 }
 
 int main_loop_s(int listensock)
 {
 	struct sockaddr_storage ss;
+	struct pollfd polls;
 	socklen_t salen;
 	int remotesock;
 
+	polls.fd = listensock;
+	polls.events = POLLIN;
+
+	switch (poll(&polls, 1, poll_timeout)) {
+	case -1:
+		perror("poll");
+		return 1;
+	case 0:
+		fprintf(stderr, "%s: timed out\n", __func__);
+		close(listensock);
+		return 2;
+	}
+
 	salen = sizeof(ss);
-	while ((remotesock = accept(listensock, (struct sockaddr *)&ss,
-				    &salen)) < 0)
-		perror("accept");
+	remotesock = accept(listensock, (struct sockaddr *)&ss, &salen);
+	if (remotesock >= 0) {
+		copyfd_io(0, remotesock, 1);
+		return 0;
+	}
 
-	copyfd_io(remotesock);
-	close(remotesock);
+	perror("accept");
 
-	return 0;
+	return 1;
 }
 
 static void init_rng(void)
@@ -225,7 +322,10 @@ static void init_rng(void)
 	unsigned int foo;
 
 	if (fd > 0) {
-		read(fd, &foo, sizeof(foo));
+		int ret = read(fd, &foo, sizeof(foo));
+
+		if (ret < 0)
+			srand(fd + foo);
 		close(fd);
 	}
 
@@ -234,113 +334,14 @@ static void init_rng(void)
 
 int main_loop(void)
 {
-	int pollfds = 2, timeout = -1;
-	char start[32];
-	int pipefd[2];
-	ssize_t ret;
 	int fd;
 
-	if (pipe(pipefd)) {
-		perror("pipe");
-		exit(1);
-	}
-
-	switch (fork()) {
-	case 0:
-		close(pipefd[0]);
-
-		init_rng();
-
-		fd = sock_listen_mptcp(NULL, cfg_port);
-		if (fd < 0)
-			return -1;
-
-		write(pipefd[1], "RDY\n", 4);
-		main_loop_s(fd);
-		exit(1);
-	case -1:
-		perror("fork");
-		return -1;
-	default:
-		close(pipefd[1]);
-		break;
-	}
-
-	init_rng();
-	ret = read(pipefd[0], start, (int)sizeof(start));
-	if (ret < 0) {
-		perror("read");
-		return -1;
-	}
-
-	if (ret != 4 || strcmp(start, "RDY\n"))
-		return -1;
-
 	/* listener is ready. */
-	fd = sock_connect_mptcp(cfg_host, cfg_port, cfg_client_proto);
+	fd = sock_connect_mptcp(cfg_host, cfg_port, cfg_sock_proto);
 	if (fd < 0)
-		return -1;
-
-	for (;;) {
-		struct pollfd fds[2];
-		char buf[4096];
-		ssize_t len;
-
-		fds[0].fd = fd;
-		fds[0].events = POLLIN;
-		fds[1].fd = 0;
-		fds[1].events = POLLIN;
-		fds[1].revents = 0;
-
-		switch (poll(fds, pollfds, timeout)) {
-		case -1:
-			if (errno == EINTR)
-				continue;
-			perror("poll");
-			return -1;
-		case 0:
-			close(fd);
-			return 0;
-		}
-
-		if (fds[0].revents & POLLIN) {
-			unsigned int blen = rand();
-
-			blen %= sizeof(buf);
-
-			++blen;
-			len = read(fd, buf, blen);
-			if (len < 0) {
-				perror("read");
-				return -1;
-			}
-
-			if (len > blen) {
-				fprintf(stderr, "read returned more data than "
-						"buffer length\n");
-				len = blen;
-			}
-
-			write(1, buf, len);
-		}
-		if (fds[1].revents & POLLIN) {
-			len = read(0, buf, sizeof(buf));
-			if (len == 0) {
-				pollfds = 1;
-				timeout = 1000;
-				continue;
-			}
+		return 2;
 
-			if (len < 0) {
-				perror("read");
-				break;
-			}
-
-			do_write(fd, buf, len);
-		}
-	}
-
-	return 1;
+	return copyfd_io(0, fd, 1);
 }
 
 int parse_proto(const char *proto)
@@ -349,6 +350,8 @@ int parse_proto(const char *proto)
 		return IPPROTO_MPTCP;
 	if (!strcasecmp(proto, "TCP"))
 		return IPPROTO_TCP;
+
+	fprintf(stderr, "Unknown protocol: %s.", proto);
 	die_usage();
 
 	/* silence compiler warning */
@@ -359,20 +362,25 @@ static void parse_opts(int argc, char **argv)
 {
 	int c;
 
-	while ((c = getopt(argc, argv, "c:p:s:h")) != -1) {
+	while ((c = getopt(argc, argv, "lp:s:ht:")) != -1) {
 		switch (c) {
-		case 'c':
-			cfg_client_proto = parse_proto(optarg);
+		case 'l':
+			listen_mode = true;
 			break;
 		case 'p':
 			cfg_port = optarg;
 			break;
 		case 's':
-			cfg_server_proto = parse_proto(optarg);
+			cfg_sock_proto = parse_proto(optarg);
 			break;
 		case 'h':
 			die_usage();
 			break;
+		case 't':
+			poll_timeout = atoi(optarg) * 1000;
+			if (poll_timeout <= 0)
+				poll_timeout = -1;
+			break;
 		}
 	}
 
@@ -386,5 +394,15 @@ int main(int argc, char *argv[])
 	init_rng();
 
 	parse_opts(argc, argv);
+
+	if (listen_mode) {
+		int fd = sock_listen_mptcp(cfg_host, cfg_port);
+
+		if (fd < 0)
+			return 1;
+
+		return main_loop_s(fd);
+	}
+
 	return main_loop();
 }
diff --git a/tools/testing/selftests/net/mptcp/mptcp_connect.sh b/tools/testing/selftests/net/mptcp/mptcp_connect.sh
index efcdda84b62a..e694dc9d312c 100755
--- a/tools/testing/selftests/net/mptcp/mptcp_connect.sh
+++ b/tools/testing/selftests/net/mptcp/mptcp_connect.sh
@@ -1,48 +1,238 @@
 #!/bin/bash
 # SPDX-License-Identifier: GPL-2.0
 
-tmpin=$(mktemp)
-tmpout=$(mktemp)
+ret=0
+sin=""
+sout=""
+cin=""
+cout=""
+ksft_skip=4
+timeout=30
+
+TEST_COUNT=0
 
 cleanup()
 {
-	rm -f "$tmpin" "$tmpout"
+	rm -f "$cin" "$cout"
+	rm -f "$sin" "$sout"
+
+	for i in 1 2 3 4; do
+		ip netns del ns$i
+	done
+}
+
+ip -Version > /dev/null 2>&1
+if [ $? -ne 0 ];then
+	echo "SKIP: Could not run test without ip tool"
+	exit $ksft_skip
+fi
+
+sin=$(mktemp)
+sout=$(mktemp)
+cin=$(mktemp)
+cout=$(mktemp)
+trap cleanup EXIT
+
+for i in 1 2 3 4;do
+	ip netns add ns$i || exit $ksft_skip
+	ip -net ns$i link set lo up
+done
+
+#  ns1              ns2                    ns3                     ns4
+# ns1eth2    ns2eth1   ns2eth3      ns3eth2   ns3eth4       ns4eth3
+#                           - drop 1% ->            reorder 25%
+#                           <- TSO off -
+
+ip link add ns1eth2 netns ns1 type veth peer name ns2eth1 netns ns2
+ip link add ns2eth3 netns ns2 type veth peer name ns3eth2 netns ns3
+ip link add ns3eth4 netns ns3 type veth peer name ns4eth3 netns ns4
+
+ip -net ns1 addr add 10.0.1.1/24 dev ns1eth2
+ip -net ns1 link set ns1eth2 up
+ip -net ns1 route add default via 10.0.1.2
+
+ip -net ns2 addr add 10.0.1.2/24 dev ns2eth1
+ip -net ns2 link set ns2eth1 up
+
+ip -net ns2 addr add 10.0.2.1/24 dev ns2eth3
+ip -net ns2 link set ns2eth3 up
+ip -net ns2 route add default via 10.0.2.2
+ip netns exec ns2 sysctl -q net.ipv4.ip_forward=1
+
+ip -net ns3 addr add 10.0.2.2/24 dev ns3eth2
+ip -net ns3 link set ns3eth2 up
+
+ip -net ns3 addr add 10.0.3.2/24 dev ns3eth4
+ip -net ns3 link set ns3eth4 up
+ip -net ns3 route add default via 10.0.2.1
+ip netns exec ns3 ethtool -K ns3eth2 tso off 2>/dev/null
+ip netns exec ns3 sysctl -q net.ipv4.ip_forward=1
+
+ip -net ns4 addr add 10.0.3.1/24 dev ns4eth3
+ip -net ns4 link set ns4eth3 up
+ip -net ns4 route add default via 10.0.3.2
+
+print_file_err()
+{
+	ls -l "$1" 1>&2
+	echo "Trailing bytes are: "
+	tail -c 27 "$1"
 }
 
 check_transfer()
 {
-	cl_proto=${1}
-	srv_proto=${2}
+	in=$1
+	out=$2
+	what=$3
 
-	printf "%-8s -> %-8s socket\t\t" ${cl_proto} ${srv_proto}
+	cmp "$in" "$out" > /dev/null 2>&1
+	if [ $? -ne 0 ] ;then
+		echo "[ FAIL ] $what does not match (in, out):"
+		print_file_err "$in"
+		print_file_err "$out"
 
-	./mptcp_connect -c ${cl_proto} -p 43212 -s ${srv_proto} 127.0.0.1  < "$tmpin" > "$tmpout" 2>/dev/null
-	ret=$?
-	if [ ${ret} -ne 0 ]; then
-		echo "[ FAIL ]"
-		echo " exit code ${ret}"
-		return ${ret}
+		return 1
 	fi
-	cmp "$tmpin" "$tmpout" > /dev/null 2>&1
-	if [ $? -ne 0 ]; then
-		echo "[ FAIL ]"
-		ls -l "$tmpin" "$tmpout" 1>&2
-	else
-		echo "[  OK  ]"
+
+	return 0
+}
+
+do_ping()
+{
+	listener_ns="$1"
+	connector_ns="$2"
+	connect_addr="$3"
+
+	ip netns exec ${connector_ns} ping -q -c 1 $connect_addr >/dev/null
+	if [ $? -ne 0 ] ; then
+		echo "$listener_ns -> $connect_addr connectivity [ FAIL ]" 1>&2
+		ret=1
 	fi
 }
 
-trap cleanup EXIT
+do_transfer()
+{
+	listener_ns="$1"
+	connector_ns="$2"
+	cl_proto="$3"
+	srv_proto="$4"
+	connect_addr="$5"
 
-SIZE=$((RANDOM % (1024 * 1024)))
-if [ $SIZE -eq 0 ]; then
-	SIZE=1
-fi
+	port=$((10000+$TEST_COUNT))
+	TEST_COUNT=$((TEST_COUNT+1))
+
+	:> "$cout"
+	:> "$sout"
+
+	printf "%-4s %-5s -> %-4s (%s:%d) %-5s\t" ${connector_ns} ${cl_proto} ${listener_ns} ${connect_addr} ${port} ${srv_proto}
+
+	ip netns exec ${listener_ns} ./mptcp_connect -t $timeout -l -p $port -s ${srv_proto} 0.0.0.0 < "$sin" > "$sout" &
+	spid=$!
+
+	sleep 1
+
+	ip netns exec ${connector_ns} ./mptcp_connect -t $timeout -p $port -s ${cl_proto} $connect_addr < "$cin" > "$cout" &
+	cpid=$!
+
+	wait $cpid
+	retc=$?
+	wait $spid
+	rets=$?
+
+	if [ ${rets} -ne 0 ] || [ ${retc} -ne 0 ]; then
+		echo "[ FAIL ] client exit code $retc, server $rets" 1>&2
+		echo "\nnetns ${listener_ns} socket stat for $port:" 1>&2
+		ip netns exec ${listener_ns} ss -nita 1>&2 -o "sport = :$port"
+		echo "\nnetns ${connector_ns} socket stat for $port:" 1>&2
+		ip netns exec ${connector_ns} ss -nita 1>&2 -o "dport = :$port"
+
+		return 1
+	fi
+
+	check_transfer $sin $cout "file received by client"
+	retc=$?
+	check_transfer $cin $sout "file received by server"
+	rets=$?
+
+	if [ $retc -eq 0 ] && [ $rets -eq 0 ];then
+		echo "[ OK ]"
+		return 0
+	fi
+
+	return 1
+}
+
+make_file()
+{
+	name=$1
+	who=$2
+
+	SIZE=$((RANDOM % (1024 * 8)))
+	TSIZE=$((SIZE * 1024))
+
+	dd if=/dev/urandom of="$name" bs=1024 count=$SIZE 2> /dev/null
+
+	SIZE=$((RANDOM % 1024))
+	SIZE=$((SIZE + 128))
+	TSIZE=$((TSIZE + SIZE))
+	dd if=/dev/urandom conf=notrunc of="$name" bs=1 count=$SIZE 2> /dev/null
+	echo -e "\nMPTCP_TEST_FILE_END_MARKER" >> "$name"
+
+	echo "Created $name (size $TSIZE) containing data sent by $who"
+}
+
+run_tests()
+{
+	listener_ns="$1"
+	connector_ns="$2"
+	connect_addr="$3"
+	lret=0
+
+	for proto in MPTCP TCP;do
+		do_transfer ${listener_ns} ${connector_ns} MPTCP "$proto" ${connect_addr}
+		lret=$?
+		if [ $lret -ne 0 ]; then
+			ret=$lret
+			return
+		fi
+	done
+
+	do_transfer ${listener_ns} ${connector_ns} TCP MPTCP ${connect_addr}
+	lret=$?
+	if [ $lret -ne 0 ]; then
+		ret=$lret
+		return
+	fi
+}
+
+make_file "$cin" "client"
+make_file "$sin" "server"
+
+for sender in 1 2 3 4;do
+	do_ping ns1 ns$sender 10.0.1.1
+
+	do_ping ns2 ns$sender 10.0.1.2
+	do_ping ns2 ns$sender 10.0.2.1
+
+	do_ping ns3 ns$sender 10.0.2.2
+	do_ping ns3 ns$sender 10.0.3.2
+
+	do_ping ns4 ns$sender 10.0.3.1
+done
+
+tc -net ns2 qdisc add dev ns2eth3 root netem loss random 1
+tc -net ns3 qdisc add dev ns3eth4 root netem delay 10ms reorder 25% 50% gap 5
+
+for sender in 1 2 3 4;do
+	run_tests ns1 ns$sender 10.0.1.1
+
+	run_tests ns2 ns$sender 10.0.1.2
+	run_tests ns2 ns$sender 10.0.2.1
 
-dd if=/dev/urandom of="$tmpin" bs=1 count=$SIZE 2> /dev/null
+	run_tests ns3 ns$sender 10.0.2.2
+	run_tests ns3 ns$sender 10.0.3.2
 
-check_transfer MPTCP MPTCP
-check_transfer MPTCP TCP
-check_transfer TCP MPTCP
+	run_tests ns4 ns$sender 10.0.3.1
+done
 
-exit 0
+exit $ret
-- 
2.22.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RFC PATCH net-next 24/33] mptcp: selftests: Add capture option
  2019-06-17 22:57 [RFC PATCH net-next 00/33] Multipath TCP Mat Martineau
                   ` (22 preceding siblings ...)
  2019-06-17 22:57 ` [RFC PATCH net-next 23/33] mptcp: selftests: switch to netns+veth based tests Mat Martineau
@ 2019-06-17 22:57 ` Mat Martineau
  2019-06-17 22:58 ` [RFC PATCH net-next 25/33] mptcp: use sk_page_frag() in sendmsg Mat Martineau
                   ` (8 subsequent siblings)
  32 siblings, 0 replies; 34+ messages in thread
From: Mat Martineau @ 2019-06-17 22:57 UTC (permalink / raw)
  To: edumazet, netdev
  Cc: Mat Martineau, cpaasch, fw, pabeni, peter.krystad, dcaratti,
	matthieu.baerts

Added a "-c" command line option for mptcp_connect.sh to make it easier
to capture packets from each test. The script will use tcpdump to create
one .pcap file per test case, named according to the namespaces,
protocols, and connect address in use. For example, the first test case
writes the capture to ns1-ns1-MPTCP-MPTCP-10.0.1.1.pcap

The stderr output from tcpdump is printed after the test completes to
show tcpdump's "packets dropped by kernel" information.

Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
---
 .../selftests/net/mptcp/mptcp_connect.sh      | 33 +++++++++++++++++++
 1 file changed, 33 insertions(+)

diff --git a/tools/testing/selftests/net/mptcp/mptcp_connect.sh b/tools/testing/selftests/net/mptcp/mptcp_connect.sh
index e694dc9d312c..4418163af001 100755
--- a/tools/testing/selftests/net/mptcp/mptcp_connect.sh
+++ b/tools/testing/selftests/net/mptcp/mptcp_connect.sh
@@ -7,6 +7,7 @@ sout=""
 cin=""
 cout=""
 ksft_skip=4
+capture=0
 timeout=30
 
 TEST_COUNT=0
@@ -15,12 +16,19 @@ cleanup()
 {
 	rm -f "$cin" "$cout"
 	rm -f "$sin" "$sout"
+	rm -f "$capout"
 
 	for i in 1 2 3 4; do
 		ip netns del ns$i
 	done
 }
 
+for arg in "$@"; do
+    if [ "$arg" = "-c" ]; then
+	capture=1
+    fi
+done
+
 ip -Version > /dev/null 2>&1
 if [ $? -ne 0 ];then
 	echo "SKIP: Could not run test without ip tool"
@@ -31,6 +39,7 @@ sin=$(mktemp)
 sout=$(mktemp)
 cin=$(mktemp)
 cout=$(mktemp)
+capout=$(mktemp)
 trap cleanup EXIT
 
 for i in 1 2 3 4;do
@@ -123,9 +132,25 @@ do_transfer()
 
 	:> "$cout"
 	:> "$sout"
+	:> "$capout"
 
 	printf "%-4s %-5s -> %-4s (%s:%d) %-5s\t" ${connector_ns} ${cl_proto} ${listener_ns} ${connect_addr} ${port} ${srv_proto}
 
+	if [ $capture -eq 1 ]; then
+	    if [ -z $SUDO_USER ] ; then
+		capuser=""
+	    else
+		capuser="-Z $SUDO_USER"
+	    fi
+
+	    capfile="${listener_ns}-${connector_ns}-${cl_proto}-${srv_proto}-${connect_addr}.pcap"
+
+	    ip netns exec ${listener_ns} tcpdump -i any -s 65535 -B 32768 $capuser -w $capfile > "$capout" 2>&1 &
+	    cappid=$!
+
+	    sleep 1
+	fi
+
 	ip netns exec ${listener_ns} ./mptcp_connect -t $timeout -l -p $port -s ${srv_proto} 0.0.0.0 < "$sin" > "$sout" &
 	spid=$!
 
@@ -139,6 +164,11 @@ do_transfer()
 	wait $spid
 	rets=$?
 
+	if [ $capture -eq 1 ]; then
+	    sleep 1
+	    kill $cappid
+	fi
+
 	if [ ${rets} -ne 0 ] || [ ${retc} -ne 0 ]; then
 		echo "[ FAIL ] client exit code $retc, server $rets" 1>&2
 		echo "\nnetns ${listener_ns} socket stat for $port:" 1>&2
@@ -146,6 +176,7 @@ do_transfer()
 		echo "\nnetns ${connector_ns} socket stat for $port:" 1>&2
 		ip netns exec ${connector_ns} ss -nita 1>&2 -o "dport = :$port"
 
+		cat "$capout"
 		return 1
 	fi
 
@@ -156,9 +187,11 @@ do_transfer()
 
 	if [ $retc -eq 0 ] && [ $rets -eq 0 ];then
 		echo "[ OK ]"
+		cat "$capout"
 		return 0
 	fi
 
+	cat "$capout"
 	return 1
 }
 
-- 
2.22.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RFC PATCH net-next 25/33] mptcp: use sk_page_frag() in sendmsg
  2019-06-17 22:57 [RFC PATCH net-next 00/33] Multipath TCP Mat Martineau
                   ` (23 preceding siblings ...)
  2019-06-17 22:57 ` [RFC PATCH net-next 24/33] mptcp: selftests: Add capture option Mat Martineau
@ 2019-06-17 22:58 ` Mat Martineau
  2019-06-17 22:58 ` [RFC PATCH net-next 26/33] mptcp: sendmsg() do spool all the provided data Mat Martineau
                   ` (7 subsequent siblings)
  32 siblings, 0 replies; 34+ messages in thread
From: Mat Martineau @ 2019-06-17 22:58 UTC (permalink / raw)
  To: edumazet, netdev
  Cc: Paolo Abeni, cpaasch, fw, peter.krystad, dcaratti, matthieu.baerts

From: Paolo Abeni <pabeni@redhat.com>

This clean-up a bit the send path, and allows better performances.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 net/mptcp/protocol.c | 41 ++++++++++++++++++++---------------------
 1 file changed, 20 insertions(+), 21 deletions(-)

diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 0db4099d9c13..98257a70ac2b 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -52,10 +52,11 @@ static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 	struct mptcp_sock *msk = mptcp_sk(sk);
 	int mss_now, size_goal, poffset, ret;
 	struct mptcp_ext *mpext = NULL;
-	struct page *page = NULL;
+	struct page_frag *pfrag;
 	struct sk_buff *skb;
 	struct sock *ssk;
 	size_t psize;
+	long timeo;
 
 	pr_debug("msk=%p", msk);
 	if (msk->subflow) {
@@ -80,33 +81,33 @@ static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 		goto put_out;
 	}
 
-	/* Initial experiment: new page per send.  Real code will
-	 * maintain list of active pages and DSS mappings, append to the
-	 * end and honor zerocopy
+	lock_sock(sk);
+	lock_sock(ssk);
+	timeo = sock_sndtimeo(sk, msg->msg_flags & MSG_DONTWAIT);
+
+	/* use the mptcp page cache so that we can easily move the data
+	 * from one substream to another, but do per subflow memory accounting
 	 */
-	page = alloc_page(GFP_KERNEL);
-	if (!page) {
-		ret = -ENOMEM;
-		goto put_out;
+	pfrag = sk_page_frag(sk);
+	while (!sk_page_frag_refill(ssk, pfrag)) {
+		ret = sk_stream_wait_memory(ssk, &timeo);
+		if (ret)
+			goto release_out;
 	}
 
 	/* Copy to page */
-	poffset = 0;
+	poffset = pfrag->offset;
 	pr_debug("left=%zu", msg_data_left(msg));
-	psize = copy_page_from_iter(page, poffset,
+	psize = copy_page_from_iter(pfrag->page, poffset,
 				    min_t(size_t, msg_data_left(msg),
-					  PAGE_SIZE),
+					  pfrag->size - poffset),
 				    &msg->msg_iter);
 	pr_debug("left=%zu", msg_data_left(msg));
-
 	if (!psize) {
 		ret = -EINVAL;
-		goto put_out;
+		goto release_out;
 	}
 
-	lock_sock(sk);
-	lock_sock(ssk);
-
 	/* Mark the end of the previous write so the beginning of the
 	 * next write (with its own mptcp skb extension data) is not
 	 * collapsed.
@@ -116,8 +117,8 @@ static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 		TCP_SKB_CB(skb)->eor = 1;
 
 	mss_now = tcp_send_mss(ssk, &size_goal, msg->msg_flags);
-
-	ret = do_tcp_sendpages(ssk, page, poffset, min_t(int, size_goal, psize),
+	psize = min_t(int, size_goal, psize);
+	ret = do_tcp_sendpages(ssk, pfrag->page, poffset, psize,
 			       msg->msg_flags | MSG_SENDPAGE_NOTLAST);
 	if (ret <= 0)
 		goto release_out;
@@ -143,6 +144,7 @@ static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 			 mpext->checksum, mpext->dsn64);
 	} /* TODO: else fallback */
 
+	pfrag->offset += ret;
 	msk->write_seq += ret;
 	subflow_ctx(ssk)->rel_write_seq += ret;
 
@@ -153,9 +155,6 @@ static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 	release_sock(sk);
 
 put_out:
-	if (page)
-		put_page(page);
-
 	sock_put(ssk);
 	return ret;
 }
-- 
2.22.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RFC PATCH net-next 26/33] mptcp: sendmsg() do spool all the provided data
  2019-06-17 22:57 [RFC PATCH net-next 00/33] Multipath TCP Mat Martineau
                   ` (24 preceding siblings ...)
  2019-06-17 22:58 ` [RFC PATCH net-next 25/33] mptcp: use sk_page_frag() in sendmsg Mat Martineau
@ 2019-06-17 22:58 ` Mat Martineau
  2019-06-17 22:58 ` [RFC PATCH net-next 27/33] mptcp: allow collapsing consecutive sendpages on the same substream Mat Martineau
                   ` (6 subsequent siblings)
  32 siblings, 0 replies; 34+ messages in thread
From: Mat Martineau @ 2019-06-17 22:58 UTC (permalink / raw)
  To: edumazet, netdev
  Cc: Paolo Abeni, cpaasch, fw, peter.krystad, dcaratti, matthieu.baerts

From: Paolo Abeni <pabeni@redhat.com>

This makes mptcp sendmsg() behaviour more consistent and
improves xmit performances.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 net/mptcp/protocol.c | 110 +++++++++++++++++++++++++------------------
 1 file changed, 63 insertions(+), 47 deletions(-)

diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 98257a70ac2b..d51201c09519 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -47,66 +47,37 @@ static struct sock *mptcp_subflow_get_ref(const struct mptcp_sock *msk)
 	return NULL;
 }
 
-static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
+static int mptcp_sendmsg_frag(struct sock *sk, struct sock *ssk,
+			      struct msghdr *msg, long *timeo)
 {
 	struct mptcp_sock *msk = mptcp_sk(sk);
-	int mss_now, size_goal, poffset, ret;
 	struct mptcp_ext *mpext = NULL;
+	int mss_now, size_goal, ret;
 	struct page_frag *pfrag;
 	struct sk_buff *skb;
-	struct sock *ssk;
 	size_t psize;
-	long timeo;
-
-	pr_debug("msk=%p", msk);
-	if (msk->subflow) {
-		pr_debug("fallback passthrough");
-		return sock_sendmsg(msk->subflow, msg);
-	}
-
-	ssk = mptcp_subflow_get_ref(msk);
-	if (!ssk)
-		return -ENOTCONN;
-
-	if (!msg_data_left(msg)) {
-		pr_debug("empty send");
-		ret = sock_sendmsg(ssk->sk_socket, msg);
-		goto put_out;
-	}
-
-	pr_debug("conn_list->subflow=%p", ssk);
-
-	if (msg->msg_flags & ~(MSG_MORE | MSG_DONTWAIT | MSG_NOSIGNAL)) {
-		ret = -ENOTSUPP;
-		goto put_out;
-	}
-
-	lock_sock(sk);
-	lock_sock(ssk);
-	timeo = sock_sndtimeo(sk, msg->msg_flags & MSG_DONTWAIT);
 
 	/* use the mptcp page cache so that we can easily move the data
 	 * from one substream to another, but do per subflow memory accounting
 	 */
 	pfrag = sk_page_frag(sk);
 	while (!sk_page_frag_refill(ssk, pfrag)) {
-		ret = sk_stream_wait_memory(ssk, &timeo);
+		ret = sk_stream_wait_memory(ssk, timeo);
 		if (ret)
-			goto release_out;
+			return ret;
 	}
 
-	/* Copy to page */
-	poffset = pfrag->offset;
+	/* compute copy limit */
+	mss_now = tcp_send_mss(ssk, &size_goal, msg->msg_flags);
+	psize = min_t(int, pfrag->size - pfrag->offset, size_goal);
+
 	pr_debug("left=%zu", msg_data_left(msg));
-	psize = copy_page_from_iter(pfrag->page, poffset,
-				    min_t(size_t, msg_data_left(msg),
-					  pfrag->size - poffset),
+	psize = copy_page_from_iter(pfrag->page, pfrag->offset,
+				    min_t(size_t, msg_data_left(msg), psize),
 				    &msg->msg_iter);
 	pr_debug("left=%zu", msg_data_left(msg));
-	if (!psize) {
-		ret = -EINVAL;
-		goto release_out;
-	}
+	if (!psize)
+		return -EINVAL;
 
 	/* Mark the end of the previous write so the beginning of the
 	 * next write (with its own mptcp skb extension data) is not
@@ -116,12 +87,12 @@ static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 	if (skb)
 		TCP_SKB_CB(skb)->eor = 1;
 
-	mss_now = tcp_send_mss(ssk, &size_goal, msg->msg_flags);
-	psize = min_t(int, size_goal, psize);
-	ret = do_tcp_sendpages(ssk, pfrag->page, poffset, psize,
+	ret = do_tcp_sendpages(ssk, pfrag->page, pfrag->offset, psize,
 			       msg->msg_flags | MSG_SENDPAGE_NOTLAST);
 	if (ret <= 0)
-		goto release_out;
+		return ret;
+	if (unlikely(ret < psize))
+		iov_iter_revert(&msg->msg_iter, psize - ret);
 
 	if (skb == tcp_write_queue_tail(ssk))
 		pr_err("no new skb %p/%p", sk, ssk);
@@ -149,8 +120,53 @@ static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 	subflow_ctx(ssk)->rel_write_seq += ret;
 
 	tcp_push(ssk, msg->msg_flags, mss_now, tcp_sk(ssk)->nonagle, size_goal);
+	return ret;
+}
+
+static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
+{
+	struct mptcp_sock *msk = mptcp_sk(sk);
+	size_t copied = 0;
+	struct sock *ssk;
+	int ret = 0;
+	long timeo;
+
+	pr_debug("msk=%p", msk);
+	if (msk->subflow) {
+		pr_debug("fallback passthrough");
+		return sock_sendmsg(msk->subflow, msg);
+	}
+
+	ssk = mptcp_subflow_get_ref(msk);
+	if (!ssk)
+		return -ENOTCONN;
+
+	if (!msg_data_left(msg)) {
+		pr_debug("empty send");
+		ret = sock_sendmsg(ssk->sk_socket, msg);
+		goto put_out;
+	}
+
+	pr_debug("conn_list->subflow=%p", ssk);
+
+	if (msg->msg_flags & ~(MSG_MORE | MSG_DONTWAIT | MSG_NOSIGNAL)) {
+		ret = -ENOTSUPP;
+		goto put_out;
+	}
+
+	lock_sock(sk);
+	lock_sock(ssk);
+	timeo = sock_sndtimeo(sk, msg->msg_flags & MSG_DONTWAIT);
+	while (msg_data_left(msg)) {
+		ret = mptcp_sendmsg_frag(sk, ssk, msg, &timeo);
+		if (ret < 0)
+			break;
+
+		copied += ret;
+	}
+	if (copied > 0)
+		ret = copied;
 
-release_out:
 	release_sock(ssk);
 	release_sock(sk);
 
-- 
2.22.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RFC PATCH net-next 27/33] mptcp: allow collapsing consecutive sendpages on the same substream
  2019-06-17 22:57 [RFC PATCH net-next 00/33] Multipath TCP Mat Martineau
                   ` (25 preceding siblings ...)
  2019-06-17 22:58 ` [RFC PATCH net-next 26/33] mptcp: sendmsg() do spool all the provided data Mat Martineau
@ 2019-06-17 22:58 ` Mat Martineau
  2019-06-17 22:58 ` [RFC PATCH net-next 28/33] tcp: Check for filled TCP option space before SACK Mat Martineau
                   ` (5 subsequent siblings)
  32 siblings, 0 replies; 34+ messages in thread
From: Mat Martineau @ 2019-06-17 22:58 UTC (permalink / raw)
  To: edumazet, netdev
  Cc: Paolo Abeni, cpaasch, fw, peter.krystad, dcaratti, matthieu.baerts

From: Paolo Abeni <pabeni@redhat.com>

If the current sendmsg() lands on the same subflow we used last, we
can try to collapse the data.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 net/mptcp/protocol.c | 79 +++++++++++++++++++++++++++++++++-----------
 1 file changed, 60 insertions(+), 19 deletions(-)

diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index d51201c09519..3fb0f3163743 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -47,12 +47,25 @@ static struct sock *mptcp_subflow_get_ref(const struct mptcp_sock *msk)
 	return NULL;
 }
 
+static inline bool mptcp_skb_can_collapse_to(const struct mptcp_sock *msk,
+					     const struct sk_buff *skb,
+					     const struct mptcp_ext *mpext)
+{
+	if (!tcp_skb_can_collapse_to(skb))
+		return false;
+
+	/* can collapse only if MPTCP level sequence is in order */
+	return mpext && mpext->data_seq + mpext->data_len == msk->write_seq;
+}
+
 static int mptcp_sendmsg_frag(struct sock *sk, struct sock *ssk,
-			      struct msghdr *msg, long *timeo)
+			      struct msghdr *msg, long *timeo, int *pmss_now,
+			      int *ps_goal)
 {
+	int mss_now, avail_size, size_goal, ret;
 	struct mptcp_sock *msk = mptcp_sk(sk);
+	bool collapsed, can_collapse = false;
 	struct mptcp_ext *mpext = NULL;
-	int mss_now, size_goal, ret;
 	struct page_frag *pfrag;
 	struct sk_buff *skb;
 	size_t psize;
@@ -69,8 +82,31 @@ static int mptcp_sendmsg_frag(struct sock *sk, struct sock *ssk,
 
 	/* compute copy limit */
 	mss_now = tcp_send_mss(ssk, &size_goal, msg->msg_flags);
-	psize = min_t(int, pfrag->size - pfrag->offset, size_goal);
+	*pmss_now = mss_now;
+	*ps_goal = size_goal;
+	avail_size = size_goal;
+	skb = tcp_write_queue_tail(ssk);
+	if (skb) {
+		mpext = skb_ext_find(skb, SKB_EXT_MPTCP);
+
+		/* Limit the write to the size available in the
+		 * current skb, if any, so that we create at most a new skb.
+		 * If we run out of space in the current skb (e.g. the window
+		 * size shrunk from last sent) a new skb will be allocated even
+		 * is collapsing was allowed: collapsing is effectively
+		 * disabled.
+		 */
+		can_collapse = mptcp_skb_can_collapse_to(msk, skb, mpext);
+		if (!can_collapse)
+			TCP_SKB_CB(skb)->eor = 1;
+		else if (size_goal - skb->len > 0)
+			avail_size = size_goal - skb->len;
+		else
+			can_collapse = false;
+	}
+	psize = min_t(size_t, pfrag->size - pfrag->offset, avail_size);
 
+	/* Copy to page */
 	pr_debug("left=%zu", msg_data_left(msg));
 	psize = copy_page_from_iter(pfrag->page, pfrag->offset,
 				    min_t(size_t, msg_data_left(msg), psize),
@@ -79,14 +115,9 @@ static int mptcp_sendmsg_frag(struct sock *sk, struct sock *ssk,
 	if (!psize)
 		return -EINVAL;
 
-	/* Mark the end of the previous write so the beginning of the
-	 * next write (with its own mptcp skb extension data) is not
-	 * collapsed.
+	/* tell the TCP stack to delay the push so that we can safely
+	 * access the skb after the sendpages call
 	 */
-	skb = tcp_write_queue_tail(ssk);
-	if (skb)
-		TCP_SKB_CB(skb)->eor = 1;
-
 	ret = do_tcp_sendpages(ssk, pfrag->page, pfrag->offset, psize,
 			       msg->msg_flags | MSG_SENDPAGE_NOTLAST);
 	if (ret <= 0)
@@ -94,13 +125,16 @@ static int mptcp_sendmsg_frag(struct sock *sk, struct sock *ssk,
 	if (unlikely(ret < psize))
 		iov_iter_revert(&msg->msg_iter, psize - ret);
 
-	if (skb == tcp_write_queue_tail(ssk))
-		pr_err("no new skb %p/%p", sk, ssk);
+	collapsed = skb == tcp_write_queue_tail(ssk);
+	BUG_ON(collapsed && !can_collapse);
+	if (collapsed) {
+		/* when collapsing mpext always exists */
+		mpext->data_len += ret;
+		goto out;
+	}
 
 	skb = tcp_write_queue_tail(ssk);
-
 	mpext = skb_ext_add(skb, SKB_EXT_MPTCP);
-
 	if (mpext) {
 		memset(mpext, 0, sizeof(*mpext));
 		mpext->data_seq = msk->write_seq;
@@ -113,22 +147,25 @@ static int mptcp_sendmsg_frag(struct sock *sk, struct sock *ssk,
 		pr_debug("data_seq=%llu subflow_seq=%u data_len=%u checksum=%u, dsn64=%d",
 			 mpext->data_seq, mpext->subflow_seq, mpext->data_len,
 			 mpext->checksum, mpext->dsn64);
-	} /* TODO: else fallback */
+	}
+	/* TODO: else fallback; allocation can fail, but we can't easily retire
+	 * skbs from the write_queue, as we need to roll-back TCP status
+	 */
 
+out:
 	pfrag->offset += ret;
 	msk->write_seq += ret;
 	subflow_ctx(ssk)->rel_write_seq += ret;
 
-	tcp_push(ssk, msg->msg_flags, mss_now, tcp_sk(ssk)->nonagle, size_goal);
 	return ret;
 }
 
 static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 {
+	int mss_now = 0, size_goal = 0, ret = 0;
 	struct mptcp_sock *msk = mptcp_sk(sk);
 	size_t copied = 0;
 	struct sock *ssk;
-	int ret = 0;
 	long timeo;
 
 	pr_debug("msk=%p", msk);
@@ -158,14 +195,18 @@ static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 	lock_sock(ssk);
 	timeo = sock_sndtimeo(sk, msg->msg_flags & MSG_DONTWAIT);
 	while (msg_data_left(msg)) {
-		ret = mptcp_sendmsg_frag(sk, ssk, msg, &timeo);
+		ret = mptcp_sendmsg_frag(sk, ssk, msg, &timeo, &mss_now,
+					 &size_goal);
 		if (ret < 0)
 			break;
 
 		copied += ret;
 	}
-	if (copied > 0)
+	if (copied) {
 		ret = copied;
+		tcp_push(ssk, msg->msg_flags, mss_now, tcp_sk(ssk)->nonagle,
+			 size_goal);
+	}
 
 	release_sock(ssk);
 	release_sock(sk);
-- 
2.22.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RFC PATCH net-next 28/33] tcp: Check for filled TCP option space before SACK
  2019-06-17 22:57 [RFC PATCH net-next 00/33] Multipath TCP Mat Martineau
                   ` (26 preceding siblings ...)
  2019-06-17 22:58 ` [RFC PATCH net-next 27/33] mptcp: allow collapsing consecutive sendpages on the same substream Mat Martineau
@ 2019-06-17 22:58 ` Mat Martineau
  2019-06-17 22:58 ` [RFC PATCH net-next 29/33] mptcp: accept: don't leak mptcp socket structure Mat Martineau
                   ` (4 subsequent siblings)
  32 siblings, 0 replies; 34+ messages in thread
From: Mat Martineau @ 2019-06-17 22:58 UTC (permalink / raw)
  To: edumazet, netdev
  Cc: Mat Martineau, cpaasch, fw, pabeni, peter.krystad, dcaratti,
	matthieu.baerts

The SACK code would potentially add four bytes to the expected
TCP option size even if all option space was already used.

Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
---
 net/ipv4/tcp_output.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 5fe9459bbd6a..e980546e330a 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -805,6 +805,9 @@ static unsigned int tcp_established_options(struct sock *sk, struct sk_buff *skb
 		}
 	}
 
+	if (size + TCPOLEN_SACK_BASE_ALIGNED >= MAX_TCP_OPTION_SPACE)
+		return size;
+
 	eff_sacks = tp->rx_opt.num_sacks + tp->rx_opt.dsack;
 	if (unlikely(eff_sacks)) {
 		const unsigned int remaining = MAX_TCP_OPTION_SPACE - size;
-- 
2.22.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RFC PATCH net-next 29/33] mptcp: accept: don't leak mptcp socket structure
  2019-06-17 22:57 [RFC PATCH net-next 00/33] Multipath TCP Mat Martineau
                   ` (27 preceding siblings ...)
  2019-06-17 22:58 ` [RFC PATCH net-next 28/33] tcp: Check for filled TCP option space before SACK Mat Martineau
@ 2019-06-17 22:58 ` Mat Martineau
  2019-06-17 22:58 ` [RFC PATCH net-next 30/33] mptcp: switch sublist to mptcp socket lock protection Mat Martineau
                   ` (3 subsequent siblings)
  32 siblings, 0 replies; 34+ messages in thread
From: Mat Martineau @ 2019-06-17 22:58 UTC (permalink / raw)
  To: edumazet, netdev
  Cc: Florian Westphal, cpaasch, pabeni, peter.krystad, dcaratti,
	matthieu.baerts

From: Florian Westphal <fw@strlen.de>

accept() is supposed to prepare and return a 'struct sock'.

The caller holds a new inode/socket, and will associate the
returned sock with it.

mptcp_accept however will allocate it via 'struct socket', then
returns socket->sk.
This then leaks the outer socket struct inode returned by sock_create():

unreferenced object 0xffff88810512e8c0 (size 936):
  comm "mptcp_connect", [..]
  backtrace:
    [<00000000872561ba>] alloc_inode+0x35/0xe0
    [<00000000646e04ed>] new_inode_pseudo+0x12/0x80
    [<00000000e2e77036>] sock_alloc+0x26/0x100
    [<00000000870a8688>] __sock_create+0x8f/0x3c0
    [<000000000558b3fa>] mptcp_accept+0x140/0x530
    [<0000000044718d60>] inet_accept+0xac/0x470
    [<00000000da8f3979>] mptcp_stream_accept+0x62/0xa0
    [<00000000c9010499>] __sys_accept4+0x228/0x3c0

To fix this, make several (unfortunately, intrusive) changes:

1. Instead of allocating a new mptcp socket, clone the
   mptcp listen socket socket->sk.

   This gives us a mptcp sock without the inode container.
   We return this sock struct to the caller, the caller will then
   complete creation of the mptcp socket.

2. For the 'compat' (old tcp, not mp capable) case, return the
   tcp socket directly and release the socket coming from
   kernel_accept().

   We can use mptcp_getname() to override the socket->ops to
   tcp.  This will make mptcp_accept work like tcp accept, and
   mptcp won't be involved anymore for the particular socket.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 net/mptcp/protocol.c | 69 +++++++++++++++++++++++++++++---------------
 1 file changed, 46 insertions(+), 23 deletions(-)

diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 3fb0f3163743..8c7b0f39394e 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -617,9 +617,9 @@ static struct sock *mptcp_accept(struct sock *sk, int flags, int *err,
 {
 	struct mptcp_sock *msk = mptcp_sk(sk);
 	struct socket *listener = msk->subflow;
-	struct socket *new_sock;
-	struct socket *new_mptcp_sock;
 	struct subflow_context *subflow;
+	struct socket *new_sock;
+	struct sock *newsk;
 
 	pr_debug("msk=%p, listener=%p", msk, subflow_ctx(listener->sk));
 	*err = kernel_accept(listener, &new_sock, flags);
@@ -629,28 +629,31 @@ static struct sock *mptcp_accept(struct sock *sk, int flags, int *err,
 	subflow = subflow_ctx(new_sock->sk);
 	pr_debug("msk=%p, new subflow=%p, ", msk, subflow);
 
-	*err = sock_create(PF_INET, SOCK_STREAM, IPPROTO_MPTCP,
-			   &new_mptcp_sock);
-	if (*err < 0) {
-		kernel_sock_shutdown(new_sock, SHUT_RDWR);
-		sock_release(new_sock);
-		return NULL;
-	}
-
-	msk = mptcp_sk(new_mptcp_sock->sk);
-	pr_debug("new msk=%p", msk);
-
 	if (subflow->mp_capable) {
+		struct sock *new_mptcp_sock;
 		u64 ack_seq;
 
+		local_bh_disable();
+		new_mptcp_sock = sk_clone_lock(sk, GFP_ATOMIC);
+		if (!new_mptcp_sock) {
+			*err = -ENOBUFS;
+			local_bh_enable();
+			kernel_sock_shutdown(new_sock, SHUT_RDWR);
+			sock_release(new_sock);
+			return NULL;
+		}
+
+		mptcp_init_sock(new_mptcp_sock);
+
+		msk = mptcp_sk(new_mptcp_sock);
 		msk->remote_key = subflow->remote_key;
 		msk->local_key = subflow->local_key;
 		msk->token = subflow->token;
-		token_update_accept(new_sock->sk, new_mptcp_sock->sk);
-		spin_lock_bh(&msk->conn_list_lock);
+		token_update_accept(new_sock->sk, new_mptcp_sock);
+		spin_lock(&msk->conn_list_lock);
 		list_add_rcu(&subflow->node, &msk->conn_list);
 		msk->subflow = NULL;
-		spin_unlock_bh(&msk->conn_list_lock);
+		spin_unlock(&msk->conn_list_lock);
 
 		crypto_key_sha1(msk->remote_key, NULL, &ack_seq);
 		msk->write_seq = subflow->idsn + 1;
@@ -659,14 +662,20 @@ static struct sock *mptcp_accept(struct sock *sk, int flags, int *err,
 		subflow->map_seq = ack_seq;
 		subflow->map_subflow_seq = 1;
 		subflow->rel_write_seq = 1;
-		subflow->conn = new_mptcp_sock->sk;
 		subflow->tcp_sock = new_sock;
+		newsk = new_mptcp_sock;
+		subflow->conn = new_mptcp_sock;
+		bh_unlock_sock(new_mptcp_sock);
+		local_bh_enable();
+		inet_sk_state_store(newsk, TCP_ESTABLISHED);
 	} else {
-		msk->subflow = new_sock;
+		newsk = new_sock->sk;
+		tcp_sk(newsk)->is_mptcp = 0;
+		new_sock->sk = NULL;
+		sock_release(new_sock);
 	}
-	inet_sk_state_store(new_mptcp_sock->sk, TCP_ESTABLISHED);
 
-	return new_mptcp_sock->sk;
+	return newsk;
 }
 
 static void mptcp_destroy(struct sock *sk)
@@ -854,6 +863,18 @@ static int mptcp_getname(struct socket *sock, struct sockaddr *uaddr,
 
 	pr_debug("msk=%p", msk);
 
+	if (sock->sk->sk_prot == &tcp_prot) {
+		/* we are being invoked from __sys_accept4, after
+		 * mptcp_accept() has just accepted a non-mp-capable
+		 * flow: sk is a tcp_sk, not an mptcp one.
+		 *
+		 * Hand the socket over to tcp so all further socket ops
+		 * bypass mptcp.
+		 */
+		sock->ops = &inet_stream_ops;
+		return inet_getname(sock, uaddr, peer);
+	}
+
 	if (msk->subflow) {
 		pr_debug("subflow=%p", msk->subflow->sk);
 		return inet_getname(msk->subflow, uaddr, peer);
@@ -967,10 +988,12 @@ static int mptcp_shutdown(struct socket *sock, int how)
 	lock_sock(sock->sk);
 	rcu_read_lock();
 	list_for_each_entry_rcu(subflow, &msk->conn_list, node) {
-		pr_debug("conn_list->subflow=%p", subflow);
+		struct socket *tcp_socket;
+
+		tcp_socket = mptcp_subflow_tcp_socket(subflow);
 		rcu_read_unlock();
-		ret = kernel_sock_shutdown(mptcp_subflow_tcp_socket(subflow),
-					   how);
+		pr_debug("conn_list->subflow=%p", subflow);
+		ret = kernel_sock_shutdown(tcp_socket, how);
 		rcu_read_lock();
 	}
 	rcu_read_unlock();
-- 
2.22.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RFC PATCH net-next 30/33] mptcp: switch sublist to mptcp socket lock protection
  2019-06-17 22:57 [RFC PATCH net-next 00/33] Multipath TCP Mat Martineau
                   ` (28 preceding siblings ...)
  2019-06-17 22:58 ` [RFC PATCH net-next 29/33] mptcp: accept: don't leak mptcp socket structure Mat Martineau
@ 2019-06-17 22:58 ` Mat Martineau
  2019-06-17 22:58 ` [RFC PATCH net-next 31/33] mptcp: Add path manager interface Mat Martineau
                   ` (2 subsequent siblings)
  32 siblings, 0 replies; 34+ messages in thread
From: Mat Martineau @ 2019-06-17 22:58 UTC (permalink / raw)
  To: edumazet, netdev
  Cc: Florian Westphal, cpaasch, pabeni, peter.krystad, dcaratti,
	matthieu.baerts

From: Florian Westphal <fw@strlen.de>

The mptcp sublist is currently guarded by rcu, but this comes with
several artifacts that make little sense.

1. There is a synchronize_rcu after stealing the subflow list on
   each mptcp socket close.

   synchronize_rcu() is a very expensive call, and should not be
   needed.

2. There is a extra spinlock to guard the list, ey we can't use
   the lock in some cases because we need to call functions that
   might sleep.

3. There is a 'mptcp_subflow_hold()' function that uses
   an 'atomic_inc_not_zero' call.  This wasn't needed even with
   current code:  The owning mptcp socket holds references on its
   subflows, so a subflow socket that is still found on the list
   will always have a nonzero reference count.

This changes the subflow list to always be guarded by the owning
mptcp socket lock.  This is safe as long as no code path that holds
a mptcp subflow tcp socket lock will try to lock the owning mptcp
sockets lock.

The inverse -- locking the tcp subflow lock while holding the
mptcp lock -- is fine.

mptcp_subflow_get_ref() will have to be altered later when we
support multiple subflows so it will pick a 'preferred' subflow
rather than the first one in the list.

v4: - remove all sk_state changes added in v2/v3, they do not
    belong here -- it should be done as a separate change.
    - prefer mptcp_for_each_subflow rather than list_for_each_entry.

v3: use BH locking scheme in mptcp_finish_connect, there is no
    guarantee we are always called from backlog processing.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 net/mptcp/protocol.c | 159 +++++++++++++++++++------------------------
 net/mptcp/protocol.h |   3 +-
 2 files changed, 72 insertions(+), 90 deletions(-)

diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 8c7b0f39394e..d4ffa47f53ef 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -24,26 +24,20 @@ static inline bool before64(__u64 seq1, __u64 seq2)
 
 #define after64(seq2, seq1)	before64(seq1, seq2)
 
-static bool mptcp_subflow_hold(struct subflow_context *subflow)
-{
-	struct sock *sk = mptcp_subflow_tcp_socket(subflow)->sk;
-
-	return refcount_inc_not_zero(&sk->sk_refcnt);
-}
-
 static struct sock *mptcp_subflow_get_ref(const struct mptcp_sock *msk)
 {
 	struct subflow_context *subflow;
 
-	rcu_read_lock();
+	sock_owned_by_me((const struct sock *)msk);
+
 	mptcp_for_each_subflow(msk, subflow) {
-		if (mptcp_subflow_hold(subflow)) {
-			rcu_read_unlock();
-			return mptcp_subflow_tcp_socket(subflow)->sk;
-		}
+		struct sock *sk;
+
+		sk = mptcp_subflow_tcp_socket(subflow)->sk;
+		sock_hold(sk);
+		return sk;
 	}
 
-	rcu_read_unlock();
 	return NULL;
 }
 
@@ -174,9 +168,12 @@ static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 		return sock_sendmsg(msk->subflow, msg);
 	}
 
+	lock_sock(sk);
 	ssk = mptcp_subflow_get_ref(msk);
-	if (!ssk)
+	if (!ssk) {
+		release_sock(sk);
 		return -ENOTCONN;
+	}
 
 	if (!msg_data_left(msg)) {
 		pr_debug("empty send");
@@ -191,7 +188,6 @@ static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 		goto put_out;
 	}
 
-	lock_sock(sk);
 	lock_sock(ssk);
 	timeo = sock_sndtimeo(sk, msg->msg_flags & MSG_DONTWAIT);
 	while (msg_data_left(msg)) {
@@ -209,9 +205,9 @@ static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 	}
 
 	release_sock(ssk);
-	release_sock(sk);
 
 put_out:
+	release_sock(sk);
 	sock_put(ssk);
 	return ret;
 }
@@ -379,14 +375,16 @@ static int mptcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
 		return sock_recvmsg(msk->subflow, msg, flags);
 	}
 
+	lock_sock(sk);
 	ssk = mptcp_subflow_get_ref(msk);
-	if (!ssk)
+	if (!ssk) {
+		release_sock(sk);
 		return -ENOTCONN;
+	}
 
 	subflow = subflow_ctx(ssk);
 	tp = tcp_sk(ssk);
 
-	lock_sock(sk);
 	lock_sock(ssk);
 
 	desc.arg.data = &arg;
@@ -556,57 +554,36 @@ static int mptcp_init_sock(struct sock *sk)
 
 	pr_debug("msk=%p", msk);
 
-	INIT_LIST_HEAD_RCU(&msk->conn_list);
-	spin_lock_init(&msk->conn_list_lock);
+	INIT_LIST_HEAD(&msk->conn_list);
 
 	return 0;
 }
 
-static void mptcp_flush_conn_list(struct sock *sk, struct list_head *list)
-{
-	struct mptcp_sock *msk = mptcp_sk(sk);
-
-	INIT_LIST_HEAD_RCU(list);
-	spin_lock_bh(&msk->conn_list_lock);
-	list_splice_init(&msk->conn_list, list);
-	spin_unlock_bh(&msk->conn_list_lock);
-
-	if (!list_empty(list))
-		synchronize_rcu();
-}
-
 static void mptcp_close(struct sock *sk, long timeout)
 {
 	struct mptcp_sock *msk = mptcp_sk(sk);
 	struct subflow_context *subflow, *tmp;
 	struct socket *ssk = NULL;
-	struct list_head list;
 
 	inet_sk_state_store(sk, TCP_CLOSE);
 
-	spin_lock_bh(&msk->conn_list_lock);
+	lock_sock(sk);
+
 	if (msk->subflow) {
 		ssk = msk->subflow;
 		msk->subflow = NULL;
 	}
-	spin_unlock_bh(&msk->conn_list_lock);
+
 	if (ssk) {
 		pr_debug("subflow=%p", ssk->sk);
 		sock_release(ssk);
 	}
 
-	/* this is the only place where we can remove any entry from the
-	 * conn_list. Additionally acquiring the socket lock here
-	 * allows for mutual exclusion with mptcp_shutdown().
-	 */
-	lock_sock(sk);
-	mptcp_flush_conn_list(sk, &list);
-	release_sock(sk);
-
-	list_for_each_entry_safe(subflow, tmp, &list, node) {
+	list_for_each_entry_safe(subflow, tmp, &msk->conn_list, node) {
 		pr_debug("conn_list->subflow=%p", subflow);
 		sock_release(mptcp_subflow_tcp_socket(subflow));
 	}
+	release_sock(sk);
 
 	sock_orphan(sk);
 	sock_put(sk);
@@ -633,11 +610,14 @@ static struct sock *mptcp_accept(struct sock *sk, int flags, int *err,
 		struct sock *new_mptcp_sock;
 		u64 ack_seq;
 
+		lock_sock(sk);
+
 		local_bh_disable();
 		new_mptcp_sock = sk_clone_lock(sk, GFP_ATOMIC);
 		if (!new_mptcp_sock) {
 			*err = -ENOBUFS;
 			local_bh_enable();
+			release_sock(sk);
 			kernel_sock_shutdown(new_sock, SHUT_RDWR);
 			sock_release(new_sock);
 			return NULL;
@@ -650,10 +630,7 @@ static struct sock *mptcp_accept(struct sock *sk, int flags, int *err,
 		msk->local_key = subflow->local_key;
 		msk->token = subflow->token;
 		token_update_accept(new_sock->sk, new_mptcp_sock);
-		spin_lock(&msk->conn_list_lock);
-		list_add_rcu(&subflow->node, &msk->conn_list);
 		msk->subflow = NULL;
-		spin_unlock(&msk->conn_list_lock);
 
 		crypto_key_sha1(msk->remote_key, NULL, &ack_seq);
 		msk->write_seq = subflow->idsn + 1;
@@ -665,9 +642,11 @@ static struct sock *mptcp_accept(struct sock *sk, int flags, int *err,
 		subflow->tcp_sock = new_sock;
 		newsk = new_mptcp_sock;
 		subflow->conn = new_mptcp_sock;
+		list_add(&subflow->node, &msk->conn_list);
 		bh_unlock_sock(new_mptcp_sock);
 		local_bh_enable();
 		inet_sk_state_store(newsk, TCP_ESTABLISHED);
+		release_sock(sk);
 	} else {
 		newsk = new_sock->sk;
 		tcp_sk(newsk)->is_mptcp = 0;
@@ -750,14 +729,33 @@ void mptcp_finish_connect(struct sock *sk, int mp_capable)
 	if (mp_capable) {
 		u64 ack_seq;
 
+		/* sk (new subflow socket) is already locked, but we need
+		 * to lock the parent (mptcp) socket now to add the tcp socket
+		 * to the subflow list.
+		 *
+		 * From lockdep point of view, this creates an ABBA type
+		 * deadlock: Normally (sendmsg, recvmsg, ..), we lock the mptcp
+		 * socket, then acquire a subflow lock.
+		 * Here we do the reverse: "subflow lock, then mptcp lock".
+		 *
+		 * Its alright to do this here, because this subflow is not yet
+		 * on the mptcp sockets subflow list.
+		 *
+		 * IOW, if another CPU has this mptcp socket locked, it cannot
+		 * acquire this particular subflow, because subflow->sk isn't
+		 * on msk->conn_list.
+		 *
+		 * This function can be called either from backlog processing
+		 * (BH will be enabled) or from softirq, so we need to use BH
+		 * locking scheme.
+		 */
+		local_bh_disable();
+		bh_lock_sock_nested(sk);
+
 		msk->remote_key = subflow->remote_key;
 		msk->local_key = subflow->local_key;
 		msk->token = subflow->token;
 		pr_debug("msk=%p, token=%u", msk, msk->token);
-		spin_lock_bh(&msk->conn_list_lock);
-		list_add_rcu(&subflow->node, &msk->conn_list);
-		msk->subflow = NULL;
-		spin_unlock_bh(&msk->conn_list_lock);
 
 		crypto_key_sha1(msk->remote_key, NULL, &ack_seq);
 		msk->write_seq = subflow->idsn + 1;
@@ -766,6 +764,11 @@ void mptcp_finish_connect(struct sock *sk, int mp_capable)
 		subflow->map_seq = ack_seq;
 		subflow->map_subflow_seq = 1;
 		subflow->rel_write_seq = 1;
+
+		list_add(&subflow->node, &msk->conn_list);
+		msk->subflow = NULL;
+		bh_unlock_sock(sk);
+		local_bh_enable();
 	}
 	inet_sk_state_store(sk, TCP_ESTABLISHED);
 }
@@ -884,11 +887,15 @@ static int mptcp_getname(struct socket *sock, struct sockaddr *uaddr,
 	 * is connected and there are multiple subflows is not defined.
 	 * For now just use the first subflow on the list.
 	 */
+	lock_sock(sock->sk);
 	ssk = mptcp_subflow_get_ref(msk);
-	if (!ssk)
+	if (!ssk) {
+		release_sock(sock->sk);
 		return -ENOTCONN;
+	}
 
 	ret = inet_getname(ssk->sk_socket, uaddr, peer);
+	release_sock(sock->sk);
 	sock_put(ssk);
 	return ret;
 }
@@ -928,39 +935,19 @@ static __poll_t mptcp_poll(struct file *file, struct socket *sock,
 	const struct mptcp_sock *msk;
 	struct sock *sk = sock->sk;
 	__poll_t ret = 0;
-	unsigned int i;
 
 	msk = mptcp_sk(sk);
 	if (msk->subflow)
 		return tcp_poll(file, msk->subflow, wait);
 
-	i = 0;
-	for (;;) {
-		struct subflow_context *tmp = NULL;
-		int j = 0;
-
-		rcu_read_lock();
-		mptcp_for_each_subflow(msk, subflow) {
-			if (j < i) {
-				j++;
-				continue;
-			}
-
-			if (!mptcp_subflow_hold(subflow))
-				continue;
-
-			tmp = subflow;
-			i++;
-			break;
-		}
-		rcu_read_unlock();
-
-		if (!tmp)
-			break;
+	lock_sock(sk);
+	mptcp_for_each_subflow(msk, subflow) {
+		struct socket *tcp_sock;
 
-		ret |= tcp_poll(file, mptcp_subflow_tcp_socket(tmp), wait);
-		sock_put(mptcp_subflow_tcp_socket(tmp)->sk);
+		tcp_sock = mptcp_subflow_tcp_socket(subflow);
+		ret |= tcp_poll(file, tcp_sock, wait);
 	}
+	release_sock(sk);
 
 	return ret;
 }
@@ -979,24 +966,20 @@ static int mptcp_shutdown(struct socket *sock, int how)
 	}
 
 	/* protect against concurrent mptcp_close(), so that nobody can
-	 * remove entries from the conn list and walking the list breaking
-	 * the RCU critical section is still safe. We need to release the
-	 * RCU lock to call the blocking kernel_sock_shutdown() primitive
-	 * Note: we can't use MPTCP socket lock to protect conn_list changes,
+	 * remove entries from the conn list and walking the list
+	 * is still safe.
+	 *
+	 * We can't use MPTCP socket lock to protect conn_list changes,
 	 * as we need to update it from the BH via the mptcp_finish_connect()
 	 */
 	lock_sock(sock->sk);
-	rcu_read_lock();
-	list_for_each_entry_rcu(subflow, &msk->conn_list, node) {
+	mptcp_for_each_subflow(msk, subflow) {
 		struct socket *tcp_socket;
 
 		tcp_socket = mptcp_subflow_tcp_socket(subflow);
-		rcu_read_unlock();
 		pr_debug("conn_list->subflow=%p", subflow);
 		ret = kernel_sock_shutdown(tcp_socket, how);
-		rcu_read_lock();
 	}
-	rcu_read_unlock();
 	release_sock(sock->sk);
 
 	return ret;
diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
index a1bf093bb37e..556981f9e5fd 100644
--- a/net/mptcp/protocol.h
+++ b/net/mptcp/protocol.h
@@ -54,13 +54,12 @@ struct mptcp_sock {
 	u64		write_seq;
 	u64		ack_seq;
 	u32		token;
-	spinlock_t	conn_list_lock;
 	struct list_head conn_list;
 	struct socket	*subflow; /* outgoing connect/listener/!mp_capable */
 };
 
 #define mptcp_for_each_subflow(__msk, __subflow)			\
-	list_for_each_entry_rcu(__subflow, &((__msk)->conn_list), node)
+	list_for_each_entry(__subflow, &((__msk)->conn_list), node)
 
 static inline struct mptcp_sock *mptcp_sk(const struct sock *sk)
 {
-- 
2.22.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RFC PATCH net-next 31/33] mptcp: Add path manager interface
  2019-06-17 22:57 [RFC PATCH net-next 00/33] Multipath TCP Mat Martineau
                   ` (29 preceding siblings ...)
  2019-06-17 22:58 ` [RFC PATCH net-next 30/33] mptcp: switch sublist to mptcp socket lock protection Mat Martineau
@ 2019-06-17 22:58 ` Mat Martineau
  2019-06-17 22:58 ` [RFC PATCH net-next 32/33] mptcp: Add ADD_ADDR handling Mat Martineau
  2019-06-17 22:58 ` [RFC PATCH net-next 33/33] mptcp: Add handling of incoming MP_JOIN requests Mat Martineau
  32 siblings, 0 replies; 34+ messages in thread
From: Mat Martineau @ 2019-06-17 22:58 UTC (permalink / raw)
  To: edumazet, netdev
  Cc: Peter Krystad, cpaasch, fw, pabeni, dcaratti, matthieu.baerts

From: Peter Krystad <peter.krystad@linux.intel.com>

Add enough of a path manager interface to allow sending of ADD_ADDR
when an incoming MPTCP connection is created. Capable of sending only
a single IPv4 ADD_ADDR option. The 'pm_data' element of the connection
sock will need to be expanded to handle multiple interfaces and IPv6.

This is a skeleton interface definition for events generated by
MPTCP.

Signed-off-by: Peter Krystad <peter.krystad@linux.intel.com>
---
 net/mptcp/Makefile   |  2 +-
 net/mptcp/pm.c       | 57 ++++++++++++++++++++++++++++++++++++++++++++
 net/mptcp/protocol.c |  4 ++++
 net/mptcp/protocol.h | 25 ++++++++++++++++++-
 4 files changed, 86 insertions(+), 2 deletions(-)
 create mode 100644 net/mptcp/pm.c

diff --git a/net/mptcp/Makefile b/net/mptcp/Makefile
index 178ae81d8b66..7fe7aa64eda0 100644
--- a/net/mptcp/Makefile
+++ b/net/mptcp/Makefile
@@ -1,4 +1,4 @@
 # SPDX-License-Identifier: GPL-2.0
 obj-$(CONFIG_MPTCP) += mptcp.o
 
-mptcp-y := protocol.o subflow.o options.o token.o crypto.o
+mptcp-y := protocol.o subflow.o options.o token.o crypto.o pm.o
diff --git a/net/mptcp/pm.c b/net/mptcp/pm.c
new file mode 100644
index 000000000000..512dc110098a
--- /dev/null
+++ b/net/mptcp/pm.c
@@ -0,0 +1,57 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Multipath TCP
+ *
+ * Copyright (c) 2019, Intel Corporation.
+ */
+#include <linux/kernel.h>
+#include <net/tcp.h>
+#include <net/mptcp.h>
+#include "protocol.h"
+
+void pm_new_connection(struct mptcp_sock *msk)
+{
+	pr_debug("msk=%p", msk);
+}
+
+void pm_fully_established(struct mptcp_sock *msk)
+{
+	pr_debug("msk=%p", msk);
+}
+
+void pm_connection_closed(struct mptcp_sock *msk)
+{
+	pr_debug("msk=%p", msk);
+}
+
+void pm_subflow_established(struct mptcp_sock *msk, u8 id)
+{
+	pr_debug("msk=%p", msk);
+}
+
+void pm_subflow_closed(struct mptcp_sock *msk, u8 id)
+{
+	pr_debug("msk=%p", msk);
+}
+
+void pm_add_addr(struct mptcp_sock *msk, const struct in_addr *addr, u8 id)
+{
+	pr_debug("msk=%p", msk);
+}
+
+void pm_add_addr6(struct mptcp_sock *msk, const struct in6_addr *addr, u8 id)
+{
+	pr_debug("msk=%p", msk);
+}
+
+void pm_rm_addr(struct mptcp_sock *msk, u8 id)
+{
+	pr_debug("msk=%p", msk);
+}
+
+bool pm_addr_signal(struct mptcp_sock *msk, unsigned int *size,
+		    unsigned int remaining, struct mptcp_out_options *opts)
+{
+	pr_debug("msk=%p", msk);
+
+	return false;
+}
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index d4ffa47f53ef..e071fc8191ee 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -632,6 +632,8 @@ static struct sock *mptcp_accept(struct sock *sk, int flags, int *err,
 		token_update_accept(new_sock->sk, new_mptcp_sock);
 		msk->subflow = NULL;
 
+		pm_new_connection(msk);
+
 		crypto_key_sha1(msk->remote_key, NULL, &ack_seq);
 		msk->write_seq = subflow->idsn + 1;
 		ack_seq++;
@@ -757,6 +759,8 @@ void mptcp_finish_connect(struct sock *sk, int mp_capable)
 		msk->token = subflow->token;
 		pr_debug("msk=%p, token=%u", msk, msk->token);
 
+		pm_new_connection(msk);
+
 		crypto_key_sha1(msk->remote_key, NULL, &ack_seq);
 		msk->write_seq = subflow->idsn + 1;
 		ack_seq++;
diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
index 556981f9e5fd..044665328b79 100644
--- a/net/mptcp/protocol.h
+++ b/net/mptcp/protocol.h
@@ -14,7 +14,7 @@
 #define MPTCPOPT_MP_JOIN	1
 #define MPTCPOPT_DSS		2
 #define MPTCPOPT_ADD_ADDR	3
-#define MPTCPOPT_REMOVE_ADDR	4
+#define MPTCPOPT_RM_ADDR	4
 #define MPTCPOPT_MP_PRIO	5
 #define MPTCPOPT_MP_FAIL	6
 #define MPTCPOPT_MP_FASTCLOSE	7
@@ -45,6 +45,17 @@
 #define MPTCP_DSS_HAS_ACK	BIT(0)
 #define MPTCP_DSS_FLAG_MASK	(0x1F)
 
+struct pm_data {
+	u8 addr_id;
+	sa_family_t family;
+	union {
+		struct in_addr addr;
+#if IS_ENABLED(CONFIG_IPV6)
+		struct in6_addr addr6;
+#endif
+	};
+};
+
 /* MPTCP connection sock */
 struct mptcp_sock {
 	/* inet_connection_sock must be the first member */
@@ -56,6 +67,7 @@ struct mptcp_sock {
 	u32		token;
 	struct list_head conn_list;
 	struct socket	*subflow; /* outgoing connect/listener/!mp_capable */
+	struct pm_data	pm;
 };
 
 #define mptcp_for_each_subflow(__msk, __subflow)			\
@@ -157,6 +169,17 @@ void crypto_key_sha1(u64 key, u32 *token, u64 *idsn);
 void crypto_hmac_sha1(u64 key1, u64 key2, u32 *hash_out,
 		      int arg_num, ...);
 
+void pm_new_connection(struct mptcp_sock *msk);
+void pm_fully_established(struct mptcp_sock *msk);
+void pm_connection_closed(struct mptcp_sock *msk);
+void pm_subflow_established(struct mptcp_sock *msk, u8 id);
+void pm_subflow_closed(struct mptcp_sock *msk, u8 id);
+void pm_add_addr(struct mptcp_sock *msk, const struct in_addr *addr, u8 id);
+void pm_add_addr6(struct mptcp_sock *msk, const struct in6_addr *addr, u8 id);
+void pm_rm_addr(struct mptcp_sock *msk, u8 id);
+bool pm_addr_signal(struct mptcp_sock *msk, unsigned int *size,
+		    unsigned int remaining, struct mptcp_out_options *opts);
+
 static inline struct mptcp_ext *mptcp_get_ext(struct sk_buff *skb)
 {
 	return (struct mptcp_ext *)skb_ext_find(skb, SKB_EXT_MPTCP);
-- 
2.22.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RFC PATCH net-next 32/33] mptcp: Add ADD_ADDR handling
  2019-06-17 22:57 [RFC PATCH net-next 00/33] Multipath TCP Mat Martineau
                   ` (30 preceding siblings ...)
  2019-06-17 22:58 ` [RFC PATCH net-next 31/33] mptcp: Add path manager interface Mat Martineau
@ 2019-06-17 22:58 ` Mat Martineau
  2019-06-17 22:58 ` [RFC PATCH net-next 33/33] mptcp: Add handling of incoming MP_JOIN requests Mat Martineau
  32 siblings, 0 replies; 34+ messages in thread
From: Mat Martineau @ 2019-06-17 22:58 UTC (permalink / raw)
  To: edumazet, netdev
  Cc: Peter Krystad, cpaasch, fw, pabeni, dcaratti, matthieu.baerts

From: Peter Krystad <peter.krystad@linux.intel.com>

Add handling for sending and receiving the ADD_ADDR, ADD_ADDR6,
and RM_ADDR suboptions.

Signed-off-by: Peter Krystad <peter.krystad@linux.intel.com>
---
 include/linux/tcp.h  | 11 +++++
 include/net/mptcp.h  | 16 ++++++--
 net/mptcp/options.c  | 98 +++++++++++++++++++++++++++++++++++++++-----
 net/mptcp/pm.c       | 11 ++++-
 net/mptcp/protocol.h | 16 ++++++++
 5 files changed, 138 insertions(+), 14 deletions(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 81cfa7834111..b1d2ff2af0c2 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -122,6 +122,16 @@ struct tcp_options_received {
 			use_ack:1,
 			ack64:1,
 			__unused:2;
+		u8	add_addr : 1,
+			rm_addr : 1,
+			family : 4;
+		u8	addr_id;
+		union {
+			struct	in_addr	addr;
+#if IS_ENABLED(CONFIG_IPV6)
+			struct	in6_addr addr6;
+#endif
+		};
 	} mptcp;
 #endif
 };
@@ -135,6 +145,7 @@ static inline void tcp_clear_options(struct tcp_options_received *rx_opt)
 #endif
 #if IS_ENABLED(CONFIG_MPTCP)
 	rx_opt->mptcp.mp_capable = rx_opt->mptcp.mp_join = 0;
+	rx_opt->mptcp.add_addr = rx_opt->mptcp.rm_addr = 0;
 	rx_opt->mptcp.dss = 0;
 #endif
 }
diff --git a/include/net/mptcp.h b/include/net/mptcp.h
index ecc45733d8cf..92c630a25666 100644
--- a/include/net/mptcp.h
+++ b/include/net/mptcp.h
@@ -25,15 +25,25 @@ struct mptcp_ext {
 };
 
 /* MPTCP option subtypes */
-#define OPTION_MPTCP_MPC_SYN	BIT(0)
-#define OPTION_MPTCP_MPC_SYNACK	BIT(1)
-#define OPTION_MPTCP_MPC_ACK	BIT(2)
+#define OPTION_MPTCP_MPC_SYN		BIT(0)
+#define OPTION_MPTCP_MPC_SYNACK		BIT(1)
+#define OPTION_MPTCP_MPC_ACK		BIT(2)
+#define OPTION_MPTCP_ADD_ADDR		BIT(6)
+#define OPTION_MPTCP_ADD_ADDR6		BIT(7)
+#define OPTION_MPTCP_RM_ADDR		BIT(8)
 
 struct mptcp_out_options {
 #if IS_ENABLED(CONFIG_MPTCP)
 	u16 suboptions;
 	u64 sndr_key;
 	u64 rcvr_key;
+	union {
+		struct in_addr addr;
+#if IS_ENABLED(CONFIG_IPV6)
+		struct in6_addr addr6;
+#endif
+	};
+	u8 addr_id;
 	struct mptcp_ext ext_copy;
 #endif
 };
diff --git a/net/mptcp/options.c b/net/mptcp/options.c
index 6c5aed6351b3..68d0b4bec1dd 100644
--- a/net/mptcp/options.c
+++ b/net/mptcp/options.c
@@ -160,12 +160,51 @@ void mptcp_parse_option(const unsigned char *ptr, int opsize,
 	 * 4 or 16 bytes of address (depending on ip version)
 	 * 0 or 2 bytes of port (depending on length)
 	 */
+	case MPTCPOPT_ADD_ADDR:
+		if (opsize != TCPOLEN_MPTCP_ADD_ADDR &&
+		    opsize != TCPOLEN_MPTCP_ADD_ADDR6)
+			break;
+		mp_opt->family = *ptr++ & MPTCP_ADDR_FAMILY_MASK;
+		if (mp_opt->family != MPTCP_ADDR_IPVERSION_4 &&
+		    mp_opt->family != MPTCP_ADDR_IPVERSION_6)
+			break;
+
+		if (mp_opt->family == MPTCP_ADDR_IPVERSION_4 &&
+		    opsize != TCPOLEN_MPTCP_ADD_ADDR)
+			break;
+#if IS_ENABLED(CONFIG_IPV6)
+		if (mp_opt->family == MPTCP_ADDR_IPVERSION_6 &&
+		    opsize != TCPOLEN_MPTCP_ADD_ADDR6)
+			break;
+#endif
+		mp_opt->addr_id = *ptr++;
+		if (mp_opt->family == MPTCP_ADDR_IPVERSION_4) {
+			mp_opt->add_addr = 1;
+			mp_opt->addr.s_addr = get_unaligned_be32(ptr);
+			pr_debug("ADD_ADDR: addr=%x, id=%d",
+				 mp_opt->addr.s_addr, mp_opt->addr_id);
+#if IS_ENABLED(CONFIG_IPV6)
+		} else {
+			mp_opt->add_addr = 1;
+			memcpy(mp_opt->addr6.s6_addr, (u8 *)ptr, 16);
+			pr_debug("ADD_ADDR: addr6=, id=%d", mp_opt->addr_id);
+#endif
+		}
+		break;
 
-	/* MPTCPOPT_REMOVE_ADDR
+	/* MPTCPOPT_RM_ADDR
 	 * 0: 4MSB=subtype, 0000
 	 * 1: Address ID
 	 * Additional bytes: More address IDs (depending on length)
 	 */
+	case MPTCPOPT_RM_ADDR:
+		if (opsize != TCPOLEN_MPTCP_RM_ADDR)
+			break;
+
+		mp_opt->rm_addr = 1;
+		mp_opt->addr_id = *ptr++;
+		pr_debug("RM_ADDR: id=%d", mp_opt->addr_id);
+		break;
 
 	/* MPTCPOPT_MP_PRIO
 	 * 0: 4MSB=subtype, 000, 1LSB=Backup
@@ -336,27 +375,47 @@ static bool mptcp_established_options_dss(struct sock *sk, struct sk_buff *skb,
 	return true;
 }
 
+static bool mptcp_established_options_addr(struct sock *sk,
+					   unsigned int *size,
+					   unsigned int remaining,
+					   struct mptcp_out_options *opts)
+{
+	struct subflow_context *subflow = subflow_ctx(sk);
+	struct mptcp_sock *msk = mptcp_sk(subflow->conn);
+
+	if (subflow->fourth_ack)
+		return pm_addr_signal(msk, size, remaining, opts);
+
+	return false;
+}
+
 bool mptcp_established_options(struct sock *sk, struct sk_buff *skb,
 			       unsigned int *size, unsigned int remaining,
 			       struct mptcp_out_options *opts)
 {
 	unsigned int opt_size = 0;
+	bool ret = false;
 
 	if (!subflow_ctx(sk)->mp_capable)
 		return false;
 
+	opts->suboptions = 0;
 	if (mptcp_established_options_mp(sk, &opt_size, remaining, opts)) {
 		*size += opt_size;
 		remaining -= opt_size;
-		return true;
+		ret = true;
 	} else if (mptcp_established_options_dss(sk, skb, &opt_size, remaining,
-					       opts)) {
+						 opts)) {
 		*size += opt_size;
 		remaining -= opt_size;
-		return true;
+		ret = true;
 	}
-
-	return false;
+	if (mptcp_established_options_addr(sk, &opt_size, remaining, opts)) {
+		*size += opt_size;
+		remaining -= opt_size;
+		ret = true;
+	}
+	return ret;
 }
 
 bool mptcp_synack_options(const struct request_sock *req, unsigned int *size,
@@ -427,10 +486,8 @@ void mptcp_write_options(__be32 *ptr, struct mptcp_out_options *opts)
 		else
 			len = TCPOLEN_MPTCP_MPC_ACK;
 
-		*ptr++ = htonl((TCPOPT_MPTCP << 24) | (len << 16) |
-			       (MPTCPOPT_MP_CAPABLE << 12) |
-			       ((MPTCP_VERSION_MASK & 0) << 8) |
-			       MPTCP_CAP_HMAC_SHA1);
+		*ptr++ = mptcp_option(MPTCPOPT_MP_CAPABLE, len, 0,
+				      MPTCP_CAP_HMAC_SHA1);
 		put_unaligned_be64(opts->sndr_key, ptr);
 		ptr += 2;
 		if ((OPTION_MPTCP_MPC_SYNACK |
@@ -440,6 +497,27 @@ void mptcp_write_options(__be32 *ptr, struct mptcp_out_options *opts)
 		}
 	}
 
+	if (OPTION_MPTCP_ADD_ADDR & opts->suboptions) {
+		*ptr++ = mptcp_option(MPTCPOPT_ADD_ADDR, TCPOLEN_MPTCP_ADD_ADDR,
+				      MPTCP_ADDR_IPVERSION_4, opts->addr_id);
+		*ptr++ = htonl(opts->addr.s_addr);
+	}
+
+#if IS_ENABLED(CONFIG_IPV6)
+	if (OPTION_MPTCP_ADD_ADDR6 & opts->suboptions) {
+		*ptr++ = mptcp_option(MPTCPOPT_ADD_ADDR,
+				      TCPOLEN_MPTCP_ADD_ADDR6,
+				      MPTCP_ADDR_IPVERSION_6, opts->addr_id);
+		memcpy((u8 *)ptr, opts->addr6.s6_addr, 16);
+		ptr += 4;
+	}
+#endif
+
+	if (OPTION_MPTCP_RM_ADDR & opts->suboptions) {
+		*ptr++ = mptcp_option(MPTCPOPT_RM_ADDR, TCPOLEN_MPTCP_RM_ADDR,
+				      0, opts->addr_id);
+	}
+
 	if (opts->ext_copy.use_ack || opts->ext_copy.use_map) {
 		struct mptcp_ext *mpext = &opts->ext_copy;
 		u8 len = TCPOLEN_MPTCP_DSS_BASE;
diff --git a/net/mptcp/pm.c b/net/mptcp/pm.c
index 512dc110098a..9e9c681a4544 100644
--- a/net/mptcp/pm.c
+++ b/net/mptcp/pm.c
@@ -51,7 +51,16 @@ void pm_rm_addr(struct mptcp_sock *msk, u8 id)
 bool pm_addr_signal(struct mptcp_sock *msk, unsigned int *size,
 		    unsigned int remaining, struct mptcp_out_options *opts)
 {
+	if (!msk || !msk->addr_signal)
+		return false;
+
+	if (msk->pm.family == AF_INET && remaining < TCPOLEN_MPTCP_ADD_ADDR)
+		return false;
+
 	pr_debug("msk=%p", msk);
+	opts->suboptions |= OPTION_MPTCP_ADD_ADDR;
+	opts->addr_id = msk->pm.addr_id;
+	opts->addr.s_addr = msk->pm.addr.s_addr;
 
-	return false;
+	return true;
 }
diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
index 044665328b79..4e4c8fc59972 100644
--- a/net/mptcp/protocol.h
+++ b/net/mptcp/protocol.h
@@ -8,6 +8,7 @@
 #define __MPTCP_PROTOCOL_H
 
 #include <linux/spinlock.h>
+#include <net/tcp.h>
 
 /* MPTCP option subtypes */
 #define MPTCPOPT_MP_CAPABLE	0
@@ -29,6 +30,9 @@
 #define TCPOLEN_MPTCP_DSS_MAP32		10
 #define TCPOLEN_MPTCP_DSS_MAP64		14
 #define TCPOLEN_MPTCP_DSS_CHECKSUM	2
+#define TCPOLEN_MPTCP_ADD_ADDR		8
+#define TCPOLEN_MPTCP_ADD_ADDR6		20
+#define TCPOLEN_MPTCP_RM_ADDR		4
 
 /* MPTCP MP_CAPABLE flags */
 #define MPTCP_VERSION_MASK	(0x0F)
@@ -45,6 +49,17 @@
 #define MPTCP_DSS_HAS_ACK	BIT(0)
 #define MPTCP_DSS_FLAG_MASK	(0x1F)
 
+/* MPTCP ADD_ADDR flags */
+#define MPTCP_ADDR_FAMILY_MASK	(0x0F)
+#define MPTCP_ADDR_IPVERSION_4	4
+#define MPTCP_ADDR_IPVERSION_6	6
+
+static inline u32 mptcp_option(u8 subopt, u8 len, u8 nib, u8 field)
+{
+	return htonl((TCPOPT_MPTCP << 24) | (len << 16) | (subopt << 12) |
+		     ((nib & 0xF) << 8) | field);
+}
+
 struct pm_data {
 	u8 addr_id;
 	sa_family_t family;
@@ -68,6 +83,7 @@ struct mptcp_sock {
 	struct list_head conn_list;
 	struct socket	*subflow; /* outgoing connect/listener/!mp_capable */
 	struct pm_data	pm;
+	u8		addr_signal;
 };
 
 #define mptcp_for_each_subflow(__msk, __subflow)			\
-- 
2.22.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RFC PATCH net-next 33/33] mptcp: Add handling of incoming MP_JOIN requests
  2019-06-17 22:57 [RFC PATCH net-next 00/33] Multipath TCP Mat Martineau
                   ` (31 preceding siblings ...)
  2019-06-17 22:58 ` [RFC PATCH net-next 32/33] mptcp: Add ADD_ADDR handling Mat Martineau
@ 2019-06-17 22:58 ` Mat Martineau
  32 siblings, 0 replies; 34+ messages in thread
From: Mat Martineau @ 2019-06-17 22:58 UTC (permalink / raw)
  To: edumazet, netdev
  Cc: Peter Krystad, cpaasch, fw, pabeni, dcaratti, matthieu.baerts

From: Peter Krystad <peter.krystad@linux.intel.com>

Process the MP_JOIN option in a SYN packet with the same flow
as MP_CAPABLE but when the third ACK is received add the
subflow to the MPTCP socket subflow list instead of adding it to
the TCP socket accept queue.

The subflow is added at the end of the subflow list so it will not
interfere with the existing subflows operation and no data is
expected to be transmitted on it.

Signed-off-by: Peter Krystad <peter.krystad@linux.intel.com>
---
 include/linux/tcp.h      |   6 ++
 include/net/mptcp.h      |  14 +++++
 net/ipv4/tcp_minisocks.c |   6 ++
 net/mptcp/options.c      |  58 ++++++++++++++++--
 net/mptcp/protocol.c     |  21 +++++++
 net/mptcp/protocol.h     |  29 ++++++++-
 net/mptcp/subflow.c      |  58 +++++++++++++++---
 net/mptcp/token.c        | 124 +++++++++++++++++++++++++++++++++++++++
 8 files changed, 301 insertions(+), 15 deletions(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index b1d2ff2af0c2..68ff73ce8ac2 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -112,8 +112,14 @@ struct tcp_options_received {
 		u8      mp_capable : 1,
 			mp_join : 1,
 			dss : 1,
+			backup : 1,
 			version : 4;
 		u8      flags;
+		u8      join_id;
+		u32     token;
+		u32     nonce;
+		u64     thmac;
+		u8      hmac[20];
 		u8	dss_flags;
 		u8	use_map:1,
 			dsn64:1,
diff --git a/include/net/mptcp.h b/include/net/mptcp.h
index 92c630a25666..68e674f453e4 100644
--- a/include/net/mptcp.h
+++ b/include/net/mptcp.h
@@ -28,6 +28,9 @@ struct mptcp_ext {
 #define OPTION_MPTCP_MPC_SYN		BIT(0)
 #define OPTION_MPTCP_MPC_SYNACK		BIT(1)
 #define OPTION_MPTCP_MPC_ACK		BIT(2)
+#define OPTION_MPTCP_MPJ_SYN		BIT(3)
+#define OPTION_MPTCP_MPJ_SYNACK		BIT(4)
+#define OPTION_MPTCP_MPJ_ACK		BIT(5)
 #define OPTION_MPTCP_ADD_ADDR		BIT(6)
 #define OPTION_MPTCP_ADD_ADDR6		BIT(7)
 #define OPTION_MPTCP_RM_ADDR		BIT(8)
@@ -44,6 +47,10 @@ struct mptcp_out_options {
 #endif
 	};
 	u8 addr_id;
+	u8 join_id;
+	u8 backup;
+	u32 nonce;
+	u64 thmac;
 	struct mptcp_ext ext_copy;
 #endif
 };
@@ -83,6 +90,8 @@ static inline bool mptcp_skb_ext_exist(const struct sk_buff *skb)
 
 void mptcp_write_options(__be32 *ptr, struct mptcp_out_options *opts);
 
+bool mptcp_sk_is_subflow(const struct sock *sk);
+
 #else
 
 static inline void mptcp_init(void)
@@ -140,5 +149,10 @@ static inline bool mptcp_skb_ext_exist(const struct sk_buff *skb)
 	return false;
 }
 
+static inline bool mptcp_sk_is_subflow(const struct sock *sk)
+{
+	return false;
+}
+
 #endif /* CONFIG_MPTCP */
 #endif /* __NET_MPTCP_H */
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index 8bcaf2586b68..081b410592b3 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -766,6 +766,12 @@ struct sock *tcp_check_req(struct sock *sk, struct sk_buff *skb,
 	if (!child)
 		goto listen_overflow;
 
+	if (own_req && sk_is_mptcp(child) && mptcp_sk_is_subflow(child)) {
+		inet_csk_reqsk_queue_drop(sk, req);
+		reqsk_queue_removed(&inet_csk(sk)->icsk_accept_queue, req);
+		return child;
+	}
+
 	sock_rps_save_rxhash(child, skb);
 	tcp_synack_rtt_meas(child, req);
 	*req_stolen = !own_req;
diff --git a/net/mptcp/options.c b/net/mptcp/options.c
index 68d0b4bec1dd..58215f19829a 100644
--- a/net/mptcp/options.c
+++ b/net/mptcp/options.c
@@ -54,24 +54,53 @@ void mptcp_parse_option(const unsigned char *ptr, int opsize,
 		break;
 
 	/* MPTCPOPT_MP_JOIN
-	 *
 	 * Initial SYN
 	 * 0: 4MSB=subtype, 000, 1LSB=Backup
 	 * 1: Address ID
 	 * 2-5: Receiver token
 	 * 6-9: Sender random number
-	 *
 	 * SYN/ACK response
 	 * 0: 4MSB=subtype, 000, 1LSB=Backup
 	 * 1: Address ID
 	 * 2-9: Sender truncated HMAC
 	 * 10-13: Sender random number
-	 *
 	 * Third ACK
 	 * 0: 4MSB=subtype, 0000
 	 * 1: 0 (Reserved)
 	 * 2-21: Sender HMAC
 	 */
+	case MPTCPOPT_MP_JOIN:
+		mp_opt->mp_join = 1;
+		if (opsize == TCPOLEN_MPTCP_MPJ_SYN) {
+			mp_opt->backup = *ptr++ & MPTCPOPT_BACKUP;
+			mp_opt->join_id = *ptr++;
+			mp_opt->token = get_unaligned_be32(ptr);
+			ptr += 4;
+			mp_opt->nonce = get_unaligned_be32(ptr);
+			ptr += 4;
+			pr_debug("MP_JOIN bkup=%u, id=%u, token=%u, nonce=%u",
+				 mp_opt->backup, mp_opt->join_id,
+				 mp_opt->token, mp_opt->nonce);
+		} else if (opsize == TCPOLEN_MPTCP_MPJ_SYNACK) {
+			mp_opt->backup = *ptr++ & MPTCPOPT_BACKUP;
+			mp_opt->join_id = *ptr++;
+			mp_opt->thmac = get_unaligned_be64(ptr);
+			ptr += 8;
+			mp_opt->nonce = get_unaligned_be32(ptr);
+			ptr += 4;
+			pr_debug("MP_JOIN bkup=%u, id=%u, thmac=%llu, nonce=%u",
+				 mp_opt->backup, mp_opt->join_id,
+				 mp_opt->thmac, mp_opt->nonce);
+		} else if (opsize == TCPOLEN_MPTCP_MPJ_ACK) {
+			ptr++;
+			memcpy(mp_opt->hmac, ptr, MPTCPOPT_HMAC_LEN);
+			pr_debug("MP_JOIN hmac");
+		} else {
+			pr_warn("MP_JOIN bad option size");
+			mp_opt->mp_join = 0;
+		}
+		break;
+
 
 	/* MPTCPOPT_DSS
 	 * 0: 4MSB=subtype, 0000
@@ -428,10 +457,21 @@ bool mptcp_synack_options(const struct request_sock *req, unsigned int *size,
 		opts->sndr_key = subflow_req->local_key;
 		opts->rcvr_key = subflow_req->remote_key;
 		*size = TCPOLEN_MPTCP_MPC_SYNACK;
-		pr_debug("subflow_req=%p, local_key=%llu, remote_key=%llu",
+		pr_debug("req=%p, local_key=%llu, remote_key=%llu",
 			 subflow_req, subflow_req->local_key,
 			 subflow_req->remote_key);
 		return true;
+	} else if (subflow_req->mp_join) {
+		opts->suboptions = OPTION_MPTCP_MPJ_SYNACK;
+		opts->backup = subflow_req->backup;
+		opts->join_id = subflow_req->local_id;
+		opts->thmac = subflow_req->thmac;
+		opts->nonce = subflow_req->local_nonce;
+		pr_debug("req=%p, bkup=%u, id=%u, thmac=%llu, nonce=%u",
+			 subflow_req, opts->backup, opts->join_id,
+			 opts->thmac, opts->nonce);
+		*size = TCPOLEN_MPTCP_MPJ_SYNACK;
+		return true;
 	}
 	return false;
 }
@@ -518,6 +558,16 @@ void mptcp_write_options(__be32 *ptr, struct mptcp_out_options *opts)
 				      0, opts->addr_id);
 	}
 
+	if (OPTION_MPTCP_MPJ_SYNACK & opts->suboptions) {
+		*ptr++ = mptcp_option(MPTCPOPT_MP_JOIN,
+				      TCPOLEN_MPTCP_MPJ_SYNACK,
+				      opts->backup, opts->join_id);
+		put_unaligned_be64(opts->thmac, ptr);
+		ptr += 2;
+		put_unaligned_be32(opts->nonce, ptr);
+		ptr += 1;
+	}
+
 	if (opts->ext_copy.use_ack || opts->ext_copy.use_map) {
 		struct mptcp_ext *mpext = &opts->ext_copy;
 		u8 len = TCPOLEN_MPTCP_DSS_BASE;
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index e071fc8191ee..042811a1e01b 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -777,6 +777,27 @@ void mptcp_finish_connect(struct sock *sk, int mp_capable)
 	inet_sk_state_store(sk, TCP_ESTABLISHED);
 }
 
+void mptcp_finish_join(struct sock *conn, struct sock *sk)
+{
+	struct subflow_context *subflow = subflow_ctx(sk);
+	struct mptcp_sock *msk = mptcp_sk(conn);
+
+	pr_debug("msk=%p, subflow=%p", msk, subflow);
+
+	local_bh_disable();
+	bh_lock_sock_nested(sk);
+	list_add_tail(&subflow->node, &msk->conn_list);
+	bh_unlock_sock(sk);
+	local_bh_enable();
+}
+
+bool mptcp_sk_is_subflow(const struct sock *sk)
+{
+	struct subflow_context *subflow = subflow_ctx(sk);
+
+	return subflow->mp_join == 1;
+}
+
 static struct proto mptcp_prot = {
 	.name		= "MPTCP",
 	.owner		= THIS_MODULE,
diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
index 4e4c8fc59972..61e9f15de9d6 100644
--- a/net/mptcp/protocol.h
+++ b/net/mptcp/protocol.h
@@ -24,6 +24,9 @@
 #define TCPOLEN_MPTCP_MPC_SYN		12
 #define TCPOLEN_MPTCP_MPC_SYNACK	20
 #define TCPOLEN_MPTCP_MPC_ACK		20
+#define TCPOLEN_MPTCP_MPJ_SYN		12
+#define TCPOLEN_MPTCP_MPJ_SYNACK	16
+#define TCPOLEN_MPTCP_MPJ_ACK		24
 #define TCPOLEN_MPTCP_DSS_BASE		4
 #define TCPOLEN_MPTCP_DSS_ACK32		4
 #define TCPOLEN_MPTCP_DSS_ACK64		8
@@ -34,6 +37,9 @@
 #define TCPOLEN_MPTCP_ADD_ADDR6		20
 #define TCPOLEN_MPTCP_RM_ADDR		4
 
+#define MPTCPOPT_BACKUP		BIT(0)
+#define MPTCPOPT_HMAC_LEN	20
+
 /* MPTCP MP_CAPABLE flags */
 #define MPTCP_VERSION_MASK	(0x0F)
 #define MPTCP_CAP_CHECKSUM_REQD	BIT(7)
@@ -101,11 +107,16 @@ struct subflow_request_sock {
 		checksum : 1,
 		backup : 1,
 		version : 4;
+	u8	local_id;
+	u8	remote_id;
 	u64	local_key;
 	u64	remote_key;
 	u64	idsn;
 	u32	token;
 	u32	ssn_offset;
+	u64	thmac;
+	u32	local_nonce;
+	u32	remote_nonce;
 };
 
 static inline
@@ -128,15 +139,23 @@ struct subflow_context {
 	u16	map_data_len;
 	u16	request_mptcp : 1,  /* send MP_CAPABLE */
 		request_cksum : 1,
-		mp_capable : 1,	    /* remote is MPTCP capable */
+		mp_capable : 1,     /* remote is MPTCP capable */
+		mp_join : 1,        /* remote is JOINing */
 		fourth_ack : 1,     /* send initial DSS */
 		version : 4,
 		conn_finished : 1,
 		use_checksum : 1,
-		map_valid : 1;
+		map_valid : 1,
+		backup : 1;
+	u32	remote_nonce;
+	u64	thmac;
+	u32	local_nonce;
+	u8	local_id;
+	u8	remote_id;
 
 	struct  socket *tcp_sock;  /* underlying tcp_sock */
 	struct  sock *conn;        /* parent mptcp_sock */
+
 	void	(*tcp_sk_data_ready)(struct sock *sk);
 };
 
@@ -161,13 +180,19 @@ void mptcp_get_options(const struct sk_buff *skb,
 		       struct tcp_options_received *opt_rx);
 
 void mptcp_finish_connect(struct sock *sk, int mp_capable);
+void mptcp_finish_join(struct sock *conn, struct sock *sk);
 
 void token_init(void);
 void token_new_request(struct request_sock *req, const struct sk_buff *skb);
+int token_join_request(struct request_sock *req, const struct sk_buff *skb);
+int token_join_valid(struct request_sock *req,
+		     struct tcp_options_received *rx_opt);
 void token_destroy_request(u32 token);
 void token_new_connect(struct sock *sk);
 void token_new_accept(struct sock *sk);
+int token_new_join(struct sock *sk);
 void token_update_accept(struct sock *sk, struct sock *conn);
+void token_release(u32 token);
 void token_destroy(u32 token);
 
 void crypto_init(void);
diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c
index a82f5091eed8..a858cc966724 100644
--- a/net/mptcp/subflow.c
+++ b/net/mptcp/subflow.c
@@ -54,6 +54,12 @@ static void subflow_v4_init_req(struct request_sock *req,
 	memset(&rx_opt.mptcp, 0, sizeof(rx_opt.mptcp));
 	mptcp_get_options(skb, &rx_opt);
 
+	subflow_req->mp_capable = 0;
+	subflow_req->mp_join = 0;
+
+	if (rx_opt.mptcp.mp_capable && rx_opt.mptcp.mp_join)
+		return;
+
 	if (rx_opt.mptcp.mp_capable && listener->request_mptcp) {
 		subflow_req->mp_capable = 1;
 		if (rx_opt.mptcp.version >= listener->version)
@@ -68,8 +74,18 @@ static void subflow_v4_init_req(struct request_sock *req,
 		token_new_request(req, skb);
 		pr_debug("syn seq=%u", TCP_SKB_CB(skb)->seq);
 		subflow_req->ssn_offset = TCP_SKB_CB(skb)->seq;
-	} else {
-		subflow_req->mp_capable = 0;
+	} else if (rx_opt.mptcp.mp_join && listener->request_mptcp) {
+		subflow_req->mp_join = 1;
+		subflow_req->backup = rx_opt.mptcp.backup;
+		subflow_req->remote_id = rx_opt.mptcp.join_id;
+		subflow_req->token = rx_opt.mptcp.token;
+		subflow_req->remote_nonce = rx_opt.mptcp.nonce;
+		pr_debug("token=%u, remote_nonce=%u", subflow_req->token,
+			 subflow_req->remote_nonce);
+		if (token_join_request(req, skb)) {
+			subflow_req->mp_join = 0;
+			// @@ need to trigger RST
+		}
 	}
 }
 
@@ -134,6 +150,11 @@ static struct sock *subflow_syn_recv_sock(const struct sock *sk,
 		    subflow_req->local_key != opt_rx.mptcp.rcvr_key ||
 		    subflow_req->remote_key != opt_rx.mptcp.sndr_key)
 			return NULL;
+	} else if (subflow_req->mp_join) {
+		opt_rx.mptcp.mp_join = 0;
+		mptcp_get_options(skb, &opt_rx);
+		if (!opt_rx.mptcp.mp_join || token_join_valid(req, &opt_rx))
+			return NULL;
 	}
 
 	child = tcp_v4_syn_recv_sock(sk, skb, req, dst, req_unhash, own_req);
@@ -141,18 +162,27 @@ static struct sock *subflow_syn_recv_sock(const struct sock *sk,
 	if (child && *own_req) {
 		struct subflow_context *ctx = subflow_ctx(child);
 
-		if (!ctx) {
-			pr_debug("Closing child socket");
-			inet_sk_set_state(child, TCP_CLOSE);
-			sock_set_flag(child, SOCK_DEAD);
-			inet_csk_destroy_sock(child);
-			child = NULL;
-		} else if (ctx->mp_capable) {
+		if (!ctx)
+			goto close_child;
+
+		if (ctx->mp_capable) {
 			token_new_accept(child);
+		} else if (ctx->mp_join) {
+			if (token_new_join(child))
+				goto close_child;
+			else
+				mptcp_finish_join(ctx->conn, child);
 		}
 	}
 
 	return child;
+
+close_child:
+	pr_debug("closing child socket");
+	inet_sk_set_state(child, TCP_CLOSE);
+	sock_set_flag(child, SOCK_DEAD);
+	inet_csk_destroy_sock(child);
+	return NULL;
 }
 
 static struct inet_connection_sock_af_ops subflow_specific;
@@ -222,6 +252,8 @@ static void subflow_ulp_release(struct sock *sk)
 
 	pr_debug("subflow=%p", ctx);
 
+	token_release(ctx->token);
+
 	kfree(ctx);
 }
 
@@ -255,6 +287,14 @@ static void subflow_ulp_clone(const struct request_sock *req,
 		new_ctx->ssn_offset = subflow_req->ssn_offset;
 		new_ctx->idsn = subflow_req->idsn;
 		pr_debug("token=%u", new_ctx->token);
+	} else if (subflow_req->mp_join) {
+		new_ctx->mp_join = 1;
+		new_ctx->fourth_ack = 1;
+		new_ctx->backup = subflow_req->backup;
+		new_ctx->local_id = subflow_req->local_id;
+		new_ctx->token = subflow_req->token;
+		new_ctx->thmac = subflow_req->thmac;
+		pr_debug("token=%u", new_ctx->token);
 	}
 }
 
diff --git a/net/mptcp/token.c b/net/mptcp/token.c
index b055a3e82add..c2f4fcb37566 100644
--- a/net/mptcp/token.c
+++ b/net/mptcp/token.c
@@ -54,6 +54,15 @@ static bool find_token(u32 token)
 	return used;
 }
 
+static struct sock *lookup_token(u32 token)
+{
+	void *conn;
+
+	pr_debug("token=%u", token);
+	conn = radix_tree_lookup(&token_tree, token);
+	return (struct sock *)conn;
+}
+
 static void new_req_token(struct request_sock *req,
 			  const struct sk_buff *skb)
 {
@@ -81,6 +90,56 @@ static void new_req_token(struct request_sock *req,
 		 subflow_req->token, subflow_req->idsn);
 }
 
+static void new_req_join(struct request_sock *req, struct sock *sk,
+			 const struct sk_buff *skb)
+{
+	const struct inet_request_sock *ireq = inet_rsk(req);
+	struct subflow_request_sock *subflow_req = subflow_rsk(req);
+	struct mptcp_sock *msk = mptcp_sk(sk);
+	u8 hmac[MPTCPOPT_HMAC_LEN];
+	u32 nonce;
+
+	if (skb->protocol == htons(ETH_P_IP)) {
+		nonce = crypto_v4_get_nonce(ip_hdr(skb)->saddr,
+					    ip_hdr(skb)->daddr,
+					    htons(ireq->ir_num),
+					    ireq->ir_rmt_port);
+#if IS_ENABLED(CONFIG_IPV6)
+	} else {
+		nonce = crypto_v6_get_nonce(&ipv6_hdr(skb)->saddr,
+					    &ipv6_hdr(skb)->daddr,
+					    htons(ireq->ir_num),
+					    ireq->ir_rmt_port);
+#endif
+	}
+	subflow_req->local_nonce = nonce;
+
+	crypto_hmac_sha1(msk->local_key,
+			 msk->remote_key,
+			 (u32 *)hmac, 2,
+			 4, (u8 *)&subflow_req->local_nonce,
+			 4, (u8 *)&subflow_req->remote_nonce);
+	subflow_req->thmac = *(u64 *)hmac;
+	pr_debug("local_nonce=%u, thmac=%llu", subflow_req->local_nonce,
+		 subflow_req->thmac);
+}
+
+static int new_join_valid(struct request_sock *req, struct sock *sk,
+			  struct tcp_options_received *rx_opt)
+{
+	struct subflow_request_sock *subflow_req = subflow_rsk(req);
+	struct mptcp_sock *msk = mptcp_sk(sk);
+	u8 hmac[MPTCPOPT_HMAC_LEN];
+
+	crypto_hmac_sha1(msk->remote_key,
+			 msk->local_key,
+			 (u32 *)hmac, 2,
+			 4, (u8 *)&subflow_req->remote_nonce,
+			 4, (u8 *)&subflow_req->local_nonce);
+
+	return memcmp(hmac, (char *)rx_opt->mptcp.hmac, MPTCPOPT_HMAC_LEN);
+}
+
 static void new_token(const struct sock *sk)
 {
 	struct subflow_context *subflow = subflow_ctx(sk);
@@ -177,6 +236,42 @@ void token_new_request(struct request_sock *req,
 	spin_unlock_bh(&token_tree_lock);
 }
 
+/* validate received token and create truncated hmac and nonce for SYN-ACK */
+int token_join_request(struct request_sock *req, const struct sk_buff *skb)
+{
+	struct subflow_request_sock *subflow_req = subflow_rsk(req);
+	struct sock *conn;
+
+	pr_debug("subflow_req=%p, token=%u", subflow_req, subflow_req->token);
+	spin_lock_bh(&token_tree_lock);
+	conn = lookup_token(subflow_req->token);
+	spin_unlock_bh(&token_tree_lock);
+	if (conn) {
+		// @@ get real local address id for this skb->saddr
+		subflow_req->local_id = 0;
+		new_req_join(req, conn, skb);
+		return 0;
+	}
+	return -1;
+}
+
+/* validate hmac received in third ACK */
+int token_join_valid(struct request_sock *req,
+		     struct tcp_options_received *rx_opt)
+{
+	struct subflow_request_sock *subflow_req = subflow_rsk(req);
+	struct sock *conn;
+
+	pr_debug("subflow_req=%p, token=%u", subflow_req, subflow_req->token);
+	spin_lock_bh(&token_tree_lock);
+	conn = lookup_token(subflow_req->token);
+	spin_unlock_bh(&token_tree_lock);
+	if (conn)
+		return new_join_valid(req, conn, rx_opt);
+
+	return -1;
+}
+
 /* create new local key, idsn, and token for subflow */
 void token_new_connect(struct sock *sk)
 {
@@ -220,6 +315,23 @@ void token_update_accept(struct sock *sk, struct sock *conn)
 	spin_unlock_bh(&token_tree_lock);
 }
 
+int token_new_join(struct sock *sk)
+{
+	struct subflow_context *subflow = subflow_ctx(sk);
+	struct sock *conn;
+
+	spin_lock_bh(&token_tree_lock);
+	conn = lookup_token(subflow->token);
+	if (conn) {
+		sock_hold(conn);
+		spin_unlock_bh(&token_tree_lock);
+		subflow->conn = conn;
+		return 0;
+	}
+	spin_unlock_bh(&token_tree_lock);
+	return -1;
+}
+
 void token_destroy_request(u32 token)
 {
 	pr_debug("token=%u", token);
@@ -229,6 +341,18 @@ void token_destroy_request(u32 token)
 	spin_unlock_bh(&token_tree_lock);
 }
 
+void token_release(u32 token)
+{
+	struct sock *conn;
+
+	pr_debug("token=%u", token);
+	spin_lock_bh(&token_tree_lock);
+	conn = lookup_token(token);
+	if (conn)
+		sock_put(conn);
+	spin_unlock_bh(&token_tree_lock);
+}
+
 void token_destroy(u32 token)
 {
 	struct sock *conn;
-- 
2.22.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2019-06-17 22:59 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-06-17 22:57 [RFC PATCH net-next 00/33] Multipath TCP Mat Martineau
2019-06-17 22:57 ` [RFC PATCH net-next 01/33] tcp: Add MPTCP option number Mat Martineau
2019-06-17 22:57 ` [RFC PATCH net-next 02/33] tcp: Define IPPROTO_MPTCP Mat Martineau
2019-06-17 22:57 ` [RFC PATCH net-next 03/33] mptcp: Add MPTCP socket stubs Mat Martineau
2019-06-17 22:57 ` [RFC PATCH net-next 04/33] mptcp: Handle MPTCP TCP options Mat Martineau
2019-06-17 22:57 ` [RFC PATCH net-next 05/33] mptcp: Associate MPTCP context with TCP socket Mat Martineau
2019-06-17 22:57 ` [RFC PATCH net-next 06/33] tcp: Expose tcp struct and routine for MPTCP Mat Martineau
2019-06-17 22:57 ` [RFC PATCH net-next 07/33] mptcp: Handle MP_CAPABLE options for outgoing connections Mat Martineau
2019-06-17 22:57 ` [RFC PATCH net-next 08/33] mptcp: add mptcp_poll Mat Martineau
2019-06-17 22:57 ` [RFC PATCH net-next 09/33] tcp, ulp: Add clone operation to tcp_ulp_ops Mat Martineau
2019-06-17 22:57 ` [RFC PATCH net-next 10/33] mptcp: Create SUBFLOW socket for incoming connections Mat Martineau
2019-06-17 22:57 ` [RFC PATCH net-next 11/33] mptcp: Add key generation and token tree Mat Martineau
2019-06-17 22:57 ` [RFC PATCH net-next 12/33] mptcp: Add shutdown() socket operation Mat Martineau
2019-06-17 22:57 ` [RFC PATCH net-next 13/33] mptcp: Add setsockopt()/getsockopt() socket operations Mat Martineau
2019-06-17 22:57 ` [RFC PATCH net-next 14/33] tcp: clean ext on tx recycle Mat Martineau
2019-06-17 22:57 ` [RFC PATCH net-next 15/33] mptcp: Add MPTCP to skb extensions Mat Martineau
2019-06-17 22:57 ` [RFC PATCH net-next 16/33] tcp: Prevent coalesce/collapse when skb has MPTCP extensions Mat Martineau
2019-06-17 22:57 ` [RFC PATCH net-next 17/33] tcp: Export low-level TCP functions Mat Martineau
2019-06-17 22:57 ` [RFC PATCH net-next 18/33] mptcp: Write MPTCP DSS headers to outgoing data packets Mat Martineau
2019-06-17 22:57 ` [RFC PATCH net-next 19/33] mptcp: Implement MPTCP receive path Mat Martineau
2019-06-17 22:57 ` [RFC PATCH net-next 20/33] mptcp: Make connection_list a real list of subflows Mat Martineau
2019-06-17 22:57 ` [RFC PATCH net-next 21/33] mptcp: add and use mptcp_subflow_hold Mat Martineau
2019-06-17 22:57 ` [RFC PATCH net-next 22/33] mptcp: add basic kselftest program Mat Martineau
2019-06-17 22:57 ` [RFC PATCH net-next 23/33] mptcp: selftests: switch to netns+veth based tests Mat Martineau
2019-06-17 22:57 ` [RFC PATCH net-next 24/33] mptcp: selftests: Add capture option Mat Martineau
2019-06-17 22:58 ` [RFC PATCH net-next 25/33] mptcp: use sk_page_frag() in sendmsg Mat Martineau
2019-06-17 22:58 ` [RFC PATCH net-next 26/33] mptcp: sendmsg() do spool all the provided data Mat Martineau
2019-06-17 22:58 ` [RFC PATCH net-next 27/33] mptcp: allow collapsing consecutive sendpages on the same substream Mat Martineau
2019-06-17 22:58 ` [RFC PATCH net-next 28/33] tcp: Check for filled TCP option space before SACK Mat Martineau
2019-06-17 22:58 ` [RFC PATCH net-next 29/33] mptcp: accept: don't leak mptcp socket structure Mat Martineau
2019-06-17 22:58 ` [RFC PATCH net-next 30/33] mptcp: switch sublist to mptcp socket lock protection Mat Martineau
2019-06-17 22:58 ` [RFC PATCH net-next 31/33] mptcp: Add path manager interface Mat Martineau
2019-06-17 22:58 ` [RFC PATCH net-next 32/33] mptcp: Add ADD_ADDR handling Mat Martineau
2019-06-17 22:58 ` [RFC PATCH net-next 33/33] mptcp: Add handling of incoming MP_JOIN requests Mat Martineau

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).