netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH v2 00/45] Multipath TCP
@ 2019-10-02 23:36 Mat Martineau
  2019-10-02 23:36 ` [RFC PATCH v2 01/45] tcp: Add MPTCP option number Mat Martineau
                   ` (46 more replies)
  0 siblings, 47 replies; 49+ messages in thread
From: Mat Martineau @ 2019-10-02 23:36 UTC (permalink / raw)
  To: netdev, edumazet
  Cc: Mat Martineau, cpaasch, fw, pabeni, peter.krystad, dcaratti,
	matthieu.baerts

The MPTCP upstreaming community has prepared a net-next RFCv2 patch set
for review.

Clone/fetch:
https://github.com/multipath-tcp/mptcp_net-next.git (tag: netdev-rfcv2)

Browse:
https://github.com/multipath-tcp/mptcp_net-next/tree/netdev-rfcv2


With CONFIG_MPTCP=y, a socket created with IPPROTO_MPTCP will attempt to
create an MPTCP connection but remains compatible with regular
TCP. IPPROTO_TCP socket behavior is unchanged.

This implementation makes use of ULP between the userspace-facing MPTCP
socket and the set of in-kernel TCP sockets it controls. ULP has been
extended for use with listening sockets. skb_ext is used to carry MPTCP
metadata.

The patch set includes self-tests to exercise MPTCP in various
connection and routing scenarios.


We have more work planned to reach the initial feature set for merging,
notably:

* IPv6

* Comply with MPTCPv1 (RFC6824bis). This patch set supports only
  MPTCPv0 (RFC 6824 / experimental)

* Proper MPTCP-level connection closing with DATA_FIN

* Couple receive windows across sibling subflow TCP sockets as required
  by RFC 6824

* Limit subflow ULP visibility to kernel space

* Simple transmit scheduler that respects subflow 'backup' flags


In order to simplify both code review and the development process, we
propose splitting the patch set in to smaller chunks. The first patch
set for merging would include patches 1 to 31 and basic IPv6 support.

Thank you for your review. You can find us at mptcp@lists.01.org and
https://is.gd/mptcp_upstream


Florian Westphal (7):
  mptcp: add mptcp_poll
  mptcp: add basic kselftest for mptcp
  selftests: mptcp: make tc delays random
  selftests: mptcp: extend mptcp_connect tool for ipv6 family
  selftests: mptcp: add accept/getpeer checks
  selftests: mptcp: add ipv6 connectivity
  selftests: mptcp: random ethtool tweaking

Mat Martineau (13):
  tcp: Add MPTCP option number
  net: Make sock protocol value checks more specific
  sock: Make sk_protocol a 16-bit value
  tcp: Define IPPROTO_MPTCP
  mptcp: Add MPTCP socket stubs
  tcp, ulp: Add clone operation to tcp_ulp_ops
  mptcp: Add MPTCP to skb extensions
  tcp: Prevent coalesce/collapse when skb has MPTCP extensions
  tcp: Export low-level TCP functions
  mptcp: Write MPTCP DSS headers to outgoing data packets
  mptcp: Implement MPTCP receive path
  tcp: Check for filled TCP option space before SACK
  mptcp: Make MPTCP socket block/wakeup ignore sk_receive_queue

Matthieu Baerts (1):
  mptcp: new sysctl to control the activation per NS

Paolo Abeni (11):
  tcp: clean ext on tx recycle
  mptcp: use sk_page_frag() in sendmsg
  mptcp: sendmsg() do spool all the provided data
  mptcp: allow collapsing consecutive sendpages on the same substream
  mptcp: harmonize locking on all socket operations.
  mptcp: update per unacked sequence on pkt reception
  mptcp: queue data for mptcp level retransmission
  mptcp: introduce MPTCP retransmission timer
  mptcp: implement memory accounting for mptcp rtx queue
  mptcp: rework mptcp_sendmsg_frag to accept optional dfrag
  mptcp: implement and use MPTCP-level retransmission

Peter Krystad (13):
  mptcp: Handle MPTCP TCP options
  mptcp: Associate MPTCP context with TCP socket
  tcp: Expose tcp struct and routine for MPTCP
  mptcp: Handle MP_CAPABLE options for outgoing connections
  mptcp: Create SUBFLOW socket for incoming connections
  mptcp: Add key generation and token tree
  mptcp: Add shutdown() socket operation
  mptcp: Add setsockopt()/getsockopt() socket operations
  mptcp: Add path manager interface
  mptcp: Add ADD_ADDR handling
  mptcp: Add handling of incoming MP_JOIN requests
  mptcp: Add handling of outgoing MP_JOIN requests
  mptcp: Implement path manager interface commands

 include/linux/skbuff.h                        |   11 +
 include/linux/tcp.h                           |   51 +
 include/net/mptcp.h                           |  149 ++
 include/net/sock.h                            |    6 +-
 include/net/tcp.h                             |   20 +
 include/trace/events/sock.h                   |    5 +-
 include/uapi/linux/in.h                       |    2 +
 net/Kconfig                                   |    1 +
 net/Makefile                                  |    1 +
 net/ax25/af_ax25.c                            |    2 +-
 net/core/skbuff.c                             |    7 +
 net/decnet/af_decnet.c                        |    2 +-
 net/ipv4/inet_connection_sock.c               |    2 +
 net/ipv4/tcp.c                                |    8 +-
 net/ipv4/tcp_input.c                          |   29 +-
 net/ipv4/tcp_ipv4.c                           |    4 +-
 net/ipv4/tcp_minisocks.c                      |    6 +
 net/ipv4/tcp_output.c                         |   62 +-
 net/ipv4/tcp_ulp.c                            |   12 +
 net/mptcp/Kconfig                             |   11 +
 net/mptcp/Makefile                            |    4 +
 net/mptcp/crypto.c                            |  128 ++
 net/mptcp/ctrl.c                              |  112 ++
 net/mptcp/options.c                           |  753 +++++++++
 net/mptcp/pm.c                                |  181 ++
 net/mptcp/protocol.c                          | 1455 +++++++++++++++++
 net/mptcp/protocol.h                          |  319 ++++
 net/mptcp/subflow.c                           |  537 ++++++
 net/mptcp/token.c                             |  217 +++
 tools/include/uapi/linux/in.h                 |    2 +
 tools/testing/selftests/Makefile              |    1 +
 tools/testing/selftests/net/mptcp/.gitignore  |    2 +
 tools/testing/selftests/net/mptcp/Makefile    |   11 +
 tools/testing/selftests/net/mptcp/config      |    1 +
 .../selftests/net/mptcp/mptcp_connect.c       |  522 ++++++
 .../selftests/net/mptcp/mptcp_connect.sh      |  401 +++++
 36 files changed, 5021 insertions(+), 16 deletions(-)
 create mode 100644 include/net/mptcp.h
 create mode 100644 net/mptcp/Kconfig
 create mode 100644 net/mptcp/Makefile
 create mode 100644 net/mptcp/crypto.c
 create mode 100644 net/mptcp/ctrl.c
 create mode 100644 net/mptcp/options.c
 create mode 100644 net/mptcp/pm.c
 create mode 100644 net/mptcp/protocol.c
 create mode 100644 net/mptcp/protocol.h
 create mode 100644 net/mptcp/subflow.c
 create mode 100644 net/mptcp/token.c
 create mode 100644 tools/testing/selftests/net/mptcp/.gitignore
 create mode 100644 tools/testing/selftests/net/mptcp/Makefile
 create mode 100644 tools/testing/selftests/net/mptcp/config
 create mode 100644 tools/testing/selftests/net/mptcp/mptcp_connect.c
 create mode 100755 tools/testing/selftests/net/mptcp/mptcp_connect.sh

-- 
2.23.0


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [RFC PATCH v2 01/45] tcp: Add MPTCP option number
  2019-10-02 23:36 [RFC PATCH v2 00/45] Multipath TCP Mat Martineau
@ 2019-10-02 23:36 ` Mat Martineau
  2019-10-02 23:36 ` [RFC PATCH v2 02/45] net: Make sock protocol value checks more specific Mat Martineau
                   ` (45 subsequent siblings)
  46 siblings, 0 replies; 49+ messages in thread
From: Mat Martineau @ 2019-10-02 23:36 UTC (permalink / raw)
  To: netdev, edumazet
  Cc: Mat Martineau, cpaasch, fw, pabeni, peter.krystad, dcaratti,
	matthieu.baerts

TCP option 30 is allocated for MPTCP by the IANA.

Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
---
 include/net/tcp.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index c9a3f9688223..382e245a7909 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -182,6 +182,7 @@ void tcp_time_wait(struct sock *sk, int state, int timeo);
 #define TCPOPT_SACK             5       /* SACK Block */
 #define TCPOPT_TIMESTAMP	8	/* Better RTT estimations/PAWS */
 #define TCPOPT_MD5SIG		19	/* MD5 Signature (RFC2385) */
+#define TCPOPT_MPTCP		30	/* Multipath TCP (RFC6824) */
 #define TCPOPT_FASTOPEN		34	/* Fast open (RFC7413) */
 #define TCPOPT_EXP		254	/* Experimental */
 /* Magic number to be after the option value for sharing TCP
-- 
2.23.0


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [RFC PATCH v2 02/45] net: Make sock protocol value checks more specific
  2019-10-02 23:36 [RFC PATCH v2 00/45] Multipath TCP Mat Martineau
  2019-10-02 23:36 ` [RFC PATCH v2 01/45] tcp: Add MPTCP option number Mat Martineau
@ 2019-10-02 23:36 ` Mat Martineau
  2019-10-02 23:36 ` [RFC PATCH v2 03/45] sock: Make sk_protocol a 16-bit value Mat Martineau
                   ` (44 subsequent siblings)
  46 siblings, 0 replies; 49+ messages in thread
From: Mat Martineau @ 2019-10-02 23:36 UTC (permalink / raw)
  To: netdev, edumazet
  Cc: Mat Martineau, cpaasch, fw, pabeni, peter.krystad, dcaratti,
	matthieu.baerts

SK_PROTOCOL_MAX is only used in two places, for DECNet and AX.25. The
limits have more to do with the those protocol definitions than they do
with the data type of sk_protocol, so remove SK_PROTOCOL_MAX and use
U8_MAX directly.

Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
---
 include/net/sock.h     | 1 -
 net/ax25/af_ax25.c     | 2 +-
 net/decnet/af_decnet.c | 2 +-
 3 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 2c53f1a1d905..c3fc62bd3e2f 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -459,7 +459,6 @@ struct sock {
 				sk_userlocks : 4,
 				sk_protocol  : 8,
 				sk_type      : 16;
-#define SK_PROTOCOL_MAX U8_MAX
 	u16			sk_gso_max_segs;
 	u8			sk_pacing_shift;
 	unsigned long	        sk_lingertime;
diff --git a/net/ax25/af_ax25.c b/net/ax25/af_ax25.c
index bb222b882b67..be5772f43816 100644
--- a/net/ax25/af_ax25.c
+++ b/net/ax25/af_ax25.c
@@ -808,7 +808,7 @@ static int ax25_create(struct net *net, struct socket *sock, int protocol,
 	struct sock *sk;
 	ax25_cb *ax25;
 
-	if (protocol < 0 || protocol > SK_PROTOCOL_MAX)
+	if (protocol < 0 || protocol > U8_MAX)
 		return -EINVAL;
 
 	if (!net_eq(net, &init_net))
diff --git a/net/decnet/af_decnet.c b/net/decnet/af_decnet.c
index 0ea75286abf4..7c0c769a8b26 100644
--- a/net/decnet/af_decnet.c
+++ b/net/decnet/af_decnet.c
@@ -670,7 +670,7 @@ static int dn_create(struct net *net, struct socket *sock, int protocol,
 {
 	struct sock *sk;
 
-	if (protocol < 0 || protocol > SK_PROTOCOL_MAX)
+	if (protocol < 0 || protocol > U8_MAX)
 		return -EINVAL;
 
 	if (!net_eq(net, &init_net))
-- 
2.23.0


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [RFC PATCH v2 03/45] sock: Make sk_protocol a 16-bit value
  2019-10-02 23:36 [RFC PATCH v2 00/45] Multipath TCP Mat Martineau
  2019-10-02 23:36 ` [RFC PATCH v2 01/45] tcp: Add MPTCP option number Mat Martineau
  2019-10-02 23:36 ` [RFC PATCH v2 02/45] net: Make sock protocol value checks more specific Mat Martineau
@ 2019-10-02 23:36 ` Mat Martineau
  2019-10-02 23:36 ` [RFC PATCH v2 04/45] tcp: Define IPPROTO_MPTCP Mat Martineau
                   ` (43 subsequent siblings)
  46 siblings, 0 replies; 49+ messages in thread
From: Mat Martineau @ 2019-10-02 23:36 UTC (permalink / raw)
  To: netdev, edumazet
  Cc: Mat Martineau, cpaasch, fw, pabeni, peter.krystad, dcaratti,
	matthieu.baerts

Match the 16-bit width of skbuff->protocol. Fills an 8-bit hole so
sizeof(struct sock) does not change.

Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
---
 include/net/sock.h          | 4 ++--
 include/trace/events/sock.h | 2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index c3fc62bd3e2f..ca2071555dde 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -457,10 +457,10 @@ struct sock {
 				sk_no_check_tx : 1,
 				sk_no_check_rx : 1,
 				sk_userlocks : 4,
-				sk_protocol  : 8,
+				sk_pacing_shift : 8,
 				sk_type      : 16;
+	u16			sk_protocol;
 	u16			sk_gso_max_segs;
-	u8			sk_pacing_shift;
 	unsigned long	        sk_lingertime;
 	struct proto		*sk_prot_creator;
 	rwlock_t		sk_callback_lock;
diff --git a/include/trace/events/sock.h b/include/trace/events/sock.h
index a0c4b8a30966..dc749c651318 100644
--- a/include/trace/events/sock.h
+++ b/include/trace/events/sock.h
@@ -147,7 +147,7 @@ TRACE_EVENT(inet_sock_set_state,
 		__field(__u16, sport)
 		__field(__u16, dport)
 		__field(__u16, family)
-		__field(__u8, protocol)
+		__field(__u16, protocol)
 		__array(__u8, saddr, 4)
 		__array(__u8, daddr, 4)
 		__array(__u8, saddr_v6, 16)
-- 
2.23.0


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [RFC PATCH v2 04/45] tcp: Define IPPROTO_MPTCP
  2019-10-02 23:36 [RFC PATCH v2 00/45] Multipath TCP Mat Martineau
                   ` (2 preceding siblings ...)
  2019-10-02 23:36 ` [RFC PATCH v2 03/45] sock: Make sk_protocol a 16-bit value Mat Martineau
@ 2019-10-02 23:36 ` Mat Martineau
  2019-10-02 23:36 ` [RFC PATCH v2 05/45] mptcp: Add MPTCP socket stubs Mat Martineau
                   ` (42 subsequent siblings)
  46 siblings, 0 replies; 49+ messages in thread
From: Mat Martineau @ 2019-10-02 23:36 UTC (permalink / raw)
  To: netdev, edumazet
  Cc: Mat Martineau, cpaasch, fw, pabeni, peter.krystad, dcaratti,
	matthieu.baerts

To open a MPTCP socket with socket(AF_INET, SOCK_STREAM, IPPROTO_MPTCP),
IPPROTO_MPTCP needs a value that differs from IPPROTO_TCP. The existing
IPPROTO numbers mostly map directly to IANA-specified protocol numbers.
MPTCP does not have a protocol number allocated because MPTCP packets
use the TCP protocol number. Use private number not used OTA.

Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
---
 include/trace/events/sock.h   | 3 ++-
 include/uapi/linux/in.h       | 2 ++
 tools/include/uapi/linux/in.h | 2 ++
 3 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/include/trace/events/sock.h b/include/trace/events/sock.h
index dc749c651318..bfbcba315a12 100644
--- a/include/trace/events/sock.h
+++ b/include/trace/events/sock.h
@@ -19,7 +19,8 @@
 #define inet_protocol_names		\
 		EM(IPPROTO_TCP)			\
 		EM(IPPROTO_DCCP)		\
-		EMe(IPPROTO_SCTP)
+		EM(IPPROTO_SCTP)		\
+		EMe(IPPROTO_MPTCP)
 
 #define tcp_state_names			\
 		EM(TCP_ESTABLISHED)		\
diff --git a/include/uapi/linux/in.h b/include/uapi/linux/in.h
index e7ad9d350a28..44df6dc1ff1d 100644
--- a/include/uapi/linux/in.h
+++ b/include/uapi/linux/in.h
@@ -76,6 +76,8 @@ enum {
 #define IPPROTO_MPLS		IPPROTO_MPLS
   IPPROTO_RAW = 255,		/* Raw IP packets			*/
 #define IPPROTO_RAW		IPPROTO_RAW
+  IPPROTO_MPTCP = 262,		/* Multipath TCP connection 		*/
+#define IPPROTO_MPTCP		IPPROTO_MPTCP
   IPPROTO_MAX
 };
 #endif
diff --git a/tools/include/uapi/linux/in.h b/tools/include/uapi/linux/in.h
index e7ad9d350a28..44df6dc1ff1d 100644
--- a/tools/include/uapi/linux/in.h
+++ b/tools/include/uapi/linux/in.h
@@ -76,6 +76,8 @@ enum {
 #define IPPROTO_MPLS		IPPROTO_MPLS
   IPPROTO_RAW = 255,		/* Raw IP packets			*/
 #define IPPROTO_RAW		IPPROTO_RAW
+  IPPROTO_MPTCP = 262,		/* Multipath TCP connection 		*/
+#define IPPROTO_MPTCP		IPPROTO_MPTCP
   IPPROTO_MAX
 };
 #endif
-- 
2.23.0


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [RFC PATCH v2 05/45] mptcp: Add MPTCP socket stubs
  2019-10-02 23:36 [RFC PATCH v2 00/45] Multipath TCP Mat Martineau
                   ` (3 preceding siblings ...)
  2019-10-02 23:36 ` [RFC PATCH v2 04/45] tcp: Define IPPROTO_MPTCP Mat Martineau
@ 2019-10-02 23:36 ` Mat Martineau
  2019-10-02 23:36 ` [RFC PATCH v2 06/45] mptcp: Handle MPTCP TCP options Mat Martineau
                   ` (41 subsequent siblings)
  46 siblings, 0 replies; 49+ messages in thread
From: Mat Martineau @ 2019-10-02 23:36 UTC (permalink / raw)
  To: netdev, edumazet
  Cc: Mat Martineau, cpaasch, fw, pabeni, peter.krystad, dcaratti,
	matthieu.baerts

Implements the infrastructure for MPTCP sockets.

MPTCP sockets open one in-kernel TCP socket per subflow. These subflow
sockets are only managed by the MPTCP socket that owns them and are not
visible from userspace. This commit allows a userspace program to open
an MPTCP socket with:

  sock = socket(AF_INET, SOCK_STREAM, IPPROTO_MPTCP);

The resulting socket is simply a wrapper around a single regular TCP
socket, without any of the MPTCP protocol implemented over the wire.

Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Peter Krystad <peter.krystad@linux.intel.com>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
---
 include/net/mptcp.h  |  22 +++++++++
 net/Kconfig          |   1 +
 net/Makefile         |   1 +
 net/ipv4/tcp.c       |   2 +
 net/mptcp/Kconfig    |  10 ++++
 net/mptcp/Makefile   |   4 ++
 net/mptcp/protocol.c | 114 +++++++++++++++++++++++++++++++++++++++++++
 net/mptcp/protocol.h |  22 +++++++++
 8 files changed, 176 insertions(+)
 create mode 100644 include/net/mptcp.h
 create mode 100644 net/mptcp/Kconfig
 create mode 100644 net/mptcp/Makefile
 create mode 100644 net/mptcp/protocol.c
 create mode 100644 net/mptcp/protocol.h

diff --git a/include/net/mptcp.h b/include/net/mptcp.h
new file mode 100644
index 000000000000..0fe78fddc638
--- /dev/null
+++ b/include/net/mptcp.h
@@ -0,0 +1,22 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Multipath TCP
+ *
+ * Copyright (c) 2017 - 2019, Intel Corporation.
+ */
+
+#ifndef __NET_MPTCP_H
+#define __NET_MPTCP_H
+
+#ifdef CONFIG_MPTCP
+
+void mptcp_init(void);
+
+#else
+
+static inline void mptcp_init(void)
+{
+}
+
+#endif /* CONFIG_MPTCP */
+#endif /* __NET_MPTCP_H */
diff --git a/net/Kconfig b/net/Kconfig
index 3101bfcbdd7a..4f7e7f3618d0 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -91,6 +91,7 @@ if INET
 source "net/ipv4/Kconfig"
 source "net/ipv6/Kconfig"
 source "net/netlabel/Kconfig"
+source "net/mptcp/Kconfig"
 
 endif # if INET
 
diff --git a/net/Makefile b/net/Makefile
index 449fc0b221f8..306d2e8c12c0 100644
--- a/net/Makefile
+++ b/net/Makefile
@@ -87,3 +87,4 @@ endif
 obj-$(CONFIG_QRTR)		+= qrtr/
 obj-$(CONFIG_NET_NCSI)		+= ncsi/
 obj-$(CONFIG_XDP_SOCKETS)	+= xdp/
+obj-$(CONFIG_MPTCP)		+= mptcp/
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 79c325a07ba5..5893d08cb204 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -271,6 +271,7 @@
 #include <net/icmp.h>
 #include <net/inet_common.h>
 #include <net/tcp.h>
+#include <net/mptcp.h>
 #include <net/xfrm.h>
 #include <net/ip.h>
 #include <net/sock.h>
@@ -4006,4 +4007,5 @@ void __init tcp_init(void)
 	tcp_metrics_init();
 	BUG_ON(tcp_register_congestion_control(&tcp_reno) != 0);
 	tcp_tasklet_init();
+	mptcp_init();
 }
diff --git a/net/mptcp/Kconfig b/net/mptcp/Kconfig
new file mode 100644
index 000000000000..d87dfdc210cc
--- /dev/null
+++ b/net/mptcp/Kconfig
@@ -0,0 +1,10 @@
+
+config MPTCP
+	bool "Multipath TCP"
+	depends on INET
+	help
+	  Multipath TCP (MPTCP) connections send and receive data over multiple
+	  subflows in order to utilize multiple network paths. Each subflow
+	  uses the TCP protocol, and TCP options carry header information for
+	  MPTCP.
+
diff --git a/net/mptcp/Makefile b/net/mptcp/Makefile
new file mode 100644
index 000000000000..659129d1fcbf
--- /dev/null
+++ b/net/mptcp/Makefile
@@ -0,0 +1,4 @@
+# SPDX-License-Identifier: GPL-2.0
+obj-$(CONFIG_MPTCP) += mptcp.o
+
+mptcp-y := protocol.o
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
new file mode 100644
index 000000000000..e5c87cb08f97
--- /dev/null
+++ b/net/mptcp/protocol.c
@@ -0,0 +1,114 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Multipath TCP
+ *
+ * Copyright (c) 2017 - 2019, Intel Corporation.
+ */
+
+#define pr_fmt(fmt) "MPTCP: " fmt
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/netdevice.h>
+#include <net/sock.h>
+#include <net/inet_common.h>
+#include <net/inet_hashtables.h>
+#include <net/protocol.h>
+#include <net/tcp.h>
+#include <net/mptcp.h>
+#include "protocol.h"
+
+static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
+{
+	struct mptcp_sock *msk = mptcp_sk(sk);
+	struct socket *subflow = msk->subflow;
+
+	return sock_sendmsg(subflow, msg);
+}
+
+static int mptcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
+			 int nonblock, int flags, int *addr_len)
+{
+	struct mptcp_sock *msk = mptcp_sk(sk);
+	struct socket *subflow = msk->subflow;
+
+	return sock_recvmsg(subflow, msg, flags);
+}
+
+static int mptcp_init_sock(struct sock *sk)
+{
+	struct mptcp_sock *msk = mptcp_sk(sk);
+	struct net *net = sock_net(sk);
+	struct socket *sf;
+	int err;
+
+	err = sock_create_kern(net, PF_INET, SOCK_STREAM, IPPROTO_TCP, &sf);
+	if (!err) {
+		pr_debug("subflow=%p", sf->sk);
+		msk->subflow = sf;
+	}
+
+	return err;
+}
+
+static void mptcp_close(struct sock *sk, long timeout)
+{
+	struct mptcp_sock *msk = mptcp_sk(sk);
+
+	inet_sk_state_store(sk, TCP_CLOSE);
+
+	if (msk->subflow) {
+		pr_debug("subflow=%p", msk->subflow->sk);
+		sock_release(msk->subflow);
+	}
+
+	sock_orphan(sk);
+	sock_put(sk);
+}
+
+static int mptcp_connect(struct sock *sk, struct sockaddr *saddr, int len)
+{
+	struct mptcp_sock *msk = mptcp_sk(sk);
+	int err;
+
+	saddr->sa_family = AF_INET;
+
+	pr_debug("msk=%p, subflow=%p", msk, msk->subflow->sk);
+
+	err = kernel_connect(msk->subflow, saddr, len, 0);
+
+	sk->sk_state = TCP_ESTABLISHED;
+
+	return err;
+}
+
+static struct proto mptcp_prot = {
+	.name		= "MPTCP",
+	.owner		= THIS_MODULE,
+	.init		= mptcp_init_sock,
+	.close		= mptcp_close,
+	.accept		= inet_csk_accept,
+	.connect	= mptcp_connect,
+	.shutdown	= tcp_shutdown,
+	.sendmsg	= mptcp_sendmsg,
+	.recvmsg	= mptcp_recvmsg,
+	.hash		= inet_hash,
+	.unhash		= inet_unhash,
+	.get_port	= inet_csk_get_port,
+	.obj_size	= sizeof(struct mptcp_sock),
+	.no_autobind	= 1,
+};
+
+static struct inet_protosw mptcp_protosw = {
+	.type		= SOCK_STREAM,
+	.protocol	= IPPROTO_MPTCP,
+	.prot		= &mptcp_prot,
+	.ops		= &inet_stream_ops,
+};
+
+void __init mptcp_init(void)
+{
+	if (proto_register(&mptcp_prot, 1) != 0)
+		panic("Failed to register MPTCP proto.\n");
+
+	inet_register_protosw(&mptcp_protosw);
+}
diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
new file mode 100644
index 000000000000..ee04a01bffd3
--- /dev/null
+++ b/net/mptcp/protocol.h
@@ -0,0 +1,22 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Multipath TCP
+ *
+ * Copyright (c) 2017 - 2019, Intel Corporation.
+ */
+
+#ifndef __MPTCP_PROTOCOL_H
+#define __MPTCP_PROTOCOL_H
+
+/* MPTCP connection sock */
+struct mptcp_sock {
+	/* inet_connection_sock must be the first member */
+	struct inet_connection_sock sk;
+	struct socket	*subflow; /* outgoing connect/listener/!mp_capable */
+};
+
+static inline struct mptcp_sock *mptcp_sk(const struct sock *sk)
+{
+	return (struct mptcp_sock *)sk;
+}
+
+#endif /* __MPTCP_PROTOCOL_H */
-- 
2.23.0


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [RFC PATCH v2 06/45] mptcp: Handle MPTCP TCP options
  2019-10-02 23:36 [RFC PATCH v2 00/45] Multipath TCP Mat Martineau
                   ` (4 preceding siblings ...)
  2019-10-02 23:36 ` [RFC PATCH v2 05/45] mptcp: Add MPTCP socket stubs Mat Martineau
@ 2019-10-02 23:36 ` Mat Martineau
  2019-10-02 23:36 ` [RFC PATCH v2 07/45] mptcp: Associate MPTCP context with TCP socket Mat Martineau
                   ` (40 subsequent siblings)
  46 siblings, 0 replies; 49+ messages in thread
From: Mat Martineau @ 2019-10-02 23:36 UTC (permalink / raw)
  To: netdev, edumazet
  Cc: Peter Krystad, cpaasch, fw, pabeni, dcaratti, matthieu.baerts

From: Peter Krystad <peter.krystad@linux.intel.com>

Currently only MPTCP v0 is supported so ignore v1 MP_CAPABLE option.

Signed-off-by: Peter Krystad <peter.krystad@linux.intel.com>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Florian Westphal <fw@strlen.de>
---
 include/linux/tcp.h   |  15 ++++
 include/net/mptcp.h   |  17 +++++
 net/ipv4/tcp_input.c  |   5 ++
 net/ipv4/tcp_output.c |  13 ++++
 net/mptcp/Makefile    |   2 +-
 net/mptcp/options.c   | 159 ++++++++++++++++++++++++++++++++++++++++++
 net/mptcp/protocol.h  |  27 +++++++
 7 files changed, 237 insertions(+), 1 deletion(-)
 create mode 100644 net/mptcp/options.c

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 99617e528ea2..18594f40b310 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -95,6 +95,17 @@ struct tcp_options_received {
 	u8	num_sacks;	/* Number of SACK blocks		*/
 	u16	user_mss;	/* mss requested by user in ioctl	*/
 	u16	mss_clamp;	/* Maximal mss, negotiated at connection setup */
+#if IS_ENABLED(CONFIG_MPTCP)
+	struct mptcp_options_received {
+		u8      mp_capable : 1,
+			mp_join : 1,
+			dss : 1,
+			version : 4;
+		u8      flags;
+		u64     sndr_key;
+		u64     rcvr_key;
+	} mptcp;
+#endif
 };
 
 static inline void tcp_clear_options(struct tcp_options_received *rx_opt)
@@ -104,6 +115,10 @@ static inline void tcp_clear_options(struct tcp_options_received *rx_opt)
 #if IS_ENABLED(CONFIG_SMC)
 	rx_opt->smc_ok = 0;
 #endif
+#if IS_ENABLED(CONFIG_MPTCP)
+	rx_opt->mptcp.mp_capable = rx_opt->mptcp.mp_join = 0;
+	rx_opt->mptcp.dss = 0;
+#endif
 }
 
 /* This is the max number of SACKS that we'll generate and process. It's safe
diff --git a/include/net/mptcp.h b/include/net/mptcp.h
index 0fe78fddc638..fc3f9286c667 100644
--- a/include/net/mptcp.h
+++ b/include/net/mptcp.h
@@ -8,15 +8,32 @@
 #ifndef __NET_MPTCP_H
 #define __NET_MPTCP_H
 
+struct mptcp_out_options {
+#if IS_ENABLED(CONFIG_MPTCP)
+	u16 suboptions;
+	u64 sndr_key;
+	u64 rcvr_key;
+#endif
+};
+
 #ifdef CONFIG_MPTCP
 
 void mptcp_init(void);
 
+void mptcp_parse_option(const unsigned char *ptr, int opsize,
+			struct tcp_options_received *opt_rx);
+void mptcp_write_options(__be32 *ptr, struct mptcp_out_options *opts);
+
 #else
 
 static inline void mptcp_init(void)
 {
 }
 
+static inline void mptcp_parse_option(const unsigned char *ptr, int opsize,
+				      struct tcp_options_received *opt_rx)
+{
+}
+
 #endif /* CONFIG_MPTCP */
 #endif /* __NET_MPTCP_H */
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 3578357abe30..8e04adff1912 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -79,6 +79,7 @@
 #include <trace/events/tcp.h>
 #include <linux/jump_label_ratelimit.h>
 #include <net/busy_poll.h>
+#include <net/mptcp.h>
 
 int sysctl_tcp_max_orphans __read_mostly = NR_FILE;
 
@@ -3918,6 +3919,10 @@ void tcp_parse_options(const struct net *net,
 				 */
 				break;
 #endif
+			case TCPOPT_MPTCP:
+				mptcp_parse_option(ptr, opsize, opt_rx);
+				break;
+
 			case TCPOPT_FASTOPEN:
 				tcp_parse_fastopen_option(
 					opsize - TCPOLEN_FASTOPEN_BASE,
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index fec6d67bfd14..531929c68822 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -38,6 +38,7 @@
 #define pr_fmt(fmt) "TCP: " fmt
 
 #include <net/tcp.h>
+#include <net/mptcp.h>
 
 #include <linux/compiler.h>
 #include <linux/gfp.h>
@@ -411,6 +412,7 @@ static inline bool tcp_urg_mode(const struct tcp_sock *tp)
 #define OPTION_WSCALE		(1 << 3)
 #define OPTION_FAST_OPEN_COOKIE	(1 << 8)
 #define OPTION_SMC		(1 << 9)
+#define OPTION_MPTCP		(1 << 10)
 
 static void smc_options_write(__be32 *ptr, u16 *options)
 {
@@ -436,8 +438,17 @@ struct tcp_out_options {
 	__u8 *hash_location;	/* temporary pointer, overloaded */
 	__u32 tsval, tsecr;	/* need to include OPTION_TS */
 	struct tcp_fastopen_cookie *fastopen_cookie;	/* Fast open cookie */
+	struct mptcp_out_options mptcp;
 };
 
+static void mptcp_options_write(__be32 *ptr, struct tcp_out_options *opts)
+{
+#if IS_ENABLED(CONFIG_MPTCP)
+	if (unlikely(OPTION_MPTCP & opts->options))
+		mptcp_write_options(ptr, &opts->mptcp);
+#endif
+}
+
 /* Write previously computed TCP options to the packet.
  *
  * Beware: Something in the Internet is very sensitive to the ordering of
@@ -546,6 +557,8 @@ static void tcp_options_write(__be32 *ptr, struct tcp_sock *tp,
 	}
 
 	smc_options_write(ptr, &options);
+
+	mptcp_options_write(ptr, opts);
 }
 
 static void smc_set_option(const struct tcp_sock *tp,
diff --git a/net/mptcp/Makefile b/net/mptcp/Makefile
index 659129d1fcbf..27a846263f08 100644
--- a/net/mptcp/Makefile
+++ b/net/mptcp/Makefile
@@ -1,4 +1,4 @@
 # SPDX-License-Identifier: GPL-2.0
 obj-$(CONFIG_MPTCP) += mptcp.o
 
-mptcp-y := protocol.o
+mptcp-y := protocol.o options.o
diff --git a/net/mptcp/options.c b/net/mptcp/options.c
new file mode 100644
index 000000000000..cee4280647fe
--- /dev/null
+++ b/net/mptcp/options.c
@@ -0,0 +1,159 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Multipath TCP
+ *
+ * Copyright (c) 2017 - 2019, Intel Corporation.
+ */
+
+#include <linux/kernel.h>
+#include <net/tcp.h>
+#include <net/mptcp.h>
+#include "protocol.h"
+
+void mptcp_parse_option(const unsigned char *ptr, int opsize,
+			struct tcp_options_received *opt_rx)
+{
+	struct mptcp_options_received *mp_opt = &opt_rx->mptcp;
+	u8 subtype = *ptr >> 4;
+
+	switch (subtype) {
+	/* MPTCPOPT_MP_CAPABLE
+	 * 0: 4MSB=subtype, 4LSB=version
+	 * 1: Handshake flags
+	 * 2-9: Sender key
+	 * 10-17: Receiver key (optional)
+	 */
+	case MPTCPOPT_MP_CAPABLE:
+		if (opsize != TCPOLEN_MPTCP_MPC_SYN &&
+		    opsize != TCPOLEN_MPTCP_MPC_SYNACK)
+			break;
+
+		mp_opt->version = *ptr++ & MPTCP_VERSION_MASK;
+		if (mp_opt->version != 0)
+			break;
+
+		mp_opt->flags = *ptr++;
+		if (!((mp_opt->flags & MPTCP_CAP_FLAG_MASK) == MPTCP_CAP_HMAC_SHA1) ||
+		    (mp_opt->flags & MPTCP_CAP_EXTENSIBILITY))
+			break;
+
+		/* RFC 6824, Section 3.1:
+		 * "For the Checksum Required bit (labeled "A"), if either
+		 * host requires the use of checksums, checksums MUST be used.
+		 * In other words, the only way for checksums not to be used
+		 * is if both hosts in their SYNs set A=0."
+		 *
+		 * Section 3.3.0:
+		 * "If a checksum is not present when its use has been
+		 * negotiated, the receiver MUST close the subflow with a RST as it is
+		 * considered broken."
+		 *
+		 * We don't implement DSS checksum - fall back to TCP.
+		 */
+		if (mp_opt->flags & MPTCP_CAP_CHECKSUM_REQD)
+			break;
+
+		mp_opt->mp_capable = 1;
+		mp_opt->sndr_key = get_unaligned_be64(ptr);
+		ptr += 8;
+
+		if (opsize == TCPOLEN_MPTCP_MPC_SYNACK) {
+			mp_opt->rcvr_key = get_unaligned_be64(ptr);
+			ptr += 8;
+			pr_debug("MP_CAPABLE flags=%x, sndr=%llu, rcvr=%llu",
+				 mp_opt->flags, mp_opt->sndr_key,
+				 mp_opt->rcvr_key);
+		} else {
+			pr_debug("MP_CAPABLE flags=%x, sndr=%llu",
+				 mp_opt->flags, mp_opt->sndr_key);
+		}
+		break;
+
+	/* MPTCPOPT_MP_JOIN
+	 * Initial SYN
+	 * 0: 4MSB=subtype, 000, 1LSB=Backup
+	 * 1: Address ID
+	 * 2-5: Receiver token
+	 * 6-9: Sender random number
+	 * SYN/ACK response
+	 * 0: 4MSB=subtype, 000, 1LSB=Backup
+	 * 1: Address ID
+	 * 2-9: Sender truncated HMAC
+	 * 10-13: Sender random number
+	 * Third ACK
+	 * 0: 4MSB=subtype, 0000
+	 * 1: 0 (Reserved)
+	 * 2-21: Sender HMAC
+	 */
+
+	/* MPTCPOPT_DSS
+	 * 0: 4MSB=subtype, 0000
+	 * 1: 3MSB=0, F=Data FIN, m=DSN length, M=has DSN/SSN/DLL/checksum,
+	 *    a=DACK length, A=has DACK
+	 * 0, 4, or 8 bytes of DACK (depending on A/a)
+	 * 0, 4, or 8 bytes of DSN (depending on M/m)
+	 * 0 or 4 bytes of SSN (depending on M)
+	 * 0 or 2 bytes of DLL (depending on M)
+	 * 0 or 2 bytes of checksum (depending on M)
+	 */
+	case MPTCPOPT_DSS:
+		pr_debug("DSS");
+		mp_opt->dss = 1;
+		break;
+
+	/* MPTCPOPT_ADD_ADDR
+	 * 0: 4MSB=subtype, 4LSB=IP version (4 or 6)
+	 * 1: Address ID
+	 * 4 or 16 bytes of address (depending on ip version)
+	 * 0 or 2 bytes of port (depending on length)
+	 */
+
+	/* MPTCPOPT_RM_ADDR
+	 * 0: 4MSB=subtype, 0000
+	 * 1: Address ID
+	 * Additional bytes: More address IDs (depending on length)
+	 */
+
+	/* MPTCPOPT_MP_PRIO
+	 * 0: 4MSB=subtype, 000, 1LSB=Backup
+	 * 1: Address ID (optional, current addr implied if not present)
+	 */
+
+	/* MPTCPOPT_MP_FAIL
+	 * 0: 4MSB=subtype, 0000
+	 * 1: 0 (Reserved)
+	 * 2-9: DSN
+	 */
+
+	/* MPTCPOPT_MP_FASTCLOSE
+	 * 0: 4MSB=subtype, 0000
+	 * 1: 0 (Reserved)
+	 * 2-9: Receiver key
+	 */
+	default:
+		break;
+	}
+}
+
+void mptcp_write_options(__be32 *ptr, struct mptcp_out_options *opts)
+{
+	if ((OPTION_MPTCP_MPC_SYN |
+	     OPTION_MPTCP_MPC_ACK) & opts->suboptions) {
+		u8 len;
+
+		if (OPTION_MPTCP_MPC_SYN & opts->suboptions)
+			len = TCPOLEN_MPTCP_MPC_SYN;
+		else
+			len = TCPOLEN_MPTCP_MPC_ACK;
+
+		*ptr++ = htonl((TCPOPT_MPTCP << 24) | (len << 16) |
+			       (MPTCPOPT_MP_CAPABLE << 12) |
+			       ((MPTCP_VERSION_MASK & 0) << 8) |
+			       MPTCP_CAP_HMAC_SHA1);
+		put_unaligned_be64(opts->sndr_key, ptr);
+		ptr += 2;
+		if (OPTION_MPTCP_MPC_ACK & opts->suboptions) {
+			put_unaligned_be64(opts->rcvr_key, ptr);
+			ptr += 2;
+		}
+	}
+}
diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
index ee04a01bffd3..26c48003f689 100644
--- a/net/mptcp/protocol.h
+++ b/net/mptcp/protocol.h
@@ -7,6 +7,33 @@
 #ifndef __MPTCP_PROTOCOL_H
 #define __MPTCP_PROTOCOL_H
 
+/* MPTCP option bits */
+#define OPTION_MPTCP_MPC_SYN	BIT(0)
+#define OPTION_MPTCP_MPC_SYNACK	BIT(1)
+#define OPTION_MPTCP_MPC_ACK	BIT(2)
+
+/* MPTCP option subtypes */
+#define MPTCPOPT_MP_CAPABLE	0
+#define MPTCPOPT_MP_JOIN	1
+#define MPTCPOPT_DSS		2
+#define MPTCPOPT_ADD_ADDR	3
+#define MPTCPOPT_RM_ADDR	4
+#define MPTCPOPT_MP_PRIO	5
+#define MPTCPOPT_MP_FAIL	6
+#define MPTCPOPT_MP_FASTCLOSE	7
+
+/* MPTCP suboption lengths */
+#define TCPOLEN_MPTCP_MPC_SYN		12
+#define TCPOLEN_MPTCP_MPC_SYNACK	20
+#define TCPOLEN_MPTCP_MPC_ACK		20
+
+/* MPTCP MP_CAPABLE flags */
+#define MPTCP_VERSION_MASK	(0x0F)
+#define MPTCP_CAP_CHECKSUM_REQD	BIT(7)
+#define MPTCP_CAP_EXTENSIBILITY	BIT(6)
+#define MPTCP_CAP_HMAC_SHA1	BIT(0)
+#define MPTCP_CAP_FLAG_MASK	(0x3F)
+
 /* MPTCP connection sock */
 struct mptcp_sock {
 	/* inet_connection_sock must be the first member */
-- 
2.23.0


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [RFC PATCH v2 07/45] mptcp: Associate MPTCP context with TCP socket
  2019-10-02 23:36 [RFC PATCH v2 00/45] Multipath TCP Mat Martineau
                   ` (5 preceding siblings ...)
  2019-10-02 23:36 ` [RFC PATCH v2 06/45] mptcp: Handle MPTCP TCP options Mat Martineau
@ 2019-10-02 23:36 ` Mat Martineau
  2019-10-02 23:36 ` [RFC PATCH v2 08/45] tcp: Expose tcp struct and routine for MPTCP Mat Martineau
                   ` (39 subsequent siblings)
  46 siblings, 0 replies; 49+ messages in thread
From: Mat Martineau @ 2019-10-02 23:36 UTC (permalink / raw)
  To: netdev, edumazet
  Cc: Peter Krystad, cpaasch, fw, pabeni, dcaratti, matthieu.baerts

From: Peter Krystad <peter.krystad@linux.intel.com>

Use ULP to associate a subflow_context structure with each TCP
subflow socket.

Signed-off-by: Peter Krystad <peter.krystad@linux.intel.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
---
 include/linux/tcp.h  |   3 ++
 net/mptcp/Makefile   |   2 +-
 net/mptcp/protocol.c |  51 +++++++++++++++++++--
 net/mptcp/protocol.h |  26 +++++++++++
 net/mptcp/subflow.c  | 106 +++++++++++++++++++++++++++++++++++++++++++
 5 files changed, 184 insertions(+), 4 deletions(-)
 create mode 100644 net/mptcp/subflow.c

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 18594f40b310..b3311659c39a 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -394,6 +394,9 @@ struct tcp_sock {
 	u32	mtu_info; /* We received an ICMP_FRAG_NEEDED / ICMPV6_PKT_TOOBIG
 			   * while socket was owned by user.
 			   */
+#if IS_ENABLED(CONFIG_MPTCP)
+	bool	is_mptcp;
+#endif
 
 #ifdef CONFIG_TCP_MD5SIG
 /* TCP AF-Specific parts; only used by MD5 Signature support so far */
diff --git a/net/mptcp/Makefile b/net/mptcp/Makefile
index 27a846263f08..e1ee5aade8b0 100644
--- a/net/mptcp/Makefile
+++ b/net/mptcp/Makefile
@@ -1,4 +1,4 @@
 # SPDX-License-Identifier: GPL-2.0
 obj-$(CONFIG_MPTCP) += mptcp.o
 
-mptcp-y := protocol.o options.o
+mptcp-y := protocol.o subflow.o options.o
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index e5c87cb08f97..c8e12017eddb 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -57,7 +57,7 @@ static void mptcp_close(struct sock *sk, long timeout)
 	inet_sk_state_store(sk, TCP_CLOSE);
 
 	if (msk->subflow) {
-		pr_debug("subflow=%p", msk->subflow->sk);
+		pr_debug("subflow=%p", mptcp_subflow_ctx(msk->subflow->sk));
 		sock_release(msk->subflow);
 	}
 
@@ -72,7 +72,8 @@ static int mptcp_connect(struct sock *sk, struct sockaddr *saddr, int len)
 
 	saddr->sa_family = AF_INET;
 
-	pr_debug("msk=%p, subflow=%p", msk, msk->subflow->sk);
+	pr_debug("msk=%p, subflow=%p", msk,
+		 mptcp_subflow_ctx(msk->subflow->sk));
 
 	err = kernel_connect(msk->subflow, saddr, len, 0);
 
@@ -98,15 +99,59 @@ static struct proto mptcp_prot = {
 	.no_autobind	= 1,
 };
 
+static int mptcp_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len)
+{
+	struct mptcp_sock *msk = mptcp_sk(sock->sk);
+	int err = -ENOTSUPP;
+
+	if (uaddr->sa_family != AF_INET) // @@ allow only IPv4 for now
+		return err;
+
+	if (!msk->subflow) {
+		err = mptcp_subflow_create_socket(sock->sk, &msk->subflow);
+		if (err)
+			return err;
+	}
+	return inet_bind(msk->subflow, uaddr, addr_len);
+}
+
+static int mptcp_stream_connect(struct socket *sock, struct sockaddr *uaddr,
+				int addr_len, int flags)
+{
+	struct mptcp_sock *msk = mptcp_sk(sock->sk);
+	int err = -ENOTSUPP;
+
+	if (uaddr->sa_family != AF_INET) // @@ allow only IPv4 for now
+		return err;
+
+	if (!msk->subflow) {
+		err = mptcp_subflow_create_socket(sock->sk, &msk->subflow);
+		if (err)
+			return err;
+	}
+
+	return inet_stream_connect(msk->subflow, uaddr, addr_len, flags);
+}
+
+static struct proto_ops mptcp_stream_ops;
+
 static struct inet_protosw mptcp_protosw = {
 	.type		= SOCK_STREAM,
 	.protocol	= IPPROTO_MPTCP,
 	.prot		= &mptcp_prot,
-	.ops		= &inet_stream_ops,
+	.ops		= &mptcp_stream_ops,
+	.flags		= INET_PROTOSW_ICSK,
 };
 
 void __init mptcp_init(void)
 {
+	mptcp_prot.h.hashinfo = tcp_prot.h.hashinfo;
+	mptcp_stream_ops = inet_stream_ops;
+	mptcp_stream_ops.bind = mptcp_bind;
+	mptcp_stream_ops.connect = mptcp_stream_connect;
+
+	mptcp_subflow_init();
+
 	if (proto_register(&mptcp_prot, 1) != 0)
 		panic("Failed to register MPTCP proto.\n");
 
diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
index 26c48003f689..fe6b31bbad1b 100644
--- a/net/mptcp/protocol.h
+++ b/net/mptcp/protocol.h
@@ -46,4 +46,30 @@ static inline struct mptcp_sock *mptcp_sk(const struct sock *sk)
 	return (struct mptcp_sock *)sk;
 }
 
+/* MPTCP subflow context */
+struct mptcp_subflow_context {
+	u32	request_mptcp : 1,  /* send MP_CAPABLE */
+		request_cksum : 1,
+		request_version : 4;
+	struct  socket *tcp_sock;  /* underlying tcp_sock */
+	struct  sock *conn;        /* parent mptcp_sock */
+};
+
+static inline struct mptcp_subflow_context *
+mptcp_subflow_ctx(const struct sock *sk)
+{
+	struct inet_connection_sock *icsk = inet_csk(sk);
+
+	return (struct mptcp_subflow_context *)icsk->icsk_ulp_data;
+}
+
+static inline struct socket *
+mptcp_subflow_tcp_socket(const struct mptcp_subflow_context *subflow)
+{
+	return subflow->tcp_sock;
+}
+
+void mptcp_subflow_init(void);
+int mptcp_subflow_create_socket(struct sock *sk, struct socket **new_sock);
+
 #endif /* __MPTCP_PROTOCOL_H */
diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c
new file mode 100644
index 000000000000..5971fb5bfdf2
--- /dev/null
+++ b/net/mptcp/subflow.c
@@ -0,0 +1,106 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Multipath TCP
+ *
+ * Copyright (c) 2017 - 2019, Intel Corporation.
+ */
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/netdevice.h>
+#include <net/sock.h>
+#include <net/inet_common.h>
+#include <net/inet_hashtables.h>
+#include <net/protocol.h>
+#include <net/tcp.h>
+#include <net/mptcp.h>
+#include "protocol.h"
+
+int mptcp_subflow_create_socket(struct sock *sk, struct socket **new_sock)
+{
+	struct mptcp_subflow_context *subflow;
+	struct net *net = sock_net(sk);
+	struct socket *sf;
+	int err;
+
+	err = sock_create_kern(net, PF_INET, SOCK_STREAM, IPPROTO_TCP, &sf);
+	if (err)
+		return err;
+
+	lock_sock(sf->sk);
+	err = tcp_set_ulp(sf->sk, "mptcp");
+	release_sock(sf->sk);
+
+	if (err)
+		return err;
+
+	subflow = mptcp_subflow_ctx(sf->sk);
+	pr_debug("subflow=%p", subflow);
+
+	*new_sock = sf;
+	subflow->conn = sk;
+	subflow->request_mptcp = 1; // @@ if MPTCP enabled
+	subflow->request_cksum = 1; // @@ if checksum enabled
+	subflow->request_version = 0;
+
+	return 0;
+}
+
+static struct mptcp_subflow_context *subflow_create_ctx(struct sock *sk,
+							struct socket *sock)
+{
+	struct inet_connection_sock *icsk = inet_csk(sk);
+	struct mptcp_subflow_context *ctx;
+
+	ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
+	if (!ctx)
+		return NULL;
+
+	pr_debug("subflow=%p", ctx);
+
+	icsk->icsk_ulp_data = ctx;
+	/* might be NULL */
+	ctx->tcp_sock = sock;
+
+	return ctx;
+}
+
+static int subflow_ulp_init(struct sock *sk)
+{
+	struct tcp_sock *tsk = tcp_sk(sk);
+	struct mptcp_subflow_context *ctx;
+	int err = 0;
+
+	ctx = subflow_create_ctx(sk, sk->sk_socket);
+	if (!ctx) {
+		err = -ENOMEM;
+		goto out;
+	}
+
+	pr_debug("subflow=%p", ctx);
+
+	tsk->is_mptcp = 1;
+out:
+	return err;
+}
+
+static void subflow_ulp_release(struct sock *sk)
+{
+	struct mptcp_subflow_context *ctx = mptcp_subflow_ctx(sk);
+
+	pr_debug("subflow=%p", ctx);
+
+	kfree(ctx);
+}
+
+static struct tcp_ulp_ops subflow_ulp_ops __read_mostly = {
+	.name		= "mptcp",
+	.owner		= THIS_MODULE,
+	.init		= subflow_ulp_init,
+	.release	= subflow_ulp_release,
+};
+
+void mptcp_subflow_init(void)
+{
+	if (tcp_register_ulp(&subflow_ulp_ops) != 0)
+		panic("MPTCP: failed to register subflows to ULP\n");
+}
-- 
2.23.0


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [RFC PATCH v2 08/45] tcp: Expose tcp struct and routine for MPTCP
  2019-10-02 23:36 [RFC PATCH v2 00/45] Multipath TCP Mat Martineau
                   ` (6 preceding siblings ...)
  2019-10-02 23:36 ` [RFC PATCH v2 07/45] mptcp: Associate MPTCP context with TCP socket Mat Martineau
@ 2019-10-02 23:36 ` Mat Martineau
  2019-10-02 23:36 ` [RFC PATCH v2 09/45] mptcp: Handle MP_CAPABLE options for outgoing connections Mat Martineau
                   ` (38 subsequent siblings)
  46 siblings, 0 replies; 49+ messages in thread
From: Mat Martineau @ 2019-10-02 23:36 UTC (permalink / raw)
  To: netdev, edumazet
  Cc: Peter Krystad, cpaasch, fw, pabeni, dcaratti, matthieu.baerts

From: Peter Krystad <peter.krystad@linux.intel.com>

tcp_request_sock_ipv4_ops and tcp_v4_init_sock().

This function is needed for MPTCP subflow initialization.

Signed-off-by: Peter Krystad <peter.krystad@linux.intel.com>
---
 include/net/tcp.h   | 3 +++
 net/ipv4/tcp_ipv4.c | 4 ++--
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 382e245a7909..509b5539d3d1 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -445,6 +445,7 @@ struct sock *tcp_v4_syn_recv_sock(const struct sock *sk, struct sk_buff *skb,
 int tcp_v4_do_rcv(struct sock *sk, struct sk_buff *skb);
 int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len);
 int tcp_connect(struct sock *sk);
+int tcp_v4_init_sock(struct sock *sk);
 enum tcp_synack_type {
 	TCP_SYNACK_NORMAL,
 	TCP_SYNACK_FASTOPEN,
@@ -1975,6 +1976,8 @@ struct tcp_request_sock_ops {
 			   enum tcp_synack_type synack_type);
 };
 
+extern const struct tcp_request_sock_ops tcp_request_sock_ipv4_ops;
+
 #ifdef CONFIG_SYN_COOKIES
 static inline __u32 cookie_init_sequence(const struct tcp_request_sock_ops *ops,
 					 const struct sock *sk, struct sk_buff *skb,
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 2ee45e3755e9..e3f86dd2d2ec 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1371,7 +1371,7 @@ struct request_sock_ops tcp_request_sock_ops __read_mostly = {
 	.syn_ack_timeout =	tcp_syn_ack_timeout,
 };
 
-static const struct tcp_request_sock_ops tcp_request_sock_ipv4_ops = {
+const struct tcp_request_sock_ops tcp_request_sock_ipv4_ops = {
 	.mss_clamp	=	TCP_MSS_DEFAULT,
 #ifdef CONFIG_TCP_MD5SIG
 	.req_md5_lookup	=	tcp_v4_md5_lookup,
@@ -2072,7 +2072,7 @@ static const struct tcp_sock_af_ops tcp_sock_ipv4_specific = {
 /* NOTE: A lot of things set to zero explicitly by call to
  *       sk_alloc() so need not be done here.
  */
-static int tcp_v4_init_sock(struct sock *sk)
+int tcp_v4_init_sock(struct sock *sk)
 {
 	struct inet_connection_sock *icsk = inet_csk(sk);
 
-- 
2.23.0


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [RFC PATCH v2 09/45] mptcp: Handle MP_CAPABLE options for outgoing connections
  2019-10-02 23:36 [RFC PATCH v2 00/45] Multipath TCP Mat Martineau
                   ` (7 preceding siblings ...)
  2019-10-02 23:36 ` [RFC PATCH v2 08/45] tcp: Expose tcp struct and routine for MPTCP Mat Martineau
@ 2019-10-02 23:36 ` Mat Martineau
  2019-10-02 23:36 ` [RFC PATCH v2 10/45] mptcp: add mptcp_poll Mat Martineau
                   ` (37 subsequent siblings)
  46 siblings, 0 replies; 49+ messages in thread
From: Mat Martineau @ 2019-10-02 23:36 UTC (permalink / raw)
  To: netdev, edumazet
  Cc: Peter Krystad, cpaasch, fw, pabeni, dcaratti, matthieu.baerts

From: Peter Krystad <peter.krystad@linux.intel.com>

Add hooks to tcp_output.c to add MP_CAPABLE to an outgoing SYN request
for a subflow socket and to the final ACK of the three-way handshake.

Use the .sk_rx_dst_set() handler in the subflow proto to capture when the
responding SYN-ACK is received and notify the MPTCP connection layer.

Signed-off-by: Peter Krystad <peter.krystad@linux.intel.com>
---
 include/net/mptcp.h   |  33 +++++++++
 net/ipv4/tcp_input.c  |   3 +
 net/ipv4/tcp_output.c |  27 ++++++++
 net/mptcp/options.c   |  45 ++++++++++++
 net/mptcp/protocol.c  | 158 +++++++++++++++++++++++++++++++++++++-----
 net/mptcp/protocol.h  |  18 ++++-
 net/mptcp/subflow.c   |  25 ++++++-
 7 files changed, 289 insertions(+), 20 deletions(-)

diff --git a/include/net/mptcp.h b/include/net/mptcp.h
index fc3f9286c667..e3e248b16016 100644
--- a/include/net/mptcp.h
+++ b/include/net/mptcp.h
@@ -20,8 +20,19 @@ struct mptcp_out_options {
 
 void mptcp_init(void);
 
+static inline bool sk_is_mptcp(const struct sock *sk)
+{
+	return tcp_sk(sk)->is_mptcp;
+}
+
 void mptcp_parse_option(const unsigned char *ptr, int opsize,
 			struct tcp_options_received *opt_rx);
+bool mptcp_syn_options(struct sock *sk, unsigned int *size,
+		       struct mptcp_out_options *opts);
+void mptcp_rcv_synsent(struct sock *sk);
+bool mptcp_established_options(struct sock *sk, unsigned int *size,
+			       struct mptcp_out_options *opts);
+
 void mptcp_write_options(__be32 *ptr, struct mptcp_out_options *opts);
 
 #else
@@ -30,10 +41,32 @@ static inline void mptcp_init(void)
 {
 }
 
+static inline bool sk_is_mptcp(const struct sock *sk)
+{
+	return false;
+}
+
 static inline void mptcp_parse_option(const unsigned char *ptr, int opsize,
 				      struct tcp_options_received *opt_rx)
 {
 }
 
+static inline bool mptcp_syn_options(struct sock *sk, unsigned int *size,
+				     struct mptcp_out_options *opts)
+{
+	return false;
+}
+
+static inline void mptcp_rcv_synsent(struct sock *sk)
+{
+}
+
+static inline bool mptcp_established_options(struct sock *sk,
+					     unsigned int *size,
+					     struct mptcp_out_options *opts)
+{
+	return false;
+}
+
 #endif /* CONFIG_MPTCP */
 #endif /* __NET_MPTCP_H */
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 8e04adff1912..e28f308006da 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -5963,6 +5963,9 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb,
 		tcp_sync_mss(sk, icsk->icsk_pmtu_cookie);
 		tcp_initialize_rcv_mss(sk);
 
+		if (sk_is_mptcp(sk))
+			mptcp_rcv_synsent(sk);
+
 		/* Remember, tcp_poll() does not lock socket!
 		 * Change state from SYN-SENT only after copied_seq
 		 * is initialized. */
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 531929c68822..2354500c8ebb 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -663,6 +663,15 @@ static unsigned int tcp_syn_options(struct sock *sk, struct sk_buff *skb,
 
 	smc_set_option(tp, opts, &remaining);
 
+	if (sk_is_mptcp(sk)) {
+		unsigned int size;
+
+		if (mptcp_syn_options(sk, &size, &opts->mptcp)) {
+			opts->options |= OPTION_MPTCP;
+			remaining -= size;
+		}
+	}
+
 	return MAX_TCP_OPTION_SPACE - remaining;
 }
 
@@ -761,6 +770,24 @@ static unsigned int tcp_established_options(struct sock *sk, struct sk_buff *skb
 		size += TCPOLEN_TSTAMP_ALIGNED;
 	}
 
+	/* MPTCP options have precedence over SACK for the limited TCP
+	 * option space because a MPTCP connection would be forced to
+	 * fall back to regular TCP if a required multipath option is
+	 * missing. SACK still gets a chance to use whatever space is
+	 * left.
+	 */
+	if (sk_is_mptcp(sk)) {
+		unsigned int remaining = MAX_TCP_OPTION_SPACE - size;
+		unsigned int opt_size;
+
+		if (mptcp_established_options(sk, &opt_size, &opts->mptcp)) {
+			if (remaining >= opt_size) {
+				opts->options |= OPTION_MPTCP;
+				size += opt_size;
+			}
+		}
+	}
+
 	eff_sacks = tp->rx_opt.num_sacks + tp->rx_opt.dsack;
 	if (unlikely(eff_sacks)) {
 		const unsigned int remaining = MAX_TCP_OPTION_SPACE - size;
diff --git a/net/mptcp/options.c b/net/mptcp/options.c
index cee4280647fe..94508c0d887a 100644
--- a/net/mptcp/options.c
+++ b/net/mptcp/options.c
@@ -134,6 +134,51 @@ void mptcp_parse_option(const unsigned char *ptr, int opsize,
 	}
 }
 
+bool mptcp_syn_options(struct sock *sk, unsigned int *size,
+		       struct mptcp_out_options *opts)
+{
+	struct mptcp_subflow_context *subflow = mptcp_subflow_ctx(sk);
+
+	if (subflow->request_mptcp) {
+		pr_debug("local_key=%llu", subflow->local_key);
+		opts->suboptions = OPTION_MPTCP_MPC_SYN;
+		opts->sndr_key = subflow->local_key;
+		*size = TCPOLEN_MPTCP_MPC_SYN;
+		return true;
+	}
+	return false;
+}
+
+void mptcp_rcv_synsent(struct sock *sk)
+{
+	struct mptcp_subflow_context *subflow = mptcp_subflow_ctx(sk);
+	struct tcp_sock *tp = tcp_sk(sk);
+
+	pr_debug("subflow=%p", subflow);
+	if (subflow->request_mptcp && tp->rx_opt.mptcp.mp_capable) {
+		subflow->mp_capable = 1;
+		subflow->remote_key = tp->rx_opt.mptcp.sndr_key;
+	}
+}
+
+bool mptcp_established_options(struct sock *sk, unsigned int *size,
+			       struct mptcp_out_options *opts)
+{
+	struct mptcp_subflow_context *subflow = mptcp_subflow_ctx(sk);
+
+	if (subflow->mp_capable && !subflow->fourth_ack) {
+		opts->suboptions = OPTION_MPTCP_MPC_ACK;
+		opts->sndr_key = subflow->local_key;
+		opts->rcvr_key = subflow->remote_key;
+		*size = TCPOLEN_MPTCP_MPC_ACK;
+		subflow->fourth_ack = 1;
+		pr_debug("subflow=%p, local_key=%llu, remote_key=%llu",
+			 subflow, subflow->local_key, subflow->remote_key);
+		return true;
+	}
+	return false;
+}
+
 void mptcp_write_options(__be32 *ptr, struct mptcp_out_options *opts)
 {
 	if ((OPTION_MPTCP_MPC_SYN |
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index c8e12017eddb..95c302c59d2e 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -17,21 +17,96 @@
 #include <net/mptcp.h>
 #include "protocol.h"
 
+static struct socket *__mptcp_fallback_get_ref(const struct mptcp_sock *msk)
+{
+	sock_owned_by_me((const struct sock *)msk);
+
+	if (!msk->subflow)
+		return NULL;
+
+	sock_hold(msk->subflow->sk);
+	return msk->subflow;
+}
+
+static struct sock *mptcp_subflow_get_ref(const struct mptcp_sock *msk)
+{
+	struct mptcp_subflow_context *subflow;
+
+	sock_owned_by_me((const struct sock *)msk);
+
+	mptcp_for_each_subflow(msk, subflow) {
+		struct sock *sk;
+
+		sk = mptcp_subflow_tcp_socket(subflow)->sk;
+		sock_hold(sk);
+		return sk;
+	}
+
+	return NULL;
+}
+
 static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 {
 	struct mptcp_sock *msk = mptcp_sk(sk);
-	struct socket *subflow = msk->subflow;
+	struct socket *ssock;
+	struct sock *ssk;
+	int ret;
+
+	lock_sock(sk);
+	ssock = __mptcp_fallback_get_ref(msk);
+	if (ssock) {
+		release_sock(sk);
+		pr_debug("fallback passthrough");
+		ret = sock_sendmsg(ssock, msg);
+		sock_put(ssock->sk);
+		return ret;
+	}
 
-	return sock_sendmsg(subflow, msg);
+	ssk = mptcp_subflow_get_ref(msk);
+	if (!ssk) {
+		release_sock(sk);
+		return -ENOTCONN;
+	}
+
+	ret = sock_sendmsg(ssk->sk_socket, msg);
+
+	release_sock(sk);
+	sock_put(ssk);
+	return ret;
 }
 
 static int mptcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
 			 int nonblock, int flags, int *addr_len)
 {
 	struct mptcp_sock *msk = mptcp_sk(sk);
-	struct socket *subflow = msk->subflow;
+	struct socket *ssock;
+	struct sock *ssk;
+	int copied = 0;
+
+	lock_sock(sk);
+	ssock = __mptcp_fallback_get_ref(msk);
+	if (ssock) {
+		release_sock(sk);
+		pr_debug("fallback-read subflow=%p",
+			 mptcp_subflow_ctx(ssock->sk));
+		copied = sock_recvmsg(ssock, msg, flags);
+		sock_put(ssock->sk);
+		return copied;
+	}
 
-	return sock_recvmsg(subflow, msg, flags);
+	ssk = mptcp_subflow_get_ref(msk);
+	if (!ssk) {
+		release_sock(sk);
+		return -ENOTCONN;
+	}
+
+	copied = sock_recvmsg(ssk->sk_socket, msg, flags);
+
+	release_sock(sk);
+
+	sock_put(ssk);
+
+	return copied;
 }
 
 static int mptcp_init_sock(struct sock *sk)
@@ -47,39 +122,89 @@ static int mptcp_init_sock(struct sock *sk)
 		msk->subflow = sf;
 	}
 
+	INIT_LIST_HEAD(&msk->conn_list);
+
 	return err;
 }
 
 static void mptcp_close(struct sock *sk, long timeout)
 {
+	struct mptcp_subflow_context *subflow, *tmp;
 	struct mptcp_sock *msk = mptcp_sk(sk);
+	struct socket *ssk = NULL;
 
 	inet_sk_state_store(sk, TCP_CLOSE);
 
+	lock_sock(sk);
+
 	if (msk->subflow) {
-		pr_debug("subflow=%p", mptcp_subflow_ctx(msk->subflow->sk));
-		sock_release(msk->subflow);
+		ssk = msk->subflow;
+		msk->subflow = NULL;
+	}
+
+	if (ssk) {
+		pr_debug("subflow=%p", ssk->sk);
+		sock_release(ssk);
 	}
 
-	sock_orphan(sk);
-	sock_put(sk);
+	list_for_each_entry_safe(subflow, tmp, &msk->conn_list, node) {
+		pr_debug("conn_list->subflow=%p", subflow);
+		sock_release(mptcp_subflow_tcp_socket(subflow));
+	}
+
+	release_sock(sk);
+	sk_common_release(sk);
 }
 
-static int mptcp_connect(struct sock *sk, struct sockaddr *saddr, int len)
+static int mptcp_get_port(struct sock *sk, unsigned short snum)
 {
 	struct mptcp_sock *msk = mptcp_sk(sk);
-	int err;
-
-	saddr->sa_family = AF_INET;
 
 	pr_debug("msk=%p, subflow=%p", msk,
 		 mptcp_subflow_ctx(msk->subflow->sk));
 
-	err = kernel_connect(msk->subflow, saddr, len, 0);
+	return inet_csk_get_port(msk->subflow->sk, snum);
+}
 
-	sk->sk_state = TCP_ESTABLISHED;
+void mptcp_finish_connect(struct sock *sk, int mp_capable)
+{
+	struct mptcp_subflow_context *subflow;
+	struct mptcp_sock *msk = mptcp_sk(sk);
 
-	return err;
+	subflow = mptcp_subflow_ctx(msk->subflow->sk);
+
+	if (mp_capable) {
+		/* sk (new subflow socket) is already locked, but we need
+		 * to lock the parent (mptcp) socket now to add the tcp socket
+		 * to the subflow list.
+		 *
+		 * From lockdep point of view, this creates an ABBA type
+		 * deadlock: Normally (sendmsg, recvmsg, ..), we lock the mptcp
+		 * socket, then acquire a subflow lock.
+		 * Here we do the reverse: "subflow lock, then mptcp lock".
+		 *
+		 * Its alright to do this here, because this subflow is not yet
+		 * on the mptcp sockets subflow list.
+		 *
+		 * IOW, if another CPU has this mptcp socket locked, it cannot
+		 * acquire this particular subflow, because subflow->sk isn't
+		 * on msk->conn_list.
+		 *
+		 * This function can be called either from backlog processing
+		 * (BH will be enabled) or from softirq, so we need to use BH
+		 * locking scheme.
+		 */
+		local_bh_disable();
+		bh_lock_sock_nested(sk);
+
+		msk->remote_key = subflow->remote_key;
+		msk->local_key = subflow->local_key;
+		list_add(&subflow->node, &msk->conn_list);
+		msk->subflow = NULL;
+		bh_unlock_sock(sk);
+		local_bh_enable();
+	}
+	inet_sk_state_store(sk, TCP_ESTABLISHED);
 }
 
 static struct proto mptcp_prot = {
@@ -88,13 +213,12 @@ static struct proto mptcp_prot = {
 	.init		= mptcp_init_sock,
 	.close		= mptcp_close,
 	.accept		= inet_csk_accept,
-	.connect	= mptcp_connect,
 	.shutdown	= tcp_shutdown,
 	.sendmsg	= mptcp_sendmsg,
 	.recvmsg	= mptcp_recvmsg,
 	.hash		= inet_hash,
 	.unhash		= inet_unhash,
-	.get_port	= inet_csk_get_port,
+	.get_port	= mptcp_get_port,
 	.obj_size	= sizeof(struct mptcp_sock),
 	.no_autobind	= 1,
 };
diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
index fe6b31bbad1b..2e000ba7caa4 100644
--- a/net/mptcp/protocol.h
+++ b/net/mptcp/protocol.h
@@ -38,9 +38,15 @@
 struct mptcp_sock {
 	/* inet_connection_sock must be the first member */
 	struct inet_connection_sock sk;
+	u64		local_key;
+	u64		remote_key;
+	struct list_head conn_list;
 	struct socket	*subflow; /* outgoing connect/listener/!mp_capable */
 };
 
+#define mptcp_for_each_subflow(__msk, __subflow)			\
+	list_for_each_entry(__subflow, &((__msk)->conn_list), node)
+
 static inline struct mptcp_sock *mptcp_sk(const struct sock *sk)
 {
 	return (struct mptcp_sock *)sk;
@@ -48,9 +54,15 @@ static inline struct mptcp_sock *mptcp_sk(const struct sock *sk)
 
 /* MPTCP subflow context */
 struct mptcp_subflow_context {
+	struct	list_head node;/* conn_list of subflows */
+	u64	local_key;
+	u64	remote_key;
 	u32	request_mptcp : 1,  /* send MP_CAPABLE */
 		request_cksum : 1,
-		request_version : 4;
+		request_version : 4,
+		mp_capable : 1,	    /* remote is MPTCP capable */
+		fourth_ack : 1,     /* send initial DSS */
+		conn_finished : 1;
 	struct  socket *tcp_sock;  /* underlying tcp_sock */
 	struct  sock *conn;        /* parent mptcp_sock */
 };
@@ -72,4 +84,8 @@ mptcp_subflow_tcp_socket(const struct mptcp_subflow_context *subflow)
 void mptcp_subflow_init(void);
 int mptcp_subflow_create_socket(struct sock *sk, struct socket **new_sock);
 
+extern const struct inet_connection_sock_af_ops ipv4_specific;
+
+void mptcp_finish_connect(struct sock *sk, int mp_capable);
+
 #endif /* __MPTCP_PROTOCOL_H */
diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c
index 5971fb5bfdf2..4690410d9922 100644
--- a/net/mptcp/subflow.c
+++ b/net/mptcp/subflow.c
@@ -15,6 +15,22 @@
 #include <net/mptcp.h>
 #include "protocol.h"
 
+static void subflow_finish_connect(struct sock *sk, const struct sk_buff *skb)
+{
+	struct mptcp_subflow_context *subflow = mptcp_subflow_ctx(sk);
+
+	inet_sk_rx_dst_set(sk, skb);
+
+	if (subflow->conn && !subflow->conn_finished) {
+		pr_debug("subflow=%p, remote_key=%llu", mptcp_subflow_ctx(sk),
+			 subflow->remote_key);
+		mptcp_finish_connect(subflow->conn, subflow->mp_capable);
+		subflow->conn_finished = 1;
+	}
+}
+
+static struct inet_connection_sock_af_ops subflow_specific;
+
 int mptcp_subflow_create_socket(struct sock *sk, struct socket **new_sock)
 {
 	struct mptcp_subflow_context *subflow;
@@ -66,8 +82,9 @@ static struct mptcp_subflow_context *subflow_create_ctx(struct sock *sk,
 
 static int subflow_ulp_init(struct sock *sk)
 {
-	struct tcp_sock *tsk = tcp_sk(sk);
+	struct inet_connection_sock *icsk = inet_csk(sk);
 	struct mptcp_subflow_context *ctx;
+	struct tcp_sock *tp = tcp_sk(sk);
 	int err = 0;
 
 	ctx = subflow_create_ctx(sk, sk->sk_socket);
@@ -78,7 +95,8 @@ static int subflow_ulp_init(struct sock *sk)
 
 	pr_debug("subflow=%p", ctx);
 
-	tsk->is_mptcp = 1;
+	tp->is_mptcp = 1;
+	icsk->icsk_af_ops = &subflow_specific;
 out:
 	return err;
 }
@@ -101,6 +119,9 @@ static struct tcp_ulp_ops subflow_ulp_ops __read_mostly = {
 
 void mptcp_subflow_init(void)
 {
+	subflow_specific = ipv4_specific;
+	subflow_specific.sk_rx_dst_set = subflow_finish_connect;
+
 	if (tcp_register_ulp(&subflow_ulp_ops) != 0)
 		panic("MPTCP: failed to register subflows to ULP\n");
 }
-- 
2.23.0


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [RFC PATCH v2 10/45] mptcp: add mptcp_poll
  2019-10-02 23:36 [RFC PATCH v2 00/45] Multipath TCP Mat Martineau
                   ` (8 preceding siblings ...)
  2019-10-02 23:36 ` [RFC PATCH v2 09/45] mptcp: Handle MP_CAPABLE options for outgoing connections Mat Martineau
@ 2019-10-02 23:36 ` Mat Martineau
  2019-10-02 23:36 ` [RFC PATCH v2 11/45] tcp, ulp: Add clone operation to tcp_ulp_ops Mat Martineau
                   ` (36 subsequent siblings)
  46 siblings, 0 replies; 49+ messages in thread
From: Mat Martineau @ 2019-10-02 23:36 UTC (permalink / raw)
  To: netdev, edumazet
  Cc: Florian Westphal, cpaasch, pabeni, peter.krystad, dcaratti,
	matthieu.baerts

From: Florian Westphal <fw@strlen.de>

Can't use tcp_poll directly:

BUG: KASAN: slab-out-of-bounds in tcp_poll+0x17f/0x540
Read of size 4 at addr ffff88806ac5e50c by task mptcp_connect/2085
Call Trace:
 tcp_poll+0x17f/0x540
 sock_poll+0x152/0x180

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 net/mptcp/protocol.c | 31 +++++++++++++++++++++++++++++++
 1 file changed, 31 insertions(+)

diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 95c302c59d2e..07508d060b3d 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -257,6 +257,36 @@ static int mptcp_stream_connect(struct socket *sock, struct sockaddr *uaddr,
 	return inet_stream_connect(msk->subflow, uaddr, addr_len, flags);
 }
 
+static __poll_t mptcp_poll(struct file *file, struct socket *sock,
+			   struct poll_table_struct *wait)
+{
+	struct mptcp_subflow_context *subflow;
+	const struct mptcp_sock *msk;
+	struct sock *sk = sock->sk;
+	struct socket *ssock;
+	__poll_t ret = 0;
+
+	msk = mptcp_sk(sk);
+	lock_sock(sk);
+	ssock = __mptcp_fallback_get_ref(msk);
+	if (ssock) {
+		release_sock(sk);
+		ret = tcp_poll(file, ssock, wait);
+		sock_put(ssock->sk);
+		return ret;
+	}
+
+	mptcp_for_each_subflow(msk, subflow) {
+		struct socket *tcp_sock;
+
+		tcp_sock = mptcp_subflow_tcp_socket(subflow);
+		ret |= tcp_poll(file, tcp_sock, wait);
+	}
+	release_sock(sk);
+
+	return ret;
+}
+
 static struct proto_ops mptcp_stream_ops;
 
 static struct inet_protosw mptcp_protosw = {
@@ -273,6 +303,7 @@ void __init mptcp_init(void)
 	mptcp_stream_ops = inet_stream_ops;
 	mptcp_stream_ops.bind = mptcp_bind;
 	mptcp_stream_ops.connect = mptcp_stream_connect;
+	mptcp_stream_ops.poll = mptcp_poll;
 
 	mptcp_subflow_init();
 
-- 
2.23.0


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [RFC PATCH v2 11/45] tcp, ulp: Add clone operation to tcp_ulp_ops
  2019-10-02 23:36 [RFC PATCH v2 00/45] Multipath TCP Mat Martineau
                   ` (9 preceding siblings ...)
  2019-10-02 23:36 ` [RFC PATCH v2 10/45] mptcp: add mptcp_poll Mat Martineau
@ 2019-10-02 23:36 ` Mat Martineau
  2019-10-02 23:36 ` [RFC PATCH v2 12/45] mptcp: Create SUBFLOW socket for incoming connections Mat Martineau
                   ` (35 subsequent siblings)
  46 siblings, 0 replies; 49+ messages in thread
From: Mat Martineau @ 2019-10-02 23:36 UTC (permalink / raw)
  To: netdev, edumazet
  Cc: Mat Martineau, cpaasch, fw, pabeni, peter.krystad, dcaratti,
	matthieu.baerts

If ULP is used on a listening socket, icsk_ulp_ops and icsk_ulp_data are
copied when the listener is cloned. Sometimes the clone is immediately
deleted, which will invoke the release op on the clone and likely
corrupt the listening socket's icsk_ulp_data.

The clone operation is invoked immediately after the clone is copied and
gives the ULP type an opportunity to set up the clone socket and its
icsk_ulp_data.

Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
---
 include/net/tcp.h               |  5 +++++
 net/ipv4/inet_connection_sock.c |  2 ++
 net/ipv4/tcp_ulp.c              | 12 ++++++++++++
 3 files changed, 19 insertions(+)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 509b5539d3d1..907fd214ff1e 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -2129,6 +2129,9 @@ struct tcp_ulp_ops {
 	/* diagnostic */
 	int (*get_info)(const struct sock *sk, struct sk_buff *skb);
 	size_t (*get_info_size)(const struct sock *sk);
+	/* clone ulp */
+	void (*clone)(const struct request_sock *req, struct sock *newsk,
+		      const gfp_t priority);
 
 	char		name[TCP_ULP_NAME_MAX];
 	struct module	*owner;
@@ -2139,6 +2142,8 @@ int tcp_set_ulp(struct sock *sk, const char *name);
 void tcp_get_available_ulp(char *buf, size_t len);
 void tcp_cleanup_ulp(struct sock *sk);
 void tcp_update_ulp(struct sock *sk, struct proto *p);
+void tcp_clone_ulp(const struct request_sock *req,
+		   struct sock *newsk, const gfp_t priority);
 
 #define MODULE_ALIAS_TCP_ULP(name)				\
 	__MODULE_INFO(alias, alias_userspace, name);		\
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index a9183543ca30..64857672e0b5 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -810,6 +810,8 @@ struct sock *inet_csk_clone_lock(const struct sock *sk,
 		/* Deinitialize accept_queue to trap illegal accesses. */
 		memset(&newicsk->icsk_accept_queue, 0, sizeof(newicsk->icsk_accept_queue));
 
+		tcp_clone_ulp(req, newsk, priority);
+
 		security_inet_csk_clone(newsk, req);
 	}
 	return newsk;
diff --git a/net/ipv4/tcp_ulp.c b/net/ipv4/tcp_ulp.c
index 4849edb62d52..27f949846450 100644
--- a/net/ipv4/tcp_ulp.c
+++ b/net/ipv4/tcp_ulp.c
@@ -127,6 +127,18 @@ void tcp_cleanup_ulp(struct sock *sk)
 	icsk->icsk_ulp_ops = NULL;
 }
 
+void tcp_clone_ulp(const struct request_sock *req, struct sock *newsk,
+		   const gfp_t priority)
+{
+	struct inet_connection_sock *icsk = inet_csk(newsk);
+
+	if (!icsk->icsk_ulp_ops)
+		return;
+
+	if (icsk->icsk_ulp_ops->clone)
+		icsk->icsk_ulp_ops->clone(req, newsk, priority);
+}
+
 static int __tcp_set_ulp(struct sock *sk, const struct tcp_ulp_ops *ulp_ops)
 {
 	struct inet_connection_sock *icsk = inet_csk(sk);
-- 
2.23.0


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [RFC PATCH v2 12/45] mptcp: Create SUBFLOW socket for incoming connections
  2019-10-02 23:36 [RFC PATCH v2 00/45] Multipath TCP Mat Martineau
                   ` (10 preceding siblings ...)
  2019-10-02 23:36 ` [RFC PATCH v2 11/45] tcp, ulp: Add clone operation to tcp_ulp_ops Mat Martineau
@ 2019-10-02 23:36 ` Mat Martineau
  2019-10-02 23:36 ` [RFC PATCH v2 13/45] mptcp: Add key generation and token tree Mat Martineau
                   ` (34 subsequent siblings)
  46 siblings, 0 replies; 49+ messages in thread
From: Mat Martineau @ 2019-10-02 23:36 UTC (permalink / raw)
  To: netdev, edumazet
  Cc: Peter Krystad, cpaasch, fw, pabeni, dcaratti, matthieu.baerts

From: Peter Krystad <peter.krystad@linux.intel.com>

Add subflow_request_sock type that extends tcp_request_sock
and add an is_mptcp flag to tcp_request_sock distinguish them.

Override the listen() and accept() methods of the MPTCP
socket proto_ops so they may act on the subflow socket.

Override the conn_request() and syn_recv_sock() handlers
in the inet_connection_sock to handle incoming MPTCP
SYNs and the ACK to the response SYN.

Add handling in tcp_output.c to add MP_CAPABLE to an outgoing
SYN-ACK response for a subflow_request_sock.

Signed-off-by: Peter Krystad <peter.krystad@linux.intel.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 include/linux/tcp.h   |   3 +
 include/net/mptcp.h   |  19 ++++++
 net/ipv4/tcp_input.c  |   3 +
 net/ipv4/tcp_output.c |  18 ++++++
 net/mptcp/options.c   |  57 ++++++++++++++++-
 net/mptcp/protocol.c  | 138 ++++++++++++++++++++++++++++++++++++++++-
 net/mptcp/protocol.h  |  20 ++++++
 net/mptcp/subflow.c   | 140 ++++++++++++++++++++++++++++++++++++++++--
 8 files changed, 392 insertions(+), 6 deletions(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index b3311659c39a..220883a7a4db 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -134,6 +134,9 @@ struct tcp_request_sock {
 	const struct tcp_request_sock_ops *af_specific;
 	u64				snt_synack; /* first SYNACK sent time */
 	bool				tfo_listener;
+#if IS_ENABLED(CONFIG_MPTCP)
+	bool				is_mptcp;
+#endif
 	u32				txhash;
 	u32				rcv_isn;
 	u32				snt_isn;
diff --git a/include/net/mptcp.h b/include/net/mptcp.h
index e3e248b16016..1a3ed7734bb4 100644
--- a/include/net/mptcp.h
+++ b/include/net/mptcp.h
@@ -25,11 +25,18 @@ static inline bool sk_is_mptcp(const struct sock *sk)
 	return tcp_sk(sk)->is_mptcp;
 }
 
+static inline bool rsk_is_mptcp(const struct request_sock *req)
+{
+	return tcp_rsk(req)->is_mptcp;
+}
+
 void mptcp_parse_option(const unsigned char *ptr, int opsize,
 			struct tcp_options_received *opt_rx);
 bool mptcp_syn_options(struct sock *sk, unsigned int *size,
 		       struct mptcp_out_options *opts);
 void mptcp_rcv_synsent(struct sock *sk);
+bool mptcp_synack_options(const struct request_sock *req, unsigned int *size,
+			  struct mptcp_out_options *opts);
 bool mptcp_established_options(struct sock *sk, unsigned int *size,
 			       struct mptcp_out_options *opts);
 
@@ -46,6 +53,11 @@ static inline bool sk_is_mptcp(const struct sock *sk)
 	return false;
 }
 
+static inline bool rsk_is_mptcp(const struct request_sock *req)
+{
+	return false;
+}
+
 static inline void mptcp_parse_option(const unsigned char *ptr, int opsize,
 				      struct tcp_options_received *opt_rx)
 {
@@ -61,6 +73,13 @@ static inline void mptcp_rcv_synsent(struct sock *sk)
 {
 }
 
+static inline bool mptcp_synack_options(const struct request_sock *req,
+					unsigned int *size,
+					struct mptcp_out_options *opts)
+{
+	return false;
+}
+
 static inline bool mptcp_established_options(struct sock *sk,
 					     unsigned int *size,
 					     struct mptcp_out_options *opts)
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index e28f308006da..218df735c822 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -6583,6 +6583,9 @@ int tcp_conn_request(struct request_sock_ops *rsk_ops,
 
 	tcp_rsk(req)->af_specific = af_ops;
 	tcp_rsk(req)->ts_off = 0;
+#if IS_ENABLED(CONFIG_MPTCP)
+	tcp_rsk(req)->is_mptcp = 0;
+#endif
 
 	tcp_clear_options(&tmp_opt);
 	tmp_opt.mss_clamp = af_ops->mss_clamp;
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 2354500c8ebb..e21e8134559a 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -594,6 +594,22 @@ static void smc_set_option_cond(const struct tcp_sock *tp,
 #endif
 }
 
+static void mptcp_set_option_cond(const struct request_sock *req,
+				  struct tcp_out_options *opts,
+				  unsigned int *remaining)
+{
+	if (rsk_is_mptcp(req)) {
+		unsigned int size;
+
+		if (mptcp_synack_options(req, &size, &opts->mptcp)) {
+			if (*remaining >= size) {
+				opts->options |= OPTION_MPTCP;
+				*remaining -= size;
+			}
+		}
+	}
+}
+
 /* Compute TCP options for SYN packets. This is not the final
  * network wire format yet.
  */
@@ -733,6 +749,8 @@ static unsigned int tcp_synack_options(const struct sock *sk,
 		}
 	}
 
+	mptcp_set_option_cond(req, opts, &remaining);
+
 	smc_set_option_cond(tcp_sk(sk), ireq, opts, &remaining);
 
 	return MAX_TCP_OPTION_SPACE - remaining;
diff --git a/net/mptcp/options.c b/net/mptcp/options.c
index 94508c0d887a..4fdd5178fe78 100644
--- a/net/mptcp/options.c
+++ b/net/mptcp/options.c
@@ -134,6 +134,39 @@ void mptcp_parse_option(const unsigned char *ptr, int opsize,
 	}
 }
 
+void mptcp_get_options(const struct sk_buff *skb,
+		       struct tcp_options_received *opt_rx)
+{
+	const unsigned char *ptr;
+	const struct tcphdr *th = tcp_hdr(skb);
+	int length = (th->doff * 4) - sizeof(struct tcphdr);
+
+	ptr = (const unsigned char *)(th + 1);
+
+	while (length > 0) {
+		int opcode = *ptr++;
+		int opsize;
+
+		switch (opcode) {
+		case TCPOPT_EOL:
+			return;
+		case TCPOPT_NOP:	/* Ref: RFC 793 section 3.1 */
+			length--;
+			continue;
+		default:
+			opsize = *ptr++;
+			if (opsize < 2) /* "silly options" */
+				return;
+			if (opsize > length)
+				return;	/* don't parse partial options */
+			if (opcode == TCPOPT_MPTCP)
+				mptcp_parse_option(ptr, opsize, opt_rx);
+			ptr += opsize - 2;
+			length -= opsize;
+		}
+	}
+}
+
 bool mptcp_syn_options(struct sock *sk, unsigned int *size,
 		       struct mptcp_out_options *opts)
 {
@@ -179,14 +212,35 @@ bool mptcp_established_options(struct sock *sk, unsigned int *size,
 	return false;
 }
 
+bool mptcp_synack_options(const struct request_sock *req, unsigned int *size,
+			  struct mptcp_out_options *opts)
+{
+	struct mptcp_subflow_request_sock *subflow_req = mptcp_subflow_rsk(req);
+
+	if (subflow_req->mp_capable) {
+		opts->suboptions = OPTION_MPTCP_MPC_SYNACK;
+		opts->sndr_key = subflow_req->local_key;
+		opts->rcvr_key = subflow_req->remote_key;
+		*size = TCPOLEN_MPTCP_MPC_SYNACK;
+		pr_debug("subflow_req=%p, local_key=%llu, remote_key=%llu",
+			 subflow_req, subflow_req->local_key,
+			 subflow_req->remote_key);
+		return true;
+	}
+	return false;
+}
+
 void mptcp_write_options(__be32 *ptr, struct mptcp_out_options *opts)
 {
 	if ((OPTION_MPTCP_MPC_SYN |
+	     OPTION_MPTCP_MPC_SYNACK |
 	     OPTION_MPTCP_MPC_ACK) & opts->suboptions) {
 		u8 len;
 
 		if (OPTION_MPTCP_MPC_SYN & opts->suboptions)
 			len = TCPOLEN_MPTCP_MPC_SYN;
+		else if (OPTION_MPTCP_MPC_SYNACK & opts->suboptions)
+			len = TCPOLEN_MPTCP_MPC_SYNACK;
 		else
 			len = TCPOLEN_MPTCP_MPC_ACK;
 
@@ -196,7 +250,8 @@ void mptcp_write_options(__be32 *ptr, struct mptcp_out_options *opts)
 			       MPTCP_CAP_HMAC_SHA1);
 		put_unaligned_be64(opts->sndr_key, ptr);
 		ptr += 2;
-		if (OPTION_MPTCP_MPC_ACK & opts->suboptions) {
+		if ((OPTION_MPTCP_MPC_SYNACK |
+		     OPTION_MPTCP_MPC_ACK) & opts->suboptions) {
 			put_unaligned_be64(opts->rcvr_key, ptr);
 			ptr += 2;
 		}
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 07508d060b3d..5605391fc32a 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -156,6 +156,65 @@ static void mptcp_close(struct sock *sk, long timeout)
 	sk_common_release(sk);
 }
 
+static struct sock *mptcp_accept(struct sock *sk, int flags, int *err,
+				 bool kern)
+{
+	struct mptcp_sock *msk = mptcp_sk(sk);
+	struct mptcp_subflow_context *subflow;
+	struct socket *new_sock;
+	struct socket *listener;
+	struct sock *newsk;
+
+	listener = msk->subflow;
+
+	pr_debug("msk=%p, listener=%p", msk, mptcp_subflow_ctx(listener->sk));
+	*err = kernel_accept(listener, &new_sock, flags);
+	if (*err < 0)
+		return NULL;
+
+	subflow = mptcp_subflow_ctx(new_sock->sk);
+	pr_debug("msk=%p, new subflow=%p, ", msk, subflow);
+
+	if (subflow->mp_capable) {
+		struct sock *new_mptcp_sock;
+
+		lock_sock(sk);
+
+		local_bh_disable();
+		new_mptcp_sock = sk_clone_lock(sk, GFP_ATOMIC);
+		if (!new_mptcp_sock) {
+			*err = -ENOBUFS;
+			local_bh_enable();
+			release_sock(sk);
+			kernel_sock_shutdown(new_sock, SHUT_RDWR);
+			sock_release(new_sock);
+			return NULL;
+		}
+
+		mptcp_init_sock(new_mptcp_sock);
+
+		msk = mptcp_sk(new_mptcp_sock);
+		msk->remote_key = subflow->remote_key;
+		msk->local_key = subflow->local_key;
+		msk->subflow = NULL;
+
+		newsk = new_mptcp_sock;
+		subflow->conn = new_mptcp_sock;
+		list_add(&subflow->node, &msk->conn_list);
+		bh_unlock_sock(new_mptcp_sock);
+		local_bh_enable();
+		inet_sk_state_store(newsk, TCP_ESTABLISHED);
+		release_sock(sk);
+	} else {
+		newsk = new_sock->sk;
+		tcp_sk(newsk)->is_mptcp = 0;
+		new_sock->sk = NULL;
+		sock_release(new_sock);
+	}
+
+	return newsk;
+}
+
 static int mptcp_get_port(struct sock *sk, unsigned short snum)
 {
 	struct mptcp_sock *msk = mptcp_sk(sk);
@@ -212,7 +271,7 @@ static struct proto mptcp_prot = {
 	.owner		= THIS_MODULE,
 	.init		= mptcp_init_sock,
 	.close		= mptcp_close,
-	.accept		= inet_csk_accept,
+	.accept		= mptcp_accept,
 	.shutdown	= tcp_shutdown,
 	.sendmsg	= mptcp_sendmsg,
 	.recvmsg	= mptcp_recvmsg,
@@ -257,6 +316,80 @@ static int mptcp_stream_connect(struct socket *sock, struct sockaddr *uaddr,
 	return inet_stream_connect(msk->subflow, uaddr, addr_len, flags);
 }
 
+static int mptcp_getname(struct socket *sock, struct sockaddr *uaddr,
+			 int peer)
+{
+	struct mptcp_sock *msk = mptcp_sk(sock->sk);
+	struct socket *ssock;
+	struct sock *ssk;
+	int ret;
+
+	if (sock->sk->sk_prot == &tcp_prot) {
+		/* we are being invoked from __sys_accept4, after
+		 * mptcp_accept() has just accepted a non-mp-capable
+		 * flow: sk is a tcp_sk, not an mptcp one.
+		 *
+		 * Hand the socket over to tcp so all further socket ops
+		 * bypass mptcp.
+		 */
+		sock->ops = &inet_stream_ops;
+		return inet_getname(sock, uaddr, peer);
+	}
+
+	lock_sock(sock->sk);
+	ssock = __mptcp_fallback_get_ref(msk);
+	if (ssock) {
+		release_sock(sock->sk);
+		pr_debug("subflow=%p", ssock->sk);
+		ret = inet_getname(ssock, uaddr, peer);
+		sock_put(ssock->sk);
+		return ret;
+	}
+
+	/* @@ the meaning of getname() for the remote peer when the socket
+	 * is connected and there are multiple subflows is not defined.
+	 * For now just use the first subflow on the list.
+	 */
+	ssk = mptcp_subflow_get_ref(msk);
+	if (!ssk) {
+		release_sock(sock->sk);
+		return -ENOTCONN;
+	}
+
+	ret = inet_getname(ssk->sk_socket, uaddr, peer);
+	release_sock(sock->sk);
+	sock_put(ssk);
+	return ret;
+}
+
+static int mptcp_listen(struct socket *sock, int backlog)
+{
+	struct mptcp_sock *msk = mptcp_sk(sock->sk);
+	int err;
+
+	pr_debug("msk=%p", msk);
+
+	if (!msk->subflow) {
+		err = mptcp_subflow_create_socket(sock->sk, &msk->subflow);
+		if (err)
+			return err;
+	}
+	return inet_listen(msk->subflow, backlog);
+}
+
+static int mptcp_stream_accept(struct socket *sock, struct socket *newsock,
+			       int flags, bool kern)
+{
+	struct mptcp_sock *msk = mptcp_sk(sock->sk);
+
+	pr_debug("msk=%p", msk);
+
+	if (!msk->subflow)
+		return -EINVAL;
+
+	return inet_accept(sock, newsock, flags, kern);
+}
+
 static __poll_t mptcp_poll(struct file *file, struct socket *sock,
 			   struct poll_table_struct *wait)
 {
@@ -304,6 +437,9 @@ void __init mptcp_init(void)
 	mptcp_stream_ops.bind = mptcp_bind;
 	mptcp_stream_ops.connect = mptcp_stream_connect;
 	mptcp_stream_ops.poll = mptcp_poll;
+	mptcp_stream_ops.accept = mptcp_stream_accept;
+	mptcp_stream_ops.getname = mptcp_getname;
+	mptcp_stream_ops.listen = mptcp_listen;
 
 	mptcp_subflow_init();
 
diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
index 2e000ba7caa4..d228d1d3c8c3 100644
--- a/net/mptcp/protocol.h
+++ b/net/mptcp/protocol.h
@@ -52,6 +52,23 @@ static inline struct mptcp_sock *mptcp_sk(const struct sock *sk)
 	return (struct mptcp_sock *)sk;
 }
 
+struct mptcp_subflow_request_sock {
+	struct	tcp_request_sock sk;
+	u8	mp_capable : 1,
+		mp_join : 1,
+		checksum : 1,
+		backup : 1,
+		version : 4;
+	u64	local_key;
+	u64	remote_key;
+};
+
+static inline struct mptcp_subflow_request_sock *
+mptcp_subflow_rsk(const struct request_sock *rsk)
+{
+	return (struct mptcp_subflow_request_sock *)rsk;
+}
+
 /* MPTCP subflow context */
 struct mptcp_subflow_context {
 	struct	list_head node;/* conn_list of subflows */
@@ -86,6 +103,9 @@ int mptcp_subflow_create_socket(struct sock *sk, struct socket **new_sock);
 
 extern const struct inet_connection_sock_af_ops ipv4_specific;
 
+void mptcp_get_options(const struct sk_buff *skb,
+		       struct tcp_options_received *opt_rx);
+
 void mptcp_finish_connect(struct sock *sk, int mp_capable);
 
 #endif /* __MPTCP_PROTOCOL_H */
diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c
index 4690410d9922..b4e86456d4d6 100644
--- a/net/mptcp/subflow.c
+++ b/net/mptcp/subflow.c
@@ -15,6 +15,37 @@
 #include <net/mptcp.h>
 #include "protocol.h"
 
+static void subflow_v4_init_req(struct request_sock *req,
+				const struct sock *sk_listener,
+				struct sk_buff *skb)
+{
+	struct mptcp_subflow_context *listener = mptcp_subflow_ctx(sk_listener);
+	struct mptcp_subflow_request_sock *subflow_req = mptcp_subflow_rsk(req);
+	struct tcp_options_received rx_opt;
+
+	tcp_rsk(req)->is_mptcp = 1;
+	pr_debug("subflow_req=%p, listener=%p", subflow_req, listener);
+
+	tcp_request_sock_ipv4_ops.init_req(req, sk_listener, skb);
+
+	memset(&rx_opt.mptcp, 0, sizeof(rx_opt.mptcp));
+	mptcp_get_options(skb, &rx_opt);
+
+	if (rx_opt.mptcp.mp_capable && listener->request_mptcp) {
+		subflow_req->mp_capable = 1;
+		if (rx_opt.mptcp.version >= listener->request_version)
+			subflow_req->version = listener->request_version;
+		else
+			subflow_req->version = rx_opt.mptcp.version;
+		if ((rx_opt.mptcp.flags & MPTCP_CAP_CHECKSUM_REQD) ||
+		    listener->request_cksum)
+			subflow_req->checksum = 1;
+		subflow_req->remote_key = rx_opt.mptcp.sndr_key;
+	} else {
+		subflow_req->mp_capable = 0;
+	}
+}
+
 static void subflow_finish_connect(struct sock *sk, const struct sk_buff *skb)
 {
 	struct mptcp_subflow_context *subflow = mptcp_subflow_ctx(sk);
@@ -29,6 +60,56 @@ static void subflow_finish_connect(struct sock *sk, const struct sk_buff *skb)
 	}
 }
 
+static struct request_sock_ops subflow_request_sock_ops;
+static struct tcp_request_sock_ops subflow_request_sock_ipv4_ops;
+
+static int subflow_conn_request(struct sock *sk, struct sk_buff *skb)
+{
+	struct mptcp_subflow_context *subflow = mptcp_subflow_ctx(sk);
+
+	pr_debug("subflow=%p", subflow);
+
+	/* Never answer to SYNs sent to broadcast or multicast */
+	if (skb_rtable(skb)->rt_flags & (RTCF_BROADCAST | RTCF_MULTICAST))
+		goto drop;
+
+	return tcp_conn_request(&subflow_request_sock_ops,
+				&subflow_request_sock_ipv4_ops,
+				sk, skb);
+drop:
+	tcp_listendrop(sk);
+	return 0;
+}
+
+static struct sock *subflow_syn_recv_sock(const struct sock *sk,
+					  struct sk_buff *skb,
+					  struct request_sock *req,
+					  struct dst_entry *dst,
+					  struct request_sock *req_unhash,
+					  bool *own_req)
+{
+	struct mptcp_subflow_context *listener = mptcp_subflow_ctx(sk);
+	struct sock *child;
+
+	pr_debug("listener=%p, req=%p, conn=%p", listener, req, listener->conn);
+
+	/* if the sk is MP_CAPABLE, we already received the client key */
+
+	child = tcp_v4_syn_recv_sock(sk, skb, req, dst, req_unhash, own_req);
+
+	if (child && *own_req) {
+		if (!mptcp_subflow_ctx(child)) {
+			pr_debug("Closing child socket");
+			inet_sk_set_state(child, TCP_CLOSE);
+			sock_set_flag(child, SOCK_DEAD);
+			inet_csk_destroy_sock(child);
+			child = NULL;
+		}
+	}
+
+	return child;
+}
+
 static struct inet_connection_sock_af_ops subflow_specific;
 
 int mptcp_subflow_create_socket(struct sock *sk, struct socket **new_sock)
@@ -62,18 +143,20 @@ int mptcp_subflow_create_socket(struct sock *sk, struct socket **new_sock)
 }
 
 static struct mptcp_subflow_context *subflow_create_ctx(struct sock *sk,
-							struct socket *sock)
+							struct socket *sock,
+							gfp_t priority)
 {
 	struct inet_connection_sock *icsk = inet_csk(sk);
 	struct mptcp_subflow_context *ctx;
 
-	ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
+	ctx = kzalloc(sizeof(*ctx), priority);
+	icsk->icsk_ulp_data = ctx;
+
 	if (!ctx)
 		return NULL;
 
 	pr_debug("subflow=%p", ctx);
 
-	icsk->icsk_ulp_data = ctx;
 	/* might be NULL */
 	ctx->tcp_sock = sock;
 
@@ -87,7 +170,7 @@ static int subflow_ulp_init(struct sock *sk)
 	struct tcp_sock *tp = tcp_sk(sk);
 	int err = 0;
 
-	ctx = subflow_create_ctx(sk, sk->sk_socket);
+	ctx = subflow_create_ctx(sk, sk->sk_socket, GFP_KERNEL);
 	if (!ctx) {
 		err = -ENOMEM;
 		goto out;
@@ -110,16 +193,65 @@ static void subflow_ulp_release(struct sock *sk)
 	kfree(ctx);
 }
 
+static void subflow_ulp_clone(const struct request_sock *req,
+			      struct sock *newsk,
+			      const gfp_t priority)
+{
+	struct mptcp_subflow_request_sock *subflow_req = mptcp_subflow_rsk(req);
+	struct mptcp_subflow_context *new_ctx;
+
+	/* newsk->sk_socket is NULL at this point */
+	new_ctx = subflow_create_ctx(newsk, NULL, priority);
+	if (!new_ctx)
+		return;
+
+	new_ctx->conn = NULL;
+	new_ctx->conn_finished = 1;
+
+	if (subflow_req->mp_capable) {
+		new_ctx->mp_capable = 1;
+		new_ctx->fourth_ack = 1;
+		new_ctx->remote_key = subflow_req->remote_key;
+		new_ctx->local_key = subflow_req->local_key;
+	}
+}
+
 static struct tcp_ulp_ops subflow_ulp_ops __read_mostly = {
 	.name		= "mptcp",
 	.owner		= THIS_MODULE,
 	.init		= subflow_ulp_init,
 	.release	= subflow_ulp_release,
+	.clone		= subflow_ulp_clone,
 };
 
+static int subflow_ops_init(struct request_sock_ops *subflow_ops)
+{
+	subflow_ops->obj_size = sizeof(struct mptcp_subflow_request_sock);
+	subflow_ops->slab_name = "request_sock_subflow";
+
+	subflow_ops->slab = kmem_cache_create(subflow_ops->slab_name,
+					      subflow_ops->obj_size, 0,
+					      SLAB_ACCOUNT |
+					      SLAB_TYPESAFE_BY_RCU,
+					      NULL);
+	if (!subflow_ops->slab)
+		return -ENOMEM;
+
+	return 0;
+}
+
 void mptcp_subflow_init(void)
 {
+	subflow_request_sock_ops = tcp_request_sock_ops;
+	if (subflow_ops_init(&subflow_request_sock_ops) != 0)
+		panic("MPTCP: failed to init subflow request sock ops\n");
+
+	subflow_request_sock_ipv4_ops = tcp_request_sock_ipv4_ops;
+	subflow_request_sock_ipv4_ops.init_req = subflow_v4_init_req;
+
 	subflow_specific = ipv4_specific;
+	subflow_specific.conn_request = subflow_conn_request;
+	subflow_specific.syn_recv_sock = subflow_syn_recv_sock;
 	subflow_specific.sk_rx_dst_set = subflow_finish_connect;
 
 	if (tcp_register_ulp(&subflow_ulp_ops) != 0)
-- 
2.23.0


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [RFC PATCH v2 13/45] mptcp: Add key generation and token tree
  2019-10-02 23:36 [RFC PATCH v2 00/45] Multipath TCP Mat Martineau
                   ` (11 preceding siblings ...)
  2019-10-02 23:36 ` [RFC PATCH v2 12/45] mptcp: Create SUBFLOW socket for incoming connections Mat Martineau
@ 2019-10-02 23:36 ` Mat Martineau
  2019-10-02 23:36 ` [RFC PATCH v2 14/45] mptcp: Add shutdown() socket operation Mat Martineau
                   ` (33 subsequent siblings)
  46 siblings, 0 replies; 49+ messages in thread
From: Mat Martineau @ 2019-10-02 23:36 UTC (permalink / raw)
  To: netdev, edumazet
  Cc: Peter Krystad, cpaasch, fw, pabeni, dcaratti, matthieu.baerts

From: Peter Krystad <peter.krystad@linux.intel.com>

Generate the local keys, IDSN, and token when creating a new socket.
Introduce the token tree to track all tokens in use using a radix tree
with the MPTCP token itself as the index.

Will be used to obtain the MPTCP parent socket to handle incoming joins.

Signed-off-by: Peter Krystad <peter.krystad@linux.intel.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 net/mptcp/Makefile   |   2 +-
 net/mptcp/crypto.c   | 128 ++++++++++++++++++++++++++++
 net/mptcp/protocol.c |  11 +++
 net/mptcp/protocol.h |  32 +++++++
 net/mptcp/subflow.c  |  65 +++++++++++++--
 net/mptcp/token.c    | 195 +++++++++++++++++++++++++++++++++++++++++++
 6 files changed, 424 insertions(+), 9 deletions(-)
 create mode 100644 net/mptcp/crypto.c
 create mode 100644 net/mptcp/token.c

diff --git a/net/mptcp/Makefile b/net/mptcp/Makefile
index e1ee5aade8b0..178ae81d8b66 100644
--- a/net/mptcp/Makefile
+++ b/net/mptcp/Makefile
@@ -1,4 +1,4 @@
 # SPDX-License-Identifier: GPL-2.0
 obj-$(CONFIG_MPTCP) += mptcp.o
 
-mptcp-y := protocol.o subflow.o options.o
+mptcp-y := protocol.o subflow.o options.o token.o crypto.o
diff --git a/net/mptcp/crypto.c b/net/mptcp/crypto.c
new file mode 100644
index 000000000000..0d7b10939ba6
--- /dev/null
+++ b/net/mptcp/crypto.c
@@ -0,0 +1,128 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Multipath TCP cryptographic functions
+ * Copyright (c) 2017 - 2019, Intel Corporation.
+ *
+ * Note: This code is based on mptcp_ctrl.c, mptcp_ipv4.c, and
+ *       mptcp_ipv6 from multipath-tcp.org, authored by:
+ *
+ *       Sébastien Barré <sebastien.barre@uclouvain.be>
+ *       Christoph Paasch <christoph.paasch@uclouvain.be>
+ *       Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
+ *       Gregory Detal <gregory.detal@uclouvain.be>
+ *       Fabien Duchêne <fabien.duchene@uclouvain.be>
+ *       Andreas Seelinger <Andreas.Seelinger@rwth-aachen.de>
+ *       Lavkesh Lahngir <lavkesh51@gmail.com>
+ *       Andreas Ripke <ripke@neclab.eu>
+ *       Vlad Dogaru <vlad.dogaru@intel.com>
+ *       Octavian Purdila <octavian.purdila@intel.com>
+ *       John Ronan <jronan@tssg.org>
+ *       Catalin Nicutar <catalin.nicutar@gmail.com>
+ *       Brandon Heller <brandonh@stanford.edu>
+ */
+
+#include <linux/kernel.h>
+#include <linux/cryptohash.h>
+#include <linux/random.h>
+#include <linux/siphash.h>
+#include <asm/unaligned.h>
+
+#include "protocol.h"
+
+void mptcp_crypto_key_sha1(u64 key, u32 *token, u64 *idsn)
+{
+	u32 workspace[SHA_WORKSPACE_WORDS];
+	u32 mptcp_hashed_key[SHA_DIGEST_WORDS];
+	u8 input[64];
+
+	memset(workspace, 0, sizeof(workspace));
+
+	/* Initialize input with appropriate padding */
+	memset(&input[9], 0, sizeof(input) - 10); /* -10, because the last byte
+						   * is explicitly set too
+						   */
+	put_unaligned_be64(key, input);
+	input[8] = 0x80; /* Padding: First bit after message = 1 */
+	input[63] = 0x40; /* Padding: Length of the message = 64 bits */
+
+	sha_init(mptcp_hashed_key);
+	sha_transform(mptcp_hashed_key, input, workspace);
+
+	if (token)
+		*token = mptcp_hashed_key[0];
+	if (idsn)
+		*idsn = ((u64)mptcp_hashed_key[3] << 32) + mptcp_hashed_key[4];
+}
+
+void mptcp_crypto_hmac_sha1(u64 key1, u64 key2, u32 nonce1, u32 nonce2,
+			    u32 *hash_out)
+{
+	u32 workspace[SHA_WORKSPACE_WORDS];
+	u8 input[128]; /* 2 512-bit blocks */
+	int i;
+	int index;
+	u8 key_1[8];
+	u8 key_2[8];
+	u8 nonce_1[4];
+	u8 nonce_2[4];
+
+	memset(workspace, 0, sizeof(workspace));
+
+	put_unaligned_be64(key1, key_1);
+	put_unaligned_be64(key2, key_2);
+	put_unaligned_be32(nonce1, nonce_1);
+	put_unaligned_be32(nonce2, nonce_2);
+
+	/* Generate key xored with ipad */
+	memset(input, 0x36, 64);
+	for (i = 0; i < 8; i++)
+		input[i] ^= key_1[i];
+	for (i = 0; i < 8; i++)
+		input[i + 8] ^= key_2[i];
+
+	index = 64;
+	memcpy(&input[index], nonce_1, 4);
+	index = 68;
+	memcpy(&input[index], nonce_2, 4);
+	index = 72;
+
+	input[index] = 0x80; /* Padding: First bit after message = 1 */
+	memset(&input[index + 1], 0, (126 - index));
+
+	/* Padding: Length of the message = 512 + message length (bits) */
+	input[126] = 0x02;
+	input[127] = ((index - 64) * 8); /* Message length (bits) */
+
+	sha_init(hash_out);
+	sha_transform(hash_out, input, workspace);
+	memset(workspace, 0, sizeof(workspace));
+
+	sha_transform(hash_out, &input[64], workspace);
+	memset(workspace, 0, sizeof(workspace));
+
+	for (i = 0; i < 5; i++)
+		hash_out[i] = (__force u32)cpu_to_be32(hash_out[i]);
+
+	/* Prepare second part of hmac */
+	memset(input, 0x5C, 64);
+	for (i = 0; i < 8; i++)
+		input[i] ^= key_1[i];
+	for (i = 0; i < 8; i++)
+		input[i + 8] ^= key_2[i];
+
+	memcpy(&input[64], hash_out, 20);
+	input[84] = 0x80;
+	memset(&input[85], 0, 41);
+
+	/* Padding: Length of the message = 512 + 160 bits */
+	input[126] = 0x02;
+	input[127] = 0xA0;
+
+	sha_init(hash_out);
+	sha_transform(hash_out, input, workspace);
+	memset(workspace, 0, sizeof(workspace));
+
+	sha_transform(hash_out, &input[64], workspace);
+
+	for (i = 0; i < 5; i++)
+		hash_out[i] = (__force u32)cpu_to_be32(hash_out[i]);
+}
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 5605391fc32a..0a9c447db159 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -133,6 +133,7 @@ static void mptcp_close(struct sock *sk, long timeout)
 	struct mptcp_sock *msk = mptcp_sk(sk);
 	struct socket *ssk = NULL;
 
+	mptcp_token_destroy(msk->token);
 	inet_sk_state_store(sk, TCP_CLOSE);
 
 	lock_sock(sk);
@@ -198,6 +199,10 @@ static struct sock *mptcp_accept(struct sock *sk, int flags, int *err,
 		msk->local_key = subflow->local_key;
 		msk->subflow = NULL;
 
+		msk->token = subflow->token;
+
+		mptcp_token_update_accept(new_sock->sk, new_mptcp_sock);
+		msk->subflow = NULL;
 		newsk = new_mptcp_sock;
 		subflow->conn = new_mptcp_sock;
 		list_add(&subflow->node, &msk->conn_list);
@@ -215,6 +220,10 @@ static struct sock *mptcp_accept(struct sock *sk, int flags, int *err,
 	return newsk;
 }
 
+static void mptcp_destroy(struct sock *sk)
+{
+}
+
 static int mptcp_get_port(struct sock *sk, unsigned short snum)
 {
 	struct mptcp_sock *msk = mptcp_sk(sk);
@@ -258,6 +267,7 @@ void mptcp_finish_connect(struct sock *sk, int mp_capable)
 
 		msk->remote_key = subflow->remote_key;
 		msk->local_key = subflow->local_key;
+		msk->token = subflow->token;
 		list_add(&subflow->node, &msk->conn_list);
 		msk->subflow = NULL;
 		bh_unlock_sock(sk);
@@ -273,6 +283,7 @@ static struct proto mptcp_prot = {
 	.close		= mptcp_close,
 	.accept		= mptcp_accept,
 	.shutdown	= tcp_shutdown,
+	.destroy	= mptcp_destroy,
 	.sendmsg	= mptcp_sendmsg,
 	.recvmsg	= mptcp_recvmsg,
 	.hash		= inet_hash,
diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
index d228d1d3c8c3..dea6d4f32f38 100644
--- a/net/mptcp/protocol.h
+++ b/net/mptcp/protocol.h
@@ -7,6 +7,10 @@
 #ifndef __MPTCP_PROTOCOL_H
 #define __MPTCP_PROTOCOL_H
 
+#include <linux/random.h>
+#include <linux/tcp.h>
+#include <net/inet_connection_sock.h>
+
 /* MPTCP option bits */
 #define OPTION_MPTCP_MPC_SYN	BIT(0)
 #define OPTION_MPTCP_MPC_SYNACK	BIT(1)
@@ -40,6 +44,7 @@ struct mptcp_sock {
 	struct inet_connection_sock sk;
 	u64		local_key;
 	u64		remote_key;
+	u32		token;
 	struct list_head conn_list;
 	struct socket	*subflow; /* outgoing connect/listener/!mp_capable */
 };
@@ -61,6 +66,8 @@ struct mptcp_subflow_request_sock {
 		version : 4;
 	u64	local_key;
 	u64	remote_key;
+	u64	idsn;
+	u32	token;
 };
 
 static inline struct mptcp_subflow_request_sock *
@@ -74,6 +81,8 @@ struct mptcp_subflow_context {
 	struct	list_head node;/* conn_list of subflows */
 	u64	local_key;
 	u64	remote_key;
+	u64	idsn;
+	u32	token;
 	u32	request_mptcp : 1,  /* send MP_CAPABLE */
 		request_cksum : 1,
 		request_version : 4,
@@ -108,4 +117,27 @@ void mptcp_get_options(const struct sk_buff *skb,
 
 void mptcp_finish_connect(struct sock *sk, int mp_capable);
 
+int mptcp_token_new_request(struct request_sock *req);
+void mptcp_token_destroy_request(u32 token);
+int mptcp_token_new_connect(struct sock *sk);
+int mptcp_token_new_accept(u32 token);
+void mptcp_token_update_accept(struct sock *sk, struct sock *conn);
+void mptcp_token_destroy(u32 token);
+
+void mptcp_crypto_key_sha1(u64 key, u32 *token, u64 *idsn);
+static inline void mptcp_crypto_key_gen_sha1(u64 *key, u32 *token, u64 *idsn)
+{
+	/* we might consider a faster version that computes the key as a
+	 * hash of some information available in the MPTCP socket. Use
+	 * random data at the moment, as it's probably the safest option
+	 * in case multiple sockets are opened in different namespaces at
+	 * the same time.
+	 */
+	get_random_bytes(key, sizeof(u64));
+	mptcp_crypto_key_sha1(*key, token, idsn);
+}
+
+void mptcp_crypto_hmac_sha1(u64 key1, u64 key2, u32 nonce1, u32 nonce2,
+			    u32 *hash_out);
+
 #endif /* __MPTCP_PROTOCOL_H */
diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c
index b4e86456d4d6..43fb1ae51b03 100644
--- a/net/mptcp/subflow.c
+++ b/net/mptcp/subflow.c
@@ -4,6 +4,8 @@
  * Copyright (c) 2017 - 2019, Intel Corporation.
  */
 
+#define pr_fmt(fmt) "MPTCP: " fmt
+
 #include <linux/kernel.h>
 #include <linux/module.h>
 #include <linux/netdevice.h>
@@ -15,6 +17,33 @@
 #include <net/mptcp.h>
 #include "protocol.h"
 
+static int subflow_rebuild_header(struct sock *sk)
+{
+	struct mptcp_subflow_context *subflow = mptcp_subflow_ctx(sk);
+	int err = 0;
+
+	if (subflow->request_mptcp && !subflow->token) {
+		pr_debug("subflow=%p", sk);
+		err = mptcp_token_new_connect(sk);
+	}
+
+	if (err)
+		return err;
+
+	return inet_sk_rebuild_header(sk);
+}
+
+static void subflow_req_destructor(struct request_sock *req)
+{
+	struct mptcp_subflow_request_sock *subflow_req = mptcp_subflow_rsk(req);
+
+	pr_debug("subflow_req=%p", subflow_req);
+
+	if (subflow_req->mp_capable)
+		mptcp_token_destroy_request(subflow_req->token);
+	tcp_request_sock_ops.destructor(req);
+}
+
 static void subflow_v4_init_req(struct request_sock *req,
 				const struct sock *sk_listener,
 				struct sk_buff *skb)
@@ -32,7 +61,12 @@ static void subflow_v4_init_req(struct request_sock *req,
 	mptcp_get_options(skb, &rx_opt);
 
 	if (rx_opt.mptcp.mp_capable && listener->request_mptcp) {
-		subflow_req->mp_capable = 1;
+		int err;
+
+		err = mptcp_token_new_request(req);
+		if (err == 0)
+			subflow_req->mp_capable = 1;
+
 		if (rx_opt.mptcp.version >= listener->request_version)
 			subflow_req->version = listener->request_version;
 		else
@@ -98,16 +132,25 @@ static struct sock *subflow_syn_recv_sock(const struct sock *sk,
 	child = tcp_v4_syn_recv_sock(sk, skb, req, dst, req_unhash, own_req);
 
 	if (child && *own_req) {
-		if (!mptcp_subflow_ctx(child)) {
-			pr_debug("Closing child socket");
-			inet_sk_set_state(child, TCP_CLOSE);
-			sock_set_flag(child, SOCK_DEAD);
-			inet_csk_destroy_sock(child);
-			child = NULL;
+		struct mptcp_subflow_context *ctx = mptcp_subflow_ctx(child);
+
+		if (!ctx)
+			goto close_child;
+
+		if (ctx->mp_capable) {
+			if (mptcp_token_new_accept(ctx->token))
+				goto close_child;
 		}
 	}
 
 	return child;
+
+close_child:
+	pr_debug("closing child socket");
+	inet_sk_set_state(child, TCP_CLOSE);
+	sock_set_flag(child, SOCK_DEAD);
+	inet_csk_destroy_sock(child);
+	return NULL;
 }
 
 static struct inet_connection_sock_af_ops subflow_specific;
@@ -134,6 +177,7 @@ int mptcp_subflow_create_socket(struct sock *sk, struct socket **new_sock)
 	pr_debug("subflow=%p", subflow);
 
 	*new_sock = sf;
+	sock_hold(sk);
 	subflow->conn = sk;
 	subflow->request_mptcp = 1; // @@ if MPTCP enabled
 	subflow->request_cksum = 1; // @@ if checksum enabled
@@ -188,7 +232,8 @@ static void subflow_ulp_release(struct sock *sk)
 {
 	struct mptcp_subflow_context *ctx = mptcp_subflow_ctx(sk);
 
-	pr_debug("subflow=%p", ctx);
+	if (ctx->conn)
+		sock_put(ctx->conn);
 
 	kfree(ctx);
 }
@@ -213,6 +258,7 @@ static void subflow_ulp_clone(const struct request_sock *req,
 		new_ctx->fourth_ack = 1;
 		new_ctx->remote_key = subflow_req->remote_key;
 		new_ctx->local_key = subflow_req->local_key;
+		new_ctx->token = subflow_req->token;
 	}
 }
 
@@ -237,6 +283,8 @@ static int subflow_ops_init(struct request_sock_ops *subflow_ops)
 	if (!subflow_ops->slab)
 		return -ENOMEM;
 
+	subflow_ops->destructor = subflow_req_destructor;
+
 	return 0;
 }
 
@@ -253,6 +301,7 @@ void mptcp_subflow_init(void)
 	subflow_specific.conn_request = subflow_conn_request;
 	subflow_specific.syn_recv_sock = subflow_syn_recv_sock;
 	subflow_specific.sk_rx_dst_set = subflow_finish_connect;
+	subflow_specific.rebuild_header = subflow_rebuild_header;
 
 	if (tcp_register_ulp(&subflow_ulp_ops) != 0)
 		panic("MPTCP: failed to register subflows to ULP\n");
diff --git a/net/mptcp/token.c b/net/mptcp/token.c
new file mode 100644
index 000000000000..7e579deef6a2
--- /dev/null
+++ b/net/mptcp/token.c
@@ -0,0 +1,195 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Multipath TCP token management
+ * Copyright (c) 2017 - 2019, Intel Corporation.
+ *
+ * Note: This code is based on mptcp_ctrl.c from multipath-tcp.org,
+ *       authored by:
+ *
+ *       Sébastien Barré <sebastien.barre@uclouvain.be>
+ *       Christoph Paasch <christoph.paasch@uclouvain.be>
+ *       Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
+ *       Gregory Detal <gregory.detal@uclouvain.be>
+ *       Fabien Duchêne <fabien.duchene@uclouvain.be>
+ *       Andreas Seelinger <Andreas.Seelinger@rwth-aachen.de>
+ *       Lavkesh Lahngir <lavkesh51@gmail.com>
+ *       Andreas Ripke <ripke@neclab.eu>
+ *       Vlad Dogaru <vlad.dogaru@intel.com>
+ *       Octavian Purdila <octavian.purdila@intel.com>
+ *       John Ronan <jronan@tssg.org>
+ *       Catalin Nicutar <catalin.nicutar@gmail.com>
+ *       Brandon Heller <brandonh@stanford.edu>
+ */
+
+#define pr_fmt(fmt) "MPTCP: " fmt
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/radix-tree.h>
+#include <linux/ip.h>
+#include <linux/tcp.h>
+#include <net/sock.h>
+#include <net/inet_common.h>
+#include <net/protocol.h>
+#include <net/mptcp.h>
+#include "protocol.h"
+
+static RADIX_TREE(token_tree, GFP_ATOMIC);
+static RADIX_TREE(token_req_tree, GFP_ATOMIC);
+static DEFINE_SPINLOCK(token_tree_lock);
+static int token_used __read_mostly;
+
+/**
+ * mptcp_token_new_request - create new key/idsn/token for subflow_request
+ * @req - the request socket
+ *
+ * This function is called when a new mptcp connection is coming in.
+ *
+ * It creates a unique token to identify the new mptcp connection,
+ * a secret local key and the initial data sequence number (idsn).
+ *
+ * Returns 0 on success.
+ */
+int mptcp_token_new_request(struct request_sock *req)
+{
+	struct mptcp_subflow_request_sock *subflow_req = mptcp_subflow_rsk(req);
+	int err;
+
+	while (1) {
+		u32 token;
+
+		mptcp_crypto_key_gen_sha1(&subflow_req->local_key,
+					  &subflow_req->token,
+					  &subflow_req->idsn);
+		pr_debug("req=%p local_key=%llu, token=%u, idsn=%llu\n",
+			 req, subflow_req->local_key, subflow_req->token,
+			 subflow_req->idsn);
+
+		token = subflow_req->token;
+		spin_lock_bh(&token_tree_lock);
+		if (!radix_tree_lookup(&token_req_tree, token) &&
+		    !radix_tree_lookup(&token_tree, token))
+			break;
+		spin_unlock_bh(&token_tree_lock);
+	}
+
+	err = radix_tree_insert(&token_req_tree,
+				subflow_req->token, &token_used);
+	spin_unlock_bh(&token_tree_lock);
+	return err;
+}
+
+/**
+ * mptcp_token_new_connect - create new key/idsn/token for subflow
+ * @sk - the socket that will initiate a connection
+ *
+ * This function is called when a new outgoing mptcp connection is
+ * initiated.
+ *
+ * It creates a unique token to identify the new mptcp connection,
+ * a secret local key and the initial data sequence number (idsn).
+ *
+ * On success, the mptcp connection can be found again using
+ * the computed token at a later time, this is needed to process
+ * join requests.
+ *
+ * returns 0 on success.
+ */
+int mptcp_token_new_connect(struct sock *sk)
+{
+	struct mptcp_subflow_context *subflow = mptcp_subflow_ctx(sk);
+	struct sock *mptcp_sock = subflow->conn;
+	int err;
+
+	while (1) {
+		u32 token;
+
+		mptcp_crypto_key_gen_sha1(&subflow->local_key, &subflow->token,
+					  &subflow->idsn);
+
+		pr_debug("ssk=%p, local_key=%llu, token=%u, idsn=%llu\n",
+			 sk, subflow->local_key, subflow->token, subflow->idsn);
+
+		token = subflow->token;
+		spin_lock_bh(&token_tree_lock);
+		if (!radix_tree_lookup(&token_req_tree, token) &&
+		    !radix_tree_lookup(&token_tree, token))
+			break;
+		spin_unlock_bh(&token_tree_lock);
+	}
+	err = radix_tree_insert(&token_tree, subflow->token, mptcp_sock);
+	spin_unlock_bh(&token_tree_lock);
+
+	return err;
+}
+
+/**
+ * mptcp_token_new_accept - insert token for later processing
+ * @token: the token to insert to the tree
+ *
+ * Called when a SYN packet creates a new logical connection, i.e.
+ * is not a join request.
+ *
+ * We don't have an mptcp socket yet at that point.
+ * This is paired with mptcp_token_update_accept, called on accept().
+ */
+int mptcp_token_new_accept(u32 token)
+{
+	int err;
+
+	spin_lock_bh(&token_tree_lock);
+	err = radix_tree_insert(&token_tree, token, &token_used);
+	spin_unlock_bh(&token_tree_lock);
+
+	return err;
+}
+
+/**
+ * mptcp_token_update_accept - update token to map to mptcp socket
+ * @conn: the new struct mptcp_sock
+ * @sk: the initial subflow for this mptcp socket
+ *
+ * Called when the first mptcp socket is created on accept to
+ * refresh the dummy mapping (done to reserve the token) with
+ * the mptcp_socket structure that wasn't allocated before.
+ */
+void mptcp_token_update_accept(struct sock *sk, struct sock *conn)
+{
+	struct mptcp_subflow_context *subflow = mptcp_subflow_ctx(sk);
+	void __rcu **slot;
+
+	spin_lock_bh(&token_tree_lock);
+	slot = radix_tree_lookup_slot(&token_tree, subflow->token);
+	WARN_ON_ONCE(!slot);
+	if (slot) {
+		WARN_ON_ONCE(*slot != &token_used);
+		*slot = conn;
+	}
+	spin_unlock_bh(&token_tree_lock);
+}
+
+/**
+ * mptcp_token_destroy_request - remove mptcp connection/token
+ * @token - token of mptcp connection to remove
+ *
+ * Remove not-yet-fully-established incoming connection identified
+ * by @token.
+ */
+void mptcp_token_destroy_request(u32 token)
+{
+	spin_lock_bh(&token_tree_lock);
+	radix_tree_delete(&token_req_tree, token);
+	spin_unlock_bh(&token_tree_lock);
+}
+
+/**
+ * mptcp_token_destroy - remove mptcp connection/token
+ * @token - token of mptcp connection to remove
+ *
+ * Remove the connection identified by @token.
+ */
+void mptcp_token_destroy(u32 token)
+{
+	spin_lock_bh(&token_tree_lock);
+	radix_tree_delete(&token_tree, token);
+	spin_unlock_bh(&token_tree_lock);
+}
-- 
2.23.0


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [RFC PATCH v2 14/45] mptcp: Add shutdown() socket operation
  2019-10-02 23:36 [RFC PATCH v2 00/45] Multipath TCP Mat Martineau
                   ` (12 preceding siblings ...)
  2019-10-02 23:36 ` [RFC PATCH v2 13/45] mptcp: Add key generation and token tree Mat Martineau
@ 2019-10-02 23:36 ` Mat Martineau
  2019-10-02 23:36 ` [RFC PATCH v2 15/45] mptcp: Add setsockopt()/getsockopt() socket operations Mat Martineau
                   ` (32 subsequent siblings)
  46 siblings, 0 replies; 49+ messages in thread
From: Mat Martineau @ 2019-10-02 23:36 UTC (permalink / raw)
  To: netdev, edumazet
  Cc: Peter Krystad, cpaasch, fw, pabeni, dcaratti, matthieu.baerts

From: Peter Krystad <peter.krystad@linux.intel.com>

Call shutdown on all subflows in use on the given socket, or on the
fallback socket.

Signed-off-by: Peter Krystad <peter.krystad@linux.intel.com>
---
 net/mptcp/protocol.c | 32 ++++++++++++++++++++++++++++++++
 1 file changed, 32 insertions(+)

diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 0a9c447db159..dba4d7a4a61a 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -431,6 +431,37 @@ static __poll_t mptcp_poll(struct file *file, struct socket *sock,
 	return ret;
 }
 
+static int mptcp_shutdown(struct socket *sock, int how)
+{
+	struct mptcp_sock *msk = mptcp_sk(sock->sk);
+	struct mptcp_subflow_context *subflow;
+	struct socket *ssock;
+	int ret = 0;
+
+	pr_debug("sk=%p, how=%d", msk, how);
+
+	lock_sock(sock->sk);
+	ssock = __mptcp_fallback_get_ref(msk);
+	if (ssock) {
+		release_sock(sock->sk);
+		pr_debug("subflow=%p", ssock->sk);
+		ret = kernel_sock_shutdown(ssock, how);
+		sock_put(ssock->sk);
+		return ret;
+	}
+
+	mptcp_for_each_subflow(msk, subflow) {
+		struct socket *tcp_socket;
+
+		tcp_socket = mptcp_subflow_tcp_socket(subflow);
+		pr_debug("conn_list->subflow=%p", subflow);
+		ret = kernel_sock_shutdown(tcp_socket, how);
+	}
+	release_sock(sock->sk);
+
+	return ret;
+}
+
 static struct proto_ops mptcp_stream_ops;
 
 static struct inet_protosw mptcp_protosw = {
@@ -451,6 +482,7 @@ void __init mptcp_init(void)
 	mptcp_stream_ops.accept = mptcp_stream_accept;
 	mptcp_stream_ops.getname = mptcp_getname;
 	mptcp_stream_ops.listen = mptcp_listen;
+	mptcp_stream_ops.shutdown = mptcp_shutdown;
 
 	mptcp_subflow_init();
 
-- 
2.23.0


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [RFC PATCH v2 15/45] mptcp: Add setsockopt()/getsockopt() socket operations
  2019-10-02 23:36 [RFC PATCH v2 00/45] Multipath TCP Mat Martineau
                   ` (13 preceding siblings ...)
  2019-10-02 23:36 ` [RFC PATCH v2 14/45] mptcp: Add shutdown() socket operation Mat Martineau
@ 2019-10-02 23:36 ` Mat Martineau
  2019-10-02 23:36 ` [RFC PATCH v2 16/45] tcp: clean ext on tx recycle Mat Martineau
                   ` (31 subsequent siblings)
  46 siblings, 0 replies; 49+ messages in thread
From: Mat Martineau @ 2019-10-02 23:36 UTC (permalink / raw)
  To: netdev, edumazet
  Cc: Peter Krystad, cpaasch, fw, pabeni, dcaratti, matthieu.baerts

From: Peter Krystad <peter.krystad@linux.intel.com>

set/getsockopt behaviour with multiple subflows is undefined.
Therefore, for now, we return -EOPNOTSUPP unless we're in fallback mode.

Signed-off-by: Peter Krystad <peter.krystad@linux.intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 net/mptcp/protocol.c | 67 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 67 insertions(+)

diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index dba4d7a4a61a..41588758f74b 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -28,6 +28,17 @@ static struct socket *__mptcp_fallback_get_ref(const struct mptcp_sock *msk)
 	return msk->subflow;
 }
 
+static struct socket *mptcp_fallback_get_ref(const struct mptcp_sock *msk)
+{
+	struct socket *ssock;
+
+	lock_sock((struct sock *)msk);
+	ssock = __mptcp_fallback_get_ref(msk);
+	release_sock((struct sock *)msk);
+
+	return ssock;
+}
+
 static struct sock *mptcp_subflow_get_ref(const struct mptcp_sock *msk)
 {
 	struct mptcp_subflow_context *subflow;
@@ -224,6 +235,60 @@ static void mptcp_destroy(struct sock *sk)
 {
 }
 
+static int mptcp_setsockopt(struct sock *sk, int level, int optname,
+			    char __user *uoptval, unsigned int optlen)
+{
+	struct mptcp_sock *msk = mptcp_sk(sk);
+	char __kernel *optval;
+	struct socket *ssock;
+	int ret;
+
+	/* will be treated as __user in tcp_setsockopt */
+	optval = (char __kernel __force *)uoptval;
+
+	pr_debug("msk=%p", msk);
+	ssock = mptcp_fallback_get_ref(msk);
+	if (ssock) {
+		pr_debug("subflow=%p", ssock->sk);
+		ret = kernel_setsockopt(ssock, level, optname, optval, optlen);
+		sock_put(ssock->sk);
+		return ret;
+	}
+
+	/* @@ the meaning of setsockopt() when the socket is connected and
+	 * there are multiple subflows is not defined.
+	 */
+	return -EOPNOTSUPP;
+}
+
+static int mptcp_getsockopt(struct sock *sk, int level, int optname,
+			    char __user *uoptval, int __user *uoption)
+{
+	struct mptcp_sock *msk = mptcp_sk(sk);
+	char __kernel *optval;
+	int __kernel *option;
+	struct socket *ssock;
+	int ret;
+
+	/* will be treated as __user in tcp_getsockopt */
+	optval = (char __kernel __force *)uoptval;
+	option = (int __kernel __force *)uoption;
+
+	pr_debug("msk=%p", msk);
+	ssock = mptcp_fallback_get_ref(msk);
+	if (ssock) {
+		pr_debug("subflow=%p", ssock->sk);
+		ret = kernel_getsockopt(ssock, level, optname, optval, option);
+		sock_put(ssock->sk);
+		return ret;
+	}
+
+	/* @@ the meaning of getsockopt() when the socket is connected and
+	 * there are multiple subflows is not defined.
+	 */
+	return -EOPNOTSUPP;
+}
+
 static int mptcp_get_port(struct sock *sk, unsigned short snum)
 {
 	struct mptcp_sock *msk = mptcp_sk(sk);
@@ -282,6 +347,8 @@ static struct proto mptcp_prot = {
 	.init		= mptcp_init_sock,
 	.close		= mptcp_close,
 	.accept		= mptcp_accept,
+	.setsockopt	= mptcp_setsockopt,
+	.getsockopt	= mptcp_getsockopt,
 	.shutdown	= tcp_shutdown,
 	.destroy	= mptcp_destroy,
 	.sendmsg	= mptcp_sendmsg,
-- 
2.23.0


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [RFC PATCH v2 16/45] tcp: clean ext on tx recycle
  2019-10-02 23:36 [RFC PATCH v2 00/45] Multipath TCP Mat Martineau
                   ` (14 preceding siblings ...)
  2019-10-02 23:36 ` [RFC PATCH v2 15/45] mptcp: Add setsockopt()/getsockopt() socket operations Mat Martineau
@ 2019-10-02 23:36 ` Mat Martineau
  2019-10-02 23:36 ` [RFC PATCH v2 17/45] mptcp: Add MPTCP to skb extensions Mat Martineau
                   ` (30 subsequent siblings)
  46 siblings, 0 replies; 49+ messages in thread
From: Mat Martineau @ 2019-10-02 23:36 UTC (permalink / raw)
  To: netdev, edumazet
  Cc: Paolo Abeni, cpaasch, fw, peter.krystad, dcaratti, matthieu.baerts

From: Paolo Abeni <pabeni@redhat.com>

Otherwise we will find stray/unexpected/old extensions
value on next iteration.

On tcp_write_xmit() we can end-up splitting an already queued
skb in two parts, via tso_fragment(). The newly created skb
can be allocated via the tx cache and the mptcp stack will not
be aware of it, so nobody set properly the MPTCP ext.

End result, we transmit the skb using an hold MPTCP DSS map
and that confuses the rx side/corrupt the stream. It requires
some concurrent conditions, so it's not deterministic.

Resetting the ext on recycle fixes all the current mptcp self tests
issues.

Apparently only MPTCP has issues with this kind of stray ext,
so an alternative would be add an additional mptcp hook in
tso_fragment() or in sk_stream_alloc_skb() to always init
the ext.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 include/linux/skbuff.h | 8 ++++++++
 include/net/sock.h     | 1 +
 2 files changed, 9 insertions(+)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index e7d3b1a513ef..e7a7abd62026 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -4099,6 +4099,14 @@ static inline void skb_ext_put(struct sk_buff *skb)
 		__skb_ext_put(skb->extensions);
 }
 
+static inline void skb_ext_clear(struct sk_buff *skb)
+{
+	if (skb->active_extensions) {
+		__skb_ext_put(skb->extensions);
+		skb->active_extensions = 0;
+	}
+}
+
 static inline void __skb_ext_copy(struct sk_buff *dst,
 				  const struct sk_buff *src)
 {
diff --git a/include/net/sock.h b/include/net/sock.h
index ca2071555dde..b9a085d0bb18 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1470,6 +1470,7 @@ static inline void sk_wmem_free_skb(struct sock *sk, struct sk_buff *skb)
 	sk_mem_uncharge(sk, skb->truesize);
 	if (static_branch_unlikely(&tcp_tx_skb_cache_key) &&
 	    !sk->sk_tx_skb_cache && !skb_cloned(skb)) {
+		skb_ext_clear(skb);
 		skb_zcopy_clear(skb, true);
 		sk->sk_tx_skb_cache = skb;
 		return;
-- 
2.23.0


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [RFC PATCH v2 17/45] mptcp: Add MPTCP to skb extensions
  2019-10-02 23:36 [RFC PATCH v2 00/45] Multipath TCP Mat Martineau
                   ` (15 preceding siblings ...)
  2019-10-02 23:36 ` [RFC PATCH v2 16/45] tcp: clean ext on tx recycle Mat Martineau
@ 2019-10-02 23:36 ` Mat Martineau
  2019-10-02 23:36 ` [RFC PATCH v2 18/45] tcp: Prevent coalesce/collapse when skb has MPTCP extensions Mat Martineau
                   ` (29 subsequent siblings)
  46 siblings, 0 replies; 49+ messages in thread
From: Mat Martineau @ 2019-10-02 23:36 UTC (permalink / raw)
  To: netdev, edumazet
  Cc: Mat Martineau, cpaasch, fw, pabeni, peter.krystad, dcaratti,
	matthieu.baerts

Add enum value for MPTCP and update config dependencies

Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
---
 include/linux/skbuff.h |  3 +++
 include/net/mptcp.h    | 16 ++++++++++++++++
 net/core/skbuff.c      |  7 +++++++
 net/mptcp/Kconfig      |  1 +
 4 files changed, 27 insertions(+)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index e7a7abd62026..5a9b26ff44e9 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -4068,6 +4068,9 @@ enum skb_ext_id {
 #endif
 #if IS_ENABLED(CONFIG_NET_TC_SKB_EXT)
 	TC_SKB_EXT,
+#endif
+#if IS_ENABLED(CONFIG_MPTCP)
+	SKB_EXT_MPTCP,
 #endif
 	SKB_EXT_NUM, /* must be last */
 };
diff --git a/include/net/mptcp.h b/include/net/mptcp.h
index 1a3ed7734bb4..6921b7fb6f0d 100644
--- a/include/net/mptcp.h
+++ b/include/net/mptcp.h
@@ -8,6 +8,22 @@
 #ifndef __NET_MPTCP_H
 #define __NET_MPTCP_H
 
+/* MPTCP sk_buff extension data */
+struct mptcp_ext {
+	u64		data_ack;
+	u64		data_seq;
+	u32		subflow_seq;
+	u16		data_len;
+	__sum16		checksum;
+	u8		use_map:1,
+			dsn64:1,
+			use_checksum:1,
+			data_fin:1,
+			use_ack:1,
+			ack64:1,
+			__unused:2;
+};
+
 struct mptcp_out_options {
 #if IS_ENABLED(CONFIG_MPTCP)
 	u16 suboptions;
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 01d65206f4fb..20094629e103 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -68,6 +68,7 @@
 #include <net/ip6_checksum.h>
 #include <net/xfrm.h>
 #include <net/mpls.h>
+#include <net/mptcp.h>
 
 #include <linux/uaccess.h>
 #include <trace/events/skb.h>
@@ -4109,6 +4110,9 @@ static const u8 skb_ext_type_len[] = {
 #if IS_ENABLED(CONFIG_NET_TC_SKB_EXT)
 	[TC_SKB_EXT] = SKB_EXT_CHUNKSIZEOF(struct tc_skb_ext),
 #endif
+#if IS_ENABLED(CONFIG_MPTCP)
+	[SKB_EXT_MPTCP] = SKB_EXT_CHUNKSIZEOF(struct mptcp_ext),
+#endif
 };
 
 static __always_inline unsigned int skb_ext_total_length(void)
@@ -4122,6 +4126,9 @@ static __always_inline unsigned int skb_ext_total_length(void)
 #endif
 #if IS_ENABLED(CONFIG_NET_TC_SKB_EXT)
 		skb_ext_type_len[TC_SKB_EXT] +
+#endif
+#if IS_ENABLED(CONFIG_MPTCP)
+		skb_ext_type_len[SKB_EXT_MPTCP] +
 #endif
 		0;
 }
diff --git a/net/mptcp/Kconfig b/net/mptcp/Kconfig
index d87dfdc210cc..f21190a4f7e9 100644
--- a/net/mptcp/Kconfig
+++ b/net/mptcp/Kconfig
@@ -2,6 +2,7 @@
 config MPTCP
 	bool "Multipath TCP"
 	depends on INET
+	select SKB_EXTENSIONS
 	help
 	  Multipath TCP (MPTCP) connections send and receive data over multiple
 	  subflows in order to utilize multiple network paths. Each subflow
-- 
2.23.0


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [RFC PATCH v2 18/45] tcp: Prevent coalesce/collapse when skb has MPTCP extensions
  2019-10-02 23:36 [RFC PATCH v2 00/45] Multipath TCP Mat Martineau
                   ` (16 preceding siblings ...)
  2019-10-02 23:36 ` [RFC PATCH v2 17/45] mptcp: Add MPTCP to skb extensions Mat Martineau
@ 2019-10-02 23:36 ` Mat Martineau
  2019-10-02 23:36 ` [RFC PATCH v2 19/45] tcp: Export low-level TCP functions Mat Martineau
                   ` (28 subsequent siblings)
  46 siblings, 0 replies; 49+ messages in thread
From: Mat Martineau @ 2019-10-02 23:36 UTC (permalink / raw)
  To: netdev, edumazet
  Cc: Mat Martineau, cpaasch, fw, pabeni, peter.krystad, dcaratti,
	matthieu.baerts

The MPTCP extension data needs to be preserved as it passes through the
TCP stack. Make sure that these skbs are not appended to others during
coalesce or collapse, so the data remains associated with the payload of
the given skb.

Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
---
 include/net/mptcp.h   | 10 ++++++++++
 include/net/tcp.h     |  8 ++++++++
 net/ipv4/tcp_input.c  | 10 ++++++++--
 net/ipv4/tcp_output.c |  2 +-
 4 files changed, 27 insertions(+), 3 deletions(-)

diff --git a/include/net/mptcp.h b/include/net/mptcp.h
index 6921b7fb6f0d..5ec275708036 100644
--- a/include/net/mptcp.h
+++ b/include/net/mptcp.h
@@ -56,6 +56,11 @@ bool mptcp_synack_options(const struct request_sock *req, unsigned int *size,
 bool mptcp_established_options(struct sock *sk, unsigned int *size,
 			       struct mptcp_out_options *opts);
 
+static inline bool mptcp_skb_ext_exist(const struct sk_buff *skb)
+{
+	return skb_ext_exist(skb, SKB_EXT_MPTCP);
+}
+
 void mptcp_write_options(__be32 *ptr, struct mptcp_out_options *opts);
 
 #else
@@ -103,5 +108,10 @@ static inline bool mptcp_established_options(struct sock *sk,
 	return false;
 }
 
+static inline bool mptcp_skb_ext_exist(const struct sk_buff *skb)
+{
+	return false;
+}
+
 #endif /* CONFIG_MPTCP */
 #endif /* __NET_MPTCP_H */
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 907fd214ff1e..ddc098b33999 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -39,6 +39,7 @@
 #include <net/tcp_states.h>
 #include <net/inet_ecn.h>
 #include <net/dst.h>
+#include <net/mptcp.h>
 
 #include <linux/seq_file.h>
 #include <linux/memcontrol.h>
@@ -962,6 +963,13 @@ static inline bool tcp_skb_can_collapse_to(const struct sk_buff *skb)
 	return likely(!TCP_SKB_CB(skb)->eor);
 }
 
+static inline bool tcp_skb_can_collapse(const struct sk_buff *to,
+					const struct sk_buff *from)
+{
+	return likely(tcp_skb_can_collapse_to(to) &&
+		      !mptcp_skb_ext_exist(from));
+}
+
 /* Events passed to congestion control interface */
 enum tcp_ca_event {
 	CA_EVENT_TX_START,	/* first transmit when no packets in flight */
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 218df735c822..04267684babf 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -1421,7 +1421,7 @@ static struct sk_buff *tcp_shift_skb_data(struct sock *sk, struct sk_buff *skb,
 	if ((TCP_SKB_CB(prev)->sacked & TCPCB_TAGBITS) != TCPCB_SACKED_ACKED)
 		goto fallback;
 
-	if (!tcp_skb_can_collapse_to(prev))
+	if (!tcp_skb_can_collapse(prev, skb))
 		goto fallback;
 
 	in_sack = !after(start_seq, TCP_SKB_CB(skb)->seq) &&
@@ -4423,6 +4423,9 @@ static bool tcp_try_coalesce(struct sock *sk,
 	if (TCP_SKB_CB(from)->seq != TCP_SKB_CB(to)->end_seq)
 		return false;
 
+	if (mptcp_skb_ext_exist(from))
+		return false;
+
 #ifdef CONFIG_TLS_DEVICE
 	if (from->decrypted != to->decrypted)
 		return false;
@@ -4931,10 +4934,12 @@ tcp_collapse(struct sock *sk, struct sk_buff_head *list, struct rb_root *root,
 
 		/* The first skb to collapse is:
 		 * - not SYN/FIN and
+		 * - does not include a MPTCP skb extension
 		 * - bloated or contains data before "start" or
 		 *   overlaps to the next one.
 		 */
 		if (!(TCP_SKB_CB(skb)->tcp_flags & (TCPHDR_SYN | TCPHDR_FIN)) &&
+		    !mptcp_skb_ext_exist(skb) &&
 		    (tcp_win_from_space(sk, skb->truesize) > skb->len ||
 		     before(TCP_SKB_CB(skb)->seq, start))) {
 			end_of_skbs = false;
@@ -4950,7 +4955,7 @@ tcp_collapse(struct sock *sk, struct sk_buff_head *list, struct rb_root *root,
 		/* Decided to skip this, advance start seq. */
 		start = TCP_SKB_CB(skb)->end_seq;
 	}
-	if (end_of_skbs ||
+	if (end_of_skbs || mptcp_skb_ext_exist(skb) ||
 	    (TCP_SKB_CB(skb)->tcp_flags & (TCPHDR_SYN | TCPHDR_FIN)))
 		return;
 
@@ -4993,6 +4998,7 @@ tcp_collapse(struct sock *sk, struct sk_buff_head *list, struct rb_root *root,
 				skb = tcp_collapse_one(sk, skb, list, root);
 				if (!skb ||
 				    skb == tail ||
+				    mptcp_skb_ext_exist(skb) ||
 				    (TCP_SKB_CB(skb)->tcp_flags & (TCPHDR_SYN | TCPHDR_FIN)))
 					goto end;
 #ifdef CONFIG_TLS_DEVICE
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index e21e8134559a..415b0414d738 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2911,7 +2911,7 @@ static void tcp_retrans_try_collapse(struct sock *sk, struct sk_buff *to,
 		if (!tcp_can_collapse(sk, skb))
 			break;
 
-		if (!tcp_skb_can_collapse_to(to))
+		if (!tcp_skb_can_collapse(to, skb))
 			break;
 
 		space -= skb->len;
-- 
2.23.0


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [RFC PATCH v2 19/45] tcp: Export low-level TCP functions
  2019-10-02 23:36 [RFC PATCH v2 00/45] Multipath TCP Mat Martineau
                   ` (17 preceding siblings ...)
  2019-10-02 23:36 ` [RFC PATCH v2 18/45] tcp: Prevent coalesce/collapse when skb has MPTCP extensions Mat Martineau
@ 2019-10-02 23:36 ` Mat Martineau
  2019-10-02 23:36 ` [RFC PATCH v2 20/45] mptcp: Write MPTCP DSS headers to outgoing data packets Mat Martineau
                   ` (27 subsequent siblings)
  46 siblings, 0 replies; 49+ messages in thread
From: Mat Martineau @ 2019-10-02 23:36 UTC (permalink / raw)
  To: netdev, edumazet
  Cc: Mat Martineau, cpaasch, fw, pabeni, peter.krystad, dcaratti,
	matthieu.baerts

MPTCP will make use of tcp_send_mss() and tcp_push() when sending
data to specific TCP subflows.

Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
---
 include/net/tcp.h | 3 +++
 net/ipv4/tcp.c    | 6 +++---
 2 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index ddc098b33999..3994e6b0db27 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -330,6 +330,9 @@ int tcp_sendpage_locked(struct sock *sk, struct page *page, int offset,
 			size_t size, int flags);
 ssize_t do_tcp_sendpages(struct sock *sk, struct page *page, int offset,
 		 size_t size, int flags);
+int tcp_send_mss(struct sock *sk, int *size_goal, int flags);
+void tcp_push(struct sock *sk, int flags, int mss_now, int nonagle,
+	      int size_goal);
 void tcp_release_cb(struct sock *sk);
 void tcp_wfree(struct sk_buff *skb);
 void tcp_write_timer_handler(struct sock *sk);
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 5893d08cb204..5450e91dd82c 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -691,8 +691,8 @@ static bool tcp_should_autocork(struct sock *sk, struct sk_buff *skb,
 	       refcount_read(&sk->sk_wmem_alloc) > skb->truesize;
 }
 
-static void tcp_push(struct sock *sk, int flags, int mss_now,
-		     int nonagle, int size_goal)
+void tcp_push(struct sock *sk, int flags, int mss_now,
+	      int nonagle, int size_goal)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 	struct sk_buff *skb;
@@ -926,7 +926,7 @@ static unsigned int tcp_xmit_size_goal(struct sock *sk, u32 mss_now,
 	return max(size_goal, mss_now);
 }
 
-static int tcp_send_mss(struct sock *sk, int *size_goal, int flags)
+int tcp_send_mss(struct sock *sk, int *size_goal, int flags)
 {
 	int mss_now;
 
-- 
2.23.0


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [RFC PATCH v2 20/45] mptcp: Write MPTCP DSS headers to outgoing data packets
  2019-10-02 23:36 [RFC PATCH v2 00/45] Multipath TCP Mat Martineau
                   ` (18 preceding siblings ...)
  2019-10-02 23:36 ` [RFC PATCH v2 19/45] tcp: Export low-level TCP functions Mat Martineau
@ 2019-10-02 23:36 ` Mat Martineau
  2019-10-02 23:36 ` [RFC PATCH v2 21/45] mptcp: Implement MPTCP receive path Mat Martineau
                   ` (26 subsequent siblings)
  46 siblings, 0 replies; 49+ messages in thread
From: Mat Martineau @ 2019-10-02 23:36 UTC (permalink / raw)
  To: netdev, edumazet
  Cc: Mat Martineau, cpaasch, fw, pabeni, peter.krystad, dcaratti,
	matthieu.baerts

Per-packet metadata required to write the MPTCP DSS option is written to
the skb_ext area. One write to the socket may contain more than one
packet of data, in which case the DSS option in the first packet will
have a mapping covering all of the data in that write. Packets after the
first do not have a DSS option. This is complicated to handle under
memory pressure, since the first packet (with the DSS mapping) is pushed
to the TCP core before the remaining skbs are allocated.

The current implementation is limited. It will only send up to one page
of data. The number of bytes sent is returned so the caller knows which
bytes were sent and which were not. More work is required to ensure that
it works correctly with full buffers or under memory pressure.

The MPTCP DSS checksum is not yet implemented.

Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Peter Krystad <peter.krystad@linux.intel.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
---
 include/net/mptcp.h   |  14 +++-
 net/ipv4/tcp_output.c |  11 ++--
 net/mptcp/options.c   | 146 +++++++++++++++++++++++++++++++++++++++++-
 net/mptcp/protocol.c  | 107 ++++++++++++++++++++++++++++++-
 net/mptcp/protocol.h  |  16 ++++-
 net/mptcp/subflow.c   |   2 +
 6 files changed, 283 insertions(+), 13 deletions(-)

diff --git a/include/net/mptcp.h b/include/net/mptcp.h
index 5ec275708036..43fcd4989f37 100644
--- a/include/net/mptcp.h
+++ b/include/net/mptcp.h
@@ -8,6 +8,14 @@
 #ifndef __NET_MPTCP_H
 #define __NET_MPTCP_H
 
+/* MPTCP DSS flags */
+
+#define MPTCP_DSS_DATA_FIN	BIT(4)
+#define MPTCP_DSS_DSN64		BIT(3)
+#define MPTCP_DSS_HAS_MAP	BIT(2)
+#define MPTCP_DSS_ACK64		BIT(1)
+#define MPTCP_DSS_HAS_ACK	BIT(0)
+
 /* MPTCP sk_buff extension data */
 struct mptcp_ext {
 	u64		data_ack;
@@ -29,6 +37,7 @@ struct mptcp_out_options {
 	u16 suboptions;
 	u64 sndr_key;
 	u64 rcvr_key;
+	struct mptcp_ext ext_copy;
 #endif
 };
 
@@ -53,7 +62,8 @@ bool mptcp_syn_options(struct sock *sk, unsigned int *size,
 void mptcp_rcv_synsent(struct sock *sk);
 bool mptcp_synack_options(const struct request_sock *req, unsigned int *size,
 			  struct mptcp_out_options *opts);
-bool mptcp_established_options(struct sock *sk, unsigned int *size,
+bool mptcp_established_options(struct sock *sk, struct sk_buff *skb,
+			       unsigned int *size, unsigned int remaining,
 			       struct mptcp_out_options *opts);
 
 static inline bool mptcp_skb_ext_exist(const struct sk_buff *skb)
@@ -102,7 +112,9 @@ static inline bool mptcp_synack_options(const struct request_sock *req,
 }
 
 static inline bool mptcp_established_options(struct sock *sk,
+					     struct sk_buff *skb,
 					     unsigned int *size,
+					     unsigned int remaining,
 					     struct mptcp_out_options *opts)
 {
 	return false;
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 415b0414d738..3de804531231 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -796,13 +796,12 @@ static unsigned int tcp_established_options(struct sock *sk, struct sk_buff *skb
 	 */
 	if (sk_is_mptcp(sk)) {
 		unsigned int remaining = MAX_TCP_OPTION_SPACE - size;
-		unsigned int opt_size;
+		unsigned int opt_size = 0;
 
-		if (mptcp_established_options(sk, &opt_size, &opts->mptcp)) {
-			if (remaining >= opt_size) {
-				opts->options |= OPTION_MPTCP;
-				size += opt_size;
-			}
+		if (mptcp_established_options(sk, skb, &opt_size, remaining,
+					      &opts->mptcp)) {
+			opts->options |= OPTION_MPTCP;
+			size += opt_size;
 		}
 	}
 
diff --git a/net/mptcp/options.c b/net/mptcp/options.c
index 4fdd5178fe78..36691e439796 100644
--- a/net/mptcp/options.c
+++ b/net/mptcp/options.c
@@ -194,12 +194,13 @@ void mptcp_rcv_synsent(struct sock *sk)
 	}
 }
 
-bool mptcp_established_options(struct sock *sk, unsigned int *size,
-			       struct mptcp_out_options *opts)
+static bool mptcp_established_options_mp(struct sock *sk, unsigned int *size,
+					 unsigned int remaining,
+					 struct mptcp_out_options *opts)
 {
 	struct mptcp_subflow_context *subflow = mptcp_subflow_ctx(sk);
 
-	if (subflow->mp_capable && !subflow->fourth_ack) {
+	if (!subflow->fourth_ack && remaining >= TCPOLEN_MPTCP_MPC_ACK) {
 		opts->suboptions = OPTION_MPTCP_MPC_ACK;
 		opts->sndr_key = subflow->local_key;
 		opts->rcvr_key = subflow->remote_key;
@@ -212,6 +213,95 @@ bool mptcp_established_options(struct sock *sk, unsigned int *size,
 	return false;
 }
 
+static bool mptcp_established_options_dss(struct sock *sk, struct sk_buff *skb,
+					  unsigned int *size,
+					  unsigned int remaining,
+					  struct mptcp_out_options *opts)
+{
+	unsigned int dss_size = 0;
+	struct mptcp_ext *mpext;
+	unsigned int ack_size;
+
+	mpext = skb ? mptcp_get_ext(skb) : NULL;
+
+	if (!skb || (mpext && mpext->use_map)) {
+		unsigned int map_size;
+		bool use_csum;
+
+		map_size = TCPOLEN_MPTCP_DSS_BASE + TCPOLEN_MPTCP_DSS_MAP64;
+		use_csum = mptcp_subflow_ctx(sk)->use_checksum;
+		if (use_csum)
+			map_size += TCPOLEN_MPTCP_DSS_CHECKSUM;
+
+		if (map_size <= remaining) {
+			remaining -= map_size;
+			dss_size = map_size;
+			if (mpext) {
+				opts->ext_copy.data_seq = mpext->data_seq;
+				opts->ext_copy.subflow_seq = mpext->subflow_seq;
+				opts->ext_copy.data_len = mpext->data_len;
+				opts->ext_copy.checksum = mpext->checksum;
+				opts->ext_copy.use_map = 1;
+				opts->ext_copy.dsn64 = mpext->dsn64;
+				opts->ext_copy.use_checksum = use_csum;
+			}
+		} else {
+			opts->ext_copy.use_map = 0;
+			WARN_ONCE(1, "MPTCP: Map dropped");
+		}
+	}
+
+	if (mpext && mpext->use_ack) {
+		ack_size = TCPOLEN_MPTCP_DSS_ACK64;
+
+		/* Add kind/length/subtype/flag overhead if mapping is not
+		 * populated
+		 */
+		if (dss_size == 0)
+			ack_size += TCPOLEN_MPTCP_DSS_BASE;
+
+		if (ack_size <= remaining) {
+			dss_size += ack_size;
+
+			opts->ext_copy.data_ack = mpext->data_ack;
+			opts->ext_copy.ack64 = 1;
+			opts->ext_copy.use_ack = 1;
+		} else {
+			opts->ext_copy.use_ack = 0;
+			WARN(1, "MPTCP: Ack dropped");
+		}
+	}
+
+	if (!dss_size)
+		return false;
+
+	*size = ALIGN(dss_size, 4);
+	return true;
+}
+
+bool mptcp_established_options(struct sock *sk, struct sk_buff *skb,
+			       unsigned int *size, unsigned int remaining,
+			       struct mptcp_out_options *opts)
+{
+	unsigned int opt_size = 0;
+
+	if (!mptcp_subflow_ctx(sk)->mp_capable)
+		return false;
+
+	if (mptcp_established_options_mp(sk, &opt_size, remaining, opts)) {
+		*size += opt_size;
+		remaining -= opt_size;
+		return true;
+	} else if (mptcp_established_options_dss(sk, skb, &opt_size, remaining,
+					       opts)) {
+		*size += opt_size;
+		remaining -= opt_size;
+		return true;
+	}
+
+	return false;
+}
+
 bool mptcp_synack_options(const struct request_sock *req, unsigned int *size,
 			  struct mptcp_out_options *opts)
 {
@@ -256,4 +346,54 @@ void mptcp_write_options(__be32 *ptr, struct mptcp_out_options *opts)
 			ptr += 2;
 		}
 	}
+
+	if (opts->ext_copy.use_ack || opts->ext_copy.use_map) {
+		struct mptcp_ext *mpext = &opts->ext_copy;
+		u8 len = TCPOLEN_MPTCP_DSS_BASE;
+		u8 flags = 0;
+
+		if (mpext->use_ack) {
+			len += TCPOLEN_MPTCP_DSS_ACK64;
+			flags = MPTCP_DSS_HAS_ACK | MPTCP_DSS_ACK64;
+		}
+
+		if (mpext->use_map) {
+			len += TCPOLEN_MPTCP_DSS_MAP64;
+
+			if (mpext->use_checksum)
+				len += TCPOLEN_MPTCP_DSS_CHECKSUM;
+
+			/* Use only 64-bit mapping flags for now, add
+			 * support for optional 32-bit mappings later.
+			 */
+			flags |= MPTCP_DSS_HAS_MAP | MPTCP_DSS_DSN64;
+			if (mpext->data_fin)
+				flags |= MPTCP_DSS_DATA_FIN;
+		}
+
+		*ptr++ = htonl((TCPOPT_MPTCP << 24) |
+			       (len  << 16) |
+			       (MPTCPOPT_DSS << 12) |
+			       (flags));
+
+		if (mpext->use_ack) {
+			put_unaligned_be64(mpext->data_ack, ptr);
+			ptr += 2;
+		}
+
+		if (mpext->use_map) {
+			__u16 checksum;
+
+			pr_debug("Writing map values");
+			put_unaligned_be64(mpext->data_seq, ptr);
+			ptr += 2;
+			*ptr++ = htonl(mpext->subflow_seq);
+
+			if (mpext->use_checksum)
+				checksum = (u16 __force)mpext->checksum;
+			else
+				checksum = TCPOPT_NOP << 8 | TCPOPT_NOP;
+			*ptr = htonl(mpext->data_len << 16 | checksum);
+		}
+	}
 }
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 41588758f74b..feabf3dfc988 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -58,10 +58,15 @@ static struct sock *mptcp_subflow_get_ref(const struct mptcp_sock *msk)
 
 static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 {
+	int mss_now = 0, size_goal = 0, ret = 0;
 	struct mptcp_sock *msk = mptcp_sk(sk);
 	struct socket *ssock;
 	struct sock *ssk;
-	int ret;
+	struct mptcp_ext *mpext = NULL;
+	struct page *page = NULL;
+	struct sk_buff *skb;
+	size_t psize;
+	int poffset;
 
 	lock_sock(sk);
 	ssock = __mptcp_fallback_get_ref(msk);
@@ -79,8 +84,90 @@ static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 		return -ENOTCONN;
 	}
 
-	ret = sock_sendmsg(ssk->sk_socket, msg);
+	if (!msg_data_left(msg)) {
+		pr_debug("empty send");
+		ret = sock_sendmsg(ssk->sk_socket, msg);
+		goto put_out;
+	}
+
+	if (msg->msg_flags & ~(MSG_MORE | MSG_DONTWAIT | MSG_NOSIGNAL)) {
+		ret = -ENOTSUPP;
+		goto put_out;
+	}
+
+	/* Initial experiment: new page per send.  Real code will
+	 * maintain list of active pages and DSS mappings, append to the
+	 * end and honor zerocopy
+	 */
+	page = alloc_page(GFP_KERNEL);
+	if (!page) {
+		ret = -ENOMEM;
+		goto put_out;
+	}
+
+	/* Copy to page */
+	poffset = 0;
+	pr_debug("left=%zu", msg_data_left(msg));
+	psize = copy_page_from_iter(page, poffset,
+				    min_t(size_t, msg_data_left(msg),
+					  PAGE_SIZE),
+				    &msg->msg_iter);
+	pr_debug("left=%zu", msg_data_left(msg));
+
+	if (!psize) {
+		ret = -EINVAL;
+		goto put_out;
+	}
+
+	lock_sock(ssk);
+
+	/* Mark the end of the previous write so the beginning of the
+	 * next write (with its own mptcp skb extension data) is not
+	 * collapsed.
+	 */
+	skb = tcp_write_queue_tail(ssk);
+	if (skb)
+		TCP_SKB_CB(skb)->eor = 1;
+
+	mss_now = tcp_send_mss(ssk, &size_goal, msg->msg_flags);
+
+	ret = do_tcp_sendpages(ssk, page, poffset, min_t(int, size_goal, psize),
+			       msg->msg_flags | MSG_SENDPAGE_NOTLAST);
+	if (ret <= 0)
+		goto put_out;
+
+	if (skb == tcp_write_queue_tail(ssk))
+		pr_err("no new skb %p/%p", sk, ssk);
+
+	skb = tcp_write_queue_tail(ssk);
+
+	mpext = skb_ext_add(skb, SKB_EXT_MPTCP);
+
+	if (mpext) {
+		memset(mpext, 0, sizeof(*mpext));
+		mpext->data_ack = msk->ack_seq;
+		mpext->data_seq = msk->write_seq;
+		mpext->subflow_seq = mptcp_subflow_ctx(ssk)->rel_write_seq;
+		mpext->data_len = ret;
+		mpext->checksum = 0xbeef;
+		mpext->use_map = 1;
+		mpext->dsn64 = 1;
+		mpext->use_ack = 1;
+		mpext->ack64 = 1;
 
+		pr_debug("data_seq=%llu subflow_seq=%u data_len=%u checksum=%u, dsn64=%d",
+			 mpext->data_seq, mpext->subflow_seq, mpext->data_len,
+			 mpext->checksum, mpext->dsn64);
+	} /* TODO: else fallback */
+
+	msk->write_seq += ret;
+	mptcp_subflow_ctx(ssk)->rel_write_seq += ret;
+
+	tcp_push(ssk, msg->msg_flags, mss_now, tcp_sk(ssk)->nonagle, size_goal);
+
+put_out:
+	if (page)
+		put_page(page);
 	release_sock(sk);
 	sock_put(ssk);
 	return ret;
@@ -189,6 +276,7 @@ static struct sock *mptcp_accept(struct sock *sk, int flags, int *err,
 
 	if (subflow->mp_capable) {
 		struct sock *new_mptcp_sock;
+		u64 ack_seq;
 
 		lock_sock(sk);
 
@@ -214,6 +302,14 @@ static struct sock *mptcp_accept(struct sock *sk, int flags, int *err,
 
 		mptcp_token_update_accept(new_sock->sk, new_mptcp_sock);
 		msk->subflow = NULL;
+
+		mptcp_crypto_key_sha1(msk->remote_key, NULL, &ack_seq);
+		msk->write_seq = subflow->idsn + 1;
+		ack_seq++;
+		msk->ack_seq = ack_seq;
+		msk->remote_key = subflow->remote_key;
+		msk->ack_seq++;
+		subflow->rel_write_seq = 1;
 		newsk = new_mptcp_sock;
 		subflow->conn = new_mptcp_sock;
 		list_add(&subflow->node, &msk->conn_list);
@@ -307,6 +403,8 @@ void mptcp_finish_connect(struct sock *sk, int mp_capable)
 	subflow = mptcp_subflow_ctx(msk->subflow->sk);
 
 	if (mp_capable) {
+		u64 ack_seq;
+
 		/* sk (new subflow socket) is already locked, but we need
 		 * to lock the parent (mptcp) socket now to add the tcp socket
 		 * to the subflow list.
@@ -333,6 +431,11 @@ void mptcp_finish_connect(struct sock *sk, int mp_capable)
 		msk->remote_key = subflow->remote_key;
 		msk->local_key = subflow->local_key;
 		msk->token = subflow->token;
+		msk->write_seq = subflow->idsn + 1;
+		subflow->rel_write_seq = 1;
+		mptcp_crypto_key_sha1(msk->remote_key, NULL, &ack_seq);
+		ack_seq++;
+		msk->ack_seq = ack_seq;
 		list_add(&subflow->node, &msk->conn_list);
 		msk->subflow = NULL;
 		bh_unlock_sock(sk);
diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
index dea6d4f32f38..2b0347612550 100644
--- a/net/mptcp/protocol.h
+++ b/net/mptcp/protocol.h
@@ -30,6 +30,10 @@
 #define TCPOLEN_MPTCP_MPC_SYN		12
 #define TCPOLEN_MPTCP_MPC_SYNACK	20
 #define TCPOLEN_MPTCP_MPC_ACK		20
+#define TCPOLEN_MPTCP_DSS_BASE		4
+#define TCPOLEN_MPTCP_DSS_ACK64		8
+#define TCPOLEN_MPTCP_DSS_MAP64		14
+#define TCPOLEN_MPTCP_DSS_CHECKSUM	2
 
 /* MPTCP MP_CAPABLE flags */
 #define MPTCP_VERSION_MASK	(0x0F)
@@ -44,6 +48,8 @@ struct mptcp_sock {
 	struct inet_connection_sock sk;
 	u64		local_key;
 	u64		remote_key;
+	u64		write_seq;
+	u64		ack_seq;
 	u32		token;
 	struct list_head conn_list;
 	struct socket	*subflow; /* outgoing connect/listener/!mp_capable */
@@ -83,12 +89,15 @@ struct mptcp_subflow_context {
 	u64	remote_key;
 	u64	idsn;
 	u32	token;
+	u32     rel_write_seq;
 	u32	request_mptcp : 1,  /* send MP_CAPABLE */
 		request_cksum : 1,
 		request_version : 4,
 		mp_capable : 1,	    /* remote is MPTCP capable */
 		fourth_ack : 1,     /* send initial DSS */
-		conn_finished : 1;
+		conn_finished : 1,
+		use_checksum : 1;
+
 	struct  socket *tcp_sock;  /* underlying tcp_sock */
 	struct  sock *conn;        /* parent mptcp_sock */
 };
@@ -140,4 +149,9 @@ static inline void mptcp_crypto_key_gen_sha1(u64 *key, u32 *token, u64 *idsn)
 void mptcp_crypto_hmac_sha1(u64 key1, u64 key2, u32 nonce1, u32 nonce2,
 			    u32 *hash_out);
 
+static inline struct mptcp_ext *mptcp_get_ext(struct sk_buff *skb)
+{
+	return (struct mptcp_ext *)skb_ext_find(skb, SKB_EXT_MPTCP);
+}
+
 #endif /* __MPTCP_PROTOCOL_H */
diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c
index 43fb1ae51b03..260352af20c9 100644
--- a/net/mptcp/subflow.c
+++ b/net/mptcp/subflow.c
@@ -224,6 +224,7 @@ static int subflow_ulp_init(struct sock *sk)
 
 	tp->is_mptcp = 1;
 	icsk->icsk_af_ops = &subflow_specific;
+	ctx->use_checksum = 0;
 out:
 	return err;
 }
@@ -252,6 +253,7 @@ static void subflow_ulp_clone(const struct request_sock *req,
 
 	new_ctx->conn = NULL;
 	new_ctx->conn_finished = 1;
+	new_ctx->use_checksum = 0;
 
 	if (subflow_req->mp_capable) {
 		new_ctx->mp_capable = 1;
-- 
2.23.0


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [RFC PATCH v2 21/45] mptcp: Implement MPTCP receive path
  2019-10-02 23:36 [RFC PATCH v2 00/45] Multipath TCP Mat Martineau
                   ` (19 preceding siblings ...)
  2019-10-02 23:36 ` [RFC PATCH v2 20/45] mptcp: Write MPTCP DSS headers to outgoing data packets Mat Martineau
@ 2019-10-02 23:36 ` Mat Martineau
  2019-10-02 23:36 ` [RFC PATCH v2 22/45] mptcp: use sk_page_frag() in sendmsg Mat Martineau
                   ` (25 subsequent siblings)
  46 siblings, 0 replies; 49+ messages in thread
From: Mat Martineau @ 2019-10-02 23:36 UTC (permalink / raw)
  To: netdev, edumazet
  Cc: Mat Martineau, cpaasch, fw, pabeni, peter.krystad, dcaratti,
	matthieu.baerts

Parses incoming DSS options and populates outgoing MPTCP ACK
fields. MPTCP fields are parsed from the TCP option header and placed in
an skb extension, allowing the upper MPTCP layer to access MPTCP
options after the skb has gone through the TCP stack.

Outgoing MPTCP ACK values are now populated from an atomic value stored
in the connection socket rather than carried in an skb extension. This
allows sent packet headers to make use of the most up-to-date sequence
number and allows the MPTCP ACK to be populated in TCP ACK packets that
have no payload.

Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Peter Krystad <peter.krystad@linux.intel.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
---
 include/linux/tcp.h  |  17 ++-
 include/net/mptcp.h  |  16 +--
 net/ipv4/tcp_input.c |   8 +-
 net/mptcp/options.c  | 138 +++++++++++++++++--
 net/mptcp/protocol.c | 318 +++++++++++++++++++++++++++++++++++++++++--
 net/mptcp/protocol.h |  32 ++++-
 net/mptcp/subflow.c  |  36 ++++-
 7 files changed, 523 insertions(+), 42 deletions(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 220883a7a4db..0f0d6c188f52 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -97,13 +97,26 @@ struct tcp_options_received {
 	u16	mss_clamp;	/* Maximal mss, negotiated at connection setup */
 #if IS_ENABLED(CONFIG_MPTCP)
 	struct mptcp_options_received {
+		u64     sndr_key;
+		u64     rcvr_key;
+		u64	data_ack;
+		u64	data_seq;
+		u32	subflow_seq;
+		u16	data_len;
+		__sum16	checksum;
 		u8      mp_capable : 1,
 			mp_join : 1,
 			dss : 1,
 			version : 4;
 		u8      flags;
-		u64     sndr_key;
-		u64     rcvr_key;
+		u8	dss_flags;
+		u8	use_map:1,
+			dsn64:1,
+			use_checksum:1,
+			data_fin:1,
+			use_ack:1,
+			ack64:1,
+			__unused:2;
 	} mptcp;
 #endif
 };
diff --git a/include/net/mptcp.h b/include/net/mptcp.h
index 43fcd4989f37..21438bcacb14 100644
--- a/include/net/mptcp.h
+++ b/include/net/mptcp.h
@@ -8,14 +8,6 @@
 #ifndef __NET_MPTCP_H
 #define __NET_MPTCP_H
 
-/* MPTCP DSS flags */
-
-#define MPTCP_DSS_DATA_FIN	BIT(4)
-#define MPTCP_DSS_DSN64		BIT(3)
-#define MPTCP_DSS_HAS_MAP	BIT(2)
-#define MPTCP_DSS_ACK64		BIT(1)
-#define MPTCP_DSS_HAS_ACK	BIT(0)
-
 /* MPTCP sk_buff extension data */
 struct mptcp_ext {
 	u64		data_ack;
@@ -65,6 +57,8 @@ bool mptcp_synack_options(const struct request_sock *req, unsigned int *size,
 bool mptcp_established_options(struct sock *sk, struct sk_buff *skb,
 			       unsigned int *size, unsigned int remaining,
 			       struct mptcp_out_options *opts);
+void mptcp_incoming_options(struct sock *sk, struct sk_buff *skb,
+			    struct tcp_options_received *opt_rx);
 
 static inline bool mptcp_skb_ext_exist(const struct sk_buff *skb)
 {
@@ -120,6 +114,12 @@ static inline bool mptcp_established_options(struct sock *sk,
 	return false;
 }
 
+static inline void mptcp_incoming_options(struct sock *sk,
+					  struct sk_buff *skb,
+					  struct tcp_options_received *opt_rx)
+{
+}
+
 static inline bool mptcp_skb_ext_exist(const struct sk_buff *skb)
 {
 	return false;
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 04267684babf..dac6027a6c2e 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4764,6 +4764,9 @@ static void tcp_data_queue(struct sock *sk, struct sk_buff *skb)
 	bool fragstolen;
 	int eaten;
 
+	if (sk_is_mptcp(sk))
+		mptcp_incoming_options(sk, skb, &tp->rx_opt);
+
 	if (TCP_SKB_CB(skb)->seq == TCP_SKB_CB(skb)->end_seq) {
 		__kfree_skb(skb);
 		return;
@@ -6332,8 +6335,11 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb)
 	case TCP_CLOSE_WAIT:
 	case TCP_CLOSING:
 	case TCP_LAST_ACK:
-		if (!before(TCP_SKB_CB(skb)->seq, tp->rcv_nxt))
+		if (!before(TCP_SKB_CB(skb)->seq, tp->rcv_nxt)) {
+			if (sk_is_mptcp(sk))
+				mptcp_incoming_options(sk, skb, &tp->rx_opt);
 			break;
+		}
 		/* fall through */
 	case TCP_FIN_WAIT1:
 	case TCP_FIN_WAIT2:
diff --git a/net/mptcp/options.c b/net/mptcp/options.c
index 36691e439796..caea320cd2a6 100644
--- a/net/mptcp/options.c
+++ b/net/mptcp/options.c
@@ -14,6 +14,7 @@ void mptcp_parse_option(const unsigned char *ptr, int opsize,
 {
 	struct mptcp_options_received *mp_opt = &opt_rx->mptcp;
 	u8 subtype = *ptr >> 4;
+	int expected_opsize;
 
 	switch (subtype) {
 	/* MPTCPOPT_MP_CAPABLE
@@ -98,6 +99,72 @@ void mptcp_parse_option(const unsigned char *ptr, int opsize,
 	case MPTCPOPT_DSS:
 		pr_debug("DSS");
 		mp_opt->dss = 1;
+		ptr++;
+
+		mp_opt->dss_flags = (*ptr++) & MPTCP_DSS_FLAG_MASK;
+		mp_opt->data_fin = (mp_opt->dss_flags & MPTCP_DSS_DATA_FIN) != 0;
+		mp_opt->dsn64 = (mp_opt->dss_flags & MPTCP_DSS_DSN64) != 0;
+		mp_opt->use_map = (mp_opt->dss_flags & MPTCP_DSS_HAS_MAP) != 0;
+		mp_opt->ack64 = (mp_opt->dss_flags & MPTCP_DSS_ACK64) != 0;
+		mp_opt->use_ack = (mp_opt->dss_flags & MPTCP_DSS_HAS_ACK);
+
+		pr_debug("data_fin=%d dsn64=%d use_map=%d ack64=%d use_ack=%d",
+			 mp_opt->data_fin, mp_opt->dsn64,
+			 mp_opt->use_map, mp_opt->ack64,
+			 mp_opt->use_ack);
+
+		expected_opsize = TCPOLEN_MPTCP_DSS_BASE;
+
+		if (mp_opt->use_ack) {
+			if (mp_opt->ack64)
+				expected_opsize += TCPOLEN_MPTCP_DSS_ACK64;
+			else
+				expected_opsize += TCPOLEN_MPTCP_DSS_ACK32;
+
+			if (opsize < expected_opsize)
+				break;
+
+			if (mp_opt->ack64) {
+				mp_opt->data_ack = get_unaligned_be64(ptr);
+				ptr += 8;
+			} else {
+				mp_opt->data_ack = get_unaligned_be32(ptr);
+				ptr += 4;
+			}
+
+			pr_debug("data_ack=%llu", mp_opt->data_ack);
+		}
+
+		if (mp_opt->use_map) {
+			if (mp_opt->dsn64)
+				expected_opsize += TCPOLEN_MPTCP_DSS_MAP64;
+			else
+				expected_opsize += TCPOLEN_MPTCP_DSS_MAP32;
+
+			if (opsize < expected_opsize)
+				break;
+
+			if (mp_opt->dsn64) {
+				mp_opt->data_seq = get_unaligned_be64(ptr);
+				ptr += 8;
+			} else {
+				mp_opt->data_seq = get_unaligned_be32(ptr);
+				ptr += 4;
+			}
+
+			mp_opt->subflow_seq = get_unaligned_be32(ptr);
+			ptr += 4;
+
+			mp_opt->data_len = get_unaligned_be16(ptr);
+			ptr += 2;
+
+			/* Checksum not currently supported */
+			mp_opt->checksum = 0;
+
+			pr_debug("data_seq=%llu subflow_seq=%u data_len=%u ck=%u",
+				 mp_opt->data_seq, mp_opt->subflow_seq,
+				 mp_opt->data_len, mp_opt->checksum);
+		}
 		break;
 
 	/* MPTCPOPT_ADD_ADDR
@@ -251,25 +318,31 @@ static bool mptcp_established_options_dss(struct sock *sk, struct sk_buff *skb,
 		}
 	}
 
-	if (mpext && mpext->use_ack) {
-		ack_size = TCPOLEN_MPTCP_DSS_ACK64;
+	ack_size = TCPOLEN_MPTCP_DSS_ACK64;
 
-		/* Add kind/length/subtype/flag overhead if mapping is not
-		 * populated
-		 */
-		if (dss_size == 0)
-			ack_size += TCPOLEN_MPTCP_DSS_BASE;
+	/* Add kind/length/subtype/flag overhead if mapping is not populated */
+	if (dss_size == 0)
+		ack_size += TCPOLEN_MPTCP_DSS_BASE;
 
-		if (ack_size <= remaining) {
-			dss_size += ack_size;
+	if (ack_size <= remaining) {
+		struct mptcp_sock *msk;
 
-			opts->ext_copy.data_ack = mpext->data_ack;
-			opts->ext_copy.ack64 = 1;
-			opts->ext_copy.use_ack = 1;
+		dss_size += ack_size;
+
+		msk = mptcp_sk(mptcp_subflow_ctx(sk)->conn);
+		if (msk) {
+			opts->ext_copy.data_ack = msk->ack_seq;
 		} else {
-			opts->ext_copy.use_ack = 0;
-			WARN(1, "MPTCP: Ack dropped");
+			mptcp_crypto_key_sha1(mptcp_subflow_ctx(sk)->remote_key,
+					      NULL, &opts->ext_copy.data_ack);
+			opts->ext_copy.data_ack++;
 		}
+
+		opts->ext_copy.ack64 = 1;
+		opts->ext_copy.use_ack = 1;
+	} else {
+		opts->ext_copy.use_ack = 0;
+		WARN(1, "MPTCP: Ack dropped");
 	}
 
 	if (!dss_size)
@@ -320,6 +393,42 @@ bool mptcp_synack_options(const struct request_sock *req, unsigned int *size,
 	return false;
 }
 
+void mptcp_incoming_options(struct sock *sk, struct sk_buff *skb,
+			    struct tcp_options_received *opt_rx)
+{
+	struct mptcp_options_received *mp_opt;
+	struct mptcp_ext *mpext;
+
+	mp_opt = &opt_rx->mptcp;
+
+	if (!mp_opt->dss)
+		return;
+
+	mpext = skb_ext_add(skb, SKB_EXT_MPTCP);
+	if (!mpext)
+		return;
+
+	memset(mpext, 0, sizeof(*mpext));
+
+	if (mp_opt->use_map) {
+		mpext->data_seq = mp_opt->data_seq;
+		mpext->subflow_seq = mp_opt->subflow_seq;
+		mpext->data_len = mp_opt->data_len;
+		mpext->checksum = mp_opt->checksum;
+		mpext->use_map = 1;
+		mpext->dsn64 = mp_opt->dsn64;
+		mpext->use_checksum = mp_opt->use_checksum;
+	}
+
+	if (mp_opt->use_ack) {
+		mpext->data_ack = mp_opt->data_ack;
+		mpext->use_ack = 1;
+		mpext->ack64 = mp_opt->ack64;
+	}
+
+	mpext->data_fin = mp_opt->data_fin;
+}
+
 void mptcp_write_options(__be32 *ptr, struct mptcp_out_options *opts)
 {
 	if ((OPTION_MPTCP_MPC_SYN |
@@ -358,6 +467,7 @@ void mptcp_write_options(__be32 *ptr, struct mptcp_out_options *opts)
 		}
 
 		if (mpext->use_map) {
+			pr_debug("Updating DSS length and flags for map");
 			len += TCPOLEN_MPTCP_DSS_MAP64;
 
 			if (mpext->use_checksum)
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index feabf3dfc988..fdcfffce0ec9 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -9,6 +9,8 @@
 #include <linux/kernel.h>
 #include <linux/module.h>
 #include <linux/netdevice.h>
+#include <linux/sched/signal.h>
+#include <linux/atomic.h>
 #include <net/sock.h>
 #include <net/inet_common.h>
 #include <net/inet_hashtables.h>
@@ -145,15 +147,12 @@ static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 
 	if (mpext) {
 		memset(mpext, 0, sizeof(*mpext));
-		mpext->data_ack = msk->ack_seq;
 		mpext->data_seq = msk->write_seq;
 		mpext->subflow_seq = mptcp_subflow_ctx(ssk)->rel_write_seq;
 		mpext->data_len = ret;
 		mpext->checksum = 0xbeef;
 		mpext->use_map = 1;
 		mpext->dsn64 = 1;
-		mpext->use_ack = 1;
-		mpext->ack64 = 1;
 
 		pr_debug("data_seq=%llu subflow_seq=%u data_len=%u checksum=%u, dsn64=%d",
 			 mpext->data_seq, mpext->subflow_seq, mpext->data_len,
@@ -173,13 +172,158 @@ static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 	return ret;
 }
 
+struct mptcp_read_arg {
+	struct msghdr *msg;
+};
+
+static u64 expand_seq(u64 old_seq, u16 old_data_len, u64 seq)
+{
+	if ((u32)seq == (u32)old_seq)
+		return old_seq;
+
+	/* Assume map covers data not mapped yet. */
+	return seq | ((old_seq + old_data_len + 1) & GENMASK_ULL(63, 32));
+}
+
+static u64 get_map_offset(struct mptcp_subflow_context *subflow)
+{
+	return tcp_sk(mptcp_subflow_tcp_socket(subflow)->sk)->copied_seq -
+		      subflow->ssn_offset -
+		      subflow->map_subflow_seq;
+}
+
+static u64 get_mapped_dsn(struct mptcp_subflow_context *subflow)
+{
+	return subflow->map_seq + get_map_offset(subflow);
+}
+
+static int mptcp_read_actor(read_descriptor_t *desc, struct sk_buff *skb,
+			    unsigned int offset, size_t len)
+{
+	struct mptcp_read_arg *arg = desc->arg.data;
+	size_t copy_len;
+
+	copy_len = min(desc->count, len);
+
+	if (likely(arg->msg)) {
+		int err;
+
+		err = skb_copy_datagram_msg(skb, offset, arg->msg, copy_len);
+		if (err) {
+			pr_debug("error path");
+			desc->error = err;
+			return err;
+		}
+	} else {
+		pr_debug("Flushing skb payload");
+	}
+
+	// MSG_PEEK support? Other flags? MSG_TRUNC?
+
+	desc->count -= copy_len;
+
+	pr_debug("consumed %zu bytes, %zu left", copy_len, desc->count);
+	return copy_len;
+}
+
+enum mapping_status {
+	MAPPING_ADDED,
+	MAPPING_MISSING,
+	MAPPING_EMPTY,
+	MAPPING_DATA_FIN
+};
+
+static enum mapping_status mptcp_get_mapping(struct sock *ssk)
+{
+	struct mptcp_subflow_context *subflow = mptcp_subflow_ctx(ssk);
+	struct mptcp_ext *mpext;
+	enum mapping_status ret;
+	struct sk_buff *skb;
+	u64 map_seq;
+
+	skb = skb_peek(&ssk->sk_receive_queue);
+	if (!skb) {
+		pr_debug("Empty queue");
+		return MAPPING_EMPTY;
+	}
+
+	mpext = mptcp_get_ext(skb);
+
+	if (!mpext) {
+		/* This is expected for non-DSS data packets */
+		return MAPPING_MISSING;
+	}
+
+	if (!mpext->use_map) {
+		ret = MAPPING_MISSING;
+		goto del_out;
+	}
+
+	pr_debug("seq=%llu is64=%d ssn=%u data_len=%u ck=%u",
+		 mpext->data_seq, mpext->dsn64, mpext->subflow_seq,
+		 mpext->data_len, mpext->checksum);
+
+	if (mpext->data_len == 0) {
+		pr_err("Infinite mapping not handled");
+		ret = MAPPING_MISSING;
+		goto del_out;
+	} else if (mpext->subflow_seq == 0 &&
+		   mpext->data_fin == 1) {
+		pr_debug("DATA_FIN with no payload");
+		ret = MAPPING_DATA_FIN;
+		goto del_out;
+	}
+
+	if (!mpext->dsn64) {
+		map_seq = expand_seq(subflow->map_seq, subflow->map_data_len,
+				     mpext->data_seq);
+		pr_debug("expanded seq=%llu", subflow->map_seq);
+	} else {
+		map_seq = mpext->data_seq;
+	}
+
+	if (subflow->map_valid) {
+		/* due to GSO/TSO we can receive the same mapping multiple
+		 * times, before it's expiration.
+		 */
+		if (subflow->map_seq != map_seq ||
+		    subflow->map_subflow_seq != mpext->subflow_seq ||
+		    subflow->map_data_len != mpext->data_len)
+			pr_warn("Replaced mapping before it was done");
+	}
+
+	subflow->map_seq = map_seq;
+	subflow->map_subflow_seq = mpext->subflow_seq;
+	subflow->map_data_len = mpext->data_len;
+	subflow->map_valid = 1;
+	ret = MAPPING_ADDED;
+	pr_debug("new map seq=%llu subflow_seq=%u data_len=%u",
+		 subflow->map_seq, subflow->map_subflow_seq,
+		 subflow->map_data_len);
+
+del_out:
+	__skb_ext_del(skb, SKB_EXT_MPTCP);
+	return ret;
+}
+
+static void warn_bad_map(struct mptcp_subflow_context *subflow, u32 ssn)
+{
+	WARN_ONCE(1, "Bad mapping: ssn=%d map_seq=%d map_data_len=%d",
+		  ssn, subflow->map_subflow_seq, subflow->map_data_len);
+}
+
 static int mptcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
 			 int nonblock, int flags, int *addr_len)
 {
 	struct mptcp_sock *msk = mptcp_sk(sk);
+	struct mptcp_subflow_context *subflow;
+	struct mptcp_read_arg arg;
+	read_descriptor_t desc;
 	struct socket *ssock;
+	struct tcp_sock *tp;
 	struct sock *ssk;
 	int copied = 0;
+	long timeo;
 
 	lock_sock(sk);
 	ssock = __mptcp_fallback_get_ref(msk);
@@ -198,8 +342,158 @@ static int mptcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
 		return -ENOTCONN;
 	}
 
-	copied = sock_recvmsg(ssk->sk_socket, msg, flags);
+	subflow = mptcp_subflow_ctx(ssk);
+	tp = tcp_sk(ssk);
+
+	lock_sock(ssk);
+
+	desc.arg.data = &arg;
+	desc.error = 0;
+
+	timeo = sock_rcvtimeo(sk, nonblock);
+
+	len = min_t(size_t, len, INT_MAX);
+
+	while (copied < len) {
+		enum mapping_status status;
+		u32 map_remaining;
+		int bytes_read;
+		u64 ack_seq;
+		u64 old_ack;
+		u32 ssn;
+
+		status = mptcp_get_mapping(ssk);
+
+		if (status == MAPPING_ADDED) {
+			/* Common case, but nothing to do here */
+		} else if (status == MAPPING_MISSING) {
+			struct sk_buff *skb = skb_peek(&ssk->sk_receive_queue);
+
+			if (!skb->len) {
+				/* the TCP stack deliver 0 len FIN pkt
+				 * to the receive queue, that is the only
+				 * 0len pkts ever expected here, and we can
+				 * admit no mapping only for 0 len pkts
+				 */
+				if (!(TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN))
+					WARN_ONCE(1, "0len seq %d:%d flags %x",
+						  TCP_SKB_CB(skb)->seq,
+						  TCP_SKB_CB(skb)->end_seq,
+						  TCP_SKB_CB(skb)->tcp_flags);
+				sk_eat_skb(sk, skb);
+				continue;
+			}
+			if (!subflow->map_valid) {
+				WARN_ONCE(1, "corrupted stream: missing mapping");
+				copied = -EBADFD;
+				break;
+			}
+		} else if (status == MAPPING_EMPTY) {
+			goto wait_for_data;
+		} else if (status == MAPPING_DATA_FIN) {
+			/* TODO: Handle according to RFC 6824 */
+			if (!copied) {
+				pr_err("Can't read after DATA_FIN");
+				copied = -ENOTCONN;
+			}
+
+			break;
+		}
+
+		ssn = tcp_sk(ssk)->copied_seq - subflow->ssn_offset;
+		old_ack = msk->ack_seq;
+
+		if (unlikely(before(ssn, subflow->map_subflow_seq))) {
+			/* Mapping covers data later in the subflow stream,
+			 * currently unsupported.
+			 */
+			warn_bad_map(subflow, ssn);
+			copied = -EBADFD;
+			break;
+		} else if (unlikely(!before(ssn, (subflow->map_subflow_seq +
+						  subflow->map_data_len)))) {
+			/* Mapping ends earlier in the subflow stream.
+			 * Invalid
+			 */
+			warn_bad_map(subflow, ssn);
+			copied = -EBADFD;
+			break;
+		}
+
+		ack_seq = get_mapped_dsn(subflow);
+		map_remaining = subflow->map_data_len - get_map_offset(subflow);
+
+		if (before64(ack_seq, old_ack)) {
+			/* Mapping covers data already received, discard data
+			 * in the current mapping
+			 */
+			arg.msg = NULL;
+			desc.count = min_t(size_t, old_ack - ack_seq,
+					   map_remaining);
+			pr_debug("Dup data, map len=%d acked=%lld dropped=%ld",
+				 map_remaining, old_ack - ack_seq, desc.count);
+		} else {
+			arg.msg = msg;
+			desc.count = min_t(size_t, len - copied, map_remaining);
+		}
+
+		/* Read mapped data */
+		bytes_read = tcp_read_sock(ssk, &desc, mptcp_read_actor);
+		if (bytes_read < 0)
+			break;
+
+		/* Refresh current MPTCP sequence number based on subflow seq */
+		ack_seq = get_mapped_dsn(subflow);
+
+		if (before64(old_ack, ack_seq))
+			msk->ack_seq = ack_seq;
+
+		if (!before(tcp_sk(ssk)->copied_seq - subflow->ssn_offset,
+			    subflow->map_subflow_seq + subflow->map_data_len)) {
+			subflow->map_valid = 0;
+			pr_debug("Done with mapping: seq=%u data_len=%u",
+				 subflow->map_subflow_seq,
+				 subflow->map_data_len);
+		}
 
+		if (arg.msg)
+			copied += bytes_read;
+
+		/* The 'wait_for_data' code path can terminate the receive loop
+		 * in a number of scenarios: check if more data is pending
+		 * before giving up.
+		 */
+		if (!skb_queue_empty(&ssk->sk_receive_queue))
+			continue;
+
+wait_for_data:
+		if (copied)
+			break;
+
+		if (tp->urg_data && tp->urg_seq == tp->copied_seq) {
+			pr_err("Urgent data present, cannot proceed");
+			break;
+		}
+
+		if (ssk->sk_err || ssk->sk_state == TCP_CLOSE ||
+		    (ssk->sk_shutdown & RCV_SHUTDOWN) || !timeo ||
+		    signal_pending(current)) {
+			pr_debug("nonblock or error");
+			break;
+		}
+
+		/* Handle blocking and retry read if needed.
+		 *
+		 * Wait on MPTCP sock, the subflow will notify via data ready.
+		 */
+
+		pr_debug("block");
+		release_sock(ssk);
+		sk_wait_data(sk, &timeo, NULL);
+		lock_sock(ssk);
+	}
+
+	release_sock(ssk);
 	release_sock(sk);
 
 	sock_put(ssk);
@@ -296,8 +590,6 @@ static struct sock *mptcp_accept(struct sock *sk, int flags, int *err,
 		msk = mptcp_sk(new_mptcp_sock);
 		msk->remote_key = subflow->remote_key;
 		msk->local_key = subflow->local_key;
-		msk->subflow = NULL;
-
 		msk->token = subflow->token;
 
 		mptcp_token_update_accept(new_sock->sk, new_mptcp_sock);
@@ -307,9 +599,10 @@ static struct sock *mptcp_accept(struct sock *sk, int flags, int *err,
 		msk->write_seq = subflow->idsn + 1;
 		ack_seq++;
 		msk->ack_seq = ack_seq;
-		msk->remote_key = subflow->remote_key;
-		msk->ack_seq++;
+		subflow->map_seq = ack_seq;
+		subflow->map_subflow_seq = 1;
 		subflow->rel_write_seq = 1;
+		subflow->tcp_sock = new_sock;
 		newsk = new_mptcp_sock;
 		subflow->conn = new_mptcp_sock;
 		list_add(&subflow->node, &msk->conn_list);
@@ -431,11 +724,16 @@ void mptcp_finish_connect(struct sock *sk, int mp_capable)
 		msk->remote_key = subflow->remote_key;
 		msk->local_key = subflow->local_key;
 		msk->token = subflow->token;
-		msk->write_seq = subflow->idsn + 1;
-		subflow->rel_write_seq = 1;
+
+		pr_debug("msk=%p, token=%u", msk, msk->token);
+
 		mptcp_crypto_key_sha1(msk->remote_key, NULL, &ack_seq);
+		msk->write_seq = subflow->idsn + 1;
 		ack_seq++;
 		msk->ack_seq = ack_seq;
+		subflow->map_seq = ack_seq;
+		subflow->map_subflow_seq = 1;
+		subflow->rel_write_seq = 1;
 		list_add(&subflow->node, &msk->conn_list);
 		msk->subflow = NULL;
 		bh_unlock_sock(sk);
diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
index 2b0347612550..7dae12cfcf14 100644
--- a/net/mptcp/protocol.h
+++ b/net/mptcp/protocol.h
@@ -31,7 +31,9 @@
 #define TCPOLEN_MPTCP_MPC_SYNACK	20
 #define TCPOLEN_MPTCP_MPC_ACK		20
 #define TCPOLEN_MPTCP_DSS_BASE		4
+#define TCPOLEN_MPTCP_DSS_ACK32		4
 #define TCPOLEN_MPTCP_DSS_ACK64		8
+#define TCPOLEN_MPTCP_DSS_MAP32		10
 #define TCPOLEN_MPTCP_DSS_MAP64		14
 #define TCPOLEN_MPTCP_DSS_CHECKSUM	2
 
@@ -42,6 +44,14 @@
 #define MPTCP_CAP_HMAC_SHA1	BIT(0)
 #define MPTCP_CAP_FLAG_MASK	(0x3F)
 
+/* MPTCP DSS flags */
+#define MPTCP_DSS_DATA_FIN	BIT(4)
+#define MPTCP_DSS_DSN64		BIT(3)
+#define MPTCP_DSS_HAS_MAP	BIT(2)
+#define MPTCP_DSS_ACK64		BIT(1)
+#define MPTCP_DSS_HAS_ACK	BIT(0)
+#define MPTCP_DSS_FLAG_MASK	(0x1F)
+
 /* MPTCP connection sock */
 struct mptcp_sock {
 	/* inet_connection_sock must be the first member */
@@ -74,6 +84,7 @@ struct mptcp_subflow_request_sock {
 	u64	remote_key;
 	u64	idsn;
 	u32	token;
+	u32	ssn_offset;
 };
 
 static inline struct mptcp_subflow_request_sock *
@@ -88,18 +99,25 @@ struct mptcp_subflow_context {
 	u64	local_key;
 	u64	remote_key;
 	u64	idsn;
+	u64	map_seq;
 	u32	token;
 	u32     rel_write_seq;
-	u32	request_mptcp : 1,  /* send MP_CAPABLE */
+	u32	map_subflow_seq;
+	u32	ssn_offset;
+	u16	map_data_len;
+	u16	request_mptcp : 1,  /* send MP_CAPABLE */
 		request_cksum : 1,
 		request_version : 4,
 		mp_capable : 1,	    /* remote is MPTCP capable */
 		fourth_ack : 1,     /* send initial DSS */
 		conn_finished : 1,
-		use_checksum : 1;
+		use_checksum : 1,
+		map_valid : 1;
 
 	struct  socket *tcp_sock;  /* underlying tcp_sock */
 	struct  sock *conn;        /* parent mptcp_sock */
+	void	(*tcp_sk_data_ready)(struct sock *sk);
+	struct	rcu_head rcu;
 };
 
 static inline struct mptcp_subflow_context *
@@ -107,7 +125,8 @@ mptcp_subflow_ctx(const struct sock *sk)
 {
 	struct inet_connection_sock *icsk = inet_csk(sk);
 
-	return (struct mptcp_subflow_context *)icsk->icsk_ulp_data;
+	/* Use RCU on icsk_ulp_data only for sock diag code */
+	return (__force struct mptcp_subflow_context *)icsk->icsk_ulp_data;
 }
 
 static inline struct socket *
@@ -154,4 +173,11 @@ static inline struct mptcp_ext *mptcp_get_ext(struct sk_buff *skb)
 	return (struct mptcp_ext *)skb_ext_find(skb, SKB_EXT_MPTCP);
 }
 
+static inline bool before64(__u64 seq1, __u64 seq2)
+{
+	return (__s64)(seq1 - seq2) < 0;
+}
+
+#define after64(seq2, seq1)	before64(seq1, seq2)
+
 #endif /* __MPTCP_PROTOCOL_H */
diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c
index 260352af20c9..1cde27227fd9 100644
--- a/net/mptcp/subflow.c
+++ b/net/mptcp/subflow.c
@@ -75,6 +75,7 @@ static void subflow_v4_init_req(struct request_sock *req,
 		    listener->request_cksum)
 			subflow_req->checksum = 1;
 		subflow_req->remote_key = rx_opt.mptcp.sndr_key;
+		subflow_req->ssn_offset = TCP_SKB_CB(skb)->seq;
 	} else {
 		subflow_req->mp_capable = 0;
 	}
@@ -91,6 +92,11 @@ static void subflow_finish_connect(struct sock *sk, const struct sk_buff *skb)
 			 subflow->remote_key);
 		mptcp_finish_connect(subflow->conn, subflow->mp_capable);
 		subflow->conn_finished = 1;
+
+		if (skb) {
+			pr_debug("synack seq=%u", TCP_SKB_CB(skb)->seq);
+			subflow->ssn_offset = TCP_SKB_CB(skb)->seq;
+		}
 	}
 }
 
@@ -155,6 +161,20 @@ static struct sock *subflow_syn_recv_sock(const struct sock *sk,
 
 static struct inet_connection_sock_af_ops subflow_specific;
 
+static void subflow_data_ready(struct sock *sk)
+{
+	struct mptcp_subflow_context *subflow = mptcp_subflow_ctx(sk);
+	struct sock *parent = subflow->conn;
+
+	pr_debug("sk=%p", sk);
+	subflow->tcp_sk_data_ready(sk);
+
+	if (parent) {
+		pr_debug("parent=%p", parent);
+		parent->sk_data_ready(parent);
+	}
+}
+
 int mptcp_subflow_create_socket(struct sock *sk, struct socket **new_sock)
 {
 	struct mptcp_subflow_context *subflow;
@@ -194,10 +214,9 @@ static struct mptcp_subflow_context *subflow_create_ctx(struct sock *sk,
 	struct mptcp_subflow_context *ctx;
 
 	ctx = kzalloc(sizeof(*ctx), priority);
-	icsk->icsk_ulp_data = ctx;
-
 	if (!ctx)
 		return NULL;
+	rcu_assign_pointer(icsk->icsk_ulp_data, ctx);
 
 	pr_debug("subflow=%p", ctx);
 
@@ -224,7 +243,9 @@ static int subflow_ulp_init(struct sock *sk)
 
 	tp->is_mptcp = 1;
 	icsk->icsk_af_ops = &subflow_specific;
+	ctx->tcp_sk_data_ready = sk->sk_data_ready;
 	ctx->use_checksum = 0;
+	sk->sk_data_ready = subflow_data_ready;
 out:
 	return err;
 }
@@ -233,10 +254,13 @@ static void subflow_ulp_release(struct sock *sk)
 {
 	struct mptcp_subflow_context *ctx = mptcp_subflow_ctx(sk);
 
+	if (!ctx)
+		return;
+
 	if (ctx->conn)
 		sock_put(ctx->conn);
 
-	kfree(ctx);
+	kfree_rcu(ctx, rcu);
 }
 
 static void subflow_ulp_clone(const struct request_sock *req,
@@ -244,6 +268,7 @@ static void subflow_ulp_clone(const struct request_sock *req,
 			      const gfp_t priority)
 {
 	struct mptcp_subflow_request_sock *subflow_req = mptcp_subflow_rsk(req);
+	struct mptcp_subflow_context *old_ctx = mptcp_subflow_ctx(newsk);
 	struct mptcp_subflow_context *new_ctx;
 
 	/* newsk->sk_socket is NULL at this point */
@@ -253,7 +278,8 @@ static void subflow_ulp_clone(const struct request_sock *req,
 
 	new_ctx->conn = NULL;
 	new_ctx->conn_finished = 1;
-	new_ctx->use_checksum = 0;
+	new_ctx->tcp_sk_data_ready = old_ctx->tcp_sk_data_ready;
+	new_ctx->use_checksum = old_ctx->use_checksum;
 
 	if (subflow_req->mp_capable) {
 		new_ctx->mp_capable = 1;
@@ -261,6 +287,8 @@ static void subflow_ulp_clone(const struct request_sock *req,
 		new_ctx->remote_key = subflow_req->remote_key;
 		new_ctx->local_key = subflow_req->local_key;
 		new_ctx->token = subflow_req->token;
+		new_ctx->ssn_offset = subflow_req->ssn_offset;
+		new_ctx->idsn = subflow_req->idsn;
 	}
 }
 
-- 
2.23.0


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [RFC PATCH v2 22/45] mptcp: use sk_page_frag() in sendmsg
  2019-10-02 23:36 [RFC PATCH v2 00/45] Multipath TCP Mat Martineau
                   ` (20 preceding siblings ...)
  2019-10-02 23:36 ` [RFC PATCH v2 21/45] mptcp: Implement MPTCP receive path Mat Martineau
@ 2019-10-02 23:36 ` Mat Martineau
  2019-10-02 23:36 ` [RFC PATCH v2 23/45] mptcp: sendmsg() do spool all the provided data Mat Martineau
                   ` (24 subsequent siblings)
  46 siblings, 0 replies; 49+ messages in thread
From: Mat Martineau @ 2019-10-02 23:36 UTC (permalink / raw)
  To: netdev, edumazet
  Cc: Paolo Abeni, cpaasch, fw, peter.krystad, dcaratti, matthieu.baerts

From: Paolo Abeni <pabeni@redhat.com>

This clean-up a bit the send path, and allows better performances.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 net/mptcp/protocol.c | 36 ++++++++++++++++++------------------
 1 file changed, 18 insertions(+), 18 deletions(-)

diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index fdcfffce0ec9..da983ea4fb5e 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -65,10 +65,11 @@ static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 	struct socket *ssock;
 	struct sock *ssk;
 	struct mptcp_ext *mpext = NULL;
-	struct page *page = NULL;
+	struct page_frag *pfrag;
 	struct sk_buff *skb;
 	size_t psize;
 	int poffset;
+	long timeo;
 
 	lock_sock(sk);
 	ssock = __mptcp_fallback_get_ref(msk);
@@ -97,32 +98,32 @@ static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 		goto put_out;
 	}
 
-	/* Initial experiment: new page per send.  Real code will
-	 * maintain list of active pages and DSS mappings, append to the
-	 * end and honor zerocopy
+	lock_sock(ssk);
+	timeo = sock_sndtimeo(sk, msg->msg_flags & MSG_DONTWAIT);
+
+	/* use the mptcp page cache so that we can easily move the data
+	 * from one substream to another, but do per subflow memory accounting
 	 */
-	page = alloc_page(GFP_KERNEL);
-	if (!page) {
-		ret = -ENOMEM;
-		goto put_out;
+	pfrag = sk_page_frag(sk);
+	while (!sk_page_frag_refill(ssk, pfrag)) {
+		ret = sk_stream_wait_memory(ssk, &timeo);
+		if (ret)
+			goto put_out;
 	}
 
 	/* Copy to page */
-	poffset = 0;
+	poffset = pfrag->offset;
 	pr_debug("left=%zu", msg_data_left(msg));
-	psize = copy_page_from_iter(page, poffset,
+	psize = copy_page_from_iter(pfrag->page, poffset,
 				    min_t(size_t, msg_data_left(msg),
-					  PAGE_SIZE),
+					  pfrag->size - poffset),
 				    &msg->msg_iter);
 	pr_debug("left=%zu", msg_data_left(msg));
-
 	if (!psize) {
 		ret = -EINVAL;
 		goto put_out;
 	}
 
-	lock_sock(ssk);
-
 	/* Mark the end of the previous write so the beginning of the
 	 * next write (with its own mptcp skb extension data) is not
 	 * collapsed.
@@ -132,8 +133,8 @@ static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 		TCP_SKB_CB(skb)->eor = 1;
 
 	mss_now = tcp_send_mss(ssk, &size_goal, msg->msg_flags);
-
-	ret = do_tcp_sendpages(ssk, page, poffset, min_t(int, size_goal, psize),
+	psize = min_t(int, size_goal, psize);
+	ret = do_tcp_sendpages(ssk, pfrag->page, poffset, psize,
 			       msg->msg_flags | MSG_SENDPAGE_NOTLAST);
 	if (ret <= 0)
 		goto put_out;
@@ -159,14 +160,13 @@ static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 			 mpext->checksum, mpext->dsn64);
 	} /* TODO: else fallback */
 
+	pfrag->offset += ret;
 	msk->write_seq += ret;
 	mptcp_subflow_ctx(ssk)->rel_write_seq += ret;
 
 	tcp_push(ssk, msg->msg_flags, mss_now, tcp_sk(ssk)->nonagle, size_goal);
 
 put_out:
-	if (page)
-		put_page(page);
 	release_sock(sk);
 	sock_put(ssk);
 	return ret;
-- 
2.23.0


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [RFC PATCH v2 23/45] mptcp: sendmsg() do spool all the provided data
  2019-10-02 23:36 [RFC PATCH v2 00/45] Multipath TCP Mat Martineau
                   ` (21 preceding siblings ...)
  2019-10-02 23:36 ` [RFC PATCH v2 22/45] mptcp: use sk_page_frag() in sendmsg Mat Martineau
@ 2019-10-02 23:36 ` Mat Martineau
  2019-10-02 23:36 ` [RFC PATCH v2 24/45] mptcp: allow collapsing consecutive sendpages on the same substream Mat Martineau
                   ` (23 subsequent siblings)
  46 siblings, 0 replies; 49+ messages in thread
From: Mat Martineau @ 2019-10-02 23:36 UTC (permalink / raw)
  To: netdev, edumazet
  Cc: Paolo Abeni, cpaasch, fw, peter.krystad, dcaratti, matthieu.baerts

From: Paolo Abeni <pabeni@redhat.com>

This makes mptcp sendmsg() behaviour more consistent and
improves xmit performances.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 net/mptcp/protocol.c | 126 ++++++++++++++++++++++++-------------------
 1 file changed, 71 insertions(+), 55 deletions(-)

diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index da983ea4fb5e..758369256f9b 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -58,71 +58,37 @@ static struct sock *mptcp_subflow_get_ref(const struct mptcp_sock *msk)
 	return NULL;
 }
 
-static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
+static int mptcp_sendmsg_frag(struct sock *sk, struct sock *ssk,
+			      struct msghdr *msg, long *timeo)
 {
 	int mss_now = 0, size_goal = 0, ret = 0;
 	struct mptcp_sock *msk = mptcp_sk(sk);
-	struct socket *ssock;
-	struct sock *ssk;
 	struct mptcp_ext *mpext = NULL;
 	struct page_frag *pfrag;
 	struct sk_buff *skb;
 	size_t psize;
-	int poffset;
-	long timeo;
-
-	lock_sock(sk);
-	ssock = __mptcp_fallback_get_ref(msk);
-	if (ssock) {
-		release_sock(sk);
-		pr_debug("fallback passthrough");
-		ret = sock_sendmsg(ssock, msg);
-		sock_put(ssock->sk);
-		return ret;
-	}
-
-	ssk = mptcp_subflow_get_ref(msk);
-	if (!ssk) {
-		release_sock(sk);
-		return -ENOTCONN;
-	}
-
-	if (!msg_data_left(msg)) {
-		pr_debug("empty send");
-		ret = sock_sendmsg(ssk->sk_socket, msg);
-		goto put_out;
-	}
-
-	if (msg->msg_flags & ~(MSG_MORE | MSG_DONTWAIT | MSG_NOSIGNAL)) {
-		ret = -ENOTSUPP;
-		goto put_out;
-	}
-
-	lock_sock(ssk);
-	timeo = sock_sndtimeo(sk, msg->msg_flags & MSG_DONTWAIT);
 
 	/* use the mptcp page cache so that we can easily move the data
 	 * from one substream to another, but do per subflow memory accounting
 	 */
 	pfrag = sk_page_frag(sk);
 	while (!sk_page_frag_refill(ssk, pfrag)) {
-		ret = sk_stream_wait_memory(ssk, &timeo);
+		ret = sk_stream_wait_memory(ssk, timeo);
 		if (ret)
-			goto put_out;
+			return ret;
 	}
 
-	/* Copy to page */
-	poffset = pfrag->offset;
+	/* compute copy limit */
+	mss_now = tcp_send_mss(ssk, &size_goal, msg->msg_flags);
+	psize = min_t(int, pfrag->size - pfrag->offset, size_goal);
+
 	pr_debug("left=%zu", msg_data_left(msg));
-	psize = copy_page_from_iter(pfrag->page, poffset,
-				    min_t(size_t, msg_data_left(msg),
-					  pfrag->size - poffset),
+	psize = copy_page_from_iter(pfrag->page, pfrag->offset,
+				    min_t(size_t, msg_data_left(msg), psize),
 				    &msg->msg_iter);
 	pr_debug("left=%zu", msg_data_left(msg));
-	if (!psize) {
-		ret = -EINVAL;
-		goto put_out;
-	}
+	if (!psize)
+		return -EINVAL;
 
 	/* Mark the end of the previous write so the beginning of the
 	 * next write (with its own mptcp skb extension data) is not
@@ -132,20 +98,15 @@ static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 	if (skb)
 		TCP_SKB_CB(skb)->eor = 1;
 
-	mss_now = tcp_send_mss(ssk, &size_goal, msg->msg_flags);
-	psize = min_t(int, size_goal, psize);
-	ret = do_tcp_sendpages(ssk, pfrag->page, poffset, psize,
+	ret = do_tcp_sendpages(ssk, pfrag->page, pfrag->offset, psize,
 			       msg->msg_flags | MSG_SENDPAGE_NOTLAST);
 	if (ret <= 0)
-		goto put_out;
-
-	if (skb == tcp_write_queue_tail(ssk))
-		pr_err("no new skb %p/%p", sk, ssk);
+		return ret;
+	if (unlikely(ret < psize))
+		iov_iter_revert(&msg->msg_iter, psize - ret);
 
 	skb = tcp_write_queue_tail(ssk);
-
 	mpext = skb_ext_add(skb, SKB_EXT_MPTCP);
-
 	if (mpext) {
 		memset(mpext, 0, sizeof(*mpext));
 		mpext->data_seq = msk->write_seq;
@@ -165,6 +126,61 @@ static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 	mptcp_subflow_ctx(ssk)->rel_write_seq += ret;
 
 	tcp_push(ssk, msg->msg_flags, mss_now, tcp_sk(ssk)->nonagle, size_goal);
+	return ret;
+}
+
+static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
+{
+	struct mptcp_sock *msk = mptcp_sk(sk);
+	struct socket *ssock;
+	size_t copied = 0;
+	struct sock *ssk;
+	int ret = 0;
+	long timeo;
+
+	lock_sock(sk);
+	ssock = __mptcp_fallback_get_ref(msk);
+	if (ssock) {
+		release_sock(sk);
+		pr_debug("fallback passthrough");
+		ret = sock_sendmsg(ssock, msg);
+		sock_put(ssock->sk);
+		return ret;
+	}
+
+	ssk = mptcp_subflow_get_ref(msk);
+	if (!ssk) {
+		release_sock(sk);
+		return -ENOTCONN;
+	}
+
+	if (!msg_data_left(msg)) {
+		pr_debug("empty send");
+		ret = sock_sendmsg(ssk->sk_socket, msg);
+		goto put_out;
+	}
+
+	pr_debug("conn_list->subflow=%p", ssk);
+
+	if (msg->msg_flags & ~(MSG_MORE | MSG_DONTWAIT | MSG_NOSIGNAL)) {
+		ret = -ENOTSUPP;
+		goto put_out;
+	}
+
+	lock_sock(ssk);
+	timeo = sock_sndtimeo(sk, msg->msg_flags & MSG_DONTWAIT);
+	while (msg_data_left(msg)) {
+		ret = mptcp_sendmsg_frag(sk, ssk, msg, &timeo);
+		if (ret < 0)
+			break;
+
+		copied += ret;
+	}
+
+	if (copied > 0)
+		ret = copied;
+
+	release_sock(ssk);
 
 put_out:
 	release_sock(sk);
-- 
2.23.0


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [RFC PATCH v2 24/45] mptcp: allow collapsing consecutive sendpages on the same substream
  2019-10-02 23:36 [RFC PATCH v2 00/45] Multipath TCP Mat Martineau
                   ` (22 preceding siblings ...)
  2019-10-02 23:36 ` [RFC PATCH v2 23/45] mptcp: sendmsg() do spool all the provided data Mat Martineau
@ 2019-10-02 23:36 ` Mat Martineau
  2019-10-02 23:36 ` [RFC PATCH v2 25/45] tcp: Check for filled TCP option space before SACK Mat Martineau
                   ` (22 subsequent siblings)
  46 siblings, 0 replies; 49+ messages in thread
From: Mat Martineau @ 2019-10-02 23:36 UTC (permalink / raw)
  To: netdev, edumazet
  Cc: Paolo Abeni, cpaasch, fw, peter.krystad, dcaratti, matthieu.baerts

From: Paolo Abeni <pabeni@redhat.com>

If the current sendmsg() lands on the same subflow we used last, we
can try to collapse the data.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 net/mptcp/protocol.c | 74 +++++++++++++++++++++++++++++++++++---------
 1 file changed, 59 insertions(+), 15 deletions(-)

diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 758369256f9b..6b31c0e460d9 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -58,11 +58,24 @@ static struct sock *mptcp_subflow_get_ref(const struct mptcp_sock *msk)
 	return NULL;
 }
 
+static inline bool mptcp_skb_can_collapse_to(const struct mptcp_sock *msk,
+					     const struct sk_buff *skb,
+					     const struct mptcp_ext *mpext)
+{
+	if (!tcp_skb_can_collapse_to(skb))
+		return false;
+
+	/* can collapse only if MPTCP level sequence is in order */
+	return mpext && mpext->data_seq + mpext->data_len == msk->write_seq;
+}
+
 static int mptcp_sendmsg_frag(struct sock *sk, struct sock *ssk,
-			      struct msghdr *msg, long *timeo)
+			      struct msghdr *msg, long *timeo, int *pmss_now,
+			      int *ps_goal)
 {
-	int mss_now = 0, size_goal = 0, ret = 0;
+	int mss_now, avail_size, size_goal, ret;
 	struct mptcp_sock *msk = mptcp_sk(sk);
+	bool collapsed, can_collapse = false;
 	struct mptcp_ext *mpext = NULL;
 	struct page_frag *pfrag;
 	struct sk_buff *skb;
@@ -80,8 +93,29 @@ static int mptcp_sendmsg_frag(struct sock *sk, struct sock *ssk,
 
 	/* compute copy limit */
 	mss_now = tcp_send_mss(ssk, &size_goal, msg->msg_flags);
-	psize = min_t(int, pfrag->size - pfrag->offset, size_goal);
+	*pmss_now = mss_now;
+	*ps_goal = size_goal;
+	avail_size = size_goal;
+	skb = tcp_write_queue_tail(ssk);
+	if (skb) {
+		mpext = skb_ext_find(skb, SKB_EXT_MPTCP);
+
+		/* Limit the write to the size available in the
+		 * current skb, if any, so that we create at most a new skb.
+		 * Explicitly tells TCP internals to avoid collapsing on later
+		 * queue management operation, to avoid breaking the ext <->
+		 * SSN association set here
+		 */
+		can_collapse = (size_goal - skb->len > 0) &&
+			      mptcp_skb_can_collapse_to(msk, skb, mpext);
+		if (!can_collapse)
+			TCP_SKB_CB(skb)->eor = 1;
+		else
+			avail_size = size_goal - skb->len;
+	}
+	psize = min_t(size_t, pfrag->size - pfrag->offset, avail_size);
 
+	/* Copy to page */
 	pr_debug("left=%zu", msg_data_left(msg));
 	psize = copy_page_from_iter(pfrag->page, pfrag->offset,
 				    min_t(size_t, msg_data_left(msg), psize),
@@ -90,14 +124,9 @@ static int mptcp_sendmsg_frag(struct sock *sk, struct sock *ssk,
 	if (!psize)
 		return -EINVAL;
 
-	/* Mark the end of the previous write so the beginning of the
-	 * next write (with its own mptcp skb extension data) is not
-	 * collapsed.
+	/* tell the TCP stack to delay the push so that we can safely
+	 * access the skb after the sendpages call
 	 */
-	skb = tcp_write_queue_tail(ssk);
-	if (skb)
-		TCP_SKB_CB(skb)->eor = 1;
-
 	ret = do_tcp_sendpages(ssk, pfrag->page, pfrag->offset, psize,
 			       msg->msg_flags | MSG_SENDPAGE_NOTLAST);
 	if (ret <= 0)
@@ -105,6 +134,14 @@ static int mptcp_sendmsg_frag(struct sock *sk, struct sock *ssk,
 	if (unlikely(ret < psize))
 		iov_iter_revert(&msg->msg_iter, psize - ret);
 
+	collapsed = skb == tcp_write_queue_tail(ssk);
+	if (collapsed) {
+		WARN_ON_ONCE(!can_collapse);
+		/* when collapsing mpext always exists */
+		mpext->data_len += ret;
+		goto out;
+	}
+
 	skb = tcp_write_queue_tail(ssk);
 	mpext = skb_ext_add(skb, SKB_EXT_MPTCP);
 	if (mpext) {
@@ -119,23 +156,26 @@ static int mptcp_sendmsg_frag(struct sock *sk, struct sock *ssk,
 		pr_debug("data_seq=%llu subflow_seq=%u data_len=%u checksum=%u, dsn64=%d",
 			 mpext->data_seq, mpext->subflow_seq, mpext->data_len,
 			 mpext->checksum, mpext->dsn64);
-	} /* TODO: else fallback */
+	}
+	/* TODO: else fallback; allocation can fail, but we can't easily retire
+	 * skbs from the write_queue, as we need to roll-back TCP status
+	 */
 
+out:
 	pfrag->offset += ret;
 	msk->write_seq += ret;
 	mptcp_subflow_ctx(ssk)->rel_write_seq += ret;
 
-	tcp_push(ssk, msg->msg_flags, mss_now, tcp_sk(ssk)->nonagle, size_goal);
 	return ret;
 }
 
 static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 {
+	int mss_now = 0, size_goal = 0, ret = 0;
 	struct mptcp_sock *msk = mptcp_sk(sk);
 	struct socket *ssock;
 	size_t copied = 0;
 	struct sock *ssk;
-	int ret = 0;
 	long timeo;
 
 	lock_sock(sk);
@@ -170,15 +210,19 @@ static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 	lock_sock(ssk);
 	timeo = sock_sndtimeo(sk, msg->msg_flags & MSG_DONTWAIT);
 	while (msg_data_left(msg)) {
-		ret = mptcp_sendmsg_frag(sk, ssk, msg, &timeo);
+		ret = mptcp_sendmsg_frag(sk, ssk, msg, &timeo, &mss_now,
+					 &size_goal);
 		if (ret < 0)
 			break;
 
 		copied += ret;
 	}
 
-	if (copied > 0)
+	if (copied) {
 		ret = copied;
+		tcp_push(ssk, msg->msg_flags, mss_now, tcp_sk(ssk)->nonagle,
+			 size_goal);
+	}
 
 	release_sock(ssk);
 
-- 
2.23.0


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [RFC PATCH v2 25/45] tcp: Check for filled TCP option space before SACK
  2019-10-02 23:36 [RFC PATCH v2 00/45] Multipath TCP Mat Martineau
                   ` (23 preceding siblings ...)
  2019-10-02 23:36 ` [RFC PATCH v2 24/45] mptcp: allow collapsing consecutive sendpages on the same substream Mat Martineau
@ 2019-10-02 23:36 ` Mat Martineau
  2019-10-02 23:36 ` [RFC PATCH v2 26/45] mptcp: Add path manager interface Mat Martineau
                   ` (21 subsequent siblings)
  46 siblings, 0 replies; 49+ messages in thread
From: Mat Martineau @ 2019-10-02 23:36 UTC (permalink / raw)
  To: netdev, edumazet
  Cc: Mat Martineau, cpaasch, fw, pabeni, peter.krystad, dcaratti,
	matthieu.baerts

The SACK code would potentially add four bytes to the expected
TCP option size even if all option space was already used.

Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
---
 net/ipv4/tcp_output.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 3de804531231..dd9be5169aa3 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -805,6 +805,9 @@ static unsigned int tcp_established_options(struct sock *sk, struct sk_buff *skb
 		}
 	}
 
+	if (size + TCPOLEN_SACK_BASE_ALIGNED >= MAX_TCP_OPTION_SPACE)
+		return size;
+
 	eff_sacks = tp->rx_opt.num_sacks + tp->rx_opt.dsack;
 	if (unlikely(eff_sacks)) {
 		const unsigned int remaining = MAX_TCP_OPTION_SPACE - size;
-- 
2.23.0


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [RFC PATCH v2 26/45] mptcp: Add path manager interface
  2019-10-02 23:36 [RFC PATCH v2 00/45] Multipath TCP Mat Martineau
                   ` (24 preceding siblings ...)
  2019-10-02 23:36 ` [RFC PATCH v2 25/45] tcp: Check for filled TCP option space before SACK Mat Martineau
@ 2019-10-02 23:36 ` Mat Martineau
  2019-10-02 23:36 ` [RFC PATCH v2 27/45] mptcp: Add ADD_ADDR handling Mat Martineau
                   ` (20 subsequent siblings)
  46 siblings, 0 replies; 49+ messages in thread
From: Mat Martineau @ 2019-10-02 23:36 UTC (permalink / raw)
  To: netdev, edumazet
  Cc: Peter Krystad, cpaasch, fw, pabeni, dcaratti, matthieu.baerts

From: Peter Krystad <peter.krystad@linux.intel.com>

Add enough of a path manager interface to allow sending of ADD_ADDR
when an incoming MPTCP connection is created. Capable of sending only
a single IPv4 ADD_ADDR option. The 'pm_data' element of the connection
sock will need to be expanded to handle multiple interfaces and IPv6.

This is a skeleton interface definition for events generated by
MPTCP.

Signed-off-by: Peter Krystad <peter.krystad@linux.intel.com>
---
 include/linux/tcp.h  |   9 ++++
 net/mptcp/Makefile   |   2 +-
 net/mptcp/options.c  |  16 ++++++
 net/mptcp/pm.c       | 124 +++++++++++++++++++++++++++++++++++++++++++
 net/mptcp/protocol.c |   6 ++-
 net/mptcp/protocol.h |  48 +++++++++++++++++
 6 files changed, 203 insertions(+), 2 deletions(-)
 create mode 100644 net/mptcp/pm.c

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 0f0d6c188f52..31e546fe9643 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -117,6 +117,15 @@ struct tcp_options_received {
 			use_ack:1,
 			ack64:1,
 			__unused:2;
+		u8	add_addr : 1,
+			family : 4;
+		u8	addr_id;
+		union {
+			struct	in_addr	addr;
+#if IS_ENABLED(CONFIG_IPV6)
+			struct	in6_addr addr6;
+#endif
+		};
 	} mptcp;
 #endif
 };
diff --git a/net/mptcp/Makefile b/net/mptcp/Makefile
index 178ae81d8b66..7fe7aa64eda0 100644
--- a/net/mptcp/Makefile
+++ b/net/mptcp/Makefile
@@ -1,4 +1,4 @@
 # SPDX-License-Identifier: GPL-2.0
 obj-$(CONFIG_MPTCP) += mptcp.o
 
-mptcp-y := protocol.o subflow.o options.o token.o crypto.o
+mptcp-y := protocol.o subflow.o options.o token.o crypto.o pm.o
diff --git a/net/mptcp/options.c b/net/mptcp/options.c
index caea320cd2a6..2981c0daa12c 100644
--- a/net/mptcp/options.c
+++ b/net/mptcp/options.c
@@ -396,11 +396,23 @@ bool mptcp_synack_options(const struct request_sock *req, unsigned int *size,
 void mptcp_incoming_options(struct sock *sk, struct sk_buff *skb,
 			    struct tcp_options_received *opt_rx)
 {
+	struct mptcp_subflow_context *subflow = mptcp_subflow_ctx(sk);
+	struct mptcp_sock *msk = mptcp_sk(subflow->conn);
 	struct mptcp_options_received *mp_opt;
 	struct mptcp_ext *mpext;
 
 	mp_opt = &opt_rx->mptcp;
 
+	if (msk && mp_opt->add_addr) {
+		if (mp_opt->family == MPTCP_ADDR_IPVERSION_4)
+			pm_add_addr(msk, &mp_opt->addr, mp_opt->addr_id);
+#if IS_ENABLED(CONFIG_IPV6)
+		else if (mp_opt->family == MPTCP_ADDR_IPVERSION_6)
+			pm_add_addr6(msk, &mp_opt->addr6, mp_opt->addr_id);
+#endif
+		mp_opt->add_addr = 0;
+	}
+
 	if (!mp_opt->dss)
 		return;
 
@@ -427,6 +439,10 @@ void mptcp_incoming_options(struct sock *sk, struct sk_buff *skb,
 	}
 
 	mpext->data_fin = mp_opt->data_fin;
+
+	if (msk)
+		pm_fully_established(msk);
+
 }
 
 void mptcp_write_options(__be32 *ptr, struct mptcp_out_options *opts)
diff --git a/net/mptcp/pm.c b/net/mptcp/pm.c
new file mode 100644
index 000000000000..933dd805c9b2
--- /dev/null
+++ b/net/mptcp/pm.c
@@ -0,0 +1,124 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Multipath TCP
+ *
+ * Copyright (c) 2019, Intel Corporation.
+ */
+#include <linux/kernel.h>
+#include <net/tcp.h>
+#include <net/mptcp.h>
+#include "protocol.h"
+
+/* path manager command handlers */
+
+int pm_announce_addr(u32 token, sa_family_t family, u8 local_id,
+		     struct in_addr *addr)
+{
+	return -ENOTSUPP;
+}
+
+int pm_remove_addr(u32 token, u8 local_id)
+{
+	return -ENOTSUPP;
+}
+
+int pm_create_subflow(u32 token, u8 remote_id)
+{
+	return -ENOTSUPP;
+}
+
+int pm_remove_subflow(u32 token, u8 remote_id)
+{
+	return -ENOTSUPP;
+}
+
+/* path manager event handlers */
+
+void pm_new_connection(struct mptcp_sock *msk, int server_side)
+{
+	pr_debug("msk=%p", msk);
+
+	msk->pm.server_side = server_side;
+}
+
+void pm_fully_established(struct mptcp_sock *msk)
+{
+	pr_debug("msk=%p", msk);
+
+	msk->pm.fully_established = 1;
+}
+
+void pm_connection_closed(struct mptcp_sock *msk)
+{
+	pr_debug("msk=%p", msk);
+}
+
+void pm_subflow_established(struct mptcp_sock *msk, u8 id)
+{
+	pr_debug("msk=%p", msk);
+}
+
+void pm_subflow_closed(struct mptcp_sock *msk, u8 id)
+{
+	pr_debug("msk=%p", msk);
+}
+
+void pm_add_addr(struct mptcp_sock *msk, const struct in_addr *addr, u8 id)
+{
+	pr_debug("msk=%p, addr=%x, remote_id=%d", msk, addr->s_addr, id);
+
+	msk->pm.remote_addr.s_addr = addr->s_addr;
+	msk->pm.remote_id = id;
+	msk->pm.remote_family = AF_INET;
+	msk->pm.remote_valid = 1;
+}
+
+void pm_add_addr6(struct mptcp_sock *msk, const struct in6_addr *addr, u8 id)
+{
+	pr_debug("msk=%p", msk);
+}
+
+void pm_rm_addr(struct mptcp_sock *msk, u8 id)
+{
+	pr_debug("msk=%p", msk);
+}
+
+/* path manager helpers */
+
+int pm_addr_signal(struct mptcp_sock *msk, u8 *id,
+		   struct sockaddr_storage *saddr)
+{
+	struct sockaddr_in *addr = (struct sockaddr_in *)saddr;
+
+	if (!msk->pm.local_valid)
+		return -1;
+
+	if (msk->pm.local_family != AF_INET)
+		return -1;
+
+	*id = msk->pm.local_id;
+	addr->sin_family = msk->pm.local_family;
+	addr->sin_addr.s_addr = msk->pm.local_addr.s_addr;
+
+	return 0;
+}
+
+int pm_get_local_id(struct request_sock *req, struct sock *sk,
+		    const struct sk_buff *skb)
+{
+	struct mptcp_subflow_request_sock *subflow_req = mptcp_subflow_rsk(req);
+	struct mptcp_sock *msk = mptcp_sk(sk);
+
+	if (!msk->pm.local_valid)
+		return -1;
+
+	/* @@ check if address actually matches... */
+
+	pr_debug("msk=%p, addr_id=%d", msk, msk->pm.local_id);
+	subflow_req->local_id = msk->pm.local_id;
+
+	return 0;
+}
+
+void pm_init(void)
+{
+}
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 6b31c0e460d9..932362e6047e 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -655,6 +655,8 @@ static struct sock *mptcp_accept(struct sock *sk, int flags, int *err,
 		mptcp_token_update_accept(new_sock->sk, new_mptcp_sock);
 		msk->subflow = NULL;
 
+		pm_new_connection(msk, 1);
+
 		mptcp_crypto_key_sha1(msk->remote_key, NULL, &ack_seq);
 		msk->write_seq = subflow->idsn + 1;
 		ack_seq++;
@@ -784,9 +786,10 @@ void mptcp_finish_connect(struct sock *sk, int mp_capable)
 		msk->remote_key = subflow->remote_key;
 		msk->local_key = subflow->local_key;
 		msk->token = subflow->token;
-
 		pr_debug("msk=%p, token=%u", msk, msk->token);
 
+		pm_new_connection(msk, 0);
+
 		mptcp_crypto_key_sha1(msk->remote_key, NULL, &ack_seq);
 		msk->write_seq = subflow->idsn + 1;
 		ack_seq++;
@@ -1013,6 +1016,7 @@ void __init mptcp_init(void)
 	mptcp_stream_ops.shutdown = mptcp_shutdown;
 
 	mptcp_subflow_init();
+	pm_init();
 
 	if (proto_register(&mptcp_prot, 1) != 0)
 		panic("Failed to register MPTCP proto.\n");
diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
index 7dae12cfcf14..599c380145e3 100644
--- a/net/mptcp/protocol.h
+++ b/net/mptcp/protocol.h
@@ -52,6 +52,38 @@
 #define MPTCP_DSS_HAS_ACK	BIT(0)
 #define MPTCP_DSS_FLAG_MASK	(0x1F)
 
+/* MPTCP ADD_ADDR flags */
+#define MPTCP_ADDR_IPVERSION_4	4
+#define MPTCP_ADDR_IPVERSION_6	6
+
+struct mptcp_pm_data {
+	u8	local_valid;
+	u8	local_id;
+	sa_family_t local_family;
+	union {
+		struct in_addr local_addr;
+#if IS_ENABLED(CONFIG_IPV6)
+		struct in6_addr local_addr6;
+#endif
+	};
+	u8	remote_valid;
+	u8	remote_id;
+	sa_family_t remote_family;
+	union {
+		struct in_addr remote_addr;
+#if IS_ENABLED(CONFIG_IPV6)
+		struct in6_addr remote_addr6;
+#endif
+	};
+	u8	server_side : 1,
+		fully_established : 1;
+
+	/* for interim path manager */
+	struct	work_struct addr_work;
+	struct	work_struct subflow_work;
+	u32	token;
+};
+
 /* MPTCP connection sock */
 struct mptcp_sock {
 	/* inet_connection_sock must be the first member */
@@ -63,6 +95,7 @@ struct mptcp_sock {
 	u32		token;
 	struct list_head conn_list;
 	struct socket	*subflow; /* outgoing connect/listener/!mp_capable */
+	struct mptcp_pm_data	pm;
 };
 
 #define mptcp_for_each_subflow(__msk, __subflow)			\
@@ -80,6 +113,7 @@ struct mptcp_subflow_request_sock {
 		checksum : 1,
 		backup : 1,
 		version : 4;
+	u8	local_id;
 	u64	local_key;
 	u64	remote_key;
 	u64	idsn;
@@ -168,6 +202,20 @@ static inline void mptcp_crypto_key_gen_sha1(u64 *key, u32 *token, u64 *idsn)
 void mptcp_crypto_hmac_sha1(u64 key1, u64 key2, u32 nonce1, u32 nonce2,
 			    u32 *hash_out);
 
+void pm_init(void);
+void pm_new_connection(struct mptcp_sock *msk, int server_side);
+void pm_fully_established(struct mptcp_sock *msk);
+void pm_connection_closed(struct mptcp_sock *msk);
+void pm_subflow_established(struct mptcp_sock *msk, u8 id);
+void pm_subflow_closed(struct mptcp_sock *msk, u8 id);
+void pm_add_addr(struct mptcp_sock *msk, const struct in_addr *addr, u8 id);
+void pm_add_addr6(struct mptcp_sock *msk, const struct in6_addr *addr, u8 id);
+void pm_rm_addr(struct mptcp_sock *msk, u8 id);
+int pm_addr_signal(struct mptcp_sock *msk, u8 *id,
+		   struct sockaddr_storage *saddr);
+int pm_get_local_id(struct request_sock *req, struct sock *sk,
+		    const struct sk_buff *skb);
+
 static inline struct mptcp_ext *mptcp_get_ext(struct sk_buff *skb)
 {
 	return (struct mptcp_ext *)skb_ext_find(skb, SKB_EXT_MPTCP);
-- 
2.23.0


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [RFC PATCH v2 27/45] mptcp: Add ADD_ADDR handling
  2019-10-02 23:36 [RFC PATCH v2 00/45] Multipath TCP Mat Martineau
                   ` (25 preceding siblings ...)
  2019-10-02 23:36 ` [RFC PATCH v2 26/45] mptcp: Add path manager interface Mat Martineau
@ 2019-10-02 23:36 ` Mat Martineau
  2019-10-02 23:36 ` [RFC PATCH v2 28/45] mptcp: Add handling of incoming MP_JOIN requests Mat Martineau
                   ` (19 subsequent siblings)
  46 siblings, 0 replies; 49+ messages in thread
From: Mat Martineau @ 2019-10-02 23:36 UTC (permalink / raw)
  To: netdev, edumazet
  Cc: Peter Krystad, cpaasch, fw, pabeni, dcaratti, matthieu.baerts

From: Peter Krystad <peter.krystad@linux.intel.com>

Add handling for sending and receiving the ADD_ADDR, ADD_ADDR6,
and RM_ADDR suboptions.

Signed-off-by: Peter Krystad <peter.krystad@linux.intel.com>
---
 include/linux/tcp.h  |   2 +
 include/net/mptcp.h  |   7 +++
 net/mptcp/options.c  | 115 +++++++++++++++++++++++++++++++++++++++----
 net/mptcp/protocol.h |  16 +++++-
 4 files changed, 130 insertions(+), 10 deletions(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 31e546fe9643..b3289e69cac1 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -118,6 +118,7 @@ struct tcp_options_received {
 			ack64:1,
 			__unused:2;
 		u8	add_addr : 1,
+			rm_addr : 1,
 			family : 4;
 		u8	addr_id;
 		union {
@@ -139,6 +140,7 @@ static inline void tcp_clear_options(struct tcp_options_received *rx_opt)
 #endif
 #if IS_ENABLED(CONFIG_MPTCP)
 	rx_opt->mptcp.mp_capable = rx_opt->mptcp.mp_join = 0;
+	rx_opt->mptcp.add_addr = rx_opt->mptcp.rm_addr = 0;
 	rx_opt->mptcp.dss = 0;
 #endif
 }
diff --git a/include/net/mptcp.h b/include/net/mptcp.h
index 21438bcacb14..f4a9962ea85f 100644
--- a/include/net/mptcp.h
+++ b/include/net/mptcp.h
@@ -29,6 +29,13 @@ struct mptcp_out_options {
 	u16 suboptions;
 	u64 sndr_key;
 	u64 rcvr_key;
+	union {
+		struct in_addr addr;
+#if IS_ENABLED(CONFIG_IPV6)
+		struct in6_addr addr6;
+#endif
+	};
+	u8 addr_id;
 	struct mptcp_ext ext_copy;
 #endif
 };
diff --git a/net/mptcp/options.c b/net/mptcp/options.c
index 2981c0daa12c..6fd2c8b5001c 100644
--- a/net/mptcp/options.c
+++ b/net/mptcp/options.c
@@ -173,12 +173,51 @@ void mptcp_parse_option(const unsigned char *ptr, int opsize,
 	 * 4 or 16 bytes of address (depending on ip version)
 	 * 0 or 2 bytes of port (depending on length)
 	 */
+	case MPTCPOPT_ADD_ADDR:
+		if (opsize != TCPOLEN_MPTCP_ADD_ADDR &&
+		    opsize != TCPOLEN_MPTCP_ADD_ADDR6)
+			break;
+		mp_opt->family = *ptr++ & MPTCP_ADDR_FAMILY_MASK;
+		if (mp_opt->family != MPTCP_ADDR_IPVERSION_4 &&
+		    mp_opt->family != MPTCP_ADDR_IPVERSION_6)
+			break;
+
+		if (mp_opt->family == MPTCP_ADDR_IPVERSION_4 &&
+		    opsize != TCPOLEN_MPTCP_ADD_ADDR)
+			break;
+#if IS_ENABLED(CONFIG_IPV6)
+		if (mp_opt->family == MPTCP_ADDR_IPVERSION_6 &&
+		    opsize != TCPOLEN_MPTCP_ADD_ADDR6)
+			break;
+#endif
+		mp_opt->addr_id = *ptr++;
+		if (mp_opt->family == MPTCP_ADDR_IPVERSION_4) {
+			mp_opt->add_addr = 1;
+			memcpy((u8 *)&mp_opt->addr.s_addr, (u8 *)ptr, 4);
+			pr_debug("ADD_ADDR: addr=%x, id=%d",
+				 mp_opt->addr.s_addr, mp_opt->addr_id);
+#if IS_ENABLED(CONFIG_IPV6)
+		} else {
+			mp_opt->add_addr = 1;
+			memcpy(mp_opt->addr6.s6_addr, (u8 *)ptr, 16);
+			pr_debug("ADD_ADDR: addr6=, id=%d", mp_opt->addr_id);
+#endif
+		}
+		break;
 
 	/* MPTCPOPT_RM_ADDR
 	 * 0: 4MSB=subtype, 0000
 	 * 1: Address ID
 	 * Additional bytes: More address IDs (depending on length)
 	 */
+	case MPTCPOPT_RM_ADDR:
+		if (opsize != TCPOLEN_MPTCP_RM_ADDR)
+			break;
+
+		mp_opt->rm_addr = 1;
+		mp_opt->addr_id = *ptr++;
+		pr_debug("RM_ADDR: id=%d", mp_opt->addr_id);
+		break;
 
 	/* MPTCPOPT_MP_PRIO
 	 * 0: 4MSB=subtype, 000, 1LSB=Backup
@@ -352,27 +391,65 @@ static bool mptcp_established_options_dss(struct sock *sk, struct sk_buff *skb,
 	return true;
 }
 
+static bool mptcp_established_options_addr(struct sock *sk,
+					   unsigned int *size,
+					   unsigned int remaining,
+					   struct mptcp_out_options *opts)
+{
+	struct mptcp_subflow_context *subflow = mptcp_subflow_ctx(sk);
+	struct mptcp_sock *msk = mptcp_sk(subflow->conn);
+	struct sockaddr_storage saddr;
+	u8 id;
+
+	if (!msk)
+		return false;
+
+	if (!msk->pm.fully_established || !msk->addr_signal)
+		return false;
+
+	if (pm_addr_signal(msk, &id, &saddr))
+		return false;
+
+	if (saddr.ss_family == AF_INET && remaining < TCPOLEN_MPTCP_ADD_ADDR)
+		return false;
+
+	opts->suboptions |= OPTION_MPTCP_ADD_ADDR;
+	opts->addr_id = id;
+	opts->addr.s_addr = ((struct sockaddr_in *)&saddr)->sin_addr.s_addr;
+	*size = TCPOLEN_MPTCP_ADD_ADDR;
+
+	msk->addr_signal = 0;
+
+	return true;
+}
+
 bool mptcp_established_options(struct sock *sk, struct sk_buff *skb,
 			       unsigned int *size, unsigned int remaining,
 			       struct mptcp_out_options *opts)
 {
 	unsigned int opt_size = 0;
+	bool ret = false;
 
 	if (!mptcp_subflow_ctx(sk)->mp_capable)
 		return false;
 
+	opts->suboptions = 0;
 	if (mptcp_established_options_mp(sk, &opt_size, remaining, opts)) {
 		*size += opt_size;
 		remaining -= opt_size;
-		return true;
+		ret = true;
 	} else if (mptcp_established_options_dss(sk, skb, &opt_size, remaining,
-					       opts)) {
+						 opts)) {
 		*size += opt_size;
 		remaining -= opt_size;
-		return true;
+		ret = true;
 	}
-
-	return false;
+	if (mptcp_established_options_addr(sk, &opt_size, remaining, opts)) {
+		*size += opt_size;
+		remaining -= opt_size;
+		ret = true;
+	}
+	return ret;
 }
 
 bool mptcp_synack_options(const struct request_sock *req, unsigned int *size,
@@ -459,10 +536,8 @@ void mptcp_write_options(__be32 *ptr, struct mptcp_out_options *opts)
 		else
 			len = TCPOLEN_MPTCP_MPC_ACK;
 
-		*ptr++ = htonl((TCPOPT_MPTCP << 24) | (len << 16) |
-			       (MPTCPOPT_MP_CAPABLE << 12) |
-			       ((MPTCP_VERSION_MASK & 0) << 8) |
-			       MPTCP_CAP_HMAC_SHA1);
+		*ptr++ = mptcp_option(MPTCPOPT_MP_CAPABLE, len, 0,
+				      MPTCP_CAP_HMAC_SHA1);
 		put_unaligned_be64(opts->sndr_key, ptr);
 		ptr += 2;
 		if ((OPTION_MPTCP_MPC_SYNACK |
@@ -472,6 +547,28 @@ void mptcp_write_options(__be32 *ptr, struct mptcp_out_options *opts)
 		}
 	}
 
+	if (OPTION_MPTCP_ADD_ADDR & opts->suboptions) {
+		*ptr++ = mptcp_option(MPTCPOPT_ADD_ADDR, TCPOLEN_MPTCP_ADD_ADDR,
+				      MPTCP_ADDR_IPVERSION_4, opts->addr_id);
+		memcpy((u8 *)ptr, (u8 *)&opts->addr.s_addr, 4);
+		ptr += 1;
+	}
+
+#if IS_ENABLED(CONFIG_IPV6)
+	if (OPTION_MPTCP_ADD_ADDR6 & opts->suboptions) {
+		*ptr++ = mptcp_option(MPTCPOPT_ADD_ADDR,
+				      TCPOLEN_MPTCP_ADD_ADDR6,
+				      MPTCP_ADDR_IPVERSION_6, opts->addr_id);
+		memcpy((u8 *)ptr, opts->addr6.s6_addr, 16);
+		ptr += 4;
+	}
+#endif
+
+	if (OPTION_MPTCP_RM_ADDR & opts->suboptions) {
+		*ptr++ = mptcp_option(MPTCPOPT_RM_ADDR, TCPOLEN_MPTCP_RM_ADDR,
+				      0, opts->addr_id);
+	}
+
 	if (opts->ext_copy.use_ack || opts->ext_copy.use_map) {
 		struct mptcp_ext *mpext = &opts->ext_copy;
 		u8 len = TCPOLEN_MPTCP_DSS_BASE;
diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
index 599c380145e3..fd21ed6c03b7 100644
--- a/net/mptcp/protocol.h
+++ b/net/mptcp/protocol.h
@@ -8,13 +8,16 @@
 #define __MPTCP_PROTOCOL_H
 
 #include <linux/random.h>
-#include <linux/tcp.h>
+#include <net/tcp.h>
 #include <net/inet_connection_sock.h>
 
 /* MPTCP option bits */
 #define OPTION_MPTCP_MPC_SYN	BIT(0)
 #define OPTION_MPTCP_MPC_SYNACK	BIT(1)
 #define OPTION_MPTCP_MPC_ACK	BIT(2)
+#define OPTION_MPTCP_ADD_ADDR	BIT(6)
+#define OPTION_MPTCP_ADD_ADDR6	BIT(7)
+#define OPTION_MPTCP_RM_ADDR	BIT(8)
 
 /* MPTCP option subtypes */
 #define MPTCPOPT_MP_CAPABLE	0
@@ -36,6 +39,9 @@
 #define TCPOLEN_MPTCP_DSS_MAP32		10
 #define TCPOLEN_MPTCP_DSS_MAP64		14
 #define TCPOLEN_MPTCP_DSS_CHECKSUM	2
+#define TCPOLEN_MPTCP_ADD_ADDR		8
+#define TCPOLEN_MPTCP_ADD_ADDR6		20
+#define TCPOLEN_MPTCP_RM_ADDR		4
 
 /* MPTCP MP_CAPABLE flags */
 #define MPTCP_VERSION_MASK	(0x0F)
@@ -53,9 +59,16 @@
 #define MPTCP_DSS_FLAG_MASK	(0x1F)
 
 /* MPTCP ADD_ADDR flags */
+#define MPTCP_ADDR_FAMILY_MASK	(0x0F)
 #define MPTCP_ADDR_IPVERSION_4	4
 #define MPTCP_ADDR_IPVERSION_6	6
 
+static inline __be32 mptcp_option(u8 subopt, u8 len, u8 nib, u8 field)
+{
+	return htonl((TCPOPT_MPTCP << 24) | (len << 16) | (subopt << 12) |
+		     ((nib & 0xF) << 8) | field);
+}
+
 struct mptcp_pm_data {
 	u8	local_valid;
 	u8	local_id;
@@ -96,6 +109,7 @@ struct mptcp_sock {
 	struct list_head conn_list;
 	struct socket	*subflow; /* outgoing connect/listener/!mp_capable */
 	struct mptcp_pm_data	pm;
+	u8		addr_signal;
 };
 
 #define mptcp_for_each_subflow(__msk, __subflow)			\
-- 
2.23.0


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [RFC PATCH v2 28/45] mptcp: Add handling of incoming MP_JOIN requests
  2019-10-02 23:36 [RFC PATCH v2 00/45] Multipath TCP Mat Martineau
                   ` (26 preceding siblings ...)
  2019-10-02 23:36 ` [RFC PATCH v2 27/45] mptcp: Add ADD_ADDR handling Mat Martineau
@ 2019-10-02 23:36 ` Mat Martineau
  2019-10-02 23:36 ` [RFC PATCH v2 29/45] mptcp: harmonize locking on all socket operations Mat Martineau
                   ` (18 subsequent siblings)
  46 siblings, 0 replies; 49+ messages in thread
From: Mat Martineau @ 2019-10-02 23:36 UTC (permalink / raw)
  To: netdev, edumazet
  Cc: Peter Krystad, cpaasch, fw, pabeni, dcaratti, matthieu.baerts

From: Peter Krystad <peter.krystad@linux.intel.com>

Process the MP_JOIN option in a SYN packet with the same flow
as MP_CAPABLE but when the third ACK is received add the
subflow to the MPTCP socket subflow list instead of adding it to
the TCP socket accept queue.

The subflow is added at the end of the subflow list so it will not
interfere with the existing subflows operation and no data is
expected to be transmitted on it.

Signed-off-by: Peter Krystad <peter.krystad@linux.intel.com>
---
 include/linux/tcp.h      |   6 +++
 include/net/mptcp.h      |  11 ++++
 net/ipv4/tcp_minisocks.c |   6 +++
 net/mptcp/options.c      |  55 +++++++++++++++++++-
 net/mptcp/protocol.c     |  22 ++++++++
 net/mptcp/protocol.h     |  27 +++++++++-
 net/mptcp/subflow.c      | 105 ++++++++++++++++++++++++++++++++++++++-
 net/mptcp/token.c        |  22 ++++++++
 8 files changed, 249 insertions(+), 5 deletions(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index b3289e69cac1..3b69ab99e666 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -107,8 +107,14 @@ struct tcp_options_received {
 		u8      mp_capable : 1,
 			mp_join : 1,
 			dss : 1,
+			backup : 1,
 			version : 4;
 		u8      flags;
+		u8      join_id;
+		u32     token;
+		u32     nonce;
+		u64     thmac;
+		u8      hmac[20];
 		u8	dss_flags;
 		u8	use_map:1,
 			dsn64:1,
diff --git a/include/net/mptcp.h b/include/net/mptcp.h
index f4a9962ea85f..bb2dd193c0c5 100644
--- a/include/net/mptcp.h
+++ b/include/net/mptcp.h
@@ -36,6 +36,10 @@ struct mptcp_out_options {
 #endif
 	};
 	u8 addr_id;
+	u8 join_id;
+	u8 backup;
+	u32 nonce;
+	u64 thmac;
 	struct mptcp_ext ext_copy;
 #endif
 };
@@ -74,6 +78,8 @@ static inline bool mptcp_skb_ext_exist(const struct sk_buff *skb)
 
 void mptcp_write_options(__be32 *ptr, struct mptcp_out_options *opts);
 
+bool mptcp_sk_is_subflow(const struct sock *sk);
+
 #else
 
 static inline void mptcp_init(void)
@@ -132,5 +138,10 @@ static inline bool mptcp_skb_ext_exist(const struct sk_buff *skb)
 	return false;
 }
 
+static inline bool mptcp_sk_is_subflow(const struct sock *sk)
+{
+	return false;
+}
+
 #endif /* CONFIG_MPTCP */
 #endif /* __NET_MPTCP_H */
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index bb140a5db8c0..c519684d081b 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -767,6 +767,12 @@ struct sock *tcp_check_req(struct sock *sk, struct sk_buff *skb,
 	if (!child)
 		goto listen_overflow;
 
+	if (own_req && sk_is_mptcp(child) && mptcp_sk_is_subflow(child)) {
+		inet_csk_reqsk_queue_drop(sk, req);
+		reqsk_queue_removed(&inet_csk(sk)->icsk_accept_queue, req);
+		return child;
+	}
+
 	sock_rps_save_rxhash(child, skb);
 	tcp_synack_rtt_meas(child, req);
 	*req_stolen = !own_req;
diff --git a/net/mptcp/options.c b/net/mptcp/options.c
index 6fd2c8b5001c..f5e0b1d0931b 100644
--- a/net/mptcp/options.c
+++ b/net/mptcp/options.c
@@ -85,6 +85,38 @@ void mptcp_parse_option(const unsigned char *ptr, int opsize,
 	 * 1: 0 (Reserved)
 	 * 2-21: Sender HMAC
 	 */
+	case MPTCPOPT_MP_JOIN:
+		mp_opt->mp_join = 1;
+		if (opsize == TCPOLEN_MPTCP_MPJ_SYN) {
+			mp_opt->backup = *ptr++ & MPTCPOPT_BACKUP;
+			mp_opt->join_id = *ptr++;
+			mp_opt->token = get_unaligned_be32(ptr);
+			ptr += 4;
+			mp_opt->nonce = get_unaligned_be32(ptr);
+			ptr += 4;
+			pr_debug("MP_JOIN bkup=%u, id=%u, token=%u, nonce=%u",
+				 mp_opt->backup, mp_opt->join_id,
+				 mp_opt->token, mp_opt->nonce);
+		} else if (opsize == TCPOLEN_MPTCP_MPJ_SYNACK) {
+			mp_opt->backup = *ptr++ & MPTCPOPT_BACKUP;
+			mp_opt->join_id = *ptr++;
+			mp_opt->thmac = get_unaligned_be64(ptr);
+			ptr += 8;
+			mp_opt->nonce = get_unaligned_be32(ptr);
+			ptr += 4;
+			pr_debug("MP_JOIN bkup=%u, id=%u, thmac=%llu, nonce=%u",
+				 mp_opt->backup, mp_opt->join_id,
+				 mp_opt->thmac, mp_opt->nonce);
+		} else if (opsize == TCPOLEN_MPTCP_MPJ_ACK) {
+			ptr += 2;
+			memcpy(mp_opt->hmac, ptr, MPTCPOPT_HMAC_LEN);
+			pr_debug("MP_JOIN hmac");
+		} else {
+			pr_warn("MP_JOIN bad option size");
+			mp_opt->mp_join = 0;
+		}
+		break;
+
 
 	/* MPTCPOPT_DSS
 	 * 0: 4MSB=subtype, 0000
@@ -462,10 +494,21 @@ bool mptcp_synack_options(const struct request_sock *req, unsigned int *size,
 		opts->sndr_key = subflow_req->local_key;
 		opts->rcvr_key = subflow_req->remote_key;
 		*size = TCPOLEN_MPTCP_MPC_SYNACK;
-		pr_debug("subflow_req=%p, local_key=%llu, remote_key=%llu",
+		pr_debug("req=%p, local_key=%llu, remote_key=%llu",
 			 subflow_req, subflow_req->local_key,
 			 subflow_req->remote_key);
 		return true;
+	} else if (subflow_req->mp_join) {
+		opts->suboptions = OPTION_MPTCP_MPJ_SYNACK;
+		opts->backup = subflow_req->backup;
+		opts->join_id = subflow_req->local_id;
+		opts->thmac = subflow_req->thmac;
+		opts->nonce = subflow_req->local_nonce;
+		pr_debug("req=%p, bkup=%u, id=%u, thmac=%llu, nonce=%u",
+			 subflow_req, opts->backup, opts->join_id,
+			 opts->thmac, opts->nonce);
+		*size = TCPOLEN_MPTCP_MPJ_SYNACK;
+		return true;
 	}
 	return false;
 }
@@ -569,6 +612,16 @@ void mptcp_write_options(__be32 *ptr, struct mptcp_out_options *opts)
 				      0, opts->addr_id);
 	}
 
+	if (OPTION_MPTCP_MPJ_SYNACK & opts->suboptions) {
+		*ptr++ = mptcp_option(MPTCPOPT_MP_JOIN,
+				      TCPOLEN_MPTCP_MPJ_SYNACK,
+				      opts->backup, opts->join_id);
+		put_unaligned_be64(opts->thmac, ptr);
+		ptr += 2;
+		put_unaligned_be32(opts->nonce, ptr);
+		ptr += 1;
+	}
+
 	if (opts->ext_copy.use_ack || opts->ext_copy.use_map) {
 		struct mptcp_ext *mpext = &opts->ext_copy;
 		u8 len = TCPOLEN_MPTCP_DSS_BASE;
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 932362e6047e..32d9963c492d 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -805,6 +805,28 @@ void mptcp_finish_connect(struct sock *sk, int mp_capable)
 	inet_sk_state_store(sk, TCP_ESTABLISHED);
 }
 
+void mptcp_finish_join(struct sock *sk)
+{
+	struct mptcp_subflow_context *subflow = mptcp_subflow_ctx(sk);
+	struct mptcp_sock *msk = mptcp_sk(subflow->conn);
+
+	pr_debug("msk=%p, subflow=%p", msk, subflow);
+
+	local_bh_disable();
+	bh_lock_sock_nested(subflow->conn);
+	list_add_tail(&subflow->node, &msk->conn_list);
+	bh_unlock_sock(subflow->conn);
+	local_bh_enable();
+	inet_sk_state_store(sk, TCP_ESTABLISHED);
+}
+
+bool mptcp_sk_is_subflow(const struct sock *sk)
+{
+	struct mptcp_subflow_context *subflow = mptcp_subflow_ctx(sk);
+
+	return subflow->mp_join == 1;
+}
+
 static struct proto mptcp_prot = {
 	.name		= "MPTCP",
 	.owner		= THIS_MODULE,
diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
index fd21ed6c03b7..16fe1c766d98 100644
--- a/net/mptcp/protocol.h
+++ b/net/mptcp/protocol.h
@@ -15,6 +15,9 @@
 #define OPTION_MPTCP_MPC_SYN	BIT(0)
 #define OPTION_MPTCP_MPC_SYNACK	BIT(1)
 #define OPTION_MPTCP_MPC_ACK	BIT(2)
+#define OPTION_MPTCP_MPJ_SYN	BIT(3)
+#define OPTION_MPTCP_MPJ_SYNACK	BIT(4)
+#define OPTION_MPTCP_MPJ_ACK	BIT(5)
 #define OPTION_MPTCP_ADD_ADDR	BIT(6)
 #define OPTION_MPTCP_ADD_ADDR6	BIT(7)
 #define OPTION_MPTCP_RM_ADDR	BIT(8)
@@ -33,6 +36,9 @@
 #define TCPOLEN_MPTCP_MPC_SYN		12
 #define TCPOLEN_MPTCP_MPC_SYNACK	20
 #define TCPOLEN_MPTCP_MPC_ACK		20
+#define TCPOLEN_MPTCP_MPJ_SYN		12
+#define TCPOLEN_MPTCP_MPJ_SYNACK	16
+#define TCPOLEN_MPTCP_MPJ_ACK		24
 #define TCPOLEN_MPTCP_DSS_BASE		4
 #define TCPOLEN_MPTCP_DSS_ACK32		4
 #define TCPOLEN_MPTCP_DSS_ACK64		8
@@ -43,6 +49,9 @@
 #define TCPOLEN_MPTCP_ADD_ADDR6		20
 #define TCPOLEN_MPTCP_RM_ADDR		4
 
+#define MPTCPOPT_BACKUP		BIT(0)
+#define MPTCPOPT_HMAC_LEN	20
+
 /* MPTCP MP_CAPABLE flags */
 #define MPTCP_VERSION_MASK	(0x0F)
 #define MPTCP_CAP_CHECKSUM_REQD	BIT(7)
@@ -128,11 +137,15 @@ struct mptcp_subflow_request_sock {
 		backup : 1,
 		version : 4;
 	u8	local_id;
+	u8	remote_id;
 	u64	local_key;
 	u64	remote_key;
 	u64	idsn;
 	u32	token;
 	u32	ssn_offset;
+	u64	thmac;
+	u32	local_nonce;
+	u32	remote_nonce;
 };
 
 static inline struct mptcp_subflow_request_sock *
@@ -156,14 +169,22 @@ struct mptcp_subflow_context {
 	u16	request_mptcp : 1,  /* send MP_CAPABLE */
 		request_cksum : 1,
 		request_version : 4,
-		mp_capable : 1,	    /* remote is MPTCP capable */
+		mp_capable : 1,     /* remote is MPTCP capable */
+		mp_join : 1,        /* remote is JOINing */
 		fourth_ack : 1,     /* send initial DSS */
 		conn_finished : 1,
 		use_checksum : 1,
-		map_valid : 1;
+		map_valid : 1,
+		backup : 1;
+	u32	remote_nonce;
+	u64	thmac;
+	u32	local_nonce;
+	u8	local_id;
+	u8	remote_id;
 
 	struct  socket *tcp_sock;  /* underlying tcp_sock */
 	struct  sock *conn;        /* parent mptcp_sock */
+
 	void	(*tcp_sk_data_ready)(struct sock *sk);
 	struct	rcu_head rcu;
 };
@@ -192,12 +213,14 @@ void mptcp_get_options(const struct sk_buff *skb,
 		       struct tcp_options_received *opt_rx);
 
 void mptcp_finish_connect(struct sock *sk, int mp_capable);
+void mptcp_finish_join(struct sock *sk);
 
 int mptcp_token_new_request(struct request_sock *req);
 void mptcp_token_destroy_request(u32 token);
 int mptcp_token_new_connect(struct sock *sk);
 int mptcp_token_new_accept(u32 token);
 void mptcp_token_update_accept(struct sock *sk, struct sock *conn);
+struct mptcp_sock *mptcp_token_get_sock(u32 token);
 void mptcp_token_destroy(u32 token);
 
 void mptcp_crypto_key_sha1(u64 key, u32 *token, u64 *idsn);
diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c
index 1cde27227fd9..1c3330ab2f30 100644
--- a/net/mptcp/subflow.c
+++ b/net/mptcp/subflow.c
@@ -9,6 +9,7 @@
 #include <linux/kernel.h>
 #include <linux/module.h>
 #include <linux/netdevice.h>
+#include <crypto/algapi.h>
 #include <net/sock.h>
 #include <net/inet_common.h>
 #include <net/inet_hashtables.h>
@@ -44,6 +45,38 @@ static void subflow_req_destructor(struct request_sock *req)
 	tcp_request_sock_ops.destructor(req);
 }
 
+/* validate received token and create truncated hmac and nonce for SYN-ACK */
+static bool subflow_token_join_request(struct request_sock *req,
+				       const struct sk_buff *skb)
+{
+	struct mptcp_subflow_request_sock *subflow_req = mptcp_subflow_rsk(req);
+	u8 hmac[MPTCPOPT_HMAC_LEN];
+	struct mptcp_sock *msk;
+
+	msk = mptcp_token_get_sock(subflow_req->token);
+	if (!msk) {
+		pr_debug("subflow_req=%p, token=%u - not found\n",
+			 subflow_req, subflow_req->token);
+		return false;
+	}
+
+	if (pm_get_local_id(req, (struct sock *)msk, skb)) {
+		sock_put((struct sock *)msk);
+		return false;
+	}
+
+	get_random_bytes(&subflow_req->local_nonce, sizeof(u32));
+
+	mptcp_crypto_hmac_sha1(msk->local_key, msk->remote_key,
+			       subflow_req->local_nonce,
+			       subflow_req->remote_nonce, (u32 *)hmac);
+
+	subflow_req->thmac = get_unaligned_be64(hmac);
+
+	sock_put((struct sock *)msk);
+	return true;
+}
+
 static void subflow_v4_init_req(struct request_sock *req,
 				const struct sock *sk_listener,
 				struct sk_buff *skb)
@@ -60,6 +93,12 @@ static void subflow_v4_init_req(struct request_sock *req,
 	memset(&rx_opt.mptcp, 0, sizeof(rx_opt.mptcp));
 	mptcp_get_options(skb, &rx_opt);
 
+	subflow_req->mp_capable = 0;
+	subflow_req->mp_join = 0;
+
+	if (rx_opt.mptcp.mp_capable && rx_opt.mptcp.mp_join)
+		return;
+
 	if (rx_opt.mptcp.mp_capable && listener->request_mptcp) {
 		int err;
 
@@ -76,8 +115,18 @@ static void subflow_v4_init_req(struct request_sock *req,
 			subflow_req->checksum = 1;
 		subflow_req->remote_key = rx_opt.mptcp.sndr_key;
 		subflow_req->ssn_offset = TCP_SKB_CB(skb)->seq;
-	} else {
-		subflow_req->mp_capable = 0;
+	} else if (rx_opt.mptcp.mp_join && listener->request_mptcp) {
+		subflow_req->mp_join = 1;
+		subflow_req->backup = rx_opt.mptcp.backup;
+		subflow_req->remote_id = rx_opt.mptcp.join_id;
+		subflow_req->token = rx_opt.mptcp.token;
+		subflow_req->remote_nonce = rx_opt.mptcp.nonce;
+		pr_debug("token=%u, remote_nonce=%u", subflow_req->token,
+			 subflow_req->remote_nonce);
+		if (!subflow_token_join_request(req, skb)) {
+			subflow_req->mp_join = 0;
+			// @@ need to trigger RST
+		}
 	}
 }
 
@@ -121,6 +170,32 @@ static int subflow_conn_request(struct sock *sk, struct sk_buff *skb)
 	return 0;
 }
 
+/* validate hmac received in third ACK */
+static bool subflow_hmac_valid(const struct request_sock *req,
+			       const struct tcp_options_received *rx_opt)
+{
+	const struct mptcp_subflow_request_sock *subflow_req;
+	u8 hmac[MPTCPOPT_HMAC_LEN];
+	struct mptcp_sock *msk;
+	bool ret;
+
+	subflow_req = mptcp_subflow_rsk(req);
+	msk = mptcp_token_get_sock(subflow_req->token);
+	if (!msk)
+		return false;
+
+	mptcp_crypto_hmac_sha1(msk->remote_key, msk->local_key,
+			       subflow_req->remote_nonce,
+			       subflow_req->local_nonce, (u32 *)hmac);
+
+	ret = true;
+	if (crypto_memneq(hmac, rx_opt->mptcp.hmac, sizeof(hmac)))
+		ret = false;
+
+	sock_put((struct sock *)msk);
+	return ret;
+}
+
 static struct sock *subflow_syn_recv_sock(const struct sock *sk,
 					  struct sk_buff *skb,
 					  struct request_sock *req,
@@ -129,11 +204,21 @@ static struct sock *subflow_syn_recv_sock(const struct sock *sk,
 					  bool *own_req)
 {
 	struct mptcp_subflow_context *listener = mptcp_subflow_ctx(sk);
+	struct mptcp_subflow_request_sock *subflow_req;
+	struct tcp_options_received opt_rx;
 	struct sock *child;
 
 	pr_debug("listener=%p, req=%p, conn=%p", listener, req, listener->conn);
 
 	/* if the sk is MP_CAPABLE, we already received the client key */
+	subflow_req = mptcp_subflow_rsk(req);
+	if (!subflow_req->mp_capable && subflow_req->mp_join) {
+		opt_rx.mptcp.mp_join = 0;
+		mptcp_get_options(skb, &opt_rx);
+		if (!opt_rx.mptcp.mp_join ||
+		    !subflow_hmac_valid(req, &opt_rx))
+			return NULL;
+	}
 
 	child = tcp_v4_syn_recv_sock(sk, skb, req, dst, req_unhash, own_req);
 
@@ -146,6 +231,15 @@ static struct sock *subflow_syn_recv_sock(const struct sock *sk,
 		if (ctx->mp_capable) {
 			if (mptcp_token_new_accept(ctx->token))
 				goto close_child;
+		} else if (ctx->mp_join) {
+			struct mptcp_sock *owner;
+
+			owner = mptcp_token_get_sock(ctx->token);
+			if (!owner)
+				goto close_child;
+
+			ctx->conn = (struct sock *)owner;
+			mptcp_finish_join(child);
 		}
 	}
 
@@ -289,6 +383,13 @@ static void subflow_ulp_clone(const struct request_sock *req,
 		new_ctx->token = subflow_req->token;
 		new_ctx->ssn_offset = subflow_req->ssn_offset;
 		new_ctx->idsn = subflow_req->idsn;
+	} else if (subflow_req->mp_join) {
+		new_ctx->mp_join = 1;
+		new_ctx->fourth_ack = 1;
+		new_ctx->backup = subflow_req->backup;
+		new_ctx->local_id = subflow_req->local_id;
+		new_ctx->token = subflow_req->token;
+		new_ctx->thmac = subflow_req->thmac;
 	}
 }
 
diff --git a/net/mptcp/token.c b/net/mptcp/token.c
index 7e579deef6a2..a0d8f6e323b2 100644
--- a/net/mptcp/token.c
+++ b/net/mptcp/token.c
@@ -167,6 +167,28 @@ void mptcp_token_update_accept(struct sock *sk, struct sock *conn)
 	spin_unlock_bh(&token_tree_lock);
 }
 
+/**
+ * mptcp_token_get_sock - retrieve mptcp connection sock using its token
+ * @token - token of the mptcp connection to retrieve
+ *
+ * This function returns the mptcp connection structure with the given token.
+ * A reference count on the mptcp socket returned is taken.
+ *
+ * returns NULL if no connection with the given token value exists.
+ */
+struct mptcp_sock *mptcp_token_get_sock(u32 token)
+{
+	struct sock *conn;
+
+	spin_lock_bh(&token_tree_lock);
+	conn = radix_tree_lookup(&token_tree, token);
+	if (conn)
+		sock_hold(conn);
+	spin_unlock_bh(&token_tree_lock);
+
+	return mptcp_sk(conn);
+}
+
 /**
  * mptcp_token_destroy_request - remove mptcp connection/token
  * @token - token of mptcp connection to remove
-- 
2.23.0


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [RFC PATCH v2 29/45] mptcp: harmonize locking on all socket operations.
  2019-10-02 23:36 [RFC PATCH v2 00/45] Multipath TCP Mat Martineau
                   ` (27 preceding siblings ...)
  2019-10-02 23:36 ` [RFC PATCH v2 28/45] mptcp: Add handling of incoming MP_JOIN requests Mat Martineau
@ 2019-10-02 23:36 ` Mat Martineau
  2019-10-02 23:36 ` [RFC PATCH v2 30/45] mptcp: new sysctl to control the activation per NS Mat Martineau
                   ` (17 subsequent siblings)
  46 siblings, 0 replies; 49+ messages in thread
From: Mat Martineau @ 2019-10-02 23:36 UTC (permalink / raw)
  To: netdev, edumazet
  Cc: Paolo Abeni, cpaasch, fw, peter.krystad, dcaratti, matthieu.baerts

From: Paolo Abeni <pabeni@redhat.com>

The locking schema implied by sendmsg(), recvmsg(), etc.
requires acquiring the msk's socket lock before manipulating
the msk internal status.

Additionally, we can't acquire the msk->subflow socket lock while holding
the msk lock, due to mptcp_finish_connect().

Many socket operations do not enforce the required locking, e.g. we have
several patterns alike:

	if (msk->subflow)
		// do something with msk->subflow

or:

	if (!msk->subflow)
		// allocate msk->subflow

all without any lock acquired.

They can race with each other and with mptcp_finish_connect() causing
UAF, null ptr dereference and/or memory leaks.

This patch ensures that all mptcp socket operations access and manipulate
msk->subflow under the msk socket lock. To avoid breaking the locking
assumption introduced by mptcp_finish_connect(), while avoiding UAF
issues, we acquire a reference to the msk->subflow, where needed.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Peter Krystad <peter.krystad@linux.intel.com>
---
 net/mptcp/protocol.c | 82 +++++++++++++++++++++++++++++++++-----------
 net/mptcp/subflow.c  |  3 --
 2 files changed, 62 insertions(+), 23 deletions(-)

diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 32d9963c492d..8512cf5e0e0f 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -178,6 +178,7 @@ static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 	struct sock *ssk;
 	long timeo;
 
+	pr_debug("msk=%p", msk);
 	lock_sock(sk);
 	ssock = __mptcp_fallback_get_ref(msk);
 	if (ssock) {
@@ -846,38 +847,72 @@ static struct proto mptcp_prot = {
 	.no_autobind	= 1,
 };
 
+static struct socket *mptcp_socket_create_get(struct mptcp_sock *msk)
+{
+	struct mptcp_subflow_context *subflow;
+	struct sock *sk = (struct sock *)msk;
+	struct socket *ssock;
+	int err;
+
+	lock_sock(sk);
+	ssock = __mptcp_fallback_get_ref(msk);
+	if (ssock)
+		goto release;
+
+	err = mptcp_subflow_create_socket(sk, &ssock);
+	if (err) {
+		ssock = ERR_PTR(err);
+		goto release;
+	}
+
+	msk->subflow = ssock;
+	subflow = mptcp_subflow_ctx(msk->subflow->sk);
+	subflow->request_mptcp = 1; /* @@ if MPTCP enabled */
+	subflow->request_cksum = 0; /* checksum not supported */
+	subflow->request_version = 0; /* only v0 supported */
+
+	sock_hold(ssock->sk);
+
+release:
+	release_sock(sk);
+	return ssock;
+}
+
 static int mptcp_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len)
 {
 	struct mptcp_sock *msk = mptcp_sk(sock->sk);
+	struct socket *ssock;
 	int err = -ENOTSUPP;
 
 	if (uaddr->sa_family != AF_INET) // @@ allow only IPv4 for now
 		return err;
 
-	if (!msk->subflow) {
-		err = mptcp_subflow_create_socket(sock->sk, &msk->subflow);
-		if (err)
-			return err;
-	}
-	return inet_bind(msk->subflow, uaddr, addr_len);
+	ssock = mptcp_socket_create_get(msk);
+	if (IS_ERR(ssock))
+		return PTR_ERR(ssock);
+
+	err = inet_bind(ssock, uaddr, addr_len);
+	sock_put(ssock->sk);
+	return err;
 }
 
 static int mptcp_stream_connect(struct socket *sock, struct sockaddr *uaddr,
 				int addr_len, int flags)
 {
 	struct mptcp_sock *msk = mptcp_sk(sock->sk);
+	struct socket *ssock;
 	int err = -ENOTSUPP;
 
 	if (uaddr->sa_family != AF_INET) // @@ allow only IPv4 for now
 		return err;
 
-	if (!msk->subflow) {
-		err = mptcp_subflow_create_socket(sock->sk, &msk->subflow);
-		if (err)
-			return err;
-	}
+	ssock = mptcp_socket_create_get(msk);
+	if (IS_ERR(ssock))
+		return PTR_ERR(ssock);
 
-	return inet_stream_connect(msk->subflow, uaddr, addr_len, flags);
+	err = inet_stream_connect(ssock, uaddr, addr_len, flags);
+	sock_put(ssock->sk);
+	return err;
 }
 
 static int mptcp_getname(struct socket *sock, struct sockaddr *uaddr,
@@ -929,29 +964,36 @@ static int mptcp_getname(struct socket *sock, struct sockaddr *uaddr,
 static int mptcp_listen(struct socket *sock, int backlog)
 {
 	struct mptcp_sock *msk = mptcp_sk(sock->sk);
+	struct socket *ssock;
 	int err;
 
 	pr_debug("msk=%p", msk);
 
-	if (!msk->subflow) {
-		err = mptcp_subflow_create_socket(sock->sk, &msk->subflow);
-		if (err)
-			return err;
-	}
-	return inet_listen(msk->subflow, backlog);
+	ssock = mptcp_socket_create_get(msk);
+	if (IS_ERR(ssock))
+		return PTR_ERR(ssock);
+
+	err = inet_listen(ssock, backlog);
+	sock_put(ssock->sk);
+	return err;
 }
 
 static int mptcp_stream_accept(struct socket *sock, struct socket *newsock,
 			       int flags, bool kern)
 {
 	struct mptcp_sock *msk = mptcp_sk(sock->sk);
+	struct socket *ssock;
+	int err;
 
 	pr_debug("msk=%p", msk);
 
-	if (!msk->subflow)
+	ssock = mptcp_fallback_get_ref(msk);
+	if (!ssock)
 		return -EINVAL;
 
-	return inet_accept(sock, newsock, flags, kern);
+	err = inet_accept(sock, newsock, flags, kern);
+	sock_put(ssock->sk);
+	return err;
 }
 
 static __poll_t mptcp_poll(struct file *file, struct socket *sock,
diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c
index 1c3330ab2f30..04f232ff1df0 100644
--- a/net/mptcp/subflow.c
+++ b/net/mptcp/subflow.c
@@ -293,9 +293,6 @@ int mptcp_subflow_create_socket(struct sock *sk, struct socket **new_sock)
 	*new_sock = sf;
 	sock_hold(sk);
 	subflow->conn = sk;
-	subflow->request_mptcp = 1; // @@ if MPTCP enabled
-	subflow->request_cksum = 1; // @@ if checksum enabled
-	subflow->request_version = 0;
 
 	return 0;
 }
-- 
2.23.0


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [RFC PATCH v2 30/45] mptcp: new sysctl to control the activation per NS
  2019-10-02 23:36 [RFC PATCH v2 00/45] Multipath TCP Mat Martineau
                   ` (28 preceding siblings ...)
  2019-10-02 23:36 ` [RFC PATCH v2 29/45] mptcp: harmonize locking on all socket operations Mat Martineau
@ 2019-10-02 23:36 ` Mat Martineau
  2019-10-02 23:36 ` [RFC PATCH v2 31/45] mptcp: add basic kselftest for mptcp Mat Martineau
                   ` (16 subsequent siblings)
  46 siblings, 0 replies; 49+ messages in thread
From: Mat Martineau @ 2019-10-02 23:36 UTC (permalink / raw)
  To: netdev, edumazet
  Cc: Matthieu Baerts, cpaasch, fw, pabeni, peter.krystad, dcaratti

From: Matthieu Baerts <matthieu.baerts@tessares.net>

New MPTCP sockets will return -ENOPROTOOPT if MPTCP support is disabled
for the current net namespace.

For security reasons, it is interesting to have a global switch for
MPTCP. To start, MPTCP will be disabled by default and only privileged
users will be able to modify this. The reason is that because MPTCP is
new, it will not be tested and reviewed by many and security issues can
then take time to be discovered and fixed.

The value of this new sysctl can be different per namespace. We can then
restrict the usage of MPTCP to the selected NS. In case of serious
issues with MPTCP, administrators can now easily turn MPTCP off.

Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
---
 net/mptcp/Makefile   |   2 +-
 net/mptcp/ctrl.c     | 112 +++++++++++++++++++++++++++++++++++++++++++
 net/mptcp/protocol.c |  25 +++++-----
 net/mptcp/protocol.h |   4 ++
 4 files changed, 129 insertions(+), 14 deletions(-)
 create mode 100644 net/mptcp/ctrl.c

diff --git a/net/mptcp/Makefile b/net/mptcp/Makefile
index 7fe7aa64eda0..289fdf4339c1 100644
--- a/net/mptcp/Makefile
+++ b/net/mptcp/Makefile
@@ -1,4 +1,4 @@
 # SPDX-License-Identifier: GPL-2.0
 obj-$(CONFIG_MPTCP) += mptcp.o
 
-mptcp-y := protocol.o subflow.o options.o token.o crypto.o pm.o
+mptcp-y := protocol.o subflow.o options.o token.o crypto.o pm.o ctrl.o
diff --git a/net/mptcp/ctrl.c b/net/mptcp/ctrl.c
new file mode 100644
index 000000000000..8d9f15f02369
--- /dev/null
+++ b/net/mptcp/ctrl.c
@@ -0,0 +1,112 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Multipath TCP
+ *
+ * Copyright (c) 2019, Tessares SA.
+ */
+
+#include <linux/sysctl.h>
+
+#include <net/net_namespace.h>
+#include <net/netns/generic.h>
+
+#include "protocol.h"
+
+#define MPTCP_SYSCTL_PATH "net/mptcp"
+
+static int mptcp_pernet_id;
+struct mptcp_pernet {
+	struct ctl_table_header *ctl_table_hdr;
+
+	int mptcp_enabled;
+};
+
+static struct mptcp_pernet *mptcp_get_pernet(struct net *net)
+{
+	return net_generic(net, mptcp_pernet_id);
+}
+
+int mptcp_is_enabled(struct net *net)
+{
+	return mptcp_get_pernet(net)->mptcp_enabled;
+}
+
+static struct ctl_table mptcp_sysctl_table[] = {
+	{
+		.procname = "enabled",
+		.maxlen = sizeof(int),
+		.mode = 0644,
+		/* users with CAP_NET_ADMIN or root (not and) can change thi
+		 * value, same as other sysctl or the 'net' tree.
+		 */
+		.proc_handler = proc_dointvec,
+	},
+	{}
+};
+
+static int mptcp_pernet_new_table(struct net *net, struct mptcp_pernet *pernet)
+{
+	struct ctl_table_header *hdr;
+	struct ctl_table *table;
+
+	table = mptcp_sysctl_table;
+	if (!net_eq(net, &init_net)) {
+		table = kmemdup(table, sizeof(mptcp_sysctl_table), GFP_KERNEL);
+		if (!table)
+			goto err_alloc;
+	}
+
+	table[0].data = &pernet->mptcp_enabled;
+
+	hdr = register_net_sysctl(net, MPTCP_SYSCTL_PATH, table);
+	if (!hdr)
+		goto err_reg;
+
+	pernet->ctl_table_hdr = hdr;
+
+	return 0;
+
+err_reg:
+	if (!net_eq(net, &init_net))
+		kfree(table);
+err_alloc:
+	return -ENOMEM;
+}
+
+static void mptcp_pernet_del_table(struct mptcp_pernet *pernet)
+{
+	struct ctl_table *table = pernet->ctl_table_hdr->ctl_table_arg;
+
+	unregister_net_sysctl_table(pernet->ctl_table_hdr);
+
+	kfree(table);
+}
+
+static int __net_init mptcp_net_init(struct net *net)
+{
+	struct mptcp_pernet *pernet = mptcp_get_pernet(net);
+
+	return mptcp_pernet_new_table(net, pernet);
+}
+
+/* Note: the callback will only be called per extra netns */
+static void __net_exit mptcp_net_exit(struct net *net)
+{
+	struct mptcp_pernet *pernet = mptcp_get_pernet(net);
+
+	mptcp_pernet_del_table(pernet);
+}
+
+static struct pernet_operations mptcp_pernet_ops = {
+	.init = mptcp_net_init,
+	.exit = mptcp_net_exit,
+	.id = &mptcp_pernet_id,
+	.size = sizeof(struct mptcp_pernet),
+};
+
+void __init mptcp_init(void)
+{
+	mptcp_proto_init();
+
+	if (register_pernet_subsys(&mptcp_pernet_ops) < 0)
+		panic("Failed to register MPTCP pernet subsystem.\n");
+}
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 8512cf5e0e0f..fa99337ca773 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -562,22 +562,21 @@ static int mptcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
 	return copied;
 }
 
-static int mptcp_init_sock(struct sock *sk)
+static int __mptcp_init_sock(struct sock *sk)
 {
 	struct mptcp_sock *msk = mptcp_sk(sk);
-	struct net *net = sock_net(sk);
-	struct socket *sf;
-	int err;
-
-	err = sock_create_kern(net, PF_INET, SOCK_STREAM, IPPROTO_TCP, &sf);
-	if (!err) {
-		pr_debug("subflow=%p", sf->sk);
-		msk->subflow = sf;
-	}
 
 	INIT_LIST_HEAD(&msk->conn_list);
 
-	return err;
+	return 0;
+}
+
+static int mptcp_init_sock(struct sock *sk)
+{
+	if (!mptcp_is_enabled(sock_net(sk)))
+		return -ENOPROTOOPT;
+
+	return __mptcp_init_sock(sk);
 }
 
 static void mptcp_close(struct sock *sk, long timeout)
@@ -646,7 +645,7 @@ static struct sock *mptcp_accept(struct sock *sk, int flags, int *err,
 			return NULL;
 		}
 
-		mptcp_init_sock(new_mptcp_sock);
+		__mptcp_init_sock(new_mptcp_sock);
 
 		msk = mptcp_sk(new_mptcp_sock);
 		msk->remote_key = subflow->remote_key;
@@ -1067,7 +1066,7 @@ static struct inet_protosw mptcp_protosw = {
 	.flags		= INET_PROTOSW_ICSK,
 };
 
-void __init mptcp_init(void)
+void mptcp_proto_init(void)
 {
 	mptcp_prot.h.hashinfo = tcp_prot.h.hashinfo;
 	mptcp_stream_ops = inet_stream_ops;
diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
index 16fe1c766d98..394f2477e6f8 100644
--- a/net/mptcp/protocol.h
+++ b/net/mptcp/protocol.h
@@ -204,11 +204,15 @@ mptcp_subflow_tcp_socket(const struct mptcp_subflow_context *subflow)
 	return subflow->tcp_sock;
 }
 
+int mptcp_is_enabled(struct net *net);
+
 void mptcp_subflow_init(void);
 int mptcp_subflow_create_socket(struct sock *sk, struct socket **new_sock);
 
 extern const struct inet_connection_sock_af_ops ipv4_specific;
 
+void mptcp_proto_init(void);
+
 void mptcp_get_options(const struct sk_buff *skb,
 		       struct tcp_options_received *opt_rx);
 
-- 
2.23.0


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [RFC PATCH v2 31/45] mptcp: add basic kselftest for mptcp
  2019-10-02 23:36 [RFC PATCH v2 00/45] Multipath TCP Mat Martineau
                   ` (29 preceding siblings ...)
  2019-10-02 23:36 ` [RFC PATCH v2 30/45] mptcp: new sysctl to control the activation per NS Mat Martineau
@ 2019-10-02 23:36 ` Mat Martineau
  2019-10-02 23:36 ` [RFC PATCH v2 32/45] mptcp: Add handling of outgoing MP_JOIN requests Mat Martineau
                   ` (15 subsequent siblings)
  46 siblings, 0 replies; 49+ messages in thread
From: Mat Martineau @ 2019-10-02 23:36 UTC (permalink / raw)
  To: netdev, edumazet
  Cc: Florian Westphal, cpaasch, pabeni, peter.krystad, dcaratti,
	matthieu.baerts, Mat Martineau

From: Florian Westphal <fw@strlen.de>

Add mpcp_connect tool:
xmit two files back and forth between two processes.
Wrapper script tests that data was transmitted without corruption.

The "-c" command line option for mptcp_connect.sh is there for debugging:

The script will use tcpdump to create one .pcap file per test case, named
according to the namespaces, protocols, and connect address in use.
For example, the first test case writes the capture to
ns1-ns1-MPTCP-MPTCP-10.0.1.1.pcap.

The stderr output from tcpdump is printed after the test completes to
show tcpdump's "packets dropped by kernel" information.

Also check that userspace can't create MPTCP sockets when mptcp.enabled
sysctl is off.

Will run automatically on "make kselftest".

This patch merged following enhancements:
From Mat Martineau:
* mptcp: selftests: Add capture option
From Matthieu Baerts: Add sysctl test
From Paolo Abeni: Use tc/ethtool to test more skb handling code paths

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
---
 tools/testing/selftests/Makefile              |   1 +
 tools/testing/selftests/net/mptcp/.gitignore  |   2 +
 tools/testing/selftests/net/mptcp/Makefile    |  11 +
 tools/testing/selftests/net/mptcp/config      |   1 +
 .../selftests/net/mptcp/mptcp_connect.c       | 408 ++++++++++++++++++
 .../selftests/net/mptcp/mptcp_connect.sh      | 295 +++++++++++++
 6 files changed, 718 insertions(+)
 create mode 100644 tools/testing/selftests/net/mptcp/.gitignore
 create mode 100644 tools/testing/selftests/net/mptcp/Makefile
 create mode 100644 tools/testing/selftests/net/mptcp/config
 create mode 100644 tools/testing/selftests/net/mptcp/mptcp_connect.c
 create mode 100755 tools/testing/selftests/net/mptcp/mptcp_connect.sh

diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index c3feccb99ff5..4a404ee1b980 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -29,6 +29,7 @@ TARGETS += memory-hotplug
 TARGETS += mount
 TARGETS += mqueue
 TARGETS += net
+TARGETS += net/mptcp
 TARGETS += netfilter
 TARGETS += networking/timestamping
 TARGETS += nsfs
diff --git a/tools/testing/selftests/net/mptcp/.gitignore b/tools/testing/selftests/net/mptcp/.gitignore
new file mode 100644
index 000000000000..d72f07642738
--- /dev/null
+++ b/tools/testing/selftests/net/mptcp/.gitignore
@@ -0,0 +1,2 @@
+mptcp_connect
+*.pcap
diff --git a/tools/testing/selftests/net/mptcp/Makefile b/tools/testing/selftests/net/mptcp/Makefile
new file mode 100644
index 000000000000..0fc5d45055ee
--- /dev/null
+++ b/tools/testing/selftests/net/mptcp/Makefile
@@ -0,0 +1,11 @@
+# SPDX-License-Identifier: GPL-2.0
+
+top_srcdir = ../../../../..
+
+CFLAGS =  -Wall -Wl,--no-as-needed -O2 -g
+
+TEST_PROGS := mptcp_connect.sh
+
+TEST_GEN_FILES = mptcp_connect
+
+include ../../lib.mk
diff --git a/tools/testing/selftests/net/mptcp/config b/tools/testing/selftests/net/mptcp/config
new file mode 100644
index 000000000000..3bfe60494af8
--- /dev/null
+++ b/tools/testing/selftests/net/mptcp/config
@@ -0,0 +1 @@
+CONFIG_MPTCP=y
diff --git a/tools/testing/selftests/net/mptcp/mptcp_connect.c b/tools/testing/selftests/net/mptcp/mptcp_connect.c
new file mode 100644
index 000000000000..cac71f0ac8f8
--- /dev/null
+++ b/tools/testing/selftests/net/mptcp/mptcp_connect.c
@@ -0,0 +1,408 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#define _GNU_SOURCE
+
+#include <errno.h>
+#include <fcntl.h>
+#include <string.h>
+#include <stdbool.h>
+#include <stdint.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <strings.h>
+#include <unistd.h>
+
+#include <sys/poll.h>
+#include <sys/socket.h>
+#include <sys/types.h>
+
+#include <netdb.h>
+#include <netinet/in.h>
+#include <netinet/tcp.h>
+
+extern int optind;
+
+#ifndef IPPROTO_MPTCP
+#define IPPROTO_MPTCP 262
+#endif
+
+static bool listen_mode;
+static int  poll_timeout;
+
+static const char *cfg_host;
+static const char *cfg_port	= "12000";
+static int cfg_sock_proto	= IPPROTO_MPTCP;
+
+
+static void die_usage(void)
+{
+	fprintf(stderr, "Usage: mptcp_connect [-s MPTCP|TCP] [-p port] "
+		"[ -l ] [ -t timeout ] connect_address\n");
+	exit(1);
+}
+
+static const char *getxinfo_strerr(int err)
+{
+	if (err == EAI_SYSTEM)
+		return strerror(errno);
+
+	return gai_strerror(err);
+}
+
+static void xgetaddrinfo(const char *node, const char *service,
+			 const struct addrinfo *hints,
+			 struct addrinfo **res)
+{
+	int err = getaddrinfo(node, service, hints, res);
+
+	if (err) {
+		const char *errstr = getxinfo_strerr(err);
+
+		fprintf(stderr, "Fatal: getaddrinfo(%s:%s): %s\n",
+			node ? node : "", service ? service : "", errstr);
+		exit(1);
+	}
+}
+
+static int sock_listen_mptcp(const char * const listenaddr,
+			     const char * const port)
+{
+	int sock;
+	struct addrinfo hints = {
+		.ai_protocol = IPPROTO_TCP,
+		.ai_socktype = SOCK_STREAM,
+		.ai_flags = AI_PASSIVE | AI_NUMERICHOST
+	};
+
+	hints.ai_family = AF_INET;
+
+	struct addrinfo *a, *addr;
+	int one = 1;
+
+	xgetaddrinfo(listenaddr, port, &hints, &addr);
+
+	for (a = addr; a; a = a->ai_next) {
+		sock = socket(a->ai_family, a->ai_socktype, cfg_sock_proto);
+		if (sock < 0)
+			continue;
+
+		if (-1 == setsockopt(sock, SOL_SOCKET, SO_REUSEADDR, &one,
+				     sizeof(one)))
+			perror("setsockopt");
+
+		if (bind(sock, a->ai_addr, a->ai_addrlen) == 0)
+			break; /* success */
+
+		perror("bind");
+		close(sock);
+		sock = -1;
+	}
+
+	freeaddrinfo(addr);
+
+	if (sock < 0) {
+		fprintf(stderr, "Could not create listen socket\n");
+		return sock;
+	}
+
+	if (listen(sock, 20)) {
+		perror("listen");
+		close(sock);
+		return -1;
+	}
+
+	return sock;
+}
+
+static int sock_connect_mptcp(const char * const remoteaddr,
+			      const char * const port, int proto)
+{
+	struct addrinfo hints = {
+		.ai_protocol = IPPROTO_TCP,
+		.ai_socktype = SOCK_STREAM,
+	};
+	struct addrinfo *a, *addr;
+	int sock = -1;
+
+	hints.ai_family = AF_INET;
+
+	xgetaddrinfo(remoteaddr, port, &hints, &addr);
+	for (a = addr; a; a = a->ai_next) {
+		sock = socket(a->ai_family, a->ai_socktype, proto);
+		if (sock < 0) {
+			perror("socket");
+			continue;
+		}
+
+		if (connect(sock, a->ai_addr, a->ai_addrlen) == 0)
+			break; /* success */
+
+		perror("connect()");
+		close(sock);
+		sock = -1;
+	}
+
+	freeaddrinfo(addr);
+	return sock;
+}
+
+static size_t do_rnd_write(const int fd, char *buf, const size_t len)
+{
+	size_t offset = 0;
+
+	while (offset < len) {
+		unsigned int do_w;
+		size_t written;
+		ssize_t bw;
+
+		do_w = rand() & 0xffff;
+		if (do_w == 0 || do_w > (len - offset))
+			do_w = len - offset;
+
+		bw = write(fd, buf + offset, do_w);
+		if (bw < 0) {
+			perror("write");
+			return 0;
+		}
+
+		written = (size_t)bw;
+		offset += written;
+	}
+	return offset;
+}
+
+static size_t do_write(const int fd, char *buf, const size_t len)
+{
+	size_t offset = 0;
+
+	while (offset < len) {
+		size_t written;
+		ssize_t bw;
+
+		bw = write(fd, buf + offset, len - offset);
+		if (bw < 0) {
+			perror("write");
+			return 0;
+		}
+
+		written = (size_t)bw;
+		offset += written;
+	}
+
+	return offset;
+}
+
+static ssize_t do_rnd_read(const int fd, char *buf, const size_t len)
+{
+	size_t cap = rand();
+
+	cap &= 0xffff;
+
+	if (cap == 0)
+		cap = 1;
+	else if (cap > len)
+		cap = len;
+
+	return read(fd, buf, cap);
+}
+
+static int copyfd_io(int infd, int peerfd, int outfd)
+{
+	struct pollfd fds = {
+		.fd = peerfd,
+		.events = POLLIN | POLLOUT,
+	};
+
+	for (;;) {
+		char buf[8192];
+		ssize_t len;
+
+		if (fds.events == 0)
+			break;
+
+		switch (poll(&fds, 1, poll_timeout)) {
+		case -1:
+			if (errno == EINTR)
+				continue;
+			perror("poll");
+			return 1;
+		case 0:
+			fprintf(stderr, "%s: poll timed out (events: "
+				"POLLIN %u, POLLOUT %u)\n", __func__,
+				fds.events & POLLIN, fds.events & POLLOUT);
+			return 2;
+		}
+
+		if (fds.revents & POLLIN) {
+			len = do_rnd_read(peerfd, buf, sizeof(buf));
+			if (len == 0) {
+				/* no more data to receive:
+				 * peer has closed its write side
+				 */
+				fds.events &= ~POLLIN;
+
+				if ((fds.events & POLLOUT) == 0)
+					/* and nothing more to send */
+					break;
+
+			/* Else, still have data to transmit */
+			} else if (len < 0) {
+				perror("read");
+				return 3;
+			}
+
+			do_write(outfd, buf, len);
+		}
+
+		if (fds.revents & POLLOUT) {
+			len = do_rnd_read(infd, buf, sizeof(buf));
+			if (len > 0) {
+				if (!do_rnd_write(peerfd, buf, len))
+					return 111;
+			} else if (len == 0) {
+				/* We have no more data to send. */
+				fds.events &= ~POLLOUT;
+
+				if ((fds.events & POLLIN) == 0)
+					/* ... and peer also closed already */
+					break;
+
+				/* ... but we still receive.
+				 * Close our write side.
+				 */
+				shutdown(peerfd, SHUT_WR);
+			} else {
+				if (errno == EINTR)
+					continue;
+				perror("read");
+				return 4;
+			}
+		}
+	}
+
+	close(peerfd);
+	return 0;
+}
+
+int main_loop_s(int listensock)
+{
+	struct sockaddr_storage ss;
+	struct pollfd polls;
+	socklen_t salen;
+	int remotesock;
+
+	polls.fd = listensock;
+	polls.events = POLLIN;
+
+	switch (poll(&polls, 1, poll_timeout)) {
+	case -1:
+		perror("poll");
+		return 1;
+	case 0:
+		fprintf(stderr, "%s: timed out\n", __func__);
+		close(listensock);
+		return 2;
+	}
+
+	salen = sizeof(ss);
+	remotesock = accept(listensock, (struct sockaddr *)&ss, &salen);
+	if (remotesock >= 0) {
+		copyfd_io(0, remotesock, 1);
+		return 0;
+	}
+
+	perror("accept");
+
+	return 1;
+}
+
+static void init_rng(void)
+{
+	int fd = open("/dev/urandom", O_RDONLY);
+	unsigned int foo;
+
+	if (fd > 0) {
+		int ret = read(fd, &foo, sizeof(foo));
+
+		if (ret < 0)
+			srand(fd + foo);
+		close(fd);
+	}
+
+	srand(foo);
+}
+
+int main_loop(void)
+{
+	int fd;
+
+	/* listener is ready. */
+	fd = sock_connect_mptcp(cfg_host, cfg_port, cfg_sock_proto);
+	if (fd < 0)
+		return 2;
+
+	return copyfd_io(0, fd, 1);
+}
+
+int parse_proto(const char *proto)
+{
+	if (!strcasecmp(proto, "MPTCP"))
+		return IPPROTO_MPTCP;
+	if (!strcasecmp(proto, "TCP"))
+		return IPPROTO_TCP;
+
+	fprintf(stderr, "Unknown protocol: %s.", proto);
+	die_usage();
+
+	/* silence compiler warning */
+	return 0;
+}
+
+static void parse_opts(int argc, char **argv)
+{
+	int c;
+
+	while ((c = getopt(argc, argv, "lp:s:ht:")) != -1) {
+		switch (c) {
+		case 'l':
+			listen_mode = true;
+			break;
+		case 'p':
+			cfg_port = optarg;
+			break;
+		case 's':
+			cfg_sock_proto = parse_proto(optarg);
+			break;
+		case 'h':
+			die_usage();
+			break;
+		case 't':
+			poll_timeout = atoi(optarg) * 1000;
+			if (poll_timeout <= 0)
+				poll_timeout = -1;
+			break;
+		}
+	}
+
+	if (optind + 1 != argc)
+		die_usage();
+	cfg_host = argv[optind];
+}
+
+int main(int argc, char *argv[])
+{
+	init_rng();
+
+	parse_opts(argc, argv);
+
+	if (listen_mode) {
+		int fd = sock_listen_mptcp(cfg_host, cfg_port);
+
+		if (fd < 0)
+			return 1;
+
+		return main_loop_s(fd);
+	}
+
+	return main_loop();
+}
diff --git a/tools/testing/selftests/net/mptcp/mptcp_connect.sh b/tools/testing/selftests/net/mptcp/mptcp_connect.sh
new file mode 100755
index 000000000000..d029bdc5946d
--- /dev/null
+++ b/tools/testing/selftests/net/mptcp/mptcp_connect.sh
@@ -0,0 +1,295 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+
+ret=0
+sin=""
+sout=""
+cin=""
+cout=""
+ksft_skip=4
+capture=0
+timeout=30
+
+TEST_COUNT=0
+
+cleanup()
+{
+	rm -f "$cin" "$cout"
+	rm -f "$sin" "$sout"
+	rm -f "$capout"
+
+	for i in 1 2 3 4; do
+		ip netns del ns$i
+	done
+}
+
+for arg in "$@"; do
+    if [ "$arg" = "-c" ]; then
+	capture=1
+    fi
+done
+
+ip -Version > /dev/null 2>&1
+if [ $? -ne 0 ];then
+	echo "SKIP: Could not run test without ip tool"
+	exit $ksft_skip
+fi
+
+sin=$(mktemp)
+sout=$(mktemp)
+cin=$(mktemp)
+cout=$(mktemp)
+capout=$(mktemp)
+trap cleanup EXIT
+
+for i in 1 2 3 4;do
+	ip netns add ns$i || exit $ksft_skip
+	ip -net ns$i link set lo up
+	ip netns exec ns$i sysctl -q net.mptcp.enabled=1
+done
+
+#  ns1              ns2                    ns3                     ns4
+# ns1eth2    ns2eth1   ns2eth3      ns3eth2   ns3eth4       ns4eth3
+#                           - drop 1% ->            reorder 25%
+#                           <- TSO off -
+
+ip link add ns1eth2 netns ns1 type veth peer name ns2eth1 netns ns2
+ip link add ns2eth3 netns ns2 type veth peer name ns3eth2 netns ns3
+ip link add ns3eth4 netns ns3 type veth peer name ns4eth3 netns ns4
+
+ip -net ns1 addr add 10.0.1.1/24 dev ns1eth2
+ip -net ns1 link set ns1eth2 up
+ip -net ns1 route add default via 10.0.1.2
+
+ip -net ns2 addr add 10.0.1.2/24 dev ns2eth1
+ip -net ns2 link set ns2eth1 up
+
+ip -net ns2 addr add 10.0.2.1/24 dev ns2eth3
+ip -net ns2 link set ns2eth3 up
+ip -net ns2 route add default via 10.0.2.2
+ip netns exec ns2 sysctl -q net.ipv4.ip_forward=1
+
+ip -net ns3 addr add 10.0.2.2/24 dev ns3eth2
+ip -net ns3 link set ns3eth2 up
+
+ip -net ns3 addr add 10.0.3.2/24 dev ns3eth4
+ip -net ns3 link set ns3eth4 up
+ip -net ns3 route add default via 10.0.2.1
+ip netns exec ns3 ethtool -K ns3eth2 tso off 2>/dev/null
+ip netns exec ns3 sysctl -q net.ipv4.ip_forward=1
+
+ip -net ns4 addr add 10.0.3.1/24 dev ns4eth3
+ip -net ns4 link set ns4eth3 up
+ip -net ns4 route add default via 10.0.3.2
+
+print_file_err()
+{
+	ls -l "$1" 1>&2
+	echo "Trailing bytes are: "
+	tail -c 27 "$1"
+}
+
+check_transfer()
+{
+	in=$1
+	out=$2
+	what=$3
+
+	cmp "$in" "$out" > /dev/null 2>&1
+	if [ $? -ne 0 ] ;then
+		echo "[ FAIL ] $what does not match (in, out):"
+		print_file_err "$in"
+		print_file_err "$out"
+
+		return 1
+	fi
+
+	return 0
+}
+
+check_mptcp_disabled()
+{
+	disabled_ns="ns_disabled"
+	ip netns add ${disabled_ns} || exit $ksft_skip
+	# by default: sysctl net.mptcp.enabled=0
+
+	local err=0
+	LANG=C ip netns exec ${disabled_ns} ./mptcp_connect -t $timeout -p 10000 -s MPTCP 127.0.0.1 < "$cin" 2>&1 | \
+		grep -q "^socket: Protocol not available$" && err=1
+	ip netns delete ${disabled_ns}
+
+	if [ ${err} -eq 0 ]; then
+		echo -e "MPTCP is not disabled by default as expected\t[ FAIL ]"
+		ret=1
+		return 1
+	fi
+
+	echo -e "MPTCP is disabled by default as expected\t[ OK ]"
+	return 0
+}
+
+do_ping()
+{
+	listener_ns="$1"
+	connector_ns="$2"
+	connect_addr="$3"
+
+	ip netns exec ${connector_ns} ping -q -c 1 $connect_addr >/dev/null
+	if [ $? -ne 0 ] ; then
+		echo "$listener_ns -> $connect_addr connectivity [ FAIL ]" 1>&2
+		ret=1
+	fi
+}
+
+do_transfer()
+{
+	listener_ns="$1"
+	connector_ns="$2"
+	cl_proto="$3"
+	srv_proto="$4"
+	connect_addr="$5"
+
+	port=$((10000+$TEST_COUNT))
+	TEST_COUNT=$((TEST_COUNT+1))
+
+	:> "$cout"
+	:> "$sout"
+	:> "$capout"
+
+	printf "%-4s %-5s -> %-4s (%s:%d) %-5s\t" ${connector_ns} ${cl_proto} ${listener_ns} ${connect_addr} ${port} ${srv_proto}
+
+	if [ $capture -eq 1 ]; then
+	    if [ -z $SUDO_USER ] ; then
+		capuser=""
+	    else
+		capuser="-Z $SUDO_USER"
+	    fi
+
+	    capfile="${listener_ns}-${connector_ns}-${cl_proto}-${srv_proto}-${connect_addr}.pcap"
+
+	    ip netns exec ${listener_ns} tcpdump -i any -s 65535 -B 32768 $capuser -w $capfile > "$capout" 2>&1 &
+	    cappid=$!
+
+	    sleep 1
+	fi
+
+	ip netns exec ${listener_ns} ./mptcp_connect -t $timeout -l -p $port -s ${srv_proto} 0.0.0.0 < "$sin" > "$sout" &
+	spid=$!
+
+	sleep 1
+
+	ip netns exec ${connector_ns} ./mptcp_connect -t $timeout -p $port -s ${cl_proto} $connect_addr < "$cin" > "$cout" &
+	cpid=$!
+
+	wait $cpid
+	retc=$?
+	wait $spid
+	rets=$?
+
+	if [ $capture -eq 1 ]; then
+	    sleep 1
+	    kill $cappid
+	fi
+
+	if [ ${rets} -ne 0 ] || [ ${retc} -ne 0 ]; then
+		echo "[ FAIL ] client exit code $retc, server $rets" 1>&2
+		echo "\nnetns ${listener_ns} socket stat for $port:" 1>&2
+		ip netns exec ${listener_ns} ss -nita 1>&2 -o "sport = :$port"
+		echo "\nnetns ${connector_ns} socket stat for $port:" 1>&2
+		ip netns exec ${connector_ns} ss -nita 1>&2 -o "dport = :$port"
+
+		cat "$capout"
+		return 1
+	fi
+
+	check_transfer $sin $cout "file received by client"
+	retc=$?
+	check_transfer $cin $sout "file received by server"
+	rets=$?
+
+	if [ $retc -eq 0 ] && [ $rets -eq 0 ];then
+		echo "[ OK ]"
+		cat "$capout"
+		return 0
+	fi
+
+	cat "$capout"
+	return 1
+}
+
+make_file()
+{
+	name=$1
+	who=$2
+
+	SIZE=$((RANDOM % (1024 * 8)))
+	TSIZE=$((SIZE * 1024))
+
+	dd if=/dev/urandom of="$name" bs=1024 count=$SIZE 2> /dev/null
+
+	SIZE=$((RANDOM % 1024))
+	SIZE=$((SIZE + 128))
+	TSIZE=$((TSIZE + SIZE))
+	dd if=/dev/urandom conf=notrunc of="$name" bs=1 count=$SIZE 2> /dev/null
+	echo -e "\nMPTCP_TEST_FILE_END_MARKER" >> "$name"
+
+	echo "Created $name (size $TSIZE) containing data sent by $who"
+}
+
+run_tests()
+{
+	listener_ns="$1"
+	connector_ns="$2"
+	connect_addr="$3"
+	lret=0
+
+	for proto in MPTCP TCP;do
+		do_transfer ${listener_ns} ${connector_ns} MPTCP "$proto" ${connect_addr}
+		lret=$?
+		if [ $lret -ne 0 ]; then
+			ret=$lret
+			return
+		fi
+	done
+
+	do_transfer ${listener_ns} ${connector_ns} TCP MPTCP ${connect_addr}
+	lret=$?
+	if [ $lret -ne 0 ]; then
+		ret=$lret
+		return
+	fi
+}
+
+make_file "$cin" "client"
+make_file "$sin" "server"
+
+check_mptcp_disabled
+
+for sender in 1 2 3 4;do
+	do_ping ns1 ns$sender 10.0.1.1
+
+	do_ping ns2 ns$sender 10.0.1.2
+	do_ping ns2 ns$sender 10.0.2.1
+
+	do_ping ns3 ns$sender 10.0.2.2
+	do_ping ns3 ns$sender 10.0.3.2
+
+	do_ping ns4 ns$sender 10.0.3.1
+done
+
+tc -net ns2 qdisc add dev ns2eth3 root netem loss random 1
+tc -net ns3 qdisc add dev ns3eth4 root netem delay 10ms reorder 25% 50% gap 5
+
+for sender in 1 2 3 4;do
+	run_tests ns1 ns$sender 10.0.1.1
+
+	run_tests ns2 ns$sender 10.0.1.2
+	run_tests ns2 ns$sender 10.0.2.1
+
+	run_tests ns3 ns$sender 10.0.2.2
+	run_tests ns3 ns$sender 10.0.3.2
+
+	run_tests ns4 ns$sender 10.0.3.1
+done
+
+exit $ret
-- 
2.23.0


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [RFC PATCH v2 32/45] mptcp: Add handling of outgoing MP_JOIN requests
  2019-10-02 23:36 [RFC PATCH v2 00/45] Multipath TCP Mat Martineau
                   ` (30 preceding siblings ...)
  2019-10-02 23:36 ` [RFC PATCH v2 31/45] mptcp: add basic kselftest for mptcp Mat Martineau
@ 2019-10-02 23:36 ` Mat Martineau
  2019-10-02 23:36 ` [RFC PATCH v2 33/45] mptcp: Implement path manager interface commands Mat Martineau
                   ` (14 subsequent siblings)
  46 siblings, 0 replies; 49+ messages in thread
From: Mat Martineau @ 2019-10-02 23:36 UTC (permalink / raw)
  To: netdev, edumazet
  Cc: Peter Krystad, cpaasch, fw, pabeni, dcaratti, matthieu.baerts

From: Peter Krystad <peter.krystad@linux.intel.com>

Subflow creation may be initiated by the path manager when
the primary connection is fully established and a remote
address has been received via ADD_ADDR.

Create an in-kernel sock and use kernel_connect() to
initiate connection. When a valid SYN-ACK is received the
new sock is added to the tail of the mptcp sock conn_list
where it will not interfere with data flow on the original
connection.

Data flow and connection failover not addressed by this commit.

Signed-off-by: Peter Krystad <peter.krystad@linux.intel.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
---
 include/net/mptcp.h  |  2 +
 net/mptcp/options.c  | 51 +++++++++++++++++++++--
 net/mptcp/protocol.c |  1 +
 net/mptcp/protocol.h |  9 ++++
 net/mptcp/subflow.c  | 98 +++++++++++++++++++++++++++++++++++++++++++-
 5 files changed, 156 insertions(+), 5 deletions(-)

diff --git a/include/net/mptcp.h b/include/net/mptcp.h
index bb2dd193c0c5..50cd1b31ebdd 100644
--- a/include/net/mptcp.h
+++ b/include/net/mptcp.h
@@ -40,6 +40,8 @@ struct mptcp_out_options {
 	u8 backup;
 	u32 nonce;
 	u64 thmac;
+	u32 token;
+	u8 hmac[20];
 	struct mptcp_ext ext_copy;
 #endif
 };
diff --git a/net/mptcp/options.c b/net/mptcp/options.c
index f5e0b1d0931b..ce298ecc64f5 100644
--- a/net/mptcp/options.c
+++ b/net/mptcp/options.c
@@ -316,6 +316,16 @@ bool mptcp_syn_options(struct sock *sk, unsigned int *size,
 		opts->sndr_key = subflow->local_key;
 		*size = TCPOLEN_MPTCP_MPC_SYN;
 		return true;
+	} else if (subflow->request_join) {
+		pr_debug("remote_token=%u, nonce=%u", subflow->remote_token,
+			 subflow->local_nonce);
+		opts->suboptions = OPTION_MPTCP_MPJ_SYN;
+		opts->join_id = subflow->remote_id;
+		opts->token = subflow->remote_token;
+		opts->nonce = subflow->local_nonce;
+		opts->backup = subflow->request_bkup;
+		*size = TCPOLEN_MPTCP_MPJ_SYN;
+		return true;
 	}
 	return false;
 }
@@ -325,10 +335,17 @@ void mptcp_rcv_synsent(struct sock *sk)
 	struct mptcp_subflow_context *subflow = mptcp_subflow_ctx(sk);
 	struct tcp_sock *tp = tcp_sk(sk);
 
-	pr_debug("subflow=%p", subflow);
 	if (subflow->request_mptcp && tp->rx_opt.mptcp.mp_capable) {
 		subflow->mp_capable = 1;
 		subflow->remote_key = tp->rx_opt.mptcp.sndr_key;
+		pr_debug("subflow=%p, remote_key=%llu", subflow,
+			 subflow->remote_key);
+	} else if (subflow->request_join && tp->rx_opt.mptcp.mp_join) {
+		subflow->mp_join = 1;
+		subflow->thmac = tp->rx_opt.mptcp.thmac;
+		subflow->remote_nonce = tp->rx_opt.mptcp.nonce;
+		pr_debug("subflow=%p, thmac=%llu, remote_nonce=%u", subflow,
+			 subflow->thmac, subflow->remote_nonce);
 	}
 }
 
@@ -338,7 +355,8 @@ static bool mptcp_established_options_mp(struct sock *sk, unsigned int *size,
 {
 	struct mptcp_subflow_context *subflow = mptcp_subflow_ctx(sk);
 
-	if (!subflow->fourth_ack && remaining >= TCPOLEN_MPTCP_MPC_ACK) {
+	if (subflow->mp_capable && !subflow->fourth_ack &&
+	    remaining >= TCPOLEN_MPTCP_MPC_ACK) {
 		opts->suboptions = OPTION_MPTCP_MPC_ACK;
 		opts->sndr_key = subflow->local_key;
 		opts->rcvr_key = subflow->remote_key;
@@ -347,6 +365,14 @@ static bool mptcp_established_options_mp(struct sock *sk, unsigned int *size,
 		pr_debug("subflow=%p, local_key=%llu, remote_key=%llu",
 			 subflow, subflow->local_key, subflow->remote_key);
 		return true;
+	} else if (subflow->mp_join && !subflow->fourth_ack &&
+		   remaining >= TCPOLEN_MPTCP_MPJ_ACK) {
+		opts->suboptions = OPTION_MPTCP_MPJ_ACK;
+		memcpy(opts->hmac, subflow->hmac, MPTCPOPT_HMAC_LEN);
+		*size = TCPOLEN_MPTCP_MPJ_ACK;
+		subflow->fourth_ack = 1;
+		pr_debug("subflow=%p", subflow);
+		return true;
 	}
 	return false;
 }
@@ -459,10 +485,11 @@ bool mptcp_established_options(struct sock *sk, struct sk_buff *skb,
 			       unsigned int *size, unsigned int remaining,
 			       struct mptcp_out_options *opts)
 {
+	struct mptcp_subflow_context *subflow = mptcp_subflow_ctx(sk);
 	unsigned int opt_size = 0;
 	bool ret = false;
 
-	if (!mptcp_subflow_ctx(sk)->mp_capable)
+	if (!subflow->mp_capable && !subflow->mp_join)
 		return false;
 
 	opts->suboptions = 0;
@@ -562,7 +589,6 @@ void mptcp_incoming_options(struct sock *sk, struct sk_buff *skb,
 
 	if (msk)
 		pm_fully_established(msk);
-
 }
 
 void mptcp_write_options(__be32 *ptr, struct mptcp_out_options *opts)
@@ -612,6 +638,16 @@ void mptcp_write_options(__be32 *ptr, struct mptcp_out_options *opts)
 				      0, opts->addr_id);
 	}
 
+	if (OPTION_MPTCP_MPJ_SYN & opts->suboptions) {
+		*ptr++ = mptcp_option(MPTCPOPT_MP_JOIN,
+				      TCPOLEN_MPTCP_MPJ_SYN,
+				      opts->backup, opts->join_id);
+		put_unaligned_be32(opts->token, ptr);
+		ptr += 1;
+		put_unaligned_be32(opts->nonce, ptr);
+		ptr += 1;
+	}
+
 	if (OPTION_MPTCP_MPJ_SYNACK & opts->suboptions) {
 		*ptr++ = mptcp_option(MPTCPOPT_MP_JOIN,
 				      TCPOLEN_MPTCP_MPJ_SYNACK,
@@ -622,6 +658,13 @@ void mptcp_write_options(__be32 *ptr, struct mptcp_out_options *opts)
 		ptr += 1;
 	}
 
+	if (OPTION_MPTCP_MPJ_ACK & opts->suboptions) {
+		*ptr++ = mptcp_option(MPTCPOPT_MP_JOIN,
+				      TCPOLEN_MPTCP_MPJ_ACK, 0, 0);
+		memcpy(ptr, opts->hmac, MPTCPOPT_HMAC_LEN);
+		ptr += 5;
+	}
+
 	if (opts->ext_copy.use_ack || opts->ext_copy.use_map) {
 		struct mptcp_ext *mpext = &opts->ext_copy;
 		u8 len = TCPOLEN_MPTCP_DSS_BASE;
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index fa99337ca773..445800eae767 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -787,6 +787,7 @@ void mptcp_finish_connect(struct sock *sk, int mp_capable)
 		msk->local_key = subflow->local_key;
 		msk->token = subflow->token;
 		pr_debug("msk=%p, token=%u", msk, msk->token);
+		msk->dport = ntohs(inet_sk(msk->subflow->sk)->inet_dport);
 
 		pm_new_connection(msk, 0);
 
diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
index 394f2477e6f8..4a1171b75ec6 100644
--- a/net/mptcp/protocol.h
+++ b/net/mptcp/protocol.h
@@ -49,8 +49,10 @@
 #define TCPOLEN_MPTCP_ADD_ADDR6		20
 #define TCPOLEN_MPTCP_RM_ADDR		4
 
+/* MPTCP MP_JOIN flags */
 #define MPTCPOPT_BACKUP		BIT(0)
 #define MPTCPOPT_HMAC_LEN	20
+#define MPTCPOPT_THMAC_LEN	8
 
 /* MPTCP MP_CAPABLE flags */
 #define MPTCP_VERSION_MASK	(0x0F)
@@ -115,6 +117,7 @@ struct mptcp_sock {
 	u64		write_seq;
 	u64		ack_seq;
 	u32		token;
+	u16		dport;
 	struct list_head conn_list;
 	struct socket	*subflow; /* outgoing connect/listener/!mp_capable */
 	struct mptcp_pm_data	pm;
@@ -167,7 +170,9 @@ struct mptcp_subflow_context {
 	u32	ssn_offset;
 	u16	map_data_len;
 	u16	request_mptcp : 1,  /* send MP_CAPABLE */
+		request_join : 1,   /* send MP_JOIN */
 		request_cksum : 1,
+		request_bkup : 1,
 		request_version : 4,
 		mp_capable : 1,     /* remote is MPTCP capable */
 		mp_join : 1,        /* remote is JOINing */
@@ -179,6 +184,8 @@ struct mptcp_subflow_context {
 	u32	remote_nonce;
 	u64	thmac;
 	u32	local_nonce;
+	u32	remote_token;
+	u8	hmac[MPTCPOPT_HMAC_LEN];
 	u8	local_id;
 	u8	remote_id;
 
@@ -207,6 +214,8 @@ mptcp_subflow_tcp_socket(const struct mptcp_subflow_context *subflow)
 int mptcp_is_enabled(struct net *net);
 
 void mptcp_subflow_init(void);
+int mptcp_subflow_connect(struct sock *sk, struct sockaddr_in *local,
+			  struct sockaddr_in *remote, u8 remote_id);
 int mptcp_subflow_create_socket(struct sock *sk, struct socket **new_sock);
 
 extern const struct inet_connection_sock_af_ops ipv4_specific;
diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c
index 04f232ff1df0..257e52d9595e 100644
--- a/net/mptcp/subflow.c
+++ b/net/mptcp/subflow.c
@@ -26,6 +26,13 @@ static int subflow_rebuild_header(struct sock *sk)
 	if (subflow->request_mptcp && !subflow->token) {
 		pr_debug("subflow=%p", sk);
 		err = mptcp_token_new_connect(sk);
+	} else if (subflow->request_join && !subflow->local_nonce) {
+		pr_debug("subflow=%p", sk);
+		mptcp_token_get_sock(subflow->token);
+
+		do {
+			get_random_bytes(&subflow->local_nonce, sizeof(u32));
+		} while (!subflow->local_nonce);
 	}
 
 	if (err)
@@ -130,13 +137,35 @@ static void subflow_v4_init_req(struct request_sock *req,
 	}
 }
 
+/* validate received truncated hmac and create hmac for third ACK */
+static bool subflow_thmac_valid(struct mptcp_subflow_context *subflow)
+{
+	u8 hmac[MPTCPOPT_HMAC_LEN];
+	u64 thmac;
+
+	mptcp_crypto_hmac_sha1(subflow->remote_key, subflow->local_key,
+			       subflow->remote_nonce, subflow->local_nonce,
+			       (u32 *)hmac);
+
+	thmac = get_unaligned_be64(hmac);
+	pr_debug("subflow=%p, token=%u, thmac=%llu, subflow->thmac=%llu\n",
+		 subflow, subflow->token,
+		 (unsigned long long)thmac,
+		 (unsigned long long)subflow->thmac);
+
+	return thmac == subflow->thmac;
+}
+
 static void subflow_finish_connect(struct sock *sk, const struct sk_buff *skb)
 {
 	struct mptcp_subflow_context *subflow = mptcp_subflow_ctx(sk);
 
 	inet_sk_rx_dst_set(sk, skb);
 
-	if (subflow->conn && !subflow->conn_finished) {
+	if (!subflow->conn)
+		return;
+
+	if (subflow->mp_capable && !subflow->conn_finished) {
 		pr_debug("subflow=%p, remote_key=%llu", mptcp_subflow_ctx(sk),
 			 subflow->remote_key);
 		mptcp_finish_connect(subflow->conn, subflow->mp_capable);
@@ -146,6 +175,23 @@ static void subflow_finish_connect(struct sock *sk, const struct sk_buff *skb)
 			pr_debug("synack seq=%u", TCP_SKB_CB(skb)->seq);
 			subflow->ssn_offset = TCP_SKB_CB(skb)->seq;
 		}
+	} else if (subflow->mp_join && !subflow->conn_finished) {
+		pr_debug("subflow=%p, thmac=%llu, remote_nonce=%u",
+			 subflow, subflow->thmac,
+			 subflow->remote_nonce);
+		if (!subflow_thmac_valid(subflow)) {
+			subflow->mp_join = 0;
+			// @@ need to trigger RST
+			return;
+		}
+
+		mptcp_crypto_hmac_sha1(subflow->local_key, subflow->remote_key,
+				       subflow->local_nonce,
+				       subflow->remote_nonce,
+				       (u32 *)subflow->hmac);
+
+		mptcp_finish_join(sk);
+		subflow->conn_finished = 1;
 	}
 }
 
@@ -269,6 +315,56 @@ static void subflow_data_ready(struct sock *sk)
 	}
 }
 
+int mptcp_subflow_connect(struct sock *sk, struct sockaddr_in *local,
+			  struct sockaddr_in *remote, u8 remote_id)
+{
+	struct mptcp_sock *msk = mptcp_sk(sk);
+	struct mptcp_subflow_context *subflow;
+	struct socket *sf;
+	u32 remote_token;
+	int err;
+
+	lock_sock(sk);
+	err = mptcp_subflow_create_socket(sk, &sf);
+	if (err) {
+		release_sock(sk);
+		return err;
+	}
+
+	subflow = mptcp_subflow_ctx(sf->sk);
+	subflow->remote_key = msk->remote_key;
+	subflow->local_key = msk->local_key;
+	subflow->token = msk->token;
+
+	sock_hold(sf->sk);
+	release_sock(sk);
+
+	err = kernel_bind(sf, (struct sockaddr *)local,
+			  sizeof(struct sockaddr_in));
+	if (err)
+		goto failed;
+
+	mptcp_crypto_key_sha1(subflow->remote_key, &remote_token, NULL);
+	pr_debug("msk=%p remote_token=%u", msk, remote_token);
+	subflow->remote_token = remote_token;
+	subflow->remote_id = remote_id;
+	subflow->request_join = 1;
+	subflow->request_bkup = 1;
+
+	err = kernel_connect(sf, (struct sockaddr *)remote,
+			     sizeof(struct sockaddr_in), O_NONBLOCK);
+	if (err && err != -EINPROGRESS)
+		goto failed;
+
+	sock_put(sf->sk);
+	return err;
+
+failed:
+	sock_put(sf->sk);
+	sock_release(sf);
+	return err;
+}
+
 int mptcp_subflow_create_socket(struct sock *sk, struct socket **new_sock)
 {
 	struct mptcp_subflow_context *subflow;
-- 
2.23.0


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [RFC PATCH v2 33/45] mptcp: Implement path manager interface commands
  2019-10-02 23:36 [RFC PATCH v2 00/45] Multipath TCP Mat Martineau
                   ` (31 preceding siblings ...)
  2019-10-02 23:36 ` [RFC PATCH v2 32/45] mptcp: Add handling of outgoing MP_JOIN requests Mat Martineau
@ 2019-10-02 23:36 ` Mat Martineau
  2019-10-02 23:36 ` [RFC PATCH v2 34/45] mptcp: Make MPTCP socket block/wakeup ignore sk_receive_queue Mat Martineau
                   ` (13 subsequent siblings)
  46 siblings, 0 replies; 49+ messages in thread
From: Mat Martineau @ 2019-10-02 23:36 UTC (permalink / raw)
  To: netdev, edumazet
  Cc: Peter Krystad, cpaasch, fw, pabeni, dcaratti, matthieu.baerts

From: Peter Krystad <peter.krystad@linux.intel.com>

Use the addr_signal flag to indicate to the subflow layer
that a local address may be announced, and call subflow_connect()
to initiate a secondary subflow.

Signed-off-by: Peter Krystad <peter.krystad@linux.intel.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
---
 net/mptcp/pm.c | 63 +++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 60 insertions(+), 3 deletions(-)

diff --git a/net/mptcp/pm.c b/net/mptcp/pm.c
index 933dd805c9b2..7c7c00f9f7e8 100644
--- a/net/mptcp/pm.c
+++ b/net/mptcp/pm.c
@@ -13,17 +13,74 @@
 int pm_announce_addr(u32 token, sa_family_t family, u8 local_id,
 		     struct in_addr *addr)
 {
-	return -ENOTSUPP;
+	struct mptcp_sock *msk = mptcp_token_get_sock(token);
+	int err = 0;
+
+	if (!msk)
+		return -EINVAL;
+
+	if (msk->pm.local_valid) {
+		err = -EBADR;
+		goto announce_put;
+	}
+
+	pr_debug("msk=%p, local_id=%d, addr=%x", msk, local_id, addr->s_addr);
+	msk->pm.local_valid = 1;
+	msk->pm.local_id = local_id;
+	msk->pm.local_family = family;
+	msk->pm.local_addr.s_addr = addr->s_addr;
+	msk->addr_signal = 1;
+
+announce_put:
+	sock_put((struct sock *)msk);
+	return err;
 }
 
 int pm_remove_addr(u32 token, u8 local_id)
 {
-	return -ENOTSUPP;
+	struct mptcp_sock *msk = mptcp_token_get_sock(token);
+
+	if (!msk)
+		return -EINVAL;
+
+	pr_debug("msk=%p", msk);
+	msk->pm.local_valid = 0;
+
+	sock_put((struct sock *)msk);
+	return 0;
 }
 
 int pm_create_subflow(u32 token, u8 remote_id)
 {
-	return -ENOTSUPP;
+	struct mptcp_sock *msk = mptcp_token_get_sock(token);
+	struct sockaddr_in remote;
+	struct sockaddr_in local;
+	int err;
+
+	if (!msk)
+		return -EINVAL;
+
+	pr_debug("msk=%p", msk);
+
+	if (!msk->pm.remote_valid || remote_id != msk->pm.remote_id) {
+		err = -EBADR;
+		goto create_put;
+	}
+
+	local.sin_family = AF_INET;
+	local.sin_port = 0;
+	local.sin_addr.s_addr = htonl(INADDR_ANY);
+
+	remote.sin_family = msk->pm.remote_family;
+	remote.sin_port = htons(msk->dport);
+	remote.sin_addr.s_addr = msk->pm.remote_addr.s_addr;
+
+	err = mptcp_subflow_connect((struct sock *)msk, &local, &remote,
+				    remote_id);
+
+create_put:
+	sock_put((struct sock *)msk);
+	return err;
 }
 
 int pm_remove_subflow(u32 token, u8 remote_id)
-- 
2.23.0


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [RFC PATCH v2 34/45] mptcp: Make MPTCP socket block/wakeup ignore sk_receive_queue
  2019-10-02 23:36 [RFC PATCH v2 00/45] Multipath TCP Mat Martineau
                   ` (32 preceding siblings ...)
  2019-10-02 23:36 ` [RFC PATCH v2 33/45] mptcp: Implement path manager interface commands Mat Martineau
@ 2019-10-02 23:36 ` Mat Martineau
  2019-10-02 23:36 ` [RFC PATCH v2 35/45] mptcp: update per unacked sequence on pkt reception Mat Martineau
                   ` (12 subsequent siblings)
  46 siblings, 0 replies; 49+ messages in thread
From: Mat Martineau @ 2019-10-02 23:36 UTC (permalink / raw)
  To: netdev, edumazet
  Cc: Mat Martineau, cpaasch, fw, pabeni, peter.krystad, dcaratti,
	matthieu.baerts

The MPTCP-level socket doesn't use sk_receive_queue, so it was possible
for mptcp_recvmsg() to remain blocked when there was data ready for it
to read. When the MPTCP socket is waiting for additional data and it
releases the subflow socket lock, the subflow may have incoming packets
ready to process and it sometimes called subflow_data_ready() before the
MPTCP socket called sk_wait_data().

This change adds a new function for the MPTCP socket to use when waiting
for a data ready signal. Atomic bitops with memory barriers are used to
set, test, and clear a MPTCP socket flag that indicates waiting subflow
data. This flag replaces the sk_receive_queue checks used by other
socket types.

Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
---
 net/mptcp/protocol.c | 31 ++++++++++++++++++++++++++++++-
 net/mptcp/protocol.h |  4 ++++
 net/mptcp/subflow.c  |  5 +++++
 3 files changed, 39 insertions(+), 1 deletion(-)

diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 445800eae767..c8ee20963887 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -367,6 +367,31 @@ static enum mapping_status mptcp_get_mapping(struct sock *ssk)
 	return ret;
 }
 
+static void mptcp_wait_data(struct sock *sk, long *timeo)
+{
+	DEFINE_WAIT_FUNC(wait, woken_wake_function);
+	struct mptcp_sock *msk = mptcp_sk(sk);
+	int data_ready;
+
+	add_wait_queue(sk_sleep(sk), &wait);
+	sk_set_bit(SOCKWQ_ASYNC_WAITDATA, sk);
+
+	release_sock(sk);
+
+	smp_mb__before_atomic();
+	data_ready = test_and_clear_bit(MPTCP_DATA_READY, &msk->flags);
+	smp_mb__after_atomic();
+
+	if (!data_ready)
+		*timeo = wait_woken(&wait, TASK_INTERRUPTIBLE, *timeo);
+
+	sched_annotate_sleep();
+	lock_sock(sk);
+
+	sk_clear_bit(SOCKWQ_ASYNC_WAITDATA, sk);
+	remove_wait_queue(sk_sleep(sk), &wait);
+}
+
 static void warn_bad_map(struct mptcp_subflow_context *subflow, u32 ssn)
 {
 	WARN_ONCE(1, "Bad mapping: ssn=%d map_seq=%d map_data_len=%d",
@@ -423,6 +448,10 @@ static int mptcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
 		u64 old_ack;
 		u32 ssn;
 
+		smp_mb__before_atomic();
+		clear_bit(MPTCP_DATA_READY, &msk->flags);
+		smp_mb__after_atomic();
+
 		status = mptcp_get_mapping(ssk);
 
 		if (status == MAPPING_ADDED) {
@@ -550,7 +579,7 @@ static int mptcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
 
 		pr_debug("block");
 		release_sock(ssk);
-		sk_wait_data(sk, &timeo, NULL);
+		mptcp_wait_data(sk, &timeo);
 		lock_sock(ssk);
 	}
 
diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
index 4a1171b75ec6..56df4f46f313 100644
--- a/net/mptcp/protocol.h
+++ b/net/mptcp/protocol.h
@@ -74,6 +74,9 @@
 #define MPTCP_ADDR_IPVERSION_4	4
 #define MPTCP_ADDR_IPVERSION_6	6
 
+/* MPTCP socket flags */
+#define MPTCP_DATA_READY	BIT(0)
+
 static inline __be32 mptcp_option(u8 subopt, u8 len, u8 nib, u8 field)
 {
 	return htonl((TCPOPT_MPTCP << 24) | (len << 16) | (subopt << 12) |
@@ -117,6 +120,7 @@ struct mptcp_sock {
 	u64		write_seq;
 	u64		ack_seq;
 	u32		token;
+	unsigned long	flags;
 	u16		dport;
 	struct list_head conn_list;
 	struct socket	*subflow; /* outgoing connect/listener/!mp_capable */
diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c
index 257e52d9595e..7a94049587cc 100644
--- a/net/mptcp/subflow.c
+++ b/net/mptcp/subflow.c
@@ -311,6 +311,11 @@ static void subflow_data_ready(struct sock *sk)
 
 	if (parent) {
 		pr_debug("parent=%p", parent);
+
+		smp_mb__before_atomic();
+		set_bit(MPTCP_DATA_READY, &mptcp_sk(parent)->flags);
+		smp_mb__after_atomic();
+
 		parent->sk_data_ready(parent);
 	}
 }
-- 
2.23.0


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [RFC PATCH v2 35/45] mptcp: update per unacked sequence on pkt reception
  2019-10-02 23:36 [RFC PATCH v2 00/45] Multipath TCP Mat Martineau
                   ` (33 preceding siblings ...)
  2019-10-02 23:36 ` [RFC PATCH v2 34/45] mptcp: Make MPTCP socket block/wakeup ignore sk_receive_queue Mat Martineau
@ 2019-10-02 23:36 ` Mat Martineau
  2019-10-02 23:36 ` [RFC PATCH v2 36/45] mptcp: queue data for mptcp level retransmission Mat Martineau
                   ` (11 subsequent siblings)
  46 siblings, 0 replies; 49+ messages in thread
From: Mat Martineau @ 2019-10-02 23:36 UTC (permalink / raw)
  To: netdev, edumazet
  Cc: Paolo Abeni, cpaasch, fw, peter.krystad, dcaratti, matthieu.baerts

From: Paolo Abeni <pabeni@redhat.com>

So that we keep per unacked sequence number consistent; since
we update per msk data, use an atomic64 cmpxcgh() to protect
against concurrent updates from multiple subflows.

Initialize the snd_una at connect()/accept() time.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 net/mptcp/options.c  | 45 ++++++++++++++++++++++++++++++++++++++------
 net/mptcp/protocol.c |  2 ++
 net/mptcp/protocol.h |  1 +
 3 files changed, 42 insertions(+), 6 deletions(-)

diff --git a/net/mptcp/options.c b/net/mptcp/options.c
index ce298ecc64f5..2427fff98091 100644
--- a/net/mptcp/options.c
+++ b/net/mptcp/options.c
@@ -540,6 +540,39 @@ bool mptcp_synack_options(const struct request_sock *req, unsigned int *size,
 	return false;
 }
 
+static u64 expand_ack(u64 old_ack, u64 cur_ack, bool use_64bit)
+{
+	u32 old_ack32, cur_ack32;
+
+	if (use_64bit)
+		return cur_ack;
+
+	old_ack32 = (u32)old_ack;
+	cur_ack32 = (u32)cur_ack;
+	cur_ack = (old_ack & GENMASK_ULL(63, 32)) + cur_ack32;
+	if (unlikely(before(cur_ack32, old_ack32)))
+		return cur_ack + (1LL << 32);
+	return cur_ack;
+}
+
+void update_una(struct mptcp_sock *msk, struct mptcp_options_received *mp_opt)
+{
+	u64 new_snd_una, snd_una, old_snd_una = atomic64_read(&msk->snd_una);
+
+	/* avoid ack expansion on update conflict, to reduce the risk of
+	 * wrongly expanding to a future ack sequence number, which is way
+	 * more dangerous than missing an ack
+	 */
+	new_snd_una = expand_ack(old_snd_una, mp_opt->data_ack, mp_opt->ack64);
+	while (after64(new_snd_una, old_snd_una)) {
+		snd_una = old_snd_una;
+		old_snd_una = atomic64_cmpxchg(&msk->snd_una, snd_una,
+					       new_snd_una);
+		if (old_snd_una == snd_una)
+			break;
+	}
+}
+
 void mptcp_incoming_options(struct sock *sk, struct sk_buff *skb,
 			    struct tcp_options_received *opt_rx)
 {
@@ -563,6 +596,12 @@ void mptcp_incoming_options(struct sock *sk, struct sk_buff *skb,
 	if (!mp_opt->dss)
 		return;
 
+	/* we can't wait for recvmsg() to update the ack_seq, otherwise
+	 * monodirectional flows will stuck
+	 */
+	if (msk && mp_opt->use_ack)
+		update_una(msk, mp_opt);
+
 	mpext = skb_ext_add(skb, SKB_EXT_MPTCP);
 	if (!mpext)
 		return;
@@ -579,12 +618,6 @@ void mptcp_incoming_options(struct sock *sk, struct sk_buff *skb,
 		mpext->use_checksum = mp_opt->use_checksum;
 	}
 
-	if (mp_opt->use_ack) {
-		mpext->data_ack = mp_opt->data_ack;
-		mpext->use_ack = 1;
-		mpext->ack64 = mp_opt->ack64;
-	}
-
 	mpext->data_fin = mp_opt->data_fin;
 
 	if (msk)
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index c8ee20963887..2c64435aedd8 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -688,6 +688,7 @@ static struct sock *mptcp_accept(struct sock *sk, int flags, int *err,
 
 		mptcp_crypto_key_sha1(msk->remote_key, NULL, &ack_seq);
 		msk->write_seq = subflow->idsn + 1;
+		atomic64_set(&msk->snd_una, msk->write_seq);
 		ack_seq++;
 		msk->ack_seq = ack_seq;
 		subflow->map_seq = ack_seq;
@@ -822,6 +823,7 @@ void mptcp_finish_connect(struct sock *sk, int mp_capable)
 
 		mptcp_crypto_key_sha1(msk->remote_key, NULL, &ack_seq);
 		msk->write_seq = subflow->idsn + 1;
+		atomic64_set(&msk->snd_una, msk->write_seq);
 		ack_seq++;
 		msk->ack_seq = ack_seq;
 		subflow->map_seq = ack_seq;
diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
index 56df4f46f313..45646e38aa4c 100644
--- a/net/mptcp/protocol.h
+++ b/net/mptcp/protocol.h
@@ -119,6 +119,7 @@ struct mptcp_sock {
 	u64		remote_key;
 	u64		write_seq;
 	u64		ack_seq;
+	atomic64_t	snd_una;
 	u32		token;
 	unsigned long	flags;
 	u16		dport;
-- 
2.23.0


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [RFC PATCH v2 36/45] mptcp: queue data for mptcp level retransmission
  2019-10-02 23:36 [RFC PATCH v2 00/45] Multipath TCP Mat Martineau
                   ` (34 preceding siblings ...)
  2019-10-02 23:36 ` [RFC PATCH v2 35/45] mptcp: update per unacked sequence on pkt reception Mat Martineau
@ 2019-10-02 23:36 ` Mat Martineau
  2019-10-02 23:36 ` [RFC PATCH v2 37/45] mptcp: introduce MPTCP retransmission timer Mat Martineau
                   ` (10 subsequent siblings)
  46 siblings, 0 replies; 49+ messages in thread
From: Mat Martineau @ 2019-10-02 23:36 UTC (permalink / raw)
  To: netdev, edumazet
  Cc: Paolo Abeni, cpaasch, fw, peter.krystad, dcaratti, matthieu.baerts

From: Paolo Abeni <pabeni@redhat.com>

keep the send page fragment on an MPTCP level retransmission queue.
the queue entries are allocated inside the page frag allocator,
acquiring an additional reference to the page for each list entry.

Also switch to a custom page frag refill function, to ensure that
the current page fragment can always host an MPTCP rtx queue entry.

The MPTCP rtx queue is flushed at disconnect() and close() time

Note that now we need to call __mptcp_init_sock() regarless of mptcp
enable status, as the destructor will try to walk the rtx_queue.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 net/mptcp/protocol.c | 131 ++++++++++++++++++++++++++++++++++++++++---
 net/mptcp/protocol.h |  20 +++++++
 2 files changed, 143 insertions(+), 8 deletions(-)

diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 2c64435aedd8..c2e89e8cbc0e 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -69,14 +69,74 @@ static inline bool mptcp_skb_can_collapse_to(const struct mptcp_sock *msk,
 	return mpext && mpext->data_seq + mpext->data_len == msk->write_seq;
 }
 
+static inline bool mptcp_frag_can_collapse_to(const struct mptcp_sock *msk,
+					      const struct page_frag *pfrag,
+					      const struct mptcp_data_frag *df)
+{
+	return df && pfrag->page == df->page &&
+		df->data_seq + df->data_len == msk->write_seq;
+}
+
+static void dfrag_clear(struct mptcp_data_frag *dfrag)
+{
+	list_del(&dfrag->list);
+	put_page(dfrag->page);
+}
+
+static void mptcp_clean_una(struct sock *sk)
+{
+	struct mptcp_sock *msk = mptcp_sk(sk);
+	struct mptcp_data_frag *dtmp, *dfrag;
+	u64 snd_una = atomic64_read(&msk->snd_una);
+
+	list_for_each_entry_safe(dfrag, dtmp, &msk->rtx_queue, list) {
+		if (after64(dfrag->data_seq + dfrag->data_len, snd_una))
+			break;
+
+		dfrag_clear(dfrag);
+	}
+}
+
+/* ensure we get enough memory for the frag hdr, beyond some minimal amount of
+ * data
+ */
+bool mptcp_page_frag_refill(struct sock *sk, struct page_frag *pfrag)
+{
+	if (likely(skb_page_frag_refill(32U + sizeof(struct mptcp_data_frag),
+					pfrag, sk->sk_allocation)))
+		return true;
+
+	sk->sk_prot->enter_memory_pressure(sk);
+	sk_stream_moderate_sndbuf(sk);
+	return false;
+}
+
+static inline struct mptcp_data_frag *
+mptcp_carve_data_frag(const struct mptcp_sock *msk, struct page_frag *pfrag,
+		      int orig_offset)
+{
+	int offset = ALIGN(orig_offset, sizeof(long));
+	struct mptcp_data_frag *dfrag;
+
+	dfrag = (struct mptcp_data_frag *)(page_to_virt(pfrag->page) + offset);
+	dfrag->data_len = 0;
+	dfrag->data_seq = msk->write_seq;
+	dfrag->overhead = offset - orig_offset + sizeof(struct mptcp_data_frag);
+	dfrag->offset = offset + sizeof(struct mptcp_data_frag);
+	dfrag->page = pfrag->page;
+
+	return dfrag;
+}
+
 static int mptcp_sendmsg_frag(struct sock *sk, struct sock *ssk,
 			      struct msghdr *msg, long *timeo, int *pmss_now,
 			      int *ps_goal)
 {
-	int mss_now, avail_size, size_goal, ret;
+	int mss_now, avail_size, size_goal, offset, ret, frag_truesize = 0;
+	bool dfrag_collapsed, collapsed, can_collapse = false;
 	struct mptcp_sock *msk = mptcp_sk(sk);
-	bool collapsed, can_collapse = false;
 	struct mptcp_ext *mpext = NULL;
+	struct mptcp_data_frag *dfrag;
 	struct page_frag *pfrag;
 	struct sk_buff *skb;
 	size_t psize;
@@ -85,10 +145,15 @@ static int mptcp_sendmsg_frag(struct sock *sk, struct sock *ssk,
 	 * from one substream to another, but do per subflow memory accounting
 	 */
 	pfrag = sk_page_frag(sk);
-	while (!sk_page_frag_refill(ssk, pfrag)) {
+	while (!mptcp_page_frag_refill(ssk, pfrag)) {
 		ret = sk_stream_wait_memory(ssk, timeo);
 		if (ret)
 			return ret;
+
+		/* id sk_stream_wait_memory() sleeps snd_una can change
+		 * significantly, refresh the rtx queue
+		 */
+		mptcp_clean_una(sk);
 	}
 
 	/* compute copy limit */
@@ -113,11 +178,23 @@ static int mptcp_sendmsg_frag(struct sock *sk, struct sock *ssk,
 		else
 			avail_size = size_goal - skb->len;
 	}
-	psize = min_t(size_t, pfrag->size - pfrag->offset, avail_size);
+
+	/* reuse tail pfrag, if possible, or carve a new one from the page
+	 * allocator
+	 */
+	dfrag = mptcp_rtx_tail(sk);
+	offset = pfrag->offset;
+	dfrag_collapsed = mptcp_frag_can_collapse_to(msk, pfrag, dfrag);
+	if (!dfrag_collapsed) {
+		dfrag = mptcp_carve_data_frag(msk, pfrag, offset);
+		offset = dfrag->offset;
+		frag_truesize = dfrag->overhead;
+	}
+	psize = min_t(size_t, pfrag->size - offset, avail_size);
 
 	/* Copy to page */
 	pr_debug("left=%zu", msg_data_left(msg));
-	psize = copy_page_from_iter(pfrag->page, pfrag->offset,
+	psize = copy_page_from_iter(pfrag->page, offset,
 				    min_t(size_t, msg_data_left(msg), psize),
 				    &msg->msg_iter);
 	pr_debug("left=%zu", msg_data_left(msg));
@@ -127,13 +204,24 @@ static int mptcp_sendmsg_frag(struct sock *sk, struct sock *ssk,
 	/* tell the TCP stack to delay the push so that we can safely
 	 * access the skb after the sendpages call
 	 */
-	ret = do_tcp_sendpages(ssk, pfrag->page, pfrag->offset, psize,
+	ret = do_tcp_sendpages(ssk, pfrag->page, offset, psize,
 			       msg->msg_flags | MSG_SENDPAGE_NOTLAST);
 	if (ret <= 0)
 		return ret;
+
+	frag_truesize += ret;
 	if (unlikely(ret < psize))
 		iov_iter_revert(&msg->msg_iter, psize - ret);
 
+	/* send successful, keep track of sent data for mptcp-level
+	 * retransmission
+	 */
+	dfrag->data_len += ret;
+	if (!dfrag_collapsed) {
+		get_page(dfrag->page);
+		list_add_tail(&dfrag->list, &msk->rtx_queue);
+	}
+
 	collapsed = skb == tcp_write_queue_tail(ssk);
 	if (collapsed) {
 		WARN_ON_ONCE(!can_collapse);
@@ -162,7 +250,7 @@ static int mptcp_sendmsg_frag(struct sock *sk, struct sock *ssk,
 	 */
 
 out:
-	pfrag->offset += ret;
+	pfrag->offset += frag_truesize;
 	msk->write_seq += ret;
 	mptcp_subflow_ctx(ssk)->rel_write_seq += ret;
 
@@ -209,6 +297,7 @@ static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 	}
 
 	lock_sock(ssk);
+	mptcp_clean_una(sk);
 	timeo = sock_sndtimeo(sk, msg->msg_flags & MSG_DONTWAIT);
 	while (msg_data_left(msg)) {
 		ret = mptcp_sendmsg_frag(sk, ssk, msg, &timeo, &mss_now,
@@ -596,16 +685,31 @@ static int __mptcp_init_sock(struct sock *sk)
 	struct mptcp_sock *msk = mptcp_sk(sk);
 
 	INIT_LIST_HEAD(&msk->conn_list);
+	INIT_LIST_HEAD(&msk->rtx_queue);
 
 	return 0;
 }
 
 static int mptcp_init_sock(struct sock *sk)
 {
+	int ret = __mptcp_init_sock(sk);
+
+	if (ret)
+		return ret;
+
 	if (!mptcp_is_enabled(sock_net(sk)))
 		return -ENOPROTOOPT;
 
-	return __mptcp_init_sock(sk);
+	return 0;
+}
+
+static void __mptcp_clear_xmit(struct sock *sk)
+{
+	struct mptcp_sock *msk = mptcp_sk(sk);
+	struct mptcp_data_frag *dtmp, *dfrag;
+
+	list_for_each_entry_safe(dfrag, dtmp, &msk->rtx_queue, list)
+		dfrag_clear(dfrag);
 }
 
 static void mptcp_close(struct sock *sk, long timeout)
@@ -634,10 +738,20 @@ static void mptcp_close(struct sock *sk, long timeout)
 		sock_release(mptcp_subflow_tcp_socket(subflow));
 	}
 
+	__mptcp_clear_xmit(sk);
 	release_sock(sk);
+
 	sk_common_release(sk);
 }
 
+static int mptcp_disconnect(struct sock *sk, int flags)
+{
+	lock_sock(sk);
+	__mptcp_clear_xmit(sk);
+	release_sock(sk);
+	return tcp_disconnect(sk, flags);
+}
+
 static struct sock *mptcp_accept(struct sock *sk, int flags, int *err,
 				 bool kern)
 {
@@ -863,6 +977,7 @@ static struct proto mptcp_prot = {
 	.name		= "MPTCP",
 	.owner		= THIS_MODULE,
 	.init		= mptcp_init_sock,
+	.disconnect	= mptcp_disconnect,
 	.close		= mptcp_close,
 	.accept		= mptcp_accept,
 	.setsockopt	= mptcp_setsockopt,
diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
index 45646e38aa4c..e09db0f85bfc 100644
--- a/net/mptcp/protocol.h
+++ b/net/mptcp/protocol.h
@@ -111,6 +111,15 @@ struct mptcp_pm_data {
 	u32	token;
 };
 
+struct mptcp_data_frag {
+	struct list_head list;
+	u64 data_seq;
+	int data_len;
+	int offset;
+	int overhead;
+	struct page *page;
+};
+
 /* MPTCP connection sock */
 struct mptcp_sock {
 	/* inet_connection_sock must be the first member */
@@ -124,6 +133,7 @@ struct mptcp_sock {
 	unsigned long	flags;
 	u16		dport;
 	struct list_head conn_list;
+	struct list_head rtx_queue;
 	struct socket	*subflow; /* outgoing connect/listener/!mp_capable */
 	struct mptcp_pm_data	pm;
 	u8		addr_signal;
@@ -137,6 +147,16 @@ static inline struct mptcp_sock *mptcp_sk(const struct sock *sk)
 	return (struct mptcp_sock *)sk;
 }
 
+static inline struct mptcp_data_frag *mptcp_rtx_tail(const struct sock *sk)
+{
+	struct mptcp_sock *msk = mptcp_sk(sk);
+
+	if (list_empty(&msk->rtx_queue))
+		return NULL;
+
+	return list_last_entry(&msk->rtx_queue, struct mptcp_data_frag, list);
+}
+
 struct mptcp_subflow_request_sock {
 	struct	tcp_request_sock sk;
 	u8	mp_capable : 1,
-- 
2.23.0


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [RFC PATCH v2 37/45] mptcp: introduce MPTCP retransmission timer
  2019-10-02 23:36 [RFC PATCH v2 00/45] Multipath TCP Mat Martineau
                   ` (35 preceding siblings ...)
  2019-10-02 23:36 ` [RFC PATCH v2 36/45] mptcp: queue data for mptcp level retransmission Mat Martineau
@ 2019-10-02 23:36 ` Mat Martineau
  2019-10-02 23:36 ` [RFC PATCH v2 38/45] mptcp: implement memory accounting for mptcp rtx queue Mat Martineau
                   ` (9 subsequent siblings)
  46 siblings, 0 replies; 49+ messages in thread
From: Mat Martineau @ 2019-10-02 23:36 UTC (permalink / raw)
  To: netdev, edumazet
  Cc: Paolo Abeni, cpaasch, fw, peter.krystad, dcaratti, matthieu.baerts

From: Paolo Abeni <pabeni@redhat.com>

The timer will be used to schedule retransmission. It's
frequency is based on the current subflow RTO estimation and
is reset on every una_seq update

The timer is clearer for good by __mptcp_clear_xmit()

Also clean MPTCP rtx queue before each transmission

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 net/mptcp/options.c  |  4 +-
 net/mptcp/protocol.c | 99 ++++++++++++++++++++++++++++++++++++++++++++
 net/mptcp/protocol.h |  2 +
 3 files changed, 104 insertions(+), 1 deletion(-)

diff --git a/net/mptcp/options.c b/net/mptcp/options.c
index 2427fff98091..5e575999e281 100644
--- a/net/mptcp/options.c
+++ b/net/mptcp/options.c
@@ -568,8 +568,10 @@ void update_una(struct mptcp_sock *msk, struct mptcp_options_received *mp_opt)
 		snd_una = old_snd_una;
 		old_snd_una = atomic64_cmpxchg(&msk->snd_una, snd_una,
 					       new_snd_una);
-		if (old_snd_una == snd_una)
+		if (old_snd_una == snd_una) {
+			mptcp_reset_timer((struct sock *)msk);
 			break;
+		}
 	}
 }
 
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index c2e89e8cbc0e..1f7a090ed25c 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -19,6 +19,41 @@
 #include <net/mptcp.h>
 #include "protocol.h"
 
+static void mptcp_set_timeout(const struct sock *sk, const struct sock *ssk)
+{
+	unsigned long tout = ssk && inet_csk(ssk)->icsk_pending ?
+			     inet_csk(ssk)->icsk_timeout - jiffies : 0;
+
+	if (tout <= 0)
+		tout = mptcp_sk(sk)->timer_ival;
+	mptcp_sk(sk)->timer_ival =  tout > 0 ? tout : TCP_RTO_MIN;
+}
+
+bool mptcp_timer_pending(struct sock *sk)
+{
+	return timer_pending(&inet_csk(sk)->icsk_retransmit_timer);
+}
+
+void mptcp_reset_timer(struct sock *sk)
+{
+	struct inet_connection_sock *icsk = inet_csk(sk);
+	unsigned long tout;
+
+	/* should never be called with mptcp level timer cleared */
+	tout = READ_ONCE(mptcp_sk(sk)->timer_ival);
+	if (WARN_ON_ONCE(!tout))
+		tout = TCP_RTO_MIN;
+	sk_reset_timer(sk, &icsk->icsk_retransmit_timer, jiffies + tout);
+}
+
+static void mptcp_stop_timer(struct sock *sk)
+{
+	struct inet_connection_sock *icsk = inet_csk(sk);
+
+	sk_stop_timer(sk, &icsk->icsk_retransmit_timer);
+	mptcp_sk(sk)->timer_ival = 0;
+}
+
 static struct socket *__mptcp_fallback_get_ref(const struct mptcp_sock *msk)
 {
 	sock_owned_by_me((const struct sock *)msk);
@@ -308,10 +343,15 @@ static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 		copied += ret;
 	}
 
+	mptcp_set_timeout(sk, ssk);
 	if (copied) {
 		ret = copied;
 		tcp_push(ssk, msg->msg_flags, mss_now, tcp_sk(ssk)->nonagle,
 			 size_goal);
+
+		/* start the timer, if it's not pending */
+		if (!mptcp_timer_pending(sk))
+			mptcp_reset_timer(sk);
 	}
 
 	release_sock(ssk);
@@ -680,6 +720,35 @@ static int mptcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
 	return copied;
 }
 
+static void mptcp_retransmit_handler(struct sock *sk)
+{
+	struct mptcp_sock *msk = mptcp_sk(sk);
+
+	if (atomic64_read(&msk->snd_una) == msk->write_seq)
+		mptcp_stop_timer(sk);
+	else
+		mptcp_reset_timer(sk);
+}
+
+static void mptcp_retransmit_timer(struct timer_list *t)
+{
+	struct inet_connection_sock *icsk = from_timer(icsk, t,
+						       icsk_retransmit_timer);
+	struct sock *sk = &icsk->icsk_inet.sk;
+
+	bh_lock_sock(sk);
+	if (!sock_owned_by_user(sk)) {
+		mptcp_retransmit_handler(sk);
+	} else {
+		/* delegate our work to tcp_release_cb() */
+		if (!test_and_set_bit(TCP_WRITE_TIMER_DEFERRED,
+				      &sk->sk_tsq_flags))
+			sock_hold(sk);
+	}
+	bh_unlock_sock(sk);
+	sock_put(sk);
+}
+
 static int __mptcp_init_sock(struct sock *sk)
 {
 	struct mptcp_sock *msk = mptcp_sk(sk);
@@ -687,6 +756,9 @@ static int __mptcp_init_sock(struct sock *sk)
 	INIT_LIST_HEAD(&msk->conn_list);
 	INIT_LIST_HEAD(&msk->rtx_queue);
 
+	/* re-use the csk retrans timer for MPTCP-level retrans */
+	timer_setup(&msk->sk.icsk_retransmit_timer, mptcp_retransmit_timer, 0);
+
 	return 0;
 }
 
@@ -708,6 +780,8 @@ static void __mptcp_clear_xmit(struct sock *sk)
 	struct mptcp_sock *msk = mptcp_sk(sk);
 	struct mptcp_data_frag *dtmp, *dfrag;
 
+	sk_stop_timer(sk, &msk->sk.icsk_retransmit_timer);
+
 	list_for_each_entry_safe(dfrag, dtmp, &msk->rtx_queue, list)
 		dfrag_clear(dfrag);
 }
@@ -884,6 +958,30 @@ static int mptcp_getsockopt(struct sock *sk, int level, int optname,
 	return -EOPNOTSUPP;
 }
 
+#define MPTCP_DEFERRED_ALL TCPF_WRITE_TIMER_DEFERRED
+
+/* this is very alike tcp_release_cb() but we must handle differently a
+ * different set of events
+ */
+static void mptcp_release_cb(struct sock *sk)
+{
+	unsigned long flags, nflags;
+
+	do {
+		flags = sk->sk_tsq_flags;
+		if (!(flags & MPTCP_DEFERRED_ALL))
+			return;
+		nflags = flags & ~MPTCP_DEFERRED_ALL;
+	} while (cmpxchg(&sk->sk_tsq_flags, flags, nflags) != flags);
+
+	sock_release_ownership(sk);
+
+	if (flags & TCPF_WRITE_TIMER_DEFERRED) {
+		mptcp_retransmit_handler(sk);
+		__sock_put(sk);
+	}
+}
+
 static int mptcp_get_port(struct sock *sk, unsigned short snum)
 {
 	struct mptcp_sock *msk = mptcp_sk(sk);
@@ -986,6 +1084,7 @@ static struct proto mptcp_prot = {
 	.destroy	= mptcp_destroy,
 	.sendmsg	= mptcp_sendmsg,
 	.recvmsg	= mptcp_recvmsg,
+	.release_cb	= mptcp_release_cb,
 	.hash		= inet_hash,
 	.unhash		= inet_unhash,
 	.get_port	= mptcp_get_port,
diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
index e09db0f85bfc..ef35d0aa8eb7 100644
--- a/net/mptcp/protocol.h
+++ b/net/mptcp/protocol.h
@@ -129,6 +129,7 @@ struct mptcp_sock {
 	u64		write_seq;
 	u64		ack_seq;
 	atomic64_t	snd_una;
+	unsigned long	timer_ival;
 	u32		token;
 	unsigned long	flags;
 	u16		dport;
@@ -252,6 +253,7 @@ void mptcp_get_options(const struct sk_buff *skb,
 
 void mptcp_finish_connect(struct sock *sk, int mp_capable);
 void mptcp_finish_join(struct sock *sk);
+void mptcp_reset_timer(struct sock *sk);
 
 int mptcp_token_new_request(struct request_sock *req);
 void mptcp_token_destroy_request(u32 token);
-- 
2.23.0


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [RFC PATCH v2 38/45] mptcp: implement memory accounting for mptcp rtx queue
  2019-10-02 23:36 [RFC PATCH v2 00/45] Multipath TCP Mat Martineau
                   ` (36 preceding siblings ...)
  2019-10-02 23:36 ` [RFC PATCH v2 37/45] mptcp: introduce MPTCP retransmission timer Mat Martineau
@ 2019-10-02 23:36 ` Mat Martineau
  2019-10-02 23:36 ` [RFC PATCH v2 39/45] mptcp: rework mptcp_sendmsg_frag to accept optional dfrag Mat Martineau
                   ` (8 subsequent siblings)
  46 siblings, 0 replies; 49+ messages in thread
From: Mat Martineau @ 2019-10-02 23:36 UTC (permalink / raw)
  To: netdev, edumazet
  Cc: Paolo Abeni, cpaasch, fw, peter.krystad, dcaratti, matthieu.baerts

From: Paolo Abeni <pabeni@redhat.com>

Charge the data on the rtx queue to the master MPTCP socket, too.
Such memory in uncharged when the data is acked/dequeued.

Also account mptcp sockets inuse via a protocol specific pcpu
counter.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 net/mptcp/protocol.c | 29 ++++++++++++++++++++++++++---
 1 file changed, 26 insertions(+), 3 deletions(-)

diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 1f7a090ed25c..8319ee807481 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -19,6 +19,8 @@
 #include <net/mptcp.h>
 #include "protocol.h"
 
+static struct percpu_counter mptcp_sockets_allocated;
+
 static void mptcp_set_timeout(const struct sock *sk, const struct sock *ssk)
 {
 	unsigned long tout = ssk && inet_csk(ssk)->icsk_pending ?
@@ -112,9 +114,10 @@ static inline bool mptcp_frag_can_collapse_to(const struct mptcp_sock *msk,
 		df->data_seq + df->data_len == msk->write_seq;
 }
 
-static void dfrag_clear(struct mptcp_data_frag *dfrag)
+static void dfrag_clear(struct sock *sk, struct mptcp_data_frag *dfrag)
 {
 	list_del(&dfrag->list);
+	sk_mem_uncharge(sk, dfrag->data_len + dfrag->overhead);
 	put_page(dfrag->page);
 }
 
@@ -128,8 +131,9 @@ static void mptcp_clean_una(struct sock *sk)
 		if (after64(dfrag->data_seq + dfrag->data_len, snd_una))
 			break;
 
-		dfrag_clear(dfrag);
+		dfrag_clear(sk, dfrag);
 	}
+	sk_mem_reclaim_partial(sk);
 }
 
 /* ensure we get enough memory for the frag hdr, beyond some minimal amount of
@@ -236,6 +240,9 @@ static int mptcp_sendmsg_frag(struct sock *sk, struct sock *ssk,
 	if (!psize)
 		return -EINVAL;
 
+	if (!sk_wmem_schedule(sk, psize + dfrag->overhead))
+		return -ENOMEM;
+
 	/* tell the TCP stack to delay the push so that we can safely
 	 * access the skb after the sendpages call
 	 */
@@ -257,6 +264,11 @@ static int mptcp_sendmsg_frag(struct sock *sk, struct sock *ssk,
 		list_add_tail(&dfrag->list, &msk->rtx_queue);
 	}
 
+	/* charge data on mptcp rtx queue to the master socket
+	 * Note: we charge such data both to sk and ssk
+	 */
+	sk->sk_forward_alloc -= frag_truesize;
+
 	collapsed = skb == tcp_write_queue_tail(ssk);
 	if (collapsed) {
 		WARN_ON_ONCE(!can_collapse);
@@ -769,6 +781,8 @@ static int mptcp_init_sock(struct sock *sk)
 	if (ret)
 		return ret;
 
+	sk_sockets_allocated_inc(sk);
+
 	if (!mptcp_is_enabled(sock_net(sk)))
 		return -ENOPROTOOPT;
 
@@ -783,7 +797,7 @@ static void __mptcp_clear_xmit(struct sock *sk)
 	sk_stop_timer(sk, &msk->sk.icsk_retransmit_timer);
 
 	list_for_each_entry_safe(dfrag, dtmp, &msk->rtx_queue, list)
-		dfrag_clear(dfrag);
+		dfrag_clear(sk, dfrag);
 }
 
 static void mptcp_close(struct sock *sk, long timeout)
@@ -902,6 +916,7 @@ static struct sock *mptcp_accept(struct sock *sk, int flags, int *err,
 
 static void mptcp_destroy(struct sock *sk)
 {
+	sk_sockets_allocated_dec(sk);
 }
 
 static int mptcp_setsockopt(struct sock *sk, int level, int optname,
@@ -1088,6 +1103,11 @@ static struct proto mptcp_prot = {
 	.hash		= inet_hash,
 	.unhash		= inet_unhash,
 	.get_port	= mptcp_get_port,
+	.sockets_allocated	= &mptcp_sockets_allocated,
+	.memory_allocated	= &tcp_memory_allocated,
+	.memory_pressure	= &tcp_memory_pressure,
+	.sysctl_wmem_offset	= offsetof(struct net, ipv4.sysctl_tcp_wmem),
+	.sysctl_mem	= sysctl_tcp_mem,
 	.obj_size	= sizeof(struct mptcp_sock),
 	.no_autobind	= 1,
 };
@@ -1324,6 +1344,9 @@ void mptcp_proto_init(void)
 	mptcp_stream_ops.listen = mptcp_listen;
 	mptcp_stream_ops.shutdown = mptcp_shutdown;
 
+	if (percpu_counter_init(&mptcp_sockets_allocated, 0, GFP_KERNEL))
+		panic("Failed to allocate MPTCP pcpu counter\n");
+
 	mptcp_subflow_init();
 	pm_init();
 
-- 
2.23.0


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [RFC PATCH v2 39/45] mptcp: rework mptcp_sendmsg_frag to accept optional dfrag
  2019-10-02 23:36 [RFC PATCH v2 00/45] Multipath TCP Mat Martineau
                   ` (37 preceding siblings ...)
  2019-10-02 23:36 ` [RFC PATCH v2 38/45] mptcp: implement memory accounting for mptcp rtx queue Mat Martineau
@ 2019-10-02 23:36 ` Mat Martineau
  2019-10-02 23:36 ` [RFC PATCH v2 40/45] mptcp: implement and use MPTCP-level retransmission Mat Martineau
                   ` (7 subsequent siblings)
  46 siblings, 0 replies; 49+ messages in thread
From: Mat Martineau @ 2019-10-02 23:36 UTC (permalink / raw)
  To: netdev, edumazet
  Cc: Paolo Abeni, cpaasch, fw, peter.krystad, dcaratti, matthieu.baerts

From: Paolo Abeni <pabeni@redhat.com>

This will simplify mptcp-level retransmission implementation
in the next patch. If dfrag is provided by the caller, skip
kernel space memory allocation and use data and metadata
provided by the dfrag itself.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 net/mptcp/protocol.c | 133 +++++++++++++++++++++++++------------------
 1 file changed, 78 insertions(+), 55 deletions(-)

diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 8319ee807481..6366dfd65aa8 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -95,7 +95,7 @@ static struct sock *mptcp_subflow_get_ref(const struct mptcp_sock *msk)
 	return NULL;
 }
 
-static inline bool mptcp_skb_can_collapse_to(const struct mptcp_sock *msk,
+static inline bool mptcp_skb_can_collapse_to(u64 write_seq,
 					     const struct sk_buff *skb,
 					     const struct mptcp_ext *mpext)
 {
@@ -103,7 +103,7 @@ static inline bool mptcp_skb_can_collapse_to(const struct mptcp_sock *msk,
 		return false;
 
 	/* can collapse only if MPTCP level sequence is in order */
-	return mpext && mpext->data_seq + mpext->data_len == msk->write_seq;
+	return mpext && mpext->data_seq + mpext->data_len == write_seq;
 }
 
 static inline bool mptcp_frag_can_collapse_to(const struct mptcp_sock *msk,
@@ -168,31 +168,44 @@ mptcp_carve_data_frag(const struct mptcp_sock *msk, struct page_frag *pfrag,
 }
 
 static int mptcp_sendmsg_frag(struct sock *sk, struct sock *ssk,
-			      struct msghdr *msg, long *timeo, int *pmss_now,
+			      struct msghdr *msg, struct mptcp_data_frag *dfrag,
+			      long *timeo, int *pmss_now,
 			      int *ps_goal)
 {
 	int mss_now, avail_size, size_goal, offset, ret, frag_truesize = 0;
 	bool dfrag_collapsed, collapsed, can_collapse = false;
 	struct mptcp_sock *msk = mptcp_sk(sk);
 	struct mptcp_ext *mpext = NULL;
-	struct mptcp_data_frag *dfrag;
+	bool retransmission = !!dfrag;
 	struct page_frag *pfrag;
 	struct sk_buff *skb;
+	struct page *page;
+	u64 *write_seq;
 	size_t psize;
+	int *poffset;
 
 	/* use the mptcp page cache so that we can easily move the data
 	 * from one substream to another, but do per subflow memory accounting
+	 * Note: pfrag is used only !retransmission, but the compiler if
+	 * fooled into a warning if we don't init here
 	 */
 	pfrag = sk_page_frag(sk);
-	while (!mptcp_page_frag_refill(ssk, pfrag)) {
-		ret = sk_stream_wait_memory(ssk, timeo);
-		if (ret)
-			return ret;
-
-		/* id sk_stream_wait_memory() sleeps snd_una can change
-		 * significantly, refresh the rtx queue
-		 */
-		mptcp_clean_una(sk);
+	if (!retransmission) {
+		while (!mptcp_page_frag_refill(ssk, pfrag)) {
+			ret = sk_stream_wait_memory(ssk, timeo);
+			if (ret)
+				return ret;
+
+			/* id sk_stream_wait_memory() sleeps snd_una can change
+			 * significantly, refresh the rtx queue
+			 */
+			mptcp_clean_una(sk);
+		}
+		write_seq = &msk->write_seq;
+		page = pfrag->page;
+	} else {
+		write_seq = &dfrag->data_seq;
+		page = dfrag->page;
 	}
 
 	/* compute copy limit */
@@ -211,63 +224,73 @@ static int mptcp_sendmsg_frag(struct sock *sk, struct sock *ssk,
 		 * SSN association set here
 		 */
 		can_collapse = (size_goal - skb->len > 0) &&
-			      mptcp_skb_can_collapse_to(msk, skb, mpext);
+			      mptcp_skb_can_collapse_to(*write_seq, skb, mpext);
 		if (!can_collapse)
 			TCP_SKB_CB(skb)->eor = 1;
 		else
 			avail_size = size_goal - skb->len;
 	}
 
-	/* reuse tail pfrag, if possible, or carve a new one from the page
-	 * allocator
-	 */
-	dfrag = mptcp_rtx_tail(sk);
-	offset = pfrag->offset;
-	dfrag_collapsed = mptcp_frag_can_collapse_to(msk, pfrag, dfrag);
-	if (!dfrag_collapsed) {
-		dfrag = mptcp_carve_data_frag(msk, pfrag, offset);
+	if (!retransmission) {
+		/* reuse tail pfrag, if possible, or carve a new one from the
+		 * page allocator
+		 */
+		dfrag = mptcp_rtx_tail(sk);
+		offset = pfrag->offset;
+		dfrag_collapsed = mptcp_frag_can_collapse_to(msk, pfrag, dfrag);
+		if (!dfrag_collapsed) {
+			dfrag = mptcp_carve_data_frag(msk, pfrag, offset);
+			offset = dfrag->offset;
+			frag_truesize = dfrag->overhead;
+		}
+		poffset = &pfrag->offset;
+		psize = min_t(size_t, pfrag->size - offset, avail_size);
+
+		/* Copy to page */
+		pr_debug("left=%zu", msg_data_left(msg));
+		psize = copy_page_from_iter(pfrag->page, offset,
+					    min_t(size_t, msg_data_left(msg),
+						  psize),
+					    &msg->msg_iter);
+		pr_debug("left=%zu", msg_data_left(msg));
+		if (!psize)
+			return -EINVAL;
+
+		if (!sk_wmem_schedule(sk, psize + dfrag->overhead))
+			return -ENOMEM;
+	} else {
 		offset = dfrag->offset;
-		frag_truesize = dfrag->overhead;
+		poffset = &dfrag->offset;
+		psize = min_t(size_t, dfrag->data_len, avail_size);
 	}
-	psize = min_t(size_t, pfrag->size - offset, avail_size);
-
-	/* Copy to page */
-	pr_debug("left=%zu", msg_data_left(msg));
-	psize = copy_page_from_iter(pfrag->page, offset,
-				    min_t(size_t, msg_data_left(msg), psize),
-				    &msg->msg_iter);
-	pr_debug("left=%zu", msg_data_left(msg));
-	if (!psize)
-		return -EINVAL;
-
-	if (!sk_wmem_schedule(sk, psize + dfrag->overhead))
-		return -ENOMEM;
 
 	/* tell the TCP stack to delay the push so that we can safely
 	 * access the skb after the sendpages call
 	 */
-	ret = do_tcp_sendpages(ssk, pfrag->page, offset, psize,
+	ret = do_tcp_sendpages(ssk, page, offset, psize,
 			       msg->msg_flags | MSG_SENDPAGE_NOTLAST);
 	if (ret <= 0)
 		return ret;
 
 	frag_truesize += ret;
-	if (unlikely(ret < psize))
-		iov_iter_revert(&msg->msg_iter, psize - ret);
+	if (!retransmission) {
+		if (unlikely(ret < psize))
+			iov_iter_revert(&msg->msg_iter, psize - ret);
 
-	/* send successful, keep track of sent data for mptcp-level
-	 * retransmission
-	 */
-	dfrag->data_len += ret;
-	if (!dfrag_collapsed) {
-		get_page(dfrag->page);
-		list_add_tail(&dfrag->list, &msk->rtx_queue);
-	}
+		/* send successful, keep track of sent data for mptcp-level
+		 * retransmission
+		 */
+		dfrag->data_len += ret;
+		if (!dfrag_collapsed) {
+			get_page(dfrag->page);
+			list_add_tail(&dfrag->list, &msk->rtx_queue);
+		}
 
-	/* charge data on mptcp rtx queue to the master socket
-	 * Note: we charge such data both to sk and ssk
-	 */
-	sk->sk_forward_alloc -= frag_truesize;
+		/* charge data on mptcp rtx queue to the master socket
+		 * Note: we charge such data both to sk and ssk
+		 */
+		sk->sk_forward_alloc -= frag_truesize;
+	}
 
 	collapsed = skb == tcp_write_queue_tail(ssk);
 	if (collapsed) {
@@ -281,7 +304,7 @@ static int mptcp_sendmsg_frag(struct sock *sk, struct sock *ssk,
 	mpext = skb_ext_add(skb, SKB_EXT_MPTCP);
 	if (mpext) {
 		memset(mpext, 0, sizeof(*mpext));
-		mpext->data_seq = msk->write_seq;
+		mpext->data_seq = *write_seq;
 		mpext->subflow_seq = mptcp_subflow_ctx(ssk)->rel_write_seq;
 		mpext->data_len = ret;
 		mpext->checksum = 0xbeef;
@@ -297,8 +320,8 @@ static int mptcp_sendmsg_frag(struct sock *sk, struct sock *ssk,
 	 */
 
 out:
-	pfrag->offset += frag_truesize;
-	msk->write_seq += ret;
+	*poffset += frag_truesize;
+	*write_seq += ret;
 	mptcp_subflow_ctx(ssk)->rel_write_seq += ret;
 
 	return ret;
@@ -347,7 +370,7 @@ static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 	mptcp_clean_una(sk);
 	timeo = sock_sndtimeo(sk, msg->msg_flags & MSG_DONTWAIT);
 	while (msg_data_left(msg)) {
-		ret = mptcp_sendmsg_frag(sk, ssk, msg, &timeo, &mss_now,
+		ret = mptcp_sendmsg_frag(sk, ssk, msg, NULL, &timeo, &mss_now,
 					 &size_goal);
 		if (ret < 0)
 			break;
-- 
2.23.0


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [RFC PATCH v2 40/45] mptcp: implement and use MPTCP-level retransmission
  2019-10-02 23:36 [RFC PATCH v2 00/45] Multipath TCP Mat Martineau
                   ` (38 preceding siblings ...)
  2019-10-02 23:36 ` [RFC PATCH v2 39/45] mptcp: rework mptcp_sendmsg_frag to accept optional dfrag Mat Martineau
@ 2019-10-02 23:36 ` Mat Martineau
  2019-10-02 23:36 ` [RFC PATCH v2 41/45] selftests: mptcp: make tc delays random Mat Martineau
                   ` (6 subsequent siblings)
  46 siblings, 0 replies; 49+ messages in thread
From: Mat Martineau @ 2019-10-02 23:36 UTC (permalink / raw)
  To: netdev, edumazet
  Cc: Paolo Abeni, cpaasch, fw, peter.krystad, dcaratti, matthieu.baerts

From: Paolo Abeni <pabeni@redhat.com>

On timeout event, schedule a work queue to do the retransmission.
Retransmission code resemple closely sendmsg() implementation and
re-uses mptcp_sendmsg_frag, providing a dummy msghdr - for flags'
sake - and peeking the relevant dfrag from the rtx head.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 net/mptcp/protocol.c | 81 ++++++++++++++++++++++++++++++++++++++++++--
 net/mptcp/protocol.h | 11 ++++++
 2 files changed, 89 insertions(+), 3 deletions(-)

diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 6366dfd65aa8..b54fecf528b9 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -759,10 +759,12 @@ static void mptcp_retransmit_handler(struct sock *sk)
 {
 	struct mptcp_sock *msk = mptcp_sk(sk);
 
-	if (atomic64_read(&msk->snd_una) == msk->write_seq)
+	if (atomic64_read(&msk->snd_una) == msk->write_seq) {
 		mptcp_stop_timer(sk);
-	else
-		mptcp_reset_timer(sk);
+	} else {
+		if (schedule_work(&msk->rtx_work))
+			sock_hold(sk);
+	}
 }
 
 static void mptcp_retransmit_timer(struct timer_list *t)
@@ -784,6 +786,66 @@ static void mptcp_retransmit_timer(struct timer_list *t)
 	sock_put(sk);
 }
 
+static void mptcp_retransmit(struct work_struct *work)
+{
+	int orig_len, orig_offset, ret, mss_now = 0, size_goal = 0;
+	struct mptcp_data_frag *dfrag;
+	struct sock *ssk, *sk;
+	struct mptcp_sock *msk;
+	u64 orig_write_seq;
+	size_t copied = 0;
+	struct msghdr msg;
+	long timeo = 0;
+
+	msk = container_of(work, struct mptcp_sock, rtx_work);
+	sk = &msk->sk.icsk_inet.sk;
+
+	lock_sock(sk);
+	mptcp_clean_una(sk);
+	dfrag = mptcp_rtx_head(sk);
+	if (!dfrag)
+		goto unlock;
+
+	ssk = mptcp_subflow_get_ref(msk);
+	if (!ssk)
+		goto reset_unlock;
+
+	lock_sock(ssk);
+
+	msg.msg_flags = MSG_DONTWAIT;
+	orig_len = dfrag->data_len;
+	orig_offset = dfrag->offset;
+	orig_write_seq = dfrag->data_seq;
+	while (dfrag->data_len > 0) {
+		ret = mptcp_sendmsg_frag(sk, ssk, &msg, dfrag, &timeo, &mss_now,
+					 &size_goal);
+		if (ret < 0)
+			break;
+
+		copied += ret;
+		dfrag->data_len -= ret;
+	}
+	if (copied)
+		tcp_push(ssk, msg.msg_flags, mss_now, tcp_sk(ssk)->nonagle,
+			 size_goal);
+
+	dfrag->data_seq = orig_write_seq;
+	dfrag->offset = orig_offset;
+	dfrag->data_len = orig_len;
+
+	mptcp_set_timeout(sk, ssk);
+	release_sock(ssk);
+	sock_put(ssk);
+
+reset_unlock:
+	if (!mptcp_timer_pending(sk))
+		mptcp_reset_timer(sk);
+
+unlock:
+	release_sock(sk);
+	sock_put(sk);
+}
+
 static int __mptcp_init_sock(struct sock *sk)
 {
 	struct mptcp_sock *msk = mptcp_sk(sk);
@@ -791,6 +853,8 @@ static int __mptcp_init_sock(struct sock *sk)
 	INIT_LIST_HEAD(&msk->conn_list);
 	INIT_LIST_HEAD(&msk->rtx_queue);
 
+	INIT_WORK(&msk->rtx_work, mptcp_retransmit);
+
 	/* re-use the csk retrans timer for MPTCP-level retrans */
 	timer_setup(&msk->sk.icsk_retransmit_timer, mptcp_retransmit_timer, 0);
 
@@ -823,6 +887,14 @@ static void __mptcp_clear_xmit(struct sock *sk)
 		dfrag_clear(sk, dfrag);
 }
 
+static void mptcp_cancel_rtx_work(struct sock *sk)
+{
+	struct mptcp_sock *msk = mptcp_sk(sk);
+
+	if (cancel_work_sync(&msk->rtx_work))
+		sock_put(sk);
+}
+
 static void mptcp_close(struct sock *sk, long timeout)
 {
 	struct mptcp_subflow_context *subflow, *tmp;
@@ -852,6 +924,8 @@ static void mptcp_close(struct sock *sk, long timeout)
 	__mptcp_clear_xmit(sk);
 	release_sock(sk);
 
+	mptcp_cancel_rtx_work(sk);
+
 	sk_common_release(sk);
 }
 
@@ -860,6 +934,7 @@ static int mptcp_disconnect(struct sock *sk, int flags)
 	lock_sock(sk);
 	__mptcp_clear_xmit(sk);
 	release_sock(sk);
+	mptcp_cancel_rtx_work(sk);
 	return tcp_disconnect(sk, flags);
 }
 
diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
index ef35d0aa8eb7..0c6bc8617cf4 100644
--- a/net/mptcp/protocol.h
+++ b/net/mptcp/protocol.h
@@ -133,6 +133,7 @@ struct mptcp_sock {
 	u32		token;
 	unsigned long	flags;
 	u16		dport;
+	struct work_struct rtx_work;
 	struct list_head conn_list;
 	struct list_head rtx_queue;
 	struct socket	*subflow; /* outgoing connect/listener/!mp_capable */
@@ -158,6 +159,16 @@ static inline struct mptcp_data_frag *mptcp_rtx_tail(const struct sock *sk)
 	return list_last_entry(&msk->rtx_queue, struct mptcp_data_frag, list);
 }
 
+static inline struct mptcp_data_frag *mptcp_rtx_head(const struct sock *sk)
+{
+	struct mptcp_sock *msk = mptcp_sk(sk);
+
+	if (list_empty(&msk->rtx_queue))
+		return NULL;
+
+	return list_first_entry(&msk->rtx_queue, struct mptcp_data_frag, list);
+}
+
 struct mptcp_subflow_request_sock {
 	struct	tcp_request_sock sk;
 	u8	mp_capable : 1,
-- 
2.23.0


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [RFC PATCH v2 41/45] selftests: mptcp: make tc delays random
  2019-10-02 23:36 [RFC PATCH v2 00/45] Multipath TCP Mat Martineau
                   ` (39 preceding siblings ...)
  2019-10-02 23:36 ` [RFC PATCH v2 40/45] mptcp: implement and use MPTCP-level retransmission Mat Martineau
@ 2019-10-02 23:36 ` Mat Martineau
  2019-10-02 23:36 ` [RFC PATCH v2 42/45] selftests: mptcp: extend mptcp_connect tool for ipv6 family Mat Martineau
                   ` (5 subsequent siblings)
  46 siblings, 0 replies; 49+ messages in thread
From: Mat Martineau @ 2019-10-02 23:36 UTC (permalink / raw)
  To: netdev, edumazet
  Cc: Florian Westphal, cpaasch, pabeni, peter.krystad, dcaratti,
	matthieu.baerts

From: Florian Westphal <fw@strlen.de>

test with less predictable setups:
tc qdisc delay is now random, same for reordering and loss.
Main motivation is to cover more scenarious without a large
increase in test-time.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 .../selftests/net/mptcp/mptcp_connect.sh      | 28 +++++++++++++++++--
 1 file changed, 26 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/net/mptcp/mptcp_connect.sh b/tools/testing/selftests/net/mptcp/mptcp_connect.sh
index d029bdc5946d..fb9bf9f4fc8b 100755
--- a/tools/testing/selftests/net/mptcp/mptcp_connect.sh
+++ b/tools/testing/selftests/net/mptcp/mptcp_connect.sh
@@ -277,8 +277,32 @@ for sender in 1 2 3 4;do
 	do_ping ns4 ns$sender 10.0.3.1
 done
 
-tc -net ns2 qdisc add dev ns2eth3 root netem loss random 1
-tc -net ns3 qdisc add dev ns3eth4 root netem delay 10ms reorder 25% 50% gap 5
+loss=$((RANDOM%101))
+if [ $loss -eq 100 ] ;then
+	loss=1%
+	tc -net ns2 qdisc add dev ns2eth3 root netem loss random $loss
+elif [ $loss -ge 10 ]; then
+	loss=0.$loss%
+	tc -net ns2 qdisc add dev ns2eth3 root netem loss random $loss
+elif [ $loss -ge 1 ]; then
+	loss=0.0$loss%
+	tc -net ns2 qdisc add dev ns2eth3 root netem loss random $loss
+fi
+
+delay=$((RANDOM%1200))
+reorder1=$((RANDOM%25))
+reorder2=$((RANDOM%50))
+gap=$((RANDOM%100))
+if [ $gap -gt 0 ]; then
+	gap="gap $gap"
+else
+	gap=""
+fi
+
+if [ $reorder1 -gt 0 ] && [ $reorder2 -gt 0 ]; then
+  tc -net ns3 qdisc add dev ns3eth4 root netem delay ${delay}ms reorder ${reorder1}% ${reorder2}% $gap
+fi
+echo "INFO: Using loss of $loss, delay $delay ms, reorder: $reorder1, $reorder2 $gap on ns3eth4"
 
 for sender in 1 2 3 4;do
 	run_tests ns1 ns$sender 10.0.1.1
-- 
2.23.0


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [RFC PATCH v2 42/45] selftests: mptcp: extend mptcp_connect tool for ipv6 family
  2019-10-02 23:36 [RFC PATCH v2 00/45] Multipath TCP Mat Martineau
                   ` (40 preceding siblings ...)
  2019-10-02 23:36 ` [RFC PATCH v2 41/45] selftests: mptcp: make tc delays random Mat Martineau
@ 2019-10-02 23:36 ` Mat Martineau
  2019-10-02 23:36 ` [RFC PATCH v2 43/45] selftests: mptcp: add accept/getpeer checks Mat Martineau
                   ` (4 subsequent siblings)
  46 siblings, 0 replies; 49+ messages in thread
From: Mat Martineau @ 2019-10-02 23:36 UTC (permalink / raw)
  To: netdev, edumazet
  Cc: Florian Westphal, cpaasch, pabeni, peter.krystad, dcaratti,
	matthieu.baerts

From: Florian Westphal <fw@strlen.de>

At this time socket() will fail when requesting an ipv6 mptcp socket.
This is ok for now, as the test script won't request ipv6 tests yet.

Tested with a tcp <-> tcp connection.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 .../testing/selftests/net/mptcp/mptcp_connect.c | 17 ++++++++++++-----
 1 file changed, 12 insertions(+), 5 deletions(-)

diff --git a/tools/testing/selftests/net/mptcp/mptcp_connect.c b/tools/testing/selftests/net/mptcp/mptcp_connect.c
index cac71f0ac8f8..32a1630b7fa3 100644
--- a/tools/testing/selftests/net/mptcp/mptcp_connect.c
+++ b/tools/testing/selftests/net/mptcp/mptcp_connect.c
@@ -32,11 +32,11 @@ static int  poll_timeout;
 static const char *cfg_host;
 static const char *cfg_port	= "12000";
 static int cfg_sock_proto	= IPPROTO_MPTCP;
-
+static int pf = AF_INET;
 
 static void die_usage(void)
 {
-	fprintf(stderr, "Usage: mptcp_connect [-s MPTCP|TCP] [-p port] "
+	fprintf(stderr, "Usage: mptcp_connect [ -6 ] [-s MPTCP|TCP] [-p port] "
 		"[ -l ] [ -t timeout ] connect_address\n");
 	exit(1);
 }
@@ -74,12 +74,13 @@ static int sock_listen_mptcp(const char * const listenaddr,
 		.ai_flags = AI_PASSIVE | AI_NUMERICHOST
 	};
 
-	hints.ai_family = AF_INET;
+	hints.ai_family = pf;
 
 	struct addrinfo *a, *addr;
 	int one = 1;
 
 	xgetaddrinfo(listenaddr, port, &hints, &addr);
+	hints.ai_family = pf;
 
 	for (a = addr; a; a = a->ai_next) {
 		sock = socket(a->ai_family, a->ai_socktype, cfg_sock_proto);
@@ -124,7 +125,7 @@ static int sock_connect_mptcp(const char * const remoteaddr,
 	struct addrinfo *a, *addr;
 	int sock = -1;
 
-	hints.ai_family = AF_INET;
+	hints.ai_family = pf;
 
 	xgetaddrinfo(remoteaddr, port, &hints, &addr);
 	for (a = addr; a; a = a->ai_next) {
@@ -362,7 +363,7 @@ static void parse_opts(int argc, char **argv)
 {
 	int c;
 
-	while ((c = getopt(argc, argv, "lp:s:ht:")) != -1) {
+	while ((c = getopt(argc, argv, "6lp:s:ht:")) != -1) {
 		switch (c) {
 		case 'l':
 			listen_mode = true;
@@ -376,6 +377,9 @@ static void parse_opts(int argc, char **argv)
 		case 'h':
 			die_usage();
 			break;
+		case '6':
+			pf = AF_INET6;
+			break;
 		case 't':
 			poll_timeout = atoi(optarg) * 1000;
 			if (poll_timeout <= 0)
@@ -387,6 +391,9 @@ static void parse_opts(int argc, char **argv)
 	if (optind + 1 != argc)
 		die_usage();
 	cfg_host = argv[optind];
+
+	if (strchr(cfg_host, ':'))
+		pf = AF_INET6;
 }
 
 int main(int argc, char *argv[])
-- 
2.23.0


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [RFC PATCH v2 43/45] selftests: mptcp: add accept/getpeer checks
  2019-10-02 23:36 [RFC PATCH v2 00/45] Multipath TCP Mat Martineau
                   ` (41 preceding siblings ...)
  2019-10-02 23:36 ` [RFC PATCH v2 42/45] selftests: mptcp: extend mptcp_connect tool for ipv6 family Mat Martineau
@ 2019-10-02 23:36 ` Mat Martineau
  2019-10-02 23:36 ` [RFC PATCH v2 44/45] selftests: mptcp: add ipv6 connectivity Mat Martineau
                   ` (3 subsequent siblings)
  46 siblings, 0 replies; 49+ messages in thread
From: Mat Martineau @ 2019-10-02 23:36 UTC (permalink / raw)
  To: netdev, edumazet
  Cc: Florian Westphal, cpaasch, pabeni, peter.krystad, dcaratti,
	matthieu.baerts

From: Florian Westphal <fw@strlen.de>

Check that the result coming from accept matches that of getpeername.
For initiator side, check getpeername matches address passed to connect().

At this time, kernel returns the address of the first subflow in the list.
Right now, we do not yet implement fate-sharing, i.e. if the original
subflow goes away, the result of getpeername would change.

There are different ways to fix that.  One would be to copy the quadruple
from the first subflow socket to the mptcp socket.

If we do that, this test should catch a possible bug (such as accidental
reversal of saddr/daddr).

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 .../selftests/net/mptcp/mptcp_connect.c       | 107 ++++++++++++++++++
 1 file changed, 107 insertions(+)

diff --git a/tools/testing/selftests/net/mptcp/mptcp_connect.c b/tools/testing/selftests/net/mptcp/mptcp_connect.c
index 32a1630b7fa3..8df2b1b24695 100644
--- a/tools/testing/selftests/net/mptcp/mptcp_connect.c
+++ b/tools/testing/selftests/net/mptcp/mptcp_connect.c
@@ -49,6 +49,21 @@ static const char *getxinfo_strerr(int err)
 	return gai_strerror(err);
 }
 
+static void xgetnameinfo(const struct sockaddr *addr, socklen_t addrlen,
+			 char *host, socklen_t hostlen,
+			 char *serv, socklen_t servlen)
+{
+	int flags = NI_NUMERICHOST | NI_NUMERICSERV;
+	int err = getnameinfo(addr, addrlen, host, hostlen, serv, servlen, flags);
+
+	if (err) {
+		const char *errstr = getxinfo_strerr(err);
+
+		fprintf(stderr, "Fatal: getnameinfo: %s\n", errstr);
+		exit(1);
+	}
+}
+
 static void xgetaddrinfo(const char *node, const char *service,
 			 const struct addrinfo *hints,
 			 struct addrinfo **res)
@@ -285,6 +300,94 @@ static int copyfd_io(int infd, int peerfd, int outfd)
 	return 0;
 }
 
+static void check_sockaddr(int pf, struct sockaddr_storage *ss,
+			   socklen_t salen)
+{
+	char addr[INET6_ADDRSTRLEN];
+	char serv[INET6_ADDRSTRLEN];
+	struct sockaddr_in6 *sin6;
+	struct sockaddr_in *sin;
+	int wanted_size = 0;
+
+	switch (pf) {
+	case AF_INET:
+		wanted_size = sizeof(*sin);
+		sin = (void *)ss;
+		if (!sin->sin_port)
+			fprintf(stderr, "accept: something wrong: ip connection from port 0");
+		break;
+	case AF_INET6:
+		wanted_size = sizeof(*sin6);
+		sin6 = (void *)ss;
+		if (!sin6->sin6_port)
+			fprintf(stderr, "accept: something wrong: ipv6 connection from port 0");
+		break;
+	default:
+		fprintf(stderr, "accept: Unknown pf %d, salen %u\n", pf, salen);
+		return;
+	}
+
+	if (salen != wanted_size)
+		fprintf(stderr, "accept: size mismatch, got %d expected %d\n",
+				(int)salen, wanted_size);
+
+	if (ss->ss_family != pf)
+		fprintf(stderr, "accept: pf mismatch, expect %d, ss_family is %d\n",
+				(int)ss->ss_family, pf);
+}
+
+static void check_getpeername(int fd, struct sockaddr_storage *ss, socklen_t salen)
+{
+	struct sockaddr_storage peerss;
+	socklen_t peersalen = sizeof(peerss);
+
+	if (getpeername(fd, (struct sockaddr *)&peerss, &peersalen)) {
+		perror("getpeername");
+		return;
+	}
+
+	if (peersalen != salen) {
+		fprintf(stderr, "%s: %d vs %d\n", __func__, peersalen, salen);
+		return;
+	}
+
+	if (memcmp(ss, &peerss, peersalen)) {
+		char a[INET6_ADDRSTRLEN];
+		char b[INET6_ADDRSTRLEN];
+		char c[INET6_ADDRSTRLEN];
+		char d[INET6_ADDRSTRLEN];
+
+		xgetnameinfo((struct sockaddr *)ss, salen,
+			     a, sizeof(a), b, sizeof(b));
+
+		xgetnameinfo((struct sockaddr *)&peerss, peersalen,
+			     c, sizeof(c), d, sizeof(d));
+
+		fprintf(stderr, "%s: memcmp failure: accept %s vs peername %s, %s vs %s salen %d vs %d\n",
+			__func__, a, c, b, d, peersalen, salen);
+	}
+}
+
+static void check_getpeername_connect(int fd)
+{
+	struct sockaddr_storage ss;
+	socklen_t salen = sizeof(ss);
+	char a[INET6_ADDRSTRLEN];
+	char b[INET6_ADDRSTRLEN];
+
+	if (!getpeername(fd, (struct sockaddr *)&ss, &salen)) {
+		perror("getpeername");
+		return;
+	}
+
+	xgetnameinfo((struct sockaddr *)&ss, salen,
+		     a, sizeof(a), b, sizeof(b));
+
+	if (strcmp(cfg_host, a) || strcmp(cfg_port, b))
+		fprintf(stderr, "%s: %s vs %s, %s vs %s\n", __func__,
+			cfg_host, a, cfg_port, b);
+}
+
 int main_loop_s(int listensock)
 {
 	struct sockaddr_storage ss;
@@ -308,6 +411,9 @@ int main_loop_s(int listensock)
 	salen = sizeof(ss);
 	remotesock = accept(listensock, (struct sockaddr *)&ss, &salen);
 	if (remotesock >= 0) {
+		check_sockaddr(pf, &ss, salen);
+		check_getpeername(remotesock, &ss, salen);
+
 		copyfd_io(0, remotesock, 1);
 		return 0;
 	}
@@ -342,6 +448,7 @@ int main_loop(void)
 	if (fd < 0)
 		return 2;
 
+	check_getpeername_connect(fd);
 	return copyfd_io(0, fd, 1);
 }
 
-- 
2.23.0


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [RFC PATCH v2 44/45] selftests: mptcp: add ipv6 connectivity
  2019-10-02 23:36 [RFC PATCH v2 00/45] Multipath TCP Mat Martineau
                   ` (42 preceding siblings ...)
  2019-10-02 23:36 ` [RFC PATCH v2 43/45] selftests: mptcp: add accept/getpeer checks Mat Martineau
@ 2019-10-02 23:36 ` Mat Martineau
  2019-10-02 23:36 ` [RFC PATCH v2 45/45] selftests: mptcp: random ethtool tweaking Mat Martineau
                   ` (2 subsequent siblings)
  46 siblings, 0 replies; 49+ messages in thread
From: Mat Martineau @ 2019-10-02 23:36 UTC (permalink / raw)
  To: netdev, edumazet
  Cc: Florian Westphal, cpaasch, pabeni, peter.krystad, dcaratti,
	matthieu.baerts

From: Florian Westphal <fw@strlen.de>

prepare for ipv6 mptcp tests.
Once someone starts to implement mptcp v6 support, just set ipv6=true in
the script and the selftest will attempt to connect via ipv6.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 .../selftests/net/mptcp/mptcp_connect.sh      | 77 ++++++++++++++++---
 1 file changed, 66 insertions(+), 11 deletions(-)

diff --git a/tools/testing/selftests/net/mptcp/mptcp_connect.sh b/tools/testing/selftests/net/mptcp/mptcp_connect.sh
index fb9bf9f4fc8b..615691434a34 100755
--- a/tools/testing/selftests/net/mptcp/mptcp_connect.sh
+++ b/tools/testing/selftests/net/mptcp/mptcp_connect.sh
@@ -9,6 +9,7 @@ cout=""
 ksft_skip=4
 capture=0
 timeout=30
+ipv6=false
 
 TEST_COUNT=0
 
@@ -58,29 +59,48 @@ ip link add ns2eth3 netns ns2 type veth peer name ns3eth2 netns ns3
 ip link add ns3eth4 netns ns3 type veth peer name ns4eth3 netns ns4
 
 ip -net ns1 addr add 10.0.1.1/24 dev ns1eth2
+if $ipv6 ; then
+	ip -net ns1 addr add dead:beef:1::1/64 dev ns1eth2
+	if [ $? -ne 0 ] ;then
+		echo "SKIP: Can't add ipv6 address, skip ipv6 tests" 1>&2
+		ipv6=false
+	fi
+fi
+
 ip -net ns1 link set ns1eth2 up
 ip -net ns1 route add default via 10.0.1.2
+$ipv6 && ip -net ns1 route add default via dead:beef:1::2
 
 ip -net ns2 addr add 10.0.1.2/24 dev ns2eth1
+$ipv6 && ip -net ns2 addr add dead:beef:1::2/64 dev ns2eth1
 ip -net ns2 link set ns2eth1 up
 
 ip -net ns2 addr add 10.0.2.1/24 dev ns2eth3
+$ipv6 && ip -net ns2 addr add dead:beef:2::1/64 dev ns2eth3
 ip -net ns2 link set ns2eth3 up
 ip -net ns2 route add default via 10.0.2.2
+$ipv6 && ip -net ns2 route add default via dead:beef:2::2
 ip netns exec ns2 sysctl -q net.ipv4.ip_forward=1
+$ipv6 && ip netns exec ns2 sysctl -q net.ipv6.conf.all.forwarding=1
 
 ip -net ns3 addr add 10.0.2.2/24 dev ns3eth2
+$ipv6 && ip -net ns3 addr add dead:beef:2::2/64 dev ns3eth2
 ip -net ns3 link set ns3eth2 up
 
 ip -net ns3 addr add 10.0.3.2/24 dev ns3eth4
+$ipv6 && ip -net ns3 addr add dead:beef:3::2/64 dev ns3eth4
 ip -net ns3 link set ns3eth4 up
 ip -net ns3 route add default via 10.0.2.1
+$ipv6 && ip -net ns3 route add default via dead:beef:2::1
 ip netns exec ns3 ethtool -K ns3eth2 tso off 2>/dev/null
 ip netns exec ns3 sysctl -q net.ipv4.ip_forward=1
+$ipv6 && ip netns exec ns3 sysctl -q net.ipv6.conf.all.forwarding=1
 
 ip -net ns4 addr add 10.0.3.1/24 dev ns4eth3
+$ipv6 && ip -net ns4 addr add dead:beef:3::1/64 dev ns4eth3
 ip -net ns4 link set ns4eth3 up
 ip -net ns4 route add default via 10.0.3.2
+$ipv6 && ip -net ns4 route add default via dead:beef:3::2
 
 print_file_err()
 {
@@ -138,7 +158,11 @@ do_ping()
 	if [ $? -ne 0 ] ; then
 		echo "$listener_ns -> $connect_addr connectivity [ FAIL ]" 1>&2
 		ret=1
+
+		return 1
 	fi
+
+	return 0
 }
 
 do_transfer()
@@ -148,6 +172,7 @@ do_transfer()
 	cl_proto="$3"
 	srv_proto="$4"
 	connect_addr="$5"
+	local_addr="$6"
 
 	port=$((10000+$TEST_COUNT))
 	TEST_COUNT=$((TEST_COUNT+1))
@@ -173,7 +198,7 @@ do_transfer()
 	    sleep 1
 	fi
 
-	ip netns exec ${listener_ns} ./mptcp_connect -t $timeout -l -p $port -s ${srv_proto} 0.0.0.0 < "$sin" > "$sout" &
+	ip netns exec ${listener_ns} ./mptcp_connect -t $timeout -l -p $port -s ${srv_proto} $local_addr < "$sin" > "$sout" &
 	spid=$!
 
 	sleep 1
@@ -241,23 +266,26 @@ run_tests()
 	listener_ns="$1"
 	connector_ns="$2"
 	connect_addr="$3"
+	local_addr="$4"
 	lret=0
 
 	for proto in MPTCP TCP;do
-		do_transfer ${listener_ns} ${connector_ns} MPTCP "$proto" ${connect_addr}
+		do_transfer ${listener_ns} ${connector_ns} MPTCP "$proto" ${connect_addr} ${local_addr}
 		lret=$?
 		if [ $lret -ne 0 ]; then
 			ret=$lret
-			return
+			return 1
 		fi
 	done
 
-	do_transfer ${listener_ns} ${connector_ns} TCP MPTCP ${connect_addr}
+	do_transfer ${listener_ns} ${connector_ns} TCP MPTCP ${connect_addr} ${local_addr}
 	lret=$?
 	if [ $lret -ne 0 ]; then
 		ret=$lret
-		return
+		return 1
 	fi
+
+	return 0
 }
 
 make_file "$cin" "client"
@@ -265,16 +293,31 @@ make_file "$sin" "server"
 
 check_mptcp_disabled
 
+# Allow DAD to finish
+$ipv6 && sleep 2
+
 for sender in 1 2 3 4;do
 	do_ping ns1 ns$sender 10.0.1.1
+	if $ipv6;then
+		do_ping ns1 ns$sender dead:beef:1::1
+		if [ $? -ne 0 ]; then
+			echo "SKIP: IPv6 tests" 2>&1
+			ipv6=false
+		fi
+	fi
 
 	do_ping ns2 ns$sender 10.0.1.2
+	$ipv6 && do_ping ns2 ns$sender dead:beef:1::2
 	do_ping ns2 ns$sender 10.0.2.1
+	$ipv6 && do_ping ns2 ns$sender dead:beef:2::1
 
 	do_ping ns3 ns$sender 10.0.2.2
+	$ipv6 && do_ping ns3 ns$sender dead:beef:2::2
 	do_ping ns3 ns$sender 10.0.3.2
+	$ipv6 && do_ping ns3 ns$sender dead:beef:3::2
 
 	do_ping ns4 ns$sender 10.0.3.1
+	$ipv6 && do_ping ns4 ns$sender dead:beef:3::1
 done
 
 loss=$((RANDOM%101))
@@ -305,15 +348,27 @@ fi
 echo "INFO: Using loss of $loss, delay $delay ms, reorder: $reorder1, $reorder2 $gap on ns3eth4"
 
 for sender in 1 2 3 4;do
-	run_tests ns1 ns$sender 10.0.1.1
+	run_tests ns1 ns$sender 10.0.1.1 0.0.0.0
+	if $ipv6;then
+		run_tests ns1 ns$sender dead:beef:1::1 ::
+		if [ $? -ne 0 ] ;then
+			echo "SKIP: IPv6 tests" 2>&1
+			ipv6=false
+		fi
+	fi
 
-	run_tests ns2 ns$sender 10.0.1.2
-	run_tests ns2 ns$sender 10.0.2.1
+	run_tests ns2 ns$sender 10.0.1.2 0.0.0.0
+	$ipv6 && run_tests ns2 ns$sender dead:beef:1::2 ::
+	run_tests ns2 ns$sender 10.0.2.1 0.0.0.0
+	$ipv6 && run_tests ns2 ns$sender dead:beef:2::1 ::
 
-	run_tests ns3 ns$sender 10.0.2.2
-	run_tests ns3 ns$sender 10.0.3.2
+	run_tests ns3 ns$sender 10.0.2.2 0.0.0.0
+	$ipv6 && run_tests ns3 ns$sender dead:beef:2::2 ::
+	run_tests ns3 ns$sender 10.0.3.2 0.0.0.0
+	$ipv6 && run_tests ns3 ns$sender dead:beef:3::2 ::
 
-	run_tests ns4 ns$sender 10.0.3.1
+	run_tests ns4 ns$sender 10.0.3.1 0.0.0.0
+	$ipv6 && run_tests ns4 ns$sender dead:beef:3::1 ::
 done
 
 exit $ret
-- 
2.23.0


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [RFC PATCH v2 45/45] selftests: mptcp: random ethtool tweaking
  2019-10-02 23:36 [RFC PATCH v2 00/45] Multipath TCP Mat Martineau
                   ` (43 preceding siblings ...)
  2019-10-02 23:36 ` [RFC PATCH v2 44/45] selftests: mptcp: add ipv6 connectivity Mat Martineau
@ 2019-10-02 23:36 ` Mat Martineau
  2019-10-02 23:53 ` [RFC PATCH v2 00/45] Multipath TCP Mat Martineau
  2019-10-03  0:12 ` David Miller
  46 siblings, 0 replies; 49+ messages in thread
From: Mat Martineau @ 2019-10-02 23:36 UTC (permalink / raw)
  To: netdev, edumazet
  Cc: Florian Westphal, cpaasch, pabeni, peter.krystad, dcaratti,
	matthieu.baerts

From: Florian Westphal <fw@strlen.de>

Instead of unconditionally disabling TSO in ns3, turn off any of
gso/tso/gro in ns3 and/or ns4.

This gets us various combinations of GRO/GSO/TSO without a large
impact on test time.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 .../selftests/net/mptcp/mptcp_connect.sh      | 29 ++++++++++++++++++-
 1 file changed, 28 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/net/mptcp/mptcp_connect.sh b/tools/testing/selftests/net/mptcp/mptcp_connect.sh
index 615691434a34..40cce8d1772e 100755
--- a/tools/testing/selftests/net/mptcp/mptcp_connect.sh
+++ b/tools/testing/selftests/net/mptcp/mptcp_connect.sh
@@ -92,7 +92,6 @@ $ipv6 && ip -net ns3 addr add dead:beef:3::2/64 dev ns3eth4
 ip -net ns3 link set ns3eth4 up
 ip -net ns3 route add default via 10.0.2.1
 $ipv6 && ip -net ns3 route add default via dead:beef:2::1
-ip netns exec ns3 ethtool -K ns3eth2 tso off 2>/dev/null
 ip netns exec ns3 sysctl -q net.ipv4.ip_forward=1
 $ipv6 && ip netns exec ns3 sysctl -q net.ipv6.conf.all.forwarding=1
 
@@ -102,6 +101,34 @@ ip -net ns4 link set ns4eth3 up
 ip -net ns4 route add default via 10.0.3.2
 $ipv6 && ip -net ns4 route add default via dead:beef:3::2
 
+set_ethtool_flags() {
+	ns=$1
+	dev=$2
+
+	r=$RANDOM
+
+	pick1=$((r & 1))
+	r=$((r>>1))
+	pick2=$((r & 1))
+	r=$((r>>1))
+	pick3=$((r & 1))
+
+	comma=""
+	flags=""
+	[ $pick1 -ne 0 ] && flags="tso"
+	[ $pick1 -ne 0 ] && comma=","
+	[ $pick2 -ne 0 ] && flags=${flags}${comma}"gso"
+	[ $pick3 -ne 0 ] && flags=${flags}${comma}"gro"
+
+	[ -z $flags ] && return
+
+	ip netns exec $ns ethtool -K $dev $flags off 2>/dev/null
+	[ $? -eq 0 ] && echo "INFO: set $ns dev $dev: ethtool -K $flags off"
+}
+
+set_ethtool_flags ns3 ns3eth2
+set_ethtool_flags ns4 ns4eth3
+
 print_file_err()
 {
 	ls -l "$1" 1>&2
-- 
2.23.0


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH v2 00/45] Multipath TCP
  2019-10-02 23:36 [RFC PATCH v2 00/45] Multipath TCP Mat Martineau
                   ` (44 preceding siblings ...)
  2019-10-02 23:36 ` [RFC PATCH v2 45/45] selftests: mptcp: random ethtool tweaking Mat Martineau
@ 2019-10-02 23:53 ` Mat Martineau
  2019-10-03  0:12 ` David Miller
  46 siblings, 0 replies; 49+ messages in thread
From: Mat Martineau @ 2019-10-02 23:53 UTC (permalink / raw)
  To: netdev, edumazet
  Cc: cpaasch, fw, pabeni, peter.krystad, dcaratti, matthieu.baerts


On Wed, 2 Oct 2019, Mat Martineau wrote:

> The MPTCP upstreaming community has prepared a net-next RFCv2 patch set
> for review.
>
> Clone/fetch:
> https://github.com/multipath-tcp/mptcp_net-next.git (tag: netdev-rfcv2)
>
> Browse:
> https://github.com/multipath-tcp/mptcp_net-next/tree/netdev-rfcv2


Huge apologies everyone: while the branches referenced above are correct, 
the email patch series I just posted was not generated from the correct 
git commits.

Please disregard, I will send v3 shortly.

--
Mat Martineau
Intel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH v2 00/45] Multipath TCP
  2019-10-02 23:36 [RFC PATCH v2 00/45] Multipath TCP Mat Martineau
                   ` (45 preceding siblings ...)
  2019-10-02 23:53 ` [RFC PATCH v2 00/45] Multipath TCP Mat Martineau
@ 2019-10-03  0:12 ` David Miller
  2019-10-03  0:27   ` Mat Martineau
  46 siblings, 1 reply; 49+ messages in thread
From: David Miller @ 2019-10-03  0:12 UTC (permalink / raw)
  To: mathew.j.martineau
  Cc: netdev, edumazet, cpaasch, fw, pabeni, peter.krystad, dcaratti,
	matthieu.baerts

From: Mat Martineau <mathew.j.martineau@linux.intel.com>
Date: Wed,  2 Oct 2019 16:36:10 -0700

> The MPTCP upstreaming community has prepared a net-next RFCv2 patch set
> for review.

Nobody is going to read 45 patches and properly review them.

And I do mean nobody.

Please make smaller, more reasonable (like 12-20 MAX), patch sets to
start building up the MPTCP infrastructure.

This is for your sake as well as everyone else's.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH v2 00/45] Multipath TCP
  2019-10-03  0:12 ` David Miller
@ 2019-10-03  0:27   ` Mat Martineau
  0 siblings, 0 replies; 49+ messages in thread
From: Mat Martineau @ 2019-10-03  0:27 UTC (permalink / raw)
  To: David Miller
  Cc: netdev, edumazet, cpaasch, fw, pabeni, peter.krystad, dcaratti,
	matthieu.baerts


On Wed, 2 Oct 2019, David Miller wrote:

> From: Mat Martineau <mathew.j.martineau@linux.intel.com>
> Date: Wed,  2 Oct 2019 16:36:10 -0700
>
>> The MPTCP upstreaming community has prepared a net-next RFCv2 patch set
>> for review.
>
> Nobody is going to read 45 patches and properly review them.
>
> And I do mean nobody.
>
> Please make smaller, more reasonable (like 12-20 MAX), patch sets to
> start building up the MPTCP infrastructure.
>
> This is for your sake as well as everyone else's.

Thanks David.

We were proposing a shorter series for our later non-RFC posting, but will 
do some further squashing and partitioning to fit in the under 12-20 range 
for all future MPTCP patch sets posted on netdev.

And for my sake as well as everyone elses, I'll skip resending the 
correction to this series since it's only slightly shorter.

--
Mat Martineau
Intel

^ permalink raw reply	[flat|nested] 49+ messages in thread

end of thread, other threads:[~2019-10-03  0:27 UTC | newest]

Thread overview: 49+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-10-02 23:36 [RFC PATCH v2 00/45] Multipath TCP Mat Martineau
2019-10-02 23:36 ` [RFC PATCH v2 01/45] tcp: Add MPTCP option number Mat Martineau
2019-10-02 23:36 ` [RFC PATCH v2 02/45] net: Make sock protocol value checks more specific Mat Martineau
2019-10-02 23:36 ` [RFC PATCH v2 03/45] sock: Make sk_protocol a 16-bit value Mat Martineau
2019-10-02 23:36 ` [RFC PATCH v2 04/45] tcp: Define IPPROTO_MPTCP Mat Martineau
2019-10-02 23:36 ` [RFC PATCH v2 05/45] mptcp: Add MPTCP socket stubs Mat Martineau
2019-10-02 23:36 ` [RFC PATCH v2 06/45] mptcp: Handle MPTCP TCP options Mat Martineau
2019-10-02 23:36 ` [RFC PATCH v2 07/45] mptcp: Associate MPTCP context with TCP socket Mat Martineau
2019-10-02 23:36 ` [RFC PATCH v2 08/45] tcp: Expose tcp struct and routine for MPTCP Mat Martineau
2019-10-02 23:36 ` [RFC PATCH v2 09/45] mptcp: Handle MP_CAPABLE options for outgoing connections Mat Martineau
2019-10-02 23:36 ` [RFC PATCH v2 10/45] mptcp: add mptcp_poll Mat Martineau
2019-10-02 23:36 ` [RFC PATCH v2 11/45] tcp, ulp: Add clone operation to tcp_ulp_ops Mat Martineau
2019-10-02 23:36 ` [RFC PATCH v2 12/45] mptcp: Create SUBFLOW socket for incoming connections Mat Martineau
2019-10-02 23:36 ` [RFC PATCH v2 13/45] mptcp: Add key generation and token tree Mat Martineau
2019-10-02 23:36 ` [RFC PATCH v2 14/45] mptcp: Add shutdown() socket operation Mat Martineau
2019-10-02 23:36 ` [RFC PATCH v2 15/45] mptcp: Add setsockopt()/getsockopt() socket operations Mat Martineau
2019-10-02 23:36 ` [RFC PATCH v2 16/45] tcp: clean ext on tx recycle Mat Martineau
2019-10-02 23:36 ` [RFC PATCH v2 17/45] mptcp: Add MPTCP to skb extensions Mat Martineau
2019-10-02 23:36 ` [RFC PATCH v2 18/45] tcp: Prevent coalesce/collapse when skb has MPTCP extensions Mat Martineau
2019-10-02 23:36 ` [RFC PATCH v2 19/45] tcp: Export low-level TCP functions Mat Martineau
2019-10-02 23:36 ` [RFC PATCH v2 20/45] mptcp: Write MPTCP DSS headers to outgoing data packets Mat Martineau
2019-10-02 23:36 ` [RFC PATCH v2 21/45] mptcp: Implement MPTCP receive path Mat Martineau
2019-10-02 23:36 ` [RFC PATCH v2 22/45] mptcp: use sk_page_frag() in sendmsg Mat Martineau
2019-10-02 23:36 ` [RFC PATCH v2 23/45] mptcp: sendmsg() do spool all the provided data Mat Martineau
2019-10-02 23:36 ` [RFC PATCH v2 24/45] mptcp: allow collapsing consecutive sendpages on the same substream Mat Martineau
2019-10-02 23:36 ` [RFC PATCH v2 25/45] tcp: Check for filled TCP option space before SACK Mat Martineau
2019-10-02 23:36 ` [RFC PATCH v2 26/45] mptcp: Add path manager interface Mat Martineau
2019-10-02 23:36 ` [RFC PATCH v2 27/45] mptcp: Add ADD_ADDR handling Mat Martineau
2019-10-02 23:36 ` [RFC PATCH v2 28/45] mptcp: Add handling of incoming MP_JOIN requests Mat Martineau
2019-10-02 23:36 ` [RFC PATCH v2 29/45] mptcp: harmonize locking on all socket operations Mat Martineau
2019-10-02 23:36 ` [RFC PATCH v2 30/45] mptcp: new sysctl to control the activation per NS Mat Martineau
2019-10-02 23:36 ` [RFC PATCH v2 31/45] mptcp: add basic kselftest for mptcp Mat Martineau
2019-10-02 23:36 ` [RFC PATCH v2 32/45] mptcp: Add handling of outgoing MP_JOIN requests Mat Martineau
2019-10-02 23:36 ` [RFC PATCH v2 33/45] mptcp: Implement path manager interface commands Mat Martineau
2019-10-02 23:36 ` [RFC PATCH v2 34/45] mptcp: Make MPTCP socket block/wakeup ignore sk_receive_queue Mat Martineau
2019-10-02 23:36 ` [RFC PATCH v2 35/45] mptcp: update per unacked sequence on pkt reception Mat Martineau
2019-10-02 23:36 ` [RFC PATCH v2 36/45] mptcp: queue data for mptcp level retransmission Mat Martineau
2019-10-02 23:36 ` [RFC PATCH v2 37/45] mptcp: introduce MPTCP retransmission timer Mat Martineau
2019-10-02 23:36 ` [RFC PATCH v2 38/45] mptcp: implement memory accounting for mptcp rtx queue Mat Martineau
2019-10-02 23:36 ` [RFC PATCH v2 39/45] mptcp: rework mptcp_sendmsg_frag to accept optional dfrag Mat Martineau
2019-10-02 23:36 ` [RFC PATCH v2 40/45] mptcp: implement and use MPTCP-level retransmission Mat Martineau
2019-10-02 23:36 ` [RFC PATCH v2 41/45] selftests: mptcp: make tc delays random Mat Martineau
2019-10-02 23:36 ` [RFC PATCH v2 42/45] selftests: mptcp: extend mptcp_connect tool for ipv6 family Mat Martineau
2019-10-02 23:36 ` [RFC PATCH v2 43/45] selftests: mptcp: add accept/getpeer checks Mat Martineau
2019-10-02 23:36 ` [RFC PATCH v2 44/45] selftests: mptcp: add ipv6 connectivity Mat Martineau
2019-10-02 23:36 ` [RFC PATCH v2 45/45] selftests: mptcp: random ethtool tweaking Mat Martineau
2019-10-02 23:53 ` [RFC PATCH v2 00/45] Multipath TCP Mat Martineau
2019-10-03  0:12 ` David Miller
2019-10-03  0:27   ` Mat Martineau

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).