All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH mptcp-net v2 0/6] mptcp: mp_fail related fixes
@ 2022-06-15 20:28 Paolo Abeni
  2022-06-15 20:28 ` [PATCH mptcp-net v2 1/6] mptcp: fix error mibs accounting Paolo Abeni
                   ` (5 more replies)
  0 siblings, 6 replies; 11+ messages in thread
From: Paolo Abeni @ 2022-06-15 20:28 UTC (permalink / raw)
  To: mptcp

While cooking a follow-up for "mptcp: invoke MP_FAIL response when
needed" I stumbled upon a few other MP_FAIL related issues hit my
the self-tests while mangling the above.

All the patches for for the -net tree, and should be applied in-order
around to the squashed-to patch.

v1 -> v2:
 - take care of UaF in mptcp_worker with new patch 6/6

Paolo Abeni (6):
  mptcp: fix error mibs accounting
  mptcp: introduce MAPPING_BAD_CSUM
  Squash-to: "mptcp: invoke MP_FAIL response when needed"
  mptcp: fix shutdown vs fallback race
  mptcp: consistent map handling on failure
  mptcp: fix race on unaccepted mptcp sockets

 net/mptcp/options.c  |   8 +--
 net/mptcp/pm.c       |   5 +-
 net/mptcp/protocol.c |  61 +++++++++++++++++++----
 net/mptcp/protocol.h |  25 ++++++++--
 net/mptcp/subflow.c  | 113 ++++++++++++++++++++++++++++++++++---------
 5 files changed, 168 insertions(+), 44 deletions(-)

-- 
2.35.3


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH mptcp-net v2 1/6] mptcp: fix error mibs accounting
  2022-06-15 20:28 [PATCH mptcp-net v2 0/6] mptcp: mp_fail related fixes Paolo Abeni
@ 2022-06-15 20:28 ` Paolo Abeni
  2022-06-15 20:28 ` [PATCH mptcp-net v2 2/6] mptcp: introduce MAPPING_BAD_CSUM Paolo Abeni
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 11+ messages in thread
From: Paolo Abeni @ 2022-06-15 20:28 UTC (permalink / raw)
  To: mptcp

The current accounting for MP_FAIL and FASTCLOSE is not very
accurate: both can be increased even when the related option is
not really sent. Move the accounting into the correct place.

Fixes: eb7f33654dc1 ("mptcp: add the mibs for MP_FAIL")
Fixes: 1e75629cb964 ("mptcp: add the mibs for MP_FASTCLOSE")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 net/mptcp/options.c | 6 +++---
 net/mptcp/pm.c      | 1 -
 net/mptcp/subflow.c | 4 +---
 3 files changed, 4 insertions(+), 7 deletions(-)

diff --git a/net/mptcp/options.c b/net/mptcp/options.c
index be3b918a6d15..0bfa6662447c 100644
--- a/net/mptcp/options.c
+++ b/net/mptcp/options.c
@@ -765,6 +765,7 @@ static noinline bool mptcp_established_options_rst(struct sock *sk, struct sk_bu
 	opts->suboptions |= OPTION_MPTCP_RST;
 	opts->reset_transient = subflow->reset_transient;
 	opts->reset_reason = subflow->reset_reason;
+	MPTCP_INC_STATS(sock_net(sk), MPTCP_MIB_MPRSTTX);
 
 	return true;
 }
@@ -788,6 +789,7 @@ static bool mptcp_established_options_fastclose(struct sock *sk,
 	opts->rcvr_key = msk->remote_key;
 
 	pr_debug("FASTCLOSE key=%llu", opts->rcvr_key);
+	MPTCP_INC_STATS(sock_net(sk), MPTCP_MIB_MPFASTCLOSETX);
 	return true;
 }
 
@@ -807,7 +809,7 @@ static bool mptcp_established_options_mp_fail(struct sock *sk,
 	*size = TCPOLEN_MPTCP_FAIL;
 	opts->suboptions |= OPTION_MPTCP_FAIL;
 	opts->fail_seq = subflow->map_seq;
-
+	MPTCP_INC_STATS(sock_net(sk), MPTCP_MIB_MPFAILTX);
 	pr_debug("MP_FAIL fail_seq=%llu", opts->fail_seq);
 
 	return true;
@@ -833,13 +835,11 @@ bool mptcp_established_options(struct sock *sk, struct sk_buff *skb,
 		    mptcp_established_options_mp_fail(sk, &opt_size, remaining, opts)) {
 			*size += opt_size;
 			remaining -= opt_size;
-			MPTCP_INC_STATS(sock_net(sk), MPTCP_MIB_MPFASTCLOSETX);
 		}
 		/* MP_RST can be used with MP_FASTCLOSE and MP_FAIL if there is room */
 		if (mptcp_established_options_rst(sk, skb, &opt_size, remaining, opts)) {
 			*size += opt_size;
 			remaining -= opt_size;
-			MPTCP_INC_STATS(sock_net(sk), MPTCP_MIB_MPRSTTX);
 		}
 		return true;
 	}
diff --git a/net/mptcp/pm.c b/net/mptcp/pm.c
index 2a57d95d5492..3c7f07bb124e 100644
--- a/net/mptcp/pm.c
+++ b/net/mptcp/pm.c
@@ -309,7 +309,6 @@ void mptcp_pm_mp_fail_received(struct sock *sk, u64 fail_seq)
 		pr_debug("send MP_FAIL response and infinite map");
 
 		subflow->send_mp_fail = 1;
-		MPTCP_INC_STATS(sock_net(sk), MPTCP_MIB_MPFAILTX);
 		subflow->send_infinite_map = 1;
 	} else {
 		pr_debug("MP_FAIL response received");
diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c
index 5351d54e514a..57d2d8d933d0 100644
--- a/net/mptcp/subflow.c
+++ b/net/mptcp/subflow.c
@@ -958,10 +958,8 @@ static enum mapping_status validate_data_csum(struct sock *ssk, struct sk_buff *
 				 subflow->map_data_csum);
 	if (unlikely(csum)) {
 		MPTCP_INC_STATS(sock_net(ssk), MPTCP_MIB_DATACSUMERR);
-		if (subflow->mp_join || subflow->valid_csum_seen) {
+		if (subflow->mp_join || subflow->valid_csum_seen)
 			subflow->send_mp_fail = 1;
-			MPTCP_INC_STATS(sock_net(ssk), MPTCP_MIB_MPFAILTX);
-		}
 		return subflow->mp_join ? MAPPING_INVALID : MAPPING_DUMMY;
 	}
 
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH mptcp-net v2 2/6] mptcp: introduce MAPPING_BAD_CSUM
  2022-06-15 20:28 [PATCH mptcp-net v2 0/6] mptcp: mp_fail related fixes Paolo Abeni
  2022-06-15 20:28 ` [PATCH mptcp-net v2 1/6] mptcp: fix error mibs accounting Paolo Abeni
@ 2022-06-15 20:28 ` Paolo Abeni
  2022-06-15 20:28 ` [PATCH mptcp-net v2 3/6] Squash-to: "mptcp: invoke MP_FAIL response when needed" Paolo Abeni
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 11+ messages in thread
From: Paolo Abeni @ 2022-06-15 20:28 UTC (permalink / raw)
  To: mptcp

This allow moving a couple of conditional out of the fast path,
making the code more easy to follow and will simplify the next
patch.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 net/mptcp/subflow.c | 18 +++++++++---------
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c
index 57d2d8d933d0..98b12a9c4eb5 100644
--- a/net/mptcp/subflow.c
+++ b/net/mptcp/subflow.c
@@ -843,7 +843,8 @@ enum mapping_status {
 	MAPPING_INVALID,
 	MAPPING_EMPTY,
 	MAPPING_DATA_FIN,
-	MAPPING_DUMMY
+	MAPPING_DUMMY,
+	MAPPING_BAD_CSUM
 };
 
 static void dbg_bad_map(struct mptcp_subflow_context *subflow, u32 ssn)
@@ -958,9 +959,7 @@ static enum mapping_status validate_data_csum(struct sock *ssk, struct sk_buff *
 				 subflow->map_data_csum);
 	if (unlikely(csum)) {
 		MPTCP_INC_STATS(sock_net(ssk), MPTCP_MIB_DATACSUMERR);
-		if (subflow->mp_join || subflow->valid_csum_seen)
-			subflow->send_mp_fail = 1;
-		return subflow->mp_join ? MAPPING_INVALID : MAPPING_DUMMY;
+		return MAPPING_BAD_CSUM;
 	}
 
 	subflow->valid_csum_seen = 1;
@@ -1178,10 +1177,8 @@ static bool subflow_check_data_avail(struct sock *ssk)
 
 		status = get_mapping_status(ssk, msk);
 		trace_subflow_check_data_avail(status, skb_peek(&ssk->sk_receive_queue));
-		if (unlikely(status == MAPPING_INVALID))
-			goto fallback;
-
-		if (unlikely(status == MAPPING_DUMMY))
+		if (unlikely(status == MAPPING_INVALID || status == MAPPING_DUMMY ||
+			     status == MAPPING_BAD_CSUM))
 			goto fallback;
 
 		if (status != MAPPING_OK)
@@ -1223,7 +1220,10 @@ static bool subflow_check_data_avail(struct sock *ssk)
 fallback:
 	if (!__mptcp_check_fallback(msk)) {
 		/* RFC 8684 section 3.7. */
-		if (subflow->send_mp_fail) {
+		if (status == MAPPING_BAD_CSUM &&
+		    ((subflow->mp_join || subflow->valid_csum_seen))) {
+			subflow->send_mp_fail = 1;
+
 			if (!READ_ONCE(msk->allow_infinite_fallback)) {
 				ssk->sk_err = EBADMSG;
 				tcp_set_state(ssk, TCP_CLOSE);
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH mptcp-net v2 3/6] Squash-to: "mptcp: invoke MP_FAIL response when needed"
  2022-06-15 20:28 [PATCH mptcp-net v2 0/6] mptcp: mp_fail related fixes Paolo Abeni
  2022-06-15 20:28 ` [PATCH mptcp-net v2 1/6] mptcp: fix error mibs accounting Paolo Abeni
  2022-06-15 20:28 ` [PATCH mptcp-net v2 2/6] mptcp: introduce MAPPING_BAD_CSUM Paolo Abeni
@ 2022-06-15 20:28 ` Paolo Abeni
  2022-06-16  0:59   ` Mat Martineau
  2022-06-15 20:28 ` [PATCH mptcp-net v2 4/6] mptcp: fix shutdown vs fallback race Paolo Abeni
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 11+ messages in thread
From: Paolo Abeni @ 2022-06-15 20:28 UTC (permalink / raw)
  To: mptcp

This tries to address a few issues outstanding in the mentioned
patch:
- we explicitly need to reset the timeout timer for mp_fail's sake
- we need to explicitly generate a tcp ack for mp_fail, otherwise
  there are no guarantees for suck option being sent out
- the timeout timer needs handling need some caring, as it's still
  shared between mp_fail and msk socket timeout.
- we can re-use msk->first for msk->fail_ssk, as only the first/mpc
  subflow can fail without reset. That additionally avoid the need
  to clear fail_ssk on the relevant subflow close.
- fail_tout would need some additional annotation. Just to be on the
  safe side move its manipulaiton under the ssk socket lock.

Last 2 paragraph of the squash to commit should be replaced with:

"""
It leverages the fact that only the MPC/first subflow can gracefully
fail to avoid unneeded subflows traversal: the failing subflow can
be only msk->first.

A new 'fail_tout' field is added to the subflow context to record the
MP_FAIL response timeout and use such field to reliably share the
timeout timer between the MP_FAIL event and the MPTCP socket close timeout.

Finally, a new ack is generated to send out MP_FAIL notification as soon
as we hit the relevant condition, instead of waiting a possibly unbound
time for the next data packet.
"""

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 net/mptcp/pm.c       |  4 +++-
 net/mptcp/protocol.c | 54 ++++++++++++++++++++++++++++++++++++--------
 net/mptcp/protocol.h |  4 ++--
 net/mptcp/subflow.c  | 24 ++++++++++++++++++--
 4 files changed, 72 insertions(+), 14 deletions(-)

diff --git a/net/mptcp/pm.c b/net/mptcp/pm.c
index 3c7f07bb124e..45e2a48397b9 100644
--- a/net/mptcp/pm.c
+++ b/net/mptcp/pm.c
@@ -305,13 +305,15 @@ void mptcp_pm_mp_fail_received(struct sock *sk, u64 fail_seq)
 	if (!READ_ONCE(msk->allow_infinite_fallback))
 		return;
 
-	if (!msk->fail_ssk) {
+	if (!subflow->fail_tout) {
 		pr_debug("send MP_FAIL response and infinite map");
 
 		subflow->send_mp_fail = 1;
 		subflow->send_infinite_map = 1;
+		tcp_send_ack(sk);
 	} else {
 		pr_debug("MP_FAIL response received");
+		WRITE_ONCE(subflow->fail_tout, 0);
 	}
 }
 
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index a0f9f3831509..50026b8da625 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -500,7 +500,7 @@ static void mptcp_set_timeout(struct sock *sk)
 	__mptcp_set_timeout(sk, tout);
 }
 
-static bool tcp_can_send_ack(const struct sock *ssk)
+static inline bool tcp_can_send_ack(const struct sock *ssk)
 {
 	return !((1 << inet_sk_state_load(ssk)) &
 	       (TCPF_SYN_SENT | TCPF_SYN_RECV | TCPF_TIME_WAIT | TCPF_CLOSE | TCPF_LISTEN));
@@ -2490,24 +2490,56 @@ static void __mptcp_retrans(struct sock *sk)
 		mptcp_reset_timer(sk);
 }
 
+/* schedule the timeout timer for the nearest relevant event: either
+ * close timeout or mp_fail timeout. Both of them could be not
+ * scheduled yet
+ */
+void mptcp_reset_timeout(struct mptcp_sock *msk, unsigned long fail_tout)
+{
+	struct sock *sk = (struct sock *)msk;
+	unsigned long timeout, close_timeout;
+
+	if (!fail_tout && !sock_flag(sk, SOCK_DEAD))
+		return;
+
+	close_timeout = inet_csk(sk)->icsk_mtup.probe_timestamp - tcp_jiffies32 + jiffies + TCP_TIMEWAIT_LEN;
+
+	/* the following is basically time_min(close_timeout, fail_tout) */
+	if (!fail_tout)
+		timeout = close_timeout;
+	else if (!sock_flag(sk, SOCK_DEAD))
+		timeout = fail_tout;
+	else if (time_after(close_timeout, fail_tout))
+		timeout = fail_tout;
+	else
+		timeout = close_timeout;
+
+	sk_reset_timer(sk, &sk->sk_timer, timeout);
+}
+
 static void mptcp_mp_fail_no_response(struct mptcp_sock *msk)
 {
-	struct sock *ssk = msk->fail_ssk;
+	struct sock *ssk = msk->first;
 	bool slow;
 
+	if (!ssk)
+		return;
+
 	pr_debug("MP_FAIL doesn't respond, reset the subflow");
 
 	slow = lock_sock_fast(ssk);
 	mptcp_subflow_reset(ssk);
+	WRITE_ONCE(mptcp_subflow_ctx(ssk)->fail_tout, 0);
 	unlock_sock_fast(ssk, slow);
 
-	msk->fail_ssk = NULL;
+	mptcp_reset_timeout(msk, 0);
 }
 
 static void mptcp_worker(struct work_struct *work)
 {
 	struct mptcp_sock *msk = container_of(work, struct mptcp_sock, work);
 	struct sock *sk = &msk->sk.icsk_inet.sk;
+	unsigned long fail_tout;
 	int state;
 
 	lock_sock(sk);
@@ -2544,7 +2576,8 @@ static void mptcp_worker(struct work_struct *work)
 	if (test_and_clear_bit(MPTCP_WORK_RTX, &msk->flags))
 		__mptcp_retrans(sk);
 
-	if (msk->fail_ssk && time_after(jiffies, msk->fail_tout))
+	fail_tout = msk->first ? READ_ONCE(mptcp_subflow_ctx(msk->first)->fail_tout) : 0;
+	if (fail_tout && time_after(jiffies, fail_tout))
 		mptcp_mp_fail_no_response(msk);
 
 unlock:
@@ -2572,8 +2605,6 @@ static int __mptcp_init_sock(struct sock *sk)
 	WRITE_ONCE(msk->csum_enabled, mptcp_is_checksum_enabled(sock_net(sk)));
 	WRITE_ONCE(msk->allow_infinite_fallback, true);
 	msk->recovery = false;
-	msk->fail_ssk = NULL;
-	msk->fail_tout = 0;
 
 	mptcp_pm_data_init(msk);
 
@@ -2804,7 +2835,9 @@ static void __mptcp_destroy_sock(struct sock *sk)
 static void mptcp_close(struct sock *sk, long timeout)
 {
 	struct mptcp_subflow_context *subflow;
+	struct mptcp_sock *msk = mptcp_sk(sk);
 	bool do_cancel_work = false;
+	unsigned long fail_tout = 0;
 
 	lock_sock(sk);
 	sk->sk_shutdown = SHUTDOWN_MASK;
@@ -2822,10 +2855,13 @@ static void mptcp_close(struct sock *sk, long timeout)
 cleanup:
 	/* orphan all the subflows */
 	inet_csk(sk)->icsk_mtup.probe_timestamp = tcp_jiffies32;
-	mptcp_for_each_subflow(mptcp_sk(sk), subflow) {
+	mptcp_for_each_subflow(msk, subflow) {
 		struct sock *ssk = mptcp_subflow_tcp_sock(subflow);
 		bool slow = lock_sock_fast_nested(ssk);
 
+		if (ssk == msk->first)
+			fail_tout = subflow->fail_tout;
+
 		sock_orphan(ssk);
 		unlock_sock_fast(ssk, slow);
 	}
@@ -2834,13 +2870,13 @@ static void mptcp_close(struct sock *sk, long timeout)
 	sock_hold(sk);
 	pr_debug("msk=%p state=%d", sk, sk->sk_state);
 	if (mptcp_sk(sk)->token)
-		mptcp_event(MPTCP_EVENT_CLOSED, mptcp_sk(sk), NULL, GFP_KERNEL);
+		mptcp_event(MPTCP_EVENT_CLOSED, msk, NULL, GFP_KERNEL);
 
 	if (sk->sk_state == TCP_CLOSE) {
 		__mptcp_destroy_sock(sk);
 		do_cancel_work = true;
 	} else {
-		sk_reset_timer(sk, &sk->sk_timer, jiffies + TCP_TIMEWAIT_LEN);
+		mptcp_reset_timeout(msk, fail_tout);
 	}
 	release_sock(sk);
 	if (do_cancel_work)
diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
index bef7dea9f358..077a717799a0 100644
--- a/net/mptcp/protocol.h
+++ b/net/mptcp/protocol.h
@@ -306,8 +306,6 @@ struct mptcp_sock {
 
 	u32 setsockopt_seq;
 	char		ca_name[TCP_CA_NAME_MAX];
-	struct sock	*fail_ssk;
-	unsigned long	fail_tout;
 };
 
 #define mptcp_data_lock(sk) spin_lock_bh(&(sk)->sk_lock.slock)
@@ -484,6 +482,7 @@ struct mptcp_subflow_context {
 	u8	stale_count;
 
 	long	delegated_status;
+	unsigned long	fail_tout;
 
 	);
 
@@ -677,6 +676,7 @@ void mptcp_get_options(const struct sk_buff *skb,
 
 void mptcp_finish_connect(struct sock *sk);
 void __mptcp_set_connected(struct sock *sk);
+void mptcp_reset_timeout(struct mptcp_sock *msk, unsigned long fail_tout);
 static inline bool mptcp_is_fully_established(struct sock *sk)
 {
 	return inet_sk_state_load(sk) == TCP_ESTABLISHED &&
diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c
index 98b12a9c4eb5..040901c1f40c 100644
--- a/net/mptcp/subflow.c
+++ b/net/mptcp/subflow.c
@@ -1158,6 +1158,27 @@ static bool subflow_can_fallback(struct mptcp_subflow_context *subflow)
 		return !subflow->fully_established;
 }
 
+static void mptcp_subflow_fail(struct mptcp_sock *msk, struct sock *ssk)
+{
+	struct mptcp_subflow_context *subflow = mptcp_subflow_ctx(ssk);
+	unsigned long fail_tout;
+
+	/* grecefull failure can happen only on the MPC subflow */
+	if (WARN_ON_ONCE(ssk != READ_ONCE(msk->first)))
+		return;
+
+	/* we don't need extreme accuracy here, use a zero fail_tout as special
+	 * value meaning no fail timeout at all;
+	 */
+	fail_tout = jiffies + TCP_RTO_MAX;
+	if (!fail_tout)
+		fail_tout = 1;
+	WRITE_ONCE(subflow->fail_tout, fail_tout);
+	tcp_send_ack(ssk);
+
+	mptcp_reset_timeout(msk, subflow->fail_tout);
+}
+
 static bool subflow_check_data_avail(struct sock *ssk)
 {
 	struct mptcp_subflow_context *subflow = mptcp_subflow_ctx(ssk);
@@ -1233,8 +1254,7 @@ static bool subflow_check_data_avail(struct sock *ssk)
 				while ((skb = skb_peek(&ssk->sk_receive_queue)))
 					sk_eat_skb(ssk, skb);
 			} else {
-				msk->fail_ssk = ssk;
-				msk->fail_tout = jiffies + TCP_RTO_MAX;
+				mptcp_subflow_fail(msk, ssk);
 			}
 			WRITE_ONCE(subflow->data_avail, MPTCP_SUBFLOW_NODATA);
 			return true;
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH mptcp-net v2 4/6] mptcp: fix shutdown vs fallback race
  2022-06-15 20:28 [PATCH mptcp-net v2 0/6] mptcp: mp_fail related fixes Paolo Abeni
                   ` (2 preceding siblings ...)
  2022-06-15 20:28 ` [PATCH mptcp-net v2 3/6] Squash-to: "mptcp: invoke MP_FAIL response when needed" Paolo Abeni
@ 2022-06-15 20:28 ` Paolo Abeni
  2022-06-15 20:28 ` [PATCH mptcp-net v2 5/6] mptcp: consistent map handling on failure Paolo Abeni
  2022-06-15 20:28 ` [PATCH mptcp-net v2 6/6] mptcp: fix race on unaccepted mptcp sockets Paolo Abeni
  5 siblings, 0 replies; 11+ messages in thread
From: Paolo Abeni @ 2022-06-15 20:28 UTC (permalink / raw)
  To: mptcp

If the MPTCP socket shutdown happens before a fallback
to TCP, and all the pending data have been already spooled,
we never close the TCP connection.

Address the issue explicitly checking for critical condition
at fallback time.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 net/mptcp/options.c  |  2 +-
 net/mptcp/protocol.c |  2 +-
 net/mptcp/protocol.h | 19 ++++++++++++++++---
 net/mptcp/subflow.c  |  2 +-
 4 files changed, 19 insertions(+), 6 deletions(-)

diff --git a/net/mptcp/options.c b/net/mptcp/options.c
index 0bfa6662447c..57eab237c837 100644
--- a/net/mptcp/options.c
+++ b/net/mptcp/options.c
@@ -966,7 +966,7 @@ static bool check_fully_established(struct mptcp_sock *msk, struct sock *ssk,
 			goto reset;
 		subflow->mp_capable = 0;
 		pr_fallback(msk);
-		__mptcp_do_fallback(msk);
+		mptcp_do_fallback(ssk);
 		return false;
 	}
 
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 50026b8da625..4f4d6c66e190 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -1245,7 +1245,7 @@ static void mptcp_update_infinite_map(struct mptcp_sock *msk,
 	MPTCP_INC_STATS(sock_net(ssk), MPTCP_MIB_INFINITEMAPTX);
 	mptcp_subflow_ctx(ssk)->send_infinite_map = 0;
 	pr_fallback(msk);
-	__mptcp_do_fallback(msk);
+	mptcp_do_fallback(ssk);
 }
 
 static int mptcp_sendmsg_frag(struct sock *sk, struct sock *ssk,
diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
index 077a717799a0..ad9b02b1b3e6 100644
--- a/net/mptcp/protocol.h
+++ b/net/mptcp/protocol.h
@@ -941,12 +941,25 @@ static inline void __mptcp_do_fallback(struct mptcp_sock *msk)
 	set_bit(MPTCP_FALLBACK_DONE, &msk->flags);
 }
 
-static inline void mptcp_do_fallback(struct sock *sk)
+static inline void mptcp_do_fallback(struct sock *ssk)
 {
-	struct mptcp_subflow_context *subflow = mptcp_subflow_ctx(sk);
-	struct mptcp_sock *msk = mptcp_sk(subflow->conn);
+	struct mptcp_subflow_context *subflow = mptcp_subflow_ctx(ssk);
+	struct sock *sk = subflow->conn;
+	struct mptcp_sock *msk;
 
+	msk = mptcp_sk(sk);
 	__mptcp_do_fallback(msk);
+	if (READ_ONCE(msk->snd_data_fin_enable) && !(ssk->sk_shutdown & SEND_SHUTDOWN)) {
+		gfp_t saved_allocation = ssk->sk_allocation;
+
+		/* we are in a atomic (BH) scope, override ssk default for data
+		 * fin allocation
+		 */
+		ssk->sk_allocation = GFP_ATOMIC;
+		ssk->sk_shutdown |= SEND_SHUTDOWN;
+		tcp_shutdown(ssk, SEND_SHUTDOWN);
+		ssk->sk_allocation = saved_allocation;
+	}
 }
 
 #define pr_fallback(a) pr_debug("%s:fallback to TCP (msk=%p)", __func__, a)
diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c
index 040901c1f40c..75fdc6474d0e 100644
--- a/net/mptcp/subflow.c
+++ b/net/mptcp/subflow.c
@@ -1273,7 +1273,7 @@ static bool subflow_check_data_avail(struct sock *ssk)
 			return false;
 		}
 
-		__mptcp_do_fallback(msk);
+		mptcp_do_fallback(ssk);
 	}
 
 	skb = skb_peek(&ssk->sk_receive_queue);
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH mptcp-net v2 5/6] mptcp: consistent map handling on failure
  2022-06-15 20:28 [PATCH mptcp-net v2 0/6] mptcp: mp_fail related fixes Paolo Abeni
                   ` (3 preceding siblings ...)
  2022-06-15 20:28 ` [PATCH mptcp-net v2 4/6] mptcp: fix shutdown vs fallback race Paolo Abeni
@ 2022-06-15 20:28 ` Paolo Abeni
  2022-06-15 20:28 ` [PATCH mptcp-net v2 6/6] mptcp: fix race on unaccepted mptcp sockets Paolo Abeni
  5 siblings, 0 replies; 11+ messages in thread
From: Paolo Abeni @ 2022-06-15 20:28 UTC (permalink / raw)
  To: mptcp

When the MPTCP receive path reach a non fatal fall-back condition, e.g.
when the MPC sockets must fall-back to TCP, the existing code is a little
self-inconsistent: it reports that new data is available - return true -
but sets the MPC flag to the opposite value.

As the consequence read operations in some exceptional scenario may block
unexpectedly.

Address the issue setting the correct MPC read status. Additionally avoid
some code duplication in the fatal fall-back scenario.

Fixes: 9c81be0dbc89 ("mptcp: add MP_FAIL response support")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 net/mptcp/subflow.c | 19 +++++++++----------
 1 file changed, 9 insertions(+), 10 deletions(-)

diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c
index 75fdc6474d0e..bec36e2bcc32 100644
--- a/net/mptcp/subflow.c
+++ b/net/mptcp/subflow.c
@@ -1246,17 +1246,12 @@ static bool subflow_check_data_avail(struct sock *ssk)
 			subflow->send_mp_fail = 1;
 
 			if (!READ_ONCE(msk->allow_infinite_fallback)) {
-				ssk->sk_err = EBADMSG;
-				tcp_set_state(ssk, TCP_CLOSE);
 				subflow->reset_transient = 0;
 				subflow->reset_reason = MPTCP_RST_EMIDDLEBOX;
-				tcp_send_active_reset(ssk, GFP_ATOMIC);
-				while ((skb = skb_peek(&ssk->sk_receive_queue)))
-					sk_eat_skb(ssk, skb);
-			} else {
-				mptcp_subflow_fail(msk, ssk);
+				goto reset;
 			}
-			WRITE_ONCE(subflow->data_avail, MPTCP_SUBFLOW_NODATA);
+			mptcp_subflow_fail(msk, ssk);
+			WRITE_ONCE(subflow->data_avail, MPTCP_SUBFLOW_DATA_AVAIL);
 			return true;
 		}
 
@@ -1264,10 +1259,14 @@ static bool subflow_check_data_avail(struct sock *ssk)
 			/* fatal protocol error, close the socket.
 			 * subflow_error_report() will introduce the appropriate barriers
 			 */
-			ssk->sk_err = EBADMSG;
-			tcp_set_state(ssk, TCP_CLOSE);
 			subflow->reset_transient = 0;
 			subflow->reset_reason = MPTCP_RST_EMPTCP;
+
+reset:
+			ssk->sk_err = EBADMSG;
+			tcp_set_state(ssk, TCP_CLOSE);
+			while ((skb = skb_peek(&ssk->sk_receive_queue)))
+				sk_eat_skb(ssk, skb);
 			tcp_send_active_reset(ssk, GFP_ATOMIC);
 			WRITE_ONCE(subflow->data_avail, MPTCP_SUBFLOW_NODATA);
 			return false;
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH mptcp-net v2 6/6] mptcp: fix race on unaccepted mptcp sockets
  2022-06-15 20:28 [PATCH mptcp-net v2 0/6] mptcp: mp_fail related fixes Paolo Abeni
                   ` (4 preceding siblings ...)
  2022-06-15 20:28 ` [PATCH mptcp-net v2 5/6] mptcp: consistent map handling on failure Paolo Abeni
@ 2022-06-15 20:28 ` Paolo Abeni
  2022-06-15 22:10   ` mptcp: fix race on unaccepted mptcp sockets: Tests Results MPTCP CI
  5 siblings, 1 reply; 11+ messages in thread
From: Paolo Abeni @ 2022-06-15 20:28 UTC (permalink / raw)
  To: mptcp

When the listener socket owning the relevant request is closed,
it frees the unaccepted subflows and that causes later deletion
of the paired MPTCP sockets.

The mptcp socket's worker can run in the time interval between such delete
operations. When that happens, any access to msk->first will cause an UaF
access, as the subflow cleanup did not cleared such field in the mptcp
socket.

Address the issue explictly traversing the listener socket accept
queue at close time and performing the needed cleanup on the pending
msk.

Note that the locking is a bit tricky, as we need to acquire the msk
socket lock, while still owning the subflow socket one.

Fixes: 86e39e04482b ("mptcp: keep track of local endpoint still available for each msk")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 net/mptcp/protocol.c |  5 +++++
 net/mptcp/protocol.h |  2 ++
 net/mptcp/subflow.c  | 50 ++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 57 insertions(+)

diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 4f4d6c66e190..35e3060233c9 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -2318,6 +2318,11 @@ static void __mptcp_close_ssk(struct sock *sk, struct sock *ssk,
 		kfree_rcu(subflow, rcu);
 	} else {
 		/* otherwise tcp will dispose of the ssk and subflow ctx */
+		if (ssk->sk_state == TCP_LISTEN) {
+			tcp_set_state(ssk, TCP_CLOSE);
+			mptcp_subflow_queue_clean(ssk);
+			inet_csk_listen_stop(ssk);
+		}
 		__tcp_close(ssk, 0);
 
 		/* close acquired an extra ref */
diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
index ad9b02b1b3e6..95c9ace1437b 100644
--- a/net/mptcp/protocol.h
+++ b/net/mptcp/protocol.h
@@ -306,6 +306,7 @@ struct mptcp_sock {
 
 	u32 setsockopt_seq;
 	char		ca_name[TCP_CA_NAME_MAX];
+	struct mptcp_sock	*dl_next;
 };
 
 #define mptcp_data_lock(sk) spin_lock_bh(&(sk)->sk_lock.slock)
@@ -610,6 +611,7 @@ void mptcp_close_ssk(struct sock *sk, struct sock *ssk,
 		     struct mptcp_subflow_context *subflow);
 void mptcp_subflow_send_ack(struct sock *ssk);
 void mptcp_subflow_reset(struct sock *ssk);
+void mptcp_subflow_queue_clean(struct sock *ssk);
 void mptcp_sock_graft(struct sock *sk, struct socket *parent);
 struct socket *__mptcp_nmpc_socket(const struct mptcp_sock *msk);
 
diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c
index bec36e2bcc32..02a12f271a60 100644
--- a/net/mptcp/subflow.c
+++ b/net/mptcp/subflow.c
@@ -1717,6 +1717,56 @@ static void subflow_state_change(struct sock *sk)
 	}
 }
 
+void mptcp_subflow_queue_clean(struct sock *listener_ssk)
+{
+	struct request_sock_queue *queue = &inet_csk(listener_ssk)->icsk_accept_queue;
+	struct mptcp_sock *msk, *next, *head = NULL;
+	struct request_sock *req;
+
+	/* build a list of all unaccepted mptcp sockets */
+	spin_lock_bh(&queue->rskq_lock);
+	for (req = queue->rskq_accept_head; req; req = req->dl_next) {
+		struct mptcp_subflow_context *subflow;
+		struct sock *ssk = req->sk;
+		struct mptcp_sock *msk;
+
+		if (!sk_is_mptcp(ssk))
+			continue;
+
+		subflow = mptcp_subflow_ctx(ssk);
+		if (!subflow || !subflow->conn)
+			continue;
+
+		/* skip if already in list */
+		msk = mptcp_sk(subflow->conn);
+		if (msk->dl_next || msk == head)
+			continue;
+
+		msk->dl_next = head;
+		head = msk;
+	}
+	spin_unlock_bh(&queue->rskq_lock);
+	if (!head)
+		return;
+
+	/* can't acquire the msk socket lock under the subflow one,
+	 * or will cause ABBA deadlock
+	 */
+	release_sock(listener_ssk);
+
+	for (msk = head; msk; msk = next) {
+		struct sock *sk = (struct sock *)msk;
+		bool slow;
+
+		slow = lock_sock_fast_nested(sk);
+		next = msk->dl_next;
+		msk->first = NULL;
+		msk->dl_next = NULL;
+		unlock_sock_fast(sk, slow);
+	}
+	lock_sock(listener_ssk);
+}
+
 static int subflow_ulp_init(struct sock *sk)
 {
 	struct inet_connection_sock *icsk = inet_csk(sk);
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: mptcp: fix race on unaccepted mptcp sockets: Tests Results
  2022-06-15 20:28 ` [PATCH mptcp-net v2 6/6] mptcp: fix race on unaccepted mptcp sockets Paolo Abeni
@ 2022-06-15 22:10   ` MPTCP CI
  0 siblings, 0 replies; 11+ messages in thread
From: MPTCP CI @ 2022-06-15 22:10 UTC (permalink / raw)
  To: Paolo Abeni; +Cc: mptcp

Hi Paolo,

Thank you for your modifications, that's great!

Our CI did some validations and here is its report:

- KVM Validation: normal:
  - Success! ✅:
  - Task: https://cirrus-ci.com/task/5240636109488128
  - Summary: https://api.cirrus-ci.com/v1/artifact/task/5240636109488128/summary/summary.txt

- KVM Validation: debug:
  - Unstable: 1 failed test(s): selftest_diag 🔴:
  - Task: https://cirrus-ci.com/task/6366536016330752
  - Summary: https://api.cirrus-ci.com/v1/artifact/task/6366536016330752/summary/summary.txt

Initiator: Patchew Applier
Commits: https://github.com/multipath-tcp/mptcp_net-next/commits/bac04f5bed54


If there are some issues, you can reproduce them using the same environment as
the one used by the CI thanks to a docker image, e.g.:

    $ cd [kernel source code]
    $ docker run -v "${PWD}:${PWD}:rw" -w "${PWD}" --privileged --rm -it \
        --pull always mptcp/mptcp-upstream-virtme-docker:latest \
        auto-debug

For more details:

    https://github.com/multipath-tcp/mptcp-upstream-virtme-docker


Please note that despite all the efforts that have been already done to have a
stable tests suite when executed on a public CI like here, it is possible some
reported issues are not due to your modifications. Still, do not hesitate to
help us improve that ;-)

Cheers,
MPTCP GH Action bot
Bot operated by Matthieu Baerts (Tessares)

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH mptcp-net v2 3/6] Squash-to: "mptcp: invoke MP_FAIL response when needed"
  2022-06-15 20:28 ` [PATCH mptcp-net v2 3/6] Squash-to: "mptcp: invoke MP_FAIL response when needed" Paolo Abeni
@ 2022-06-16  0:59   ` Mat Martineau
  2022-06-16  7:11     ` Paolo Abeni
  0 siblings, 1 reply; 11+ messages in thread
From: Mat Martineau @ 2022-06-16  0:59 UTC (permalink / raw)
  To: Paolo Abeni; +Cc: mptcp

On Wed, 15 Jun 2022, Paolo Abeni wrote:

> This tries to address a few issues outstanding in the mentioned
> patch:
> - we explicitly need to reset the timeout timer for mp_fail's sake
> - we need to explicitly generate a tcp ack for mp_fail, otherwise
>  there are no guarantees for suck option being sent out
> - the timeout timer needs handling need some caring, as it's still
>  shared between mp_fail and msk socket timeout.
> - we can re-use msk->first for msk->fail_ssk, as only the first/mpc
>  subflow can fail without reset. That additionally avoid the need
>  to clear fail_ssk on the relevant subflow close.
> - fail_tout would need some additional annotation. Just to be on the
>  safe side move its manipulaiton under the ssk socket lock.
>
> Last 2 paragraph of the squash to commit should be replaced with:
>
> """
> It leverages the fact that only the MPC/first subflow can gracefully
> fail to avoid unneeded subflows traversal: the failing subflow can
> be only msk->first.
>
> A new 'fail_tout' field is added to the subflow context to record the
> MP_FAIL response timeout and use such field to reliably share the
> timeout timer between the MP_FAIL event and the MPTCP socket close timeout.
>
> Finally, a new ack is generated to send out MP_FAIL notification as soon
> as we hit the relevant condition, instead of waiting a possibly unbound
> time for the next data packet.
> """
>
> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
> ---
> net/mptcp/pm.c       |  4 +++-
> net/mptcp/protocol.c | 54 ++++++++++++++++++++++++++++++++++++--------
> net/mptcp/protocol.h |  4 ++--
> net/mptcp/subflow.c  | 24 ++++++++++++++++++--
> 4 files changed, 72 insertions(+), 14 deletions(-)
>
> diff --git a/net/mptcp/pm.c b/net/mptcp/pm.c
> index 3c7f07bb124e..45e2a48397b9 100644
> --- a/net/mptcp/pm.c
> +++ b/net/mptcp/pm.c
> @@ -305,13 +305,15 @@ void mptcp_pm_mp_fail_received(struct sock *sk, u64 fail_seq)
> 	if (!READ_ONCE(msk->allow_infinite_fallback))
> 		return;
>
> -	if (!msk->fail_ssk) {
> +	if (!subflow->fail_tout) {
> 		pr_debug("send MP_FAIL response and infinite map");
>
> 		subflow->send_mp_fail = 1;
> 		subflow->send_infinite_map = 1;
> +		tcp_send_ack(sk);
> 	} else {
> 		pr_debug("MP_FAIL response received");
> +		WRITE_ONCE(subflow->fail_tout, 0);
> 	}
> }
>
> diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
> index a0f9f3831509..50026b8da625 100644
> --- a/net/mptcp/protocol.c
> +++ b/net/mptcp/protocol.c
> @@ -500,7 +500,7 @@ static void mptcp_set_timeout(struct sock *sk)
> 	__mptcp_set_timeout(sk, tout);
> }
>
> -static bool tcp_can_send_ack(const struct sock *ssk)
> +static inline bool tcp_can_send_ack(const struct sock *ssk)
> {
> 	return !((1 << inet_sk_state_load(ssk)) &
> 	       (TCPF_SYN_SENT | TCPF_SYN_RECV | TCPF_TIME_WAIT | TCPF_CLOSE | TCPF_LISTEN));
> @@ -2490,24 +2490,56 @@ static void __mptcp_retrans(struct sock *sk)
> 		mptcp_reset_timer(sk);
> }
>
> +/* schedule the timeout timer for the nearest relevant event: either
> + * close timeout or mp_fail timeout. Both of them could be not
> + * scheduled yet
> + */
> +void mptcp_reset_timeout(struct mptcp_sock *msk, unsigned long fail_tout)
> +{
> +	struct sock *sk = (struct sock *)msk;
> +	unsigned long timeout, close_timeout;
> +
> +	if (!fail_tout && !sock_flag(sk, SOCK_DEAD))
> +		return;
> +
> +	close_timeout = inet_csk(sk)->icsk_mtup.probe_timestamp - tcp_jiffies32 + jiffies + TCP_TIMEWAIT_LEN;
> +
> +	/* the following is basically time_min(close_timeout, fail_tout) */
> +	if (!fail_tout)
> +		timeout = close_timeout;
> +	else if (!sock_flag(sk, SOCK_DEAD))
> +		timeout = fail_tout;
> +	else if (time_after(close_timeout, fail_tout))
> +		timeout = fail_tout;
> +	else
> +		timeout = close_timeout;
> +
> +	sk_reset_timer(sk, &sk->sk_timer, timeout);
> +}

Hi Paolo -

The above function seems more complex than needed. If mptcp_close() has 
been called, the fail timeout should be considered canceled and the 
"close" has exclusive use of sk_timer. subflow->fail_tout can be set to 0 
and sk_timer can be set only based on TCP_TIMEWAIT_LEN.

TCP_RTO_MAX for the MP_FAIL timeout was kind of arbitrary - the spec 
doesn't say what exactly to do. We didn't want the connection to be in the 
"unacknowledged" MP_FAIL state forever, but also didn't want to give up 
too fast.

Given that, do you think the complexity is justified to possibly reset the 
subflow earlier in this "unacknowledged MP_FAIL followed by a close" 
scenario?


> +
> static void mptcp_mp_fail_no_response(struct mptcp_sock *msk)
> {
> -	struct sock *ssk = msk->fail_ssk;
> +	struct sock *ssk = msk->first;
> 	bool slow;
>
> +	if (!ssk)
> +		return;
> +
> 	pr_debug("MP_FAIL doesn't respond, reset the subflow");
>
> 	slow = lock_sock_fast(ssk);
> 	mptcp_subflow_reset(ssk);
> +	WRITE_ONCE(mptcp_subflow_ctx(ssk)->fail_tout, 0);
> 	unlock_sock_fast(ssk, slow);
>
> -	msk->fail_ssk = NULL;
> +	mptcp_reset_timeout(msk, 0);
> }
>
> static void mptcp_worker(struct work_struct *work)
> {
> 	struct mptcp_sock *msk = container_of(work, struct mptcp_sock, work);
> 	struct sock *sk = &msk->sk.icsk_inet.sk;
> +	unsigned long fail_tout;
> 	int state;
>
> 	lock_sock(sk);
> @@ -2544,7 +2576,8 @@ static void mptcp_worker(struct work_struct *work)
> 	if (test_and_clear_bit(MPTCP_WORK_RTX, &msk->flags))
> 		__mptcp_retrans(sk);
>
> -	if (msk->fail_ssk && time_after(jiffies, msk->fail_tout))
> +	fail_tout = msk->first ? READ_ONCE(mptcp_subflow_ctx(msk->first)->fail_tout) : 0;
> +	if (fail_tout && time_after(jiffies, fail_tout))
> 		mptcp_mp_fail_no_response(msk);
>
> unlock:
> @@ -2572,8 +2605,6 @@ static int __mptcp_init_sock(struct sock *sk)
> 	WRITE_ONCE(msk->csum_enabled, mptcp_is_checksum_enabled(sock_net(sk)));
> 	WRITE_ONCE(msk->allow_infinite_fallback, true);
> 	msk->recovery = false;
> -	msk->fail_ssk = NULL;
> -	msk->fail_tout = 0;
>
> 	mptcp_pm_data_init(msk);
>
> @@ -2804,7 +2835,9 @@ static void __mptcp_destroy_sock(struct sock *sk)
> static void mptcp_close(struct sock *sk, long timeout)
> {
> 	struct mptcp_subflow_context *subflow;
> +	struct mptcp_sock *msk = mptcp_sk(sk);
> 	bool do_cancel_work = false;
> +	unsigned long fail_tout = 0;
>
> 	lock_sock(sk);
> 	sk->sk_shutdown = SHUTDOWN_MASK;
> @@ -2822,10 +2855,13 @@ static void mptcp_close(struct sock *sk, long timeout)
> cleanup:
> 	/* orphan all the subflows */
> 	inet_csk(sk)->icsk_mtup.probe_timestamp = tcp_jiffies32;
> -	mptcp_for_each_subflow(mptcp_sk(sk), subflow) {
> +	mptcp_for_each_subflow(msk, subflow) {
> 		struct sock *ssk = mptcp_subflow_tcp_sock(subflow);
> 		bool slow = lock_sock_fast_nested(ssk);
>
> +		if (ssk == msk->first)
> +			fail_tout = subflow->fail_tout;
> +
> 		sock_orphan(ssk);
> 		unlock_sock_fast(ssk, slow);
> 	}
> @@ -2834,13 +2870,13 @@ static void mptcp_close(struct sock *sk, long timeout)
> 	sock_hold(sk);
> 	pr_debug("msk=%p state=%d", sk, sk->sk_state);
> 	if (mptcp_sk(sk)->token)
> -		mptcp_event(MPTCP_EVENT_CLOSED, mptcp_sk(sk), NULL, GFP_KERNEL);
> +		mptcp_event(MPTCP_EVENT_CLOSED, msk, NULL, GFP_KERNEL);
>
> 	if (sk->sk_state == TCP_CLOSE) {
> 		__mptcp_destroy_sock(sk);
> 		do_cancel_work = true;
> 	} else {
> -		sk_reset_timer(sk, &sk->sk_timer, jiffies + TCP_TIMEWAIT_LEN);
> +		mptcp_reset_timeout(msk, fail_tout);
> 	}
> 	release_sock(sk);
> 	if (do_cancel_work)
> diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
> index bef7dea9f358..077a717799a0 100644
> --- a/net/mptcp/protocol.h
> +++ b/net/mptcp/protocol.h
> @@ -306,8 +306,6 @@ struct mptcp_sock {
>
> 	u32 setsockopt_seq;
> 	char		ca_name[TCP_CA_NAME_MAX];
> -	struct sock	*fail_ssk;
> -	unsigned long	fail_tout;
> };
>
> #define mptcp_data_lock(sk) spin_lock_bh(&(sk)->sk_lock.slock)
> @@ -484,6 +482,7 @@ struct mptcp_subflow_context {
> 	u8	stale_count;
>
> 	long	delegated_status;
> +	unsigned long	fail_tout;
>
> 	);
>
> @@ -677,6 +676,7 @@ void mptcp_get_options(const struct sk_buff *skb,
>
> void mptcp_finish_connect(struct sock *sk);
> void __mptcp_set_connected(struct sock *sk);
> +void mptcp_reset_timeout(struct mptcp_sock *msk, unsigned long fail_tout);
> static inline bool mptcp_is_fully_established(struct sock *sk)
> {
> 	return inet_sk_state_load(sk) == TCP_ESTABLISHED &&
> diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c
> index 98b12a9c4eb5..040901c1f40c 100644
> --- a/net/mptcp/subflow.c
> +++ b/net/mptcp/subflow.c
> @@ -1158,6 +1158,27 @@ static bool subflow_can_fallback(struct mptcp_subflow_context *subflow)
> 		return !subflow->fully_established;
> }
>
> +static void mptcp_subflow_fail(struct mptcp_sock *msk, struct sock *ssk)
> +{
> +	struct mptcp_subflow_context *subflow = mptcp_subflow_ctx(ssk);
> +	unsigned long fail_tout;
> +
> +	/* grecefull failure can happen only on the MPC subflow */
> +	if (WARN_ON_ONCE(ssk != READ_ONCE(msk->first)))
> +		return;
> +
> +	/* we don't need extreme accuracy here, use a zero fail_tout as special
> +	 * value meaning no fail timeout at all;
> +	 */
> +	fail_tout = jiffies + TCP_RTO_MAX;
> +	if (!fail_tout)
> +		fail_tout = 1;

Ah, I was wondering how jiffies wrap-around and the '0' special value were 
handled! Good idea.

> +	WRITE_ONCE(subflow->fail_tout, fail_tout);
> +	tcp_send_ack(ssk);
> +
> +	mptcp_reset_timeout(msk, subflow->fail_tout);
> +}
> +
> static bool subflow_check_data_avail(struct sock *ssk)
> {
> 	struct mptcp_subflow_context *subflow = mptcp_subflow_ctx(ssk);
> @@ -1233,8 +1254,7 @@ static bool subflow_check_data_avail(struct sock *ssk)
> 				while ((skb = skb_peek(&ssk->sk_receive_queue)))
> 					sk_eat_skb(ssk, skb);
> 			} else {
> -				msk->fail_ssk = ssk;
> -				msk->fail_tout = jiffies + TCP_RTO_MAX;
> +				mptcp_subflow_fail(msk, ssk);
> 			}
> 			WRITE_ONCE(subflow->data_avail, MPTCP_SUBFLOW_NODATA);
> 			return true;
> -- 
> 2.35.3
>
>
>

--
Mat Martineau
Intel

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH mptcp-net v2 3/6] Squash-to: "mptcp: invoke MP_FAIL response when needed"
  2022-06-16  0:59   ` Mat Martineau
@ 2022-06-16  7:11     ` Paolo Abeni
  2022-06-17  0:27       ` Mat Martineau
  0 siblings, 1 reply; 11+ messages in thread
From: Paolo Abeni @ 2022-06-16  7:11 UTC (permalink / raw)
  To: Mat Martineau; +Cc: mptcp

On Wed, 2022-06-15 at 17:59 -0700, Mat Martineau wrote:
> On Wed, 15 Jun 2022, Paolo Abeni wrote:
> 
> > This tries to address a few issues outstanding in the mentioned
> > patch:
> > - we explicitly need to reset the timeout timer for mp_fail's sake
> > - we need to explicitly generate a tcp ack for mp_fail, otherwise
> >  there are no guarantees for suck option being sent out
> > - the timeout timer needs handling need some caring, as it's still
> >  shared between mp_fail and msk socket timeout.
> > - we can re-use msk->first for msk->fail_ssk, as only the first/mpc
> >  subflow can fail without reset. That additionally avoid the need
> >  to clear fail_ssk on the relevant subflow close.
> > - fail_tout would need some additional annotation. Just to be on the
> >  safe side move its manipulaiton under the ssk socket lock.
> > 
> > Last 2 paragraph of the squash to commit should be replaced with:
> > 
> > """
> > It leverages the fact that only the MPC/first subflow can gracefully
> > fail to avoid unneeded subflows traversal: the failing subflow can
> > be only msk->first.
> > 
> > A new 'fail_tout' field is added to the subflow context to record the
> > MP_FAIL response timeout and use such field to reliably share the
> > timeout timer between the MP_FAIL event and the MPTCP socket close timeout.
> > 
> > Finally, a new ack is generated to send out MP_FAIL notification as soon
> > as we hit the relevant condition, instead of waiting a possibly unbound
> > time for the next data packet.
> > """
> > 
> > Signed-off-by: Paolo Abeni <pabeni@redhat.com>
> > ---
> > net/mptcp/pm.c       |  4 +++-
> > net/mptcp/protocol.c | 54 ++++++++++++++++++++++++++++++++++++--------
> > net/mptcp/protocol.h |  4 ++--
> > net/mptcp/subflow.c  | 24 ++++++++++++++++++--
> > 4 files changed, 72 insertions(+), 14 deletions(-)
> > 
> > diff --git a/net/mptcp/pm.c b/net/mptcp/pm.c
> > index 3c7f07bb124e..45e2a48397b9 100644
> > --- a/net/mptcp/pm.c
> > +++ b/net/mptcp/pm.c
> > @@ -305,13 +305,15 @@ void mptcp_pm_mp_fail_received(struct sock *sk, u64 fail_seq)
> > 	if (!READ_ONCE(msk->allow_infinite_fallback))
> > 		return;
> > 
> > -	if (!msk->fail_ssk) {
> > +	if (!subflow->fail_tout) {
> > 		pr_debug("send MP_FAIL response and infinite map");
> > 
> > 		subflow->send_mp_fail = 1;
> > 		subflow->send_infinite_map = 1;
> > +		tcp_send_ack(sk);
> > 	} else {
> > 		pr_debug("MP_FAIL response received");
> > +		WRITE_ONCE(subflow->fail_tout, 0);
> > 	}
> > }
> > 
> > diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
> > index a0f9f3831509..50026b8da625 100644
> > --- a/net/mptcp/protocol.c
> > +++ b/net/mptcp/protocol.c
> > @@ -500,7 +500,7 @@ static void mptcp_set_timeout(struct sock *sk)
> > 	__mptcp_set_timeout(sk, tout);
> > }
> > 
> > -static bool tcp_can_send_ack(const struct sock *ssk)
> > +static inline bool tcp_can_send_ack(const struct sock *ssk)
> > {
> > 	return !((1 << inet_sk_state_load(ssk)) &
> > 	       (TCPF_SYN_SENT | TCPF_SYN_RECV | TCPF_TIME_WAIT | TCPF_CLOSE | TCPF_LISTEN));
> > @@ -2490,24 +2490,56 @@ static void __mptcp_retrans(struct sock *sk)
> > 		mptcp_reset_timer(sk);
> > }
> > 
> > +/* schedule the timeout timer for the nearest relevant event: either
> > + * close timeout or mp_fail timeout. Both of them could be not
> > + * scheduled yet
> > + */
> > +void mptcp_reset_timeout(struct mptcp_sock *msk, unsigned long fail_tout)
> > +{
> > +	struct sock *sk = (struct sock *)msk;
> > +	unsigned long timeout, close_timeout;
> > +
> > +	if (!fail_tout && !sock_flag(sk, SOCK_DEAD))
> > +		return;
> > +
> > +	close_timeout = inet_csk(sk)->icsk_mtup.probe_timestamp - tcp_jiffies32 + jiffies + TCP_TIMEWAIT_LEN;
> > +
> > +	/* the following is basically time_min(close_timeout, fail_tout) */
> > +	if (!fail_tout)
> > +		timeout = close_timeout;
> > +	else if (!sock_flag(sk, SOCK_DEAD))
> > +		timeout = fail_tout;
> > +	else if (time_after(close_timeout, fail_tout))
> > +		timeout = fail_tout;
> > +	else
> > +		timeout = close_timeout;
> > +
> > +	sk_reset_timer(sk, &sk->sk_timer, timeout);
> > +}
> 
> Hi Paolo -
> 
> The above function seems more complex than needed. If mptcp_close() has 
> been called, the fail timeout should be considered canceled and the 
> "close" has exclusive use of sk_timer. subflow->fail_tout can be set to 0 
> and sk_timer can be set only based on TCP_TIMEWAIT_LEN.

I agree the above function is non trivial, on the flip side, it helps
keeping all the timeout logic in a single place.

Note that msk->first->fail_tout is under the subflow socket lock
protection, and this function is not always invoked under such lock, so
we will need to either add more read/write once annotation or split the
logic in the caller.

Side note: I agree a bit with David Laight wrt *_once preoliferation,
possibly with less divisive language: 

https://lore.kernel.org/netdev/cea2c2c39d0e4f27b2e75cdbc8fce09d@AcuMS.aculab.com/

;)

*_ONCE are a bit hard to track and difficult to verify formally, so I
tried hard to reduce their usage. Currently we have only a single READ
operation outside the relevant lock, which is AFAIK the 'safer'
possible scenario with such annotations. All the write operations are
under the subflow socket lock.

> TCP_RTO_MAX for the MP_FAIL timeout was kind of arbitrary - the spec 
> doesn't say what exactly to do. We didn't want the connection to be in the 
> "unacknowledged" MP_FAIL state forever, but also didn't want to give up 
> too fast.

This function don't make any assumption on the relative timeout length
and should fit the case if we will make them configurable someday

> Given that, do you think the complexity is justified to possibly reset the 
> subflow earlier in this "unacknowledged MP_FAIL followed by a close" 
> scenario?

IMHO yes ;) But if you have strong opinion otherwise I can refactor the
code...

Cheers,

Paolo


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH mptcp-net v2 3/6] Squash-to: "mptcp: invoke MP_FAIL response when needed"
  2022-06-16  7:11     ` Paolo Abeni
@ 2022-06-17  0:27       ` Mat Martineau
  0 siblings, 0 replies; 11+ messages in thread
From: Mat Martineau @ 2022-06-17  0:27 UTC (permalink / raw)
  To: Paolo Abeni; +Cc: mptcp

On Thu, 16 Jun 2022, Paolo Abeni wrote:

> On Wed, 2022-06-15 at 17:59 -0700, Mat Martineau wrote:
>> On Wed, 15 Jun 2022, Paolo Abeni wrote:
>>
>>> This tries to address a few issues outstanding in the mentioned
>>> patch:
>>> - we explicitly need to reset the timeout timer for mp_fail's sake
>>> - we need to explicitly generate a tcp ack for mp_fail, otherwise
>>>  there are no guarantees for suck option being sent out
>>> - the timeout timer needs handling need some caring, as it's still
>>>  shared between mp_fail and msk socket timeout.
>>> - we can re-use msk->first for msk->fail_ssk, as only the first/mpc
>>>  subflow can fail without reset. That additionally avoid the need
>>>  to clear fail_ssk on the relevant subflow close.
>>> - fail_tout would need some additional annotation. Just to be on the
>>>  safe side move its manipulaiton under the ssk socket lock.
>>>
>>> Last 2 paragraph of the squash to commit should be replaced with:
>>>
>>> """
>>> It leverages the fact that only the MPC/first subflow can gracefully
>>> fail to avoid unneeded subflows traversal: the failing subflow can
>>> be only msk->first.
>>>
>>> A new 'fail_tout' field is added to the subflow context to record the
>>> MP_FAIL response timeout and use such field to reliably share the
>>> timeout timer between the MP_FAIL event and the MPTCP socket close timeout.
>>>
>>> Finally, a new ack is generated to send out MP_FAIL notification as soon
>>> as we hit the relevant condition, instead of waiting a possibly unbound
>>> time for the next data packet.
>>> """
>>>
>>> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
>>> ---
>>> net/mptcp/pm.c       |  4 +++-
>>> net/mptcp/protocol.c | 54 ++++++++++++++++++++++++++++++++++++--------
>>> net/mptcp/protocol.h |  4 ++--
>>> net/mptcp/subflow.c  | 24 ++++++++++++++++++--
>>> 4 files changed, 72 insertions(+), 14 deletions(-)
>>>
>>> diff --git a/net/mptcp/pm.c b/net/mptcp/pm.c
>>> index 3c7f07bb124e..45e2a48397b9 100644
>>> --- a/net/mptcp/pm.c
>>> +++ b/net/mptcp/pm.c
>>> @@ -305,13 +305,15 @@ void mptcp_pm_mp_fail_received(struct sock *sk, u64 fail_seq)
>>> 	if (!READ_ONCE(msk->allow_infinite_fallback))
>>> 		return;
>>>
>>> -	if (!msk->fail_ssk) {
>>> +	if (!subflow->fail_tout) {
>>> 		pr_debug("send MP_FAIL response and infinite map");
>>>
>>> 		subflow->send_mp_fail = 1;
>>> 		subflow->send_infinite_map = 1;
>>> +		tcp_send_ack(sk);
>>> 	} else {
>>> 		pr_debug("MP_FAIL response received");
>>> +		WRITE_ONCE(subflow->fail_tout, 0);
>>> 	}
>>> }
>>>
>>> diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
>>> index a0f9f3831509..50026b8da625 100644
>>> --- a/net/mptcp/protocol.c
>>> +++ b/net/mptcp/protocol.c
>>> @@ -500,7 +500,7 @@ static void mptcp_set_timeout(struct sock *sk)
>>> 	__mptcp_set_timeout(sk, tout);
>>> }
>>>
>>> -static bool tcp_can_send_ack(const struct sock *ssk)
>>> +static inline bool tcp_can_send_ack(const struct sock *ssk)
>>> {
>>> 	return !((1 << inet_sk_state_load(ssk)) &
>>> 	       (TCPF_SYN_SENT | TCPF_SYN_RECV | TCPF_TIME_WAIT | TCPF_CLOSE | TCPF_LISTEN));
>>> @@ -2490,24 +2490,56 @@ static void __mptcp_retrans(struct sock *sk)
>>> 		mptcp_reset_timer(sk);
>>> }
>>>
>>> +/* schedule the timeout timer for the nearest relevant event: either
>>> + * close timeout or mp_fail timeout. Both of them could be not
>>> + * scheduled yet
>>> + */
>>> +void mptcp_reset_timeout(struct mptcp_sock *msk, unsigned long fail_tout)
>>> +{
>>> +	struct sock *sk = (struct sock *)msk;
>>> +	unsigned long timeout, close_timeout;
>>> +
>>> +	if (!fail_tout && !sock_flag(sk, SOCK_DEAD))
>>> +		return;
>>> +
>>> +	close_timeout = inet_csk(sk)->icsk_mtup.probe_timestamp - tcp_jiffies32 + jiffies + TCP_TIMEWAIT_LEN;
>>> +
>>> +	/* the following is basically time_min(close_timeout, fail_tout) */
>>> +	if (!fail_tout)
>>> +		timeout = close_timeout;
>>> +	else if (!sock_flag(sk, SOCK_DEAD))
>>> +		timeout = fail_tout;
>>> +	else if (time_after(close_timeout, fail_tout))
>>> +		timeout = fail_tout;
>>> +	else
>>> +		timeout = close_timeout;
>>> +
>>> +	sk_reset_timer(sk, &sk->sk_timer, timeout);
>>> +}
>>
>> Hi Paolo -
>>
>> The above function seems more complex than needed. If mptcp_close() has
>> been called, the fail timeout should be considered canceled and the
>> "close" has exclusive use of sk_timer. subflow->fail_tout can be set to 0
>> and sk_timer can be set only based on TCP_TIMEWAIT_LEN.
>
> I agree the above function is non trivial, on the flip side, it helps
> keeping all the timeout logic in a single place.
>
> Note that msk->first->fail_tout is under the subflow socket lock
> protection, and this function is not always invoked under such lock, so
> we will need to either add more read/write once annotation or split the
> logic in the caller.
>
> Side note: I agree a bit with David Laight wrt *_once preoliferation,
> possibly with less divisive language:
> https://lore.kernel.org/netdev/cea2c2c39d0e4f27b2e75cdbc8fce09d@AcuMS.aculab.com/
>
> ;)

Heh - yeah, I saw that exchange too :)

>
> *_ONCE are a bit hard to track and difficult to verify formally, so I
> tried hard to reduce their usage. Currently we have only a single READ
> operation outside the relevant lock, which is AFAIK the 'safer'
> possible scenario with such annotations. All the write operations are
> under the subflow socket lock.
>
>> TCP_RTO_MAX for the MP_FAIL timeout was kind of arbitrary - the spec
>> doesn't say what exactly to do. We didn't want the connection to be in the
>> "unacknowledged" MP_FAIL state forever, but also didn't want to give up
>> too fast.
>
> This function don't make any assumption on the relative timeout length
> and should fit the case if we will make them configurable someday
>
>> Given that, do you think the complexity is justified to possibly reset the
>> subflow earlier in this "unacknowledged MP_FAIL followed by a close"
>> scenario?
>
> IMHO yes ;) But if you have strong opinion otherwise I can refactor the
> code...
>

I do think it's better to cancel the MP_FAIL timeout once mptcp_close() is 
called. It's not just about the code in mptcp_reset_timeout(), it doesn't 
seem meaningful to continue MP_FAIL handling after close and to have 
various scenarios to consider/test/etc. for timeout or MP_FAIL echo 
ordering.


--
Mat Martineau
Intel

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2022-06-17  0:27 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-06-15 20:28 [PATCH mptcp-net v2 0/6] mptcp: mp_fail related fixes Paolo Abeni
2022-06-15 20:28 ` [PATCH mptcp-net v2 1/6] mptcp: fix error mibs accounting Paolo Abeni
2022-06-15 20:28 ` [PATCH mptcp-net v2 2/6] mptcp: introduce MAPPING_BAD_CSUM Paolo Abeni
2022-06-15 20:28 ` [PATCH mptcp-net v2 3/6] Squash-to: "mptcp: invoke MP_FAIL response when needed" Paolo Abeni
2022-06-16  0:59   ` Mat Martineau
2022-06-16  7:11     ` Paolo Abeni
2022-06-17  0:27       ` Mat Martineau
2022-06-15 20:28 ` [PATCH mptcp-net v2 4/6] mptcp: fix shutdown vs fallback race Paolo Abeni
2022-06-15 20:28 ` [PATCH mptcp-net v2 5/6] mptcp: consistent map handling on failure Paolo Abeni
2022-06-15 20:28 ` [PATCH mptcp-net v2 6/6] mptcp: fix race on unaccepted mptcp sockets Paolo Abeni
2022-06-15 22:10   ` mptcp: fix race on unaccepted mptcp sockets: Tests Results MPTCP CI

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.