mptcp.lists.linux.dev archive mirror
 help / color / mirror / Atom feed
* [PATCH mptcp-next 0/7] mptcp: refactor active backup
@ 2021-06-28 15:54 Paolo Abeni
  2021-06-28 15:54 ` [PATCH mptcp-next 1/7] mptcp: more accurate timeout Paolo Abeni
                   ` (7 more replies)
  0 siblings, 8 replies; 12+ messages in thread
From: Paolo Abeni @ 2021-06-28 15:54 UTC (permalink / raw)
  To: mptcp; +Cc: fwestpha

This series addresses a bunch of issues somewhat related to active
backup handling. The most visible functial issue addressed here is:

https://github.com/multipath-tcp/mptcp_net-next/issues/191

This series also add some specific self-tests, to cover both
active-backup switch-over and proper usage of backup link.

A new netns parameter is introduced:

stale_loss_cnt - the max amount of mptcp rtx timeouts with no progresses
	and outstanding data over a single subflow needed to declare
	such subflow 'stale': no more data will be queue until some ack
	is observed.

This parameter is currently not configurable: I'm undecited if it should
stay under the pm_netlink APIs or under the sysfs. I've a slightly
preference for the latter. Any opinion welcome!

This is only lightly (but painfully) tests, my proposal it to let
it stage in the export branch for some time.

Paolo Abeni (7):
  mptcp: more accurate timeout
  mptcp: less aggressive retransmission stragegy
  mptcp: handle pending data on closed subflow
  mptcp: faster active backup recovery
  mptcp: add mibs for stale subflows processing
  mptcp: backup flag from incoming MPJ ack option
  selftests: mptcp: add testcase for active-back

 net/mptcp/mib.c                               |   2 +
 net/mptcp/mib.h                               |   2 +
 net/mptcp/options.c                           |   9 +-
 net/mptcp/pm.c                                |  21 ++
 net/mptcp/pm_netlink.c                        |  40 ++++
 net/mptcp/protocol.c                          | 132 ++++++++++---
 net/mptcp/protocol.h                          |  19 +-
 net/mptcp/subflow.c                           |   6 +-
 .../testing/selftests/net/mptcp/mptcp_join.sh | 185 +++++++++++++++---
 9 files changed, 352 insertions(+), 64 deletions(-)

-- 
2.26.3


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH mptcp-next 1/7] mptcp: more accurate timeout
  2021-06-28 15:54 [PATCH mptcp-next 0/7] mptcp: refactor active backup Paolo Abeni
@ 2021-06-28 15:54 ` Paolo Abeni
  2021-06-28 15:54 ` [PATCH mptcp-next 2/7] mptcp: less aggressive retransmission stragegy Paolo Abeni
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 12+ messages in thread
From: Paolo Abeni @ 2021-06-28 15:54 UTC (permalink / raw)
  To: mptcp; +Cc: fwestpha

As reported by Maxim, we have a lot of MPTCP-level
retransmissions when multilple links with different latencies
are in use.

This patch refactor the mptcp-level timeout accounting so that
the maximum of all the active subflow timeout is used. To avoid
traversing the subflow list multiple times, the update is
perfomed inside the packet scheduler.

Additionally clean-up a bit timeout handling.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 net/mptcp/protocol.c | 34 ++++++++++++++++++----------------
 1 file changed, 18 insertions(+), 16 deletions(-)

diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 4cc94ee425e4..291f0e2279a8 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -411,13 +411,8 @@ static void mptcp_set_datafin_timeout(const struct sock *sk)
 				       TCP_RTO_MIN << icsk->icsk_retransmits);
 }
 
-static void mptcp_set_timeout(const struct sock *sk, const struct sock *ssk)
+static void mptcp_set_timeout(struct sock *sk, long tout)
 {
-	long tout = ssk && inet_csk(ssk)->icsk_pending ?
-				      inet_csk(ssk)->icsk_timeout - jiffies : 0;
-
-	if (tout <= 0)
-		tout = mptcp_sk(sk)->timer_ival;
 	mptcp_sk(sk)->timer_ival = tout > 0 ? tout : TCP_RTO_MIN;
 }
 
@@ -531,7 +526,6 @@ static bool mptcp_check_data_fin(struct sock *sk)
 		}
 
 		ret = true;
-		mptcp_set_timeout(sk, NULL);
 		mptcp_send_ack(msk);
 		mptcp_close_wake_up(sk);
 	}
@@ -791,10 +785,7 @@ static void mptcp_reset_timer(struct sock *sk)
 	if (unlikely(inet_sk_state_load(sk) == TCP_CLOSE))
 		return;
 
-	/* should never be called with mptcp level timer cleared */
-	tout = READ_ONCE(mptcp_sk(sk)->timer_ival);
-	if (WARN_ON_ONCE(!tout))
-		tout = TCP_RTO_MIN;
+	tout = mptcp_sk(sk)->timer_ival;
 	sk_reset_timer(sk, &icsk->icsk_retransmit_timer, jiffies + tout);
 }
 
@@ -1366,20 +1357,32 @@ struct subflow_send_info {
 	u64 ratio;
 };
 
+static long mptcp_timeout_from_ssk(const struct sock *ssk)
+{
+	return inet_csk(ssk)->icsk_pending ? inet_csk(ssk)->icsk_timeout - jiffies : 0;
+}
+
+/* implement the mptcp packet scheduler;
+ * returns the subflow that will transmit the next DSS
+ * additionally updates the rtx timeout
+ */
 static struct sock *mptcp_subflow_get_send(struct mptcp_sock *msk)
 {
 	struct subflow_send_info send_info[2];
 	struct mptcp_subflow_context *subflow;
+	struct sock *sk = (struct sock *)msk;
 	int i, nr_active = 0;
 	struct sock *ssk;
+	long tout = 0;
 	u64 ratio;
 	u32 pace;
 
-	sock_owned_by_me((struct sock *)msk);
+	sock_owned_by_me(sk);
 
 	if (__mptcp_check_fallback(msk)) {
 		if (!msk->first)
 			return NULL;
+		mptcp_set_timeout(sk, mptcp_timeout_from_ssk(msk->first));
 		return sk_stream_memory_free(msk->first) ? msk->first : NULL;
 	}
 
@@ -1400,6 +1403,7 @@ static struct sock *mptcp_subflow_get_send(struct mptcp_sock *msk)
 		if (!mptcp_subflow_active(subflow))
 			continue;
 
+		tout = max(tout, mptcp_timeout_from_ssk(ssk));
 		nr_active += !subflow->backup;
 		if (!sk_stream_memory_free(subflow->tcp_sock) || !tcp_sk(ssk)->snd_wnd)
 			continue;
@@ -1415,6 +1419,7 @@ static struct sock *mptcp_subflow_get_send(struct mptcp_sock *msk)
 			send_info[subflow->backup].ratio = ratio;
 		}
 	}
+	mptcp_set_timeout(sk, tout);
 
 	/* pick the best backup if no other subflow is active */
 	if (!nr_active)
@@ -1433,7 +1438,6 @@ static struct sock *mptcp_subflow_get_send(struct mptcp_sock *msk)
 static void mptcp_push_release(struct sock *sk, struct sock *ssk,
 			       struct mptcp_sendmsg_info *info)
 {
-	mptcp_set_timeout(sk, ssk);
 	tcp_push(ssk, 0, info->mss_now, tcp_sk(ssk)->nonagle, info->size_goal);
 	release_sock(ssk);
 }
@@ -1567,7 +1571,6 @@ static void __mptcp_subflow_push_pending(struct sock *sk, struct sock *ssk)
 	 */
 	__mptcp_update_wmem(sk);
 	if (copied) {
-		mptcp_set_timeout(sk, ssk);
 		tcp_push(ssk, 0, info.mss_now, tcp_sk(ssk)->nonagle,
 			 info.size_goal);
 		if (!mptcp_timer_pending(sk))
@@ -2313,7 +2316,6 @@ static void __mptcp_retrans(struct sock *sk)
 			 info.size_goal);
 	}
 
-	mptcp_set_timeout(sk, ssk);
 	release_sock(ssk);
 
 reset_timer:
@@ -2384,6 +2386,7 @@ static int __mptcp_init_sock(struct sock *sk)
 	msk->wmem_reserved = 0;
 	WRITE_ONCE(msk->rmem_released, 0);
 	msk->tx_pending_data = 0;
+	msk->timer_ival = TCP_RTO_MIN;
 
 	msk->first = NULL;
 	inet_csk(sk)->icsk_sync_mss = mptcp_sync_mss;
@@ -2476,7 +2479,6 @@ void mptcp_subflow_shutdown(struct sock *sk, struct sock *ssk, int how)
 			tcp_shutdown(ssk, how);
 		} else {
 			pr_debug("Sending DATA_FIN on subflow %p", ssk);
-			mptcp_set_timeout(sk, ssk);
 			tcp_send_ack(ssk);
 			if (!mptcp_timer_pending(sk))
 				mptcp_reset_timer(sk);
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH mptcp-next 2/7] mptcp: less aggressive retransmission stragegy
  2021-06-28 15:54 [PATCH mptcp-next 0/7] mptcp: refactor active backup Paolo Abeni
  2021-06-28 15:54 ` [PATCH mptcp-next 1/7] mptcp: more accurate timeout Paolo Abeni
@ 2021-06-28 15:54 ` Paolo Abeni
  2021-06-28 15:54 ` [PATCH mptcp-next 3/7] mptcp: handle pending data on closed subflow Paolo Abeni
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 12+ messages in thread
From: Paolo Abeni @ 2021-06-28 15:54 UTC (permalink / raw)
  To: mptcp; +Cc: fwestpha

The current mptcp re-inject strategy is very aggressive,
we have mptcp-level retransmissions even on single subflow
connection, if the link in-use is lossy.

Let's be a little more conservative: we do retransmission
only if at least a subflow has write and rtx queue empty.

Additionally use the backup subflows only if the active
subflows are stale - no progresses in at least an rtx period.

Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/207
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 net/mptcp/pm.c       | 17 +++++++++++++++++
 net/mptcp/protocol.c | 22 ++++++++++++++--------
 net/mptcp/protocol.h |  5 ++++-
 3 files changed, 35 insertions(+), 9 deletions(-)

diff --git a/net/mptcp/pm.c b/net/mptcp/pm.c
index 639271e09604..9ff17c5205ce 100644
--- a/net/mptcp/pm.c
+++ b/net/mptcp/pm.c
@@ -308,6 +308,23 @@ int mptcp_pm_get_local_id(struct mptcp_sock *msk, struct sock_common *skc)
 	return mptcp_pm_nl_get_local_id(msk, skc);
 }
 
+void mptcp_pm_subflow_chk_stale(const struct mptcp_sock *msk, struct sock *ssk)
+{
+	struct mptcp_subflow_context *subflow = mptcp_subflow_ctx(ssk);
+	u32 rcv_tstamp = READ_ONCE(tcp_sk(ssk)->rcv_tstamp);
+
+	/* keep track of rtx periods with no progress */
+	if (!subflow->stale_count) {
+		subflow->stale_rcv_tstamp = rcv_tstamp;
+		subflow->stale_count++;
+	} else if (subflow->stale_rcv_tstamp == rcv_tstamp) {
+		if (subflow->stale_count < U8_MAX)
+			subflow->stale_count++;
+	} else {
+		subflow->stale_count = 0;
+	}
+}
+
 void mptcp_pm_data_init(struct mptcp_sock *msk)
 {
 	msk->pm.add_addr_signaled = 0;
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 291f0e2279a8..51f608831fae 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -2088,8 +2088,9 @@ static void mptcp_timeout_timer(struct timer_list *t)
  */
 static struct sock *mptcp_subflow_get_retrans(const struct mptcp_sock *msk)
 {
+	struct sock *backup = NULL, *pick = NULL;
 	struct mptcp_subflow_context *subflow;
-	struct sock *backup = NULL;
+	int min_stale_count = INT_MAX;
 
 	sock_owned_by_me((const struct sock *)msk);
 
@@ -2102,11 +2103,11 @@ static struct sock *mptcp_subflow_get_retrans(const struct mptcp_sock *msk)
 		if (!mptcp_subflow_active(subflow))
 			continue;
 
-		/* still data outstanding at TCP level?  Don't retransmit. */
-		if (!tcp_write_queue_empty(ssk)) {
-			if (inet_csk(ssk)->icsk_ca_state >= TCP_CA_Loss)
-				continue;
-			return NULL;
+		/* still data outstanding at TCP level? skip this */
+		if (!tcp_rtx_and_write_queues_empty(ssk)) {
+			mptcp_pm_subflow_chk_stale(msk, ssk);
+			min_stale_count = min_t(int, min_stale_count, subflow->stale_count);
+			continue;
 		}
 
 		if (subflow->backup) {
@@ -2115,10 +2116,15 @@ static struct sock *mptcp_subflow_get_retrans(const struct mptcp_sock *msk)
 			continue;
 		}
 
-		return ssk;
+		if (!pick)
+			pick = ssk;
 	}
 
-	return backup;
+	if (pick)
+		return pick;
+
+	/* use backup only if there are no progresses anywhere */
+	return min_stale_count > 1 ? backup : NULL;
 }
 
 static void mptcp_dispose_initial_subflow(struct mptcp_sock *msk)
diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
index 0f0c026c5f8b..6a3cbdb597e2 100644
--- a/net/mptcp/protocol.h
+++ b/net/mptcp/protocol.h
@@ -439,11 +439,13 @@ struct mptcp_subflow_context {
 	u8	reset_seen:1;
 	u8	reset_transient:1;
 	u8	reset_reason:4;
+	u8	stale_count;
 
 	long	delegated_status;
 	struct	list_head delegated_node;   /* link into delegated_action, protected by local BH */
 
-	u32 setsockopt_seq;
+	u32	setsockopt_seq;
+	u32	stale_rcv_tstamp;
 
 	struct	sock *tcp_sock;	    /* tcp sk backpointer */
 	struct	sock *conn;	    /* parent mptcp_sock */
@@ -690,6 +692,7 @@ void mptcp_crypto_hmac_sha(u64 key1, u64 key2, u8 *msg, int len, void *hmac);
 
 void __init mptcp_pm_init(void);
 void mptcp_pm_data_init(struct mptcp_sock *msk);
+void mptcp_pm_subflow_chk_stale(const struct mptcp_sock *msk, struct sock *ssk);
 void mptcp_pm_new_connection(struct mptcp_sock *msk, const struct sock *ssk, int server_side);
 void mptcp_pm_fully_established(struct mptcp_sock *msk, const struct sock *ssk, gfp_t gfp);
 bool mptcp_pm_allow_new_subflow(struct mptcp_sock *msk);
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH mptcp-next 3/7] mptcp: handle pending data on closed subflow
  2021-06-28 15:54 [PATCH mptcp-next 0/7] mptcp: refactor active backup Paolo Abeni
  2021-06-28 15:54 ` [PATCH mptcp-next 1/7] mptcp: more accurate timeout Paolo Abeni
  2021-06-28 15:54 ` [PATCH mptcp-next 2/7] mptcp: less aggressive retransmission stragegy Paolo Abeni
@ 2021-06-28 15:54 ` Paolo Abeni
  2021-07-09  0:44   ` Mat Martineau
  2021-06-28 15:54 ` [PATCH mptcp-next 4/7] mptcp: faster active backup recovery Paolo Abeni
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 12+ messages in thread
From: Paolo Abeni @ 2021-06-28 15:54 UTC (permalink / raw)
  To: mptcp; +Cc: fwestpha

The PM can close active subflow, e.g. due to ingress RM_ADDR
option. Such subflow could carry data still unacked at the
MPTCP-level, both in the write and the rtx_queue, which has
never reached the other peer.

Currently the mptcp-level retransmission will deliver such data,
but at a very low rate (at most 1 DSM for each MPTCP rtx interval).

We can speed-up the recovery a lot, moving all the unacked in the
tcp write_queue, so that it will be pushed again via other
subflows, at the speed allowed by them.

Also make available the new helper for later patches.

Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/207
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 net/mptcp/options.c  |  9 ++++++---
 net/mptcp/protocol.c | 48 ++++++++++++++++++++++++++++++++++++++++++++
 net/mptcp/protocol.h |  3 +++
 3 files changed, 57 insertions(+), 3 deletions(-)

diff --git a/net/mptcp/options.c b/net/mptcp/options.c
index b5850afea343..c82ad34a0abc 100644
--- a/net/mptcp/options.c
+++ b/net/mptcp/options.c
@@ -975,9 +975,12 @@ static void ack_update_msk(struct mptcp_sock *msk,
 	old_snd_una = msk->snd_una;
 	new_snd_una = mptcp_expand_seq(old_snd_una, mp_opt->data_ack, mp_opt->ack64);
 
-	/* ACK for data not even sent yet? Ignore. */
-	if (after64(new_snd_una, snd_nxt))
-		new_snd_una = old_snd_una;
+	/* ACK for data not even sent yet and even above recovery bound? Ignore.*/
+	if (unlikely(after64(new_snd_una, snd_nxt))) {
+		if (!READ_ONCE(msk->recovery) ||
+		    after64(new_snd_una, READ_ONCE(msk->recovery_snd_nxt)))
+			new_snd_una = old_snd_una;
+	}
 
 	new_wnd_end = new_snd_una + tcp_sk(ssk)->snd_wnd;
 
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 51f608831fae..b0a7eba202fc 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -1032,6 +1032,9 @@ static void __mptcp_clean_una(struct sock *sk)
 	if (__mptcp_check_fallback(msk))
 		msk->snd_una = READ_ONCE(msk->snd_nxt);
 
+	if (unlikely(msk->recovery) && after64(msk->snd_una, msk->recovery_snd_nxt))
+		WRITE_ONCE(msk->recovery, false);
+
 	snd_una = msk->snd_una;
 	list_for_each_entry_safe(dfrag, dtmp, &msk->rtx_queue, list) {
 		if (after64(dfrag->data_seq + dfrag->data_len, snd_una))
@@ -2135,6 +2138,45 @@ static void mptcp_dispose_initial_subflow(struct mptcp_sock *msk)
 	}
 }
 
+bool __mptcp_retransmit_pending_data(struct sock *sk, const struct sock *ssk)
+{
+	struct mptcp_data_frag *cur, *rtx_head;
+	struct mptcp_sock *msk = mptcp_sk(sk);
+
+	if (__mptcp_check_fallback(mptcp_sk(sk)))
+		return false;
+
+	if (tcp_rtx_and_write_queues_empty(sk))
+		return false;
+
+	/* the closing socket has some data untransmitted and/or unacked:
+	 * some data in the mptcp rtx queue has not really xmitted yet.
+	 * keep it simple and re-inject the whole mptcp level rtx queue
+	 */
+	mptcp_clean_una_wakeup(sk);
+	rtx_head = mptcp_rtx_head(sk);
+	if (!rtx_head)
+		return false;
+
+	/* will accept ack for reijected data before re-sending them */
+	if (!msk->recovery || after64(msk->snd_nxt, msk->recovery_snd_nxt))
+		WRITE_ONCE(msk->recovery_snd_nxt, msk->snd_nxt);
+	WRITE_ONCE(msk->recovery, true);
+
+	msk->first_pending = rtx_head;
+	msk->tx_pending_data += msk->snd_nxt - rtx_head->data_seq;
+	msk->snd_nxt = rtx_head->data_seq;
+	msk->snd_burst = 0;
+
+	/* be sure to clear the "sent status" on all re-injected fragments */
+	list_for_each_entry(cur, &msk->rtx_queue, list) {
+		if (!cur->already_sent)
+			break;
+		cur->already_sent = 0;
+	}
+	return true;
+}
+
 /* subflow sockets can be either outgoing (connect) or incoming
  * (accept).
  *
@@ -2147,6 +2189,7 @@ static void __mptcp_close_ssk(struct sock *sk, struct sock *ssk,
 			      struct mptcp_subflow_context *subflow)
 {
 	struct mptcp_sock *msk = mptcp_sk(sk);
+	bool need_push;
 
 	list_del(&subflow->node);
 
@@ -2158,6 +2201,7 @@ static void __mptcp_close_ssk(struct sock *sk, struct sock *ssk,
 	if (ssk->sk_socket)
 		sock_orphan(ssk);
 
+	need_push = __mptcp_retransmit_pending_data(sk, ssk);
 	subflow->disposable = 1;
 
 	/* if ssk hit tcp_done(), tcp_cleanup_ulp() cleared the related ops
@@ -2185,6 +2229,9 @@ static void __mptcp_close_ssk(struct sock *sk, struct sock *ssk,
 
 	if (msk->subflow && ssk == msk->subflow->sk)
 		mptcp_dispose_initial_subflow(msk);
+
+	if (need_push)
+		__mptcp_push_pending(sk, 0);
 }
 
 void mptcp_close_ssk(struct sock *sk, struct sock *ssk,
@@ -2397,6 +2444,7 @@ static int __mptcp_init_sock(struct sock *sk)
 	msk->first = NULL;
 	inet_csk(sk)->icsk_sync_mss = mptcp_sync_mss;
 	WRITE_ONCE(msk->csum_enabled, mptcp_is_checksum_enabled(sock_net(sk)));
+	WRITE_ONCE(msk->recovery, false);
 
 	mptcp_pm_data_init(msk);
 
diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
index 6a3cbdb597e2..0218b777cdc3 100644
--- a/net/mptcp/protocol.h
+++ b/net/mptcp/protocol.h
@@ -222,6 +222,7 @@ struct mptcp_sock {
 	u64		local_key;
 	u64		remote_key;
 	u64		write_seq;
+	u64		recovery_snd_nxt;	/* in recovery mode accept up to this seq */
 	u64		snd_nxt;
 	u64		ack_seq;
 	u64		rcv_wnd_sent;
@@ -236,6 +237,7 @@ struct mptcp_sock {
 	u32		token;
 	int		rmem_released;
 	unsigned long	flags;
+	bool		recovery;		/* closing subflow write queue reinjected */
 	bool		can_ack;
 	bool		fully_established;
 	bool		rcv_data_fin;
@@ -557,6 +559,7 @@ int mptcp_is_checksum_enabled(struct net *net);
 int mptcp_allow_join_id0(struct net *net);
 void mptcp_subflow_fully_established(struct mptcp_subflow_context *subflow,
 				     struct mptcp_options_received *mp_opt);
+bool __mptcp_retransmit_pending_data(struct sock *sk, const struct sock *ssk);
 bool mptcp_subflow_data_available(struct sock *sk);
 void __init mptcp_subflow_init(void);
 void mptcp_subflow_shutdown(struct sock *sk, struct sock *ssk, int how);
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH mptcp-next 4/7] mptcp: faster active backup recovery
  2021-06-28 15:54 [PATCH mptcp-next 0/7] mptcp: refactor active backup Paolo Abeni
                   ` (2 preceding siblings ...)
  2021-06-28 15:54 ` [PATCH mptcp-next 3/7] mptcp: handle pending data on closed subflow Paolo Abeni
@ 2021-06-28 15:54 ` Paolo Abeni
  2021-06-28 15:54 ` [PATCH mptcp-next 5/7] mptcp: add mibs for stale subflows processing Paolo Abeni
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 12+ messages in thread
From: Paolo Abeni @ 2021-06-28 15:54 UTC (permalink / raw)
  To: mptcp; +Cc: fwestpha

the msk can use backup subflows to transmit in-sequence data
only if there are no other active subflow. On active backup
scenario, the MPTCP connection can do forward progress only
due to MPTCP retransmissions - rtx can pick backup subflows.

This patch introduces a new flag flow MPTCP subflows: if the
underlaying TCP connection made no progresses for long time,
and there are other less problematic subflows available, the
given subflow become stale.

Stale subflows are not considered active: if all non backup
subflows become stale, the MPTCP scheduler can pick backup
subflows for plain transmissions.

Stale subflows can return in active state, as soon as any reply
from the peer is observed.

Active backup scenarios can now leverage the available b/w
with no restrinction.

Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/207
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 net/mptcp/pm.c         |  2 ++
 net/mptcp/pm_netlink.c | 39 +++++++++++++++++++++++++++++++++++++++
 net/mptcp/protocol.c   | 27 ++++++++++++++++++++++++---
 net/mptcp/protocol.h   | 11 +++++++++--
 4 files changed, 74 insertions(+), 5 deletions(-)

diff --git a/net/mptcp/pm.c b/net/mptcp/pm.c
index 9ff17c5205ce..d8a85fe92360 100644
--- a/net/mptcp/pm.c
+++ b/net/mptcp/pm.c
@@ -320,8 +320,10 @@ void mptcp_pm_subflow_chk_stale(const struct mptcp_sock *msk, struct sock *ssk)
 	} else if (subflow->stale_rcv_tstamp == rcv_tstamp) {
 		if (subflow->stale_count < U8_MAX)
 			subflow->stale_count++;
+		mptcp_pm_nl_subflow_chk_stale(msk, ssk);
 	} else {
 		subflow->stale_count = 0;
+		mptcp_subflow_set_active(subflow);
 	}
 }
 
diff --git a/net/mptcp/pm_netlink.c b/net/mptcp/pm_netlink.c
index d2591ebf01d9..d93e5f73b5cb 100644
--- a/net/mptcp/pm_netlink.c
+++ b/net/mptcp/pm_netlink.c
@@ -47,6 +47,7 @@ struct pm_nl_pernet {
 	spinlock_t		lock;
 	struct list_head	local_addr_list;
 	unsigned int		addrs;
+	unsigned int		stale_loss_cnt;
 	unsigned int		add_addr_signal_max;
 	unsigned int		add_addr_accept_max;
 	unsigned int		local_addr_max;
@@ -900,6 +901,43 @@ static const struct nla_policy mptcp_pm_policy[MPTCP_PM_ATTR_MAX + 1] = {
 	[MPTCP_PM_ATTR_SUBFLOWS]	= { .type	= NLA_U32,	},
 };
 
+void mptcp_pm_nl_subflow_chk_stale(const struct mptcp_sock *msk, struct sock *ssk)
+{
+	struct mptcp_subflow_context *iter, *subflow = mptcp_subflow_ctx(ssk);
+	struct sock *sk = (struct sock *)msk;
+	unsigned int active_max_loss_cnt;
+	struct net *net = sock_net(sk);
+	struct pm_nl_pernet *pernet;
+	unsigned int stale_loss_cnt;
+	bool slow, push;
+
+	pernet = net_generic(net, pm_nl_pernet_id);
+	stale_loss_cnt = READ_ONCE(pernet->stale_loss_cnt);
+
+	if (subflow->stale || !stale_loss_cnt || subflow->stale_count <= stale_loss_cnt)
+		return;
+
+	/* look for another available subflow not in loss state */
+	active_max_loss_cnt = max_t(int, stale_loss_cnt - 1, 1);
+	mptcp_for_each_subflow(msk, iter) {
+		if (iter != subflow && mptcp_subflow_active(iter) &&
+		    iter->stale_count < active_max_loss_cnt) {
+			/* we have some alteratives, try to mark this subflow as idle ...*/
+			slow = lock_sock_fast(ssk);
+			if (!tcp_rtx_and_write_queues_empty(ssk)) {
+				subflow->stale = 1;
+				push = __mptcp_retransmit_pending_data(sk, ssk);
+			}
+			unlock_sock_fast(ssk, slow);
+
+			/* pending data on the idle subflow: retransmit */
+			if (push)
+				__mptcp_push_pending(sk, 0);
+			return;
+		}
+	}
+}
+
 static int mptcp_pm_family_to_addr(int family)
 {
 #if IS_ENABLED(CONFIG_MPTCP_IPV6)
@@ -1923,6 +1961,7 @@ static int __net_init pm_nl_init_net(struct net *net)
 
 	INIT_LIST_HEAD_RCU(&pernet->local_addr_list);
 	pernet->next_id = 1;
+	pernet->stale_loss_cnt = 4;
 	spin_lock_init(&pernet->lock);
 
 	/* No need to initialize other pernet fields, the struct is zeroed at
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index b0a7eba202fc..fc41e4a59b8f 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -1365,6 +1365,27 @@ static long mptcp_timeout_from_ssk(const struct sock *ssk)
 	return inet_csk(ssk)->icsk_pending ? inet_csk(ssk)->icsk_timeout - jiffies : 0;
 }
 
+void mptcp_subflow_set_active(struct mptcp_subflow_context *subflow)
+{
+	if (!subflow->stale)
+		return;
+
+	subflow->stale = 0;
+}
+
+bool mptcp_subflow_active(struct mptcp_subflow_context *subflow)
+{
+	if (unlikely(subflow->stale)) {
+		u32 rcv_tstamp = READ_ONCE(tcp_sk(mptcp_subflow_tcp_sock(subflow))->rcv_tstamp);
+
+		if (subflow->stale_rcv_tstamp == rcv_tstamp)
+			return false;
+
+		mptcp_subflow_set_active(subflow);
+	}
+	return __mptcp_subflow_active(subflow);
+}
+
 /* implement the mptcp packet scheduler;
  * returns the subflow that will transmit the next DSS
  * additionally updates the rtx timeout
@@ -1445,7 +1466,7 @@ static void mptcp_push_release(struct sock *sk, struct sock *ssk,
 	release_sock(ssk);
 }
 
-static void __mptcp_push_pending(struct sock *sk, unsigned int flags)
+void __mptcp_push_pending(struct sock *sk, unsigned int flags)
 {
 	struct sock *prev_ssk = NULL, *ssk = NULL;
 	struct mptcp_sock *msk = mptcp_sk(sk);
@@ -2089,7 +2110,7 @@ static void mptcp_timeout_timer(struct timer_list *t)
  *
  * A backup subflow is returned only if that is the only kind available.
  */
-static struct sock *mptcp_subflow_get_retrans(const struct mptcp_sock *msk)
+static struct sock *mptcp_subflow_get_retrans(struct mptcp_sock *msk)
 {
 	struct sock *backup = NULL, *pick = NULL;
 	struct mptcp_subflow_context *subflow;
@@ -2103,7 +2124,7 @@ static struct sock *mptcp_subflow_get_retrans(const struct mptcp_sock *msk)
 	mptcp_for_each_subflow(msk, subflow) {
 		struct sock *ssk = mptcp_subflow_tcp_sock(subflow);
 
-		if (!mptcp_subflow_active(subflow))
+		if (!__mptcp_subflow_active(subflow))
 			continue;
 
 		/* still data outstanding at TCP level? skip this */
diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
index 0218b777cdc3..6cc9059c6a40 100644
--- a/net/mptcp/protocol.h
+++ b/net/mptcp/protocol.h
@@ -429,7 +429,8 @@ struct mptcp_subflow_context {
 		send_mp_prio : 1,
 		rx_eof : 1,
 		can_ack : 1,        /* only after processing the remote a key */
-		disposable : 1;	    /* ctx can be free at ulp release time */
+		disposable : 1,	    /* ctx can be free at ulp release time */
+		stale : 1;	    /* unable to snd/rcv data, do not use for xmit */
 	enum mptcp_data_avail data_avail;
 	u32	remote_nonce;
 	u64	thmac;
@@ -560,6 +561,7 @@ int mptcp_allow_join_id0(struct net *net);
 void mptcp_subflow_fully_established(struct mptcp_subflow_context *subflow,
 				     struct mptcp_options_received *mp_opt);
 bool __mptcp_retransmit_pending_data(struct sock *sk, const struct sock *ssk);
+void __mptcp_push_pending(struct sock *sk, unsigned int flags);
 bool mptcp_subflow_data_available(struct sock *sk);
 void __init mptcp_subflow_init(void);
 void mptcp_subflow_shutdown(struct sock *sk, struct sock *ssk, int how);
@@ -578,7 +580,7 @@ void mptcp_info2sockaddr(const struct mptcp_addr_info *info,
 			 struct sockaddr_storage *addr,
 			 unsigned short family);
 
-static inline bool mptcp_subflow_active(struct mptcp_subflow_context *subflow)
+static inline bool __mptcp_subflow_active(struct mptcp_subflow_context *subflow)
 {
 	struct sock *ssk = mptcp_subflow_tcp_sock(subflow);
 
@@ -590,6 +592,10 @@ static inline bool mptcp_subflow_active(struct mptcp_subflow_context *subflow)
 	return ((1 << ssk->sk_state) & (TCPF_ESTABLISHED | TCPF_CLOSE_WAIT));
 }
 
+void mptcp_subflow_set_active(struct mptcp_subflow_context *subflow);
+
+bool mptcp_subflow_active(struct mptcp_subflow_context *subflow);
+
 static inline void mptcp_subflow_tcp_fallback(struct sock *sk,
 					      struct mptcp_subflow_context *ctx)
 {
@@ -696,6 +702,7 @@ void mptcp_crypto_hmac_sha(u64 key1, u64 key2, u8 *msg, int len, void *hmac);
 void __init mptcp_pm_init(void);
 void mptcp_pm_data_init(struct mptcp_sock *msk);
 void mptcp_pm_subflow_chk_stale(const struct mptcp_sock *msk, struct sock *ssk);
+void mptcp_pm_nl_subflow_chk_stale(const struct mptcp_sock *msk, struct sock *ssk);
 void mptcp_pm_new_connection(struct mptcp_sock *msk, const struct sock *ssk, int server_side);
 void mptcp_pm_fully_established(struct mptcp_sock *msk, const struct sock *ssk, gfp_t gfp);
 bool mptcp_pm_allow_new_subflow(struct mptcp_sock *msk);
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH mptcp-next 5/7] mptcp: add mibs for stale subflows processing
  2021-06-28 15:54 [PATCH mptcp-next 0/7] mptcp: refactor active backup Paolo Abeni
                   ` (3 preceding siblings ...)
  2021-06-28 15:54 ` [PATCH mptcp-next 4/7] mptcp: faster active backup recovery Paolo Abeni
@ 2021-06-28 15:54 ` Paolo Abeni
  2021-06-28 15:54 ` [PATCH mptcp-next 6/7] mptcp: backup flag from incoming MPJ ack option Paolo Abeni
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 12+ messages in thread
From: Paolo Abeni @ 2021-06-28 15:54 UTC (permalink / raw)
  To: mptcp; +Cc: fwestpha

This allows monitoring exceptional events like
active backup scenarios.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 net/mptcp/mib.c        | 2 ++
 net/mptcp/mib.h        | 2 ++
 net/mptcp/pm.c         | 2 ++
 net/mptcp/pm_netlink.c | 1 +
 net/mptcp/protocol.c   | 1 +
 5 files changed, 8 insertions(+)

diff --git a/net/mptcp/mib.c b/net/mptcp/mib.c
index ff2cc0e3273d..3a7c4e7b2d79 100644
--- a/net/mptcp/mib.c
+++ b/net/mptcp/mib.c
@@ -45,6 +45,8 @@ static const struct snmp_mib mptcp_snmp_list[] = {
 	SNMP_MIB_ITEM("MPPrioTx", MPTCP_MIB_MPPRIOTX),
 	SNMP_MIB_ITEM("MPPrioRx", MPTCP_MIB_MPPRIORX),
 	SNMP_MIB_ITEM("RcvPruned", MPTCP_MIB_RCVPRUNED),
+	SNMP_MIB_ITEM("SubflowStale", MPTCP_MIB_SUBFLOWSTALE),
+	SNMP_MIB_ITEM("SubflowRecover", MPTCP_MIB_SUBFLOWRECOVER),
 	SNMP_MIB_SENTINEL
 };
 
diff --git a/net/mptcp/mib.h b/net/mptcp/mib.h
index 0663cb12b448..8ec16c991aac 100644
--- a/net/mptcp/mib.h
+++ b/net/mptcp/mib.h
@@ -38,6 +38,8 @@ enum linux_mptcp_mib_field {
 	MPTCP_MIB_MPPRIOTX,		/* Transmit a MP_PRIO */
 	MPTCP_MIB_MPPRIORX,		/* Received a MP_PRIO */
 	MPTCP_MIB_RCVPRUNED,		/* Incoming packet dropped due to memory limit */
+	MPTCP_MIB_SUBFLOWSTALE,		/* Subflows entered 'stale' status */
+	MPTCP_MIB_SUBFLOWRECOVER,	/* Subflows returned to active status after being stale */
 	__MPTCP_MIB_MAX
 };
 
diff --git a/net/mptcp/pm.c b/net/mptcp/pm.c
index d8a85fe92360..0ed3e565f8f8 100644
--- a/net/mptcp/pm.c
+++ b/net/mptcp/pm.c
@@ -10,6 +10,8 @@
 #include <net/mptcp.h>
 #include "protocol.h"
 
+#include "mib.h"
+
 /* path manager command handlers */
 
 int mptcp_pm_announce_addr(struct mptcp_sock *msk,
diff --git a/net/mptcp/pm_netlink.c b/net/mptcp/pm_netlink.c
index d93e5f73b5cb..5c2f76ee5923 100644
--- a/net/mptcp/pm_netlink.c
+++ b/net/mptcp/pm_netlink.c
@@ -927,6 +927,7 @@ void mptcp_pm_nl_subflow_chk_stale(const struct mptcp_sock *msk, struct sock *ss
 			if (!tcp_rtx_and_write_queues_empty(ssk)) {
 				subflow->stale = 1;
 				push = __mptcp_retransmit_pending_data(sk, ssk);
+				MPTCP_INC_STATS(sock_net(sk), MPTCP_MIB_SUBFLOWSTALE);
 			}
 			unlock_sock_fast(ssk, slow);
 
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index fc41e4a59b8f..3b27b5c3791b 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -1371,6 +1371,7 @@ void mptcp_subflow_set_active(struct mptcp_subflow_context *subflow)
 		return;
 
 	subflow->stale = 0;
+	MPTCP_INC_STATS(sock_net(mptcp_subflow_tcp_sock(subflow)), MPTCP_MIB_SUBFLOWRECOVER);
 }
 
 bool mptcp_subflow_active(struct mptcp_subflow_context *subflow)
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH mptcp-next 6/7] mptcp: backup flag from incoming MPJ ack option
  2021-06-28 15:54 [PATCH mptcp-next 0/7] mptcp: refactor active backup Paolo Abeni
                   ` (4 preceding siblings ...)
  2021-06-28 15:54 ` [PATCH mptcp-next 5/7] mptcp: add mibs for stale subflows processing Paolo Abeni
@ 2021-06-28 15:54 ` Paolo Abeni
  2021-06-28 15:54 ` [PATCH mptcp-next 7/7] selftests: mptcp: add testcase for active-back Paolo Abeni
  2021-07-09  1:13 ` [PATCH mptcp-next 0/7] mptcp: refactor active backup Mat Martineau
  7 siblings, 0 replies; 12+ messages in thread
From: Paolo Abeni @ 2021-06-28 15:54 UTC (permalink / raw)
  To: mptcp; +Cc: fwestpha

the parsed incoming backup flag is not propagated
to the subflow itself, the client may end-up using it
to send data.

Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/191
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 net/mptcp/subflow.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c
index 0b5d4a3eadcd..04bd8783f52c 100644
--- a/net/mptcp/subflow.c
+++ b/net/mptcp/subflow.c
@@ -440,10 +440,12 @@ static void subflow_finish_connect(struct sock *sk, const struct sk_buff *skb)
 			goto do_reset;
 		}
 
+		subflow->backup = mp_opt.backup;
 		subflow->thmac = mp_opt.thmac;
 		subflow->remote_nonce = mp_opt.nonce;
-		pr_debug("subflow=%p, thmac=%llu, remote_nonce=%u", subflow,
-			 subflow->thmac, subflow->remote_nonce);
+		pr_debug("subflow=%p, thmac=%llu, remote_nonce=%u backup=%d",
+			 subflow, subflow->thmac, subflow->remote_nonce,
+			 subflow->backup);
 
 		if (!subflow_thmac_valid(subflow)) {
 			MPTCP_INC_STATS(sock_net(sk), MPTCP_MIB_JOINACKMAC);
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH mptcp-next 7/7] selftests: mptcp: add testcase for active-back
  2021-06-28 15:54 [PATCH mptcp-next 0/7] mptcp: refactor active backup Paolo Abeni
                   ` (5 preceding siblings ...)
  2021-06-28 15:54 ` [PATCH mptcp-next 6/7] mptcp: backup flag from incoming MPJ ack option Paolo Abeni
@ 2021-06-28 15:54 ` Paolo Abeni
  2021-07-09  0:51   ` Mat Martineau
  2021-07-09  1:13 ` [PATCH mptcp-next 0/7] mptcp: refactor active backup Mat Martineau
  7 siblings, 1 reply; 12+ messages in thread
From: Paolo Abeni @ 2021-06-28 15:54 UTC (permalink / raw)
  To: mptcp; +Cc: fwestpha

Add more test-case for link failures scenario,
including recovery from link failure using only
backup subflows and bi-directional transfer.

Additionally explicitly check for stale count

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 .../testing/selftests/net/mptcp/mptcp_join.sh | 185 +++++++++++++++---
 1 file changed, 156 insertions(+), 29 deletions(-)

diff --git a/tools/testing/selftests/net/mptcp/mptcp_join.sh b/tools/testing/selftests/net/mptcp/mptcp_join.sh
index 9a191c1a5de8..e7c021d199e3 100755
--- a/tools/testing/selftests/net/mptcp/mptcp_join.sh
+++ b/tools/testing/selftests/net/mptcp/mptcp_join.sh
@@ -3,8 +3,10 @@
 
 ret=0
 sin=""
+sinfail=""
 sout=""
 cin=""
+cinfail=""
 cinsent=""
 cout=""
 ksft_skip=4
@@ -88,8 +90,8 @@ cleanup_partial()
 
 cleanup()
 {
-	rm -f "$cin" "$cout"
-	rm -f "$sin" "$sout" "$cinsent"
+	rm -f "$cin" "$cout" "$sinfail"
+	rm -f "$sin" "$sout" "$cinsent" "$cinfail"
 	cleanup_partial
 }
 
@@ -211,11 +213,15 @@ link_failure()
 {
 	ns="$1"
 
-	l=$((RANDOM%4))
-	l=$((l+1))
+	if [ -z "$FAILING_LINKS" ]; then
+		l=$((RANDOM%4))
+		FAILING_LINKS=$((l+1))
+	fi
 
-	veth="ns1eth$l"
-	ip -net "$ns" link set "$veth" down
+	for l in $FAILING_LINKS; do
+		veth="ns1eth$l"
+		ip -net "$ns" link set "$veth" down
+	done
 }
 
 # $1: IP address
@@ -280,10 +286,17 @@ do_transfer()
 		local_addr="0.0.0.0"
 	fi
 
-	timeout ${timeout_test} \
-		ip netns exec ${listener_ns} \
-			$mptcp_connect -t ${timeout_poll} -l -p $port -s ${srv_proto} \
-				${local_addr} < "$sin" > "$sout" &
+	if [ "$test_link_fail" -eq 2 ];then
+		timeout ${timeout_test} \
+			ip netns exec ${listener_ns} \
+				$mptcp_connect -t ${timeout_poll} -l -p $port -s ${cl_proto} \
+					${local_addr} < "$sinfail" > "$sout" &
+	else
+		timeout ${timeout_test} \
+			ip netns exec ${listener_ns} \
+				$mptcp_connect -t ${timeout_poll} -l -p $port -s ${srv_proto} \
+					${local_addr} < "$sin" > "$sout" &
+	fi
 	spid=$!
 
 	sleep 1
@@ -294,7 +307,7 @@ do_transfer()
 				$mptcp_connect -t ${timeout_poll} -p $port -s ${cl_proto} \
 					$connect_addr < "$cin" > "$cout" &
 	else
-		( cat "$cin" ; sleep 2; link_failure $listener_ns ; cat "$cin" ) | \
+		( cat "$cinfail" ; sleep 2; link_failure $listener_ns ; cat "$cinfail" ) | \
 			tee "$cinsent" | \
 			timeout ${timeout_test} \
 				ip netns exec ${connector_ns} \
@@ -434,7 +447,11 @@ do_transfer()
 		return 1
 	fi
 
-	check_transfer $sin $cout "file received by client"
+	if [ "$test_link_fail" -eq 2 ];then
+		check_transfer $sinfail $cout "file received by client"
+	else
+		check_transfer $sin $cout "file received by client"
+	fi
 	retc=$?
 	if [ "$test_link_fail" -eq 0 ];then
 		check_transfer $cin $sout "file received by server"
@@ -477,29 +494,29 @@ run_tests()
 	lret=0
 	oldin=""
 
-	if [ "$test_linkfail" -eq 1 ];then
+	# create the input file for the failure test when
+	# the first failure test run
+	if [ "$test_linkfail" -ne 0 -a -z "$cinfail" ]; then
+		size=$((RANDOM%64))
+		size=$((size+1))
+		size=$((size*2048))
+
+		cinfail=$(mktemp)
+		make_file "$cinfail" "client" $size
+	fi
+
+	if [ "$test_linkfail" -eq 2 -a -z "$sinfail" ]; then
 		size=$((RANDOM%1024))
 		size=$((size+1))
 		size=$((size*128))
 
-		oldin=$(mktemp)
-		cp "$cin" "$oldin"
-		make_file "$cin" "client" $size
+		sinfail=$(mktemp)
+		make_file "$sinfail" "server" $size
 	fi
 
 	do_transfer ${listener_ns} ${connector_ns} MPTCP MPTCP ${connect_addr} \
 		${test_linkfail} ${addr_nr_ns1} ${addr_nr_ns2} ${speed} ${bkup}
 	lret=$?
-
-	if [ "$test_linkfail" -eq 1 ];then
-		cp "$oldin" "$cin"
-		rm -f "$oldin"
-	fi
-
-	if [ $lret -ne 0 ]; then
-		ret=$lret
-		return
-	fi
 }
 
 chk_csum_nr()
@@ -593,6 +610,38 @@ chk_join_nr()
 	fi
 }
 
+chk_stale_nr()
+{
+	local ns=$1
+	local stale_min=$2
+	local stale_max=$3
+	local stale_delta=$4
+	local dump_stats
+	local stale_nr
+	local recover_nr
+
+	printf "%-39s %s" " " "stale"
+	stale_nr=`ip netns exec $ns nstat -as | grep MPTcpExtSubflowStale | awk '{print $2}'`
+	[ -z "$stale_nr" ] && stale_nr=0
+	recover_nr=`ip netns exec $ns nstat -as | grep MPTcpExtSubflowRecover | awk '{print $2}'`
+	[ -z "$recover_nr" ] && recover_nr=0
+
+	if [ $stale_nr -lt $stale_min -o $stale_nr -gt $stale_max -o $((stale_nr - $recover_nr)) -ne $stale_delta ]; then
+		echo "[fail] got $stale_nr stale[s] $recover_nr recover[s], " \
+		     " expected stale in range [$stale_min..$stale_max]," \
+		     " stale-recover delta $stale_delta "
+		ret=1
+		dump_stats=1
+	else
+		echo "[ ok ]"
+	fi
+
+	if [ "${dump_stats}" = 1 ]; then
+		echo $ns stats
+		ip netns exec $ns nstat -as | grep MPTcp
+	fi
+}
+
 chk_add_nr()
 {
 	local add_nr=$1
@@ -801,6 +850,27 @@ chk_prio_nr()
 	fi
 }
 
+chk_link_usage()
+{
+	local ns=$1
+	local link=$2
+	local out=$3
+	local expected_rate=$4
+	local tx_link=`ip netns exec $ns cat /sys/class/net/$link/statistics/tx_bytes`
+	local tx_total=`ls -l $out | awk '{print $5}'`
+	local tx_rate=$((tx_link * 100 / $tx_total))
+	local tolerance=5
+
+	printf "%-39s %s" " " "link usage "
+	if [ $tx_rate -lt $((expected_rate - $tolerance)) -o \
+	     $tx_rate -gt $((expected_rate + $tolerance)) ]; then
+		echo "[fail] got $tx_rate% usage, expected $expected_rate%"
+		ret=1
+	else
+		echo "[ ok ]"
+	fi
+}
+
 subflows_tests()
 {
 	reset
@@ -925,13 +995,67 @@ link_failure_tests()
 	# accept and use add_addr with additional subflows and link loss
 	reset
 	ip netns exec $ns1 ./pm_nl_ctl limits 0 3
-	ip netns exec $ns1 ./pm_nl_ctl add 10.0.2.1 flags signal
+	ip netns exec $ns1 ./pm_nl_ctl add 10.0.2.1 dev ns1eth2 flags signal
 	ip netns exec $ns2 ./pm_nl_ctl limits 1 3
-	ip netns exec $ns2 ./pm_nl_ctl add 10.0.3.2 flags subflow
-	ip netns exec $ns2 ./pm_nl_ctl add 10.0.4.2 flags subflow
+	ip netns exec $ns2 ./pm_nl_ctl add 10.0.3.2 dev ns2eth3 flags subflow
+	ip netns exec $ns2 ./pm_nl_ctl add 10.0.4.2 dev ns2eth4 flags subflow
 	run_tests $ns1 $ns2 10.0.1.1 1
 	chk_join_nr "multiple flows, signal, link failure" 3 3 3
 	chk_add_nr 1 1
+	chk_stale_nr $ns2 1 5 1
+
+	# accept and use add_addr with additional subflows and link loss
+	# for bidirectional transfer
+	reset
+	ip netns exec $ns1 ./pm_nl_ctl limits 0 3
+	ip netns exec $ns1 ./pm_nl_ctl add 10.0.2.1 dev ns1eth2 flags signal
+	ip netns exec $ns2 ./pm_nl_ctl limits 1 3
+	ip netns exec $ns2 ./pm_nl_ctl add 10.0.3.2 dev ns2eth3 flags subflow
+	ip netns exec $ns2 ./pm_nl_ctl add 10.0.4.2 dev ns2eth4 flags subflow
+	run_tests $ns1 $ns2 10.0.1.1 2
+	chk_join_nr "multiple flows, signal, bidirectional, link failure" 3 3 3
+	chk_add_nr 1 1
+	chk_stale_nr $ns2 1 5 1
+
+	# 2 subflows plus 1 backup subflow with a lossy link, backup
+	# will never be used
+	reset
+	ip netns exec $ns1 ./pm_nl_ctl limits 0 2
+	ip netns exec $ns1 ./pm_nl_ctl add 10.0.2.1 dev ns1eth2 flags signal
+	ip netns exec $ns2 ./pm_nl_ctl limits 1 2
+	export FAILING_LINKS="1"
+	ip netns exec $ns2 ./pm_nl_ctl add 10.0.3.2 dev ns2eth3 flags subflow,backup
+	run_tests $ns1 $ns2 10.0.1.1 1
+	chk_join_nr "backup subflow unused with link failure" 2 2 2
+	chk_add_nr 1 1
+	chk_link_usage $ns2 ns2eth3 $cinsent 0
+
+	# 2 lossy links after half transfer, backup will get half of
+	# the traffic
+	reset
+	ip netns exec $ns1 ./pm_nl_ctl limits 0 2
+	ip netns exec $ns1 ./pm_nl_ctl add 10.0.2.1 dev ns1eth2 flags signal
+	ip netns exec $ns2 ./pm_nl_ctl limits 1 2
+	ip netns exec $ns2 ./pm_nl_ctl add 10.0.3.2 dev ns2eth3 flags subflow,backup
+	export FAILING_LINKS="1 2"
+	run_tests $ns1 $ns2 10.0.1.1 1
+	chk_join_nr "backup subflow used due to multiple links failure" 2 2 2
+	chk_add_nr 1 1
+	chk_stale_nr $ns2 2 4 2
+	chk_link_usage $ns2 ns2eth3 $cinsent 50
+
+	# use a backup subflow with the first subflow on a lossy link
+	# for bidirectional transfer
+	reset
+	ip netns exec $ns1 ./pm_nl_ctl limits 0 2
+	ip netns exec $ns1 ./pm_nl_ctl add 10.0.2.1 dev ns1eth2 flags signal
+	ip netns exec $ns2 ./pm_nl_ctl limits 1 3
+	ip netns exec $ns2 ./pm_nl_ctl add 10.0.3.2 dev ns2eth3 flags subflow,backup
+	run_tests $ns1 $ns2 10.0.1.1 2
+	chk_join_nr "backup subflow in use, bidirectional, link failure" 2 2 2
+	chk_add_nr 1 1
+	chk_stale_nr $ns2 2 4 2
+	chk_link_usage $ns2 ns2eth3 $cinsent 50
 }
 
 add_addr_timeout_tests()
@@ -1288,6 +1412,9 @@ v4mapped_tests()
 
 backup_tests()
 {
+	local rx_backup
+	local tx_backup
+
 	# single subflow, backup
 	reset
 	ip netns exec $ns1 ./pm_nl_ctl limits 0 1
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH mptcp-next 3/7] mptcp: handle pending data on closed subflow
  2021-06-28 15:54 ` [PATCH mptcp-next 3/7] mptcp: handle pending data on closed subflow Paolo Abeni
@ 2021-07-09  0:44   ` Mat Martineau
  0 siblings, 0 replies; 12+ messages in thread
From: Mat Martineau @ 2021-07-09  0:44 UTC (permalink / raw)
  To: Paolo Abeni; +Cc: mptcp, fwestpha

On Mon, 28 Jun 2021, Paolo Abeni wrote:

> The PM can close active subflow, e.g. due to ingress RM_ADDR
> option. Such subflow could carry data still unacked at the
> MPTCP-level, both in the write and the rtx_queue, which has
> never reached the other peer.
>
> Currently the mptcp-level retransmission will deliver such data,
> but at a very low rate (at most 1 DSM for each MPTCP rtx interval).
>
> We can speed-up the recovery a lot, moving all the unacked in the
> tcp write_queue, so that it will be pushed again via other
> subflows, at the speed allowed by them.
>
> Also make available the new helper for later patches.
>
> Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/207
> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
> ---
> net/mptcp/options.c  |  9 ++++++---
> net/mptcp/protocol.c | 48 ++++++++++++++++++++++++++++++++++++++++++++
> net/mptcp/protocol.h |  3 +++
> 3 files changed, 57 insertions(+), 3 deletions(-)
>
> diff --git a/net/mptcp/options.c b/net/mptcp/options.c
> index b5850afea343..c82ad34a0abc 100644
> --- a/net/mptcp/options.c
> +++ b/net/mptcp/options.c
> @@ -975,9 +975,12 @@ static void ack_update_msk(struct mptcp_sock *msk,
> 	old_snd_una = msk->snd_una;
> 	new_snd_una = mptcp_expand_seq(old_snd_una, mp_opt->data_ack, mp_opt->ack64);
>
> -	/* ACK for data not even sent yet? Ignore. */
> -	if (after64(new_snd_una, snd_nxt))
> -		new_snd_una = old_snd_una;
> +	/* ACK for data not even sent yet and even above recovery bound? Ignore.*/
> +	if (unlikely(after64(new_snd_una, snd_nxt))) {
> +		if (!READ_ONCE(msk->recovery) ||
> +		    after64(new_snd_una, READ_ONCE(msk->recovery_snd_nxt)))
> +			new_snd_una = old_snd_una;
> +	}
>
> 	new_wnd_end = new_snd_una + tcp_sk(ssk)->snd_wnd;
>
> diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
> index 51f608831fae..b0a7eba202fc 100644
> --- a/net/mptcp/protocol.c
> +++ b/net/mptcp/protocol.c
> @@ -1032,6 +1032,9 @@ static void __mptcp_clean_una(struct sock *sk)
> 	if (__mptcp_check_fallback(msk))
> 		msk->snd_una = READ_ONCE(msk->snd_nxt);
>
> +	if (unlikely(msk->recovery) && after64(msk->snd_una, msk->recovery_snd_nxt))
> +		WRITE_ONCE(msk->recovery, false);
> +
> 	snd_una = msk->snd_una;
> 	list_for_each_entry_safe(dfrag, dtmp, &msk->rtx_queue, list) {
> 		if (after64(dfrag->data_seq + dfrag->data_len, snd_una))
> @@ -2135,6 +2138,45 @@ static void mptcp_dispose_initial_subflow(struct mptcp_sock *msk)
> 	}
> }
>
> +bool __mptcp_retransmit_pending_data(struct sock *sk, const struct sock *ssk)

One small thing: ssk is unused.

> +{
> +	struct mptcp_data_frag *cur, *rtx_head;
> +	struct mptcp_sock *msk = mptcp_sk(sk);
> +
> +	if (__mptcp_check_fallback(mptcp_sk(sk)))
> +		return false;
> +
> +	if (tcp_rtx_and_write_queues_empty(sk))
> +		return false;
> +
> +	/* the closing socket has some data untransmitted and/or unacked:
> +	 * some data in the mptcp rtx queue has not really xmitted yet.
> +	 * keep it simple and re-inject the whole mptcp level rtx queue
> +	 */
> +	mptcp_clean_una_wakeup(sk);
> +	rtx_head = mptcp_rtx_head(sk);
> +	if (!rtx_head)
> +		return false;
> +
> +	/* will accept ack for reijected data before re-sending them */
> +	if (!msk->recovery || after64(msk->snd_nxt, msk->recovery_snd_nxt))
> +		WRITE_ONCE(msk->recovery_snd_nxt, msk->snd_nxt);
> +	WRITE_ONCE(msk->recovery, true);
> +
> +	msk->first_pending = rtx_head;
> +	msk->tx_pending_data += msk->snd_nxt - rtx_head->data_seq;
> +	msk->snd_nxt = rtx_head->data_seq;
> +	msk->snd_burst = 0;
> +
> +	/* be sure to clear the "sent status" on all re-injected fragments */
> +	list_for_each_entry(cur, &msk->rtx_queue, list) {
> +		if (!cur->already_sent)
> +			break;
> +		cur->already_sent = 0;
> +	}
> +	return true;
> +}
> +

--
Mat Martineau
Intel

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH mptcp-next 7/7] selftests: mptcp: add testcase for active-back
  2021-06-28 15:54 ` [PATCH mptcp-next 7/7] selftests: mptcp: add testcase for active-back Paolo Abeni
@ 2021-07-09  0:51   ` Mat Martineau
  2021-07-09  7:04     ` Paolo Abeni
  0 siblings, 1 reply; 12+ messages in thread
From: Mat Martineau @ 2021-07-09  0:51 UTC (permalink / raw)
  To: Paolo Abeni; +Cc: mptcp, fwestpha

On Mon, 28 Jun 2021, Paolo Abeni wrote:

> Add more test-case for link failures scenario,
> including recovery from link failure using only
> backup subflows and bi-directional transfer.
>
> Additionally explicitly check for stale count
>

I get a lot of intermittent failures due to the expected stale count 
range, like this:

15 multiple flows, signal, bidirectional, link failure syn[ ok ] - synack[ ok ] - ack[ ok ]
                                         add[ ok ] - echo  [ ok ]
                                         stale[fail] got 19 stale[s] 18 recover[s],   expected stale in range [1..5],  stale-recover delta 1

02 multiple flows, signal, bidirectional, link failure syn[ ok ] - synack[ ok ] - ack[ ok ]
                                         add[ ok ] - echo  [ ok ]
                                         stale[fail] got 11 stale[s] 10 recover[s],   expected stale in range [1..5],  stale-recover delta 1

02 multiple flows, signal, bidirectional, link failure syn[ ok ] - synack[ ok ] - ack[ ok ]
                                         add[ ok ] - echo  [ ok ]
                                         stale[fail] got 51 stale[s] 50 recover[s],   expected stale in range [1..5],  stale-recover delta 1


and


18 backup subflow in use, bidirectional, link failure syn[ ok ] - synack[ ok ] - ack[ ok ]
                                         add[ ok ] - echo  [ ok ]
                                         stale[fail] got 8 stale[s] 6 recover[s],   expected stale in range [2..4],  stale-recover delta 2

05 backup subflow in use, bidirectional, link failure syn[ ok ] - synack[ ok ] - ack[ ok ]
                                         add[ ok ] - echo  [ ok ]
                                         stale[fail] got 8 stale[s] 6 recover[s],   expected stale in range [2..4],  stale-recover delta 2

I also saw one instance of a poll timeout failure in the "backup subflow 
in use, bidirectional, link failure" case. Let me know if you want the 
logs for that.


> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
> ---
> .../testing/selftests/net/mptcp/mptcp_join.sh | 185 +++++++++++++++---
> 1 file changed, 156 insertions(+), 29 deletions(-)
>
> diff --git a/tools/testing/selftests/net/mptcp/mptcp_join.sh b/tools/testing/selftests/net/mptcp/mptcp_join.sh
> index 9a191c1a5de8..e7c021d199e3 100755
> --- a/tools/testing/selftests/net/mptcp/mptcp_join.sh
> +++ b/tools/testing/selftests/net/mptcp/mptcp_join.sh

...

> @@ -1288,6 +1412,9 @@ v4mapped_tests()
>
> backup_tests()
> {
> +	local rx_backup
> +	local tx_backup
> +

These aren't used (yet?)

> 	# single subflow, backup
> 	reset
> 	ip netns exec $ns1 ./pm_nl_ctl limits 0 1

--
Mat Martineau
Intel

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH mptcp-next 0/7] mptcp: refactor active backup
  2021-06-28 15:54 [PATCH mptcp-next 0/7] mptcp: refactor active backup Paolo Abeni
                   ` (6 preceding siblings ...)
  2021-06-28 15:54 ` [PATCH mptcp-next 7/7] selftests: mptcp: add testcase for active-back Paolo Abeni
@ 2021-07-09  1:13 ` Mat Martineau
  7 siblings, 0 replies; 12+ messages in thread
From: Mat Martineau @ 2021-07-09  1:13 UTC (permalink / raw)
  To: Paolo Abeni; +Cc: mptcp, fwestpha

On Mon, 28 Jun 2021, Paolo Abeni wrote:

> This series addresses a bunch of issues somewhat related to active
> backup handling. The most visible functial issue addressed here is:
>
> https://github.com/multipath-tcp/mptcp_net-next/issues/191
>
> This series also add some specific self-tests, to cover both
> active-backup switch-over and proper usage of backup link.
>
> A new netns parameter is introduced:
>
> stale_loss_cnt - the max amount of mptcp rtx timeouts with no progresses
> 	and outstanding data over a single subflow needed to declare
> 	such subflow 'stale': no more data will be queue until some ack
> 	is observed.
>
> This parameter is currently not configurable: I'm undecited if it should
> stay under the pm_netlink APIs or under the sysfs. I've a slightly
> preference for the latter. Any opinion welcome!

I also lean toward sysfs/sysctl. If anyone has reasons to prefer netlink 
for this setting I'd like to hear them!

>
> This is only lightly (but painfully) tests, my proposal it to let
> it stage in the export branch for some time.
>

Yes, I agree on staging in the export branch. I had only a couple of small 
code changes to suggest (fixes could easily be squashed). I think the more 
important consideration is how noisy the tests would be in CI. The self 
tests did not consistently pass for me, there's more detail about that in 
my patch 7/7 reply. I would suggest having the mptcp_join.sh tests at 
least in a place of "intermittent failure" rather than "rare success" 
before adding to the export branch.


- Mat


> Paolo Abeni (7):
>  mptcp: more accurate timeout
>  mptcp: less aggressive retransmission stragegy
>  mptcp: handle pending data on closed subflow
>  mptcp: faster active backup recovery
>  mptcp: add mibs for stale subflows processing
>  mptcp: backup flag from incoming MPJ ack option
>  selftests: mptcp: add testcase for active-back
>
> net/mptcp/mib.c                               |   2 +
> net/mptcp/mib.h                               |   2 +
> net/mptcp/options.c                           |   9 +-
> net/mptcp/pm.c                                |  21 ++
> net/mptcp/pm_netlink.c                        |  40 ++++
> net/mptcp/protocol.c                          | 132 ++++++++++---
> net/mptcp/protocol.h                          |  19 +-
> net/mptcp/subflow.c                           |   6 +-
> .../testing/selftests/net/mptcp/mptcp_join.sh | 185 +++++++++++++++---
> 9 files changed, 352 insertions(+), 64 deletions(-)
>
> -- 
> 2.26.3
>
>
>

--
Mat Martineau
Intel

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH mptcp-next 7/7] selftests: mptcp: add testcase for active-back
  2021-07-09  0:51   ` Mat Martineau
@ 2021-07-09  7:04     ` Paolo Abeni
  0 siblings, 0 replies; 12+ messages in thread
From: Paolo Abeni @ 2021-07-09  7:04 UTC (permalink / raw)
  To: Mat Martineau; +Cc: mptcp, fwestpha

On Thu, 2021-07-08 at 17:51 -0700, Mat Martineau wrote:
> On Mon, 28 Jun 2021, Paolo Abeni wrote:
> 
> > Add more test-case for link failures scenario,
> > including recovery from link failure using only
> > backup subflows and bi-directional transfer.
> > 
> > Additionally explicitly check for stale count
> > 
> 
> I get a lot of intermittent failures due to the expected stale count 
> range, like this:
> 
> 15 multiple flows, signal, bidirectional, link failure syn[ ok ] - synack[ ok ] - ack[ ok ]
>                                          add[ ok ] - echo  [ ok ]
>                                          stale[fail] got 19 stale[s] 18 recover[s],   expected stale in range [1..5],  stale-recover delta 1
> 
> 02 multiple flows, signal, bidirectional, link failure syn[ ok ] - synack[ ok ] - ack[ ok ]
>                                          add[ ok ] - echo  [ ok ]
>                                          stale[fail] got 11 stale[s] 10 recover[s],   expected stale in range [1..5],  stale-recover delta 1
> 
> 02 multiple flows, signal, bidirectional, link failure syn[ ok ] - synack[ ok ] - ack[ ok ]
>                                          add[ ok ] - echo  [ ok ]
>                                          stale[fail] got 51 stale[s] 50 recover[s],   expected stale in range [1..5],  stale-recover delta 1
> 
> 
> and
> 
> 
> 18 backup subflow in use, bidirectional, link failure syn[ ok ] - synack[ ok ] - ack[ ok ]
>                                          add[ ok ] - echo  [ ok ]
>                                          stale[fail] got 8 stale[s] 6 recover[s],   expected stale in range [2..4],  stale-recover delta 2
> 
> 05 backup subflow in use, bidirectional, link failure syn[ ok ] - synack[ ok ] - ack[ ok ]
>                                          add[ ok ] - echo  [ ok ]
>                                          stale[fail] got 8 stale[s] 6 recover[s],   expected stale in range [2..4],  stale-recover delta 2

Ouch, I though I solved these ones :(

I agree not merging before addressing the above. I can't have a look
before next week.

Cheers,

Paolo


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2021-07-09  7:04 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-06-28 15:54 [PATCH mptcp-next 0/7] mptcp: refactor active backup Paolo Abeni
2021-06-28 15:54 ` [PATCH mptcp-next 1/7] mptcp: more accurate timeout Paolo Abeni
2021-06-28 15:54 ` [PATCH mptcp-next 2/7] mptcp: less aggressive retransmission stragegy Paolo Abeni
2021-06-28 15:54 ` [PATCH mptcp-next 3/7] mptcp: handle pending data on closed subflow Paolo Abeni
2021-07-09  0:44   ` Mat Martineau
2021-06-28 15:54 ` [PATCH mptcp-next 4/7] mptcp: faster active backup recovery Paolo Abeni
2021-06-28 15:54 ` [PATCH mptcp-next 5/7] mptcp: add mibs for stale subflows processing Paolo Abeni
2021-06-28 15:54 ` [PATCH mptcp-next 6/7] mptcp: backup flag from incoming MPJ ack option Paolo Abeni
2021-06-28 15:54 ` [PATCH mptcp-next 7/7] selftests: mptcp: add testcase for active-back Paolo Abeni
2021-07-09  0:51   ` Mat Martineau
2021-07-09  7:04     ` Paolo Abeni
2021-07-09  1:13 ` [PATCH mptcp-next 0/7] mptcp: refactor active backup Mat Martineau

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).