All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/5] mptcp: improve mptcp-level window tracking
@ 2022-04-21 14:20 Paolo Abeni
  2022-04-21 14:20 ` [PATCH v2 1/5] mptcp: really share subflow snd_wnd Paolo Abeni
                   ` (6 more replies)
  0 siblings, 7 replies; 9+ messages in thread
From: Paolo Abeni @ 2022-04-21 14:20 UTC (permalink / raw)
  To: mptcp

I've been chasing bad/unstable performances with multiple subflows
on very high speed links.

It looks like the root cause is due to the current mptcp-level
congestion window handling. There are apparently a few different
sub-issues:

- the rcv_wnd is not effectively shared on the tx side, as each
  subflow takes in account only the value received by the underlaying
  TCP connection. This is addressed in patch 1/4

- The mptcp-level offered wnd right edge is currently allowed to shrink.
  Reading section 3.3.4.:

"""
   The receive window is relative to the DATA_ACK.  As in TCP, a
   receiver MUST NOT shrink the right edge of the receive window (i.e.,
   DATA_ACK + receive window).  The receiver will use the data sequence
   number to tell if a packet should be accepted at the connection
   level.
"""

I read the above as we need to reflect window right-edge tracking
on the wire, see patch 3/4.

- The offered window right edge tracking can happen concurrently on
  multiple subflows, but there is no mutex protection. We need an
  additional atomic operation - still patch 3/4

This series additionally bump a few new MIBs to track all the above
(ensure/observe that the suspected races actually take place).

I could not access again the host where the issue was su much
noticeable, still in the current setup the tput changes from
[6-18] Gbps to 19Gbps very stable.

 v1 -> v2:
 - pass only the TCP header to tcp_options_write (Mat)
 - fix build issues on some 32 bit arches (intel bot)

RFC -> v1:
 - added patch 3/5 to address Mat's comment, and rebased the
   following on top of it - I hope Eric may tolerate that, it's
   more an hope than guess ;)

Paolo Abeni (5):
  mptcp: really share subflow snd_wnd
  mptcp: add mib for xmit window sharing
  tcp: allow MPTCP to update the announced window.
  mptcp: never shrink offered window
  mptcp: add more offered MIBs counter.

 include/net/mptcp.h   |  2 +-
 net/ipv4/tcp_output.c | 14 ++++++-----
 net/mptcp/mib.c       |  4 +++
 net/mptcp/mib.h       |  6 +++++
 net/mptcp/options.c   | 58 +++++++++++++++++++++++++++++++++++++------
 net/mptcp/protocol.c  | 32 +++++++++++++++---------
 net/mptcp/protocol.h  |  2 +-
 7 files changed, 90 insertions(+), 28 deletions(-)

-- 
2.35.1


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH v2 1/5] mptcp: really share subflow snd_wnd
  2022-04-21 14:20 [PATCH v2 0/5] mptcp: improve mptcp-level window tracking Paolo Abeni
@ 2022-04-21 14:20 ` Paolo Abeni
  2022-04-21 14:20 ` [PATCH v2 2/5] mptcp: add mib for xmit window sharing Paolo Abeni
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 9+ messages in thread
From: Paolo Abeni @ 2022-04-21 14:20 UTC (permalink / raw)
  To: mptcp

As per RFC, mptcp subflows use a "shared" snd_wnd: the effective
window is the maximum among the current values received on all
subflows. Without such feature a data transfer using multiple
subflows could block.

Window sharing is currently implemented in the RX side:
__tcp_select_window uses the mptcp-level receive buffer to compute
the announced window.

That is not enough: the TCP stack will stick to the window size
received on the given subflow; we need to propagate the msk window
value on each subflow at xmit time.

Change the packet scheduler to ignore the subflow level window
and use instead the msk level one

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 net/mptcp/protocol.c | 24 +++++++++++++++---------
 1 file changed, 15 insertions(+), 9 deletions(-)

diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 6d59bfdd6bbd..0223a44d6b29 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -1141,19 +1141,20 @@ struct mptcp_sendmsg_info {
 	bool data_lock_held;
 };
 
-static int mptcp_check_allowed_size(struct mptcp_sock *msk, u64 data_seq,
-				    int avail_size)
+static int mptcp_check_allowed_size(const struct mptcp_sock *msk, struct sock *ssk,
+				    u64 data_seq, int avail_size)
 {
 	u64 window_end = mptcp_wnd_end(msk);
+	u64 mptcp_snd_wnd;
 
 	if (__mptcp_check_fallback(msk))
 		return avail_size;
 
-	if (!before64(data_seq + avail_size, window_end)) {
-		u64 allowed_size = window_end - data_seq;
+	mptcp_snd_wnd = window_end - data_seq;
+	avail_size = min_t(unsigned int, mptcp_snd_wnd, avail_size);
 
-		return min_t(unsigned int, allowed_size, avail_size);
-	}
+	if (unlikely(tcp_sk(ssk)->snd_wnd < mptcp_snd_wnd))
+		tcp_sk(ssk)->snd_wnd = min_t(u64, U32_MAX, mptcp_snd_wnd);
 
 	return avail_size;
 }
@@ -1305,7 +1306,7 @@ static int mptcp_sendmsg_frag(struct sock *sk, struct sock *ssk,
 	}
 
 	/* Zero window and all data acked? Probe. */
-	copy = mptcp_check_allowed_size(msk, data_seq, copy);
+	copy = mptcp_check_allowed_size(msk, ssk, data_seq, copy);
 	if (copy == 0) {
 		u64 snd_una = READ_ONCE(msk->snd_una);
 
@@ -1498,11 +1499,16 @@ static struct sock *mptcp_subflow_get_send(struct mptcp_sock *msk)
 	 * to check that subflow has a non empty cwin.
 	 */
 	ssk = send_info[SSK_MODE_ACTIVE].ssk;
-	if (!ssk || !sk_stream_memory_free(ssk) || !tcp_sk(ssk)->snd_wnd)
+	if (!ssk || !sk_stream_memory_free(ssk))
 		return NULL;
 
-	burst = min_t(int, MPTCP_SEND_BURST_SIZE, tcp_sk(ssk)->snd_wnd);
+	burst = min_t(int, MPTCP_SEND_BURST_SIZE, mptcp_wnd_end(msk) - msk->snd_nxt);
 	wmem = READ_ONCE(ssk->sk_wmem_queued);
+	if (!burst) {
+		msk->last_snd = NULL;
+		return ssk;
+	}
+
 	subflow = mptcp_subflow_ctx(ssk);
 	subflow->avg_pacing_rate = div_u64((u64)subflow->avg_pacing_rate * wmem +
 					   READ_ONCE(ssk->sk_pacing_rate) * burst,
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v2 2/5] mptcp: add mib for xmit window sharing
  2022-04-21 14:20 [PATCH v2 0/5] mptcp: improve mptcp-level window tracking Paolo Abeni
  2022-04-21 14:20 ` [PATCH v2 1/5] mptcp: really share subflow snd_wnd Paolo Abeni
@ 2022-04-21 14:20 ` Paolo Abeni
  2022-04-21 14:20 ` [PATCH v2 3/5] tcp: allow MPTCP to update the announced window Paolo Abeni
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 9+ messages in thread
From: Paolo Abeni @ 2022-04-21 14:20 UTC (permalink / raw)
  To: mptcp

Bump a counter for counter when snd_wnd is shared among subflow,
for observability's sake.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 net/mptcp/mib.c      | 1 +
 net/mptcp/mib.h      | 1 +
 net/mptcp/protocol.c | 4 +++-
 3 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/net/mptcp/mib.c b/net/mptcp/mib.c
index d93a8c9996fd..6a6f8151375a 100644
--- a/net/mptcp/mib.c
+++ b/net/mptcp/mib.c
@@ -56,6 +56,7 @@ static const struct snmp_mib mptcp_snmp_list[] = {
 	SNMP_MIB_ITEM("RcvPruned", MPTCP_MIB_RCVPRUNED),
 	SNMP_MIB_ITEM("SubflowStale", MPTCP_MIB_SUBFLOWSTALE),
 	SNMP_MIB_ITEM("SubflowRecover", MPTCP_MIB_SUBFLOWRECOVER),
+	SNMP_MIB_ITEM("SndWndShared", MPTCP_MIB_SNDWNDSHARED),
 	SNMP_MIB_SENTINEL
 };
 
diff --git a/net/mptcp/mib.h b/net/mptcp/mib.h
index 529d07af9e14..2411510bef66 100644
--- a/net/mptcp/mib.h
+++ b/net/mptcp/mib.h
@@ -49,6 +49,7 @@ enum linux_mptcp_mib_field {
 	MPTCP_MIB_RCVPRUNED,		/* Incoming packet dropped due to memory limit */
 	MPTCP_MIB_SUBFLOWSTALE,		/* Subflows entered 'stale' status */
 	MPTCP_MIB_SUBFLOWRECOVER,	/* Subflows returned to active status after being stale */
+	MPTCP_MIB_SNDWNDSHARED,		/* Subflow snd wnd is overridden by msk's one */
 	__MPTCP_MIB_MAX
 };
 
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 0223a44d6b29..098f8f7d6366 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -1153,8 +1153,10 @@ static int mptcp_check_allowed_size(const struct mptcp_sock *msk, struct sock *s
 	mptcp_snd_wnd = window_end - data_seq;
 	avail_size = min_t(unsigned int, mptcp_snd_wnd, avail_size);
 
-	if (unlikely(tcp_sk(ssk)->snd_wnd < mptcp_snd_wnd))
+	if (unlikely(tcp_sk(ssk)->snd_wnd < mptcp_snd_wnd)) {
 		tcp_sk(ssk)->snd_wnd = min_t(u64, U32_MAX, mptcp_snd_wnd);
+		MPTCP_INC_STATS(sock_net(ssk), MPTCP_MIB_SNDWNDSHARED);
+	}
 
 	return avail_size;
 }
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v2 3/5] tcp: allow MPTCP to update the announced window.
  2022-04-21 14:20 [PATCH v2 0/5] mptcp: improve mptcp-level window tracking Paolo Abeni
  2022-04-21 14:20 ` [PATCH v2 1/5] mptcp: really share subflow snd_wnd Paolo Abeni
  2022-04-21 14:20 ` [PATCH v2 2/5] mptcp: add mib for xmit window sharing Paolo Abeni
@ 2022-04-21 14:20 ` Paolo Abeni
  2022-04-21 14:20 ` [PATCH v2 4/5] mptcp: never shrink offered window Paolo Abeni
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 9+ messages in thread
From: Paolo Abeni @ 2022-04-21 14:20 UTC (permalink / raw)
  To: mptcp

The MPTCP RFC requires that the MPTCP-level receive window's
right edge never moves backward. Currently the MPTCP code
enforces such constraint while tracking the right edge, but it
does not reflects it on the wire, as MPTCP lacks a suitable hook
to update accordingly the TCP header.

This change modifies the existing mptcp_write_options() hook,
providing the current packet's TCP header to the MPTCP protocol,
so that the next patch could implement the above mentioned
constraint.

No functional changes intended.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
v1 -> v2:
 - Use a single hdr/opt argument (Mat)
---
 include/net/mptcp.h   |  2 +-
 net/ipv4/tcp_output.c | 14 ++++++++------
 net/mptcp/options.c   |  2 +-
 3 files changed, 10 insertions(+), 8 deletions(-)

diff --git a/include/net/mptcp.h b/include/net/mptcp.h
index 877077b53200..6b07011c060d 100644
--- a/include/net/mptcp.h
+++ b/include/net/mptcp.h
@@ -125,7 +125,7 @@ bool mptcp_established_options(struct sock *sk, struct sk_buff *skb,
 			       struct mptcp_out_options *opts);
 bool mptcp_incoming_options(struct sock *sk, struct sk_buff *skb);
 
-void mptcp_write_options(__be32 *ptr, const struct tcp_sock *tp,
+void mptcp_write_options(struct tcphdr *th, __be32 *ptr, struct tcp_sock *tp,
 			 struct mptcp_out_options *opts);
 
 void mptcp_diag_fill_info(struct mptcp_sock *msk, struct mptcp_info *info);
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index c221f3bce975..7d64c3fc5d40 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -444,12 +444,13 @@ struct tcp_out_options {
 	struct mptcp_out_options mptcp;
 };
 
-static void mptcp_options_write(__be32 *ptr, const struct tcp_sock *tp,
+static void mptcp_options_write(struct tcphdr *th, __be32 *ptr,
+				struct tcp_sock *tp,
 				struct tcp_out_options *opts)
 {
 #if IS_ENABLED(CONFIG_MPTCP)
 	if (unlikely(OPTION_MPTCP & opts->options))
-		mptcp_write_options(ptr, tp, &opts->mptcp);
+		mptcp_write_options(th, ptr, tp, &opts->mptcp);
 #endif
 }
 
@@ -605,9 +606,10 @@ static void bpf_skops_write_hdr_opt(struct sock *sk, struct sk_buff *skb,
  * At least SACK_PERM as the first option is known to lead to a disaster
  * (but it may well be that other scenarios fail similarly).
  */
-static void tcp_options_write(__be32 *ptr, struct tcp_sock *tp,
+static void tcp_options_write(struct tcphdr *th, struct tcp_sock *tp,
 			      struct tcp_out_options *opts)
 {
+	__be32 *ptr = (__be32 *)(th + 1);
 	u16 options = opts->options;	/* mungable copy */
 
 	if (unlikely(OPTION_MD5 & options)) {
@@ -701,7 +703,7 @@ static void tcp_options_write(__be32 *ptr, struct tcp_sock *tp,
 
 	smc_options_write(ptr, &options);
 
-	mptcp_options_write(ptr, tp, opts);
+	mptcp_options_write(th, ptr, tp, opts);
 }
 
 static void smc_set_option(const struct tcp_sock *tp,
@@ -1354,7 +1356,7 @@ static int __tcp_transmit_skb(struct sock *sk, struct sk_buff *skb,
 		th->window	= htons(min(tp->rcv_wnd, 65535U));
 	}
 
-	tcp_options_write((__be32 *)(th + 1), tp, &opts);
+	tcp_options_write(th, tp, &opts);
 
 #ifdef CONFIG_TCP_MD5SIG
 	/* Calculate the MD5 hash, as we have all we need now */
@@ -3590,7 +3592,7 @@ struct sk_buff *tcp_make_synack(const struct sock *sk, struct dst_entry *dst,
 
 	/* RFC1323: The window in SYN & SYN/ACK segments is never scaled. */
 	th->window = htons(min(req->rsk_rcv_wnd, 65535U));
-	tcp_options_write((__be32 *)(th + 1), NULL, &opts);
+	tcp_options_write(th, NULL, &opts);
 	th->doff = (tcp_header_size >> 2);
 	__TCP_INC_STATS(sock_net(sk), TCP_MIB_OUTSEGS);
 
diff --git a/net/mptcp/options.c b/net/mptcp/options.c
index e05d9458a025..2570911735ab 100644
--- a/net/mptcp/options.c
+++ b/net/mptcp/options.c
@@ -1265,7 +1265,7 @@ static u16 mptcp_make_csum(const struct mptcp_ext *mpext)
 				 ~csum_unfold(mpext->csum));
 }
 
-void mptcp_write_options(__be32 *ptr, const struct tcp_sock *tp,
+void mptcp_write_options(struct tcphdr *th, __be32 *ptr, struct tcp_sock *tp,
 			 struct mptcp_out_options *opts)
 {
 	const struct sock *ssk = (const struct sock *)tp;
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v2 4/5] mptcp: never shrink offered window
  2022-04-21 14:20 [PATCH v2 0/5] mptcp: improve mptcp-level window tracking Paolo Abeni
                   ` (2 preceding siblings ...)
  2022-04-21 14:20 ` [PATCH v2 3/5] tcp: allow MPTCP to update the announced window Paolo Abeni
@ 2022-04-21 14:20 ` Paolo Abeni
  2022-04-21 14:20 ` [PATCH v2 5/5] mptcp: add more offered MIBs counter Paolo Abeni
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 9+ messages in thread
From: Paolo Abeni @ 2022-04-21 14:20 UTC (permalink / raw)
  To: mptcp

As per RFC, the offered MPTCP-level window should never shrink.
While we currently track the right edge, we don't enforce the
above constraint on the wire.
Additionally, concurrent xmit on different subflows can end-up in
erroneous right edge update.
Address the above explicitly updating the announced window and
protecting the update with an additional atomic operation (sic)

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 v1 -> v2:
 - convert rcv_wnd_sent to atomic64: cmpxchg64 is not supported by
   every arch, while atomic64_t has no such issue

RFC -> v1:
 - rebased on previous patch
---
 net/mptcp/options.c  | 52 ++++++++++++++++++++++++++++++++++++++------
 net/mptcp/protocol.c |  8 +++----
 net/mptcp/protocol.h |  2 +-
 3 files changed, 50 insertions(+), 12 deletions(-)

diff --git a/net/mptcp/options.c b/net/mptcp/options.c
index 2570911735ab..df58130da4dc 100644
--- a/net/mptcp/options.c
+++ b/net/mptcp/options.c
@@ -1224,20 +1224,58 @@ bool mptcp_incoming_options(struct sock *sk, struct sk_buff *skb)
 	return true;
 }
 
-static void mptcp_set_rwin(const struct tcp_sock *tp)
+static void mptcp_set_rwin(struct tcp_sock *tp, struct tcphdr *th)
 {
 	const struct sock *ssk = (const struct sock *)tp;
-	const struct mptcp_subflow_context *subflow;
+	struct mptcp_subflow_context *subflow;
+	u64 ack_seq, rcv_wnd_old, rcv_wnd_new;
 	struct mptcp_sock *msk;
-	u64 ack_seq;
+	u32 new_win;
+	u64 win;
 
 	subflow = mptcp_subflow_ctx(ssk);
 	msk = mptcp_sk(subflow->conn);
 
-	ack_seq = READ_ONCE(msk->ack_seq) + tp->rcv_wnd;
+	ack_seq = READ_ONCE(msk->ack_seq);
+	rcv_wnd_new = ack_seq + tp->rcv_wnd;
+
+	rcv_wnd_old = atomic64_read(&msk->rcv_wnd_sent);
+	if (after64(rcv_wnd_new, rcv_wnd_old)) {
+		u64 rcv_wnd;
+
+		for (;;) {
+			rcv_wnd = atomic64_cmpxchg(&msk->rcv_wnd_sent, rcv_wnd_old, rcv_wnd_new);
+
+			if (rcv_wnd == rcv_wnd_old)
+				break;
+			if (before64(rcv_wnd_new, rcv_wnd))
+				goto raise_win;
+			rcv_wnd_old = rcv_wnd;
+		};
+		return;
+	}
+
+	if (rcv_wnd_new != rcv_wnd_old) {
+raise_win:
+		win = rcv_wnd_old - ack_seq;
+		tp->rcv_wnd = min_t(u64, win, U32_MAX);
+		new_win = tp->rcv_wnd;
 
-	if (after64(ack_seq, READ_ONCE(msk->rcv_wnd_sent)))
-		WRITE_ONCE(msk->rcv_wnd_sent, ack_seq);
+		/* Make sure we do not exceed the maximum possible
+		 * scaled window.
+		 */
+		if (unlikely(th->syn))
+			new_win = min(new_win, 65535U) << tp->rx_opt.rcv_wscale;
+		if (!tp->rx_opt.rcv_wscale &&
+		    sock_net(ssk)->ipv4.sysctl_tcp_workaround_signed_windows)
+			new_win = min(new_win, MAX_TCP_WINDOW);
+		else
+			new_win = min(new_win, (65535U << tp->rx_opt.rcv_wscale));
+
+		/* RFC1323 scaling applied */
+		new_win >>= tp->rx_opt.rcv_wscale;
+		th->window = htons(new_win);
+	}
 }
 
 u16 __mptcp_make_csum(u64 data_seq, u32 subflow_seq, u16 data_len, __wsum sum)
@@ -1554,7 +1592,7 @@ void mptcp_write_options(struct tcphdr *th, __be32 *ptr, struct tcp_sock *tp,
 	}
 
 	if (tp)
-		mptcp_set_rwin(tp);
+		mptcp_set_rwin(tp, th);
 }
 
 __be32 mptcp_get_reset_option(const struct sk_buff *skb)
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 098f8f7d6366..653757ea0aca 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -216,7 +216,7 @@ static void mptcp_data_queue_ofo(struct mptcp_sock *msk, struct sk_buff *skb)
 
 	seq = MPTCP_SKB_CB(skb)->map_seq;
 	end_seq = MPTCP_SKB_CB(skb)->end_seq;
-	max_seq = READ_ONCE(msk->rcv_wnd_sent);
+	max_seq = atomic64_read(&msk->rcv_wnd_sent);
 
 	pr_debug("msk=%p seq=%llx limit=%llx empty=%d", msk, seq, max_seq,
 		 RB_EMPTY_ROOT(&msk->out_of_order_queue));
@@ -225,7 +225,7 @@ static void mptcp_data_queue_ofo(struct mptcp_sock *msk, struct sk_buff *skb)
 		mptcp_drop(sk, skb);
 		pr_debug("oow by %lld, rcv_wnd_sent %llu\n",
 			 (unsigned long long)end_seq - (unsigned long)max_seq,
-			 (unsigned long long)msk->rcv_wnd_sent);
+			 (unsigned long long)atomic64_read(&msk->rcv_wnd_sent));
 		MPTCP_INC_STATS(sock_net(sk), MPTCP_MIB_NODSSWINDOW);
 		return;
 	}
@@ -3008,7 +3008,7 @@ struct sock *mptcp_sk_clone(const struct sock *sk,
 		mptcp_crypto_key_sha(msk->remote_key, NULL, &ack_seq);
 		ack_seq++;
 		WRITE_ONCE(msk->ack_seq, ack_seq);
-		WRITE_ONCE(msk->rcv_wnd_sent, ack_seq);
+		atomic64_set(&msk->rcv_wnd_sent, ack_seq);
 	}
 
 #if !IS_ENABLED(CONFIG_KASAN)
@@ -3303,9 +3303,9 @@ void mptcp_finish_connect(struct sock *ssk)
 	WRITE_ONCE(msk->write_seq, subflow->idsn + 1);
 	WRITE_ONCE(msk->snd_nxt, msk->write_seq);
 	WRITE_ONCE(msk->ack_seq, ack_seq);
-	WRITE_ONCE(msk->rcv_wnd_sent, ack_seq);
 	WRITE_ONCE(msk->can_ack, 1);
 	WRITE_ONCE(msk->snd_una, msk->write_seq);
+	atomic64_set(&msk->rcv_wnd_sent, ack_seq);
 
 	mptcp_pm_new_connection(msk, ssk, 0);
 
diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
index f542aeaa5b09..4672901d0dfe 100644
--- a/net/mptcp/protocol.h
+++ b/net/mptcp/protocol.h
@@ -257,7 +257,7 @@ struct mptcp_sock {
 	u64		write_seq;
 	u64		snd_nxt;
 	u64		ack_seq;
-	u64		rcv_wnd_sent;
+	atomic64_t	rcv_wnd_sent;
 	u64		rcv_data_fin_seq;
 	int		rmem_fwd_alloc;
 	struct sock	*last_snd;
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v2 5/5] mptcp: add more offered MIBs counter.
  2022-04-21 14:20 [PATCH v2 0/5] mptcp: improve mptcp-level window tracking Paolo Abeni
                   ` (3 preceding siblings ...)
  2022-04-21 14:20 ` [PATCH v2 4/5] mptcp: never shrink offered window Paolo Abeni
@ 2022-04-21 14:20 ` Paolo Abeni
  2022-04-21 16:08   ` mptcp: add more offered MIBs counter.: Tests Results MPTCP CI
  2022-04-22  0:04 ` [PATCH v2 0/5] mptcp: improve mptcp-level window tracking Mat Martineau
  2022-04-22 15:18 ` Matthieu Baerts
  6 siblings, 1 reply; 9+ messages in thread
From: Paolo Abeni @ 2022-04-21 14:20 UTC (permalink / raw)
  To: mptcp

Track the exceptional handling of MPTCP-level offered window
with a few more counters for observability.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 net/mptcp/mib.c     | 3 +++
 net/mptcp/mib.h     | 5 +++++
 net/mptcp/options.c | 6 +++++-
 3 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/net/mptcp/mib.c b/net/mptcp/mib.c
index 6a6f8151375a..0dac2863c6e1 100644
--- a/net/mptcp/mib.c
+++ b/net/mptcp/mib.c
@@ -57,6 +57,9 @@ static const struct snmp_mib mptcp_snmp_list[] = {
 	SNMP_MIB_ITEM("SubflowStale", MPTCP_MIB_SUBFLOWSTALE),
 	SNMP_MIB_ITEM("SubflowRecover", MPTCP_MIB_SUBFLOWRECOVER),
 	SNMP_MIB_ITEM("SndWndShared", MPTCP_MIB_SNDWNDSHARED),
+	SNMP_MIB_ITEM("RcvWndShared", MPTCP_MIB_RCVWNDSHARED),
+	SNMP_MIB_ITEM("RcvWndConflictUpdate", MPTCP_MIB_RCVWNDCONFLICTUPDATE),
+	SNMP_MIB_ITEM("RcvWndConflict", MPTCP_MIB_RCVWNDCONFLICT),
 	SNMP_MIB_SENTINEL
 };
 
diff --git a/net/mptcp/mib.h b/net/mptcp/mib.h
index 2411510bef66..2be3596374f4 100644
--- a/net/mptcp/mib.h
+++ b/net/mptcp/mib.h
@@ -50,6 +50,11 @@ enum linux_mptcp_mib_field {
 	MPTCP_MIB_SUBFLOWSTALE,		/* Subflows entered 'stale' status */
 	MPTCP_MIB_SUBFLOWRECOVER,	/* Subflows returned to active status after being stale */
 	MPTCP_MIB_SNDWNDSHARED,		/* Subflow snd wnd is overridden by msk's one */
+	MPTCP_MIB_RCVWNDSHARED,		/* Subflow rcv wnd is overridden by msk's one */
+	MPTCP_MIB_RCVWNDCONFLICTUPDATE,	/* subflow rcv wnd is overridden by msk's one due to
+					 * conflict with another subflow while updating msk rcv wnd
+					 */
+	MPTCP_MIB_RCVWNDCONFLICT,	/* Conflict with while updating msk rcv wnd */
 	__MPTCP_MIB_MAX
 };
 
diff --git a/net/mptcp/options.c b/net/mptcp/options.c
index df58130da4dc..5e337f5303cc 100644
--- a/net/mptcp/options.c
+++ b/net/mptcp/options.c
@@ -1248,8 +1248,11 @@ static void mptcp_set_rwin(struct tcp_sock *tp, struct tcphdr *th)
 
 			if (rcv_wnd == rcv_wnd_old)
 				break;
-			if (before64(rcv_wnd_new, rcv_wnd))
+			if (before64(rcv_wnd_new, rcv_wnd)) {
+				MPTCP_INC_STATS(sock_net(ssk), MPTCP_MIB_RCVWNDCONFLICTUPDATE);
 				goto raise_win;
+			}
+			MPTCP_INC_STATS(sock_net(ssk), MPTCP_MIB_RCVWNDCONFLICT);
 			rcv_wnd_old = rcv_wnd;
 		};
 		return;
@@ -1275,6 +1278,7 @@ static void mptcp_set_rwin(struct tcp_sock *tp, struct tcphdr *th)
 		/* RFC1323 scaling applied */
 		new_win >>= tp->rx_opt.rcv_wscale;
 		th->window = htons(new_win);
+		MPTCP_INC_STATS(sock_net(ssk), MPTCP_MIB_RCVWNDSHARED);
 	}
 }
 
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: mptcp: add more offered MIBs counter.: Tests Results
  2022-04-21 14:20 ` [PATCH v2 5/5] mptcp: add more offered MIBs counter Paolo Abeni
@ 2022-04-21 16:08   ` MPTCP CI
  0 siblings, 0 replies; 9+ messages in thread
From: MPTCP CI @ 2022-04-21 16:08 UTC (permalink / raw)
  To: Paolo Abeni; +Cc: mptcp

Hi Paolo,

Thank you for your modifications, that's great!

Our CI did some validations and here is its report:

- KVM Validation: normal:
  - Success! ✅:
  - Task: https://cirrus-ci.com/task/5291819050205184
  - Summary: https://api.cirrus-ci.com/v1/artifact/task/5291819050205184/summary/summary.txt

- KVM Validation: debug:
  - Unstable: 2 failed test(s): selftest_diag selftest_mptcp_join - Critical: KMemLeak ❌:
  - Task: https://cirrus-ci.com/task/6417718957047808
  - Summary: https://api.cirrus-ci.com/v1/artifact/task/6417718957047808/summary/summary.txt

Initiator: Patchew Applier
Commits: https://github.com/multipath-tcp/mptcp_net-next/commits/d2204656a1fb


If there are some issues, you can reproduce them using the same environment as
the one used by the CI thanks to a docker image, e.g.:

    $ cd [kernel source code]
    $ docker run -v "${PWD}:${PWD}:rw" -w "${PWD}" --privileged --rm -it \
        --pull always mptcp/mptcp-upstream-virtme-docker:latest \
        auto-debug

For more details:

    https://github.com/multipath-tcp/mptcp-upstream-virtme-docker


Please note that despite all the efforts that have been already done to have a
stable tests suite when executed on a public CI like here, it is possible some
reported issues are not due to your modifications. Still, do not hesitate to
help us improve that ;-)

Cheers,
MPTCP GH Action bot
Bot operated by Matthieu Baerts (Tessares)

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 0/5] mptcp: improve mptcp-level window tracking
  2022-04-21 14:20 [PATCH v2 0/5] mptcp: improve mptcp-level window tracking Paolo Abeni
                   ` (4 preceding siblings ...)
  2022-04-21 14:20 ` [PATCH v2 5/5] mptcp: add more offered MIBs counter Paolo Abeni
@ 2022-04-22  0:04 ` Mat Martineau
  2022-04-22 15:18 ` Matthieu Baerts
  6 siblings, 0 replies; 9+ messages in thread
From: Mat Martineau @ 2022-04-22  0:04 UTC (permalink / raw)
  To: Paolo Abeni; +Cc: mptcp

On Thu, 21 Apr 2022, Paolo Abeni wrote:

> I've been chasing bad/unstable performances with multiple subflows
> on very high speed links.
>
> It looks like the root cause is due to the current mptcp-level
> congestion window handling. There are apparently a few different
> sub-issues:
>
> - the rcv_wnd is not effectively shared on the tx side, as each
>  subflow takes in account only the value received by the underlaying
>  TCP connection. This is addressed in patch 1/4
>
> - The mptcp-level offered wnd right edge is currently allowed to shrink.
>  Reading section 3.3.4.:
>
> """
>   The receive window is relative to the DATA_ACK.  As in TCP, a
>   receiver MUST NOT shrink the right edge of the receive window (i.e.,
>   DATA_ACK + receive window).  The receiver will use the data sequence
>   number to tell if a packet should be accepted at the connection
>   level.
> """
>
> I read the above as we need to reflect window right-edge tracking
> on the wire, see patch 3/4.
>
> - The offered window right edge tracking can happen concurrently on
>  multiple subflows, but there is no mutex protection. We need an
>  additional atomic operation - still patch 3/4
>
> This series additionally bump a few new MIBs to track all the above
> (ensure/observe that the suspected races actually take place).
>
> I could not access again the host where the issue was su much
> noticeable, still in the current setup the tput changes from
> [6-18] Gbps to 19Gbps very stable.
>
> v1 -> v2:
> - pass only the TCP header to tcp_options_write (Mat)
> - fix build issues on some 32 bit arches (intel bot)

v2 looks good for the export branch, thanks Paolo.

Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>

>
> RFC -> v1:
> - added patch 3/5 to address Mat's comment, and rebased the
>   following on top of it - I hope Eric may tolerate that, it's
>   more an hope than guess ;)
>
> Paolo Abeni (5):
>  mptcp: really share subflow snd_wnd
>  mptcp: add mib for xmit window sharing
>  tcp: allow MPTCP to update the announced window.
>  mptcp: never shrink offered window
>  mptcp: add more offered MIBs counter.
>
> include/net/mptcp.h   |  2 +-
> net/ipv4/tcp_output.c | 14 ++++++-----
> net/mptcp/mib.c       |  4 +++
> net/mptcp/mib.h       |  6 +++++
> net/mptcp/options.c   | 58 +++++++++++++++++++++++++++++++++++++------
> net/mptcp/protocol.c  | 32 +++++++++++++++---------
> net/mptcp/protocol.h  |  2 +-
> 7 files changed, 90 insertions(+), 28 deletions(-)
>
> -- 
> 2.35.1
>
>
>

--
Mat Martineau
Intel

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 0/5] mptcp: improve mptcp-level window tracking
  2022-04-21 14:20 [PATCH v2 0/5] mptcp: improve mptcp-level window tracking Paolo Abeni
                   ` (5 preceding siblings ...)
  2022-04-22  0:04 ` [PATCH v2 0/5] mptcp: improve mptcp-level window tracking Mat Martineau
@ 2022-04-22 15:18 ` Matthieu Baerts
  6 siblings, 0 replies; 9+ messages in thread
From: Matthieu Baerts @ 2022-04-22 15:18 UTC (permalink / raw)
  To: Paolo Abeni, mptcp

Hi Paolo, Mat,

On 21/04/2022 16:20, Paolo Abeni wrote:
> I've been chasing bad/unstable performances with multiple subflows
> on very high speed links.
> 
> It looks like the root cause is due to the current mptcp-level
> congestion window handling. There are apparently a few different
> sub-issues:
> 
> - the rcv_wnd is not effectively shared on the tx side, as each
>   subflow takes in account only the value received by the underlaying
>   TCP connection. This is addressed in patch 1/4
> 
> - The mptcp-level offered wnd right edge is currently allowed to shrink.
>   Reading section 3.3.4.:
> 
> """
>    The receive window is relative to the DATA_ACK.  As in TCP, a
>    receiver MUST NOT shrink the right edge of the receive window (i.e.,
>    DATA_ACK + receive window).  The receiver will use the data sequence
>    number to tell if a packet should be accepted at the connection
>    level.
> """
> 
> I read the above as we need to reflect window right-edge tracking
> on the wire, see patch 3/4.
> 
> - The offered window right edge tracking can happen concurrently on
>   multiple subflows, but there is no mutex protection. We need an
>   additional atomic operation - still patch 3/4
> 
> This series additionally bump a few new MIBs to track all the above
> (ensure/observe that the suspected races actually take place).
> 
> I could not access again the host where the issue was su much
> noticeable, still in the current setup the tput changes from
> [6-18] Gbps to 19Gbps very stable.

Thank you for the patches and reviews!

Now in our tree (feat. for net-next) with Mat's RvB tag:

New patches for t/upstream:
- 61ed3e818378: mptcp: really share subflow snd_wnd
- d5b00c55441f: mptcp: add mib for xmit window sharing
- 0b814d52c6bb: tcp: allow MPTCP to update the announced window
- 87ce505746f5: mptcp: never shrink offered window
- dac3ff7c87fe: mptcp: add more offered MIBs counter
- Results: 9709ecfa06e8..5b7d53ca0bd1 (export)

Builds and tests are now in progress:

https://cirrus-ci.com/github/multipath-tcp/mptcp_net-next/export/20220422T151603
https://github.com/multipath-tcp/mptcp_net-next/actions/workflows/build-validation.yml?query=branch:export

Cheers,
Matt
-- 
Tessares | Belgium | Hybrid Access Solutions
www.tessares.net

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2022-04-22 15:18 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-04-21 14:20 [PATCH v2 0/5] mptcp: improve mptcp-level window tracking Paolo Abeni
2022-04-21 14:20 ` [PATCH v2 1/5] mptcp: really share subflow snd_wnd Paolo Abeni
2022-04-21 14:20 ` [PATCH v2 2/5] mptcp: add mib for xmit window sharing Paolo Abeni
2022-04-21 14:20 ` [PATCH v2 3/5] tcp: allow MPTCP to update the announced window Paolo Abeni
2022-04-21 14:20 ` [PATCH v2 4/5] mptcp: never shrink offered window Paolo Abeni
2022-04-21 14:20 ` [PATCH v2 5/5] mptcp: add more offered MIBs counter Paolo Abeni
2022-04-21 16:08   ` mptcp: add more offered MIBs counter.: Tests Results MPTCP CI
2022-04-22  0:04 ` [PATCH v2 0/5] mptcp: improve mptcp-level window tracking Mat Martineau
2022-04-22 15:18 ` Matthieu Baerts

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.