[PATCH v2 mptcp-next 0/4] mptcp: just another xmit path refactor

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2 mptcp-next 0/4] mptcp: just another xmit path refactor
@ 2021-09-02 15:52 Paolo Abeni
  2021-09-02 15:52 ` [PATCH v2 mptcp-next 1/4] tcp: expose the tcp_mark_push() and skb_entail() helpers Paolo Abeni
                   ` (4 more replies)
  0 siblings, 5 replies; 15+ messages in thread
From: Paolo Abeni @ 2021-09-02 15:52 UTC (permalink / raw)
  To: mptcp

Eric want to revert the tcp_tx_skb_cache. MPTCP relies on it
for skb allocation. Before the revert we need to refactor our
xmit path.

Patch 1 exposes some needed helpers (endorsed by Eric)
Patch 2 contains the nasty new code
Patch 3 revert some core TCP changes not needed anymore
and patch 4 is Eric's revert.

This iteration should hopefully solve the OoB issues I observed with
the previous attempt, changes documented in patch 2.

Eric Dumazet (1):
  tcp: remove sk_{tr}x_skb_cache

Paolo Abeni (3):
  tcp: expose the tcp_mark_push() and skb_entail() helpers
  mptcp: stop relaying on tcp_tx_skb_cache.
  Partially revert "tcp: factor out tcp_build_frag()"

 Documentation/networking/ip-sysctl.rst |   8 --
 include/net/sock.h                     |  19 ----
 include/net/tcp.h                      |   4 +-
 net/ipv4/af_inet.c                     |   4 -
 net/ipv4/sysctl_net_ipv4.c             |  12 --
 net/ipv4/tcp.c                         | 147 +++++++++----------------
 net/ipv4/tcp_ipv4.c                    |   6 -
 net/ipv6/tcp_ipv6.c                    |   6 -
 net/mptcp/protocol.c                   | 132 +++++++++++++---------
 9 files changed, 132 insertions(+), 206 deletions(-)

-- 
2.26.3

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v2 mptcp-next 1/4] tcp: expose the tcp_mark_push() and skb_entail() helpers
  2021-09-02 15:52 [PATCH v2 mptcp-next 0/4] mptcp: just another xmit path refactor Paolo Abeni
@ 2021-09-02 15:52 ` Paolo Abeni
  2021-09-02 15:52 ` [PATCH v2 mptcp-next 2/4] mptcp: stop relaying on tcp_tx_skb_cache Paolo Abeni
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 15+ messages in thread
From: Paolo Abeni @ 2021-09-02 15:52 UTC (permalink / raw)
  To: mptcp

They will be used by the next patch.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 include/net/tcp.h | 2 ++
 net/ipv4/tcp.c    | 4 ++--
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 3166dc15d7d6..dc52ea8adfc7 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -581,6 +581,8 @@ __u32 cookie_v6_init_sequence(const struct sk_buff *skb, __u16 *mss);
 #endif
 /* tcp_output.c */
 
+void skb_entail(struct sock *sk, struct sk_buff *skb);
+void tcp_mark_push(struct tcp_sock *tp, struct sk_buff *skb);
 void __tcp_push_pending_frames(struct sock *sk, unsigned int cur_mss,
 			       int nonagle);
 int __tcp_retransmit_skb(struct sock *sk, struct sk_buff *skb, int segs);
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index e8b48df73c85..7a3e632b0048 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -647,7 +647,7 @@ int tcp_ioctl(struct sock *sk, int cmd, unsigned long arg)
 }
 EXPORT_SYMBOL(tcp_ioctl);
 
-static inline void tcp_mark_push(struct tcp_sock *tp, struct sk_buff *skb)
+void tcp_mark_push(struct tcp_sock *tp, struct sk_buff *skb)
 {
 	TCP_SKB_CB(skb)->tcp_flags |= TCPHDR_PSH;
 	tp->pushed_seq = tp->write_seq;
@@ -658,7 +658,7 @@ static inline bool forced_push(const struct tcp_sock *tp)
 	return after(tp->write_seq, tp->pushed_seq + (tp->max_window >> 1));
 }
 
-static void skb_entail(struct sock *sk, struct sk_buff *skb)
+void skb_entail(struct sock *sk, struct sk_buff *skb)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 	struct tcp_skb_cb *tcb = TCP_SKB_CB(skb);
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v2 mptcp-next 2/4] mptcp: stop relaying on tcp_tx_skb_cache.
  2021-09-02 15:52 [PATCH v2 mptcp-next 0/4] mptcp: just another xmit path refactor Paolo Abeni
  2021-09-02 15:52 ` [PATCH v2 mptcp-next 1/4] tcp: expose the tcp_mark_push() and skb_entail() helpers Paolo Abeni
@ 2021-09-02 15:52 ` Paolo Abeni
  2021-09-03  0:30   ` Mat Martineau
                     ` (2 more replies)
  2021-09-02 15:52 ` [PATCH v2 mptcp-next 3/4] Partially revert "tcp: factor out tcp_build_frag()" Paolo Abeni
                   ` (2 subsequent siblings)
  4 siblings, 3 replies; 15+ messages in thread
From: Paolo Abeni @ 2021-09-02 15:52 UTC (permalink / raw)
  To: mptcp

We want to revert the skb TX cache, but MPTCP is currently
using it unconditionally.

Rework the MPTCP tx code, so that tcp_tx_skb_cache is not
needed anymore: do the whole coalescing check, skb allocation
skb initialization/update inside mptcp_sendmsg_frag(), quite
alike the current TCP code.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
v1 -> v2:
 - hopefully fix OoB, fetching nr_frags on new skbs
---
 net/mptcp/protocol.c | 132 +++++++++++++++++++++++++------------------
 1 file changed, 77 insertions(+), 55 deletions(-)

diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index faf6e7000d18..101e61bb2a80 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -1224,6 +1224,7 @@ static struct sk_buff *__mptcp_do_alloc_tx_skb(struct sock *sk, gfp_t gfp)
 		if (likely(__mptcp_add_ext(skb, gfp))) {
 			skb_reserve(skb, MAX_TCP_HEADER);
 			skb->reserved_tailroom = skb->end - skb->tail;
+			INIT_LIST_HEAD(&skb->tcp_tsorted_anchor);
 			return skb;
 		}
 		__kfree_skb(skb);
@@ -1233,31 +1234,23 @@ static struct sk_buff *__mptcp_do_alloc_tx_skb(struct sock *sk, gfp_t gfp)
 	return NULL;
 }
 
-static bool __mptcp_alloc_tx_skb(struct sock *sk, struct sock *ssk, gfp_t gfp)
+static struct sk_buff *__mptcp_alloc_tx_skb(struct sock *sk, struct sock *ssk, gfp_t gfp)
 {
 	struct sk_buff *skb;
 
-	if (ssk->sk_tx_skb_cache) {
-		skb = ssk->sk_tx_skb_cache;
-		if (unlikely(!skb_ext_find(skb, SKB_EXT_MPTCP) &&
-			     !__mptcp_add_ext(skb, gfp)))
-			return false;
-		return true;
-	}
-
 	skb = __mptcp_do_alloc_tx_skb(sk, gfp);
 	if (!skb)
-		return false;
+		return NULL;
 
 	if (likely(sk_wmem_schedule(ssk, skb->truesize))) {
-		ssk->sk_tx_skb_cache = skb;
-		return true;
+		skb_entail(ssk, skb);
+		return skb;
 	}
 	kfree_skb(skb);
-	return false;
+	return NULL;
 }
 
-static bool mptcp_alloc_tx_skb(struct sock *sk, struct sock *ssk, bool data_lock_held)
+static struct sk_buff *mptcp_alloc_tx_skb(struct sock *sk, struct sock *ssk, bool data_lock_held)
 {
 	gfp_t gfp = data_lock_held ? GFP_ATOMIC : sk->sk_allocation;
 
@@ -1287,23 +1280,29 @@ static int mptcp_sendmsg_frag(struct sock *sk, struct sock *ssk,
 			      struct mptcp_sendmsg_info *info)
 {
 	u64 data_seq = dfrag->data_seq + info->sent;
+	int offset = dfrag->offset + info->sent;
 	struct mptcp_sock *msk = mptcp_sk(sk);
 	bool zero_window_probe = false;
 	struct mptcp_ext *mpext = NULL;
-	struct sk_buff *skb, *tail;
-	bool must_collapse = false;
-	int size_bias = 0;
-	int avail_size;
-	size_t ret = 0;
+	bool can_coalesce = false;
+	bool reuse_skb = true;
+	struct sk_buff *skb;
+	size_t copy;
+	int i;
 
 	pr_debug("msk=%p ssk=%p sending dfrag at seq=%llu len=%u already sent=%u",
 		 msk, ssk, dfrag->data_seq, dfrag->data_len, info->sent);
 
+	if (WARN_ON_ONCE(info->sent > info->limit ||
+			 info->limit > dfrag->data_len))
+		return 0;
+
 	/* compute send limit */
 	info->mss_now = tcp_send_mss(ssk, &info->size_goal, info->flags);
-	avail_size = info->size_goal;
+	copy = info->size_goal;
+
 	skb = tcp_write_queue_tail(ssk);
-	if (skb) {
+	if (skb && (copy > skb->len)) {
 		/* Limit the write to the size available in the
 		 * current skb, if any, so that we create at most a new skb.
 		 * Explicitly tells TCP internals to avoid collapsing on later
@@ -1316,53 +1315,76 @@ static int mptcp_sendmsg_frag(struct sock *sk, struct sock *ssk,
 			goto alloc_skb;
 		}
 
-		must_collapse = (info->size_goal - skb->len > 0) &&
-				(skb_shinfo(skb)->nr_frags < sysctl_max_skb_frags);
-		if (must_collapse) {
-			size_bias = skb->len;
-			avail_size = info->size_goal - skb->len;
+		i = skb_shinfo(skb)->nr_frags;
+		can_coalesce = skb_can_coalesce(skb, i, dfrag->page, offset);
+		if (!can_coalesce && i >= sysctl_max_skb_frags) {
+			tcp_mark_push(tcp_sk(ssk), skb);
+			goto alloc_skb;
 		}
-	}
 
+		copy -= skb->len;
+	} else {
 alloc_skb:
-	if (!must_collapse && !ssk->sk_tx_skb_cache &&
-	    !mptcp_alloc_tx_skb(sk, ssk, info->data_lock_held))
-		return 0;
+		skb = mptcp_alloc_tx_skb(sk, ssk, info->data_lock_held);
+		if (!skb)
+			return -ENOMEM;
+
+		i = skb_shinfo(skb)->nr_frags;
+		reuse_skb = false;
+		mpext = skb_ext_find(skb, SKB_EXT_MPTCP);
+	}
 
 	/* Zero window and all data acked? Probe. */
-	avail_size = mptcp_check_allowed_size(msk, data_seq, avail_size);
-	if (avail_size == 0) {
+	copy = mptcp_check_allowed_size(msk, data_seq, copy);
+	if (copy == 0) {
 		u64 snd_una = READ_ONCE(msk->snd_una);
 
-		if (skb || snd_una != msk->snd_nxt)
+		if (skb || snd_una != msk->snd_nxt) {
+			tcp_remove_empty_skb(ssk, tcp_write_queue_tail(ssk));
 			return 0;
+		}
+
 		zero_window_probe = true;
 		data_seq = snd_una - 1;
-		avail_size = 1;
-	}
+		copy = 1;
 
-	if (WARN_ON_ONCE(info->sent > info->limit ||
-			 info->limit > dfrag->data_len))
-		return 0;
+		/* all mptcp-level data is acked, no skbs should be present into the
+		 * ssk write queue
+		 */
+		WARN_ON_ONCE(reuse_skb);
+	}
 
-	ret = info->limit - info->sent;
-	tail = tcp_build_frag(ssk, avail_size + size_bias, info->flags,
-			      dfrag->page, dfrag->offset + info->sent, &ret);
-	if (!tail) {
-		tcp_remove_empty_skb(sk, tcp_write_queue_tail(ssk));
+	copy = min_t(size_t, copy, info->limit - info->sent);
+	if (!sk_wmem_schedule(ssk, copy)) {
+		tcp_remove_empty_skb(ssk, tcp_write_queue_tail(ssk));
 		return -ENOMEM;
 	}
 
-	/* if the tail skb is still the cached one, collapsing really happened.
-	 */
-	if (skb == tail) {
-		TCP_SKB_CB(tail)->tcp_flags &= ~TCPHDR_PSH;
-		mpext->data_len += ret;
+	if (can_coalesce) {
+		skb_frag_size_add(&skb_shinfo(skb)->frags[i - 1], copy);
+	} else {
+		get_page(dfrag->page);
+		skb_fill_page_desc(skb, i, dfrag->page, offset, copy);
+	}
+
+	skb->len += copy;
+	skb->data_len += copy;
+	skb->truesize += copy;
+	sk_wmem_queued_add(ssk, copy);
+	sk_mem_charge(ssk, copy);
+	skb->ip_summed = CHECKSUM_PARTIAL;
+	WRITE_ONCE(tcp_sk(ssk)->write_seq, tcp_sk(ssk)->write_seq + copy);
+	TCP_SKB_CB(skb)->end_seq += copy;
+	tcp_skb_pcount_set(skb, 0);
+
+	/* on skb reuse we just need to update the DSS len */
+	if (reuse_skb) {
+		TCP_SKB_CB(skb)->tcp_flags &= ~TCPHDR_PSH;
+		mpext->data_len += copy;
 		WARN_ON_ONCE(zero_window_probe);
 		goto out;
 	}
 
-	mpext = skb_ext_find(tail, SKB_EXT_MPTCP);
 	if (WARN_ON_ONCE(!mpext)) {
 		/* should never reach here, stream corrupted */
 		return -EINVAL;
@@ -1371,7 +1393,7 @@ static int mptcp_sendmsg_frag(struct sock *sk, struct sock *ssk,
 	memset(mpext, 0, sizeof(*mpext));
 	mpext->data_seq = data_seq;
 	mpext->subflow_seq = mptcp_subflow_ctx(ssk)->rel_write_seq;
-	mpext->data_len = ret;
+	mpext->data_len = copy;
 	mpext->use_map = 1;
 	mpext->dsn64 = 1;
 
@@ -1380,18 +1402,18 @@ static int mptcp_sendmsg_frag(struct sock *sk, struct sock *ssk,
 		 mpext->dsn64);
 
 	if (zero_window_probe) {
-		mptcp_subflow_ctx(ssk)->rel_write_seq += ret;
+		mptcp_subflow_ctx(ssk)->rel_write_seq += copy;
 		mpext->frozen = 1;
 		if (READ_ONCE(msk->csum_enabled))
-			mptcp_update_data_checksum(tail, ret);
+			mptcp_update_data_checksum(skb, copy);
 		tcp_push_pending_frames(ssk);
 		return 0;
 	}
 out:
 	if (READ_ONCE(msk->csum_enabled))
-		mptcp_update_data_checksum(tail, ret);
-	mptcp_subflow_ctx(ssk)->rel_write_seq += ret;
-	return ret;
+		mptcp_update_data_checksum(skb, copy);
+	mptcp_subflow_ctx(ssk)->rel_write_seq += copy;
+	return copy;
 }
 
 #define MPTCP_SEND_BURST_SIZE		((1 << 16) - \
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v2 mptcp-next 3/4] Partially revert "tcp: factor out tcp_build_frag()"
  2021-09-02 15:52 [PATCH v2 mptcp-next 0/4] mptcp: just another xmit path refactor Paolo Abeni
  2021-09-02 15:52 ` [PATCH v2 mptcp-next 1/4] tcp: expose the tcp_mark_push() and skb_entail() helpers Paolo Abeni
  2021-09-02 15:52 ` [PATCH v2 mptcp-next 2/4] mptcp: stop relaying on tcp_tx_skb_cache Paolo Abeni
@ 2021-09-02 15:52 ` Paolo Abeni
  2021-09-02 15:52 ` [PATCH v2 mptcp-next 4/4] tcp: remove sk_{tr}x_skb_cache Paolo Abeni
  2021-09-03 20:11 ` [PATCH v2 mptcp-next 0/4] mptcp: just another xmit path refactor Mat Martineau
  4 siblings, 0 replies; 15+ messages in thread
From: Paolo Abeni @ 2021-09-02 15:52 UTC (permalink / raw)
  To: mptcp

This is a partial revert for commit b796d04bd014 ("tcp:
factor out tcp_build_frag()").

MPTCP was the only user of the tcp_build_frag helper, and after
the previous patch MPTCP does not use the mentioned helper anymore.
Let's avoid exposing TCP internals.

The revert is partial, as tcp_remove_empty_skb(), exposed
by the same commit is still required.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 include/net/tcp.h |   2 -
 net/ipv4/tcp.c    | 117 ++++++++++++++++++++--------------------------
 2 files changed, 51 insertions(+), 68 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index dc52ea8adfc7..91f4397c4c08 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -330,8 +330,6 @@ int tcp_sendpage(struct sock *sk, struct page *page, int offset, size_t size,
 		 int flags);
 int tcp_sendpage_locked(struct sock *sk, struct page *page, int offset,
 			size_t size, int flags);
-struct sk_buff *tcp_build_frag(struct sock *sk, int size_goal, int flags,
-			       struct page *page, int offset, size_t *size);
 ssize_t do_tcp_sendpages(struct sock *sk, struct page *page, int offset,
 		 size_t size, int flags);
 int tcp_send_mss(struct sock *sk, int *size_goal, int flags);
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 7a3e632b0048..caf0c50d86bc 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -963,68 +963,6 @@ void tcp_remove_empty_skb(struct sock *sk, struct sk_buff *skb)
 	}
 }
 
-struct sk_buff *tcp_build_frag(struct sock *sk, int size_goal, int flags,
-			       struct page *page, int offset, size_t *size)
-{
-	struct sk_buff *skb = tcp_write_queue_tail(sk);
-	struct tcp_sock *tp = tcp_sk(sk);
-	bool can_coalesce;
-	int copy, i;
-
-	if (!skb || (copy = size_goal - skb->len) <= 0 ||
-	    !tcp_skb_can_collapse_to(skb)) {
-new_segment:
-		if (!sk_stream_memory_free(sk))
-			return NULL;
-
-		skb = sk_stream_alloc_skb(sk, 0, sk->sk_allocation,
-					  tcp_rtx_and_write_queues_empty(sk));
-		if (!skb)
-			return NULL;
-
-#ifdef CONFIG_TLS_DEVICE
-		skb->decrypted = !!(flags & MSG_SENDPAGE_DECRYPTED);
-#endif
-		skb_entail(sk, skb);
-		copy = size_goal;
-	}
-
-	if (copy > *size)
-		copy = *size;
-
-	i = skb_shinfo(skb)->nr_frags;
-	can_coalesce = skb_can_coalesce(skb, i, page, offset);
-	if (!can_coalesce && i >= sysctl_max_skb_frags) {
-		tcp_mark_push(tp, skb);
-		goto new_segment;
-	}
-	if (!sk_wmem_schedule(sk, copy))
-		return NULL;
-
-	if (can_coalesce) {
-		skb_frag_size_add(&skb_shinfo(skb)->frags[i - 1], copy);
-	} else {
-		get_page(page);
-		skb_fill_page_desc(skb, i, page, offset, copy);
-	}
-
-	if (!(flags & MSG_NO_SHARED_FRAGS))
-		skb_shinfo(skb)->flags |= SKBFL_SHARED_FRAG;
-
-	skb->len += copy;
-	skb->data_len += copy;
-	skb->truesize += copy;
-	sk_wmem_queued_add(sk, copy);
-	sk_mem_charge(sk, copy);
-	skb->ip_summed = CHECKSUM_PARTIAL;
-	WRITE_ONCE(tp->write_seq, tp->write_seq + copy);
-	TCP_SKB_CB(skb)->end_seq += copy;
-	tcp_skb_pcount_set(skb, 0);
-
-	*size = copy;
-	return skb;
-}
-
 ssize_t do_tcp_sendpages(struct sock *sk, struct page *page, int offset,
 			 size_t size, int flags)
 {
@@ -1060,13 +998,60 @@ ssize_t do_tcp_sendpages(struct sock *sk, struct page *page, int offset,
 		goto out_err;
 
 	while (size > 0) {
-		struct sk_buff *skb;
-		size_t copy = size;
+		struct sk_buff *skb = tcp_write_queue_tail(sk);
+		int copy, i;
+		bool can_coalesce;
 
-		skb = tcp_build_frag(sk, size_goal, flags, page, offset, &copy);
-		if (!skb)
+		if (!skb || (copy = size_goal - skb->len) <= 0 ||
+		    !tcp_skb_can_collapse_to(skb)) {
+new_segment:
+			if (!sk_stream_memory_free(sk))
+				goto wait_for_space;
+
+			skb = sk_stream_alloc_skb(sk, 0, sk->sk_allocation,
+					tcp_rtx_and_write_queues_empty(sk));
+			if (!skb)
+				goto wait_for_space;
+
+#ifdef CONFIG_TLS_DEVICE
+			skb->decrypted = !!(flags & MSG_SENDPAGE_DECRYPTED);
+#endif
+			skb_entail(sk, skb);
+			copy = size_goal;
+		}
+
+		if (copy > size)
+			copy = size;
+
+		i = skb_shinfo(skb)->nr_frags;
+		can_coalesce = skb_can_coalesce(skb, i, page, offset);
+		if (!can_coalesce && i >= sysctl_max_skb_frags) {
+			tcp_mark_push(tp, skb);
+			goto new_segment;
+		}
+		if (!sk_wmem_schedule(sk, copy))
 			goto wait_for_space;
 
+		if (can_coalesce) {
+			skb_frag_size_add(&skb_shinfo(skb)->frags[i - 1], copy);
+		} else {
+			get_page(page);
+			skb_fill_page_desc(skb, i, page, offset, copy);
+		}
+
+		if (!(flags & MSG_NO_SHARED_FRAGS))
+			skb_shinfo(skb)->flags |= SKBFL_SHARED_FRAG;
+
+		skb->len += copy;
+		skb->data_len += copy;
+		skb->truesize += copy;
+		sk_wmem_queued_add(sk, copy);
+		sk_mem_charge(sk, copy);
+		skb->ip_summed = CHECKSUM_PARTIAL;
+		WRITE_ONCE(tp->write_seq, tp->write_seq + copy);
+		TCP_SKB_CB(skb)->end_seq += copy;
+		tcp_skb_pcount_set(skb, 0);
+
 		if (!copied)
 			TCP_SKB_CB(skb)->tcp_flags &= ~TCPHDR_PSH;
 
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v2 mptcp-next 4/4] tcp: remove sk_{tr}x_skb_cache
  2021-09-02 15:52 [PATCH v2 mptcp-next 0/4] mptcp: just another xmit path refactor Paolo Abeni
                   ` (2 preceding siblings ...)
  2021-09-02 15:52 ` [PATCH v2 mptcp-next 3/4] Partially revert "tcp: factor out tcp_build_frag()" Paolo Abeni
@ 2021-09-02 15:52 ` Paolo Abeni
  2021-09-03 20:11 ` [PATCH v2 mptcp-next 0/4] mptcp: just another xmit path refactor Mat Martineau
  4 siblings, 0 replies; 15+ messages in thread
From: Paolo Abeni @ 2021-09-02 15:52 UTC (permalink / raw)
  To: mptcp

From: Eric Dumazet <edumazet@google.com>

This reverts the following patches :

2e05fcae83c41eb2df10558338dc600dc783af47 ("tcp: fix compile error if !CONFIG_SYSCTL")
4f661542a40217713f2cee0bb6678fbb30d9d367 ("tcp: fix zerocopy and notsent_lowat issues")
472c2e07eef045145bc1493cc94a01c87140780a ("tcp: add one skb cache for tx")
8b27dae5a2e89a61c46c6dbc76c040c0e6d0ed4c ("tcp: add one skb cache for rx")

Having a cache of one skb (in each direction) per TCP socket is fragile,
since it can cause a significant increase of memory needs,
and not good enough for high speed flows anyway where more than one skb
is needed.

We want instead to add a generic infrastructure, with more flexible per-cpu
caches, for alien NUMA nodes.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 Documentation/networking/ip-sysctl.rst |  8 --------
 include/net/sock.h                     | 19 -------------------
 net/ipv4/af_inet.c                     |  4 ----
 net/ipv4/sysctl_net_ipv4.c             | 12 ------------
 net/ipv4/tcp.c                         | 26 --------------------------
 net/ipv4/tcp_ipv4.c                    |  6 ------
 net/ipv6/tcp_ipv6.c                    |  6 ------
 7 files changed, 81 deletions(-)

diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst
index d91ab28718d4..16b8bf72feaf 100644
--- a/Documentation/networking/ip-sysctl.rst
+++ b/Documentation/networking/ip-sysctl.rst
@@ -989,14 +989,6 @@ tcp_challenge_ack_limit - INTEGER
 	in RFC 5961 (Improving TCP's Robustness to Blind In-Window Attacks)
 	Default: 1000
 
-tcp_rx_skb_cache - BOOLEAN
-	Controls a per TCP socket cache of one skb, that might help
-	performance of some workloads. This might be dangerous
-	on systems with a lot of TCP sockets, since it increases
-	memory usage.
-
-	Default: 0 (disabled)
-
 UDP variables
 =============
 
diff --git a/include/net/sock.h b/include/net/sock.h
index 66a9a90f9558..708b9de3cdbb 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -262,7 +262,6 @@ struct bpf_local_storage;
   *	@sk_dst_cache: destination cache
   *	@sk_dst_pending_confirm: need to confirm neighbour
   *	@sk_policy: flow policy
-  *	@sk_rx_skb_cache: cache copy of recently accessed RX skb
   *	@sk_receive_queue: incoming packets
   *	@sk_wmem_alloc: transmit queue bytes committed
   *	@sk_tsq_flags: TCP Small Queues flags
@@ -328,7 +327,6 @@ struct bpf_local_storage;
   *	@sk_peek_off: current peek_offset value
   *	@sk_send_head: front of stuff to transmit
   *	@tcp_rtx_queue: TCP re-transmit queue [union with @sk_send_head]
-  *	@sk_tx_skb_cache: cache copy of recently accessed TX skb
   *	@sk_security: used by security modules
   *	@sk_mark: generic packet mark
   *	@sk_cgrp_data: cgroup data for this cgroup
@@ -393,7 +391,6 @@ struct sock {
 	atomic_t		sk_drops;
 	int			sk_rcvlowat;
 	struct sk_buff_head	sk_error_queue;
-	struct sk_buff		*sk_rx_skb_cache;
 	struct sk_buff_head	sk_receive_queue;
 	/*
 	 * The backlog queue is special, it is always used with
@@ -442,7 +439,6 @@ struct sock {
 		struct sk_buff	*sk_send_head;
 		struct rb_root	tcp_rtx_queue;
 	};
-	struct sk_buff		*sk_tx_skb_cache;
 	struct sk_buff_head	sk_write_queue;
 	__s32			sk_peek_off;
 	int			sk_write_pending;
@@ -1555,18 +1551,10 @@ static inline void sk_mem_uncharge(struct sock *sk, int size)
 		__sk_mem_reclaim(sk, 1 << 20);
 }
 
-DECLARE_STATIC_KEY_FALSE(tcp_tx_skb_cache_key);
 static inline void sk_wmem_free_skb(struct sock *sk, struct sk_buff *skb)
 {
 	sk_wmem_queued_add(sk, -skb->truesize);
 	sk_mem_uncharge(sk, skb->truesize);
-	if (static_branch_unlikely(&tcp_tx_skb_cache_key) &&
-	    !sk->sk_tx_skb_cache && !skb_cloned(skb)) {
-		skb_ext_reset(skb);
-		skb_zcopy_clear(skb, true);
-		sk->sk_tx_skb_cache = skb;
-		return;
-	}
 	__kfree_skb(skb);
 }
 
@@ -2575,7 +2563,6 @@ static inline void skb_setup_tx_timestamp(struct sk_buff *skb, __u16 tsflags)
 			   &skb_shinfo(skb)->tskey);
 }
 
-DECLARE_STATIC_KEY_FALSE(tcp_rx_skb_cache_key);
 /**
  * sk_eat_skb - Release a skb if it is no longer needed
  * @sk: socket to eat this skb from
@@ -2587,12 +2574,6 @@ DECLARE_STATIC_KEY_FALSE(tcp_rx_skb_cache_key);
 static inline void sk_eat_skb(struct sock *sk, struct sk_buff *skb)
 {
 	__skb_unlink(skb, &sk->sk_receive_queue);
-	if (static_branch_unlikely(&tcp_rx_skb_cache_key) &&
-	    !sk->sk_rx_skb_cache) {
-		sk->sk_rx_skb_cache = skb;
-		skb_orphan(skb);
-		return;
-	}
 	__kfree_skb(skb);
 }
 
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 9dc7613e589d..63eda8cb0d26 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -133,10 +133,6 @@ void inet_sock_destruct(struct sock *sk)
 	struct inet_sock *inet = inet_sk(sk);
 
 	__skb_queue_purge(&sk->sk_receive_queue);
-	if (sk->sk_rx_skb_cache) {
-		__kfree_skb(sk->sk_rx_skb_cache);
-		sk->sk_rx_skb_cache = NULL;
-	}
 	__skb_queue_purge(&sk->sk_error_queue);
 
 	sk_mem_reclaim(sk);
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 6f1e64d49232..6eb43dc91218 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -594,18 +594,6 @@ static struct ctl_table ipv4_table[] = {
 		.extra1		= &sysctl_fib_sync_mem_min,
 		.extra2		= &sysctl_fib_sync_mem_max,
 	},
-	{
-		.procname	= "tcp_rx_skb_cache",
-		.data		= &tcp_rx_skb_cache_key.key,
-		.mode		= 0644,
-		.proc_handler	= proc_do_static_key,
-	},
-	{
-		.procname	= "tcp_tx_skb_cache",
-		.data		= &tcp_tx_skb_cache_key.key,
-		.mode		= 0644,
-		.proc_handler	= proc_do_static_key,
-	},
 	{ }
 };
 
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index caf0c50d86bc..cbb0f807be46 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -325,11 +325,6 @@ struct tcp_splice_state {
 unsigned long tcp_memory_pressure __read_mostly;
 EXPORT_SYMBOL_GPL(tcp_memory_pressure);
 
-DEFINE_STATIC_KEY_FALSE(tcp_rx_skb_cache_key);
-EXPORT_SYMBOL(tcp_rx_skb_cache_key);
-
-DEFINE_STATIC_KEY_FALSE(tcp_tx_skb_cache_key);
-
 void tcp_enter_memory_pressure(struct sock *sk)
 {
 	unsigned long val;
@@ -866,18 +861,6 @@ struct sk_buff *sk_stream_alloc_skb(struct sock *sk, int size, gfp_t gfp,
 {
 	struct sk_buff *skb;
 
-	if (likely(!size)) {
-		skb = sk->sk_tx_skb_cache;
-		if (skb) {
-			skb->truesize = SKB_TRUESIZE(skb_end_offset(skb));
-			sk->sk_tx_skb_cache = NULL;
-			pskb_trim(skb, 0);
-			INIT_LIST_HEAD(&skb->tcp_tsorted_anchor);
-			skb_shinfo(skb)->tx_flags = 0;
-			memset(TCP_SKB_CB(skb), 0, sizeof(struct tcp_skb_cb));
-			return skb;
-		}
-	}
 	/* The TCP header must be at least 32-bit aligned.  */
 	size = ALIGN(size, 4);
 
@@ -2905,11 +2888,6 @@ void tcp_write_queue_purge(struct sock *sk)
 		sk_wmem_free_skb(sk, skb);
 	}
 	tcp_rtx_queue_purge(sk);
-	skb = sk->sk_tx_skb_cache;
-	if (skb) {
-		__kfree_skb(skb);
-		sk->sk_tx_skb_cache = NULL;
-	}
 	INIT_LIST_HEAD(&tcp_sk(sk)->tsorted_sent_queue);
 	sk_mem_reclaim(sk);
 	tcp_clear_all_retrans_hints(tcp_sk(sk));
@@ -2946,10 +2924,6 @@ int tcp_disconnect(struct sock *sk, int flags)
 
 	tcp_clear_xmit_timers(sk);
 	__skb_queue_purge(&sk->sk_receive_queue);
-	if (sk->sk_rx_skb_cache) {
-		__kfree_skb(sk->sk_rx_skb_cache);
-		sk->sk_rx_skb_cache = NULL;
-	}
 	WRITE_ONCE(tp->copied_seq, tp->rcv_nxt);
 	tp->urg_data = 0;
 	tcp_write_queue_purge(sk);
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 2e62e0d6373a..29a57bd159f0 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1941,7 +1941,6 @@ static void tcp_v4_fill_cb(struct sk_buff *skb, const struct iphdr *iph,
 int tcp_v4_rcv(struct sk_buff *skb)
 {
 	struct net *net = dev_net(skb->dev);
-	struct sk_buff *skb_to_free;
 	int sdif = inet_sdif(skb);
 	int dif = inet_iif(skb);
 	const struct iphdr *iph;
@@ -2082,17 +2081,12 @@ int tcp_v4_rcv(struct sk_buff *skb)
 	tcp_segs_in(tcp_sk(sk), skb);
 	ret = 0;
 	if (!sock_owned_by_user(sk)) {
-		skb_to_free = sk->sk_rx_skb_cache;
-		sk->sk_rx_skb_cache = NULL;
 		ret = tcp_v4_do_rcv(sk, skb);
 	} else {
 		if (tcp_add_backlog(sk, skb))
 			goto discard_and_relse;
-		skb_to_free = NULL;
 	}
 	bh_unlock_sock(sk);
-	if (skb_to_free)
-		__kfree_skb(skb_to_free);
 
 put_and_return:
 	if (refcounted)
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 0ce52d46e4f8..8cf5ff2e9504 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1618,7 +1618,6 @@ static void tcp_v6_fill_cb(struct sk_buff *skb, const struct ipv6hdr *hdr,
 
 INDIRECT_CALLABLE_SCOPE int tcp_v6_rcv(struct sk_buff *skb)
 {
-	struct sk_buff *skb_to_free;
 	int sdif = inet6_sdif(skb);
 	int dif = inet6_iif(skb);
 	const struct tcphdr *th;
@@ -1754,17 +1753,12 @@ INDIRECT_CALLABLE_SCOPE int tcp_v6_rcv(struct sk_buff *skb)
 	tcp_segs_in(tcp_sk(sk), skb);
 	ret = 0;
 	if (!sock_owned_by_user(sk)) {
-		skb_to_free = sk->sk_rx_skb_cache;
-		sk->sk_rx_skb_cache = NULL;
 		ret = tcp_v6_do_rcv(sk, skb);
 	} else {
 		if (tcp_add_backlog(sk, skb))
 			goto discard_and_relse;
-		skb_to_free = NULL;
 	}
 	bh_unlock_sock(sk);
-	if (skb_to_free)
-		__kfree_skb(skb_to_free);
 put_and_return:
 	if (refcounted)
 		sock_put(sk);
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 mptcp-next 2/4] mptcp: stop relaying on tcp_tx_skb_cache.
  2021-09-02 15:52 ` [PATCH v2 mptcp-next 2/4] mptcp: stop relaying on tcp_tx_skb_cache Paolo Abeni
@ 2021-09-03  0:30   ` Mat Martineau
  2021-09-03 13:58     ` Paolo Abeni
  2021-09-03 11:28   ` Matthieu Baerts
  2021-09-04  8:00   ` Matthieu Baerts
  2 siblings, 1 reply; 15+ messages in thread
From: Mat Martineau @ 2021-09-03  0:30 UTC (permalink / raw)
  To: Paolo Abeni; +Cc: mptcp

On Thu, 2 Sep 2021, Paolo Abeni wrote:

> We want to revert the skb TX cache, but MPTCP is currently
> using it unconditionally.
>
> Rework the MPTCP tx code, so that tcp_tx_skb_cache is not
> needed anymore: do the whole coalescing check, skb allocation
> skb initialization/update inside mptcp_sendmsg_frag(), quite
> alike the current TCP code.
>
> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
> ---
> v1 -> v2:
> - hopefully fix OoB, fetching nr_frags on new skbs
> ---
> net/mptcp/protocol.c | 132 +++++++++++++++++++++++++------------------
> 1 file changed, 77 insertions(+), 55 deletions(-)
>
> diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
> index faf6e7000d18..101e61bb2a80 100644
> --- a/net/mptcp/protocol.c
> +++ b/net/mptcp/protocol.c
> @@ -1224,6 +1224,7 @@ static struct sk_buff *__mptcp_do_alloc_tx_skb(struct sock *sk, gfp_t gfp)
> 		if (likely(__mptcp_add_ext(skb, gfp))) {
> 			skb_reserve(skb, MAX_TCP_HEADER);
> 			skb->reserved_tailroom = skb->end - skb->tail;
> +			INIT_LIST_HEAD(&skb->tcp_tsorted_anchor);

Is this related to tx_skb_cache? Looks like it could be a -net fix.

> 			return skb;
> 		}
> 		__kfree_skb(skb);
> @@ -1233,31 +1234,23 @@ static struct sk_buff *__mptcp_do_alloc_tx_skb(struct sock *sk, gfp_t gfp)
> 	return NULL;
> }
>
> -static bool __mptcp_alloc_tx_skb(struct sock *sk, struct sock *ssk, gfp_t gfp)
> +static struct sk_buff *__mptcp_alloc_tx_skb(struct sock *sk, struct sock *ssk, gfp_t gfp)
> {
> 	struct sk_buff *skb;
>
> -	if (ssk->sk_tx_skb_cache) {
> -		skb = ssk->sk_tx_skb_cache;
> -		if (unlikely(!skb_ext_find(skb, SKB_EXT_MPTCP) &&
> -			     !__mptcp_add_ext(skb, gfp)))
> -			return false;
> -		return true;
> -	}
> -
> 	skb = __mptcp_do_alloc_tx_skb(sk, gfp);
> 	if (!skb)
> -		return false;
> +		return NULL;
>
> 	if (likely(sk_wmem_schedule(ssk, skb->truesize))) {
> -		ssk->sk_tx_skb_cache = skb;
> -		return true;
> +		skb_entail(ssk, skb);
> +		return skb;
> 	}
> 	kfree_skb(skb);
> -	return false;
> +	return NULL;
> }
>
> -static bool mptcp_alloc_tx_skb(struct sock *sk, struct sock *ssk, bool data_lock_held)
> +static struct sk_buff *mptcp_alloc_tx_skb(struct sock *sk, struct sock *ssk, bool data_lock_held)
> {
> 	gfp_t gfp = data_lock_held ? GFP_ATOMIC : sk->sk_allocation;
>
> @@ -1287,23 +1280,29 @@ static int mptcp_sendmsg_frag(struct sock *sk, struct sock *ssk,
> 			      struct mptcp_sendmsg_info *info)
> {
> 	u64 data_seq = dfrag->data_seq + info->sent;
> +	int offset = dfrag->offset + info->sent;
> 	struct mptcp_sock *msk = mptcp_sk(sk);
> 	bool zero_window_probe = false;
> 	struct mptcp_ext *mpext = NULL;
> -	struct sk_buff *skb, *tail;
> -	bool must_collapse = false;
> -	int size_bias = 0;
> -	int avail_size;
> -	size_t ret = 0;
> +	bool can_coalesce = false;
> +	bool reuse_skb = true;
> +	struct sk_buff *skb;
> +	size_t copy;
> +	int i;
>
> 	pr_debug("msk=%p ssk=%p sending dfrag at seq=%llu len=%u already sent=%u",
> 		 msk, ssk, dfrag->data_seq, dfrag->data_len, info->sent);
>
> +	if (WARN_ON_ONCE(info->sent > info->limit ||
> +			 info->limit > dfrag->data_len))
> +		return 0;
> +
> 	/* compute send limit */
> 	info->mss_now = tcp_send_mss(ssk, &info->size_goal, info->flags);
> -	avail_size = info->size_goal;
> +	copy = info->size_goal;
> +
> 	skb = tcp_write_queue_tail(ssk);
> -	if (skb) {
> +	if (skb && (copy > skb->len)) {
> 		/* Limit the write to the size available in the
> 		 * current skb, if any, so that we create at most a new skb.
> 		 * Explicitly tells TCP internals to avoid collapsing on later
> @@ -1316,53 +1315,76 @@ static int mptcp_sendmsg_frag(struct sock *sk, struct sock *ssk,
> 			goto alloc_skb;
> 		}
>
> -		must_collapse = (info->size_goal - skb->len > 0) &&
> -				(skb_shinfo(skb)->nr_frags < sysctl_max_skb_frags);
> -		if (must_collapse) {
> -			size_bias = skb->len;
> -			avail_size = info->size_goal - skb->len;
> +		i = skb_shinfo(skb)->nr_frags;
> +		can_coalesce = skb_can_coalesce(skb, i, dfrag->page, offset);
> +		if (!can_coalesce && i >= sysctl_max_skb_frags) {
> +			tcp_mark_push(tcp_sk(ssk), skb);
> +			goto alloc_skb;
> 		}
> -	}
>
> +		copy -= skb->len;
> +	} else {
> alloc_skb:
> -	if (!must_collapse && !ssk->sk_tx_skb_cache &&
> -	    !mptcp_alloc_tx_skb(sk, ssk, info->data_lock_held))
> -		return 0;
> +		skb = mptcp_alloc_tx_skb(sk, ssk, info->data_lock_held);
> +		if (!skb)
> +			return -ENOMEM;
> +
> +		i = skb_shinfo(skb)->nr_frags;
> +		reuse_skb = false;
> +		mpext = skb_ext_find(skb, SKB_EXT_MPTCP);
> +	}
>
> 	/* Zero window and all data acked? Probe. */
> -	avail_size = mptcp_check_allowed_size(msk, data_seq, avail_size);
> -	if (avail_size == 0) {
> +	copy = mptcp_check_allowed_size(msk, data_seq, copy);
> +	if (copy == 0) {
> 		u64 snd_una = READ_ONCE(msk->snd_una);
>
> -		if (skb || snd_una != msk->snd_nxt)
> +		if (skb || snd_una != msk->snd_nxt) {
> +			tcp_remove_empty_skb(ssk, tcp_write_queue_tail(ssk));
> 			return 0;
> +		}
> +
> 		zero_window_probe = true;
> 		data_seq = snd_una - 1;
> -		avail_size = 1;
> -	}
> +		copy = 1;
>
> -	if (WARN_ON_ONCE(info->sent > info->limit ||
> -			 info->limit > dfrag->data_len))
> -		return 0;
> +		/* all mptcp-level data is acked, no skbs should be present into the
> +		 * ssk write queue
> +		 */
> +		WARN_ON_ONCE(reuse_skb);
> +	}
>
> -	ret = info->limit - info->sent;
> -	tail = tcp_build_frag(ssk, avail_size + size_bias, info->flags,
> -			      dfrag->page, dfrag->offset + info->sent, &ret);
> -	if (!tail) {
> -		tcp_remove_empty_skb(sk, tcp_write_queue_tail(ssk));
> +	copy = min_t(size_t, copy, info->limit - info->sent);
> +	if (!sk_wmem_schedule(ssk, copy)) {
> +		tcp_remove_empty_skb(ssk, tcp_write_queue_tail(ssk));
> 		return -ENOMEM;
> 	}
>
> -	/* if the tail skb is still the cached one, collapsing really happened.
> -	 */
> -	if (skb == tail) {
> -		TCP_SKB_CB(tail)->tcp_flags &= ~TCPHDR_PSH;
> -		mpext->data_len += ret;
> +	if (can_coalesce) {
> +		skb_frag_size_add(&skb_shinfo(skb)->frags[i - 1], copy);
> +	} else {
> +		get_page(dfrag->page);
> +		skb_fill_page_desc(skb, i, dfrag->page, offset, copy);
> +	}
> +
> +	skb->len += copy;
> +	skb->data_len += copy;
> +	skb->truesize += copy;
> +	sk_wmem_queued_add(ssk, copy);
> +	sk_mem_charge(ssk, copy);
> +	skb->ip_summed = CHECKSUM_PARTIAL;
> +	WRITE_ONCE(tcp_sk(ssk)->write_seq, tcp_sk(ssk)->write_seq + copy);
> +	TCP_SKB_CB(skb)->end_seq += copy;
> +	tcp_skb_pcount_set(skb, 0);
> +
> +	/* on skb reuse we just need to update the DSS len */
> +	if (reuse_skb) {
> +		TCP_SKB_CB(skb)->tcp_flags &= ~TCPHDR_PSH;
> +		mpext->data_len += copy;

Should have the WARN_ON_ONCE(!mpext) before this block - while it 
shouldn't happen, it's slightly less impossible in the reuse_skb case.

I'll start some tests running.


Mat


> 		WARN_ON_ONCE(zero_window_probe);
> 		goto out;
> 	}
>
> -	mpext = skb_ext_find(tail, SKB_EXT_MPTCP);
> 	if (WARN_ON_ONCE(!mpext)) {
> 		/* should never reach here, stream corrupted */
> 		return -EINVAL;
> @@ -1371,7 +1393,7 @@ static int mptcp_sendmsg_frag(struct sock *sk, struct sock *ssk,
> 	memset(mpext, 0, sizeof(*mpext));
> 	mpext->data_seq = data_seq;
> 	mpext->subflow_seq = mptcp_subflow_ctx(ssk)->rel_write_seq;
> -	mpext->data_len = ret;
> +	mpext->data_len = copy;
> 	mpext->use_map = 1;
> 	mpext->dsn64 = 1;
>
> @@ -1380,18 +1402,18 @@ static int mptcp_sendmsg_frag(struct sock *sk, struct sock *ssk,
> 		 mpext->dsn64);
>
> 	if (zero_window_probe) {
> -		mptcp_subflow_ctx(ssk)->rel_write_seq += ret;
> +		mptcp_subflow_ctx(ssk)->rel_write_seq += copy;
> 		mpext->frozen = 1;
> 		if (READ_ONCE(msk->csum_enabled))
> -			mptcp_update_data_checksum(tail, ret);
> +			mptcp_update_data_checksum(skb, copy);
> 		tcp_push_pending_frames(ssk);
> 		return 0;
> 	}
> out:
> 	if (READ_ONCE(msk->csum_enabled))
> -		mptcp_update_data_checksum(tail, ret);
> -	mptcp_subflow_ctx(ssk)->rel_write_seq += ret;
> -	return ret;
> +		mptcp_update_data_checksum(skb, copy);
> +	mptcp_subflow_ctx(ssk)->rel_write_seq += copy;
> +	return copy;
> }
>
> #define MPTCP_SEND_BURST_SIZE		((1 << 16) - \
> -- 
> 2.26.3
>
>
>

--
Mat Martineau
Intel

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 mptcp-next 2/4] mptcp: stop relaying on tcp_tx_skb_cache.
  2021-09-02 15:52 ` [PATCH v2 mptcp-next 2/4] mptcp: stop relaying on tcp_tx_skb_cache Paolo Abeni
  2021-09-03  0:30   ` Mat Martineau
@ 2021-09-03 11:28   ` Matthieu Baerts
  2021-09-03 17:18     ` Paolo Abeni
  2021-09-04  8:00   ` Matthieu Baerts
  2 siblings, 1 reply; 15+ messages in thread
From: Matthieu Baerts @ 2021-09-03 11:28 UTC (permalink / raw)
  To: Paolo Abeni; +Cc: mptcp

Hi Paolo,

On 02/09/2021 17:52, Paolo Abeni wrote:
> We want to revert the skb TX cache, but MPTCP is currently
> using it unconditionally.

One one small detail if you have to resend a new version: for the title,
s/relaying/relying/

  mptcp: stop relying on tcp_tx_skb_cache

Cheers,
Matt
-- 
Tessares | Belgium | Hybrid Access Solutions
www.tessares.net

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 mptcp-next 2/4] mptcp: stop relaying on tcp_tx_skb_cache.
  2021-09-03  0:30   ` Mat Martineau
@ 2021-09-03 13:58     ` Paolo Abeni
  2021-09-03 17:07       ` Mat Martineau
  0 siblings, 1 reply; 15+ messages in thread
From: Paolo Abeni @ 2021-09-03 13:58 UTC (permalink / raw)
  To: Mat Martineau; +Cc: mptcp

On Thu, 2021-09-02 at 17:30 -0700, Mat Martineau wrote:
> On Thu, 2 Sep 2021, Paolo Abeni wrote:
> 
> > We want to revert the skb TX cache, but MPTCP is currently
> > using it unconditionally.
> > 
> > Rework the MPTCP tx code, so that tcp_tx_skb_cache is not
> > needed anymore: do the whole coalescing check, skb allocation
> > skb initialization/update inside mptcp_sendmsg_frag(), quite
> > alike the current TCP code.
> > 
> > Signed-off-by: Paolo Abeni <pabeni@redhat.com>
> > ---
> > v1 -> v2:
> > - hopefully fix OoB, fetching nr_frags on new skbs
> > ---
> > net/mptcp/protocol.c | 132 +++++++++++++++++++++++++------------------
> > 1 file changed, 77 insertions(+), 55 deletions(-)
> > 
> > diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
> > index faf6e7000d18..101e61bb2a80 100644
> > --- a/net/mptcp/protocol.c
> > +++ b/net/mptcp/protocol.c
> > @@ -1224,6 +1224,7 @@ static struct sk_buff *__mptcp_do_alloc_tx_skb(struct sock *sk, gfp_t gfp)
> > 		if (likely(__mptcp_add_ext(skb, gfp))) {
> > 			skb_reserve(skb, MAX_TCP_HEADER);
> > 			skb->reserved_tailroom = skb->end - skb->tail;
> > +			INIT_LIST_HEAD(&skb->tcp_tsorted_anchor);
> 
> Is this related to tx_skb_cache? Looks like it could be a -net fix.

Eheh you nicely noticed the relevant details as usual ;)

No, this is not needed for -net. Before this patch, tcp_tsorted_anchor
initialization was performed by sk_stream_alloc_skb(), invoked by
tcp_build_frag(), which we are not reaching anymore now.

> 
> > 			return skb;
> > 		}
> > 		__kfree_skb(skb);
> > @@ -1233,31 +1234,23 @@ static struct sk_buff *__mptcp_do_alloc_tx_skb(struct sock *sk, gfp_t gfp)
> > 	return NULL;
> > }
> > 
> > -static bool __mptcp_alloc_tx_skb(struct sock *sk, struct sock *ssk, gfp_t gfp)
> > +static struct sk_buff *__mptcp_alloc_tx_skb(struct sock *sk, struct sock *ssk, gfp_t gfp)
> > {
> > 	struct sk_buff *skb;
> > 
> > -	if (ssk->sk_tx_skb_cache) {
> > -		skb = ssk->sk_tx_skb_cache;
> > -		if (unlikely(!skb_ext_find(skb, SKB_EXT_MPTCP) &&
> > -			     !__mptcp_add_ext(skb, gfp)))
> > -			return false;
> > -		return true;
> > -	}
> > -
> > 	skb = __mptcp_do_alloc_tx_skb(sk, gfp);
> > 	if (!skb)
> > -		return false;
> > +		return NULL;
> > 
> > 	if (likely(sk_wmem_schedule(ssk, skb->truesize))) {
> > -		ssk->sk_tx_skb_cache = skb;
> > -		return true;
> > +		skb_entail(ssk, skb);
> > +		return skb;
> > 	}
> > 	kfree_skb(skb);
> > -	return false;
> > +	return NULL;
> > }
> > 
> > -static bool mptcp_alloc_tx_skb(struct sock *sk, struct sock *ssk, bool data_lock_held)
> > +static struct sk_buff *mptcp_alloc_tx_skb(struct sock *sk, struct sock *ssk, bool data_lock_held)
> > {
> > 	gfp_t gfp = data_lock_held ? GFP_ATOMIC : sk->sk_allocation;
> > 
> > @@ -1287,23 +1280,29 @@ static int mptcp_sendmsg_frag(struct sock *sk, struct sock *ssk,
> > 			      struct mptcp_sendmsg_info *info)
> > {
> > 	u64 data_seq = dfrag->data_seq + info->sent;
> > +	int offset = dfrag->offset + info->sent;
> > 	struct mptcp_sock *msk = mptcp_sk(sk);
> > 	bool zero_window_probe = false;
> > 	struct mptcp_ext *mpext = NULL;
> > -	struct sk_buff *skb, *tail;
> > -	bool must_collapse = false;
> > -	int size_bias = 0;
> > -	int avail_size;
> > -	size_t ret = 0;
> > +	bool can_coalesce = false;
> > +	bool reuse_skb = true;
> > +	struct sk_buff *skb;
> > +	size_t copy;
> > +	int i;
> > 
> > 	pr_debug("msk=%p ssk=%p sending dfrag at seq=%llu len=%u already sent=%u",
> > 		 msk, ssk, dfrag->data_seq, dfrag->data_len, info->sent);
> > 
> > +	if (WARN_ON_ONCE(info->sent > info->limit ||
> > +			 info->limit > dfrag->data_len))
> > +		return 0;
> > +
> > 	/* compute send limit */
> > 	info->mss_now = tcp_send_mss(ssk, &info->size_goal, info->flags);
> > -	avail_size = info->size_goal;
> > +	copy = info->size_goal;
> > +
> > 	skb = tcp_write_queue_tail(ssk);
> > -	if (skb) {
> > +	if (skb && (copy > skb->len)) {
> > 		/* Limit the write to the size available in the
> > 		 * current skb, if any, so that we create at most a new skb.
> > 		 * Explicitly tells TCP internals to avoid collapsing on later
> > @@ -1316,53 +1315,76 @@ static int mptcp_sendmsg_frag(struct sock *sk, struct sock *ssk,
> > 			goto alloc_skb;
> > 		}
> > 
> > -		must_collapse = (info->size_goal - skb->len > 0) &&
> > -				(skb_shinfo(skb)->nr_frags < sysctl_max_skb_frags);
> > -		if (must_collapse) {
> > -			size_bias = skb->len;
> > -			avail_size = info->size_goal - skb->len;
> > +		i = skb_shinfo(skb)->nr_frags;
> > +		can_coalesce = skb_can_coalesce(skb, i, dfrag->page, offset);
> > +		if (!can_coalesce && i >= sysctl_max_skb_frags) {
> > +			tcp_mark_push(tcp_sk(ssk), skb);
> > +			goto alloc_skb;
> > 		}
> > -	}
> > 
> > +		copy -= skb->len;
> > +	} else {
> > alloc_skb:
> > -	if (!must_collapse && !ssk->sk_tx_skb_cache &&
> > -	    !mptcp_alloc_tx_skb(sk, ssk, info->data_lock_held))
> > -		return 0;
> > +		skb = mptcp_alloc_tx_skb(sk, ssk, info->data_lock_held);
> > +		if (!skb)
> > +			return -ENOMEM;
> > +
> > +		i = skb_shinfo(skb)->nr_frags;
> > +		reuse_skb = false;
> > +		mpext = skb_ext_find(skb, SKB_EXT_MPTCP);
> > +	}
> > 
> > 	/* Zero window and all data acked? Probe. */
> > -	avail_size = mptcp_check_allowed_size(msk, data_seq, avail_size);
> > -	if (avail_size == 0) {
> > +	copy = mptcp_check_allowed_size(msk, data_seq, copy);
> > +	if (copy == 0) {
> > 		u64 snd_una = READ_ONCE(msk->snd_una);
> > 
> > -		if (skb || snd_una != msk->snd_nxt)
> > +		if (skb || snd_una != msk->snd_nxt) {
> > +			tcp_remove_empty_skb(ssk, tcp_write_queue_tail(ssk));
> > 			return 0;
> > +		}
> > +
> > 		zero_window_probe = true;
> > 		data_seq = snd_una - 1;
> > -		avail_size = 1;
> > -	}
> > +		copy = 1;
> > 
> > -	if (WARN_ON_ONCE(info->sent > info->limit ||
> > -			 info->limit > dfrag->data_len))
> > -		return 0;
> > +		/* all mptcp-level data is acked, no skbs should be present into the
> > +		 * ssk write queue
> > +		 */
> > +		WARN_ON_ONCE(reuse_skb);
> > +	}
> > 
> > -	ret = info->limit - info->sent;
> > -	tail = tcp_build_frag(ssk, avail_size + size_bias, info->flags,
> > -			      dfrag->page, dfrag->offset + info->sent, &ret);
> > -	if (!tail) {
> > -		tcp_remove_empty_skb(sk, tcp_write_queue_tail(ssk));
> > +	copy = min_t(size_t, copy, info->limit - info->sent);
> > +	if (!sk_wmem_schedule(ssk, copy)) {
> > +		tcp_remove_empty_skb(ssk, tcp_write_queue_tail(ssk));
> > 		return -ENOMEM;
> > 	}
> > 
> > -	/* if the tail skb is still the cached one, collapsing really happened.
> > -	 */
> > -	if (skb == tail) {
> > -		TCP_SKB_CB(tail)->tcp_flags &= ~TCPHDR_PSH;
> > -		mpext->data_len += ret;
> > +	if (can_coalesce) {
> > +		skb_frag_size_add(&skb_shinfo(skb)->frags[i - 1], copy);
> > +	} else {
> > +		get_page(dfrag->page);
> > +		skb_fill_page_desc(skb, i, dfrag->page, offset, copy);
> > +	}
> > +
> > +	skb->len += copy;
> > +	skb->data_len += copy;
> > +	skb->truesize += copy;
> > +	sk_wmem_queued_add(ssk, copy);
> > +	sk_mem_charge(ssk, copy);
> > +	skb->ip_summed = CHECKSUM_PARTIAL;
> > +	WRITE_ONCE(tcp_sk(ssk)->write_seq, tcp_sk(ssk)->write_seq + copy);
> > +	TCP_SKB_CB(skb)->end_seq += copy;
> > +	tcp_skb_pcount_set(skb, 0);
> > +
> > +	/* on skb reuse we just need to update the DSS len */
> > +	if (reuse_skb) {
> > +		TCP_SKB_CB(skb)->tcp_flags &= ~TCPHDR_PSH;
> > +		mpext->data_len += copy;
> 
> Should have the WARN_ON_ONCE(!mpext) before this block - while it 
> shouldn't happen, it's slightly less impossible in the reuse_skb case.

I wondered about it a bit, and I *think* is not needed: we reuse the
skb only if mpext != NULL, see mptcp_skb_can_collapse_to().

> I'll start some tests running.

Thanks!

Paolo


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 mptcp-next 2/4] mptcp: stop relaying on tcp_tx_skb_cache.
  2021-09-03 13:58     ` Paolo Abeni
@ 2021-09-03 17:07       ` Mat Martineau
  0 siblings, 0 replies; 15+ messages in thread
From: Mat Martineau @ 2021-09-03 17:07 UTC (permalink / raw)
  To: Paolo Abeni; +Cc: mptcp

On Fri, 3 Sep 2021, Paolo Abeni wrote:

> On Thu, 2021-09-02 at 17:30 -0700, Mat Martineau wrote:
>> On Thu, 2 Sep 2021, Paolo Abeni wrote:
>>
>>> We want to revert the skb TX cache, but MPTCP is currently
>>> using it unconditionally.
>>>
>>> Rework the MPTCP tx code, so that tcp_tx_skb_cache is not
>>> needed anymore: do the whole coalescing check, skb allocation
>>> skb initialization/update inside mptcp_sendmsg_frag(), quite
>>> alike the current TCP code.
>>>
>>> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
>>> ---
>>> v1 -> v2:
>>> - hopefully fix OoB, fetching nr_frags on new skbs
>>> ---
>>> net/mptcp/protocol.c | 132 +++++++++++++++++++++++++------------------
>>> 1 file changed, 77 insertions(+), 55 deletions(-)
>>>
>>> diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
>>> index faf6e7000d18..101e61bb2a80 100644
>>> --- a/net/mptcp/protocol.c
>>> +++ b/net/mptcp/protocol.c
>>> @@ -1224,6 +1224,7 @@ static struct sk_buff *__mptcp_do_alloc_tx_skb(struct sock *sk, gfp_t gfp)
>>> 		if (likely(__mptcp_add_ext(skb, gfp))) {
>>> 			skb_reserve(skb, MAX_TCP_HEADER);
>>> 			skb->reserved_tailroom = skb->end - skb->tail;
>>> +			INIT_LIST_HEAD(&skb->tcp_tsorted_anchor);
>>
>> Is this related to tx_skb_cache? Looks like it could be a -net fix.
>
> Eheh you nicely noticed the relevant details as usual ;)
>
> No, this is not needed for -net. Before this patch, tcp_tsorted_anchor
> initialization was performed by sk_stream_alloc_skb(), invoked by
> tcp_build_frag(), which we are not reaching anymore now.
>
>>
>>> 			return skb;
>>> 		}
>>> 		__kfree_skb(skb);
>>> @@ -1233,31 +1234,23 @@ static struct sk_buff *__mptcp_do_alloc_tx_skb(struct sock *sk, gfp_t gfp)
>>> 	return NULL;
>>> }
>>>
>>> -static bool __mptcp_alloc_tx_skb(struct sock *sk, struct sock *ssk, gfp_t gfp)
>>> +static struct sk_buff *__mptcp_alloc_tx_skb(struct sock *sk, struct sock *ssk, gfp_t gfp)
>>> {
>>> 	struct sk_buff *skb;
>>>
>>> -	if (ssk->sk_tx_skb_cache) {
>>> -		skb = ssk->sk_tx_skb_cache;
>>> -		if (unlikely(!skb_ext_find(skb, SKB_EXT_MPTCP) &&
>>> -			     !__mptcp_add_ext(skb, gfp)))
>>> -			return false;
>>> -		return true;
>>> -	}
>>> -
>>> 	skb = __mptcp_do_alloc_tx_skb(sk, gfp);
>>> 	if (!skb)
>>> -		return false;
>>> +		return NULL;
>>>
>>> 	if (likely(sk_wmem_schedule(ssk, skb->truesize))) {
>>> -		ssk->sk_tx_skb_cache = skb;
>>> -		return true;
>>> +		skb_entail(ssk, skb);
>>> +		return skb;
>>> 	}
>>> 	kfree_skb(skb);
>>> -	return false;
>>> +	return NULL;
>>> }
>>>
>>> -static bool mptcp_alloc_tx_skb(struct sock *sk, struct sock *ssk, bool data_lock_held)
>>> +static struct sk_buff *mptcp_alloc_tx_skb(struct sock *sk, struct sock *ssk, bool data_lock_held)
>>> {
>>> 	gfp_t gfp = data_lock_held ? GFP_ATOMIC : sk->sk_allocation;
>>>
>>> @@ -1287,23 +1280,29 @@ static int mptcp_sendmsg_frag(struct sock *sk, struct sock *ssk,
>>> 			      struct mptcp_sendmsg_info *info)
>>> {
>>> 	u64 data_seq = dfrag->data_seq + info->sent;
>>> +	int offset = dfrag->offset + info->sent;
>>> 	struct mptcp_sock *msk = mptcp_sk(sk);
>>> 	bool zero_window_probe = false;
>>> 	struct mptcp_ext *mpext = NULL;
>>> -	struct sk_buff *skb, *tail;
>>> -	bool must_collapse = false;
>>> -	int size_bias = 0;
>>> -	int avail_size;
>>> -	size_t ret = 0;
>>> +	bool can_coalesce = false;
>>> +	bool reuse_skb = true;
>>> +	struct sk_buff *skb;
>>> +	size_t copy;
>>> +	int i;
>>>
>>> 	pr_debug("msk=%p ssk=%p sending dfrag at seq=%llu len=%u already sent=%u",
>>> 		 msk, ssk, dfrag->data_seq, dfrag->data_len, info->sent);
>>>
>>> +	if (WARN_ON_ONCE(info->sent > info->limit ||
>>> +			 info->limit > dfrag->data_len))
>>> +		return 0;
>>> +
>>> 	/* compute send limit */
>>> 	info->mss_now = tcp_send_mss(ssk, &info->size_goal, info->flags);
>>> -	avail_size = info->size_goal;
>>> +	copy = info->size_goal;
>>> +
>>> 	skb = tcp_write_queue_tail(ssk);
>>> -	if (skb) {
>>> +	if (skb && (copy > skb->len)) {
>>> 		/* Limit the write to the size available in the
>>> 		 * current skb, if any, so that we create at most a new skb.
>>> 		 * Explicitly tells TCP internals to avoid collapsing on later
>>> @@ -1316,53 +1315,76 @@ static int mptcp_sendmsg_frag(struct sock *sk, struct sock *ssk,
>>> 			goto alloc_skb;
>>> 		}
>>>
>>> -		must_collapse = (info->size_goal - skb->len > 0) &&
>>> -				(skb_shinfo(skb)->nr_frags < sysctl_max_skb_frags);
>>> -		if (must_collapse) {
>>> -			size_bias = skb->len;
>>> -			avail_size = info->size_goal - skb->len;
>>> +		i = skb_shinfo(skb)->nr_frags;
>>> +		can_coalesce = skb_can_coalesce(skb, i, dfrag->page, offset);
>>> +		if (!can_coalesce && i >= sysctl_max_skb_frags) {
>>> +			tcp_mark_push(tcp_sk(ssk), skb);
>>> +			goto alloc_skb;
>>> 		}
>>> -	}
>>>
>>> +		copy -= skb->len;
>>> +	} else {
>>> alloc_skb:
>>> -	if (!must_collapse && !ssk->sk_tx_skb_cache &&
>>> -	    !mptcp_alloc_tx_skb(sk, ssk, info->data_lock_held))
>>> -		return 0;
>>> +		skb = mptcp_alloc_tx_skb(sk, ssk, info->data_lock_held);
>>> +		if (!skb)
>>> +			return -ENOMEM;
>>> +
>>> +		i = skb_shinfo(skb)->nr_frags;
>>> +		reuse_skb = false;
>>> +		mpext = skb_ext_find(skb, SKB_EXT_MPTCP);
>>> +	}
>>>
>>> 	/* Zero window and all data acked? Probe. */
>>> -	avail_size = mptcp_check_allowed_size(msk, data_seq, avail_size);
>>> -	if (avail_size == 0) {
>>> +	copy = mptcp_check_allowed_size(msk, data_seq, copy);
>>> +	if (copy == 0) {
>>> 		u64 snd_una = READ_ONCE(msk->snd_una);
>>>
>>> -		if (skb || snd_una != msk->snd_nxt)
>>> +		if (skb || snd_una != msk->snd_nxt) {
>>> +			tcp_remove_empty_skb(ssk, tcp_write_queue_tail(ssk));
>>> 			return 0;
>>> +		}
>>> +
>>> 		zero_window_probe = true;
>>> 		data_seq = snd_una - 1;
>>> -		avail_size = 1;
>>> -	}
>>> +		copy = 1;
>>>
>>> -	if (WARN_ON_ONCE(info->sent > info->limit ||
>>> -			 info->limit > dfrag->data_len))
>>> -		return 0;
>>> +		/* all mptcp-level data is acked, no skbs should be present into the
>>> +		 * ssk write queue
>>> +		 */
>>> +		WARN_ON_ONCE(reuse_skb);
>>> +	}
>>>
>>> -	ret = info->limit - info->sent;
>>> -	tail = tcp_build_frag(ssk, avail_size + size_bias, info->flags,
>>> -			      dfrag->page, dfrag->offset + info->sent, &ret);
>>> -	if (!tail) {
>>> -		tcp_remove_empty_skb(sk, tcp_write_queue_tail(ssk));
>>> +	copy = min_t(size_t, copy, info->limit - info->sent);
>>> +	if (!sk_wmem_schedule(ssk, copy)) {
>>> +		tcp_remove_empty_skb(ssk, tcp_write_queue_tail(ssk));
>>> 		return -ENOMEM;
>>> 	}
>>>
>>> -	/* if the tail skb is still the cached one, collapsing really happened.
>>> -	 */
>>> -	if (skb == tail) {
>>> -		TCP_SKB_CB(tail)->tcp_flags &= ~TCPHDR_PSH;
>>> -		mpext->data_len += ret;
>>> +	if (can_coalesce) {
>>> +		skb_frag_size_add(&skb_shinfo(skb)->frags[i - 1], copy);
>>> +	} else {
>>> +		get_page(dfrag->page);
>>> +		skb_fill_page_desc(skb, i, dfrag->page, offset, copy);
>>> +	}
>>> +
>>> +	skb->len += copy;
>>> +	skb->data_len += copy;
>>> +	skb->truesize += copy;
>>> +	sk_wmem_queued_add(ssk, copy);
>>> +	sk_mem_charge(ssk, copy);
>>> +	skb->ip_summed = CHECKSUM_PARTIAL;
>>> +	WRITE_ONCE(tcp_sk(ssk)->write_seq, tcp_sk(ssk)->write_seq + copy);
>>> +	TCP_SKB_CB(skb)->end_seq += copy;
>>> +	tcp_skb_pcount_set(skb, 0);
>>> +
>>> +	/* on skb reuse we just need to update the DSS len */
>>> +	if (reuse_skb) {
>>> +		TCP_SKB_CB(skb)->tcp_flags &= ~TCPHDR_PSH;
>>> +		mpext->data_len += copy;
>>
>> Should have the WARN_ON_ONCE(!mpext) before this block - while it
>> shouldn't happen, it's slightly less impossible in the reuse_skb case.
>
> I wondered about it a bit, and I *think* is not needed: we reuse the
> skb only if mpext != NULL, see mptcp_skb_can_collapse_to().
>

I missed that check, you're right that at this point in the code reuse_skb 
== true guarantees that mpext != NULL. Since reuse_skb is only set to 
false in a path where mpext has been successfully allocated, it makes 
sense to drop the WARN_ON_ONCE().

>> I'll start some tests running.

I didn't see any issues in syzkaller overnight, or in a few selftest runs.

--
Mat Martineau
Intel

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 mptcp-next 2/4] mptcp: stop relaying on tcp_tx_skb_cache.
  2021-09-03 11:28   ` Matthieu Baerts
@ 2021-09-03 17:18     ` Paolo Abeni
  2021-09-03 17:47       ` Matthieu Baerts
  0 siblings, 1 reply; 15+ messages in thread
From: Paolo Abeni @ 2021-09-03 17:18 UTC (permalink / raw)
  To: Matthieu Baerts; +Cc: mptcp

On Fri, 2021-09-03 at 13:28 +0200, Matthieu Baerts wrote:
> Hi Paolo,
> 
> On 02/09/2021 17:52, Paolo Abeni wrote:
> > We want to revert the skb TX cache, but MPTCP is currently
> > using it unconditionally.
> 
> One one small detail if you have to resend a new version: for the title,
> s/relaying/relying/
> 
>   mptcp: stop relying on tcp_tx_skb_cache

I guess/hope a repost is not needed, would a squash-to patch and/or
direct edit while merging be enough?

Cheers,

Paolo


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 mptcp-next 2/4] mptcp: stop relaying on tcp_tx_skb_cache.
  2021-09-03 17:18     ` Paolo Abeni
@ 2021-09-03 17:47       ` Matthieu Baerts
  0 siblings, 0 replies; 15+ messages in thread
From: Matthieu Baerts @ 2021-09-03 17:47 UTC (permalink / raw)
  To: Paolo Abeni; +Cc: mptcp

3 Sep 2021 19:18:33 Paolo Abeni <pabeni@redhat.com>:

> On Fri, 2021-09-03 at 13:28 +0200, Matthieu Baerts wrote:
>> Hi Paolo,
>>
>> On 02/09/2021 17:52, Paolo Abeni wrote:
>>> We want to revert the skb TX cache, but MPTCP is currently
>>> using it unconditionally.
>>
>> One one small detail if you have to resend a new version: for the title,
>> s/relaying/relying/
>>
>>   mptcp: stop relying on tcp_tx_skb_cache
>
> I guess/hope a repost is not needed, would a squash-to patch and/or
> direct edit while merging be enough?

Sure I can directly edit the commit title.

I can apply the series if it is OK for Mat.

Cheers,
Matt
--
Tessares | Belgium | Hybrid Access Solutions
www.tessares.net

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 mptcp-next 0/4] mptcp: just another xmit path refactor
  2021-09-02 15:52 [PATCH v2 mptcp-next 0/4] mptcp: just another xmit path refactor Paolo Abeni
                   ` (3 preceding siblings ...)
  2021-09-02 15:52 ` [PATCH v2 mptcp-next 4/4] tcp: remove sk_{tr}x_skb_cache Paolo Abeni
@ 2021-09-03 20:11 ` Mat Martineau
  2021-09-04  8:00   ` Matthieu Baerts
  4 siblings, 1 reply; 15+ messages in thread
From: Mat Martineau @ 2021-09-03 20:11 UTC (permalink / raw)
  To: Paolo Abeni; +Cc: mptcp

On Thu, 2 Sep 2021, Paolo Abeni wrote:

> Eric want to revert the tcp_tx_skb_cache. MPTCP relies on it
> for skb allocation. Before the revert we need to refactor our
> xmit path.
>
> Patch 1 exposes some needed helpers (endorsed by Eric)
> Patch 2 contains the nasty new code
> Patch 3 revert some core TCP changes not needed anymore
> and patch 4 is Eric's revert.
>
> This iteration should hopefully solve the OoB issues I observed with
> the previous attempt, changes documented in patch 2.
>
> Eric Dumazet (1):
>  tcp: remove sk_{tr}x_skb_cache
>
> Paolo Abeni (3):
>  tcp: expose the tcp_mark_push() and skb_entail() helpers
>  mptcp: stop relaying on tcp_tx_skb_cache.
>  Partially revert "tcp: factor out tcp_build_frag()"
>
> Documentation/networking/ip-sysctl.rst |   8 --
> include/net/sock.h                     |  19 ----
> include/net/tcp.h                      |   4 +-
> net/ipv4/af_inet.c                     |   4 -
> net/ipv4/sysctl_net_ipv4.c             |  12 --
> net/ipv4/tcp.c                         | 147 +++++++++----------------
> net/ipv4/tcp_ipv4.c                    |   6 -
> net/ipv6/tcp_ipv6.c                    |   6 -
> net/mptcp/protocol.c                   | 132 +++++++++++++---------
> 9 files changed, 132 insertions(+), 206 deletions(-)
>
> -- 
> 2.26.3

It's ok to apply the series to the export branch. Before we upstream it, 
can Paolo send a squash-to patch for #2 that removes the 
WARN_ON_ONCE(!mpext)?

Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>



--
Mat Martineau
Intel

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 mptcp-next 0/4] mptcp: just another xmit path refactor
  2021-09-03 20:11 ` [PATCH v2 mptcp-next 0/4] mptcp: just another xmit path refactor Mat Martineau
@ 2021-09-04  8:00   ` Matthieu Baerts
  0 siblings, 0 replies; 15+ messages in thread
From: Matthieu Baerts @ 2021-09-04  8:00 UTC (permalink / raw)
  To: Mat Martineau, Paolo Abeni; +Cc: mptcp

Hi Paolo, Mat,

On 03/09/2021 22:11, Mat Martineau wrote:
> On Thu, 2 Sep 2021, Paolo Abeni wrote:
> 
>> Eric want to revert the tcp_tx_skb_cache. MPTCP relies on it
>> for skb allocation. Before the revert we need to refactor our
>> xmit path.
>>
>> Patch 1 exposes some needed helpers (endorsed by Eric)
>> Patch 2 contains the nasty new code
>> Patch 3 revert some core TCP changes not needed anymore
>> and patch 4 is Eric's revert.
>>
>> This iteration should hopefully solve the OoB issues I observed with
>> the previous attempt, changes documented in patch 2.
>>
>> Eric Dumazet (1):
>>  tcp: remove sk_{tr}x_skb_cache
>>
>> Paolo Abeni (3):
>>  tcp: expose the tcp_mark_push() and skb_entail() helpers
>>  mptcp: stop relaying on tcp_tx_skb_cache.
>>  Partially revert "tcp: factor out tcp_build_frag()"
>>
>> Documentation/networking/ip-sysctl.rst |   8 --
>> include/net/sock.h                     |  19 ----
>> include/net/tcp.h                      |   4 +-
>> net/ipv4/af_inet.c                     |   4 -
>> net/ipv4/sysctl_net_ipv4.c             |  12 --
>> net/ipv4/tcp.c                         | 147 +++++++++----------------
>> net/ipv4/tcp_ipv4.c                    |   6 -
>> net/ipv6/tcp_ipv6.c                    |   6 -
>> net/mptcp/protocol.c                   | 132 +++++++++++++---------
>> 9 files changed, 132 insertions(+), 206 deletions(-)
>>
>> -- 
>> 2.26.3

Thank you for the patches and the reviews

> It's ok to apply the series to the export branch. Before we upstream it,
> can Paolo send a squash-to patch for #2 that removes the
> WARN_ON_ONCE(!mpext)?
> 
> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>

These patches are now in our tree -- features for net-next -- with:

- Mat's RvB tag on patch 2/4
- Mat's Ack on patches 1/4 and 3-4/4
- Paolo's Ack on patch 4/4
- fixed a typo and removed dot in the title of patch 2/4
- a small checkpatch fix in patch 2/4: Unnecessary parentheses
- without checkpatch fixes in patch 3/4: bring back old and "non
compliant" code in TCP
- (I hope it is OK) reduce commit SHA lengths in patch 4/4's message to
fix checkpatch errors.

Commits:

- ef9706ecd17e: tcp: expose the tcp_mark_push() and skb_entail() helpers

- 0ed0c4ad3483: mptcp: stop relying on tcp_tx_skb_cache

- ee51bf92cdcc: Partially revert "tcp: factor out tcp_build_frag()"

- ead8eb1fc223: tcp: remove sk_{tr}x_skb_cache
- Results: 7755d2ada382..d692ad305b2a

WARN_ON_ONCE(!mpext):

- 618f6578f614: mptcp: remove no longer needed check
- Results: d692ad305b2a..7a2d18a25e65

Builds and tests are now in progress:

https://cirrus-ci.com/github/multipath-tcp/mptcp_net-next/export/20210904T080009
https://github.com/multipath-tcp/mptcp_net-next/actions/workflows/build-validation.yml?query=branch:export/20210904T080009

Cheers,
Matt
-- 
Tessares | Belgium | Hybrid Access Solutions
www.tessares.net

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 mptcp-next 2/4] mptcp: stop relaying on tcp_tx_skb_cache.
  2021-09-02 15:52 ` [PATCH v2 mptcp-next 2/4] mptcp: stop relaying on tcp_tx_skb_cache Paolo Abeni
  2021-09-03  0:30   ` Mat Martineau
  2021-09-03 11:28   ` Matthieu Baerts
@ 2021-09-04  8:00   ` Matthieu Baerts
  2021-09-06  7:10     ` Paolo Abeni
  2 siblings, 1 reply; 15+ messages in thread
From: Matthieu Baerts @ 2021-09-04  8:00 UTC (permalink / raw)
  To: Paolo Abeni; +Cc: mptcp

Hi Paolo,

On 02/09/2021 17:52, Paolo Abeni wrote:
> We want to revert the skb TX cache, but MPTCP is currently
> using it unconditionally.
> 
> Rework the MPTCP tx code, so that tcp_tx_skb_cache is not
> needed anymore: do the whole coalescing check, skb allocation
> skb initialization/update inside mptcp_sendmsg_frag(), quite
> alike the current TCP code.

(...)

> @@ -1287,23 +1280,29 @@ static int mptcp_sendmsg_frag(struct sock *sk, struct sock *ssk,
>  			      struct mptcp_sendmsg_info *info)
>  {
>  	u64 data_seq = dfrag->data_seq + info->sent;
> +	int offset = dfrag->offset + info->sent;
>  	struct mptcp_sock *msk = mptcp_sk(sk);
>  	bool zero_window_probe = false;
>  	struct mptcp_ext *mpext = NULL;
> -	struct sk_buff *skb, *tail;
> -	bool must_collapse = false;
> -	int size_bias = 0;
> -	int avail_size;
> -	size_t ret = 0;
> +	bool can_coalesce = false;
> +	bool reuse_skb = true;
> +	struct sk_buff *skb;
> +	size_t copy;
> +	int i;
>  
>  	pr_debug("msk=%p ssk=%p sending dfrag at seq=%llu len=%u already sent=%u",
>  		 msk, ssk, dfrag->data_seq, dfrag->data_len, info->sent);
>  
> +	if (WARN_ON_ONCE(info->sent > info->limit ||
> +			 info->limit > dfrag->data_len))
> +		return 0;
> +
>  	/* compute send limit */
>  	info->mss_now = tcp_send_mss(ssk, &info->size_goal, info->flags);
> -	avail_size = info->size_goal;
> +	copy = info->size_goal;
> +
>  	skb = tcp_write_queue_tail(ssk);
> -	if (skb) {
> +	if (skb && (copy > skb->len)) {
>  		/* Limit the write to the size available in the
>  		 * current skb, if any, so that we create at most a new skb.
>  		 * Explicitly tells TCP internals to avoid collapsing on later
> @@ -1316,53 +1315,76 @@ static int mptcp_sendmsg_frag(struct sock *sk, struct sock *ssk,
>  			goto alloc_skb;
>  		}
>  
> -		must_collapse = (info->size_goal - skb->len > 0) &&
> -				(skb_shinfo(skb)->nr_frags < sysctl_max_skb_frags);
> -		if (must_collapse) {
> -			size_bias = skb->len;
> -			avail_size = info->size_goal - skb->len;
> +		i = skb_shinfo(skb)->nr_frags;
> +		can_coalesce = skb_can_coalesce(skb, i, dfrag->page, offset);
> +		if (!can_coalesce && i >= sysctl_max_skb_frags) {
> +			tcp_mark_push(tcp_sk(ssk), skb);
> +			goto alloc_skb;
>  		}
> -	}
>  
> +		copy -= skb->len;
> +	} else {
>  alloc_skb:
> -	if (!must_collapse && !ssk->sk_tx_skb_cache &&
> -	    !mptcp_alloc_tx_skb(sk, ssk, info->data_lock_held))
> -		return 0;
> +		skb = mptcp_alloc_tx_skb(sk, ssk, info->data_lock_held);
> +		if (!skb)
> +			return -ENOMEM;
> +
> +		i = skb_shinfo(skb)->nr_frags;
> +		reuse_skb = false;
> +		mpext = skb_ext_find(skb, SKB_EXT_MPTCP);
> +	}
>  
>  	/* Zero window and all data acked? Probe. */
> -	avail_size = mptcp_check_allowed_size(msk, data_seq, avail_size);
> -	if (avail_size == 0) {
> +	copy = mptcp_check_allowed_size(msk, data_seq, copy);
> +	if (copy == 0) {
>  		u64 snd_una = READ_ONCE(msk->snd_una);
>  
> -		if (skb || snd_una != msk->snd_nxt)
> +		if (skb || snd_una != msk->snd_nxt) {

If I'm not mistaken, 'skb' will always be != NULL here.

Should we only have these 2 next lines if 'copy == 0'?

> +			tcp_remove_empty_skb(ssk, tcp_write_queue_tail(ssk));
>  			return 0;
> +		}
> +
>  		zero_window_probe = true;
>  		data_seq = snd_una - 1;
> -		avail_size = 1;
> -	}
Cheers,
Matt
-- 
Tessares | Belgium | Hybrid Access Solutions
www.tessares.net

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 mptcp-next 2/4] mptcp: stop relaying on tcp_tx_skb_cache.
  2021-09-04  8:00   ` Matthieu Baerts
@ 2021-09-06  7:10     ` Paolo Abeni
  0 siblings, 0 replies; 15+ messages in thread
From: Paolo Abeni @ 2021-09-06  7:10 UTC (permalink / raw)
  To: Matthieu Baerts; +Cc: mptcp

On Sat, 2021-09-04 at 10:00 +0200, Matthieu Baerts wrote:
> Hi Paolo,
> 
> On 02/09/2021 17:52, Paolo Abeni wrote:
> > We want to revert the skb TX cache, but MPTCP is currently
> > using it unconditionally.
> > 
> > Rework the MPTCP tx code, so that tcp_tx_skb_cache is not
> > needed anymore: do the whole coalescing check, skb allocation
> > skb initialization/update inside mptcp_sendmsg_frag(), quite
> > alike the current TCP code.
> 
> (...)
> 
> > @@ -1287,23 +1280,29 @@ static int mptcp_sendmsg_frag(struct sock *sk, struct sock *ssk,
> >  			      struct mptcp_sendmsg_info *info)
> >  {
> >  	u64 data_seq = dfrag->data_seq + info->sent;
> > +	int offset = dfrag->offset + info->sent;
> >  	struct mptcp_sock *msk = mptcp_sk(sk);
> >  	bool zero_window_probe = false;
> >  	struct mptcp_ext *mpext = NULL;
> > -	struct sk_buff *skb, *tail;
> > -	bool must_collapse = false;
> > -	int size_bias = 0;
> > -	int avail_size;
> > -	size_t ret = 0;
> > +	bool can_coalesce = false;
> > +	bool reuse_skb = true;
> > +	struct sk_buff *skb;
> > +	size_t copy;
> > +	int i;
> >  
> >  	pr_debug("msk=%p ssk=%p sending dfrag at seq=%llu len=%u already sent=%u",
> >  		 msk, ssk, dfrag->data_seq, dfrag->data_len, info->sent);
> >  
> > +	if (WARN_ON_ONCE(info->sent > info->limit ||
> > +			 info->limit > dfrag->data_len))
> > +		return 0;
> > +
> >  	/* compute send limit */
> >  	info->mss_now = tcp_send_mss(ssk, &info->size_goal, info->flags);
> > -	avail_size = info->size_goal;
> > +	copy = info->size_goal;
> > +
> >  	skb = tcp_write_queue_tail(ssk);
> > -	if (skb) {
> > +	if (skb && (copy > skb->len)) {
> >  		/* Limit the write to the size available in the
> >  		 * current skb, if any, so that we create at most a new skb.
> >  		 * Explicitly tells TCP internals to avoid collapsing on later
> > @@ -1316,53 +1315,76 @@ static int mptcp_sendmsg_frag(struct sock *sk, struct sock *ssk,
> >  			goto alloc_skb;
> >  		}
> >  
> > -		must_collapse = (info->size_goal - skb->len > 0) &&
> > -				(skb_shinfo(skb)->nr_frags < sysctl_max_skb_frags);
> > -		if (must_collapse) {
> > -			size_bias = skb->len;
> > -			avail_size = info->size_goal - skb->len;
> > +		i = skb_shinfo(skb)->nr_frags;
> > +		can_coalesce = skb_can_coalesce(skb, i, dfrag->page, offset);
> > +		if (!can_coalesce && i >= sysctl_max_skb_frags) {
> > +			tcp_mark_push(tcp_sk(ssk), skb);
> > +			goto alloc_skb;
> >  		}
> > -	}
> >  
> > +		copy -= skb->len;
> > +	} else {
> >  alloc_skb:
> > -	if (!must_collapse && !ssk->sk_tx_skb_cache &&
> > -	    !mptcp_alloc_tx_skb(sk, ssk, info->data_lock_held))
> > -		return 0;
> > +		skb = mptcp_alloc_tx_skb(sk, ssk, info->data_lock_held);
> > +		if (!skb)
> > +			return -ENOMEM;
> > +
> > +		i = skb_shinfo(skb)->nr_frags;
> > +		reuse_skb = false;
> > +		mpext = skb_ext_find(skb, SKB_EXT_MPTCP);
> > +	}
> >  
> >  	/* Zero window and all data acked? Probe. */
> > -	avail_size = mptcp_check_allowed_size(msk, data_seq, avail_size);
> > -	if (avail_size == 0) {
> > +	copy = mptcp_check_allowed_size(msk, data_seq, copy);
> > +	if (copy == 0) {
> >  		u64 snd_una = READ_ONCE(msk->snd_una);
> >  
> > -		if (skb || snd_una != msk->snd_nxt)
> > +		if (skb || snd_una != msk->snd_nxt) {
> 
> If I'm not mistaken, 'skb' will always be != NULL here.
> 
> Should we only have these 2 next lines if 'copy == 0'?

We still need to check for 'snd_una != msk->snd_nxt'. I'll send a
squash-to patch.

Thanks!

Paolo


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2021-09-06  7:10 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-02 15:52 [PATCH v2 mptcp-next 0/4] mptcp: just another xmit path refactor Paolo Abeni
2021-09-02 15:52 ` [PATCH v2 mptcp-next 1/4] tcp: expose the tcp_mark_push() and skb_entail() helpers Paolo Abeni
2021-09-02 15:52 ` [PATCH v2 mptcp-next 2/4] mptcp: stop relaying on tcp_tx_skb_cache Paolo Abeni
2021-09-03  0:30   ` Mat Martineau
2021-09-03 13:58     ` Paolo Abeni
2021-09-03 17:07       ` Mat Martineau
2021-09-03 11:28   ` Matthieu Baerts
2021-09-03 17:18     ` Paolo Abeni
2021-09-03 17:47       ` Matthieu Baerts
2021-09-04  8:00   ` Matthieu Baerts
2021-09-06  7:10     ` Paolo Abeni
2021-09-02 15:52 ` [PATCH v2 mptcp-next 3/4] Partially revert "tcp: factor out tcp_build_frag()" Paolo Abeni
2021-09-02 15:52 ` [PATCH v2 mptcp-next 4/4] tcp: remove sk_{tr}x_skb_cache Paolo Abeni
2021-09-03 20:11 ` [PATCH v2 mptcp-next 0/4] mptcp: just another xmit path refactor Mat Martineau
2021-09-04  8:00   ` Matthieu Baerts

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.