All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 0/4] mptcp: just another receive path refactor
@ 2021-05-25 17:37 Paolo Abeni
  2021-05-25 17:37 ` [RFC PATCH 1/4] mptcp: wake-up readers only for in sequence data Paolo Abeni
                   ` (4 more replies)
  0 siblings, 5 replies; 10+ messages in thread
From: Paolo Abeni @ 2021-05-25 17:37 UTC (permalink / raw)
  To: mptcp; +Cc: fwestpha

Some recent issues outlined we have perhaps too much complexity
in the receive path and memory accounting.

After recent changes in release_cb() we can drop most such complexity
in favour of some more traditional RX/memory accounting schema.

This is the result.

The first 2 patches are actually bugfixes, even if I don't understand
why we almost never hit the condition addressed by patch 1/4 (and
the patched kernel hits hit frequently).

The 3rd patch introduces the major change, and the fouth is just
cleanup.

This could have some negative performance effects, as on average more
locking is required for each packet. I'm doing some perf test and will
report the results.

Paolo Abeni (4):
  mptcp: wake-up readers only for in sequence data.
  mptcp: don't clear MPTCP_DATA_READY in sk_wait_event()
  mptcp: move the whole rx path under msk socket lock protection
  mptcp: cleanup mem accounting.

 net/mptcp/protocol.c | 298 +++++++++----------------------------------
 net/mptcp/protocol.h |  20 +--
 net/mptcp/subflow.c  |  15 +--
 3 files changed, 68 insertions(+), 265 deletions(-)

-- 
2.26.3


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [RFC PATCH 1/4] mptcp: wake-up readers only for in sequence data.
  2021-05-25 17:37 [RFC PATCH 0/4] mptcp: just another receive path refactor Paolo Abeni
@ 2021-05-25 17:37 ` Paolo Abeni
  2021-05-25 17:37 ` [RFC PATCH 2/4] mptcp: don't clear MPTCP_DATA_READY in sk_wait_event() Paolo Abeni
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 10+ messages in thread
From: Paolo Abeni @ 2021-05-25 17:37 UTC (permalink / raw)
  To: mptcp; +Cc: fwestpha

Currently we relay on the subflow->data_avail field,
which is subject to races:

	ssk1
		skb len = 500 DSS(seq=1, len=1000, off=0)
		# data_avail == MPTCP_SUBFLOW_DATA_AVAIL

	ssk2
		skb len = 500 DSS(seq = 501, len=1000)
		# data_avail == MPTCP_SUBFLOW_DATA_AVAIL

	ssk1
		skb len = 500 DSS(seq = 1, len=1000, off =500)
		# still data_avail == MPTCP_SUBFLOW_DATA_AVAIL,
		# as the skb is covered by a pre-existing map,
		# which was in-sequence at reception time.

Instead we can explicitly check if some has been received in-sequence,
propagating the info from __mptcp_move_skbs_from_subflow().

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 net/mptcp/protocol.c | 25 +++++++++----------------
 net/mptcp/protocol.h |  1 -
 net/mptcp/subflow.c  | 15 +++++----------
 3 files changed, 14 insertions(+), 27 deletions(-)

diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index bb029dd4ff5e..e85ec0e84e06 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -677,7 +677,7 @@ static bool __mptcp_ofo_queue(struct mptcp_sock *msk)
 /* In most cases we will be able to lock the mptcp socket.  If its already
  * owned, we need to defer to the work queue to avoid ABBA deadlock.
  */
-static void move_skbs_to_msk(struct mptcp_sock *msk, struct sock *ssk)
+static bool move_skbs_to_msk(struct mptcp_sock *msk, struct sock *ssk)
 {
 	struct sock *sk = (struct sock *)msk;
 	unsigned int moved = 0;
@@ -698,6 +698,8 @@ static void move_skbs_to_msk(struct mptcp_sock *msk, struct sock *ssk)
 	if (mptcp_pending_data_fin(sk, NULL))
 		mptcp_schedule_work(sk);
 	mptcp_data_unlock(sk);
+
+	return moved > 0;
 }
 
 void mptcp_data_ready(struct sock *sk, struct sock *ssk)
@@ -705,7 +707,6 @@ void mptcp_data_ready(struct sock *sk, struct sock *ssk)
 	struct mptcp_subflow_context *subflow = mptcp_subflow_ctx(ssk);
 	struct mptcp_sock *msk = mptcp_sk(sk);
 	int sk_rbuf, ssk_rbuf;
-	bool wake;
 
 	/* The peer can send data while we are shutting down this
 	 * subflow at msk destruction time, but we must avoid enqueuing
@@ -714,28 +715,20 @@ void mptcp_data_ready(struct sock *sk, struct sock *ssk)
 	if (unlikely(subflow->disposable))
 		return;
 
-	/* move_skbs_to_msk below can legitly clear the data_avail flag,
-	 * but we will need later to properly woke the reader, cache its
-	 * value
-	 */
-	wake = subflow->data_avail == MPTCP_SUBFLOW_DATA_AVAIL;
-	if (wake)
-		set_bit(MPTCP_DATA_READY, &msk->flags);
-
 	ssk_rbuf = READ_ONCE(ssk->sk_rcvbuf);
 	sk_rbuf = READ_ONCE(sk->sk_rcvbuf);
 	if (unlikely(ssk_rbuf > sk_rbuf))
 		sk_rbuf = ssk_rbuf;
 
-	/* over limit? can't append more skbs to msk */
+	/* over limit? can't append more skbs to msk, Also, no need to wake-up*/
 	if (atomic_read(&sk->sk_rmem_alloc) > sk_rbuf)
-		goto wake;
-
-	move_skbs_to_msk(msk, ssk);
+		return;
 
-wake:
-	if (wake)
+	/* Wake-up the reader only for in-sequence data */
+	if (move_skbs_to_msk(msk, ssk)) {
+		set_bit(MPTCP_DATA_READY, &msk->flags);
 		sk->sk_data_ready(sk);
+	}
 }
 
 static bool mptcp_do_flush_join_list(struct mptcp_sock *msk)
diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
index 520098188d80..2f22046a7565 100644
--- a/net/mptcp/protocol.h
+++ b/net/mptcp/protocol.h
@@ -374,7 +374,6 @@ mptcp_subflow_rsk(const struct request_sock *rsk)
 enum mptcp_data_avail {
 	MPTCP_SUBFLOW_NODATA,
 	MPTCP_SUBFLOW_DATA_AVAIL,
-	MPTCP_SUBFLOW_OOO_DATA
 };
 
 struct mptcp_delegated_action {
diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c
index 6b1cd4257edf..f8323a759af1 100644
--- a/net/mptcp/subflow.c
+++ b/net/mptcp/subflow.c
@@ -1137,18 +1137,13 @@ static bool subflow_check_data_avail(struct sock *ssk)
 		ack_seq = mptcp_subflow_get_mapped_dsn(subflow);
 		pr_debug("msk ack_seq=%llx subflow ack_seq=%llx", old_ack,
 			 ack_seq);
-		if (ack_seq == old_ack) {
-			subflow->data_avail = MPTCP_SUBFLOW_DATA_AVAIL;
-			break;
-		} else if (after64(ack_seq, old_ack)) {
-			subflow->data_avail = MPTCP_SUBFLOW_OOO_DATA;
-			break;
+		if (unlikely(before64(ack_seq, old_ack))) {
+			mptcp_subflow_discard_data(ssk, skb, old_ack - ack_seq);
+			continue;
 		}
 
-		/* only accept in-sequence mapping. Old values are spurious
-		 * retransmission
-		 */
-		mptcp_subflow_discard_data(ssk, skb, old_ack - ack_seq);
+		subflow->data_avail = MPTCP_SUBFLOW_DATA_AVAIL;
+		break;
 	}
 	return true;
 
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [RFC PATCH 2/4] mptcp: don't clear MPTCP_DATA_READY in sk_wait_event()
  2021-05-25 17:37 [RFC PATCH 0/4] mptcp: just another receive path refactor Paolo Abeni
  2021-05-25 17:37 ` [RFC PATCH 1/4] mptcp: wake-up readers only for in sequence data Paolo Abeni
@ 2021-05-25 17:37 ` Paolo Abeni
  2021-05-25 17:37 ` [RFC PATCH 3/4] mptcp: move the whole rx path under msk socket lock protection Paolo Abeni
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 10+ messages in thread
From: Paolo Abeni @ 2021-05-25 17:37 UTC (permalink / raw)
  To: mptcp; +Cc: fwestpha

If we don't flush entirely the receive queue, we need set
again such bit later. We can simply avoid clearing it.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 net/mptcp/protocol.c | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index e85ec0e84e06..6050431f4c86 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -1707,7 +1707,7 @@ static void mptcp_wait_data(struct sock *sk, long *timeo)
 	sk_set_bit(SOCKWQ_ASYNC_WAITDATA, sk);
 
 	sk_wait_event(sk, timeo,
-		      test_and_clear_bit(MPTCP_DATA_READY, &msk->flags), &wait);
+		      test_bit(MPTCP_DATA_READY, &msk->flags), &wait);
 
 	sk_clear_bit(SOCKWQ_ASYNC_WAITDATA, sk);
 	remove_wait_queue(sk_sleep(sk), &wait);
@@ -2028,10 +2028,8 @@ static int mptcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
 		 */
 		if (unlikely(__mptcp_move_skbs(msk)))
 			set_bit(MPTCP_DATA_READY, &msk->flags);
-	} else if (unlikely(!test_bit(MPTCP_DATA_READY, &msk->flags))) {
-		/* data to read but mptcp_wait_data() cleared DATA_READY */
-		set_bit(MPTCP_DATA_READY, &msk->flags);
 	}
+
 out_err:
 	if (cmsg_flags && copied >= 0) {
 		if (cmsg_flags & MPTCP_CMSG_TS)
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [RFC PATCH 3/4] mptcp: move the whole rx path under msk socket lock protection
  2021-05-25 17:37 [RFC PATCH 0/4] mptcp: just another receive path refactor Paolo Abeni
  2021-05-25 17:37 ` [RFC PATCH 1/4] mptcp: wake-up readers only for in sequence data Paolo Abeni
  2021-05-25 17:37 ` [RFC PATCH 2/4] mptcp: don't clear MPTCP_DATA_READY in sk_wait_event() Paolo Abeni
@ 2021-05-25 17:37 ` Paolo Abeni
  2021-05-26  0:06   ` Mat Martineau
  2021-05-25 17:37 ` [RFC PATCH 4/4] mptcp: cleanup mem accounting Paolo Abeni
  2021-05-28 15:18 ` [RFC PATCH 0/4] mptcp: just another receive path refactor Paolo Abeni
  4 siblings, 1 reply; 10+ messages in thread
From: Paolo Abeni @ 2021-05-25 17:37 UTC (permalink / raw)
  To: mptcp; +Cc: fwestpha

After commit c2e6048fa1cf ("mptcp: fix race in release_cb") it's
pretty straight forward move the whole MPTCP rx path under the socket
lock leveraging the release_cb.

We can drop a bunch of spin_lock pairs in the receive functions, use
a single receive queue and invoke __mptcp_move_skbs only when subflows
ask for it.

This will allow more cleanup in the next patch

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 net/mptcp/protocol.c | 94 ++++++++++++++++++--------------------------
 net/mptcp/protocol.h |  2 +-
 2 files changed, 39 insertions(+), 57 deletions(-)

diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 6050431f4c86..57deea409d0c 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -682,11 +682,6 @@ static bool move_skbs_to_msk(struct mptcp_sock *msk, struct sock *ssk)
 	struct sock *sk = (struct sock *)msk;
 	unsigned int moved = 0;
 
-	if (inet_sk_state_load(sk) == TCP_CLOSE)
-		return;
-
-	mptcp_data_lock(sk);
-
 	__mptcp_move_skbs_from_subflow(msk, ssk, &moved);
 	__mptcp_ofo_queue(msk);
 
@@ -697,12 +692,11 @@ static bool move_skbs_to_msk(struct mptcp_sock *msk, struct sock *ssk)
 	 */
 	if (mptcp_pending_data_fin(sk, NULL))
 		mptcp_schedule_work(sk);
-	mptcp_data_unlock(sk);
 
 	return moved > 0;
 }
 
-void mptcp_data_ready(struct sock *sk, struct sock *ssk)
+static void __mptcp_data_ready(struct sock *sk, struct sock *ssk)
 {
 	struct mptcp_subflow_context *subflow = mptcp_subflow_ctx(ssk);
 	struct mptcp_sock *msk = mptcp_sk(sk);
@@ -731,6 +725,16 @@ void mptcp_data_ready(struct sock *sk, struct sock *ssk)
 	}
 }
 
+void mptcp_data_ready(struct sock *sk, struct sock *ssk)
+{
+	mptcp_data_lock(sk);
+	if (!sock_owned_by_user(sk))
+		__mptcp_data_ready(sk, ssk);
+	else
+		set_bit(MPTCP_DEQUEUE, &mptcp_sk(sk)->flags);
+	mptcp_data_unlock(sk);
+}
+
 static bool mptcp_do_flush_join_list(struct mptcp_sock *msk)
 {
 	struct mptcp_subflow_context *subflow;
@@ -967,7 +971,6 @@ static bool mptcp_wmem_alloc(struct sock *sk, int size)
 	if (msk->wmem_reserved >= size)
 		goto account;
 
-	mptcp_data_lock(sk);
 	if (!sk_wmem_schedule(sk, size)) {
 		mptcp_data_unlock(sk);
 		return false;
@@ -975,7 +978,6 @@ static bool mptcp_wmem_alloc(struct sock *sk, int size)
 
 	sk->sk_forward_alloc -= size;
 	msk->wmem_reserved += size;
-	mptcp_data_unlock(sk);
 
 account:
 	msk->wmem_reserved -= size;
@@ -1002,12 +1004,10 @@ static void mptcp_mem_reclaim_partial(struct sock *sk)
 	if (msk->wmem_reserved < 0)
 		return;
 
-	mptcp_data_lock(sk);
 	sk->sk_forward_alloc += msk->wmem_reserved;
 	sk_mem_reclaim_partial(sk);
 	msk->wmem_reserved = sk->sk_forward_alloc;
 	sk->sk_forward_alloc = 0;
-	mptcp_data_unlock(sk);
 }
 
 static void dfrag_uncharge(struct sock *sk, int len)
@@ -1092,9 +1092,7 @@ static void __mptcp_clean_una_wakeup(struct sock *sk)
 
 static void mptcp_clean_una_wakeup(struct sock *sk)
 {
-	mptcp_data_lock(sk);
 	__mptcp_clean_una_wakeup(sk);
-	mptcp_data_unlock(sk);
 }
 
 static void mptcp_enter_memory_pressure(struct sock *sk)
@@ -1713,16 +1711,22 @@ static void mptcp_wait_data(struct sock *sk, long *timeo)
 	remove_wait_queue(sk_sleep(sk), &wait);
 }
 
-static int __mptcp_recvmsg_mskq(struct mptcp_sock *msk,
+static bool __mptcp_move_skbs(struct sock *sk);
+
+static int __mptcp_recvmsg_mskq(struct sock *sk,
 				struct msghdr *msg,
 				size_t len, int flags,
 				struct scm_timestamping_internal *tss,
 				int *cmsg_flags)
 {
+	struct mptcp_sock *msk = mptcp_sk(sk);
 	struct sk_buff *skb, *tmp;
 	int copied = 0;
 
-	skb_queue_walk_safe(&msk->receive_queue, skb, tmp) {
+	if (skb_queue_empty(&sk->sk_receive_queue) && !__mptcp_move_skbs(sk))
+		return 0;
+
+	skb_queue_walk_safe(&sk->sk_receive_queue, skb, tmp) {
 		u32 offset = MPTCP_SKB_CB(skb)->offset;
 		u32 data_len = skb->len - offset;
 		u32 count = min_t(size_t, len - copied, data_len);
@@ -1754,7 +1758,7 @@ static int __mptcp_recvmsg_mskq(struct mptcp_sock *msk,
 			/* we will bulk release the skb memory later */
 			skb->destructor = NULL;
 			msk->rmem_released += skb->truesize;
-			__skb_unlink(skb, &msk->receive_queue);
+			__skb_unlink(skb, &sk->sk_receive_queue);
 			__kfree_skb(skb);
 		}
 
@@ -1875,16 +1879,9 @@ static void __mptcp_update_rmem(struct sock *sk)
 	msk->rmem_released = 0;
 }
 
-static void __mptcp_splice_receive_queue(struct sock *sk)
+static bool __mptcp_move_skbs(struct sock *sk)
 {
 	struct mptcp_sock *msk = mptcp_sk(sk);
-
-	skb_queue_splice_tail_init(&sk->sk_receive_queue, &msk->receive_queue);
-}
-
-static bool __mptcp_move_skbs(struct mptcp_sock *msk)
-{
-	struct sock *sk = (struct sock *)msk;
 	unsigned int moved = 0;
 	bool ret, done;
 
@@ -1893,18 +1890,12 @@ static bool __mptcp_move_skbs(struct mptcp_sock *msk)
 		struct sock *ssk = mptcp_subflow_recv_lookup(msk);
 		bool slowpath;
 
-		/* we can have data pending in the subflows only if the msk
-		 * receive buffer was full at subflow_data_ready() time,
-		 * that is an unlikely slow path.
-		 */
-		if (likely(!ssk))
+		if (unlikely(!ssk))
 			break;
 
 		slowpath = lock_sock_fast(ssk);
-		mptcp_data_lock(sk);
 		__mptcp_update_rmem(sk);
 		done = __mptcp_move_skbs_from_subflow(msk, ssk, &moved);
-		mptcp_data_unlock(sk);
 		tcp_cleanup_rbuf(ssk, moved);
 		unlock_sock_fast(ssk, slowpath);
 	} while (!done);
@@ -1912,17 +1903,16 @@ static bool __mptcp_move_skbs(struct mptcp_sock *msk)
 	/* acquire the data lock only if some input data is pending */
 	ret = moved > 0;
 	if (!RB_EMPTY_ROOT(&msk->out_of_order_queue) ||
-	    !skb_queue_empty_lockless(&sk->sk_receive_queue)) {
-		mptcp_data_lock(sk);
+	    !skb_queue_empty(&sk->sk_receive_queue)) {
 		__mptcp_update_rmem(sk);
 		ret |= __mptcp_ofo_queue(msk);
-		__mptcp_splice_receive_queue(sk);
-		mptcp_data_unlock(sk);
 		mptcp_cleanup_rbuf(msk);
 	}
-	if (ret)
+	if (ret) {
+		set_bit(MPTCP_DATA_READY, &msk->flags);
 		mptcp_check_data_fin((struct sock *)msk);
-	return !skb_queue_empty(&msk->receive_queue);
+	}
+	return ret;
 }
 
 static int mptcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
@@ -1938,7 +1928,7 @@ static int mptcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
 	if (unlikely(flags & MSG_ERRQUEUE))
 		return inet_recv_error(sk, msg, len, addr_len);
 
-	mptcp_lock_sock(sk, __mptcp_splice_receive_queue(sk));
+	lock_sock(sk);
 	if (unlikely(sk->sk_state == TCP_LISTEN)) {
 		copied = -ENOTCONN;
 		goto out_err;
@@ -1952,7 +1942,7 @@ static int mptcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
 	while (copied < len) {
 		int bytes_read;
 
-		bytes_read = __mptcp_recvmsg_mskq(msk, msg, len - copied, flags, &tss, &cmsg_flags);
+		bytes_read = __mptcp_recvmsg_mskq(sk, msg, len - copied, flags, &tss, &cmsg_flags);
 		if (unlikely(bytes_read < 0)) {
 			if (!copied)
 				copied = bytes_read;
@@ -1964,9 +1954,6 @@ static int mptcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
 		/* be sure to advertise window change */
 		mptcp_cleanup_rbuf(msk);
 
-		if (skb_queue_empty(&msk->receive_queue) && __mptcp_move_skbs(msk))
-			continue;
-
 		/* only the master socket status is relevant here. The exit
 		 * conditions mirror closely tcp_recvmsg()
 		 */
@@ -1993,7 +1980,7 @@ static int mptcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
 				/* race breaker: the shutdown could be after the
 				 * previous receive queue check
 				 */
-				if (__mptcp_move_skbs(msk))
+				if (__mptcp_move_skbs(sk))
 					continue;
 				break;
 			}
@@ -2018,16 +2005,11 @@ static int mptcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
 		mptcp_wait_data(sk, &timeo);
 	}
 
-	if (skb_queue_empty_lockless(&sk->sk_receive_queue) &&
-	    skb_queue_empty(&msk->receive_queue)) {
-		/* entire backlog drained, clear DATA_READY. */
-		clear_bit(MPTCP_DATA_READY, &msk->flags);
-
-		/* .. race-breaker: ssk might have gotten new data
-		 * after last __mptcp_move_skbs() returned false.
+	if (skb_queue_empty(&sk->sk_receive_queue)) {
+		/* entire backlog drained, clear DATA_READY.
+		 * release_cb/__mptcp_move_skbs() will eventually set it again if needed
 		 */
-		if (unlikely(__mptcp_move_skbs(msk)))
-			set_bit(MPTCP_DATA_READY, &msk->flags);
+		clear_bit(MPTCP_DATA_READY, &msk->flags);
 	}
 
 out_err:
@@ -2376,7 +2358,6 @@ static int __mptcp_init_sock(struct sock *sk)
 	INIT_LIST_HEAD(&msk->join_list);
 	INIT_LIST_HEAD(&msk->rtx_queue);
 	INIT_WORK(&msk->work, mptcp_worker);
-	__skb_queue_head_init(&msk->receive_queue);
 	msk->out_of_order_queue = RB_ROOT;
 	msk->first_pending = NULL;
 	msk->wmem_reserved = 0;
@@ -2828,9 +2809,6 @@ void mptcp_destroy_common(struct mptcp_sock *msk)
 
 	__mptcp_clear_xmit(sk);
 
-	/* move to sk_receive_queue, sk_stream_kill_queues will purge it */
-	skb_queue_splice_tail_init(&msk->receive_queue, &sk->sk_receive_queue);
-
 	skb_rbtree_purge(&msk->out_of_order_queue);
 	mptcp_token_destroy(msk);
 	mptcp_pm_free_anno_list(msk);
@@ -2882,6 +2860,8 @@ static void mptcp_release_cb(struct sock *sk)
 			flags |= BIT(MPTCP_PUSH_PENDING);
 		if (test_and_clear_bit(MPTCP_RETRANSMIT, &mptcp_sk(sk)->flags))
 			flags |= BIT(MPTCP_RETRANSMIT);
+		if (test_and_clear_bit(MPTCP_DEQUEUE, &mptcp_sk(sk)->flags))
+			flags |= BIT(MPTCP_DEQUEUE);
 		if (!flags)
 			break;
 
@@ -2898,6 +2878,8 @@ static void mptcp_release_cb(struct sock *sk)
 			__mptcp_push_pending(sk, 0);
 		if (flags & BIT(MPTCP_RETRANSMIT))
 			__mptcp_retrans(sk);
+		if ((flags & BIT(MPTCP_DEQUEUE)) && __mptcp_move_skbs(sk))
+			sk->sk_data_ready(sk);
 
 		cond_resched();
 		spin_lock_bh(&sk->sk_lock.slock);
diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
index 2f22046a7565..d392ee44deb3 100644
--- a/net/mptcp/protocol.h
+++ b/net/mptcp/protocol.h
@@ -111,6 +111,7 @@
 #define MPTCP_ERROR_REPORT	8
 #define MPTCP_RETRANSMIT	9
 #define MPTCP_WORK_SYNC_SETSOCKOPT 10
+#define MPTCP_DEQUEUE		11
 
 static inline bool before64(__u64 seq1, __u64 seq2)
 {
@@ -244,7 +245,6 @@ struct mptcp_sock {
 	struct work_struct work;
 	struct sk_buff  *ooo_last_skb;
 	struct rb_root  out_of_order_queue;
-	struct sk_buff_head receive_queue;
 	int		tx_pending_data;
 	int		size_goal_cache;
 	struct list_head conn_list;
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [RFC PATCH 4/4] mptcp: cleanup mem accounting.
  2021-05-25 17:37 [RFC PATCH 0/4] mptcp: just another receive path refactor Paolo Abeni
                   ` (2 preceding siblings ...)
  2021-05-25 17:37 ` [RFC PATCH 3/4] mptcp: move the whole rx path under msk socket lock protection Paolo Abeni
@ 2021-05-25 17:37 ` Paolo Abeni
  2021-05-26  0:12   ` Mat Martineau
  2021-05-28 15:18 ` [RFC PATCH 0/4] mptcp: just another receive path refactor Paolo Abeni
  4 siblings, 1 reply; 10+ messages in thread
From: Paolo Abeni @ 2021-05-25 17:37 UTC (permalink / raw)
  To: mptcp; +Cc: fwestpha

After the previous patch, updating sk_forward_memory is cheap and
we can drop a lot of complexity from the MPTCP memory acconting,
removing the bulk fwd mem allocations for wmem and rmem.

Singed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 net/mptcp/protocol.c | 175 ++++---------------------------------------
 net/mptcp/protocol.h |  17 +----
 2 files changed, 14 insertions(+), 178 deletions(-)

diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 57deea409d0c..1a9ac2986581 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -900,116 +900,6 @@ static bool mptcp_frag_can_collapse_to(const struct mptcp_sock *msk,
 		df->data_seq + df->data_len == msk->write_seq;
 }
 
-static int mptcp_wmem_with_overhead(int size)
-{
-	return size + ((sizeof(struct mptcp_data_frag) * size) >> PAGE_SHIFT);
-}
-
-static void __mptcp_wmem_reserve(struct sock *sk, int size)
-{
-	int amount = mptcp_wmem_with_overhead(size);
-	struct mptcp_sock *msk = mptcp_sk(sk);
-
-	WARN_ON_ONCE(msk->wmem_reserved);
-	if (WARN_ON_ONCE(amount < 0))
-		amount = 0;
-
-	if (amount <= sk->sk_forward_alloc)
-		goto reserve;
-
-	/* under memory pressure try to reserve at most a single page
-	 * otherwise try to reserve the full estimate and fallback
-	 * to a single page before entering the error path
-	 */
-	if ((tcp_under_memory_pressure(sk) && amount > PAGE_SIZE) ||
-	    !sk_wmem_schedule(sk, amount)) {
-		if (amount <= PAGE_SIZE)
-			goto nomem;
-
-		amount = PAGE_SIZE;
-		if (!sk_wmem_schedule(sk, amount))
-			goto nomem;
-	}
-
-reserve:
-	msk->wmem_reserved = amount;
-	sk->sk_forward_alloc -= amount;
-	return;
-
-nomem:
-	/* we will wait for memory on next allocation */
-	msk->wmem_reserved = -1;
-}
-
-static void __mptcp_update_wmem(struct sock *sk)
-{
-	struct mptcp_sock *msk = mptcp_sk(sk);
-
-#ifdef CONFIG_LOCKDEP
-	WARN_ON_ONCE(!lockdep_is_held(&sk->sk_lock.slock));
-#endif
-
-	if (!msk->wmem_reserved)
-		return;
-
-	if (msk->wmem_reserved < 0)
-		msk->wmem_reserved = 0;
-	if (msk->wmem_reserved > 0) {
-		sk->sk_forward_alloc += msk->wmem_reserved;
-		msk->wmem_reserved = 0;
-	}
-}
-
-static bool mptcp_wmem_alloc(struct sock *sk, int size)
-{
-	struct mptcp_sock *msk = mptcp_sk(sk);
-
-	/* check for pre-existing error condition */
-	if (msk->wmem_reserved < 0)
-		return false;
-
-	if (msk->wmem_reserved >= size)
-		goto account;
-
-	if (!sk_wmem_schedule(sk, size)) {
-		mptcp_data_unlock(sk);
-		return false;
-	}
-
-	sk->sk_forward_alloc -= size;
-	msk->wmem_reserved += size;
-
-account:
-	msk->wmem_reserved -= size;
-	return true;
-}
-
-static void mptcp_wmem_uncharge(struct sock *sk, int size)
-{
-	struct mptcp_sock *msk = mptcp_sk(sk);
-
-	if (msk->wmem_reserved < 0)
-		msk->wmem_reserved = 0;
-	msk->wmem_reserved += size;
-}
-
-static void mptcp_mem_reclaim_partial(struct sock *sk)
-{
-	struct mptcp_sock *msk = mptcp_sk(sk);
-
-	/* if we are experiencing a transint allocation error,
-	 * the forward allocation memory has been already
-	 * released
-	 */
-	if (msk->wmem_reserved < 0)
-		return;
-
-	sk->sk_forward_alloc += msk->wmem_reserved;
-	sk_mem_reclaim_partial(sk);
-	msk->wmem_reserved = sk->sk_forward_alloc;
-	sk->sk_forward_alloc = 0;
-}
-
 static void dfrag_uncharge(struct sock *sk, int len)
 {
 	sk_mem_uncharge(sk, len);
@@ -1066,12 +956,8 @@ static void __mptcp_clean_una(struct sock *sk)
 	}
 
 out:
-	if (cleaned) {
-		if (tcp_under_memory_pressure(sk)) {
-			__mptcp_update_wmem(sk);
-			sk_mem_reclaim_partial(sk);
-		}
-	}
+	if (cleaned && tcp_under_memory_pressure(sk))
+		sk_mem_reclaim_partial(sk);
 
 	if (snd_una == READ_ONCE(msk->snd_nxt)) {
 		if (msk->timer_ival && !mptcp_data_fin_enabled(msk))
@@ -1083,18 +969,10 @@ static void __mptcp_clean_una(struct sock *sk)
 
 static void __mptcp_clean_una_wakeup(struct sock *sk)
 {
-#ifdef CONFIG_LOCKDEP
-	WARN_ON_ONCE(!lockdep_is_held(&sk->sk_lock.slock));
-#endif
 	__mptcp_clean_una(sk);
 	mptcp_write_space(sk);
 }
 
-static void mptcp_clean_una_wakeup(struct sock *sk)
-{
-	__mptcp_clean_una_wakeup(sk);
-}
-
 static void mptcp_enter_memory_pressure(struct sock *sk)
 {
 	struct mptcp_subflow_context *subflow;
@@ -1229,7 +1107,7 @@ static bool mptcp_must_reclaim_memory(struct sock *sk, struct sock *ssk)
 static bool mptcp_alloc_tx_skb(struct sock *sk, struct sock *ssk)
 {
 	if (unlikely(mptcp_must_reclaim_memory(sk, ssk)))
-		mptcp_mem_reclaim_partial(sk);
+		sk_mem_reclaim_partial(sk);
 	return __mptcp_alloc_tx_skb(sk, ssk, sk->sk_allocation);
 }
 
@@ -1533,10 +1411,8 @@ static void __mptcp_subflow_push_pending(struct sock *sk, struct sock *ssk)
 				goto out;
 			}
 
-			if (unlikely(mptcp_must_reclaim_memory(sk, ssk))) {
-				__mptcp_update_wmem(sk);
+			if (unlikely(mptcp_must_reclaim_memory(sk, ssk)))
 				sk_mem_reclaim_partial(sk);
-			}
 			if (!__mptcp_alloc_tx_skb(sk, ssk, GFP_ATOMIC))
 				goto out;
 
@@ -1560,7 +1436,6 @@ static void __mptcp_subflow_push_pending(struct sock *sk, struct sock *ssk)
 	/* __mptcp_alloc_tx_skb could have released some wmem and we are
 	 * not going to flush it via release_sock()
 	 */
-	__mptcp_update_wmem(sk);
 	if (copied) {
 		mptcp_set_timeout(sk, ssk);
 		tcp_push(ssk, 0, info.mss_now, tcp_sk(ssk)->nonagle,
@@ -1598,7 +1473,7 @@ static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 	/* silently ignore everything else */
 	msg->msg_flags &= MSG_MORE | MSG_DONTWAIT | MSG_NOSIGNAL;
 
-	mptcp_lock_sock(sk, __mptcp_wmem_reserve(sk, min_t(size_t, 1 << 20, len)));
+	lock_sock(sk);
 
 	timeo = sock_sndtimeo(sk, msg->msg_flags & MSG_DONTWAIT);
 
@@ -1611,8 +1486,8 @@ static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 	pfrag = sk_page_frag(sk);
 
 	while (msg_data_left(msg)) {
-		int total_ts, frag_truesize = 0;
 		struct mptcp_data_frag *dfrag;
+		int frag_truesize = 0;
 		bool dfrag_collapsed;
 		size_t psize, offset;
 
@@ -1644,14 +1519,13 @@ static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 		offset = dfrag->offset + dfrag->data_len;
 		psize = pfrag->size - offset;
 		psize = min_t(size_t, psize, msg_data_left(msg));
-		total_ts = psize + frag_truesize;
+		frag_truesize += psize;
 
-		if (!mptcp_wmem_alloc(sk, total_ts))
+		if (!sk_wmem_schedule(sk, frag_truesize))
 			goto wait_for_memory;
 
 		if (copy_page_from_iter(dfrag->page, offset, psize,
 					&msg->msg_iter) != psize) {
-			mptcp_wmem_uncharge(sk, psize + frag_truesize);
 			ret = -EFAULT;
 			goto out;
 		}
@@ -1659,7 +1533,6 @@ static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 		/* data successfully copied into the write queue */
 		copied += psize;
 		dfrag->data_len += psize;
-		frag_truesize += psize;
 		pfrag->offset += frag_truesize;
 		WRITE_ONCE(msk->write_seq, msk->write_seq + psize);
 		msk->tx_pending_data += psize;
@@ -1668,6 +1541,7 @@ static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 		 * Note: we charge such data both to sk and ssk
 		 */
 		sk_wmem_queued_add(sk, frag_truesize);
+		sk_mem_charge(sk, frag_truesize);
 		if (!dfrag_collapsed) {
 			get_page(dfrag->page);
 			list_add_tail(&dfrag->list, &msk->rtx_queue);
@@ -1719,7 +1593,6 @@ static int __mptcp_recvmsg_mskq(struct sock *sk,
 				struct scm_timestamping_internal *tss,
 				int *cmsg_flags)
 {
-	struct mptcp_sock *msk = mptcp_sk(sk);
 	struct sk_buff *skb, *tmp;
 	int copied = 0;
 
@@ -1755,9 +1628,10 @@ static int __mptcp_recvmsg_mskq(struct sock *sk,
 		}
 
 		if (!(flags & MSG_PEEK)) {
-			/* we will bulk release the skb memory later */
+			/* avoid the indirect call, we know the destructor is sock_wfree */
 			skb->destructor = NULL;
-			msk->rmem_released += skb->truesize;
+			atomic_sub(skb->truesize, &sk->sk_rmem_alloc);
+			sk_mem_uncharge(sk, skb->truesize);
 			__skb_unlink(skb, &sk->sk_receive_queue);
 			__kfree_skb(skb);
 		}
@@ -1867,17 +1741,6 @@ static void mptcp_rcv_space_adjust(struct mptcp_sock *msk, int copied)
 	msk->rcvq_space.time = mstamp;
 }
 
-static void __mptcp_update_rmem(struct sock *sk)
-{
-	struct mptcp_sock *msk = mptcp_sk(sk);
-
-	if (!msk->rmem_released)
-		return;
-
-	atomic_sub(msk->rmem_released, &sk->sk_rmem_alloc);
-	sk_mem_uncharge(sk, msk->rmem_released);
-	msk->rmem_released = 0;
-}
 
 static bool __mptcp_move_skbs(struct sock *sk)
 {
@@ -1894,7 +1757,6 @@ static bool __mptcp_move_skbs(struct sock *sk)
 			break;
 
 		slowpath = lock_sock_fast(ssk);
-		__mptcp_update_rmem(sk);
 		done = __mptcp_move_skbs_from_subflow(msk, ssk, &moved);
 		tcp_cleanup_rbuf(ssk, moved);
 		unlock_sock_fast(ssk, slowpath);
@@ -1904,7 +1766,6 @@ static bool __mptcp_move_skbs(struct sock *sk)
 	ret = moved > 0;
 	if (!RB_EMPTY_ROOT(&msk->out_of_order_queue) ||
 	    !skb_queue_empty(&sk->sk_receive_queue)) {
-		__mptcp_update_rmem(sk);
 		ret |= __mptcp_ofo_queue(msk);
 		mptcp_cleanup_rbuf(msk);
 	}
@@ -2250,7 +2111,7 @@ static void __mptcp_retrans(struct sock *sk)
 	struct sock *ssk;
 	int ret;
 
-	mptcp_clean_una_wakeup(sk);
+	__mptcp_clean_una_wakeup(sk);
 	dfrag = mptcp_rtx_head(sk);
 	if (!dfrag) {
 		if (mptcp_data_fin_enabled(msk)) {
@@ -2360,8 +2221,6 @@ static int __mptcp_init_sock(struct sock *sk)
 	INIT_WORK(&msk->work, mptcp_worker);
 	msk->out_of_order_queue = RB_ROOT;
 	msk->first_pending = NULL;
-	msk->wmem_reserved = 0;
-	msk->rmem_released = 0;
 	msk->tx_pending_data = 0;
 
 	msk->ack_hint = NULL;
@@ -2576,8 +2435,6 @@ static void __mptcp_destroy_sock(struct sock *sk)
 
 	sk->sk_prot->destroy(sk);
 
-	WARN_ON_ONCE(msk->wmem_reserved);
-	WARN_ON_ONCE(msk->rmem_released);
 	sk_stream_kill_queues(sk);
 	xfrm_sk_free_policy(sk);
 
@@ -2889,12 +2746,6 @@ static void mptcp_release_cb(struct sock *sk)
 		__mptcp_clean_una_wakeup(sk);
 	if (test_and_clear_bit(MPTCP_ERROR_REPORT, &mptcp_sk(sk)->flags))
 		__mptcp_error_report(sk);
-
-	/* push_pending may touch wmem_reserved, ensure we do the cleanup
-	 * later
-	 */
-	__mptcp_update_wmem(sk);
-	__mptcp_update_rmem(sk);
 }
 
 void mptcp_subflow_process_delegated(struct sock *ssk)
diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
index d392ee44deb3..94ca8d6e2f97 100644
--- a/net/mptcp/protocol.h
+++ b/net/mptcp/protocol.h
@@ -223,7 +223,6 @@ struct mptcp_sock {
 	u64		ack_seq;
 	u64		rcv_wnd_sent;
 	u64		rcv_data_fin_seq;
-	int		wmem_reserved;
 	struct sock	*last_snd;
 	int		snd_burst;
 	int		old_wspace;
@@ -231,7 +230,6 @@ struct mptcp_sock {
 	u64		wnd_end;
 	unsigned long	timer_ival;
 	u32		token;
-	int		rmem_released;
 	unsigned long	flags;
 	bool		can_ack;
 	bool		fully_established;
@@ -265,19 +263,6 @@ struct mptcp_sock {
 	char		ca_name[TCP_CA_NAME_MAX];
 };
 
-#define mptcp_lock_sock(___sk, cb) do {					\
-	struct sock *__sk = (___sk); /* silence macro reuse warning */	\
-	might_sleep();							\
-	spin_lock_bh(&__sk->sk_lock.slock);				\
-	if (__sk->sk_lock.owned)					\
-		__lock_sock(__sk);					\
-	cb;								\
-	__sk->sk_lock.owned = 1;					\
-	spin_unlock(&__sk->sk_lock.slock);				\
-	mutex_acquire(&__sk->sk_lock.dep_map, 0, 0, _RET_IP_);		\
-	local_bh_enable();						\
-} while (0)
-
 #define mptcp_data_lock(sk) spin_lock_bh(&(sk)->sk_lock.slock)
 #define mptcp_data_unlock(sk) spin_unlock_bh(&(sk)->sk_lock.slock)
 
@@ -296,7 +281,7 @@ static inline struct mptcp_sock *mptcp_sk(const struct sock *sk)
 
 static inline int __mptcp_space(const struct sock *sk)
 {
-	return tcp_space(sk) + READ_ONCE(mptcp_sk(sk)->rmem_released);
+	return tcp_space(sk);
 }
 
 static inline struct mptcp_data_frag *mptcp_send_head(const struct sock *sk)
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH 3/4] mptcp: move the whole rx path under msk socket lock protection
  2021-05-25 17:37 ` [RFC PATCH 3/4] mptcp: move the whole rx path under msk socket lock protection Paolo Abeni
@ 2021-05-26  0:06   ` Mat Martineau
  2021-05-26 10:50     ` Paolo Abeni
  0 siblings, 1 reply; 10+ messages in thread
From: Mat Martineau @ 2021-05-26  0:06 UTC (permalink / raw)
  To: Paolo Abeni; +Cc: mptcp, fwestpha

On Tue, 25 May 2021, Paolo Abeni wrote:

> After commit c2e6048fa1cf ("mptcp: fix race in release_cb") it's
> pretty straight forward move the whole MPTCP rx path under the socket
> lock leveraging the release_cb.
>
> We can drop a bunch of spin_lock pairs in the receive functions, use
> a single receive queue and invoke __mptcp_move_skbs only when subflows
> ask for it.
>
> This will allow more cleanup in the next patch
>

Like you said in the cover letter, the perf data will really help with 
understanding the performance tradeoff.

I'm a little paranoid about the locking changes, since the 
mptcp_data_lock() is used to protect several things. What do you think 
about a debug patch (maybe temporarily in the export branch, but not 
upstreamed) that used spin_is_locked() or assert_spin_locked() to double 
check that there's still lock coverage where we expect it?

Overall, I like this direction - hope the performance looks ok!

-Mat


> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
> ---
> net/mptcp/protocol.c | 94 ++++++++++++++++++--------------------------
> net/mptcp/protocol.h |  2 +-
> 2 files changed, 39 insertions(+), 57 deletions(-)
>
> diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
> index 6050431f4c86..57deea409d0c 100644
> --- a/net/mptcp/protocol.c
> +++ b/net/mptcp/protocol.c
> @@ -682,11 +682,6 @@ static bool move_skbs_to_msk(struct mptcp_sock *msk, struct sock *ssk)
> 	struct sock *sk = (struct sock *)msk;
> 	unsigned int moved = 0;
>
> -	if (inet_sk_state_load(sk) == TCP_CLOSE)
> -		return;
> -
> -	mptcp_data_lock(sk);
> -
> 	__mptcp_move_skbs_from_subflow(msk, ssk, &moved);
> 	__mptcp_ofo_queue(msk);
>
> @@ -697,12 +692,11 @@ static bool move_skbs_to_msk(struct mptcp_sock *msk, struct sock *ssk)
> 	 */
> 	if (mptcp_pending_data_fin(sk, NULL))
> 		mptcp_schedule_work(sk);
> -	mptcp_data_unlock(sk);
>
> 	return moved > 0;
> }
>
> -void mptcp_data_ready(struct sock *sk, struct sock *ssk)
> +static void __mptcp_data_ready(struct sock *sk, struct sock *ssk)
> {
> 	struct mptcp_subflow_context *subflow = mptcp_subflow_ctx(ssk);
> 	struct mptcp_sock *msk = mptcp_sk(sk);
> @@ -731,6 +725,16 @@ void mptcp_data_ready(struct sock *sk, struct sock *ssk)
> 	}
> }
>
> +void mptcp_data_ready(struct sock *sk, struct sock *ssk)
> +{
> +	mptcp_data_lock(sk);
> +	if (!sock_owned_by_user(sk))
> +		__mptcp_data_ready(sk, ssk);
> +	else
> +		set_bit(MPTCP_DEQUEUE, &mptcp_sk(sk)->flags);
> +	mptcp_data_unlock(sk);
> +}
> +
> static bool mptcp_do_flush_join_list(struct mptcp_sock *msk)
> {
> 	struct mptcp_subflow_context *subflow;
> @@ -967,7 +971,6 @@ static bool mptcp_wmem_alloc(struct sock *sk, int size)
> 	if (msk->wmem_reserved >= size)
> 		goto account;
>
> -	mptcp_data_lock(sk);
> 	if (!sk_wmem_schedule(sk, size)) {
> 		mptcp_data_unlock(sk);
> 		return false;
> @@ -975,7 +978,6 @@ static bool mptcp_wmem_alloc(struct sock *sk, int size)
>
> 	sk->sk_forward_alloc -= size;
> 	msk->wmem_reserved += size;
> -	mptcp_data_unlock(sk);
>
> account:
> 	msk->wmem_reserved -= size;
> @@ -1002,12 +1004,10 @@ static void mptcp_mem_reclaim_partial(struct sock *sk)
> 	if (msk->wmem_reserved < 0)
> 		return;
>
> -	mptcp_data_lock(sk);
> 	sk->sk_forward_alloc += msk->wmem_reserved;
> 	sk_mem_reclaim_partial(sk);
> 	msk->wmem_reserved = sk->sk_forward_alloc;
> 	sk->sk_forward_alloc = 0;
> -	mptcp_data_unlock(sk);
> }
>
> static void dfrag_uncharge(struct sock *sk, int len)
> @@ -1092,9 +1092,7 @@ static void __mptcp_clean_una_wakeup(struct sock *sk)
>
> static void mptcp_clean_una_wakeup(struct sock *sk)
> {
> -	mptcp_data_lock(sk);
> 	__mptcp_clean_una_wakeup(sk);
> -	mptcp_data_unlock(sk);
> }
>
> static void mptcp_enter_memory_pressure(struct sock *sk)
> @@ -1713,16 +1711,22 @@ static void mptcp_wait_data(struct sock *sk, long *timeo)
> 	remove_wait_queue(sk_sleep(sk), &wait);
> }
>
> -static int __mptcp_recvmsg_mskq(struct mptcp_sock *msk,
> +static bool __mptcp_move_skbs(struct sock *sk);
> +
> +static int __mptcp_recvmsg_mskq(struct sock *sk,
> 				struct msghdr *msg,
> 				size_t len, int flags,
> 				struct scm_timestamping_internal *tss,
> 				int *cmsg_flags)
> {
> +	struct mptcp_sock *msk = mptcp_sk(sk);
> 	struct sk_buff *skb, *tmp;
> 	int copied = 0;
>
> -	skb_queue_walk_safe(&msk->receive_queue, skb, tmp) {
> +	if (skb_queue_empty(&sk->sk_receive_queue) && !__mptcp_move_skbs(sk))
> +		return 0;
> +
> +	skb_queue_walk_safe(&sk->sk_receive_queue, skb, tmp) {
> 		u32 offset = MPTCP_SKB_CB(skb)->offset;
> 		u32 data_len = skb->len - offset;
> 		u32 count = min_t(size_t, len - copied, data_len);
> @@ -1754,7 +1758,7 @@ static int __mptcp_recvmsg_mskq(struct mptcp_sock *msk,
> 			/* we will bulk release the skb memory later */
> 			skb->destructor = NULL;
> 			msk->rmem_released += skb->truesize;
> -			__skb_unlink(skb, &msk->receive_queue);
> +			__skb_unlink(skb, &sk->sk_receive_queue);
> 			__kfree_skb(skb);
> 		}
>
> @@ -1875,16 +1879,9 @@ static void __mptcp_update_rmem(struct sock *sk)
> 	msk->rmem_released = 0;
> }
>
> -static void __mptcp_splice_receive_queue(struct sock *sk)
> +static bool __mptcp_move_skbs(struct sock *sk)
> {
> 	struct mptcp_sock *msk = mptcp_sk(sk);
> -
> -	skb_queue_splice_tail_init(&sk->sk_receive_queue, &msk->receive_queue);
> -}
> -
> -static bool __mptcp_move_skbs(struct mptcp_sock *msk)
> -{
> -	struct sock *sk = (struct sock *)msk;
> 	unsigned int moved = 0;
> 	bool ret, done;
>
> @@ -1893,18 +1890,12 @@ static bool __mptcp_move_skbs(struct mptcp_sock *msk)
> 		struct sock *ssk = mptcp_subflow_recv_lookup(msk);
> 		bool slowpath;
>
> -		/* we can have data pending in the subflows only if the msk
> -		 * receive buffer was full at subflow_data_ready() time,
> -		 * that is an unlikely slow path.
> -		 */
> -		if (likely(!ssk))
> +		if (unlikely(!ssk))
> 			break;
>
> 		slowpath = lock_sock_fast(ssk);
> -		mptcp_data_lock(sk);
> 		__mptcp_update_rmem(sk);
> 		done = __mptcp_move_skbs_from_subflow(msk, ssk, &moved);
> -		mptcp_data_unlock(sk);
> 		tcp_cleanup_rbuf(ssk, moved);
> 		unlock_sock_fast(ssk, slowpath);
> 	} while (!done);
> @@ -1912,17 +1903,16 @@ static bool __mptcp_move_skbs(struct mptcp_sock *msk)
> 	/* acquire the data lock only if some input data is pending */
> 	ret = moved > 0;
> 	if (!RB_EMPTY_ROOT(&msk->out_of_order_queue) ||
> -	    !skb_queue_empty_lockless(&sk->sk_receive_queue)) {
> -		mptcp_data_lock(sk);
> +	    !skb_queue_empty(&sk->sk_receive_queue)) {
> 		__mptcp_update_rmem(sk);
> 		ret |= __mptcp_ofo_queue(msk);
> -		__mptcp_splice_receive_queue(sk);
> -		mptcp_data_unlock(sk);
> 		mptcp_cleanup_rbuf(msk);
> 	}
> -	if (ret)
> +	if (ret) {
> +		set_bit(MPTCP_DATA_READY, &msk->flags);
> 		mptcp_check_data_fin((struct sock *)msk);
> -	return !skb_queue_empty(&msk->receive_queue);
> +	}
> +	return ret;
> }
>
> static int mptcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
> @@ -1938,7 +1928,7 @@ static int mptcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
> 	if (unlikely(flags & MSG_ERRQUEUE))
> 		return inet_recv_error(sk, msg, len, addr_len);
>
> -	mptcp_lock_sock(sk, __mptcp_splice_receive_queue(sk));
> +	lock_sock(sk);
> 	if (unlikely(sk->sk_state == TCP_LISTEN)) {
> 		copied = -ENOTCONN;
> 		goto out_err;
> @@ -1952,7 +1942,7 @@ static int mptcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
> 	while (copied < len) {
> 		int bytes_read;
>
> -		bytes_read = __mptcp_recvmsg_mskq(msk, msg, len - copied, flags, &tss, &cmsg_flags);
> +		bytes_read = __mptcp_recvmsg_mskq(sk, msg, len - copied, flags, &tss, &cmsg_flags);
> 		if (unlikely(bytes_read < 0)) {
> 			if (!copied)
> 				copied = bytes_read;
> @@ -1964,9 +1954,6 @@ static int mptcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
> 		/* be sure to advertise window change */
> 		mptcp_cleanup_rbuf(msk);
>
> -		if (skb_queue_empty(&msk->receive_queue) && __mptcp_move_skbs(msk))
> -			continue;
> -
> 		/* only the master socket status is relevant here. The exit
> 		 * conditions mirror closely tcp_recvmsg()
> 		 */
> @@ -1993,7 +1980,7 @@ static int mptcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
> 				/* race breaker: the shutdown could be after the
> 				 * previous receive queue check
> 				 */
> -				if (__mptcp_move_skbs(msk))
> +				if (__mptcp_move_skbs(sk))
> 					continue;
> 				break;
> 			}
> @@ -2018,16 +2005,11 @@ static int mptcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
> 		mptcp_wait_data(sk, &timeo);
> 	}
>
> -	if (skb_queue_empty_lockless(&sk->sk_receive_queue) &&
> -	    skb_queue_empty(&msk->receive_queue)) {
> -		/* entire backlog drained, clear DATA_READY. */
> -		clear_bit(MPTCP_DATA_READY, &msk->flags);
> -
> -		/* .. race-breaker: ssk might have gotten new data
> -		 * after last __mptcp_move_skbs() returned false.
> +	if (skb_queue_empty(&sk->sk_receive_queue)) {
> +		/* entire backlog drained, clear DATA_READY.
> +		 * release_cb/__mptcp_move_skbs() will eventually set it again if needed
> 		 */
> -		if (unlikely(__mptcp_move_skbs(msk)))
> -			set_bit(MPTCP_DATA_READY, &msk->flags);
> +		clear_bit(MPTCP_DATA_READY, &msk->flags);
> 	}
>
> out_err:
> @@ -2376,7 +2358,6 @@ static int __mptcp_init_sock(struct sock *sk)
> 	INIT_LIST_HEAD(&msk->join_list);
> 	INIT_LIST_HEAD(&msk->rtx_queue);
> 	INIT_WORK(&msk->work, mptcp_worker);
> -	__skb_queue_head_init(&msk->receive_queue);
> 	msk->out_of_order_queue = RB_ROOT;
> 	msk->first_pending = NULL;
> 	msk->wmem_reserved = 0;
> @@ -2828,9 +2809,6 @@ void mptcp_destroy_common(struct mptcp_sock *msk)
>
> 	__mptcp_clear_xmit(sk);
>
> -	/* move to sk_receive_queue, sk_stream_kill_queues will purge it */
> -	skb_queue_splice_tail_init(&msk->receive_queue, &sk->sk_receive_queue);
> -
> 	skb_rbtree_purge(&msk->out_of_order_queue);
> 	mptcp_token_destroy(msk);
> 	mptcp_pm_free_anno_list(msk);
> @@ -2882,6 +2860,8 @@ static void mptcp_release_cb(struct sock *sk)
> 			flags |= BIT(MPTCP_PUSH_PENDING);
> 		if (test_and_clear_bit(MPTCP_RETRANSMIT, &mptcp_sk(sk)->flags))
> 			flags |= BIT(MPTCP_RETRANSMIT);
> +		if (test_and_clear_bit(MPTCP_DEQUEUE, &mptcp_sk(sk)->flags))
> +			flags |= BIT(MPTCP_DEQUEUE);
> 		if (!flags)
> 			break;
>
> @@ -2898,6 +2878,8 @@ static void mptcp_release_cb(struct sock *sk)
> 			__mptcp_push_pending(sk, 0);
> 		if (flags & BIT(MPTCP_RETRANSMIT))
> 			__mptcp_retrans(sk);
> +		if ((flags & BIT(MPTCP_DEQUEUE)) && __mptcp_move_skbs(sk))
> +			sk->sk_data_ready(sk);
>
> 		cond_resched();
> 		spin_lock_bh(&sk->sk_lock.slock);
> diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
> index 2f22046a7565..d392ee44deb3 100644
> --- a/net/mptcp/protocol.h
> +++ b/net/mptcp/protocol.h
> @@ -111,6 +111,7 @@
> #define MPTCP_ERROR_REPORT	8
> #define MPTCP_RETRANSMIT	9
> #define MPTCP_WORK_SYNC_SETSOCKOPT 10
> +#define MPTCP_DEQUEUE		11
>
> static inline bool before64(__u64 seq1, __u64 seq2)
> {
> @@ -244,7 +245,6 @@ struct mptcp_sock {
> 	struct work_struct work;
> 	struct sk_buff  *ooo_last_skb;
> 	struct rb_root  out_of_order_queue;
> -	struct sk_buff_head receive_queue;
> 	int		tx_pending_data;
> 	int		size_goal_cache;
> 	struct list_head conn_list;
> -- 
> 2.26.3
>
>
>

--
Mat Martineau
Intel

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH 4/4] mptcp: cleanup mem accounting.
  2021-05-25 17:37 ` [RFC PATCH 4/4] mptcp: cleanup mem accounting Paolo Abeni
@ 2021-05-26  0:12   ` Mat Martineau
  2021-05-26 10:42     ` Paolo Abeni
  0 siblings, 1 reply; 10+ messages in thread
From: Mat Martineau @ 2021-05-26  0:12 UTC (permalink / raw)
  To: Paolo Abeni; +Cc: mptcp, fwestpha

On Tue, 25 May 2021, Paolo Abeni wrote:

> After the previous patch, updating sk_forward_memory is cheap and
> we can drop a lot of complexity from the MPTCP memory acconting,
> removing the bulk fwd mem allocations for wmem and rmem.
>
> Singed-off-by: Paolo Abeni <pabeni@redhat.com>
> ---
> net/mptcp/protocol.c | 175 ++++---------------------------------------
> net/mptcp/protocol.h |  17 +----
> 2 files changed, 14 insertions(+), 178 deletions(-)
>
> diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
> index 57deea409d0c..1a9ac2986581 100644
> --- a/net/mptcp/protocol.c
> +++ b/net/mptcp/protocol.c
> @@ -900,116 +900,6 @@ static bool mptcp_frag_can_collapse_to(const struct mptcp_sock *msk,
> 		df->data_seq + df->data_len == msk->write_seq;
> }
>
> -static int mptcp_wmem_with_overhead(int size)
> -{
> -	return size + ((sizeof(struct mptcp_data_frag) * size) >> PAGE_SHIFT);
> -}
> -
> -static void __mptcp_wmem_reserve(struct sock *sk, int size)
> -{
> -	int amount = mptcp_wmem_with_overhead(size);
> -	struct mptcp_sock *msk = mptcp_sk(sk);
> -
> -	WARN_ON_ONCE(msk->wmem_reserved);
> -	if (WARN_ON_ONCE(amount < 0))
> -		amount = 0;
> -
> -	if (amount <= sk->sk_forward_alloc)
> -		goto reserve;
> -
> -	/* under memory pressure try to reserve at most a single page
> -	 * otherwise try to reserve the full estimate and fallback
> -	 * to a single page before entering the error path
> -	 */
> -	if ((tcp_under_memory_pressure(sk) && amount > PAGE_SIZE) ||
> -	    !sk_wmem_schedule(sk, amount)) {
> -		if (amount <= PAGE_SIZE)
> -			goto nomem;
> -
> -		amount = PAGE_SIZE;
> -		if (!sk_wmem_schedule(sk, amount))
> -			goto nomem;
> -	}
> -
> -reserve:
> -	msk->wmem_reserved = amount;
> -	sk->sk_forward_alloc -= amount;
> -	return;
> -
> -nomem:
> -	/* we will wait for memory on next allocation */
> -	msk->wmem_reserved = -1;
> -}
> -
> -static void __mptcp_update_wmem(struct sock *sk)
> -{
> -	struct mptcp_sock *msk = mptcp_sk(sk);
> -
> -#ifdef CONFIG_LOCKDEP
> -	WARN_ON_ONCE(!lockdep_is_held(&sk->sk_lock.slock));
> -#endif
> -
> -	if (!msk->wmem_reserved)
> -		return;
> -
> -	if (msk->wmem_reserved < 0)
> -		msk->wmem_reserved = 0;
> -	if (msk->wmem_reserved > 0) {
> -		sk->sk_forward_alloc += msk->wmem_reserved;
> -		msk->wmem_reserved = 0;
> -	}
> -}
> -
> -static bool mptcp_wmem_alloc(struct sock *sk, int size)
> -{
> -	struct mptcp_sock *msk = mptcp_sk(sk);
> -
> -	/* check for pre-existing error condition */
> -	if (msk->wmem_reserved < 0)
> -		return false;
> -
> -	if (msk->wmem_reserved >= size)
> -		goto account;
> -
> -	if (!sk_wmem_schedule(sk, size)) {
> -		mptcp_data_unlock(sk);
> -		return false;
> -	}
> -
> -	sk->sk_forward_alloc -= size;
> -	msk->wmem_reserved += size;
> -
> -account:
> -	msk->wmem_reserved -= size;
> -	return true;
> -}
> -
> -static void mptcp_wmem_uncharge(struct sock *sk, int size)
> -{
> -	struct mptcp_sock *msk = mptcp_sk(sk);
> -
> -	if (msk->wmem_reserved < 0)
> -		msk->wmem_reserved = 0;
> -	msk->wmem_reserved += size;
> -}
> -
> -static void mptcp_mem_reclaim_partial(struct sock *sk)
> -{
> -	struct mptcp_sock *msk = mptcp_sk(sk);
> -
> -	/* if we are experiencing a transint allocation error,
> -	 * the forward allocation memory has been already
> -	 * released
> -	 */
> -	if (msk->wmem_reserved < 0)
> -		return;
> -
> -	sk->sk_forward_alloc += msk->wmem_reserved;
> -	sk_mem_reclaim_partial(sk);
> -	msk->wmem_reserved = sk->sk_forward_alloc;
> -	sk->sk_forward_alloc = 0;
> -}
> -
> static void dfrag_uncharge(struct sock *sk, int len)
> {
> 	sk_mem_uncharge(sk, len);
> @@ -1066,12 +956,8 @@ static void __mptcp_clean_una(struct sock *sk)
> 	}
>
> out:
> -	if (cleaned) {
> -		if (tcp_under_memory_pressure(sk)) {
> -			__mptcp_update_wmem(sk);
> -			sk_mem_reclaim_partial(sk);
> -		}
> -	}
> +	if (cleaned && tcp_under_memory_pressure(sk))
> +		sk_mem_reclaim_partial(sk);
>
> 	if (snd_una == READ_ONCE(msk->snd_nxt)) {
> 		if (msk->timer_ival && !mptcp_data_fin_enabled(msk))
> @@ -1083,18 +969,10 @@ static void __mptcp_clean_una(struct sock *sk)
>
> static void __mptcp_clean_una_wakeup(struct sock *sk)
> {
> -#ifdef CONFIG_LOCKDEP
> -	WARN_ON_ONCE(!lockdep_is_held(&sk->sk_lock.slock));
> -#endif
> 	__mptcp_clean_una(sk);
> 	mptcp_write_space(sk);
> }
>
> -static void mptcp_clean_una_wakeup(struct sock *sk)
> -{
> -	__mptcp_clean_una_wakeup(sk);
> -}
> -
> static void mptcp_enter_memory_pressure(struct sock *sk)
> {
> 	struct mptcp_subflow_context *subflow;
> @@ -1229,7 +1107,7 @@ static bool mptcp_must_reclaim_memory(struct sock *sk, struct sock *ssk)
> static bool mptcp_alloc_tx_skb(struct sock *sk, struct sock *ssk)
> {
> 	if (unlikely(mptcp_must_reclaim_memory(sk, ssk)))
> -		mptcp_mem_reclaim_partial(sk);
> +		sk_mem_reclaim_partial(sk);
> 	return __mptcp_alloc_tx_skb(sk, ssk, sk->sk_allocation);
> }
>
> @@ -1533,10 +1411,8 @@ static void __mptcp_subflow_push_pending(struct sock *sk, struct sock *ssk)
> 				goto out;
> 			}
>
> -			if (unlikely(mptcp_must_reclaim_memory(sk, ssk))) {
> -				__mptcp_update_wmem(sk);
> +			if (unlikely(mptcp_must_reclaim_memory(sk, ssk)))
> 				sk_mem_reclaim_partial(sk);
> -			}
> 			if (!__mptcp_alloc_tx_skb(sk, ssk, GFP_ATOMIC))
> 				goto out;
>
> @@ -1560,7 +1436,6 @@ static void __mptcp_subflow_push_pending(struct sock *sk, struct sock *ssk)
> 	/* __mptcp_alloc_tx_skb could have released some wmem and we are
> 	 * not going to flush it via release_sock()
> 	 */
> -	__mptcp_update_wmem(sk);
> 	if (copied) {
> 		mptcp_set_timeout(sk, ssk);
> 		tcp_push(ssk, 0, info.mss_now, tcp_sk(ssk)->nonagle,
> @@ -1598,7 +1473,7 @@ static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
> 	/* silently ignore everything else */
> 	msg->msg_flags &= MSG_MORE | MSG_DONTWAIT | MSG_NOSIGNAL;
>
> -	mptcp_lock_sock(sk, __mptcp_wmem_reserve(sk, min_t(size_t, 1 << 20, len)));
> +	lock_sock(sk);
>
> 	timeo = sock_sndtimeo(sk, msg->msg_flags & MSG_DONTWAIT);
>
> @@ -1611,8 +1486,8 @@ static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
> 	pfrag = sk_page_frag(sk);
>
> 	while (msg_data_left(msg)) {
> -		int total_ts, frag_truesize = 0;
> 		struct mptcp_data_frag *dfrag;
> +		int frag_truesize = 0;
> 		bool dfrag_collapsed;
> 		size_t psize, offset;
>
> @@ -1644,14 +1519,13 @@ static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
> 		offset = dfrag->offset + dfrag->data_len;
> 		psize = pfrag->size - offset;
> 		psize = min_t(size_t, psize, msg_data_left(msg));
> -		total_ts = psize + frag_truesize;
> +		frag_truesize += psize;
>
> -		if (!mptcp_wmem_alloc(sk, total_ts))
> +		if (!sk_wmem_schedule(sk, frag_truesize))
> 			goto wait_for_memory;
>
> 		if (copy_page_from_iter(dfrag->page, offset, psize,
> 					&msg->msg_iter) != psize) {
> -			mptcp_wmem_uncharge(sk, psize + frag_truesize);
> 			ret = -EFAULT;
> 			goto out;
> 		}
> @@ -1659,7 +1533,6 @@ static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
> 		/* data successfully copied into the write queue */
> 		copied += psize;
> 		dfrag->data_len += psize;
> -		frag_truesize += psize;
> 		pfrag->offset += frag_truesize;
> 		WRITE_ONCE(msk->write_seq, msk->write_seq + psize);
> 		msk->tx_pending_data += psize;
> @@ -1668,6 +1541,7 @@ static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
> 		 * Note: we charge such data both to sk and ssk
> 		 */
> 		sk_wmem_queued_add(sk, frag_truesize);
> +		sk_mem_charge(sk, frag_truesize);
> 		if (!dfrag_collapsed) {
> 			get_page(dfrag->page);
> 			list_add_tail(&dfrag->list, &msk->rtx_queue);
> @@ -1719,7 +1593,6 @@ static int __mptcp_recvmsg_mskq(struct sock *sk,
> 				struct scm_timestamping_internal *tss,
> 				int *cmsg_flags)
> {
> -	struct mptcp_sock *msk = mptcp_sk(sk);
> 	struct sk_buff *skb, *tmp;
> 	int copied = 0;
>
> @@ -1755,9 +1628,10 @@ static int __mptcp_recvmsg_mskq(struct sock *sk,
> 		}
>
> 		if (!(flags & MSG_PEEK)) {
> -			/* we will bulk release the skb memory later */
> +			/* avoid the indirect call, we know the destructor is sock_wfree */
> 			skb->destructor = NULL;
> -			msk->rmem_released += skb->truesize;
> +			atomic_sub(skb->truesize, &sk->sk_rmem_alloc);
> +			sk_mem_uncharge(sk, skb->truesize);
> 			__skb_unlink(skb, &sk->sk_receive_queue);
> 			__kfree_skb(skb);
> 		}
> @@ -1867,17 +1741,6 @@ static void mptcp_rcv_space_adjust(struct mptcp_sock *msk, int copied)
> 	msk->rcvq_space.time = mstamp;
> }
>
> -static void __mptcp_update_rmem(struct sock *sk)
> -{
> -	struct mptcp_sock *msk = mptcp_sk(sk);
> -
> -	if (!msk->rmem_released)
> -		return;
> -
> -	atomic_sub(msk->rmem_released, &sk->sk_rmem_alloc);
> -	sk_mem_uncharge(sk, msk->rmem_released);
> -	msk->rmem_released = 0;
> -}
>
> static bool __mptcp_move_skbs(struct sock *sk)
> {
> @@ -1894,7 +1757,6 @@ static bool __mptcp_move_skbs(struct sock *sk)
> 			break;
>
> 		slowpath = lock_sock_fast(ssk);
> -		__mptcp_update_rmem(sk);
> 		done = __mptcp_move_skbs_from_subflow(msk, ssk, &moved);
> 		tcp_cleanup_rbuf(ssk, moved);
> 		unlock_sock_fast(ssk, slowpath);
> @@ -1904,7 +1766,6 @@ static bool __mptcp_move_skbs(struct sock *sk)
> 	ret = moved > 0;
> 	if (!RB_EMPTY_ROOT(&msk->out_of_order_queue) ||
> 	    !skb_queue_empty(&sk->sk_receive_queue)) {
> -		__mptcp_update_rmem(sk);
> 		ret |= __mptcp_ofo_queue(msk);
> 		mptcp_cleanup_rbuf(msk);
> 	}
> @@ -2250,7 +2111,7 @@ static void __mptcp_retrans(struct sock *sk)
> 	struct sock *ssk;
> 	int ret;
>
> -	mptcp_clean_una_wakeup(sk);
> +	__mptcp_clean_una_wakeup(sk);
> 	dfrag = mptcp_rtx_head(sk);
> 	if (!dfrag) {
> 		if (mptcp_data_fin_enabled(msk)) {
> @@ -2360,8 +2221,6 @@ static int __mptcp_init_sock(struct sock *sk)
> 	INIT_WORK(&msk->work, mptcp_worker);
> 	msk->out_of_order_queue = RB_ROOT;
> 	msk->first_pending = NULL;
> -	msk->wmem_reserved = 0;
> -	msk->rmem_released = 0;
> 	msk->tx_pending_data = 0;
>
> 	msk->ack_hint = NULL;
> @@ -2576,8 +2435,6 @@ static void __mptcp_destroy_sock(struct sock *sk)
>
> 	sk->sk_prot->destroy(sk);
>
> -	WARN_ON_ONCE(msk->wmem_reserved);
> -	WARN_ON_ONCE(msk->rmem_released);
> 	sk_stream_kill_queues(sk);
> 	xfrm_sk_free_policy(sk);
>
> @@ -2889,12 +2746,6 @@ static void mptcp_release_cb(struct sock *sk)
> 		__mptcp_clean_una_wakeup(sk);
> 	if (test_and_clear_bit(MPTCP_ERROR_REPORT, &mptcp_sk(sk)->flags))
> 		__mptcp_error_report(sk);
> -
> -	/* push_pending may touch wmem_reserved, ensure we do the cleanup
> -	 * later
> -	 */
> -	__mptcp_update_wmem(sk);
> -	__mptcp_update_rmem(sk);
> }
>
> void mptcp_subflow_process_delegated(struct sock *ssk)
> diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
> index d392ee44deb3..94ca8d6e2f97 100644
> --- a/net/mptcp/protocol.h
> +++ b/net/mptcp/protocol.h
> @@ -223,7 +223,6 @@ struct mptcp_sock {
> 	u64		ack_seq;
> 	u64		rcv_wnd_sent;
> 	u64		rcv_data_fin_seq;
> -	int		wmem_reserved;
> 	struct sock	*last_snd;
> 	int		snd_burst;
> 	int		old_wspace;
> @@ -231,7 +230,6 @@ struct mptcp_sock {
> 	u64		wnd_end;
> 	unsigned long	timer_ival;
> 	u32		token;
> -	int		rmem_released;
> 	unsigned long	flags;
> 	bool		can_ack;
> 	bool		fully_established;
> @@ -265,19 +263,6 @@ struct mptcp_sock {
> 	char		ca_name[TCP_CA_NAME_MAX];
> };
>
> -#define mptcp_lock_sock(___sk, cb) do {					\
> -	struct sock *__sk = (___sk); /* silence macro reuse warning */	\
> -	might_sleep();							\
> -	spin_lock_bh(&__sk->sk_lock.slock);				\
> -	if (__sk->sk_lock.owned)					\
> -		__lock_sock(__sk);					\
> -	cb;								\
> -	__sk->sk_lock.owned = 1;					\
> -	spin_unlock(&__sk->sk_lock.slock);				\
> -	mutex_acquire(&__sk->sk_lock.dep_map, 0, 0, _RET_IP_);		\
> -	local_bh_enable();						\
> -} while (0)
> -
> #define mptcp_data_lock(sk) spin_lock_bh(&(sk)->sk_lock.slock)
> #define mptcp_data_unlock(sk) spin_unlock_bh(&(sk)->sk_lock.slock)
>
> @@ -296,7 +281,7 @@ static inline struct mptcp_sock *mptcp_sk(const struct sock *sk)
>
> static inline int __mptcp_space(const struct sock *sk)
> {
> -	return tcp_space(sk) + READ_ONCE(mptcp_sk(sk)->rmem_released);
> +	return tcp_space(sk);
> }

Minor - looks like __mptcp_space() isn't needed any more either.

>
> static inline struct mptcp_data_frag *mptcp_send_head(const struct sock *sk)
> -- 
> 2.26.3
>
>
>

--
Mat Martineau
Intel

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH 4/4] mptcp: cleanup mem accounting.
  2021-05-26  0:12   ` Mat Martineau
@ 2021-05-26 10:42     ` Paolo Abeni
  0 siblings, 0 replies; 10+ messages in thread
From: Paolo Abeni @ 2021-05-26 10:42 UTC (permalink / raw)
  To: Mat Martineau; +Cc: mptcp, fwestpha

On Tue, 2021-05-25 at 17:12 -0700, Mat Martineau wrote:
> On Tue, 25 May 2021, Paolo Abeni wrote:
> > @@ -296,7 +281,7 @@ static inline struct mptcp_sock *mptcp_sk(const struct sock *sk)
> > 
> > static inline int __mptcp_space(const struct sock *sk)
> > {
> > -	return tcp_space(sk) + READ_ONCE(mptcp_sk(sk)->rmem_released);
> > +	return tcp_space(sk);
> > }
> 
> Minor - looks like __mptcp_space() isn't needed any more either.

Exactly! I *guess* there is also some other additional follow-up
cleanup, but I stopped here to avoid this thing becoming too big.

Anyhow I can add the above as an additional patch.

Cheers,

Paolo


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH 3/4] mptcp: move the whole rx path under msk socket lock protection
  2021-05-26  0:06   ` Mat Martineau
@ 2021-05-26 10:50     ` Paolo Abeni
  0 siblings, 0 replies; 10+ messages in thread
From: Paolo Abeni @ 2021-05-26 10:50 UTC (permalink / raw)
  To: Mat Martineau; +Cc: mptcp, fwestpha

On Tue, 2021-05-25 at 17:06 -0700, Mat Martineau wrote:
> On Tue, 25 May 2021, Paolo Abeni wrote:
> 
> > After commit c2e6048fa1cf ("mptcp: fix race in release_cb") it's
> > pretty straight forward move the whole MPTCP rx path under the socket
> > lock leveraging the release_cb.
> > 
> > We can drop a bunch of spin_lock pairs in the receive functions, use
> > a single receive queue and invoke __mptcp_move_skbs only when subflows
> > ask for it.
> > 
> > This will allow more cleanup in the next patch
> > 
> 
> Like you said in the cover letter, the perf data will really help with 
> understanding the performance tradeoff.
> 
> I'm a little paranoid about the locking changes, since the 
> mptcp_data_lock() is used to protect several things. What do you think 
> about a debug patch (maybe temporarily in the export branch, but not 
> upstreamed) that used spin_is_locked() or assert_spin_locked() to double 
> check that there's still lock coverage where we expect it?

I thought about that. The "problem" is that the relevant  lockdep
assertion is 'lockdep_sock_is_held()' which is both quite simple to
verify with code inspection only and not very accurate:
lockdep_sock_is_held() can be true and the caller can be without the
appropriate lock e.g. if lockdep_is_held(&sk->sk_lock.slock) and sk-
>sk_lock.owned by someonelse.

TL;DR: lock assertion can be added, but very likely with little value.
Please let me know if I should add it in the next iteration.

On the flip/positive side, I keep this thing running for the whole WE
without any issue ;)

Thanks!

Paolo



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH 0/4] mptcp: just another receive path refactor
  2021-05-25 17:37 [RFC PATCH 0/4] mptcp: just another receive path refactor Paolo Abeni
                   ` (3 preceding siblings ...)
  2021-05-25 17:37 ` [RFC PATCH 4/4] mptcp: cleanup mem accounting Paolo Abeni
@ 2021-05-28 15:18 ` Paolo Abeni
  4 siblings, 0 replies; 10+ messages in thread
From: Paolo Abeni @ 2021-05-28 15:18 UTC (permalink / raw)
  To: mptcp; +Cc: fwestpha

On Tue, 2021-05-25 at 19:37 +0200, Paolo Abeni wrote:
> This could have some negative performance effects, as on average more
> locking is required for each packet. I'm doing some perf test and will
> report the results.

There are several different possible scenarios:

1) single subflow, ksoftirq && user-space process run on the same CPU
2) multiple subflows, ksoftirqs && user-space process run on the same
CPU
3) single subflow, ksoftirq && user-space process run on different CPUs
4) multiple subflows ksoftirqs && user-space process run on different
CPUs

With a single subflow, the most common scenario is with ksoftirq &&
user-space process run on the same CPU. With multiple subflows on
resonable server H/W we should likley observe a more mixed situation:
softirqs running on multiple CPUs, one of them also hosting the user-
space process. I don't have data for that yet.

The figures:

scenario	export branch 		RX path refactor	delta
1)		23Mbps			21Mbps			-8%
2)		30Mbps			19Mbps			-37%
3)		17.8Mbps		17.5Mbps		noise range
4)		1-3Mbps			1-3Mbps			???

The last scenario outlined a bug, we likely don't send MPTCP level ACK
frequently enough under some condition. That *could* possibly be
related to:

https://github.com/multipath-tcp/mptcp_net-next/issues/137

but I'm unsure about that.

The delta in scenario 2) is quite significant.

The root cause is that in such scenario the user-space process is the
bottle-neck: it keeps a CPU fully busy, spending most of the available
cycles memcpying the data into the user-space. 

With the current export branch, the skbs movement/enqueuing happens
completely inside the ksoftirqd processes.

On top the RX path refactor, some skbs handling is peformed by the
mptcp_release_cb() inside the scope of the user-space process. That
reduces the number of CPU cycles available for memcpying the data and
thus reduces also the overall tput.

I experimented with a different approach - e.g. keeping the skbs
accounted to the incoming subflows - but that looks not feasible.

Input wanted: WDYT of the above?

Thanks!

Paolo


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2021-05-28 15:19 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-25 17:37 [RFC PATCH 0/4] mptcp: just another receive path refactor Paolo Abeni
2021-05-25 17:37 ` [RFC PATCH 1/4] mptcp: wake-up readers only for in sequence data Paolo Abeni
2021-05-25 17:37 ` [RFC PATCH 2/4] mptcp: don't clear MPTCP_DATA_READY in sk_wait_event() Paolo Abeni
2021-05-25 17:37 ` [RFC PATCH 3/4] mptcp: move the whole rx path under msk socket lock protection Paolo Abeni
2021-05-26  0:06   ` Mat Martineau
2021-05-26 10:50     ` Paolo Abeni
2021-05-25 17:37 ` [RFC PATCH 4/4] mptcp: cleanup mem accounting Paolo Abeni
2021-05-26  0:12   ` Mat Martineau
2021-05-26 10:42     ` Paolo Abeni
2021-05-28 15:18 ` [RFC PATCH 0/4] mptcp: just another receive path refactor Paolo Abeni

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.