All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: [MPTCP] [RFC PATCH 1/4] mptcp: use sk_page_frag() in sendmsg
@ 2019-04-08 18:20 Mat Martineau
  0 siblings, 0 replies; 4+ messages in thread
From: Mat Martineau @ 2019-04-08 18:20 UTC (permalink / raw)
  To: mptcp

[-- Attachment #1: Type: text/plain, Size: 3507 bytes --]

On Mon, 8 Apr 2019, Paolo Abeni wrote:

> Hi,
>
> On Fri, 2019-04-05 at 15:17 -0700, Mat Martineau wrote:
>> @@ -80,33 +80,32 @@ static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
>>> 		goto put_out;
>>> 	}
>>>
>>> -	/* Initial experiment: new page per send.  Real code will
>>> -	 * maintain list of active pages and DSS mappings, append to the
>>> -	 * end and honor zerocopy
>>> -	 */
>>> -	page = alloc_page(GFP_KERNEL);
>>
>> Allocating a new page per send is very inefficient for small sends, but
>> was a placeholder for managing an MPTCP-level send buffer.
>>
>> For each connection, we have one MPTCP-level socket, and some number of
>> subflow sockets that it controls. Sent data needs to be stored by the
>> MPTCP-level socket until it sees a relevant MPTCP DATA_ACK from any one of
>> the subflows and can purge the ack'd data.
>>
>> To meet that requirement, the idea was to have the MPTCP-level socket
>> buffer the send data in a set of pages. Each sendmsg() call would append
>> to the current page while space is available, so each page would fill with
>> contiguous data (in the MPTCP sequence space). If data was sent with
>> MSG_ZEROCOPY the userspace-provided pages would be used instead. The
>> buffered data could then be sent (or resent) on multiple subflows by using
>> do_tcp_sendpages() to create independent skbs referencing the one
>> MPTCP-level copy of the data.
>>
>> For example:
>>
>>   * Userspace sends 1024 bytes of data
>>
>>   * MPTCP-level socket copies 1024 bytes to a page
>>
>>   * Four separate skbs are built, referencing the same page and offset for
>> the 1024 bytes
>>
>>   * Each of the four skbs is near-simultaneously queued on a separate
>> subflow
>>
>>   * One subflow is faster than the others, and gets the data sent and
>> DATA_ACK'd quickly
>>
>>   * MPTCP-level socket releases the buffer page (but the data might still
>> be referenced by a queued skb on a slow subflow)
>>
>> As a first step with a single subflow, I kept it simple by allocating one
>> page per send and  keeping track of that page in the MPTCP-level
>> socket.
>
> Thank you for the detailed feedback.
> To double check I'm no the same page: most/all of the above
> infrastructure and features are missing in the current code base.

That's correct.

>
> I *think* they could be added on top of this series with some
> changes...

Ok, good! I was hoping the additional background would help make design 
choices at this stage rather than after you have done a lot more work.

>
>>> -	if (!page) {
>>> -		ret = -ENOMEM;
>>> -		goto put_out;
>>> +	lock_sock(sk);
>>> +	lock_sock(ssk);
>>> +
>>> +	/* use the subflow page cache so that memory accounting is coherent */
>>> +	pfrag = sk_page_frag(ssk);
>>
>> Given the plan to keep the paged data buffered by the MPTCP-level socket,
>> is sk_page_frag() the best way to get a page that might be used for skbs
>> in multiple subflows?
>
> ... but some additional data structure would be needed (RB-tree for the
> pending page fragments ?!?) ...

Right.

>
>> Can memory accounting work properly if the page frag comes from sk (the
>> MPTCP-level socket) instead of ssk?
>
> AFAICS, it should. I have not tested it.
>
> Yep, using sk's page frag should be better.
>
> I'll try to give it a spin, unless you prefer a different approach.
>

This looks like a good approach to me. Thanks!

--
Mat Martineau
Intel

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [MPTCP] [RFC PATCH 1/4] mptcp: use sk_page_frag() in sendmsg
@ 2019-04-08 16:45 Paolo Abeni
  0 siblings, 0 replies; 4+ messages in thread
From: Paolo Abeni @ 2019-04-08 16:45 UTC (permalink / raw)
  To: mptcp

[-- Attachment #1: Type: text/plain, Size: 3178 bytes --]

Hi,

On Fri, 2019-04-05 at 15:17 -0700, Mat Martineau wrote:
> @@ -80,33 +80,32 @@ static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
> > 		goto put_out;
> > 	}
> > 
> > -	/* Initial experiment: new page per send.  Real code will
> > -	 * maintain list of active pages and DSS mappings, append to the
> > -	 * end and honor zerocopy
> > -	 */
> > -	page = alloc_page(GFP_KERNEL);
> 
> Allocating a new page per send is very inefficient for small sends, but 
> was a placeholder for managing an MPTCP-level send buffer.
> 
> For each connection, we have one MPTCP-level socket, and some number of 
> subflow sockets that it controls. Sent data needs to be stored by the 
> MPTCP-level socket until it sees a relevant MPTCP DATA_ACK from any one of 
> the subflows and can purge the ack'd data.
> 
> To meet that requirement, the idea was to have the MPTCP-level socket 
> buffer the send data in a set of pages. Each sendmsg() call would append 
> to the current page while space is available, so each page would fill with 
> contiguous data (in the MPTCP sequence space). If data was sent with 
> MSG_ZEROCOPY the userspace-provided pages would be used instead. The 
> buffered data could then be sent (or resent) on multiple subflows by using 
> do_tcp_sendpages() to create independent skbs referencing the one 
> MPTCP-level copy of the data.
> 
> For example:
> 
>   * Userspace sends 1024 bytes of data
> 
>   * MPTCP-level socket copies 1024 bytes to a page
> 
>   * Four separate skbs are built, referencing the same page and offset for 
> the 1024 bytes
> 
>   * Each of the four skbs is near-simultaneously queued on a separate 
> subflow
> 
>   * One subflow is faster than the others, and gets the data sent and 
> DATA_ACK'd quickly
> 
>   * MPTCP-level socket releases the buffer page (but the data might still 
> be referenced by a queued skb on a slow subflow)
> 
> As a first step with a single subflow, I kept it simple by allocating one 
> page per send and  keeping track of that page in the MPTCP-level 
> socket.

Thank you for the detailed feedback. 
To double check I'm no the same page: most/all of the above
infrastructure and features are missing in the current code base.

I *think* they could be added on top of this series with some
changes...

> > -	if (!page) {
> > -		ret = -ENOMEM;
> > -		goto put_out;
> > +	lock_sock(sk);
> > +	lock_sock(ssk);
> > +
> > +	/* use the subflow page cache so that memory accounting is coherent */
> > +	pfrag = sk_page_frag(ssk);
> 
> Given the plan to keep the paged data buffered by the MPTCP-level socket, 
> is sk_page_frag() the best way to get a page that might be used for skbs 
> in multiple subflows?

... but some additional data structure would be needed (RB-tree for the
pending page fragments ?!?) ...

> Can memory accounting work properly if the page frag comes from sk (the 
> MPTCP-level socket) instead of ssk?

AFAICS, it should. I have not tested it.

Yep, using sk's page frag should be better.

I'll try to give it a spin, unless you prefer a different approach.

Thanks,

Paolo


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [MPTCP] [RFC PATCH 1/4] mptcp: use sk_page_frag() in sendmsg
@ 2019-04-05 22:17 Mat Martineau
  0 siblings, 0 replies; 4+ messages in thread
From: Mat Martineau @ 2019-04-05 22:17 UTC (permalink / raw)
  To: mptcp

[-- Attachment #1: Type: text/plain, Size: 5857 bytes --]


On Fri, 5 Apr 2019, Paolo Abeni wrote:

> This clean-up a bit the send path, and allows better performances.

Paolo -

Thanks for going over the sendmsg code in detail and working to enhance 
it. Some parts of the design were intentionally inefficient, since they 
were intermediate steps toward something more complex. Other parts I may 
have simply chosen the wrong tool for the job - so I appreciate the 
improvements to the code.

I'll comment on the design choices in context below:

>
> Signed-off-by: Paolo Abeni <pabeni(a)redhat.com>
> ---
> iv1 -> iv2
> - ret != 0 -> !ret
> - goto put_out -> goto release_out; - when we error out under lock
> ---
> net/mptcp/protocol.c | 41 +++++++++++++++++++----------------------
> 1 file changed, 19 insertions(+), 22 deletions(-)
>
> diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
> index 1e73e307a6d2..7fe8dff9f5f2 100644
> --- a/net/mptcp/protocol.c
> +++ b/net/mptcp/protocol.c
> @@ -52,7 +52,7 @@ static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
> 	struct mptcp_sock *msk = mptcp_sk(sk);
> 	int mss_now, size_goal, poffset, ret;
> 	struct mptcp_ext *mpext = NULL;
> -	struct page *page = NULL;
> +	struct page_frag *pfrag;
> 	struct sk_buff *skb;
> 	struct sock *ssk;
> 	size_t psize;
> @@ -80,33 +80,32 @@ static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
> 		goto put_out;
> 	}
>
> -	/* Initial experiment: new page per send.  Real code will
> -	 * maintain list of active pages and DSS mappings, append to the
> -	 * end and honor zerocopy
> -	 */
> -	page = alloc_page(GFP_KERNEL);

Allocating a new page per send is very inefficient for small sends, but 
was a placeholder for managing an MPTCP-level send buffer.

For each connection, we have one MPTCP-level socket, and some number of 
subflow sockets that it controls. Sent data needs to be stored by the 
MPTCP-level socket until it sees a relevant MPTCP DATA_ACK from any one of 
the subflows and can purge the ack'd data.

To meet that requirement, the idea was to have the MPTCP-level socket 
buffer the send data in a set of pages. Each sendmsg() call would append 
to the current page while space is available, so each page would fill with 
contiguous data (in the MPTCP sequence space). If data was sent with 
MSG_ZEROCOPY the userspace-provided pages would be used instead. The 
buffered data could then be sent (or resent) on multiple subflows by using 
do_tcp_sendpages() to create independent skbs referencing the one 
MPTCP-level copy of the data.

For example:

  * Userspace sends 1024 bytes of data

  * MPTCP-level socket copies 1024 bytes to a page

  * Four separate skbs are built, referencing the same page and offset for 
the 1024 bytes

  * Each of the four skbs is near-simultaneously queued on a separate 
subflow

  * One subflow is faster than the others, and gets the data sent and 
DATA_ACK'd quickly

  * MPTCP-level socket releases the buffer page (but the data might still 
be referenced by a queued skb on a slow subflow)


As a first step with a single subflow, I kept it simple by allocating one 
page per send and not keeping track of that page in the MPTCP-level 
socket.


> -	if (!page) {
> -		ret = -ENOMEM;
> -		goto put_out;
> +	lock_sock(sk);
> +	lock_sock(ssk);
> +
> +	/* use the subflow page cache so that memory accounting is coherent */
> +	pfrag = sk_page_frag(ssk);

Given the plan to keep the paged data buffered by the MPTCP-level socket, 
is sk_page_frag() the best way to get a page that might be used for skbs 
in multiple subflows?

Can memory accounting work properly if the page frag comes from sk (the 
MPTCP-level socket) instead of ssk?


Regards,

Mat


> +	if (!sk_page_frag_refill(ssk, pfrag)) {
> +		long timeo = sock_sndtimeo(sk, msg->msg_flags & MSG_DONTWAIT);
> +
> +		ret = sk_stream_wait_memory(sk, &timeo);
> +		if (ret)
> +			goto release_out;
> 	}
>
> 	/* Copy to page */
> -	poffset = 0;
> +	poffset = pfrag->offset;
> 	pr_debug("left=%zu", msg_data_left(msg));
> -	psize = copy_page_from_iter(page, poffset,
> +	psize = copy_page_from_iter(pfrag->page, poffset,
> 				    min_t(size_t, msg_data_left(msg),
> -					  PAGE_SIZE),
> +					  pfrag->size - poffset),
> 				    &msg->msg_iter);
> 	pr_debug("left=%zu", msg_data_left(msg));
> -
> 	if (!psize) {
> 		ret = -EINVAL;
> -		goto put_out;
> +		goto release_out;
> 	}
>
> -	lock_sock(sk);
> -	lock_sock(ssk);
> -
> 	/* Mark the end of the previous write so the beginning of the
> 	 * next write (with its own mptcp skb extension data) is not
> 	 * collapsed.
> @@ -116,8 +115,8 @@ static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
> 		TCP_SKB_CB(skb)->eor = 1;
>
> 	mss_now = tcp_send_mss(ssk, &size_goal, msg->msg_flags);
> -
> -	ret = do_tcp_sendpages(ssk, page, poffset, min_t(int, size_goal, psize),
> +	psize = min_t(int, size_goal, psize);
> +	ret = do_tcp_sendpages(ssk, pfrag->page, poffset, psize,
> 			       msg->msg_flags | MSG_SENDPAGE_NOTLAST);
> 	if (ret <= 0)
> 		goto release_out;
> @@ -143,6 +142,7 @@ static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
> 			 mpext->checksum, mpext->dsn64);
> 	} /* TODO: else fallback */
>
> +	pfrag->offset += ret;
> 	msk->write_seq += ret;
> 	subflow_ctx(ssk)->rel_write_seq += ret;
>
> @@ -153,9 +153,6 @@ static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
> 	release_sock(sk);
>
> put_out:
> -	if (page)
> -		put_page(page);
> -
> 	sock_put(ssk);
> 	return ret;
> }
> -- 
> 2.20.1
>
> _______________________________________________
> mptcp mailing list
> mptcp(a)lists.01.org
> https://lists.01.org/mailman/listinfo/mptcp
>

--
Mat Martineau
Intel

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [MPTCP] [RFC PATCH 1/4] mptcp: use sk_page_frag() in sendmsg
@ 2019-04-05 14:46 Paolo Abeni
  0 siblings, 0 replies; 4+ messages in thread
From: Paolo Abeni @ 2019-04-05 14:46 UTC (permalink / raw)
  To: mptcp

[-- Attachment #1: Type: text/plain, Size: 3154 bytes --]

This clean-up a bit the send path, and allows better performances.

Signed-off-by: Paolo Abeni <pabeni(a)redhat.com>
---
iv1 -> iv2
 - ret != 0 -> !ret
 - goto put_out -> goto release_out; - when we error out under lock
---
 net/mptcp/protocol.c | 41 +++++++++++++++++++----------------------
 1 file changed, 19 insertions(+), 22 deletions(-)

diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 1e73e307a6d2..7fe8dff9f5f2 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -52,7 +52,7 @@ static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 	struct mptcp_sock *msk = mptcp_sk(sk);
 	int mss_now, size_goal, poffset, ret;
 	struct mptcp_ext *mpext = NULL;
-	struct page *page = NULL;
+	struct page_frag *pfrag;
 	struct sk_buff *skb;
 	struct sock *ssk;
 	size_t psize;
@@ -80,33 +80,32 @@ static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 		goto put_out;
 	}
 
-	/* Initial experiment: new page per send.  Real code will
-	 * maintain list of active pages and DSS mappings, append to the
-	 * end and honor zerocopy
-	 */
-	page = alloc_page(GFP_KERNEL);
-	if (!page) {
-		ret = -ENOMEM;
-		goto put_out;
+	lock_sock(sk);
+	lock_sock(ssk);
+
+	/* use the subflow page cache so that memory accounting is coherent */
+	pfrag = sk_page_frag(ssk);
+	if (!sk_page_frag_refill(ssk, pfrag)) {
+		long timeo = sock_sndtimeo(sk, msg->msg_flags & MSG_DONTWAIT);
+
+		ret = sk_stream_wait_memory(sk, &timeo);
+		if (ret)
+			goto release_out;
 	}
 
 	/* Copy to page */
-	poffset = 0;
+	poffset = pfrag->offset;
 	pr_debug("left=%zu", msg_data_left(msg));
-	psize = copy_page_from_iter(page, poffset,
+	psize = copy_page_from_iter(pfrag->page, poffset,
 				    min_t(size_t, msg_data_left(msg),
-					  PAGE_SIZE),
+					  pfrag->size - poffset),
 				    &msg->msg_iter);
 	pr_debug("left=%zu", msg_data_left(msg));
-
 	if (!psize) {
 		ret = -EINVAL;
-		goto put_out;
+		goto release_out;
 	}
 
-	lock_sock(sk);
-	lock_sock(ssk);
-
 	/* Mark the end of the previous write so the beginning of the
 	 * next write (with its own mptcp skb extension data) is not
 	 * collapsed.
@@ -116,8 +115,8 @@ static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 		TCP_SKB_CB(skb)->eor = 1;
 
 	mss_now = tcp_send_mss(ssk, &size_goal, msg->msg_flags);
-
-	ret = do_tcp_sendpages(ssk, page, poffset, min_t(int, size_goal, psize),
+	psize = min_t(int, size_goal, psize);
+	ret = do_tcp_sendpages(ssk, pfrag->page, poffset, psize,
 			       msg->msg_flags | MSG_SENDPAGE_NOTLAST);
 	if (ret <= 0)
 		goto release_out;
@@ -143,6 +142,7 @@ static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 			 mpext->checksum, mpext->dsn64);
 	} /* TODO: else fallback */
 
+	pfrag->offset += ret;
 	msk->write_seq += ret;
 	subflow_ctx(ssk)->rel_write_seq += ret;
 
@@ -153,9 +153,6 @@ static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 	release_sock(sk);
 
 put_out:
-	if (page)
-		put_page(page);
-
 	sock_put(ssk);
 	return ret;
 }
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2019-04-08 18:20 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-04-08 18:20 [MPTCP] [RFC PATCH 1/4] mptcp: use sk_page_frag() in sendmsg Mat Martineau
  -- strict thread matches above, loose matches on Subject: below --
2019-04-08 16:45 Paolo Abeni
2019-04-05 22:17 Mat Martineau
2019-04-05 14:46 Paolo Abeni

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.