All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v1] sunrpc: Allocate up to RPCSVC_MAXPAGES per svc_rqst
@ 2017-06-30 16:03 Chuck Lever
  2017-06-30 20:16 ` J. Bruce Fields
  0 siblings, 1 reply; 8+ messages in thread
From: Chuck Lever @ 2017-06-30 16:03 UTC (permalink / raw)
  To: bfields; +Cc: linux-nfs

svcrdma needs 259 pages for 1MB NFSv4.0 READ requests:

 - 1 page for the transport header and head iovec
 - 256 pages for the data payload
 - 1 page for the trailing GETATTR request (since NFSD XDR decoding
   does not look for a tail iovec, the GETATTR is stuck at the end
   of the rqstp->rq_arg.pages list)
 - 1 page for building the reply xdr_buf

But RPCSVC_MAXPAGES is already 259 (on x86_64). The problem is that
svc_alloc_arg never allocates that many pages. To address this:

1. The final element of rq_pages always points to NULL. To
   accommodate up to 259 pages in rq_pages, add an extra element
   to rq_pages for the array termination sentinel.

2. Adjust the calculation of "pages" to match how RPCSVC_MAXPAGES
   is calculated, so it can go up to 259. Bruce noted that the
   calculation assumes sv_max_mesg is a multiple of PAGE_SIZE,
   which might not always be true. I didn't change this assumption.

3. Change the loop boundaries to allow 259 pages to be allocated.

Additional clean-up: WARN_ON_ONCE adds an extra conditional branch,
which is basically never taken. And there's no need to dump the
stack here because svc_alloc_arg has only one caller.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
Hi Bruce-

This is what I proposed yesterday evening. Instead of changing
the value of RPCSVC_MAXPAGES, ensure that svc_alloc_arg can
safely allocate that many pages.


 include/linux/sunrpc/svc.h |    2 +-
 net/sunrpc/svc_xprt.c      |   10 ++++++----
 2 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
index 11cef5a..d741399 100644
--- a/include/linux/sunrpc/svc.h
+++ b/include/linux/sunrpc/svc.h
@@ -246,7 +246,7 @@ struct svc_rqst {
 	size_t			rq_xprt_hlen;	/* xprt header len */
 	struct xdr_buf		rq_arg;
 	struct xdr_buf		rq_res;
-	struct page *		rq_pages[RPCSVC_MAXPAGES];
+	struct page		*rq_pages[RPCSVC_MAXPAGES + 1];
 	struct page *		*rq_respages;	/* points into rq_pages */
 	struct page *		*rq_next_page; /* next reply page to use */
 	struct page *		*rq_page_end;  /* one past the last page */
diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
index 7bfe1fb..d16a8b4 100644
--- a/net/sunrpc/svc_xprt.c
+++ b/net/sunrpc/svc_xprt.c
@@ -659,11 +659,13 @@ static int svc_alloc_arg(struct svc_rqst *rqstp)
 	int i;
 
 	/* now allocate needed pages.  If we get a failure, sleep briefly */
-	pages = (serv->sv_max_mesg + PAGE_SIZE) / PAGE_SIZE;
-	WARN_ON_ONCE(pages >= RPCSVC_MAXPAGES);
-	if (pages >= RPCSVC_MAXPAGES)
+	pages = (serv->sv_max_mesg + 2 * PAGE_SIZE) >> PAGE_SHIFT;
+	if (pages > RPCSVC_MAXPAGES) {
+		pr_warn_once("svc: warning: pages=%u > RPCSVC_MAXPAGES=%lu\n",
+			     pages, RPCSVC_MAXPAGES);
 		/* use as many pages as possible */
-		pages = RPCSVC_MAXPAGES - 1;
+		pages = RPCSVC_MAXPAGES;
+	}
 	for (i = 0; i < pages ; i++)
 		while (rqstp->rq_pages[i] == NULL) {
 			struct page *p = alloc_page(GFP_KERNEL);


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH v1] sunrpc: Allocate up to RPCSVC_MAXPAGES per svc_rqst
  2017-06-30 16:03 [PATCH v1] sunrpc: Allocate up to RPCSVC_MAXPAGES per svc_rqst Chuck Lever
@ 2017-06-30 20:16 ` J. Bruce Fields
  2017-06-30 20:33   ` Chuck Lever
  0 siblings, 1 reply; 8+ messages in thread
From: J. Bruce Fields @ 2017-06-30 20:16 UTC (permalink / raw)
  To: Chuck Lever; +Cc: linux-nfs

I like that better.

I'm still pretty confused.

The long comment above RPCSVC_MAXPAGES convinces me that it's
calculating the maximum number of pages a TCP request might need.  But
apparently rq_pages has never been large enough to hold that many pages.
Has nobody ever tried an unaligned 1MB read?

We could use comments each time elsewhere when we add another random 1
or 2 somewhere.  What does sv_max_mesg really mean, why are we adding 2
to it in svc_alloc_arg, etc.

Might make more sense if RPCSVC_MAXPAYLOAD was explicitly calculated as
a max over the compiled-in transports, too.

I feel like we've actually been through this before, but somehow I'm
still confused every time I look at it.

--b.

On Fri, Jun 30, 2017 at 12:03:54PM -0400, Chuck Lever wrote:
> svcrdma needs 259 pages for 1MB NFSv4.0 READ requests:
> 
>  - 1 page for the transport header and head iovec
>  - 256 pages for the data payload
>  - 1 page for the trailing GETATTR request (since NFSD XDR decoding
>    does not look for a tail iovec, the GETATTR is stuck at the end
>    of the rqstp->rq_arg.pages list)
>  - 1 page for building the reply xdr_buf
> 
> But RPCSVC_MAXPAGES is already 259 (on x86_64). The problem is that
> svc_alloc_arg never allocates that many pages. To address this:
> 
> 1. The final element of rq_pages always points to NULL. To
>    accommodate up to 259 pages in rq_pages, add an extra element
>    to rq_pages for the array termination sentinel.
> 
> 2. Adjust the calculation of "pages" to match how RPCSVC_MAXPAGES
>    is calculated, so it can go up to 259. Bruce noted that the
>    calculation assumes sv_max_mesg is a multiple of PAGE_SIZE,
>    which might not always be true. I didn't change this assumption.
> 
> 3. Change the loop boundaries to allow 259 pages to be allocated.
> 
> Additional clean-up: WARN_ON_ONCE adds an extra conditional branch,
> which is basically never taken. And there's no need to dump the
> stack here because svc_alloc_arg has only one caller.
> 
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> ---
> Hi Bruce-
> 
> This is what I proposed yesterday evening. Instead of changing
> the value of RPCSVC_MAXPAGES, ensure that svc_alloc_arg can
> safely allocate that many pages.
> 
> 
>  include/linux/sunrpc/svc.h |    2 +-
>  net/sunrpc/svc_xprt.c      |   10 ++++++----
>  2 files changed, 7 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
> index 11cef5a..d741399 100644
> --- a/include/linux/sunrpc/svc.h
> +++ b/include/linux/sunrpc/svc.h
> @@ -246,7 +246,7 @@ struct svc_rqst {
>  	size_t			rq_xprt_hlen;	/* xprt header len */
>  	struct xdr_buf		rq_arg;
>  	struct xdr_buf		rq_res;
> -	struct page *		rq_pages[RPCSVC_MAXPAGES];
> +	struct page		*rq_pages[RPCSVC_MAXPAGES + 1];
>  	struct page *		*rq_respages;	/* points into rq_pages */
>  	struct page *		*rq_next_page; /* next reply page to use */
>  	struct page *		*rq_page_end;  /* one past the last page */
> diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
> index 7bfe1fb..d16a8b4 100644
> --- a/net/sunrpc/svc_xprt.c
> +++ b/net/sunrpc/svc_xprt.c
> @@ -659,11 +659,13 @@ static int svc_alloc_arg(struct svc_rqst *rqstp)
>  	int i;
>  
>  	/* now allocate needed pages.  If we get a failure, sleep briefly */
> -	pages = (serv->sv_max_mesg + PAGE_SIZE) / PAGE_SIZE;
> -	WARN_ON_ONCE(pages >= RPCSVC_MAXPAGES);
> -	if (pages >= RPCSVC_MAXPAGES)
> +	pages = (serv->sv_max_mesg + 2 * PAGE_SIZE) >> PAGE_SHIFT;
> +	if (pages > RPCSVC_MAXPAGES) {
> +		pr_warn_once("svc: warning: pages=%u > RPCSVC_MAXPAGES=%lu\n",
> +			     pages, RPCSVC_MAXPAGES);
>  		/* use as many pages as possible */
> -		pages = RPCSVC_MAXPAGES - 1;
> +		pages = RPCSVC_MAXPAGES;
> +	}
>  	for (i = 0; i < pages ; i++)
>  		while (rqstp->rq_pages[i] == NULL) {
>  			struct page *p = alloc_page(GFP_KERNEL);

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v1] sunrpc: Allocate up to RPCSVC_MAXPAGES per svc_rqst
  2017-06-30 20:16 ` J. Bruce Fields
@ 2017-06-30 20:33   ` Chuck Lever
  2017-06-30 20:47     ` J. Bruce Fields
  0 siblings, 1 reply; 8+ messages in thread
From: Chuck Lever @ 2017-06-30 20:33 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: Linux NFS Mailing List


> On Jun 30, 2017, at 4:16 PM, J. Bruce Fields <bfields@fieldses.org> wrote:
> 
> I like that better.
> 
> I'm still pretty confused.
> 
> The long comment above RPCSVC_MAXPAGES convinces me that it's
> calculating the maximum number of pages a TCP request might need.  But
> apparently rq_pages has never been large enough to hold that many pages.
> Has nobody ever tried an unaligned 1MB read?

I'm not sure how the TCP transport works. If it tries to fill the
head iovec completely, then goes into the page list, then it might
not ever need the extra page on the end for the GETATTR.

For NFS/RDMA, the transport currently aligns the incoming data so
it always starts at byte 0 of the first page in the xdr_buf's page
list. That's going to have to change if we want direct placement
of the write payload, and at that point an unaligned payload will
be possible.


> We could use comments each time elsewhere when we add another random 1
> or 2 somewhere.  What does sv_max_mesg really mean, why are we adding 2
> to it in svc_alloc_arg, etc.

I'm open to suggestions. A comment in svc_alloc_arg might just
refer weary readers to the comment in front of the definition of
RPCSVC_MAXPAGES.

Let me know if there's anything else you want to see changed, or
if this patch is now acceptable.


> Might make more sense if RPCSVC_MAXPAYLOAD was explicitly calculated as
> a max over the compiled-in transports, too.

IMO sv_max_mesg could be expressed in pages, instead of bytes, so
that the computation in svc_alloc_arg isn't repeated for every RPC.


> I feel like we've actually been through this before, but somehow I'm
> still confused every time I look at it.
> 
> --b.
> 
> On Fri, Jun 30, 2017 at 12:03:54PM -0400, Chuck Lever wrote:
>> svcrdma needs 259 pages for 1MB NFSv4.0 READ requests:

This should be

"svcrdma needs 259 pages for 1MB NFSv4.0 WRITE requests:"


>> - 1 page for the transport header and head iovec
>> - 256 pages for the data payload
>> - 1 page for the trailing GETATTR request (since NFSD XDR decoding
>>   does not look for a tail iovec, the GETATTR is stuck at the end
>>   of the rqstp->rq_arg.pages list)
>> - 1 page for building the reply xdr_buf
>> 
>> But RPCSVC_MAXPAGES is already 259 (on x86_64). The problem is that
>> svc_alloc_arg never allocates that many pages. To address this:
>> 
>> 1. The final element of rq_pages always points to NULL. To
>>   accommodate up to 259 pages in rq_pages, add an extra element
>>   to rq_pages for the array termination sentinel.
>> 
>> 2. Adjust the calculation of "pages" to match how RPCSVC_MAXPAGES
>>   is calculated, so it can go up to 259. Bruce noted that the
>>   calculation assumes sv_max_mesg is a multiple of PAGE_SIZE,
>>   which might not always be true. I didn't change this assumption.
>> 
>> 3. Change the loop boundaries to allow 259 pages to be allocated.
>> 
>> Additional clean-up: WARN_ON_ONCE adds an extra conditional branch,
>> which is basically never taken. And there's no need to dump the
>> stack here because svc_alloc_arg has only one caller.
>> 
>> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
>> ---
>> Hi Bruce-
>> 
>> This is what I proposed yesterday evening. Instead of changing
>> the value of RPCSVC_MAXPAGES, ensure that svc_alloc_arg can
>> safely allocate that many pages.
>> 
>> 
>> include/linux/sunrpc/svc.h |    2 +-
>> net/sunrpc/svc_xprt.c      |   10 ++++++----
>> 2 files changed, 7 insertions(+), 5 deletions(-)
>> 
>> diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
>> index 11cef5a..d741399 100644
>> --- a/include/linux/sunrpc/svc.h
>> +++ b/include/linux/sunrpc/svc.h
>> @@ -246,7 +246,7 @@ struct svc_rqst {
>> 	size_t			rq_xprt_hlen;	/* xprt header len */
>> 	struct xdr_buf		rq_arg;
>> 	struct xdr_buf		rq_res;
>> -	struct page *		rq_pages[RPCSVC_MAXPAGES];
>> +	struct page		*rq_pages[RPCSVC_MAXPAGES + 1];
>> 	struct page *		*rq_respages;	/* points into rq_pages */
>> 	struct page *		*rq_next_page; /* next reply page to use */
>> 	struct page *		*rq_page_end;  /* one past the last page */
>> diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
>> index 7bfe1fb..d16a8b4 100644
>> --- a/net/sunrpc/svc_xprt.c
>> +++ b/net/sunrpc/svc_xprt.c
>> @@ -659,11 +659,13 @@ static int svc_alloc_arg(struct svc_rqst *rqstp)
>> 	int i;
>> 
>> 	/* now allocate needed pages.  If we get a failure, sleep briefly */
>> -	pages = (serv->sv_max_mesg + PAGE_SIZE) / PAGE_SIZE;
>> -	WARN_ON_ONCE(pages >= RPCSVC_MAXPAGES);
>> -	if (pages >= RPCSVC_MAXPAGES)
>> +	pages = (serv->sv_max_mesg + 2 * PAGE_SIZE) >> PAGE_SHIFT;
>> +	if (pages > RPCSVC_MAXPAGES) {
>> +		pr_warn_once("svc: warning: pages=%u > RPCSVC_MAXPAGES=%lu\n",
>> +			     pages, RPCSVC_MAXPAGES);
>> 		/* use as many pages as possible */
>> -		pages = RPCSVC_MAXPAGES - 1;
>> +		pages = RPCSVC_MAXPAGES;
>> +	}
>> 	for (i = 0; i < pages ; i++)
>> 		while (rqstp->rq_pages[i] == NULL) {
>> 			struct page *p = alloc_page(GFP_KERNEL);
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
Chuck Lever




^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v1] sunrpc: Allocate up to RPCSVC_MAXPAGES per svc_rqst
  2017-06-30 20:33   ` Chuck Lever
@ 2017-06-30 20:47     ` J. Bruce Fields
  2017-06-30 21:23       ` Chuck Lever
  0 siblings, 1 reply; 8+ messages in thread
From: J. Bruce Fields @ 2017-06-30 20:47 UTC (permalink / raw)
  To: Chuck Lever; +Cc: Linux NFS Mailing List

On Fri, Jun 30, 2017 at 04:33:23PM -0400, Chuck Lever wrote:
> 
> > On Jun 30, 2017, at 4:16 PM, J. Bruce Fields <bfields@fieldses.org> wrote:
> > 
> > I like that better.
> > 
> > I'm still pretty confused.
> > 
> > The long comment above RPCSVC_MAXPAGES convinces me that it's
> > calculating the maximum number of pages a TCP request might need.  But
> > apparently rq_pages has never been large enough to hold that many pages.
> > Has nobody ever tried an unaligned 1MB read?
> 
> I'm not sure how the TCP transport works. If it tries to fill the
> head iovec completely, then goes into the page list, then it might
> not ever need the extra page on the end for the GETATTR.

I think an unaligned read of N pages ends up with rq_pages like:

	1st page: read request
	2nd page: head+tail of reply
	3rd page: partial first page of read data
	N-1 pages: read data
	last page: partial last page of read data

So, N+3 pages.

> For NFS/RDMA, the transport currently aligns the incoming data so
> it always starts at byte 0 of the first page in the xdr_buf's page
> list. That's going to have to change if we want direct placement
> of the write payload, and at that point an unaligned payload will
> be possible.
> 
> 
> > We could use comments each time elsewhere when we add another random 1
> > or 2 somewhere.  What does sv_max_mesg really mean, why are we adding 2
> > to it in svc_alloc_arg, etc.
> 
> I'm open to suggestions. A comment in svc_alloc_arg might just
> refer weary readers to the comment in front of the definition of
> RPCSVC_MAXPAGES.

There's 3 extra pages in rq_pages beyond what's needed to hold a max IO
size buffer, and I'd like the comment to explain *which* 2 of those
three pages we're accounting for here.  I could figure that out if I
knew what xv_max_mesg meant, but the comment there just says 1 page is
added for "overhead", which isn't helpful.

> Let me know if there's anything else you want to see changed, or
> if this patch is now acceptable.
> 
> 
> > Might make more sense if RPCSVC_MAXPAYLOAD was explicitly calculated as
> > a max over the compiled-in transports, too.
> 
> IMO sv_max_mesg could be expressed in pages, instead of bytes, so
> that the computation in svc_alloc_arg isn't repeated for every RPC.

Makes sense to me.

> 
> 
> > I feel like we've actually been through this before, but somehow I'm
> > still confused every time I look at it.
> > 
> > --b.
> > 
> > On Fri, Jun 30, 2017 at 12:03:54PM -0400, Chuck Lever wrote:
> >> svcrdma needs 259 pages for 1MB NFSv4.0 READ requests:
> 
> This should be
> 
> "svcrdma needs 259 pages for 1MB NFSv4.0 WRITE requests:"

Oh, got it.

--b.

> 
> 
> >> - 1 page for the transport header and head iovec
> >> - 256 pages for the data payload
> >> - 1 page for the trailing GETATTR request (since NFSD XDR decoding
> >>   does not look for a tail iovec, the GETATTR is stuck at the end
> >>   of the rqstp->rq_arg.pages list)
> >> - 1 page for building the reply xdr_buf
> >> 
> >> But RPCSVC_MAXPAGES is already 259 (on x86_64). The problem is that
> >> svc_alloc_arg never allocates that many pages. To address this:
> >> 
> >> 1. The final element of rq_pages always points to NULL. To
> >>   accommodate up to 259 pages in rq_pages, add an extra element
> >>   to rq_pages for the array termination sentinel.
> >> 
> >> 2. Adjust the calculation of "pages" to match how RPCSVC_MAXPAGES
> >>   is calculated, so it can go up to 259. Bruce noted that the
> >>   calculation assumes sv_max_mesg is a multiple of PAGE_SIZE,
> >>   which might not always be true. I didn't change this assumption.
> >> 
> >> 3. Change the loop boundaries to allow 259 pages to be allocated.
> >> 
> >> Additional clean-up: WARN_ON_ONCE adds an extra conditional branch,
> >> which is basically never taken. And there's no need to dump the
> >> stack here because svc_alloc_arg has only one caller.
> >> 
> >> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> >> ---
> >> Hi Bruce-
> >> 
> >> This is what I proposed yesterday evening. Instead of changing
> >> the value of RPCSVC_MAXPAGES, ensure that svc_alloc_arg can
> >> safely allocate that many pages.
> >> 
> >> 
> >> include/linux/sunrpc/svc.h |    2 +-
> >> net/sunrpc/svc_xprt.c      |   10 ++++++----
> >> 2 files changed, 7 insertions(+), 5 deletions(-)
> >> 
> >> diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
> >> index 11cef5a..d741399 100644
> >> --- a/include/linux/sunrpc/svc.h
> >> +++ b/include/linux/sunrpc/svc.h
> >> @@ -246,7 +246,7 @@ struct svc_rqst {
> >> 	size_t			rq_xprt_hlen;	/* xprt header len */
> >> 	struct xdr_buf		rq_arg;
> >> 	struct xdr_buf		rq_res;
> >> -	struct page *		rq_pages[RPCSVC_MAXPAGES];
> >> +	struct page		*rq_pages[RPCSVC_MAXPAGES + 1];
> >> 	struct page *		*rq_respages;	/* points into rq_pages */
> >> 	struct page *		*rq_next_page; /* next reply page to use */
> >> 	struct page *		*rq_page_end;  /* one past the last page */
> >> diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
> >> index 7bfe1fb..d16a8b4 100644
> >> --- a/net/sunrpc/svc_xprt.c
> >> +++ b/net/sunrpc/svc_xprt.c
> >> @@ -659,11 +659,13 @@ static int svc_alloc_arg(struct svc_rqst *rqstp)
> >> 	int i;
> >> 
> >> 	/* now allocate needed pages.  If we get a failure, sleep briefly */
> >> -	pages = (serv->sv_max_mesg + PAGE_SIZE) / PAGE_SIZE;
> >> -	WARN_ON_ONCE(pages >= RPCSVC_MAXPAGES);
> >> -	if (pages >= RPCSVC_MAXPAGES)
> >> +	pages = (serv->sv_max_mesg + 2 * PAGE_SIZE) >> PAGE_SHIFT;
> >> +	if (pages > RPCSVC_MAXPAGES) {
> >> +		pr_warn_once("svc: warning: pages=%u > RPCSVC_MAXPAGES=%lu\n",
> >> +			     pages, RPCSVC_MAXPAGES);
> >> 		/* use as many pages as possible */
> >> -		pages = RPCSVC_MAXPAGES - 1;
> >> +		pages = RPCSVC_MAXPAGES;
> >> +	}
> >> 	for (i = 0; i < pages ; i++)
> >> 		while (rqstp->rq_pages[i] == NULL) {
> >> 			struct page *p = alloc_page(GFP_KERNEL);
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> --
> Chuck Lever
> 
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v1] sunrpc: Allocate up to RPCSVC_MAXPAGES per svc_rqst
  2017-06-30 20:47     ` J. Bruce Fields
@ 2017-06-30 21:23       ` Chuck Lever
  2017-07-05 14:36         ` Chuck Lever
  0 siblings, 1 reply; 8+ messages in thread
From: Chuck Lever @ 2017-06-30 21:23 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: Linux NFS Mailing List


> On Jun 30, 2017, at 4:47 PM, J. Bruce Fields <bfields@fieldses.org> wrote:
> 
> On Fri, Jun 30, 2017 at 04:33:23PM -0400, Chuck Lever wrote:
>> 
>>> On Jun 30, 2017, at 4:16 PM, J. Bruce Fields <bfields@fieldses.org> wrote:
>>> 
>>> I like that better.
>>> 
>>> I'm still pretty confused.
>>> 
>>> The long comment above RPCSVC_MAXPAGES convinces me that it's
>>> calculating the maximum number of pages a TCP request might need.  But
>>> apparently rq_pages has never been large enough to hold that many pages.
>>> Has nobody ever tried an unaligned 1MB read?
>> 
>> I'm not sure how the TCP transport works. If it tries to fill the
>> head iovec completely, then goes into the page list, then it might
>> not ever need the extra page on the end for the GETATTR.
> 
> I think an unaligned read of N pages ends up with rq_pages like:
> 
> 	1st page: read request
> 	2nd page: head+tail of reply
> 	3rd page: partial first page of read data
> 	N-1 pages: read data
> 	last page: partial last page of read data
> 
> So, N+3 pages.

OK. Again, I'm not aware of a workload that specifically asks
for an unaligned 1MB READ. Payloads of that size are generally
page-aligned at least, and probably aligned to 1MB, because
they are usually reads into a client's page cache.

For an NFSv4.0 WRITE request on NFS/RDMA we need the same
number of pages, but for different reasons:

	1st page: transport header and head of Call
	N pages: write data
	N+1 page: trailing NFSv4 operations and MIC
	last page: reply

So, N+3 pages.

I would argue we probably want to accommodate an unaligned 1MB
write here eventually, by adding one more page. But I don't
know of a workload that would generate one today.


>> For NFS/RDMA, the transport currently aligns the incoming data so
>> it always starts at byte 0 of the first page in the xdr_buf's page
>> list. That's going to have to change if we want direct placement
>> of the write payload, and at that point an unaligned payload will
>> be possible.
>> 
>> 
>>> We could use comments each time elsewhere when we add another random 1
>>> or 2 somewhere.  What does sv_max_mesg really mean, why are we adding 2
>>> to it in svc_alloc_arg, etc.
>> 
>> I'm open to suggestions. A comment in svc_alloc_arg might just
>> refer weary readers to the comment in front of the definition of
>> RPCSVC_MAXPAGES.
> 
> There's 3 extra pages in rq_pages beyond what's needed to hold a max IO
> size buffer, and I'd like the comment to explain *which* 2 of those
> three pages we're accounting for here.  I could figure that out if I
> knew what xv_max_mesg meant, but the comment there just says 1 page is
> added for "overhead", which isn't helpful.

So here's how sv_max_mesg is set:

serv->sv_max_mesg  = roundup(serv->sv_max_payload + PAGE_SIZE, PAGE_SIZE);

It's the max payload size (1MB) plus a page. Let's assume this
extra page accounts for a non-page-aligned payload.

In svc_alloc_arg, the "pages" calculation was adding just another
page, which gives us only N+2 pages. It should be making the same
calculation as the one described in front of RPCSVC_MAXPAGES to
give us N+3.

At a guess, the two additional pages in svc_alloc_arg are for
the request, and for the head+tail of the reply.

The problem is, the pages are accounted for differently for an
NFS READ or an NFS WRITE (see above). So this kind of comment
doesn't make sense to me in a generic function like svc_alloc_arg.

I would stick to saying "this allows the allocation of up to
RPCSVC_MAXPAGES pages" and leave it to other areas of the code
to decide how to use them.


>> Let me know if there's anything else you want to see changed, or
>> if this patch is now acceptable.
>> 
>> 
>>> Might make more sense if RPCSVC_MAXPAYLOAD was explicitly calculated as
>>> a max over the compiled-in transports, too.
>> 
>> IMO sv_max_mesg could be expressed in pages, instead of bytes, so
>> that the computation in svc_alloc_arg isn't repeated for every RPC.
> 
> Makes sense to me.
> 
>> 
>> 
>>> I feel like we've actually been through this before, but somehow I'm
>>> still confused every time I look at it.
>>> 
>>> --b.
>>> 
>>> On Fri, Jun 30, 2017 at 12:03:54PM -0400, Chuck Lever wrote:
>>>> svcrdma needs 259 pages for 1MB NFSv4.0 READ requests:
>> 
>> This should be
>> 
>> "svcrdma needs 259 pages for 1MB NFSv4.0 WRITE requests:"
> 
> Oh, got it.
> 
> --b.
> 
>> 
>> 
>>>> - 1 page for the transport header and head iovec
>>>> - 256 pages for the data payload
>>>> - 1 page for the trailing GETATTR request (since NFSD XDR decoding
>>>>  does not look for a tail iovec, the GETATTR is stuck at the end
>>>>  of the rqstp->rq_arg.pages list)
>>>> - 1 page for building the reply xdr_buf
>>>> 
>>>> But RPCSVC_MAXPAGES is already 259 (on x86_64). The problem is that
>>>> svc_alloc_arg never allocates that many pages. To address this:
>>>> 
>>>> 1. The final element of rq_pages always points to NULL. To
>>>>  accommodate up to 259 pages in rq_pages, add an extra element
>>>>  to rq_pages for the array termination sentinel.
>>>> 
>>>> 2. Adjust the calculation of "pages" to match how RPCSVC_MAXPAGES
>>>>  is calculated, so it can go up to 259. Bruce noted that the
>>>>  calculation assumes sv_max_mesg is a multiple of PAGE_SIZE,
>>>>  which might not always be true. I didn't change this assumption.
>>>> 
>>>> 3. Change the loop boundaries to allow 259 pages to be allocated.
>>>> 
>>>> Additional clean-up: WARN_ON_ONCE adds an extra conditional branch,
>>>> which is basically never taken. And there's no need to dump the
>>>> stack here because svc_alloc_arg has only one caller.
>>>> 
>>>> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
>>>> ---
>>>> Hi Bruce-
>>>> 
>>>> This is what I proposed yesterday evening. Instead of changing
>>>> the value of RPCSVC_MAXPAGES, ensure that svc_alloc_arg can
>>>> safely allocate that many pages.
>>>> 
>>>> 
>>>> include/linux/sunrpc/svc.h |    2 +-
>>>> net/sunrpc/svc_xprt.c      |   10 ++++++----
>>>> 2 files changed, 7 insertions(+), 5 deletions(-)
>>>> 
>>>> diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
>>>> index 11cef5a..d741399 100644
>>>> --- a/include/linux/sunrpc/svc.h
>>>> +++ b/include/linux/sunrpc/svc.h
>>>> @@ -246,7 +246,7 @@ struct svc_rqst {
>>>> 	size_t			rq_xprt_hlen;	/* xprt header len */
>>>> 	struct xdr_buf		rq_arg;
>>>> 	struct xdr_buf		rq_res;
>>>> -	struct page *		rq_pages[RPCSVC_MAXPAGES];
>>>> +	struct page		*rq_pages[RPCSVC_MAXPAGES + 1];
>>>> 	struct page *		*rq_respages;	/* points into rq_pages */
>>>> 	struct page *		*rq_next_page; /* next reply page to use */
>>>> 	struct page *		*rq_page_end;  /* one past the last page */
>>>> diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
>>>> index 7bfe1fb..d16a8b4 100644
>>>> --- a/net/sunrpc/svc_xprt.c
>>>> +++ b/net/sunrpc/svc_xprt.c
>>>> @@ -659,11 +659,13 @@ static int svc_alloc_arg(struct svc_rqst *rqstp)
>>>> 	int i;
>>>> 
>>>> 	/* now allocate needed pages.  If we get a failure, sleep briefly */
>>>> -	pages = (serv->sv_max_mesg + PAGE_SIZE) / PAGE_SIZE;
>>>> -	WARN_ON_ONCE(pages >= RPCSVC_MAXPAGES);
>>>> -	if (pages >= RPCSVC_MAXPAGES)
>>>> +	pages = (serv->sv_max_mesg + 2 * PAGE_SIZE) >> PAGE_SHIFT;
>>>> +	if (pages > RPCSVC_MAXPAGES) {
>>>> +		pr_warn_once("svc: warning: pages=%u > RPCSVC_MAXPAGES=%lu\n",
>>>> +			     pages, RPCSVC_MAXPAGES);
>>>> 		/* use as many pages as possible */
>>>> -		pages = RPCSVC_MAXPAGES - 1;
>>>> +		pages = RPCSVC_MAXPAGES;
>>>> +	}
>>>> 	for (i = 0; i < pages ; i++)
>>>> 		while (rqstp->rq_pages[i] == NULL) {
>>>> 			struct page *p = alloc_page(GFP_KERNEL);
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> 
>> --
>> Chuck Lever
>> 
>> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
Chuck Lever




^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v1] sunrpc: Allocate up to RPCSVC_MAXPAGES per svc_rqst
  2017-06-30 21:23       ` Chuck Lever
@ 2017-07-05 14:36         ` Chuck Lever
  2017-07-06 14:02           ` J. Bruce Fields
  0 siblings, 1 reply; 8+ messages in thread
From: Chuck Lever @ 2017-07-05 14:36 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: Linux NFS Mailing List

Hi Bruce-


> On Jun 30, 2017, at 5:23 PM, Chuck Lever <chuck.lever@oracle.com> wrote:
> 
> 
>> On Jun 30, 2017, at 4:47 PM, J. Bruce Fields <bfields@fieldses.org> wrote:
>> 
>> On Fri, Jun 30, 2017 at 04:33:23PM -0400, Chuck Lever wrote:
>>> 
>>>> On Jun 30, 2017, at 4:16 PM, J. Bruce Fields <bfields@fieldses.org> wrote:
>>>> 
>>>> I like that better.
>>>> 
>>>> I'm still pretty confused.
>>>> 
>>>> The long comment above RPCSVC_MAXPAGES convinces me that it's
>>>> calculating the maximum number of pages a TCP request might need.  But
>>>> apparently rq_pages has never been large enough to hold that many pages.
>>>> Has nobody ever tried an unaligned 1MB read?
>>> 
>>> I'm not sure how the TCP transport works. If it tries to fill the
>>> head iovec completely, then goes into the page list, then it might
>>> not ever need the extra page on the end for the GETATTR.
>> 
>> I think an unaligned read of N pages ends up with rq_pages like:
>> 
>> 	1st page: read request
>> 	2nd page: head+tail of reply
>> 	3rd page: partial first page of read data
>> 	N-1 pages: read data
>> 	last page: partial last page of read data
>> 
>> So, N+3 pages.
> 
> OK. Again, I'm not aware of a workload that specifically asks
> for an unaligned 1MB READ. Payloads of that size are generally
> page-aligned at least, and probably aligned to 1MB, because
> they are usually reads into a client's page cache.
> 
> For an NFSv4.0 WRITE request on NFS/RDMA we need the same
> number of pages, but for different reasons:
> 
> 	1st page: transport header and head of Call
> 	N pages: write data
> 	N+1 page: trailing NFSv4 operations and MIC
> 	last page: reply
> 
> So, N+3 pages.
> 
> I would argue we probably want to accommodate an unaligned 1MB
> write here eventually, by adding one more page. But I don't
> know of a workload that would generate one today.
> 
> 
>>> For NFS/RDMA, the transport currently aligns the incoming data so
>>> it always starts at byte 0 of the first page in the xdr_buf's page
>>> list. That's going to have to change if we want direct placement
>>> of the write payload, and at that point an unaligned payload will
>>> be possible.
>>> 
>>> 
>>>> We could use comments each time elsewhere when we add another random 1
>>>> or 2 somewhere.  What does sv_max_mesg really mean, why are we adding 2
>>>> to it in svc_alloc_arg, etc.
>>> 
>>> I'm open to suggestions. A comment in svc_alloc_arg might just
>>> refer weary readers to the comment in front of the definition of
>>> RPCSVC_MAXPAGES.
>> 
>> There's 3 extra pages in rq_pages beyond what's needed to hold a max IO
>> size buffer, and I'd like the comment to explain *which* 2 of those
>> three pages we're accounting for here.  I could figure that out if I
>> knew what xv_max_mesg meant, but the comment there just says 1 page is
>> added for "overhead", which isn't helpful.
> 
> So here's how sv_max_mesg is set:
> 
> serv->sv_max_mesg  = roundup(serv->sv_max_payload + PAGE_SIZE, PAGE_SIZE);
> 
> It's the max payload size (1MB) plus a page. Let's assume this
> extra page accounts for a non-page-aligned payload.
> 
> In svc_alloc_arg, the "pages" calculation was adding just another
> page, which gives us only N+2 pages. It should be making the same
> calculation as the one described in front of RPCSVC_MAXPAGES to
> give us N+3.
> 
> At a guess, the two additional pages in svc_alloc_arg are for
> the request, and for the head+tail of the reply.
> 
> The problem is, the pages are accounted for differently for an
> NFS READ or an NFS WRITE (see above). So this kind of comment
> doesn't make sense to me in a generic function like svc_alloc_arg.
> 
> I would stick to saying "this allows the allocation of up to
> RPCSVC_MAXPAGES pages" and leave it to other areas of the code
> to decide how to use them.

I would like to get this resolved quickly so that my nfsd-rdma-for-4.13
series can be included in v4.13.

Is the v1 patch acceptable, or do we need a comment here? If a comment
is necessary, please provide the text you want to see.


>>> Let me know if there's anything else you want to see changed, or
>>> if this patch is now acceptable.
>>> 
>>> 
>>>> Might make more sense if RPCSVC_MAXPAYLOAD was explicitly calculated as
>>>> a max over the compiled-in transports, too.
>>> 
>>> IMO sv_max_mesg could be expressed in pages, instead of bytes, so
>>> that the computation in svc_alloc_arg isn't repeated for every RPC.
>> 
>> Makes sense to me.
>> 
>>> 
>>> 
>>>> I feel like we've actually been through this before, but somehow I'm
>>>> still confused every time I look at it.
>>>> 
>>>> --b.
>>>> 
>>>> On Fri, Jun 30, 2017 at 12:03:54PM -0400, Chuck Lever wrote:
>>>>> svcrdma needs 259 pages for 1MB NFSv4.0 READ requests:
>>> 
>>> This should be
>>> 
>>> "svcrdma needs 259 pages for 1MB NFSv4.0 WRITE requests:"
>> 
>> Oh, got it.
>> 
>> --b.
>> 
>>> 
>>> 
>>>>> - 1 page for the transport header and head iovec
>>>>> - 256 pages for the data payload
>>>>> - 1 page for the trailing GETATTR request (since NFSD XDR decoding
>>>>> does not look for a tail iovec, the GETATTR is stuck at the end
>>>>> of the rqstp->rq_arg.pages list)
>>>>> - 1 page for building the reply xdr_buf
>>>>> 
>>>>> But RPCSVC_MAXPAGES is already 259 (on x86_64). The problem is that
>>>>> svc_alloc_arg never allocates that many pages. To address this:
>>>>> 
>>>>> 1. The final element of rq_pages always points to NULL. To
>>>>> accommodate up to 259 pages in rq_pages, add an extra element
>>>>> to rq_pages for the array termination sentinel.
>>>>> 
>>>>> 2. Adjust the calculation of "pages" to match how RPCSVC_MAXPAGES
>>>>> is calculated, so it can go up to 259. Bruce noted that the
>>>>> calculation assumes sv_max_mesg is a multiple of PAGE_SIZE,
>>>>> which might not always be true. I didn't change this assumption.
>>>>> 
>>>>> 3. Change the loop boundaries to allow 259 pages to be allocated.
>>>>> 
>>>>> Additional clean-up: WARN_ON_ONCE adds an extra conditional branch,
>>>>> which is basically never taken. And there's no need to dump the
>>>>> stack here because svc_alloc_arg has only one caller.
>>>>> 
>>>>> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
>>>>> ---
>>>>> Hi Bruce-
>>>>> 
>>>>> This is what I proposed yesterday evening. Instead of changing
>>>>> the value of RPCSVC_MAXPAGES, ensure that svc_alloc_arg can
>>>>> safely allocate that many pages.
>>>>> 
>>>>> 
>>>>> include/linux/sunrpc/svc.h |    2 +-
>>>>> net/sunrpc/svc_xprt.c      |   10 ++++++----
>>>>> 2 files changed, 7 insertions(+), 5 deletions(-)
>>>>> 
>>>>> diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
>>>>> index 11cef5a..d741399 100644
>>>>> --- a/include/linux/sunrpc/svc.h
>>>>> +++ b/include/linux/sunrpc/svc.h
>>>>> @@ -246,7 +246,7 @@ struct svc_rqst {
>>>>> 	size_t			rq_xprt_hlen;	/* xprt header len */
>>>>> 	struct xdr_buf		rq_arg;
>>>>> 	struct xdr_buf		rq_res;
>>>>> -	struct page *		rq_pages[RPCSVC_MAXPAGES];
>>>>> +	struct page		*rq_pages[RPCSVC_MAXPAGES + 1];
>>>>> 	struct page *		*rq_respages;	/* points into rq_pages */
>>>>> 	struct page *		*rq_next_page; /* next reply page to use */
>>>>> 	struct page *		*rq_page_end;  /* one past the last page */
>>>>> diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
>>>>> index 7bfe1fb..d16a8b4 100644
>>>>> --- a/net/sunrpc/svc_xprt.c
>>>>> +++ b/net/sunrpc/svc_xprt.c
>>>>> @@ -659,11 +659,13 @@ static int svc_alloc_arg(struct svc_rqst *rqstp)
>>>>> 	int i;
>>>>> 
>>>>> 	/* now allocate needed pages.  If we get a failure, sleep briefly */
>>>>> -	pages = (serv->sv_max_mesg + PAGE_SIZE) / PAGE_SIZE;
>>>>> -	WARN_ON_ONCE(pages >= RPCSVC_MAXPAGES);
>>>>> -	if (pages >= RPCSVC_MAXPAGES)
>>>>> +	pages = (serv->sv_max_mesg + 2 * PAGE_SIZE) >> PAGE_SHIFT;
>>>>> +	if (pages > RPCSVC_MAXPAGES) {
>>>>> +		pr_warn_once("svc: warning: pages=%u > RPCSVC_MAXPAGES=%lu\n",
>>>>> +			     pages, RPCSVC_MAXPAGES);
>>>>> 		/* use as many pages as possible */
>>>>> -		pages = RPCSVC_MAXPAGES - 1;
>>>>> +		pages = RPCSVC_MAXPAGES;
>>>>> +	}
>>>>> 	for (i = 0; i < pages ; i++)
>>>>> 		while (rqstp->rq_pages[i] == NULL) {
>>>>> 			struct page *p = alloc_page(GFP_KERNEL);
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>> 
>>> --
>>> Chuck Lever
>>> 
>>> 
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> --
> Chuck Lever
> 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
Chuck Lever




^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v1] sunrpc: Allocate up to RPCSVC_MAXPAGES per svc_rqst
  2017-07-05 14:36         ` Chuck Lever
@ 2017-07-06 14:02           ` J. Bruce Fields
  2017-07-11 20:50             ` J. Bruce Fields
  0 siblings, 1 reply; 8+ messages in thread
From: J. Bruce Fields @ 2017-07-06 14:02 UTC (permalink / raw)
  To: Chuck Lever; +Cc: Linux NFS Mailing List

On Wed, Jul 05, 2017 at 10:36:17AM -0400, Chuck Lever wrote:
> I would like to get this resolved quickly so that my nfsd-rdma-for-4.13
> series can be included in v4.13.

It will be in 4.13.

> Is the v1 patch acceptable, or do we need a comment here? If a comment
> is necessary, please provide the text you want to see.

It's fine, we can do more later.

The holdup is I'm still confused about the non-RDMA case: it looks to me
like we're not reserving quite enough for an unaligned maximum-length
read, but a pynfs test doesn't seem to be hitting any problem, so I
think I'm misunderstanding that case....

--b.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v1] sunrpc: Allocate up to RPCSVC_MAXPAGES per svc_rqst
  2017-07-06 14:02           ` J. Bruce Fields
@ 2017-07-11 20:50             ` J. Bruce Fields
  0 siblings, 0 replies; 8+ messages in thread
From: J. Bruce Fields @ 2017-07-11 20:50 UTC (permalink / raw)
  To: Chuck Lever; +Cc: Linux NFS Mailing List

On Thu, Jul 06, 2017 at 10:02:16AM -0400, J. Bruce Fields wrote:
> On Wed, Jul 05, 2017 at 10:36:17AM -0400, Chuck Lever wrote:
> > I would like to get this resolved quickly so that my nfsd-rdma-for-4.13
> > series can be included in v4.13.
> 
> It will be in 4.13.
> 
> > Is the v1 patch acceptable, or do we need a comment here? If a comment
> > is necessary, please provide the text you want to see.
> 
> It's fine, we can do more later.
> 
> The holdup is I'm still confused about the non-RDMA case: it looks to me
> like we're not reserving quite enough for an unaligned maximum-length
> read, but a pynfs test doesn't seem to be hitting any problem, so I
> think I'm misunderstanding that case....

OK, I see my confusion: that final NULL that svc_alloc_arg sets isn't a
"sentinel NULL" exactly.  There's no code that assumes that rq_pages is
NULL-terminated.

We initialize that final entry to NULL just so that nfsd_splice_actor
doesn't call put_page() on an unitialized value.  That final entry is
*only* needed in the splice case, to handle maximum-length
non-page-aligned zero-copy reads.  And in the splice case we don't need
an actual allocated page, we're just going to put_page() whatever's
there to replace with an entry from the page cache....

So, in the tcp case to handle a big read request, assuming 4k pages, we
need 2 pages for the request and the head and tail of the reply, and,
either: 256 allocated pages to copy data into in the readv case, or 257
(not necessarily allocated) array entries to store page array references
to in the splice case.

Hence a 259-entry array with only 258 allocated entries and one NULL.

--b.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2017-07-11 20:50 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-06-30 16:03 [PATCH v1] sunrpc: Allocate up to RPCSVC_MAXPAGES per svc_rqst Chuck Lever
2017-06-30 20:16 ` J. Bruce Fields
2017-06-30 20:33   ` Chuck Lever
2017-06-30 20:47     ` J. Bruce Fields
2017-06-30 21:23       ` Chuck Lever
2017-07-05 14:36         ` Chuck Lever
2017-07-06 14:02           ` J. Bruce Fields
2017-07-11 20:50             ` J. Bruce Fields

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.