All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] Fix regression in NFSRDMA server
@ 2014-03-25 20:14 Steve Wise
  2014-03-28  2:08 ` J. Bruce Fields
  0 siblings, 1 reply; 8+ messages in thread
From: Steve Wise @ 2014-03-25 20:14 UTC (permalink / raw)
  To: trond.myklebust, bfields; +Cc: linux-nfs, tom

From: Tom Tucker <tom@ogc.us>

The server regression was caused by the addition of rq_next_page
(afc59400d6c65bad66d4ad0b2daf879cbff8e23e). There were a few places that
were missed with the update of the rq_respages array.

Signed-off-by: Tom Tucker <tom@ogc.us>
Tested-by: Steve Wise <swise@ogc.us>
---

 net/sunrpc/xprtrdma/svc_rdma_recvfrom.c |   12 ++++--------
 net/sunrpc/xprtrdma/svc_rdma_sendto.c   |    1 +
 2 files changed, 5 insertions(+), 8 deletions(-)

diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
index 0ce7552..8d904e4 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
@@ -90,6 +90,7 @@ static void rdma_build_arg_xdr(struct svc_rqst *rqstp,
 		sge_no++;
 	}
 	rqstp->rq_respages = &rqstp->rq_pages[sge_no];
+	rqstp->rq_next_page = rqstp->rq_respages + 1;
 
 	/* We should never run out of SGE because the limit is defined to
 	 * support the max allowed RPC data length
@@ -169,6 +170,7 @@ static int map_read_chunks(struct svcxprt_rdma *xprt,
 		 */
 		head->arg.pages[page_no] = rqstp->rq_arg.pages[page_no];
 		rqstp->rq_respages = &rqstp->rq_arg.pages[page_no+1];
+		rqstp->rq_next_page = rqstp->rq_respages + 1;
 
 		byte_count -= sge_bytes;
 		ch_bytes -= sge_bytes;
@@ -276,6 +278,7 @@ static int fast_reg_read_chunks(struct svcxprt_rdma *xprt,
 
 	/* rq_respages points one past arg pages */
 	rqstp->rq_respages = &rqstp->rq_arg.pages[page_no];
+	rqstp->rq_next_page = rqstp->rq_respages + 1;
 
 	/* Create the reply and chunk maps */
 	offset = 0;
@@ -520,13 +523,6 @@ next_sge:
 	for (ch_no = 0; &rqstp->rq_pages[ch_no] < rqstp->rq_respages; ch_no++)
 		rqstp->rq_pages[ch_no] = NULL;
 
-	/*
-	 * Detach res pages. If svc_release sees any it will attempt to
-	 * put them.
-	 */
-	while (rqstp->rq_next_page != rqstp->rq_respages)
-		*(--rqstp->rq_next_page) = NULL;
-
 	return err;
 }
 
@@ -550,7 +546,7 @@ static int rdma_read_complete(struct svc_rqst *rqstp,
 
 	/* rq_respages starts after the last arg page */
 	rqstp->rq_respages = &rqstp->rq_arg.pages[page_no];
-	rqstp->rq_next_page = &rqstp->rq_arg.pages[page_no];
+	rqstp->rq_next_page = rqstp->rq_respages + 1;
 
 	/* Rebuild rq_arg head and tail. */
 	rqstp->rq_arg.head[0] = head->arg.head[0];
diff --git a/net/sunrpc/xprtrdma/svc_rdma_sendto.c b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
index c1d124d..11e90f8 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_sendto.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
@@ -625,6 +625,7 @@ static int send_reply(struct svcxprt_rdma *rdma,
 		if (page_no+1 >= sge_no)
 			ctxt->sge[page_no+1].length = 0;
 	}
+	rqstp->rq_next_page = rqstp->rq_respages + 1;
 	BUG_ON(sge_no > rdma->sc_max_sge);
 	memset(&send_wr, 0, sizeof send_wr);
 	ctxt->wr_op = IB_WR_SEND;


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH] Fix regression in NFSRDMA server
  2014-03-25 20:14 [PATCH] Fix regression in NFSRDMA server Steve Wise
@ 2014-03-28  2:08 ` J. Bruce Fields
  2014-03-28 13:38   ` Steve Wise
  2014-03-28 15:21   ` Tom Tucker
  0 siblings, 2 replies; 8+ messages in thread
From: J. Bruce Fields @ 2014-03-28  2:08 UTC (permalink / raw)
  To: Steve Wise; +Cc: trond.myklebust, linux-nfs, tom

On Tue, Mar 25, 2014 at 03:14:57PM -0500, Steve Wise wrote:
> From: Tom Tucker <tom@ogc.us>
> 
> The server regression was caused by the addition of rq_next_page
> (afc59400d6c65bad66d4ad0b2daf879cbff8e23e). There were a few places that
> were missed with the update of the rq_respages array.

Apologies.  (But, it could happen again--could we set up some regular
testing?  It doesn't have to be anything fancy, just cthon over
rdma--really, just read and write over rdma--would probably catch a
lot.)

Also: I don't get why all these rq_next_page initializations are
required.  Why isn't the initialization at the top of svc_process()
enough?  Is rdma using it before we get to that point?  The only use of
it I see off hand is in the while loop that you're deleting.

--b.

> 
> Signed-off-by: Tom Tucker <tom@ogc.us>
> Tested-by: Steve Wise <swise@ogc.us>
> ---
> 
>  net/sunrpc/xprtrdma/svc_rdma_recvfrom.c |   12 ++++--------
>  net/sunrpc/xprtrdma/svc_rdma_sendto.c   |    1 +
>  2 files changed, 5 insertions(+), 8 deletions(-)
> 
> diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
> index 0ce7552..8d904e4 100644
> --- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
> +++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
> @@ -90,6 +90,7 @@ static void rdma_build_arg_xdr(struct svc_rqst *rqstp,
>  		sge_no++;
>  	}
>  	rqstp->rq_respages = &rqstp->rq_pages[sge_no];
> +	rqstp->rq_next_page = rqstp->rq_respages + 1;
>  
>  	/* We should never run out of SGE because the limit is defined to
>  	 * support the max allowed RPC data length
> @@ -169,6 +170,7 @@ static int map_read_chunks(struct svcxprt_rdma *xprt,
>  		 */
>  		head->arg.pages[page_no] = rqstp->rq_arg.pages[page_no];
>  		rqstp->rq_respages = &rqstp->rq_arg.pages[page_no+1];
> +		rqstp->rq_next_page = rqstp->rq_respages + 1;
>  
>  		byte_count -= sge_bytes;
>  		ch_bytes -= sge_bytes;
> @@ -276,6 +278,7 @@ static int fast_reg_read_chunks(struct svcxprt_rdma *xprt,
>  
>  	/* rq_respages points one past arg pages */
>  	rqstp->rq_respages = &rqstp->rq_arg.pages[page_no];
> +	rqstp->rq_next_page = rqstp->rq_respages + 1;
>  
>  	/* Create the reply and chunk maps */
>  	offset = 0;
> @@ -520,13 +523,6 @@ next_sge:
>  	for (ch_no = 0; &rqstp->rq_pages[ch_no] < rqstp->rq_respages; ch_no++)
>  		rqstp->rq_pages[ch_no] = NULL;
>  
> -	/*
> -	 * Detach res pages. If svc_release sees any it will attempt to
> -	 * put them.
> -	 */
> -	while (rqstp->rq_next_page != rqstp->rq_respages)
> -		*(--rqstp->rq_next_page) = NULL;
> -
>  	return err;
>  }
>  
> @@ -550,7 +546,7 @@ static int rdma_read_complete(struct svc_rqst *rqstp,
>  
>  	/* rq_respages starts after the last arg page */
>  	rqstp->rq_respages = &rqstp->rq_arg.pages[page_no];
> -	rqstp->rq_next_page = &rqstp->rq_arg.pages[page_no];
> +	rqstp->rq_next_page = rqstp->rq_respages + 1;
>  
>  	/* Rebuild rq_arg head and tail. */
>  	rqstp->rq_arg.head[0] = head->arg.head[0];
> diff --git a/net/sunrpc/xprtrdma/svc_rdma_sendto.c b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
> index c1d124d..11e90f8 100644
> --- a/net/sunrpc/xprtrdma/svc_rdma_sendto.c
> +++ b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
> @@ -625,6 +625,7 @@ static int send_reply(struct svcxprt_rdma *rdma,
>  		if (page_no+1 >= sge_no)
>  			ctxt->sge[page_no+1].length = 0;
>  	}
> +	rqstp->rq_next_page = rqstp->rq_respages + 1;
>  	BUG_ON(sge_no > rdma->sc_max_sge);
>  	memset(&send_wr, 0, sizeof send_wr);
>  	ctxt->wr_op = IB_WR_SEND;
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: [PATCH] Fix regression in NFSRDMA server
  2014-03-28  2:08 ` J. Bruce Fields
@ 2014-03-28 13:38   ` Steve Wise
  2014-03-28 15:21   ` Tom Tucker
  1 sibling, 0 replies; 8+ messages in thread
From: Steve Wise @ 2014-03-28 13:38 UTC (permalink / raw)
  To: 'J. Bruce Fields'; +Cc: trond.myklebust, linux-nfs, tom

> >
> > The server regression was caused by the addition of rq_next_page
> > (afc59400d6c65bad66d4ad0b2daf879cbff8e23e). There were a few places that
> > were missed with the update of the rq_respages array.
> 
> Apologies.  (But, it could happen again--could we set up some regular
> testing?  It doesn't have to be anything fancy, just cthon over
> rdma--really, just read and write over rdma--would probably catch a
> lot.)
> 

Chelsio is adding NFSRDMA testing into their QA flow over their iWARP devices.




^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] Fix regression in NFSRDMA server
  2014-03-28  2:08 ` J. Bruce Fields
  2014-03-28 13:38   ` Steve Wise
@ 2014-03-28 15:21   ` Tom Tucker
  2014-03-28 21:26     ` J. Bruce Fields
  1 sibling, 1 reply; 8+ messages in thread
From: Tom Tucker @ 2014-03-28 15:21 UTC (permalink / raw)
  To: J. Bruce Fields, Steve Wise; +Cc: trond.myklebust, linux-nfs

Hi Bruce,

On 3/27/14 9:08 PM, J. Bruce Fields wrote:
> On Tue, Mar 25, 2014 at 03:14:57PM -0500, Steve Wise wrote:
>> From: Tom Tucker <tom@ogc.us>
>>
>> The server regression was caused by the addition of rq_next_page
>> (afc59400d6c65bad66d4ad0b2daf879cbff8e23e). There were a few places that
>> were missed with the update of the rq_respages array.
> Apologies.  (But, it could happen again--could we set up some regular
> testing?  It doesn't have to be anything fancy, just cthon over
> rdma--really, just read and write over rdma--would probably catch a
> lot.)

I think Chelsio is going to be adding some NFSRDMA regression testing to 
their system test.

> Also: I don't get why all these rq_next_page initializations are
> required.  Why isn't the initialization at the top of svc_process()
> enough?  Is rdma using it before we get to that point?  The only use of
> it I see off hand is in the while loop that you're deleting.

I didn't apply tremendous deductive powers here, I just added updates to 
rq_next_page wherever the transport messed with rq_respages. That said, 
NFS WRITE is likely the culprit since the write is completed as a deferral 
and therefore the request doesn't go through svc_process, so if 
rq_next_page is bogus, the cleanup will free/re-use pages that are 
actually in use by the transport.

Tom
> --b.
>
>> Signed-off-by: Tom Tucker <tom@ogc.us>
>> Tested-by: Steve Wise <swise@ogc.us>
>> ---
>>
>>   net/sunrpc/xprtrdma/svc_rdma_recvfrom.c |   12 ++++--------
>>   net/sunrpc/xprtrdma/svc_rdma_sendto.c   |    1 +
>>   2 files changed, 5 insertions(+), 8 deletions(-)
>>
>> diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
>> index 0ce7552..8d904e4 100644
>> --- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
>> +++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
>> @@ -90,6 +90,7 @@ static void rdma_build_arg_xdr(struct svc_rqst *rqstp,
>>   		sge_no++;
>>   	}
>>   	rqstp->rq_respages = &rqstp->rq_pages[sge_no];
>> +	rqstp->rq_next_page = rqstp->rq_respages + 1;
>>   
>>   	/* We should never run out of SGE because the limit is defined to
>>   	 * support the max allowed RPC data length
>> @@ -169,6 +170,7 @@ static int map_read_chunks(struct svcxprt_rdma *xprt,
>>   		 */
>>   		head->arg.pages[page_no] = rqstp->rq_arg.pages[page_no];
>>   		rqstp->rq_respages = &rqstp->rq_arg.pages[page_no+1];
>> +		rqstp->rq_next_page = rqstp->rq_respages + 1;
>>   
>>   		byte_count -= sge_bytes;
>>   		ch_bytes -= sge_bytes;
>> @@ -276,6 +278,7 @@ static int fast_reg_read_chunks(struct svcxprt_rdma *xprt,
>>   
>>   	/* rq_respages points one past arg pages */
>>   	rqstp->rq_respages = &rqstp->rq_arg.pages[page_no];
>> +	rqstp->rq_next_page = rqstp->rq_respages + 1;
>>   
>>   	/* Create the reply and chunk maps */
>>   	offset = 0;
>> @@ -520,13 +523,6 @@ next_sge:
>>   	for (ch_no = 0; &rqstp->rq_pages[ch_no] < rqstp->rq_respages; ch_no++)
>>   		rqstp->rq_pages[ch_no] = NULL;
>>   
>> -	/*
>> -	 * Detach res pages. If svc_release sees any it will attempt to
>> -	 * put them.
>> -	 */
>> -	while (rqstp->rq_next_page != rqstp->rq_respages)
>> -		*(--rqstp->rq_next_page) = NULL;
>> -
>>   	return err;
>>   }
>>   
>> @@ -550,7 +546,7 @@ static int rdma_read_complete(struct svc_rqst *rqstp,
>>   
>>   	/* rq_respages starts after the last arg page */
>>   	rqstp->rq_respages = &rqstp->rq_arg.pages[page_no];
>> -	rqstp->rq_next_page = &rqstp->rq_arg.pages[page_no];
>> +	rqstp->rq_next_page = rqstp->rq_respages + 1;
>>   
>>   	/* Rebuild rq_arg head and tail. */
>>   	rqstp->rq_arg.head[0] = head->arg.head[0];
>> diff --git a/net/sunrpc/xprtrdma/svc_rdma_sendto.c b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
>> index c1d124d..11e90f8 100644
>> --- a/net/sunrpc/xprtrdma/svc_rdma_sendto.c
>> +++ b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
>> @@ -625,6 +625,7 @@ static int send_reply(struct svcxprt_rdma *rdma,
>>   		if (page_no+1 >= sge_no)
>>   			ctxt->sge[page_no+1].length = 0;
>>   	}
>> +	rqstp->rq_next_page = rqstp->rq_respages + 1;
>>   	BUG_ON(sge_no > rdma->sc_max_sge);
>>   	memset(&send_wr, 0, sizeof send_wr);
>>   	ctxt->wr_op = IB_WR_SEND;
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] Fix regression in NFSRDMA server
  2014-03-28 15:21   ` Tom Tucker
@ 2014-03-28 21:26     ` J. Bruce Fields
  2014-03-28 21:31       ` Steve Wise
  2014-03-29  0:11       ` Tom Tucker
  0 siblings, 2 replies; 8+ messages in thread
From: J. Bruce Fields @ 2014-03-28 21:26 UTC (permalink / raw)
  To: Tom Tucker; +Cc: Steve Wise, trond.myklebust, linux-nfs

On Fri, Mar 28, 2014 at 10:21:27AM -0500, Tom Tucker wrote:
> Hi Bruce,
> 
> On 3/27/14 9:08 PM, J. Bruce Fields wrote:
> >On Tue, Mar 25, 2014 at 03:14:57PM -0500, Steve Wise wrote:
> >>From: Tom Tucker <tom@ogc.us>
> >>
> >>The server regression was caused by the addition of rq_next_page
> >>(afc59400d6c65bad66d4ad0b2daf879cbff8e23e). There were a few places that
> >>were missed with the update of the rq_respages array.
> >Apologies.  (But, it could happen again--could we set up some regular
> >testing?  It doesn't have to be anything fancy, just cthon over
> >rdma--really, just read and write over rdma--would probably catch a
> >lot.)
> 
> I think Chelsio is going to be adding some NFSRDMA regression
> testing to their system test.

OK.  Do you know who there is setting that up?  I'd be curious exactly
what kernels they intend to test and how they plan to report results.

> >Also: I don't get why all these rq_next_page initializations are
> >required.  Why isn't the initialization at the top of svc_process()
> >enough?  Is rdma using it before we get to that point?  The only use of
> >it I see off hand is in the while loop that you're deleting.
> 
> I didn't apply tremendous deductive powers here, I just added
> updates to rq_next_page wherever the transport messed with
> rq_respages. That said, NFS WRITE is likely the culprit since the
> write is completed as a deferral and therefore the request doesn't
> go through svc_process, so if rq_next_page is bogus, the cleanup
> will free/re-use pages that are actually in use by the transport.

Ugh, OK, without tracing through the code I guess I can see how that
would happen.  Remind me why it's using deferrals?

Applying the patch.

--b.

> 
> Tom
> >--b.
> >
> >>Signed-off-by: Tom Tucker <tom@ogc.us>
> >>Tested-by: Steve Wise <swise@ogc.us>
> >>---
> >>
> >>  net/sunrpc/xprtrdma/svc_rdma_recvfrom.c |   12 ++++--------
> >>  net/sunrpc/xprtrdma/svc_rdma_sendto.c   |    1 +
> >>  2 files changed, 5 insertions(+), 8 deletions(-)
> >>
> >>diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
> >>index 0ce7552..8d904e4 100644
> >>--- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
> >>+++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
> >>@@ -90,6 +90,7 @@ static void rdma_build_arg_xdr(struct svc_rqst *rqstp,
> >>  		sge_no++;
> >>  	}
> >>  	rqstp->rq_respages = &rqstp->rq_pages[sge_no];
> >>+	rqstp->rq_next_page = rqstp->rq_respages + 1;
> >>  	/* We should never run out of SGE because the limit is defined to
> >>  	 * support the max allowed RPC data length
> >>@@ -169,6 +170,7 @@ static int map_read_chunks(struct svcxprt_rdma *xprt,
> >>  		 */
> >>  		head->arg.pages[page_no] = rqstp->rq_arg.pages[page_no];
> >>  		rqstp->rq_respages = &rqstp->rq_arg.pages[page_no+1];
> >>+		rqstp->rq_next_page = rqstp->rq_respages + 1;
> >>  		byte_count -= sge_bytes;
> >>  		ch_bytes -= sge_bytes;
> >>@@ -276,6 +278,7 @@ static int fast_reg_read_chunks(struct svcxprt_rdma *xprt,
> >>  	/* rq_respages points one past arg pages */
> >>  	rqstp->rq_respages = &rqstp->rq_arg.pages[page_no];
> >>+	rqstp->rq_next_page = rqstp->rq_respages + 1;
> >>  	/* Create the reply and chunk maps */
> >>  	offset = 0;
> >>@@ -520,13 +523,6 @@ next_sge:
> >>  	for (ch_no = 0; &rqstp->rq_pages[ch_no] < rqstp->rq_respages; ch_no++)
> >>  		rqstp->rq_pages[ch_no] = NULL;
> >>-	/*
> >>-	 * Detach res pages. If svc_release sees any it will attempt to
> >>-	 * put them.
> >>-	 */
> >>-	while (rqstp->rq_next_page != rqstp->rq_respages)
> >>-		*(--rqstp->rq_next_page) = NULL;
> >>-
> >>  	return err;
> >>  }
> >>@@ -550,7 +546,7 @@ static int rdma_read_complete(struct svc_rqst *rqstp,
> >>  	/* rq_respages starts after the last arg page */
> >>  	rqstp->rq_respages = &rqstp->rq_arg.pages[page_no];
> >>-	rqstp->rq_next_page = &rqstp->rq_arg.pages[page_no];
> >>+	rqstp->rq_next_page = rqstp->rq_respages + 1;
> >>  	/* Rebuild rq_arg head and tail. */
> >>  	rqstp->rq_arg.head[0] = head->arg.head[0];
> >>diff --git a/net/sunrpc/xprtrdma/svc_rdma_sendto.c b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
> >>index c1d124d..11e90f8 100644
> >>--- a/net/sunrpc/xprtrdma/svc_rdma_sendto.c
> >>+++ b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
> >>@@ -625,6 +625,7 @@ static int send_reply(struct svcxprt_rdma *rdma,
> >>  		if (page_no+1 >= sge_no)
> >>  			ctxt->sge[page_no+1].length = 0;
> >>  	}
> >>+	rqstp->rq_next_page = rqstp->rq_respages + 1;
> >>  	BUG_ON(sge_no > rdma->sc_max_sge);
> >>  	memset(&send_wr, 0, sizeof send_wr);
> >>  	ctxt->wr_op = IB_WR_SEND;
> >>
> >--
> >To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> >the body of a message to majordomo@vger.kernel.org
> >More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: [PATCH] Fix regression in NFSRDMA server
  2014-03-28 21:26     ` J. Bruce Fields
@ 2014-03-28 21:31       ` Steve Wise
  2014-03-29  0:11       ` Tom Tucker
  1 sibling, 0 replies; 8+ messages in thread
From: Steve Wise @ 2014-03-28 21:31 UTC (permalink / raw)
  To: 'J. Bruce Fields', 'Tom Tucker'
  Cc: trond.myklebust, linux-nfs, 'Indranil Choudhury'

+Indranil

Indranil Choudhury is the QA contact.  

Steve
> -----Original Message-----
> From: J. Bruce Fields [mailto:bfields@fieldses.org]
> Sent: Friday, March 28, 2014 4:27 PM
> To: Tom Tucker
> Cc: Steve Wise; trond.myklebust@primarydata.com; linux-nfs@vger.kernel.org
> Subject: Re: [PATCH] Fix regression in NFSRDMA server
> 
> On Fri, Mar 28, 2014 at 10:21:27AM -0500, Tom Tucker wrote:
> > Hi Bruce,
> >
> > On 3/27/14 9:08 PM, J. Bruce Fields wrote:
> > >On Tue, Mar 25, 2014 at 03:14:57PM -0500, Steve Wise wrote:
> > >>From: Tom Tucker <tom@ogc.us>
> > >>
> > >>The server regression was caused by the addition of rq_next_page
> > >>(afc59400d6c65bad66d4ad0b2daf879cbff8e23e). There were a few places that
> > >>were missed with the update of the rq_respages array.
> > >Apologies.  (But, it could happen again--could we set up some regular
> > >testing?  It doesn't have to be anything fancy, just cthon over
> > >rdma--really, just read and write over rdma--would probably catch a
> > >lot.)
> >
> > I think Chelsio is going to be adding some NFSRDMA regression
> > testing to their system test.
> 
> OK.  Do you know who there is setting that up?  I'd be curious exactly
> what kernels they intend to test and how they plan to report results.
> 
> > >Also: I don't get why all these rq_next_page initializations are
> > >required.  Why isn't the initialization at the top of svc_process()
> > >enough?  Is rdma using it before we get to that point?  The only use of
> > >it I see off hand is in the while loop that you're deleting.
> >
> > I didn't apply tremendous deductive powers here, I just added
> > updates to rq_next_page wherever the transport messed with
> > rq_respages. That said, NFS WRITE is likely the culprit since the
> > write is completed as a deferral and therefore the request doesn't
> > go through svc_process, so if rq_next_page is bogus, the cleanup
> > will free/re-use pages that are actually in use by the transport.
> 
> Ugh, OK, without tracing through the code I guess I can see how that
> would happen.  Remind me why it's using deferrals?
> 
> Applying the patch.
> 
> --b.
> 
> >
> > Tom
> > >--b.
> > >
> > >>Signed-off-by: Tom Tucker <tom@ogc.us>
> > >>Tested-by: Steve Wise <swise@ogc.us>
> > >>---
> > >>
> > >>  net/sunrpc/xprtrdma/svc_rdma_recvfrom.c |   12 ++++--------
> > >>  net/sunrpc/xprtrdma/svc_rdma_sendto.c   |    1 +
> > >>  2 files changed, 5 insertions(+), 8 deletions(-)
> > >>
> > >>diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
> b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
> > >>index 0ce7552..8d904e4 100644
> > >>--- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
> > >>+++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
> > >>@@ -90,6 +90,7 @@ static void rdma_build_arg_xdr(struct svc_rqst *rqstp,
> > >>  		sge_no++;
> > >>  	}
> > >>  	rqstp->rq_respages = &rqstp->rq_pages[sge_no];
> > >>+	rqstp->rq_next_page = rqstp->rq_respages + 1;
> > >>  	/* We should never run out of SGE because the limit is defined to
> > >>  	 * support the max allowed RPC data length
> > >>@@ -169,6 +170,7 @@ static int map_read_chunks(struct svcxprt_rdma *xprt,
> > >>  		 */
> > >>  		head->arg.pages[page_no] = rqstp->rq_arg.pages[page_no];
> > >>  		rqstp->rq_respages = &rqstp->rq_arg.pages[page_no+1];
> > >>+		rqstp->rq_next_page = rqstp->rq_respages + 1;
> > >>  		byte_count -= sge_bytes;
> > >>  		ch_bytes -= sge_bytes;
> > >>@@ -276,6 +278,7 @@ static int fast_reg_read_chunks(struct svcxprt_rdma *xprt,
> > >>  	/* rq_respages points one past arg pages */
> > >>  	rqstp->rq_respages = &rqstp->rq_arg.pages[page_no];
> > >>+	rqstp->rq_next_page = rqstp->rq_respages + 1;
> > >>  	/* Create the reply and chunk maps */
> > >>  	offset = 0;
> > >>@@ -520,13 +523,6 @@ next_sge:
> > >>  	for (ch_no = 0; &rqstp->rq_pages[ch_no] < rqstp->rq_respages; ch_no++)
> > >>  		rqstp->rq_pages[ch_no] = NULL;
> > >>-	/*
> > >>-	 * Detach res pages. If svc_release sees any it will attempt to
> > >>-	 * put them.
> > >>-	 */
> > >>-	while (rqstp->rq_next_page != rqstp->rq_respages)
> > >>-		*(--rqstp->rq_next_page) = NULL;
> > >>-
> > >>  	return err;
> > >>  }
> > >>@@ -550,7 +546,7 @@ static int rdma_read_complete(struct svc_rqst *rqstp,
> > >>  	/* rq_respages starts after the last arg page */
> > >>  	rqstp->rq_respages = &rqstp->rq_arg.pages[page_no];
> > >>-	rqstp->rq_next_page = &rqstp->rq_arg.pages[page_no];
> > >>+	rqstp->rq_next_page = rqstp->rq_respages + 1;
> > >>  	/* Rebuild rq_arg head and tail. */
> > >>  	rqstp->rq_arg.head[0] = head->arg.head[0];
> > >>diff --git a/net/sunrpc/xprtrdma/svc_rdma_sendto.c
> b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
> > >>index c1d124d..11e90f8 100644
> > >>--- a/net/sunrpc/xprtrdma/svc_rdma_sendto.c
> > >>+++ b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
> > >>@@ -625,6 +625,7 @@ static int send_reply(struct svcxprt_rdma *rdma,
> > >>  		if (page_no+1 >= sge_no)
> > >>  			ctxt->sge[page_no+1].length = 0;
> > >>  	}
> > >>+	rqstp->rq_next_page = rqstp->rq_respages + 1;
> > >>  	BUG_ON(sge_no > rdma->sc_max_sge);
> > >>  	memset(&send_wr, 0, sizeof send_wr);
> > >>  	ctxt->wr_op = IB_WR_SEND;
> > >>
> > >--
> > >To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> > >the body of a message to majordomo@vger.kernel.org
> > >More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] Fix regression in NFSRDMA server
  2014-03-28 21:26     ` J. Bruce Fields
  2014-03-28 21:31       ` Steve Wise
@ 2014-03-29  0:11       ` Tom Tucker
  2014-03-29  0:51         ` J. Bruce Fields
  1 sibling, 1 reply; 8+ messages in thread
From: Tom Tucker @ 2014-03-29  0:11 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: Steve Wise, trond.myklebust, linux-nfs

Hi Bruce,

On 3/28/14 4:26 PM, J. Bruce Fields wrote:
> On Fri, Mar 28, 2014 at 10:21:27AM -0500, Tom Tucker wrote:
>> Hi Bruce,
>>
>> On 3/27/14 9:08 PM, J. Bruce Fields wrote:
>>> On Tue, Mar 25, 2014 at 03:14:57PM -0500, Steve Wise wrote:
>>>> From: Tom Tucker <tom@ogc.us>
>>>>
>>>> The server regression was caused by the addition of rq_next_page
>>>> (afc59400d6c65bad66d4ad0b2daf879cbff8e23e). There were a few places that
>>>> were missed with the update of the rq_respages array.
>>> Apologies.  (But, it could happen again--could we set up some regular
>>> testing?  It doesn't have to be anything fancy, just cthon over
>>> rdma--really, just read and write over rdma--would probably catch a
>>> lot.)
>> I think Chelsio is going to be adding some NFSRDMA regression
>> testing to their system test.
> OK.  Do you know who there is setting that up?  I'd be curious exactly
> what kernels they intend to test and how they plan to report results.
>

I don't know, Steve can weigh in on this...

>>> Also: I don't get why all these rq_next_page initializations are
>>> required.  Why isn't the initialization at the top of svc_process()
>>> enough?  Is rdma using it before we get to that point?  The only use of
>>> it I see off hand is in the while loop that you're deleting.
>> I didn't apply tremendous deductive powers here, I just added
>> updates to rq_next_page wherever the transport messed with
>> rq_respages. That said, NFS WRITE is likely the culprit since the
>> write is completed as a deferral and therefore the request doesn't
>> go through svc_process, so if rq_next_page is bogus, the cleanup
>> will free/re-use pages that are actually in use by the transport.
> Ugh, OK, without tracing through the code I guess I can see how that
> would happen.  Remind me why it's using deferrals?

The server fetches the write data from the client using RDMA READ. So 
the request says ... "here's where the data is in my memory", and then 
the server issues an RDMA READ to fetch it. When the read completes, the 
deferred request is completed.

Tom

> Applying the patch.
>
> --b.
>
>> Tom
>>> --b.
>>>
>>>> Signed-off-by: Tom Tucker <tom@ogc.us>
>>>> Tested-by: Steve Wise <swise@ogc.us>
>>>> ---
>>>>
>>>>   net/sunrpc/xprtrdma/svc_rdma_recvfrom.c |   12 ++++--------
>>>>   net/sunrpc/xprtrdma/svc_rdma_sendto.c   |    1 +
>>>>   2 files changed, 5 insertions(+), 8 deletions(-)
>>>>
>>>> diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
>>>> index 0ce7552..8d904e4 100644
>>>> --- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
>>>> +++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
>>>> @@ -90,6 +90,7 @@ static void rdma_build_arg_xdr(struct svc_rqst *rqstp,
>>>>   		sge_no++;
>>>>   	}
>>>>   	rqstp->rq_respages = &rqstp->rq_pages[sge_no];
>>>> +	rqstp->rq_next_page = rqstp->rq_respages + 1;
>>>>   	/* We should never run out of SGE because the limit is defined to
>>>>   	 * support the max allowed RPC data length
>>>> @@ -169,6 +170,7 @@ static int map_read_chunks(struct svcxprt_rdma *xprt,
>>>>   		 */
>>>>   		head->arg.pages[page_no] = rqstp->rq_arg.pages[page_no];
>>>>   		rqstp->rq_respages = &rqstp->rq_arg.pages[page_no+1];
>>>> +		rqstp->rq_next_page = rqstp->rq_respages + 1;
>>>>   		byte_count -= sge_bytes;
>>>>   		ch_bytes -= sge_bytes;
>>>> @@ -276,6 +278,7 @@ static int fast_reg_read_chunks(struct svcxprt_rdma *xprt,
>>>>   	/* rq_respages points one past arg pages */
>>>>   	rqstp->rq_respages = &rqstp->rq_arg.pages[page_no];
>>>> +	rqstp->rq_next_page = rqstp->rq_respages + 1;
>>>>   	/* Create the reply and chunk maps */
>>>>   	offset = 0;
>>>> @@ -520,13 +523,6 @@ next_sge:
>>>>   	for (ch_no = 0; &rqstp->rq_pages[ch_no] < rqstp->rq_respages; ch_no++)
>>>>   		rqstp->rq_pages[ch_no] = NULL;
>>>> -	/*
>>>> -	 * Detach res pages. If svc_release sees any it will attempt to
>>>> -	 * put them.
>>>> -	 */
>>>> -	while (rqstp->rq_next_page != rqstp->rq_respages)
>>>> -		*(--rqstp->rq_next_page) = NULL;
>>>> -
>>>>   	return err;
>>>>   }
>>>> @@ -550,7 +546,7 @@ static int rdma_read_complete(struct svc_rqst *rqstp,
>>>>   	/* rq_respages starts after the last arg page */
>>>>   	rqstp->rq_respages = &rqstp->rq_arg.pages[page_no];
>>>> -	rqstp->rq_next_page = &rqstp->rq_arg.pages[page_no];
>>>> +	rqstp->rq_next_page = rqstp->rq_respages + 1;
>>>>   	/* Rebuild rq_arg head and tail. */
>>>>   	rqstp->rq_arg.head[0] = head->arg.head[0];
>>>> diff --git a/net/sunrpc/xprtrdma/svc_rdma_sendto.c b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
>>>> index c1d124d..11e90f8 100644
>>>> --- a/net/sunrpc/xprtrdma/svc_rdma_sendto.c
>>>> +++ b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
>>>> @@ -625,6 +625,7 @@ static int send_reply(struct svcxprt_rdma *rdma,
>>>>   		if (page_no+1 >= sge_no)
>>>>   			ctxt->sge[page_no+1].length = 0;
>>>>   	}
>>>> +	rqstp->rq_next_page = rqstp->rq_respages + 1;
>>>>   	BUG_ON(sge_no > rdma->sc_max_sge);
>>>>   	memset(&send_wr, 0, sizeof send_wr);
>>>>   	ctxt->wr_op = IB_WR_SEND;
>>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] Fix regression in NFSRDMA server
  2014-03-29  0:11       ` Tom Tucker
@ 2014-03-29  0:51         ` J. Bruce Fields
  0 siblings, 0 replies; 8+ messages in thread
From: J. Bruce Fields @ 2014-03-29  0:51 UTC (permalink / raw)
  To: Tom Tucker; +Cc: Steve Wise, trond.myklebust, linux-nfs

On Fri, Mar 28, 2014 at 07:11:56PM -0500, Tom Tucker wrote:
> Hi Bruce,
> 
> On 3/28/14 4:26 PM, J. Bruce Fields wrote:
> >On Fri, Mar 28, 2014 at 10:21:27AM -0500, Tom Tucker wrote:
> >>Hi Bruce,
> >>
> >>On 3/27/14 9:08 PM, J. Bruce Fields wrote:
> >>>On Tue, Mar 25, 2014 at 03:14:57PM -0500, Steve Wise wrote:
> >>>>From: Tom Tucker <tom@ogc.us>
> >>>>
> >>>>The server regression was caused by the addition of rq_next_page
> >>>>(afc59400d6c65bad66d4ad0b2daf879cbff8e23e). There were a few places that
> >>>>were missed with the update of the rq_respages array.
> >>>Apologies.  (But, it could happen again--could we set up some regular
> >>>testing?  It doesn't have to be anything fancy, just cthon over
> >>>rdma--really, just read and write over rdma--would probably catch a
> >>>lot.)
> >>I think Chelsio is going to be adding some NFSRDMA regression
> >>testing to their system test.
> >OK.  Do you know who there is setting that up?  I'd be curious exactly
> >what kernels they intend to test and how they plan to report results.
> >
> 
> I don't know, Steve can weigh in on this...
> 
> >>>Also: I don't get why all these rq_next_page initializations are
> >>>required.  Why isn't the initialization at the top of svc_process()
> >>>enough?  Is rdma using it before we get to that point?  The only use of
> >>>it I see off hand is in the while loop that you're deleting.
> >>I didn't apply tremendous deductive powers here, I just added
> >>updates to rq_next_page wherever the transport messed with
> >>rq_respages. That said, NFS WRITE is likely the culprit since the
> >>write is completed as a deferral and therefore the request doesn't
> >>go through svc_process, so if rq_next_page is bogus, the cleanup
> >>will free/re-use pages that are actually in use by the transport.
> >Ugh, OK, without tracing through the code I guess I can see how that
> >would happen.  Remind me why it's using deferrals?
> 
> The server fetches the write data from the client using RDMA READ.
> So the request says ... "here's where the data is in my memory", and
> then the server issues an RDMA READ to fetch it. When the read
> completes, the deferred request is completed.

That makes sense, but maybe I'm not sure what you mean by deferring.

The tcp code can also receive a request over multiple recvfroms.  See
Trond's hack in 31d68ef65c7d4 "SUNRPC: Don't wait for full record to
receive tcp data".

--b.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2014-03-29  0:51 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-03-25 20:14 [PATCH] Fix regression in NFSRDMA server Steve Wise
2014-03-28  2:08 ` J. Bruce Fields
2014-03-28 13:38   ` Steve Wise
2014-03-28 15:21   ` Tom Tucker
2014-03-28 21:26     ` J. Bruce Fields
2014-03-28 21:31       ` Steve Wise
2014-03-29  0:11       ` Tom Tucker
2014-03-29  0:51         ` J. Bruce Fields

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.