linux-nfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v1 0/4] NFS/RDMA server patches for v4.19
@ 2018-07-27 15:18 Chuck Lever
  2018-07-27 15:18 ` [PATCH v1 1/4] svcrdma: Avoid releasing a page in svc_xprt_release() Chuck Lever
                   ` (4 more replies)
  0 siblings, 5 replies; 15+ messages in thread
From: Chuck Lever @ 2018-07-27 15:18 UTC (permalink / raw)
  To: bfields; +Cc: linux-rdma, linux-nfs

Hi Bruce et al. -

This short series includes clean-ups related to performance-related
work that you and I have discussed in the past. Let me give an
update on the progress of that work as context for these patches.

We had discussed moving the generation of RDMA Read requests from
->recvfrom up into the NFSD proc functions that handle WRITE and
SYMLINK operations. There were two reasons for this change:

1. To enable the upper layer to choose the pages that act as the
RDMA Read sink buffer, rather than always using anonymous pages
for this purpose

2. To reduce the average latency of ->recvfrom calls on RPC/RDMA,
which are serialized per transport connection

I was able to successfully prototype this change. The scope of this
prototype was limited to exploring how to move the RDMA Read code.
I have not yet tried to implement a per-FH page selection mechanism.

There was no measurable performance impact of this change. The good
news is that this confirms that the RDMA Read code can be moved
upstairs without negative performance consequences. The not-so-good
news:

- Serialization of ->recvfrom might not be the problem that I
predicted.

- I don't have a macro benchmark that mixes small NFS requests with
NFS requests with Read chunks in a way that can assess the
serialization issue.

- The most significant current bottleneck for NFS WRITE performance
is on the Linux client, which obscures performance improvements in
the server-side NFS WRITE path. The bottleneck is generic, not
related to the use of pNFS or the choice of transport type.

There were two architecture-related findings as well:

1. We were considering the need to double the size of the
svc_rqst::rq_pages array in order to reserve enough pages to con-
currently handle late RDMA Reads and building an RPC Reply. I found
a way to implement this prototype that did not require doubling the
size of this array. The transport reserves enough pages in that
array before ->recvfrom returns, and advances the respages pointer
accordingly. The upper layer can begin constructing the RPC Reply
immediately while the payload pages are filled later.

2. The prototype encountered an architectural issue with the
server's DRC. A checksum is computed on non-idempotent RPC Calls to
place them in the server's DRC hash table. That checksum includes
the first hundred or so bytes of the payload -- the data that would
be pulled over using RDMA Read. When RDMA Read is delayed, the
payload is not available for checksum computation.

To complete my prototype, I disabled the server's DRC. Going forward
with this work will require some thought about how to deal with non-
idempotent requests with Read chunks. Some possibilities:

- For RPC Calls with Read chunks, don't include the payload in the
checksum. This could be done by providing a per-transport checksum
callout that would manage the details.

- Support late RDMA Reads for session-based versions of NFS, but not
for earlier versions of NFS which utilize the legacy DRC.

- Adopt an entirely different DRC hashing mechanism.

---

Chuck Lever (4):
      svcrdma: Avoid releasing a page in svc_xprt_release()
      svcrdma: Clean up Read chunk path
      NFSD: Refactor the generic write vector fill helper
      NFSD: Handle full-length symlinks


 fs/nfsd/nfs3proc.c                      |    5 ++
 fs/nfsd/nfs4proc.c                      |   23 ++-------
 fs/nfsd/nfsproc.c                       |    5 ++
 include/linux/sunrpc/svc.h              |    4 +-
 net/sunrpc/svc.c                        |   78 ++++++++++++-------------------
 net/sunrpc/xprtrdma/svc_rdma_recvfrom.c |    9 ++--
 net/sunrpc/xprtrdma/svc_rdma_rw.c       |   32 +++++--------
 net/sunrpc/xprtrdma/svc_rdma_sendto.c   |    4 +-
 8 files changed, 66 insertions(+), 94 deletions(-)

--
Chuck Lever

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v1 1/4] svcrdma: Avoid releasing a page in svc_xprt_release()
  2018-07-27 15:18 [PATCH v1 0/4] NFS/RDMA server patches for v4.19 Chuck Lever
@ 2018-07-27 15:18 ` Chuck Lever
  2018-07-27 15:18 ` [PATCH v1 2/4] svcrdma: Clean up Read chunk path Chuck Lever
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 15+ messages in thread
From: Chuck Lever @ 2018-07-27 15:18 UTC (permalink / raw)
  To: bfields; +Cc: linux-rdma, linux-nfs

svc_xprt_release() invokes svc_free_res_pages(), which releases
pages between rq_respages and rq_next_page.

Historically, the RPC/RDMA transport has set these two pointers to
be different by one, which means:

- one page gets released when svc_recv returns 0. This normally
happens whenever one or more RDMA Reads need to be dispatched to
complete construction of an RPC Call.

- one page gets released after every call to svc_send.

In both cases, this released page is immediately refilled by
svc_alloc_arg. There does not seem to be a reason for releasing this
page.

To avoid this unnecessary memory allocator traffic, set rq_next_page
more carefully.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 net/sunrpc/xprtrdma/svc_rdma_recvfrom.c |    9 ++++++---
 net/sunrpc/xprtrdma/svc_rdma_sendto.c   |    4 +++-
 2 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
index 841fca1..ddb7def 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
@@ -366,9 +366,6 @@ static void svc_rdma_build_arg_xdr(struct svc_rqst *rqstp,
 	arg->page_base = 0;
 	arg->buflen = ctxt->rc_byte_len;
 	arg->len = ctxt->rc_byte_len;
-
-	rqstp->rq_respages = &rqstp->rq_pages[0];
-	rqstp->rq_next_page = rqstp->rq_respages + 1;
 }
 
 /* This accommodates the largest possible Write chunk,
@@ -730,6 +727,12 @@ int svc_rdma_recvfrom(struct svc_rqst *rqstp)
 
 	svc_rdma_build_arg_xdr(rqstp, ctxt);
 
+	/* Prevent svc_xprt_release from releasing pages in rq_pages
+	 * if we return 0 or an error.
+	 */
+	rqstp->rq_respages = rqstp->rq_pages;
+	rqstp->rq_next_page = rqstp->rq_respages;
+
 	p = (__be32 *)rqstp->rq_arg.head[0].iov_base;
 	ret = svc_rdma_xdr_decode_req(&rqstp->rq_arg);
 	if (ret < 0)
diff --git a/net/sunrpc/xprtrdma/svc_rdma_sendto.c b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
index 4a3efae..b958fb6 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_sendto.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
@@ -657,7 +657,9 @@ static void svc_rdma_save_io_pages(struct svc_rqst *rqstp,
 		ctxt->sc_pages[i] = rqstp->rq_respages[i];
 		rqstp->rq_respages[i] = NULL;
 	}
-	rqstp->rq_next_page = rqstp->rq_respages + 1;
+
+	/* Prevent svc_xprt_release from releasing pages in rq_pages */
+	rqstp->rq_next_page = rqstp->rq_respages;
 }
 
 /* Prepare the portion of the RPC Reply that will be transmitted


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v1 2/4] svcrdma: Clean up Read chunk path
  2018-07-27 15:18 [PATCH v1 0/4] NFS/RDMA server patches for v4.19 Chuck Lever
  2018-07-27 15:18 ` [PATCH v1 1/4] svcrdma: Avoid releasing a page in svc_xprt_release() Chuck Lever
@ 2018-07-27 15:18 ` Chuck Lever
  2018-07-27 15:19 ` [PATCH v1 3/4] NFSD: Refactor the generic write vector fill helper Chuck Lever
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 15+ messages in thread
From: Chuck Lever @ 2018-07-27 15:18 UTC (permalink / raw)
  To: bfields; +Cc: linux-rdma, linux-nfs

Simplify the error handling at the tail of recv_read_chunk() by
re-arranging rq_pages[] housekeeping and documenting it properly.

NB: In this path, svc_rdma_recvfrom returns zero. Therefore no
subsequent reply processing is done on the svc_rqstp, and thus the
rq_respages field does not need to be updated.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 net/sunrpc/xprtrdma/svc_rdma_rw.c |   32 +++++++++++++-------------------
 1 file changed, 13 insertions(+), 19 deletions(-)

diff --git a/net/sunrpc/xprtrdma/svc_rdma_rw.c b/net/sunrpc/xprtrdma/svc_rdma_rw.c
index ce3ea84..a1ac999 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_rw.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_rw.c
@@ -679,6 +679,7 @@ static int svc_rdma_build_read_chunk(struct svc_rqst *rqstp,
 				     struct svc_rdma_read_info *info,
 				     __be32 *p)
 {
+	unsigned int i;
 	int ret;
 
 	ret = -EINVAL;
@@ -701,6 +702,12 @@ static int svc_rdma_build_read_chunk(struct svc_rqst *rqstp,
 		info->ri_chunklen += rs_length;
 	}
 
+	/* Pages under I/O have been copied to head->rc_pages.
+	 * Prevent their premature release by svc_xprt_release() .
+	 */
+	for (i = 0; i < info->ri_readctxt->rc_page_count; i++)
+		rqstp->rq_pages[i] = NULL;
+
 	return ret;
 }
 
@@ -816,7 +823,6 @@ int svc_rdma_recv_read_chunk(struct svcxprt_rdma *rdma, struct svc_rqst *rqstp,
 			     struct svc_rdma_recv_ctxt *head, __be32 *p)
 {
 	struct svc_rdma_read_info *info;
-	struct page **page;
 	int ret;
 
 	/* The request (with page list) is constructed in
@@ -843,27 +849,15 @@ int svc_rdma_recv_read_chunk(struct svcxprt_rdma *rdma, struct svc_rqst *rqstp,
 		ret = svc_rdma_build_normal_read_chunk(rqstp, info, p);
 	else
 		ret = svc_rdma_build_pz_read_chunk(rqstp, info, p);
-
-	/* Mark the start of the pages that can be used for the reply */
-	if (info->ri_pageoff > 0)
-		info->ri_pageno++;
-	rqstp->rq_respages = &rqstp->rq_pages[info->ri_pageno];
-	rqstp->rq_next_page = rqstp->rq_respages + 1;
-
 	if (ret < 0)
-		goto out;
+		goto out_err;
 
 	ret = svc_rdma_post_chunk_ctxt(&info->ri_cc);
-
-out:
-	/* Read sink pages have been moved from rqstp->rq_pages to
-	 * head->rc_arg.pages. Force svc_recv to refill those slots
-	 * in rq_pages.
-	 */
-	for (page = rqstp->rq_pages; page < rqstp->rq_respages; page++)
-		*page = NULL;
-
 	if (ret < 0)
-		svc_rdma_read_info_free(info);
+		goto out_err;
+	return 0;
+
+out_err:
+	svc_rdma_read_info_free(info);
 	return ret;
 }


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v1 3/4] NFSD: Refactor the generic write vector fill helper
  2018-07-27 15:18 [PATCH v1 0/4] NFS/RDMA server patches for v4.19 Chuck Lever
  2018-07-27 15:18 ` [PATCH v1 1/4] svcrdma: Avoid releasing a page in svc_xprt_release() Chuck Lever
  2018-07-27 15:18 ` [PATCH v1 2/4] svcrdma: Clean up Read chunk path Chuck Lever
@ 2018-07-27 15:19 ` Chuck Lever
  2018-11-15 16:32   ` J. Bruce Fields
  2018-07-27 15:19 ` [PATCH v1 4/4] NFSD: Handle full-length symlinks Chuck Lever
  2018-08-01 14:36 ` [PATCH v1 0/4] NFS/RDMA server patches for v4.19 J. Bruce Fields
  4 siblings, 1 reply; 15+ messages in thread
From: Chuck Lever @ 2018-07-27 15:19 UTC (permalink / raw)
  To: bfields; +Cc: linux-rdma, linux-nfs

fill_in_write_vector() is nearly the same logic as
svc_fill_write_vector(), but there are a few differences so that
the former can handle multiple WRITE payloads in a single COMPOUND.

svc_fill_write_vector() can be adjusted so that it can be used in
the NFSv4 WRITE code path too. Instead of assuming the pages are
coming from rq_args.pages, have the caller pass in the page list.

The immediate benefit is a reduction of code duplication. It also
prevents the NFSv4 WRITE decoder from passing an empty vector
element when the transport has provided the payload in the xdr_buf's
page array.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 fs/nfsd/nfs3proc.c         |    3 ++-
 fs/nfsd/nfs4proc.c         |   23 ++++-------------------
 fs/nfsd/nfsproc.c          |    3 ++-
 include/linux/sunrpc/svc.h |    1 +
 net/sunrpc/svc.c           |   11 ++++-------
 5 files changed, 13 insertions(+), 28 deletions(-)

diff --git a/fs/nfsd/nfs3proc.c b/fs/nfsd/nfs3proc.c
index 6259a4b..8d1c2d1a 100644
--- a/fs/nfsd/nfs3proc.c
+++ b/fs/nfsd/nfs3proc.c
@@ -202,7 +202,8 @@
 
 	fh_copy(&resp->fh, &argp->fh);
 	resp->committed = argp->stable;
-	nvecs = svc_fill_write_vector(rqstp, &argp->first, cnt);
+	nvecs = svc_fill_write_vector(rqstp, rqstp->rq_arg.pages,
+				      &argp->first, cnt);
 	if (!nvecs)
 		RETURN_STATUS(nfserr_io);
 	nfserr = nfsd_write(rqstp, &resp->fh, argp->offset,
diff --git a/fs/nfsd/nfs4proc.c b/fs/nfsd/nfs4proc.c
index 5d99e88..1d1080c 100644
--- a/fs/nfsd/nfs4proc.c
+++ b/fs/nfsd/nfs4proc.c
@@ -982,24 +982,6 @@ static __be32 nfsd4_do_lookupp(struct svc_rqst *rqstp, struct svc_fh *fh)
 	return status;
 }
 
-static int fill_in_write_vector(struct kvec *vec, struct nfsd4_write *write)
-{
-        int i = 1;
-        int buflen = write->wr_buflen;
-
-        vec[0].iov_base = write->wr_head.iov_base;
-        vec[0].iov_len = min_t(int, buflen, write->wr_head.iov_len);
-        buflen -= vec[0].iov_len;
-
-        while (buflen) {
-                vec[i].iov_base = page_address(write->wr_pagelist[i - 1]);
-                vec[i].iov_len = min_t(int, PAGE_SIZE, buflen);
-                buflen -= vec[i].iov_len;
-                i++;
-        }
-        return i;
-}
-
 static __be32
 nfsd4_write(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
 	    union nfsd4_op_u *u)
@@ -1027,7 +1009,10 @@ static int fill_in_write_vector(struct kvec *vec, struct nfsd4_write *write)
 	write->wr_how_written = write->wr_stable_how;
 	gen_boot_verifier(&write->wr_verifier, SVC_NET(rqstp));
 
-	nvecs = fill_in_write_vector(rqstp->rq_vec, write);
+	nvecs = svc_fill_write_vector(rqstp, write->wr_pagelist,
+				      &write->wr_head, write->wr_buflen);
+	if (!nvecs)
+		return nfserr_io;
 	WARN_ON_ONCE(nvecs > ARRAY_SIZE(rqstp->rq_vec));
 
 	status = nfsd_vfs_write(rqstp, &cstate->current_fh, filp,
diff --git a/fs/nfsd/nfsproc.c b/fs/nfsd/nfsproc.c
index f107f9f..a6faee5 100644
--- a/fs/nfsd/nfsproc.c
+++ b/fs/nfsd/nfsproc.c
@@ -218,7 +218,8 @@
 		SVCFH_fmt(&argp->fh),
 		argp->len, argp->offset);
 
-	nvecs = svc_fill_write_vector(rqstp, &argp->first, cnt);
+	nvecs = svc_fill_write_vector(rqstp, rqstp->rq_arg.pages,
+				      &argp->first, cnt);
 	if (!nvecs)
 		return nfserr_io;
 	nfserr = nfsd_write(rqstp, fh_copy(&resp->fh, &argp->fh),
diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
index 574368e..43f88bd 100644
--- a/include/linux/sunrpc/svc.h
+++ b/include/linux/sunrpc/svc.h
@@ -496,6 +496,7 @@ int		   svc_register(const struct svc_serv *, struct net *, const int,
 struct svc_pool *  svc_pool_for_cpu(struct svc_serv *serv, int cpu);
 char *		   svc_print_addr(struct svc_rqst *, char *, size_t);
 unsigned int	   svc_fill_write_vector(struct svc_rqst *rqstp,
+					 struct page **pages,
 					 struct kvec *first, size_t total);
 char		  *svc_fill_symlink_pathname(struct svc_rqst *rqstp,
 					     struct kvec *first, size_t total);
diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
index 30a4226..2194ed5 100644
--- a/net/sunrpc/svc.c
+++ b/net/sunrpc/svc.c
@@ -1537,16 +1537,16 @@ u32 svc_max_payload(const struct svc_rqst *rqstp)
 /**
  * svc_fill_write_vector - Construct data argument for VFS write call
  * @rqstp: svc_rqst to operate on
+ * @pages: list of pages containing data payload
  * @first: buffer containing first section of write payload
  * @total: total number of bytes of write payload
  *
- * Returns the number of elements populated in the data argument array.
+ * Fills in rqstp::rq_vec, and returns the number of elements.
  */
-unsigned int svc_fill_write_vector(struct svc_rqst *rqstp, struct kvec *first,
-				   size_t total)
+unsigned int svc_fill_write_vector(struct svc_rqst *rqstp, struct page **pages,
+				   struct kvec *first, size_t total)
 {
 	struct kvec *vec = rqstp->rq_vec;
-	struct page **pages;
 	unsigned int i;
 
 	/* Some types of transport can present the write payload
@@ -1560,14 +1560,11 @@ unsigned int svc_fill_write_vector(struct svc_rqst *rqstp, struct kvec *first,
 		++i;
 	}
 
-	WARN_ON_ONCE(rqstp->rq_arg.page_base != 0);
-	pages = rqstp->rq_arg.pages;
 	while (total) {
 		vec[i].iov_base = page_address(*pages);
 		vec[i].iov_len = min_t(size_t, total, PAGE_SIZE);
 		total -= vec[i].iov_len;
 		++i;
-
 		++pages;
 	}
 


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v1 4/4] NFSD: Handle full-length symlinks
  2018-07-27 15:18 [PATCH v1 0/4] NFS/RDMA server patches for v4.19 Chuck Lever
                   ` (2 preceding siblings ...)
  2018-07-27 15:19 ` [PATCH v1 3/4] NFSD: Refactor the generic write vector fill helper Chuck Lever
@ 2018-07-27 15:19 ` Chuck Lever
  2018-08-01 14:14   ` J. Bruce Fields
  2018-08-01 14:36 ` [PATCH v1 0/4] NFS/RDMA server patches for v4.19 J. Bruce Fields
  4 siblings, 1 reply; 15+ messages in thread
From: Chuck Lever @ 2018-07-27 15:19 UTC (permalink / raw)
  To: bfields; +Cc: linux-rdma, linux-nfs

I've given up on the idea of zero-copy handling of SYMLINK on the
server side. This is because the Linux VFS symlink API requires the
symlink pathname to be in a NUL-terminated kmalloc'd buffer. The
NUL-termination is going to be problematic (watching out for
landing on a page boundary and dealing with a 4096-byte pathname).

I don't believe that SYMLINK creation is on a performance path or is
requested frequently enough that it will cause noticeable CPU cache
pollution due to data copies.

There will be two places where a transport callout will be necessary
to fill in the rqstp: one will be in the svc_fill_symlink_pathname()
helper that is used by NFSv2 and NFSv3, and the other will be in
nfsd4_decode_create().

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 fs/nfsd/nfs3proc.c         |    2 +
 fs/nfsd/nfsproc.c          |    2 +
 include/linux/sunrpc/svc.h |    3 +-
 net/sunrpc/svc.c           |   67 ++++++++++++++++----------------------------
 4 files changed, 31 insertions(+), 43 deletions(-)

diff --git a/fs/nfsd/nfs3proc.c b/fs/nfsd/nfs3proc.c
index 8d1c2d1a..9eb8086 100644
--- a/fs/nfsd/nfs3proc.c
+++ b/fs/nfsd/nfs3proc.c
@@ -290,6 +290,7 @@
 		RETURN_STATUS(nfserr_nametoolong);
 
 	argp->tname = svc_fill_symlink_pathname(rqstp, &argp->first,
+						page_address(rqstp->rq_arg.pages[0]),
 						argp->tlen);
 	if (IS_ERR(argp->tname))
 		RETURN_STATUS(nfserrno(PTR_ERR(argp->tname)));
@@ -303,6 +304,7 @@
 	fh_init(&resp->fh, NFS3_FHSIZE);
 	nfserr = nfsd_symlink(rqstp, &resp->dirfh, argp->fname, argp->flen,
 						   argp->tname, &resp->fh);
+	kfree(argp->tname);
 	RETURN_STATUS(nfserr);
 }
 
diff --git a/fs/nfsd/nfsproc.c b/fs/nfsd/nfsproc.c
index a6faee5..0d20fd1 100644
--- a/fs/nfsd/nfsproc.c
+++ b/fs/nfsd/nfsproc.c
@@ -454,6 +454,7 @@
 		return nfserr_nametoolong;
 
 	argp->tname = svc_fill_symlink_pathname(rqstp, &argp->first,
+						page_address(rqstp->rq_arg.pages[0]),
 						argp->tlen);
 	if (IS_ERR(argp->tname))
 		return nfserrno(PTR_ERR(argp->tname));
@@ -466,6 +467,7 @@
 	nfserr = nfsd_symlink(rqstp, &argp->ffh, argp->fname, argp->flen,
 						 argp->tname, &newfh);
 
+	kfree(argp->tname);
 	fh_put(&argp->ffh);
 	fh_put(&newfh);
 	return nfserr;
diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
index 43f88bd..73e130a 100644
--- a/include/linux/sunrpc/svc.h
+++ b/include/linux/sunrpc/svc.h
@@ -499,7 +499,8 @@ unsigned int	   svc_fill_write_vector(struct svc_rqst *rqstp,
 					 struct page **pages,
 					 struct kvec *first, size_t total);
 char		  *svc_fill_symlink_pathname(struct svc_rqst *rqstp,
-					     struct kvec *first, size_t total);
+					     struct kvec *first, void *p,
+					     size_t total);
 
 #define	RPC_MAX_ADDRBUFLEN	(63U)
 
diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
index 2194ed5..d13e05f 100644
--- a/net/sunrpc/svc.c
+++ b/net/sunrpc/svc.c
@@ -1577,65 +1577,48 @@ unsigned int svc_fill_write_vector(struct svc_rqst *rqstp, struct page **pages,
  * svc_fill_symlink_pathname - Construct pathname argument for VFS symlink call
  * @rqstp: svc_rqst to operate on
  * @first: buffer containing first section of pathname
+ * @p: buffer containing remaining section of pathname
  * @total: total length of the pathname argument
  *
- * Returns pointer to a NUL-terminated string, or an ERR_PTR. The buffer is
- * released automatically when @rqstp is recycled.
+ * The VFS symlink API demands a NUL-terminated pathname in mapped memory.
+ * Returns pointer to a NUL-terminated string, or an ERR_PTR. Caller must free
+ * the returned string.
  */
 char *svc_fill_symlink_pathname(struct svc_rqst *rqstp, struct kvec *first,
-				size_t total)
+				void *p, size_t total)
 {
-	struct xdr_buf *arg = &rqstp->rq_arg;
-	struct page **pages;
-	char *result;
-
-	/* VFS API demands a NUL-terminated pathname. This function
-	 * uses a page from @rqstp as the pathname buffer, to enable
-	 * direct placement. Thus the total buffer size is PAGE_SIZE.
-	 * Space in this buffer for NUL-termination requires that we
-	 * cap the size of the returned symlink pathname just a
-	 * little early.
-	 */
-	if (total > PAGE_SIZE - 1)
-		return ERR_PTR(-ENAMETOOLONG);
+	size_t len, remaining;
+	char *result, *dst;
 
-	/* Some types of transport can present the pathname entirely
-	 * in rq_arg.pages. If not, then copy the pathname into one
-	 * page.
-	 */
-	pages = arg->pages;
-	WARN_ON_ONCE(arg->page_base != 0);
-	if (first->iov_base == 0) {
-		result = page_address(*pages);
-		result[total] = '\0';
-	} else {
-		size_t len, remaining;
-		char *dst;
+	result = kmalloc(total + 1, GFP_KERNEL);
+	if (!result)
+		return ERR_PTR(-ESERVERFAULT);
 
-		result = page_address(*(rqstp->rq_next_page++));
-		dst = result;
-		remaining = total;
+	dst = result;
+	remaining = total;
 
-		len = min_t(size_t, total, first->iov_len);
+	len = min_t(size_t, total, first->iov_len);
+	if (len) {
 		memcpy(dst, first->iov_base, len);
 		dst += len;
 		remaining -= len;
+	}
 
-		/* No more than one page left */
-		if (remaining) {
-			len = min_t(size_t, remaining, PAGE_SIZE);
-			memcpy(dst, page_address(*pages), len);
-			dst += len;
-		}
-
-		*dst = '\0';
+	if (remaining) {
+		len = min_t(size_t, remaining, PAGE_SIZE);
+		memcpy(dst, p, len);
+		dst += len;
 	}
 
-	/* Sanity check: we don't allow the pathname argument to
+	*dst = '\0';
+
+	/* Sanity check: Linux doesn't allow the pathname argument to
 	 * contain a NUL byte.
 	 */
-	if (strlen(result) != total)
+	if (strlen(result) != total) {
+		kfree(result);
 		return ERR_PTR(-EINVAL);
+	}
 	return result;
 }
 EXPORT_SYMBOL_GPL(svc_fill_symlink_pathname);


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH v1 4/4] NFSD: Handle full-length symlinks
  2018-07-27 15:19 ` [PATCH v1 4/4] NFSD: Handle full-length symlinks Chuck Lever
@ 2018-08-01 14:14   ` J. Bruce Fields
  2018-08-01 14:16     ` Chuck Lever
  0 siblings, 1 reply; 15+ messages in thread
From: J. Bruce Fields @ 2018-08-01 14:14 UTC (permalink / raw)
  To: Chuck Lever; +Cc: linux-rdma, linux-nfs

On Fri, Jul 27, 2018 at 11:19:10AM -0400, Chuck Lever wrote:
> I've given up on the idea of zero-copy handling of SYMLINK on the
> server side. This is because the Linux VFS symlink API requires the
> symlink pathname to be in a NUL-terminated kmalloc'd buffer. The
> NUL-termination is going to be problematic (watching out for
> landing on a page boundary and dealing with a 4096-byte pathname).
> 
> I don't believe that SYMLINK creation is on a performance path or is
> requested frequently enough that it will cause noticeable CPU cache
> pollution due to data copies.

Sounds fine.

The limits here are weird.  In the v2 case the maximum length is
NFS_MAXPATHLEN == 1024.  OK, I see, I guess that limit was imposed by
the NFSv2 protocol.  In the v3 case it's NFS3_MAXPATHLEN == PATH_MAX ==
4096.  But we were imposing an artificial max of 4095 just so we could
fit the result in one page with null termination.

OK, makes sense to me (though I can't recall anyone actually complaining
about that limit).

--b.


> 
> There will be two places where a transport callout will be necessary
> to fill in the rqstp: one will be in the svc_fill_symlink_pathname()
> helper that is used by NFSv2 and NFSv3, and the other will be in
> nfsd4_decode_create().
> 
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> ---
>  fs/nfsd/nfs3proc.c         |    2 +
>  fs/nfsd/nfsproc.c          |    2 +
>  include/linux/sunrpc/svc.h |    3 +-
>  net/sunrpc/svc.c           |   67 ++++++++++++++++----------------------------
>  4 files changed, 31 insertions(+), 43 deletions(-)
> 
> diff --git a/fs/nfsd/nfs3proc.c b/fs/nfsd/nfs3proc.c
> index 8d1c2d1a..9eb8086 100644
> --- a/fs/nfsd/nfs3proc.c
> +++ b/fs/nfsd/nfs3proc.c
> @@ -290,6 +290,7 @@
>  		RETURN_STATUS(nfserr_nametoolong);
>  
>  	argp->tname = svc_fill_symlink_pathname(rqstp, &argp->first,
> +						page_address(rqstp->rq_arg.pages[0]),
>  						argp->tlen);
>  	if (IS_ERR(argp->tname))
>  		RETURN_STATUS(nfserrno(PTR_ERR(argp->tname)));
> @@ -303,6 +304,7 @@
>  	fh_init(&resp->fh, NFS3_FHSIZE);
>  	nfserr = nfsd_symlink(rqstp, &resp->dirfh, argp->fname, argp->flen,
>  						   argp->tname, &resp->fh);
> +	kfree(argp->tname);
>  	RETURN_STATUS(nfserr);
>  }
>  
> diff --git a/fs/nfsd/nfsproc.c b/fs/nfsd/nfsproc.c
> index a6faee5..0d20fd1 100644
> --- a/fs/nfsd/nfsproc.c
> +++ b/fs/nfsd/nfsproc.c
> @@ -454,6 +454,7 @@
>  		return nfserr_nametoolong;
>  
>  	argp->tname = svc_fill_symlink_pathname(rqstp, &argp->first,
> +						page_address(rqstp->rq_arg.pages[0]),
>  						argp->tlen);
>  	if (IS_ERR(argp->tname))
>  		return nfserrno(PTR_ERR(argp->tname));
> @@ -466,6 +467,7 @@
>  	nfserr = nfsd_symlink(rqstp, &argp->ffh, argp->fname, argp->flen,
>  						 argp->tname, &newfh);
>  
> +	kfree(argp->tname);
>  	fh_put(&argp->ffh);
>  	fh_put(&newfh);
>  	return nfserr;
> diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
> index 43f88bd..73e130a 100644
> --- a/include/linux/sunrpc/svc.h
> +++ b/include/linux/sunrpc/svc.h
> @@ -499,7 +499,8 @@ unsigned int	   svc_fill_write_vector(struct svc_rqst *rqstp,
>  					 struct page **pages,
>  					 struct kvec *first, size_t total);
>  char		  *svc_fill_symlink_pathname(struct svc_rqst *rqstp,
> -					     struct kvec *first, size_t total);
> +					     struct kvec *first, void *p,
> +					     size_t total);
>  
>  #define	RPC_MAX_ADDRBUFLEN	(63U)
>  
> diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
> index 2194ed5..d13e05f 100644
> --- a/net/sunrpc/svc.c
> +++ b/net/sunrpc/svc.c
> @@ -1577,65 +1577,48 @@ unsigned int svc_fill_write_vector(struct svc_rqst *rqstp, struct page **pages,
>   * svc_fill_symlink_pathname - Construct pathname argument for VFS symlink call
>   * @rqstp: svc_rqst to operate on
>   * @first: buffer containing first section of pathname
> + * @p: buffer containing remaining section of pathname
>   * @total: total length of the pathname argument
>   *
> - * Returns pointer to a NUL-terminated string, or an ERR_PTR. The buffer is
> - * released automatically when @rqstp is recycled.
> + * The VFS symlink API demands a NUL-terminated pathname in mapped memory.
> + * Returns pointer to a NUL-terminated string, or an ERR_PTR. Caller must free
> + * the returned string.
>   */
>  char *svc_fill_symlink_pathname(struct svc_rqst *rqstp, struct kvec *first,
> -				size_t total)
> +				void *p, size_t total)
>  {
> -	struct xdr_buf *arg = &rqstp->rq_arg;
> -	struct page **pages;
> -	char *result;
> -
> -	/* VFS API demands a NUL-terminated pathname. This function
> -	 * uses a page from @rqstp as the pathname buffer, to enable
> -	 * direct placement. Thus the total buffer size is PAGE_SIZE.
> -	 * Space in this buffer for NUL-termination requires that we
> -	 * cap the size of the returned symlink pathname just a
> -	 * little early.
> -	 */
> -	if (total > PAGE_SIZE - 1)
> -		return ERR_PTR(-ENAMETOOLONG);
> +	size_t len, remaining;
> +	char *result, *dst;
>  
> -	/* Some types of transport can present the pathname entirely
> -	 * in rq_arg.pages. If not, then copy the pathname into one
> -	 * page.
> -	 */
> -	pages = arg->pages;
> -	WARN_ON_ONCE(arg->page_base != 0);
> -	if (first->iov_base == 0) {
> -		result = page_address(*pages);
> -		result[total] = '\0';
> -	} else {
> -		size_t len, remaining;
> -		char *dst;
> +	result = kmalloc(total + 1, GFP_KERNEL);
> +	if (!result)
> +		return ERR_PTR(-ESERVERFAULT);
>  
> -		result = page_address(*(rqstp->rq_next_page++));
> -		dst = result;
> -		remaining = total;
> +	dst = result;
> +	remaining = total;
>  
> -		len = min_t(size_t, total, first->iov_len);
> +	len = min_t(size_t, total, first->iov_len);
> +	if (len) {
>  		memcpy(dst, first->iov_base, len);
>  		dst += len;
>  		remaining -= len;
> +	}
>  
> -		/* No more than one page left */
> -		if (remaining) {
> -			len = min_t(size_t, remaining, PAGE_SIZE);
> -			memcpy(dst, page_address(*pages), len);
> -			dst += len;
> -		}
> -
> -		*dst = '\0';
> +	if (remaining) {
> +		len = min_t(size_t, remaining, PAGE_SIZE);
> +		memcpy(dst, p, len);
> +		dst += len;
>  	}
>  
> -	/* Sanity check: we don't allow the pathname argument to
> +	*dst = '\0';
> +
> +	/* Sanity check: Linux doesn't allow the pathname argument to
>  	 * contain a NUL byte.
>  	 */
> -	if (strlen(result) != total)
> +	if (strlen(result) != total) {
> +		kfree(result);
>  		return ERR_PTR(-EINVAL);
> +	}
>  	return result;
>  }
>  EXPORT_SYMBOL_GPL(svc_fill_symlink_pathname);

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v1 4/4] NFSD: Handle full-length symlinks
  2018-08-01 14:14   ` J. Bruce Fields
@ 2018-08-01 14:16     ` Chuck Lever
  0 siblings, 0 replies; 15+ messages in thread
From: Chuck Lever @ 2018-08-01 14:16 UTC (permalink / raw)
  To: Bruce Fields; +Cc: linux-rdma, Linux NFS Mailing List



> On Aug 1, 2018, at 10:14 AM, J. Bruce Fields <bfields@fieldses.org> =
wrote:
>=20
> On Fri, Jul 27, 2018 at 11:19:10AM -0400, Chuck Lever wrote:
>> I've given up on the idea of zero-copy handling of SYMLINK on the
>> server side. This is because the Linux VFS symlink API requires the
>> symlink pathname to be in a NUL-terminated kmalloc'd buffer. The
>> NUL-termination is going to be problematic (watching out for
>> landing on a page boundary and dealing with a 4096-byte pathname).
>>=20
>> I don't believe that SYMLINK creation is on a performance path or is
>> requested frequently enough that it will cause noticeable CPU cache
>> pollution due to data copies.
>=20
> Sounds fine.
>=20
> The limits here are weird.  In the v2 case the maximum length is
> NFS_MAXPATHLEN =3D=3D 1024.  OK, I see, I guess that limit was imposed =
by
> the NFSv2 protocol.

Right.


> In the v3 case it's NFS3_MAXPATHLEN =3D=3D PATH_MAX =3D=3D
> 4096.  But we were imposing an artificial max of 4095 just so we could
> fit the result in one page with null termination.
>=20
> OK, makes sense to me (though I can't recall anyone actually =
complaining
> about that limit).

I haven't heard complaints either. I'm just trying to make this
code a little more sensible. :-)


> --b.
>=20
>=20
>>=20
>> There will be two places where a transport callout will be necessary
>> to fill in the rqstp: one will be in the svc_fill_symlink_pathname()
>> helper that is used by NFSv2 and NFSv3, and the other will be in
>> nfsd4_decode_create().
>>=20
>> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
>> ---
>> fs/nfsd/nfs3proc.c         |    2 +
>> fs/nfsd/nfsproc.c          |    2 +
>> include/linux/sunrpc/svc.h |    3 +-
>> net/sunrpc/svc.c           |   67 =
++++++++++++++++----------------------------
>> 4 files changed, 31 insertions(+), 43 deletions(-)
>>=20
>> diff --git a/fs/nfsd/nfs3proc.c b/fs/nfsd/nfs3proc.c
>> index 8d1c2d1a..9eb8086 100644
>> --- a/fs/nfsd/nfs3proc.c
>> +++ b/fs/nfsd/nfs3proc.c
>> @@ -290,6 +290,7 @@
>> 		RETURN_STATUS(nfserr_nametoolong);
>>=20
>> 	argp->tname =3D svc_fill_symlink_pathname(rqstp, &argp->first,
>> +						=
page_address(rqstp->rq_arg.pages[0]),
>> 						argp->tlen);
>> 	if (IS_ERR(argp->tname))
>> 		RETURN_STATUS(nfserrno(PTR_ERR(argp->tname)));
>> @@ -303,6 +304,7 @@
>> 	fh_init(&resp->fh, NFS3_FHSIZE);
>> 	nfserr =3D nfsd_symlink(rqstp, &resp->dirfh, argp->fname, =
argp->flen,
>> 						   argp->tname, =
&resp->fh);
>> +	kfree(argp->tname);
>> 	RETURN_STATUS(nfserr);
>> }
>>=20
>> diff --git a/fs/nfsd/nfsproc.c b/fs/nfsd/nfsproc.c
>> index a6faee5..0d20fd1 100644
>> --- a/fs/nfsd/nfsproc.c
>> +++ b/fs/nfsd/nfsproc.c
>> @@ -454,6 +454,7 @@
>> 		return nfserr_nametoolong;
>>=20
>> 	argp->tname =3D svc_fill_symlink_pathname(rqstp, &argp->first,
>> +						=
page_address(rqstp->rq_arg.pages[0]),
>> 						argp->tlen);
>> 	if (IS_ERR(argp->tname))
>> 		return nfserrno(PTR_ERR(argp->tname));
>> @@ -466,6 +467,7 @@
>> 	nfserr =3D nfsd_symlink(rqstp, &argp->ffh, argp->fname, =
argp->flen,
>> 						 argp->tname, &newfh);
>>=20
>> +	kfree(argp->tname);
>> 	fh_put(&argp->ffh);
>> 	fh_put(&newfh);
>> 	return nfserr;
>> diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
>> index 43f88bd..73e130a 100644
>> --- a/include/linux/sunrpc/svc.h
>> +++ b/include/linux/sunrpc/svc.h
>> @@ -499,7 +499,8 @@ unsigned int	   svc_fill_write_vector(struct =
svc_rqst *rqstp,
>> 					 struct page **pages,
>> 					 struct kvec *first, size_t =
total);
>> char		  *svc_fill_symlink_pathname(struct svc_rqst *rqstp,
>> -					     struct kvec *first, size_t =
total);
>> +					     struct kvec *first, void =
*p,
>> +					     size_t total);
>>=20
>> #define	RPC_MAX_ADDRBUFLEN	(63U)
>>=20
>> diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
>> index 2194ed5..d13e05f 100644
>> --- a/net/sunrpc/svc.c
>> +++ b/net/sunrpc/svc.c
>> @@ -1577,65 +1577,48 @@ unsigned int svc_fill_write_vector(struct =
svc_rqst *rqstp, struct page **pages,
>>  * svc_fill_symlink_pathname - Construct pathname argument for VFS =
symlink call
>>  * @rqstp: svc_rqst to operate on
>>  * @first: buffer containing first section of pathname
>> + * @p: buffer containing remaining section of pathname
>>  * @total: total length of the pathname argument
>>  *
>> - * Returns pointer to a NUL-terminated string, or an ERR_PTR. The =
buffer is
>> - * released automatically when @rqstp is recycled.
>> + * The VFS symlink API demands a NUL-terminated pathname in mapped =
memory.
>> + * Returns pointer to a NUL-terminated string, or an ERR_PTR. Caller =
must free
>> + * the returned string.
>>  */
>> char *svc_fill_symlink_pathname(struct svc_rqst *rqstp, struct kvec =
*first,
>> -				size_t total)
>> +				void *p, size_t total)
>> {
>> -	struct xdr_buf *arg =3D &rqstp->rq_arg;
>> -	struct page **pages;
>> -	char *result;
>> -
>> -	/* VFS API demands a NUL-terminated pathname. This function
>> -	 * uses a page from @rqstp as the pathname buffer, to enable
>> -	 * direct placement. Thus the total buffer size is PAGE_SIZE.
>> -	 * Space in this buffer for NUL-termination requires that we
>> -	 * cap the size of the returned symlink pathname just a
>> -	 * little early.
>> -	 */
>> -	if (total > PAGE_SIZE - 1)
>> -		return ERR_PTR(-ENAMETOOLONG);
>> +	size_t len, remaining;
>> +	char *result, *dst;
>>=20
>> -	/* Some types of transport can present the pathname entirely
>> -	 * in rq_arg.pages. If not, then copy the pathname into one
>> -	 * page.
>> -	 */
>> -	pages =3D arg->pages;
>> -	WARN_ON_ONCE(arg->page_base !=3D 0);
>> -	if (first->iov_base =3D=3D 0) {
>> -		result =3D page_address(*pages);
>> -		result[total] =3D '\0';
>> -	} else {
>> -		size_t len, remaining;
>> -		char *dst;
>> +	result =3D kmalloc(total + 1, GFP_KERNEL);
>> +	if (!result)
>> +		return ERR_PTR(-ESERVERFAULT);
>>=20
>> -		result =3D page_address(*(rqstp->rq_next_page++));
>> -		dst =3D result;
>> -		remaining =3D total;
>> +	dst =3D result;
>> +	remaining =3D total;
>>=20
>> -		len =3D min_t(size_t, total, first->iov_len);
>> +	len =3D min_t(size_t, total, first->iov_len);
>> +	if (len) {
>> 		memcpy(dst, first->iov_base, len);
>> 		dst +=3D len;
>> 		remaining -=3D len;
>> +	}
>>=20
>> -		/* No more than one page left */
>> -		if (remaining) {
>> -			len =3D min_t(size_t, remaining, PAGE_SIZE);
>> -			memcpy(dst, page_address(*pages), len);
>> -			dst +=3D len;
>> -		}
>> -
>> -		*dst =3D '\0';
>> +	if (remaining) {
>> +		len =3D min_t(size_t, remaining, PAGE_SIZE);
>> +		memcpy(dst, p, len);
>> +		dst +=3D len;
>> 	}
>>=20
>> -	/* Sanity check: we don't allow the pathname argument to
>> +	*dst =3D '\0';
>> +
>> +	/* Sanity check: Linux doesn't allow the pathname argument to
>> 	 * contain a NUL byte.
>> 	 */
>> -	if (strlen(result) !=3D total)
>> +	if (strlen(result) !=3D total) {
>> +		kfree(result);
>> 		return ERR_PTR(-EINVAL);
>> +	}
>> 	return result;
>> }
>> EXPORT_SYMBOL_GPL(svc_fill_symlink_pathname);

--
Chuck Lever




^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v1 0/4] NFS/RDMA server patches for v4.19
  2018-07-27 15:18 [PATCH v1 0/4] NFS/RDMA server patches for v4.19 Chuck Lever
                   ` (3 preceding siblings ...)
  2018-07-27 15:19 ` [PATCH v1 4/4] NFSD: Handle full-length symlinks Chuck Lever
@ 2018-08-01 14:36 ` J. Bruce Fields
  2018-08-01 15:14   ` Chuck Lever
  4 siblings, 1 reply; 15+ messages in thread
From: J. Bruce Fields @ 2018-08-01 14:36 UTC (permalink / raw)
  To: Chuck Lever; +Cc: linux-rdma, linux-nfs

On Fri, Jul 27, 2018 at 11:18:48AM -0400, Chuck Lever wrote:
> This short series includes clean-ups related to performance-related
> work that you and I have discussed in the past.

Thanks, applying.

> Let me give an
> update on the progress of that work as context for these patches.
> 
> We had discussed moving the generation of RDMA Read requests from
> ->recvfrom up into the NFSD proc functions that handle WRITE and
> SYMLINK operations. There were two reasons for this change:
> 
> 1. To enable the upper layer to choose the pages that act as the
> RDMA Read sink buffer, rather than always using anonymous pages
> for this purpose
> 
> 2. To reduce the average latency of ->recvfrom calls on RPC/RDMA,
> which are serialized per transport connection
> 
> I was able to successfully prototype this change. The scope of this
> prototype was limited to exploring how to move the RDMA Read code.
> I have not yet tried to implement a per-FH page selection mechanism.
> 
> There was no measurable performance impact of this change. The good
> news is that this confirms that the RDMA Read code can be moved
> upstairs without negative performance consequences. The not-so-good
> news:
> 
> - Serialization of ->recvfrom might not be the problem that I
> predicted.
> 
> - I don't have a macro benchmark that mixes small NFS requests with
> NFS requests with Read chunks in a way that can assess the
> serialization issue.
> 
> - The most significant current bottleneck for NFS WRITE performance
> is on the Linux client, which obscures performance improvements in
> the server-side NFS WRITE path. The bottleneck is generic, not
> related to the use of pNFS or the choice of transport type.

Would it be practical to test with an artificial workload that takes the
client out of the test?

But if there are client issues then they need to be fixed anyway, so
the better use of time may be fixing those first, and then we get to
test the server with more realistic workloads....

> To complete my prototype, I disabled the server's DRC. Going forward
> with this work will require some thought about how to deal with non-
> idempotent requests with Read chunks. Some possibilities:
> 
> - For RPC Calls with Read chunks, don't include the payload in the
> checksum. This could be done by providing a per-transport checksum
> callout that would manage the details.

I believe the checksum is there to prevent rpc's being incorrectly
treated as replays on xid wraparound.  If we're skipping it in the case
of writes (for example), then we may be slightly increasing the chance
of data coruption in some cases.  I don't know if it's significant.

> - Support late RDMA Reads for session-based versions of NFS, but not
> for earlier versions of NFS which utilize the legacy DRC.

To me that sounds like it might be simplest to start off with?

> - Adopt an entirely different DRC hashing mechanism.

I guess we could delay the hashing somehow too.

But I'd rather not invest a lot of time trying to make NFSv2/v3 better,
the priority for older protocols is just to avoid regressions.

--b.

> 
> ---
> 
> Chuck Lever (4):
>       svcrdma: Avoid releasing a page in svc_xprt_release()
>       svcrdma: Clean up Read chunk path
>       NFSD: Refactor the generic write vector fill helper
>       NFSD: Handle full-length symlinks
> 
> 
>  fs/nfsd/nfs3proc.c                      |    5 ++
>  fs/nfsd/nfs4proc.c                      |   23 ++-------
>  fs/nfsd/nfsproc.c                       |    5 ++
>  include/linux/sunrpc/svc.h              |    4 +-
>  net/sunrpc/svc.c                        |   78 ++++++++++++-------------------
>  net/sunrpc/xprtrdma/svc_rdma_recvfrom.c |    9 ++--
>  net/sunrpc/xprtrdma/svc_rdma_rw.c       |   32 +++++--------
>  net/sunrpc/xprtrdma/svc_rdma_sendto.c   |    4 +-
>  8 files changed, 66 insertions(+), 94 deletions(-)
> 
> --
> Chuck Lever

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v1 0/4] NFS/RDMA server patches for v4.19
  2018-08-01 14:36 ` [PATCH v1 0/4] NFS/RDMA server patches for v4.19 J. Bruce Fields
@ 2018-08-01 15:14   ` Chuck Lever
  2018-08-01 15:20     ` Bruce Fields
  0 siblings, 1 reply; 15+ messages in thread
From: Chuck Lever @ 2018-08-01 15:14 UTC (permalink / raw)
  To: Bruce Fields; +Cc: linux-rdma, Linux NFS Mailing List



> On Aug 1, 2018, at 10:36 AM, J. Bruce Fields <bfields@fieldses.org> wrote:
> 
> On Fri, Jul 27, 2018 at 11:18:48AM -0400, Chuck Lever wrote:
>> This short series includes clean-ups related to performance-related
>> work that you and I have discussed in the past.
> 
> Thanks, applying.

Thanks for looking!


>> Let me give an
>> update on the progress of that work as context for these patches.
>> 
>> We had discussed moving the generation of RDMA Read requests from
>> ->recvfrom up into the NFSD proc functions that handle WRITE and
>> SYMLINK operations. There were two reasons for this change:
>> 
>> 1. To enable the upper layer to choose the pages that act as the
>> RDMA Read sink buffer, rather than always using anonymous pages
>> for this purpose
>> 
>> 2. To reduce the average latency of ->recvfrom calls on RPC/RDMA,
>> which are serialized per transport connection
>> 
>> I was able to successfully prototype this change. The scope of this
>> prototype was limited to exploring how to move the RDMA Read code.
>> I have not yet tried to implement a per-FH page selection mechanism.
>> 
>> There was no measurable performance impact of this change. The good
>> news is that this confirms that the RDMA Read code can be moved
>> upstairs without negative performance consequences. The not-so-good
>> news:
>> 
>> - Serialization of ->recvfrom might not be the problem that I
>> predicted.
>> 
>> - I don't have a macro benchmark that mixes small NFS requests with
>> NFS requests with Read chunks in a way that can assess the
>> serialization issue.
>> 
>> - The most significant current bottleneck for NFS WRITE performance
>> is on the Linux client, which obscures performance improvements in
>> the server-side NFS WRITE path. The bottleneck is generic, not
>> related to the use of pNFS or the choice of transport type.
> 
> Would it be practical to test with an artificial workload that takes the
> client out of the test?

For instance, that is what SPEC NFS does: it has a synthetic client.


> But if there are client issues then they need to be fixed anyway, so
> the better use of time may be fixing those first, and then we get to
> test the server with more realistic workloads....

I've been dabbling with the client a little, and have more to
discuss whenever there is a f2f meeting.


>> To complete my prototype, I disabled the server's DRC. Going forward
>> with this work will require some thought about how to deal with non-
>> idempotent requests with Read chunks. Some possibilities:
>> 
>> - For RPC Calls with Read chunks, don't include the payload in the
>> checksum. This could be done by providing a per-transport checksum
>> callout that would manage the details.
> 
> I believe the checksum is there to prevent rpc's being incorrectly
> treated as replays on xid wraparound.  If we're skipping it in the case
> of writes (for example), then we may be slightly increasing the chance
> of data coruption in some cases.  I don't know if it's significant.

One thing Parav has suggested is the use of T10 PI signatures.
We had discussed using it to offload the computation of the
checksum on the inline part (RDMA Send), but that's variable
size, and PI likes multiples of 512 bytes.

Instead, perhaps we could grab a PI signature on the payload
(the Read chunk, whose size will often be a multiple of a page),
and toss that into the checksum computation. Dunno if that would
be practical in general, or for write payloads that are not
appropriately length-aligned.


>> - Support late RDMA Reads for session-based versions of NFS, but not
>> for earlier versions of NFS which utilize the legacy DRC.
> 
> To me that sounds like it might be simplest to start off with?
> 
>> - Adopt an entirely different DRC hashing mechanism.
> 
> I guess we could delay the hashing somehow too.

There is plenty of room for creativity.


> But I'd rather not invest a lot of time trying to make NFSv2/v3 better,
> the priority for older protocols is just to avoid regressions.

Fair 'nuf.

I'm not clear where NFSv4.0 fits here. I realize most of us would
consider NFSv4.0 to be a "legacy protocol". Does the Linux server
utilize the DRC for NFSv4.0 ?


--
Chuck Lever




^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v1 0/4] NFS/RDMA server patches for v4.19
  2018-08-01 15:14   ` Chuck Lever
@ 2018-08-01 15:20     ` Bruce Fields
  0 siblings, 0 replies; 15+ messages in thread
From: Bruce Fields @ 2018-08-01 15:20 UTC (permalink / raw)
  To: Chuck Lever; +Cc: linux-rdma, Linux NFS Mailing List

On Wed, Aug 01, 2018 at 11:14:14AM -0400, Chuck Lever wrote:
> > On Aug 1, 2018, at 10:36 AM, J. Bruce Fields <bfields@fieldses.org> wrote:
> > But I'd rather not invest a lot of time trying to make NFSv2/v3 better,
> > the priority for older protocols is just to avoid regressions.
> 
> Fair 'nuf.
> 
> I'm not clear where NFSv4.0 fits here. I realize most of us would
> consider NFSv4.0 to be a "legacy protocol". Does the Linux server
> utilize the DRC for NFSv4.0 ?

If I remember correctly, it uses the DRC and decides whether to cache
particular compounds based on the mix of operation numbers making up the
compound.  (I wrote that code but haven't looked at it recently.)--b.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v1 3/4] NFSD: Refactor the generic write vector fill helper
  2018-07-27 15:19 ` [PATCH v1 3/4] NFSD: Refactor the generic write vector fill helper Chuck Lever
@ 2018-11-15 16:32   ` J. Bruce Fields
  2018-11-16  0:19     ` Chuck Lever
  0 siblings, 1 reply; 15+ messages in thread
From: J. Bruce Fields @ 2018-11-15 16:32 UTC (permalink / raw)
  To: Chuck Lever; +Cc: linux-rdma, linux-nfs

Um, I somehow managed to overlook a pynfs regression till just now:

On Fri, Jul 27, 2018 at 11:19:05AM -0400, Chuck Lever wrote:
> fill_in_write_vector() is nearly the same logic as
> svc_fill_write_vector(), but there are a few differences so that
> the former can handle multiple WRITE payloads in a single COMPOUND.
> 
> svc_fill_write_vector() can be adjusted so that it can be used in
> the NFSv4 WRITE code path too. Instead of assuming the pages are
> coming from rq_args.pages, have the caller pass in the page list.
> 
> The immediate benefit is a reduction of code duplication. It also
> prevents the NFSv4 WRITE decoder from passing an empty vector
> element when the transport has provided the payload in the xdr_buf's
> page array.
> 
...
> @@ -1027,7 +1009,10 @@ static int fill_in_write_vector(struct kvec *vec, struct nfsd4_write *write)
>  	write->wr_how_written = write->wr_stable_how;
>  	gen_boot_verifier(&write->wr_verifier, SVC_NET(rqstp));
>  
> -	nvecs = fill_in_write_vector(rqstp->rq_vec, write);
> +	nvecs = svc_fill_write_vector(rqstp, write->wr_pagelist,
> +				      &write->wr_head, write->wr_buflen);
> +	if (!nvecs)
> +		return nfserr_io;

Do you remember why you added this check?

It's causing zero-length writes to fail, in violation of the spec:

	"If the count is zero, the WRITE will succeed and return a count
	of zero subject to permissions checking."

I'm not seeing a reason why it wouldn't be safe just to remove that
check.

--b.

>  	WARN_ON_ONCE(nvecs > ARRAY_SIZE(rqstp->rq_vec));
>  
>  	status = nfsd_vfs_write(rqstp, &cstate->current_fh, filp,
> diff --git a/fs/nfsd/nfsproc.c b/fs/nfsd/nfsproc.c
> index f107f9f..a6faee5 100644
> --- a/fs/nfsd/nfsproc.c
> +++ b/fs/nfsd/nfsproc.c
> @@ -218,7 +218,8 @@
>  		SVCFH_fmt(&argp->fh),
>  		argp->len, argp->offset);
>  
> -	nvecs = svc_fill_write_vector(rqstp, &argp->first, cnt);
> +	nvecs = svc_fill_write_vector(rqstp, rqstp->rq_arg.pages,
> +				      &argp->first, cnt);
>  	if (!nvecs)
>  		return nfserr_io;
>  	nfserr = nfsd_write(rqstp, fh_copy(&resp->fh, &argp->fh),
> diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
> index 574368e..43f88bd 100644
> --- a/include/linux/sunrpc/svc.h
> +++ b/include/linux/sunrpc/svc.h
> @@ -496,6 +496,7 @@ int		   svc_register(const struct svc_serv *, struct net *, const int,
>  struct svc_pool *  svc_pool_for_cpu(struct svc_serv *serv, int cpu);
>  char *		   svc_print_addr(struct svc_rqst *, char *, size_t);
>  unsigned int	   svc_fill_write_vector(struct svc_rqst *rqstp,
> +					 struct page **pages,
>  					 struct kvec *first, size_t total);
>  char		  *svc_fill_symlink_pathname(struct svc_rqst *rqstp,
>  					     struct kvec *first, size_t total);
> diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
> index 30a4226..2194ed5 100644
> --- a/net/sunrpc/svc.c
> +++ b/net/sunrpc/svc.c
> @@ -1537,16 +1537,16 @@ u32 svc_max_payload(const struct svc_rqst *rqstp)
>  /**
>   * svc_fill_write_vector - Construct data argument for VFS write call
>   * @rqstp: svc_rqst to operate on
> + * @pages: list of pages containing data payload
>   * @first: buffer containing first section of write payload
>   * @total: total number of bytes of write payload
>   *
> - * Returns the number of elements populated in the data argument array.
> + * Fills in rqstp::rq_vec, and returns the number of elements.
>   */
> -unsigned int svc_fill_write_vector(struct svc_rqst *rqstp, struct kvec *first,
> -				   size_t total)
> +unsigned int svc_fill_write_vector(struct svc_rqst *rqstp, struct page **pages,
> +				   struct kvec *first, size_t total)
>  {
>  	struct kvec *vec = rqstp->rq_vec;
> -	struct page **pages;
>  	unsigned int i;
>  
>  	/* Some types of transport can present the write payload
> @@ -1560,14 +1560,11 @@ unsigned int svc_fill_write_vector(struct svc_rqst *rqstp, struct kvec *first,
>  		++i;
>  	}
>  
> -	WARN_ON_ONCE(rqstp->rq_arg.page_base != 0);
> -	pages = rqstp->rq_arg.pages;
>  	while (total) {
>  		vec[i].iov_base = page_address(*pages);
>  		vec[i].iov_len = min_t(size_t, total, PAGE_SIZE);
>  		total -= vec[i].iov_len;
>  		++i;
> -
>  		++pages;
>  	}
>  

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v1 3/4] NFSD: Refactor the generic write vector fill helper
  2018-11-15 16:32   ` J. Bruce Fields
@ 2018-11-16  0:19     ` Chuck Lever
  2018-11-19 19:54       ` Bruce Fields
  0 siblings, 1 reply; 15+ messages in thread
From: Chuck Lever @ 2018-11-16  0:19 UTC (permalink / raw)
  To: Bruce Fields; +Cc: linux-rdma, Linux NFS Mailing List



> On Nov 15, 2018, at 8:32 AM, J. Bruce Fields <bfields@fieldses.org> wrote:
> 
> Um, I somehow managed to overlook a pynfs regression till just now:
> 
> On Fri, Jul 27, 2018 at 11:19:05AM -0400, Chuck Lever wrote:
>> fill_in_write_vector() is nearly the same logic as
>> svc_fill_write_vector(), but there are a few differences so that
>> the former can handle multiple WRITE payloads in a single COMPOUND.
>> 
>> svc_fill_write_vector() can be adjusted so that it can be used in
>> the NFSv4 WRITE code path too. Instead of assuming the pages are
>> coming from rq_args.pages, have the caller pass in the page list.
>> 
>> The immediate benefit is a reduction of code duplication. It also
>> prevents the NFSv4 WRITE decoder from passing an empty vector
>> element when the transport has provided the payload in the xdr_buf's
>> page array.
>> 
> ...
>> @@ -1027,7 +1009,10 @@ static int fill_in_write_vector(struct kvec *vec, struct nfsd4_write *write)
>> 	write->wr_how_written = write->wr_stable_how;
>> 	gen_boot_verifier(&write->wr_verifier, SVC_NET(rqstp));
>> 
>> -	nvecs = fill_in_write_vector(rqstp->rq_vec, write);
>> +	nvecs = svc_fill_write_vector(rqstp, write->wr_pagelist,
>> +				      &write->wr_head, write->wr_buflen);
>> +	if (!nvecs)
>> +		return nfserr_io;
> 
> Do you remember why you added this check?

Yes, at one point svc_fill_write_vector() called the transport
layer to do some processing, and I needed a way for it to
indicate a graceful failure.


> It's causing zero-length writes to fail, in violation of the spec:
> 
> 	"If the count is zero, the WRITE will succeed and return a count
> 	of zero subject to permissions checking."

RFC 7530, Section 16.36.4.


> I'm not seeing a reason why it wouldn't be safe just to remove that
> check.

Or, make the check more specific:

	if (!nvecs && write->wr_buflen)
		return nfserr_io;

This is a little more bullet proof, in case of bugs in the
transport layer.


> --b.
> 
>> 	WARN_ON_ONCE(nvecs > ARRAY_SIZE(rqstp->rq_vec));
>> 
>> 	status = nfsd_vfs_write(rqstp, &cstate->current_fh, filp,
>> diff --git a/fs/nfsd/nfsproc.c b/fs/nfsd/nfsproc.c
>> index f107f9f..a6faee5 100644
>> --- a/fs/nfsd/nfsproc.c
>> +++ b/fs/nfsd/nfsproc.c
>> @@ -218,7 +218,8 @@
>> 		SVCFH_fmt(&argp->fh),
>> 		argp->len, argp->offset);
>> 
>> -	nvecs = svc_fill_write_vector(rqstp, &argp->first, cnt);
>> +	nvecs = svc_fill_write_vector(rqstp, rqstp->rq_arg.pages,
>> +				      &argp->first, cnt);
>> 	if (!nvecs)
>> 		return nfserr_io;
>> 	nfserr = nfsd_write(rqstp, fh_copy(&resp->fh, &argp->fh),
>> diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
>> index 574368e..43f88bd 100644
>> --- a/include/linux/sunrpc/svc.h
>> +++ b/include/linux/sunrpc/svc.h
>> @@ -496,6 +496,7 @@ int		   svc_register(const struct svc_serv *, struct net *, const int,
>> struct svc_pool *  svc_pool_for_cpu(struct svc_serv *serv, int cpu);
>> char *		   svc_print_addr(struct svc_rqst *, char *, size_t);
>> unsigned int	   svc_fill_write_vector(struct svc_rqst *rqstp,
>> +					 struct page **pages,
>> 					 struct kvec *first, size_t total);
>> char		  *svc_fill_symlink_pathname(struct svc_rqst *rqstp,
>> 					     struct kvec *first, size_t total);
>> diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
>> index 30a4226..2194ed5 100644
>> --- a/net/sunrpc/svc.c
>> +++ b/net/sunrpc/svc.c
>> @@ -1537,16 +1537,16 @@ u32 svc_max_payload(const struct svc_rqst *rqstp)
>> /**
>>  * svc_fill_write_vector - Construct data argument for VFS write call
>>  * @rqstp: svc_rqst to operate on
>> + * @pages: list of pages containing data payload
>>  * @first: buffer containing first section of write payload
>>  * @total: total number of bytes of write payload
>>  *
>> - * Returns the number of elements populated in the data argument array.
>> + * Fills in rqstp::rq_vec, and returns the number of elements.
>>  */
>> -unsigned int svc_fill_write_vector(struct svc_rqst *rqstp, struct kvec *first,
>> -				   size_t total)
>> +unsigned int svc_fill_write_vector(struct svc_rqst *rqstp, struct page **pages,
>> +				   struct kvec *first, size_t total)
>> {
>> 	struct kvec *vec = rqstp->rq_vec;
>> -	struct page **pages;
>> 	unsigned int i;
>> 
>> 	/* Some types of transport can present the write payload
>> @@ -1560,14 +1560,11 @@ unsigned int svc_fill_write_vector(struct svc_rqst *rqstp, struct kvec *first,
>> 		++i;
>> 	}
>> 
>> -	WARN_ON_ONCE(rqstp->rq_arg.page_base != 0);
>> -	pages = rqstp->rq_arg.pages;
>> 	while (total) {
>> 		vec[i].iov_base = page_address(*pages);
>> 		vec[i].iov_len = min_t(size_t, total, PAGE_SIZE);
>> 		total -= vec[i].iov_len;
>> 		++i;
>> -
>> 		++pages;
>> 	}
>> 

--
Chuck Lever




^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v1 3/4] NFSD: Refactor the generic write vector fill helper
  2018-11-16  0:19     ` Chuck Lever
@ 2018-11-19 19:54       ` Bruce Fields
  2018-11-19 19:58         ` Chuck Lever
  0 siblings, 1 reply; 15+ messages in thread
From: Bruce Fields @ 2018-11-19 19:54 UTC (permalink / raw)
  To: Chuck Lever; +Cc: linux-rdma, Linux NFS Mailing List

On Thu, Nov 15, 2018 at 04:19:52PM -0800, Chuck Lever wrote:
> 
> 
> > On Nov 15, 2018, at 8:32 AM, J. Bruce Fields <bfields@fieldses.org> wrote:
> > 
> > Um, I somehow managed to overlook a pynfs regression till just now:
> > 
> > On Fri, Jul 27, 2018 at 11:19:05AM -0400, Chuck Lever wrote:
> >> fill_in_write_vector() is nearly the same logic as
> >> svc_fill_write_vector(), but there are a few differences so that
> >> the former can handle multiple WRITE payloads in a single COMPOUND.
> >> 
> >> svc_fill_write_vector() can be adjusted so that it can be used in
> >> the NFSv4 WRITE code path too. Instead of assuming the pages are
> >> coming from rq_args.pages, have the caller pass in the page list.
> >> 
> >> The immediate benefit is a reduction of code duplication. It also
> >> prevents the NFSv4 WRITE decoder from passing an empty vector
> >> element when the transport has provided the payload in the xdr_buf's
> >> page array.
> >> 
> > ...
> >> @@ -1027,7 +1009,10 @@ static int fill_in_write_vector(struct kvec *vec, struct nfsd4_write *write)
> >> 	write->wr_how_written = write->wr_stable_how;
> >> 	gen_boot_verifier(&write->wr_verifier, SVC_NET(rqstp));
> >> 
> >> -	nvecs = fill_in_write_vector(rqstp->rq_vec, write);
> >> +	nvecs = svc_fill_write_vector(rqstp, write->wr_pagelist,
> >> +				      &write->wr_head, write->wr_buflen);
> >> +	if (!nvecs)
> >> +		return nfserr_io;
> > 
> > Do you remember why you added this check?
> 
> Yes, at one point svc_fill_write_vector() called the transport
> layer to do some processing, and I needed a way for it to
> indicate a graceful failure.
> 
> 
> > It's causing zero-length writes to fail, in violation of the spec:
> > 
> > 	"If the count is zero, the WRITE will succeed and return a count
> > 	of zero subject to permissions checking."
> 
> RFC 7530, Section 16.36.4.
> 
> 
> > I'm not seeing a reason why it wouldn't be safe just to remove that
> > check.
> 
> Or, make the check more specific:
> 
> 	if (!nvecs && write->wr_buflen)
> 		return nfserr_io;
> 
> This is a little more bullet proof, in case of bugs in the
> transport layer.

The current svc_fill_write_vector is pretty short and obviously can't
hit that case, so I'd just find that confusing.  We can re-add that
check when if and when it's needed.

--b.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v1 3/4] NFSD: Refactor the generic write vector fill helper
  2018-11-19 19:54       ` Bruce Fields
@ 2018-11-19 19:58         ` Chuck Lever
  2018-11-19 19:59           ` Bruce Fields
  0 siblings, 1 reply; 15+ messages in thread
From: Chuck Lever @ 2018-11-19 19:58 UTC (permalink / raw)
  To: Bruce Fields; +Cc: linux-rdma, Linux NFS Mailing List



> On Nov 19, 2018, at 2:54 PM, Bruce Fields <bfields@fieldses.org> wrote:
> 
> On Thu, Nov 15, 2018 at 04:19:52PM -0800, Chuck Lever wrote:
>> 
>> 
>>> On Nov 15, 2018, at 8:32 AM, J. Bruce Fields <bfields@fieldses.org> wrote:
>>> 
>>> Um, I somehow managed to overlook a pynfs regression till just now:
>>> 
>>> On Fri, Jul 27, 2018 at 11:19:05AM -0400, Chuck Lever wrote:
>>>> fill_in_write_vector() is nearly the same logic as
>>>> svc_fill_write_vector(), but there are a few differences so that
>>>> the former can handle multiple WRITE payloads in a single COMPOUND.
>>>> 
>>>> svc_fill_write_vector() can be adjusted so that it can be used in
>>>> the NFSv4 WRITE code path too. Instead of assuming the pages are
>>>> coming from rq_args.pages, have the caller pass in the page list.
>>>> 
>>>> The immediate benefit is a reduction of code duplication. It also
>>>> prevents the NFSv4 WRITE decoder from passing an empty vector
>>>> element when the transport has provided the payload in the xdr_buf's
>>>> page array.
>>>> 
>>> ...
>>>> @@ -1027,7 +1009,10 @@ static int fill_in_write_vector(struct kvec *vec, struct nfsd4_write *write)
>>>> 	write->wr_how_written = write->wr_stable_how;
>>>> 	gen_boot_verifier(&write->wr_verifier, SVC_NET(rqstp));
>>>> 
>>>> -	nvecs = fill_in_write_vector(rqstp->rq_vec, write);
>>>> +	nvecs = svc_fill_write_vector(rqstp, write->wr_pagelist,
>>>> +				      &write->wr_head, write->wr_buflen);
>>>> +	if (!nvecs)
>>>> +		return nfserr_io;
>>> 
>>> Do you remember why you added this check?
>> 
>> Yes, at one point svc_fill_write_vector() called the transport
>> layer to do some processing, and I needed a way for it to
>> indicate a graceful failure.
>> 
>> 
>>> It's causing zero-length writes to fail, in violation of the spec:
>>> 
>>> 	"If the count is zero, the WRITE will succeed and return a count
>>> 	of zero subject to permissions checking."
>> 
>> RFC 7530, Section 16.36.4.
>> 
>> 
>>> I'm not seeing a reason why it wouldn't be safe just to remove that
>>> check.
>> 
>> Or, make the check more specific:
>> 
>> 	if (!nvecs && write->wr_buflen)
>> 		return nfserr_io;
>> 
>> This is a little more bullet proof, in case of bugs in the
>> transport layer.
> 
> The current svc_fill_write_vector is pretty short and obviously can't
> hit that case, so I'd just find that confusing.  We can re-add that
> check when if and when it's needed.

Should I send you a patch?


--
Chuck Lever




^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v1 3/4] NFSD: Refactor the generic write vector fill helper
  2018-11-19 19:58         ` Chuck Lever
@ 2018-11-19 19:59           ` Bruce Fields
  0 siblings, 0 replies; 15+ messages in thread
From: Bruce Fields @ 2018-11-19 19:59 UTC (permalink / raw)
  To: Chuck Lever; +Cc: linux-rdma, Linux NFS Mailing List

On Mon, Nov 19, 2018 at 02:58:17PM -0500, Chuck Lever wrote:
> Should I send you a patch?

I'm applying the following.

--b.

commit fc4ee8d8a5f7
Author: J. Bruce Fields <bfields@redhat.com>
Date:   Thu Nov 15 11:21:40 2018 -0500

    nfsd4: zero-length WRITE should succeed
    
    Zero-length writes are legal; from 5661 section 18.32.3: "If the count
    is zero, the WRITE will succeed and return a count of zero subject to
    permissions checking".
    
    This check is unnecessary and is causing zero-length reads to return
    EINVAL.
    
    Cc: stable@vger.kernel.org
    Fixes: 3fd9557aec91 "NFSD: Refactor the generic write vector fill helper"
    Cc: Chuck Lever <chuck.lever@oracle.com>
    Signed-off-by: J. Bruce Fields <bfields@redhat.com>

diff --git a/fs/nfsd/nfs4proc.c b/fs/nfsd/nfs4proc.c
index d505990dac7c..c364acbb6aba 100644
--- a/fs/nfsd/nfs4proc.c
+++ b/fs/nfsd/nfs4proc.c
@@ -1016,8 +1016,6 @@ nfsd4_write(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
 
 	nvecs = svc_fill_write_vector(rqstp, write->wr_pagelist,
 				      &write->wr_head, write->wr_buflen);
-	if (!nvecs)
-		return nfserr_io;
 	WARN_ON_ONCE(nvecs > ARRAY_SIZE(rqstp->rq_vec));
 
 	status = nfsd_vfs_write(rqstp, &cstate->current_fh, filp,

^ permalink raw reply related	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2018-11-19 19:59 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-07-27 15:18 [PATCH v1 0/4] NFS/RDMA server patches for v4.19 Chuck Lever
2018-07-27 15:18 ` [PATCH v1 1/4] svcrdma: Avoid releasing a page in svc_xprt_release() Chuck Lever
2018-07-27 15:18 ` [PATCH v1 2/4] svcrdma: Clean up Read chunk path Chuck Lever
2018-07-27 15:19 ` [PATCH v1 3/4] NFSD: Refactor the generic write vector fill helper Chuck Lever
2018-11-15 16:32   ` J. Bruce Fields
2018-11-16  0:19     ` Chuck Lever
2018-11-19 19:54       ` Bruce Fields
2018-11-19 19:58         ` Chuck Lever
2018-11-19 19:59           ` Bruce Fields
2018-07-27 15:19 ` [PATCH v1 4/4] NFSD: Handle full-length symlinks Chuck Lever
2018-08-01 14:14   ` J. Bruce Fields
2018-08-01 14:16     ` Chuck Lever
2018-08-01 14:36 ` [PATCH v1 0/4] NFS/RDMA server patches for v4.19 J. Bruce Fields
2018-08-01 15:14   ` Chuck Lever
2018-08-01 15:20     ` Bruce Fields

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).