All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v1 0/2] NFS decoder clean-ups, redux
@ 2018-03-19 18:56 Chuck Lever
  2018-03-19 18:56 ` [PATCH v1 1/2] NFSD: Clean up write argument XDR decoders Chuck Lever
  2018-03-19 18:56 ` [PATCH v1 2/2] NFSD: Clean up symlink " Chuck Lever
  0 siblings, 2 replies; 8+ messages in thread
From: Chuck Lever @ 2018-03-19 18:56 UTC (permalink / raw)
  To: bfields; +Cc: linux-nfs

Hi Bruce-

These two are pre-requisites for some transport level changes I'd
like to propose for v4.17.

I proposed these two for v4.16, you may recall. We had some
discussion about what changes you wanted but now I'm not sure I
recall your requests accurately.

I recall you felt OK with "NFSD: Clean up write argument XDR
decoders" but wanted some changes to "NFSD: Clean up symlink
argument XDR decoders".

Did you prefer that the symlink decoder should allocate a page
rather than pull it out of rq_pages[] ?

---

Chuck Lever (2):
      NFSD: Clean up write argument XDR decoders
      NFSD: Clean up symlink argument XDR decoders


 fs/nfsd/nfs3proc.c         |   18 ++++++-
 fs/nfsd/nfs3xdr.c          |   67 ++++++---------------------
 fs/nfsd/nfs4proc.c         |   37 +++++----------
 fs/nfsd/nfs4xdr.c          |   11 +++-
 fs/nfsd/nfsproc.c          |   23 ++++++---
 fs/nfsd/nfsxdr.c           |   63 +++++++++++++------------
 fs/nfsd/xdr.h              |    3 +
 fs/nfsd/xdr3.h             |    3 +
 fs/nfsd/xdr4.h             |    3 +
 include/linux/sunrpc/svc.h |    4 ++
 net/sunrpc/svc.c           |  109 ++++++++++++++++++++++++++++++++++++++++++++
 11 files changed, 217 insertions(+), 124 deletions(-)

--
Chuck Lever

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH v1 1/2] NFSD: Clean up write argument XDR decoders
  2018-03-19 18:56 [PATCH v1 0/2] NFS decoder clean-ups, redux Chuck Lever
@ 2018-03-19 18:56 ` Chuck Lever
  2018-03-23 21:32   ` J. Bruce Fields
  2018-03-19 18:56 ` [PATCH v1 2/2] NFSD: Clean up symlink " Chuck Lever
  1 sibling, 1 reply; 8+ messages in thread
From: Chuck Lever @ 2018-03-19 18:56 UTC (permalink / raw)
  To: bfields; +Cc: linux-nfs

Move common code in NFSD's write arg decoders into a helper. The
immediate benefit is reduction of code duplication and some nice
micro-optimizations (see below).

However, in the long term, this helper could perform a per-transport
call-out to fill the rq_vec (say, using RDMA Reads).

The legacy WRITE decoders and procs are changed to work like NFSv4,
which constructs the rq_vec just before it is about to call
vfs_writev.

Why? Calling a transport call-out from the proc instead of the XDR
decoder means that the incoming FH can be resolved to a particular
filesystem and file. This would allow pages from the backing file to
be presented to the transport to be filled, rather than presenting
anonymous pages and copying or swapping them into the file's page
cache later.

I also prefer using the pages in rq_arg.pages, instead of pulling
the data pages directly out of the rqstp::rq_pages array. This is
currently the way the NFSv3 write decoder works, but the other two
do not seem to take this approach. Fixing this removes the only
reference to rq_pages found in NFSD, eliminating an NFSD assumption
about how transports use the pages in rq_pages.

Lastly, avoid setting up the first element of rq_vec as a zero-
length buffer. This happens with an RDMA transport when a normal
Read chunk is present because the data payload is in rq_arg's
page list (none of it is in the head buffer).

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 fs/nfsd/nfs3proc.c         |    8 ++++++--
 fs/nfsd/nfs3xdr.c          |   16 ++++------------
 fs/nfsd/nfs4proc.c         |   30 ++++++------------------------
 fs/nfsd/nfs4xdr.c          |    1 -
 fs/nfsd/nfsproc.c          |    9 +++++++--
 fs/nfsd/nfsxdr.c           |   14 ++------------
 fs/nfsd/xdr.h              |    2 +-
 fs/nfsd/xdr3.h             |    2 +-
 fs/nfsd/xdr4.h             |    1 -
 include/linux/sunrpc/svc.h |    2 ++
 net/sunrpc/svc.c           |   42 ++++++++++++++++++++++++++++++++++++++++++
 11 files changed, 71 insertions(+), 56 deletions(-)

diff --git a/fs/nfsd/nfs3proc.c b/fs/nfsd/nfs3proc.c
index 1d0ce3c..2dd95eb 100644
--- a/fs/nfsd/nfs3proc.c
+++ b/fs/nfsd/nfs3proc.c
@@ -192,6 +192,7 @@
 	struct nfsd3_writeres *resp = rqstp->rq_resp;
 	__be32	nfserr;
 	unsigned long cnt = argp->len;
+	unsigned int nvecs;
 
 	dprintk("nfsd: WRITE(3)    %s %d bytes at %Lu%s\n",
 				SVCFH_fmt(&argp->fh),
@@ -201,9 +202,12 @@
 
 	fh_copy(&resp->fh, &argp->fh);
 	resp->committed = argp->stable;
+	nvecs = svc_fill_write_vector(rqstp, &argp->first, cnt);
+	if (!nvecs)
+		RETURN_STATUS(nfserr_io);
 	nfserr = nfsd_write(rqstp, &resp->fh, argp->offset,
-				rqstp->rq_vec, argp->vlen,
-				&cnt, resp->committed);
+			    rqstp->rq_vec, nvecs, &cnt,
+			    resp->committed);
 	resp->count = cnt;
 	RETURN_STATUS(nfserr);
 }
diff --git a/fs/nfsd/nfs3xdr.c b/fs/nfsd/nfs3xdr.c
index 1a70581..e19fc5d 100644
--- a/fs/nfsd/nfs3xdr.c
+++ b/fs/nfsd/nfs3xdr.c
@@ -391,7 +391,7 @@ void fill_post_wcc(struct svc_fh *fhp)
 nfs3svc_decode_writeargs(struct svc_rqst *rqstp, __be32 *p)
 {
 	struct nfsd3_writeargs *args = rqstp->rq_argp;
-	unsigned int len, v, hdr, dlen;
+	unsigned int len, hdr, dlen;
 	u32 max_blocksize = svc_max_payload(rqstp);
 	struct kvec *head = rqstp->rq_arg.head;
 	struct kvec *tail = rqstp->rq_arg.tail;
@@ -433,17 +433,9 @@ void fill_post_wcc(struct svc_fh *fhp)
 		args->count = max_blocksize;
 		len = args->len = max_blocksize;
 	}
-	rqstp->rq_vec[0].iov_base = (void*)p;
-	rqstp->rq_vec[0].iov_len = head->iov_len - hdr;
-	v = 0;
-	while (len > rqstp->rq_vec[v].iov_len) {
-		len -= rqstp->rq_vec[v].iov_len;
-		v++;
-		rqstp->rq_vec[v].iov_base = page_address(rqstp->rq_pages[v]);
-		rqstp->rq_vec[v].iov_len = PAGE_SIZE;
-	}
-	rqstp->rq_vec[v].iov_len = len;
-	args->vlen = v + 1;
+
+	args->first.iov_base = (void *)p;
+	args->first.iov_len = head->iov_len - hdr;
 	return 1;
 }
 
diff --git a/fs/nfsd/nfs4proc.c b/fs/nfsd/nfs4proc.c
index 0df37e0..cf30544 100644
--- a/fs/nfsd/nfs4proc.c
+++ b/fs/nfsd/nfs4proc.c
@@ -974,24 +974,6 @@ static __be32 nfsd4_do_lookupp(struct svc_rqst *rqstp, struct svc_fh *fh)
 	return status;
 }
 
-static int fill_in_write_vector(struct kvec *vec, struct nfsd4_write *write)
-{
-        int i = 1;
-        int buflen = write->wr_buflen;
-
-        vec[0].iov_base = write->wr_head.iov_base;
-        vec[0].iov_len = min_t(int, buflen, write->wr_head.iov_len);
-        buflen -= vec[0].iov_len;
-
-        while (buflen) {
-                vec[i].iov_base = page_address(write->wr_pagelist[i - 1]);
-                vec[i].iov_len = min_t(int, PAGE_SIZE, buflen);
-                buflen -= vec[i].iov_len;
-                i++;
-        }
-        return i;
-}
-
 static __be32
 nfsd4_write(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
 	    union nfsd4_op_u *u)
@@ -1000,8 +982,8 @@ static int fill_in_write_vector(struct kvec *vec, struct nfsd4_write *write)
 	stateid_t *stateid = &write->wr_stateid;
 	struct file *filp = NULL;
 	__be32 status = nfs_ok;
+	unsigned int nvecs;
 	unsigned long cnt;
-	int nvecs;
 
 	if (write->wr_offset >= OFFSET_MAX)
 		return nfserr_inval;
@@ -1019,12 +1001,12 @@ static int fill_in_write_vector(struct kvec *vec, struct nfsd4_write *write)
 	write->wr_how_written = write->wr_stable_how;
 	gen_boot_verifier(&write->wr_verifier, SVC_NET(rqstp));
 
-	nvecs = fill_in_write_vector(rqstp->rq_vec, write);
-	WARN_ON_ONCE(nvecs > ARRAY_SIZE(rqstp->rq_vec));
-
+	nvecs = svc_fill_write_vector(rqstp, &write->wr_head, cnt);
+	if (!nvecs)
+		return nfserr_io;
 	status = nfsd_vfs_write(rqstp, &cstate->current_fh, filp,
-				write->wr_offset, rqstp->rq_vec, nvecs, &cnt,
-				write->wr_how_written);
+				write->wr_offset, rqstp->rq_vec, nvecs,
+				&cnt, write->wr_how_written);
 	fput(filp);
 
 	write->wr_bytes_written = cnt;
diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
index f75cc7d..d33c8aa 100644
--- a/fs/nfsd/nfs4xdr.c
+++ b/fs/nfsd/nfs4xdr.c
@@ -1286,7 +1286,6 @@ static __be32 nfsd4_decode_opaque(struct nfsd4_compoundargs *argp, struct xdr_ne
 	}
 	write->wr_head.iov_base = p;
 	write->wr_head.iov_len = avail;
-	write->wr_pagelist = argp->pagelist;
 
 	len = XDR_QUADLEN(write->wr_buflen) << 2;
 	if (len >= avail) {
diff --git a/fs/nfsd/nfsproc.c b/fs/nfsd/nfsproc.c
index 43c0419..1995ea6 100644
--- a/fs/nfsd/nfsproc.c
+++ b/fs/nfsd/nfsproc.c
@@ -212,13 +212,18 @@
 	struct nfsd_attrstat *resp = rqstp->rq_resp;
 	__be32	nfserr;
 	unsigned long cnt = argp->len;
+	unsigned int nvecs;
 
 	dprintk("nfsd: WRITE    %s %d bytes at %d\n",
 		SVCFH_fmt(&argp->fh),
 		argp->len, argp->offset);
 
-	nfserr = nfsd_write(rqstp, fh_copy(&resp->fh, &argp->fh), argp->offset,
-				rqstp->rq_vec, argp->vlen, &cnt, NFS_DATA_SYNC);
+	nvecs = svc_fill_write_vector(rqstp, &argp->first, cnt);
+	if (!nvecs)
+		return nfserr_io;
+	nfserr = nfsd_write(rqstp, fh_copy(&resp->fh, &argp->fh),
+			    argp->offset, rqstp->rq_vec, nvecs,
+			    &cnt, NFS_DATA_SYNC);
 	return nfsd_return_attrs(nfserr, resp);
 }
 
diff --git a/fs/nfsd/nfsxdr.c b/fs/nfsd/nfsxdr.c
index 79b6064..db24ae8 100644
--- a/fs/nfsd/nfsxdr.c
+++ b/fs/nfsd/nfsxdr.c
@@ -287,7 +287,6 @@ __be32 *nfs2svc_encode_fattr(struct svc_rqst *rqstp, __be32 *p, struct svc_fh *f
 	struct nfsd_writeargs *args = rqstp->rq_argp;
 	unsigned int len, hdr, dlen;
 	struct kvec *head = rqstp->rq_arg.head;
-	int v;
 
 	p = decode_fh(p, &args->fh);
 	if (!p)
@@ -323,17 +322,8 @@ __be32 *nfs2svc_encode_fattr(struct svc_rqst *rqstp, __be32 *p, struct svc_fh *f
 	if (dlen < XDR_QUADLEN(len)*4)
 		return 0;
 
-	rqstp->rq_vec[0].iov_base = (void*)p;
-	rqstp->rq_vec[0].iov_len = head->iov_len - hdr;
-	v = 0;
-	while (len > rqstp->rq_vec[v].iov_len) {
-		len -= rqstp->rq_vec[v].iov_len;
-		v++;
-		rqstp->rq_vec[v].iov_base = page_address(rqstp->rq_pages[v]);
-		rqstp->rq_vec[v].iov_len = PAGE_SIZE;
-	}
-	rqstp->rq_vec[v].iov_len = len;
-	args->vlen = v + 1;
+	args->first.iov_base = (void *)p;
+	args->first.iov_len = head->iov_len - hdr;
 	return 1;
 }
 
diff --git a/fs/nfsd/xdr.h b/fs/nfsd/xdr.h
index 2f4f22e..a765c41 100644
--- a/fs/nfsd/xdr.h
+++ b/fs/nfsd/xdr.h
@@ -34,7 +34,7 @@ struct nfsd_writeargs {
 	svc_fh			fh;
 	__u32			offset;
 	int			len;
-	int			vlen;
+	struct kvec		first;
 };
 
 struct nfsd_createargs {
diff --git a/fs/nfsd/xdr3.h b/fs/nfsd/xdr3.h
index 056bf8a..deccf7f 100644
--- a/fs/nfsd/xdr3.h
+++ b/fs/nfsd/xdr3.h
@@ -41,7 +41,7 @@ struct nfsd3_writeargs {
 	__u32			count;
 	int			stable;
 	__u32			len;
-	int			vlen;
+	struct kvec		first;
 };
 
 struct nfsd3_createargs {
diff --git a/fs/nfsd/xdr4.h b/fs/nfsd/xdr4.h
index bc29511..d56219d 100644
--- a/fs/nfsd/xdr4.h
+++ b/fs/nfsd/xdr4.h
@@ -390,7 +390,6 @@ struct nfsd4_write {
 	u32		wr_stable_how;      /* request */
 	u32		wr_buflen;          /* request */
 	struct kvec	wr_head;
-	struct page **	wr_pagelist;        /* request */
 
 	u32		wr_bytes_written;   /* response */
 	u32		wr_how_written;     /* response */
diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
index 1939709..6dc7879 100644
--- a/include/linux/sunrpc/svc.h
+++ b/include/linux/sunrpc/svc.h
@@ -494,6 +494,8 @@ int		   svc_register(const struct svc_serv *, struct net *, const int,
 void		   svc_reserve(struct svc_rqst *rqstp, int space);
 struct svc_pool *  svc_pool_for_cpu(struct svc_serv *serv, int cpu);
 char *		   svc_print_addr(struct svc_rqst *, char *, size_t);
+unsigned int	   svc_fill_write_vector(struct svc_rqst *rqstp,
+					 struct kvec *first, size_t total);
 
 #define	RPC_MAX_ADDRBUFLEN	(63U)
 
diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
index f19987f..a155e2d 100644
--- a/net/sunrpc/svc.c
+++ b/net/sunrpc/svc.c
@@ -1533,3 +1533,45 @@ u32 svc_max_payload(const struct svc_rqst *rqstp)
 	return max;
 }
 EXPORT_SYMBOL_GPL(svc_max_payload);
+
+/**
+ * svc_fill_write_vector - Construct data argument for VFS write call
+ * @rqstp: svc_rqst to operate on
+ * @first: buffer containing first section of write payload
+ * @total: total number of bytes of write payload
+ *
+ * Returns the number of elements populated in the data argument array.
+ */
+unsigned int svc_fill_write_vector(struct svc_rqst *rqstp, struct kvec *first,
+				   size_t total)
+{
+	struct kvec *vec = rqstp->rq_vec;
+	struct page **pages;
+	unsigned int i;
+
+	/* Some types of transport can present the write payload
+	 * entirely in rq_arg.pages. In this case, @first is empty.
+	 */
+	i = 0;
+	if (first->iov_len) {
+		vec[i].iov_base = first->iov_base;
+		vec[i].iov_len = min_t(size_t, total, first->iov_len);
+		total -= vec[i].iov_len;
+		++i;
+	}
+
+	WARN_ON_ONCE(rqstp->rq_arg.page_base != 0);
+	pages = rqstp->rq_arg.pages;
+	while (total) {
+		vec[i].iov_base = page_address(*pages);
+		vec[i].iov_len = min_t(size_t, total, PAGE_SIZE);
+		total -= vec[i].iov_len;
+		++i;
+
+		++pages;
+	}
+
+	WARN_ON_ONCE(i > ARRAY_SIZE(rqstp->rq_vec));
+	return i;
+}
+EXPORT_SYMBOL_GPL(svc_fill_write_vector);


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v1 2/2] NFSD: Clean up symlink argument XDR decoders
  2018-03-19 18:56 [PATCH v1 0/2] NFS decoder clean-ups, redux Chuck Lever
  2018-03-19 18:56 ` [PATCH v1 1/2] NFSD: Clean up write argument XDR decoders Chuck Lever
@ 2018-03-19 18:56 ` Chuck Lever
  1 sibling, 0 replies; 8+ messages in thread
From: Chuck Lever @ 2018-03-19 18:56 UTC (permalink / raw)
  To: bfields; +Cc: linux-nfs

Move common code in NFSD's symlink arg decoders into a helper. The
immediate benefits include:

 - one fewer data copies on transports that support DDP
 - no memory allocation in the NFSv4 XDR decoder
 - consistent error checking across all versions
 - reduction of code duplication
 - support for both legal forms of SYMLINK requests on RDMA
   transports for all versions of NFS (in particular, NFSv2, for
   completeness)

In the long term, this helper is an appropriate spot to perform a
per-transport call-out to fill the pathname argument using, say,
RDMA Reads.

Filling the pathname in the proc function also means that eventually
the incoming filehandle can be interpreted so that filesystem-
specific memory can be allocated as a sink for the pathname
argument, rather than using anonymous pages.

Wondering why the current code punts a zero-length SYMLINK. Is it
illegal to create a zero-length SYMLINK on Linux?

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 fs/nfsd/nfs3proc.c         |   10 +++++++
 fs/nfsd/nfs3xdr.c          |   51 ++++++++-------------------------
 fs/nfsd/nfs4proc.c         |    7 +++++
 fs/nfsd/nfs4xdr.c          |   10 +++++--
 fs/nfsd/nfsproc.c          |   14 +++++----
 fs/nfsd/nfsxdr.c           |   49 +++++++++++++++++++-------------
 fs/nfsd/xdr.h              |    1 +
 fs/nfsd/xdr3.h             |    1 +
 fs/nfsd/xdr4.h             |    2 +
 include/linux/sunrpc/svc.h |    2 +
 net/sunrpc/svc.c           |   67 ++++++++++++++++++++++++++++++++++++++++++++
 11 files changed, 146 insertions(+), 68 deletions(-)

diff --git a/fs/nfsd/nfs3proc.c b/fs/nfsd/nfs3proc.c
index 2dd95eb..6259a4b 100644
--- a/fs/nfsd/nfs3proc.c
+++ b/fs/nfsd/nfs3proc.c
@@ -283,6 +283,16 @@
 	struct nfsd3_diropres *resp = rqstp->rq_resp;
 	__be32	nfserr;
 
+	if (argp->tlen == 0)
+		RETURN_STATUS(nfserr_inval);
+	if (argp->tlen > NFS3_MAXPATHLEN)
+		RETURN_STATUS(nfserr_nametoolong);
+
+	argp->tname = svc_fill_symlink_pathname(rqstp, &argp->first,
+						argp->tlen);
+	if (IS_ERR(argp->tname))
+		RETURN_STATUS(nfserrno(PTR_ERR(argp->tname)));
+
 	dprintk("nfsd: SYMLINK(3)  %s %.*s -> %.*s\n",
 				SVCFH_fmt(&argp->ffh),
 				argp->flen, argp->fname,
diff --git a/fs/nfsd/nfs3xdr.c b/fs/nfsd/nfs3xdr.c
index e19fc5d..3192b54 100644
--- a/fs/nfsd/nfs3xdr.c
+++ b/fs/nfsd/nfs3xdr.c
@@ -481,51 +481,24 @@ void fill_post_wcc(struct svc_fh *fhp)
 nfs3svc_decode_symlinkargs(struct svc_rqst *rqstp, __be32 *p)
 {
 	struct nfsd3_symlinkargs *args = rqstp->rq_argp;
-	unsigned int len, avail;
-	char *old, *new;
-	struct kvec *vec;
+	char *base = (char *)p;
+	size_t dlen;
 
 	if (!(p = decode_fh(p, &args->ffh)) ||
-	    !(p = decode_filename(p, &args->fname, &args->flen))
-		)
+	    !(p = decode_filename(p, &args->fname, &args->flen)))
 		return 0;
 	p = decode_sattr3(p, &args->attrs);
 
-	/* now decode the pathname, which might be larger than the first page.
-	 * As we have to check for nul's anyway, we copy it into a new page
-	 * This page appears in the rq_res.pages list, but as pages_len is always
-	 * 0, it won't get in the way
-	 */
-	len = ntohl(*p++);
-	if (len == 0 || len > NFS3_MAXPATHLEN || len >= PAGE_SIZE)
-		return 0;
-	args->tname = new = page_address(*(rqstp->rq_next_page++));
-	args->tlen = len;
-	/* first copy and check from the first page */
-	old = (char*)p;
-	vec = &rqstp->rq_arg.head[0];
-	if ((void *)old > vec->iov_base + vec->iov_len)
-		return 0;
-	avail = vec->iov_len - (old - (char*)vec->iov_base);
-	while (len && avail && *old) {
-		*new++ = *old++;
-		len--;
-		avail--;
-	}
-	/* now copy next page if there is one */
-	if (len && !avail && rqstp->rq_arg.page_len) {
-		avail = min_t(unsigned int, rqstp->rq_arg.page_len, PAGE_SIZE);
-		old = page_address(rqstp->rq_arg.pages[0]);
-	}
-	while (len && avail && *old) {
-		*new++ = *old++;
-		len--;
-		avail--;
-	}
-	*new = '\0';
-	if (len)
-		return 0;
+	args->tlen = ntohl(*p++);
 
+	args->first.iov_base = p;
+	args->first.iov_len = rqstp->rq_arg.head[0].iov_len;
+	args->first.iov_len -= (char *)p - base;
+
+	dlen = args->first.iov_len + rqstp->rq_arg.page_len +
+	       rqstp->rq_arg.tail[0].iov_len;
+	if (dlen < XDR_QUADLEN(args->tlen) << 2)
+		return 0;
 	return 1;
 }
 
diff --git a/fs/nfsd/nfs4proc.c b/fs/nfsd/nfs4proc.c
index cf30544..c8dfcc4 100644
--- a/fs/nfsd/nfs4proc.c
+++ b/fs/nfsd/nfs4proc.c
@@ -605,6 +605,13 @@ static void gen_boot_verifier(nfs4_verifier *verifier, struct net *net)
 
 	switch (create->cr_type) {
 	case NF4LNK:
+		if (create->cr_datalen > NFS4_MAXPATHLEN)
+			return nfserr_nametoolong;
+		create->cr_data =
+			svc_fill_symlink_pathname(rqstp, &create->cr_first,
+						  create->cr_datalen);
+		if (IS_ERR(create->cr_data))
+			return nfserrno(PTR_ERR(create->cr_data));
 		status = nfsd_symlink(rqstp, &cstate->current_fh,
 				      create->cr_name, create->cr_namelen,
 				      create->cr_data, &resfh);
diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
index d33c8aa..8e286ce 100644
--- a/fs/nfsd/nfs4xdr.c
+++ b/fs/nfsd/nfs4xdr.c
@@ -649,6 +649,7 @@ static __be32 nfsd4_decode_bind_conn_to_session(struct nfsd4_compoundargs *argp,
 static __be32
 nfsd4_decode_create(struct nfsd4_compoundargs *argp, struct nfsd4_create *create)
 {
+	struct kvec *head;
 	DECODE_HEAD;
 
 	READ_BUF(4);
@@ -657,10 +658,13 @@ static __be32 nfsd4_decode_bind_conn_to_session(struct nfsd4_compoundargs *argp,
 	case NF4LNK:
 		READ_BUF(4);
 		create->cr_datalen = be32_to_cpup(p++);
+		if (create->cr_datalen == 0)
+			return nfserr_inval;
+		head = argp->rqstp->rq_arg.head;
+		create->cr_first.iov_base = p;
+		create->cr_first.iov_len = head->iov_len;
+		create->cr_first.iov_len -= (char *)p - (char *)head->iov_base;
 		READ_BUF(create->cr_datalen);
-		create->cr_data = svcxdr_dupstr(argp, p, create->cr_datalen);
-		if (!create->cr_data)
-			return nfserr_jukebox;
 		break;
 	case NF4BLK:
 	case NF4CHR:
diff --git a/fs/nfsd/nfsproc.c b/fs/nfsd/nfsproc.c
index 1995ea6..f107f9f 100644
--- a/fs/nfsd/nfsproc.c
+++ b/fs/nfsd/nfsproc.c
@@ -449,17 +449,19 @@
 	struct svc_fh	newfh;
 	__be32		nfserr;
 
+	if (argp->tlen > NFS_MAXPATHLEN)
+		return nfserr_nametoolong;
+
+	argp->tname = svc_fill_symlink_pathname(rqstp, &argp->first,
+						argp->tlen);
+	if (IS_ERR(argp->tname))
+		return nfserrno(PTR_ERR(argp->tname));
+
 	dprintk("nfsd: SYMLINK  %s %.*s -> %.*s\n",
 		SVCFH_fmt(&argp->ffh), argp->flen, argp->fname,
 		argp->tlen, argp->tname);
 
 	fh_init(&newfh, NFS_FHSIZE);
-	/*
-	 * Crazy hack: the request fits in a page, and already-decoded
-	 * attributes follow argp->tname, so it's safe to just write a
-	 * null to ensure it's null-terminated:
-	 */
-	argp->tname[argp->tlen] = '\0';
 	nfserr = nfsd_symlink(rqstp, &argp->ffh, argp->fname, argp->flen,
 						 argp->tname, &newfh);
 
diff --git a/fs/nfsd/nfsxdr.c b/fs/nfsd/nfsxdr.c
index db24ae8..a43e826 100644
--- a/fs/nfsd/nfsxdr.c
+++ b/fs/nfsd/nfsxdr.c
@@ -71,22 +71,6 @@ __be32 *nfs2svc_decode_fh(__be32 *p, struct svc_fh *fhp)
 }
 
 static __be32 *
-decode_pathname(__be32 *p, char **namp, unsigned int *lenp)
-{
-	char		*name;
-	unsigned int	i;
-
-	if ((p = xdr_decode_string_inplace(p, namp, lenp, NFS_MAXPATHLEN)) != NULL) {
-		for (i = 0, name = *namp; i < *lenp; i++, name++) {
-			if (*name == '\0')
-				return NULL;
-		}
-	}
-
-	return p;
-}
-
-static __be32 *
 decode_sattr(__be32 *p, struct iattr *iap)
 {
 	u32	tmp, tmp1;
@@ -384,14 +368,39 @@ __be32 *nfs2svc_encode_fattr(struct svc_rqst *rqstp, __be32 *p, struct svc_fh *f
 nfssvc_decode_symlinkargs(struct svc_rqst *rqstp, __be32 *p)
 {
 	struct nfsd_symlinkargs *args = rqstp->rq_argp;
+	char *base = (char *)p;
+	size_t xdrlen;
 
 	if (   !(p = decode_fh(p, &args->ffh))
-	    || !(p = decode_filename(p, &args->fname, &args->flen))
-	    || !(p = decode_pathname(p, &args->tname, &args->tlen)))
+	    || !(p = decode_filename(p, &args->fname, &args->flen)))
 		return 0;
-	p = decode_sattr(p, &args->attrs);
 
-	return xdr_argsize_check(rqstp, p);
+	args->tlen = ntohl(*p++);
+	if (args->tlen == 0)
+		return 0;
+
+	args->first.iov_base = p;
+	args->first.iov_len = rqstp->rq_arg.head[0].iov_len;
+	args->first.iov_len -= (char *)p - base;
+
+	/* This request is never larger than a page. Therefore,
+	 * transport will deliver either:
+	 * 1. pathname in the pagelist -> sattr is in the tail.
+	 * 2. everything in the head buffer -> sattr is in the head.
+	 */
+	if (rqstp->rq_arg.page_len) {
+		if (args->tlen != rqstp->rq_arg.page_len)
+			return 0;
+		p = rqstp->rq_arg.tail[0].iov_base;
+	} else {
+		xdrlen = XDR_QUADLEN(args->tlen);
+		if (xdrlen > args->first.iov_len - (8 * sizeof(__be32)))
+			return 0;
+		p += xdrlen;
+	}
+	decode_sattr(p, &args->attrs);
+
+	return 1;
 }
 
 int
diff --git a/fs/nfsd/xdr.h b/fs/nfsd/xdr.h
index a765c41..ea7cca3 100644
--- a/fs/nfsd/xdr.h
+++ b/fs/nfsd/xdr.h
@@ -72,6 +72,7 @@ struct nfsd_symlinkargs {
 	char *			tname;
 	unsigned int		tlen;
 	struct iattr		attrs;
+	struct kvec		first;
 };
 
 struct nfsd_readdirargs {
diff --git a/fs/nfsd/xdr3.h b/fs/nfsd/xdr3.h
index deccf7f..2cb29e9 100644
--- a/fs/nfsd/xdr3.h
+++ b/fs/nfsd/xdr3.h
@@ -90,6 +90,7 @@ struct nfsd3_symlinkargs {
 	char *			tname;
 	unsigned int		tlen;
 	struct iattr		attrs;
+	struct kvec		first;
 };
 
 struct nfsd3_readdirargs {
diff --git a/fs/nfsd/xdr4.h b/fs/nfsd/xdr4.h
index d56219d..b485cd1 100644
--- a/fs/nfsd/xdr4.h
+++ b/fs/nfsd/xdr4.h
@@ -110,6 +110,7 @@ struct nfsd4_create {
 		struct {
 			u32 datalen;
 			char *data;
+			struct kvec first;
 		} link;   /* NF4LNK */
 		struct {
 			u32 specdata1;
@@ -124,6 +125,7 @@ struct nfsd4_create {
 };
 #define cr_datalen	u.link.datalen
 #define cr_data		u.link.data
+#define cr_first	u.link.first
 #define cr_specdata1	u.dev.specdata1
 #define cr_specdata2	u.dev.specdata2
 
diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
index 6dc7879..4eeeca8 100644
--- a/include/linux/sunrpc/svc.h
+++ b/include/linux/sunrpc/svc.h
@@ -496,6 +496,8 @@ int		   svc_register(const struct svc_serv *, struct net *, const int,
 char *		   svc_print_addr(struct svc_rqst *, char *, size_t);
 unsigned int	   svc_fill_write_vector(struct svc_rqst *rqstp,
 					 struct kvec *first, size_t total);
+char		  *svc_fill_symlink_pathname(struct svc_rqst *rqstp,
+					     struct kvec *first, size_t total);
 
 #define	RPC_MAX_ADDRBUFLEN	(63U)
 
diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
index a155e2d..30a4226 100644
--- a/net/sunrpc/svc.c
+++ b/net/sunrpc/svc.c
@@ -1575,3 +1575,70 @@ unsigned int svc_fill_write_vector(struct svc_rqst *rqstp, struct kvec *first,
 	return i;
 }
 EXPORT_SYMBOL_GPL(svc_fill_write_vector);
+
+/**
+ * svc_fill_symlink_pathname - Construct pathname argument for VFS symlink call
+ * @rqstp: svc_rqst to operate on
+ * @first: buffer containing first section of pathname
+ * @total: total length of the pathname argument
+ *
+ * Returns pointer to a NUL-terminated string, or an ERR_PTR. The buffer is
+ * released automatically when @rqstp is recycled.
+ */
+char *svc_fill_symlink_pathname(struct svc_rqst *rqstp, struct kvec *first,
+				size_t total)
+{
+	struct xdr_buf *arg = &rqstp->rq_arg;
+	struct page **pages;
+	char *result;
+
+	/* VFS API demands a NUL-terminated pathname. This function
+	 * uses a page from @rqstp as the pathname buffer, to enable
+	 * direct placement. Thus the total buffer size is PAGE_SIZE.
+	 * Space in this buffer for NUL-termination requires that we
+	 * cap the size of the returned symlink pathname just a
+	 * little early.
+	 */
+	if (total > PAGE_SIZE - 1)
+		return ERR_PTR(-ENAMETOOLONG);
+
+	/* Some types of transport can present the pathname entirely
+	 * in rq_arg.pages. If not, then copy the pathname into one
+	 * page.
+	 */
+	pages = arg->pages;
+	WARN_ON_ONCE(arg->page_base != 0);
+	if (first->iov_base == 0) {
+		result = page_address(*pages);
+		result[total] = '\0';
+	} else {
+		size_t len, remaining;
+		char *dst;
+
+		result = page_address(*(rqstp->rq_next_page++));
+		dst = result;
+		remaining = total;
+
+		len = min_t(size_t, total, first->iov_len);
+		memcpy(dst, first->iov_base, len);
+		dst += len;
+		remaining -= len;
+
+		/* No more than one page left */
+		if (remaining) {
+			len = min_t(size_t, remaining, PAGE_SIZE);
+			memcpy(dst, page_address(*pages), len);
+			dst += len;
+		}
+
+		*dst = '\0';
+	}
+
+	/* Sanity check: we don't allow the pathname argument to
+	 * contain a NUL byte.
+	 */
+	if (strlen(result) != total)
+		return ERR_PTR(-EINVAL);
+	return result;
+}
+EXPORT_SYMBOL_GPL(svc_fill_symlink_pathname);


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH v1 1/2] NFSD: Clean up write argument XDR decoders
  2018-03-19 18:56 ` [PATCH v1 1/2] NFSD: Clean up write argument XDR decoders Chuck Lever
@ 2018-03-23 21:32   ` J. Bruce Fields
  2018-03-23 21:55     ` Chuck Lever
  0 siblings, 1 reply; 8+ messages in thread
From: J. Bruce Fields @ 2018-03-23 21:32 UTC (permalink / raw)
  To: Chuck Lever; +Cc: linux-nfs

If I understand correctly, in the v4 case you change the way the page
data is passed by using rq_arg.pages instead of a list of pages for each
write.

I don't think that handles the case of compounds with multiple writes.

If all the writes are contained in a page, then you may be OK, since the
wr_head iovec is all the information you need to pass the write data
from the xdr code to the proc code.

But if there are multiple larger writes, you also need to know where the
page data is.  You're depending on rq_arg.pages for that, but there's
only one of those per compound.

To catch the regression I think we'd need a test that sends a compound
with two greater-than-one-page writes and distinct write data and then
reads back the results to check they're correct.  I don't think we have
one currently, though it wouldn't be hard to do in pynfs.

It's honestly not an important case, but I'd rather not break it.

--b.

On Mon, Mar 19, 2018 at 02:56:48PM -0400, Chuck Lever wrote:
> Move common code in NFSD's write arg decoders into a helper. The
> immediate benefit is reduction of code duplication and some nice
> micro-optimizations (see below).
> 
> However, in the long term, this helper could perform a per-transport
> call-out to fill the rq_vec (say, using RDMA Reads).
> 
> The legacy WRITE decoders and procs are changed to work like NFSv4,
> which constructs the rq_vec just before it is about to call
> vfs_writev.
> 
> Why? Calling a transport call-out from the proc instead of the XDR
> decoder means that the incoming FH can be resolved to a particular
> filesystem and file. This would allow pages from the backing file to
> be presented to the transport to be filled, rather than presenting
> anonymous pages and copying or swapping them into the file's page
> cache later.
> 
> I also prefer using the pages in rq_arg.pages, instead of pulling
> the data pages directly out of the rqstp::rq_pages array. This is
> currently the way the NFSv3 write decoder works, but the other two
> do not seem to take this approach. Fixing this removes the only
> reference to rq_pages found in NFSD, eliminating an NFSD assumption
> about how transports use the pages in rq_pages.
> 
> Lastly, avoid setting up the first element of rq_vec as a zero-
> length buffer. This happens with an RDMA transport when a normal
> Read chunk is present because the data payload is in rq_arg's
> page list (none of it is in the head buffer).
> 
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> ---
>  fs/nfsd/nfs3proc.c         |    8 ++++++--
>  fs/nfsd/nfs3xdr.c          |   16 ++++------------
>  fs/nfsd/nfs4proc.c         |   30 ++++++------------------------
>  fs/nfsd/nfs4xdr.c          |    1 -
>  fs/nfsd/nfsproc.c          |    9 +++++++--
>  fs/nfsd/nfsxdr.c           |   14 ++------------
>  fs/nfsd/xdr.h              |    2 +-
>  fs/nfsd/xdr3.h             |    2 +-
>  fs/nfsd/xdr4.h             |    1 -
>  include/linux/sunrpc/svc.h |    2 ++
>  net/sunrpc/svc.c           |   42 ++++++++++++++++++++++++++++++++++++++++++
>  11 files changed, 71 insertions(+), 56 deletions(-)
> 
> diff --git a/fs/nfsd/nfs3proc.c b/fs/nfsd/nfs3proc.c
> index 1d0ce3c..2dd95eb 100644
> --- a/fs/nfsd/nfs3proc.c
> +++ b/fs/nfsd/nfs3proc.c
> @@ -192,6 +192,7 @@
>  	struct nfsd3_writeres *resp = rqstp->rq_resp;
>  	__be32	nfserr;
>  	unsigned long cnt = argp->len;
> +	unsigned int nvecs;
>  
>  	dprintk("nfsd: WRITE(3)    %s %d bytes at %Lu%s\n",
>  				SVCFH_fmt(&argp->fh),
> @@ -201,9 +202,12 @@
>  
>  	fh_copy(&resp->fh, &argp->fh);
>  	resp->committed = argp->stable;
> +	nvecs = svc_fill_write_vector(rqstp, &argp->first, cnt);
> +	if (!nvecs)
> +		RETURN_STATUS(nfserr_io);
>  	nfserr = nfsd_write(rqstp, &resp->fh, argp->offset,
> -				rqstp->rq_vec, argp->vlen,
> -				&cnt, resp->committed);
> +			    rqstp->rq_vec, nvecs, &cnt,
> +			    resp->committed);
>  	resp->count = cnt;
>  	RETURN_STATUS(nfserr);
>  }
> diff --git a/fs/nfsd/nfs3xdr.c b/fs/nfsd/nfs3xdr.c
> index 1a70581..e19fc5d 100644
> --- a/fs/nfsd/nfs3xdr.c
> +++ b/fs/nfsd/nfs3xdr.c
> @@ -391,7 +391,7 @@ void fill_post_wcc(struct svc_fh *fhp)
>  nfs3svc_decode_writeargs(struct svc_rqst *rqstp, __be32 *p)
>  {
>  	struct nfsd3_writeargs *args = rqstp->rq_argp;
> -	unsigned int len, v, hdr, dlen;
> +	unsigned int len, hdr, dlen;
>  	u32 max_blocksize = svc_max_payload(rqstp);
>  	struct kvec *head = rqstp->rq_arg.head;
>  	struct kvec *tail = rqstp->rq_arg.tail;
> @@ -433,17 +433,9 @@ void fill_post_wcc(struct svc_fh *fhp)
>  		args->count = max_blocksize;
>  		len = args->len = max_blocksize;
>  	}
> -	rqstp->rq_vec[0].iov_base = (void*)p;
> -	rqstp->rq_vec[0].iov_len = head->iov_len - hdr;
> -	v = 0;
> -	while (len > rqstp->rq_vec[v].iov_len) {
> -		len -= rqstp->rq_vec[v].iov_len;
> -		v++;
> -		rqstp->rq_vec[v].iov_base = page_address(rqstp->rq_pages[v]);
> -		rqstp->rq_vec[v].iov_len = PAGE_SIZE;
> -	}
> -	rqstp->rq_vec[v].iov_len = len;
> -	args->vlen = v + 1;
> +
> +	args->first.iov_base = (void *)p;
> +	args->first.iov_len = head->iov_len - hdr;
>  	return 1;
>  }
>  
> diff --git a/fs/nfsd/nfs4proc.c b/fs/nfsd/nfs4proc.c
> index 0df37e0..cf30544 100644
> --- a/fs/nfsd/nfs4proc.c
> +++ b/fs/nfsd/nfs4proc.c
> @@ -974,24 +974,6 @@ static __be32 nfsd4_do_lookupp(struct svc_rqst *rqstp, struct svc_fh *fh)
>  	return status;
>  }
>  
> -static int fill_in_write_vector(struct kvec *vec, struct nfsd4_write *write)
> -{
> -        int i = 1;
> -        int buflen = write->wr_buflen;
> -
> -        vec[0].iov_base = write->wr_head.iov_base;
> -        vec[0].iov_len = min_t(int, buflen, write->wr_head.iov_len);
> -        buflen -= vec[0].iov_len;
> -
> -        while (buflen) {
> -                vec[i].iov_base = page_address(write->wr_pagelist[i - 1]);
> -                vec[i].iov_len = min_t(int, PAGE_SIZE, buflen);
> -                buflen -= vec[i].iov_len;
> -                i++;
> -        }
> -        return i;
> -}
> -
>  static __be32
>  nfsd4_write(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
>  	    union nfsd4_op_u *u)
> @@ -1000,8 +982,8 @@ static int fill_in_write_vector(struct kvec *vec, struct nfsd4_write *write)
>  	stateid_t *stateid = &write->wr_stateid;
>  	struct file *filp = NULL;
>  	__be32 status = nfs_ok;
> +	unsigned int nvecs;
>  	unsigned long cnt;
> -	int nvecs;
>  
>  	if (write->wr_offset >= OFFSET_MAX)
>  		return nfserr_inval;
> @@ -1019,12 +1001,12 @@ static int fill_in_write_vector(struct kvec *vec, struct nfsd4_write *write)
>  	write->wr_how_written = write->wr_stable_how;
>  	gen_boot_verifier(&write->wr_verifier, SVC_NET(rqstp));
>  
> -	nvecs = fill_in_write_vector(rqstp->rq_vec, write);
> -	WARN_ON_ONCE(nvecs > ARRAY_SIZE(rqstp->rq_vec));
> -
> +	nvecs = svc_fill_write_vector(rqstp, &write->wr_head, cnt);
> +	if (!nvecs)
> +		return nfserr_io;
>  	status = nfsd_vfs_write(rqstp, &cstate->current_fh, filp,
> -				write->wr_offset, rqstp->rq_vec, nvecs, &cnt,
> -				write->wr_how_written);
> +				write->wr_offset, rqstp->rq_vec, nvecs,
> +				&cnt, write->wr_how_written);
>  	fput(filp);
>  
>  	write->wr_bytes_written = cnt;
> diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
> index f75cc7d..d33c8aa 100644
> --- a/fs/nfsd/nfs4xdr.c
> +++ b/fs/nfsd/nfs4xdr.c
> @@ -1286,7 +1286,6 @@ static __be32 nfsd4_decode_opaque(struct nfsd4_compoundargs *argp, struct xdr_ne
>  	}
>  	write->wr_head.iov_base = p;
>  	write->wr_head.iov_len = avail;
> -	write->wr_pagelist = argp->pagelist;
>  
>  	len = XDR_QUADLEN(write->wr_buflen) << 2;
>  	if (len >= avail) {
> diff --git a/fs/nfsd/nfsproc.c b/fs/nfsd/nfsproc.c
> index 43c0419..1995ea6 100644
> --- a/fs/nfsd/nfsproc.c
> +++ b/fs/nfsd/nfsproc.c
> @@ -212,13 +212,18 @@
>  	struct nfsd_attrstat *resp = rqstp->rq_resp;
>  	__be32	nfserr;
>  	unsigned long cnt = argp->len;
> +	unsigned int nvecs;
>  
>  	dprintk("nfsd: WRITE    %s %d bytes at %d\n",
>  		SVCFH_fmt(&argp->fh),
>  		argp->len, argp->offset);
>  
> -	nfserr = nfsd_write(rqstp, fh_copy(&resp->fh, &argp->fh), argp->offset,
> -				rqstp->rq_vec, argp->vlen, &cnt, NFS_DATA_SYNC);
> +	nvecs = svc_fill_write_vector(rqstp, &argp->first, cnt);
> +	if (!nvecs)
> +		return nfserr_io;
> +	nfserr = nfsd_write(rqstp, fh_copy(&resp->fh, &argp->fh),
> +			    argp->offset, rqstp->rq_vec, nvecs,
> +			    &cnt, NFS_DATA_SYNC);
>  	return nfsd_return_attrs(nfserr, resp);
>  }
>  
> diff --git a/fs/nfsd/nfsxdr.c b/fs/nfsd/nfsxdr.c
> index 79b6064..db24ae8 100644
> --- a/fs/nfsd/nfsxdr.c
> +++ b/fs/nfsd/nfsxdr.c
> @@ -287,7 +287,6 @@ __be32 *nfs2svc_encode_fattr(struct svc_rqst *rqstp, __be32 *p, struct svc_fh *f
>  	struct nfsd_writeargs *args = rqstp->rq_argp;
>  	unsigned int len, hdr, dlen;
>  	struct kvec *head = rqstp->rq_arg.head;
> -	int v;
>  
>  	p = decode_fh(p, &args->fh);
>  	if (!p)
> @@ -323,17 +322,8 @@ __be32 *nfs2svc_encode_fattr(struct svc_rqst *rqstp, __be32 *p, struct svc_fh *f
>  	if (dlen < XDR_QUADLEN(len)*4)
>  		return 0;
>  
> -	rqstp->rq_vec[0].iov_base = (void*)p;
> -	rqstp->rq_vec[0].iov_len = head->iov_len - hdr;
> -	v = 0;
> -	while (len > rqstp->rq_vec[v].iov_len) {
> -		len -= rqstp->rq_vec[v].iov_len;
> -		v++;
> -		rqstp->rq_vec[v].iov_base = page_address(rqstp->rq_pages[v]);
> -		rqstp->rq_vec[v].iov_len = PAGE_SIZE;
> -	}
> -	rqstp->rq_vec[v].iov_len = len;
> -	args->vlen = v + 1;
> +	args->first.iov_base = (void *)p;
> +	args->first.iov_len = head->iov_len - hdr;
>  	return 1;
>  }
>  
> diff --git a/fs/nfsd/xdr.h b/fs/nfsd/xdr.h
> index 2f4f22e..a765c41 100644
> --- a/fs/nfsd/xdr.h
> +++ b/fs/nfsd/xdr.h
> @@ -34,7 +34,7 @@ struct nfsd_writeargs {
>  	svc_fh			fh;
>  	__u32			offset;
>  	int			len;
> -	int			vlen;
> +	struct kvec		first;
>  };
>  
>  struct nfsd_createargs {
> diff --git a/fs/nfsd/xdr3.h b/fs/nfsd/xdr3.h
> index 056bf8a..deccf7f 100644
> --- a/fs/nfsd/xdr3.h
> +++ b/fs/nfsd/xdr3.h
> @@ -41,7 +41,7 @@ struct nfsd3_writeargs {
>  	__u32			count;
>  	int			stable;
>  	__u32			len;
> -	int			vlen;
> +	struct kvec		first;
>  };
>  
>  struct nfsd3_createargs {
> diff --git a/fs/nfsd/xdr4.h b/fs/nfsd/xdr4.h
> index bc29511..d56219d 100644
> --- a/fs/nfsd/xdr4.h
> +++ b/fs/nfsd/xdr4.h
> @@ -390,7 +390,6 @@ struct nfsd4_write {
>  	u32		wr_stable_how;      /* request */
>  	u32		wr_buflen;          /* request */
>  	struct kvec	wr_head;
> -	struct page **	wr_pagelist;        /* request */
>  
>  	u32		wr_bytes_written;   /* response */
>  	u32		wr_how_written;     /* response */
> diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
> index 1939709..6dc7879 100644
> --- a/include/linux/sunrpc/svc.h
> +++ b/include/linux/sunrpc/svc.h
> @@ -494,6 +494,8 @@ int		   svc_register(const struct svc_serv *, struct net *, const int,
>  void		   svc_reserve(struct svc_rqst *rqstp, int space);
>  struct svc_pool *  svc_pool_for_cpu(struct svc_serv *serv, int cpu);
>  char *		   svc_print_addr(struct svc_rqst *, char *, size_t);
> +unsigned int	   svc_fill_write_vector(struct svc_rqst *rqstp,
> +					 struct kvec *first, size_t total);
>  
>  #define	RPC_MAX_ADDRBUFLEN	(63U)
>  
> diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
> index f19987f..a155e2d 100644
> --- a/net/sunrpc/svc.c
> +++ b/net/sunrpc/svc.c
> @@ -1533,3 +1533,45 @@ u32 svc_max_payload(const struct svc_rqst *rqstp)
>  	return max;
>  }
>  EXPORT_SYMBOL_GPL(svc_max_payload);
> +
> +/**
> + * svc_fill_write_vector - Construct data argument for VFS write call
> + * @rqstp: svc_rqst to operate on
> + * @first: buffer containing first section of write payload
> + * @total: total number of bytes of write payload
> + *
> + * Returns the number of elements populated in the data argument array.
> + */
> +unsigned int svc_fill_write_vector(struct svc_rqst *rqstp, struct kvec *first,
> +				   size_t total)
> +{
> +	struct kvec *vec = rqstp->rq_vec;
> +	struct page **pages;
> +	unsigned int i;
> +
> +	/* Some types of transport can present the write payload
> +	 * entirely in rq_arg.pages. In this case, @first is empty.
> +	 */
> +	i = 0;
> +	if (first->iov_len) {
> +		vec[i].iov_base = first->iov_base;
> +		vec[i].iov_len = min_t(size_t, total, first->iov_len);
> +		total -= vec[i].iov_len;
> +		++i;
> +	}
> +
> +	WARN_ON_ONCE(rqstp->rq_arg.page_base != 0);
> +	pages = rqstp->rq_arg.pages;
> +	while (total) {
> +		vec[i].iov_base = page_address(*pages);
> +		vec[i].iov_len = min_t(size_t, total, PAGE_SIZE);
> +		total -= vec[i].iov_len;
> +		++i;
> +
> +		++pages;
> +	}
> +
> +	WARN_ON_ONCE(i > ARRAY_SIZE(rqstp->rq_vec));
> +	return i;
> +}
> +EXPORT_SYMBOL_GPL(svc_fill_write_vector);

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v1 1/2] NFSD: Clean up write argument XDR decoders
  2018-03-23 21:32   ` J. Bruce Fields
@ 2018-03-23 21:55     ` Chuck Lever
  2018-03-23 22:37       ` Chuck Lever
  0 siblings, 1 reply; 8+ messages in thread
From: Chuck Lever @ 2018-03-23 21:55 UTC (permalink / raw)
  To: Bruce Fields; +Cc: Linux NFS Mailing List



> On Mar 23, 2018, at 5:32 PM, J. Bruce Fields <bfields@fieldses.org> =
wrote:
>=20
> If I understand correctly, in the v4 case you change the way the page
> data is passed by using rq_arg.pages instead of a list of pages for =
each
> write.
>=20
> I don't think that handles the case of compounds with multiple writes.
>=20
> If all the writes are contained in a page, then you may be OK, since =
the
> wr_head iovec is all the information you need to pass the write data
> from the xdr code to the proc code.
>=20
> But if there are multiple larger writes, you also need to know where =
the
> page data is.  You're depending on rq_arg.pages for that, but there's
> only one of those per compound.

I thought we were going to handle that case by chaining xdr_bufs--
one per NFSv4 WRITE operation. There has to be some structured way
to pass distinct write payloads up to the NFS server, otherwise
direct placement is impossible.


> To catch the regression I think we'd need a test that sends a compound
> with two greater-than-one-page writes and distinct write data and then
> reads back the results to check they're correct.  I don't think we =
have
> one currently, though it wouldn't be hard to do in pynfs.
>=20
> It's honestly not an important case, but I'd rather not break it.

I believe Solaris handles direct writev() using multiple WRITE
operations per COMPOUND, but I could be wrong.


> --b.
>=20
> On Mon, Mar 19, 2018 at 02:56:48PM -0400, Chuck Lever wrote:
>> Move common code in NFSD's write arg decoders into a helper. The
>> immediate benefit is reduction of code duplication and some nice
>> micro-optimizations (see below).
>>=20
>> However, in the long term, this helper could perform a per-transport
>> call-out to fill the rq_vec (say, using RDMA Reads).
>>=20
>> The legacy WRITE decoders and procs are changed to work like NFSv4,
>> which constructs the rq_vec just before it is about to call
>> vfs_writev.
>>=20
>> Why? Calling a transport call-out from the proc instead of the XDR
>> decoder means that the incoming FH can be resolved to a particular
>> filesystem and file. This would allow pages from the backing file to
>> be presented to the transport to be filled, rather than presenting
>> anonymous pages and copying or swapping them into the file's page
>> cache later.
>>=20
>> I also prefer using the pages in rq_arg.pages, instead of pulling
>> the data pages directly out of the rqstp::rq_pages array. This is
>> currently the way the NFSv3 write decoder works, but the other two
>> do not seem to take this approach. Fixing this removes the only
>> reference to rq_pages found in NFSD, eliminating an NFSD assumption
>> about how transports use the pages in rq_pages.
>>=20
>> Lastly, avoid setting up the first element of rq_vec as a zero-
>> length buffer. This happens with an RDMA transport when a normal
>> Read chunk is present because the data payload is in rq_arg's
>> page list (none of it is in the head buffer).
>>=20
>> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
>> ---
>> fs/nfsd/nfs3proc.c         |    8 ++++++--
>> fs/nfsd/nfs3xdr.c          |   16 ++++------------
>> fs/nfsd/nfs4proc.c         |   30 ++++++------------------------
>> fs/nfsd/nfs4xdr.c          |    1 -
>> fs/nfsd/nfsproc.c          |    9 +++++++--
>> fs/nfsd/nfsxdr.c           |   14 ++------------
>> fs/nfsd/xdr.h              |    2 +-
>> fs/nfsd/xdr3.h             |    2 +-
>> fs/nfsd/xdr4.h             |    1 -
>> include/linux/sunrpc/svc.h |    2 ++
>> net/sunrpc/svc.c           |   42 =
++++++++++++++++++++++++++++++++++++++++++
>> 11 files changed, 71 insertions(+), 56 deletions(-)
>>=20
>> diff --git a/fs/nfsd/nfs3proc.c b/fs/nfsd/nfs3proc.c
>> index 1d0ce3c..2dd95eb 100644
>> --- a/fs/nfsd/nfs3proc.c
>> +++ b/fs/nfsd/nfs3proc.c
>> @@ -192,6 +192,7 @@
>> 	struct nfsd3_writeres *resp =3D rqstp->rq_resp;
>> 	__be32	nfserr;
>> 	unsigned long cnt =3D argp->len;
>> +	unsigned int nvecs;
>>=20
>> 	dprintk("nfsd: WRITE(3)    %s %d bytes at %Lu%s\n",
>> 				SVCFH_fmt(&argp->fh),
>> @@ -201,9 +202,12 @@
>>=20
>> 	fh_copy(&resp->fh, &argp->fh);
>> 	resp->committed =3D argp->stable;
>> +	nvecs =3D svc_fill_write_vector(rqstp, &argp->first, cnt);
>> +	if (!nvecs)
>> +		RETURN_STATUS(nfserr_io);
>> 	nfserr =3D nfsd_write(rqstp, &resp->fh, argp->offset,
>> -				rqstp->rq_vec, argp->vlen,
>> -				&cnt, resp->committed);
>> +			    rqstp->rq_vec, nvecs, &cnt,
>> +			    resp->committed);
>> 	resp->count =3D cnt;
>> 	RETURN_STATUS(nfserr);
>> }
>> diff --git a/fs/nfsd/nfs3xdr.c b/fs/nfsd/nfs3xdr.c
>> index 1a70581..e19fc5d 100644
>> --- a/fs/nfsd/nfs3xdr.c
>> +++ b/fs/nfsd/nfs3xdr.c
>> @@ -391,7 +391,7 @@ void fill_post_wcc(struct svc_fh *fhp)
>> nfs3svc_decode_writeargs(struct svc_rqst *rqstp, __be32 *p)
>> {
>> 	struct nfsd3_writeargs *args =3D rqstp->rq_argp;
>> -	unsigned int len, v, hdr, dlen;
>> +	unsigned int len, hdr, dlen;
>> 	u32 max_blocksize =3D svc_max_payload(rqstp);
>> 	struct kvec *head =3D rqstp->rq_arg.head;
>> 	struct kvec *tail =3D rqstp->rq_arg.tail;
>> @@ -433,17 +433,9 @@ void fill_post_wcc(struct svc_fh *fhp)
>> 		args->count =3D max_blocksize;
>> 		len =3D args->len =3D max_blocksize;
>> 	}
>> -	rqstp->rq_vec[0].iov_base =3D (void*)p;
>> -	rqstp->rq_vec[0].iov_len =3D head->iov_len - hdr;
>> -	v =3D 0;
>> -	while (len > rqstp->rq_vec[v].iov_len) {
>> -		len -=3D rqstp->rq_vec[v].iov_len;
>> -		v++;
>> -		rqstp->rq_vec[v].iov_base =3D =
page_address(rqstp->rq_pages[v]);
>> -		rqstp->rq_vec[v].iov_len =3D PAGE_SIZE;
>> -	}
>> -	rqstp->rq_vec[v].iov_len =3D len;
>> -	args->vlen =3D v + 1;
>> +
>> +	args->first.iov_base =3D (void *)p;
>> +	args->first.iov_len =3D head->iov_len - hdr;
>> 	return 1;
>> }
>>=20
>> diff --git a/fs/nfsd/nfs4proc.c b/fs/nfsd/nfs4proc.c
>> index 0df37e0..cf30544 100644
>> --- a/fs/nfsd/nfs4proc.c
>> +++ b/fs/nfsd/nfs4proc.c
>> @@ -974,24 +974,6 @@ static __be32 nfsd4_do_lookupp(struct svc_rqst =
*rqstp, struct svc_fh *fh)
>> 	return status;
>> }
>>=20
>> -static int fill_in_write_vector(struct kvec *vec, struct nfsd4_write =
*write)
>> -{
>> -        int i =3D 1;
>> -        int buflen =3D write->wr_buflen;
>> -
>> -        vec[0].iov_base =3D write->wr_head.iov_base;
>> -        vec[0].iov_len =3D min_t(int, buflen, =
write->wr_head.iov_len);
>> -        buflen -=3D vec[0].iov_len;
>> -
>> -        while (buflen) {
>> -                vec[i].iov_base =3D =
page_address(write->wr_pagelist[i - 1]);
>> -                vec[i].iov_len =3D min_t(int, PAGE_SIZE, buflen);
>> -                buflen -=3D vec[i].iov_len;
>> -                i++;
>> -        }
>> -        return i;
>> -}
>> -
>> static __be32
>> nfsd4_write(struct svc_rqst *rqstp, struct nfsd4_compound_state =
*cstate,
>> 	    union nfsd4_op_u *u)
>> @@ -1000,8 +982,8 @@ static int fill_in_write_vector(struct kvec =
*vec, struct nfsd4_write *write)
>> 	stateid_t *stateid =3D &write->wr_stateid;
>> 	struct file *filp =3D NULL;
>> 	__be32 status =3D nfs_ok;
>> +	unsigned int nvecs;
>> 	unsigned long cnt;
>> -	int nvecs;
>>=20
>> 	if (write->wr_offset >=3D OFFSET_MAX)
>> 		return nfserr_inval;
>> @@ -1019,12 +1001,12 @@ static int fill_in_write_vector(struct kvec =
*vec, struct nfsd4_write *write)
>> 	write->wr_how_written =3D write->wr_stable_how;
>> 	gen_boot_verifier(&write->wr_verifier, SVC_NET(rqstp));
>>=20
>> -	nvecs =3D fill_in_write_vector(rqstp->rq_vec, write);
>> -	WARN_ON_ONCE(nvecs > ARRAY_SIZE(rqstp->rq_vec));
>> -
>> +	nvecs =3D svc_fill_write_vector(rqstp, &write->wr_head, cnt);
>> +	if (!nvecs)
>> +		return nfserr_io;
>> 	status =3D nfsd_vfs_write(rqstp, &cstate->current_fh, filp,
>> -				write->wr_offset, rqstp->rq_vec, nvecs, =
&cnt,
>> -				write->wr_how_written);
>> +				write->wr_offset, rqstp->rq_vec, nvecs,
>> +				&cnt, write->wr_how_written);
>> 	fput(filp);
>>=20
>> 	write->wr_bytes_written =3D cnt;
>> diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
>> index f75cc7d..d33c8aa 100644
>> --- a/fs/nfsd/nfs4xdr.c
>> +++ b/fs/nfsd/nfs4xdr.c
>> @@ -1286,7 +1286,6 @@ static __be32 nfsd4_decode_opaque(struct =
nfsd4_compoundargs *argp, struct xdr_ne
>> 	}
>> 	write->wr_head.iov_base =3D p;
>> 	write->wr_head.iov_len =3D avail;
>> -	write->wr_pagelist =3D argp->pagelist;
>>=20
>> 	len =3D XDR_QUADLEN(write->wr_buflen) << 2;
>> 	if (len >=3D avail) {
>> diff --git a/fs/nfsd/nfsproc.c b/fs/nfsd/nfsproc.c
>> index 43c0419..1995ea6 100644
>> --- a/fs/nfsd/nfsproc.c
>> +++ b/fs/nfsd/nfsproc.c
>> @@ -212,13 +212,18 @@
>> 	struct nfsd_attrstat *resp =3D rqstp->rq_resp;
>> 	__be32	nfserr;
>> 	unsigned long cnt =3D argp->len;
>> +	unsigned int nvecs;
>>=20
>> 	dprintk("nfsd: WRITE    %s %d bytes at %d\n",
>> 		SVCFH_fmt(&argp->fh),
>> 		argp->len, argp->offset);
>>=20
>> -	nfserr =3D nfsd_write(rqstp, fh_copy(&resp->fh, &argp->fh), =
argp->offset,
>> -				rqstp->rq_vec, argp->vlen, &cnt, =
NFS_DATA_SYNC);
>> +	nvecs =3D svc_fill_write_vector(rqstp, &argp->first, cnt);
>> +	if (!nvecs)
>> +		return nfserr_io;
>> +	nfserr =3D nfsd_write(rqstp, fh_copy(&resp->fh, &argp->fh),
>> +			    argp->offset, rqstp->rq_vec, nvecs,
>> +			    &cnt, NFS_DATA_SYNC);
>> 	return nfsd_return_attrs(nfserr, resp);
>> }
>>=20
>> diff --git a/fs/nfsd/nfsxdr.c b/fs/nfsd/nfsxdr.c
>> index 79b6064..db24ae8 100644
>> --- a/fs/nfsd/nfsxdr.c
>> +++ b/fs/nfsd/nfsxdr.c
>> @@ -287,7 +287,6 @@ __be32 *nfs2svc_encode_fattr(struct svc_rqst =
*rqstp, __be32 *p, struct svc_fh *f
>> 	struct nfsd_writeargs *args =3D rqstp->rq_argp;
>> 	unsigned int len, hdr, dlen;
>> 	struct kvec *head =3D rqstp->rq_arg.head;
>> -	int v;
>>=20
>> 	p =3D decode_fh(p, &args->fh);
>> 	if (!p)
>> @@ -323,17 +322,8 @@ __be32 *nfs2svc_encode_fattr(struct svc_rqst =
*rqstp, __be32 *p, struct svc_fh *f
>> 	if (dlen < XDR_QUADLEN(len)*4)
>> 		return 0;
>>=20
>> -	rqstp->rq_vec[0].iov_base =3D (void*)p;
>> -	rqstp->rq_vec[0].iov_len =3D head->iov_len - hdr;
>> -	v =3D 0;
>> -	while (len > rqstp->rq_vec[v].iov_len) {
>> -		len -=3D rqstp->rq_vec[v].iov_len;
>> -		v++;
>> -		rqstp->rq_vec[v].iov_base =3D =
page_address(rqstp->rq_pages[v]);
>> -		rqstp->rq_vec[v].iov_len =3D PAGE_SIZE;
>> -	}
>> -	rqstp->rq_vec[v].iov_len =3D len;
>> -	args->vlen =3D v + 1;
>> +	args->first.iov_base =3D (void *)p;
>> +	args->first.iov_len =3D head->iov_len - hdr;
>> 	return 1;
>> }
>>=20
>> diff --git a/fs/nfsd/xdr.h b/fs/nfsd/xdr.h
>> index 2f4f22e..a765c41 100644
>> --- a/fs/nfsd/xdr.h
>> +++ b/fs/nfsd/xdr.h
>> @@ -34,7 +34,7 @@ struct nfsd_writeargs {
>> 	svc_fh			fh;
>> 	__u32			offset;
>> 	int			len;
>> -	int			vlen;
>> +	struct kvec		first;
>> };
>>=20
>> struct nfsd_createargs {
>> diff --git a/fs/nfsd/xdr3.h b/fs/nfsd/xdr3.h
>> index 056bf8a..deccf7f 100644
>> --- a/fs/nfsd/xdr3.h
>> +++ b/fs/nfsd/xdr3.h
>> @@ -41,7 +41,7 @@ struct nfsd3_writeargs {
>> 	__u32			count;
>> 	int			stable;
>> 	__u32			len;
>> -	int			vlen;
>> +	struct kvec		first;
>> };
>>=20
>> struct nfsd3_createargs {
>> diff --git a/fs/nfsd/xdr4.h b/fs/nfsd/xdr4.h
>> index bc29511..d56219d 100644
>> --- a/fs/nfsd/xdr4.h
>> +++ b/fs/nfsd/xdr4.h
>> @@ -390,7 +390,6 @@ struct nfsd4_write {
>> 	u32		wr_stable_how;      /* request */
>> 	u32		wr_buflen;          /* request */
>> 	struct kvec	wr_head;
>> -	struct page **	wr_pagelist;        /* request */
>>=20
>> 	u32		wr_bytes_written;   /* response */
>> 	u32		wr_how_written;     /* response */
>> diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
>> index 1939709..6dc7879 100644
>> --- a/include/linux/sunrpc/svc.h
>> +++ b/include/linux/sunrpc/svc.h
>> @@ -494,6 +494,8 @@ int		   svc_register(const struct =
svc_serv *, struct net *, const int,
>> void		   svc_reserve(struct svc_rqst *rqstp, int space);
>> struct svc_pool *  svc_pool_for_cpu(struct svc_serv *serv, int cpu);
>> char *		   svc_print_addr(struct svc_rqst *, char *, =
size_t);
>> +unsigned int	   svc_fill_write_vector(struct svc_rqst *rqstp,
>> +					 struct kvec *first, size_t =
total);
>>=20
>> #define	RPC_MAX_ADDRBUFLEN	(63U)
>>=20
>> diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
>> index f19987f..a155e2d 100644
>> --- a/net/sunrpc/svc.c
>> +++ b/net/sunrpc/svc.c
>> @@ -1533,3 +1533,45 @@ u32 svc_max_payload(const struct svc_rqst =
*rqstp)
>> 	return max;
>> }
>> EXPORT_SYMBOL_GPL(svc_max_payload);
>> +
>> +/**
>> + * svc_fill_write_vector - Construct data argument for VFS write =
call
>> + * @rqstp: svc_rqst to operate on
>> + * @first: buffer containing first section of write payload
>> + * @total: total number of bytes of write payload
>> + *
>> + * Returns the number of elements populated in the data argument =
array.
>> + */
>> +unsigned int svc_fill_write_vector(struct svc_rqst *rqstp, struct =
kvec *first,
>> +				   size_t total)
>> +{
>> +	struct kvec *vec =3D rqstp->rq_vec;
>> +	struct page **pages;
>> +	unsigned int i;
>> +
>> +	/* Some types of transport can present the write payload
>> +	 * entirely in rq_arg.pages. In this case, @first is empty.
>> +	 */
>> +	i =3D 0;
>> +	if (first->iov_len) {
>> +		vec[i].iov_base =3D first->iov_base;
>> +		vec[i].iov_len =3D min_t(size_t, total, first->iov_len);
>> +		total -=3D vec[i].iov_len;
>> +		++i;
>> +	}
>> +
>> +	WARN_ON_ONCE(rqstp->rq_arg.page_base !=3D 0);
>> +	pages =3D rqstp->rq_arg.pages;
>> +	while (total) {
>> +		vec[i].iov_base =3D page_address(*pages);
>> +		vec[i].iov_len =3D min_t(size_t, total, PAGE_SIZE);
>> +		total -=3D vec[i].iov_len;
>> +		++i;
>> +
>> +		++pages;
>> +	}
>> +
>> +	WARN_ON_ONCE(i > ARRAY_SIZE(rqstp->rq_vec));
>> +	return i;
>> +}
>> +EXPORT_SYMBOL_GPL(svc_fill_write_vector);

--
Chuck Lever




^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v1 1/2] NFSD: Clean up write argument XDR decoders
  2018-03-23 21:55     ` Chuck Lever
@ 2018-03-23 22:37       ` Chuck Lever
  2018-03-24  2:40         ` Bruce Fields
  0 siblings, 1 reply; 8+ messages in thread
From: Chuck Lever @ 2018-03-23 22:37 UTC (permalink / raw)
  To: Bruce Fields; +Cc: Linux NFS Mailing List



> On Mar 23, 2018, at 5:55 PM, Chuck Lever <chuck.lever@oracle.com> =
wrote:
>=20
>=20
>=20
>> On Mar 23, 2018, at 5:32 PM, J. Bruce Fields <bfields@fieldses.org> =
wrote:
>>=20
>> If I understand correctly, in the v4 case you change the way the page
>> data is passed by using rq_arg.pages instead of a list of pages for =
each
>> write.
>>=20
>> I don't think that handles the case of compounds with multiple =
writes.
>>=20
>> If all the writes are contained in a page, then you may be OK, since =
the
>> wr_head iovec is all the information you need to pass the write data
>> from the xdr code to the proc code.
>>=20
>> But if there are multiple larger writes, you also need to know where =
the
>> page data is.  You're depending on rq_arg.pages for that, but there's
>> only one of those per compound.
>=20
> I thought we were going to handle that case by chaining xdr_bufs--
> one per NFSv4 WRITE operation. There has to be some structured way
> to pass distinct write payloads up to the NFS server, otherwise
> direct placement is impossible.

Thinking on this a little more, I think you are saying the new shared
decoder is adequate for NFSv2 and NFSv3, but not for NFSv4. Would you
accept a patch that kept the NFSv2 and NFSv3 parts of this patch, but
dropped the NFSv4-related hunks?

Likewise for the symlink decoder patch?

And we can still use a transport call-out in the XDR decoders, just as
we have discussed. (I think I've misremembered our discussion about
chaining xdr_bufs).

It's just that the XDR decoder can't be shared between NFSv4 and the
NFSv2/NFSv3 code paths.


>> To catch the regression I think we'd need a test that sends a =
compound
>> with two greater-than-one-page writes and distinct write data and =
then
>> reads back the results to check they're correct.  I don't think we =
have
>> one currently, though it wouldn't be hard to do in pynfs.
>>=20
>> It's honestly not an important case, but I'd rather not break it.
>=20
> I believe Solaris handles direct writev() using multiple WRITE
> operations per COMPOUND, but I could be wrong.
>=20
>=20
>> --b.
>>=20
>> On Mon, Mar 19, 2018 at 02:56:48PM -0400, Chuck Lever wrote:
>>> Move common code in NFSD's write arg decoders into a helper. The
>>> immediate benefit is reduction of code duplication and some nice
>>> micro-optimizations (see below).
>>>=20
>>> However, in the long term, this helper could perform a per-transport
>>> call-out to fill the rq_vec (say, using RDMA Reads).
>>>=20
>>> The legacy WRITE decoders and procs are changed to work like NFSv4,
>>> which constructs the rq_vec just before it is about to call
>>> vfs_writev.
>>>=20
>>> Why? Calling a transport call-out from the proc instead of the XDR
>>> decoder means that the incoming FH can be resolved to a particular
>>> filesystem and file. This would allow pages from the backing file to
>>> be presented to the transport to be filled, rather than presenting
>>> anonymous pages and copying or swapping them into the file's page
>>> cache later.
>>>=20
>>> I also prefer using the pages in rq_arg.pages, instead of pulling
>>> the data pages directly out of the rqstp::rq_pages array. This is
>>> currently the way the NFSv3 write decoder works, but the other two
>>> do not seem to take this approach. Fixing this removes the only
>>> reference to rq_pages found in NFSD, eliminating an NFSD assumption
>>> about how transports use the pages in rq_pages.
>>>=20
>>> Lastly, avoid setting up the first element of rq_vec as a zero-
>>> length buffer. This happens with an RDMA transport when a normal
>>> Read chunk is present because the data payload is in rq_arg's
>>> page list (none of it is in the head buffer).
>>>=20
>>> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
>>> ---
>>> fs/nfsd/nfs3proc.c         |    8 ++++++--
>>> fs/nfsd/nfs3xdr.c          |   16 ++++------------
>>> fs/nfsd/nfs4proc.c         |   30 ++++++------------------------
>>> fs/nfsd/nfs4xdr.c          |    1 -
>>> fs/nfsd/nfsproc.c          |    9 +++++++--
>>> fs/nfsd/nfsxdr.c           |   14 ++------------
>>> fs/nfsd/xdr.h              |    2 +-
>>> fs/nfsd/xdr3.h             |    2 +-
>>> fs/nfsd/xdr4.h             |    1 -
>>> include/linux/sunrpc/svc.h |    2 ++
>>> net/sunrpc/svc.c           |   42 =
++++++++++++++++++++++++++++++++++++++++++
>>> 11 files changed, 71 insertions(+), 56 deletions(-)
>>>=20
>>> diff --git a/fs/nfsd/nfs3proc.c b/fs/nfsd/nfs3proc.c
>>> index 1d0ce3c..2dd95eb 100644
>>> --- a/fs/nfsd/nfs3proc.c
>>> +++ b/fs/nfsd/nfs3proc.c
>>> @@ -192,6 +192,7 @@
>>> 	struct nfsd3_writeres *resp =3D rqstp->rq_resp;
>>> 	__be32	nfserr;
>>> 	unsigned long cnt =3D argp->len;
>>> +	unsigned int nvecs;
>>>=20
>>> 	dprintk("nfsd: WRITE(3)    %s %d bytes at %Lu%s\n",
>>> 				SVCFH_fmt(&argp->fh),
>>> @@ -201,9 +202,12 @@
>>>=20
>>> 	fh_copy(&resp->fh, &argp->fh);
>>> 	resp->committed =3D argp->stable;
>>> +	nvecs =3D svc_fill_write_vector(rqstp, &argp->first, cnt);
>>> +	if (!nvecs)
>>> +		RETURN_STATUS(nfserr_io);
>>> 	nfserr =3D nfsd_write(rqstp, &resp->fh, argp->offset,
>>> -				rqstp->rq_vec, argp->vlen,
>>> -				&cnt, resp->committed);
>>> +			    rqstp->rq_vec, nvecs, &cnt,
>>> +			    resp->committed);
>>> 	resp->count =3D cnt;
>>> 	RETURN_STATUS(nfserr);
>>> }
>>> diff --git a/fs/nfsd/nfs3xdr.c b/fs/nfsd/nfs3xdr.c
>>> index 1a70581..e19fc5d 100644
>>> --- a/fs/nfsd/nfs3xdr.c
>>> +++ b/fs/nfsd/nfs3xdr.c
>>> @@ -391,7 +391,7 @@ void fill_post_wcc(struct svc_fh *fhp)
>>> nfs3svc_decode_writeargs(struct svc_rqst *rqstp, __be32 *p)
>>> {
>>> 	struct nfsd3_writeargs *args =3D rqstp->rq_argp;
>>> -	unsigned int len, v, hdr, dlen;
>>> +	unsigned int len, hdr, dlen;
>>> 	u32 max_blocksize =3D svc_max_payload(rqstp);
>>> 	struct kvec *head =3D rqstp->rq_arg.head;
>>> 	struct kvec *tail =3D rqstp->rq_arg.tail;
>>> @@ -433,17 +433,9 @@ void fill_post_wcc(struct svc_fh *fhp)
>>> 		args->count =3D max_blocksize;
>>> 		len =3D args->len =3D max_blocksize;
>>> 	}
>>> -	rqstp->rq_vec[0].iov_base =3D (void*)p;
>>> -	rqstp->rq_vec[0].iov_len =3D head->iov_len - hdr;
>>> -	v =3D 0;
>>> -	while (len > rqstp->rq_vec[v].iov_len) {
>>> -		len -=3D rqstp->rq_vec[v].iov_len;
>>> -		v++;
>>> -		rqstp->rq_vec[v].iov_base =3D =
page_address(rqstp->rq_pages[v]);
>>> -		rqstp->rq_vec[v].iov_len =3D PAGE_SIZE;
>>> -	}
>>> -	rqstp->rq_vec[v].iov_len =3D len;
>>> -	args->vlen =3D v + 1;
>>> +
>>> +	args->first.iov_base =3D (void *)p;
>>> +	args->first.iov_len =3D head->iov_len - hdr;
>>> 	return 1;
>>> }
>>>=20
>>> diff --git a/fs/nfsd/nfs4proc.c b/fs/nfsd/nfs4proc.c
>>> index 0df37e0..cf30544 100644
>>> --- a/fs/nfsd/nfs4proc.c
>>> +++ b/fs/nfsd/nfs4proc.c
>>> @@ -974,24 +974,6 @@ static __be32 nfsd4_do_lookupp(struct svc_rqst =
*rqstp, struct svc_fh *fh)
>>> 	return status;
>>> }
>>>=20
>>> -static int fill_in_write_vector(struct kvec *vec, struct =
nfsd4_write *write)
>>> -{
>>> -        int i =3D 1;
>>> -        int buflen =3D write->wr_buflen;
>>> -
>>> -        vec[0].iov_base =3D write->wr_head.iov_base;
>>> -        vec[0].iov_len =3D min_t(int, buflen, =
write->wr_head.iov_len);
>>> -        buflen -=3D vec[0].iov_len;
>>> -
>>> -        while (buflen) {
>>> -                vec[i].iov_base =3D =
page_address(write->wr_pagelist[i - 1]);
>>> -                vec[i].iov_len =3D min_t(int, PAGE_SIZE, buflen);
>>> -                buflen -=3D vec[i].iov_len;
>>> -                i++;
>>> -        }
>>> -        return i;
>>> -}
>>> -
>>> static __be32
>>> nfsd4_write(struct svc_rqst *rqstp, struct nfsd4_compound_state =
*cstate,
>>> 	    union nfsd4_op_u *u)
>>> @@ -1000,8 +982,8 @@ static int fill_in_write_vector(struct kvec =
*vec, struct nfsd4_write *write)
>>> 	stateid_t *stateid =3D &write->wr_stateid;
>>> 	struct file *filp =3D NULL;
>>> 	__be32 status =3D nfs_ok;
>>> +	unsigned int nvecs;
>>> 	unsigned long cnt;
>>> -	int nvecs;
>>>=20
>>> 	if (write->wr_offset >=3D OFFSET_MAX)
>>> 		return nfserr_inval;
>>> @@ -1019,12 +1001,12 @@ static int fill_in_write_vector(struct kvec =
*vec, struct nfsd4_write *write)
>>> 	write->wr_how_written =3D write->wr_stable_how;
>>> 	gen_boot_verifier(&write->wr_verifier, SVC_NET(rqstp));
>>>=20
>>> -	nvecs =3D fill_in_write_vector(rqstp->rq_vec, write);
>>> -	WARN_ON_ONCE(nvecs > ARRAY_SIZE(rqstp->rq_vec));
>>> -
>>> +	nvecs =3D svc_fill_write_vector(rqstp, &write->wr_head, cnt);
>>> +	if (!nvecs)
>>> +		return nfserr_io;
>>> 	status =3D nfsd_vfs_write(rqstp, &cstate->current_fh, filp,
>>> -				write->wr_offset, rqstp->rq_vec, nvecs, =
&cnt,
>>> -				write->wr_how_written);
>>> +				write->wr_offset, rqstp->rq_vec, nvecs,
>>> +				&cnt, write->wr_how_written);
>>> 	fput(filp);
>>>=20
>>> 	write->wr_bytes_written =3D cnt;
>>> diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
>>> index f75cc7d..d33c8aa 100644
>>> --- a/fs/nfsd/nfs4xdr.c
>>> +++ b/fs/nfsd/nfs4xdr.c
>>> @@ -1286,7 +1286,6 @@ static __be32 nfsd4_decode_opaque(struct =
nfsd4_compoundargs *argp, struct xdr_ne
>>> 	}
>>> 	write->wr_head.iov_base =3D p;
>>> 	write->wr_head.iov_len =3D avail;
>>> -	write->wr_pagelist =3D argp->pagelist;
>>>=20
>>> 	len =3D XDR_QUADLEN(write->wr_buflen) << 2;
>>> 	if (len >=3D avail) {
>>> diff --git a/fs/nfsd/nfsproc.c b/fs/nfsd/nfsproc.c
>>> index 43c0419..1995ea6 100644
>>> --- a/fs/nfsd/nfsproc.c
>>> +++ b/fs/nfsd/nfsproc.c
>>> @@ -212,13 +212,18 @@
>>> 	struct nfsd_attrstat *resp =3D rqstp->rq_resp;
>>> 	__be32	nfserr;
>>> 	unsigned long cnt =3D argp->len;
>>> +	unsigned int nvecs;
>>>=20
>>> 	dprintk("nfsd: WRITE    %s %d bytes at %d\n",
>>> 		SVCFH_fmt(&argp->fh),
>>> 		argp->len, argp->offset);
>>>=20
>>> -	nfserr =3D nfsd_write(rqstp, fh_copy(&resp->fh, &argp->fh), =
argp->offset,
>>> -				rqstp->rq_vec, argp->vlen, &cnt, =
NFS_DATA_SYNC);
>>> +	nvecs =3D svc_fill_write_vector(rqstp, &argp->first, cnt);
>>> +	if (!nvecs)
>>> +		return nfserr_io;
>>> +	nfserr =3D nfsd_write(rqstp, fh_copy(&resp->fh, &argp->fh),
>>> +			    argp->offset, rqstp->rq_vec, nvecs,
>>> +			    &cnt, NFS_DATA_SYNC);
>>> 	return nfsd_return_attrs(nfserr, resp);
>>> }
>>>=20
>>> diff --git a/fs/nfsd/nfsxdr.c b/fs/nfsd/nfsxdr.c
>>> index 79b6064..db24ae8 100644
>>> --- a/fs/nfsd/nfsxdr.c
>>> +++ b/fs/nfsd/nfsxdr.c
>>> @@ -287,7 +287,6 @@ __be32 *nfs2svc_encode_fattr(struct svc_rqst =
*rqstp, __be32 *p, struct svc_fh *f
>>> 	struct nfsd_writeargs *args =3D rqstp->rq_argp;
>>> 	unsigned int len, hdr, dlen;
>>> 	struct kvec *head =3D rqstp->rq_arg.head;
>>> -	int v;
>>>=20
>>> 	p =3D decode_fh(p, &args->fh);
>>> 	if (!p)
>>> @@ -323,17 +322,8 @@ __be32 *nfs2svc_encode_fattr(struct svc_rqst =
*rqstp, __be32 *p, struct svc_fh *f
>>> 	if (dlen < XDR_QUADLEN(len)*4)
>>> 		return 0;
>>>=20
>>> -	rqstp->rq_vec[0].iov_base =3D (void*)p;
>>> -	rqstp->rq_vec[0].iov_len =3D head->iov_len - hdr;
>>> -	v =3D 0;
>>> -	while (len > rqstp->rq_vec[v].iov_len) {
>>> -		len -=3D rqstp->rq_vec[v].iov_len;
>>> -		v++;
>>> -		rqstp->rq_vec[v].iov_base =3D =
page_address(rqstp->rq_pages[v]);
>>> -		rqstp->rq_vec[v].iov_len =3D PAGE_SIZE;
>>> -	}
>>> -	rqstp->rq_vec[v].iov_len =3D len;
>>> -	args->vlen =3D v + 1;
>>> +	args->first.iov_base =3D (void *)p;
>>> +	args->first.iov_len =3D head->iov_len - hdr;
>>> 	return 1;
>>> }
>>>=20
>>> diff --git a/fs/nfsd/xdr.h b/fs/nfsd/xdr.h
>>> index 2f4f22e..a765c41 100644
>>> --- a/fs/nfsd/xdr.h
>>> +++ b/fs/nfsd/xdr.h
>>> @@ -34,7 +34,7 @@ struct nfsd_writeargs {
>>> 	svc_fh			fh;
>>> 	__u32			offset;
>>> 	int			len;
>>> -	int			vlen;
>>> +	struct kvec		first;
>>> };
>>>=20
>>> struct nfsd_createargs {
>>> diff --git a/fs/nfsd/xdr3.h b/fs/nfsd/xdr3.h
>>> index 056bf8a..deccf7f 100644
>>> --- a/fs/nfsd/xdr3.h
>>> +++ b/fs/nfsd/xdr3.h
>>> @@ -41,7 +41,7 @@ struct nfsd3_writeargs {
>>> 	__u32			count;
>>> 	int			stable;
>>> 	__u32			len;
>>> -	int			vlen;
>>> +	struct kvec		first;
>>> };
>>>=20
>>> struct nfsd3_createargs {
>>> diff --git a/fs/nfsd/xdr4.h b/fs/nfsd/xdr4.h
>>> index bc29511..d56219d 100644
>>> --- a/fs/nfsd/xdr4.h
>>> +++ b/fs/nfsd/xdr4.h
>>> @@ -390,7 +390,6 @@ struct nfsd4_write {
>>> 	u32		wr_stable_how;      /* request */
>>> 	u32		wr_buflen;          /* request */
>>> 	struct kvec	wr_head;
>>> -	struct page **	wr_pagelist;        /* request */
>>>=20
>>> 	u32		wr_bytes_written;   /* response */
>>> 	u32		wr_how_written;     /* response */
>>> diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
>>> index 1939709..6dc7879 100644
>>> --- a/include/linux/sunrpc/svc.h
>>> +++ b/include/linux/sunrpc/svc.h
>>> @@ -494,6 +494,8 @@ int		   svc_register(const struct =
svc_serv *, struct net *, const int,
>>> void		   svc_reserve(struct svc_rqst *rqstp, int =
space);
>>> struct svc_pool *  svc_pool_for_cpu(struct svc_serv *serv, int cpu);
>>> char *		   svc_print_addr(struct svc_rqst *, char *, =
size_t);
>>> +unsigned int	   svc_fill_write_vector(struct svc_rqst *rqstp,
>>> +					 struct kvec *first, size_t =
total);
>>>=20
>>> #define	RPC_MAX_ADDRBUFLEN	(63U)
>>>=20
>>> diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
>>> index f19987f..a155e2d 100644
>>> --- a/net/sunrpc/svc.c
>>> +++ b/net/sunrpc/svc.c
>>> @@ -1533,3 +1533,45 @@ u32 svc_max_payload(const struct svc_rqst =
*rqstp)
>>> 	return max;
>>> }
>>> EXPORT_SYMBOL_GPL(svc_max_payload);
>>> +
>>> +/**
>>> + * svc_fill_write_vector - Construct data argument for VFS write =
call
>>> + * @rqstp: svc_rqst to operate on
>>> + * @first: buffer containing first section of write payload
>>> + * @total: total number of bytes of write payload
>>> + *
>>> + * Returns the number of elements populated in the data argument =
array.
>>> + */
>>> +unsigned int svc_fill_write_vector(struct svc_rqst *rqstp, struct =
kvec *first,
>>> +				   size_t total)
>>> +{
>>> +	struct kvec *vec =3D rqstp->rq_vec;
>>> +	struct page **pages;
>>> +	unsigned int i;
>>> +
>>> +	/* Some types of transport can present the write payload
>>> +	 * entirely in rq_arg.pages. In this case, @first is empty.
>>> +	 */
>>> +	i =3D 0;
>>> +	if (first->iov_len) {
>>> +		vec[i].iov_base =3D first->iov_base;
>>> +		vec[i].iov_len =3D min_t(size_t, total, first->iov_len);
>>> +		total -=3D vec[i].iov_len;
>>> +		++i;
>>> +	}
>>> +
>>> +	WARN_ON_ONCE(rqstp->rq_arg.page_base !=3D 0);
>>> +	pages =3D rqstp->rq_arg.pages;
>>> +	while (total) {
>>> +		vec[i].iov_base =3D page_address(*pages);
>>> +		vec[i].iov_len =3D min_t(size_t, total, PAGE_SIZE);
>>> +		total -=3D vec[i].iov_len;
>>> +		++i;
>>> +
>>> +		++pages;
>>> +	}
>>> +
>>> +	WARN_ON_ONCE(i > ARRAY_SIZE(rqstp->rq_vec));
>>> +	return i;
>>> +}
>>> +EXPORT_SYMBOL_GPL(svc_fill_write_vector);
>=20
> --
> Chuck Lever
>=20
>=20
>=20
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" =
in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
Chuck Lever




^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v1 1/2] NFSD: Clean up write argument XDR decoders
  2018-03-23 22:37       ` Chuck Lever
@ 2018-03-24  2:40         ` Bruce Fields
  2018-03-24 16:00           ` Chuck Lever
  0 siblings, 1 reply; 8+ messages in thread
From: Bruce Fields @ 2018-03-24  2:40 UTC (permalink / raw)
  To: Chuck Lever; +Cc: Linux NFS Mailing List

On Fri, Mar 23, 2018 at 06:37:51PM -0400, Chuck Lever wrote:
> 
> 
> > On Mar 23, 2018, at 5:55 PM, Chuck Lever <chuck.lever@oracle.com> wrote:
> > 
> > 
> > 
> >> On Mar 23, 2018, at 5:32 PM, J. Bruce Fields <bfields@fieldses.org> wrote:
> >> 
> >> If I understand correctly, in the v4 case you change the way the page
> >> data is passed by using rq_arg.pages instead of a list of pages for each
> >> write.
> >> 
> >> I don't think that handles the case of compounds with multiple writes.
> >> 
> >> If all the writes are contained in a page, then you may be OK, since the
> >> wr_head iovec is all the information you need to pass the write data
> >> from the xdr code to the proc code.
> >> 
> >> But if there are multiple larger writes, you also need to know where the
> >> page data is.  You're depending on rq_arg.pages for that, but there's
> >> only one of those per compound.
> > 
> > I thought we were going to handle that case by chaining xdr_bufs--
> > one per NFSv4 WRITE operation. There has to be some structured way
> > to pass distinct write payloads up to the NFS server, otherwise
> > direct placement is impossible.
> 
> Thinking on this a little more, I think you are saying the new shared
> decoder is adequate for NFSv2 and NFSv3, but not for NFSv4.

Yes, it would be a regression in the v4 case, but not in the v2 or v3
cases.

> Would you
> accept a patch that kept the NFSv2 and NFSv3 parts of this patch, but
> dropped the NFSv4-related hunks?
> 
> Likewise for the symlink decoder patch?

I'll have to take another look, but, could be.

> And we can still use a transport call-out in the XDR decoders, just as
> we have discussed. (I think I've misremembered our discussion about
> chaining xdr_bufs).

Maybe chaining xdr bufs would make sense, I haven't thought about it.

But we currently handle multiple IOs per compound without chaining xdr
bufs.

--b.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v1 1/2] NFSD: Clean up write argument XDR decoders
  2018-03-24  2:40         ` Bruce Fields
@ 2018-03-24 16:00           ` Chuck Lever
  0 siblings, 0 replies; 8+ messages in thread
From: Chuck Lever @ 2018-03-24 16:00 UTC (permalink / raw)
  To: Bruce Fields; +Cc: Linux NFS Mailing List



> On Mar 23, 2018, at 10:40 PM, Bruce Fields <bfields@fieldses.org> =
wrote:
>=20
> On Fri, Mar 23, 2018 at 06:37:51PM -0400, Chuck Lever wrote:
>>=20
>>=20
>>> On Mar 23, 2018, at 5:55 PM, Chuck Lever <chuck.lever@oracle.com> =
wrote:
>>>=20
>>>=20
>>>=20
>>>> On Mar 23, 2018, at 5:32 PM, J. Bruce Fields <bfields@fieldses.org> =
wrote:
>>>>=20
>>>> If I understand correctly, in the v4 case you change the way the =
page
>>>> data is passed by using rq_arg.pages instead of a list of pages for =
each
>>>> write.
>>>>=20
>>>> I don't think that handles the case of compounds with multiple =
writes.
>>>>=20
>>>> If all the writes are contained in a page, then you may be OK, =
since the
>>>> wr_head iovec is all the information you need to pass the write =
data
>>>> from the xdr code to the proc code.
>>>>=20
>>>> But if there are multiple larger writes, you also need to know =
where the
>>>> page data is.  You're depending on rq_arg.pages for that, but =
there's
>>>> only one of those per compound.
>>>=20
>>> I thought we were going to handle that case by chaining xdr_bufs--
>>> one per NFSv4 WRITE operation. There has to be some structured way
>>> to pass distinct write payloads up to the NFS server, otherwise
>>> direct placement is impossible.
>>=20
>> Thinking on this a little more, I think you are saying the new shared
>> decoder is adequate for NFSv2 and NFSv3, but not for NFSv4.
>=20
> Yes, it would be a regression in the v4 case, but not in the v2 or v3
> cases.
>=20
>> Would you
>> accept a patch that kept the NFSv2 and NFSv3 parts of this patch, but
>> dropped the NFSv4-related hunks?
>>=20
>> Likewise for the symlink decoder patch?
>=20
> I'll have to take another look, but, could be.

Thanks!


>> And we can still use a transport call-out in the XDR decoders, just =
as
>> we have discussed. (I think I've misremembered our discussion about
>> chaining xdr_bufs).
>=20
> Maybe chaining xdr bufs would make sense, I haven't thought about it.

Hrm, chaining was your idea, actually. But chaining is not needed if
each RDMA Read is driven by the WRITE XDR decoder. For each WRITE
payload, the XDR decoder could pass an array of pages, a starting
offset into the first page, and a byte count, and the transport could
fill those pages.

A detail would be how to handle both a request that provides only an
unstructured byte stream and a request where the payloads are
structured. TCP provides only the former, RPC/RDMA can provide either
type.


> But we currently handle multiple IOs per compound without chaining xdr
> bufs.

Yes, by using one or more pull-up data copies. The point is we want to
enable the transport hardware to place each payload where no =
re-alignment
is needed.


--
Chuck Lever




^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2018-03-24 16:00 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-03-19 18:56 [PATCH v1 0/2] NFS decoder clean-ups, redux Chuck Lever
2018-03-19 18:56 ` [PATCH v1 1/2] NFSD: Clean up write argument XDR decoders Chuck Lever
2018-03-23 21:32   ` J. Bruce Fields
2018-03-23 21:55     ` Chuck Lever
2018-03-23 22:37       ` Chuck Lever
2018-03-24  2:40         ` Bruce Fields
2018-03-24 16:00           ` Chuck Lever
2018-03-19 18:56 ` [PATCH v1 2/2] NFSD: Clean up symlink " Chuck Lever

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.