linux-rdma.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RFC 0/6] NFSD: Pull Read chunks in XDR decoders
@ 2021-08-31 19:05 Chuck Lever
  2021-08-31 19:05 ` [PATCH RFC 1/6] SUNRPC: Capture value of xdr_buf::page_base Chuck Lever
                   ` (6 more replies)
  0 siblings, 7 replies; 13+ messages in thread
From: Chuck Lever @ 2021-08-31 19:05 UTC (permalink / raw)
  To: bfields; +Cc: linux-nfs, linux-rdma

Hi Bruce-

Here is part of what we discussed recently about trying to align
pages in NFS WRITE requests so that splice can be used. This
series updates server-side RPC and svcrdma to ensure that an
aligned xdr_buf::pages array is presented to NFSD, which is then
converted into an aligned rq_vec for the VFS layer.

The next step would be to look at how to make the best use of the
aligned rq_vec. My naive thought is that where there is a PAGE_SIZE
entry in rq_vec and there is no page in the file's page cache at
that offset, the transport-provided page can be flipped into place.
Might work for replacing whole pages as well, but baby steps first.

This series has been exercised a bit with both TCP and RDMA, but no
guarantees that it is completely bug-free. NFSv4 compounds with
multiple WRITE payloads on RDMA are treated like TCP: the RPC
message is contained in an unstructured stream of unaligned pages.

Comments encouraged.

---

Chuck Lever (6):
      SUNRPC: Capture value of xdr_buf::page_base
      SUNRPC: xdr_stream_subsegment() must handle non-zero page_bases
      NFSD: Have legacy NFSD WRITE decoders use xdr_stream_subsegment()
      SUNRPC: svc_fill_write_vector() must handle non-zero page_bases
      NFSD: Add a transport hook for pulling argument payloads
      svcrdma: Pull Read chunks in ->xpo_argument_payload


 fs/nfsd/nfs3proc.c                       |   3 +-
 fs/nfsd/nfs3xdr.c                        |  16 +--
 fs/nfsd/nfs4proc.c                       |   3 +-
 fs/nfsd/nfs4xdr.c                        |   6 +
 fs/nfsd/nfsproc.c                        |   3 +-
 fs/nfsd/nfsxdr.c                         |  13 +--
 fs/nfsd/xdr.h                            |   2 +-
 fs/nfsd/xdr3.h                           |   2 +-
 include/linux/sunrpc/svc.h               |   6 +-
 include/linux/sunrpc/svc_rdma.h          |   8 ++
 include/linux/sunrpc/svc_xprt.h          |   3 +
 include/trace/events/rpcrdma.h           |  26 +++++
 include/trace/events/sunrpc.h            |  20 +++-
 net/sunrpc/svc.c                         |  38 +++++--
 net/sunrpc/svcsock.c                     |   8 ++
 net/sunrpc/xdr.c                         |  32 +++---
 net/sunrpc/xprtrdma/svc_rdma_recvfrom.c  |  37 +++++-
 net/sunrpc/xprtrdma/svc_rdma_rw.c        | 139 ++++++++++++++++++++---
 net/sunrpc/xprtrdma/svc_rdma_transport.c |   1 +
 19 files changed, 292 insertions(+), 74 deletions(-)

--
Chuck Lever


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH RFC 1/6] SUNRPC: Capture value of xdr_buf::page_base
  2021-08-31 19:05 [PATCH RFC 0/6] NFSD: Pull Read chunks in XDR decoders Chuck Lever
@ 2021-08-31 19:05 ` Chuck Lever
  2021-08-31 19:05 ` [PATCH RFC 2/6] SUNRPC: xdr_stream_subsegment() must handle non-zero page_bases Chuck Lever
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 13+ messages in thread
From: Chuck Lever @ 2021-08-31 19:05 UTC (permalink / raw)
  To: bfields; +Cc: linux-nfs, linux-rdma

This value will be non-zero more often, after subsequent patches are
applied. Knowing its value can be important diagnostic information.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 include/trace/events/sunrpc.h |   20 ++++++++++++++------
 1 file changed, 14 insertions(+), 6 deletions(-)

diff --git a/include/trace/events/sunrpc.h b/include/trace/events/sunrpc.h
index d323f5a049c8..f13de06b7c1d 100644
--- a/include/trace/events/sunrpc.h
+++ b/include/trace/events/sunrpc.h
@@ -62,6 +62,7 @@ DECLARE_EVENT_CLASS(rpc_xdr_buf_class,
 		__field(size_t, head_len)
 		__field(const void *, tail_base)
 		__field(size_t, tail_len)
+		__field(unsigned int, page_base)
 		__field(unsigned int, page_len)
 		__field(unsigned int, msg_len)
 	),
@@ -74,14 +75,17 @@ DECLARE_EVENT_CLASS(rpc_xdr_buf_class,
 		__entry->head_len = xdr->head[0].iov_len;
 		__entry->tail_base = xdr->tail[0].iov_base;
 		__entry->tail_len = xdr->tail[0].iov_len;
+		__entry->page_base = xdr->page_base;
 		__entry->page_len = xdr->page_len;
 		__entry->msg_len = xdr->len;
 	),
 
-	TP_printk("task:%u@%u head=[%p,%zu] page=%u tail=[%p,%zu] len=%u",
+	TP_printk("task:%u@%u head=[%p,%zu] page=%u(%u) tail=[%p,%zu] len=%u",
 		__entry->task_id, __entry->client_id,
-		__entry->head_base, __entry->head_len, __entry->page_len,
-		__entry->tail_base, __entry->tail_len, __entry->msg_len
+		__entry->head_base, __entry->head_len,
+		__entry->page_len, __entry->page_base,
+		__entry->tail_base, __entry->tail_len,
+		__entry->msg_len
 	)
 );
 
@@ -1525,6 +1529,7 @@ DECLARE_EVENT_CLASS(svc_xdr_buf_class,
 		__field(size_t, head_len)
 		__field(const void *, tail_base)
 		__field(size_t, tail_len)
+		__field(unsigned int, page_base)
 		__field(unsigned int, page_len)
 		__field(unsigned int, msg_len)
 	),
@@ -1535,14 +1540,17 @@ DECLARE_EVENT_CLASS(svc_xdr_buf_class,
 		__entry->head_len = xdr->head[0].iov_len;
 		__entry->tail_base = xdr->tail[0].iov_base;
 		__entry->tail_len = xdr->tail[0].iov_len;
+		__entry->page_base = xdr->page_base;
 		__entry->page_len = xdr->page_len;
 		__entry->msg_len = xdr->len;
 	),
 
-	TP_printk("xid=0x%08x head=[%p,%zu] page=%u tail=[%p,%zu] len=%u",
+	TP_printk("xid=0x%08x head=[%p,%zu] page=%u(%u) tail=[%p,%zu] len=%u",
 		__entry->xid,
-		__entry->head_base, __entry->head_len, __entry->page_len,
-		__entry->tail_base, __entry->tail_len, __entry->msg_len
+		__entry->head_base, __entry->head_len,
+		__entry->page_len, __entry->page_base,
+		__entry->tail_base, __entry->tail_len,
+		__entry->msg_len
 	)
 );
 



^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH RFC 2/6] SUNRPC: xdr_stream_subsegment() must handle non-zero page_bases
  2021-08-31 19:05 [PATCH RFC 0/6] NFSD: Pull Read chunks in XDR decoders Chuck Lever
  2021-08-31 19:05 ` [PATCH RFC 1/6] SUNRPC: Capture value of xdr_buf::page_base Chuck Lever
@ 2021-08-31 19:05 ` Chuck Lever
  2021-08-31 19:05 ` [PATCH RFC 3/6] NFSD: Have legacy NFSD WRITE decoders use xdr_stream_subsegment() Chuck Lever
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 13+ messages in thread
From: Chuck Lever @ 2021-08-31 19:05 UTC (permalink / raw)
  To: bfields; +Cc: linux-nfs, linux-rdma

xdr_stream_subsegment() was introduced in commit c1346a1216ab
("NFSD: Replace the internals of the READ_BUF() macro").

There are two call sites for xdr_stream_subsegment(). One is
nfsd4_decode_write(), and the other is nfsd4_decode_setxattr().
Currently neither of these call sites calls this API when
xdr_buf::page_base is a non-zero value.

However, I'm about to add a case where page_base will sometimes not
be zero when nfsd4_decode_write() invokes this API. Replace the
logic in xdr_stream_subsegment() that advances to the next data item
in the xdr_stream with something more generic in order to handle
this new use case.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 net/sunrpc/xdr.c |   32 +++++++++++++++++---------------
 1 file changed, 17 insertions(+), 15 deletions(-)

diff --git a/net/sunrpc/xdr.c b/net/sunrpc/xdr.c
index ca10ba2626f2..df194cc07035 100644
--- a/net/sunrpc/xdr.c
+++ b/net/sunrpc/xdr.c
@@ -1633,7 +1633,7 @@ EXPORT_SYMBOL_GPL(xdr_buf_subsegment);
  * Sets up @subbuf to represent a portion of @xdr. The portion
  * starts at the current offset in @xdr, and extends for a length
  * of @nbytes. If this is successful, @xdr is advanced to the next
- * position following that portion.
+ * XDR data item following that portion.
  *
  * Return values:
  *   %true: @subbuf has been initialized, and @xdr has been advanced.
@@ -1642,29 +1642,31 @@ EXPORT_SYMBOL_GPL(xdr_buf_subsegment);
 bool xdr_stream_subsegment(struct xdr_stream *xdr, struct xdr_buf *subbuf,
 			   unsigned int nbytes)
 {
-	unsigned int remaining, offset, len;
+	unsigned int start = xdr_stream_pos(xdr);
+	unsigned int remaining, len;
 
-	if (xdr_buf_subsegment(xdr->buf, subbuf, xdr_stream_pos(xdr), nbytes))
+	/* Extract @subbuf and bounds-check the fn arguments */
+	if (xdr_buf_subsegment(xdr->buf, subbuf, start, nbytes))
 		return false;
 
-	if (subbuf->head[0].iov_len)
-		if (!__xdr_inline_decode(xdr, subbuf->head[0].iov_len))
-			return false;
-
-	remaining = subbuf->page_len;
-	offset = subbuf->page_base;
-	while (remaining) {
-		len = min_t(unsigned int, remaining, PAGE_SIZE) - offset;
-
+	/* Advance @xdr by @nbytes */
+	for (remaining = nbytes; remaining;) {
 		if (xdr->p == xdr->end && !xdr_set_next_buffer(xdr))
 			return false;
-		if (!__xdr_inline_decode(xdr, len))
-			return false;
 
+		len = (char *)xdr->end - (char *)xdr->p;
+		if (remaining <= len) {
+			xdr->p = (__be32 *)((char *)xdr->p +
+					(remaining + xdr_pad_size(nbytes)));
+			break;
+		}
+
+		xdr->p = (__be32 *)((char *)xdr->p + len);
+		xdr->end = xdr->p;
 		remaining -= len;
-		offset = 0;
 	}
 
+	xdr_stream_set_pos(xdr, start + nbytes);
 	return true;
 }
 EXPORT_SYMBOL_GPL(xdr_stream_subsegment);



^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH RFC 3/6] NFSD: Have legacy NFSD WRITE decoders use xdr_stream_subsegment()
  2021-08-31 19:05 [PATCH RFC 0/6] NFSD: Pull Read chunks in XDR decoders Chuck Lever
  2021-08-31 19:05 ` [PATCH RFC 1/6] SUNRPC: Capture value of xdr_buf::page_base Chuck Lever
  2021-08-31 19:05 ` [PATCH RFC 2/6] SUNRPC: xdr_stream_subsegment() must handle non-zero page_bases Chuck Lever
@ 2021-08-31 19:05 ` Chuck Lever
  2021-08-31 19:05 ` [PATCH RFC 4/6] SUNRPC: svc_fill_write_vector() must handle non-zero page_bases Chuck Lever
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 13+ messages in thread
From: Chuck Lever @ 2021-08-31 19:05 UTC (permalink / raw)
  To: bfields; +Cc: linux-nfs, linux-rdma

Refactor.

Now that the NFSv2 and NFSv3 XDR decoders have been converted to
use xdr_streams, the WRITE decoder functions can use
xdr_stream_subsegment() to extract the WRITE payload into its own
xdr_buf, just as the NFSv4 WRITE XDR decoder currently does.

That makes it possible to pass the first kvec, pages array + length,
page_base, and total payload length via a single function parameter.

The payload's page_base is not yet assigned or used, but will be in
subsequent patches.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 fs/nfsd/nfs3proc.c         |    3 +--
 fs/nfsd/nfs3xdr.c          |   12 ++----------
 fs/nfsd/nfs4proc.c         |    3 +--
 fs/nfsd/nfsproc.c          |    3 +--
 fs/nfsd/nfsxdr.c           |    9 +--------
 fs/nfsd/xdr.h              |    2 +-
 fs/nfsd/xdr3.h             |    2 +-
 include/linux/sunrpc/svc.h |    3 +--
 net/sunrpc/svc.c           |   11 ++++++-----
 9 files changed, 15 insertions(+), 33 deletions(-)

diff --git a/fs/nfsd/nfs3proc.c b/fs/nfsd/nfs3proc.c
index 17715a6c7a40..4418517f6f12 100644
--- a/fs/nfsd/nfs3proc.c
+++ b/fs/nfsd/nfs3proc.c
@@ -201,8 +201,7 @@ nfsd3_proc_write(struct svc_rqst *rqstp)
 
 	fh_copy(&resp->fh, &argp->fh);
 	resp->committed = argp->stable;
-	nvecs = svc_fill_write_vector(rqstp, rqstp->rq_arg.pages,
-				      &argp->first, cnt);
+	nvecs = svc_fill_write_vector(rqstp, &argp->payload);
 	if (!nvecs) {
 		resp->status = nfserr_io;
 		goto out;
diff --git a/fs/nfsd/nfs3xdr.c b/fs/nfsd/nfs3xdr.c
index 0a5ebc52e6a9..91d4e2b8b854 100644
--- a/fs/nfsd/nfs3xdr.c
+++ b/fs/nfsd/nfs3xdr.c
@@ -621,9 +621,6 @@ nfs3svc_decode_writeargs(struct svc_rqst *rqstp, __be32 *p)
 	struct xdr_stream *xdr = &rqstp->rq_arg_stream;
 	struct nfsd3_writeargs *args = rqstp->rq_argp;
 	u32 max_blocksize = svc_max_payload(rqstp);
-	struct kvec *head = rqstp->rq_arg.head;
-	struct kvec *tail = rqstp->rq_arg.tail;
-	size_t remaining;
 
 	if (!svcxdr_decode_nfs_fh3(xdr, &args->fh))
 		return 0;
@@ -641,17 +638,12 @@ nfs3svc_decode_writeargs(struct svc_rqst *rqstp, __be32 *p)
 	/* request sanity */
 	if (args->count != args->len)
 		return 0;
-	remaining = head->iov_len + rqstp->rq_arg.page_len + tail->iov_len;
-	remaining -= xdr_stream_pos(xdr);
-	if (remaining < xdr_align_size(args->len))
-		return 0;
 	if (args->count > max_blocksize) {
 		args->count = max_blocksize;
 		args->len = max_blocksize;
 	}
-
-	args->first.iov_base = xdr->p;
-	args->first.iov_len = head->iov_len - xdr_stream_pos(xdr);
+	if (!xdr_stream_subsegment(xdr, &args->payload, args->count))
+		return 0;
 
 	return 1;
 }
diff --git a/fs/nfsd/nfs4proc.c b/fs/nfsd/nfs4proc.c
index 486c5dba4b65..55c9551fa74e 100644
--- a/fs/nfsd/nfs4proc.c
+++ b/fs/nfsd/nfs4proc.c
@@ -1033,8 +1033,7 @@ nfsd4_write(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
 
 	write->wr_how_written = write->wr_stable_how;
 
-	nvecs = svc_fill_write_vector(rqstp, write->wr_payload.pages,
-				      write->wr_payload.head, write->wr_buflen);
+	nvecs = svc_fill_write_vector(rqstp, &write->wr_payload);
 	WARN_ON_ONCE(nvecs > ARRAY_SIZE(rqstp->rq_vec));
 
 	status = nfsd_vfs_write(rqstp, &cstate->current_fh, nf,
diff --git a/fs/nfsd/nfsproc.c b/fs/nfsd/nfsproc.c
index 90fcd6178823..eea5b59b6a6c 100644
--- a/fs/nfsd/nfsproc.c
+++ b/fs/nfsd/nfsproc.c
@@ -234,8 +234,7 @@ nfsd_proc_write(struct svc_rqst *rqstp)
 		SVCFH_fmt(&argp->fh),
 		argp->len, argp->offset);
 
-	nvecs = svc_fill_write_vector(rqstp, rqstp->rq_arg.pages,
-				      &argp->first, cnt);
+	nvecs = svc_fill_write_vector(rqstp, &argp->payload);
 	if (!nvecs) {
 		resp->status = nfserr_io;
 		goto out;
diff --git a/fs/nfsd/nfsxdr.c b/fs/nfsd/nfsxdr.c
index a06c05fe3b42..26a42f87c240 100644
--- a/fs/nfsd/nfsxdr.c
+++ b/fs/nfsd/nfsxdr.c
@@ -325,10 +325,7 @@ nfssvc_decode_writeargs(struct svc_rqst *rqstp, __be32 *p)
 {
 	struct xdr_stream *xdr = &rqstp->rq_arg_stream;
 	struct nfsd_writeargs *args = rqstp->rq_argp;
-	struct kvec *head = rqstp->rq_arg.head;
-	struct kvec *tail = rqstp->rq_arg.tail;
 	u32 beginoffset, totalcount;
-	size_t remaining;
 
 	if (!svcxdr_decode_fhandle(xdr, &args->fh))
 		return 0;
@@ -346,12 +343,8 @@ nfssvc_decode_writeargs(struct svc_rqst *rqstp, __be32 *p)
 		return 0;
 	if (args->len > NFSSVC_MAXBLKSIZE_V2)
 		return 0;
-	remaining = head->iov_len + rqstp->rq_arg.page_len + tail->iov_len;
-	remaining -= xdr_stream_pos(xdr);
-	if (remaining < xdr_align_size(args->len))
+	if (!xdr_stream_subsegment(xdr, &args->payload, args->len))
 		return 0;
-	args->first.iov_base = xdr->p;
-	args->first.iov_len = head->iov_len - xdr_stream_pos(xdr);
 
 	return 1;
 }
diff --git a/fs/nfsd/xdr.h b/fs/nfsd/xdr.h
index f45b4bc93f52..80fd6d7f3404 100644
--- a/fs/nfsd/xdr.h
+++ b/fs/nfsd/xdr.h
@@ -33,7 +33,7 @@ struct nfsd_writeargs {
 	svc_fh			fh;
 	__u32			offset;
 	int			len;
-	struct kvec		first;
+	struct xdr_buf		payload;
 };
 
 struct nfsd_createargs {
diff --git a/fs/nfsd/xdr3.h b/fs/nfsd/xdr3.h
index 933008382bbe..712c117300cb 100644
--- a/fs/nfsd/xdr3.h
+++ b/fs/nfsd/xdr3.h
@@ -40,7 +40,7 @@ struct nfsd3_writeargs {
 	__u32			count;
 	int			stable;
 	__u32			len;
-	struct kvec		first;
+	struct xdr_buf		payload;
 };
 
 struct nfsd3_createargs {
diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
index f0f846fa396e..5a277acf2667 100644
--- a/include/linux/sunrpc/svc.h
+++ b/include/linux/sunrpc/svc.h
@@ -532,8 +532,7 @@ int		   svc_encode_result_payload(struct svc_rqst *rqstp,
 					     unsigned int offset,
 					     unsigned int length);
 unsigned int	   svc_fill_write_vector(struct svc_rqst *rqstp,
-					 struct page **pages,
-					 struct kvec *first, size_t total);
+					 struct xdr_buf *payload);
 char		  *svc_fill_symlink_pathname(struct svc_rqst *rqstp,
 					     struct kvec *first, void *p,
 					     size_t total);
diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
index bfcbaf7b3822..34ced7f538ee 100644
--- a/net/sunrpc/svc.c
+++ b/net/sunrpc/svc.c
@@ -1693,16 +1693,17 @@ EXPORT_SYMBOL_GPL(svc_encode_result_payload);
 /**
  * svc_fill_write_vector - Construct data argument for VFS write call
  * @rqstp: svc_rqst to operate on
- * @pages: list of pages containing data payload
- * @first: buffer containing first section of write payload
- * @total: total number of bytes of write payload
+ * @payload: xdr_buf containing only the write data payload
  *
  * Fills in rqstp::rq_vec, and returns the number of elements.
  */
-unsigned int svc_fill_write_vector(struct svc_rqst *rqstp, struct page **pages,
-				   struct kvec *first, size_t total)
+unsigned int svc_fill_write_vector(struct svc_rqst *rqstp,
+				   struct xdr_buf *payload)
 {
+	struct page **pages = payload->pages;
+	struct kvec *first = payload->head;
 	struct kvec *vec = rqstp->rq_vec;
+	size_t total = payload->len;
 	unsigned int i;
 
 	/* Some types of transport can present the write payload



^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH RFC 4/6] SUNRPC: svc_fill_write_vector() must handle non-zero page_bases
  2021-08-31 19:05 [PATCH RFC 0/6] NFSD: Pull Read chunks in XDR decoders Chuck Lever
                   ` (2 preceding siblings ...)
  2021-08-31 19:05 ` [PATCH RFC 3/6] NFSD: Have legacy NFSD WRITE decoders use xdr_stream_subsegment() Chuck Lever
@ 2021-08-31 19:05 ` Chuck Lever
  2021-08-31 19:05 ` [PATCH RFC 5/6] NFSD: Add a transport hook for pulling argument payloads Chuck Lever
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 13+ messages in thread
From: Chuck Lever @ 2021-08-31 19:05 UTC (permalink / raw)
  To: bfields; +Cc: linux-nfs, linux-rdma

svc_fill_write_vector() was introduced in commit 8154ef2776aa
("NFSD: Clean up legacy NFS WRITE argument XDR decoders").

I'm about to add a case where page_base will sometimes not be zero
when the NFS WRITE proc functions invoke this API. Refactor the
API and function internals to handle this case properly.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 net/sunrpc/svc.c |    7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
index 34ced7f538ee..3d9f9da98aed 100644
--- a/net/sunrpc/svc.c
+++ b/net/sunrpc/svc.c
@@ -1705,6 +1705,7 @@ unsigned int svc_fill_write_vector(struct svc_rqst *rqstp,
 	struct kvec *vec = rqstp->rq_vec;
 	size_t total = payload->len;
 	unsigned int i;
+	size_t offset;
 
 	/* Some types of transport can present the write payload
 	 * entirely in rq_arg.pages. In this case, @first is empty.
@@ -1717,12 +1718,14 @@ unsigned int svc_fill_write_vector(struct svc_rqst *rqstp,
 		++i;
 	}
 
+	offset = payload->page_base;
 	while (total) {
-		vec[i].iov_base = page_address(*pages);
-		vec[i].iov_len = min_t(size_t, total, PAGE_SIZE);
+		vec[i].iov_base = page_address(*pages) + offset;
+		vec[i].iov_len = min_t(size_t, total, PAGE_SIZE - offset);
 		total -= vec[i].iov_len;
 		++i;
 		++pages;
+		offset = 0;
 	}
 
 	WARN_ON_ONCE(i > ARRAY_SIZE(rqstp->rq_vec));



^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH RFC 5/6] NFSD: Add a transport hook for pulling argument payloads
  2021-08-31 19:05 [PATCH RFC 0/6] NFSD: Pull Read chunks in XDR decoders Chuck Lever
                   ` (3 preceding siblings ...)
  2021-08-31 19:05 ` [PATCH RFC 4/6] SUNRPC: svc_fill_write_vector() must handle non-zero page_bases Chuck Lever
@ 2021-08-31 19:05 ` Chuck Lever
  2021-08-31 19:05 ` [PATCH RFC 6/6] svcrdma: Pull Read chunks in ->xpo_argument_payload Chuck Lever
  2021-08-31 20:42 ` [PATCH RFC 0/6] NFSD: Pull Read chunks in XDR decoders J. Bruce Fields
  6 siblings, 0 replies; 13+ messages in thread
From: Chuck Lever @ 2021-08-31 19:05 UTC (permalink / raw)
  To: bfields; +Cc: linux-nfs, linux-rdma

Refactor.

The new hook is a no-op at the moment, but it will soon be used to
pull RDMA Read chunks such that the payload is aligned to the pages
in the target file's page cache.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 fs/nfsd/nfs3xdr.c                        |    4 ++++
 fs/nfsd/nfs4xdr.c                        |    6 ++++++
 fs/nfsd/nfsxdr.c                         |    4 ++++
 include/linux/sunrpc/svc.h               |    3 +++
 include/linux/sunrpc/svc_rdma.h          |    2 ++
 include/linux/sunrpc/svc_xprt.h          |    3 +++
 net/sunrpc/svc.c                         |   20 ++++++++++++++++++++
 net/sunrpc/svcsock.c                     |    8 ++++++++
 net/sunrpc/xprtrdma/svc_rdma_recvfrom.c  |   22 ++++++++++++++++++++++
 net/sunrpc/xprtrdma/svc_rdma_transport.c |    1 +
 10 files changed, 73 insertions(+)

diff --git a/fs/nfsd/nfs3xdr.c b/fs/nfsd/nfs3xdr.c
index 91d4e2b8b854..1bb61a7d125a 100644
--- a/fs/nfsd/nfs3xdr.c
+++ b/fs/nfsd/nfs3xdr.c
@@ -642,6 +642,8 @@ nfs3svc_decode_writeargs(struct svc_rqst *rqstp, __be32 *p)
 		args->count = max_blocksize;
 		args->len = max_blocksize;
 	}
+	if (svc_decode_argument_payload(rqstp, args->offset, args->len) < 0)
+		return 0;
 	if (!xdr_stream_subsegment(xdr, &args->payload, args->count))
 		return 0;
 
@@ -705,6 +707,8 @@ nfs3svc_decode_symlinkargs(struct svc_rqst *rqstp, __be32 *p)
 	remaining -= xdr_stream_pos(xdr);
 	if (remaining < xdr_align_size(args->tlen))
 		return 0;
+	if (svc_decode_argument_payload(rqstp, 0, args->tlen) < 0)
+		return 0;
 
 	args->first.iov_base = xdr->p;
 	args->first.iov_len = head->iov_len - xdr_stream_pos(xdr);
diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
index 7abeccb975b2..e919f409cb9c 100644
--- a/fs/nfsd/nfs4xdr.c
+++ b/fs/nfsd/nfs4xdr.c
@@ -813,6 +813,9 @@ nfsd4_decode_create(struct nfsd4_compoundargs *argp, struct nfsd4_create *create
 		p = xdr_inline_decode(argp->xdr, create->cr_datalen);
 		if (!p)
 			return nfserr_bad_xdr;
+		if (svc_decode_argument_payload(argp->rqstp, 0,
+						create->cr_datalen) < 0)
+			return nfserr_bad_xdr;
 		create->cr_data = svcxdr_dupstr(argp, p, create->cr_datalen);
 		if (!create->cr_data)
 			return nfserr_jukebox;
@@ -1410,6 +1413,9 @@ nfsd4_decode_write(struct nfsd4_compoundargs *argp, struct nfsd4_write *write)
 		return nfserr_bad_xdr;
 	if (xdr_stream_decode_u32(argp->xdr, &write->wr_buflen) < 0)
 		return nfserr_bad_xdr;
+	if (svc_decode_argument_payload(argp->rqstp, write->wr_offset,
+					write->wr_buflen) < 0)
+		return nfserr_bad_xdr;
 	if (!xdr_stream_subsegment(argp->xdr, &write->wr_payload, write->wr_buflen))
 		return nfserr_bad_xdr;
 
diff --git a/fs/nfsd/nfsxdr.c b/fs/nfsd/nfsxdr.c
index 26a42f87c240..7c15f31f91da 100644
--- a/fs/nfsd/nfsxdr.c
+++ b/fs/nfsd/nfsxdr.c
@@ -343,6 +343,8 @@ nfssvc_decode_writeargs(struct svc_rqst *rqstp, __be32 *p)
 		return 0;
 	if (args->len > NFSSVC_MAXBLKSIZE_V2)
 		return 0;
+	if (svc_decode_argument_payload(rqstp, args->offset, args->len) < 0)
+		return 0;
 	if (!xdr_stream_subsegment(xdr, &args->payload, args->len))
 		return 0;
 
@@ -397,6 +399,8 @@ nfssvc_decode_symlinkargs(struct svc_rqst *rqstp, __be32 *p)
 	if (args->tlen == 0)
 		return 0;
 
+	if (svc_decode_argument_payload(rqstp, 0, args->tlen) < 0)
+		return 0;
 	args->first.iov_len = head->iov_len - xdr_stream_pos(xdr);
 	args->first.iov_base = xdr_inline_decode(xdr, args->tlen);
 	if (!args->first.iov_base)
diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
index 5a277acf2667..59d5016d9ec4 100644
--- a/include/linux/sunrpc/svc.h
+++ b/include/linux/sunrpc/svc.h
@@ -528,6 +528,9 @@ void		   svc_reserve(struct svc_rqst *rqstp, int space);
 struct svc_pool *  svc_pool_for_cpu(struct svc_serv *serv, int cpu);
 char *		   svc_print_addr(struct svc_rqst *, char *, size_t);
 const char *	   svc_proc_name(const struct svc_rqst *rqstp);
+int		   svc_decode_argument_payload(struct svc_rqst *rqstp,
+					       unsigned int offset,
+					       unsigned int length);
 int		   svc_encode_result_payload(struct svc_rqst *rqstp,
 					     unsigned int offset,
 					     unsigned int length);
diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
index 24aa159d29a7..f660244cc8ba 100644
--- a/include/linux/sunrpc/svc_rdma.h
+++ b/include/linux/sunrpc/svc_rdma.h
@@ -178,6 +178,8 @@ extern void svc_rdma_recv_ctxt_put(struct svcxprt_rdma *rdma,
 extern void svc_rdma_flush_recv_queues(struct svcxprt_rdma *rdma);
 extern void svc_rdma_release_rqst(struct svc_rqst *rqstp);
 extern int svc_rdma_recvfrom(struct svc_rqst *);
+extern int svc_rdma_argument_payload(struct svc_rqst *rqstp,
+				     unsigned int offset, unsigned int length);
 
 /* svc_rdma_rw.c */
 extern void svc_rdma_destroy_rw_ctxts(struct svcxprt_rdma *rdma);
diff --git a/include/linux/sunrpc/svc_xprt.h b/include/linux/sunrpc/svc_xprt.h
index 571f605bc91e..2d4c61f3307e 100644
--- a/include/linux/sunrpc/svc_xprt.h
+++ b/include/linux/sunrpc/svc_xprt.h
@@ -20,6 +20,9 @@ struct svc_xprt_ops {
 	struct svc_xprt	*(*xpo_accept)(struct svc_xprt *);
 	int		(*xpo_has_wspace)(struct svc_xprt *);
 	int		(*xpo_recvfrom)(struct svc_rqst *);
+	int		(*xpo_argument_payload)(struct svc_rqst *rqstp,
+						unsigned int offset,
+						unsigned int length);
 	int		(*xpo_sendto)(struct svc_rqst *);
 	int		(*xpo_result_payload)(struct svc_rqst *, unsigned int,
 					      unsigned int);
diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
index 3d9f9da98aed..6eca2f420371 100644
--- a/net/sunrpc/svc.c
+++ b/net/sunrpc/svc.c
@@ -1673,6 +1673,26 @@ const char *svc_proc_name(const struct svc_rqst *rqstp)
 }
 
 
+/**
+ * svc_decode_argument_payload - set up pages containing an argument
+ * @rqstp: svc_rqst to operate on
+ * @offset: byte offset of payload in file's page cache
+ * @length: size of payload, in bytes
+ *
+ * This function can modify rqstp->rq_arg.page_base and the content
+ * of rqstp->rq_arg.pages, but no other fields of rq_arg are changed.
+ *
+ * Returns zero on success, or a negative errno if a permanent error
+ * occurred.
+ */
+int svc_decode_argument_payload(struct svc_rqst *rqstp, unsigned int offset,
+				unsigned int length)
+{
+	return rqstp->rq_xprt->xpt_ops->xpo_argument_payload(rqstp, offset,
+							     length);
+}
+EXPORT_SYMBOL_GPL(svc_decode_argument_payload);
+
 /**
  * svc_encode_result_payload - mark a range of bytes as a result payload
  * @rqstp: svc_rqst to operate on
diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index 478f857cdaed..05eb63c182fd 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -181,6 +181,12 @@ static void svc_set_cmsg_data(struct svc_rqst *rqstp, struct cmsghdr *cmh)
 	}
 }
 
+static int svc_sock_argument_payload(struct svc_rqst *rqstp,
+				     unsigned int offset, unsigned int length)
+{
+	return 0;
+}
+
 static int svc_sock_result_payload(struct svc_rqst *rqstp, unsigned int offset,
 				   unsigned int length)
 {
@@ -637,6 +643,7 @@ static struct svc_xprt *svc_udp_create(struct svc_serv *serv,
 static const struct svc_xprt_ops svc_udp_ops = {
 	.xpo_create = svc_udp_create,
 	.xpo_recvfrom = svc_udp_recvfrom,
+	.xpo_argument_payload = svc_sock_argument_payload,
 	.xpo_sendto = svc_udp_sendto,
 	.xpo_result_payload = svc_sock_result_payload,
 	.xpo_release_rqst = svc_udp_release_rqst,
@@ -1208,6 +1215,7 @@ static struct svc_xprt *svc_tcp_create(struct svc_serv *serv,
 static const struct svc_xprt_ops svc_tcp_ops = {
 	.xpo_create = svc_tcp_create,
 	.xpo_recvfrom = svc_tcp_recvfrom,
+	.xpo_argument_payload = svc_sock_argument_payload,
 	.xpo_sendto = svc_tcp_sendto,
 	.xpo_result_payload = svc_sock_result_payload,
 	.xpo_release_rqst = svc_tcp_release_rqst,
diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
index cf76a6ad127b..08a620b370ae 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
@@ -867,3 +867,25 @@ int svc_rdma_recvfrom(struct svc_rqst *rqstp)
 	svc_rdma_recv_ctxt_put(rdma_xprt, ctxt);
 	return 0;
 }
+
+/**
+ * svc_rdma_argument_payload - special processing for an argument payload
+ * @rqstp: svc_rqst to operate on
+ * @offset: offset of payload in file's page cache
+ * @length: size of payload, in bytes
+ *
+ * Pull an RDMA Read payload chunk into rqstp->rq_arg.
+ *
+ * Return values:
+ *   %0 if successful or nothing needed to be done
+ *   %-EMSGSIZE on XDR buffer overflow
+ *   %-EINVAL if client provided too many segments
+ *   %-ENOMEM if rdma_rw context pool was exhausted
+ *   %-ENOTCONN if posting failed (connection is lost)
+ *   %-EIO if rdma_rw initialization failed (DMA mapping, etc)
+ */
+int svc_rdma_argument_payload(struct svc_rqst *rqstp, unsigned int offset,
+			      unsigned int length)
+{
+	return 0;
+}
diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c
index 94b20fb47135..ee7cc112504c 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
@@ -79,6 +79,7 @@ static void svc_rdma_kill_temp_xprt(struct svc_xprt *);
 static const struct svc_xprt_ops svc_rdma_ops = {
 	.xpo_create = svc_rdma_create,
 	.xpo_recvfrom = svc_rdma_recvfrom,
+	.xpo_argument_payload = svc_rdma_argument_payload,
 	.xpo_sendto = svc_rdma_sendto,
 	.xpo_result_payload = svc_rdma_result_payload,
 	.xpo_release_rqst = svc_rdma_release_rqst,



^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH RFC 6/6] svcrdma: Pull Read chunks in ->xpo_argument_payload
  2021-08-31 19:05 [PATCH RFC 0/6] NFSD: Pull Read chunks in XDR decoders Chuck Lever
                   ` (4 preceding siblings ...)
  2021-08-31 19:05 ` [PATCH RFC 5/6] NFSD: Add a transport hook for pulling argument payloads Chuck Lever
@ 2021-08-31 19:05 ` Chuck Lever
  2021-08-31 19:08   ` Chuck Lever III
  2021-08-31 20:40   ` J. Bruce Fields
  2021-08-31 20:42 ` [PATCH RFC 0/6] NFSD: Pull Read chunks in XDR decoders J. Bruce Fields
  6 siblings, 2 replies; 13+ messages in thread
From: Chuck Lever @ 2021-08-31 19:05 UTC (permalink / raw)
  To: bfields; +Cc: linux-nfs, linux-rdma

This enables the XDR decoder to figure out how the payload sink
buffer needs to be aligned before setting up the RDMA Reads.
Then re-alignment of large RDMA Read payloads can be avoided.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 include/linux/sunrpc/svc_rdma.h         |    6 +
 include/trace/events/rpcrdma.h          |   26 ++++++
 net/sunrpc/xprtrdma/svc_rdma_recvfrom.c |   17 +++-
 net/sunrpc/xprtrdma/svc_rdma_rw.c       |  139 ++++++++++++++++++++++++++++---
 4 files changed, 169 insertions(+), 19 deletions(-)

diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
index f660244cc8ba..8d80a759a909 100644
--- a/include/linux/sunrpc/svc_rdma.h
+++ b/include/linux/sunrpc/svc_rdma.h
@@ -192,6 +192,12 @@ extern int svc_rdma_send_reply_chunk(struct svcxprt_rdma *rdma,
 extern int svc_rdma_process_read_list(struct svcxprt_rdma *rdma,
 				      struct svc_rqst *rqstp,
 				      struct svc_rdma_recv_ctxt *head);
+extern void svc_rdma_prepare_read_chunk(struct svc_rqst *rqstp,
+					struct svc_rdma_recv_ctxt *head);
+extern int svc_rdma_pull_read_chunk(struct svcxprt_rdma *rdma,
+				    struct svc_rqst *rqstp,
+				    struct svc_rdma_recv_ctxt *ctxt,
+				    unsigned int offset, unsigned int length);
 
 /* svc_rdma_sendto.c */
 extern void svc_rdma_send_ctxts_destroy(struct svcxprt_rdma *rdma);
diff --git a/include/trace/events/rpcrdma.h b/include/trace/events/rpcrdma.h
index 5954ce036173..30440cca321a 100644
--- a/include/trace/events/rpcrdma.h
+++ b/include/trace/events/rpcrdma.h
@@ -2136,6 +2136,32 @@ TRACE_EVENT(svcrdma_sq_post_err,
 	)
 );
 
+TRACE_EVENT(svcrdma_arg_payload,
+	TP_PROTO(
+		const struct svc_rqst *rqstp,
+		unsigned int offset,
+		unsigned int length
+	),
+
+	TP_ARGS(rqstp, offset, length),
+
+	TP_STRUCT__entry(
+		__field(u32, xid)
+		__field(u32, offset)
+		__field(u32, length)
+	),
+
+	TP_fast_assign(
+		__entry->xid = __be32_to_cpu(rqstp->rq_xid);
+		__entry->offset = offset_in_page(offset);
+		__entry->length = length;
+	),
+
+	TP_printk("xid=0x%08x offset=%u length=%u",
+		__entry->xid, __entry->offset, __entry->length
+	)
+);
+
 #endif /* _TRACE_RPCRDMA_H */
 
 #include <trace/define_trace.h>
diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
index 08a620b370ae..cd9c0fb1a470 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
@@ -838,12 +838,13 @@ int svc_rdma_recvfrom(struct svc_rqst *rqstp)
 
 	svc_rdma_get_inv_rkey(rdma_xprt, ctxt);
 
-	if (!pcl_is_empty(&ctxt->rc_read_pcl) ||
-	    !pcl_is_empty(&ctxt->rc_call_pcl)) {
+	if (!pcl_is_empty(&ctxt->rc_call_pcl) ||
+	    ctxt->rc_read_pcl.cl_count > 1) {
 		ret = svc_rdma_process_read_list(rdma_xprt, rqstp, ctxt);
 		if (ret < 0)
 			goto out_readfail;
-	}
+	} else if (ctxt->rc_read_pcl.cl_count == 1)
+		svc_rdma_prepare_read_chunk(rqstp, ctxt);
 
 	rqstp->rq_xprt_ctxt = ctxt;
 	rqstp->rq_prot = IPPROTO_MAX;
@@ -887,5 +888,13 @@ int svc_rdma_recvfrom(struct svc_rqst *rqstp)
 int svc_rdma_argument_payload(struct svc_rqst *rqstp, unsigned int offset,
 			      unsigned int length)
 {
-	return 0;
+	struct svc_rdma_recv_ctxt *ctxt = rqstp->rq_xprt_ctxt;
+	struct svc_xprt *xprt = rqstp->rq_xprt;
+	struct svcxprt_rdma *rdma =
+		container_of(xprt, struct svcxprt_rdma, sc_xprt);
+
+	if (!pcl_is_empty(&ctxt->rc_call_pcl) ||
+	    ctxt->rc_read_pcl.cl_count != 1)
+		return 0;
+	return svc_rdma_pull_read_chunk(rdma, rqstp, ctxt, offset, length);
 }
diff --git a/net/sunrpc/xprtrdma/svc_rdma_rw.c b/net/sunrpc/xprtrdma/svc_rdma_rw.c
index 29b7d477891c..5f03dfd2fa03 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_rw.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_rw.c
@@ -707,19 +707,24 @@ static int svc_rdma_build_read_segment(struct svc_rdma_read_info *info,
 
 	len = segment->rs_length;
 	sge_no = PAGE_ALIGN(info->ri_pageoff + len) >> PAGE_SHIFT;
+
+	trace_printk("pageoff=%u len=%u sges=%u\n",
+		info->ri_pageoff, len, sge_no);
+
 	ctxt = svc_rdma_get_rw_ctxt(cc->cc_rdma, sge_no);
 	if (!ctxt)
 		return -ENOMEM;
 	ctxt->rw_nents = sge_no;
+	head->rc_page_count += sge_no;
 
 	sg = ctxt->rw_sg_table.sgl;
 	for (sge_no = 0; sge_no < ctxt->rw_nents; sge_no++) {
 		seg_len = min_t(unsigned int, len,
 				PAGE_SIZE - info->ri_pageoff);
 
-		if (!info->ri_pageoff)
-			head->rc_page_count++;
-
+		trace_printk("  page=%p seg_len=%u offset=%u\n",
+			rqstp->rq_pages[info->ri_pageno], seg_len,
+			info->ri_pageoff);
 		sg_set_page(sg, rqstp->rq_pages[info->ri_pageno],
 			    seg_len, info->ri_pageoff);
 		sg = sg_next(sg);
@@ -804,15 +809,14 @@ static int svc_rdma_copy_inline_range(struct svc_rdma_read_info *info,
 	unsigned int page_no, numpages;
 
 	numpages = PAGE_ALIGN(info->ri_pageoff + remaining) >> PAGE_SHIFT;
+	head->rc_page_count += numpages;
+
 	for (page_no = 0; page_no < numpages; page_no++) {
 		unsigned int page_len;
 
 		page_len = min_t(unsigned int, remaining,
 				 PAGE_SIZE - info->ri_pageoff);
 
-		if (!info->ri_pageoff)
-			head->rc_page_count++;
-
 		dst = page_address(rqstp->rq_pages[info->ri_pageno]);
 		memcpy(dst + info->ri_pageno, src + offset, page_len);
 
@@ -1092,15 +1096,8 @@ static noinline int svc_rdma_read_special(struct svc_rdma_read_info *info)
  * @rqstp: set of pages to use as Read sink buffers
  * @head: pages under I/O collect here
  *
- * The RPC/RDMA protocol assumes that the upper layer's XDR decoders
- * pull each Read chunk as they decode an incoming RPC message.
- *
- * On Linux, however, the server needs to have a fully-constructed RPC
- * message in rqstp->rq_arg when there is a positive return code from
- * ->xpo_recvfrom. So the Read list is safety-checked immediately when
- * it is received, then here the whole Read list is pulled all at once.
- * The ingress RPC message is fully reconstructed once all associated
- * RDMA Reads have completed.
+ * Handle complex Read chunk cases fully before svc_rdma_recvfrom()
+ * returns.
  *
  * Return values:
  *   %1: all needed RDMA Reads were posted successfully,
@@ -1159,3 +1156,115 @@ int svc_rdma_process_read_list(struct svcxprt_rdma *rdma,
 	svc_rdma_read_info_free(info);
 	return ret;
 }
+
+/**
+ * svc_rdma_prepare_read_chunk - Prepare rq_arg for Read chunk
+ * @rqstp: set of pages to use as Read sink buffers
+ * @head: pages under I/O collect here
+ *
+ * The Read chunk will be pulled when the upper layer's XDR
+ * decoder calls svc_decode_argument_payload(). In the meantime,
+ * fake up rq_arg.page_len and .len to reflect the size of the
+ * yet-to-be-pulled payload.
+ */
+void svc_rdma_prepare_read_chunk(struct svc_rqst *rqstp,
+				 struct svc_rdma_recv_ctxt *head)
+{
+	struct svc_rdma_chunk *chunk = pcl_first_chunk(&head->rc_read_pcl);
+	unsigned int length = xdr_align_size(chunk->ch_length);
+	struct xdr_buf *buf = &rqstp->rq_arg;
+
+	buf->tail[0].iov_base = buf->head[0].iov_base + chunk->ch_position;
+	buf->tail[0].iov_len = buf->head[0].iov_len - chunk->ch_position;
+	buf->head[0].iov_len = chunk->ch_position;
+
+	buf->page_len = length;
+	buf->len += length;
+	buf->buflen += length;
+
+	/*
+	 * rq_respages starts after the last arg page. Note that we
+	 * don't know the offset yet, so add an extra page as slack.
+	 */
+	length += PAGE_SIZE * 2 - 1;
+	rqstp->rq_respages = &rqstp->rq_pages[length >> PAGE_SHIFT];
+	rqstp->rq_next_page = rqstp->rq_respages + 1;
+}
+
+/**
+ * svc_rdma_pull_read_chunk - Pull one Read chunk from the client
+ * @rdma: controlling RDMA transport
+ * @rqstp: set of pages to use as Read sink buffers
+ * @head: pages under I/O collect here
+ * @offset: offset of payload in file's page cache
+ * @length: size of payload, in bytes
+ *
+ * Once the upper layer's XDR decoder has decoded the length of
+ * the payload and it's offset, we can be clever about setting up
+ * the RDMA Read sink buffer so that the VFS does not have to
+ * re-align the payload once it is received.
+ *
+ * Caveat: To keep things simple, this is an optimization that is
+ *	   used only when there is a single Read chunk in the Read
+ *	   list.
+ *
+ * Return values:
+ *   %0: all needed RDMA Reads were posted successfully,
+ *   %-EINVAL: client provided the wrong chunk size,
+ *   %-ENOMEM: rdma_rw context pool was exhausted,
+ *   %-ENOTCONN: posting failed (connection is lost),
+ *   %-EIO: rdma_rw initialization failed (DMA mapping, etc).
+ */
+int svc_rdma_pull_read_chunk(struct svcxprt_rdma *rdma, struct svc_rqst *rqstp,
+			     struct svc_rdma_recv_ctxt *head,
+			     unsigned int offset, unsigned int length)
+{
+	struct svc_rdma_read_info *info;
+	struct svc_rdma_chunk_ctxt *cc;
+	struct svc_rdma_chunk *chunk;
+	int ret;
+
+	trace_svcrdma_arg_payload(rqstp, offset, length);
+
+	/* Sanity: the Requester must have provided enough
+	 * bytes to fill the XDR opaque.
+	 */
+	chunk = pcl_first_chunk(&head->rc_read_pcl);
+	if (length > chunk->ch_length)
+		return -EINVAL;
+
+	info = svc_rdma_read_info_alloc(rdma);
+	if (!info)
+		return -ENOMEM;
+	cc = &info->ri_cc;
+	info->ri_rqst = rqstp;
+	info->ri_readctxt = head;
+	info->ri_pageno = 0;
+	info->ri_pageoff = offset_in_page(offset);
+	info->ri_totalbytes = 0;
+
+	ret = svc_rdma_build_read_chunk(info, chunk);
+	if (ret < 0)
+		goto out_err;
+	rqstp->rq_arg.pages = &info->ri_rqst->rq_pages[0];
+	rqstp->rq_arg.page_base = offset_in_page(offset);
+	rqstp->rq_arg.buflen += offset_in_page(offset);
+
+	trace_svcrdma_post_read_chunk(&cc->cc_cid, cc->cc_sqecount);
+	init_completion(&cc->cc_done);
+	ret = svc_rdma_post_chunk_ctxt(cc);
+	if (ret < 0)
+		goto out_err;
+
+	ret = 0;
+	wait_for_completion(&cc->cc_done);
+	if (cc->cc_status != IB_WC_SUCCESS)
+		ret = -EIO;
+
+	/* Ensure svc_rdma_recv_ctxt_put() does not try to release pages */
+	head->rc_page_count = 0;
+
+out_err:
+	svc_rdma_read_info_free(info);
+	return ret;
+}



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC 6/6] svcrdma: Pull Read chunks in ->xpo_argument_payload
  2021-08-31 19:05 ` [PATCH RFC 6/6] svcrdma: Pull Read chunks in ->xpo_argument_payload Chuck Lever
@ 2021-08-31 19:08   ` Chuck Lever III
  2021-08-31 20:40   ` J. Bruce Fields
  1 sibling, 0 replies; 13+ messages in thread
From: Chuck Lever III @ 2021-08-31 19:08 UTC (permalink / raw)
  To: Bruce Fields; +Cc: Linux NFS Mailing List, linux-rdma



> On Aug 31, 2021, at 3:05 PM, Chuck Lever <chuck.lever@oracle.com> wrote:
> 
> This enables the XDR decoder to figure out how the payload sink
> buffer needs to be aligned before setting up the RDMA Reads.
> Then re-alignment of large RDMA Read payloads can be avoided.

Should say "can be avoided" but isn't yet. All of these patches
are pre-requisites to that behavior.


> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> ---
> include/linux/sunrpc/svc_rdma.h         |    6 +
> include/trace/events/rpcrdma.h          |   26 ++++++
> net/sunrpc/xprtrdma/svc_rdma_recvfrom.c |   17 +++-
> net/sunrpc/xprtrdma/svc_rdma_rw.c       |  139 ++++++++++++++++++++++++++++---
> 4 files changed, 169 insertions(+), 19 deletions(-)
> 
> diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
> index f660244cc8ba..8d80a759a909 100644
> --- a/include/linux/sunrpc/svc_rdma.h
> +++ b/include/linux/sunrpc/svc_rdma.h
> @@ -192,6 +192,12 @@ extern int svc_rdma_send_reply_chunk(struct svcxprt_rdma *rdma,
> extern int svc_rdma_process_read_list(struct svcxprt_rdma *rdma,
> 				      struct svc_rqst *rqstp,
> 				      struct svc_rdma_recv_ctxt *head);
> +extern void svc_rdma_prepare_read_chunk(struct svc_rqst *rqstp,
> +					struct svc_rdma_recv_ctxt *head);
> +extern int svc_rdma_pull_read_chunk(struct svcxprt_rdma *rdma,
> +				    struct svc_rqst *rqstp,
> +				    struct svc_rdma_recv_ctxt *ctxt,
> +				    unsigned int offset, unsigned int length);
> 
> /* svc_rdma_sendto.c */
> extern void svc_rdma_send_ctxts_destroy(struct svcxprt_rdma *rdma);
> diff --git a/include/trace/events/rpcrdma.h b/include/trace/events/rpcrdma.h
> index 5954ce036173..30440cca321a 100644
> --- a/include/trace/events/rpcrdma.h
> +++ b/include/trace/events/rpcrdma.h
> @@ -2136,6 +2136,32 @@ TRACE_EVENT(svcrdma_sq_post_err,
> 	)
> );
> 
> +TRACE_EVENT(svcrdma_arg_payload,
> +	TP_PROTO(
> +		const struct svc_rqst *rqstp,
> +		unsigned int offset,
> +		unsigned int length
> +	),
> +
> +	TP_ARGS(rqstp, offset, length),
> +
> +	TP_STRUCT__entry(
> +		__field(u32, xid)
> +		__field(u32, offset)
> +		__field(u32, length)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->xid = __be32_to_cpu(rqstp->rq_xid);
> +		__entry->offset = offset_in_page(offset);
> +		__entry->length = length;
> +	),
> +
> +	TP_printk("xid=0x%08x offset=%u length=%u",
> +		__entry->xid, __entry->offset, __entry->length
> +	)
> +);
> +
> #endif /* _TRACE_RPCRDMA_H */
> 
> #include <trace/define_trace.h>
> diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
> index 08a620b370ae..cd9c0fb1a470 100644
> --- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
> +++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
> @@ -838,12 +838,13 @@ int svc_rdma_recvfrom(struct svc_rqst *rqstp)
> 
> 	svc_rdma_get_inv_rkey(rdma_xprt, ctxt);
> 
> -	if (!pcl_is_empty(&ctxt->rc_read_pcl) ||
> -	    !pcl_is_empty(&ctxt->rc_call_pcl)) {
> +	if (!pcl_is_empty(&ctxt->rc_call_pcl) ||
> +	    ctxt->rc_read_pcl.cl_count > 1) {
> 		ret = svc_rdma_process_read_list(rdma_xprt, rqstp, ctxt);
> 		if (ret < 0)
> 			goto out_readfail;
> -	}
> +	} else if (ctxt->rc_read_pcl.cl_count == 1)
> +		svc_rdma_prepare_read_chunk(rqstp, ctxt);
> 
> 	rqstp->rq_xprt_ctxt = ctxt;
> 	rqstp->rq_prot = IPPROTO_MAX;
> @@ -887,5 +888,13 @@ int svc_rdma_recvfrom(struct svc_rqst *rqstp)
> int svc_rdma_argument_payload(struct svc_rqst *rqstp, unsigned int offset,
> 			      unsigned int length)
> {
> -	return 0;
> +	struct svc_rdma_recv_ctxt *ctxt = rqstp->rq_xprt_ctxt;
> +	struct svc_xprt *xprt = rqstp->rq_xprt;
> +	struct svcxprt_rdma *rdma =
> +		container_of(xprt, struct svcxprt_rdma, sc_xprt);
> +
> +	if (!pcl_is_empty(&ctxt->rc_call_pcl) ||
> +	    ctxt->rc_read_pcl.cl_count != 1)
> +		return 0;
> +	return svc_rdma_pull_read_chunk(rdma, rqstp, ctxt, offset, length);
> }
> diff --git a/net/sunrpc/xprtrdma/svc_rdma_rw.c b/net/sunrpc/xprtrdma/svc_rdma_rw.c
> index 29b7d477891c..5f03dfd2fa03 100644
> --- a/net/sunrpc/xprtrdma/svc_rdma_rw.c
> +++ b/net/sunrpc/xprtrdma/svc_rdma_rw.c
> @@ -707,19 +707,24 @@ static int svc_rdma_build_read_segment(struct svc_rdma_read_info *info,
> 
> 	len = segment->rs_length;
> 	sge_no = PAGE_ALIGN(info->ri_pageoff + len) >> PAGE_SHIFT;
> +
> +	trace_printk("pageoff=%u len=%u sges=%u\n",
> +		info->ri_pageoff, len, sge_no);
> +
> 	ctxt = svc_rdma_get_rw_ctxt(cc->cc_rdma, sge_no);
> 	if (!ctxt)
> 		return -ENOMEM;
> 	ctxt->rw_nents = sge_no;
> +	head->rc_page_count += sge_no;
> 
> 	sg = ctxt->rw_sg_table.sgl;
> 	for (sge_no = 0; sge_no < ctxt->rw_nents; sge_no++) {
> 		seg_len = min_t(unsigned int, len,
> 				PAGE_SIZE - info->ri_pageoff);
> 
> -		if (!info->ri_pageoff)
> -			head->rc_page_count++;
> -
> +		trace_printk("  page=%p seg_len=%u offset=%u\n",
> +			rqstp->rq_pages[info->ri_pageno], seg_len,
> +			info->ri_pageoff);
> 		sg_set_page(sg, rqstp->rq_pages[info->ri_pageno],
> 			    seg_len, info->ri_pageoff);
> 		sg = sg_next(sg);
> @@ -804,15 +809,14 @@ static int svc_rdma_copy_inline_range(struct svc_rdma_read_info *info,
> 	unsigned int page_no, numpages;
> 
> 	numpages = PAGE_ALIGN(info->ri_pageoff + remaining) >> PAGE_SHIFT;
> +	head->rc_page_count += numpages;
> +
> 	for (page_no = 0; page_no < numpages; page_no++) {
> 		unsigned int page_len;
> 
> 		page_len = min_t(unsigned int, remaining,
> 				 PAGE_SIZE - info->ri_pageoff);
> 
> -		if (!info->ri_pageoff)
> -			head->rc_page_count++;
> -
> 		dst = page_address(rqstp->rq_pages[info->ri_pageno]);
> 		memcpy(dst + info->ri_pageno, src + offset, page_len);
> 
> @@ -1092,15 +1096,8 @@ static noinline int svc_rdma_read_special(struct svc_rdma_read_info *info)
>  * @rqstp: set of pages to use as Read sink buffers
>  * @head: pages under I/O collect here
>  *
> - * The RPC/RDMA protocol assumes that the upper layer's XDR decoders
> - * pull each Read chunk as they decode an incoming RPC message.
> - *
> - * On Linux, however, the server needs to have a fully-constructed RPC
> - * message in rqstp->rq_arg when there is a positive return code from
> - * ->xpo_recvfrom. So the Read list is safety-checked immediately when
> - * it is received, then here the whole Read list is pulled all at once.
> - * The ingress RPC message is fully reconstructed once all associated
> - * RDMA Reads have completed.
> + * Handle complex Read chunk cases fully before svc_rdma_recvfrom()
> + * returns.
>  *
>  * Return values:
>  *   %1: all needed RDMA Reads were posted successfully,
> @@ -1159,3 +1156,115 @@ int svc_rdma_process_read_list(struct svcxprt_rdma *rdma,
> 	svc_rdma_read_info_free(info);
> 	return ret;
> }
> +
> +/**
> + * svc_rdma_prepare_read_chunk - Prepare rq_arg for Read chunk
> + * @rqstp: set of pages to use as Read sink buffers
> + * @head: pages under I/O collect here
> + *
> + * The Read chunk will be pulled when the upper layer's XDR
> + * decoder calls svc_decode_argument_payload(). In the meantime,
> + * fake up rq_arg.page_len and .len to reflect the size of the
> + * yet-to-be-pulled payload.
> + */
> +void svc_rdma_prepare_read_chunk(struct svc_rqst *rqstp,
> +				 struct svc_rdma_recv_ctxt *head)
> +{
> +	struct svc_rdma_chunk *chunk = pcl_first_chunk(&head->rc_read_pcl);
> +	unsigned int length = xdr_align_size(chunk->ch_length);
> +	struct xdr_buf *buf = &rqstp->rq_arg;
> +
> +	buf->tail[0].iov_base = buf->head[0].iov_base + chunk->ch_position;
> +	buf->tail[0].iov_len = buf->head[0].iov_len - chunk->ch_position;
> +	buf->head[0].iov_len = chunk->ch_position;
> +
> +	buf->page_len = length;
> +	buf->len += length;
> +	buf->buflen += length;
> +
> +	/*
> +	 * rq_respages starts after the last arg page. Note that we
> +	 * don't know the offset yet, so add an extra page as slack.
> +	 */
> +	length += PAGE_SIZE * 2 - 1;
> +	rqstp->rq_respages = &rqstp->rq_pages[length >> PAGE_SHIFT];
> +	rqstp->rq_next_page = rqstp->rq_respages + 1;
> +}
> +
> +/**
> + * svc_rdma_pull_read_chunk - Pull one Read chunk from the client
> + * @rdma: controlling RDMA transport
> + * @rqstp: set of pages to use as Read sink buffers
> + * @head: pages under I/O collect here
> + * @offset: offset of payload in file's page cache
> + * @length: size of payload, in bytes
> + *
> + * Once the upper layer's XDR decoder has decoded the length of
> + * the payload and it's offset, we can be clever about setting up
> + * the RDMA Read sink buffer so that the VFS does not have to
> + * re-align the payload once it is received.
> + *
> + * Caveat: To keep things simple, this is an optimization that is
> + *	   used only when there is a single Read chunk in the Read
> + *	   list.
> + *
> + * Return values:
> + *   %0: all needed RDMA Reads were posted successfully,
> + *   %-EINVAL: client provided the wrong chunk size,
> + *   %-ENOMEM: rdma_rw context pool was exhausted,
> + *   %-ENOTCONN: posting failed (connection is lost),
> + *   %-EIO: rdma_rw initialization failed (DMA mapping, etc).
> + */
> +int svc_rdma_pull_read_chunk(struct svcxprt_rdma *rdma, struct svc_rqst *rqstp,
> +			     struct svc_rdma_recv_ctxt *head,
> +			     unsigned int offset, unsigned int length)
> +{
> +	struct svc_rdma_read_info *info;
> +	struct svc_rdma_chunk_ctxt *cc;
> +	struct svc_rdma_chunk *chunk;
> +	int ret;
> +
> +	trace_svcrdma_arg_payload(rqstp, offset, length);
> +
> +	/* Sanity: the Requester must have provided enough
> +	 * bytes to fill the XDR opaque.
> +	 */
> +	chunk = pcl_first_chunk(&head->rc_read_pcl);
> +	if (length > chunk->ch_length)
> +		return -EINVAL;
> +
> +	info = svc_rdma_read_info_alloc(rdma);
> +	if (!info)
> +		return -ENOMEM;
> +	cc = &info->ri_cc;
> +	info->ri_rqst = rqstp;
> +	info->ri_readctxt = head;
> +	info->ri_pageno = 0;
> +	info->ri_pageoff = offset_in_page(offset);
> +	info->ri_totalbytes = 0;
> +
> +	ret = svc_rdma_build_read_chunk(info, chunk);
> +	if (ret < 0)
> +		goto out_err;
> +	rqstp->rq_arg.pages = &info->ri_rqst->rq_pages[0];
> +	rqstp->rq_arg.page_base = offset_in_page(offset);
> +	rqstp->rq_arg.buflen += offset_in_page(offset);
> +
> +	trace_svcrdma_post_read_chunk(&cc->cc_cid, cc->cc_sqecount);
> +	init_completion(&cc->cc_done);
> +	ret = svc_rdma_post_chunk_ctxt(cc);
> +	if (ret < 0)
> +		goto out_err;
> +
> +	ret = 0;
> +	wait_for_completion(&cc->cc_done);
> +	if (cc->cc_status != IB_WC_SUCCESS)
> +		ret = -EIO;
> +
> +	/* Ensure svc_rdma_recv_ctxt_put() does not try to release pages */
> +	head->rc_page_count = 0;
> +
> +out_err:
> +	svc_rdma_read_info_free(info);
> +	return ret;
> +}
> 
> 

--
Chuck Lever




^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC 6/6] svcrdma: Pull Read chunks in ->xpo_argument_payload
  2021-08-31 19:05 ` [PATCH RFC 6/6] svcrdma: Pull Read chunks in ->xpo_argument_payload Chuck Lever
  2021-08-31 19:08   ` Chuck Lever III
@ 2021-08-31 20:40   ` J. Bruce Fields
  2021-08-31 21:19     ` Chuck Lever III
  1 sibling, 1 reply; 13+ messages in thread
From: J. Bruce Fields @ 2021-08-31 20:40 UTC (permalink / raw)
  To: Chuck Lever; +Cc: linux-nfs, linux-rdma

How does this deal with compounds with multiple writes?

We don't need to handle those efficiently, but we do need to handle
them.

--b.

On Tue, Aug 31, 2021 at 03:05:46PM -0400, Chuck Lever wrote:
> This enables the XDR decoder to figure out how the payload sink
> buffer needs to be aligned before setting up the RDMA Reads.
> Then re-alignment of large RDMA Read payloads can be avoided.
> 
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> ---
>  include/linux/sunrpc/svc_rdma.h         |    6 +
>  include/trace/events/rpcrdma.h          |   26 ++++++
>  net/sunrpc/xprtrdma/svc_rdma_recvfrom.c |   17 +++-
>  net/sunrpc/xprtrdma/svc_rdma_rw.c       |  139 ++++++++++++++++++++++++++++---
>  4 files changed, 169 insertions(+), 19 deletions(-)
> 
> diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
> index f660244cc8ba..8d80a759a909 100644
> --- a/include/linux/sunrpc/svc_rdma.h
> +++ b/include/linux/sunrpc/svc_rdma.h
> @@ -192,6 +192,12 @@ extern int svc_rdma_send_reply_chunk(struct svcxprt_rdma *rdma,
>  extern int svc_rdma_process_read_list(struct svcxprt_rdma *rdma,
>  				      struct svc_rqst *rqstp,
>  				      struct svc_rdma_recv_ctxt *head);
> +extern void svc_rdma_prepare_read_chunk(struct svc_rqst *rqstp,
> +					struct svc_rdma_recv_ctxt *head);
> +extern int svc_rdma_pull_read_chunk(struct svcxprt_rdma *rdma,
> +				    struct svc_rqst *rqstp,
> +				    struct svc_rdma_recv_ctxt *ctxt,
> +				    unsigned int offset, unsigned int length);
>  
>  /* svc_rdma_sendto.c */
>  extern void svc_rdma_send_ctxts_destroy(struct svcxprt_rdma *rdma);
> diff --git a/include/trace/events/rpcrdma.h b/include/trace/events/rpcrdma.h
> index 5954ce036173..30440cca321a 100644
> --- a/include/trace/events/rpcrdma.h
> +++ b/include/trace/events/rpcrdma.h
> @@ -2136,6 +2136,32 @@ TRACE_EVENT(svcrdma_sq_post_err,
>  	)
>  );
>  
> +TRACE_EVENT(svcrdma_arg_payload,
> +	TP_PROTO(
> +		const struct svc_rqst *rqstp,
> +		unsigned int offset,
> +		unsigned int length
> +	),
> +
> +	TP_ARGS(rqstp, offset, length),
> +
> +	TP_STRUCT__entry(
> +		__field(u32, xid)
> +		__field(u32, offset)
> +		__field(u32, length)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->xid = __be32_to_cpu(rqstp->rq_xid);
> +		__entry->offset = offset_in_page(offset);
> +		__entry->length = length;
> +	),
> +
> +	TP_printk("xid=0x%08x offset=%u length=%u",
> +		__entry->xid, __entry->offset, __entry->length
> +	)
> +);
> +
>  #endif /* _TRACE_RPCRDMA_H */
>  
>  #include <trace/define_trace.h>
> diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
> index 08a620b370ae..cd9c0fb1a470 100644
> --- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
> +++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
> @@ -838,12 +838,13 @@ int svc_rdma_recvfrom(struct svc_rqst *rqstp)
>  
>  	svc_rdma_get_inv_rkey(rdma_xprt, ctxt);
>  
> -	if (!pcl_is_empty(&ctxt->rc_read_pcl) ||
> -	    !pcl_is_empty(&ctxt->rc_call_pcl)) {
> +	if (!pcl_is_empty(&ctxt->rc_call_pcl) ||
> +	    ctxt->rc_read_pcl.cl_count > 1) {
>  		ret = svc_rdma_process_read_list(rdma_xprt, rqstp, ctxt);
>  		if (ret < 0)
>  			goto out_readfail;
> -	}
> +	} else if (ctxt->rc_read_pcl.cl_count == 1)
> +		svc_rdma_prepare_read_chunk(rqstp, ctxt);
>  
>  	rqstp->rq_xprt_ctxt = ctxt;
>  	rqstp->rq_prot = IPPROTO_MAX;
> @@ -887,5 +888,13 @@ int svc_rdma_recvfrom(struct svc_rqst *rqstp)
>  int svc_rdma_argument_payload(struct svc_rqst *rqstp, unsigned int offset,
>  			      unsigned int length)
>  {
> -	return 0;
> +	struct svc_rdma_recv_ctxt *ctxt = rqstp->rq_xprt_ctxt;
> +	struct svc_xprt *xprt = rqstp->rq_xprt;
> +	struct svcxprt_rdma *rdma =
> +		container_of(xprt, struct svcxprt_rdma, sc_xprt);
> +
> +	if (!pcl_is_empty(&ctxt->rc_call_pcl) ||
> +	    ctxt->rc_read_pcl.cl_count != 1)
> +		return 0;
> +	return svc_rdma_pull_read_chunk(rdma, rqstp, ctxt, offset, length);
>  }
> diff --git a/net/sunrpc/xprtrdma/svc_rdma_rw.c b/net/sunrpc/xprtrdma/svc_rdma_rw.c
> index 29b7d477891c..5f03dfd2fa03 100644
> --- a/net/sunrpc/xprtrdma/svc_rdma_rw.c
> +++ b/net/sunrpc/xprtrdma/svc_rdma_rw.c
> @@ -707,19 +707,24 @@ static int svc_rdma_build_read_segment(struct svc_rdma_read_info *info,
>  
>  	len = segment->rs_length;
>  	sge_no = PAGE_ALIGN(info->ri_pageoff + len) >> PAGE_SHIFT;
> +
> +	trace_printk("pageoff=%u len=%u sges=%u\n",
> +		info->ri_pageoff, len, sge_no);
> +
>  	ctxt = svc_rdma_get_rw_ctxt(cc->cc_rdma, sge_no);
>  	if (!ctxt)
>  		return -ENOMEM;
>  	ctxt->rw_nents = sge_no;
> +	head->rc_page_count += sge_no;
>  
>  	sg = ctxt->rw_sg_table.sgl;
>  	for (sge_no = 0; sge_no < ctxt->rw_nents; sge_no++) {
>  		seg_len = min_t(unsigned int, len,
>  				PAGE_SIZE - info->ri_pageoff);
>  
> -		if (!info->ri_pageoff)
> -			head->rc_page_count++;
> -
> +		trace_printk("  page=%p seg_len=%u offset=%u\n",
> +			rqstp->rq_pages[info->ri_pageno], seg_len,
> +			info->ri_pageoff);
>  		sg_set_page(sg, rqstp->rq_pages[info->ri_pageno],
>  			    seg_len, info->ri_pageoff);
>  		sg = sg_next(sg);
> @@ -804,15 +809,14 @@ static int svc_rdma_copy_inline_range(struct svc_rdma_read_info *info,
>  	unsigned int page_no, numpages;
>  
>  	numpages = PAGE_ALIGN(info->ri_pageoff + remaining) >> PAGE_SHIFT;
> +	head->rc_page_count += numpages;
> +
>  	for (page_no = 0; page_no < numpages; page_no++) {
>  		unsigned int page_len;
>  
>  		page_len = min_t(unsigned int, remaining,
>  				 PAGE_SIZE - info->ri_pageoff);
>  
> -		if (!info->ri_pageoff)
> -			head->rc_page_count++;
> -
>  		dst = page_address(rqstp->rq_pages[info->ri_pageno]);
>  		memcpy(dst + info->ri_pageno, src + offset, page_len);
>  
> @@ -1092,15 +1096,8 @@ static noinline int svc_rdma_read_special(struct svc_rdma_read_info *info)
>   * @rqstp: set of pages to use as Read sink buffers
>   * @head: pages under I/O collect here
>   *
> - * The RPC/RDMA protocol assumes that the upper layer's XDR decoders
> - * pull each Read chunk as they decode an incoming RPC message.
> - *
> - * On Linux, however, the server needs to have a fully-constructed RPC
> - * message in rqstp->rq_arg when there is a positive return code from
> - * ->xpo_recvfrom. So the Read list is safety-checked immediately when
> - * it is received, then here the whole Read list is pulled all at once.
> - * The ingress RPC message is fully reconstructed once all associated
> - * RDMA Reads have completed.
> + * Handle complex Read chunk cases fully before svc_rdma_recvfrom()
> + * returns.
>   *
>   * Return values:
>   *   %1: all needed RDMA Reads were posted successfully,
> @@ -1159,3 +1156,115 @@ int svc_rdma_process_read_list(struct svcxprt_rdma *rdma,
>  	svc_rdma_read_info_free(info);
>  	return ret;
>  }
> +
> +/**
> + * svc_rdma_prepare_read_chunk - Prepare rq_arg for Read chunk
> + * @rqstp: set of pages to use as Read sink buffers
> + * @head: pages under I/O collect here
> + *
> + * The Read chunk will be pulled when the upper layer's XDR
> + * decoder calls svc_decode_argument_payload(). In the meantime,
> + * fake up rq_arg.page_len and .len to reflect the size of the
> + * yet-to-be-pulled payload.
> + */
> +void svc_rdma_prepare_read_chunk(struct svc_rqst *rqstp,
> +				 struct svc_rdma_recv_ctxt *head)
> +{
> +	struct svc_rdma_chunk *chunk = pcl_first_chunk(&head->rc_read_pcl);
> +	unsigned int length = xdr_align_size(chunk->ch_length);
> +	struct xdr_buf *buf = &rqstp->rq_arg;
> +
> +	buf->tail[0].iov_base = buf->head[0].iov_base + chunk->ch_position;
> +	buf->tail[0].iov_len = buf->head[0].iov_len - chunk->ch_position;
> +	buf->head[0].iov_len = chunk->ch_position;
> +
> +	buf->page_len = length;
> +	buf->len += length;
> +	buf->buflen += length;
> +
> +	/*
> +	 * rq_respages starts after the last arg page. Note that we
> +	 * don't know the offset yet, so add an extra page as slack.
> +	 */
> +	length += PAGE_SIZE * 2 - 1;
> +	rqstp->rq_respages = &rqstp->rq_pages[length >> PAGE_SHIFT];
> +	rqstp->rq_next_page = rqstp->rq_respages + 1;
> +}
> +
> +/**
> + * svc_rdma_pull_read_chunk - Pull one Read chunk from the client
> + * @rdma: controlling RDMA transport
> + * @rqstp: set of pages to use as Read sink buffers
> + * @head: pages under I/O collect here
> + * @offset: offset of payload in file's page cache
> + * @length: size of payload, in bytes
> + *
> + * Once the upper layer's XDR decoder has decoded the length of
> + * the payload and it's offset, we can be clever about setting up
> + * the RDMA Read sink buffer so that the VFS does not have to
> + * re-align the payload once it is received.
> + *
> + * Caveat: To keep things simple, this is an optimization that is
> + *	   used only when there is a single Read chunk in the Read
> + *	   list.
> + *
> + * Return values:
> + *   %0: all needed RDMA Reads were posted successfully,
> + *   %-EINVAL: client provided the wrong chunk size,
> + *   %-ENOMEM: rdma_rw context pool was exhausted,
> + *   %-ENOTCONN: posting failed (connection is lost),
> + *   %-EIO: rdma_rw initialization failed (DMA mapping, etc).
> + */
> +int svc_rdma_pull_read_chunk(struct svcxprt_rdma *rdma, struct svc_rqst *rqstp,
> +			     struct svc_rdma_recv_ctxt *head,
> +			     unsigned int offset, unsigned int length)
> +{
> +	struct svc_rdma_read_info *info;
> +	struct svc_rdma_chunk_ctxt *cc;
> +	struct svc_rdma_chunk *chunk;
> +	int ret;
> +
> +	trace_svcrdma_arg_payload(rqstp, offset, length);
> +
> +	/* Sanity: the Requester must have provided enough
> +	 * bytes to fill the XDR opaque.
> +	 */
> +	chunk = pcl_first_chunk(&head->rc_read_pcl);
> +	if (length > chunk->ch_length)
> +		return -EINVAL;
> +
> +	info = svc_rdma_read_info_alloc(rdma);
> +	if (!info)
> +		return -ENOMEM;
> +	cc = &info->ri_cc;
> +	info->ri_rqst = rqstp;
> +	info->ri_readctxt = head;
> +	info->ri_pageno = 0;
> +	info->ri_pageoff = offset_in_page(offset);
> +	info->ri_totalbytes = 0;
> +
> +	ret = svc_rdma_build_read_chunk(info, chunk);
> +	if (ret < 0)
> +		goto out_err;
> +	rqstp->rq_arg.pages = &info->ri_rqst->rq_pages[0];
> +	rqstp->rq_arg.page_base = offset_in_page(offset);
> +	rqstp->rq_arg.buflen += offset_in_page(offset);
> +
> +	trace_svcrdma_post_read_chunk(&cc->cc_cid, cc->cc_sqecount);
> +	init_completion(&cc->cc_done);
> +	ret = svc_rdma_post_chunk_ctxt(cc);
> +	if (ret < 0)
> +		goto out_err;
> +
> +	ret = 0;
> +	wait_for_completion(&cc->cc_done);
> +	if (cc->cc_status != IB_WC_SUCCESS)
> +		ret = -EIO;
> +
> +	/* Ensure svc_rdma_recv_ctxt_put() does not try to release pages */
> +	head->rc_page_count = 0;
> +
> +out_err:
> +	svc_rdma_read_info_free(info);
> +	return ret;
> +}
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC 0/6] NFSD: Pull Read chunks in XDR decoders
  2021-08-31 19:05 [PATCH RFC 0/6] NFSD: Pull Read chunks in XDR decoders Chuck Lever
                   ` (5 preceding siblings ...)
  2021-08-31 19:05 ` [PATCH RFC 6/6] svcrdma: Pull Read chunks in ->xpo_argument_payload Chuck Lever
@ 2021-08-31 20:42 ` J. Bruce Fields
  2021-08-31 21:23   ` Chuck Lever III
  6 siblings, 1 reply; 13+ messages in thread
From: J. Bruce Fields @ 2021-08-31 20:42 UTC (permalink / raw)
  To: Chuck Lever; +Cc: linux-nfs, linux-rdma

On Tue, Aug 31, 2021 at 03:05:09PM -0400, Chuck Lever wrote:
> Hi Bruce-
> 
> Here is part of what we discussed recently about trying to align
> pages in NFS WRITE requests so that splice can be used. This
> series updates server-side RPC and svcrdma to ensure that an
> aligned xdr_buf::pages array is presented to NFSD, which is then
> converted into an aligned rq_vec for the VFS layer.

Seems sensible to me.

Do you have a git tree?  It didn't apply cleanly to 5.14 when I tried,
but I didn't stop to figure out why.

> The next step would be to look at how to make the best use of the
> aligned rq_vec.

Have you done any performance comparison just with this?

Doesn't seem like it should make a significant difference, but it might
be interesting to check anyway.

--b.

> My naive thought is that where there is a PAGE_SIZE
> entry in rq_vec and there is no page in the file's page cache at
> that offset, the transport-provided page can be flipped into place.
> Might work for replacing whole pages as well, but baby steps first.
> 
> This series has been exercised a bit with both TCP and RDMA, but no
> guarantees that it is completely bug-free. NFSv4 compounds with
> multiple WRITE payloads on RDMA are treated like TCP: the RPC
> message is contained in an unstructured stream of unaligned pages.
> 
> Comments encouraged.
> 
> ---
> 
> Chuck Lever (6):
>       SUNRPC: Capture value of xdr_buf::page_base
>       SUNRPC: xdr_stream_subsegment() must handle non-zero page_bases
>       NFSD: Have legacy NFSD WRITE decoders use xdr_stream_subsegment()
>       SUNRPC: svc_fill_write_vector() must handle non-zero page_bases
>       NFSD: Add a transport hook for pulling argument payloads
>       svcrdma: Pull Read chunks in ->xpo_argument_payload
> 
> 
>  fs/nfsd/nfs3proc.c                       |   3 +-
>  fs/nfsd/nfs3xdr.c                        |  16 +--
>  fs/nfsd/nfs4proc.c                       |   3 +-
>  fs/nfsd/nfs4xdr.c                        |   6 +
>  fs/nfsd/nfsproc.c                        |   3 +-
>  fs/nfsd/nfsxdr.c                         |  13 +--
>  fs/nfsd/xdr.h                            |   2 +-
>  fs/nfsd/xdr3.h                           |   2 +-
>  include/linux/sunrpc/svc.h               |   6 +-
>  include/linux/sunrpc/svc_rdma.h          |   8 ++
>  include/linux/sunrpc/svc_xprt.h          |   3 +
>  include/trace/events/rpcrdma.h           |  26 +++++
>  include/trace/events/sunrpc.h            |  20 +++-
>  net/sunrpc/svc.c                         |  38 +++++--
>  net/sunrpc/svcsock.c                     |   8 ++
>  net/sunrpc/xdr.c                         |  32 +++---
>  net/sunrpc/xprtrdma/svc_rdma_recvfrom.c  |  37 +++++-
>  net/sunrpc/xprtrdma/svc_rdma_rw.c        | 139 ++++++++++++++++++++---
>  net/sunrpc/xprtrdma/svc_rdma_transport.c |   1 +
>  19 files changed, 292 insertions(+), 74 deletions(-)
> 
> --
> Chuck Lever

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC 6/6] svcrdma: Pull Read chunks in ->xpo_argument_payload
  2021-08-31 20:40   ` J. Bruce Fields
@ 2021-08-31 21:19     ` Chuck Lever III
  2021-08-31 21:24       ` Bruce Fields
  0 siblings, 1 reply; 13+ messages in thread
From: Chuck Lever III @ 2021-08-31 21:19 UTC (permalink / raw)
  To: Bruce Fields; +Cc: Linux NFS Mailing List, linux-rdma


> On Aug 31, 2021, at 4:40 PM, J. Bruce Fields <bfields@fieldses.org> wrote:
> 
> How does this deal with compounds with multiple writes?

Explained in the cover letter.


> We don't need to handle those efficiently, but we do need to handle
> them.
> 
> --b.
> 
> On Tue, Aug 31, 2021 at 03:05:46PM -0400, Chuck Lever wrote:
>> This enables the XDR decoder to figure out how the payload sink
>> buffer needs to be aligned before setting up the RDMA Reads.
>> Then re-alignment of large RDMA Read payloads can be avoided.
>> 
>> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
>> ---
>> include/linux/sunrpc/svc_rdma.h         |    6 +
>> include/trace/events/rpcrdma.h          |   26 ++++++
>> net/sunrpc/xprtrdma/svc_rdma_recvfrom.c |   17 +++-
>> net/sunrpc/xprtrdma/svc_rdma_rw.c       |  139 ++++++++++++++++++++++++++++---
>> 4 files changed, 169 insertions(+), 19 deletions(-)
>> 
>> diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
>> index f660244cc8ba..8d80a759a909 100644
>> --- a/include/linux/sunrpc/svc_rdma.h
>> +++ b/include/linux/sunrpc/svc_rdma.h
>> @@ -192,6 +192,12 @@ extern int svc_rdma_send_reply_chunk(struct svcxprt_rdma *rdma,
>> extern int svc_rdma_process_read_list(struct svcxprt_rdma *rdma,
>> 				      struct svc_rqst *rqstp,
>> 				      struct svc_rdma_recv_ctxt *head);
>> +extern void svc_rdma_prepare_read_chunk(struct svc_rqst *rqstp,
>> +					struct svc_rdma_recv_ctxt *head);
>> +extern int svc_rdma_pull_read_chunk(struct svcxprt_rdma *rdma,
>> +				    struct svc_rqst *rqstp,
>> +				    struct svc_rdma_recv_ctxt *ctxt,
>> +				    unsigned int offset, unsigned int length);
>> 
>> /* svc_rdma_sendto.c */
>> extern void svc_rdma_send_ctxts_destroy(struct svcxprt_rdma *rdma);
>> diff --git a/include/trace/events/rpcrdma.h b/include/trace/events/rpcrdma.h
>> index 5954ce036173..30440cca321a 100644
>> --- a/include/trace/events/rpcrdma.h
>> +++ b/include/trace/events/rpcrdma.h
>> @@ -2136,6 +2136,32 @@ TRACE_EVENT(svcrdma_sq_post_err,
>> 	)
>> );
>> 
>> +TRACE_EVENT(svcrdma_arg_payload,
>> +	TP_PROTO(
>> +		const struct svc_rqst *rqstp,
>> +		unsigned int offset,
>> +		unsigned int length
>> +	),
>> +
>> +	TP_ARGS(rqstp, offset, length),
>> +
>> +	TP_STRUCT__entry(
>> +		__field(u32, xid)
>> +		__field(u32, offset)
>> +		__field(u32, length)
>> +	),
>> +
>> +	TP_fast_assign(
>> +		__entry->xid = __be32_to_cpu(rqstp->rq_xid);
>> +		__entry->offset = offset_in_page(offset);
>> +		__entry->length = length;
>> +	),
>> +
>> +	TP_printk("xid=0x%08x offset=%u length=%u",
>> +		__entry->xid, __entry->offset, __entry->length
>> +	)
>> +);
>> +
>> #endif /* _TRACE_RPCRDMA_H */
>> 
>> #include <trace/define_trace.h>
>> diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
>> index 08a620b370ae..cd9c0fb1a470 100644
>> --- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
>> +++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
>> @@ -838,12 +838,13 @@ int svc_rdma_recvfrom(struct svc_rqst *rqstp)
>> 
>> 	svc_rdma_get_inv_rkey(rdma_xprt, ctxt);
>> 
>> -	if (!pcl_is_empty(&ctxt->rc_read_pcl) ||
>> -	    !pcl_is_empty(&ctxt->rc_call_pcl)) {
>> +	if (!pcl_is_empty(&ctxt->rc_call_pcl) ||
>> +	    ctxt->rc_read_pcl.cl_count > 1) {
>> 		ret = svc_rdma_process_read_list(rdma_xprt, rqstp, ctxt);
>> 		if (ret < 0)
>> 			goto out_readfail;
>> -	}
>> +	} else if (ctxt->rc_read_pcl.cl_count == 1)
>> +		svc_rdma_prepare_read_chunk(rqstp, ctxt);
>> 
>> 	rqstp->rq_xprt_ctxt = ctxt;
>> 	rqstp->rq_prot = IPPROTO_MAX;
>> @@ -887,5 +888,13 @@ int svc_rdma_recvfrom(struct svc_rqst *rqstp)
>> int svc_rdma_argument_payload(struct svc_rqst *rqstp, unsigned int offset,
>> 			      unsigned int length)
>> {
>> -	return 0;
>> +	struct svc_rdma_recv_ctxt *ctxt = rqstp->rq_xprt_ctxt;
>> +	struct svc_xprt *xprt = rqstp->rq_xprt;
>> +	struct svcxprt_rdma *rdma =
>> +		container_of(xprt, struct svcxprt_rdma, sc_xprt);
>> +
>> +	if (!pcl_is_empty(&ctxt->rc_call_pcl) ||
>> +	    ctxt->rc_read_pcl.cl_count != 1)
>> +		return 0;
>> +	return svc_rdma_pull_read_chunk(rdma, rqstp, ctxt, offset, length);
>> }
>> diff --git a/net/sunrpc/xprtrdma/svc_rdma_rw.c b/net/sunrpc/xprtrdma/svc_rdma_rw.c
>> index 29b7d477891c..5f03dfd2fa03 100644
>> --- a/net/sunrpc/xprtrdma/svc_rdma_rw.c
>> +++ b/net/sunrpc/xprtrdma/svc_rdma_rw.c
>> @@ -707,19 +707,24 @@ static int svc_rdma_build_read_segment(struct svc_rdma_read_info *info,
>> 
>> 	len = segment->rs_length;
>> 	sge_no = PAGE_ALIGN(info->ri_pageoff + len) >> PAGE_SHIFT;
>> +
>> +	trace_printk("pageoff=%u len=%u sges=%u\n",
>> +		info->ri_pageoff, len, sge_no);
>> +
>> 	ctxt = svc_rdma_get_rw_ctxt(cc->cc_rdma, sge_no);
>> 	if (!ctxt)
>> 		return -ENOMEM;
>> 	ctxt->rw_nents = sge_no;
>> +	head->rc_page_count += sge_no;
>> 
>> 	sg = ctxt->rw_sg_table.sgl;
>> 	for (sge_no = 0; sge_no < ctxt->rw_nents; sge_no++) {
>> 		seg_len = min_t(unsigned int, len,
>> 				PAGE_SIZE - info->ri_pageoff);
>> 
>> -		if (!info->ri_pageoff)
>> -			head->rc_page_count++;
>> -
>> +		trace_printk("  page=%p seg_len=%u offset=%u\n",
>> +			rqstp->rq_pages[info->ri_pageno], seg_len,
>> +			info->ri_pageoff);
>> 		sg_set_page(sg, rqstp->rq_pages[info->ri_pageno],
>> 			    seg_len, info->ri_pageoff);
>> 		sg = sg_next(sg);
>> @@ -804,15 +809,14 @@ static int svc_rdma_copy_inline_range(struct svc_rdma_read_info *info,
>> 	unsigned int page_no, numpages;
>> 
>> 	numpages = PAGE_ALIGN(info->ri_pageoff + remaining) >> PAGE_SHIFT;
>> +	head->rc_page_count += numpages;
>> +
>> 	for (page_no = 0; page_no < numpages; page_no++) {
>> 		unsigned int page_len;
>> 
>> 		page_len = min_t(unsigned int, remaining,
>> 				 PAGE_SIZE - info->ri_pageoff);
>> 
>> -		if (!info->ri_pageoff)
>> -			head->rc_page_count++;
>> -
>> 		dst = page_address(rqstp->rq_pages[info->ri_pageno]);
>> 		memcpy(dst + info->ri_pageno, src + offset, page_len);
>> 
>> @@ -1092,15 +1096,8 @@ static noinline int svc_rdma_read_special(struct svc_rdma_read_info *info)
>>  * @rqstp: set of pages to use as Read sink buffers
>>  * @head: pages under I/O collect here
>>  *
>> - * The RPC/RDMA protocol assumes that the upper layer's XDR decoders
>> - * pull each Read chunk as they decode an incoming RPC message.
>> - *
>> - * On Linux, however, the server needs to have a fully-constructed RPC
>> - * message in rqstp->rq_arg when there is a positive return code from
>> - * ->xpo_recvfrom. So the Read list is safety-checked immediately when
>> - * it is received, then here the whole Read list is pulled all at once.
>> - * The ingress RPC message is fully reconstructed once all associated
>> - * RDMA Reads have completed.
>> + * Handle complex Read chunk cases fully before svc_rdma_recvfrom()
>> + * returns.
>>  *
>>  * Return values:
>>  *   %1: all needed RDMA Reads were posted successfully,
>> @@ -1159,3 +1156,115 @@ int svc_rdma_process_read_list(struct svcxprt_rdma *rdma,
>> 	svc_rdma_read_info_free(info);
>> 	return ret;
>> }
>> +
>> +/**
>> + * svc_rdma_prepare_read_chunk - Prepare rq_arg for Read chunk
>> + * @rqstp: set of pages to use as Read sink buffers
>> + * @head: pages under I/O collect here
>> + *
>> + * The Read chunk will be pulled when the upper layer's XDR
>> + * decoder calls svc_decode_argument_payload(). In the meantime,
>> + * fake up rq_arg.page_len and .len to reflect the size of the
>> + * yet-to-be-pulled payload.
>> + */
>> +void svc_rdma_prepare_read_chunk(struct svc_rqst *rqstp,
>> +				 struct svc_rdma_recv_ctxt *head)
>> +{
>> +	struct svc_rdma_chunk *chunk = pcl_first_chunk(&head->rc_read_pcl);
>> +	unsigned int length = xdr_align_size(chunk->ch_length);
>> +	struct xdr_buf *buf = &rqstp->rq_arg;
>> +
>> +	buf->tail[0].iov_base = buf->head[0].iov_base + chunk->ch_position;
>> +	buf->tail[0].iov_len = buf->head[0].iov_len - chunk->ch_position;
>> +	buf->head[0].iov_len = chunk->ch_position;
>> +
>> +	buf->page_len = length;
>> +	buf->len += length;
>> +	buf->buflen += length;
>> +
>> +	/*
>> +	 * rq_respages starts after the last arg page. Note that we
>> +	 * don't know the offset yet, so add an extra page as slack.
>> +	 */
>> +	length += PAGE_SIZE * 2 - 1;
>> +	rqstp->rq_respages = &rqstp->rq_pages[length >> PAGE_SHIFT];
>> +	rqstp->rq_next_page = rqstp->rq_respages + 1;
>> +}
>> +
>> +/**
>> + * svc_rdma_pull_read_chunk - Pull one Read chunk from the client
>> + * @rdma: controlling RDMA transport
>> + * @rqstp: set of pages to use as Read sink buffers
>> + * @head: pages under I/O collect here
>> + * @offset: offset of payload in file's page cache
>> + * @length: size of payload, in bytes
>> + *
>> + * Once the upper layer's XDR decoder has decoded the length of
>> + * the payload and it's offset, we can be clever about setting up
>> + * the RDMA Read sink buffer so that the VFS does not have to
>> + * re-align the payload once it is received.
>> + *
>> + * Caveat: To keep things simple, this is an optimization that is
>> + *	   used only when there is a single Read chunk in the Read
>> + *	   list.
>> + *
>> + * Return values:
>> + *   %0: all needed RDMA Reads were posted successfully,
>> + *   %-EINVAL: client provided the wrong chunk size,
>> + *   %-ENOMEM: rdma_rw context pool was exhausted,
>> + *   %-ENOTCONN: posting failed (connection is lost),
>> + *   %-EIO: rdma_rw initialization failed (DMA mapping, etc).
>> + */
>> +int svc_rdma_pull_read_chunk(struct svcxprt_rdma *rdma, struct svc_rqst *rqstp,
>> +			     struct svc_rdma_recv_ctxt *head,
>> +			     unsigned int offset, unsigned int length)
>> +{
>> +	struct svc_rdma_read_info *info;
>> +	struct svc_rdma_chunk_ctxt *cc;
>> +	struct svc_rdma_chunk *chunk;
>> +	int ret;
>> +
>> +	trace_svcrdma_arg_payload(rqstp, offset, length);
>> +
>> +	/* Sanity: the Requester must have provided enough
>> +	 * bytes to fill the XDR opaque.
>> +	 */
>> +	chunk = pcl_first_chunk(&head->rc_read_pcl);
>> +	if (length > chunk->ch_length)
>> +		return -EINVAL;
>> +
>> +	info = svc_rdma_read_info_alloc(rdma);
>> +	if (!info)
>> +		return -ENOMEM;
>> +	cc = &info->ri_cc;
>> +	info->ri_rqst = rqstp;
>> +	info->ri_readctxt = head;
>> +	info->ri_pageno = 0;
>> +	info->ri_pageoff = offset_in_page(offset);
>> +	info->ri_totalbytes = 0;
>> +
>> +	ret = svc_rdma_build_read_chunk(info, chunk);
>> +	if (ret < 0)
>> +		goto out_err;
>> +	rqstp->rq_arg.pages = &info->ri_rqst->rq_pages[0];
>> +	rqstp->rq_arg.page_base = offset_in_page(offset);
>> +	rqstp->rq_arg.buflen += offset_in_page(offset);
>> +
>> +	trace_svcrdma_post_read_chunk(&cc->cc_cid, cc->cc_sqecount);
>> +	init_completion(&cc->cc_done);
>> +	ret = svc_rdma_post_chunk_ctxt(cc);
>> +	if (ret < 0)
>> +		goto out_err;
>> +
>> +	ret = 0;
>> +	wait_for_completion(&cc->cc_done);
>> +	if (cc->cc_status != IB_WC_SUCCESS)
>> +		ret = -EIO;
>> +
>> +	/* Ensure svc_rdma_recv_ctxt_put() does not try to release pages */
>> +	head->rc_page_count = 0;
>> +
>> +out_err:
>> +	svc_rdma_read_info_free(info);
>> +	return ret;
>> +}
>> 

--
Chuck Lever




^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC 0/6] NFSD: Pull Read chunks in XDR decoders
  2021-08-31 20:42 ` [PATCH RFC 0/6] NFSD: Pull Read chunks in XDR decoders J. Bruce Fields
@ 2021-08-31 21:23   ` Chuck Lever III
  0 siblings, 0 replies; 13+ messages in thread
From: Chuck Lever III @ 2021-08-31 21:23 UTC (permalink / raw)
  To: Bruce Fields; +Cc: Linux NFS Mailing List, linux-rdma



> On Aug 31, 2021, at 4:42 PM, J. Bruce Fields <bfields@fieldses.org> wrote:
> 
> On Tue, Aug 31, 2021 at 03:05:09PM -0400, Chuck Lever wrote:
>> Hi Bruce-
>> 
>> Here is part of what we discussed recently about trying to align
>> pages in NFS WRITE requests so that splice can be used. This
>> series updates server-side RPC and svcrdma to ensure that an
>> aligned xdr_buf::pages array is presented to NFSD, which is then
>> converted into an aligned rq_vec for the VFS layer.
> 
> Seems sensible to me.
> 
> Do you have a git tree?

I don't yet, but can set up a topic branch somewhere where we're
a little further along.


> It didn't apply cleanly to 5.14 when I tried,
> but I didn't stop to figure out why.

It might apply to 5.15-pre now that nfsd-5.15 has been merged.


>> The next step would be to look at how to make the best use of the
>> aligned rq_vec.
> 
> Have you done any performance comparison just with this?

Not yet. I just got it behaving correctly with the usual tests.
Baby steps, sir.


> Doesn't seem like it should make a significant difference, but it might
> be interesting to check anyway.

I expect it to add a small amount of latency to NFS WRITEs, since
RDMA Reads are done just a little later than before.


> --b.
> 
>> My naive thought is that where there is a PAGE_SIZE
>> entry in rq_vec and there is no page in the file's page cache at
>> that offset, the transport-provided page can be flipped into place.
>> Might work for replacing whole pages as well, but baby steps first.
>> 
>> This series has been exercised a bit with both TCP and RDMA, but no
>> guarantees that it is completely bug-free. NFSv4 compounds with
>> multiple WRITE payloads on RDMA are treated like TCP: the RPC
>> message is contained in an unstructured stream of unaligned pages.
>> 
>> Comments encouraged.
>> 
>> ---
>> 
>> Chuck Lever (6):
>>      SUNRPC: Capture value of xdr_buf::page_base
>>      SUNRPC: xdr_stream_subsegment() must handle non-zero page_bases
>>      NFSD: Have legacy NFSD WRITE decoders use xdr_stream_subsegment()
>>      SUNRPC: svc_fill_write_vector() must handle non-zero page_bases
>>      NFSD: Add a transport hook for pulling argument payloads
>>      svcrdma: Pull Read chunks in ->xpo_argument_payload
>> 
>> 
>> fs/nfsd/nfs3proc.c                       |   3 +-
>> fs/nfsd/nfs3xdr.c                        |  16 +--
>> fs/nfsd/nfs4proc.c                       |   3 +-
>> fs/nfsd/nfs4xdr.c                        |   6 +
>> fs/nfsd/nfsproc.c                        |   3 +-
>> fs/nfsd/nfsxdr.c                         |  13 +--
>> fs/nfsd/xdr.h                            |   2 +-
>> fs/nfsd/xdr3.h                           |   2 +-
>> include/linux/sunrpc/svc.h               |   6 +-
>> include/linux/sunrpc/svc_rdma.h          |   8 ++
>> include/linux/sunrpc/svc_xprt.h          |   3 +
>> include/trace/events/rpcrdma.h           |  26 +++++
>> include/trace/events/sunrpc.h            |  20 +++-
>> net/sunrpc/svc.c                         |  38 +++++--
>> net/sunrpc/svcsock.c                     |   8 ++
>> net/sunrpc/xdr.c                         |  32 +++---
>> net/sunrpc/xprtrdma/svc_rdma_recvfrom.c  |  37 +++++-
>> net/sunrpc/xprtrdma/svc_rdma_rw.c        | 139 ++++++++++++++++++++---
>> net/sunrpc/xprtrdma/svc_rdma_transport.c |   1 +
>> 19 files changed, 292 insertions(+), 74 deletions(-)
>> 
>> --
>> Chuck Lever

--
Chuck Lever




^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC 6/6] svcrdma: Pull Read chunks in ->xpo_argument_payload
  2021-08-31 21:19     ` Chuck Lever III
@ 2021-08-31 21:24       ` Bruce Fields
  0 siblings, 0 replies; 13+ messages in thread
From: Bruce Fields @ 2021-08-31 21:24 UTC (permalink / raw)
  To: Chuck Lever III; +Cc: Linux NFS Mailing List, linux-rdma

On Tue, Aug 31, 2021 at 09:19:23PM +0000, Chuck Lever III wrote:
> 
> > On Aug 31, 2021, at 4:40 PM, J. Bruce Fields <bfields@fieldses.org> wrote:
> > 
> > How does this deal with compounds with multiple writes?
> 
> Explained in the cover letter.

Whoops, I see now, thanks.--b.

> 
> 
> > We don't need to handle those efficiently, but we do need to handle
> > them.
> > 
> > --b.
> > 
> > On Tue, Aug 31, 2021 at 03:05:46PM -0400, Chuck Lever wrote:
> >> This enables the XDR decoder to figure out how the payload sink
> >> buffer needs to be aligned before setting up the RDMA Reads.
> >> Then re-alignment of large RDMA Read payloads can be avoided.
> >> 
> >> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> >> ---
> >> include/linux/sunrpc/svc_rdma.h         |    6 +
> >> include/trace/events/rpcrdma.h          |   26 ++++++
> >> net/sunrpc/xprtrdma/svc_rdma_recvfrom.c |   17 +++-
> >> net/sunrpc/xprtrdma/svc_rdma_rw.c       |  139 ++++++++++++++++++++++++++++---
> >> 4 files changed, 169 insertions(+), 19 deletions(-)
> >> 
> >> diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
> >> index f660244cc8ba..8d80a759a909 100644
> >> --- a/include/linux/sunrpc/svc_rdma.h
> >> +++ b/include/linux/sunrpc/svc_rdma.h
> >> @@ -192,6 +192,12 @@ extern int svc_rdma_send_reply_chunk(struct svcxprt_rdma *rdma,
> >> extern int svc_rdma_process_read_list(struct svcxprt_rdma *rdma,
> >> 				      struct svc_rqst *rqstp,
> >> 				      struct svc_rdma_recv_ctxt *head);
> >> +extern void svc_rdma_prepare_read_chunk(struct svc_rqst *rqstp,
> >> +					struct svc_rdma_recv_ctxt *head);
> >> +extern int svc_rdma_pull_read_chunk(struct svcxprt_rdma *rdma,
> >> +				    struct svc_rqst *rqstp,
> >> +				    struct svc_rdma_recv_ctxt *ctxt,
> >> +				    unsigned int offset, unsigned int length);
> >> 
> >> /* svc_rdma_sendto.c */
> >> extern void svc_rdma_send_ctxts_destroy(struct svcxprt_rdma *rdma);
> >> diff --git a/include/trace/events/rpcrdma.h b/include/trace/events/rpcrdma.h
> >> index 5954ce036173..30440cca321a 100644
> >> --- a/include/trace/events/rpcrdma.h
> >> +++ b/include/trace/events/rpcrdma.h
> >> @@ -2136,6 +2136,32 @@ TRACE_EVENT(svcrdma_sq_post_err,
> >> 	)
> >> );
> >> 
> >> +TRACE_EVENT(svcrdma_arg_payload,
> >> +	TP_PROTO(
> >> +		const struct svc_rqst *rqstp,
> >> +		unsigned int offset,
> >> +		unsigned int length
> >> +	),
> >> +
> >> +	TP_ARGS(rqstp, offset, length),
> >> +
> >> +	TP_STRUCT__entry(
> >> +		__field(u32, xid)
> >> +		__field(u32, offset)
> >> +		__field(u32, length)
> >> +	),
> >> +
> >> +	TP_fast_assign(
> >> +		__entry->xid = __be32_to_cpu(rqstp->rq_xid);
> >> +		__entry->offset = offset_in_page(offset);
> >> +		__entry->length = length;
> >> +	),
> >> +
> >> +	TP_printk("xid=0x%08x offset=%u length=%u",
> >> +		__entry->xid, __entry->offset, __entry->length
> >> +	)
> >> +);
> >> +
> >> #endif /* _TRACE_RPCRDMA_H */
> >> 
> >> #include <trace/define_trace.h>
> >> diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
> >> index 08a620b370ae..cd9c0fb1a470 100644
> >> --- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
> >> +++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
> >> @@ -838,12 +838,13 @@ int svc_rdma_recvfrom(struct svc_rqst *rqstp)
> >> 
> >> 	svc_rdma_get_inv_rkey(rdma_xprt, ctxt);
> >> 
> >> -	if (!pcl_is_empty(&ctxt->rc_read_pcl) ||
> >> -	    !pcl_is_empty(&ctxt->rc_call_pcl)) {
> >> +	if (!pcl_is_empty(&ctxt->rc_call_pcl) ||
> >> +	    ctxt->rc_read_pcl.cl_count > 1) {
> >> 		ret = svc_rdma_process_read_list(rdma_xprt, rqstp, ctxt);
> >> 		if (ret < 0)
> >> 			goto out_readfail;
> >> -	}
> >> +	} else if (ctxt->rc_read_pcl.cl_count == 1)
> >> +		svc_rdma_prepare_read_chunk(rqstp, ctxt);
> >> 
> >> 	rqstp->rq_xprt_ctxt = ctxt;
> >> 	rqstp->rq_prot = IPPROTO_MAX;
> >> @@ -887,5 +888,13 @@ int svc_rdma_recvfrom(struct svc_rqst *rqstp)
> >> int svc_rdma_argument_payload(struct svc_rqst *rqstp, unsigned int offset,
> >> 			      unsigned int length)
> >> {
> >> -	return 0;
> >> +	struct svc_rdma_recv_ctxt *ctxt = rqstp->rq_xprt_ctxt;
> >> +	struct svc_xprt *xprt = rqstp->rq_xprt;
> >> +	struct svcxprt_rdma *rdma =
> >> +		container_of(xprt, struct svcxprt_rdma, sc_xprt);
> >> +
> >> +	if (!pcl_is_empty(&ctxt->rc_call_pcl) ||
> >> +	    ctxt->rc_read_pcl.cl_count != 1)
> >> +		return 0;
> >> +	return svc_rdma_pull_read_chunk(rdma, rqstp, ctxt, offset, length);
> >> }
> >> diff --git a/net/sunrpc/xprtrdma/svc_rdma_rw.c b/net/sunrpc/xprtrdma/svc_rdma_rw.c
> >> index 29b7d477891c..5f03dfd2fa03 100644
> >> --- a/net/sunrpc/xprtrdma/svc_rdma_rw.c
> >> +++ b/net/sunrpc/xprtrdma/svc_rdma_rw.c
> >> @@ -707,19 +707,24 @@ static int svc_rdma_build_read_segment(struct svc_rdma_read_info *info,
> >> 
> >> 	len = segment->rs_length;
> >> 	sge_no = PAGE_ALIGN(info->ri_pageoff + len) >> PAGE_SHIFT;
> >> +
> >> +	trace_printk("pageoff=%u len=%u sges=%u\n",
> >> +		info->ri_pageoff, len, sge_no);
> >> +
> >> 	ctxt = svc_rdma_get_rw_ctxt(cc->cc_rdma, sge_no);
> >> 	if (!ctxt)
> >> 		return -ENOMEM;
> >> 	ctxt->rw_nents = sge_no;
> >> +	head->rc_page_count += sge_no;
> >> 
> >> 	sg = ctxt->rw_sg_table.sgl;
> >> 	for (sge_no = 0; sge_no < ctxt->rw_nents; sge_no++) {
> >> 		seg_len = min_t(unsigned int, len,
> >> 				PAGE_SIZE - info->ri_pageoff);
> >> 
> >> -		if (!info->ri_pageoff)
> >> -			head->rc_page_count++;
> >> -
> >> +		trace_printk("  page=%p seg_len=%u offset=%u\n",
> >> +			rqstp->rq_pages[info->ri_pageno], seg_len,
> >> +			info->ri_pageoff);
> >> 		sg_set_page(sg, rqstp->rq_pages[info->ri_pageno],
> >> 			    seg_len, info->ri_pageoff);
> >> 		sg = sg_next(sg);
> >> @@ -804,15 +809,14 @@ static int svc_rdma_copy_inline_range(struct svc_rdma_read_info *info,
> >> 	unsigned int page_no, numpages;
> >> 
> >> 	numpages = PAGE_ALIGN(info->ri_pageoff + remaining) >> PAGE_SHIFT;
> >> +	head->rc_page_count += numpages;
> >> +
> >> 	for (page_no = 0; page_no < numpages; page_no++) {
> >> 		unsigned int page_len;
> >> 
> >> 		page_len = min_t(unsigned int, remaining,
> >> 				 PAGE_SIZE - info->ri_pageoff);
> >> 
> >> -		if (!info->ri_pageoff)
> >> -			head->rc_page_count++;
> >> -
> >> 		dst = page_address(rqstp->rq_pages[info->ri_pageno]);
> >> 		memcpy(dst + info->ri_pageno, src + offset, page_len);
> >> 
> >> @@ -1092,15 +1096,8 @@ static noinline int svc_rdma_read_special(struct svc_rdma_read_info *info)
> >>  * @rqstp: set of pages to use as Read sink buffers
> >>  * @head: pages under I/O collect here
> >>  *
> >> - * The RPC/RDMA protocol assumes that the upper layer's XDR decoders
> >> - * pull each Read chunk as they decode an incoming RPC message.
> >> - *
> >> - * On Linux, however, the server needs to have a fully-constructed RPC
> >> - * message in rqstp->rq_arg when there is a positive return code from
> >> - * ->xpo_recvfrom. So the Read list is safety-checked immediately when
> >> - * it is received, then here the whole Read list is pulled all at once.
> >> - * The ingress RPC message is fully reconstructed once all associated
> >> - * RDMA Reads have completed.
> >> + * Handle complex Read chunk cases fully before svc_rdma_recvfrom()
> >> + * returns.
> >>  *
> >>  * Return values:
> >>  *   %1: all needed RDMA Reads were posted successfully,
> >> @@ -1159,3 +1156,115 @@ int svc_rdma_process_read_list(struct svcxprt_rdma *rdma,
> >> 	svc_rdma_read_info_free(info);
> >> 	return ret;
> >> }
> >> +
> >> +/**
> >> + * svc_rdma_prepare_read_chunk - Prepare rq_arg for Read chunk
> >> + * @rqstp: set of pages to use as Read sink buffers
> >> + * @head: pages under I/O collect here
> >> + *
> >> + * The Read chunk will be pulled when the upper layer's XDR
> >> + * decoder calls svc_decode_argument_payload(). In the meantime,
> >> + * fake up rq_arg.page_len and .len to reflect the size of the
> >> + * yet-to-be-pulled payload.
> >> + */
> >> +void svc_rdma_prepare_read_chunk(struct svc_rqst *rqstp,
> >> +				 struct svc_rdma_recv_ctxt *head)
> >> +{
> >> +	struct svc_rdma_chunk *chunk = pcl_first_chunk(&head->rc_read_pcl);
> >> +	unsigned int length = xdr_align_size(chunk->ch_length);
> >> +	struct xdr_buf *buf = &rqstp->rq_arg;
> >> +
> >> +	buf->tail[0].iov_base = buf->head[0].iov_base + chunk->ch_position;
> >> +	buf->tail[0].iov_len = buf->head[0].iov_len - chunk->ch_position;
> >> +	buf->head[0].iov_len = chunk->ch_position;
> >> +
> >> +	buf->page_len = length;
> >> +	buf->len += length;
> >> +	buf->buflen += length;
> >> +
> >> +	/*
> >> +	 * rq_respages starts after the last arg page. Note that we
> >> +	 * don't know the offset yet, so add an extra page as slack.
> >> +	 */
> >> +	length += PAGE_SIZE * 2 - 1;
> >> +	rqstp->rq_respages = &rqstp->rq_pages[length >> PAGE_SHIFT];
> >> +	rqstp->rq_next_page = rqstp->rq_respages + 1;
> >> +}
> >> +
> >> +/**
> >> + * svc_rdma_pull_read_chunk - Pull one Read chunk from the client
> >> + * @rdma: controlling RDMA transport
> >> + * @rqstp: set of pages to use as Read sink buffers
> >> + * @head: pages under I/O collect here
> >> + * @offset: offset of payload in file's page cache
> >> + * @length: size of payload, in bytes
> >> + *
> >> + * Once the upper layer's XDR decoder has decoded the length of
> >> + * the payload and it's offset, we can be clever about setting up
> >> + * the RDMA Read sink buffer so that the VFS does not have to
> >> + * re-align the payload once it is received.
> >> + *
> >> + * Caveat: To keep things simple, this is an optimization that is
> >> + *	   used only when there is a single Read chunk in the Read
> >> + *	   list.
> >> + *
> >> + * Return values:
> >> + *   %0: all needed RDMA Reads were posted successfully,
> >> + *   %-EINVAL: client provided the wrong chunk size,
> >> + *   %-ENOMEM: rdma_rw context pool was exhausted,
> >> + *   %-ENOTCONN: posting failed (connection is lost),
> >> + *   %-EIO: rdma_rw initialization failed (DMA mapping, etc).
> >> + */
> >> +int svc_rdma_pull_read_chunk(struct svcxprt_rdma *rdma, struct svc_rqst *rqstp,
> >> +			     struct svc_rdma_recv_ctxt *head,
> >> +			     unsigned int offset, unsigned int length)
> >> +{
> >> +	struct svc_rdma_read_info *info;
> >> +	struct svc_rdma_chunk_ctxt *cc;
> >> +	struct svc_rdma_chunk *chunk;
> >> +	int ret;
> >> +
> >> +	trace_svcrdma_arg_payload(rqstp, offset, length);
> >> +
> >> +	/* Sanity: the Requester must have provided enough
> >> +	 * bytes to fill the XDR opaque.
> >> +	 */
> >> +	chunk = pcl_first_chunk(&head->rc_read_pcl);
> >> +	if (length > chunk->ch_length)
> >> +		return -EINVAL;
> >> +
> >> +	info = svc_rdma_read_info_alloc(rdma);
> >> +	if (!info)
> >> +		return -ENOMEM;
> >> +	cc = &info->ri_cc;
> >> +	info->ri_rqst = rqstp;
> >> +	info->ri_readctxt = head;
> >> +	info->ri_pageno = 0;
> >> +	info->ri_pageoff = offset_in_page(offset);
> >> +	info->ri_totalbytes = 0;
> >> +
> >> +	ret = svc_rdma_build_read_chunk(info, chunk);
> >> +	if (ret < 0)
> >> +		goto out_err;
> >> +	rqstp->rq_arg.pages = &info->ri_rqst->rq_pages[0];
> >> +	rqstp->rq_arg.page_base = offset_in_page(offset);
> >> +	rqstp->rq_arg.buflen += offset_in_page(offset);
> >> +
> >> +	trace_svcrdma_post_read_chunk(&cc->cc_cid, cc->cc_sqecount);
> >> +	init_completion(&cc->cc_done);
> >> +	ret = svc_rdma_post_chunk_ctxt(cc);
> >> +	if (ret < 0)
> >> +		goto out_err;
> >> +
> >> +	ret = 0;
> >> +	wait_for_completion(&cc->cc_done);
> >> +	if (cc->cc_status != IB_WC_SUCCESS)
> >> +		ret = -EIO;
> >> +
> >> +	/* Ensure svc_rdma_recv_ctxt_put() does not try to release pages */
> >> +	head->rc_page_count = 0;
> >> +
> >> +out_err:
> >> +	svc_rdma_read_info_free(info);
> >> +	return ret;
> >> +}
> >> 
> 
> --
> Chuck Lever
> 
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2021-08-31 21:24 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-31 19:05 [PATCH RFC 0/6] NFSD: Pull Read chunks in XDR decoders Chuck Lever
2021-08-31 19:05 ` [PATCH RFC 1/6] SUNRPC: Capture value of xdr_buf::page_base Chuck Lever
2021-08-31 19:05 ` [PATCH RFC 2/6] SUNRPC: xdr_stream_subsegment() must handle non-zero page_bases Chuck Lever
2021-08-31 19:05 ` [PATCH RFC 3/6] NFSD: Have legacy NFSD WRITE decoders use xdr_stream_subsegment() Chuck Lever
2021-08-31 19:05 ` [PATCH RFC 4/6] SUNRPC: svc_fill_write_vector() must handle non-zero page_bases Chuck Lever
2021-08-31 19:05 ` [PATCH RFC 5/6] NFSD: Add a transport hook for pulling argument payloads Chuck Lever
2021-08-31 19:05 ` [PATCH RFC 6/6] svcrdma: Pull Read chunks in ->xpo_argument_payload Chuck Lever
2021-08-31 19:08   ` Chuck Lever III
2021-08-31 20:40   ` J. Bruce Fields
2021-08-31 21:19     ` Chuck Lever III
2021-08-31 21:24       ` Bruce Fields
2021-08-31 20:42 ` [PATCH RFC 0/6] NFSD: Pull Read chunks in XDR decoders J. Bruce Fields
2021-08-31 21:23   ` Chuck Lever III

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).