All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v1 00/14] Server-side NFS/RDMA changes for v4.12
@ 2017-03-16 15:52 ` Chuck Lever
  0 siblings, 0 replies; 70+ messages in thread
From: Chuck Lever @ 2017-03-16 15:52 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA, linux-nfs-u79uwXL29TY76Z2rM5mHXA

This series overhauls the "reply send" side of the RPC-over-RDMA
transport to use the new rdma_rw API. Benefits include:

<> Better code modularity, less code duplication with other ULPs

<> Ability for svcrdma to use any registration mode for RDMA Writes

<> Correctly handles RPCs that have both a Write and a Reply chunk

<> Much better handling of Write chunk overrun


No significant performance changes noticed with this overhaul.

Additions outweigh deletions for two reasons: there are more large
block comments in the new code, and code to handle "call receive"
is also added in svc_rdma_rw.c, but not used yet.


Available in the "nfsd-rdma-for-4.12" topic branch of this git repo:

git://git.linux-nfs.org/projects/cel/cel-2.6.git


Or for browsing:

http://git.linux-nfs.org/?p=cel/cel-2.6.git;a=log;h=refs/heads/nfsd-rdma-for-4.12

---

Chuck Lever (14):
      svcrdma: Move send_wr to svc_rdma_op_ctxt
      svcrdma: Add svc_rdma_map_reply_hdr()
      svcrdma: Eliminate RPCRDMA_SQ_DEPTH_MULT
      svcrdma: Add helper to save pages under I/O
      svcrdma: Introduce local rdma_rw API helpers
      svcrdma: Use rdma_rw API in RPC reply path
      svcrdma: Clean up RDMA_ERROR path
      svcrdma: Report Write/Reply chunk overruns
      svcrdma: Clean up RPC-over-RDMA backchannel reply processing
      svcrdma: Reduce size of sge array in struct svc_rdma_op_ctxt
      svcrdma: Remove old RDMA Write completion handlers
      svcrdma: Remove the req_map cache
      svcrdma: Clean out old XDR encoders
      svcrdma: Clean up svc_rdma_post_recv() error handling


 include/linux/sunrpc/rpc_rdma.h            |    3 
 include/linux/sunrpc/svc_rdma.h            |   81 +--
 net/sunrpc/xprtrdma/Makefile               |    2 
 net/sunrpc/xprtrdma/svc_rdma.c             |    8 
 net/sunrpc/xprtrdma/svc_rdma_backchannel.c |   74 +-
 net/sunrpc/xprtrdma/svc_rdma_marshal.c     |  148 +++--
 net/sunrpc/xprtrdma/svc_rdma_recvfrom.c    |   80 ++-
 net/sunrpc/xprtrdma/svc_rdma_rw.c          |  785 +++++++++++++++++++++++++
 net/sunrpc/xprtrdma/svc_rdma_sendto.c      |  872 ++++++++++++----------------
 net/sunrpc/xprtrdma/svc_rdma_transport.c   |  176 ++----
 10 files changed, 1438 insertions(+), 791 deletions(-)
 create mode 100644 net/sunrpc/xprtrdma/svc_rdma_rw.c

--
Chuck Lever
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 70+ messages in thread

* [PATCH v1 00/14] Server-side NFS/RDMA changes for v4.12
@ 2017-03-16 15:52 ` Chuck Lever
  0 siblings, 0 replies; 70+ messages in thread
From: Chuck Lever @ 2017-03-16 15:52 UTC (permalink / raw)
  To: linux-rdma, linux-nfs

This series overhauls the "reply send" side of the RPC-over-RDMA
transport to use the new rdma_rw API. Benefits include:

<> Better code modularity, less code duplication with other ULPs

<> Ability for svcrdma to use any registration mode for RDMA Writes

<> Correctly handles RPCs that have both a Write and a Reply chunk

<> Much better handling of Write chunk overrun


No significant performance changes noticed with this overhaul.

Additions outweigh deletions for two reasons: there are more large
block comments in the new code, and code to handle "call receive"
is also added in svc_rdma_rw.c, but not used yet.


Available in the "nfsd-rdma-for-4.12" topic branch of this git repo:

git://git.linux-nfs.org/projects/cel/cel-2.6.git


Or for browsing:

http://git.linux-nfs.org/?p=cel/cel-2.6.git;a=log;h=refs/heads/nfsd-rdma-for-4.12

---

Chuck Lever (14):
      svcrdma: Move send_wr to svc_rdma_op_ctxt
      svcrdma: Add svc_rdma_map_reply_hdr()
      svcrdma: Eliminate RPCRDMA_SQ_DEPTH_MULT
      svcrdma: Add helper to save pages under I/O
      svcrdma: Introduce local rdma_rw API helpers
      svcrdma: Use rdma_rw API in RPC reply path
      svcrdma: Clean up RDMA_ERROR path
      svcrdma: Report Write/Reply chunk overruns
      svcrdma: Clean up RPC-over-RDMA backchannel reply processing
      svcrdma: Reduce size of sge array in struct svc_rdma_op_ctxt
      svcrdma: Remove old RDMA Write completion handlers
      svcrdma: Remove the req_map cache
      svcrdma: Clean out old XDR encoders
      svcrdma: Clean up svc_rdma_post_recv() error handling


 include/linux/sunrpc/rpc_rdma.h            |    3 
 include/linux/sunrpc/svc_rdma.h            |   81 +--
 net/sunrpc/xprtrdma/Makefile               |    2 
 net/sunrpc/xprtrdma/svc_rdma.c             |    8 
 net/sunrpc/xprtrdma/svc_rdma_backchannel.c |   74 +-
 net/sunrpc/xprtrdma/svc_rdma_marshal.c     |  148 +++--
 net/sunrpc/xprtrdma/svc_rdma_recvfrom.c    |   80 ++-
 net/sunrpc/xprtrdma/svc_rdma_rw.c          |  785 +++++++++++++++++++++++++
 net/sunrpc/xprtrdma/svc_rdma_sendto.c      |  872 ++++++++++++----------------
 net/sunrpc/xprtrdma/svc_rdma_transport.c   |  176 ++----
 10 files changed, 1438 insertions(+), 791 deletions(-)
 create mode 100644 net/sunrpc/xprtrdma/svc_rdma_rw.c

--
Chuck Lever

^ permalink raw reply	[flat|nested] 70+ messages in thread

* [PATCH v1 01/14] svcrdma: Move send_wr to svc_rdma_op_ctxt
  2017-03-16 15:52 ` Chuck Lever
@ 2017-03-16 15:52     ` Chuck Lever
  -1 siblings, 0 replies; 70+ messages in thread
From: Chuck Lever @ 2017-03-16 15:52 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA, linux-nfs-u79uwXL29TY76Z2rM5mHXA

Clean up: Move the ib_send_wr off the stack, and move common send WR
setup code into a helper.

This is a refactoring change only.

Signed-off-by: Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
---
 include/linux/sunrpc/svc_rdma.h            |    3 +
 net/sunrpc/xprtrdma/svc_rdma_backchannel.c |   12 +-----
 net/sunrpc/xprtrdma/svc_rdma_sendto.c      |   57 ++++++++++++++++------------
 3 files changed, 37 insertions(+), 35 deletions(-)

diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
index b105f73..fa3ef11 100644
--- a/include/linux/sunrpc/svc_rdma.h
+++ b/include/linux/sunrpc/svc_rdma.h
@@ -85,6 +85,7 @@ struct svc_rdma_op_ctxt {
 	enum dma_data_direction direction;
 	int count;
 	unsigned int mapped_sges;
+	struct ib_send_wr send_wr;
 	struct ib_sge sge[RPCSVC_MAXPAGES];
 	struct page *pages[RPCSVC_MAXPAGES];
 };
@@ -227,6 +228,8 @@ extern int rdma_read_chunk_frmr(struct svcxprt_rdma *, struct svc_rqst *,
 /* svc_rdma_sendto.c */
 extern int svc_rdma_map_xdr(struct svcxprt_rdma *, struct xdr_buf *,
 			    struct svc_rdma_req_map *, bool);
+extern void svc_rdma_build_send_wr(struct svc_rdma_op_ctxt *ctxt,
+				   int num_sge);
 extern int svc_rdma_sendto(struct svc_rqst *);
 extern void svc_rdma_send_error(struct svcxprt_rdma *, struct rpcrdma_msg *,
 				int);
diff --git a/net/sunrpc/xprtrdma/svc_rdma_backchannel.c b/net/sunrpc/xprtrdma/svc_rdma_backchannel.c
index ff1df40..6741ed0 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_backchannel.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_backchannel.c
@@ -104,7 +104,6 @@ static int svc_rdma_bc_sendto(struct svcxprt_rdma *rdma,
 	struct xdr_buf *sndbuf = &rqst->rq_snd_buf;
 	struct svc_rdma_op_ctxt *ctxt;
 	struct svc_rdma_req_map *vec;
-	struct ib_send_wr send_wr;
 	int ret;
 
 	vec = svc_rdma_get_req_map(rdma);
@@ -132,15 +131,8 @@ static int svc_rdma_bc_sendto(struct svcxprt_rdma *rdma,
 	}
 	svc_rdma_count_mappings(rdma, ctxt);
 
-	memset(&send_wr, 0, sizeof(send_wr));
-	ctxt->cqe.done = svc_rdma_wc_send;
-	send_wr.wr_cqe = &ctxt->cqe;
-	send_wr.sg_list = ctxt->sge;
-	send_wr.num_sge = 1;
-	send_wr.opcode = IB_WR_SEND;
-	send_wr.send_flags = IB_SEND_SIGNALED;
-
-	ret = svc_rdma_send(rdma, &send_wr);
+	svc_rdma_build_send_wr(ctxt, 1);
+	ret = svc_rdma_send(rdma, &ctxt->send_wr);
 	if (ret) {
 		ret = -EIO;
 		goto out_unmap;
diff --git a/net/sunrpc/xprtrdma/svc_rdma_sendto.c b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
index 515221b..fdf8e3d 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_sendto.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
@@ -435,6 +435,28 @@ static int send_reply_chunks(struct svcxprt_rdma *xprt,
 	return -EIO;
 }
 
+/**
+ * svc_rdma_build_send_wr - Set up a Send Work Request
+ * @ctxt: op_ctxt for transmitting the Send WR
+ *
+ */
+void svc_rdma_build_send_wr(struct svc_rdma_op_ctxt *ctxt, int num_sge)
+{
+	struct ib_send_wr *send_wr;
+
+	send_wr = &ctxt->send_wr;
+	send_wr->next = NULL;
+	ctxt->cqe.done = svc_rdma_wc_send;
+	send_wr->wr_cqe = &ctxt->cqe;
+	send_wr->sg_list = ctxt->sge;
+	send_wr->num_sge = num_sge;
+	send_wr->opcode = IB_WR_SEND;
+	send_wr->send_flags = IB_SEND_SIGNALED;
+
+	dprintk("svcrdma: posting Send WR with %u sge(s)\n",
+		send_wr->num_sge);
+}
+
 /* This function prepares the portion of the RPCRDMA message to be
  * sent in the RDMA_SEND. This function is called after data sent via
  * RDMA has already been transmitted. There are three cases:
@@ -460,7 +482,7 @@ static int send_reply(struct svcxprt_rdma *rdma,
 		      u32 inv_rkey)
 {
 	struct svc_rdma_op_ctxt *ctxt;
-	struct ib_send_wr send_wr;
+	struct ib_send_wr *send_wr;
 	u32 xdr_off;
 	int sge_no;
 	int sge_bytes;
@@ -524,19 +546,14 @@ static int send_reply(struct svcxprt_rdma *rdma,
 		pr_err("svcrdma: Too many sges (%d)\n", sge_no);
 		goto err;
 	}
-	memset(&send_wr, 0, sizeof send_wr);
-	ctxt->cqe.done = svc_rdma_wc_send;
-	send_wr.wr_cqe = &ctxt->cqe;
-	send_wr.sg_list = ctxt->sge;
-	send_wr.num_sge = sge_no;
-	if (inv_rkey) {
-		send_wr.opcode = IB_WR_SEND_WITH_INV;
-		send_wr.ex.invalidate_rkey = inv_rkey;
-	} else
-		send_wr.opcode = IB_WR_SEND;
-	send_wr.send_flags =  IB_SEND_SIGNALED;
 
-	ret = svc_rdma_send(rdma, &send_wr);
+	svc_rdma_build_send_wr(ctxt, sge_no);
+	send_wr = &ctxt->send_wr;
+	if (inv_rkey) {
+		send_wr->opcode = IB_WR_SEND_WITH_INV;
+		send_wr->ex.invalidate_rkey = inv_rkey;
+	}
+	ret = svc_rdma_send(rdma, send_wr);
 	if (ret)
 		goto err;
 
@@ -652,7 +669,6 @@ int svc_rdma_sendto(struct svc_rqst *rqstp)
 void svc_rdma_send_error(struct svcxprt_rdma *xprt, struct rpcrdma_msg *rmsgp,
 			 int status)
 {
-	struct ib_send_wr err_wr;
 	struct page *p;
 	struct svc_rdma_op_ctxt *ctxt;
 	enum rpcrdma_errcode err;
@@ -692,17 +708,8 @@ void svc_rdma_send_error(struct svcxprt_rdma *xprt, struct rpcrdma_msg *rmsgp,
 	}
 	svc_rdma_count_mappings(xprt, ctxt);
 
-	/* Prepare SEND WR */
-	memset(&err_wr, 0, sizeof(err_wr));
-	ctxt->cqe.done = svc_rdma_wc_send;
-	err_wr.wr_cqe = &ctxt->cqe;
-	err_wr.sg_list = ctxt->sge;
-	err_wr.num_sge = 1;
-	err_wr.opcode = IB_WR_SEND;
-	err_wr.send_flags = IB_SEND_SIGNALED;
-
-	/* Post It */
-	ret = svc_rdma_send(xprt, &err_wr);
+	svc_rdma_build_send_wr(ctxt, 1);
+	ret = svc_rdma_send(xprt, &ctxt->send_wr);
 	if (ret) {
 		dprintk("svcrdma: Error %d posting send for protocol error\n",
 			ret);

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v1 01/14] svcrdma: Move send_wr to svc_rdma_op_ctxt
@ 2017-03-16 15:52     ` Chuck Lever
  0 siblings, 0 replies; 70+ messages in thread
From: Chuck Lever @ 2017-03-16 15:52 UTC (permalink / raw)
  To: linux-rdma, linux-nfs

Clean up: Move the ib_send_wr off the stack, and move common send WR
setup code into a helper.

This is a refactoring change only.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 include/linux/sunrpc/svc_rdma.h            |    3 +
 net/sunrpc/xprtrdma/svc_rdma_backchannel.c |   12 +-----
 net/sunrpc/xprtrdma/svc_rdma_sendto.c      |   57 ++++++++++++++++------------
 3 files changed, 37 insertions(+), 35 deletions(-)

diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
index b105f73..fa3ef11 100644
--- a/include/linux/sunrpc/svc_rdma.h
+++ b/include/linux/sunrpc/svc_rdma.h
@@ -85,6 +85,7 @@ struct svc_rdma_op_ctxt {
 	enum dma_data_direction direction;
 	int count;
 	unsigned int mapped_sges;
+	struct ib_send_wr send_wr;
 	struct ib_sge sge[RPCSVC_MAXPAGES];
 	struct page *pages[RPCSVC_MAXPAGES];
 };
@@ -227,6 +228,8 @@ extern int rdma_read_chunk_frmr(struct svcxprt_rdma *, struct svc_rqst *,
 /* svc_rdma_sendto.c */
 extern int svc_rdma_map_xdr(struct svcxprt_rdma *, struct xdr_buf *,
 			    struct svc_rdma_req_map *, bool);
+extern void svc_rdma_build_send_wr(struct svc_rdma_op_ctxt *ctxt,
+				   int num_sge);
 extern int svc_rdma_sendto(struct svc_rqst *);
 extern void svc_rdma_send_error(struct svcxprt_rdma *, struct rpcrdma_msg *,
 				int);
diff --git a/net/sunrpc/xprtrdma/svc_rdma_backchannel.c b/net/sunrpc/xprtrdma/svc_rdma_backchannel.c
index ff1df40..6741ed0 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_backchannel.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_backchannel.c
@@ -104,7 +104,6 @@ static int svc_rdma_bc_sendto(struct svcxprt_rdma *rdma,
 	struct xdr_buf *sndbuf = &rqst->rq_snd_buf;
 	struct svc_rdma_op_ctxt *ctxt;
 	struct svc_rdma_req_map *vec;
-	struct ib_send_wr send_wr;
 	int ret;
 
 	vec = svc_rdma_get_req_map(rdma);
@@ -132,15 +131,8 @@ static int svc_rdma_bc_sendto(struct svcxprt_rdma *rdma,
 	}
 	svc_rdma_count_mappings(rdma, ctxt);
 
-	memset(&send_wr, 0, sizeof(send_wr));
-	ctxt->cqe.done = svc_rdma_wc_send;
-	send_wr.wr_cqe = &ctxt->cqe;
-	send_wr.sg_list = ctxt->sge;
-	send_wr.num_sge = 1;
-	send_wr.opcode = IB_WR_SEND;
-	send_wr.send_flags = IB_SEND_SIGNALED;
-
-	ret = svc_rdma_send(rdma, &send_wr);
+	svc_rdma_build_send_wr(ctxt, 1);
+	ret = svc_rdma_send(rdma, &ctxt->send_wr);
 	if (ret) {
 		ret = -EIO;
 		goto out_unmap;
diff --git a/net/sunrpc/xprtrdma/svc_rdma_sendto.c b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
index 515221b..fdf8e3d 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_sendto.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
@@ -435,6 +435,28 @@ static int send_reply_chunks(struct svcxprt_rdma *xprt,
 	return -EIO;
 }
 
+/**
+ * svc_rdma_build_send_wr - Set up a Send Work Request
+ * @ctxt: op_ctxt for transmitting the Send WR
+ *
+ */
+void svc_rdma_build_send_wr(struct svc_rdma_op_ctxt *ctxt, int num_sge)
+{
+	struct ib_send_wr *send_wr;
+
+	send_wr = &ctxt->send_wr;
+	send_wr->next = NULL;
+	ctxt->cqe.done = svc_rdma_wc_send;
+	send_wr->wr_cqe = &ctxt->cqe;
+	send_wr->sg_list = ctxt->sge;
+	send_wr->num_sge = num_sge;
+	send_wr->opcode = IB_WR_SEND;
+	send_wr->send_flags = IB_SEND_SIGNALED;
+
+	dprintk("svcrdma: posting Send WR with %u sge(s)\n",
+		send_wr->num_sge);
+}
+
 /* This function prepares the portion of the RPCRDMA message to be
  * sent in the RDMA_SEND. This function is called after data sent via
  * RDMA has already been transmitted. There are three cases:
@@ -460,7 +482,7 @@ static int send_reply(struct svcxprt_rdma *rdma,
 		      u32 inv_rkey)
 {
 	struct svc_rdma_op_ctxt *ctxt;
-	struct ib_send_wr send_wr;
+	struct ib_send_wr *send_wr;
 	u32 xdr_off;
 	int sge_no;
 	int sge_bytes;
@@ -524,19 +546,14 @@ static int send_reply(struct svcxprt_rdma *rdma,
 		pr_err("svcrdma: Too many sges (%d)\n", sge_no);
 		goto err;
 	}
-	memset(&send_wr, 0, sizeof send_wr);
-	ctxt->cqe.done = svc_rdma_wc_send;
-	send_wr.wr_cqe = &ctxt->cqe;
-	send_wr.sg_list = ctxt->sge;
-	send_wr.num_sge = sge_no;
-	if (inv_rkey) {
-		send_wr.opcode = IB_WR_SEND_WITH_INV;
-		send_wr.ex.invalidate_rkey = inv_rkey;
-	} else
-		send_wr.opcode = IB_WR_SEND;
-	send_wr.send_flags =  IB_SEND_SIGNALED;
 
-	ret = svc_rdma_send(rdma, &send_wr);
+	svc_rdma_build_send_wr(ctxt, sge_no);
+	send_wr = &ctxt->send_wr;
+	if (inv_rkey) {
+		send_wr->opcode = IB_WR_SEND_WITH_INV;
+		send_wr->ex.invalidate_rkey = inv_rkey;
+	}
+	ret = svc_rdma_send(rdma, send_wr);
 	if (ret)
 		goto err;
 
@@ -652,7 +669,6 @@ int svc_rdma_sendto(struct svc_rqst *rqstp)
 void svc_rdma_send_error(struct svcxprt_rdma *xprt, struct rpcrdma_msg *rmsgp,
 			 int status)
 {
-	struct ib_send_wr err_wr;
 	struct page *p;
 	struct svc_rdma_op_ctxt *ctxt;
 	enum rpcrdma_errcode err;
@@ -692,17 +708,8 @@ void svc_rdma_send_error(struct svcxprt_rdma *xprt, struct rpcrdma_msg *rmsgp,
 	}
 	svc_rdma_count_mappings(xprt, ctxt);
 
-	/* Prepare SEND WR */
-	memset(&err_wr, 0, sizeof(err_wr));
-	ctxt->cqe.done = svc_rdma_wc_send;
-	err_wr.wr_cqe = &ctxt->cqe;
-	err_wr.sg_list = ctxt->sge;
-	err_wr.num_sge = 1;
-	err_wr.opcode = IB_WR_SEND;
-	err_wr.send_flags = IB_SEND_SIGNALED;
-
-	/* Post It */
-	ret = svc_rdma_send(xprt, &err_wr);
+	svc_rdma_build_send_wr(ctxt, 1);
+	ret = svc_rdma_send(xprt, &ctxt->send_wr);
 	if (ret) {
 		dprintk("svcrdma: Error %d posting send for protocol error\n",
 			ret);


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v1 02/14] svcrdma: Add svc_rdma_map_reply_hdr()
  2017-03-16 15:52 ` Chuck Lever
@ 2017-03-16 15:52     ` Chuck Lever
  -1 siblings, 0 replies; 70+ messages in thread
From: Chuck Lever @ 2017-03-16 15:52 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA, linux-nfs-u79uwXL29TY76Z2rM5mHXA

Introduce a helper to DMA-map a reply's transport header before
sending it. This will in part replace the map vector cache.

Signed-off-by: Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
---
 include/linux/sunrpc/svc_rdma.h            |    3 +
 net/sunrpc/xprtrdma/svc_rdma_backchannel.c |   38 +++++------------
 net/sunrpc/xprtrdma/svc_rdma_sendto.c      |   61 ++++++++++++++++++++++------
 3 files changed, 62 insertions(+), 40 deletions(-)

diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
index fa3ef11..ac05495 100644
--- a/include/linux/sunrpc/svc_rdma.h
+++ b/include/linux/sunrpc/svc_rdma.h
@@ -228,6 +228,9 @@ extern int rdma_read_chunk_frmr(struct svcxprt_rdma *, struct svc_rqst *,
 /* svc_rdma_sendto.c */
 extern int svc_rdma_map_xdr(struct svcxprt_rdma *, struct xdr_buf *,
 			    struct svc_rdma_req_map *, bool);
+extern int svc_rdma_map_reply_hdr(struct svcxprt_rdma *rdma,
+				  struct svc_rdma_op_ctxt *ctxt,
+				  __be32 *rdma_resp, unsigned int len);
 extern void svc_rdma_build_send_wr(struct svc_rdma_op_ctxt *ctxt,
 				   int num_sge);
 extern int svc_rdma_sendto(struct svc_rqst *);
diff --git a/net/sunrpc/xprtrdma/svc_rdma_backchannel.c b/net/sunrpc/xprtrdma/svc_rdma_backchannel.c
index 6741ed0..71ad9cd 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_backchannel.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_backchannel.c
@@ -101,52 +101,36 @@ int svc_rdma_handle_bc_reply(struct rpc_xprt *xprt, struct rpcrdma_msg *rmsgp,
 static int svc_rdma_bc_sendto(struct svcxprt_rdma *rdma,
 			      struct rpc_rqst *rqst)
 {
-	struct xdr_buf *sndbuf = &rqst->rq_snd_buf;
 	struct svc_rdma_op_ctxt *ctxt;
-	struct svc_rdma_req_map *vec;
 	int ret;
 
-	vec = svc_rdma_get_req_map(rdma);
-	ret = svc_rdma_map_xdr(rdma, sndbuf, vec, false);
-	if (ret)
+	ctxt = svc_rdma_get_context(rdma);
+
+	/* rpcrdma_bc_send_request builds the transport header and
+	 * the backchannel RPC message in the same buffer. Thus only
+	 * one SGE is needed to send both.
+	 */
+	ret = svc_rdma_map_reply_hdr(rdma, ctxt, rqst->rq_buffer,
+				     rqst->rq_snd_buf.len);
+	if (ret < 0)
 		goto out_err;
 
 	ret = svc_rdma_repost_recv(rdma, GFP_NOIO);
 	if (ret)
 		goto out_err;
 
-	ctxt = svc_rdma_get_context(rdma);
-	ctxt->pages[0] = virt_to_page(rqst->rq_buffer);
-	ctxt->count = 1;
-
-	ctxt->direction = DMA_TO_DEVICE;
-	ctxt->sge[0].lkey = rdma->sc_pd->local_dma_lkey;
-	ctxt->sge[0].length = sndbuf->len;
-	ctxt->sge[0].addr =
-	    ib_dma_map_page(rdma->sc_cm_id->device, ctxt->pages[0], 0,
-			    sndbuf->len, DMA_TO_DEVICE);
-	if (ib_dma_mapping_error(rdma->sc_cm_id->device, ctxt->sge[0].addr)) {
-		ret = -EIO;
-		goto out_unmap;
-	}
-	svc_rdma_count_mappings(rdma, ctxt);
-
 	svc_rdma_build_send_wr(ctxt, 1);
 	ret = svc_rdma_send(rdma, &ctxt->send_wr);
 	if (ret) {
+		svc_rdma_unmap_dma(ctxt);
+		svc_rdma_put_context(ctxt, 1);
 		ret = -EIO;
-		goto out_unmap;
 	}
 
 out_err:
-	svc_rdma_put_req_map(rdma, vec);
 	dprintk("svcrdma: %s returns %d\n", __func__, ret);
 	return ret;
 
-out_unmap:
-	svc_rdma_unmap_dma(ctxt);
-	svc_rdma_put_context(ctxt, 1);
-	goto out_err;
 }
 
 /* Server-side transport endpoint wants a whole page for its send
diff --git a/net/sunrpc/xprtrdma/svc_rdma_sendto.c b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
index fdf8e3d..0e55b34 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_sendto.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
@@ -217,6 +217,49 @@ static u32 svc_rdma_get_inv_rkey(struct rpcrdma_msg *rdma_argp,
 	return 0;
 }
 
+static int svc_rdma_dma_map_page(struct svcxprt_rdma *rdma,
+				 struct svc_rdma_op_ctxt *ctxt,
+				 unsigned int sge_no,
+				 struct page *page,
+				 unsigned int offset,
+				 unsigned int len)
+{
+	struct ib_device *dev = rdma->sc_cm_id->device;
+	dma_addr_t dma_addr;
+
+	dma_addr = ib_dma_map_page(dev, page, offset, len, DMA_TO_DEVICE);
+	if (ib_dma_mapping_error(dev, dma_addr))
+		return -EIO;
+
+	ctxt->sge[sge_no].addr = dma_addr;
+	ctxt->sge[sge_no].length = len;
+	ctxt->sge[sge_no].lkey = rdma->sc_pd->local_dma_lkey;
+	svc_rdma_count_mappings(rdma, ctxt);
+	return 0;
+}
+
+/**
+ * svc_rdma_map_reply_hdr - DMA map the transport header buffer
+ * @rdma: controlling transport
+ * @ctxt: op_ctxt for the Send WR
+ * @rdma_resp: buffer containing transport header
+ * @len: length of transport header
+ *
+ * Returns:
+ *	%0 if the header is DMA mapped,
+ *	%-EIO if DMA mapping failed.
+ */
+int svc_rdma_map_reply_hdr(struct svcxprt_rdma *rdma,
+			   struct svc_rdma_op_ctxt *ctxt,
+			   __be32 *rdma_resp,
+			   unsigned int len)
+{
+	ctxt->direction = DMA_TO_DEVICE;
+	ctxt->pages[0] = virt_to_page(rdma_resp);
+	ctxt->count = 1;
+	return svc_rdma_dma_map_page(rdma, ctxt, 0, ctxt->pages[0], 0, len);
+}
+
 /* Assumptions:
  * - The specified write_len can be represented in sc_max_sge * PAGE_SIZE
  */
@@ -691,22 +734,14 @@ void svc_rdma_send_error(struct svcxprt_rdma *xprt, struct rpcrdma_msg *rmsgp,
 		err = ERR_VERS;
 	length = svc_rdma_xdr_encode_error(xprt, rmsgp, err, va);
 
+	/* Map transport header; no RPC message payload */
 	ctxt = svc_rdma_get_context(xprt);
-	ctxt->direction = DMA_TO_DEVICE;
-	ctxt->count = 1;
-	ctxt->pages[0] = p;
-
-	/* Prepare SGE for local address */
-	ctxt->sge[0].lkey = xprt->sc_pd->local_dma_lkey;
-	ctxt->sge[0].length = length;
-	ctxt->sge[0].addr = ib_dma_map_page(xprt->sc_cm_id->device,
-					    p, 0, length, DMA_TO_DEVICE);
-	if (ib_dma_mapping_error(xprt->sc_cm_id->device, ctxt->sge[0].addr)) {
-		dprintk("svcrdma: Error mapping buffer for protocol error\n");
-		svc_rdma_put_context(ctxt, 1);
+	ret = svc_rdma_map_reply_hdr(xprt, ctxt, &rmsgp->rm_xid, length);
+	if (ret) {
+		dprintk("svcrdma: Error %d mapping send for protocol error\n",
+			ret);
 		return;
 	}
-	svc_rdma_count_mappings(xprt, ctxt);
 
 	svc_rdma_build_send_wr(ctxt, 1);
 	ret = svc_rdma_send(xprt, &ctxt->send_wr);

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v1 02/14] svcrdma: Add svc_rdma_map_reply_hdr()
@ 2017-03-16 15:52     ` Chuck Lever
  0 siblings, 0 replies; 70+ messages in thread
From: Chuck Lever @ 2017-03-16 15:52 UTC (permalink / raw)
  To: linux-rdma, linux-nfs

Introduce a helper to DMA-map a reply's transport header before
sending it. This will in part replace the map vector cache.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 include/linux/sunrpc/svc_rdma.h            |    3 +
 net/sunrpc/xprtrdma/svc_rdma_backchannel.c |   38 +++++------------
 net/sunrpc/xprtrdma/svc_rdma_sendto.c      |   61 ++++++++++++++++++++++------
 3 files changed, 62 insertions(+), 40 deletions(-)

diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
index fa3ef11..ac05495 100644
--- a/include/linux/sunrpc/svc_rdma.h
+++ b/include/linux/sunrpc/svc_rdma.h
@@ -228,6 +228,9 @@ extern int rdma_read_chunk_frmr(struct svcxprt_rdma *, struct svc_rqst *,
 /* svc_rdma_sendto.c */
 extern int svc_rdma_map_xdr(struct svcxprt_rdma *, struct xdr_buf *,
 			    struct svc_rdma_req_map *, bool);
+extern int svc_rdma_map_reply_hdr(struct svcxprt_rdma *rdma,
+				  struct svc_rdma_op_ctxt *ctxt,
+				  __be32 *rdma_resp, unsigned int len);
 extern void svc_rdma_build_send_wr(struct svc_rdma_op_ctxt *ctxt,
 				   int num_sge);
 extern int svc_rdma_sendto(struct svc_rqst *);
diff --git a/net/sunrpc/xprtrdma/svc_rdma_backchannel.c b/net/sunrpc/xprtrdma/svc_rdma_backchannel.c
index 6741ed0..71ad9cd 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_backchannel.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_backchannel.c
@@ -101,52 +101,36 @@ int svc_rdma_handle_bc_reply(struct rpc_xprt *xprt, struct rpcrdma_msg *rmsgp,
 static int svc_rdma_bc_sendto(struct svcxprt_rdma *rdma,
 			      struct rpc_rqst *rqst)
 {
-	struct xdr_buf *sndbuf = &rqst->rq_snd_buf;
 	struct svc_rdma_op_ctxt *ctxt;
-	struct svc_rdma_req_map *vec;
 	int ret;
 
-	vec = svc_rdma_get_req_map(rdma);
-	ret = svc_rdma_map_xdr(rdma, sndbuf, vec, false);
-	if (ret)
+	ctxt = svc_rdma_get_context(rdma);
+
+	/* rpcrdma_bc_send_request builds the transport header and
+	 * the backchannel RPC message in the same buffer. Thus only
+	 * one SGE is needed to send both.
+	 */
+	ret = svc_rdma_map_reply_hdr(rdma, ctxt, rqst->rq_buffer,
+				     rqst->rq_snd_buf.len);
+	if (ret < 0)
 		goto out_err;
 
 	ret = svc_rdma_repost_recv(rdma, GFP_NOIO);
 	if (ret)
 		goto out_err;
 
-	ctxt = svc_rdma_get_context(rdma);
-	ctxt->pages[0] = virt_to_page(rqst->rq_buffer);
-	ctxt->count = 1;
-
-	ctxt->direction = DMA_TO_DEVICE;
-	ctxt->sge[0].lkey = rdma->sc_pd->local_dma_lkey;
-	ctxt->sge[0].length = sndbuf->len;
-	ctxt->sge[0].addr =
-	    ib_dma_map_page(rdma->sc_cm_id->device, ctxt->pages[0], 0,
-			    sndbuf->len, DMA_TO_DEVICE);
-	if (ib_dma_mapping_error(rdma->sc_cm_id->device, ctxt->sge[0].addr)) {
-		ret = -EIO;
-		goto out_unmap;
-	}
-	svc_rdma_count_mappings(rdma, ctxt);
-
 	svc_rdma_build_send_wr(ctxt, 1);
 	ret = svc_rdma_send(rdma, &ctxt->send_wr);
 	if (ret) {
+		svc_rdma_unmap_dma(ctxt);
+		svc_rdma_put_context(ctxt, 1);
 		ret = -EIO;
-		goto out_unmap;
 	}
 
 out_err:
-	svc_rdma_put_req_map(rdma, vec);
 	dprintk("svcrdma: %s returns %d\n", __func__, ret);
 	return ret;
 
-out_unmap:
-	svc_rdma_unmap_dma(ctxt);
-	svc_rdma_put_context(ctxt, 1);
-	goto out_err;
 }
 
 /* Server-side transport endpoint wants a whole page for its send
diff --git a/net/sunrpc/xprtrdma/svc_rdma_sendto.c b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
index fdf8e3d..0e55b34 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_sendto.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
@@ -217,6 +217,49 @@ static u32 svc_rdma_get_inv_rkey(struct rpcrdma_msg *rdma_argp,
 	return 0;
 }
 
+static int svc_rdma_dma_map_page(struct svcxprt_rdma *rdma,
+				 struct svc_rdma_op_ctxt *ctxt,
+				 unsigned int sge_no,
+				 struct page *page,
+				 unsigned int offset,
+				 unsigned int len)
+{
+	struct ib_device *dev = rdma->sc_cm_id->device;
+	dma_addr_t dma_addr;
+
+	dma_addr = ib_dma_map_page(dev, page, offset, len, DMA_TO_DEVICE);
+	if (ib_dma_mapping_error(dev, dma_addr))
+		return -EIO;
+
+	ctxt->sge[sge_no].addr = dma_addr;
+	ctxt->sge[sge_no].length = len;
+	ctxt->sge[sge_no].lkey = rdma->sc_pd->local_dma_lkey;
+	svc_rdma_count_mappings(rdma, ctxt);
+	return 0;
+}
+
+/**
+ * svc_rdma_map_reply_hdr - DMA map the transport header buffer
+ * @rdma: controlling transport
+ * @ctxt: op_ctxt for the Send WR
+ * @rdma_resp: buffer containing transport header
+ * @len: length of transport header
+ *
+ * Returns:
+ *	%0 if the header is DMA mapped,
+ *	%-EIO if DMA mapping failed.
+ */
+int svc_rdma_map_reply_hdr(struct svcxprt_rdma *rdma,
+			   struct svc_rdma_op_ctxt *ctxt,
+			   __be32 *rdma_resp,
+			   unsigned int len)
+{
+	ctxt->direction = DMA_TO_DEVICE;
+	ctxt->pages[0] = virt_to_page(rdma_resp);
+	ctxt->count = 1;
+	return svc_rdma_dma_map_page(rdma, ctxt, 0, ctxt->pages[0], 0, len);
+}
+
 /* Assumptions:
  * - The specified write_len can be represented in sc_max_sge * PAGE_SIZE
  */
@@ -691,22 +734,14 @@ void svc_rdma_send_error(struct svcxprt_rdma *xprt, struct rpcrdma_msg *rmsgp,
 		err = ERR_VERS;
 	length = svc_rdma_xdr_encode_error(xprt, rmsgp, err, va);
 
+	/* Map transport header; no RPC message payload */
 	ctxt = svc_rdma_get_context(xprt);
-	ctxt->direction = DMA_TO_DEVICE;
-	ctxt->count = 1;
-	ctxt->pages[0] = p;
-
-	/* Prepare SGE for local address */
-	ctxt->sge[0].lkey = xprt->sc_pd->local_dma_lkey;
-	ctxt->sge[0].length = length;
-	ctxt->sge[0].addr = ib_dma_map_page(xprt->sc_cm_id->device,
-					    p, 0, length, DMA_TO_DEVICE);
-	if (ib_dma_mapping_error(xprt->sc_cm_id->device, ctxt->sge[0].addr)) {
-		dprintk("svcrdma: Error mapping buffer for protocol error\n");
-		svc_rdma_put_context(ctxt, 1);
+	ret = svc_rdma_map_reply_hdr(xprt, ctxt, &rmsgp->rm_xid, length);
+	if (ret) {
+		dprintk("svcrdma: Error %d mapping send for protocol error\n",
+			ret);
 		return;
 	}
-	svc_rdma_count_mappings(xprt, ctxt);
 
 	svc_rdma_build_send_wr(ctxt, 1);
 	ret = svc_rdma_send(xprt, &ctxt->send_wr);


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v1 03/14] svcrdma: Eliminate RPCRDMA_SQ_DEPTH_MULT
  2017-03-16 15:52 ` Chuck Lever
@ 2017-03-16 15:52     ` Chuck Lever
  -1 siblings, 0 replies; 70+ messages in thread
From: Chuck Lever @ 2017-03-16 15:52 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA, linux-nfs-u79uwXL29TY76Z2rM5mHXA

The Send Queue depth is temporarily reduced to 1 SQE per credit. The
new rdma_rw API does an internal computation, during QP creation, to
increase the depth of the Send Queue to handle RDMA Read and Write
operations.

This change has to come before the NFSD code paths are updated to
use the rdma_rw API. Without this patch, rdma_rw_init_qp() increases
the size of the SQ too much, resulting in memory allocation failures
during QP creation.

Signed-off-by: Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
---
 include/linux/sunrpc/svc_rdma.h          |    1 -
 net/sunrpc/xprtrdma/svc_rdma.c           |    2 --
 net/sunrpc/xprtrdma/svc_rdma_transport.c |    2 +-
 3 files changed, 1 insertion(+), 4 deletions(-)

diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
index ac05495..f066349 100644
--- a/include/linux/sunrpc/svc_rdma.h
+++ b/include/linux/sunrpc/svc_rdma.h
@@ -182,7 +182,6 @@ struct svcxprt_rdma {
 /* The default ORD value is based on two outstanding full-size writes with a
  * page size of 4k, or 32k * 2 ops / 4k = 16 outstanding RDMA_READ.  */
 #define RPCRDMA_ORD             (64/4)
-#define RPCRDMA_SQ_DEPTH_MULT   8
 #define RPCRDMA_MAX_REQUESTS    32
 #define RPCRDMA_MAX_REQ_SIZE    4096
 
diff --git a/net/sunrpc/xprtrdma/svc_rdma.c b/net/sunrpc/xprtrdma/svc_rdma.c
index c846ca9..9124441 100644
--- a/net/sunrpc/xprtrdma/svc_rdma.c
+++ b/net/sunrpc/xprtrdma/svc_rdma.c
@@ -247,8 +247,6 @@ int svc_rdma_init(void)
 	dprintk("SVCRDMA Module Init, register RPC RDMA transport\n");
 	dprintk("\tsvcrdma_ord      : %d\n", svcrdma_ord);
 	dprintk("\tmax_requests     : %u\n", svcrdma_max_requests);
-	dprintk("\tsq_depth         : %u\n",
-		svcrdma_max_requests * RPCRDMA_SQ_DEPTH_MULT);
 	dprintk("\tmax_bc_requests  : %u\n", svcrdma_max_bc_requests);
 	dprintk("\tmax_inline       : %d\n", svcrdma_max_req_size);
 
diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c
index c13a5c3..b84cd53 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
@@ -1013,7 +1013,7 @@ static struct svc_xprt *svc_rdma_accept(struct svc_xprt *xprt)
 					    svcrdma_max_bc_requests);
 	newxprt->sc_rq_depth = newxprt->sc_max_requests +
 			       newxprt->sc_max_bc_requests;
-	newxprt->sc_sq_depth = RPCRDMA_SQ_DEPTH_MULT * newxprt->sc_rq_depth;
+	newxprt->sc_sq_depth = newxprt->sc_rq_depth;
 	atomic_set(&newxprt->sc_sq_avail, newxprt->sc_sq_depth);
 
 	if (!svc_rdma_prealloc_ctxts(newxprt))

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v1 03/14] svcrdma: Eliminate RPCRDMA_SQ_DEPTH_MULT
@ 2017-03-16 15:52     ` Chuck Lever
  0 siblings, 0 replies; 70+ messages in thread
From: Chuck Lever @ 2017-03-16 15:52 UTC (permalink / raw)
  To: linux-rdma, linux-nfs

The Send Queue depth is temporarily reduced to 1 SQE per credit. The
new rdma_rw API does an internal computation, during QP creation, to
increase the depth of the Send Queue to handle RDMA Read and Write
operations.

This change has to come before the NFSD code paths are updated to
use the rdma_rw API. Without this patch, rdma_rw_init_qp() increases
the size of the SQ too much, resulting in memory allocation failures
during QP creation.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 include/linux/sunrpc/svc_rdma.h          |    1 -
 net/sunrpc/xprtrdma/svc_rdma.c           |    2 --
 net/sunrpc/xprtrdma/svc_rdma_transport.c |    2 +-
 3 files changed, 1 insertion(+), 4 deletions(-)

diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
index ac05495..f066349 100644
--- a/include/linux/sunrpc/svc_rdma.h
+++ b/include/linux/sunrpc/svc_rdma.h
@@ -182,7 +182,6 @@ struct svcxprt_rdma {
 /* The default ORD value is based on two outstanding full-size writes with a
  * page size of 4k, or 32k * 2 ops / 4k = 16 outstanding RDMA_READ.  */
 #define RPCRDMA_ORD             (64/4)
-#define RPCRDMA_SQ_DEPTH_MULT   8
 #define RPCRDMA_MAX_REQUESTS    32
 #define RPCRDMA_MAX_REQ_SIZE    4096
 
diff --git a/net/sunrpc/xprtrdma/svc_rdma.c b/net/sunrpc/xprtrdma/svc_rdma.c
index c846ca9..9124441 100644
--- a/net/sunrpc/xprtrdma/svc_rdma.c
+++ b/net/sunrpc/xprtrdma/svc_rdma.c
@@ -247,8 +247,6 @@ int svc_rdma_init(void)
 	dprintk("SVCRDMA Module Init, register RPC RDMA transport\n");
 	dprintk("\tsvcrdma_ord      : %d\n", svcrdma_ord);
 	dprintk("\tmax_requests     : %u\n", svcrdma_max_requests);
-	dprintk("\tsq_depth         : %u\n",
-		svcrdma_max_requests * RPCRDMA_SQ_DEPTH_MULT);
 	dprintk("\tmax_bc_requests  : %u\n", svcrdma_max_bc_requests);
 	dprintk("\tmax_inline       : %d\n", svcrdma_max_req_size);
 
diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c
index c13a5c3..b84cd53 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
@@ -1013,7 +1013,7 @@ static struct svc_xprt *svc_rdma_accept(struct svc_xprt *xprt)
 					    svcrdma_max_bc_requests);
 	newxprt->sc_rq_depth = newxprt->sc_max_requests +
 			       newxprt->sc_max_bc_requests;
-	newxprt->sc_sq_depth = RPCRDMA_SQ_DEPTH_MULT * newxprt->sc_rq_depth;
+	newxprt->sc_sq_depth = newxprt->sc_rq_depth;
 	atomic_set(&newxprt->sc_sq_avail, newxprt->sc_sq_depth);
 
 	if (!svc_rdma_prealloc_ctxts(newxprt))


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v1 04/14] svcrdma: Add helper to save pages under I/O
  2017-03-16 15:52 ` Chuck Lever
@ 2017-03-16 15:52     ` Chuck Lever
  -1 siblings, 0 replies; 70+ messages in thread
From: Chuck Lever @ 2017-03-16 15:52 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA, linux-nfs-u79uwXL29TY76Z2rM5mHXA

Clean up: extract the logic to save pages under I/O into a helper to
add a big documenting comment without adding clutter in the send
path.

This is a refactoring change only.

Signed-off-by: Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
---
 net/sunrpc/xprtrdma/svc_rdma_sendto.c |   31 ++++++++++++++++++-------------
 1 file changed, 18 insertions(+), 13 deletions(-)

diff --git a/net/sunrpc/xprtrdma/svc_rdma_sendto.c b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
index 0e55b34..b4028bc3 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_sendto.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
@@ -478,6 +478,23 @@ static int send_reply_chunks(struct svcxprt_rdma *xprt,
 	return -EIO;
 }
 
+/* The svc_rqst and all resources it owns are released as soon as
+ * svc_rdma_sendto returns. Transfer pages under I/O to the ctxt
+ * so they are released by the Send completion handler.
+ */
+static void svc_rdma_save_io_pages(struct svc_rqst *rqstp,
+				   struct svc_rdma_op_ctxt *ctxt)
+{
+	int i, pages = rqstp->rq_next_page - rqstp->rq_respages;
+
+	ctxt->count += pages;
+	for (i = 0; i < pages; i++) {
+		ctxt->pages[i + 1] = rqstp->rq_respages[i];
+		rqstp->rq_respages[i] = NULL;
+	}
+	rqstp->rq_next_page = rqstp->rq_respages + 1;
+}
+
 /**
  * svc_rdma_build_send_wr - Set up a Send Work Request
  * @ctxt: op_ctxt for transmitting the Send WR
@@ -529,8 +546,6 @@ static int send_reply(struct svcxprt_rdma *rdma,
 	u32 xdr_off;
 	int sge_no;
 	int sge_bytes;
-	int page_no;
-	int pages;
 	int ret = -EIO;
 
 	/* Prepare the context */
@@ -573,17 +588,7 @@ static int send_reply(struct svcxprt_rdma *rdma,
 		goto err;
 	}
 
-	/* Save all respages in the ctxt and remove them from the
-	 * respages array. They are our pages until the I/O
-	 * completes.
-	 */
-	pages = rqstp->rq_next_page - rqstp->rq_respages;
-	for (page_no = 0; page_no < pages; page_no++) {
-		ctxt->pages[page_no+1] = rqstp->rq_respages[page_no];
-		ctxt->count++;
-		rqstp->rq_respages[page_no] = NULL;
-	}
-	rqstp->rq_next_page = rqstp->rq_respages + 1;
+	svc_rdma_save_io_pages(rqstp, ctxt);
 
 	if (sge_no > rdma->sc_max_sge) {
 		pr_err("svcrdma: Too many sges (%d)\n", sge_no);

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v1 04/14] svcrdma: Add helper to save pages under I/O
@ 2017-03-16 15:52     ` Chuck Lever
  0 siblings, 0 replies; 70+ messages in thread
From: Chuck Lever @ 2017-03-16 15:52 UTC (permalink / raw)
  To: linux-rdma, linux-nfs

Clean up: extract the logic to save pages under I/O into a helper to
add a big documenting comment without adding clutter in the send
path.

This is a refactoring change only.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 net/sunrpc/xprtrdma/svc_rdma_sendto.c |   31 ++++++++++++++++++-------------
 1 file changed, 18 insertions(+), 13 deletions(-)

diff --git a/net/sunrpc/xprtrdma/svc_rdma_sendto.c b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
index 0e55b34..b4028bc3 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_sendto.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
@@ -478,6 +478,23 @@ static int send_reply_chunks(struct svcxprt_rdma *xprt,
 	return -EIO;
 }
 
+/* The svc_rqst and all resources it owns are released as soon as
+ * svc_rdma_sendto returns. Transfer pages under I/O to the ctxt
+ * so they are released by the Send completion handler.
+ */
+static void svc_rdma_save_io_pages(struct svc_rqst *rqstp,
+				   struct svc_rdma_op_ctxt *ctxt)
+{
+	int i, pages = rqstp->rq_next_page - rqstp->rq_respages;
+
+	ctxt->count += pages;
+	for (i = 0; i < pages; i++) {
+		ctxt->pages[i + 1] = rqstp->rq_respages[i];
+		rqstp->rq_respages[i] = NULL;
+	}
+	rqstp->rq_next_page = rqstp->rq_respages + 1;
+}
+
 /**
  * svc_rdma_build_send_wr - Set up a Send Work Request
  * @ctxt: op_ctxt for transmitting the Send WR
@@ -529,8 +546,6 @@ static int send_reply(struct svcxprt_rdma *rdma,
 	u32 xdr_off;
 	int sge_no;
 	int sge_bytes;
-	int page_no;
-	int pages;
 	int ret = -EIO;
 
 	/* Prepare the context */
@@ -573,17 +588,7 @@ static int send_reply(struct svcxprt_rdma *rdma,
 		goto err;
 	}
 
-	/* Save all respages in the ctxt and remove them from the
-	 * respages array. They are our pages until the I/O
-	 * completes.
-	 */
-	pages = rqstp->rq_next_page - rqstp->rq_respages;
-	for (page_no = 0; page_no < pages; page_no++) {
-		ctxt->pages[page_no+1] = rqstp->rq_respages[page_no];
-		ctxt->count++;
-		rqstp->rq_respages[page_no] = NULL;
-	}
-	rqstp->rq_next_page = rqstp->rq_respages + 1;
+	svc_rdma_save_io_pages(rqstp, ctxt);
 
 	if (sge_no > rdma->sc_max_sge) {
 		pr_err("svcrdma: Too many sges (%d)\n", sge_no);


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v1 05/14] svcrdma: Introduce local rdma_rw API helpers
  2017-03-16 15:52 ` Chuck Lever
@ 2017-03-16 15:53     ` Chuck Lever
  -1 siblings, 0 replies; 70+ messages in thread
From: Chuck Lever @ 2017-03-16 15:53 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA, linux-nfs-u79uwXL29TY76Z2rM5mHXA

The plan is to replace the local bespoke code that constructs and
posts RDMA Read and Write Work Requests with calls to the rdma_rw
API. This shares code with other RDMA-enabled ULPs that manages the
gory details of buffer registration and posting Work Requests.

Some design notes:

 o svc_xprt reference counting is modified, since one rdma_rw_ctx
   generates one completion, no matter how many Write WRs are
   posted. To accommodate the new reference counting scheme, a new
   version of svc_rdma_send() is introduced.

 o The structure of RPC-over-RDMA transport headers is flexible,
   allowing multiple segments per Reply with arbitrary alignment.
   Thus I did not take the further step of chaining Write WRs with
   the Send WR containing the RPC Reply message. The Write and Send
   WRs continue to be built by separate pieces of code.

 o The current code builds the transport header as it is construct-
   ing Write WRs. I've replaced that with marshaling of transport
   header data items in a separate step. This is because the exact
   structure of client-provided segments may not align with the
   components of the server's reply xdr_buf, or the pages in the
   page list. Thus parts of each client-provided segment may be
   written at different points in the send path.

 o Since the Write list and Reply chunk marshaling code is being
   replaced, I took the opportunity to replace some of the C
   structure-based XDR encoding code with more portable code that
   instead uses pointer arithmetic.

Signed-off-by: Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
---
 include/linux/sunrpc/svc_rdma.h          |   22 +
 net/sunrpc/xprtrdma/Makefile             |    2 
 net/sunrpc/xprtrdma/svc_rdma_marshal.c   |  114 ++++
 net/sunrpc/xprtrdma/svc_rdma_rw.c        |  785 ++++++++++++++++++++++++++++++
 net/sunrpc/xprtrdma/svc_rdma_sendto.c    |    2 
 net/sunrpc/xprtrdma/svc_rdma_transport.c |    4 
 6 files changed, 925 insertions(+), 4 deletions(-)
 create mode 100644 net/sunrpc/xprtrdma/svc_rdma_rw.c

diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
index f066349..5fc9f6e 100644
--- a/include/linux/sunrpc/svc_rdma.h
+++ b/include/linux/sunrpc/svc_rdma.h
@@ -145,12 +145,15 @@ struct svcxprt_rdma {
 	u32		     sc_max_requests;	/* Max requests */
 	u32		     sc_max_bc_requests;/* Backward credits */
 	int                  sc_max_req_size;	/* Size of each RQ WR buf */
+	u8		     sc_port_num;
 
 	struct ib_pd         *sc_pd;
 
 	spinlock_t	     sc_ctxt_lock;
 	struct list_head     sc_ctxts;
 	int		     sc_ctxt_used;
+	spinlock_t	     sc_rw_ctxt_lock;
+	struct list_head     sc_rw_ctxts;
 	spinlock_t	     sc_map_lock;
 	struct list_head     sc_maps;
 
@@ -209,10 +212,15 @@ extern int svc_rdma_handle_bc_reply(struct rpc_xprt *xprt,
 extern int svc_rdma_xdr_encode_error(struct svcxprt_rdma *,
 				     struct rpcrdma_msg *,
 				     enum rpcrdma_errcode, __be32 *);
-extern void svc_rdma_xdr_encode_write_list(struct rpcrdma_msg *, int);
+extern void svc_rdma_old_encode_write_list(struct rpcrdma_msg *rmsgp,
+					   int chunks);
 extern void svc_rdma_xdr_encode_reply_array(struct rpcrdma_write_array *, int);
 extern void svc_rdma_xdr_encode_array_chunk(struct rpcrdma_write_array *, int,
 					    __be32, __be64, u32);
+extern void svc_rdma_xdr_encode_write_list(__be32 *rdma_resp, __be32 *wr_ch,
+					   unsigned int consumed);
+extern void svc_rdma_xdr_encode_reply_chunk(__be32 *rdma_resp, __be32 *rp_ch,
+					    unsigned int consumed);
 extern unsigned int svc_rdma_xdr_get_reply_hdr_len(__be32 *rdma_resp);
 
 /* svc_rdma_recvfrom.c */
@@ -224,6 +232,18 @@ extern int rdma_read_chunk_frmr(struct svcxprt_rdma *, struct svc_rqst *,
 				struct svc_rdma_op_ctxt *, int *, u32 *,
 				u32, u32, u64, bool);
 
+/* svc_rdma_rw.c */
+extern void svc_rdma_destroy_rw_ctxts(struct svcxprt_rdma *rdma);
+extern int svc_rdma_recv_read_list(struct svcxprt_rdma *rdma, __be32 *ch,
+				   struct svc_rdma_op_ctxt *head,
+				   struct svc_rqst *rqstp);
+extern int svc_rdma_send_write_list(struct svcxprt_rdma *rdma,
+				    __be32 *wr_ch, __be32 *rdma_resp,
+				    struct xdr_buf *xdr);
+extern int svc_rdma_send_reply_chunk(struct svcxprt_rdma *rdma,
+				     __be32 *wr_lst, __be32 *rp_ch,
+				     __be32 *rdma_resp, struct xdr_buf *xdr);
+
 /* svc_rdma_sendto.c */
 extern int svc_rdma_map_xdr(struct svcxprt_rdma *, struct xdr_buf *,
 			    struct svc_rdma_req_map *, bool);
diff --git a/net/sunrpc/xprtrdma/Makefile b/net/sunrpc/xprtrdma/Makefile
index ef19fa4..c1ae814 100644
--- a/net/sunrpc/xprtrdma/Makefile
+++ b/net/sunrpc/xprtrdma/Makefile
@@ -4,5 +4,5 @@ rpcrdma-y := transport.o rpc_rdma.o verbs.o \
 	fmr_ops.o frwr_ops.o \
 	svc_rdma.o svc_rdma_backchannel.o svc_rdma_transport.o \
 	svc_rdma_marshal.o svc_rdma_sendto.o svc_rdma_recvfrom.o \
-	module.o
+	svc_rdma_rw.o module.o
 rpcrdma-$(CONFIG_SUNRPC_BACKCHANNEL) += backchannel.o
diff --git a/net/sunrpc/xprtrdma/svc_rdma_marshal.c b/net/sunrpc/xprtrdma/svc_rdma_marshal.c
index 1c4aabf..bf3ca7e 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_marshal.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_marshal.c
@@ -217,7 +217,7 @@ unsigned int svc_rdma_xdr_get_reply_hdr_len(__be32 *rdma_resp)
 	return (unsigned long)p - (unsigned long)rdma_resp;
 }
 
-void svc_rdma_xdr_encode_write_list(struct rpcrdma_msg *rmsgp, int chunks)
+void svc_rdma_old_encode_write_list(struct rpcrdma_msg *rmsgp, int chunks)
 {
 	struct rpcrdma_write_array *ary;
 
@@ -255,3 +255,115 @@ void svc_rdma_xdr_encode_array_chunk(struct rpcrdma_write_array *ary,
 	seg->rs_offset = rs_offset;
 	seg->rs_length = cpu_to_be32(write_len);
 }
+
+/* One Write chunk is copied from Call transport header to Reply
+ * transport header. Each segment's length field is updated to
+ * reflect number of bytes consumed in the segment.
+ *
+ * Returns number of segments in this chunk.
+ */
+static unsigned int xdr_encode_write_chunk(__be32 *dst, __be32 *src,
+					   unsigned int remaining)
+{
+	unsigned int i, nsegs;
+	u32 seg_len;
+
+	/* Write list discriminator */
+	*dst++ = *src++;
+
+	/* number of segments in this chunk */
+	nsegs = be32_to_cpup(src);
+	*dst++ = *src++;
+
+	for (i = nsegs; i; i--) {
+		/* segment's RDMA handle */
+		*dst++ = *src++;
+
+		/* bytes returned in this segment */
+		seg_len = be32_to_cpu(*src);
+		if (remaining >= seg_len) {
+			/* entire segment was consumed */
+			*dst = *src;
+			remaining -= seg_len;
+		} else {
+			/* segment only partly filled */
+			*dst = cpu_to_be32(remaining);
+			remaining = 0;
+		}
+		dst++; src++;
+
+		/* segment's RDMA offset */
+		*dst++ = *src++;
+		*dst++ = *src++;
+	}
+
+	return nsegs;
+}
+
+/**
+ * svc_rdma_xdr_encode_write_list - Encode Reply's Write list
+ * @rdma_resp: Reply's transport header
+ * @wr_ch: Write list in Call's transport header
+ * @consumed: total Write chunk bytes consumed in Reply
+ *
+ * The client provided a Write list in the Call message. Fill in
+ * the segments in the first Write chunk in the Reply's transport
+ * header with the number of bytes consumed in each segment.
+ * Remaining chunks are returned unused.
+ *
+ * Assumptions:
+ *  - Server can consume only one Write chunk.
+ */
+void svc_rdma_xdr_encode_write_list(__be32 *rdma_resp, __be32 *wr_ch,
+				    unsigned int consumed)
+{
+	unsigned int nsegs;
+	__be32 *p, *q;
+
+	/* RPC-over-RDMA V1 replies never have a Read list. */
+	p = rdma_resp + rpcrdma_fixed_maxsz + 1;
+
+	q = wr_ch;
+	while (*q != xdr_zero) {
+		nsegs = xdr_encode_write_chunk(p, q, consumed);
+		q += 2 + nsegs * rpcrdma_segment_maxsz;
+		p += 2 + nsegs * rpcrdma_segment_maxsz;
+		consumed = 0;
+	}
+
+	/* Terminate Write list */
+	*p++ = xdr_zero;
+
+	/* Reply chunk discriminator; may be replaced later */
+	*p = xdr_zero;
+}
+
+/**
+ * svc_rdma_xdr_encode_reply_chunk - Encode Reply's Reply chunk
+ * @rdma_resp: Reply's transport header
+ * @rp_ch: Reply chunk in Call's transport header
+ * @consumed: total Reply chunk bytes consumed in Reply
+ *
+ * The client provided a Reply chunk in the Call message. Fill in
+ * the segments in the Reply chunk in the Reply message with the
+ * number of bytes consumed in each segment.
+ *
+ * Assumptions:
+ * - Reply chunk is smaller than or equal in size to Reply
+ */
+void svc_rdma_xdr_encode_reply_chunk(__be32 *rdma_resp, __be32 *rp_ch,
+				     unsigned int consumed)
+{
+	__be32 *p;
+
+	/* Find the Reply chunk in the Reply's xprt header.
+	 * RPC-over-RDMA V1 replies never have a Read list.
+	 */
+	p = rdma_resp + rpcrdma_fixed_maxsz + 1;
+
+	/* Skip past Write list */
+	while (*p++ != xdr_zero)
+		p += 1 + be32_to_cpup(p) * rpcrdma_segment_maxsz;
+
+	xdr_encode_write_chunk(p, rp_ch, consumed);
+}
diff --git a/net/sunrpc/xprtrdma/svc_rdma_rw.c b/net/sunrpc/xprtrdma/svc_rdma_rw.c
new file mode 100644
index 0000000..1e76227
--- /dev/null
+++ b/net/sunrpc/xprtrdma/svc_rdma_rw.c
@@ -0,0 +1,785 @@
+/*
+ * Copyright (c) 2016 Oracle.  All rights reserved.
+ *
+ * Use the core R/W API to move RPC-over-RDMA Read and Write chunks.
+ */
+
+#include <linux/sunrpc/rpc_rdma.h>
+#include <linux/sunrpc/svc_rdma.h>
+#include <linux/sunrpc/debug.h>
+
+#include <rdma/rw.h>
+
+#define RPCDBG_FACILITY	RPCDBG_SVCXPRT
+
+/* Each R/W context contains state for one chain of RDMA Read or
+ * Write Work Requests (one RDMA segment to be read from or written
+ * back to the client).
+ *
+ * Each WR chain handles a single contiguous server-side buffer,
+ * because some registration modes (eg. FRWR) do not support a
+ * discontiguous scatterlist.
+ *
+ * Each WR chain handles only one R_key. Each RPC-over-RDMA segment
+ * from a client may contain a unique R_key, so each WR chain moves
+ * one segment (or less) at a time.
+ *
+ * The scatterlist makes this data structure just over 8KB in size
+ * on 4KB-page platforms. As the size of this structure increases
+ * past one page, it becomes more likely that allocating one of these
+ * will fail. Therefore, these contexts are created on demand, but
+ * cached and reused until the controlling svcxprt_rdma is destroyed.
+ */
+struct svc_rdma_rw_ctxt {
+	struct list_head	rw_list;
+	struct ib_cqe		rw_cqe;
+	struct svcxprt_rdma	*rw_rdma;
+	int			rw_nents;
+	int			rw_wrcount;
+	enum dma_data_direction	rw_dir;
+	struct svc_rdma_op_ctxt	*rw_readctxt;
+	struct rdma_rw_ctx	rw_ctx;
+	struct scatterlist	rw_sg[RPCSVC_MAXPAGES];
+};
+
+static struct svc_rdma_rw_ctxt *
+svc_rdma_get_rw_ctxt(struct svcxprt_rdma *rdma)
+{
+	struct svc_rdma_rw_ctxt *ctxt;
+
+	svc_xprt_get(&rdma->sc_xprt);
+
+	spin_lock(&rdma->sc_rw_ctxt_lock);
+	if (list_empty(&rdma->sc_rw_ctxts))
+		goto out_empty;
+
+	ctxt = list_first_entry(&rdma->sc_rw_ctxts,
+				struct svc_rdma_rw_ctxt, rw_list);
+	list_del_init(&ctxt->rw_list);
+	spin_unlock(&rdma->sc_rw_ctxt_lock);
+
+out:
+	ctxt->rw_dir = DMA_NONE;
+	return ctxt;
+
+out_empty:
+	spin_unlock(&rdma->sc_rw_ctxt_lock);
+
+	ctxt = kmalloc(sizeof(*ctxt), GFP_KERNEL);
+	if (!ctxt) {
+		svc_xprt_put(&rdma->sc_xprt);
+		return NULL;
+	}
+
+	ctxt->rw_rdma = rdma;
+	INIT_LIST_HEAD(&ctxt->rw_list);
+	sg_init_table(ctxt->rw_sg, ARRAY_SIZE(ctxt->rw_sg));
+	goto out;
+}
+
+static void svc_rdma_put_rw_ctxt(struct svc_rdma_rw_ctxt *ctxt)
+{
+	struct svcxprt_rdma *rdma = ctxt->rw_rdma;
+
+	if (ctxt->rw_dir != DMA_NONE)
+		rdma_rw_ctx_destroy(&ctxt->rw_ctx, rdma->sc_qp,
+				    rdma->sc_port_num,
+				    ctxt->rw_sg, ctxt->rw_nents,
+				    ctxt->rw_dir);
+
+	spin_lock(&rdma->sc_rw_ctxt_lock);
+	list_add(&ctxt->rw_list, &rdma->sc_rw_ctxts);
+	spin_unlock(&rdma->sc_rw_ctxt_lock);
+
+	svc_xprt_put(&rdma->sc_xprt);
+}
+
+/**
+ * svc_rdma_destroy_rw_ctxts - Free write contexts
+ * @rdma: xprt about to be destroyed
+ *
+ */
+void svc_rdma_destroy_rw_ctxts(struct svcxprt_rdma *rdma)
+{
+	struct svc_rdma_rw_ctxt *ctxt;
+
+	while (!list_empty(&rdma->sc_rw_ctxts)) {
+		ctxt = list_first_entry(&rdma->sc_rw_ctxts,
+					struct svc_rdma_rw_ctxt, rw_list);
+		list_del(&ctxt->rw_list);
+		kfree(ctxt);
+	}
+}
+
+/**
+ * svc_rdma_wc_write_ctx - Handle completion of an RDMA Write ctx
+ * @cq: controlling Completion Queue
+ * @wc: Work Completion
+ *
+ * Invoked in soft IRQ context.
+ *
+ * Assumptions:
+ * - Write completion is not responsible for freeing pages under
+ *   I/O.
+ */
+static void svc_rdma_wc_write_ctx(struct ib_cq *cq, struct ib_wc *wc)
+{
+	struct ib_cqe *cqe = wc->wr_cqe;
+	struct svc_rdma_rw_ctxt *ctxt =
+			container_of(cqe, struct svc_rdma_rw_ctxt, rw_cqe);
+	struct svcxprt_rdma *rdma = ctxt->rw_rdma;
+
+	atomic_add(ctxt->rw_wrcount, &rdma->sc_sq_avail);
+	wake_up(&rdma->sc_send_wait);
+
+	if (wc->status != IB_WC_SUCCESS)
+		goto flush;
+
+out:
+	svc_rdma_put_rw_ctxt(ctxt);
+	return;
+
+flush:
+	set_bit(XPT_CLOSE, &rdma->sc_xprt.xpt_flags);
+	if (wc->status != IB_WC_WR_FLUSH_ERR)
+		pr_err("svcrdma: write ctx: %s (%u/0x%x)\n",
+		       ib_wc_status_msg(wc->status),
+		       wc->status, wc->vendor_err);
+	goto out;
+}
+
+/**
+ * svc_rdma_wc_read_ctx - Handle completion of an RDMA Read ctx
+ * @cq: controlling Completion Queue
+ * @wc: Work Completion
+ *
+ * Invoked in soft IRQ context.
+ */
+static void svc_rdma_wc_read_ctx(struct ib_cq *cq, struct ib_wc *wc)
+{
+	struct ib_cqe *cqe = wc->wr_cqe;
+	struct svc_rdma_rw_ctxt *ctxt =
+			container_of(cqe, struct svc_rdma_rw_ctxt, rw_cqe);
+	struct svcxprt_rdma *rdma = ctxt->rw_rdma;
+	struct svc_rdma_op_ctxt *head;
+
+	atomic_add(ctxt->rw_wrcount, &rdma->sc_sq_avail);
+	wake_up(&rdma->sc_send_wait);
+
+	if (wc->status != IB_WC_SUCCESS)
+		goto flush;
+
+	head = ctxt->rw_readctxt;
+	if (!head)
+		goto out;
+
+	spin_lock(&rdma->sc_rq_dto_lock);
+	list_add_tail(&head->list, &rdma->sc_read_complete_q);
+	spin_unlock(&rdma->sc_rq_dto_lock);
+	set_bit(XPT_DATA, &rdma->sc_xprt.xpt_flags);
+	svc_xprt_enqueue(&rdma->sc_xprt);
+
+out:
+	svc_rdma_put_rw_ctxt(ctxt);
+	return;
+
+flush:
+	set_bit(XPT_CLOSE, &rdma->sc_xprt.xpt_flags);
+	if (wc->status != IB_WC_WR_FLUSH_ERR)
+		pr_err("svcrdma: read ctx: %s (%u/0x%x)\n",
+		       ib_wc_status_msg(wc->status),
+		       wc->status, wc->vendor_err);
+	goto out;
+}
+
+/* This function sleeps when the transport's Send Queue is congested.
+ *
+ * Assumptions:
+ * - If ib_post_send() succeeds, only one completion is expected,
+ *   even if one or more WRs are flushed. This is true when posting
+ *   an rdma_rw_ctx or when posting a single signaled WR.
+ */
+static int svc_rdma_post_send(struct svcxprt_rdma *rdma,
+			      struct ib_send_wr *first_wr,
+			      int num_wrs)
+{
+	struct svc_xprt *xprt = &rdma->sc_xprt;
+	struct ib_send_wr *bad_wr;
+	int ret;
+
+	do {
+		if ((atomic_sub_return(num_wrs, &rdma->sc_sq_avail) > 0)) {
+			ret = ib_post_send(rdma->sc_qp, first_wr, &bad_wr);
+			if (ret)
+				break;
+			return 0;
+		}
+
+		atomic_inc(&rdma_stat_sq_starve);
+		atomic_add(num_wrs, &rdma->sc_sq_avail);
+		wait_event(rdma->sc_send_wait,
+			   atomic_read(&rdma->sc_sq_avail) > num_wrs);
+	} while (1);
+
+	pr_err("svcrdma: post_send rc=%d; SQ avail=%d/%u\n",
+	       ret, atomic_read(&rdma->sc_sq_avail), rdma->sc_sq_depth);
+	set_bit(XPT_CLOSE, &xprt->xpt_flags);
+
+	/* If even one was posted, there will be a completion. */
+	if (bad_wr != first_wr)
+		return 0;
+
+	atomic_add(num_wrs, &rdma->sc_sq_avail);
+	wake_up(&rdma->sc_send_wait);
+	return -ENOTCONN;
+}
+
+static int svc_rdma_send_write_ctx(struct svcxprt_rdma *rdma,
+				   struct svc_rdma_rw_ctxt *ctxt,
+				   u64 offset, u32 rkey)
+{
+	struct ib_send_wr *first_wr;
+	int ret;
+
+	ret = rdma_rw_ctx_init(&ctxt->rw_ctx,
+			       rdma->sc_qp, rdma->sc_port_num,
+			       ctxt->rw_sg, ctxt->rw_nents,
+			       0, offset, rkey, DMA_TO_DEVICE);
+	if (ret < 0)
+		goto out_init;
+
+	ctxt->rw_wrcount = ret;
+	ctxt->rw_dir = DMA_TO_DEVICE;
+	ctxt->rw_cqe.done = svc_rdma_wc_write_ctx;
+	first_wr = rdma_rw_ctx_wrs(&ctxt->rw_ctx,
+				   rdma->sc_qp, rdma->sc_port_num,
+				   &ctxt->rw_cqe, NULL);
+	atomic_add(ret, &rdma_stat_write);
+	return svc_rdma_post_send(rdma, first_wr, ret);
+
+out_init:
+	pr_err("svcrdma: rdma_rw_ctx_init failed: %d\n", ret);
+	return -EIO;
+}
+
+/* Common information for sending a Write chunk.
+ *  - Tracks progress of writing one chunk
+ *  - Stores arguments for the SGL constructor function
+ */
+struct svc_rdma_write_info {
+	struct svcxprt_rdma	*wi_rdma;
+
+	/* write state of this chunk */
+	unsigned int		wi_bytes_consumed;
+	unsigned int		wi_seg_off;
+	unsigned int		wi_seg_no;
+	unsigned int		wi_nsegs;
+	__be32			*wi_segs;
+
+	/* SGL constructor arguments */
+	struct xdr_buf		*wi_xdr;
+	unsigned char		*wi_base;
+	unsigned int		wi_next_off;
+};
+
+static void svc_rdma_init_write_info(struct svcxprt_rdma *rdma, __be32 *chunk,
+				     struct svc_rdma_write_info *info)
+{
+	info->wi_rdma = rdma;
+	info->wi_bytes_consumed = 0;
+	info->wi_seg_off = 0;
+	info->wi_seg_no = 0;
+	info->wi_nsegs = be32_to_cpup(chunk + 1);
+	info->wi_segs = chunk + 2;
+}
+
+/* Build and DMA-map an SGL that covers one kvec in an xdr_buf
+ */
+static void svc_rdma_vec_to_sg(struct svc_rdma_write_info *info,
+			       unsigned int len,
+			       struct svc_rdma_rw_ctxt *ctxt)
+{
+	sg_set_buf(&ctxt->rw_sg[0], info->wi_base, len);
+	info->wi_base += len;
+
+	ctxt->rw_nents = 1;
+}
+
+/* Build and DMA-map an SGL that covers the pagelist of an xdr_buf
+ */
+static void svc_rdma_pagelist_to_sg(struct svc_rdma_write_info *info,
+				    unsigned int remaining,
+				    struct svc_rdma_rw_ctxt *ctxt)
+{
+	unsigned int sge_no, sge_bytes, page_off, page_no;
+	struct scatterlist *sg = ctxt->rw_sg;
+	struct xdr_buf *xdr = info->wi_xdr;
+
+	page_no = (info->wi_next_off + xdr->page_base) >> PAGE_SHIFT;
+	page_off = (info->wi_next_off + xdr->page_base) & ~PAGE_MASK;
+	info->wi_next_off += remaining;
+
+	sge_no = 0;
+	do {
+		sge_bytes = min_t(unsigned int, remaining,
+				  PAGE_SIZE - page_off);
+
+		sg_set_page(&sg[sge_no++], xdr->pages[page_no],
+			    sge_bytes, page_off);
+
+		remaining -= sge_bytes;
+		page_no++;
+		page_off = 0;
+	} while (remaining);
+
+	ctxt->rw_nents = sge_no;
+}
+
+/* Post Write WRs to send a portion of an xdr_buf containing
+ * an RPC Reply.
+ */
+static int
+svc_rdma_send_writes(struct svc_rdma_write_info *info,
+		     void (*constructor)(struct svc_rdma_write_info *info,
+					 unsigned int len,
+					 struct svc_rdma_rw_ctxt *ctxt),
+		     unsigned int total)
+{
+	struct svcxprt_rdma *rdma = info->wi_rdma;
+	unsigned int remaining, seg_no, seg_off;
+	struct svc_rdma_rw_ctxt *ctxt;
+	__be32 *seg;
+	int ret;
+
+	if (total == 0)
+		return 0;
+
+	remaining = total;
+	seg_no = info->wi_seg_no;
+	seg_off = info->wi_seg_off;
+	seg = info->wi_segs + seg_no * rpcrdma_segment_maxsz;
+	do {
+		unsigned int write_len;
+		u32 rs_length, rs_handle;
+		u64 rs_offset;
+
+		if (seg_no >= info->wi_nsegs)
+			goto out_overflow;
+
+		ctxt = svc_rdma_get_rw_ctxt(rdma);
+		if (!ctxt)
+			goto out_noctx;
+
+		rs_handle = be32_to_cpu(*seg++);
+		rs_length = be32_to_cpu(*seg++);
+		seg = xdr_decode_hyper(seg, &rs_offset);
+
+		write_len = min(remaining, rs_length - seg_off);
+		constructor(info, write_len, ctxt);
+		ret = svc_rdma_send_write_ctx(rdma, ctxt, rs_offset + seg_off,
+					      rs_handle);
+		if (ret < 0)
+			goto out_senderr;
+
+		if (write_len == rs_length - seg_off) {
+			seg_no++;
+			seg_off = 0;
+		} else {
+			seg_off += write_len;
+		}
+		remaining -= write_len;
+	} while (remaining);
+
+	info->wi_bytes_consumed += total;
+	info->wi_seg_no = seg_no;
+	info->wi_seg_off = seg_off;
+	return 0;
+
+out_overflow:
+	dprintk("svcrdma: inadequate space in Write chunk (%u)\n",
+		info->wi_nsegs);
+	return -E2BIG;
+
+out_noctx:
+	dprintk("svcrdma: no R/W ctxs available\n");
+	return -ENOMEM;
+
+out_senderr:
+	svc_rdma_put_rw_ctxt(ctxt);
+	pr_err("svcrdma: failed to write pagelist (%d)\n", ret);
+	return ret;
+}
+
+/* Send one of an xdr_buf's kvecs by itself. To send a Reply
+ * chunk, the whole RPC Reply is written back to the client.
+ * This function writes either the head or tail of the xdr_buf
+ * containing the Reply.
+ */
+static int svc_rdma_send_xdr_kvec(struct svc_rdma_write_info *info,
+				  struct kvec *vec)
+{
+	info->wi_base = vec->iov_base;
+
+	return svc_rdma_send_writes(info, svc_rdma_vec_to_sg,
+				    vec->iov_len);
+}
+
+/* Send an xdr_buf's page list by itself. A Write chunk is
+ * just the page list. a Reply chunk is the head, page list,
+ * and tail. This function is shared between the two types
+ * of chunk.
+ */
+static int svc_rdma_send_xdr_pagelist(struct svc_rdma_write_info *info,
+				      struct xdr_buf *xdr)
+{
+	info->wi_xdr = xdr;
+	info->wi_next_off = 0;
+
+	return svc_rdma_send_writes(info, svc_rdma_pagelist_to_sg,
+				    xdr->page_len);
+}
+
+/**
+ * svc_rdma_send_write_list - Write all chunks in the Write list
+ * @rdma: controlling RDMA transport
+ * @wr_ch: Write list provided by client
+ * @rdma_resp: buffer containing transport header under construction
+ * @xdr: xdr_buf carrying an RPC Reply
+ *
+ * Returns:
+ *	%0 if all needed RDMA Writes were posted successfully,
+ *	%-E2BIG if the payload was larger than the Write chunk,
+ *	%-ENOMEM if rdma_rw context pool was exhausted,
+ *	%-ENOTCONN if posting failed (connection is lost),
+ *	%-EIO if rdma_rw initialization failed (DMA mapping, etc).
+ *
+ * Assumptions:
+ *  - Only one Write chunk, and it's the xdr_buf's entire pagelist
+ */
+int svc_rdma_send_write_list(struct svcxprt_rdma *rdma,
+			     __be32 *wr_ch, __be32 *rdma_resp,
+			     struct xdr_buf *xdr)
+{
+	struct svc_rdma_write_info info;
+	int ret;
+
+	svc_rdma_init_write_info(rdma, wr_ch, &info);
+
+	ret = svc_rdma_send_xdr_pagelist(&info, xdr);
+
+	svc_rdma_xdr_encode_write_list(rdma_resp, wr_ch,
+				       info.wi_bytes_consumed);
+	return ret;
+}
+
+/**
+ * svc_rdma_send_reply_chunk - Write all segments in the Reply chunk
+ * @rdma: controlling RDMA transport
+ * @wr_lst: Write list provided by client
+ * @rp_ch: Reply chunk provided by client
+ * @rdma_resp: buffer containing transport header for Reply
+ * @xdr: xdr_buf carrying an RPC Reply
+ *
+ * Returns:
+ *	%0 if all needed RDMA Writes were posted successfully,
+ *	%-E2BIG if the payload was larger than the Reply chunk,
+ *	%-ENOMEM if rdma_rw context pool was exhausted,
+ *	%-ENOTCONN if posting failed (connection is lost),
+ *	%-EIO if rdma_rw initialization failed (DMA mapping, etc).
+ *
+ * Assumptions:
+ *  - The Reply chunk always carries the whole xdr_buf
+ */
+int svc_rdma_send_reply_chunk(struct svcxprt_rdma *rdma, __be32 *wr_lst,
+			      __be32 *rp_ch, __be32 *rdma_resp,
+			      struct xdr_buf *xdr)
+{
+	struct svc_rdma_write_info info;
+	int ret;
+
+	svc_rdma_init_write_info(rdma, rp_ch, &info);
+
+	ret = svc_rdma_send_xdr_kvec(&info, &xdr->head[0]);
+	if (ret < 0)
+		goto out;
+
+	/* When Write list entries are present, server has already
+	 * transmitted the pagelist payload via a Write chunk. Thus
+	 * we can skip the pagelist here.
+	 */
+	if (!wr_lst) {
+		ret = svc_rdma_send_xdr_pagelist(&info, xdr);
+		if (ret < 0)
+			goto out;
+	}
+
+	ret = svc_rdma_send_xdr_kvec(&info, &xdr->tail[0]);
+
+out:
+	svc_rdma_xdr_encode_reply_chunk(rdma_resp, rp_ch,
+					info.wi_bytes_consumed);
+	return ret;
+}
+
+/* Pull one Read chunk (segment) from the client.
+ *
+ * Returns zero if one or more RDMA Reads have been posted.  Otherwise,
+ * returns a negative errno if there is a Read list present but RDMA
+ * Reads could not be posted.
+ *
+ * For incoming Reads, @rqstp provides a page list containing sink pages.
+ * As pages are prepared for I/O, they are transferred to @head. After
+ * all Reads in the list have completed, svc_rdma_recvfrom builds an
+ * xdr_buf from the page list in @head.
+ *
+ * On entry, *page_no and *page_offset point into the rqstp's page list.
+ * On return, *page_no and *page_offset are updated to point to the next
+ * position in the page list.
+ */
+static int svc_rdma_recv_read_segment(struct svcxprt_rdma *rdma,
+				      struct svc_rdma_op_ctxt *head,
+				      struct svc_rqst *rqstp,
+				      unsigned int *page_no,
+				      unsigned int *page_offset,
+				      u32 rkey, u32 len, u64 offset,
+				      bool last)
+{
+	struct svc_rdma_rw_ctxt *ctxt;
+	struct ib_send_wr *first_wr;
+	unsigned int pg_no, seg_no;
+	u32 pg_off;
+	int ret;
+
+	dprintk("svcrdma: reading segment %u@0x%016llx:0x%08x\n",
+		len, offset, rkey);
+
+	ctxt = svc_rdma_get_rw_ctxt(rdma);
+	if (!ctxt)
+		return -ENOMEM;
+
+	pg_off = *page_offset;
+	pg_no = *page_no;
+	ctxt->rw_nents = PAGE_ALIGN(*page_offset + len) >> PAGE_SHIFT;
+	for (seg_no = 0; seg_no < ctxt->rw_nents; seg_no++) {
+		unsigned int seg_len = min_t(unsigned int, len,
+					     PAGE_SIZE - pg_off);
+
+		head->arg.pages[pg_no] = rqstp->rq_arg.pages[pg_no];
+		head->arg.page_len += seg_len;
+		head->arg.len += seg_len;
+		if (!pg_off)
+			head->count++;
+
+		sg_set_page(&ctxt->rw_sg[seg_no], rqstp->rq_arg.pages[pg_no],
+			    seg_len, pg_off);
+
+		rqstp->rq_respages = &rqstp->rq_arg.pages[pg_no + 1];
+		rqstp->rq_next_page = rqstp->rq_respages + 1;
+
+		pg_off += seg_len;
+		if (pg_off == PAGE_SIZE) {
+			pg_off = 0;
+			pg_no++;
+		}
+		len -= seg_len;
+	}
+
+	ret = rdma_rw_ctx_init(&ctxt->rw_ctx,
+			       rdma->sc_qp, rdma->sc_port_num,
+			       ctxt->rw_sg, ctxt->rw_nents,
+			       0, offset, rkey, DMA_FROM_DEVICE);
+	if (ret < 0)
+		goto out_init;
+
+	ctxt->rw_wrcount = ret;
+	ctxt->rw_dir = DMA_FROM_DEVICE;
+	ctxt->rw_cqe.done = svc_rdma_wc_read_ctx;
+	ctxt->rw_readctxt = last ? head : NULL;
+	first_wr = rdma_rw_ctx_wrs(&ctxt->rw_ctx,
+				   rdma->sc_qp, rdma->sc_port_num,
+				   &ctxt->rw_cqe, NULL);
+	atomic_add(ret, &rdma_stat_read);
+	ret = svc_rdma_post_send(rdma, first_wr, ret);
+	if (ret)
+		goto out_send;
+
+	*page_no = pg_no;
+	*page_offset = pg_off;
+	return 0;
+
+out_init:
+	pr_err("svcrdma: rdma_rw_ctx_init failed: %d\n", ret);
+	ret = -EIO;
+out_send:
+	svc_rdma_put_rw_ctxt(ctxt);
+	return ret;
+}
+
+/* If there was additional inline content, append it to the end of arg.pages.
+ * Tail copy has to be done after the reader function has determined how many
+ * pages were consumed for RDMA Read.
+ */
+static int svc_rdma_copy_tail(struct svc_rqst *rqstp,
+			      struct svc_rdma_op_ctxt *head, u32 position,
+			      unsigned int page_offset, unsigned int page_no)
+{
+	char *srcp, *destp;
+	u32 byte_count;
+
+	srcp = head->arg.head[0].iov_base + position;
+	byte_count = head->arg.head[0].iov_len - position;
+	if (byte_count > PAGE_SIZE) {
+		dprintk("svcrdma: large tail unsupported\n");
+		return 0;
+	}
+
+	/* Fit as much of the tail on the current page as possible */
+	if (page_offset != PAGE_SIZE) {
+		destp = page_address(rqstp->rq_arg.pages[page_no]);
+		destp += page_offset;
+		while (byte_count--) {
+			*destp++ = *srcp++;
+			page_offset++;
+			if (page_offset == PAGE_SIZE && byte_count)
+				goto more;
+		}
+		goto done;
+	}
+
+more:
+	/* Fit the rest on the next page */
+	page_no++;
+	destp = page_address(rqstp->rq_arg.pages[page_no]);
+	while (byte_count--)
+		*destp++ = *srcp++;
+
+	rqstp->rq_respages = &rqstp->rq_arg.pages[page_no + 1];
+	rqstp->rq_next_page = rqstp->rq_respages + 1;
+
+done:
+	byte_count = head->arg.head[0].iov_len - position;
+	head->arg.page_len += byte_count;
+	head->arg.len += byte_count;
+	head->arg.buflen += byte_count;
+	return 1;
+}
+
+static unsigned int svc_rdma_read_chunk_count(__be32 *p)
+{
+	unsigned int nsegs;
+
+	for (nsegs = 0; *p != xdr_zero; p += rpcrdma_readchunk_maxsz)
+		nsegs++;
+	return nsegs;
+}
+
+/**
+ * svc_rdma_recv_read_list - Pull read chunks from the client
+ * @rdma: controlling RDMA transport
+ * @ch: pointer to Read list in the incoming transport header
+ * @head: pages under I/O collect here
+ * @rqstp: set of pages to use as Read sink buffers
+ *
+ * Returns:
+ *	%0 if there is no Read list,
+ *	%1 if all needed RDMA Reads were posted successfully,
+ *	%-EINVAL if the Read chunk data is too large,
+ *	%-ENOMEM if rdma_rw context pool was exhausted,
+ *	%-ENOTCONN if posting failed (connection is lost),
+ *	%-EIO if rdma_rw initialization failed (DMA mapping, etc).
+ *
+ * Assumptions:
+ * - Clients can send multiple Read chunks in a Read list, but
+ *   the chunks must all have the same value in their Position
+ *   field.
+ */
+int svc_rdma_recv_read_list(struct svcxprt_rdma *rdma, __be32 *ch,
+			    struct svc_rdma_op_ctxt *head,
+			    struct svc_rqst *rqstp)
+{
+	unsigned int page_no, page_offset;
+	u32 position;
+	__be32 *p;
+	bool last;
+	int ret;
+
+	p = ch;
+
+	/* Sanity check */
+	if (svc_rdma_read_chunk_count(p++) > RPCSVC_MAXPAGES)
+		return -EINVAL;
+
+	/* "head" keeps all the pages that comprise the request.
+	 */
+	head->arg.head[0] = rqstp->rq_arg.head[0];
+	head->arg.tail[0] = rqstp->rq_arg.tail[0];
+	head->hdr_count = head->count;
+	head->arg.page_base = 0;
+	head->arg.page_len = 0;
+	head->arg.len = rqstp->rq_arg.len;
+	head->arg.buflen = rqstp->rq_arg.buflen;
+
+	/* RDMA_NOMSG: RDMA Read data should land just after Receive data.
+	 */
+	position = be32_to_cpu(*p++);
+	if (position == 0) {
+		head->arg.pages = &head->pages[0];
+		page_offset = head->byte_len;
+	} else {
+		head->arg.pages = &head->pages[head->count];
+		page_offset = 0;
+	}
+
+	/* This server implementation supports only one Read chunk (of one
+	 * or more segments) per message. The list walk is terminated once
+	 * "position" changes.
+	 */
+	page_no = 0;
+	last = false;
+	while (!last) {
+		u32 rs_handle, rs_length;
+		u64 rs_offset;
+
+		rs_handle = be32_to_cpu(*p++),
+		rs_length = be32_to_cpu(*p++);
+		p = xdr_decode_hyper(p, &rs_offset);
+
+		/* Examine next read segment */
+		if (*p == xdr_zero ||
+		    ((*p != xdr_zero) && (be32_to_cpu(*(p + 1)) != position)))
+			last = true;
+
+		ret = svc_rdma_recv_read_segment(rdma, head, rqstp,
+						 &page_no, &page_offset,
+						 rs_handle, rs_length,
+						 rs_offset, last);
+		if (ret < 0)
+			goto out;
+
+		p += 2;
+	}
+
+	/* Read list may need XDR round-up (see RFC 5666, s. 3.7) */
+	if (page_offset & 3) {
+		u32 pad = 4 - (page_offset & 3);
+
+		head->arg.tail[0].iov_len += pad;
+		head->arg.len += pad;
+		head->arg.buflen += pad;
+		page_offset += pad;
+	}
+
+	ret = 1;
+	if (position && position < head->arg.head[0].iov_len)
+		ret = svc_rdma_copy_tail(rqstp, head, position,
+					 page_offset, page_no);
+	head->arg.head[0].iov_len = position;
+	head->position = position;
+
+ out:
+	/* Detach arg pages. svc_recv will replenish them */
+	for (page_no = 0;
+	     &rqstp->rq_pages[page_no] < rqstp->rq_respages; page_no++)
+		rqstp->rq_pages[page_no] = NULL;
+	return ret;
+}
diff --git a/net/sunrpc/xprtrdma/svc_rdma_sendto.c b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
index b4028bc3..e4b8800 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_sendto.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
@@ -406,7 +406,7 @@ static int send_write_chunks(struct svcxprt_rdma *xprt,
 		}
 	}
 	/* Update the req with the number of chunks actually used */
-	svc_rdma_xdr_encode_write_list(rdma_resp, chunk_no);
+	svc_rdma_old_encode_write_list(rdma_resp, chunk_no);
 
 	return rqstp->rq_res.page_len;
 
diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c
index b84cd53..90fabad 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
@@ -560,6 +560,7 @@ static struct svcxprt_rdma *rdma_create_xprt(struct svc_serv *serv,
 	INIT_LIST_HEAD(&cma_xprt->sc_read_complete_q);
 	INIT_LIST_HEAD(&cma_xprt->sc_frmr_q);
 	INIT_LIST_HEAD(&cma_xprt->sc_ctxts);
+	INIT_LIST_HEAD(&cma_xprt->sc_rw_ctxts);
 	INIT_LIST_HEAD(&cma_xprt->sc_maps);
 	init_waitqueue_head(&cma_xprt->sc_send_wait);
 
@@ -567,6 +568,7 @@ static struct svcxprt_rdma *rdma_create_xprt(struct svc_serv *serv,
 	spin_lock_init(&cma_xprt->sc_rq_dto_lock);
 	spin_lock_init(&cma_xprt->sc_frmr_q_lock);
 	spin_lock_init(&cma_xprt->sc_ctxt_lock);
+	spin_lock_init(&cma_xprt->sc_rw_ctxt_lock);
 	spin_lock_init(&cma_xprt->sc_map_lock);
 
 	/*
@@ -998,6 +1000,7 @@ static struct svc_xprt *svc_rdma_accept(struct svc_xprt *xprt)
 		newxprt, newxprt->sc_cm_id);
 
 	dev = newxprt->sc_cm_id->device;
+	newxprt->sc_port_num = newxprt->sc_cm_id->port_num;
 
 	/* Qualify the transport resource defaults with the
 	 * capabilities of this particular device */
@@ -1247,6 +1250,7 @@ static void __svc_rdma_free(struct work_struct *work)
 	}
 
 	rdma_dealloc_frmr_q(rdma);
+	svc_rdma_destroy_rw_ctxts(rdma);
 	svc_rdma_destroy_ctxts(rdma);
 	svc_rdma_destroy_maps(rdma);
 

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v1 05/14] svcrdma: Introduce local rdma_rw API helpers
@ 2017-03-16 15:53     ` Chuck Lever
  0 siblings, 0 replies; 70+ messages in thread
From: Chuck Lever @ 2017-03-16 15:53 UTC (permalink / raw)
  To: linux-rdma, linux-nfs

The plan is to replace the local bespoke code that constructs and
posts RDMA Read and Write Work Requests with calls to the rdma_rw
API. This shares code with other RDMA-enabled ULPs that manages the
gory details of buffer registration and posting Work Requests.

Some design notes:

 o svc_xprt reference counting is modified, since one rdma_rw_ctx
   generates one completion, no matter how many Write WRs are
   posted. To accommodate the new reference counting scheme, a new
   version of svc_rdma_send() is introduced.

 o The structure of RPC-over-RDMA transport headers is flexible,
   allowing multiple segments per Reply with arbitrary alignment.
   Thus I did not take the further step of chaining Write WRs with
   the Send WR containing the RPC Reply message. The Write and Send
   WRs continue to be built by separate pieces of code.

 o The current code builds the transport header as it is construct-
   ing Write WRs. I've replaced that with marshaling of transport
   header data items in a separate step. This is because the exact
   structure of client-provided segments may not align with the
   components of the server's reply xdr_buf, or the pages in the
   page list. Thus parts of each client-provided segment may be
   written at different points in the send path.

 o Since the Write list and Reply chunk marshaling code is being
   replaced, I took the opportunity to replace some of the C
   structure-based XDR encoding code with more portable code that
   instead uses pointer arithmetic.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 include/linux/sunrpc/svc_rdma.h          |   22 +
 net/sunrpc/xprtrdma/Makefile             |    2 
 net/sunrpc/xprtrdma/svc_rdma_marshal.c   |  114 ++++
 net/sunrpc/xprtrdma/svc_rdma_rw.c        |  785 ++++++++++++++++++++++++++++++
 net/sunrpc/xprtrdma/svc_rdma_sendto.c    |    2 
 net/sunrpc/xprtrdma/svc_rdma_transport.c |    4 
 6 files changed, 925 insertions(+), 4 deletions(-)
 create mode 100644 net/sunrpc/xprtrdma/svc_rdma_rw.c

diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
index f066349..5fc9f6e 100644
--- a/include/linux/sunrpc/svc_rdma.h
+++ b/include/linux/sunrpc/svc_rdma.h
@@ -145,12 +145,15 @@ struct svcxprt_rdma {
 	u32		     sc_max_requests;	/* Max requests */
 	u32		     sc_max_bc_requests;/* Backward credits */
 	int                  sc_max_req_size;	/* Size of each RQ WR buf */
+	u8		     sc_port_num;
 
 	struct ib_pd         *sc_pd;
 
 	spinlock_t	     sc_ctxt_lock;
 	struct list_head     sc_ctxts;
 	int		     sc_ctxt_used;
+	spinlock_t	     sc_rw_ctxt_lock;
+	struct list_head     sc_rw_ctxts;
 	spinlock_t	     sc_map_lock;
 	struct list_head     sc_maps;
 
@@ -209,10 +212,15 @@ extern int svc_rdma_handle_bc_reply(struct rpc_xprt *xprt,
 extern int svc_rdma_xdr_encode_error(struct svcxprt_rdma *,
 				     struct rpcrdma_msg *,
 				     enum rpcrdma_errcode, __be32 *);
-extern void svc_rdma_xdr_encode_write_list(struct rpcrdma_msg *, int);
+extern void svc_rdma_old_encode_write_list(struct rpcrdma_msg *rmsgp,
+					   int chunks);
 extern void svc_rdma_xdr_encode_reply_array(struct rpcrdma_write_array *, int);
 extern void svc_rdma_xdr_encode_array_chunk(struct rpcrdma_write_array *, int,
 					    __be32, __be64, u32);
+extern void svc_rdma_xdr_encode_write_list(__be32 *rdma_resp, __be32 *wr_ch,
+					   unsigned int consumed);
+extern void svc_rdma_xdr_encode_reply_chunk(__be32 *rdma_resp, __be32 *rp_ch,
+					    unsigned int consumed);
 extern unsigned int svc_rdma_xdr_get_reply_hdr_len(__be32 *rdma_resp);
 
 /* svc_rdma_recvfrom.c */
@@ -224,6 +232,18 @@ extern int rdma_read_chunk_frmr(struct svcxprt_rdma *, struct svc_rqst *,
 				struct svc_rdma_op_ctxt *, int *, u32 *,
 				u32, u32, u64, bool);
 
+/* svc_rdma_rw.c */
+extern void svc_rdma_destroy_rw_ctxts(struct svcxprt_rdma *rdma);
+extern int svc_rdma_recv_read_list(struct svcxprt_rdma *rdma, __be32 *ch,
+				   struct svc_rdma_op_ctxt *head,
+				   struct svc_rqst *rqstp);
+extern int svc_rdma_send_write_list(struct svcxprt_rdma *rdma,
+				    __be32 *wr_ch, __be32 *rdma_resp,
+				    struct xdr_buf *xdr);
+extern int svc_rdma_send_reply_chunk(struct svcxprt_rdma *rdma,
+				     __be32 *wr_lst, __be32 *rp_ch,
+				     __be32 *rdma_resp, struct xdr_buf *xdr);
+
 /* svc_rdma_sendto.c */
 extern int svc_rdma_map_xdr(struct svcxprt_rdma *, struct xdr_buf *,
 			    struct svc_rdma_req_map *, bool);
diff --git a/net/sunrpc/xprtrdma/Makefile b/net/sunrpc/xprtrdma/Makefile
index ef19fa4..c1ae814 100644
--- a/net/sunrpc/xprtrdma/Makefile
+++ b/net/sunrpc/xprtrdma/Makefile
@@ -4,5 +4,5 @@ rpcrdma-y := transport.o rpc_rdma.o verbs.o \
 	fmr_ops.o frwr_ops.o \
 	svc_rdma.o svc_rdma_backchannel.o svc_rdma_transport.o \
 	svc_rdma_marshal.o svc_rdma_sendto.o svc_rdma_recvfrom.o \
-	module.o
+	svc_rdma_rw.o module.o
 rpcrdma-$(CONFIG_SUNRPC_BACKCHANNEL) += backchannel.o
diff --git a/net/sunrpc/xprtrdma/svc_rdma_marshal.c b/net/sunrpc/xprtrdma/svc_rdma_marshal.c
index 1c4aabf..bf3ca7e 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_marshal.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_marshal.c
@@ -217,7 +217,7 @@ unsigned int svc_rdma_xdr_get_reply_hdr_len(__be32 *rdma_resp)
 	return (unsigned long)p - (unsigned long)rdma_resp;
 }
 
-void svc_rdma_xdr_encode_write_list(struct rpcrdma_msg *rmsgp, int chunks)
+void svc_rdma_old_encode_write_list(struct rpcrdma_msg *rmsgp, int chunks)
 {
 	struct rpcrdma_write_array *ary;
 
@@ -255,3 +255,115 @@ void svc_rdma_xdr_encode_array_chunk(struct rpcrdma_write_array *ary,
 	seg->rs_offset = rs_offset;
 	seg->rs_length = cpu_to_be32(write_len);
 }
+
+/* One Write chunk is copied from Call transport header to Reply
+ * transport header. Each segment's length field is updated to
+ * reflect number of bytes consumed in the segment.
+ *
+ * Returns number of segments in this chunk.
+ */
+static unsigned int xdr_encode_write_chunk(__be32 *dst, __be32 *src,
+					   unsigned int remaining)
+{
+	unsigned int i, nsegs;
+	u32 seg_len;
+
+	/* Write list discriminator */
+	*dst++ = *src++;
+
+	/* number of segments in this chunk */
+	nsegs = be32_to_cpup(src);
+	*dst++ = *src++;
+
+	for (i = nsegs; i; i--) {
+		/* segment's RDMA handle */
+		*dst++ = *src++;
+
+		/* bytes returned in this segment */
+		seg_len = be32_to_cpu(*src);
+		if (remaining >= seg_len) {
+			/* entire segment was consumed */
+			*dst = *src;
+			remaining -= seg_len;
+		} else {
+			/* segment only partly filled */
+			*dst = cpu_to_be32(remaining);
+			remaining = 0;
+		}
+		dst++; src++;
+
+		/* segment's RDMA offset */
+		*dst++ = *src++;
+		*dst++ = *src++;
+	}
+
+	return nsegs;
+}
+
+/**
+ * svc_rdma_xdr_encode_write_list - Encode Reply's Write list
+ * @rdma_resp: Reply's transport header
+ * @wr_ch: Write list in Call's transport header
+ * @consumed: total Write chunk bytes consumed in Reply
+ *
+ * The client provided a Write list in the Call message. Fill in
+ * the segments in the first Write chunk in the Reply's transport
+ * header with the number of bytes consumed in each segment.
+ * Remaining chunks are returned unused.
+ *
+ * Assumptions:
+ *  - Server can consume only one Write chunk.
+ */
+void svc_rdma_xdr_encode_write_list(__be32 *rdma_resp, __be32 *wr_ch,
+				    unsigned int consumed)
+{
+	unsigned int nsegs;
+	__be32 *p, *q;
+
+	/* RPC-over-RDMA V1 replies never have a Read list. */
+	p = rdma_resp + rpcrdma_fixed_maxsz + 1;
+
+	q = wr_ch;
+	while (*q != xdr_zero) {
+		nsegs = xdr_encode_write_chunk(p, q, consumed);
+		q += 2 + nsegs * rpcrdma_segment_maxsz;
+		p += 2 + nsegs * rpcrdma_segment_maxsz;
+		consumed = 0;
+	}
+
+	/* Terminate Write list */
+	*p++ = xdr_zero;
+
+	/* Reply chunk discriminator; may be replaced later */
+	*p = xdr_zero;
+}
+
+/**
+ * svc_rdma_xdr_encode_reply_chunk - Encode Reply's Reply chunk
+ * @rdma_resp: Reply's transport header
+ * @rp_ch: Reply chunk in Call's transport header
+ * @consumed: total Reply chunk bytes consumed in Reply
+ *
+ * The client provided a Reply chunk in the Call message. Fill in
+ * the segments in the Reply chunk in the Reply message with the
+ * number of bytes consumed in each segment.
+ *
+ * Assumptions:
+ * - Reply chunk is smaller than or equal in size to Reply
+ */
+void svc_rdma_xdr_encode_reply_chunk(__be32 *rdma_resp, __be32 *rp_ch,
+				     unsigned int consumed)
+{
+	__be32 *p;
+
+	/* Find the Reply chunk in the Reply's xprt header.
+	 * RPC-over-RDMA V1 replies never have a Read list.
+	 */
+	p = rdma_resp + rpcrdma_fixed_maxsz + 1;
+
+	/* Skip past Write list */
+	while (*p++ != xdr_zero)
+		p += 1 + be32_to_cpup(p) * rpcrdma_segment_maxsz;
+
+	xdr_encode_write_chunk(p, rp_ch, consumed);
+}
diff --git a/net/sunrpc/xprtrdma/svc_rdma_rw.c b/net/sunrpc/xprtrdma/svc_rdma_rw.c
new file mode 100644
index 0000000..1e76227
--- /dev/null
+++ b/net/sunrpc/xprtrdma/svc_rdma_rw.c
@@ -0,0 +1,785 @@
+/*
+ * Copyright (c) 2016 Oracle.  All rights reserved.
+ *
+ * Use the core R/W API to move RPC-over-RDMA Read and Write chunks.
+ */
+
+#include <linux/sunrpc/rpc_rdma.h>
+#include <linux/sunrpc/svc_rdma.h>
+#include <linux/sunrpc/debug.h>
+
+#include <rdma/rw.h>
+
+#define RPCDBG_FACILITY	RPCDBG_SVCXPRT
+
+/* Each R/W context contains state for one chain of RDMA Read or
+ * Write Work Requests (one RDMA segment to be read from or written
+ * back to the client).
+ *
+ * Each WR chain handles a single contiguous server-side buffer,
+ * because some registration modes (eg. FRWR) do not support a
+ * discontiguous scatterlist.
+ *
+ * Each WR chain handles only one R_key. Each RPC-over-RDMA segment
+ * from a client may contain a unique R_key, so each WR chain moves
+ * one segment (or less) at a time.
+ *
+ * The scatterlist makes this data structure just over 8KB in size
+ * on 4KB-page platforms. As the size of this structure increases
+ * past one page, it becomes more likely that allocating one of these
+ * will fail. Therefore, these contexts are created on demand, but
+ * cached and reused until the controlling svcxprt_rdma is destroyed.
+ */
+struct svc_rdma_rw_ctxt {
+	struct list_head	rw_list;
+	struct ib_cqe		rw_cqe;
+	struct svcxprt_rdma	*rw_rdma;
+	int			rw_nents;
+	int			rw_wrcount;
+	enum dma_data_direction	rw_dir;
+	struct svc_rdma_op_ctxt	*rw_readctxt;
+	struct rdma_rw_ctx	rw_ctx;
+	struct scatterlist	rw_sg[RPCSVC_MAXPAGES];
+};
+
+static struct svc_rdma_rw_ctxt *
+svc_rdma_get_rw_ctxt(struct svcxprt_rdma *rdma)
+{
+	struct svc_rdma_rw_ctxt *ctxt;
+
+	svc_xprt_get(&rdma->sc_xprt);
+
+	spin_lock(&rdma->sc_rw_ctxt_lock);
+	if (list_empty(&rdma->sc_rw_ctxts))
+		goto out_empty;
+
+	ctxt = list_first_entry(&rdma->sc_rw_ctxts,
+				struct svc_rdma_rw_ctxt, rw_list);
+	list_del_init(&ctxt->rw_list);
+	spin_unlock(&rdma->sc_rw_ctxt_lock);
+
+out:
+	ctxt->rw_dir = DMA_NONE;
+	return ctxt;
+
+out_empty:
+	spin_unlock(&rdma->sc_rw_ctxt_lock);
+
+	ctxt = kmalloc(sizeof(*ctxt), GFP_KERNEL);
+	if (!ctxt) {
+		svc_xprt_put(&rdma->sc_xprt);
+		return NULL;
+	}
+
+	ctxt->rw_rdma = rdma;
+	INIT_LIST_HEAD(&ctxt->rw_list);
+	sg_init_table(ctxt->rw_sg, ARRAY_SIZE(ctxt->rw_sg));
+	goto out;
+}
+
+static void svc_rdma_put_rw_ctxt(struct svc_rdma_rw_ctxt *ctxt)
+{
+	struct svcxprt_rdma *rdma = ctxt->rw_rdma;
+
+	if (ctxt->rw_dir != DMA_NONE)
+		rdma_rw_ctx_destroy(&ctxt->rw_ctx, rdma->sc_qp,
+				    rdma->sc_port_num,
+				    ctxt->rw_sg, ctxt->rw_nents,
+				    ctxt->rw_dir);
+
+	spin_lock(&rdma->sc_rw_ctxt_lock);
+	list_add(&ctxt->rw_list, &rdma->sc_rw_ctxts);
+	spin_unlock(&rdma->sc_rw_ctxt_lock);
+
+	svc_xprt_put(&rdma->sc_xprt);
+}
+
+/**
+ * svc_rdma_destroy_rw_ctxts - Free write contexts
+ * @rdma: xprt about to be destroyed
+ *
+ */
+void svc_rdma_destroy_rw_ctxts(struct svcxprt_rdma *rdma)
+{
+	struct svc_rdma_rw_ctxt *ctxt;
+
+	while (!list_empty(&rdma->sc_rw_ctxts)) {
+		ctxt = list_first_entry(&rdma->sc_rw_ctxts,
+					struct svc_rdma_rw_ctxt, rw_list);
+		list_del(&ctxt->rw_list);
+		kfree(ctxt);
+	}
+}
+
+/**
+ * svc_rdma_wc_write_ctx - Handle completion of an RDMA Write ctx
+ * @cq: controlling Completion Queue
+ * @wc: Work Completion
+ *
+ * Invoked in soft IRQ context.
+ *
+ * Assumptions:
+ * - Write completion is not responsible for freeing pages under
+ *   I/O.
+ */
+static void svc_rdma_wc_write_ctx(struct ib_cq *cq, struct ib_wc *wc)
+{
+	struct ib_cqe *cqe = wc->wr_cqe;
+	struct svc_rdma_rw_ctxt *ctxt =
+			container_of(cqe, struct svc_rdma_rw_ctxt, rw_cqe);
+	struct svcxprt_rdma *rdma = ctxt->rw_rdma;
+
+	atomic_add(ctxt->rw_wrcount, &rdma->sc_sq_avail);
+	wake_up(&rdma->sc_send_wait);
+
+	if (wc->status != IB_WC_SUCCESS)
+		goto flush;
+
+out:
+	svc_rdma_put_rw_ctxt(ctxt);
+	return;
+
+flush:
+	set_bit(XPT_CLOSE, &rdma->sc_xprt.xpt_flags);
+	if (wc->status != IB_WC_WR_FLUSH_ERR)
+		pr_err("svcrdma: write ctx: %s (%u/0x%x)\n",
+		       ib_wc_status_msg(wc->status),
+		       wc->status, wc->vendor_err);
+	goto out;
+}
+
+/**
+ * svc_rdma_wc_read_ctx - Handle completion of an RDMA Read ctx
+ * @cq: controlling Completion Queue
+ * @wc: Work Completion
+ *
+ * Invoked in soft IRQ context.
+ */
+static void svc_rdma_wc_read_ctx(struct ib_cq *cq, struct ib_wc *wc)
+{
+	struct ib_cqe *cqe = wc->wr_cqe;
+	struct svc_rdma_rw_ctxt *ctxt =
+			container_of(cqe, struct svc_rdma_rw_ctxt, rw_cqe);
+	struct svcxprt_rdma *rdma = ctxt->rw_rdma;
+	struct svc_rdma_op_ctxt *head;
+
+	atomic_add(ctxt->rw_wrcount, &rdma->sc_sq_avail);
+	wake_up(&rdma->sc_send_wait);
+
+	if (wc->status != IB_WC_SUCCESS)
+		goto flush;
+
+	head = ctxt->rw_readctxt;
+	if (!head)
+		goto out;
+
+	spin_lock(&rdma->sc_rq_dto_lock);
+	list_add_tail(&head->list, &rdma->sc_read_complete_q);
+	spin_unlock(&rdma->sc_rq_dto_lock);
+	set_bit(XPT_DATA, &rdma->sc_xprt.xpt_flags);
+	svc_xprt_enqueue(&rdma->sc_xprt);
+
+out:
+	svc_rdma_put_rw_ctxt(ctxt);
+	return;
+
+flush:
+	set_bit(XPT_CLOSE, &rdma->sc_xprt.xpt_flags);
+	if (wc->status != IB_WC_WR_FLUSH_ERR)
+		pr_err("svcrdma: read ctx: %s (%u/0x%x)\n",
+		       ib_wc_status_msg(wc->status),
+		       wc->status, wc->vendor_err);
+	goto out;
+}
+
+/* This function sleeps when the transport's Send Queue is congested.
+ *
+ * Assumptions:
+ * - If ib_post_send() succeeds, only one completion is expected,
+ *   even if one or more WRs are flushed. This is true when posting
+ *   an rdma_rw_ctx or when posting a single signaled WR.
+ */
+static int svc_rdma_post_send(struct svcxprt_rdma *rdma,
+			      struct ib_send_wr *first_wr,
+			      int num_wrs)
+{
+	struct svc_xprt *xprt = &rdma->sc_xprt;
+	struct ib_send_wr *bad_wr;
+	int ret;
+
+	do {
+		if ((atomic_sub_return(num_wrs, &rdma->sc_sq_avail) > 0)) {
+			ret = ib_post_send(rdma->sc_qp, first_wr, &bad_wr);
+			if (ret)
+				break;
+			return 0;
+		}
+
+		atomic_inc(&rdma_stat_sq_starve);
+		atomic_add(num_wrs, &rdma->sc_sq_avail);
+		wait_event(rdma->sc_send_wait,
+			   atomic_read(&rdma->sc_sq_avail) > num_wrs);
+	} while (1);
+
+	pr_err("svcrdma: post_send rc=%d; SQ avail=%d/%u\n",
+	       ret, atomic_read(&rdma->sc_sq_avail), rdma->sc_sq_depth);
+	set_bit(XPT_CLOSE, &xprt->xpt_flags);
+
+	/* If even one was posted, there will be a completion. */
+	if (bad_wr != first_wr)
+		return 0;
+
+	atomic_add(num_wrs, &rdma->sc_sq_avail);
+	wake_up(&rdma->sc_send_wait);
+	return -ENOTCONN;
+}
+
+static int svc_rdma_send_write_ctx(struct svcxprt_rdma *rdma,
+				   struct svc_rdma_rw_ctxt *ctxt,
+				   u64 offset, u32 rkey)
+{
+	struct ib_send_wr *first_wr;
+	int ret;
+
+	ret = rdma_rw_ctx_init(&ctxt->rw_ctx,
+			       rdma->sc_qp, rdma->sc_port_num,
+			       ctxt->rw_sg, ctxt->rw_nents,
+			       0, offset, rkey, DMA_TO_DEVICE);
+	if (ret < 0)
+		goto out_init;
+
+	ctxt->rw_wrcount = ret;
+	ctxt->rw_dir = DMA_TO_DEVICE;
+	ctxt->rw_cqe.done = svc_rdma_wc_write_ctx;
+	first_wr = rdma_rw_ctx_wrs(&ctxt->rw_ctx,
+				   rdma->sc_qp, rdma->sc_port_num,
+				   &ctxt->rw_cqe, NULL);
+	atomic_add(ret, &rdma_stat_write);
+	return svc_rdma_post_send(rdma, first_wr, ret);
+
+out_init:
+	pr_err("svcrdma: rdma_rw_ctx_init failed: %d\n", ret);
+	return -EIO;
+}
+
+/* Common information for sending a Write chunk.
+ *  - Tracks progress of writing one chunk
+ *  - Stores arguments for the SGL constructor function
+ */
+struct svc_rdma_write_info {
+	struct svcxprt_rdma	*wi_rdma;
+
+	/* write state of this chunk */
+	unsigned int		wi_bytes_consumed;
+	unsigned int		wi_seg_off;
+	unsigned int		wi_seg_no;
+	unsigned int		wi_nsegs;
+	__be32			*wi_segs;
+
+	/* SGL constructor arguments */
+	struct xdr_buf		*wi_xdr;
+	unsigned char		*wi_base;
+	unsigned int		wi_next_off;
+};
+
+static void svc_rdma_init_write_info(struct svcxprt_rdma *rdma, __be32 *chunk,
+				     struct svc_rdma_write_info *info)
+{
+	info->wi_rdma = rdma;
+	info->wi_bytes_consumed = 0;
+	info->wi_seg_off = 0;
+	info->wi_seg_no = 0;
+	info->wi_nsegs = be32_to_cpup(chunk + 1);
+	info->wi_segs = chunk + 2;
+}
+
+/* Build and DMA-map an SGL that covers one kvec in an xdr_buf
+ */
+static void svc_rdma_vec_to_sg(struct svc_rdma_write_info *info,
+			       unsigned int len,
+			       struct svc_rdma_rw_ctxt *ctxt)
+{
+	sg_set_buf(&ctxt->rw_sg[0], info->wi_base, len);
+	info->wi_base += len;
+
+	ctxt->rw_nents = 1;
+}
+
+/* Build and DMA-map an SGL that covers the pagelist of an xdr_buf
+ */
+static void svc_rdma_pagelist_to_sg(struct svc_rdma_write_info *info,
+				    unsigned int remaining,
+				    struct svc_rdma_rw_ctxt *ctxt)
+{
+	unsigned int sge_no, sge_bytes, page_off, page_no;
+	struct scatterlist *sg = ctxt->rw_sg;
+	struct xdr_buf *xdr = info->wi_xdr;
+
+	page_no = (info->wi_next_off + xdr->page_base) >> PAGE_SHIFT;
+	page_off = (info->wi_next_off + xdr->page_base) & ~PAGE_MASK;
+	info->wi_next_off += remaining;
+
+	sge_no = 0;
+	do {
+		sge_bytes = min_t(unsigned int, remaining,
+				  PAGE_SIZE - page_off);
+
+		sg_set_page(&sg[sge_no++], xdr->pages[page_no],
+			    sge_bytes, page_off);
+
+		remaining -= sge_bytes;
+		page_no++;
+		page_off = 0;
+	} while (remaining);
+
+	ctxt->rw_nents = sge_no;
+}
+
+/* Post Write WRs to send a portion of an xdr_buf containing
+ * an RPC Reply.
+ */
+static int
+svc_rdma_send_writes(struct svc_rdma_write_info *info,
+		     void (*constructor)(struct svc_rdma_write_info *info,
+					 unsigned int len,
+					 struct svc_rdma_rw_ctxt *ctxt),
+		     unsigned int total)
+{
+	struct svcxprt_rdma *rdma = info->wi_rdma;
+	unsigned int remaining, seg_no, seg_off;
+	struct svc_rdma_rw_ctxt *ctxt;
+	__be32 *seg;
+	int ret;
+
+	if (total == 0)
+		return 0;
+
+	remaining = total;
+	seg_no = info->wi_seg_no;
+	seg_off = info->wi_seg_off;
+	seg = info->wi_segs + seg_no * rpcrdma_segment_maxsz;
+	do {
+		unsigned int write_len;
+		u32 rs_length, rs_handle;
+		u64 rs_offset;
+
+		if (seg_no >= info->wi_nsegs)
+			goto out_overflow;
+
+		ctxt = svc_rdma_get_rw_ctxt(rdma);
+		if (!ctxt)
+			goto out_noctx;
+
+		rs_handle = be32_to_cpu(*seg++);
+		rs_length = be32_to_cpu(*seg++);
+		seg = xdr_decode_hyper(seg, &rs_offset);
+
+		write_len = min(remaining, rs_length - seg_off);
+		constructor(info, write_len, ctxt);
+		ret = svc_rdma_send_write_ctx(rdma, ctxt, rs_offset + seg_off,
+					      rs_handle);
+		if (ret < 0)
+			goto out_senderr;
+
+		if (write_len == rs_length - seg_off) {
+			seg_no++;
+			seg_off = 0;
+		} else {
+			seg_off += write_len;
+		}
+		remaining -= write_len;
+	} while (remaining);
+
+	info->wi_bytes_consumed += total;
+	info->wi_seg_no = seg_no;
+	info->wi_seg_off = seg_off;
+	return 0;
+
+out_overflow:
+	dprintk("svcrdma: inadequate space in Write chunk (%u)\n",
+		info->wi_nsegs);
+	return -E2BIG;
+
+out_noctx:
+	dprintk("svcrdma: no R/W ctxs available\n");
+	return -ENOMEM;
+
+out_senderr:
+	svc_rdma_put_rw_ctxt(ctxt);
+	pr_err("svcrdma: failed to write pagelist (%d)\n", ret);
+	return ret;
+}
+
+/* Send one of an xdr_buf's kvecs by itself. To send a Reply
+ * chunk, the whole RPC Reply is written back to the client.
+ * This function writes either the head or tail of the xdr_buf
+ * containing the Reply.
+ */
+static int svc_rdma_send_xdr_kvec(struct svc_rdma_write_info *info,
+				  struct kvec *vec)
+{
+	info->wi_base = vec->iov_base;
+
+	return svc_rdma_send_writes(info, svc_rdma_vec_to_sg,
+				    vec->iov_len);
+}
+
+/* Send an xdr_buf's page list by itself. A Write chunk is
+ * just the page list. a Reply chunk is the head, page list,
+ * and tail. This function is shared between the two types
+ * of chunk.
+ */
+static int svc_rdma_send_xdr_pagelist(struct svc_rdma_write_info *info,
+				      struct xdr_buf *xdr)
+{
+	info->wi_xdr = xdr;
+	info->wi_next_off = 0;
+
+	return svc_rdma_send_writes(info, svc_rdma_pagelist_to_sg,
+				    xdr->page_len);
+}
+
+/**
+ * svc_rdma_send_write_list - Write all chunks in the Write list
+ * @rdma: controlling RDMA transport
+ * @wr_ch: Write list provided by client
+ * @rdma_resp: buffer containing transport header under construction
+ * @xdr: xdr_buf carrying an RPC Reply
+ *
+ * Returns:
+ *	%0 if all needed RDMA Writes were posted successfully,
+ *	%-E2BIG if the payload was larger than the Write chunk,
+ *	%-ENOMEM if rdma_rw context pool was exhausted,
+ *	%-ENOTCONN if posting failed (connection is lost),
+ *	%-EIO if rdma_rw initialization failed (DMA mapping, etc).
+ *
+ * Assumptions:
+ *  - Only one Write chunk, and it's the xdr_buf's entire pagelist
+ */
+int svc_rdma_send_write_list(struct svcxprt_rdma *rdma,
+			     __be32 *wr_ch, __be32 *rdma_resp,
+			     struct xdr_buf *xdr)
+{
+	struct svc_rdma_write_info info;
+	int ret;
+
+	svc_rdma_init_write_info(rdma, wr_ch, &info);
+
+	ret = svc_rdma_send_xdr_pagelist(&info, xdr);
+
+	svc_rdma_xdr_encode_write_list(rdma_resp, wr_ch,
+				       info.wi_bytes_consumed);
+	return ret;
+}
+
+/**
+ * svc_rdma_send_reply_chunk - Write all segments in the Reply chunk
+ * @rdma: controlling RDMA transport
+ * @wr_lst: Write list provided by client
+ * @rp_ch: Reply chunk provided by client
+ * @rdma_resp: buffer containing transport header for Reply
+ * @xdr: xdr_buf carrying an RPC Reply
+ *
+ * Returns:
+ *	%0 if all needed RDMA Writes were posted successfully,
+ *	%-E2BIG if the payload was larger than the Reply chunk,
+ *	%-ENOMEM if rdma_rw context pool was exhausted,
+ *	%-ENOTCONN if posting failed (connection is lost),
+ *	%-EIO if rdma_rw initialization failed (DMA mapping, etc).
+ *
+ * Assumptions:
+ *  - The Reply chunk always carries the whole xdr_buf
+ */
+int svc_rdma_send_reply_chunk(struct svcxprt_rdma *rdma, __be32 *wr_lst,
+			      __be32 *rp_ch, __be32 *rdma_resp,
+			      struct xdr_buf *xdr)
+{
+	struct svc_rdma_write_info info;
+	int ret;
+
+	svc_rdma_init_write_info(rdma, rp_ch, &info);
+
+	ret = svc_rdma_send_xdr_kvec(&info, &xdr->head[0]);
+	if (ret < 0)
+		goto out;
+
+	/* When Write list entries are present, server has already
+	 * transmitted the pagelist payload via a Write chunk. Thus
+	 * we can skip the pagelist here.
+	 */
+	if (!wr_lst) {
+		ret = svc_rdma_send_xdr_pagelist(&info, xdr);
+		if (ret < 0)
+			goto out;
+	}
+
+	ret = svc_rdma_send_xdr_kvec(&info, &xdr->tail[0]);
+
+out:
+	svc_rdma_xdr_encode_reply_chunk(rdma_resp, rp_ch,
+					info.wi_bytes_consumed);
+	return ret;
+}
+
+/* Pull one Read chunk (segment) from the client.
+ *
+ * Returns zero if one or more RDMA Reads have been posted.  Otherwise,
+ * returns a negative errno if there is a Read list present but RDMA
+ * Reads could not be posted.
+ *
+ * For incoming Reads, @rqstp provides a page list containing sink pages.
+ * As pages are prepared for I/O, they are transferred to @head. After
+ * all Reads in the list have completed, svc_rdma_recvfrom builds an
+ * xdr_buf from the page list in @head.
+ *
+ * On entry, *page_no and *page_offset point into the rqstp's page list.
+ * On return, *page_no and *page_offset are updated to point to the next
+ * position in the page list.
+ */
+static int svc_rdma_recv_read_segment(struct svcxprt_rdma *rdma,
+				      struct svc_rdma_op_ctxt *head,
+				      struct svc_rqst *rqstp,
+				      unsigned int *page_no,
+				      unsigned int *page_offset,
+				      u32 rkey, u32 len, u64 offset,
+				      bool last)
+{
+	struct svc_rdma_rw_ctxt *ctxt;
+	struct ib_send_wr *first_wr;
+	unsigned int pg_no, seg_no;
+	u32 pg_off;
+	int ret;
+
+	dprintk("svcrdma: reading segment %u@0x%016llx:0x%08x\n",
+		len, offset, rkey);
+
+	ctxt = svc_rdma_get_rw_ctxt(rdma);
+	if (!ctxt)
+		return -ENOMEM;
+
+	pg_off = *page_offset;
+	pg_no = *page_no;
+	ctxt->rw_nents = PAGE_ALIGN(*page_offset + len) >> PAGE_SHIFT;
+	for (seg_no = 0; seg_no < ctxt->rw_nents; seg_no++) {
+		unsigned int seg_len = min_t(unsigned int, len,
+					     PAGE_SIZE - pg_off);
+
+		head->arg.pages[pg_no] = rqstp->rq_arg.pages[pg_no];
+		head->arg.page_len += seg_len;
+		head->arg.len += seg_len;
+		if (!pg_off)
+			head->count++;
+
+		sg_set_page(&ctxt->rw_sg[seg_no], rqstp->rq_arg.pages[pg_no],
+			    seg_len, pg_off);
+
+		rqstp->rq_respages = &rqstp->rq_arg.pages[pg_no + 1];
+		rqstp->rq_next_page = rqstp->rq_respages + 1;
+
+		pg_off += seg_len;
+		if (pg_off == PAGE_SIZE) {
+			pg_off = 0;
+			pg_no++;
+		}
+		len -= seg_len;
+	}
+
+	ret = rdma_rw_ctx_init(&ctxt->rw_ctx,
+			       rdma->sc_qp, rdma->sc_port_num,
+			       ctxt->rw_sg, ctxt->rw_nents,
+			       0, offset, rkey, DMA_FROM_DEVICE);
+	if (ret < 0)
+		goto out_init;
+
+	ctxt->rw_wrcount = ret;
+	ctxt->rw_dir = DMA_FROM_DEVICE;
+	ctxt->rw_cqe.done = svc_rdma_wc_read_ctx;
+	ctxt->rw_readctxt = last ? head : NULL;
+	first_wr = rdma_rw_ctx_wrs(&ctxt->rw_ctx,
+				   rdma->sc_qp, rdma->sc_port_num,
+				   &ctxt->rw_cqe, NULL);
+	atomic_add(ret, &rdma_stat_read);
+	ret = svc_rdma_post_send(rdma, first_wr, ret);
+	if (ret)
+		goto out_send;
+
+	*page_no = pg_no;
+	*page_offset = pg_off;
+	return 0;
+
+out_init:
+	pr_err("svcrdma: rdma_rw_ctx_init failed: %d\n", ret);
+	ret = -EIO;
+out_send:
+	svc_rdma_put_rw_ctxt(ctxt);
+	return ret;
+}
+
+/* If there was additional inline content, append it to the end of arg.pages.
+ * Tail copy has to be done after the reader function has determined how many
+ * pages were consumed for RDMA Read.
+ */
+static int svc_rdma_copy_tail(struct svc_rqst *rqstp,
+			      struct svc_rdma_op_ctxt *head, u32 position,
+			      unsigned int page_offset, unsigned int page_no)
+{
+	char *srcp, *destp;
+	u32 byte_count;
+
+	srcp = head->arg.head[0].iov_base + position;
+	byte_count = head->arg.head[0].iov_len - position;
+	if (byte_count > PAGE_SIZE) {
+		dprintk("svcrdma: large tail unsupported\n");
+		return 0;
+	}
+
+	/* Fit as much of the tail on the current page as possible */
+	if (page_offset != PAGE_SIZE) {
+		destp = page_address(rqstp->rq_arg.pages[page_no]);
+		destp += page_offset;
+		while (byte_count--) {
+			*destp++ = *srcp++;
+			page_offset++;
+			if (page_offset == PAGE_SIZE && byte_count)
+				goto more;
+		}
+		goto done;
+	}
+
+more:
+	/* Fit the rest on the next page */
+	page_no++;
+	destp = page_address(rqstp->rq_arg.pages[page_no]);
+	while (byte_count--)
+		*destp++ = *srcp++;
+
+	rqstp->rq_respages = &rqstp->rq_arg.pages[page_no + 1];
+	rqstp->rq_next_page = rqstp->rq_respages + 1;
+
+done:
+	byte_count = head->arg.head[0].iov_len - position;
+	head->arg.page_len += byte_count;
+	head->arg.len += byte_count;
+	head->arg.buflen += byte_count;
+	return 1;
+}
+
+static unsigned int svc_rdma_read_chunk_count(__be32 *p)
+{
+	unsigned int nsegs;
+
+	for (nsegs = 0; *p != xdr_zero; p += rpcrdma_readchunk_maxsz)
+		nsegs++;
+	return nsegs;
+}
+
+/**
+ * svc_rdma_recv_read_list - Pull read chunks from the client
+ * @rdma: controlling RDMA transport
+ * @ch: pointer to Read list in the incoming transport header
+ * @head: pages under I/O collect here
+ * @rqstp: set of pages to use as Read sink buffers
+ *
+ * Returns:
+ *	%0 if there is no Read list,
+ *	%1 if all needed RDMA Reads were posted successfully,
+ *	%-EINVAL if the Read chunk data is too large,
+ *	%-ENOMEM if rdma_rw context pool was exhausted,
+ *	%-ENOTCONN if posting failed (connection is lost),
+ *	%-EIO if rdma_rw initialization failed (DMA mapping, etc).
+ *
+ * Assumptions:
+ * - Clients can send multiple Read chunks in a Read list, but
+ *   the chunks must all have the same value in their Position
+ *   field.
+ */
+int svc_rdma_recv_read_list(struct svcxprt_rdma *rdma, __be32 *ch,
+			    struct svc_rdma_op_ctxt *head,
+			    struct svc_rqst *rqstp)
+{
+	unsigned int page_no, page_offset;
+	u32 position;
+	__be32 *p;
+	bool last;
+	int ret;
+
+	p = ch;
+
+	/* Sanity check */
+	if (svc_rdma_read_chunk_count(p++) > RPCSVC_MAXPAGES)
+		return -EINVAL;
+
+	/* "head" keeps all the pages that comprise the request.
+	 */
+	head->arg.head[0] = rqstp->rq_arg.head[0];
+	head->arg.tail[0] = rqstp->rq_arg.tail[0];
+	head->hdr_count = head->count;
+	head->arg.page_base = 0;
+	head->arg.page_len = 0;
+	head->arg.len = rqstp->rq_arg.len;
+	head->arg.buflen = rqstp->rq_arg.buflen;
+
+	/* RDMA_NOMSG: RDMA Read data should land just after Receive data.
+	 */
+	position = be32_to_cpu(*p++);
+	if (position == 0) {
+		head->arg.pages = &head->pages[0];
+		page_offset = head->byte_len;
+	} else {
+		head->arg.pages = &head->pages[head->count];
+		page_offset = 0;
+	}
+
+	/* This server implementation supports only one Read chunk (of one
+	 * or more segments) per message. The list walk is terminated once
+	 * "position" changes.
+	 */
+	page_no = 0;
+	last = false;
+	while (!last) {
+		u32 rs_handle, rs_length;
+		u64 rs_offset;
+
+		rs_handle = be32_to_cpu(*p++),
+		rs_length = be32_to_cpu(*p++);
+		p = xdr_decode_hyper(p, &rs_offset);
+
+		/* Examine next read segment */
+		if (*p == xdr_zero ||
+		    ((*p != xdr_zero) && (be32_to_cpu(*(p + 1)) != position)))
+			last = true;
+
+		ret = svc_rdma_recv_read_segment(rdma, head, rqstp,
+						 &page_no, &page_offset,
+						 rs_handle, rs_length,
+						 rs_offset, last);
+		if (ret < 0)
+			goto out;
+
+		p += 2;
+	}
+
+	/* Read list may need XDR round-up (see RFC 5666, s. 3.7) */
+	if (page_offset & 3) {
+		u32 pad = 4 - (page_offset & 3);
+
+		head->arg.tail[0].iov_len += pad;
+		head->arg.len += pad;
+		head->arg.buflen += pad;
+		page_offset += pad;
+	}
+
+	ret = 1;
+	if (position && position < head->arg.head[0].iov_len)
+		ret = svc_rdma_copy_tail(rqstp, head, position,
+					 page_offset, page_no);
+	head->arg.head[0].iov_len = position;
+	head->position = position;
+
+ out:
+	/* Detach arg pages. svc_recv will replenish them */
+	for (page_no = 0;
+	     &rqstp->rq_pages[page_no] < rqstp->rq_respages; page_no++)
+		rqstp->rq_pages[page_no] = NULL;
+	return ret;
+}
diff --git a/net/sunrpc/xprtrdma/svc_rdma_sendto.c b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
index b4028bc3..e4b8800 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_sendto.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
@@ -406,7 +406,7 @@ static int send_write_chunks(struct svcxprt_rdma *xprt,
 		}
 	}
 	/* Update the req with the number of chunks actually used */
-	svc_rdma_xdr_encode_write_list(rdma_resp, chunk_no);
+	svc_rdma_old_encode_write_list(rdma_resp, chunk_no);
 
 	return rqstp->rq_res.page_len;
 
diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c
index b84cd53..90fabad 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
@@ -560,6 +560,7 @@ static struct svcxprt_rdma *rdma_create_xprt(struct svc_serv *serv,
 	INIT_LIST_HEAD(&cma_xprt->sc_read_complete_q);
 	INIT_LIST_HEAD(&cma_xprt->sc_frmr_q);
 	INIT_LIST_HEAD(&cma_xprt->sc_ctxts);
+	INIT_LIST_HEAD(&cma_xprt->sc_rw_ctxts);
 	INIT_LIST_HEAD(&cma_xprt->sc_maps);
 	init_waitqueue_head(&cma_xprt->sc_send_wait);
 
@@ -567,6 +568,7 @@ static struct svcxprt_rdma *rdma_create_xprt(struct svc_serv *serv,
 	spin_lock_init(&cma_xprt->sc_rq_dto_lock);
 	spin_lock_init(&cma_xprt->sc_frmr_q_lock);
 	spin_lock_init(&cma_xprt->sc_ctxt_lock);
+	spin_lock_init(&cma_xprt->sc_rw_ctxt_lock);
 	spin_lock_init(&cma_xprt->sc_map_lock);
 
 	/*
@@ -998,6 +1000,7 @@ static struct svc_xprt *svc_rdma_accept(struct svc_xprt *xprt)
 		newxprt, newxprt->sc_cm_id);
 
 	dev = newxprt->sc_cm_id->device;
+	newxprt->sc_port_num = newxprt->sc_cm_id->port_num;
 
 	/* Qualify the transport resource defaults with the
 	 * capabilities of this particular device */
@@ -1247,6 +1250,7 @@ static void __svc_rdma_free(struct work_struct *work)
 	}
 
 	rdma_dealloc_frmr_q(rdma);
+	svc_rdma_destroy_rw_ctxts(rdma);
 	svc_rdma_destroy_ctxts(rdma);
 	svc_rdma_destroy_maps(rdma);
 


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v1 06/14] svcrdma: Use rdma_rw API in RPC reply path
  2017-03-16 15:52 ` Chuck Lever
@ 2017-03-16 15:53     ` Chuck Lever
  -1 siblings, 0 replies; 70+ messages in thread
From: Chuck Lever @ 2017-03-16 15:53 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA, linux-nfs-u79uwXL29TY76Z2rM5mHXA

The current svcrdma sendto code path posts one RDMA Write WR at a
time. Each of these Writes typically carries a small number of pages
(for instance, up to 30 pages for mlx4 devices). That means a 1MB
NFS READ reply requires 9 ib_post_send() calls for the Write WRs,
and one for the Send WR carrying the actual RPC Reply message.

Instead, use the new rdma_rw API. This gives two main benefits:

1. All Write WRs for one RDMA segment are posted in a single chain.
Just one doorbell for each RPC's Write chunk data.

2. The Write path can now use FRWR to register the Write buffers.
If the device's maximum page list depth is large, this means a
single Write WR is needed for each RPC's Write chunk data.

But also, the details of Write WR chain construction and memory
registration are taken care of elsewhere. svcrdma can focus on
the details of the RPC-over-RDMA.

Note also that the new code introduces support for RPCs that
carry both a Write list and a Reply chunk. This combination might
be used for an NFSv4 READ where the data payload is large and the
RPC Reply message is still larger than the inline threshold.

Signed-off-by: Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
---
 net/sunrpc/xprtrdma/svc_rdma_backchannel.c |    6 
 net/sunrpc/xprtrdma/svc_rdma_sendto.c      |  594 +++++++++++-----------------
 net/sunrpc/xprtrdma/svc_rdma_transport.c   |    2 
 3 files changed, 231 insertions(+), 371 deletions(-)

diff --git a/net/sunrpc/xprtrdma/svc_rdma_backchannel.c b/net/sunrpc/xprtrdma/svc_rdma_backchannel.c
index 71ad9cd..24c26f4 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_backchannel.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_backchannel.c
@@ -90,9 +90,9 @@ int svc_rdma_handle_bc_reply(struct rpc_xprt *xprt, struct rpcrdma_msg *rmsgp,
  * Caller holds the connection's mutex and has already marshaled
  * the RPC/RDMA request.
  *
- * This is similar to svc_rdma_reply, but takes an rpc_rqst
- * instead, does not support chunks, and avoids blocking memory
- * allocation.
+ * This is similar to svc_rdma_send_reply_msg, but takes a struct
+ * rpc_rqst instead, does not support chunks, and avoids blocking
+ * memory allocation.
  *
  * XXX: There is still an opportunity to block in svc_rdma_send()
  * if there are no SQ entries to post the Send. This may occur if
diff --git a/net/sunrpc/xprtrdma/svc_rdma_sendto.c b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
index e4b8800..be6b11a 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_sendto.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
@@ -1,4 +1,5 @@
 /*
+ * Copyright (c) 2016 Oracle. All rights reserved.
  * Copyright (c) 2014 Open Grid Computing, Inc. All rights reserved.
  * Copyright (c) 2005-2006 Network Appliance, Inc. All rights reserved.
  *
@@ -40,6 +41,63 @@
  * Author: Tom Tucker <tom-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>
  */
 
+/* Operation
+ *
+ * The main entry point is svc_rdma_sendto. This is called by the
+ * RPC server when an RPC Reply is ready to be transmitted to a client.
+ *
+ * The passed-in svc_rqst contains a struct xdr_buf which holds an
+ * XDR-encoded RPC Reply message. sendto must construct the RPC-over-RDMA
+ * transport header, post all Write WRs needed for this Reply, then post
+ * a Send WR conveying the transport header and the RPC message itself to
+ * the client.
+ *
+ * svc_rdma_sendto must fully transmit the Reply before returning, as
+ * the svc_rqst will be recycled as soon as sendto returns. Remaining
+ * resources referred to by the svc_rqst are also recycled at that time.
+ * Therefore any resources that must remain longer must be detached
+ * from the svc_rqst and released later.
+ *
+ * Page Management
+ *
+ * The I/O that performs Reply transmission is asynchronous, and may
+ * complete well after sendto returns. Thus pages under I/O must be
+ * removed from the svc_rqst before sendto returns.
+ *
+ * The logic here depends on Send Queue and completion ordering. Since
+ * the Send WR is always posted last, it will always complete last. Thus
+ * when it completes, it is guaranteed that all previous Write WRs have
+ * also completed.
+ *
+ * Write WRs are constructed and posted. Each Write segment gets its own
+ * svc_rdma_rw_ctxt, allowing the Write completion handler to find and
+ * DMA-unmap the pages under I/O for that Write segment. The Write
+ * completion handler does not release any pages.
+ *
+ * When the Send WR is constructed, it also gets its own svc_rdma_op_ctxt.
+ * The ownership of all of the Reply's pages are transferred into that
+ * ctxt, the Send WR is posted, and sendto returns.
+ *
+ * The svc_rdma_op_ctxt is presented when the Send WR completes. The
+ * Send completion handler finally releases the Reply's pages.
+ *
+ * This mechanism also assumes that completions on the transport's Send
+ * Completion Queue do not run in parallel. Otherwise a Write completion
+ * and Send completion running at the same time could release pages that
+ * are still DMA-mapped.
+ *
+ * Error Handling
+ *
+ * - If the Send WR is posted successfully, it will either complete
+ *   successfully, or get flushed. Either way, the Send completion
+ *   handler releases the Reply's pages.
+ * - If the Send WR cannot be not posted, the forward path releases
+ *   the Reply's pages.
+ *
+ * This handles the case, without the use of page reference counting,
+ * where two different Write segments send portions of the same page.
+ */
+
 #include <linux/sunrpc/debug.h>
 #include <linux/sunrpc/rpc_rdma.h>
 #include <linux/spinlock.h>
@@ -123,45 +181,14 @@ int svc_rdma_map_xdr(struct svcxprt_rdma *xprt,
 	return 0;
 }
 
-static dma_addr_t dma_map_xdr(struct svcxprt_rdma *xprt,
-			      struct xdr_buf *xdr,
-			      u32 xdr_off, size_t len, int dir)
-{
-	struct page *page;
-	dma_addr_t dma_addr;
-	if (xdr_off < xdr->head[0].iov_len) {
-		/* This offset is in the head */
-		xdr_off += (unsigned long)xdr->head[0].iov_base & ~PAGE_MASK;
-		page = virt_to_page(xdr->head[0].iov_base);
-	} else {
-		xdr_off -= xdr->head[0].iov_len;
-		if (xdr_off < xdr->page_len) {
-			/* This offset is in the page list */
-			xdr_off += xdr->page_base;
-			page = xdr->pages[xdr_off >> PAGE_SHIFT];
-			xdr_off &= ~PAGE_MASK;
-		} else {
-			/* This offset is in the tail */
-			xdr_off -= xdr->page_len;
-			xdr_off += (unsigned long)
-				xdr->tail[0].iov_base & ~PAGE_MASK;
-			page = virt_to_page(xdr->tail[0].iov_base);
-		}
-	}
-	dma_addr = ib_dma_map_page(xprt->sc_cm_id->device, page, xdr_off,
-				   min_t(size_t, PAGE_SIZE, len), dir);
-	return dma_addr;
-}
-
 /* Parse the RPC Call's transport header.
  */
-static void svc_rdma_get_write_arrays(struct rpcrdma_msg *rmsgp,
-				      struct rpcrdma_write_array **write,
-				      struct rpcrdma_write_array **reply)
+static void svc_rdma_get_write_arrays(__be32 *rdma_argp,
+				      __be32 **write, __be32 **reply)
 {
 	__be32 *p;
 
-	p = (__be32 *)&rmsgp->rm_body.rm_chunks[0];
+	p = rdma_argp + rpcrdma_fixed_maxsz;
 
 	/* Read list */
 	while (*p++ != xdr_zero)
@@ -169,7 +196,7 @@ static void svc_rdma_get_write_arrays(struct rpcrdma_msg *rmsgp,
 
 	/* Write list */
 	if (*p != xdr_zero) {
-		*write = (struct rpcrdma_write_array *)p;
+		*write = p;
 		while (*p++ != xdr_zero)
 			p += 1 + be32_to_cpu(*p) * 4;
 	} else {
@@ -179,7 +206,7 @@ static void svc_rdma_get_write_arrays(struct rpcrdma_msg *rmsgp,
 
 	/* Reply chunk */
 	if (*p != xdr_zero)
-		*reply = (struct rpcrdma_write_array *)p;
+		*reply = p;
 	else
 		*reply = NULL;
 }
@@ -189,31 +216,50 @@ static void svc_rdma_get_write_arrays(struct rpcrdma_msg *rmsgp,
  * Invalidate, and responder chooses one rkey to invalidate.
  *
  * Find a candidate rkey to invalidate when sending a reply.  Picks the
- * first rkey it finds in the chunks lists.
+ * first R_key it finds in the chunk lists.
  *
  * Returns zero if RPC's chunk lists are empty.
  */
-static u32 svc_rdma_get_inv_rkey(struct rpcrdma_msg *rdma_argp,
-				 struct rpcrdma_write_array *wr_ary,
-				 struct rpcrdma_write_array *rp_ary)
+static u32 svc_rdma_get_inv_rkey(__be32 *rdma_argp,
+				 __be32 *wr_lst, __be32 *rp_ch)
 {
-	struct rpcrdma_read_chunk *rd_ary;
-	struct rpcrdma_segment *arg_ch;
+	__be32 *p;
 
-	rd_ary = (struct rpcrdma_read_chunk *)&rdma_argp->rm_body.rm_chunks[0];
-	if (rd_ary->rc_discrim != xdr_zero)
-		return be32_to_cpu(rd_ary->rc_target.rs_handle);
+	p = rdma_argp + rpcrdma_fixed_maxsz;
+	if (*p != xdr_zero)
+		p += 2;
+	else if (wr_lst && be32_to_cpup(wr_lst + 1))
+		p = wr_lst + 2;
+	else if (rp_ch && be32_to_cpup(rp_ch + 1))
+		p = rp_ch + 2;
+	else
+		return 0;
+	return be32_to_cpup(p);
+}
 
-	if (wr_ary && be32_to_cpu(wr_ary->wc_nchunks)) {
-		arg_ch = &wr_ary->wc_array[0].wc_target;
-		return be32_to_cpu(arg_ch->rs_handle);
-	}
+/* ib_dma_map_page() is used here because svc_rdma_dma_unmap()
+ * is used during completion to DMA-unmap this memory, and
+ * it uses ib_dma_unmap_page() exclusively.
+ */
+static int svc_rdma_dma_map_buf(struct svcxprt_rdma *rdma,
+				struct svc_rdma_op_ctxt *ctxt,
+				unsigned int sge_no,
+				unsigned char *base,
+				unsigned int len)
+{
+	unsigned long offset = (unsigned long)base & ~PAGE_MASK;
+	struct ib_device *dev = rdma->sc_cm_id->device;
+	dma_addr_t dma_addr;
 
-	if (rp_ary && be32_to_cpu(rp_ary->wc_nchunks)) {
-		arg_ch = &rp_ary->wc_array[0].wc_target;
-		return be32_to_cpu(arg_ch->rs_handle);
-	}
+	dma_addr = ib_dma_map_page(dev, virt_to_page(base),
+				   offset, len, DMA_TO_DEVICE);
+	if (ib_dma_mapping_error(dev, dma_addr))
+		return -EIO;
 
+	ctxt->sge[sge_no].addr = dma_addr;
+	ctxt->sge[sge_no].length = len;
+	ctxt->sge[sge_no].lkey = rdma->sc_pd->local_dma_lkey;
+	svc_rdma_count_mappings(rdma, ctxt);
 	return 0;
 }
 
@@ -260,222 +306,73 @@ int svc_rdma_map_reply_hdr(struct svcxprt_rdma *rdma,
 	return svc_rdma_dma_map_page(rdma, ctxt, 0, ctxt->pages[0], 0, len);
 }
 
-/* Assumptions:
- * - The specified write_len can be represented in sc_max_sge * PAGE_SIZE
+/* Load the xdr_buf into the ctxt's sge array, and DMA map each
+ * element as it is added.
+ *
+ * Returns the number of sge elements loaded on success, or
+ * a negative errno on failure.
  */
-static int send_write(struct svcxprt_rdma *xprt, struct svc_rqst *rqstp,
-		      u32 rmr, u64 to,
-		      u32 xdr_off, int write_len,
-		      struct svc_rdma_req_map *vec)
+static int svc_rdma_map_reply_msg(struct svcxprt_rdma *rdma,
+				  struct svc_rdma_op_ctxt *ctxt,
+				  struct xdr_buf *xdr, __be32 *wr_lst)
 {
-	struct ib_rdma_wr write_wr;
-	struct ib_sge *sge;
-	int xdr_sge_no;
-	int sge_no;
-	int sge_bytes;
-	int sge_off;
-	int bc;
-	struct svc_rdma_op_ctxt *ctxt;
+	unsigned int len, sge_no, remaining, page_off;
+	struct page **ppages;
+	unsigned char *base;
+	u32 xdr_pad;
+	int ret;
 
-	if (vec->count > RPCSVC_MAXPAGES) {
-		pr_err("svcrdma: Too many pages (%lu)\n", vec->count);
-		return -EIO;
-	}
+	sge_no = 1;
 
-	dprintk("svcrdma: RDMA_WRITE rmr=%x, to=%llx, xdr_off=%d, "
-		"write_len=%d, vec->sge=%p, vec->count=%lu\n",
-		rmr, (unsigned long long)to, xdr_off,
-		write_len, vec->sge, vec->count);
+	ret = svc_rdma_dma_map_buf(rdma, ctxt, sge_no++,
+				   xdr->head[0].iov_base,
+				   xdr->head[0].iov_len);
+	if (ret < 0)
+		return ret;
 
-	ctxt = svc_rdma_get_context(xprt);
-	ctxt->direction = DMA_TO_DEVICE;
-	sge = ctxt->sge;
-
-	/* Find the SGE associated with xdr_off */
-	for (bc = xdr_off, xdr_sge_no = 1; bc && xdr_sge_no < vec->count;
-	     xdr_sge_no++) {
-		if (vec->sge[xdr_sge_no].iov_len > bc)
-			break;
-		bc -= vec->sge[xdr_sge_no].iov_len;
-	}
+	/* If a Write chunk is present, the xdr_buf's page list
+	 * is not included inline. However the Upper Layer may
+	 * have added XDR padding in the tail buffer, and that
+	 * should not be included inline.
+	 */
+	if (wr_lst) {
+		base = xdr->tail[0].iov_base;
+		len = xdr->tail[0].iov_len;
+		xdr_pad = xdr_padsize(xdr->page_len);
 
-	sge_off = bc;
-	bc = write_len;
-	sge_no = 0;
-
-	/* Copy the remaining SGE */
-	while (bc != 0) {
-		sge_bytes = min_t(size_t,
-			  bc, vec->sge[xdr_sge_no].iov_len-sge_off);
-		sge[sge_no].length = sge_bytes;
-		sge[sge_no].addr =
-			dma_map_xdr(xprt, &rqstp->rq_res, xdr_off,
-				    sge_bytes, DMA_TO_DEVICE);
-		xdr_off += sge_bytes;
-		if (ib_dma_mapping_error(xprt->sc_cm_id->device,
-					 sge[sge_no].addr))
-			goto err;
-		svc_rdma_count_mappings(xprt, ctxt);
-		sge[sge_no].lkey = xprt->sc_pd->local_dma_lkey;
-		ctxt->count++;
-		sge_off = 0;
-		sge_no++;
-		xdr_sge_no++;
-		if (xdr_sge_no > vec->count) {
-			pr_err("svcrdma: Too many sges (%d)\n", xdr_sge_no);
-			goto err;
+		if (len && xdr_pad) {
+			base += xdr_pad;
+			len -= xdr_pad;
 		}
-		bc -= sge_bytes;
-		if (sge_no == xprt->sc_max_sge)
-			break;
-	}
-
-	/* Prepare WRITE WR */
-	memset(&write_wr, 0, sizeof write_wr);
-	ctxt->cqe.done = svc_rdma_wc_write;
-	write_wr.wr.wr_cqe = &ctxt->cqe;
-	write_wr.wr.sg_list = &sge[0];
-	write_wr.wr.num_sge = sge_no;
-	write_wr.wr.opcode = IB_WR_RDMA_WRITE;
-	write_wr.wr.send_flags = IB_SEND_SIGNALED;
-	write_wr.rkey = rmr;
-	write_wr.remote_addr = to;
-
-	/* Post It */
-	atomic_inc(&rdma_stat_write);
-	if (svc_rdma_send(xprt, &write_wr.wr))
-		goto err;
-	return write_len - bc;
- err:
-	svc_rdma_unmap_dma(ctxt);
-	svc_rdma_put_context(ctxt, 0);
-	return -EIO;
-}
-
-noinline
-static int send_write_chunks(struct svcxprt_rdma *xprt,
-			     struct rpcrdma_write_array *wr_ary,
-			     struct rpcrdma_msg *rdma_resp,
-			     struct svc_rqst *rqstp,
-			     struct svc_rdma_req_map *vec)
-{
-	u32 xfer_len = rqstp->rq_res.page_len;
-	int write_len;
-	u32 xdr_off;
-	int chunk_off;
-	int chunk_no;
-	int nchunks;
-	struct rpcrdma_write_array *res_ary;
-	int ret;
 
-	res_ary = (struct rpcrdma_write_array *)
-		&rdma_resp->rm_body.rm_chunks[1];
-
-	/* Write chunks start at the pagelist */
-	nchunks = be32_to_cpu(wr_ary->wc_nchunks);
-	for (xdr_off = rqstp->rq_res.head[0].iov_len, chunk_no = 0;
-	     xfer_len && chunk_no < nchunks;
-	     chunk_no++) {
-		struct rpcrdma_segment *arg_ch;
-		u64 rs_offset;
-
-		arg_ch = &wr_ary->wc_array[chunk_no].wc_target;
-		write_len = min(xfer_len, be32_to_cpu(arg_ch->rs_length));
-
-		/* Prepare the response chunk given the length actually
-		 * written */
-		xdr_decode_hyper((__be32 *)&arg_ch->rs_offset, &rs_offset);
-		svc_rdma_xdr_encode_array_chunk(res_ary, chunk_no,
-						arg_ch->rs_handle,
-						arg_ch->rs_offset,
-						write_len);
-		chunk_off = 0;
-		while (write_len) {
-			ret = send_write(xprt, rqstp,
-					 be32_to_cpu(arg_ch->rs_handle),
-					 rs_offset + chunk_off,
-					 xdr_off,
-					 write_len,
-					 vec);
-			if (ret <= 0)
-				goto out_err;
-			chunk_off += ret;
-			xdr_off += ret;
-			xfer_len -= ret;
-			write_len -= ret;
-		}
+		goto tail;
 	}
-	/* Update the req with the number of chunks actually used */
-	svc_rdma_old_encode_write_list(rdma_resp, chunk_no);
 
-	return rqstp->rq_res.page_len;
+	ppages = xdr->pages + (xdr->page_base >> PAGE_SHIFT);
+	page_off = xdr->page_base & ~PAGE_MASK;
+	remaining = xdr->page_len;
+	while (remaining) {
+		len = min_t(u32, PAGE_SIZE - page_off, remaining);
 
-out_err:
-	pr_err("svcrdma: failed to send write chunks, rc=%d\n", ret);
-	return -EIO;
-}
-
-noinline
-static int send_reply_chunks(struct svcxprt_rdma *xprt,
-			     struct rpcrdma_write_array *rp_ary,
-			     struct rpcrdma_msg *rdma_resp,
-			     struct svc_rqst *rqstp,
-			     struct svc_rdma_req_map *vec)
-{
-	u32 xfer_len = rqstp->rq_res.len;
-	int write_len;
-	u32 xdr_off;
-	int chunk_no;
-	int chunk_off;
-	int nchunks;
-	struct rpcrdma_segment *ch;
-	struct rpcrdma_write_array *res_ary;
-	int ret;
+		ret = svc_rdma_dma_map_page(rdma, ctxt, sge_no++,
+					    *ppages++, page_off, len);
+		if (ret < 0)
+			return ret;
 
-	/* XXX: need to fix when reply lists occur with read-list and or
-	 * write-list */
-	res_ary = (struct rpcrdma_write_array *)
-		&rdma_resp->rm_body.rm_chunks[2];
-
-	/* xdr offset starts at RPC message */
-	nchunks = be32_to_cpu(rp_ary->wc_nchunks);
-	for (xdr_off = 0, chunk_no = 0;
-	     xfer_len && chunk_no < nchunks;
-	     chunk_no++) {
-		u64 rs_offset;
-		ch = &rp_ary->wc_array[chunk_no].wc_target;
-		write_len = min(xfer_len, be32_to_cpu(ch->rs_length));
-
-		/* Prepare the reply chunk given the length actually
-		 * written */
-		xdr_decode_hyper((__be32 *)&ch->rs_offset, &rs_offset);
-		svc_rdma_xdr_encode_array_chunk(res_ary, chunk_no,
-						ch->rs_handle, ch->rs_offset,
-						write_len);
-		chunk_off = 0;
-		while (write_len) {
-			ret = send_write(xprt, rqstp,
-					 be32_to_cpu(ch->rs_handle),
-					 rs_offset + chunk_off,
-					 xdr_off,
-					 write_len,
-					 vec);
-			if (ret <= 0)
-				goto out_err;
-			chunk_off += ret;
-			xdr_off += ret;
-			xfer_len -= ret;
-			write_len -= ret;
-		}
+		remaining -= len;
+		page_off = 0;
 	}
-	/* Update the req with the number of chunks actually used */
-	svc_rdma_xdr_encode_reply_array(res_ary, chunk_no);
 
-	return rqstp->rq_res.len;
+	base = xdr->tail[0].iov_base;
+	len = xdr->tail[0].iov_len;
+tail:
+	if (len) {
+		ret = svc_rdma_dma_map_buf(rdma, ctxt, sge_no++, base, len);
+		if (ret < 0)
+			return ret;
+	}
 
-out_err:
-	pr_err("svcrdma: failed to send reply chunks, rc=%d\n", ret);
-	return -EIO;
+	return sge_no - 1;
 }
 
 /* The svc_rqst and all resources it owns are released as soon as
@@ -517,89 +414,62 @@ void svc_rdma_build_send_wr(struct svc_rdma_op_ctxt *ctxt, int num_sge)
 		send_wr->num_sge);
 }
 
-/* This function prepares the portion of the RPCRDMA message to be
- * sent in the RDMA_SEND. This function is called after data sent via
- * RDMA has already been transmitted. There are three cases:
- * - The RPCRDMA header, RPC header, and payload are all sent in a
- *   single RDMA_SEND. This is the "inline" case.
- * - The RPCRDMA header and some portion of the RPC header and data
- *   are sent via this RDMA_SEND and another portion of the data is
- *   sent via RDMA.
- * - The RPCRDMA header [NOMSG] is sent in this RDMA_SEND and the RPC
- *   header and data are all transmitted via RDMA.
- * In all three cases, this function prepares the RPCRDMA header in
- * sge[0], the 'type' parameter indicates the type to place in the
- * RPCRDMA header, and the 'byte_count' field indicates how much of
- * the XDR to include in this RDMA_SEND. NB: The offset of the payload
- * to send is zero in the XDR.
+/* Prepare the portion of the RPC Reply that will be transmitted
+ * via RDMA Send. The RPC-over-RDMA transport header is prepared
+ * in sge[0], and the RPC xdr_buf is prepared in following sges.
+ *
+ * Depending on whether a Write list or Reply chunk is present,
+ * the server may send all, a portion of, or none of the xdr_buf.
+ * In the latter case, only the transport header (sge[0]) is
+ * transmitted.
+ *
+ * RDMA Send is the last step of transmitting an RPC reply. Pages
+ * involved in the earlier RDMA Writes are here transferred out
+ * of the rqstp and into the ctxt's page array. These pages are
+ * DMA unmapped by each Write completion, but the subsequent Send
+ * completion finally releases these pages.
+ *
+ * Assumptions:
+ * - The Reply's transport header will never be larger than a page.
  */
-static int send_reply(struct svcxprt_rdma *rdma,
-		      struct svc_rqst *rqstp,
-		      struct page *page,
-		      struct rpcrdma_msg *rdma_resp,
-		      struct svc_rdma_req_map *vec,
-		      int byte_count,
-		      u32 inv_rkey)
+static int svc_rdma_send_reply_msg(struct svcxprt_rdma *rdma,
+				   __be32 *rdma_argp, __be32 *rdma_resp,
+				   struct svc_rqst *rqstp,
+				   __be32 *wr_lst, __be32 *rp_ch)
 {
 	struct svc_rdma_op_ctxt *ctxt;
 	struct ib_send_wr *send_wr;
-	u32 xdr_off;
-	int sge_no;
-	int sge_bytes;
-	int ret = -EIO;
+	int ret;
+
+	dprintk("svcrdma: sending %s reply: head=%zu, pagelen=%u, tail=%zu\n",
+		(rp_ch ? "RDMA_NOMSG" : "RDMA_MSG"),
+		rqstp->rq_res.head[0].iov_len,
+		rqstp->rq_res.page_len,
+		rqstp->rq_res.tail[0].iov_len);
 
-	/* Prepare the context */
 	ctxt = svc_rdma_get_context(rdma);
-	ctxt->direction = DMA_TO_DEVICE;
-	ctxt->pages[0] = page;
-	ctxt->count = 1;
 
-	/* Prepare the SGE for the RPCRDMA Header */
-	ctxt->sge[0].lkey = rdma->sc_pd->local_dma_lkey;
-	ctxt->sge[0].length =
-	    svc_rdma_xdr_get_reply_hdr_len((__be32 *)rdma_resp);
-	ctxt->sge[0].addr =
-	    ib_dma_map_page(rdma->sc_cm_id->device, page, 0,
-			    ctxt->sge[0].length, DMA_TO_DEVICE);
-	if (ib_dma_mapping_error(rdma->sc_cm_id->device, ctxt->sge[0].addr))
+	ret = svc_rdma_map_reply_hdr(rdma, ctxt, rdma_resp,
+				     svc_rdma_xdr_get_reply_hdr_len(rdma_resp));
+	if (ret < 0)
 		goto err;
-	svc_rdma_count_mappings(rdma, ctxt);
-
-	ctxt->direction = DMA_TO_DEVICE;
 
-	/* Map the payload indicated by 'byte_count' */
-	xdr_off = 0;
-	for (sge_no = 1; byte_count && sge_no < vec->count; sge_no++) {
-		sge_bytes = min_t(size_t, vec->sge[sge_no].iov_len, byte_count);
-		byte_count -= sge_bytes;
-		ctxt->sge[sge_no].addr =
-			dma_map_xdr(rdma, &rqstp->rq_res, xdr_off,
-				    sge_bytes, DMA_TO_DEVICE);
-		xdr_off += sge_bytes;
-		if (ib_dma_mapping_error(rdma->sc_cm_id->device,
-					 ctxt->sge[sge_no].addr))
+	if (!rp_ch) {
+		ret = svc_rdma_map_reply_msg(rdma, ctxt,
+					     &rqstp->rq_res, wr_lst);
+		if (ret < 0)
 			goto err;
-		svc_rdma_count_mappings(rdma, ctxt);
-		ctxt->sge[sge_no].lkey = rdma->sc_pd->local_dma_lkey;
-		ctxt->sge[sge_no].length = sge_bytes;
-	}
-	if (byte_count != 0) {
-		pr_err("svcrdma: Could not map %d bytes\n", byte_count);
-		goto err;
 	}
 
 	svc_rdma_save_io_pages(rqstp, ctxt);
 
-	if (sge_no > rdma->sc_max_sge) {
-		pr_err("svcrdma: Too many sges (%d)\n", sge_no);
-		goto err;
-	}
-
-	svc_rdma_build_send_wr(ctxt, sge_no);
+	svc_rdma_build_send_wr(ctxt, 1 + ret);
 	send_wr = &ctxt->send_wr;
-	if (inv_rkey) {
-		send_wr->opcode = IB_WR_SEND_WITH_INV;
-		send_wr->ex.invalidate_rkey = inv_rkey;
+	if (rdma->sc_snd_w_inv) {
+		send_wr->ex.invalidate_rkey =
+			svc_rdma_get_inv_rkey(rdma_argp, wr_lst, rp_ch);
+		if (send_wr->ex.invalidate_rkey)
+			send_wr->opcode = IB_WR_SEND_WITH_INV;
 	}
 	ret = svc_rdma_send(rdma, send_wr);
 	if (ret)
@@ -607,7 +477,8 @@ static int send_reply(struct svcxprt_rdma *rdma,
 
 	return 0;
 
- err:
+err:
+	pr_err("svcrdma: failed to post Send WR (%d)\n", ret);
 	svc_rdma_unmap_dma(ctxt);
 	svc_rdma_put_context(ctxt, 1);
 	return ret;
@@ -617,39 +488,35 @@ void svc_rdma_prep_reply_hdr(struct svc_rqst *rqstp)
 {
 }
 
+/**
+ * svc_rdma_sendto - Transmit an RPC reply
+ * @rqstp: processed RPC request, reply XDR already in ::rq_res
+ *
+ * Any resources still associated with @rqstp are released upon return.
+ * If no reply message was possible, the connection is closed.
+ *
+ * Returns:
+ *	%0 if an RPC reply has been successfully posted,
+ *	%-ENOMEM if a resource shortage occurred (connection is lost),
+ *	%-ENOTCONN if posting failed (connection is lost).
+ */
 int svc_rdma_sendto(struct svc_rqst *rqstp)
 {
 	struct svc_xprt *xprt = rqstp->rq_xprt;
 	struct svcxprt_rdma *rdma =
 		container_of(xprt, struct svcxprt_rdma, sc_xprt);
-	struct rpcrdma_msg *rdma_argp;
-	struct rpcrdma_msg *rdma_resp;
-	struct rpcrdma_write_array *wr_ary, *rp_ary;
-	int ret;
-	int inline_bytes;
+	__be32 *p, *rdma_argp, *rdma_resp, *wr_lst, *rp_ch;
 	struct page *res_page;
-	struct svc_rdma_req_map *vec;
-	u32 inv_rkey;
-	__be32 *p;
-
-	dprintk("svcrdma: sending response for rqstp=%p\n", rqstp);
+	int ret;
 
-	/* Get the RDMA request header. The receive logic always
-	 * places this at the start of page 0.
+	/* Find the call's chunk lists to decide how to send the reply.
+	 * Receive places the Call's xprt header at the start of page 0.
 	 */
 	rdma_argp = page_address(rqstp->rq_pages[0]);
-	svc_rdma_get_write_arrays(rdma_argp, &wr_ary, &rp_ary);
+	svc_rdma_get_write_arrays(rdma_argp, &wr_lst, &rp_ch);
 
-	inv_rkey = 0;
-	if (rdma->sc_snd_w_inv)
-		inv_rkey = svc_rdma_get_inv_rkey(rdma_argp, wr_ary, rp_ary);
-
-	/* Build an req vec for the XDR */
-	vec = svc_rdma_get_req_map(rdma);
-	ret = svc_rdma_map_xdr(rdma, &rqstp->rq_res, vec, wr_ary != NULL);
-	if (ret)
-		goto err0;
-	inline_bytes = rqstp->rq_res.len;
+	dprintk("svcrdma: preparing response for XID 0x%08x\n",
+		be32_to_cpup(rdma_argp));
 
 	/* Create the RDMA response header. xprt->xpt_mutex,
 	 * acquired in svc_send(), serializes RPC replies. The
@@ -663,51 +530,42 @@ int svc_rdma_sendto(struct svc_rqst *rqstp)
 		goto err0;
 	rdma_resp = page_address(res_page);
 
-	p = &rdma_resp->rm_xid;
-	*p++ = rdma_argp->rm_xid;
-	*p++ = rdma_argp->rm_vers;
+	p = rdma_resp;
+	*p++ = *rdma_argp;
+	*p++ = *(rdma_argp + 1);
 	*p++ = rdma->sc_fc_credits;
-	*p++ = rp_ary ? rdma_nomsg : rdma_msg;
+	*p++ = rp_ch ? rdma_nomsg : rdma_msg;
 
 	/* Start with empty chunks */
 	*p++ = xdr_zero;
 	*p++ = xdr_zero;
 	*p   = xdr_zero;
 
-	/* Send any write-chunk data and build resp write-list */
-	if (wr_ary) {
-		ret = send_write_chunks(rdma, wr_ary, rdma_resp, rqstp, vec);
+	if (wr_lst) {
+		ret = svc_rdma_send_write_list(rdma, wr_lst, rdma_resp,
+					       &rqstp->rq_res);
 		if (ret < 0)
 			goto err1;
-		inline_bytes -= ret + xdr_padsize(ret);
 	}
-
-	/* Send any reply-list data and update resp reply-list */
-	if (rp_ary) {
-		ret = send_reply_chunks(rdma, rp_ary, rdma_resp, rqstp, vec);
+	if (rp_ch) {
+		ret = svc_rdma_send_reply_chunk(rdma, wr_lst, rp_ch,
+						rdma_resp, &rqstp->rq_res);
 		if (ret < 0)
 			goto err1;
-		inline_bytes -= ret;
 	}
 
-	/* Post a fresh Receive buffer _before_ sending the reply */
 	ret = svc_rdma_post_recv(rdma, GFP_KERNEL);
 	if (ret)
 		goto err1;
-
-	ret = send_reply(rdma, rqstp, res_page, rdma_resp, vec,
-			 inline_bytes, inv_rkey);
+	ret = svc_rdma_send_reply_msg(rdma, rdma_argp, rdma_resp, rqstp,
+				      wr_lst, rp_ch);
 	if (ret < 0)
 		goto err0;
-
-	svc_rdma_put_req_map(rdma, vec);
-	dprintk("svcrdma: send_reply returns %d\n", ret);
-	return ret;
+	return 0;
 
  err1:
 	put_page(res_page);
  err0:
-	svc_rdma_put_req_map(rdma, vec);
 	pr_err("svcrdma: Could not send reply, err=%d. Closing transport.\n",
 	       ret);
 	set_bit(XPT_CLOSE, &rdma->sc_xprt.xpt_flags);
diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c
index 90fabad..fecf220 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
@@ -1052,6 +1052,8 @@ static struct svc_xprt *svc_rdma_accept(struct svc_xprt *xprt)
 	memset(&qp_attr, 0, sizeof qp_attr);
 	qp_attr.event_handler = qp_event_handler;
 	qp_attr.qp_context = &newxprt->sc_xprt;
+	qp_attr.port_num = newxprt->sc_cm_id->port_num;
+	qp_attr.cap.max_rdma_ctxs = newxprt->sc_max_requests;
 	qp_attr.cap.max_send_wr = newxprt->sc_sq_depth;
 	qp_attr.cap.max_recv_wr = newxprt->sc_rq_depth;
 	qp_attr.cap.max_send_sge = newxprt->sc_max_sge;

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v1 06/14] svcrdma: Use rdma_rw API in RPC reply path
@ 2017-03-16 15:53     ` Chuck Lever
  0 siblings, 0 replies; 70+ messages in thread
From: Chuck Lever @ 2017-03-16 15:53 UTC (permalink / raw)
  To: linux-rdma, linux-nfs

The current svcrdma sendto code path posts one RDMA Write WR at a
time. Each of these Writes typically carries a small number of pages
(for instance, up to 30 pages for mlx4 devices). That means a 1MB
NFS READ reply requires 9 ib_post_send() calls for the Write WRs,
and one for the Send WR carrying the actual RPC Reply message.

Instead, use the new rdma_rw API. This gives two main benefits:

1. All Write WRs for one RDMA segment are posted in a single chain.
Just one doorbell for each RPC's Write chunk data.

2. The Write path can now use FRWR to register the Write buffers.
If the device's maximum page list depth is large, this means a
single Write WR is needed for each RPC's Write chunk data.

But also, the details of Write WR chain construction and memory
registration are taken care of elsewhere. svcrdma can focus on
the details of the RPC-over-RDMA.

Note also that the new code introduces support for RPCs that
carry both a Write list and a Reply chunk. This combination might
be used for an NFSv4 READ where the data payload is large and the
RPC Reply message is still larger than the inline threshold.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 net/sunrpc/xprtrdma/svc_rdma_backchannel.c |    6 
 net/sunrpc/xprtrdma/svc_rdma_sendto.c      |  594 +++++++++++-----------------
 net/sunrpc/xprtrdma/svc_rdma_transport.c   |    2 
 3 files changed, 231 insertions(+), 371 deletions(-)

diff --git a/net/sunrpc/xprtrdma/svc_rdma_backchannel.c b/net/sunrpc/xprtrdma/svc_rdma_backchannel.c
index 71ad9cd..24c26f4 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_backchannel.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_backchannel.c
@@ -90,9 +90,9 @@ int svc_rdma_handle_bc_reply(struct rpc_xprt *xprt, struct rpcrdma_msg *rmsgp,
  * Caller holds the connection's mutex and has already marshaled
  * the RPC/RDMA request.
  *
- * This is similar to svc_rdma_reply, but takes an rpc_rqst
- * instead, does not support chunks, and avoids blocking memory
- * allocation.
+ * This is similar to svc_rdma_send_reply_msg, but takes a struct
+ * rpc_rqst instead, does not support chunks, and avoids blocking
+ * memory allocation.
  *
  * XXX: There is still an opportunity to block in svc_rdma_send()
  * if there are no SQ entries to post the Send. This may occur if
diff --git a/net/sunrpc/xprtrdma/svc_rdma_sendto.c b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
index e4b8800..be6b11a 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_sendto.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
@@ -1,4 +1,5 @@
 /*
+ * Copyright (c) 2016 Oracle. All rights reserved.
  * Copyright (c) 2014 Open Grid Computing, Inc. All rights reserved.
  * Copyright (c) 2005-2006 Network Appliance, Inc. All rights reserved.
  *
@@ -40,6 +41,63 @@
  * Author: Tom Tucker <tom@opengridcomputing.com>
  */
 
+/* Operation
+ *
+ * The main entry point is svc_rdma_sendto. This is called by the
+ * RPC server when an RPC Reply is ready to be transmitted to a client.
+ *
+ * The passed-in svc_rqst contains a struct xdr_buf which holds an
+ * XDR-encoded RPC Reply message. sendto must construct the RPC-over-RDMA
+ * transport header, post all Write WRs needed for this Reply, then post
+ * a Send WR conveying the transport header and the RPC message itself to
+ * the client.
+ *
+ * svc_rdma_sendto must fully transmit the Reply before returning, as
+ * the svc_rqst will be recycled as soon as sendto returns. Remaining
+ * resources referred to by the svc_rqst are also recycled at that time.
+ * Therefore any resources that must remain longer must be detached
+ * from the svc_rqst and released later.
+ *
+ * Page Management
+ *
+ * The I/O that performs Reply transmission is asynchronous, and may
+ * complete well after sendto returns. Thus pages under I/O must be
+ * removed from the svc_rqst before sendto returns.
+ *
+ * The logic here depends on Send Queue and completion ordering. Since
+ * the Send WR is always posted last, it will always complete last. Thus
+ * when it completes, it is guaranteed that all previous Write WRs have
+ * also completed.
+ *
+ * Write WRs are constructed and posted. Each Write segment gets its own
+ * svc_rdma_rw_ctxt, allowing the Write completion handler to find and
+ * DMA-unmap the pages under I/O for that Write segment. The Write
+ * completion handler does not release any pages.
+ *
+ * When the Send WR is constructed, it also gets its own svc_rdma_op_ctxt.
+ * The ownership of all of the Reply's pages are transferred into that
+ * ctxt, the Send WR is posted, and sendto returns.
+ *
+ * The svc_rdma_op_ctxt is presented when the Send WR completes. The
+ * Send completion handler finally releases the Reply's pages.
+ *
+ * This mechanism also assumes that completions on the transport's Send
+ * Completion Queue do not run in parallel. Otherwise a Write completion
+ * and Send completion running at the same time could release pages that
+ * are still DMA-mapped.
+ *
+ * Error Handling
+ *
+ * - If the Send WR is posted successfully, it will either complete
+ *   successfully, or get flushed. Either way, the Send completion
+ *   handler releases the Reply's pages.
+ * - If the Send WR cannot be not posted, the forward path releases
+ *   the Reply's pages.
+ *
+ * This handles the case, without the use of page reference counting,
+ * where two different Write segments send portions of the same page.
+ */
+
 #include <linux/sunrpc/debug.h>
 #include <linux/sunrpc/rpc_rdma.h>
 #include <linux/spinlock.h>
@@ -123,45 +181,14 @@ int svc_rdma_map_xdr(struct svcxprt_rdma *xprt,
 	return 0;
 }
 
-static dma_addr_t dma_map_xdr(struct svcxprt_rdma *xprt,
-			      struct xdr_buf *xdr,
-			      u32 xdr_off, size_t len, int dir)
-{
-	struct page *page;
-	dma_addr_t dma_addr;
-	if (xdr_off < xdr->head[0].iov_len) {
-		/* This offset is in the head */
-		xdr_off += (unsigned long)xdr->head[0].iov_base & ~PAGE_MASK;
-		page = virt_to_page(xdr->head[0].iov_base);
-	} else {
-		xdr_off -= xdr->head[0].iov_len;
-		if (xdr_off < xdr->page_len) {
-			/* This offset is in the page list */
-			xdr_off += xdr->page_base;
-			page = xdr->pages[xdr_off >> PAGE_SHIFT];
-			xdr_off &= ~PAGE_MASK;
-		} else {
-			/* This offset is in the tail */
-			xdr_off -= xdr->page_len;
-			xdr_off += (unsigned long)
-				xdr->tail[0].iov_base & ~PAGE_MASK;
-			page = virt_to_page(xdr->tail[0].iov_base);
-		}
-	}
-	dma_addr = ib_dma_map_page(xprt->sc_cm_id->device, page, xdr_off,
-				   min_t(size_t, PAGE_SIZE, len), dir);
-	return dma_addr;
-}
-
 /* Parse the RPC Call's transport header.
  */
-static void svc_rdma_get_write_arrays(struct rpcrdma_msg *rmsgp,
-				      struct rpcrdma_write_array **write,
-				      struct rpcrdma_write_array **reply)
+static void svc_rdma_get_write_arrays(__be32 *rdma_argp,
+				      __be32 **write, __be32 **reply)
 {
 	__be32 *p;
 
-	p = (__be32 *)&rmsgp->rm_body.rm_chunks[0];
+	p = rdma_argp + rpcrdma_fixed_maxsz;
 
 	/* Read list */
 	while (*p++ != xdr_zero)
@@ -169,7 +196,7 @@ static void svc_rdma_get_write_arrays(struct rpcrdma_msg *rmsgp,
 
 	/* Write list */
 	if (*p != xdr_zero) {
-		*write = (struct rpcrdma_write_array *)p;
+		*write = p;
 		while (*p++ != xdr_zero)
 			p += 1 + be32_to_cpu(*p) * 4;
 	} else {
@@ -179,7 +206,7 @@ static void svc_rdma_get_write_arrays(struct rpcrdma_msg *rmsgp,
 
 	/* Reply chunk */
 	if (*p != xdr_zero)
-		*reply = (struct rpcrdma_write_array *)p;
+		*reply = p;
 	else
 		*reply = NULL;
 }
@@ -189,31 +216,50 @@ static void svc_rdma_get_write_arrays(struct rpcrdma_msg *rmsgp,
  * Invalidate, and responder chooses one rkey to invalidate.
  *
  * Find a candidate rkey to invalidate when sending a reply.  Picks the
- * first rkey it finds in the chunks lists.
+ * first R_key it finds in the chunk lists.
  *
  * Returns zero if RPC's chunk lists are empty.
  */
-static u32 svc_rdma_get_inv_rkey(struct rpcrdma_msg *rdma_argp,
-				 struct rpcrdma_write_array *wr_ary,
-				 struct rpcrdma_write_array *rp_ary)
+static u32 svc_rdma_get_inv_rkey(__be32 *rdma_argp,
+				 __be32 *wr_lst, __be32 *rp_ch)
 {
-	struct rpcrdma_read_chunk *rd_ary;
-	struct rpcrdma_segment *arg_ch;
+	__be32 *p;
 
-	rd_ary = (struct rpcrdma_read_chunk *)&rdma_argp->rm_body.rm_chunks[0];
-	if (rd_ary->rc_discrim != xdr_zero)
-		return be32_to_cpu(rd_ary->rc_target.rs_handle);
+	p = rdma_argp + rpcrdma_fixed_maxsz;
+	if (*p != xdr_zero)
+		p += 2;
+	else if (wr_lst && be32_to_cpup(wr_lst + 1))
+		p = wr_lst + 2;
+	else if (rp_ch && be32_to_cpup(rp_ch + 1))
+		p = rp_ch + 2;
+	else
+		return 0;
+	return be32_to_cpup(p);
+}
 
-	if (wr_ary && be32_to_cpu(wr_ary->wc_nchunks)) {
-		arg_ch = &wr_ary->wc_array[0].wc_target;
-		return be32_to_cpu(arg_ch->rs_handle);
-	}
+/* ib_dma_map_page() is used here because svc_rdma_dma_unmap()
+ * is used during completion to DMA-unmap this memory, and
+ * it uses ib_dma_unmap_page() exclusively.
+ */
+static int svc_rdma_dma_map_buf(struct svcxprt_rdma *rdma,
+				struct svc_rdma_op_ctxt *ctxt,
+				unsigned int sge_no,
+				unsigned char *base,
+				unsigned int len)
+{
+	unsigned long offset = (unsigned long)base & ~PAGE_MASK;
+	struct ib_device *dev = rdma->sc_cm_id->device;
+	dma_addr_t dma_addr;
 
-	if (rp_ary && be32_to_cpu(rp_ary->wc_nchunks)) {
-		arg_ch = &rp_ary->wc_array[0].wc_target;
-		return be32_to_cpu(arg_ch->rs_handle);
-	}
+	dma_addr = ib_dma_map_page(dev, virt_to_page(base),
+				   offset, len, DMA_TO_DEVICE);
+	if (ib_dma_mapping_error(dev, dma_addr))
+		return -EIO;
 
+	ctxt->sge[sge_no].addr = dma_addr;
+	ctxt->sge[sge_no].length = len;
+	ctxt->sge[sge_no].lkey = rdma->sc_pd->local_dma_lkey;
+	svc_rdma_count_mappings(rdma, ctxt);
 	return 0;
 }
 
@@ -260,222 +306,73 @@ int svc_rdma_map_reply_hdr(struct svcxprt_rdma *rdma,
 	return svc_rdma_dma_map_page(rdma, ctxt, 0, ctxt->pages[0], 0, len);
 }
 
-/* Assumptions:
- * - The specified write_len can be represented in sc_max_sge * PAGE_SIZE
+/* Load the xdr_buf into the ctxt's sge array, and DMA map each
+ * element as it is added.
+ *
+ * Returns the number of sge elements loaded on success, or
+ * a negative errno on failure.
  */
-static int send_write(struct svcxprt_rdma *xprt, struct svc_rqst *rqstp,
-		      u32 rmr, u64 to,
-		      u32 xdr_off, int write_len,
-		      struct svc_rdma_req_map *vec)
+static int svc_rdma_map_reply_msg(struct svcxprt_rdma *rdma,
+				  struct svc_rdma_op_ctxt *ctxt,
+				  struct xdr_buf *xdr, __be32 *wr_lst)
 {
-	struct ib_rdma_wr write_wr;
-	struct ib_sge *sge;
-	int xdr_sge_no;
-	int sge_no;
-	int sge_bytes;
-	int sge_off;
-	int bc;
-	struct svc_rdma_op_ctxt *ctxt;
+	unsigned int len, sge_no, remaining, page_off;
+	struct page **ppages;
+	unsigned char *base;
+	u32 xdr_pad;
+	int ret;
 
-	if (vec->count > RPCSVC_MAXPAGES) {
-		pr_err("svcrdma: Too many pages (%lu)\n", vec->count);
-		return -EIO;
-	}
+	sge_no = 1;
 
-	dprintk("svcrdma: RDMA_WRITE rmr=%x, to=%llx, xdr_off=%d, "
-		"write_len=%d, vec->sge=%p, vec->count=%lu\n",
-		rmr, (unsigned long long)to, xdr_off,
-		write_len, vec->sge, vec->count);
+	ret = svc_rdma_dma_map_buf(rdma, ctxt, sge_no++,
+				   xdr->head[0].iov_base,
+				   xdr->head[0].iov_len);
+	if (ret < 0)
+		return ret;
 
-	ctxt = svc_rdma_get_context(xprt);
-	ctxt->direction = DMA_TO_DEVICE;
-	sge = ctxt->sge;
-
-	/* Find the SGE associated with xdr_off */
-	for (bc = xdr_off, xdr_sge_no = 1; bc && xdr_sge_no < vec->count;
-	     xdr_sge_no++) {
-		if (vec->sge[xdr_sge_no].iov_len > bc)
-			break;
-		bc -= vec->sge[xdr_sge_no].iov_len;
-	}
+	/* If a Write chunk is present, the xdr_buf's page list
+	 * is not included inline. However the Upper Layer may
+	 * have added XDR padding in the tail buffer, and that
+	 * should not be included inline.
+	 */
+	if (wr_lst) {
+		base = xdr->tail[0].iov_base;
+		len = xdr->tail[0].iov_len;
+		xdr_pad = xdr_padsize(xdr->page_len);
 
-	sge_off = bc;
-	bc = write_len;
-	sge_no = 0;
-
-	/* Copy the remaining SGE */
-	while (bc != 0) {
-		sge_bytes = min_t(size_t,
-			  bc, vec->sge[xdr_sge_no].iov_len-sge_off);
-		sge[sge_no].length = sge_bytes;
-		sge[sge_no].addr =
-			dma_map_xdr(xprt, &rqstp->rq_res, xdr_off,
-				    sge_bytes, DMA_TO_DEVICE);
-		xdr_off += sge_bytes;
-		if (ib_dma_mapping_error(xprt->sc_cm_id->device,
-					 sge[sge_no].addr))
-			goto err;
-		svc_rdma_count_mappings(xprt, ctxt);
-		sge[sge_no].lkey = xprt->sc_pd->local_dma_lkey;
-		ctxt->count++;
-		sge_off = 0;
-		sge_no++;
-		xdr_sge_no++;
-		if (xdr_sge_no > vec->count) {
-			pr_err("svcrdma: Too many sges (%d)\n", xdr_sge_no);
-			goto err;
+		if (len && xdr_pad) {
+			base += xdr_pad;
+			len -= xdr_pad;
 		}
-		bc -= sge_bytes;
-		if (sge_no == xprt->sc_max_sge)
-			break;
-	}
-
-	/* Prepare WRITE WR */
-	memset(&write_wr, 0, sizeof write_wr);
-	ctxt->cqe.done = svc_rdma_wc_write;
-	write_wr.wr.wr_cqe = &ctxt->cqe;
-	write_wr.wr.sg_list = &sge[0];
-	write_wr.wr.num_sge = sge_no;
-	write_wr.wr.opcode = IB_WR_RDMA_WRITE;
-	write_wr.wr.send_flags = IB_SEND_SIGNALED;
-	write_wr.rkey = rmr;
-	write_wr.remote_addr = to;
-
-	/* Post It */
-	atomic_inc(&rdma_stat_write);
-	if (svc_rdma_send(xprt, &write_wr.wr))
-		goto err;
-	return write_len - bc;
- err:
-	svc_rdma_unmap_dma(ctxt);
-	svc_rdma_put_context(ctxt, 0);
-	return -EIO;
-}
-
-noinline
-static int send_write_chunks(struct svcxprt_rdma *xprt,
-			     struct rpcrdma_write_array *wr_ary,
-			     struct rpcrdma_msg *rdma_resp,
-			     struct svc_rqst *rqstp,
-			     struct svc_rdma_req_map *vec)
-{
-	u32 xfer_len = rqstp->rq_res.page_len;
-	int write_len;
-	u32 xdr_off;
-	int chunk_off;
-	int chunk_no;
-	int nchunks;
-	struct rpcrdma_write_array *res_ary;
-	int ret;
 
-	res_ary = (struct rpcrdma_write_array *)
-		&rdma_resp->rm_body.rm_chunks[1];
-
-	/* Write chunks start at the pagelist */
-	nchunks = be32_to_cpu(wr_ary->wc_nchunks);
-	for (xdr_off = rqstp->rq_res.head[0].iov_len, chunk_no = 0;
-	     xfer_len && chunk_no < nchunks;
-	     chunk_no++) {
-		struct rpcrdma_segment *arg_ch;
-		u64 rs_offset;
-
-		arg_ch = &wr_ary->wc_array[chunk_no].wc_target;
-		write_len = min(xfer_len, be32_to_cpu(arg_ch->rs_length));
-
-		/* Prepare the response chunk given the length actually
-		 * written */
-		xdr_decode_hyper((__be32 *)&arg_ch->rs_offset, &rs_offset);
-		svc_rdma_xdr_encode_array_chunk(res_ary, chunk_no,
-						arg_ch->rs_handle,
-						arg_ch->rs_offset,
-						write_len);
-		chunk_off = 0;
-		while (write_len) {
-			ret = send_write(xprt, rqstp,
-					 be32_to_cpu(arg_ch->rs_handle),
-					 rs_offset + chunk_off,
-					 xdr_off,
-					 write_len,
-					 vec);
-			if (ret <= 0)
-				goto out_err;
-			chunk_off += ret;
-			xdr_off += ret;
-			xfer_len -= ret;
-			write_len -= ret;
-		}
+		goto tail;
 	}
-	/* Update the req with the number of chunks actually used */
-	svc_rdma_old_encode_write_list(rdma_resp, chunk_no);
 
-	return rqstp->rq_res.page_len;
+	ppages = xdr->pages + (xdr->page_base >> PAGE_SHIFT);
+	page_off = xdr->page_base & ~PAGE_MASK;
+	remaining = xdr->page_len;
+	while (remaining) {
+		len = min_t(u32, PAGE_SIZE - page_off, remaining);
 
-out_err:
-	pr_err("svcrdma: failed to send write chunks, rc=%d\n", ret);
-	return -EIO;
-}
-
-noinline
-static int send_reply_chunks(struct svcxprt_rdma *xprt,
-			     struct rpcrdma_write_array *rp_ary,
-			     struct rpcrdma_msg *rdma_resp,
-			     struct svc_rqst *rqstp,
-			     struct svc_rdma_req_map *vec)
-{
-	u32 xfer_len = rqstp->rq_res.len;
-	int write_len;
-	u32 xdr_off;
-	int chunk_no;
-	int chunk_off;
-	int nchunks;
-	struct rpcrdma_segment *ch;
-	struct rpcrdma_write_array *res_ary;
-	int ret;
+		ret = svc_rdma_dma_map_page(rdma, ctxt, sge_no++,
+					    *ppages++, page_off, len);
+		if (ret < 0)
+			return ret;
 
-	/* XXX: need to fix when reply lists occur with read-list and or
-	 * write-list */
-	res_ary = (struct rpcrdma_write_array *)
-		&rdma_resp->rm_body.rm_chunks[2];
-
-	/* xdr offset starts at RPC message */
-	nchunks = be32_to_cpu(rp_ary->wc_nchunks);
-	for (xdr_off = 0, chunk_no = 0;
-	     xfer_len && chunk_no < nchunks;
-	     chunk_no++) {
-		u64 rs_offset;
-		ch = &rp_ary->wc_array[chunk_no].wc_target;
-		write_len = min(xfer_len, be32_to_cpu(ch->rs_length));
-
-		/* Prepare the reply chunk given the length actually
-		 * written */
-		xdr_decode_hyper((__be32 *)&ch->rs_offset, &rs_offset);
-		svc_rdma_xdr_encode_array_chunk(res_ary, chunk_no,
-						ch->rs_handle, ch->rs_offset,
-						write_len);
-		chunk_off = 0;
-		while (write_len) {
-			ret = send_write(xprt, rqstp,
-					 be32_to_cpu(ch->rs_handle),
-					 rs_offset + chunk_off,
-					 xdr_off,
-					 write_len,
-					 vec);
-			if (ret <= 0)
-				goto out_err;
-			chunk_off += ret;
-			xdr_off += ret;
-			xfer_len -= ret;
-			write_len -= ret;
-		}
+		remaining -= len;
+		page_off = 0;
 	}
-	/* Update the req with the number of chunks actually used */
-	svc_rdma_xdr_encode_reply_array(res_ary, chunk_no);
 
-	return rqstp->rq_res.len;
+	base = xdr->tail[0].iov_base;
+	len = xdr->tail[0].iov_len;
+tail:
+	if (len) {
+		ret = svc_rdma_dma_map_buf(rdma, ctxt, sge_no++, base, len);
+		if (ret < 0)
+			return ret;
+	}
 
-out_err:
-	pr_err("svcrdma: failed to send reply chunks, rc=%d\n", ret);
-	return -EIO;
+	return sge_no - 1;
 }
 
 /* The svc_rqst and all resources it owns are released as soon as
@@ -517,89 +414,62 @@ void svc_rdma_build_send_wr(struct svc_rdma_op_ctxt *ctxt, int num_sge)
 		send_wr->num_sge);
 }
 
-/* This function prepares the portion of the RPCRDMA message to be
- * sent in the RDMA_SEND. This function is called after data sent via
- * RDMA has already been transmitted. There are three cases:
- * - The RPCRDMA header, RPC header, and payload are all sent in a
- *   single RDMA_SEND. This is the "inline" case.
- * - The RPCRDMA header and some portion of the RPC header and data
- *   are sent via this RDMA_SEND and another portion of the data is
- *   sent via RDMA.
- * - The RPCRDMA header [NOMSG] is sent in this RDMA_SEND and the RPC
- *   header and data are all transmitted via RDMA.
- * In all three cases, this function prepares the RPCRDMA header in
- * sge[0], the 'type' parameter indicates the type to place in the
- * RPCRDMA header, and the 'byte_count' field indicates how much of
- * the XDR to include in this RDMA_SEND. NB: The offset of the payload
- * to send is zero in the XDR.
+/* Prepare the portion of the RPC Reply that will be transmitted
+ * via RDMA Send. The RPC-over-RDMA transport header is prepared
+ * in sge[0], and the RPC xdr_buf is prepared in following sges.
+ *
+ * Depending on whether a Write list or Reply chunk is present,
+ * the server may send all, a portion of, or none of the xdr_buf.
+ * In the latter case, only the transport header (sge[0]) is
+ * transmitted.
+ *
+ * RDMA Send is the last step of transmitting an RPC reply. Pages
+ * involved in the earlier RDMA Writes are here transferred out
+ * of the rqstp and into the ctxt's page array. These pages are
+ * DMA unmapped by each Write completion, but the subsequent Send
+ * completion finally releases these pages.
+ *
+ * Assumptions:
+ * - The Reply's transport header will never be larger than a page.
  */
-static int send_reply(struct svcxprt_rdma *rdma,
-		      struct svc_rqst *rqstp,
-		      struct page *page,
-		      struct rpcrdma_msg *rdma_resp,
-		      struct svc_rdma_req_map *vec,
-		      int byte_count,
-		      u32 inv_rkey)
+static int svc_rdma_send_reply_msg(struct svcxprt_rdma *rdma,
+				   __be32 *rdma_argp, __be32 *rdma_resp,
+				   struct svc_rqst *rqstp,
+				   __be32 *wr_lst, __be32 *rp_ch)
 {
 	struct svc_rdma_op_ctxt *ctxt;
 	struct ib_send_wr *send_wr;
-	u32 xdr_off;
-	int sge_no;
-	int sge_bytes;
-	int ret = -EIO;
+	int ret;
+
+	dprintk("svcrdma: sending %s reply: head=%zu, pagelen=%u, tail=%zu\n",
+		(rp_ch ? "RDMA_NOMSG" : "RDMA_MSG"),
+		rqstp->rq_res.head[0].iov_len,
+		rqstp->rq_res.page_len,
+		rqstp->rq_res.tail[0].iov_len);
 
-	/* Prepare the context */
 	ctxt = svc_rdma_get_context(rdma);
-	ctxt->direction = DMA_TO_DEVICE;
-	ctxt->pages[0] = page;
-	ctxt->count = 1;
 
-	/* Prepare the SGE for the RPCRDMA Header */
-	ctxt->sge[0].lkey = rdma->sc_pd->local_dma_lkey;
-	ctxt->sge[0].length =
-	    svc_rdma_xdr_get_reply_hdr_len((__be32 *)rdma_resp);
-	ctxt->sge[0].addr =
-	    ib_dma_map_page(rdma->sc_cm_id->device, page, 0,
-			    ctxt->sge[0].length, DMA_TO_DEVICE);
-	if (ib_dma_mapping_error(rdma->sc_cm_id->device, ctxt->sge[0].addr))
+	ret = svc_rdma_map_reply_hdr(rdma, ctxt, rdma_resp,
+				     svc_rdma_xdr_get_reply_hdr_len(rdma_resp));
+	if (ret < 0)
 		goto err;
-	svc_rdma_count_mappings(rdma, ctxt);
-
-	ctxt->direction = DMA_TO_DEVICE;
 
-	/* Map the payload indicated by 'byte_count' */
-	xdr_off = 0;
-	for (sge_no = 1; byte_count && sge_no < vec->count; sge_no++) {
-		sge_bytes = min_t(size_t, vec->sge[sge_no].iov_len, byte_count);
-		byte_count -= sge_bytes;
-		ctxt->sge[sge_no].addr =
-			dma_map_xdr(rdma, &rqstp->rq_res, xdr_off,
-				    sge_bytes, DMA_TO_DEVICE);
-		xdr_off += sge_bytes;
-		if (ib_dma_mapping_error(rdma->sc_cm_id->device,
-					 ctxt->sge[sge_no].addr))
+	if (!rp_ch) {
+		ret = svc_rdma_map_reply_msg(rdma, ctxt,
+					     &rqstp->rq_res, wr_lst);
+		if (ret < 0)
 			goto err;
-		svc_rdma_count_mappings(rdma, ctxt);
-		ctxt->sge[sge_no].lkey = rdma->sc_pd->local_dma_lkey;
-		ctxt->sge[sge_no].length = sge_bytes;
-	}
-	if (byte_count != 0) {
-		pr_err("svcrdma: Could not map %d bytes\n", byte_count);
-		goto err;
 	}
 
 	svc_rdma_save_io_pages(rqstp, ctxt);
 
-	if (sge_no > rdma->sc_max_sge) {
-		pr_err("svcrdma: Too many sges (%d)\n", sge_no);
-		goto err;
-	}
-
-	svc_rdma_build_send_wr(ctxt, sge_no);
+	svc_rdma_build_send_wr(ctxt, 1 + ret);
 	send_wr = &ctxt->send_wr;
-	if (inv_rkey) {
-		send_wr->opcode = IB_WR_SEND_WITH_INV;
-		send_wr->ex.invalidate_rkey = inv_rkey;
+	if (rdma->sc_snd_w_inv) {
+		send_wr->ex.invalidate_rkey =
+			svc_rdma_get_inv_rkey(rdma_argp, wr_lst, rp_ch);
+		if (send_wr->ex.invalidate_rkey)
+			send_wr->opcode = IB_WR_SEND_WITH_INV;
 	}
 	ret = svc_rdma_send(rdma, send_wr);
 	if (ret)
@@ -607,7 +477,8 @@ static int send_reply(struct svcxprt_rdma *rdma,
 
 	return 0;
 
- err:
+err:
+	pr_err("svcrdma: failed to post Send WR (%d)\n", ret);
 	svc_rdma_unmap_dma(ctxt);
 	svc_rdma_put_context(ctxt, 1);
 	return ret;
@@ -617,39 +488,35 @@ void svc_rdma_prep_reply_hdr(struct svc_rqst *rqstp)
 {
 }
 
+/**
+ * svc_rdma_sendto - Transmit an RPC reply
+ * @rqstp: processed RPC request, reply XDR already in ::rq_res
+ *
+ * Any resources still associated with @rqstp are released upon return.
+ * If no reply message was possible, the connection is closed.
+ *
+ * Returns:
+ *	%0 if an RPC reply has been successfully posted,
+ *	%-ENOMEM if a resource shortage occurred (connection is lost),
+ *	%-ENOTCONN if posting failed (connection is lost).
+ */
 int svc_rdma_sendto(struct svc_rqst *rqstp)
 {
 	struct svc_xprt *xprt = rqstp->rq_xprt;
 	struct svcxprt_rdma *rdma =
 		container_of(xprt, struct svcxprt_rdma, sc_xprt);
-	struct rpcrdma_msg *rdma_argp;
-	struct rpcrdma_msg *rdma_resp;
-	struct rpcrdma_write_array *wr_ary, *rp_ary;
-	int ret;
-	int inline_bytes;
+	__be32 *p, *rdma_argp, *rdma_resp, *wr_lst, *rp_ch;
 	struct page *res_page;
-	struct svc_rdma_req_map *vec;
-	u32 inv_rkey;
-	__be32 *p;
-
-	dprintk("svcrdma: sending response for rqstp=%p\n", rqstp);
+	int ret;
 
-	/* Get the RDMA request header. The receive logic always
-	 * places this at the start of page 0.
+	/* Find the call's chunk lists to decide how to send the reply.
+	 * Receive places the Call's xprt header at the start of page 0.
 	 */
 	rdma_argp = page_address(rqstp->rq_pages[0]);
-	svc_rdma_get_write_arrays(rdma_argp, &wr_ary, &rp_ary);
+	svc_rdma_get_write_arrays(rdma_argp, &wr_lst, &rp_ch);
 
-	inv_rkey = 0;
-	if (rdma->sc_snd_w_inv)
-		inv_rkey = svc_rdma_get_inv_rkey(rdma_argp, wr_ary, rp_ary);
-
-	/* Build an req vec for the XDR */
-	vec = svc_rdma_get_req_map(rdma);
-	ret = svc_rdma_map_xdr(rdma, &rqstp->rq_res, vec, wr_ary != NULL);
-	if (ret)
-		goto err0;
-	inline_bytes = rqstp->rq_res.len;
+	dprintk("svcrdma: preparing response for XID 0x%08x\n",
+		be32_to_cpup(rdma_argp));
 
 	/* Create the RDMA response header. xprt->xpt_mutex,
 	 * acquired in svc_send(), serializes RPC replies. The
@@ -663,51 +530,42 @@ int svc_rdma_sendto(struct svc_rqst *rqstp)
 		goto err0;
 	rdma_resp = page_address(res_page);
 
-	p = &rdma_resp->rm_xid;
-	*p++ = rdma_argp->rm_xid;
-	*p++ = rdma_argp->rm_vers;
+	p = rdma_resp;
+	*p++ = *rdma_argp;
+	*p++ = *(rdma_argp + 1);
 	*p++ = rdma->sc_fc_credits;
-	*p++ = rp_ary ? rdma_nomsg : rdma_msg;
+	*p++ = rp_ch ? rdma_nomsg : rdma_msg;
 
 	/* Start with empty chunks */
 	*p++ = xdr_zero;
 	*p++ = xdr_zero;
 	*p   = xdr_zero;
 
-	/* Send any write-chunk data and build resp write-list */
-	if (wr_ary) {
-		ret = send_write_chunks(rdma, wr_ary, rdma_resp, rqstp, vec);
+	if (wr_lst) {
+		ret = svc_rdma_send_write_list(rdma, wr_lst, rdma_resp,
+					       &rqstp->rq_res);
 		if (ret < 0)
 			goto err1;
-		inline_bytes -= ret + xdr_padsize(ret);
 	}
-
-	/* Send any reply-list data and update resp reply-list */
-	if (rp_ary) {
-		ret = send_reply_chunks(rdma, rp_ary, rdma_resp, rqstp, vec);
+	if (rp_ch) {
+		ret = svc_rdma_send_reply_chunk(rdma, wr_lst, rp_ch,
+						rdma_resp, &rqstp->rq_res);
 		if (ret < 0)
 			goto err1;
-		inline_bytes -= ret;
 	}
 
-	/* Post a fresh Receive buffer _before_ sending the reply */
 	ret = svc_rdma_post_recv(rdma, GFP_KERNEL);
 	if (ret)
 		goto err1;
-
-	ret = send_reply(rdma, rqstp, res_page, rdma_resp, vec,
-			 inline_bytes, inv_rkey);
+	ret = svc_rdma_send_reply_msg(rdma, rdma_argp, rdma_resp, rqstp,
+				      wr_lst, rp_ch);
 	if (ret < 0)
 		goto err0;
-
-	svc_rdma_put_req_map(rdma, vec);
-	dprintk("svcrdma: send_reply returns %d\n", ret);
-	return ret;
+	return 0;
 
  err1:
 	put_page(res_page);
  err0:
-	svc_rdma_put_req_map(rdma, vec);
 	pr_err("svcrdma: Could not send reply, err=%d. Closing transport.\n",
 	       ret);
 	set_bit(XPT_CLOSE, &rdma->sc_xprt.xpt_flags);
diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c
index 90fabad..fecf220 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
@@ -1052,6 +1052,8 @@ static struct svc_xprt *svc_rdma_accept(struct svc_xprt *xprt)
 	memset(&qp_attr, 0, sizeof qp_attr);
 	qp_attr.event_handler = qp_event_handler;
 	qp_attr.qp_context = &newxprt->sc_xprt;
+	qp_attr.port_num = newxprt->sc_cm_id->port_num;
+	qp_attr.cap.max_rdma_ctxs = newxprt->sc_max_requests;
 	qp_attr.cap.max_send_wr = newxprt->sc_sq_depth;
 	qp_attr.cap.max_recv_wr = newxprt->sc_rq_depth;
 	qp_attr.cap.max_send_sge = newxprt->sc_max_sge;


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v1 07/14] svcrdma: Clean up RDMA_ERROR path
  2017-03-16 15:52 ` Chuck Lever
@ 2017-03-16 15:53     ` Chuck Lever
  -1 siblings, 0 replies; 70+ messages in thread
From: Chuck Lever @ 2017-03-16 15:53 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA, linux-nfs-u79uwXL29TY76Z2rM5mHXA

Now that svc_rdma_sendto has been renovated, svc_rdma_send_error can
be refactored to reduce code duplication and remove C structure-
based XDR encoding. It is also relocated to the source file that
contains its only caller.

This is a refactoring change only.

Signed-off-by: Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
---
 include/linux/sunrpc/rpc_rdma.h         |    3 ++
 include/linux/sunrpc/svc_rdma.h         |    5 ---
 net/sunrpc/xprtrdma/svc_rdma_marshal.c  |   19 -----------
 net/sunrpc/xprtrdma/svc_rdma_recvfrom.c |   53 ++++++++++++++++++++++++++++++-
 net/sunrpc/xprtrdma/svc_rdma_sendto.c   |   44 --------------------------
 5 files changed, 55 insertions(+), 69 deletions(-)

diff --git a/include/linux/sunrpc/rpc_rdma.h b/include/linux/sunrpc/rpc_rdma.h
index 245fc59..b7e85b3 100644
--- a/include/linux/sunrpc/rpc_rdma.h
+++ b/include/linux/sunrpc/rpc_rdma.h
@@ -143,6 +143,9 @@ enum rpcrdma_proc {
 #define rdma_done	cpu_to_be32(RDMA_DONE)
 #define rdma_error	cpu_to_be32(RDMA_ERROR)
 
+#define err_vers	cpu_to_be32(ERR_VERS)
+#define err_chunk	cpu_to_be32(ERR_CHUNK)
+
 /*
  * Private extension to RPC-over-RDMA Version One.
  * Message passed during RDMA-CM connection set-up.
diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
index 5fc9f6e..498a086 100644
--- a/include/linux/sunrpc/svc_rdma.h
+++ b/include/linux/sunrpc/svc_rdma.h
@@ -209,9 +209,6 @@ extern int svc_rdma_handle_bc_reply(struct rpc_xprt *xprt,
 
 /* svc_rdma_marshal.c */
 extern int svc_rdma_xdr_decode_req(struct xdr_buf *);
-extern int svc_rdma_xdr_encode_error(struct svcxprt_rdma *,
-				     struct rpcrdma_msg *,
-				     enum rpcrdma_errcode, __be32 *);
 extern void svc_rdma_old_encode_write_list(struct rpcrdma_msg *rmsgp,
 					   int chunks);
 extern void svc_rdma_xdr_encode_reply_array(struct rpcrdma_write_array *, int);
@@ -253,8 +250,6 @@ extern int svc_rdma_map_reply_hdr(struct svcxprt_rdma *rdma,
 extern void svc_rdma_build_send_wr(struct svc_rdma_op_ctxt *ctxt,
 				   int num_sge);
 extern int svc_rdma_sendto(struct svc_rqst *);
-extern void svc_rdma_send_error(struct svcxprt_rdma *, struct rpcrdma_msg *,
-				int);
 
 /* svc_rdma_transport.c */
 extern void svc_rdma_wc_send(struct ib_cq *, struct ib_wc *);
diff --git a/net/sunrpc/xprtrdma/svc_rdma_marshal.c b/net/sunrpc/xprtrdma/svc_rdma_marshal.c
index bf3ca7e..24a8151 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_marshal.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_marshal.c
@@ -167,25 +167,6 @@ int svc_rdma_xdr_decode_req(struct xdr_buf *rq_arg)
 	return -EINVAL;
 }
 
-int svc_rdma_xdr_encode_error(struct svcxprt_rdma *xprt,
-			      struct rpcrdma_msg *rmsgp,
-			      enum rpcrdma_errcode err, __be32 *va)
-{
-	__be32 *startp = va;
-
-	*va++ = rmsgp->rm_xid;
-	*va++ = rmsgp->rm_vers;
-	*va++ = xprt->sc_fc_credits;
-	*va++ = rdma_error;
-	*va++ = cpu_to_be32(err);
-	if (err == ERR_VERS) {
-		*va++ = rpcrdma_version;
-		*va++ = rpcrdma_version;
-	}
-
-	return (int)((unsigned long)va - (unsigned long)startp);
-}
-
 /**
  * svc_rdma_xdr_get_reply_hdr_length - Get length of Reply transport header
  * @rdma_resp: buffer containing Reply transport header
diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
index f7b2daf..efa9f12 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
@@ -558,6 +558,57 @@ static void rdma_read_complete(struct svc_rqst *rqstp,
 	rqstp->rq_arg.buflen = head->arg.buflen;
 }
 
+static void svc_rdma_send_error(struct svcxprt_rdma *xprt,
+				__be32 *rdma_argp, int status)
+{
+	struct svc_rdma_op_ctxt *ctxt;
+	__be32 *p, *err_msgp;
+	unsigned int length;
+	struct page *page;
+	int ret;
+
+	ret = svc_rdma_repost_recv(xprt, GFP_KERNEL);
+	if (ret)
+		return;
+
+	page = alloc_page(GFP_KERNEL);
+	if (!page)
+		return;
+	err_msgp = page_address(page);
+
+	p = err_msgp;
+	*p++ = *rdma_argp;
+	*p++ = *(rdma_argp + 1);
+	*p++ = xprt->sc_fc_credits;
+	*p++ = rdma_error;
+	if (status == -EPROTONOSUPPORT) {
+		*p++ = err_vers;
+		*p++ = rpcrdma_version;
+		*p++ = rpcrdma_version;
+	} else {
+		*p++ = err_chunk;
+	}
+	length = (unsigned long)p - (unsigned long)err_msgp;
+
+	/* Map transport header; no RPC message payload */
+	ctxt = svc_rdma_get_context(xprt);
+	ret = svc_rdma_map_reply_hdr(xprt, ctxt, err_msgp, length);
+	if (ret) {
+		dprintk("svcrdma: Error %d mapping send for protocol error\n",
+			ret);
+		return;
+	}
+
+	svc_rdma_build_send_wr(ctxt, 1);
+	ret = svc_rdma_send(xprt, &ctxt->send_wr);
+	if (ret) {
+		dprintk("svcrdma: Error %d posting send for protocol error\n",
+			ret);
+		svc_rdma_unmap_dma(ctxt);
+		svc_rdma_put_context(ctxt, 1);
+	}
+}
+
 /* By convention, backchannel calls arrive via rdma_msg type
  * messages, and never populate the chunk lists. This makes
  * the RPC/RDMA header small and fixed in size, so it is
@@ -686,7 +737,7 @@ int svc_rdma_recvfrom(struct svc_rqst *rqstp)
 	return ret;
 
 out_err:
-	svc_rdma_send_error(rdma_xprt, rmsgp, ret);
+	svc_rdma_send_error(rdma_xprt, &rmsgp->rm_xid, ret);
 	svc_rdma_put_context(ctxt, 0);
 	return 0;
 
diff --git a/net/sunrpc/xprtrdma/svc_rdma_sendto.c b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
index be6b11a..1b230dc 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_sendto.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
@@ -571,47 +571,3 @@ int svc_rdma_sendto(struct svc_rqst *rqstp)
 	set_bit(XPT_CLOSE, &rdma->sc_xprt.xpt_flags);
 	return -ENOTCONN;
 }
-
-void svc_rdma_send_error(struct svcxprt_rdma *xprt, struct rpcrdma_msg *rmsgp,
-			 int status)
-{
-	struct page *p;
-	struct svc_rdma_op_ctxt *ctxt;
-	enum rpcrdma_errcode err;
-	__be32 *va;
-	int length;
-	int ret;
-
-	ret = svc_rdma_repost_recv(xprt, GFP_KERNEL);
-	if (ret)
-		return;
-
-	p = alloc_page(GFP_KERNEL);
-	if (!p)
-		return;
-	va = page_address(p);
-
-	/* XDR encode an error reply */
-	err = ERR_CHUNK;
-	if (status == -EPROTONOSUPPORT)
-		err = ERR_VERS;
-	length = svc_rdma_xdr_encode_error(xprt, rmsgp, err, va);
-
-	/* Map transport header; no RPC message payload */
-	ctxt = svc_rdma_get_context(xprt);
-	ret = svc_rdma_map_reply_hdr(xprt, ctxt, &rmsgp->rm_xid, length);
-	if (ret) {
-		dprintk("svcrdma: Error %d mapping send for protocol error\n",
-			ret);
-		return;
-	}
-
-	svc_rdma_build_send_wr(ctxt, 1);
-	ret = svc_rdma_send(xprt, &ctxt->send_wr);
-	if (ret) {
-		dprintk("svcrdma: Error %d posting send for protocol error\n",
-			ret);
-		svc_rdma_unmap_dma(ctxt);
-		svc_rdma_put_context(ctxt, 1);
-	}
-}

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v1 07/14] svcrdma: Clean up RDMA_ERROR path
@ 2017-03-16 15:53     ` Chuck Lever
  0 siblings, 0 replies; 70+ messages in thread
From: Chuck Lever @ 2017-03-16 15:53 UTC (permalink / raw)
  To: linux-rdma, linux-nfs

Now that svc_rdma_sendto has been renovated, svc_rdma_send_error can
be refactored to reduce code duplication and remove C structure-
based XDR encoding. It is also relocated to the source file that
contains its only caller.

This is a refactoring change only.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 include/linux/sunrpc/rpc_rdma.h         |    3 ++
 include/linux/sunrpc/svc_rdma.h         |    5 ---
 net/sunrpc/xprtrdma/svc_rdma_marshal.c  |   19 -----------
 net/sunrpc/xprtrdma/svc_rdma_recvfrom.c |   53 ++++++++++++++++++++++++++++++-
 net/sunrpc/xprtrdma/svc_rdma_sendto.c   |   44 --------------------------
 5 files changed, 55 insertions(+), 69 deletions(-)

diff --git a/include/linux/sunrpc/rpc_rdma.h b/include/linux/sunrpc/rpc_rdma.h
index 245fc59..b7e85b3 100644
--- a/include/linux/sunrpc/rpc_rdma.h
+++ b/include/linux/sunrpc/rpc_rdma.h
@@ -143,6 +143,9 @@ enum rpcrdma_proc {
 #define rdma_done	cpu_to_be32(RDMA_DONE)
 #define rdma_error	cpu_to_be32(RDMA_ERROR)
 
+#define err_vers	cpu_to_be32(ERR_VERS)
+#define err_chunk	cpu_to_be32(ERR_CHUNK)
+
 /*
  * Private extension to RPC-over-RDMA Version One.
  * Message passed during RDMA-CM connection set-up.
diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
index 5fc9f6e..498a086 100644
--- a/include/linux/sunrpc/svc_rdma.h
+++ b/include/linux/sunrpc/svc_rdma.h
@@ -209,9 +209,6 @@ extern int svc_rdma_handle_bc_reply(struct rpc_xprt *xprt,
 
 /* svc_rdma_marshal.c */
 extern int svc_rdma_xdr_decode_req(struct xdr_buf *);
-extern int svc_rdma_xdr_encode_error(struct svcxprt_rdma *,
-				     struct rpcrdma_msg *,
-				     enum rpcrdma_errcode, __be32 *);
 extern void svc_rdma_old_encode_write_list(struct rpcrdma_msg *rmsgp,
 					   int chunks);
 extern void svc_rdma_xdr_encode_reply_array(struct rpcrdma_write_array *, int);
@@ -253,8 +250,6 @@ extern int svc_rdma_map_reply_hdr(struct svcxprt_rdma *rdma,
 extern void svc_rdma_build_send_wr(struct svc_rdma_op_ctxt *ctxt,
 				   int num_sge);
 extern int svc_rdma_sendto(struct svc_rqst *);
-extern void svc_rdma_send_error(struct svcxprt_rdma *, struct rpcrdma_msg *,
-				int);
 
 /* svc_rdma_transport.c */
 extern void svc_rdma_wc_send(struct ib_cq *, struct ib_wc *);
diff --git a/net/sunrpc/xprtrdma/svc_rdma_marshal.c b/net/sunrpc/xprtrdma/svc_rdma_marshal.c
index bf3ca7e..24a8151 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_marshal.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_marshal.c
@@ -167,25 +167,6 @@ int svc_rdma_xdr_decode_req(struct xdr_buf *rq_arg)
 	return -EINVAL;
 }
 
-int svc_rdma_xdr_encode_error(struct svcxprt_rdma *xprt,
-			      struct rpcrdma_msg *rmsgp,
-			      enum rpcrdma_errcode err, __be32 *va)
-{
-	__be32 *startp = va;
-
-	*va++ = rmsgp->rm_xid;
-	*va++ = rmsgp->rm_vers;
-	*va++ = xprt->sc_fc_credits;
-	*va++ = rdma_error;
-	*va++ = cpu_to_be32(err);
-	if (err == ERR_VERS) {
-		*va++ = rpcrdma_version;
-		*va++ = rpcrdma_version;
-	}
-
-	return (int)((unsigned long)va - (unsigned long)startp);
-}
-
 /**
  * svc_rdma_xdr_get_reply_hdr_length - Get length of Reply transport header
  * @rdma_resp: buffer containing Reply transport header
diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
index f7b2daf..efa9f12 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
@@ -558,6 +558,57 @@ static void rdma_read_complete(struct svc_rqst *rqstp,
 	rqstp->rq_arg.buflen = head->arg.buflen;
 }
 
+static void svc_rdma_send_error(struct svcxprt_rdma *xprt,
+				__be32 *rdma_argp, int status)
+{
+	struct svc_rdma_op_ctxt *ctxt;
+	__be32 *p, *err_msgp;
+	unsigned int length;
+	struct page *page;
+	int ret;
+
+	ret = svc_rdma_repost_recv(xprt, GFP_KERNEL);
+	if (ret)
+		return;
+
+	page = alloc_page(GFP_KERNEL);
+	if (!page)
+		return;
+	err_msgp = page_address(page);
+
+	p = err_msgp;
+	*p++ = *rdma_argp;
+	*p++ = *(rdma_argp + 1);
+	*p++ = xprt->sc_fc_credits;
+	*p++ = rdma_error;
+	if (status == -EPROTONOSUPPORT) {
+		*p++ = err_vers;
+		*p++ = rpcrdma_version;
+		*p++ = rpcrdma_version;
+	} else {
+		*p++ = err_chunk;
+	}
+	length = (unsigned long)p - (unsigned long)err_msgp;
+
+	/* Map transport header; no RPC message payload */
+	ctxt = svc_rdma_get_context(xprt);
+	ret = svc_rdma_map_reply_hdr(xprt, ctxt, err_msgp, length);
+	if (ret) {
+		dprintk("svcrdma: Error %d mapping send for protocol error\n",
+			ret);
+		return;
+	}
+
+	svc_rdma_build_send_wr(ctxt, 1);
+	ret = svc_rdma_send(xprt, &ctxt->send_wr);
+	if (ret) {
+		dprintk("svcrdma: Error %d posting send for protocol error\n",
+			ret);
+		svc_rdma_unmap_dma(ctxt);
+		svc_rdma_put_context(ctxt, 1);
+	}
+}
+
 /* By convention, backchannel calls arrive via rdma_msg type
  * messages, and never populate the chunk lists. This makes
  * the RPC/RDMA header small and fixed in size, so it is
@@ -686,7 +737,7 @@ int svc_rdma_recvfrom(struct svc_rqst *rqstp)
 	return ret;
 
 out_err:
-	svc_rdma_send_error(rdma_xprt, rmsgp, ret);
+	svc_rdma_send_error(rdma_xprt, &rmsgp->rm_xid, ret);
 	svc_rdma_put_context(ctxt, 0);
 	return 0;
 
diff --git a/net/sunrpc/xprtrdma/svc_rdma_sendto.c b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
index be6b11a..1b230dc 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_sendto.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
@@ -571,47 +571,3 @@ int svc_rdma_sendto(struct svc_rqst *rqstp)
 	set_bit(XPT_CLOSE, &rdma->sc_xprt.xpt_flags);
 	return -ENOTCONN;
 }
-
-void svc_rdma_send_error(struct svcxprt_rdma *xprt, struct rpcrdma_msg *rmsgp,
-			 int status)
-{
-	struct page *p;
-	struct svc_rdma_op_ctxt *ctxt;
-	enum rpcrdma_errcode err;
-	__be32 *va;
-	int length;
-	int ret;
-
-	ret = svc_rdma_repost_recv(xprt, GFP_KERNEL);
-	if (ret)
-		return;
-
-	p = alloc_page(GFP_KERNEL);
-	if (!p)
-		return;
-	va = page_address(p);
-
-	/* XDR encode an error reply */
-	err = ERR_CHUNK;
-	if (status == -EPROTONOSUPPORT)
-		err = ERR_VERS;
-	length = svc_rdma_xdr_encode_error(xprt, rmsgp, err, va);
-
-	/* Map transport header; no RPC message payload */
-	ctxt = svc_rdma_get_context(xprt);
-	ret = svc_rdma_map_reply_hdr(xprt, ctxt, &rmsgp->rm_xid, length);
-	if (ret) {
-		dprintk("svcrdma: Error %d mapping send for protocol error\n",
-			ret);
-		return;
-	}
-
-	svc_rdma_build_send_wr(ctxt, 1);
-	ret = svc_rdma_send(xprt, &ctxt->send_wr);
-	if (ret) {
-		dprintk("svcrdma: Error %d posting send for protocol error\n",
-			ret);
-		svc_rdma_unmap_dma(ctxt);
-		svc_rdma_put_context(ctxt, 1);
-	}
-}


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v1 08/14] svcrdma: Report Write/Reply chunk overruns
  2017-03-16 15:52 ` Chuck Lever
@ 2017-03-16 15:53     ` Chuck Lever
  -1 siblings, 0 replies; 70+ messages in thread
From: Chuck Lever @ 2017-03-16 15:53 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA, linux-nfs-u79uwXL29TY76Z2rM5mHXA

Observed at Connectathon 2017.

If a client has underestimated the size of a Write or Reply chunk,
the Linux server writes as much payload data as it can, then it
recognizes there was a problem and closes the connection without
sending the transport header.

This creates a couple of problems:

<> The client never receives indication of the server-side failure,
   so it continues to retransmit the bad RPC. Forward progress on
   the transport is blocked.

<> The reply payload pages are not moved out of the svc_rqst, thus
   they can be released by the RPC server before the RDMA Writes
   have completed.

The new rdma_rw-ized helpers return a distinct error code when a
Write/Reply chunk overrun occurs, so it's now easy for the caller
(svc_rdma_sendto) to recognize this case.

Instead of dropping the connection, post an RDMA_ERROR message. The
client now sees an RDMA_ERROR and can properly terminate the RPC
transaction.

As part of the new logic, set up the same delayed release for these
payload pages as would have occurred in the normal case.

Signed-off-by: Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
---
 net/sunrpc/xprtrdma/svc_rdma_sendto.c |   59 ++++++++++++++++++++++++++++++++-
 1 file changed, 57 insertions(+), 2 deletions(-)

diff --git a/net/sunrpc/xprtrdma/svc_rdma_sendto.c b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
index 1b230dc..489e602 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_sendto.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
@@ -484,6 +484,49 @@ static int svc_rdma_send_reply_msg(struct svcxprt_rdma *rdma,
 	return ret;
 }
 
+/* Given the client-provided Write and Reply chunks, the server was not
+ * able to form a complete reply. Return an RDMA_ERROR message so the
+ * client can retire this RPC transaction. As above, the Send completion
+ * routine releases payload pages that were part of a previous RDMA Write.
+ *
+ * Remote Invalidation is skipped for simplicity.
+ */
+static int svc_rdma_send_error_msg(struct svcxprt_rdma *rdma,
+				   __be32 *rdma_resp, struct svc_rqst *rqstp)
+{
+	struct svc_rdma_op_ctxt *ctxt;
+	__be32 *p;
+	int ret;
+
+	ctxt = svc_rdma_get_context(rdma);
+
+	/* Replace the original transport header with an
+	 * RDMA_ERROR response. XID etc are preserved.
+	 */
+	p = rdma_resp + 3;
+	*p++ = rdma_error;
+	*p   = err_chunk;
+
+	ret = svc_rdma_map_reply_hdr(rdma, ctxt, rdma_resp, 20);
+	if (ret < 0)
+		goto err;
+
+	svc_rdma_save_io_pages(rqstp, ctxt);
+
+	svc_rdma_build_send_wr(ctxt, 1 + ret);
+	ret = svc_rdma_send(rdma, &ctxt->send_wr);
+	if (ret)
+		goto err;
+
+	return 0;
+
+err:
+	pr_err("svcrdma: failed to post Send WR (%d)\n", ret);
+	svc_rdma_unmap_dma(ctxt);
+	svc_rdma_put_context(ctxt, 1);
+	return ret;
+}
+
 void svc_rdma_prep_reply_hdr(struct svc_rqst *rqstp)
 {
 }
@@ -545,13 +588,13 @@ int svc_rdma_sendto(struct svc_rqst *rqstp)
 		ret = svc_rdma_send_write_list(rdma, wr_lst, rdma_resp,
 					       &rqstp->rq_res);
 		if (ret < 0)
-			goto err1;
+			goto err2;
 	}
 	if (rp_ch) {
 		ret = svc_rdma_send_reply_chunk(rdma, wr_lst, rp_ch,
 						rdma_resp, &rqstp->rq_res);
 		if (ret < 0)
-			goto err1;
+			goto err2;
 	}
 
 	ret = svc_rdma_post_recv(rdma, GFP_KERNEL);
@@ -563,6 +606,18 @@ int svc_rdma_sendto(struct svc_rqst *rqstp)
 		goto err0;
 	return 0;
 
+ err2:
+	if (ret != -E2BIG)
+		goto err1;
+
+	ret = svc_rdma_post_recv(rdma, GFP_KERNEL);
+	if (ret)
+		goto err1;
+	ret = svc_rdma_send_error_msg(rdma, rdma_resp, rqstp);
+	if (ret < 0)
+		goto err0;
+	return 0;
+
  err1:
 	put_page(res_page);
  err0:

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v1 08/14] svcrdma: Report Write/Reply chunk overruns
@ 2017-03-16 15:53     ` Chuck Lever
  0 siblings, 0 replies; 70+ messages in thread
From: Chuck Lever @ 2017-03-16 15:53 UTC (permalink / raw)
  To: linux-rdma, linux-nfs

Observed at Connectathon 2017.

If a client has underestimated the size of a Write or Reply chunk,
the Linux server writes as much payload data as it can, then it
recognizes there was a problem and closes the connection without
sending the transport header.

This creates a couple of problems:

<> The client never receives indication of the server-side failure,
   so it continues to retransmit the bad RPC. Forward progress on
   the transport is blocked.

<> The reply payload pages are not moved out of the svc_rqst, thus
   they can be released by the RPC server before the RDMA Writes
   have completed.

The new rdma_rw-ized helpers return a distinct error code when a
Write/Reply chunk overrun occurs, so it's now easy for the caller
(svc_rdma_sendto) to recognize this case.

Instead of dropping the connection, post an RDMA_ERROR message. The
client now sees an RDMA_ERROR and can properly terminate the RPC
transaction.

As part of the new logic, set up the same delayed release for these
payload pages as would have occurred in the normal case.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 net/sunrpc/xprtrdma/svc_rdma_sendto.c |   59 ++++++++++++++++++++++++++++++++-
 1 file changed, 57 insertions(+), 2 deletions(-)

diff --git a/net/sunrpc/xprtrdma/svc_rdma_sendto.c b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
index 1b230dc..489e602 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_sendto.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
@@ -484,6 +484,49 @@ static int svc_rdma_send_reply_msg(struct svcxprt_rdma *rdma,
 	return ret;
 }
 
+/* Given the client-provided Write and Reply chunks, the server was not
+ * able to form a complete reply. Return an RDMA_ERROR message so the
+ * client can retire this RPC transaction. As above, the Send completion
+ * routine releases payload pages that were part of a previous RDMA Write.
+ *
+ * Remote Invalidation is skipped for simplicity.
+ */
+static int svc_rdma_send_error_msg(struct svcxprt_rdma *rdma,
+				   __be32 *rdma_resp, struct svc_rqst *rqstp)
+{
+	struct svc_rdma_op_ctxt *ctxt;
+	__be32 *p;
+	int ret;
+
+	ctxt = svc_rdma_get_context(rdma);
+
+	/* Replace the original transport header with an
+	 * RDMA_ERROR response. XID etc are preserved.
+	 */
+	p = rdma_resp + 3;
+	*p++ = rdma_error;
+	*p   = err_chunk;
+
+	ret = svc_rdma_map_reply_hdr(rdma, ctxt, rdma_resp, 20);
+	if (ret < 0)
+		goto err;
+
+	svc_rdma_save_io_pages(rqstp, ctxt);
+
+	svc_rdma_build_send_wr(ctxt, 1 + ret);
+	ret = svc_rdma_send(rdma, &ctxt->send_wr);
+	if (ret)
+		goto err;
+
+	return 0;
+
+err:
+	pr_err("svcrdma: failed to post Send WR (%d)\n", ret);
+	svc_rdma_unmap_dma(ctxt);
+	svc_rdma_put_context(ctxt, 1);
+	return ret;
+}
+
 void svc_rdma_prep_reply_hdr(struct svc_rqst *rqstp)
 {
 }
@@ -545,13 +588,13 @@ int svc_rdma_sendto(struct svc_rqst *rqstp)
 		ret = svc_rdma_send_write_list(rdma, wr_lst, rdma_resp,
 					       &rqstp->rq_res);
 		if (ret < 0)
-			goto err1;
+			goto err2;
 	}
 	if (rp_ch) {
 		ret = svc_rdma_send_reply_chunk(rdma, wr_lst, rp_ch,
 						rdma_resp, &rqstp->rq_res);
 		if (ret < 0)
-			goto err1;
+			goto err2;
 	}
 
 	ret = svc_rdma_post_recv(rdma, GFP_KERNEL);
@@ -563,6 +606,18 @@ int svc_rdma_sendto(struct svc_rqst *rqstp)
 		goto err0;
 	return 0;
 
+ err2:
+	if (ret != -E2BIG)
+		goto err1;
+
+	ret = svc_rdma_post_recv(rdma, GFP_KERNEL);
+	if (ret)
+		goto err1;
+	ret = svc_rdma_send_error_msg(rdma, rdma_resp, rqstp);
+	if (ret < 0)
+		goto err0;
+	return 0;
+
  err1:
 	put_page(res_page);
  err0:


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v1 09/14] svcrdma: Clean up RPC-over-RDMA backchannel reply processing
  2017-03-16 15:52 ` Chuck Lever
@ 2017-03-16 15:53     ` Chuck Lever
  -1 siblings, 0 replies; 70+ messages in thread
From: Chuck Lever @ 2017-03-16 15:53 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA, linux-nfs-u79uwXL29TY76Z2rM5mHXA

Replace C structure-based XDR decoding with pointer arithmetic.
Pointer arithmetic is considered more portable.

Signed-off-by: Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
---
 include/linux/sunrpc/svc_rdma.h            |    2 +-
 net/sunrpc/xprtrdma/svc_rdma_backchannel.c |   18 ++++++++++++++----
 net/sunrpc/xprtrdma/svc_rdma_recvfrom.c    |   27 +++++++++++++++------------
 3 files changed, 30 insertions(+), 17 deletions(-)

diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
index 498a086..6181c24 100644
--- a/include/linux/sunrpc/svc_rdma.h
+++ b/include/linux/sunrpc/svc_rdma.h
@@ -204,7 +204,7 @@ static inline void svc_rdma_count_mappings(struct svcxprt_rdma *rdma,
 
 /* svc_rdma_backchannel.c */
 extern int svc_rdma_handle_bc_reply(struct rpc_xprt *xprt,
-				    struct rpcrdma_msg *rmsgp,
+				    __be32 *rdma_resp,
 				    struct xdr_buf *rcvbuf);
 
 /* svc_rdma_marshal.c */
diff --git a/net/sunrpc/xprtrdma/svc_rdma_backchannel.c b/net/sunrpc/xprtrdma/svc_rdma_backchannel.c
index 24c26f4..3484f40a 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_backchannel.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_backchannel.c
@@ -12,7 +12,17 @@
 
 #undef SVCRDMA_BACKCHANNEL_DEBUG
 
-int svc_rdma_handle_bc_reply(struct rpc_xprt *xprt, struct rpcrdma_msg *rmsgp,
+/**
+ * svc_rdma_handle_bc_reply - Process incoming backchannel reply
+ * @xprt: controlling backchannel transport
+ * @rdma_resp: pointer to incoming transport header
+ * @rcvbuf: XDR buffer into which to decode the reply
+ *
+ * Returns:
+ *	%0 if @rcvbuf is filled in, xprt_complete_rqst called,
+ *	%-EAGAIN if server should call ->recvfrom again.
+ */
+int svc_rdma_handle_bc_reply(struct rpc_xprt *xprt, __be32 *rdma_resp,
 			     struct xdr_buf *rcvbuf)
 {
 	struct rpcrdma_xprt *r_xprt = rpcx_to_rdmax(xprt);
@@ -27,13 +37,13 @@ int svc_rdma_handle_bc_reply(struct rpc_xprt *xprt, struct rpcrdma_msg *rmsgp,
 
 	p = (__be32 *)src->iov_base;
 	len = src->iov_len;
-	xid = rmsgp->rm_xid;
+	xid = *rdma_resp;
 
 #ifdef SVCRDMA_BACKCHANNEL_DEBUG
 	pr_info("%s: xid=%08x, length=%zu\n",
 		__func__, be32_to_cpu(xid), len);
 	pr_info("%s: RPC/RDMA: %*ph\n",
-		__func__, (int)RPCRDMA_HDRLEN_MIN, rmsgp);
+		__func__, (int)RPCRDMA_HDRLEN_MIN, rdma_resp);
 	pr_info("%s:      RPC: %*ph\n",
 		__func__, (int)len, p);
 #endif
@@ -53,7 +63,7 @@ int svc_rdma_handle_bc_reply(struct rpc_xprt *xprt, struct rpcrdma_msg *rmsgp,
 		goto out_unlock;
 	memcpy(dst->iov_base, p, len);
 
-	credits = be32_to_cpu(rmsgp->rm_credit);
+	credits = be32_to_cpup(rdma_resp + 2);
 	if (credits == 0)
 		credits = 1;	/* don't deadlock */
 	else if (credits > r_xprt->rx_buf.rb_bc_max_requests)
diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
index efa9f12..9315ee6 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
@@ -614,28 +614,30 @@ static void svc_rdma_send_error(struct svcxprt_rdma *xprt,
  * the RPC/RDMA header small and fixed in size, so it is
  * straightforward to check the RPC header's direction field.
  */
-static bool
-svc_rdma_is_backchannel_reply(struct svc_xprt *xprt, struct rpcrdma_msg *rmsgp)
+static bool svc_rdma_is_backchannel_reply(struct svc_xprt *xprt,
+					  __be32 *rdma_resp)
 {
-	__be32 *p = (__be32 *)rmsgp;
+	__be32 *p;
 
 	if (!xprt->xpt_bc_xprt)
 		return false;
 
-	if (rmsgp->rm_type != rdma_msg)
+	p = rdma_resp + 3;
+	if (*p++ != rdma_msg)
 		return false;
-	if (rmsgp->rm_body.rm_chunks[0] != xdr_zero)
+
+	if (*p++ != xdr_zero)
 		return false;
-	if (rmsgp->rm_body.rm_chunks[1] != xdr_zero)
+	if (*p++ != xdr_zero)
 		return false;
-	if (rmsgp->rm_body.rm_chunks[2] != xdr_zero)
+	if (*p++ != xdr_zero)
 		return false;
 
-	/* sanity */
-	if (p[7] != rmsgp->rm_xid)
+	/* XID sanity */
+	if (*p++ != *rdma_resp)
 		return false;
 	/* call direction */
-	if (p[8] == cpu_to_be32(RPC_CALL))
+	if (*p == cpu_to_be32(RPC_CALL))
 		return false;
 
 	return true;
@@ -701,8 +703,9 @@ int svc_rdma_recvfrom(struct svc_rqst *rqstp)
 		goto out_drop;
 	rqstp->rq_xprt_hlen = ret;
 
-	if (svc_rdma_is_backchannel_reply(xprt, rmsgp)) {
-		ret = svc_rdma_handle_bc_reply(xprt->xpt_bc_xprt, rmsgp,
+	if (svc_rdma_is_backchannel_reply(xprt, &rmsgp->rm_xid)) {
+		ret = svc_rdma_handle_bc_reply(xprt->xpt_bc_xprt,
+					       &rmsgp->rm_xid,
 					       &rqstp->rq_arg);
 		svc_rdma_put_context(ctxt, 0);
 		if (ret)

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v1 09/14] svcrdma: Clean up RPC-over-RDMA backchannel reply processing
@ 2017-03-16 15:53     ` Chuck Lever
  0 siblings, 0 replies; 70+ messages in thread
From: Chuck Lever @ 2017-03-16 15:53 UTC (permalink / raw)
  To: linux-rdma, linux-nfs

Replace C structure-based XDR decoding with pointer arithmetic.
Pointer arithmetic is considered more portable.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 include/linux/sunrpc/svc_rdma.h            |    2 +-
 net/sunrpc/xprtrdma/svc_rdma_backchannel.c |   18 ++++++++++++++----
 net/sunrpc/xprtrdma/svc_rdma_recvfrom.c    |   27 +++++++++++++++------------
 3 files changed, 30 insertions(+), 17 deletions(-)

diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
index 498a086..6181c24 100644
--- a/include/linux/sunrpc/svc_rdma.h
+++ b/include/linux/sunrpc/svc_rdma.h
@@ -204,7 +204,7 @@ static inline void svc_rdma_count_mappings(struct svcxprt_rdma *rdma,
 
 /* svc_rdma_backchannel.c */
 extern int svc_rdma_handle_bc_reply(struct rpc_xprt *xprt,
-				    struct rpcrdma_msg *rmsgp,
+				    __be32 *rdma_resp,
 				    struct xdr_buf *rcvbuf);
 
 /* svc_rdma_marshal.c */
diff --git a/net/sunrpc/xprtrdma/svc_rdma_backchannel.c b/net/sunrpc/xprtrdma/svc_rdma_backchannel.c
index 24c26f4..3484f40a 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_backchannel.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_backchannel.c
@@ -12,7 +12,17 @@
 
 #undef SVCRDMA_BACKCHANNEL_DEBUG
 
-int svc_rdma_handle_bc_reply(struct rpc_xprt *xprt, struct rpcrdma_msg *rmsgp,
+/**
+ * svc_rdma_handle_bc_reply - Process incoming backchannel reply
+ * @xprt: controlling backchannel transport
+ * @rdma_resp: pointer to incoming transport header
+ * @rcvbuf: XDR buffer into which to decode the reply
+ *
+ * Returns:
+ *	%0 if @rcvbuf is filled in, xprt_complete_rqst called,
+ *	%-EAGAIN if server should call ->recvfrom again.
+ */
+int svc_rdma_handle_bc_reply(struct rpc_xprt *xprt, __be32 *rdma_resp,
 			     struct xdr_buf *rcvbuf)
 {
 	struct rpcrdma_xprt *r_xprt = rpcx_to_rdmax(xprt);
@@ -27,13 +37,13 @@ int svc_rdma_handle_bc_reply(struct rpc_xprt *xprt, struct rpcrdma_msg *rmsgp,
 
 	p = (__be32 *)src->iov_base;
 	len = src->iov_len;
-	xid = rmsgp->rm_xid;
+	xid = *rdma_resp;
 
 #ifdef SVCRDMA_BACKCHANNEL_DEBUG
 	pr_info("%s: xid=%08x, length=%zu\n",
 		__func__, be32_to_cpu(xid), len);
 	pr_info("%s: RPC/RDMA: %*ph\n",
-		__func__, (int)RPCRDMA_HDRLEN_MIN, rmsgp);
+		__func__, (int)RPCRDMA_HDRLEN_MIN, rdma_resp);
 	pr_info("%s:      RPC: %*ph\n",
 		__func__, (int)len, p);
 #endif
@@ -53,7 +63,7 @@ int svc_rdma_handle_bc_reply(struct rpc_xprt *xprt, struct rpcrdma_msg *rmsgp,
 		goto out_unlock;
 	memcpy(dst->iov_base, p, len);
 
-	credits = be32_to_cpu(rmsgp->rm_credit);
+	credits = be32_to_cpup(rdma_resp + 2);
 	if (credits == 0)
 		credits = 1;	/* don't deadlock */
 	else if (credits > r_xprt->rx_buf.rb_bc_max_requests)
diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
index efa9f12..9315ee6 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
@@ -614,28 +614,30 @@ static void svc_rdma_send_error(struct svcxprt_rdma *xprt,
  * the RPC/RDMA header small and fixed in size, so it is
  * straightforward to check the RPC header's direction field.
  */
-static bool
-svc_rdma_is_backchannel_reply(struct svc_xprt *xprt, struct rpcrdma_msg *rmsgp)
+static bool svc_rdma_is_backchannel_reply(struct svc_xprt *xprt,
+					  __be32 *rdma_resp)
 {
-	__be32 *p = (__be32 *)rmsgp;
+	__be32 *p;
 
 	if (!xprt->xpt_bc_xprt)
 		return false;
 
-	if (rmsgp->rm_type != rdma_msg)
+	p = rdma_resp + 3;
+	if (*p++ != rdma_msg)
 		return false;
-	if (rmsgp->rm_body.rm_chunks[0] != xdr_zero)
+
+	if (*p++ != xdr_zero)
 		return false;
-	if (rmsgp->rm_body.rm_chunks[1] != xdr_zero)
+	if (*p++ != xdr_zero)
 		return false;
-	if (rmsgp->rm_body.rm_chunks[2] != xdr_zero)
+	if (*p++ != xdr_zero)
 		return false;
 
-	/* sanity */
-	if (p[7] != rmsgp->rm_xid)
+	/* XID sanity */
+	if (*p++ != *rdma_resp)
 		return false;
 	/* call direction */
-	if (p[8] == cpu_to_be32(RPC_CALL))
+	if (*p == cpu_to_be32(RPC_CALL))
 		return false;
 
 	return true;
@@ -701,8 +703,9 @@ int svc_rdma_recvfrom(struct svc_rqst *rqstp)
 		goto out_drop;
 	rqstp->rq_xprt_hlen = ret;
 
-	if (svc_rdma_is_backchannel_reply(xprt, rmsgp)) {
-		ret = svc_rdma_handle_bc_reply(xprt->xpt_bc_xprt, rmsgp,
+	if (svc_rdma_is_backchannel_reply(xprt, &rmsgp->rm_xid)) {
+		ret = svc_rdma_handle_bc_reply(xprt->xpt_bc_xprt,
+					       &rmsgp->rm_xid,
 					       &rqstp->rq_arg);
 		svc_rdma_put_context(ctxt, 0);
 		if (ret)


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v1 10/14] svcrdma: Reduce size of sge array in struct svc_rdma_op_ctxt
  2017-03-16 15:52 ` Chuck Lever
@ 2017-03-16 15:53     ` Chuck Lever
  -1 siblings, 0 replies; 70+ messages in thread
From: Chuck Lever @ 2017-03-16 15:53 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA, linux-nfs-u79uwXL29TY76Z2rM5mHXA

The sge array in struct svc_rdma_op_ctxt is no longer used for
sending RDMA Write WRs. It need only accommodate the construction of
Send and Receive WRs. The maximum inline size is the largest payload
it needs to handle now.

Signed-off-by: Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
---
 include/linux/sunrpc/svc_rdma.h |    9 +++++++--
 net/sunrpc/xprtrdma/svc_rdma.c  |    6 +++---
 2 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
index 6181c24..6b60f2e 100644
--- a/include/linux/sunrpc/svc_rdma.h
+++ b/include/linux/sunrpc/svc_rdma.h
@@ -48,6 +48,12 @@
 #include <rdma/rdma_cm.h>
 #define SVCRDMA_DEBUG
 
+/* Default and maximum inline threshold sizes */
+enum {
+	RPCRDMA_DEF_INLINE_THRESH = 4096,
+	RPCRDMA_MAX_INLINE_THRESH = 65536
+};
+
 /* RPC/RDMA parameters and stats */
 extern unsigned int svcrdma_ord;
 extern unsigned int svcrdma_max_requests;
@@ -86,7 +92,7 @@ struct svc_rdma_op_ctxt {
 	int count;
 	unsigned int mapped_sges;
 	struct ib_send_wr send_wr;
-	struct ib_sge sge[RPCSVC_MAXPAGES];
+	struct ib_sge sge[1 + RPCRDMA_MAX_INLINE_THRESH / PAGE_SIZE];
 	struct page *pages[RPCSVC_MAXPAGES];
 };
 
@@ -186,7 +192,6 @@ struct svcxprt_rdma {
  * page size of 4k, or 32k * 2 ops / 4k = 16 outstanding RDMA_READ.  */
 #define RPCRDMA_ORD             (64/4)
 #define RPCRDMA_MAX_REQUESTS    32
-#define RPCRDMA_MAX_REQ_SIZE    4096
 
 /* Typical ULP usage of BC requests is NFSv4.1 backchannel. Our
  * current NFSv4.1 implementation supports one backchannel slot.
diff --git a/net/sunrpc/xprtrdma/svc_rdma.c b/net/sunrpc/xprtrdma/svc_rdma.c
index 9124441..a4a8f69 100644
--- a/net/sunrpc/xprtrdma/svc_rdma.c
+++ b/net/sunrpc/xprtrdma/svc_rdma.c
@@ -58,9 +58,9 @@
 unsigned int svcrdma_max_bc_requests = RPCRDMA_MAX_BC_REQUESTS;
 static unsigned int min_max_requests = 4;
 static unsigned int max_max_requests = 16384;
-unsigned int svcrdma_max_req_size = RPCRDMA_MAX_REQ_SIZE;
-static unsigned int min_max_inline = 4096;
-static unsigned int max_max_inline = 65536;
+unsigned int svcrdma_max_req_size = RPCRDMA_DEF_INLINE_THRESH;
+static unsigned int min_max_inline = RPCRDMA_DEF_INLINE_THRESH;
+static unsigned int max_max_inline = RPCRDMA_MAX_INLINE_THRESH;
 
 atomic_t rdma_stat_recv;
 atomic_t rdma_stat_read;

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v1 10/14] svcrdma: Reduce size of sge array in struct svc_rdma_op_ctxt
@ 2017-03-16 15:53     ` Chuck Lever
  0 siblings, 0 replies; 70+ messages in thread
From: Chuck Lever @ 2017-03-16 15:53 UTC (permalink / raw)
  To: linux-rdma, linux-nfs

The sge array in struct svc_rdma_op_ctxt is no longer used for
sending RDMA Write WRs. It need only accommodate the construction of
Send and Receive WRs. The maximum inline size is the largest payload
it needs to handle now.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 include/linux/sunrpc/svc_rdma.h |    9 +++++++--
 net/sunrpc/xprtrdma/svc_rdma.c  |    6 +++---
 2 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
index 6181c24..6b60f2e 100644
--- a/include/linux/sunrpc/svc_rdma.h
+++ b/include/linux/sunrpc/svc_rdma.h
@@ -48,6 +48,12 @@
 #include <rdma/rdma_cm.h>
 #define SVCRDMA_DEBUG
 
+/* Default and maximum inline threshold sizes */
+enum {
+	RPCRDMA_DEF_INLINE_THRESH = 4096,
+	RPCRDMA_MAX_INLINE_THRESH = 65536
+};
+
 /* RPC/RDMA parameters and stats */
 extern unsigned int svcrdma_ord;
 extern unsigned int svcrdma_max_requests;
@@ -86,7 +92,7 @@ struct svc_rdma_op_ctxt {
 	int count;
 	unsigned int mapped_sges;
 	struct ib_send_wr send_wr;
-	struct ib_sge sge[RPCSVC_MAXPAGES];
+	struct ib_sge sge[1 + RPCRDMA_MAX_INLINE_THRESH / PAGE_SIZE];
 	struct page *pages[RPCSVC_MAXPAGES];
 };
 
@@ -186,7 +192,6 @@ struct svcxprt_rdma {
  * page size of 4k, or 32k * 2 ops / 4k = 16 outstanding RDMA_READ.  */
 #define RPCRDMA_ORD             (64/4)
 #define RPCRDMA_MAX_REQUESTS    32
-#define RPCRDMA_MAX_REQ_SIZE    4096
 
 /* Typical ULP usage of BC requests is NFSv4.1 backchannel. Our
  * current NFSv4.1 implementation supports one backchannel slot.
diff --git a/net/sunrpc/xprtrdma/svc_rdma.c b/net/sunrpc/xprtrdma/svc_rdma.c
index 9124441..a4a8f69 100644
--- a/net/sunrpc/xprtrdma/svc_rdma.c
+++ b/net/sunrpc/xprtrdma/svc_rdma.c
@@ -58,9 +58,9 @@
 unsigned int svcrdma_max_bc_requests = RPCRDMA_MAX_BC_REQUESTS;
 static unsigned int min_max_requests = 4;
 static unsigned int max_max_requests = 16384;
-unsigned int svcrdma_max_req_size = RPCRDMA_MAX_REQ_SIZE;
-static unsigned int min_max_inline = 4096;
-static unsigned int max_max_inline = 65536;
+unsigned int svcrdma_max_req_size = RPCRDMA_DEF_INLINE_THRESH;
+static unsigned int min_max_inline = RPCRDMA_DEF_INLINE_THRESH;
+static unsigned int max_max_inline = RPCRDMA_MAX_INLINE_THRESH;
 
 atomic_t rdma_stat_recv;
 atomic_t rdma_stat_read;


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v1 11/14] svcrdma: Remove old RDMA Write completion handlers
  2017-03-16 15:52 ` Chuck Lever
@ 2017-03-16 15:53     ` Chuck Lever
  -1 siblings, 0 replies; 70+ messages in thread
From: Chuck Lever @ 2017-03-16 15:53 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA, linux-nfs-u79uwXL29TY76Z2rM5mHXA

Clean up. RDMA Write completions are now handled by the rdma_rw API.

Signed-off-by: Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
---
 include/linux/sunrpc/svc_rdma.h          |    1 -
 net/sunrpc/xprtrdma/svc_rdma_transport.c |   18 ------------------
 2 files changed, 19 deletions(-)

diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
index 6b60f2e..e526c55 100644
--- a/include/linux/sunrpc/svc_rdma.h
+++ b/include/linux/sunrpc/svc_rdma.h
@@ -258,7 +258,6 @@ extern void svc_rdma_build_send_wr(struct svc_rdma_op_ctxt *ctxt,
 
 /* svc_rdma_transport.c */
 extern void svc_rdma_wc_send(struct ib_cq *, struct ib_wc *);
-extern void svc_rdma_wc_write(struct ib_cq *, struct ib_wc *);
 extern void svc_rdma_wc_reg(struct ib_cq *, struct ib_wc *);
 extern void svc_rdma_wc_read(struct ib_cq *, struct ib_wc *);
 extern void svc_rdma_wc_inv(struct ib_cq *, struct ib_wc *);
diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c
index fecf220..8ce68bf 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
@@ -473,24 +473,6 @@ void svc_rdma_wc_send(struct ib_cq *cq, struct ib_wc *wc)
 }
 
 /**
- * svc_rdma_wc_write - Invoked by RDMA provider for each polled Write WC
- * @cq:        completion queue
- * @wc:        completed WR
- *
- */
-void svc_rdma_wc_write(struct ib_cq *cq, struct ib_wc *wc)
-{
-	struct ib_cqe *cqe = wc->wr_cqe;
-	struct svc_rdma_op_ctxt *ctxt;
-
-	svc_rdma_send_wc_common_put(cq, wc, "write");
-
-	ctxt = container_of(cqe, struct svc_rdma_op_ctxt, cqe);
-	svc_rdma_unmap_dma(ctxt);
-	svc_rdma_put_context(ctxt, 0);
-}
-
-/**
  * svc_rdma_wc_reg - Invoked by RDMA provider for each polled FASTREG WC
  * @cq:        completion queue
  * @wc:        completed WR

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v1 11/14] svcrdma: Remove old RDMA Write completion handlers
@ 2017-03-16 15:53     ` Chuck Lever
  0 siblings, 0 replies; 70+ messages in thread
From: Chuck Lever @ 2017-03-16 15:53 UTC (permalink / raw)
  To: linux-rdma, linux-nfs

Clean up. RDMA Write completions are now handled by the rdma_rw API.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 include/linux/sunrpc/svc_rdma.h          |    1 -
 net/sunrpc/xprtrdma/svc_rdma_transport.c |   18 ------------------
 2 files changed, 19 deletions(-)

diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
index 6b60f2e..e526c55 100644
--- a/include/linux/sunrpc/svc_rdma.h
+++ b/include/linux/sunrpc/svc_rdma.h
@@ -258,7 +258,6 @@ extern void svc_rdma_build_send_wr(struct svc_rdma_op_ctxt *ctxt,
 
 /* svc_rdma_transport.c */
 extern void svc_rdma_wc_send(struct ib_cq *, struct ib_wc *);
-extern void svc_rdma_wc_write(struct ib_cq *, struct ib_wc *);
 extern void svc_rdma_wc_reg(struct ib_cq *, struct ib_wc *);
 extern void svc_rdma_wc_read(struct ib_cq *, struct ib_wc *);
 extern void svc_rdma_wc_inv(struct ib_cq *, struct ib_wc *);
diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c
index fecf220..8ce68bf 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
@@ -473,24 +473,6 @@ void svc_rdma_wc_send(struct ib_cq *cq, struct ib_wc *wc)
 }
 
 /**
- * svc_rdma_wc_write - Invoked by RDMA provider for each polled Write WC
- * @cq:        completion queue
- * @wc:        completed WR
- *
- */
-void svc_rdma_wc_write(struct ib_cq *cq, struct ib_wc *wc)
-{
-	struct ib_cqe *cqe = wc->wr_cqe;
-	struct svc_rdma_op_ctxt *ctxt;
-
-	svc_rdma_send_wc_common_put(cq, wc, "write");
-
-	ctxt = container_of(cqe, struct svc_rdma_op_ctxt, cqe);
-	svc_rdma_unmap_dma(ctxt);
-	svc_rdma_put_context(ctxt, 0);
-}
-
-/**
  * svc_rdma_wc_reg - Invoked by RDMA provider for each polled FASTREG WC
  * @cq:        completion queue
  * @wc:        completed WR


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v1 12/14] svcrdma: Remove the req_map cache
  2017-03-16 15:52 ` Chuck Lever
@ 2017-03-16 15:54     ` Chuck Lever
  -1 siblings, 0 replies; 70+ messages in thread
From: Chuck Lever @ 2017-03-16 15:54 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA, linux-nfs-u79uwXL29TY76Z2rM5mHXA

req_maps are no longer used by the send path and can thus be removed.

Signed-off-by: Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
---
 include/linux/sunrpc/svc_rdma.h          |   34 ------------
 net/sunrpc/xprtrdma/svc_rdma_sendto.c    |   68 ------------------------
 net/sunrpc/xprtrdma/svc_rdma_transport.c |   84 ------------------------------
 3 files changed, 1 insertion(+), 185 deletions(-)

diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
index e526c55..7b1f886 100644
--- a/include/linux/sunrpc/svc_rdma.h
+++ b/include/linux/sunrpc/svc_rdma.h
@@ -96,23 +96,6 @@ struct svc_rdma_op_ctxt {
 	struct page *pages[RPCSVC_MAXPAGES];
 };
 
-/*
- * NFS_ requests are mapped on the client side by the chunk lists in
- * the RPCRDMA header. During the fetching of the RPC from the client
- * and the writing of the reply to the client, the memory in the
- * client and the memory in the server must be mapped as contiguous
- * vaddr/len for access by the hardware. These data strucures keep
- * these mappings.
- *
- * For an RDMA_WRITE, the 'sge' maps the RPC REPLY. For RDMA_READ, the
- * 'sge' in the svc_rdma_req_map maps the server side RPC reply and the
- * 'ch' field maps the read-list of the RPCRDMA header to the 'sge'
- * mapping of the reply.
- */
-struct svc_rdma_chunk_sge {
-	int start;		/* sge no for this chunk */
-	int count;		/* sge count for this chunk */
-};
 struct svc_rdma_fastreg_mr {
 	struct ib_mr *mr;
 	struct scatterlist *sg;
@@ -121,15 +104,7 @@ struct svc_rdma_fastreg_mr {
 	enum dma_data_direction direction;
 	struct list_head frmr_list;
 };
-struct svc_rdma_req_map {
-	struct list_head free;
-	unsigned long count;
-	union {
-		struct kvec sge[RPCSVC_MAXPAGES];
-		struct svc_rdma_chunk_sge ch[RPCSVC_MAXPAGES];
-		unsigned long lkey[RPCSVC_MAXPAGES];
-	};
-};
+
 #define RDMACTXT_F_LAST_CTXT	2
 
 #define	SVCRDMA_DEVCAP_FAST_REG		1	/* fast mr registration */
@@ -160,8 +135,6 @@ struct svcxprt_rdma {
 	int		     sc_ctxt_used;
 	spinlock_t	     sc_rw_ctxt_lock;
 	struct list_head     sc_rw_ctxts;
-	spinlock_t	     sc_map_lock;
-	struct list_head     sc_maps;
 
 	struct list_head     sc_rq_dto_q;
 	spinlock_t	     sc_rq_dto_lock;
@@ -247,8 +220,6 @@ extern int svc_rdma_send_reply_chunk(struct svcxprt_rdma *rdma,
 				     __be32 *rdma_resp, struct xdr_buf *xdr);
 
 /* svc_rdma_sendto.c */
-extern int svc_rdma_map_xdr(struct svcxprt_rdma *, struct xdr_buf *,
-			    struct svc_rdma_req_map *, bool);
 extern int svc_rdma_map_reply_hdr(struct svcxprt_rdma *rdma,
 				  struct svc_rdma_op_ctxt *ctxt,
 				  __be32 *rdma_resp, unsigned int len);
@@ -268,9 +239,6 @@ extern void svc_rdma_build_send_wr(struct svc_rdma_op_ctxt *ctxt,
 extern struct svc_rdma_op_ctxt *svc_rdma_get_context(struct svcxprt_rdma *);
 extern void svc_rdma_put_context(struct svc_rdma_op_ctxt *, int);
 extern void svc_rdma_unmap_dma(struct svc_rdma_op_ctxt *ctxt);
-extern struct svc_rdma_req_map *svc_rdma_get_req_map(struct svcxprt_rdma *);
-extern void svc_rdma_put_req_map(struct svcxprt_rdma *,
-				 struct svc_rdma_req_map *);
 extern struct svc_rdma_fastreg_mr *svc_rdma_get_frmr(struct svcxprt_rdma *);
 extern void svc_rdma_put_frmr(struct svcxprt_rdma *,
 			      struct svc_rdma_fastreg_mr *);
diff --git a/net/sunrpc/xprtrdma/svc_rdma_sendto.c b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
index 489e602..022569c 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_sendto.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
@@ -113,74 +113,6 @@ static u32 xdr_padsize(u32 len)
 	return (len & 3) ? (4 - (len & 3)) : 0;
 }
 
-int svc_rdma_map_xdr(struct svcxprt_rdma *xprt,
-		     struct xdr_buf *xdr,
-		     struct svc_rdma_req_map *vec,
-		     bool write_chunk_present)
-{
-	int sge_no;
-	u32 sge_bytes;
-	u32 page_bytes;
-	u32 page_off;
-	int page_no;
-
-	if (xdr->len !=
-	    (xdr->head[0].iov_len + xdr->page_len + xdr->tail[0].iov_len)) {
-		pr_err("svcrdma: %s: XDR buffer length error\n", __func__);
-		return -EIO;
-	}
-
-	/* Skip the first sge, this is for the RPCRDMA header */
-	sge_no = 1;
-
-	/* Head SGE */
-	vec->sge[sge_no].iov_base = xdr->head[0].iov_base;
-	vec->sge[sge_no].iov_len = xdr->head[0].iov_len;
-	sge_no++;
-
-	/* pages SGE */
-	page_no = 0;
-	page_bytes = xdr->page_len;
-	page_off = xdr->page_base;
-	while (page_bytes) {
-		vec->sge[sge_no].iov_base =
-			page_address(xdr->pages[page_no]) + page_off;
-		sge_bytes = min_t(u32, page_bytes, (PAGE_SIZE - page_off));
-		page_bytes -= sge_bytes;
-		vec->sge[sge_no].iov_len = sge_bytes;
-
-		sge_no++;
-		page_no++;
-		page_off = 0; /* reset for next time through loop */
-	}
-
-	/* Tail SGE */
-	if (xdr->tail[0].iov_len) {
-		unsigned char *base = xdr->tail[0].iov_base;
-		size_t len = xdr->tail[0].iov_len;
-		u32 xdr_pad = xdr_padsize(xdr->page_len);
-
-		if (write_chunk_present && xdr_pad) {
-			base += xdr_pad;
-			len -= xdr_pad;
-		}
-
-		if (len) {
-			vec->sge[sge_no].iov_base = base;
-			vec->sge[sge_no].iov_len = len;
-			sge_no++;
-		}
-	}
-
-	dprintk("svcrdma: %s: sge_no %d page_no %d "
-		"page_base %u page_len %u head_len %zu tail_len %zu\n",
-		__func__, sge_no, page_no, xdr->page_base, xdr->page_len,
-		xdr->head[0].iov_len, xdr->tail[0].iov_len);
-
-	vec->count = sge_no;
-	return 0;
-}
-
 /* Parse the RPC Call's transport header.
  */
 static void svc_rdma_get_write_arrays(__be32 *rdma_argp,
diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c
index 8ce68bf..9176a35 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
@@ -271,85 +271,6 @@ static void svc_rdma_destroy_ctxts(struct svcxprt_rdma *xprt)
 	}
 }
 
-static struct svc_rdma_req_map *alloc_req_map(gfp_t flags)
-{
-	struct svc_rdma_req_map *map;
-
-	map = kmalloc(sizeof(*map), flags);
-	if (map)
-		INIT_LIST_HEAD(&map->free);
-	return map;
-}
-
-static bool svc_rdma_prealloc_maps(struct svcxprt_rdma *xprt)
-{
-	unsigned int i;
-
-	/* One for each receive buffer on this connection. */
-	i = xprt->sc_max_requests;
-
-	while (i--) {
-		struct svc_rdma_req_map *map;
-
-		map = alloc_req_map(GFP_KERNEL);
-		if (!map) {
-			dprintk("svcrdma: No memory for request map\n");
-			return false;
-		}
-		list_add(&map->free, &xprt->sc_maps);
-	}
-	return true;
-}
-
-struct svc_rdma_req_map *svc_rdma_get_req_map(struct svcxprt_rdma *xprt)
-{
-	struct svc_rdma_req_map *map = NULL;
-
-	spin_lock(&xprt->sc_map_lock);
-	if (list_empty(&xprt->sc_maps))
-		goto out_empty;
-
-	map = list_first_entry(&xprt->sc_maps,
-			       struct svc_rdma_req_map, free);
-	list_del_init(&map->free);
-	spin_unlock(&xprt->sc_map_lock);
-
-out:
-	map->count = 0;
-	return map;
-
-out_empty:
-	spin_unlock(&xprt->sc_map_lock);
-
-	/* Pre-allocation amount was incorrect */
-	map = alloc_req_map(GFP_NOIO);
-	if (map)
-		goto out;
-
-	WARN_ONCE(1, "svcrdma: empty request map list?\n");
-	return NULL;
-}
-
-void svc_rdma_put_req_map(struct svcxprt_rdma *xprt,
-			  struct svc_rdma_req_map *map)
-{
-	spin_lock(&xprt->sc_map_lock);
-	list_add(&map->free, &xprt->sc_maps);
-	spin_unlock(&xprt->sc_map_lock);
-}
-
-static void svc_rdma_destroy_maps(struct svcxprt_rdma *xprt)
-{
-	while (!list_empty(&xprt->sc_maps)) {
-		struct svc_rdma_req_map *map;
-
-		map = list_first_entry(&xprt->sc_maps,
-				       struct svc_rdma_req_map, free);
-		list_del(&map->free);
-		kfree(map);
-	}
-}
-
 /* QP event handler */
 static void qp_event_handler(struct ib_event *event, void *context)
 {
@@ -543,7 +464,6 @@ static struct svcxprt_rdma *rdma_create_xprt(struct svc_serv *serv,
 	INIT_LIST_HEAD(&cma_xprt->sc_frmr_q);
 	INIT_LIST_HEAD(&cma_xprt->sc_ctxts);
 	INIT_LIST_HEAD(&cma_xprt->sc_rw_ctxts);
-	INIT_LIST_HEAD(&cma_xprt->sc_maps);
 	init_waitqueue_head(&cma_xprt->sc_send_wait);
 
 	spin_lock_init(&cma_xprt->sc_lock);
@@ -551,7 +471,6 @@ static struct svcxprt_rdma *rdma_create_xprt(struct svc_serv *serv,
 	spin_lock_init(&cma_xprt->sc_frmr_q_lock);
 	spin_lock_init(&cma_xprt->sc_ctxt_lock);
 	spin_lock_init(&cma_xprt->sc_rw_ctxt_lock);
-	spin_lock_init(&cma_xprt->sc_map_lock);
 
 	/*
 	 * Note that this implies that the underlying transport support
@@ -1003,8 +922,6 @@ static struct svc_xprt *svc_rdma_accept(struct svc_xprt *xprt)
 
 	if (!svc_rdma_prealloc_ctxts(newxprt))
 		goto errout;
-	if (!svc_rdma_prealloc_maps(newxprt))
-		goto errout;
 
 	/*
 	 * Limit ORD based on client limit, local device limit, and
@@ -1236,7 +1153,6 @@ static void __svc_rdma_free(struct work_struct *work)
 	rdma_dealloc_frmr_q(rdma);
 	svc_rdma_destroy_rw_ctxts(rdma);
 	svc_rdma_destroy_ctxts(rdma);
-	svc_rdma_destroy_maps(rdma);
 
 	/* Destroy the QP if present (not a listener) */
 	if (rdma->sc_qp && !IS_ERR(rdma->sc_qp))

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v1 12/14] svcrdma: Remove the req_map cache
@ 2017-03-16 15:54     ` Chuck Lever
  0 siblings, 0 replies; 70+ messages in thread
From: Chuck Lever @ 2017-03-16 15:54 UTC (permalink / raw)
  To: linux-rdma, linux-nfs

req_maps are no longer used by the send path and can thus be removed.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 include/linux/sunrpc/svc_rdma.h          |   34 ------------
 net/sunrpc/xprtrdma/svc_rdma_sendto.c    |   68 ------------------------
 net/sunrpc/xprtrdma/svc_rdma_transport.c |   84 ------------------------------
 3 files changed, 1 insertion(+), 185 deletions(-)

diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
index e526c55..7b1f886 100644
--- a/include/linux/sunrpc/svc_rdma.h
+++ b/include/linux/sunrpc/svc_rdma.h
@@ -96,23 +96,6 @@ struct svc_rdma_op_ctxt {
 	struct page *pages[RPCSVC_MAXPAGES];
 };
 
-/*
- * NFS_ requests are mapped on the client side by the chunk lists in
- * the RPCRDMA header. During the fetching of the RPC from the client
- * and the writing of the reply to the client, the memory in the
- * client and the memory in the server must be mapped as contiguous
- * vaddr/len for access by the hardware. These data strucures keep
- * these mappings.
- *
- * For an RDMA_WRITE, the 'sge' maps the RPC REPLY. For RDMA_READ, the
- * 'sge' in the svc_rdma_req_map maps the server side RPC reply and the
- * 'ch' field maps the read-list of the RPCRDMA header to the 'sge'
- * mapping of the reply.
- */
-struct svc_rdma_chunk_sge {
-	int start;		/* sge no for this chunk */
-	int count;		/* sge count for this chunk */
-};
 struct svc_rdma_fastreg_mr {
 	struct ib_mr *mr;
 	struct scatterlist *sg;
@@ -121,15 +104,7 @@ struct svc_rdma_fastreg_mr {
 	enum dma_data_direction direction;
 	struct list_head frmr_list;
 };
-struct svc_rdma_req_map {
-	struct list_head free;
-	unsigned long count;
-	union {
-		struct kvec sge[RPCSVC_MAXPAGES];
-		struct svc_rdma_chunk_sge ch[RPCSVC_MAXPAGES];
-		unsigned long lkey[RPCSVC_MAXPAGES];
-	};
-};
+
 #define RDMACTXT_F_LAST_CTXT	2
 
 #define	SVCRDMA_DEVCAP_FAST_REG		1	/* fast mr registration */
@@ -160,8 +135,6 @@ struct svcxprt_rdma {
 	int		     sc_ctxt_used;
 	spinlock_t	     sc_rw_ctxt_lock;
 	struct list_head     sc_rw_ctxts;
-	spinlock_t	     sc_map_lock;
-	struct list_head     sc_maps;
 
 	struct list_head     sc_rq_dto_q;
 	spinlock_t	     sc_rq_dto_lock;
@@ -247,8 +220,6 @@ extern int svc_rdma_send_reply_chunk(struct svcxprt_rdma *rdma,
 				     __be32 *rdma_resp, struct xdr_buf *xdr);
 
 /* svc_rdma_sendto.c */
-extern int svc_rdma_map_xdr(struct svcxprt_rdma *, struct xdr_buf *,
-			    struct svc_rdma_req_map *, bool);
 extern int svc_rdma_map_reply_hdr(struct svcxprt_rdma *rdma,
 				  struct svc_rdma_op_ctxt *ctxt,
 				  __be32 *rdma_resp, unsigned int len);
@@ -268,9 +239,6 @@ extern void svc_rdma_build_send_wr(struct svc_rdma_op_ctxt *ctxt,
 extern struct svc_rdma_op_ctxt *svc_rdma_get_context(struct svcxprt_rdma *);
 extern void svc_rdma_put_context(struct svc_rdma_op_ctxt *, int);
 extern void svc_rdma_unmap_dma(struct svc_rdma_op_ctxt *ctxt);
-extern struct svc_rdma_req_map *svc_rdma_get_req_map(struct svcxprt_rdma *);
-extern void svc_rdma_put_req_map(struct svcxprt_rdma *,
-				 struct svc_rdma_req_map *);
 extern struct svc_rdma_fastreg_mr *svc_rdma_get_frmr(struct svcxprt_rdma *);
 extern void svc_rdma_put_frmr(struct svcxprt_rdma *,
 			      struct svc_rdma_fastreg_mr *);
diff --git a/net/sunrpc/xprtrdma/svc_rdma_sendto.c b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
index 489e602..022569c 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_sendto.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
@@ -113,74 +113,6 @@ static u32 xdr_padsize(u32 len)
 	return (len & 3) ? (4 - (len & 3)) : 0;
 }
 
-int svc_rdma_map_xdr(struct svcxprt_rdma *xprt,
-		     struct xdr_buf *xdr,
-		     struct svc_rdma_req_map *vec,
-		     bool write_chunk_present)
-{
-	int sge_no;
-	u32 sge_bytes;
-	u32 page_bytes;
-	u32 page_off;
-	int page_no;
-
-	if (xdr->len !=
-	    (xdr->head[0].iov_len + xdr->page_len + xdr->tail[0].iov_len)) {
-		pr_err("svcrdma: %s: XDR buffer length error\n", __func__);
-		return -EIO;
-	}
-
-	/* Skip the first sge, this is for the RPCRDMA header */
-	sge_no = 1;
-
-	/* Head SGE */
-	vec->sge[sge_no].iov_base = xdr->head[0].iov_base;
-	vec->sge[sge_no].iov_len = xdr->head[0].iov_len;
-	sge_no++;
-
-	/* pages SGE */
-	page_no = 0;
-	page_bytes = xdr->page_len;
-	page_off = xdr->page_base;
-	while (page_bytes) {
-		vec->sge[sge_no].iov_base =
-			page_address(xdr->pages[page_no]) + page_off;
-		sge_bytes = min_t(u32, page_bytes, (PAGE_SIZE - page_off));
-		page_bytes -= sge_bytes;
-		vec->sge[sge_no].iov_len = sge_bytes;
-
-		sge_no++;
-		page_no++;
-		page_off = 0; /* reset for next time through loop */
-	}
-
-	/* Tail SGE */
-	if (xdr->tail[0].iov_len) {
-		unsigned char *base = xdr->tail[0].iov_base;
-		size_t len = xdr->tail[0].iov_len;
-		u32 xdr_pad = xdr_padsize(xdr->page_len);
-
-		if (write_chunk_present && xdr_pad) {
-			base += xdr_pad;
-			len -= xdr_pad;
-		}
-
-		if (len) {
-			vec->sge[sge_no].iov_base = base;
-			vec->sge[sge_no].iov_len = len;
-			sge_no++;
-		}
-	}
-
-	dprintk("svcrdma: %s: sge_no %d page_no %d "
-		"page_base %u page_len %u head_len %zu tail_len %zu\n",
-		__func__, sge_no, page_no, xdr->page_base, xdr->page_len,
-		xdr->head[0].iov_len, xdr->tail[0].iov_len);
-
-	vec->count = sge_no;
-	return 0;
-}
-
 /* Parse the RPC Call's transport header.
  */
 static void svc_rdma_get_write_arrays(__be32 *rdma_argp,
diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c
index 8ce68bf..9176a35 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
@@ -271,85 +271,6 @@ static void svc_rdma_destroy_ctxts(struct svcxprt_rdma *xprt)
 	}
 }
 
-static struct svc_rdma_req_map *alloc_req_map(gfp_t flags)
-{
-	struct svc_rdma_req_map *map;
-
-	map = kmalloc(sizeof(*map), flags);
-	if (map)
-		INIT_LIST_HEAD(&map->free);
-	return map;
-}
-
-static bool svc_rdma_prealloc_maps(struct svcxprt_rdma *xprt)
-{
-	unsigned int i;
-
-	/* One for each receive buffer on this connection. */
-	i = xprt->sc_max_requests;
-
-	while (i--) {
-		struct svc_rdma_req_map *map;
-
-		map = alloc_req_map(GFP_KERNEL);
-		if (!map) {
-			dprintk("svcrdma: No memory for request map\n");
-			return false;
-		}
-		list_add(&map->free, &xprt->sc_maps);
-	}
-	return true;
-}
-
-struct svc_rdma_req_map *svc_rdma_get_req_map(struct svcxprt_rdma *xprt)
-{
-	struct svc_rdma_req_map *map = NULL;
-
-	spin_lock(&xprt->sc_map_lock);
-	if (list_empty(&xprt->sc_maps))
-		goto out_empty;
-
-	map = list_first_entry(&xprt->sc_maps,
-			       struct svc_rdma_req_map, free);
-	list_del_init(&map->free);
-	spin_unlock(&xprt->sc_map_lock);
-
-out:
-	map->count = 0;
-	return map;
-
-out_empty:
-	spin_unlock(&xprt->sc_map_lock);
-
-	/* Pre-allocation amount was incorrect */
-	map = alloc_req_map(GFP_NOIO);
-	if (map)
-		goto out;
-
-	WARN_ONCE(1, "svcrdma: empty request map list?\n");
-	return NULL;
-}
-
-void svc_rdma_put_req_map(struct svcxprt_rdma *xprt,
-			  struct svc_rdma_req_map *map)
-{
-	spin_lock(&xprt->sc_map_lock);
-	list_add(&map->free, &xprt->sc_maps);
-	spin_unlock(&xprt->sc_map_lock);
-}
-
-static void svc_rdma_destroy_maps(struct svcxprt_rdma *xprt)
-{
-	while (!list_empty(&xprt->sc_maps)) {
-		struct svc_rdma_req_map *map;
-
-		map = list_first_entry(&xprt->sc_maps,
-				       struct svc_rdma_req_map, free);
-		list_del(&map->free);
-		kfree(map);
-	}
-}
-
 /* QP event handler */
 static void qp_event_handler(struct ib_event *event, void *context)
 {
@@ -543,7 +464,6 @@ static struct svcxprt_rdma *rdma_create_xprt(struct svc_serv *serv,
 	INIT_LIST_HEAD(&cma_xprt->sc_frmr_q);
 	INIT_LIST_HEAD(&cma_xprt->sc_ctxts);
 	INIT_LIST_HEAD(&cma_xprt->sc_rw_ctxts);
-	INIT_LIST_HEAD(&cma_xprt->sc_maps);
 	init_waitqueue_head(&cma_xprt->sc_send_wait);
 
 	spin_lock_init(&cma_xprt->sc_lock);
@@ -551,7 +471,6 @@ static struct svcxprt_rdma *rdma_create_xprt(struct svc_serv *serv,
 	spin_lock_init(&cma_xprt->sc_frmr_q_lock);
 	spin_lock_init(&cma_xprt->sc_ctxt_lock);
 	spin_lock_init(&cma_xprt->sc_rw_ctxt_lock);
-	spin_lock_init(&cma_xprt->sc_map_lock);
 
 	/*
 	 * Note that this implies that the underlying transport support
@@ -1003,8 +922,6 @@ static struct svc_xprt *svc_rdma_accept(struct svc_xprt *xprt)
 
 	if (!svc_rdma_prealloc_ctxts(newxprt))
 		goto errout;
-	if (!svc_rdma_prealloc_maps(newxprt))
-		goto errout;
 
 	/*
 	 * Limit ORD based on client limit, local device limit, and
@@ -1236,7 +1153,6 @@ static void __svc_rdma_free(struct work_struct *work)
 	rdma_dealloc_frmr_q(rdma);
 	svc_rdma_destroy_rw_ctxts(rdma);
 	svc_rdma_destroy_ctxts(rdma);
-	svc_rdma_destroy_maps(rdma);
 
 	/* Destroy the QP if present (not a listener) */
 	if (rdma->sc_qp && !IS_ERR(rdma->sc_qp))


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v1 13/14] svcrdma: Clean out old XDR encoders
  2017-03-16 15:52 ` Chuck Lever
@ 2017-03-16 15:54     ` Chuck Lever
  -1 siblings, 0 replies; 70+ messages in thread
From: Chuck Lever @ 2017-03-16 15:54 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA, linux-nfs-u79uwXL29TY76Z2rM5mHXA

Clean up: These have been replaced and are no longer used.

Signed-off-by: Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
---
 include/linux/sunrpc/svc_rdma.h        |    5 ----
 net/sunrpc/xprtrdma/svc_rdma_marshal.c |   39 --------------------------------
 2 files changed, 44 deletions(-)

diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
index 7b1f886..f88f6b0 100644
--- a/include/linux/sunrpc/svc_rdma.h
+++ b/include/linux/sunrpc/svc_rdma.h
@@ -187,11 +187,6 @@ extern int svc_rdma_handle_bc_reply(struct rpc_xprt *xprt,
 
 /* svc_rdma_marshal.c */
 extern int svc_rdma_xdr_decode_req(struct xdr_buf *);
-extern void svc_rdma_old_encode_write_list(struct rpcrdma_msg *rmsgp,
-					   int chunks);
-extern void svc_rdma_xdr_encode_reply_array(struct rpcrdma_write_array *, int);
-extern void svc_rdma_xdr_encode_array_chunk(struct rpcrdma_write_array *, int,
-					    __be32, __be64, u32);
 extern void svc_rdma_xdr_encode_write_list(__be32 *rdma_resp, __be32 *wr_ch,
 					   unsigned int consumed);
 extern void svc_rdma_xdr_encode_reply_chunk(__be32 *rdma_resp, __be32 *rp_ch,
diff --git a/net/sunrpc/xprtrdma/svc_rdma_marshal.c b/net/sunrpc/xprtrdma/svc_rdma_marshal.c
index 24a8151..658337e 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_marshal.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_marshal.c
@@ -198,45 +198,6 @@ unsigned int svc_rdma_xdr_get_reply_hdr_len(__be32 *rdma_resp)
 	return (unsigned long)p - (unsigned long)rdma_resp;
 }
 
-void svc_rdma_old_encode_write_list(struct rpcrdma_msg *rmsgp, int chunks)
-{
-	struct rpcrdma_write_array *ary;
-
-	/* no read-list */
-	rmsgp->rm_body.rm_chunks[0] = xdr_zero;
-
-	/* write-array discrim */
-	ary = (struct rpcrdma_write_array *)
-		&rmsgp->rm_body.rm_chunks[1];
-	ary->wc_discrim = xdr_one;
-	ary->wc_nchunks = cpu_to_be32(chunks);
-
-	/* write-list terminator */
-	ary->wc_array[chunks].wc_target.rs_handle = xdr_zero;
-
-	/* reply-array discriminator */
-	ary->wc_array[chunks].wc_target.rs_length = xdr_zero;
-}
-
-void svc_rdma_xdr_encode_reply_array(struct rpcrdma_write_array *ary,
-				 int chunks)
-{
-	ary->wc_discrim = xdr_one;
-	ary->wc_nchunks = cpu_to_be32(chunks);
-}
-
-void svc_rdma_xdr_encode_array_chunk(struct rpcrdma_write_array *ary,
-				     int chunk_no,
-				     __be32 rs_handle,
-				     __be64 rs_offset,
-				     u32 write_len)
-{
-	struct rpcrdma_segment *seg = &ary->wc_array[chunk_no].wc_target;
-	seg->rs_handle = rs_handle;
-	seg->rs_offset = rs_offset;
-	seg->rs_length = cpu_to_be32(write_len);
-}
-
 /* One Write chunk is copied from Call transport header to Reply
  * transport header. Each segment's length field is updated to
  * reflect number of bytes consumed in the segment.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v1 13/14] svcrdma: Clean out old XDR encoders
@ 2017-03-16 15:54     ` Chuck Lever
  0 siblings, 0 replies; 70+ messages in thread
From: Chuck Lever @ 2017-03-16 15:54 UTC (permalink / raw)
  To: linux-rdma, linux-nfs

Clean up: These have been replaced and are no longer used.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 include/linux/sunrpc/svc_rdma.h        |    5 ----
 net/sunrpc/xprtrdma/svc_rdma_marshal.c |   39 --------------------------------
 2 files changed, 44 deletions(-)

diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
index 7b1f886..f88f6b0 100644
--- a/include/linux/sunrpc/svc_rdma.h
+++ b/include/linux/sunrpc/svc_rdma.h
@@ -187,11 +187,6 @@ extern int svc_rdma_handle_bc_reply(struct rpc_xprt *xprt,
 
 /* svc_rdma_marshal.c */
 extern int svc_rdma_xdr_decode_req(struct xdr_buf *);
-extern void svc_rdma_old_encode_write_list(struct rpcrdma_msg *rmsgp,
-					   int chunks);
-extern void svc_rdma_xdr_encode_reply_array(struct rpcrdma_write_array *, int);
-extern void svc_rdma_xdr_encode_array_chunk(struct rpcrdma_write_array *, int,
-					    __be32, __be64, u32);
 extern void svc_rdma_xdr_encode_write_list(__be32 *rdma_resp, __be32 *wr_ch,
 					   unsigned int consumed);
 extern void svc_rdma_xdr_encode_reply_chunk(__be32 *rdma_resp, __be32 *rp_ch,
diff --git a/net/sunrpc/xprtrdma/svc_rdma_marshal.c b/net/sunrpc/xprtrdma/svc_rdma_marshal.c
index 24a8151..658337e 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_marshal.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_marshal.c
@@ -198,45 +198,6 @@ unsigned int svc_rdma_xdr_get_reply_hdr_len(__be32 *rdma_resp)
 	return (unsigned long)p - (unsigned long)rdma_resp;
 }
 
-void svc_rdma_old_encode_write_list(struct rpcrdma_msg *rmsgp, int chunks)
-{
-	struct rpcrdma_write_array *ary;
-
-	/* no read-list */
-	rmsgp->rm_body.rm_chunks[0] = xdr_zero;
-
-	/* write-array discrim */
-	ary = (struct rpcrdma_write_array *)
-		&rmsgp->rm_body.rm_chunks[1];
-	ary->wc_discrim = xdr_one;
-	ary->wc_nchunks = cpu_to_be32(chunks);
-
-	/* write-list terminator */
-	ary->wc_array[chunks].wc_target.rs_handle = xdr_zero;
-
-	/* reply-array discriminator */
-	ary->wc_array[chunks].wc_target.rs_length = xdr_zero;
-}
-
-void svc_rdma_xdr_encode_reply_array(struct rpcrdma_write_array *ary,
-				 int chunks)
-{
-	ary->wc_discrim = xdr_one;
-	ary->wc_nchunks = cpu_to_be32(chunks);
-}
-
-void svc_rdma_xdr_encode_array_chunk(struct rpcrdma_write_array *ary,
-				     int chunk_no,
-				     __be32 rs_handle,
-				     __be64 rs_offset,
-				     u32 write_len)
-{
-	struct rpcrdma_segment *seg = &ary->wc_array[chunk_no].wc_target;
-	seg->rs_handle = rs_handle;
-	seg->rs_offset = rs_offset;
-	seg->rs_length = cpu_to_be32(write_len);
-}
-
 /* One Write chunk is copied from Call transport header to Reply
  * transport header. Each segment's length field is updated to
  * reflect number of bytes consumed in the segment.


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v1 14/14] svcrdma: Clean up svc_rdma_post_recv() error handling
  2017-03-16 15:52 ` Chuck Lever
@ 2017-03-16 15:54     ` Chuck Lever
  -1 siblings, 0 replies; 70+ messages in thread
From: Chuck Lever @ 2017-03-16 15:54 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA, linux-nfs-u79uwXL29TY76Z2rM5mHXA

Distinguish and document failure modes.

Signed-off-by: Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
---
 net/sunrpc/xprtrdma/svc_rdma_transport.c |   66 +++++++++++++++++++++---------
 1 file changed, 47 insertions(+), 19 deletions(-)

diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c
index 9176a35..95af982 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
@@ -486,6 +486,17 @@ static struct svcxprt_rdma *rdma_create_xprt(struct svc_serv *serv,
 	return cma_xprt;
 }
 
+/**
+ * svc_rdma_post_recv - Post one Receive buffer
+ * @xprt: controlling transport
+ * @flags: memory allocation flags
+ *
+ * Returns:
+ *	%0 if Receive was posted successfully,
+ *	%-EINVAL if arguments are not correct,
+ *	%-ENOMEM if a resource shortage occurred,
+ *	%-EIO if DMA mapping failed.
+ */
 int svc_rdma_post_recv(struct svcxprt_rdma *xprt, gfp_t flags)
 {
 	struct ib_recv_wr recv_wr, *bad_recv_wr;
@@ -501,14 +512,14 @@ int svc_rdma_post_recv(struct svcxprt_rdma *xprt, gfp_t flags)
 	ctxt->direction = DMA_FROM_DEVICE;
 	ctxt->cqe.done = svc_rdma_wc_receive;
 	for (sge_no = 0; buflen < xprt->sc_max_req_size; sge_no++) {
-		if (sge_no >= xprt->sc_max_sge) {
-			pr_err("svcrdma: Too many sges (%d)\n", sge_no);
-			goto err_put_ctxt;
-		}
+		if (sge_no >= xprt->sc_max_sge)
+			goto err_sges;
+		ret = -ENOMEM;
 		page = alloc_page(flags);
 		if (!page)
 			goto err_put_ctxt;
 		ctxt->pages[sge_no] = page;
+		ret = -EIO;
 		pa = ib_dma_map_page(xprt->sc_cm_id->device,
 				     page, 0, PAGE_SIZE,
 				     DMA_FROM_DEVICE);
@@ -528,32 +539,49 @@ int svc_rdma_post_recv(struct svcxprt_rdma *xprt, gfp_t flags)
 
 	svc_xprt_get(&xprt->sc_xprt);
 	ret = ib_post_recv(xprt->sc_qp, &recv_wr, &bad_recv_wr);
-	if (ret) {
-		svc_rdma_unmap_dma(ctxt);
-		svc_rdma_put_context(ctxt, 1);
-		svc_xprt_put(&xprt->sc_xprt);
-	}
-	return ret;
+	if (ret)
+		goto err_post;
+	return 0;
+
+ err_sges:
+	ret = -EINVAL;
+	pr_err("svcrdma: Too many sges (%d)\n", sge_no);
 
  err_put_ctxt:
 	svc_rdma_unmap_dma(ctxt);
 	svc_rdma_put_context(ctxt, 1);
-	return -ENOMEM;
+	return ret;
+
+ err_post:
+	svc_rdma_unmap_dma(ctxt);
+	svc_rdma_put_context(ctxt, 1);
+	svc_xprt_put(&xprt->sc_xprt);
+	return ret;
 }
 
+/**
+ * svc_rdma_repost_recv - Post one Receive buffer, disconnect on failure
+ * @xprt: controlling transport
+ * @flags: memory allocation flags
+ *
+ * Returns:
+ *	%0 if Receive was posted successfully,
+ *	%-ENOTCONN if posting failed (connection is lost).
+ */
 int svc_rdma_repost_recv(struct svcxprt_rdma *xprt, gfp_t flags)
 {
 	int ret = 0;
 
 	ret = svc_rdma_post_recv(xprt, flags);
-	if (ret) {
-		pr_err("svcrdma: could not post a receive buffer, err=%d.\n",
-		       ret);
-		pr_err("svcrdma: closing transport %p.\n", xprt);
-		set_bit(XPT_CLOSE, &xprt->sc_xprt.xpt_flags);
-		ret = -ENOTCONN;
-	}
-	return ret;
+	if (ret)
+		goto err_disconnect;
+	return 0;
+
+ err_disconnect:
+	pr_err("svcrdma: could not post a receive buffer, err=%d.\n", ret);
+	pr_err("svcrdma: closing transport %p.\n", xprt);
+	set_bit(XPT_CLOSE, &xprt->sc_xprt.xpt_flags);
+	return -ENOTCONN;
 }
 
 static void

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v1 14/14] svcrdma: Clean up svc_rdma_post_recv() error handling
@ 2017-03-16 15:54     ` Chuck Lever
  0 siblings, 0 replies; 70+ messages in thread
From: Chuck Lever @ 2017-03-16 15:54 UTC (permalink / raw)
  To: linux-rdma, linux-nfs

Distinguish and document failure modes.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 net/sunrpc/xprtrdma/svc_rdma_transport.c |   66 +++++++++++++++++++++---------
 1 file changed, 47 insertions(+), 19 deletions(-)

diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c
index 9176a35..95af982 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
@@ -486,6 +486,17 @@ static struct svcxprt_rdma *rdma_create_xprt(struct svc_serv *serv,
 	return cma_xprt;
 }
 
+/**
+ * svc_rdma_post_recv - Post one Receive buffer
+ * @xprt: controlling transport
+ * @flags: memory allocation flags
+ *
+ * Returns:
+ *	%0 if Receive was posted successfully,
+ *	%-EINVAL if arguments are not correct,
+ *	%-ENOMEM if a resource shortage occurred,
+ *	%-EIO if DMA mapping failed.
+ */
 int svc_rdma_post_recv(struct svcxprt_rdma *xprt, gfp_t flags)
 {
 	struct ib_recv_wr recv_wr, *bad_recv_wr;
@@ -501,14 +512,14 @@ int svc_rdma_post_recv(struct svcxprt_rdma *xprt, gfp_t flags)
 	ctxt->direction = DMA_FROM_DEVICE;
 	ctxt->cqe.done = svc_rdma_wc_receive;
 	for (sge_no = 0; buflen < xprt->sc_max_req_size; sge_no++) {
-		if (sge_no >= xprt->sc_max_sge) {
-			pr_err("svcrdma: Too many sges (%d)\n", sge_no);
-			goto err_put_ctxt;
-		}
+		if (sge_no >= xprt->sc_max_sge)
+			goto err_sges;
+		ret = -ENOMEM;
 		page = alloc_page(flags);
 		if (!page)
 			goto err_put_ctxt;
 		ctxt->pages[sge_no] = page;
+		ret = -EIO;
 		pa = ib_dma_map_page(xprt->sc_cm_id->device,
 				     page, 0, PAGE_SIZE,
 				     DMA_FROM_DEVICE);
@@ -528,32 +539,49 @@ int svc_rdma_post_recv(struct svcxprt_rdma *xprt, gfp_t flags)
 
 	svc_xprt_get(&xprt->sc_xprt);
 	ret = ib_post_recv(xprt->sc_qp, &recv_wr, &bad_recv_wr);
-	if (ret) {
-		svc_rdma_unmap_dma(ctxt);
-		svc_rdma_put_context(ctxt, 1);
-		svc_xprt_put(&xprt->sc_xprt);
-	}
-	return ret;
+	if (ret)
+		goto err_post;
+	return 0;
+
+ err_sges:
+	ret = -EINVAL;
+	pr_err("svcrdma: Too many sges (%d)\n", sge_no);
 
  err_put_ctxt:
 	svc_rdma_unmap_dma(ctxt);
 	svc_rdma_put_context(ctxt, 1);
-	return -ENOMEM;
+	return ret;
+
+ err_post:
+	svc_rdma_unmap_dma(ctxt);
+	svc_rdma_put_context(ctxt, 1);
+	svc_xprt_put(&xprt->sc_xprt);
+	return ret;
 }
 
+/**
+ * svc_rdma_repost_recv - Post one Receive buffer, disconnect on failure
+ * @xprt: controlling transport
+ * @flags: memory allocation flags
+ *
+ * Returns:
+ *	%0 if Receive was posted successfully,
+ *	%-ENOTCONN if posting failed (connection is lost).
+ */
 int svc_rdma_repost_recv(struct svcxprt_rdma *xprt, gfp_t flags)
 {
 	int ret = 0;
 
 	ret = svc_rdma_post_recv(xprt, flags);
-	if (ret) {
-		pr_err("svcrdma: could not post a receive buffer, err=%d.\n",
-		       ret);
-		pr_err("svcrdma: closing transport %p.\n", xprt);
-		set_bit(XPT_CLOSE, &xprt->sc_xprt.xpt_flags);
-		ret = -ENOTCONN;
-	}
-	return ret;
+	if (ret)
+		goto err_disconnect;
+	return 0;
+
+ err_disconnect:
+	pr_err("svcrdma: could not post a receive buffer, err=%d.\n", ret);
+	pr_err("svcrdma: closing transport %p.\n", xprt);
+	set_bit(XPT_CLOSE, &xprt->sc_xprt.xpt_flags);
+	return -ENOTCONN;
 }
 
 static void


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* Re: [PATCH v1 01/14] svcrdma: Move send_wr to svc_rdma_op_ctxt
  2017-03-16 15:52     ` Chuck Lever
@ 2017-03-21 17:49         ` Sagi Grimberg
  -1 siblings, 0 replies; 70+ messages in thread
From: Sagi Grimberg @ 2017-03-21 17:49 UTC (permalink / raw)
  To: Chuck Lever, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA

Looks good,

Reviewed-by: Sagi Grimberg <sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v1 01/14] svcrdma: Move send_wr to svc_rdma_op_ctxt
@ 2017-03-21 17:49         ` Sagi Grimberg
  0 siblings, 0 replies; 70+ messages in thread
From: Sagi Grimberg @ 2017-03-21 17:49 UTC (permalink / raw)
  To: Chuck Lever, linux-rdma, linux-nfs

Looks good,

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v1 02/14] svcrdma: Add svc_rdma_map_reply_hdr()
  2017-03-16 15:52     ` Chuck Lever
@ 2017-03-21 17:54         ` Sagi Grimberg
  -1 siblings, 0 replies; 70+ messages in thread
From: Sagi Grimberg @ 2017-03-21 17:54 UTC (permalink / raw)
  To: Chuck Lever, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA


>  	svc_rdma_build_send_wr(ctxt, 1);
>  	ret = svc_rdma_send(rdma, &ctxt->send_wr);
>  	if (ret) {
> +		svc_rdma_unmap_dma(ctxt);
> +		svc_rdma_put_context(ctxt, 1);
>  		ret = -EIO;
> -		goto out_unmap;
>  	}

Any specific reason to not go with the goto scheme?
Can't this function grow more error paths in the future?

btw, I'm assuming svc_rdma_unmap_dma() is the opposite of
svc_rdma_map_reply_hdr() ?
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v1 02/14] svcrdma: Add svc_rdma_map_reply_hdr()
@ 2017-03-21 17:54         ` Sagi Grimberg
  0 siblings, 0 replies; 70+ messages in thread
From: Sagi Grimberg @ 2017-03-21 17:54 UTC (permalink / raw)
  To: Chuck Lever, linux-rdma, linux-nfs


>  	svc_rdma_build_send_wr(ctxt, 1);
>  	ret = svc_rdma_send(rdma, &ctxt->send_wr);
>  	if (ret) {
> +		svc_rdma_unmap_dma(ctxt);
> +		svc_rdma_put_context(ctxt, 1);
>  		ret = -EIO;
> -		goto out_unmap;
>  	}

Any specific reason to not go with the goto scheme?
Can't this function grow more error paths in the future?

btw, I'm assuming svc_rdma_unmap_dma() is the opposite of
svc_rdma_map_reply_hdr() ?

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v1 03/14] svcrdma: Eliminate RPCRDMA_SQ_DEPTH_MULT
  2017-03-16 15:52     ` Chuck Lever
@ 2017-03-21 17:58         ` Sagi Grimberg
  -1 siblings, 0 replies; 70+ messages in thread
From: Sagi Grimberg @ 2017-03-21 17:58 UTC (permalink / raw)
  To: Chuck Lever, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA


> The Send Queue depth is temporarily reduced to 1 SQE per credit. The
> new rdma_rw API does an internal computation, during QP creation, to
> increase the depth of the Send Queue to handle RDMA Read and Write
> operations.
>
> This change has to come before the NFSD code paths are updated to
> use the rdma_rw API. Without this patch, rdma_rw_init_qp() increases
> the size of the SQ too much, resulting in memory allocation failures
> during QP creation.

I agree this needs to happen, but turns out you don't have any
guarantees of the maximum size of the sq depending on your max_sge
parameter. I'd recommend having a fall-back shrinked size sq allocation
impllemented like srpt does.

We don't have it in nvmet-rdma nor iser, but its a good thing to have...
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v1 03/14] svcrdma: Eliminate RPCRDMA_SQ_DEPTH_MULT
@ 2017-03-21 17:58         ` Sagi Grimberg
  0 siblings, 0 replies; 70+ messages in thread
From: Sagi Grimberg @ 2017-03-21 17:58 UTC (permalink / raw)
  To: Chuck Lever, linux-rdma, linux-nfs


> The Send Queue depth is temporarily reduced to 1 SQE per credit. The
> new rdma_rw API does an internal computation, during QP creation, to
> increase the depth of the Send Queue to handle RDMA Read and Write
> operations.
>
> This change has to come before the NFSD code paths are updated to
> use the rdma_rw API. Without this patch, rdma_rw_init_qp() increases
> the size of the SQ too much, resulting in memory allocation failures
> during QP creation.

I agree this needs to happen, but turns out you don't have any
guarantees of the maximum size of the sq depending on your max_sge
parameter. I'd recommend having a fall-back shrinked size sq allocation
impllemented like srpt does.

We don't have it in nvmet-rdma nor iser, but its a good thing to have...

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v1 04/14] svcrdma: Add helper to save pages under I/O
  2017-03-16 15:52     ` Chuck Lever
@ 2017-03-21 18:01         ` Sagi Grimberg
  -1 siblings, 0 replies; 70+ messages in thread
From: Sagi Grimberg @ 2017-03-21 18:01 UTC (permalink / raw)
  To: Chuck Lever, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA

Looks fine,

Reviewed-by: Sagi Grimberg <sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v1 04/14] svcrdma: Add helper to save pages under I/O
@ 2017-03-21 18:01         ` Sagi Grimberg
  0 siblings, 0 replies; 70+ messages in thread
From: Sagi Grimberg @ 2017-03-21 18:01 UTC (permalink / raw)
  To: Chuck Lever, linux-rdma, linux-nfs

Looks fine,

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v1 02/14] svcrdma: Add svc_rdma_map_reply_hdr()
  2017-03-21 17:54         ` Sagi Grimberg
@ 2017-03-21 18:40             ` Chuck Lever
  -1 siblings, 0 replies; 70+ messages in thread
From: Chuck Lever @ 2017-03-21 18:40 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: List Linux RDMA Mailing, Linux NFS Mailing List


> On Mar 21, 2017, at 1:54 PM, Sagi Grimberg <sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org> wrote:
> 
> 
>> 	svc_rdma_build_send_wr(ctxt, 1);
>> 	ret = svc_rdma_send(rdma, &ctxt->send_wr);
>> 	if (ret) {
>> +		svc_rdma_unmap_dma(ctxt);
>> +		svc_rdma_put_context(ctxt, 1);
>> 		ret = -EIO;
>> -		goto out_unmap;
>> 	}
> 
> Any specific reason to not go with the goto scheme?

Only one "goto out_unmap" call site is left, and this
isn't a performance critical path.


> Can't this function grow more error paths in the future?

I can't think of any. The only time this code changes is
when overhauls like this happens.

I don't mind putting the out_unmap label back.


> btw, I'm assuming svc_rdma_unmap_dma() is the opposite of
> svc_rdma_map_reply_hdr() ?

svc_rdma_map_reply_hdr() DMA-maps the transport header buffer.

svc_rdma_unmap_dma() DMA-unmaps everything associated with
the ctxt.


--
Chuck Lever



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v1 02/14] svcrdma: Add svc_rdma_map_reply_hdr()
@ 2017-03-21 18:40             ` Chuck Lever
  0 siblings, 0 replies; 70+ messages in thread
From: Chuck Lever @ 2017-03-21 18:40 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: List Linux RDMA Mailing, Linux NFS Mailing List


> On Mar 21, 2017, at 1:54 PM, Sagi Grimberg <sagi@grimberg.me> wrote:
> 
> 
>> 	svc_rdma_build_send_wr(ctxt, 1);
>> 	ret = svc_rdma_send(rdma, &ctxt->send_wr);
>> 	if (ret) {
>> +		svc_rdma_unmap_dma(ctxt);
>> +		svc_rdma_put_context(ctxt, 1);
>> 		ret = -EIO;
>> -		goto out_unmap;
>> 	}
> 
> Any specific reason to not go with the goto scheme?

Only one "goto out_unmap" call site is left, and this
isn't a performance critical path.


> Can't this function grow more error paths in the future?

I can't think of any. The only time this code changes is
when overhauls like this happens.

I don't mind putting the out_unmap label back.


> btw, I'm assuming svc_rdma_unmap_dma() is the opposite of
> svc_rdma_map_reply_hdr() ?

svc_rdma_map_reply_hdr() DMA-maps the transport header buffer.

svc_rdma_unmap_dma() DMA-unmaps everything associated with
the ctxt.


--
Chuck Lever




^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v1 03/14] svcrdma: Eliminate RPCRDMA_SQ_DEPTH_MULT
  2017-03-21 17:58         ` Sagi Grimberg
@ 2017-03-21 18:44             ` Chuck Lever
  -1 siblings, 0 replies; 70+ messages in thread
From: Chuck Lever @ 2017-03-21 18:44 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Linux NFS Mailing List


> On Mar 21, 2017, at 1:58 PM, Sagi Grimberg <sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org> wrote:
> 
> 
>> The Send Queue depth is temporarily reduced to 1 SQE per credit. The
>> new rdma_rw API does an internal computation, during QP creation, to
>> increase the depth of the Send Queue to handle RDMA Read and Write
>> operations.
>> 
>> This change has to come before the NFSD code paths are updated to
>> use the rdma_rw API. Without this patch, rdma_rw_init_qp() increases
>> the size of the SQ too much, resulting in memory allocation failures
>> during QP creation.
> 
> I agree this needs to happen, but turns out you don't have any
> guarantees of the maximum size of the sq depending on your max_sge
> parameter.

That's true. However, this is meant to be temporary while I'm
working out details of the rdma_rw API conversion. More work
in this area comes in the next series:

http://git.linux-nfs.org/?p=cel/cel-2.6.git;a=log;h=refs/heads/nfsd-rdma-rw-api


> I'd recommend having a fall-back shrinked size sq allocation
> impllemented like srpt does.

Agree it should be done. Would it be OK to wait until the dust
settles here, or do you think it's a hard requirement for
accepting this series?


> We don't have it in nvmet-rdma nor iser, but its a good thing to have...


--
Chuck Lever



--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v1 03/14] svcrdma: Eliminate RPCRDMA_SQ_DEPTH_MULT
@ 2017-03-21 18:44             ` Chuck Lever
  0 siblings, 0 replies; 70+ messages in thread
From: Chuck Lever @ 2017-03-21 18:44 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: linux-rdma, Linux NFS Mailing List


> On Mar 21, 2017, at 1:58 PM, Sagi Grimberg <sagi@grimberg.me> wrote:
> 
> 
>> The Send Queue depth is temporarily reduced to 1 SQE per credit. The
>> new rdma_rw API does an internal computation, during QP creation, to
>> increase the depth of the Send Queue to handle RDMA Read and Write
>> operations.
>> 
>> This change has to come before the NFSD code paths are updated to
>> use the rdma_rw API. Without this patch, rdma_rw_init_qp() increases
>> the size of the SQ too much, resulting in memory allocation failures
>> during QP creation.
> 
> I agree this needs to happen, but turns out you don't have any
> guarantees of the maximum size of the sq depending on your max_sge
> parameter.

That's true. However, this is meant to be temporary while I'm
working out details of the rdma_rw API conversion. More work
in this area comes in the next series:

http://git.linux-nfs.org/?p=cel/cel-2.6.git;a=log;h=refs/heads/nfsd-rdma-rw-api


> I'd recommend having a fall-back shrinked size sq allocation
> impllemented like srpt does.

Agree it should be done. Would it be OK to wait until the dust
settles here, or do you think it's a hard requirement for
accepting this series?


> We don't have it in nvmet-rdma nor iser, but its a good thing to have...


--
Chuck Lever




^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v1 02/14] svcrdma: Add svc_rdma_map_reply_hdr()
  2017-03-21 18:40             ` Chuck Lever
@ 2017-03-22 13:07                 ` Sagi Grimberg
  -1 siblings, 0 replies; 70+ messages in thread
From: Sagi Grimberg @ 2017-03-22 13:07 UTC (permalink / raw)
  To: Chuck Lever; +Cc: List Linux RDMA Mailing, Linux NFS Mailing List


> I can't think of any. The only time this code changes is
> when overhauls like this happens.
>
> I don't mind putting the out_unmap label back.

Your call...

>> btw, I'm assuming svc_rdma_unmap_dma() is the opposite of
>> svc_rdma_map_reply_hdr() ?
>
> svc_rdma_map_reply_hdr() DMA-maps the transport header buffer.
>
> svc_rdma_unmap_dma() DMA-unmaps everything associated with
> the ctxt.

a bit non trivial, but ok I guess...
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v1 02/14] svcrdma: Add svc_rdma_map_reply_hdr()
@ 2017-03-22 13:07                 ` Sagi Grimberg
  0 siblings, 0 replies; 70+ messages in thread
From: Sagi Grimberg @ 2017-03-22 13:07 UTC (permalink / raw)
  To: Chuck Lever; +Cc: List Linux RDMA Mailing, Linux NFS Mailing List


> I can't think of any. The only time this code changes is
> when overhauls like this happens.
>
> I don't mind putting the out_unmap label back.

Your call...

>> btw, I'm assuming svc_rdma_unmap_dma() is the opposite of
>> svc_rdma_map_reply_hdr() ?
>
> svc_rdma_map_reply_hdr() DMA-maps the transport header buffer.
>
> svc_rdma_unmap_dma() DMA-unmaps everything associated with
> the ctxt.

a bit non trivial, but ok I guess...

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v1 03/14] svcrdma: Eliminate RPCRDMA_SQ_DEPTH_MULT
  2017-03-21 18:44             ` Chuck Lever
@ 2017-03-22 13:09                 ` Sagi Grimberg
  -1 siblings, 0 replies; 70+ messages in thread
From: Sagi Grimberg @ 2017-03-22 13:09 UTC (permalink / raw)
  To: Chuck Lever; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Linux NFS Mailing List


>> I agree this needs to happen, but turns out you don't have any
>> guarantees of the maximum size of the sq depending on your max_sge
>> parameter.
>
> That's true. However, this is meant to be temporary while I'm
> working out details of the rdma_rw API conversion. More work
> in this area comes in the next series:
>
> http://git.linux-nfs.org/?p=cel/cel-2.6.git;a=log;h=refs/heads/nfsd-rdma-rw-api

Thanks for the pointer...

>> I'd recommend having a fall-back shrinked size sq allocation
>> impllemented like srpt does.
>
> Agree it should be done. Would it be OK to wait until the dust
> settles here, or do you think it's a hard requirement for
> accepting this series?

It isn't and can definitely be added incrementally...
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v1 03/14] svcrdma: Eliminate RPCRDMA_SQ_DEPTH_MULT
@ 2017-03-22 13:09                 ` Sagi Grimberg
  0 siblings, 0 replies; 70+ messages in thread
From: Sagi Grimberg @ 2017-03-22 13:09 UTC (permalink / raw)
  To: Chuck Lever; +Cc: linux-rdma, Linux NFS Mailing List


>> I agree this needs to happen, but turns out you don't have any
>> guarantees of the maximum size of the sq depending on your max_sge
>> parameter.
>
> That's true. However, this is meant to be temporary while I'm
> working out details of the rdma_rw API conversion. More work
> in this area comes in the next series:
>
> http://git.linux-nfs.org/?p=cel/cel-2.6.git;a=log;h=refs/heads/nfsd-rdma-rw-api

Thanks for the pointer...

>> I'd recommend having a fall-back shrinked size sq allocation
>> impllemented like srpt does.
>
> Agree it should be done. Would it be OK to wait until the dust
> settles here, or do you think it's a hard requirement for
> accepting this series?

It isn't and can definitely be added incrementally...

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v1 03/14] svcrdma: Eliminate RPCRDMA_SQ_DEPTH_MULT
  2017-03-22 13:09                 ` Sagi Grimberg
@ 2017-03-22 13:36                     ` Chuck Lever
  -1 siblings, 0 replies; 70+ messages in thread
From: Chuck Lever @ 2017-03-22 13:36 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: List Linux RDMA Mailing, Linux NFS Mailing List


> On Mar 22, 2017, at 9:09 AM, Sagi Grimberg <sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org> wrote:
> 
> 
>>> I agree this needs to happen, but turns out you don't have any
>>> guarantees of the maximum size of the sq depending on your max_sge
>>> parameter.
>> 
>> That's true. However, this is meant to be temporary while I'm
>> working out details of the rdma_rw API conversion. More work
>> in this area comes in the next series:
>> 
>> http://git.linux-nfs.org/?p=cel/cel-2.6.git;a=log;h=refs/heads/nfsd-rdma-rw-api
> 
> Thanks for the pointer...
> 
>>> I'd recommend having a fall-back shrinked size sq allocation
>>> impllemented like srpt does.
>> 
>> Agree it should be done. Would it be OK to wait until the dust
>> settles here, or do you think it's a hard requirement for
>> accepting this series?
> 
> It isn't and can definitely be added incrementally...

Roughly speaking, I think there needs to be an rdma_rw API that
assists the ULP with setting its CQ and SQ sizes, since rdma_rw
hides the registration mode (one of which, at least, consumes
more SQEs than the other).

I'd like to introduce one new function call that surfaces the
factor used to compute how many additional SQEs that rdma_rw will
need. The ULP will invoke it before allocating new Send CQs.

I'll try to provide an RFC in the nfsd-rdma-rw-api topic branch.


--
Chuck Lever



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v1 03/14] svcrdma: Eliminate RPCRDMA_SQ_DEPTH_MULT
@ 2017-03-22 13:36                     ` Chuck Lever
  0 siblings, 0 replies; 70+ messages in thread
From: Chuck Lever @ 2017-03-22 13:36 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: List Linux RDMA Mailing, Linux NFS Mailing List


> On Mar 22, 2017, at 9:09 AM, Sagi Grimberg <sagi@grimberg.me> wrote:
> 
> 
>>> I agree this needs to happen, but turns out you don't have any
>>> guarantees of the maximum size of the sq depending on your max_sge
>>> parameter.
>> 
>> That's true. However, this is meant to be temporary while I'm
>> working out details of the rdma_rw API conversion. More work
>> in this area comes in the next series:
>> 
>> http://git.linux-nfs.org/?p=cel/cel-2.6.git;a=log;h=refs/heads/nfsd-rdma-rw-api
> 
> Thanks for the pointer...
> 
>>> I'd recommend having a fall-back shrinked size sq allocation
>>> impllemented like srpt does.
>> 
>> Agree it should be done. Would it be OK to wait until the dust
>> settles here, or do you think it's a hard requirement for
>> accepting this series?
> 
> It isn't and can definitely be added incrementally...

Roughly speaking, I think there needs to be an rdma_rw API that
assists the ULP with setting its CQ and SQ sizes, since rdma_rw
hides the registration mode (one of which, at least, consumes
more SQEs than the other).

I'd like to introduce one new function call that surfaces the
factor used to compute how many additional SQEs that rdma_rw will
need. The ULP will invoke it before allocating new Send CQs.

I'll try to provide an RFC in the nfsd-rdma-rw-api topic branch.


--
Chuck Lever




^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v1 05/14] svcrdma: Introduce local rdma_rw API helpers
  2017-03-16 15:53     ` Chuck Lever
@ 2017-03-22 14:17         ` Sagi Grimberg
  -1 siblings, 0 replies; 70+ messages in thread
From: Sagi Grimberg @ 2017-03-22 14:17 UTC (permalink / raw)
  To: Chuck Lever, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA


> The plan is to replace the local bespoke code that constructs and
> posts RDMA Read and Write Work Requests with calls to the rdma_rw
> API. This shares code with other RDMA-enabled ULPs that manages the
> gory details of buffer registration and posting Work Requests.
>
> Some design notes:
>
>  o svc_xprt reference counting is modified, since one rdma_rw_ctx
>    generates one completion, no matter how many Write WRs are
>    posted. To accommodate the new reference counting scheme, a new
>    version of svc_rdma_send() is introduced.
>
>  o The structure of RPC-over-RDMA transport headers is flexible,
>    allowing multiple segments per Reply with arbitrary alignment.
>    Thus I did not take the further step of chaining Write WRs with
>    the Send WR containing the RPC Reply message. The Write and Send
>    WRs continue to be built by separate pieces of code.
>
>  o The current code builds the transport header as it is construct-
>    ing Write WRs. I've replaced that with marshaling of transport
>    header data items in a separate step. This is because the exact
>    structure of client-provided segments may not align with the
>    components of the server's reply xdr_buf, or the pages in the
>    page list. Thus parts of each client-provided segment may be
>    written at different points in the send path.
>
>  o Since the Write list and Reply chunk marshaling code is being
>    replaced, I took the opportunity to replace some of the C
>    structure-based XDR encoding code with more portable code that
>    instead uses pointer arithmetic.
>
> Signed-off-by: Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>

To be honest its difficult to review this patch, but its probably
difficult to split it too...

> +
> +/* One Write chunk is copied from Call transport header to Reply
> + * transport header. Each segment's length field is updated to
> + * reflect number of bytes consumed in the segment.
> + *
> + * Returns number of segments in this chunk.
> + */
> +static unsigned int xdr_encode_write_chunk(__be32 *dst, __be32 *src,
> +					   unsigned int remaining)

Is this only for data-in-reply (send operation)? I don't see why you
would need to modify that for RDMA operations.

Perhaps I'd try to split the data-in-reply code from the actual rdma
conversion. It might be helpful to comprehend.

> +{
> +	unsigned int i, nsegs;
> +	u32 seg_len;
> +
> +	/* Write list discriminator */
> +	*dst++ = *src++;

I had to actually run a test program to understand the precedence
here so parenthesis would've helped :)

> diff --git a/net/sunrpc/xprtrdma/svc_rdma_rw.c b/net/sunrpc/xprtrdma/svc_rdma_rw.c
> new file mode 100644
> index 0000000..1e76227
> --- /dev/null
> +++ b/net/sunrpc/xprtrdma/svc_rdma_rw.c
> @@ -0,0 +1,785 @@
> +/*
> + * Copyright (c) 2016 Oracle.  All rights reserved.
> + *
> + * Use the core R/W API to move RPC-over-RDMA Read and Write chunks.
> + */
> +
> +#include <linux/sunrpc/rpc_rdma.h>
> +#include <linux/sunrpc/svc_rdma.h>
> +#include <linux/sunrpc/debug.h>
> +
> +#include <rdma/rw.h>
> +
> +#define RPCDBG_FACILITY	RPCDBG_SVCXPRT
> +
> +/* Each R/W context contains state for one chain of RDMA Read or
> + * Write Work Requests (one RDMA segment to be read from or written
> + * back to the client).
> + *
> + * Each WR chain handles a single contiguous server-side buffer,
> + * because some registration modes (eg. FRWR) do not support a
> + * discontiguous scatterlist.
> + *
> + * Each WR chain handles only one R_key. Each RPC-over-RDMA segment
> + * from a client may contain a unique R_key, so each WR chain moves
> + * one segment (or less) at a time.
> + *
> + * The scatterlist makes this data structure just over 8KB in size
> + * on 4KB-page platforms. As the size of this structure increases
> + * past one page, it becomes more likely that allocating one of these
> + * will fail. Therefore, these contexts are created on demand, but
> + * cached and reused until the controlling svcxprt_rdma is destroyed.
> + */
> +struct svc_rdma_rw_ctxt {
> +	struct list_head	rw_list;
> +	struct ib_cqe		rw_cqe;
> +	struct svcxprt_rdma	*rw_rdma;
> +	int			rw_nents;
> +	int			rw_wrcount;
> +	enum dma_data_direction	rw_dir;
> +	struct svc_rdma_op_ctxt	*rw_readctxt;
> +	struct rdma_rw_ctx	rw_ctx;
> +	struct scatterlist	rw_sg[RPCSVC_MAXPAGES];

Have you considered using sg_table with sg_alloc_table_chained?

See lib/sg_pool.c and nvme-rdma as a consumer.

> +};
> +
> +static struct svc_rdma_rw_ctxt *
> +svc_rdma_get_rw_ctxt(struct svcxprt_rdma *rdma)
> +{
> +	struct svc_rdma_rw_ctxt *ctxt;
> +
> +	svc_xprt_get(&rdma->sc_xprt);
> +
> +	spin_lock(&rdma->sc_rw_ctxt_lock);
> +	if (list_empty(&rdma->sc_rw_ctxts))
> +		goto out_empty;
> +
> +	ctxt = list_first_entry(&rdma->sc_rw_ctxts,
> +				struct svc_rdma_rw_ctxt, rw_list);
> +	list_del_init(&ctxt->rw_list);
> +	spin_unlock(&rdma->sc_rw_ctxt_lock);
> +
> +out:
> +	ctxt->rw_dir = DMA_NONE;
> +	return ctxt;
> +
> +out_empty:
> +	spin_unlock(&rdma->sc_rw_ctxt_lock);
> +
> +	ctxt = kmalloc(sizeof(*ctxt), GFP_KERNEL);
> +	if (!ctxt) {
> +		svc_xprt_put(&rdma->sc_xprt);
> +		return NULL;
> +	}
> +
> +	ctxt->rw_rdma = rdma;
> +	INIT_LIST_HEAD(&ctxt->rw_list);
> +	sg_init_table(ctxt->rw_sg, ARRAY_SIZE(ctxt->rw_sg));
> +	goto out;
> +}
> +
> +static void svc_rdma_put_rw_ctxt(struct svc_rdma_rw_ctxt *ctxt)
> +{
> +	struct svcxprt_rdma *rdma = ctxt->rw_rdma;
> +
> +	if (ctxt->rw_dir != DMA_NONE)
> +		rdma_rw_ctx_destroy(&ctxt->rw_ctx, rdma->sc_qp,
> +				    rdma->sc_port_num,
> +				    ctxt->rw_sg, ctxt->rw_nents,
> +				    ctxt->rw_dir);
> +

its a bit odd to see put_rw_ctxt that also destroys the context
which isn't exactly pairs with get_rw_ctxt.

Maybe it'd be useful to explicitly do that outside the put.

> +/**
> + * svc_rdma_destroy_rw_ctxts - Free write contexts
> + * @rdma: xprt about to be destroyed
> + *
> + */
> +void svc_rdma_destroy_rw_ctxts(struct svcxprt_rdma *rdma)
> +{
> +	struct svc_rdma_rw_ctxt *ctxt;
> +
> +	while (!list_empty(&rdma->sc_rw_ctxts)) {
> +		ctxt = list_first_entry(&rdma->sc_rw_ctxts,
> +					struct svc_rdma_rw_ctxt, rw_list);
> +		list_del(&ctxt->rw_list);
> +		kfree(ctxt);
> +	}
> +}
> +
> +/**
> + * svc_rdma_wc_write_ctx - Handle completion of an RDMA Write ctx
> + * @cq: controlling Completion Queue
> + * @wc: Work Completion
> + *
> + * Invoked in soft IRQ context.
> + *
> + * Assumptions:
> + * - Write completion is not responsible for freeing pages under
> + *   I/O.
> + */
> +static void svc_rdma_wc_write_ctx(struct ib_cq *cq, struct ib_wc *wc)
> +{
> +	struct ib_cqe *cqe = wc->wr_cqe;
> +	struct svc_rdma_rw_ctxt *ctxt =
> +			container_of(cqe, struct svc_rdma_rw_ctxt, rw_cqe);
> +	struct svcxprt_rdma *rdma = ctxt->rw_rdma;
> +
> +	atomic_add(ctxt->rw_wrcount, &rdma->sc_sq_avail);
> +	wake_up(&rdma->sc_send_wait);
> +
> +	if (wc->status != IB_WC_SUCCESS)
> +		goto flush;
> +
> +out:
> +	svc_rdma_put_rw_ctxt(ctxt);
> +	return;
> +
> +flush:
> +	set_bit(XPT_CLOSE, &rdma->sc_xprt.xpt_flags);
> +	if (wc->status != IB_WC_WR_FLUSH_ERR)
> +		pr_err("svcrdma: write ctx: %s (%u/0x%x)\n",
> +		       ib_wc_status_msg(wc->status),
> +		       wc->status, wc->vendor_err);
> +	goto out;
> +}
> +
> +/**
> + * svc_rdma_wc_read_ctx - Handle completion of an RDMA Read ctx
> + * @cq: controlling Completion Queue
> + * @wc: Work Completion
> + *
> + * Invoked in soft IRQ context.

? in soft IRQ?

> + */
> +static void svc_rdma_wc_read_ctx(struct ib_cq *cq, struct ib_wc *wc)
> +{
> +	struct ib_cqe *cqe = wc->wr_cqe;
> +	struct svc_rdma_rw_ctxt *ctxt =
> +			container_of(cqe, struct svc_rdma_rw_ctxt, rw_cqe);
> +	struct svcxprt_rdma *rdma = ctxt->rw_rdma;
> +	struct svc_rdma_op_ctxt *head;
> +
> +	atomic_add(ctxt->rw_wrcount, &rdma->sc_sq_avail);
> +	wake_up(&rdma->sc_send_wait);
> +
> +	if (wc->status != IB_WC_SUCCESS)
> +		goto flush;
> +
> +	head = ctxt->rw_readctxt;
> +	if (!head)
> +		goto out;
> +
> +	spin_lock(&rdma->sc_rq_dto_lock);
> +	list_add_tail(&head->list, &rdma->sc_read_complete_q);
> +	spin_unlock(&rdma->sc_rq_dto_lock);

Not sure what sc_read_complete_q does... what post processing is
needed for completed reads?

> +/* This function sleeps when the transport's Send Queue is congested.

Is this easy to trigger?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v1 05/14] svcrdma: Introduce local rdma_rw API helpers
@ 2017-03-22 14:17         ` Sagi Grimberg
  0 siblings, 0 replies; 70+ messages in thread
From: Sagi Grimberg @ 2017-03-22 14:17 UTC (permalink / raw)
  To: Chuck Lever, linux-rdma, linux-nfs


> The plan is to replace the local bespoke code that constructs and
> posts RDMA Read and Write Work Requests with calls to the rdma_rw
> API. This shares code with other RDMA-enabled ULPs that manages the
> gory details of buffer registration and posting Work Requests.
>
> Some design notes:
>
>  o svc_xprt reference counting is modified, since one rdma_rw_ctx
>    generates one completion, no matter how many Write WRs are
>    posted. To accommodate the new reference counting scheme, a new
>    version of svc_rdma_send() is introduced.
>
>  o The structure of RPC-over-RDMA transport headers is flexible,
>    allowing multiple segments per Reply with arbitrary alignment.
>    Thus I did not take the further step of chaining Write WRs with
>    the Send WR containing the RPC Reply message. The Write and Send
>    WRs continue to be built by separate pieces of code.
>
>  o The current code builds the transport header as it is construct-
>    ing Write WRs. I've replaced that with marshaling of transport
>    header data items in a separate step. This is because the exact
>    structure of client-provided segments may not align with the
>    components of the server's reply xdr_buf, or the pages in the
>    page list. Thus parts of each client-provided segment may be
>    written at different points in the send path.
>
>  o Since the Write list and Reply chunk marshaling code is being
>    replaced, I took the opportunity to replace some of the C
>    structure-based XDR encoding code with more portable code that
>    instead uses pointer arithmetic.
>
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

To be honest its difficult to review this patch, but its probably
difficult to split it too...

> +
> +/* One Write chunk is copied from Call transport header to Reply
> + * transport header. Each segment's length field is updated to
> + * reflect number of bytes consumed in the segment.
> + *
> + * Returns number of segments in this chunk.
> + */
> +static unsigned int xdr_encode_write_chunk(__be32 *dst, __be32 *src,
> +					   unsigned int remaining)

Is this only for data-in-reply (send operation)? I don't see why you
would need to modify that for RDMA operations.

Perhaps I'd try to split the data-in-reply code from the actual rdma
conversion. It might be helpful to comprehend.

> +{
> +	unsigned int i, nsegs;
> +	u32 seg_len;
> +
> +	/* Write list discriminator */
> +	*dst++ = *src++;

I had to actually run a test program to understand the precedence
here so parenthesis would've helped :)

> diff --git a/net/sunrpc/xprtrdma/svc_rdma_rw.c b/net/sunrpc/xprtrdma/svc_rdma_rw.c
> new file mode 100644
> index 0000000..1e76227
> --- /dev/null
> +++ b/net/sunrpc/xprtrdma/svc_rdma_rw.c
> @@ -0,0 +1,785 @@
> +/*
> + * Copyright (c) 2016 Oracle.  All rights reserved.
> + *
> + * Use the core R/W API to move RPC-over-RDMA Read and Write chunks.
> + */
> +
> +#include <linux/sunrpc/rpc_rdma.h>
> +#include <linux/sunrpc/svc_rdma.h>
> +#include <linux/sunrpc/debug.h>
> +
> +#include <rdma/rw.h>
> +
> +#define RPCDBG_FACILITY	RPCDBG_SVCXPRT
> +
> +/* Each R/W context contains state for one chain of RDMA Read or
> + * Write Work Requests (one RDMA segment to be read from or written
> + * back to the client).
> + *
> + * Each WR chain handles a single contiguous server-side buffer,
> + * because some registration modes (eg. FRWR) do not support a
> + * discontiguous scatterlist.
> + *
> + * Each WR chain handles only one R_key. Each RPC-over-RDMA segment
> + * from a client may contain a unique R_key, so each WR chain moves
> + * one segment (or less) at a time.
> + *
> + * The scatterlist makes this data structure just over 8KB in size
> + * on 4KB-page platforms. As the size of this structure increases
> + * past one page, it becomes more likely that allocating one of these
> + * will fail. Therefore, these contexts are created on demand, but
> + * cached and reused until the controlling svcxprt_rdma is destroyed.
> + */
> +struct svc_rdma_rw_ctxt {
> +	struct list_head	rw_list;
> +	struct ib_cqe		rw_cqe;
> +	struct svcxprt_rdma	*rw_rdma;
> +	int			rw_nents;
> +	int			rw_wrcount;
> +	enum dma_data_direction	rw_dir;
> +	struct svc_rdma_op_ctxt	*rw_readctxt;
> +	struct rdma_rw_ctx	rw_ctx;
> +	struct scatterlist	rw_sg[RPCSVC_MAXPAGES];

Have you considered using sg_table with sg_alloc_table_chained?

See lib/sg_pool.c and nvme-rdma as a consumer.

> +};
> +
> +static struct svc_rdma_rw_ctxt *
> +svc_rdma_get_rw_ctxt(struct svcxprt_rdma *rdma)
> +{
> +	struct svc_rdma_rw_ctxt *ctxt;
> +
> +	svc_xprt_get(&rdma->sc_xprt);
> +
> +	spin_lock(&rdma->sc_rw_ctxt_lock);
> +	if (list_empty(&rdma->sc_rw_ctxts))
> +		goto out_empty;
> +
> +	ctxt = list_first_entry(&rdma->sc_rw_ctxts,
> +				struct svc_rdma_rw_ctxt, rw_list);
> +	list_del_init(&ctxt->rw_list);
> +	spin_unlock(&rdma->sc_rw_ctxt_lock);
> +
> +out:
> +	ctxt->rw_dir = DMA_NONE;
> +	return ctxt;
> +
> +out_empty:
> +	spin_unlock(&rdma->sc_rw_ctxt_lock);
> +
> +	ctxt = kmalloc(sizeof(*ctxt), GFP_KERNEL);
> +	if (!ctxt) {
> +		svc_xprt_put(&rdma->sc_xprt);
> +		return NULL;
> +	}
> +
> +	ctxt->rw_rdma = rdma;
> +	INIT_LIST_HEAD(&ctxt->rw_list);
> +	sg_init_table(ctxt->rw_sg, ARRAY_SIZE(ctxt->rw_sg));
> +	goto out;
> +}
> +
> +static void svc_rdma_put_rw_ctxt(struct svc_rdma_rw_ctxt *ctxt)
> +{
> +	struct svcxprt_rdma *rdma = ctxt->rw_rdma;
> +
> +	if (ctxt->rw_dir != DMA_NONE)
> +		rdma_rw_ctx_destroy(&ctxt->rw_ctx, rdma->sc_qp,
> +				    rdma->sc_port_num,
> +				    ctxt->rw_sg, ctxt->rw_nents,
> +				    ctxt->rw_dir);
> +

its a bit odd to see put_rw_ctxt that also destroys the context
which isn't exactly pairs with get_rw_ctxt.

Maybe it'd be useful to explicitly do that outside the put.

> +/**
> + * svc_rdma_destroy_rw_ctxts - Free write contexts
> + * @rdma: xprt about to be destroyed
> + *
> + */
> +void svc_rdma_destroy_rw_ctxts(struct svcxprt_rdma *rdma)
> +{
> +	struct svc_rdma_rw_ctxt *ctxt;
> +
> +	while (!list_empty(&rdma->sc_rw_ctxts)) {
> +		ctxt = list_first_entry(&rdma->sc_rw_ctxts,
> +					struct svc_rdma_rw_ctxt, rw_list);
> +		list_del(&ctxt->rw_list);
> +		kfree(ctxt);
> +	}
> +}
> +
> +/**
> + * svc_rdma_wc_write_ctx - Handle completion of an RDMA Write ctx
> + * @cq: controlling Completion Queue
> + * @wc: Work Completion
> + *
> + * Invoked in soft IRQ context.
> + *
> + * Assumptions:
> + * - Write completion is not responsible for freeing pages under
> + *   I/O.
> + */
> +static void svc_rdma_wc_write_ctx(struct ib_cq *cq, struct ib_wc *wc)
> +{
> +	struct ib_cqe *cqe = wc->wr_cqe;
> +	struct svc_rdma_rw_ctxt *ctxt =
> +			container_of(cqe, struct svc_rdma_rw_ctxt, rw_cqe);
> +	struct svcxprt_rdma *rdma = ctxt->rw_rdma;
> +
> +	atomic_add(ctxt->rw_wrcount, &rdma->sc_sq_avail);
> +	wake_up(&rdma->sc_send_wait);
> +
> +	if (wc->status != IB_WC_SUCCESS)
> +		goto flush;
> +
> +out:
> +	svc_rdma_put_rw_ctxt(ctxt);
> +	return;
> +
> +flush:
> +	set_bit(XPT_CLOSE, &rdma->sc_xprt.xpt_flags);
> +	if (wc->status != IB_WC_WR_FLUSH_ERR)
> +		pr_err("svcrdma: write ctx: %s (%u/0x%x)\n",
> +		       ib_wc_status_msg(wc->status),
> +		       wc->status, wc->vendor_err);
> +	goto out;
> +}
> +
> +/**
> + * svc_rdma_wc_read_ctx - Handle completion of an RDMA Read ctx
> + * @cq: controlling Completion Queue
> + * @wc: Work Completion
> + *
> + * Invoked in soft IRQ context.

? in soft IRQ?

> + */
> +static void svc_rdma_wc_read_ctx(struct ib_cq *cq, struct ib_wc *wc)
> +{
> +	struct ib_cqe *cqe = wc->wr_cqe;
> +	struct svc_rdma_rw_ctxt *ctxt =
> +			container_of(cqe, struct svc_rdma_rw_ctxt, rw_cqe);
> +	struct svcxprt_rdma *rdma = ctxt->rw_rdma;
> +	struct svc_rdma_op_ctxt *head;
> +
> +	atomic_add(ctxt->rw_wrcount, &rdma->sc_sq_avail);
> +	wake_up(&rdma->sc_send_wait);
> +
> +	if (wc->status != IB_WC_SUCCESS)
> +		goto flush;
> +
> +	head = ctxt->rw_readctxt;
> +	if (!head)
> +		goto out;
> +
> +	spin_lock(&rdma->sc_rq_dto_lock);
> +	list_add_tail(&head->list, &rdma->sc_read_complete_q);
> +	spin_unlock(&rdma->sc_rq_dto_lock);

Not sure what sc_read_complete_q does... what post processing is
needed for completed reads?

> +/* This function sleeps when the transport's Send Queue is congested.

Is this easy to trigger?

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v1 07/14] svcrdma: Clean up RDMA_ERROR path
  2017-03-16 15:53     ` Chuck Lever
@ 2017-03-22 14:18         ` Sagi Grimberg
  -1 siblings, 0 replies; 70+ messages in thread
From: Sagi Grimberg @ 2017-03-22 14:18 UTC (permalink / raw)
  To: Chuck Lever, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA

Looks good,

Reviewed-by: Sagi Grimberg <sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v1 07/14] svcrdma: Clean up RDMA_ERROR path
@ 2017-03-22 14:18         ` Sagi Grimberg
  0 siblings, 0 replies; 70+ messages in thread
From: Sagi Grimberg @ 2017-03-22 14:18 UTC (permalink / raw)
  To: Chuck Lever, linux-rdma, linux-nfs

Looks good,

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v1 08/14] svcrdma: Report Write/Reply chunk overruns
  2017-03-16 15:53     ` Chuck Lever
@ 2017-03-22 14:20         ` Sagi Grimberg
  -1 siblings, 0 replies; 70+ messages in thread
From: Sagi Grimberg @ 2017-03-22 14:20 UTC (permalink / raw)
  To: Chuck Lever, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA

Looks good,

Reviewed-by: Sagi Grimberg <sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v1 08/14] svcrdma: Report Write/Reply chunk overruns
@ 2017-03-22 14:20         ` Sagi Grimberg
  0 siblings, 0 replies; 70+ messages in thread
From: Sagi Grimberg @ 2017-03-22 14:20 UTC (permalink / raw)
  To: Chuck Lever, linux-rdma, linux-nfs

Looks good,

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v1 10/14] svcrdma: Reduce size of sge array in struct svc_rdma_op_ctxt
  2017-03-16 15:53     ` Chuck Lever
@ 2017-03-22 14:21         ` Sagi Grimberg
  -1 siblings, 0 replies; 70+ messages in thread
From: Sagi Grimberg @ 2017-03-22 14:21 UTC (permalink / raw)
  To: Chuck Lever, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA

Looks good,

Reviewed-by: Sagi Grimberg <sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v1 10/14] svcrdma: Reduce size of sge array in struct svc_rdma_op_ctxt
@ 2017-03-22 14:21         ` Sagi Grimberg
  0 siblings, 0 replies; 70+ messages in thread
From: Sagi Grimberg @ 2017-03-22 14:21 UTC (permalink / raw)
  To: Chuck Lever, linux-rdma, linux-nfs

Looks good,

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v1 11/14] svcrdma: Remove old RDMA Write completion handlers
  2017-03-16 15:53     ` Chuck Lever
@ 2017-03-22 14:22         ` Sagi Grimberg
  -1 siblings, 0 replies; 70+ messages in thread
From: Sagi Grimberg @ 2017-03-22 14:22 UTC (permalink / raw)
  To: Chuck Lever, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA

Looks good,

Reviewed-by: Sagi Grimberg <sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v1 11/14] svcrdma: Remove old RDMA Write completion handlers
@ 2017-03-22 14:22         ` Sagi Grimberg
  0 siblings, 0 replies; 70+ messages in thread
From: Sagi Grimberg @ 2017-03-22 14:22 UTC (permalink / raw)
  To: Chuck Lever, linux-rdma, linux-nfs

Looks good,

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v1 12/14] svcrdma: Remove the req_map cache
  2017-03-16 15:54     ` Chuck Lever
@ 2017-03-22 14:22         ` Sagi Grimberg
  -1 siblings, 0 replies; 70+ messages in thread
From: Sagi Grimberg @ 2017-03-22 14:22 UTC (permalink / raw)
  To: Chuck Lever, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA

Nice,

Reviewed-by: Sagi Grimberg <sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v1 12/14] svcrdma: Remove the req_map cache
@ 2017-03-22 14:22         ` Sagi Grimberg
  0 siblings, 0 replies; 70+ messages in thread
From: Sagi Grimberg @ 2017-03-22 14:22 UTC (permalink / raw)
  To: Chuck Lever, linux-rdma, linux-nfs

Nice,

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v1 13/14] svcrdma: Clean out old XDR encoders
  2017-03-16 15:54     ` Chuck Lever
@ 2017-03-22 14:23         ` Sagi Grimberg
  -1 siblings, 0 replies; 70+ messages in thread
From: Sagi Grimberg @ 2017-03-22 14:23 UTC (permalink / raw)
  To: Chuck Lever, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA

Reviewed-by: Sagi Grimberg <sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v1 13/14] svcrdma: Clean out old XDR encoders
@ 2017-03-22 14:23         ` Sagi Grimberg
  0 siblings, 0 replies; 70+ messages in thread
From: Sagi Grimberg @ 2017-03-22 14:23 UTC (permalink / raw)
  To: Chuck Lever, linux-rdma, linux-nfs

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v1 05/14] svcrdma: Introduce local rdma_rw API helpers
  2017-03-22 14:17         ` Sagi Grimberg
@ 2017-03-22 15:41             ` Chuck Lever
  -1 siblings, 0 replies; 70+ messages in thread
From: Chuck Lever @ 2017-03-22 15:41 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Linux NFS Mailing List


> On Mar 22, 2017, at 10:17 AM, Sagi Grimberg <sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org> wrote:
> 
>> 
>> The plan is to replace the local bespoke code that constructs and
>> posts RDMA Read and Write Work Requests with calls to the rdma_rw
>> API. This shares code with other RDMA-enabled ULPs that manages the
>> gory details of buffer registration and posting Work Requests.
>> 
>> Some design notes:
>> 
>> o svc_xprt reference counting is modified, since one rdma_rw_ctx
>>   generates one completion, no matter how many Write WRs are
>>   posted. To accommodate the new reference counting scheme, a new
>>   version of svc_rdma_send() is introduced.
>> 
>> o The structure of RPC-over-RDMA transport headers is flexible,
>>   allowing multiple segments per Reply with arbitrary alignment.
>>   Thus I did not take the further step of chaining Write WRs with
>>   the Send WR containing the RPC Reply message. The Write and Send
>>   WRs continue to be built by separate pieces of code.
>> 
>> o The current code builds the transport header as it is construct-
>>   ing Write WRs. I've replaced that with marshaling of transport
>>   header data items in a separate step. This is because the exact
>>   structure of client-provided segments may not align with the
>>   components of the server's reply xdr_buf, or the pages in the
>>   page list. Thus parts of each client-provided segment may be
>>   written at different points in the send path.
>> 
>> o Since the Write list and Reply chunk marshaling code is being
>>   replaced, I took the opportunity to replace some of the C
>>   structure-based XDR encoding code with more portable code that
>>   instead uses pointer arithmetic.
>> 
>> Signed-off-by: Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
> 
> To be honest its difficult to review this patch, but its probably
> difficult to split it too...

I agree this is an unfortunately large heap of code. However, I'm
somewhat constrained by Bruce's requirement that I introduce new
code and use it in the same (or an adjacent) patch.

In the next version of this series, the read code (below) has been
removed from this patch, since it's not actually used until
nfsd-rdma-rw-api.


>> +
>> +/* One Write chunk is copied from Call transport header to Reply
>> + * transport header. Each segment's length field is updated to
>> + * reflect number of bytes consumed in the segment.
>> + *
>> + * Returns number of segments in this chunk.
>> + */
>> +static unsigned int xdr_encode_write_chunk(__be32 *dst, __be32 *src,
>> +					   unsigned int remaining)
> 
> Is this only for data-in-reply (send operation)? I don't see why you
> would need to modify that for RDMA operations.

The sendto path encodes the chunk list in the transport header
as it is posting the RDMA Writes. So this function is used when
there are RDMA Writes before the actual RPC Reply.

I could add this code in a preceding patch, but again, Bruce
likes to see all the code added and used at the same time.


> Perhaps I'd try to split the data-in-reply code from the actual rdma
> conversion. It might be helpful to comprehend.

I'm not sure what you mean, but it might be that we are using
these terms a little differently.


>> +{
>> +	unsigned int i, nsegs;
>> +	u32 seg_len;
>> +
>> +	/* Write list discriminator */
>> +	*dst++ = *src++;
> 
> I had to actually run a test program to understand the precedence
> here so parenthesis would've helped :)

*dst++ = *src++ is a common idiom in networking code and
XDR encoders/decoders, though it is a little old-fashioned.


>> diff --git a/net/sunrpc/xprtrdma/svc_rdma_rw.c b/net/sunrpc/xprtrdma/svc_rdma_rw.c
>> new file mode 100644
>> index 0000000..1e76227
>> --- /dev/null
>> +++ b/net/sunrpc/xprtrdma/svc_rdma_rw.c
>> @@ -0,0 +1,785 @@
>> +/*
>> + * Copyright (c) 2016 Oracle.  All rights reserved.
>> + *
>> + * Use the core R/W API to move RPC-over-RDMA Read and Write chunks.
>> + */
>> +
>> +#include <linux/sunrpc/rpc_rdma.h>
>> +#include <linux/sunrpc/svc_rdma.h>
>> +#include <linux/sunrpc/debug.h>
>> +
>> +#include <rdma/rw.h>
>> +
>> +#define RPCDBG_FACILITY	RPCDBG_SVCXPRT
>> +
>> +/* Each R/W context contains state for one chain of RDMA Read or
>> + * Write Work Requests (one RDMA segment to be read from or written
>> + * back to the client).
>> + *
>> + * Each WR chain handles a single contiguous server-side buffer,
>> + * because some registration modes (eg. FRWR) do not support a
>> + * discontiguous scatterlist.
>> + *
>> + * Each WR chain handles only one R_key. Each RPC-over-RDMA segment
>> + * from a client may contain a unique R_key, so each WR chain moves
>> + * one segment (or less) at a time.
>> + *
>> + * The scatterlist makes this data structure just over 8KB in size
>> + * on 4KB-page platforms. As the size of this structure increases
>> + * past one page, it becomes more likely that allocating one of these
>> + * will fail. Therefore, these contexts are created on demand, but
>> + * cached and reused until the controlling svcxprt_rdma is destroyed.
>> + */
>> +struct svc_rdma_rw_ctxt {
>> +	struct list_head	rw_list;
>> +	struct ib_cqe		rw_cqe;
>> +	struct svcxprt_rdma	*rw_rdma;
>> +	int			rw_nents;
>> +	int			rw_wrcount;
>> +	enum dma_data_direction	rw_dir;
>> +	struct svc_rdma_op_ctxt	*rw_readctxt;
>> +	struct rdma_rw_ctx	rw_ctx;
>> +	struct scatterlist	rw_sg[RPCSVC_MAXPAGES];
> 
> Have you considered using sg_table with sg_alloc_table_chained?
> 
> See lib/sg_pool.c and nvme-rdma as a consumer.

That might be newer than my patches. I'll have a look.


>> +};
>> +
>> +static struct svc_rdma_rw_ctxt *
>> +svc_rdma_get_rw_ctxt(struct svcxprt_rdma *rdma)
>> +{
>> +	struct svc_rdma_rw_ctxt *ctxt;
>> +
>> +	svc_xprt_get(&rdma->sc_xprt);
>> +
>> +	spin_lock(&rdma->sc_rw_ctxt_lock);
>> +	if (list_empty(&rdma->sc_rw_ctxts))
>> +		goto out_empty;
>> +
>> +	ctxt = list_first_entry(&rdma->sc_rw_ctxts,
>> +				struct svc_rdma_rw_ctxt, rw_list);
>> +	list_del_init(&ctxt->rw_list);
>> +	spin_unlock(&rdma->sc_rw_ctxt_lock);
>> +
>> +out:
>> +	ctxt->rw_dir = DMA_NONE;
>> +	return ctxt;
>> +
>> +out_empty:
>> +	spin_unlock(&rdma->sc_rw_ctxt_lock);
>> +
>> +	ctxt = kmalloc(sizeof(*ctxt), GFP_KERNEL);
>> +	if (!ctxt) {
>> +		svc_xprt_put(&rdma->sc_xprt);
>> +		return NULL;
>> +	}
>> +
>> +	ctxt->rw_rdma = rdma;
>> +	INIT_LIST_HEAD(&ctxt->rw_list);
>> +	sg_init_table(ctxt->rw_sg, ARRAY_SIZE(ctxt->rw_sg));
>> +	goto out;
>> +}
>> +
>> +static void svc_rdma_put_rw_ctxt(struct svc_rdma_rw_ctxt *ctxt)
>> +{
>> +	struct svcxprt_rdma *rdma = ctxt->rw_rdma;
>> +
>> +	if (ctxt->rw_dir != DMA_NONE)
>> +		rdma_rw_ctx_destroy(&ctxt->rw_ctx, rdma->sc_qp,
>> +				    rdma->sc_port_num,
>> +				    ctxt->rw_sg, ctxt->rw_nents,
>> +				    ctxt->rw_dir);
>> +
> 
> its a bit odd to see put_rw_ctxt that also destroys the context
> which isn't exactly pairs with get_rw_ctxt.
> 
> Maybe it'd be useful to explicitly do that outside the put.

The pairing is not obvious, but it is this:

svc_rdma_send_writes() does the svc_rdma_get_rw_ctx.

-> If posting succeeds, svc_rdma_wc_write_ctx puts the ctx.

-> If posting fails, svc_rdma_send_writes puts the ctx.

Do you have a suggestion about how this could be more
intuitively documented?

IIRC I combined these because the rdma_rw_ctx_destroy is
always done just before putting the ctx back on the free
list. It eliminates some code duplication, and ensures
the ctx is always ready for the next svc_rdma_get_rw_ctx.


>> +/**
>> + * svc_rdma_destroy_rw_ctxts - Free write contexts
>> + * @rdma: xprt about to be destroyed
>> + *
>> + */
>> +void svc_rdma_destroy_rw_ctxts(struct svcxprt_rdma *rdma)
>> +{
>> +	struct svc_rdma_rw_ctxt *ctxt;
>> +
>> +	while (!list_empty(&rdma->sc_rw_ctxts)) {
>> +		ctxt = list_first_entry(&rdma->sc_rw_ctxts,
>> +					struct svc_rdma_rw_ctxt, rw_list);
>> +		list_del(&ctxt->rw_list);
>> +		kfree(ctxt);
>> +	}
>> +}
>> +
>> +/**
>> + * svc_rdma_wc_write_ctx - Handle completion of an RDMA Write ctx
>> + * @cq: controlling Completion Queue
>> + * @wc: Work Completion
>> + *
>> + * Invoked in soft IRQ context.
>> + *
>> + * Assumptions:
>> + * - Write completion is not responsible for freeing pages under
>> + *   I/O.
>> + */
>> +static void svc_rdma_wc_write_ctx(struct ib_cq *cq, struct ib_wc *wc)
>> +{
>> +	struct ib_cqe *cqe = wc->wr_cqe;
>> +	struct svc_rdma_rw_ctxt *ctxt =
>> +			container_of(cqe, struct svc_rdma_rw_ctxt, rw_cqe);
>> +	struct svcxprt_rdma *rdma = ctxt->rw_rdma;
>> +
>> +	atomic_add(ctxt->rw_wrcount, &rdma->sc_sq_avail);
>> +	wake_up(&rdma->sc_send_wait);
>> +
>> +	if (wc->status != IB_WC_SUCCESS)
>> +		goto flush;
>> +
>> +out:
>> +	svc_rdma_put_rw_ctxt(ctxt);
>> +	return;
>> +
>> +flush:
>> +	set_bit(XPT_CLOSE, &rdma->sc_xprt.xpt_flags);
>> +	if (wc->status != IB_WC_WR_FLUSH_ERR)
>> +		pr_err("svcrdma: write ctx: %s (%u/0x%x)\n",
>> +		       ib_wc_status_msg(wc->status),
>> +		       wc->status, wc->vendor_err);
>> +	goto out;
>> +}
>> +
>> +/**
>> + * svc_rdma_wc_read_ctx - Handle completion of an RDMA Read ctx
>> + * @cq: controlling Completion Queue
>> + * @wc: Work Completion
>> + *
>> + * Invoked in soft IRQ context.
> 
> ? in soft IRQ?

Not sure I understand this comment?


> 
>> + */
>> +static void svc_rdma_wc_read_ctx(struct ib_cq *cq, struct ib_wc *wc)
>> +{
>> +	struct ib_cqe *cqe = wc->wr_cqe;
>> +	struct svc_rdma_rw_ctxt *ctxt =
>> +			container_of(cqe, struct svc_rdma_rw_ctxt, rw_cqe);
>> +	struct svcxprt_rdma *rdma = ctxt->rw_rdma;
>> +	struct svc_rdma_op_ctxt *head;
>> +
>> +	atomic_add(ctxt->rw_wrcount, &rdma->sc_sq_avail);
>> +	wake_up(&rdma->sc_send_wait);
>> +
>> +	if (wc->status != IB_WC_SUCCESS)
>> +		goto flush;
>> +
>> +	head = ctxt->rw_readctxt;
>> +	if (!head)
>> +		goto out;
>> +
>> +	spin_lock(&rdma->sc_rq_dto_lock);
>> +	list_add_tail(&head->list, &rdma->sc_read_complete_q);
>> +	spin_unlock(&rdma->sc_rq_dto_lock);
> 
> Not sure what sc_read_complete_q does... what post processing is
> needed for completed reads?

I postponed this until nfsd-rdma-rw-api. Briefly, yes, there's
a lot of work to do when receiving an RPC Call with Read chunks.


>> +/* This function sleeps when the transport's Send Queue is congested.
> 
> Is this easy to trigger?

Not really, but it does happen.

This is one of the problems with RPC-over-RDMA. It's not practical
for the server to estimate its SQ size large enough for every
possible scenario. And, as you observed before, some HCA/RNICs
will have limited SQ capabilities.

--
Chuck Lever



--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v1 05/14] svcrdma: Introduce local rdma_rw API helpers
@ 2017-03-22 15:41             ` Chuck Lever
  0 siblings, 0 replies; 70+ messages in thread
From: Chuck Lever @ 2017-03-22 15:41 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: linux-rdma, Linux NFS Mailing List


> On Mar 22, 2017, at 10:17 AM, Sagi Grimberg <sagi@grimberg.me> wrote:
> 
>> 
>> The plan is to replace the local bespoke code that constructs and
>> posts RDMA Read and Write Work Requests with calls to the rdma_rw
>> API. This shares code with other RDMA-enabled ULPs that manages the
>> gory details of buffer registration and posting Work Requests.
>> 
>> Some design notes:
>> 
>> o svc_xprt reference counting is modified, since one rdma_rw_ctx
>>   generates one completion, no matter how many Write WRs are
>>   posted. To accommodate the new reference counting scheme, a new
>>   version of svc_rdma_send() is introduced.
>> 
>> o The structure of RPC-over-RDMA transport headers is flexible,
>>   allowing multiple segments per Reply with arbitrary alignment.
>>   Thus I did not take the further step of chaining Write WRs with
>>   the Send WR containing the RPC Reply message. The Write and Send
>>   WRs continue to be built by separate pieces of code.
>> 
>> o The current code builds the transport header as it is construct-
>>   ing Write WRs. I've replaced that with marshaling of transport
>>   header data items in a separate step. This is because the exact
>>   structure of client-provided segments may not align with the
>>   components of the server's reply xdr_buf, or the pages in the
>>   page list. Thus parts of each client-provided segment may be
>>   written at different points in the send path.
>> 
>> o Since the Write list and Reply chunk marshaling code is being
>>   replaced, I took the opportunity to replace some of the C
>>   structure-based XDR encoding code with more portable code that
>>   instead uses pointer arithmetic.
>> 
>> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> 
> To be honest its difficult to review this patch, but its probably
> difficult to split it too...

I agree this is an unfortunately large heap of code. However, I'm
somewhat constrained by Bruce's requirement that I introduce new
code and use it in the same (or an adjacent) patch.

In the next version of this series, the read code (below) has been
removed from this patch, since it's not actually used until
nfsd-rdma-rw-api.


>> +
>> +/* One Write chunk is copied from Call transport header to Reply
>> + * transport header. Each segment's length field is updated to
>> + * reflect number of bytes consumed in the segment.
>> + *
>> + * Returns number of segments in this chunk.
>> + */
>> +static unsigned int xdr_encode_write_chunk(__be32 *dst, __be32 *src,
>> +					   unsigned int remaining)
> 
> Is this only for data-in-reply (send operation)? I don't see why you
> would need to modify that for RDMA operations.

The sendto path encodes the chunk list in the transport header
as it is posting the RDMA Writes. So this function is used when
there are RDMA Writes before the actual RPC Reply.

I could add this code in a preceding patch, but again, Bruce
likes to see all the code added and used at the same time.


> Perhaps I'd try to split the data-in-reply code from the actual rdma
> conversion. It might be helpful to comprehend.

I'm not sure what you mean, but it might be that we are using
these terms a little differently.


>> +{
>> +	unsigned int i, nsegs;
>> +	u32 seg_len;
>> +
>> +	/* Write list discriminator */
>> +	*dst++ = *src++;
> 
> I had to actually run a test program to understand the precedence
> here so parenthesis would've helped :)

*dst++ = *src++ is a common idiom in networking code and
XDR encoders/decoders, though it is a little old-fashioned.


>> diff --git a/net/sunrpc/xprtrdma/svc_rdma_rw.c b/net/sunrpc/xprtrdma/svc_rdma_rw.c
>> new file mode 100644
>> index 0000000..1e76227
>> --- /dev/null
>> +++ b/net/sunrpc/xprtrdma/svc_rdma_rw.c
>> @@ -0,0 +1,785 @@
>> +/*
>> + * Copyright (c) 2016 Oracle.  All rights reserved.
>> + *
>> + * Use the core R/W API to move RPC-over-RDMA Read and Write chunks.
>> + */
>> +
>> +#include <linux/sunrpc/rpc_rdma.h>
>> +#include <linux/sunrpc/svc_rdma.h>
>> +#include <linux/sunrpc/debug.h>
>> +
>> +#include <rdma/rw.h>
>> +
>> +#define RPCDBG_FACILITY	RPCDBG_SVCXPRT
>> +
>> +/* Each R/W context contains state for one chain of RDMA Read or
>> + * Write Work Requests (one RDMA segment to be read from or written
>> + * back to the client).
>> + *
>> + * Each WR chain handles a single contiguous server-side buffer,
>> + * because some registration modes (eg. FRWR) do not support a
>> + * discontiguous scatterlist.
>> + *
>> + * Each WR chain handles only one R_key. Each RPC-over-RDMA segment
>> + * from a client may contain a unique R_key, so each WR chain moves
>> + * one segment (or less) at a time.
>> + *
>> + * The scatterlist makes this data structure just over 8KB in size
>> + * on 4KB-page platforms. As the size of this structure increases
>> + * past one page, it becomes more likely that allocating one of these
>> + * will fail. Therefore, these contexts are created on demand, but
>> + * cached and reused until the controlling svcxprt_rdma is destroyed.
>> + */
>> +struct svc_rdma_rw_ctxt {
>> +	struct list_head	rw_list;
>> +	struct ib_cqe		rw_cqe;
>> +	struct svcxprt_rdma	*rw_rdma;
>> +	int			rw_nents;
>> +	int			rw_wrcount;
>> +	enum dma_data_direction	rw_dir;
>> +	struct svc_rdma_op_ctxt	*rw_readctxt;
>> +	struct rdma_rw_ctx	rw_ctx;
>> +	struct scatterlist	rw_sg[RPCSVC_MAXPAGES];
> 
> Have you considered using sg_table with sg_alloc_table_chained?
> 
> See lib/sg_pool.c and nvme-rdma as a consumer.

That might be newer than my patches. I'll have a look.


>> +};
>> +
>> +static struct svc_rdma_rw_ctxt *
>> +svc_rdma_get_rw_ctxt(struct svcxprt_rdma *rdma)
>> +{
>> +	struct svc_rdma_rw_ctxt *ctxt;
>> +
>> +	svc_xprt_get(&rdma->sc_xprt);
>> +
>> +	spin_lock(&rdma->sc_rw_ctxt_lock);
>> +	if (list_empty(&rdma->sc_rw_ctxts))
>> +		goto out_empty;
>> +
>> +	ctxt = list_first_entry(&rdma->sc_rw_ctxts,
>> +				struct svc_rdma_rw_ctxt, rw_list);
>> +	list_del_init(&ctxt->rw_list);
>> +	spin_unlock(&rdma->sc_rw_ctxt_lock);
>> +
>> +out:
>> +	ctxt->rw_dir = DMA_NONE;
>> +	return ctxt;
>> +
>> +out_empty:
>> +	spin_unlock(&rdma->sc_rw_ctxt_lock);
>> +
>> +	ctxt = kmalloc(sizeof(*ctxt), GFP_KERNEL);
>> +	if (!ctxt) {
>> +		svc_xprt_put(&rdma->sc_xprt);
>> +		return NULL;
>> +	}
>> +
>> +	ctxt->rw_rdma = rdma;
>> +	INIT_LIST_HEAD(&ctxt->rw_list);
>> +	sg_init_table(ctxt->rw_sg, ARRAY_SIZE(ctxt->rw_sg));
>> +	goto out;
>> +}
>> +
>> +static void svc_rdma_put_rw_ctxt(struct svc_rdma_rw_ctxt *ctxt)
>> +{
>> +	struct svcxprt_rdma *rdma = ctxt->rw_rdma;
>> +
>> +	if (ctxt->rw_dir != DMA_NONE)
>> +		rdma_rw_ctx_destroy(&ctxt->rw_ctx, rdma->sc_qp,
>> +				    rdma->sc_port_num,
>> +				    ctxt->rw_sg, ctxt->rw_nents,
>> +				    ctxt->rw_dir);
>> +
> 
> its a bit odd to see put_rw_ctxt that also destroys the context
> which isn't exactly pairs with get_rw_ctxt.
> 
> Maybe it'd be useful to explicitly do that outside the put.

The pairing is not obvious, but it is this:

svc_rdma_send_writes() does the svc_rdma_get_rw_ctx.

-> If posting succeeds, svc_rdma_wc_write_ctx puts the ctx.

-> If posting fails, svc_rdma_send_writes puts the ctx.

Do you have a suggestion about how this could be more
intuitively documented?

IIRC I combined these because the rdma_rw_ctx_destroy is
always done just before putting the ctx back on the free
list. It eliminates some code duplication, and ensures
the ctx is always ready for the next svc_rdma_get_rw_ctx.


>> +/**
>> + * svc_rdma_destroy_rw_ctxts - Free write contexts
>> + * @rdma: xprt about to be destroyed
>> + *
>> + */
>> +void svc_rdma_destroy_rw_ctxts(struct svcxprt_rdma *rdma)
>> +{
>> +	struct svc_rdma_rw_ctxt *ctxt;
>> +
>> +	while (!list_empty(&rdma->sc_rw_ctxts)) {
>> +		ctxt = list_first_entry(&rdma->sc_rw_ctxts,
>> +					struct svc_rdma_rw_ctxt, rw_list);
>> +		list_del(&ctxt->rw_list);
>> +		kfree(ctxt);
>> +	}
>> +}
>> +
>> +/**
>> + * svc_rdma_wc_write_ctx - Handle completion of an RDMA Write ctx
>> + * @cq: controlling Completion Queue
>> + * @wc: Work Completion
>> + *
>> + * Invoked in soft IRQ context.
>> + *
>> + * Assumptions:
>> + * - Write completion is not responsible for freeing pages under
>> + *   I/O.
>> + */
>> +static void svc_rdma_wc_write_ctx(struct ib_cq *cq, struct ib_wc *wc)
>> +{
>> +	struct ib_cqe *cqe = wc->wr_cqe;
>> +	struct svc_rdma_rw_ctxt *ctxt =
>> +			container_of(cqe, struct svc_rdma_rw_ctxt, rw_cqe);
>> +	struct svcxprt_rdma *rdma = ctxt->rw_rdma;
>> +
>> +	atomic_add(ctxt->rw_wrcount, &rdma->sc_sq_avail);
>> +	wake_up(&rdma->sc_send_wait);
>> +
>> +	if (wc->status != IB_WC_SUCCESS)
>> +		goto flush;
>> +
>> +out:
>> +	svc_rdma_put_rw_ctxt(ctxt);
>> +	return;
>> +
>> +flush:
>> +	set_bit(XPT_CLOSE, &rdma->sc_xprt.xpt_flags);
>> +	if (wc->status != IB_WC_WR_FLUSH_ERR)
>> +		pr_err("svcrdma: write ctx: %s (%u/0x%x)\n",
>> +		       ib_wc_status_msg(wc->status),
>> +		       wc->status, wc->vendor_err);
>> +	goto out;
>> +}
>> +
>> +/**
>> + * svc_rdma_wc_read_ctx - Handle completion of an RDMA Read ctx
>> + * @cq: controlling Completion Queue
>> + * @wc: Work Completion
>> + *
>> + * Invoked in soft IRQ context.
> 
> ? in soft IRQ?

Not sure I understand this comment?


> 
>> + */
>> +static void svc_rdma_wc_read_ctx(struct ib_cq *cq, struct ib_wc *wc)
>> +{
>> +	struct ib_cqe *cqe = wc->wr_cqe;
>> +	struct svc_rdma_rw_ctxt *ctxt =
>> +			container_of(cqe, struct svc_rdma_rw_ctxt, rw_cqe);
>> +	struct svcxprt_rdma *rdma = ctxt->rw_rdma;
>> +	struct svc_rdma_op_ctxt *head;
>> +
>> +	atomic_add(ctxt->rw_wrcount, &rdma->sc_sq_avail);
>> +	wake_up(&rdma->sc_send_wait);
>> +
>> +	if (wc->status != IB_WC_SUCCESS)
>> +		goto flush;
>> +
>> +	head = ctxt->rw_readctxt;
>> +	if (!head)
>> +		goto out;
>> +
>> +	spin_lock(&rdma->sc_rq_dto_lock);
>> +	list_add_tail(&head->list, &rdma->sc_read_complete_q);
>> +	spin_unlock(&rdma->sc_rq_dto_lock);
> 
> Not sure what sc_read_complete_q does... what post processing is
> needed for completed reads?

I postponed this until nfsd-rdma-rw-api. Briefly, yes, there's
a lot of work to do when receiving an RPC Call with Read chunks.


>> +/* This function sleeps when the transport's Send Queue is congested.
> 
> Is this easy to trigger?

Not really, but it does happen.

This is one of the problems with RPC-over-RDMA. It's not practical
for the server to estimate its SQ size large enough for every
possible scenario. And, as you observed before, some HCA/RNICs
will have limited SQ capabilities.

--
Chuck Lever




^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v1 03/14] svcrdma: Eliminate RPCRDMA_SQ_DEPTH_MULT
  2017-03-22 13:36                     ` Chuck Lever
@ 2017-03-22 19:06                         ` Sagi Grimberg
  -1 siblings, 0 replies; 70+ messages in thread
From: Sagi Grimberg @ 2017-03-22 19:06 UTC (permalink / raw)
  To: Chuck Lever; +Cc: List Linux RDMA Mailing, Linux NFS Mailing List


> Roughly speaking, I think there needs to be an rdma_rw API that
> assists the ULP with setting its CQ and SQ sizes, since rdma_rw
> hides the registration mode (one of which, at least, consumes
> more SQEs than the other).

Hiding the registration mode was the largely the motivation for
this... It buys us simplified implementation and inherently supports
both IB and iWARP (which was annoying and only existing in svc but
still suboptimal).

> I'd like to introduce one new function call that surfaces the
> factor used to compute how many additional SQEs that rdma_rw will
> need. The ULP will invoke it before allocating new Send CQs.

I see your point... We should probably get a sense on how to
size the completion queue. I think that this issue is solved with
the CQ pool API that Christoph sent a while ago but was never
pursued.

The basic idea is that the core would create a pool of long CQs
and then assigns queue-pairs depending on the sq+rq depth.
If we were to pick it up would you consider using it?

> I'll try to provide an RFC in the nfsd-rdma-rw-api topic branch.

Cool, lets see what you had in mind...
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v1 03/14] svcrdma: Eliminate RPCRDMA_SQ_DEPTH_MULT
@ 2017-03-22 19:06                         ` Sagi Grimberg
  0 siblings, 0 replies; 70+ messages in thread
From: Sagi Grimberg @ 2017-03-22 19:06 UTC (permalink / raw)
  To: Chuck Lever; +Cc: List Linux RDMA Mailing, Linux NFS Mailing List


> Roughly speaking, I think there needs to be an rdma_rw API that
> assists the ULP with setting its CQ and SQ sizes, since rdma_rw
> hides the registration mode (one of which, at least, consumes
> more SQEs than the other).

Hiding the registration mode was the largely the motivation for
this... It buys us simplified implementation and inherently supports
both IB and iWARP (which was annoying and only existing in svc but
still suboptimal).

> I'd like to introduce one new function call that surfaces the
> factor used to compute how many additional SQEs that rdma_rw will
> need. The ULP will invoke it before allocating new Send CQs.

I see your point... We should probably get a sense on how to
size the completion queue. I think that this issue is solved with
the CQ pool API that Christoph sent a while ago but was never
pursued.

The basic idea is that the core would create a pool of long CQs
and then assigns queue-pairs depending on the sq+rq depth.
If we were to pick it up would you consider using it?

> I'll try to provide an RFC in the nfsd-rdma-rw-api topic branch.

Cool, lets see what you had in mind...

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v1 03/14] svcrdma: Eliminate RPCRDMA_SQ_DEPTH_MULT
  2017-03-22 19:06                         ` Sagi Grimberg
@ 2017-03-22 19:30                             ` Chuck Lever
  -1 siblings, 0 replies; 70+ messages in thread
From: Chuck Lever @ 2017-03-22 19:30 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: List Linux RDMA Mailing, Linux NFS Mailing List


> On Mar 22, 2017, at 3:06 PM, Sagi Grimberg <sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org> wrote:
> 
> 
>> Roughly speaking, I think there needs to be an rdma_rw API that
>> assists the ULP with setting its CQ and SQ sizes, since rdma_rw
>> hides the registration mode (one of which, at least, consumes
>> more SQEs than the other).
> 
> Hiding the registration mode was the largely the motivation for
> this... It buys us simplified implementation and inherently supports
> both IB and iWARP (which was annoying and only existing in svc but
> still suboptimal).
> 
>> I'd like to introduce one new function call that surfaces the
>> factor used to compute how many additional SQEs that rdma_rw will
>> need. The ULP will invoke it before allocating new Send CQs.
> 
> I see your point... We should probably get a sense on how to
> size the completion queue. I think that this issue is solved with
> the CQ pool API that Christoph sent a while ago but was never
> pursued.
> 
> The basic idea is that the core would create a pool of long CQs
> and then assigns queue-pairs depending on the sq+rq depth.
> If we were to pick it up would you consider using it?

I will certainly take a look at it. But I don't think that's
enough.

The ULP is also responsible for managing send queue accounting,
and possibly queuing WRs when a send queue is full. So it still
needs to know the maximum number of send WRs that can be posted
at one time. For svc_rdma, this is sc_sq_avail.

I believe that the ULP needs to know the actual number of SQEs
both for determining CQ size, and for knowing when to plug the
send queue.

This maximum depends on the registration mode, the page list
depth capability of the HCA (relative to the maximum ULP data
payload size), and the page size of the platform.

For example, for NFS, the typical maximum rsize and wsize is 1MB.
The CX-3 Pro cards I have allow 511 pages per MR in FRWR mode.
My systems are x64 using 4KB pages.

So I know that one rdma_rw_ctx can handle 256 pages (or 1MB) of
payload on my system.

An HCA with a smaller page list depth or if the system has larger
pages, or an rsize/wsize of 4MB might want a different number of
MRs for the same transport, and thus a larger send queue.

Alternately, we could set a fixed arbitrary send queue size, and
force all ULPs and devices to live with that. That would be much
simpler.


>> I'll try to provide an RFC in the nfsd-rdma-rw-api topic branch.
> 
> Cool, lets see what you had in mind...


--
Chuck Lever



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v1 03/14] svcrdma: Eliminate RPCRDMA_SQ_DEPTH_MULT
@ 2017-03-22 19:30                             ` Chuck Lever
  0 siblings, 0 replies; 70+ messages in thread
From: Chuck Lever @ 2017-03-22 19:30 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: List Linux RDMA Mailing, Linux NFS Mailing List


> On Mar 22, 2017, at 3:06 PM, Sagi Grimberg <sagi@grimberg.me> wrote:
> 
> 
>> Roughly speaking, I think there needs to be an rdma_rw API that
>> assists the ULP with setting its CQ and SQ sizes, since rdma_rw
>> hides the registration mode (one of which, at least, consumes
>> more SQEs than the other).
> 
> Hiding the registration mode was the largely the motivation for
> this... It buys us simplified implementation and inherently supports
> both IB and iWARP (which was annoying and only existing in svc but
> still suboptimal).
> 
>> I'd like to introduce one new function call that surfaces the
>> factor used to compute how many additional SQEs that rdma_rw will
>> need. The ULP will invoke it before allocating new Send CQs.
> 
> I see your point... We should probably get a sense on how to
> size the completion queue. I think that this issue is solved with
> the CQ pool API that Christoph sent a while ago but was never
> pursued.
> 
> The basic idea is that the core would create a pool of long CQs
> and then assigns queue-pairs depending on the sq+rq depth.
> If we were to pick it up would you consider using it?

I will certainly take a look at it. But I don't think that's
enough.

The ULP is also responsible for managing send queue accounting,
and possibly queuing WRs when a send queue is full. So it still
needs to know the maximum number of send WRs that can be posted
at one time. For svc_rdma, this is sc_sq_avail.

I believe that the ULP needs to know the actual number of SQEs
both for determining CQ size, and for knowing when to plug the
send queue.

This maximum depends on the registration mode, the page list
depth capability of the HCA (relative to the maximum ULP data
payload size), and the page size of the platform.

For example, for NFS, the typical maximum rsize and wsize is 1MB.
The CX-3 Pro cards I have allow 511 pages per MR in FRWR mode.
My systems are x64 using 4KB pages.

So I know that one rdma_rw_ctx can handle 256 pages (or 1MB) of
payload on my system.

An HCA with a smaller page list depth or if the system has larger
pages, or an rsize/wsize of 4MB might want a different number of
MRs for the same transport, and thus a larger send queue.

Alternately, we could set a fixed arbitrary send queue size, and
force all ULPs and devices to live with that. That would be much
simpler.


>> I'll try to provide an RFC in the nfsd-rdma-rw-api topic branch.
> 
> Cool, lets see what you had in mind...


--
Chuck Lever




^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v1 05/14] svcrdma: Introduce local rdma_rw API helpers
  2017-03-22 15:41             ` Chuck Lever
@ 2017-03-24 22:19                 ` Chuck Lever
  -1 siblings, 0 replies; 70+ messages in thread
From: Chuck Lever @ 2017-03-24 22:19 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Linux NFS Mailing List


> On Mar 22, 2017, at 11:41 AM, Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote:
> 
>> 
>> On Mar 22, 2017, at 10:17 AM, Sagi Grimberg <sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org> wrote:
>> 
>>> 
>>> The plan is to replace the local bespoke code that constructs and
>>> posts RDMA Read and Write Work Requests with calls to the rdma_rw
>>> API. This shares code with other RDMA-enabled ULPs that manages the
>>> gory details of buffer registration and posting Work Requests.
>>> 
>>> Some design notes:
>>> 
>>> o svc_xprt reference counting is modified, since one rdma_rw_ctx
>>>  generates one completion, no matter how many Write WRs are
>>>  posted. To accommodate the new reference counting scheme, a new
>>>  version of svc_rdma_send() is introduced.
>>> 
>>> o The structure of RPC-over-RDMA transport headers is flexible,
>>>  allowing multiple segments per Reply with arbitrary alignment.
>>>  Thus I did not take the further step of chaining Write WRs with
>>>  the Send WR containing the RPC Reply message. The Write and Send
>>>  WRs continue to be built by separate pieces of code.
>>> 
>>> o The current code builds the transport header as it is construct-
>>>  ing Write WRs. I've replaced that with marshaling of transport
>>>  header data items in a separate step. This is because the exact
>>>  structure of client-provided segments may not align with the
>>>  components of the server's reply xdr_buf, or the pages in the
>>>  page list. Thus parts of each client-provided segment may be
>>>  written at different points in the send path.
>>> 
>>> o Since the Write list and Reply chunk marshaling code is being
>>>  replaced, I took the opportunity to replace some of the C
>>>  structure-based XDR encoding code with more portable code that
>>>  instead uses pointer arithmetic.
>>> 
>>> Signed-off-by: Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
> 
>>> diff --git a/net/sunrpc/xprtrdma/svc_rdma_rw.c b/net/sunrpc/xprtrdma/svc_rdma_rw.c
>>> new file mode 100644
>>> index 0000000..1e76227
>>> --- /dev/null
>>> +++ b/net/sunrpc/xprtrdma/svc_rdma_rw.c
>>> @@ -0,0 +1,785 @@
>>> +/*
>>> + * Copyright (c) 2016 Oracle.  All rights reserved.
>>> + *
>>> + * Use the core R/W API to move RPC-over-RDMA Read and Write chunks.
>>> + */
>>> +
>>> +#include <linux/sunrpc/rpc_rdma.h>
>>> +#include <linux/sunrpc/svc_rdma.h>
>>> +#include <linux/sunrpc/debug.h>
>>> +
>>> +#include <rdma/rw.h>
>>> +
>>> +#define RPCDBG_FACILITY	RPCDBG_SVCXPRT
>>> +
>>> +/* Each R/W context contains state for one chain of RDMA Read or
>>> + * Write Work Requests (one RDMA segment to be read from or written
>>> + * back to the client).
>>> + *
>>> + * Each WR chain handles a single contiguous server-side buffer,
>>> + * because some registration modes (eg. FRWR) do not support a
>>> + * discontiguous scatterlist.
>>> + *
>>> + * Each WR chain handles only one R_key. Each RPC-over-RDMA segment
>>> + * from a client may contain a unique R_key, so each WR chain moves
>>> + * one segment (or less) at a time.
>>> + *
>>> + * The scatterlist makes this data structure just over 8KB in size
>>> + * on 4KB-page platforms. As the size of this structure increases
>>> + * past one page, it becomes more likely that allocating one of these
>>> + * will fail. Therefore, these contexts are created on demand, but
>>> + * cached and reused until the controlling svcxprt_rdma is destroyed.
>>> + */
>>> +struct svc_rdma_rw_ctxt {
>>> +	struct list_head	rw_list;
>>> +	struct ib_cqe		rw_cqe;
>>> +	struct svcxprt_rdma	*rw_rdma;
>>> +	int			rw_nents;
>>> +	int			rw_wrcount;
>>> +	enum dma_data_direction	rw_dir;
>>> +	struct svc_rdma_op_ctxt	*rw_readctxt;
>>> +	struct rdma_rw_ctx	rw_ctx;
>>> +	struct scatterlist	rw_sg[RPCSVC_MAXPAGES];
>> 
>> Have you considered using sg_table with sg_alloc_table_chained?
>> 
>> See lib/sg_pool.c and nvme-rdma as a consumer.
> 
> That might be newer than my patches. I'll have a look.

I looked at the consumers of sg_alloc_table_chained, and
all these callers are immediately doing ib_dma_map_sg.

Nothing in svc_rdma_rw.c does a DMA map. It relies on
rdma_rw_ctx_init for that, and that API is passed a
scatterlist.

I don't see how I could use sg_alloc_table_chained here,
unless rdma_rw_ctx_init was modified to take a chained
sg_table instead of a scatterlist argument.

I suppose I could convert the client side to use it?
What do you think?


>>> +};
>>> +
>>> +static struct svc_rdma_rw_ctxt *
>>> +svc_rdma_get_rw_ctxt(struct svcxprt_rdma *rdma)
>>> +{
>>> +	struct svc_rdma_rw_ctxt *ctxt;
>>> +
>>> +	svc_xprt_get(&rdma->sc_xprt);
>>> +
>>> +	spin_lock(&rdma->sc_rw_ctxt_lock);
>>> +	if (list_empty(&rdma->sc_rw_ctxts))
>>> +		goto out_empty;
>>> +
>>> +	ctxt = list_first_entry(&rdma->sc_rw_ctxts,
>>> +				struct svc_rdma_rw_ctxt, rw_list);
>>> +	list_del_init(&ctxt->rw_list);
>>> +	spin_unlock(&rdma->sc_rw_ctxt_lock);
>>> +
>>> +out:
>>> +	ctxt->rw_dir = DMA_NONE;
>>> +	return ctxt;
>>> +
>>> +out_empty:
>>> +	spin_unlock(&rdma->sc_rw_ctxt_lock);
>>> +
>>> +	ctxt = kmalloc(sizeof(*ctxt), GFP_KERNEL);
>>> +	if (!ctxt) {
>>> +		svc_xprt_put(&rdma->sc_xprt);
>>> +		return NULL;
>>> +	}
>>> +
>>> +	ctxt->rw_rdma = rdma;
>>> +	INIT_LIST_HEAD(&ctxt->rw_list);
>>> +	sg_init_table(ctxt->rw_sg, ARRAY_SIZE(ctxt->rw_sg));
>>> +	goto out;
>>> +}
>>> +
>>> +static void svc_rdma_put_rw_ctxt(struct svc_rdma_rw_ctxt *ctxt)
>>> +{
>>> +	struct svcxprt_rdma *rdma = ctxt->rw_rdma;
>>> +
>>> +	if (ctxt->rw_dir != DMA_NONE)
>>> +		rdma_rw_ctx_destroy(&ctxt->rw_ctx, rdma->sc_qp,
>>> +				    rdma->sc_port_num,
>>> +				    ctxt->rw_sg, ctxt->rw_nents,
>>> +				    ctxt->rw_dir);
>>> +
>> 
>> its a bit odd to see put_rw_ctxt that also destroys the context
>> which isn't exactly pairs with get_rw_ctxt.
>> 
>> Maybe it'd be useful to explicitly do that outside the put.
> 
> The pairing is not obvious, but it is this:
> 
> svc_rdma_send_writes() does the svc_rdma_get_rw_ctx.
> 
> -> If posting succeeds, svc_rdma_wc_write_ctx puts the ctx.
> 
> -> If posting fails, svc_rdma_send_writes puts the ctx.
> 
> Do you have a suggestion about how this could be more
> intuitively documented?
> 
> IIRC I combined these because the rdma_rw_ctx_destroy is
> always done just before putting the ctx back on the free
> list. It eliminates some code duplication, and ensures
> the ctx is always ready for the next svc_rdma_get_rw_ctx.

I fixed this up, I think it is an improvement. Thanks for
the suggestion.


--
Chuck Lever



--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v1 05/14] svcrdma: Introduce local rdma_rw API helpers
@ 2017-03-24 22:19                 ` Chuck Lever
  0 siblings, 0 replies; 70+ messages in thread
From: Chuck Lever @ 2017-03-24 22:19 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: linux-rdma, Linux NFS Mailing List


> On Mar 22, 2017, at 11:41 AM, Chuck Lever <chuck.lever@oracle.com> wrote:
> 
>> 
>> On Mar 22, 2017, at 10:17 AM, Sagi Grimberg <sagi@grimberg.me> wrote:
>> 
>>> 
>>> The plan is to replace the local bespoke code that constructs and
>>> posts RDMA Read and Write Work Requests with calls to the rdma_rw
>>> API. This shares code with other RDMA-enabled ULPs that manages the
>>> gory details of buffer registration and posting Work Requests.
>>> 
>>> Some design notes:
>>> 
>>> o svc_xprt reference counting is modified, since one rdma_rw_ctx
>>>  generates one completion, no matter how many Write WRs are
>>>  posted. To accommodate the new reference counting scheme, a new
>>>  version of svc_rdma_send() is introduced.
>>> 
>>> o The structure of RPC-over-RDMA transport headers is flexible,
>>>  allowing multiple segments per Reply with arbitrary alignment.
>>>  Thus I did not take the further step of chaining Write WRs with
>>>  the Send WR containing the RPC Reply message. The Write and Send
>>>  WRs continue to be built by separate pieces of code.
>>> 
>>> o The current code builds the transport header as it is construct-
>>>  ing Write WRs. I've replaced that with marshaling of transport
>>>  header data items in a separate step. This is because the exact
>>>  structure of client-provided segments may not align with the
>>>  components of the server's reply xdr_buf, or the pages in the
>>>  page list. Thus parts of each client-provided segment may be
>>>  written at different points in the send path.
>>> 
>>> o Since the Write list and Reply chunk marshaling code is being
>>>  replaced, I took the opportunity to replace some of the C
>>>  structure-based XDR encoding code with more portable code that
>>>  instead uses pointer arithmetic.
>>> 
>>> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> 
>>> diff --git a/net/sunrpc/xprtrdma/svc_rdma_rw.c b/net/sunrpc/xprtrdma/svc_rdma_rw.c
>>> new file mode 100644
>>> index 0000000..1e76227
>>> --- /dev/null
>>> +++ b/net/sunrpc/xprtrdma/svc_rdma_rw.c
>>> @@ -0,0 +1,785 @@
>>> +/*
>>> + * Copyright (c) 2016 Oracle.  All rights reserved.
>>> + *
>>> + * Use the core R/W API to move RPC-over-RDMA Read and Write chunks.
>>> + */
>>> +
>>> +#include <linux/sunrpc/rpc_rdma.h>
>>> +#include <linux/sunrpc/svc_rdma.h>
>>> +#include <linux/sunrpc/debug.h>
>>> +
>>> +#include <rdma/rw.h>
>>> +
>>> +#define RPCDBG_FACILITY	RPCDBG_SVCXPRT
>>> +
>>> +/* Each R/W context contains state for one chain of RDMA Read or
>>> + * Write Work Requests (one RDMA segment to be read from or written
>>> + * back to the client).
>>> + *
>>> + * Each WR chain handles a single contiguous server-side buffer,
>>> + * because some registration modes (eg. FRWR) do not support a
>>> + * discontiguous scatterlist.
>>> + *
>>> + * Each WR chain handles only one R_key. Each RPC-over-RDMA segment
>>> + * from a client may contain a unique R_key, so each WR chain moves
>>> + * one segment (or less) at a time.
>>> + *
>>> + * The scatterlist makes this data structure just over 8KB in size
>>> + * on 4KB-page platforms. As the size of this structure increases
>>> + * past one page, it becomes more likely that allocating one of these
>>> + * will fail. Therefore, these contexts are created on demand, but
>>> + * cached and reused until the controlling svcxprt_rdma is destroyed.
>>> + */
>>> +struct svc_rdma_rw_ctxt {
>>> +	struct list_head	rw_list;
>>> +	struct ib_cqe		rw_cqe;
>>> +	struct svcxprt_rdma	*rw_rdma;
>>> +	int			rw_nents;
>>> +	int			rw_wrcount;
>>> +	enum dma_data_direction	rw_dir;
>>> +	struct svc_rdma_op_ctxt	*rw_readctxt;
>>> +	struct rdma_rw_ctx	rw_ctx;
>>> +	struct scatterlist	rw_sg[RPCSVC_MAXPAGES];
>> 
>> Have you considered using sg_table with sg_alloc_table_chained?
>> 
>> See lib/sg_pool.c and nvme-rdma as a consumer.
> 
> That might be newer than my patches. I'll have a look.

I looked at the consumers of sg_alloc_table_chained, and
all these callers are immediately doing ib_dma_map_sg.

Nothing in svc_rdma_rw.c does a DMA map. It relies on
rdma_rw_ctx_init for that, and that API is passed a
scatterlist.

I don't see how I could use sg_alloc_table_chained here,
unless rdma_rw_ctx_init was modified to take a chained
sg_table instead of a scatterlist argument.

I suppose I could convert the client side to use it?
What do you think?


>>> +};
>>> +
>>> +static struct svc_rdma_rw_ctxt *
>>> +svc_rdma_get_rw_ctxt(struct svcxprt_rdma *rdma)
>>> +{
>>> +	struct svc_rdma_rw_ctxt *ctxt;
>>> +
>>> +	svc_xprt_get(&rdma->sc_xprt);
>>> +
>>> +	spin_lock(&rdma->sc_rw_ctxt_lock);
>>> +	if (list_empty(&rdma->sc_rw_ctxts))
>>> +		goto out_empty;
>>> +
>>> +	ctxt = list_first_entry(&rdma->sc_rw_ctxts,
>>> +				struct svc_rdma_rw_ctxt, rw_list);
>>> +	list_del_init(&ctxt->rw_list);
>>> +	spin_unlock(&rdma->sc_rw_ctxt_lock);
>>> +
>>> +out:
>>> +	ctxt->rw_dir = DMA_NONE;
>>> +	return ctxt;
>>> +
>>> +out_empty:
>>> +	spin_unlock(&rdma->sc_rw_ctxt_lock);
>>> +
>>> +	ctxt = kmalloc(sizeof(*ctxt), GFP_KERNEL);
>>> +	if (!ctxt) {
>>> +		svc_xprt_put(&rdma->sc_xprt);
>>> +		return NULL;
>>> +	}
>>> +
>>> +	ctxt->rw_rdma = rdma;
>>> +	INIT_LIST_HEAD(&ctxt->rw_list);
>>> +	sg_init_table(ctxt->rw_sg, ARRAY_SIZE(ctxt->rw_sg));
>>> +	goto out;
>>> +}
>>> +
>>> +static void svc_rdma_put_rw_ctxt(struct svc_rdma_rw_ctxt *ctxt)
>>> +{
>>> +	struct svcxprt_rdma *rdma = ctxt->rw_rdma;
>>> +
>>> +	if (ctxt->rw_dir != DMA_NONE)
>>> +		rdma_rw_ctx_destroy(&ctxt->rw_ctx, rdma->sc_qp,
>>> +				    rdma->sc_port_num,
>>> +				    ctxt->rw_sg, ctxt->rw_nents,
>>> +				    ctxt->rw_dir);
>>> +
>> 
>> its a bit odd to see put_rw_ctxt that also destroys the context
>> which isn't exactly pairs with get_rw_ctxt.
>> 
>> Maybe it'd be useful to explicitly do that outside the put.
> 
> The pairing is not obvious, but it is this:
> 
> svc_rdma_send_writes() does the svc_rdma_get_rw_ctx.
> 
> -> If posting succeeds, svc_rdma_wc_write_ctx puts the ctx.
> 
> -> If posting fails, svc_rdma_send_writes puts the ctx.
> 
> Do you have a suggestion about how this could be more
> intuitively documented?
> 
> IIRC I combined these because the rdma_rw_ctx_destroy is
> always done just before putting the ctx back on the free
> list. It eliminates some code duplication, and ensures
> the ctx is always ready for the next svc_rdma_get_rw_ctx.

I fixed this up, I think it is an improvement. Thanks for
the suggestion.


--
Chuck Lever




^ permalink raw reply	[flat|nested] 70+ messages in thread

end of thread, other threads:[~2017-03-24 22:19 UTC | newest]

Thread overview: 70+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-03-16 15:52 [PATCH v1 00/14] Server-side NFS/RDMA changes for v4.12 Chuck Lever
2017-03-16 15:52 ` Chuck Lever
     [not found] ` <20170316154132.4482.56769.stgit-Hs+gFlyCn65vLzlybtyyYzGyq/o6K9yX@public.gmane.org>
2017-03-16 15:52   ` [PATCH v1 01/14] svcrdma: Move send_wr to svc_rdma_op_ctxt Chuck Lever
2017-03-16 15:52     ` Chuck Lever
     [not found]     ` <20170316155234.4482.94225.stgit-Hs+gFlyCn65vLzlybtyyYzGyq/o6K9yX@public.gmane.org>
2017-03-21 17:49       ` Sagi Grimberg
2017-03-21 17:49         ` Sagi Grimberg
2017-03-16 15:52   ` [PATCH v1 02/14] svcrdma: Add svc_rdma_map_reply_hdr() Chuck Lever
2017-03-16 15:52     ` Chuck Lever
     [not found]     ` <20170316155242.4482.64809.stgit-Hs+gFlyCn65vLzlybtyyYzGyq/o6K9yX@public.gmane.org>
2017-03-21 17:54       ` Sagi Grimberg
2017-03-21 17:54         ` Sagi Grimberg
     [not found]         ` <f5000e25-6ca1-fc24-35c0-6089cf50923c-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
2017-03-21 18:40           ` Chuck Lever
2017-03-21 18:40             ` Chuck Lever
     [not found]             ` <A18F9D5E-09BA-4268-9AA6-3E5866101F76-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
2017-03-22 13:07               ` Sagi Grimberg
2017-03-22 13:07                 ` Sagi Grimberg
2017-03-16 15:52   ` [PATCH v1 03/14] svcrdma: Eliminate RPCRDMA_SQ_DEPTH_MULT Chuck Lever
2017-03-16 15:52     ` Chuck Lever
     [not found]     ` <20170316155250.4482.49638.stgit-Hs+gFlyCn65vLzlybtyyYzGyq/o6K9yX@public.gmane.org>
2017-03-21 17:58       ` Sagi Grimberg
2017-03-21 17:58         ` Sagi Grimberg
     [not found]         ` <46eb6195-a542-b35c-4902-a2bebb38feba-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
2017-03-21 18:44           ` Chuck Lever
2017-03-21 18:44             ` Chuck Lever
     [not found]             ` <391F0D90-2A46-4B2F-BCF0-B3BE7D48A3EF-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
2017-03-22 13:09               ` Sagi Grimberg
2017-03-22 13:09                 ` Sagi Grimberg
     [not found]                 ` <ec82feb4-d6b9-7fb4-5b11-b8007e313845-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
2017-03-22 13:36                   ` Chuck Lever
2017-03-22 13:36                     ` Chuck Lever
     [not found]                     ` <C9D8A91C-DE08-4A41-A07D-1F4C42DD9B97-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
2017-03-22 19:06                       ` Sagi Grimberg
2017-03-22 19:06                         ` Sagi Grimberg
     [not found]                         ` <68e3eda0-90f4-5bca-28be-b2cf494ed172-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
2017-03-22 19:30                           ` Chuck Lever
2017-03-22 19:30                             ` Chuck Lever
2017-03-16 15:52   ` [PATCH v1 04/14] svcrdma: Add helper to save pages under I/O Chuck Lever
2017-03-16 15:52     ` Chuck Lever
     [not found]     ` <20170316155258.4482.69182.stgit-Hs+gFlyCn65vLzlybtyyYzGyq/o6K9yX@public.gmane.org>
2017-03-21 18:01       ` Sagi Grimberg
2017-03-21 18:01         ` Sagi Grimberg
2017-03-16 15:53   ` [PATCH v1 05/14] svcrdma: Introduce local rdma_rw API helpers Chuck Lever
2017-03-16 15:53     ` Chuck Lever
     [not found]     ` <20170316155306.4482.68041.stgit-Hs+gFlyCn65vLzlybtyyYzGyq/o6K9yX@public.gmane.org>
2017-03-22 14:17       ` Sagi Grimberg
2017-03-22 14:17         ` Sagi Grimberg
     [not found]         ` <cfa49433-ab26-d2f0-27d4-2a96ff0adaba-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
2017-03-22 15:41           ` Chuck Lever
2017-03-22 15:41             ` Chuck Lever
     [not found]             ` <1CAD2542-A121-47ED-A47C-624E188EB54F-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
2017-03-24 22:19               ` Chuck Lever
2017-03-24 22:19                 ` Chuck Lever
2017-03-16 15:53   ` [PATCH v1 06/14] svcrdma: Use rdma_rw API in RPC reply path Chuck Lever
2017-03-16 15:53     ` Chuck Lever
2017-03-16 15:53   ` [PATCH v1 07/14] svcrdma: Clean up RDMA_ERROR path Chuck Lever
2017-03-16 15:53     ` Chuck Lever
     [not found]     ` <20170316155323.4482.8051.stgit-Hs+gFlyCn65vLzlybtyyYzGyq/o6K9yX@public.gmane.org>
2017-03-22 14:18       ` Sagi Grimberg
2017-03-22 14:18         ` Sagi Grimberg
2017-03-16 15:53   ` [PATCH v1 08/14] svcrdma: Report Write/Reply chunk overruns Chuck Lever
2017-03-16 15:53     ` Chuck Lever
     [not found]     ` <20170316155331.4482.7734.stgit-Hs+gFlyCn65vLzlybtyyYzGyq/o6K9yX@public.gmane.org>
2017-03-22 14:20       ` Sagi Grimberg
2017-03-22 14:20         ` Sagi Grimberg
2017-03-16 15:53   ` [PATCH v1 09/14] svcrdma: Clean up RPC-over-RDMA backchannel reply processing Chuck Lever
2017-03-16 15:53     ` Chuck Lever
2017-03-16 15:53   ` [PATCH v1 10/14] svcrdma: Reduce size of sge array in struct svc_rdma_op_ctxt Chuck Lever
2017-03-16 15:53     ` Chuck Lever
     [not found]     ` <20170316155347.4482.74652.stgit-Hs+gFlyCn65vLzlybtyyYzGyq/o6K9yX@public.gmane.org>
2017-03-22 14:21       ` Sagi Grimberg
2017-03-22 14:21         ` Sagi Grimberg
2017-03-16 15:53   ` [PATCH v1 11/14] svcrdma: Remove old RDMA Write completion handlers Chuck Lever
2017-03-16 15:53     ` Chuck Lever
     [not found]     ` <20170316155355.4482.35026.stgit-Hs+gFlyCn65vLzlybtyyYzGyq/o6K9yX@public.gmane.org>
2017-03-22 14:22       ` Sagi Grimberg
2017-03-22 14:22         ` Sagi Grimberg
2017-03-16 15:54   ` [PATCH v1 12/14] svcrdma: Remove the req_map cache Chuck Lever
2017-03-16 15:54     ` Chuck Lever
     [not found]     ` <20170316155403.4482.2040.stgit-Hs+gFlyCn65vLzlybtyyYzGyq/o6K9yX@public.gmane.org>
2017-03-22 14:22       ` Sagi Grimberg
2017-03-22 14:22         ` Sagi Grimberg
2017-03-16 15:54   ` [PATCH v1 13/14] svcrdma: Clean out old XDR encoders Chuck Lever
2017-03-16 15:54     ` Chuck Lever
     [not found]     ` <20170316155411.4482.37224.stgit-Hs+gFlyCn65vLzlybtyyYzGyq/o6K9yX@public.gmane.org>
2017-03-22 14:23       ` Sagi Grimberg
2017-03-22 14:23         ` Sagi Grimberg
2017-03-16 15:54   ` [PATCH v1 14/14] svcrdma: Clean up svc_rdma_post_recv() error handling Chuck Lever
2017-03-16 15:54     ` Chuck Lever

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.