All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v1 0/9] NFS/RDMA client patches for 4.5
@ 2015-11-23 22:13 ` Chuck Lever
  0 siblings, 0 replies; 78+ messages in thread
From: Chuck Lever @ 2015-11-23 22:13 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA, linux-nfs-u79uwXL29TY76Z2rM5mHXA

For 4.5, I'd like to address the send queue accounting and
invalidation/unmap ordering issues Jason brought up a couple of
months ago. Here's a first shot at that.

Also available in the "nfs-rdma-for-4.5" topic branch of this git repo:

git://git.linux-nfs.org/projects/cel/cel-2.6.git

Or for browsing:

http://git.linux-nfs.org/?p=cel/cel-2.6.git;a=log;h=refs/heads/nfs-rdma-for-4.5

---

Chuck Lever (9):
      xprtrdma: Add a safety margin for receive buffers
      xprtrdma: Move struct ib_send_wr off the stack
      xprtrdma: Introduce ro_unmap_sync method
      xprtrdma: Add ro_unmap_sync method for FRWR
      xprtrdma: Add ro_unmap_sync method for FMR
      xprtrdma: Add ro_unmap_sync method for all-physical registration
      SUNRPC: Introduce xprt_commit_rqst()
      xprtrdma: Invalidate in the RPC reply handler
      xprtrdma: Revert commit e7104a2a9606 ('xprtrdma: Cap req_cqinit').


 include/linux/sunrpc/xprt.h        |    1 
 net/sunrpc/xprt.c                  |   14 +++
 net/sunrpc/xprtrdma/fmr_ops.c      |   64 +++++++++++++
 net/sunrpc/xprtrdma/frwr_ops.c     |  175 +++++++++++++++++++++++++++++++-----
 net/sunrpc/xprtrdma/physical_ops.c |   13 +++
 net/sunrpc/xprtrdma/rpc_rdma.c     |   21 ++++
 net/sunrpc/xprtrdma/verbs.c        |   12 +-
 net/sunrpc/xprtrdma/xprt_rdma.h    |   17 ++-
 8 files changed, 284 insertions(+), 33 deletions(-)

--
Chuck Lever
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH v1 0/9] NFS/RDMA client patches for 4.5
@ 2015-11-23 22:13 ` Chuck Lever
  0 siblings, 0 replies; 78+ messages in thread
From: Chuck Lever @ 2015-11-23 22:13 UTC (permalink / raw)
  To: linux-rdma, linux-nfs

For 4.5, I'd like to address the send queue accounting and
invalidation/unmap ordering issues Jason brought up a couple of
months ago. Here's a first shot at that.

Also available in the "nfs-rdma-for-4.5" topic branch of this git repo:

git://git.linux-nfs.org/projects/cel/cel-2.6.git

Or for browsing:

http://git.linux-nfs.org/?p=cel/cel-2.6.git;a=log;h=refs/heads/nfs-rdma-for-4.5

---

Chuck Lever (9):
      xprtrdma: Add a safety margin for receive buffers
      xprtrdma: Move struct ib_send_wr off the stack
      xprtrdma: Introduce ro_unmap_sync method
      xprtrdma: Add ro_unmap_sync method for FRWR
      xprtrdma: Add ro_unmap_sync method for FMR
      xprtrdma: Add ro_unmap_sync method for all-physical registration
      SUNRPC: Introduce xprt_commit_rqst()
      xprtrdma: Invalidate in the RPC reply handler
      xprtrdma: Revert commit e7104a2a9606 ('xprtrdma: Cap req_cqinit').


 include/linux/sunrpc/xprt.h        |    1 
 net/sunrpc/xprt.c                  |   14 +++
 net/sunrpc/xprtrdma/fmr_ops.c      |   64 +++++++++++++
 net/sunrpc/xprtrdma/frwr_ops.c     |  175 +++++++++++++++++++++++++++++++-----
 net/sunrpc/xprtrdma/physical_ops.c |   13 +++
 net/sunrpc/xprtrdma/rpc_rdma.c     |   21 ++++
 net/sunrpc/xprtrdma/verbs.c        |   12 +-
 net/sunrpc/xprtrdma/xprt_rdma.h    |   17 ++-
 8 files changed, 284 insertions(+), 33 deletions(-)

--
Chuck Lever

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH v1 1/9] xprtrdma: Add a safety margin for receive buffers
  2015-11-23 22:13 ` Chuck Lever
@ 2015-11-23 22:13     ` Chuck Lever
  -1 siblings, 0 replies; 78+ messages in thread
From: Chuck Lever @ 2015-11-23 22:13 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA, linux-nfs-u79uwXL29TY76Z2rM5mHXA

Rarely, senders post a Send that is larger than the client's inline
threshold. That can be due to a bug, or the client and server may
not have communicated about their inline limits. RPC-over-RDMA
currently doesn't specify any particular limit on inline size, so
peers have to guess what it is.

It is fatal to the connection if the size of a Send is larger than
the client's receive buffer. The sender is likely to retry with the
same message size, so the workload is stuck at that point.

Follow Postel's robustness principal: Be conservative in what you
do, be liberal in what you accept from others. Increase the size of
client receive buffers by a safety margin, and add a warning when
the inline threshold is exceeded during receive.

Note the Linux server's receive buffers are already page-sized.

Signed-off-by: Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
---
 net/sunrpc/xprtrdma/rpc_rdma.c  |    7 +++++++
 net/sunrpc/xprtrdma/verbs.c     |    6 +++++-
 net/sunrpc/xprtrdma/xprt_rdma.h |    5 +++++
 3 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
index c10d969..a169252 100644
--- a/net/sunrpc/xprtrdma/rpc_rdma.c
+++ b/net/sunrpc/xprtrdma/rpc_rdma.c
@@ -776,6 +776,7 @@ rpcrdma_reply_handler(struct rpcrdma_rep *rep)
 	int rdmalen, status;
 	unsigned long cwnd;
 	u32 credits;
+	RPC_IFDEBUG(struct rpcrdma_create_data_internal *cdata);
 
 	dprintk("RPC:       %s: incoming rep %p\n", __func__, rep);
 
@@ -783,6 +784,12 @@ rpcrdma_reply_handler(struct rpcrdma_rep *rep)
 		goto out_badstatus;
 	if (rep->rr_len < RPCRDMA_HDRLEN_MIN)
 		goto out_shortreply;
+#if IS_ENABLED(CONFIG_SUNRPC_DEBUG)
+	cdata = &r_xprt->rx_data;
+	if (rep->rr_len > cdata->inline_rsize)
+		pr_warn("RPC: %u byte reply exceeds inline threshold\n",
+			rep->rr_len);
+#endif
 
 	headerp = rdmab_to_msg(rep->rr_rdmabuf);
 	if (headerp->rm_vers != rpcrdma_version)
diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index eadd1655..e3f12e2 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -924,7 +924,11 @@ rpcrdma_create_rep(struct rpcrdma_xprt *r_xprt)
 	if (rep == NULL)
 		goto out;
 
-	rep->rr_rdmabuf = rpcrdma_alloc_regbuf(ia, cdata->inline_rsize,
+	/* The actual size of our receive buffers is increased slightly
+	 * to prevent small receive overruns from killing our connection.
+	 */
+	rep->rr_rdmabuf = rpcrdma_alloc_regbuf(ia, cdata->inline_rsize +
+					       RPCRDMA_RECV_MARGIN,
 					       GFP_KERNEL);
 	if (IS_ERR(rep->rr_rdmabuf)) {
 		rc = PTR_ERR(rep->rr_rdmabuf);
diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index ac7f8d4..1b72ab1 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -337,6 +337,11 @@ struct rpcrdma_create_data_internal {
 #define RPCRDMA_INLINE_PAD_VALUE(rq)\
 	rpcx_to_rdmad(rq->rq_xprt).padding
 
+/* To help prevent spurious connection shutdown, allow senders to
+ * overrun our receive inline threshold by a small bit.
+ */
+#define RPCRDMA_RECV_MARGIN	(128)
+
 /*
  * Statistics for RPCRDMA
  */

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH v1 1/9] xprtrdma: Add a safety margin for receive buffers
@ 2015-11-23 22:13     ` Chuck Lever
  0 siblings, 0 replies; 78+ messages in thread
From: Chuck Lever @ 2015-11-23 22:13 UTC (permalink / raw)
  To: linux-rdma, linux-nfs

Rarely, senders post a Send that is larger than the client's inline
threshold. That can be due to a bug, or the client and server may
not have communicated about their inline limits. RPC-over-RDMA
currently doesn't specify any particular limit on inline size, so
peers have to guess what it is.

It is fatal to the connection if the size of a Send is larger than
the client's receive buffer. The sender is likely to retry with the
same message size, so the workload is stuck at that point.

Follow Postel's robustness principal: Be conservative in what you
do, be liberal in what you accept from others. Increase the size of
client receive buffers by a safety margin, and add a warning when
the inline threshold is exceeded during receive.

Note the Linux server's receive buffers are already page-sized.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 net/sunrpc/xprtrdma/rpc_rdma.c  |    7 +++++++
 net/sunrpc/xprtrdma/verbs.c     |    6 +++++-
 net/sunrpc/xprtrdma/xprt_rdma.h |    5 +++++
 3 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
index c10d969..a169252 100644
--- a/net/sunrpc/xprtrdma/rpc_rdma.c
+++ b/net/sunrpc/xprtrdma/rpc_rdma.c
@@ -776,6 +776,7 @@ rpcrdma_reply_handler(struct rpcrdma_rep *rep)
 	int rdmalen, status;
 	unsigned long cwnd;
 	u32 credits;
+	RPC_IFDEBUG(struct rpcrdma_create_data_internal *cdata);
 
 	dprintk("RPC:       %s: incoming rep %p\n", __func__, rep);
 
@@ -783,6 +784,12 @@ rpcrdma_reply_handler(struct rpcrdma_rep *rep)
 		goto out_badstatus;
 	if (rep->rr_len < RPCRDMA_HDRLEN_MIN)
 		goto out_shortreply;
+#if IS_ENABLED(CONFIG_SUNRPC_DEBUG)
+	cdata = &r_xprt->rx_data;
+	if (rep->rr_len > cdata->inline_rsize)
+		pr_warn("RPC: %u byte reply exceeds inline threshold\n",
+			rep->rr_len);
+#endif
 
 	headerp = rdmab_to_msg(rep->rr_rdmabuf);
 	if (headerp->rm_vers != rpcrdma_version)
diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index eadd1655..e3f12e2 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -924,7 +924,11 @@ rpcrdma_create_rep(struct rpcrdma_xprt *r_xprt)
 	if (rep == NULL)
 		goto out;
 
-	rep->rr_rdmabuf = rpcrdma_alloc_regbuf(ia, cdata->inline_rsize,
+	/* The actual size of our receive buffers is increased slightly
+	 * to prevent small receive overruns from killing our connection.
+	 */
+	rep->rr_rdmabuf = rpcrdma_alloc_regbuf(ia, cdata->inline_rsize +
+					       RPCRDMA_RECV_MARGIN,
 					       GFP_KERNEL);
 	if (IS_ERR(rep->rr_rdmabuf)) {
 		rc = PTR_ERR(rep->rr_rdmabuf);
diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index ac7f8d4..1b72ab1 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -337,6 +337,11 @@ struct rpcrdma_create_data_internal {
 #define RPCRDMA_INLINE_PAD_VALUE(rq)\
 	rpcx_to_rdmad(rq->rq_xprt).padding
 
+/* To help prevent spurious connection shutdown, allow senders to
+ * overrun our receive inline threshold by a small bit.
+ */
+#define RPCRDMA_RECV_MARGIN	(128)
+
 /*
  * Statistics for RPCRDMA
  */


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH v1 2/9] xprtrdma: Move struct ib_send_wr off the stack
  2015-11-23 22:13 ` Chuck Lever
@ 2015-11-23 22:14     ` Chuck Lever
  -1 siblings, 0 replies; 78+ messages in thread
From: Chuck Lever @ 2015-11-23 22:14 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA, linux-nfs-u79uwXL29TY76Z2rM5mHXA

For FRWR FASTREG and LOCAL_INV, move the ib_*_wr structure off
the stack. This allows frwr_op_map and frwr_op_unmap to chain
WRs together without limit to register or invalidate a set of MRs
with a single ib_post_send().

(This will be for chaining LOCAL_INV requests).

Signed-off-by: Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
---
 net/sunrpc/xprtrdma/frwr_ops.c  |   38 ++++++++++++++++++++------------------
 net/sunrpc/xprtrdma/xprt_rdma.h |    2 ++
 2 files changed, 22 insertions(+), 18 deletions(-)

diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
index 88cf9e7..31a4578 100644
--- a/net/sunrpc/xprtrdma/frwr_ops.c
+++ b/net/sunrpc/xprtrdma/frwr_ops.c
@@ -319,7 +319,7 @@ frwr_op_map(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg,
 	struct rpcrdma_mw *mw;
 	struct rpcrdma_frmr *frmr;
 	struct ib_mr *mr;
-	struct ib_reg_wr reg_wr;
+	struct ib_reg_wr *reg_wr;
 	struct ib_send_wr *bad_wr;
 	int rc, i, n, dma_nents;
 	u8 key;
@@ -336,6 +336,7 @@ frwr_op_map(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg,
 	frmr = &mw->r.frmr;
 	frmr->fr_state = FRMR_IS_VALID;
 	mr = frmr->fr_mr;
+	reg_wr = &frmr->fr_regwr;
 
 	if (nsegs > ia->ri_max_frmr_depth)
 		nsegs = ia->ri_max_frmr_depth;
@@ -381,19 +382,19 @@ frwr_op_map(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg,
 	key = (u8)(mr->rkey & 0x000000FF);
 	ib_update_fast_reg_key(mr, ++key);
 
-	reg_wr.wr.next = NULL;
-	reg_wr.wr.opcode = IB_WR_REG_MR;
-	reg_wr.wr.wr_id = (uintptr_t)mw;
-	reg_wr.wr.num_sge = 0;
-	reg_wr.wr.send_flags = 0;
-	reg_wr.mr = mr;
-	reg_wr.key = mr->rkey;
-	reg_wr.access = writing ?
-			IB_ACCESS_REMOTE_WRITE | IB_ACCESS_LOCAL_WRITE :
-			IB_ACCESS_REMOTE_READ;
+	reg_wr->wr.next = NULL;
+	reg_wr->wr.opcode = IB_WR_REG_MR;
+	reg_wr->wr.wr_id = (uintptr_t)mw;
+	reg_wr->wr.num_sge = 0;
+	reg_wr->wr.send_flags = 0;
+	reg_wr->mr = mr;
+	reg_wr->key = mr->rkey;
+	reg_wr->access = writing ?
+			 IB_ACCESS_REMOTE_WRITE | IB_ACCESS_LOCAL_WRITE :
+			 IB_ACCESS_REMOTE_READ;
 
 	DECR_CQCOUNT(&r_xprt->rx_ep);
-	rc = ib_post_send(ia->ri_id->qp, &reg_wr.wr, &bad_wr);
+	rc = ib_post_send(ia->ri_id->qp, &reg_wr->wr, &bad_wr);
 	if (rc)
 		goto out_senderr;
 
@@ -423,23 +424,24 @@ frwr_op_unmap(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg)
 	struct rpcrdma_ia *ia = &r_xprt->rx_ia;
 	struct rpcrdma_mw *mw = seg1->rl_mw;
 	struct rpcrdma_frmr *frmr = &mw->r.frmr;
-	struct ib_send_wr invalidate_wr, *bad_wr;
+	struct ib_send_wr *invalidate_wr, *bad_wr;
 	int rc, nsegs = seg->mr_nsegs;
 
 	dprintk("RPC:       %s: FRMR %p\n", __func__, mw);
 
 	seg1->rl_mw = NULL;
 	frmr->fr_state = FRMR_IS_INVALID;
+	invalidate_wr = &mw->r.frmr.fr_invwr;
 
-	memset(&invalidate_wr, 0, sizeof(invalidate_wr));
-	invalidate_wr.wr_id = (unsigned long)(void *)mw;
-	invalidate_wr.opcode = IB_WR_LOCAL_INV;
-	invalidate_wr.ex.invalidate_rkey = frmr->fr_mr->rkey;
+	memset(invalidate_wr, 0, sizeof(*invalidate_wr));
+	invalidate_wr->wr_id = (uintptr_t)mw;
+	invalidate_wr->opcode = IB_WR_LOCAL_INV;
+	invalidate_wr->ex.invalidate_rkey = frmr->fr_mr->rkey;
 	DECR_CQCOUNT(&r_xprt->rx_ep);
 
 	ib_dma_unmap_sg(ia->ri_device, frmr->sg, frmr->sg_nents, seg1->mr_dir);
 	read_lock(&ia->ri_qplock);
-	rc = ib_post_send(ia->ri_id->qp, &invalidate_wr, &bad_wr);
+	rc = ib_post_send(ia->ri_id->qp, invalidate_wr, &bad_wr);
 	read_unlock(&ia->ri_qplock);
 	if (rc)
 		goto out_err;
diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index 1b72ab1..2c34159 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -207,6 +207,8 @@ struct rpcrdma_frmr {
 	enum rpcrdma_frmr_state		fr_state;
 	struct work_struct		fr_work;
 	struct rpcrdma_xprt		*fr_xprt;
+	struct ib_reg_wr		fr_regwr;
+	struct ib_send_wr		fr_invwr;
 };
 
 struct rpcrdma_fmr {

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH v1 2/9] xprtrdma: Move struct ib_send_wr off the stack
@ 2015-11-23 22:14     ` Chuck Lever
  0 siblings, 0 replies; 78+ messages in thread
From: Chuck Lever @ 2015-11-23 22:14 UTC (permalink / raw)
  To: linux-rdma, linux-nfs

For FRWR FASTREG and LOCAL_INV, move the ib_*_wr structure off
the stack. This allows frwr_op_map and frwr_op_unmap to chain
WRs together without limit to register or invalidate a set of MRs
with a single ib_post_send().

(This will be for chaining LOCAL_INV requests).

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 net/sunrpc/xprtrdma/frwr_ops.c  |   38 ++++++++++++++++++++------------------
 net/sunrpc/xprtrdma/xprt_rdma.h |    2 ++
 2 files changed, 22 insertions(+), 18 deletions(-)

diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
index 88cf9e7..31a4578 100644
--- a/net/sunrpc/xprtrdma/frwr_ops.c
+++ b/net/sunrpc/xprtrdma/frwr_ops.c
@@ -319,7 +319,7 @@ frwr_op_map(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg,
 	struct rpcrdma_mw *mw;
 	struct rpcrdma_frmr *frmr;
 	struct ib_mr *mr;
-	struct ib_reg_wr reg_wr;
+	struct ib_reg_wr *reg_wr;
 	struct ib_send_wr *bad_wr;
 	int rc, i, n, dma_nents;
 	u8 key;
@@ -336,6 +336,7 @@ frwr_op_map(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg,
 	frmr = &mw->r.frmr;
 	frmr->fr_state = FRMR_IS_VALID;
 	mr = frmr->fr_mr;
+	reg_wr = &frmr->fr_regwr;
 
 	if (nsegs > ia->ri_max_frmr_depth)
 		nsegs = ia->ri_max_frmr_depth;
@@ -381,19 +382,19 @@ frwr_op_map(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg,
 	key = (u8)(mr->rkey & 0x000000FF);
 	ib_update_fast_reg_key(mr, ++key);
 
-	reg_wr.wr.next = NULL;
-	reg_wr.wr.opcode = IB_WR_REG_MR;
-	reg_wr.wr.wr_id = (uintptr_t)mw;
-	reg_wr.wr.num_sge = 0;
-	reg_wr.wr.send_flags = 0;
-	reg_wr.mr = mr;
-	reg_wr.key = mr->rkey;
-	reg_wr.access = writing ?
-			IB_ACCESS_REMOTE_WRITE | IB_ACCESS_LOCAL_WRITE :
-			IB_ACCESS_REMOTE_READ;
+	reg_wr->wr.next = NULL;
+	reg_wr->wr.opcode = IB_WR_REG_MR;
+	reg_wr->wr.wr_id = (uintptr_t)mw;
+	reg_wr->wr.num_sge = 0;
+	reg_wr->wr.send_flags = 0;
+	reg_wr->mr = mr;
+	reg_wr->key = mr->rkey;
+	reg_wr->access = writing ?
+			 IB_ACCESS_REMOTE_WRITE | IB_ACCESS_LOCAL_WRITE :
+			 IB_ACCESS_REMOTE_READ;
 
 	DECR_CQCOUNT(&r_xprt->rx_ep);
-	rc = ib_post_send(ia->ri_id->qp, &reg_wr.wr, &bad_wr);
+	rc = ib_post_send(ia->ri_id->qp, &reg_wr->wr, &bad_wr);
 	if (rc)
 		goto out_senderr;
 
@@ -423,23 +424,24 @@ frwr_op_unmap(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg)
 	struct rpcrdma_ia *ia = &r_xprt->rx_ia;
 	struct rpcrdma_mw *mw = seg1->rl_mw;
 	struct rpcrdma_frmr *frmr = &mw->r.frmr;
-	struct ib_send_wr invalidate_wr, *bad_wr;
+	struct ib_send_wr *invalidate_wr, *bad_wr;
 	int rc, nsegs = seg->mr_nsegs;
 
 	dprintk("RPC:       %s: FRMR %p\n", __func__, mw);
 
 	seg1->rl_mw = NULL;
 	frmr->fr_state = FRMR_IS_INVALID;
+	invalidate_wr = &mw->r.frmr.fr_invwr;
 
-	memset(&invalidate_wr, 0, sizeof(invalidate_wr));
-	invalidate_wr.wr_id = (unsigned long)(void *)mw;
-	invalidate_wr.opcode = IB_WR_LOCAL_INV;
-	invalidate_wr.ex.invalidate_rkey = frmr->fr_mr->rkey;
+	memset(invalidate_wr, 0, sizeof(*invalidate_wr));
+	invalidate_wr->wr_id = (uintptr_t)mw;
+	invalidate_wr->opcode = IB_WR_LOCAL_INV;
+	invalidate_wr->ex.invalidate_rkey = frmr->fr_mr->rkey;
 	DECR_CQCOUNT(&r_xprt->rx_ep);
 
 	ib_dma_unmap_sg(ia->ri_device, frmr->sg, frmr->sg_nents, seg1->mr_dir);
 	read_lock(&ia->ri_qplock);
-	rc = ib_post_send(ia->ri_id->qp, &invalidate_wr, &bad_wr);
+	rc = ib_post_send(ia->ri_id->qp, invalidate_wr, &bad_wr);
 	read_unlock(&ia->ri_qplock);
 	if (rc)
 		goto out_err;
diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index 1b72ab1..2c34159 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -207,6 +207,8 @@ struct rpcrdma_frmr {
 	enum rpcrdma_frmr_state		fr_state;
 	struct work_struct		fr_work;
 	struct rpcrdma_xprt		*fr_xprt;
+	struct ib_reg_wr		fr_regwr;
+	struct ib_send_wr		fr_invwr;
 };
 
 struct rpcrdma_fmr {


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH v1 3/9] xprtrdma: Introduce ro_unmap_sync method
  2015-11-23 22:13 ` Chuck Lever
@ 2015-11-23 22:14     ` Chuck Lever
  -1 siblings, 0 replies; 78+ messages in thread
From: Chuck Lever @ 2015-11-23 22:14 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA, linux-nfs-u79uwXL29TY76Z2rM5mHXA

In the current xprtrdma implementation, some memreg strategies
implement ro_unmap synchronously (the MR is knocked down before the
method returns) and some asynchonously (the MR will be knocked down
and returned to the pool in the background).

To guarantee the MR is truly invalid before the RPC consumer is
allowed to resume execution, we need an unmap method that is
always synchronous, invoked from the RPC/RDMA reply handler.

The new method unmaps all MRs for an RPC. The existing ro_unmap
method unmaps only one MR at a time.

Signed-off-by: Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
---
 net/sunrpc/xprtrdma/xprt_rdma.h |    2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index 2c34159..bd2022e 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -371,6 +371,8 @@ struct rpcrdma_xprt;
 struct rpcrdma_memreg_ops {
 	int		(*ro_map)(struct rpcrdma_xprt *,
 				  struct rpcrdma_mr_seg *, int, bool);
+	void		(*ro_unmap_sync)(struct rpcrdma_xprt *,
+					 struct rpcrdma_req *);
 	int		(*ro_unmap)(struct rpcrdma_xprt *,
 				    struct rpcrdma_mr_seg *);
 	int		(*ro_open)(struct rpcrdma_ia *,

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH v1 3/9] xprtrdma: Introduce ro_unmap_sync method
@ 2015-11-23 22:14     ` Chuck Lever
  0 siblings, 0 replies; 78+ messages in thread
From: Chuck Lever @ 2015-11-23 22:14 UTC (permalink / raw)
  To: linux-rdma, linux-nfs

In the current xprtrdma implementation, some memreg strategies
implement ro_unmap synchronously (the MR is knocked down before the
method returns) and some asynchonously (the MR will be knocked down
and returned to the pool in the background).

To guarantee the MR is truly invalid before the RPC consumer is
allowed to resume execution, we need an unmap method that is
always synchronous, invoked from the RPC/RDMA reply handler.

The new method unmaps all MRs for an RPC. The existing ro_unmap
method unmaps only one MR at a time.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 net/sunrpc/xprtrdma/xprt_rdma.h |    2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index 2c34159..bd2022e 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -371,6 +371,8 @@ struct rpcrdma_xprt;
 struct rpcrdma_memreg_ops {
 	int		(*ro_map)(struct rpcrdma_xprt *,
 				  struct rpcrdma_mr_seg *, int, bool);
+	void		(*ro_unmap_sync)(struct rpcrdma_xprt *,
+					 struct rpcrdma_req *);
 	int		(*ro_unmap)(struct rpcrdma_xprt *,
 				    struct rpcrdma_mr_seg *);
 	int		(*ro_open)(struct rpcrdma_ia *,


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH v1 4/9] xprtrdma: Add ro_unmap_sync method for FRWR
  2015-11-23 22:13 ` Chuck Lever
@ 2015-11-23 22:14     ` Chuck Lever
  -1 siblings, 0 replies; 78+ messages in thread
From: Chuck Lever @ 2015-11-23 22:14 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA, linux-nfs-u79uwXL29TY76Z2rM5mHXA

FRWR's ro_unmap is asynchronous. The new ro_unmap_sync posts
LOCAL_INV Work Requests and waits for them to complete before
returning.

Note also, DMA unmapping is now done _after_ invalidation.

Signed-off-by: Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
---
 net/sunrpc/xprtrdma/frwr_ops.c  |  137 ++++++++++++++++++++++++++++++++++++++-
 net/sunrpc/xprtrdma/xprt_rdma.h |    2 +
 2 files changed, 135 insertions(+), 4 deletions(-)

diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
index 31a4578..5d58008 100644
--- a/net/sunrpc/xprtrdma/frwr_ops.c
+++ b/net/sunrpc/xprtrdma/frwr_ops.c
@@ -245,12 +245,14 @@ frwr_op_maxpages(struct rpcrdma_xprt *r_xprt)
 		     rpcrdma_max_segments(r_xprt) * ia->ri_max_frmr_depth);
 }
 
-/* If FAST_REG or LOCAL_INV failed, indicate the frmr needs to be reset. */
+/* If FAST_REG or LOCAL_INV failed, indicate the frmr needs
+ * to be reset.
+ *
+ * WARNING: Only wr_id and status are reliable at this point
+ */
 static void
-frwr_sendcompletion(struct ib_wc *wc)
+__frwr_sendcompletion_flush(struct ib_wc *wc, struct rpcrdma_mw *r)
 {
-	struct rpcrdma_mw *r;
-
 	if (likely(wc->status == IB_WC_SUCCESS))
 		return;
 
@@ -261,9 +263,23 @@ frwr_sendcompletion(struct ib_wc *wc)
 	else
 		pr_warn("RPC:       %s: frmr %p error, status %s (%d)\n",
 			__func__, r, ib_wc_status_msg(wc->status), wc->status);
+
 	r->r.frmr.fr_state = FRMR_IS_STALE;
 }
 
+static void
+frwr_sendcompletion(struct ib_wc *wc)
+{
+	struct rpcrdma_mw *r = (struct rpcrdma_mw *)(unsigned long)wc->wr_id;
+	struct rpcrdma_frmr *f = &r->r.frmr;
+
+	if (unlikely(wc->status != IB_WC_SUCCESS))
+		__frwr_sendcompletion_flush(wc, r);
+
+	if (f->fr_waiter)
+		complete(&f->fr_linv_done);
+}
+
 static int
 frwr_op_init(struct rpcrdma_xprt *r_xprt)
 {
@@ -335,6 +351,7 @@ frwr_op_map(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg,
 	} while (mw->r.frmr.fr_state != FRMR_IS_INVALID);
 	frmr = &mw->r.frmr;
 	frmr->fr_state = FRMR_IS_VALID;
+	frmr->fr_waiter = false;
 	mr = frmr->fr_mr;
 	reg_wr = &frmr->fr_regwr;
 
@@ -414,6 +431,117 @@ out_senderr:
 	return rc;
 }
 
+static struct ib_send_wr *
+__frwr_prepare_linv_wr(struct rpcrdma_mr_seg *seg)
+{
+	struct rpcrdma_mw *mw = seg->rl_mw;
+	struct rpcrdma_frmr *f = &mw->r.frmr;
+	struct ib_send_wr *invalidate_wr;
+
+	f->fr_waiter = false;
+	f->fr_state = FRMR_IS_INVALID;
+	invalidate_wr = &f->fr_invwr;
+
+	memset(invalidate_wr, 0, sizeof(*invalidate_wr));
+	invalidate_wr->wr_id = (unsigned long)(void *)mw;
+	invalidate_wr->opcode = IB_WR_LOCAL_INV;
+	invalidate_wr->ex.invalidate_rkey = f->fr_mr->rkey;
+
+	return invalidate_wr;
+}
+
+static void
+__frwr_dma_unmap(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg,
+		 int rc)
+{
+	struct ib_device *device = r_xprt->rx_ia.ri_device;
+	struct rpcrdma_mw *mw = seg->rl_mw;
+	int nsegs = seg->mr_nsegs;
+
+	seg->rl_mw = NULL;
+
+	while (nsegs--)
+		rpcrdma_unmap_one(device, seg++);
+
+	if (!rc)
+		rpcrdma_put_mw(r_xprt, mw);
+	else
+		__frwr_queue_recovery(mw);
+}
+
+/* Invalidate all memory regions that were registered for "req".
+ *
+ * Sleeps until it is safe for the host CPU to access the
+ * previously mapped memory regions.
+ */
+static void
+frwr_op_unmap_sync(struct rpcrdma_xprt *r_xprt, struct rpcrdma_req *req)
+{
+	struct ib_send_wr *invalidate_wrs, *pos, *prev, *bad_wr;
+	struct rpcrdma_ia *ia = &r_xprt->rx_ia;
+	struct rpcrdma_mr_seg *seg;
+	unsigned int i, nchunks;
+	struct rpcrdma_frmr *f;
+	int rc;
+
+	dprintk("RPC:       %s: req %p\n", __func__, req);
+
+	/* ORDER: Invalidate all of the req's MRs first
+	 *
+	 * Chain the LOCAL_INV Work Requests and post them with
+	 * a single ib_post_send() call.
+	 */
+	invalidate_wrs = pos = prev = NULL;
+	seg = NULL;
+	for (i = 0, nchunks = req->rl_nchunks; nchunks; nchunks--) {
+		seg = &req->rl_segments[i];
+
+		pos = __frwr_prepare_linv_wr(seg);
+
+		if (!invalidate_wrs)
+			invalidate_wrs = pos;
+		else
+			prev->next = pos;
+		prev = pos;
+
+		i += seg->mr_nsegs;
+	}
+	f = &seg->rl_mw->r.frmr;
+
+	/* Strong send queue ordering guarantees that when the
+	 * last WR in the chain completes, all WRs in the chain
+	 * are complete.
+	 */
+	f->fr_invwr.send_flags = IB_SEND_SIGNALED;
+	f->fr_waiter = true;
+	init_completion(&f->fr_linv_done);
+	INIT_CQCOUNT(&r_xprt->rx_ep);
+
+	/* Transport disconnect drains the receive CQ before it
+	 * replaces the QP. The RPC reply handler won't call us
+	 * unless ri_id->qp is a valid pointer.
+	 */
+	rc = ib_post_send(ia->ri_id->qp, invalidate_wrs, &bad_wr);
+	if (rc)
+		pr_warn("%s: ib_post_send failed %i\n", __func__, rc);
+
+	wait_for_completion(&f->fr_linv_done);
+
+	/* ORDER: Now DMA unmap all of the req's MRs, and return
+	 * them to the free MW list.
+	 */
+	for (i = 0, nchunks = req->rl_nchunks; nchunks; nchunks--) {
+		seg = &req->rl_segments[i];
+
+		__frwr_dma_unmap(r_xprt, seg, rc);
+
+		i += seg->mr_nsegs;
+		seg->mr_nsegs = 0;
+	}
+
+	req->rl_nchunks = 0;
+}
+
 /* Post a LOCAL_INV Work Request to prevent further remote access
  * via RDMA READ or RDMA WRITE.
  */
@@ -473,6 +601,7 @@ frwr_op_destroy(struct rpcrdma_buffer *buf)
 
 const struct rpcrdma_memreg_ops rpcrdma_frwr_memreg_ops = {
 	.ro_map				= frwr_op_map,
+	.ro_unmap_sync			= frwr_op_unmap_sync,
 	.ro_unmap			= frwr_op_unmap,
 	.ro_open			= frwr_op_open,
 	.ro_maxpages			= frwr_op_maxpages,
diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index bd2022e..45dd0b7 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -207,6 +207,8 @@ struct rpcrdma_frmr {
 	enum rpcrdma_frmr_state		fr_state;
 	struct work_struct		fr_work;
 	struct rpcrdma_xprt		*fr_xprt;
+	bool				fr_waiter;
+	struct completion		fr_linv_done;;
 	struct ib_reg_wr		fr_regwr;
 	struct ib_send_wr		fr_invwr;
 };

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH v1 4/9] xprtrdma: Add ro_unmap_sync method for FRWR
@ 2015-11-23 22:14     ` Chuck Lever
  0 siblings, 0 replies; 78+ messages in thread
From: Chuck Lever @ 2015-11-23 22:14 UTC (permalink / raw)
  To: linux-rdma, linux-nfs

FRWR's ro_unmap is asynchronous. The new ro_unmap_sync posts
LOCAL_INV Work Requests and waits for them to complete before
returning.

Note also, DMA unmapping is now done _after_ invalidation.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 net/sunrpc/xprtrdma/frwr_ops.c  |  137 ++++++++++++++++++++++++++++++++++++++-
 net/sunrpc/xprtrdma/xprt_rdma.h |    2 +
 2 files changed, 135 insertions(+), 4 deletions(-)

diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
index 31a4578..5d58008 100644
--- a/net/sunrpc/xprtrdma/frwr_ops.c
+++ b/net/sunrpc/xprtrdma/frwr_ops.c
@@ -245,12 +245,14 @@ frwr_op_maxpages(struct rpcrdma_xprt *r_xprt)
 		     rpcrdma_max_segments(r_xprt) * ia->ri_max_frmr_depth);
 }
 
-/* If FAST_REG or LOCAL_INV failed, indicate the frmr needs to be reset. */
+/* If FAST_REG or LOCAL_INV failed, indicate the frmr needs
+ * to be reset.
+ *
+ * WARNING: Only wr_id and status are reliable at this point
+ */
 static void
-frwr_sendcompletion(struct ib_wc *wc)
+__frwr_sendcompletion_flush(struct ib_wc *wc, struct rpcrdma_mw *r)
 {
-	struct rpcrdma_mw *r;
-
 	if (likely(wc->status == IB_WC_SUCCESS))
 		return;
 
@@ -261,9 +263,23 @@ frwr_sendcompletion(struct ib_wc *wc)
 	else
 		pr_warn("RPC:       %s: frmr %p error, status %s (%d)\n",
 			__func__, r, ib_wc_status_msg(wc->status), wc->status);
+
 	r->r.frmr.fr_state = FRMR_IS_STALE;
 }
 
+static void
+frwr_sendcompletion(struct ib_wc *wc)
+{
+	struct rpcrdma_mw *r = (struct rpcrdma_mw *)(unsigned long)wc->wr_id;
+	struct rpcrdma_frmr *f = &r->r.frmr;
+
+	if (unlikely(wc->status != IB_WC_SUCCESS))
+		__frwr_sendcompletion_flush(wc, r);
+
+	if (f->fr_waiter)
+		complete(&f->fr_linv_done);
+}
+
 static int
 frwr_op_init(struct rpcrdma_xprt *r_xprt)
 {
@@ -335,6 +351,7 @@ frwr_op_map(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg,
 	} while (mw->r.frmr.fr_state != FRMR_IS_INVALID);
 	frmr = &mw->r.frmr;
 	frmr->fr_state = FRMR_IS_VALID;
+	frmr->fr_waiter = false;
 	mr = frmr->fr_mr;
 	reg_wr = &frmr->fr_regwr;
 
@@ -414,6 +431,117 @@ out_senderr:
 	return rc;
 }
 
+static struct ib_send_wr *
+__frwr_prepare_linv_wr(struct rpcrdma_mr_seg *seg)
+{
+	struct rpcrdma_mw *mw = seg->rl_mw;
+	struct rpcrdma_frmr *f = &mw->r.frmr;
+	struct ib_send_wr *invalidate_wr;
+
+	f->fr_waiter = false;
+	f->fr_state = FRMR_IS_INVALID;
+	invalidate_wr = &f->fr_invwr;
+
+	memset(invalidate_wr, 0, sizeof(*invalidate_wr));
+	invalidate_wr->wr_id = (unsigned long)(void *)mw;
+	invalidate_wr->opcode = IB_WR_LOCAL_INV;
+	invalidate_wr->ex.invalidate_rkey = f->fr_mr->rkey;
+
+	return invalidate_wr;
+}
+
+static void
+__frwr_dma_unmap(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg,
+		 int rc)
+{
+	struct ib_device *device = r_xprt->rx_ia.ri_device;
+	struct rpcrdma_mw *mw = seg->rl_mw;
+	int nsegs = seg->mr_nsegs;
+
+	seg->rl_mw = NULL;
+
+	while (nsegs--)
+		rpcrdma_unmap_one(device, seg++);
+
+	if (!rc)
+		rpcrdma_put_mw(r_xprt, mw);
+	else
+		__frwr_queue_recovery(mw);
+}
+
+/* Invalidate all memory regions that were registered for "req".
+ *
+ * Sleeps until it is safe for the host CPU to access the
+ * previously mapped memory regions.
+ */
+static void
+frwr_op_unmap_sync(struct rpcrdma_xprt *r_xprt, struct rpcrdma_req *req)
+{
+	struct ib_send_wr *invalidate_wrs, *pos, *prev, *bad_wr;
+	struct rpcrdma_ia *ia = &r_xprt->rx_ia;
+	struct rpcrdma_mr_seg *seg;
+	unsigned int i, nchunks;
+	struct rpcrdma_frmr *f;
+	int rc;
+
+	dprintk("RPC:       %s: req %p\n", __func__, req);
+
+	/* ORDER: Invalidate all of the req's MRs first
+	 *
+	 * Chain the LOCAL_INV Work Requests and post them with
+	 * a single ib_post_send() call.
+	 */
+	invalidate_wrs = pos = prev = NULL;
+	seg = NULL;
+	for (i = 0, nchunks = req->rl_nchunks; nchunks; nchunks--) {
+		seg = &req->rl_segments[i];
+
+		pos = __frwr_prepare_linv_wr(seg);
+
+		if (!invalidate_wrs)
+			invalidate_wrs = pos;
+		else
+			prev->next = pos;
+		prev = pos;
+
+		i += seg->mr_nsegs;
+	}
+	f = &seg->rl_mw->r.frmr;
+
+	/* Strong send queue ordering guarantees that when the
+	 * last WR in the chain completes, all WRs in the chain
+	 * are complete.
+	 */
+	f->fr_invwr.send_flags = IB_SEND_SIGNALED;
+	f->fr_waiter = true;
+	init_completion(&f->fr_linv_done);
+	INIT_CQCOUNT(&r_xprt->rx_ep);
+
+	/* Transport disconnect drains the receive CQ before it
+	 * replaces the QP. The RPC reply handler won't call us
+	 * unless ri_id->qp is a valid pointer.
+	 */
+	rc = ib_post_send(ia->ri_id->qp, invalidate_wrs, &bad_wr);
+	if (rc)
+		pr_warn("%s: ib_post_send failed %i\n", __func__, rc);
+
+	wait_for_completion(&f->fr_linv_done);
+
+	/* ORDER: Now DMA unmap all of the req's MRs, and return
+	 * them to the free MW list.
+	 */
+	for (i = 0, nchunks = req->rl_nchunks; nchunks; nchunks--) {
+		seg = &req->rl_segments[i];
+
+		__frwr_dma_unmap(r_xprt, seg, rc);
+
+		i += seg->mr_nsegs;
+		seg->mr_nsegs = 0;
+	}
+
+	req->rl_nchunks = 0;
+}
+
 /* Post a LOCAL_INV Work Request to prevent further remote access
  * via RDMA READ or RDMA WRITE.
  */
@@ -473,6 +601,7 @@ frwr_op_destroy(struct rpcrdma_buffer *buf)
 
 const struct rpcrdma_memreg_ops rpcrdma_frwr_memreg_ops = {
 	.ro_map				= frwr_op_map,
+	.ro_unmap_sync			= frwr_op_unmap_sync,
 	.ro_unmap			= frwr_op_unmap,
 	.ro_open			= frwr_op_open,
 	.ro_maxpages			= frwr_op_maxpages,
diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index bd2022e..45dd0b7 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -207,6 +207,8 @@ struct rpcrdma_frmr {
 	enum rpcrdma_frmr_state		fr_state;
 	struct work_struct		fr_work;
 	struct rpcrdma_xprt		*fr_xprt;
+	bool				fr_waiter;
+	struct completion		fr_linv_done;;
 	struct ib_reg_wr		fr_regwr;
 	struct ib_send_wr		fr_invwr;
 };


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH v1 5/9] xprtrdma: Add ro_unmap_sync method for FMR
  2015-11-23 22:13 ` Chuck Lever
@ 2015-11-23 22:14     ` Chuck Lever
  -1 siblings, 0 replies; 78+ messages in thread
From: Chuck Lever @ 2015-11-23 22:14 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA, linux-nfs-u79uwXL29TY76Z2rM5mHXA

FMR's ro_unmap method is already synchronous because ib_unmap_fmr()
is a synchronous verb. However, some improvements can be made here.

1. Gather all the MRs for the RPC request onto a list, and invoke
   ib_unmap_fmr() once with that list. This reduces the number of
   doorbells when there is more than one MR to invalidate

2. Perform the DMA unmap _after_ the MRs are unmapped, not before.
   This is critical after invalidating a Write chunk.

Signed-off-by: Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
---
 net/sunrpc/xprtrdma/fmr_ops.c |   64 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 64 insertions(+)

diff --git a/net/sunrpc/xprtrdma/fmr_ops.c b/net/sunrpc/xprtrdma/fmr_ops.c
index f1e8daf..c14f3a4 100644
--- a/net/sunrpc/xprtrdma/fmr_ops.c
+++ b/net/sunrpc/xprtrdma/fmr_ops.c
@@ -179,6 +179,69 @@ out_maperr:
 	return rc;
 }
 
+static void
+__fmr_dma_unmap(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg)
+{
+	struct ib_device *device = r_xprt->rx_ia.ri_device;
+	struct rpcrdma_mw *mw = seg->rl_mw;
+	int nsegs = seg->mr_nsegs;
+
+	seg->rl_mw = NULL;
+
+	while (nsegs--)
+		rpcrdma_unmap_one(device, seg++);
+
+	rpcrdma_put_mw(r_xprt, mw);
+}
+
+/* Invalidate all memory regions that were registered for "req".
+ *
+ * Sleeps until it is safe for the host CPU to access the
+ * previously mapped memory regions.
+ */
+static void
+fmr_op_unmap_sync(struct rpcrdma_xprt *r_xprt, struct rpcrdma_req *req)
+{
+	struct rpcrdma_mr_seg *seg;
+	unsigned int i, nchunks;
+	struct rpcrdma_mw *mw;
+	LIST_HEAD(unmap_list);
+	int rc;
+
+	dprintk("RPC:       %s: req %p\n", __func__, req);
+
+	/* ORDER: Invalidate all of the req's MRs first
+	 *
+	 * ib_unmap_fmr() is slow, so use a single call instead
+	 * of one call per mapped MR.
+	 */
+	for (i = 0, nchunks = req->rl_nchunks; nchunks; nchunks--) {
+		seg = &req->rl_segments[i];
+		mw = seg->rl_mw;
+
+		list_add(&mw->r.fmr.fmr->list, &unmap_list);
+
+		i += seg->mr_nsegs;
+	}
+	rc = ib_unmap_fmr(&unmap_list);
+	if (rc)
+		pr_warn("%s: ib_unmap_fmr failed (%i)\n", __func__, rc);
+
+	/* ORDER: Now DMA unmap all of the req's MRs, and return
+	 * them to the free MW list.
+	 */
+	for (i = 0, nchunks = req->rl_nchunks; nchunks; nchunks--) {
+		seg = &req->rl_segments[i];
+
+		__fmr_dma_unmap(r_xprt, seg);
+
+		i += seg->mr_nsegs;
+		seg->mr_nsegs = 0;
+	}
+
+	req->rl_nchunks = 0;
+}
+
 /* Use the ib_unmap_fmr() verb to prevent further remote
  * access via RDMA READ or RDMA WRITE.
  */
@@ -231,6 +294,7 @@ fmr_op_destroy(struct rpcrdma_buffer *buf)
 
 const struct rpcrdma_memreg_ops rpcrdma_fmr_memreg_ops = {
 	.ro_map				= fmr_op_map,
+	.ro_unmap_sync			= fmr_op_unmap_sync,
 	.ro_unmap			= fmr_op_unmap,
 	.ro_open			= fmr_op_open,
 	.ro_maxpages			= fmr_op_maxpages,

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH v1 5/9] xprtrdma: Add ro_unmap_sync method for FMR
@ 2015-11-23 22:14     ` Chuck Lever
  0 siblings, 0 replies; 78+ messages in thread
From: Chuck Lever @ 2015-11-23 22:14 UTC (permalink / raw)
  To: linux-rdma, linux-nfs

FMR's ro_unmap method is already synchronous because ib_unmap_fmr()
is a synchronous verb. However, some improvements can be made here.

1. Gather all the MRs for the RPC request onto a list, and invoke
   ib_unmap_fmr() once with that list. This reduces the number of
   doorbells when there is more than one MR to invalidate

2. Perform the DMA unmap _after_ the MRs are unmapped, not before.
   This is critical after invalidating a Write chunk.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 net/sunrpc/xprtrdma/fmr_ops.c |   64 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 64 insertions(+)

diff --git a/net/sunrpc/xprtrdma/fmr_ops.c b/net/sunrpc/xprtrdma/fmr_ops.c
index f1e8daf..c14f3a4 100644
--- a/net/sunrpc/xprtrdma/fmr_ops.c
+++ b/net/sunrpc/xprtrdma/fmr_ops.c
@@ -179,6 +179,69 @@ out_maperr:
 	return rc;
 }
 
+static void
+__fmr_dma_unmap(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg)
+{
+	struct ib_device *device = r_xprt->rx_ia.ri_device;
+	struct rpcrdma_mw *mw = seg->rl_mw;
+	int nsegs = seg->mr_nsegs;
+
+	seg->rl_mw = NULL;
+
+	while (nsegs--)
+		rpcrdma_unmap_one(device, seg++);
+
+	rpcrdma_put_mw(r_xprt, mw);
+}
+
+/* Invalidate all memory regions that were registered for "req".
+ *
+ * Sleeps until it is safe for the host CPU to access the
+ * previously mapped memory regions.
+ */
+static void
+fmr_op_unmap_sync(struct rpcrdma_xprt *r_xprt, struct rpcrdma_req *req)
+{
+	struct rpcrdma_mr_seg *seg;
+	unsigned int i, nchunks;
+	struct rpcrdma_mw *mw;
+	LIST_HEAD(unmap_list);
+	int rc;
+
+	dprintk("RPC:       %s: req %p\n", __func__, req);
+
+	/* ORDER: Invalidate all of the req's MRs first
+	 *
+	 * ib_unmap_fmr() is slow, so use a single call instead
+	 * of one call per mapped MR.
+	 */
+	for (i = 0, nchunks = req->rl_nchunks; nchunks; nchunks--) {
+		seg = &req->rl_segments[i];
+		mw = seg->rl_mw;
+
+		list_add(&mw->r.fmr.fmr->list, &unmap_list);
+
+		i += seg->mr_nsegs;
+	}
+	rc = ib_unmap_fmr(&unmap_list);
+	if (rc)
+		pr_warn("%s: ib_unmap_fmr failed (%i)\n", __func__, rc);
+
+	/* ORDER: Now DMA unmap all of the req's MRs, and return
+	 * them to the free MW list.
+	 */
+	for (i = 0, nchunks = req->rl_nchunks; nchunks; nchunks--) {
+		seg = &req->rl_segments[i];
+
+		__fmr_dma_unmap(r_xprt, seg);
+
+		i += seg->mr_nsegs;
+		seg->mr_nsegs = 0;
+	}
+
+	req->rl_nchunks = 0;
+}
+
 /* Use the ib_unmap_fmr() verb to prevent further remote
  * access via RDMA READ or RDMA WRITE.
  */
@@ -231,6 +294,7 @@ fmr_op_destroy(struct rpcrdma_buffer *buf)
 
 const struct rpcrdma_memreg_ops rpcrdma_fmr_memreg_ops = {
 	.ro_map				= fmr_op_map,
+	.ro_unmap_sync			= fmr_op_unmap_sync,
 	.ro_unmap			= fmr_op_unmap,
 	.ro_open			= fmr_op_open,
 	.ro_maxpages			= fmr_op_maxpages,


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH v1 6/9] xprtrdma: Add ro_unmap_sync method for all-physical registration
  2015-11-23 22:13 ` Chuck Lever
@ 2015-11-23 22:14     ` Chuck Lever
  -1 siblings, 0 replies; 78+ messages in thread
From: Chuck Lever @ 2015-11-23 22:14 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA, linux-nfs-u79uwXL29TY76Z2rM5mHXA

physical's ro_unmap is synchronous already. The new ro_unmap_sync
method just has to DMA unmap all MRs associated with the RPC
request.

Signed-off-by: Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
---
 net/sunrpc/xprtrdma/physical_ops.c |   13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/net/sunrpc/xprtrdma/physical_ops.c b/net/sunrpc/xprtrdma/physical_ops.c
index 617b76f..dbb302e 100644
--- a/net/sunrpc/xprtrdma/physical_ops.c
+++ b/net/sunrpc/xprtrdma/physical_ops.c
@@ -83,6 +83,18 @@ physical_op_unmap(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg)
 	return 1;
 }
 
+/* DMA unmap all memory regions that were mapped for "req".
+ */
+static void
+physical_op_unmap_sync(struct rpcrdma_xprt *r_xprt, struct rpcrdma_req *req)
+{
+	struct ib_device *device = r_xprt->rx_ia.ri_device;
+	unsigned int i;
+
+	for (i = 0; req->rl_nchunks; --req->rl_nchunks)
+		rpcrdma_unmap_one(device, &req->rl_segments[i++]);
+}
+
 static void
 physical_op_destroy(struct rpcrdma_buffer *buf)
 {
@@ -90,6 +102,7 @@ physical_op_destroy(struct rpcrdma_buffer *buf)
 
 const struct rpcrdma_memreg_ops rpcrdma_physical_memreg_ops = {
 	.ro_map				= physical_op_map,
+	.ro_unmap_sync			= physical_op_unmap_sync,
 	.ro_unmap			= physical_op_unmap,
 	.ro_open			= physical_op_open,
 	.ro_maxpages			= physical_op_maxpages,

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH v1 6/9] xprtrdma: Add ro_unmap_sync method for all-physical registration
@ 2015-11-23 22:14     ` Chuck Lever
  0 siblings, 0 replies; 78+ messages in thread
From: Chuck Lever @ 2015-11-23 22:14 UTC (permalink / raw)
  To: linux-rdma, linux-nfs

physical's ro_unmap is synchronous already. The new ro_unmap_sync
method just has to DMA unmap all MRs associated with the RPC
request.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 net/sunrpc/xprtrdma/physical_ops.c |   13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/net/sunrpc/xprtrdma/physical_ops.c b/net/sunrpc/xprtrdma/physical_ops.c
index 617b76f..dbb302e 100644
--- a/net/sunrpc/xprtrdma/physical_ops.c
+++ b/net/sunrpc/xprtrdma/physical_ops.c
@@ -83,6 +83,18 @@ physical_op_unmap(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg)
 	return 1;
 }
 
+/* DMA unmap all memory regions that were mapped for "req".
+ */
+static void
+physical_op_unmap_sync(struct rpcrdma_xprt *r_xprt, struct rpcrdma_req *req)
+{
+	struct ib_device *device = r_xprt->rx_ia.ri_device;
+	unsigned int i;
+
+	for (i = 0; req->rl_nchunks; --req->rl_nchunks)
+		rpcrdma_unmap_one(device, &req->rl_segments[i++]);
+}
+
 static void
 physical_op_destroy(struct rpcrdma_buffer *buf)
 {
@@ -90,6 +102,7 @@ physical_op_destroy(struct rpcrdma_buffer *buf)
 
 const struct rpcrdma_memreg_ops rpcrdma_physical_memreg_ops = {
 	.ro_map				= physical_op_map,
+	.ro_unmap_sync			= physical_op_unmap_sync,
 	.ro_unmap			= physical_op_unmap,
 	.ro_open			= physical_op_open,
 	.ro_maxpages			= physical_op_maxpages,


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH v1 7/9] SUNRPC: Introduct xprt_commit_rqst()
  2015-11-23 22:13 ` Chuck Lever
@ 2015-11-23 22:14     ` Chuck Lever
  -1 siblings, 0 replies; 78+ messages in thread
From: Chuck Lever @ 2015-11-23 22:14 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA, linux-nfs-u79uwXL29TY76Z2rM5mHXA

I'm about to add code in the RPC/RDMA reply handler between the
xprt_lookup_rqst() and xprt_complete_rqst() call site that needs
to execute outside of spinlock critical sections.

Add a hook to remove an rpc_rqst from the pending list once
the transport knows its going to invoke xprt_complete_rqst().

Signed-off-by: Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
---
 include/linux/sunrpc/xprt.h    |    1 +
 net/sunrpc/xprt.c              |   14 ++++++++++++++
 net/sunrpc/xprtrdma/rpc_rdma.c |    4 ++++
 3 files changed, 19 insertions(+)

diff --git a/include/linux/sunrpc/xprt.h b/include/linux/sunrpc/xprt.h
index 69ef5b3..ab6c3a5 100644
--- a/include/linux/sunrpc/xprt.h
+++ b/include/linux/sunrpc/xprt.h
@@ -366,6 +366,7 @@ void			xprt_wait_for_buffer_space(struct rpc_task *task, rpc_action action);
 void			xprt_write_space(struct rpc_xprt *xprt);
 void			xprt_adjust_cwnd(struct rpc_xprt *xprt, struct rpc_task *task, int result);
 struct rpc_rqst *	xprt_lookup_rqst(struct rpc_xprt *xprt, __be32 xid);
+void			xprt_commit_rqst(struct rpc_task *task);
 void			xprt_complete_rqst(struct rpc_task *task, int copied);
 void			xprt_release_rqst_cong(struct rpc_task *task);
 void			xprt_disconnect_done(struct rpc_xprt *xprt);
diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
index 2e98f4a..a5be4ab 100644
--- a/net/sunrpc/xprt.c
+++ b/net/sunrpc/xprt.c
@@ -837,6 +837,20 @@ static void xprt_update_rtt(struct rpc_task *task)
 }
 
 /**
+ * xprt_commit_rqst - remove rqst from pending list early
+ * @task: RPC request to remove
+ *
+ * Caller holds transport lock.
+ */
+void xprt_commit_rqst(struct rpc_task *task)
+{
+	struct rpc_rqst *req = task->tk_rqstp;
+
+	list_del_init(&req->rq_list);
+}
+EXPORT_SYMBOL_GPL(xprt_commit_rqst);
+
+/**
  * xprt_complete_rqst - called when reply processing is complete
  * @task: RPC request that recently completed
  * @copied: actual number of bytes received from the transport
diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
index a169252..d7b9156 100644
--- a/net/sunrpc/xprtrdma/rpc_rdma.c
+++ b/net/sunrpc/xprtrdma/rpc_rdma.c
@@ -811,6 +811,9 @@ rpcrdma_reply_handler(struct rpcrdma_rep *rep)
 	if (req->rl_reply)
 		goto out_duplicate;
 
+	xprt_commit_rqst(rqst->rq_task);
+	spin_unlock_bh(&xprt->transport_lock);
+
 	dprintk("RPC:       %s: reply 0x%p completes request 0x%p\n"
 		"                   RPC request 0x%p xid 0x%08x\n",
 			__func__, rep, req, rqst,
@@ -901,6 +904,7 @@ badheader:
 	else if (credits > r_xprt->rx_buf.rb_max_requests)
 		credits = r_xprt->rx_buf.rb_max_requests;
 
+	spin_lock_bh(&xprt->transport_lock);
 	cwnd = xprt->cwnd;
 	xprt->cwnd = credits << RPC_CWNDSHIFT;
 	if (xprt->cwnd > cwnd)

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH v1 7/9] SUNRPC: Introduct xprt_commit_rqst()
@ 2015-11-23 22:14     ` Chuck Lever
  0 siblings, 0 replies; 78+ messages in thread
From: Chuck Lever @ 2015-11-23 22:14 UTC (permalink / raw)
  To: linux-rdma, linux-nfs

I'm about to add code in the RPC/RDMA reply handler between the
xprt_lookup_rqst() and xprt_complete_rqst() call site that needs
to execute outside of spinlock critical sections.

Add a hook to remove an rpc_rqst from the pending list once
the transport knows its going to invoke xprt_complete_rqst().

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 include/linux/sunrpc/xprt.h    |    1 +
 net/sunrpc/xprt.c              |   14 ++++++++++++++
 net/sunrpc/xprtrdma/rpc_rdma.c |    4 ++++
 3 files changed, 19 insertions(+)

diff --git a/include/linux/sunrpc/xprt.h b/include/linux/sunrpc/xprt.h
index 69ef5b3..ab6c3a5 100644
--- a/include/linux/sunrpc/xprt.h
+++ b/include/linux/sunrpc/xprt.h
@@ -366,6 +366,7 @@ void			xprt_wait_for_buffer_space(struct rpc_task *task, rpc_action action);
 void			xprt_write_space(struct rpc_xprt *xprt);
 void			xprt_adjust_cwnd(struct rpc_xprt *xprt, struct rpc_task *task, int result);
 struct rpc_rqst *	xprt_lookup_rqst(struct rpc_xprt *xprt, __be32 xid);
+void			xprt_commit_rqst(struct rpc_task *task);
 void			xprt_complete_rqst(struct rpc_task *task, int copied);
 void			xprt_release_rqst_cong(struct rpc_task *task);
 void			xprt_disconnect_done(struct rpc_xprt *xprt);
diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
index 2e98f4a..a5be4ab 100644
--- a/net/sunrpc/xprt.c
+++ b/net/sunrpc/xprt.c
@@ -837,6 +837,20 @@ static void xprt_update_rtt(struct rpc_task *task)
 }
 
 /**
+ * xprt_commit_rqst - remove rqst from pending list early
+ * @task: RPC request to remove
+ *
+ * Caller holds transport lock.
+ */
+void xprt_commit_rqst(struct rpc_task *task)
+{
+	struct rpc_rqst *req = task->tk_rqstp;
+
+	list_del_init(&req->rq_list);
+}
+EXPORT_SYMBOL_GPL(xprt_commit_rqst);
+
+/**
  * xprt_complete_rqst - called when reply processing is complete
  * @task: RPC request that recently completed
  * @copied: actual number of bytes received from the transport
diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
index a169252..d7b9156 100644
--- a/net/sunrpc/xprtrdma/rpc_rdma.c
+++ b/net/sunrpc/xprtrdma/rpc_rdma.c
@@ -811,6 +811,9 @@ rpcrdma_reply_handler(struct rpcrdma_rep *rep)
 	if (req->rl_reply)
 		goto out_duplicate;
 
+	xprt_commit_rqst(rqst->rq_task);
+	spin_unlock_bh(&xprt->transport_lock);
+
 	dprintk("RPC:       %s: reply 0x%p completes request 0x%p\n"
 		"                   RPC request 0x%p xid 0x%08x\n",
 			__func__, rep, req, rqst,
@@ -901,6 +904,7 @@ badheader:
 	else if (credits > r_xprt->rx_buf.rb_max_requests)
 		credits = r_xprt->rx_buf.rb_max_requests;
 
+	spin_lock_bh(&xprt->transport_lock);
 	cwnd = xprt->cwnd;
 	xprt->cwnd = credits << RPC_CWNDSHIFT;
 	if (xprt->cwnd > cwnd)


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH v1 8/9] xprtrdma: Invalidate in the RPC reply handler
  2015-11-23 22:13 ` Chuck Lever
@ 2015-11-23 22:14     ` Chuck Lever
  -1 siblings, 0 replies; 78+ messages in thread
From: Chuck Lever @ 2015-11-23 22:14 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA, linux-nfs-u79uwXL29TY76Z2rM5mHXA

There is a window between the time the RPC reply handler wakes the
waiting RPC task and when xprt_release() invokes ops->buf_free.
During this time, memory regions containing the data payload may
still be accessed by a broken or malicious server, but the RPC
application has already been allowed access to the memory containing
the RPC request's data payloads.

The server should be fenced from client memory containing RPC data
payloads _before_ the RPC application is allowed to continue.

This change also more strongly enforces send queue accounting. There
is a maximum number of RPC calls allowed to be outstanding. When an
RPC/RDMA transport is set up, just enough send queue resources are
allocated to handle registration, Send, and invalidation WRs for
each those RPCs at the same time.

Before, additional RPC calls could be dispatched while invalidation
WRs were still consuming send WQEs. When invalidation WRs backed
up, dispatching additional RPCs resulted in a send queue overrun.

Now, the reply handler prevents RPC dispatch until invalidation is
complete. This prevents RPC call dispatch until there are enough
send queue resources to proceed.

Still to do: If an RPC exits early (say, ^C), the reply handler has
no opportunity to perform invalidation. Currently, xprt_rdma_free()
still frees remaining RDMA resources, which could deadlock.
Additional changes are needed to handle invalidation properly in this
case.

Reported-by: Jason Gunthorpe <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
Signed-off-by: Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
---
 net/sunrpc/xprtrdma/rpc_rdma.c |   10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
index d7b9156..4ad72ca 100644
--- a/net/sunrpc/xprtrdma/rpc_rdma.c
+++ b/net/sunrpc/xprtrdma/rpc_rdma.c
@@ -898,6 +898,16 @@ badheader:
 		break;
 	}
 
+	/* Invalidate and flush the data payloads before waking the
+	 * waiting application. This guarantees the memory region is
+	 * properly fenced from the server before the application
+	 * accesses the data. It also ensures proper send flow
+	 * control: waking the next RPC waits until this RPC has
+	 * relinquished all its Send Queue entries.
+	 */
+	if (req->rl_nchunks)
+		r_xprt->rx_ia.ri_ops->ro_unmap_sync(r_xprt, req);
+
 	credits = be32_to_cpu(headerp->rm_credit);
 	if (credits == 0)
 		credits = 1;	/* don't deadlock */

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH v1 8/9] xprtrdma: Invalidate in the RPC reply handler
@ 2015-11-23 22:14     ` Chuck Lever
  0 siblings, 0 replies; 78+ messages in thread
From: Chuck Lever @ 2015-11-23 22:14 UTC (permalink / raw)
  To: linux-rdma, linux-nfs

There is a window between the time the RPC reply handler wakes the
waiting RPC task and when xprt_release() invokes ops->buf_free.
During this time, memory regions containing the data payload may
still be accessed by a broken or malicious server, but the RPC
application has already been allowed access to the memory containing
the RPC request's data payloads.

The server should be fenced from client memory containing RPC data
payloads _before_ the RPC application is allowed to continue.

This change also more strongly enforces send queue accounting. There
is a maximum number of RPC calls allowed to be outstanding. When an
RPC/RDMA transport is set up, just enough send queue resources are
allocated to handle registration, Send, and invalidation WRs for
each those RPCs at the same time.

Before, additional RPC calls could be dispatched while invalidation
WRs were still consuming send WQEs. When invalidation WRs backed
up, dispatching additional RPCs resulted in a send queue overrun.

Now, the reply handler prevents RPC dispatch until invalidation is
complete. This prevents RPC call dispatch until there are enough
send queue resources to proceed.

Still to do: If an RPC exits early (say, ^C), the reply handler has
no opportunity to perform invalidation. Currently, xprt_rdma_free()
still frees remaining RDMA resources, which could deadlock.
Additional changes are needed to handle invalidation properly in this
case.

Reported-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 net/sunrpc/xprtrdma/rpc_rdma.c |   10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
index d7b9156..4ad72ca 100644
--- a/net/sunrpc/xprtrdma/rpc_rdma.c
+++ b/net/sunrpc/xprtrdma/rpc_rdma.c
@@ -898,6 +898,16 @@ badheader:
 		break;
 	}
 
+	/* Invalidate and flush the data payloads before waking the
+	 * waiting application. This guarantees the memory region is
+	 * properly fenced from the server before the application
+	 * accesses the data. It also ensures proper send flow
+	 * control: waking the next RPC waits until this RPC has
+	 * relinquished all its Send Queue entries.
+	 */
+	if (req->rl_nchunks)
+		r_xprt->rx_ia.ri_ops->ro_unmap_sync(r_xprt, req);
+
 	credits = be32_to_cpu(headerp->rm_credit);
 	if (credits == 0)
 		credits = 1;	/* don't deadlock */


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH v1 9/9] xprtrdma: Revert commit e7104a2a9606 ('xprtrdma: Cap req_cqinit').
  2015-11-23 22:13 ` Chuck Lever
@ 2015-11-23 22:15     ` Chuck Lever
  -1 siblings, 0 replies; 78+ messages in thread
From: Chuck Lever @ 2015-11-23 22:15 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA, linux-nfs-u79uwXL29TY76Z2rM5mHXA

The root of the problem was that sends (especially unsignalled
FASTREG and LOCAL_INV Work Requests) were not properly flow-
controlled, which allowed a send queue overrun.

Now that the RPC/RDMA reply handler waits for invalidation to
complete, the send queue is properly flow-controlled. Thus this
limit is no longer necessary.

Signed-off-by: Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
---
 net/sunrpc/xprtrdma/verbs.c     |    6 ++----
 net/sunrpc/xprtrdma/xprt_rdma.h |    6 ------
 2 files changed, 2 insertions(+), 10 deletions(-)

diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index e3f12e2..e3baf05 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -616,10 +616,8 @@ rpcrdma_ep_create(struct rpcrdma_ep *ep, struct rpcrdma_ia *ia,
 
 	/* set trigger for requesting send completion */
 	ep->rep_cqinit = ep->rep_attr.cap.max_send_wr/2 - 1;
-	if (ep->rep_cqinit > RPCRDMA_MAX_UNSIGNALED_SENDS)
-		ep->rep_cqinit = RPCRDMA_MAX_UNSIGNALED_SENDS;
-	else if (ep->rep_cqinit <= 2)
-		ep->rep_cqinit = 0;
+	if (ep->rep_cqinit <= 2)
+		ep->rep_cqinit = 0;	/* always signal? */
 	INIT_CQCOUNT(ep);
 	init_waitqueue_head(&ep->rep_connect_wait);
 	INIT_DELAYED_WORK(&ep->rep_connect_worker, rpcrdma_connect_worker);
diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index 45dd0b7..c166aeb 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -88,12 +88,6 @@ struct rpcrdma_ep {
 	struct delayed_work	rep_connect_worker;
 };
 
-/*
- * Force a signaled SEND Work Request every so often,
- * in case the provider needs to do some housekeeping.
- */
-#define RPCRDMA_MAX_UNSIGNALED_SENDS	(32)
-
 #define INIT_CQCOUNT(ep) atomic_set(&(ep)->rep_cqcount, (ep)->rep_cqinit)
 #define DECR_CQCOUNT(ep) atomic_sub_return(1, &(ep)->rep_cqcount)
 

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH v1 9/9] xprtrdma: Revert commit e7104a2a9606 ('xprtrdma: Cap req_cqinit').
@ 2015-11-23 22:15     ` Chuck Lever
  0 siblings, 0 replies; 78+ messages in thread
From: Chuck Lever @ 2015-11-23 22:15 UTC (permalink / raw)
  To: linux-rdma, linux-nfs

The root of the problem was that sends (especially unsignalled
FASTREG and LOCAL_INV Work Requests) were not properly flow-
controlled, which allowed a send queue overrun.

Now that the RPC/RDMA reply handler waits for invalidation to
complete, the send queue is properly flow-controlled. Thus this
limit is no longer necessary.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 net/sunrpc/xprtrdma/verbs.c     |    6 ++----
 net/sunrpc/xprtrdma/xprt_rdma.h |    6 ------
 2 files changed, 2 insertions(+), 10 deletions(-)

diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index e3f12e2..e3baf05 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -616,10 +616,8 @@ rpcrdma_ep_create(struct rpcrdma_ep *ep, struct rpcrdma_ia *ia,
 
 	/* set trigger for requesting send completion */
 	ep->rep_cqinit = ep->rep_attr.cap.max_send_wr/2 - 1;
-	if (ep->rep_cqinit > RPCRDMA_MAX_UNSIGNALED_SENDS)
-		ep->rep_cqinit = RPCRDMA_MAX_UNSIGNALED_SENDS;
-	else if (ep->rep_cqinit <= 2)
-		ep->rep_cqinit = 0;
+	if (ep->rep_cqinit <= 2)
+		ep->rep_cqinit = 0;	/* always signal? */
 	INIT_CQCOUNT(ep);
 	init_waitqueue_head(&ep->rep_connect_wait);
 	INIT_DELAYED_WORK(&ep->rep_connect_worker, rpcrdma_connect_worker);
diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index 45dd0b7..c166aeb 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -88,12 +88,6 @@ struct rpcrdma_ep {
 	struct delayed_work	rep_connect_worker;
 };
 
-/*
- * Force a signaled SEND Work Request every so often,
- * in case the provider needs to do some housekeeping.
- */
-#define RPCRDMA_MAX_UNSIGNALED_SENDS	(32)
-
 #define INIT_CQCOUNT(ep) atomic_set(&(ep)->rep_cqcount, (ep)->rep_cqinit)
 #define DECR_CQCOUNT(ep) atomic_sub_return(1, &(ep)->rep_cqcount)
 


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* Re: [PATCH v1 1/9] xprtrdma: Add a safety margin for receive buffers
  2015-11-23 22:13     ` Chuck Lever
@ 2015-11-24  0:55         ` Tom Talpey
  -1 siblings, 0 replies; 78+ messages in thread
From: Tom Talpey @ 2015-11-24  0:55 UTC (permalink / raw)
  To: Chuck Lever, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA

On 11/23/2015 5:13 PM, Chuck Lever wrote:
> Rarely, senders post a Send that is larger than the client's inline
> threshold. That can be due to a bug, or the client and server may
> not have communicated about their inline limits. RPC-over-RDMA
> currently doesn't specify any particular limit on inline size, so
> peers have to guess what it is.
>
> It is fatal to the connection if the size of a Send is larger than
> the client's receive buffer. The sender is likely to retry with the
> same message size, so the workload is stuck at that point.
>
> Follow Postel's robustness principal: Be conservative in what you
> do, be liberal in what you accept from others. Increase the size of
> client receive buffers by a safety margin, and add a warning when
> the inline threshold is exceeded during receive.

Safety is good, but how do know the chosen value is enough?
Isn't it better to fail the badly-composed request and be done
with it? Even if the stupid sender loops, which it will do
anyway.

>
> Note the Linux server's receive buffers are already page-sized.
>
> Signed-off-by: Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
> ---
>   net/sunrpc/xprtrdma/rpc_rdma.c  |    7 +++++++
>   net/sunrpc/xprtrdma/verbs.c     |    6 +++++-
>   net/sunrpc/xprtrdma/xprt_rdma.h |    5 +++++
>   3 files changed, 17 insertions(+), 1 deletion(-)
>
> diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
> index c10d969..a169252 100644
> --- a/net/sunrpc/xprtrdma/rpc_rdma.c
> +++ b/net/sunrpc/xprtrdma/rpc_rdma.c
> @@ -776,6 +776,7 @@ rpcrdma_reply_handler(struct rpcrdma_rep *rep)
>   	int rdmalen, status;
>   	unsigned long cwnd;
>   	u32 credits;
> +	RPC_IFDEBUG(struct rpcrdma_create_data_internal *cdata);
>
>   	dprintk("RPC:       %s: incoming rep %p\n", __func__, rep);
>
> @@ -783,6 +784,12 @@ rpcrdma_reply_handler(struct rpcrdma_rep *rep)
>   		goto out_badstatus;
>   	if (rep->rr_len < RPCRDMA_HDRLEN_MIN)
>   		goto out_shortreply;
> +#if IS_ENABLED(CONFIG_SUNRPC_DEBUG)
> +	cdata = &r_xprt->rx_data;
> +	if (rep->rr_len > cdata->inline_rsize)
> +		pr_warn("RPC: %u byte reply exceeds inline threshold\n",
> +			rep->rr_len);
> +#endif
>
>   	headerp = rdmab_to_msg(rep->rr_rdmabuf);
>   	if (headerp->rm_vers != rpcrdma_version)
> diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
> index eadd1655..e3f12e2 100644
> --- a/net/sunrpc/xprtrdma/verbs.c
> +++ b/net/sunrpc/xprtrdma/verbs.c
> @@ -924,7 +924,11 @@ rpcrdma_create_rep(struct rpcrdma_xprt *r_xprt)
>   	if (rep == NULL)
>   		goto out;
>
> -	rep->rr_rdmabuf = rpcrdma_alloc_regbuf(ia, cdata->inline_rsize,
> +	/* The actual size of our receive buffers is increased slightly
> +	 * to prevent small receive overruns from killing our connection.
> +	 */
> +	rep->rr_rdmabuf = rpcrdma_alloc_regbuf(ia, cdata->inline_rsize +
> +					       RPCRDMA_RECV_MARGIN,
>   					       GFP_KERNEL);
>   	if (IS_ERR(rep->rr_rdmabuf)) {
>   		rc = PTR_ERR(rep->rr_rdmabuf);
> diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
> index ac7f8d4..1b72ab1 100644
> --- a/net/sunrpc/xprtrdma/xprt_rdma.h
> +++ b/net/sunrpc/xprtrdma/xprt_rdma.h
> @@ -337,6 +337,11 @@ struct rpcrdma_create_data_internal {
>   #define RPCRDMA_INLINE_PAD_VALUE(rq)\
>   	rpcx_to_rdmad(rq->rq_xprt).padding
>
> +/* To help prevent spurious connection shutdown, allow senders to
> + * overrun our receive inline threshold by a small bit.
> + */
> +#define RPCRDMA_RECV_MARGIN	(128)
> +
>   /*
>    * Statistics for RPCRDMA
>    */
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v1 1/9] xprtrdma: Add a safety margin for receive buffers
@ 2015-11-24  0:55         ` Tom Talpey
  0 siblings, 0 replies; 78+ messages in thread
From: Tom Talpey @ 2015-11-24  0:55 UTC (permalink / raw)
  To: Chuck Lever, linux-rdma, linux-nfs

On 11/23/2015 5:13 PM, Chuck Lever wrote:
> Rarely, senders post a Send that is larger than the client's inline
> threshold. That can be due to a bug, or the client and server may
> not have communicated about their inline limits. RPC-over-RDMA
> currently doesn't specify any particular limit on inline size, so
> peers have to guess what it is.
>
> It is fatal to the connection if the size of a Send is larger than
> the client's receive buffer. The sender is likely to retry with the
> same message size, so the workload is stuck at that point.
>
> Follow Postel's robustness principal: Be conservative in what you
> do, be liberal in what you accept from others. Increase the size of
> client receive buffers by a safety margin, and add a warning when
> the inline threshold is exceeded during receive.

Safety is good, but how do know the chosen value is enough?
Isn't it better to fail the badly-composed request and be done
with it? Even if the stupid sender loops, which it will do
anyway.

>
> Note the Linux server's receive buffers are already page-sized.
>
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> ---
>   net/sunrpc/xprtrdma/rpc_rdma.c  |    7 +++++++
>   net/sunrpc/xprtrdma/verbs.c     |    6 +++++-
>   net/sunrpc/xprtrdma/xprt_rdma.h |    5 +++++
>   3 files changed, 17 insertions(+), 1 deletion(-)
>
> diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
> index c10d969..a169252 100644
> --- a/net/sunrpc/xprtrdma/rpc_rdma.c
> +++ b/net/sunrpc/xprtrdma/rpc_rdma.c
> @@ -776,6 +776,7 @@ rpcrdma_reply_handler(struct rpcrdma_rep *rep)
>   	int rdmalen, status;
>   	unsigned long cwnd;
>   	u32 credits;
> +	RPC_IFDEBUG(struct rpcrdma_create_data_internal *cdata);
>
>   	dprintk("RPC:       %s: incoming rep %p\n", __func__, rep);
>
> @@ -783,6 +784,12 @@ rpcrdma_reply_handler(struct rpcrdma_rep *rep)
>   		goto out_badstatus;
>   	if (rep->rr_len < RPCRDMA_HDRLEN_MIN)
>   		goto out_shortreply;
> +#if IS_ENABLED(CONFIG_SUNRPC_DEBUG)
> +	cdata = &r_xprt->rx_data;
> +	if (rep->rr_len > cdata->inline_rsize)
> +		pr_warn("RPC: %u byte reply exceeds inline threshold\n",
> +			rep->rr_len);
> +#endif
>
>   	headerp = rdmab_to_msg(rep->rr_rdmabuf);
>   	if (headerp->rm_vers != rpcrdma_version)
> diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
> index eadd1655..e3f12e2 100644
> --- a/net/sunrpc/xprtrdma/verbs.c
> +++ b/net/sunrpc/xprtrdma/verbs.c
> @@ -924,7 +924,11 @@ rpcrdma_create_rep(struct rpcrdma_xprt *r_xprt)
>   	if (rep == NULL)
>   		goto out;
>
> -	rep->rr_rdmabuf = rpcrdma_alloc_regbuf(ia, cdata->inline_rsize,
> +	/* The actual size of our receive buffers is increased slightly
> +	 * to prevent small receive overruns from killing our connection.
> +	 */
> +	rep->rr_rdmabuf = rpcrdma_alloc_regbuf(ia, cdata->inline_rsize +
> +					       RPCRDMA_RECV_MARGIN,
>   					       GFP_KERNEL);
>   	if (IS_ERR(rep->rr_rdmabuf)) {
>   		rc = PTR_ERR(rep->rr_rdmabuf);
> diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
> index ac7f8d4..1b72ab1 100644
> --- a/net/sunrpc/xprtrdma/xprt_rdma.h
> +++ b/net/sunrpc/xprtrdma/xprt_rdma.h
> @@ -337,6 +337,11 @@ struct rpcrdma_create_data_internal {
>   #define RPCRDMA_INLINE_PAD_VALUE(rq)\
>   	rpcx_to_rdmad(rq->rq_xprt).padding
>
> +/* To help prevent spurious connection shutdown, allow senders to
> + * overrun our receive inline threshold by a small bit.
> + */
> +#define RPCRDMA_RECV_MARGIN	(128)
> +
>   /*
>    * Statistics for RPCRDMA
>    */
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v1 5/9] xprtrdma: Add ro_unmap_sync method for FMR
  2015-11-23 22:14     ` Chuck Lever
@ 2015-11-24  0:57         ` Tom Talpey
  -1 siblings, 0 replies; 78+ messages in thread
From: Tom Talpey @ 2015-11-24  0:57 UTC (permalink / raw)
  To: Chuck Lever, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA

On 11/23/2015 5:14 PM, Chuck Lever wrote:
> FMR's ro_unmap method is already synchronous because ib_unmap_fmr()
> is a synchronous verb. However, some improvements can be made here.

I thought FMR support was about to be removed in the core.


>
> 1. Gather all the MRs for the RPC request onto a list, and invoke
>     ib_unmap_fmr() once with that list. This reduces the number of
>     doorbells when there is more than one MR to invalidate
>
> 2. Perform the DMA unmap _after_ the MRs are unmapped, not before.
>     This is critical after invalidating a Write chunk.
>
> Signed-off-by: Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
> ---
>   net/sunrpc/xprtrdma/fmr_ops.c |   64 +++++++++++++++++++++++++++++++++++++++++
>   1 file changed, 64 insertions(+)
>
> diff --git a/net/sunrpc/xprtrdma/fmr_ops.c b/net/sunrpc/xprtrdma/fmr_ops.c
> index f1e8daf..c14f3a4 100644
> --- a/net/sunrpc/xprtrdma/fmr_ops.c
> +++ b/net/sunrpc/xprtrdma/fmr_ops.c
> @@ -179,6 +179,69 @@ out_maperr:
>   	return rc;
>   }
>
> +static void
> +__fmr_dma_unmap(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg)
> +{
> +	struct ib_device *device = r_xprt->rx_ia.ri_device;
> +	struct rpcrdma_mw *mw = seg->rl_mw;
> +	int nsegs = seg->mr_nsegs;
> +
> +	seg->rl_mw = NULL;
> +
> +	while (nsegs--)
> +		rpcrdma_unmap_one(device, seg++);
> +
> +	rpcrdma_put_mw(r_xprt, mw);
> +}
> +
> +/* Invalidate all memory regions that were registered for "req".
> + *
> + * Sleeps until it is safe for the host CPU to access the
> + * previously mapped memory regions.
> + */
> +static void
> +fmr_op_unmap_sync(struct rpcrdma_xprt *r_xprt, struct rpcrdma_req *req)
> +{
> +	struct rpcrdma_mr_seg *seg;
> +	unsigned int i, nchunks;
> +	struct rpcrdma_mw *mw;
> +	LIST_HEAD(unmap_list);
> +	int rc;
> +
> +	dprintk("RPC:       %s: req %p\n", __func__, req);
> +
> +	/* ORDER: Invalidate all of the req's MRs first
> +	 *
> +	 * ib_unmap_fmr() is slow, so use a single call instead
> +	 * of one call per mapped MR.
> +	 */
> +	for (i = 0, nchunks = req->rl_nchunks; nchunks; nchunks--) {
> +		seg = &req->rl_segments[i];
> +		mw = seg->rl_mw;
> +
> +		list_add(&mw->r.fmr.fmr->list, &unmap_list);
> +
> +		i += seg->mr_nsegs;
> +	}
> +	rc = ib_unmap_fmr(&unmap_list);
> +	if (rc)
> +		pr_warn("%s: ib_unmap_fmr failed (%i)\n", __func__, rc);
> +
> +	/* ORDER: Now DMA unmap all of the req's MRs, and return
> +	 * them to the free MW list.
> +	 */
> +	for (i = 0, nchunks = req->rl_nchunks; nchunks; nchunks--) {
> +		seg = &req->rl_segments[i];
> +
> +		__fmr_dma_unmap(r_xprt, seg);
> +
> +		i += seg->mr_nsegs;
> +		seg->mr_nsegs = 0;
> +	}
> +
> +	req->rl_nchunks = 0;
> +}
> +
>   /* Use the ib_unmap_fmr() verb to prevent further remote
>    * access via RDMA READ or RDMA WRITE.
>    */
> @@ -231,6 +294,7 @@ fmr_op_destroy(struct rpcrdma_buffer *buf)
>
>   const struct rpcrdma_memreg_ops rpcrdma_fmr_memreg_ops = {
>   	.ro_map				= fmr_op_map,
> +	.ro_unmap_sync			= fmr_op_unmap_sync,
>   	.ro_unmap			= fmr_op_unmap,
>   	.ro_open			= fmr_op_open,
>   	.ro_maxpages			= fmr_op_maxpages,
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v1 5/9] xprtrdma: Add ro_unmap_sync method for FMR
@ 2015-11-24  0:57         ` Tom Talpey
  0 siblings, 0 replies; 78+ messages in thread
From: Tom Talpey @ 2015-11-24  0:57 UTC (permalink / raw)
  To: Chuck Lever, linux-rdma, linux-nfs

On 11/23/2015 5:14 PM, Chuck Lever wrote:
> FMR's ro_unmap method is already synchronous because ib_unmap_fmr()
> is a synchronous verb. However, some improvements can be made here.

I thought FMR support was about to be removed in the core.


>
> 1. Gather all the MRs for the RPC request onto a list, and invoke
>     ib_unmap_fmr() once with that list. This reduces the number of
>     doorbells when there is more than one MR to invalidate
>
> 2. Perform the DMA unmap _after_ the MRs are unmapped, not before.
>     This is critical after invalidating a Write chunk.
>
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> ---
>   net/sunrpc/xprtrdma/fmr_ops.c |   64 +++++++++++++++++++++++++++++++++++++++++
>   1 file changed, 64 insertions(+)
>
> diff --git a/net/sunrpc/xprtrdma/fmr_ops.c b/net/sunrpc/xprtrdma/fmr_ops.c
> index f1e8daf..c14f3a4 100644
> --- a/net/sunrpc/xprtrdma/fmr_ops.c
> +++ b/net/sunrpc/xprtrdma/fmr_ops.c
> @@ -179,6 +179,69 @@ out_maperr:
>   	return rc;
>   }
>
> +static void
> +__fmr_dma_unmap(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg)
> +{
> +	struct ib_device *device = r_xprt->rx_ia.ri_device;
> +	struct rpcrdma_mw *mw = seg->rl_mw;
> +	int nsegs = seg->mr_nsegs;
> +
> +	seg->rl_mw = NULL;
> +
> +	while (nsegs--)
> +		rpcrdma_unmap_one(device, seg++);
> +
> +	rpcrdma_put_mw(r_xprt, mw);
> +}
> +
> +/* Invalidate all memory regions that were registered for "req".
> + *
> + * Sleeps until it is safe for the host CPU to access the
> + * previously mapped memory regions.
> + */
> +static void
> +fmr_op_unmap_sync(struct rpcrdma_xprt *r_xprt, struct rpcrdma_req *req)
> +{
> +	struct rpcrdma_mr_seg *seg;
> +	unsigned int i, nchunks;
> +	struct rpcrdma_mw *mw;
> +	LIST_HEAD(unmap_list);
> +	int rc;
> +
> +	dprintk("RPC:       %s: req %p\n", __func__, req);
> +
> +	/* ORDER: Invalidate all of the req's MRs first
> +	 *
> +	 * ib_unmap_fmr() is slow, so use a single call instead
> +	 * of one call per mapped MR.
> +	 */
> +	for (i = 0, nchunks = req->rl_nchunks; nchunks; nchunks--) {
> +		seg = &req->rl_segments[i];
> +		mw = seg->rl_mw;
> +
> +		list_add(&mw->r.fmr.fmr->list, &unmap_list);
> +
> +		i += seg->mr_nsegs;
> +	}
> +	rc = ib_unmap_fmr(&unmap_list);
> +	if (rc)
> +		pr_warn("%s: ib_unmap_fmr failed (%i)\n", __func__, rc);
> +
> +	/* ORDER: Now DMA unmap all of the req's MRs, and return
> +	 * them to the free MW list.
> +	 */
> +	for (i = 0, nchunks = req->rl_nchunks; nchunks; nchunks--) {
> +		seg = &req->rl_segments[i];
> +
> +		__fmr_dma_unmap(r_xprt, seg);
> +
> +		i += seg->mr_nsegs;
> +		seg->mr_nsegs = 0;
> +	}
> +
> +	req->rl_nchunks = 0;
> +}
> +
>   /* Use the ib_unmap_fmr() verb to prevent further remote
>    * access via RDMA READ or RDMA WRITE.
>    */
> @@ -231,6 +294,7 @@ fmr_op_destroy(struct rpcrdma_buffer *buf)
>
>   const struct rpcrdma_memreg_ops rpcrdma_fmr_memreg_ops = {
>   	.ro_map				= fmr_op_map,
> +	.ro_unmap_sync			= fmr_op_unmap_sync,
>   	.ro_unmap			= fmr_op_unmap,
>   	.ro_open			= fmr_op_open,
>   	.ro_maxpages			= fmr_op_maxpages,
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v1 8/9] xprtrdma: Invalidate in the RPC reply handler
  2015-11-23 22:14     ` Chuck Lever
@ 2015-11-24  1:01         ` Tom Talpey
  -1 siblings, 0 replies; 78+ messages in thread
From: Tom Talpey @ 2015-11-24  1:01 UTC (permalink / raw)
  To: Chuck Lever, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA

On 11/23/2015 5:14 PM, Chuck Lever wrote:
> There is a window between the time the RPC reply handler wakes the
> waiting RPC task and when xprt_release() invokes ops->buf_free.
> During this time, memory regions containing the data payload may
> still be accessed by a broken or malicious server, but the RPC
> application has already been allowed access to the memory containing
> the RPC request's data payloads.
>
> The server should be fenced from client memory containing RPC data
> payloads _before_ the RPC application is allowed to continue.

No concern, just Hurray for implementing this fundamental
integrity requirement. It's even more important when krb5i
is in play, to avoid truly malicious injection-after-integrity.

>
> This change also more strongly enforces send queue accounting. There
> is a maximum number of RPC calls allowed to be outstanding. When an
> RPC/RDMA transport is set up, just enough send queue resources are
> allocated to handle registration, Send, and invalidation WRs for
> each those RPCs at the same time.
>
> Before, additional RPC calls could be dispatched while invalidation
> WRs were still consuming send WQEs. When invalidation WRs backed
> up, dispatching additional RPCs resulted in a send queue overrun.
>
> Now, the reply handler prevents RPC dispatch until invalidation is
> complete. This prevents RPC call dispatch until there are enough
> send queue resources to proceed.
>
> Still to do: If an RPC exits early (say, ^C), the reply handler has
> no opportunity to perform invalidation. Currently, xprt_rdma_free()
> still frees remaining RDMA resources, which could deadlock.
> Additional changes are needed to handle invalidation properly in this
> case.
>
> Reported-by: Jason Gunthorpe <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
> Signed-off-by: Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
> ---
>   net/sunrpc/xprtrdma/rpc_rdma.c |   10 ++++++++++
>   1 file changed, 10 insertions(+)
>
> diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
> index d7b9156..4ad72ca 100644
> --- a/net/sunrpc/xprtrdma/rpc_rdma.c
> +++ b/net/sunrpc/xprtrdma/rpc_rdma.c
> @@ -898,6 +898,16 @@ badheader:
>   		break;
>   	}
>
> +	/* Invalidate and flush the data payloads before waking the
> +	 * waiting application. This guarantees the memory region is
> +	 * properly fenced from the server before the application
> +	 * accesses the data. It also ensures proper send flow
> +	 * control: waking the next RPC waits until this RPC has
> +	 * relinquished all its Send Queue entries.
> +	 */
> +	if (req->rl_nchunks)
> +		r_xprt->rx_ia.ri_ops->ro_unmap_sync(r_xprt, req);
> +
>   	credits = be32_to_cpu(headerp->rm_credit);
>   	if (credits == 0)
>   		credits = 1;	/* don't deadlock */
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v1 8/9] xprtrdma: Invalidate in the RPC reply handler
@ 2015-11-24  1:01         ` Tom Talpey
  0 siblings, 0 replies; 78+ messages in thread
From: Tom Talpey @ 2015-11-24  1:01 UTC (permalink / raw)
  To: Chuck Lever, linux-rdma, linux-nfs

On 11/23/2015 5:14 PM, Chuck Lever wrote:
> There is a window between the time the RPC reply handler wakes the
> waiting RPC task and when xprt_release() invokes ops->buf_free.
> During this time, memory regions containing the data payload may
> still be accessed by a broken or malicious server, but the RPC
> application has already been allowed access to the memory containing
> the RPC request's data payloads.
>
> The server should be fenced from client memory containing RPC data
> payloads _before_ the RPC application is allowed to continue.

No concern, just Hurray for implementing this fundamental
integrity requirement. It's even more important when krb5i
is in play, to avoid truly malicious injection-after-integrity.

>
> This change also more strongly enforces send queue accounting. There
> is a maximum number of RPC calls allowed to be outstanding. When an
> RPC/RDMA transport is set up, just enough send queue resources are
> allocated to handle registration, Send, and invalidation WRs for
> each those RPCs at the same time.
>
> Before, additional RPC calls could be dispatched while invalidation
> WRs were still consuming send WQEs. When invalidation WRs backed
> up, dispatching additional RPCs resulted in a send queue overrun.
>
> Now, the reply handler prevents RPC dispatch until invalidation is
> complete. This prevents RPC call dispatch until there are enough
> send queue resources to proceed.
>
> Still to do: If an RPC exits early (say, ^C), the reply handler has
> no opportunity to perform invalidation. Currently, xprt_rdma_free()
> still frees remaining RDMA resources, which could deadlock.
> Additional changes are needed to handle invalidation properly in this
> case.
>
> Reported-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> ---
>   net/sunrpc/xprtrdma/rpc_rdma.c |   10 ++++++++++
>   1 file changed, 10 insertions(+)
>
> diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
> index d7b9156..4ad72ca 100644
> --- a/net/sunrpc/xprtrdma/rpc_rdma.c
> +++ b/net/sunrpc/xprtrdma/rpc_rdma.c
> @@ -898,6 +898,16 @@ badheader:
>   		break;
>   	}
>
> +	/* Invalidate and flush the data payloads before waking the
> +	 * waiting application. This guarantees the memory region is
> +	 * properly fenced from the server before the application
> +	 * accesses the data. It also ensures proper send flow
> +	 * control: waking the next RPC waits until this RPC has
> +	 * relinquished all its Send Queue entries.
> +	 */
> +	if (req->rl_nchunks)
> +		r_xprt->rx_ia.ri_ops->ro_unmap_sync(r_xprt, req);
> +
>   	credits = be32_to_cpu(headerp->rm_credit);
>   	if (credits == 0)
>   		credits = 1;	/* don't deadlock */
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v1 1/9] xprtrdma: Add a safety margin for receive buffers
  2015-11-24  0:55         ` Tom Talpey
@ 2015-11-24  1:16             ` Chuck Lever
  -1 siblings, 0 replies; 78+ messages in thread
From: Chuck Lever @ 2015-11-24  1:16 UTC (permalink / raw)
  To: Tom Talpey; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Linux NFS Mailing List


> On Nov 23, 2015, at 7:55 PM, Tom Talpey <tom-CLs1Zie5N5HQT0dZR+AlfA@public.gmane.org> wrote:
> 
> On 11/23/2015 5:13 PM, Chuck Lever wrote:
>> Rarely, senders post a Send that is larger than the client's inline
>> threshold. That can be due to a bug, or the client and server may
>> not have communicated about their inline limits. RPC-over-RDMA
>> currently doesn't specify any particular limit on inline size, so
>> peers have to guess what it is.
>> 
>> It is fatal to the connection if the size of a Send is larger than
>> the client's receive buffer. The sender is likely to retry with the
>> same message size, so the workload is stuck at that point.
>> 
>> Follow Postel's robustness principal: Be conservative in what you
>> do, be liberal in what you accept from others. Increase the size of
>> client receive buffers by a safety margin, and add a warning when
>> the inline threshold is exceeded during receive.
> 
> Safety is good, but how do know the chosen value is enough?
> Isn't it better to fail the badly-composed request and be done
> with it? Even if the stupid sender loops, which it will do
> anyway.

It’s good enough to compensate for the most common sender bug,
which is that the sender did not account for the 28 bytes of
the RPC-over-RDMA header when it built the send buffer. The
additional 100 byte margin is gravy.

The loop occurs because the server gets a completion error.
The client just sees a connection loss. There’s no way for it
to know it should fail the RPC, so it keeps trying.

Perhaps the server could remember that the reply failed, and
when the client retransmits, it can simply return that XID
with an RDMA_ERROR.


>> Note the Linux server's receive buffers are already page-sized.
>> 
>> Signed-off-by: Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
>> ---
>>  net/sunrpc/xprtrdma/rpc_rdma.c  |    7 +++++++
>>  net/sunrpc/xprtrdma/verbs.c     |    6 +++++-
>>  net/sunrpc/xprtrdma/xprt_rdma.h |    5 +++++
>>  3 files changed, 17 insertions(+), 1 deletion(-)
>> 
>> diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
>> index c10d969..a169252 100644
>> --- a/net/sunrpc/xprtrdma/rpc_rdma.c
>> +++ b/net/sunrpc/xprtrdma/rpc_rdma.c
>> @@ -776,6 +776,7 @@ rpcrdma_reply_handler(struct rpcrdma_rep *rep)
>>  	int rdmalen, status;
>>  	unsigned long cwnd;
>>  	u32 credits;
>> +	RPC_IFDEBUG(struct rpcrdma_create_data_internal *cdata);
>> 
>>  	dprintk("RPC:       %s: incoming rep %p\n", __func__, rep);
>> 
>> @@ -783,6 +784,12 @@ rpcrdma_reply_handler(struct rpcrdma_rep *rep)
>>  		goto out_badstatus;
>>  	if (rep->rr_len < RPCRDMA_HDRLEN_MIN)
>>  		goto out_shortreply;
>> +#if IS_ENABLED(CONFIG_SUNRPC_DEBUG)
>> +	cdata = &r_xprt->rx_data;
>> +	if (rep->rr_len > cdata->inline_rsize)
>> +		pr_warn("RPC: %u byte reply exceeds inline threshold\n",
>> +			rep->rr_len);
>> +#endif
>> 
>>  	headerp = rdmab_to_msg(rep->rr_rdmabuf);
>>  	if (headerp->rm_vers != rpcrdma_version)
>> diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
>> index eadd1655..e3f12e2 100644
>> --- a/net/sunrpc/xprtrdma/verbs.c
>> +++ b/net/sunrpc/xprtrdma/verbs.c
>> @@ -924,7 +924,11 @@ rpcrdma_create_rep(struct rpcrdma_xprt *r_xprt)
>>  	if (rep == NULL)
>>  		goto out;
>> 
>> -	rep->rr_rdmabuf = rpcrdma_alloc_regbuf(ia, cdata->inline_rsize,
>> +	/* The actual size of our receive buffers is increased slightly
>> +	 * to prevent small receive overruns from killing our connection.
>> +	 */
>> +	rep->rr_rdmabuf = rpcrdma_alloc_regbuf(ia, cdata->inline_rsize +
>> +					       RPCRDMA_RECV_MARGIN,
>>  					       GFP_KERNEL);
>>  	if (IS_ERR(rep->rr_rdmabuf)) {
>>  		rc = PTR_ERR(rep->rr_rdmabuf);
>> diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
>> index ac7f8d4..1b72ab1 100644
>> --- a/net/sunrpc/xprtrdma/xprt_rdma.h
>> +++ b/net/sunrpc/xprtrdma/xprt_rdma.h
>> @@ -337,6 +337,11 @@ struct rpcrdma_create_data_internal {
>>  #define RPCRDMA_INLINE_PAD_VALUE(rq)\
>>  	rpcx_to_rdmad(rq->rq_xprt).padding
>> 
>> +/* To help prevent spurious connection shutdown, allow senders to
>> + * overrun our receive inline threshold by a small bit.
>> + */
>> +#define RPCRDMA_RECV_MARGIN	(128)
>> +
>>  /*
>>   * Statistics for RPCRDMA
>>   */
>> 
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> 
>> 

--
Chuck Lever




--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v1 1/9] xprtrdma: Add a safety margin for receive buffers
@ 2015-11-24  1:16             ` Chuck Lever
  0 siblings, 0 replies; 78+ messages in thread
From: Chuck Lever @ 2015-11-24  1:16 UTC (permalink / raw)
  To: Tom Talpey; +Cc: linux-rdma, Linux NFS Mailing List


> On Nov 23, 2015, at 7:55 PM, Tom Talpey <tom@talpey.com> wrote:
> 
> On 11/23/2015 5:13 PM, Chuck Lever wrote:
>> Rarely, senders post a Send that is larger than the client's inline
>> threshold. That can be due to a bug, or the client and server may
>> not have communicated about their inline limits. RPC-over-RDMA
>> currently doesn't specify any particular limit on inline size, so
>> peers have to guess what it is.
>> 
>> It is fatal to the connection if the size of a Send is larger than
>> the client's receive buffer. The sender is likely to retry with the
>> same message size, so the workload is stuck at that point.
>> 
>> Follow Postel's robustness principal: Be conservative in what you
>> do, be liberal in what you accept from others. Increase the size of
>> client receive buffers by a safety margin, and add a warning when
>> the inline threshold is exceeded during receive.
> 
> Safety is good, but how do know the chosen value is enough?
> Isn't it better to fail the badly-composed request and be done
> with it? Even if the stupid sender loops, which it will do
> anyway.

It’s good enough to compensate for the most common sender bug,
which is that the sender did not account for the 28 bytes of
the RPC-over-RDMA header when it built the send buffer. The
additional 100 byte margin is gravy.

The loop occurs because the server gets a completion error.
The client just sees a connection loss. There’s no way for it
to know it should fail the RPC, so it keeps trying.

Perhaps the server could remember that the reply failed, and
when the client retransmits, it can simply return that XID
with an RDMA_ERROR.


>> Note the Linux server's receive buffers are already page-sized.
>> 
>> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
>> ---
>>  net/sunrpc/xprtrdma/rpc_rdma.c  |    7 +++++++
>>  net/sunrpc/xprtrdma/verbs.c     |    6 +++++-
>>  net/sunrpc/xprtrdma/xprt_rdma.h |    5 +++++
>>  3 files changed, 17 insertions(+), 1 deletion(-)
>> 
>> diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
>> index c10d969..a169252 100644
>> --- a/net/sunrpc/xprtrdma/rpc_rdma.c
>> +++ b/net/sunrpc/xprtrdma/rpc_rdma.c
>> @@ -776,6 +776,7 @@ rpcrdma_reply_handler(struct rpcrdma_rep *rep)
>>  	int rdmalen, status;
>>  	unsigned long cwnd;
>>  	u32 credits;
>> +	RPC_IFDEBUG(struct rpcrdma_create_data_internal *cdata);
>> 
>>  	dprintk("RPC:       %s: incoming rep %p\n", __func__, rep);
>> 
>> @@ -783,6 +784,12 @@ rpcrdma_reply_handler(struct rpcrdma_rep *rep)
>>  		goto out_badstatus;
>>  	if (rep->rr_len < RPCRDMA_HDRLEN_MIN)
>>  		goto out_shortreply;
>> +#if IS_ENABLED(CONFIG_SUNRPC_DEBUG)
>> +	cdata = &r_xprt->rx_data;
>> +	if (rep->rr_len > cdata->inline_rsize)
>> +		pr_warn("RPC: %u byte reply exceeds inline threshold\n",
>> +			rep->rr_len);
>> +#endif
>> 
>>  	headerp = rdmab_to_msg(rep->rr_rdmabuf);
>>  	if (headerp->rm_vers != rpcrdma_version)
>> diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
>> index eadd1655..e3f12e2 100644
>> --- a/net/sunrpc/xprtrdma/verbs.c
>> +++ b/net/sunrpc/xprtrdma/verbs.c
>> @@ -924,7 +924,11 @@ rpcrdma_create_rep(struct rpcrdma_xprt *r_xprt)
>>  	if (rep == NULL)
>>  		goto out;
>> 
>> -	rep->rr_rdmabuf = rpcrdma_alloc_regbuf(ia, cdata->inline_rsize,
>> +	/* The actual size of our receive buffers is increased slightly
>> +	 * to prevent small receive overruns from killing our connection.
>> +	 */
>> +	rep->rr_rdmabuf = rpcrdma_alloc_regbuf(ia, cdata->inline_rsize +
>> +					       RPCRDMA_RECV_MARGIN,
>>  					       GFP_KERNEL);
>>  	if (IS_ERR(rep->rr_rdmabuf)) {
>>  		rc = PTR_ERR(rep->rr_rdmabuf);
>> diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
>> index ac7f8d4..1b72ab1 100644
>> --- a/net/sunrpc/xprtrdma/xprt_rdma.h
>> +++ b/net/sunrpc/xprtrdma/xprt_rdma.h
>> @@ -337,6 +337,11 @@ struct rpcrdma_create_data_internal {
>>  #define RPCRDMA_INLINE_PAD_VALUE(rq)\
>>  	rpcx_to_rdmad(rq->rq_xprt).padding
>> 
>> +/* To help prevent spurious connection shutdown, allow senders to
>> + * overrun our receive inline threshold by a small bit.
>> + */
>> +#define RPCRDMA_RECV_MARGIN	(128)
>> +
>>  /*
>>   * Statistics for RPCRDMA
>>   */
>> 
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> 
>> 

--
Chuck Lever





^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v1 1/9] xprtrdma: Add a safety margin for receive buffers
  2015-11-24  1:16             ` Chuck Lever
@ 2015-11-24  1:22                 ` Tom Talpey
  -1 siblings, 0 replies; 78+ messages in thread
From: Tom Talpey @ 2015-11-24  1:22 UTC (permalink / raw)
  To: Chuck Lever; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Linux NFS Mailing List

On 11/23/2015 8:16 PM, Chuck Lever wrote:
>
>> On Nov 23, 2015, at 7:55 PM, Tom Talpey <tom-CLs1Zie5N5HQT0dZR+AlfA@public.gmane.org> wrote:
>>
>> On 11/23/2015 5:13 PM, Chuck Lever wrote:
>>> Rarely, senders post a Send that is larger than the client's inline
>>> threshold. That can be due to a bug, or the client and server may
>>> not have communicated about their inline limits. RPC-over-RDMA
>>> currently doesn't specify any particular limit on inline size, so
>>> peers have to guess what it is.
>>>
>>> It is fatal to the connection if the size of a Send is larger than
>>> the client's receive buffer. The sender is likely to retry with the
>>> same message size, so the workload is stuck at that point.
>>>
>>> Follow Postel's robustness principal: Be conservative in what you
>>> do, be liberal in what you accept from others. Increase the size of
>>> client receive buffers by a safety margin, and add a warning when
>>> the inline threshold is exceeded during receive.
>>
>> Safety is good, but how do know the chosen value is enough?
>> Isn't it better to fail the badly-composed request and be done
>> with it? Even if the stupid sender loops, which it will do
>> anyway.
>
> It’s good enough to compensate for the most common sender bug,
> which is that the sender did not account for the 28 bytes of
> the RPC-over-RDMA header when it built the send buffer. The
> additional 100 byte margin is gravy.

I think it's good to have sympathy and resilience to differing
designs on the other end of the wire, but I fail to have it for
stupid bugs. Unless this can take down the receiver, fail it fast.

MHO.

>
> The loop occurs because the server gets a completion error.
> The client just sees a connection loss. There’s no way for it
> to know it should fail the RPC, so it keeps trying.
>
> Perhaps the server could remember that the reply failed, and
> when the client retransmits, it can simply return that XID
> with an RDMA_ERROR.
>
>
>>> Note the Linux server's receive buffers are already page-sized.
>>>
>>> Signed-off-by: Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
>>> ---
>>>   net/sunrpc/xprtrdma/rpc_rdma.c  |    7 +++++++
>>>   net/sunrpc/xprtrdma/verbs.c     |    6 +++++-
>>>   net/sunrpc/xprtrdma/xprt_rdma.h |    5 +++++
>>>   3 files changed, 17 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
>>> index c10d969..a169252 100644
>>> --- a/net/sunrpc/xprtrdma/rpc_rdma.c
>>> +++ b/net/sunrpc/xprtrdma/rpc_rdma.c
>>> @@ -776,6 +776,7 @@ rpcrdma_reply_handler(struct rpcrdma_rep *rep)
>>>   	int rdmalen, status;
>>>   	unsigned long cwnd;
>>>   	u32 credits;
>>> +	RPC_IFDEBUG(struct rpcrdma_create_data_internal *cdata);
>>>
>>>   	dprintk("RPC:       %s: incoming rep %p\n", __func__, rep);
>>>
>>> @@ -783,6 +784,12 @@ rpcrdma_reply_handler(struct rpcrdma_rep *rep)
>>>   		goto out_badstatus;
>>>   	if (rep->rr_len < RPCRDMA_HDRLEN_MIN)
>>>   		goto out_shortreply;
>>> +#if IS_ENABLED(CONFIG_SUNRPC_DEBUG)
>>> +	cdata = &r_xprt->rx_data;
>>> +	if (rep->rr_len > cdata->inline_rsize)
>>> +		pr_warn("RPC: %u byte reply exceeds inline threshold\n",
>>> +			rep->rr_len);
>>> +#endif
>>>
>>>   	headerp = rdmab_to_msg(rep->rr_rdmabuf);
>>>   	if (headerp->rm_vers != rpcrdma_version)
>>> diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
>>> index eadd1655..e3f12e2 100644
>>> --- a/net/sunrpc/xprtrdma/verbs.c
>>> +++ b/net/sunrpc/xprtrdma/verbs.c
>>> @@ -924,7 +924,11 @@ rpcrdma_create_rep(struct rpcrdma_xprt *r_xprt)
>>>   	if (rep == NULL)
>>>   		goto out;
>>>
>>> -	rep->rr_rdmabuf = rpcrdma_alloc_regbuf(ia, cdata->inline_rsize,
>>> +	/* The actual size of our receive buffers is increased slightly
>>> +	 * to prevent small receive overruns from killing our connection.
>>> +	 */
>>> +	rep->rr_rdmabuf = rpcrdma_alloc_regbuf(ia, cdata->inline_rsize +
>>> +					       RPCRDMA_RECV_MARGIN,
>>>   					       GFP_KERNEL);
>>>   	if (IS_ERR(rep->rr_rdmabuf)) {
>>>   		rc = PTR_ERR(rep->rr_rdmabuf);
>>> diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
>>> index ac7f8d4..1b72ab1 100644
>>> --- a/net/sunrpc/xprtrdma/xprt_rdma.h
>>> +++ b/net/sunrpc/xprtrdma/xprt_rdma.h
>>> @@ -337,6 +337,11 @@ struct rpcrdma_create_data_internal {
>>>   #define RPCRDMA_INLINE_PAD_VALUE(rq)\
>>>   	rpcx_to_rdmad(rq->rq_xprt).padding
>>>
>>> +/* To help prevent spurious connection shutdown, allow senders to
>>> + * overrun our receive inline threshold by a small bit.
>>> + */
>>> +#define RPCRDMA_RECV_MARGIN	(128)
>>> +
>>>   /*
>>>    * Statistics for RPCRDMA
>>>    */
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>>
>
> --
> Chuck Lever
>
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v1 1/9] xprtrdma: Add a safety margin for receive buffers
@ 2015-11-24  1:22                 ` Tom Talpey
  0 siblings, 0 replies; 78+ messages in thread
From: Tom Talpey @ 2015-11-24  1:22 UTC (permalink / raw)
  To: Chuck Lever; +Cc: linux-rdma, Linux NFS Mailing List

On 11/23/2015 8:16 PM, Chuck Lever wrote:
>
>> On Nov 23, 2015, at 7:55 PM, Tom Talpey <tom@talpey.com> wrote:
>>
>> On 11/23/2015 5:13 PM, Chuck Lever wrote:
>>> Rarely, senders post a Send that is larger than the client's inline
>>> threshold. That can be due to a bug, or the client and server may
>>> not have communicated about their inline limits. RPC-over-RDMA
>>> currently doesn't specify any particular limit on inline size, so
>>> peers have to guess what it is.
>>>
>>> It is fatal to the connection if the size of a Send is larger than
>>> the client's receive buffer. The sender is likely to retry with the
>>> same message size, so the workload is stuck at that point.
>>>
>>> Follow Postel's robustness principal: Be conservative in what you
>>> do, be liberal in what you accept from others. Increase the size of
>>> client receive buffers by a safety margin, and add a warning when
>>> the inline threshold is exceeded during receive.
>>
>> Safety is good, but how do know the chosen value is enough?
>> Isn't it better to fail the badly-composed request and be done
>> with it? Even if the stupid sender loops, which it will do
>> anyway.
>
> It’s good enough to compensate for the most common sender bug,
> which is that the sender did not account for the 28 bytes of
> the RPC-over-RDMA header when it built the send buffer. The
> additional 100 byte margin is gravy.

I think it's good to have sympathy and resilience to differing
designs on the other end of the wire, but I fail to have it for
stupid bugs. Unless this can take down the receiver, fail it fast.

MHO.

>
> The loop occurs because the server gets a completion error.
> The client just sees a connection loss. There’s no way for it
> to know it should fail the RPC, so it keeps trying.
>
> Perhaps the server could remember that the reply failed, and
> when the client retransmits, it can simply return that XID
> with an RDMA_ERROR.
>
>
>>> Note the Linux server's receive buffers are already page-sized.
>>>
>>> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
>>> ---
>>>   net/sunrpc/xprtrdma/rpc_rdma.c  |    7 +++++++
>>>   net/sunrpc/xprtrdma/verbs.c     |    6 +++++-
>>>   net/sunrpc/xprtrdma/xprt_rdma.h |    5 +++++
>>>   3 files changed, 17 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
>>> index c10d969..a169252 100644
>>> --- a/net/sunrpc/xprtrdma/rpc_rdma.c
>>> +++ b/net/sunrpc/xprtrdma/rpc_rdma.c
>>> @@ -776,6 +776,7 @@ rpcrdma_reply_handler(struct rpcrdma_rep *rep)
>>>   	int rdmalen, status;
>>>   	unsigned long cwnd;
>>>   	u32 credits;
>>> +	RPC_IFDEBUG(struct rpcrdma_create_data_internal *cdata);
>>>
>>>   	dprintk("RPC:       %s: incoming rep %p\n", __func__, rep);
>>>
>>> @@ -783,6 +784,12 @@ rpcrdma_reply_handler(struct rpcrdma_rep *rep)
>>>   		goto out_badstatus;
>>>   	if (rep->rr_len < RPCRDMA_HDRLEN_MIN)
>>>   		goto out_shortreply;
>>> +#if IS_ENABLED(CONFIG_SUNRPC_DEBUG)
>>> +	cdata = &r_xprt->rx_data;
>>> +	if (rep->rr_len > cdata->inline_rsize)
>>> +		pr_warn("RPC: %u byte reply exceeds inline threshold\n",
>>> +			rep->rr_len);
>>> +#endif
>>>
>>>   	headerp = rdmab_to_msg(rep->rr_rdmabuf);
>>>   	if (headerp->rm_vers != rpcrdma_version)
>>> diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
>>> index eadd1655..e3f12e2 100644
>>> --- a/net/sunrpc/xprtrdma/verbs.c
>>> +++ b/net/sunrpc/xprtrdma/verbs.c
>>> @@ -924,7 +924,11 @@ rpcrdma_create_rep(struct rpcrdma_xprt *r_xprt)
>>>   	if (rep == NULL)
>>>   		goto out;
>>>
>>> -	rep->rr_rdmabuf = rpcrdma_alloc_regbuf(ia, cdata->inline_rsize,
>>> +	/* The actual size of our receive buffers is increased slightly
>>> +	 * to prevent small receive overruns from killing our connection.
>>> +	 */
>>> +	rep->rr_rdmabuf = rpcrdma_alloc_regbuf(ia, cdata->inline_rsize +
>>> +					       RPCRDMA_RECV_MARGIN,
>>>   					       GFP_KERNEL);
>>>   	if (IS_ERR(rep->rr_rdmabuf)) {
>>>   		rc = PTR_ERR(rep->rr_rdmabuf);
>>> diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
>>> index ac7f8d4..1b72ab1 100644
>>> --- a/net/sunrpc/xprtrdma/xprt_rdma.h
>>> +++ b/net/sunrpc/xprtrdma/xprt_rdma.h
>>> @@ -337,6 +337,11 @@ struct rpcrdma_create_data_internal {
>>>   #define RPCRDMA_INLINE_PAD_VALUE(rq)\
>>>   	rpcx_to_rdmad(rq->rq_xprt).padding
>>>
>>> +/* To help prevent spurious connection shutdown, allow senders to
>>> + * overrun our receive inline threshold by a small bit.
>>> + */
>>> +#define RPCRDMA_RECV_MARGIN	(128)
>>> +
>>>   /*
>>>    * Statistics for RPCRDMA
>>>    */
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>>
>
> --
> Chuck Lever
>
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v1 1/9] xprtrdma: Add a safety margin for receive buffers
  2015-11-24  1:22                 ` Tom Talpey
@ 2015-11-24  1:44                     ` Chuck Lever
  -1 siblings, 0 replies; 78+ messages in thread
From: Chuck Lever @ 2015-11-24  1:44 UTC (permalink / raw)
  To: Tom Talpey; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Linux NFS Mailing List


> On Nov 23, 2015, at 8:22 PM, Tom Talpey <tom-CLs1Zie5N5HQT0dZR+AlfA@public.gmane.org> wrote:
> 
> On 11/23/2015 8:16 PM, Chuck Lever wrote:
>> 
>>> On Nov 23, 2015, at 7:55 PM, Tom Talpey <tom-CLs1Zie5N5HQT0dZR+AlfA@public.gmane.org> wrote:
>>> 
>>> On 11/23/2015 5:13 PM, Chuck Lever wrote:
>>>> Rarely, senders post a Send that is larger than the client's inline
>>>> threshold. That can be due to a bug, or the client and server may
>>>> not have communicated about their inline limits. RPC-over-RDMA
>>>> currently doesn't specify any particular limit on inline size, so
>>>> peers have to guess what it is.
>>>> 
>>>> It is fatal to the connection if the size of a Send is larger than
>>>> the client's receive buffer. The sender is likely to retry with the
>>>> same message size, so the workload is stuck at that point.
>>>> 
>>>> Follow Postel's robustness principal: Be conservative in what you
>>>> do, be liberal in what you accept from others. Increase the size of
>>>> client receive buffers by a safety margin, and add a warning when
>>>> the inline threshold is exceeded during receive.
>>> 
>>> Safety is good, but how do know the chosen value is enough?
>>> Isn't it better to fail the badly-composed request and be done
>>> with it? Even if the stupid sender loops, which it will do
>>> anyway.
>> 
>> It’s good enough to compensate for the most common sender bug,
>> which is that the sender did not account for the 28 bytes of
>> the RPC-over-RDMA header when it built the send buffer. The
>> additional 100 byte margin is gravy.
> 
> I think it's good to have sympathy and resilience to differing
> designs on the other end of the wire, but I fail to have it for
> stupid bugs. Unless this can take down the receiver, fail it fast.
> 
> MHO.

See above: the client can’t tell why it’s failed.

Again, the Send on the server side fails with LOCAL_LEN_ERR
and the connection is terminated. The client sees only the
connection loss. There’s no distinction between this and a
cable pull or a server crash, where you do want the client
to retransmit.

I agree it’s a dumb server bug. But the idea is to preserve
the connection, since NFS retransmits are a hazard.

Just floating this idea here, this is v1. This one can be
dropped.


>> The loop occurs because the server gets a completion error.
>> The client just sees a connection loss. There’s no way for it
>> to know it should fail the RPC, so it keeps trying.
>> 
>> Perhaps the server could remember that the reply failed, and
>> when the client retransmits, it can simply return that XID
>> with an RDMA_ERROR.
>> 
>> 
>>>> Note the Linux server's receive buffers are already page-sized.
>>>> 
>>>> Signed-off-by: Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
>>>> ---
>>>>  net/sunrpc/xprtrdma/rpc_rdma.c  |    7 +++++++
>>>>  net/sunrpc/xprtrdma/verbs.c     |    6 +++++-
>>>>  net/sunrpc/xprtrdma/xprt_rdma.h |    5 +++++
>>>>  3 files changed, 17 insertions(+), 1 deletion(-)
>>>> 
>>>> diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
>>>> index c10d969..a169252 100644
>>>> --- a/net/sunrpc/xprtrdma/rpc_rdma.c
>>>> +++ b/net/sunrpc/xprtrdma/rpc_rdma.c
>>>> @@ -776,6 +776,7 @@ rpcrdma_reply_handler(struct rpcrdma_rep *rep)
>>>>  	int rdmalen, status;
>>>>  	unsigned long cwnd;
>>>>  	u32 credits;
>>>> +	RPC_IFDEBUG(struct rpcrdma_create_data_internal *cdata);
>>>> 
>>>>  	dprintk("RPC:       %s: incoming rep %p\n", __func__, rep);
>>>> 
>>>> @@ -783,6 +784,12 @@ rpcrdma_reply_handler(struct rpcrdma_rep *rep)
>>>>  		goto out_badstatus;
>>>>  	if (rep->rr_len < RPCRDMA_HDRLEN_MIN)
>>>>  		goto out_shortreply;
>>>> +#if IS_ENABLED(CONFIG_SUNRPC_DEBUG)
>>>> +	cdata = &r_xprt->rx_data;
>>>> +	if (rep->rr_len > cdata->inline_rsize)
>>>> +		pr_warn("RPC: %u byte reply exceeds inline threshold\n",
>>>> +			rep->rr_len);
>>>> +#endif
>>>> 
>>>>  	headerp = rdmab_to_msg(rep->rr_rdmabuf);
>>>>  	if (headerp->rm_vers != rpcrdma_version)
>>>> diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
>>>> index eadd1655..e3f12e2 100644
>>>> --- a/net/sunrpc/xprtrdma/verbs.c
>>>> +++ b/net/sunrpc/xprtrdma/verbs.c
>>>> @@ -924,7 +924,11 @@ rpcrdma_create_rep(struct rpcrdma_xprt *r_xprt)
>>>>  	if (rep == NULL)
>>>>  		goto out;
>>>> 
>>>> -	rep->rr_rdmabuf = rpcrdma_alloc_regbuf(ia, cdata->inline_rsize,
>>>> +	/* The actual size of our receive buffers is increased slightly
>>>> +	 * to prevent small receive overruns from killing our connection.
>>>> +	 */
>>>> +	rep->rr_rdmabuf = rpcrdma_alloc_regbuf(ia, cdata->inline_rsize +
>>>> +					       RPCRDMA_RECV_MARGIN,
>>>>  					       GFP_KERNEL);
>>>>  	if (IS_ERR(rep->rr_rdmabuf)) {
>>>>  		rc = PTR_ERR(rep->rr_rdmabuf);
>>>> diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
>>>> index ac7f8d4..1b72ab1 100644
>>>> --- a/net/sunrpc/xprtrdma/xprt_rdma.h
>>>> +++ b/net/sunrpc/xprtrdma/xprt_rdma.h
>>>> @@ -337,6 +337,11 @@ struct rpcrdma_create_data_internal {
>>>>  #define RPCRDMA_INLINE_PAD_VALUE(rq)\
>>>>  	rpcx_to_rdmad(rq->rq_xprt).padding
>>>> 
>>>> +/* To help prevent spurious connection shutdown, allow senders to
>>>> + * overrun our receive inline threshold by a small bit.
>>>> + */
>>>> +#define RPCRDMA_RECV_MARGIN	(128)
>>>> +
>>>>  /*
>>>>   * Statistics for RPCRDMA
>>>>   */
>>>> 
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>> 
>>>> 
>> 
>> --
>> Chuck Lever
>> 
>> 
>> 
>> 
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> 
>> 

--
Chuck Lever




--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v1 1/9] xprtrdma: Add a safety margin for receive buffers
@ 2015-11-24  1:44                     ` Chuck Lever
  0 siblings, 0 replies; 78+ messages in thread
From: Chuck Lever @ 2015-11-24  1:44 UTC (permalink / raw)
  To: Tom Talpey; +Cc: linux-rdma, Linux NFS Mailing List


> On Nov 23, 2015, at 8:22 PM, Tom Talpey <tom@talpey.com> wrote:
> 
> On 11/23/2015 8:16 PM, Chuck Lever wrote:
>> 
>>> On Nov 23, 2015, at 7:55 PM, Tom Talpey <tom@talpey.com> wrote:
>>> 
>>> On 11/23/2015 5:13 PM, Chuck Lever wrote:
>>>> Rarely, senders post a Send that is larger than the client's inline
>>>> threshold. That can be due to a bug, or the client and server may
>>>> not have communicated about their inline limits. RPC-over-RDMA
>>>> currently doesn't specify any particular limit on inline size, so
>>>> peers have to guess what it is.
>>>> 
>>>> It is fatal to the connection if the size of a Send is larger than
>>>> the client's receive buffer. The sender is likely to retry with the
>>>> same message size, so the workload is stuck at that point.
>>>> 
>>>> Follow Postel's robustness principal: Be conservative in what you
>>>> do, be liberal in what you accept from others. Increase the size of
>>>> client receive buffers by a safety margin, and add a warning when
>>>> the inline threshold is exceeded during receive.
>>> 
>>> Safety is good, but how do know the chosen value is enough?
>>> Isn't it better to fail the badly-composed request and be done
>>> with it? Even if the stupid sender loops, which it will do
>>> anyway.
>> 
>> It’s good enough to compensate for the most common sender bug,
>> which is that the sender did not account for the 28 bytes of
>> the RPC-over-RDMA header when it built the send buffer. The
>> additional 100 byte margin is gravy.
> 
> I think it's good to have sympathy and resilience to differing
> designs on the other end of the wire, but I fail to have it for
> stupid bugs. Unless this can take down the receiver, fail it fast.
> 
> MHO.

See above: the client can’t tell why it’s failed.

Again, the Send on the server side fails with LOCAL_LEN_ERR
and the connection is terminated. The client sees only the
connection loss. There’s no distinction between this and a
cable pull or a server crash, where you do want the client
to retransmit.

I agree it’s a dumb server bug. But the idea is to preserve
the connection, since NFS retransmits are a hazard.

Just floating this idea here, this is v1. This one can be
dropped.


>> The loop occurs because the server gets a completion error.
>> The client just sees a connection loss. There’s no way for it
>> to know it should fail the RPC, so it keeps trying.
>> 
>> Perhaps the server could remember that the reply failed, and
>> when the client retransmits, it can simply return that XID
>> with an RDMA_ERROR.
>> 
>> 
>>>> Note the Linux server's receive buffers are already page-sized.
>>>> 
>>>> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
>>>> ---
>>>>  net/sunrpc/xprtrdma/rpc_rdma.c  |    7 +++++++
>>>>  net/sunrpc/xprtrdma/verbs.c     |    6 +++++-
>>>>  net/sunrpc/xprtrdma/xprt_rdma.h |    5 +++++
>>>>  3 files changed, 17 insertions(+), 1 deletion(-)
>>>> 
>>>> diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
>>>> index c10d969..a169252 100644
>>>> --- a/net/sunrpc/xprtrdma/rpc_rdma.c
>>>> +++ b/net/sunrpc/xprtrdma/rpc_rdma.c
>>>> @@ -776,6 +776,7 @@ rpcrdma_reply_handler(struct rpcrdma_rep *rep)
>>>>  	int rdmalen, status;
>>>>  	unsigned long cwnd;
>>>>  	u32 credits;
>>>> +	RPC_IFDEBUG(struct rpcrdma_create_data_internal *cdata);
>>>> 
>>>>  	dprintk("RPC:       %s: incoming rep %p\n", __func__, rep);
>>>> 
>>>> @@ -783,6 +784,12 @@ rpcrdma_reply_handler(struct rpcrdma_rep *rep)
>>>>  		goto out_badstatus;
>>>>  	if (rep->rr_len < RPCRDMA_HDRLEN_MIN)
>>>>  		goto out_shortreply;
>>>> +#if IS_ENABLED(CONFIG_SUNRPC_DEBUG)
>>>> +	cdata = &r_xprt->rx_data;
>>>> +	if (rep->rr_len > cdata->inline_rsize)
>>>> +		pr_warn("RPC: %u byte reply exceeds inline threshold\n",
>>>> +			rep->rr_len);
>>>> +#endif
>>>> 
>>>>  	headerp = rdmab_to_msg(rep->rr_rdmabuf);
>>>>  	if (headerp->rm_vers != rpcrdma_version)
>>>> diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
>>>> index eadd1655..e3f12e2 100644
>>>> --- a/net/sunrpc/xprtrdma/verbs.c
>>>> +++ b/net/sunrpc/xprtrdma/verbs.c
>>>> @@ -924,7 +924,11 @@ rpcrdma_create_rep(struct rpcrdma_xprt *r_xprt)
>>>>  	if (rep == NULL)
>>>>  		goto out;
>>>> 
>>>> -	rep->rr_rdmabuf = rpcrdma_alloc_regbuf(ia, cdata->inline_rsize,
>>>> +	/* The actual size of our receive buffers is increased slightly
>>>> +	 * to prevent small receive overruns from killing our connection.
>>>> +	 */
>>>> +	rep->rr_rdmabuf = rpcrdma_alloc_regbuf(ia, cdata->inline_rsize +
>>>> +					       RPCRDMA_RECV_MARGIN,
>>>>  					       GFP_KERNEL);
>>>>  	if (IS_ERR(rep->rr_rdmabuf)) {
>>>>  		rc = PTR_ERR(rep->rr_rdmabuf);
>>>> diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
>>>> index ac7f8d4..1b72ab1 100644
>>>> --- a/net/sunrpc/xprtrdma/xprt_rdma.h
>>>> +++ b/net/sunrpc/xprtrdma/xprt_rdma.h
>>>> @@ -337,6 +337,11 @@ struct rpcrdma_create_data_internal {
>>>>  #define RPCRDMA_INLINE_PAD_VALUE(rq)\
>>>>  	rpcx_to_rdmad(rq->rq_xprt).padding
>>>> 
>>>> +/* To help prevent spurious connection shutdown, allow senders to
>>>> + * overrun our receive inline threshold by a small bit.
>>>> + */
>>>> +#define RPCRDMA_RECV_MARGIN	(128)
>>>> +
>>>>  /*
>>>>   * Statistics for RPCRDMA
>>>>   */
>>>> 
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>> 
>>>> 
>> 
>> --
>> Chuck Lever
>> 
>> 
>> 
>> 
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> 
>> 

--
Chuck Lever





^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v1 3/9] xprtrdma: Introduce ro_unmap_sync method
  2015-11-23 22:14     ` Chuck Lever
@ 2015-11-24  6:45         ` Christoph Hellwig
  -1 siblings, 0 replies; 78+ messages in thread
From: Christoph Hellwig @ 2015-11-24  6:45 UTC (permalink / raw)
  To: Chuck Lever
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA, Sagi Grimberg

On Mon, Nov 23, 2015 at 05:14:14PM -0500, Chuck Lever wrote:
> In the current xprtrdma implementation, some memreg strategies
> implement ro_unmap synchronously (the MR is knocked down before the
> method returns) and some asynchonously (the MR will be knocked down
> and returned to the pool in the background).
> 
> To guarantee the MR is truly invalid before the RPC consumer is
> allowed to resume execution, we need an unmap method that is
> always synchronous, invoked from the RPC/RDMA reply handler.
> 
> The new method unmaps all MRs for an RPC. The existing ro_unmap
> method unmaps only one MR at a time.

Do we really want to go down that road?  It seems like we've decided
in general that while the protocol specs say MR must be unmapped before
proceeding with the data that is painful enough to ignore this
requirement.  E.g. iser for example only does the local invalidate
just before reusing the MR.

I'd like to hear arguments for and against each method instead of
adding more magic to drivers to either optimize MR performance and
add clunky workarounds to make it even slower, and instead handled
the semantics we agreed upo in common code.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v1 3/9] xprtrdma: Introduce ro_unmap_sync method
@ 2015-11-24  6:45         ` Christoph Hellwig
  0 siblings, 0 replies; 78+ messages in thread
From: Christoph Hellwig @ 2015-11-24  6:45 UTC (permalink / raw)
  To: Chuck Lever; +Cc: linux-rdma, linux-nfs, Sagi Grimberg

On Mon, Nov 23, 2015 at 05:14:14PM -0500, Chuck Lever wrote:
> In the current xprtrdma implementation, some memreg strategies
> implement ro_unmap synchronously (the MR is knocked down before the
> method returns) and some asynchonously (the MR will be knocked down
> and returned to the pool in the background).
> 
> To guarantee the MR is truly invalid before the RPC consumer is
> allowed to resume execution, we need an unmap method that is
> always synchronous, invoked from the RPC/RDMA reply handler.
> 
> The new method unmaps all MRs for an RPC. The existing ro_unmap
> method unmaps only one MR at a time.

Do we really want to go down that road?  It seems like we've decided
in general that while the protocol specs say MR must be unmapped before
proceeding with the data that is painful enough to ignore this
requirement.  E.g. iser for example only does the local invalidate
just before reusing the MR.

I'd like to hear arguments for and against each method instead of
adding more magic to drivers to either optimize MR performance and
add clunky workarounds to make it even slower, and instead handled
the semantics we agreed upo in common code.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Future of FMR support, was: Re: [PATCH v1 5/9] xprtrdma: Add ro_unmap_sync method for FMR
  2015-11-24  0:57         ` Tom Talpey
@ 2015-11-24  6:52             ` Christoph Hellwig
  -1 siblings, 0 replies; 78+ messages in thread
From: Christoph Hellwig @ 2015-11-24  6:52 UTC (permalink / raw)
  To: Tom Talpey
  Cc: Chuck Lever, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA, Sagi Grimberg,
	santosh.shilimkar-QHcLZuEGTsvQT0dZR+AlfA

On Mon, Nov 23, 2015 at 07:57:42PM -0500, Tom Talpey wrote:
> On 11/23/2015 5:14 PM, Chuck Lever wrote:
> >FMR's ro_unmap method is already synchronous because ib_unmap_fmr()
> >is a synchronous verb. However, some improvements can be made here.
> 
> I thought FMR support was about to be removed in the core.

Seems like everyone would love to kill it, but no one dares to do
it just yet.  Reasons to keep FMRs:

 - mthca doesn't support FRs but haven't been staged out
 - RDS only supports FRMs for the IB transports (it does support FRs for
   an entirely separate iWarp transports.  I'd love to know why we can't
   just use that iWarp transport for IB as well).
 - mlx4 hardware might be slower with FRs than FRMs (Or mentioned this
   in reply to the iSER remote invalidation series).

So at lest for 4.5 we're unlikely to be able to get rid of it alone
due to the RDS issue.  We'll then need performance numbers for mlx4,
and figure out how much we care about mthca.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Future of FMR support, was: Re: [PATCH v1 5/9] xprtrdma: Add ro_unmap_sync method for FMR
@ 2015-11-24  6:52             ` Christoph Hellwig
  0 siblings, 0 replies; 78+ messages in thread
From: Christoph Hellwig @ 2015-11-24  6:52 UTC (permalink / raw)
  To: Tom Talpey
  Cc: Chuck Lever, linux-rdma, linux-nfs, Sagi Grimberg, santosh.shilimkar

On Mon, Nov 23, 2015 at 07:57:42PM -0500, Tom Talpey wrote:
> On 11/23/2015 5:14 PM, Chuck Lever wrote:
> >FMR's ro_unmap method is already synchronous because ib_unmap_fmr()
> >is a synchronous verb. However, some improvements can be made here.
> 
> I thought FMR support was about to be removed in the core.

Seems like everyone would love to kill it, but no one dares to do
it just yet.  Reasons to keep FMRs:

 - mthca doesn't support FRs but haven't been staged out
 - RDS only supports FRMs for the IB transports (it does support FRs for
   an entirely separate iWarp transports.  I'd love to know why we can't
   just use that iWarp transport for IB as well).
 - mlx4 hardware might be slower with FRs than FRMs (Or mentioned this
   in reply to the iSER remote invalidation series).

So at lest for 4.5 we're unlikely to be able to get rid of it alone
due to the RDS issue.  We'll then need performance numbers for mlx4,
and figure out how much we care about mthca.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: Future of FMR support, was: Re: [PATCH v1 5/9] xprtrdma: Add ro_unmap_sync method for FMR
  2015-11-24  6:52             ` Christoph Hellwig
@ 2015-11-24  7:12                 ` Jason Gunthorpe
  -1 siblings, 0 replies; 78+ messages in thread
From: Jason Gunthorpe @ 2015-11-24  7:12 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Tom Talpey, Chuck Lever, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA, Sagi Grimberg,
	santosh.shilimkar-QHcLZuEGTsvQT0dZR+AlfA

On Mon, Nov 23, 2015 at 10:52:26PM -0800, Christoph Hellwig wrote:
> 
> So at lest for 4.5 we're unlikely to be able to get rid of it alone
> due to the RDS issue.  We'll then need performance numbers for mlx4,
> and figure out how much we care about mthca.

mthca is unfortunately very popular in the used market, it is very
easy to get cards, build a cheap test cluster, etc. :(

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: Future of FMR support, was: Re: [PATCH v1 5/9] xprtrdma: Add ro_unmap_sync method for FMR
@ 2015-11-24  7:12                 ` Jason Gunthorpe
  0 siblings, 0 replies; 78+ messages in thread
From: Jason Gunthorpe @ 2015-11-24  7:12 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Tom Talpey, Chuck Lever, linux-rdma, linux-nfs, Sagi Grimberg,
	santosh.shilimkar

On Mon, Nov 23, 2015 at 10:52:26PM -0800, Christoph Hellwig wrote:
> 
> So at lest for 4.5 we're unlikely to be able to get rid of it alone
> due to the RDS issue.  We'll then need performance numbers for mlx4,
> and figure out how much we care about mthca.

mthca is unfortunately very popular in the used market, it is very
easy to get cards, build a cheap test cluster, etc. :(

Jason

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v1 3/9] xprtrdma: Introduce ro_unmap_sync method
  2015-11-24  6:45         ` Christoph Hellwig
@ 2015-11-24  7:28             ` Jason Gunthorpe
  -1 siblings, 0 replies; 78+ messages in thread
From: Jason Gunthorpe @ 2015-11-24  7:28 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Chuck Lever, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA, Sagi Grimberg

On Mon, Nov 23, 2015 at 10:45:56PM -0800, Christoph Hellwig wrote:
> On Mon, Nov 23, 2015 at 05:14:14PM -0500, Chuck Lever wrote:
> > In the current xprtrdma implementation, some memreg strategies
> > implement ro_unmap synchronously (the MR is knocked down before the
> > method returns) and some asynchonously (the MR will be knocked down
> > and returned to the pool in the background).
> > 
> > To guarantee the MR is truly invalid before the RPC consumer is
> > allowed to resume execution, we need an unmap method that is
> > always synchronous, invoked from the RPC/RDMA reply handler.
> > 
> > The new method unmaps all MRs for an RPC. The existing ro_unmap
> > method unmaps only one MR at a time.
> 
> Do we really want to go down that road?  It seems like we've decided
> in general that while the protocol specs say MR must be unmapped before
> proceeding with the data that is painful enough to ignore this

That is not my impression, I was thinking we keep finding that ULPs
are not implemented correctly. The various clean up exercises keep
exposing flaws.

The common code is intended to drive RDMA properly.

Async invalidating the rkey is fundamentally a security issue and
should be treated as such. The kernel never trades security for
performance without a user opt in. This is the same logic we've used
for purging the global writable rkey stuff, even though it often had
performance.

> requirement.  E.g. iser for example only does the local invalidate
> just before reusing the MR.

Ugh :(

> I'd like to hear arguments for and against each method instead of
> adding more magic to drivers to either optimize MR performance and
> add clunky workarounds to make it even slower, and instead handled
> the semantics we agreed upo in common code.

Common code should make it easy to do this right, an invalidate of the
MR ordered before the dma unmap, which must complete before the buffer
is handed back to the caller. With easy support for send with
invalidate.

If the common code has an opt-in to make some of these steps run
async, and that gives performance, then fine, but the default should be
secure operation.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v1 3/9] xprtrdma: Introduce ro_unmap_sync method
@ 2015-11-24  7:28             ` Jason Gunthorpe
  0 siblings, 0 replies; 78+ messages in thread
From: Jason Gunthorpe @ 2015-11-24  7:28 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Chuck Lever, linux-rdma, linux-nfs, Sagi Grimberg

On Mon, Nov 23, 2015 at 10:45:56PM -0800, Christoph Hellwig wrote:
> On Mon, Nov 23, 2015 at 05:14:14PM -0500, Chuck Lever wrote:
> > In the current xprtrdma implementation, some memreg strategies
> > implement ro_unmap synchronously (the MR is knocked down before the
> > method returns) and some asynchonously (the MR will be knocked down
> > and returned to the pool in the background).
> > 
> > To guarantee the MR is truly invalid before the RPC consumer is
> > allowed to resume execution, we need an unmap method that is
> > always synchronous, invoked from the RPC/RDMA reply handler.
> > 
> > The new method unmaps all MRs for an RPC. The existing ro_unmap
> > method unmaps only one MR at a time.
> 
> Do we really want to go down that road?  It seems like we've decided
> in general that while the protocol specs say MR must be unmapped before
> proceeding with the data that is painful enough to ignore this

That is not my impression, I was thinking we keep finding that ULPs
are not implemented correctly. The various clean up exercises keep
exposing flaws.

The common code is intended to drive RDMA properly.

Async invalidating the rkey is fundamentally a security issue and
should be treated as such. The kernel never trades security for
performance without a user opt in. This is the same logic we've used
for purging the global writable rkey stuff, even though it often had
performance.

> requirement.  E.g. iser for example only does the local invalidate
> just before reusing the MR.

Ugh :(

> I'd like to hear arguments for and against each method instead of
> adding more magic to drivers to either optimize MR performance and
> add clunky workarounds to make it even slower, and instead handled
> the semantics we agreed upo in common code.

Common code should make it easy to do this right, an invalidate of the
MR ordered before the dma unmap, which must complete before the buffer
is handed back to the caller. With easy support for send with
invalidate.

If the common code has an opt-in to make some of these steps run
async, and that gives performance, then fine, but the default should be
secure operation.

Jason

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v1 3/9] xprtrdma: Introduce ro_unmap_sync method
  2015-11-24  6:45         ` Christoph Hellwig
@ 2015-11-24 10:59             ` Sagi Grimberg
  -1 siblings, 0 replies; 78+ messages in thread
From: Sagi Grimberg @ 2015-11-24 10:59 UTC (permalink / raw)
  To: Christoph Hellwig, Chuck Lever
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA, Sagi Grimberg



On 24/11/2015 08:45, Christoph Hellwig wrote:
> On Mon, Nov 23, 2015 at 05:14:14PM -0500, Chuck Lever wrote:
>> In the current xprtrdma implementation, some memreg strategies
>> implement ro_unmap synchronously (the MR is knocked down before the
>> method returns) and some asynchonously (the MR will be knocked down
>> and returned to the pool in the background).
>>
>> To guarantee the MR is truly invalid before the RPC consumer is
>> allowed to resume execution, we need an unmap method that is
>> always synchronous, invoked from the RPC/RDMA reply handler.
>>
>> The new method unmaps all MRs for an RPC. The existing ro_unmap
>> method unmaps only one MR at a time.
>
> Do we really want to go down that road?  It seems like we've decided
> in general that while the protocol specs say MR must be unmapped before
> proceeding with the data that is painful enough to ignore this
> requirement.  E.g. iser for example only does the local invalidate
> just before reusing the MR.

It is painful, too painful. The entire value proposition of RDMA is
low-latency and waiting for the extra HW round-trip for a local
invalidation to complete is unacceptable, moreover it adds a huge loads
of extra interrupts and cache-line pollutions.

As I see it, if we don't wait for local-invalidate to complete before
unmap and IO completion (and no one does) then local invalidate before
re-use is only marginally worse. For iSER, remote invalidate solves this 
(patches submitted!) and I'd say we should push for all the
storage standards to include remote invalidate. There is the question
of multi-rkey transactions, which is why I stated in the past that
arbitrary sg registration is important (which will be submitted soon
for ConnectX-4).

Waiting for local invalidate to complete would be a really big
sacrifice for our storage ULPs.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v1 3/9] xprtrdma: Introduce ro_unmap_sync method
@ 2015-11-24 10:59             ` Sagi Grimberg
  0 siblings, 0 replies; 78+ messages in thread
From: Sagi Grimberg @ 2015-11-24 10:59 UTC (permalink / raw)
  To: Christoph Hellwig, Chuck Lever; +Cc: linux-rdma, linux-nfs, Sagi Grimberg



On 24/11/2015 08:45, Christoph Hellwig wrote:
> On Mon, Nov 23, 2015 at 05:14:14PM -0500, Chuck Lever wrote:
>> In the current xprtrdma implementation, some memreg strategies
>> implement ro_unmap synchronously (the MR is knocked down before the
>> method returns) and some asynchonously (the MR will be knocked down
>> and returned to the pool in the background).
>>
>> To guarantee the MR is truly invalid before the RPC consumer is
>> allowed to resume execution, we need an unmap method that is
>> always synchronous, invoked from the RPC/RDMA reply handler.
>>
>> The new method unmaps all MRs for an RPC. The existing ro_unmap
>> method unmaps only one MR at a time.
>
> Do we really want to go down that road?  It seems like we've decided
> in general that while the protocol specs say MR must be unmapped before
> proceeding with the data that is painful enough to ignore this
> requirement.  E.g. iser for example only does the local invalidate
> just before reusing the MR.

It is painful, too painful. The entire value proposition of RDMA is
low-latency and waiting for the extra HW round-trip for a local
invalidation to complete is unacceptable, moreover it adds a huge loads
of extra interrupts and cache-line pollutions.

As I see it, if we don't wait for local-invalidate to complete before
unmap and IO completion (and no one does) then local invalidate before
re-use is only marginally worse. For iSER, remote invalidate solves this 
(patches submitted!) and I'd say we should push for all the
storage standards to include remote invalidate. There is the question
of multi-rkey transactions, which is why I stated in the past that
arbitrary sg registration is important (which will be submitted soon
for ConnectX-4).

Waiting for local invalidate to complete would be a really big
sacrifice for our storage ULPs.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: Future of FMR support, was: Re: [PATCH v1 5/9] xprtrdma: Add ro_unmap_sync method for FMR
  2015-11-24  6:52             ` Christoph Hellwig
@ 2015-11-24 12:36                 ` Tom Talpey
  -1 siblings, 0 replies; 78+ messages in thread
From: Tom Talpey @ 2015-11-24 12:36 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Chuck Lever, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA, Sagi Grimberg,
	santosh.shilimkar-QHcLZuEGTsvQT0dZR+AlfA

On 11/24/2015 1:52 AM, Christoph Hellwig wrote:
> On Mon, Nov 23, 2015 at 07:57:42PM -0500, Tom Talpey wrote:
>> On 11/23/2015 5:14 PM, Chuck Lever wrote:
>>> FMR's ro_unmap method is already synchronous because ib_unmap_fmr()
>>> is a synchronous verb. However, some improvements can be made here.
>>
>> I thought FMR support was about to be removed in the core.
>
> Seems like everyone would love to kill it, but no one dares to do
> it just yet.  Reasons to keep FMRs:
>
>   - mthca doesn't support FRs but haven't been staged out
>   - RDS only supports FRMs for the IB transports (it does support FRs for
>     an entirely separate iWarp transports.  I'd love to know why we can't
>     just use that iWarp transport for IB as well).
>   - mlx4 hardware might be slower with FRs than FRMs (Or mentioned this
>     in reply to the iSER remote invalidation series).

Ok, let me add then.

Reasons to kill FMRs:

   - FMRs are UNSAFE. They protect on page boundaries, not byte,
     therefore a length error on an operation can corrupt data
     outside the i/o buffer remotely. To say nothing of maliciousness.

   - FMRs are SLOWER. The supposed performance improvement of FMR
     comes from the "mempool" that is often placed in front of them.
     These pools cache remotely registered regions, in the hope of
     reusing them without having to invalidate in between. See the
     first bullet for the ramifications of that.

   - FMRs are SYNCHRONOUS. The FMR verbs make direct calls into the
     verbs which block and/or do not return control to the caller
     promptly as do work queue operations. The processor spends
     more time servicing the device, typically at raised irq.

   - FMRs are PROPRIETARY. Only one vendor supports them, only one
     of their NICs has no decent alternative, and that NIC is
     functionally obsolete.

If storage stacks continue to support them, I personally feel it
is critical to carefully document the risks to customer data as
part of shipping the code.

Tom.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: Future of FMR support, was: Re: [PATCH v1 5/9] xprtrdma: Add ro_unmap_sync method for FMR
@ 2015-11-24 12:36                 ` Tom Talpey
  0 siblings, 0 replies; 78+ messages in thread
From: Tom Talpey @ 2015-11-24 12:36 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Chuck Lever, linux-rdma, linux-nfs, Sagi Grimberg, santosh.shilimkar

On 11/24/2015 1:52 AM, Christoph Hellwig wrote:
> On Mon, Nov 23, 2015 at 07:57:42PM -0500, Tom Talpey wrote:
>> On 11/23/2015 5:14 PM, Chuck Lever wrote:
>>> FMR's ro_unmap method is already synchronous because ib_unmap_fmr()
>>> is a synchronous verb. However, some improvements can be made here.
>>
>> I thought FMR support was about to be removed in the core.
>
> Seems like everyone would love to kill it, but no one dares to do
> it just yet.  Reasons to keep FMRs:
>
>   - mthca doesn't support FRs but haven't been staged out
>   - RDS only supports FRMs for the IB transports (it does support FRs for
>     an entirely separate iWarp transports.  I'd love to know why we can't
>     just use that iWarp transport for IB as well).
>   - mlx4 hardware might be slower with FRs than FRMs (Or mentioned this
>     in reply to the iSER remote invalidation series).

Ok, let me add then.

Reasons to kill FMRs:

   - FMRs are UNSAFE. They protect on page boundaries, not byte,
     therefore a length error on an operation can corrupt data
     outside the i/o buffer remotely. To say nothing of maliciousness.

   - FMRs are SLOWER. The supposed performance improvement of FMR
     comes from the "mempool" that is often placed in front of them.
     These pools cache remotely registered regions, in the hope of
     reusing them without having to invalidate in between. See the
     first bullet for the ramifications of that.

   - FMRs are SYNCHRONOUS. The FMR verbs make direct calls into the
     verbs which block and/or do not return control to the caller
     promptly as do work queue operations. The processor spends
     more time servicing the device, typically at raised irq.

   - FMRs are PROPRIETARY. Only one vendor supports them, only one
     of their NICs has no decent alternative, and that NIC is
     functionally obsolete.

If storage stacks continue to support them, I personally feel it
is critical to carefully document the risks to customer data as
part of shipping the code.

Tom.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v1 3/9] xprtrdma: Introduce ro_unmap_sync method
  2015-11-24 10:59             ` Sagi Grimberg
@ 2015-11-24 13:43                 ` Tom Talpey
  -1 siblings, 0 replies; 78+ messages in thread
From: Tom Talpey @ 2015-11-24 13:43 UTC (permalink / raw)
  To: Sagi Grimberg, Christoph Hellwig, Chuck Lever
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA, Sagi Grimberg

On 11/24/2015 5:59 AM, Sagi Grimberg wrote:
> As I see it, if we don't wait for local-invalidate to complete before
> unmap and IO completion (and no one does)

For the record, that is false. Windows quite strictly performs these
steps, and deliver millions of real 4K IOPS over SMB Direct.

> Waiting for local invalidate to complete would be a really big
> sacrifice for our storage ULPs.

Not waiting would also be a sacrifice, by compromising data integrity.
It's a tough tradeoff, but if you choose to go that way it will be
critical to be honest about the consequences to the data.

Tom.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v1 3/9] xprtrdma: Introduce ro_unmap_sync method
@ 2015-11-24 13:43                 ` Tom Talpey
  0 siblings, 0 replies; 78+ messages in thread
From: Tom Talpey @ 2015-11-24 13:43 UTC (permalink / raw)
  To: Sagi Grimberg, Christoph Hellwig, Chuck Lever
  Cc: linux-rdma, linux-nfs, Sagi Grimberg

On 11/24/2015 5:59 AM, Sagi Grimberg wrote:
> As I see it, if we don't wait for local-invalidate to complete before
> unmap and IO completion (and no one does)

For the record, that is false. Windows quite strictly performs these
steps, and deliver millions of real 4K IOPS over SMB Direct.

> Waiting for local invalidate to complete would be a really big
> sacrifice for our storage ULPs.

Not waiting would also be a sacrifice, by compromising data integrity.
It's a tough tradeoff, but if you choose to go that way it will be
critical to be honest about the consequences to the data.

Tom.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v1 3/9] xprtrdma: Introduce ro_unmap_sync method
  2015-11-24 10:59             ` Sagi Grimberg
@ 2015-11-24 14:39                 ` Chuck Lever
  -1 siblings, 0 replies; 78+ messages in thread
From: Chuck Lever @ 2015-11-24 14:39 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Christoph Hellwig, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	Linux NFS Mailing List, Sagi Grimberg


> On Nov 24, 2015, at 5:59 AM, Sagi Grimberg <sagig-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org> wrote:
> 
> 
> 
> On 24/11/2015 08:45, Christoph Hellwig wrote:
>> On Mon, Nov 23, 2015 at 05:14:14PM -0500, Chuck Lever wrote:
>>> In the current xprtrdma implementation, some memreg strategies
>>> implement ro_unmap synchronously (the MR is knocked down before the
>>> method returns) and some asynchonously (the MR will be knocked down
>>> and returned to the pool in the background).
>>> 
>>> To guarantee the MR is truly invalid before the RPC consumer is
>>> allowed to resume execution, we need an unmap method that is
>>> always synchronous, invoked from the RPC/RDMA reply handler.
>>> 
>>> The new method unmaps all MRs for an RPC. The existing ro_unmap
>>> method unmaps only one MR at a time.
>> 
>> Do we really want to go down that road?  It seems like we've decided
>> in general that while the protocol specs say MR must be unmapped before
>> proceeding with the data that is painful enough to ignore this
>> requirement.  E.g. iser for example only does the local invalidate
>> just before reusing the MR.

That leaves the MR exposed to the remote indefinitely. If
the MR is registered for remote write, that seems hazardous.


> It is painful, too painful. The entire value proposition of RDMA is
> low-latency and waiting for the extra HW round-trip for a local
> invalidation to complete is unacceptable, moreover it adds a huge loads
> of extra interrupts and cache-line pollutions.

The killer is the extra context switches, I’ve found.


> As I see it, if we don't wait for local-invalidate to complete before
> unmap and IO completion (and no one does) then local invalidate before
> re-use is only marginally worse. For iSER, remote invalidate solves this (patches submitted!) and I'd say we should push for all the
> storage standards to include remote invalidate.

I agree: the right answer is to use remote invalidation,
and to ensure the order is always:

  1. invalidate the MR
  2. unmap the MR
  3. wake up the consumer

And that is exactly my strategy for NFS/RDMA. I don’t have
a choice: as Tom observed yesterday, krb5i is meaningless
unless the integrity of the data is guaranteed by fencing
the server before the client performs checksumming. I
expect the same is true for T10-PI.


> There is the question
> of multi-rkey transactions, which is why I stated in the past that
> arbitrary sg registration is important (which will be submitted soon
> for ConnectX-4).
> 
> Waiting for local invalidate to complete would be a really big
> sacrifice for our storage ULPs.

I’ve noticed only a marginal loss of performance on modern
hardware.


--
Chuck Lever




--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v1 3/9] xprtrdma: Introduce ro_unmap_sync method
@ 2015-11-24 14:39                 ` Chuck Lever
  0 siblings, 0 replies; 78+ messages in thread
From: Chuck Lever @ 2015-11-24 14:39 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Christoph Hellwig, linux-rdma, Linux NFS Mailing List, Sagi Grimberg


> On Nov 24, 2015, at 5:59 AM, Sagi Grimberg <sagig@dev.mellanox.co.il> wrote:
> 
> 
> 
> On 24/11/2015 08:45, Christoph Hellwig wrote:
>> On Mon, Nov 23, 2015 at 05:14:14PM -0500, Chuck Lever wrote:
>>> In the current xprtrdma implementation, some memreg strategies
>>> implement ro_unmap synchronously (the MR is knocked down before the
>>> method returns) and some asynchonously (the MR will be knocked down
>>> and returned to the pool in the background).
>>> 
>>> To guarantee the MR is truly invalid before the RPC consumer is
>>> allowed to resume execution, we need an unmap method that is
>>> always synchronous, invoked from the RPC/RDMA reply handler.
>>> 
>>> The new method unmaps all MRs for an RPC. The existing ro_unmap
>>> method unmaps only one MR at a time.
>> 
>> Do we really want to go down that road?  It seems like we've decided
>> in general that while the protocol specs say MR must be unmapped before
>> proceeding with the data that is painful enough to ignore this
>> requirement.  E.g. iser for example only does the local invalidate
>> just before reusing the MR.

That leaves the MR exposed to the remote indefinitely. If
the MR is registered for remote write, that seems hazardous.


> It is painful, too painful. The entire value proposition of RDMA is
> low-latency and waiting for the extra HW round-trip for a local
> invalidation to complete is unacceptable, moreover it adds a huge loads
> of extra interrupts and cache-line pollutions.

The killer is the extra context switches, I’ve found.


> As I see it, if we don't wait for local-invalidate to complete before
> unmap and IO completion (and no one does) then local invalidate before
> re-use is only marginally worse. For iSER, remote invalidate solves this (patches submitted!) and I'd say we should push for all the
> storage standards to include remote invalidate.

I agree: the right answer is to use remote invalidation,
and to ensure the order is always:

  1. invalidate the MR
  2. unmap the MR
  3. wake up the consumer

And that is exactly my strategy for NFS/RDMA. I don’t have
a choice: as Tom observed yesterday, krb5i is meaningless
unless the integrity of the data is guaranteed by fencing
the server before the client performs checksumming. I
expect the same is true for T10-PI.


> There is the question
> of multi-rkey transactions, which is why I stated in the past that
> arbitrary sg registration is important (which will be submitted soon
> for ConnectX-4).
> 
> Waiting for local invalidate to complete would be a really big
> sacrifice for our storage ULPs.

I’ve noticed only a marginal loss of performance on modern
hardware.


--
Chuck Lever





^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v1 3/9] xprtrdma: Introduce ro_unmap_sync method
  2015-11-24 13:43                 ` Tom Talpey
@ 2015-11-24 14:40                     ` Sagi Grimberg
  -1 siblings, 0 replies; 78+ messages in thread
From: Sagi Grimberg @ 2015-11-24 14:40 UTC (permalink / raw)
  To: Tom Talpey, Christoph Hellwig, Chuck Lever
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA, Sagi Grimberg

Hey Tom,

> On 11/24/2015 5:59 AM, Sagi Grimberg wrote:
>> As I see it, if we don't wait for local-invalidate to complete before
>> unmap and IO completion (and no one does)
>
> For the record, that is false. Windows quite strictly performs these
> steps,

I meant Linux ULPs. I'm not very familiar with Windows SMBD.

> and deliver millions of real 4K IOPS over SMB Direct.

That's very encouraging! Is this the client side scaling?
 From my experience, getting a storage client/initiator to scale
up to 1.5-3 MIOPs over a single HCA with limited number of cores
is really a struggle for each interrupt and cacheline.

Still the latency of each IO will see a noticable increase.

>> Waiting for local invalidate to complete would be a really big
>> sacrifice for our storage ULPs.
>
> Not waiting would also be a sacrifice, by compromising data integrity.
> It's a tough tradeoff, but if you choose to go that way it will be
> critical to be honest about the consequences to the data.

That's true. I assume that the best compromise would be remote
invalidation but some standards needs enhancements and it doesn't solve
the multi-rkey transactions.

As storage devices are becoming faster we really should try to do our
best to keep up.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v1 3/9] xprtrdma: Introduce ro_unmap_sync method
@ 2015-11-24 14:40                     ` Sagi Grimberg
  0 siblings, 0 replies; 78+ messages in thread
From: Sagi Grimberg @ 2015-11-24 14:40 UTC (permalink / raw)
  To: Tom Talpey, Christoph Hellwig, Chuck Lever
  Cc: linux-rdma, linux-nfs, Sagi Grimberg

Hey Tom,

> On 11/24/2015 5:59 AM, Sagi Grimberg wrote:
>> As I see it, if we don't wait for local-invalidate to complete before
>> unmap and IO completion (and no one does)
>
> For the record, that is false. Windows quite strictly performs these
> steps,

I meant Linux ULPs. I'm not very familiar with Windows SMBD.

> and deliver millions of real 4K IOPS over SMB Direct.

That's very encouraging! Is this the client side scaling?
 From my experience, getting a storage client/initiator to scale
up to 1.5-3 MIOPs over a single HCA with limited number of cores
is really a struggle for each interrupt and cacheline.

Still the latency of each IO will see a noticable increase.

>> Waiting for local invalidate to complete would be a really big
>> sacrifice for our storage ULPs.
>
> Not waiting would also be a sacrifice, by compromising data integrity.
> It's a tough tradeoff, but if you choose to go that way it will be
> critical to be honest about the consequences to the data.

That's true. I assume that the best compromise would be remote
invalidation but some standards needs enhancements and it doesn't solve
the multi-rkey transactions.

As storage devices are becoming faster we really should try to do our
best to keep up.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v1 3/9] xprtrdma: Introduce ro_unmap_sync method
  2015-11-24 14:39                 ` Chuck Lever
@ 2015-11-24 14:44                     ` Sagi Grimberg
  -1 siblings, 0 replies; 78+ messages in thread
From: Sagi Grimberg @ 2015-11-24 14:44 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Christoph Hellwig, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	Linux NFS Mailing List, Sagi Grimberg

Hey Chuck,

>
>> It is painful, too painful. The entire value proposition of RDMA is
>> low-latency and waiting for the extra HW round-trip for a local
>> invalidation to complete is unacceptable, moreover it adds a huge loads
>> of extra interrupts and cache-line pollutions.
>
> The killer is the extra context switches, I’ve found.

That too...

> I’ve noticed only a marginal loss of performance on modern
> hardware.

Would you mind sharing your observations?
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v1 3/9] xprtrdma: Introduce ro_unmap_sync method
@ 2015-11-24 14:44                     ` Sagi Grimberg
  0 siblings, 0 replies; 78+ messages in thread
From: Sagi Grimberg @ 2015-11-24 14:44 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Christoph Hellwig, linux-rdma, Linux NFS Mailing List, Sagi Grimberg

Hey Chuck,

>
>> It is painful, too painful. The entire value proposition of RDMA is
>> low-latency and waiting for the extra HW round-trip for a local
>> invalidation to complete is unacceptable, moreover it adds a huge loads
>> of extra interrupts and cache-line pollutions.
>
> The killer is the extra context switches, I’ve found.

That too...

> I’ve noticed only a marginal loss of performance on modern
> hardware.

Would you mind sharing your observations?

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v1 3/9] xprtrdma: Introduce ro_unmap_sync method
  2015-11-24 14:44                     ` Sagi Grimberg
@ 2015-11-24 15:20                         ` Chuck Lever
  -1 siblings, 0 replies; 78+ messages in thread
From: Chuck Lever @ 2015-11-24 15:20 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Christoph Hellwig, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	Linux NFS Mailing List, Sagi Grimberg


> On Nov 24, 2015, at 9:44 AM, Sagi Grimberg <sagig-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org> wrote:
> 
> Hey Chuck,
> 
>> 
>>> It is painful, too painful. The entire value proposition of RDMA is
>>> low-latency and waiting for the extra HW round-trip for a local
>>> invalidation to complete is unacceptable, moreover it adds a huge loads
>>> of extra interrupts and cache-line pollutions.
>> 
>> The killer is the extra context switches, I’ve found.
> 
> That too...
> 
>> I’ve noticed only a marginal loss of performance on modern
>> hardware.
> 
> Would you mind sharing your observations?

I’m testing with CX-3 Pro on FDR.

NFS READ and WRITE round trip latency, which includes the cost
of registration and now invalidation, is not noticeably longer.
dbench and fio results are marginally slower (in the neighborhood
of 5%).

For NFS, the cost of invalidation is probably not significant
compared to other bottlenecks in our stack (lock contention and
scheduling overhead are likely the largest contributors).

Notice that xprtrdma chains together all the LOCAL_INV WRs for
an RPC, and only signals the final one. Before, every LOCAL_INV
WR was signaled. So this patch actually reduces the send
completion rate.

The main benefit for NFS of waiting for invalidation to complete
is better send queue accounting. Even without the data integrity
issue, we have to ensure the WQEs consumed by invalidation
requests are released before dispatching another RPC. Otherwise
the send queue can be overrun.


--
Chuck Lever




--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v1 3/9] xprtrdma: Introduce ro_unmap_sync method
@ 2015-11-24 15:20                         ` Chuck Lever
  0 siblings, 0 replies; 78+ messages in thread
From: Chuck Lever @ 2015-11-24 15:20 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Christoph Hellwig, linux-rdma, Linux NFS Mailing List, Sagi Grimberg


> On Nov 24, 2015, at 9:44 AM, Sagi Grimberg <sagig@dev.mellanox.co.il> wrote:
> 
> Hey Chuck,
> 
>> 
>>> It is painful, too painful. The entire value proposition of RDMA is
>>> low-latency and waiting for the extra HW round-trip for a local
>>> invalidation to complete is unacceptable, moreover it adds a huge loads
>>> of extra interrupts and cache-line pollutions.
>> 
>> The killer is the extra context switches, I’ve found.
> 
> That too...
> 
>> I’ve noticed only a marginal loss of performance on modern
>> hardware.
> 
> Would you mind sharing your observations?

I’m testing with CX-3 Pro on FDR.

NFS READ and WRITE round trip latency, which includes the cost
of registration and now invalidation, is not noticeably longer.
dbench and fio results are marginally slower (in the neighborhood
of 5%).

For NFS, the cost of invalidation is probably not significant
compared to other bottlenecks in our stack (lock contention and
scheduling overhead are likely the largest contributors).

Notice that xprtrdma chains together all the LOCAL_INV WRs for
an RPC, and only signals the final one. Before, every LOCAL_INV
WR was signaled. So this patch actually reduces the send
completion rate.

The main benefit for NFS of waiting for invalidation to complete
is better send queue accounting. Even without the data integrity
issue, we have to ensure the WQEs consumed by invalidation
requests are released before dispatching another RPC. Otherwise
the send queue can be overrun.


--
Chuck Lever





^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v1 3/9] xprtrdma: Introduce ro_unmap_sync method
  2015-11-24 14:39                 ` Chuck Lever
@ 2015-11-24 18:57                     ` Jason Gunthorpe
  -1 siblings, 0 replies; 78+ messages in thread
From: Jason Gunthorpe @ 2015-11-24 18:57 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Sagi Grimberg, Christoph Hellwig,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Linux NFS Mailing List,
	Sagi Grimberg

On Tue, Nov 24, 2015 at 09:39:33AM -0500, Chuck Lever wrote:
> >> Do we really want to go down that road?  It seems like we've decided
> >> in general that while the protocol specs say MR must be unmapped before
> >> proceeding with the data that is painful enough to ignore this
> >> requirement.  E.g. iser for example only does the local invalidate
> >> just before reusing the MR.
> 
> That leaves the MR exposed to the remote indefinitely. If
> the MR is registered for remote write, that seems hazardous.

Agree.

If we have a performance knob it should be to post the invalidate,
then immediately dma unmap and start using the buffer, not to defer
the invalidate potentially indefinitely.  Then it is just a race to be
exploited not a long lived window into random page cache memory.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v1 3/9] xprtrdma: Introduce ro_unmap_sync method
@ 2015-11-24 18:57                     ` Jason Gunthorpe
  0 siblings, 0 replies; 78+ messages in thread
From: Jason Gunthorpe @ 2015-11-24 18:57 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Sagi Grimberg, Christoph Hellwig, linux-rdma,
	Linux NFS Mailing List, Sagi Grimberg

On Tue, Nov 24, 2015 at 09:39:33AM -0500, Chuck Lever wrote:
> >> Do we really want to go down that road?  It seems like we've decided
> >> in general that while the protocol specs say MR must be unmapped before
> >> proceeding with the data that is painful enough to ignore this
> >> requirement.  E.g. iser for example only does the local invalidate
> >> just before reusing the MR.
> 
> That leaves the MR exposed to the remote indefinitely. If
> the MR is registered for remote write, that seems hazardous.

Agree.

If we have a performance knob it should be to post the invalidate,
then immediately dma unmap and start using the buffer, not to defer
the invalidate potentially indefinitely.  Then it is just a race to be
exploited not a long lived window into random page cache memory.

Jason

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v1 7/9] SUNRPC: Introduct xprt_commit_rqst()
  2015-11-23 22:14     ` Chuck Lever
@ 2015-11-24 19:54         ` Anna Schumaker
  -1 siblings, 0 replies; 78+ messages in thread
From: Anna Schumaker @ 2015-11-24 19:54 UTC (permalink / raw)
  To: Chuck Lever, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA

Hi Chuck,

On 11/23/2015 05:14 PM, Chuck Lever wrote:
> I'm about to add code in the RPC/RDMA reply handler between the
> xprt_lookup_rqst() and xprt_complete_rqst() call site that needs
> to execute outside of spinlock critical sections.
> 
> Add a hook to remove an rpc_rqst from the pending list once
> the transport knows its going to invoke xprt_complete_rqst().
> 
> Signed-off-by: Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
> ---
>  include/linux/sunrpc/xprt.h    |    1 +
>  net/sunrpc/xprt.c              |   14 ++++++++++++++
>  net/sunrpc/xprtrdma/rpc_rdma.c |    4 ++++
>  3 files changed, 19 insertions(+)
> 
> diff --git a/include/linux/sunrpc/xprt.h b/include/linux/sunrpc/xprt.h
> index 69ef5b3..ab6c3a5 100644
> --- a/include/linux/sunrpc/xprt.h
> +++ b/include/linux/sunrpc/xprt.h
> @@ -366,6 +366,7 @@ void			xprt_wait_for_buffer_space(struct rpc_task *task, rpc_action action);
>  void			xprt_write_space(struct rpc_xprt *xprt);
>  void			xprt_adjust_cwnd(struct rpc_xprt *xprt, struct rpc_task *task, int result);
>  struct rpc_rqst *	xprt_lookup_rqst(struct rpc_xprt *xprt, __be32 xid);
> +void			xprt_commit_rqst(struct rpc_task *task);
>  void			xprt_complete_rqst(struct rpc_task *task, int copied);
>  void			xprt_release_rqst_cong(struct rpc_task *task);
>  void			xprt_disconnect_done(struct rpc_xprt *xprt);
> diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
> index 2e98f4a..a5be4ab 100644
> --- a/net/sunrpc/xprt.c
> +++ b/net/sunrpc/xprt.c
> @@ -837,6 +837,20 @@ static void xprt_update_rtt(struct rpc_task *task)
>  }
>  
>  /**
> + * xprt_commit_rqst - remove rqst from pending list early
> + * @task: RPC request to remove

Is xprt_commit_rqst() the right name for this function?  Removing a request from a list isn't how I would expect a commit to work.

Anna

> + *
> + * Caller holds transport lock.
> + */
> +void xprt_commit_rqst(struct rpc_task *task)
> +{
> +	struct rpc_rqst *req = task->tk_rqstp;
> +
> +	list_del_init(&req->rq_list);
> +}
> +EXPORT_SYMBOL_GPL(xprt_commit_rqst);
> +
> +/**
>   * xprt_complete_rqst - called when reply processing is complete
>   * @task: RPC request that recently completed
>   * @copied: actual number of bytes received from the transport
> diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
> index a169252..d7b9156 100644
> --- a/net/sunrpc/xprtrdma/rpc_rdma.c
> +++ b/net/sunrpc/xprtrdma/rpc_rdma.c
> @@ -811,6 +811,9 @@ rpcrdma_reply_handler(struct rpcrdma_rep *rep)
>  	if (req->rl_reply)
>  		goto out_duplicate;
>  
> +	xprt_commit_rqst(rqst->rq_task);
> +	spin_unlock_bh(&xprt->transport_lock);
> +
>  	dprintk("RPC:       %s: reply 0x%p completes request 0x%p\n"
>  		"                   RPC request 0x%p xid 0x%08x\n",
>  			__func__, rep, req, rqst,
> @@ -901,6 +904,7 @@ badheader:
>  	else if (credits > r_xprt->rx_buf.rb_max_requests)
>  		credits = r_xprt->rx_buf.rb_max_requests;
>  
> +	spin_lock_bh(&xprt->transport_lock);
>  	cwnd = xprt->cwnd;
>  	xprt->cwnd = credits << RPC_CWNDSHIFT;
>  	if (xprt->cwnd > cwnd)
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v1 7/9] SUNRPC: Introduct xprt_commit_rqst()
@ 2015-11-24 19:54         ` Anna Schumaker
  0 siblings, 0 replies; 78+ messages in thread
From: Anna Schumaker @ 2015-11-24 19:54 UTC (permalink / raw)
  To: Chuck Lever, linux-rdma, linux-nfs

Hi Chuck,

On 11/23/2015 05:14 PM, Chuck Lever wrote:
> I'm about to add code in the RPC/RDMA reply handler between the
> xprt_lookup_rqst() and xprt_complete_rqst() call site that needs
> to execute outside of spinlock critical sections.
> 
> Add a hook to remove an rpc_rqst from the pending list once
> the transport knows its going to invoke xprt_complete_rqst().
> 
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> ---
>  include/linux/sunrpc/xprt.h    |    1 +
>  net/sunrpc/xprt.c              |   14 ++++++++++++++
>  net/sunrpc/xprtrdma/rpc_rdma.c |    4 ++++
>  3 files changed, 19 insertions(+)
> 
> diff --git a/include/linux/sunrpc/xprt.h b/include/linux/sunrpc/xprt.h
> index 69ef5b3..ab6c3a5 100644
> --- a/include/linux/sunrpc/xprt.h
> +++ b/include/linux/sunrpc/xprt.h
> @@ -366,6 +366,7 @@ void			xprt_wait_for_buffer_space(struct rpc_task *task, rpc_action action);
>  void			xprt_write_space(struct rpc_xprt *xprt);
>  void			xprt_adjust_cwnd(struct rpc_xprt *xprt, struct rpc_task *task, int result);
>  struct rpc_rqst *	xprt_lookup_rqst(struct rpc_xprt *xprt, __be32 xid);
> +void			xprt_commit_rqst(struct rpc_task *task);
>  void			xprt_complete_rqst(struct rpc_task *task, int copied);
>  void			xprt_release_rqst_cong(struct rpc_task *task);
>  void			xprt_disconnect_done(struct rpc_xprt *xprt);
> diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
> index 2e98f4a..a5be4ab 100644
> --- a/net/sunrpc/xprt.c
> +++ b/net/sunrpc/xprt.c
> @@ -837,6 +837,20 @@ static void xprt_update_rtt(struct rpc_task *task)
>  }
>  
>  /**
> + * xprt_commit_rqst - remove rqst from pending list early
> + * @task: RPC request to remove

Is xprt_commit_rqst() the right name for this function?  Removing a request from a list isn't how I would expect a commit to work.

Anna

> + *
> + * Caller holds transport lock.
> + */
> +void xprt_commit_rqst(struct rpc_task *task)
> +{
> +	struct rpc_rqst *req = task->tk_rqstp;
> +
> +	list_del_init(&req->rq_list);
> +}
> +EXPORT_SYMBOL_GPL(xprt_commit_rqst);
> +
> +/**
>   * xprt_complete_rqst - called when reply processing is complete
>   * @task: RPC request that recently completed
>   * @copied: actual number of bytes received from the transport
> diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
> index a169252..d7b9156 100644
> --- a/net/sunrpc/xprtrdma/rpc_rdma.c
> +++ b/net/sunrpc/xprtrdma/rpc_rdma.c
> @@ -811,6 +811,9 @@ rpcrdma_reply_handler(struct rpcrdma_rep *rep)
>  	if (req->rl_reply)
>  		goto out_duplicate;
>  
> +	xprt_commit_rqst(rqst->rq_task);
> +	spin_unlock_bh(&xprt->transport_lock);
> +
>  	dprintk("RPC:       %s: reply 0x%p completes request 0x%p\n"
>  		"                   RPC request 0x%p xid 0x%08x\n",
>  			__func__, rep, req, rqst,
> @@ -901,6 +904,7 @@ badheader:
>  	else if (credits > r_xprt->rx_buf.rb_max_requests)
>  		credits = r_xprt->rx_buf.rb_max_requests;
>  
> +	spin_lock_bh(&xprt->transport_lock);
>  	cwnd = xprt->cwnd;
>  	xprt->cwnd = credits << RPC_CWNDSHIFT;
>  	if (xprt->cwnd > cwnd)
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v1 7/9] SUNRPC: Introduct xprt_commit_rqst()
  2015-11-24 19:54         ` Anna Schumaker
@ 2015-11-24 19:56             ` Chuck Lever
  -1 siblings, 0 replies; 78+ messages in thread
From: Chuck Lever @ 2015-11-24 19:56 UTC (permalink / raw)
  To: Anna Schumaker; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Linux NFS Mailing List


> On Nov 24, 2015, at 2:54 PM, Anna Schumaker <Anna.Schumaker-HgOvQuBEEgRhl2p70BpVqQ@public.gmane.orgm> wrote:
> 
> Hi Chuck,
> 
> On 11/23/2015 05:14 PM, Chuck Lever wrote:
>> I'm about to add code in the RPC/RDMA reply handler between the
>> xprt_lookup_rqst() and xprt_complete_rqst() call site that needs
>> to execute outside of spinlock critical sections.
>> 
>> Add a hook to remove an rpc_rqst from the pending list once
>> the transport knows its going to invoke xprt_complete_rqst().
>> 
>> Signed-off-by: Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
>> ---
>> include/linux/sunrpc/xprt.h    |    1 +
>> net/sunrpc/xprt.c              |   14 ++++++++++++++
>> net/sunrpc/xprtrdma/rpc_rdma.c |    4 ++++
>> 3 files changed, 19 insertions(+)
>> 
>> diff --git a/include/linux/sunrpc/xprt.h b/include/linux/sunrpc/xprt.h
>> index 69ef5b3..ab6c3a5 100644
>> --- a/include/linux/sunrpc/xprt.h
>> +++ b/include/linux/sunrpc/xprt.h
>> @@ -366,6 +366,7 @@ void			xprt_wait_for_buffer_space(struct rpc_task *task, rpc_action action);
>> void			xprt_write_space(struct rpc_xprt *xprt);
>> void			xprt_adjust_cwnd(struct rpc_xprt *xprt, struct rpc_task *task, int result);
>> struct rpc_rqst *	xprt_lookup_rqst(struct rpc_xprt *xprt, __be32 xid);
>> +void			xprt_commit_rqst(struct rpc_task *task);
>> void			xprt_complete_rqst(struct rpc_task *task, int copied);
>> void			xprt_release_rqst_cong(struct rpc_task *task);
>> void			xprt_disconnect_done(struct rpc_xprt *xprt);
>> diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
>> index 2e98f4a..a5be4ab 100644
>> --- a/net/sunrpc/xprt.c
>> +++ b/net/sunrpc/xprt.c
>> @@ -837,6 +837,20 @@ static void xprt_update_rtt(struct rpc_task *task)
>> }
>> 
>> /**
>> + * xprt_commit_rqst - remove rqst from pending list early
>> + * @task: RPC request to remove
> 
> Is xprt_commit_rqst() the right name for this function?  Removing a request from a list isn't how I would expect a commit to work.

“commit” means the request is committed: we have a parse-able
reply and will proceed to completion. The name does not reflect
the mechanism, but the policy.

Any suggestions on a different name?


> Anna
> 
>> + *
>> + * Caller holds transport lock.
>> + */
>> +void xprt_commit_rqst(struct rpc_task *task)
>> +{
>> +	struct rpc_rqst *req = task->tk_rqstp;
>> +
>> +	list_del_init(&req->rq_list);
>> +}
>> +EXPORT_SYMBOL_GPL(xprt_commit_rqst);
>> +
>> +/**
>>  * xprt_complete_rqst - called when reply processing is complete
>>  * @task: RPC request that recently completed
>>  * @copied: actual number of bytes received from the transport
>> diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
>> index a169252..d7b9156 100644
>> --- a/net/sunrpc/xprtrdma/rpc_rdma.c
>> +++ b/net/sunrpc/xprtrdma/rpc_rdma.c
>> @@ -811,6 +811,9 @@ rpcrdma_reply_handler(struct rpcrdma_rep *rep)
>> 	if (req->rl_reply)
>> 		goto out_duplicate;
>> 
>> +	xprt_commit_rqst(rqst->rq_task);
>> +	spin_unlock_bh(&xprt->transport_lock);
>> +
>> 	dprintk("RPC:       %s: reply 0x%p completes request 0x%p\n"
>> 		"                   RPC request 0x%p xid 0x%08x\n",
>> 			__func__, rep, req, rqst,
>> @@ -901,6 +904,7 @@ badheader:
>> 	else if (credits > r_xprt->rx_buf.rb_max_requests)
>> 		credits = r_xprt->rx_buf.rb_max_requests;
>> 
>> +	spin_lock_bh(&xprt->transport_lock);
>> 	cwnd = xprt->cwnd;
>> 	xprt->cwnd = credits << RPC_CWNDSHIFT;
>> 	if (xprt->cwnd > cwnd)
>> 
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> 
> 

--
Chuck Lever




--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v1 7/9] SUNRPC: Introduct xprt_commit_rqst()
@ 2015-11-24 19:56             ` Chuck Lever
  0 siblings, 0 replies; 78+ messages in thread
From: Chuck Lever @ 2015-11-24 19:56 UTC (permalink / raw)
  To: Anna Schumaker; +Cc: linux-rdma, Linux NFS Mailing List


> On Nov 24, 2015, at 2:54 PM, Anna Schumaker <Anna.Schumaker@netapp.com> wrote:
> 
> Hi Chuck,
> 
> On 11/23/2015 05:14 PM, Chuck Lever wrote:
>> I'm about to add code in the RPC/RDMA reply handler between the
>> xprt_lookup_rqst() and xprt_complete_rqst() call site that needs
>> to execute outside of spinlock critical sections.
>> 
>> Add a hook to remove an rpc_rqst from the pending list once
>> the transport knows its going to invoke xprt_complete_rqst().
>> 
>> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
>> ---
>> include/linux/sunrpc/xprt.h    |    1 +
>> net/sunrpc/xprt.c              |   14 ++++++++++++++
>> net/sunrpc/xprtrdma/rpc_rdma.c |    4 ++++
>> 3 files changed, 19 insertions(+)
>> 
>> diff --git a/include/linux/sunrpc/xprt.h b/include/linux/sunrpc/xprt.h
>> index 69ef5b3..ab6c3a5 100644
>> --- a/include/linux/sunrpc/xprt.h
>> +++ b/include/linux/sunrpc/xprt.h
>> @@ -366,6 +366,7 @@ void			xprt_wait_for_buffer_space(struct rpc_task *task, rpc_action action);
>> void			xprt_write_space(struct rpc_xprt *xprt);
>> void			xprt_adjust_cwnd(struct rpc_xprt *xprt, struct rpc_task *task, int result);
>> struct rpc_rqst *	xprt_lookup_rqst(struct rpc_xprt *xprt, __be32 xid);
>> +void			xprt_commit_rqst(struct rpc_task *task);
>> void			xprt_complete_rqst(struct rpc_task *task, int copied);
>> void			xprt_release_rqst_cong(struct rpc_task *task);
>> void			xprt_disconnect_done(struct rpc_xprt *xprt);
>> diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
>> index 2e98f4a..a5be4ab 100644
>> --- a/net/sunrpc/xprt.c
>> +++ b/net/sunrpc/xprt.c
>> @@ -837,6 +837,20 @@ static void xprt_update_rtt(struct rpc_task *task)
>> }
>> 
>> /**
>> + * xprt_commit_rqst - remove rqst from pending list early
>> + * @task: RPC request to remove
> 
> Is xprt_commit_rqst() the right name for this function?  Removing a request from a list isn't how I would expect a commit to work.

“commit” means the request is committed: we have a parse-able
reply and will proceed to completion. The name does not reflect
the mechanism, but the policy.

Any suggestions on a different name?


> Anna
> 
>> + *
>> + * Caller holds transport lock.
>> + */
>> +void xprt_commit_rqst(struct rpc_task *task)
>> +{
>> +	struct rpc_rqst *req = task->tk_rqstp;
>> +
>> +	list_del_init(&req->rq_list);
>> +}
>> +EXPORT_SYMBOL_GPL(xprt_commit_rqst);
>> +
>> +/**
>>  * xprt_complete_rqst - called when reply processing is complete
>>  * @task: RPC request that recently completed
>>  * @copied: actual number of bytes received from the transport
>> diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
>> index a169252..d7b9156 100644
>> --- a/net/sunrpc/xprtrdma/rpc_rdma.c
>> +++ b/net/sunrpc/xprtrdma/rpc_rdma.c
>> @@ -811,6 +811,9 @@ rpcrdma_reply_handler(struct rpcrdma_rep *rep)
>> 	if (req->rl_reply)
>> 		goto out_duplicate;
>> 
>> +	xprt_commit_rqst(rqst->rq_task);
>> +	spin_unlock_bh(&xprt->transport_lock);
>> +
>> 	dprintk("RPC:       %s: reply 0x%p completes request 0x%p\n"
>> 		"                   RPC request 0x%p xid 0x%08x\n",
>> 			__func__, rep, req, rqst,
>> @@ -901,6 +904,7 @@ badheader:
>> 	else if (credits > r_xprt->rx_buf.rb_max_requests)
>> 		credits = r_xprt->rx_buf.rb_max_requests;
>> 
>> +	spin_lock_bh(&xprt->transport_lock);
>> 	cwnd = xprt->cwnd;
>> 	xprt->cwnd = credits << RPC_CWNDSHIFT;
>> 	if (xprt->cwnd > cwnd)
>> 
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> 
> 

--
Chuck Lever





^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: Future of FMR support, was: Re: [PATCH v1 5/9] xprtrdma: Add ro_unmap_sync method for FMR
  2015-11-24  6:52             ` Christoph Hellwig
@ 2015-11-24 21:54                 ` santosh shilimkar
  -1 siblings, 0 replies; 78+ messages in thread
From: santosh shilimkar @ 2015-11-24 21:54 UTC (permalink / raw)
  To: Christoph Hellwig, Tom Talpey
  Cc: Chuck Lever, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA, Sagi Grimberg

Hi Christoph,

On 11/23/2015 10:52 PM, Christoph Hellwig wrote:
> On Mon, Nov 23, 2015 at 07:57:42PM -0500, Tom Talpey wrote:
>> On 11/23/2015 5:14 PM, Chuck Lever wrote:
>>> FMR's ro_unmap method is already synchronous because ib_unmap_fmr()
>>> is a synchronous verb. However, some improvements can be made here.
>>
>> I thought FMR support was about to be removed in the core.
>
> Seems like everyone would love to kill it, but no one dares to do
> it just yet.  Reasons to keep FMRs:
>
>   - mthca doesn't support FRs but haven't been staged out
>   - RDS only supports FRMs for the IB transports (it does support FRs for
>     an entirely separate iWarp transports.  I'd love to know why we can't
>     just use that iWarp transport for IB as well).
>   - mlx4 hardware might be slower with FRs than FRMs (Or mentioned this
>     in reply to the iSER remote invalidation series).
>
> So at lest for 4.5 we're unlikely to be able to get rid of it alone
> due to the RDS issue.  We'll then need performance numbers for mlx4,
> and figure out how much we care about mthca.
>
As already indicated to Sagi [1], RDS IB FR support is work in
progress and I was hoping to get it ready for 4.5. There are few
issues we found with one of the HCA and hence the progress
slowed down. Looking at where we are, 4.6 merge window seems
to be realistic for me to get RDS FR support.

Now on the iWARP transport itself, it has been bit tough because
of lack of hardware to tests. I have been requesting test(s) in
previous RDS patches and haven't seen any interest so far. If
this continues to be the trend, I might as well get rid of
RDS iWARP support after couple of merge windows.

Regards,
Santosh
[1] http://www.spinics.net/lists/linux-nfs/msg53909.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: Future of FMR support, was: Re: [PATCH v1 5/9] xprtrdma: Add ro_unmap_sync method for FMR
@ 2015-11-24 21:54                 ` santosh shilimkar
  0 siblings, 0 replies; 78+ messages in thread
From: santosh shilimkar @ 2015-11-24 21:54 UTC (permalink / raw)
  To: Christoph Hellwig, Tom Talpey
  Cc: Chuck Lever, linux-rdma, linux-nfs, Sagi Grimberg

Hi Christoph,

On 11/23/2015 10:52 PM, Christoph Hellwig wrote:
> On Mon, Nov 23, 2015 at 07:57:42PM -0500, Tom Talpey wrote:
>> On 11/23/2015 5:14 PM, Chuck Lever wrote:
>>> FMR's ro_unmap method is already synchronous because ib_unmap_fmr()
>>> is a synchronous verb. However, some improvements can be made here.
>>
>> I thought FMR support was about to be removed in the core.
>
> Seems like everyone would love to kill it, but no one dares to do
> it just yet.  Reasons to keep FMRs:
>
>   - mthca doesn't support FRs but haven't been staged out
>   - RDS only supports FRMs for the IB transports (it does support FRs for
>     an entirely separate iWarp transports.  I'd love to know why we can't
>     just use that iWarp transport for IB as well).
>   - mlx4 hardware might be slower with FRs than FRMs (Or mentioned this
>     in reply to the iSER remote invalidation series).
>
> So at lest for 4.5 we're unlikely to be able to get rid of it alone
> due to the RDS issue.  We'll then need performance numbers for mlx4,
> and figure out how much we care about mthca.
>
As already indicated to Sagi [1], RDS IB FR support is work in
progress and I was hoping to get it ready for 4.5. There are few
issues we found with one of the HCA and hence the progress
slowed down. Looking at where we are, 4.6 merge window seems
to be realistic for me to get RDS FR support.

Now on the iWARP transport itself, it has been bit tough because
of lack of hardware to tests. I have been requesting test(s) in
previous RDS patches and haven't seen any interest so far. If
this continues to be the trend, I might as well get rid of
RDS iWARP support after couple of merge windows.

Regards,
Santosh
[1] http://www.spinics.net/lists/linux-nfs/msg53909.html

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: Future of FMR support, was: Re: [PATCH v1 5/9] xprtrdma: Add ro_unmap_sync method for FMR
  2015-11-24 21:54                 ` santosh shilimkar
@ 2015-11-25  9:00                     ` Christoph Hellwig
  -1 siblings, 0 replies; 78+ messages in thread
From: Christoph Hellwig @ 2015-11-25  9:00 UTC (permalink / raw)
  To: santosh shilimkar
  Cc: Tom Talpey, Chuck Lever, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA, Sagi Grimberg

On Tue, Nov 24, 2015 at 01:54:02PM -0800, santosh shilimkar wrote:
> As already indicated to Sagi [1], RDS IB FR support is work in
> progress and I was hoping to get it ready for 4.5. There are few
> issues we found with one of the HCA and hence the progress
> slowed down. Looking at where we are, 4.6 merge window seems
> to be realistic for me to get RDS FR support.

Ok.

> Now on the iWARP transport itself, it has been bit tough because
> of lack of hardware to tests. I have been requesting test(s) in
> previous RDS patches and haven't seen any interest so far. If
> this continues to be the trend, I might as well get rid of
> RDS iWARP support after couple of merge windows.

I'd say drop the current iWarp transport if it's not testable.  The
only real difference between IB and iWarp is the needed to create
a MR for the RDMA READ sink, and we're much better of adding that into
the current IB transport if new iWarp users show up.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: Future of FMR support, was: Re: [PATCH v1 5/9] xprtrdma: Add ro_unmap_sync method for FMR
@ 2015-11-25  9:00                     ` Christoph Hellwig
  0 siblings, 0 replies; 78+ messages in thread
From: Christoph Hellwig @ 2015-11-25  9:00 UTC (permalink / raw)
  To: santosh shilimkar
  Cc: Tom Talpey, Chuck Lever, linux-rdma, linux-nfs, Sagi Grimberg

On Tue, Nov 24, 2015 at 01:54:02PM -0800, santosh shilimkar wrote:
> As already indicated to Sagi [1], RDS IB FR support is work in
> progress and I was hoping to get it ready for 4.5. There are few
> issues we found with one of the HCA and hence the progress
> slowed down. Looking at where we are, 4.6 merge window seems
> to be realistic for me to get RDS FR support.

Ok.

> Now on the iWARP transport itself, it has been bit tough because
> of lack of hardware to tests. I have been requesting test(s) in
> previous RDS patches and haven't seen any interest so far. If
> this continues to be the trend, I might as well get rid of
> RDS iWARP support after couple of merge windows.

I'd say drop the current iWarp transport if it's not testable.  The
only real difference between IB and iWarp is the needed to create
a MR for the RDMA READ sink, and we're much better of adding that into
the current IB transport if new iWarp users show up.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: Future of FMR support, was: Re: [PATCH v1 5/9] xprtrdma: Add ro_unmap_sync method for FMR
  2015-11-25  9:00                     ` Christoph Hellwig
@ 2015-11-25 17:09                         ` santosh shilimkar
  -1 siblings, 0 replies; 78+ messages in thread
From: santosh shilimkar @ 2015-11-25 17:09 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Tom Talpey, Chuck Lever, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA, Sagi Grimberg

On 11/25/2015 1:00 AM, Christoph Hellwig wrote:
> On Tue, Nov 24, 2015 at 01:54:02PM -0800, santosh shilimkar wrote:
>> As already indicated to Sagi [1], RDS IB FR support is work in
>> progress and I was hoping to get it ready for 4.5. There are few
>> issues we found with one of the HCA and hence the progress
>> slowed down. Looking at where we are, 4.6 merge window seems
>> to be realistic for me to get RDS FR support.
>
> Ok.
>
>> Now on the iWARP transport itself, it has been bit tough because
>> of lack of hardware to tests. I have been requesting test(s) in
>> previous RDS patches and haven't seen any interest so far. If
>> this continues to be the trend, I might as well get rid of
>> RDS iWARP support after couple of merge windows.
>
> I'd say drop the current iWarp transport if it's not testable.  The
> only real difference between IB and iWarp is the needed to create
> a MR for the RDMA READ sink, and we're much better of adding that into
> the current IB transport if new iWarp users show up.

Agree. I will probably do it along with RDS IB FR support in 4.6

Regards,
Santosh
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: Future of FMR support, was: Re: [PATCH v1 5/9] xprtrdma: Add ro_unmap_sync method for FMR
@ 2015-11-25 17:09                         ` santosh shilimkar
  0 siblings, 0 replies; 78+ messages in thread
From: santosh shilimkar @ 2015-11-25 17:09 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Tom Talpey, Chuck Lever, linux-rdma, linux-nfs, Sagi Grimberg

On 11/25/2015 1:00 AM, Christoph Hellwig wrote:
> On Tue, Nov 24, 2015 at 01:54:02PM -0800, santosh shilimkar wrote:
>> As already indicated to Sagi [1], RDS IB FR support is work in
>> progress and I was hoping to get it ready for 4.5. There are few
>> issues we found with one of the HCA and hence the progress
>> slowed down. Looking at where we are, 4.6 merge window seems
>> to be realistic for me to get RDS FR support.
>
> Ok.
>
>> Now on the iWARP transport itself, it has been bit tough because
>> of lack of hardware to tests. I have been requesting test(s) in
>> previous RDS patches and haven't seen any interest so far. If
>> this continues to be the trend, I might as well get rid of
>> RDS iWARP support after couple of merge windows.
>
> I'd say drop the current iWarp transport if it's not testable.  The
> only real difference between IB and iWarp is the needed to create
> a MR for the RDMA READ sink, and we're much better of adding that into
> the current IB transport if new iWarp users show up.

Agree. I will probably do it along with RDS IB FR support in 4.6

Regards,
Santosh

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: Future of FMR support, was: Re: [PATCH v1 5/9] xprtrdma: Add ro_unmap_sync method for FMR
  2015-11-25 17:09                         ` santosh shilimkar
@ 2015-11-25 18:22                             ` Or Gerlitz
  -1 siblings, 0 replies; 78+ messages in thread
From: Or Gerlitz @ 2015-11-25 18:22 UTC (permalink / raw)
  To: santosh shilimkar
  Cc: Christoph Hellwig, Tom Talpey, Chuck Lever,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA, Sagi Grimberg

On Wed, Nov 25, 2015 at 7:09 PM, santosh shilimkar
<santosh.shilimkar-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote:
>>> As already indicated to Sagi [1], RDS IB FR support is work in
>>> progress and I was hoping to get it ready for 4.5.

These are really good news! can you please elaborate a bit on the
design changes this move introduced in RDS?
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: Future of FMR support, was: Re: [PATCH v1 5/9] xprtrdma: Add ro_unmap_sync method for FMR
@ 2015-11-25 18:22                             ` Or Gerlitz
  0 siblings, 0 replies; 78+ messages in thread
From: Or Gerlitz @ 2015-11-25 18:22 UTC (permalink / raw)
  To: santosh shilimkar
  Cc: Christoph Hellwig, Tom Talpey, Chuck Lever, linux-rdma,
	linux-nfs, Sagi Grimberg

On Wed, Nov 25, 2015 at 7:09 PM, santosh shilimkar
<santosh.shilimkar@oracle.com> wrote:
>>> As already indicated to Sagi [1], RDS IB FR support is work in
>>> progress and I was hoping to get it ready for 4.5.

These are really good news! can you please elaborate a bit on the
design changes this move introduced in RDS?

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: Future of FMR support, was: Re: [PATCH v1 5/9] xprtrdma: Add ro_unmap_sync method for FMR
  2015-11-25 18:22                             ` Or Gerlitz
@ 2015-11-25 19:17                                 ` santosh shilimkar
  -1 siblings, 0 replies; 78+ messages in thread
From: santosh shilimkar @ 2015-11-25 19:17 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Christoph Hellwig, Tom Talpey, Chuck Lever,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA, Sagi Grimberg

On 11/25/2015 10:22 AM, Or Gerlitz wrote:
> On Wed, Nov 25, 2015 at 7:09 PM, santosh shilimkar
> <santosh.shilimkar-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote:
>>>> As already indicated to Sagi [1], RDS IB FR support is work in
>>>> progress and I was hoping to get it ready for 4.5.
>
> These are really good news! can you please elaborate a bit on the
> design changes this move introduced in RDS?
>
Yeah. It has been a bit of pain point since the need
was to keep the RDS design same and retrofit the FR
support so that it can co-exist with existing deployed
FMR code.

Leaving the details for the code review but at very high
level,

- Have to split the poll CQ handling so that
send + FR WR completion can be handled together.
FR CQ handler and reg/inv WR prep marks the MR state
like INVALID, VALID & STALE appropriately.
- Allocate 2X space on WR and WC queues during queue setup.
- Manage the MR reg/inv based on the space available
in FR WR ring(actually it is just a counter). This is
bit tricky because RDS does MR operation via sendmsg()
as well as directly through socket APIs so needs
co-ordination.

Am hoping that above remains true when code actually
makes to the list but that is how things stand as
of now.

Regards,
Santosh

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: Future of FMR support, was: Re: [PATCH v1 5/9] xprtrdma: Add ro_unmap_sync method for FMR
@ 2015-11-25 19:17                                 ` santosh shilimkar
  0 siblings, 0 replies; 78+ messages in thread
From: santosh shilimkar @ 2015-11-25 19:17 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Christoph Hellwig, Tom Talpey, Chuck Lever, linux-rdma,
	linux-nfs, Sagi Grimberg

On 11/25/2015 10:22 AM, Or Gerlitz wrote:
> On Wed, Nov 25, 2015 at 7:09 PM, santosh shilimkar
> <santosh.shilimkar@oracle.com> wrote:
>>>> As already indicated to Sagi [1], RDS IB FR support is work in
>>>> progress and I was hoping to get it ready for 4.5.
>
> These are really good news! can you please elaborate a bit on the
> design changes this move introduced in RDS?
>
Yeah. It has been a bit of pain point since the need
was to keep the RDS design same and retrofit the FR
support so that it can co-exist with existing deployed
FMR code.

Leaving the details for the code review but at very high
level,

- Have to split the poll CQ handling so that
send + FR WR completion can be handled together.
FR CQ handler and reg/inv WR prep marks the MR state
like INVALID, VALID & STALE appropriately.
- Allocate 2X space on WR and WC queues during queue setup.
- Manage the MR reg/inv based on the space available
in FR WR ring(actually it is just a counter). This is
bit tricky because RDS does MR operation via sendmsg()
as well as directly through socket APIs so needs
co-ordination.

Am hoping that above remains true when code actually
makes to the list but that is how things stand as
of now.

Regards,
Santosh


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: Future of FMR support, was: Re: [PATCH v1 5/9] xprtrdma: Add ro_unmap_sync method for FMR
  2015-11-25 17:09                         ` santosh shilimkar
@ 2015-11-25 19:28                             ` Jason Gunthorpe
  -1 siblings, 0 replies; 78+ messages in thread
From: Jason Gunthorpe @ 2015-11-25 19:28 UTC (permalink / raw)
  To: santosh shilimkar
  Cc: Christoph Hellwig, Tom Talpey, Chuck Lever,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA, Sagi Grimberg

On Wed, Nov 25, 2015 at 09:09:20AM -0800, santosh shilimkar wrote:
> >I'd say drop the current iWarp transport if it's not testable.  The
> >only real difference between IB and iWarp is the needed to create
> >a MR for the RDMA READ sink, and we're much better of adding that into
> >the current IB transport if new iWarp users show up.
> 
> Agree. I will probably do it along with RDS IB FR support in 4.6

Great news!

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: Future of FMR support, was: Re: [PATCH v1 5/9] xprtrdma: Add ro_unmap_sync method for FMR
@ 2015-11-25 19:28                             ` Jason Gunthorpe
  0 siblings, 0 replies; 78+ messages in thread
From: Jason Gunthorpe @ 2015-11-25 19:28 UTC (permalink / raw)
  To: santosh shilimkar
  Cc: Christoph Hellwig, Tom Talpey, Chuck Lever, linux-rdma,
	linux-nfs, Sagi Grimberg

On Wed, Nov 25, 2015 at 09:09:20AM -0800, santosh shilimkar wrote:
> >I'd say drop the current iWarp transport if it's not testable.  The
> >only real difference between IB and iWarp is the needed to create
> >a MR for the RDMA READ sink, and we're much better of adding that into
> >the current IB transport if new iWarp users show up.
> 
> Agree. I will probably do it along with RDS IB FR support in 4.6

Great news!

Jason

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: Future of FMR support, was: Re: [PATCH v1 5/9] xprtrdma: Add ro_unmap_sync method for FMR
  2015-11-25 19:28                             ` Jason Gunthorpe
@ 2015-11-26 10:01                                 ` Sagi Grimberg
  -1 siblings, 0 replies; 78+ messages in thread
From: Sagi Grimberg @ 2015-11-26 10:01 UTC (permalink / raw)
  To: Jason Gunthorpe, santosh shilimkar
  Cc: Christoph Hellwig, Tom Talpey, Chuck Lever,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA, Sagi Grimberg



On 25/11/2015 21:28, Jason Gunthorpe wrote:
> On Wed, Nov 25, 2015 at 09:09:20AM -0800, santosh shilimkar wrote:
>>> I'd say drop the current iWarp transport if it's not testable.  The
>>> only real difference between IB and iWarp is the needed to create
>>> a MR for the RDMA READ sink, and we're much better of adding that into
>>> the current IB transport if new iWarp users show up.
>>
>> Agree. I will probably do it along with RDS IB FR support in 4.6
>
> Great news!

Great news indeed!

Looking forward to seeing this.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: Future of FMR support, was: Re: [PATCH v1 5/9] xprtrdma: Add ro_unmap_sync method for FMR
@ 2015-11-26 10:01                                 ` Sagi Grimberg
  0 siblings, 0 replies; 78+ messages in thread
From: Sagi Grimberg @ 2015-11-26 10:01 UTC (permalink / raw)
  To: Jason Gunthorpe, santosh shilimkar
  Cc: Christoph Hellwig, Tom Talpey, Chuck Lever, linux-rdma,
	linux-nfs, Sagi Grimberg



On 25/11/2015 21:28, Jason Gunthorpe wrote:
> On Wed, Nov 25, 2015 at 09:09:20AM -0800, santosh shilimkar wrote:
>>> I'd say drop the current iWarp transport if it's not testable.  The
>>> only real difference between IB and iWarp is the needed to create
>>> a MR for the RDMA READ sink, and we're much better of adding that into
>>> the current IB transport if new iWarp users show up.
>>
>> Agree. I will probably do it along with RDS IB FR support in 4.6
>
> Great news!

Great news indeed!

Looking forward to seeing this.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: Future of FMR support, was: Re: [PATCH v1 5/9] xprtrdma: Add ro_unmap_sync method for FMR
  2015-11-24  7:12                 ` Jason Gunthorpe
@ 2015-12-01 15:33                     ` Chuck Lever
  -1 siblings, 0 replies; 78+ messages in thread
From: Chuck Lever @ 2015-12-01 15:33 UTC (permalink / raw)
  To: Linux RDMA Mailing List
  Cc: Christoph Hellwig, Jason Gunthorpe, Tom Talpey,
	Linux NFS Mailing List, Sagi Grimberg, santosh shilimkar


> On Nov 24, 2015, at 2:12 AM, Jason Gunthorpe <jgunthorpe@obsidianresearch.com> wrote:
> 
> On Mon, Nov 23, 2015 at 10:52:26PM -0800, Christoph Hellwig wrote:
>> 
>> So at lest for 4.5 we're unlikely to be able to get rid of it alone
>> due to the RDS issue.  We'll then need performance numbers for mlx4,
>> and figure out how much we care about mthca.
> 
> mthca is unfortunately very popular in the used market, it is very
> easy to get cards, build a cheap test cluster, etc. :(

Oracle recently announced Sonoma, which is a SPARC CPU with
an on-chip IB HCA. Oracle plans to publish an open-source
GPL device driver that enables this HCA in Linux for SPARC.
We’d eventually like to contribute it to the upstream
Linux kernel.

At the moment, the device and driver support only FMR. As
you might expect, Oracle needs it to work at least with
in-kernel RDS. Thus we hope to see the life of in-kernel
FMR extended for a bit to accommodate this new device.


--
Chuck Lever




--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: Future of FMR support, was: Re: [PATCH v1 5/9] xprtrdma: Add ro_unmap_sync method for FMR
@ 2015-12-01 15:33                     ` Chuck Lever
  0 siblings, 0 replies; 78+ messages in thread
From: Chuck Lever @ 2015-12-01 15:33 UTC (permalink / raw)
  To: Linux RDMA Mailing List
  Cc: Christoph Hellwig, Jason Gunthorpe, Tom Talpey,
	Linux NFS Mailing List, Sagi Grimberg, santosh shilimkar


> On Nov 24, 2015, at 2:12 AM, Jason Gunthorpe <jgunthorpe@obsidianresearch.com> wrote:
> 
> On Mon, Nov 23, 2015 at 10:52:26PM -0800, Christoph Hellwig wrote:
>> 
>> So at lest for 4.5 we're unlikely to be able to get rid of it alone
>> due to the RDS issue.  We'll then need performance numbers for mlx4,
>> and figure out how much we care about mthca.
> 
> mthca is unfortunately very popular in the used market, it is very
> easy to get cards, build a cheap test cluster, etc. :(

Oracle recently announced Sonoma, which is a SPARC CPU with
an on-chip IB HCA. Oracle plans to publish an open-source
GPL device driver that enables this HCA in Linux for SPARC.
We’d eventually like to contribute it to the upstream
Linux kernel.

At the moment, the device and driver support only FMR. As
you might expect, Oracle needs it to work at least with
in-kernel RDS. Thus we hope to see the life of in-kernel
FMR extended for a bit to accommodate this new device.


--
Chuck Lever





^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: Future of FMR support, was: Re: [PATCH v1 5/9] xprtrdma: Add ro_unmap_sync method for FMR
  2015-12-01 15:33                     ` Chuck Lever
@ 2015-12-02 12:27                         ` Christoph Hellwig
  -1 siblings, 0 replies; 78+ messages in thread
From: Christoph Hellwig @ 2015-12-02 12:27 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Linux RDMA Mailing List, Christoph Hellwig, Jason Gunthorpe,
	Tom Talpey, Linux NFS Mailing List, Sagi Grimberg,
	santosh shilimkar

On Tue, Dec 01, 2015 at 10:33:18AM -0500, Chuck Lever wrote:
> Oracle recently announced Sonoma, which is a SPARC CPU with
> an on-chip IB HCA. Oracle plans to publish an open-source
> GPL device driver that enables this HCA in Linux for SPARC.
> We???d eventually like to contribute it to the upstream
> Linux kernel.
> 
> At the moment, the device and driver support only FMR. As
> you might expect, Oracle needs it to work at least with
> in-kernel RDS. Thus we hope to see the life of in-kernel
> FMR extended for a bit to accommodate this new device.

Oh, that's pretty sad news.  I wonder what the engineers where smoking..
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: Future of FMR support, was: Re: [PATCH v1 5/9] xprtrdma: Add ro_unmap_sync method for FMR
@ 2015-12-02 12:27                         ` Christoph Hellwig
  0 siblings, 0 replies; 78+ messages in thread
From: Christoph Hellwig @ 2015-12-02 12:27 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Linux RDMA Mailing List, Christoph Hellwig, Jason Gunthorpe,
	Tom Talpey, Linux NFS Mailing List, Sagi Grimberg,
	santosh shilimkar

On Tue, Dec 01, 2015 at 10:33:18AM -0500, Chuck Lever wrote:
> Oracle recently announced Sonoma, which is a SPARC CPU with
> an on-chip IB HCA. Oracle plans to publish an open-source
> GPL device driver that enables this HCA in Linux for SPARC.
> We???d eventually like to contribute it to the upstream
> Linux kernel.
> 
> At the moment, the device and driver support only FMR. As
> you might expect, Oracle needs it to work at least with
> in-kernel RDS. Thus we hope to see the life of in-kernel
> FMR extended for a bit to accommodate this new device.

Oh, that's pretty sad news.  I wonder what the engineers where smoking..

^ permalink raw reply	[flat|nested] 78+ messages in thread

end of thread, other threads:[~2015-12-02 12:27 UTC | newest]

Thread overview: 78+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-11-23 22:13 [PATCH v1 0/9] NFS/RDMA client patches for 4.5 Chuck Lever
2015-11-23 22:13 ` Chuck Lever
     [not found] ` <20151123220627.32702.62667.stgit-FYjufvaPoItvLzlybtyyYzGyq/o6K9yX@public.gmane.org>
2015-11-23 22:13   ` [PATCH v1 1/9] xprtrdma: Add a safety margin for receive buffers Chuck Lever
2015-11-23 22:13     ` Chuck Lever
     [not found]     ` <20151123221357.32702.59922.stgit-FYjufvaPoItvLzlybtyyYzGyq/o6K9yX@public.gmane.org>
2015-11-24  0:55       ` Tom Talpey
2015-11-24  0:55         ` Tom Talpey
     [not found]         ` <5653B586.705-CLs1Zie5N5HQT0dZR+AlfA@public.gmane.org>
2015-11-24  1:16           ` Chuck Lever
2015-11-24  1:16             ` Chuck Lever
     [not found]             ` <D946BAC3-26D5-4801-BD50-9F026EEF6551-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
2015-11-24  1:22               ` Tom Talpey
2015-11-24  1:22                 ` Tom Talpey
     [not found]                 ` <5653BBCB.8010107-CLs1Zie5N5HQT0dZR+AlfA@public.gmane.org>
2015-11-24  1:44                   ` Chuck Lever
2015-11-24  1:44                     ` Chuck Lever
2015-11-23 22:14   ` [PATCH v1 2/9] xprtrdma: Move struct ib_send_wr off the stack Chuck Lever
2015-11-23 22:14     ` Chuck Lever
2015-11-23 22:14   ` [PATCH v1 3/9] xprtrdma: Introduce ro_unmap_sync method Chuck Lever
2015-11-23 22:14     ` Chuck Lever
     [not found]     ` <20151123221414.32702.87638.stgit-FYjufvaPoItvLzlybtyyYzGyq/o6K9yX@public.gmane.org>
2015-11-24  6:45       ` Christoph Hellwig
2015-11-24  6:45         ` Christoph Hellwig
     [not found]         ` <20151124064556.GA29141-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
2015-11-24  7:28           ` Jason Gunthorpe
2015-11-24  7:28             ` Jason Gunthorpe
2015-11-24 10:59           ` Sagi Grimberg
2015-11-24 10:59             ` Sagi Grimberg
     [not found]             ` <565442F5.7080400-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2015-11-24 13:43               ` Tom Talpey
2015-11-24 13:43                 ` Tom Talpey
     [not found]                 ` <5654697E.1030809-CLs1Zie5N5HQT0dZR+AlfA@public.gmane.org>
2015-11-24 14:40                   ` Sagi Grimberg
2015-11-24 14:40                     ` Sagi Grimberg
2015-11-24 14:39               ` Chuck Lever
2015-11-24 14:39                 ` Chuck Lever
     [not found]                 ` <4B2D7C66-31AC-44F3-A8CC-22CC7136015C-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
2015-11-24 14:44                   ` Sagi Grimberg
2015-11-24 14:44                     ` Sagi Grimberg
     [not found]                     ` <565477CC.5070309-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2015-11-24 15:20                       ` Chuck Lever
2015-11-24 15:20                         ` Chuck Lever
2015-11-24 18:57                   ` Jason Gunthorpe
2015-11-24 18:57                     ` Jason Gunthorpe
2015-11-23 22:14   ` [PATCH v1 4/9] xprtrdma: Add ro_unmap_sync method for FRWR Chuck Lever
2015-11-23 22:14     ` Chuck Lever
2015-11-23 22:14   ` [PATCH v1 5/9] xprtrdma: Add ro_unmap_sync method for FMR Chuck Lever
2015-11-23 22:14     ` Chuck Lever
     [not found]     ` <20151123221430.32702.86114.stgit-FYjufvaPoItvLzlybtyyYzGyq/o6K9yX@public.gmane.org>
2015-11-24  0:57       ` Tom Talpey
2015-11-24  0:57         ` Tom Talpey
     [not found]         ` <5653B606.3070700-CLs1Zie5N5HQT0dZR+AlfA@public.gmane.org>
2015-11-24  6:52           ` Future of FMR support, was: " Christoph Hellwig
2015-11-24  6:52             ` Christoph Hellwig
     [not found]             ` <20151124065225.GB29141-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
2015-11-24  7:12               ` Jason Gunthorpe
2015-11-24  7:12                 ` Jason Gunthorpe
     [not found]                 ` <20151124071215.GC23597-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2015-12-01 15:33                   ` Chuck Lever
2015-12-01 15:33                     ` Chuck Lever
     [not found]                     ` <AE86B182-6000-4437-8502-9D2C5EC3B09D-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
2015-12-02 12:27                       ` Christoph Hellwig
2015-12-02 12:27                         ` Christoph Hellwig
2015-11-24 12:36               ` Tom Talpey
2015-11-24 12:36                 ` Tom Talpey
2015-11-24 21:54               ` santosh shilimkar
2015-11-24 21:54                 ` santosh shilimkar
     [not found]                 ` <5654DC7A.7080807-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
2015-11-25  9:00                   ` Christoph Hellwig
2015-11-25  9:00                     ` Christoph Hellwig
     [not found]                     ` <20151125090009.GA11255-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
2015-11-25 17:09                       ` santosh shilimkar
2015-11-25 17:09                         ` santosh shilimkar
     [not found]                         ` <5655EB40.8070508-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
2015-11-25 18:22                           ` Or Gerlitz
2015-11-25 18:22                             ` Or Gerlitz
     [not found]                             ` <CAJ3xEMgC5aa8pvvvs0y8O+h_WBk-em07gHvOgP=DFDrv4ydGAg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-11-25 19:17                               ` santosh shilimkar
2015-11-25 19:17                                 ` santosh shilimkar
2015-11-25 19:28                           ` Jason Gunthorpe
2015-11-25 19:28                             ` Jason Gunthorpe
     [not found]                             ` <20151125192824.GD3223-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2015-11-26 10:01                               ` Sagi Grimberg
2015-11-26 10:01                                 ` Sagi Grimberg
2015-11-23 22:14   ` [PATCH v1 6/9] xprtrdma: Add ro_unmap_sync method for all-physical registration Chuck Lever
2015-11-23 22:14     ` Chuck Lever
2015-11-23 22:14   ` [PATCH v1 7/9] SUNRPC: Introduct xprt_commit_rqst() Chuck Lever
2015-11-23 22:14     ` Chuck Lever
     [not found]     ` <20151123221446.32702.24797.stgit-FYjufvaPoItvLzlybtyyYzGyq/o6K9yX@public.gmane.org>
2015-11-24 19:54       ` Anna Schumaker
2015-11-24 19:54         ` Anna Schumaker
     [not found]         ` <5654C079.7060507-ZwjVKphTwtPQT0dZR+AlfA@public.gmane.org>
2015-11-24 19:56           ` Chuck Lever
2015-11-24 19:56             ` Chuck Lever
2015-11-23 22:14   ` [PATCH v1 8/9] xprtrdma: Invalidate in the RPC reply handler Chuck Lever
2015-11-23 22:14     ` Chuck Lever
     [not found]     ` <20151123221454.32702.76062.stgit-FYjufvaPoItvLzlybtyyYzGyq/o6K9yX@public.gmane.org>
2015-11-24  1:01       ` Tom Talpey
2015-11-24  1:01         ` Tom Talpey
2015-11-23 22:15   ` [PATCH v1 9/9] xprtrdma: Revert commit e7104a2a9606 ('xprtrdma: Cap req_cqinit') Chuck Lever
2015-11-23 22:15     ` Chuck Lever

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.