All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/8] NFS/RDMA patches for review
@ 2014-04-14 22:22 ` Chuck Lever
  0 siblings, 0 replies; 58+ messages in thread
From: Chuck Lever @ 2014-04-14 22:22 UTC (permalink / raw)
  To: linux-nfs-u79uwXL29TY76Z2rM5mHXA, linux-rdma-u79uwXL29TY76Z2rM5mHXA

I'm interested in comments and testing, especially with
RPCRDMA_REGISTER mode because I don't have that capability set up on
my client yet.

These apply on top of Allen Andrews' and Steve Wise's recent patches,
which are included in the nfs-rdma-client topic branch at

  git://git.linux-nfs.org/projects/cel/cel-2.6.git

---

Chuck Lever (8):
      xprtrdma: Reduce the number of hardway buffer allocations
      xprtrdma: Split the completion queue
      xprtrdma: Make rpcrdma_ep_destroy() return void
      xprtrdma: Simplify rpcrdma_deregister_external() synopsis
      xprtrdma: Remove support for MEMWINDOWS registration mode
      xprtrdma: Disable ALLPHYSICAL mode by default
      xprtrdma: Remove BOUNCEBUFFERS memory registration mode
      xprtrdma: RPC/RDMA must invoke xprt_wake_pending_tasks() in process context


 include/linux/sunrpc/xprtrdma.h |    3 
 net/sunrpc/Kconfig              |   15 +
 net/sunrpc/xprtrdma/rpc_rdma.c  |   56 +----
 net/sunrpc/xprtrdma/transport.c |   32 ---
 net/sunrpc/xprtrdma/verbs.c     |  457 ++++++++++++++-------------------------
 net/sunrpc/xprtrdma/xprt_rdma.h |   10 -
 6 files changed, 195 insertions(+), 378 deletions(-)

-- 
Chuck Lever
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 58+ messages in thread

* [PATCH 0/8] NFS/RDMA patches for review
@ 2014-04-14 22:22 ` Chuck Lever
  0 siblings, 0 replies; 58+ messages in thread
From: Chuck Lever @ 2014-04-14 22:22 UTC (permalink / raw)
  To: linux-nfs, linux-rdma

I'm interested in comments and testing, especially with
RPCRDMA_REGISTER mode because I don't have that capability set up on
my client yet.

These apply on top of Allen Andrews' and Steve Wise's recent patches,
which are included in the nfs-rdma-client topic branch at

  git://git.linux-nfs.org/projects/cel/cel-2.6.git

---

Chuck Lever (8):
      xprtrdma: Reduce the number of hardway buffer allocations
      xprtrdma: Split the completion queue
      xprtrdma: Make rpcrdma_ep_destroy() return void
      xprtrdma: Simplify rpcrdma_deregister_external() synopsis
      xprtrdma: Remove support for MEMWINDOWS registration mode
      xprtrdma: Disable ALLPHYSICAL mode by default
      xprtrdma: Remove BOUNCEBUFFERS memory registration mode
      xprtrdma: RPC/RDMA must invoke xprt_wake_pending_tasks() in process context


 include/linux/sunrpc/xprtrdma.h |    3 
 net/sunrpc/Kconfig              |   15 +
 net/sunrpc/xprtrdma/rpc_rdma.c  |   56 +----
 net/sunrpc/xprtrdma/transport.c |   32 ---
 net/sunrpc/xprtrdma/verbs.c     |  457 ++++++++++++++-------------------------
 net/sunrpc/xprtrdma/xprt_rdma.h |   10 -
 6 files changed, 195 insertions(+), 378 deletions(-)

-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 58+ messages in thread

* [PATCH 1/8] xprtrdma: RPC/RDMA must invoke xprt_wake_pending_tasks() in process context
  2014-04-14 22:22 ` Chuck Lever
@ 2014-04-14 22:22     ` Chuck Lever
  -1 siblings, 0 replies; 58+ messages in thread
From: Chuck Lever @ 2014-04-14 22:22 UTC (permalink / raw)
  To: linux-nfs-u79uwXL29TY76Z2rM5mHXA, linux-rdma-u79uwXL29TY76Z2rM5mHXA

An IB provider can invoke rpcrdma_conn_func() in an IRQ context,
thus rpcrdma_conn_func() cannot be allowed to directly invoke
generic RPC functions like xprt_wake_pending_tasks().

Signed-off-by: Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
---

 net/sunrpc/xprtrdma/rpc_rdma.c  |   22 +++++++++++++++-------
 net/sunrpc/xprtrdma/verbs.c     |    3 +++
 net/sunrpc/xprtrdma/xprt_rdma.h |    3 +++
 3 files changed, 21 insertions(+), 7 deletions(-)

diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
index 400aa1b..c296468 100644
--- a/net/sunrpc/xprtrdma/rpc_rdma.c
+++ b/net/sunrpc/xprtrdma/rpc_rdma.c
@@ -676,15 +676,11 @@ rpcrdma_inline_fixup(struct rpc_rqst *rqst, char *srcp, int copy_len, int pad)
 	rqst->rq_private_buf = rqst->rq_rcv_buf;
 }
 
-/*
- * This function is called when an async event is posted to
- * the connection which changes the connection state. All it
- * does at this point is mark the connection up/down, the rpc
- * timers do the rest.
- */
 void
-rpcrdma_conn_func(struct rpcrdma_ep *ep)
+rpcrdma_connect_worker(struct work_struct *work)
 {
+	struct rpcrdma_ep *ep =
+		container_of(work, struct rpcrdma_ep, rep_connect_worker.work);
 	struct rpc_xprt *xprt = ep->rep_xprt;
 
 	spin_lock_bh(&xprt->transport_lock);
@@ -701,6 +697,18 @@ rpcrdma_conn_func(struct rpcrdma_ep *ep)
 }
 
 /*
+ * This function is called when an async event is posted to
+ * the connection which changes the connection state. All it
+ * does at this point is mark the connection up/down, the rpc
+ * timers do the rest.
+ */
+void
+rpcrdma_conn_func(struct rpcrdma_ep *ep)
+{
+	schedule_delayed_work(&ep->rep_connect_worker, 0);
+}
+
+/*
  * This function is called when memory window unbind which we are waiting
  * for completes. Just use rr_func (zeroed by upcall) to signal completion.
  */
diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 5c07d11..422cf89 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -742,6 +742,7 @@ rpcrdma_ep_create(struct rpcrdma_ep *ep, struct rpcrdma_ia *ia,
 	INIT_CQCOUNT(ep);
 	ep->rep_ia = ia;
 	init_waitqueue_head(&ep->rep_connect_wait);
+	INIT_DELAYED_WORK(&ep->rep_connect_worker, rpcrdma_connect_worker);
 
 	/*
 	 * Create a single cq for receive dto and mw_bind (only ever
@@ -817,6 +818,8 @@ rpcrdma_ep_destroy(struct rpcrdma_ep *ep, struct rpcrdma_ia *ia)
 	dprintk("RPC:       %s: entering, connected is %d\n",
 		__func__, ep->rep_connected);
 
+	cancel_delayed_work_sync(&ep->rep_connect_worker);
+
 	if (ia->ri_id->qp) {
 		rc = rpcrdma_ep_disconnect(ep, ia);
 		if (rc)
diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index 98340a3..c620d13 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -43,6 +43,7 @@
 #include <linux/wait.h> 		/* wait_queue_head_t, etc */
 #include <linux/spinlock.h> 		/* spinlock_t, etc */
 #include <linux/atomic.h>			/* atomic_t, etc */
+#include <linux/workqueue.h>		/* struct work_struct */
 
 #include <rdma/rdma_cm.h>		/* RDMA connection api */
 #include <rdma/ib_verbs.h>		/* RDMA verbs api */
@@ -87,6 +88,7 @@ struct rpcrdma_ep {
 	struct rpc_xprt		*rep_xprt;	/* for rep_func */
 	struct rdma_conn_param	rep_remote_cma;
 	struct sockaddr_storage	rep_remote_addr;
+	struct delayed_work	rep_connect_worker;
 };
 
 #define INIT_CQCOUNT(ep) atomic_set(&(ep)->rep_cqcount, (ep)->rep_cqinit)
@@ -336,6 +338,7 @@ int rpcrdma_deregister_external(struct rpcrdma_mr_seg *,
 /*
  * RPC/RDMA connection management calls - xprtrdma/rpc_rdma.c
  */
+void rpcrdma_connect_worker(struct work_struct *);
 void rpcrdma_conn_func(struct rpcrdma_ep *);
 void rpcrdma_reply_handler(struct rpcrdma_rep *);
 

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 1/8] xprtrdma: RPC/RDMA must invoke xprt_wake_pending_tasks() in process context
@ 2014-04-14 22:22     ` Chuck Lever
  0 siblings, 0 replies; 58+ messages in thread
From: Chuck Lever @ 2014-04-14 22:22 UTC (permalink / raw)
  To: linux-nfs, linux-rdma

An IB provider can invoke rpcrdma_conn_func() in an IRQ context,
thus rpcrdma_conn_func() cannot be allowed to directly invoke
generic RPC functions like xprt_wake_pending_tasks().

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---

 net/sunrpc/xprtrdma/rpc_rdma.c  |   22 +++++++++++++++-------
 net/sunrpc/xprtrdma/verbs.c     |    3 +++
 net/sunrpc/xprtrdma/xprt_rdma.h |    3 +++
 3 files changed, 21 insertions(+), 7 deletions(-)

diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
index 400aa1b..c296468 100644
--- a/net/sunrpc/xprtrdma/rpc_rdma.c
+++ b/net/sunrpc/xprtrdma/rpc_rdma.c
@@ -676,15 +676,11 @@ rpcrdma_inline_fixup(struct rpc_rqst *rqst, char *srcp, int copy_len, int pad)
 	rqst->rq_private_buf = rqst->rq_rcv_buf;
 }
 
-/*
- * This function is called when an async event is posted to
- * the connection which changes the connection state. All it
- * does at this point is mark the connection up/down, the rpc
- * timers do the rest.
- */
 void
-rpcrdma_conn_func(struct rpcrdma_ep *ep)
+rpcrdma_connect_worker(struct work_struct *work)
 {
+	struct rpcrdma_ep *ep =
+		container_of(work, struct rpcrdma_ep, rep_connect_worker.work);
 	struct rpc_xprt *xprt = ep->rep_xprt;
 
 	spin_lock_bh(&xprt->transport_lock);
@@ -701,6 +697,18 @@ rpcrdma_conn_func(struct rpcrdma_ep *ep)
 }
 
 /*
+ * This function is called when an async event is posted to
+ * the connection which changes the connection state. All it
+ * does at this point is mark the connection up/down, the rpc
+ * timers do the rest.
+ */
+void
+rpcrdma_conn_func(struct rpcrdma_ep *ep)
+{
+	schedule_delayed_work(&ep->rep_connect_worker, 0);
+}
+
+/*
  * This function is called when memory window unbind which we are waiting
  * for completes. Just use rr_func (zeroed by upcall) to signal completion.
  */
diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 5c07d11..422cf89 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -742,6 +742,7 @@ rpcrdma_ep_create(struct rpcrdma_ep *ep, struct rpcrdma_ia *ia,
 	INIT_CQCOUNT(ep);
 	ep->rep_ia = ia;
 	init_waitqueue_head(&ep->rep_connect_wait);
+	INIT_DELAYED_WORK(&ep->rep_connect_worker, rpcrdma_connect_worker);
 
 	/*
 	 * Create a single cq for receive dto and mw_bind (only ever
@@ -817,6 +818,8 @@ rpcrdma_ep_destroy(struct rpcrdma_ep *ep, struct rpcrdma_ia *ia)
 	dprintk("RPC:       %s: entering, connected is %d\n",
 		__func__, ep->rep_connected);
 
+	cancel_delayed_work_sync(&ep->rep_connect_worker);
+
 	if (ia->ri_id->qp) {
 		rc = rpcrdma_ep_disconnect(ep, ia);
 		if (rc)
diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index 98340a3..c620d13 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -43,6 +43,7 @@
 #include <linux/wait.h> 		/* wait_queue_head_t, etc */
 #include <linux/spinlock.h> 		/* spinlock_t, etc */
 #include <linux/atomic.h>			/* atomic_t, etc */
+#include <linux/workqueue.h>		/* struct work_struct */
 
 #include <rdma/rdma_cm.h>		/* RDMA connection api */
 #include <rdma/ib_verbs.h>		/* RDMA verbs api */
@@ -87,6 +88,7 @@ struct rpcrdma_ep {
 	struct rpc_xprt		*rep_xprt;	/* for rep_func */
 	struct rdma_conn_param	rep_remote_cma;
 	struct sockaddr_storage	rep_remote_addr;
+	struct delayed_work	rep_connect_worker;
 };
 
 #define INIT_CQCOUNT(ep) atomic_set(&(ep)->rep_cqcount, (ep)->rep_cqinit)
@@ -336,6 +338,7 @@ int rpcrdma_deregister_external(struct rpcrdma_mr_seg *,
 /*
  * RPC/RDMA connection management calls - xprtrdma/rpc_rdma.c
  */
+void rpcrdma_connect_worker(struct work_struct *);
 void rpcrdma_conn_func(struct rpcrdma_ep *);
 void rpcrdma_reply_handler(struct rpcrdma_rep *);
 


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 2/8] xprtrdma: Remove BOUNCEBUFFERS memory registration mode
  2014-04-14 22:22 ` Chuck Lever
@ 2014-04-14 22:22     ` Chuck Lever
  -1 siblings, 0 replies; 58+ messages in thread
From: Chuck Lever @ 2014-04-14 22:22 UTC (permalink / raw)
  To: linux-nfs-u79uwXL29TY76Z2rM5mHXA, linux-rdma-u79uwXL29TY76Z2rM5mHXA

Clean up: This memory registration mode is slow and was never
meant for use in production environments. Remove it to reduce
implementation complexity.

Signed-off-by: Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
---

 net/sunrpc/xprtrdma/rpc_rdma.c  |    8 --------
 net/sunrpc/xprtrdma/transport.c |   13 -------------
 net/sunrpc/xprtrdma/verbs.c     |    5 +----
 3 files changed, 1 insertions(+), 25 deletions(-)

diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
index c296468..b963e50 100644
--- a/net/sunrpc/xprtrdma/rpc_rdma.c
+++ b/net/sunrpc/xprtrdma/rpc_rdma.c
@@ -439,14 +439,6 @@ rpcrdma_marshal_req(struct rpc_rqst *rqst)
 		wtype = rpcrdma_noch;
 	BUG_ON(rtype != rpcrdma_noch && wtype != rpcrdma_noch);
 
-	if (r_xprt->rx_ia.ri_memreg_strategy == RPCRDMA_BOUNCEBUFFERS &&
-	    (rtype != rpcrdma_noch || wtype != rpcrdma_noch)) {
-		/* forced to "pure inline"? */
-		dprintk("RPC:       %s: too much data (%d/%d) for inline\n",
-			__func__, rqst->rq_rcv_buf.len, rqst->rq_snd_buf.len);
-		return -1;
-	}
-
 	hdrlen = 28; /*sizeof *headerp;*/
 	padlen = 0;
 
diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
index 1eb9c46..8c5035a 100644
--- a/net/sunrpc/xprtrdma/transport.c
+++ b/net/sunrpc/xprtrdma/transport.c
@@ -503,18 +503,6 @@ xprt_rdma_allocate(struct rpc_task *task, size_t size)
 		 * If the allocation or registration fails, the RPC framework
 		 * will (doggedly) retry.
 		 */
-		if (rpcx_to_rdmax(xprt)->rx_ia.ri_memreg_strategy ==
-				RPCRDMA_BOUNCEBUFFERS) {
-			/* forced to "pure inline" */
-			dprintk("RPC:       %s: too much data (%zd) for inline "
-					"(r/w max %d/%d)\n", __func__, size,
-					rpcx_to_rdmad(xprt).inline_rsize,
-					rpcx_to_rdmad(xprt).inline_wsize);
-			size = req->rl_size;
-			rpc_exit(task, -EIO);		/* fail the operation */
-			rpcx_to_rdmax(xprt)->rx_stats.failed_marshal_count++;
-			goto out;
-		}
 		if (task->tk_flags & RPC_TASK_SWAPPER)
 			nreq = kmalloc(sizeof *req + size, GFP_ATOMIC);
 		else
@@ -543,7 +531,6 @@ xprt_rdma_allocate(struct rpc_task *task, size_t size)
 		req = nreq;
 	}
 	dprintk("RPC:       %s: size %zd, request 0x%p\n", __func__, size, req);
-out:
 	req->rl_connect_cookie = 0;	/* our reserved value */
 	return req->rl_xdr_buf;
 
diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 422cf89..1fe554f 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -557,7 +557,6 @@ rpcrdma_ia_open(struct rpcrdma_xprt *xprt, struct sockaddr *addr, int memreg)
 	 * adapter.
 	 */
 	switch (memreg) {
-	case RPCRDMA_BOUNCEBUFFERS:
 	case RPCRDMA_REGISTER:
 	case RPCRDMA_FRMR:
 		break;
@@ -778,9 +777,7 @@ rpcrdma_ep_create(struct rpcrdma_ep *ep, struct rpcrdma_ia *ia,
 
 	/* Client offers RDMA Read but does not initiate */
 	ep->rep_remote_cma.initiator_depth = 0;
-	if (ia->ri_memreg_strategy == RPCRDMA_BOUNCEBUFFERS)
-		ep->rep_remote_cma.responder_resources = 0;
-	else if (devattr.max_qp_rd_atom > 32)	/* arbitrary but <= 255 */
+	if (devattr.max_qp_rd_atom > 32)	/* arbitrary but <= 255 */
 		ep->rep_remote_cma.responder_resources = 32;
 	else
 		ep->rep_remote_cma.responder_resources = devattr.max_qp_rd_atom;

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 2/8] xprtrdma: Remove BOUNCEBUFFERS memory registration mode
@ 2014-04-14 22:22     ` Chuck Lever
  0 siblings, 0 replies; 58+ messages in thread
From: Chuck Lever @ 2014-04-14 22:22 UTC (permalink / raw)
  To: linux-nfs, linux-rdma

Clean up: This memory registration mode is slow and was never
meant for use in production environments. Remove it to reduce
implementation complexity.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---

 net/sunrpc/xprtrdma/rpc_rdma.c  |    8 --------
 net/sunrpc/xprtrdma/transport.c |   13 -------------
 net/sunrpc/xprtrdma/verbs.c     |    5 +----
 3 files changed, 1 insertions(+), 25 deletions(-)

diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
index c296468..b963e50 100644
--- a/net/sunrpc/xprtrdma/rpc_rdma.c
+++ b/net/sunrpc/xprtrdma/rpc_rdma.c
@@ -439,14 +439,6 @@ rpcrdma_marshal_req(struct rpc_rqst *rqst)
 		wtype = rpcrdma_noch;
 	BUG_ON(rtype != rpcrdma_noch && wtype != rpcrdma_noch);
 
-	if (r_xprt->rx_ia.ri_memreg_strategy == RPCRDMA_BOUNCEBUFFERS &&
-	    (rtype != rpcrdma_noch || wtype != rpcrdma_noch)) {
-		/* forced to "pure inline"? */
-		dprintk("RPC:       %s: too much data (%d/%d) for inline\n",
-			__func__, rqst->rq_rcv_buf.len, rqst->rq_snd_buf.len);
-		return -1;
-	}
-
 	hdrlen = 28; /*sizeof *headerp;*/
 	padlen = 0;
 
diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
index 1eb9c46..8c5035a 100644
--- a/net/sunrpc/xprtrdma/transport.c
+++ b/net/sunrpc/xprtrdma/transport.c
@@ -503,18 +503,6 @@ xprt_rdma_allocate(struct rpc_task *task, size_t size)
 		 * If the allocation or registration fails, the RPC framework
 		 * will (doggedly) retry.
 		 */
-		if (rpcx_to_rdmax(xprt)->rx_ia.ri_memreg_strategy ==
-				RPCRDMA_BOUNCEBUFFERS) {
-			/* forced to "pure inline" */
-			dprintk("RPC:       %s: too much data (%zd) for inline "
-					"(r/w max %d/%d)\n", __func__, size,
-					rpcx_to_rdmad(xprt).inline_rsize,
-					rpcx_to_rdmad(xprt).inline_wsize);
-			size = req->rl_size;
-			rpc_exit(task, -EIO);		/* fail the operation */
-			rpcx_to_rdmax(xprt)->rx_stats.failed_marshal_count++;
-			goto out;
-		}
 		if (task->tk_flags & RPC_TASK_SWAPPER)
 			nreq = kmalloc(sizeof *req + size, GFP_ATOMIC);
 		else
@@ -543,7 +531,6 @@ xprt_rdma_allocate(struct rpc_task *task, size_t size)
 		req = nreq;
 	}
 	dprintk("RPC:       %s: size %zd, request 0x%p\n", __func__, size, req);
-out:
 	req->rl_connect_cookie = 0;	/* our reserved value */
 	return req->rl_xdr_buf;
 
diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 422cf89..1fe554f 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -557,7 +557,6 @@ rpcrdma_ia_open(struct rpcrdma_xprt *xprt, struct sockaddr *addr, int memreg)
 	 * adapter.
 	 */
 	switch (memreg) {
-	case RPCRDMA_BOUNCEBUFFERS:
 	case RPCRDMA_REGISTER:
 	case RPCRDMA_FRMR:
 		break;
@@ -778,9 +777,7 @@ rpcrdma_ep_create(struct rpcrdma_ep *ep, struct rpcrdma_ia *ia,
 
 	/* Client offers RDMA Read but does not initiate */
 	ep->rep_remote_cma.initiator_depth = 0;
-	if (ia->ri_memreg_strategy == RPCRDMA_BOUNCEBUFFERS)
-		ep->rep_remote_cma.responder_resources = 0;
-	else if (devattr.max_qp_rd_atom > 32)	/* arbitrary but <= 255 */
+	if (devattr.max_qp_rd_atom > 32)	/* arbitrary but <= 255 */
 		ep->rep_remote_cma.responder_resources = 32;
 	else
 		ep->rep_remote_cma.responder_resources = devattr.max_qp_rd_atom;


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 3/8] xprtrdma: Disable ALLPHYSICAL mode by default
  2014-04-14 22:22 ` Chuck Lever
@ 2014-04-14 22:22     ` Chuck Lever
  -1 siblings, 0 replies; 58+ messages in thread
From: Chuck Lever @ 2014-04-14 22:22 UTC (permalink / raw)
  To: linux-nfs-u79uwXL29TY76Z2rM5mHXA, linux-rdma-u79uwXL29TY76Z2rM5mHXA

ALLPHYSICAL is not a safe memory registration mode because it
permits NFS servers to write anywhere in a client's memory.  NFS
server bugs could result in client memory being overwritten.

This can be useful for embedded systems which do not support more
surgical RDMA memory registration and protection methods. However,
enterprise Linux distributions have expressed a desire to disable
or remove it entirely.

As a compromise (or as a step towards deprecation), add a CONFIG
setting (by default N) to enable ALLPHYSICAL support.

ALLPHYSICAL is now never a fallback choice when HCAs do not support
the selected memory registration mode.

Signed-off-by: Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
---

 include/linux/sunrpc/xprtrdma.h |    3 ---
 net/sunrpc/Kconfig              |   15 +++++++++++++++
 net/sunrpc/xprtrdma/verbs.c     |   22 +++-------------------
 3 files changed, 18 insertions(+), 22 deletions(-)

diff --git a/include/linux/sunrpc/xprtrdma.h b/include/linux/sunrpc/xprtrdma.h
index c2f04e1..54e9b47 100644
--- a/include/linux/sunrpc/xprtrdma.h
+++ b/include/linux/sunrpc/xprtrdma.h
@@ -61,9 +61,6 @@
 
 #define RPCRDMA_INLINE_PAD_THRESH  (512)/* payload threshold to pad (bytes) */
 
-/* memory registration strategies */
-#define RPCRDMA_PERSISTENT_REGISTRATION (1)
-
 enum rpcrdma_memreg {
 	RPCRDMA_BOUNCEBUFFERS = 0,
 	RPCRDMA_REGISTER,
diff --git a/net/sunrpc/Kconfig b/net/sunrpc/Kconfig
index 0754d0f..af56554 100644
--- a/net/sunrpc/Kconfig
+++ b/net/sunrpc/Kconfig
@@ -58,6 +58,21 @@ config SUNRPC_XPRT_RDMA_CLIENT
 
 	  If unsure, say N.
 
+config SUNRPC_XPRT_RDMA_CLIENT_ALLPHYSICAL
+	bool "Enable ALLPHYSICAL memory registration mode"
+	depends on SUNRPC_XPRT_RDMA_CLIENT
+	default N
+	help
+	  This option enables support for the ALLPHYSICAL memory
+	  registration mode.
+
+	  This mode is very fast but not safe because it registers
+	  and exposes all of local memory. This could allow an
+	  NFS server bug to corrupt client memory.
+
+	  N is almost always the correct choice unless you are
+	  sure what you are doing.
+
 config SUNRPC_XPRT_RDMA_SERVER
 	tristate "RPC over RDMA Server Support"
 	depends on SUNRPC && INFINIBAND && INFINIBAND_ADDR_TRANS
diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 1fe554f..6373232 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -506,19 +506,11 @@ rpcrdma_ia_open(struct rpcrdma_xprt *xprt, struct sockaddr *addr, int memreg)
 		break;
 	case RPCRDMA_MTHCAFMR:
 		if (!ia->ri_id->device->alloc_fmr) {
-#if RPCRDMA_PERSISTENT_REGISTRATION
-			dprintk("RPC:       %s: MTHCAFMR registration "
-				"specified but not supported by adapter, "
-				"using riskier RPCRDMA_ALLPHYSICAL\n",
-				__func__);
-			memreg = RPCRDMA_ALLPHYSICAL;
-#else
 			dprintk("RPC:       %s: MTHCAFMR registration "
 				"specified but not supported by adapter, "
 				"using slower RPCRDMA_REGISTER\n",
 				__func__);
 			memreg = RPCRDMA_REGISTER;
-#endif
 		}
 		break;
 	case RPCRDMA_FRMR:
@@ -526,19 +518,11 @@ rpcrdma_ia_open(struct rpcrdma_xprt *xprt, struct sockaddr *addr, int memreg)
 		if ((devattr.device_cap_flags &
 		     (IB_DEVICE_MEM_MGT_EXTENSIONS|IB_DEVICE_LOCAL_DMA_LKEY)) !=
 		    (IB_DEVICE_MEM_MGT_EXTENSIONS|IB_DEVICE_LOCAL_DMA_LKEY)) {
-#if RPCRDMA_PERSISTENT_REGISTRATION
-			dprintk("RPC:       %s: FRMR registration "
-				"specified but not supported by adapter, "
-				"using riskier RPCRDMA_ALLPHYSICAL\n",
-				__func__);
-			memreg = RPCRDMA_ALLPHYSICAL;
-#else
 			dprintk("RPC:       %s: FRMR registration "
 				"specified but not supported by adapter, "
 				"using slower RPCRDMA_REGISTER\n",
 				__func__);
 			memreg = RPCRDMA_REGISTER;
-#endif
 		} else {
 			/* Mind the ia limit on FRMR page list depth */
 			ia->ri_max_frmr_depth = min_t(unsigned int,
@@ -560,7 +544,7 @@ rpcrdma_ia_open(struct rpcrdma_xprt *xprt, struct sockaddr *addr, int memreg)
 	case RPCRDMA_REGISTER:
 	case RPCRDMA_FRMR:
 		break;
-#if RPCRDMA_PERSISTENT_REGISTRATION
+#ifdef CONFIG_SUNRPC_XPRT_RDMA_CLIENT_ALLPHYSICAL
 	case RPCRDMA_ALLPHYSICAL:
 		mem_priv = IB_ACCESS_LOCAL_WRITE |
 				IB_ACCESS_REMOTE_WRITE |
@@ -1824,7 +1808,7 @@ rpcrdma_register_external(struct rpcrdma_mr_seg *seg,
 
 	switch (ia->ri_memreg_strategy) {
 
-#if RPCRDMA_PERSISTENT_REGISTRATION
+#ifdef CONFIG_SUNRPC_XPRT_RDMA_CLIENT_ALLPHYSICAL
 	case RPCRDMA_ALLPHYSICAL:
 		rpcrdma_map_one(ia, seg, writing);
 		seg->mr_rkey = ia->ri_bind_mem->rkey;
@@ -1870,7 +1854,7 @@ rpcrdma_deregister_external(struct rpcrdma_mr_seg *seg,
 
 	switch (ia->ri_memreg_strategy) {
 
-#if RPCRDMA_PERSISTENT_REGISTRATION
+#ifdef CONFIG_SUNRPC_XPRT_RDMA_CLIENT_ALLPHYSICAL
 	case RPCRDMA_ALLPHYSICAL:
 		BUG_ON(nsegs != 1);
 		rpcrdma_unmap_one(ia, seg);

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 3/8] xprtrdma: Disable ALLPHYSICAL mode by default
@ 2014-04-14 22:22     ` Chuck Lever
  0 siblings, 0 replies; 58+ messages in thread
From: Chuck Lever @ 2014-04-14 22:22 UTC (permalink / raw)
  To: linux-nfs, linux-rdma

ALLPHYSICAL is not a safe memory registration mode because it
permits NFS servers to write anywhere in a client's memory.  NFS
server bugs could result in client memory being overwritten.

This can be useful for embedded systems which do not support more
surgical RDMA memory registration and protection methods. However,
enterprise Linux distributions have expressed a desire to disable
or remove it entirely.

As a compromise (or as a step towards deprecation), add a CONFIG
setting (by default N) to enable ALLPHYSICAL support.

ALLPHYSICAL is now never a fallback choice when HCAs do not support
the selected memory registration mode.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---

 include/linux/sunrpc/xprtrdma.h |    3 ---
 net/sunrpc/Kconfig              |   15 +++++++++++++++
 net/sunrpc/xprtrdma/verbs.c     |   22 +++-------------------
 3 files changed, 18 insertions(+), 22 deletions(-)

diff --git a/include/linux/sunrpc/xprtrdma.h b/include/linux/sunrpc/xprtrdma.h
index c2f04e1..54e9b47 100644
--- a/include/linux/sunrpc/xprtrdma.h
+++ b/include/linux/sunrpc/xprtrdma.h
@@ -61,9 +61,6 @@
 
 #define RPCRDMA_INLINE_PAD_THRESH  (512)/* payload threshold to pad (bytes) */
 
-/* memory registration strategies */
-#define RPCRDMA_PERSISTENT_REGISTRATION (1)
-
 enum rpcrdma_memreg {
 	RPCRDMA_BOUNCEBUFFERS = 0,
 	RPCRDMA_REGISTER,
diff --git a/net/sunrpc/Kconfig b/net/sunrpc/Kconfig
index 0754d0f..af56554 100644
--- a/net/sunrpc/Kconfig
+++ b/net/sunrpc/Kconfig
@@ -58,6 +58,21 @@ config SUNRPC_XPRT_RDMA_CLIENT
 
 	  If unsure, say N.
 
+config SUNRPC_XPRT_RDMA_CLIENT_ALLPHYSICAL
+	bool "Enable ALLPHYSICAL memory registration mode"
+	depends on SUNRPC_XPRT_RDMA_CLIENT
+	default N
+	help
+	  This option enables support for the ALLPHYSICAL memory
+	  registration mode.
+
+	  This mode is very fast but not safe because it registers
+	  and exposes all of local memory. This could allow an
+	  NFS server bug to corrupt client memory.
+
+	  N is almost always the correct choice unless you are
+	  sure what you are doing.
+
 config SUNRPC_XPRT_RDMA_SERVER
 	tristate "RPC over RDMA Server Support"
 	depends on SUNRPC && INFINIBAND && INFINIBAND_ADDR_TRANS
diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 1fe554f..6373232 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -506,19 +506,11 @@ rpcrdma_ia_open(struct rpcrdma_xprt *xprt, struct sockaddr *addr, int memreg)
 		break;
 	case RPCRDMA_MTHCAFMR:
 		if (!ia->ri_id->device->alloc_fmr) {
-#if RPCRDMA_PERSISTENT_REGISTRATION
-			dprintk("RPC:       %s: MTHCAFMR registration "
-				"specified but not supported by adapter, "
-				"using riskier RPCRDMA_ALLPHYSICAL\n",
-				__func__);
-			memreg = RPCRDMA_ALLPHYSICAL;
-#else
 			dprintk("RPC:       %s: MTHCAFMR registration "
 				"specified but not supported by adapter, "
 				"using slower RPCRDMA_REGISTER\n",
 				__func__);
 			memreg = RPCRDMA_REGISTER;
-#endif
 		}
 		break;
 	case RPCRDMA_FRMR:
@@ -526,19 +518,11 @@ rpcrdma_ia_open(struct rpcrdma_xprt *xprt, struct sockaddr *addr, int memreg)
 		if ((devattr.device_cap_flags &
 		     (IB_DEVICE_MEM_MGT_EXTENSIONS|IB_DEVICE_LOCAL_DMA_LKEY)) !=
 		    (IB_DEVICE_MEM_MGT_EXTENSIONS|IB_DEVICE_LOCAL_DMA_LKEY)) {
-#if RPCRDMA_PERSISTENT_REGISTRATION
-			dprintk("RPC:       %s: FRMR registration "
-				"specified but not supported by adapter, "
-				"using riskier RPCRDMA_ALLPHYSICAL\n",
-				__func__);
-			memreg = RPCRDMA_ALLPHYSICAL;
-#else
 			dprintk("RPC:       %s: FRMR registration "
 				"specified but not supported by adapter, "
 				"using slower RPCRDMA_REGISTER\n",
 				__func__);
 			memreg = RPCRDMA_REGISTER;
-#endif
 		} else {
 			/* Mind the ia limit on FRMR page list depth */
 			ia->ri_max_frmr_depth = min_t(unsigned int,
@@ -560,7 +544,7 @@ rpcrdma_ia_open(struct rpcrdma_xprt *xprt, struct sockaddr *addr, int memreg)
 	case RPCRDMA_REGISTER:
 	case RPCRDMA_FRMR:
 		break;
-#if RPCRDMA_PERSISTENT_REGISTRATION
+#ifdef CONFIG_SUNRPC_XPRT_RDMA_CLIENT_ALLPHYSICAL
 	case RPCRDMA_ALLPHYSICAL:
 		mem_priv = IB_ACCESS_LOCAL_WRITE |
 				IB_ACCESS_REMOTE_WRITE |
@@ -1824,7 +1808,7 @@ rpcrdma_register_external(struct rpcrdma_mr_seg *seg,
 
 	switch (ia->ri_memreg_strategy) {
 
-#if RPCRDMA_PERSISTENT_REGISTRATION
+#ifdef CONFIG_SUNRPC_XPRT_RDMA_CLIENT_ALLPHYSICAL
 	case RPCRDMA_ALLPHYSICAL:
 		rpcrdma_map_one(ia, seg, writing);
 		seg->mr_rkey = ia->ri_bind_mem->rkey;
@@ -1870,7 +1854,7 @@ rpcrdma_deregister_external(struct rpcrdma_mr_seg *seg,
 
 	switch (ia->ri_memreg_strategy) {
 
-#if RPCRDMA_PERSISTENT_REGISTRATION
+#ifdef CONFIG_SUNRPC_XPRT_RDMA_CLIENT_ALLPHYSICAL
 	case RPCRDMA_ALLPHYSICAL:
 		BUG_ON(nsegs != 1);
 		rpcrdma_unmap_one(ia, seg);


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 4/8] xprtrdma: Remove support for MEMWINDOWS registration mode
  2014-04-14 22:22 ` Chuck Lever
@ 2014-04-14 22:22     ` Chuck Lever
  -1 siblings, 0 replies; 58+ messages in thread
From: Chuck Lever @ 2014-04-14 22:22 UTC (permalink / raw)
  To: linux-nfs-u79uwXL29TY76Z2rM5mHXA, linux-rdma-u79uwXL29TY76Z2rM5mHXA

The MEMWINDOWS and MEMWINDOES_ASYNC memory registration modes were
intended as stop-gap modes before the introduction of FRMR. They
are now considered obsolete.

MEMWINDOWS_ASYNC is also considered unsafe because it can leave
client memory registered and exposed for an indeterminant time after
each I/O.

At this point, the MEMWINDOWS modes add needless complexity, so
remove them.

Signed-off-by: Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
---

 net/sunrpc/xprtrdma/rpc_rdma.c  |   34 --------
 net/sunrpc/xprtrdma/transport.c |    9 --
 net/sunrpc/xprtrdma/verbs.c     |  165 +--------------------------------------
 net/sunrpc/xprtrdma/xprt_rdma.h |    2 
 4 files changed, 7 insertions(+), 203 deletions(-)

diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
index b963e50..e4af26a 100644
--- a/net/sunrpc/xprtrdma/rpc_rdma.c
+++ b/net/sunrpc/xprtrdma/rpc_rdma.c
@@ -202,7 +202,6 @@ rpcrdma_create_chunks(struct rpc_rqst *rqst, struct xdr_buf *target,
 		return 0;
 
 	do {
-		/* bind/register the memory, then build chunk from result. */
 		int n = rpcrdma_register_external(seg, nsegs,
 						cur_wchunk != NULL, r_xprt);
 		if (n <= 0)
@@ -701,16 +700,6 @@ rpcrdma_conn_func(struct rpcrdma_ep *ep)
 }
 
 /*
- * This function is called when memory window unbind which we are waiting
- * for completes. Just use rr_func (zeroed by upcall) to signal completion.
- */
-static void
-rpcrdma_unbind_func(struct rpcrdma_rep *rep)
-{
-	wake_up(&rep->rr_unbind);
-}
-
-/*
  * Called as a tasklet to do req/reply match and complete a request
  * Errors must result in the RPC task either being awakened, or
  * allowed to timeout, to discover the errors at that time.
@@ -724,7 +713,7 @@ rpcrdma_reply_handler(struct rpcrdma_rep *rep)
 	struct rpc_xprt *xprt = rep->rr_xprt;
 	struct rpcrdma_xprt *r_xprt = rpcx_to_rdmax(xprt);
 	__be32 *iptr;
-	int i, rdmalen, status;
+	int rdmalen, status;
 
 	/* Check status. If bad, signal disconnect and return rep to pool */
 	if (rep->rr_len == ~0U) {
@@ -853,27 +842,6 @@ badheader:
 		break;
 	}
 
-	/* If using mw bind, start the deregister process now. */
-	/* (Note: if mr_free(), cannot perform it here, in tasklet context) */
-	if (req->rl_nchunks) switch (r_xprt->rx_ia.ri_memreg_strategy) {
-	case RPCRDMA_MEMWINDOWS:
-		for (i = 0; req->rl_nchunks-- > 1;)
-			i += rpcrdma_deregister_external(
-				&req->rl_segments[i], r_xprt, NULL);
-		/* Optionally wait (not here) for unbinds to complete */
-		rep->rr_func = rpcrdma_unbind_func;
-		(void) rpcrdma_deregister_external(&req->rl_segments[i],
-						   r_xprt, rep);
-		break;
-	case RPCRDMA_MEMWINDOWS_ASYNC:
-		for (i = 0; req->rl_nchunks--;)
-			i += rpcrdma_deregister_external(&req->rl_segments[i],
-							 r_xprt, NULL);
-		break;
-	default:
-		break;
-	}
-
 	dprintk("RPC:       %s: xprt_complete_rqst(0x%p, 0x%p, %d)\n",
 			__func__, xprt, rqst, status);
 	xprt_complete_rqst(rqst->rq_task, status);
diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
index 8c5035a..c23b0c1 100644
--- a/net/sunrpc/xprtrdma/transport.c
+++ b/net/sunrpc/xprtrdma/transport.c
@@ -566,9 +566,7 @@ xprt_rdma_free(void *buffer)
 		__func__, rep, (rep && rep->rr_func) ? " (with waiter)" : "");
 
 	/*
-	 * Finish the deregistration. When using mw bind, this was
-	 * begun in rpcrdma_reply_handler(). In all other modes, we
-	 * do it here, in thread context. The process is considered
+	 * Finish the deregistration.  The process is considered
 	 * complete when the rr_func vector becomes NULL - this
 	 * was put in place during rpcrdma_reply_handler() - the wait
 	 * call below will not block if the dereg is "done". If
@@ -580,11 +578,6 @@ xprt_rdma_free(void *buffer)
 			&req->rl_segments[i], r_xprt, NULL);
 	}
 
-	if (rep && wait_event_interruptible(rep->rr_unbind, !rep->rr_func)) {
-		rep->rr_func = NULL;	/* abandon the callback */
-		req->rl_reply = NULL;
-	}
-
 	if (req->rl_iov.length == 0) {	/* see allocate above */
 		struct rpcrdma_req *oreq = (struct rpcrdma_req *)req->rl_buffer;
 		oreq->rl_reply = req->rl_reply;
diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 6373232..35c6699 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -152,7 +152,7 @@ void rpcrdma_event_process(struct ib_wc *wc)
 	dprintk("RPC:       %s: event rep %p status %X opcode %X length %u\n",
 		__func__, rep, wc->status, wc->opcode, wc->byte_len);
 
-	if (!rep) /* send or bind completion that we don't care about */
+	if (!rep) /* send completion that we don't care about */
 		return;
 
 	if (IB_WC_SUCCESS != wc->status) {
@@ -197,8 +197,6 @@ void rpcrdma_event_process(struct ib_wc *wc)
 			}
 			atomic_set(&rep->rr_buffer->rb_credits, credits);
 		}
-		/* fall through */
-	case IB_WC_BIND_MW:
 		rpcrdma_schedule_tasklet(rep);
 		break;
 	default:
@@ -233,7 +231,7 @@ rpcrdma_cq_poll(struct ib_cq *cq)
 /*
  * rpcrdma_cq_event_upcall
  *
- * This upcall handles recv, send, bind and unbind events.
+ * This upcall handles recv and send events.
  * It is reentrant but processes single events in order to maintain
  * ordering of receives to keep server credits.
  *
@@ -494,16 +492,6 @@ rpcrdma_ia_open(struct rpcrdma_xprt *xprt, struct sockaddr *addr, int memreg)
 	}
 
 	switch (memreg) {
-	case RPCRDMA_MEMWINDOWS:
-	case RPCRDMA_MEMWINDOWS_ASYNC:
-		if (!(devattr.device_cap_flags & IB_DEVICE_MEM_WINDOW)) {
-			dprintk("RPC:       %s: MEMWINDOWS registration "
-				"specified but not supported by adapter, "
-				"using slower RPCRDMA_REGISTER\n",
-				__func__);
-			memreg = RPCRDMA_REGISTER;
-		}
-		break;
 	case RPCRDMA_MTHCAFMR:
 		if (!ia->ri_id->device->alloc_fmr) {
 			dprintk("RPC:       %s: MTHCAFMR registration "
@@ -551,16 +539,13 @@ rpcrdma_ia_open(struct rpcrdma_xprt *xprt, struct sockaddr *addr, int memreg)
 				IB_ACCESS_REMOTE_READ;
 		goto register_setup;
 #endif
-	case RPCRDMA_MEMWINDOWS_ASYNC:
-	case RPCRDMA_MEMWINDOWS:
-		mem_priv = IB_ACCESS_LOCAL_WRITE |
-				IB_ACCESS_MW_BIND;
-		goto register_setup;
 	case RPCRDMA_MTHCAFMR:
 		if (ia->ri_have_dma_lkey)
 			break;
 		mem_priv = IB_ACCESS_LOCAL_WRITE;
+#ifdef CONFIG_SUNRPC_XPRT_RDMA_CLIENT_ALLPHYSICAL
 	register_setup:
+#endif
 		ia->ri_bind_mem = ib_get_dma_mr(ia->ri_pd, mem_priv);
 		if (IS_ERR(ia->ri_bind_mem)) {
 			printk(KERN_ALERT "%s: ib_get_dma_mr for "
@@ -683,14 +668,6 @@ rpcrdma_ep_create(struct rpcrdma_ep *ep, struct rpcrdma_ia *ia,
 		}
 		break;
 	}
-	case RPCRDMA_MEMWINDOWS_ASYNC:
-	case RPCRDMA_MEMWINDOWS:
-		/* Add room for mw_binds+unbinds - overkill! */
-		ep->rep_attr.cap.max_send_wr++;
-		ep->rep_attr.cap.max_send_wr *= (2 * RPCRDMA_MAX_SEGS);
-		if (ep->rep_attr.cap.max_send_wr > devattr.max_qp_wr)
-			return -EINVAL;
-		break;
 	default:
 		break;
 	}
@@ -712,14 +689,6 @@ rpcrdma_ep_create(struct rpcrdma_ep *ep, struct rpcrdma_ia *ia,
 
 	/* set trigger for requesting send completion */
 	ep->rep_cqinit = ep->rep_attr.cap.max_send_wr/2 /*  - 1*/;
-	switch (ia->ri_memreg_strategy) {
-	case RPCRDMA_MEMWINDOWS_ASYNC:
-	case RPCRDMA_MEMWINDOWS:
-		ep->rep_cqinit -= RPCRDMA_MAX_SEGS;
-		break;
-	default:
-		break;
-	}
 	if (ep->rep_cqinit <= 2)
 		ep->rep_cqinit = 0;
 	INIT_CQCOUNT(ep);
@@ -727,11 +696,6 @@ rpcrdma_ep_create(struct rpcrdma_ep *ep, struct rpcrdma_ia *ia,
 	init_waitqueue_head(&ep->rep_connect_wait);
 	INIT_DELAYED_WORK(&ep->rep_connect_worker, rpcrdma_connect_worker);
 
-	/*
-	 * Create a single cq for receive dto and mw_bind (only ever
-	 * care about unbind, really). Send completions are suppressed.
-	 * Use single threaded tasklet upcalls to maintain ordering.
-	 */
 	ep->rep_cq = ib_create_cq(ia->ri_id->device, rpcrdma_cq_event_upcall,
 				  rpcrdma_cq_async_error_upcall, NULL,
 				  ep->rep_attr.cap.max_recv_wr +
@@ -1004,11 +968,6 @@ rpcrdma_buffer_create(struct rpcrdma_buffer *buf, struct rpcrdma_ep *ep,
 		len += (buf->rb_max_requests + 1) * RPCRDMA_MAX_SEGS *
 				sizeof(struct rpcrdma_mw);
 		break;
-	case RPCRDMA_MEMWINDOWS_ASYNC:
-	case RPCRDMA_MEMWINDOWS:
-		len += (buf->rb_max_requests + 1) * RPCRDMA_MAX_SEGS *
-				sizeof(struct rpcrdma_mw);
-		break;
 	default:
 		break;
 	}
@@ -1039,11 +998,6 @@ rpcrdma_buffer_create(struct rpcrdma_buffer *buf, struct rpcrdma_ep *ep,
 	}
 	p += cdata->padding;
 
-	/*
-	 * Allocate the fmr's, or mw's for mw_bind chunk registration.
-	 * We "cycle" the mw's in order to minimize rkey reuse,
-	 * and also reduce unbind-to-bind collision.
-	 */
 	INIT_LIST_HEAD(&buf->rb_mws);
 	r = (struct rpcrdma_mw *)p;
 	switch (ia->ri_memreg_strategy) {
@@ -1090,21 +1044,6 @@ rpcrdma_buffer_create(struct rpcrdma_buffer *buf, struct rpcrdma_ep *ep,
 			++r;
 		}
 		break;
-	case RPCRDMA_MEMWINDOWS_ASYNC:
-	case RPCRDMA_MEMWINDOWS:
-		/* Allocate one extra request's worth, for full cycling */
-		for (i = (buf->rb_max_requests+1) * RPCRDMA_MAX_SEGS; i; i--) {
-			r->r.mw = ib_alloc_mw(ia->ri_pd, IB_MW_TYPE_1);
-			if (IS_ERR(r->r.mw)) {
-				rc = PTR_ERR(r->r.mw);
-				dprintk("RPC:       %s: ib_alloc_mw"
-					" failed %i\n", __func__, rc);
-				goto out;
-			}
-			list_add(&r->mw_list, &buf->rb_mws);
-			++r;
-		}
-		break;
 	default:
 		break;
 	}
@@ -1153,7 +1092,6 @@ rpcrdma_buffer_create(struct rpcrdma_buffer *buf, struct rpcrdma_ep *ep,
 		memset(rep, 0, sizeof(struct rpcrdma_rep));
 		buf->rb_recv_bufs[i] = rep;
 		buf->rb_recv_bufs[i]->rr_buffer = buf;
-		init_waitqueue_head(&rep->rr_unbind);
 
 		rc = rpcrdma_register_internal(ia, rep->rr_base,
 				len - offsetof(struct rpcrdma_rep, rr_base),
@@ -1187,7 +1125,6 @@ rpcrdma_buffer_destroy(struct rpcrdma_buffer *buf)
 
 	/* clean up in reverse order from create
 	 *   1.  recv mr memory (mr free, then kfree)
-	 *   1a. bind mw memory
 	 *   2.  send mr memory (mr free, then kfree)
 	 *   3.  padding (if any) [moved to rpcrdma_ep_destroy]
 	 *   4.  arrays
@@ -1231,15 +1168,6 @@ rpcrdma_buffer_destroy(struct rpcrdma_buffer *buf)
 					" failed %i\n",
 					__func__, rc);
 			break;
-		case RPCRDMA_MEMWINDOWS_ASYNC:
-		case RPCRDMA_MEMWINDOWS:
-			rc = ib_dealloc_mw(r->r.mw);
-			if (rc)
-				dprintk("RPC:       %s:"
-					" ib_dealloc_mw"
-					" failed %i\n",
-					__func__, rc);
-			break;
 		default:
 			break;
 		}
@@ -1314,15 +1242,12 @@ rpcrdma_buffer_put(struct rpcrdma_req *req)
 	req->rl_niovs = 0;
 	if (req->rl_reply) {
 		buffers->rb_recv_bufs[--buffers->rb_recv_index] = req->rl_reply;
-		init_waitqueue_head(&req->rl_reply->rr_unbind);
 		req->rl_reply->rr_func = NULL;
 		req->rl_reply = NULL;
 	}
 	switch (ia->ri_memreg_strategy) {
 	case RPCRDMA_FRMR:
 	case RPCRDMA_MTHCAFMR:
-	case RPCRDMA_MEMWINDOWS_ASYNC:
-	case RPCRDMA_MEMWINDOWS:
 		/*
 		 * Cycle mw's back in reverse order, and "spin" them.
 		 * This delays and scrambles reuse as much as possible.
@@ -1367,8 +1292,7 @@ rpcrdma_recv_buffer_get(struct rpcrdma_req *req)
 
 /*
  * Put reply buffers back into pool when not attached to
- * request. This happens in error conditions, and when
- * aborting unbinds. Pre-decrement counter/array index.
+ * request. This happens in error conditions.
  */
 void
 rpcrdma_recv_buffer_put(struct rpcrdma_rep *rep)
@@ -1671,74 +1595,6 @@ rpcrdma_deregister_fmr_external(struct rpcrdma_mr_seg *seg,
 }
 
 static int
-rpcrdma_register_memwin_external(struct rpcrdma_mr_seg *seg,
-			int *nsegs, int writing, struct rpcrdma_ia *ia,
-			struct rpcrdma_xprt *r_xprt)
-{
-	int mem_priv = (writing ? IB_ACCESS_REMOTE_WRITE :
-				  IB_ACCESS_REMOTE_READ);
-	struct ib_mw_bind param;
-	int rc;
-
-	*nsegs = 1;
-	rpcrdma_map_one(ia, seg, writing);
-	param.bind_info.mr = ia->ri_bind_mem;
-	param.wr_id = 0ULL;	/* no send cookie */
-	param.bind_info.addr = seg->mr_dma;
-	param.bind_info.length = seg->mr_len;
-	param.send_flags = 0;
-	param.bind_info.mw_access_flags = mem_priv;
-
-	DECR_CQCOUNT(&r_xprt->rx_ep);
-	rc = ib_bind_mw(ia->ri_id->qp, seg->mr_chunk.rl_mw->r.mw, &param);
-	if (rc) {
-		dprintk("RPC:       %s: failed ib_bind_mw "
-			"%u@0x%llx status %i\n",
-			__func__, seg->mr_len,
-			(unsigned long long)seg->mr_dma, rc);
-		rpcrdma_unmap_one(ia, seg);
-	} else {
-		seg->mr_rkey = seg->mr_chunk.rl_mw->r.mw->rkey;
-		seg->mr_base = param.bind_info.addr;
-		seg->mr_nsegs = 1;
-	}
-	return rc;
-}
-
-static int
-rpcrdma_deregister_memwin_external(struct rpcrdma_mr_seg *seg,
-			struct rpcrdma_ia *ia,
-			struct rpcrdma_xprt *r_xprt, void **r)
-{
-	struct ib_mw_bind param;
-	LIST_HEAD(l);
-	int rc;
-
-	BUG_ON(seg->mr_nsegs != 1);
-	param.bind_info.mr = ia->ri_bind_mem;
-	param.bind_info.addr = 0ULL;	/* unbind */
-	param.bind_info.length = 0;
-	param.bind_info.mw_access_flags = 0;
-	if (*r) {
-		param.wr_id = (u64) (unsigned long) *r;
-		param.send_flags = IB_SEND_SIGNALED;
-		INIT_CQCOUNT(&r_xprt->rx_ep);
-	} else {
-		param.wr_id = 0ULL;
-		param.send_flags = 0;
-		DECR_CQCOUNT(&r_xprt->rx_ep);
-	}
-	rc = ib_bind_mw(ia->ri_id->qp, seg->mr_chunk.rl_mw->r.mw, &param);
-	rpcrdma_unmap_one(ia, seg);
-	if (rc)
-		dprintk("RPC:       %s: failed ib_(un)bind_mw,"
-			" status %i\n", __func__, rc);
-	else
-		*r = NULL;	/* will upcall on completion */
-	return rc;
-}
-
-static int
 rpcrdma_register_default_external(struct rpcrdma_mr_seg *seg,
 			int *nsegs, int writing, struct rpcrdma_ia *ia)
 {
@@ -1828,12 +1684,6 @@ rpcrdma_register_external(struct rpcrdma_mr_seg *seg,
 		rc = rpcrdma_register_fmr_external(seg, &nsegs, writing, ia);
 		break;
 
-	/* Registration using memory windows */
-	case RPCRDMA_MEMWINDOWS_ASYNC:
-	case RPCRDMA_MEMWINDOWS:
-		rc = rpcrdma_register_memwin_external(seg, &nsegs, writing, ia, r_xprt);
-		break;
-
 	/* Default registration each time */
 	default:
 		rc = rpcrdma_register_default_external(seg, &nsegs, writing, ia);
@@ -1870,11 +1720,6 @@ rpcrdma_deregister_external(struct rpcrdma_mr_seg *seg,
 		rc = rpcrdma_deregister_fmr_external(seg, ia);
 		break;
 
-	case RPCRDMA_MEMWINDOWS_ASYNC:
-	case RPCRDMA_MEMWINDOWS:
-		rc = rpcrdma_deregister_memwin_external(seg, ia, r_xprt, &r);
-		break;
-
 	default:
 		rc = rpcrdma_deregister_default_external(seg, ia);
 		break;
diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index c620d13..bf08ee0 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -127,7 +127,6 @@ struct rpcrdma_rep {
 	struct rpc_xprt	*rr_xprt;	/* needed for request/reply matching */
 	void (*rr_func)(struct rpcrdma_rep *);/* called by tasklet in softint */
 	struct list_head rr_list;	/* tasklet list */
-	wait_queue_head_t rr_unbind;	/* optional unbind wait */
 	struct ib_sge	rr_iov;		/* for posting */
 	struct ib_mr	*rr_handle;	/* handle for mem in rr_iov */
 	char	rr_base[MAX_RPCRDMAHDR]; /* minimal inline receive buffer */
@@ -162,7 +161,6 @@ struct rpcrdma_mr_seg {		/* chunk descriptors */
 		struct ib_mr	*rl_mr;		/* if registered directly */
 		struct rpcrdma_mw {		/* if registered from region */
 			union {
-				struct ib_mw	*mw;
 				struct ib_fmr	*fmr;
 				struct {
 					struct ib_fast_reg_page_list *fr_pgl;

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 4/8] xprtrdma: Remove support for MEMWINDOWS registration mode
@ 2014-04-14 22:22     ` Chuck Lever
  0 siblings, 0 replies; 58+ messages in thread
From: Chuck Lever @ 2014-04-14 22:22 UTC (permalink / raw)
  To: linux-nfs, linux-rdma

The MEMWINDOWS and MEMWINDOES_ASYNC memory registration modes were
intended as stop-gap modes before the introduction of FRMR. They
are now considered obsolete.

MEMWINDOWS_ASYNC is also considered unsafe because it can leave
client memory registered and exposed for an indeterminant time after
each I/O.

At this point, the MEMWINDOWS modes add needless complexity, so
remove them.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---

 net/sunrpc/xprtrdma/rpc_rdma.c  |   34 --------
 net/sunrpc/xprtrdma/transport.c |    9 --
 net/sunrpc/xprtrdma/verbs.c     |  165 +--------------------------------------
 net/sunrpc/xprtrdma/xprt_rdma.h |    2 
 4 files changed, 7 insertions(+), 203 deletions(-)

diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
index b963e50..e4af26a 100644
--- a/net/sunrpc/xprtrdma/rpc_rdma.c
+++ b/net/sunrpc/xprtrdma/rpc_rdma.c
@@ -202,7 +202,6 @@ rpcrdma_create_chunks(struct rpc_rqst *rqst, struct xdr_buf *target,
 		return 0;
 
 	do {
-		/* bind/register the memory, then build chunk from result. */
 		int n = rpcrdma_register_external(seg, nsegs,
 						cur_wchunk != NULL, r_xprt);
 		if (n <= 0)
@@ -701,16 +700,6 @@ rpcrdma_conn_func(struct rpcrdma_ep *ep)
 }
 
 /*
- * This function is called when memory window unbind which we are waiting
- * for completes. Just use rr_func (zeroed by upcall) to signal completion.
- */
-static void
-rpcrdma_unbind_func(struct rpcrdma_rep *rep)
-{
-	wake_up(&rep->rr_unbind);
-}
-
-/*
  * Called as a tasklet to do req/reply match and complete a request
  * Errors must result in the RPC task either being awakened, or
  * allowed to timeout, to discover the errors at that time.
@@ -724,7 +713,7 @@ rpcrdma_reply_handler(struct rpcrdma_rep *rep)
 	struct rpc_xprt *xprt = rep->rr_xprt;
 	struct rpcrdma_xprt *r_xprt = rpcx_to_rdmax(xprt);
 	__be32 *iptr;
-	int i, rdmalen, status;
+	int rdmalen, status;
 
 	/* Check status. If bad, signal disconnect and return rep to pool */
 	if (rep->rr_len == ~0U) {
@@ -853,27 +842,6 @@ badheader:
 		break;
 	}
 
-	/* If using mw bind, start the deregister process now. */
-	/* (Note: if mr_free(), cannot perform it here, in tasklet context) */
-	if (req->rl_nchunks) switch (r_xprt->rx_ia.ri_memreg_strategy) {
-	case RPCRDMA_MEMWINDOWS:
-		for (i = 0; req->rl_nchunks-- > 1;)
-			i += rpcrdma_deregister_external(
-				&req->rl_segments[i], r_xprt, NULL);
-		/* Optionally wait (not here) for unbinds to complete */
-		rep->rr_func = rpcrdma_unbind_func;
-		(void) rpcrdma_deregister_external(&req->rl_segments[i],
-						   r_xprt, rep);
-		break;
-	case RPCRDMA_MEMWINDOWS_ASYNC:
-		for (i = 0; req->rl_nchunks--;)
-			i += rpcrdma_deregister_external(&req->rl_segments[i],
-							 r_xprt, NULL);
-		break;
-	default:
-		break;
-	}
-
 	dprintk("RPC:       %s: xprt_complete_rqst(0x%p, 0x%p, %d)\n",
 			__func__, xprt, rqst, status);
 	xprt_complete_rqst(rqst->rq_task, status);
diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
index 8c5035a..c23b0c1 100644
--- a/net/sunrpc/xprtrdma/transport.c
+++ b/net/sunrpc/xprtrdma/transport.c
@@ -566,9 +566,7 @@ xprt_rdma_free(void *buffer)
 		__func__, rep, (rep && rep->rr_func) ? " (with waiter)" : "");
 
 	/*
-	 * Finish the deregistration. When using mw bind, this was
-	 * begun in rpcrdma_reply_handler(). In all other modes, we
-	 * do it here, in thread context. The process is considered
+	 * Finish the deregistration.  The process is considered
 	 * complete when the rr_func vector becomes NULL - this
 	 * was put in place during rpcrdma_reply_handler() - the wait
 	 * call below will not block if the dereg is "done". If
@@ -580,11 +578,6 @@ xprt_rdma_free(void *buffer)
 			&req->rl_segments[i], r_xprt, NULL);
 	}
 
-	if (rep && wait_event_interruptible(rep->rr_unbind, !rep->rr_func)) {
-		rep->rr_func = NULL;	/* abandon the callback */
-		req->rl_reply = NULL;
-	}
-
 	if (req->rl_iov.length == 0) {	/* see allocate above */
 		struct rpcrdma_req *oreq = (struct rpcrdma_req *)req->rl_buffer;
 		oreq->rl_reply = req->rl_reply;
diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 6373232..35c6699 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -152,7 +152,7 @@ void rpcrdma_event_process(struct ib_wc *wc)
 	dprintk("RPC:       %s: event rep %p status %X opcode %X length %u\n",
 		__func__, rep, wc->status, wc->opcode, wc->byte_len);
 
-	if (!rep) /* send or bind completion that we don't care about */
+	if (!rep) /* send completion that we don't care about */
 		return;
 
 	if (IB_WC_SUCCESS != wc->status) {
@@ -197,8 +197,6 @@ void rpcrdma_event_process(struct ib_wc *wc)
 			}
 			atomic_set(&rep->rr_buffer->rb_credits, credits);
 		}
-		/* fall through */
-	case IB_WC_BIND_MW:
 		rpcrdma_schedule_tasklet(rep);
 		break;
 	default:
@@ -233,7 +231,7 @@ rpcrdma_cq_poll(struct ib_cq *cq)
 /*
  * rpcrdma_cq_event_upcall
  *
- * This upcall handles recv, send, bind and unbind events.
+ * This upcall handles recv and send events.
  * It is reentrant but processes single events in order to maintain
  * ordering of receives to keep server credits.
  *
@@ -494,16 +492,6 @@ rpcrdma_ia_open(struct rpcrdma_xprt *xprt, struct sockaddr *addr, int memreg)
 	}
 
 	switch (memreg) {
-	case RPCRDMA_MEMWINDOWS:
-	case RPCRDMA_MEMWINDOWS_ASYNC:
-		if (!(devattr.device_cap_flags & IB_DEVICE_MEM_WINDOW)) {
-			dprintk("RPC:       %s: MEMWINDOWS registration "
-				"specified but not supported by adapter, "
-				"using slower RPCRDMA_REGISTER\n",
-				__func__);
-			memreg = RPCRDMA_REGISTER;
-		}
-		break;
 	case RPCRDMA_MTHCAFMR:
 		if (!ia->ri_id->device->alloc_fmr) {
 			dprintk("RPC:       %s: MTHCAFMR registration "
@@ -551,16 +539,13 @@ rpcrdma_ia_open(struct rpcrdma_xprt *xprt, struct sockaddr *addr, int memreg)
 				IB_ACCESS_REMOTE_READ;
 		goto register_setup;
 #endif
-	case RPCRDMA_MEMWINDOWS_ASYNC:
-	case RPCRDMA_MEMWINDOWS:
-		mem_priv = IB_ACCESS_LOCAL_WRITE |
-				IB_ACCESS_MW_BIND;
-		goto register_setup;
 	case RPCRDMA_MTHCAFMR:
 		if (ia->ri_have_dma_lkey)
 			break;
 		mem_priv = IB_ACCESS_LOCAL_WRITE;
+#ifdef CONFIG_SUNRPC_XPRT_RDMA_CLIENT_ALLPHYSICAL
 	register_setup:
+#endif
 		ia->ri_bind_mem = ib_get_dma_mr(ia->ri_pd, mem_priv);
 		if (IS_ERR(ia->ri_bind_mem)) {
 			printk(KERN_ALERT "%s: ib_get_dma_mr for "
@@ -683,14 +668,6 @@ rpcrdma_ep_create(struct rpcrdma_ep *ep, struct rpcrdma_ia *ia,
 		}
 		break;
 	}
-	case RPCRDMA_MEMWINDOWS_ASYNC:
-	case RPCRDMA_MEMWINDOWS:
-		/* Add room for mw_binds+unbinds - overkill! */
-		ep->rep_attr.cap.max_send_wr++;
-		ep->rep_attr.cap.max_send_wr *= (2 * RPCRDMA_MAX_SEGS);
-		if (ep->rep_attr.cap.max_send_wr > devattr.max_qp_wr)
-			return -EINVAL;
-		break;
 	default:
 		break;
 	}
@@ -712,14 +689,6 @@ rpcrdma_ep_create(struct rpcrdma_ep *ep, struct rpcrdma_ia *ia,
 
 	/* set trigger for requesting send completion */
 	ep->rep_cqinit = ep->rep_attr.cap.max_send_wr/2 /*  - 1*/;
-	switch (ia->ri_memreg_strategy) {
-	case RPCRDMA_MEMWINDOWS_ASYNC:
-	case RPCRDMA_MEMWINDOWS:
-		ep->rep_cqinit -= RPCRDMA_MAX_SEGS;
-		break;
-	default:
-		break;
-	}
 	if (ep->rep_cqinit <= 2)
 		ep->rep_cqinit = 0;
 	INIT_CQCOUNT(ep);
@@ -727,11 +696,6 @@ rpcrdma_ep_create(struct rpcrdma_ep *ep, struct rpcrdma_ia *ia,
 	init_waitqueue_head(&ep->rep_connect_wait);
 	INIT_DELAYED_WORK(&ep->rep_connect_worker, rpcrdma_connect_worker);
 
-	/*
-	 * Create a single cq for receive dto and mw_bind (only ever
-	 * care about unbind, really). Send completions are suppressed.
-	 * Use single threaded tasklet upcalls to maintain ordering.
-	 */
 	ep->rep_cq = ib_create_cq(ia->ri_id->device, rpcrdma_cq_event_upcall,
 				  rpcrdma_cq_async_error_upcall, NULL,
 				  ep->rep_attr.cap.max_recv_wr +
@@ -1004,11 +968,6 @@ rpcrdma_buffer_create(struct rpcrdma_buffer *buf, struct rpcrdma_ep *ep,
 		len += (buf->rb_max_requests + 1) * RPCRDMA_MAX_SEGS *
 				sizeof(struct rpcrdma_mw);
 		break;
-	case RPCRDMA_MEMWINDOWS_ASYNC:
-	case RPCRDMA_MEMWINDOWS:
-		len += (buf->rb_max_requests + 1) * RPCRDMA_MAX_SEGS *
-				sizeof(struct rpcrdma_mw);
-		break;
 	default:
 		break;
 	}
@@ -1039,11 +998,6 @@ rpcrdma_buffer_create(struct rpcrdma_buffer *buf, struct rpcrdma_ep *ep,
 	}
 	p += cdata->padding;
 
-	/*
-	 * Allocate the fmr's, or mw's for mw_bind chunk registration.
-	 * We "cycle" the mw's in order to minimize rkey reuse,
-	 * and also reduce unbind-to-bind collision.
-	 */
 	INIT_LIST_HEAD(&buf->rb_mws);
 	r = (struct rpcrdma_mw *)p;
 	switch (ia->ri_memreg_strategy) {
@@ -1090,21 +1044,6 @@ rpcrdma_buffer_create(struct rpcrdma_buffer *buf, struct rpcrdma_ep *ep,
 			++r;
 		}
 		break;
-	case RPCRDMA_MEMWINDOWS_ASYNC:
-	case RPCRDMA_MEMWINDOWS:
-		/* Allocate one extra request's worth, for full cycling */
-		for (i = (buf->rb_max_requests+1) * RPCRDMA_MAX_SEGS; i; i--) {
-			r->r.mw = ib_alloc_mw(ia->ri_pd, IB_MW_TYPE_1);
-			if (IS_ERR(r->r.mw)) {
-				rc = PTR_ERR(r->r.mw);
-				dprintk("RPC:       %s: ib_alloc_mw"
-					" failed %i\n", __func__, rc);
-				goto out;
-			}
-			list_add(&r->mw_list, &buf->rb_mws);
-			++r;
-		}
-		break;
 	default:
 		break;
 	}
@@ -1153,7 +1092,6 @@ rpcrdma_buffer_create(struct rpcrdma_buffer *buf, struct rpcrdma_ep *ep,
 		memset(rep, 0, sizeof(struct rpcrdma_rep));
 		buf->rb_recv_bufs[i] = rep;
 		buf->rb_recv_bufs[i]->rr_buffer = buf;
-		init_waitqueue_head(&rep->rr_unbind);
 
 		rc = rpcrdma_register_internal(ia, rep->rr_base,
 				len - offsetof(struct rpcrdma_rep, rr_base),
@@ -1187,7 +1125,6 @@ rpcrdma_buffer_destroy(struct rpcrdma_buffer *buf)
 
 	/* clean up in reverse order from create
 	 *   1.  recv mr memory (mr free, then kfree)
-	 *   1a. bind mw memory
 	 *   2.  send mr memory (mr free, then kfree)
 	 *   3.  padding (if any) [moved to rpcrdma_ep_destroy]
 	 *   4.  arrays
@@ -1231,15 +1168,6 @@ rpcrdma_buffer_destroy(struct rpcrdma_buffer *buf)
 					" failed %i\n",
 					__func__, rc);
 			break;
-		case RPCRDMA_MEMWINDOWS_ASYNC:
-		case RPCRDMA_MEMWINDOWS:
-			rc = ib_dealloc_mw(r->r.mw);
-			if (rc)
-				dprintk("RPC:       %s:"
-					" ib_dealloc_mw"
-					" failed %i\n",
-					__func__, rc);
-			break;
 		default:
 			break;
 		}
@@ -1314,15 +1242,12 @@ rpcrdma_buffer_put(struct rpcrdma_req *req)
 	req->rl_niovs = 0;
 	if (req->rl_reply) {
 		buffers->rb_recv_bufs[--buffers->rb_recv_index] = req->rl_reply;
-		init_waitqueue_head(&req->rl_reply->rr_unbind);
 		req->rl_reply->rr_func = NULL;
 		req->rl_reply = NULL;
 	}
 	switch (ia->ri_memreg_strategy) {
 	case RPCRDMA_FRMR:
 	case RPCRDMA_MTHCAFMR:
-	case RPCRDMA_MEMWINDOWS_ASYNC:
-	case RPCRDMA_MEMWINDOWS:
 		/*
 		 * Cycle mw's back in reverse order, and "spin" them.
 		 * This delays and scrambles reuse as much as possible.
@@ -1367,8 +1292,7 @@ rpcrdma_recv_buffer_get(struct rpcrdma_req *req)
 
 /*
  * Put reply buffers back into pool when not attached to
- * request. This happens in error conditions, and when
- * aborting unbinds. Pre-decrement counter/array index.
+ * request. This happens in error conditions.
  */
 void
 rpcrdma_recv_buffer_put(struct rpcrdma_rep *rep)
@@ -1671,74 +1595,6 @@ rpcrdma_deregister_fmr_external(struct rpcrdma_mr_seg *seg,
 }
 
 static int
-rpcrdma_register_memwin_external(struct rpcrdma_mr_seg *seg,
-			int *nsegs, int writing, struct rpcrdma_ia *ia,
-			struct rpcrdma_xprt *r_xprt)
-{
-	int mem_priv = (writing ? IB_ACCESS_REMOTE_WRITE :
-				  IB_ACCESS_REMOTE_READ);
-	struct ib_mw_bind param;
-	int rc;
-
-	*nsegs = 1;
-	rpcrdma_map_one(ia, seg, writing);
-	param.bind_info.mr = ia->ri_bind_mem;
-	param.wr_id = 0ULL;	/* no send cookie */
-	param.bind_info.addr = seg->mr_dma;
-	param.bind_info.length = seg->mr_len;
-	param.send_flags = 0;
-	param.bind_info.mw_access_flags = mem_priv;
-
-	DECR_CQCOUNT(&r_xprt->rx_ep);
-	rc = ib_bind_mw(ia->ri_id->qp, seg->mr_chunk.rl_mw->r.mw, &param);
-	if (rc) {
-		dprintk("RPC:       %s: failed ib_bind_mw "
-			"%u@0x%llx status %i\n",
-			__func__, seg->mr_len,
-			(unsigned long long)seg->mr_dma, rc);
-		rpcrdma_unmap_one(ia, seg);
-	} else {
-		seg->mr_rkey = seg->mr_chunk.rl_mw->r.mw->rkey;
-		seg->mr_base = param.bind_info.addr;
-		seg->mr_nsegs = 1;
-	}
-	return rc;
-}
-
-static int
-rpcrdma_deregister_memwin_external(struct rpcrdma_mr_seg *seg,
-			struct rpcrdma_ia *ia,
-			struct rpcrdma_xprt *r_xprt, void **r)
-{
-	struct ib_mw_bind param;
-	LIST_HEAD(l);
-	int rc;
-
-	BUG_ON(seg->mr_nsegs != 1);
-	param.bind_info.mr = ia->ri_bind_mem;
-	param.bind_info.addr = 0ULL;	/* unbind */
-	param.bind_info.length = 0;
-	param.bind_info.mw_access_flags = 0;
-	if (*r) {
-		param.wr_id = (u64) (unsigned long) *r;
-		param.send_flags = IB_SEND_SIGNALED;
-		INIT_CQCOUNT(&r_xprt->rx_ep);
-	} else {
-		param.wr_id = 0ULL;
-		param.send_flags = 0;
-		DECR_CQCOUNT(&r_xprt->rx_ep);
-	}
-	rc = ib_bind_mw(ia->ri_id->qp, seg->mr_chunk.rl_mw->r.mw, &param);
-	rpcrdma_unmap_one(ia, seg);
-	if (rc)
-		dprintk("RPC:       %s: failed ib_(un)bind_mw,"
-			" status %i\n", __func__, rc);
-	else
-		*r = NULL;	/* will upcall on completion */
-	return rc;
-}
-
-static int
 rpcrdma_register_default_external(struct rpcrdma_mr_seg *seg,
 			int *nsegs, int writing, struct rpcrdma_ia *ia)
 {
@@ -1828,12 +1684,6 @@ rpcrdma_register_external(struct rpcrdma_mr_seg *seg,
 		rc = rpcrdma_register_fmr_external(seg, &nsegs, writing, ia);
 		break;
 
-	/* Registration using memory windows */
-	case RPCRDMA_MEMWINDOWS_ASYNC:
-	case RPCRDMA_MEMWINDOWS:
-		rc = rpcrdma_register_memwin_external(seg, &nsegs, writing, ia, r_xprt);
-		break;
-
 	/* Default registration each time */
 	default:
 		rc = rpcrdma_register_default_external(seg, &nsegs, writing, ia);
@@ -1870,11 +1720,6 @@ rpcrdma_deregister_external(struct rpcrdma_mr_seg *seg,
 		rc = rpcrdma_deregister_fmr_external(seg, ia);
 		break;
 
-	case RPCRDMA_MEMWINDOWS_ASYNC:
-	case RPCRDMA_MEMWINDOWS:
-		rc = rpcrdma_deregister_memwin_external(seg, ia, r_xprt, &r);
-		break;
-
 	default:
 		rc = rpcrdma_deregister_default_external(seg, ia);
 		break;
diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index c620d13..bf08ee0 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -127,7 +127,6 @@ struct rpcrdma_rep {
 	struct rpc_xprt	*rr_xprt;	/* needed for request/reply matching */
 	void (*rr_func)(struct rpcrdma_rep *);/* called by tasklet in softint */
 	struct list_head rr_list;	/* tasklet list */
-	wait_queue_head_t rr_unbind;	/* optional unbind wait */
 	struct ib_sge	rr_iov;		/* for posting */
 	struct ib_mr	*rr_handle;	/* handle for mem in rr_iov */
 	char	rr_base[MAX_RPCRDMAHDR]; /* minimal inline receive buffer */
@@ -162,7 +161,6 @@ struct rpcrdma_mr_seg {		/* chunk descriptors */
 		struct ib_mr	*rl_mr;		/* if registered directly */
 		struct rpcrdma_mw {		/* if registered from region */
 			union {
-				struct ib_mw	*mw;
 				struct ib_fmr	*fmr;
 				struct {
 					struct ib_fast_reg_page_list *fr_pgl;


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 5/8] xprtrdma: Simplify rpcrdma_deregister_external() synopsis
  2014-04-14 22:22 ` Chuck Lever
@ 2014-04-14 22:23     ` Chuck Lever
  -1 siblings, 0 replies; 58+ messages in thread
From: Chuck Lever @ 2014-04-14 22:23 UTC (permalink / raw)
  To: linux-nfs-u79uwXL29TY76Z2rM5mHXA, linux-rdma-u79uwXL29TY76Z2rM5mHXA

Clean up: All remaining callers of rpcrdma_deregister_external()
pass NULL as the last argument, so remove that argument.

Signed-off-by: Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
---

 net/sunrpc/xprtrdma/rpc_rdma.c  |    2 +-
 net/sunrpc/xprtrdma/transport.c |    2 +-
 net/sunrpc/xprtrdma/verbs.c     |    8 +-------
 net/sunrpc/xprtrdma/xprt_rdma.h |    2 +-
 4 files changed, 4 insertions(+), 10 deletions(-)

diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
index e4af26a..aac6adb 100644
--- a/net/sunrpc/xprtrdma/rpc_rdma.c
+++ b/net/sunrpc/xprtrdma/rpc_rdma.c
@@ -273,7 +273,7 @@ rpcrdma_create_chunks(struct rpc_rqst *rqst, struct xdr_buf *target,
 out:
 	for (pos = 0; nchunks--;)
 		pos += rpcrdma_deregister_external(
-				&req->rl_segments[pos], r_xprt, NULL);
+				&req->rl_segments[pos], r_xprt);
 	return 0;
 }
 
diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
index c23b0c1..430cabb 100644
--- a/net/sunrpc/xprtrdma/transport.c
+++ b/net/sunrpc/xprtrdma/transport.c
@@ -575,7 +575,7 @@ xprt_rdma_free(void *buffer)
 	for (i = 0; req->rl_nchunks;) {
 		--req->rl_nchunks;
 		i += rpcrdma_deregister_external(
-			&req->rl_segments[i], r_xprt, NULL);
+			&req->rl_segments[i], r_xprt);
 	}
 
 	if (req->rl_iov.length == 0) {	/* see allocate above */
diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 35c6699..2b763a2 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -1697,7 +1697,7 @@ rpcrdma_register_external(struct rpcrdma_mr_seg *seg,
 
 int
 rpcrdma_deregister_external(struct rpcrdma_mr_seg *seg,
-		struct rpcrdma_xprt *r_xprt, void *r)
+		struct rpcrdma_xprt *r_xprt)
 {
 	struct rpcrdma_ia *ia = &r_xprt->rx_ia;
 	int nsegs = seg->mr_nsegs, rc;
@@ -1724,12 +1724,6 @@ rpcrdma_deregister_external(struct rpcrdma_mr_seg *seg,
 		rc = rpcrdma_deregister_default_external(seg, ia);
 		break;
 	}
-	if (r) {
-		struct rpcrdma_rep *rep = r;
-		void (*func)(struct rpcrdma_rep *) = rep->rr_func;
-		rep->rr_func = NULL;
-		func(rep);	/* dereg done, callback now */
-	}
 	return nsegs;
 }
 
diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index bf08ee0..3f44d6a 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -331,7 +331,7 @@ int rpcrdma_deregister_internal(struct rpcrdma_ia *,
 int rpcrdma_register_external(struct rpcrdma_mr_seg *,
 				int, int, struct rpcrdma_xprt *);
 int rpcrdma_deregister_external(struct rpcrdma_mr_seg *,
-				struct rpcrdma_xprt *, void *);
+				struct rpcrdma_xprt *);
 
 /*
  * RPC/RDMA connection management calls - xprtrdma/rpc_rdma.c

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 5/8] xprtrdma: Simplify rpcrdma_deregister_external() synopsis
@ 2014-04-14 22:23     ` Chuck Lever
  0 siblings, 0 replies; 58+ messages in thread
From: Chuck Lever @ 2014-04-14 22:23 UTC (permalink / raw)
  To: linux-nfs, linux-rdma

Clean up: All remaining callers of rpcrdma_deregister_external()
pass NULL as the last argument, so remove that argument.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---

 net/sunrpc/xprtrdma/rpc_rdma.c  |    2 +-
 net/sunrpc/xprtrdma/transport.c |    2 +-
 net/sunrpc/xprtrdma/verbs.c     |    8 +-------
 net/sunrpc/xprtrdma/xprt_rdma.h |    2 +-
 4 files changed, 4 insertions(+), 10 deletions(-)

diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
index e4af26a..aac6adb 100644
--- a/net/sunrpc/xprtrdma/rpc_rdma.c
+++ b/net/sunrpc/xprtrdma/rpc_rdma.c
@@ -273,7 +273,7 @@ rpcrdma_create_chunks(struct rpc_rqst *rqst, struct xdr_buf *target,
 out:
 	for (pos = 0; nchunks--;)
 		pos += rpcrdma_deregister_external(
-				&req->rl_segments[pos], r_xprt, NULL);
+				&req->rl_segments[pos], r_xprt);
 	return 0;
 }
 
diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
index c23b0c1..430cabb 100644
--- a/net/sunrpc/xprtrdma/transport.c
+++ b/net/sunrpc/xprtrdma/transport.c
@@ -575,7 +575,7 @@ xprt_rdma_free(void *buffer)
 	for (i = 0; req->rl_nchunks;) {
 		--req->rl_nchunks;
 		i += rpcrdma_deregister_external(
-			&req->rl_segments[i], r_xprt, NULL);
+			&req->rl_segments[i], r_xprt);
 	}
 
 	if (req->rl_iov.length == 0) {	/* see allocate above */
diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 35c6699..2b763a2 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -1697,7 +1697,7 @@ rpcrdma_register_external(struct rpcrdma_mr_seg *seg,
 
 int
 rpcrdma_deregister_external(struct rpcrdma_mr_seg *seg,
-		struct rpcrdma_xprt *r_xprt, void *r)
+		struct rpcrdma_xprt *r_xprt)
 {
 	struct rpcrdma_ia *ia = &r_xprt->rx_ia;
 	int nsegs = seg->mr_nsegs, rc;
@@ -1724,12 +1724,6 @@ rpcrdma_deregister_external(struct rpcrdma_mr_seg *seg,
 		rc = rpcrdma_deregister_default_external(seg, ia);
 		break;
 	}
-	if (r) {
-		struct rpcrdma_rep *rep = r;
-		void (*func)(struct rpcrdma_rep *) = rep->rr_func;
-		rep->rr_func = NULL;
-		func(rep);	/* dereg done, callback now */
-	}
 	return nsegs;
 }
 
diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index bf08ee0..3f44d6a 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -331,7 +331,7 @@ int rpcrdma_deregister_internal(struct rpcrdma_ia *,
 int rpcrdma_register_external(struct rpcrdma_mr_seg *,
 				int, int, struct rpcrdma_xprt *);
 int rpcrdma_deregister_external(struct rpcrdma_mr_seg *,
-				struct rpcrdma_xprt *, void *);
+				struct rpcrdma_xprt *);
 
 /*
  * RPC/RDMA connection management calls - xprtrdma/rpc_rdma.c


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 6/8] xprtrdma: Make rpcrdma_ep_destroy() return void
  2014-04-14 22:22 ` Chuck Lever
@ 2014-04-14 22:23     ` Chuck Lever
  -1 siblings, 0 replies; 58+ messages in thread
From: Chuck Lever @ 2014-04-14 22:23 UTC (permalink / raw)
  To: linux-nfs-u79uwXL29TY76Z2rM5mHXA, linux-rdma-u79uwXL29TY76Z2rM5mHXA

Clean up: rpcrdma_ep_destroy() returns a value that is used
only to print a debugging message. rpcrdma_ep_destroy() already
prints debugging messages in all error cases.

Make rpcrdma_ep_destroy() return void instead.

Signed-off-by: Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
---

 net/sunrpc/xprtrdma/transport.c |    8 ++------
 net/sunrpc/xprtrdma/verbs.c     |    7 +------
 net/sunrpc/xprtrdma/xprt_rdma.h |    2 +-
 3 files changed, 4 insertions(+), 13 deletions(-)

diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
index 430cabb..d18b2a3 100644
--- a/net/sunrpc/xprtrdma/transport.c
+++ b/net/sunrpc/xprtrdma/transport.c
@@ -229,7 +229,6 @@ static void
 xprt_rdma_destroy(struct rpc_xprt *xprt)
 {
 	struct rpcrdma_xprt *r_xprt = rpcx_to_rdmax(xprt);
-	int rc;
 
 	dprintk("RPC:       %s: called\n", __func__);
 
@@ -238,10 +237,7 @@ xprt_rdma_destroy(struct rpc_xprt *xprt)
 	xprt_clear_connected(xprt);
 
 	rpcrdma_buffer_destroy(&r_xprt->rx_buf);
-	rc = rpcrdma_ep_destroy(&r_xprt->rx_ep, &r_xprt->rx_ia);
-	if (rc)
-		dprintk("RPC:       %s: rpcrdma_ep_destroy returned %i\n",
-			__func__, rc);
+	rpcrdma_ep_destroy(&r_xprt->rx_ep, &r_xprt->rx_ia);
 	rpcrdma_ia_close(&r_xprt->rx_ia);
 
 	xprt_rdma_free_addresses(xprt);
@@ -391,7 +387,7 @@ out4:
 	xprt_rdma_free_addresses(xprt);
 	rc = -EINVAL;
 out3:
-	(void) rpcrdma_ep_destroy(new_ep, &new_xprt->rx_ia);
+	rpcrdma_ep_destroy(new_ep, &new_xprt->rx_ia);
 out2:
 	rpcrdma_ia_close(&new_xprt->rx_ia);
 out1:
diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 2b763a2..0f8c22c 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -751,11 +751,8 @@ out1:
  * Disconnect and destroy endpoint. After this, the only
  * valid operations on the ep are to free it (if dynamically
  * allocated) or re-create it.
- *
- * The caller's error handling must be sure to not leak the endpoint
- * if this function fails.
  */
-int
+void
 rpcrdma_ep_destroy(struct rpcrdma_ep *ep, struct rpcrdma_ia *ia)
 {
 	int rc;
@@ -785,8 +782,6 @@ rpcrdma_ep_destroy(struct rpcrdma_ep *ep, struct rpcrdma_ia *ia)
 	if (rc)
 		dprintk("RPC:       %s: ib_destroy_cq returned %i\n",
 			__func__, rc);
-
-	return rc;
 }
 
 /*
diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index 3f44d6a..362a19d 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -301,7 +301,7 @@ void rpcrdma_ia_close(struct rpcrdma_ia *);
  */
 int rpcrdma_ep_create(struct rpcrdma_ep *, struct rpcrdma_ia *,
 				struct rpcrdma_create_data_internal *);
-int rpcrdma_ep_destroy(struct rpcrdma_ep *, struct rpcrdma_ia *);
+void rpcrdma_ep_destroy(struct rpcrdma_ep *, struct rpcrdma_ia *);
 int rpcrdma_ep_connect(struct rpcrdma_ep *, struct rpcrdma_ia *);
 int rpcrdma_ep_disconnect(struct rpcrdma_ep *, struct rpcrdma_ia *);
 

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 6/8] xprtrdma: Make rpcrdma_ep_destroy() return void
@ 2014-04-14 22:23     ` Chuck Lever
  0 siblings, 0 replies; 58+ messages in thread
From: Chuck Lever @ 2014-04-14 22:23 UTC (permalink / raw)
  To: linux-nfs, linux-rdma

Clean up: rpcrdma_ep_destroy() returns a value that is used
only to print a debugging message. rpcrdma_ep_destroy() already
prints debugging messages in all error cases.

Make rpcrdma_ep_destroy() return void instead.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---

 net/sunrpc/xprtrdma/transport.c |    8 ++------
 net/sunrpc/xprtrdma/verbs.c     |    7 +------
 net/sunrpc/xprtrdma/xprt_rdma.h |    2 +-
 3 files changed, 4 insertions(+), 13 deletions(-)

diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
index 430cabb..d18b2a3 100644
--- a/net/sunrpc/xprtrdma/transport.c
+++ b/net/sunrpc/xprtrdma/transport.c
@@ -229,7 +229,6 @@ static void
 xprt_rdma_destroy(struct rpc_xprt *xprt)
 {
 	struct rpcrdma_xprt *r_xprt = rpcx_to_rdmax(xprt);
-	int rc;
 
 	dprintk("RPC:       %s: called\n", __func__);
 
@@ -238,10 +237,7 @@ xprt_rdma_destroy(struct rpc_xprt *xprt)
 	xprt_clear_connected(xprt);
 
 	rpcrdma_buffer_destroy(&r_xprt->rx_buf);
-	rc = rpcrdma_ep_destroy(&r_xprt->rx_ep, &r_xprt->rx_ia);
-	if (rc)
-		dprintk("RPC:       %s: rpcrdma_ep_destroy returned %i\n",
-			__func__, rc);
+	rpcrdma_ep_destroy(&r_xprt->rx_ep, &r_xprt->rx_ia);
 	rpcrdma_ia_close(&r_xprt->rx_ia);
 
 	xprt_rdma_free_addresses(xprt);
@@ -391,7 +387,7 @@ out4:
 	xprt_rdma_free_addresses(xprt);
 	rc = -EINVAL;
 out3:
-	(void) rpcrdma_ep_destroy(new_ep, &new_xprt->rx_ia);
+	rpcrdma_ep_destroy(new_ep, &new_xprt->rx_ia);
 out2:
 	rpcrdma_ia_close(&new_xprt->rx_ia);
 out1:
diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 2b763a2..0f8c22c 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -751,11 +751,8 @@ out1:
  * Disconnect and destroy endpoint. After this, the only
  * valid operations on the ep are to free it (if dynamically
  * allocated) or re-create it.
- *
- * The caller's error handling must be sure to not leak the endpoint
- * if this function fails.
  */
-int
+void
 rpcrdma_ep_destroy(struct rpcrdma_ep *ep, struct rpcrdma_ia *ia)
 {
 	int rc;
@@ -785,8 +782,6 @@ rpcrdma_ep_destroy(struct rpcrdma_ep *ep, struct rpcrdma_ia *ia)
 	if (rc)
 		dprintk("RPC:       %s: ib_destroy_cq returned %i\n",
 			__func__, rc);
-
-	return rc;
 }
 
 /*
diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index 3f44d6a..362a19d 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -301,7 +301,7 @@ void rpcrdma_ia_close(struct rpcrdma_ia *);
  */
 int rpcrdma_ep_create(struct rpcrdma_ep *, struct rpcrdma_ia *,
 				struct rpcrdma_create_data_internal *);
-int rpcrdma_ep_destroy(struct rpcrdma_ep *, struct rpcrdma_ia *);
+void rpcrdma_ep_destroy(struct rpcrdma_ep *, struct rpcrdma_ia *);
 int rpcrdma_ep_connect(struct rpcrdma_ep *, struct rpcrdma_ia *);
 int rpcrdma_ep_disconnect(struct rpcrdma_ep *, struct rpcrdma_ia *);
 


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 7/8] xprtrdma: Split the completion queue
  2014-04-14 22:22 ` Chuck Lever
@ 2014-04-14 22:23     ` Chuck Lever
  -1 siblings, 0 replies; 58+ messages in thread
From: Chuck Lever @ 2014-04-14 22:23 UTC (permalink / raw)
  To: linux-nfs-u79uwXL29TY76Z2rM5mHXA, linux-rdma-u79uwXL29TY76Z2rM5mHXA

The current CQ handler uses the ib_wc.opcode field to distinguish
between event types. However, the contents of that field are not
reliable if the completion status is not IB_WC_SUCCESS.

When an error completion occurs on a send event, the CQ handler
schedules a tasklet with something that is not a struct rpcrdma_rep.
This is never correct behavior, and sometimes it results in a panic.

To resolve this issue, split the completion queue into a send CQ and
a receive CQ. The send CQ handler now handles only struct rpcrdma_mw
wr_id's, and the receive CQ handler now handles only struct
rpcrdma_rep wr_id's.

Fix suggested by Shirley Ma <shirley.ma-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>

Reported-by: Rafael Reiter <rafael.reiter-cv18SyjCLaheoWH0uzbU5w@public.gmane.org>
Fixes: 5c635e09cec0feeeb310968e51dad01040244851
BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=73211
Tested-by: Klemens Senn <klemens.senn-cv18SyjCLaheoWH0uzbU5w@public.gmane.org>
Signed-off-by: Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
---

 net/sunrpc/xprtrdma/verbs.c     |  228 +++++++++++++++++++++++----------------
 net/sunrpc/xprtrdma/xprt_rdma.h |    1 
 2 files changed, 135 insertions(+), 94 deletions(-)

diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 0f8c22c..35f5ab6 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -142,96 +142,113 @@ rpcrdma_cq_async_error_upcall(struct ib_event *event, void *context)
 	}
 }
 
-static inline
-void rpcrdma_event_process(struct ib_wc *wc)
+static void
+rpcrdma_send_event_process(struct ib_wc *wc)
 {
-	struct rpcrdma_mw *frmr;
-	struct rpcrdma_rep *rep =
-			(struct rpcrdma_rep *)(unsigned long) wc->wr_id;
+	struct rpcrdma_mw *frmr = (struct rpcrdma_mw *)(unsigned long)wc->wr_id;
 
-	dprintk("RPC:       %s: event rep %p status %X opcode %X length %u\n",
-		__func__, rep, wc->status, wc->opcode, wc->byte_len);
+	dprintk("RPC:       %s: frmr %p status %X opcode %d\n",
+		__func__, frmr, wc->status, wc->opcode);
 
-	if (!rep) /* send completion that we don't care about */
+	if (wc->wr_id == 0ULL)
 		return;
-
-	if (IB_WC_SUCCESS != wc->status) {
-		dprintk("RPC:       %s: WC opcode %d status %X, connection lost\n",
-			__func__, wc->opcode, wc->status);
-		rep->rr_len = ~0U;
-		if (wc->opcode != IB_WC_FAST_REG_MR && wc->opcode != IB_WC_LOCAL_INV)
-			rpcrdma_schedule_tasklet(rep);
+	if (wc->status != IB_WC_SUCCESS)
 		return;
-	}
 
-	switch (wc->opcode) {
-	case IB_WC_FAST_REG_MR:
-		frmr = (struct rpcrdma_mw *)(unsigned long)wc->wr_id;
+	if (wc->opcode == IB_WC_FAST_REG_MR)
 		frmr->r.frmr.state = FRMR_IS_VALID;
-		break;
-	case IB_WC_LOCAL_INV:
-		frmr = (struct rpcrdma_mw *)(unsigned long)wc->wr_id;
+	else if (wc->opcode == IB_WC_LOCAL_INV)
 		frmr->r.frmr.state = FRMR_IS_INVALID;
-		break;
-	case IB_WC_RECV:
-		rep->rr_len = wc->byte_len;
-		ib_dma_sync_single_for_cpu(
-			rdmab_to_ia(rep->rr_buffer)->ri_id->device,
-			rep->rr_iov.addr, rep->rr_len, DMA_FROM_DEVICE);
-		/* Keep (only) the most recent credits, after check validity */
-		if (rep->rr_len >= 16) {
-			struct rpcrdma_msg *p =
-					(struct rpcrdma_msg *) rep->rr_base;
-			unsigned int credits = ntohl(p->rm_credit);
-			if (credits == 0) {
-				dprintk("RPC:       %s: server"
-					" dropped credits to 0!\n", __func__);
-				/* don't deadlock */
-				credits = 1;
-			} else if (credits > rep->rr_buffer->rb_max_requests) {
-				dprintk("RPC:       %s: server"
-					" over-crediting: %d (%d)\n",
-					__func__, credits,
-					rep->rr_buffer->rb_max_requests);
-				credits = rep->rr_buffer->rb_max_requests;
-			}
-			atomic_set(&rep->rr_buffer->rb_credits, credits);
-		}
-		rpcrdma_schedule_tasklet(rep);
-		break;
-	default:
-		dprintk("RPC:       %s: unexpected WC event %X\n",
-			__func__, wc->opcode);
-		break;
-	}
 }
 
-static inline int
-rpcrdma_cq_poll(struct ib_cq *cq)
+static int
+rpcrdma_sendcq_poll(struct ib_cq *cq)
 {
 	struct ib_wc wc;
 	int rc;
 
-	for (;;) {
-		rc = ib_poll_cq(cq, 1, &wc);
-		if (rc < 0) {
-			dprintk("RPC:       %s: ib_poll_cq failed %i\n",
-				__func__, rc);
-			return rc;
-		}
-		if (rc == 0)
-			break;
+	while ((rc = ib_poll_cq(cq, 1, &wc)) == 1)
+		rpcrdma_send_event_process(&wc);
+	return rc;
+}
 
-		rpcrdma_event_process(&wc);
+/*
+ * Handle send, fast_reg_mr, and local_inv completions.
+ *
+ * Send events are typically suppressed and thus do not result
+ * in an upcall. Occasionally one is signaled, however. This
+ * prevents the provider's completion queue from wrapping and
+ * losing a completion.
+ */
+static void
+rpcrdma_sendcq_upcall(struct ib_cq *cq, void *cq_context)
+{
+	int rc;
+
+	rc = rpcrdma_sendcq_poll(cq);
+	if (rc) {
+		dprintk("RPC:       %s: ib_poll_cq failed: %i\n",
+			__func__, rc);
+		return;
 	}
+	rc = ib_req_notify_cq(cq, IB_CQ_NEXT_COMP);
+	if (rc) {
+		dprintk("RPC:       %s: ib_req_notify_cq failed: %i\n",
+			__func__, rc);
+		return;
+	}
+	rpcrdma_sendcq_poll(cq);
+}
 
-	return 0;
+static void
+rpcrdma_recv_event_process(struct ib_wc *wc)
+{
+	struct rpcrdma_rep *rep =
+			(struct rpcrdma_rep *)(unsigned long)wc->wr_id;
+
+	dprintk("RPC:       %s: rep %p status %X opcode %X length %u\n",
+		__func__, rep, wc->status, wc->opcode, wc->byte_len);
+
+	if (wc->status != IB_WC_SUCCESS) {
+		rep->rr_len = ~0U;
+		goto out_schedule;
+	}
+	if (wc->opcode != IB_WC_RECV)
+		return;
+
+	rep->rr_len = wc->byte_len;
+	ib_dma_sync_single_for_cpu(rdmab_to_ia(rep->rr_buffer)->ri_id->device,
+			rep->rr_iov.addr, rep->rr_len, DMA_FROM_DEVICE);
+
+	if (rep->rr_len >= 16) {
+		struct rpcrdma_msg *p = (struct rpcrdma_msg *)rep->rr_base;
+		unsigned int credits = ntohl(p->rm_credit);
+
+		if (credits == 0)
+			credits = 1;	/* don't deadlock */
+		else if (credits > rep->rr_buffer->rb_max_requests)
+			credits = rep->rr_buffer->rb_max_requests;
+		atomic_set(&rep->rr_buffer->rb_credits, credits);
+	}
+
+out_schedule:
+	rpcrdma_schedule_tasklet(rep);
+}
+
+static int
+rpcrdma_recvcq_poll(struct ib_cq *cq)
+{
+	struct ib_wc wc;
+	int rc;
+
+	while ((rc = ib_poll_cq(cq, 1, &wc)) == 1)
+		rpcrdma_recv_event_process(&wc);
+	return rc;
 }
 
 /*
- * rpcrdma_cq_event_upcall
+ * Handle receive completions.
  *
- * This upcall handles recv and send events.
  * It is reentrant but processes single events in order to maintain
  * ordering of receives to keep server credits.
  *
@@ -240,26 +257,25 @@ rpcrdma_cq_poll(struct ib_cq *cq)
  * connection shutdown. That is, the structures required for
  * the completion of the reply handler must remain intact until
  * all memory has been reclaimed.
- *
- * Note that send events are suppressed and do not result in an upcall.
  */
 static void
-rpcrdma_cq_event_upcall(struct ib_cq *cq, void *context)
+rpcrdma_recvcq_upcall(struct ib_cq *cq, void *cq_context)
 {
 	int rc;
 
-	rc = rpcrdma_cq_poll(cq);
-	if (rc)
+	rc = rpcrdma_recvcq_poll(cq);
+	if (rc) {
+		dprintk("RPC:       %s: ib_poll_cq failed: %i\n",
+			__func__, rc);
 		return;
-
+	}
 	rc = ib_req_notify_cq(cq, IB_CQ_NEXT_COMP);
 	if (rc) {
-		dprintk("RPC:       %s: ib_req_notify_cq failed %i\n",
+		dprintk("RPC:       %s: ib_req_notify_cq failed: %i\n",
 			__func__, rc);
 		return;
 	}
-
-	rpcrdma_cq_poll(cq);
+	rpcrdma_recvcq_poll(cq);
 }
 
 #ifdef RPC_DEBUG
@@ -613,6 +629,7 @@ rpcrdma_ep_create(struct rpcrdma_ep *ep, struct rpcrdma_ia *ia,
 				struct rpcrdma_create_data_internal *cdata)
 {
 	struct ib_device_attr devattr;
+	struct ib_cq *sendcq, *recvcq;
 	int rc, err;
 
 	rc = ib_query_device(ia->ri_id->device, &devattr);
@@ -688,7 +705,7 @@ rpcrdma_ep_create(struct rpcrdma_ep *ep, struct rpcrdma_ia *ia,
 		ep->rep_attr.cap.max_recv_sge);
 
 	/* set trigger for requesting send completion */
-	ep->rep_cqinit = ep->rep_attr.cap.max_send_wr/2 /*  - 1*/;
+	ep->rep_cqinit = ep->rep_attr.cap.max_send_wr/2 - 1;
 	if (ep->rep_cqinit <= 2)
 		ep->rep_cqinit = 0;
 	INIT_CQCOUNT(ep);
@@ -696,26 +713,43 @@ rpcrdma_ep_create(struct rpcrdma_ep *ep, struct rpcrdma_ia *ia,
 	init_waitqueue_head(&ep->rep_connect_wait);
 	INIT_DELAYED_WORK(&ep->rep_connect_worker, rpcrdma_connect_worker);
 
-	ep->rep_cq = ib_create_cq(ia->ri_id->device, rpcrdma_cq_event_upcall,
+	sendcq = ib_create_cq(ia->ri_id->device, rpcrdma_sendcq_upcall,
 				  rpcrdma_cq_async_error_upcall, NULL,
-				  ep->rep_attr.cap.max_recv_wr +
 				  ep->rep_attr.cap.max_send_wr + 1, 0);
-	if (IS_ERR(ep->rep_cq)) {
-		rc = PTR_ERR(ep->rep_cq);
-		dprintk("RPC:       %s: ib_create_cq failed: %i\n",
+	if (IS_ERR(sendcq)) {
+		rc = PTR_ERR(sendcq);
+		dprintk("RPC:       %s: failed to create send CQ: %i\n",
 			__func__, rc);
 		goto out1;
 	}
 
-	rc = ib_req_notify_cq(ep->rep_cq, IB_CQ_NEXT_COMP);
+	rc = ib_req_notify_cq(sendcq, IB_CQ_NEXT_COMP);
 	if (rc) {
 		dprintk("RPC:       %s: ib_req_notify_cq failed: %i\n",
 			__func__, rc);
 		goto out2;
 	}
 
-	ep->rep_attr.send_cq = ep->rep_cq;
-	ep->rep_attr.recv_cq = ep->rep_cq;
+	recvcq = ib_create_cq(ia->ri_id->device, rpcrdma_recvcq_upcall,
+				  rpcrdma_cq_async_error_upcall, NULL,
+				  ep->rep_attr.cap.max_recv_wr + 1, 0);
+	if (IS_ERR(recvcq)) {
+		rc = PTR_ERR(recvcq);
+		dprintk("RPC:       %s: failed to create recv CQ: %i\n",
+			__func__, rc);
+		goto out2;
+	}
+
+	rc = ib_req_notify_cq(recvcq, IB_CQ_NEXT_COMP);
+	if (rc) {
+		dprintk("RPC:       %s: ib_req_notify_cq failed: %i\n",
+			__func__, rc);
+		ib_destroy_cq(recvcq);
+		goto out2;
+	}
+
+	ep->rep_attr.send_cq = sendcq;
+	ep->rep_attr.recv_cq = recvcq;
 
 	/* Initialize cma parameters */
 
@@ -737,7 +771,7 @@ rpcrdma_ep_create(struct rpcrdma_ep *ep, struct rpcrdma_ia *ia,
 	return 0;
 
 out2:
-	err = ib_destroy_cq(ep->rep_cq);
+	err = ib_destroy_cq(sendcq);
 	if (err)
 		dprintk("RPC:       %s: ib_destroy_cq returned %i\n",
 			__func__, err);
@@ -777,8 +811,14 @@ rpcrdma_ep_destroy(struct rpcrdma_ep *ep, struct rpcrdma_ia *ia)
 		ep->rep_pad_mr = NULL;
 	}
 
-	rpcrdma_clean_cq(ep->rep_cq);
-	rc = ib_destroy_cq(ep->rep_cq);
+	rpcrdma_clean_cq(ep->rep_attr.recv_cq);
+	rc = ib_destroy_cq(ep->rep_attr.recv_cq);
+	if (rc)
+		dprintk("RPC:       %s: ib_destroy_cq returned %i\n",
+			__func__, rc);
+
+	rpcrdma_clean_cq(ep->rep_attr.send_cq);
+	rc = ib_destroy_cq(ep->rep_attr.send_cq);
 	if (rc)
 		dprintk("RPC:       %s: ib_destroy_cq returned %i\n",
 			__func__, rc);
@@ -801,7 +841,9 @@ retry:
 		if (rc && rc != -ENOTCONN)
 			dprintk("RPC:       %s: rpcrdma_ep_disconnect"
 				" status %i\n", __func__, rc);
-		rpcrdma_clean_cq(ep->rep_cq);
+
+		rpcrdma_clean_cq(ep->rep_attr.recv_cq);
+		rpcrdma_clean_cq(ep->rep_attr.send_cq);
 
 		xprt = container_of(ia, struct rpcrdma_xprt, rx_ia);
 		id = rpcrdma_create_id(xprt, ia,
@@ -910,7 +952,8 @@ rpcrdma_ep_disconnect(struct rpcrdma_ep *ep, struct rpcrdma_ia *ia)
 {
 	int rc;
 
-	rpcrdma_clean_cq(ep->rep_cq);
+	rpcrdma_clean_cq(ep->rep_attr.recv_cq);
+	rpcrdma_clean_cq(ep->rep_attr.send_cq);
 	rc = rdma_disconnect(ia->ri_id);
 	if (!rc) {
 		/* returns without wait if not connected */
@@ -1793,7 +1836,6 @@ rpcrdma_ep_post_recv(struct rpcrdma_ia *ia,
 	ib_dma_sync_single_for_cpu(ia->ri_id->device,
 		rep->rr_iov.addr, rep->rr_iov.length, DMA_BIDIRECTIONAL);
 
-	DECR_CQCOUNT(ep);
 	rc = ib_post_recv(ia->ri_id->qp, &recv_wr, &recv_wr_fail);
 
 	if (rc)
diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index 362a19d..334ab6e 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -79,7 +79,6 @@ struct rpcrdma_ep {
 	int			rep_cqinit;
 	int			rep_connected;
 	struct rpcrdma_ia	*rep_ia;
-	struct ib_cq		*rep_cq;
 	struct ib_qp_init_attr	rep_attr;
 	wait_queue_head_t 	rep_connect_wait;
 	struct ib_sge		rep_pad;	/* holds zeroed pad */

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 7/8] xprtrdma: Split the completion queue
@ 2014-04-14 22:23     ` Chuck Lever
  0 siblings, 0 replies; 58+ messages in thread
From: Chuck Lever @ 2014-04-14 22:23 UTC (permalink / raw)
  To: linux-nfs, linux-rdma

The current CQ handler uses the ib_wc.opcode field to distinguish
between event types. However, the contents of that field are not
reliable if the completion status is not IB_WC_SUCCESS.

When an error completion occurs on a send event, the CQ handler
schedules a tasklet with something that is not a struct rpcrdma_rep.
This is never correct behavior, and sometimes it results in a panic.

To resolve this issue, split the completion queue into a send CQ and
a receive CQ. The send CQ handler now handles only struct rpcrdma_mw
wr_id's, and the receive CQ handler now handles only struct
rpcrdma_rep wr_id's.

Fix suggested by Shirley Ma <shirley.ma@oracle.com>

Reported-by: Rafael Reiter <rafael.reiter@ims.co.at>
Fixes: 5c635e09cec0feeeb310968e51dad01040244851
BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=73211
Tested-by: Klemens Senn <klemens.senn@ims.co.at>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---

 net/sunrpc/xprtrdma/verbs.c     |  228 +++++++++++++++++++++++----------------
 net/sunrpc/xprtrdma/xprt_rdma.h |    1 
 2 files changed, 135 insertions(+), 94 deletions(-)

diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 0f8c22c..35f5ab6 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -142,96 +142,113 @@ rpcrdma_cq_async_error_upcall(struct ib_event *event, void *context)
 	}
 }
 
-static inline
-void rpcrdma_event_process(struct ib_wc *wc)
+static void
+rpcrdma_send_event_process(struct ib_wc *wc)
 {
-	struct rpcrdma_mw *frmr;
-	struct rpcrdma_rep *rep =
-			(struct rpcrdma_rep *)(unsigned long) wc->wr_id;
+	struct rpcrdma_mw *frmr = (struct rpcrdma_mw *)(unsigned long)wc->wr_id;
 
-	dprintk("RPC:       %s: event rep %p status %X opcode %X length %u\n",
-		__func__, rep, wc->status, wc->opcode, wc->byte_len);
+	dprintk("RPC:       %s: frmr %p status %X opcode %d\n",
+		__func__, frmr, wc->status, wc->opcode);
 
-	if (!rep) /* send completion that we don't care about */
+	if (wc->wr_id == 0ULL)
 		return;
-
-	if (IB_WC_SUCCESS != wc->status) {
-		dprintk("RPC:       %s: WC opcode %d status %X, connection lost\n",
-			__func__, wc->opcode, wc->status);
-		rep->rr_len = ~0U;
-		if (wc->opcode != IB_WC_FAST_REG_MR && wc->opcode != IB_WC_LOCAL_INV)
-			rpcrdma_schedule_tasklet(rep);
+	if (wc->status != IB_WC_SUCCESS)
 		return;
-	}
 
-	switch (wc->opcode) {
-	case IB_WC_FAST_REG_MR:
-		frmr = (struct rpcrdma_mw *)(unsigned long)wc->wr_id;
+	if (wc->opcode == IB_WC_FAST_REG_MR)
 		frmr->r.frmr.state = FRMR_IS_VALID;
-		break;
-	case IB_WC_LOCAL_INV:
-		frmr = (struct rpcrdma_mw *)(unsigned long)wc->wr_id;
+	else if (wc->opcode == IB_WC_LOCAL_INV)
 		frmr->r.frmr.state = FRMR_IS_INVALID;
-		break;
-	case IB_WC_RECV:
-		rep->rr_len = wc->byte_len;
-		ib_dma_sync_single_for_cpu(
-			rdmab_to_ia(rep->rr_buffer)->ri_id->device,
-			rep->rr_iov.addr, rep->rr_len, DMA_FROM_DEVICE);
-		/* Keep (only) the most recent credits, after check validity */
-		if (rep->rr_len >= 16) {
-			struct rpcrdma_msg *p =
-					(struct rpcrdma_msg *) rep->rr_base;
-			unsigned int credits = ntohl(p->rm_credit);
-			if (credits == 0) {
-				dprintk("RPC:       %s: server"
-					" dropped credits to 0!\n", __func__);
-				/* don't deadlock */
-				credits = 1;
-			} else if (credits > rep->rr_buffer->rb_max_requests) {
-				dprintk("RPC:       %s: server"
-					" over-crediting: %d (%d)\n",
-					__func__, credits,
-					rep->rr_buffer->rb_max_requests);
-				credits = rep->rr_buffer->rb_max_requests;
-			}
-			atomic_set(&rep->rr_buffer->rb_credits, credits);
-		}
-		rpcrdma_schedule_tasklet(rep);
-		break;
-	default:
-		dprintk("RPC:       %s: unexpected WC event %X\n",
-			__func__, wc->opcode);
-		break;
-	}
 }
 
-static inline int
-rpcrdma_cq_poll(struct ib_cq *cq)
+static int
+rpcrdma_sendcq_poll(struct ib_cq *cq)
 {
 	struct ib_wc wc;
 	int rc;
 
-	for (;;) {
-		rc = ib_poll_cq(cq, 1, &wc);
-		if (rc < 0) {
-			dprintk("RPC:       %s: ib_poll_cq failed %i\n",
-				__func__, rc);
-			return rc;
-		}
-		if (rc == 0)
-			break;
+	while ((rc = ib_poll_cq(cq, 1, &wc)) == 1)
+		rpcrdma_send_event_process(&wc);
+	return rc;
+}
 
-		rpcrdma_event_process(&wc);
+/*
+ * Handle send, fast_reg_mr, and local_inv completions.
+ *
+ * Send events are typically suppressed and thus do not result
+ * in an upcall. Occasionally one is signaled, however. This
+ * prevents the provider's completion queue from wrapping and
+ * losing a completion.
+ */
+static void
+rpcrdma_sendcq_upcall(struct ib_cq *cq, void *cq_context)
+{
+	int rc;
+
+	rc = rpcrdma_sendcq_poll(cq);
+	if (rc) {
+		dprintk("RPC:       %s: ib_poll_cq failed: %i\n",
+			__func__, rc);
+		return;
 	}
+	rc = ib_req_notify_cq(cq, IB_CQ_NEXT_COMP);
+	if (rc) {
+		dprintk("RPC:       %s: ib_req_notify_cq failed: %i\n",
+			__func__, rc);
+		return;
+	}
+	rpcrdma_sendcq_poll(cq);
+}
 
-	return 0;
+static void
+rpcrdma_recv_event_process(struct ib_wc *wc)
+{
+	struct rpcrdma_rep *rep =
+			(struct rpcrdma_rep *)(unsigned long)wc->wr_id;
+
+	dprintk("RPC:       %s: rep %p status %X opcode %X length %u\n",
+		__func__, rep, wc->status, wc->opcode, wc->byte_len);
+
+	if (wc->status != IB_WC_SUCCESS) {
+		rep->rr_len = ~0U;
+		goto out_schedule;
+	}
+	if (wc->opcode != IB_WC_RECV)
+		return;
+
+	rep->rr_len = wc->byte_len;
+	ib_dma_sync_single_for_cpu(rdmab_to_ia(rep->rr_buffer)->ri_id->device,
+			rep->rr_iov.addr, rep->rr_len, DMA_FROM_DEVICE);
+
+	if (rep->rr_len >= 16) {
+		struct rpcrdma_msg *p = (struct rpcrdma_msg *)rep->rr_base;
+		unsigned int credits = ntohl(p->rm_credit);
+
+		if (credits == 0)
+			credits = 1;	/* don't deadlock */
+		else if (credits > rep->rr_buffer->rb_max_requests)
+			credits = rep->rr_buffer->rb_max_requests;
+		atomic_set(&rep->rr_buffer->rb_credits, credits);
+	}
+
+out_schedule:
+	rpcrdma_schedule_tasklet(rep);
+}
+
+static int
+rpcrdma_recvcq_poll(struct ib_cq *cq)
+{
+	struct ib_wc wc;
+	int rc;
+
+	while ((rc = ib_poll_cq(cq, 1, &wc)) == 1)
+		rpcrdma_recv_event_process(&wc);
+	return rc;
 }
 
 /*
- * rpcrdma_cq_event_upcall
+ * Handle receive completions.
  *
- * This upcall handles recv and send events.
  * It is reentrant but processes single events in order to maintain
  * ordering of receives to keep server credits.
  *
@@ -240,26 +257,25 @@ rpcrdma_cq_poll(struct ib_cq *cq)
  * connection shutdown. That is, the structures required for
  * the completion of the reply handler must remain intact until
  * all memory has been reclaimed.
- *
- * Note that send events are suppressed and do not result in an upcall.
  */
 static void
-rpcrdma_cq_event_upcall(struct ib_cq *cq, void *context)
+rpcrdma_recvcq_upcall(struct ib_cq *cq, void *cq_context)
 {
 	int rc;
 
-	rc = rpcrdma_cq_poll(cq);
-	if (rc)
+	rc = rpcrdma_recvcq_poll(cq);
+	if (rc) {
+		dprintk("RPC:       %s: ib_poll_cq failed: %i\n",
+			__func__, rc);
 		return;
-
+	}
 	rc = ib_req_notify_cq(cq, IB_CQ_NEXT_COMP);
 	if (rc) {
-		dprintk("RPC:       %s: ib_req_notify_cq failed %i\n",
+		dprintk("RPC:       %s: ib_req_notify_cq failed: %i\n",
 			__func__, rc);
 		return;
 	}
-
-	rpcrdma_cq_poll(cq);
+	rpcrdma_recvcq_poll(cq);
 }
 
 #ifdef RPC_DEBUG
@@ -613,6 +629,7 @@ rpcrdma_ep_create(struct rpcrdma_ep *ep, struct rpcrdma_ia *ia,
 				struct rpcrdma_create_data_internal *cdata)
 {
 	struct ib_device_attr devattr;
+	struct ib_cq *sendcq, *recvcq;
 	int rc, err;
 
 	rc = ib_query_device(ia->ri_id->device, &devattr);
@@ -688,7 +705,7 @@ rpcrdma_ep_create(struct rpcrdma_ep *ep, struct rpcrdma_ia *ia,
 		ep->rep_attr.cap.max_recv_sge);
 
 	/* set trigger for requesting send completion */
-	ep->rep_cqinit = ep->rep_attr.cap.max_send_wr/2 /*  - 1*/;
+	ep->rep_cqinit = ep->rep_attr.cap.max_send_wr/2 - 1;
 	if (ep->rep_cqinit <= 2)
 		ep->rep_cqinit = 0;
 	INIT_CQCOUNT(ep);
@@ -696,26 +713,43 @@ rpcrdma_ep_create(struct rpcrdma_ep *ep, struct rpcrdma_ia *ia,
 	init_waitqueue_head(&ep->rep_connect_wait);
 	INIT_DELAYED_WORK(&ep->rep_connect_worker, rpcrdma_connect_worker);
 
-	ep->rep_cq = ib_create_cq(ia->ri_id->device, rpcrdma_cq_event_upcall,
+	sendcq = ib_create_cq(ia->ri_id->device, rpcrdma_sendcq_upcall,
 				  rpcrdma_cq_async_error_upcall, NULL,
-				  ep->rep_attr.cap.max_recv_wr +
 				  ep->rep_attr.cap.max_send_wr + 1, 0);
-	if (IS_ERR(ep->rep_cq)) {
-		rc = PTR_ERR(ep->rep_cq);
-		dprintk("RPC:       %s: ib_create_cq failed: %i\n",
+	if (IS_ERR(sendcq)) {
+		rc = PTR_ERR(sendcq);
+		dprintk("RPC:       %s: failed to create send CQ: %i\n",
 			__func__, rc);
 		goto out1;
 	}
 
-	rc = ib_req_notify_cq(ep->rep_cq, IB_CQ_NEXT_COMP);
+	rc = ib_req_notify_cq(sendcq, IB_CQ_NEXT_COMP);
 	if (rc) {
 		dprintk("RPC:       %s: ib_req_notify_cq failed: %i\n",
 			__func__, rc);
 		goto out2;
 	}
 
-	ep->rep_attr.send_cq = ep->rep_cq;
-	ep->rep_attr.recv_cq = ep->rep_cq;
+	recvcq = ib_create_cq(ia->ri_id->device, rpcrdma_recvcq_upcall,
+				  rpcrdma_cq_async_error_upcall, NULL,
+				  ep->rep_attr.cap.max_recv_wr + 1, 0);
+	if (IS_ERR(recvcq)) {
+		rc = PTR_ERR(recvcq);
+		dprintk("RPC:       %s: failed to create recv CQ: %i\n",
+			__func__, rc);
+		goto out2;
+	}
+
+	rc = ib_req_notify_cq(recvcq, IB_CQ_NEXT_COMP);
+	if (rc) {
+		dprintk("RPC:       %s: ib_req_notify_cq failed: %i\n",
+			__func__, rc);
+		ib_destroy_cq(recvcq);
+		goto out2;
+	}
+
+	ep->rep_attr.send_cq = sendcq;
+	ep->rep_attr.recv_cq = recvcq;
 
 	/* Initialize cma parameters */
 
@@ -737,7 +771,7 @@ rpcrdma_ep_create(struct rpcrdma_ep *ep, struct rpcrdma_ia *ia,
 	return 0;
 
 out2:
-	err = ib_destroy_cq(ep->rep_cq);
+	err = ib_destroy_cq(sendcq);
 	if (err)
 		dprintk("RPC:       %s: ib_destroy_cq returned %i\n",
 			__func__, err);
@@ -777,8 +811,14 @@ rpcrdma_ep_destroy(struct rpcrdma_ep *ep, struct rpcrdma_ia *ia)
 		ep->rep_pad_mr = NULL;
 	}
 
-	rpcrdma_clean_cq(ep->rep_cq);
-	rc = ib_destroy_cq(ep->rep_cq);
+	rpcrdma_clean_cq(ep->rep_attr.recv_cq);
+	rc = ib_destroy_cq(ep->rep_attr.recv_cq);
+	if (rc)
+		dprintk("RPC:       %s: ib_destroy_cq returned %i\n",
+			__func__, rc);
+
+	rpcrdma_clean_cq(ep->rep_attr.send_cq);
+	rc = ib_destroy_cq(ep->rep_attr.send_cq);
 	if (rc)
 		dprintk("RPC:       %s: ib_destroy_cq returned %i\n",
 			__func__, rc);
@@ -801,7 +841,9 @@ retry:
 		if (rc && rc != -ENOTCONN)
 			dprintk("RPC:       %s: rpcrdma_ep_disconnect"
 				" status %i\n", __func__, rc);
-		rpcrdma_clean_cq(ep->rep_cq);
+
+		rpcrdma_clean_cq(ep->rep_attr.recv_cq);
+		rpcrdma_clean_cq(ep->rep_attr.send_cq);
 
 		xprt = container_of(ia, struct rpcrdma_xprt, rx_ia);
 		id = rpcrdma_create_id(xprt, ia,
@@ -910,7 +952,8 @@ rpcrdma_ep_disconnect(struct rpcrdma_ep *ep, struct rpcrdma_ia *ia)
 {
 	int rc;
 
-	rpcrdma_clean_cq(ep->rep_cq);
+	rpcrdma_clean_cq(ep->rep_attr.recv_cq);
+	rpcrdma_clean_cq(ep->rep_attr.send_cq);
 	rc = rdma_disconnect(ia->ri_id);
 	if (!rc) {
 		/* returns without wait if not connected */
@@ -1793,7 +1836,6 @@ rpcrdma_ep_post_recv(struct rpcrdma_ia *ia,
 	ib_dma_sync_single_for_cpu(ia->ri_id->device,
 		rep->rr_iov.addr, rep->rr_iov.length, DMA_BIDIRECTIONAL);
 
-	DECR_CQCOUNT(ep);
 	rc = ib_post_recv(ia->ri_id->qp, &recv_wr, &recv_wr_fail);
 
 	if (rc)
diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index 362a19d..334ab6e 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -79,7 +79,6 @@ struct rpcrdma_ep {
 	int			rep_cqinit;
 	int			rep_connected;
 	struct rpcrdma_ia	*rep_ia;
-	struct ib_cq		*rep_cq;
 	struct ib_qp_init_attr	rep_attr;
 	wait_queue_head_t 	rep_connect_wait;
 	struct ib_sge		rep_pad;	/* holds zeroed pad */


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 8/8] xprtrdma: Reduce the number of hardway buffer allocations
  2014-04-14 22:22 ` Chuck Lever
@ 2014-04-14 22:23     ` Chuck Lever
  -1 siblings, 0 replies; 58+ messages in thread
From: Chuck Lever @ 2014-04-14 22:23 UTC (permalink / raw)
  To: linux-nfs-u79uwXL29TY76Z2rM5mHXA, linux-rdma-u79uwXL29TY76Z2rM5mHXA

While marshaling an RPC/RDMA request, the inline rsize and wsize
settings determine whether an inline request is used, or whether
read and write chunks lists are built. The current default value of
these settings is 1024. Any RPC request smaller than 1024 bytes is
sent completely inline to the NFS server.

rpcrdma_buffer_create() allocates and pre-registers a set of RPC
buffers for each transport instance, also based on the inline rsize
and wsize settings.

RPC/RDMA requests and replies are built in these buffers. However,
if an RPC/RDMA request is expected to be larger than 1024, a buffer
has to be allocated and registered for that RPC, and deregistered
and released when the RPC is complete. This is known has a
"hardway allocation."

Since the introduction of NFSv4, the size of RPC requests has become
larger, and hardway allocations are thus more frequent. Hardway
allocations are significant overhead, and they waste the existing
RPC buffers pre-allocated by rpcrdma_buffer_create().

We'd like fewer hardway allocations.

Increasing the size of the pre-registered buffers is the most direct
way to do this. However, a blanket increase of the inline thresholds
has interoperability consequences.

On my 64-bit system, rpcrdma_buffer_create() requests roughly 7000
bytes for each RPC request buffer, using kmalloc(). This wastes
nearly 1200 bytes because kmalloc() already returns an 8192-byte
piece of memory for a 7000-byte allocation request, though the extra
space remains unused.

So let's round up the size of the pre-allocated buffers, and make
use of the unused space in the kmalloc'd memory.

This change reduces the amount of hardway allocated memory for an
NFSv4 general connectathon run from 1322092 to 9472 bytes (99%).

Signed-off-by: Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
---

 net/sunrpc/xprtrdma/verbs.c |   25 +++++++++++++------------
 1 files changed, 13 insertions(+), 12 deletions(-)

diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 35f5ab6..9b038e1 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -50,6 +50,7 @@
 #include <linux/interrupt.h>
 #include <linux/pci.h>	/* for Tavor hack below */
 #include <linux/slab.h>
+#include <asm/bitops.h>
 
 #include "xprt_rdma.h"
 
@@ -976,7 +977,7 @@ rpcrdma_buffer_create(struct rpcrdma_buffer *buf, struct rpcrdma_ep *ep,
 	struct rpcrdma_ia *ia, struct rpcrdma_create_data_internal *cdata)
 {
 	char *p;
-	size_t len;
+	size_t len, rlen, wlen;
 	int i, rc;
 	struct rpcrdma_mw *r;
 
@@ -1090,16 +1091,16 @@ rpcrdma_buffer_create(struct rpcrdma_buffer *buf, struct rpcrdma_ep *ep,
 	 * Allocate/init the request/reply buffers. Doing this
 	 * using kmalloc for now -- one for each buf.
 	 */
+	wlen = 1 << fls(cdata->inline_wsize + sizeof(struct rpcrdma_req));
+	rlen = 1 << fls(cdata->inline_rsize + sizeof(struct rpcrdma_rep));
+	dprintk("RPC:       %s: wlen = %zu, rlen = %zu\n",
+		__func__, wlen, rlen);
+
 	for (i = 0; i < buf->rb_max_requests; i++) {
 		struct rpcrdma_req *req;
 		struct rpcrdma_rep *rep;
 
-		len = cdata->inline_wsize + sizeof(struct rpcrdma_req);
-		/* RPC layer requests *double* size + 1K RPC_SLACK_SPACE! */
-		/* Typical ~2400b, so rounding up saves work later */
-		if (len < 4096)
-			len = 4096;
-		req = kmalloc(len, GFP_KERNEL);
+		req = kmalloc(wlen, GFP_KERNEL);
 		if (req == NULL) {
 			dprintk("RPC:       %s: request buffer %d alloc"
 				" failed\n", __func__, i);
@@ -1111,16 +1112,16 @@ rpcrdma_buffer_create(struct rpcrdma_buffer *buf, struct rpcrdma_ep *ep,
 		buf->rb_send_bufs[i]->rl_buffer = buf;
 
 		rc = rpcrdma_register_internal(ia, req->rl_base,
-				len - offsetof(struct rpcrdma_req, rl_base),
+				wlen - offsetof(struct rpcrdma_req, rl_base),
 				&buf->rb_send_bufs[i]->rl_handle,
 				&buf->rb_send_bufs[i]->rl_iov);
 		if (rc)
 			goto out;
 
-		buf->rb_send_bufs[i]->rl_size = len-sizeof(struct rpcrdma_req);
+		buf->rb_send_bufs[i]->rl_size = wlen -
+						sizeof(struct rpcrdma_req);
 
-		len = cdata->inline_rsize + sizeof(struct rpcrdma_rep);
-		rep = kmalloc(len, GFP_KERNEL);
+		rep = kmalloc(rlen, GFP_KERNEL);
 		if (rep == NULL) {
 			dprintk("RPC:       %s: reply buffer %d alloc failed\n",
 				__func__, i);
@@ -1132,7 +1133,7 @@ rpcrdma_buffer_create(struct rpcrdma_buffer *buf, struct rpcrdma_ep *ep,
 		buf->rb_recv_bufs[i]->rr_buffer = buf;
 
 		rc = rpcrdma_register_internal(ia, rep->rr_base,
-				len - offsetof(struct rpcrdma_rep, rr_base),
+				rlen - offsetof(struct rpcrdma_rep, rr_base),
 				&buf->rb_recv_bufs[i]->rr_handle,
 				&buf->rb_recv_bufs[i]->rr_iov);
 		if (rc)

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 8/8] xprtrdma: Reduce the number of hardway buffer allocations
@ 2014-04-14 22:23     ` Chuck Lever
  0 siblings, 0 replies; 58+ messages in thread
From: Chuck Lever @ 2014-04-14 22:23 UTC (permalink / raw)
  To: linux-nfs, linux-rdma

While marshaling an RPC/RDMA request, the inline rsize and wsize
settings determine whether an inline request is used, or whether
read and write chunks lists are built. The current default value of
these settings is 1024. Any RPC request smaller than 1024 bytes is
sent completely inline to the NFS server.

rpcrdma_buffer_create() allocates and pre-registers a set of RPC
buffers for each transport instance, also based on the inline rsize
and wsize settings.

RPC/RDMA requests and replies are built in these buffers. However,
if an RPC/RDMA request is expected to be larger than 1024, a buffer
has to be allocated and registered for that RPC, and deregistered
and released when the RPC is complete. This is known has a
"hardway allocation."

Since the introduction of NFSv4, the size of RPC requests has become
larger, and hardway allocations are thus more frequent. Hardway
allocations are significant overhead, and they waste the existing
RPC buffers pre-allocated by rpcrdma_buffer_create().

We'd like fewer hardway allocations.

Increasing the size of the pre-registered buffers is the most direct
way to do this. However, a blanket increase of the inline thresholds
has interoperability consequences.

On my 64-bit system, rpcrdma_buffer_create() requests roughly 7000
bytes for each RPC request buffer, using kmalloc(). This wastes
nearly 1200 bytes because kmalloc() already returns an 8192-byte
piece of memory for a 7000-byte allocation request, though the extra
space remains unused.

So let's round up the size of the pre-allocated buffers, and make
use of the unused space in the kmalloc'd memory.

This change reduces the amount of hardway allocated memory for an
NFSv4 general connectathon run from 1322092 to 9472 bytes (99%).

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---

 net/sunrpc/xprtrdma/verbs.c |   25 +++++++++++++------------
 1 files changed, 13 insertions(+), 12 deletions(-)

diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 35f5ab6..9b038e1 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -50,6 +50,7 @@
 #include <linux/interrupt.h>
 #include <linux/pci.h>	/* for Tavor hack below */
 #include <linux/slab.h>
+#include <asm/bitops.h>
 
 #include "xprt_rdma.h"
 
@@ -976,7 +977,7 @@ rpcrdma_buffer_create(struct rpcrdma_buffer *buf, struct rpcrdma_ep *ep,
 	struct rpcrdma_ia *ia, struct rpcrdma_create_data_internal *cdata)
 {
 	char *p;
-	size_t len;
+	size_t len, rlen, wlen;
 	int i, rc;
 	struct rpcrdma_mw *r;
 
@@ -1090,16 +1091,16 @@ rpcrdma_buffer_create(struct rpcrdma_buffer *buf, struct rpcrdma_ep *ep,
 	 * Allocate/init the request/reply buffers. Doing this
 	 * using kmalloc for now -- one for each buf.
 	 */
+	wlen = 1 << fls(cdata->inline_wsize + sizeof(struct rpcrdma_req));
+	rlen = 1 << fls(cdata->inline_rsize + sizeof(struct rpcrdma_rep));
+	dprintk("RPC:       %s: wlen = %zu, rlen = %zu\n",
+		__func__, wlen, rlen);
+
 	for (i = 0; i < buf->rb_max_requests; i++) {
 		struct rpcrdma_req *req;
 		struct rpcrdma_rep *rep;
 
-		len = cdata->inline_wsize + sizeof(struct rpcrdma_req);
-		/* RPC layer requests *double* size + 1K RPC_SLACK_SPACE! */
-		/* Typical ~2400b, so rounding up saves work later */
-		if (len < 4096)
-			len = 4096;
-		req = kmalloc(len, GFP_KERNEL);
+		req = kmalloc(wlen, GFP_KERNEL);
 		if (req == NULL) {
 			dprintk("RPC:       %s: request buffer %d alloc"
 				" failed\n", __func__, i);
@@ -1111,16 +1112,16 @@ rpcrdma_buffer_create(struct rpcrdma_buffer *buf, struct rpcrdma_ep *ep,
 		buf->rb_send_bufs[i]->rl_buffer = buf;
 
 		rc = rpcrdma_register_internal(ia, req->rl_base,
-				len - offsetof(struct rpcrdma_req, rl_base),
+				wlen - offsetof(struct rpcrdma_req, rl_base),
 				&buf->rb_send_bufs[i]->rl_handle,
 				&buf->rb_send_bufs[i]->rl_iov);
 		if (rc)
 			goto out;
 
-		buf->rb_send_bufs[i]->rl_size = len-sizeof(struct rpcrdma_req);
+		buf->rb_send_bufs[i]->rl_size = wlen -
+						sizeof(struct rpcrdma_req);
 
-		len = cdata->inline_rsize + sizeof(struct rpcrdma_rep);
-		rep = kmalloc(len, GFP_KERNEL);
+		rep = kmalloc(rlen, GFP_KERNEL);
 		if (rep == NULL) {
 			dprintk("RPC:       %s: reply buffer %d alloc failed\n",
 				__func__, i);
@@ -1132,7 +1133,7 @@ rpcrdma_buffer_create(struct rpcrdma_buffer *buf, struct rpcrdma_ep *ep,
 		buf->rb_recv_bufs[i]->rr_buffer = buf;
 
 		rc = rpcrdma_register_internal(ia, rep->rr_base,
-				len - offsetof(struct rpcrdma_rep, rr_base),
+				rlen - offsetof(struct rpcrdma_rep, rr_base),
 				&buf->rb_recv_bufs[i]->rr_handle,
 				&buf->rb_recv_bufs[i]->rr_iov);
 		if (rc)


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* Re: [PATCH 0/8] NFS/RDMA patches for review
  2014-04-14 22:22 ` Chuck Lever
@ 2014-04-15 20:15     ` Steve Wise
  -1 siblings, 0 replies; 58+ messages in thread
From: Steve Wise @ 2014-04-15 20:15 UTC (permalink / raw)
  To: Chuck Lever, linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA


Tested-by: Steve Wise <swise-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 0/8] NFS/RDMA patches for review
@ 2014-04-15 20:15     ` Steve Wise
  0 siblings, 0 replies; 58+ messages in thread
From: Steve Wise @ 2014-04-15 20:15 UTC (permalink / raw)
  To: Chuck Lever, linux-nfs, linux-rdma


Tested-by: Steve Wise <swise@opengridcomputing.com>


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 0/8] NFS/RDMA patches for review
  2014-04-15 20:15     ` Steve Wise
@ 2014-04-15 20:20         ` Steve Wise
  -1 siblings, 0 replies; 58+ messages in thread
From: Steve Wise @ 2014-04-15 20:20 UTC (permalink / raw)
  To: Chuck Lever, linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA

My testing was with the default memory registration mode by the way.  
Over mlx4 and cxgb4.

Steve.

On 4/15/2014 3:15 PM, Steve Wise wrote:
>
> Tested-by: Steve Wise <swise-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>
>
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 0/8] NFS/RDMA patches for review
@ 2014-04-15 20:20         ` Steve Wise
  0 siblings, 0 replies; 58+ messages in thread
From: Steve Wise @ 2014-04-15 20:20 UTC (permalink / raw)
  To: Chuck Lever, linux-nfs, linux-rdma

My testing was with the default memory registration mode by the way.  
Over mlx4 and cxgb4.

Steve.

On 4/15/2014 3:15 PM, Steve Wise wrote:
>
> Tested-by: Steve Wise <swise@opengridcomputing.com>
>
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 7/8] xprtrdma: Split the completion queue
  2014-04-14 22:23     ` Chuck Lever
@ 2014-04-16 12:48         ` Sagi Grimberg
  -1 siblings, 0 replies; 58+ messages in thread
From: Sagi Grimberg @ 2014-04-16 12:48 UTC (permalink / raw)
  To: Chuck Lever, linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA

On 4/15/2014 1:23 AM, Chuck Lever wrote:
> The current CQ handler uses the ib_wc.opcode field to distinguish
> between event types. However, the contents of that field are not
> reliable if the completion status is not IB_WC_SUCCESS.
>
> When an error completion occurs on a send event, the CQ handler
> schedules a tasklet with something that is not a struct rpcrdma_rep.
> This is never correct behavior, and sometimes it results in a panic.
>
> To resolve this issue, split the completion queue into a send CQ and
> a receive CQ. The send CQ handler now handles only struct rpcrdma_mw
> wr_id's, and the receive CQ handler now handles only struct
> rpcrdma_rep wr_id's.

Hey Chuck,

So 2 suggestions related (although not directly) to this one.

1. I recommend suppressing Fastreg completions - no one cares that they 
succeeded.

2. If I understood correctly, I see that the CQs are loop-polled in an 
interrupt context.
     This may become problematic in stress workload where the CQ simply 
never drains (imagine
     some studios task keeps posting WRs as soon as IOs complete). This 
will cause CQ poll loop
     to go on forever. This situation may cause starvation of other CQs 
that share the same EQ (CPU
     is hogged by the endless poller).
     This may be solved in 2 ways:
     - Use blk-iopoll to maintain fairness - the downside will be moving 
from interrupt context to soft-irq (slower).
     - Set some configurable budget to CQ poller - after finishing the 
budget, notify the CQ and bail.
       If there is another CQ waiting to get in - it's only fair that it 
will be given with a chance, else another interrupt
       on that CQ will immediately come.

     I noticed that one with iSER which polls from tasklet context 
(under heavy workload).

Sagi.

> Fix suggested by Shirley Ma <shirley.ma-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
>
> Reported-by: Rafael Reiter <rafael.reiter-cv18SyjCLaheoWH0uzbU5w@public.gmane.org>
> Fixes: 5c635e09cec0feeeb310968e51dad01040244851
> BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=73211
> Tested-by: Klemens Senn <klemens.senn-cv18SyjCLaheoWH0uzbU5w@public.gmane.org>
> Signed-off-by: Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
> ---
>
>   net/sunrpc/xprtrdma/verbs.c     |  228 +++++++++++++++++++++++----------------
>   net/sunrpc/xprtrdma/xprt_rdma.h |    1
>   2 files changed, 135 insertions(+), 94 deletions(-)
>
> diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
> index 0f8c22c..35f5ab6 100644
> --- a/net/sunrpc/xprtrdma/verbs.c
> +++ b/net/sunrpc/xprtrdma/verbs.c
> @@ -142,96 +142,113 @@ rpcrdma_cq_async_error_upcall(struct ib_event *event, void *context)
>   	}
>   }
>   
> -static inline
> -void rpcrdma_event_process(struct ib_wc *wc)
> +static void
> +rpcrdma_send_event_process(struct ib_wc *wc)
>   {
> -	struct rpcrdma_mw *frmr;
> -	struct rpcrdma_rep *rep =
> -			(struct rpcrdma_rep *)(unsigned long) wc->wr_id;
> +	struct rpcrdma_mw *frmr = (struct rpcrdma_mw *)(unsigned long)wc->wr_id;
>   
> -	dprintk("RPC:       %s: event rep %p status %X opcode %X length %u\n",
> -		__func__, rep, wc->status, wc->opcode, wc->byte_len);
> +	dprintk("RPC:       %s: frmr %p status %X opcode %d\n",
> +		__func__, frmr, wc->status, wc->opcode);
>   
> -	if (!rep) /* send completion that we don't care about */
> +	if (wc->wr_id == 0ULL)
>   		return;
> -
> -	if (IB_WC_SUCCESS != wc->status) {
> -		dprintk("RPC:       %s: WC opcode %d status %X, connection lost\n",
> -			__func__, wc->opcode, wc->status);
> -		rep->rr_len = ~0U;
> -		if (wc->opcode != IB_WC_FAST_REG_MR && wc->opcode != IB_WC_LOCAL_INV)
> -			rpcrdma_schedule_tasklet(rep);
> +	if (wc->status != IB_WC_SUCCESS)
>   		return;
> -	}
>   
> -	switch (wc->opcode) {
> -	case IB_WC_FAST_REG_MR:
> -		frmr = (struct rpcrdma_mw *)(unsigned long)wc->wr_id;
> +	if (wc->opcode == IB_WC_FAST_REG_MR)
>   		frmr->r.frmr.state = FRMR_IS_VALID;
> -		break;
> -	case IB_WC_LOCAL_INV:
> -		frmr = (struct rpcrdma_mw *)(unsigned long)wc->wr_id;
> +	else if (wc->opcode == IB_WC_LOCAL_INV)
>   		frmr->r.frmr.state = FRMR_IS_INVALID;
> -		break;
> -	case IB_WC_RECV:
> -		rep->rr_len = wc->byte_len;
> -		ib_dma_sync_single_for_cpu(
> -			rdmab_to_ia(rep->rr_buffer)->ri_id->device,
> -			rep->rr_iov.addr, rep->rr_len, DMA_FROM_DEVICE);
> -		/* Keep (only) the most recent credits, after check validity */
> -		if (rep->rr_len >= 16) {
> -			struct rpcrdma_msg *p =
> -					(struct rpcrdma_msg *) rep->rr_base;
> -			unsigned int credits = ntohl(p->rm_credit);
> -			if (credits == 0) {
> -				dprintk("RPC:       %s: server"
> -					" dropped credits to 0!\n", __func__);
> -				/* don't deadlock */
> -				credits = 1;
> -			} else if (credits > rep->rr_buffer->rb_max_requests) {
> -				dprintk("RPC:       %s: server"
> -					" over-crediting: %d (%d)\n",
> -					__func__, credits,
> -					rep->rr_buffer->rb_max_requests);
> -				credits = rep->rr_buffer->rb_max_requests;
> -			}
> -			atomic_set(&rep->rr_buffer->rb_credits, credits);
> -		}
> -		rpcrdma_schedule_tasklet(rep);
> -		break;
> -	default:
> -		dprintk("RPC:       %s: unexpected WC event %X\n",
> -			__func__, wc->opcode);
> -		break;
> -	}
>   }
>   
> -static inline int
> -rpcrdma_cq_poll(struct ib_cq *cq)
> +static int
> +rpcrdma_sendcq_poll(struct ib_cq *cq)
>   {
>   	struct ib_wc wc;
>   	int rc;
>   
> -	for (;;) {
> -		rc = ib_poll_cq(cq, 1, &wc);
> -		if (rc < 0) {
> -			dprintk("RPC:       %s: ib_poll_cq failed %i\n",
> -				__func__, rc);
> -			return rc;
> -		}
> -		if (rc == 0)
> -			break;
> +	while ((rc = ib_poll_cq(cq, 1, &wc)) == 1)
> +		rpcrdma_send_event_process(&wc);
> +	return rc;
> +}
>   
> -		rpcrdma_event_process(&wc);
> +/*
> + * Handle send, fast_reg_mr, and local_inv completions.
> + *
> + * Send events are typically suppressed and thus do not result
> + * in an upcall. Occasionally one is signaled, however. This
> + * prevents the provider's completion queue from wrapping and
> + * losing a completion.
> + */
> +static void
> +rpcrdma_sendcq_upcall(struct ib_cq *cq, void *cq_context)
> +{
> +	int rc;
> +
> +	rc = rpcrdma_sendcq_poll(cq);
> +	if (rc) {
> +		dprintk("RPC:       %s: ib_poll_cq failed: %i\n",
> +			__func__, rc);
> +		return;
>   	}
> +	rc = ib_req_notify_cq(cq, IB_CQ_NEXT_COMP);
> +	if (rc) {
> +		dprintk("RPC:       %s: ib_req_notify_cq failed: %i\n",
> +			__func__, rc);
> +		return;
> +	}
> +	rpcrdma_sendcq_poll(cq);
> +}
>   
> -	return 0;
> +static void
> +rpcrdma_recv_event_process(struct ib_wc *wc)
> +{
> +	struct rpcrdma_rep *rep =
> +			(struct rpcrdma_rep *)(unsigned long)wc->wr_id;
> +
> +	dprintk("RPC:       %s: rep %p status %X opcode %X length %u\n",
> +		__func__, rep, wc->status, wc->opcode, wc->byte_len);
> +
> +	if (wc->status != IB_WC_SUCCESS) {
> +		rep->rr_len = ~0U;
> +		goto out_schedule;
> +	}
> +	if (wc->opcode != IB_WC_RECV)
> +		return;
> +
> +	rep->rr_len = wc->byte_len;
> +	ib_dma_sync_single_for_cpu(rdmab_to_ia(rep->rr_buffer)->ri_id->device,
> +			rep->rr_iov.addr, rep->rr_len, DMA_FROM_DEVICE);
> +
> +	if (rep->rr_len >= 16) {
> +		struct rpcrdma_msg *p = (struct rpcrdma_msg *)rep->rr_base;
> +		unsigned int credits = ntohl(p->rm_credit);
> +
> +		if (credits == 0)
> +			credits = 1;	/* don't deadlock */
> +		else if (credits > rep->rr_buffer->rb_max_requests)
> +			credits = rep->rr_buffer->rb_max_requests;
> +		atomic_set(&rep->rr_buffer->rb_credits, credits);
> +	}
> +
> +out_schedule:
> +	rpcrdma_schedule_tasklet(rep);
> +}
> +
> +static int
> +rpcrdma_recvcq_poll(struct ib_cq *cq)
> +{
> +	struct ib_wc wc;
> +	int rc;
> +
> +	while ((rc = ib_poll_cq(cq, 1, &wc)) == 1)
> +		rpcrdma_recv_event_process(&wc);
> +	return rc;
>   }
>   
>   /*
> - * rpcrdma_cq_event_upcall
> + * Handle receive completions.
>    *
> - * This upcall handles recv and send events.
>    * It is reentrant but processes single events in order to maintain
>    * ordering of receives to keep server credits.
>    *
> @@ -240,26 +257,25 @@ rpcrdma_cq_poll(struct ib_cq *cq)
>    * connection shutdown. That is, the structures required for
>    * the completion of the reply handler must remain intact until
>    * all memory has been reclaimed.
> - *
> - * Note that send events are suppressed and do not result in an upcall.
>    */
>   static void
> -rpcrdma_cq_event_upcall(struct ib_cq *cq, void *context)
> +rpcrdma_recvcq_upcall(struct ib_cq *cq, void *cq_context)
>   {
>   	int rc;
>   
> -	rc = rpcrdma_cq_poll(cq);
> -	if (rc)
> +	rc = rpcrdma_recvcq_poll(cq);
> +	if (rc) {
> +		dprintk("RPC:       %s: ib_poll_cq failed: %i\n",
> +			__func__, rc);
>   		return;
> -
> +	}
>   	rc = ib_req_notify_cq(cq, IB_CQ_NEXT_COMP);
>   	if (rc) {
> -		dprintk("RPC:       %s: ib_req_notify_cq failed %i\n",
> +		dprintk("RPC:       %s: ib_req_notify_cq failed: %i\n",
>   			__func__, rc);
>   		return;
>   	}
> -
> -	rpcrdma_cq_poll(cq);
> +	rpcrdma_recvcq_poll(cq);
>   }
>   
>   #ifdef RPC_DEBUG
> @@ -613,6 +629,7 @@ rpcrdma_ep_create(struct rpcrdma_ep *ep, struct rpcrdma_ia *ia,
>   				struct rpcrdma_create_data_internal *cdata)
>   {
>   	struct ib_device_attr devattr;
> +	struct ib_cq *sendcq, *recvcq;
>   	int rc, err;
>   
>   	rc = ib_query_device(ia->ri_id->device, &devattr);
> @@ -688,7 +705,7 @@ rpcrdma_ep_create(struct rpcrdma_ep *ep, struct rpcrdma_ia *ia,
>   		ep->rep_attr.cap.max_recv_sge);
>   
>   	/* set trigger for requesting send completion */
> -	ep->rep_cqinit = ep->rep_attr.cap.max_send_wr/2 /*  - 1*/;
> +	ep->rep_cqinit = ep->rep_attr.cap.max_send_wr/2 - 1;
>   	if (ep->rep_cqinit <= 2)
>   		ep->rep_cqinit = 0;
>   	INIT_CQCOUNT(ep);
> @@ -696,26 +713,43 @@ rpcrdma_ep_create(struct rpcrdma_ep *ep, struct rpcrdma_ia *ia,
>   	init_waitqueue_head(&ep->rep_connect_wait);
>   	INIT_DELAYED_WORK(&ep->rep_connect_worker, rpcrdma_connect_worker);
>   
> -	ep->rep_cq = ib_create_cq(ia->ri_id->device, rpcrdma_cq_event_upcall,
> +	sendcq = ib_create_cq(ia->ri_id->device, rpcrdma_sendcq_upcall,
>   				  rpcrdma_cq_async_error_upcall, NULL,
> -				  ep->rep_attr.cap.max_recv_wr +
>   				  ep->rep_attr.cap.max_send_wr + 1, 0);
> -	if (IS_ERR(ep->rep_cq)) {
> -		rc = PTR_ERR(ep->rep_cq);
> -		dprintk("RPC:       %s: ib_create_cq failed: %i\n",
> +	if (IS_ERR(sendcq)) {
> +		rc = PTR_ERR(sendcq);
> +		dprintk("RPC:       %s: failed to create send CQ: %i\n",
>   			__func__, rc);
>   		goto out1;
>   	}
>   
> -	rc = ib_req_notify_cq(ep->rep_cq, IB_CQ_NEXT_COMP);
> +	rc = ib_req_notify_cq(sendcq, IB_CQ_NEXT_COMP);
>   	if (rc) {
>   		dprintk("RPC:       %s: ib_req_notify_cq failed: %i\n",
>   			__func__, rc);
>   		goto out2;
>   	}
>   
> -	ep->rep_attr.send_cq = ep->rep_cq;
> -	ep->rep_attr.recv_cq = ep->rep_cq;
> +	recvcq = ib_create_cq(ia->ri_id->device, rpcrdma_recvcq_upcall,
> +				  rpcrdma_cq_async_error_upcall, NULL,
> +				  ep->rep_attr.cap.max_recv_wr + 1, 0);
> +	if (IS_ERR(recvcq)) {
> +		rc = PTR_ERR(recvcq);
> +		dprintk("RPC:       %s: failed to create recv CQ: %i\n",
> +			__func__, rc);
> +		goto out2;
> +	}
> +
> +	rc = ib_req_notify_cq(recvcq, IB_CQ_NEXT_COMP);
> +	if (rc) {
> +		dprintk("RPC:       %s: ib_req_notify_cq failed: %i\n",
> +			__func__, rc);
> +		ib_destroy_cq(recvcq);
> +		goto out2;
> +	}
> +
> +	ep->rep_attr.send_cq = sendcq;
> +	ep->rep_attr.recv_cq = recvcq;
>   
>   	/* Initialize cma parameters */
>   
> @@ -737,7 +771,7 @@ rpcrdma_ep_create(struct rpcrdma_ep *ep, struct rpcrdma_ia *ia,
>   	return 0;
>   
>   out2:
> -	err = ib_destroy_cq(ep->rep_cq);
> +	err = ib_destroy_cq(sendcq);
>   	if (err)
>   		dprintk("RPC:       %s: ib_destroy_cq returned %i\n",
>   			__func__, err);
> @@ -777,8 +811,14 @@ rpcrdma_ep_destroy(struct rpcrdma_ep *ep, struct rpcrdma_ia *ia)
>   		ep->rep_pad_mr = NULL;
>   	}
>   
> -	rpcrdma_clean_cq(ep->rep_cq);
> -	rc = ib_destroy_cq(ep->rep_cq);
> +	rpcrdma_clean_cq(ep->rep_attr.recv_cq);
> +	rc = ib_destroy_cq(ep->rep_attr.recv_cq);
> +	if (rc)
> +		dprintk("RPC:       %s: ib_destroy_cq returned %i\n",
> +			__func__, rc);
> +
> +	rpcrdma_clean_cq(ep->rep_attr.send_cq);
> +	rc = ib_destroy_cq(ep->rep_attr.send_cq);
>   	if (rc)
>   		dprintk("RPC:       %s: ib_destroy_cq returned %i\n",
>   			__func__, rc);
> @@ -801,7 +841,9 @@ retry:
>   		if (rc && rc != -ENOTCONN)
>   			dprintk("RPC:       %s: rpcrdma_ep_disconnect"
>   				" status %i\n", __func__, rc);
> -		rpcrdma_clean_cq(ep->rep_cq);
> +
> +		rpcrdma_clean_cq(ep->rep_attr.recv_cq);
> +		rpcrdma_clean_cq(ep->rep_attr.send_cq);
>   
>   		xprt = container_of(ia, struct rpcrdma_xprt, rx_ia);
>   		id = rpcrdma_create_id(xprt, ia,
> @@ -910,7 +952,8 @@ rpcrdma_ep_disconnect(struct rpcrdma_ep *ep, struct rpcrdma_ia *ia)
>   {
>   	int rc;
>   
> -	rpcrdma_clean_cq(ep->rep_cq);
> +	rpcrdma_clean_cq(ep->rep_attr.recv_cq);
> +	rpcrdma_clean_cq(ep->rep_attr.send_cq);
>   	rc = rdma_disconnect(ia->ri_id);
>   	if (!rc) {
>   		/* returns without wait if not connected */
> @@ -1793,7 +1836,6 @@ rpcrdma_ep_post_recv(struct rpcrdma_ia *ia,
>   	ib_dma_sync_single_for_cpu(ia->ri_id->device,
>   		rep->rr_iov.addr, rep->rr_iov.length, DMA_BIDIRECTIONAL);
>   
> -	DECR_CQCOUNT(ep);
>   	rc = ib_post_recv(ia->ri_id->qp, &recv_wr, &recv_wr_fail);
>   
>   	if (rc)
> diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
> index 362a19d..334ab6e 100644
> --- a/net/sunrpc/xprtrdma/xprt_rdma.h
> +++ b/net/sunrpc/xprtrdma/xprt_rdma.h
> @@ -79,7 +79,6 @@ struct rpcrdma_ep {
>   	int			rep_cqinit;
>   	int			rep_connected;
>   	struct rpcrdma_ia	*rep_ia;
> -	struct ib_cq		*rep_cq;
>   	struct ib_qp_init_attr	rep_attr;
>   	wait_queue_head_t 	rep_connect_wait;
>   	struct ib_sge		rep_pad;	/* holds zeroed pad */
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 7/8] xprtrdma: Split the completion queue
@ 2014-04-16 12:48         ` Sagi Grimberg
  0 siblings, 0 replies; 58+ messages in thread
From: Sagi Grimberg @ 2014-04-16 12:48 UTC (permalink / raw)
  To: Chuck Lever, linux-nfs, linux-rdma

On 4/15/2014 1:23 AM, Chuck Lever wrote:
> The current CQ handler uses the ib_wc.opcode field to distinguish
> between event types. However, the contents of that field are not
> reliable if the completion status is not IB_WC_SUCCESS.
>
> When an error completion occurs on a send event, the CQ handler
> schedules a tasklet with something that is not a struct rpcrdma_rep.
> This is never correct behavior, and sometimes it results in a panic.
>
> To resolve this issue, split the completion queue into a send CQ and
> a receive CQ. The send CQ handler now handles only struct rpcrdma_mw
> wr_id's, and the receive CQ handler now handles only struct
> rpcrdma_rep wr_id's.

Hey Chuck,

So 2 suggestions related (although not directly) to this one.

1. I recommend suppressing Fastreg completions - no one cares that they 
succeeded.

2. If I understood correctly, I see that the CQs are loop-polled in an 
interrupt context.
     This may become problematic in stress workload where the CQ simply 
never drains (imagine
     some studios task keeps posting WRs as soon as IOs complete). This 
will cause CQ poll loop
     to go on forever. This situation may cause starvation of other CQs 
that share the same EQ (CPU
     is hogged by the endless poller).
     This may be solved in 2 ways:
     - Use blk-iopoll to maintain fairness - the downside will be moving 
from interrupt context to soft-irq (slower).
     - Set some configurable budget to CQ poller - after finishing the 
budget, notify the CQ and bail.
       If there is another CQ waiting to get in - it's only fair that it 
will be given with a chance, else another interrupt
       on that CQ will immediately come.

     I noticed that one with iSER which polls from tasklet context 
(under heavy workload).

Sagi.

> Fix suggested by Shirley Ma <shirley.ma@oracle.com>
>
> Reported-by: Rafael Reiter <rafael.reiter@ims.co.at>
> Fixes: 5c635e09cec0feeeb310968e51dad01040244851
> BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=73211
> Tested-by: Klemens Senn <klemens.senn@ims.co.at>
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> ---
>
>   net/sunrpc/xprtrdma/verbs.c     |  228 +++++++++++++++++++++++----------------
>   net/sunrpc/xprtrdma/xprt_rdma.h |    1
>   2 files changed, 135 insertions(+), 94 deletions(-)
>
> diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
> index 0f8c22c..35f5ab6 100644
> --- a/net/sunrpc/xprtrdma/verbs.c
> +++ b/net/sunrpc/xprtrdma/verbs.c
> @@ -142,96 +142,113 @@ rpcrdma_cq_async_error_upcall(struct ib_event *event, void *context)
>   	}
>   }
>   
> -static inline
> -void rpcrdma_event_process(struct ib_wc *wc)
> +static void
> +rpcrdma_send_event_process(struct ib_wc *wc)
>   {
> -	struct rpcrdma_mw *frmr;
> -	struct rpcrdma_rep *rep =
> -			(struct rpcrdma_rep *)(unsigned long) wc->wr_id;
> +	struct rpcrdma_mw *frmr = (struct rpcrdma_mw *)(unsigned long)wc->wr_id;
>   
> -	dprintk("RPC:       %s: event rep %p status %X opcode %X length %u\n",
> -		__func__, rep, wc->status, wc->opcode, wc->byte_len);
> +	dprintk("RPC:       %s: frmr %p status %X opcode %d\n",
> +		__func__, frmr, wc->status, wc->opcode);
>   
> -	if (!rep) /* send completion that we don't care about */
> +	if (wc->wr_id == 0ULL)
>   		return;
> -
> -	if (IB_WC_SUCCESS != wc->status) {
> -		dprintk("RPC:       %s: WC opcode %d status %X, connection lost\n",
> -			__func__, wc->opcode, wc->status);
> -		rep->rr_len = ~0U;
> -		if (wc->opcode != IB_WC_FAST_REG_MR && wc->opcode != IB_WC_LOCAL_INV)
> -			rpcrdma_schedule_tasklet(rep);
> +	if (wc->status != IB_WC_SUCCESS)
>   		return;
> -	}
>   
> -	switch (wc->opcode) {
> -	case IB_WC_FAST_REG_MR:
> -		frmr = (struct rpcrdma_mw *)(unsigned long)wc->wr_id;
> +	if (wc->opcode == IB_WC_FAST_REG_MR)
>   		frmr->r.frmr.state = FRMR_IS_VALID;
> -		break;
> -	case IB_WC_LOCAL_INV:
> -		frmr = (struct rpcrdma_mw *)(unsigned long)wc->wr_id;
> +	else if (wc->opcode == IB_WC_LOCAL_INV)
>   		frmr->r.frmr.state = FRMR_IS_INVALID;
> -		break;
> -	case IB_WC_RECV:
> -		rep->rr_len = wc->byte_len;
> -		ib_dma_sync_single_for_cpu(
> -			rdmab_to_ia(rep->rr_buffer)->ri_id->device,
> -			rep->rr_iov.addr, rep->rr_len, DMA_FROM_DEVICE);
> -		/* Keep (only) the most recent credits, after check validity */
> -		if (rep->rr_len >= 16) {
> -			struct rpcrdma_msg *p =
> -					(struct rpcrdma_msg *) rep->rr_base;
> -			unsigned int credits = ntohl(p->rm_credit);
> -			if (credits == 0) {
> -				dprintk("RPC:       %s: server"
> -					" dropped credits to 0!\n", __func__);
> -				/* don't deadlock */
> -				credits = 1;
> -			} else if (credits > rep->rr_buffer->rb_max_requests) {
> -				dprintk("RPC:       %s: server"
> -					" over-crediting: %d (%d)\n",
> -					__func__, credits,
> -					rep->rr_buffer->rb_max_requests);
> -				credits = rep->rr_buffer->rb_max_requests;
> -			}
> -			atomic_set(&rep->rr_buffer->rb_credits, credits);
> -		}
> -		rpcrdma_schedule_tasklet(rep);
> -		break;
> -	default:
> -		dprintk("RPC:       %s: unexpected WC event %X\n",
> -			__func__, wc->opcode);
> -		break;
> -	}
>   }
>   
> -static inline int
> -rpcrdma_cq_poll(struct ib_cq *cq)
> +static int
> +rpcrdma_sendcq_poll(struct ib_cq *cq)
>   {
>   	struct ib_wc wc;
>   	int rc;
>   
> -	for (;;) {
> -		rc = ib_poll_cq(cq, 1, &wc);
> -		if (rc < 0) {
> -			dprintk("RPC:       %s: ib_poll_cq failed %i\n",
> -				__func__, rc);
> -			return rc;
> -		}
> -		if (rc == 0)
> -			break;
> +	while ((rc = ib_poll_cq(cq, 1, &wc)) == 1)
> +		rpcrdma_send_event_process(&wc);
> +	return rc;
> +}
>   
> -		rpcrdma_event_process(&wc);
> +/*
> + * Handle send, fast_reg_mr, and local_inv completions.
> + *
> + * Send events are typically suppressed and thus do not result
> + * in an upcall. Occasionally one is signaled, however. This
> + * prevents the provider's completion queue from wrapping and
> + * losing a completion.
> + */
> +static void
> +rpcrdma_sendcq_upcall(struct ib_cq *cq, void *cq_context)
> +{
> +	int rc;
> +
> +	rc = rpcrdma_sendcq_poll(cq);
> +	if (rc) {
> +		dprintk("RPC:       %s: ib_poll_cq failed: %i\n",
> +			__func__, rc);
> +		return;
>   	}
> +	rc = ib_req_notify_cq(cq, IB_CQ_NEXT_COMP);
> +	if (rc) {
> +		dprintk("RPC:       %s: ib_req_notify_cq failed: %i\n",
> +			__func__, rc);
> +		return;
> +	}
> +	rpcrdma_sendcq_poll(cq);
> +}
>   
> -	return 0;
> +static void
> +rpcrdma_recv_event_process(struct ib_wc *wc)
> +{
> +	struct rpcrdma_rep *rep =
> +			(struct rpcrdma_rep *)(unsigned long)wc->wr_id;
> +
> +	dprintk("RPC:       %s: rep %p status %X opcode %X length %u\n",
> +		__func__, rep, wc->status, wc->opcode, wc->byte_len);
> +
> +	if (wc->status != IB_WC_SUCCESS) {
> +		rep->rr_len = ~0U;
> +		goto out_schedule;
> +	}
> +	if (wc->opcode != IB_WC_RECV)
> +		return;
> +
> +	rep->rr_len = wc->byte_len;
> +	ib_dma_sync_single_for_cpu(rdmab_to_ia(rep->rr_buffer)->ri_id->device,
> +			rep->rr_iov.addr, rep->rr_len, DMA_FROM_DEVICE);
> +
> +	if (rep->rr_len >= 16) {
> +		struct rpcrdma_msg *p = (struct rpcrdma_msg *)rep->rr_base;
> +		unsigned int credits = ntohl(p->rm_credit);
> +
> +		if (credits == 0)
> +			credits = 1;	/* don't deadlock */
> +		else if (credits > rep->rr_buffer->rb_max_requests)
> +			credits = rep->rr_buffer->rb_max_requests;
> +		atomic_set(&rep->rr_buffer->rb_credits, credits);
> +	}
> +
> +out_schedule:
> +	rpcrdma_schedule_tasklet(rep);
> +}
> +
> +static int
> +rpcrdma_recvcq_poll(struct ib_cq *cq)
> +{
> +	struct ib_wc wc;
> +	int rc;
> +
> +	while ((rc = ib_poll_cq(cq, 1, &wc)) == 1)
> +		rpcrdma_recv_event_process(&wc);
> +	return rc;
>   }
>   
>   /*
> - * rpcrdma_cq_event_upcall
> + * Handle receive completions.
>    *
> - * This upcall handles recv and send events.
>    * It is reentrant but processes single events in order to maintain
>    * ordering of receives to keep server credits.
>    *
> @@ -240,26 +257,25 @@ rpcrdma_cq_poll(struct ib_cq *cq)
>    * connection shutdown. That is, the structures required for
>    * the completion of the reply handler must remain intact until
>    * all memory has been reclaimed.
> - *
> - * Note that send events are suppressed and do not result in an upcall.
>    */
>   static void
> -rpcrdma_cq_event_upcall(struct ib_cq *cq, void *context)
> +rpcrdma_recvcq_upcall(struct ib_cq *cq, void *cq_context)
>   {
>   	int rc;
>   
> -	rc = rpcrdma_cq_poll(cq);
> -	if (rc)
> +	rc = rpcrdma_recvcq_poll(cq);
> +	if (rc) {
> +		dprintk("RPC:       %s: ib_poll_cq failed: %i\n",
> +			__func__, rc);
>   		return;
> -
> +	}
>   	rc = ib_req_notify_cq(cq, IB_CQ_NEXT_COMP);
>   	if (rc) {
> -		dprintk("RPC:       %s: ib_req_notify_cq failed %i\n",
> +		dprintk("RPC:       %s: ib_req_notify_cq failed: %i\n",
>   			__func__, rc);
>   		return;
>   	}
> -
> -	rpcrdma_cq_poll(cq);
> +	rpcrdma_recvcq_poll(cq);
>   }
>   
>   #ifdef RPC_DEBUG
> @@ -613,6 +629,7 @@ rpcrdma_ep_create(struct rpcrdma_ep *ep, struct rpcrdma_ia *ia,
>   				struct rpcrdma_create_data_internal *cdata)
>   {
>   	struct ib_device_attr devattr;
> +	struct ib_cq *sendcq, *recvcq;
>   	int rc, err;
>   
>   	rc = ib_query_device(ia->ri_id->device, &devattr);
> @@ -688,7 +705,7 @@ rpcrdma_ep_create(struct rpcrdma_ep *ep, struct rpcrdma_ia *ia,
>   		ep->rep_attr.cap.max_recv_sge);
>   
>   	/* set trigger for requesting send completion */
> -	ep->rep_cqinit = ep->rep_attr.cap.max_send_wr/2 /*  - 1*/;
> +	ep->rep_cqinit = ep->rep_attr.cap.max_send_wr/2 - 1;
>   	if (ep->rep_cqinit <= 2)
>   		ep->rep_cqinit = 0;
>   	INIT_CQCOUNT(ep);
> @@ -696,26 +713,43 @@ rpcrdma_ep_create(struct rpcrdma_ep *ep, struct rpcrdma_ia *ia,
>   	init_waitqueue_head(&ep->rep_connect_wait);
>   	INIT_DELAYED_WORK(&ep->rep_connect_worker, rpcrdma_connect_worker);
>   
> -	ep->rep_cq = ib_create_cq(ia->ri_id->device, rpcrdma_cq_event_upcall,
> +	sendcq = ib_create_cq(ia->ri_id->device, rpcrdma_sendcq_upcall,
>   				  rpcrdma_cq_async_error_upcall, NULL,
> -				  ep->rep_attr.cap.max_recv_wr +
>   				  ep->rep_attr.cap.max_send_wr + 1, 0);
> -	if (IS_ERR(ep->rep_cq)) {
> -		rc = PTR_ERR(ep->rep_cq);
> -		dprintk("RPC:       %s: ib_create_cq failed: %i\n",
> +	if (IS_ERR(sendcq)) {
> +		rc = PTR_ERR(sendcq);
> +		dprintk("RPC:       %s: failed to create send CQ: %i\n",
>   			__func__, rc);
>   		goto out1;
>   	}
>   
> -	rc = ib_req_notify_cq(ep->rep_cq, IB_CQ_NEXT_COMP);
> +	rc = ib_req_notify_cq(sendcq, IB_CQ_NEXT_COMP);
>   	if (rc) {
>   		dprintk("RPC:       %s: ib_req_notify_cq failed: %i\n",
>   			__func__, rc);
>   		goto out2;
>   	}
>   
> -	ep->rep_attr.send_cq = ep->rep_cq;
> -	ep->rep_attr.recv_cq = ep->rep_cq;
> +	recvcq = ib_create_cq(ia->ri_id->device, rpcrdma_recvcq_upcall,
> +				  rpcrdma_cq_async_error_upcall, NULL,
> +				  ep->rep_attr.cap.max_recv_wr + 1, 0);
> +	if (IS_ERR(recvcq)) {
> +		rc = PTR_ERR(recvcq);
> +		dprintk("RPC:       %s: failed to create recv CQ: %i\n",
> +			__func__, rc);
> +		goto out2;
> +	}
> +
> +	rc = ib_req_notify_cq(recvcq, IB_CQ_NEXT_COMP);
> +	if (rc) {
> +		dprintk("RPC:       %s: ib_req_notify_cq failed: %i\n",
> +			__func__, rc);
> +		ib_destroy_cq(recvcq);
> +		goto out2;
> +	}
> +
> +	ep->rep_attr.send_cq = sendcq;
> +	ep->rep_attr.recv_cq = recvcq;
>   
>   	/* Initialize cma parameters */
>   
> @@ -737,7 +771,7 @@ rpcrdma_ep_create(struct rpcrdma_ep *ep, struct rpcrdma_ia *ia,
>   	return 0;
>   
>   out2:
> -	err = ib_destroy_cq(ep->rep_cq);
> +	err = ib_destroy_cq(sendcq);
>   	if (err)
>   		dprintk("RPC:       %s: ib_destroy_cq returned %i\n",
>   			__func__, err);
> @@ -777,8 +811,14 @@ rpcrdma_ep_destroy(struct rpcrdma_ep *ep, struct rpcrdma_ia *ia)
>   		ep->rep_pad_mr = NULL;
>   	}
>   
> -	rpcrdma_clean_cq(ep->rep_cq);
> -	rc = ib_destroy_cq(ep->rep_cq);
> +	rpcrdma_clean_cq(ep->rep_attr.recv_cq);
> +	rc = ib_destroy_cq(ep->rep_attr.recv_cq);
> +	if (rc)
> +		dprintk("RPC:       %s: ib_destroy_cq returned %i\n",
> +			__func__, rc);
> +
> +	rpcrdma_clean_cq(ep->rep_attr.send_cq);
> +	rc = ib_destroy_cq(ep->rep_attr.send_cq);
>   	if (rc)
>   		dprintk("RPC:       %s: ib_destroy_cq returned %i\n",
>   			__func__, rc);
> @@ -801,7 +841,9 @@ retry:
>   		if (rc && rc != -ENOTCONN)
>   			dprintk("RPC:       %s: rpcrdma_ep_disconnect"
>   				" status %i\n", __func__, rc);
> -		rpcrdma_clean_cq(ep->rep_cq);
> +
> +		rpcrdma_clean_cq(ep->rep_attr.recv_cq);
> +		rpcrdma_clean_cq(ep->rep_attr.send_cq);
>   
>   		xprt = container_of(ia, struct rpcrdma_xprt, rx_ia);
>   		id = rpcrdma_create_id(xprt, ia,
> @@ -910,7 +952,8 @@ rpcrdma_ep_disconnect(struct rpcrdma_ep *ep, struct rpcrdma_ia *ia)
>   {
>   	int rc;
>   
> -	rpcrdma_clean_cq(ep->rep_cq);
> +	rpcrdma_clean_cq(ep->rep_attr.recv_cq);
> +	rpcrdma_clean_cq(ep->rep_attr.send_cq);
>   	rc = rdma_disconnect(ia->ri_id);
>   	if (!rc) {
>   		/* returns without wait if not connected */
> @@ -1793,7 +1836,6 @@ rpcrdma_ep_post_recv(struct rpcrdma_ia *ia,
>   	ib_dma_sync_single_for_cpu(ia->ri_id->device,
>   		rep->rr_iov.addr, rep->rr_iov.length, DMA_BIDIRECTIONAL);
>   
> -	DECR_CQCOUNT(ep);
>   	rc = ib_post_recv(ia->ri_id->qp, &recv_wr, &recv_wr_fail);
>   
>   	if (rc)
> diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
> index 362a19d..334ab6e 100644
> --- a/net/sunrpc/xprtrdma/xprt_rdma.h
> +++ b/net/sunrpc/xprtrdma/xprt_rdma.h
> @@ -79,7 +79,6 @@ struct rpcrdma_ep {
>   	int			rep_cqinit;
>   	int			rep_connected;
>   	struct rpcrdma_ia	*rep_ia;
> -	struct ib_cq		*rep_cq;
>   	struct ib_qp_init_attr	rep_attr;
>   	wait_queue_head_t 	rep_connect_wait;
>   	struct ib_sge		rep_pad;	/* holds zeroed pad */
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 7/8] xprtrdma: Split the completion queue
  2014-04-16 12:48         ` Sagi Grimberg
@ 2014-04-16 13:30             ` Steve Wise
  -1 siblings, 0 replies; 58+ messages in thread
From: Steve Wise @ 2014-04-16 13:30 UTC (permalink / raw)
  To: Sagi Grimberg, Chuck Lever, linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA

On 4/16/2014 7:48 AM, Sagi Grimberg wrote:
> On 4/15/2014 1:23 AM, Chuck Lever wrote:
>> The current CQ handler uses the ib_wc.opcode field to distinguish
>> between event types. However, the contents of that field are not
>> reliable if the completion status is not IB_WC_SUCCESS.
>>
>> When an error completion occurs on a send event, the CQ handler
>> schedules a tasklet with something that is not a struct rpcrdma_rep.
>> This is never correct behavior, and sometimes it results in a panic.
>>
>> To resolve this issue, split the completion queue into a send CQ and
>> a receive CQ. The send CQ handler now handles only struct rpcrdma_mw
>> wr_id's, and the receive CQ handler now handles only struct
>> rpcrdma_rep wr_id's.
>
> Hey Chuck,
>
> So 2 suggestions related (although not directly) to this one.
>
> 1. I recommend suppressing Fastreg completions - no one cares that 
> they succeeded.
>

Not true.  The nfsrdma client uses frmrs across re-connects for the same 
mount and needs to know at any point in time if a frmr is registered or 
invalid.  So completions of both fastreg and invalidate need to be 
signaled.  See:

commit 5c635e09cec0feeeb310968e51dad01040244851
Author: Tom Tucker <tom-/Yg/VP3ZvrM@public.gmane.org>
Date:   Wed Feb 9 19:45:34 2011 +0000

     RPCRDMA: Fix FRMR registration/invalidate handling.



> 2. If I understood correctly, I see that the CQs are loop-polled in an 
> interrupt context.
>     This may become problematic in stress workload where the CQ simply 
> never drains (imagine
>     some studios task keeps posting WRs as soon as IOs complete). This 
> will cause CQ poll loop
>     to go on forever. This situation may cause starvation of other CQs 
> that share the same EQ (CPU
>     is hogged by the endless poller).
>     This may be solved in 2 ways:
>     - Use blk-iopoll to maintain fairness - the downside will be 
> moving from interrupt context to soft-irq (slower).
>     - Set some configurable budget to CQ poller - after finishing the 
> budget, notify the CQ and bail.
>       If there is another CQ waiting to get in - it's only fair that 
> it will be given with a chance, else another interrupt
>       on that CQ will immediately come.
>
>     I noticed that one with iSER which polls from tasklet context 
> (under heavy workload).
>
> Sagi.
>
>> Fix suggested by Shirley Ma <shirley.ma-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
>>
>> Reported-by: Rafael Reiter <rafael.reiter-cv18SyjCLaheoWH0uzbU5w@public.gmane.org>
>> Fixes: 5c635e09cec0feeeb310968e51dad01040244851
>> BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=73211
>> Tested-by: Klemens Senn <klemens.senn-cv18SyjCLaheoWH0uzbU5w@public.gmane.org>
>> Signed-off-by: Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
>> ---
>>
>>   net/sunrpc/xprtrdma/verbs.c     |  228 
>> +++++++++++++++++++++++----------------
>>   net/sunrpc/xprtrdma/xprt_rdma.h |    1
>>   2 files changed, 135 insertions(+), 94 deletions(-)
>>
>> diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
>> index 0f8c22c..35f5ab6 100644
>> --- a/net/sunrpc/xprtrdma/verbs.c
>> +++ b/net/sunrpc/xprtrdma/verbs.c
>> @@ -142,96 +142,113 @@ rpcrdma_cq_async_error_upcall(struct ib_event 
>> *event, void *context)
>>       }
>>   }
>>   -static inline
>> -void rpcrdma_event_process(struct ib_wc *wc)
>> +static void
>> +rpcrdma_send_event_process(struct ib_wc *wc)
>>   {
>> -    struct rpcrdma_mw *frmr;
>> -    struct rpcrdma_rep *rep =
>> -            (struct rpcrdma_rep *)(unsigned long) wc->wr_id;
>> +    struct rpcrdma_mw *frmr = (struct rpcrdma_mw *)(unsigned 
>> long)wc->wr_id;
>>   -    dprintk("RPC:       %s: event rep %p status %X opcode %X 
>> length %u\n",
>> -        __func__, rep, wc->status, wc->opcode, wc->byte_len);
>> +    dprintk("RPC:       %s: frmr %p status %X opcode %d\n",
>> +        __func__, frmr, wc->status, wc->opcode);
>>   -    if (!rep) /* send completion that we don't care about */
>> +    if (wc->wr_id == 0ULL)
>>           return;
>> -
>> -    if (IB_WC_SUCCESS != wc->status) {
>> -        dprintk("RPC:       %s: WC opcode %d status %X, connection 
>> lost\n",
>> -            __func__, wc->opcode, wc->status);
>> -        rep->rr_len = ~0U;
>> -        if (wc->opcode != IB_WC_FAST_REG_MR && wc->opcode != 
>> IB_WC_LOCAL_INV)
>> -            rpcrdma_schedule_tasklet(rep);
>> +    if (wc->status != IB_WC_SUCCESS)
>>           return;
>> -    }
>>   -    switch (wc->opcode) {
>> -    case IB_WC_FAST_REG_MR:
>> -        frmr = (struct rpcrdma_mw *)(unsigned long)wc->wr_id;
>> +    if (wc->opcode == IB_WC_FAST_REG_MR)
>>           frmr->r.frmr.state = FRMR_IS_VALID;
>> -        break;
>> -    case IB_WC_LOCAL_INV:
>> -        frmr = (struct rpcrdma_mw *)(unsigned long)wc->wr_id;
>> +    else if (wc->opcode == IB_WC_LOCAL_INV)
>>           frmr->r.frmr.state = FRMR_IS_INVALID;
>> -        break;
>> -    case IB_WC_RECV:
>> -        rep->rr_len = wc->byte_len;
>> -        ib_dma_sync_single_for_cpu(
>> - rdmab_to_ia(rep->rr_buffer)->ri_id->device,
>> -            rep->rr_iov.addr, rep->rr_len, DMA_FROM_DEVICE);
>> -        /* Keep (only) the most recent credits, after check validity */
>> -        if (rep->rr_len >= 16) {
>> -            struct rpcrdma_msg *p =
>> -                    (struct rpcrdma_msg *) rep->rr_base;
>> -            unsigned int credits = ntohl(p->rm_credit);
>> -            if (credits == 0) {
>> -                dprintk("RPC:       %s: server"
>> -                    " dropped credits to 0!\n", __func__);
>> -                /* don't deadlock */
>> -                credits = 1;
>> -            } else if (credits > rep->rr_buffer->rb_max_requests) {
>> -                dprintk("RPC:       %s: server"
>> -                    " over-crediting: %d (%d)\n",
>> -                    __func__, credits,
>> -                    rep->rr_buffer->rb_max_requests);
>> -                credits = rep->rr_buffer->rb_max_requests;
>> -            }
>> -            atomic_set(&rep->rr_buffer->rb_credits, credits);
>> -        }
>> -        rpcrdma_schedule_tasklet(rep);
>> -        break;
>> -    default:
>> -        dprintk("RPC:       %s: unexpected WC event %X\n",
>> -            __func__, wc->opcode);
>> -        break;
>> -    }
>>   }
>>   -static inline int
>> -rpcrdma_cq_poll(struct ib_cq *cq)
>> +static int
>> +rpcrdma_sendcq_poll(struct ib_cq *cq)
>>   {
>>       struct ib_wc wc;
>>       int rc;
>>   -    for (;;) {
>> -        rc = ib_poll_cq(cq, 1, &wc);
>> -        if (rc < 0) {
>> -            dprintk("RPC:       %s: ib_poll_cq failed %i\n",
>> -                __func__, rc);
>> -            return rc;
>> -        }
>> -        if (rc == 0)
>> -            break;
>> +    while ((rc = ib_poll_cq(cq, 1, &wc)) == 1)
>> +        rpcrdma_send_event_process(&wc);
>> +    return rc;
>> +}
>>   -        rpcrdma_event_process(&wc);
>> +/*
>> + * Handle send, fast_reg_mr, and local_inv completions.
>> + *
>> + * Send events are typically suppressed and thus do not result
>> + * in an upcall. Occasionally one is signaled, however. This
>> + * prevents the provider's completion queue from wrapping and
>> + * losing a completion.
>> + */
>> +static void
>> +rpcrdma_sendcq_upcall(struct ib_cq *cq, void *cq_context)
>> +{
>> +    int rc;
>> +
>> +    rc = rpcrdma_sendcq_poll(cq);
>> +    if (rc) {
>> +        dprintk("RPC:       %s: ib_poll_cq failed: %i\n",
>> +            __func__, rc);
>> +        return;
>>       }
>> +    rc = ib_req_notify_cq(cq, IB_CQ_NEXT_COMP);
>> +    if (rc) {
>> +        dprintk("RPC:       %s: ib_req_notify_cq failed: %i\n",
>> +            __func__, rc);
>> +        return;
>> +    }
>> +    rpcrdma_sendcq_poll(cq);
>> +}
>>   -    return 0;
>> +static void
>> +rpcrdma_recv_event_process(struct ib_wc *wc)
>> +{
>> +    struct rpcrdma_rep *rep =
>> +            (struct rpcrdma_rep *)(unsigned long)wc->wr_id;
>> +
>> +    dprintk("RPC:       %s: rep %p status %X opcode %X length %u\n",
>> +        __func__, rep, wc->status, wc->opcode, wc->byte_len);
>> +
>> +    if (wc->status != IB_WC_SUCCESS) {
>> +        rep->rr_len = ~0U;
>> +        goto out_schedule;
>> +    }
>> +    if (wc->opcode != IB_WC_RECV)
>> +        return;
>> +
>> +    rep->rr_len = wc->byte_len;
>> + ib_dma_sync_single_for_cpu(rdmab_to_ia(rep->rr_buffer)->ri_id->device,
>> +            rep->rr_iov.addr, rep->rr_len, DMA_FROM_DEVICE);
>> +
>> +    if (rep->rr_len >= 16) {
>> +        struct rpcrdma_msg *p = (struct rpcrdma_msg *)rep->rr_base;
>> +        unsigned int credits = ntohl(p->rm_credit);
>> +
>> +        if (credits == 0)
>> +            credits = 1;    /* don't deadlock */
>> +        else if (credits > rep->rr_buffer->rb_max_requests)
>> +            credits = rep->rr_buffer->rb_max_requests;
>> +        atomic_set(&rep->rr_buffer->rb_credits, credits);
>> +    }
>> +
>> +out_schedule:
>> +    rpcrdma_schedule_tasklet(rep);
>> +}
>> +
>> +static int
>> +rpcrdma_recvcq_poll(struct ib_cq *cq)
>> +{
>> +    struct ib_wc wc;
>> +    int rc;
>> +
>> +    while ((rc = ib_poll_cq(cq, 1, &wc)) == 1)
>> +        rpcrdma_recv_event_process(&wc);
>> +    return rc;
>>   }
>>     /*
>> - * rpcrdma_cq_event_upcall
>> + * Handle receive completions.
>>    *
>> - * This upcall handles recv and send events.
>>    * It is reentrant but processes single events in order to maintain
>>    * ordering of receives to keep server credits.
>>    *
>> @@ -240,26 +257,25 @@ rpcrdma_cq_poll(struct ib_cq *cq)
>>    * connection shutdown. That is, the structures required for
>>    * the completion of the reply handler must remain intact until
>>    * all memory has been reclaimed.
>> - *
>> - * Note that send events are suppressed and do not result in an upcall.
>>    */
>>   static void
>> -rpcrdma_cq_event_upcall(struct ib_cq *cq, void *context)
>> +rpcrdma_recvcq_upcall(struct ib_cq *cq, void *cq_context)
>>   {
>>       int rc;
>>   -    rc = rpcrdma_cq_poll(cq);
>> -    if (rc)
>> +    rc = rpcrdma_recvcq_poll(cq);
>> +    if (rc) {
>> +        dprintk("RPC:       %s: ib_poll_cq failed: %i\n",
>> +            __func__, rc);
>>           return;
>> -
>> +    }
>>       rc = ib_req_notify_cq(cq, IB_CQ_NEXT_COMP);
>>       if (rc) {
>> -        dprintk("RPC:       %s: ib_req_notify_cq failed %i\n",
>> +        dprintk("RPC:       %s: ib_req_notify_cq failed: %i\n",
>>               __func__, rc);
>>           return;
>>       }
>> -
>> -    rpcrdma_cq_poll(cq);
>> +    rpcrdma_recvcq_poll(cq);
>>   }
>>     #ifdef RPC_DEBUG
>> @@ -613,6 +629,7 @@ rpcrdma_ep_create(struct rpcrdma_ep *ep, struct 
>> rpcrdma_ia *ia,
>>                   struct rpcrdma_create_data_internal *cdata)
>>   {
>>       struct ib_device_attr devattr;
>> +    struct ib_cq *sendcq, *recvcq;
>>       int rc, err;
>>         rc = ib_query_device(ia->ri_id->device, &devattr);
>> @@ -688,7 +705,7 @@ rpcrdma_ep_create(struct rpcrdma_ep *ep, struct 
>> rpcrdma_ia *ia,
>>           ep->rep_attr.cap.max_recv_sge);
>>         /* set trigger for requesting send completion */
>> -    ep->rep_cqinit = ep->rep_attr.cap.max_send_wr/2 /*  - 1*/;
>> +    ep->rep_cqinit = ep->rep_attr.cap.max_send_wr/2 - 1;
>>       if (ep->rep_cqinit <= 2)
>>           ep->rep_cqinit = 0;
>>       INIT_CQCOUNT(ep);
>> @@ -696,26 +713,43 @@ rpcrdma_ep_create(struct rpcrdma_ep *ep, struct 
>> rpcrdma_ia *ia,
>>       init_waitqueue_head(&ep->rep_connect_wait);
>>       INIT_DELAYED_WORK(&ep->rep_connect_worker, 
>> rpcrdma_connect_worker);
>>   -    ep->rep_cq = ib_create_cq(ia->ri_id->device, 
>> rpcrdma_cq_event_upcall,
>> +    sendcq = ib_create_cq(ia->ri_id->device, rpcrdma_sendcq_upcall,
>>                     rpcrdma_cq_async_error_upcall, NULL,
>> -                  ep->rep_attr.cap.max_recv_wr +
>>                     ep->rep_attr.cap.max_send_wr + 1, 0);
>> -    if (IS_ERR(ep->rep_cq)) {
>> -        rc = PTR_ERR(ep->rep_cq);
>> -        dprintk("RPC:       %s: ib_create_cq failed: %i\n",
>> +    if (IS_ERR(sendcq)) {
>> +        rc = PTR_ERR(sendcq);
>> +        dprintk("RPC:       %s: failed to create send CQ: %i\n",
>>               __func__, rc);
>>           goto out1;
>>       }
>>   -    rc = ib_req_notify_cq(ep->rep_cq, IB_CQ_NEXT_COMP);
>> +    rc = ib_req_notify_cq(sendcq, IB_CQ_NEXT_COMP);
>>       if (rc) {
>>           dprintk("RPC:       %s: ib_req_notify_cq failed: %i\n",
>>               __func__, rc);
>>           goto out2;
>>       }
>>   -    ep->rep_attr.send_cq = ep->rep_cq;
>> -    ep->rep_attr.recv_cq = ep->rep_cq;
>> +    recvcq = ib_create_cq(ia->ri_id->device, rpcrdma_recvcq_upcall,
>> +                  rpcrdma_cq_async_error_upcall, NULL,
>> +                  ep->rep_attr.cap.max_recv_wr + 1, 0);
>> +    if (IS_ERR(recvcq)) {
>> +        rc = PTR_ERR(recvcq);
>> +        dprintk("RPC:       %s: failed to create recv CQ: %i\n",
>> +            __func__, rc);
>> +        goto out2;
>> +    }
>> +
>> +    rc = ib_req_notify_cq(recvcq, IB_CQ_NEXT_COMP);
>> +    if (rc) {
>> +        dprintk("RPC:       %s: ib_req_notify_cq failed: %i\n",
>> +            __func__, rc);
>> +        ib_destroy_cq(recvcq);
>> +        goto out2;
>> +    }
>> +
>> +    ep->rep_attr.send_cq = sendcq;
>> +    ep->rep_attr.recv_cq = recvcq;
>>         /* Initialize cma parameters */
>>   @@ -737,7 +771,7 @@ rpcrdma_ep_create(struct rpcrdma_ep *ep, struct 
>> rpcrdma_ia *ia,
>>       return 0;
>>     out2:
>> -    err = ib_destroy_cq(ep->rep_cq);
>> +    err = ib_destroy_cq(sendcq);
>>       if (err)
>>           dprintk("RPC:       %s: ib_destroy_cq returned %i\n",
>>               __func__, err);
>> @@ -777,8 +811,14 @@ rpcrdma_ep_destroy(struct rpcrdma_ep *ep, struct 
>> rpcrdma_ia *ia)
>>           ep->rep_pad_mr = NULL;
>>       }
>>   -    rpcrdma_clean_cq(ep->rep_cq);
>> -    rc = ib_destroy_cq(ep->rep_cq);
>> +    rpcrdma_clean_cq(ep->rep_attr.recv_cq);
>> +    rc = ib_destroy_cq(ep->rep_attr.recv_cq);
>> +    if (rc)
>> +        dprintk("RPC:       %s: ib_destroy_cq returned %i\n",
>> +            __func__, rc);
>> +
>> +    rpcrdma_clean_cq(ep->rep_attr.send_cq);
>> +    rc = ib_destroy_cq(ep->rep_attr.send_cq);
>>       if (rc)
>>           dprintk("RPC:       %s: ib_destroy_cq returned %i\n",
>>               __func__, rc);
>> @@ -801,7 +841,9 @@ retry:
>>           if (rc && rc != -ENOTCONN)
>>               dprintk("RPC:       %s: rpcrdma_ep_disconnect"
>>                   " status %i\n", __func__, rc);
>> -        rpcrdma_clean_cq(ep->rep_cq);
>> +
>> +        rpcrdma_clean_cq(ep->rep_attr.recv_cq);
>> +        rpcrdma_clean_cq(ep->rep_attr.send_cq);
>>             xprt = container_of(ia, struct rpcrdma_xprt, rx_ia);
>>           id = rpcrdma_create_id(xprt, ia,
>> @@ -910,7 +952,8 @@ rpcrdma_ep_disconnect(struct rpcrdma_ep *ep, 
>> struct rpcrdma_ia *ia)
>>   {
>>       int rc;
>>   -    rpcrdma_clean_cq(ep->rep_cq);
>> +    rpcrdma_clean_cq(ep->rep_attr.recv_cq);
>> +    rpcrdma_clean_cq(ep->rep_attr.send_cq);
>>       rc = rdma_disconnect(ia->ri_id);
>>       if (!rc) {
>>           /* returns without wait if not connected */
>> @@ -1793,7 +1836,6 @@ rpcrdma_ep_post_recv(struct rpcrdma_ia *ia,
>>       ib_dma_sync_single_for_cpu(ia->ri_id->device,
>>           rep->rr_iov.addr, rep->rr_iov.length, DMA_BIDIRECTIONAL);
>>   -    DECR_CQCOUNT(ep);
>>       rc = ib_post_recv(ia->ri_id->qp, &recv_wr, &recv_wr_fail);
>>         if (rc)
>> diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h 
>> b/net/sunrpc/xprtrdma/xprt_rdma.h
>> index 362a19d..334ab6e 100644
>> --- a/net/sunrpc/xprtrdma/xprt_rdma.h
>> +++ b/net/sunrpc/xprtrdma/xprt_rdma.h
>> @@ -79,7 +79,6 @@ struct rpcrdma_ep {
>>       int            rep_cqinit;
>>       int            rep_connected;
>>       struct rpcrdma_ia    *rep_ia;
>> -    struct ib_cq        *rep_cq;
>>       struct ib_qp_init_attr    rep_attr;
>>       wait_queue_head_t     rep_connect_wait;
>>       struct ib_sge        rep_pad;    /* holds zeroed pad */
>>
>> -- 
>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 7/8] xprtrdma: Split the completion queue
@ 2014-04-16 13:30             ` Steve Wise
  0 siblings, 0 replies; 58+ messages in thread
From: Steve Wise @ 2014-04-16 13:30 UTC (permalink / raw)
  To: Sagi Grimberg, Chuck Lever, linux-nfs, linux-rdma

On 4/16/2014 7:48 AM, Sagi Grimberg wrote:
> On 4/15/2014 1:23 AM, Chuck Lever wrote:
>> The current CQ handler uses the ib_wc.opcode field to distinguish
>> between event types. However, the contents of that field are not
>> reliable if the completion status is not IB_WC_SUCCESS.
>>
>> When an error completion occurs on a send event, the CQ handler
>> schedules a tasklet with something that is not a struct rpcrdma_rep.
>> This is never correct behavior, and sometimes it results in a panic.
>>
>> To resolve this issue, split the completion queue into a send CQ and
>> a receive CQ. The send CQ handler now handles only struct rpcrdma_mw
>> wr_id's, and the receive CQ handler now handles only struct
>> rpcrdma_rep wr_id's.
>
> Hey Chuck,
>
> So 2 suggestions related (although not directly) to this one.
>
> 1. I recommend suppressing Fastreg completions - no one cares that 
> they succeeded.
>

Not true.  The nfsrdma client uses frmrs across re-connects for the same 
mount and needs to know at any point in time if a frmr is registered or 
invalid.  So completions of both fastreg and invalidate need to be 
signaled.  See:

commit 5c635e09cec0feeeb310968e51dad01040244851
Author: Tom Tucker <tom@ogc.us>
Date:   Wed Feb 9 19:45:34 2011 +0000

     RPCRDMA: Fix FRMR registration/invalidate handling.



> 2. If I understood correctly, I see that the CQs are loop-polled in an 
> interrupt context.
>     This may become problematic in stress workload where the CQ simply 
> never drains (imagine
>     some studios task keeps posting WRs as soon as IOs complete). This 
> will cause CQ poll loop
>     to go on forever. This situation may cause starvation of other CQs 
> that share the same EQ (CPU
>     is hogged by the endless poller).
>     This may be solved in 2 ways:
>     - Use blk-iopoll to maintain fairness - the downside will be 
> moving from interrupt context to soft-irq (slower).
>     - Set some configurable budget to CQ poller - after finishing the 
> budget, notify the CQ and bail.
>       If there is another CQ waiting to get in - it's only fair that 
> it will be given with a chance, else another interrupt
>       on that CQ will immediately come.
>
>     I noticed that one with iSER which polls from tasklet context 
> (under heavy workload).
>
> Sagi.
>
>> Fix suggested by Shirley Ma <shirley.ma@oracle.com>
>>
>> Reported-by: Rafael Reiter <rafael.reiter@ims.co.at>
>> Fixes: 5c635e09cec0feeeb310968e51dad01040244851
>> BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=73211
>> Tested-by: Klemens Senn <klemens.senn@ims.co.at>
>> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
>> ---
>>
>>   net/sunrpc/xprtrdma/verbs.c     |  228 
>> +++++++++++++++++++++++----------------
>>   net/sunrpc/xprtrdma/xprt_rdma.h |    1
>>   2 files changed, 135 insertions(+), 94 deletions(-)
>>
>> diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
>> index 0f8c22c..35f5ab6 100644
>> --- a/net/sunrpc/xprtrdma/verbs.c
>> +++ b/net/sunrpc/xprtrdma/verbs.c
>> @@ -142,96 +142,113 @@ rpcrdma_cq_async_error_upcall(struct ib_event 
>> *event, void *context)
>>       }
>>   }
>>   -static inline
>> -void rpcrdma_event_process(struct ib_wc *wc)
>> +static void
>> +rpcrdma_send_event_process(struct ib_wc *wc)
>>   {
>> -    struct rpcrdma_mw *frmr;
>> -    struct rpcrdma_rep *rep =
>> -            (struct rpcrdma_rep *)(unsigned long) wc->wr_id;
>> +    struct rpcrdma_mw *frmr = (struct rpcrdma_mw *)(unsigned 
>> long)wc->wr_id;
>>   -    dprintk("RPC:       %s: event rep %p status %X opcode %X 
>> length %u\n",
>> -        __func__, rep, wc->status, wc->opcode, wc->byte_len);
>> +    dprintk("RPC:       %s: frmr %p status %X opcode %d\n",
>> +        __func__, frmr, wc->status, wc->opcode);
>>   -    if (!rep) /* send completion that we don't care about */
>> +    if (wc->wr_id == 0ULL)
>>           return;
>> -
>> -    if (IB_WC_SUCCESS != wc->status) {
>> -        dprintk("RPC:       %s: WC opcode %d status %X, connection 
>> lost\n",
>> -            __func__, wc->opcode, wc->status);
>> -        rep->rr_len = ~0U;
>> -        if (wc->opcode != IB_WC_FAST_REG_MR && wc->opcode != 
>> IB_WC_LOCAL_INV)
>> -            rpcrdma_schedule_tasklet(rep);
>> +    if (wc->status != IB_WC_SUCCESS)
>>           return;
>> -    }
>>   -    switch (wc->opcode) {
>> -    case IB_WC_FAST_REG_MR:
>> -        frmr = (struct rpcrdma_mw *)(unsigned long)wc->wr_id;
>> +    if (wc->opcode == IB_WC_FAST_REG_MR)
>>           frmr->r.frmr.state = FRMR_IS_VALID;
>> -        break;
>> -    case IB_WC_LOCAL_INV:
>> -        frmr = (struct rpcrdma_mw *)(unsigned long)wc->wr_id;
>> +    else if (wc->opcode == IB_WC_LOCAL_INV)
>>           frmr->r.frmr.state = FRMR_IS_INVALID;
>> -        break;
>> -    case IB_WC_RECV:
>> -        rep->rr_len = wc->byte_len;
>> -        ib_dma_sync_single_for_cpu(
>> - rdmab_to_ia(rep->rr_buffer)->ri_id->device,
>> -            rep->rr_iov.addr, rep->rr_len, DMA_FROM_DEVICE);
>> -        /* Keep (only) the most recent credits, after check validity */
>> -        if (rep->rr_len >= 16) {
>> -            struct rpcrdma_msg *p =
>> -                    (struct rpcrdma_msg *) rep->rr_base;
>> -            unsigned int credits = ntohl(p->rm_credit);
>> -            if (credits == 0) {
>> -                dprintk("RPC:       %s: server"
>> -                    " dropped credits to 0!\n", __func__);
>> -                /* don't deadlock */
>> -                credits = 1;
>> -            } else if (credits > rep->rr_buffer->rb_max_requests) {
>> -                dprintk("RPC:       %s: server"
>> -                    " over-crediting: %d (%d)\n",
>> -                    __func__, credits,
>> -                    rep->rr_buffer->rb_max_requests);
>> -                credits = rep->rr_buffer->rb_max_requests;
>> -            }
>> -            atomic_set(&rep->rr_buffer->rb_credits, credits);
>> -        }
>> -        rpcrdma_schedule_tasklet(rep);
>> -        break;
>> -    default:
>> -        dprintk("RPC:       %s: unexpected WC event %X\n",
>> -            __func__, wc->opcode);
>> -        break;
>> -    }
>>   }
>>   -static inline int
>> -rpcrdma_cq_poll(struct ib_cq *cq)
>> +static int
>> +rpcrdma_sendcq_poll(struct ib_cq *cq)
>>   {
>>       struct ib_wc wc;
>>       int rc;
>>   -    for (;;) {
>> -        rc = ib_poll_cq(cq, 1, &wc);
>> -        if (rc < 0) {
>> -            dprintk("RPC:       %s: ib_poll_cq failed %i\n",
>> -                __func__, rc);
>> -            return rc;
>> -        }
>> -        if (rc == 0)
>> -            break;
>> +    while ((rc = ib_poll_cq(cq, 1, &wc)) == 1)
>> +        rpcrdma_send_event_process(&wc);
>> +    return rc;
>> +}
>>   -        rpcrdma_event_process(&wc);
>> +/*
>> + * Handle send, fast_reg_mr, and local_inv completions.
>> + *
>> + * Send events are typically suppressed and thus do not result
>> + * in an upcall. Occasionally one is signaled, however. This
>> + * prevents the provider's completion queue from wrapping and
>> + * losing a completion.
>> + */
>> +static void
>> +rpcrdma_sendcq_upcall(struct ib_cq *cq, void *cq_context)
>> +{
>> +    int rc;
>> +
>> +    rc = rpcrdma_sendcq_poll(cq);
>> +    if (rc) {
>> +        dprintk("RPC:       %s: ib_poll_cq failed: %i\n",
>> +            __func__, rc);
>> +        return;
>>       }
>> +    rc = ib_req_notify_cq(cq, IB_CQ_NEXT_COMP);
>> +    if (rc) {
>> +        dprintk("RPC:       %s: ib_req_notify_cq failed: %i\n",
>> +            __func__, rc);
>> +        return;
>> +    }
>> +    rpcrdma_sendcq_poll(cq);
>> +}
>>   -    return 0;
>> +static void
>> +rpcrdma_recv_event_process(struct ib_wc *wc)
>> +{
>> +    struct rpcrdma_rep *rep =
>> +            (struct rpcrdma_rep *)(unsigned long)wc->wr_id;
>> +
>> +    dprintk("RPC:       %s: rep %p status %X opcode %X length %u\n",
>> +        __func__, rep, wc->status, wc->opcode, wc->byte_len);
>> +
>> +    if (wc->status != IB_WC_SUCCESS) {
>> +        rep->rr_len = ~0U;
>> +        goto out_schedule;
>> +    }
>> +    if (wc->opcode != IB_WC_RECV)
>> +        return;
>> +
>> +    rep->rr_len = wc->byte_len;
>> + ib_dma_sync_single_for_cpu(rdmab_to_ia(rep->rr_buffer)->ri_id->device,
>> +            rep->rr_iov.addr, rep->rr_len, DMA_FROM_DEVICE);
>> +
>> +    if (rep->rr_len >= 16) {
>> +        struct rpcrdma_msg *p = (struct rpcrdma_msg *)rep->rr_base;
>> +        unsigned int credits = ntohl(p->rm_credit);
>> +
>> +        if (credits == 0)
>> +            credits = 1;    /* don't deadlock */
>> +        else if (credits > rep->rr_buffer->rb_max_requests)
>> +            credits = rep->rr_buffer->rb_max_requests;
>> +        atomic_set(&rep->rr_buffer->rb_credits, credits);
>> +    }
>> +
>> +out_schedule:
>> +    rpcrdma_schedule_tasklet(rep);
>> +}
>> +
>> +static int
>> +rpcrdma_recvcq_poll(struct ib_cq *cq)
>> +{
>> +    struct ib_wc wc;
>> +    int rc;
>> +
>> +    while ((rc = ib_poll_cq(cq, 1, &wc)) == 1)
>> +        rpcrdma_recv_event_process(&wc);
>> +    return rc;
>>   }
>>     /*
>> - * rpcrdma_cq_event_upcall
>> + * Handle receive completions.
>>    *
>> - * This upcall handles recv and send events.
>>    * It is reentrant but processes single events in order to maintain
>>    * ordering of receives to keep server credits.
>>    *
>> @@ -240,26 +257,25 @@ rpcrdma_cq_poll(struct ib_cq *cq)
>>    * connection shutdown. That is, the structures required for
>>    * the completion of the reply handler must remain intact until
>>    * all memory has been reclaimed.
>> - *
>> - * Note that send events are suppressed and do not result in an upcall.
>>    */
>>   static void
>> -rpcrdma_cq_event_upcall(struct ib_cq *cq, void *context)
>> +rpcrdma_recvcq_upcall(struct ib_cq *cq, void *cq_context)
>>   {
>>       int rc;
>>   -    rc = rpcrdma_cq_poll(cq);
>> -    if (rc)
>> +    rc = rpcrdma_recvcq_poll(cq);
>> +    if (rc) {
>> +        dprintk("RPC:       %s: ib_poll_cq failed: %i\n",
>> +            __func__, rc);
>>           return;
>> -
>> +    }
>>       rc = ib_req_notify_cq(cq, IB_CQ_NEXT_COMP);
>>       if (rc) {
>> -        dprintk("RPC:       %s: ib_req_notify_cq failed %i\n",
>> +        dprintk("RPC:       %s: ib_req_notify_cq failed: %i\n",
>>               __func__, rc);
>>           return;
>>       }
>> -
>> -    rpcrdma_cq_poll(cq);
>> +    rpcrdma_recvcq_poll(cq);
>>   }
>>     #ifdef RPC_DEBUG
>> @@ -613,6 +629,7 @@ rpcrdma_ep_create(struct rpcrdma_ep *ep, struct 
>> rpcrdma_ia *ia,
>>                   struct rpcrdma_create_data_internal *cdata)
>>   {
>>       struct ib_device_attr devattr;
>> +    struct ib_cq *sendcq, *recvcq;
>>       int rc, err;
>>         rc = ib_query_device(ia->ri_id->device, &devattr);
>> @@ -688,7 +705,7 @@ rpcrdma_ep_create(struct rpcrdma_ep *ep, struct 
>> rpcrdma_ia *ia,
>>           ep->rep_attr.cap.max_recv_sge);
>>         /* set trigger for requesting send completion */
>> -    ep->rep_cqinit = ep->rep_attr.cap.max_send_wr/2 /*  - 1*/;
>> +    ep->rep_cqinit = ep->rep_attr.cap.max_send_wr/2 - 1;
>>       if (ep->rep_cqinit <= 2)
>>           ep->rep_cqinit = 0;
>>       INIT_CQCOUNT(ep);
>> @@ -696,26 +713,43 @@ rpcrdma_ep_create(struct rpcrdma_ep *ep, struct 
>> rpcrdma_ia *ia,
>>       init_waitqueue_head(&ep->rep_connect_wait);
>>       INIT_DELAYED_WORK(&ep->rep_connect_worker, 
>> rpcrdma_connect_worker);
>>   -    ep->rep_cq = ib_create_cq(ia->ri_id->device, 
>> rpcrdma_cq_event_upcall,
>> +    sendcq = ib_create_cq(ia->ri_id->device, rpcrdma_sendcq_upcall,
>>                     rpcrdma_cq_async_error_upcall, NULL,
>> -                  ep->rep_attr.cap.max_recv_wr +
>>                     ep->rep_attr.cap.max_send_wr + 1, 0);
>> -    if (IS_ERR(ep->rep_cq)) {
>> -        rc = PTR_ERR(ep->rep_cq);
>> -        dprintk("RPC:       %s: ib_create_cq failed: %i\n",
>> +    if (IS_ERR(sendcq)) {
>> +        rc = PTR_ERR(sendcq);
>> +        dprintk("RPC:       %s: failed to create send CQ: %i\n",
>>               __func__, rc);
>>           goto out1;
>>       }
>>   -    rc = ib_req_notify_cq(ep->rep_cq, IB_CQ_NEXT_COMP);
>> +    rc = ib_req_notify_cq(sendcq, IB_CQ_NEXT_COMP);
>>       if (rc) {
>>           dprintk("RPC:       %s: ib_req_notify_cq failed: %i\n",
>>               __func__, rc);
>>           goto out2;
>>       }
>>   -    ep->rep_attr.send_cq = ep->rep_cq;
>> -    ep->rep_attr.recv_cq = ep->rep_cq;
>> +    recvcq = ib_create_cq(ia->ri_id->device, rpcrdma_recvcq_upcall,
>> +                  rpcrdma_cq_async_error_upcall, NULL,
>> +                  ep->rep_attr.cap.max_recv_wr + 1, 0);
>> +    if (IS_ERR(recvcq)) {
>> +        rc = PTR_ERR(recvcq);
>> +        dprintk("RPC:       %s: failed to create recv CQ: %i\n",
>> +            __func__, rc);
>> +        goto out2;
>> +    }
>> +
>> +    rc = ib_req_notify_cq(recvcq, IB_CQ_NEXT_COMP);
>> +    if (rc) {
>> +        dprintk("RPC:       %s: ib_req_notify_cq failed: %i\n",
>> +            __func__, rc);
>> +        ib_destroy_cq(recvcq);
>> +        goto out2;
>> +    }
>> +
>> +    ep->rep_attr.send_cq = sendcq;
>> +    ep->rep_attr.recv_cq = recvcq;
>>         /* Initialize cma parameters */
>>   @@ -737,7 +771,7 @@ rpcrdma_ep_create(struct rpcrdma_ep *ep, struct 
>> rpcrdma_ia *ia,
>>       return 0;
>>     out2:
>> -    err = ib_destroy_cq(ep->rep_cq);
>> +    err = ib_destroy_cq(sendcq);
>>       if (err)
>>           dprintk("RPC:       %s: ib_destroy_cq returned %i\n",
>>               __func__, err);
>> @@ -777,8 +811,14 @@ rpcrdma_ep_destroy(struct rpcrdma_ep *ep, struct 
>> rpcrdma_ia *ia)
>>           ep->rep_pad_mr = NULL;
>>       }
>>   -    rpcrdma_clean_cq(ep->rep_cq);
>> -    rc = ib_destroy_cq(ep->rep_cq);
>> +    rpcrdma_clean_cq(ep->rep_attr.recv_cq);
>> +    rc = ib_destroy_cq(ep->rep_attr.recv_cq);
>> +    if (rc)
>> +        dprintk("RPC:       %s: ib_destroy_cq returned %i\n",
>> +            __func__, rc);
>> +
>> +    rpcrdma_clean_cq(ep->rep_attr.send_cq);
>> +    rc = ib_destroy_cq(ep->rep_attr.send_cq);
>>       if (rc)
>>           dprintk("RPC:       %s: ib_destroy_cq returned %i\n",
>>               __func__, rc);
>> @@ -801,7 +841,9 @@ retry:
>>           if (rc && rc != -ENOTCONN)
>>               dprintk("RPC:       %s: rpcrdma_ep_disconnect"
>>                   " status %i\n", __func__, rc);
>> -        rpcrdma_clean_cq(ep->rep_cq);
>> +
>> +        rpcrdma_clean_cq(ep->rep_attr.recv_cq);
>> +        rpcrdma_clean_cq(ep->rep_attr.send_cq);
>>             xprt = container_of(ia, struct rpcrdma_xprt, rx_ia);
>>           id = rpcrdma_create_id(xprt, ia,
>> @@ -910,7 +952,8 @@ rpcrdma_ep_disconnect(struct rpcrdma_ep *ep, 
>> struct rpcrdma_ia *ia)
>>   {
>>       int rc;
>>   -    rpcrdma_clean_cq(ep->rep_cq);
>> +    rpcrdma_clean_cq(ep->rep_attr.recv_cq);
>> +    rpcrdma_clean_cq(ep->rep_attr.send_cq);
>>       rc = rdma_disconnect(ia->ri_id);
>>       if (!rc) {
>>           /* returns without wait if not connected */
>> @@ -1793,7 +1836,6 @@ rpcrdma_ep_post_recv(struct rpcrdma_ia *ia,
>>       ib_dma_sync_single_for_cpu(ia->ri_id->device,
>>           rep->rr_iov.addr, rep->rr_iov.length, DMA_BIDIRECTIONAL);
>>   -    DECR_CQCOUNT(ep);
>>       rc = ib_post_recv(ia->ri_id->qp, &recv_wr, &recv_wr_fail);
>>         if (rc)
>> diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h 
>> b/net/sunrpc/xprtrdma/xprt_rdma.h
>> index 362a19d..334ab6e 100644
>> --- a/net/sunrpc/xprtrdma/xprt_rdma.h
>> +++ b/net/sunrpc/xprtrdma/xprt_rdma.h
>> @@ -79,7 +79,6 @@ struct rpcrdma_ep {
>>       int            rep_cqinit;
>>       int            rep_connected;
>>       struct rpcrdma_ia    *rep_ia;
>> -    struct ib_cq        *rep_cq;
>>       struct ib_qp_init_attr    rep_attr;
>>       wait_queue_head_t     rep_connect_wait;
>>       struct ib_sge        rep_pad;    /* holds zeroed pad */
>>
>> -- 
>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 7/8] xprtrdma: Split the completion queue
  2014-04-16 13:30             ` Steve Wise
@ 2014-04-16 14:12                 ` Sagi Grimberg
  -1 siblings, 0 replies; 58+ messages in thread
From: Sagi Grimberg @ 2014-04-16 14:12 UTC (permalink / raw)
  To: Steve Wise, Chuck Lever, linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA

On 4/16/2014 4:30 PM, Steve Wise wrote:
> On 4/16/2014 7:48 AM, Sagi Grimberg wrote:
>> On 4/15/2014 1:23 AM, Chuck Lever wrote:
>>> The current CQ handler uses the ib_wc.opcode field to distinguish
>>> between event types. However, the contents of that field are not
>>> reliable if the completion status is not IB_WC_SUCCESS.
>>>
>>> When an error completion occurs on a send event, the CQ handler
>>> schedules a tasklet with something that is not a struct rpcrdma_rep.
>>> This is never correct behavior, and sometimes it results in a panic.
>>>
>>> To resolve this issue, split the completion queue into a send CQ and
>>> a receive CQ. The send CQ handler now handles only struct rpcrdma_mw
>>> wr_id's, and the receive CQ handler now handles only struct
>>> rpcrdma_rep wr_id's.
>>
>> Hey Chuck,
>>
>> So 2 suggestions related (although not directly) to this one.
>>
>> 1. I recommend suppressing Fastreg completions - no one cares that 
>> they succeeded.
>>
>
> Not true.  The nfsrdma client uses frmrs across re-connects for the 
> same mount and needs to know at any point in time if a frmr is 
> registered or invalid.  So completions of both fastreg and invalidate 
> need to be signaled.  See:
>
> commit 5c635e09cec0feeeb310968e51dad01040244851
> Author: Tom Tucker <tom-/Yg/VP3ZvrM@public.gmane.org>
> Date:   Wed Feb 9 19:45:34 2011 +0000
>
>     RPCRDMA: Fix FRMR registration/invalidate handling.
>

Hmm, But if either FASTREG or LINV failed the QP will go to error state 
and you *will* get the error wc (with a rain of FLUSH errors).
AFAICT it is safe to assume that it succeeded as long as you don't get 
error completions.
Moreover, FASTREG on top of FASTREG are not allowed indeed, but AFAIK 
LINV on top of LINV are allowed.
It is OK to just always do LINV+FASTREG post-list each registration and 
this way no need to account for successful completions.

Cheers,
Sagi.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 7/8] xprtrdma: Split the completion queue
@ 2014-04-16 14:12                 ` Sagi Grimberg
  0 siblings, 0 replies; 58+ messages in thread
From: Sagi Grimberg @ 2014-04-16 14:12 UTC (permalink / raw)
  To: Steve Wise, Chuck Lever, linux-nfs, linux-rdma

On 4/16/2014 4:30 PM, Steve Wise wrote:
> On 4/16/2014 7:48 AM, Sagi Grimberg wrote:
>> On 4/15/2014 1:23 AM, Chuck Lever wrote:
>>> The current CQ handler uses the ib_wc.opcode field to distinguish
>>> between event types. However, the contents of that field are not
>>> reliable if the completion status is not IB_WC_SUCCESS.
>>>
>>> When an error completion occurs on a send event, the CQ handler
>>> schedules a tasklet with something that is not a struct rpcrdma_rep.
>>> This is never correct behavior, and sometimes it results in a panic.
>>>
>>> To resolve this issue, split the completion queue into a send CQ and
>>> a receive CQ. The send CQ handler now handles only struct rpcrdma_mw
>>> wr_id's, and the receive CQ handler now handles only struct
>>> rpcrdma_rep wr_id's.
>>
>> Hey Chuck,
>>
>> So 2 suggestions related (although not directly) to this one.
>>
>> 1. I recommend suppressing Fastreg completions - no one cares that 
>> they succeeded.
>>
>
> Not true.  The nfsrdma client uses frmrs across re-connects for the 
> same mount and needs to know at any point in time if a frmr is 
> registered or invalid.  So completions of both fastreg and invalidate 
> need to be signaled.  See:
>
> commit 5c635e09cec0feeeb310968e51dad01040244851
> Author: Tom Tucker <tom@ogc.us>
> Date:   Wed Feb 9 19:45:34 2011 +0000
>
>     RPCRDMA: Fix FRMR registration/invalidate handling.
>

Hmm, But if either FASTREG or LINV failed the QP will go to error state 
and you *will* get the error wc (with a rain of FLUSH errors).
AFAICT it is safe to assume that it succeeded as long as you don't get 
error completions.
Moreover, FASTREG on top of FASTREG are not allowed indeed, but AFAIK 
LINV on top of LINV are allowed.
It is OK to just always do LINV+FASTREG post-list each registration and 
this way no need to account for successful completions.

Cheers,
Sagi.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* RE: [PATCH 7/8] xprtrdma: Split the completion queue
  2014-04-16 14:12                 ` Sagi Grimberg
@ 2014-04-16 14:25                     ` Steve Wise
  -1 siblings, 0 replies; 58+ messages in thread
From: Steve Wise @ 2014-04-16 14:25 UTC (permalink / raw)
  To: 'Sagi Grimberg', 'Chuck Lever',
	linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA



> -----Original Message-----
> From: Sagi Grimberg [mailto:sagig-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org]
> Sent: Wednesday, April 16, 2014 9:13 AM
> To: Steve Wise; Chuck Lever; linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> Subject: Re: [PATCH 7/8] xprtrdma: Split the completion queue
> 
> On 4/16/2014 4:30 PM, Steve Wise wrote:
> > On 4/16/2014 7:48 AM, Sagi Grimberg wrote:
> >> On 4/15/2014 1:23 AM, Chuck Lever wrote:
> >>> The current CQ handler uses the ib_wc.opcode field to distinguish
> >>> between event types. However, the contents of that field are not
> >>> reliable if the completion status is not IB_WC_SUCCESS.
> >>>
> >>> When an error completion occurs on a send event, the CQ handler
> >>> schedules a tasklet with something that is not a struct rpcrdma_rep.
> >>> This is never correct behavior, and sometimes it results in a panic.
> >>>
> >>> To resolve this issue, split the completion queue into a send CQ and
> >>> a receive CQ. The send CQ handler now handles only struct rpcrdma_mw
> >>> wr_id's, and the receive CQ handler now handles only struct
> >>> rpcrdma_rep wr_id's.
> >>
> >> Hey Chuck,
> >>
> >> So 2 suggestions related (although not directly) to this one.
> >>
> >> 1. I recommend suppressing Fastreg completions - no one cares that
> >> they succeeded.
> >>
> >
> > Not true.  The nfsrdma client uses frmrs across re-connects for the
> > same mount and needs to know at any point in time if a frmr is
> > registered or invalid.  So completions of both fastreg and invalidate
> > need to be signaled.  See:
> >
> > commit 5c635e09cec0feeeb310968e51dad01040244851
> > Author: Tom Tucker <tom-/Yg/VP3ZvrM@public.gmane.org>
> > Date:   Wed Feb 9 19:45:34 2011 +0000
> >
> >     RPCRDMA: Fix FRMR registration/invalidate handling.
> >
> 
> Hmm, But if either FASTREG or LINV failed the QP will go to error state
> and you *will* get the error wc (with a rain of FLUSH errors).
> AFAICT it is safe to assume that it succeeded as long as you don't get
> error completions.

But if an unsignaled FASTREG is posted and silently succeeds, then the next signaled work request fails, I believe the FASTREG will be completed with FLUSH status, yet the operation actually completed in the hw.  So the driver would mark the frmr as INVALID, and a subsequent FASTREG for this frmr would fail because the frmr is in the VALID state. 

> Moreover, FASTREG on top of FASTREG are not allowed indeed, but AFAIK
> LINV on top of LINV are allowed.
> It is OK to just always do LINV+FASTREG post-list each registration and
> this way no need to account for successful completions.

Perhaps always posting a LINV+FASTREG would do the trick.  

Regardless, I recommend we don't muddle this particular patch which fixes a bug by using separate SQ and RQ CQs with tweaking how frmr registration is managed.  IE this should be a separate patch for review/testing/etc.  

Steve.


> 
> Cheers,
> Sagi.

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 58+ messages in thread

* RE: [PATCH 7/8] xprtrdma: Split the completion queue
@ 2014-04-16 14:25                     ` Steve Wise
  0 siblings, 0 replies; 58+ messages in thread
From: Steve Wise @ 2014-04-16 14:25 UTC (permalink / raw)
  To: 'Sagi Grimberg', 'Chuck Lever', linux-nfs, linux-rdma



> -----Original Message-----
> From: Sagi Grimberg [mailto:sagig@dev.mellanox.co.il]
> Sent: Wednesday, April 16, 2014 9:13 AM
> To: Steve Wise; Chuck Lever; linux-nfs@vger.kernel.org; linux-rdma@vger.kernel.org
> Subject: Re: [PATCH 7/8] xprtrdma: Split the completion queue
> 
> On 4/16/2014 4:30 PM, Steve Wise wrote:
> > On 4/16/2014 7:48 AM, Sagi Grimberg wrote:
> >> On 4/15/2014 1:23 AM, Chuck Lever wrote:
> >>> The current CQ handler uses the ib_wc.opcode field to distinguish
> >>> between event types. However, the contents of that field are not
> >>> reliable if the completion status is not IB_WC_SUCCESS.
> >>>
> >>> When an error completion occurs on a send event, the CQ handler
> >>> schedules a tasklet with something that is not a struct rpcrdma_rep.
> >>> This is never correct behavior, and sometimes it results in a panic.
> >>>
> >>> To resolve this issue, split the completion queue into a send CQ and
> >>> a receive CQ. The send CQ handler now handles only struct rpcrdma_mw
> >>> wr_id's, and the receive CQ handler now handles only struct
> >>> rpcrdma_rep wr_id's.
> >>
> >> Hey Chuck,
> >>
> >> So 2 suggestions related (although not directly) to this one.
> >>
> >> 1. I recommend suppressing Fastreg completions - no one cares that
> >> they succeeded.
> >>
> >
> > Not true.  The nfsrdma client uses frmrs across re-connects for the
> > same mount and needs to know at any point in time if a frmr is
> > registered or invalid.  So completions of both fastreg and invalidate
> > need to be signaled.  See:
> >
> > commit 5c635e09cec0feeeb310968e51dad01040244851
> > Author: Tom Tucker <tom@ogc.us>
> > Date:   Wed Feb 9 19:45:34 2011 +0000
> >
> >     RPCRDMA: Fix FRMR registration/invalidate handling.
> >
> 
> Hmm, But if either FASTREG or LINV failed the QP will go to error state
> and you *will* get the error wc (with a rain of FLUSH errors).
> AFAICT it is safe to assume that it succeeded as long as you don't get
> error completions.

But if an unsignaled FASTREG is posted and silently succeeds, then the next signaled work request fails, I believe the FASTREG will be completed with FLUSH status, yet the operation actually completed in the hw.  So the driver would mark the frmr as INVALID, and a subsequent FASTREG for this frmr would fail because the frmr is in the VALID state. 

> Moreover, FASTREG on top of FASTREG are not allowed indeed, but AFAIK
> LINV on top of LINV are allowed.
> It is OK to just always do LINV+FASTREG post-list each registration and
> this way no need to account for successful completions.

Perhaps always posting a LINV+FASTREG would do the trick.  

Regardless, I recommend we don't muddle this particular patch which fixes a bug by using separate SQ and RQ CQs with tweaking how frmr registration is managed.  IE this should be a separate patch for review/testing/etc.  

Steve.


> 
> Cheers,
> Sagi.


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 7/8] xprtrdma: Split the completion queue
  2014-04-16 14:25                     ` Steve Wise
@ 2014-04-16 14:35                       ` Sagi Grimberg
  -1 siblings, 0 replies; 58+ messages in thread
From: Sagi Grimberg @ 2014-04-16 14:35 UTC (permalink / raw)
  To: Steve Wise, 'Chuck Lever',
	linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA

On 4/16/2014 5:25 PM, Steve Wise wrote:
>
>> -----Original Message-----
>> From: Sagi Grimberg [mailto:sagig-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org]
>> Sent: Wednesday, April 16, 2014 9:13 AM
>> To: Steve Wise; Chuck Lever; linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> Subject: Re: [PATCH 7/8] xprtrdma: Split the completion queue
>>
>> On 4/16/2014 4:30 PM, Steve Wise wrote:
>>> On 4/16/2014 7:48 AM, Sagi Grimberg wrote:
>>>> On 4/15/2014 1:23 AM, Chuck Lever wrote:
>>>>> The current CQ handler uses the ib_wc.opcode field to distinguish
>>>>> between event types. However, the contents of that field are not
>>>>> reliable if the completion status is not IB_WC_SUCCESS.
>>>>>
>>>>> When an error completion occurs on a send event, the CQ handler
>>>>> schedules a tasklet with something that is not a struct rpcrdma_rep.
>>>>> This is never correct behavior, and sometimes it results in a panic.
>>>>>
>>>>> To resolve this issue, split the completion queue into a send CQ and
>>>>> a receive CQ. The send CQ handler now handles only struct rpcrdma_mw
>>>>> wr_id's, and the receive CQ handler now handles only struct
>>>>> rpcrdma_rep wr_id's.
>>>> Hey Chuck,
>>>>
>>>> So 2 suggestions related (although not directly) to this one.
>>>>
>>>> 1. I recommend suppressing Fastreg completions - no one cares that
>>>> they succeeded.
>>>>
>>> Not true.  The nfsrdma client uses frmrs across re-connects for the
>>> same mount and needs to know at any point in time if a frmr is
>>> registered or invalid.  So completions of both fastreg and invalidate
>>> need to be signaled.  See:
>>>
>>> commit 5c635e09cec0feeeb310968e51dad01040244851
>>> Author: Tom Tucker <tom-/Yg/VP3ZvrM@public.gmane.org>
>>> Date:   Wed Feb 9 19:45:34 2011 +0000
>>>
>>>      RPCRDMA: Fix FRMR registration/invalidate handling.
>>>
>> Hmm, But if either FASTREG or LINV failed the QP will go to error state
>> and you *will* get the error wc (with a rain of FLUSH errors).
>> AFAICT it is safe to assume that it succeeded as long as you don't get
>> error completions.
> But if an unsignaled FASTREG is posted and silently succeeds, then the next signaled work request fails, I believe the FASTREG will be completed with FLUSH status, yet the operation actually completed in the hw.

Actually if (any) WR successfully completed and SW got it as FLUSH error 
it seems like a bug to me.
Once the HW processed the WQ entry it should update the consumer index 
accordingly thus should not happen.

>    So the driver would mark the frmr as INVALID, and a subsequent FASTREG for this frmr would fail because the frmr is in the VALID state.
>
>> Moreover, FASTREG on top of FASTREG are not allowed indeed, but AFAIK
>> LINV on top of LINV are allowed.
>> It is OK to just always do LINV+FASTREG post-list each registration and
>> this way no need to account for successful completions.
> Perhaps always posting a LINV+FASTREG would do the trick.
>
> Regardless, I recommend we don't muddle this particular patch which fixes a bug by using separate SQ and RQ CQs with tweaking how frmr registration is managed.  IE this should be a separate patch for review/testing/etc.

Agree, as I said it wasn't directly related to this patch.

Sagi.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 7/8] xprtrdma: Split the completion queue
@ 2014-04-16 14:35                       ` Sagi Grimberg
  0 siblings, 0 replies; 58+ messages in thread
From: Sagi Grimberg @ 2014-04-16 14:35 UTC (permalink / raw)
  To: Steve Wise, 'Chuck Lever', linux-nfs, linux-rdma

On 4/16/2014 5:25 PM, Steve Wise wrote:
>
>> -----Original Message-----
>> From: Sagi Grimberg [mailto:sagig@dev.mellanox.co.il]
>> Sent: Wednesday, April 16, 2014 9:13 AM
>> To: Steve Wise; Chuck Lever; linux-nfs@vger.kernel.org; linux-rdma@vger.kernel.org
>> Subject: Re: [PATCH 7/8] xprtrdma: Split the completion queue
>>
>> On 4/16/2014 4:30 PM, Steve Wise wrote:
>>> On 4/16/2014 7:48 AM, Sagi Grimberg wrote:
>>>> On 4/15/2014 1:23 AM, Chuck Lever wrote:
>>>>> The current CQ handler uses the ib_wc.opcode field to distinguish
>>>>> between event types. However, the contents of that field are not
>>>>> reliable if the completion status is not IB_WC_SUCCESS.
>>>>>
>>>>> When an error completion occurs on a send event, the CQ handler
>>>>> schedules a tasklet with something that is not a struct rpcrdma_rep.
>>>>> This is never correct behavior, and sometimes it results in a panic.
>>>>>
>>>>> To resolve this issue, split the completion queue into a send CQ and
>>>>> a receive CQ. The send CQ handler now handles only struct rpcrdma_mw
>>>>> wr_id's, and the receive CQ handler now handles only struct
>>>>> rpcrdma_rep wr_id's.
>>>> Hey Chuck,
>>>>
>>>> So 2 suggestions related (although not directly) to this one.
>>>>
>>>> 1. I recommend suppressing Fastreg completions - no one cares that
>>>> they succeeded.
>>>>
>>> Not true.  The nfsrdma client uses frmrs across re-connects for the
>>> same mount and needs to know at any point in time if a frmr is
>>> registered or invalid.  So completions of both fastreg and invalidate
>>> need to be signaled.  See:
>>>
>>> commit 5c635e09cec0feeeb310968e51dad01040244851
>>> Author: Tom Tucker <tom@ogc.us>
>>> Date:   Wed Feb 9 19:45:34 2011 +0000
>>>
>>>      RPCRDMA: Fix FRMR registration/invalidate handling.
>>>
>> Hmm, But if either FASTREG or LINV failed the QP will go to error state
>> and you *will* get the error wc (with a rain of FLUSH errors).
>> AFAICT it is safe to assume that it succeeded as long as you don't get
>> error completions.
> But if an unsignaled FASTREG is posted and silently succeeds, then the next signaled work request fails, I believe the FASTREG will be completed with FLUSH status, yet the operation actually completed in the hw.

Actually if (any) WR successfully completed and SW got it as FLUSH error 
it seems like a bug to me.
Once the HW processed the WQ entry it should update the consumer index 
accordingly thus should not happen.

>    So the driver would mark the frmr as INVALID, and a subsequent FASTREG for this frmr would fail because the frmr is in the VALID state.
>
>> Moreover, FASTREG on top of FASTREG are not allowed indeed, but AFAIK
>> LINV on top of LINV are allowed.
>> It is OK to just always do LINV+FASTREG post-list each registration and
>> this way no need to account for successful completions.
> Perhaps always posting a LINV+FASTREG would do the trick.
>
> Regardless, I recommend we don't muddle this particular patch which fixes a bug by using separate SQ and RQ CQs with tweaking how frmr registration is managed.  IE this should be a separate patch for review/testing/etc.

Agree, as I said it wasn't directly related to this patch.

Sagi.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* RE: [PATCH 7/8] xprtrdma: Split the completion queue
  2014-04-16 14:35                       ` Sagi Grimberg
@ 2014-04-16 14:43                           ` Steve Wise
  -1 siblings, 0 replies; 58+ messages in thread
From: Steve Wise @ 2014-04-16 14:43 UTC (permalink / raw)
  To: 'Sagi Grimberg', 'Chuck Lever',
	linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA

> >> Hmm, But if either FASTREG or LINV failed the QP will go to error state
> >> and you *will* get the error wc (with a rain of FLUSH errors).
> >> AFAICT it is safe to assume that it succeeded as long as you don't get
> >> error completions.
> > But if an unsignaled FASTREG is posted and silently succeeds, then the next signaled work
> request fails, I believe the FASTREG will be completed with FLUSH status, yet the operation
> actually completed in the hw.
> 
> Actually if (any) WR successfully completed and SW got it as FLUSH error
> it seems like a bug to me.
> Once the HW processed the WQ entry it should update the consumer index
> accordingly thus should not happen.

Aren't you assuming a specific hardware design/implementation?  For cxgb4, the fact that a work request was consumed by the HW from the host send queue in no way indicates it is complete.  Also, the RDMA specs specifically state that the rnic/hca implementation can only assume an unsignaled work request completes successfully (and make its slot in the SQ available for the ULP) when a subsequent signaled work request completes successfully.   So if the next signaled work request fails, I believe the completion status of prior unsignaled work requests is indeterminate. 


> 
> >    So the driver would mark the frmr as INVALID, and a subsequent FASTREG for this frmr
> would fail because the frmr is in the VALID state.
> >
> >> Moreover, FASTREG on top of FASTREG are not allowed indeed, but AFAIK
> >> LINV on top of LINV are allowed.
> >> It is OK to just always do LINV+FASTREG post-list each registration and
> >> this way no need to account for successful completions.
> > Perhaps always posting a LINV+FASTREG would do the trick.
> >
> > Regardless, I recommend we don't muddle this particular patch which fixes a bug by using
> separate SQ and RQ CQs with tweaking how frmr registration is managed.  IE this should be a
> separate patch for review/testing/etc.
> 
> Agree, as I said it wasn't directly related to this patch.
>

Cheers!

Steve.
 

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 58+ messages in thread

* RE: [PATCH 7/8] xprtrdma: Split the completion queue
@ 2014-04-16 14:43                           ` Steve Wise
  0 siblings, 0 replies; 58+ messages in thread
From: Steve Wise @ 2014-04-16 14:43 UTC (permalink / raw)
  To: 'Sagi Grimberg', 'Chuck Lever', linux-nfs, linux-rdma

> >> Hmm, But if either FASTREG or LINV failed the QP will go to error state
> >> and you *will* get the error wc (with a rain of FLUSH errors).
> >> AFAICT it is safe to assume that it succeeded as long as you don't get
> >> error completions.
> > But if an unsignaled FASTREG is posted and silently succeeds, then the next signaled work
> request fails, I believe the FASTREG will be completed with FLUSH status, yet the operation
> actually completed in the hw.
> 
> Actually if (any) WR successfully completed and SW got it as FLUSH error
> it seems like a bug to me.
> Once the HW processed the WQ entry it should update the consumer index
> accordingly thus should not happen.

Aren't you assuming a specific hardware design/implementation?  For cxgb4, the fact that a work request was consumed by the HW from the host send queue in no way indicates it is complete.  Also, the RDMA specs specifically state that the rnic/hca implementation can only assume an unsignaled work request completes successfully (and make its slot in the SQ available for the ULP) when a subsequent signaled work request completes successfully.   So if the next signaled work request fails, I believe the completion status of prior unsignaled work requests is indeterminate. 


> 
> >    So the driver would mark the frmr as INVALID, and a subsequent FASTREG for this frmr
> would fail because the frmr is in the VALID state.
> >
> >> Moreover, FASTREG on top of FASTREG are not allowed indeed, but AFAIK
> >> LINV on top of LINV are allowed.
> >> It is OK to just always do LINV+FASTREG post-list each registration and
> >> this way no need to account for successful completions.
> > Perhaps always posting a LINV+FASTREG would do the trick.
> >
> > Regardless, I recommend we don't muddle this particular patch which fixes a bug by using
> separate SQ and RQ CQs with tweaking how frmr registration is managed.  IE this should be a
> separate patch for review/testing/etc.
> 
> Agree, as I said it wasn't directly related to this patch.
>

Cheers!

Steve.
 


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 7/8] xprtrdma: Split the completion queue
  2014-04-16 13:30             ` Steve Wise
@ 2014-04-16 15:08                 ` Chuck Lever
  -1 siblings, 0 replies; 58+ messages in thread
From: Chuck Lever @ 2014-04-16 15:08 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Steve Wise, Linux NFS Mailing List, linux-rdma-u79uwXL29TY76Z2rM5mHXA

Hi Sagi-

Thanks for the review! In-line replies below.


On Apr 16, 2014, at 9:30 AM, Steve Wise <swise-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org> wrote:

> On 4/16/2014 7:48 AM, Sagi Grimberg wrote:
>> On 4/15/2014 1:23 AM, Chuck Lever wrote:
>>> The current CQ handler uses the ib_wc.opcode field to distinguish
>>> between event types. However, the contents of that field are not
>>> reliable if the completion status is not IB_WC_SUCCESS.
>>> 
>>> When an error completion occurs on a send event, the CQ handler
>>> schedules a tasklet with something that is not a struct rpcrdma_rep.
>>> This is never correct behavior, and sometimes it results in a panic.
>>> 
>>> To resolve this issue, split the completion queue into a send CQ and
>>> a receive CQ. The send CQ handler now handles only struct rpcrdma_mw
>>> wr_id's, and the receive CQ handler now handles only struct
>>> rpcrdma_rep wr_id's.
>> 
>> Hey Chuck,
>> 
>> So 2 suggestions related (although not directly) to this one.
>> 
>> 1. I recommend suppressing Fastreg completions - no one cares that they succeeded.
>> 
> 
> Not true.  The nfsrdma client uses frmrs across re-connects for the same mount and needs to know at any point in time if a frmr is registered or invalid.  So completions of both fastreg and invalidate need to be signaled.  See:
> 
> commit 5c635e09cec0feeeb310968e51dad01040244851
> Author: Tom Tucker <tom-/Yg/VP3ZvrM@public.gmane.org>
> Date:   Wed Feb 9 19:45:34 2011 +0000
> 
>    RPCRDMA: Fix FRMR registration/invalidate handling.

Steve is correct. For the time being, fast_reg_mr completions have to stay,
as they work around a real issue.

However, the long term goal is to suppress them, as you suggested a few
weeks ago. When we can identify and address the underlying FRMR leak that
Tom has cleverly worked around in 5c635e09, then fast_reg_mr can be done
without completions.

This is possibly related to the issue we are discussing in the other
thread “[PATCH V1] NFS-REDMA: fix qp pointer validation checks”.

> 
>> 2. If I understood correctly, I see that the CQs are loop-polled in an interrupt context.
>>    This may become problematic in stress workload where the CQ simply never drains (imagine
>>    some studios task keeps posting WRs as soon as IOs complete). This will cause CQ poll loop
>>    to go on forever. This situation may cause starvation of other CQs that share the same EQ (CPU
>>    is hogged by the endless poller).
>>    This may be solved in 2 ways:
>>    - Use blk-iopoll to maintain fairness - the downside will be moving from interrupt context to soft-irq (slower).
>>    - Set some configurable budget to CQ poller - after finishing the budget, notify the CQ and bail.
>>      If there is another CQ waiting to get in - it's only fair that it will be given with a chance, else another interrupt
>>      on that CQ will immediately come.

I think it would be a reasonable change to pass an array of WC’s to
ib_poll_cq() just once in rpcrdma_{send,recv}cq_poll(), instead of
looping. Would that be enough?

To be clear, my patch merely cleans up the completion handler logic,
which already did loop-polling. Should we consider this improvement
for another patch?


>> 
>>    I noticed that one with iSER which polls from tasklet context (under heavy workload).
>> 
>> Sagi.
>> 
>>> Fix suggested by Shirley Ma <shirley.ma-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
>>> 
>>> Reported-by: Rafael Reiter <rafael.reiter-cv18SyjCLaheoWH0uzbU5w@public.gmane.org>
>>> Fixes: 5c635e09cec0feeeb310968e51dad01040244851
>>> BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=73211
>>> Tested-by: Klemens Senn <klemens.senn-cv18SyjCLaheoWH0uzbU5w@public.gmane.org>
>>> Signed-off-by: Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
>>> ---
>>> 
>>>  net/sunrpc/xprtrdma/verbs.c     |  228 +++++++++++++++++++++++----------------
>>>  net/sunrpc/xprtrdma/xprt_rdma.h |    1
>>>  2 files changed, 135 insertions(+), 94 deletions(-)
>>> 
>>> diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
>>> index 0f8c22c..35f5ab6 100644
>>> --- a/net/sunrpc/xprtrdma/verbs.c
>>> +++ b/net/sunrpc/xprtrdma/verbs.c
>>> @@ -142,96 +142,113 @@ rpcrdma_cq_async_error_upcall(struct ib_event *event, void *context)
>>>      }
>>>  }
>>>  -static inline
>>> -void rpcrdma_event_process(struct ib_wc *wc)
>>> +static void
>>> +rpcrdma_send_event_process(struct ib_wc *wc)
>>>  {
>>> -    struct rpcrdma_mw *frmr;
>>> -    struct rpcrdma_rep *rep =
>>> -            (struct rpcrdma_rep *)(unsigned long) wc->wr_id;
>>> +    struct rpcrdma_mw *frmr = (struct rpcrdma_mw *)(unsigned long)wc->wr_id;
>>>  -    dprintk("RPC:       %s: event rep %p status %X opcode %X length %u\n",
>>> -        __func__, rep, wc->status, wc->opcode, wc->byte_len);
>>> +    dprintk("RPC:       %s: frmr %p status %X opcode %d\n",
>>> +        __func__, frmr, wc->status, wc->opcode);
>>>  -    if (!rep) /* send completion that we don't care about */
>>> +    if (wc->wr_id == 0ULL)
>>>          return;
>>> -
>>> -    if (IB_WC_SUCCESS != wc->status) {
>>> -        dprintk("RPC:       %s: WC opcode %d status %X, connection lost\n",
>>> -            __func__, wc->opcode, wc->status);
>>> -        rep->rr_len = ~0U;
>>> -        if (wc->opcode != IB_WC_FAST_REG_MR && wc->opcode != IB_WC_LOCAL_INV)
>>> -            rpcrdma_schedule_tasklet(rep);
>>> +    if (wc->status != IB_WC_SUCCESS)
>>>          return;
>>> -    }
>>>  -    switch (wc->opcode) {
>>> -    case IB_WC_FAST_REG_MR:
>>> -        frmr = (struct rpcrdma_mw *)(unsigned long)wc->wr_id;
>>> +    if (wc->opcode == IB_WC_FAST_REG_MR)
>>>          frmr->r.frmr.state = FRMR_IS_VALID;
>>> -        break;
>>> -    case IB_WC_LOCAL_INV:
>>> -        frmr = (struct rpcrdma_mw *)(unsigned long)wc->wr_id;
>>> +    else if (wc->opcode == IB_WC_LOCAL_INV)
>>>          frmr->r.frmr.state = FRMR_IS_INVALID;
>>> -        break;
>>> -    case IB_WC_RECV:
>>> -        rep->rr_len = wc->byte_len;
>>> -        ib_dma_sync_single_for_cpu(
>>> - rdmab_to_ia(rep->rr_buffer)->ri_id->device,
>>> -            rep->rr_iov.addr, rep->rr_len, DMA_FROM_DEVICE);
>>> -        /* Keep (only) the most recent credits, after check validity */
>>> -        if (rep->rr_len >= 16) {
>>> -            struct rpcrdma_msg *p =
>>> -                    (struct rpcrdma_msg *) rep->rr_base;
>>> -            unsigned int credits = ntohl(p->rm_credit);
>>> -            if (credits == 0) {
>>> -                dprintk("RPC:       %s: server"
>>> -                    " dropped credits to 0!\n", __func__);
>>> -                /* don't deadlock */
>>> -                credits = 1;
>>> -            } else if (credits > rep->rr_buffer->rb_max_requests) {
>>> -                dprintk("RPC:       %s: server"
>>> -                    " over-crediting: %d (%d)\n",
>>> -                    __func__, credits,
>>> -                    rep->rr_buffer->rb_max_requests);
>>> -                credits = rep->rr_buffer->rb_max_requests;
>>> -            }
>>> -            atomic_set(&rep->rr_buffer->rb_credits, credits);
>>> -        }
>>> -        rpcrdma_schedule_tasklet(rep);
>>> -        break;
>>> -    default:
>>> -        dprintk("RPC:       %s: unexpected WC event %X\n",
>>> -            __func__, wc->opcode);
>>> -        break;
>>> -    }
>>>  }
>>>  -static inline int
>>> -rpcrdma_cq_poll(struct ib_cq *cq)
>>> +static int
>>> +rpcrdma_sendcq_poll(struct ib_cq *cq)
>>>  {
>>>      struct ib_wc wc;
>>>      int rc;
>>>  -    for (;;) {
>>> -        rc = ib_poll_cq(cq, 1, &wc);
>>> -        if (rc < 0) {
>>> -            dprintk("RPC:       %s: ib_poll_cq failed %i\n",
>>> -                __func__, rc);
>>> -            return rc;
>>> -        }
>>> -        if (rc == 0)
>>> -            break;
>>> +    while ((rc = ib_poll_cq(cq, 1, &wc)) == 1)
>>> +        rpcrdma_send_event_process(&wc);
>>> +    return rc;
>>> +}
>>>  -        rpcrdma_event_process(&wc);
>>> +/*
>>> + * Handle send, fast_reg_mr, and local_inv completions.
>>> + *
>>> + * Send events are typically suppressed and thus do not result
>>> + * in an upcall. Occasionally one is signaled, however. This
>>> + * prevents the provider's completion queue from wrapping and
>>> + * losing a completion.
>>> + */
>>> +static void
>>> +rpcrdma_sendcq_upcall(struct ib_cq *cq, void *cq_context)
>>> +{
>>> +    int rc;
>>> +
>>> +    rc = rpcrdma_sendcq_poll(cq);
>>> +    if (rc) {
>>> +        dprintk("RPC:       %s: ib_poll_cq failed: %i\n",
>>> +            __func__, rc);
>>> +        return;
>>>      }
>>> +    rc = ib_req_notify_cq(cq, IB_CQ_NEXT_COMP);
>>> +    if (rc) {
>>> +        dprintk("RPC:       %s: ib_req_notify_cq failed: %i\n",
>>> +            __func__, rc);
>>> +        return;
>>> +    }
>>> +    rpcrdma_sendcq_poll(cq);
>>> +}
>>>  -    return 0;
>>> +static void
>>> +rpcrdma_recv_event_process(struct ib_wc *wc)
>>> +{
>>> +    struct rpcrdma_rep *rep =
>>> +            (struct rpcrdma_rep *)(unsigned long)wc->wr_id;
>>> +
>>> +    dprintk("RPC:       %s: rep %p status %X opcode %X length %u\n",
>>> +        __func__, rep, wc->status, wc->opcode, wc->byte_len);
>>> +
>>> +    if (wc->status != IB_WC_SUCCESS) {
>>> +        rep->rr_len = ~0U;
>>> +        goto out_schedule;
>>> +    }
>>> +    if (wc->opcode != IB_WC_RECV)
>>> +        return;
>>> +
>>> +    rep->rr_len = wc->byte_len;
>>> + ib_dma_sync_single_for_cpu(rdmab_to_ia(rep->rr_buffer)->ri_id->device,
>>> +            rep->rr_iov.addr, rep->rr_len, DMA_FROM_DEVICE);
>>> +
>>> +    if (rep->rr_len >= 16) {
>>> +        struct rpcrdma_msg *p = (struct rpcrdma_msg *)rep->rr_base;
>>> +        unsigned int credits = ntohl(p->rm_credit);
>>> +
>>> +        if (credits == 0)
>>> +            credits = 1;    /* don't deadlock */
>>> +        else if (credits > rep->rr_buffer->rb_max_requests)
>>> +            credits = rep->rr_buffer->rb_max_requests;
>>> +        atomic_set(&rep->rr_buffer->rb_credits, credits);
>>> +    }
>>> +
>>> +out_schedule:
>>> +    rpcrdma_schedule_tasklet(rep);
>>> +}
>>> +
>>> +static int
>>> +rpcrdma_recvcq_poll(struct ib_cq *cq)
>>> +{
>>> +    struct ib_wc wc;
>>> +    int rc;
>>> +
>>> +    while ((rc = ib_poll_cq(cq, 1, &wc)) == 1)
>>> +        rpcrdma_recv_event_process(&wc);
>>> +    return rc;
>>>  }
>>>    /*
>>> - * rpcrdma_cq_event_upcall
>>> + * Handle receive completions.
>>>   *
>>> - * This upcall handles recv and send events.
>>>   * It is reentrant but processes single events in order to maintain
>>>   * ordering of receives to keep server credits.
>>>   *
>>> @@ -240,26 +257,25 @@ rpcrdma_cq_poll(struct ib_cq *cq)
>>>   * connection shutdown. That is, the structures required for
>>>   * the completion of the reply handler must remain intact until
>>>   * all memory has been reclaimed.
>>> - *
>>> - * Note that send events are suppressed and do not result in an upcall.
>>>   */
>>>  static void
>>> -rpcrdma_cq_event_upcall(struct ib_cq *cq, void *context)
>>> +rpcrdma_recvcq_upcall(struct ib_cq *cq, void *cq_context)
>>>  {
>>>      int rc;
>>>  -    rc = rpcrdma_cq_poll(cq);
>>> -    if (rc)
>>> +    rc = rpcrdma_recvcq_poll(cq);
>>> +    if (rc) {
>>> +        dprintk("RPC:       %s: ib_poll_cq failed: %i\n",
>>> +            __func__, rc);
>>>          return;
>>> -
>>> +    }
>>>      rc = ib_req_notify_cq(cq, IB_CQ_NEXT_COMP);
>>>      if (rc) {
>>> -        dprintk("RPC:       %s: ib_req_notify_cq failed %i\n",
>>> +        dprintk("RPC:       %s: ib_req_notify_cq failed: %i\n",
>>>              __func__, rc);
>>>          return;
>>>      }
>>> -
>>> -    rpcrdma_cq_poll(cq);
>>> +    rpcrdma_recvcq_poll(cq);
>>>  }
>>>    #ifdef RPC_DEBUG
>>> @@ -613,6 +629,7 @@ rpcrdma_ep_create(struct rpcrdma_ep *ep, struct rpcrdma_ia *ia,
>>>                  struct rpcrdma_create_data_internal *cdata)
>>>  {
>>>      struct ib_device_attr devattr;
>>> +    struct ib_cq *sendcq, *recvcq;
>>>      int rc, err;
>>>        rc = ib_query_device(ia->ri_id->device, &devattr);
>>> @@ -688,7 +705,7 @@ rpcrdma_ep_create(struct rpcrdma_ep *ep, struct rpcrdma_ia *ia,
>>>          ep->rep_attr.cap.max_recv_sge);
>>>        /* set trigger for requesting send completion */
>>> -    ep->rep_cqinit = ep->rep_attr.cap.max_send_wr/2 /*  - 1*/;
>>> +    ep->rep_cqinit = ep->rep_attr.cap.max_send_wr/2 - 1;
>>>      if (ep->rep_cqinit <= 2)
>>>          ep->rep_cqinit = 0;
>>>      INIT_CQCOUNT(ep);
>>> @@ -696,26 +713,43 @@ rpcrdma_ep_create(struct rpcrdma_ep *ep, struct rpcrdma_ia *ia,
>>>      init_waitqueue_head(&ep->rep_connect_wait);
>>>      INIT_DELAYED_WORK(&ep->rep_connect_worker, rpcrdma_connect_worker);
>>>  -    ep->rep_cq = ib_create_cq(ia->ri_id->device, rpcrdma_cq_event_upcall,
>>> +    sendcq = ib_create_cq(ia->ri_id->device, rpcrdma_sendcq_upcall,
>>>                    rpcrdma_cq_async_error_upcall, NULL,
>>> -                  ep->rep_attr.cap.max_recv_wr +
>>>                    ep->rep_attr.cap.max_send_wr + 1, 0);
>>> -    if (IS_ERR(ep->rep_cq)) {
>>> -        rc = PTR_ERR(ep->rep_cq);
>>> -        dprintk("RPC:       %s: ib_create_cq failed: %i\n",
>>> +    if (IS_ERR(sendcq)) {
>>> +        rc = PTR_ERR(sendcq);
>>> +        dprintk("RPC:       %s: failed to create send CQ: %i\n",
>>>              __func__, rc);
>>>          goto out1;
>>>      }
>>>  -    rc = ib_req_notify_cq(ep->rep_cq, IB_CQ_NEXT_COMP);
>>> +    rc = ib_req_notify_cq(sendcq, IB_CQ_NEXT_COMP);
>>>      if (rc) {
>>>          dprintk("RPC:       %s: ib_req_notify_cq failed: %i\n",
>>>              __func__, rc);
>>>          goto out2;
>>>      }
>>>  -    ep->rep_attr.send_cq = ep->rep_cq;
>>> -    ep->rep_attr.recv_cq = ep->rep_cq;
>>> +    recvcq = ib_create_cq(ia->ri_id->device, rpcrdma_recvcq_upcall,
>>> +                  rpcrdma_cq_async_error_upcall, NULL,
>>> +                  ep->rep_attr.cap.max_recv_wr + 1, 0);
>>> +    if (IS_ERR(recvcq)) {
>>> +        rc = PTR_ERR(recvcq);
>>> +        dprintk("RPC:       %s: failed to create recv CQ: %i\n",
>>> +            __func__, rc);
>>> +        goto out2;
>>> +    }
>>> +
>>> +    rc = ib_req_notify_cq(recvcq, IB_CQ_NEXT_COMP);
>>> +    if (rc) {
>>> +        dprintk("RPC:       %s: ib_req_notify_cq failed: %i\n",
>>> +            __func__, rc);
>>> +        ib_destroy_cq(recvcq);
>>> +        goto out2;
>>> +    }
>>> +
>>> +    ep->rep_attr.send_cq = sendcq;
>>> +    ep->rep_attr.recv_cq = recvcq;
>>>        /* Initialize cma parameters */
>>>  @@ -737,7 +771,7 @@ rpcrdma_ep_create(struct rpcrdma_ep *ep, struct rpcrdma_ia *ia,
>>>      return 0;
>>>    out2:
>>> -    err = ib_destroy_cq(ep->rep_cq);
>>> +    err = ib_destroy_cq(sendcq);
>>>      if (err)
>>>          dprintk("RPC:       %s: ib_destroy_cq returned %i\n",
>>>              __func__, err);
>>> @@ -777,8 +811,14 @@ rpcrdma_ep_destroy(struct rpcrdma_ep *ep, struct rpcrdma_ia *ia)
>>>          ep->rep_pad_mr = NULL;
>>>      }
>>>  -    rpcrdma_clean_cq(ep->rep_cq);
>>> -    rc = ib_destroy_cq(ep->rep_cq);
>>> +    rpcrdma_clean_cq(ep->rep_attr.recv_cq);
>>> +    rc = ib_destroy_cq(ep->rep_attr.recv_cq);
>>> +    if (rc)
>>> +        dprintk("RPC:       %s: ib_destroy_cq returned %i\n",
>>> +            __func__, rc);
>>> +
>>> +    rpcrdma_clean_cq(ep->rep_attr.send_cq);
>>> +    rc = ib_destroy_cq(ep->rep_attr.send_cq);
>>>      if (rc)
>>>          dprintk("RPC:       %s: ib_destroy_cq returned %i\n",
>>>              __func__, rc);
>>> @@ -801,7 +841,9 @@ retry:
>>>          if (rc && rc != -ENOTCONN)
>>>              dprintk("RPC:       %s: rpcrdma_ep_disconnect"
>>>                  " status %i\n", __func__, rc);
>>> -        rpcrdma_clean_cq(ep->rep_cq);
>>> +
>>> +        rpcrdma_clean_cq(ep->rep_attr.recv_cq);
>>> +        rpcrdma_clean_cq(ep->rep_attr.send_cq);
>>>            xprt = container_of(ia, struct rpcrdma_xprt, rx_ia);
>>>          id = rpcrdma_create_id(xprt, ia,
>>> @@ -910,7 +952,8 @@ rpcrdma_ep_disconnect(struct rpcrdma_ep *ep, struct rpcrdma_ia *ia)
>>>  {
>>>      int rc;
>>>  -    rpcrdma_clean_cq(ep->rep_cq);
>>> +    rpcrdma_clean_cq(ep->rep_attr.recv_cq);
>>> +    rpcrdma_clean_cq(ep->rep_attr.send_cq);
>>>      rc = rdma_disconnect(ia->ri_id);
>>>      if (!rc) {
>>>          /* returns without wait if not connected */
>>> @@ -1793,7 +1836,6 @@ rpcrdma_ep_post_recv(struct rpcrdma_ia *ia,
>>>      ib_dma_sync_single_for_cpu(ia->ri_id->device,
>>>          rep->rr_iov.addr, rep->rr_iov.length, DMA_BIDIRECTIONAL);
>>>  -    DECR_CQCOUNT(ep);
>>>      rc = ib_post_recv(ia->ri_id->qp, &recv_wr, &recv_wr_fail);
>>>        if (rc)
>>> diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
>>> index 362a19d..334ab6e 100644
>>> --- a/net/sunrpc/xprtrdma/xprt_rdma.h
>>> +++ b/net/sunrpc/xprtrdma/xprt_rdma.h
>>> @@ -79,7 +79,6 @@ struct rpcrdma_ep {
>>>      int            rep_cqinit;
>>>      int            rep_connected;
>>>      struct rpcrdma_ia    *rep_ia;
>>> -    struct ib_cq        *rep_cq;
>>>      struct ib_qp_init_attr    rep_attr;
>>>      wait_queue_head_t     rep_connect_wait;
>>>      struct ib_sge        rep_pad;    /* holds zeroed pad */
>>> 
>>> -- 
>>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>> 
>> -- 
>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 7/8] xprtrdma: Split the completion queue
@ 2014-04-16 15:08                 ` Chuck Lever
  0 siblings, 0 replies; 58+ messages in thread
From: Chuck Lever @ 2014-04-16 15:08 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: Steve Wise, Linux NFS Mailing List, linux-rdma

Hi Sagi-

Thanks for the review! In-line replies below.


On Apr 16, 2014, at 9:30 AM, Steve Wise <swise@opengridcomputing.com> wrote:

> On 4/16/2014 7:48 AM, Sagi Grimberg wrote:
>> On 4/15/2014 1:23 AM, Chuck Lever wrote:
>>> The current CQ handler uses the ib_wc.opcode field to distinguish
>>> between event types. However, the contents of that field are not
>>> reliable if the completion status is not IB_WC_SUCCESS.
>>> 
>>> When an error completion occurs on a send event, the CQ handler
>>> schedules a tasklet with something that is not a struct rpcrdma_rep.
>>> This is never correct behavior, and sometimes it results in a panic.
>>> 
>>> To resolve this issue, split the completion queue into a send CQ and
>>> a receive CQ. The send CQ handler now handles only struct rpcrdma_mw
>>> wr_id's, and the receive CQ handler now handles only struct
>>> rpcrdma_rep wr_id's.
>> 
>> Hey Chuck,
>> 
>> So 2 suggestions related (although not directly) to this one.
>> 
>> 1. I recommend suppressing Fastreg completions - no one cares that they succeeded.
>> 
> 
> Not true.  The nfsrdma client uses frmrs across re-connects for the same mount and needs to know at any point in time if a frmr is registered or invalid.  So completions of both fastreg and invalidate need to be signaled.  See:
> 
> commit 5c635e09cec0feeeb310968e51dad01040244851
> Author: Tom Tucker <tom@ogc.us>
> Date:   Wed Feb 9 19:45:34 2011 +0000
> 
>    RPCRDMA: Fix FRMR registration/invalidate handling.

Steve is correct. For the time being, fast_reg_mr completions have to stay,
as they work around a real issue.

However, the long term goal is to suppress them, as you suggested a few
weeks ago. When we can identify and address the underlying FRMR leak that
Tom has cleverly worked around in 5c635e09, then fast_reg_mr can be done
without completions.

This is possibly related to the issue we are discussing in the other
thread “[PATCH V1] NFS-REDMA: fix qp pointer validation checks”.

> 
>> 2. If I understood correctly, I see that the CQs are loop-polled in an interrupt context.
>>    This may become problematic in stress workload where the CQ simply never drains (imagine
>>    some studios task keeps posting WRs as soon as IOs complete). This will cause CQ poll loop
>>    to go on forever. This situation may cause starvation of other CQs that share the same EQ (CPU
>>    is hogged by the endless poller).
>>    This may be solved in 2 ways:
>>    - Use blk-iopoll to maintain fairness - the downside will be moving from interrupt context to soft-irq (slower).
>>    - Set some configurable budget to CQ poller - after finishing the budget, notify the CQ and bail.
>>      If there is another CQ waiting to get in - it's only fair that it will be given with a chance, else another interrupt
>>      on that CQ will immediately come.

I think it would be a reasonable change to pass an array of WC’s to
ib_poll_cq() just once in rpcrdma_{send,recv}cq_poll(), instead of
looping. Would that be enough?

To be clear, my patch merely cleans up the completion handler logic,
which already did loop-polling. Should we consider this improvement
for another patch?


>> 
>>    I noticed that one with iSER which polls from tasklet context (under heavy workload).
>> 
>> Sagi.
>> 
>>> Fix suggested by Shirley Ma <shirley.ma@oracle.com>
>>> 
>>> Reported-by: Rafael Reiter <rafael.reiter@ims.co.at>
>>> Fixes: 5c635e09cec0feeeb310968e51dad01040244851
>>> BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=73211
>>> Tested-by: Klemens Senn <klemens.senn@ims.co.at>
>>> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
>>> ---
>>> 
>>>  net/sunrpc/xprtrdma/verbs.c     |  228 +++++++++++++++++++++++----------------
>>>  net/sunrpc/xprtrdma/xprt_rdma.h |    1
>>>  2 files changed, 135 insertions(+), 94 deletions(-)
>>> 
>>> diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
>>> index 0f8c22c..35f5ab6 100644
>>> --- a/net/sunrpc/xprtrdma/verbs.c
>>> +++ b/net/sunrpc/xprtrdma/verbs.c
>>> @@ -142,96 +142,113 @@ rpcrdma_cq_async_error_upcall(struct ib_event *event, void *context)
>>>      }
>>>  }
>>>  -static inline
>>> -void rpcrdma_event_process(struct ib_wc *wc)
>>> +static void
>>> +rpcrdma_send_event_process(struct ib_wc *wc)
>>>  {
>>> -    struct rpcrdma_mw *frmr;
>>> -    struct rpcrdma_rep *rep =
>>> -            (struct rpcrdma_rep *)(unsigned long) wc->wr_id;
>>> +    struct rpcrdma_mw *frmr = (struct rpcrdma_mw *)(unsigned long)wc->wr_id;
>>>  -    dprintk("RPC:       %s: event rep %p status %X opcode %X length %u\n",
>>> -        __func__, rep, wc->status, wc->opcode, wc->byte_len);
>>> +    dprintk("RPC:       %s: frmr %p status %X opcode %d\n",
>>> +        __func__, frmr, wc->status, wc->opcode);
>>>  -    if (!rep) /* send completion that we don't care about */
>>> +    if (wc->wr_id == 0ULL)
>>>          return;
>>> -
>>> -    if (IB_WC_SUCCESS != wc->status) {
>>> -        dprintk("RPC:       %s: WC opcode %d status %X, connection lost\n",
>>> -            __func__, wc->opcode, wc->status);
>>> -        rep->rr_len = ~0U;
>>> -        if (wc->opcode != IB_WC_FAST_REG_MR && wc->opcode != IB_WC_LOCAL_INV)
>>> -            rpcrdma_schedule_tasklet(rep);
>>> +    if (wc->status != IB_WC_SUCCESS)
>>>          return;
>>> -    }
>>>  -    switch (wc->opcode) {
>>> -    case IB_WC_FAST_REG_MR:
>>> -        frmr = (struct rpcrdma_mw *)(unsigned long)wc->wr_id;
>>> +    if (wc->opcode == IB_WC_FAST_REG_MR)
>>>          frmr->r.frmr.state = FRMR_IS_VALID;
>>> -        break;
>>> -    case IB_WC_LOCAL_INV:
>>> -        frmr = (struct rpcrdma_mw *)(unsigned long)wc->wr_id;
>>> +    else if (wc->opcode == IB_WC_LOCAL_INV)
>>>          frmr->r.frmr.state = FRMR_IS_INVALID;
>>> -        break;
>>> -    case IB_WC_RECV:
>>> -        rep->rr_len = wc->byte_len;
>>> -        ib_dma_sync_single_for_cpu(
>>> - rdmab_to_ia(rep->rr_buffer)->ri_id->device,
>>> -            rep->rr_iov.addr, rep->rr_len, DMA_FROM_DEVICE);
>>> -        /* Keep (only) the most recent credits, after check validity */
>>> -        if (rep->rr_len >= 16) {
>>> -            struct rpcrdma_msg *p =
>>> -                    (struct rpcrdma_msg *) rep->rr_base;
>>> -            unsigned int credits = ntohl(p->rm_credit);
>>> -            if (credits == 0) {
>>> -                dprintk("RPC:       %s: server"
>>> -                    " dropped credits to 0!\n", __func__);
>>> -                /* don't deadlock */
>>> -                credits = 1;
>>> -            } else if (credits > rep->rr_buffer->rb_max_requests) {
>>> -                dprintk("RPC:       %s: server"
>>> -                    " over-crediting: %d (%d)\n",
>>> -                    __func__, credits,
>>> -                    rep->rr_buffer->rb_max_requests);
>>> -                credits = rep->rr_buffer->rb_max_requests;
>>> -            }
>>> -            atomic_set(&rep->rr_buffer->rb_credits, credits);
>>> -        }
>>> -        rpcrdma_schedule_tasklet(rep);
>>> -        break;
>>> -    default:
>>> -        dprintk("RPC:       %s: unexpected WC event %X\n",
>>> -            __func__, wc->opcode);
>>> -        break;
>>> -    }
>>>  }
>>>  -static inline int
>>> -rpcrdma_cq_poll(struct ib_cq *cq)
>>> +static int
>>> +rpcrdma_sendcq_poll(struct ib_cq *cq)
>>>  {
>>>      struct ib_wc wc;
>>>      int rc;
>>>  -    for (;;) {
>>> -        rc = ib_poll_cq(cq, 1, &wc);
>>> -        if (rc < 0) {
>>> -            dprintk("RPC:       %s: ib_poll_cq failed %i\n",
>>> -                __func__, rc);
>>> -            return rc;
>>> -        }
>>> -        if (rc == 0)
>>> -            break;
>>> +    while ((rc = ib_poll_cq(cq, 1, &wc)) == 1)
>>> +        rpcrdma_send_event_process(&wc);
>>> +    return rc;
>>> +}
>>>  -        rpcrdma_event_process(&wc);
>>> +/*
>>> + * Handle send, fast_reg_mr, and local_inv completions.
>>> + *
>>> + * Send events are typically suppressed and thus do not result
>>> + * in an upcall. Occasionally one is signaled, however. This
>>> + * prevents the provider's completion queue from wrapping and
>>> + * losing a completion.
>>> + */
>>> +static void
>>> +rpcrdma_sendcq_upcall(struct ib_cq *cq, void *cq_context)
>>> +{
>>> +    int rc;
>>> +
>>> +    rc = rpcrdma_sendcq_poll(cq);
>>> +    if (rc) {
>>> +        dprintk("RPC:       %s: ib_poll_cq failed: %i\n",
>>> +            __func__, rc);
>>> +        return;
>>>      }
>>> +    rc = ib_req_notify_cq(cq, IB_CQ_NEXT_COMP);
>>> +    if (rc) {
>>> +        dprintk("RPC:       %s: ib_req_notify_cq failed: %i\n",
>>> +            __func__, rc);
>>> +        return;
>>> +    }
>>> +    rpcrdma_sendcq_poll(cq);
>>> +}
>>>  -    return 0;
>>> +static void
>>> +rpcrdma_recv_event_process(struct ib_wc *wc)
>>> +{
>>> +    struct rpcrdma_rep *rep =
>>> +            (struct rpcrdma_rep *)(unsigned long)wc->wr_id;
>>> +
>>> +    dprintk("RPC:       %s: rep %p status %X opcode %X length %u\n",
>>> +        __func__, rep, wc->status, wc->opcode, wc->byte_len);
>>> +
>>> +    if (wc->status != IB_WC_SUCCESS) {
>>> +        rep->rr_len = ~0U;
>>> +        goto out_schedule;
>>> +    }
>>> +    if (wc->opcode != IB_WC_RECV)
>>> +        return;
>>> +
>>> +    rep->rr_len = wc->byte_len;
>>> + ib_dma_sync_single_for_cpu(rdmab_to_ia(rep->rr_buffer)->ri_id->device,
>>> +            rep->rr_iov.addr, rep->rr_len, DMA_FROM_DEVICE);
>>> +
>>> +    if (rep->rr_len >= 16) {
>>> +        struct rpcrdma_msg *p = (struct rpcrdma_msg *)rep->rr_base;
>>> +        unsigned int credits = ntohl(p->rm_credit);
>>> +
>>> +        if (credits == 0)
>>> +            credits = 1;    /* don't deadlock */
>>> +        else if (credits > rep->rr_buffer->rb_max_requests)
>>> +            credits = rep->rr_buffer->rb_max_requests;
>>> +        atomic_set(&rep->rr_buffer->rb_credits, credits);
>>> +    }
>>> +
>>> +out_schedule:
>>> +    rpcrdma_schedule_tasklet(rep);
>>> +}
>>> +
>>> +static int
>>> +rpcrdma_recvcq_poll(struct ib_cq *cq)
>>> +{
>>> +    struct ib_wc wc;
>>> +    int rc;
>>> +
>>> +    while ((rc = ib_poll_cq(cq, 1, &wc)) == 1)
>>> +        rpcrdma_recv_event_process(&wc);
>>> +    return rc;
>>>  }
>>>    /*
>>> - * rpcrdma_cq_event_upcall
>>> + * Handle receive completions.
>>>   *
>>> - * This upcall handles recv and send events.
>>>   * It is reentrant but processes single events in order to maintain
>>>   * ordering of receives to keep server credits.
>>>   *
>>> @@ -240,26 +257,25 @@ rpcrdma_cq_poll(struct ib_cq *cq)
>>>   * connection shutdown. That is, the structures required for
>>>   * the completion of the reply handler must remain intact until
>>>   * all memory has been reclaimed.
>>> - *
>>> - * Note that send events are suppressed and do not result in an upcall.
>>>   */
>>>  static void
>>> -rpcrdma_cq_event_upcall(struct ib_cq *cq, void *context)
>>> +rpcrdma_recvcq_upcall(struct ib_cq *cq, void *cq_context)
>>>  {
>>>      int rc;
>>>  -    rc = rpcrdma_cq_poll(cq);
>>> -    if (rc)
>>> +    rc = rpcrdma_recvcq_poll(cq);
>>> +    if (rc) {
>>> +        dprintk("RPC:       %s: ib_poll_cq failed: %i\n",
>>> +            __func__, rc);
>>>          return;
>>> -
>>> +    }
>>>      rc = ib_req_notify_cq(cq, IB_CQ_NEXT_COMP);
>>>      if (rc) {
>>> -        dprintk("RPC:       %s: ib_req_notify_cq failed %i\n",
>>> +        dprintk("RPC:       %s: ib_req_notify_cq failed: %i\n",
>>>              __func__, rc);
>>>          return;
>>>      }
>>> -
>>> -    rpcrdma_cq_poll(cq);
>>> +    rpcrdma_recvcq_poll(cq);
>>>  }
>>>    #ifdef RPC_DEBUG
>>> @@ -613,6 +629,7 @@ rpcrdma_ep_create(struct rpcrdma_ep *ep, struct rpcrdma_ia *ia,
>>>                  struct rpcrdma_create_data_internal *cdata)
>>>  {
>>>      struct ib_device_attr devattr;
>>> +    struct ib_cq *sendcq, *recvcq;
>>>      int rc, err;
>>>        rc = ib_query_device(ia->ri_id->device, &devattr);
>>> @@ -688,7 +705,7 @@ rpcrdma_ep_create(struct rpcrdma_ep *ep, struct rpcrdma_ia *ia,
>>>          ep->rep_attr.cap.max_recv_sge);
>>>        /* set trigger for requesting send completion */
>>> -    ep->rep_cqinit = ep->rep_attr.cap.max_send_wr/2 /*  - 1*/;
>>> +    ep->rep_cqinit = ep->rep_attr.cap.max_send_wr/2 - 1;
>>>      if (ep->rep_cqinit <= 2)
>>>          ep->rep_cqinit = 0;
>>>      INIT_CQCOUNT(ep);
>>> @@ -696,26 +713,43 @@ rpcrdma_ep_create(struct rpcrdma_ep *ep, struct rpcrdma_ia *ia,
>>>      init_waitqueue_head(&ep->rep_connect_wait);
>>>      INIT_DELAYED_WORK(&ep->rep_connect_worker, rpcrdma_connect_worker);
>>>  -    ep->rep_cq = ib_create_cq(ia->ri_id->device, rpcrdma_cq_event_upcall,
>>> +    sendcq = ib_create_cq(ia->ri_id->device, rpcrdma_sendcq_upcall,
>>>                    rpcrdma_cq_async_error_upcall, NULL,
>>> -                  ep->rep_attr.cap.max_recv_wr +
>>>                    ep->rep_attr.cap.max_send_wr + 1, 0);
>>> -    if (IS_ERR(ep->rep_cq)) {
>>> -        rc = PTR_ERR(ep->rep_cq);
>>> -        dprintk("RPC:       %s: ib_create_cq failed: %i\n",
>>> +    if (IS_ERR(sendcq)) {
>>> +        rc = PTR_ERR(sendcq);
>>> +        dprintk("RPC:       %s: failed to create send CQ: %i\n",
>>>              __func__, rc);
>>>          goto out1;
>>>      }
>>>  -    rc = ib_req_notify_cq(ep->rep_cq, IB_CQ_NEXT_COMP);
>>> +    rc = ib_req_notify_cq(sendcq, IB_CQ_NEXT_COMP);
>>>      if (rc) {
>>>          dprintk("RPC:       %s: ib_req_notify_cq failed: %i\n",
>>>              __func__, rc);
>>>          goto out2;
>>>      }
>>>  -    ep->rep_attr.send_cq = ep->rep_cq;
>>> -    ep->rep_attr.recv_cq = ep->rep_cq;
>>> +    recvcq = ib_create_cq(ia->ri_id->device, rpcrdma_recvcq_upcall,
>>> +                  rpcrdma_cq_async_error_upcall, NULL,
>>> +                  ep->rep_attr.cap.max_recv_wr + 1, 0);
>>> +    if (IS_ERR(recvcq)) {
>>> +        rc = PTR_ERR(recvcq);
>>> +        dprintk("RPC:       %s: failed to create recv CQ: %i\n",
>>> +            __func__, rc);
>>> +        goto out2;
>>> +    }
>>> +
>>> +    rc = ib_req_notify_cq(recvcq, IB_CQ_NEXT_COMP);
>>> +    if (rc) {
>>> +        dprintk("RPC:       %s: ib_req_notify_cq failed: %i\n",
>>> +            __func__, rc);
>>> +        ib_destroy_cq(recvcq);
>>> +        goto out2;
>>> +    }
>>> +
>>> +    ep->rep_attr.send_cq = sendcq;
>>> +    ep->rep_attr.recv_cq = recvcq;
>>>        /* Initialize cma parameters */
>>>  @@ -737,7 +771,7 @@ rpcrdma_ep_create(struct rpcrdma_ep *ep, struct rpcrdma_ia *ia,
>>>      return 0;
>>>    out2:
>>> -    err = ib_destroy_cq(ep->rep_cq);
>>> +    err = ib_destroy_cq(sendcq);
>>>      if (err)
>>>          dprintk("RPC:       %s: ib_destroy_cq returned %i\n",
>>>              __func__, err);
>>> @@ -777,8 +811,14 @@ rpcrdma_ep_destroy(struct rpcrdma_ep *ep, struct rpcrdma_ia *ia)
>>>          ep->rep_pad_mr = NULL;
>>>      }
>>>  -    rpcrdma_clean_cq(ep->rep_cq);
>>> -    rc = ib_destroy_cq(ep->rep_cq);
>>> +    rpcrdma_clean_cq(ep->rep_attr.recv_cq);
>>> +    rc = ib_destroy_cq(ep->rep_attr.recv_cq);
>>> +    if (rc)
>>> +        dprintk("RPC:       %s: ib_destroy_cq returned %i\n",
>>> +            __func__, rc);
>>> +
>>> +    rpcrdma_clean_cq(ep->rep_attr.send_cq);
>>> +    rc = ib_destroy_cq(ep->rep_attr.send_cq);
>>>      if (rc)
>>>          dprintk("RPC:       %s: ib_destroy_cq returned %i\n",
>>>              __func__, rc);
>>> @@ -801,7 +841,9 @@ retry:
>>>          if (rc && rc != -ENOTCONN)
>>>              dprintk("RPC:       %s: rpcrdma_ep_disconnect"
>>>                  " status %i\n", __func__, rc);
>>> -        rpcrdma_clean_cq(ep->rep_cq);
>>> +
>>> +        rpcrdma_clean_cq(ep->rep_attr.recv_cq);
>>> +        rpcrdma_clean_cq(ep->rep_attr.send_cq);
>>>            xprt = container_of(ia, struct rpcrdma_xprt, rx_ia);
>>>          id = rpcrdma_create_id(xprt, ia,
>>> @@ -910,7 +952,8 @@ rpcrdma_ep_disconnect(struct rpcrdma_ep *ep, struct rpcrdma_ia *ia)
>>>  {
>>>      int rc;
>>>  -    rpcrdma_clean_cq(ep->rep_cq);
>>> +    rpcrdma_clean_cq(ep->rep_attr.recv_cq);
>>> +    rpcrdma_clean_cq(ep->rep_attr.send_cq);
>>>      rc = rdma_disconnect(ia->ri_id);
>>>      if (!rc) {
>>>          /* returns without wait if not connected */
>>> @@ -1793,7 +1836,6 @@ rpcrdma_ep_post_recv(struct rpcrdma_ia *ia,
>>>      ib_dma_sync_single_for_cpu(ia->ri_id->device,
>>>          rep->rr_iov.addr, rep->rr_iov.length, DMA_BIDIRECTIONAL);
>>>  -    DECR_CQCOUNT(ep);
>>>      rc = ib_post_recv(ia->ri_id->qp, &recv_wr, &recv_wr_fail);
>>>        if (rc)
>>> diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
>>> index 362a19d..334ab6e 100644
>>> --- a/net/sunrpc/xprtrdma/xprt_rdma.h
>>> +++ b/net/sunrpc/xprtrdma/xprt_rdma.h
>>> @@ -79,7 +79,6 @@ struct rpcrdma_ep {
>>>      int            rep_cqinit;
>>>      int            rep_connected;
>>>      struct rpcrdma_ia    *rep_ia;
>>> -    struct ib_cq        *rep_cq;
>>>      struct ib_qp_init_attr    rep_attr;
>>>      wait_queue_head_t     rep_connect_wait;
>>>      struct ib_sge        rep_pad;    /* holds zeroed pad */
>>> 
>>> -- 
>>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>> 
>> -- 
>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com




^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 7/8] xprtrdma: Split the completion queue
  2014-04-16 14:43                           ` Steve Wise
@ 2014-04-16 15:18                             ` Sagi Grimberg
  -1 siblings, 0 replies; 58+ messages in thread
From: Sagi Grimberg @ 2014-04-16 15:18 UTC (permalink / raw)
  To: Steve Wise, 'Chuck Lever',
	linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA

On 4/16/2014 5:43 PM, Steve Wise wrote:
>>>> Hmm, But if either FASTREG or LINV failed the QP will go to error state
>>>> and you *will* get the error wc (with a rain of FLUSH errors).
>>>> AFAICT it is safe to assume that it succeeded as long as you don't get
>>>> error completions.
>>> But if an unsignaled FASTREG is posted and silently succeeds, then the next signaled work
>> request fails, I believe the FASTREG will be completed with FLUSH status, yet the operation
>> actually completed in the hw.
>>
>> Actually if (any) WR successfully completed and SW got it as FLUSH error
>> it seems like a bug to me.
>> Once the HW processed the WQ entry it should update the consumer index
>> accordingly thus should not happen.
> Aren't you assuming a specific hardware design/implementation?  For cxgb4, the fact that a work request was consumed by the HW from the host send queue in no way indicates it is complete.  Also, the RDMA specs specifically state that the rnic/hca implementation can only assume an unsignaled work request completes successfully (and make its slot in the SQ available for the ULP) when a subsequent signaled work request completes successfully.   So if the next signaled work request fails, I believe the completion status of prior unsignaled work requests is indeterminate.

Well actually I wasn't, I just assumed that FLUSH errors will come for 
all WQ entries in the range {CI, PI}.
I get it, if a suppressed WQe was consumed and QP went to error state 
before a completion was placed, HW may flush it as well.
I agree this may happen. Thanks!

Sagi.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 7/8] xprtrdma: Split the completion queue
@ 2014-04-16 15:18                             ` Sagi Grimberg
  0 siblings, 0 replies; 58+ messages in thread
From: Sagi Grimberg @ 2014-04-16 15:18 UTC (permalink / raw)
  To: Steve Wise, 'Chuck Lever', linux-nfs, linux-rdma

On 4/16/2014 5:43 PM, Steve Wise wrote:
>>>> Hmm, But if either FASTREG or LINV failed the QP will go to error state
>>>> and you *will* get the error wc (with a rain of FLUSH errors).
>>>> AFAICT it is safe to assume that it succeeded as long as you don't get
>>>> error completions.
>>> But if an unsignaled FASTREG is posted and silently succeeds, then the next signaled work
>> request fails, I believe the FASTREG will be completed with FLUSH status, yet the operation
>> actually completed in the hw.
>>
>> Actually if (any) WR successfully completed and SW got it as FLUSH error
>> it seems like a bug to me.
>> Once the HW processed the WQ entry it should update the consumer index
>> accordingly thus should not happen.
> Aren't you assuming a specific hardware design/implementation?  For cxgb4, the fact that a work request was consumed by the HW from the host send queue in no way indicates it is complete.  Also, the RDMA specs specifically state that the rnic/hca implementation can only assume an unsignaled work request completes successfully (and make its slot in the SQ available for the ULP) when a subsequent signaled work request completes successfully.   So if the next signaled work request fails, I believe the completion status of prior unsignaled work requests is indeterminate.

Well actually I wasn't, I just assumed that FLUSH errors will come for 
all WQ entries in the range {CI, PI}.
I get it, if a suppressed WQe was consumed and QP went to error state 
before a completion was placed, HW may flush it as well.
I agree this may happen. Thanks!

Sagi.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 7/8] xprtrdma: Split the completion queue
  2014-04-16 15:08                 ` Chuck Lever
@ 2014-04-16 15:23                     ` Sagi Grimberg
  -1 siblings, 0 replies; 58+ messages in thread
From: Sagi Grimberg @ 2014-04-16 15:23 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Steve Wise, Linux NFS Mailing List, linux-rdma-u79uwXL29TY76Z2rM5mHXA

On 4/16/2014 6:08 PM, Chuck Lever wrote:
> Hi Sagi-
>
> Thanks for the review! In-line replies below.
<SNIP>
>
>>> 2. If I understood correctly, I see that the CQs are loop-polled in an interrupt context.
>>>     This may become problematic in stress workload where the CQ simply never drains (imagine
>>>     some studios task keeps posting WRs as soon as IOs complete). This will cause CQ poll loop
>>>     to go on forever. This situation may cause starvation of other CQs that share the same EQ (CPU
>>>     is hogged by the endless poller).
>>>     This may be solved in 2 ways:
>>>     - Use blk-iopoll to maintain fairness - the downside will be moving from interrupt context to soft-irq (slower).
>>>     - Set some configurable budget to CQ poller - after finishing the budget, notify the CQ and bail.
>>>       If there is another CQ waiting to get in - it's only fair that it will be given with a chance, else another interrupt
>>>       on that CQ will immediately come.
> I think it would be a reasonable change to pass an array of WC’s to
> ib_poll_cq() just once in rpcrdma_{send,recv}cq_poll(), instead of
> looping. Would that be enough?
>
> To be clear, my patch merely cleans up the completion handler logic,
> which already did loop-polling. Should we consider this improvement
> for another patch?

Well, I wasn't suggesting passing an array, carrying it around (or on 
the stack for that manner) might be annoying...
I was suggesting a budget (poll loops or a time bound - jiffy is usually 
well behaved).

Sagi.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 7/8] xprtrdma: Split the completion queue
@ 2014-04-16 15:23                     ` Sagi Grimberg
  0 siblings, 0 replies; 58+ messages in thread
From: Sagi Grimberg @ 2014-04-16 15:23 UTC (permalink / raw)
  To: Chuck Lever; +Cc: Steve Wise, Linux NFS Mailing List, linux-rdma

On 4/16/2014 6:08 PM, Chuck Lever wrote:
> Hi Sagi-
>
> Thanks for the review! In-line replies below.
<SNIP>
>
>>> 2. If I understood correctly, I see that the CQs are loop-polled in an interrupt context.
>>>     This may become problematic in stress workload where the CQ simply never drains (imagine
>>>     some studios task keeps posting WRs as soon as IOs complete). This will cause CQ poll loop
>>>     to go on forever. This situation may cause starvation of other CQs that share the same EQ (CPU
>>>     is hogged by the endless poller).
>>>     This may be solved in 2 ways:
>>>     - Use blk-iopoll to maintain fairness - the downside will be moving from interrupt context to soft-irq (slower).
>>>     - Set some configurable budget to CQ poller - after finishing the budget, notify the CQ and bail.
>>>       If there is another CQ waiting to get in - it's only fair that it will be given with a chance, else another interrupt
>>>       on that CQ will immediately come.
> I think it would be a reasonable change to pass an array of WC’s to
> ib_poll_cq() just once in rpcrdma_{send,recv}cq_poll(), instead of
> looping. Would that be enough?
>
> To be clear, my patch merely cleans up the completion handler logic,
> which already did loop-polling. Should we consider this improvement
> for another patch?

Well, I wasn't suggesting passing an array, carrying it around (or on 
the stack for that manner) might be annoying...
I was suggesting a budget (poll loops or a time bound - jiffy is usually 
well behaved).

Sagi.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* RE: [PATCH 7/8] xprtrdma: Split the completion queue
  2014-04-16 15:18                             ` Sagi Grimberg
@ 2014-04-16 15:46                                 ` Steve Wise
  -1 siblings, 0 replies; 58+ messages in thread
From: Steve Wise @ 2014-04-16 15:46 UTC (permalink / raw)
  To: 'Sagi Grimberg', 'Chuck Lever',
	linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA



> -----Original Message-----
> From: Sagi Grimberg [mailto:sagig-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org]
> Sent: Wednesday, April 16, 2014 10:18 AM
> To: Steve Wise; 'Chuck Lever'; linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> Subject: Re: [PATCH 7/8] xprtrdma: Split the completion queue
> 
> On 4/16/2014 5:43 PM, Steve Wise wrote:
> >>>> Hmm, But if either FASTREG or LINV failed the QP will go to error state
> >>>> and you *will* get the error wc (with a rain of FLUSH errors).
> >>>> AFAICT it is safe to assume that it succeeded as long as you don't get
> >>>> error completions.
> >>> But if an unsignaled FASTREG is posted and silently succeeds, then the next signaled work
> >> request fails, I believe the FASTREG will be completed with FLUSH status, yet the operation
> >> actually completed in the hw.
> >>
> >> Actually if (any) WR successfully completed and SW got it as FLUSH error
> >> it seems like a bug to me.
> >> Once the HW processed the WQ entry it should update the consumer index
> >> accordingly thus should not happen.
> > Aren't you assuming a specific hardware design/implementation?  For cxgb4, the fact that a
> work request was consumed by the HW from the host send queue in no way indicates it is
> complete.  Also, the RDMA specs specifically state that the rnic/hca implementation can only
> assume an unsignaled work request completes successfully (and make its slot in the SQ
> available for the ULP) when a subsequent signaled work request completes successfully.   So if
> the next signaled work request fails, I believe the completion status of prior unsignaled work
> requests is indeterminate.
> 
> Well actually I wasn't, I just assumed that FLUSH errors will come for
> all WQ entries in the range {CI, PI}.
> I get it, if a suppressed WQe was consumed and QP went to error state
> before a completion was placed, HW may flush it as well.
> I agree this may happen. Thanks!
> 

Thank you! :)   In fact, chelsio HW doesn't do ANY flushing.  It is all done in software at the time the QP exits RTS...

Stevo

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 58+ messages in thread

* RE: [PATCH 7/8] xprtrdma: Split the completion queue
@ 2014-04-16 15:46                                 ` Steve Wise
  0 siblings, 0 replies; 58+ messages in thread
From: Steve Wise @ 2014-04-16 15:46 UTC (permalink / raw)
  To: 'Sagi Grimberg', 'Chuck Lever', linux-nfs, linux-rdma



> -----Original Message-----
> From: Sagi Grimberg [mailto:sagig@dev.mellanox.co.il]
> Sent: Wednesday, April 16, 2014 10:18 AM
> To: Steve Wise; 'Chuck Lever'; linux-nfs@vger.kernel.org; linux-rdma@vger.kernel.org
> Subject: Re: [PATCH 7/8] xprtrdma: Split the completion queue
> 
> On 4/16/2014 5:43 PM, Steve Wise wrote:
> >>>> Hmm, But if either FASTREG or LINV failed the QP will go to error state
> >>>> and you *will* get the error wc (with a rain of FLUSH errors).
> >>>> AFAICT it is safe to assume that it succeeded as long as you don't get
> >>>> error completions.
> >>> But if an unsignaled FASTREG is posted and silently succeeds, then the next signaled work
> >> request fails, I believe the FASTREG will be completed with FLUSH status, yet the operation
> >> actually completed in the hw.
> >>
> >> Actually if (any) WR successfully completed and SW got it as FLUSH error
> >> it seems like a bug to me.
> >> Once the HW processed the WQ entry it should update the consumer index
> >> accordingly thus should not happen.
> > Aren't you assuming a specific hardware design/implementation?  For cxgb4, the fact that a
> work request was consumed by the HW from the host send queue in no way indicates it is
> complete.  Also, the RDMA specs specifically state that the rnic/hca implementation can only
> assume an unsignaled work request completes successfully (and make its slot in the SQ
> available for the ULP) when a subsequent signaled work request completes successfully.   So if
> the next signaled work request fails, I believe the completion status of prior unsignaled work
> requests is indeterminate.
> 
> Well actually I wasn't, I just assumed that FLUSH errors will come for
> all WQ entries in the range {CI, PI}.
> I get it, if a suppressed WQe was consumed and QP went to error state
> before a completion was placed, HW may flush it as well.
> I agree this may happen. Thanks!
> 

Thank you! :)   In fact, chelsio HW doesn't do ANY flushing.  It is all done in software at the time the QP exits RTS...

Stevo


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 7/8] xprtrdma: Split the completion queue
  2014-04-16 15:23                     ` Sagi Grimberg
@ 2014-04-16 18:21                         ` Chuck Lever
  -1 siblings, 0 replies; 58+ messages in thread
From: Chuck Lever @ 2014-04-16 18:21 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Steve Wise, Linux NFS Mailing List, linux-rdma-u79uwXL29TY76Z2rM5mHXA


On Apr 16, 2014, at 11:23 AM, Sagi Grimberg <sagig-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org> wrote:

> On 4/16/2014 6:08 PM, Chuck Lever wrote:
>> Hi Sagi-
>> 
>> Thanks for the review! In-line replies below.
> <SNIP>
>> 
>>>> 2. If I understood correctly, I see that the CQs are loop-polled in an interrupt context.
>>>>    This may become problematic in stress workload where the CQ simply never drains (imagine
>>>>    some studios task keeps posting WRs as soon as IOs complete). This will cause CQ poll loop
>>>>    to go on forever. This situation may cause starvation of other CQs that share the same EQ (CPU
>>>>    is hogged by the endless poller).
>>>>    This may be solved in 2 ways:
>>>>    - Use blk-iopoll to maintain fairness - the downside will be moving from interrupt context to soft-irq (slower).
>>>>    - Set some configurable budget to CQ poller - after finishing the budget, notify the CQ and bail.
>>>>      If there is another CQ waiting to get in - it's only fair that it will be given with a chance, else another interrupt
>>>>      on that CQ will immediately come.
>> I think it would be a reasonable change to pass an array of WC’s to
>> ib_poll_cq() just once in rpcrdma_{send,recv}cq_poll(), instead of
>> looping. Would that be enough?
>> 
>> To be clear, my patch merely cleans up the completion handler logic,
>> which already did loop-polling. Should we consider this improvement
>> for another patch?
> 
> Well, I wasn't suggesting passing an array, carrying it around (or on the stack for that manner) might be annoying...
> I was suggesting a budget (poll loops or a time bound - jiffy is usually well behaved).

Passing a small array to ip_poll_cq() is actually easy to do, and is
exactly equivalent to a poll budget. The struct ib_wc should be taken
off the stack anyway, IMO.

The only other example I see in 3.15 right now is IPoIB, which seems
to do exactly this.

I’m testing a patch now. I’d like to start simple and make it more
complex only if we need to.

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 7/8] xprtrdma: Split the completion queue
@ 2014-04-16 18:21                         ` Chuck Lever
  0 siblings, 0 replies; 58+ messages in thread
From: Chuck Lever @ 2014-04-16 18:21 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: Steve Wise, Linux NFS Mailing List, linux-rdma


On Apr 16, 2014, at 11:23 AM, Sagi Grimberg <sagig@dev.mellanox.co.il> wrote:

> On 4/16/2014 6:08 PM, Chuck Lever wrote:
>> Hi Sagi-
>> 
>> Thanks for the review! In-line replies below.
> <SNIP>
>> 
>>>> 2. If I understood correctly, I see that the CQs are loop-polled in an interrupt context.
>>>>    This may become problematic in stress workload where the CQ simply never drains (imagine
>>>>    some studios task keeps posting WRs as soon as IOs complete). This will cause CQ poll loop
>>>>    to go on forever. This situation may cause starvation of other CQs that share the same EQ (CPU
>>>>    is hogged by the endless poller).
>>>>    This may be solved in 2 ways:
>>>>    - Use blk-iopoll to maintain fairness - the downside will be moving from interrupt context to soft-irq (slower).
>>>>    - Set some configurable budget to CQ poller - after finishing the budget, notify the CQ and bail.
>>>>      If there is another CQ waiting to get in - it's only fair that it will be given with a chance, else another interrupt
>>>>      on that CQ will immediately come.
>> I think it would be a reasonable change to pass an array of WC’s to
>> ib_poll_cq() just once in rpcrdma_{send,recv}cq_poll(), instead of
>> looping. Would that be enough?
>> 
>> To be clear, my patch merely cleans up the completion handler logic,
>> which already did loop-polling. Should we consider this improvement
>> for another patch?
> 
> Well, I wasn't suggesting passing an array, carrying it around (or on the stack for that manner) might be annoying...
> I was suggesting a budget (poll loops or a time bound - jiffy is usually well behaved).

Passing a small array to ip_poll_cq() is actually easy to do, and is
exactly equivalent to a poll budget. The struct ib_wc should be taken
off the stack anyway, IMO.

The only other example I see in 3.15 right now is IPoIB, which seems
to do exactly this.

I’m testing a patch now. I’d like to start simple and make it more
complex only if we need to.

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com




^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 7/8] xprtrdma: Split the completion queue
  2014-04-16 18:21                         ` Chuck Lever
@ 2014-04-17  7:06                             ` Sagi Grimberg
  -1 siblings, 0 replies; 58+ messages in thread
From: Sagi Grimberg @ 2014-04-17  7:06 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Steve Wise, Linux NFS Mailing List, linux-rdma-u79uwXL29TY76Z2rM5mHXA

On 4/16/2014 9:21 PM, Chuck Lever wrote:
> Passing a small array to ip_poll_cq() is actually easy to do, and is
> exactly equivalent to a poll budget. The struct ib_wc should be taken
> off the stack anyway, IMO.
>
> The only other example I see in 3.15 right now is IPoIB, which seems
> to do exactly this.
>
> I’m testing a patch now. I’d like to start simple and make it more
> complex only if we need to.

What array size are you using? Note that if you use a small array it may 
be an overkill since
a lot more interrupts are invoked (-> more latency). I found that for a 
high workload a budget
of 256/512/1024 keeps fairness and doesn't increase latency.

Regardless, doing array-polling is a nice optimization reducing CQ 
entrances.

Sagi.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 7/8] xprtrdma: Split the completion queue
@ 2014-04-17  7:06                             ` Sagi Grimberg
  0 siblings, 0 replies; 58+ messages in thread
From: Sagi Grimberg @ 2014-04-17  7:06 UTC (permalink / raw)
  To: Chuck Lever; +Cc: Steve Wise, Linux NFS Mailing List, linux-rdma

On 4/16/2014 9:21 PM, Chuck Lever wrote:
> Passing a small array to ip_poll_cq() is actually easy to do, and is
> exactly equivalent to a poll budget. The struct ib_wc should be taken
> off the stack anyway, IMO.
>
> The only other example I see in 3.15 right now is IPoIB, which seems
> to do exactly this.
>
> I’m testing a patch now. I’d like to start simple and make it more
> complex only if we need to.

What array size are you using? Note that if you use a small array it may 
be an overkill since
a lot more interrupts are invoked (-> more latency). I found that for a 
high workload a budget
of 256/512/1024 keeps fairness and doesn't increase latency.

Regardless, doing array-polling is a nice optimization reducing CQ 
entrances.

Sagi.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 7/8] xprtrdma: Split the completion queue
  2014-04-17  7:06                             ` Sagi Grimberg
@ 2014-04-17 13:55                                 ` Chuck Lever
  -1 siblings, 0 replies; 58+ messages in thread
From: Chuck Lever @ 2014-04-17 13:55 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Steve Wise, Linux NFS Mailing List, linux-rdma-u79uwXL29TY76Z2rM5mHXA


On Apr 17, 2014, at 3:06 AM, Sagi Grimberg <sagig-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org> wrote:

> On 4/16/2014 9:21 PM, Chuck Lever wrote:
>> Passing a small array to ip_poll_cq() is actually easy to do, and is
>> exactly equivalent to a poll budget. The struct ib_wc should be taken
>> off the stack anyway, IMO.
>> 
>> The only other example I see in 3.15 right now is IPoIB, which seems
>> to do exactly this.
>> 
>> I’m testing a patch now. I’d like to start simple and make it more
>> complex only if we need to.
> 
> What array size are you using? Note that if you use a small array it may be an overkill since
> a lot more interrupts are invoked (-> more latency). I found that for a high workload a budget
> of 256/512/1024 keeps fairness and doesn't increase latency.

My array size is currently 4. It’s a macro that can be changed easily.

By a very large majority, my workloads see only one WC per completion 
upcall. However, I’m using an older card with simple synthetic benchmarks.

I don’t want to make the array large because struct ib_wc is at least
64 bytes on my systems — each WC array would be enormous and hardly ever
used. But we can dial it in over time.

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com



--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 7/8] xprtrdma: Split the completion queue
@ 2014-04-17 13:55                                 ` Chuck Lever
  0 siblings, 0 replies; 58+ messages in thread
From: Chuck Lever @ 2014-04-17 13:55 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: Steve Wise, Linux NFS Mailing List, linux-rdma


On Apr 17, 2014, at 3:06 AM, Sagi Grimberg <sagig@dev.mellanox.co.il> wrote:

> On 4/16/2014 9:21 PM, Chuck Lever wrote:
>> Passing a small array to ip_poll_cq() is actually easy to do, and is
>> exactly equivalent to a poll budget. The struct ib_wc should be taken
>> off the stack anyway, IMO.
>> 
>> The only other example I see in 3.15 right now is IPoIB, which seems
>> to do exactly this.
>> 
>> I’m testing a patch now. I’d like to start simple and make it more
>> complex only if we need to.
> 
> What array size are you using? Note that if you use a small array it may be an overkill since
> a lot more interrupts are invoked (-> more latency). I found that for a high workload a budget
> of 256/512/1024 keeps fairness and doesn't increase latency.

My array size is currently 4. It’s a macro that can be changed easily.

By a very large majority, my workloads see only one WC per completion 
upcall. However, I’m using an older card with simple synthetic benchmarks.

I don’t want to make the array large because struct ib_wc is at least
64 bytes on my systems — each WC array would be enormous and hardly ever
used. But we can dial it in over time.

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com




^ permalink raw reply	[flat|nested] 58+ messages in thread

* RE: [PATCH 7/8] xprtrdma: Split the completion queue
  2014-04-17 13:55                                 ` Chuck Lever
@ 2014-04-17 14:34                                     ` Steve Wise
  -1 siblings, 0 replies; 58+ messages in thread
From: Steve Wise @ 2014-04-17 14:34 UTC (permalink / raw)
  To: 'Chuck Lever', 'Sagi Grimberg'
  Cc: 'Linux NFS Mailing List', linux-rdma-u79uwXL29TY76Z2rM5mHXA



> -----Original Message-----
> From: linux-rdma-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org [mailto:linux-rdma-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org] On
> Behalf Of Chuck Lever
> Sent: Thursday, April 17, 2014 8:55 AM
> To: Sagi Grimberg
> Cc: Steve Wise; Linux NFS Mailing List; linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> Subject: Re: [PATCH 7/8] xprtrdma: Split the completion queue
> 
> 
> On Apr 17, 2014, at 3:06 AM, Sagi Grimberg <sagig-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org> wrote:
> 
> > On 4/16/2014 9:21 PM, Chuck Lever wrote:
> >> Passing a small array to ip_poll_cq() is actually easy to do, and is
> >> exactly equivalent to a poll budget. The struct ib_wc should be taken
> >> off the stack anyway, IMO.
> >>
> >> The only other example I see in 3.15 right now is IPoIB, which seems
> >> to do exactly this.
> >>
> >> I'm testing a patch now. I'd like to start simple and make it more
> >> complex only if we need to.
> >
> > What array size are you using? Note that if you use a small array it may be an
overkill since
> > a lot more interrupts are invoked (-> more latency). I found that for a high workload
a
> budget
> > of 256/512/1024 keeps fairness and doesn't increase latency.
> 
> My array size is currently 4. It's a macro that can be changed easily.
> 
> By a very large majority, my workloads see only one WC per completion
> upcall. However, I'm using an older card with simple synthetic benchmarks.
> 
> I don't want to make the array large because struct ib_wc is at least
> 64 bytes on my systems - each WC array would be enormous and hardly ever
> used. But we can dial it in over time.

You could use a small array combined with a loop and a budget count.  So the code would
grab, say, 4 at a time, and keep looping polling up to 4 until the CQ is empty or the
desired budget is reached...

Stevo

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 58+ messages in thread

* RE: [PATCH 7/8] xprtrdma: Split the completion queue
@ 2014-04-17 14:34                                     ` Steve Wise
  0 siblings, 0 replies; 58+ messages in thread
From: Steve Wise @ 2014-04-17 14:34 UTC (permalink / raw)
  To: 'Chuck Lever', 'Sagi Grimberg'
  Cc: 'Linux NFS Mailing List', linux-rdma



> -----Original Message-----
> From: linux-rdma-owner@vger.kernel.org [mailto:linux-rdma-owner@vger.kernel.org] On
> Behalf Of Chuck Lever
> Sent: Thursday, April 17, 2014 8:55 AM
> To: Sagi Grimberg
> Cc: Steve Wise; Linux NFS Mailing List; linux-rdma@vger.kernel.org
> Subject: Re: [PATCH 7/8] xprtrdma: Split the completion queue
> 
> 
> On Apr 17, 2014, at 3:06 AM, Sagi Grimberg <sagig@dev.mellanox.co.il> wrote:
> 
> > On 4/16/2014 9:21 PM, Chuck Lever wrote:
> >> Passing a small array to ip_poll_cq() is actually easy to do, and is
> >> exactly equivalent to a poll budget. The struct ib_wc should be taken
> >> off the stack anyway, IMO.
> >>
> >> The only other example I see in 3.15 right now is IPoIB, which seems
> >> to do exactly this.
> >>
> >> I'm testing a patch now. I'd like to start simple and make it more
> >> complex only if we need to.
> >
> > What array size are you using? Note that if you use a small array it may be an
overkill since
> > a lot more interrupts are invoked (-> more latency). I found that for a high workload
a
> budget
> > of 256/512/1024 keeps fairness and doesn't increase latency.
> 
> My array size is currently 4. It's a macro that can be changed easily.
> 
> By a very large majority, my workloads see only one WC per completion
> upcall. However, I'm using an older card with simple synthetic benchmarks.
> 
> I don't want to make the array large because struct ib_wc is at least
> 64 bytes on my systems - each WC array would be enormous and hardly ever
> used. But we can dial it in over time.

You could use a small array combined with a loop and a budget count.  So the code would
grab, say, 4 at a time, and keep looping polling up to 4 until the CQ is empty or the
desired budget is reached...

Stevo


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 7/8] xprtrdma: Split the completion queue
  2014-04-17 13:55                                 ` Chuck Lever
@ 2014-04-17 19:08                                     ` Sagi Grimberg
  -1 siblings, 0 replies; 58+ messages in thread
From: Sagi Grimberg @ 2014-04-17 19:08 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Steve Wise, Linux NFS Mailing List, linux-rdma-u79uwXL29TY76Z2rM5mHXA

On 4/17/2014 4:55 PM, Chuck Lever wrote:
> On Apr 17, 2014, at 3:06 AM, Sagi Grimberg <sagig-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org> wrote:
>
>> On 4/16/2014 9:21 PM, Chuck Lever wrote:
>>> Passing a small array to ip_poll_cq() is actually easy to do, and is
>>> exactly equivalent to a poll budget. The struct ib_wc should be taken
>>> off the stack anyway, IMO.
>>>
>>> The only other example I see in 3.15 right now is IPoIB, which seems
>>> to do exactly this.
>>>
>>> I’m testing a patch now. I’d like to start simple and make it more
>>> complex only if we need to.
>> What array size are you using? Note that if you use a small array it may be an overkill since
>> a lot more interrupts are invoked (-> more latency). I found that for a high workload a budget
>> of 256/512/1024 keeps fairness and doesn't increase latency.
> My array size is currently 4. It’s a macro that can be changed easily.
>
> By a very large majority, my workloads see only one WC per completion
> upcall. However, I’m using an older card with simple synthetic benchmarks.

It doesn't matter until it does matter...
We noticed this phenomenon under *extreme* stress.

> I don’t want to make the array large because struct ib_wc is at least
> 64 bytes on my systems — each WC array would be enormous and hardly ever
> used.

Right, large WC array would be annoying...

Sagi.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 7/8] xprtrdma: Split the completion queue
@ 2014-04-17 19:08                                     ` Sagi Grimberg
  0 siblings, 0 replies; 58+ messages in thread
From: Sagi Grimberg @ 2014-04-17 19:08 UTC (permalink / raw)
  To: Chuck Lever; +Cc: Steve Wise, Linux NFS Mailing List, linux-rdma

On 4/17/2014 4:55 PM, Chuck Lever wrote:
> On Apr 17, 2014, at 3:06 AM, Sagi Grimberg <sagig@dev.mellanox.co.il> wrote:
>
>> On 4/16/2014 9:21 PM, Chuck Lever wrote:
>>> Passing a small array to ip_poll_cq() is actually easy to do, and is
>>> exactly equivalent to a poll budget. The struct ib_wc should be taken
>>> off the stack anyway, IMO.
>>>
>>> The only other example I see in 3.15 right now is IPoIB, which seems
>>> to do exactly this.
>>>
>>> I’m testing a patch now. I’d like to start simple and make it more
>>> complex only if we need to.
>> What array size are you using? Note that if you use a small array it may be an overkill since
>> a lot more interrupts are invoked (-> more latency). I found that for a high workload a budget
>> of 256/512/1024 keeps fairness and doesn't increase latency.
> My array size is currently 4. It’s a macro that can be changed easily.
>
> By a very large majority, my workloads see only one WC per completion
> upcall. However, I’m using an older card with simple synthetic benchmarks.

It doesn't matter until it does matter...
We noticed this phenomenon under *extreme* stress.

> I don’t want to make the array large because struct ib_wc is at least
> 64 bytes on my systems — each WC array would be enormous and hardly ever
> used.

Right, large WC array would be annoying...

Sagi.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 7/8] xprtrdma: Split the completion queue
  2014-04-17 14:34                                     ` Steve Wise
@ 2014-04-17 19:11                                       ` Sagi Grimberg
  -1 siblings, 0 replies; 58+ messages in thread
From: Sagi Grimberg @ 2014-04-17 19:11 UTC (permalink / raw)
  To: Steve Wise, 'Chuck Lever'
  Cc: 'Linux NFS Mailing List', linux-rdma-u79uwXL29TY76Z2rM5mHXA

On 4/17/2014 5:34 PM, Steve Wise wrote:

<SNIP>
> You could use a small array combined with a loop and a budget count.  So the code would
> grab, say, 4 at a time, and keep looping polling up to 4 until the CQ is empty or the
> desired budget is reached...

Bingo... couldn't agree more.

Poll Arrays are a nice optimization, but large arrays will just burden 
the stack (and might even make things worse in high workloads...)

Sagi.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 7/8] xprtrdma: Split the completion queue
@ 2014-04-17 19:11                                       ` Sagi Grimberg
  0 siblings, 0 replies; 58+ messages in thread
From: Sagi Grimberg @ 2014-04-17 19:11 UTC (permalink / raw)
  To: Steve Wise, 'Chuck Lever'
  Cc: 'Linux NFS Mailing List', linux-rdma

On 4/17/2014 5:34 PM, Steve Wise wrote:

<SNIP>
> You could use a small array combined with a loop and a budget count.  So the code would
> grab, say, 4 at a time, and keep looping polling up to 4 until the CQ is empty or the
> desired budget is reached...

Bingo... couldn't agree more.

Poll Arrays are a nice optimization, but large arrays will just burden 
the stack (and might even make things worse in high workloads...)

Sagi.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 7/8] xprtrdma: Split the completion queue
  2014-04-17 19:11                                       ` Sagi Grimberg
@ 2014-04-19 16:31                                           ` Chuck Lever
  -1 siblings, 0 replies; 58+ messages in thread
From: Chuck Lever @ 2014-04-19 16:31 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Steve Wise, Linux NFS Mailing List, linux-rdma-u79uwXL29TY76Z2rM5mHXA

Hi Sagi-

On Apr 17, 2014, at 3:11 PM, Sagi Grimberg <sagig-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org> wrote:

> On 4/17/2014 5:34 PM, Steve Wise wrote:
> 
> <SNIP>
>> You could use a small array combined with a loop and a budget count.  So the code would
>> grab, say, 4 at a time, and keep looping polling up to 4 until the CQ is empty or the
>> desired budget is reached...
> 
> Bingo... couldn't agree more.
> 
> Poll Arrays are a nice optimization,

Typically, a provider's poll_cq implementation takes the CQ lock
using spin_lock_irqsave().  My goal of using a poll array is to
reduce the number of times the completion handler invokes
spin_lock_irqsave / spin_unlock_irqsave pairs when draining a
large queue.

> but large arrays will just burden the stack (and might even make things worse in high workloads...)


My prototype moves the poll array off the stack and into allocated
storage.  Making that array as large as a single page would be
sufficient for 50 or more ib_wc structures on a platform with 4KB
pages and 64-bit addresses.

The xprtrdma completion handler polls twice:

  1.  Drain the CQ completely

  2.  Re-arm

  3.  Drain the CQ completely again

So between steps 1. and 3. a single notification could handle over
100 WCs, if we were to budget by draining just a single array's worth
during each step. (Btw, I'm not opposed to looping while polling
arrays. This is just an example for discussion).

As for budgeting itself, I wonder if there is a possibility of losing
notifications.  The purpose of re-arming and then draining again is to
ensure that any items queued after step 1. and before step 2. are
captured, as by themselves they would never generate an upcall
notification, IIUC.

When the handler hits its budget and returns, xprtrdma needs to be
invoked again to finish draining the completion queue. How is that
guaranteed?

-- 
Chuck Lever
chuck[dot]lever[at]oracle[dot]com




--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 7/8] xprtrdma: Split the completion queue
@ 2014-04-19 16:31                                           ` Chuck Lever
  0 siblings, 0 replies; 58+ messages in thread
From: Chuck Lever @ 2014-04-19 16:31 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: Steve Wise, Linux NFS Mailing List, linux-rdma

Hi Sagi-

On Apr 17, 2014, at 3:11 PM, Sagi Grimberg <sagig@dev.mellanox.co.il> wrote:

> On 4/17/2014 5:34 PM, Steve Wise wrote:
> 
> <SNIP>
>> You could use a small array combined with a loop and a budget count.  So the code would
>> grab, say, 4 at a time, and keep looping polling up to 4 until the CQ is empty or the
>> desired budget is reached...
> 
> Bingo... couldn't agree more.
> 
> Poll Arrays are a nice optimization,

Typically, a provider's poll_cq implementation takes the CQ lock
using spin_lock_irqsave().  My goal of using a poll array is to
reduce the number of times the completion handler invokes
spin_lock_irqsave / spin_unlock_irqsave pairs when draining a
large queue.

> but large arrays will just burden the stack (and might even make things worse in high workloads...)


My prototype moves the poll array off the stack and into allocated
storage.  Making that array as large as a single page would be
sufficient for 50 or more ib_wc structures on a platform with 4KB
pages and 64-bit addresses.

The xprtrdma completion handler polls twice:

  1.  Drain the CQ completely

  2.  Re-arm

  3.  Drain the CQ completely again

So between steps 1. and 3. a single notification could handle over
100 WCs, if we were to budget by draining just a single array's worth
during each step. (Btw, I'm not opposed to looping while polling
arrays. This is just an example for discussion).

As for budgeting itself, I wonder if there is a possibility of losing
notifications.  The purpose of re-arming and then draining again is to
ensure that any items queued after step 1. and before step 2. are
captured, as by themselves they would never generate an upcall
notification, IIUC.

When the handler hits its budget and returns, xprtrdma needs to be
invoked again to finish draining the completion queue. How is that
guaranteed?

-- 
Chuck Lever
chuck[dot]lever[at]oracle[dot]com





^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 7/8] xprtrdma: Split the completion queue
  2014-04-19 16:31                                           ` Chuck Lever
@ 2014-04-20 12:42                                               ` Sagi Grimberg
  -1 siblings, 0 replies; 58+ messages in thread
From: Sagi Grimberg @ 2014-04-20 12:42 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Steve Wise, Linux NFS Mailing List, linux-rdma-u79uwXL29TY76Z2rM5mHXA

On 4/19/2014 7:31 PM, Chuck Lever wrote:
> Hi Sagi-
>
> On Apr 17, 2014, at 3:11 PM, Sagi Grimberg <sagig-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org> wrote:
>
>> On 4/17/2014 5:34 PM, Steve Wise wrote:
>>
>> <SNIP>
>>> You could use a small array combined with a loop and a budget count.  So the code would
>>> grab, say, 4 at a time, and keep looping polling up to 4 until the CQ is empty or the
>>> desired budget is reached...
>> Bingo... couldn't agree more.
>>
>> Poll Arrays are a nice optimization,
> Typically, a provider's poll_cq implementation takes the CQ lock
> using spin_lock_irqsave().  My goal of using a poll array is to
> reduce the number of times the completion handler invokes
> spin_lock_irqsave / spin_unlock_irqsave pairs when draining a
> large queue.

Yes, hence the optimization.

>
>> but large arrays will just burden the stack (and might even make things worse in high workloads...)
>
> My prototype moves the poll array off the stack and into allocated
> storage.  Making that array as large as a single page would be
> sufficient for 50 or more ib_wc structures on a platform with 4KB
> pages and 64-bit addresses.

You assume here the worst-case workload. In the sparse case you are 
carrying around a redundant storage space...
I would recommend using say 16 wc array or so.

> The xprtrdma completion handler polls twice:
>
>    1.  Drain the CQ completely
>
>    2.  Re-arm
>
>    3.  Drain the CQ completely again
>
> So between steps 1. and 3. a single notification could handle over
> 100 WCs, if we were to budget by draining just a single array's worth
> during each step. (Btw, I'm not opposed to looping while polling
> arrays. This is just an example for discussion).
>
> As for budgeting itself, I wonder if there is a possibility of losing
> notifications.  The purpose of re-arming and then draining again is to
> ensure that any items queued after step 1. and before step 2. are
> captured, as by themselves they would never generate an upcall
> notification, IIUC.

I don't think there is a possibility for implicit loss of completions. 
HCAs that may miss completions
should respond to ib_req_notify flag IB_CQ_REPORT_MISSED_EVENTS.
/**
  * ib_req_notify_cq - Request completion notification on a CQ.
  * @cq: The CQ to generate an event for.
  * @flags:
  *   Must contain exactly one of %IB_CQ_SOLICITED or %IB_CQ_NEXT_COMP
  *   to request an event on the next solicited event or next work
  *   completion at any type, respectively. %IB_CQ_REPORT_MISSED_EVENTS
  *   may also be |ed in to request a hint about missed events, as
  *   described below.
  *
  * Return Value:
  *    < 0 means an error occurred while requesting notification
  *   == 0 means notification was requested successfully, and if
  *        IB_CQ_REPORT_MISSED_EVENTS was passed in, then no events
  *        were missed and it is safe to wait for another event.  In
  *        this case is it guaranteed that any work completions added
  *        to the CQ since the last CQ poll will trigger a completion
  *        notification event.
  *    > 0 is only returned if IB_CQ_REPORT_MISSED_EVENTS was passed
  *        in.  It means that the consumer must poll the CQ again to
  *        make sure it is empty to avoid missing an event because of a
  *        race between requesting notification and an entry being
  *        added to the CQ.  This return value means it is possible
  *        (but not guaranteed) that a work completion has been added
  *        to the CQ since the last poll without triggering a
  *        completion notification event.
  */

Other than that, if one stops polling and requests notify - he should be 
invoked again from the
correct producer index (i.e. no missed events).

Hope this helps,

Sagi.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 7/8] xprtrdma: Split the completion queue
@ 2014-04-20 12:42                                               ` Sagi Grimberg
  0 siblings, 0 replies; 58+ messages in thread
From: Sagi Grimberg @ 2014-04-20 12:42 UTC (permalink / raw)
  To: Chuck Lever; +Cc: Steve Wise, Linux NFS Mailing List, linux-rdma

On 4/19/2014 7:31 PM, Chuck Lever wrote:
> Hi Sagi-
>
> On Apr 17, 2014, at 3:11 PM, Sagi Grimberg <sagig@dev.mellanox.co.il> wrote:
>
>> On 4/17/2014 5:34 PM, Steve Wise wrote:
>>
>> <SNIP>
>>> You could use a small array combined with a loop and a budget count.  So the code would
>>> grab, say, 4 at a time, and keep looping polling up to 4 until the CQ is empty or the
>>> desired budget is reached...
>> Bingo... couldn't agree more.
>>
>> Poll Arrays are a nice optimization,
> Typically, a provider's poll_cq implementation takes the CQ lock
> using spin_lock_irqsave().  My goal of using a poll array is to
> reduce the number of times the completion handler invokes
> spin_lock_irqsave / spin_unlock_irqsave pairs when draining a
> large queue.

Yes, hence the optimization.

>
>> but large arrays will just burden the stack (and might even make things worse in high workloads...)
>
> My prototype moves the poll array off the stack and into allocated
> storage.  Making that array as large as a single page would be
> sufficient for 50 or more ib_wc structures on a platform with 4KB
> pages and 64-bit addresses.

You assume here the worst-case workload. In the sparse case you are 
carrying around a redundant storage space...
I would recommend using say 16 wc array or so.

> The xprtrdma completion handler polls twice:
>
>    1.  Drain the CQ completely
>
>    2.  Re-arm
>
>    3.  Drain the CQ completely again
>
> So between steps 1. and 3. a single notification could handle over
> 100 WCs, if we were to budget by draining just a single array's worth
> during each step. (Btw, I'm not opposed to looping while polling
> arrays. This is just an example for discussion).
>
> As for budgeting itself, I wonder if there is a possibility of losing
> notifications.  The purpose of re-arming and then draining again is to
> ensure that any items queued after step 1. and before step 2. are
> captured, as by themselves they would never generate an upcall
> notification, IIUC.

I don't think there is a possibility for implicit loss of completions. 
HCAs that may miss completions
should respond to ib_req_notify flag IB_CQ_REPORT_MISSED_EVENTS.
/**
  * ib_req_notify_cq - Request completion notification on a CQ.
  * @cq: The CQ to generate an event for.
  * @flags:
  *   Must contain exactly one of %IB_CQ_SOLICITED or %IB_CQ_NEXT_COMP
  *   to request an event on the next solicited event or next work
  *   completion at any type, respectively. %IB_CQ_REPORT_MISSED_EVENTS
  *   may also be |ed in to request a hint about missed events, as
  *   described below.
  *
  * Return Value:
  *    < 0 means an error occurred while requesting notification
  *   == 0 means notification was requested successfully, and if
  *        IB_CQ_REPORT_MISSED_EVENTS was passed in, then no events
  *        were missed and it is safe to wait for another event.  In
  *        this case is it guaranteed that any work completions added
  *        to the CQ since the last CQ poll will trigger a completion
  *        notification event.
  *    > 0 is only returned if IB_CQ_REPORT_MISSED_EVENTS was passed
  *        in.  It means that the consumer must poll the CQ again to
  *        make sure it is empty to avoid missing an event because of a
  *        race between requesting notification and an entry being
  *        added to the CQ.  This return value means it is possible
  *        (but not guaranteed) that a work completion has been added
  *        to the CQ since the last poll without triggering a
  *        completion notification event.
  */

Other than that, if one stops polling and requests notify - he should be 
invoked again from the
correct producer index (i.e. no missed events).

Hope this helps,

Sagi.

^ permalink raw reply	[flat|nested] 58+ messages in thread

end of thread, other threads:[~2014-04-20 12:43 UTC | newest]

Thread overview: 58+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-04-14 22:22 [PATCH 0/8] NFS/RDMA patches for review Chuck Lever
2014-04-14 22:22 ` Chuck Lever
     [not found] ` <20140414220041.20646.63991.stgit-FYjufvaPoItvLzlybtyyYzGyq/o6K9yX@public.gmane.org>
2014-04-14 22:22   ` [PATCH 1/8] xprtrdma: RPC/RDMA must invoke xprt_wake_pending_tasks() in process context Chuck Lever
2014-04-14 22:22     ` Chuck Lever
2014-04-14 22:22   ` [PATCH 2/8] xprtrdma: Remove BOUNCEBUFFERS memory registration mode Chuck Lever
2014-04-14 22:22     ` Chuck Lever
2014-04-14 22:22   ` [PATCH 3/8] xprtrdma: Disable ALLPHYSICAL mode by default Chuck Lever
2014-04-14 22:22     ` Chuck Lever
2014-04-14 22:22   ` [PATCH 4/8] xprtrdma: Remove support for MEMWINDOWS registration mode Chuck Lever
2014-04-14 22:22     ` Chuck Lever
2014-04-14 22:23   ` [PATCH 5/8] xprtrdma: Simplify rpcrdma_deregister_external() synopsis Chuck Lever
2014-04-14 22:23     ` Chuck Lever
2014-04-14 22:23   ` [PATCH 6/8] xprtrdma: Make rpcrdma_ep_destroy() return void Chuck Lever
2014-04-14 22:23     ` Chuck Lever
2014-04-14 22:23   ` [PATCH 7/8] xprtrdma: Split the completion queue Chuck Lever
2014-04-14 22:23     ` Chuck Lever
     [not found]     ` <20140414222323.20646.66946.stgit-FYjufvaPoItvLzlybtyyYzGyq/o6K9yX@public.gmane.org>
2014-04-16 12:48       ` Sagi Grimberg
2014-04-16 12:48         ` Sagi Grimberg
     [not found]         ` <534E7C1C.5070407-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2014-04-16 13:30           ` Steve Wise
2014-04-16 13:30             ` Steve Wise
     [not found]             ` <534E8608.8030801-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>
2014-04-16 14:12               ` Sagi Grimberg
2014-04-16 14:12                 ` Sagi Grimberg
     [not found]                 ` <534E8FCE.909-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2014-04-16 14:25                   ` Steve Wise
2014-04-16 14:25                     ` Steve Wise
2014-04-16 14:35                     ` Sagi Grimberg
2014-04-16 14:35                       ` Sagi Grimberg
     [not found]                       ` <534E9534.9020004-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2014-04-16 14:43                         ` Steve Wise
2014-04-16 14:43                           ` Steve Wise
2014-04-16 15:18                           ` Sagi Grimberg
2014-04-16 15:18                             ` Sagi Grimberg
     [not found]                             ` <534E9F40.8000905-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2014-04-16 15:46                               ` Steve Wise
2014-04-16 15:46                                 ` Steve Wise
2014-04-16 15:08               ` Chuck Lever
2014-04-16 15:08                 ` Chuck Lever
     [not found]                 ` <E9B601B5-1984-40D8-914A-DDD1380E8183-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
2014-04-16 15:23                   ` Sagi Grimberg
2014-04-16 15:23                     ` Sagi Grimberg
     [not found]                     ` <534EA06A.7090200-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2014-04-16 18:21                       ` Chuck Lever
2014-04-16 18:21                         ` Chuck Lever
     [not found]                         ` <FC61F219-ACA2-4CE8-BF02-1B8EE2464639-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
2014-04-17  7:06                           ` Sagi Grimberg
2014-04-17  7:06                             ` Sagi Grimberg
     [not found]                             ` <534F7D5F.1090908-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2014-04-17 13:55                               ` Chuck Lever
2014-04-17 13:55                                 ` Chuck Lever
     [not found]                                 ` <A7E4B101-BAF0-480C-95AE-26BB845EE2C3-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
2014-04-17 14:34                                   ` Steve Wise
2014-04-17 14:34                                     ` Steve Wise
2014-04-17 19:11                                     ` Sagi Grimberg
2014-04-17 19:11                                       ` Sagi Grimberg
     [not found]                                       ` <5350277C.20608-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2014-04-19 16:31                                         ` Chuck Lever
2014-04-19 16:31                                           ` Chuck Lever
     [not found]                                           ` <593D9BFA-714E-417F-ACA0-05594290C4D1-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
2014-04-20 12:42                                             ` Sagi Grimberg
2014-04-20 12:42                                               ` Sagi Grimberg
2014-04-17 19:08                                   ` Sagi Grimberg
2014-04-17 19:08                                     ` Sagi Grimberg
2014-04-14 22:23   ` [PATCH 8/8] xprtrdma: Reduce the number of hardway buffer allocations Chuck Lever
2014-04-14 22:23     ` Chuck Lever
2014-04-15 20:15   ` [PATCH 0/8] NFS/RDMA patches for review Steve Wise
2014-04-15 20:15     ` Steve Wise
     [not found]     ` <534D9373.3050406-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>
2014-04-15 20:20       ` Steve Wise
2014-04-15 20:20         ` Steve Wise

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.