All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v1 00/19] NFS/RDMA client patches for next
@ 2018-05-04 19:34 Chuck Lever
  2018-05-04 19:34 ` [PATCH v1 01/19] xprtrdma: Add proper SPDX tags for NetApp-contributed source Chuck Lever
                   ` (18 more replies)
  0 siblings, 19 replies; 30+ messages in thread
From: Chuck Lever @ 2018-05-04 19:34 UTC (permalink / raw)
  To: anna.schumaker; +Cc: linux-rdma, linux-nfs

Hi Anna-

Don't know what to call the next kernel release. v4.18? v5.0?
Anyway, here is the full set I'd like to see merged in that
release.

Along with the Receive efficiency-related patches that did not
get into v4.17, there are a number of unrelated fixes, improve-
ments, and clean ups in this series.

There is a three-patch series near the end that handles the
"empty sendctx queue" case a little more nicely. Instead of
waiting an arbitrary amount of time and trying again, an RPC
waits for the transport to wake it up when there are more
sendctxs available. I've found this makes the transport a
little less prone to deadlock under heavy workloads.

As usual, the series can be found in my git repo as well:

http://git.linux-nfs.org/?p=cel/cel-2.6.git;a=shortlog;h=refs/heads/nfs-rdma-for-4.18

---

Chuck Lever (19):
      xprtrdma: Add proper SPDX tags for NetApp-contributed source
      xprtrdma: Try to fail quickly if proto=rdma
      xprtrdma: Create transport's CM ID in the correct network namespace
      xprtrdma: Fix max_send_wr computation
      SUNRPC: Initialize rpc_rqst outside of xprt->reserve_lock
      SUNRPC: Add a ->free_slot transport callout
      xprtrdma: Introduce ->alloc_slot call-out for xprtrdma
      xprtrdma: Make rpc_rqst part of rpcrdma_req
      xprtrdma: Clean up Receive trace points
      xprtrdma: Move Receive posting to Receive handler
      xprtrdma: Remove rpcrdma_ep_{post_recv,post_extra_recv}
      xprtrdma: Remove rpcrdma_buffer_get_req_locked()
      xprtrdma: Remove rpcrdma_buffer_get_rep_locked()
      xprtrdma: Make rpcrdma_sendctx_put_locked() a static function
      xprtrdma: Return -ENOBUFS when no pages are available
      xprtrdma: Move common wait_for_buffer_space call to parent function
      xprtrdma: Wait on empty sendctx queue
      xprtrdma: Add trace_xprtrdma_dma_map(mr)
      xprtrdma: Remove transfertypes array


 include/linux/sunrpc/rpc_rdma.h            |    1 
 include/linux/sunrpc/xprt.h                |    6 -
 include/linux/sunrpc/xprtrdma.h            |    1 
 include/trace/events/rpcrdma.h             |   76 +++++--
 net/sunrpc/clnt.c                          |    1 
 net/sunrpc/xprt.c                          |   17 +-
 net/sunrpc/xprtrdma/backchannel.c          |  105 ++++------
 net/sunrpc/xprtrdma/fmr_ops.c              |   23 ++
 net/sunrpc/xprtrdma/frwr_ops.c             |   31 +++
 net/sunrpc/xprtrdma/module.c               |    1 
 net/sunrpc/xprtrdma/rpc_rdma.c             |   66 ++----
 net/sunrpc/xprtrdma/svc_rdma_backchannel.c |    1 
 net/sunrpc/xprtrdma/transport.c            |   64 +++++-
 net/sunrpc/xprtrdma/verbs.c                |  291 +++++++++++-----------------
 net/sunrpc/xprtrdma/xprt_rdma.h            |   26 +--
 net/sunrpc/xprtsock.c                      |    4 
 16 files changed, 359 insertions(+), 355 deletions(-)

--
Chuck Lever

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH v1 01/19] xprtrdma: Add proper SPDX tags for NetApp-contributed source
  2018-05-04 19:34 [PATCH v1 00/19] NFS/RDMA client patches for next Chuck Lever
@ 2018-05-04 19:34 ` Chuck Lever
  2018-05-07 13:27   ` Anna Schumaker
  2018-05-04 19:34 ` [PATCH v1 02/19] xprtrdma: Try to fail quickly if proto=rdma Chuck Lever
                   ` (17 subsequent siblings)
  18 siblings, 1 reply; 30+ messages in thread
From: Chuck Lever @ 2018-05-04 19:34 UTC (permalink / raw)
  To: anna.schumaker; +Cc: linux-rdma, linux-nfs

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 include/linux/sunrpc/rpc_rdma.h |    1 +
 include/linux/sunrpc/xprtrdma.h |    1 +
 net/sunrpc/xprtrdma/module.c    |    1 +
 net/sunrpc/xprtrdma/rpc_rdma.c  |    1 +
 net/sunrpc/xprtrdma/transport.c |    1 +
 net/sunrpc/xprtrdma/verbs.c     |    1 +
 net/sunrpc/xprtrdma/xprt_rdma.h |    1 +
 7 files changed, 7 insertions(+)

diff --git a/include/linux/sunrpc/rpc_rdma.h b/include/linux/sunrpc/rpc_rdma.h
index 8f144db..92d182f 100644
--- a/include/linux/sunrpc/rpc_rdma.h
+++ b/include/linux/sunrpc/rpc_rdma.h
@@ -1,3 +1,4 @@
+/* SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause */
 /*
  * Copyright (c) 2015-2017 Oracle. All rights reserved.
  * Copyright (c) 2003-2007 Network Appliance, Inc. All rights reserved.
diff --git a/include/linux/sunrpc/xprtrdma.h b/include/linux/sunrpc/xprtrdma.h
index 5859563..86fc38f 100644
--- a/include/linux/sunrpc/xprtrdma.h
+++ b/include/linux/sunrpc/xprtrdma.h
@@ -1,3 +1,4 @@
+/* SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause */
 /*
  * Copyright (c) 2003-2007 Network Appliance, Inc. All rights reserved.
  *
diff --git a/net/sunrpc/xprtrdma/module.c b/net/sunrpc/xprtrdma/module.c
index a762d19..f338065 100644
--- a/net/sunrpc/xprtrdma/module.c
+++ b/net/sunrpc/xprtrdma/module.c
@@ -1,3 +1,4 @@
+// SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause
 /*
  * Copyright (c) 2015, 2017 Oracle.  All rights reserved.
  */
diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
index e8adad3..8f89e3f 100644
--- a/net/sunrpc/xprtrdma/rpc_rdma.c
+++ b/net/sunrpc/xprtrdma/rpc_rdma.c
@@ -1,3 +1,4 @@
+// SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause
 /*
  * Copyright (c) 2014-2017 Oracle.  All rights reserved.
  * Copyright (c) 2003-2007 Network Appliance, Inc. All rights reserved.
diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
index cc1aad3..4717578 100644
--- a/net/sunrpc/xprtrdma/transport.c
+++ b/net/sunrpc/xprtrdma/transport.c
@@ -1,3 +1,4 @@
+// SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause
 /*
  * Copyright (c) 2014-2017 Oracle.  All rights reserved.
  * Copyright (c) 2003-2007 Network Appliance, Inc. All rights reserved.
diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index c345d36..10f5032 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -1,3 +1,4 @@
+// SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause
 /*
  * Copyright (c) 2014-2017 Oracle.  All rights reserved.
  * Copyright (c) 2003-2007 Network Appliance, Inc. All rights reserved.
diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index cb41b12..e83ba758 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -1,3 +1,4 @@
+/* SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause */
 /*
  * Copyright (c) 2014-2017 Oracle.  All rights reserved.
  * Copyright (c) 2003-2007 Network Appliance, Inc. All rights reserved.


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v1 02/19] xprtrdma: Try to fail quickly if proto=rdma
  2018-05-04 19:34 [PATCH v1 00/19] NFS/RDMA client patches for next Chuck Lever
  2018-05-04 19:34 ` [PATCH v1 01/19] xprtrdma: Add proper SPDX tags for NetApp-contributed source Chuck Lever
@ 2018-05-04 19:34 ` Chuck Lever
  2018-05-04 19:34 ` [PATCH v1 03/19] xprtrdma: Create transport's CM ID in the correct network namespace Chuck Lever
                   ` (16 subsequent siblings)
  18 siblings, 0 replies; 30+ messages in thread
From: Chuck Lever @ 2018-05-04 19:34 UTC (permalink / raw)
  To: anna.schumaker; +Cc: linux-rdma, linux-nfs

rdma_resolve_addr(3) says:

> This call is used to map a given destination IP address to a
> usable RDMA address. The IP to RDMA address mapping is done
> using the local routing tables, or via ARP.

If this can't be done, there's no local device that can be used
to establish an RDMA-capable network path to the remote. In this
case, the RDMA CM very quickly posts an RDMA_CM_EVENT_ADDR_ERROR
upcall.

Currently rpcrdma_conn_upcall() converts RDMA_CM_EVENT_ADDR_ERROR
to EHOSTUNREACH. mount.nfs seems to want to retry EHOSTUNREACH
forever, thinking that this is a temporary situation. This makes
mount.nfs appear to hang if I try to mount with proto=rdma through,
say, a conventional Ethernet device.

If the admin has specified proto=rdma along with a server IP address
that requires a network path that does not support RDMA, instead
let's fail with a permanent error. -EPROTONOSUPPORT is returned when
NFSv4 or one of its minor versions is not supported.

-EPROTO is not (currently) retried by mount.nfs.

There are potentially other similar cases where -EPROTO is an
appropriate return code.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Olga Kornievskaia <kolga@netapp.com>
Tested-by: Anna Schumaker <Anna.Schumaker@netapp.com>
---
 net/sunrpc/xprtrdma/verbs.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 10f5032..331d372 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -232,7 +232,7 @@
 		complete(&ia->ri_done);
 		break;
 	case RDMA_CM_EVENT_ADDR_ERROR:
-		ia->ri_async_rc = -EHOSTUNREACH;
+		ia->ri_async_rc = -EPROTO;
 		complete(&ia->ri_done);
 		break;
 	case RDMA_CM_EVENT_ROUTE_ERROR:
@@ -263,7 +263,7 @@
 		connstate = -ENOTCONN;
 		goto connected;
 	case RDMA_CM_EVENT_UNREACHABLE:
-		connstate = -ENETDOWN;
+		connstate = -ENETUNREACH;
 		goto connected;
 	case RDMA_CM_EVENT_REJECTED:
 		dprintk("rpcrdma: connection to %s:%s rejected: %s\n",


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v1 03/19] xprtrdma: Create transport's CM ID in the correct network namespace
  2018-05-04 19:34 [PATCH v1 00/19] NFS/RDMA client patches for next Chuck Lever
  2018-05-04 19:34 ` [PATCH v1 01/19] xprtrdma: Add proper SPDX tags for NetApp-contributed source Chuck Lever
  2018-05-04 19:34 ` [PATCH v1 02/19] xprtrdma: Try to fail quickly if proto=rdma Chuck Lever
@ 2018-05-04 19:34 ` Chuck Lever
  2018-05-04 19:34 ` [PATCH v1 04/19] xprtrdma: Fix max_send_wr computation Chuck Lever
                   ` (15 subsequent siblings)
  18 siblings, 0 replies; 30+ messages in thread
From: Chuck Lever @ 2018-05-04 19:34 UTC (permalink / raw)
  To: anna.schumaker; +Cc: linux-rdma, linux-nfs

Set up RPC/RDMA transport in mount.nfs's network namespace. This
passes the correct namespace information to the RDMA core, similar
to how RPC sockets are created (see xs_create_sock).

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 net/sunrpc/xprtrdma/verbs.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 331d372..817a692 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -306,8 +306,8 @@
 	init_completion(&ia->ri_done);
 	init_completion(&ia->ri_remove_done);
 
-	id = rdma_create_id(&init_net, rpcrdma_conn_upcall, xprt, RDMA_PS_TCP,
-			    IB_QPT_RC);
+	id = rdma_create_id(xprt->rx_xprt.xprt_net, rpcrdma_conn_upcall,
+			    xprt, RDMA_PS_TCP, IB_QPT_RC);
 	if (IS_ERR(id)) {
 		rc = PTR_ERR(id);
 		dprintk("RPC:       %s: rdma_create_id() failed %i\n",


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v1 04/19] xprtrdma: Fix max_send_wr computation
  2018-05-04 19:34 [PATCH v1 00/19] NFS/RDMA client patches for next Chuck Lever
                   ` (2 preceding siblings ...)
  2018-05-04 19:34 ` [PATCH v1 03/19] xprtrdma: Create transport's CM ID in the correct network namespace Chuck Lever
@ 2018-05-04 19:34 ` Chuck Lever
  2018-05-04 19:34 ` [PATCH v1 05/19] SUNRPC: Initialize rpc_rqst outside of xprt->reserve_lock Chuck Lever
                   ` (14 subsequent siblings)
  18 siblings, 0 replies; 30+ messages in thread
From: Chuck Lever @ 2018-05-04 19:34 UTC (permalink / raw)
  To: anna.schumaker; +Cc: linux-rdma, linux-nfs

For FRWR, the computation of max_send_wr is split between
frwr_op_open and rpcrdma_ep_create, which makes it difficult to tell
that the max_send_wr result is currently incorrect if frwr_op_open
has to reduce the credit limit to accommodate a small max_qp_wr.
This is a problem now that extra WRs are needed for backchannel
operations and a drain CQE.

So, refactor the computation so that it is all done in ->ro_open,
and fix the FRWR version of this computation so that it
accommodates HCAs with small max_qp_wr correctly.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 net/sunrpc/xprtrdma/fmr_ops.c  |   22 ++++++++++++++++++++++
 net/sunrpc/xprtrdma/frwr_ops.c |   30 ++++++++++++++++++++++++++----
 net/sunrpc/xprtrdma/verbs.c    |   24 ++++--------------------
 3 files changed, 52 insertions(+), 24 deletions(-)

diff --git a/net/sunrpc/xprtrdma/fmr_ops.c b/net/sunrpc/xprtrdma/fmr_ops.c
index f2f6395..0815f9e 100644
--- a/net/sunrpc/xprtrdma/fmr_ops.c
+++ b/net/sunrpc/xprtrdma/fmr_ops.c
@@ -156,10 +156,32 @@ enum {
 	fmr_op_release_mr(mr);
 }
 
+/* On success, sets:
+ *	ep->rep_attr.cap.max_send_wr
+ *	ep->rep_attr.cap.max_recv_wr
+ *	cdata->max_requests
+ *	ia->ri_max_segs
+ */
 static int
 fmr_op_open(struct rpcrdma_ia *ia, struct rpcrdma_ep *ep,
 	    struct rpcrdma_create_data_internal *cdata)
 {
+	int max_qp_wr;
+
+	max_qp_wr = ia->ri_device->attrs.max_qp_wr;
+	max_qp_wr -= RPCRDMA_BACKWARD_WRS;
+	max_qp_wr -= 1;
+	if (max_qp_wr < RPCRDMA_MIN_SLOT_TABLE)
+		return -ENOMEM;
+	if (cdata->max_requests > max_qp_wr)
+		cdata->max_requests = max_qp_wr;
+	ep->rep_attr.cap.max_send_wr = cdata->max_requests;
+	ep->rep_attr.cap.max_send_wr += RPCRDMA_BACKWARD_WRS;
+	ep->rep_attr.cap.max_send_wr += 1; /* for ib_drain_sq */
+	ep->rep_attr.cap.max_recv_wr = cdata->max_requests;
+	ep->rep_attr.cap.max_recv_wr += RPCRDMA_BACKWARD_WRS;
+	ep->rep_attr.cap.max_recv_wr += 1; /* for ib_drain_rq */
+
 	ia->ri_max_segs = max_t(unsigned int, 1, RPCRDMA_MAX_DATA_SEGS /
 				RPCRDMA_MAX_FMR_SGES);
 	return 0;
diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
index c59c5c7..cf5095d6 100644
--- a/net/sunrpc/xprtrdma/frwr_ops.c
+++ b/net/sunrpc/xprtrdma/frwr_ops.c
@@ -202,12 +202,22 @@
 	frwr_op_release_mr(mr);
 }
 
+/* On success, sets:
+ *	ep->rep_attr.cap.max_send_wr
+ *	ep->rep_attr.cap.max_recv_wr
+ *	cdata->max_requests
+ *	ia->ri_max_segs
+ *
+ * And these FRWR-related fields:
+ *	ia->ri_max_frwr_depth
+ *	ia->ri_mrtype
+ */
 static int
 frwr_op_open(struct rpcrdma_ia *ia, struct rpcrdma_ep *ep,
 	     struct rpcrdma_create_data_internal *cdata)
 {
 	struct ib_device_attr *attrs = &ia->ri_device->attrs;
-	int depth, delta;
+	int max_qp_wr, depth, delta;
 
 	ia->ri_mrtype = IB_MR_TYPE_MEM_REG;
 	if (attrs->device_cap_flags & IB_DEVICE_SG_GAPS_REG)
@@ -241,14 +251,26 @@
 		} while (delta > 0);
 	}
 
-	ep->rep_attr.cap.max_send_wr *= depth;
-	if (ep->rep_attr.cap.max_send_wr > attrs->max_qp_wr) {
-		cdata->max_requests = attrs->max_qp_wr / depth;
+	max_qp_wr = ia->ri_device->attrs.max_qp_wr;
+	max_qp_wr -= RPCRDMA_BACKWARD_WRS;
+	max_qp_wr -= 1;
+	if (max_qp_wr < RPCRDMA_MIN_SLOT_TABLE)
+		return -ENOMEM;
+	if (cdata->max_requests > max_qp_wr)
+		cdata->max_requests = max_qp_wr;
+	ep->rep_attr.cap.max_send_wr = cdata->max_requests * depth;
+	if (ep->rep_attr.cap.max_send_wr > max_qp_wr) {
+		cdata->max_requests = max_qp_wr / depth;
 		if (!cdata->max_requests)
 			return -EINVAL;
 		ep->rep_attr.cap.max_send_wr = cdata->max_requests *
 					       depth;
 	}
+	ep->rep_attr.cap.max_send_wr += RPCRDMA_BACKWARD_WRS;
+	ep->rep_attr.cap.max_send_wr += 1; /* for ib_drain_sq */
+	ep->rep_attr.cap.max_recv_wr = cdata->max_requests;
+	ep->rep_attr.cap.max_recv_wr += RPCRDMA_BACKWARD_WRS;
+	ep->rep_attr.cap.max_recv_wr += 1; /* for ib_drain_rq */
 
 	ia->ri_max_segs = max_t(unsigned int, 1, RPCRDMA_MAX_DATA_SEGS /
 				ia->ri_max_frwr_depth);
diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 817a692..581d0ae 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -501,8 +501,8 @@
 		  struct rpcrdma_create_data_internal *cdata)
 {
 	struct rpcrdma_connect_private *pmsg = &ep->rep_cm_private;
-	unsigned int max_qp_wr, max_sge;
 	struct ib_cq *sendcq, *recvcq;
+	unsigned int max_sge;
 	int rc;
 
 	max_sge = min_t(unsigned int, ia->ri_device->attrs.max_sge,
@@ -513,29 +513,13 @@
 	}
 	ia->ri_max_send_sges = max_sge;
 
-	if (ia->ri_device->attrs.max_qp_wr <= RPCRDMA_BACKWARD_WRS) {
-		dprintk("RPC:       %s: insufficient wqe's available\n",
-			__func__);
-		return -ENOMEM;
-	}
-	max_qp_wr = ia->ri_device->attrs.max_qp_wr - RPCRDMA_BACKWARD_WRS - 1;
-
-	/* check provider's send/recv wr limits */
-	if (cdata->max_requests > max_qp_wr)
-		cdata->max_requests = max_qp_wr;
+	rc = ia->ri_ops->ro_open(ia, ep, cdata);
+	if (rc)
+		return rc;
 
 	ep->rep_attr.event_handler = rpcrdma_qp_async_error_upcall;
 	ep->rep_attr.qp_context = ep;
 	ep->rep_attr.srq = NULL;
-	ep->rep_attr.cap.max_send_wr = cdata->max_requests;
-	ep->rep_attr.cap.max_send_wr += RPCRDMA_BACKWARD_WRS;
-	ep->rep_attr.cap.max_send_wr += 1;	/* drain cqe */
-	rc = ia->ri_ops->ro_open(ia, ep, cdata);
-	if (rc)
-		return rc;
-	ep->rep_attr.cap.max_recv_wr = cdata->max_requests;
-	ep->rep_attr.cap.max_recv_wr += RPCRDMA_BACKWARD_WRS;
-	ep->rep_attr.cap.max_recv_wr += 1;	/* drain cqe */
 	ep->rep_attr.cap.max_send_sge = max_sge;
 	ep->rep_attr.cap.max_recv_sge = 1;
 	ep->rep_attr.cap.max_inline_data = 0;


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v1 05/19] SUNRPC: Initialize rpc_rqst outside of xprt->reserve_lock
  2018-05-04 19:34 [PATCH v1 00/19] NFS/RDMA client patches for next Chuck Lever
                   ` (3 preceding siblings ...)
  2018-05-04 19:34 ` [PATCH v1 04/19] xprtrdma: Fix max_send_wr computation Chuck Lever
@ 2018-05-04 19:34 ` Chuck Lever
  2018-05-04 19:34 ` [PATCH v1 06/19] SUNRPC: Add a ->free_slot transport callout Chuck Lever
                   ` (13 subsequent siblings)
  18 siblings, 0 replies; 30+ messages in thread
From: Chuck Lever @ 2018-05-04 19:34 UTC (permalink / raw)
  To: anna.schumaker; +Cc: linux-rdma, linux-nfs

alloc_slot is a transport-specific op, but initializing an rpc_rqst
is common to all transports. In addition, the only part of initial-
izing an rpc_rqst that needs serialization is getting a fresh XID.

Move rpc_rqst initialization to common code in preparation for
adding a transport-specific alloc_slot to xprtrdma.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 include/linux/sunrpc/xprt.h |    1 +
 net/sunrpc/clnt.c           |    1 +
 net/sunrpc/xprt.c           |   12 +++++++-----
 3 files changed, 9 insertions(+), 5 deletions(-)

diff --git a/include/linux/sunrpc/xprt.h b/include/linux/sunrpc/xprt.h
index 5fea0fb..9784e28 100644
--- a/include/linux/sunrpc/xprt.h
+++ b/include/linux/sunrpc/xprt.h
@@ -324,6 +324,7 @@ struct xprt_class {
 struct rpc_xprt		*xprt_create_transport(struct xprt_create *args);
 void			xprt_connect(struct rpc_task *task);
 void			xprt_reserve(struct rpc_task *task);
+void			xprt_request_init(struct rpc_task *task);
 void			xprt_retry_reserve(struct rpc_task *task);
 int			xprt_reserve_xprt(struct rpc_xprt *xprt, struct rpc_task *task);
 int			xprt_reserve_xprt_cong(struct rpc_xprt *xprt, struct rpc_task *task);
diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c
index c2266f3..d839c33 100644
--- a/net/sunrpc/clnt.c
+++ b/net/sunrpc/clnt.c
@@ -1546,6 +1546,7 @@ void rpc_force_rebind(struct rpc_clnt *clnt)
 	task->tk_status = 0;
 	if (status >= 0) {
 		if (task->tk_rqstp) {
+			xprt_request_init(task);
 			task->tk_action = call_refresh;
 			return;
 		}
diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
index 70f0050..2d95926 100644
--- a/net/sunrpc/xprt.c
+++ b/net/sunrpc/xprt.c
@@ -66,7 +66,7 @@
  * Local functions
  */
 static void	 xprt_init(struct rpc_xprt *xprt, struct net *net);
-static void	xprt_request_init(struct rpc_task *, struct rpc_xprt *);
+static __be32	xprt_alloc_xid(struct rpc_xprt *xprt);
 static void	xprt_connect_status(struct rpc_task *task);
 static int      __xprt_get_cong(struct rpc_xprt *, struct rpc_task *);
 static void     __xprt_put_cong(struct rpc_xprt *, struct rpc_rqst *);
@@ -987,6 +987,8 @@ bool xprt_prepare_transmit(struct rpc_task *task)
 		task->tk_status = -EAGAIN;
 		goto out_unlock;
 	}
+	if (!bc_prealloc(req) && !req->rq_xmit_bytes_sent)
+		req->rq_xid = xprt_alloc_xid(xprt);
 	ret = true;
 out_unlock:
 	spin_unlock_bh(&xprt->transport_lock);
@@ -1163,10 +1165,10 @@ void xprt_alloc_slot(struct rpc_xprt *xprt, struct rpc_task *task)
 out_init_req:
 	xprt->stat.max_slots = max_t(unsigned int, xprt->stat.max_slots,
 				     xprt->num_reqs);
+	spin_unlock(&xprt->reserve_lock);
+
 	task->tk_status = 0;
 	task->tk_rqstp = req;
-	xprt_request_init(task, xprt);
-	spin_unlock(&xprt->reserve_lock);
 }
 EXPORT_SYMBOL_GPL(xprt_alloc_slot);
 
@@ -1303,8 +1305,9 @@ static inline void xprt_init_xid(struct rpc_xprt *xprt)
 	xprt->xid = prandom_u32();
 }
 
-static void xprt_request_init(struct rpc_task *task, struct rpc_xprt *xprt)
+void xprt_request_init(struct rpc_task *task)
 {
+	struct rpc_xprt *xprt = task->tk_xprt;
 	struct rpc_rqst	*req = task->tk_rqstp;
 
 	INIT_LIST_HEAD(&req->rq_list);
@@ -1312,7 +1315,6 @@ static void xprt_request_init(struct rpc_task *task, struct rpc_xprt *xprt)
 	req->rq_task	= task;
 	req->rq_xprt    = xprt;
 	req->rq_buffer  = NULL;
-	req->rq_xid     = xprt_alloc_xid(xprt);
 	req->rq_connect_cookie = xprt->connect_cookie - 1;
 	req->rq_bytes_sent = 0;
 	req->rq_snd_buf.len = 0;


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v1 06/19] SUNRPC: Add a ->free_slot transport callout
  2018-05-04 19:34 [PATCH v1 00/19] NFS/RDMA client patches for next Chuck Lever
                   ` (4 preceding siblings ...)
  2018-05-04 19:34 ` [PATCH v1 05/19] SUNRPC: Initialize rpc_rqst outside of xprt->reserve_lock Chuck Lever
@ 2018-05-04 19:34 ` Chuck Lever
  2018-05-04 19:35 ` [PATCH v1 07/19] xprtrdma: Introduce ->alloc_slot call-out for xprtrdma Chuck Lever
                   ` (12 subsequent siblings)
  18 siblings, 0 replies; 30+ messages in thread
From: Chuck Lever @ 2018-05-04 19:34 UTC (permalink / raw)
  To: anna.schumaker; +Cc: linux-rdma, linux-nfs

Refactor: xprtrdma needs to have better control over when RPCs are
awoken from the backlog queue, so replace xprt_free_slot with a
transport op callout.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 include/linux/sunrpc/xprt.h                |    4 ++++
 net/sunrpc/xprt.c                          |    5 +++--
 net/sunrpc/xprtrdma/svc_rdma_backchannel.c |    1 +
 net/sunrpc/xprtrdma/transport.c            |    1 +
 net/sunrpc/xprtsock.c                      |    4 ++++
 5 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/include/linux/sunrpc/xprt.h b/include/linux/sunrpc/xprt.h
index 9784e28..706eef1 100644
--- a/include/linux/sunrpc/xprt.h
+++ b/include/linux/sunrpc/xprt.h
@@ -127,6 +127,8 @@ struct rpc_xprt_ops {
 	int		(*reserve_xprt)(struct rpc_xprt *xprt, struct rpc_task *task);
 	void		(*release_xprt)(struct rpc_xprt *xprt, struct rpc_task *task);
 	void		(*alloc_slot)(struct rpc_xprt *xprt, struct rpc_task *task);
+	void		(*free_slot)(struct rpc_xprt *xprt,
+				     struct rpc_rqst *req);
 	void		(*rpcbind)(struct rpc_task *task);
 	void		(*set_port)(struct rpc_xprt *xprt, unsigned short port);
 	void		(*connect)(struct rpc_xprt *xprt, struct rpc_task *task);
@@ -329,6 +331,8 @@ struct xprt_class {
 int			xprt_reserve_xprt(struct rpc_xprt *xprt, struct rpc_task *task);
 int			xprt_reserve_xprt_cong(struct rpc_xprt *xprt, struct rpc_task *task);
 void			xprt_alloc_slot(struct rpc_xprt *xprt, struct rpc_task *task);
+void			xprt_free_slot(struct rpc_xprt *xprt,
+				       struct rpc_rqst *req);
 void			xprt_lock_and_alloc_slot(struct rpc_xprt *xprt, struct rpc_task *task);
 bool			xprt_prepare_transmit(struct rpc_task *task);
 void			xprt_transmit(struct rpc_task *task);
diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
index 2d95926..3c85af0 100644
--- a/net/sunrpc/xprt.c
+++ b/net/sunrpc/xprt.c
@@ -1186,7 +1186,7 @@ void xprt_lock_and_alloc_slot(struct rpc_xprt *xprt, struct rpc_task *task)
 }
 EXPORT_SYMBOL_GPL(xprt_lock_and_alloc_slot);
 
-static void xprt_free_slot(struct rpc_xprt *xprt, struct rpc_rqst *req)
+void xprt_free_slot(struct rpc_xprt *xprt, struct rpc_rqst *req)
 {
 	spin_lock(&xprt->reserve_lock);
 	if (!xprt_dynamic_free_slot(xprt, req)) {
@@ -1196,6 +1196,7 @@ static void xprt_free_slot(struct rpc_xprt *xprt, struct rpc_rqst *req)
 	xprt_wake_up_backlog(xprt);
 	spin_unlock(&xprt->reserve_lock);
 }
+EXPORT_SYMBOL_GPL(xprt_free_slot);
 
 static void xprt_free_all_slots(struct rpc_xprt *xprt)
 {
@@ -1375,7 +1376,7 @@ void xprt_release(struct rpc_task *task)
 
 	dprintk("RPC: %5u release request %p\n", task->tk_pid, req);
 	if (likely(!bc_prealloc(req)))
-		xprt_free_slot(xprt, req);
+		xprt->ops->free_slot(xprt, req);
 	else
 		xprt_free_bc_request(req);
 }
diff --git a/net/sunrpc/xprtrdma/svc_rdma_backchannel.c b/net/sunrpc/xprtrdma/svc_rdma_backchannel.c
index a73632c..1035516 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_backchannel.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_backchannel.c
@@ -273,6 +273,7 @@ static int svc_rdma_bc_sendto(struct svcxprt_rdma *rdma,
 	.reserve_xprt		= xprt_reserve_xprt_cong,
 	.release_xprt		= xprt_release_xprt_cong,
 	.alloc_slot		= xprt_alloc_slot,
+	.free_slot		= xprt_free_slot,
 	.release_request	= xprt_release_rqst_cong,
 	.buf_alloc		= xprt_rdma_bc_allocate,
 	.buf_free		= xprt_rdma_bc_free,
diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
index 4717578..cf5e866 100644
--- a/net/sunrpc/xprtrdma/transport.c
+++ b/net/sunrpc/xprtrdma/transport.c
@@ -781,6 +781,7 @@ void xprt_rdma_print_stats(struct rpc_xprt *xprt, struct seq_file *seq)
 	.reserve_xprt		= xprt_reserve_xprt_cong,
 	.release_xprt		= xprt_release_xprt_cong, /* sunrpc/xprt.c */
 	.alloc_slot		= xprt_alloc_slot,
+	.free_slot		= xprt_free_slot,
 	.release_request	= xprt_release_rqst_cong,       /* ditto */
 	.set_retrans_timeout	= xprt_set_retrans_timeout_def, /* ditto */
 	.timer			= xprt_rdma_timer,
diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index c8902f1..9e1c502 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -2763,6 +2763,7 @@ static void bc_destroy(struct rpc_xprt *xprt)
 	.reserve_xprt		= xprt_reserve_xprt,
 	.release_xprt		= xs_tcp_release_xprt,
 	.alloc_slot		= xprt_alloc_slot,
+	.free_slot		= xprt_free_slot,
 	.rpcbind		= xs_local_rpcbind,
 	.set_port		= xs_local_set_port,
 	.connect		= xs_local_connect,
@@ -2782,6 +2783,7 @@ static void bc_destroy(struct rpc_xprt *xprt)
 	.reserve_xprt		= xprt_reserve_xprt_cong,
 	.release_xprt		= xprt_release_xprt_cong,
 	.alloc_slot		= xprt_alloc_slot,
+	.free_slot		= xprt_free_slot,
 	.rpcbind		= rpcb_getport_async,
 	.set_port		= xs_set_port,
 	.connect		= xs_connect,
@@ -2803,6 +2805,7 @@ static void bc_destroy(struct rpc_xprt *xprt)
 	.reserve_xprt		= xprt_reserve_xprt,
 	.release_xprt		= xs_tcp_release_xprt,
 	.alloc_slot		= xprt_lock_and_alloc_slot,
+	.free_slot		= xprt_free_slot,
 	.rpcbind		= rpcb_getport_async,
 	.set_port		= xs_set_port,
 	.connect		= xs_connect,
@@ -2834,6 +2837,7 @@ static void bc_destroy(struct rpc_xprt *xprt)
 	.reserve_xprt		= xprt_reserve_xprt,
 	.release_xprt		= xprt_release_xprt,
 	.alloc_slot		= xprt_alloc_slot,
+	.free_slot		= xprt_free_slot,
 	.buf_alloc		= bc_malloc,
 	.buf_free		= bc_free,
 	.send_request		= bc_send_request,


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v1 07/19] xprtrdma: Introduce ->alloc_slot call-out for xprtrdma
  2018-05-04 19:34 [PATCH v1 00/19] NFS/RDMA client patches for next Chuck Lever
                   ` (5 preceding siblings ...)
  2018-05-04 19:34 ` [PATCH v1 06/19] SUNRPC: Add a ->free_slot transport callout Chuck Lever
@ 2018-05-04 19:35 ` Chuck Lever
  2018-05-04 19:35 ` [PATCH v1 08/19] xprtrdma: Make rpc_rqst part of rpcrdma_req Chuck Lever
                   ` (11 subsequent siblings)
  18 siblings, 0 replies; 30+ messages in thread
From: Chuck Lever @ 2018-05-04 19:35 UTC (permalink / raw)
  To: anna.schumaker; +Cc: linux-rdma, linux-nfs

rpcrdma_buffer_get acquires an rpcrdma_req and rep for each RPC.
Currently this is done in the call_allocate action, and sometimes it
can fail if there are many outstanding RPCs.

When call_allocate fails, the RPC task is put on the delayq. It is
awoken a few milliseconds later, but there's no guarantee it will
get a buffer at that time. The RPC task can be repeatedly put back
to sleep or even starved.

The call_allocate action should rarely fail. The delayq mechanism is
not meant to deal with transport congestion.

In the current sunrpc stack, there is a friendlier way to deal with
this situation. These objects are actually tantamount to an RPC
slot (rpc_rqst) and there is a separate FSM action, distinct from
call_allocate, for allocating slot resources. This is the
call_reserve action.

When allocation fails during this action, the RPC is placed on the
transport's backlog queue. The backlog mechanism provides a stronger
guarantee that when the RPC is awoken, a buffer will be available
for it; and backlogged RPCs are awoken one-at-a-time.

To make slot resource allocation occur in the call_reserve action,
create special ->alloc_slot and ->free_slot call-outs for xprtrdma.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 net/sunrpc/xprtrdma/transport.c |   52 ++++++++++++++++++++++++++++++++++++++-
 1 file changed, 50 insertions(+), 2 deletions(-)

diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
index cf5e866..8f9338e 100644
--- a/net/sunrpc/xprtrdma/transport.c
+++ b/net/sunrpc/xprtrdma/transport.c
@@ -538,6 +538,54 @@
 	}
 }
 
+/**
+ * xprt_rdma_alloc_slot - allocate an rpc_rqst
+ * @xprt: controlling RPC transport
+ * @task: RPC task requesting a fresh rpc_rqst
+ *
+ * tk_status values:
+ *	%0 if task->tk_rqstp points to a fresh rpc_rqst
+ *	%-EAGAIN if no rpc_rqst is available; queued on backlog
+ */
+static void
+xprt_rdma_alloc_slot(struct rpc_xprt *xprt, struct rpc_task *task)
+{
+	struct rpc_rqst *rqst;
+
+	spin_lock(&xprt->reserve_lock);
+	if (list_empty(&xprt->free))
+		goto out_sleep;
+	rqst = list_first_entry(&xprt->free, struct rpc_rqst, rq_list);
+	list_del(&rqst->rq_list);
+	spin_unlock(&xprt->reserve_lock);
+
+	task->tk_rqstp = rqst;
+	task->tk_status = 0;
+	return;
+
+out_sleep:
+	rpc_sleep_on(&xprt->backlog, task, NULL);
+	spin_unlock(&xprt->reserve_lock);
+	task->tk_status = -EAGAIN;
+}
+
+/**
+ * xprt_rdma_free_slot - release an rpc_rqst
+ * @xprt: controlling RPC transport
+ * @rqst: rpc_rqst to release
+ *
+ */
+static void
+xprt_rdma_free_slot(struct rpc_xprt *xprt, struct rpc_rqst *rqst)
+{
+	memset(rqst, 0, sizeof(*rqst));
+
+	spin_lock(&xprt->reserve_lock);
+	list_add(&rqst->rq_list, &xprt->free);
+	rpc_wake_up_next(&xprt->backlog);
+	spin_unlock(&xprt->reserve_lock);
+}
+
 static bool
 rpcrdma_get_sendbuf(struct rpcrdma_xprt *r_xprt, struct rpcrdma_req *req,
 		    size_t size, gfp_t flags)
@@ -780,8 +828,8 @@ void xprt_rdma_print_stats(struct rpc_xprt *xprt, struct seq_file *seq)
 static const struct rpc_xprt_ops xprt_rdma_procs = {
 	.reserve_xprt		= xprt_reserve_xprt_cong,
 	.release_xprt		= xprt_release_xprt_cong, /* sunrpc/xprt.c */
-	.alloc_slot		= xprt_alloc_slot,
-	.free_slot		= xprt_free_slot,
+	.alloc_slot		= xprt_rdma_alloc_slot,
+	.free_slot		= xprt_rdma_free_slot,
 	.release_request	= xprt_release_rqst_cong,       /* ditto */
 	.set_retrans_timeout	= xprt_set_retrans_timeout_def, /* ditto */
 	.timer			= xprt_rdma_timer,


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v1 08/19] xprtrdma: Make rpc_rqst part of rpcrdma_req
  2018-05-04 19:34 [PATCH v1 00/19] NFS/RDMA client patches for next Chuck Lever
                   ` (6 preceding siblings ...)
  2018-05-04 19:35 ` [PATCH v1 07/19] xprtrdma: Introduce ->alloc_slot call-out for xprtrdma Chuck Lever
@ 2018-05-04 19:35 ` Chuck Lever
  2018-05-04 19:35 ` [PATCH v1 09/19] xprtrdma: Clean up Receive trace points Chuck Lever
                   ` (10 subsequent siblings)
  18 siblings, 0 replies; 30+ messages in thread
From: Chuck Lever @ 2018-05-04 19:35 UTC (permalink / raw)
  To: anna.schumaker; +Cc: linux-rdma, linux-nfs

This simplifies allocation of the generic RPC slot and xprtrdma
specific per-RPC resources.

It also makes xprtrdma more like the socket-based transports:
->buf_alloc and ->buf_free are now responsible only for send and
receive buffers.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 include/linux/sunrpc/xprt.h       |    1 
 net/sunrpc/xprtrdma/backchannel.c |   77 +++++++++++++++++--------------------
 net/sunrpc/xprtrdma/transport.c   |   35 ++++-------------
 net/sunrpc/xprtrdma/xprt_rdma.h   |    9 +---
 4 files changed, 46 insertions(+), 76 deletions(-)

diff --git a/include/linux/sunrpc/xprt.h b/include/linux/sunrpc/xprt.h
index 706eef1..336fd1a 100644
--- a/include/linux/sunrpc/xprt.h
+++ b/include/linux/sunrpc/xprt.h
@@ -84,7 +84,6 @@ struct rpc_rqst {
 	void (*rq_release_snd_buf)(struct rpc_rqst *); /* release rq_enc_pages */
 	struct list_head	rq_list;
 
-	void			*rq_xprtdata;	/* Per-xprt private data */
 	void			*rq_buffer;	/* Call XDR encode buffer */
 	size_t			rq_callsize;
 	void			*rq_rbuffer;	/* Reply XDR decode buffer */
diff --git a/net/sunrpc/xprtrdma/backchannel.c b/net/sunrpc/xprtrdma/backchannel.c
index 47ebac9..4034788 100644
--- a/net/sunrpc/xprtrdma/backchannel.c
+++ b/net/sunrpc/xprtrdma/backchannel.c
@@ -29,29 +29,41 @@ static void rpcrdma_bc_free_rqst(struct rpcrdma_xprt *r_xprt,
 	spin_unlock(&buf->rb_reqslock);
 
 	rpcrdma_destroy_req(req);
-
-	kfree(rqst);
 }
 
-static int rpcrdma_bc_setup_rqst(struct rpcrdma_xprt *r_xprt,
-				 struct rpc_rqst *rqst)
+static int rpcrdma_bc_setup_reqs(struct rpcrdma_xprt *r_xprt,
+				 unsigned int count)
 {
-	struct rpcrdma_regbuf *rb;
-	struct rpcrdma_req *req;
-	size_t size;
+	struct rpc_xprt *xprt = &r_xprt->rx_xprt;
+	struct rpc_rqst *rqst;
+	unsigned int i;
+
+	for (i = 0; i < (count << 1); i++) {
+		struct rpcrdma_regbuf *rb;
+		struct rpcrdma_req *req;
+		size_t size;
+
+		req = rpcrdma_create_req(r_xprt);
+		if (IS_ERR(req))
+			return PTR_ERR(req);
+		rqst = &req->rl_slot;
+
+		rqst->rq_xprt = xprt;
+		INIT_LIST_HEAD(&rqst->rq_list);
+		INIT_LIST_HEAD(&rqst->rq_bc_list);
+		__set_bit(RPC_BC_PA_IN_USE, &rqst->rq_bc_pa_state);
+		spin_lock_bh(&xprt->bc_pa_lock);
+		list_add(&rqst->rq_bc_pa_list, &xprt->bc_pa_list);
+		spin_unlock_bh(&xprt->bc_pa_lock);
 
-	req = rpcrdma_create_req(r_xprt);
-	if (IS_ERR(req))
-		return PTR_ERR(req);
-
-	size = r_xprt->rx_data.inline_rsize;
-	rb = rpcrdma_alloc_regbuf(size, DMA_TO_DEVICE, GFP_KERNEL);
-	if (IS_ERR(rb))
-		goto out_fail;
-	req->rl_sendbuf = rb;
-	xdr_buf_init(&rqst->rq_snd_buf, rb->rg_base,
-		     min_t(size_t, size, PAGE_SIZE));
-	rpcrdma_set_xprtdata(rqst, req);
+		size = r_xprt->rx_data.inline_rsize;
+		rb = rpcrdma_alloc_regbuf(size, DMA_TO_DEVICE, GFP_KERNEL);
+		if (IS_ERR(rb))
+			goto out_fail;
+		req->rl_sendbuf = rb;
+		xdr_buf_init(&rqst->rq_snd_buf, rb->rg_base,
+			     min_t(size_t, size, PAGE_SIZE));
+	}
 	return 0;
 
 out_fail:
@@ -86,9 +98,6 @@ static int rpcrdma_bc_setup_reps(struct rpcrdma_xprt *r_xprt,
 int xprt_rdma_bc_setup(struct rpc_xprt *xprt, unsigned int reqs)
 {
 	struct rpcrdma_xprt *r_xprt = rpcx_to_rdmax(xprt);
-	struct rpcrdma_buffer *buffer = &r_xprt->rx_buf;
-	struct rpc_rqst *rqst;
-	unsigned int i;
 	int rc;
 
 	/* The backchannel reply path returns each rpc_rqst to the
@@ -103,25 +112,9 @@ int xprt_rdma_bc_setup(struct rpc_xprt *xprt, unsigned int reqs)
 	if (reqs > RPCRDMA_BACKWARD_WRS >> 1)
 		goto out_err;
 
-	for (i = 0; i < (reqs << 1); i++) {
-		rqst = kzalloc(sizeof(*rqst), GFP_KERNEL);
-		if (!rqst)
-			goto out_free;
-
-		dprintk("RPC:       %s: new rqst %p\n", __func__, rqst);
-
-		rqst->rq_xprt = &r_xprt->rx_xprt;
-		INIT_LIST_HEAD(&rqst->rq_list);
-		INIT_LIST_HEAD(&rqst->rq_bc_list);
-		__set_bit(RPC_BC_PA_IN_USE, &rqst->rq_bc_pa_state);
-
-		if (rpcrdma_bc_setup_rqst(r_xprt, rqst))
-			goto out_free;
-
-		spin_lock_bh(&xprt->bc_pa_lock);
-		list_add(&rqst->rq_bc_pa_list, &xprt->bc_pa_list);
-		spin_unlock_bh(&xprt->bc_pa_lock);
-	}
+	rc = rpcrdma_bc_setup_reqs(r_xprt, reqs);
+	if (rc)
+		goto out_free;
 
 	rc = rpcrdma_bc_setup_reps(r_xprt, reqs);
 	if (rc)
@@ -131,7 +124,7 @@ int xprt_rdma_bc_setup(struct rpc_xprt *xprt, unsigned int reqs)
 	if (rc)
 		goto out_free;
 
-	buffer->rb_bc_srv_max_requests = reqs;
+	r_xprt->rx_buf.rb_bc_srv_max_requests = reqs;
 	request_module("svcrdma");
 	trace_xprtrdma_cb_setup(r_xprt, reqs);
 	return 0;
diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
index 8f9338e..79885aa 100644
--- a/net/sunrpc/xprtrdma/transport.c
+++ b/net/sunrpc/xprtrdma/transport.c
@@ -331,9 +331,7 @@
 		return ERR_PTR(-EBADF);
 	}
 
-	xprt = xprt_alloc(args->net, sizeof(struct rpcrdma_xprt),
-			xprt_rdma_slot_table_entries,
-			xprt_rdma_slot_table_entries);
+	xprt = xprt_alloc(args->net, sizeof(struct rpcrdma_xprt), 0, 0);
 	if (xprt == NULL) {
 		dprintk("RPC:       %s: couldn't allocate rpcrdma_xprt\n",
 			__func__);
@@ -365,7 +363,7 @@
 		xprt_set_bound(xprt);
 	xprt_rdma_format_addresses(xprt, sap);
 
-	cdata.max_requests = xprt->max_reqs;
+	cdata.max_requests = xprt_rdma_slot_table_entries;
 
 	cdata.rsize = RPCRDMA_MAX_SEGS * PAGE_SIZE; /* RDMA write max */
 	cdata.wsize = RPCRDMA_MAX_SEGS * PAGE_SIZE; /* RDMA read max */
@@ -550,22 +548,18 @@
 static void
 xprt_rdma_alloc_slot(struct rpc_xprt *xprt, struct rpc_task *task)
 {
-	struct rpc_rqst *rqst;
+	struct rpcrdma_xprt *r_xprt = rpcx_to_rdmax(xprt);
+	struct rpcrdma_req *req;
 
-	spin_lock(&xprt->reserve_lock);
-	if (list_empty(&xprt->free))
+	req = rpcrdma_buffer_get(&r_xprt->rx_buf);
+	if (!req)
 		goto out_sleep;
-	rqst = list_first_entry(&xprt->free, struct rpc_rqst, rq_list);
-	list_del(&rqst->rq_list);
-	spin_unlock(&xprt->reserve_lock);
-
-	task->tk_rqstp = rqst;
+	task->tk_rqstp = &req->rl_slot;
 	task->tk_status = 0;
 	return;
 
 out_sleep:
 	rpc_sleep_on(&xprt->backlog, task, NULL);
-	spin_unlock(&xprt->reserve_lock);
 	task->tk_status = -EAGAIN;
 }
 
@@ -579,11 +573,8 @@
 xprt_rdma_free_slot(struct rpc_xprt *xprt, struct rpc_rqst *rqst)
 {
 	memset(rqst, 0, sizeof(*rqst));
-
-	spin_lock(&xprt->reserve_lock);
-	list_add(&rqst->rq_list, &xprt->free);
+	rpcrdma_buffer_put(rpcr_to_rdmar(rqst));
 	rpc_wake_up_next(&xprt->backlog);
-	spin_unlock(&xprt->reserve_lock);
 }
 
 static bool
@@ -656,13 +647,9 @@
 {
 	struct rpc_rqst *rqst = task->tk_rqstp;
 	struct rpcrdma_xprt *r_xprt = rpcx_to_rdmax(rqst->rq_xprt);
-	struct rpcrdma_req *req;
+	struct rpcrdma_req *req = rpcr_to_rdmar(rqst);
 	gfp_t flags;
 
-	req = rpcrdma_buffer_get(&r_xprt->rx_buf);
-	if (req == NULL)
-		goto out_get;
-
 	flags = RPCRDMA_DEF_GFP;
 	if (RPC_IS_SWAPPER(task))
 		flags = __GFP_MEMALLOC | GFP_NOWAIT | __GFP_NOWARN;
@@ -672,15 +659,12 @@
 	if (!rpcrdma_get_recvbuf(r_xprt, req, rqst->rq_rcvsize, flags))
 		goto out_fail;
 
-	rpcrdma_set_xprtdata(rqst, req);
 	rqst->rq_buffer = req->rl_sendbuf->rg_base;
 	rqst->rq_rbuffer = req->rl_recvbuf->rg_base;
 	trace_xprtrdma_allocate(task, req);
 	return 0;
 
 out_fail:
-	rpcrdma_buffer_put(req);
-out_get:
 	trace_xprtrdma_allocate(task, NULL);
 	return -ENOMEM;
 }
@@ -701,7 +685,6 @@
 	if (test_bit(RPCRDMA_REQ_F_PENDING, &req->rl_flags))
 		rpcrdma_release_rqst(r_xprt, req);
 	trace_xprtrdma_rpc_done(task, req);
-	rpcrdma_buffer_put(req);
 }
 
 /**
diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index e83ba758..765e4df 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -335,6 +335,7 @@ enum {
 struct rpcrdma_buffer;
 struct rpcrdma_req {
 	struct list_head	rl_list;
+	struct rpc_rqst		rl_slot;
 	struct rpcrdma_buffer	*rl_buffer;
 	struct rpcrdma_rep	*rl_reply;
 	struct xdr_stream	rl_stream;
@@ -357,16 +358,10 @@ enum {
 	RPCRDMA_REQ_F_TX_RESOURCES,
 };
 
-static inline void
-rpcrdma_set_xprtdata(struct rpc_rqst *rqst, struct rpcrdma_req *req)
-{
-	rqst->rq_xprtdata = req;
-}
-
 static inline struct rpcrdma_req *
 rpcr_to_rdmar(const struct rpc_rqst *rqst)
 {
-	return rqst->rq_xprtdata;
+	return container_of(rqst, struct rpcrdma_req, rl_slot);
 }
 
 static inline void


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v1 09/19] xprtrdma: Clean up Receive trace points
  2018-05-04 19:34 [PATCH v1 00/19] NFS/RDMA client patches for next Chuck Lever
                   ` (7 preceding siblings ...)
  2018-05-04 19:35 ` [PATCH v1 08/19] xprtrdma: Make rpc_rqst part of rpcrdma_req Chuck Lever
@ 2018-05-04 19:35 ` Chuck Lever
  2018-05-04 19:35 ` [PATCH v1 10/19] xprtrdma: Move Receive posting to Receive handler Chuck Lever
                   ` (9 subsequent siblings)
  18 siblings, 0 replies; 30+ messages in thread
From: Chuck Lever @ 2018-05-04 19:35 UTC (permalink / raw)
  To: anna.schumaker; +Cc: linux-rdma, linux-nfs

For clarity, report the posting and completion of Receive CQEs.

Also, the wc->byte_len field contains garbage if wc->status is
non-zero, and the vendor error field contains garbage if wc->status
is zero. For readability, don't save those fields in those cases.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 include/trace/events/rpcrdma.h |   39 ++++++++++++++++++++-------------------
 net/sunrpc/xprtrdma/verbs.c    |    4 ++--
 2 files changed, 22 insertions(+), 21 deletions(-)

diff --git a/include/trace/events/rpcrdma.h b/include/trace/events/rpcrdma.h
index 50ed3f8..99c0049 100644
--- a/include/trace/events/rpcrdma.h
+++ b/include/trace/events/rpcrdma.h
@@ -528,24 +528,21 @@
 
 TRACE_EVENT(xprtrdma_post_recv,
 	TP_PROTO(
-		const struct rpcrdma_rep *rep,
-		int status
+		const struct ib_cqe *cqe
 	),
 
-	TP_ARGS(rep, status),
+	TP_ARGS(cqe),
 
 	TP_STRUCT__entry(
-		__field(const void *, rep)
-		__field(int, status)
+		__field(const void *, cqe)
 	),
 
 	TP_fast_assign(
-		__entry->rep = rep;
-		__entry->status = status;
+		__entry->cqe = cqe;
 	),
 
-	TP_printk("rep=%p status=%d",
-		__entry->rep, __entry->status
+	TP_printk("cqe=%p",
+		__entry->cqe
 	)
 );
 
@@ -584,28 +581,32 @@
 
 TRACE_EVENT(xprtrdma_wc_receive,
 	TP_PROTO(
-		const struct rpcrdma_rep *rep,
 		const struct ib_wc *wc
 	),
 
-	TP_ARGS(rep, wc),
+	TP_ARGS(wc),
 
 	TP_STRUCT__entry(
-		__field(const void *, rep)
-		__field(unsigned int, byte_len)
+		__field(const void *, cqe)
+		__field(u32, byte_len)
 		__field(unsigned int, status)
-		__field(unsigned int, vendor_err)
+		__field(u32, vendor_err)
 	),
 
 	TP_fast_assign(
-		__entry->rep = rep;
-		__entry->byte_len = wc->byte_len;
+		__entry->cqe = wc->wr_cqe;
 		__entry->status = wc->status;
-		__entry->vendor_err = __entry->status ? wc->vendor_err : 0;
+		if (wc->status) {
+			__entry->byte_len = 0;
+			__entry->vendor_err = wc->vendor_err;
+		} else {
+			__entry->byte_len = wc->byte_len;
+			__entry->vendor_err = 0;
+		}
 	),
 
-	TP_printk("rep=%p, %u bytes: %s (%u/0x%x)",
-		__entry->rep, __entry->byte_len,
+	TP_printk("cqe=%p %u bytes: %s (%u/0x%x)",
+		__entry->cqe, __entry->byte_len,
 		rdma_show_wc_status(__entry->status),
 		__entry->status, __entry->vendor_err
 	)
diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 581d0ae..f4ce7af 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -160,7 +160,7 @@
 					       rr_cqe);
 
 	/* WARNING: Only wr_id and status are reliable at this point */
-	trace_xprtrdma_wc_receive(rep, wc);
+	trace_xprtrdma_wc_receive(wc);
 	if (wc->status != IB_WC_SUCCESS)
 		goto out_fail;
 
@@ -1575,7 +1575,7 @@ struct rpcrdma_regbuf *
 	if (!rpcrdma_dma_map_regbuf(ia, rep->rr_rdmabuf))
 		goto out_map;
 	rc = ib_post_recv(ia->ri_id->qp, &rep->rr_recv_wr, &recv_wr_fail);
-	trace_xprtrdma_post_recv(rep, rc);
+	trace_xprtrdma_post_recv(rep->rr_recv_wr.wr_cqe);
 	if (rc)
 		return -ENOTCONN;
 	return 0;


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v1 10/19] xprtrdma: Move Receive posting to Receive handler
  2018-05-04 19:34 [PATCH v1 00/19] NFS/RDMA client patches for next Chuck Lever
                   ` (8 preceding siblings ...)
  2018-05-04 19:35 ` [PATCH v1 09/19] xprtrdma: Clean up Receive trace points Chuck Lever
@ 2018-05-04 19:35 ` Chuck Lever
  2018-05-08 19:40   ` Anna Schumaker
  2018-05-04 19:35 ` [PATCH v1 11/19] xprtrdma: Remove rpcrdma_ep_{post_recv, post_extra_recv} Chuck Lever
                   ` (8 subsequent siblings)
  18 siblings, 1 reply; 30+ messages in thread
From: Chuck Lever @ 2018-05-04 19:35 UTC (permalink / raw)
  To: anna.schumaker; +Cc: linux-rdma, linux-nfs

Receive completion and Reply handling are done by a BOUND
workqueue, meaning they run on only one CPU.

Posting receives is currently done in the send_request path, which
on large systems is typically done on a different CPU than the one
handling Receive completions. This results in movement of
Receive-related cachelines between the sending and receiving CPUs.

More importantly, it means that currently Receives are posted while
the transport's write lock is held, which is unnecessary and costly.

Finally, allocation of Receive buffers is performed on-demand in
the Receive completion handler. This helps guarantee that they are
allocated on the same NUMA node as the CPU that handles Receive
completions.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 include/trace/events/rpcrdma.h    |   40 +++++++-
 net/sunrpc/xprtrdma/backchannel.c |   32 +------
 net/sunrpc/xprtrdma/rpc_rdma.c    |   22 +----
 net/sunrpc/xprtrdma/transport.c   |    3 -
 net/sunrpc/xprtrdma/verbs.c       |  176 +++++++++++++++++++++----------------
 net/sunrpc/xprtrdma/xprt_rdma.h   |    6 +
 6 files changed, 150 insertions(+), 129 deletions(-)

diff --git a/include/trace/events/rpcrdma.h b/include/trace/events/rpcrdma.h
index 99c0049..ad27e19 100644
--- a/include/trace/events/rpcrdma.h
+++ b/include/trace/events/rpcrdma.h
@@ -546,6 +546,39 @@
 	)
 );
 
+TRACE_EVENT(xprtrdma_post_recvs,
+	TP_PROTO(
+		const struct rpcrdma_xprt *r_xprt,
+		unsigned int count,
+		int status
+	),
+
+	TP_ARGS(r_xprt, count, status),
+
+	TP_STRUCT__entry(
+		__field(const void *, r_xprt)
+		__field(unsigned int, count)
+		__field(int, status)
+		__field(int, posted)
+		__string(addr, rpcrdma_addrstr(r_xprt))
+		__string(port, rpcrdma_portstr(r_xprt))
+	),
+
+	TP_fast_assign(
+		__entry->r_xprt = r_xprt;
+		__entry->count = count;
+		__entry->status = status;
+		__entry->posted = r_xprt->rx_buf.rb_posted_receives;
+		__assign_str(addr, rpcrdma_addrstr(r_xprt));
+		__assign_str(port, rpcrdma_portstr(r_xprt));
+	),
+
+	TP_printk("peer=[%s]:%s r_xprt=%p: %u new recvs, %d active (rc %d)",
+		__get_str(addr), __get_str(port), __entry->r_xprt,
+		__entry->count, __entry->posted, __entry->status
+	)
+);
+
 /**
  ** Completion events
  **/
@@ -800,7 +833,6 @@
 		__field(unsigned int, task_id)
 		__field(unsigned int, client_id)
 		__field(const void *, req)
-		__field(const void *, rep)
 		__field(size_t, callsize)
 		__field(size_t, rcvsize)
 	),
@@ -809,15 +841,13 @@
 		__entry->task_id = task->tk_pid;
 		__entry->client_id = task->tk_client->cl_clid;
 		__entry->req = req;
-		__entry->rep = req ? req->rl_reply : NULL;
 		__entry->callsize = task->tk_rqstp->rq_callsize;
 		__entry->rcvsize = task->tk_rqstp->rq_rcvsize;
 	),
 
-	TP_printk("task:%u@%u req=%p rep=%p (%zu, %zu)",
+	TP_printk("task:%u@%u req=%p (%zu, %zu)",
 		__entry->task_id, __entry->client_id,
-		__entry->req, __entry->rep,
-		__entry->callsize, __entry->rcvsize
+		__entry->req, __entry->callsize, __entry->rcvsize
 	)
 );
 
diff --git a/net/sunrpc/xprtrdma/backchannel.c b/net/sunrpc/xprtrdma/backchannel.c
index 4034788..c8f1c2b 100644
--- a/net/sunrpc/xprtrdma/backchannel.c
+++ b/net/sunrpc/xprtrdma/backchannel.c
@@ -71,23 +71,6 @@ static int rpcrdma_bc_setup_reqs(struct rpcrdma_xprt *r_xprt,
 	return -ENOMEM;
 }
 
-/* Allocate and add receive buffers to the rpcrdma_buffer's
- * existing list of rep's. These are released when the
- * transport is destroyed.
- */
-static int rpcrdma_bc_setup_reps(struct rpcrdma_xprt *r_xprt,
-				 unsigned int count)
-{
-	int rc = 0;
-
-	while (count--) {
-		rc = rpcrdma_create_rep(r_xprt);
-		if (rc)
-			break;
-	}
-	return rc;
-}
-
 /**
  * xprt_rdma_bc_setup - Pre-allocate resources for handling backchannel requests
  * @xprt: transport associated with these backchannel resources
@@ -116,14 +99,6 @@ int xprt_rdma_bc_setup(struct rpc_xprt *xprt, unsigned int reqs)
 	if (rc)
 		goto out_free;
 
-	rc = rpcrdma_bc_setup_reps(r_xprt, reqs);
-	if (rc)
-		goto out_free;
-
-	rc = rpcrdma_ep_post_extra_recv(r_xprt, reqs);
-	if (rc)
-		goto out_free;
-
 	r_xprt->rx_buf.rb_bc_srv_max_requests = reqs;
 	request_module("svcrdma");
 	trace_xprtrdma_cb_setup(r_xprt, reqs);
@@ -228,6 +203,7 @@ int xprt_rdma_bc_send_reply(struct rpc_rqst *rqst)
 	if (rc < 0)
 		goto failed_marshal;
 
+	rpcrdma_post_recvs(r_xprt, true);
 	if (rpcrdma_ep_post(&r_xprt->rx_ia, &r_xprt->rx_ep, req))
 		goto drop_connection;
 	return 0;
@@ -268,10 +244,14 @@ void xprt_rdma_bc_destroy(struct rpc_xprt *xprt, unsigned int reqs)
  */
 void xprt_rdma_bc_free_rqst(struct rpc_rqst *rqst)
 {
+	struct rpcrdma_req *req = rpcr_to_rdmar(rqst);
 	struct rpc_xprt *xprt = rqst->rq_xprt;
 
 	dprintk("RPC:       %s: freeing rqst %p (req %p)\n",
-		__func__, rqst, rpcr_to_rdmar(rqst));
+		__func__, rqst, req);
+
+	rpcrdma_recv_buffer_put(req->rl_reply);
+	req->rl_reply = NULL;
 
 	spin_lock_bh(&xprt->bc_pa_lock);
 	list_add_tail(&rqst->rq_bc_pa_list, &xprt->bc_pa_list);
diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
index 8f89e3f..d676106 100644
--- a/net/sunrpc/xprtrdma/rpc_rdma.c
+++ b/net/sunrpc/xprtrdma/rpc_rdma.c
@@ -1027,8 +1027,6 @@ static bool rpcrdma_results_inline(struct rpcrdma_xprt *r_xprt,
 
 out_short:
 	pr_warn("RPC/RDMA short backward direction call\n");
-	if (rpcrdma_ep_post_recv(&r_xprt->rx_ia, rep))
-		xprt_disconnect_done(&r_xprt->rx_xprt);
 	return true;
 }
 #else	/* CONFIG_SUNRPC_BACKCHANNEL */
@@ -1334,13 +1332,14 @@ void rpcrdma_reply_handler(struct rpcrdma_rep *rep)
 	u32 credits;
 	__be32 *p;
 
+	--buf->rb_posted_receives;
+
 	if (rep->rr_hdrbuf.head[0].iov_len == 0)
 		goto out_badstatus;
 
+	/* Fixed transport header fields */
 	xdr_init_decode(&rep->rr_stream, &rep->rr_hdrbuf,
 			rep->rr_hdrbuf.head[0].iov_base);
-
-	/* Fixed transport header fields */
 	p = xdr_inline_decode(&rep->rr_stream, 4 * sizeof(*p));
 	if (unlikely(!p))
 		goto out_shortreply;
@@ -1379,17 +1378,10 @@ void rpcrdma_reply_handler(struct rpcrdma_rep *rep)
 
 	trace_xprtrdma_reply(rqst->rq_task, rep, req, credits);
 
+	rpcrdma_post_recvs(r_xprt, false);
 	queue_work(rpcrdma_receive_wq, &rep->rr_work);
 	return;
 
-out_badstatus:
-	rpcrdma_recv_buffer_put(rep);
-	if (r_xprt->rx_ep.rep_connected == 1) {
-		r_xprt->rx_ep.rep_connected = -EIO;
-		rpcrdma_conn_func(&r_xprt->rx_ep);
-	}
-	return;
-
 out_badversion:
 	trace_xprtrdma_reply_vers(rep);
 	goto repost;
@@ -1409,7 +1401,7 @@ void rpcrdma_reply_handler(struct rpcrdma_rep *rep)
  * receive buffer before returning.
  */
 repost:
-	r_xprt->rx_stats.bad_reply_count++;
-	if (rpcrdma_ep_post_recv(&r_xprt->rx_ia, rep))
-		rpcrdma_recv_buffer_put(rep);
+	rpcrdma_post_recvs(r_xprt, false);
+out_badstatus:
+	rpcrdma_recv_buffer_put(rep);
 }
diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
index 79885aa..0c775f0 100644
--- a/net/sunrpc/xprtrdma/transport.c
+++ b/net/sunrpc/xprtrdma/transport.c
@@ -722,9 +722,6 @@
 	if (rc < 0)
 		goto failed_marshal;
 
-	if (req->rl_reply == NULL) 		/* e.g. reconnection */
-		rpcrdma_recv_buffer_get(req);
-
 	/* Must suppress retransmit to maintain credits */
 	if (rqst->rq_connect_cookie == xprt->connect_cookie)
 		goto drop_connection;
diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index f4ce7af..2a38301 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -74,6 +74,7 @@
  */
 static void rpcrdma_mrs_create(struct rpcrdma_xprt *r_xprt);
 static void rpcrdma_mrs_destroy(struct rpcrdma_buffer *buf);
+static int rpcrdma_create_rep(struct rpcrdma_xprt *r_xprt, bool temp);
 static void rpcrdma_dma_unmap_regbuf(struct rpcrdma_regbuf *rb);
 
 struct workqueue_struct *rpcrdma_receive_wq __read_mostly;
@@ -726,7 +727,6 @@
 {
 	struct rpcrdma_xprt *r_xprt = container_of(ia, struct rpcrdma_xprt,
 						   rx_ia);
-	unsigned int extras;
 	int rc;
 
 retry:
@@ -770,9 +770,8 @@
 	}
 
 	dprintk("RPC:       %s: connected\n", __func__);
-	extras = r_xprt->rx_buf.rb_bc_srv_max_requests;
-	if (extras)
-		rpcrdma_ep_post_extra_recv(r_xprt, extras);
+
+	rpcrdma_post_recvs(r_xprt, true);
 
 out:
 	if (rc)
@@ -1082,14 +1081,8 @@ struct rpcrdma_req *
 	return req;
 }
 
-/**
- * rpcrdma_create_rep - Allocate an rpcrdma_rep object
- * @r_xprt: controlling transport
- *
- * Returns 0 on success or a negative errno on failure.
- */
-int
-rpcrdma_create_rep(struct rpcrdma_xprt *r_xprt)
+static int
+rpcrdma_create_rep(struct rpcrdma_xprt *r_xprt, bool temp)
 {
 	struct rpcrdma_create_data_internal *cdata = &r_xprt->rx_data;
 	struct rpcrdma_buffer *buf = &r_xprt->rx_buf;
@@ -1117,6 +1110,7 @@ struct rpcrdma_req *
 	rep->rr_recv_wr.wr_cqe = &rep->rr_cqe;
 	rep->rr_recv_wr.sg_list = &rep->rr_rdmabuf->rg_iov;
 	rep->rr_recv_wr.num_sge = 1;
+	rep->rr_temp = temp;
 
 	spin_lock(&buf->rb_lock);
 	list_add(&rep->rr_list, &buf->rb_recv_bufs);
@@ -1168,12 +1162,8 @@ struct rpcrdma_req *
 		list_add(&req->rl_list, &buf->rb_send_bufs);
 	}
 
+	buf->rb_posted_receives = 0;
 	INIT_LIST_HEAD(&buf->rb_recv_bufs);
-	for (i = 0; i <= buf->rb_max_requests; i++) {
-		rc = rpcrdma_create_rep(r_xprt);
-		if (rc)
-			goto out;
-	}
 
 	rc = rpcrdma_sendctxs_create(r_xprt);
 	if (rc)
@@ -1268,7 +1258,6 @@ struct rpcrdma_req *
 		rep = rpcrdma_buffer_get_rep_locked(buf);
 		rpcrdma_destroy_rep(rep);
 	}
-	buf->rb_send_count = 0;
 
 	spin_lock(&buf->rb_reqslock);
 	while (!list_empty(&buf->rb_allreqs)) {
@@ -1283,7 +1272,6 @@ struct rpcrdma_req *
 		spin_lock(&buf->rb_reqslock);
 	}
 	spin_unlock(&buf->rb_reqslock);
-	buf->rb_recv_count = 0;
 
 	rpcrdma_mrs_destroy(buf);
 }
@@ -1356,27 +1344,11 @@ struct rpcrdma_mr *
 	__rpcrdma_mr_put(&r_xprt->rx_buf, mr);
 }
 
-static struct rpcrdma_rep *
-rpcrdma_buffer_get_rep(struct rpcrdma_buffer *buffers)
-{
-	/* If an RPC previously completed without a reply (say, a
-	 * credential problem or a soft timeout occurs) then hold off
-	 * on supplying more Receive buffers until the number of new
-	 * pending RPCs catches up to the number of posted Receives.
-	 */
-	if (unlikely(buffers->rb_send_count < buffers->rb_recv_count))
-		return NULL;
-
-	if (unlikely(list_empty(&buffers->rb_recv_bufs)))
-		return NULL;
-	buffers->rb_recv_count++;
-	return rpcrdma_buffer_get_rep_locked(buffers);
-}
-
-/*
- * Get a set of request/reply buffers.
+/**
+ * rpcrdma_buffer_get - Get a request buffer
+ * @buffers: Buffer pool from which to obtain a buffer
  *
- * Reply buffer (if available) is attached to send buffer upon return.
+ * Returns a fresh rpcrdma_req, or NULL if none are available.
  */
 struct rpcrdma_req *
 rpcrdma_buffer_get(struct rpcrdma_buffer *buffers)
@@ -1384,23 +1356,21 @@ struct rpcrdma_req *
 	struct rpcrdma_req *req;
 
 	spin_lock(&buffers->rb_lock);
-	if (list_empty(&buffers->rb_send_bufs))
-		goto out_reqbuf;
-	buffers->rb_send_count++;
+	if (unlikely(list_empty(&buffers->rb_send_bufs)))
+		goto out_noreqs;
 	req = rpcrdma_buffer_get_req_locked(buffers);
-	req->rl_reply = rpcrdma_buffer_get_rep(buffers);
 	spin_unlock(&buffers->rb_lock);
-
 	return req;
 
-out_reqbuf:
+out_noreqs:
 	spin_unlock(&buffers->rb_lock);
 	return NULL;
 }
 
-/*
- * Put request/reply buffers back into pool.
- * Pre-decrement counter/array index.
+/**
+ * rpcrdma_buffer_put - Put request/reply buffers back into pool
+ * @req: object to return
+ *
  */
 void
 rpcrdma_buffer_put(struct rpcrdma_req *req)
@@ -1411,27 +1381,16 @@ struct rpcrdma_req *
 	req->rl_reply = NULL;
 
 	spin_lock(&buffers->rb_lock);
-	buffers->rb_send_count--;
-	list_add_tail(&req->rl_list, &buffers->rb_send_bufs);
+	list_add(&req->rl_list, &buffers->rb_send_bufs);
 	if (rep) {
-		buffers->rb_recv_count--;
-		list_add_tail(&rep->rr_list, &buffers->rb_recv_bufs);
+		if (!rep->rr_temp) {
+			list_add(&rep->rr_list, &buffers->rb_recv_bufs);
+			rep = NULL;
+		}
 	}
 	spin_unlock(&buffers->rb_lock);
-}
-
-/*
- * Recover reply buffers from pool.
- * This happens when recovering from disconnect.
- */
-void
-rpcrdma_recv_buffer_get(struct rpcrdma_req *req)
-{
-	struct rpcrdma_buffer *buffers = req->rl_buffer;
-
-	spin_lock(&buffers->rb_lock);
-	req->rl_reply = rpcrdma_buffer_get_rep(buffers);
-	spin_unlock(&buffers->rb_lock);
+	if (rep)
+		rpcrdma_destroy_rep(rep);
 }
 
 /*
@@ -1443,10 +1402,13 @@ struct rpcrdma_req *
 {
 	struct rpcrdma_buffer *buffers = &rep->rr_rxprt->rx_buf;
 
-	spin_lock(&buffers->rb_lock);
-	buffers->rb_recv_count--;
-	list_add_tail(&rep->rr_list, &buffers->rb_recv_bufs);
-	spin_unlock(&buffers->rb_lock);
+	if (!rep->rr_temp) {
+		spin_lock(&buffers->rb_lock);
+		list_add(&rep->rr_list, &buffers->rb_recv_bufs);
+		spin_unlock(&buffers->rb_lock);
+	} else {
+		rpcrdma_destroy_rep(rep);
+	}
 }
 
 /**
@@ -1542,13 +1504,6 @@ struct rpcrdma_regbuf *
 	struct ib_send_wr *send_wr = &req->rl_sendctx->sc_wr;
 	int rc;
 
-	if (req->rl_reply) {
-		rc = rpcrdma_ep_post_recv(ia, req->rl_reply);
-		if (rc)
-			return rc;
-		req->rl_reply = NULL;
-	}
-
 	if (!ep->rep_send_count ||
 	    test_bit(RPCRDMA_REQ_F_TX_RESOURCES, &req->rl_flags)) {
 		send_wr->send_flags |= IB_SEND_SIGNALED;
@@ -1623,3 +1578,70 @@ struct rpcrdma_regbuf *
 	rpcrdma_recv_buffer_put(rep);
 	return rc;
 }
+
+/**
+ * rpcrdma_post_recvs - Maybe post some Receive buffers
+ * @r_xprt: controlling transport
+ * @temp: when true, allocate temp rpcrdma_rep objects
+ *
+ */
+void
+rpcrdma_post_recvs(struct rpcrdma_xprt *r_xprt, bool temp)
+{
+	struct rpcrdma_buffer *buf = &r_xprt->rx_buf;
+	struct ib_recv_wr *wr, *bad_wr;
+	int needed, count, rc;
+
+	needed = buf->rb_credits + (buf->rb_bc_srv_max_requests << 1);
+	if (buf->rb_posted_receives > needed)
+		return;
+	needed -= buf->rb_posted_receives;
+
+	count = 0;
+	wr = NULL;
+	while (needed) {
+		struct rpcrdma_regbuf *rb;
+		struct rpcrdma_rep *rep;
+
+		spin_lock(&buf->rb_lock);
+		rep = list_first_entry_or_null(&buf->rb_recv_bufs,
+					       struct rpcrdma_rep, rr_list);
+		if (likely(rep))
+			list_del(&rep->rr_list);
+		spin_unlock(&buf->rb_lock);
+		if (!rep) {
+			if (rpcrdma_create_rep(r_xprt, temp))
+				break;
+			continue;
+		}
+
+		rb = rep->rr_rdmabuf;
+		if (!rpcrdma_regbuf_is_mapped(rb)) {
+			if (!__rpcrdma_dma_map_regbuf(&r_xprt->rx_ia, rb)) {
+				rpcrdma_recv_buffer_put(rep);
+				break;
+			}
+		}
+
+		trace_xprtrdma_post_recv(rep->rr_recv_wr.wr_cqe);
+		rep->rr_recv_wr.next = wr;
+		wr = &rep->rr_recv_wr;
+		++count;
+		--needed;
+	}
+	if (!count)
+		return;
+
+	rc = ib_post_recv(r_xprt->rx_ia.ri_id->qp, wr, &bad_wr);
+	if (rc) {
+		for (wr = bad_wr; wr; wr = wr->next) {
+			struct rpcrdma_rep *rep;
+
+			rep = container_of(wr, struct rpcrdma_rep, rr_recv_wr);
+			rpcrdma_recv_buffer_put(rep);
+			--count;
+		}
+	}
+	buf->rb_posted_receives += count;
+	trace_xprtrdma_post_recvs(r_xprt, count, rc);
+}
diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index 765e4df..a6d0d6e 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -197,6 +197,7 @@ struct rpcrdma_rep {
 	__be32			rr_proc;
 	int			rr_wc_flags;
 	u32			rr_inv_rkey;
+	bool			rr_temp;
 	struct rpcrdma_regbuf	*rr_rdmabuf;
 	struct rpcrdma_xprt	*rr_rxprt;
 	struct work_struct	rr_work;
@@ -397,11 +398,11 @@ struct rpcrdma_buffer {
 	struct rpcrdma_sendctx	**rb_sc_ctxs;
 
 	spinlock_t		rb_lock;	/* protect buf lists */
-	int			rb_send_count, rb_recv_count;
 	struct list_head	rb_send_bufs;
 	struct list_head	rb_recv_bufs;
 	u32			rb_max_requests;
 	u32			rb_credits;	/* most recent credit grant */
+	int			rb_posted_receives;
 
 	u32			rb_bc_srv_max_requests;
 	spinlock_t		rb_reqslock;	/* protect rb_allreqs */
@@ -558,13 +559,13 @@ int rpcrdma_ep_create(struct rpcrdma_ep *, struct rpcrdma_ia *,
 int rpcrdma_ep_post(struct rpcrdma_ia *, struct rpcrdma_ep *,
 				struct rpcrdma_req *);
 int rpcrdma_ep_post_recv(struct rpcrdma_ia *, struct rpcrdma_rep *);
+void rpcrdma_post_recvs(struct rpcrdma_xprt *r_xprt, bool temp);
 
 /*
  * Buffer calls - xprtrdma/verbs.c
  */
 struct rpcrdma_req *rpcrdma_create_req(struct rpcrdma_xprt *);
 void rpcrdma_destroy_req(struct rpcrdma_req *);
-int rpcrdma_create_rep(struct rpcrdma_xprt *r_xprt);
 int rpcrdma_buffer_create(struct rpcrdma_xprt *);
 void rpcrdma_buffer_destroy(struct rpcrdma_buffer *);
 struct rpcrdma_sendctx *rpcrdma_sendctx_get_locked(struct rpcrdma_buffer *buf);
@@ -577,7 +578,6 @@ int rpcrdma_ep_post(struct rpcrdma_ia *, struct rpcrdma_ep *,
 
 struct rpcrdma_req *rpcrdma_buffer_get(struct rpcrdma_buffer *);
 void rpcrdma_buffer_put(struct rpcrdma_req *);
-void rpcrdma_recv_buffer_get(struct rpcrdma_req *);
 void rpcrdma_recv_buffer_put(struct rpcrdma_rep *);
 
 struct rpcrdma_regbuf *rpcrdma_alloc_regbuf(size_t, enum dma_data_direction,


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v1 11/19] xprtrdma: Remove rpcrdma_ep_{post_recv, post_extra_recv}
  2018-05-04 19:34 [PATCH v1 00/19] NFS/RDMA client patches for next Chuck Lever
                   ` (9 preceding siblings ...)
  2018-05-04 19:35 ` [PATCH v1 10/19] xprtrdma: Move Receive posting to Receive handler Chuck Lever
@ 2018-05-04 19:35 ` Chuck Lever
  2018-05-04 19:35 ` [PATCH v1 12/19] xprtrdma: Remove rpcrdma_buffer_get_req_locked() Chuck Lever
                   ` (7 subsequent siblings)
  18 siblings, 0 replies; 30+ messages in thread
From: Chuck Lever @ 2018-05-04 19:35 UTC (permalink / raw)
  To: anna.schumaker; +Cc: linux-rdma, linux-nfs

Clean up: These functions are no longer used.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 include/trace/events/rpcrdma.h  |    2 -
 net/sunrpc/xprtrdma/verbs.c     |   59 ---------------------------------------
 net/sunrpc/xprtrdma/xprt_rdma.h |    3 --
 3 files changed, 64 deletions(-)

diff --git a/include/trace/events/rpcrdma.h b/include/trace/events/rpcrdma.h
index ad27e19..ac82849 100644
--- a/include/trace/events/rpcrdma.h
+++ b/include/trace/events/rpcrdma.h
@@ -879,8 +879,6 @@
 	)
 );
 
-DEFINE_RXPRT_EVENT(xprtrdma_noreps);
-
 /**
  ** Callback events
  **/
diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 2a38301..9c35540 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -1520,65 +1520,6 @@ struct rpcrdma_regbuf *
 	return 0;
 }
 
-int
-rpcrdma_ep_post_recv(struct rpcrdma_ia *ia,
-		     struct rpcrdma_rep *rep)
-{
-	struct ib_recv_wr *recv_wr_fail;
-	int rc;
-
-	if (!rpcrdma_dma_map_regbuf(ia, rep->rr_rdmabuf))
-		goto out_map;
-	rc = ib_post_recv(ia->ri_id->qp, &rep->rr_recv_wr, &recv_wr_fail);
-	trace_xprtrdma_post_recv(rep->rr_recv_wr.wr_cqe);
-	if (rc)
-		return -ENOTCONN;
-	return 0;
-
-out_map:
-	pr_err("rpcrdma: failed to DMA map the Receive buffer\n");
-	return -EIO;
-}
-
-/**
- * rpcrdma_ep_post_extra_recv - Post buffers for incoming backchannel requests
- * @r_xprt: transport associated with these backchannel resources
- * @count: minimum number of incoming requests expected
- *
- * Returns zero if all requested buffers were posted, or a negative errno.
- */
-int
-rpcrdma_ep_post_extra_recv(struct rpcrdma_xprt *r_xprt, unsigned int count)
-{
-	struct rpcrdma_buffer *buffers = &r_xprt->rx_buf;
-	struct rpcrdma_ia *ia = &r_xprt->rx_ia;
-	struct rpcrdma_rep *rep;
-	int rc;
-
-	while (count--) {
-		spin_lock(&buffers->rb_lock);
-		if (list_empty(&buffers->rb_recv_bufs))
-			goto out_reqbuf;
-		rep = rpcrdma_buffer_get_rep_locked(buffers);
-		spin_unlock(&buffers->rb_lock);
-
-		rc = rpcrdma_ep_post_recv(ia, rep);
-		if (rc)
-			goto out_rc;
-	}
-
-	return 0;
-
-out_reqbuf:
-	spin_unlock(&buffers->rb_lock);
-	trace_xprtrdma_noreps(r_xprt);
-	return -ENOMEM;
-
-out_rc:
-	rpcrdma_recv_buffer_put(rep);
-	return rc;
-}
-
 /**
  * rpcrdma_post_recvs - Maybe post some Receive buffers
  * @r_xprt: controlling transport
diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index a6d0d6e..507b515 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -558,7 +558,6 @@ int rpcrdma_ep_create(struct rpcrdma_ep *, struct rpcrdma_ia *,
 
 int rpcrdma_ep_post(struct rpcrdma_ia *, struct rpcrdma_ep *,
 				struct rpcrdma_req *);
-int rpcrdma_ep_post_recv(struct rpcrdma_ia *, struct rpcrdma_rep *);
 void rpcrdma_post_recvs(struct rpcrdma_xprt *r_xprt, bool temp);
 
 /*
@@ -599,8 +598,6 @@ struct rpcrdma_regbuf *rpcrdma_alloc_regbuf(size_t, enum dma_data_direction,
 	return __rpcrdma_dma_map_regbuf(ia, rb);
 }
 
-int rpcrdma_ep_post_extra_recv(struct rpcrdma_xprt *, unsigned int);
-
 int rpcrdma_alloc_wq(void);
 void rpcrdma_destroy_wq(void);
 


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v1 12/19] xprtrdma: Remove rpcrdma_buffer_get_req_locked()
  2018-05-04 19:34 [PATCH v1 00/19] NFS/RDMA client patches for next Chuck Lever
                   ` (10 preceding siblings ...)
  2018-05-04 19:35 ` [PATCH v1 11/19] xprtrdma: Remove rpcrdma_ep_{post_recv, post_extra_recv} Chuck Lever
@ 2018-05-04 19:35 ` Chuck Lever
  2018-05-04 19:35 ` [PATCH v1 13/19] xprtrdma: Remove rpcrdma_buffer_get_rep_locked() Chuck Lever
                   ` (6 subsequent siblings)
  18 siblings, 0 replies; 30+ messages in thread
From: Chuck Lever @ 2018-05-04 19:35 UTC (permalink / raw)
  To: anna.schumaker; +Cc: linux-rdma, linux-nfs

Clean up. There is only one call-site for this helper, and it can be
simplified by using list_first_entry_or_null().

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 net/sunrpc/xprtrdma/verbs.c |   22 ++++------------------
 1 file changed, 4 insertions(+), 18 deletions(-)

diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 9c35540..3edf5c4 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -1175,17 +1175,6 @@ struct rpcrdma_req *
 	return rc;
 }
 
-static struct rpcrdma_req *
-rpcrdma_buffer_get_req_locked(struct rpcrdma_buffer *buf)
-{
-	struct rpcrdma_req *req;
-
-	req = list_first_entry(&buf->rb_send_bufs,
-			       struct rpcrdma_req, rl_list);
-	list_del_init(&req->rl_list);
-	return req;
-}
-
 static struct rpcrdma_rep *
 rpcrdma_buffer_get_rep_locked(struct rpcrdma_buffer *buf)
 {
@@ -1356,15 +1345,12 @@ struct rpcrdma_req *
 	struct rpcrdma_req *req;
 
 	spin_lock(&buffers->rb_lock);
-	if (unlikely(list_empty(&buffers->rb_send_bufs)))
-		goto out_noreqs;
-	req = rpcrdma_buffer_get_req_locked(buffers);
+	req = list_first_entry_or_null(&buffers->rb_send_bufs,
+				       struct rpcrdma_req, rl_list);
+	if (req)
+		list_del_init(&req->rl_list);
 	spin_unlock(&buffers->rb_lock);
 	return req;
-
-out_noreqs:
-	spin_unlock(&buffers->rb_lock);
-	return NULL;
 }
 
 /**


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v1 13/19] xprtrdma: Remove rpcrdma_buffer_get_rep_locked()
  2018-05-04 19:34 [PATCH v1 00/19] NFS/RDMA client patches for next Chuck Lever
                   ` (11 preceding siblings ...)
  2018-05-04 19:35 ` [PATCH v1 12/19] xprtrdma: Remove rpcrdma_buffer_get_req_locked() Chuck Lever
@ 2018-05-04 19:35 ` Chuck Lever
  2018-05-04 19:35 ` [PATCH v1 14/19] xprtrdma: Make rpcrdma_sendctx_put_locked() a static function Chuck Lever
                   ` (5 subsequent siblings)
  18 siblings, 0 replies; 30+ messages in thread
From: Chuck Lever @ 2018-05-04 19:35 UTC (permalink / raw)
  To: anna.schumaker; +Cc: linux-rdma, linux-nfs

Clean up: There is only one remaining call site for this helper.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 net/sunrpc/xprtrdma/verbs.c |   15 +++------------
 1 file changed, 3 insertions(+), 12 deletions(-)

diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 3edf5c4..3cada42 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -1175,17 +1175,6 @@ struct rpcrdma_req *
 	return rc;
 }
 
-static struct rpcrdma_rep *
-rpcrdma_buffer_get_rep_locked(struct rpcrdma_buffer *buf)
-{
-	struct rpcrdma_rep *rep;
-
-	rep = list_first_entry(&buf->rb_recv_bufs,
-			       struct rpcrdma_rep, rr_list);
-	list_del(&rep->rr_list);
-	return rep;
-}
-
 static void
 rpcrdma_destroy_rep(struct rpcrdma_rep *rep)
 {
@@ -1244,7 +1233,9 @@ struct rpcrdma_req *
 	while (!list_empty(&buf->rb_recv_bufs)) {
 		struct rpcrdma_rep *rep;
 
-		rep = rpcrdma_buffer_get_rep_locked(buf);
+		rep = list_first_entry(&buf->rb_recv_bufs,
+				       struct rpcrdma_rep, rr_list);
+		list_del(&rep->rr_list);
 		rpcrdma_destroy_rep(rep);
 	}
 


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v1 14/19] xprtrdma: Make rpcrdma_sendctx_put_locked() a static function
  2018-05-04 19:34 [PATCH v1 00/19] NFS/RDMA client patches for next Chuck Lever
                   ` (12 preceding siblings ...)
  2018-05-04 19:35 ` [PATCH v1 13/19] xprtrdma: Remove rpcrdma_buffer_get_rep_locked() Chuck Lever
@ 2018-05-04 19:35 ` Chuck Lever
  2018-05-04 19:35 ` [PATCH v1 15/19] xprtrdma: Return -ENOBUFS when no pages are available Chuck Lever
                   ` (4 subsequent siblings)
  18 siblings, 0 replies; 30+ messages in thread
From: Chuck Lever @ 2018-05-04 19:35 UTC (permalink / raw)
  To: anna.schumaker; +Cc: linux-rdma, linux-nfs

Clean up: The only call site is in the same file as the function's
definition.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 net/sunrpc/xprtrdma/verbs.c     |    4 +++-
 net/sunrpc/xprtrdma/xprt_rdma.h |    1 -
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 3cada42..4ccc9b2 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -72,6 +72,7 @@
 /*
  * internal functions
  */
+static void rpcrdma_sendctx_put_locked(struct rpcrdma_sendctx *sc);
 static void rpcrdma_mrs_create(struct rpcrdma_xprt *r_xprt);
 static void rpcrdma_mrs_destroy(struct rpcrdma_buffer *buf);
 static int rpcrdma_create_rep(struct rpcrdma_xprt *r_xprt, bool temp);
@@ -949,7 +950,8 @@ struct rpcrdma_sendctx *rpcrdma_sendctx_get_locked(struct rpcrdma_buffer *buf)
  *
  * The caller serializes calls to this function (per rpcrdma_buffer).
  */
-void rpcrdma_sendctx_put_locked(struct rpcrdma_sendctx *sc)
+static void
+rpcrdma_sendctx_put_locked(struct rpcrdma_sendctx *sc)
 {
 	struct rpcrdma_buffer *buf = &sc->sc_xprt->rx_buf;
 	unsigned long next_tail;
diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index 507b515..f22bcdd 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -568,7 +568,6 @@ int rpcrdma_ep_post(struct rpcrdma_ia *, struct rpcrdma_ep *,
 int rpcrdma_buffer_create(struct rpcrdma_xprt *);
 void rpcrdma_buffer_destroy(struct rpcrdma_buffer *);
 struct rpcrdma_sendctx *rpcrdma_sendctx_get_locked(struct rpcrdma_buffer *buf);
-void rpcrdma_sendctx_put_locked(struct rpcrdma_sendctx *sc);
 
 struct rpcrdma_mr *rpcrdma_mr_get(struct rpcrdma_xprt *r_xprt);
 void rpcrdma_mr_put(struct rpcrdma_mr *mr);


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v1 15/19] xprtrdma: Return -ENOBUFS when no pages are available
  2018-05-04 19:34 [PATCH v1 00/19] NFS/RDMA client patches for next Chuck Lever
                   ` (13 preceding siblings ...)
  2018-05-04 19:35 ` [PATCH v1 14/19] xprtrdma: Make rpcrdma_sendctx_put_locked() a static function Chuck Lever
@ 2018-05-04 19:35 ` Chuck Lever
  2018-05-04 19:35 ` [PATCH v1 16/19] xprtrdma: Move common wait_for_buffer_space call to parent function Chuck Lever
                   ` (3 subsequent siblings)
  18 siblings, 0 replies; 30+ messages in thread
From: Chuck Lever @ 2018-05-04 19:35 UTC (permalink / raw)
  To: anna.schumaker; +Cc: linux-rdma, linux-nfs

The use of -EAGAIN in rpcrdma_convert_iovs() is a latent bug: the
transport never calls xprt_write_space() when more pages become
available. -ENOBUFS will trigger the correct "delay briefly and call
again" logic.

Fixes: 7a89f9c626e3 ("xprtrdma: Honor ->send_request API contract")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Cc: stable@vger.kernel.org
---
 net/sunrpc/xprtrdma/rpc_rdma.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
index d676106..1d78579 100644
--- a/net/sunrpc/xprtrdma/rpc_rdma.c
+++ b/net/sunrpc/xprtrdma/rpc_rdma.c
@@ -231,7 +231,7 @@ static bool rpcrdma_results_inline(struct rpcrdma_xprt *r_xprt,
 			 */
 			*ppages = alloc_page(GFP_ATOMIC);
 			if (!*ppages)
-				return -EAGAIN;
+				return -ENOBUFS;
 		}
 		seg->mr_page = *ppages;
 		seg->mr_offset = (char *)page_base;


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v1 16/19] xprtrdma: Move common wait_for_buffer_space call to parent function
  2018-05-04 19:34 [PATCH v1 00/19] NFS/RDMA client patches for next Chuck Lever
                   ` (14 preceding siblings ...)
  2018-05-04 19:35 ` [PATCH v1 15/19] xprtrdma: Return -ENOBUFS when no pages are available Chuck Lever
@ 2018-05-04 19:35 ` Chuck Lever
  2018-05-04 19:35 ` [PATCH v1 17/19] xprtrdma: Wait on empty sendctx queue Chuck Lever
                   ` (2 subsequent siblings)
  18 siblings, 0 replies; 30+ messages in thread
From: Chuck Lever @ 2018-05-04 19:35 UTC (permalink / raw)
  To: anna.schumaker; +Cc: linux-rdma, linux-nfs

Clean up: The logic to wait for write space is common to a bunch of
the encoding helper functions. Lift it out and put it in the tail
of rpcrdma_marshal_req().

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 net/sunrpc/xprtrdma/rpc_rdma.c |   31 ++++++++++++-------------------
 1 file changed, 12 insertions(+), 19 deletions(-)

diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
index 1d78579..b12b044 100644
--- a/net/sunrpc/xprtrdma/rpc_rdma.c
+++ b/net/sunrpc/xprtrdma/rpc_rdma.c
@@ -366,7 +366,7 @@ static bool rpcrdma_results_inline(struct rpcrdma_xprt *r_xprt,
 		seg = r_xprt->rx_ia.ri_ops->ro_map(r_xprt, seg, nsegs,
 						   false, &mr);
 		if (IS_ERR(seg))
-			goto out_maperr;
+			return PTR_ERR(seg);
 		rpcrdma_mr_push(mr, &req->rl_registered);
 
 		if (encode_read_segment(xdr, mr, pos) < 0)
@@ -378,11 +378,6 @@ static bool rpcrdma_results_inline(struct rpcrdma_xprt *r_xprt,
 	} while (nsegs);
 
 	return 0;
-
-out_maperr:
-	if (PTR_ERR(seg) == -EAGAIN)
-		xprt_wait_for_buffer_space(rqst->rq_task, NULL);
-	return PTR_ERR(seg);
 }
 
 /* Register and XDR encode the Write list. Supports encoding a list
@@ -429,7 +424,7 @@ static bool rpcrdma_results_inline(struct rpcrdma_xprt *r_xprt,
 		seg = r_xprt->rx_ia.ri_ops->ro_map(r_xprt, seg, nsegs,
 						   true, &mr);
 		if (IS_ERR(seg))
-			goto out_maperr;
+			return PTR_ERR(seg);
 		rpcrdma_mr_push(mr, &req->rl_registered);
 
 		if (encode_rdma_segment(xdr, mr) < 0)
@@ -446,11 +441,6 @@ static bool rpcrdma_results_inline(struct rpcrdma_xprt *r_xprt,
 	*segcount = cpu_to_be32(nchunks);
 
 	return 0;
-
-out_maperr:
-	if (PTR_ERR(seg) == -EAGAIN)
-		xprt_wait_for_buffer_space(rqst->rq_task, NULL);
-	return PTR_ERR(seg);
 }
 
 /* Register and XDR encode the Reply chunk. Supports encoding an array
@@ -492,7 +482,7 @@ static bool rpcrdma_results_inline(struct rpcrdma_xprt *r_xprt,
 		seg = r_xprt->rx_ia.ri_ops->ro_map(r_xprt, seg, nsegs,
 						   true, &mr);
 		if (IS_ERR(seg))
-			goto out_maperr;
+			return PTR_ERR(seg);
 		rpcrdma_mr_push(mr, &req->rl_registered);
 
 		if (encode_rdma_segment(xdr, mr) < 0)
@@ -509,11 +499,6 @@ static bool rpcrdma_results_inline(struct rpcrdma_xprt *r_xprt,
 	*segcount = cpu_to_be32(nchunks);
 
 	return 0;
-
-out_maperr:
-	if (PTR_ERR(seg) == -EAGAIN)
-		xprt_wait_for_buffer_space(rqst->rq_task, NULL);
-	return PTR_ERR(seg);
 }
 
 /**
@@ -884,7 +869,15 @@ static bool rpcrdma_results_inline(struct rpcrdma_xprt *r_xprt,
 	return 0;
 
 out_err:
-	r_xprt->rx_stats.failed_marshal_count++;
+	switch (ret) {
+	case -EAGAIN:
+		xprt_wait_for_buffer_space(rqst->rq_task, NULL);
+		break;
+	case -ENOBUFS:
+		break;
+	default:
+		r_xprt->rx_stats.failed_marshal_count++;
+	}
 	return ret;
 }
 


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v1 17/19] xprtrdma: Wait on empty sendctx queue
  2018-05-04 19:34 [PATCH v1 00/19] NFS/RDMA client patches for next Chuck Lever
                   ` (15 preceding siblings ...)
  2018-05-04 19:35 ` [PATCH v1 16/19] xprtrdma: Move common wait_for_buffer_space call to parent function Chuck Lever
@ 2018-05-04 19:35 ` Chuck Lever
  2018-05-04 19:36 ` [PATCH v1 18/19] xprtrdma: Add trace_xprtrdma_dma_map(mr) Chuck Lever
  2018-05-04 19:36 ` [PATCH v1 19/19] xprtrdma: Remove transfertypes array Chuck Lever
  18 siblings, 0 replies; 30+ messages in thread
From: Chuck Lever @ 2018-05-04 19:35 UTC (permalink / raw)
  To: anna.schumaker; +Cc: linux-rdma, linux-nfs

Currently, when the sendctx queue is exhausted during marshaling, the
RPC/RDMA transport places the RPC task on the delayq, which forces a
wait for HZ >> 2 before the marshal and send is retried.

With this change, the transport now places such an RPC task on the
pending queue, and wakes it just as soon as more sendctxs become
available. This typically takes less than a millisecond, and the
write_space waking mechanism is less deadlock-prone.

Moreover, the waiting RPC task is holding the transport's write
lock, which blocks the transport from sending RPCs. Therefore faster
recovery from sendctx queue exhaustion is desirable.

Cf. commit 5804891455d5 ("xprtrdma: ->send_request returns -EAGAIN
when there are no free MRs").

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 net/sunrpc/xprtrdma/rpc_rdma.c  |    2 +-
 net/sunrpc/xprtrdma/verbs.c     |    8 +++++++-
 net/sunrpc/xprtrdma/xprt_rdma.h |    6 ++++++
 3 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
index b12b044..a373d03 100644
--- a/net/sunrpc/xprtrdma/rpc_rdma.c
+++ b/net/sunrpc/xprtrdma/rpc_rdma.c
@@ -695,7 +695,7 @@ static bool rpcrdma_results_inline(struct rpcrdma_xprt *r_xprt,
 {
 	req->rl_sendctx = rpcrdma_sendctx_get_locked(&r_xprt->rx_buf);
 	if (!req->rl_sendctx)
-		return -ENOBUFS;
+		return -EAGAIN;
 	req->rl_sendctx->sc_wr.num_sge = 0;
 	req->rl_sendctx->sc_unmap_count = 0;
 	req->rl_sendctx->sc_req = req;
diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 4ccc9b2..042bb24 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -878,6 +878,7 @@ static int rpcrdma_sendctxs_create(struct rpcrdma_xprt *r_xprt)
 		sc->sc_xprt = r_xprt;
 		buf->rb_sc_ctxs[i] = sc;
 	}
+	buf->rb_flags = 0;
 
 	return 0;
 
@@ -935,7 +936,7 @@ struct rpcrdma_sendctx *rpcrdma_sendctx_get_locked(struct rpcrdma_buffer *buf)
 	 * completions recently. This is a sign the Send Queue is
 	 * backing up. Cause the caller to pause and try again.
 	 */
-	dprintk("RPC:       %s: empty sendctx queue\n", __func__);
+	set_bit(RPCRDMA_BUF_F_EMPTY_SCQ, &buf->rb_flags);
 	r_xprt = container_of(buf, struct rpcrdma_xprt, rx_buf);
 	r_xprt->rx_stats.empty_sendctx_q++;
 	return NULL;
@@ -970,6 +971,11 @@ struct rpcrdma_sendctx *rpcrdma_sendctx_get_locked(struct rpcrdma_buffer *buf)
 
 	/* Paired with READ_ONCE */
 	smp_store_release(&buf->rb_sc_tail, next_tail);
+
+	if (test_and_clear_bit(RPCRDMA_BUF_F_EMPTY_SCQ, &buf->rb_flags)) {
+		smp_mb__after_atomic();
+		xprt_write_space(&sc->sc_xprt->rx_xprt);
+	}
 }
 
 static void
diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index f22bcdd..38973a9 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -400,6 +400,7 @@ struct rpcrdma_buffer {
 	spinlock_t		rb_lock;	/* protect buf lists */
 	struct list_head	rb_send_bufs;
 	struct list_head	rb_recv_bufs;
+	unsigned long		rb_flags;
 	u32			rb_max_requests;
 	u32			rb_credits;	/* most recent credit grant */
 	int			rb_posted_receives;
@@ -417,6 +418,11 @@ struct rpcrdma_buffer {
 };
 #define rdmab_to_ia(b) (&container_of((b), struct rpcrdma_xprt, rx_buf)->rx_ia)
 
+/* rb_flags */
+enum {
+	RPCRDMA_BUF_F_EMPTY_SCQ = 0,
+};
+
 /*
  * Internal structure for transport instance creation. This
  * exists primarily for modularity.


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v1 18/19] xprtrdma: Add trace_xprtrdma_dma_map(mr)
  2018-05-04 19:34 [PATCH v1 00/19] NFS/RDMA client patches for next Chuck Lever
                   ` (16 preceding siblings ...)
  2018-05-04 19:35 ` [PATCH v1 17/19] xprtrdma: Wait on empty sendctx queue Chuck Lever
@ 2018-05-04 19:36 ` Chuck Lever
  2018-05-04 19:36 ` [PATCH v1 19/19] xprtrdma: Remove transfertypes array Chuck Lever
  18 siblings, 0 replies; 30+ messages in thread
From: Chuck Lever @ 2018-05-04 19:36 UTC (permalink / raw)
  To: anna.schumaker; +Cc: linux-rdma, linux-nfs

Matches trace_xprtrdma_dma_unmap(mr).

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 include/trace/events/rpcrdma.h |    1 +
 net/sunrpc/xprtrdma/fmr_ops.c  |    1 +
 net/sunrpc/xprtrdma/frwr_ops.c |    1 +
 3 files changed, 3 insertions(+)

diff --git a/include/trace/events/rpcrdma.h b/include/trace/events/rpcrdma.h
index ac82849..c4494a2 100644
--- a/include/trace/events/rpcrdma.h
+++ b/include/trace/events/rpcrdma.h
@@ -650,6 +650,7 @@
 DEFINE_FRWR_DONE_EVENT(xprtrdma_wc_li_wake);
 
 DEFINE_MR_EVENT(xprtrdma_localinv);
+DEFINE_MR_EVENT(xprtrdma_dma_map);
 DEFINE_MR_EVENT(xprtrdma_dma_unmap);
 DEFINE_MR_EVENT(xprtrdma_remoteinv);
 DEFINE_MR_EVENT(xprtrdma_recover_mr);
diff --git a/net/sunrpc/xprtrdma/fmr_ops.c b/net/sunrpc/xprtrdma/fmr_ops.c
index 0815f9e..58b4726 100644
--- a/net/sunrpc/xprtrdma/fmr_ops.c
+++ b/net/sunrpc/xprtrdma/fmr_ops.c
@@ -241,6 +241,7 @@ enum {
 				     mr->mr_sg, i, mr->mr_dir);
 	if (!mr->mr_nents)
 		goto out_dmamap_err;
+	trace_xprtrdma_dma_map(mr);
 
 	for (i = 0, dma_pages = mr->fmr.fm_physaddrs; i < mr->mr_nents; i++)
 		dma_pages[i] = sg_dma_address(&mr->mr_sg[i]);
diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
index cf5095d6..d46dc7e 100644
--- a/net/sunrpc/xprtrdma/frwr_ops.c
+++ b/net/sunrpc/xprtrdma/frwr_ops.c
@@ -415,6 +415,7 @@
 	mr->mr_nents = ib_dma_map_sg(ia->ri_device, mr->mr_sg, i, mr->mr_dir);
 	if (!mr->mr_nents)
 		goto out_dmamap_err;
+	trace_xprtrdma_dma_map(mr);
 
 	ibmr = frwr->fr_mr;
 	n = ib_map_mr_sg(ibmr, mr->mr_sg, mr->mr_nents, NULL, PAGE_SIZE);


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v1 19/19] xprtrdma: Remove transfertypes array
  2018-05-04 19:34 [PATCH v1 00/19] NFS/RDMA client patches for next Chuck Lever
                   ` (17 preceding siblings ...)
  2018-05-04 19:36 ` [PATCH v1 18/19] xprtrdma: Add trace_xprtrdma_dma_map(mr) Chuck Lever
@ 2018-05-04 19:36 ` Chuck Lever
  18 siblings, 0 replies; 30+ messages in thread
From: Chuck Lever @ 2018-05-04 19:36 UTC (permalink / raw)
  To: anna.schumaker; +Cc: linux-rdma, linux-nfs

Clean up: This array was used in a dprintk that was replaced by a
trace point in commit ab03eff58eb5 ("xprtrdma: Add trace points in
RPC Call transmit paths").

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 net/sunrpc/xprtrdma/rpc_rdma.c |    8 --------
 1 file changed, 8 deletions(-)

diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
index a373d03..1c78516 100644
--- a/net/sunrpc/xprtrdma/rpc_rdma.c
+++ b/net/sunrpc/xprtrdma/rpc_rdma.c
@@ -55,14 +55,6 @@
 # define RPCDBG_FACILITY	RPCDBG_TRANS
 #endif
 
-static const char transfertypes[][12] = {
-	"inline",	/* no chunks */
-	"read list",	/* some argument via rdma read */
-	"*read list",	/* entire request via rdma read */
-	"write list",	/* some result via rdma write */
-	"reply chunk"	/* entire reply via rdma write */
-};
-
 /* Returns size of largest RPC-over-RDMA header in a Call message
  *
  * The largest Call header contains a full-size Read list and a


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH v1 01/19] xprtrdma: Add proper SPDX tags for NetApp-contributed source
  2018-05-04 19:34 ` [PATCH v1 01/19] xprtrdma: Add proper SPDX tags for NetApp-contributed source Chuck Lever
@ 2018-05-07 13:27   ` Anna Schumaker
  2018-05-07 14:11     ` Chuck Lever
  0 siblings, 1 reply; 30+ messages in thread
From: Anna Schumaker @ 2018-05-07 13:27 UTC (permalink / raw)
  To: Chuck Lever; +Cc: linux-rdma, linux-nfs

Hi Chuck,

On 05/04/2018 03:34 PM, Chuck Lever wrote:
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> ---
>  include/linux/sunrpc/rpc_rdma.h |    1 +
>  include/linux/sunrpc/xprtrdma.h |    1 +
>  net/sunrpc/xprtrdma/module.c    |    1 +
>  net/sunrpc/xprtrdma/rpc_rdma.c  |    1 +
>  net/sunrpc/xprtrdma/transport.c |    1 +
>  net/sunrpc/xprtrdma/verbs.c     |    1 +
>  net/sunrpc/xprtrdma/xprt_rdma.h |    1 +
>  7 files changed, 7 insertions(+)
> 
> diff --git a/include/linux/sunrpc/rpc_rdma.h b/include/linux/sunrpc/rpc_rdma.h
> index 8f144db..92d182f 100644
> --- a/include/linux/sunrpc/rpc_rdma.h
> +++ b/include/linux/sunrpc/rpc_rdma.h
> @@ -1,3 +1,4 @@
> +/* SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause */
>  /*
>   * Copyright (c) 2015-2017 Oracle. All rights reserved.
>   * Copyright (c) 2003-2007 Network Appliance, Inc. All rights reserved.
> diff --git a/include/linux/sunrpc/xprtrdma.h b/include/linux/sunrpc/xprtrdma.h
> index 5859563..86fc38f 100644
> --- a/include/linux/sunrpc/xprtrdma.h
> +++ b/include/linux/sunrpc/xprtrdma.h
> @@ -1,3 +1,4 @@
> +/* SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause */
>  /*
>   * Copyright (c) 2003-2007 Network Appliance, Inc. All rights reserved.
>   *
> diff --git a/net/sunrpc/xprtrdma/module.c b/net/sunrpc/xprtrdma/module.c
> index a762d19..f338065 100644
> --- a/net/sunrpc/xprtrdma/module.c
> +++ b/net/sunrpc/xprtrdma/module.c
> @@ -1,3 +1,4 @@
> +// SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause

I'm not familiar with hte SPDX-License-Identifier tag.  Is there a reason it has to exist in a separate comment block at the top of the file instead of getting rolled in with the copyright stuff right below it?

Either way, can you use the C-style ("/* ... */") comments here (and in a few other places below) for consistency?

Thanks,
Anna

>  /*
>   * Copyright (c) 2015, 2017 Oracle.  All rights reserved.
>   */
> diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
> index e8adad3..8f89e3f 100644
> --- a/net/sunrpc/xprtrdma/rpc_rdma.c
> +++ b/net/sunrpc/xprtrdma/rpc_rdma.c
> @@ -1,3 +1,4 @@
> +// SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause
>  /*
>   * Copyright (c) 2014-2017 Oracle.  All rights reserved.
>   * Copyright (c) 2003-2007 Network Appliance, Inc. All rights reserved.
> diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
> index cc1aad3..4717578 100644
> --- a/net/sunrpc/xprtrdma/transport.c
> +++ b/net/sunrpc/xprtrdma/transport.c
> @@ -1,3 +1,4 @@
> +// SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause
>  /*
>   * Copyright (c) 2014-2017 Oracle.  All rights reserved.
>   * Copyright (c) 2003-2007 Network Appliance, Inc. All rights reserved.
> diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
> index c345d36..10f5032 100644
> --- a/net/sunrpc/xprtrdma/verbs.c
> +++ b/net/sunrpc/xprtrdma/verbs.c
> @@ -1,3 +1,4 @@
> +// SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause
>  /*
>   * Copyright (c) 2014-2017 Oracle.  All rights reserved.
>   * Copyright (c) 2003-2007 Network Appliance, Inc. All rights reserved.
> diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
> index cb41b12..e83ba758 100644
> --- a/net/sunrpc/xprtrdma/xprt_rdma.h
> +++ b/net/sunrpc/xprtrdma/xprt_rdma.h
> @@ -1,3 +1,4 @@
> +/* SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause */
>  /*
>   * Copyright (c) 2014-2017 Oracle.  All rights reserved.
>   * Copyright (c) 2003-2007 Network Appliance, Inc. All rights reserved.
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v1 01/19] xprtrdma: Add proper SPDX tags for NetApp-contributed source
  2018-05-07 13:27   ` Anna Schumaker
@ 2018-05-07 14:11     ` Chuck Lever
  2018-05-07 14:28       ` Anna Schumaker
  0 siblings, 1 reply; 30+ messages in thread
From: Chuck Lever @ 2018-05-07 14:11 UTC (permalink / raw)
  To: Anna Schumaker; +Cc: linux-rdma, Linux NFS Mailing List

Hi Anna-

Thanks for the review!


> On May 7, 2018, at 9:27 AM, Anna Schumaker <anna.schumaker@netapp.com> =
wrote:
>=20
> Hi Chuck,
>=20
> On 05/04/2018 03:34 PM, Chuck Lever wrote:
>> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
>> ---
>> include/linux/sunrpc/rpc_rdma.h |    1 +
>> include/linux/sunrpc/xprtrdma.h |    1 +
>> net/sunrpc/xprtrdma/module.c    |    1 +
>> net/sunrpc/xprtrdma/rpc_rdma.c  |    1 +
>> net/sunrpc/xprtrdma/transport.c |    1 +
>> net/sunrpc/xprtrdma/verbs.c     |    1 +
>> net/sunrpc/xprtrdma/xprt_rdma.h |    1 +
>> 7 files changed, 7 insertions(+)
>>=20
>> diff --git a/include/linux/sunrpc/rpc_rdma.h =
b/include/linux/sunrpc/rpc_rdma.h
>> index 8f144db..92d182f 100644
>> --- a/include/linux/sunrpc/rpc_rdma.h
>> +++ b/include/linux/sunrpc/rpc_rdma.h
>> @@ -1,3 +1,4 @@
>> +/* SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause */
>> /*
>>  * Copyright (c) 2015-2017 Oracle. All rights reserved.
>>  * Copyright (c) 2003-2007 Network Appliance, Inc. All rights =
reserved.
>> diff --git a/include/linux/sunrpc/xprtrdma.h =
b/include/linux/sunrpc/xprtrdma.h
>> index 5859563..86fc38f 100644
>> --- a/include/linux/sunrpc/xprtrdma.h
>> +++ b/include/linux/sunrpc/xprtrdma.h
>> @@ -1,3 +1,4 @@
>> +/* SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause */
>> /*
>>  * Copyright (c) 2003-2007 Network Appliance, Inc. All rights =
reserved.
>>  *
>> diff --git a/net/sunrpc/xprtrdma/module.c =
b/net/sunrpc/xprtrdma/module.c
>> index a762d19..f338065 100644
>> --- a/net/sunrpc/xprtrdma/module.c
>> +++ b/net/sunrpc/xprtrdma/module.c
>> @@ -1,3 +1,4 @@
>> +// SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause
>=20
> I'm not familiar with hte SPDX-License-Identifier tag.  Is there a =
reason it has to exist in a separate comment block at the top of the =
file instead of getting rolled in with the copyright stuff right below =
it?

I believe that each tag is meant to be parsed by a script
to produce a license manifest file for packaging. Thus the
tag is maintained on a separate line using a specific
format.

More information is at https://spdx.org


> Either way, can you use the C-style ("/* ... */") comments here (and =
in a few other places below) for consistency?

Here is a mechanical survey of familiar kernel source
files that already have an SPDX tag.

[cel@klimt linux]$ grep SPDX net/sunrpc/*
grep: net/sunrpc/auth_gss: Is a directory
net/sunrpc/auth_null.c:// SPDX-License-Identifier: GPL-2.0
net/sunrpc/auth_unix.c:// SPDX-License-Identifier: GPL-2.0
net/sunrpc/debugfs.c:// SPDX-License-Identifier: GPL-2.0
net/sunrpc/Makefile:# SPDX-License-Identifier: GPL-2.0
net/sunrpc/netns.h:/* SPDX-License-Identifier: GPL-2.0 */
net/sunrpc/xprtmultipath.c:// SPDX-License-Identifier: GPL-2.0
grep: net/sunrpc/xprtrdma: Is a directory
net/sunrpc/xprtsock.c:// SPDX-License-Identifier: GPL-2.0
[cel@klimt linux]$ grep SPDX fs/nfs/*
grep: fs/nfs/blocklayout: Is a directory
fs/nfs/cache_lib.c:// SPDX-License-Identifier: GPL-2.0
fs/nfs/cache_lib.h:/* SPDX-License-Identifier: GPL-2.0 */
fs/nfs/callback.c:// SPDX-License-Identifier: GPL-2.0
fs/nfs/callback.h:/* SPDX-License-Identifier: GPL-2.0 */
fs/nfs/callback_proc.c:// SPDX-License-Identifier: GPL-2.0
fs/nfs/callback_xdr.c:// SPDX-License-Identifier: GPL-2.0
fs/nfs/delegation.h:/* SPDX-License-Identifier: GPL-2.0 */
fs/nfs/dns_resolve.c:// SPDX-License-Identifier: GPL-2.0
fs/nfs/dns_resolve.h:/* SPDX-License-Identifier: GPL-2.0 */
fs/nfs/export.c:// SPDX-License-Identifier: GPL-2.0
grep: fs/nfs/filelayout: Is a directory
grep: fs/nfs/flexfilelayout: Is a directory
fs/nfs/internal.h:/* SPDX-License-Identifier: GPL-2.0 */
fs/nfs/io.c:// SPDX-License-Identifier: GPL-2.0
fs/nfs/iostat.h:/* SPDX-License-Identifier: GPL-2.0 */
fs/nfs/Makefile:# SPDX-License-Identifier: GPL-2.0
fs/nfs/mount_clnt.c:// SPDX-License-Identifier: GPL-2.0
fs/nfs/netns.h:/* SPDX-License-Identifier: GPL-2.0 */
fs/nfs/nfs2xdr.c:// SPDX-License-Identifier: GPL-2.0
fs/nfs/nfs3acl.c:// SPDX-License-Identifier: GPL-2.0
fs/nfs/nfs3_fs.h:/* SPDX-License-Identifier: GPL-2.0 */
fs/nfs/nfs3proc.c:// SPDX-License-Identifier: GPL-2.0
fs/nfs/nfs3xdr.c:// SPDX-License-Identifier: GPL-2.0
fs/nfs/nfs42.h:/* SPDX-License-Identifier: GPL-2.0 */
fs/nfs/nfs42proc.c:// SPDX-License-Identifier: GPL-2.0
fs/nfs/nfs42xdr.c:// SPDX-License-Identifier: GPL-2.0
fs/nfs/nfs4file.c:// SPDX-License-Identifier: GPL-2.0
fs/nfs/nfs4_fs.h:/* SPDX-License-Identifier: GPL-2.0 */
fs/nfs/nfs4getroot.c:// SPDX-License-Identifier: GPL-2.0
fs/nfs/nfs4namespace.c:// SPDX-License-Identifier: GPL-2.0
fs/nfs/nfs4session.h:/* SPDX-License-Identifier: GPL-2.0 */
fs/nfs/nfs4sysctl.c:// SPDX-License-Identifier: GPL-2.0
fs/nfs/nfs4trace.c:// SPDX-License-Identifier: GPL-2.0
fs/nfs/nfs4trace.h:/* SPDX-License-Identifier: GPL-2.0 */
fs/nfs/nfs.h:/* SPDX-License-Identifier: GPL-2.0 */
fs/nfs/nfsroot.c:// SPDX-License-Identifier: GPL-2.0
fs/nfs/nfstrace.c:// SPDX-License-Identifier: GPL-2.0
fs/nfs/nfstrace.h:/* SPDX-License-Identifier: GPL-2.0 */
fs/nfs/proc.c:// SPDX-License-Identifier: GPL-2.0
fs/nfs/symlink.c:// SPDX-License-Identifier: GPL-2.0
fs/nfs/sysctl.c:// SPDX-License-Identifier: GPL-2.0
fs/nfs/unlink.c:// SPDX-License-Identifier: GPL-2.0
[cel@klimt linux]$

The tags I've proposed are consistent with other usage:

-> .c files use // ... comments
-> .h files use /* ... */ comments
-> Makefiles use # comments

There were no complaints from checkpatch.pl about the
comment style in my patch.


> Thanks,
> Anna
>=20
>> /*
>>  * Copyright (c) 2015, 2017 Oracle.  All rights reserved.
>>  */
>> diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c =
b/net/sunrpc/xprtrdma/rpc_rdma.c
>> index e8adad3..8f89e3f 100644
>> --- a/net/sunrpc/xprtrdma/rpc_rdma.c
>> +++ b/net/sunrpc/xprtrdma/rpc_rdma.c
>> @@ -1,3 +1,4 @@
>> +// SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause
>> /*
>>  * Copyright (c) 2014-2017 Oracle.  All rights reserved.
>>  * Copyright (c) 2003-2007 Network Appliance, Inc. All rights =
reserved.
>> diff --git a/net/sunrpc/xprtrdma/transport.c =
b/net/sunrpc/xprtrdma/transport.c
>> index cc1aad3..4717578 100644
>> --- a/net/sunrpc/xprtrdma/transport.c
>> +++ b/net/sunrpc/xprtrdma/transport.c
>> @@ -1,3 +1,4 @@
>> +// SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause
>> /*
>>  * Copyright (c) 2014-2017 Oracle.  All rights reserved.
>>  * Copyright (c) 2003-2007 Network Appliance, Inc. All rights =
reserved.
>> diff --git a/net/sunrpc/xprtrdma/verbs.c =
b/net/sunrpc/xprtrdma/verbs.c
>> index c345d36..10f5032 100644
>> --- a/net/sunrpc/xprtrdma/verbs.c
>> +++ b/net/sunrpc/xprtrdma/verbs.c
>> @@ -1,3 +1,4 @@
>> +// SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause
>> /*
>>  * Copyright (c) 2014-2017 Oracle.  All rights reserved.
>>  * Copyright (c) 2003-2007 Network Appliance, Inc. All rights =
reserved.
>> diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h =
b/net/sunrpc/xprtrdma/xprt_rdma.h
>> index cb41b12..e83ba758 100644
>> --- a/net/sunrpc/xprtrdma/xprt_rdma.h
>> +++ b/net/sunrpc/xprtrdma/xprt_rdma.h
>> @@ -1,3 +1,4 @@
>> +/* SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause */
>> /*
>>  * Copyright (c) 2014-2017 Oracle.  All rights reserved.
>>  * Copyright (c) 2003-2007 Network Appliance, Inc. All rights =
reserved.
>>=20
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" =
in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
Chuck Lever




^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v1 01/19] xprtrdma: Add proper SPDX tags for NetApp-contributed source
  2018-05-07 14:11     ` Chuck Lever
@ 2018-05-07 14:28       ` Anna Schumaker
  2018-05-14 20:37         ` Jason Gunthorpe
  0 siblings, 1 reply; 30+ messages in thread
From: Anna Schumaker @ 2018-05-07 14:28 UTC (permalink / raw)
  To: Chuck Lever; +Cc: linux-rdma, Linux NFS Mailing List



On 05/07/2018 10:11 AM, Chuck Lever wrote:
> Hi Anna-
> 
> Thanks for the review!
> 
> 
>> On May 7, 2018, at 9:27 AM, Anna Schumaker <anna.schumaker@netapp.com> wrote:
>>
>> Hi Chuck,
>>
>> On 05/04/2018 03:34 PM, Chuck Lever wrote:
>>> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
>>> ---
>>> include/linux/sunrpc/rpc_rdma.h |    1 +
>>> include/linux/sunrpc/xprtrdma.h |    1 +
>>> net/sunrpc/xprtrdma/module.c    |    1 +
>>> net/sunrpc/xprtrdma/rpc_rdma.c  |    1 +
>>> net/sunrpc/xprtrdma/transport.c |    1 +
>>> net/sunrpc/xprtrdma/verbs.c     |    1 +
>>> net/sunrpc/xprtrdma/xprt_rdma.h |    1 +
>>> 7 files changed, 7 insertions(+)
>>>
>>> diff --git a/include/linux/sunrpc/rpc_rdma.h b/include/linux/sunrpc/rpc_rdma.h
>>> index 8f144db..92d182f 100644
>>> --- a/include/linux/sunrpc/rpc_rdma.h
>>> +++ b/include/linux/sunrpc/rpc_rdma.h
>>> @@ -1,3 +1,4 @@
>>> +/* SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause */
>>> /*
>>>  * Copyright (c) 2015-2017 Oracle. All rights reserved.
>>>  * Copyright (c) 2003-2007 Network Appliance, Inc. All rights reserved.
>>> diff --git a/include/linux/sunrpc/xprtrdma.h b/include/linux/sunrpc/xprtrdma.h
>>> index 5859563..86fc38f 100644
>>> --- a/include/linux/sunrpc/xprtrdma.h
>>> +++ b/include/linux/sunrpc/xprtrdma.h
>>> @@ -1,3 +1,4 @@
>>> +/* SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause */
>>> /*
>>>  * Copyright (c) 2003-2007 Network Appliance, Inc. All rights reserved.
>>>  *
>>> diff --git a/net/sunrpc/xprtrdma/module.c b/net/sunrpc/xprtrdma/module.c
>>> index a762d19..f338065 100644
>>> --- a/net/sunrpc/xprtrdma/module.c
>>> +++ b/net/sunrpc/xprtrdma/module.c
>>> @@ -1,3 +1,4 @@
>>> +// SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause
>>
>> I'm not familiar with hte SPDX-License-Identifier tag.  Is there a reason it has to exist in a separate comment block at the top of the file instead of getting rolled in with the copyright stuff right below it?
> 
> I believe that each tag is meant to be parsed by a script
> to produce a license manifest file for packaging. Thus the
> tag is maintained on a separate line using a specific
> format.
> 
> More information is at https://spdx.org

I'll take a look there, thanks!

> 
> 
>> Either way, can you use the C-style ("/* ... */") comments here (and in a few other places below) for consistency?
> 
> Here is a mechanical survey of familiar kernel source
> files that already have an SPDX tag.
> 
> [cel@klimt linux]$ grep SPDX net/sunrpc/*
> grep: net/sunrpc/auth_gss: Is a directory
> net/sunrpc/auth_null.c:// SPDX-License-Identifier: GPL-2.0
> net/sunrpc/auth_unix.c:// SPDX-License-Identifier: GPL-2.0
> net/sunrpc/debugfs.c:// SPDX-License-Identifier: GPL-2.0
> net/sunrpc/Makefile:# SPDX-License-Identifier: GPL-2.0
> net/sunrpc/netns.h:/* SPDX-License-Identifier: GPL-2.0 */
> net/sunrpc/xprtmultipath.c:// SPDX-License-Identifier: GPL-2.0
> grep: net/sunrpc/xprtrdma: Is a directory
> net/sunrpc/xprtsock.c:// SPDX-License-Identifier: GPL-2.0
> [cel@klimt linux]$ grep SPDX fs/nfs/*
> grep: fs/nfs/blocklayout: Is a directory
> fs/nfs/cache_lib.c:// SPDX-License-Identifier: GPL-2.0
> fs/nfs/cache_lib.h:/* SPDX-License-Identifier: GPL-2.0 */
> fs/nfs/callback.c:// SPDX-License-Identifier: GPL-2.0
> fs/nfs/callback.h:/* SPDX-License-Identifier: GPL-2.0 */
> fs/nfs/callback_proc.c:// SPDX-License-Identifier: GPL-2.0
> fs/nfs/callback_xdr.c:// SPDX-License-Identifier: GPL-2.0
> fs/nfs/delegation.h:/* SPDX-License-Identifier: GPL-2.0 */
> fs/nfs/dns_resolve.c:// SPDX-License-Identifier: GPL-2.0
> fs/nfs/dns_resolve.h:/* SPDX-License-Identifier: GPL-2.0 */
> fs/nfs/export.c:// SPDX-License-Identifier: GPL-2.0
> grep: fs/nfs/filelayout: Is a directory
> grep: fs/nfs/flexfilelayout: Is a directory
> fs/nfs/internal.h:/* SPDX-License-Identifier: GPL-2.0 */
> fs/nfs/io.c:// SPDX-License-Identifier: GPL-2.0
> fs/nfs/iostat.h:/* SPDX-License-Identifier: GPL-2.0 */
> fs/nfs/Makefile:# SPDX-License-Identifier: GPL-2.0
> fs/nfs/mount_clnt.c:// SPDX-License-Identifier: GPL-2.0
> fs/nfs/netns.h:/* SPDX-License-Identifier: GPL-2.0 */
> fs/nfs/nfs2xdr.c:// SPDX-License-Identifier: GPL-2.0
> fs/nfs/nfs3acl.c:// SPDX-License-Identifier: GPL-2.0
> fs/nfs/nfs3_fs.h:/* SPDX-License-Identifier: GPL-2.0 */
> fs/nfs/nfs3proc.c:// SPDX-License-Identifier: GPL-2.0
> fs/nfs/nfs3xdr.c:// SPDX-License-Identifier: GPL-2.0
> fs/nfs/nfs42.h:/* SPDX-License-Identifier: GPL-2.0 */
> fs/nfs/nfs42proc.c:// SPDX-License-Identifier: GPL-2.0
> fs/nfs/nfs42xdr.c:// SPDX-License-Identifier: GPL-2.0
> fs/nfs/nfs4file.c:// SPDX-License-Identifier: GPL-2.0
> fs/nfs/nfs4_fs.h:/* SPDX-License-Identifier: GPL-2.0 */
> fs/nfs/nfs4getroot.c:// SPDX-License-Identifier: GPL-2.0
> fs/nfs/nfs4namespace.c:// SPDX-License-Identifier: GPL-2.0
> fs/nfs/nfs4session.h:/* SPDX-License-Identifier: GPL-2.0 */
> fs/nfs/nfs4sysctl.c:// SPDX-License-Identifier: GPL-2.0
> fs/nfs/nfs4trace.c:// SPDX-License-Identifier: GPL-2.0
> fs/nfs/nfs4trace.h:/* SPDX-License-Identifier: GPL-2.0 */
> fs/nfs/nfs.h:/* SPDX-License-Identifier: GPL-2.0 */
> fs/nfs/nfsroot.c:// SPDX-License-Identifier: GPL-2.0
> fs/nfs/nfstrace.c:// SPDX-License-Identifier: GPL-2.0
> fs/nfs/nfstrace.h:/* SPDX-License-Identifier: GPL-2.0 */
> fs/nfs/proc.c:// SPDX-License-Identifier: GPL-2.0
> fs/nfs/symlink.c:// SPDX-License-Identifier: GPL-2.0
> fs/nfs/sysctl.c:// SPDX-License-Identifier: GPL-2.0
> fs/nfs/unlink.c:// SPDX-License-Identifier: GPL-2.0
> [cel@klimt linux]$
> 
> The tags I've proposed are consistent with other usage:
> 
> -> .c files use // ... comments
> -> .h files use /* ... */ comments
> -> Makefiles use # comments
> 
> There were no complaints from checkpatch.pl about the
> comment style in my patch.

Ah, okay.  It's probably best to go with how everybody else uses it (although I still wonder why they use different styles for .c and .h files).  I'll take your patch the way it is now.

> 
> 
>> Thanks,
>> Anna
>>
>>> /*
>>>  * Copyright (c) 2015, 2017 Oracle.  All rights reserved.
>>>  */
>>> diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
>>> index e8adad3..8f89e3f 100644
>>> --- a/net/sunrpc/xprtrdma/rpc_rdma.c
>>> +++ b/net/sunrpc/xprtrdma/rpc_rdma.c
>>> @@ -1,3 +1,4 @@
>>> +// SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause
>>> /*
>>>  * Copyright (c) 2014-2017 Oracle.  All rights reserved.
>>>  * Copyright (c) 2003-2007 Network Appliance, Inc. All rights reserved.
>>> diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
>>> index cc1aad3..4717578 100644
>>> --- a/net/sunrpc/xprtrdma/transport.c
>>> +++ b/net/sunrpc/xprtrdma/transport.c
>>> @@ -1,3 +1,4 @@
>>> +// SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause
>>> /*
>>>  * Copyright (c) 2014-2017 Oracle.  All rights reserved.
>>>  * Copyright (c) 2003-2007 Network Appliance, Inc. All rights reserved.
>>> diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
>>> index c345d36..10f5032 100644
>>> --- a/net/sunrpc/xprtrdma/verbs.c
>>> +++ b/net/sunrpc/xprtrdma/verbs.c
>>> @@ -1,3 +1,4 @@
>>> +// SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause
>>> /*
>>>  * Copyright (c) 2014-2017 Oracle.  All rights reserved.
>>>  * Copyright (c) 2003-2007 Network Appliance, Inc. All rights reserved.
>>> diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
>>> index cb41b12..e83ba758 100644
>>> --- a/net/sunrpc/xprtrdma/xprt_rdma.h
>>> +++ b/net/sunrpc/xprtrdma/xprt_rdma.h
>>> @@ -1,3 +1,4 @@
>>> +/* SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause */
>>> /*
>>>  * Copyright (c) 2014-2017 Oracle.  All rights reserved.
>>>  * Copyright (c) 2003-2007 Network Appliance, Inc. All rights reserved.
>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> --
> Chuck Lever
> 
> 
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v1 10/19] xprtrdma: Move Receive posting to Receive handler
  2018-05-04 19:35 ` [PATCH v1 10/19] xprtrdma: Move Receive posting to Receive handler Chuck Lever
@ 2018-05-08 19:40   ` Anna Schumaker
  2018-05-08 19:47     ` Chuck Lever
  0 siblings, 1 reply; 30+ messages in thread
From: Anna Schumaker @ 2018-05-08 19:40 UTC (permalink / raw)
  To: Chuck Lever; +Cc: linux-rdma, linux-nfs, Trond Myklebust

Hi Chuck,

On 05/04/2018 03:35 PM, Chuck Lever wrote:
> Receive completion and Reply handling are done by a BOUND
> workqueue, meaning they run on only one CPU.
> 
> Posting receives is currently done in the send_request path, which
> on large systems is typically done on a different CPU than the one
> handling Receive completions. This results in movement of
> Receive-related cachelines between the sending and receiving CPUs.
> 
> More importantly, it means that currently Receives are posted while
> the transport's write lock is held, which is unnecessary and costly.
> 
> Finally, allocation of Receive buffers is performed on-demand in
> the Receive completion handler. This helps guarantee that they are
> allocated on the same NUMA node as the CPU that handles Receive
> completions.
> 
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

Running this against a 4.17-rc4 server seems to work okay, but running against a 4.16 server fails the cthon special tests with:                                                                                                                                                                     
                                                                                                                                                                                                                            
  write/read 30 MB file                                                                                                                                                                                                       
  verify failed, offset 11272192; expected 79, got                                                                                                                                                                            
  79 79 79 79 79 79 79 79 79 79                                                                                                                                                                                               
  79 79 79 79 79 79 79 79 79 79                                                                                                                                                                                               
  79 79 79 79 79 79 79 79 79 79                                                                                                                                                                                               
  79 79 79 79 79 79 79 79 79 79                                                                                                                                                                                               
  79 79 79 79 79 79 79 79 79 79                                                                                                                                                                                               
  79 79 79 79 79 79 79 79 79 79                                                                                                                                                                                               

and it goes on for several hundred more lines after this.  How worried do we need to be about somebody running a new client against and old server?

Some of the performance issues I've had in the past seem to have gone away with the 4.17-rc4 code as well.  I'm not sure if that's related to your code or something changing in soft roce, but either way I'm much happier :)

Anna

> ---
>  include/trace/events/rpcrdma.h    |   40 +++++++-
>  net/sunrpc/xprtrdma/backchannel.c |   32 +------
>  net/sunrpc/xprtrdma/rpc_rdma.c    |   22 +----
>  net/sunrpc/xprtrdma/transport.c   |    3 -
>  net/sunrpc/xprtrdma/verbs.c       |  176 +++++++++++++++++++++----------------
>  net/sunrpc/xprtrdma/xprt_rdma.h   |    6 +
>  6 files changed, 150 insertions(+), 129 deletions(-)
> 
> diff --git a/include/trace/events/rpcrdma.h b/include/trace/events/rpcrdma.h
> index 99c0049..ad27e19 100644
> --- a/include/trace/events/rpcrdma.h
> +++ b/include/trace/events/rpcrdma.h
> @@ -546,6 +546,39 @@
>  	)
>  );
>  
> +TRACE_EVENT(xprtrdma_post_recvs,
> +	TP_PROTO(
> +		const struct rpcrdma_xprt *r_xprt,
> +		unsigned int count,
> +		int status
> +	),
> +
> +	TP_ARGS(r_xprt, count, status),
> +
> +	TP_STRUCT__entry(
> +		__field(const void *, r_xprt)
> +		__field(unsigned int, count)
> +		__field(int, status)
> +		__field(int, posted)
> +		__string(addr, rpcrdma_addrstr(r_xprt))
> +		__string(port, rpcrdma_portstr(r_xprt))
> +	),
> +
> +	TP_fast_assign(
> +		__entry->r_xprt = r_xprt;
> +		__entry->count = count;
> +		__entry->status = status;
> +		__entry->posted = r_xprt->rx_buf.rb_posted_receives;
> +		__assign_str(addr, rpcrdma_addrstr(r_xprt));
> +		__assign_str(port, rpcrdma_portstr(r_xprt));
> +	),
> +
> +	TP_printk("peer=[%s]:%s r_xprt=%p: %u new recvs, %d active (rc %d)",
> +		__get_str(addr), __get_str(port), __entry->r_xprt,
> +		__entry->count, __entry->posted, __entry->status
> +	)
> +);
> +
>  /**
>   ** Completion events
>   **/
> @@ -800,7 +833,6 @@
>  		__field(unsigned int, task_id)
>  		__field(unsigned int, client_id)
>  		__field(const void *, req)
> -		__field(const void *, rep)
>  		__field(size_t, callsize)
>  		__field(size_t, rcvsize)
>  	),
> @@ -809,15 +841,13 @@
>  		__entry->task_id = task->tk_pid;
>  		__entry->client_id = task->tk_client->cl_clid;
>  		__entry->req = req;
> -		__entry->rep = req ? req->rl_reply : NULL;
>  		__entry->callsize = task->tk_rqstp->rq_callsize;
>  		__entry->rcvsize = task->tk_rqstp->rq_rcvsize;
>  	),
>  
> -	TP_printk("task:%u@%u req=%p rep=%p (%zu, %zu)",
> +	TP_printk("task:%u@%u req=%p (%zu, %zu)",
>  		__entry->task_id, __entry->client_id,
> -		__entry->req, __entry->rep,
> -		__entry->callsize, __entry->rcvsize
> +		__entry->req, __entry->callsize, __entry->rcvsize
>  	)
>  );
>  
> diff --git a/net/sunrpc/xprtrdma/backchannel.c b/net/sunrpc/xprtrdma/backchannel.c
> index 4034788..c8f1c2b 100644
> --- a/net/sunrpc/xprtrdma/backchannel.c
> +++ b/net/sunrpc/xprtrdma/backchannel.c
> @@ -71,23 +71,6 @@ static int rpcrdma_bc_setup_reqs(struct rpcrdma_xprt *r_xprt,
>  	return -ENOMEM;
>  }
>  
> -/* Allocate and add receive buffers to the rpcrdma_buffer's
> - * existing list of rep's. These are released when the
> - * transport is destroyed.
> - */
> -static int rpcrdma_bc_setup_reps(struct rpcrdma_xprt *r_xprt,
> -				 unsigned int count)
> -{
> -	int rc = 0;
> -
> -	while (count--) {
> -		rc = rpcrdma_create_rep(r_xprt);
> -		if (rc)
> -			break;
> -	}
> -	return rc;
> -}
> -
>  /**
>   * xprt_rdma_bc_setup - Pre-allocate resources for handling backchannel requests
>   * @xprt: transport associated with these backchannel resources
> @@ -116,14 +99,6 @@ int xprt_rdma_bc_setup(struct rpc_xprt *xprt, unsigned int reqs)
>  	if (rc)
>  		goto out_free;
>  
> -	rc = rpcrdma_bc_setup_reps(r_xprt, reqs);
> -	if (rc)
> -		goto out_free;
> -
> -	rc = rpcrdma_ep_post_extra_recv(r_xprt, reqs);
> -	if (rc)
> -		goto out_free;
> -
>  	r_xprt->rx_buf.rb_bc_srv_max_requests = reqs;
>  	request_module("svcrdma");
>  	trace_xprtrdma_cb_setup(r_xprt, reqs);
> @@ -228,6 +203,7 @@ int xprt_rdma_bc_send_reply(struct rpc_rqst *rqst)
>  	if (rc < 0)
>  		goto failed_marshal;
>  
> +	rpcrdma_post_recvs(r_xprt, true);
>  	if (rpcrdma_ep_post(&r_xprt->rx_ia, &r_xprt->rx_ep, req))
>  		goto drop_connection;
>  	return 0;
> @@ -268,10 +244,14 @@ void xprt_rdma_bc_destroy(struct rpc_xprt *xprt, unsigned int reqs)
>   */
>  void xprt_rdma_bc_free_rqst(struct rpc_rqst *rqst)
>  {
> +	struct rpcrdma_req *req = rpcr_to_rdmar(rqst);
>  	struct rpc_xprt *xprt = rqst->rq_xprt;
>  
>  	dprintk("RPC:       %s: freeing rqst %p (req %p)\n",
> -		__func__, rqst, rpcr_to_rdmar(rqst));
> +		__func__, rqst, req);
> +
> +	rpcrdma_recv_buffer_put(req->rl_reply);
> +	req->rl_reply = NULL;
>  
>  	spin_lock_bh(&xprt->bc_pa_lock);
>  	list_add_tail(&rqst->rq_bc_pa_list, &xprt->bc_pa_list);
> diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
> index 8f89e3f..d676106 100644
> --- a/net/sunrpc/xprtrdma/rpc_rdma.c
> +++ b/net/sunrpc/xprtrdma/rpc_rdma.c
> @@ -1027,8 +1027,6 @@ static bool rpcrdma_results_inline(struct rpcrdma_xprt *r_xprt,
>  
>  out_short:
>  	pr_warn("RPC/RDMA short backward direction call\n");
> -	if (rpcrdma_ep_post_recv(&r_xprt->rx_ia, rep))
> -		xprt_disconnect_done(&r_xprt->rx_xprt);
>  	return true;
>  }
>  #else	/* CONFIG_SUNRPC_BACKCHANNEL */
> @@ -1334,13 +1332,14 @@ void rpcrdma_reply_handler(struct rpcrdma_rep *rep)
>  	u32 credits;
>  	__be32 *p;
>  
> +	--buf->rb_posted_receives;
> +
>  	if (rep->rr_hdrbuf.head[0].iov_len == 0)
>  		goto out_badstatus;
>  
> +	/* Fixed transport header fields */
>  	xdr_init_decode(&rep->rr_stream, &rep->rr_hdrbuf,
>  			rep->rr_hdrbuf.head[0].iov_base);
> -
> -	/* Fixed transport header fields */
>  	p = xdr_inline_decode(&rep->rr_stream, 4 * sizeof(*p));
>  	if (unlikely(!p))
>  		goto out_shortreply;
> @@ -1379,17 +1378,10 @@ void rpcrdma_reply_handler(struct rpcrdma_rep *rep)
>  
>  	trace_xprtrdma_reply(rqst->rq_task, rep, req, credits);
>  
> +	rpcrdma_post_recvs(r_xprt, false);
>  	queue_work(rpcrdma_receive_wq, &rep->rr_work);
>  	return;
>  
> -out_badstatus:
> -	rpcrdma_recv_buffer_put(rep);
> -	if (r_xprt->rx_ep.rep_connected == 1) {
> -		r_xprt->rx_ep.rep_connected = -EIO;
> -		rpcrdma_conn_func(&r_xprt->rx_ep);
> -	}
> -	return;
> -
>  out_badversion:
>  	trace_xprtrdma_reply_vers(rep);
>  	goto repost;
> @@ -1409,7 +1401,7 @@ void rpcrdma_reply_handler(struct rpcrdma_rep *rep)
>   * receive buffer before returning.
>   */
>  repost:
> -	r_xprt->rx_stats.bad_reply_count++;
> -	if (rpcrdma_ep_post_recv(&r_xprt->rx_ia, rep))
> -		rpcrdma_recv_buffer_put(rep);
> +	rpcrdma_post_recvs(r_xprt, false);
> +out_badstatus:
> +	rpcrdma_recv_buffer_put(rep);
>  }
> diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
> index 79885aa..0c775f0 100644
> --- a/net/sunrpc/xprtrdma/transport.c
> +++ b/net/sunrpc/xprtrdma/transport.c
> @@ -722,9 +722,6 @@
>  	if (rc < 0)
>  		goto failed_marshal;
>  
> -	if (req->rl_reply == NULL) 		/* e.g. reconnection */
> -		rpcrdma_recv_buffer_get(req);
> -
>  	/* Must suppress retransmit to maintain credits */
>  	if (rqst->rq_connect_cookie == xprt->connect_cookie)
>  		goto drop_connection;
> diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
> index f4ce7af..2a38301 100644
> --- a/net/sunrpc/xprtrdma/verbs.c
> +++ b/net/sunrpc/xprtrdma/verbs.c
> @@ -74,6 +74,7 @@
>   */
>  static void rpcrdma_mrs_create(struct rpcrdma_xprt *r_xprt);
>  static void rpcrdma_mrs_destroy(struct rpcrdma_buffer *buf);
> +static int rpcrdma_create_rep(struct rpcrdma_xprt *r_xprt, bool temp);
>  static void rpcrdma_dma_unmap_regbuf(struct rpcrdma_regbuf *rb);
>  
>  struct workqueue_struct *rpcrdma_receive_wq __read_mostly;
> @@ -726,7 +727,6 @@
>  {
>  	struct rpcrdma_xprt *r_xprt = container_of(ia, struct rpcrdma_xprt,
>  						   rx_ia);
> -	unsigned int extras;
>  	int rc;
>  
>  retry:
> @@ -770,9 +770,8 @@
>  	}
>  
>  	dprintk("RPC:       %s: connected\n", __func__);
> -	extras = r_xprt->rx_buf.rb_bc_srv_max_requests;
> -	if (extras)
> -		rpcrdma_ep_post_extra_recv(r_xprt, extras);
> +
> +	rpcrdma_post_recvs(r_xprt, true);
>  
>  out:
>  	if (rc)
> @@ -1082,14 +1081,8 @@ struct rpcrdma_req *
>  	return req;
>  }
>  
> -/**
> - * rpcrdma_create_rep - Allocate an rpcrdma_rep object
> - * @r_xprt: controlling transport
> - *
> - * Returns 0 on success or a negative errno on failure.
> - */
> -int
> -rpcrdma_create_rep(struct rpcrdma_xprt *r_xprt)
> +static int
> +rpcrdma_create_rep(struct rpcrdma_xprt *r_xprt, bool temp)
>  {
>  	struct rpcrdma_create_data_internal *cdata = &r_xprt->rx_data;
>  	struct rpcrdma_buffer *buf = &r_xprt->rx_buf;
> @@ -1117,6 +1110,7 @@ struct rpcrdma_req *
>  	rep->rr_recv_wr.wr_cqe = &rep->rr_cqe;
>  	rep->rr_recv_wr.sg_list = &rep->rr_rdmabuf->rg_iov;
>  	rep->rr_recv_wr.num_sge = 1;
> +	rep->rr_temp = temp;
>  
>  	spin_lock(&buf->rb_lock);
>  	list_add(&rep->rr_list, &buf->rb_recv_bufs);
> @@ -1168,12 +1162,8 @@ struct rpcrdma_req *
>  		list_add(&req->rl_list, &buf->rb_send_bufs);
>  	}
>  
> +	buf->rb_posted_receives = 0;
>  	INIT_LIST_HEAD(&buf->rb_recv_bufs);
> -	for (i = 0; i <= buf->rb_max_requests; i++) {
> -		rc = rpcrdma_create_rep(r_xprt);
> -		if (rc)
> -			goto out;
> -	}
>  
>  	rc = rpcrdma_sendctxs_create(r_xprt);
>  	if (rc)
> @@ -1268,7 +1258,6 @@ struct rpcrdma_req *
>  		rep = rpcrdma_buffer_get_rep_locked(buf);
>  		rpcrdma_destroy_rep(rep);
>  	}
> -	buf->rb_send_count = 0;
>  
>  	spin_lock(&buf->rb_reqslock);
>  	while (!list_empty(&buf->rb_allreqs)) {
> @@ -1283,7 +1272,6 @@ struct rpcrdma_req *
>  		spin_lock(&buf->rb_reqslock);
>  	}
>  	spin_unlock(&buf->rb_reqslock);
> -	buf->rb_recv_count = 0;
>  
>  	rpcrdma_mrs_destroy(buf);
>  }
> @@ -1356,27 +1344,11 @@ struct rpcrdma_mr *
>  	__rpcrdma_mr_put(&r_xprt->rx_buf, mr);
>  }
>  
> -static struct rpcrdma_rep *
> -rpcrdma_buffer_get_rep(struct rpcrdma_buffer *buffers)
> -{
> -	/* If an RPC previously completed without a reply (say, a
> -	 * credential problem or a soft timeout occurs) then hold off
> -	 * on supplying more Receive buffers until the number of new
> -	 * pending RPCs catches up to the number of posted Receives.
> -	 */
> -	if (unlikely(buffers->rb_send_count < buffers->rb_recv_count))
> -		return NULL;
> -
> -	if (unlikely(list_empty(&buffers->rb_recv_bufs)))
> -		return NULL;
> -	buffers->rb_recv_count++;
> -	return rpcrdma_buffer_get_rep_locked(buffers);
> -}
> -
> -/*
> - * Get a set of request/reply buffers.
> +/**
> + * rpcrdma_buffer_get - Get a request buffer
> + * @buffers: Buffer pool from which to obtain a buffer
>   *
> - * Reply buffer (if available) is attached to send buffer upon return.
> + * Returns a fresh rpcrdma_req, or NULL if none are available.
>   */
>  struct rpcrdma_req *
>  rpcrdma_buffer_get(struct rpcrdma_buffer *buffers)
> @@ -1384,23 +1356,21 @@ struct rpcrdma_req *
>  	struct rpcrdma_req *req;
>  
>  	spin_lock(&buffers->rb_lock);
> -	if (list_empty(&buffers->rb_send_bufs))
> -		goto out_reqbuf;
> -	buffers->rb_send_count++;
> +	if (unlikely(list_empty(&buffers->rb_send_bufs)))
> +		goto out_noreqs;
>  	req = rpcrdma_buffer_get_req_locked(buffers);
> -	req->rl_reply = rpcrdma_buffer_get_rep(buffers);
>  	spin_unlock(&buffers->rb_lock);
> -
>  	return req;
>  
> -out_reqbuf:
> +out_noreqs:
>  	spin_unlock(&buffers->rb_lock);
>  	return NULL;
>  }
>  
> -/*
> - * Put request/reply buffers back into pool.
> - * Pre-decrement counter/array index.
> +/**
> + * rpcrdma_buffer_put - Put request/reply buffers back into pool
> + * @req: object to return
> + *
>   */
>  void
>  rpcrdma_buffer_put(struct rpcrdma_req *req)
> @@ -1411,27 +1381,16 @@ struct rpcrdma_req *
>  	req->rl_reply = NULL;
>  
>  	spin_lock(&buffers->rb_lock);
> -	buffers->rb_send_count--;
> -	list_add_tail(&req->rl_list, &buffers->rb_send_bufs);
> +	list_add(&req->rl_list, &buffers->rb_send_bufs);
>  	if (rep) {
> -		buffers->rb_recv_count--;
> -		list_add_tail(&rep->rr_list, &buffers->rb_recv_bufs);
> +		if (!rep->rr_temp) {
> +			list_add(&rep->rr_list, &buffers->rb_recv_bufs);
> +			rep = NULL;
> +		}
>  	}
>  	spin_unlock(&buffers->rb_lock);
> -}
> -
> -/*
> - * Recover reply buffers from pool.
> - * This happens when recovering from disconnect.
> - */
> -void
> -rpcrdma_recv_buffer_get(struct rpcrdma_req *req)
> -{
> -	struct rpcrdma_buffer *buffers = req->rl_buffer;
> -
> -	spin_lock(&buffers->rb_lock);
> -	req->rl_reply = rpcrdma_buffer_get_rep(buffers);
> -	spin_unlock(&buffers->rb_lock);
> +	if (rep)
> +		rpcrdma_destroy_rep(rep);
>  }
>  
>  /*
> @@ -1443,10 +1402,13 @@ struct rpcrdma_req *
>  {
>  	struct rpcrdma_buffer *buffers = &rep->rr_rxprt->rx_buf;
>  
> -	spin_lock(&buffers->rb_lock);
> -	buffers->rb_recv_count--;
> -	list_add_tail(&rep->rr_list, &buffers->rb_recv_bufs);
> -	spin_unlock(&buffers->rb_lock);
> +	if (!rep->rr_temp) {
> +		spin_lock(&buffers->rb_lock);
> +		list_add(&rep->rr_list, &buffers->rb_recv_bufs);
> +		spin_unlock(&buffers->rb_lock);
> +	} else {
> +		rpcrdma_destroy_rep(rep);
> +	}
>  }
>  
>  /**
> @@ -1542,13 +1504,6 @@ struct rpcrdma_regbuf *
>  	struct ib_send_wr *send_wr = &req->rl_sendctx->sc_wr;
>  	int rc;
>  
> -	if (req->rl_reply) {
> -		rc = rpcrdma_ep_post_recv(ia, req->rl_reply);
> -		if (rc)
> -			return rc;
> -		req->rl_reply = NULL;
> -	}
> -
>  	if (!ep->rep_send_count ||
>  	    test_bit(RPCRDMA_REQ_F_TX_RESOURCES, &req->rl_flags)) {
>  		send_wr->send_flags |= IB_SEND_SIGNALED;
> @@ -1623,3 +1578,70 @@ struct rpcrdma_regbuf *
>  	rpcrdma_recv_buffer_put(rep);
>  	return rc;
>  }
> +
> +/**
> + * rpcrdma_post_recvs - Maybe post some Receive buffers
> + * @r_xprt: controlling transport
> + * @temp: when true, allocate temp rpcrdma_rep objects
> + *
> + */
> +void
> +rpcrdma_post_recvs(struct rpcrdma_xprt *r_xprt, bool temp)
> +{
> +	struct rpcrdma_buffer *buf = &r_xprt->rx_buf;
> +	struct ib_recv_wr *wr, *bad_wr;
> +	int needed, count, rc;
> +
> +	needed = buf->rb_credits + (buf->rb_bc_srv_max_requests << 1);
> +	if (buf->rb_posted_receives > needed)
> +		return;
> +	needed -= buf->rb_posted_receives;
> +
> +	count = 0;
> +	wr = NULL;
> +	while (needed) {
> +		struct rpcrdma_regbuf *rb;
> +		struct rpcrdma_rep *rep;
> +
> +		spin_lock(&buf->rb_lock);
> +		rep = list_first_entry_or_null(&buf->rb_recv_bufs,
> +					       struct rpcrdma_rep, rr_list);
> +		if (likely(rep))
> +			list_del(&rep->rr_list);
> +		spin_unlock(&buf->rb_lock);
> +		if (!rep) {
> +			if (rpcrdma_create_rep(r_xprt, temp))
> +				break;
> +			continue;
> +		}
> +
> +		rb = rep->rr_rdmabuf;
> +		if (!rpcrdma_regbuf_is_mapped(rb)) {
> +			if (!__rpcrdma_dma_map_regbuf(&r_xprt->rx_ia, rb)) {
> +				rpcrdma_recv_buffer_put(rep);
> +				break;
> +			}
> +		}
> +
> +		trace_xprtrdma_post_recv(rep->rr_recv_wr.wr_cqe);
> +		rep->rr_recv_wr.next = wr;
> +		wr = &rep->rr_recv_wr;
> +		++count;
> +		--needed;
> +	}
> +	if (!count)
> +		return;
> +
> +	rc = ib_post_recv(r_xprt->rx_ia.ri_id->qp, wr, &bad_wr);
> +	if (rc) {
> +		for (wr = bad_wr; wr; wr = wr->next) {
> +			struct rpcrdma_rep *rep;
> +
> +			rep = container_of(wr, struct rpcrdma_rep, rr_recv_wr);
> +			rpcrdma_recv_buffer_put(rep);
> +			--count;
> +		}
> +	}
> +	buf->rb_posted_receives += count;
> +	trace_xprtrdma_post_recvs(r_xprt, count, rc);
> +}
> diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
> index 765e4df..a6d0d6e 100644
> --- a/net/sunrpc/xprtrdma/xprt_rdma.h
> +++ b/net/sunrpc/xprtrdma/xprt_rdma.h
> @@ -197,6 +197,7 @@ struct rpcrdma_rep {
>  	__be32			rr_proc;
>  	int			rr_wc_flags;
>  	u32			rr_inv_rkey;
> +	bool			rr_temp;
>  	struct rpcrdma_regbuf	*rr_rdmabuf;
>  	struct rpcrdma_xprt	*rr_rxprt;
>  	struct work_struct	rr_work;
> @@ -397,11 +398,11 @@ struct rpcrdma_buffer {
>  	struct rpcrdma_sendctx	**rb_sc_ctxs;
>  
>  	spinlock_t		rb_lock;	/* protect buf lists */
> -	int			rb_send_count, rb_recv_count;
>  	struct list_head	rb_send_bufs;
>  	struct list_head	rb_recv_bufs;
>  	u32			rb_max_requests;
>  	u32			rb_credits;	/* most recent credit grant */
> +	int			rb_posted_receives;
>  
>  	u32			rb_bc_srv_max_requests;
>  	spinlock_t		rb_reqslock;	/* protect rb_allreqs */
> @@ -558,13 +559,13 @@ int rpcrdma_ep_create(struct rpcrdma_ep *, struct rpcrdma_ia *,
>  int rpcrdma_ep_post(struct rpcrdma_ia *, struct rpcrdma_ep *,
>  				struct rpcrdma_req *);
>  int rpcrdma_ep_post_recv(struct rpcrdma_ia *, struct rpcrdma_rep *);
> +void rpcrdma_post_recvs(struct rpcrdma_xprt *r_xprt, bool temp);
>  
>  /*
>   * Buffer calls - xprtrdma/verbs.c
>   */
>  struct rpcrdma_req *rpcrdma_create_req(struct rpcrdma_xprt *);
>  void rpcrdma_destroy_req(struct rpcrdma_req *);
> -int rpcrdma_create_rep(struct rpcrdma_xprt *r_xprt);
>  int rpcrdma_buffer_create(struct rpcrdma_xprt *);
>  void rpcrdma_buffer_destroy(struct rpcrdma_buffer *);
>  struct rpcrdma_sendctx *rpcrdma_sendctx_get_locked(struct rpcrdma_buffer *buf);
> @@ -577,7 +578,6 @@ int rpcrdma_ep_post(struct rpcrdma_ia *, struct rpcrdma_ep *,
>  
>  struct rpcrdma_req *rpcrdma_buffer_get(struct rpcrdma_buffer *);
>  void rpcrdma_buffer_put(struct rpcrdma_req *);
> -void rpcrdma_recv_buffer_get(struct rpcrdma_req *);
>  void rpcrdma_recv_buffer_put(struct rpcrdma_rep *);
>  
>  struct rpcrdma_regbuf *rpcrdma_alloc_regbuf(size_t, enum dma_data_direction,
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v1 10/19] xprtrdma: Move Receive posting to Receive handler
  2018-05-08 19:40   ` Anna Schumaker
@ 2018-05-08 19:47     ` Chuck Lever
  2018-05-08 19:52       ` Anna Schumaker
  0 siblings, 1 reply; 30+ messages in thread
From: Chuck Lever @ 2018-05-08 19:47 UTC (permalink / raw)
  To: Anna Schumaker; +Cc: linux-rdma, Linux NFS Mailing List, Trond Myklebust



> On May 8, 2018, at 3:40 PM, Anna Schumaker <anna.schumaker@netapp.com> =
wrote:
>=20
> Hi Chuck,
>=20
> On 05/04/2018 03:35 PM, Chuck Lever wrote:
>> Receive completion and Reply handling are done by a BOUND
>> workqueue, meaning they run on only one CPU.
>>=20
>> Posting receives is currently done in the send_request path, which
>> on large systems is typically done on a different CPU than the one
>> handling Receive completions. This results in movement of
>> Receive-related cachelines between the sending and receiving CPUs.
>>=20
>> More importantly, it means that currently Receives are posted while
>> the transport's write lock is held, which is unnecessary and costly.
>>=20
>> Finally, allocation of Receive buffers is performed on-demand in
>> the Receive completion handler. This helps guarantee that they are
>> allocated on the same NUMA node as the CPU that handles Receive
>> completions.
>>=20
>> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
>=20
> Running this against a 4.17-rc4 server seems to work okay, but running =
against a 4.16 server fails the cthon special tests with:                =
                                                                         =
                                             =20
>=20
>  write/read 30 MB file                                                 =
                                                                         =
                                                                         =
   =20
>  verify failed, offset 11272192; expected 79, got                      =
                                                                         =
                                                                         =
   =20
>  79 79 79 79 79 79 79 79 79 79                                         =
                                                                         =
                                                                         =
   =20
>  79 79 79 79 79 79 79 79 79 79                                         =
                                                                         =
                                                                         =
   =20
>  79 79 79 79 79 79 79 79 79 79                                         =
                                                                         =
                                                                         =
   =20
>  79 79 79 79 79 79 79 79 79 79                                         =
                                                                         =
                                                                         =
   =20
>  79 79 79 79 79 79 79 79 79 79                                         =
                                                                         =
                                                                         =
   =20
>  79 79 79 79 79 79 79 79 79 79                                         =
                                                                         =
                                                                         =
   =20
>=20
> and it goes on for several hundred more lines after this.  How worried =
do we need to be about somebody running a new client against and old =
server?

I'm not sure what that result means, I've never seen it
before. But I can't think of a reason there would be an
incompatibility with a v4.16 server. That behavior needs
to be chased down and explained.

Can you bisect this to a particular commit?


> Some of the performance issues I've had in the past seem to have gone =
away with the 4.17-rc4 code as well.  I'm not sure if that's related to =
your code or something changing in soft roce, but either way I'm much =
happier :)
>=20
> Anna
>=20
>> ---
>> include/trace/events/rpcrdma.h    |   40 +++++++-
>> net/sunrpc/xprtrdma/backchannel.c |   32 +------
>> net/sunrpc/xprtrdma/rpc_rdma.c    |   22 +----
>> net/sunrpc/xprtrdma/transport.c   |    3 -
>> net/sunrpc/xprtrdma/verbs.c       |  176 =
+++++++++++++++++++++----------------
>> net/sunrpc/xprtrdma/xprt_rdma.h   |    6 +
>> 6 files changed, 150 insertions(+), 129 deletions(-)
>>=20
>> diff --git a/include/trace/events/rpcrdma.h =
b/include/trace/events/rpcrdma.h
>> index 99c0049..ad27e19 100644
>> --- a/include/trace/events/rpcrdma.h
>> +++ b/include/trace/events/rpcrdma.h
>> @@ -546,6 +546,39 @@
>> 	)
>> );
>>=20
>> +TRACE_EVENT(xprtrdma_post_recvs,
>> +	TP_PROTO(
>> +		const struct rpcrdma_xprt *r_xprt,
>> +		unsigned int count,
>> +		int status
>> +	),
>> +
>> +	TP_ARGS(r_xprt, count, status),
>> +
>> +	TP_STRUCT__entry(
>> +		__field(const void *, r_xprt)
>> +		__field(unsigned int, count)
>> +		__field(int, status)
>> +		__field(int, posted)
>> +		__string(addr, rpcrdma_addrstr(r_xprt))
>> +		__string(port, rpcrdma_portstr(r_xprt))
>> +	),
>> +
>> +	TP_fast_assign(
>> +		__entry->r_xprt =3D r_xprt;
>> +		__entry->count =3D count;
>> +		__entry->status =3D status;
>> +		__entry->posted =3D r_xprt->rx_buf.rb_posted_receives;
>> +		__assign_str(addr, rpcrdma_addrstr(r_xprt));
>> +		__assign_str(port, rpcrdma_portstr(r_xprt));
>> +	),
>> +
>> +	TP_printk("peer=3D[%s]:%s r_xprt=3D%p: %u new recvs, %d active =
(rc %d)",
>> +		__get_str(addr), __get_str(port), __entry->r_xprt,
>> +		__entry->count, __entry->posted, __entry->status
>> +	)
>> +);
>> +
>> /**
>>  ** Completion events
>>  **/
>> @@ -800,7 +833,6 @@
>> 		__field(unsigned int, task_id)
>> 		__field(unsigned int, client_id)
>> 		__field(const void *, req)
>> -		__field(const void *, rep)
>> 		__field(size_t, callsize)
>> 		__field(size_t, rcvsize)
>> 	),
>> @@ -809,15 +841,13 @@
>> 		__entry->task_id =3D task->tk_pid;
>> 		__entry->client_id =3D task->tk_client->cl_clid;
>> 		__entry->req =3D req;
>> -		__entry->rep =3D req ? req->rl_reply : NULL;
>> 		__entry->callsize =3D task->tk_rqstp->rq_callsize;
>> 		__entry->rcvsize =3D task->tk_rqstp->rq_rcvsize;
>> 	),
>>=20
>> -	TP_printk("task:%u@%u req=3D%p rep=3D%p (%zu, %zu)",
>> +	TP_printk("task:%u@%u req=3D%p (%zu, %zu)",
>> 		__entry->task_id, __entry->client_id,
>> -		__entry->req, __entry->rep,
>> -		__entry->callsize, __entry->rcvsize
>> +		__entry->req, __entry->callsize, __entry->rcvsize
>> 	)
>> );
>>=20
>> diff --git a/net/sunrpc/xprtrdma/backchannel.c =
b/net/sunrpc/xprtrdma/backchannel.c
>> index 4034788..c8f1c2b 100644
>> --- a/net/sunrpc/xprtrdma/backchannel.c
>> +++ b/net/sunrpc/xprtrdma/backchannel.c
>> @@ -71,23 +71,6 @@ static int rpcrdma_bc_setup_reqs(struct =
rpcrdma_xprt *r_xprt,
>> 	return -ENOMEM;
>> }
>>=20
>> -/* Allocate and add receive buffers to the rpcrdma_buffer's
>> - * existing list of rep's. These are released when the
>> - * transport is destroyed.
>> - */
>> -static int rpcrdma_bc_setup_reps(struct rpcrdma_xprt *r_xprt,
>> -				 unsigned int count)
>> -{
>> -	int rc =3D 0;
>> -
>> -	while (count--) {
>> -		rc =3D rpcrdma_create_rep(r_xprt);
>> -		if (rc)
>> -			break;
>> -	}
>> -	return rc;
>> -}
>> -
>> /**
>>  * xprt_rdma_bc_setup - Pre-allocate resources for handling =
backchannel requests
>>  * @xprt: transport associated with these backchannel resources
>> @@ -116,14 +99,6 @@ int xprt_rdma_bc_setup(struct rpc_xprt *xprt, =
unsigned int reqs)
>> 	if (rc)
>> 		goto out_free;
>>=20
>> -	rc =3D rpcrdma_bc_setup_reps(r_xprt, reqs);
>> -	if (rc)
>> -		goto out_free;
>> -
>> -	rc =3D rpcrdma_ep_post_extra_recv(r_xprt, reqs);
>> -	if (rc)
>> -		goto out_free;
>> -
>> 	r_xprt->rx_buf.rb_bc_srv_max_requests =3D reqs;
>> 	request_module("svcrdma");
>> 	trace_xprtrdma_cb_setup(r_xprt, reqs);
>> @@ -228,6 +203,7 @@ int xprt_rdma_bc_send_reply(struct rpc_rqst =
*rqst)
>> 	if (rc < 0)
>> 		goto failed_marshal;
>>=20
>> +	rpcrdma_post_recvs(r_xprt, true);
>> 	if (rpcrdma_ep_post(&r_xprt->rx_ia, &r_xprt->rx_ep, req))
>> 		goto drop_connection;
>> 	return 0;
>> @@ -268,10 +244,14 @@ void xprt_rdma_bc_destroy(struct rpc_xprt =
*xprt, unsigned int reqs)
>>  */
>> void xprt_rdma_bc_free_rqst(struct rpc_rqst *rqst)
>> {
>> +	struct rpcrdma_req *req =3D rpcr_to_rdmar(rqst);
>> 	struct rpc_xprt *xprt =3D rqst->rq_xprt;
>>=20
>> 	dprintk("RPC:       %s: freeing rqst %p (req %p)\n",
>> -		__func__, rqst, rpcr_to_rdmar(rqst));
>> +		__func__, rqst, req);
>> +
>> +	rpcrdma_recv_buffer_put(req->rl_reply);
>> +	req->rl_reply =3D NULL;
>>=20
>> 	spin_lock_bh(&xprt->bc_pa_lock);
>> 	list_add_tail(&rqst->rq_bc_pa_list, &xprt->bc_pa_list);
>> diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c =
b/net/sunrpc/xprtrdma/rpc_rdma.c
>> index 8f89e3f..d676106 100644
>> --- a/net/sunrpc/xprtrdma/rpc_rdma.c
>> +++ b/net/sunrpc/xprtrdma/rpc_rdma.c
>> @@ -1027,8 +1027,6 @@ static bool rpcrdma_results_inline(struct =
rpcrdma_xprt *r_xprt,
>>=20
>> out_short:
>> 	pr_warn("RPC/RDMA short backward direction call\n");
>> -	if (rpcrdma_ep_post_recv(&r_xprt->rx_ia, rep))
>> -		xprt_disconnect_done(&r_xprt->rx_xprt);
>> 	return true;
>> }
>> #else	/* CONFIG_SUNRPC_BACKCHANNEL */
>> @@ -1334,13 +1332,14 @@ void rpcrdma_reply_handler(struct rpcrdma_rep =
*rep)
>> 	u32 credits;
>> 	__be32 *p;
>>=20
>> +	--buf->rb_posted_receives;
>> +
>> 	if (rep->rr_hdrbuf.head[0].iov_len =3D=3D 0)
>> 		goto out_badstatus;
>>=20
>> +	/* Fixed transport header fields */
>> 	xdr_init_decode(&rep->rr_stream, &rep->rr_hdrbuf,
>> 			rep->rr_hdrbuf.head[0].iov_base);
>> -
>> -	/* Fixed transport header fields */
>> 	p =3D xdr_inline_decode(&rep->rr_stream, 4 * sizeof(*p));
>> 	if (unlikely(!p))
>> 		goto out_shortreply;
>> @@ -1379,17 +1378,10 @@ void rpcrdma_reply_handler(struct rpcrdma_rep =
*rep)
>>=20
>> 	trace_xprtrdma_reply(rqst->rq_task, rep, req, credits);
>>=20
>> +	rpcrdma_post_recvs(r_xprt, false);
>> 	queue_work(rpcrdma_receive_wq, &rep->rr_work);
>> 	return;
>>=20
>> -out_badstatus:
>> -	rpcrdma_recv_buffer_put(rep);
>> -	if (r_xprt->rx_ep.rep_connected =3D=3D 1) {
>> -		r_xprt->rx_ep.rep_connected =3D -EIO;
>> -		rpcrdma_conn_func(&r_xprt->rx_ep);
>> -	}
>> -	return;
>> -
>> out_badversion:
>> 	trace_xprtrdma_reply_vers(rep);
>> 	goto repost;
>> @@ -1409,7 +1401,7 @@ void rpcrdma_reply_handler(struct rpcrdma_rep =
*rep)
>>  * receive buffer before returning.
>>  */
>> repost:
>> -	r_xprt->rx_stats.bad_reply_count++;
>> -	if (rpcrdma_ep_post_recv(&r_xprt->rx_ia, rep))
>> -		rpcrdma_recv_buffer_put(rep);
>> +	rpcrdma_post_recvs(r_xprt, false);
>> +out_badstatus:
>> +	rpcrdma_recv_buffer_put(rep);
>> }
>> diff --git a/net/sunrpc/xprtrdma/transport.c =
b/net/sunrpc/xprtrdma/transport.c
>> index 79885aa..0c775f0 100644
>> --- a/net/sunrpc/xprtrdma/transport.c
>> +++ b/net/sunrpc/xprtrdma/transport.c
>> @@ -722,9 +722,6 @@
>> 	if (rc < 0)
>> 		goto failed_marshal;
>>=20
>> -	if (req->rl_reply =3D=3D NULL) 		/* e.g. reconnection */
>> -		rpcrdma_recv_buffer_get(req);
>> -
>> 	/* Must suppress retransmit to maintain credits */
>> 	if (rqst->rq_connect_cookie =3D=3D xprt->connect_cookie)
>> 		goto drop_connection;
>> diff --git a/net/sunrpc/xprtrdma/verbs.c =
b/net/sunrpc/xprtrdma/verbs.c
>> index f4ce7af..2a38301 100644
>> --- a/net/sunrpc/xprtrdma/verbs.c
>> +++ b/net/sunrpc/xprtrdma/verbs.c
>> @@ -74,6 +74,7 @@
>>  */
>> static void rpcrdma_mrs_create(struct rpcrdma_xprt *r_xprt);
>> static void rpcrdma_mrs_destroy(struct rpcrdma_buffer *buf);
>> +static int rpcrdma_create_rep(struct rpcrdma_xprt *r_xprt, bool =
temp);
>> static void rpcrdma_dma_unmap_regbuf(struct rpcrdma_regbuf *rb);
>>=20
>> struct workqueue_struct *rpcrdma_receive_wq __read_mostly;
>> @@ -726,7 +727,6 @@
>> {
>> 	struct rpcrdma_xprt *r_xprt =3D container_of(ia, struct =
rpcrdma_xprt,
>> 						   rx_ia);
>> -	unsigned int extras;
>> 	int rc;
>>=20
>> retry:
>> @@ -770,9 +770,8 @@
>> 	}
>>=20
>> 	dprintk("RPC:       %s: connected\n", __func__);
>> -	extras =3D r_xprt->rx_buf.rb_bc_srv_max_requests;
>> -	if (extras)
>> -		rpcrdma_ep_post_extra_recv(r_xprt, extras);
>> +
>> +	rpcrdma_post_recvs(r_xprt, true);
>>=20
>> out:
>> 	if (rc)
>> @@ -1082,14 +1081,8 @@ struct rpcrdma_req *
>> 	return req;
>> }
>>=20
>> -/**
>> - * rpcrdma_create_rep - Allocate an rpcrdma_rep object
>> - * @r_xprt: controlling transport
>> - *
>> - * Returns 0 on success or a negative errno on failure.
>> - */
>> -int
>> -rpcrdma_create_rep(struct rpcrdma_xprt *r_xprt)
>> +static int
>> +rpcrdma_create_rep(struct rpcrdma_xprt *r_xprt, bool temp)
>> {
>> 	struct rpcrdma_create_data_internal *cdata =3D &r_xprt->rx_data;
>> 	struct rpcrdma_buffer *buf =3D &r_xprt->rx_buf;
>> @@ -1117,6 +1110,7 @@ struct rpcrdma_req *
>> 	rep->rr_recv_wr.wr_cqe =3D &rep->rr_cqe;
>> 	rep->rr_recv_wr.sg_list =3D &rep->rr_rdmabuf->rg_iov;
>> 	rep->rr_recv_wr.num_sge =3D 1;
>> +	rep->rr_temp =3D temp;
>>=20
>> 	spin_lock(&buf->rb_lock);
>> 	list_add(&rep->rr_list, &buf->rb_recv_bufs);
>> @@ -1168,12 +1162,8 @@ struct rpcrdma_req *
>> 		list_add(&req->rl_list, &buf->rb_send_bufs);
>> 	}
>>=20
>> +	buf->rb_posted_receives =3D 0;
>> 	INIT_LIST_HEAD(&buf->rb_recv_bufs);
>> -	for (i =3D 0; i <=3D buf->rb_max_requests; i++) {
>> -		rc =3D rpcrdma_create_rep(r_xprt);
>> -		if (rc)
>> -			goto out;
>> -	}
>>=20
>> 	rc =3D rpcrdma_sendctxs_create(r_xprt);
>> 	if (rc)
>> @@ -1268,7 +1258,6 @@ struct rpcrdma_req *
>> 		rep =3D rpcrdma_buffer_get_rep_locked(buf);
>> 		rpcrdma_destroy_rep(rep);
>> 	}
>> -	buf->rb_send_count =3D 0;
>>=20
>> 	spin_lock(&buf->rb_reqslock);
>> 	while (!list_empty(&buf->rb_allreqs)) {
>> @@ -1283,7 +1272,6 @@ struct rpcrdma_req *
>> 		spin_lock(&buf->rb_reqslock);
>> 	}
>> 	spin_unlock(&buf->rb_reqslock);
>> -	buf->rb_recv_count =3D 0;
>>=20
>> 	rpcrdma_mrs_destroy(buf);
>> }
>> @@ -1356,27 +1344,11 @@ struct rpcrdma_mr *
>> 	__rpcrdma_mr_put(&r_xprt->rx_buf, mr);
>> }
>>=20
>> -static struct rpcrdma_rep *
>> -rpcrdma_buffer_get_rep(struct rpcrdma_buffer *buffers)
>> -{
>> -	/* If an RPC previously completed without a reply (say, a
>> -	 * credential problem or a soft timeout occurs) then hold off
>> -	 * on supplying more Receive buffers until the number of new
>> -	 * pending RPCs catches up to the number of posted Receives.
>> -	 */
>> -	if (unlikely(buffers->rb_send_count < buffers->rb_recv_count))
>> -		return NULL;
>> -
>> -	if (unlikely(list_empty(&buffers->rb_recv_bufs)))
>> -		return NULL;
>> -	buffers->rb_recv_count++;
>> -	return rpcrdma_buffer_get_rep_locked(buffers);
>> -}
>> -
>> -/*
>> - * Get a set of request/reply buffers.
>> +/**
>> + * rpcrdma_buffer_get - Get a request buffer
>> + * @buffers: Buffer pool from which to obtain a buffer
>>  *
>> - * Reply buffer (if available) is attached to send buffer upon =
return.
>> + * Returns a fresh rpcrdma_req, or NULL if none are available.
>>  */
>> struct rpcrdma_req *
>> rpcrdma_buffer_get(struct rpcrdma_buffer *buffers)
>> @@ -1384,23 +1356,21 @@ struct rpcrdma_req *
>> 	struct rpcrdma_req *req;
>>=20
>> 	spin_lock(&buffers->rb_lock);
>> -	if (list_empty(&buffers->rb_send_bufs))
>> -		goto out_reqbuf;
>> -	buffers->rb_send_count++;
>> +	if (unlikely(list_empty(&buffers->rb_send_bufs)))
>> +		goto out_noreqs;
>> 	req =3D rpcrdma_buffer_get_req_locked(buffers);
>> -	req->rl_reply =3D rpcrdma_buffer_get_rep(buffers);
>> 	spin_unlock(&buffers->rb_lock);
>> -
>> 	return req;
>>=20
>> -out_reqbuf:
>> +out_noreqs:
>> 	spin_unlock(&buffers->rb_lock);
>> 	return NULL;
>> }
>>=20
>> -/*
>> - * Put request/reply buffers back into pool.
>> - * Pre-decrement counter/array index.
>> +/**
>> + * rpcrdma_buffer_put - Put request/reply buffers back into pool
>> + * @req: object to return
>> + *
>>  */
>> void
>> rpcrdma_buffer_put(struct rpcrdma_req *req)
>> @@ -1411,27 +1381,16 @@ struct rpcrdma_req *
>> 	req->rl_reply =3D NULL;
>>=20
>> 	spin_lock(&buffers->rb_lock);
>> -	buffers->rb_send_count--;
>> -	list_add_tail(&req->rl_list, &buffers->rb_send_bufs);
>> +	list_add(&req->rl_list, &buffers->rb_send_bufs);
>> 	if (rep) {
>> -		buffers->rb_recv_count--;
>> -		list_add_tail(&rep->rr_list, &buffers->rb_recv_bufs);
>> +		if (!rep->rr_temp) {
>> +			list_add(&rep->rr_list, &buffers->rb_recv_bufs);
>> +			rep =3D NULL;
>> +		}
>> 	}
>> 	spin_unlock(&buffers->rb_lock);
>> -}
>> -
>> -/*
>> - * Recover reply buffers from pool.
>> - * This happens when recovering from disconnect.
>> - */
>> -void
>> -rpcrdma_recv_buffer_get(struct rpcrdma_req *req)
>> -{
>> -	struct rpcrdma_buffer *buffers =3D req->rl_buffer;
>> -
>> -	spin_lock(&buffers->rb_lock);
>> -	req->rl_reply =3D rpcrdma_buffer_get_rep(buffers);
>> -	spin_unlock(&buffers->rb_lock);
>> +	if (rep)
>> +		rpcrdma_destroy_rep(rep);
>> }
>>=20
>> /*
>> @@ -1443,10 +1402,13 @@ struct rpcrdma_req *
>> {
>> 	struct rpcrdma_buffer *buffers =3D &rep->rr_rxprt->rx_buf;
>>=20
>> -	spin_lock(&buffers->rb_lock);
>> -	buffers->rb_recv_count--;
>> -	list_add_tail(&rep->rr_list, &buffers->rb_recv_bufs);
>> -	spin_unlock(&buffers->rb_lock);
>> +	if (!rep->rr_temp) {
>> +		spin_lock(&buffers->rb_lock);
>> +		list_add(&rep->rr_list, &buffers->rb_recv_bufs);
>> +		spin_unlock(&buffers->rb_lock);
>> +	} else {
>> +		rpcrdma_destroy_rep(rep);
>> +	}
>> }
>>=20
>> /**
>> @@ -1542,13 +1504,6 @@ struct rpcrdma_regbuf *
>> 	struct ib_send_wr *send_wr =3D &req->rl_sendctx->sc_wr;
>> 	int rc;
>>=20
>> -	if (req->rl_reply) {
>> -		rc =3D rpcrdma_ep_post_recv(ia, req->rl_reply);
>> -		if (rc)
>> -			return rc;
>> -		req->rl_reply =3D NULL;
>> -	}
>> -
>> 	if (!ep->rep_send_count ||
>> 	    test_bit(RPCRDMA_REQ_F_TX_RESOURCES, &req->rl_flags)) {
>> 		send_wr->send_flags |=3D IB_SEND_SIGNALED;
>> @@ -1623,3 +1578,70 @@ struct rpcrdma_regbuf *
>> 	rpcrdma_recv_buffer_put(rep);
>> 	return rc;
>> }
>> +
>> +/**
>> + * rpcrdma_post_recvs - Maybe post some Receive buffers
>> + * @r_xprt: controlling transport
>> + * @temp: when true, allocate temp rpcrdma_rep objects
>> + *
>> + */
>> +void
>> +rpcrdma_post_recvs(struct rpcrdma_xprt *r_xprt, bool temp)
>> +{
>> +	struct rpcrdma_buffer *buf =3D &r_xprt->rx_buf;
>> +	struct ib_recv_wr *wr, *bad_wr;
>> +	int needed, count, rc;
>> +
>> +	needed =3D buf->rb_credits + (buf->rb_bc_srv_max_requests << 1);
>> +	if (buf->rb_posted_receives > needed)
>> +		return;
>> +	needed -=3D buf->rb_posted_receives;
>> +
>> +	count =3D 0;
>> +	wr =3D NULL;
>> +	while (needed) {
>> +		struct rpcrdma_regbuf *rb;
>> +		struct rpcrdma_rep *rep;
>> +
>> +		spin_lock(&buf->rb_lock);
>> +		rep =3D list_first_entry_or_null(&buf->rb_recv_bufs,
>> +					       struct rpcrdma_rep, =
rr_list);
>> +		if (likely(rep))
>> +			list_del(&rep->rr_list);
>> +		spin_unlock(&buf->rb_lock);
>> +		if (!rep) {
>> +			if (rpcrdma_create_rep(r_xprt, temp))
>> +				break;
>> +			continue;
>> +		}
>> +
>> +		rb =3D rep->rr_rdmabuf;
>> +		if (!rpcrdma_regbuf_is_mapped(rb)) {
>> +			if (!__rpcrdma_dma_map_regbuf(&r_xprt->rx_ia, =
rb)) {
>> +				rpcrdma_recv_buffer_put(rep);
>> +				break;
>> +			}
>> +		}
>> +
>> +		trace_xprtrdma_post_recv(rep->rr_recv_wr.wr_cqe);
>> +		rep->rr_recv_wr.next =3D wr;
>> +		wr =3D &rep->rr_recv_wr;
>> +		++count;
>> +		--needed;
>> +	}
>> +	if (!count)
>> +		return;
>> +
>> +	rc =3D ib_post_recv(r_xprt->rx_ia.ri_id->qp, wr, &bad_wr);
>> +	if (rc) {
>> +		for (wr =3D bad_wr; wr; wr =3D wr->next) {
>> +			struct rpcrdma_rep *rep;
>> +
>> +			rep =3D container_of(wr, struct rpcrdma_rep, =
rr_recv_wr);
>> +			rpcrdma_recv_buffer_put(rep);
>> +			--count;
>> +		}
>> +	}
>> +	buf->rb_posted_receives +=3D count;
>> +	trace_xprtrdma_post_recvs(r_xprt, count, rc);
>> +}
>> diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h =
b/net/sunrpc/xprtrdma/xprt_rdma.h
>> index 765e4df..a6d0d6e 100644
>> --- a/net/sunrpc/xprtrdma/xprt_rdma.h
>> +++ b/net/sunrpc/xprtrdma/xprt_rdma.h
>> @@ -197,6 +197,7 @@ struct rpcrdma_rep {
>> 	__be32			rr_proc;
>> 	int			rr_wc_flags;
>> 	u32			rr_inv_rkey;
>> +	bool			rr_temp;
>> 	struct rpcrdma_regbuf	*rr_rdmabuf;
>> 	struct rpcrdma_xprt	*rr_rxprt;
>> 	struct work_struct	rr_work;
>> @@ -397,11 +398,11 @@ struct rpcrdma_buffer {
>> 	struct rpcrdma_sendctx	**rb_sc_ctxs;
>>=20
>> 	spinlock_t		rb_lock;	/* protect buf lists */
>> -	int			rb_send_count, rb_recv_count;
>> 	struct list_head	rb_send_bufs;
>> 	struct list_head	rb_recv_bufs;
>> 	u32			rb_max_requests;
>> 	u32			rb_credits;	/* most recent credit =
grant */
>> +	int			rb_posted_receives;
>>=20
>> 	u32			rb_bc_srv_max_requests;
>> 	spinlock_t		rb_reqslock;	/* protect rb_allreqs */
>> @@ -558,13 +559,13 @@ int rpcrdma_ep_create(struct rpcrdma_ep *, =
struct rpcrdma_ia *,
>> int rpcrdma_ep_post(struct rpcrdma_ia *, struct rpcrdma_ep *,
>> 				struct rpcrdma_req *);
>> int rpcrdma_ep_post_recv(struct rpcrdma_ia *, struct rpcrdma_rep *);
>> +void rpcrdma_post_recvs(struct rpcrdma_xprt *r_xprt, bool temp);
>>=20
>> /*
>>  * Buffer calls - xprtrdma/verbs.c
>>  */
>> struct rpcrdma_req *rpcrdma_create_req(struct rpcrdma_xprt *);
>> void rpcrdma_destroy_req(struct rpcrdma_req *);
>> -int rpcrdma_create_rep(struct rpcrdma_xprt *r_xprt);
>> int rpcrdma_buffer_create(struct rpcrdma_xprt *);
>> void rpcrdma_buffer_destroy(struct rpcrdma_buffer *);
>> struct rpcrdma_sendctx *rpcrdma_sendctx_get_locked(struct =
rpcrdma_buffer *buf);
>> @@ -577,7 +578,6 @@ int rpcrdma_ep_post(struct rpcrdma_ia *, struct =
rpcrdma_ep *,
>>=20
>> struct rpcrdma_req *rpcrdma_buffer_get(struct rpcrdma_buffer *);
>> void rpcrdma_buffer_put(struct rpcrdma_req *);
>> -void rpcrdma_recv_buffer_get(struct rpcrdma_req *);
>> void rpcrdma_recv_buffer_put(struct rpcrdma_rep *);
>>=20
>> struct rpcrdma_regbuf *rpcrdma_alloc_regbuf(size_t, enum =
dma_data_direction,
>>=20
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" =
in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
Chuck Lever




^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v1 10/19] xprtrdma: Move Receive posting to Receive handler
  2018-05-08 19:47     ` Chuck Lever
@ 2018-05-08 19:52       ` Anna Schumaker
  2018-05-08 19:56         ` Chuck Lever
  2018-05-29 18:23         ` Chuck Lever
  0 siblings, 2 replies; 30+ messages in thread
From: Anna Schumaker @ 2018-05-08 19:52 UTC (permalink / raw)
  To: Chuck Lever; +Cc: linux-rdma, Linux NFS Mailing List, Trond Myklebust



On 05/08/2018 03:47 PM, Chuck Lever wrote:
> 
> 
>> On May 8, 2018, at 3:40 PM, Anna Schumaker <anna.schumaker@netapp.com> wrote:
>>
>> Hi Chuck,
>>
>> On 05/04/2018 03:35 PM, Chuck Lever wrote:
>>> Receive completion and Reply handling are done by a BOUND
>>> workqueue, meaning they run on only one CPU.
>>>
>>> Posting receives is currently done in the send_request path, which
>>> on large systems is typically done on a different CPU than the one
>>> handling Receive completions. This results in movement of
>>> Receive-related cachelines between the sending and receiving CPUs.
>>>
>>> More importantly, it means that currently Receives are posted while
>>> the transport's write lock is held, which is unnecessary and costly.
>>>
>>> Finally, allocation of Receive buffers is performed on-demand in
>>> the Receive completion handler. This helps guarantee that they are
>>> allocated on the same NUMA node as the CPU that handles Receive
>>> completions.
>>>
>>> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
>>
>> Running this against a 4.17-rc4 server seems to work okay, but running against a 4.16 server fails the cthon special tests with:                                                                                                                                       
>>
>>  write/read 30 MB file                                                                                                                                                                                                       
>>  verify failed, offset 11272192; expected 79, got                                                                                                                                                                            
>>  79 79 79 79 79 79 79 79 79 79                                                                                                                                                                                               
>>  79 79 79 79 79 79 79 79 79 79                                                                                                                                                                                               
>>  79 79 79 79 79 79 79 79 79 79                                                                                                                                                                                               
>>  79 79 79 79 79 79 79 79 79 79                                                                                                                                                                                               
>>  79 79 79 79 79 79 79 79 79 79                                                                                                                                                                                               
>>  79 79 79 79 79 79 79 79 79 79                                                                                                                                                                                               
>>
>> and it goes on for several hundred more lines after this.  How worried do we need to be about somebody running a new client against and old server?
> 
> I'm not sure what that result means, I've never seen it
> before. But I can't think of a reason there would be an
> incompatibility with a v4.16 server. That behavior needs
> to be chased down and explained.
> 
> Can you bisect this to a particular commit?

Do you mean on the server side?  I don't see the problem on the client until I apply this patch

> 
> 
>> Some of the performance issues I've had in the past seem to have gone away with the 4.17-rc4 code as well.  I'm not sure if that's related to your code or something changing in soft roce, but either way I'm much happier :)
>>
>> Anna
>>
>>> ---
>>> include/trace/events/rpcrdma.h    |   40 +++++++-
>>> net/sunrpc/xprtrdma/backchannel.c |   32 +------
>>> net/sunrpc/xprtrdma/rpc_rdma.c    |   22 +----
>>> net/sunrpc/xprtrdma/transport.c   |    3 -
>>> net/sunrpc/xprtrdma/verbs.c       |  176 +++++++++++++++++++++----------------
>>> net/sunrpc/xprtrdma/xprt_rdma.h   |    6 +
>>> 6 files changed, 150 insertions(+), 129 deletions(-)
>>>
>>> diff --git a/include/trace/events/rpcrdma.h b/include/trace/events/rpcrdma.h
>>> index 99c0049..ad27e19 100644
>>> --- a/include/trace/events/rpcrdma.h
>>> +++ b/include/trace/events/rpcrdma.h
>>> @@ -546,6 +546,39 @@
>>> 	)
>>> );
>>>
>>> +TRACE_EVENT(xprtrdma_post_recvs,
>>> +	TP_PROTO(
>>> +		const struct rpcrdma_xprt *r_xprt,
>>> +		unsigned int count,
>>> +		int status
>>> +	),
>>> +
>>> +	TP_ARGS(r_xprt, count, status),
>>> +
>>> +	TP_STRUCT__entry(
>>> +		__field(const void *, r_xprt)
>>> +		__field(unsigned int, count)
>>> +		__field(int, status)
>>> +		__field(int, posted)
>>> +		__string(addr, rpcrdma_addrstr(r_xprt))
>>> +		__string(port, rpcrdma_portstr(r_xprt))
>>> +	),
>>> +
>>> +	TP_fast_assign(
>>> +		__entry->r_xprt = r_xprt;
>>> +		__entry->count = count;
>>> +		__entry->status = status;
>>> +		__entry->posted = r_xprt->rx_buf.rb_posted_receives;
>>> +		__assign_str(addr, rpcrdma_addrstr(r_xprt));
>>> +		__assign_str(port, rpcrdma_portstr(r_xprt));
>>> +	),
>>> +
>>> +	TP_printk("peer=[%s]:%s r_xprt=%p: %u new recvs, %d active (rc %d)",
>>> +		__get_str(addr), __get_str(port), __entry->r_xprt,
>>> +		__entry->count, __entry->posted, __entry->status
>>> +	)
>>> +);
>>> +
>>> /**
>>>  ** Completion events
>>>  **/
>>> @@ -800,7 +833,6 @@
>>> 		__field(unsigned int, task_id)
>>> 		__field(unsigned int, client_id)
>>> 		__field(const void *, req)
>>> -		__field(const void *, rep)
>>> 		__field(size_t, callsize)
>>> 		__field(size_t, rcvsize)
>>> 	),
>>> @@ -809,15 +841,13 @@
>>> 		__entry->task_id = task->tk_pid;
>>> 		__entry->client_id = task->tk_client->cl_clid;
>>> 		__entry->req = req;
>>> -		__entry->rep = req ? req->rl_reply : NULL;
>>> 		__entry->callsize = task->tk_rqstp->rq_callsize;
>>> 		__entry->rcvsize = task->tk_rqstp->rq_rcvsize;
>>> 	),
>>>
>>> -	TP_printk("task:%u@%u req=%p rep=%p (%zu, %zu)",
>>> +	TP_printk("task:%u@%u req=%p (%zu, %zu)",
>>> 		__entry->task_id, __entry->client_id,
>>> -		__entry->req, __entry->rep,
>>> -		__entry->callsize, __entry->rcvsize
>>> +		__entry->req, __entry->callsize, __entry->rcvsize
>>> 	)
>>> );
>>>
>>> diff --git a/net/sunrpc/xprtrdma/backchannel.c b/net/sunrpc/xprtrdma/backchannel.c
>>> index 4034788..c8f1c2b 100644
>>> --- a/net/sunrpc/xprtrdma/backchannel.c
>>> +++ b/net/sunrpc/xprtrdma/backchannel.c
>>> @@ -71,23 +71,6 @@ static int rpcrdma_bc_setup_reqs(struct rpcrdma_xprt *r_xprt,
>>> 	return -ENOMEM;
>>> }
>>>
>>> -/* Allocate and add receive buffers to the rpcrdma_buffer's
>>> - * existing list of rep's. These are released when the
>>> - * transport is destroyed.
>>> - */
>>> -static int rpcrdma_bc_setup_reps(struct rpcrdma_xprt *r_xprt,
>>> -				 unsigned int count)
>>> -{
>>> -	int rc = 0;
>>> -
>>> -	while (count--) {
>>> -		rc = rpcrdma_create_rep(r_xprt);
>>> -		if (rc)
>>> -			break;
>>> -	}
>>> -	return rc;
>>> -}
>>> -
>>> /**
>>>  * xprt_rdma_bc_setup - Pre-allocate resources for handling backchannel requests
>>>  * @xprt: transport associated with these backchannel resources
>>> @@ -116,14 +99,6 @@ int xprt_rdma_bc_setup(struct rpc_xprt *xprt, unsigned int reqs)
>>> 	if (rc)
>>> 		goto out_free;
>>>
>>> -	rc = rpcrdma_bc_setup_reps(r_xprt, reqs);
>>> -	if (rc)
>>> -		goto out_free;
>>> -
>>> -	rc = rpcrdma_ep_post_extra_recv(r_xprt, reqs);
>>> -	if (rc)
>>> -		goto out_free;
>>> -
>>> 	r_xprt->rx_buf.rb_bc_srv_max_requests = reqs;
>>> 	request_module("svcrdma");
>>> 	trace_xprtrdma_cb_setup(r_xprt, reqs);
>>> @@ -228,6 +203,7 @@ int xprt_rdma_bc_send_reply(struct rpc_rqst *rqst)
>>> 	if (rc < 0)
>>> 		goto failed_marshal;
>>>
>>> +	rpcrdma_post_recvs(r_xprt, true);
>>> 	if (rpcrdma_ep_post(&r_xprt->rx_ia, &r_xprt->rx_ep, req))
>>> 		goto drop_connection;
>>> 	return 0;
>>> @@ -268,10 +244,14 @@ void xprt_rdma_bc_destroy(struct rpc_xprt *xprt, unsigned int reqs)
>>>  */
>>> void xprt_rdma_bc_free_rqst(struct rpc_rqst *rqst)
>>> {
>>> +	struct rpcrdma_req *req = rpcr_to_rdmar(rqst);
>>> 	struct rpc_xprt *xprt = rqst->rq_xprt;
>>>
>>> 	dprintk("RPC:       %s: freeing rqst %p (req %p)\n",
>>> -		__func__, rqst, rpcr_to_rdmar(rqst));
>>> +		__func__, rqst, req);
>>> +
>>> +	rpcrdma_recv_buffer_put(req->rl_reply);
>>> +	req->rl_reply = NULL;
>>>
>>> 	spin_lock_bh(&xprt->bc_pa_lock);
>>> 	list_add_tail(&rqst->rq_bc_pa_list, &xprt->bc_pa_list);
>>> diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
>>> index 8f89e3f..d676106 100644
>>> --- a/net/sunrpc/xprtrdma/rpc_rdma.c
>>> +++ b/net/sunrpc/xprtrdma/rpc_rdma.c
>>> @@ -1027,8 +1027,6 @@ static bool rpcrdma_results_inline(struct rpcrdma_xprt *r_xprt,
>>>
>>> out_short:
>>> 	pr_warn("RPC/RDMA short backward direction call\n");
>>> -	if (rpcrdma_ep_post_recv(&r_xprt->rx_ia, rep))
>>> -		xprt_disconnect_done(&r_xprt->rx_xprt);
>>> 	return true;
>>> }
>>> #else	/* CONFIG_SUNRPC_BACKCHANNEL */
>>> @@ -1334,13 +1332,14 @@ void rpcrdma_reply_handler(struct rpcrdma_rep *rep)
>>> 	u32 credits;
>>> 	__be32 *p;
>>>
>>> +	--buf->rb_posted_receives;
>>> +
>>> 	if (rep->rr_hdrbuf.head[0].iov_len == 0)
>>> 		goto out_badstatus;
>>>
>>> +	/* Fixed transport header fields */
>>> 	xdr_init_decode(&rep->rr_stream, &rep->rr_hdrbuf,
>>> 			rep->rr_hdrbuf.head[0].iov_base);
>>> -
>>> -	/* Fixed transport header fields */
>>> 	p = xdr_inline_decode(&rep->rr_stream, 4 * sizeof(*p));
>>> 	if (unlikely(!p))
>>> 		goto out_shortreply;
>>> @@ -1379,17 +1378,10 @@ void rpcrdma_reply_handler(struct rpcrdma_rep *rep)
>>>
>>> 	trace_xprtrdma_reply(rqst->rq_task, rep, req, credits);
>>>
>>> +	rpcrdma_post_recvs(r_xprt, false);
>>> 	queue_work(rpcrdma_receive_wq, &rep->rr_work);
>>> 	return;
>>>
>>> -out_badstatus:
>>> -	rpcrdma_recv_buffer_put(rep);
>>> -	if (r_xprt->rx_ep.rep_connected == 1) {
>>> -		r_xprt->rx_ep.rep_connected = -EIO;
>>> -		rpcrdma_conn_func(&r_xprt->rx_ep);
>>> -	}
>>> -	return;
>>> -
>>> out_badversion:
>>> 	trace_xprtrdma_reply_vers(rep);
>>> 	goto repost;
>>> @@ -1409,7 +1401,7 @@ void rpcrdma_reply_handler(struct rpcrdma_rep *rep)
>>>  * receive buffer before returning.
>>>  */
>>> repost:
>>> -	r_xprt->rx_stats.bad_reply_count++;
>>> -	if (rpcrdma_ep_post_recv(&r_xprt->rx_ia, rep))
>>> -		rpcrdma_recv_buffer_put(rep);
>>> +	rpcrdma_post_recvs(r_xprt, false);
>>> +out_badstatus:
>>> +	rpcrdma_recv_buffer_put(rep);
>>> }
>>> diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
>>> index 79885aa..0c775f0 100644
>>> --- a/net/sunrpc/xprtrdma/transport.c
>>> +++ b/net/sunrpc/xprtrdma/transport.c
>>> @@ -722,9 +722,6 @@
>>> 	if (rc < 0)
>>> 		goto failed_marshal;
>>>
>>> -	if (req->rl_reply == NULL) 		/* e.g. reconnection */
>>> -		rpcrdma_recv_buffer_get(req);
>>> -
>>> 	/* Must suppress retransmit to maintain credits */
>>> 	if (rqst->rq_connect_cookie == xprt->connect_cookie)
>>> 		goto drop_connection;
>>> diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
>>> index f4ce7af..2a38301 100644
>>> --- a/net/sunrpc/xprtrdma/verbs.c
>>> +++ b/net/sunrpc/xprtrdma/verbs.c
>>> @@ -74,6 +74,7 @@
>>>  */
>>> static void rpcrdma_mrs_create(struct rpcrdma_xprt *r_xprt);
>>> static void rpcrdma_mrs_destroy(struct rpcrdma_buffer *buf);
>>> +static int rpcrdma_create_rep(struct rpcrdma_xprt *r_xprt, bool temp);
>>> static void rpcrdma_dma_unmap_regbuf(struct rpcrdma_regbuf *rb);
>>>
>>> struct workqueue_struct *rpcrdma_receive_wq __read_mostly;
>>> @@ -726,7 +727,6 @@
>>> {
>>> 	struct rpcrdma_xprt *r_xprt = container_of(ia, struct rpcrdma_xprt,
>>> 						   rx_ia);
>>> -	unsigned int extras;
>>> 	int rc;
>>>
>>> retry:
>>> @@ -770,9 +770,8 @@
>>> 	}
>>>
>>> 	dprintk("RPC:       %s: connected\n", __func__);
>>> -	extras = r_xprt->rx_buf.rb_bc_srv_max_requests;
>>> -	if (extras)
>>> -		rpcrdma_ep_post_extra_recv(r_xprt, extras);
>>> +
>>> +	rpcrdma_post_recvs(r_xprt, true);
>>>
>>> out:
>>> 	if (rc)
>>> @@ -1082,14 +1081,8 @@ struct rpcrdma_req *
>>> 	return req;
>>> }
>>>
>>> -/**
>>> - * rpcrdma_create_rep - Allocate an rpcrdma_rep object
>>> - * @r_xprt: controlling transport
>>> - *
>>> - * Returns 0 on success or a negative errno on failure.
>>> - */
>>> -int
>>> -rpcrdma_create_rep(struct rpcrdma_xprt *r_xprt)
>>> +static int
>>> +rpcrdma_create_rep(struct rpcrdma_xprt *r_xprt, bool temp)
>>> {
>>> 	struct rpcrdma_create_data_internal *cdata = &r_xprt->rx_data;
>>> 	struct rpcrdma_buffer *buf = &r_xprt->rx_buf;
>>> @@ -1117,6 +1110,7 @@ struct rpcrdma_req *
>>> 	rep->rr_recv_wr.wr_cqe = &rep->rr_cqe;
>>> 	rep->rr_recv_wr.sg_list = &rep->rr_rdmabuf->rg_iov;
>>> 	rep->rr_recv_wr.num_sge = 1;
>>> +	rep->rr_temp = temp;
>>>
>>> 	spin_lock(&buf->rb_lock);
>>> 	list_add(&rep->rr_list, &buf->rb_recv_bufs);
>>> @@ -1168,12 +1162,8 @@ struct rpcrdma_req *
>>> 		list_add(&req->rl_list, &buf->rb_send_bufs);
>>> 	}
>>>
>>> +	buf->rb_posted_receives = 0;
>>> 	INIT_LIST_HEAD(&buf->rb_recv_bufs);
>>> -	for (i = 0; i <= buf->rb_max_requests; i++) {
>>> -		rc = rpcrdma_create_rep(r_xprt);
>>> -		if (rc)
>>> -			goto out;
>>> -	}
>>>
>>> 	rc = rpcrdma_sendctxs_create(r_xprt);
>>> 	if (rc)
>>> @@ -1268,7 +1258,6 @@ struct rpcrdma_req *
>>> 		rep = rpcrdma_buffer_get_rep_locked(buf);
>>> 		rpcrdma_destroy_rep(rep);
>>> 	}
>>> -	buf->rb_send_count = 0;
>>>
>>> 	spin_lock(&buf->rb_reqslock);
>>> 	while (!list_empty(&buf->rb_allreqs)) {
>>> @@ -1283,7 +1272,6 @@ struct rpcrdma_req *
>>> 		spin_lock(&buf->rb_reqslock);
>>> 	}
>>> 	spin_unlock(&buf->rb_reqslock);
>>> -	buf->rb_recv_count = 0;
>>>
>>> 	rpcrdma_mrs_destroy(buf);
>>> }
>>> @@ -1356,27 +1344,11 @@ struct rpcrdma_mr *
>>> 	__rpcrdma_mr_put(&r_xprt->rx_buf, mr);
>>> }
>>>
>>> -static struct rpcrdma_rep *
>>> -rpcrdma_buffer_get_rep(struct rpcrdma_buffer *buffers)
>>> -{
>>> -	/* If an RPC previously completed without a reply (say, a
>>> -	 * credential problem or a soft timeout occurs) then hold off
>>> -	 * on supplying more Receive buffers until the number of new
>>> -	 * pending RPCs catches up to the number of posted Receives.
>>> -	 */
>>> -	if (unlikely(buffers->rb_send_count < buffers->rb_recv_count))
>>> -		return NULL;
>>> -
>>> -	if (unlikely(list_empty(&buffers->rb_recv_bufs)))
>>> -		return NULL;
>>> -	buffers->rb_recv_count++;
>>> -	return rpcrdma_buffer_get_rep_locked(buffers);
>>> -}
>>> -
>>> -/*
>>> - * Get a set of request/reply buffers.
>>> +/**
>>> + * rpcrdma_buffer_get - Get a request buffer
>>> + * @buffers: Buffer pool from which to obtain a buffer
>>>  *
>>> - * Reply buffer (if available) is attached to send buffer upon return.
>>> + * Returns a fresh rpcrdma_req, or NULL if none are available.
>>>  */
>>> struct rpcrdma_req *
>>> rpcrdma_buffer_get(struct rpcrdma_buffer *buffers)
>>> @@ -1384,23 +1356,21 @@ struct rpcrdma_req *
>>> 	struct rpcrdma_req *req;
>>>
>>> 	spin_lock(&buffers->rb_lock);
>>> -	if (list_empty(&buffers->rb_send_bufs))
>>> -		goto out_reqbuf;
>>> -	buffers->rb_send_count++;
>>> +	if (unlikely(list_empty(&buffers->rb_send_bufs)))
>>> +		goto out_noreqs;
>>> 	req = rpcrdma_buffer_get_req_locked(buffers);
>>> -	req->rl_reply = rpcrdma_buffer_get_rep(buffers);
>>> 	spin_unlock(&buffers->rb_lock);
>>> -
>>> 	return req;
>>>
>>> -out_reqbuf:
>>> +out_noreqs:
>>> 	spin_unlock(&buffers->rb_lock);
>>> 	return NULL;
>>> }
>>>
>>> -/*
>>> - * Put request/reply buffers back into pool.
>>> - * Pre-decrement counter/array index.
>>> +/**
>>> + * rpcrdma_buffer_put - Put request/reply buffers back into pool
>>> + * @req: object to return
>>> + *
>>>  */
>>> void
>>> rpcrdma_buffer_put(struct rpcrdma_req *req)
>>> @@ -1411,27 +1381,16 @@ struct rpcrdma_req *
>>> 	req->rl_reply = NULL;
>>>
>>> 	spin_lock(&buffers->rb_lock);
>>> -	buffers->rb_send_count--;
>>> -	list_add_tail(&req->rl_list, &buffers->rb_send_bufs);
>>> +	list_add(&req->rl_list, &buffers->rb_send_bufs);
>>> 	if (rep) {
>>> -		buffers->rb_recv_count--;
>>> -		list_add_tail(&rep->rr_list, &buffers->rb_recv_bufs);
>>> +		if (!rep->rr_temp) {
>>> +			list_add(&rep->rr_list, &buffers->rb_recv_bufs);
>>> +			rep = NULL;
>>> +		}
>>> 	}
>>> 	spin_unlock(&buffers->rb_lock);
>>> -}
>>> -
>>> -/*
>>> - * Recover reply buffers from pool.
>>> - * This happens when recovering from disconnect.
>>> - */
>>> -void
>>> -rpcrdma_recv_buffer_get(struct rpcrdma_req *req)
>>> -{
>>> -	struct rpcrdma_buffer *buffers = req->rl_buffer;
>>> -
>>> -	spin_lock(&buffers->rb_lock);
>>> -	req->rl_reply = rpcrdma_buffer_get_rep(buffers);
>>> -	spin_unlock(&buffers->rb_lock);
>>> +	if (rep)
>>> +		rpcrdma_destroy_rep(rep);
>>> }
>>>
>>> /*
>>> @@ -1443,10 +1402,13 @@ struct rpcrdma_req *
>>> {
>>> 	struct rpcrdma_buffer *buffers = &rep->rr_rxprt->rx_buf;
>>>
>>> -	spin_lock(&buffers->rb_lock);
>>> -	buffers->rb_recv_count--;
>>> -	list_add_tail(&rep->rr_list, &buffers->rb_recv_bufs);
>>> -	spin_unlock(&buffers->rb_lock);
>>> +	if (!rep->rr_temp) {
>>> +		spin_lock(&buffers->rb_lock);
>>> +		list_add(&rep->rr_list, &buffers->rb_recv_bufs);
>>> +		spin_unlock(&buffers->rb_lock);
>>> +	} else {
>>> +		rpcrdma_destroy_rep(rep);
>>> +	}
>>> }
>>>
>>> /**
>>> @@ -1542,13 +1504,6 @@ struct rpcrdma_regbuf *
>>> 	struct ib_send_wr *send_wr = &req->rl_sendctx->sc_wr;
>>> 	int rc;
>>>
>>> -	if (req->rl_reply) {
>>> -		rc = rpcrdma_ep_post_recv(ia, req->rl_reply);
>>> -		if (rc)
>>> -			return rc;
>>> -		req->rl_reply = NULL;
>>> -	}
>>> -
>>> 	if (!ep->rep_send_count ||
>>> 	    test_bit(RPCRDMA_REQ_F_TX_RESOURCES, &req->rl_flags)) {
>>> 		send_wr->send_flags |= IB_SEND_SIGNALED;
>>> @@ -1623,3 +1578,70 @@ struct rpcrdma_regbuf *
>>> 	rpcrdma_recv_buffer_put(rep);
>>> 	return rc;
>>> }
>>> +
>>> +/**
>>> + * rpcrdma_post_recvs - Maybe post some Receive buffers
>>> + * @r_xprt: controlling transport
>>> + * @temp: when true, allocate temp rpcrdma_rep objects
>>> + *
>>> + */
>>> +void
>>> +rpcrdma_post_recvs(struct rpcrdma_xprt *r_xprt, bool temp)
>>> +{
>>> +	struct rpcrdma_buffer *buf = &r_xprt->rx_buf;
>>> +	struct ib_recv_wr *wr, *bad_wr;
>>> +	int needed, count, rc;
>>> +
>>> +	needed = buf->rb_credits + (buf->rb_bc_srv_max_requests << 1);
>>> +	if (buf->rb_posted_receives > needed)
>>> +		return;
>>> +	needed -= buf->rb_posted_receives;
>>> +
>>> +	count = 0;
>>> +	wr = NULL;
>>> +	while (needed) {
>>> +		struct rpcrdma_regbuf *rb;
>>> +		struct rpcrdma_rep *rep;
>>> +
>>> +		spin_lock(&buf->rb_lock);
>>> +		rep = list_first_entry_or_null(&buf->rb_recv_bufs,
>>> +					       struct rpcrdma_rep, rr_list);
>>> +		if (likely(rep))
>>> +			list_del(&rep->rr_list);
>>> +		spin_unlock(&buf->rb_lock);
>>> +		if (!rep) {
>>> +			if (rpcrdma_create_rep(r_xprt, temp))
>>> +				break;
>>> +			continue;
>>> +		}
>>> +
>>> +		rb = rep->rr_rdmabuf;
>>> +		if (!rpcrdma_regbuf_is_mapped(rb)) {
>>> +			if (!__rpcrdma_dma_map_regbuf(&r_xprt->rx_ia, rb)) {
>>> +				rpcrdma_recv_buffer_put(rep);
>>> +				break;
>>> +			}
>>> +		}
>>> +
>>> +		trace_xprtrdma_post_recv(rep->rr_recv_wr.wr_cqe);
>>> +		rep->rr_recv_wr.next = wr;
>>> +		wr = &rep->rr_recv_wr;
>>> +		++count;
>>> +		--needed;
>>> +	}
>>> +	if (!count)
>>> +		return;
>>> +
>>> +	rc = ib_post_recv(r_xprt->rx_ia.ri_id->qp, wr, &bad_wr);
>>> +	if (rc) {
>>> +		for (wr = bad_wr; wr; wr = wr->next) {
>>> +			struct rpcrdma_rep *rep;
>>> +
>>> +			rep = container_of(wr, struct rpcrdma_rep, rr_recv_wr);
>>> +			rpcrdma_recv_buffer_put(rep);
>>> +			--count;
>>> +		}
>>> +	}
>>> +	buf->rb_posted_receives += count;
>>> +	trace_xprtrdma_post_recvs(r_xprt, count, rc);
>>> +}
>>> diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
>>> index 765e4df..a6d0d6e 100644
>>> --- a/net/sunrpc/xprtrdma/xprt_rdma.h
>>> +++ b/net/sunrpc/xprtrdma/xprt_rdma.h
>>> @@ -197,6 +197,7 @@ struct rpcrdma_rep {
>>> 	__be32			rr_proc;
>>> 	int			rr_wc_flags;
>>> 	u32			rr_inv_rkey;
>>> +	bool			rr_temp;
>>> 	struct rpcrdma_regbuf	*rr_rdmabuf;
>>> 	struct rpcrdma_xprt	*rr_rxprt;
>>> 	struct work_struct	rr_work;
>>> @@ -397,11 +398,11 @@ struct rpcrdma_buffer {
>>> 	struct rpcrdma_sendctx	**rb_sc_ctxs;
>>>
>>> 	spinlock_t		rb_lock;	/* protect buf lists */
>>> -	int			rb_send_count, rb_recv_count;
>>> 	struct list_head	rb_send_bufs;
>>> 	struct list_head	rb_recv_bufs;
>>> 	u32			rb_max_requests;
>>> 	u32			rb_credits;	/* most recent credit grant */
>>> +	int			rb_posted_receives;
>>>
>>> 	u32			rb_bc_srv_max_requests;
>>> 	spinlock_t		rb_reqslock;	/* protect rb_allreqs */
>>> @@ -558,13 +559,13 @@ int rpcrdma_ep_create(struct rpcrdma_ep *, struct rpcrdma_ia *,
>>> int rpcrdma_ep_post(struct rpcrdma_ia *, struct rpcrdma_ep *,
>>> 				struct rpcrdma_req *);
>>> int rpcrdma_ep_post_recv(struct rpcrdma_ia *, struct rpcrdma_rep *);
>>> +void rpcrdma_post_recvs(struct rpcrdma_xprt *r_xprt, bool temp);
>>>
>>> /*
>>>  * Buffer calls - xprtrdma/verbs.c
>>>  */
>>> struct rpcrdma_req *rpcrdma_create_req(struct rpcrdma_xprt *);
>>> void rpcrdma_destroy_req(struct rpcrdma_req *);
>>> -int rpcrdma_create_rep(struct rpcrdma_xprt *r_xprt);
>>> int rpcrdma_buffer_create(struct rpcrdma_xprt *);
>>> void rpcrdma_buffer_destroy(struct rpcrdma_buffer *);
>>> struct rpcrdma_sendctx *rpcrdma_sendctx_get_locked(struct rpcrdma_buffer *buf);
>>> @@ -577,7 +578,6 @@ int rpcrdma_ep_post(struct rpcrdma_ia *, struct rpcrdma_ep *,
>>>
>>> struct rpcrdma_req *rpcrdma_buffer_get(struct rpcrdma_buffer *);
>>> void rpcrdma_buffer_put(struct rpcrdma_req *);
>>> -void rpcrdma_recv_buffer_get(struct rpcrdma_req *);
>>> void rpcrdma_recv_buffer_put(struct rpcrdma_rep *);
>>>
>>> struct rpcrdma_regbuf *rpcrdma_alloc_regbuf(size_t, enum dma_data_direction,
>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> --
> Chuck Lever
> 
> 
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v1 10/19] xprtrdma: Move Receive posting to Receive handler
  2018-05-08 19:52       ` Anna Schumaker
@ 2018-05-08 19:56         ` Chuck Lever
  2018-05-29 18:23         ` Chuck Lever
  1 sibling, 0 replies; 30+ messages in thread
From: Chuck Lever @ 2018-05-08 19:56 UTC (permalink / raw)
  To: Anna Schumaker; +Cc: linux-rdma, Linux NFS Mailing List, Trond Myklebust



> On May 8, 2018, at 3:52 PM, Anna Schumaker <Anna.Schumaker@Netapp.com> =
wrote:
>=20
>=20
>=20
> On 05/08/2018 03:47 PM, Chuck Lever wrote:
>>=20
>>=20
>>> On May 8, 2018, at 3:40 PM, Anna Schumaker =
<anna.schumaker@netapp.com> wrote:
>>>=20
>>> Hi Chuck,
>>>=20
>>> On 05/04/2018 03:35 PM, Chuck Lever wrote:
>>>> Receive completion and Reply handling are done by a BOUND
>>>> workqueue, meaning they run on only one CPU.
>>>>=20
>>>> Posting receives is currently done in the send_request path, which
>>>> on large systems is typically done on a different CPU than the one
>>>> handling Receive completions. This results in movement of
>>>> Receive-related cachelines between the sending and receiving CPUs.
>>>>=20
>>>> More importantly, it means that currently Receives are posted while
>>>> the transport's write lock is held, which is unnecessary and =
costly.
>>>>=20
>>>> Finally, allocation of Receive buffers is performed on-demand in
>>>> the Receive completion handler. This helps guarantee that they are
>>>> allocated on the same NUMA node as the CPU that handles Receive
>>>> completions.
>>>>=20
>>>> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
>>>=20
>>> Running this against a 4.17-rc4 server seems to work okay, but =
running against a 4.16 server fails the cthon special tests with:        =
                                                                         =
                                                  =20
>>>=20
>>> write/read 30 MB file                                                =
                                                                         =
                                                                         =
    =20
>>> verify failed, offset 11272192; expected 79, got                     =
                                                                         =
                                                                         =
    =20
>>> 79 79 79 79 79 79 79 79 79 79                                        =
                                                                         =
                                                                         =
    =20
>>> 79 79 79 79 79 79 79 79 79 79                                        =
                                                                         =
                                                                         =
    =20
>>> 79 79 79 79 79 79 79 79 79 79                                        =
                                                                         =
                                                                         =
    =20
>>> 79 79 79 79 79 79 79 79 79 79                                        =
                                                                         =
                                                                         =
    =20
>>> 79 79 79 79 79 79 79 79 79 79                                        =
                                                                         =
                                                                         =
    =20
>>> 79 79 79 79 79 79 79 79 79 79                                        =
                                                                         =
                                                                         =
    =20
>>>=20
>>> and it goes on for several hundred more lines after this.  How =
worried do we need to be about somebody running a new client against and =
old server?
>>=20
>> I'm not sure what that result means, I've never seen it
>> before. But I can't think of a reason there would be an
>> incompatibility with a v4.16 server. That behavior needs
>> to be chased down and explained.
>>=20
>> Can you bisect this to a particular commit?
>=20
> Do you mean on the server side?  I don't see the problem on the client =
until I apply this patch

Sure, why not try bisecting the server. I'll see if I can
reproduce the behavior here with InfiniBand.


>>> Some of the performance issues I've had in the past seem to have =
gone away with the 4.17-rc4 code as well.  I'm not sure if that's =
related to your code or something changing in soft roce, but either way =
I'm much happier :)
>>>=20
>>> Anna
>>>=20
>>>> ---
>>>> include/trace/events/rpcrdma.h    |   40 +++++++-
>>>> net/sunrpc/xprtrdma/backchannel.c |   32 +------
>>>> net/sunrpc/xprtrdma/rpc_rdma.c    |   22 +----
>>>> net/sunrpc/xprtrdma/transport.c   |    3 -
>>>> net/sunrpc/xprtrdma/verbs.c       |  176 =
+++++++++++++++++++++----------------
>>>> net/sunrpc/xprtrdma/xprt_rdma.h   |    6 +
>>>> 6 files changed, 150 insertions(+), 129 deletions(-)
>>>>=20
>>>> diff --git a/include/trace/events/rpcrdma.h =
b/include/trace/events/rpcrdma.h
>>>> index 99c0049..ad27e19 100644
>>>> --- a/include/trace/events/rpcrdma.h
>>>> +++ b/include/trace/events/rpcrdma.h
>>>> @@ -546,6 +546,39 @@
>>>> 	)
>>>> );
>>>>=20
>>>> +TRACE_EVENT(xprtrdma_post_recvs,
>>>> +	TP_PROTO(
>>>> +		const struct rpcrdma_xprt *r_xprt,
>>>> +		unsigned int count,
>>>> +		int status
>>>> +	),
>>>> +
>>>> +	TP_ARGS(r_xprt, count, status),
>>>> +
>>>> +	TP_STRUCT__entry(
>>>> +		__field(const void *, r_xprt)
>>>> +		__field(unsigned int, count)
>>>> +		__field(int, status)
>>>> +		__field(int, posted)
>>>> +		__string(addr, rpcrdma_addrstr(r_xprt))
>>>> +		__string(port, rpcrdma_portstr(r_xprt))
>>>> +	),
>>>> +
>>>> +	TP_fast_assign(
>>>> +		__entry->r_xprt =3D r_xprt;
>>>> +		__entry->count =3D count;
>>>> +		__entry->status =3D status;
>>>> +		__entry->posted =3D r_xprt->rx_buf.rb_posted_receives;
>>>> +		__assign_str(addr, rpcrdma_addrstr(r_xprt));
>>>> +		__assign_str(port, rpcrdma_portstr(r_xprt));
>>>> +	),
>>>> +
>>>> +	TP_printk("peer=3D[%s]:%s r_xprt=3D%p: %u new recvs, %d active =
(rc %d)",
>>>> +		__get_str(addr), __get_str(port), __entry->r_xprt,
>>>> +		__entry->count, __entry->posted, __entry->status
>>>> +	)
>>>> +);
>>>> +
>>>> /**
>>>> ** Completion events
>>>> **/
>>>> @@ -800,7 +833,6 @@
>>>> 		__field(unsigned int, task_id)
>>>> 		__field(unsigned int, client_id)
>>>> 		__field(const void *, req)
>>>> -		__field(const void *, rep)
>>>> 		__field(size_t, callsize)
>>>> 		__field(size_t, rcvsize)
>>>> 	),
>>>> @@ -809,15 +841,13 @@
>>>> 		__entry->task_id =3D task->tk_pid;
>>>> 		__entry->client_id =3D task->tk_client->cl_clid;
>>>> 		__entry->req =3D req;
>>>> -		__entry->rep =3D req ? req->rl_reply : NULL;
>>>> 		__entry->callsize =3D task->tk_rqstp->rq_callsize;
>>>> 		__entry->rcvsize =3D task->tk_rqstp->rq_rcvsize;
>>>> 	),
>>>>=20
>>>> -	TP_printk("task:%u@%u req=3D%p rep=3D%p (%zu, %zu)",
>>>> +	TP_printk("task:%u@%u req=3D%p (%zu, %zu)",
>>>> 		__entry->task_id, __entry->client_id,
>>>> -		__entry->req, __entry->rep,
>>>> -		__entry->callsize, __entry->rcvsize
>>>> +		__entry->req, __entry->callsize, __entry->rcvsize
>>>> 	)
>>>> );
>>>>=20
>>>> diff --git a/net/sunrpc/xprtrdma/backchannel.c =
b/net/sunrpc/xprtrdma/backchannel.c
>>>> index 4034788..c8f1c2b 100644
>>>> --- a/net/sunrpc/xprtrdma/backchannel.c
>>>> +++ b/net/sunrpc/xprtrdma/backchannel.c
>>>> @@ -71,23 +71,6 @@ static int rpcrdma_bc_setup_reqs(struct =
rpcrdma_xprt *r_xprt,
>>>> 	return -ENOMEM;
>>>> }
>>>>=20
>>>> -/* Allocate and add receive buffers to the rpcrdma_buffer's
>>>> - * existing list of rep's. These are released when the
>>>> - * transport is destroyed.
>>>> - */
>>>> -static int rpcrdma_bc_setup_reps(struct rpcrdma_xprt *r_xprt,
>>>> -				 unsigned int count)
>>>> -{
>>>> -	int rc =3D 0;
>>>> -
>>>> -	while (count--) {
>>>> -		rc =3D rpcrdma_create_rep(r_xprt);
>>>> -		if (rc)
>>>> -			break;
>>>> -	}
>>>> -	return rc;
>>>> -}
>>>> -
>>>> /**
>>>> * xprt_rdma_bc_setup - Pre-allocate resources for handling =
backchannel requests
>>>> * @xprt: transport associated with these backchannel resources
>>>> @@ -116,14 +99,6 @@ int xprt_rdma_bc_setup(struct rpc_xprt *xprt, =
unsigned int reqs)
>>>> 	if (rc)
>>>> 		goto out_free;
>>>>=20
>>>> -	rc =3D rpcrdma_bc_setup_reps(r_xprt, reqs);
>>>> -	if (rc)
>>>> -		goto out_free;
>>>> -
>>>> -	rc =3D rpcrdma_ep_post_extra_recv(r_xprt, reqs);
>>>> -	if (rc)
>>>> -		goto out_free;
>>>> -
>>>> 	r_xprt->rx_buf.rb_bc_srv_max_requests =3D reqs;
>>>> 	request_module("svcrdma");
>>>> 	trace_xprtrdma_cb_setup(r_xprt, reqs);
>>>> @@ -228,6 +203,7 @@ int xprt_rdma_bc_send_reply(struct rpc_rqst =
*rqst)
>>>> 	if (rc < 0)
>>>> 		goto failed_marshal;
>>>>=20
>>>> +	rpcrdma_post_recvs(r_xprt, true);
>>>> 	if (rpcrdma_ep_post(&r_xprt->rx_ia, &r_xprt->rx_ep, req))
>>>> 		goto drop_connection;
>>>> 	return 0;
>>>> @@ -268,10 +244,14 @@ void xprt_rdma_bc_destroy(struct rpc_xprt =
*xprt, unsigned int reqs)
>>>> */
>>>> void xprt_rdma_bc_free_rqst(struct rpc_rqst *rqst)
>>>> {
>>>> +	struct rpcrdma_req *req =3D rpcr_to_rdmar(rqst);
>>>> 	struct rpc_xprt *xprt =3D rqst->rq_xprt;
>>>>=20
>>>> 	dprintk("RPC:       %s: freeing rqst %p (req %p)\n",
>>>> -		__func__, rqst, rpcr_to_rdmar(rqst));
>>>> +		__func__, rqst, req);
>>>> +
>>>> +	rpcrdma_recv_buffer_put(req->rl_reply);
>>>> +	req->rl_reply =3D NULL;
>>>>=20
>>>> 	spin_lock_bh(&xprt->bc_pa_lock);
>>>> 	list_add_tail(&rqst->rq_bc_pa_list, &xprt->bc_pa_list);
>>>> diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c =
b/net/sunrpc/xprtrdma/rpc_rdma.c
>>>> index 8f89e3f..d676106 100644
>>>> --- a/net/sunrpc/xprtrdma/rpc_rdma.c
>>>> +++ b/net/sunrpc/xprtrdma/rpc_rdma.c
>>>> @@ -1027,8 +1027,6 @@ static bool rpcrdma_results_inline(struct =
rpcrdma_xprt *r_xprt,
>>>>=20
>>>> out_short:
>>>> 	pr_warn("RPC/RDMA short backward direction call\n");
>>>> -	if (rpcrdma_ep_post_recv(&r_xprt->rx_ia, rep))
>>>> -		xprt_disconnect_done(&r_xprt->rx_xprt);
>>>> 	return true;
>>>> }
>>>> #else	/* CONFIG_SUNRPC_BACKCHANNEL */
>>>> @@ -1334,13 +1332,14 @@ void rpcrdma_reply_handler(struct =
rpcrdma_rep *rep)
>>>> 	u32 credits;
>>>> 	__be32 *p;
>>>>=20
>>>> +	--buf->rb_posted_receives;
>>>> +
>>>> 	if (rep->rr_hdrbuf.head[0].iov_len =3D=3D 0)
>>>> 		goto out_badstatus;
>>>>=20
>>>> +	/* Fixed transport header fields */
>>>> 	xdr_init_decode(&rep->rr_stream, &rep->rr_hdrbuf,
>>>> 			rep->rr_hdrbuf.head[0].iov_base);
>>>> -
>>>> -	/* Fixed transport header fields */
>>>> 	p =3D xdr_inline_decode(&rep->rr_stream, 4 * sizeof(*p));
>>>> 	if (unlikely(!p))
>>>> 		goto out_shortreply;
>>>> @@ -1379,17 +1378,10 @@ void rpcrdma_reply_handler(struct =
rpcrdma_rep *rep)
>>>>=20
>>>> 	trace_xprtrdma_reply(rqst->rq_task, rep, req, credits);
>>>>=20
>>>> +	rpcrdma_post_recvs(r_xprt, false);
>>>> 	queue_work(rpcrdma_receive_wq, &rep->rr_work);
>>>> 	return;
>>>>=20
>>>> -out_badstatus:
>>>> -	rpcrdma_recv_buffer_put(rep);
>>>> -	if (r_xprt->rx_ep.rep_connected =3D=3D 1) {
>>>> -		r_xprt->rx_ep.rep_connected =3D -EIO;
>>>> -		rpcrdma_conn_func(&r_xprt->rx_ep);
>>>> -	}
>>>> -	return;
>>>> -
>>>> out_badversion:
>>>> 	trace_xprtrdma_reply_vers(rep);
>>>> 	goto repost;
>>>> @@ -1409,7 +1401,7 @@ void rpcrdma_reply_handler(struct rpcrdma_rep =
*rep)
>>>> * receive buffer before returning.
>>>> */
>>>> repost:
>>>> -	r_xprt->rx_stats.bad_reply_count++;
>>>> -	if (rpcrdma_ep_post_recv(&r_xprt->rx_ia, rep))
>>>> -		rpcrdma_recv_buffer_put(rep);
>>>> +	rpcrdma_post_recvs(r_xprt, false);
>>>> +out_badstatus:
>>>> +	rpcrdma_recv_buffer_put(rep);
>>>> }
>>>> diff --git a/net/sunrpc/xprtrdma/transport.c =
b/net/sunrpc/xprtrdma/transport.c
>>>> index 79885aa..0c775f0 100644
>>>> --- a/net/sunrpc/xprtrdma/transport.c
>>>> +++ b/net/sunrpc/xprtrdma/transport.c
>>>> @@ -722,9 +722,6 @@
>>>> 	if (rc < 0)
>>>> 		goto failed_marshal;
>>>>=20
>>>> -	if (req->rl_reply =3D=3D NULL) 		/* e.g. reconnection */
>>>> -		rpcrdma_recv_buffer_get(req);
>>>> -
>>>> 	/* Must suppress retransmit to maintain credits */
>>>> 	if (rqst->rq_connect_cookie =3D=3D xprt->connect_cookie)
>>>> 		goto drop_connection;
>>>> diff --git a/net/sunrpc/xprtrdma/verbs.c =
b/net/sunrpc/xprtrdma/verbs.c
>>>> index f4ce7af..2a38301 100644
>>>> --- a/net/sunrpc/xprtrdma/verbs.c
>>>> +++ b/net/sunrpc/xprtrdma/verbs.c
>>>> @@ -74,6 +74,7 @@
>>>> */
>>>> static void rpcrdma_mrs_create(struct rpcrdma_xprt *r_xprt);
>>>> static void rpcrdma_mrs_destroy(struct rpcrdma_buffer *buf);
>>>> +static int rpcrdma_create_rep(struct rpcrdma_xprt *r_xprt, bool =
temp);
>>>> static void rpcrdma_dma_unmap_regbuf(struct rpcrdma_regbuf *rb);
>>>>=20
>>>> struct workqueue_struct *rpcrdma_receive_wq __read_mostly;
>>>> @@ -726,7 +727,6 @@
>>>> {
>>>> 	struct rpcrdma_xprt *r_xprt =3D container_of(ia, struct =
rpcrdma_xprt,
>>>> 						   rx_ia);
>>>> -	unsigned int extras;
>>>> 	int rc;
>>>>=20
>>>> retry:
>>>> @@ -770,9 +770,8 @@
>>>> 	}
>>>>=20
>>>> 	dprintk("RPC:       %s: connected\n", __func__);
>>>> -	extras =3D r_xprt->rx_buf.rb_bc_srv_max_requests;
>>>> -	if (extras)
>>>> -		rpcrdma_ep_post_extra_recv(r_xprt, extras);
>>>> +
>>>> +	rpcrdma_post_recvs(r_xprt, true);
>>>>=20
>>>> out:
>>>> 	if (rc)
>>>> @@ -1082,14 +1081,8 @@ struct rpcrdma_req *
>>>> 	return req;
>>>> }
>>>>=20
>>>> -/**
>>>> - * rpcrdma_create_rep - Allocate an rpcrdma_rep object
>>>> - * @r_xprt: controlling transport
>>>> - *
>>>> - * Returns 0 on success or a negative errno on failure.
>>>> - */
>>>> -int
>>>> -rpcrdma_create_rep(struct rpcrdma_xprt *r_xprt)
>>>> +static int
>>>> +rpcrdma_create_rep(struct rpcrdma_xprt *r_xprt, bool temp)
>>>> {
>>>> 	struct rpcrdma_create_data_internal *cdata =3D &r_xprt->rx_data;
>>>> 	struct rpcrdma_buffer *buf =3D &r_xprt->rx_buf;
>>>> @@ -1117,6 +1110,7 @@ struct rpcrdma_req *
>>>> 	rep->rr_recv_wr.wr_cqe =3D &rep->rr_cqe;
>>>> 	rep->rr_recv_wr.sg_list =3D &rep->rr_rdmabuf->rg_iov;
>>>> 	rep->rr_recv_wr.num_sge =3D 1;
>>>> +	rep->rr_temp =3D temp;
>>>>=20
>>>> 	spin_lock(&buf->rb_lock);
>>>> 	list_add(&rep->rr_list, &buf->rb_recv_bufs);
>>>> @@ -1168,12 +1162,8 @@ struct rpcrdma_req *
>>>> 		list_add(&req->rl_list, &buf->rb_send_bufs);
>>>> 	}
>>>>=20
>>>> +	buf->rb_posted_receives =3D 0;
>>>> 	INIT_LIST_HEAD(&buf->rb_recv_bufs);
>>>> -	for (i =3D 0; i <=3D buf->rb_max_requests; i++) {
>>>> -		rc =3D rpcrdma_create_rep(r_xprt);
>>>> -		if (rc)
>>>> -			goto out;
>>>> -	}
>>>>=20
>>>> 	rc =3D rpcrdma_sendctxs_create(r_xprt);
>>>> 	if (rc)
>>>> @@ -1268,7 +1258,6 @@ struct rpcrdma_req *
>>>> 		rep =3D rpcrdma_buffer_get_rep_locked(buf);
>>>> 		rpcrdma_destroy_rep(rep);
>>>> 	}
>>>> -	buf->rb_send_count =3D 0;
>>>>=20
>>>> 	spin_lock(&buf->rb_reqslock);
>>>> 	while (!list_empty(&buf->rb_allreqs)) {
>>>> @@ -1283,7 +1272,6 @@ struct rpcrdma_req *
>>>> 		spin_lock(&buf->rb_reqslock);
>>>> 	}
>>>> 	spin_unlock(&buf->rb_reqslock);
>>>> -	buf->rb_recv_count =3D 0;
>>>>=20
>>>> 	rpcrdma_mrs_destroy(buf);
>>>> }
>>>> @@ -1356,27 +1344,11 @@ struct rpcrdma_mr *
>>>> 	__rpcrdma_mr_put(&r_xprt->rx_buf, mr);
>>>> }
>>>>=20
>>>> -static struct rpcrdma_rep *
>>>> -rpcrdma_buffer_get_rep(struct rpcrdma_buffer *buffers)
>>>> -{
>>>> -	/* If an RPC previously completed without a reply (say, a
>>>> -	 * credential problem or a soft timeout occurs) then hold off
>>>> -	 * on supplying more Receive buffers until the number of new
>>>> -	 * pending RPCs catches up to the number of posted Receives.
>>>> -	 */
>>>> -	if (unlikely(buffers->rb_send_count < buffers->rb_recv_count))
>>>> -		return NULL;
>>>> -
>>>> -	if (unlikely(list_empty(&buffers->rb_recv_bufs)))
>>>> -		return NULL;
>>>> -	buffers->rb_recv_count++;
>>>> -	return rpcrdma_buffer_get_rep_locked(buffers);
>>>> -}
>>>> -
>>>> -/*
>>>> - * Get a set of request/reply buffers.
>>>> +/**
>>>> + * rpcrdma_buffer_get - Get a request buffer
>>>> + * @buffers: Buffer pool from which to obtain a buffer
>>>> *
>>>> - * Reply buffer (if available) is attached to send buffer upon =
return.
>>>> + * Returns a fresh rpcrdma_req, or NULL if none are available.
>>>> */
>>>> struct rpcrdma_req *
>>>> rpcrdma_buffer_get(struct rpcrdma_buffer *buffers)
>>>> @@ -1384,23 +1356,21 @@ struct rpcrdma_req *
>>>> 	struct rpcrdma_req *req;
>>>>=20
>>>> 	spin_lock(&buffers->rb_lock);
>>>> -	if (list_empty(&buffers->rb_send_bufs))
>>>> -		goto out_reqbuf;
>>>> -	buffers->rb_send_count++;
>>>> +	if (unlikely(list_empty(&buffers->rb_send_bufs)))
>>>> +		goto out_noreqs;
>>>> 	req =3D rpcrdma_buffer_get_req_locked(buffers);
>>>> -	req->rl_reply =3D rpcrdma_buffer_get_rep(buffers);
>>>> 	spin_unlock(&buffers->rb_lock);
>>>> -
>>>> 	return req;
>>>>=20
>>>> -out_reqbuf:
>>>> +out_noreqs:
>>>> 	spin_unlock(&buffers->rb_lock);
>>>> 	return NULL;
>>>> }
>>>>=20
>>>> -/*
>>>> - * Put request/reply buffers back into pool.
>>>> - * Pre-decrement counter/array index.
>>>> +/**
>>>> + * rpcrdma_buffer_put - Put request/reply buffers back into pool
>>>> + * @req: object to return
>>>> + *
>>>> */
>>>> void
>>>> rpcrdma_buffer_put(struct rpcrdma_req *req)
>>>> @@ -1411,27 +1381,16 @@ struct rpcrdma_req *
>>>> 	req->rl_reply =3D NULL;
>>>>=20
>>>> 	spin_lock(&buffers->rb_lock);
>>>> -	buffers->rb_send_count--;
>>>> -	list_add_tail(&req->rl_list, &buffers->rb_send_bufs);
>>>> +	list_add(&req->rl_list, &buffers->rb_send_bufs);
>>>> 	if (rep) {
>>>> -		buffers->rb_recv_count--;
>>>> -		list_add_tail(&rep->rr_list, &buffers->rb_recv_bufs);
>>>> +		if (!rep->rr_temp) {
>>>> +			list_add(&rep->rr_list, &buffers->rb_recv_bufs);
>>>> +			rep =3D NULL;
>>>> +		}
>>>> 	}
>>>> 	spin_unlock(&buffers->rb_lock);
>>>> -}
>>>> -
>>>> -/*
>>>> - * Recover reply buffers from pool.
>>>> - * This happens when recovering from disconnect.
>>>> - */
>>>> -void
>>>> -rpcrdma_recv_buffer_get(struct rpcrdma_req *req)
>>>> -{
>>>> -	struct rpcrdma_buffer *buffers =3D req->rl_buffer;
>>>> -
>>>> -	spin_lock(&buffers->rb_lock);
>>>> -	req->rl_reply =3D rpcrdma_buffer_get_rep(buffers);
>>>> -	spin_unlock(&buffers->rb_lock);
>>>> +	if (rep)
>>>> +		rpcrdma_destroy_rep(rep);
>>>> }
>>>>=20
>>>> /*
>>>> @@ -1443,10 +1402,13 @@ struct rpcrdma_req *
>>>> {
>>>> 	struct rpcrdma_buffer *buffers =3D &rep->rr_rxprt->rx_buf;
>>>>=20
>>>> -	spin_lock(&buffers->rb_lock);
>>>> -	buffers->rb_recv_count--;
>>>> -	list_add_tail(&rep->rr_list, &buffers->rb_recv_bufs);
>>>> -	spin_unlock(&buffers->rb_lock);
>>>> +	if (!rep->rr_temp) {
>>>> +		spin_lock(&buffers->rb_lock);
>>>> +		list_add(&rep->rr_list, &buffers->rb_recv_bufs);
>>>> +		spin_unlock(&buffers->rb_lock);
>>>> +	} else {
>>>> +		rpcrdma_destroy_rep(rep);
>>>> +	}
>>>> }
>>>>=20
>>>> /**
>>>> @@ -1542,13 +1504,6 @@ struct rpcrdma_regbuf *
>>>> 	struct ib_send_wr *send_wr =3D &req->rl_sendctx->sc_wr;
>>>> 	int rc;
>>>>=20
>>>> -	if (req->rl_reply) {
>>>> -		rc =3D rpcrdma_ep_post_recv(ia, req->rl_reply);
>>>> -		if (rc)
>>>> -			return rc;
>>>> -		req->rl_reply =3D NULL;
>>>> -	}
>>>> -
>>>> 	if (!ep->rep_send_count ||
>>>> 	    test_bit(RPCRDMA_REQ_F_TX_RESOURCES, &req->rl_flags)) {
>>>> 		send_wr->send_flags |=3D IB_SEND_SIGNALED;
>>>> @@ -1623,3 +1578,70 @@ struct rpcrdma_regbuf *
>>>> 	rpcrdma_recv_buffer_put(rep);
>>>> 	return rc;
>>>> }
>>>> +
>>>> +/**
>>>> + * rpcrdma_post_recvs - Maybe post some Receive buffers
>>>> + * @r_xprt: controlling transport
>>>> + * @temp: when true, allocate temp rpcrdma_rep objects
>>>> + *
>>>> + */
>>>> +void
>>>> +rpcrdma_post_recvs(struct rpcrdma_xprt *r_xprt, bool temp)
>>>> +{
>>>> +	struct rpcrdma_buffer *buf =3D &r_xprt->rx_buf;
>>>> +	struct ib_recv_wr *wr, *bad_wr;
>>>> +	int needed, count, rc;
>>>> +
>>>> +	needed =3D buf->rb_credits + (buf->rb_bc_srv_max_requests << 1);
>>>> +	if (buf->rb_posted_receives > needed)
>>>> +		return;
>>>> +	needed -=3D buf->rb_posted_receives;
>>>> +
>>>> +	count =3D 0;
>>>> +	wr =3D NULL;
>>>> +	while (needed) {
>>>> +		struct rpcrdma_regbuf *rb;
>>>> +		struct rpcrdma_rep *rep;
>>>> +
>>>> +		spin_lock(&buf->rb_lock);
>>>> +		rep =3D list_first_entry_or_null(&buf->rb_recv_bufs,
>>>> +					       struct rpcrdma_rep, =
rr_list);
>>>> +		if (likely(rep))
>>>> +			list_del(&rep->rr_list);
>>>> +		spin_unlock(&buf->rb_lock);
>>>> +		if (!rep) {
>>>> +			if (rpcrdma_create_rep(r_xprt, temp))
>>>> +				break;
>>>> +			continue;
>>>> +		}
>>>> +
>>>> +		rb =3D rep->rr_rdmabuf;
>>>> +		if (!rpcrdma_regbuf_is_mapped(rb)) {
>>>> +			if (!__rpcrdma_dma_map_regbuf(&r_xprt->rx_ia, =
rb)) {
>>>> +				rpcrdma_recv_buffer_put(rep);
>>>> +				break;
>>>> +			}
>>>> +		}
>>>> +
>>>> +		trace_xprtrdma_post_recv(rep->rr_recv_wr.wr_cqe);
>>>> +		rep->rr_recv_wr.next =3D wr;
>>>> +		wr =3D &rep->rr_recv_wr;
>>>> +		++count;
>>>> +		--needed;
>>>> +	}
>>>> +	if (!count)
>>>> +		return;
>>>> +
>>>> +	rc =3D ib_post_recv(r_xprt->rx_ia.ri_id->qp, wr, &bad_wr);
>>>> +	if (rc) {
>>>> +		for (wr =3D bad_wr; wr; wr =3D wr->next) {
>>>> +			struct rpcrdma_rep *rep;
>>>> +
>>>> +			rep =3D container_of(wr, struct rpcrdma_rep, =
rr_recv_wr);
>>>> +			rpcrdma_recv_buffer_put(rep);
>>>> +			--count;
>>>> +		}
>>>> +	}
>>>> +	buf->rb_posted_receives +=3D count;
>>>> +	trace_xprtrdma_post_recvs(r_xprt, count, rc);
>>>> +}
>>>> diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h =
b/net/sunrpc/xprtrdma/xprt_rdma.h
>>>> index 765e4df..a6d0d6e 100644
>>>> --- a/net/sunrpc/xprtrdma/xprt_rdma.h
>>>> +++ b/net/sunrpc/xprtrdma/xprt_rdma.h
>>>> @@ -197,6 +197,7 @@ struct rpcrdma_rep {
>>>> 	__be32			rr_proc;
>>>> 	int			rr_wc_flags;
>>>> 	u32			rr_inv_rkey;
>>>> +	bool			rr_temp;
>>>> 	struct rpcrdma_regbuf	*rr_rdmabuf;
>>>> 	struct rpcrdma_xprt	*rr_rxprt;
>>>> 	struct work_struct	rr_work;
>>>> @@ -397,11 +398,11 @@ struct rpcrdma_buffer {
>>>> 	struct rpcrdma_sendctx	**rb_sc_ctxs;
>>>>=20
>>>> 	spinlock_t		rb_lock;	/* protect buf lists */
>>>> -	int			rb_send_count, rb_recv_count;
>>>> 	struct list_head	rb_send_bufs;
>>>> 	struct list_head	rb_recv_bufs;
>>>> 	u32			rb_max_requests;
>>>> 	u32			rb_credits;	/* most recent credit =
grant */
>>>> +	int			rb_posted_receives;
>>>>=20
>>>> 	u32			rb_bc_srv_max_requests;
>>>> 	spinlock_t		rb_reqslock;	/* protect rb_allreqs */
>>>> @@ -558,13 +559,13 @@ int rpcrdma_ep_create(struct rpcrdma_ep *, =
struct rpcrdma_ia *,
>>>> int rpcrdma_ep_post(struct rpcrdma_ia *, struct rpcrdma_ep *,
>>>> 				struct rpcrdma_req *);
>>>> int rpcrdma_ep_post_recv(struct rpcrdma_ia *, struct rpcrdma_rep =
*);
>>>> +void rpcrdma_post_recvs(struct rpcrdma_xprt *r_xprt, bool temp);
>>>>=20
>>>> /*
>>>> * Buffer calls - xprtrdma/verbs.c
>>>> */
>>>> struct rpcrdma_req *rpcrdma_create_req(struct rpcrdma_xprt *);
>>>> void rpcrdma_destroy_req(struct rpcrdma_req *);
>>>> -int rpcrdma_create_rep(struct rpcrdma_xprt *r_xprt);
>>>> int rpcrdma_buffer_create(struct rpcrdma_xprt *);
>>>> void rpcrdma_buffer_destroy(struct rpcrdma_buffer *);
>>>> struct rpcrdma_sendctx *rpcrdma_sendctx_get_locked(struct =
rpcrdma_buffer *buf);
>>>> @@ -577,7 +578,6 @@ int rpcrdma_ep_post(struct rpcrdma_ia *, struct =
rpcrdma_ep *,
>>>>=20
>>>> struct rpcrdma_req *rpcrdma_buffer_get(struct rpcrdma_buffer *);
>>>> void rpcrdma_buffer_put(struct rpcrdma_req *);
>>>> -void rpcrdma_recv_buffer_get(struct rpcrdma_req *);
>>>> void rpcrdma_recv_buffer_put(struct rpcrdma_rep *);
>>>>=20
>>>> struct rpcrdma_regbuf *rpcrdma_alloc_regbuf(size_t, enum =
dma_data_direction,
>>>>=20
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" =
in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>=20
>> --
>> Chuck Lever
>>=20
>>=20
>>=20
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" =
in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
Chuck Lever




^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v1 01/19] xprtrdma: Add proper SPDX tags for NetApp-contributed source
  2018-05-07 14:28       ` Anna Schumaker
@ 2018-05-14 20:37         ` Jason Gunthorpe
  0 siblings, 0 replies; 30+ messages in thread
From: Jason Gunthorpe @ 2018-05-14 20:37 UTC (permalink / raw)
  To: Anna Schumaker; +Cc: Chuck Lever, linux-rdma, Linux NFS Mailing List

On Mon, May 07, 2018 at 10:28:13AM -0400, Anna Schumaker wrote:

> > The tags I've proposed are consistent with other usage:
> > 
> > -> .c files use // ... comments
> > -> .h files use /* ... */ comments
> > -> Makefiles use # comments
> > 
> > There were no complaints from checkpatch.pl about the
> > comment style in my patch.
> 
> Ah, okay.  It's probably best to go with how everybody else uses it
> (although I still wonder why they use different styles for .c and .h
> files).  I'll take your patch the way it is now.

It is a mysterious thing, but

 Documentation/process/license-rules.rst

Explains what to do.. Looks like this patch is OK

Jason

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v1 10/19] xprtrdma: Move Receive posting to Receive handler
  2018-05-08 19:52       ` Anna Schumaker
  2018-05-08 19:56         ` Chuck Lever
@ 2018-05-29 18:23         ` Chuck Lever
  2018-05-31 20:55           ` Anna Schumaker
  1 sibling, 1 reply; 30+ messages in thread
From: Chuck Lever @ 2018-05-29 18:23 UTC (permalink / raw)
  To: Anna Schumaker; +Cc: linux-rdma, Linux NFS Mailing List, Trond Myklebust



> On May 8, 2018, at 3:52 PM, Anna Schumaker <anna.schumaker@netapp.com> =
wrote:
>=20
>=20
>=20
> On 05/08/2018 03:47 PM, Chuck Lever wrote:
>>=20
>>=20
>>> On May 8, 2018, at 3:40 PM, Anna Schumaker =
<anna.schumaker@netapp.com> wrote:
>>>=20
>>> Hi Chuck,
>>>=20
>>> On 05/04/2018 03:35 PM, Chuck Lever wrote:
>>>> Receive completion and Reply handling are done by a BOUND
>>>> workqueue, meaning they run on only one CPU.
>>>>=20
>>>> Posting receives is currently done in the send_request path, which
>>>> on large systems is typically done on a different CPU than the one
>>>> handling Receive completions. This results in movement of
>>>> Receive-related cachelines between the sending and receiving CPUs.
>>>>=20
>>>> More importantly, it means that currently Receives are posted while
>>>> the transport's write lock is held, which is unnecessary and =
costly.
>>>>=20
>>>> Finally, allocation of Receive buffers is performed on-demand in
>>>> the Receive completion handler. This helps guarantee that they are
>>>> allocated on the same NUMA node as the CPU that handles Receive
>>>> completions.
>>>>=20
>>>> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
>>>=20
>>> Running this against a 4.17-rc4 server seems to work okay, but =
running against a 4.16 server fails the cthon special tests with:        =
                                                                         =
                                                     =20
>>>=20
>>> write/read 30 MB file                                                =
                                                                         =
                                                                         =
    =20
>>> verify failed, offset 11272192; expected 79, got                     =
                                                                         =
                                                                         =
    =20
>>> 79 79 79 79 79 79 79 79 79 79                                        =
                                                                         =
                                                                         =
    =20
>>> 79 79 79 79 79 79 79 79 79 79                                        =
                                                                         =
                                                                         =
    =20
>>> 79 79 79 79 79 79 79 79 79 79                                        =
                                                                         =
                                                                         =
    =20
>>> 79 79 79 79 79 79 79 79 79 79                                        =
                                                                         =
                                                                         =
    =20
>>> 79 79 79 79 79 79 79 79 79 79                                        =
                                                                         =
                                                                         =
    =20
>>> 79 79 79 79 79 79 79 79 79 79                                        =
                                                                         =
                                                                         =
    =20
>>>=20
>>> and it goes on for several hundred more lines after this.  How =
worried do we need to be about somebody running a new client against and =
old server?
>>=20
>> I'm not sure what that result means, I've never seen it
>> before. But I can't think of a reason there would be an
>> incompatibility with a v4.16 server. That behavior needs
>> to be chased down and explained.
>>=20
>> Can you bisect this to a particular commit?
>=20
> Do you mean on the server side?  I don't see the problem on the client =
until I apply this patch

Hi Anna-

Have you made any progress on this? What is the status of my NFS/RDMA
patch series for v4.18 ?


>>> Some of the performance issues I've had in the past seem to have =
gone away with the 4.17-rc4 code as well.  I'm not sure if that's =
related to your code or something changing in soft roce, but either way =
I'm much happier :)
>>>=20
>>> Anna
>>>=20
>>>> ---
>>>> include/trace/events/rpcrdma.h    |   40 +++++++-
>>>> net/sunrpc/xprtrdma/backchannel.c |   32 +------
>>>> net/sunrpc/xprtrdma/rpc_rdma.c    |   22 +----
>>>> net/sunrpc/xprtrdma/transport.c   |    3 -
>>>> net/sunrpc/xprtrdma/verbs.c       |  176 =
+++++++++++++++++++++----------------
>>>> net/sunrpc/xprtrdma/xprt_rdma.h   |    6 +
>>>> 6 files changed, 150 insertions(+), 129 deletions(-)
>>>>=20
>>>> diff --git a/include/trace/events/rpcrdma.h =
b/include/trace/events/rpcrdma.h
>>>> index 99c0049..ad27e19 100644
>>>> --- a/include/trace/events/rpcrdma.h
>>>> +++ b/include/trace/events/rpcrdma.h
>>>> @@ -546,6 +546,39 @@
>>>> 	)
>>>> );
>>>>=20
>>>> +TRACE_EVENT(xprtrdma_post_recvs,
>>>> +	TP_PROTO(
>>>> +		const struct rpcrdma_xprt *r_xprt,
>>>> +		unsigned int count,
>>>> +		int status
>>>> +	),
>>>> +
>>>> +	TP_ARGS(r_xprt, count, status),
>>>> +
>>>> +	TP_STRUCT__entry(
>>>> +		__field(const void *, r_xprt)
>>>> +		__field(unsigned int, count)
>>>> +		__field(int, status)
>>>> +		__field(int, posted)
>>>> +		__string(addr, rpcrdma_addrstr(r_xprt))
>>>> +		__string(port, rpcrdma_portstr(r_xprt))
>>>> +	),
>>>> +
>>>> +	TP_fast_assign(
>>>> +		__entry->r_xprt =3D r_xprt;
>>>> +		__entry->count =3D count;
>>>> +		__entry->status =3D status;
>>>> +		__entry->posted =3D r_xprt->rx_buf.rb_posted_receives;
>>>> +		__assign_str(addr, rpcrdma_addrstr(r_xprt));
>>>> +		__assign_str(port, rpcrdma_portstr(r_xprt));
>>>> +	),
>>>> +
>>>> +	TP_printk("peer=3D[%s]:%s r_xprt=3D%p: %u new recvs, %d active =
(rc %d)",
>>>> +		__get_str(addr), __get_str(port), __entry->r_xprt,
>>>> +		__entry->count, __entry->posted, __entry->status
>>>> +	)
>>>> +);
>>>> +
>>>> /**
>>>> ** Completion events
>>>> **/
>>>> @@ -800,7 +833,6 @@
>>>> 		__field(unsigned int, task_id)
>>>> 		__field(unsigned int, client_id)
>>>> 		__field(const void *, req)
>>>> -		__field(const void *, rep)
>>>> 		__field(size_t, callsize)
>>>> 		__field(size_t, rcvsize)
>>>> 	),
>>>> @@ -809,15 +841,13 @@
>>>> 		__entry->task_id =3D task->tk_pid;
>>>> 		__entry->client_id =3D task->tk_client->cl_clid;
>>>> 		__entry->req =3D req;
>>>> -		__entry->rep =3D req ? req->rl_reply : NULL;
>>>> 		__entry->callsize =3D task->tk_rqstp->rq_callsize;
>>>> 		__entry->rcvsize =3D task->tk_rqstp->rq_rcvsize;
>>>> 	),
>>>>=20
>>>> -	TP_printk("task:%u@%u req=3D%p rep=3D%p (%zu, %zu)",
>>>> +	TP_printk("task:%u@%u req=3D%p (%zu, %zu)",
>>>> 		__entry->task_id, __entry->client_id,
>>>> -		__entry->req, __entry->rep,
>>>> -		__entry->callsize, __entry->rcvsize
>>>> +		__entry->req, __entry->callsize, __entry->rcvsize
>>>> 	)
>>>> );
>>>>=20
>>>> diff --git a/net/sunrpc/xprtrdma/backchannel.c =
b/net/sunrpc/xprtrdma/backchannel.c
>>>> index 4034788..c8f1c2b 100644
>>>> --- a/net/sunrpc/xprtrdma/backchannel.c
>>>> +++ b/net/sunrpc/xprtrdma/backchannel.c
>>>> @@ -71,23 +71,6 @@ static int rpcrdma_bc_setup_reqs(struct =
rpcrdma_xprt *r_xprt,
>>>> 	return -ENOMEM;
>>>> }
>>>>=20
>>>> -/* Allocate and add receive buffers to the rpcrdma_buffer's
>>>> - * existing list of rep's. These are released when the
>>>> - * transport is destroyed.
>>>> - */
>>>> -static int rpcrdma_bc_setup_reps(struct rpcrdma_xprt *r_xprt,
>>>> -				 unsigned int count)
>>>> -{
>>>> -	int rc =3D 0;
>>>> -
>>>> -	while (count--) {
>>>> -		rc =3D rpcrdma_create_rep(r_xprt);
>>>> -		if (rc)
>>>> -			break;
>>>> -	}
>>>> -	return rc;
>>>> -}
>>>> -
>>>> /**
>>>> * xprt_rdma_bc_setup - Pre-allocate resources for handling =
backchannel requests
>>>> * @xprt: transport associated with these backchannel resources
>>>> @@ -116,14 +99,6 @@ int xprt_rdma_bc_setup(struct rpc_xprt *xprt, =
unsigned int reqs)
>>>> 	if (rc)
>>>> 		goto out_free;
>>>>=20
>>>> -	rc =3D rpcrdma_bc_setup_reps(r_xprt, reqs);
>>>> -	if (rc)
>>>> -		goto out_free;
>>>> -
>>>> -	rc =3D rpcrdma_ep_post_extra_recv(r_xprt, reqs);
>>>> -	if (rc)
>>>> -		goto out_free;
>>>> -
>>>> 	r_xprt->rx_buf.rb_bc_srv_max_requests =3D reqs;
>>>> 	request_module("svcrdma");
>>>> 	trace_xprtrdma_cb_setup(r_xprt, reqs);
>>>> @@ -228,6 +203,7 @@ int xprt_rdma_bc_send_reply(struct rpc_rqst =
*rqst)
>>>> 	if (rc < 0)
>>>> 		goto failed_marshal;
>>>>=20
>>>> +	rpcrdma_post_recvs(r_xprt, true);
>>>> 	if (rpcrdma_ep_post(&r_xprt->rx_ia, &r_xprt->rx_ep, req))
>>>> 		goto drop_connection;
>>>> 	return 0;
>>>> @@ -268,10 +244,14 @@ void xprt_rdma_bc_destroy(struct rpc_xprt =
*xprt, unsigned int reqs)
>>>> */
>>>> void xprt_rdma_bc_free_rqst(struct rpc_rqst *rqst)
>>>> {
>>>> +	struct rpcrdma_req *req =3D rpcr_to_rdmar(rqst);
>>>> 	struct rpc_xprt *xprt =3D rqst->rq_xprt;
>>>>=20
>>>> 	dprintk("RPC:       %s: freeing rqst %p (req %p)\n",
>>>> -		__func__, rqst, rpcr_to_rdmar(rqst));
>>>> +		__func__, rqst, req);
>>>> +
>>>> +	rpcrdma_recv_buffer_put(req->rl_reply);
>>>> +	req->rl_reply =3D NULL;
>>>>=20
>>>> 	spin_lock_bh(&xprt->bc_pa_lock);
>>>> 	list_add_tail(&rqst->rq_bc_pa_list, &xprt->bc_pa_list);
>>>> diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c =
b/net/sunrpc/xprtrdma/rpc_rdma.c
>>>> index 8f89e3f..d676106 100644
>>>> --- a/net/sunrpc/xprtrdma/rpc_rdma.c
>>>> +++ b/net/sunrpc/xprtrdma/rpc_rdma.c
>>>> @@ -1027,8 +1027,6 @@ static bool rpcrdma_results_inline(struct =
rpcrdma_xprt *r_xprt,
>>>>=20
>>>> out_short:
>>>> 	pr_warn("RPC/RDMA short backward direction call\n");
>>>> -	if (rpcrdma_ep_post_recv(&r_xprt->rx_ia, rep))
>>>> -		xprt_disconnect_done(&r_xprt->rx_xprt);
>>>> 	return true;
>>>> }
>>>> #else	/* CONFIG_SUNRPC_BACKCHANNEL */
>>>> @@ -1334,13 +1332,14 @@ void rpcrdma_reply_handler(struct =
rpcrdma_rep *rep)
>>>> 	u32 credits;
>>>> 	__be32 *p;
>>>>=20
>>>> +	--buf->rb_posted_receives;
>>>> +
>>>> 	if (rep->rr_hdrbuf.head[0].iov_len =3D=3D 0)
>>>> 		goto out_badstatus;
>>>>=20
>>>> +	/* Fixed transport header fields */
>>>> 	xdr_init_decode(&rep->rr_stream, &rep->rr_hdrbuf,
>>>> 			rep->rr_hdrbuf.head[0].iov_base);
>>>> -
>>>> -	/* Fixed transport header fields */
>>>> 	p =3D xdr_inline_decode(&rep->rr_stream, 4 * sizeof(*p));
>>>> 	if (unlikely(!p))
>>>> 		goto out_shortreply;
>>>> @@ -1379,17 +1378,10 @@ void rpcrdma_reply_handler(struct =
rpcrdma_rep *rep)
>>>>=20
>>>> 	trace_xprtrdma_reply(rqst->rq_task, rep, req, credits);
>>>>=20
>>>> +	rpcrdma_post_recvs(r_xprt, false);
>>>> 	queue_work(rpcrdma_receive_wq, &rep->rr_work);
>>>> 	return;
>>>>=20
>>>> -out_badstatus:
>>>> -	rpcrdma_recv_buffer_put(rep);
>>>> -	if (r_xprt->rx_ep.rep_connected =3D=3D 1) {
>>>> -		r_xprt->rx_ep.rep_connected =3D -EIO;
>>>> -		rpcrdma_conn_func(&r_xprt->rx_ep);
>>>> -	}
>>>> -	return;
>>>> -
>>>> out_badversion:
>>>> 	trace_xprtrdma_reply_vers(rep);
>>>> 	goto repost;
>>>> @@ -1409,7 +1401,7 @@ void rpcrdma_reply_handler(struct rpcrdma_rep =
*rep)
>>>> * receive buffer before returning.
>>>> */
>>>> repost:
>>>> -	r_xprt->rx_stats.bad_reply_count++;
>>>> -	if (rpcrdma_ep_post_recv(&r_xprt->rx_ia, rep))
>>>> -		rpcrdma_recv_buffer_put(rep);
>>>> +	rpcrdma_post_recvs(r_xprt, false);
>>>> +out_badstatus:
>>>> +	rpcrdma_recv_buffer_put(rep);
>>>> }
>>>> diff --git a/net/sunrpc/xprtrdma/transport.c =
b/net/sunrpc/xprtrdma/transport.c
>>>> index 79885aa..0c775f0 100644
>>>> --- a/net/sunrpc/xprtrdma/transport.c
>>>> +++ b/net/sunrpc/xprtrdma/transport.c
>>>> @@ -722,9 +722,6 @@
>>>> 	if (rc < 0)
>>>> 		goto failed_marshal;
>>>>=20
>>>> -	if (req->rl_reply =3D=3D NULL) 		/* e.g. reconnection */
>>>> -		rpcrdma_recv_buffer_get(req);
>>>> -
>>>> 	/* Must suppress retransmit to maintain credits */
>>>> 	if (rqst->rq_connect_cookie =3D=3D xprt->connect_cookie)
>>>> 		goto drop_connection;
>>>> diff --git a/net/sunrpc/xprtrdma/verbs.c =
b/net/sunrpc/xprtrdma/verbs.c
>>>> index f4ce7af..2a38301 100644
>>>> --- a/net/sunrpc/xprtrdma/verbs.c
>>>> +++ b/net/sunrpc/xprtrdma/verbs.c
>>>> @@ -74,6 +74,7 @@
>>>> */
>>>> static void rpcrdma_mrs_create(struct rpcrdma_xprt *r_xprt);
>>>> static void rpcrdma_mrs_destroy(struct rpcrdma_buffer *buf);
>>>> +static int rpcrdma_create_rep(struct rpcrdma_xprt *r_xprt, bool =
temp);
>>>> static void rpcrdma_dma_unmap_regbuf(struct rpcrdma_regbuf *rb);
>>>>=20
>>>> struct workqueue_struct *rpcrdma_receive_wq __read_mostly;
>>>> @@ -726,7 +727,6 @@
>>>> {
>>>> 	struct rpcrdma_xprt *r_xprt =3D container_of(ia, struct =
rpcrdma_xprt,
>>>> 						   rx_ia);
>>>> -	unsigned int extras;
>>>> 	int rc;
>>>>=20
>>>> retry:
>>>> @@ -770,9 +770,8 @@
>>>> 	}
>>>>=20
>>>> 	dprintk("RPC:       %s: connected\n", __func__);
>>>> -	extras =3D r_xprt->rx_buf.rb_bc_srv_max_requests;
>>>> -	if (extras)
>>>> -		rpcrdma_ep_post_extra_recv(r_xprt, extras);
>>>> +
>>>> +	rpcrdma_post_recvs(r_xprt, true);
>>>>=20
>>>> out:
>>>> 	if (rc)
>>>> @@ -1082,14 +1081,8 @@ struct rpcrdma_req *
>>>> 	return req;
>>>> }
>>>>=20
>>>> -/**
>>>> - * rpcrdma_create_rep - Allocate an rpcrdma_rep object
>>>> - * @r_xprt: controlling transport
>>>> - *
>>>> - * Returns 0 on success or a negative errno on failure.
>>>> - */
>>>> -int
>>>> -rpcrdma_create_rep(struct rpcrdma_xprt *r_xprt)
>>>> +static int
>>>> +rpcrdma_create_rep(struct rpcrdma_xprt *r_xprt, bool temp)
>>>> {
>>>> 	struct rpcrdma_create_data_internal *cdata =3D &r_xprt->rx_data;
>>>> 	struct rpcrdma_buffer *buf =3D &r_xprt->rx_buf;
>>>> @@ -1117,6 +1110,7 @@ struct rpcrdma_req *
>>>> 	rep->rr_recv_wr.wr_cqe =3D &rep->rr_cqe;
>>>> 	rep->rr_recv_wr.sg_list =3D &rep->rr_rdmabuf->rg_iov;
>>>> 	rep->rr_recv_wr.num_sge =3D 1;
>>>> +	rep->rr_temp =3D temp;
>>>>=20
>>>> 	spin_lock(&buf->rb_lock);
>>>> 	list_add(&rep->rr_list, &buf->rb_recv_bufs);
>>>> @@ -1168,12 +1162,8 @@ struct rpcrdma_req *
>>>> 		list_add(&req->rl_list, &buf->rb_send_bufs);
>>>> 	}
>>>>=20
>>>> +	buf->rb_posted_receives =3D 0;
>>>> 	INIT_LIST_HEAD(&buf->rb_recv_bufs);
>>>> -	for (i =3D 0; i <=3D buf->rb_max_requests; i++) {
>>>> -		rc =3D rpcrdma_create_rep(r_xprt);
>>>> -		if (rc)
>>>> -			goto out;
>>>> -	}
>>>>=20
>>>> 	rc =3D rpcrdma_sendctxs_create(r_xprt);
>>>> 	if (rc)
>>>> @@ -1268,7 +1258,6 @@ struct rpcrdma_req *
>>>> 		rep =3D rpcrdma_buffer_get_rep_locked(buf);
>>>> 		rpcrdma_destroy_rep(rep);
>>>> 	}
>>>> -	buf->rb_send_count =3D 0;
>>>>=20
>>>> 	spin_lock(&buf->rb_reqslock);
>>>> 	while (!list_empty(&buf->rb_allreqs)) {
>>>> @@ -1283,7 +1272,6 @@ struct rpcrdma_req *
>>>> 		spin_lock(&buf->rb_reqslock);
>>>> 	}
>>>> 	spin_unlock(&buf->rb_reqslock);
>>>> -	buf->rb_recv_count =3D 0;
>>>>=20
>>>> 	rpcrdma_mrs_destroy(buf);
>>>> }
>>>> @@ -1356,27 +1344,11 @@ struct rpcrdma_mr *
>>>> 	__rpcrdma_mr_put(&r_xprt->rx_buf, mr);
>>>> }
>>>>=20
>>>> -static struct rpcrdma_rep *
>>>> -rpcrdma_buffer_get_rep(struct rpcrdma_buffer *buffers)
>>>> -{
>>>> -	/* If an RPC previously completed without a reply (say, a
>>>> -	 * credential problem or a soft timeout occurs) then hold off
>>>> -	 * on supplying more Receive buffers until the number of new
>>>> -	 * pending RPCs catches up to the number of posted Receives.
>>>> -	 */
>>>> -	if (unlikely(buffers->rb_send_count < buffers->rb_recv_count))
>>>> -		return NULL;
>>>> -
>>>> -	if (unlikely(list_empty(&buffers->rb_recv_bufs)))
>>>> -		return NULL;
>>>> -	buffers->rb_recv_count++;
>>>> -	return rpcrdma_buffer_get_rep_locked(buffers);
>>>> -}
>>>> -
>>>> -/*
>>>> - * Get a set of request/reply buffers.
>>>> +/**
>>>> + * rpcrdma_buffer_get - Get a request buffer
>>>> + * @buffers: Buffer pool from which to obtain a buffer
>>>> *
>>>> - * Reply buffer (if available) is attached to send buffer upon =
return.
>>>> + * Returns a fresh rpcrdma_req, or NULL if none are available.
>>>> */
>>>> struct rpcrdma_req *
>>>> rpcrdma_buffer_get(struct rpcrdma_buffer *buffers)
>>>> @@ -1384,23 +1356,21 @@ struct rpcrdma_req *
>>>> 	struct rpcrdma_req *req;
>>>>=20
>>>> 	spin_lock(&buffers->rb_lock);
>>>> -	if (list_empty(&buffers->rb_send_bufs))
>>>> -		goto out_reqbuf;
>>>> -	buffers->rb_send_count++;
>>>> +	if (unlikely(list_empty(&buffers->rb_send_bufs)))
>>>> +		goto out_noreqs;
>>>> 	req =3D rpcrdma_buffer_get_req_locked(buffers);
>>>> -	req->rl_reply =3D rpcrdma_buffer_get_rep(buffers);
>>>> 	spin_unlock(&buffers->rb_lock);
>>>> -
>>>> 	return req;
>>>>=20
>>>> -out_reqbuf:
>>>> +out_noreqs:
>>>> 	spin_unlock(&buffers->rb_lock);
>>>> 	return NULL;
>>>> }
>>>>=20
>>>> -/*
>>>> - * Put request/reply buffers back into pool.
>>>> - * Pre-decrement counter/array index.
>>>> +/**
>>>> + * rpcrdma_buffer_put - Put request/reply buffers back into pool
>>>> + * @req: object to return
>>>> + *
>>>> */
>>>> void
>>>> rpcrdma_buffer_put(struct rpcrdma_req *req)
>>>> @@ -1411,27 +1381,16 @@ struct rpcrdma_req *
>>>> 	req->rl_reply =3D NULL;
>>>>=20
>>>> 	spin_lock(&buffers->rb_lock);
>>>> -	buffers->rb_send_count--;
>>>> -	list_add_tail(&req->rl_list, &buffers->rb_send_bufs);
>>>> +	list_add(&req->rl_list, &buffers->rb_send_bufs);
>>>> 	if (rep) {
>>>> -		buffers->rb_recv_count--;
>>>> -		list_add_tail(&rep->rr_list, &buffers->rb_recv_bufs);
>>>> +		if (!rep->rr_temp) {
>>>> +			list_add(&rep->rr_list, &buffers->rb_recv_bufs);
>>>> +			rep =3D NULL;
>>>> +		}
>>>> 	}
>>>> 	spin_unlock(&buffers->rb_lock);
>>>> -}
>>>> -
>>>> -/*
>>>> - * Recover reply buffers from pool.
>>>> - * This happens when recovering from disconnect.
>>>> - */
>>>> -void
>>>> -rpcrdma_recv_buffer_get(struct rpcrdma_req *req)
>>>> -{
>>>> -	struct rpcrdma_buffer *buffers =3D req->rl_buffer;
>>>> -
>>>> -	spin_lock(&buffers->rb_lock);
>>>> -	req->rl_reply =3D rpcrdma_buffer_get_rep(buffers);
>>>> -	spin_unlock(&buffers->rb_lock);
>>>> +	if (rep)
>>>> +		rpcrdma_destroy_rep(rep);
>>>> }
>>>>=20
>>>> /*
>>>> @@ -1443,10 +1402,13 @@ struct rpcrdma_req *
>>>> {
>>>> 	struct rpcrdma_buffer *buffers =3D &rep->rr_rxprt->rx_buf;
>>>>=20
>>>> -	spin_lock(&buffers->rb_lock);
>>>> -	buffers->rb_recv_count--;
>>>> -	list_add_tail(&rep->rr_list, &buffers->rb_recv_bufs);
>>>> -	spin_unlock(&buffers->rb_lock);
>>>> +	if (!rep->rr_temp) {
>>>> +		spin_lock(&buffers->rb_lock);
>>>> +		list_add(&rep->rr_list, &buffers->rb_recv_bufs);
>>>> +		spin_unlock(&buffers->rb_lock);
>>>> +	} else {
>>>> +		rpcrdma_destroy_rep(rep);
>>>> +	}
>>>> }
>>>>=20
>>>> /**
>>>> @@ -1542,13 +1504,6 @@ struct rpcrdma_regbuf *
>>>> 	struct ib_send_wr *send_wr =3D &req->rl_sendctx->sc_wr;
>>>> 	int rc;
>>>>=20
>>>> -	if (req->rl_reply) {
>>>> -		rc =3D rpcrdma_ep_post_recv(ia, req->rl_reply);
>>>> -		if (rc)
>>>> -			return rc;
>>>> -		req->rl_reply =3D NULL;
>>>> -	}
>>>> -
>>>> 	if (!ep->rep_send_count ||
>>>> 	    test_bit(RPCRDMA_REQ_F_TX_RESOURCES, &req->rl_flags)) {
>>>> 		send_wr->send_flags |=3D IB_SEND_SIGNALED;
>>>> @@ -1623,3 +1578,70 @@ struct rpcrdma_regbuf *
>>>> 	rpcrdma_recv_buffer_put(rep);
>>>> 	return rc;
>>>> }
>>>> +
>>>> +/**
>>>> + * rpcrdma_post_recvs - Maybe post some Receive buffers
>>>> + * @r_xprt: controlling transport
>>>> + * @temp: when true, allocate temp rpcrdma_rep objects
>>>> + *
>>>> + */
>>>> +void
>>>> +rpcrdma_post_recvs(struct rpcrdma_xprt *r_xprt, bool temp)
>>>> +{
>>>> +	struct rpcrdma_buffer *buf =3D &r_xprt->rx_buf;
>>>> +	struct ib_recv_wr *wr, *bad_wr;
>>>> +	int needed, count, rc;
>>>> +
>>>> +	needed =3D buf->rb_credits + (buf->rb_bc_srv_max_requests << 1);
>>>> +	if (buf->rb_posted_receives > needed)
>>>> +		return;
>>>> +	needed -=3D buf->rb_posted_receives;
>>>> +
>>>> +	count =3D 0;
>>>> +	wr =3D NULL;
>>>> +	while (needed) {
>>>> +		struct rpcrdma_regbuf *rb;
>>>> +		struct rpcrdma_rep *rep;
>>>> +
>>>> +		spin_lock(&buf->rb_lock);
>>>> +		rep =3D list_first_entry_or_null(&buf->rb_recv_bufs,
>>>> +					       struct rpcrdma_rep, =
rr_list);
>>>> +		if (likely(rep))
>>>> +			list_del(&rep->rr_list);
>>>> +		spin_unlock(&buf->rb_lock);
>>>> +		if (!rep) {
>>>> +			if (rpcrdma_create_rep(r_xprt, temp))
>>>> +				break;
>>>> +			continue;
>>>> +		}
>>>> +
>>>> +		rb =3D rep->rr_rdmabuf;
>>>> +		if (!rpcrdma_regbuf_is_mapped(rb)) {
>>>> +			if (!__rpcrdma_dma_map_regbuf(&r_xprt->rx_ia, =
rb)) {
>>>> +				rpcrdma_recv_buffer_put(rep);
>>>> +				break;
>>>> +			}
>>>> +		}
>>>> +
>>>> +		trace_xprtrdma_post_recv(rep->rr_recv_wr.wr_cqe);
>>>> +		rep->rr_recv_wr.next =3D wr;
>>>> +		wr =3D &rep->rr_recv_wr;
>>>> +		++count;
>>>> +		--needed;
>>>> +	}
>>>> +	if (!count)
>>>> +		return;
>>>> +
>>>> +	rc =3D ib_post_recv(r_xprt->rx_ia.ri_id->qp, wr, &bad_wr);
>>>> +	if (rc) {
>>>> +		for (wr =3D bad_wr; wr; wr =3D wr->next) {
>>>> +			struct rpcrdma_rep *rep;
>>>> +
>>>> +			rep =3D container_of(wr, struct rpcrdma_rep, =
rr_recv_wr);
>>>> +			rpcrdma_recv_buffer_put(rep);
>>>> +			--count;
>>>> +		}
>>>> +	}
>>>> +	buf->rb_posted_receives +=3D count;
>>>> +	trace_xprtrdma_post_recvs(r_xprt, count, rc);
>>>> +}
>>>> diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h =
b/net/sunrpc/xprtrdma/xprt_rdma.h
>>>> index 765e4df..a6d0d6e 100644
>>>> --- a/net/sunrpc/xprtrdma/xprt_rdma.h
>>>> +++ b/net/sunrpc/xprtrdma/xprt_rdma.h
>>>> @@ -197,6 +197,7 @@ struct rpcrdma_rep {
>>>> 	__be32			rr_proc;
>>>> 	int			rr_wc_flags;
>>>> 	u32			rr_inv_rkey;
>>>> +	bool			rr_temp;
>>>> 	struct rpcrdma_regbuf	*rr_rdmabuf;
>>>> 	struct rpcrdma_xprt	*rr_rxprt;
>>>> 	struct work_struct	rr_work;
>>>> @@ -397,11 +398,11 @@ struct rpcrdma_buffer {
>>>> 	struct rpcrdma_sendctx	**rb_sc_ctxs;
>>>>=20
>>>> 	spinlock_t		rb_lock;	/* protect buf lists */
>>>> -	int			rb_send_count, rb_recv_count;
>>>> 	struct list_head	rb_send_bufs;
>>>> 	struct list_head	rb_recv_bufs;
>>>> 	u32			rb_max_requests;
>>>> 	u32			rb_credits;	/* most recent credit =
grant */
>>>> +	int			rb_posted_receives;
>>>>=20
>>>> 	u32			rb_bc_srv_max_requests;
>>>> 	spinlock_t		rb_reqslock;	/* protect rb_allreqs */
>>>> @@ -558,13 +559,13 @@ int rpcrdma_ep_create(struct rpcrdma_ep *, =
struct rpcrdma_ia *,
>>>> int rpcrdma_ep_post(struct rpcrdma_ia *, struct rpcrdma_ep *,
>>>> 				struct rpcrdma_req *);
>>>> int rpcrdma_ep_post_recv(struct rpcrdma_ia *, struct rpcrdma_rep =
*);
>>>> +void rpcrdma_post_recvs(struct rpcrdma_xprt *r_xprt, bool temp);
>>>>=20
>>>> /*
>>>> * Buffer calls - xprtrdma/verbs.c
>>>> */
>>>> struct rpcrdma_req *rpcrdma_create_req(struct rpcrdma_xprt *);
>>>> void rpcrdma_destroy_req(struct rpcrdma_req *);
>>>> -int rpcrdma_create_rep(struct rpcrdma_xprt *r_xprt);
>>>> int rpcrdma_buffer_create(struct rpcrdma_xprt *);
>>>> void rpcrdma_buffer_destroy(struct rpcrdma_buffer *);
>>>> struct rpcrdma_sendctx *rpcrdma_sendctx_get_locked(struct =
rpcrdma_buffer *buf);
>>>> @@ -577,7 +578,6 @@ int rpcrdma_ep_post(struct rpcrdma_ia *, struct =
rpcrdma_ep *,
>>>>=20
>>>> struct rpcrdma_req *rpcrdma_buffer_get(struct rpcrdma_buffer *);
>>>> void rpcrdma_buffer_put(struct rpcrdma_req *);
>>>> -void rpcrdma_recv_buffer_get(struct rpcrdma_req *);
>>>> void rpcrdma_recv_buffer_put(struct rpcrdma_rep *);
>>>>=20
>>>> struct rpcrdma_regbuf *rpcrdma_alloc_regbuf(size_t, enum =
dma_data_direction,
>>>>=20
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" =
in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>=20
>> --
>> Chuck Lever
>>=20
>>=20
>>=20
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" =
in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
Chuck Lever




^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v1 10/19] xprtrdma: Move Receive posting to Receive handler
  2018-05-29 18:23         ` Chuck Lever
@ 2018-05-31 20:55           ` Anna Schumaker
  0 siblings, 0 replies; 30+ messages in thread
From: Anna Schumaker @ 2018-05-31 20:55 UTC (permalink / raw)
  To: Chuck Lever; +Cc: linux-rdma, Linux NFS Mailing List, Trond Myklebust



On 05/29/2018 02:23 PM, Chuck Lever wrote:
> 
> 
>> On May 8, 2018, at 3:52 PM, Anna Schumaker <anna.schumaker@netapp.com> wrote:
>>
>>
>>
>> On 05/08/2018 03:47 PM, Chuck Lever wrote:
>>>
>>>
>>>> On May 8, 2018, at 3:40 PM, Anna Schumaker <anna.schumaker@netapp.com> wrote:
>>>>
>>>> Hi Chuck,
>>>>
>>>> On 05/04/2018 03:35 PM, Chuck Lever wrote:
>>>>> Receive completion and Reply handling are done by a BOUND
>>>>> workqueue, meaning they run on only one CPU.
>>>>>
>>>>> Posting receives is currently done in the send_request path, which
>>>>> on large systems is typically done on a different CPU than the one
>>>>> handling Receive completions. This results in movement of
>>>>> Receive-related cachelines between the sending and receiving CPUs.
>>>>>
>>>>> More importantly, it means that currently Receives are posted while
>>>>> the transport's write lock is held, which is unnecessary and costly.
>>>>>
>>>>> Finally, allocation of Receive buffers is performed on-demand in
>>>>> the Receive completion handler. This helps guarantee that they are
>>>>> allocated on the same NUMA node as the CPU that handles Receive
>>>>> completions.
>>>>>
>>>>> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
>>>>
>>>> Running this against a 4.17-rc4 server seems to work okay, but running against a 4.16 server fails the cthon special tests with:                                                                                                                                       
>>>>
>>>> write/read 30 MB file                                                                                                                                                                                                       
>>>> verify failed, offset 11272192; expected 79, got                                                                                                                                                                            
>>>> 79 79 79 79 79 79 79 79 79 79                                                                                                                                                                                               
>>>> 79 79 79 79 79 79 79 79 79 79                                                                                                                                                                                               
>>>> 79 79 79 79 79 79 79 79 79 79                                                                                                                                                                                               
>>>> 79 79 79 79 79 79 79 79 79 79                                                                                                                                                                                               
>>>> 79 79 79 79 79 79 79 79 79 79                                                                                                                                                                                               
>>>> 79 79 79 79 79 79 79 79 79 79                                                                                                                                                                                               
>>>>
>>>> and it goes on for several hundred more lines after this.  How worried do we need to be about somebody running a new client against and old server?
>>>
>>> I'm not sure what that result means, I've never seen it
>>> before. But I can't think of a reason there would be an
>>> incompatibility with a v4.16 server. That behavior needs
>>> to be chased down and explained.
>>>
>>> Can you bisect this to a particular commit?
>>
>> Do you mean on the server side?  I don't see the problem on the client until I apply this patch
> 
> Hi Anna-
> 
> Have you made any progress on this? What is the status of my NFS/RDMA
> patch series for v4.18 ?

I haven't been able to reproduce the issue against 4.16 in the last few days, so I'm assuming that whatever fixes it has already been backported.  I'm planning to send your patches to Trond tomorrow.

Anna

> 
> 
>>>> Some of the performance issues I've had in the past seem to have gone away with the 4.17-rc4 code as well.  I'm not sure if that's related to your code or something changing in soft roce, but either way I'm much happier :)
>>>>
>>>> Anna
>>>>
>>>>> ---
>>>>> include/trace/events/rpcrdma.h    |   40 +++++++-
>>>>> net/sunrpc/xprtrdma/backchannel.c |   32 +------
>>>>> net/sunrpc/xprtrdma/rpc_rdma.c    |   22 +----
>>>>> net/sunrpc/xprtrdma/transport.c   |    3 -
>>>>> net/sunrpc/xprtrdma/verbs.c       |  176 +++++++++++++++++++++----------------
>>>>> net/sunrpc/xprtrdma/xprt_rdma.h   |    6 +
>>>>> 6 files changed, 150 insertions(+), 129 deletions(-)
>>>>>
>>>>> diff --git a/include/trace/events/rpcrdma.h b/include/trace/events/rpcrdma.h
>>>>> index 99c0049..ad27e19 100644
>>>>> --- a/include/trace/events/rpcrdma.h
>>>>> +++ b/include/trace/events/rpcrdma.h
>>>>> @@ -546,6 +546,39 @@
>>>>> 	)
>>>>> );
>>>>>
>>>>> +TRACE_EVENT(xprtrdma_post_recvs,
>>>>> +	TP_PROTO(
>>>>> +		const struct rpcrdma_xprt *r_xprt,
>>>>> +		unsigned int count,
>>>>> +		int status
>>>>> +	),
>>>>> +
>>>>> +	TP_ARGS(r_xprt, count, status),
>>>>> +
>>>>> +	TP_STRUCT__entry(
>>>>> +		__field(const void *, r_xprt)
>>>>> +		__field(unsigned int, count)
>>>>> +		__field(int, status)
>>>>> +		__field(int, posted)
>>>>> +		__string(addr, rpcrdma_addrstr(r_xprt))
>>>>> +		__string(port, rpcrdma_portstr(r_xprt))
>>>>> +	),
>>>>> +
>>>>> +	TP_fast_assign(
>>>>> +		__entry->r_xprt = r_xprt;
>>>>> +		__entry->count = count;
>>>>> +		__entry->status = status;
>>>>> +		__entry->posted = r_xprt->rx_buf.rb_posted_receives;
>>>>> +		__assign_str(addr, rpcrdma_addrstr(r_xprt));
>>>>> +		__assign_str(port, rpcrdma_portstr(r_xprt));
>>>>> +	),
>>>>> +
>>>>> +	TP_printk("peer=[%s]:%s r_xprt=%p: %u new recvs, %d active (rc %d)",
>>>>> +		__get_str(addr), __get_str(port), __entry->r_xprt,
>>>>> +		__entry->count, __entry->posted, __entry->status
>>>>> +	)
>>>>> +);
>>>>> +
>>>>> /**
>>>>> ** Completion events
>>>>> **/
>>>>> @@ -800,7 +833,6 @@
>>>>> 		__field(unsigned int, task_id)
>>>>> 		__field(unsigned int, client_id)
>>>>> 		__field(const void *, req)
>>>>> -		__field(const void *, rep)
>>>>> 		__field(size_t, callsize)
>>>>> 		__field(size_t, rcvsize)
>>>>> 	),
>>>>> @@ -809,15 +841,13 @@
>>>>> 		__entry->task_id = task->tk_pid;
>>>>> 		__entry->client_id = task->tk_client->cl_clid;
>>>>> 		__entry->req = req;
>>>>> -		__entry->rep = req ? req->rl_reply : NULL;
>>>>> 		__entry->callsize = task->tk_rqstp->rq_callsize;
>>>>> 		__entry->rcvsize = task->tk_rqstp->rq_rcvsize;
>>>>> 	),
>>>>>
>>>>> -	TP_printk("task:%u@%u req=%p rep=%p (%zu, %zu)",
>>>>> +	TP_printk("task:%u@%u req=%p (%zu, %zu)",
>>>>> 		__entry->task_id, __entry->client_id,
>>>>> -		__entry->req, __entry->rep,
>>>>> -		__entry->callsize, __entry->rcvsize
>>>>> +		__entry->req, __entry->callsize, __entry->rcvsize
>>>>> 	)
>>>>> );
>>>>>
>>>>> diff --git a/net/sunrpc/xprtrdma/backchannel.c b/net/sunrpc/xprtrdma/backchannel.c
>>>>> index 4034788..c8f1c2b 100644
>>>>> --- a/net/sunrpc/xprtrdma/backchannel.c
>>>>> +++ b/net/sunrpc/xprtrdma/backchannel.c
>>>>> @@ -71,23 +71,6 @@ static int rpcrdma_bc_setup_reqs(struct rpcrdma_xprt *r_xprt,
>>>>> 	return -ENOMEM;
>>>>> }
>>>>>
>>>>> -/* Allocate and add receive buffers to the rpcrdma_buffer's
>>>>> - * existing list of rep's. These are released when the
>>>>> - * transport is destroyed.
>>>>> - */
>>>>> -static int rpcrdma_bc_setup_reps(struct rpcrdma_xprt *r_xprt,
>>>>> -				 unsigned int count)
>>>>> -{
>>>>> -	int rc = 0;
>>>>> -
>>>>> -	while (count--) {
>>>>> -		rc = rpcrdma_create_rep(r_xprt);
>>>>> -		if (rc)
>>>>> -			break;
>>>>> -	}
>>>>> -	return rc;
>>>>> -}
>>>>> -
>>>>> /**
>>>>> * xprt_rdma_bc_setup - Pre-allocate resources for handling backchannel requests
>>>>> * @xprt: transport associated with these backchannel resources
>>>>> @@ -116,14 +99,6 @@ int xprt_rdma_bc_setup(struct rpc_xprt *xprt, unsigned int reqs)
>>>>> 	if (rc)
>>>>> 		goto out_free;
>>>>>
>>>>> -	rc = rpcrdma_bc_setup_reps(r_xprt, reqs);
>>>>> -	if (rc)
>>>>> -		goto out_free;
>>>>> -
>>>>> -	rc = rpcrdma_ep_post_extra_recv(r_xprt, reqs);
>>>>> -	if (rc)
>>>>> -		goto out_free;
>>>>> -
>>>>> 	r_xprt->rx_buf.rb_bc_srv_max_requests = reqs;
>>>>> 	request_module("svcrdma");
>>>>> 	trace_xprtrdma_cb_setup(r_xprt, reqs);
>>>>> @@ -228,6 +203,7 @@ int xprt_rdma_bc_send_reply(struct rpc_rqst *rqst)
>>>>> 	if (rc < 0)
>>>>> 		goto failed_marshal;
>>>>>
>>>>> +	rpcrdma_post_recvs(r_xprt, true);
>>>>> 	if (rpcrdma_ep_post(&r_xprt->rx_ia, &r_xprt->rx_ep, req))
>>>>> 		goto drop_connection;
>>>>> 	return 0;
>>>>> @@ -268,10 +244,14 @@ void xprt_rdma_bc_destroy(struct rpc_xprt *xprt, unsigned int reqs)
>>>>> */
>>>>> void xprt_rdma_bc_free_rqst(struct rpc_rqst *rqst)
>>>>> {
>>>>> +	struct rpcrdma_req *req = rpcr_to_rdmar(rqst);
>>>>> 	struct rpc_xprt *xprt = rqst->rq_xprt;
>>>>>
>>>>> 	dprintk("RPC:       %s: freeing rqst %p (req %p)\n",
>>>>> -		__func__, rqst, rpcr_to_rdmar(rqst));
>>>>> +		__func__, rqst, req);
>>>>> +
>>>>> +	rpcrdma_recv_buffer_put(req->rl_reply);
>>>>> +	req->rl_reply = NULL;
>>>>>
>>>>> 	spin_lock_bh(&xprt->bc_pa_lock);
>>>>> 	list_add_tail(&rqst->rq_bc_pa_list, &xprt->bc_pa_list);
>>>>> diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
>>>>> index 8f89e3f..d676106 100644
>>>>> --- a/net/sunrpc/xprtrdma/rpc_rdma.c
>>>>> +++ b/net/sunrpc/xprtrdma/rpc_rdma.c
>>>>> @@ -1027,8 +1027,6 @@ static bool rpcrdma_results_inline(struct rpcrdma_xprt *r_xprt,
>>>>>
>>>>> out_short:
>>>>> 	pr_warn("RPC/RDMA short backward direction call\n");
>>>>> -	if (rpcrdma_ep_post_recv(&r_xprt->rx_ia, rep))
>>>>> -		xprt_disconnect_done(&r_xprt->rx_xprt);
>>>>> 	return true;
>>>>> }
>>>>> #else	/* CONFIG_SUNRPC_BACKCHANNEL */
>>>>> @@ -1334,13 +1332,14 @@ void rpcrdma_reply_handler(struct rpcrdma_rep *rep)
>>>>> 	u32 credits;
>>>>> 	__be32 *p;
>>>>>
>>>>> +	--buf->rb_posted_receives;
>>>>> +
>>>>> 	if (rep->rr_hdrbuf.head[0].iov_len == 0)
>>>>> 		goto out_badstatus;
>>>>>
>>>>> +	/* Fixed transport header fields */
>>>>> 	xdr_init_decode(&rep->rr_stream, &rep->rr_hdrbuf,
>>>>> 			rep->rr_hdrbuf.head[0].iov_base);
>>>>> -
>>>>> -	/* Fixed transport header fields */
>>>>> 	p = xdr_inline_decode(&rep->rr_stream, 4 * sizeof(*p));
>>>>> 	if (unlikely(!p))
>>>>> 		goto out_shortreply;
>>>>> @@ -1379,17 +1378,10 @@ void rpcrdma_reply_handler(struct rpcrdma_rep *rep)
>>>>>
>>>>> 	trace_xprtrdma_reply(rqst->rq_task, rep, req, credits);
>>>>>
>>>>> +	rpcrdma_post_recvs(r_xprt, false);
>>>>> 	queue_work(rpcrdma_receive_wq, &rep->rr_work);
>>>>> 	return;
>>>>>
>>>>> -out_badstatus:
>>>>> -	rpcrdma_recv_buffer_put(rep);
>>>>> -	if (r_xprt->rx_ep.rep_connected == 1) {
>>>>> -		r_xprt->rx_ep.rep_connected = -EIO;
>>>>> -		rpcrdma_conn_func(&r_xprt->rx_ep);
>>>>> -	}
>>>>> -	return;
>>>>> -
>>>>> out_badversion:
>>>>> 	trace_xprtrdma_reply_vers(rep);
>>>>> 	goto repost;
>>>>> @@ -1409,7 +1401,7 @@ void rpcrdma_reply_handler(struct rpcrdma_rep *rep)
>>>>> * receive buffer before returning.
>>>>> */
>>>>> repost:
>>>>> -	r_xprt->rx_stats.bad_reply_count++;
>>>>> -	if (rpcrdma_ep_post_recv(&r_xprt->rx_ia, rep))
>>>>> -		rpcrdma_recv_buffer_put(rep);
>>>>> +	rpcrdma_post_recvs(r_xprt, false);
>>>>> +out_badstatus:
>>>>> +	rpcrdma_recv_buffer_put(rep);
>>>>> }
>>>>> diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
>>>>> index 79885aa..0c775f0 100644
>>>>> --- a/net/sunrpc/xprtrdma/transport.c
>>>>> +++ b/net/sunrpc/xprtrdma/transport.c
>>>>> @@ -722,9 +722,6 @@
>>>>> 	if (rc < 0)
>>>>> 		goto failed_marshal;
>>>>>
>>>>> -	if (req->rl_reply == NULL) 		/* e.g. reconnection */
>>>>> -		rpcrdma_recv_buffer_get(req);
>>>>> -
>>>>> 	/* Must suppress retransmit to maintain credits */
>>>>> 	if (rqst->rq_connect_cookie == xprt->connect_cookie)
>>>>> 		goto drop_connection;
>>>>> diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
>>>>> index f4ce7af..2a38301 100644
>>>>> --- a/net/sunrpc/xprtrdma/verbs.c
>>>>> +++ b/net/sunrpc/xprtrdma/verbs.c
>>>>> @@ -74,6 +74,7 @@
>>>>> */
>>>>> static void rpcrdma_mrs_create(struct rpcrdma_xprt *r_xprt);
>>>>> static void rpcrdma_mrs_destroy(struct rpcrdma_buffer *buf);
>>>>> +static int rpcrdma_create_rep(struct rpcrdma_xprt *r_xprt, bool temp);
>>>>> static void rpcrdma_dma_unmap_regbuf(struct rpcrdma_regbuf *rb);
>>>>>
>>>>> struct workqueue_struct *rpcrdma_receive_wq __read_mostly;
>>>>> @@ -726,7 +727,6 @@
>>>>> {
>>>>> 	struct rpcrdma_xprt *r_xprt = container_of(ia, struct rpcrdma_xprt,
>>>>> 						   rx_ia);
>>>>> -	unsigned int extras;
>>>>> 	int rc;
>>>>>
>>>>> retry:
>>>>> @@ -770,9 +770,8 @@
>>>>> 	}
>>>>>
>>>>> 	dprintk("RPC:       %s: connected\n", __func__);
>>>>> -	extras = r_xprt->rx_buf.rb_bc_srv_max_requests;
>>>>> -	if (extras)
>>>>> -		rpcrdma_ep_post_extra_recv(r_xprt, extras);
>>>>> +
>>>>> +	rpcrdma_post_recvs(r_xprt, true);
>>>>>
>>>>> out:
>>>>> 	if (rc)
>>>>> @@ -1082,14 +1081,8 @@ struct rpcrdma_req *
>>>>> 	return req;
>>>>> }
>>>>>
>>>>> -/**
>>>>> - * rpcrdma_create_rep - Allocate an rpcrdma_rep object
>>>>> - * @r_xprt: controlling transport
>>>>> - *
>>>>> - * Returns 0 on success or a negative errno on failure.
>>>>> - */
>>>>> -int
>>>>> -rpcrdma_create_rep(struct rpcrdma_xprt *r_xprt)
>>>>> +static int
>>>>> +rpcrdma_create_rep(struct rpcrdma_xprt *r_xprt, bool temp)
>>>>> {
>>>>> 	struct rpcrdma_create_data_internal *cdata = &r_xprt->rx_data;
>>>>> 	struct rpcrdma_buffer *buf = &r_xprt->rx_buf;
>>>>> @@ -1117,6 +1110,7 @@ struct rpcrdma_req *
>>>>> 	rep->rr_recv_wr.wr_cqe = &rep->rr_cqe;
>>>>> 	rep->rr_recv_wr.sg_list = &rep->rr_rdmabuf->rg_iov;
>>>>> 	rep->rr_recv_wr.num_sge = 1;
>>>>> +	rep->rr_temp = temp;
>>>>>
>>>>> 	spin_lock(&buf->rb_lock);
>>>>> 	list_add(&rep->rr_list, &buf->rb_recv_bufs);
>>>>> @@ -1168,12 +1162,8 @@ struct rpcrdma_req *
>>>>> 		list_add(&req->rl_list, &buf->rb_send_bufs);
>>>>> 	}
>>>>>
>>>>> +	buf->rb_posted_receives = 0;
>>>>> 	INIT_LIST_HEAD(&buf->rb_recv_bufs);
>>>>> -	for (i = 0; i <= buf->rb_max_requests; i++) {
>>>>> -		rc = rpcrdma_create_rep(r_xprt);
>>>>> -		if (rc)
>>>>> -			goto out;
>>>>> -	}
>>>>>
>>>>> 	rc = rpcrdma_sendctxs_create(r_xprt);
>>>>> 	if (rc)
>>>>> @@ -1268,7 +1258,6 @@ struct rpcrdma_req *
>>>>> 		rep = rpcrdma_buffer_get_rep_locked(buf);
>>>>> 		rpcrdma_destroy_rep(rep);
>>>>> 	}
>>>>> -	buf->rb_send_count = 0;
>>>>>
>>>>> 	spin_lock(&buf->rb_reqslock);
>>>>> 	while (!list_empty(&buf->rb_allreqs)) {
>>>>> @@ -1283,7 +1272,6 @@ struct rpcrdma_req *
>>>>> 		spin_lock(&buf->rb_reqslock);
>>>>> 	}
>>>>> 	spin_unlock(&buf->rb_reqslock);
>>>>> -	buf->rb_recv_count = 0;
>>>>>
>>>>> 	rpcrdma_mrs_destroy(buf);
>>>>> }
>>>>> @@ -1356,27 +1344,11 @@ struct rpcrdma_mr *
>>>>> 	__rpcrdma_mr_put(&r_xprt->rx_buf, mr);
>>>>> }
>>>>>
>>>>> -static struct rpcrdma_rep *
>>>>> -rpcrdma_buffer_get_rep(struct rpcrdma_buffer *buffers)
>>>>> -{
>>>>> -	/* If an RPC previously completed without a reply (say, a
>>>>> -	 * credential problem or a soft timeout occurs) then hold off
>>>>> -	 * on supplying more Receive buffers until the number of new
>>>>> -	 * pending RPCs catches up to the number of posted Receives.
>>>>> -	 */
>>>>> -	if (unlikely(buffers->rb_send_count < buffers->rb_recv_count))
>>>>> -		return NULL;
>>>>> -
>>>>> -	if (unlikely(list_empty(&buffers->rb_recv_bufs)))
>>>>> -		return NULL;
>>>>> -	buffers->rb_recv_count++;
>>>>> -	return rpcrdma_buffer_get_rep_locked(buffers);
>>>>> -}
>>>>> -
>>>>> -/*
>>>>> - * Get a set of request/reply buffers.
>>>>> +/**
>>>>> + * rpcrdma_buffer_get - Get a request buffer
>>>>> + * @buffers: Buffer pool from which to obtain a buffer
>>>>> *
>>>>> - * Reply buffer (if available) is attached to send buffer upon return.
>>>>> + * Returns a fresh rpcrdma_req, or NULL if none are available.
>>>>> */
>>>>> struct rpcrdma_req *
>>>>> rpcrdma_buffer_get(struct rpcrdma_buffer *buffers)
>>>>> @@ -1384,23 +1356,21 @@ struct rpcrdma_req *
>>>>> 	struct rpcrdma_req *req;
>>>>>
>>>>> 	spin_lock(&buffers->rb_lock);
>>>>> -	if (list_empty(&buffers->rb_send_bufs))
>>>>> -		goto out_reqbuf;
>>>>> -	buffers->rb_send_count++;
>>>>> +	if (unlikely(list_empty(&buffers->rb_send_bufs)))
>>>>> +		goto out_noreqs;
>>>>> 	req = rpcrdma_buffer_get_req_locked(buffers);
>>>>> -	req->rl_reply = rpcrdma_buffer_get_rep(buffers);
>>>>> 	spin_unlock(&buffers->rb_lock);
>>>>> -
>>>>> 	return req;
>>>>>
>>>>> -out_reqbuf:
>>>>> +out_noreqs:
>>>>> 	spin_unlock(&buffers->rb_lock);
>>>>> 	return NULL;
>>>>> }
>>>>>
>>>>> -/*
>>>>> - * Put request/reply buffers back into pool.
>>>>> - * Pre-decrement counter/array index.
>>>>> +/**
>>>>> + * rpcrdma_buffer_put - Put request/reply buffers back into pool
>>>>> + * @req: object to return
>>>>> + *
>>>>> */
>>>>> void
>>>>> rpcrdma_buffer_put(struct rpcrdma_req *req)
>>>>> @@ -1411,27 +1381,16 @@ struct rpcrdma_req *
>>>>> 	req->rl_reply = NULL;
>>>>>
>>>>> 	spin_lock(&buffers->rb_lock);
>>>>> -	buffers->rb_send_count--;
>>>>> -	list_add_tail(&req->rl_list, &buffers->rb_send_bufs);
>>>>> +	list_add(&req->rl_list, &buffers->rb_send_bufs);
>>>>> 	if (rep) {
>>>>> -		buffers->rb_recv_count--;
>>>>> -		list_add_tail(&rep->rr_list, &buffers->rb_recv_bufs);
>>>>> +		if (!rep->rr_temp) {
>>>>> +			list_add(&rep->rr_list, &buffers->rb_recv_bufs);
>>>>> +			rep = NULL;
>>>>> +		}
>>>>> 	}
>>>>> 	spin_unlock(&buffers->rb_lock);
>>>>> -}
>>>>> -
>>>>> -/*
>>>>> - * Recover reply buffers from pool.
>>>>> - * This happens when recovering from disconnect.
>>>>> - */
>>>>> -void
>>>>> -rpcrdma_recv_buffer_get(struct rpcrdma_req *req)
>>>>> -{
>>>>> -	struct rpcrdma_buffer *buffers = req->rl_buffer;
>>>>> -
>>>>> -	spin_lock(&buffers->rb_lock);
>>>>> -	req->rl_reply = rpcrdma_buffer_get_rep(buffers);
>>>>> -	spin_unlock(&buffers->rb_lock);
>>>>> +	if (rep)
>>>>> +		rpcrdma_destroy_rep(rep);
>>>>> }
>>>>>
>>>>> /*
>>>>> @@ -1443,10 +1402,13 @@ struct rpcrdma_req *
>>>>> {
>>>>> 	struct rpcrdma_buffer *buffers = &rep->rr_rxprt->rx_buf;
>>>>>
>>>>> -	spin_lock(&buffers->rb_lock);
>>>>> -	buffers->rb_recv_count--;
>>>>> -	list_add_tail(&rep->rr_list, &buffers->rb_recv_bufs);
>>>>> -	spin_unlock(&buffers->rb_lock);
>>>>> +	if (!rep->rr_temp) {
>>>>> +		spin_lock(&buffers->rb_lock);
>>>>> +		list_add(&rep->rr_list, &buffers->rb_recv_bufs);
>>>>> +		spin_unlock(&buffers->rb_lock);
>>>>> +	} else {
>>>>> +		rpcrdma_destroy_rep(rep);
>>>>> +	}
>>>>> }
>>>>>
>>>>> /**
>>>>> @@ -1542,13 +1504,6 @@ struct rpcrdma_regbuf *
>>>>> 	struct ib_send_wr *send_wr = &req->rl_sendctx->sc_wr;
>>>>> 	int rc;
>>>>>
>>>>> -	if (req->rl_reply) {
>>>>> -		rc = rpcrdma_ep_post_recv(ia, req->rl_reply);
>>>>> -		if (rc)
>>>>> -			return rc;
>>>>> -		req->rl_reply = NULL;
>>>>> -	}
>>>>> -
>>>>> 	if (!ep->rep_send_count ||
>>>>> 	    test_bit(RPCRDMA_REQ_F_TX_RESOURCES, &req->rl_flags)) {
>>>>> 		send_wr->send_flags |= IB_SEND_SIGNALED;
>>>>> @@ -1623,3 +1578,70 @@ struct rpcrdma_regbuf *
>>>>> 	rpcrdma_recv_buffer_put(rep);
>>>>> 	return rc;
>>>>> }
>>>>> +
>>>>> +/**
>>>>> + * rpcrdma_post_recvs - Maybe post some Receive buffers
>>>>> + * @r_xprt: controlling transport
>>>>> + * @temp: when true, allocate temp rpcrdma_rep objects
>>>>> + *
>>>>> + */
>>>>> +void
>>>>> +rpcrdma_post_recvs(struct rpcrdma_xprt *r_xprt, bool temp)
>>>>> +{
>>>>> +	struct rpcrdma_buffer *buf = &r_xprt->rx_buf;
>>>>> +	struct ib_recv_wr *wr, *bad_wr;
>>>>> +	int needed, count, rc;
>>>>> +
>>>>> +	needed = buf->rb_credits + (buf->rb_bc_srv_max_requests << 1);
>>>>> +	if (buf->rb_posted_receives > needed)
>>>>> +		return;
>>>>> +	needed -= buf->rb_posted_receives;
>>>>> +
>>>>> +	count = 0;
>>>>> +	wr = NULL;
>>>>> +	while (needed) {
>>>>> +		struct rpcrdma_regbuf *rb;
>>>>> +		struct rpcrdma_rep *rep;
>>>>> +
>>>>> +		spin_lock(&buf->rb_lock);
>>>>> +		rep = list_first_entry_or_null(&buf->rb_recv_bufs,
>>>>> +					       struct rpcrdma_rep, rr_list);
>>>>> +		if (likely(rep))
>>>>> +			list_del(&rep->rr_list);
>>>>> +		spin_unlock(&buf->rb_lock);
>>>>> +		if (!rep) {
>>>>> +			if (rpcrdma_create_rep(r_xprt, temp))
>>>>> +				break;
>>>>> +			continue;
>>>>> +		}
>>>>> +
>>>>> +		rb = rep->rr_rdmabuf;
>>>>> +		if (!rpcrdma_regbuf_is_mapped(rb)) {
>>>>> +			if (!__rpcrdma_dma_map_regbuf(&r_xprt->rx_ia, rb)) {
>>>>> +				rpcrdma_recv_buffer_put(rep);
>>>>> +				break;
>>>>> +			}
>>>>> +		}
>>>>> +
>>>>> +		trace_xprtrdma_post_recv(rep->rr_recv_wr.wr_cqe);
>>>>> +		rep->rr_recv_wr.next = wr;
>>>>> +		wr = &rep->rr_recv_wr;
>>>>> +		++count;
>>>>> +		--needed;
>>>>> +	}
>>>>> +	if (!count)
>>>>> +		return;
>>>>> +
>>>>> +	rc = ib_post_recv(r_xprt->rx_ia.ri_id->qp, wr, &bad_wr);
>>>>> +	if (rc) {
>>>>> +		for (wr = bad_wr; wr; wr = wr->next) {
>>>>> +			struct rpcrdma_rep *rep;
>>>>> +
>>>>> +			rep = container_of(wr, struct rpcrdma_rep, rr_recv_wr);
>>>>> +			rpcrdma_recv_buffer_put(rep);
>>>>> +			--count;
>>>>> +		}
>>>>> +	}
>>>>> +	buf->rb_posted_receives += count;
>>>>> +	trace_xprtrdma_post_recvs(r_xprt, count, rc);
>>>>> +}
>>>>> diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
>>>>> index 765e4df..a6d0d6e 100644
>>>>> --- a/net/sunrpc/xprtrdma/xprt_rdma.h
>>>>> +++ b/net/sunrpc/xprtrdma/xprt_rdma.h
>>>>> @@ -197,6 +197,7 @@ struct rpcrdma_rep {
>>>>> 	__be32			rr_proc;
>>>>> 	int			rr_wc_flags;
>>>>> 	u32			rr_inv_rkey;
>>>>> +	bool			rr_temp;
>>>>> 	struct rpcrdma_regbuf	*rr_rdmabuf;
>>>>> 	struct rpcrdma_xprt	*rr_rxprt;
>>>>> 	struct work_struct	rr_work;
>>>>> @@ -397,11 +398,11 @@ struct rpcrdma_buffer {
>>>>> 	struct rpcrdma_sendctx	**rb_sc_ctxs;
>>>>>
>>>>> 	spinlock_t		rb_lock;	/* protect buf lists */
>>>>> -	int			rb_send_count, rb_recv_count;
>>>>> 	struct list_head	rb_send_bufs;
>>>>> 	struct list_head	rb_recv_bufs;
>>>>> 	u32			rb_max_requests;
>>>>> 	u32			rb_credits;	/* most recent credit grant */
>>>>> +	int			rb_posted_receives;
>>>>>
>>>>> 	u32			rb_bc_srv_max_requests;
>>>>> 	spinlock_t		rb_reqslock;	/* protect rb_allreqs */
>>>>> @@ -558,13 +559,13 @@ int rpcrdma_ep_create(struct rpcrdma_ep *, struct rpcrdma_ia *,
>>>>> int rpcrdma_ep_post(struct rpcrdma_ia *, struct rpcrdma_ep *,
>>>>> 				struct rpcrdma_req *);
>>>>> int rpcrdma_ep_post_recv(struct rpcrdma_ia *, struct rpcrdma_rep *);
>>>>> +void rpcrdma_post_recvs(struct rpcrdma_xprt *r_xprt, bool temp);
>>>>>
>>>>> /*
>>>>> * Buffer calls - xprtrdma/verbs.c
>>>>> */
>>>>> struct rpcrdma_req *rpcrdma_create_req(struct rpcrdma_xprt *);
>>>>> void rpcrdma_destroy_req(struct rpcrdma_req *);
>>>>> -int rpcrdma_create_rep(struct rpcrdma_xprt *r_xprt);
>>>>> int rpcrdma_buffer_create(struct rpcrdma_xprt *);
>>>>> void rpcrdma_buffer_destroy(struct rpcrdma_buffer *);
>>>>> struct rpcrdma_sendctx *rpcrdma_sendctx_get_locked(struct rpcrdma_buffer *buf);
>>>>> @@ -577,7 +578,6 @@ int rpcrdma_ep_post(struct rpcrdma_ia *, struct rpcrdma_ep *,
>>>>>
>>>>> struct rpcrdma_req *rpcrdma_buffer_get(struct rpcrdma_buffer *);
>>>>> void rpcrdma_buffer_put(struct rpcrdma_req *);
>>>>> -void rpcrdma_recv_buffer_get(struct rpcrdma_req *);
>>>>> void rpcrdma_recv_buffer_put(struct rpcrdma_rep *);
>>>>>
>>>>> struct rpcrdma_regbuf *rpcrdma_alloc_regbuf(size_t, enum dma_data_direction,
>>>>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>> --
>>> Chuck Lever
>>>
>>>
>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> --
> Chuck Lever
> 
> 
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2018-05-31 20:56 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-05-04 19:34 [PATCH v1 00/19] NFS/RDMA client patches for next Chuck Lever
2018-05-04 19:34 ` [PATCH v1 01/19] xprtrdma: Add proper SPDX tags for NetApp-contributed source Chuck Lever
2018-05-07 13:27   ` Anna Schumaker
2018-05-07 14:11     ` Chuck Lever
2018-05-07 14:28       ` Anna Schumaker
2018-05-14 20:37         ` Jason Gunthorpe
2018-05-04 19:34 ` [PATCH v1 02/19] xprtrdma: Try to fail quickly if proto=rdma Chuck Lever
2018-05-04 19:34 ` [PATCH v1 03/19] xprtrdma: Create transport's CM ID in the correct network namespace Chuck Lever
2018-05-04 19:34 ` [PATCH v1 04/19] xprtrdma: Fix max_send_wr computation Chuck Lever
2018-05-04 19:34 ` [PATCH v1 05/19] SUNRPC: Initialize rpc_rqst outside of xprt->reserve_lock Chuck Lever
2018-05-04 19:34 ` [PATCH v1 06/19] SUNRPC: Add a ->free_slot transport callout Chuck Lever
2018-05-04 19:35 ` [PATCH v1 07/19] xprtrdma: Introduce ->alloc_slot call-out for xprtrdma Chuck Lever
2018-05-04 19:35 ` [PATCH v1 08/19] xprtrdma: Make rpc_rqst part of rpcrdma_req Chuck Lever
2018-05-04 19:35 ` [PATCH v1 09/19] xprtrdma: Clean up Receive trace points Chuck Lever
2018-05-04 19:35 ` [PATCH v1 10/19] xprtrdma: Move Receive posting to Receive handler Chuck Lever
2018-05-08 19:40   ` Anna Schumaker
2018-05-08 19:47     ` Chuck Lever
2018-05-08 19:52       ` Anna Schumaker
2018-05-08 19:56         ` Chuck Lever
2018-05-29 18:23         ` Chuck Lever
2018-05-31 20:55           ` Anna Schumaker
2018-05-04 19:35 ` [PATCH v1 11/19] xprtrdma: Remove rpcrdma_ep_{post_recv, post_extra_recv} Chuck Lever
2018-05-04 19:35 ` [PATCH v1 12/19] xprtrdma: Remove rpcrdma_buffer_get_req_locked() Chuck Lever
2018-05-04 19:35 ` [PATCH v1 13/19] xprtrdma: Remove rpcrdma_buffer_get_rep_locked() Chuck Lever
2018-05-04 19:35 ` [PATCH v1 14/19] xprtrdma: Make rpcrdma_sendctx_put_locked() a static function Chuck Lever
2018-05-04 19:35 ` [PATCH v1 15/19] xprtrdma: Return -ENOBUFS when no pages are available Chuck Lever
2018-05-04 19:35 ` [PATCH v1 16/19] xprtrdma: Move common wait_for_buffer_space call to parent function Chuck Lever
2018-05-04 19:35 ` [PATCH v1 17/19] xprtrdma: Wait on empty sendctx queue Chuck Lever
2018-05-04 19:36 ` [PATCH v1 18/19] xprtrdma: Add trace_xprtrdma_dma_map(mr) Chuck Lever
2018-05-04 19:36 ` [PATCH v1 19/19] xprtrdma: Remove transfertypes array Chuck Lever

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.