linux-nfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v4 00/30] NFS/RDMA client for next
@ 2018-12-17 16:39 Chuck Lever
  2018-12-17 16:39 ` [PATCH v4 01/30] xprtrdma: Yet another double DMA-unmap Chuck Lever
                   ` (29 more replies)
  0 siblings, 30 replies; 40+ messages in thread
From: Chuck Lever @ 2018-12-17 16:39 UTC (permalink / raw)
  To: linux-rdma, linux-nfs

I'd like to see this series merged into next.

There have been several regressions related to the ->send_request
changes merged into v4.20. As a result, this series contains fixes
and clean-ups that resulted from testing and close code audit while
working on those regressions.

The soft IRQ warnings and DMAR faults that I observed with krb5
flavors on NFS/RDMA are now resolved by fixes included at the top
of this series. There is still an abnormal number of disconnects
with WRITE-intensive workloads on Kerberos, but recovery now
appears to be transparent to applications.

The patch removing deprecated Kerberos encryption types has been
dropped because there are still users who need this support. I will
look into simple ways to ensure these enctypes are working properly
and find more gentle deprecation schemes for a later merge window.


Change since v3:
- Rebased on v4.20-rc7
- Added patches that fix disconnect hangs and crashes
- Reordered series so that critical fixes are easy to backport
- Dropped patch removing deprecated encryption types
- Added patch to replace indirect memory registration calls
- Patch to detect leaked rpcrdma_reps is no longer needed
- Patch to fix rxe REG_WR was accepted by Jason, dropped here


Changes since v2:
- Rebased on v4.20-rc6 to pick up recent fixes
- Patches related to "xprtrdma: Dynamically allocate rpcrdma_reqs"
  have been dropped
- A number of revisions of documenting comments have been added
- Several new trace points are introduced


Changes since v1:
- Rebased on v4.20-rc4
- Series includes the full set, not just the RDMA-related fixes
- "Plant XID..." has been improved, based on testing with rxe
- The required rxe driver fix is included for convenience
- "Fix ri_max_segs..." replaces a bogus one-line fix in v1
- The patch description for "Remove support for FMR" was updated

---

Chuck Lever (30):
      xprtrdma: Yet another double DMA-unmap
      xprtrdma: Ensure MRs are DMA-unmapped when posting LOCAL_INV fails
      xprtrdma: Refactor Receive accounting
      xprtrdma: Replace rpcrdma_receive_wq with a per-xprt workqueue
      xprtrdma: No qp_event disconnect
      xprtrdma: Don't wake pending tasks until disconnect is done
      xprtrdma: Fix ri_max_segs and the result of ro_maxpages
      xprtrdma: Reduce max_frwr_depth
      xprtrdma: Remove support for FMR memory registration
      xprtrdma: Remove rpcrdma_memreg_ops
      xprtrdma: Plant XID in on-the-wire RDMA offset (FRWR)
      NFS: Make "port=" mount option optional for RDMA mounts
      xprtrdma: Recognize XDRBUF_SPARSE_PAGES
      xprtrdma: Remove request_module from backchannel
      xprtrdma: Expose transport header errors
      xprtrdma: Simplify locking that protects the rl_allreqs list
      xprtrdma: Cull dprintk() call sites
      xprtrdma: Remove unused fields from rpcrdma_ia
      xprtrdma: Clean up of xprtrdma chunk trace points
      xprtrdma: Relocate the xprtrdma_mr_map trace points
      xprtrdma: Add trace points for calls to transport switch methods
      xprtrdma: Trace mapping, alloc, and dereg failures
      NFS: Fix NFSv4 symbolic trace point output
      SUNRPC: Simplify defining common RPC trace events
      SUNRPC: Fix some kernel doc complaints
      xprtrdma: Update comments in frwr_op_send
      xprtrdma: Replace outdated comment for rpcrdma_ep_post
      xprtrdma: Add documenting comment for rpcrdma_buffer_destroy
      xprtrdma: Clarify comments in rpcrdma_ia_remove
      xprtrdma: Don't leak freed MRs


 fs/nfs/nfs4trace.h                         |  456 +++++++++++++++++++---------
 fs/nfs/super.c                             |   10 -
 include/linux/sunrpc/xprt.h                |    1 
 include/trace/events/rpcrdma.h             |  190 ++++++++++--
 include/trace/events/sunrpc.h              |  172 ++++-------
 net/sunrpc/auth_gss/gss_mech_switch.c      |    2 
 net/sunrpc/backchannel_rqst.c              |    2 
 net/sunrpc/xprt.c                          |   19 +
 net/sunrpc/xprtmultipath.c                 |    4 
 net/sunrpc/xprtrdma/Makefile               |    3 
 net/sunrpc/xprtrdma/backchannel.c          |   39 +-
 net/sunrpc/xprtrdma/fmr_ops.c              |  337 ---------------------
 net/sunrpc/xprtrdma/frwr_ops.c             |  209 ++++++++-----
 net/sunrpc/xprtrdma/rpc_rdma.c             |   74 ++---
 net/sunrpc/xprtrdma/svc_rdma_backchannel.c |    8 
 net/sunrpc/xprtrdma/transport.c            |   90 ++----
 net/sunrpc/xprtrdma/verbs.c                |  273 +++++++----------
 net/sunrpc/xprtrdma/xprt_rdma.h            |   79 +----
 net/sunrpc/xprtsock.c                      |    2 
 19 files changed, 935 insertions(+), 1035 deletions(-)
 delete mode 100644 net/sunrpc/xprtrdma/fmr_ops.c

--
Chuck Lever

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v4 01/30] xprtrdma: Yet another double DMA-unmap
  2018-12-17 16:39 [PATCH v4 00/30] NFS/RDMA client for next Chuck Lever
@ 2018-12-17 16:39 ` Chuck Lever
  2018-12-17 16:39 ` [PATCH v4 02/30] xprtrdma: Ensure MRs are DMA-unmapped when posting LOCAL_INV fails Chuck Lever
                   ` (28 subsequent siblings)
  29 siblings, 0 replies; 40+ messages in thread
From: Chuck Lever @ 2018-12-17 16:39 UTC (permalink / raw)
  To: linux-rdma, linux-nfs

While chasing yet another set of DMAR fault reports, I noticed that
the frwr recycler conflates whether or not an MR has been DMA
unmapped with frwr->fr_state. Actually the two have only an indirect
relationship. It's in fact impossible to guess reliably whether the
MR has been DMA unmapped based on its fr_state field, especially as
the surrounding code and its assumptions have changed over time.

A better approach is to track the DMA mapping status explicitly so
that the recycler is less brittle to unexpected situations, and
attempts to DMA-unmap a second time are prevented.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Cc: stable@vger.kernel.org # v4.20
---
 net/sunrpc/xprtrdma/frwr_ops.c |    6 ++++--
 net/sunrpc/xprtrdma/verbs.c    |    9 ++++++---
 2 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
index fc6378cc..20ced24 100644
--- a/net/sunrpc/xprtrdma/frwr_ops.c
+++ b/net/sunrpc/xprtrdma/frwr_ops.c
@@ -117,15 +117,15 @@
 frwr_mr_recycle_worker(struct work_struct *work)
 {
 	struct rpcrdma_mr *mr = container_of(work, struct rpcrdma_mr, mr_recycle);
-	enum rpcrdma_frwr_state state = mr->frwr.fr_state;
 	struct rpcrdma_xprt *r_xprt = mr->mr_xprt;
 
 	trace_xprtrdma_mr_recycle(mr);
 
-	if (state != FRWR_FLUSHED_LI) {
+	if (mr->mr_dir != DMA_NONE) {
 		trace_xprtrdma_mr_unmap(mr);
 		ib_dma_unmap_sg(r_xprt->rx_ia.ri_device,
 				mr->mr_sg, mr->mr_nents, mr->mr_dir);
+		mr->mr_dir = DMA_NONE;
 	}
 
 	spin_lock(&r_xprt->rx_buf.rb_mrlock);
@@ -150,6 +150,8 @@
 	if (!mr->mr_sg)
 		goto out_list_err;
 
+	frwr->fr_state = FRWR_IS_INVALID;
+	mr->mr_dir = DMA_NONE;
 	INIT_LIST_HEAD(&mr->mr_list);
 	INIT_WORK(&mr->mr_recycle, frwr_mr_recycle_worker);
 	sg_init_table(mr->mr_sg, depth);
diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 3ddba94..b9bc7f9 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -1329,9 +1329,12 @@ struct rpcrdma_mr *
 {
 	struct rpcrdma_xprt *r_xprt = mr->mr_xprt;
 
-	trace_xprtrdma_mr_unmap(mr);
-	ib_dma_unmap_sg(r_xprt->rx_ia.ri_device,
-			mr->mr_sg, mr->mr_nents, mr->mr_dir);
+	if (mr->mr_dir != DMA_NONE) {
+		trace_xprtrdma_mr_unmap(mr);
+		ib_dma_unmap_sg(r_xprt->rx_ia.ri_device,
+				mr->mr_sg, mr->mr_nents, mr->mr_dir);
+		mr->mr_dir = DMA_NONE;
+	}
 	__rpcrdma_mr_put(&r_xprt->rx_buf, mr);
 }
 


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v4 02/30] xprtrdma: Ensure MRs are DMA-unmapped when posting LOCAL_INV fails
  2018-12-17 16:39 [PATCH v4 00/30] NFS/RDMA client for next Chuck Lever
  2018-12-17 16:39 ` [PATCH v4 01/30] xprtrdma: Yet another double DMA-unmap Chuck Lever
@ 2018-12-17 16:39 ` Chuck Lever
  2018-12-17 16:39 ` [PATCH v4 03/30] xprtrdma: Refactor Receive accounting Chuck Lever
                   ` (27 subsequent siblings)
  29 siblings, 0 replies; 40+ messages in thread
From: Chuck Lever @ 2018-12-17 16:39 UTC (permalink / raw)
  To: linux-rdma, linux-nfs

The recovery case in frwr_op_unmap_sync needs to DMA unmap each MR.
frwr_release_mr does not DMA-unmap, but the recycle worker does.

Fixes: 61da886bf74e ("xprtrdma: Explicitly resetting MRs is ... ")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 net/sunrpc/xprtrdma/frwr_ops.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
index 20ced24..27222c0 100644
--- a/net/sunrpc/xprtrdma/frwr_ops.c
+++ b/net/sunrpc/xprtrdma/frwr_ops.c
@@ -563,8 +563,8 @@
 		mr = container_of(frwr, struct rpcrdma_mr, frwr);
 		bad_wr = bad_wr->next;
 
-		list_del(&mr->mr_list);
-		frwr_op_release_mr(mr);
+		list_del_init(&mr->mr_list);
+		rpcrdma_mr_recycle(mr);
 	}
 }
 


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v4 03/30] xprtrdma: Refactor Receive accounting
  2018-12-17 16:39 [PATCH v4 00/30] NFS/RDMA client for next Chuck Lever
  2018-12-17 16:39 ` [PATCH v4 01/30] xprtrdma: Yet another double DMA-unmap Chuck Lever
  2018-12-17 16:39 ` [PATCH v4 02/30] xprtrdma: Ensure MRs are DMA-unmapped when posting LOCAL_INV fails Chuck Lever
@ 2018-12-17 16:39 ` Chuck Lever
  2018-12-17 16:39 ` [PATCH v4 04/30] xprtrdma: Replace rpcrdma_receive_wq with a per-xprt workqueue Chuck Lever
                   ` (26 subsequent siblings)
  29 siblings, 0 replies; 40+ messages in thread
From: Chuck Lever @ 2018-12-17 16:39 UTC (permalink / raw)
  To: linux-rdma, linux-nfs

Clean up: Divide the work cleanly:

- rpcrdma_wc_receive is responsible only for RDMA Receives
- rpcrdma_reply_handler is responsible only for RPC Replies
- the posted send and receive counts both belong in rpcrdma_ep

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 include/trace/events/rpcrdma.h    |    2 +-
 net/sunrpc/xprtrdma/backchannel.c |    1 -
 net/sunrpc/xprtrdma/rpc_rdma.c    |   21 +++------------------
 net/sunrpc/xprtrdma/verbs.c       |   31 ++++++++++++++-----------------
 net/sunrpc/xprtrdma/xprt_rdma.h   |    3 +--
 5 files changed, 19 insertions(+), 39 deletions(-)

diff --git a/include/trace/events/rpcrdma.h b/include/trace/events/rpcrdma.h
index b093058..2efe2d7 100644
--- a/include/trace/events/rpcrdma.h
+++ b/include/trace/events/rpcrdma.h
@@ -570,7 +570,7 @@
 		__entry->r_xprt = r_xprt;
 		__entry->count = count;
 		__entry->status = status;
-		__entry->posted = r_xprt->rx_buf.rb_posted_receives;
+		__entry->posted = r_xprt->rx_ep.rep_receive_count;
 		__assign_str(addr, rpcrdma_addrstr(r_xprt));
 		__assign_str(port, rpcrdma_portstr(r_xprt));
 	),
diff --git a/net/sunrpc/xprtrdma/backchannel.c b/net/sunrpc/xprtrdma/backchannel.c
index e5b367a..2cb07a3 100644
--- a/net/sunrpc/xprtrdma/backchannel.c
+++ b/net/sunrpc/xprtrdma/backchannel.c
@@ -207,7 +207,6 @@ int xprt_rdma_bc_send_reply(struct rpc_rqst *rqst)
 	if (rc < 0)
 		goto failed_marshal;
 
-	rpcrdma_post_recvs(r_xprt, true);
 	if (rpcrdma_ep_post(&r_xprt->rx_ia, &r_xprt->rx_ep, req))
 		goto drop_connection;
 	return 0;
diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
index 9f53e02..dc23977 100644
--- a/net/sunrpc/xprtrdma/rpc_rdma.c
+++ b/net/sunrpc/xprtrdma/rpc_rdma.c
@@ -1312,11 +1312,6 @@ void rpcrdma_reply_handler(struct rpcrdma_rep *rep)
 	u32 credits;
 	__be32 *p;
 
-	--buf->rb_posted_receives;
-
-	if (rep->rr_hdrbuf.head[0].iov_len == 0)
-		goto out_badstatus;
-
 	/* Fixed transport header fields */
 	xdr_init_decode(&rep->rr_stream, &rep->rr_hdrbuf,
 			rep->rr_hdrbuf.head[0].iov_base);
@@ -1361,31 +1356,21 @@ void rpcrdma_reply_handler(struct rpcrdma_rep *rep)
 	clear_bit(RPCRDMA_REQ_F_PENDING, &req->rl_flags);
 
 	trace_xprtrdma_reply(rqst->rq_task, rep, req, credits);
-
-	rpcrdma_post_recvs(r_xprt, false);
 	queue_work(rpcrdma_receive_wq, &rep->rr_work);
 	return;
 
 out_badversion:
 	trace_xprtrdma_reply_vers(rep);
-	goto repost;
+	goto out;
 
-/* The RPC transaction has already been terminated, or the header
- * is corrupt.
- */
 out_norqst:
 	spin_unlock(&xprt->queue_lock);
 	trace_xprtrdma_reply_rqst(rep);
-	goto repost;
+	goto out;
 
 out_shortreply:
 	trace_xprtrdma_reply_short(rep);
 
-/* If no pending RPC transaction was matched, post a replacement
- * receive buffer before returning.
- */
-repost:
-	rpcrdma_post_recvs(r_xprt, false);
-out_badstatus:
+out:
 	rpcrdma_recv_buffer_put(rep);
 }
diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index b9bc7f9..e4461e7 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -78,6 +78,7 @@
 static void rpcrdma_mrs_destroy(struct rpcrdma_buffer *buf);
 static int rpcrdma_create_rep(struct rpcrdma_xprt *r_xprt, bool temp);
 static void rpcrdma_dma_unmap_regbuf(struct rpcrdma_regbuf *rb);
+static void rpcrdma_post_recvs(struct rpcrdma_xprt *r_xprt, bool temp);
 
 struct workqueue_struct *rpcrdma_receive_wq __read_mostly;
 
@@ -189,11 +190,13 @@
 	struct ib_cqe *cqe = wc->wr_cqe;
 	struct rpcrdma_rep *rep = container_of(cqe, struct rpcrdma_rep,
 					       rr_cqe);
+	struct rpcrdma_xprt *r_xprt = rep->rr_rxprt;
 
-	/* WARNING: Only wr_id and status are reliable at this point */
+	/* WARNING: Only wr_cqe and status are reliable at this point */
 	trace_xprtrdma_wc_receive(wc);
+	--r_xprt->rx_ep.rep_receive_count;
 	if (wc->status != IB_WC_SUCCESS)
-		goto out_fail;
+		goto out_flushed;
 
 	/* status == SUCCESS means all fields in wc are trustworthy */
 	rpcrdma_set_xdrlen(&rep->rr_hdrbuf, wc->byte_len);
@@ -204,17 +207,16 @@
 				   rdmab_addr(rep->rr_rdmabuf),
 				   wc->byte_len, DMA_FROM_DEVICE);
 
-out_schedule:
+	rpcrdma_post_recvs(r_xprt, false);
 	rpcrdma_reply_handler(rep);
 	return;
 
-out_fail:
+out_flushed:
 	if (wc->status != IB_WC_WR_FLUSH_ERR)
 		pr_err("rpcrdma: Recv: %s (%u/0x%x)\n",
 		       ib_wc_status_msg(wc->status),
 		       wc->status, wc->vendor_err);
-	rpcrdma_set_xdrlen(&rep->rr_hdrbuf, 0);
-	goto out_schedule;
+	rpcrdma_recv_buffer_put(rep);
 }
 
 static void
@@ -581,6 +583,7 @@
 	init_waitqueue_head(&ep->rep_connect_wait);
 	INIT_DELAYED_WORK(&ep->rep_disconnect_worker,
 			  rpcrdma_disconnect_worker);
+	ep->rep_receive_count = 0;
 
 	sendcq = ib_alloc_cq(ia->ri_device, NULL,
 			     ep->rep_attr.cap.max_send_wr + 1,
@@ -1174,7 +1177,6 @@ struct rpcrdma_req *
 	}
 
 	buf->rb_credits = 1;
-	buf->rb_posted_receives = 0;
 	INIT_LIST_HEAD(&buf->rb_recv_bufs);
 
 	rc = rpcrdma_sendctxs_create(r_xprt);
@@ -1511,25 +1513,20 @@ struct rpcrdma_regbuf *
 	return 0;
 }
 
-/**
- * rpcrdma_post_recvs - Maybe post some Receive buffers
- * @r_xprt: controlling transport
- * @temp: when true, allocate temp rpcrdma_rep objects
- *
- */
-void
+static void
 rpcrdma_post_recvs(struct rpcrdma_xprt *r_xprt, bool temp)
 {
 	struct rpcrdma_buffer *buf = &r_xprt->rx_buf;
+	struct rpcrdma_ep *ep = &r_xprt->rx_ep;
 	struct ib_recv_wr *wr, *bad_wr;
 	int needed, count, rc;
 
 	rc = 0;
 	count = 0;
 	needed = buf->rb_credits + (buf->rb_bc_srv_max_requests << 1);
-	if (buf->rb_posted_receives > needed)
+	if (ep->rep_receive_count > needed)
 		goto out;
-	needed -= buf->rb_posted_receives;
+	needed -= ep->rep_receive_count;
 
 	count = 0;
 	wr = NULL;
@@ -1577,7 +1574,7 @@ struct rpcrdma_regbuf *
 			--count;
 		}
 	}
-	buf->rb_posted_receives += count;
+	ep->rep_receive_count += count;
 out:
 	trace_xprtrdma_post_recvs(r_xprt, count, rc);
 }
diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index a13ccb6..788124c 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -102,6 +102,7 @@ struct rpcrdma_ep {
 	struct rpcrdma_connect_private	rep_cm_private;
 	struct rdma_conn_param	rep_remote_cma;
 	struct delayed_work	rep_disconnect_worker;
+	int			rep_receive_count;
 };
 
 /* Pre-allocate extra Work Requests for handling backward receives
@@ -404,7 +405,6 @@ struct rpcrdma_buffer {
 	unsigned long		rb_flags;
 	u32			rb_max_requests;
 	u32			rb_credits;	/* most recent credit grant */
-	int			rb_posted_receives;
 
 	u32			rb_bc_srv_max_requests;
 	spinlock_t		rb_reqslock;	/* protect rb_allreqs */
@@ -560,7 +560,6 @@ int rpcrdma_ep_create(struct rpcrdma_ep *, struct rpcrdma_ia *,
 
 int rpcrdma_ep_post(struct rpcrdma_ia *, struct rpcrdma_ep *,
 				struct rpcrdma_req *);
-void rpcrdma_post_recvs(struct rpcrdma_xprt *r_xprt, bool temp);
 
 /*
  * Buffer calls - xprtrdma/verbs.c


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v4 04/30] xprtrdma: Replace rpcrdma_receive_wq with a per-xprt workqueue
  2018-12-17 16:39 [PATCH v4 00/30] NFS/RDMA client for next Chuck Lever
                   ` (2 preceding siblings ...)
  2018-12-17 16:39 ` [PATCH v4 03/30] xprtrdma: Refactor Receive accounting Chuck Lever
@ 2018-12-17 16:39 ` Chuck Lever
  2018-12-17 16:39 ` [PATCH v4 05/30] xprtrdma: No qp_event disconnect Chuck Lever
                   ` (25 subsequent siblings)
  29 siblings, 0 replies; 40+ messages in thread
From: Chuck Lever @ 2018-12-17 16:39 UTC (permalink / raw)
  To: linux-rdma, linux-nfs

To address a connection-close ordering problem, we need the ability
to drain the RPC completions running on rpcrdma_receive_wq for just
one transport. Give each transport its own RPC completion workqueue,
and drain that workqueue when disconnecting the transport.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 net/sunrpc/xprtrdma/rpc_rdma.c  |    2 +
 net/sunrpc/xprtrdma/transport.c |   17 +++-------
 net/sunrpc/xprtrdma/verbs.c     |   67 +++++++++++++++++++++------------------
 net/sunrpc/xprtrdma/xprt_rdma.h |    6 +--
 4 files changed, 44 insertions(+), 48 deletions(-)

diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
index dc23977..5738c9f 100644
--- a/net/sunrpc/xprtrdma/rpc_rdma.c
+++ b/net/sunrpc/xprtrdma/rpc_rdma.c
@@ -1356,7 +1356,7 @@ void rpcrdma_reply_handler(struct rpcrdma_rep *rep)
 	clear_bit(RPCRDMA_REQ_F_PENDING, &req->rl_flags);
 
 	trace_xprtrdma_reply(rqst->rq_task, rep, req, credits);
-	queue_work(rpcrdma_receive_wq, &rep->rr_work);
+	queue_work(buf->rb_completion_wq, &rep->rr_work);
 	return;
 
 out_badversion:
diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
index ae2a838..91c476a 100644
--- a/net/sunrpc/xprtrdma/transport.c
+++ b/net/sunrpc/xprtrdma/transport.c
@@ -444,10 +444,14 @@
 	struct rpcrdma_ep *ep = &r_xprt->rx_ep;
 	struct rpcrdma_ia *ia = &r_xprt->rx_ia;
 
+	might_sleep();
+
 	dprintk("RPC:       %s: closing xprt %p\n", __func__, xprt);
 
+	/* Prevent marshaling and sending of new requests */
+	xprt_clear_connected(xprt);
+
 	if (test_and_clear_bit(RPCRDMA_IAF_REMOVING, &ia->ri_flags)) {
-		xprt_clear_connected(xprt);
 		rpcrdma_ia_remove(ia);
 		return;
 	}
@@ -858,8 +862,6 @@ void xprt_rdma_cleanup(void)
 		dprintk("RPC:       %s: xprt_unregister returned %i\n",
 			__func__, rc);
 
-	rpcrdma_destroy_wq();
-
 	rc = xprt_unregister_transport(&xprt_rdma_bc);
 	if (rc)
 		dprintk("RPC:       %s: xprt_unregister(bc) returned %i\n",
@@ -870,20 +872,13 @@ int xprt_rdma_init(void)
 {
 	int rc;
 
-	rc = rpcrdma_alloc_wq();
-	if (rc)
-		return rc;
-
 	rc = xprt_register_transport(&xprt_rdma);
-	if (rc) {
-		rpcrdma_destroy_wq();
+	if (rc)
 		return rc;
-	}
 
 	rc = xprt_register_transport(&xprt_rdma_bc);
 	if (rc) {
 		xprt_unregister_transport(&xprt_rdma);
-		rpcrdma_destroy_wq();
 		return rc;
 	}
 
diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index e4461e7..cff3a5d 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -80,33 +80,23 @@
 static void rpcrdma_dma_unmap_regbuf(struct rpcrdma_regbuf *rb);
 static void rpcrdma_post_recvs(struct rpcrdma_xprt *r_xprt, bool temp);
 
-struct workqueue_struct *rpcrdma_receive_wq __read_mostly;
-
-int
-rpcrdma_alloc_wq(void)
+/* Wait for outstanding transport work to finish.
+ */
+static void rpcrdma_xprt_drain(struct rpcrdma_xprt *r_xprt)
 {
-	struct workqueue_struct *recv_wq;
-
-	recv_wq = alloc_workqueue("xprtrdma_receive",
-				  WQ_MEM_RECLAIM | WQ_HIGHPRI,
-				  0);
-	if (!recv_wq)
-		return -ENOMEM;
-
-	rpcrdma_receive_wq = recv_wq;
-	return 0;
-}
+	struct rpcrdma_buffer *buf = &r_xprt->rx_buf;
+	struct rpcrdma_ia *ia = &r_xprt->rx_ia;
 
-void
-rpcrdma_destroy_wq(void)
-{
-	struct workqueue_struct *wq;
+	/* Flush Receives, then wait for deferred Reply work
+	 * to complete.
+	 */
+	ib_drain_qp(ia->ri_id->qp);
+	drain_workqueue(buf->rb_completion_wq);
 
-	if (rpcrdma_receive_wq) {
-		wq = rpcrdma_receive_wq;
-		rpcrdma_receive_wq = NULL;
-		destroy_workqueue(wq);
-	}
+	/* Deferred Reply processing might have scheduled
+	 * local invalidations.
+	 */
+	ib_drain_sq(ia->ri_id->qp);
 }
 
 /**
@@ -483,7 +473,7 @@
 	 *   connection is already gone.
 	 */
 	if (ia->ri_id->qp) {
-		ib_drain_qp(ia->ri_id->qp);
+		rpcrdma_xprt_drain(r_xprt);
 		rdma_destroy_qp(ia->ri_id);
 		ia->ri_id->qp = NULL;
 	}
@@ -825,8 +815,10 @@
 	return rc;
 }
 
-/*
- * rpcrdma_ep_disconnect
+/**
+ * rpcrdma_ep_disconnect - Disconnect underlying transport
+ * @ep: endpoint to disconnect
+ * @ia: associated interface adapter
  *
  * This is separate from destroy to facilitate the ability
  * to reconnect without recreating the endpoint.
@@ -837,19 +829,20 @@
 void
 rpcrdma_ep_disconnect(struct rpcrdma_ep *ep, struct rpcrdma_ia *ia)
 {
+	struct rpcrdma_xprt *r_xprt = container_of(ep, struct rpcrdma_xprt,
+						   rx_ep);
 	int rc;
 
+	/* returns without wait if ID is not connected */
 	rc = rdma_disconnect(ia->ri_id);
 	if (!rc)
-		/* returns without wait if not connected */
 		wait_event_interruptible(ep->rep_connect_wait,
 							ep->rep_connected != 1);
 	else
 		ep->rep_connected = rc;
-	trace_xprtrdma_disconnect(container_of(ep, struct rpcrdma_xprt,
-					       rx_ep), rc);
+	trace_xprtrdma_disconnect(r_xprt, rc);
 
-	ib_drain_qp(ia->ri_id->qp);
+	rpcrdma_xprt_drain(r_xprt);
 }
 
 /* Fixed-size circular FIFO queue. This implementation is wait-free and
@@ -1183,6 +1176,13 @@ struct rpcrdma_req *
 	if (rc)
 		goto out;
 
+	buf->rb_completion_wq = alloc_workqueue("rpcrdma-%s",
+						WQ_MEM_RECLAIM | WQ_HIGHPRI,
+						0,
+			r_xprt->rx_xprt.address_strings[RPC_DISPLAY_ADDR]);
+	if (!buf->rb_completion_wq)
+		goto out;
+
 	return 0;
 out:
 	rpcrdma_buffer_destroy(buf);
@@ -1241,6 +1241,11 @@ struct rpcrdma_req *
 {
 	cancel_delayed_work_sync(&buf->rb_refresh_worker);
 
+	if (buf->rb_completion_wq) {
+		destroy_workqueue(buf->rb_completion_wq);
+		buf->rb_completion_wq = NULL;
+	}
+
 	rpcrdma_sendctxs_destroy(buf);
 
 	while (!list_empty(&buf->rb_recv_bufs)) {
diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index 788124c..3f198cd 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -412,6 +412,7 @@ struct rpcrdma_buffer {
 
 	u32			rb_bc_max_requests;
 
+	struct workqueue_struct *rb_completion_wq;
 	struct delayed_work	rb_refresh_worker;
 };
 #define rdmab_to_ia(b) (&container_of((b), struct rpcrdma_xprt, rx_buf)->rx_ia)
@@ -547,8 +548,6 @@ struct rpcrdma_xprt {
 bool frwr_is_supported(struct rpcrdma_ia *);
 bool fmr_is_supported(struct rpcrdma_ia *);
 
-extern struct workqueue_struct *rpcrdma_receive_wq;
-
 /*
  * Endpoint calls - xprtrdma/verbs.c
  */
@@ -603,9 +602,6 @@ struct rpcrdma_regbuf *rpcrdma_alloc_regbuf(size_t, enum dma_data_direction,
 	return __rpcrdma_dma_map_regbuf(ia, rb);
 }
 
-int rpcrdma_alloc_wq(void);
-void rpcrdma_destroy_wq(void);
-
 /*
  * Wrappers for chunk registration, shared by read/write chunk code.
  */


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v4 05/30] xprtrdma: No qp_event disconnect
  2018-12-17 16:39 [PATCH v4 00/30] NFS/RDMA client for next Chuck Lever
                   ` (3 preceding siblings ...)
  2018-12-17 16:39 ` [PATCH v4 04/30] xprtrdma: Replace rpcrdma_receive_wq with a per-xprt workqueue Chuck Lever
@ 2018-12-17 16:39 ` Chuck Lever
  2018-12-17 16:39 ` [PATCH v4 06/30] xprtrdma: Don't wake pending tasks until disconnect is done Chuck Lever
                   ` (24 subsequent siblings)
  29 siblings, 0 replies; 40+ messages in thread
From: Chuck Lever @ 2018-12-17 16:39 UTC (permalink / raw)
  To: linux-rdma, linux-nfs

After thinking about this more, and auditing other kernel ULP imple-
mentations, I believe that a DISCONNECT cm_event will occur after a
fatal QP event. If that's the case, there's no need for an explicit
disconnect in the QP event handler.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 net/sunrpc/xprtrdma/verbs.c     |   32 --------------------------------
 net/sunrpc/xprtrdma/xprt_rdma.h |    1 -
 2 files changed, 33 deletions(-)

diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index cff3a5d..9a0a765 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -100,25 +100,6 @@ static void rpcrdma_xprt_drain(struct rpcrdma_xprt *r_xprt)
 }
 
 /**
- * rpcrdma_disconnect_worker - Force a disconnect
- * @work: endpoint to be disconnected
- *
- * Provider callbacks can possibly run in an IRQ context. This function
- * is invoked in a worker thread to guarantee that disconnect wake-up
- * calls are always done in process context.
- */
-static void
-rpcrdma_disconnect_worker(struct work_struct *work)
-{
-	struct rpcrdma_ep *ep = container_of(work, struct rpcrdma_ep,
-					     rep_disconnect_worker.work);
-	struct rpcrdma_xprt *r_xprt =
-		container_of(ep, struct rpcrdma_xprt, rx_ep);
-
-	xprt_force_disconnect(&r_xprt->rx_xprt);
-}
-
-/**
  * rpcrdma_qp_event_handler - Handle one QP event (error notification)
  * @event: details of the event
  * @context: ep that owns QP where event occurred
@@ -134,15 +115,6 @@ static void rpcrdma_xprt_drain(struct rpcrdma_xprt *r_xprt)
 						   rx_ep);
 
 	trace_xprtrdma_qp_event(r_xprt, event);
-	pr_err("rpcrdma: %s on device %s connected to %s:%s\n",
-	       ib_event_msg(event->event), event->device->name,
-	       rpcrdma_addrstr(r_xprt), rpcrdma_portstr(r_xprt));
-
-	if (ep->rep_connected == 1) {
-		ep->rep_connected = -EIO;
-		schedule_delayed_work(&ep->rep_disconnect_worker, 0);
-		wake_up_all(&ep->rep_connect_wait);
-	}
 }
 
 /**
@@ -571,8 +543,6 @@ static void rpcrdma_xprt_drain(struct rpcrdma_xprt *r_xprt)
 				   cdata->max_requests >> 2);
 	ep->rep_send_count = ep->rep_send_batch;
 	init_waitqueue_head(&ep->rep_connect_wait);
-	INIT_DELAYED_WORK(&ep->rep_disconnect_worker,
-			  rpcrdma_disconnect_worker);
 	ep->rep_receive_count = 0;
 
 	sendcq = ib_alloc_cq(ia->ri_device, NULL,
@@ -646,8 +616,6 @@ static void rpcrdma_xprt_drain(struct rpcrdma_xprt *r_xprt)
 void
 rpcrdma_ep_destroy(struct rpcrdma_ep *ep, struct rpcrdma_ia *ia)
 {
-	cancel_delayed_work_sync(&ep->rep_disconnect_worker);
-
 	if (ia->ri_id && ia->ri_id->qp) {
 		rpcrdma_ep_disconnect(ep, ia);
 		rdma_destroy_qp(ia->ri_id);
diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index 3f198cd..7c1b519 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -101,7 +101,6 @@ struct rpcrdma_ep {
 	wait_queue_head_t 	rep_connect_wait;
 	struct rpcrdma_connect_private	rep_cm_private;
 	struct rdma_conn_param	rep_remote_cma;
-	struct delayed_work	rep_disconnect_worker;
 	int			rep_receive_count;
 };
 


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v4 06/30] xprtrdma: Don't wake pending tasks until disconnect is done
  2018-12-17 16:39 [PATCH v4 00/30] NFS/RDMA client for next Chuck Lever
                   ` (4 preceding siblings ...)
  2018-12-17 16:39 ` [PATCH v4 05/30] xprtrdma: No qp_event disconnect Chuck Lever
@ 2018-12-17 16:39 ` Chuck Lever
  2018-12-17 17:28   ` Trond Myklebust
  2018-12-17 16:39 ` [PATCH v4 07/30] xprtrdma: Fix ri_max_segs and the result of ro_maxpages Chuck Lever
                   ` (23 subsequent siblings)
  29 siblings, 1 reply; 40+ messages in thread
From: Chuck Lever @ 2018-12-17 16:39 UTC (permalink / raw)
  To: linux-rdma, linux-nfs

Transport disconnect processing does a "wake pending tasks" at
various points.

Suppose an RPC Reply is being processed. The RPC task that Reply
goes with is waiting on the pending queue. If a disconnect wake-up
happens before reply processing is done, that reply, even if it is
good, is thrown away, and the RPC has to be sent again.

This window apparently does not exist for socket transports because
there is a lock held while a reply is being received which prevents
the wake-up call until after reply processing is done.

To resolve this, all RPC replies being processed on an RPC-over-RDMA
transport have to complete before pending tasks are awoken due to a
transport disconnect.

Callers that already hold the transport write lock may invoke
->ops->close directly. Others use a generic helper that schedules
a close when the write lock can be taken safely.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 include/linux/sunrpc/xprt.h                |    1 +
 net/sunrpc/xprt.c                          |   19 +++++++++++++++++++
 net/sunrpc/xprtrdma/backchannel.c          |   13 +++++++------
 net/sunrpc/xprtrdma/svc_rdma_backchannel.c |    8 +++++---
 net/sunrpc/xprtrdma/transport.c            |   16 ++++++++++------
 net/sunrpc/xprtrdma/verbs.c                |    5 ++---
 6 files changed, 44 insertions(+), 18 deletions(-)

diff --git a/include/linux/sunrpc/xprt.h b/include/linux/sunrpc/xprt.h
index a4ab4f8..ee94ed0 100644
--- a/include/linux/sunrpc/xprt.h
+++ b/include/linux/sunrpc/xprt.h
@@ -401,6 +401,7 @@ static inline __be32 *xprt_skip_transport_header(struct rpc_xprt *xprt, __be32 *
 bool			xprt_request_get_cong(struct rpc_xprt *xprt, struct rpc_rqst *req);
 void			xprt_disconnect_done(struct rpc_xprt *xprt);
 void			xprt_force_disconnect(struct rpc_xprt *xprt);
+void			xprt_disconnect_nowake(struct rpc_xprt *xprt);
 void			xprt_conditional_disconnect(struct rpc_xprt *xprt, unsigned int cookie);
 
 bool			xprt_lock_connect(struct rpc_xprt *, struct rpc_task *, void *);
diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
index ce92700..afe412e 100644
--- a/net/sunrpc/xprt.c
+++ b/net/sunrpc/xprt.c
@@ -685,6 +685,25 @@ void xprt_force_disconnect(struct rpc_xprt *xprt)
 }
 EXPORT_SYMBOL_GPL(xprt_force_disconnect);
 
+/**
+ * xprt_disconnect_nowake - force a call to xprt->ops->close
+ * @xprt: transport to disconnect
+ *
+ * The caller must ensure that xprt_wake_pending_tasks() is
+ * called later.
+ */
+void xprt_disconnect_nowake(struct rpc_xprt *xprt)
+{
+       /* Don't race with the test_bit() in xprt_clear_locked() */
+       spin_lock_bh(&xprt->transport_lock);
+       set_bit(XPRT_CLOSE_WAIT, &xprt->state);
+       /* Try to schedule an autoclose RPC call */
+       if (test_and_set_bit(XPRT_LOCKED, &xprt->state) == 0)
+               queue_work(xprtiod_workqueue, &xprt->task_cleanup);
+       spin_unlock_bh(&xprt->transport_lock);
+}
+EXPORT_SYMBOL_GPL(xprt_disconnect_nowake);
+
 static unsigned int
 xprt_connect_cookie(struct rpc_xprt *xprt)
 {
diff --git a/net/sunrpc/xprtrdma/backchannel.c b/net/sunrpc/xprtrdma/backchannel.c
index 2cb07a3..5d462e8 100644
--- a/net/sunrpc/xprtrdma/backchannel.c
+++ b/net/sunrpc/xprtrdma/backchannel.c
@@ -193,14 +193,15 @@ static int rpcrdma_bc_marshal_reply(struct rpc_rqst *rqst)
  */
 int xprt_rdma_bc_send_reply(struct rpc_rqst *rqst)
 {
-	struct rpcrdma_xprt *r_xprt = rpcx_to_rdmax(rqst->rq_xprt);
+	struct rpc_xprt *xprt = rqst->rq_xprt;
+	struct rpcrdma_xprt *r_xprt = rpcx_to_rdmax(xprt);
 	struct rpcrdma_req *req = rpcr_to_rdmar(rqst);
 	int rc;
 
-	if (!xprt_connected(rqst->rq_xprt))
-		goto drop_connection;
+	if (!xprt_connected(xprt))
+		return -ENOTCONN;
 
-	if (!xprt_request_get_cong(rqst->rq_xprt, rqst))
+	if (!xprt_request_get_cong(xprt, rqst))
 		return -EBADSLT;
 
 	rc = rpcrdma_bc_marshal_reply(rqst);
@@ -215,7 +216,7 @@ int xprt_rdma_bc_send_reply(struct rpc_rqst *rqst)
 	if (rc != -ENOTCONN)
 		return rc;
 drop_connection:
-	xprt_disconnect_done(rqst->rq_xprt);
+	xprt->ops->close(xprt);
 	return -ENOTCONN;
 }
 
@@ -338,7 +339,7 @@ void rpcrdma_bc_receive_call(struct rpcrdma_xprt *r_xprt,
 
 out_overflow:
 	pr_warn("RPC/RDMA backchannel overflow\n");
-	xprt_disconnect_done(xprt);
+	xprt_disconnect_nowake(xprt);
 	/* This receive buffer gets reposted automatically
 	 * when the connection is re-established.
 	 */
diff --git a/net/sunrpc/xprtrdma/svc_rdma_backchannel.c b/net/sunrpc/xprtrdma/svc_rdma_backchannel.c
index f3c147d..b908f2c 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_backchannel.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_backchannel.c
@@ -200,11 +200,10 @@ static int svc_rdma_bc_sendto(struct svcxprt_rdma *rdma,
 		svc_rdma_send_ctxt_put(rdma, ctxt);
 		goto drop_connection;
 	}
-	return rc;
+	return 0;
 
 drop_connection:
 	dprintk("svcrdma: failed to send bc call\n");
-	xprt_disconnect_done(xprt);
 	return -ENOTCONN;
 }
 
@@ -225,8 +224,11 @@ static int svc_rdma_bc_sendto(struct svcxprt_rdma *rdma,
 
 	ret = -ENOTCONN;
 	rdma = container_of(sxprt, struct svcxprt_rdma, sc_xprt);
-	if (!test_bit(XPT_DEAD, &sxprt->xpt_flags))
+	if (!test_bit(XPT_DEAD, &sxprt->xpt_flags)) {
 		ret = rpcrdma_bc_send_request(rdma, rqst);
+		if (ret == -ENOTCONN)
+			svc_close_xprt(sxprt);
+	}
 
 	mutex_unlock(&sxprt->xpt_mutex);
 
diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
index 91c476a..a16296b 100644
--- a/net/sunrpc/xprtrdma/transport.c
+++ b/net/sunrpc/xprtrdma/transport.c
@@ -453,13 +453,13 @@
 
 	if (test_and_clear_bit(RPCRDMA_IAF_REMOVING, &ia->ri_flags)) {
 		rpcrdma_ia_remove(ia);
-		return;
+		goto out;
 	}
+
 	if (ep->rep_connected == -ENODEV)
 		return;
 	if (ep->rep_connected > 0)
 		xprt->reestablish_timeout = 0;
-	xprt_disconnect_done(xprt);
 	rpcrdma_ep_disconnect(ep, ia);
 
 	/* Prepare @xprt for the next connection by reinitializing
@@ -467,6 +467,10 @@
 	 */
 	r_xprt->rx_buf.rb_credits = 1;
 	xprt->cwnd = RPC_CWNDSHIFT;
+
+out:
+	++xprt->connect_cookie;
+	xprt_disconnect_done(xprt);
 }
 
 /**
@@ -515,7 +519,7 @@
 static void
 xprt_rdma_timer(struct rpc_xprt *xprt, struct rpc_task *task)
 {
-	xprt_force_disconnect(xprt);
+	xprt_disconnect_nowake(xprt);
 }
 
 /**
@@ -717,7 +721,7 @@
 #endif	/* CONFIG_SUNRPC_BACKCHANNEL */
 
 	if (!xprt_connected(xprt))
-		goto drop_connection;
+		return -ENOTCONN;
 
 	if (!xprt_request_get_cong(xprt, rqst))
 		return -EBADSLT;
@@ -749,8 +753,8 @@
 	if (rc != -ENOTCONN)
 		return rc;
 drop_connection:
-	xprt_disconnect_done(xprt);
-	return -ENOTCONN;	/* implies disconnect */
+	xprt_rdma_close(xprt);
+	return -ENOTCONN;
 }
 
 void xprt_rdma_print_stats(struct rpc_xprt *xprt, struct seq_file *seq)
diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 9a0a765..38a757c 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -252,7 +252,7 @@ static void rpcrdma_xprt_drain(struct rpcrdma_xprt *r_xprt)
 #endif
 		set_bit(RPCRDMA_IAF_REMOVING, &ia->ri_flags);
 		ep->rep_connected = -ENODEV;
-		xprt_force_disconnect(xprt);
+		xprt_disconnect_nowake(xprt);
 		wait_for_completion(&ia->ri_remove_done);
 
 		ia->ri_id = NULL;
@@ -280,10 +280,9 @@ static void rpcrdma_xprt_drain(struct rpcrdma_xprt *r_xprt)
 			ep->rep_connected = -EAGAIN;
 		goto disconnected;
 	case RDMA_CM_EVENT_DISCONNECTED:
-		++xprt->connect_cookie;
 		ep->rep_connected = -ECONNABORTED;
 disconnected:
-		xprt_force_disconnect(xprt);
+		xprt_disconnect_nowake(xprt);
 		wake_up_all(&ep->rep_connect_wait);
 		break;
 	default:


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v4 07/30] xprtrdma: Fix ri_max_segs and the result of ro_maxpages
  2018-12-17 16:39 [PATCH v4 00/30] NFS/RDMA client for next Chuck Lever
                   ` (5 preceding siblings ...)
  2018-12-17 16:39 ` [PATCH v4 06/30] xprtrdma: Don't wake pending tasks until disconnect is done Chuck Lever
@ 2018-12-17 16:39 ` Chuck Lever
  2018-12-18 19:35   ` Anna Schumaker
  2018-12-17 16:40 ` [PATCH v4 08/30] xprtrdma: Reduce max_frwr_depth Chuck Lever
                   ` (22 subsequent siblings)
  29 siblings, 1 reply; 40+ messages in thread
From: Chuck Lever @ 2018-12-17 16:39 UTC (permalink / raw)
  To: linux-rdma, linux-nfs

With certain combinations of krb5i/p, MR size, and r/wsize, I/O can
fail with EMSGSIZE. This is because the calculated value of
ri_max_segs (the max number of MRs per RPC) exceeded
RPCRDMA_MAX_HDR_SEGS, which caused Read or Write list encoding to
walk off the end of the transport header.

Once that was addressed, the ro_maxpages result has to be corrected
to account for the number of MRs needed for Reply chunks, which is
2 MRs smaller than a normal Read or Write chunk.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 net/sunrpc/xprtrdma/fmr_ops.c   |    7 +++++--
 net/sunrpc/xprtrdma/frwr_ops.c  |    7 +++++--
 net/sunrpc/xprtrdma/transport.c |    6 ++++--
 3 files changed, 14 insertions(+), 6 deletions(-)

diff --git a/net/sunrpc/xprtrdma/fmr_ops.c b/net/sunrpc/xprtrdma/fmr_ops.c
index 7f5632c..78a0224 100644
--- a/net/sunrpc/xprtrdma/fmr_ops.c
+++ b/net/sunrpc/xprtrdma/fmr_ops.c
@@ -176,7 +176,10 @@ enum {
 
 	ia->ri_max_segs = max_t(unsigned int, 1, RPCRDMA_MAX_DATA_SEGS /
 				RPCRDMA_MAX_FMR_SGES);
-	ia->ri_max_segs += 2;	/* segments for head and tail buffers */
+	/* Reply chunks require segments for head and tail buffers */
+	ia->ri_max_segs += 2;
+	if (ia->ri_max_segs > RPCRDMA_MAX_HDR_SEGS)
+		ia->ri_max_segs = RPCRDMA_MAX_HDR_SEGS;
 	return 0;
 }
 
@@ -186,7 +189,7 @@ enum {
 fmr_op_maxpages(struct rpcrdma_xprt *r_xprt)
 {
 	return min_t(unsigned int, RPCRDMA_MAX_DATA_SEGS,
-		     RPCRDMA_MAX_HDR_SEGS * RPCRDMA_MAX_FMR_SGES);
+		     (ia->ri_max_segs - 2) * RPCRDMA_MAX_FMR_SGES);
 }
 
 /* Use the ib_map_phys_fmr() verb to register a memory region
diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
index 27222c0..f587e44 100644
--- a/net/sunrpc/xprtrdma/frwr_ops.c
+++ b/net/sunrpc/xprtrdma/frwr_ops.c
@@ -244,7 +244,10 @@
 
 	ia->ri_max_segs = max_t(unsigned int, 1, RPCRDMA_MAX_DATA_SEGS /
 				ia->ri_max_frwr_depth);
-	ia->ri_max_segs += 2;	/* segments for head and tail buffers */
+	/* Reply chunks require segments for head and tail buffers */
+	ia->ri_max_segs += 2;
+	if (ia->ri_max_segs > RPCRDMA_MAX_HDR_SEGS)
+		ia->ri_max_segs = RPCRDMA_MAX_HDR_SEGS;
 	return 0;
 }
 
@@ -257,7 +260,7 @@
 	struct rpcrdma_ia *ia = &r_xprt->rx_ia;
 
 	return min_t(unsigned int, RPCRDMA_MAX_DATA_SEGS,
-		     RPCRDMA_MAX_HDR_SEGS * ia->ri_max_frwr_depth);
+		     (ia->ri_max_segs - 2) * ia->ri_max_frwr_depth);
 }
 
 static void
diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
index a16296b..fbb14bf 100644
--- a/net/sunrpc/xprtrdma/transport.c
+++ b/net/sunrpc/xprtrdma/transport.c
@@ -704,8 +704,10 @@
  *	%-ENOTCONN if the caller should reconnect and call again
  *	%-EAGAIN if the caller should call again
  *	%-ENOBUFS if the caller should call again after a delay
- *	%-EIO if a permanent error occurred and the request was not
- *		sent. Do not try to send this message again.
+ *	%-EMSGSIZE if encoding ran out of buffer space. The request
+ *		was not sent. Do not try to send this message again.
+ *	%-EIO if an I/O error occurred. The request was not sent.
+ *		Do not try to send this message again.
  */
 static int
 xprt_rdma_send_request(struct rpc_rqst *rqst)


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v4 08/30] xprtrdma: Reduce max_frwr_depth
  2018-12-17 16:39 [PATCH v4 00/30] NFS/RDMA client for next Chuck Lever
                   ` (6 preceding siblings ...)
  2018-12-17 16:39 ` [PATCH v4 07/30] xprtrdma: Fix ri_max_segs and the result of ro_maxpages Chuck Lever
@ 2018-12-17 16:40 ` Chuck Lever
  2018-12-17 16:40 ` [PATCH v4 09/30] xprtrdma: Remove support for FMR memory registration Chuck Lever
                   ` (21 subsequent siblings)
  29 siblings, 0 replies; 40+ messages in thread
From: Chuck Lever @ 2018-12-17 16:40 UTC (permalink / raw)
  To: linux-rdma, linux-nfs

Some devices advertise a large max_fast_reg_page_list_len
capability, but perform optimally when MRs are significantly smaller
than that depth -- probably when the MR itself is no larger than a
page.

By default, the RDMA R/W core API uses max_sge_rd as the maximum
page depth for MRs. For some devices, the value of max_sge_rd is
1, which is also not optimal. Thus, when max_sge_rd is larger than
1, use that value. Otherwise use the value of the
max_fast_reg_page_list_len attribute.

I've tested this with CX-3 Pro, FastLinq, and CX-5 devices. It
reproducibly improves the throughput of large I/Os by several
percent.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 net/sunrpc/xprtrdma/frwr_ops.c |   15 +++++++++++----
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
index f587e44..16976b0 100644
--- a/net/sunrpc/xprtrdma/frwr_ops.c
+++ b/net/sunrpc/xprtrdma/frwr_ops.c
@@ -193,10 +193,17 @@
 	if (attrs->device_cap_flags & IB_DEVICE_SG_GAPS_REG)
 		ia->ri_mrtype = IB_MR_TYPE_SG_GAPS;
 
-	ia->ri_max_frwr_depth =
-			min_t(unsigned int, RPCRDMA_MAX_DATA_SEGS,
-			      attrs->max_fast_reg_page_list_len);
-	dprintk("RPC:       %s: device's max FR page list len = %u\n",
+	/* Quirk: Some devices advertise a large max_fast_reg_page_list_len
+	 * capability, but perform optimally when the MRs are not larger
+	 * than a page.
+	 */
+	if (attrs->max_sge_rd > 1)
+		ia->ri_max_frwr_depth = attrs->max_sge_rd;
+	else
+		ia->ri_max_frwr_depth = attrs->max_fast_reg_page_list_len;
+	if (ia->ri_max_frwr_depth > RPCRDMA_MAX_DATA_SEGS)
+		ia->ri_max_frwr_depth = RPCRDMA_MAX_DATA_SEGS;
+	dprintk("RPC:       %s: max FR page list depth = %u\n",
 		__func__, ia->ri_max_frwr_depth);
 
 	/* Add room for frwr register and invalidate WRs.


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v4 09/30] xprtrdma: Remove support for FMR memory registration
  2018-12-17 16:39 [PATCH v4 00/30] NFS/RDMA client for next Chuck Lever
                   ` (7 preceding siblings ...)
  2018-12-17 16:40 ` [PATCH v4 08/30] xprtrdma: Reduce max_frwr_depth Chuck Lever
@ 2018-12-17 16:40 ` Chuck Lever
  2018-12-17 16:40 ` [PATCH v4 10/30] xprtrdma: Remove rpcrdma_memreg_ops Chuck Lever
                   ` (20 subsequent siblings)
  29 siblings, 0 replies; 40+ messages in thread
From: Chuck Lever @ 2018-12-17 16:40 UTC (permalink / raw)
  To: linux-rdma, linux-nfs

FMR is not supported on most recent RDMA devices. It is also less
secure than FRWR because an FMR memory registration can expose
adjacent bytes to remote reading or writing. As discussed during the
RDMA BoF at LPC 2018, it is time to remove support for FMR in the
NFS/RDMA client stack.

Note that NFS/RDMA server-side uses either local memory registration
or FRWR. FMR is not used.

There are a few Infiniband/RoCE devices in the kernel tree that do
not appear to support MEM_MGT_EXTENSIONS (FRWR), and therefore will
not support client-side NFS/RDMA after this patch. These are:

 - mthca
 - qib
 - hns (RoCE)

Users of these devices can use NFS/TCP on IPoIB instead.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 net/sunrpc/xprtrdma/Makefile    |    3 
 net/sunrpc/xprtrdma/fmr_ops.c   |  340 ---------------------------------------
 net/sunrpc/xprtrdma/verbs.c     |    6 -
 net/sunrpc/xprtrdma/xprt_rdma.h |   12 -
 4 files changed, 2 insertions(+), 359 deletions(-)
 delete mode 100644 net/sunrpc/xprtrdma/fmr_ops.c

diff --git a/net/sunrpc/xprtrdma/Makefile b/net/sunrpc/xprtrdma/Makefile
index 8bf19e1..8ed0377 100644
--- a/net/sunrpc/xprtrdma/Makefile
+++ b/net/sunrpc/xprtrdma/Makefile
@@ -1,8 +1,7 @@
 # SPDX-License-Identifier: GPL-2.0
 obj-$(CONFIG_SUNRPC_XPRT_RDMA) += rpcrdma.o
 
-rpcrdma-y := transport.o rpc_rdma.o verbs.o \
-	fmr_ops.o frwr_ops.o \
+rpcrdma-y := transport.o rpc_rdma.o verbs.o frwr_ops.o \
 	svc_rdma.o svc_rdma_backchannel.o svc_rdma_transport.o \
 	svc_rdma_sendto.o svc_rdma_recvfrom.o svc_rdma_rw.o \
 	module.o
diff --git a/net/sunrpc/xprtrdma/fmr_ops.c b/net/sunrpc/xprtrdma/fmr_ops.c
deleted file mode 100644
index 78a0224..0000000
--- a/net/sunrpc/xprtrdma/fmr_ops.c
+++ /dev/null
@@ -1,340 +0,0 @@
-// SPDX-License-Identifier: GPL-2.0
-/*
- * Copyright (c) 2015, 2017 Oracle.  All rights reserved.
- * Copyright (c) 2003-2007 Network Appliance, Inc. All rights reserved.
- */
-
-/* Lightweight memory registration using Fast Memory Regions (FMR).
- * Referred to sometimes as MTHCAFMR mode.
- *
- * FMR uses synchronous memory registration and deregistration.
- * FMR registration is known to be fast, but FMR deregistration
- * can take tens of usecs to complete.
- */
-
-/* Normal operation
- *
- * A Memory Region is prepared for RDMA READ or WRITE using the
- * ib_map_phys_fmr verb (fmr_op_map). When the RDMA operation is
- * finished, the Memory Region is unmapped using the ib_unmap_fmr
- * verb (fmr_op_unmap).
- */
-
-#include <linux/sunrpc/svc_rdma.h>
-
-#include "xprt_rdma.h"
-#include <trace/events/rpcrdma.h>
-
-#if IS_ENABLED(CONFIG_SUNRPC_DEBUG)
-# define RPCDBG_FACILITY	RPCDBG_TRANS
-#endif
-
-/* Maximum scatter/gather per FMR */
-#define RPCRDMA_MAX_FMR_SGES	(64)
-
-/* Access mode of externally registered pages */
-enum {
-	RPCRDMA_FMR_ACCESS_FLAGS	= IB_ACCESS_REMOTE_WRITE |
-					  IB_ACCESS_REMOTE_READ,
-};
-
-bool
-fmr_is_supported(struct rpcrdma_ia *ia)
-{
-	if (!ia->ri_device->alloc_fmr) {
-		pr_info("rpcrdma: 'fmr' mode is not supported by device %s\n",
-			ia->ri_device->name);
-		return false;
-	}
-	return true;
-}
-
-static void
-__fmr_unmap(struct rpcrdma_mr *mr)
-{
-	LIST_HEAD(l);
-	int rc;
-
-	list_add(&mr->fmr.fm_mr->list, &l);
-	rc = ib_unmap_fmr(&l);
-	list_del(&mr->fmr.fm_mr->list);
-	if (rc)
-		pr_err("rpcrdma: final ib_unmap_fmr for %p failed %i\n",
-		       mr, rc);
-}
-
-/* Release an MR.
- */
-static void
-fmr_op_release_mr(struct rpcrdma_mr *mr)
-{
-	int rc;
-
-	kfree(mr->fmr.fm_physaddrs);
-	kfree(mr->mr_sg);
-
-	/* In case this one was left mapped, try to unmap it
-	 * to prevent dealloc_fmr from failing with EBUSY
-	 */
-	__fmr_unmap(mr);
-
-	rc = ib_dealloc_fmr(mr->fmr.fm_mr);
-	if (rc)
-		pr_err("rpcrdma: final ib_dealloc_fmr for %p returned %i\n",
-		       mr, rc);
-
-	kfree(mr);
-}
-
-/* MRs are dynamically allocated, so simply clean up and release the MR.
- * A replacement MR will subsequently be allocated on demand.
- */
-static void
-fmr_mr_recycle_worker(struct work_struct *work)
-{
-	struct rpcrdma_mr *mr = container_of(work, struct rpcrdma_mr, mr_recycle);
-	struct rpcrdma_xprt *r_xprt = mr->mr_xprt;
-
-	trace_xprtrdma_mr_recycle(mr);
-
-	trace_xprtrdma_mr_unmap(mr);
-	ib_dma_unmap_sg(r_xprt->rx_ia.ri_device,
-			mr->mr_sg, mr->mr_nents, mr->mr_dir);
-
-	spin_lock(&r_xprt->rx_buf.rb_mrlock);
-	list_del(&mr->mr_all);
-	r_xprt->rx_stats.mrs_recycled++;
-	spin_unlock(&r_xprt->rx_buf.rb_mrlock);
-	fmr_op_release_mr(mr);
-}
-
-static int
-fmr_op_init_mr(struct rpcrdma_ia *ia, struct rpcrdma_mr *mr)
-{
-	static struct ib_fmr_attr fmr_attr = {
-		.max_pages	= RPCRDMA_MAX_FMR_SGES,
-		.max_maps	= 1,
-		.page_shift	= PAGE_SHIFT
-	};
-
-	mr->fmr.fm_physaddrs = kcalloc(RPCRDMA_MAX_FMR_SGES,
-				       sizeof(u64), GFP_KERNEL);
-	if (!mr->fmr.fm_physaddrs)
-		goto out_free;
-
-	mr->mr_sg = kcalloc(RPCRDMA_MAX_FMR_SGES,
-			    sizeof(*mr->mr_sg), GFP_KERNEL);
-	if (!mr->mr_sg)
-		goto out_free;
-
-	sg_init_table(mr->mr_sg, RPCRDMA_MAX_FMR_SGES);
-
-	mr->fmr.fm_mr = ib_alloc_fmr(ia->ri_pd, RPCRDMA_FMR_ACCESS_FLAGS,
-				     &fmr_attr);
-	if (IS_ERR(mr->fmr.fm_mr))
-		goto out_fmr_err;
-
-	INIT_LIST_HEAD(&mr->mr_list);
-	INIT_WORK(&mr->mr_recycle, fmr_mr_recycle_worker);
-	return 0;
-
-out_fmr_err:
-	dprintk("RPC:       %s: ib_alloc_fmr returned %ld\n", __func__,
-		PTR_ERR(mr->fmr.fm_mr));
-
-out_free:
-	kfree(mr->mr_sg);
-	kfree(mr->fmr.fm_physaddrs);
-	return -ENOMEM;
-}
-
-/* On success, sets:
- *	ep->rep_attr.cap.max_send_wr
- *	ep->rep_attr.cap.max_recv_wr
- *	cdata->max_requests
- *	ia->ri_max_segs
- */
-static int
-fmr_op_open(struct rpcrdma_ia *ia, struct rpcrdma_ep *ep,
-	    struct rpcrdma_create_data_internal *cdata)
-{
-	int max_qp_wr;
-
-	max_qp_wr = ia->ri_device->attrs.max_qp_wr;
-	max_qp_wr -= RPCRDMA_BACKWARD_WRS;
-	max_qp_wr -= 1;
-	if (max_qp_wr < RPCRDMA_MIN_SLOT_TABLE)
-		return -ENOMEM;
-	if (cdata->max_requests > max_qp_wr)
-		cdata->max_requests = max_qp_wr;
-	ep->rep_attr.cap.max_send_wr = cdata->max_requests;
-	ep->rep_attr.cap.max_send_wr += RPCRDMA_BACKWARD_WRS;
-	ep->rep_attr.cap.max_send_wr += 1; /* for ib_drain_sq */
-	ep->rep_attr.cap.max_recv_wr = cdata->max_requests;
-	ep->rep_attr.cap.max_recv_wr += RPCRDMA_BACKWARD_WRS;
-	ep->rep_attr.cap.max_recv_wr += 1; /* for ib_drain_rq */
-
-	ia->ri_max_segs = max_t(unsigned int, 1, RPCRDMA_MAX_DATA_SEGS /
-				RPCRDMA_MAX_FMR_SGES);
-	/* Reply chunks require segments for head and tail buffers */
-	ia->ri_max_segs += 2;
-	if (ia->ri_max_segs > RPCRDMA_MAX_HDR_SEGS)
-		ia->ri_max_segs = RPCRDMA_MAX_HDR_SEGS;
-	return 0;
-}
-
-/* FMR mode conveys up to 64 pages of payload per chunk segment.
- */
-static size_t
-fmr_op_maxpages(struct rpcrdma_xprt *r_xprt)
-{
-	return min_t(unsigned int, RPCRDMA_MAX_DATA_SEGS,
-		     (ia->ri_max_segs - 2) * RPCRDMA_MAX_FMR_SGES);
-}
-
-/* Use the ib_map_phys_fmr() verb to register a memory region
- * for remote access via RDMA READ or RDMA WRITE.
- */
-static struct rpcrdma_mr_seg *
-fmr_op_map(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg,
-	   int nsegs, bool writing, struct rpcrdma_mr **out)
-{
-	struct rpcrdma_mr_seg *seg1 = seg;
-	int len, pageoff, i, rc;
-	struct rpcrdma_mr *mr;
-	u64 *dma_pages;
-
-	mr = rpcrdma_mr_get(r_xprt);
-	if (!mr)
-		return ERR_PTR(-EAGAIN);
-
-	pageoff = offset_in_page(seg1->mr_offset);
-	seg1->mr_offset -= pageoff;	/* start of page */
-	seg1->mr_len += pageoff;
-	len = -pageoff;
-	if (nsegs > RPCRDMA_MAX_FMR_SGES)
-		nsegs = RPCRDMA_MAX_FMR_SGES;
-	for (i = 0; i < nsegs;) {
-		if (seg->mr_page)
-			sg_set_page(&mr->mr_sg[i],
-				    seg->mr_page,
-				    seg->mr_len,
-				    offset_in_page(seg->mr_offset));
-		else
-			sg_set_buf(&mr->mr_sg[i], seg->mr_offset,
-				   seg->mr_len);
-		len += seg->mr_len;
-		++seg;
-		++i;
-		/* Check for holes */
-		if ((i < nsegs && offset_in_page(seg->mr_offset)) ||
-		    offset_in_page((seg-1)->mr_offset + (seg-1)->mr_len))
-			break;
-	}
-	mr->mr_dir = rpcrdma_data_dir(writing);
-
-	mr->mr_nents = ib_dma_map_sg(r_xprt->rx_ia.ri_device,
-				     mr->mr_sg, i, mr->mr_dir);
-	if (!mr->mr_nents)
-		goto out_dmamap_err;
-	trace_xprtrdma_mr_map(mr);
-
-	for (i = 0, dma_pages = mr->fmr.fm_physaddrs; i < mr->mr_nents; i++)
-		dma_pages[i] = sg_dma_address(&mr->mr_sg[i]);
-	rc = ib_map_phys_fmr(mr->fmr.fm_mr, dma_pages, mr->mr_nents,
-			     dma_pages[0]);
-	if (rc)
-		goto out_maperr;
-
-	mr->mr_handle = mr->fmr.fm_mr->rkey;
-	mr->mr_length = len;
-	mr->mr_offset = dma_pages[0] + pageoff;
-
-	*out = mr;
-	return seg;
-
-out_dmamap_err:
-	pr_err("rpcrdma: failed to DMA map sg %p sg_nents %d\n",
-	       mr->mr_sg, i);
-	rpcrdma_mr_put(mr);
-	return ERR_PTR(-EIO);
-
-out_maperr:
-	pr_err("rpcrdma: ib_map_phys_fmr %u@0x%llx+%i (%d) status %i\n",
-	       len, (unsigned long long)dma_pages[0],
-	       pageoff, mr->mr_nents, rc);
-	rpcrdma_mr_unmap_and_put(mr);
-	return ERR_PTR(-EIO);
-}
-
-/* Post Send WR containing the RPC Call message.
- */
-static int
-fmr_op_send(struct rpcrdma_ia *ia, struct rpcrdma_req *req)
-{
-	return ib_post_send(ia->ri_id->qp, &req->rl_sendctx->sc_wr, NULL);
-}
-
-/* Invalidate all memory regions that were registered for "req".
- *
- * Sleeps until it is safe for the host CPU to access the
- * previously mapped memory regions.
- *
- * Caller ensures that @mrs is not empty before the call. This
- * function empties the list.
- */
-static void
-fmr_op_unmap_sync(struct rpcrdma_xprt *r_xprt, struct list_head *mrs)
-{
-	struct rpcrdma_mr *mr;
-	LIST_HEAD(unmap_list);
-	int rc;
-
-	/* ORDER: Invalidate all of the req's MRs first
-	 *
-	 * ib_unmap_fmr() is slow, so use a single call instead
-	 * of one call per mapped FMR.
-	 */
-	list_for_each_entry(mr, mrs, mr_list) {
-		dprintk("RPC:       %s: unmapping fmr %p\n",
-			__func__, &mr->fmr);
-		trace_xprtrdma_mr_localinv(mr);
-		list_add_tail(&mr->fmr.fm_mr->list, &unmap_list);
-	}
-	r_xprt->rx_stats.local_inv_needed++;
-	rc = ib_unmap_fmr(&unmap_list);
-	if (rc)
-		goto out_release;
-
-	/* ORDER: Now DMA unmap all of the req's MRs, and return
-	 * them to the free MW list.
-	 */
-	while (!list_empty(mrs)) {
-		mr = rpcrdma_mr_pop(mrs);
-		list_del(&mr->fmr.fm_mr->list);
-		rpcrdma_mr_unmap_and_put(mr);
-	}
-
-	return;
-
-out_release:
-	pr_err("rpcrdma: ib_unmap_fmr failed (%i)\n", rc);
-
-	while (!list_empty(mrs)) {
-		mr = rpcrdma_mr_pop(mrs);
-		list_del(&mr->fmr.fm_mr->list);
-		rpcrdma_mr_recycle(mr);
-	}
-}
-
-const struct rpcrdma_memreg_ops rpcrdma_fmr_memreg_ops = {
-	.ro_map				= fmr_op_map,
-	.ro_send			= fmr_op_send,
-	.ro_unmap_sync			= fmr_op_unmap_sync,
-	.ro_open			= fmr_op_open,
-	.ro_maxpages			= fmr_op_maxpages,
-	.ro_init_mr			= fmr_op_init_mr,
-	.ro_release_mr			= fmr_op_release_mr,
-	.ro_displayname			= "fmr",
-	.ro_send_w_inv_ok		= 0,
-};
diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 38a757c..389b617 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -397,12 +397,6 @@ static void rpcrdma_xprt_drain(struct rpcrdma_xprt *r_xprt)
 			break;
 		}
 		/*FALLTHROUGH*/
-	case RPCRDMA_MTHCAFMR:
-		if (fmr_is_supported(ia)) {
-			ia->ri_ops = &rpcrdma_fmr_memreg_ops;
-			break;
-		}
-		/*FALLTHROUGH*/
 	default:
 		pr_err("rpcrdma: Device %s does not support memreg mode %d\n",
 		       ia->ri_device->name, xprt_rdma_memreg_strategy);
diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index 7c1b519..dc8e178 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -262,20 +262,12 @@ struct rpcrdma_frwr {
 	};
 };
 
-struct rpcrdma_fmr {
-	struct ib_fmr		*fm_mr;
-	u64			*fm_physaddrs;
-};
-
 struct rpcrdma_mr {
 	struct list_head	mr_list;
 	struct scatterlist	*mr_sg;
 	int			mr_nents;
 	enum dma_data_direction	mr_dir;
-	union {
-		struct rpcrdma_fmr	fmr;
-		struct rpcrdma_frwr	frwr;
-	};
+	struct rpcrdma_frwr	frwr;
 	struct rpcrdma_xprt	*mr_xprt;
 	u32			mr_handle;
 	u32			mr_length;
@@ -490,7 +482,6 @@ struct rpcrdma_memreg_ops {
 	const int	ro_send_w_inv_ok;
 };
 
-extern const struct rpcrdma_memreg_ops rpcrdma_fmr_memreg_ops;
 extern const struct rpcrdma_memreg_ops rpcrdma_frwr_memreg_ops;
 
 /*
@@ -545,7 +536,6 @@ struct rpcrdma_xprt {
 void rpcrdma_ia_remove(struct rpcrdma_ia *ia);
 void rpcrdma_ia_close(struct rpcrdma_ia *);
 bool frwr_is_supported(struct rpcrdma_ia *);
-bool fmr_is_supported(struct rpcrdma_ia *);
 
 /*
  * Endpoint calls - xprtrdma/verbs.c


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v4 10/30] xprtrdma: Remove rpcrdma_memreg_ops
  2018-12-17 16:39 [PATCH v4 00/30] NFS/RDMA client for next Chuck Lever
                   ` (8 preceding siblings ...)
  2018-12-17 16:40 ` [PATCH v4 09/30] xprtrdma: Remove support for FMR memory registration Chuck Lever
@ 2018-12-17 16:40 ` Chuck Lever
  2018-12-17 16:40 ` [PATCH v4 11/30] xprtrdma: Plant XID in on-the-wire RDMA offset (FRWR) Chuck Lever
                   ` (19 subsequent siblings)
  29 siblings, 0 replies; 40+ messages in thread
From: Chuck Lever @ 2018-12-17 16:40 UTC (permalink / raw)
  To: linux-rdma, linux-nfs

Clean up: Now that there is only FRWR, there is no need for a memory
registration switch. The indirect calls to the memreg operations can
be replaced with faster direct calls.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 net/sunrpc/xprtrdma/frwr_ops.c  |  131 +++++++++++++++++++++++++--------------
 net/sunrpc/xprtrdma/rpc_rdma.c  |   14 +---
 net/sunrpc/xprtrdma/transport.c |    2 -
 net/sunrpc/xprtrdma/verbs.c     |   22 +++----
 net/sunrpc/xprtrdma/xprt_rdma.h |   48 +++++---------
 5 files changed, 116 insertions(+), 101 deletions(-)

diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
index 16976b0..fb0944d 100644
--- a/net/sunrpc/xprtrdma/frwr_ops.c
+++ b/net/sunrpc/xprtrdma/frwr_ops.c
@@ -15,21 +15,21 @@
 /* Normal operation
  *
  * A Memory Region is prepared for RDMA READ or WRITE using a FAST_REG
- * Work Request (frwr_op_map). When the RDMA operation is finished, this
+ * Work Request (frwr_map). When the RDMA operation is finished, this
  * Memory Region is invalidated using a LOCAL_INV Work Request
- * (frwr_op_unmap_sync).
+ * (frwr_unmap_sync).
  *
  * Typically these Work Requests are not signaled, and neither are RDMA
  * SEND Work Requests (with the exception of signaling occasionally to
  * prevent provider work queue overflows). This greatly reduces HCA
  * interrupt workload.
  *
- * As an optimization, frwr_op_unmap marks MRs INVALID before the
+ * As an optimization, frwr_unmap marks MRs INVALID before the
  * LOCAL_INV WR is posted. If posting succeeds, the MR is placed on
  * rb_mrs immediately so that no work (like managing a linked list
  * under a spinlock) is needed in the completion upcall.
  *
- * But this means that frwr_op_map() can occasionally encounter an MR
+ * But this means that frwr_map() can occasionally encounter an MR
  * that is INVALID but the LOCAL_INV WR has not completed. Work Queue
  * ordering prevents a subsequent FAST_REG WR from executing against
  * that MR while it is still being invalidated.
@@ -57,14 +57,14 @@
  * FLUSHED_LI:	The MR was being invalidated when the QP entered ERROR
  *		state, and the pending WR was flushed.
  *
- * When frwr_op_map encounters FLUSHED and VALID MRs, they are recovered
+ * When frwr_map encounters FLUSHED and VALID MRs, they are recovered
  * with ib_dereg_mr and then are re-initialized. Because MR recovery
  * allocates fresh resources, it is deferred to a workqueue, and the
  * recovered MRs are placed back on the rb_mrs list when recovery is
- * complete. frwr_op_map allocates another MR for the current RPC while
+ * complete. frwr_map allocates another MR for the current RPC while
  * the broken MR is reset.
  *
- * To ensure that frwr_op_map doesn't encounter an MR that is marked
+ * To ensure that frwr_map doesn't encounter an MR that is marked
  * INVALID but that is about to be flushed due to a previous transport
  * disconnect, the transport connect worker attempts to drain all
  * pending send queue WRs before the transport is reconnected.
@@ -80,8 +80,13 @@
 # define RPCDBG_FACILITY	RPCDBG_TRANS
 #endif
 
-bool
-frwr_is_supported(struct rpcrdma_ia *ia)
+/**
+ * frwr_is_supported - Check if device supports FRWR
+ * @ia: interface adapter to check
+ *
+ * Returns true if device supports FRWR, otherwise false
+ */
+bool frwr_is_supported(struct rpcrdma_ia *ia)
 {
 	struct ib_device_attr *attrs = &ia->ri_device->attrs;
 
@@ -97,8 +102,12 @@
 	return false;
 }
 
-static void
-frwr_op_release_mr(struct rpcrdma_mr *mr)
+/**
+ * frwr_release_mr - Destroy one MR
+ * @mr: MR allocated by frwr_init_mr
+ *
+ */
+void frwr_release_mr(struct rpcrdma_mr *mr)
 {
 	int rc;
 
@@ -132,11 +141,19 @@
 	list_del(&mr->mr_all);
 	r_xprt->rx_stats.mrs_recycled++;
 	spin_unlock(&r_xprt->rx_buf.rb_mrlock);
-	frwr_op_release_mr(mr);
+
+	frwr_release_mr(mr);
 }
 
-static int
-frwr_op_init_mr(struct rpcrdma_ia *ia, struct rpcrdma_mr *mr)
+/**
+ * frwr_init_mr - Initialize one MR
+ * @ia: interface adapter
+ * @mr: generic MR to prepare for FRWR
+ *
+ * Returns zero if successful. Otherwise a negative errno
+ * is returned.
+ */
+int frwr_init_mr(struct rpcrdma_ia *ia, struct rpcrdma_mr *mr)
 {
 	unsigned int depth = ia->ri_max_frwr_depth;
 	struct rpcrdma_frwr *frwr = &mr->frwr;
@@ -172,7 +189,13 @@
 	return rc;
 }
 
-/* On success, sets:
+/**
+ * frwr_open - Prepare an endpoint for use with FRWR
+ * @ia: interface adapter this endpoint will use
+ * @ep: endpoint to prepare
+ * @cdata: transport parameters
+ *
+ * On success, sets:
  *	ep->rep_attr.cap.max_send_wr
  *	ep->rep_attr.cap.max_recv_wr
  *	cdata->max_requests
@@ -181,10 +204,11 @@
  * And these FRWR-related fields:
  *	ia->ri_max_frwr_depth
  *	ia->ri_mrtype
+ *
+ * On failure, a negative errno is returned.
  */
-static int
-frwr_op_open(struct rpcrdma_ia *ia, struct rpcrdma_ep *ep,
-	     struct rpcrdma_create_data_internal *cdata)
+int frwr_open(struct rpcrdma_ia *ia, struct rpcrdma_ep *ep,
+	      struct rpcrdma_create_data_internal *cdata)
 {
 	struct ib_device_attr *attrs = &ia->ri_device->attrs;
 	int max_qp_wr, depth, delta;
@@ -258,11 +282,16 @@
 	return 0;
 }
 
-/* FRWR mode conveys a list of pages per chunk segment. The
+/**
+ * frwr_maxpages - Compute size of largest payload
+ * @r_xprt: transport
+ *
+ * Returns maximum size of an RPC message, in pages.
+ *
+ * FRWR mode conveys a list of pages per chunk segment. The
  * maximum length of that list is the FRWR page list depth.
  */
-static size_t
-frwr_op_maxpages(struct rpcrdma_xprt *r_xprt)
+size_t frwr_maxpages(struct rpcrdma_xprt *r_xprt)
 {
 	struct rpcrdma_ia *ia = &r_xprt->rx_ia;
 
@@ -344,12 +373,24 @@
 	trace_xprtrdma_wc_li_wake(wc, frwr);
 }
 
-/* Post a REG_MR Work Request to register a memory region
+/**
+ * frwr_map - Register a memory region
+ * @r_xprt: controlling transport
+ * @seg: memory region co-ordinates
+ * @nsegs: number of segments remaining
+ * @writing: true when RDMA Write will be used
+ * @out: initialized MR
+ *
+ * Prepare a REG_MR Work Request to register a memory region
  * for remote access via RDMA READ or RDMA WRITE.
+ *
+ * Returns the next segment or a negative errno pointer.
+ * On success, the prepared MR is planted in @out.
  */
-static struct rpcrdma_mr_seg *
-frwr_op_map(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg,
-	    int nsegs, bool writing, struct rpcrdma_mr **out)
+struct rpcrdma_mr_seg *frwr_map(struct rpcrdma_xprt *r_xprt,
+				struct rpcrdma_mr_seg *seg,
+				int nsegs, bool writing,
+				struct rpcrdma_mr **out)
 {
 	struct rpcrdma_ia *ia = &r_xprt->rx_ia;
 	bool holes_ok = ia->ri_mrtype == IB_MR_TYPE_SG_GAPS;
@@ -434,14 +475,18 @@
 	return ERR_PTR(-EIO);
 }
 
-/* Post Send WR containing the RPC Call message.
+/**
+ * frwr_send - post Send WR containing the RPC Call message
+ * @ia: interface adapter
+ * @req: Prepared RPC Call
  *
  * For FRMR, chain any FastReg WRs to the Send WR. Only a
  * single ib_post_send call is needed to register memory
  * and then post the Send WR.
+ *
+ * Returns the result of ib_post_send.
  */
-static int
-frwr_op_send(struct rpcrdma_ia *ia, struct rpcrdma_req *req)
+int frwr_send(struct rpcrdma_ia *ia, struct rpcrdma_req *req)
 {
 	struct ib_send_wr *post_wr;
 	struct rpcrdma_mr *mr;
@@ -468,10 +513,13 @@
 	return ib_post_send(ia->ri_id->qp, post_wr, NULL);
 }
 
-/* Handle a remotely invalidated mr on the @mrs list
+/**
+ * frwr_reminv - handle a remotely invalidated mr on the @mrs list
+ * @rep: Received reply
+ * @mrs: list of MRs to check
+ *
  */
-static void
-frwr_op_reminv(struct rpcrdma_rep *rep, struct list_head *mrs)
+void frwr_reminv(struct rpcrdma_rep *rep, struct list_head *mrs)
 {
 	struct rpcrdma_mr *mr;
 
@@ -485,7 +533,10 @@
 		}
 }
 
-/* Invalidate all memory regions that were registered for "req".
+/**
+ * frwr_unmap_sync - invalidate memory regions that were registered for @req
+ * @r_xprt: controlling transport
+ * @mrs: list of MRs to process
  *
  * Sleeps until it is safe for the host CPU to access the
  * previously mapped memory regions.
@@ -493,8 +544,7 @@
  * Caller ensures that @mrs is not empty before the call. This
  * function empties the list.
  */
-static void
-frwr_op_unmap_sync(struct rpcrdma_xprt *r_xprt, struct list_head *mrs)
+void frwr_unmap_sync(struct rpcrdma_xprt *r_xprt, struct list_head *mrs)
 {
 	struct ib_send_wr *first, **prev, *last;
 	const struct ib_send_wr *bad_wr;
@@ -577,16 +627,3 @@
 		rpcrdma_mr_recycle(mr);
 	}
 }
-
-const struct rpcrdma_memreg_ops rpcrdma_frwr_memreg_ops = {
-	.ro_map				= frwr_op_map,
-	.ro_send			= frwr_op_send,
-	.ro_reminv			= frwr_op_reminv,
-	.ro_unmap_sync			= frwr_op_unmap_sync,
-	.ro_open			= frwr_op_open,
-	.ro_maxpages			= frwr_op_maxpages,
-	.ro_init_mr			= frwr_op_init_mr,
-	.ro_release_mr			= frwr_op_release_mr,
-	.ro_displayname			= "frwr",
-	.ro_send_w_inv_ok		= RPCRDMA_CMP_F_SND_W_INV_OK,
-};
diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
index 5738c9f..2a2023d 100644
--- a/net/sunrpc/xprtrdma/rpc_rdma.c
+++ b/net/sunrpc/xprtrdma/rpc_rdma.c
@@ -356,8 +356,7 @@ static bool rpcrdma_results_inline(struct rpcrdma_xprt *r_xprt,
 		return nsegs;
 
 	do {
-		seg = r_xprt->rx_ia.ri_ops->ro_map(r_xprt, seg, nsegs,
-						   false, &mr);
+		seg = frwr_map(r_xprt, seg, nsegs, false, &mr);
 		if (IS_ERR(seg))
 			return PTR_ERR(seg);
 		rpcrdma_mr_push(mr, &req->rl_registered);
@@ -414,8 +413,7 @@ static bool rpcrdma_results_inline(struct rpcrdma_xprt *r_xprt,
 
 	nchunks = 0;
 	do {
-		seg = r_xprt->rx_ia.ri_ops->ro_map(r_xprt, seg, nsegs,
-						   true, &mr);
+		seg = frwr_map(r_xprt, seg, nsegs, true, &mr);
 		if (IS_ERR(seg))
 			return PTR_ERR(seg);
 		rpcrdma_mr_push(mr, &req->rl_registered);
@@ -472,8 +470,7 @@ static bool rpcrdma_results_inline(struct rpcrdma_xprt *r_xprt,
 
 	nchunks = 0;
 	do {
-		seg = r_xprt->rx_ia.ri_ops->ro_map(r_xprt, seg, nsegs,
-						   true, &mr);
+		seg = frwr_map(r_xprt, seg, nsegs, true, &mr);
 		if (IS_ERR(seg))
 			return PTR_ERR(seg);
 		rpcrdma_mr_push(mr, &req->rl_registered);
@@ -1262,8 +1259,7 @@ void rpcrdma_release_rqst(struct rpcrdma_xprt *r_xprt, struct rpcrdma_req *req)
 	 * RPC has relinquished all its Send Queue entries.
 	 */
 	if (!list_empty(&req->rl_registered))
-		r_xprt->rx_ia.ri_ops->ro_unmap_sync(r_xprt,
-						    &req->rl_registered);
+		frwr_unmap_sync(r_xprt, &req->rl_registered);
 
 	/* Ensure that any DMA mapped pages associated with
 	 * the Send of the RPC Call have been unmapped before
@@ -1292,7 +1288,7 @@ void rpcrdma_deferred_completion(struct work_struct *work)
 
 	trace_xprtrdma_defer_cmp(rep);
 	if (rep->rr_wc_flags & IB_WC_WITH_INVALIDATE)
-		r_xprt->rx_ia.ri_ops->ro_reminv(rep, &req->rl_registered);
+		frwr_reminv(rep, &req->rl_registered);
 	rpcrdma_release_rqst(r_xprt, req);
 	rpcrdma_complete_rqst(rep);
 }
diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
index fbb14bf..f824892 100644
--- a/net/sunrpc/xprtrdma/transport.c
+++ b/net/sunrpc/xprtrdma/transport.c
@@ -399,7 +399,7 @@
 	INIT_DELAYED_WORK(&new_xprt->rx_connect_worker,
 			  xprt_rdma_connect_worker);
 
-	xprt->max_payload = new_xprt->rx_ia.ri_ops->ro_maxpages(new_xprt);
+	xprt->max_payload = frwr_maxpages(new_xprt);
 	if (xprt->max_payload == 0)
 		goto out4;
 	xprt->max_payload <<= PAGE_SHIFT;
diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 389b617..d68efaf 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -289,10 +289,9 @@ static void rpcrdma_xprt_drain(struct rpcrdma_xprt *r_xprt)
 		break;
 	}
 
-	dprintk("RPC:       %s: %s:%s on %s/%s: %s\n", __func__,
+	dprintk("RPC:       %s: %s:%s on %s/frwr: %s\n", __func__,
 		rpcrdma_addrstr(r_xprt), rpcrdma_portstr(r_xprt),
-		ia->ri_device->name, ia->ri_ops->ro_displayname,
-		rdma_event_msg(event->event));
+		ia->ri_device->name, rdma_event_msg(event->event));
 	return 0;
 }
 
@@ -392,10 +391,8 @@ static void rpcrdma_xprt_drain(struct rpcrdma_xprt *r_xprt)
 
 	switch (xprt_rdma_memreg_strategy) {
 	case RPCRDMA_FRWR:
-		if (frwr_is_supported(ia)) {
-			ia->ri_ops = &rpcrdma_frwr_memreg_ops;
+		if (frwr_is_supported(ia))
 			break;
-		}
 		/*FALLTHROUGH*/
 	default:
 		pr_err("rpcrdma: Device %s does not support memreg mode %d\n",
@@ -509,7 +506,7 @@ static void rpcrdma_xprt_drain(struct rpcrdma_xprt *r_xprt)
 	}
 	ia->ri_max_send_sges = max_sge;
 
-	rc = ia->ri_ops->ro_open(ia, ep, cdata);
+	rc = frwr_open(ia, ep, cdata);
 	if (rc)
 		return rc;
 
@@ -567,7 +564,7 @@ static void rpcrdma_xprt_drain(struct rpcrdma_xprt *r_xprt)
 	/* Prepare RDMA-CM private message */
 	pmsg->cp_magic = rpcrdma_cmp_magic;
 	pmsg->cp_version = RPCRDMA_CMP_VERSION;
-	pmsg->cp_flags |= ia->ri_ops->ro_send_w_inv_ok;
+	pmsg->cp_flags |= RPCRDMA_CMP_F_SND_W_INV_OK;
 	pmsg->cp_send_size = rpcrdma_encode_buffer_size(cdata->inline_wsize);
 	pmsg->cp_recv_size = rpcrdma_encode_buffer_size(cdata->inline_rsize);
 	ep->rep_remote_cma.private_data = pmsg;
@@ -991,7 +988,7 @@ struct rpcrdma_sendctx *rpcrdma_sendctx_get_locked(struct rpcrdma_buffer *buf)
 		if (!mr)
 			break;
 
-		rc = ia->ri_ops->ro_init_mr(ia, mr);
+		rc = frwr_init_mr(ia, mr);
 		if (rc) {
 			kfree(mr);
 			break;
@@ -1171,7 +1168,6 @@ struct rpcrdma_req *
 {
 	struct rpcrdma_xprt *r_xprt = container_of(buf, struct rpcrdma_xprt,
 						   rx_buf);
-	struct rpcrdma_ia *ia = rdmab_to_ia(buf);
 	struct rpcrdma_mr *mr;
 	unsigned int count;
 
@@ -1187,7 +1183,7 @@ struct rpcrdma_req *
 		if (!list_empty(&mr->mr_list))
 			list_del(&mr->mr_list);
 
-		ia->ri_ops->ro_release_mr(mr);
+		frwr_release_mr(mr);
 		count++;
 		spin_lock(&buf->rb_mrlock);
 	}
@@ -1381,7 +1377,7 @@ struct rpcrdma_req *
  *
  * xprtrdma uses a regbuf for posting an outgoing RDMA SEND, or for
  * receiving the payload of RDMA RECV operations. During Long Calls
- * or Replies they may be registered externally via ro_map.
+ * or Replies they may be registered externally via frwr_map.
  */
 struct rpcrdma_regbuf *
 rpcrdma_alloc_regbuf(size_t size, enum dma_data_direction direction,
@@ -1472,7 +1468,7 @@ struct rpcrdma_regbuf *
 		--ep->rep_send_count;
 	}
 
-	rc = ia->ri_ops->ro_send(ia, req);
+	rc = frwr_send(ia, req);
 	trace_xprtrdma_post_send(req, rc);
 	if (rc)
 		return -ENOTCONN;
diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index dc8e178..eccb930 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -66,7 +66,6 @@
  * Interface Adapter -- one per transport instance
  */
 struct rpcrdma_ia {
-	const struct rpcrdma_memreg_ops	*ri_ops;
 	struct ib_device	*ri_device;
 	struct rdma_cm_id 	*ri_id;
 	struct ib_pd		*ri_pd;
@@ -406,7 +405,6 @@ struct rpcrdma_buffer {
 	struct workqueue_struct *rb_completion_wq;
 	struct delayed_work	rb_refresh_worker;
 };
-#define rdmab_to_ia(b) (&container_of((b), struct rpcrdma_xprt, rx_buf)->rx_ia)
 
 /* rb_flags */
 enum {
@@ -457,34 +455,6 @@ struct rpcrdma_stats {
 };
 
 /*
- * Per-registration mode operations
- */
-struct rpcrdma_xprt;
-struct rpcrdma_memreg_ops {
-	struct rpcrdma_mr_seg *
-			(*ro_map)(struct rpcrdma_xprt *,
-				  struct rpcrdma_mr_seg *, int, bool,
-				  struct rpcrdma_mr **);
-	int		(*ro_send)(struct rpcrdma_ia *ia,
-				   struct rpcrdma_req *req);
-	void		(*ro_reminv)(struct rpcrdma_rep *rep,
-				     struct list_head *mrs);
-	void		(*ro_unmap_sync)(struct rpcrdma_xprt *,
-					 struct list_head *);
-	int		(*ro_open)(struct rpcrdma_ia *,
-				   struct rpcrdma_ep *,
-				   struct rpcrdma_create_data_internal *);
-	size_t		(*ro_maxpages)(struct rpcrdma_xprt *);
-	int		(*ro_init_mr)(struct rpcrdma_ia *,
-				      struct rpcrdma_mr *);
-	void		(*ro_release_mr)(struct rpcrdma_mr *mr);
-	const char	*ro_displayname;
-	const int	ro_send_w_inv_ok;
-};
-
-extern const struct rpcrdma_memreg_ops rpcrdma_frwr_memreg_ops;
-
-/*
  * RPCRDMA transport -- encapsulates the structures above for
  * integration with RPC.
  *
@@ -535,7 +505,6 @@ struct rpcrdma_xprt {
 int rpcrdma_ia_open(struct rpcrdma_xprt *xprt);
 void rpcrdma_ia_remove(struct rpcrdma_ia *ia);
 void rpcrdma_ia_close(struct rpcrdma_ia *);
-bool frwr_is_supported(struct rpcrdma_ia *);
 
 /*
  * Endpoint calls - xprtrdma/verbs.c
@@ -601,6 +570,23 @@ struct rpcrdma_regbuf *rpcrdma_alloc_regbuf(size_t, enum dma_data_direction,
 	return writing ? DMA_FROM_DEVICE : DMA_TO_DEVICE;
 }
 
+/* Memory registration calls xprtrdma/frwr_ops.c
+ */
+bool frwr_is_supported(struct rpcrdma_ia *);
+int frwr_open(struct rpcrdma_ia *ia, struct rpcrdma_ep *ep,
+	      struct rpcrdma_create_data_internal *cdata);
+int frwr_init_mr(struct rpcrdma_ia *ia, struct rpcrdma_mr *mr);
+void frwr_release_mr(struct rpcrdma_mr *mr);
+size_t frwr_maxpages(struct rpcrdma_xprt *r_xprt);
+struct rpcrdma_mr_seg *frwr_map(struct rpcrdma_xprt *r_xprt,
+				struct rpcrdma_mr_seg *seg,
+				int nsegs, bool writing,
+				struct rpcrdma_mr **mr);
+int frwr_send(struct rpcrdma_ia *ia, struct rpcrdma_req *req);
+void frwr_reminv(struct rpcrdma_rep *rep, struct list_head *mrs);
+void frwr_unmap_sync(struct rpcrdma_xprt *r_xprt,
+		     struct list_head *mrs);
+
 /*
  * RPC/RDMA protocol calls - xprtrdma/rpc_rdma.c
  */


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v4 11/30] xprtrdma: Plant XID in on-the-wire RDMA offset (FRWR)
  2018-12-17 16:39 [PATCH v4 00/30] NFS/RDMA client for next Chuck Lever
                   ` (9 preceding siblings ...)
  2018-12-17 16:40 ` [PATCH v4 10/30] xprtrdma: Remove rpcrdma_memreg_ops Chuck Lever
@ 2018-12-17 16:40 ` Chuck Lever
  2018-12-17 16:40 ` [PATCH v4 12/30] NFS: Make "port=" mount option optional for RDMA mounts Chuck Lever
                   ` (18 subsequent siblings)
  29 siblings, 0 replies; 40+ messages in thread
From: Chuck Lever @ 2018-12-17 16:40 UTC (permalink / raw)
  To: linux-rdma, linux-nfs

Place the associated RPC transaction's XID in the upper 32 bits of
each RDMA segment's rdma_offset field. There are two reasons to do
this:

- The R_key only has 8 bits that are different from registration to
  registration. The XID adds more uniqueness to each RDMA segment to
  reduce the likelihood of a software bug on the server reading from
  or writing into memory it's not supposed to.

- On-the-wire RDMA Read and Write requests do not otherwise carry
  any identifier that matches them up to an RPC. The XID in the
  upper 32 bits will act as an eye-catcher in network captures.

Suggested-by: Tom Talpey <ttalpey@microsoft.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 net/sunrpc/xprtrdma/frwr_ops.c  |    5 ++++-
 net/sunrpc/xprtrdma/rpc_rdma.c  |    6 +++---
 net/sunrpc/xprtrdma/xprt_rdma.h |    2 +-
 3 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
index fb0944d..97f88bb 100644
--- a/net/sunrpc/xprtrdma/frwr_ops.c
+++ b/net/sunrpc/xprtrdma/frwr_ops.c
@@ -379,6 +379,7 @@ size_t frwr_maxpages(struct rpcrdma_xprt *r_xprt)
  * @seg: memory region co-ordinates
  * @nsegs: number of segments remaining
  * @writing: true when RDMA Write will be used
+ * @xid: XID of RPC using the registered memory
  * @out: initialized MR
  *
  * Prepare a REG_MR Work Request to register a memory region
@@ -389,7 +390,7 @@ size_t frwr_maxpages(struct rpcrdma_xprt *r_xprt)
  */
 struct rpcrdma_mr_seg *frwr_map(struct rpcrdma_xprt *r_xprt,
 				struct rpcrdma_mr_seg *seg,
-				int nsegs, bool writing,
+				int nsegs, bool writing, u32 xid,
 				struct rpcrdma_mr **out)
 {
 	struct rpcrdma_ia *ia = &r_xprt->rx_ia;
@@ -444,6 +445,8 @@ struct rpcrdma_mr_seg *frwr_map(struct rpcrdma_xprt *r_xprt,
 	if (unlikely(n != mr->mr_nents))
 		goto out_mapmr_err;
 
+	ibmr->iova &= 0x00000000ffffffff;
+	ibmr->iova |= ((u64)cpu_to_be32(xid)) << 32;
 	key = (u8)(ibmr->rkey & 0x000000FF);
 	ib_update_fast_reg_key(ibmr, ++key);
 
diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
index 2a2023d..3804fb3 100644
--- a/net/sunrpc/xprtrdma/rpc_rdma.c
+++ b/net/sunrpc/xprtrdma/rpc_rdma.c
@@ -356,7 +356,7 @@ static bool rpcrdma_results_inline(struct rpcrdma_xprt *r_xprt,
 		return nsegs;
 
 	do {
-		seg = frwr_map(r_xprt, seg, nsegs, false, &mr);
+		seg = frwr_map(r_xprt, seg, nsegs, false, rqst->rq_xid, &mr);
 		if (IS_ERR(seg))
 			return PTR_ERR(seg);
 		rpcrdma_mr_push(mr, &req->rl_registered);
@@ -413,7 +413,7 @@ static bool rpcrdma_results_inline(struct rpcrdma_xprt *r_xprt,
 
 	nchunks = 0;
 	do {
-		seg = frwr_map(r_xprt, seg, nsegs, true, &mr);
+		seg = frwr_map(r_xprt, seg, nsegs, true, rqst->rq_xid, &mr);
 		if (IS_ERR(seg))
 			return PTR_ERR(seg);
 		rpcrdma_mr_push(mr, &req->rl_registered);
@@ -470,7 +470,7 @@ static bool rpcrdma_results_inline(struct rpcrdma_xprt *r_xprt,
 
 	nchunks = 0;
 	do {
-		seg = frwr_map(r_xprt, seg, nsegs, true, &mr);
+		seg = frwr_map(r_xprt, seg, nsegs, true, rqst->rq_xid, &mr);
 		if (IS_ERR(seg))
 			return PTR_ERR(seg);
 		rpcrdma_mr_push(mr, &req->rl_registered);
diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index eccb930..56b299f 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -580,7 +580,7 @@ int frwr_open(struct rpcrdma_ia *ia, struct rpcrdma_ep *ep,
 size_t frwr_maxpages(struct rpcrdma_xprt *r_xprt);
 struct rpcrdma_mr_seg *frwr_map(struct rpcrdma_xprt *r_xprt,
 				struct rpcrdma_mr_seg *seg,
-				int nsegs, bool writing,
+				int nsegs, bool writing, u32 xid,
 				struct rpcrdma_mr **mr);
 int frwr_send(struct rpcrdma_ia *ia, struct rpcrdma_req *req);
 void frwr_reminv(struct rpcrdma_rep *rep, struct list_head *mrs);


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v4 12/30] NFS: Make "port=" mount option optional for RDMA mounts
  2018-12-17 16:39 [PATCH v4 00/30] NFS/RDMA client for next Chuck Lever
                   ` (10 preceding siblings ...)
  2018-12-17 16:40 ` [PATCH v4 11/30] xprtrdma: Plant XID in on-the-wire RDMA offset (FRWR) Chuck Lever
@ 2018-12-17 16:40 ` Chuck Lever
  2018-12-17 16:40 ` [PATCH v4 13/30] xprtrdma: Recognize XDRBUF_SPARSE_PAGES Chuck Lever
                   ` (17 subsequent siblings)
  29 siblings, 0 replies; 40+ messages in thread
From: Chuck Lever @ 2018-12-17 16:40 UTC (permalink / raw)
  To: linux-rdma, linux-nfs

Having to specify "proto=rdma,port=20049" is cumbersome.

RFC 8267 Section 6.3 requires NFSv4 clients to use "the alternative
well-known port number", which is 20049. Make the use of the well-
known port number automatic, just as it is for NFS/TCP and port
2049.

For NFSv2/3, Section 4.2 allows clients to simply choose 20049 as
the default or use rpcbind. I don't know of an NFS/RDMA server
implementation that registers it's NFS/RDMA service with rpcbind,
so automatically choosing 20049 seems like the better choice. The
other widely-deployed NFS/RDMA client, Solaris, also uses 20049
as the default port.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 fs/nfs/super.c |   10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/fs/nfs/super.c b/fs/nfs/super.c
index ac4b2f0..22247c2 100644
--- a/fs/nfs/super.c
+++ b/fs/nfs/super.c
@@ -2168,7 +2168,10 @@ static int nfs_validate_text_mount_data(void *options,
 
 	if (args->version == 4) {
 #if IS_ENABLED(CONFIG_NFS_V4)
-		port = NFS_PORT;
+		if (args->nfs_server.protocol == XPRT_TRANSPORT_RDMA)
+			port = NFS_RDMA_PORT;
+		else
+			port = NFS_PORT;
 		max_namelen = NFS4_MAXNAMLEN;
 		max_pathlen = NFS4_MAXPATHLEN;
 		nfs_validate_transport_protocol(args);
@@ -2178,8 +2181,11 @@ static int nfs_validate_text_mount_data(void *options,
 #else
 		goto out_v4_not_compiled;
 #endif /* CONFIG_NFS_V4 */
-	} else
+	} else {
 		nfs_set_mount_transport_protocol(args);
+		if (args->nfs_server.protocol == XPRT_TRANSPORT_RDMA)
+			port = NFS_RDMA_PORT;
+	}
 
 	nfs_set_port(sap, &args->nfs_server.port, port);
 


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v4 13/30] xprtrdma: Recognize XDRBUF_SPARSE_PAGES
  2018-12-17 16:39 [PATCH v4 00/30] NFS/RDMA client for next Chuck Lever
                   ` (11 preceding siblings ...)
  2018-12-17 16:40 ` [PATCH v4 12/30] NFS: Make "port=" mount option optional for RDMA mounts Chuck Lever
@ 2018-12-17 16:40 ` Chuck Lever
  2018-12-17 16:40 ` [PATCH v4 14/30] xprtrdma: Remove request_module from backchannel Chuck Lever
                   ` (16 subsequent siblings)
  29 siblings, 0 replies; 40+ messages in thread
From: Chuck Lever @ 2018-12-17 16:40 UTC (permalink / raw)
  To: linux-rdma, linux-nfs

Commit 431f6eb3570f ("SUNRPC: Add a label for RPC calls that require
allocation on receive") didn't update similar logic in rpc_rdma.c.
I don't think this is a bug, per-se; the commit just adds more
careful checking for broken upper layer behavior.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 net/sunrpc/xprtrdma/rpc_rdma.c |   11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
index 3804fb3..939f84a 100644
--- a/net/sunrpc/xprtrdma/rpc_rdma.c
+++ b/net/sunrpc/xprtrdma/rpc_rdma.c
@@ -218,11 +218,12 @@ static bool rpcrdma_results_inline(struct rpcrdma_xprt *r_xprt,
 	ppages = xdrbuf->pages + (xdrbuf->page_base >> PAGE_SHIFT);
 	page_base = offset_in_page(xdrbuf->page_base);
 	while (len) {
-		if (unlikely(!*ppages)) {
-			/* XXX: Certain upper layer operations do
-			 *	not provide receive buffer pages.
-			 */
-			*ppages = alloc_page(GFP_ATOMIC);
+		/* ACL likes to be lazy in allocating pages - ACLs
+		 * are small by default but can get huge.
+		 */
+		if (unlikely(xdrbuf->flags & XDRBUF_SPARSE_PAGES)) {
+			if (!*ppages)
+				*ppages = alloc_page(GFP_ATOMIC);
 			if (!*ppages)
 				return -ENOBUFS;
 		}


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v4 14/30] xprtrdma: Remove request_module from backchannel
  2018-12-17 16:39 [PATCH v4 00/30] NFS/RDMA client for next Chuck Lever
                   ` (12 preceding siblings ...)
  2018-12-17 16:40 ` [PATCH v4 13/30] xprtrdma: Recognize XDRBUF_SPARSE_PAGES Chuck Lever
@ 2018-12-17 16:40 ` Chuck Lever
  2018-12-17 16:40 ` [PATCH v4 15/30] xprtrdma: Expose transport header errors Chuck Lever
                   ` (15 subsequent siblings)
  29 siblings, 0 replies; 40+ messages in thread
From: Chuck Lever @ 2018-12-17 16:40 UTC (permalink / raw)
  To: linux-rdma, linux-nfs

Since commit ffe1f0df5862 ("rpcrdma: Merge svcrdma and xprtrdma
modules into one"), the forward and backchannel components are part
of the same kernel module. A separate request_module() call in the
backchannel code is no longer necessary.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 net/sunrpc/xprtrdma/backchannel.c |    2 --
 1 file changed, 2 deletions(-)

diff --git a/net/sunrpc/xprtrdma/backchannel.c b/net/sunrpc/xprtrdma/backchannel.c
index 5d462e8..9cb96a5 100644
--- a/net/sunrpc/xprtrdma/backchannel.c
+++ b/net/sunrpc/xprtrdma/backchannel.c
@@ -5,7 +5,6 @@
  * Support for backward direction RPCs on RPC/RDMA.
  */
 
-#include <linux/module.h>
 #include <linux/sunrpc/xprt.h>
 #include <linux/sunrpc/svc.h>
 #include <linux/sunrpc/svc_xprt.h>
@@ -101,7 +100,6 @@ int xprt_rdma_bc_setup(struct rpc_xprt *xprt, unsigned int reqs)
 		goto out_free;
 
 	r_xprt->rx_buf.rb_bc_srv_max_requests = reqs;
-	request_module("svcrdma");
 	trace_xprtrdma_cb_setup(r_xprt, reqs);
 	return 0;
 


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v4 15/30] xprtrdma: Expose transport header errors
  2018-12-17 16:39 [PATCH v4 00/30] NFS/RDMA client for next Chuck Lever
                   ` (13 preceding siblings ...)
  2018-12-17 16:40 ` [PATCH v4 14/30] xprtrdma: Remove request_module from backchannel Chuck Lever
@ 2018-12-17 16:40 ` Chuck Lever
  2018-12-17 16:40 ` [PATCH v4 16/30] xprtrdma: Simplify locking that protects the rl_allreqs list Chuck Lever
                   ` (14 subsequent siblings)
  29 siblings, 0 replies; 40+ messages in thread
From: Chuck Lever @ 2018-12-17 16:40 UTC (permalink / raw)
  To: linux-rdma, linux-nfs

For better observability of parsing errors, return the error code
generated in the decoders to the upper layer consumer.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 net/sunrpc/xprtrdma/rpc_rdma.c |    1 -
 1 file changed, 1 deletion(-)

diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
index 939f84a..8de0b9f 100644
--- a/net/sunrpc/xprtrdma/rpc_rdma.c
+++ b/net/sunrpc/xprtrdma/rpc_rdma.c
@@ -1246,7 +1246,6 @@ void rpcrdma_complete_rqst(struct rpcrdma_rep *rep)
 out_badheader:
 	trace_xprtrdma_reply_hdr(rep);
 	r_xprt->rx_stats.bad_reply_count++;
-	status = -EIO;
 	goto out;
 }
 


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v4 16/30] xprtrdma: Simplify locking that protects the rl_allreqs list
  2018-12-17 16:39 [PATCH v4 00/30] NFS/RDMA client for next Chuck Lever
                   ` (14 preceding siblings ...)
  2018-12-17 16:40 ` [PATCH v4 15/30] xprtrdma: Expose transport header errors Chuck Lever
@ 2018-12-17 16:40 ` Chuck Lever
  2018-12-17 16:40 ` [PATCH v4 17/30] xprtrdma: Cull dprintk() call sites Chuck Lever
                   ` (13 subsequent siblings)
  29 siblings, 0 replies; 40+ messages in thread
From: Chuck Lever @ 2018-12-17 16:40 UTC (permalink / raw)
  To: linux-rdma, linux-nfs

Clean up: There's little chance of contention between the use of
rb_lock and rb_reqslock, so merge the two. This avoids having to
take both in some (possibly future) cases.

Transport tear-down is already serialized, thus there is no need for
locking at all when destroying rpcrdma_reqs.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 net/sunrpc/xprtrdma/backchannel.c |   20 +++-----------------
 net/sunrpc/xprtrdma/verbs.c       |   31 +++++++++++++++++--------------
 net/sunrpc/xprtrdma/xprt_rdma.h   |    7 +++----
 3 files changed, 23 insertions(+), 35 deletions(-)

diff --git a/net/sunrpc/xprtrdma/backchannel.c b/net/sunrpc/xprtrdma/backchannel.c
index 9cb96a5..af8249b 100644
--- a/net/sunrpc/xprtrdma/backchannel.c
+++ b/net/sunrpc/xprtrdma/backchannel.c
@@ -19,29 +19,16 @@
 
 #undef RPCRDMA_BACKCHANNEL_DEBUG
 
-static void rpcrdma_bc_free_rqst(struct rpcrdma_xprt *r_xprt,
-				 struct rpc_rqst *rqst)
-{
-	struct rpcrdma_buffer *buf = &r_xprt->rx_buf;
-	struct rpcrdma_req *req = rpcr_to_rdmar(rqst);
-
-	spin_lock(&buf->rb_reqslock);
-	list_del(&req->rl_all);
-	spin_unlock(&buf->rb_reqslock);
-
-	rpcrdma_destroy_req(req);
-}
-
 static int rpcrdma_bc_setup_reqs(struct rpcrdma_xprt *r_xprt,
 				 unsigned int count)
 {
 	struct rpc_xprt *xprt = &r_xprt->rx_xprt;
+	struct rpcrdma_req *req;
 	struct rpc_rqst *rqst;
 	unsigned int i;
 
 	for (i = 0; i < (count << 1); i++) {
 		struct rpcrdma_regbuf *rb;
-		struct rpcrdma_req *req;
 		size_t size;
 
 		req = rpcrdma_create_req(r_xprt);
@@ -67,7 +54,7 @@ static int rpcrdma_bc_setup_reqs(struct rpcrdma_xprt *r_xprt,
 	return 0;
 
 out_fail:
-	rpcrdma_bc_free_rqst(r_xprt, rqst);
+	rpcrdma_req_destroy(req);
 	return -ENOMEM;
 }
 
@@ -225,7 +212,6 @@ int xprt_rdma_bc_send_reply(struct rpc_rqst *rqst)
  */
 void xprt_rdma_bc_destroy(struct rpc_xprt *xprt, unsigned int reqs)
 {
-	struct rpcrdma_xprt *r_xprt = rpcx_to_rdmax(xprt);
 	struct rpc_rqst *rqst, *tmp;
 
 	spin_lock(&xprt->bc_pa_lock);
@@ -233,7 +219,7 @@ void xprt_rdma_bc_destroy(struct rpc_xprt *xprt, unsigned int reqs)
 		list_del(&rqst->rq_bc_pa_list);
 		spin_unlock(&xprt->bc_pa_lock);
 
-		rpcrdma_bc_free_rqst(r_xprt, rqst);
+		rpcrdma_req_destroy(rpcr_to_rdmar(rqst));
 
 		spin_lock(&xprt->bc_pa_lock);
 	}
diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index d68efaf..a6ab216 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -1043,9 +1043,9 @@ struct rpcrdma_req *
 	req->rl_buffer = buffer;
 	INIT_LIST_HEAD(&req->rl_registered);
 
-	spin_lock(&buffer->rb_reqslock);
+	spin_lock(&buffer->rb_lock);
 	list_add(&req->rl_all, &buffer->rb_allreqs);
-	spin_unlock(&buffer->rb_reqslock);
+	spin_unlock(&buffer->rb_lock);
 	return req;
 }
 
@@ -1113,7 +1113,6 @@ struct rpcrdma_req *
 
 	INIT_LIST_HEAD(&buf->rb_send_bufs);
 	INIT_LIST_HEAD(&buf->rb_allreqs);
-	spin_lock_init(&buf->rb_reqslock);
 	for (i = 0; i < buf->rb_max_requests; i++) {
 		struct rpcrdma_req *req;
 
@@ -1154,9 +1153,18 @@ struct rpcrdma_req *
 	kfree(rep);
 }
 
+/**
+ * rpcrdma_req_destroy - Destroy an rpcrdma_req object
+ * @req: unused object to be destroyed
+ *
+ * This function assumes that the caller prevents concurrent device
+ * unload and transport tear-down.
+ */
 void
-rpcrdma_destroy_req(struct rpcrdma_req *req)
+rpcrdma_req_destroy(struct rpcrdma_req *req)
 {
+	list_del(&req->rl_all);
+
 	rpcrdma_free_regbuf(req->rl_recvbuf);
 	rpcrdma_free_regbuf(req->rl_sendbuf);
 	rpcrdma_free_regbuf(req->rl_rdmabuf);
@@ -1214,19 +1222,14 @@ struct rpcrdma_req *
 		rpcrdma_destroy_rep(rep);
 	}
 
-	spin_lock(&buf->rb_reqslock);
-	while (!list_empty(&buf->rb_allreqs)) {
+	while (!list_empty(&buf->rb_send_bufs)) {
 		struct rpcrdma_req *req;
 
-		req = list_first_entry(&buf->rb_allreqs,
-				       struct rpcrdma_req, rl_all);
-		list_del(&req->rl_all);
-
-		spin_unlock(&buf->rb_reqslock);
-		rpcrdma_destroy_req(req);
-		spin_lock(&buf->rb_reqslock);
+		req = list_first_entry(&buf->rb_send_bufs,
+				       struct rpcrdma_req, rl_list);
+		list_del(&req->rl_list);
+		rpcrdma_req_destroy(req);
 	}
-	spin_unlock(&buf->rb_reqslock);
 
 	rpcrdma_mrs_destroy(buf);
 }
diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index 56b299f..6e104cd 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -392,14 +392,13 @@ struct rpcrdma_buffer {
 	spinlock_t		rb_lock;	/* protect buf lists */
 	struct list_head	rb_send_bufs;
 	struct list_head	rb_recv_bufs;
+	struct list_head	rb_allreqs;
+
 	unsigned long		rb_flags;
 	u32			rb_max_requests;
 	u32			rb_credits;	/* most recent credit grant */
 
 	u32			rb_bc_srv_max_requests;
-	spinlock_t		rb_reqslock;	/* protect rb_allreqs */
-	struct list_head	rb_allreqs;
-
 	u32			rb_bc_max_requests;
 
 	struct workqueue_struct *rb_completion_wq;
@@ -522,7 +521,7 @@ int rpcrdma_ep_post(struct rpcrdma_ia *, struct rpcrdma_ep *,
  * Buffer calls - xprtrdma/verbs.c
  */
 struct rpcrdma_req *rpcrdma_create_req(struct rpcrdma_xprt *);
-void rpcrdma_destroy_req(struct rpcrdma_req *);
+void rpcrdma_req_destroy(struct rpcrdma_req *req);
 int rpcrdma_buffer_create(struct rpcrdma_xprt *);
 void rpcrdma_buffer_destroy(struct rpcrdma_buffer *);
 struct rpcrdma_sendctx *rpcrdma_sendctx_get_locked(struct rpcrdma_buffer *buf);


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v4 17/30] xprtrdma: Cull dprintk() call sites
  2018-12-17 16:39 [PATCH v4 00/30] NFS/RDMA client for next Chuck Lever
                   ` (15 preceding siblings ...)
  2018-12-17 16:40 ` [PATCH v4 16/30] xprtrdma: Simplify locking that protects the rl_allreqs list Chuck Lever
@ 2018-12-17 16:40 ` Chuck Lever
  2018-12-17 16:40 ` [PATCH v4 18/30] xprtrdma: Remove unused fields from rpcrdma_ia Chuck Lever
                   ` (12 subsequent siblings)
  29 siblings, 0 replies; 40+ messages in thread
From: Chuck Lever @ 2018-12-17 16:40 UTC (permalink / raw)
  To: linux-rdma, linux-nfs

Clean up: Remove dprintk() call sites that report rare or impossible
errors. Leave a few that display high-value low noise status
information.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 net/sunrpc/xprtrdma/backchannel.c |    3 ---
 net/sunrpc/xprtrdma/rpc_rdma.c    |   17 ++++++++++-------
 net/sunrpc/xprtrdma/transport.c   |   33 ++++-----------------------------
 net/sunrpc/xprtrdma/verbs.c       |   34 +++++-----------------------------
 4 files changed, 19 insertions(+), 68 deletions(-)

diff --git a/net/sunrpc/xprtrdma/backchannel.c b/net/sunrpc/xprtrdma/backchannel.c
index af8249b..64e1708 100644
--- a/net/sunrpc/xprtrdma/backchannel.c
+++ b/net/sunrpc/xprtrdma/backchannel.c
@@ -235,9 +235,6 @@ void xprt_rdma_bc_free_rqst(struct rpc_rqst *rqst)
 	struct rpcrdma_req *req = rpcr_to_rdmar(rqst);
 	struct rpc_xprt *xprt = rqst->rq_xprt;
 
-	dprintk("RPC:       %s: freeing rqst %p (req %p)\n",
-		__func__, rqst, req);
-
 	rpcrdma_recv_buffer_put(req->rl_reply);
 	req->rl_reply = NULL;
 
diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
index 8de0b9f..5a58769 100644
--- a/net/sunrpc/xprtrdma/rpc_rdma.c
+++ b/net/sunrpc/xprtrdma/rpc_rdma.c
@@ -1186,17 +1186,20 @@ static int decode_reply_chunk(struct xdr_stream *xdr, u32 *length)
 		p = xdr_inline_decode(xdr, 2 * sizeof(*p));
 		if (!p)
 			break;
-		dprintk("RPC: %5u: %s: server reports version error (%u-%u)\n",
-			rqst->rq_task->tk_pid, __func__,
-			be32_to_cpup(p), be32_to_cpu(*(p + 1)));
+		dprintk("RPC:       %s: server reports "
+			"version error (%u-%u), xid %08x\n", __func__,
+			be32_to_cpup(p), be32_to_cpu(*(p + 1)),
+			be32_to_cpu(rep->rr_xid));
 		break;
 	case err_chunk:
-		dprintk("RPC: %5u: %s: server reports header decoding error\n",
-			rqst->rq_task->tk_pid, __func__);
+		dprintk("RPC:       %s: server reports "
+			"header decoding error, xid %08x\n", __func__,
+			be32_to_cpu(rep->rr_xid));
 		break;
 	default:
-		dprintk("RPC: %5u: %s: server reports unrecognized error %d\n",
-			rqst->rq_task->tk_pid, __func__, be32_to_cpup(p));
+		dprintk("RPC:       %s: server reports "
+			"unrecognized error %d, xid %08x\n", __func__,
+			be32_to_cpup(p), be32_to_cpu(rep->rr_xid));
 	}
 
 	r_xprt->rx_stats.bad_reply_count++;
diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
index f824892..41e4347 100644
--- a/net/sunrpc/xprtrdma/transport.c
+++ b/net/sunrpc/xprtrdma/transport.c
@@ -318,17 +318,12 @@
 	struct sockaddr *sap;
 	int rc;
 
-	if (args->addrlen > sizeof(xprt->addr)) {
-		dprintk("RPC:       %s: address too large\n", __func__);
+	if (args->addrlen > sizeof(xprt->addr))
 		return ERR_PTR(-EBADF);
-	}
 
 	xprt = xprt_alloc(args->net, sizeof(struct rpcrdma_xprt), 0, 0);
-	if (xprt == NULL) {
-		dprintk("RPC:       %s: couldn't allocate rpcrdma_xprt\n",
-			__func__);
+	if (!xprt)
 		return ERR_PTR(-ENOMEM);
-	}
 
 	/* 60 second timeout, no retries */
 	xprt->timeout = &xprt_rdma_default_timeout;
@@ -446,8 +441,6 @@
 
 	might_sleep();
 
-	dprintk("RPC:       %s: closing xprt %p\n", __func__, xprt);
-
 	/* Prevent marshaling and sending of new requests */
 	xprt_clear_connected(xprt);
 
@@ -854,24 +847,15 @@ void xprt_rdma_print_stats(struct rpc_xprt *xprt, struct seq_file *seq)
 
 void xprt_rdma_cleanup(void)
 {
-	int rc;
-
-	dprintk("RPCRDMA Module Removed, deregister RPC RDMA transport\n");
 #if IS_ENABLED(CONFIG_SUNRPC_DEBUG)
 	if (sunrpc_table_header) {
 		unregister_sysctl_table(sunrpc_table_header);
 		sunrpc_table_header = NULL;
 	}
 #endif
-	rc = xprt_unregister_transport(&xprt_rdma);
-	if (rc)
-		dprintk("RPC:       %s: xprt_unregister returned %i\n",
-			__func__, rc);
 
-	rc = xprt_unregister_transport(&xprt_rdma_bc);
-	if (rc)
-		dprintk("RPC:       %s: xprt_unregister(bc) returned %i\n",
-			__func__, rc);
+	xprt_unregister_transport(&xprt_rdma);
+	xprt_unregister_transport(&xprt_rdma_bc);
 }
 
 int xprt_rdma_init(void)
@@ -888,15 +872,6 @@ int xprt_rdma_init(void)
 		return rc;
 	}
 
-	dprintk("RPCRDMA Module Init, register RPC RDMA transport\n");
-
-	dprintk("Defaults:\n");
-	dprintk("\tSlots %d\n"
-		"\tMaxInlineRead %d\n\tMaxInlineWrite %d\n",
-		xprt_rdma_slot_table_entries,
-		xprt_rdma_max_inline_read, xprt_rdma_max_inline_write);
-	dprintk("\tPadding 0\n\tMemreg %d\n", xprt_rdma_memreg_strategy);
-
 #if IS_ENABLED(CONFIG_SUNRPC_DEBUG)
 	if (!sunrpc_table_header)
 		sunrpc_table_header = register_sysctl_table(sunrpc_table);
diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index a6ab216..64672d0 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -309,22 +309,15 @@ static void rpcrdma_xprt_drain(struct rpcrdma_xprt *r_xprt)
 
 	id = rdma_create_id(xprt->rx_xprt.xprt_net, rpcrdma_cm_event_handler,
 			    xprt, RDMA_PS_TCP, IB_QPT_RC);
-	if (IS_ERR(id)) {
-		rc = PTR_ERR(id);
-		dprintk("RPC:       %s: rdma_create_id() failed %i\n",
-			__func__, rc);
+	if (IS_ERR(id))
 		return id;
-	}
 
 	ia->ri_async_rc = -ETIMEDOUT;
 	rc = rdma_resolve_addr(id, NULL,
 			       (struct sockaddr *)&xprt->rx_xprt.addr,
 			       RDMA_RESOLVE_TIMEOUT);
-	if (rc) {
-		dprintk("RPC:       %s: rdma_resolve_addr() failed %i\n",
-			__func__, rc);
+	if (rc)
 		goto out;
-	}
 	rc = wait_for_completion_interruptible_timeout(&ia->ri_done, wtimeout);
 	if (rc < 0) {
 		trace_xprtrdma_conn_tout(xprt);
@@ -337,11 +330,8 @@ static void rpcrdma_xprt_drain(struct rpcrdma_xprt *r_xprt)
 
 	ia->ri_async_rc = -ETIMEDOUT;
 	rc = rdma_resolve_route(id, RDMA_RESOLVE_TIMEOUT);
-	if (rc) {
-		dprintk("RPC:       %s: rdma_resolve_route() failed %i\n",
-			__func__, rc);
+	if (rc)
 		goto out;
-	}
 	rc = wait_for_completion_interruptible_timeout(&ia->ri_done, wtimeout);
 	if (rc < 0) {
 		trace_xprtrdma_conn_tout(xprt);
@@ -540,8 +530,6 @@ static void rpcrdma_xprt_drain(struct rpcrdma_xprt *r_xprt)
 			     1, IB_POLL_WORKQUEUE);
 	if (IS_ERR(sendcq)) {
 		rc = PTR_ERR(sendcq);
-		dprintk("RPC:       %s: failed to create send CQ: %i\n",
-			__func__, rc);
 		goto out1;
 	}
 
@@ -550,8 +538,6 @@ static void rpcrdma_xprt_drain(struct rpcrdma_xprt *r_xprt)
 			     0, IB_POLL_WORKQUEUE);
 	if (IS_ERR(recvcq)) {
 		rc = PTR_ERR(recvcq);
-		dprintk("RPC:       %s: failed to create recv CQ: %i\n",
-			__func__, rc);
 		goto out2;
 	}
 
@@ -691,11 +677,8 @@ static void rpcrdma_xprt_drain(struct rpcrdma_xprt *r_xprt)
 	}
 
 	err = rdma_create_qp(id, ia->ri_pd, &ep->rep_attr);
-	if (err) {
-		dprintk("RPC:       %s: rdma_create_qp returned %d\n",
-			__func__, err);
+	if (err)
 		goto out_destroy;
-	}
 
 	/* Atomically replace the transport's ID and QP. */
 	rc = 0;
@@ -726,8 +709,6 @@ static void rpcrdma_xprt_drain(struct rpcrdma_xprt *r_xprt)
 		dprintk("RPC:       %s: connecting...\n", __func__);
 		rc = rdma_create_qp(ia->ri_id, ia->ri_pd, &ep->rep_attr);
 		if (rc) {
-			dprintk("RPC:       %s: rdma_create_qp failed %i\n",
-				__func__, rc);
 			rc = -ENETUNREACH;
 			goto out_noupdate;
 		}
@@ -749,11 +730,8 @@ static void rpcrdma_xprt_drain(struct rpcrdma_xprt *r_xprt)
 	rpcrdma_post_recvs(r_xprt, true);
 
 	rc = rdma_connect(ia->ri_id, &ep->rep_remote_cma);
-	if (rc) {
-		dprintk("RPC:       %s: rdma_connect() failed with %i\n",
-				__func__, rc);
+	if (rc)
 		goto out;
-	}
 
 	wait_event_interruptible(ep->rep_connect_wait, ep->rep_connected != 0);
 	if (ep->rep_connected <= 0) {
@@ -1088,8 +1066,6 @@ struct rpcrdma_req *
 out_free:
 	kfree(rep);
 out:
-	dprintk("RPC:       %s: reply buffer %d alloc failed\n",
-		__func__, rc);
 	return rc;
 }
 


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v4 18/30] xprtrdma: Remove unused fields from rpcrdma_ia
  2018-12-17 16:39 [PATCH v4 00/30] NFS/RDMA client for next Chuck Lever
                   ` (16 preceding siblings ...)
  2018-12-17 16:40 ` [PATCH v4 17/30] xprtrdma: Cull dprintk() call sites Chuck Lever
@ 2018-12-17 16:40 ` Chuck Lever
  2018-12-17 16:41 ` [PATCH v4 19/30] xprtrdma: Clean up of xprtrdma chunk trace points Chuck Lever
                   ` (11 subsequent siblings)
  29 siblings, 0 replies; 40+ messages in thread
From: Chuck Lever @ 2018-12-17 16:40 UTC (permalink / raw)
  To: linux-rdma, linux-nfs

Clean up. The last use of these fields was in commit 173b8f49b3af
("xprtrdma: Demote "connect" log messages") .

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 net/sunrpc/xprtrdma/xprt_rdma.h |    2 --
 1 file changed, 2 deletions(-)

diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index 6e104cd..375c7be 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -80,8 +80,6 @@ struct rpcrdma_ia {
 	bool			ri_implicit_roundup;
 	enum ib_mr_type		ri_mrtype;
 	unsigned long		ri_flags;
-	struct ib_qp_attr	ri_qp_attr;
-	struct ib_qp_init_attr	ri_qp_init_attr;
 };
 
 enum {


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v4 19/30] xprtrdma: Clean up of xprtrdma chunk trace points
  2018-12-17 16:39 [PATCH v4 00/30] NFS/RDMA client for next Chuck Lever
                   ` (17 preceding siblings ...)
  2018-12-17 16:40 ` [PATCH v4 18/30] xprtrdma: Remove unused fields from rpcrdma_ia Chuck Lever
@ 2018-12-17 16:41 ` Chuck Lever
  2018-12-17 16:41 ` [PATCH v4 20/30] xprtrdma: Relocate the xprtrdma_mr_map " Chuck Lever
                   ` (10 subsequent siblings)
  29 siblings, 0 replies; 40+ messages in thread
From: Chuck Lever @ 2018-12-17 16:41 UTC (permalink / raw)
  To: linux-rdma, linux-nfs

The chunk-related trace points capture nearly the same information
as the MR-related trace points.

Also, rename them so globbing can be used to enable or disable
these trace points more easily.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 include/trace/events/rpcrdma.h |   42 +++++++++++++++++++++++++---------------
 net/sunrpc/xprtrdma/rpc_rdma.c |    6 +++---
 2 files changed, 29 insertions(+), 19 deletions(-)

diff --git a/include/trace/events/rpcrdma.h b/include/trace/events/rpcrdma.h
index 2efe2d7..e9fbf7d 100644
--- a/include/trace/events/rpcrdma.h
+++ b/include/trace/events/rpcrdma.h
@@ -97,7 +97,6 @@
 	TP_STRUCT__entry(
 		__field(unsigned int, task_id)
 		__field(unsigned int, client_id)
-		__field(const void *, mr)
 		__field(unsigned int, pos)
 		__field(int, nents)
 		__field(u32, handle)
@@ -109,7 +108,6 @@
 	TP_fast_assign(
 		__entry->task_id = task->tk_pid;
 		__entry->client_id = task->tk_client->cl_clid;
-		__entry->mr = mr;
 		__entry->pos = pos;
 		__entry->nents = mr->mr_nents;
 		__entry->handle = mr->mr_handle;
@@ -118,8 +116,8 @@
 		__entry->nsegs = nsegs;
 	),
 
-	TP_printk("task:%u@%u mr=%p pos=%u %u@0x%016llx:0x%08x (%s)",
-		__entry->task_id, __entry->client_id, __entry->mr,
+	TP_printk("task:%u@%u pos=%u %u@0x%016llx:0x%08x (%s)",
+		__entry->task_id, __entry->client_id,
 		__entry->pos, __entry->length,
 		(unsigned long long)__entry->offset, __entry->handle,
 		__entry->nents < __entry->nsegs ? "more" : "last"
@@ -127,7 +125,7 @@
 );
 
 #define DEFINE_RDCH_EVENT(name)						\
-		DEFINE_EVENT(xprtrdma_rdch_event, name,			\
+		DEFINE_EVENT(xprtrdma_rdch_event, xprtrdma_chunk_##name,\
 				TP_PROTO(				\
 					const struct rpc_task *task,	\
 					unsigned int pos,		\
@@ -148,7 +146,6 @@
 	TP_STRUCT__entry(
 		__field(unsigned int, task_id)
 		__field(unsigned int, client_id)
-		__field(const void *, mr)
 		__field(int, nents)
 		__field(u32, handle)
 		__field(u32, length)
@@ -159,7 +156,6 @@
 	TP_fast_assign(
 		__entry->task_id = task->tk_pid;
 		__entry->client_id = task->tk_client->cl_clid;
-		__entry->mr = mr;
 		__entry->nents = mr->mr_nents;
 		__entry->handle = mr->mr_handle;
 		__entry->length = mr->mr_length;
@@ -167,8 +163,8 @@
 		__entry->nsegs = nsegs;
 	),
 
-	TP_printk("task:%u@%u mr=%p %u@0x%016llx:0x%08x (%s)",
-		__entry->task_id, __entry->client_id, __entry->mr,
+	TP_printk("task:%u@%u %u@0x%016llx:0x%08x (%s)",
+		__entry->task_id, __entry->client_id,
 		__entry->length, (unsigned long long)__entry->offset,
 		__entry->handle,
 		__entry->nents < __entry->nsegs ? "more" : "last"
@@ -176,7 +172,7 @@
 );
 
 #define DEFINE_WRCH_EVENT(name)						\
-		DEFINE_EVENT(xprtrdma_wrch_event, name,			\
+		DEFINE_EVENT(xprtrdma_wrch_event, xprtrdma_chunk_##name,\
 				TP_PROTO(				\
 					const struct rpc_task *task,	\
 					struct rpcrdma_mr *mr,		\
@@ -234,6 +230,18 @@
 				),					\
 				TP_ARGS(wc, frwr))
 
+TRACE_DEFINE_ENUM(DMA_BIDIRECTIONAL);
+TRACE_DEFINE_ENUM(DMA_TO_DEVICE);
+TRACE_DEFINE_ENUM(DMA_FROM_DEVICE);
+TRACE_DEFINE_ENUM(DMA_NONE);
+
+#define xprtrdma_show_direction(x)					\
+		__print_symbolic(x,					\
+				{ DMA_BIDIRECTIONAL, "BIDIR" },		\
+				{ DMA_TO_DEVICE, "TO_DEVICE" },		\
+				{ DMA_FROM_DEVICE, "FROM_DEVICE" },	\
+				{ DMA_NONE, "NONE" })
+
 DECLARE_EVENT_CLASS(xprtrdma_mr,
 	TP_PROTO(
 		const struct rpcrdma_mr *mr
@@ -246,6 +254,7 @@
 		__field(u32, handle)
 		__field(u32, length)
 		__field(u64, offset)
+		__field(u32, dir)
 	),
 
 	TP_fast_assign(
@@ -253,12 +262,13 @@
 		__entry->handle = mr->mr_handle;
 		__entry->length = mr->mr_length;
 		__entry->offset = mr->mr_offset;
+		__entry->dir    = mr->mr_dir;
 	),
 
-	TP_printk("mr=%p %u@0x%016llx:0x%08x",
+	TP_printk("mr=%p %u@0x%016llx:0x%08x (%s)",
 		__entry->mr, __entry->length,
-		(unsigned long long)__entry->offset,
-		__entry->handle
+		(unsigned long long)__entry->offset, __entry->handle,
+		xprtrdma_show_direction(__entry->dir)
 	)
 );
 
@@ -437,9 +447,9 @@
 
 DEFINE_RXPRT_EVENT(xprtrdma_nomrs);
 
-DEFINE_RDCH_EVENT(xprtrdma_read_chunk);
-DEFINE_WRCH_EVENT(xprtrdma_write_chunk);
-DEFINE_WRCH_EVENT(xprtrdma_reply_chunk);
+DEFINE_RDCH_EVENT(read);
+DEFINE_WRCH_EVENT(write);
+DEFINE_WRCH_EVENT(reply);
 
 TRACE_DEFINE_ENUM(rpcrdma_noch);
 TRACE_DEFINE_ENUM(rpcrdma_readch);
diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
index 5a58769..54fbd70 100644
--- a/net/sunrpc/xprtrdma/rpc_rdma.c
+++ b/net/sunrpc/xprtrdma/rpc_rdma.c
@@ -365,7 +365,7 @@ static bool rpcrdma_results_inline(struct rpcrdma_xprt *r_xprt,
 		if (encode_read_segment(xdr, mr, pos) < 0)
 			return -EMSGSIZE;
 
-		trace_xprtrdma_read_chunk(rqst->rq_task, pos, mr, nsegs);
+		trace_xprtrdma_chunk_read(rqst->rq_task, pos, mr, nsegs);
 		r_xprt->rx_stats.read_chunk_count++;
 		nsegs -= mr->mr_nents;
 	} while (nsegs);
@@ -422,7 +422,7 @@ static bool rpcrdma_results_inline(struct rpcrdma_xprt *r_xprt,
 		if (encode_rdma_segment(xdr, mr) < 0)
 			return -EMSGSIZE;
 
-		trace_xprtrdma_write_chunk(rqst->rq_task, mr, nsegs);
+		trace_xprtrdma_chunk_write(rqst->rq_task, mr, nsegs);
 		r_xprt->rx_stats.write_chunk_count++;
 		r_xprt->rx_stats.total_rdma_request += mr->mr_length;
 		nchunks++;
@@ -479,7 +479,7 @@ static bool rpcrdma_results_inline(struct rpcrdma_xprt *r_xprt,
 		if (encode_rdma_segment(xdr, mr) < 0)
 			return -EMSGSIZE;
 
-		trace_xprtrdma_reply_chunk(rqst->rq_task, mr, nsegs);
+		trace_xprtrdma_chunk_reply(rqst->rq_task, mr, nsegs);
 		r_xprt->rx_stats.reply_chunk_count++;
 		r_xprt->rx_stats.total_rdma_request += mr->mr_length;
 		nchunks++;


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v4 20/30] xprtrdma: Relocate the xprtrdma_mr_map trace points
  2018-12-17 16:39 [PATCH v4 00/30] NFS/RDMA client for next Chuck Lever
                   ` (18 preceding siblings ...)
  2018-12-17 16:41 ` [PATCH v4 19/30] xprtrdma: Clean up of xprtrdma chunk trace points Chuck Lever
@ 2018-12-17 16:41 ` Chuck Lever
  2018-12-17 16:41 ` [PATCH v4 21/30] xprtrdma: Add trace points for calls to transport switch methods Chuck Lever
                   ` (9 subsequent siblings)
  29 siblings, 0 replies; 40+ messages in thread
From: Chuck Lever @ 2018-12-17 16:41 UTC (permalink / raw)
  To: linux-rdma, linux-nfs

The mr_map trace points were capturing information about the previous
use of the MR rather than about the segment that was just mapped.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 net/sunrpc/xprtrdma/frwr_ops.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
index 97f88bb..1f508f4 100644
--- a/net/sunrpc/xprtrdma/frwr_ops.c
+++ b/net/sunrpc/xprtrdma/frwr_ops.c
@@ -438,7 +438,6 @@ struct rpcrdma_mr_seg *frwr_map(struct rpcrdma_xprt *r_xprt,
 	mr->mr_nents = ib_dma_map_sg(ia->ri_device, mr->mr_sg, i, mr->mr_dir);
 	if (!mr->mr_nents)
 		goto out_dmamap_err;
-	trace_xprtrdma_mr_map(mr);
 
 	ibmr = frwr->fr_mr;
 	n = ib_map_mr_sg(ibmr, mr->mr_sg, mr->mr_nents, NULL, PAGE_SIZE);
@@ -460,6 +459,7 @@ struct rpcrdma_mr_seg *frwr_map(struct rpcrdma_xprt *r_xprt,
 	mr->mr_handle = ibmr->rkey;
 	mr->mr_length = ibmr->length;
 	mr->mr_offset = ibmr->iova;
+	trace_xprtrdma_mr_map(mr);
 
 	*out = mr;
 	return seg;


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v4 21/30] xprtrdma: Add trace points for calls to transport switch methods
  2018-12-17 16:39 [PATCH v4 00/30] NFS/RDMA client for next Chuck Lever
                   ` (19 preceding siblings ...)
  2018-12-17 16:41 ` [PATCH v4 20/30] xprtrdma: Relocate the xprtrdma_mr_map " Chuck Lever
@ 2018-12-17 16:41 ` Chuck Lever
  2018-12-17 16:41 ` [PATCH v4 22/30] xprtrdma: Trace mapping, alloc, and dereg failures Chuck Lever
                   ` (8 subsequent siblings)
  29 siblings, 0 replies; 40+ messages in thread
From: Chuck Lever @ 2018-12-17 16:41 UTC (permalink / raw)
  To: linux-rdma, linux-nfs

Name them "trace_xprtrdma_op_*" so they can be easily enabled as a
group. No trace point is added where the generic layer already has
observability.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 include/trace/events/rpcrdma.h  |   10 ++++++----
 net/sunrpc/xprtrdma/transport.c |   18 +++++++++++-------
 2 files changed, 17 insertions(+), 11 deletions(-)

diff --git a/include/trace/events/rpcrdma.h b/include/trace/events/rpcrdma.h
index e9fbf7d..3d068bb 100644
--- a/include/trace/events/rpcrdma.h
+++ b/include/trace/events/rpcrdma.h
@@ -381,11 +381,13 @@
 DEFINE_RXPRT_EVENT(xprtrdma_conn_start);
 DEFINE_RXPRT_EVENT(xprtrdma_conn_tout);
 DEFINE_RXPRT_EVENT(xprtrdma_create);
-DEFINE_RXPRT_EVENT(xprtrdma_destroy);
+DEFINE_RXPRT_EVENT(xprtrdma_op_destroy);
 DEFINE_RXPRT_EVENT(xprtrdma_remove);
 DEFINE_RXPRT_EVENT(xprtrdma_reinsert);
 DEFINE_RXPRT_EVENT(xprtrdma_reconnect);
-DEFINE_RXPRT_EVENT(xprtrdma_inject_dsc);
+DEFINE_RXPRT_EVENT(xprtrdma_op_inject_dsc);
+DEFINE_RXPRT_EVENT(xprtrdma_op_close);
+DEFINE_RXPRT_EVENT(xprtrdma_op_connect);
 
 TRACE_EVENT(xprtrdma_qp_event,
 	TP_PROTO(
@@ -834,7 +836,7 @@
  ** Allocation/release of rpcrdma_reqs and rpcrdma_reps
  **/
 
-TRACE_EVENT(xprtrdma_allocate,
+TRACE_EVENT(xprtrdma_op_allocate,
 	TP_PROTO(
 		const struct rpc_task *task,
 		const struct rpcrdma_req *req
@@ -864,7 +866,7 @@
 	)
 );
 
-TRACE_EVENT(xprtrdma_rpc_done,
+TRACE_EVENT(xprtrdma_op_free,
 	TP_PROTO(
 		const struct rpc_task *task,
 		const struct rpcrdma_req *req
diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
index 41e4347..cb855b2 100644
--- a/net/sunrpc/xprtrdma/transport.c
+++ b/net/sunrpc/xprtrdma/transport.c
@@ -268,7 +268,7 @@
 {
 	struct rpcrdma_xprt *r_xprt = rpcx_to_rdmax(xprt);
 
-	trace_xprtrdma_inject_dsc(r_xprt);
+	trace_xprtrdma_op_inject_dsc(r_xprt);
 	rdma_disconnect(r_xprt->rx_ia.ri_id);
 }
 
@@ -284,7 +284,7 @@
 {
 	struct rpcrdma_xprt *r_xprt = rpcx_to_rdmax(xprt);
 
-	trace_xprtrdma_destroy(r_xprt);
+	trace_xprtrdma_op_destroy(r_xprt);
 
 	cancel_delayed_work_sync(&r_xprt->rx_connect_worker);
 
@@ -418,7 +418,7 @@
 out2:
 	rpcrdma_ia_close(&new_xprt->rx_ia);
 out1:
-	trace_xprtrdma_destroy(new_xprt);
+	trace_xprtrdma_op_destroy(new_xprt);
 	xprt_rdma_free_addresses(xprt);
 	xprt_free(xprt);
 	return ERR_PTR(rc);
@@ -428,7 +428,8 @@
  * xprt_rdma_close - close a transport connection
  * @xprt: transport context
  *
- * Called during transport shutdown, reconnect, or device removal.
+ * Called during autoclose or device removal.
+ *
  * Caller holds @xprt's send lock to prevent activity on this
  * transport while the connection is torn down.
  */
@@ -441,6 +442,8 @@
 
 	might_sleep();
 
+	trace_xprtrdma_op_close(r_xprt);
+
 	/* Prevent marshaling and sending of new requests */
 	xprt_clear_connected(xprt);
 
@@ -526,6 +529,7 @@
 {
 	struct rpcrdma_xprt *r_xprt = rpcx_to_rdmax(xprt);
 
+	trace_xprtrdma_op_connect(r_xprt);
 	if (r_xprt->rx_ep.rep_connected != 0) {
 		/* Reconnect */
 		schedule_delayed_work(&r_xprt->rx_connect_worker,
@@ -660,11 +664,11 @@
 
 	rqst->rq_buffer = req->rl_sendbuf->rg_base;
 	rqst->rq_rbuffer = req->rl_recvbuf->rg_base;
-	trace_xprtrdma_allocate(task, req);
+	trace_xprtrdma_op_allocate(task, req);
 	return 0;
 
 out_fail:
-	trace_xprtrdma_allocate(task, NULL);
+	trace_xprtrdma_op_allocate(task, NULL);
 	return -ENOMEM;
 }
 
@@ -683,7 +687,7 @@
 
 	if (test_bit(RPCRDMA_REQ_F_PENDING, &req->rl_flags))
 		rpcrdma_release_rqst(r_xprt, req);
-	trace_xprtrdma_rpc_done(task, req);
+	trace_xprtrdma_op_free(task, req);
 }
 
 /**


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v4 22/30] xprtrdma: Trace mapping, alloc, and dereg failures
  2018-12-17 16:39 [PATCH v4 00/30] NFS/RDMA client for next Chuck Lever
                   ` (20 preceding siblings ...)
  2018-12-17 16:41 ` [PATCH v4 21/30] xprtrdma: Add trace points for calls to transport switch methods Chuck Lever
@ 2018-12-17 16:41 ` Chuck Lever
  2018-12-17 16:41 ` [PATCH v4 23/30] NFS: Fix NFSv4 symbolic trace point output Chuck Lever
                   ` (7 subsequent siblings)
  29 siblings, 0 replies; 40+ messages in thread
From: Chuck Lever @ 2018-12-17 16:41 UTC (permalink / raw)
  To: linux-rdma, linux-nfs

These are rare, but can be helpful at tracking down DMAR and other
problems.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 include/trace/events/rpcrdma.h |  136 ++++++++++++++++++++++++++++++++++++++++
 net/sunrpc/xprtrdma/frwr_ops.c |   12 +---
 net/sunrpc/xprtrdma/rpc_rdma.c |    2 -
 net/sunrpc/xprtrdma/verbs.c    |    4 +
 4 files changed, 144 insertions(+), 10 deletions(-)

diff --git a/include/trace/events/rpcrdma.h b/include/trace/events/rpcrdma.h
index 3d068bb..ce528d5 100644
--- a/include/trace/events/rpcrdma.h
+++ b/include/trace/events/rpcrdma.h
@@ -10,6 +10,7 @@
 #if !defined(_TRACE_RPCRDMA_H) || defined(TRACE_HEADER_MULTI_READ)
 #define _TRACE_RPCRDMA_H
 
+#include <linux/scatterlist.h>
 #include <linux/tracepoint.h>
 #include <trace/events/rdma.h>
 
@@ -663,12 +664,147 @@
 DEFINE_FRWR_DONE_EVENT(xprtrdma_wc_li);
 DEFINE_FRWR_DONE_EVENT(xprtrdma_wc_li_wake);
 
+TRACE_EVENT(xprtrdma_frwr_alloc,
+	TP_PROTO(
+		const struct rpcrdma_mr *mr,
+		int rc
+	),
+
+	TP_ARGS(mr, rc),
+
+	TP_STRUCT__entry(
+		__field(const void *, mr)
+		__field(int, rc)
+	),
+
+	TP_fast_assign(
+		__entry->mr = mr;
+		__entry->rc	= rc;
+	),
+
+	TP_printk("mr=%p: rc=%d",
+		__entry->mr, __entry->rc
+	)
+);
+
+TRACE_EVENT(xprtrdma_frwr_dereg,
+	TP_PROTO(
+		const struct rpcrdma_mr *mr,
+		int rc
+	),
+
+	TP_ARGS(mr, rc),
+
+	TP_STRUCT__entry(
+		__field(const void *, mr)
+		__field(u32, handle)
+		__field(u32, length)
+		__field(u64, offset)
+		__field(u32, dir)
+		__field(int, rc)
+	),
+
+	TP_fast_assign(
+		__entry->mr = mr;
+		__entry->handle = mr->mr_handle;
+		__entry->length = mr->mr_length;
+		__entry->offset = mr->mr_offset;
+		__entry->dir    = mr->mr_dir;
+		__entry->rc	= rc;
+	),
+
+	TP_printk("mr=%p %u@0x%016llx:0x%08x (%s): rc=%d",
+		__entry->mr, __entry->length,
+		(unsigned long long)__entry->offset, __entry->handle,
+		xprtrdma_show_direction(__entry->dir),
+		__entry->rc
+	)
+);
+
+TRACE_EVENT(xprtrdma_frwr_sgerr,
+	TP_PROTO(
+		const struct rpcrdma_mr *mr,
+		int sg_nents
+	),
+
+	TP_ARGS(mr, sg_nents),
+
+	TP_STRUCT__entry(
+		__field(const void *, mr)
+		__field(u64, addr)
+		__field(u32, dir)
+		__field(int, nents)
+	),
+
+	TP_fast_assign(
+		__entry->mr = mr;
+		__entry->addr = mr->mr_sg->dma_address;
+		__entry->dir = mr->mr_dir;
+		__entry->nents = sg_nents;
+	),
+
+	TP_printk("mr=%p dma addr=0x%llx (%s) sg_nents=%d",
+		__entry->mr, __entry->addr,
+		xprtrdma_show_direction(__entry->dir),
+		__entry->nents
+	)
+);
+
+TRACE_EVENT(xprtrdma_frwr_maperr,
+	TP_PROTO(
+		const struct rpcrdma_mr *mr,
+		int num_mapped
+	),
+
+	TP_ARGS(mr, num_mapped),
+
+	TP_STRUCT__entry(
+		__field(const void *, mr)
+		__field(u64, addr)
+		__field(u32, dir)
+		__field(int, num_mapped)
+		__field(int, nents)
+	),
+
+	TP_fast_assign(
+		__entry->mr = mr;
+		__entry->addr = mr->mr_sg->dma_address;
+		__entry->dir = mr->mr_dir;
+		__entry->num_mapped = num_mapped;
+		__entry->nents = mr->mr_nents;
+	),
+
+	TP_printk("mr=%p dma addr=0x%llx (%s) nents=%d of %d",
+		__entry->mr, __entry->addr,
+		xprtrdma_show_direction(__entry->dir),
+		__entry->num_mapped, __entry->nents
+	)
+);
+
 DEFINE_MR_EVENT(localinv);
 DEFINE_MR_EVENT(map);
 DEFINE_MR_EVENT(unmap);
 DEFINE_MR_EVENT(remoteinv);
 DEFINE_MR_EVENT(recycle);
 
+TRACE_EVENT(xprtrdma_dma_maperr,
+	TP_PROTO(
+		u64 addr
+	),
+
+	TP_ARGS(addr),
+
+	TP_STRUCT__entry(
+		__field(u64, addr)
+	),
+
+	TP_fast_assign(
+		__entry->addr = addr;
+	),
+
+	TP_printk("dma addr=0x%llx\n", __entry->addr)
+);
+
 /**
  ** Reply events
  **/
diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
index 1f508f4..8a0f1a6 100644
--- a/net/sunrpc/xprtrdma/frwr_ops.c
+++ b/net/sunrpc/xprtrdma/frwr_ops.c
@@ -113,8 +113,7 @@ void frwr_release_mr(struct rpcrdma_mr *mr)
 
 	rc = ib_dereg_mr(mr->frwr.fr_mr);
 	if (rc)
-		pr_err("rpcrdma: final ib_dereg_mr for %p returned %i\n",
-		       mr, rc);
+		trace_xprtrdma_frwr_dereg(mr, rc);
 	kfree(mr->mr_sg);
 	kfree(mr);
 }
@@ -177,8 +176,7 @@ int frwr_init_mr(struct rpcrdma_ia *ia, struct rpcrdma_mr *mr)
 
 out_mr_err:
 	rc = PTR_ERR(frwr->fr_mr);
-	dprintk("RPC:       %s: ib_alloc_mr status %i\n",
-		__func__, rc);
+	trace_xprtrdma_frwr_alloc(mr, rc);
 	return rc;
 
 out_list_err:
@@ -465,15 +463,13 @@ struct rpcrdma_mr_seg *frwr_map(struct rpcrdma_xprt *r_xprt,
 	return seg;
 
 out_dmamap_err:
-	pr_err("rpcrdma: failed to DMA map sg %p sg_nents %d\n",
-	       mr->mr_sg, i);
 	frwr->fr_state = FRWR_IS_INVALID;
+	trace_xprtrdma_frwr_sgerr(mr, i);
 	rpcrdma_mr_put(mr);
 	return ERR_PTR(-EIO);
 
 out_mapmr_err:
-	pr_err("rpcrdma: failed to map mr %p (%d/%d)\n",
-	       frwr->fr_mr, n, mr->mr_nents);
+	trace_xprtrdma_frwr_maperr(mr, n);
 	rpcrdma_mr_recycle(mr);
 	return ERR_PTR(-EIO);
 }
diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
index 54fbd70..062aee9 100644
--- a/net/sunrpc/xprtrdma/rpc_rdma.c
+++ b/net/sunrpc/xprtrdma/rpc_rdma.c
@@ -665,7 +665,7 @@ static bool rpcrdma_results_inline(struct rpcrdma_xprt *r_xprt,
 
 out_mapping_err:
 	rpcrdma_unmap_sendctx(sc);
-	pr_err("rpcrdma: Send mapping error\n");
+	trace_xprtrdma_dma_maperr(sge[sge_no].addr);
 	return false;
 }
 
diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 64672d0..b46e2f9 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -1392,8 +1392,10 @@ struct rpcrdma_regbuf *
 					    (void *)rb->rg_base,
 					    rdmab_length(rb),
 					    rb->rg_direction);
-	if (ib_dma_mapping_error(device, rdmab_addr(rb)))
+	if (ib_dma_mapping_error(device, rdmab_addr(rb))) {
+		trace_xprtrdma_dma_maperr(rdmab_addr(rb));
 		return false;
+	}
 
 	rb->rg_device = device;
 	rb->rg_iov.lkey = ia->ri_pd->local_dma_lkey;


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v4 23/30] NFS: Fix NFSv4 symbolic trace point output
  2018-12-17 16:39 [PATCH v4 00/30] NFS/RDMA client for next Chuck Lever
                   ` (21 preceding siblings ...)
  2018-12-17 16:41 ` [PATCH v4 22/30] xprtrdma: Trace mapping, alloc, and dereg failures Chuck Lever
@ 2018-12-17 16:41 ` Chuck Lever
  2018-12-17 16:41 ` [PATCH v4 24/30] SUNRPC: Simplify defining common RPC trace events Chuck Lever
                   ` (6 subsequent siblings)
  29 siblings, 0 replies; 40+ messages in thread
From: Chuck Lever @ 2018-12-17 16:41 UTC (permalink / raw)
  To: linux-rdma, linux-nfs

These symbolic values were not being displayed in string form.
TRACE_DEFINE_ENUM was missing in many cases. It also turns out that
__print_symbolic wants an unsigned long in the first field...

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 fs/nfs/nfs4trace.h |  456 ++++++++++++++++++++++++++++++++++++----------------
 1 file changed, 313 insertions(+), 143 deletions(-)

diff --git a/fs/nfs/nfs4trace.h b/fs/nfs/nfs4trace.h
index b1483b3..b4557cf 100644
--- a/fs/nfs/nfs4trace.h
+++ b/fs/nfs/nfs4trace.h
@@ -10,157 +10,302 @@
 
 #include <linux/tracepoint.h>
 
+TRACE_DEFINE_ENUM(EPERM);
+TRACE_DEFINE_ENUM(ENOENT);
+TRACE_DEFINE_ENUM(EIO);
+TRACE_DEFINE_ENUM(ENXIO);
+TRACE_DEFINE_ENUM(EACCES);
+TRACE_DEFINE_ENUM(EEXIST);
+TRACE_DEFINE_ENUM(EXDEV);
+TRACE_DEFINE_ENUM(ENOTDIR);
+TRACE_DEFINE_ENUM(EISDIR);
+TRACE_DEFINE_ENUM(EFBIG);
+TRACE_DEFINE_ENUM(ENOSPC);
+TRACE_DEFINE_ENUM(EROFS);
+TRACE_DEFINE_ENUM(EMLINK);
+TRACE_DEFINE_ENUM(ENAMETOOLONG);
+TRACE_DEFINE_ENUM(ENOTEMPTY);
+TRACE_DEFINE_ENUM(EDQUOT);
+TRACE_DEFINE_ENUM(ESTALE);
+TRACE_DEFINE_ENUM(EBADHANDLE);
+TRACE_DEFINE_ENUM(EBADCOOKIE);
+TRACE_DEFINE_ENUM(ENOTSUPP);
+TRACE_DEFINE_ENUM(ETOOSMALL);
+TRACE_DEFINE_ENUM(EREMOTEIO);
+TRACE_DEFINE_ENUM(EBADTYPE);
+TRACE_DEFINE_ENUM(EAGAIN);
+TRACE_DEFINE_ENUM(ELOOP);
+TRACE_DEFINE_ENUM(EOPNOTSUPP);
+TRACE_DEFINE_ENUM(EDEADLK);
+TRACE_DEFINE_ENUM(ENOMEM);
+TRACE_DEFINE_ENUM(EKEYEXPIRED);
+TRACE_DEFINE_ENUM(ETIMEDOUT);
+TRACE_DEFINE_ENUM(ERESTARTSYS);
+TRACE_DEFINE_ENUM(ECONNREFUSED);
+TRACE_DEFINE_ENUM(ECONNRESET);
+TRACE_DEFINE_ENUM(ENETUNREACH);
+TRACE_DEFINE_ENUM(EHOSTUNREACH);
+TRACE_DEFINE_ENUM(EHOSTDOWN);
+TRACE_DEFINE_ENUM(EPIPE);
+TRACE_DEFINE_ENUM(EPFNOSUPPORT);
+TRACE_DEFINE_ENUM(EPROTONOSUPPORT);
+
+TRACE_DEFINE_ENUM(NFS4_OK);
+TRACE_DEFINE_ENUM(NFS4ERR_ACCESS);
+TRACE_DEFINE_ENUM(NFS4ERR_ATTRNOTSUPP);
+TRACE_DEFINE_ENUM(NFS4ERR_ADMIN_REVOKED);
+TRACE_DEFINE_ENUM(NFS4ERR_BACK_CHAN_BUSY);
+TRACE_DEFINE_ENUM(NFS4ERR_BADCHAR);
+TRACE_DEFINE_ENUM(NFS4ERR_BADHANDLE);
+TRACE_DEFINE_ENUM(NFS4ERR_BADIOMODE);
+TRACE_DEFINE_ENUM(NFS4ERR_BADLAYOUT);
+TRACE_DEFINE_ENUM(NFS4ERR_BADLABEL);
+TRACE_DEFINE_ENUM(NFS4ERR_BADNAME);
+TRACE_DEFINE_ENUM(NFS4ERR_BADOWNER);
+TRACE_DEFINE_ENUM(NFS4ERR_BADSESSION);
+TRACE_DEFINE_ENUM(NFS4ERR_BADSLOT);
+TRACE_DEFINE_ENUM(NFS4ERR_BADTYPE);
+TRACE_DEFINE_ENUM(NFS4ERR_BADXDR);
+TRACE_DEFINE_ENUM(NFS4ERR_BAD_COOKIE);
+TRACE_DEFINE_ENUM(NFS4ERR_BAD_HIGH_SLOT);
+TRACE_DEFINE_ENUM(NFS4ERR_BAD_RANGE);
+TRACE_DEFINE_ENUM(NFS4ERR_BAD_SEQID);
+TRACE_DEFINE_ENUM(NFS4ERR_BAD_SESSION_DIGEST);
+TRACE_DEFINE_ENUM(NFS4ERR_BAD_STATEID);
+TRACE_DEFINE_ENUM(NFS4ERR_CB_PATH_DOWN);
+TRACE_DEFINE_ENUM(NFS4ERR_CLID_INUSE);
+TRACE_DEFINE_ENUM(NFS4ERR_CLIENTID_BUSY);
+TRACE_DEFINE_ENUM(NFS4ERR_COMPLETE_ALREADY);
+TRACE_DEFINE_ENUM(NFS4ERR_CONN_NOT_BOUND_TO_SESSION);
+TRACE_DEFINE_ENUM(NFS4ERR_DEADLOCK);
+TRACE_DEFINE_ENUM(NFS4ERR_DEADSESSION);
+TRACE_DEFINE_ENUM(NFS4ERR_DELAY);
+TRACE_DEFINE_ENUM(NFS4ERR_DELEG_ALREADY_WANTED);
+TRACE_DEFINE_ENUM(NFS4ERR_DELEG_REVOKED);
+TRACE_DEFINE_ENUM(NFS4ERR_DENIED);
+TRACE_DEFINE_ENUM(NFS4ERR_DIRDELEG_UNAVAIL);
+TRACE_DEFINE_ENUM(NFS4ERR_DQUOT);
+TRACE_DEFINE_ENUM(NFS4ERR_ENCR_ALG_UNSUPP);
+TRACE_DEFINE_ENUM(NFS4ERR_EXIST);
+TRACE_DEFINE_ENUM(NFS4ERR_EXPIRED);
+TRACE_DEFINE_ENUM(NFS4ERR_FBIG);
+TRACE_DEFINE_ENUM(NFS4ERR_FHEXPIRED);
+TRACE_DEFINE_ENUM(NFS4ERR_FILE_OPEN);
+TRACE_DEFINE_ENUM(NFS4ERR_GRACE);
+TRACE_DEFINE_ENUM(NFS4ERR_HASH_ALG_UNSUPP);
+TRACE_DEFINE_ENUM(NFS4ERR_INVAL);
+TRACE_DEFINE_ENUM(NFS4ERR_IO);
+TRACE_DEFINE_ENUM(NFS4ERR_ISDIR);
+TRACE_DEFINE_ENUM(NFS4ERR_LAYOUTTRYLATER);
+TRACE_DEFINE_ENUM(NFS4ERR_LAYOUTUNAVAILABLE);
+TRACE_DEFINE_ENUM(NFS4ERR_LEASE_MOVED);
+TRACE_DEFINE_ENUM(NFS4ERR_LOCKED);
+TRACE_DEFINE_ENUM(NFS4ERR_LOCKS_HELD);
+TRACE_DEFINE_ENUM(NFS4ERR_LOCK_RANGE);
+TRACE_DEFINE_ENUM(NFS4ERR_MINOR_VERS_MISMATCH);
+TRACE_DEFINE_ENUM(NFS4ERR_MLINK);
+TRACE_DEFINE_ENUM(NFS4ERR_MOVED);
+TRACE_DEFINE_ENUM(NFS4ERR_NAMETOOLONG);
+TRACE_DEFINE_ENUM(NFS4ERR_NOENT);
+TRACE_DEFINE_ENUM(NFS4ERR_NOFILEHANDLE);
+TRACE_DEFINE_ENUM(NFS4ERR_NOMATCHING_LAYOUT);
+TRACE_DEFINE_ENUM(NFS4ERR_NOSPC);
+TRACE_DEFINE_ENUM(NFS4ERR_NOTDIR);
+TRACE_DEFINE_ENUM(NFS4ERR_NOTEMPTY);
+TRACE_DEFINE_ENUM(NFS4ERR_NOTSUPP);
+TRACE_DEFINE_ENUM(NFS4ERR_NOT_ONLY_OP);
+TRACE_DEFINE_ENUM(NFS4ERR_NOT_SAME);
+TRACE_DEFINE_ENUM(NFS4ERR_NO_GRACE);
+TRACE_DEFINE_ENUM(NFS4ERR_NXIO);
+TRACE_DEFINE_ENUM(NFS4ERR_OLD_STATEID);
+TRACE_DEFINE_ENUM(NFS4ERR_OPENMODE);
+TRACE_DEFINE_ENUM(NFS4ERR_OP_ILLEGAL);
+TRACE_DEFINE_ENUM(NFS4ERR_OP_NOT_IN_SESSION);
+TRACE_DEFINE_ENUM(NFS4ERR_PERM);
+TRACE_DEFINE_ENUM(NFS4ERR_PNFS_IO_HOLE);
+TRACE_DEFINE_ENUM(NFS4ERR_PNFS_NO_LAYOUT);
+TRACE_DEFINE_ENUM(NFS4ERR_RECALLCONFLICT);
+TRACE_DEFINE_ENUM(NFS4ERR_RECLAIM_BAD);
+TRACE_DEFINE_ENUM(NFS4ERR_RECLAIM_CONFLICT);
+TRACE_DEFINE_ENUM(NFS4ERR_REJECT_DELEG);
+TRACE_DEFINE_ENUM(NFS4ERR_REP_TOO_BIG);
+TRACE_DEFINE_ENUM(NFS4ERR_REP_TOO_BIG_TO_CACHE);
+TRACE_DEFINE_ENUM(NFS4ERR_REQ_TOO_BIG);
+TRACE_DEFINE_ENUM(NFS4ERR_RESOURCE);
+TRACE_DEFINE_ENUM(NFS4ERR_RESTOREFH);
+TRACE_DEFINE_ENUM(NFS4ERR_RETRY_UNCACHED_REP);
+TRACE_DEFINE_ENUM(NFS4ERR_RETURNCONFLICT);
+TRACE_DEFINE_ENUM(NFS4ERR_ROFS);
+TRACE_DEFINE_ENUM(NFS4ERR_SAME);
+TRACE_DEFINE_ENUM(NFS4ERR_SHARE_DENIED);
+TRACE_DEFINE_ENUM(NFS4ERR_SEQUENCE_POS);
+TRACE_DEFINE_ENUM(NFS4ERR_SEQ_FALSE_RETRY);
+TRACE_DEFINE_ENUM(NFS4ERR_SEQ_MISORDERED);
+TRACE_DEFINE_ENUM(NFS4ERR_SERVERFAULT);
+TRACE_DEFINE_ENUM(NFS4ERR_STALE);
+TRACE_DEFINE_ENUM(NFS4ERR_STALE_CLIENTID);
+TRACE_DEFINE_ENUM(NFS4ERR_STALE_STATEID);
+TRACE_DEFINE_ENUM(NFS4ERR_SYMLINK);
+TRACE_DEFINE_ENUM(NFS4ERR_TOOSMALL);
+TRACE_DEFINE_ENUM(NFS4ERR_TOO_MANY_OPS);
+TRACE_DEFINE_ENUM(NFS4ERR_UNKNOWN_LAYOUTTYPE);
+TRACE_DEFINE_ENUM(NFS4ERR_UNSAFE_COMPOUND);
+TRACE_DEFINE_ENUM(NFS4ERR_WRONGSEC);
+TRACE_DEFINE_ENUM(NFS4ERR_WRONG_CRED);
+TRACE_DEFINE_ENUM(NFS4ERR_WRONG_TYPE);
+TRACE_DEFINE_ENUM(NFS4ERR_XDEV);
+
 #define show_nfsv4_errors(error) \
-	__print_symbolic(error, \
+	__print_symbolic(-(error), \
 		{ NFS4_OK, "OK" }, \
 		/* Mapped by nfs4_stat_to_errno() */ \
-		{ -EPERM, "EPERM" }, \
-		{ -ENOENT, "ENOENT" }, \
-		{ -EIO, "EIO" }, \
-		{ -ENXIO, "ENXIO" }, \
-		{ -EACCES, "EACCES" }, \
-		{ -EEXIST, "EEXIST" }, \
-		{ -EXDEV, "EXDEV" }, \
-		{ -ENOTDIR, "ENOTDIR" }, \
-		{ -EISDIR, "EISDIR" }, \
-		{ -EFBIG, "EFBIG" }, \
-		{ -ENOSPC, "ENOSPC" }, \
-		{ -EROFS, "EROFS" }, \
-		{ -EMLINK, "EMLINK" }, \
-		{ -ENAMETOOLONG, "ENAMETOOLONG" }, \
-		{ -ENOTEMPTY, "ENOTEMPTY" }, \
-		{ -EDQUOT, "EDQUOT" }, \
-		{ -ESTALE, "ESTALE" }, \
-		{ -EBADHANDLE, "EBADHANDLE" }, \
-		{ -EBADCOOKIE, "EBADCOOKIE" }, \
-		{ -ENOTSUPP, "ENOTSUPP" }, \
-		{ -ETOOSMALL, "ETOOSMALL" }, \
-		{ -EREMOTEIO, "EREMOTEIO" }, \
-		{ -EBADTYPE, "EBADTYPE" }, \
-		{ -EAGAIN, "EAGAIN" }, \
-		{ -ELOOP, "ELOOP" }, \
-		{ -EOPNOTSUPP, "EOPNOTSUPP" }, \
-		{ -EDEADLK, "EDEADLK" }, \
+		{ EPERM, "EPERM" }, \
+		{ ENOENT, "ENOENT" }, \
+		{ EIO, "EIO" }, \
+		{ ENXIO, "ENXIO" }, \
+		{ EACCES, "EACCES" }, \
+		{ EEXIST, "EEXIST" }, \
+		{ EXDEV, "EXDEV" }, \
+		{ ENOTDIR, "ENOTDIR" }, \
+		{ EISDIR, "EISDIR" }, \
+		{ EFBIG, "EFBIG" }, \
+		{ ENOSPC, "ENOSPC" }, \
+		{ EROFS, "EROFS" }, \
+		{ EMLINK, "EMLINK" }, \
+		{ ENAMETOOLONG, "ENAMETOOLONG" }, \
+		{ ENOTEMPTY, "ENOTEMPTY" }, \
+		{ EDQUOT, "EDQUOT" }, \
+		{ ESTALE, "ESTALE" }, \
+		{ EBADHANDLE, "EBADHANDLE" }, \
+		{ EBADCOOKIE, "EBADCOOKIE" }, \
+		{ ENOTSUPP, "ENOTSUPP" }, \
+		{ ETOOSMALL, "ETOOSMALL" }, \
+		{ EREMOTEIO, "EREMOTEIO" }, \
+		{ EBADTYPE, "EBADTYPE" }, \
+		{ EAGAIN, "EAGAIN" }, \
+		{ ELOOP, "ELOOP" }, \
+		{ EOPNOTSUPP, "EOPNOTSUPP" }, \
+		{ EDEADLK, "EDEADLK" }, \
 		/* RPC errors */ \
-		{ -ENOMEM, "ENOMEM" }, \
-		{ -EKEYEXPIRED, "EKEYEXPIRED" }, \
-		{ -ETIMEDOUT, "ETIMEDOUT" }, \
-		{ -ERESTARTSYS, "ERESTARTSYS" }, \
-		{ -ECONNREFUSED, "ECONNREFUSED" }, \
-		{ -ECONNRESET, "ECONNRESET" }, \
-		{ -ENETUNREACH, "ENETUNREACH" }, \
-		{ -EHOSTUNREACH, "EHOSTUNREACH" }, \
-		{ -EHOSTDOWN, "EHOSTDOWN" }, \
-		{ -EPIPE, "EPIPE" }, \
-		{ -EPFNOSUPPORT, "EPFNOSUPPORT" }, \
-		{ -EPROTONOSUPPORT, "EPROTONOSUPPORT" }, \
+		{ ENOMEM, "ENOMEM" }, \
+		{ EKEYEXPIRED, "EKEYEXPIRED" }, \
+		{ ETIMEDOUT, "ETIMEDOUT" }, \
+		{ ERESTARTSYS, "ERESTARTSYS" }, \
+		{ ECONNREFUSED, "ECONNREFUSED" }, \
+		{ ECONNRESET, "ECONNRESET" }, \
+		{ ENETUNREACH, "ENETUNREACH" }, \
+		{ EHOSTUNREACH, "EHOSTUNREACH" }, \
+		{ EHOSTDOWN, "EHOSTDOWN" }, \
+		{ EPIPE, "EPIPE" }, \
+		{ EPFNOSUPPORT, "EPFNOSUPPORT" }, \
+		{ EPROTONOSUPPORT, "EPROTONOSUPPORT" }, \
 		/* NFSv4 native errors */ \
-		{ -NFS4ERR_ACCESS, "ACCESS" }, \
-		{ -NFS4ERR_ATTRNOTSUPP, "ATTRNOTSUPP" }, \
-		{ -NFS4ERR_ADMIN_REVOKED, "ADMIN_REVOKED" }, \
-		{ -NFS4ERR_BACK_CHAN_BUSY, "BACK_CHAN_BUSY" }, \
-		{ -NFS4ERR_BADCHAR, "BADCHAR" }, \
-		{ -NFS4ERR_BADHANDLE, "BADHANDLE" }, \
-		{ -NFS4ERR_BADIOMODE, "BADIOMODE" }, \
-		{ -NFS4ERR_BADLAYOUT, "BADLAYOUT" }, \
-		{ -NFS4ERR_BADLABEL, "BADLABEL" }, \
-		{ -NFS4ERR_BADNAME, "BADNAME" }, \
-		{ -NFS4ERR_BADOWNER, "BADOWNER" }, \
-		{ -NFS4ERR_BADSESSION, "BADSESSION" }, \
-		{ -NFS4ERR_BADSLOT, "BADSLOT" }, \
-		{ -NFS4ERR_BADTYPE, "BADTYPE" }, \
-		{ -NFS4ERR_BADXDR, "BADXDR" }, \
-		{ -NFS4ERR_BAD_COOKIE, "BAD_COOKIE" }, \
-		{ -NFS4ERR_BAD_HIGH_SLOT, "BAD_HIGH_SLOT" }, \
-		{ -NFS4ERR_BAD_RANGE, "BAD_RANGE" }, \
-		{ -NFS4ERR_BAD_SEQID, "BAD_SEQID" }, \
-		{ -NFS4ERR_BAD_SESSION_DIGEST, "BAD_SESSION_DIGEST" }, \
-		{ -NFS4ERR_BAD_STATEID, "BAD_STATEID" }, \
-		{ -NFS4ERR_CB_PATH_DOWN, "CB_PATH_DOWN" }, \
-		{ -NFS4ERR_CLID_INUSE, "CLID_INUSE" }, \
-		{ -NFS4ERR_CLIENTID_BUSY, "CLIENTID_BUSY" }, \
-		{ -NFS4ERR_COMPLETE_ALREADY, "COMPLETE_ALREADY" }, \
-		{ -NFS4ERR_CONN_NOT_BOUND_TO_SESSION, \
+		{ NFS4ERR_ACCESS, "ACCESS" }, \
+		{ NFS4ERR_ATTRNOTSUPP, "ATTRNOTSUPP" }, \
+		{ NFS4ERR_ADMIN_REVOKED, "ADMIN_REVOKED" }, \
+		{ NFS4ERR_BACK_CHAN_BUSY, "BACK_CHAN_BUSY" }, \
+		{ NFS4ERR_BADCHAR, "BADCHAR" }, \
+		{ NFS4ERR_BADHANDLE, "BADHANDLE" }, \
+		{ NFS4ERR_BADIOMODE, "BADIOMODE" }, \
+		{ NFS4ERR_BADLAYOUT, "BADLAYOUT" }, \
+		{ NFS4ERR_BADLABEL, "BADLABEL" }, \
+		{ NFS4ERR_BADNAME, "BADNAME" }, \
+		{ NFS4ERR_BADOWNER, "BADOWNER" }, \
+		{ NFS4ERR_BADSESSION, "BADSESSION" }, \
+		{ NFS4ERR_BADSLOT, "BADSLOT" }, \
+		{ NFS4ERR_BADTYPE, "BADTYPE" }, \
+		{ NFS4ERR_BADXDR, "BADXDR" }, \
+		{ NFS4ERR_BAD_COOKIE, "BAD_COOKIE" }, \
+		{ NFS4ERR_BAD_HIGH_SLOT, "BAD_HIGH_SLOT" }, \
+		{ NFS4ERR_BAD_RANGE, "BAD_RANGE" }, \
+		{ NFS4ERR_BAD_SEQID, "BAD_SEQID" }, \
+		{ NFS4ERR_BAD_SESSION_DIGEST, "BAD_SESSION_DIGEST" }, \
+		{ NFS4ERR_BAD_STATEID, "BAD_STATEID" }, \
+		{ NFS4ERR_CB_PATH_DOWN, "CB_PATH_DOWN" }, \
+		{ NFS4ERR_CLID_INUSE, "CLID_INUSE" }, \
+		{ NFS4ERR_CLIENTID_BUSY, "CLIENTID_BUSY" }, \
+		{ NFS4ERR_COMPLETE_ALREADY, "COMPLETE_ALREADY" }, \
+		{ NFS4ERR_CONN_NOT_BOUND_TO_SESSION, \
 			"CONN_NOT_BOUND_TO_SESSION" }, \
-		{ -NFS4ERR_DEADLOCK, "DEADLOCK" }, \
-		{ -NFS4ERR_DEADSESSION, "DEAD_SESSION" }, \
-		{ -NFS4ERR_DELAY, "DELAY" }, \
-		{ -NFS4ERR_DELEG_ALREADY_WANTED, \
+		{ NFS4ERR_DEADLOCK, "DEADLOCK" }, \
+		{ NFS4ERR_DEADSESSION, "DEAD_SESSION" }, \
+		{ NFS4ERR_DELAY, "DELAY" }, \
+		{ NFS4ERR_DELEG_ALREADY_WANTED, \
 			"DELEG_ALREADY_WANTED" }, \
-		{ -NFS4ERR_DELEG_REVOKED, "DELEG_REVOKED" }, \
-		{ -NFS4ERR_DENIED, "DENIED" }, \
-		{ -NFS4ERR_DIRDELEG_UNAVAIL, "DIRDELEG_UNAVAIL" }, \
-		{ -NFS4ERR_DQUOT, "DQUOT" }, \
-		{ -NFS4ERR_ENCR_ALG_UNSUPP, "ENCR_ALG_UNSUPP" }, \
-		{ -NFS4ERR_EXIST, "EXIST" }, \
-		{ -NFS4ERR_EXPIRED, "EXPIRED" }, \
-		{ -NFS4ERR_FBIG, "FBIG" }, \
-		{ -NFS4ERR_FHEXPIRED, "FHEXPIRED" }, \
-		{ -NFS4ERR_FILE_OPEN, "FILE_OPEN" }, \
-		{ -NFS4ERR_GRACE, "GRACE" }, \
-		{ -NFS4ERR_HASH_ALG_UNSUPP, "HASH_ALG_UNSUPP" }, \
-		{ -NFS4ERR_INVAL, "INVAL" }, \
-		{ -NFS4ERR_IO, "IO" }, \
-		{ -NFS4ERR_ISDIR, "ISDIR" }, \
-		{ -NFS4ERR_LAYOUTTRYLATER, "LAYOUTTRYLATER" }, \
-		{ -NFS4ERR_LAYOUTUNAVAILABLE, "LAYOUTUNAVAILABLE" }, \
-		{ -NFS4ERR_LEASE_MOVED, "LEASE_MOVED" }, \
-		{ -NFS4ERR_LOCKED, "LOCKED" }, \
-		{ -NFS4ERR_LOCKS_HELD, "LOCKS_HELD" }, \
-		{ -NFS4ERR_LOCK_RANGE, "LOCK_RANGE" }, \
-		{ -NFS4ERR_MINOR_VERS_MISMATCH, "MINOR_VERS_MISMATCH" }, \
-		{ -NFS4ERR_MLINK, "MLINK" }, \
-		{ -NFS4ERR_MOVED, "MOVED" }, \
-		{ -NFS4ERR_NAMETOOLONG, "NAMETOOLONG" }, \
-		{ -NFS4ERR_NOENT, "NOENT" }, \
-		{ -NFS4ERR_NOFILEHANDLE, "NOFILEHANDLE" }, \
-		{ -NFS4ERR_NOMATCHING_LAYOUT, "NOMATCHING_LAYOUT" }, \
-		{ -NFS4ERR_NOSPC, "NOSPC" }, \
-		{ -NFS4ERR_NOTDIR, "NOTDIR" }, \
-		{ -NFS4ERR_NOTEMPTY, "NOTEMPTY" }, \
-		{ -NFS4ERR_NOTSUPP, "NOTSUPP" }, \
-		{ -NFS4ERR_NOT_ONLY_OP, "NOT_ONLY_OP" }, \
-		{ -NFS4ERR_NOT_SAME, "NOT_SAME" }, \
-		{ -NFS4ERR_NO_GRACE, "NO_GRACE" }, \
-		{ -NFS4ERR_NXIO, "NXIO" }, \
-		{ -NFS4ERR_OLD_STATEID, "OLD_STATEID" }, \
-		{ -NFS4ERR_OPENMODE, "OPENMODE" }, \
-		{ -NFS4ERR_OP_ILLEGAL, "OP_ILLEGAL" }, \
-		{ -NFS4ERR_OP_NOT_IN_SESSION, "OP_NOT_IN_SESSION" }, \
-		{ -NFS4ERR_PERM, "PERM" }, \
-		{ -NFS4ERR_PNFS_IO_HOLE, "PNFS_IO_HOLE" }, \
-		{ -NFS4ERR_PNFS_NO_LAYOUT, "PNFS_NO_LAYOUT" }, \
-		{ -NFS4ERR_RECALLCONFLICT, "RECALLCONFLICT" }, \
-		{ -NFS4ERR_RECLAIM_BAD, "RECLAIM_BAD" }, \
-		{ -NFS4ERR_RECLAIM_CONFLICT, "RECLAIM_CONFLICT" }, \
-		{ -NFS4ERR_REJECT_DELEG, "REJECT_DELEG" }, \
-		{ -NFS4ERR_REP_TOO_BIG, "REP_TOO_BIG" }, \
-		{ -NFS4ERR_REP_TOO_BIG_TO_CACHE, \
+		{ NFS4ERR_DELEG_REVOKED, "DELEG_REVOKED" }, \
+		{ NFS4ERR_DENIED, "DENIED" }, \
+		{ NFS4ERR_DIRDELEG_UNAVAIL, "DIRDELEG_UNAVAIL" }, \
+		{ NFS4ERR_DQUOT, "DQUOT" }, \
+		{ NFS4ERR_ENCR_ALG_UNSUPP, "ENCR_ALG_UNSUPP" }, \
+		{ NFS4ERR_EXIST, "EXIST" }, \
+		{ NFS4ERR_EXPIRED, "EXPIRED" }, \
+		{ NFS4ERR_FBIG, "FBIG" }, \
+		{ NFS4ERR_FHEXPIRED, "FHEXPIRED" }, \
+		{ NFS4ERR_FILE_OPEN, "FILE_OPEN" }, \
+		{ NFS4ERR_GRACE, "GRACE" }, \
+		{ NFS4ERR_HASH_ALG_UNSUPP, "HASH_ALG_UNSUPP" }, \
+		{ NFS4ERR_INVAL, "INVAL" }, \
+		{ NFS4ERR_IO, "IO" }, \
+		{ NFS4ERR_ISDIR, "ISDIR" }, \
+		{ NFS4ERR_LAYOUTTRYLATER, "LAYOUTTRYLATER" }, \
+		{ NFS4ERR_LAYOUTUNAVAILABLE, "LAYOUTUNAVAILABLE" }, \
+		{ NFS4ERR_LEASE_MOVED, "LEASE_MOVED" }, \
+		{ NFS4ERR_LOCKED, "LOCKED" }, \
+		{ NFS4ERR_LOCKS_HELD, "LOCKS_HELD" }, \
+		{ NFS4ERR_LOCK_RANGE, "LOCK_RANGE" }, \
+		{ NFS4ERR_MINOR_VERS_MISMATCH, "MINOR_VERS_MISMATCH" }, \
+		{ NFS4ERR_MLINK, "MLINK" }, \
+		{ NFS4ERR_MOVED, "MOVED" }, \
+		{ NFS4ERR_NAMETOOLONG, "NAMETOOLONG" }, \
+		{ NFS4ERR_NOENT, "NOENT" }, \
+		{ NFS4ERR_NOFILEHANDLE, "NOFILEHANDLE" }, \
+		{ NFS4ERR_NOMATCHING_LAYOUT, "NOMATCHING_LAYOUT" }, \
+		{ NFS4ERR_NOSPC, "NOSPC" }, \
+		{ NFS4ERR_NOTDIR, "NOTDIR" }, \
+		{ NFS4ERR_NOTEMPTY, "NOTEMPTY" }, \
+		{ NFS4ERR_NOTSUPP, "NOTSUPP" }, \
+		{ NFS4ERR_NOT_ONLY_OP, "NOT_ONLY_OP" }, \
+		{ NFS4ERR_NOT_SAME, "NOT_SAME" }, \
+		{ NFS4ERR_NO_GRACE, "NO_GRACE" }, \
+		{ NFS4ERR_NXIO, "NXIO" }, \
+		{ NFS4ERR_OLD_STATEID, "OLD_STATEID" }, \
+		{ NFS4ERR_OPENMODE, "OPENMODE" }, \
+		{ NFS4ERR_OP_ILLEGAL, "OP_ILLEGAL" }, \
+		{ NFS4ERR_OP_NOT_IN_SESSION, "OP_NOT_IN_SESSION" }, \
+		{ NFS4ERR_PERM, "PERM" }, \
+		{ NFS4ERR_PNFS_IO_HOLE, "PNFS_IO_HOLE" }, \
+		{ NFS4ERR_PNFS_NO_LAYOUT, "PNFS_NO_LAYOUT" }, \
+		{ NFS4ERR_RECALLCONFLICT, "RECALLCONFLICT" }, \
+		{ NFS4ERR_RECLAIM_BAD, "RECLAIM_BAD" }, \
+		{ NFS4ERR_RECLAIM_CONFLICT, "RECLAIM_CONFLICT" }, \
+		{ NFS4ERR_REJECT_DELEG, "REJECT_DELEG" }, \
+		{ NFS4ERR_REP_TOO_BIG, "REP_TOO_BIG" }, \
+		{ NFS4ERR_REP_TOO_BIG_TO_CACHE, \
 			"REP_TOO_BIG_TO_CACHE" }, \
-		{ -NFS4ERR_REQ_TOO_BIG, "REQ_TOO_BIG" }, \
-		{ -NFS4ERR_RESOURCE, "RESOURCE" }, \
-		{ -NFS4ERR_RESTOREFH, "RESTOREFH" }, \
-		{ -NFS4ERR_RETRY_UNCACHED_REP, "RETRY_UNCACHED_REP" }, \
-		{ -NFS4ERR_RETURNCONFLICT, "RETURNCONFLICT" }, \
-		{ -NFS4ERR_ROFS, "ROFS" }, \
-		{ -NFS4ERR_SAME, "SAME" }, \
-		{ -NFS4ERR_SHARE_DENIED, "SHARE_DENIED" }, \
-		{ -NFS4ERR_SEQUENCE_POS, "SEQUENCE_POS" }, \
-		{ -NFS4ERR_SEQ_FALSE_RETRY, "SEQ_FALSE_RETRY" }, \
-		{ -NFS4ERR_SEQ_MISORDERED, "SEQ_MISORDERED" }, \
-		{ -NFS4ERR_SERVERFAULT, "SERVERFAULT" }, \
-		{ -NFS4ERR_STALE, "STALE" }, \
-		{ -NFS4ERR_STALE_CLIENTID, "STALE_CLIENTID" }, \
-		{ -NFS4ERR_STALE_STATEID, "STALE_STATEID" }, \
-		{ -NFS4ERR_SYMLINK, "SYMLINK" }, \
-		{ -NFS4ERR_TOOSMALL, "TOOSMALL" }, \
-		{ -NFS4ERR_TOO_MANY_OPS, "TOO_MANY_OPS" }, \
-		{ -NFS4ERR_UNKNOWN_LAYOUTTYPE, "UNKNOWN_LAYOUTTYPE" }, \
-		{ -NFS4ERR_UNSAFE_COMPOUND, "UNSAFE_COMPOUND" }, \
-		{ -NFS4ERR_WRONGSEC, "WRONGSEC" }, \
-		{ -NFS4ERR_WRONG_CRED, "WRONG_CRED" }, \
-		{ -NFS4ERR_WRONG_TYPE, "WRONG_TYPE" }, \
-		{ -NFS4ERR_XDEV, "XDEV" })
+		{ NFS4ERR_REQ_TOO_BIG, "REQ_TOO_BIG" }, \
+		{ NFS4ERR_RESOURCE, "RESOURCE" }, \
+		{ NFS4ERR_RESTOREFH, "RESTOREFH" }, \
+		{ NFS4ERR_RETRY_UNCACHED_REP, "RETRY_UNCACHED_REP" }, \
+		{ NFS4ERR_RETURNCONFLICT, "RETURNCONFLICT" }, \
+		{ NFS4ERR_ROFS, "ROFS" }, \
+		{ NFS4ERR_SAME, "SAME" }, \
+		{ NFS4ERR_SHARE_DENIED, "SHARE_DENIED" }, \
+		{ NFS4ERR_SEQUENCE_POS, "SEQUENCE_POS" }, \
+		{ NFS4ERR_SEQ_FALSE_RETRY, "SEQ_FALSE_RETRY" }, \
+		{ NFS4ERR_SEQ_MISORDERED, "SEQ_MISORDERED" }, \
+		{ NFS4ERR_SERVERFAULT, "SERVERFAULT" }, \
+		{ NFS4ERR_STALE, "STALE" }, \
+		{ NFS4ERR_STALE_CLIENTID, "STALE_CLIENTID" }, \
+		{ NFS4ERR_STALE_STATEID, "STALE_STATEID" }, \
+		{ NFS4ERR_SYMLINK, "SYMLINK" }, \
+		{ NFS4ERR_TOOSMALL, "TOOSMALL" }, \
+		{ NFS4ERR_TOO_MANY_OPS, "TOO_MANY_OPS" }, \
+		{ NFS4ERR_UNKNOWN_LAYOUTTYPE, "UNKNOWN_LAYOUTTYPE" }, \
+		{ NFS4ERR_UNSAFE_COMPOUND, "UNSAFE_COMPOUND" }, \
+		{ NFS4ERR_WRONGSEC, "WRONGSEC" }, \
+		{ NFS4ERR_WRONG_CRED, "WRONG_CRED" }, \
+		{ NFS4ERR_WRONG_TYPE, "WRONG_TYPE" }, \
+		{ NFS4ERR_XDEV, "XDEV" })
 
 #define show_open_flags(flags) \
 	__print_flags(flags, "|", \
@@ -558,6 +703,13 @@
 		)
 );
 
+TRACE_DEFINE_ENUM(F_GETLK);
+TRACE_DEFINE_ENUM(F_SETLK);
+TRACE_DEFINE_ENUM(F_SETLKW);
+TRACE_DEFINE_ENUM(F_RDLCK);
+TRACE_DEFINE_ENUM(F_WRLCK);
+TRACE_DEFINE_ENUM(F_UNLCK);
+
 #define show_lock_cmd(type) \
 	__print_symbolic((int)type, \
 		{ F_GETLK, "GETLK" }, \
@@ -1451,6 +1603,10 @@
 #ifdef CONFIG_NFS_V4_1
 DEFINE_NFS4_COMMIT_EVENT(nfs4_pnfs_commit_ds);
 
+TRACE_DEFINE_ENUM(IOMODE_READ);
+TRACE_DEFINE_ENUM(IOMODE_RW);
+TRACE_DEFINE_ENUM(IOMODE_ANY);
+
 #define show_pnfs_iomode(iomode) \
 	__print_symbolic(iomode, \
 		{ IOMODE_READ, "READ" }, \
@@ -1528,6 +1684,20 @@
 DEFINE_NFS4_INODE_STATEID_EVENT(nfs4_layoutreturn);
 DEFINE_NFS4_INODE_EVENT(nfs4_layoutreturn_on_close);
 
+TRACE_DEFINE_ENUM(PNFS_UPDATE_LAYOUT_UNKNOWN);
+TRACE_DEFINE_ENUM(PNFS_UPDATE_LAYOUT_NO_PNFS);
+TRACE_DEFINE_ENUM(PNFS_UPDATE_LAYOUT_RD_ZEROLEN);
+TRACE_DEFINE_ENUM(PNFS_UPDATE_LAYOUT_MDSTHRESH);
+TRACE_DEFINE_ENUM(PNFS_UPDATE_LAYOUT_NOMEM);
+TRACE_DEFINE_ENUM(PNFS_UPDATE_LAYOUT_BULK_RECALL);
+TRACE_DEFINE_ENUM(PNFS_UPDATE_LAYOUT_IO_TEST_FAIL);
+TRACE_DEFINE_ENUM(PNFS_UPDATE_LAYOUT_FOUND_CACHED);
+TRACE_DEFINE_ENUM(PNFS_UPDATE_LAYOUT_RETURN);
+TRACE_DEFINE_ENUM(PNFS_UPDATE_LAYOUT_BLOCKED);
+TRACE_DEFINE_ENUM(PNFS_UPDATE_LAYOUT_INVALID_OPEN);
+TRACE_DEFINE_ENUM(PNFS_UPDATE_LAYOUT_RETRY);
+TRACE_DEFINE_ENUM(PNFS_UPDATE_LAYOUT_SEND_LAYOUTGET);
+
 #define show_pnfs_update_layout_reason(reason)				\
 	__print_symbolic(reason,					\
 		{ PNFS_UPDATE_LAYOUT_UNKNOWN, "unknown" },		\


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v4 24/30] SUNRPC: Simplify defining common RPC trace events
  2018-12-17 16:39 [PATCH v4 00/30] NFS/RDMA client for next Chuck Lever
                   ` (22 preceding siblings ...)
  2018-12-17 16:41 ` [PATCH v4 23/30] NFS: Fix NFSv4 symbolic trace point output Chuck Lever
@ 2018-12-17 16:41 ` Chuck Lever
  2018-12-17 16:41 ` [PATCH v4 25/30] SUNRPC: Fix some kernel doc complaints Chuck Lever
                   ` (5 subsequent siblings)
  29 siblings, 0 replies; 40+ messages in thread
From: Chuck Lever @ 2018-12-17 16:41 UTC (permalink / raw)
  To: linux-rdma, linux-nfs

Clean up, no functional change is expected.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 include/trace/events/sunrpc.h |  172 ++++++++++++++++-------------------------
 1 file changed, 69 insertions(+), 103 deletions(-)

diff --git a/include/trace/events/sunrpc.h b/include/trace/events/sunrpc.h
index 28e3841..88bda93 100644
--- a/include/trace/events/sunrpc.h
+++ b/include/trace/events/sunrpc.h
@@ -16,40 +16,6 @@
 
 DECLARE_EVENT_CLASS(rpc_task_status,
 
-	TP_PROTO(struct rpc_task *task),
-
-	TP_ARGS(task),
-
-	TP_STRUCT__entry(
-		__field(unsigned int, task_id)
-		__field(unsigned int, client_id)
-		__field(int, status)
-	),
-
-	TP_fast_assign(
-		__entry->task_id = task->tk_pid;
-		__entry->client_id = task->tk_client->cl_clid;
-		__entry->status = task->tk_status;
-	),
-
-	TP_printk("task:%u@%u status=%d",
-		__entry->task_id, __entry->client_id,
-		__entry->status)
-);
-
-DEFINE_EVENT(rpc_task_status, rpc_call_status,
-	TP_PROTO(struct rpc_task *task),
-
-	TP_ARGS(task)
-);
-
-DEFINE_EVENT(rpc_task_status, rpc_bind_status,
-	TP_PROTO(struct rpc_task *task),
-
-	TP_ARGS(task)
-);
-
-TRACE_EVENT(rpc_connect_status,
 	TP_PROTO(const struct rpc_task *task),
 
 	TP_ARGS(task),
@@ -70,6 +36,16 @@
 		__entry->task_id, __entry->client_id,
 		__entry->status)
 );
+#define DEFINE_RPC_STATUS_EVENT(name) \
+	DEFINE_EVENT(rpc_task_status, rpc_##name##_status, \
+			TP_PROTO( \
+				const struct rpc_task *task \
+			), \
+			TP_ARGS(task))
+
+DEFINE_RPC_STATUS_EVENT(call);
+DEFINE_RPC_STATUS_EVENT(bind);
+DEFINE_RPC_STATUS_EVENT(connect);
 
 TRACE_EVENT(rpc_request,
 	TP_PROTO(const struct rpc_task *task),
@@ -134,30 +110,17 @@
 		__entry->action
 		)
 );
+#define DEFINE_RPC_RUNNING_EVENT(name) \
+	DEFINE_EVENT(rpc_task_running, rpc_task_##name, \
+			TP_PROTO( \
+				const struct rpc_task *task, \
+				const void *action \
+			), \
+			TP_ARGS(task, action))
 
-DEFINE_EVENT(rpc_task_running, rpc_task_begin,
-
-	TP_PROTO(const struct rpc_task *task, const void *action),
-
-	TP_ARGS(task, action)
-
-);
-
-DEFINE_EVENT(rpc_task_running, rpc_task_run_action,
-
-	TP_PROTO(const struct rpc_task *task, const void *action),
-
-	TP_ARGS(task, action)
-
-);
-
-DEFINE_EVENT(rpc_task_running, rpc_task_complete,
-
-	TP_PROTO(const struct rpc_task *task, const void *action),
-
-	TP_ARGS(task, action)
-
-);
+DEFINE_RPC_RUNNING_EVENT(begin);
+DEFINE_RPC_RUNNING_EVENT(run_action);
+DEFINE_RPC_RUNNING_EVENT(complete);
 
 DECLARE_EVENT_CLASS(rpc_task_queued,
 
@@ -195,22 +158,16 @@
 		__get_str(q_name)
 		)
 );
+#define DEFINE_RPC_QUEUED_EVENT(name) \
+	DEFINE_EVENT(rpc_task_queued, rpc_task_##name, \
+			TP_PROTO( \
+				const struct rpc_task *task, \
+				const struct rpc_wait_queue *q \
+			), \
+			TP_ARGS(task, q))
 
-DEFINE_EVENT(rpc_task_queued, rpc_task_sleep,
-
-	TP_PROTO(const struct rpc_task *task, const struct rpc_wait_queue *q),
-
-	TP_ARGS(task, q)
-
-);
-
-DEFINE_EVENT(rpc_task_queued, rpc_task_wakeup,
-
-	TP_PROTO(const struct rpc_task *task, const struct rpc_wait_queue *q),
-
-	TP_ARGS(task, q)
-
-);
+DEFINE_RPC_QUEUED_EVENT(sleep);
+DEFINE_RPC_QUEUED_EVENT(wakeup);
 
 TRACE_EVENT(rpc_stats_latency,
 
@@ -410,7 +367,11 @@
 DEFINE_RPC_SOCKET_EVENT(rpc_socket_shutdown);
 
 DECLARE_EVENT_CLASS(rpc_xprt_event,
-	TP_PROTO(struct rpc_xprt *xprt, __be32 xid, int status),
+	TP_PROTO(
+		const struct rpc_xprt *xprt,
+		__be32 xid,
+		int status
+	),
 
 	TP_ARGS(xprt, xid, status),
 
@@ -432,22 +393,19 @@
 			__get_str(port), __entry->xid,
 			__entry->status)
 );
+#define DEFINE_RPC_XPRT_EVENT(name) \
+	DEFINE_EVENT(rpc_xprt_event, xprt_##name, \
+			TP_PROTO( \
+				const struct rpc_xprt *xprt, \
+				__be32 xid, \
+				int status \
+			), \
+			TP_ARGS(xprt, xid, status))
 
-DEFINE_EVENT(rpc_xprt_event, xprt_timer,
-	TP_PROTO(struct rpc_xprt *xprt, __be32 xid, int status),
-	TP_ARGS(xprt, xid, status));
-
-DEFINE_EVENT(rpc_xprt_event, xprt_lookup_rqst,
-	TP_PROTO(struct rpc_xprt *xprt, __be32 xid, int status),
-	TP_ARGS(xprt, xid, status));
-
-DEFINE_EVENT(rpc_xprt_event, xprt_transmit,
-	TP_PROTO(struct rpc_xprt *xprt, __be32 xid, int status),
-	TP_ARGS(xprt, xid, status));
-
-DEFINE_EVENT(rpc_xprt_event, xprt_complete_rqst,
-	TP_PROTO(struct rpc_xprt *xprt, __be32 xid, int status),
-	TP_ARGS(xprt, xid, status));
+DEFINE_RPC_XPRT_EVENT(timer);
+DEFINE_RPC_XPRT_EVENT(lookup_rqst);
+DEFINE_RPC_XPRT_EVENT(transmit);
+DEFINE_RPC_XPRT_EVENT(complete_rqst);
 
 TRACE_EVENT(xprt_ping,
 	TP_PROTO(const struct rpc_xprt *xprt, int status),
@@ -587,7 +545,9 @@
 
 DECLARE_EVENT_CLASS(svc_rqst_event,
 
-	TP_PROTO(struct svc_rqst *rqst),
+	TP_PROTO(
+		const struct svc_rqst *rqst
+	),
 
 	TP_ARGS(rqst),
 
@@ -607,14 +567,15 @@
 			__get_str(addr), __entry->xid,
 			show_rqstp_flags(__entry->flags))
 );
+#define DEFINE_SVC_RQST_EVENT(name) \
+	DEFINE_EVENT(svc_rqst_event, svc_##name, \
+			TP_PROTO( \
+				const struct svc_rqst *rqst \
+			), \
+			TP_ARGS(rqst))
 
-DEFINE_EVENT(svc_rqst_event, svc_defer,
-	TP_PROTO(struct svc_rqst *rqst),
-	TP_ARGS(rqst));
-
-DEFINE_EVENT(svc_rqst_event, svc_drop,
-	TP_PROTO(struct svc_rqst *rqst),
-	TP_ARGS(rqst));
+DEFINE_SVC_RQST_EVENT(defer);
+DEFINE_SVC_RQST_EVENT(drop);
 
 DECLARE_EVENT_CLASS(svc_rqst_status,
 
@@ -801,7 +762,9 @@
 );
 
 DECLARE_EVENT_CLASS(svc_deferred_event,
-	TP_PROTO(struct svc_deferred_req *dr),
+	TP_PROTO(
+		const struct svc_deferred_req *dr
+	),
 
 	TP_ARGS(dr),
 
@@ -818,13 +781,16 @@
 
 	TP_printk("addr=%s xid=0x%08x", __get_str(addr), __entry->xid)
 );
+#define DEFINE_SVC_DEFERRED_EVENT(name) \
+	DEFINE_EVENT(svc_deferred_event, svc_##name##_deferred, \
+			TP_PROTO( \
+				const struct svc_deferred_req *dr \
+			), \
+			TP_ARGS(dr))
+
+DEFINE_SVC_DEFERRED_EVENT(drop);
+DEFINE_SVC_DEFERRED_EVENT(revisit);
 
-DEFINE_EVENT(svc_deferred_event, svc_drop_deferred,
-	TP_PROTO(struct svc_deferred_req *dr),
-	TP_ARGS(dr));
-DEFINE_EVENT(svc_deferred_event, svc_revisit_deferred,
-	TP_PROTO(struct svc_deferred_req *dr),
-	TP_ARGS(dr));
 #endif /* _TRACE_SUNRPC_H */
 
 #include <trace/define_trace.h>


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v4 25/30] SUNRPC: Fix some kernel doc complaints
  2018-12-17 16:39 [PATCH v4 00/30] NFS/RDMA client for next Chuck Lever
                   ` (23 preceding siblings ...)
  2018-12-17 16:41 ` [PATCH v4 24/30] SUNRPC: Simplify defining common RPC trace events Chuck Lever
@ 2018-12-17 16:41 ` Chuck Lever
  2018-12-17 16:41 ` [PATCH v4 26/30] xprtrdma: Update comments in frwr_op_send Chuck Lever
                   ` (4 subsequent siblings)
  29 siblings, 0 replies; 40+ messages in thread
From: Chuck Lever @ 2018-12-17 16:41 UTC (permalink / raw)
  To: linux-rdma, linux-nfs

Clean up some warnings observed when building with "make W=1".

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 net/sunrpc/auth_gss/gss_mech_switch.c |    2 +-
 net/sunrpc/backchannel_rqst.c         |    2 +-
 net/sunrpc/xprtmultipath.c            |    4 ++--
 net/sunrpc/xprtsock.c                 |    2 ++
 4 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/net/sunrpc/auth_gss/gss_mech_switch.c b/net/sunrpc/auth_gss/gss_mech_switch.c
index 16ac0f4..379318d 100644
--- a/net/sunrpc/auth_gss/gss_mech_switch.c
+++ b/net/sunrpc/auth_gss/gss_mech_switch.c
@@ -244,7 +244,7 @@ struct gss_api_mech *
 
 /**
  * gss_mech_list_pseudoflavors - Discover registered GSS pseudoflavors
- * @array: array to fill in
+ * @array_ptr: array to fill in
  * @size: size of "array"
  *
  * Returns the number of array items filled in, or a negative errno.
diff --git a/net/sunrpc/backchannel_rqst.c b/net/sunrpc/backchannel_rqst.c
index fa5ba6e..ec451b8 100644
--- a/net/sunrpc/backchannel_rqst.c
+++ b/net/sunrpc/backchannel_rqst.c
@@ -197,7 +197,7 @@ int xprt_setup_bc(struct rpc_xprt *xprt, unsigned int min_reqs)
 /**
  * xprt_destroy_backchannel - Destroys the backchannel preallocated structures.
  * @xprt:	the transport holding the preallocated strucures
- * @max_reqs	the maximum number of preallocated structures to destroy
+ * @max_reqs:	the maximum number of preallocated structures to destroy
  *
  * Since these structures may have been allocated by multiple calls
  * to xprt_setup_backchannel, we only destroy up to the maximum number
diff --git a/net/sunrpc/xprtmultipath.c b/net/sunrpc/xprtmultipath.c
index e2d64c7..8394124 100644
--- a/net/sunrpc/xprtmultipath.c
+++ b/net/sunrpc/xprtmultipath.c
@@ -383,7 +383,7 @@ void xprt_iter_init_listall(struct rpc_xprt_iter *xpi,
 /**
  * xprt_iter_xchg_switch - Atomically swap out the rpc_xprt_switch
  * @xpi: pointer to rpc_xprt_iter
- * @xps: pointer to a new rpc_xprt_switch or NULL
+ * @newswitch: pointer to a new rpc_xprt_switch or NULL
  *
  * Swaps out the existing xpi->xpi_xpswitch with a new value.
  */
@@ -401,7 +401,7 @@ struct rpc_xprt_switch *xprt_iter_xchg_switch(struct rpc_xprt_iter *xpi,
 
 /**
  * xprt_iter_destroy - Destroys the xprt iterator
- * @xpi pointer to rpc_xprt_iter
+ * @xpi: pointer to rpc_xprt_iter
  */
 void xprt_iter_destroy(struct rpc_xprt_iter *xpi)
 {
diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index 8a5e823..8ee9831 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -1602,6 +1602,7 @@ static void xs_udp_set_buffer_size(struct rpc_xprt *xprt, size_t sndsize, size_t
 
 /**
  * xs_udp_timer - called when a retransmit timeout occurs on a UDP transport
+ * @xprt: controlling transport
  * @task: task that timed out
  *
  * Adjust the congestion window after a retransmit timeout has occurred.
@@ -2259,6 +2260,7 @@ static int xs_tcp_finish_connecting(struct rpc_xprt *xprt, struct socket *sock)
 
 /**
  * xs_tcp_setup_socket - create a TCP socket and connect to a remote endpoint
+ * @work: queued work item
  *
  * Invoked by a work queue tasklet.
  */


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v4 26/30] xprtrdma: Update comments in frwr_op_send
  2018-12-17 16:39 [PATCH v4 00/30] NFS/RDMA client for next Chuck Lever
                   ` (24 preceding siblings ...)
  2018-12-17 16:41 ` [PATCH v4 25/30] SUNRPC: Fix some kernel doc complaints Chuck Lever
@ 2018-12-17 16:41 ` Chuck Lever
  2018-12-17 16:41 ` [PATCH v4 27/30] xprtrdma: Replace outdated comment for rpcrdma_ep_post Chuck Lever
                   ` (3 subsequent siblings)
  29 siblings, 0 replies; 40+ messages in thread
From: Chuck Lever @ 2018-12-17 16:41 UTC (permalink / raw)
  To: linux-rdma, linux-nfs

Commit f2877623082b ("xprtrdma: Chain Send to FastReg WRs") was
written before commit ce5b37178283 ("xprtrdma: Replace all usage of
"frmr" with "frwr""), but was merged afterwards. Thus it still
refers to FRMR and MWs.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 net/sunrpc/xprtrdma/frwr_ops.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
index 8a0f1a6..35c8f62 100644
--- a/net/sunrpc/xprtrdma/frwr_ops.c
+++ b/net/sunrpc/xprtrdma/frwr_ops.c
@@ -479,7 +479,7 @@ struct rpcrdma_mr_seg *frwr_map(struct rpcrdma_xprt *r_xprt,
  * @ia: interface adapter
  * @req: Prepared RPC Call
  *
- * For FRMR, chain any FastReg WRs to the Send WR. Only a
+ * For FRWR, chain any FastReg WRs to the Send WR. Only a
  * single ib_post_send call is needed to register memory
  * and then post the Send WR.
  *
@@ -507,7 +507,7 @@ int frwr_send(struct rpcrdma_ia *ia, struct rpcrdma_req *req)
 	}
 
 	/* If ib_post_send fails, the next ->send_request for
-	 * @req will queue these MWs for recovery.
+	 * @req will queue these MRs for recovery.
 	 */
 	return ib_post_send(ia->ri_id->qp, post_wr, NULL);
 }


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v4 27/30] xprtrdma: Replace outdated comment for rpcrdma_ep_post
  2018-12-17 16:39 [PATCH v4 00/30] NFS/RDMA client for next Chuck Lever
                   ` (25 preceding siblings ...)
  2018-12-17 16:41 ` [PATCH v4 26/30] xprtrdma: Update comments in frwr_op_send Chuck Lever
@ 2018-12-17 16:41 ` Chuck Lever
  2018-12-17 16:41 ` [PATCH v4 28/30] xprtrdma: Add documenting comment for rpcrdma_buffer_destroy Chuck Lever
                   ` (2 subsequent siblings)
  29 siblings, 0 replies; 40+ messages in thread
From: Chuck Lever @ 2018-12-17 16:41 UTC (permalink / raw)
  To: linux-rdma, linux-nfs

Since commit 7c8d9e7c8863 ("xprtrdma: Move Receive posting to
Receive handler"), rpcrdma_ep_post is no longer responsible for
posting Receive buffers. Update the documenting comment to reflect
this change.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 net/sunrpc/xprtrdma/verbs.c |   10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index b46e2f9..c69b985 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -1427,10 +1427,14 @@ struct rpcrdma_regbuf *
 	kfree(rb);
 }
 
-/*
- * Prepost any receive buffer, then post send.
+/**
+ * rpcrdma_ep_post - Post WRs to a transport's Send Queue
+ * @ia: transport's device information
+ * @ep: transport's RDMA endpoint information
+ * @req: rpcrdma_req containing the Send WR to post
  *
- * Receive buffer is donated to hardware, reclaimed upon recv completion.
+ * Returns 0 if the post was successful, otherwise -ENOTCONN
+ * is returned.
  */
 int
 rpcrdma_ep_post(struct rpcrdma_ia *ia,


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v4 28/30] xprtrdma: Add documenting comment for rpcrdma_buffer_destroy
  2018-12-17 16:39 [PATCH v4 00/30] NFS/RDMA client for next Chuck Lever
                   ` (26 preceding siblings ...)
  2018-12-17 16:41 ` [PATCH v4 27/30] xprtrdma: Replace outdated comment for rpcrdma_ep_post Chuck Lever
@ 2018-12-17 16:41 ` Chuck Lever
  2018-12-17 16:41 ` [PATCH v4 29/30] xprtrdma: Clarify comments in rpcrdma_ia_remove Chuck Lever
  2018-12-17 16:42 ` [PATCH v4 30/30] xprtrdma: Don't leak freed MRs Chuck Lever
  29 siblings, 0 replies; 40+ messages in thread
From: Chuck Lever @ 2018-12-17 16:41 UTC (permalink / raw)
  To: linux-rdma, linux-nfs

Make a note of the function's dependency on an earlier ib_drain_qp.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 net/sunrpc/xprtrdma/verbs.c |    8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index c69b985..339c40e 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -1177,6 +1177,14 @@ struct rpcrdma_req *
 	dprintk("RPC:       %s: released %u MRs\n", __func__, count);
 }
 
+/**
+ * rpcrdma_buffer_destroy - Release all hw resources
+ * @buf: root control block for resources
+ *
+ * ORDERING: relies on a prior ib_drain_qp :
+ * - No more Send or Receive completions can occur
+ * - All MRs, reps, and reqs are returned to their free lists
+ */
 void
 rpcrdma_buffer_destroy(struct rpcrdma_buffer *buf)
 {


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v4 29/30] xprtrdma: Clarify comments in rpcrdma_ia_remove
  2018-12-17 16:39 [PATCH v4 00/30] NFS/RDMA client for next Chuck Lever
                   ` (27 preceding siblings ...)
  2018-12-17 16:41 ` [PATCH v4 28/30] xprtrdma: Add documenting comment for rpcrdma_buffer_destroy Chuck Lever
@ 2018-12-17 16:41 ` Chuck Lever
  2018-12-17 16:42 ` [PATCH v4 30/30] xprtrdma: Don't leak freed MRs Chuck Lever
  29 siblings, 0 replies; 40+ messages in thread
From: Chuck Lever @ 2018-12-17 16:41 UTC (permalink / raw)
  To: linux-rdma, linux-nfs

Comments are clarified to note how transport data structures are
protected.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 net/sunrpc/xprtrdma/verbs.c |   14 ++++++++++----
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 339c40e..b700ade 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -402,8 +402,7 @@ static void rpcrdma_xprt_drain(struct rpcrdma_xprt *r_xprt)
  * rpcrdma_ia_remove - Handle device driver unload
  * @ia: interface adapter being removed
  *
- * Divest transport H/W resources associated with this adapter,
- * but allow it to be restored later.
+ * Callers must serialize calls to this function.
  */
 void
 rpcrdma_ia_remove(struct rpcrdma_ia *ia)
@@ -434,16 +433,23 @@ static void rpcrdma_xprt_drain(struct rpcrdma_xprt *r_xprt)
 	ib_free_cq(ep->rep_attr.send_cq);
 	ep->rep_attr.send_cq = NULL;
 
-	/* The ULP is responsible for ensuring all DMA
-	 * mappings and MRs are gone.
+	/* The ib_drain_qp above guarantees that all posted
+	 * Receives have flushed, which returns the transport's
+	 * rpcrdma_reps to the rb_recv_bufs list.
 	 */
 	list_for_each_entry(rep, &buf->rb_recv_bufs, rr_list)
 		rpcrdma_dma_unmap_regbuf(rep->rr_rdmabuf);
+
+	/* DMA mapping happens in ->send_request with the
+	 * transport send lock held. Our caller is holding
+	 * the transport send lock.
+	 */
 	list_for_each_entry(req, &buf->rb_allreqs, rl_all) {
 		rpcrdma_dma_unmap_regbuf(req->rl_rdmabuf);
 		rpcrdma_dma_unmap_regbuf(req->rl_sendbuf);
 		rpcrdma_dma_unmap_regbuf(req->rl_recvbuf);
 	}
+
 	rpcrdma_mrs_destroy(buf);
 	ib_dealloc_pd(ia->ri_pd);
 	ia->ri_pd = NULL;


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v4 30/30] xprtrdma: Don't leak freed MRs
  2018-12-17 16:39 [PATCH v4 00/30] NFS/RDMA client for next Chuck Lever
                   ` (28 preceding siblings ...)
  2018-12-17 16:41 ` [PATCH v4 29/30] xprtrdma: Clarify comments in rpcrdma_ia_remove Chuck Lever
@ 2018-12-17 16:42 ` Chuck Lever
  29 siblings, 0 replies; 40+ messages in thread
From: Chuck Lever @ 2018-12-17 16:42 UTC (permalink / raw)
  To: linux-rdma, linux-nfs

Defensive clean up. Don't set frwr->fr_mr until we know that the
scatterlist allocation has succeeded.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 net/sunrpc/xprtrdma/frwr_ops.c |   27 +++++++++++++++------------
 1 file changed, 15 insertions(+), 12 deletions(-)

diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
index 35c8f62..6a56105 100644
--- a/net/sunrpc/xprtrdma/frwr_ops.c
+++ b/net/sunrpc/xprtrdma/frwr_ops.c
@@ -155,36 +155,39 @@ void frwr_release_mr(struct rpcrdma_mr *mr)
 int frwr_init_mr(struct rpcrdma_ia *ia, struct rpcrdma_mr *mr)
 {
 	unsigned int depth = ia->ri_max_frwr_depth;
-	struct rpcrdma_frwr *frwr = &mr->frwr;
+	struct scatterlist *sg;
+	struct ib_mr *frmr;
 	int rc;
 
-	frwr->fr_mr = ib_alloc_mr(ia->ri_pd, ia->ri_mrtype, depth);
-	if (IS_ERR(frwr->fr_mr))
+	frmr = ib_alloc_mr(ia->ri_pd, ia->ri_mrtype, depth);
+	if (IS_ERR(frmr))
 		goto out_mr_err;
 
-	mr->mr_sg = kcalloc(depth, sizeof(*mr->mr_sg), GFP_KERNEL);
-	if (!mr->mr_sg)
+	sg = kcalloc(depth, sizeof(*sg), GFP_KERNEL);
+	if (!sg)
 		goto out_list_err;
 
-	frwr->fr_state = FRWR_IS_INVALID;
+	mr->frwr.fr_mr = frmr;
+	mr->frwr.fr_state = FRWR_IS_INVALID;
 	mr->mr_dir = DMA_NONE;
 	INIT_LIST_HEAD(&mr->mr_list);
 	INIT_WORK(&mr->mr_recycle, frwr_mr_recycle_worker);
-	sg_init_table(mr->mr_sg, depth);
-	init_completion(&frwr->fr_linv_done);
+	init_completion(&mr->frwr.fr_linv_done);
+
+	sg_init_table(sg, depth);
+	mr->mr_sg = sg;
 	return 0;
 
 out_mr_err:
-	rc = PTR_ERR(frwr->fr_mr);
+	rc = PTR_ERR(frmr);
 	trace_xprtrdma_frwr_alloc(mr, rc);
 	return rc;
 
 out_list_err:
-	rc = -ENOMEM;
 	dprintk("RPC:       %s: sg allocation failure\n",
 		__func__);
-	ib_dereg_mr(frwr->fr_mr);
-	return rc;
+	ib_dereg_mr(frmr);
+	return -ENOMEM;
 }
 
 /**


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH v4 06/30] xprtrdma: Don't wake pending tasks until disconnect is done
  2018-12-17 16:39 ` [PATCH v4 06/30] xprtrdma: Don't wake pending tasks until disconnect is done Chuck Lever
@ 2018-12-17 17:28   ` Trond Myklebust
  2018-12-17 18:37     ` Chuck Lever
  0 siblings, 1 reply; 40+ messages in thread
From: Trond Myklebust @ 2018-12-17 17:28 UTC (permalink / raw)
  To: linux-rdma, linux-nfs, chuck.lever

On Mon, 2018-12-17 at 11:39 -0500, Chuck Lever wrote:
> Transport disconnect processing does a "wake pending tasks" at
> various points.
> 
> Suppose an RPC Reply is being processed. The RPC task that Reply
> goes with is waiting on the pending queue. If a disconnect wake-up
> happens before reply processing is done, that reply, even if it is
> good, is thrown away, and the RPC has to be sent again.
> 
> This window apparently does not exist for socket transports because
> there is a lock held while a reply is being received which prevents
> the wake-up call until after reply processing is done.
> 
> To resolve this, all RPC replies being processed on an RPC-over-RDMA
> transport have to complete before pending tasks are awoken due to a
> transport disconnect.
> 
> Callers that already hold the transport write lock may invoke
> ->ops->close directly. Others use a generic helper that schedules
> a close when the write lock can be taken safely.
> 
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> ---
>  include/linux/sunrpc/xprt.h                |    1 +
>  net/sunrpc/xprt.c                          |   19
> +++++++++++++++++++
>  net/sunrpc/xprtrdma/backchannel.c          |   13 +++++++------
>  net/sunrpc/xprtrdma/svc_rdma_backchannel.c |    8 +++++---
>  net/sunrpc/xprtrdma/transport.c            |   16 ++++++++++------
>  net/sunrpc/xprtrdma/verbs.c                |    5 ++---
>  6 files changed, 44 insertions(+), 18 deletions(-)
> 
> diff --git a/include/linux/sunrpc/xprt.h
> b/include/linux/sunrpc/xprt.h
> index a4ab4f8..ee94ed0 100644
> --- a/include/linux/sunrpc/xprt.h
> +++ b/include/linux/sunrpc/xprt.h
> @@ -401,6 +401,7 @@ static inline __be32
> *xprt_skip_transport_header(struct rpc_xprt *xprt, __be32 *
>  bool			xprt_request_get_cong(struct rpc_xprt *xprt,
> struct rpc_rqst *req);
>  void			xprt_disconnect_done(struct rpc_xprt *xprt);
>  void			xprt_force_disconnect(struct rpc_xprt *xprt);
> +void			xprt_disconnect_nowake(struct rpc_xprt *xprt);
>  void			xprt_conditional_disconnect(struct rpc_xprt
> *xprt, unsigned int cookie);
>  
>  bool			xprt_lock_connect(struct rpc_xprt *, struct
> rpc_task *, void *);
> diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
> index ce92700..afe412e 100644
> --- a/net/sunrpc/xprt.c
> +++ b/net/sunrpc/xprt.c
> @@ -685,6 +685,25 @@ void xprt_force_disconnect(struct rpc_xprt
> *xprt)
>  }
>  EXPORT_SYMBOL_GPL(xprt_force_disconnect);
>  
> +/**
> + * xprt_disconnect_nowake - force a call to xprt->ops->close
> + * @xprt: transport to disconnect
> + *
> + * The caller must ensure that xprt_wake_pending_tasks() is
> + * called later.
> + */
> +void xprt_disconnect_nowake(struct rpc_xprt *xprt)
> +{
> +       /* Don't race with the test_bit() in xprt_clear_locked() */
> +       spin_lock_bh(&xprt->transport_lock);
> +       set_bit(XPRT_CLOSE_WAIT, &xprt->state);
> +       /* Try to schedule an autoclose RPC call */
> +       if (test_and_set_bit(XPRT_LOCKED, &xprt->state) == 0)
> +               queue_work(xprtiod_workqueue, &xprt->task_cleanup);
> +       spin_unlock_bh(&xprt->transport_lock);
> +}
> +EXPORT_SYMBOL_GPL(xprt_disconnect_nowake);
> +

We shouldn't need both xprt_disconnect_nowake() and
xprt_force_disconnect() to be exported given that you can build the
latter from the former + xprt_wake_pending_tasks() (which is also
already exported).

>  static unsigned int
>  xprt_connect_cookie(struct rpc_xprt *xprt)
>  {
> diff --git a/net/sunrpc/xprtrdma/backchannel.c
> b/net/sunrpc/xprtrdma/backchannel.c
> index 2cb07a3..5d462e8 100644
> --- a/net/sunrpc/xprtrdma/backchannel.c
> +++ b/net/sunrpc/xprtrdma/backchannel.c
> @@ -193,14 +193,15 @@ static int rpcrdma_bc_marshal_reply(struct
> rpc_rqst *rqst)
>   */
>  int xprt_rdma_bc_send_reply(struct rpc_rqst *rqst)
>  {
> -	struct rpcrdma_xprt *r_xprt = rpcx_to_rdmax(rqst->rq_xprt);
> +	struct rpc_xprt *xprt = rqst->rq_xprt;
> +	struct rpcrdma_xprt *r_xprt = rpcx_to_rdmax(xprt);
>  	struct rpcrdma_req *req = rpcr_to_rdmar(rqst);
>  	int rc;
>  
> -	if (!xprt_connected(rqst->rq_xprt))
> -		goto drop_connection;
> +	if (!xprt_connected(xprt))
> +		return -ENOTCONN;
>  
> -	if (!xprt_request_get_cong(rqst->rq_xprt, rqst))
> +	if (!xprt_request_get_cong(xprt, rqst))
>  		return -EBADSLT;
>  
>  	rc = rpcrdma_bc_marshal_reply(rqst);
> @@ -215,7 +216,7 @@ int xprt_rdma_bc_send_reply(struct rpc_rqst
> *rqst)
>  	if (rc != -ENOTCONN)
>  		return rc;
>  drop_connection:
> -	xprt_disconnect_done(rqst->rq_xprt);
> +	xprt->ops->close(xprt);

Why use an indirect call here? Is this ever going to be different to
xprt_rdma_close()?

>  	return -ENOTCONN;
>  }
>  
> @@ -338,7 +339,7 @@ void rpcrdma_bc_receive_call(struct rpcrdma_xprt
> *r_xprt,
>  
>  out_overflow:
>  	pr_warn("RPC/RDMA backchannel overflow\n");
> -	xprt_disconnect_done(xprt);
> +	xprt_disconnect_nowake(xprt);
>  	/* This receive buffer gets reposted automatically
>  	 * when the connection is re-established.
>  	 */
> diff --git a/net/sunrpc/xprtrdma/svc_rdma_backchannel.c
> b/net/sunrpc/xprtrdma/svc_rdma_backchannel.c
> index f3c147d..b908f2c 100644
> --- a/net/sunrpc/xprtrdma/svc_rdma_backchannel.c
> +++ b/net/sunrpc/xprtrdma/svc_rdma_backchannel.c
> @@ -200,11 +200,10 @@ static int svc_rdma_bc_sendto(struct
> svcxprt_rdma *rdma,
>  		svc_rdma_send_ctxt_put(rdma, ctxt);
>  		goto drop_connection;
>  	}
> -	return rc;
> +	return 0;
>  
>  drop_connection:
>  	dprintk("svcrdma: failed to send bc call\n");
> -	xprt_disconnect_done(xprt);
>  	return -ENOTCONN;
>  }
>  
> @@ -225,8 +224,11 @@ static int svc_rdma_bc_sendto(struct
> svcxprt_rdma *rdma,
>  
>  	ret = -ENOTCONN;
>  	rdma = container_of(sxprt, struct svcxprt_rdma, sc_xprt);
> -	if (!test_bit(XPT_DEAD, &sxprt->xpt_flags))
> +	if (!test_bit(XPT_DEAD, &sxprt->xpt_flags)) {
>  		ret = rpcrdma_bc_send_request(rdma, rqst);
> +		if (ret == -ENOTCONN)
> +			svc_close_xprt(sxprt);
> +	}
>  
>  	mutex_unlock(&sxprt->xpt_mutex);
>  
> diff --git a/net/sunrpc/xprtrdma/transport.c
> b/net/sunrpc/xprtrdma/transport.c
> index 91c476a..a16296b 100644
> --- a/net/sunrpc/xprtrdma/transport.c
> +++ b/net/sunrpc/xprtrdma/transport.c
> @@ -453,13 +453,13 @@
>  
>  	if (test_and_clear_bit(RPCRDMA_IAF_REMOVING, &ia->ri_flags)) {
>  		rpcrdma_ia_remove(ia);
> -		return;
> +		goto out;
>  	}
> +
>  	if (ep->rep_connected == -ENODEV)
>  		return;
>  	if (ep->rep_connected > 0)
>  		xprt->reestablish_timeout = 0;
> -	xprt_disconnect_done(xprt);
>  	rpcrdma_ep_disconnect(ep, ia);
>  
>  	/* Prepare @xprt for the next connection by reinitializing
> @@ -467,6 +467,10 @@
>  	 */
>  	r_xprt->rx_buf.rb_credits = 1;
>  	xprt->cwnd = RPC_CWNDSHIFT;
> +
> +out:
> +	++xprt->connect_cookie;
> +	xprt_disconnect_done(xprt);
>  }
>  
>  /**
> @@ -515,7 +519,7 @@
>  static void
>  xprt_rdma_timer(struct rpc_xprt *xprt, struct rpc_task *task)
>  {
> -	xprt_force_disconnect(xprt);
> +	xprt_disconnect_nowake(xprt);
>  }
>  
>  /**
> @@ -717,7 +721,7 @@
>  #endif	/* CONFIG_SUNRPC_BACKCHANNEL */
>  
>  	if (!xprt_connected(xprt))
> -		goto drop_connection;
> +		return -ENOTCONN;
>  
>  	if (!xprt_request_get_cong(xprt, rqst))
>  		return -EBADSLT;
> @@ -749,8 +753,8 @@
>  	if (rc != -ENOTCONN)
>  		return rc;
>  drop_connection:
> -	xprt_disconnect_done(xprt);
> -	return -ENOTCONN;	/* implies disconnect */
> +	xprt_rdma_close(xprt);
> +	return -ENOTCONN;
>  }
>  
>  void xprt_rdma_print_stats(struct rpc_xprt *xprt, struct seq_file
> *seq)
> diff --git a/net/sunrpc/xprtrdma/verbs.c
> b/net/sunrpc/xprtrdma/verbs.c
> index 9a0a765..38a757c 100644
> --- a/net/sunrpc/xprtrdma/verbs.c
> +++ b/net/sunrpc/xprtrdma/verbs.c
> @@ -252,7 +252,7 @@ static void rpcrdma_xprt_drain(struct
> rpcrdma_xprt *r_xprt)
>  #endif
>  		set_bit(RPCRDMA_IAF_REMOVING, &ia->ri_flags);
>  		ep->rep_connected = -ENODEV;
> -		xprt_force_disconnect(xprt);
> +		xprt_disconnect_nowake(xprt);
>  		wait_for_completion(&ia->ri_remove_done);
>  
>  		ia->ri_id = NULL;
> @@ -280,10 +280,9 @@ static void rpcrdma_xprt_drain(struct
> rpcrdma_xprt *r_xprt)
>  			ep->rep_connected = -EAGAIN;
>  		goto disconnected;
>  	case RDMA_CM_EVENT_DISCONNECTED:
> -		++xprt->connect_cookie;
>  		ep->rep_connected = -ECONNABORTED;
>  disconnected:
> -		xprt_force_disconnect(xprt);
> +		xprt_disconnect_nowake(xprt);
>  		wake_up_all(&ep->rep_connect_wait);
>  		break;
>  	default:
> 

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v4 06/30] xprtrdma: Don't wake pending tasks until disconnect is done
  2018-12-17 17:28   ` Trond Myklebust
@ 2018-12-17 18:37     ` Chuck Lever
  2018-12-17 18:55       ` Trond Myklebust
  0 siblings, 1 reply; 40+ messages in thread
From: Chuck Lever @ 2018-12-17 18:37 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: linux-rdma, Linux NFS Mailing List



> On Dec 17, 2018, at 12:28 PM, Trond Myklebust <trondmy@hammerspace.com> wrote:
> 
> On Mon, 2018-12-17 at 11:39 -0500, Chuck Lever wrote:
>> Transport disconnect processing does a "wake pending tasks" at
>> various points.
>> 
>> Suppose an RPC Reply is being processed. The RPC task that Reply
>> goes with is waiting on the pending queue. If a disconnect wake-up
>> happens before reply processing is done, that reply, even if it is
>> good, is thrown away, and the RPC has to be sent again.
>> 
>> This window apparently does not exist for socket transports because
>> there is a lock held while a reply is being received which prevents
>> the wake-up call until after reply processing is done.
>> 
>> To resolve this, all RPC replies being processed on an RPC-over-RDMA
>> transport have to complete before pending tasks are awoken due to a
>> transport disconnect.
>> 
>> Callers that already hold the transport write lock may invoke
>> ->ops->close directly. Others use a generic helper that schedules
>> a close when the write lock can be taken safely.
>> 
>> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
>> ---
>> include/linux/sunrpc/xprt.h                |    1 +
>> net/sunrpc/xprt.c                          |   19
>> +++++++++++++++++++
>> net/sunrpc/xprtrdma/backchannel.c          |   13 +++++++------
>> net/sunrpc/xprtrdma/svc_rdma_backchannel.c |    8 +++++---
>> net/sunrpc/xprtrdma/transport.c            |   16 ++++++++++------
>> net/sunrpc/xprtrdma/verbs.c                |    5 ++---
>> 6 files changed, 44 insertions(+), 18 deletions(-)
>> 
>> diff --git a/include/linux/sunrpc/xprt.h
>> b/include/linux/sunrpc/xprt.h
>> index a4ab4f8..ee94ed0 100644
>> --- a/include/linux/sunrpc/xprt.h
>> +++ b/include/linux/sunrpc/xprt.h
>> @@ -401,6 +401,7 @@ static inline __be32
>> *xprt_skip_transport_header(struct rpc_xprt *xprt, __be32 *
>> bool			xprt_request_get_cong(struct rpc_xprt *xprt,
>> struct rpc_rqst *req);
>> void			xprt_disconnect_done(struct rpc_xprt *xprt);
>> void			xprt_force_disconnect(struct rpc_xprt *xprt);
>> +void			xprt_disconnect_nowake(struct rpc_xprt *xprt);
>> void			xprt_conditional_disconnect(struct rpc_xprt
>> *xprt, unsigned int cookie);
>> 
>> bool			xprt_lock_connect(struct rpc_xprt *, struct
>> rpc_task *, void *);
>> diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
>> index ce92700..afe412e 100644
>> --- a/net/sunrpc/xprt.c
>> +++ b/net/sunrpc/xprt.c
>> @@ -685,6 +685,25 @@ void xprt_force_disconnect(struct rpc_xprt
>> *xprt)
>> }
>> EXPORT_SYMBOL_GPL(xprt_force_disconnect);
>> 
>> +/**
>> + * xprt_disconnect_nowake - force a call to xprt->ops->close
>> + * @xprt: transport to disconnect
>> + *
>> + * The caller must ensure that xprt_wake_pending_tasks() is
>> + * called later.
>> + */
>> +void xprt_disconnect_nowake(struct rpc_xprt *xprt)
>> +{
>> +       /* Don't race with the test_bit() in xprt_clear_locked() */
>> +       spin_lock_bh(&xprt->transport_lock);
>> +       set_bit(XPRT_CLOSE_WAIT, &xprt->state);
>> +       /* Try to schedule an autoclose RPC call */
>> +       if (test_and_set_bit(XPRT_LOCKED, &xprt->state) == 0)
>> +               queue_work(xprtiod_workqueue, &xprt->task_cleanup);
>> +       spin_unlock_bh(&xprt->transport_lock);
>> +}
>> +EXPORT_SYMBOL_GPL(xprt_disconnect_nowake);
>> +
> 
> We shouldn't need both xprt_disconnect_nowake() and
> xprt_force_disconnect() to be exported given that you can build the
> latter from the former + xprt_wake_pending_tasks() (which is also
> already exported).

Thanks for your review!

I can get rid of xprt_disconnect_nowake. There are some variations,
depending on why wake_pending_tasks is protected by xprt->transport_lock.

static void xs_tcp_force_close(struct rpc_xprt *xprt)
{
        xprt_force_disconnect(xprt);
        xprt_wake_pending_tasks(xprt, -EAGAIN);
}    

Or,

static void xs_tcp_force_close(struct rpc_xprt *xprt)
{
        xprt_force_disconnect(xprt);
        spin_lock_bh(&xprt->transport_lock);
        xprt_wake_pending_tasks(xprt, -EAGAIN);
        spin_lock_bh(&xprt->transport_lock);
}    

Or,

void xprt_force_disconnect(struct rpc_xprt *xprt, bool wake)
{
        /* Don't race with the test_bit() in xprt_clear_locked() */
        spin_lock_bh(&xprt->transport_lock);
        set_bit(XPRT_CLOSE_WAIT, &xprt->state);
        /* Try to schedule an autoclose RPC call */
        if (test_and_set_bit(XPRT_LOCKED, &xprt->state) == 0)
                queue_work(xprtiod_workqueue, &xprt->task_cleanup);
        if (wake)
                xprt_wake_pending_tasks(xprt, -EAGAIN);
        spin_unlock_bh(&xprt->transport_lock);
}

Which do you prefer?


>> static unsigned int
>> xprt_connect_cookie(struct rpc_xprt *xprt)
>> {
>> diff --git a/net/sunrpc/xprtrdma/backchannel.c
>> b/net/sunrpc/xprtrdma/backchannel.c
>> index 2cb07a3..5d462e8 100644
>> --- a/net/sunrpc/xprtrdma/backchannel.c
>> +++ b/net/sunrpc/xprtrdma/backchannel.c
>> @@ -193,14 +193,15 @@ static int rpcrdma_bc_marshal_reply(struct
>> rpc_rqst *rqst)
>>  */
>> int xprt_rdma_bc_send_reply(struct rpc_rqst *rqst)
>> {
>> -	struct rpcrdma_xprt *r_xprt = rpcx_to_rdmax(rqst->rq_xprt);
>> +	struct rpc_xprt *xprt = rqst->rq_xprt;
>> +	struct rpcrdma_xprt *r_xprt = rpcx_to_rdmax(xprt);
>> 	struct rpcrdma_req *req = rpcr_to_rdmar(rqst);
>> 	int rc;
>> 
>> -	if (!xprt_connected(rqst->rq_xprt))
>> -		goto drop_connection;
>> +	if (!xprt_connected(xprt))
>> +		return -ENOTCONN;
>> 
>> -	if (!xprt_request_get_cong(rqst->rq_xprt, rqst))
>> +	if (!xprt_request_get_cong(xprt, rqst))
>> 		return -EBADSLT;
>> 
>> 	rc = rpcrdma_bc_marshal_reply(rqst);
>> @@ -215,7 +216,7 @@ int xprt_rdma_bc_send_reply(struct rpc_rqst
>> *rqst)
>> 	if (rc != -ENOTCONN)
>> 		return rc;
>> drop_connection:
>> -	xprt_disconnect_done(rqst->rq_xprt);
>> +	xprt->ops->close(xprt);
> 
> Why use an indirect call here? Is this ever going to be different to
> xprt_rdma_close()?
> 
>> 	return -ENOTCONN;
>> }
>> 
>> @@ -338,7 +339,7 @@ void rpcrdma_bc_receive_call(struct rpcrdma_xprt
>> *r_xprt,
>> 
>> out_overflow:
>> 	pr_warn("RPC/RDMA backchannel overflow\n");
>> -	xprt_disconnect_done(xprt);
>> +	xprt_disconnect_nowake(xprt);
>> 	/* This receive buffer gets reposted automatically
>> 	 * when the connection is re-established.
>> 	 */
>> diff --git a/net/sunrpc/xprtrdma/svc_rdma_backchannel.c
>> b/net/sunrpc/xprtrdma/svc_rdma_backchannel.c
>> index f3c147d..b908f2c 100644
>> --- a/net/sunrpc/xprtrdma/svc_rdma_backchannel.c
>> +++ b/net/sunrpc/xprtrdma/svc_rdma_backchannel.c
>> @@ -200,11 +200,10 @@ static int svc_rdma_bc_sendto(struct
>> svcxprt_rdma *rdma,
>> 		svc_rdma_send_ctxt_put(rdma, ctxt);
>> 		goto drop_connection;
>> 	}
>> -	return rc;
>> +	return 0;
>> 
>> drop_connection:
>> 	dprintk("svcrdma: failed to send bc call\n");
>> -	xprt_disconnect_done(xprt);
>> 	return -ENOTCONN;
>> }
>> 
>> @@ -225,8 +224,11 @@ static int svc_rdma_bc_sendto(struct
>> svcxprt_rdma *rdma,
>> 
>> 	ret = -ENOTCONN;
>> 	rdma = container_of(sxprt, struct svcxprt_rdma, sc_xprt);
>> -	if (!test_bit(XPT_DEAD, &sxprt->xpt_flags))
>> +	if (!test_bit(XPT_DEAD, &sxprt->xpt_flags)) {
>> 		ret = rpcrdma_bc_send_request(rdma, rqst);
>> +		if (ret == -ENOTCONN)
>> +			svc_close_xprt(sxprt);
>> +	}
>> 
>> 	mutex_unlock(&sxprt->xpt_mutex);
>> 
>> diff --git a/net/sunrpc/xprtrdma/transport.c
>> b/net/sunrpc/xprtrdma/transport.c
>> index 91c476a..a16296b 100644
>> --- a/net/sunrpc/xprtrdma/transport.c
>> +++ b/net/sunrpc/xprtrdma/transport.c
>> @@ -453,13 +453,13 @@
>> 
>> 	if (test_and_clear_bit(RPCRDMA_IAF_REMOVING, &ia->ri_flags)) {
>> 		rpcrdma_ia_remove(ia);
>> -		return;
>> +		goto out;
>> 	}
>> +
>> 	if (ep->rep_connected == -ENODEV)
>> 		return;
>> 	if (ep->rep_connected > 0)
>> 		xprt->reestablish_timeout = 0;
>> -	xprt_disconnect_done(xprt);
>> 	rpcrdma_ep_disconnect(ep, ia);
>> 
>> 	/* Prepare @xprt for the next connection by reinitializing
>> @@ -467,6 +467,10 @@
>> 	 */
>> 	r_xprt->rx_buf.rb_credits = 1;
>> 	xprt->cwnd = RPC_CWNDSHIFT;
>> +
>> +out:
>> +	++xprt->connect_cookie;
>> +	xprt_disconnect_done(xprt);
>> }
>> 
>> /**
>> @@ -515,7 +519,7 @@
>> static void
>> xprt_rdma_timer(struct rpc_xprt *xprt, struct rpc_task *task)
>> {
>> -	xprt_force_disconnect(xprt);
>> +	xprt_disconnect_nowake(xprt);
>> }
>> 
>> /**
>> @@ -717,7 +721,7 @@
>> #endif	/* CONFIG_SUNRPC_BACKCHANNEL */
>> 
>> 	if (!xprt_connected(xprt))
>> -		goto drop_connection;
>> +		return -ENOTCONN;
>> 
>> 	if (!xprt_request_get_cong(xprt, rqst))
>> 		return -EBADSLT;
>> @@ -749,8 +753,8 @@
>> 	if (rc != -ENOTCONN)
>> 		return rc;
>> drop_connection:
>> -	xprt_disconnect_done(xprt);
>> -	return -ENOTCONN;	/* implies disconnect */
>> +	xprt_rdma_close(xprt);
>> +	return -ENOTCONN;
>> }
>> 
>> void xprt_rdma_print_stats(struct rpc_xprt *xprt, struct seq_file
>> *seq)
>> diff --git a/net/sunrpc/xprtrdma/verbs.c
>> b/net/sunrpc/xprtrdma/verbs.c
>> index 9a0a765..38a757c 100644
>> --- a/net/sunrpc/xprtrdma/verbs.c
>> +++ b/net/sunrpc/xprtrdma/verbs.c
>> @@ -252,7 +252,7 @@ static void rpcrdma_xprt_drain(struct
>> rpcrdma_xprt *r_xprt)
>> #endif
>> 		set_bit(RPCRDMA_IAF_REMOVING, &ia->ri_flags);
>> 		ep->rep_connected = -ENODEV;
>> -		xprt_force_disconnect(xprt);
>> +		xprt_disconnect_nowake(xprt);
>> 		wait_for_completion(&ia->ri_remove_done);
>> 
>> 		ia->ri_id = NULL;
>> @@ -280,10 +280,9 @@ static void rpcrdma_xprt_drain(struct
>> rpcrdma_xprt *r_xprt)
>> 			ep->rep_connected = -EAGAIN;
>> 		goto disconnected;
>> 	case RDMA_CM_EVENT_DISCONNECTED:
>> -		++xprt->connect_cookie;
>> 		ep->rep_connected = -ECONNABORTED;
>> disconnected:
>> -		xprt_force_disconnect(xprt);
>> +		xprt_disconnect_nowake(xprt);
>> 		wake_up_all(&ep->rep_connect_wait);
>> 		break;
>> 	default:
>> 
> 
> -- 
> Trond Myklebust
> Linux NFS client maintainer, Hammerspace
> trond.myklebust@hammerspace.com

--
Chuck Lever




^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v4 06/30] xprtrdma: Don't wake pending tasks until disconnect is done
  2018-12-17 18:37     ` Chuck Lever
@ 2018-12-17 18:55       ` Trond Myklebust
  2018-12-17 19:00         ` Chuck Lever
  0 siblings, 1 reply; 40+ messages in thread
From: Trond Myklebust @ 2018-12-17 18:55 UTC (permalink / raw)
  To: chuck.lever; +Cc: linux-rdma, linux-nfs

On Mon, 2018-12-17 at 13:37 -0500, Chuck Lever wrote:
> > On Dec 17, 2018, at 12:28 PM, Trond Myklebust <
> > trondmy@hammerspace.com> wrote:
> > 
> > On Mon, 2018-12-17 at 11:39 -0500, Chuck Lever wrote:
> > > Transport disconnect processing does a "wake pending tasks" at
> > > various points.
> > > 
> > > Suppose an RPC Reply is being processed. The RPC task that Reply
> > > goes with is waiting on the pending queue. If a disconnect wake-
> > > up
> > > happens before reply processing is done, that reply, even if it
> > > is
> > > good, is thrown away, and the RPC has to be sent again.
> > > 
> > > This window apparently does not exist for socket transports
> > > because
> > > there is a lock held while a reply is being received which
> > > prevents
> > > the wake-up call until after reply processing is done.
> > > 
> > > To resolve this, all RPC replies being processed on an RPC-over-
> > > RDMA
> > > transport have to complete before pending tasks are awoken due to
> > > a
> > > transport disconnect.
> > > 
> > > Callers that already hold the transport write lock may invoke
> > > ->ops->close directly. Others use a generic helper that schedules
> > > a close when the write lock can be taken safely.
> > > 
> > > Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> > > ---
> > > include/linux/sunrpc/xprt.h                |    1 +
> > > net/sunrpc/xprt.c                          |   19
> > > +++++++++++++++++++
> > > net/sunrpc/xprtrdma/backchannel.c          |   13 +++++++------
> > > net/sunrpc/xprtrdma/svc_rdma_backchannel.c |    8 +++++---
> > > net/sunrpc/xprtrdma/transport.c            |   16 ++++++++++-----
> > > -
> > > net/sunrpc/xprtrdma/verbs.c                |    5 ++---
> > > 6 files changed, 44 insertions(+), 18 deletions(-)
> > > 
> > > diff --git a/include/linux/sunrpc/xprt.h
> > > b/include/linux/sunrpc/xprt.h
> > > index a4ab4f8..ee94ed0 100644
> > > --- a/include/linux/sunrpc/xprt.h
> > > +++ b/include/linux/sunrpc/xprt.h
> > > @@ -401,6 +401,7 @@ static inline __be32
> > > *xprt_skip_transport_header(struct rpc_xprt *xprt, __be32 *
> > > bool			xprt_request_get_cong(struct rpc_xprt
> > > *xprt,
> > > struct rpc_rqst *req);
> > > void			xprt_disconnect_done(struct rpc_xprt
> > > *xprt);
> > > void			xprt_force_disconnect(struct rpc_xprt
> > > *xprt);
> > > +void			xprt_disconnect_nowake(struct rpc_xprt
> > > *xprt);
> > > void			xprt_conditional_disconnect(struct
> > > rpc_xprt
> > > *xprt, unsigned int cookie);
> > > 
> > > bool			xprt_lock_connect(struct rpc_xprt *,
> > > struct
> > > rpc_task *, void *);
> > > diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
> > > index ce92700..afe412e 100644
> > > --- a/net/sunrpc/xprt.c
> > > +++ b/net/sunrpc/xprt.c
> > > @@ -685,6 +685,25 @@ void xprt_force_disconnect(struct rpc_xprt
> > > *xprt)
> > > }
> > > EXPORT_SYMBOL_GPL(xprt_force_disconnect);
> > > 
> > > +/**
> > > + * xprt_disconnect_nowake - force a call to xprt->ops->close
> > > + * @xprt: transport to disconnect
> > > + *
> > > + * The caller must ensure that xprt_wake_pending_tasks() is
> > > + * called later.
> > > + */
> > > +void xprt_disconnect_nowake(struct rpc_xprt *xprt)
> > > +{
> > > +       /* Don't race with the test_bit() in xprt_clear_locked()
> > > */
> > > +       spin_lock_bh(&xprt->transport_lock);
> > > +       set_bit(XPRT_CLOSE_WAIT, &xprt->state);
> > > +       /* Try to schedule an autoclose RPC call */
> > > +       if (test_and_set_bit(XPRT_LOCKED, &xprt->state) == 0)
> > > +               queue_work(xprtiod_workqueue, &xprt-
> > > >task_cleanup);
> > > +       spin_unlock_bh(&xprt->transport_lock);
> > > +}
> > > +EXPORT_SYMBOL_GPL(xprt_disconnect_nowake);
> > > +
> > 
> > We shouldn't need both xprt_disconnect_nowake() and
> > xprt_force_disconnect() to be exported given that you can build the
> > latter from the former + xprt_wake_pending_tasks() (which is also
> > already exported).
> 
> Thanks for your review!
> 
> I can get rid of xprt_disconnect_nowake. There are some variations,
> depending on why wake_pending_tasks is protected by xprt-
> >transport_lock.

I'm having some second thoughts about the patch that Scott sent out
last week to fix the issue that Dave and he were seeing. I think that
what we really need to do to fix his issue is to call
xprt_disconnect_done() after we've released the TCP socket.

Given that realisation, I think that we can fix up
xprt_force_disconnect() to only wake up the task that holds the
XPRT_LOCKED instead of doing a thundering herd wakeup like we do today.
That should (I think) obviate the need for a separate
xprt_disconnect_nowake().

A patch is forthcoming later today. I'll make sure you are Cced so you
can comment.

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v4 06/30] xprtrdma: Don't wake pending tasks until disconnect is done
  2018-12-17 18:55       ` Trond Myklebust
@ 2018-12-17 19:00         ` Chuck Lever
  2018-12-17 19:09           ` Trond Myklebust
  0 siblings, 1 reply; 40+ messages in thread
From: Chuck Lever @ 2018-12-17 19:00 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: linux-rdma, Linux NFS Mailing List



> On Dec 17, 2018, at 1:55 PM, Trond Myklebust <trondmy@hammerspace.com> wrote:
> 
> On Mon, 2018-12-17 at 13:37 -0500, Chuck Lever wrote:
>>> On Dec 17, 2018, at 12:28 PM, Trond Myklebust <
>>> trondmy@hammerspace.com> wrote:
>>> 
>>> On Mon, 2018-12-17 at 11:39 -0500, Chuck Lever wrote:
>>>> Transport disconnect processing does a "wake pending tasks" at
>>>> various points.
>>>> 
>>>> Suppose an RPC Reply is being processed. The RPC task that Reply
>>>> goes with is waiting on the pending queue. If a disconnect wake-
>>>> up
>>>> happens before reply processing is done, that reply, even if it
>>>> is
>>>> good, is thrown away, and the RPC has to be sent again.
>>>> 
>>>> This window apparently does not exist for socket transports
>>>> because
>>>> there is a lock held while a reply is being received which
>>>> prevents
>>>> the wake-up call until after reply processing is done.
>>>> 
>>>> To resolve this, all RPC replies being processed on an RPC-over-
>>>> RDMA
>>>> transport have to complete before pending tasks are awoken due to
>>>> a
>>>> transport disconnect.
>>>> 
>>>> Callers that already hold the transport write lock may invoke
>>>> ->ops->close directly. Others use a generic helper that schedules
>>>> a close when the write lock can be taken safely.
>>>> 
>>>> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
>>>> ---
>>>> include/linux/sunrpc/xprt.h                |    1 +
>>>> net/sunrpc/xprt.c                          |   19
>>>> +++++++++++++++++++
>>>> net/sunrpc/xprtrdma/backchannel.c          |   13 +++++++------
>>>> net/sunrpc/xprtrdma/svc_rdma_backchannel.c |    8 +++++---
>>>> net/sunrpc/xprtrdma/transport.c            |   16 ++++++++++-----
>>>> -
>>>> net/sunrpc/xprtrdma/verbs.c                |    5 ++---
>>>> 6 files changed, 44 insertions(+), 18 deletions(-)
>>>> 
>>>> diff --git a/include/linux/sunrpc/xprt.h
>>>> b/include/linux/sunrpc/xprt.h
>>>> index a4ab4f8..ee94ed0 100644
>>>> --- a/include/linux/sunrpc/xprt.h
>>>> +++ b/include/linux/sunrpc/xprt.h
>>>> @@ -401,6 +401,7 @@ static inline __be32
>>>> *xprt_skip_transport_header(struct rpc_xprt *xprt, __be32 *
>>>> bool			xprt_request_get_cong(struct rpc_xprt
>>>> *xprt,
>>>> struct rpc_rqst *req);
>>>> void			xprt_disconnect_done(struct rpc_xprt
>>>> *xprt);
>>>> void			xprt_force_disconnect(struct rpc_xprt
>>>> *xprt);
>>>> +void			xprt_disconnect_nowake(struct rpc_xprt
>>>> *xprt);
>>>> void			xprt_conditional_disconnect(struct
>>>> rpc_xprt
>>>> *xprt, unsigned int cookie);
>>>> 
>>>> bool			xprt_lock_connect(struct rpc_xprt *,
>>>> struct
>>>> rpc_task *, void *);
>>>> diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
>>>> index ce92700..afe412e 100644
>>>> --- a/net/sunrpc/xprt.c
>>>> +++ b/net/sunrpc/xprt.c
>>>> @@ -685,6 +685,25 @@ void xprt_force_disconnect(struct rpc_xprt
>>>> *xprt)
>>>> }
>>>> EXPORT_SYMBOL_GPL(xprt_force_disconnect);
>>>> 
>>>> +/**
>>>> + * xprt_disconnect_nowake - force a call to xprt->ops->close
>>>> + * @xprt: transport to disconnect
>>>> + *
>>>> + * The caller must ensure that xprt_wake_pending_tasks() is
>>>> + * called later.
>>>> + */
>>>> +void xprt_disconnect_nowake(struct rpc_xprt *xprt)
>>>> +{
>>>> +       /* Don't race with the test_bit() in xprt_clear_locked()
>>>> */
>>>> +       spin_lock_bh(&xprt->transport_lock);
>>>> +       set_bit(XPRT_CLOSE_WAIT, &xprt->state);
>>>> +       /* Try to schedule an autoclose RPC call */
>>>> +       if (test_and_set_bit(XPRT_LOCKED, &xprt->state) == 0)
>>>> +               queue_work(xprtiod_workqueue, &xprt-
>>>>> task_cleanup);
>>>> +       spin_unlock_bh(&xprt->transport_lock);
>>>> +}
>>>> +EXPORT_SYMBOL_GPL(xprt_disconnect_nowake);
>>>> +
>>> 
>>> We shouldn't need both xprt_disconnect_nowake() and
>>> xprt_force_disconnect() to be exported given that you can build the
>>> latter from the former + xprt_wake_pending_tasks() (which is also
>>> already exported).
>> 
>> Thanks for your review!
>> 
>> I can get rid of xprt_disconnect_nowake. There are some variations,
>> depending on why wake_pending_tasks is protected by xprt-
>>> transport_lock.
> 
> I'm having some second thoughts about the patch that Scott sent out
> last week to fix the issue that Dave and he were seeing. I think that
> what we really need to do to fix his issue is to call
> xprt_disconnect_done() after we've released the TCP socket.
> 
> Given that realisation, I think that we can fix up
> xprt_force_disconnect() to only wake up the task that holds the
> XPRT_LOCKED instead of doing a thundering herd wakeup like we do today.
> That should (I think) obviate the need for a separate
> xprt_disconnect_nowake().

For RPC-over-RDMA, there really is a dangerous race between the waking
task(s) and work being done by the deferred RPC completion handler. IMO
the only safe thing RPC-over-RDMA can do is not wake anything until the
deferred queue is well and truly drained.

As you observed when we last spoke, socket transports are already
protected from this race by the socket lock.... RPC-over-RDMA is going
to have to be more careful.


> A patch is forthcoming later today. I'll make sure you are Cced so you
> can comment.
> 
> -- 
> Trond Myklebust
> Linux NFS client maintainer, Hammerspace
> trond.myklebust@hammerspace.com

--
Chuck Lever




^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v4 06/30] xprtrdma: Don't wake pending tasks until disconnect is done
  2018-12-17 19:00         ` Chuck Lever
@ 2018-12-17 19:09           ` Trond Myklebust
  2018-12-17 19:19             ` Chuck Lever
  0 siblings, 1 reply; 40+ messages in thread
From: Trond Myklebust @ 2018-12-17 19:09 UTC (permalink / raw)
  To: chuck.lever; +Cc: linux-rdma, linux-nfs

On Mon, 2018-12-17 at 14:00 -0500, Chuck Lever wrote:
> > On Dec 17, 2018, at 1:55 PM, Trond Myklebust <
> > trondmy@hammerspace.com> wrote:
> > 
> > On Mon, 2018-12-17 at 13:37 -0500, Chuck Lever wrote:
> > > > On Dec 17, 2018, at 12:28 PM, Trond Myklebust <
> > > > trondmy@hammerspace.com> wrote:
> > > > 
> > > > On Mon, 2018-12-17 at 11:39 -0500, Chuck Lever wrote:
> > > > > Transport disconnect processing does a "wake pending tasks"
> > > > > at
> > > > > various points.
> > > > > 
> > > > > Suppose an RPC Reply is being processed. The RPC task that
> > > > > Reply
> > > > > goes with is waiting on the pending queue. If a disconnect
> > > > > wake-
> > > > > up
> > > > > happens before reply processing is done, that reply, even if
> > > > > it
> > > > > is
> > > > > good, is thrown away, and the RPC has to be sent again.
> > > > > 
> > > > > This window apparently does not exist for socket transports
> > > > > because
> > > > > there is a lock held while a reply is being received which
> > > > > prevents
> > > > > the wake-up call until after reply processing is done.
> > > > > 
> > > > > To resolve this, all RPC replies being processed on an RPC-
> > > > > over-
> > > > > RDMA
> > > > > transport have to complete before pending tasks are awoken
> > > > > due to
> > > > > a
> > > > > transport disconnect.
> > > > > 
> > > > > Callers that already hold the transport write lock may invoke
> > > > > ->ops->close directly. Others use a generic helper that
> > > > > schedules
> > > > > a close when the write lock can be taken safely.
> > > > > 
> > > > > Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> > > > > ---
> > > > > include/linux/sunrpc/xprt.h                |    1 +
> > > > > net/sunrpc/xprt.c                          |   19
> > > > > +++++++++++++++++++
> > > > > net/sunrpc/xprtrdma/backchannel.c          |   13 +++++++--
> > > > > ----
> > > > > net/sunrpc/xprtrdma/svc_rdma_backchannel.c |    8 +++++---
> > > > > net/sunrpc/xprtrdma/transport.c            |   16 ++++++++++-
> > > > > ----
> > > > > -
> > > > > net/sunrpc/xprtrdma/verbs.c                |    5 ++---
> > > > > 6 files changed, 44 insertions(+), 18 deletions(-)
> > > > > 
> > > > > diff --git a/include/linux/sunrpc/xprt.h
> > > > > b/include/linux/sunrpc/xprt.h
> > > > > index a4ab4f8..ee94ed0 100644
> > > > > --- a/include/linux/sunrpc/xprt.h
> > > > > +++ b/include/linux/sunrpc/xprt.h
> > > > > @@ -401,6 +401,7 @@ static inline __be32
> > > > > *xprt_skip_transport_header(struct rpc_xprt *xprt, __be32 *
> > > > > bool			xprt_request_get_cong(struct rpc_xprt
> > > > > *xprt,
> > > > > struct rpc_rqst *req);
> > > > > void			xprt_disconnect_done(struct rpc_xprt
> > > > > *xprt);
> > > > > void			xprt_force_disconnect(struct rpc_xprt
> > > > > *xprt);
> > > > > +void			xprt_disconnect_nowake(struct rpc_xprt
> > > > > *xprt);
> > > > > void			xprt_conditional_disconnect(struct
> > > > > rpc_xprt
> > > > > *xprt, unsigned int cookie);
> > > > > 
> > > > > bool			xprt_lock_connect(struct rpc_xprt *,
> > > > > struct
> > > > > rpc_task *, void *);
> > > > > diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
> > > > > index ce92700..afe412e 100644
> > > > > --- a/net/sunrpc/xprt.c
> > > > > +++ b/net/sunrpc/xprt.c
> > > > > @@ -685,6 +685,25 @@ void xprt_force_disconnect(struct
> > > > > rpc_xprt
> > > > > *xprt)
> > > > > }
> > > > > EXPORT_SYMBOL_GPL(xprt_force_disconnect);
> > > > > 
> > > > > +/**
> > > > > + * xprt_disconnect_nowake - force a call to xprt->ops->close
> > > > > + * @xprt: transport to disconnect
> > > > > + *
> > > > > + * The caller must ensure that xprt_wake_pending_tasks() is
> > > > > + * called later.
> > > > > + */
> > > > > +void xprt_disconnect_nowake(struct rpc_xprt *xprt)
> > > > > +{
> > > > > +       /* Don't race with the test_bit() in
> > > > > xprt_clear_locked()
> > > > > */
> > > > > +       spin_lock_bh(&xprt->transport_lock);
> > > > > +       set_bit(XPRT_CLOSE_WAIT, &xprt->state);
> > > > > +       /* Try to schedule an autoclose RPC call */
> > > > > +       if (test_and_set_bit(XPRT_LOCKED, &xprt->state) == 0)
> > > > > +               queue_work(xprtiod_workqueue, &xprt-
> > > > > > task_cleanup);
> > > > > +       spin_unlock_bh(&xprt->transport_lock);
> > > > > +}
> > > > > +EXPORT_SYMBOL_GPL(xprt_disconnect_nowake);
> > > > > +
> > > > 
> > > > We shouldn't need both xprt_disconnect_nowake() and
> > > > xprt_force_disconnect() to be exported given that you can build
> > > > the
> > > > latter from the former + xprt_wake_pending_tasks() (which is
> > > > also
> > > > already exported).
> > > 
> > > Thanks for your review!
> > > 
> > > I can get rid of xprt_disconnect_nowake. There are some
> > > variations,
> > > depending on why wake_pending_tasks is protected by xprt-
> > > > transport_lock.
> > 
> > I'm having some second thoughts about the patch that Scott sent out
> > last week to fix the issue that Dave and he were seeing. I think
> > that
> > what we really need to do to fix his issue is to call
> > xprt_disconnect_done() after we've released the TCP socket.
> > 
> > Given that realisation, I think that we can fix up
> > xprt_force_disconnect() to only wake up the task that holds the
> > XPRT_LOCKED instead of doing a thundering herd wakeup like we do
> > today.
> > That should (I think) obviate the need for a separate
> > xprt_disconnect_nowake().
> 
> For RPC-over-RDMA, there really is a dangerous race between the
> waking
> task(s) and work being done by the deferred RPC completion handler.
> IMO
> the only safe thing RPC-over-RDMA can do is not wake anything until
> the
> deferred queue is well and truly drained.

The deferred RPC completion handler (and hence the close) cannot
execute if another task is holding XPRT_LOCKED, so we do need to wake
up that task (and only that one).

Note that in the new code, the only reason why a task would be holding
XPRT_LOCKED while sleeping is because

   1. It is waiting for a connection attempt to complete following a call
      to xprt_connect().
   2. It is waiting for a write_space event following an attempt to
      transmit.



> As you observed when we last spoke, socket transports are already
> protected from this race by the socket lock.... RPC-over-RDMA is
> going
> to have to be more careful.
> 
> 
> > A patch is forthcoming later today. I'll make sure you are Cced so
> > you
> > can comment.
> > 
> > -- 
> > Trond Myklebust
> > Linux NFS client maintainer, Hammerspace
> > trond.myklebust@hammerspace.com
> 
> --
> Chuck Lever
> 
> 
> 
-- 
Trond Myklebust
CTO, Hammerspace Inc
4300 El Camino Real, Suite 105
Los Altos, CA 94022
www.hammer.space



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v4 06/30] xprtrdma: Don't wake pending tasks until disconnect is done
  2018-12-17 19:09           ` Trond Myklebust
@ 2018-12-17 19:19             ` Chuck Lever
  2018-12-17 19:26               ` Trond Myklebust
  0 siblings, 1 reply; 40+ messages in thread
From: Chuck Lever @ 2018-12-17 19:19 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: linux-rdma, Linux NFS Mailing List



> On Dec 17, 2018, at 2:09 PM, Trond Myklebust <trondmy@hammerspace.com> wrote:
> 
> On Mon, 2018-12-17 at 14:00 -0500, Chuck Lever wrote:
>>> On Dec 17, 2018, at 1:55 PM, Trond Myklebust <
>>> trondmy@hammerspace.com> wrote:
>>> 
>>> On Mon, 2018-12-17 at 13:37 -0500, Chuck Lever wrote:
>>>>> On Dec 17, 2018, at 12:28 PM, Trond Myklebust <
>>>>> trondmy@hammerspace.com> wrote:
>>>>> 
>>>>> On Mon, 2018-12-17 at 11:39 -0500, Chuck Lever wrote:
>>>>>> Transport disconnect processing does a "wake pending tasks"
>>>>>> at
>>>>>> various points.
>>>>>> 
>>>>>> Suppose an RPC Reply is being processed. The RPC task that
>>>>>> Reply
>>>>>> goes with is waiting on the pending queue. If a disconnect
>>>>>> wake-
>>>>>> up
>>>>>> happens before reply processing is done, that reply, even if
>>>>>> it
>>>>>> is
>>>>>> good, is thrown away, and the RPC has to be sent again.
>>>>>> 
>>>>>> This window apparently does not exist for socket transports
>>>>>> because
>>>>>> there is a lock held while a reply is being received which
>>>>>> prevents
>>>>>> the wake-up call until after reply processing is done.
>>>>>> 
>>>>>> To resolve this, all RPC replies being processed on an RPC-
>>>>>> over-
>>>>>> RDMA
>>>>>> transport have to complete before pending tasks are awoken
>>>>>> due to
>>>>>> a
>>>>>> transport disconnect.
>>>>>> 
>>>>>> Callers that already hold the transport write lock may invoke
>>>>>> ->ops->close directly. Others use a generic helper that
>>>>>> schedules
>>>>>> a close when the write lock can be taken safely.
>>>>>> 
>>>>>> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
>>>>>> ---
>>>>>> include/linux/sunrpc/xprt.h                |    1 +
>>>>>> net/sunrpc/xprt.c                          |   19
>>>>>> +++++++++++++++++++
>>>>>> net/sunrpc/xprtrdma/backchannel.c          |   13 +++++++--
>>>>>> ----
>>>>>> net/sunrpc/xprtrdma/svc_rdma_backchannel.c |    8 +++++---
>>>>>> net/sunrpc/xprtrdma/transport.c            |   16 ++++++++++-
>>>>>> ----
>>>>>> -
>>>>>> net/sunrpc/xprtrdma/verbs.c                |    5 ++---
>>>>>> 6 files changed, 44 insertions(+), 18 deletions(-)
>>>>>> 
>>>>>> diff --git a/include/linux/sunrpc/xprt.h
>>>>>> b/include/linux/sunrpc/xprt.h
>>>>>> index a4ab4f8..ee94ed0 100644
>>>>>> --- a/include/linux/sunrpc/xprt.h
>>>>>> +++ b/include/linux/sunrpc/xprt.h
>>>>>> @@ -401,6 +401,7 @@ static inline __be32
>>>>>> *xprt_skip_transport_header(struct rpc_xprt *xprt, __be32 *
>>>>>> bool			xprt_request_get_cong(struct rpc_xprt
>>>>>> *xprt,
>>>>>> struct rpc_rqst *req);
>>>>>> void			xprt_disconnect_done(struct rpc_xprt
>>>>>> *xprt);
>>>>>> void			xprt_force_disconnect(struct rpc_xprt
>>>>>> *xprt);
>>>>>> +void			xprt_disconnect_nowake(struct rpc_xprt
>>>>>> *xprt);
>>>>>> void			xprt_conditional_disconnect(struct
>>>>>> rpc_xprt
>>>>>> *xprt, unsigned int cookie);
>>>>>> 
>>>>>> bool			xprt_lock_connect(struct rpc_xprt *,
>>>>>> struct
>>>>>> rpc_task *, void *);
>>>>>> diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
>>>>>> index ce92700..afe412e 100644
>>>>>> --- a/net/sunrpc/xprt.c
>>>>>> +++ b/net/sunrpc/xprt.c
>>>>>> @@ -685,6 +685,25 @@ void xprt_force_disconnect(struct
>>>>>> rpc_xprt
>>>>>> *xprt)
>>>>>> }
>>>>>> EXPORT_SYMBOL_GPL(xprt_force_disconnect);
>>>>>> 
>>>>>> +/**
>>>>>> + * xprt_disconnect_nowake - force a call to xprt->ops->close
>>>>>> + * @xprt: transport to disconnect
>>>>>> + *
>>>>>> + * The caller must ensure that xprt_wake_pending_tasks() is
>>>>>> + * called later.
>>>>>> + */
>>>>>> +void xprt_disconnect_nowake(struct rpc_xprt *xprt)
>>>>>> +{
>>>>>> +       /* Don't race with the test_bit() in
>>>>>> xprt_clear_locked()
>>>>>> */
>>>>>> +       spin_lock_bh(&xprt->transport_lock);
>>>>>> +       set_bit(XPRT_CLOSE_WAIT, &xprt->state);
>>>>>> +       /* Try to schedule an autoclose RPC call */
>>>>>> +       if (test_and_set_bit(XPRT_LOCKED, &xprt->state) == 0)
>>>>>> +               queue_work(xprtiod_workqueue, &xprt-
>>>>>>> task_cleanup);
>>>>>> +       spin_unlock_bh(&xprt->transport_lock);
>>>>>> +}
>>>>>> +EXPORT_SYMBOL_GPL(xprt_disconnect_nowake);
>>>>>> +
>>>>> 
>>>>> We shouldn't need both xprt_disconnect_nowake() and
>>>>> xprt_force_disconnect() to be exported given that you can build
>>>>> the
>>>>> latter from the former + xprt_wake_pending_tasks() (which is
>>>>> also
>>>>> already exported).
>>>> 
>>>> Thanks for your review!
>>>> 
>>>> I can get rid of xprt_disconnect_nowake. There are some
>>>> variations,
>>>> depending on why wake_pending_tasks is protected by xprt-
>>>>> transport_lock.
>>> 
>>> I'm having some second thoughts about the patch that Scott sent out
>>> last week to fix the issue that Dave and he were seeing. I think
>>> that
>>> what we really need to do to fix his issue is to call
>>> xprt_disconnect_done() after we've released the TCP socket.
>>> 
>>> Given that realisation, I think that we can fix up
>>> xprt_force_disconnect() to only wake up the task that holds the
>>> XPRT_LOCKED instead of doing a thundering herd wakeup like we do
>>> today.
>>> That should (I think) obviate the need for a separate
>>> xprt_disconnect_nowake().
>> 
>> For RPC-over-RDMA, there really is a dangerous race between the
>> waking
>> task(s) and work being done by the deferred RPC completion handler.
>> IMO
>> the only safe thing RPC-over-RDMA can do is not wake anything until
>> the
>> deferred queue is well and truly drained.
> 
> The deferred RPC completion handler (and hence the close) cannot
> execute if another task is holding XPRT_LOCKED,

Just to be certain we are speaking of the same thing,
rpcrdma_deferred_completion is queued by the Receive handler, and
can indeed run independently of an rpc_task. It is always running
outside the purview of XPRT_LOCKED.


> so we do need to wake up that task (and only that one).
> 
> Note that in the new code, the only reason why a task would be holding
> XPRT_LOCKED while sleeping is because
> 
>   1. It is waiting for a connection attempt to complete following a call
>      to xprt_connect().
>   2. It is waiting for a write_space event following an attempt to
>      transmit.

xprt_rdma_close can sleep in rpcrdma_ep_disconnect:

 -> ib_drain_{qp,sq,rq} can all sleep waiting for the last FLUSH

 -> drain_workqueue, added in this patch, can sleep waiting for the
    deferred RPC completion workqueue to drain


>> As you observed when we last spoke, socket transports are already
>> protected from this race by the socket lock.... RPC-over-RDMA is
>> going
>> to have to be more careful.
>> 
>> 
>>> A patch is forthcoming later today. I'll make sure you are Cced so
>>> you
>>> can comment.
>>> 
>>> -- 
>>> Trond Myklebust
>>> Linux NFS client maintainer, Hammerspace
>>> trond.myklebust@hammerspace.com
>> 
>> --
>> Chuck Lever
>> 
>> 
>> 
> -- 
> Trond Myklebust
> CTO, Hammerspace Inc
> 4300 El Camino Real, Suite 105
> Los Altos, CA 94022
> www.hammer.space

--
Chuck Lever




^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v4 06/30] xprtrdma: Don't wake pending tasks until disconnect is done
  2018-12-17 19:19             ` Chuck Lever
@ 2018-12-17 19:26               ` Trond Myklebust
  0 siblings, 0 replies; 40+ messages in thread
From: Trond Myklebust @ 2018-12-17 19:26 UTC (permalink / raw)
  To: chuck.lever; +Cc: linux-rdma, linux-nfs

On Mon, 2018-12-17 at 14:19 -0500, Chuck Lever wrote:
> > On Dec 17, 2018, at 2:09 PM, Trond Myklebust <
> > trondmy@hammerspace.com> wrote:
> > 
> > On Mon, 2018-12-17 at 14:00 -0500, Chuck Lever wrote:
> > > > On Dec 17, 2018, at 1:55 PM, Trond Myklebust <
> > > > trondmy@hammerspace.com> wrote:
> > > > 
> > > > On Mon, 2018-12-17 at 13:37 -0500, Chuck Lever wrote:
> > > > > > On Dec 17, 2018, at 12:28 PM, Trond Myklebust <
> > > > > > trondmy@hammerspace.com> wrote:
> > > > > > 
> > > > > > On Mon, 2018-12-17 at 11:39 -0500, Chuck Lever wrote:
> > > > > > > Transport disconnect processing does a "wake pending
> > > > > > > tasks"
> > > > > > > at
> > > > > > > various points.
> > > > > > > 
> > > > > > > Suppose an RPC Reply is being processed. The RPC task
> > > > > > > that
> > > > > > > Reply
> > > > > > > goes with is waiting on the pending queue. If a
> > > > > > > disconnect
> > > > > > > wake-
> > > > > > > up
> > > > > > > happens before reply processing is done, that reply, even
> > > > > > > if
> > > > > > > it
> > > > > > > is
> > > > > > > good, is thrown away, and the RPC has to be sent again.
> > > > > > > 
> > > > > > > This window apparently does not exist for socket
> > > > > > > transports
> > > > > > > because
> > > > > > > there is a lock held while a reply is being received
> > > > > > > which
> > > > > > > prevents
> > > > > > > the wake-up call until after reply processing is done.
> > > > > > > 
> > > > > > > To resolve this, all RPC replies being processed on an
> > > > > > > RPC-
> > > > > > > over-
> > > > > > > RDMA
> > > > > > > transport have to complete before pending tasks are
> > > > > > > awoken
> > > > > > > due to
> > > > > > > a
> > > > > > > transport disconnect.
> > > > > > > 
> > > > > > > Callers that already hold the transport write lock may
> > > > > > > invoke
> > > > > > > ->ops->close directly. Others use a generic helper that
> > > > > > > schedules
> > > > > > > a close when the write lock can be taken safely.
> > > > > > > 
> > > > > > > Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> > > > > > > ---
> > > > > > > include/linux/sunrpc/xprt.h                |    1 +
> > > > > > > net/sunrpc/xprt.c                          |   19
> > > > > > > +++++++++++++++++++
> > > > > > > net/sunrpc/xprtrdma/backchannel.c          |   13
> > > > > > > +++++++--
> > > > > > > ----
> > > > > > > net/sunrpc/xprtrdma/svc_rdma_backchannel.c |    8 +++++
> > > > > > > ---
> > > > > > > net/sunrpc/xprtrdma/transport.c            |   16
> > > > > > > ++++++++++-
> > > > > > > ----
> > > > > > > -
> > > > > > > net/sunrpc/xprtrdma/verbs.c                |    5 ++---
> > > > > > > 6 files changed, 44 insertions(+), 18 deletions(-)
> > > > > > > 
> > > > > > > diff --git a/include/linux/sunrpc/xprt.h
> > > > > > > b/include/linux/sunrpc/xprt.h
> > > > > > > index a4ab4f8..ee94ed0 100644
> > > > > > > --- a/include/linux/sunrpc/xprt.h
> > > > > > > +++ b/include/linux/sunrpc/xprt.h
> > > > > > > @@ -401,6 +401,7 @@ static inline __be32
> > > > > > > *xprt_skip_transport_header(struct rpc_xprt *xprt, __be32
> > > > > > > *
> > > > > > > bool			xprt_request_get_cong(struct
> > > > > > > rpc_xprt
> > > > > > > *xprt,
> > > > > > > struct rpc_rqst *req);
> > > > > > > void			xprt_disconnect_done(struct
> > > > > > > rpc_xprt
> > > > > > > *xprt);
> > > > > > > void			xprt_force_disconnect(struct
> > > > > > > rpc_xprt
> > > > > > > *xprt);
> > > > > > > +void			xprt_disconnect_nowake(struct
> > > > > > > rpc_xprt
> > > > > > > *xprt);
> > > > > > > void			xprt_conditional_disconnect(str
> > > > > > > uct
> > > > > > > rpc_xprt
> > > > > > > *xprt, unsigned int cookie);
> > > > > > > 
> > > > > > > bool			xprt_lock_connect(struct
> > > > > > > rpc_xprt *,
> > > > > > > struct
> > > > > > > rpc_task *, void *);
> > > > > > > diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
> > > > > > > index ce92700..afe412e 100644
> > > > > > > --- a/net/sunrpc/xprt.c
> > > > > > > +++ b/net/sunrpc/xprt.c
> > > > > > > @@ -685,6 +685,25 @@ void xprt_force_disconnect(struct
> > > > > > > rpc_xprt
> > > > > > > *xprt)
> > > > > > > }
> > > > > > > EXPORT_SYMBOL_GPL(xprt_force_disconnect);
> > > > > > > 
> > > > > > > +/**
> > > > > > > + * xprt_disconnect_nowake - force a call to xprt->ops-
> > > > > > > >close
> > > > > > > + * @xprt: transport to disconnect
> > > > > > > + *
> > > > > > > + * The caller must ensure that xprt_wake_pending_tasks()
> > > > > > > is
> > > > > > > + * called later.
> > > > > > > + */
> > > > > > > +void xprt_disconnect_nowake(struct rpc_xprt *xprt)
> > > > > > > +{
> > > > > > > +       /* Don't race with the test_bit() in
> > > > > > > xprt_clear_locked()
> > > > > > > */
> > > > > > > +       spin_lock_bh(&xprt->transport_lock);
> > > > > > > +       set_bit(XPRT_CLOSE_WAIT, &xprt->state);
> > > > > > > +       /* Try to schedule an autoclose RPC call */
> > > > > > > +       if (test_and_set_bit(XPRT_LOCKED, &xprt->state)
> > > > > > > == 0)
> > > > > > > +               queue_work(xprtiod_workqueue, &xprt-
> > > > > > > > task_cleanup);
> > > > > > > +       spin_unlock_bh(&xprt->transport_lock);
> > > > > > > +}
> > > > > > > +EXPORT_SYMBOL_GPL(xprt_disconnect_nowake);
> > > > > > > +
> > > > > > 
> > > > > > We shouldn't need both xprt_disconnect_nowake() and
> > > > > > xprt_force_disconnect() to be exported given that you can
> > > > > > build
> > > > > > the
> > > > > > latter from the former + xprt_wake_pending_tasks() (which
> > > > > > is
> > > > > > also
> > > > > > already exported).
> > > > > 
> > > > > Thanks for your review!
> > > > > 
> > > > > I can get rid of xprt_disconnect_nowake. There are some
> > > > > variations,
> > > > > depending on why wake_pending_tasks is protected by xprt-
> > > > > > transport_lock.
> > > > 
> > > > I'm having some second thoughts about the patch that Scott sent
> > > > out
> > > > last week to fix the issue that Dave and he were seeing. I
> > > > think
> > > > that
> > > > what we really need to do to fix his issue is to call
> > > > xprt_disconnect_done() after we've released the TCP socket.
> > > > 
> > > > Given that realisation, I think that we can fix up
> > > > xprt_force_disconnect() to only wake up the task that holds the
> > > > XPRT_LOCKED instead of doing a thundering herd wakeup like we
> > > > do
> > > > today.
> > > > That should (I think) obviate the need for a separate
> > > > xprt_disconnect_nowake().
> > > 
> > > For RPC-over-RDMA, there really is a dangerous race between the
> > > waking
> > > task(s) and work being done by the deferred RPC completion
> > > handler.
> > > IMO
> > > the only safe thing RPC-over-RDMA can do is not wake anything
> > > until
> > > the
> > > deferred queue is well and truly drained.
> > 
> > The deferred RPC completion handler (and hence the close) cannot
> > execute if another task is holding XPRT_LOCKED,
> 
> Just to be certain we are speaking of the same thing,
> rpcrdma_deferred_completion is queued by the Receive handler, and
> can indeed run independently of an rpc_task. It is always running
> outside the purview of XPRT_LOCKED.

No. I was thinking of the xprt->task_cleanup. It can't execute, and
complete your close until it holds XPRT_LOCKED.

> 
> 
> > so we do need to wake up that task (and only that one).
> > 
> > Note that in the new code, the only reason why a task would be
> > holding
> > XPRT_LOCKED while sleeping is because
> > 
> >   1. It is waiting for a connection attempt to complete following a
> > call
> >      to xprt_connect().
> >   2. It is waiting for a write_space event following an attempt to
> >      transmit.
> 
> xprt_rdma_close can sleep in rpcrdma_ep_disconnect:
> 
>  -> ib_drain_{qp,sq,rq} can all sleep waiting for the last FLUSH
> 
>  -> drain_workqueue, added in this patch, can sleep waiting for the
>     deferred RPC completion workqueue to drain
> 
> 
> > > As you observed when we last spoke, socket transports are already
> > > protected from this race by the socket lock.... RPC-over-RDMA is
> > > going
> > > to have to be more careful.
> > > 
> > > 
> > > > A patch is forthcoming later today. I'll make sure you are Cced
> > > > so
> > > > you
> > > > can comment.
> > > > 
> > > > -- 
> > > > Trond Myklebust
> > > > Linux NFS client maintainer, Hammerspace
> > > > trond.myklebust@hammerspace.com
> > > 
> > > --
> > > Chuck Lever
> > > 
> > > 
> > > 
> > -- 
> > Trond Myklebust
> > CTO, Hammerspace Inc
> > 4300 El Camino Real, Suite 105
> > Los Altos, CA 94022
> > www.hammer.space
> 
> --
> Chuck Lever
> 
> 
> 
-- 
Trond Myklebust
CTO, Hammerspace Inc
4300 El Camino Real, Suite 105
Los Altos, CA 94022
www.hammer.space



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v4 07/30] xprtrdma: Fix ri_max_segs and the result of ro_maxpages
  2018-12-17 16:39 ` [PATCH v4 07/30] xprtrdma: Fix ri_max_segs and the result of ro_maxpages Chuck Lever
@ 2018-12-18 19:35   ` Anna Schumaker
  2018-12-18 19:39     ` Chuck Lever
  0 siblings, 1 reply; 40+ messages in thread
From: Anna Schumaker @ 2018-12-18 19:35 UTC (permalink / raw)
  To: Chuck Lever, linux-rdma, linux-nfs

Hi Chuck,

On Mon, 2018-12-17 at 11:39 -0500, Chuck Lever wrote:
> With certain combinations of krb5i/p, MR size, and r/wsize, I/O can
> fail with EMSGSIZE. This is because the calculated value of
> ri_max_segs (the max number of MRs per RPC) exceeded
> RPCRDMA_MAX_HDR_SEGS, which caused Read or Write list encoding to
> walk off the end of the transport header.
> 
> Once that was addressed, the ro_maxpages result has to be corrected
> to account for the number of MRs needed for Reply chunks, which is
> 2 MRs smaller than a normal Read or Write chunk.
> 
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> ---
>  net/sunrpc/xprtrdma/fmr_ops.c   |    7 +++++--
>  net/sunrpc/xprtrdma/frwr_ops.c  |    7 +++++--
>  net/sunrpc/xprtrdma/transport.c |    6 ++++--
>  3 files changed, 14 insertions(+), 6 deletions(-)
> 
> diff --git a/net/sunrpc/xprtrdma/fmr_ops.c b/net/sunrpc/xprtrdma/fmr_ops.c
> index 7f5632c..78a0224 100644
> --- a/net/sunrpc/xprtrdma/fmr_ops.c
> +++ b/net/sunrpc/xprtrdma/fmr_ops.c
> @@ -176,7 +176,10 @@ enum {
>  
>  	ia->ri_max_segs = max_t(unsigned int, 1, RPCRDMA_MAX_DATA_SEGS /
>  				RPCRDMA_MAX_FMR_SGES);
> -	ia->ri_max_segs += 2;	/* segments for head and tail buffers */
> +	/* Reply chunks require segments for head and tail buffers */
> +	ia->ri_max_segs += 2;
> +	if (ia->ri_max_segs > RPCRDMA_MAX_HDR_SEGS)
> +		ia->ri_max_segs = RPCRDMA_MAX_HDR_SEGS;
>  	return 0;
>  }
>  
> @@ -186,7 +189,7 @@ enum {
>  fmr_op_maxpages(struct rpcrdma_xprt *r_xprt)
>  {
>  	return min_t(unsigned int, RPCRDMA_MAX_DATA_SEGS,
> -		     RPCRDMA_MAX_HDR_SEGS * RPCRDMA_MAX_FMR_SGES);
> +		     (ia->ri_max_segs - 2) * RPCRDMA_MAX_FMR_SGES);

ia isn't defined in this function.  Should that be r_xprt->rx_ia.ri_max_segs
instead?

Thanks,
Anna

>  }
>  
>  /* Use the ib_map_phys_fmr() verb to register a memory region
> diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
> index 27222c0..f587e44 100644
> --- a/net/sunrpc/xprtrdma/frwr_ops.c
> +++ b/net/sunrpc/xprtrdma/frwr_ops.c
> @@ -244,7 +244,10 @@
>  
>  	ia->ri_max_segs = max_t(unsigned int, 1, RPCRDMA_MAX_DATA_SEGS /
>  				ia->ri_max_frwr_depth);
> -	ia->ri_max_segs += 2;	/* segments for head and tail buffers */
> +	/* Reply chunks require segments for head and tail buffers */
> +	ia->ri_max_segs += 2;
> +	if (ia->ri_max_segs > RPCRDMA_MAX_HDR_SEGS)
> +		ia->ri_max_segs = RPCRDMA_MAX_HDR_SEGS;
>  	return 0;
>  }
>  
> @@ -257,7 +260,7 @@
>  	struct rpcrdma_ia *ia = &r_xprt->rx_ia;
>  
>  	return min_t(unsigned int, RPCRDMA_MAX_DATA_SEGS,
> -		     RPCRDMA_MAX_HDR_SEGS * ia->ri_max_frwr_depth);
> +		     (ia->ri_max_segs - 2) * ia->ri_max_frwr_depth);
>  }
>  
>  static void
> diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
> index a16296b..fbb14bf 100644
> --- a/net/sunrpc/xprtrdma/transport.c
> +++ b/net/sunrpc/xprtrdma/transport.c
> @@ -704,8 +704,10 @@
>   *	%-ENOTCONN if the caller should reconnect and call again
>   *	%-EAGAIN if the caller should call again
>   *	%-ENOBUFS if the caller should call again after a delay
> - *	%-EIO if a permanent error occurred and the request was not
> - *		sent. Do not try to send this message again.
> + *	%-EMSGSIZE if encoding ran out of buffer space. The request
> + *		was not sent. Do not try to send this message again.
> + *	%-EIO if an I/O error occurred. The request was not sent.
> + *		Do not try to send this message again.
>   */
>  static int
>  xprt_rdma_send_request(struct rpc_rqst *rqst)
> 


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v4 07/30] xprtrdma: Fix ri_max_segs and the result of ro_maxpages
  2018-12-18 19:35   ` Anna Schumaker
@ 2018-12-18 19:39     ` Chuck Lever
  0 siblings, 0 replies; 40+ messages in thread
From: Chuck Lever @ 2018-12-18 19:39 UTC (permalink / raw)
  To: Anna Schumaker; +Cc: linux-rdma, Linux NFS Mailing List



> On Dec 18, 2018, at 2:35 PM, Anna Schumaker <schumaker.anna@gmail.com> wrote:
> 
> Hi Chuck,
> 
> On Mon, 2018-12-17 at 11:39 -0500, Chuck Lever wrote:
>> With certain combinations of krb5i/p, MR size, and r/wsize, I/O can
>> fail with EMSGSIZE. This is because the calculated value of
>> ri_max_segs (the max number of MRs per RPC) exceeded
>> RPCRDMA_MAX_HDR_SEGS, which caused Read or Write list encoding to
>> walk off the end of the transport header.
>> 
>> Once that was addressed, the ro_maxpages result has to be corrected
>> to account for the number of MRs needed for Reply chunks, which is
>> 2 MRs smaller than a normal Read or Write chunk.
>> 
>> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
>> ---
>> net/sunrpc/xprtrdma/fmr_ops.c   |    7 +++++--
>> net/sunrpc/xprtrdma/frwr_ops.c  |    7 +++++--
>> net/sunrpc/xprtrdma/transport.c |    6 ++++--
>> 3 files changed, 14 insertions(+), 6 deletions(-)
>> 
>> diff --git a/net/sunrpc/xprtrdma/fmr_ops.c b/net/sunrpc/xprtrdma/fmr_ops.c
>> index 7f5632c..78a0224 100644
>> --- a/net/sunrpc/xprtrdma/fmr_ops.c
>> +++ b/net/sunrpc/xprtrdma/fmr_ops.c
>> @@ -176,7 +176,10 @@ enum {
>> 
>> 	ia->ri_max_segs = max_t(unsigned int, 1, RPCRDMA_MAX_DATA_SEGS /
>> 				RPCRDMA_MAX_FMR_SGES);
>> -	ia->ri_max_segs += 2;	/* segments for head and tail buffers */
>> +	/* Reply chunks require segments for head and tail buffers */
>> +	ia->ri_max_segs += 2;
>> +	if (ia->ri_max_segs > RPCRDMA_MAX_HDR_SEGS)
>> +		ia->ri_max_segs = RPCRDMA_MAX_HDR_SEGS;
>> 	return 0;
>> }
>> 
>> @@ -186,7 +189,7 @@ enum {
>> fmr_op_maxpages(struct rpcrdma_xprt *r_xprt)
>> {
>> 	return min_t(unsigned int, RPCRDMA_MAX_DATA_SEGS,
>> -		     RPCRDMA_MAX_HDR_SEGS * RPCRDMA_MAX_FMR_SGES);
>> +		     (ia->ri_max_segs - 2) * RPCRDMA_MAX_FMR_SGES);
> 
> ia isn't defined in this function.  Should that be r_xprt->rx_ia.ri_max_segs
> instead?

Yeah. I compile-tested the tip of the branch, and so missed
this.

I'm going to have to send you a v5, so I'll fix this up in
the next post of this series.


> Thanks,
> Anna
> 
>> }
>> 
>> /* Use the ib_map_phys_fmr() verb to register a memory region
>> diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
>> index 27222c0..f587e44 100644
>> --- a/net/sunrpc/xprtrdma/frwr_ops.c
>> +++ b/net/sunrpc/xprtrdma/frwr_ops.c
>> @@ -244,7 +244,10 @@
>> 
>> 	ia->ri_max_segs = max_t(unsigned int, 1, RPCRDMA_MAX_DATA_SEGS /
>> 				ia->ri_max_frwr_depth);
>> -	ia->ri_max_segs += 2;	/* segments for head and tail buffers */
>> +	/* Reply chunks require segments for head and tail buffers */
>> +	ia->ri_max_segs += 2;
>> +	if (ia->ri_max_segs > RPCRDMA_MAX_HDR_SEGS)
>> +		ia->ri_max_segs = RPCRDMA_MAX_HDR_SEGS;
>> 	return 0;
>> }
>> 
>> @@ -257,7 +260,7 @@
>> 	struct rpcrdma_ia *ia = &r_xprt->rx_ia;
>> 
>> 	return min_t(unsigned int, RPCRDMA_MAX_DATA_SEGS,
>> -		     RPCRDMA_MAX_HDR_SEGS * ia->ri_max_frwr_depth);
>> +		     (ia->ri_max_segs - 2) * ia->ri_max_frwr_depth);
>> }
>> 
>> static void
>> diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
>> index a16296b..fbb14bf 100644
>> --- a/net/sunrpc/xprtrdma/transport.c
>> +++ b/net/sunrpc/xprtrdma/transport.c
>> @@ -704,8 +704,10 @@
>>  *	%-ENOTCONN if the caller should reconnect and call again
>>  *	%-EAGAIN if the caller should call again
>>  *	%-ENOBUFS if the caller should call again after a delay
>> - *	%-EIO if a permanent error occurred and the request was not
>> - *		sent. Do not try to send this message again.
>> + *	%-EMSGSIZE if encoding ran out of buffer space. The request
>> + *		was not sent. Do not try to send this message again.
>> + *	%-EIO if an I/O error occurred. The request was not sent.
>> + *		Do not try to send this message again.
>>  */
>> static int
>> xprt_rdma_send_request(struct rpc_rqst *rqst)

--
Chuck Lever




^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2018-12-18 19:39 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-12-17 16:39 [PATCH v4 00/30] NFS/RDMA client for next Chuck Lever
2018-12-17 16:39 ` [PATCH v4 01/30] xprtrdma: Yet another double DMA-unmap Chuck Lever
2018-12-17 16:39 ` [PATCH v4 02/30] xprtrdma: Ensure MRs are DMA-unmapped when posting LOCAL_INV fails Chuck Lever
2018-12-17 16:39 ` [PATCH v4 03/30] xprtrdma: Refactor Receive accounting Chuck Lever
2018-12-17 16:39 ` [PATCH v4 04/30] xprtrdma: Replace rpcrdma_receive_wq with a per-xprt workqueue Chuck Lever
2018-12-17 16:39 ` [PATCH v4 05/30] xprtrdma: No qp_event disconnect Chuck Lever
2018-12-17 16:39 ` [PATCH v4 06/30] xprtrdma: Don't wake pending tasks until disconnect is done Chuck Lever
2018-12-17 17:28   ` Trond Myklebust
2018-12-17 18:37     ` Chuck Lever
2018-12-17 18:55       ` Trond Myklebust
2018-12-17 19:00         ` Chuck Lever
2018-12-17 19:09           ` Trond Myklebust
2018-12-17 19:19             ` Chuck Lever
2018-12-17 19:26               ` Trond Myklebust
2018-12-17 16:39 ` [PATCH v4 07/30] xprtrdma: Fix ri_max_segs and the result of ro_maxpages Chuck Lever
2018-12-18 19:35   ` Anna Schumaker
2018-12-18 19:39     ` Chuck Lever
2018-12-17 16:40 ` [PATCH v4 08/30] xprtrdma: Reduce max_frwr_depth Chuck Lever
2018-12-17 16:40 ` [PATCH v4 09/30] xprtrdma: Remove support for FMR memory registration Chuck Lever
2018-12-17 16:40 ` [PATCH v4 10/30] xprtrdma: Remove rpcrdma_memreg_ops Chuck Lever
2018-12-17 16:40 ` [PATCH v4 11/30] xprtrdma: Plant XID in on-the-wire RDMA offset (FRWR) Chuck Lever
2018-12-17 16:40 ` [PATCH v4 12/30] NFS: Make "port=" mount option optional for RDMA mounts Chuck Lever
2018-12-17 16:40 ` [PATCH v4 13/30] xprtrdma: Recognize XDRBUF_SPARSE_PAGES Chuck Lever
2018-12-17 16:40 ` [PATCH v4 14/30] xprtrdma: Remove request_module from backchannel Chuck Lever
2018-12-17 16:40 ` [PATCH v4 15/30] xprtrdma: Expose transport header errors Chuck Lever
2018-12-17 16:40 ` [PATCH v4 16/30] xprtrdma: Simplify locking that protects the rl_allreqs list Chuck Lever
2018-12-17 16:40 ` [PATCH v4 17/30] xprtrdma: Cull dprintk() call sites Chuck Lever
2018-12-17 16:40 ` [PATCH v4 18/30] xprtrdma: Remove unused fields from rpcrdma_ia Chuck Lever
2018-12-17 16:41 ` [PATCH v4 19/30] xprtrdma: Clean up of xprtrdma chunk trace points Chuck Lever
2018-12-17 16:41 ` [PATCH v4 20/30] xprtrdma: Relocate the xprtrdma_mr_map " Chuck Lever
2018-12-17 16:41 ` [PATCH v4 21/30] xprtrdma: Add trace points for calls to transport switch methods Chuck Lever
2018-12-17 16:41 ` [PATCH v4 22/30] xprtrdma: Trace mapping, alloc, and dereg failures Chuck Lever
2018-12-17 16:41 ` [PATCH v4 23/30] NFS: Fix NFSv4 symbolic trace point output Chuck Lever
2018-12-17 16:41 ` [PATCH v4 24/30] SUNRPC: Simplify defining common RPC trace events Chuck Lever
2018-12-17 16:41 ` [PATCH v4 25/30] SUNRPC: Fix some kernel doc complaints Chuck Lever
2018-12-17 16:41 ` [PATCH v4 26/30] xprtrdma: Update comments in frwr_op_send Chuck Lever
2018-12-17 16:41 ` [PATCH v4 27/30] xprtrdma: Replace outdated comment for rpcrdma_ep_post Chuck Lever
2018-12-17 16:41 ` [PATCH v4 28/30] xprtrdma: Add documenting comment for rpcrdma_buffer_destroy Chuck Lever
2018-12-17 16:41 ` [PATCH v4 29/30] xprtrdma: Clarify comments in rpcrdma_ia_remove Chuck Lever
2018-12-17 16:42 ` [PATCH v4 30/30] xprtrdma: Don't leak freed MRs Chuck Lever

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).