All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v1 00/16] NFS/RDMA patches for 3.19
@ 2014-10-16 19:38 Chuck Lever
  2014-10-16 19:38 ` [PATCH v1 01/16] xprtrdma: Return an errno from rpcrdma_register_external() Chuck Lever
                   ` (15 more replies)
  0 siblings, 16 replies; 34+ messages in thread
From: Chuck Lever @ 2014-10-16 19:38 UTC (permalink / raw)
  To: linux-nfs

Hi-

Two groups of patches in this series. The first group is fixes
and clean-ups for xprtrdma. The second group adds client support
for NFSv4.1 on RDMA. Looking for review and testing.

Also available in the "nfs-rdma-for-3.19" topic branch at

  git://linux-nfs.org/projects/cel/cel-2.6.git

---

Chuck Lever (16):
      xprtrdma: Return an errno from rpcrdma_register_external()
      xprtrdma: Cap req_cqinit
      SUNRPC: Pass callsize and recvsize to buf_alloc as separate arguments
      xprtrdma: Re-write rpcrdma_flush_cqs()
      xprtrdma: unmap all FMRs during transport disconnect
      xprtrdma: spin CQ completion vectors
      SUNRPC: serialize iostats updates
      xprtrdma: Display async errors
      xprtrdma: Enable pad optimization
      NFS: Include transport protocol name in UCS client string
      NFS: Clean up nfs4_init_callback()
      SUNRPC: Add rpc_xprt_is_bidirectional()
      NFS: Add sidecar RPC client support
      NFS: Set BIND_CONN_TO_SESSION arguments in the proc layer
      NFS: Bind side-car connection to session
      NFS: Disable SESSION4_BACK_CHAN when a backchannel sidecar is to be used


 fs/nfs/client.c                 |    1 
 fs/nfs/nfs4client.c             |   86 +++++++++++++++++----
 fs/nfs/nfs4proc.c               |   71 ++++++++++++++---
 fs/nfs/nfs4xdr.c                |   16 ++--
 include/linux/nfs_fs_sb.h       |    2 
 include/linux/nfs_xdr.h         |    6 +
 include/linux/sunrpc/clnt.h     |    1 
 include/linux/sunrpc/metrics.h  |    3 +
 include/linux/sunrpc/sched.h    |    2 
 include/linux/sunrpc/xprt.h     |    4 +
 net/sunrpc/clnt.c               |   28 ++++++-
 net/sunrpc/sched.c              |    6 +
 net/sunrpc/stats.c              |   21 ++++-
 net/sunrpc/xprtrdma/transport.c |    6 +
 net/sunrpc/xprtrdma/verbs.c     |  159 +++++++++++++++++++++++++++++++++++----
 net/sunrpc/xprtsock.c           |    6 +
 16 files changed, 347 insertions(+), 71 deletions(-)

--
Chuck Lever

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v1 01/16] xprtrdma: Return an errno from rpcrdma_register_external()
  2014-10-16 19:38 [PATCH v1 00/16] NFS/RDMA patches for 3.19 Chuck Lever
@ 2014-10-16 19:38 ` Chuck Lever
  2014-10-16 19:38 ` [PATCH v1 02/16] xprtrdma: Cap req_cqinit Chuck Lever
                   ` (14 subsequent siblings)
  15 siblings, 0 replies; 34+ messages in thread
From: Chuck Lever @ 2014-10-16 19:38 UTC (permalink / raw)
  To: linux-nfs

The RPC/RDMA send_request method and the chunk registration code
expects an errno from the registration function. This allows
the upper layers to distinguish between a recoverable failure
(for example, temporary memory exhaustion) and a hard failure
(for example, a bug in the registration logic).

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 net/sunrpc/xprtrdma/verbs.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 61c4129..6ea2942 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -1918,10 +1918,10 @@ rpcrdma_register_external(struct rpcrdma_mr_seg *seg,
 		break;
 
 	default:
-		return -1;
+		return -EIO;
 	}
 	if (rc)
-		return -1;
+		return rc;
 
 	return nsegs;
 }


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v1 02/16] xprtrdma: Cap req_cqinit
  2014-10-16 19:38 [PATCH v1 00/16] NFS/RDMA patches for 3.19 Chuck Lever
  2014-10-16 19:38 ` [PATCH v1 01/16] xprtrdma: Return an errno from rpcrdma_register_external() Chuck Lever
@ 2014-10-16 19:38 ` Chuck Lever
  2014-10-20 13:27   ` Anna Schumaker
  2014-10-16 19:38 ` [PATCH v1 03/16] SUNRPC: Pass callsize and recvsize to buf_alloc as separate arguments Chuck Lever
                   ` (13 subsequent siblings)
  15 siblings, 1 reply; 34+ messages in thread
From: Chuck Lever @ 2014-10-16 19:38 UTC (permalink / raw)
  To: linux-nfs

Recent work made FRMR registration and invalidation completions
unsignaled. This greatly reduces the adapter interrupt rate.

Every so often, however, a posted send Work Request is allowed to
signal. Otherwise, the provider's Work Queue will wrap and the
workload will hang.

The number of Work Requests that are allowed to remain unsignaled is
determined by the value of req_cqinit. Currently, this is set to the
size of the send Work Queue divided by two, minus 1.

For FRMR, the send Work Queue is the maximum number of concurrent
RPCs (currently 32) times the maximum number of Work Requests an
RPC might use (currently 7, though some adapters may need more).

For mlx4, this is 224 entries. This leaves completion signaling
disabled for 111 send Work Requests.

Some providers hold back dispatching Work Requests until a CQE is
generated.  If completions are disabled, then no CQEs are generated
for quite some time, and that can stall the Work Queue.

I've seen this occur running xfstests generic/113 over NFSv4, where
eventually, posting a FAST_REG_MR Work Request fails with -ENOMEM
because the Work Queue has overflowed. The connection is dropped
and re-established.

Cap the rep_cqinit setting so completions are not left turned off
for too long.

BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=269
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 net/sunrpc/xprtrdma/verbs.c |    2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 6ea2942..5c0c7a5 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -733,6 +733,8 @@ rpcrdma_ep_create(struct rpcrdma_ep *ep, struct rpcrdma_ia *ia,
 
 	/* set trigger for requesting send completion */
 	ep->rep_cqinit = ep->rep_attr.cap.max_send_wr/2 - 1;
+	if (ep->rep_cqinit > 20)
+		ep->rep_cqinit = 20;
 	if (ep->rep_cqinit <= 2)
 		ep->rep_cqinit = 0;
 	INIT_CQCOUNT(ep);


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v1 03/16] SUNRPC: Pass callsize and recvsize to buf_alloc as separate arguments
  2014-10-16 19:38 [PATCH v1 00/16] NFS/RDMA patches for 3.19 Chuck Lever
  2014-10-16 19:38 ` [PATCH v1 01/16] xprtrdma: Return an errno from rpcrdma_register_external() Chuck Lever
  2014-10-16 19:38 ` [PATCH v1 02/16] xprtrdma: Cap req_cqinit Chuck Lever
@ 2014-10-16 19:38 ` Chuck Lever
  2014-10-20 14:04   ` Anna Schumaker
  2014-10-16 19:38 ` [PATCH v1 04/16] xprtrdma: Re-write rpcrdma_flush_cqs() Chuck Lever
                   ` (12 subsequent siblings)
  15 siblings, 1 reply; 34+ messages in thread
From: Chuck Lever @ 2014-10-16 19:38 UTC (permalink / raw)
  To: linux-nfs

I noticed that on RDMA, NFSv4 operations were using "hardway"
allocations much more than not. A "hardway" allocation uses GFP_NOFS
during each RPC to allocate the XDR buffer, instead of using a
pre-allocated pre-registered buffer for each RPC.

The pre-allocated buffers are 2200 bytes in length.  The requested
XDR buffer sizes looked like this:

    GETATTR: 3220 bytes
    LOOKUP:  3612 bytes
    WRITE:   3256 bytes
    OPEN:    6344 bytes

But an NFSv4 GETATTR RPC request should be small. It's the reply
part of GETATTR that can grow large.

call_allocate() passes a single value as the XDR buffer size: the
sum of call and reply buffers. However, the xprtrdma transport
allocates its XDR request and reply buffers separately.

xprtrdma needs to know the maximum call size, as guidance for how
large the outgoing request is going to be and how the NFS payload
will be marshalled into chunks.

But RDMA XDR reply buffers are pre-posted, fixed-size buffers, not
allocated by xprt_rdma_allocate().

Because of the sum passed through ->buf_alloc(), xprtrdma's
->buf_alloc() always allocates more XDR buffer than it will ever
use. For NFSv4, it is unnecessarily triggering the slow "hardway"
path for almost every RPC.

Pass the call and reply buffer size values separately to the
transport's ->buf_alloc method.  The RDMA transport ->buf_alloc can
now ignore the reply size, and allocate just what it will use for
the call buffer.  The socket transport ->buf_alloc can simply add
them together, as call_allocate() did before.

With this patch, an NFSv4 GETATTR request now allocates a 476 byte
RDMA XDR buffer. I didn't see a single NFSv4 request that did not
fit into the transport's pre-allocated XDR buffer.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 include/linux/sunrpc/sched.h    |    2 +-
 include/linux/sunrpc/xprt.h     |    3 ++-
 net/sunrpc/clnt.c               |    4 ++--
 net/sunrpc/sched.c              |    6 ++++--
 net/sunrpc/xprtrdma/transport.c |    2 +-
 net/sunrpc/xprtsock.c           |    3 ++-
 6 files changed, 12 insertions(+), 8 deletions(-)

diff --git a/include/linux/sunrpc/sched.h b/include/linux/sunrpc/sched.h
index 1a89599..68fa71d 100644
--- a/include/linux/sunrpc/sched.h
+++ b/include/linux/sunrpc/sched.h
@@ -232,7 +232,7 @@ struct rpc_task *rpc_wake_up_first(struct rpc_wait_queue *,
 					void *);
 void		rpc_wake_up_status(struct rpc_wait_queue *, int);
 void		rpc_delay(struct rpc_task *, unsigned long);
-void *		rpc_malloc(struct rpc_task *, size_t);
+void		*rpc_malloc(struct rpc_task *, size_t, size_t);
 void		rpc_free(void *);
 int		rpciod_up(void);
 void		rpciod_down(void);
diff --git a/include/linux/sunrpc/xprt.h b/include/linux/sunrpc/xprt.h
index fcbfe87..632685c 100644
--- a/include/linux/sunrpc/xprt.h
+++ b/include/linux/sunrpc/xprt.h
@@ -124,7 +124,8 @@ struct rpc_xprt_ops {
 	void		(*rpcbind)(struct rpc_task *task);
 	void		(*set_port)(struct rpc_xprt *xprt, unsigned short port);
 	void		(*connect)(struct rpc_xprt *xprt, struct rpc_task *task);
-	void *		(*buf_alloc)(struct rpc_task *task, size_t size);
+	void *		(*buf_alloc)(struct rpc_task *task,
+					size_t call, size_t reply);
 	void		(*buf_free)(void *buffer);
 	int		(*send_request)(struct rpc_task *task);
 	void		(*set_retrans_timeout)(struct rpc_task *task);
diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c
index 488ddee..5e817d6 100644
--- a/net/sunrpc/clnt.c
+++ b/net/sunrpc/clnt.c
@@ -1599,8 +1599,8 @@ call_allocate(struct rpc_task *task)
 	req->rq_rcvsize = RPC_REPHDRSIZE + slack + proc->p_replen;
 	req->rq_rcvsize <<= 2;
 
-	req->rq_buffer = xprt->ops->buf_alloc(task,
-					req->rq_callsize + req->rq_rcvsize);
+	req->rq_buffer = xprt->ops->buf_alloc(task, req->rq_callsize,
+					      req->rq_rcvsize);
 	if (req->rq_buffer != NULL)
 		return;
 
diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
index 9358c79..fc4f939 100644
--- a/net/sunrpc/sched.c
+++ b/net/sunrpc/sched.c
@@ -829,7 +829,8 @@ static void rpc_async_schedule(struct work_struct *work)
 /**
  * rpc_malloc - allocate an RPC buffer
  * @task: RPC task that will use this buffer
- * @size: requested byte size
+ * @call: maximum size of on-the-wire RPC call, in bytes
+ * @reply: maximum size of on-the-wire RPC reply, in bytes
  *
  * To prevent rpciod from hanging, this allocator never sleeps,
  * returning NULL and suppressing warning if the request cannot be serviced
@@ -843,8 +844,9 @@ static void rpc_async_schedule(struct work_struct *work)
  * In order to avoid memory starvation triggering more writebacks of
  * NFS requests, we avoid using GFP_KERNEL.
  */
-void *rpc_malloc(struct rpc_task *task, size_t size)
+void *rpc_malloc(struct rpc_task *task, size_t call, size_t reply)
 {
+	size_t size = call + reply;
 	struct rpc_buffer *buf;
 	gfp_t gfp = GFP_NOWAIT | __GFP_NOWARN;
 
diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
index 2faac49..6e9d0a7 100644
--- a/net/sunrpc/xprtrdma/transport.c
+++ b/net/sunrpc/xprtrdma/transport.c
@@ -459,7 +459,7 @@ xprt_rdma_connect(struct rpc_xprt *xprt, struct rpc_task *task)
  * the receive buffer portion when using reply chunks.
  */
 static void *
-xprt_rdma_allocate(struct rpc_task *task, size_t size)
+xprt_rdma_allocate(struct rpc_task *task, size_t size, size_t replen)
 {
 	struct rpc_xprt *xprt = task->tk_rqstp->rq_xprt;
 	struct rpcrdma_req *req, *nreq;
diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index 43cd89e..b4aca48 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -2423,8 +2423,9 @@ static void xs_tcp_print_stats(struct rpc_xprt *xprt, struct seq_file *seq)
  * we allocate pages instead doing a kmalloc like rpc_malloc is because we want
  * to use the server side send routines.
  */
-static void *bc_malloc(struct rpc_task *task, size_t size)
+static void *bc_malloc(struct rpc_task *task, size_t call, size_t reply)
 {
+	size_t size = call + reply;
 	struct page *page;
 	struct rpc_buffer *buf;
 


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v1 04/16] xprtrdma: Re-write rpcrdma_flush_cqs()
  2014-10-16 19:38 [PATCH v1 00/16] NFS/RDMA patches for 3.19 Chuck Lever
                   ` (2 preceding siblings ...)
  2014-10-16 19:38 ` [PATCH v1 03/16] SUNRPC: Pass callsize and recvsize to buf_alloc as separate arguments Chuck Lever
@ 2014-10-16 19:38 ` Chuck Lever
  2014-10-16 19:38 ` [PATCH v1 05/16] xprtrdma: unmap all FMRs during transport disconnect Chuck Lever
                   ` (11 subsequent siblings)
  15 siblings, 0 replies; 34+ messages in thread
From: Chuck Lever @ 2014-10-16 19:38 UTC (permalink / raw)
  To: linux-nfs

Currently rpcrdma_flush_cqs() attempts to avoid code duplication,
and simply invokes rpcrdma_recvcq_upcall and rpcrdma_sendcq_upcall.
This has two minor issues:

1. It re-arms the CQ, which can happen even if a CQ upcall is
   running at the same time

2. The upcall functions drain only a limited number of CQEs,
   thanks to the poll budget added by commit 8301a2c047cc
   ("xprtrdma: Limit work done by completion handler").

Rewrite rpcrdma_flush_cqs() to be sure all CQEs are drained after a
transport is disconnected.

Fixes: a7bc211ac926 ("xprtrdma: On disconnect, don't ignore ... ")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 net/sunrpc/xprtrdma/verbs.c |   32 ++++++++++++++++++++++++--------
 1 file changed, 24 insertions(+), 8 deletions(-)

diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 5c0c7a5..6fadb90 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -106,6 +106,17 @@ rpcrdma_run_tasklet(unsigned long data)
 static DECLARE_TASKLET(rpcrdma_tasklet_g, rpcrdma_run_tasklet, 0UL);
 
 static void
+rpcrdma_schedule_tasklet(struct list_head *sched_list)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&rpcrdma_tk_lock_g, flags);
+	list_splice_tail(sched_list, &rpcrdma_tasklets_g);
+	spin_unlock_irqrestore(&rpcrdma_tk_lock_g, flags);
+	tasklet_schedule(&rpcrdma_tasklet_g);
+}
+
+static void
 rpcrdma_qp_async_error_upcall(struct ib_event *event, void *context)
 {
 	struct rpcrdma_ep *ep = context;
@@ -243,7 +254,6 @@ rpcrdma_recvcq_poll(struct ib_cq *cq, struct rpcrdma_ep *ep)
 	struct list_head sched_list;
 	struct ib_wc *wcs;
 	int budget, count, rc;
-	unsigned long flags;
 
 	INIT_LIST_HEAD(&sched_list);
 	budget = RPCRDMA_WC_BUDGET / RPCRDMA_POLLSIZE;
@@ -261,10 +271,7 @@ rpcrdma_recvcq_poll(struct ib_cq *cq, struct rpcrdma_ep *ep)
 	rc = 0;
 
 out_schedule:
-	spin_lock_irqsave(&rpcrdma_tk_lock_g, flags);
-	list_splice_tail(&sched_list, &rpcrdma_tasklets_g);
-	spin_unlock_irqrestore(&rpcrdma_tk_lock_g, flags);
-	tasklet_schedule(&rpcrdma_tasklet_g);
+	rpcrdma_schedule_tasklet(&sched_list);
 	return rc;
 }
 
@@ -309,8 +316,17 @@ rpcrdma_recvcq_upcall(struct ib_cq *cq, void *cq_context)
 static void
 rpcrdma_flush_cqs(struct rpcrdma_ep *ep)
 {
-	rpcrdma_recvcq_upcall(ep->rep_attr.recv_cq, ep);
-	rpcrdma_sendcq_upcall(ep->rep_attr.send_cq, ep);
+	struct list_head sched_list;
+	struct ib_wc wc;
+
+	INIT_LIST_HEAD(&sched_list);
+	while (ib_poll_cq(ep->rep_attr.recv_cq, 1, &wc) > 0)
+		rpcrdma_recvcq_process_wc(&wc, &sched_list);
+	if (!list_empty(&sched_list))
+		rpcrdma_schedule_tasklet(&sched_list);
+
+	while (ib_poll_cq(ep->rep_attr.send_cq, 1, &wc) > 0)
+		rpcrdma_sendcq_process_wc(&wc);
 }
 
 #ifdef RPC_DEBUG
@@ -980,7 +996,6 @@ rpcrdma_ep_disconnect(struct rpcrdma_ep *ep, struct rpcrdma_ia *ia)
 {
 	int rc;
 
-	rpcrdma_flush_cqs(ep);
 	rc = rdma_disconnect(ia->ri_id);
 	if (!rc) {
 		/* returns without wait if not connected */
@@ -992,6 +1007,7 @@ rpcrdma_ep_disconnect(struct rpcrdma_ep *ep, struct rpcrdma_ia *ia)
 		dprintk("RPC:       %s: rdma_disconnect %i\n", __func__, rc);
 		ep->rep_connected = rc;
 	}
+	rpcrdma_flush_cqs(ep);
 }
 
 static int


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v1 05/16] xprtrdma: unmap all FMRs during transport disconnect
  2014-10-16 19:38 [PATCH v1 00/16] NFS/RDMA patches for 3.19 Chuck Lever
                   ` (3 preceding siblings ...)
  2014-10-16 19:38 ` [PATCH v1 04/16] xprtrdma: Re-write rpcrdma_flush_cqs() Chuck Lever
@ 2014-10-16 19:38 ` Chuck Lever
  2014-10-16 19:39 ` [PATCH v1 06/16] xprtrdma: spin CQ completion vectors Chuck Lever
                   ` (10 subsequent siblings)
  15 siblings, 0 replies; 34+ messages in thread
From: Chuck Lever @ 2014-10-16 19:38 UTC (permalink / raw)
  To: linux-nfs

When using RPCRDMA_MTHCAFMR memory registration, after a few
transport disconnect / reconnect cycles, ib_map_phys_fmr() starts to
return EINVAL because the provider has exhausted its map pool.

Make sure that all FMRs are unmapped during transport disconnect,
and that ->send_request remarshals them during an RPC retransmit.
This resets the transport's MRs to ensure that none are leaked
during a disconnect.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 net/sunrpc/xprtrdma/transport.c |    2 +-
 net/sunrpc/xprtrdma/verbs.c     |   40 ++++++++++++++++++++++++++++++++++++++-
 2 files changed, 40 insertions(+), 2 deletions(-)

diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
index 6e9d0a7..61be0a0 100644
--- a/net/sunrpc/xprtrdma/transport.c
+++ b/net/sunrpc/xprtrdma/transport.c
@@ -601,7 +601,7 @@ xprt_rdma_send_request(struct rpc_task *task)
 
 	if (req->rl_niovs == 0)
 		rc = rpcrdma_marshal_req(rqst);
-	else if (r_xprt->rx_ia.ri_memreg_strategy == RPCRDMA_FRMR)
+	else if (r_xprt->rx_ia.ri_memreg_strategy != RPCRDMA_ALLPHYSICAL)
 		rc = rpcrdma_marshal_chunks(rqst, 0);
 	if (rc < 0)
 		goto failed_marshal;
diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 6fadb90..9105524 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -62,6 +62,7 @@
 #endif
 
 static void rpcrdma_reset_frmrs(struct rpcrdma_ia *);
+static void rpcrdma_reset_fmrs(struct rpcrdma_ia *);
 
 /*
  * internal functions
@@ -884,8 +885,17 @@ retry:
 		rpcrdma_ep_disconnect(ep, ia);
 		rpcrdma_flush_cqs(ep);
 
-		if (ia->ri_memreg_strategy == RPCRDMA_FRMR)
+		switch (ia->ri_memreg_strategy) {
+		case RPCRDMA_FRMR:
 			rpcrdma_reset_frmrs(ia);
+			break;
+		case RPCRDMA_MTHCAFMR:
+			rpcrdma_reset_fmrs(ia);
+			break;
+		default:
+			rc = -EIO;
+			goto out;
+		}
 
 		xprt = container_of(ia, struct rpcrdma_xprt, rx_ia);
 		id = rpcrdma_create_id(xprt, ia,
@@ -1305,6 +1315,34 @@ rpcrdma_buffer_destroy(struct rpcrdma_buffer *buf)
 	kfree(buf->rb_pool);
 }
 
+/* After a disconnect, unmap all FMRs.
+ *
+ * This is invoked only in the transport connect worker in order
+ * to serialize with rpcrdma_register_fmr_external().
+ */
+static void
+rpcrdma_reset_fmrs(struct rpcrdma_ia *ia)
+{
+	struct rpcrdma_xprt *r_xprt =
+				container_of(ia, struct rpcrdma_xprt, rx_ia);
+	struct rpcrdma_buffer *buf = &r_xprt->rx_buf;
+	struct list_head *pos;
+	struct rpcrdma_mw *r;
+	LIST_HEAD(l);
+	int rc;
+
+	list_for_each(pos, &buf->rb_all) {
+		r = list_entry(pos, struct rpcrdma_mw, mw_all);
+
+		INIT_LIST_HEAD(&l);
+		list_add(&r->r.fmr->list, &l);
+		rc = ib_unmap_fmr(&l);
+		if (rc)
+			dprintk("RPC:       %s: ib_unmap_fmr failed %i\n",
+				__func__, rc);
+	}
+}
+
 /* After a disconnect, a flushed FAST_REG_MR can leave an FRMR in
  * an unusable state. Find FRMRs in this state and dereg / reg
  * each.  FRMRs that are VALID and attached to an rpcrdma_req are


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v1 06/16] xprtrdma: spin CQ completion vectors
  2014-10-16 19:38 [PATCH v1 00/16] NFS/RDMA patches for 3.19 Chuck Lever
                   ` (4 preceding siblings ...)
  2014-10-16 19:38 ` [PATCH v1 05/16] xprtrdma: unmap all FMRs during transport disconnect Chuck Lever
@ 2014-10-16 19:39 ` Chuck Lever
  2014-10-16 19:39 ` [PATCH v1 07/16] SUNRPC: serialize iostats updates Chuck Lever
                   ` (9 subsequent siblings)
  15 siblings, 0 replies; 34+ messages in thread
From: Chuck Lever @ 2014-10-16 19:39 UTC (permalink / raw)
  To: linux-nfs

A pair of CQs is created for each xprtrdma transport. One transport
instance is created per NFS mount point.

Both Shirley Ma and Steve Wise have observed that the adapter
interrupt workload sticks with a single MSI-X and CPU core unless
manual steps are taken to move it to other CPUs. This tends to limit
performance once the interrupt workload consumes an entire core.

Sagi Grimwald suggested one way to get better dispersal of
interrupts is to use the completion vector argument of the
ib_create_cq() API to assign new CQs to different adapter ingress
queues. Currently, xprtrdma sets this argument to 0 unconditionally,
which leaves all xprtrdma CQs consuming the same small pool of
resources.

Each CQ will still be nailed to one completion vector.  This won't help
a "single mount point" workload, but when multiple mount points are in
play, the RDMA provider will see to it that adapter interrupts are
better spread over available resources.

We also take a little trouble to stay off of vector 0, which is used
by many other kernel RDMA consumers such as IPoIB.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 net/sunrpc/xprtrdma/verbs.c |   45 ++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 42 insertions(+), 3 deletions(-)

diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 9105524..dc4c8e3 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -49,6 +49,8 @@
 
 #include <linux/interrupt.h>
 #include <linux/slab.h>
+#include <linux/random.h>
+
 #include <asm/bitops.h>
 
 #include "xprt_rdma.h"
@@ -666,6 +668,42 @@ rpcrdma_ia_close(struct rpcrdma_ia *ia)
 }
 
 /*
+ * Select a provider completion vector to assign a CQ to.
+ *
+ * This is an attempt to spread CQs across available CPUs. The counter
+ * is shared between all adapters on a system. Multi-adapter systems
+ * are rare, and this is still better for them than leaving all CQs on
+ * one completion vector.
+ *
+ * We could put the send and receive CQs for the same transport on
+ * different vectors. However, this risks assigning them to cores on
+ * different sockets in larger systems, which could have disasterous
+ * performance effects due to NUMA.
+ */
+static int
+rpcrdma_cq_comp_vec(struct rpcrdma_ia *ia)
+{
+	int num_comp_vectors = ia->ri_id->device->num_comp_vectors;
+	int vector = 0;
+
+	if (num_comp_vectors > 1) {
+		static DEFINE_SPINLOCK(rpcrdma_cv_lock);
+		static unsigned int rpcrdma_cv_counter;
+
+		spin_lock(&rpcrdma_cv_lock);
+		vector = rpcrdma_cv_counter++ % num_comp_vectors;
+		/* Skip 0, as it is commonly used by other RDMA consumers */
+		if (vector == 0)
+			vector = rpcrdma_cv_counter++ % num_comp_vectors;
+		spin_unlock(&rpcrdma_cv_lock);
+	}
+
+	dprintk("RPC:       %s: adapter has %d vectors, using vector %d\n",
+		__func__, num_comp_vectors, vector);
+	return vector;
+}
+
+/*
  * Create unconnected endpoint.
  */
 int
@@ -674,7 +712,7 @@ rpcrdma_ep_create(struct rpcrdma_ep *ep, struct rpcrdma_ia *ia,
 {
 	struct ib_device_attr devattr;
 	struct ib_cq *sendcq, *recvcq;
-	int rc, err;
+	int rc, err, comp_vec;
 
 	rc = ib_query_device(ia->ri_id->device, &devattr);
 	if (rc) {
@@ -759,9 +797,10 @@ rpcrdma_ep_create(struct rpcrdma_ep *ep, struct rpcrdma_ia *ia,
 	init_waitqueue_head(&ep->rep_connect_wait);
 	INIT_DELAYED_WORK(&ep->rep_connect_worker, rpcrdma_connect_worker);
 
+	comp_vec = rpcrdma_cq_comp_vec(ia);
 	sendcq = ib_create_cq(ia->ri_id->device, rpcrdma_sendcq_upcall,
 				  rpcrdma_cq_async_error_upcall, ep,
-				  ep->rep_attr.cap.max_send_wr + 1, 0);
+				  ep->rep_attr.cap.max_send_wr + 1, comp_vec);
 	if (IS_ERR(sendcq)) {
 		rc = PTR_ERR(sendcq);
 		dprintk("RPC:       %s: failed to create send CQ: %i\n",
@@ -778,7 +817,7 @@ rpcrdma_ep_create(struct rpcrdma_ep *ep, struct rpcrdma_ia *ia,
 
 	recvcq = ib_create_cq(ia->ri_id->device, rpcrdma_recvcq_upcall,
 				  rpcrdma_cq_async_error_upcall, ep,
-				  ep->rep_attr.cap.max_recv_wr + 1, 0);
+				  ep->rep_attr.cap.max_recv_wr + 1, comp_vec);
 	if (IS_ERR(recvcq)) {
 		rc = PTR_ERR(recvcq);
 		dprintk("RPC:       %s: failed to create recv CQ: %i\n",


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v1 07/16] SUNRPC: serialize iostats updates
  2014-10-16 19:38 [PATCH v1 00/16] NFS/RDMA patches for 3.19 Chuck Lever
                   ` (5 preceding siblings ...)
  2014-10-16 19:39 ` [PATCH v1 06/16] xprtrdma: spin CQ completion vectors Chuck Lever
@ 2014-10-16 19:39 ` Chuck Lever
  2014-10-16 19:39 ` [PATCH v1 08/16] xprtrdma: Display async errors Chuck Lever
                   ` (8 subsequent siblings)
  15 siblings, 0 replies; 34+ messages in thread
From: Chuck Lever @ 2014-10-16 19:39 UTC (permalink / raw)
  To: linux-nfs

Occasionally mountstats reports a negative retransmission rate.
Ensure that two RPCs completing concurrently don't confuse the sums
in the transport's op_metrics array.

Since pNFS filelayout can invoke rpc_count_iostats() on another
transport from xprt_release(), we can't rely on simply holding the
transport_lock in xprt_release(). There's nothing for it but hard
serialization. One spin lock per RPC operation should make this as
painless as it can be.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 include/linux/sunrpc/metrics.h |    3 +++
 net/sunrpc/stats.c             |   21 ++++++++++++++++-----
 2 files changed, 19 insertions(+), 5 deletions(-)

diff --git a/include/linux/sunrpc/metrics.h b/include/linux/sunrpc/metrics.h
index 1565bbe..eecb5a7 100644
--- a/include/linux/sunrpc/metrics.h
+++ b/include/linux/sunrpc/metrics.h
@@ -27,10 +27,13 @@
 
 #include <linux/seq_file.h>
 #include <linux/ktime.h>
+#include <linux/spinlock.h>
 
 #define RPC_IOSTATS_VERS	"1.0"
 
 struct rpc_iostats {
+	spinlock_t		om_lock;
+
 	/*
 	 * These counters give an idea about how many request
 	 * transmissions are required, on average, to complete that
diff --git a/net/sunrpc/stats.c b/net/sunrpc/stats.c
index 5453049..9711a15 100644
--- a/net/sunrpc/stats.c
+++ b/net/sunrpc/stats.c
@@ -116,7 +116,15 @@ EXPORT_SYMBOL_GPL(svc_seq_show);
  */
 struct rpc_iostats *rpc_alloc_iostats(struct rpc_clnt *clnt)
 {
-	return kcalloc(clnt->cl_maxproc, sizeof(struct rpc_iostats), GFP_KERNEL);
+	struct rpc_iostats *stats;
+	int i;
+
+	stats = kcalloc(clnt->cl_maxproc, sizeof(*stats), GFP_KERNEL);
+	if (stats) {
+		for (i = 0; i < clnt->cl_maxproc; i++)
+			spin_lock_init(&stats[i].om_lock);
+	}
+	return stats;
 }
 EXPORT_SYMBOL_GPL(rpc_alloc_iostats);
 
@@ -135,20 +143,21 @@ EXPORT_SYMBOL_GPL(rpc_free_iostats);
  * rpc_count_iostats - tally up per-task stats
  * @task: completed rpc_task
  * @stats: array of stat structures
- *
- * Relies on the caller for serialization.
  */
 void rpc_count_iostats(const struct rpc_task *task, struct rpc_iostats *stats)
 {
 	struct rpc_rqst *req = task->tk_rqstp;
 	struct rpc_iostats *op_metrics;
-	ktime_t delta;
+	ktime_t delta, now;
 
 	if (!stats || !req)
 		return;
 
+	now = ktime_get();
 	op_metrics = &stats[task->tk_msg.rpc_proc->p_statidx];
 
+	spin_lock(&op_metrics->om_lock);
+
 	op_metrics->om_ops++;
 	op_metrics->om_ntrans += req->rq_ntrans;
 	op_metrics->om_timeouts += task->tk_timeouts;
@@ -161,8 +170,10 @@ void rpc_count_iostats(const struct rpc_task *task, struct rpc_iostats *stats)
 
 	op_metrics->om_rtt = ktime_add(op_metrics->om_rtt, req->rq_rtt);
 
-	delta = ktime_sub(ktime_get(), task->tk_start);
+	delta = ktime_sub(now, task->tk_start);
 	op_metrics->om_execute = ktime_add(op_metrics->om_execute, delta);
+
+	spin_unlock(&op_metrics->om_lock);
 }
 EXPORT_SYMBOL_GPL(rpc_count_iostats);
 


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v1 08/16] xprtrdma: Display async errors
  2014-10-16 19:38 [PATCH v1 00/16] NFS/RDMA patches for 3.19 Chuck Lever
                   ` (6 preceding siblings ...)
  2014-10-16 19:39 ` [PATCH v1 07/16] SUNRPC: serialize iostats updates Chuck Lever
@ 2014-10-16 19:39 ` Chuck Lever
  2014-10-16 19:39 ` [PATCH v1 09/16] xprtrdma: Enable pad optimization Chuck Lever
                   ` (7 subsequent siblings)
  15 siblings, 0 replies; 34+ messages in thread
From: Chuck Lever @ 2014-10-16 19:39 UTC (permalink / raw)
  To: linux-nfs

An async error upcall is a hard error, and should be reported in
the system log.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 net/sunrpc/xprtrdma/verbs.c |   36 ++++++++++++++++++++++++++++++++----
 1 file changed, 32 insertions(+), 4 deletions(-)

diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index dc4c8e3..73079d5 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -108,6 +108,32 @@ rpcrdma_run_tasklet(unsigned long data)
 
 static DECLARE_TASKLET(rpcrdma_tasklet_g, rpcrdma_run_tasklet, 0UL);
 
+static const char * const async_event[] = {
+	"CQ error",
+	"QP fatal error",
+	"QP request error",
+	"QP access error",
+	"communication established",
+	"send queue drained",
+	"path migration successful",
+	"path mig error",
+	"device fatal error",
+	"port active",
+	"port error",
+	"LID change",
+	"P_key change",
+	"SM change",
+	"SRQ error",
+	"SRQ limit reached",
+	"last WQE reached",
+	"client reregister",
+	"GID change",
+};
+
+#define ASYNC_MSG(status)					\
+	((status) < ARRAY_SIZE(async_event) ?			\
+		async_event[(status)] : "unknown async error")
+
 static void
 rpcrdma_schedule_tasklet(struct list_head *sched_list)
 {
@@ -124,8 +150,9 @@ rpcrdma_qp_async_error_upcall(struct ib_event *event, void *context)
 {
 	struct rpcrdma_ep *ep = context;
 
-	dprintk("RPC:       %s: QP error %X on device %s ep %p\n",
-		__func__, event->event, event->device->name, context);
+	pr_err("RPC:       %s: %s on device %s ep %p\n",
+	       __func__, ASYNC_MSG(event->event),
+		event->device->name, context);
 	if (ep->rep_connected == 1) {
 		ep->rep_connected = -EIO;
 		ep->rep_func(ep);
@@ -138,8 +165,9 @@ rpcrdma_cq_async_error_upcall(struct ib_event *event, void *context)
 {
 	struct rpcrdma_ep *ep = context;
 
-	dprintk("RPC:       %s: CQ error %X on device %s ep %p\n",
-		__func__, event->event, event->device->name, context);
+	pr_err("RPC:       %s: %s on device %s ep %p\n",
+	       __func__, ASYNC_MSG(event->event),
+		event->device->name, context);
 	if (ep->rep_connected == 1) {
 		ep->rep_connected = -EIO;
 		ep->rep_func(ep);


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v1 09/16] xprtrdma: Enable pad optimization
  2014-10-16 19:38 [PATCH v1 00/16] NFS/RDMA patches for 3.19 Chuck Lever
                   ` (7 preceding siblings ...)
  2014-10-16 19:39 ` [PATCH v1 08/16] xprtrdma: Display async errors Chuck Lever
@ 2014-10-16 19:39 ` Chuck Lever
  2014-10-16 19:39 ` [PATCH v1 10/16] NFS: Include transport protocol name in UCS client string Chuck Lever
                   ` (6 subsequent siblings)
  15 siblings, 0 replies; 34+ messages in thread
From: Chuck Lever @ 2014-10-16 19:39 UTC (permalink / raw)
  To: linux-nfs

The Linux NFS/RDMA was rejecting NFSv3 WRITE requests when pad
optimization was enabled. That bug is now fixed

We can enable pad optimization, which helps performance and is
supported now by both Linux and Solaris servers.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 net/sunrpc/xprtrdma/transport.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
index 61be0a0..09c6443 100644
--- a/net/sunrpc/xprtrdma/transport.c
+++ b/net/sunrpc/xprtrdma/transport.c
@@ -73,7 +73,7 @@ static unsigned int xprt_rdma_max_inline_read = RPCRDMA_DEF_INLINE;
 static unsigned int xprt_rdma_max_inline_write = RPCRDMA_DEF_INLINE;
 static unsigned int xprt_rdma_inline_write_padding;
 static unsigned int xprt_rdma_memreg_strategy = RPCRDMA_FRMR;
-                int xprt_rdma_pad_optimize = 0;
+		int xprt_rdma_pad_optimize = 1;
 
 #ifdef RPC_DEBUG
 


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v1 10/16] NFS: Include transport protocol name in UCS client string
  2014-10-16 19:38 [PATCH v1 00/16] NFS/RDMA patches for 3.19 Chuck Lever
                   ` (8 preceding siblings ...)
  2014-10-16 19:39 ` [PATCH v1 09/16] xprtrdma: Enable pad optimization Chuck Lever
@ 2014-10-16 19:39 ` Chuck Lever
  2014-10-16 19:39 ` [PATCH v1 11/16] NFS: Clean up nfs4_init_callback() Chuck Lever
                   ` (5 subsequent siblings)
  15 siblings, 0 replies; 34+ messages in thread
From: Chuck Lever @ 2014-10-16 19:39 UTC (permalink / raw)
  To: linux-nfs

The nfs_match_client() function asserts that different nfs_client
structures are used when mounting the same server with different
transport protocols. For example, if a Linux client mounts the same
server via NFSv3 with some UDP mounts and some TCP mounts, it will
use only two transports and two nfs_client structures: one shared
with all UDP mounts, and one shared with all TCP mounts.

When a uniform client string is in use (NFSv4.1, or NFSv4.0 with the
"migration" mount option), nfs_match_client() will ensure an
nfs_client structure and separate transport is created for mounts
with unique "proto=" settings (one for TCP and one for RDMA,
currently).

But EXCHANGE_ID sends exactly the same nfs_client_id4 string on both
transports. The server then believes that the client is trunking
over disparate transports, when it clearly is not. The open and lock
state that will appear on each transport is disjoint.

Now that NFSv4.1 over RDMA is supported, a user can mount the same
server with NFSv4.1 over TCP and RDMA concurrently. The client will
send an EXCHANGE_ID with the same client ID on both transports, and
it will also send a CREATE_SESSION on both.

To ensure the Linux client represents itself correctly to servers,
add the transport protocol name to the uniform client string. Each
transport instance should get its own client ID (and session) to
prevent trunking.

This doesn't appear to be a problem for NFSv4.0 without migration.
It also wasn't a problem for NFSv4.1 when TCP was the only available
transport.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 fs/nfs/nfs4proc.c |   15 +++++++++++----
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
index 6ca0c8e..a1243e7 100644
--- a/fs/nfs/nfs4proc.c
+++ b/fs/nfs/nfs4proc.c
@@ -4929,16 +4929,23 @@ nfs4_init_uniform_client_string(const struct nfs_client *clp,
 				char *buf, size_t len)
 {
 	const char *nodename = clp->cl_rpcclient->cl_nodename;
+	unsigned int result;
 
+	rcu_read_lock();
 	if (nfs4_client_id_uniquifier[0] != '\0')
-		return scnprintf(buf, len, "Linux NFSv%u.%u %s/%s",
+		result = scnprintf(buf, len, "Linux NFSv%u.%u %s/%s %s",
 				clp->rpc_ops->version,
 				clp->cl_minorversion,
 				nfs4_client_id_uniquifier,
-				nodename);
-	return scnprintf(buf, len, "Linux NFSv%u.%u %s",
+				nodename, rpc_peeraddr2str(clp->cl_rpcclient,
+							RPC_DISPLAY_PROTO));
+	else
+		result = scnprintf(buf, len, "Linux NFSv%u.%u %s %s",
 				clp->rpc_ops->version, clp->cl_minorversion,
-				nodename);
+				nodename, rpc_peeraddr2str(clp->cl_rpcclient,
+							RPC_DISPLAY_PROTO));
+	rcu_read_unlock();
+	return result;
 }
 
 /*


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v1 11/16] NFS: Clean up nfs4_init_callback()
  2014-10-16 19:38 [PATCH v1 00/16] NFS/RDMA patches for 3.19 Chuck Lever
                   ` (9 preceding siblings ...)
  2014-10-16 19:39 ` [PATCH v1 10/16] NFS: Include transport protocol name in UCS client string Chuck Lever
@ 2014-10-16 19:39 ` Chuck Lever
  2014-10-16 19:39 ` [PATCH v1 12/16] SUNRPC: Add rpc_xprt_is_bidirectional() Chuck Lever
                   ` (4 subsequent siblings)
  15 siblings, 0 replies; 34+ messages in thread
From: Chuck Lever @ 2014-10-16 19:39 UTC (permalink / raw)
  To: linux-nfs

nfs4_init_callback() is never invoked for NFS versions other than 4.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 fs/nfs/nfs4client.c |   31 ++++++++++++++-----------------
 1 file changed, 14 insertions(+), 17 deletions(-)

diff --git a/fs/nfs/nfs4client.c b/fs/nfs/nfs4client.c
index ffdb28d..5f4b818 100644
--- a/fs/nfs/nfs4client.c
+++ b/fs/nfs/nfs4client.c
@@ -241,28 +241,25 @@ void nfs4_free_client(struct nfs_client *clp)
  */
 static int nfs4_init_callback(struct nfs_client *clp)
 {
+	struct rpc_xprt *xprt;
 	int error;
 
-	if (clp->rpc_ops->version == 4) {
-		struct rpc_xprt *xprt;
+	xprt = rcu_dereference_raw(clp->cl_rpcclient->cl_xprt);
 
-		xprt = rcu_dereference_raw(clp->cl_rpcclient->cl_xprt);
-
-		if (nfs4_has_session(clp)) {
-			error = xprt_setup_backchannel(xprt,
-						NFS41_BC_MIN_CALLBACKS);
-			if (error < 0)
-				return error;
-		}
-
-		error = nfs_callback_up(clp->cl_mvops->minor_version, xprt);
-		if (error < 0) {
-			dprintk("%s: failed to start callback. Error = %d\n",
-				__func__, error);
+	if (nfs4_has_session(clp)) {
+		error = xprt_setup_backchannel(xprt, NFS41_BC_MIN_CALLBACKS);
+		if (error < 0)
 			return error;
-		}
-		__set_bit(NFS_CS_CALLBACK, &clp->cl_res_state);
 	}
+
+	error = nfs_callback_up(clp->cl_mvops->minor_version, xprt);
+	if (error < 0) {
+		dprintk("%s: failed to start callback. Error = %d\n",
+			__func__, error);
+		return error;
+	}
+	__set_bit(NFS_CS_CALLBACK, &clp->cl_res_state);
+
 	return 0;
 }
 


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v1 12/16] SUNRPC: Add rpc_xprt_is_bidirectional()
  2014-10-16 19:38 [PATCH v1 00/16] NFS/RDMA patches for 3.19 Chuck Lever
                   ` (10 preceding siblings ...)
  2014-10-16 19:39 ` [PATCH v1 11/16] NFS: Clean up nfs4_init_callback() Chuck Lever
@ 2014-10-16 19:39 ` Chuck Lever
  2014-10-16 19:40 ` [PATCH v1 13/16] NFS: Add sidecar RPC client support Chuck Lever
                   ` (3 subsequent siblings)
  15 siblings, 0 replies; 34+ messages in thread
From: Chuck Lever @ 2014-10-16 19:39 UTC (permalink / raw)
  To: linux-nfs

Allow upper layers to determine if a particular rpc_clnt is prepared
to provide a backchannel RPC service.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 include/linux/sunrpc/clnt.h |    1 +
 include/linux/sunrpc/xprt.h |    1 +
 net/sunrpc/clnt.c           |   24 ++++++++++++++++++++++++
 net/sunrpc/xprtsock.c       |    3 +++
 4 files changed, 29 insertions(+)

diff --git a/include/linux/sunrpc/clnt.h b/include/linux/sunrpc/clnt.h
index 70736b9..644c751 100644
--- a/include/linux/sunrpc/clnt.h
+++ b/include/linux/sunrpc/clnt.h
@@ -171,6 +171,7 @@ int		rpc_protocol(struct rpc_clnt *);
 struct net *	rpc_net_ns(struct rpc_clnt *);
 size_t		rpc_max_payload(struct rpc_clnt *);
 unsigned long	rpc_get_timeout(struct rpc_clnt *clnt);
+bool		rpc_xprt_is_bidirectional(struct rpc_clnt *);
 void		rpc_force_rebind(struct rpc_clnt *);
 size_t		rpc_peeraddr(struct rpc_clnt *, struct sockaddr *, size_t);
 const char	*rpc_peeraddr2str(struct rpc_clnt *, enum rpc_display_format_t);
diff --git a/include/linux/sunrpc/xprt.h b/include/linux/sunrpc/xprt.h
index 632685c..4dea441 100644
--- a/include/linux/sunrpc/xprt.h
+++ b/include/linux/sunrpc/xprt.h
@@ -218,6 +218,7 @@ struct rpc_xprt {
 						 * items */
 	struct list_head	bc_pa_list;	/* List of preallocated
 						 * backchannel rpc_rqst's */
+	bool			bc_supported;
 #endif /* CONFIG_SUNRPC_BACKCHANNEL */
 	struct list_head	recv;
 
diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c
index 5e817d6..1793341 100644
--- a/net/sunrpc/clnt.c
+++ b/net/sunrpc/clnt.c
@@ -1347,6 +1347,30 @@ unsigned long rpc_get_timeout(struct rpc_clnt *clnt)
 EXPORT_SYMBOL_GPL(rpc_get_timeout);
 
 /**
+ * rpc_xprt_is_bidirectional
+ * @clnt: RPC clnt to query
+ *
+ * Returns true if underlying transport supports backchannel service.
+ */
+#ifdef CONFIG_SUNRPC_BACKCHANNEL
+bool rpc_xprt_is_bidirectional(struct rpc_clnt *clnt)
+{
+	bool ret;
+
+	rcu_read_lock();
+	ret = rcu_dereference(clnt->cl_xprt)->bc_supported;
+	rcu_read_unlock();
+	return ret;
+}
+#else
+bool rpc_xprt_is_bidirectional(struct rpc_clnt *clnt)
+{
+	return false;
+}
+#endif
+EXPORT_SYMBOL_GPL(rpc_xprt_is_bidirectional);
+
+/**
  * rpc_force_rebind - force transport to check that remote port is unchanged
  * @clnt: client to rebind
  *
diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index b4aca48..e2e15a9 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -2864,6 +2864,9 @@ static struct rpc_xprt *xs_setup_tcp(struct xprt_create *args)
 
 	xprt->ops = &xs_tcp_ops;
 	xprt->timeout = &xs_tcp_default_timeout;
+#ifdef CONFIG_SUNRPC_BACKCHANNEL
+	xprt->bc_supported = true;
+#endif
 
 	switch (addr->sa_family) {
 	case AF_INET:


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v1 13/16] NFS: Add sidecar RPC client support
  2014-10-16 19:38 [PATCH v1 00/16] NFS/RDMA patches for 3.19 Chuck Lever
                   ` (11 preceding siblings ...)
  2014-10-16 19:39 ` [PATCH v1 12/16] SUNRPC: Add rpc_xprt_is_bidirectional() Chuck Lever
@ 2014-10-16 19:40 ` Chuck Lever
  2014-10-20 17:33   ` Anna Schumaker
  2014-10-16 19:40 ` [PATCH v1 14/16] NFS: Set BIND_CONN_TO_SESSION arguments in the proc layer Chuck Lever
                   ` (2 subsequent siblings)
  15 siblings, 1 reply; 34+ messages in thread
From: Chuck Lever @ 2014-10-16 19:40 UTC (permalink / raw)
  To: linux-nfs

So far, TCP is the only transport that supports bi-directional RPC.

When mounting with NFSv4.1 using a transport that does not support
bi-directional RPC, establish a TCP sidecar connection to handle
backchannel traffic for a session. The sidecar transport does not
use its forward channel except for sending BIND_CONN_TO_SESSION
operations.

This commit adds logic to create and destroy the sidecar transport.
Subsequent commits add logic to use the transport.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 fs/nfs/client.c           |    1 +
 fs/nfs/nfs4client.c       |   54 +++++++++++++++++++++++++++++++++++++++++++++
 include/linux/nfs_fs_sb.h |    2 ++
 3 files changed, 57 insertions(+)

diff --git a/fs/nfs/client.c b/fs/nfs/client.c
index 6a4f366..19f49bf 100644
--- a/fs/nfs/client.c
+++ b/fs/nfs/client.c
@@ -78,6 +78,7 @@ const struct rpc_program nfs_program = {
 	.stats			= &nfs_rpcstat,
 	.pipe_dir_name		= NFS_PIPE_DIRNAME,
 };
+EXPORT_SYMBOL_GPL(nfs_program);
 
 struct rpc_stat nfs_rpcstat = {
 	.program		= &nfs_program
diff --git a/fs/nfs/nfs4client.c b/fs/nfs/nfs4client.c
index 5f4b818..b1cc35e 100644
--- a/fs/nfs/nfs4client.c
+++ b/fs/nfs/nfs4client.c
@@ -213,6 +213,8 @@ static void nfs4_destroy_callback(struct nfs_client *clp)
 {
 	if (__test_and_clear_bit(NFS_CS_CALLBACK, &clp->cl_res_state))
 		nfs_callback_down(clp->cl_mvops->minor_version, clp->cl_net);
+	if (clp->cl_bc_rpcclient)
+		rpc_shutdown_client(clp->cl_bc_rpcclient);
 }
 
 static void nfs4_shutdown_client(struct nfs_client *clp)
@@ -291,6 +293,53 @@ int nfs40_init_client(struct nfs_client *clp)
 
 #if defined(CONFIG_NFS_V4_1)
 
+/*
+ * Create a separate rpc_clnt using TCP that can provide a
+ * backchannel service.
+ */
+static int nfs41_create_sidecar_rpc_client(struct nfs_client *clp)
+{
+	struct sockaddr_storage address;
+	struct sockaddr *sap = (struct sockaddr *)&address;
+	struct rpc_create_args args = {
+		.net		= clp->cl_net,
+		.protocol	= XPRT_TRANSPORT_TCP,
+		.address	= sap,
+		.addrsize	= clp->cl_addrlen,
+		.servername	= clp->cl_hostname,
+		.program	= &nfs_program,
+		.version	= clp->rpc_ops->version,
+		.flags		= (RPC_CLNT_CREATE_DISCRTRY |
+				  RPC_CLNT_CREATE_NOPING),
+	};
+	struct rpc_clnt *clnt;
+	struct rpc_cred *cred;
+
+	if (rpc_xprt_is_bidirectional(clp->cl_rpcclient))
+		return 0;
+
+	if (test_bit(NFS_CS_NORESVPORT, &clp->cl_flags))
+		args.flags |= RPC_CLNT_CREATE_NONPRIVPORT;
+	memcpy(sap, &clp->cl_addr, clp->cl_addrlen);
+	rpc_set_port(sap, NFS_PORT);
+	cred = nfs4_get_clid_cred(clp);
+	if (cred) {
+		args.authflavor = cred->cr_auth->au_flavor;
+		put_rpccred(cred);
+	} else
+		args.authflavor = RPC_AUTH_UNIX;
+
+	clnt = rpc_create(&args);
+	if (IS_ERR(clnt)) {
+		dprintk("%s: cannot create side-car RPC client. Error = %ld\n",
+			__func__, PTR_ERR(clnt));
+		return PTR_ERR(clnt);
+	}
+
+	clp->cl_bc_rpcclient = clnt;
+	return 0;
+}
+
 /**
  * nfs41_init_client - nfs_client initialization tasks for NFSv4.1+
  * @clp - nfs_client to initialize
@@ -300,6 +349,11 @@ int nfs40_init_client(struct nfs_client *clp)
 int nfs41_init_client(struct nfs_client *clp)
 {
 	struct nfs4_session *session = NULL;
+	int ret;
+
+	ret = nfs41_create_sidecar_rpc_client(clp);
+	if (ret)
+		return ret;
 
 	/*
 	 * Create the session and mark it expired.
diff --git a/include/linux/nfs_fs_sb.h b/include/linux/nfs_fs_sb.h
index 922be2e..159d703 100644
--- a/include/linux/nfs_fs_sb.h
+++ b/include/linux/nfs_fs_sb.h
@@ -87,6 +87,8 @@ struct nfs_client {
 
 	/* The sequence id to use for the next CREATE_SESSION */
 	u32			cl_seqid;
+	/* The optional sidecar backchannel transport */
+	struct rpc_clnt		*cl_bc_rpcclient;
 	/* The flags used for obtaining the clientid during EXCHANGE_ID */
 	u32			cl_exchange_flags;
 	struct nfs4_session	*cl_session;	/* shared session */


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v1 14/16] NFS: Set BIND_CONN_TO_SESSION arguments in the proc layer
  2014-10-16 19:38 [PATCH v1 00/16] NFS/RDMA patches for 3.19 Chuck Lever
                   ` (12 preceding siblings ...)
  2014-10-16 19:40 ` [PATCH v1 13/16] NFS: Add sidecar RPC client support Chuck Lever
@ 2014-10-16 19:40 ` Chuck Lever
  2014-10-16 19:40 ` [PATCH v1 15/16] NFS: Bind side-car connection to session Chuck Lever
  2014-10-16 19:40 ` [PATCH v1 16/16] NFS: Disable SESSION4_BACK_CHAN when a backchannel sidecar is to be used Chuck Lever
  15 siblings, 0 replies; 34+ messages in thread
From: Chuck Lever @ 2014-10-16 19:40 UTC (permalink / raw)
  To: linux-nfs

When NFS_CS_SIDECAR_BACKCHANNEL is set, the client needs to bind
both the forward and the back channel during recovery. Two separate
BC2S operations are needed, with different arguments.

Prepare nfs4_proc_bind_conn_to_session() by creating a complete arg
struct for the BIND_CONN_TO_SESSION XDR encoder.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 fs/nfs/nfs4proc.c       |    6 +++++-
 fs/nfs/nfs4xdr.c        |   16 +++++++++-------
 include/linux/nfs_xdr.h |    6 ++++++
 3 files changed, 20 insertions(+), 8 deletions(-)

diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
index a1243e7..9a8ffb7 100644
--- a/fs/nfs/nfs4proc.c
+++ b/fs/nfs/nfs4proc.c
@@ -6578,11 +6578,15 @@ nfs41_same_server_scope(struct nfs41_server_scope *a,
 int nfs4_proc_bind_conn_to_session(struct nfs_client *clp, struct rpc_cred *cred)
 {
 	int status;
+	struct nfs41_bind_conn_to_session_args args = {
+		.client = clp,
+		.dir = NFS4_CDFC4_BACK_OR_BOTH,
+	};
 	struct nfs41_bind_conn_to_session_res res;
 	struct rpc_message msg = {
 		.rpc_proc =
 			&nfs4_procedures[NFSPROC4_CLNT_BIND_CONN_TO_SESSION],
-		.rpc_argp = clp,
+		.rpc_argp = &args,
 		.rpc_resp = &res,
 		.rpc_cred = cred,
 	};
diff --git a/fs/nfs/nfs4xdr.c b/fs/nfs/nfs4xdr.c
index e13b59d..6dfdd14 100644
--- a/fs/nfs/nfs4xdr.c
+++ b/fs/nfs/nfs4xdr.c
@@ -1733,17 +1733,19 @@ static void encode_secinfo(struct xdr_stream *xdr, const struct qstr *name, stru
 #if defined(CONFIG_NFS_V4_1)
 /* NFSv4.1 operations */
 static void encode_bind_conn_to_session(struct xdr_stream *xdr,
-				   struct nfs4_session *session,
-				   struct compound_hdr *hdr)
+				  struct nfs41_bind_conn_to_session_args *args,
+				  struct compound_hdr *hdr)
 {
+	struct nfs4_session *session = args->client->cl_session;
 	__be32 *p;
 
 	encode_op_hdr(xdr, OP_BIND_CONN_TO_SESSION,
 		decode_bind_conn_to_session_maxsz, hdr);
 	encode_opaque_fixed(xdr, session->sess_id.data, NFS4_MAX_SESSIONID_LEN);
+
 	p = xdr_reserve_space(xdr, 8);
-	*p++ = cpu_to_be32(NFS4_CDFC4_BACK_OR_BOTH);
-	*p = 0;	/* use_conn_in_rdma_mode = False */
+	*p++ = cpu_to_be32(args->dir);
+	*p = args->use_conn_in_rdma_mode ? xdr_one : xdr_zero;
 }
 
 static void encode_op_map(struct xdr_stream *xdr, struct nfs4_op_map *op_map)
@@ -2766,14 +2768,14 @@ static void nfs4_xdr_enc_fsid_present(struct rpc_rqst *req,
  */
 static void nfs4_xdr_enc_bind_conn_to_session(struct rpc_rqst *req,
 				struct xdr_stream *xdr,
-				struct nfs_client *clp)
+				struct nfs41_bind_conn_to_session_args *args)
 {
 	struct compound_hdr hdr = {
-		.minorversion = clp->cl_mvops->minor_version,
+		.minorversion = args->client->cl_mvops->minor_version,
 	};
 
 	encode_compound_hdr(xdr, req, &hdr);
-	encode_bind_conn_to_session(xdr, clp->cl_session, &hdr);
+	encode_bind_conn_to_session(xdr, args, &hdr);
 	encode_nops(&hdr);
 }
 
diff --git a/include/linux/nfs_xdr.h b/include/linux/nfs_xdr.h
index 0040629..af17763 100644
--- a/include/linux/nfs_xdr.h
+++ b/include/linux/nfs_xdr.h
@@ -1140,6 +1140,12 @@ struct nfs41_state_protection {
 	struct nfs4_op_map allow;
 };
 
+struct nfs41_bind_conn_to_session_args {
+	struct nfs_client		*client;
+	u32				dir;
+	bool				use_conn_in_rdma_mode;
+};
+
 #define NFS4_EXCHANGE_ID_LEN	(48)
 struct nfs41_exchange_id_args {
 	struct nfs_client		*client;


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v1 15/16] NFS: Bind side-car connection to session
  2014-10-16 19:38 [PATCH v1 00/16] NFS/RDMA patches for 3.19 Chuck Lever
                   ` (13 preceding siblings ...)
  2014-10-16 19:40 ` [PATCH v1 14/16] NFS: Set BIND_CONN_TO_SESSION arguments in the proc layer Chuck Lever
@ 2014-10-16 19:40 ` Chuck Lever
  2014-10-16 19:40 ` [PATCH v1 16/16] NFS: Disable SESSION4_BACK_CHAN when a backchannel sidecar is to be used Chuck Lever
  15 siblings, 0 replies; 34+ messages in thread
From: Chuck Lever @ 2014-10-16 19:40 UTC (permalink / raw)
  To: linux-nfs

When recovering from a network partition, a client must identify
both the forward and backchannel it wants bound to a session.

Usually these use the same transport, which can be re-bound to the
session with a single BIND_CONN_TO_SESSION operation.

But with a sidecar backchannel, the fore and back channels use
separate transports that must be bound to the transport connection
using separate BC2S operations.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 fs/nfs/nfs4client.c |    5 ++++-
 fs/nfs/nfs4proc.c   |   48 ++++++++++++++++++++++++++++++++++++++----------
 2 files changed, 42 insertions(+), 11 deletions(-)

diff --git a/fs/nfs/nfs4client.c b/fs/nfs/nfs4client.c
index b1cc35e..97cc170 100644
--- a/fs/nfs/nfs4client.c
+++ b/fs/nfs/nfs4client.c
@@ -246,7 +246,10 @@ static int nfs4_init_callback(struct nfs_client *clp)
 	struct rpc_xprt *xprt;
 	int error;
 
-	xprt = rcu_dereference_raw(clp->cl_rpcclient->cl_xprt);
+	if (clp->cl_bc_rpcclient)
+		xprt = rcu_dereference_raw(clp->cl_bc_rpcclient->cl_xprt);
+	else
+		xprt = rcu_dereference_raw(clp->cl_rpcclient->cl_xprt);
 
 	if (nfs4_has_session(clp)) {
 		error = xprt_setup_backchannel(xprt, NFS41_BC_MIN_CALLBACKS);
diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
index 9a8ffb7..2eaf7ec 100644
--- a/fs/nfs/nfs4proc.c
+++ b/fs/nfs/nfs4proc.c
@@ -6569,18 +6569,16 @@ nfs41_same_server_scope(struct nfs41_server_scope *a,
 	return false;
 }
 
-/*
- * nfs4_proc_bind_conn_to_session()
- *
- * The 4.1 client currently uses the same TCP connection for the
- * fore and backchannel.
- */
-int nfs4_proc_bind_conn_to_session(struct nfs_client *clp, struct rpc_cred *cred)
+static int _nfs4_proc_bind_conn_to_session(struct nfs_client *clp,
+					   struct rpc_cred *cred,
+					   struct rpc_clnt *clnt,
+					   u32 dir_from_client,
+					   u32 dir_from_server)
 {
 	int status;
 	struct nfs41_bind_conn_to_session_args args = {
 		.client = clp,
-		.dir = NFS4_CDFC4_BACK_OR_BOTH,
+		.dir = dir_from_client,
 	};
 	struct nfs41_bind_conn_to_session_res res;
 	struct rpc_message msg = {
@@ -6599,7 +6597,7 @@ int nfs4_proc_bind_conn_to_session(struct nfs_client *clp, struct rpc_cred *cred
 		goto out;
 	}
 
-	status = rpc_call_sync(clp->cl_rpcclient, &msg, RPC_TASK_TIMEOUT);
+	status = rpc_call_sync(clnt, &msg, RPC_TASK_TIMEOUT);
 	trace_nfs4_bind_conn_to_session(clp, status);
 	if (status == 0) {
 		if (memcmp(res.session->sess_id.data,
@@ -6608,7 +6606,7 @@ int nfs4_proc_bind_conn_to_session(struct nfs_client *clp, struct rpc_cred *cred
 			status = -EIO;
 			goto out_session;
 		}
-		if (res.dir != NFS4_CDFS4_BOTH) {
+		if (res.dir != dir_from_server) {
 			dprintk("NFS: %s: Unexpected direction from server\n",
 				__func__);
 			status = -EIO;
@@ -6628,6 +6626,36 @@ out:
 	return status;
 }
 
+/**
+ * nfs4_proc_bind_conn_to_session - (re)bind fore/back channels to session
+ * @clp: per-server state
+ * @cred: credential for managing state
+ *
+ * Returns zero on success, or a negative errno or negative NFS4ERR.
+ */
+int nfs4_proc_bind_conn_to_session(struct nfs_client *clp,
+				   struct rpc_cred *cred)
+{
+	int ret;
+
+	if (!clp->cl_bc_rpcclient)
+		return _nfs4_proc_bind_conn_to_session(clp, cred,
+						clp->cl_rpcclient,
+						NFS4_CDFC4_BACK_OR_BOTH,
+						NFS4_CDFS4_BOTH);
+
+	ret = _nfs4_proc_bind_conn_to_session(clp, cred,
+						clp->cl_bc_rpcclient,
+						NFS4_CDFC4_BACK,
+						NFS4_CDFS4_BACK);
+	if (ret)
+		return ret;
+	return _nfs4_proc_bind_conn_to_session(clp, cred,
+						clp->cl_rpcclient,
+						NFS4_CDFC4_FORE,
+						NFS4_CDFS4_FORE);
+}
+
 /*
  * Minimum set of SP4_MACH_CRED operations from RFC 5661 in the enforce map
  * and operations we'd like to see to enable certain features in the allow map


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v1 16/16] NFS: Disable SESSION4_BACK_CHAN when a backchannel sidecar is to be used
  2014-10-16 19:38 [PATCH v1 00/16] NFS/RDMA patches for 3.19 Chuck Lever
                   ` (14 preceding siblings ...)
  2014-10-16 19:40 ` [PATCH v1 15/16] NFS: Bind side-car connection to session Chuck Lever
@ 2014-10-16 19:40 ` Chuck Lever
  15 siblings, 0 replies; 34+ messages in thread
From: Chuck Lever @ 2014-10-16 19:40 UTC (permalink / raw)
  To: linux-nfs

A CREATE_SESSION operation with SESSION4_BACK_CHAN cleared is sent
to force the server to report SEQ4_STATUS_CB_PATH_DOWN.  The client
recovers by setting up a sidecar backchannel connection.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 fs/nfs/nfs4proc.c |    4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
index 2eaf7ec..e0bcc30 100644
--- a/fs/nfs/nfs4proc.c
+++ b/fs/nfs/nfs4proc.c
@@ -7187,6 +7187,7 @@ static int _nfs4_proc_create_session(struct nfs_client *clp,
 	struct nfs41_create_session_args args = {
 		.client = clp,
 		.cb_program = NFS4_CALLBACK,
+		.flags = SESSION4_PERSIST,
 	};
 	struct nfs41_create_session_res res = {
 		.client = clp,
@@ -7200,7 +7201,8 @@ static int _nfs4_proc_create_session(struct nfs_client *clp,
 	int status;
 
 	nfs4_init_channel_attrs(&args);
-	args.flags = (SESSION4_PERSIST | SESSION4_BACK_CHAN);
+	if (!clp->cl_bc_rpcclient)
+		args.flags |= SESSION4_BACK_CHAN;
 
 	status = rpc_call_sync(session->clp->cl_rpcclient, &msg, RPC_TASK_TIMEOUT);
 	trace_nfs4_create_session(clp, status);


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH v1 02/16] xprtrdma: Cap req_cqinit
  2014-10-16 19:38 ` [PATCH v1 02/16] xprtrdma: Cap req_cqinit Chuck Lever
@ 2014-10-20 13:27   ` Anna Schumaker
  0 siblings, 0 replies; 34+ messages in thread
From: Anna Schumaker @ 2014-10-20 13:27 UTC (permalink / raw)
  To: Chuck Lever, linux-nfs

Hey Chuck,

On 10/16/14 15:38, Chuck Lever wrote:
> Recent work made FRMR registration and invalidation completions
> unsignaled. This greatly reduces the adapter interrupt rate.
>
> Every so often, however, a posted send Work Request is allowed to
> signal. Otherwise, the provider's Work Queue will wrap and the
> workload will hang.
>
> The number of Work Requests that are allowed to remain unsignaled is
> determined by the value of req_cqinit. Currently, this is set to the
> size of the send Work Queue divided by two, minus 1.
>
> For FRMR, the send Work Queue is the maximum number of concurrent
> RPCs (currently 32) times the maximum number of Work Requests an
> RPC might use (currently 7, though some adapters may need more).
>
> For mlx4, this is 224 entries. This leaves completion signaling
> disabled for 111 send Work Requests.
>
> Some providers hold back dispatching Work Requests until a CQE is
> generated.  If completions are disabled, then no CQEs are generated
> for quite some time, and that can stall the Work Queue.
>
> I've seen this occur running xfstests generic/113 over NFSv4, where
> eventually, posting a FAST_REG_MR Work Request fails with -ENOMEM
> because the Work Queue has overflowed. The connection is dropped
> and re-established.
>
> Cap the rep_cqinit setting so completions are not left turned off
> for too long.
>
> BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=269
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> ---
>  net/sunrpc/xprtrdma/verbs.c |    2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
> index 6ea2942..5c0c7a5 100644
> --- a/net/sunrpc/xprtrdma/verbs.c
> +++ b/net/sunrpc/xprtrdma/verbs.c
> @@ -733,6 +733,8 @@ rpcrdma_ep_create(struct rpcrdma_ep *ep, struct rpcrdma_ia *ia,
>  
>  	/* set trigger for requesting send completion */
>  	ep->rep_cqinit = ep->rep_attr.cap.max_send_wr/2 - 1;
> +	if (ep->rep_cqinit > 20)
> +		ep->rep_cqinit = 20;
>  	if (ep->rep_cqinit <= 2)

Can you change the ep->rep_cqinit <= 2 check into an else-if?

Thanks!
Anna

>  		ep->rep_cqinit = 0;
>  	INIT_CQCOUNT(ep);
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v1 03/16] SUNRPC: Pass callsize and recvsize to buf_alloc as separate arguments
  2014-10-16 19:38 ` [PATCH v1 03/16] SUNRPC: Pass callsize and recvsize to buf_alloc as separate arguments Chuck Lever
@ 2014-10-20 14:04   ` Anna Schumaker
  2014-10-20 18:21     ` Chuck Lever
  0 siblings, 1 reply; 34+ messages in thread
From: Anna Schumaker @ 2014-10-20 14:04 UTC (permalink / raw)
  To: Chuck Lever, linux-nfs

On 10/16/14 15:38, Chuck Lever wrote:
> I noticed that on RDMA, NFSv4 operations were using "hardway"
> allocations much more than not. A "hardway" allocation uses GFP_NOFS
> during each RPC to allocate the XDR buffer, instead of using a
> pre-allocated pre-registered buffer for each RPC.
>
> The pre-allocated buffers are 2200 bytes in length.  The requested
> XDR buffer sizes looked like this:
>
>     GETATTR: 3220 bytes
>     LOOKUP:  3612 bytes
>     WRITE:   3256 bytes
>     OPEN:    6344 bytes
>
> But an NFSv4 GETATTR RPC request should be small. It's the reply
> part of GETATTR that can grow large.
>
> call_allocate() passes a single value as the XDR buffer size: the
> sum of call and reply buffers. However, the xprtrdma transport
> allocates its XDR request and reply buffers separately.
>
> xprtrdma needs to know the maximum call size, as guidance for how
> large the outgoing request is going to be and how the NFS payload
> will be marshalled into chunks.
>
> But RDMA XDR reply buffers are pre-posted, fixed-size buffers, not
> allocated by xprt_rdma_allocate().
>
> Because of the sum passed through ->buf_alloc(), xprtrdma's
> ->buf_alloc() always allocates more XDR buffer than it will ever
> use. For NFSv4, it is unnecessarily triggering the slow "hardway"
> path for almost every RPC.
>
> Pass the call and reply buffer size values separately to the
> transport's ->buf_alloc method.  The RDMA transport ->buf_alloc can
> now ignore the reply size, and allocate just what it will use for
> the call buffer.  The socket transport ->buf_alloc can simply add
> them together, as call_allocate() did before.
>
> With this patch, an NFSv4 GETATTR request now allocates a 476 byte
> RDMA XDR buffer. I didn't see a single NFSv4 request that did not
> fit into the transport's pre-allocated XDR buffer.
>
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> ---
>  include/linux/sunrpc/sched.h    |    2 +-
>  include/linux/sunrpc/xprt.h     |    3 ++-
>  net/sunrpc/clnt.c               |    4 ++--
>  net/sunrpc/sched.c              |    6 ++++--
>  net/sunrpc/xprtrdma/transport.c |    2 +-
>  net/sunrpc/xprtsock.c           |    3 ++-
>  6 files changed, 12 insertions(+), 8 deletions(-)
>
> diff --git a/include/linux/sunrpc/sched.h b/include/linux/sunrpc/sched.h
> index 1a89599..68fa71d 100644
> --- a/include/linux/sunrpc/sched.h
> +++ b/include/linux/sunrpc/sched.h
> @@ -232,7 +232,7 @@ struct rpc_task *rpc_wake_up_first(struct rpc_wait_queue *,
>  					void *);
>  void		rpc_wake_up_status(struct rpc_wait_queue *, int);
>  void		rpc_delay(struct rpc_task *, unsigned long);
> -void *		rpc_malloc(struct rpc_task *, size_t);
> +void		*rpc_malloc(struct rpc_task *, size_t, size_t);
>  void		rpc_free(void *);
>  int		rpciod_up(void);
>  void		rpciod_down(void);
> diff --git a/include/linux/sunrpc/xprt.h b/include/linux/sunrpc/xprt.h
> index fcbfe87..632685c 100644
> --- a/include/linux/sunrpc/xprt.h
> +++ b/include/linux/sunrpc/xprt.h
> @@ -124,7 +124,8 @@ struct rpc_xprt_ops {
>  	void		(*rpcbind)(struct rpc_task *task);
>  	void		(*set_port)(struct rpc_xprt *xprt, unsigned short port);
>  	void		(*connect)(struct rpc_xprt *xprt, struct rpc_task *task);
> -	void *		(*buf_alloc)(struct rpc_task *task, size_t size);
> +	void *		(*buf_alloc)(struct rpc_task *task,
> +					size_t call, size_t reply);
>  	void		(*buf_free)(void *buffer);
>  	int		(*send_request)(struct rpc_task *task);
>  	void		(*set_retrans_timeout)(struct rpc_task *task);
> diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c
> index 488ddee..5e817d6 100644
> --- a/net/sunrpc/clnt.c
> +++ b/net/sunrpc/clnt.c
> @@ -1599,8 +1599,8 @@ call_allocate(struct rpc_task *task)
>  	req->rq_rcvsize = RPC_REPHDRSIZE + slack + proc->p_replen;
>  	req->rq_rcvsize <<= 2;
>  
> -	req->rq_buffer = xprt->ops->buf_alloc(task,
> -					req->rq_callsize + req->rq_rcvsize);
> +	req->rq_buffer = xprt->ops->buf_alloc(task, req->rq_callsize,
> +					      req->rq_rcvsize);
>  	if (req->rq_buffer != NULL)
>  		return;
>  
> diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
> index 9358c79..fc4f939 100644
> --- a/net/sunrpc/sched.c
> +++ b/net/sunrpc/sched.c
> @@ -829,7 +829,8 @@ static void rpc_async_schedule(struct work_struct *work)
>  /**
>   * rpc_malloc - allocate an RPC buffer
>   * @task: RPC task that will use this buffer
> - * @size: requested byte size
> + * @call: maximum size of on-the-wire RPC call, in bytes
> + * @reply: maximum size of on-the-wire RPC reply, in bytes
>   *
>   * To prevent rpciod from hanging, this allocator never sleeps,
>   * returning NULL and suppressing warning if the request cannot be serviced
> @@ -843,8 +844,9 @@ static void rpc_async_schedule(struct work_struct *work)
>   * In order to avoid memory starvation triggering more writebacks of
>   * NFS requests, we avoid using GFP_KERNEL.
>   */
> -void *rpc_malloc(struct rpc_task *task, size_t size)
> +void *rpc_malloc(struct rpc_task *task, size_t call, size_t reply)
>  {
> +	size_t size = call + reply;
>  	struct rpc_buffer *buf;
>  	gfp_t gfp = GFP_NOWAIT | __GFP_NOWARN;
>  
> diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
> index 2faac49..6e9d0a7 100644
> --- a/net/sunrpc/xprtrdma/transport.c
> +++ b/net/sunrpc/xprtrdma/transport.c
> @@ -459,7 +459,7 @@ xprt_rdma_connect(struct rpc_xprt *xprt, struct rpc_task *task)
>   * the receive buffer portion when using reply chunks.
>   */
>  static void *
> -xprt_rdma_allocate(struct rpc_task *task, size_t size)
> +xprt_rdma_allocate(struct rpc_task *task, size_t size, size_t replen)

The comment right before this function mentions that send and receive buffers are allocated in the same call.  Can you update this?

Anna

>  {
>  	struct rpc_xprt *xprt = task->tk_rqstp->rq_xprt;
>  	struct rpcrdma_req *req, *nreq;
> diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
> index 43cd89e..b4aca48 100644
> --- a/net/sunrpc/xprtsock.c
> +++ b/net/sunrpc/xprtsock.c
> @@ -2423,8 +2423,9 @@ static void xs_tcp_print_stats(struct rpc_xprt *xprt, struct seq_file *seq)
>   * we allocate pages instead doing a kmalloc like rpc_malloc is because we want
>   * to use the server side send routines.
>   */
> -static void *bc_malloc(struct rpc_task *task, size_t size)
> +static void *bc_malloc(struct rpc_task *task, size_t call, size_t reply)
>  {
> +	size_t size = call + reply;
>  	struct page *page;
>  	struct rpc_buffer *buf;
>  
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v1 13/16] NFS: Add sidecar RPC client support
  2014-10-16 19:40 ` [PATCH v1 13/16] NFS: Add sidecar RPC client support Chuck Lever
@ 2014-10-20 17:33   ` Anna Schumaker
  2014-10-20 18:09     ` Chuck Lever
  0 siblings, 1 reply; 34+ messages in thread
From: Anna Schumaker @ 2014-10-20 17:33 UTC (permalink / raw)
  To: Chuck Lever, linux-nfs

On 10/16/14 15:40, Chuck Lever wrote:
> So far, TCP is the only transport that supports bi-directional RPC.
>
> When mounting with NFSv4.1 using a transport that does not support
> bi-directional RPC, establish a TCP sidecar connection to handle
> backchannel traffic for a session. The sidecar transport does not
> use its forward channel except for sending BIND_CONN_TO_SESSION
> operations.
>
> This commit adds logic to create and destroy the sidecar transport.
> Subsequent commits add logic to use the transport.

I thought NFS v4.0 also uses a separate connection for the backchannel?  Can any of that code be reused here, rather than creating new sidecar structures?

Anna

>
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> ---
>  fs/nfs/client.c           |    1 +
>  fs/nfs/nfs4client.c       |   54 +++++++++++++++++++++++++++++++++++++++++++++
>  include/linux/nfs_fs_sb.h |    2 ++
>  3 files changed, 57 insertions(+)
>
> diff --git a/fs/nfs/client.c b/fs/nfs/client.c
> index 6a4f366..19f49bf 100644
> --- a/fs/nfs/client.c
> +++ b/fs/nfs/client.c
> @@ -78,6 +78,7 @@ const struct rpc_program nfs_program = {
>  	.stats			= &nfs_rpcstat,
>  	.pipe_dir_name		= NFS_PIPE_DIRNAME,
>  };
> +EXPORT_SYMBOL_GPL(nfs_program);
>  
>  struct rpc_stat nfs_rpcstat = {
>  	.program		= &nfs_program
> diff --git a/fs/nfs/nfs4client.c b/fs/nfs/nfs4client.c
> index 5f4b818..b1cc35e 100644
> --- a/fs/nfs/nfs4client.c
> +++ b/fs/nfs/nfs4client.c
> @@ -213,6 +213,8 @@ static void nfs4_destroy_callback(struct nfs_client *clp)
>  {
>  	if (__test_and_clear_bit(NFS_CS_CALLBACK, &clp->cl_res_state))
>  		nfs_callback_down(clp->cl_mvops->minor_version, clp->cl_net);
> +	if (clp->cl_bc_rpcclient)
> +		rpc_shutdown_client(clp->cl_bc_rpcclient);
>  }
>  
>  static void nfs4_shutdown_client(struct nfs_client *clp)
> @@ -291,6 +293,53 @@ int nfs40_init_client(struct nfs_client *clp)
>  
>  #if defined(CONFIG_NFS_V4_1)
>  
> +/*
> + * Create a separate rpc_clnt using TCP that can provide a
> + * backchannel service.
> + */
> +static int nfs41_create_sidecar_rpc_client(struct nfs_client *clp)
> +{
> +	struct sockaddr_storage address;
> +	struct sockaddr *sap = (struct sockaddr *)&address;
> +	struct rpc_create_args args = {
> +		.net		= clp->cl_net,
> +		.protocol	= XPRT_TRANSPORT_TCP,
> +		.address	= sap,
> +		.addrsize	= clp->cl_addrlen,
> +		.servername	= clp->cl_hostname,
> +		.program	= &nfs_program,
> +		.version	= clp->rpc_ops->version,
> +		.flags		= (RPC_CLNT_CREATE_DISCRTRY |
> +				  RPC_CLNT_CREATE_NOPING),
> +	};
> +	struct rpc_clnt *clnt;
> +	struct rpc_cred *cred;
> +
> +	if (rpc_xprt_is_bidirectional(clp->cl_rpcclient))
> +		return 0;
> +
> +	if (test_bit(NFS_CS_NORESVPORT, &clp->cl_flags))
> +		args.flags |= RPC_CLNT_CREATE_NONPRIVPORT;
> +	memcpy(sap, &clp->cl_addr, clp->cl_addrlen);
> +	rpc_set_port(sap, NFS_PORT);
> +	cred = nfs4_get_clid_cred(clp);
> +	if (cred) {
> +		args.authflavor = cred->cr_auth->au_flavor;
> +		put_rpccred(cred);
> +	} else
> +		args.authflavor = RPC_AUTH_UNIX;
> +
> +	clnt = rpc_create(&args);
> +	if (IS_ERR(clnt)) {
> +		dprintk("%s: cannot create side-car RPC client. Error = %ld\n",
> +			__func__, PTR_ERR(clnt));
> +		return PTR_ERR(clnt);
> +	}
> +
> +	clp->cl_bc_rpcclient = clnt;
> +	return 0;
> +}
> +
>  /**
>   * nfs41_init_client - nfs_client initialization tasks for NFSv4.1+
>   * @clp - nfs_client to initialize
> @@ -300,6 +349,11 @@ int nfs40_init_client(struct nfs_client *clp)
>  int nfs41_init_client(struct nfs_client *clp)
>  {
>  	struct nfs4_session *session = NULL;
> +	int ret;
> +
> +	ret = nfs41_create_sidecar_rpc_client(clp);
> +	if (ret)
> +		return ret;
>  
>  	/*
>  	 * Create the session and mark it expired.
> diff --git a/include/linux/nfs_fs_sb.h b/include/linux/nfs_fs_sb.h
> index 922be2e..159d703 100644
> --- a/include/linux/nfs_fs_sb.h
> +++ b/include/linux/nfs_fs_sb.h
> @@ -87,6 +87,8 @@ struct nfs_client {
>  
>  	/* The sequence id to use for the next CREATE_SESSION */
>  	u32			cl_seqid;
> +	/* The optional sidecar backchannel transport */
> +	struct rpc_clnt		*cl_bc_rpcclient;
>  	/* The flags used for obtaining the clientid during EXCHANGE_ID */
>  	u32			cl_exchange_flags;
>  	struct nfs4_session	*cl_session;	/* shared session */
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v1 13/16] NFS: Add sidecar RPC client support
  2014-10-20 17:33   ` Anna Schumaker
@ 2014-10-20 18:09     ` Chuck Lever
  2014-10-20 19:40       ` Trond Myklebust
  0 siblings, 1 reply; 34+ messages in thread
From: Chuck Lever @ 2014-10-20 18:09 UTC (permalink / raw)
  To: Anna Schumaker; +Cc: Linux NFS Mailing List


On Oct 20, 2014, at 1:33 PM, Anna Schumaker <Anna.Schumaker@netapp.com> wrote:

> On 10/16/14 15:40, Chuck Lever wrote:
>> So far, TCP is the only transport that supports bi-directional RPC.
>> 
>> When mounting with NFSv4.1 using a transport that does not support
>> bi-directional RPC, establish a TCP sidecar connection to handle
>> backchannel traffic for a session. The sidecar transport does not
>> use its forward channel except for sending BIND_CONN_TO_SESSION
>> operations.
>> 
>> This commit adds logic to create and destroy the sidecar transport.
>> Subsequent commits add logic to use the transport.
> 
> I thought NFS v4.0 also uses a separate connection for the backchannel?

For NFSv4.0, the server opens a connection to the client. For
NFSv4.1, the client opens the side car connection to the
server, and then performs BIND_CONN_TO_SESSION, an operation
not in NFSv4.0. The processes are not terribly similar.

> Can any of that code be reused here, rather than creating new sidecar structures?

Since it’s the client opening a connection in this case,
I don’t see any obvious common code paths that I haven’t
already re-used. I’m open to suggestions.

> Anna
> 
>> 
>> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
>> ---
>> fs/nfs/client.c           |    1 +
>> fs/nfs/nfs4client.c       |   54 +++++++++++++++++++++++++++++++++++++++++++++
>> include/linux/nfs_fs_sb.h |    2 ++
>> 3 files changed, 57 insertions(+)
>> 
>> diff --git a/fs/nfs/client.c b/fs/nfs/client.c
>> index 6a4f366..19f49bf 100644
>> --- a/fs/nfs/client.c
>> +++ b/fs/nfs/client.c
>> @@ -78,6 +78,7 @@ const struct rpc_program nfs_program = {
>> 	.stats			= &nfs_rpcstat,
>> 	.pipe_dir_name		= NFS_PIPE_DIRNAME,
>> };
>> +EXPORT_SYMBOL_GPL(nfs_program);
>> 
>> struct rpc_stat nfs_rpcstat = {
>> 	.program		= &nfs_program
>> diff --git a/fs/nfs/nfs4client.c b/fs/nfs/nfs4client.c
>> index 5f4b818..b1cc35e 100644
>> --- a/fs/nfs/nfs4client.c
>> +++ b/fs/nfs/nfs4client.c
>> @@ -213,6 +213,8 @@ static void nfs4_destroy_callback(struct nfs_client *clp)
>> {
>> 	if (__test_and_clear_bit(NFS_CS_CALLBACK, &clp->cl_res_state))
>> 		nfs_callback_down(clp->cl_mvops->minor_version, clp->cl_net);
>> +	if (clp->cl_bc_rpcclient)
>> +		rpc_shutdown_client(clp->cl_bc_rpcclient);
>> }
>> 
>> static void nfs4_shutdown_client(struct nfs_client *clp)
>> @@ -291,6 +293,53 @@ int nfs40_init_client(struct nfs_client *clp)
>> 
>> #if defined(CONFIG_NFS_V4_1)
>> 
>> +/*
>> + * Create a separate rpc_clnt using TCP that can provide a
>> + * backchannel service.
>> + */
>> +static int nfs41_create_sidecar_rpc_client(struct nfs_client *clp)
>> +{
>> +	struct sockaddr_storage address;
>> +	struct sockaddr *sap = (struct sockaddr *)&address;
>> +	struct rpc_create_args args = {
>> +		.net		= clp->cl_net,
>> +		.protocol	= XPRT_TRANSPORT_TCP,
>> +		.address	= sap,
>> +		.addrsize	= clp->cl_addrlen,
>> +		.servername	= clp->cl_hostname,
>> +		.program	= &nfs_program,
>> +		.version	= clp->rpc_ops->version,
>> +		.flags		= (RPC_CLNT_CREATE_DISCRTRY |
>> +				  RPC_CLNT_CREATE_NOPING),
>> +	};
>> +	struct rpc_clnt *clnt;
>> +	struct rpc_cred *cred;
>> +
>> +	if (rpc_xprt_is_bidirectional(clp->cl_rpcclient))
>> +		return 0;
>> +
>> +	if (test_bit(NFS_CS_NORESVPORT, &clp->cl_flags))
>> +		args.flags |= RPC_CLNT_CREATE_NONPRIVPORT;
>> +	memcpy(sap, &clp->cl_addr, clp->cl_addrlen);
>> +	rpc_set_port(sap, NFS_PORT);
>> +	cred = nfs4_get_clid_cred(clp);
>> +	if (cred) {
>> +		args.authflavor = cred->cr_auth->au_flavor;
>> +		put_rpccred(cred);
>> +	} else
>> +		args.authflavor = RPC_AUTH_UNIX;
>> +
>> +	clnt = rpc_create(&args);
>> +	if (IS_ERR(clnt)) {
>> +		dprintk("%s: cannot create side-car RPC client. Error = %ld\n",
>> +			__func__, PTR_ERR(clnt));
>> +		return PTR_ERR(clnt);
>> +	}
>> +
>> +	clp->cl_bc_rpcclient = clnt;
>> +	return 0;
>> +}
>> +
>> /**
>>  * nfs41_init_client - nfs_client initialization tasks for NFSv4.1+
>>  * @clp - nfs_client to initialize
>> @@ -300,6 +349,11 @@ int nfs40_init_client(struct nfs_client *clp)
>> int nfs41_init_client(struct nfs_client *clp)
>> {
>> 	struct nfs4_session *session = NULL;
>> +	int ret;
>> +
>> +	ret = nfs41_create_sidecar_rpc_client(clp);
>> +	if (ret)
>> +		return ret;
>> 
>> 	/*
>> 	 * Create the session and mark it expired.
>> diff --git a/include/linux/nfs_fs_sb.h b/include/linux/nfs_fs_sb.h
>> index 922be2e..159d703 100644
>> --- a/include/linux/nfs_fs_sb.h
>> +++ b/include/linux/nfs_fs_sb.h
>> @@ -87,6 +87,8 @@ struct nfs_client {
>> 
>> 	/* The sequence id to use for the next CREATE_SESSION */
>> 	u32			cl_seqid;
>> +	/* The optional sidecar backchannel transport */
>> +	struct rpc_clnt		*cl_bc_rpcclient;
>> 	/* The flags used for obtaining the clientid during EXCHANGE_ID */
>> 	u32			cl_exchange_flags;
>> 	struct nfs4_session	*cl_session;	/* shared session */
>> 
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
Chuck Lever




^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v1 03/16] SUNRPC: Pass callsize and recvsize to buf_alloc as separate arguments
  2014-10-20 14:04   ` Anna Schumaker
@ 2014-10-20 18:21     ` Chuck Lever
  0 siblings, 0 replies; 34+ messages in thread
From: Chuck Lever @ 2014-10-20 18:21 UTC (permalink / raw)
  To: Anna Schumaker; +Cc: Linux NFS Mailing List


On Oct 20, 2014, at 10:04 AM, Anna Schumaker <Anna.Schumaker@netapp.com> wrote:

> On 10/16/14 15:38, Chuck Lever wrote:
>> I noticed that on RDMA, NFSv4 operations were using "hardway"
>> allocations much more than not. A "hardway" allocation uses GFP_NOFS
>> during each RPC to allocate the XDR buffer, instead of using a
>> pre-allocated pre-registered buffer for each RPC.
>> 
>> The pre-allocated buffers are 2200 bytes in length.  The requested
>> XDR buffer sizes looked like this:
>> 
>>    GETATTR: 3220 bytes
>>    LOOKUP:  3612 bytes
>>    WRITE:   3256 bytes
>>    OPEN:    6344 bytes
>> 
>> But an NFSv4 GETATTR RPC request should be small. It's the reply
>> part of GETATTR that can grow large.
>> 
>> call_allocate() passes a single value as the XDR buffer size: the
>> sum of call and reply buffers. However, the xprtrdma transport
>> allocates its XDR request and reply buffers separately.
>> 
>> xprtrdma needs to know the maximum call size, as guidance for how
>> large the outgoing request is going to be and how the NFS payload
>> will be marshalled into chunks.
>> 
>> But RDMA XDR reply buffers are pre-posted, fixed-size buffers, not
>> allocated by xprt_rdma_allocate().
>> 
>> Because of the sum passed through ->buf_alloc(), xprtrdma's
>> ->buf_alloc() always allocates more XDR buffer than it will ever
>> use. For NFSv4, it is unnecessarily triggering the slow "hardway"
>> path for almost every RPC.
>> 
>> Pass the call and reply buffer size values separately to the
>> transport's ->buf_alloc method.  The RDMA transport ->buf_alloc can
>> now ignore the reply size, and allocate just what it will use for
>> the call buffer.  The socket transport ->buf_alloc can simply add
>> them together, as call_allocate() did before.
>> 
>> With this patch, an NFSv4 GETATTR request now allocates a 476 byte
>> RDMA XDR buffer. I didn't see a single NFSv4 request that did not
>> fit into the transport's pre-allocated XDR buffer.
>> 
>> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
>> ---
>> include/linux/sunrpc/sched.h    |    2 +-
>> include/linux/sunrpc/xprt.h     |    3 ++-
>> net/sunrpc/clnt.c               |    4 ++--
>> net/sunrpc/sched.c              |    6 ++++--
>> net/sunrpc/xprtrdma/transport.c |    2 +-
>> net/sunrpc/xprtsock.c           |    3 ++-
>> 6 files changed, 12 insertions(+), 8 deletions(-)
>> 
>> diff --git a/include/linux/sunrpc/sched.h b/include/linux/sunrpc/sched.h
>> index 1a89599..68fa71d 100644
>> --- a/include/linux/sunrpc/sched.h
>> +++ b/include/linux/sunrpc/sched.h
>> @@ -232,7 +232,7 @@ struct rpc_task *rpc_wake_up_first(struct rpc_wait_queue *,
>> 					void *);
>> void		rpc_wake_up_status(struct rpc_wait_queue *, int);
>> void		rpc_delay(struct rpc_task *, unsigned long);
>> -void *		rpc_malloc(struct rpc_task *, size_t);
>> +void		*rpc_malloc(struct rpc_task *, size_t, size_t);
>> void		rpc_free(void *);
>> int		rpciod_up(void);
>> void		rpciod_down(void);
>> diff --git a/include/linux/sunrpc/xprt.h b/include/linux/sunrpc/xprt.h
>> index fcbfe87..632685c 100644
>> --- a/include/linux/sunrpc/xprt.h
>> +++ b/include/linux/sunrpc/xprt.h
>> @@ -124,7 +124,8 @@ struct rpc_xprt_ops {
>> 	void		(*rpcbind)(struct rpc_task *task);
>> 	void		(*set_port)(struct rpc_xprt *xprt, unsigned short port);
>> 	void		(*connect)(struct rpc_xprt *xprt, struct rpc_task *task);
>> -	void *		(*buf_alloc)(struct rpc_task *task, size_t size);
>> +	void *		(*buf_alloc)(struct rpc_task *task,
>> +					size_t call, size_t reply);
>> 	void		(*buf_free)(void *buffer);
>> 	int		(*send_request)(struct rpc_task *task);
>> 	void		(*set_retrans_timeout)(struct rpc_task *task);
>> diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c
>> index 488ddee..5e817d6 100644
>> --- a/net/sunrpc/clnt.c
>> +++ b/net/sunrpc/clnt.c
>> @@ -1599,8 +1599,8 @@ call_allocate(struct rpc_task *task)
>> 	req->rq_rcvsize = RPC_REPHDRSIZE + slack + proc->p_replen;
>> 	req->rq_rcvsize <<= 2;
>> 
>> -	req->rq_buffer = xprt->ops->buf_alloc(task,
>> -					req->rq_callsize + req->rq_rcvsize);
>> +	req->rq_buffer = xprt->ops->buf_alloc(task, req->rq_callsize,
>> +					      req->rq_rcvsize);
>> 	if (req->rq_buffer != NULL)
>> 		return;
>> 
>> diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
>> index 9358c79..fc4f939 100644
>> --- a/net/sunrpc/sched.c
>> +++ b/net/sunrpc/sched.c
>> @@ -829,7 +829,8 @@ static void rpc_async_schedule(struct work_struct *work)
>> /**
>>  * rpc_malloc - allocate an RPC buffer
>>  * @task: RPC task that will use this buffer
>> - * @size: requested byte size
>> + * @call: maximum size of on-the-wire RPC call, in bytes
>> + * @reply: maximum size of on-the-wire RPC reply, in bytes
>>  *
>>  * To prevent rpciod from hanging, this allocator never sleeps,
>>  * returning NULL and suppressing warning if the request cannot be serviced
>> @@ -843,8 +844,9 @@ static void rpc_async_schedule(struct work_struct *work)
>>  * In order to avoid memory starvation triggering more writebacks of
>>  * NFS requests, we avoid using GFP_KERNEL.
>>  */
>> -void *rpc_malloc(struct rpc_task *task, size_t size)
>> +void *rpc_malloc(struct rpc_task *task, size_t call, size_t reply)
>> {
>> +	size_t size = call + reply;
>> 	struct rpc_buffer *buf;
>> 	gfp_t gfp = GFP_NOWAIT | __GFP_NOWARN;
>> 
>> diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
>> index 2faac49..6e9d0a7 100644
>> --- a/net/sunrpc/xprtrdma/transport.c
>> +++ b/net/sunrpc/xprtrdma/transport.c
>> @@ -459,7 +459,7 @@ xprt_rdma_connect(struct rpc_xprt *xprt, struct rpc_task *task)
>>  * the receive buffer portion when using reply chunks.
>>  */
>> static void *
>> -xprt_rdma_allocate(struct rpc_task *task, size_t size)
>> +xprt_rdma_allocate(struct rpc_task *task, size_t size, size_t replen)

The rpc_rqst actually has separate send and receive sizes
already in it, and the passed-in tk_task has a pointer to
rpc_rqst. So I don’t need to alter the buf_alloc() method
synopsis at all.

Though I’ve tested with the two server implementations that
I have on hand, I’ve discovered this is a less than generic
approach to the problem. I can’t just slice off the receive
part of this buffer, as it is actually used sometimes
(though apparently not with the Solaris or Linux servers).

For those two reasons, I’m going to replace this patch with
something else in the next round. JFYI so you know what to
look for next time.

> The comment right before this function mentions that send and receive buffers are allocated in the same call.  Can you update this?

If the new patch touches xprt_rdma_allocate(), I can
replace Tom’s implementer’s notes in this function with
some operational documenting comments.

> Anna
> 
>> {
>> 	struct rpc_xprt *xprt = task->tk_rqstp->rq_xprt;
>> 	struct rpcrdma_req *req, *nreq;
>> diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
>> index 43cd89e..b4aca48 100644
>> --- a/net/sunrpc/xprtsock.c
>> +++ b/net/sunrpc/xprtsock.c
>> @@ -2423,8 +2423,9 @@ static void xs_tcp_print_stats(struct rpc_xprt *xprt, struct seq_file *seq)
>>  * we allocate pages instead doing a kmalloc like rpc_malloc is because we want
>>  * to use the server side send routines.
>>  */
>> -static void *bc_malloc(struct rpc_task *task, size_t size)
>> +static void *bc_malloc(struct rpc_task *task, size_t call, size_t reply)
>> {
>> +	size_t size = call + reply;
>> 	struct page *page;
>> 	struct rpc_buffer *buf;
>> 
>> 
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
Chuck Lever




^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v1 13/16] NFS: Add sidecar RPC client support
  2014-10-20 18:09     ` Chuck Lever
@ 2014-10-20 19:40       ` Trond Myklebust
  2014-10-20 20:11         ` Chuck Lever
  0 siblings, 1 reply; 34+ messages in thread
From: Trond Myklebust @ 2014-10-20 19:40 UTC (permalink / raw)
  To: Chuck Lever; +Cc: Anna Schumaker, Linux NFS Mailing List

On Mon, Oct 20, 2014 at 9:09 PM, Chuck Lever <chuck.lever@oracle.com> wrote:
>
>
> On Oct 20, 2014, at 1:33 PM, Anna Schumaker <Anna.Schumaker@netapp.com> wrote:
>
> > On 10/16/14 15:40, Chuck Lever wrote:
> >> So far, TCP is the only transport that supports bi-directional RPC.
> >>
> >> When mounting with NFSv4.1 using a transport that does not support
> >> bi-directional RPC, establish a TCP sidecar connection to handle
> >> backchannel traffic for a session. The sidecar transport does not
> >> use its forward channel except for sending BIND_CONN_TO_SESSION
> >> operations.
> >>
> >> This commit adds logic to create and destroy the sidecar transport.
> >> Subsequent commits add logic to use the transport.
> >
> > I thought NFS v4.0 also uses a separate connection for the backchannel?
>
> For NFSv4.0, the server opens a connection to the client. For
> NFSv4.1, the client opens the side car connection to the
> server, and then performs BIND_CONN_TO_SESSION, an operation
> not in NFSv4.0. The processes are not terribly similar.
>
> > Can any of that code be reused here, rather than creating new sidecar structures?
>
> Since it’s the client opening a connection in this case,
> I don’t see any obvious common code paths that I haven’t
> already re-used. I’m open to suggestions.

Why aren't we doing the callbacks via RDMA as per the recommendation
in RFC5667 section 5.1?

-- 

Trond Myklebust

Linux NFS client maintainer, PrimaryData

trond.myklebust@primarydata.com

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v1 13/16] NFS: Add sidecar RPC client support
  2014-10-20 19:40       ` Trond Myklebust
@ 2014-10-20 20:11         ` Chuck Lever
  2014-10-20 22:31           ` Trond Myklebust
  0 siblings, 1 reply; 34+ messages in thread
From: Chuck Lever @ 2014-10-20 20:11 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Anna Schumaker, Linux NFS Mailing List, Tom Talpey

Hi Trond-

On Oct 20, 2014, at 3:40 PM, Trond Myklebust <trond.myklebust@primarydata.com> wrote:

> Why aren't we doing the callbacks via RDMA as per the recommendation
> in RFC5667 section 5.1?

There’s no benefit to it. With a side car, the server requires
few or no changes. There are no CB operations that benefit
from using RDMA. It’s very quick to implement, re-using most of
the client backchannel implementation that already exists.

I’ve discussed this with an author of RFC 5667 [cc’d], and also
with the implementors of an existing NFSv4.1 server that supports
RDMA. They both agree that a side car is an acceptable, or even a
preferable, way to approach backchannel support.

Also, when I discussed this with you months ago, you also felt
that a side car was better than adding backchannel support to the
xprtrdma transport. I took this approach only because you OK’d it.

But I don’t see an explicit recommendation in section 5.1. Which
text are you referring to?

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com




^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v1 13/16] NFS: Add sidecar RPC client support
  2014-10-20 20:11         ` Chuck Lever
@ 2014-10-20 22:31           ` Trond Myklebust
  2014-10-21  1:06             ` Chuck Lever
  0 siblings, 1 reply; 34+ messages in thread
From: Trond Myklebust @ 2014-10-20 22:31 UTC (permalink / raw)
  To: Chuck Lever; +Cc: Anna Schumaker, Linux NFS Mailing List, Tom Talpey

On Mon, Oct 20, 2014 at 11:11 PM, Chuck Lever <chuck.lever@oracle.com> wrote:
> Hi Trond-
>
> On Oct 20, 2014, at 3:40 PM, Trond Myklebust <trond.myklebust@primarydata.com> wrote:
>
>> Why aren't we doing the callbacks via RDMA as per the recommendation
>> in RFC5667 section 5.1?
>
> There’s no benefit to it. With a side car, the server requires
> few or no changes. There are no CB operations that benefit
> from using RDMA. It’s very quick to implement, re-using most of
> the client backchannel implementation that already exists.
>
> I’ve discussed this with an author of RFC 5667 [cc’d], and also
> with the implementors of an existing NFSv4.1 server that supports
> RDMA. They both agree that a side car is an acceptable, or even a
> preferable, way to approach backchannel support.
>
> Also, when I discussed this with you months ago, you also felt
> that a side car was better than adding backchannel support to the
> xprtrdma transport. I took this approach only because you OK’d it.
>
> But I don’t see an explicit recommendation in section 5.1. Which
> text are you referring to?

The very first paragraph argues that because callback messages don't
carry bulk data, there is no problem with using RPC/RDMA and, in
particular, with using RDMA_MSG provided that the buffer sizes are
negotiated correctly.

So the questions are:

1) Where is the discussion of the merits for and against adding
bi-directional support to the xprtrdma layer in Linux? What is the
showstopper preventing implementation of a design based around
RFC5667?

2) Why do we instead have to solve the whole backchannel problem in
the NFSv4.1 layer, and where is the discussion of the merits for and
against that particular solution? As far as I can tell, it imposes at
least 2 extra requirements:
 a) NFSv4.1 client+server must have support either for session
trunking or for clientid trunking
 b) NFSv4.1 client must be able to set up a TCP connection to the
server (that can be session/clientid trunked with the existing RDMA
channel)

All I've found so far on googling these questions is a 5 1/2 year old
email exchange between Tom Tucker and Ricardo where the conclusion
appears to be that we can, in time, implement both designs. However
there is no explanation of why we would want to do so.
http://comments.gmane.org/gmane.linux.nfs/22927

-- 
Trond Myklebust

Linux NFS client maintainer, PrimaryData

trond.myklebust@primarydata.com

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v1 13/16] NFS: Add sidecar RPC client support
  2014-10-20 22:31           ` Trond Myklebust
@ 2014-10-21  1:06             ` Chuck Lever
  2014-10-21  7:45               ` Trond Myklebust
  0 siblings, 1 reply; 34+ messages in thread
From: Chuck Lever @ 2014-10-21  1:06 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Anna Schumaker, Linux NFS Mailing List, Tom Talpey


On Oct 20, 2014, at 6:31 PM, Trond Myklebust <trond.myklebust@primarydata.com> wrote:

> On Mon, Oct 20, 2014 at 11:11 PM, Chuck Lever <chuck.lever@oracle.com> wrote:
>> Hi Trond-
>> 
>> On Oct 20, 2014, at 3:40 PM, Trond Myklebust <trond.myklebust@primarydata.com> wrote:
>> 
>>> Why aren't we doing the callbacks via RDMA as per the recommendation
>>> in RFC5667 section 5.1?
>> 
>> There’s no benefit to it. With a side car, the server requires
>> few or no changes. There are no CB operations that benefit
>> from using RDMA. It’s very quick to implement, re-using most of
>> the client backchannel implementation that already exists.
>> 
>> I’ve discussed this with an author of RFC 5667 [cc’d], and also
>> with the implementors of an existing NFSv4.1 server that supports
>> RDMA. They both agree that a side car is an acceptable, or even a
>> preferable, way to approach backchannel support.
>> 
>> Also, when I discussed this with you months ago, you also felt
>> that a side car was better than adding backchannel support to the
>> xprtrdma transport. I took this approach only because you OK’d it.
>> 
>> But I don’t see an explicit recommendation in section 5.1. Which
>> text are you referring to?
> 
> The very first paragraph argues that because callback messages don't
> carry bulk data, there is no problem with using RPC/RDMA and, in
> particular, with using RDMA_MSG provided that the buffer sizes are
> negotiated correctly.

The opening paragraph is advice that applies to all forms
of NFSv4 callback, including NFSv4.0, which uses a separate
transport initiated from the NFS server. Specific advice about
NFSv4.1 bi-directional RPC is left to the next two paragraphs,
but they suggest there be dragons. I rather think this is a
warning not to “go there.”

> So the questions are:
> 
> 1) Where is the discussion of the merits for and against adding
> bi-directional support to the xprtrdma layer in Linux? What is the
> showstopper preventing implementation of a design based around
> RFC5667?

There is no show-stopper (see Section 5.1, after all). It’s
simply a matter of development effort: a side-car is much
less work than implementing full RDMA backchannel support for
both a client and server, especially since TCP backchannel
already works and can be used immediately.

Also, no problem with eventually implementing RDMA backchannel
if the complexity, and any performance overhead it introduces in
the forward channel, can be justified. The client can use the
CREATE_SESSION flags to detect what a server supports.

> 2) Why do we instead have to solve the whole backchannel problem in
> the NFSv4.1 layer, and where is the discussion of the merits for and
> against that particular solution? As far as I can tell, it imposes at
> least 2 extra requirements:
> a) NFSv4.1 client+server must have support either for session
> trunking or for clientid trunking

Very minimal trunking support. The only operation allowed on
the TCP side-car's forward channel is BIND_CONN_TO_SESSION.

Bruce told me that associating multiple transports to a
clientid/session should not be an issue for his server (his
words were “if that doesn’t work, it’s a bug”).

Would this restrictive form of trunking present a problem?

> b) NFSv4.1 client must be able to set up a TCP connection to the
> server (that can be session/clientid trunked with the existing RDMA
> channel)

Also very minimal changes. The changes are already done,
posted in v1 of this patch series.

> All I've found so far on googling these questions is a 5 1/2 year old
> email exchange between Tom Tucker and Ricardo where the conclusion
> appears to be that we can, in time, implement both designs.

You and I spoke about this on Feb 13, 2014 during pub night.
At the time you stated that a side-car was the only spec-
compliant way to approach this. I said I would go forward
with the idea in Linux, and you did not object.

> However
> there is no explanation of why we would want to do so.
> http://comments.gmane.org/gmane.linux.nfs/22927

I’ve implemented exactly what Ricardo proposed in this
thread, including dealing with connection loss:

> > The thinking is that NFSRDMA could initially use a TCP callback channel.
> > We'll implement BIND_CONN_TO_SESSION so that the backchannel does not
> > need to be tied to the forechannel connection.  This should address the
> > case where you have NFSRDMA for the forechannel and TCP for the
> > backchannel.  BIND_CONN_TO_SESSION is also required to reestablish
> > dropped connections effectively (to avoid losing the reply cache).


And here’s what you had to say in support of the idea:

> Given what they're hoping to achieve, I'm fine with
> doing a simple implementation of sessions first, then progressively
> refining it.

What’s the next step?

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com




^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v1 13/16] NFS: Add sidecar RPC client support
  2014-10-21  1:06             ` Chuck Lever
@ 2014-10-21  7:45               ` Trond Myklebust
  2014-10-21 17:11                 ` Chuck Lever
  0 siblings, 1 reply; 34+ messages in thread
From: Trond Myklebust @ 2014-10-21  7:45 UTC (permalink / raw)
  To: Chuck Lever; +Cc: Anna Schumaker, Linux NFS Mailing List, Tom Talpey

On Tue, Oct 21, 2014 at 4:06 AM, Chuck Lever <chuck.lever@oracle.com> wrote:
>
> On Oct 20, 2014, at 6:31 PM, Trond Myklebust <trond.myklebust@primarydata.com> wrote:
>
>> On Mon, Oct 20, 2014 at 11:11 PM, Chuck Lever <chuck.lever@oracle.com> wrote:
>>> Hi Trond-
>>>
>>> On Oct 20, 2014, at 3:40 PM, Trond Myklebust <trond.myklebust@primarydata.com> wrote:
>>>
>>>> Why aren't we doing the callbacks via RDMA as per the recommendation
>>>> in RFC5667 section 5.1?
>>>
>>> There’s no benefit to it. With a side car, the server requires
>>> few or no changes. There are no CB operations that benefit
>>> from using RDMA. It’s very quick to implement, re-using most of
>>> the client backchannel implementation that already exists.
>>>
>>> I’ve discussed this with an author of RFC 5667 [cc’d], and also
>>> with the implementors of an existing NFSv4.1 server that supports
>>> RDMA. They both agree that a side car is an acceptable, or even a
>>> preferable, way to approach backchannel support.
>>>
>>> Also, when I discussed this with you months ago, you also felt
>>> that a side car was better than adding backchannel support to the
>>> xprtrdma transport. I took this approach only because you OK’d it.
>>>
>>> But I don’t see an explicit recommendation in section 5.1. Which
>>> text are you referring to?
>>
>> The very first paragraph argues that because callback messages don't
>> carry bulk data, there is no problem with using RPC/RDMA and, in
>> particular, with using RDMA_MSG provided that the buffer sizes are
>> negotiated correctly.
>
> The opening paragraph is advice that applies to all forms
> of NFSv4 callback, including NFSv4.0, which uses a separate
> transport initiated from the NFS server. Specific advice about
> NFSv4.1 bi-directional RPC is left to the next two paragraphs,
> but they suggest there be dragons. I rather think this is a
> warning not to “go there.”

All I see is a warning not to rely on XIDs to distinguish between
backchannel and forward channel RPC requests. How is that particular
to RDMA?

>> So the questions are:
>>
>> 1) Where is the discussion of the merits for and against adding
>> bi-directional support to the xprtrdma layer in Linux? What is the
>> showstopper preventing implementation of a design based around
>> RFC5667?
>
> There is no show-stopper (see Section 5.1, after all). It’s
> simply a matter of development effort: a side-car is much
> less work than implementing full RDMA backchannel support for
> both a client and server, especially since TCP backchannel
> already works and can be used immediately.
>
> Also, no problem with eventually implementing RDMA backchannel
> if the complexity, and any performance overhead it introduces in
> the forward channel, can be justified. The client can use the
> CREATE_SESSION flags to detect what a server supports.

What complexity and performance overhead does it introduce in the
forward channel? I've never been presented with a complete discussion
of this alternative either from you or from anybody else.

>> 2) Why do we instead have to solve the whole backchannel problem in
>> the NFSv4.1 layer, and where is the discussion of the merits for and
>> against that particular solution? As far as I can tell, it imposes at
>> least 2 extra requirements:
>> a) NFSv4.1 client+server must have support either for session
>> trunking or for clientid trunking
>
> Very minimal trunking support. The only operation allowed on
> the TCP side-car's forward channel is BIND_CONN_TO_SESSION.
>
> Bruce told me that associating multiple transports to a
> clientid/session should not be an issue for his server (his
> words were “if that doesn’t work, it’s a bug”).
>
> Would this restrictive form of trunking present a problem?
>
>> b) NFSv4.1 client must be able to set up a TCP connection to the
>> server (that can be session/clientid trunked with the existing RDMA
>> channel)
>
> Also very minimal changes. The changes are already done,
> posted in v1 of this patch series.

I'm not asking for details on the size of the changesets, but for a
justification of the design itself. If it is possible to confine all
the changes to the RPC/RDMA layer, then why consider patches that
change the NFSv4.1 layer at all?

-- 
Trond Myklebust

Linux NFS client maintainer, PrimaryData

trond.myklebust@primarydata.com

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v1 13/16] NFS: Add sidecar RPC client support
  2014-10-21  7:45               ` Trond Myklebust
@ 2014-10-21 17:11                 ` Chuck Lever
  2014-10-22  8:39                   ` Trond Myklebust
  2014-10-23 13:32                   ` J. Bruce Fields
  0 siblings, 2 replies; 34+ messages in thread
From: Chuck Lever @ 2014-10-21 17:11 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Anna Schumaker, Linux NFS Mailing List, Tom Talpey


On Oct 21, 2014, at 3:45 AM, Trond Myklebust <trond.myklebust@primarydata.com> wrote:

> On Tue, Oct 21, 2014 at 4:06 AM, Chuck Lever <chuck.lever@oracle.com> wrote:
>> 
>> There is no show-stopper (see Section 5.1, after all). It’s
>> simply a matter of development effort: a side-car is much
>> less work than implementing full RDMA backchannel support for
>> both a client and server, especially since TCP backchannel
>> already works and can be used immediately.
>> 
>> Also, no problem with eventually implementing RDMA backchannel
>> if the complexity, and any performance overhead it introduces in
>> the forward channel, can be justified. The client can use the
>> CREATE_SESSION flags to detect what a server supports.
> 
> What complexity and performance overhead does it introduce in the
> forward channel?

The benefit of RDMA is that there are opportunities to
reduce host CPU interaction with incoming data.
Bi-direction requires that the transport look at the RPC
header to determine the direction of the message. That
could have an impact on the forward channel, but it’s
never been measured, to my knowledge.

The reason this is more of an issue for RPC/RDMA is that
a copy of the XID appears in the RPC/RDMA header to avoid
the need to look at the RPC header. That’s typically what
implementations use to steer RPC reply processing.

Often the RPC/RDMA header and RPC header land in
disparate buffers. The RPC/RDMA reply handler looks
strictly at the RPC/RDMA header, and runs in a tasklet
usually on a different CPU. Adding bi-direction would mean
the transport would have to peek into the upper layer
headers, possibly resulting in cache line bouncing.

The complexity would be the addition of over a hundred
new lines of code on the client, and possibly a similar
amount of new code on the server. Small, perhaps, but
not insignificant.

>>> 2) Why do we instead have to solve the whole backchannel problem in
>>> the NFSv4.1 layer, and where is the discussion of the merits for and
>>> against that particular solution? As far as I can tell, it imposes at
>>> least 2 extra requirements:
>>> a) NFSv4.1 client+server must have support either for session
>>> trunking or for clientid trunking
>> 
>> Very minimal trunking support. The only operation allowed on
>> the TCP side-car's forward channel is BIND_CONN_TO_SESSION.
>> 
>> Bruce told me that associating multiple transports to a
>> clientid/session should not be an issue for his server (his
>> words were “if that doesn’t work, it’s a bug”).
>> 
>> Would this restrictive form of trunking present a problem?
>> 
>>> b) NFSv4.1 client must be able to set up a TCP connection to the
>>> server (that can be session/clientid trunked with the existing RDMA
>>> channel)
>> 
>> Also very minimal changes. The changes are already done,
>> posted in v1 of this patch series.
> 
> I'm not asking for details on the size of the changesets, but for a
> justification of the design itself.

The size of the changeset _is_ the justification. It’s
a much less invasive change to add a TCP side-car than
it is to implement RDMA backchannel on both server and
client.

Most servers would require almost no change. Linux needs
only a bug fix or two. Effectively zero-impact for
servers that already support NFSv4.0 on RDMA to get
NFSv4.1 and pNFS on RDMA, with working callbacks.

That’s really all there is to it. It’s almost entirely a
practical consideration: we have the infrastructure and
can make it work in just a few lines of code.

> If it is possible to confine all
> the changes to the RPC/RDMA layer, then why consider patches that
> change the NFSv4.1 layer at all?

The fast new transport bring-up benefit is probably the
biggest win. A TCP side-car makes bringing up any new
transport implementation simpler.

And, RPC/RDMA offers zero performance benefit for
backchannel traffic, especially since CB traffic would
never move via RDMA READ/WRITE (as per RFC 5667 section
5.1).

The primary benefit to doing an RPC/RDMA-only solution
is that there is no upper layer impact. Is that a design
requirement?

There’s also been no discussion of issues with adding a
very restricted amount of transport trunking. Can you
elaborate on the problems this could introduce?

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com




^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v1 13/16] NFS: Add sidecar RPC client support
  2014-10-21 17:11                 ` Chuck Lever
@ 2014-10-22  8:39                   ` Trond Myklebust
  2014-10-22 17:20                     ` Chuck Lever
  2014-10-23 13:32                   ` J. Bruce Fields
  1 sibling, 1 reply; 34+ messages in thread
From: Trond Myklebust @ 2014-10-22  8:39 UTC (permalink / raw)
  To: Chuck Lever; +Cc: Anna Schumaker, Linux NFS Mailing List, Tom Talpey

On Tue, Oct 21, 2014 at 8:11 PM, Chuck Lever <chuck.lever@oracle.com> wrote:
>
> On Oct 21, 2014, at 3:45 AM, Trond Myklebust <trond.myklebust@primarydata.com> wrote:
>
>> On Tue, Oct 21, 2014 at 4:06 AM, Chuck Lever <chuck.lever@oracle.com> wrote:
>>>
>>> There is no show-stopper (see Section 5.1, after all). It’s
>>> simply a matter of development effort: a side-car is much
>>> less work than implementing full RDMA backchannel support for
>>> both a client and server, especially since TCP backchannel
>>> already works and can be used immediately.
>>>
>>> Also, no problem with eventually implementing RDMA backchannel
>>> if the complexity, and any performance overhead it introduces in
>>> the forward channel, can be justified. The client can use the
>>> CREATE_SESSION flags to detect what a server supports.
>>
>> What complexity and performance overhead does it introduce in the
>> forward channel?
>
> The benefit of RDMA is that there are opportunities to
> reduce host CPU interaction with incoming data.
> Bi-direction requires that the transport look at the RPC
> header to determine the direction of the message. That
> could have an impact on the forward channel, but it’s
> never been measured, to my knowledge.
>
> The reason this is more of an issue for RPC/RDMA is that
> a copy of the XID appears in the RPC/RDMA header to avoid
> the need to look at the RPC header. That’s typically what
> implementations use to steer RPC reply processing.
>
> Often the RPC/RDMA header and RPC header land in
> disparate buffers. The RPC/RDMA reply handler looks
> strictly at the RPC/RDMA header, and runs in a tasklet
> usually on a different CPU. Adding bi-direction would mean
> the transport would have to peek into the upper layer
> headers, possibly resulting in cache line bouncing.

Under what circumstances would you expect to receive a valid NFSv4.1
callback with an RDMA header that spans multiple cache lines?

> The complexity would be the addition of over a hundred
> new lines of code on the client, and possibly a similar
> amount of new code on the server. Small, perhaps, but
> not insignificant.

Until there are RDMA users, I care a lot less about code changes to
xprtrdma than to NFS.

>>>> 2) Why do we instead have to solve the whole backchannel problem in
>>>> the NFSv4.1 layer, and where is the discussion of the merits for and
>>>> against that particular solution? As far as I can tell, it imposes at
>>>> least 2 extra requirements:
>>>> a) NFSv4.1 client+server must have support either for session
>>>> trunking or for clientid trunking
>>>
>>> Very minimal trunking support. The only operation allowed on
>>> the TCP side-car's forward channel is BIND_CONN_TO_SESSION.
>>>
>>> Bruce told me that associating multiple transports to a
>>> clientid/session should not be an issue for his server (his
>>> words were “if that doesn’t work, it’s a bug”).
>>>
>>> Would this restrictive form of trunking present a problem?
>>>
>>>> b) NFSv4.1 client must be able to set up a TCP connection to the
>>>> server (that can be session/clientid trunked with the existing RDMA
>>>> channel)
>>>
>>> Also very minimal changes. The changes are already done,
>>> posted in v1 of this patch series.
>>
>> I'm not asking for details on the size of the changesets, but for a
>> justification of the design itself.
>
> The size of the changeset _is_ the justification. It’s
> a much less invasive change to add a TCP side-car than
> it is to implement RDMA backchannel on both server and
> client.

Please define your use of the word "invasive" in the above context. To
me "invasive" means "will affect code that is in use by others".

> Most servers would require almost no change. Linux needs
> only a bug fix or two. Effectively zero-impact for
> servers that already support NFSv4.0 on RDMA to get
> NFSv4.1 and pNFS on RDMA, with working callbacks.
>
> That’s really all there is to it. It’s almost entirely a
> practical consideration: we have the infrastructure and
> can make it work in just a few lines of code.
>
>> If it is possible to confine all
>> the changes to the RPC/RDMA layer, then why consider patches that
>> change the NFSv4.1 layer at all?
>
> The fast new transport bring-up benefit is probably the
> biggest win. A TCP side-car makes bringing up any new
> transport implementation simpler.

That's an assertion that assumes:
- we actually want to implement more transports aside from RDMA.
- implementing bi-directional transports in the RPC layer is non-simple

Right now, the benefit is only to RDMA users. Nobody else is asking
for such a change.

> And, RPC/RDMA offers zero performance benefit for
> backchannel traffic, especially since CB traffic would
> never move via RDMA READ/WRITE (as per RFC 5667 section
> 5.1).
>
> The primary benefit to doing an RPC/RDMA-only solution
> is that there is no upper layer impact. Is that a design
> requirement?
>
> There’s also been no discussion of issues with adding a
> very restricted amount of transport trunking. Can you
> elaborate on the problems this could introduce?

I haven't looked into those problems and I'm not the one asking anyone
to implement trunking.

-- 
Trond Myklebust

Linux NFS client maintainer, PrimaryData

trond.myklebust@primarydata.com

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v1 13/16] NFS: Add sidecar RPC client support
  2014-10-22  8:39                   ` Trond Myklebust
@ 2014-10-22 17:20                     ` Chuck Lever
  2014-10-22 20:53                       ` Trond Myklebust
  0 siblings, 1 reply; 34+ messages in thread
From: Chuck Lever @ 2014-10-22 17:20 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Anna Schumaker, Linux NFS Mailing List, Tom Talpey


> On Oct 22, 2014, at 4:39 AM, Trond Myklebust <trond.myklebust@primarydata.com> wrote:
> 
>> On Tue, Oct 21, 2014 at 8:11 PM, Chuck Lever <chuck.lever@oracle.com> wrote:
>> 
>>> On Oct 21, 2014, at 3:45 AM, Trond Myklebust <trond.myklebust@primarydata.com> wrote:
>>> 
>>>> On Tue, Oct 21, 2014 at 4:06 AM, Chuck Lever <chuck.lever@oracle.com> wrote:
>>>> 
>>>> There is no show-stopper (see Section 5.1, after all). It’s
>>>> simply a matter of development effort: a side-car is much
>>>> less work than implementing full RDMA backchannel support for
>>>> both a client and server, especially since TCP backchannel
>>>> already works and can be used immediately.
>>>> 
>>>> Also, no problem with eventually implementing RDMA backchannel
>>>> if the complexity, and any performance overhead it introduces in
>>>> the forward channel, can be justified. The client can use the
>>>> CREATE_SESSION flags to detect what a server supports.
>>> 
>>> What complexity and performance overhead does it introduce in the
>>> forward channel?
>> 
>> The benefit of RDMA is that there are opportunities to
>> reduce host CPU interaction with incoming data.
>> Bi-direction requires that the transport look at the RPC
>> header to determine the direction of the message. That
>> could have an impact on the forward channel, but it’s
>> never been measured, to my knowledge.
>> 
>> The reason this is more of an issue for RPC/RDMA is that
>> a copy of the XID appears in the RPC/RDMA header to avoid
>> the need to look at the RPC header. That’s typically what
>> implementations use to steer RPC reply processing.
>> 
>> Often the RPC/RDMA header and RPC header land in
>> disparate buffers. The RPC/RDMA reply handler looks
>> strictly at the RPC/RDMA header, and runs in a tasklet
>> usually on a different CPU. Adding bi-direction would mean
>> the transport would have to peek into the upper layer
>> headers, possibly resulting in cache line bouncing.
> 
> Under what circumstances would you expect to receive a valid NFSv4.1
> callback with an RDMA header that spans multiple cache lines?

The RPC header and RPC/RDMA header are separate entities, but
together can span multiple cache lines if the server has returned a
chunk list containing multiple entries.

For example, RDMA_NOMSG would send the RPC/RDMA header
via RDMA SEND with a chunk list that represents the RPC and NFS
payload. That list could make the header larger than 32 bytes.

I expect that any callback that involves more than 1024 byte of
RPC payload will need to use RDMA_NOMSG. A long device
info list might fit that category?

>> The complexity would be the addition of over a hundred
>> new lines of code on the client, and possibly a similar
>> amount of new code on the server. Small, perhaps, but
>> not insignificant.
> 
> Until there are RDMA users, I care a lot less about code changes to
> xprtrdma than to NFS.
> 
>>>>> 2) Why do we instead have to solve the whole backchannel problem in
>>>>> the NFSv4.1 layer, and where is the discussion of the merits for and
>>>>> against that particular solution? As far as I can tell, it imposes at
>>>>> least 2 extra requirements:
>>>>> a) NFSv4.1 client+server must have support either for session
>>>>> trunking or for clientid trunking
>>>> 
>>>> Very minimal trunking support. The only operation allowed on
>>>> the TCP side-car's forward channel is BIND_CONN_TO_SESSION.
>>>> 
>>>> Bruce told me that associating multiple transports to a
>>>> clientid/session should not be an issue for his server (his
>>>> words were “if that doesn’t work, it’s a bug”).
>>>> 
>>>> Would this restrictive form of trunking present a problem?
>>>> 
>>>>> b) NFSv4.1 client must be able to set up a TCP connection to the
>>>>> server (that can be session/clientid trunked with the existing RDMA
>>>>> channel)
>>>> 
>>>> Also very minimal changes. The changes are already done,
>>>> posted in v1 of this patch series.
>>> 
>>> I'm not asking for details on the size of the changesets, but for a
>>> justification of the design itself.
>> 
>> The size of the changeset _is_ the justification. It’s
>> a much less invasive change to add a TCP side-car than
>> it is to implement RDMA backchannel on both server and
>> client.
> 
> Please define your use of the word "invasive" in the above context. To
> me "invasive" means "will affect code that is in use by others".

The server side, then, is non-invasive. The client side makes minor
changes to state management.

> 
>> Most servers would require almost no change. Linux needs
>> only a bug fix or two. Effectively zero-impact for
>> servers that already support NFSv4.0 on RDMA to get
>> NFSv4.1 and pNFS on RDMA, with working callbacks.
>> 
>> That’s really all there is to it. It’s almost entirely a
>> practical consideration: we have the infrastructure and
>> can make it work in just a few lines of code.
>> 
>>> If it is possible to confine all
>>> the changes to the RPC/RDMA layer, then why consider patches that
>>> change the NFSv4.1 layer at all?
>> 
>> The fast new transport bring-up benefit is probably the
>> biggest win. A TCP side-car makes bringing up any new
>> transport implementation simpler.
> 
> That's an assertion that assumes:
> - we actually want to implement more transports aside from RDMA

So you no longer consider RPC/SCTP a possibility?

> - implementing bi-directional transports in the RPC layer is non-simple

I don't care to generalize about that. In the RPC/RDMA case, there
are some complications that make it non-simple, but not impossible.
So we have an example of a non-simple case, IMO.

> Right now, the benefit is only to RDMA users. Nobody else is asking
> for such a change.
> 
>> And, RPC/RDMA offers zero performance benefit for
>> backchannel traffic, especially since CB traffic would
>> never move via RDMA READ/WRITE (as per RFC 5667 section
>> 5.1).
>> 
>> The primary benefit to doing an RPC/RDMA-only solution
>> is that there is no upper layer impact. Is that a design
>> requirement?

Based on your objections, it appears that "no upper layer
impact" is a hard design requirement. I will take this as a
NACK for the side-car approach.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v1 13/16] NFS: Add sidecar RPC client support
  2014-10-22 17:20                     ` Chuck Lever
@ 2014-10-22 20:53                       ` Trond Myklebust
  2014-10-22 22:38                         ` Chuck Lever
  0 siblings, 1 reply; 34+ messages in thread
From: Trond Myklebust @ 2014-10-22 20:53 UTC (permalink / raw)
  To: Chuck Lever; +Cc: Anna Schumaker, Linux NFS Mailing List, Tom Talpey

On Wed, Oct 22, 2014 at 8:20 PM, Chuck Lever <chuck.lever@oracle.com> wrote:
>
>> On Oct 22, 2014, at 4:39 AM, Trond Myklebust <trond.myklebust@primarydata.com> wrote:
>>
>>> On Tue, Oct 21, 2014 at 8:11 PM, Chuck Lever <chuck.lever@oracle.com> wrote:
>>>
>>>> On Oct 21, 2014, at 3:45 AM, Trond Myklebust <trond.myklebust@primarydata.com> wrote:
>>>>
>>>>> On Tue, Oct 21, 2014 at 4:06 AM, Chuck Lever <chuck.lever@oracle.com> wrote:
>>>>>
>>>>> There is no show-stopper (see Section 5.1, after all). It’s
>>>>> simply a matter of development effort: a side-car is much
>>>>> less work than implementing full RDMA backchannel support for
>>>>> both a client and server, especially since TCP backchannel
>>>>> already works and can be used immediately.
>>>>>
>>>>> Also, no problem with eventually implementing RDMA backchannel
>>>>> if the complexity, and any performance overhead it introduces in
>>>>> the forward channel, can be justified. The client can use the
>>>>> CREATE_SESSION flags to detect what a server supports.
>>>>
>>>> What complexity and performance overhead does it introduce in the
>>>> forward channel?
>>>
>>> The benefit of RDMA is that there are opportunities to
>>> reduce host CPU interaction with incoming data.
>>> Bi-direction requires that the transport look at the RPC
>>> header to determine the direction of the message. That
>>> could have an impact on the forward channel, but it’s
>>> never been measured, to my knowledge.
>>>
>>> The reason this is more of an issue for RPC/RDMA is that
>>> a copy of the XID appears in the RPC/RDMA header to avoid
>>> the need to look at the RPC header. That’s typically what
>>> implementations use to steer RPC reply processing.
>>>
>>> Often the RPC/RDMA header and RPC header land in
>>> disparate buffers. The RPC/RDMA reply handler looks
>>> strictly at the RPC/RDMA header, and runs in a tasklet
>>> usually on a different CPU. Adding bi-direction would mean
>>> the transport would have to peek into the upper layer
>>> headers, possibly resulting in cache line bouncing.
>>
>> Under what circumstances would you expect to receive a valid NFSv4.1
>> callback with an RDMA header that spans multiple cache lines?
>
> The RPC header and RPC/RDMA header are separate entities, but
> together can span multiple cache lines if the server has returned a
> chunk list containing multiple entries.
>
> For example, RDMA_NOMSG would send the RPC/RDMA header
> via RDMA SEND with a chunk list that represents the RPC and NFS
> payload. That list could make the header larger than 32 bytes.
>
> I expect that any callback that involves more than 1024 byte of
> RPC payload will need to use RDMA_NOMSG. A long device
> info list might fit that category?

Right, but are there any callbacks that would do that? AFAICS, most of
them are CB_SEQUENCE+(PUT_FH+CB_do_some_recall_operation_on_this_file
| some single CB_operation)

The point is that we can set finite limits on the size of callbacks in
the CREATE_SESSION. As long as those limits are reasonable (and 1K
does seem more than reasonable for existing use cases) then why
shouldn't we be able to expect the server to use RDMA_MSG?

>>> The complexity would be the addition of over a hundred
>>> new lines of code on the client, and possibly a similar
>>> amount of new code on the server. Small, perhaps, but
>>> not insignificant.
>>
>> Until there are RDMA users, I care a lot less about code changes to
>> xprtrdma than to NFS.
>>
>>>>>> 2) Why do we instead have to solve the whole backchannel problem in
>>>>>> the NFSv4.1 layer, and where is the discussion of the merits for and
>>>>>> against that particular solution? As far as I can tell, it imposes at
>>>>>> least 2 extra requirements:
>>>>>> a) NFSv4.1 client+server must have support either for session
>>>>>> trunking or for clientid trunking
>>>>>
>>>>> Very minimal trunking support. The only operation allowed on
>>>>> the TCP side-car's forward channel is BIND_CONN_TO_SESSION.
>>>>>
>>>>> Bruce told me that associating multiple transports to a
>>>>> clientid/session should not be an issue for his server (his
>>>>> words were “if that doesn’t work, it’s a bug”).
>>>>>
>>>>> Would this restrictive form of trunking present a problem?
>>>>>
>>>>>> b) NFSv4.1 client must be able to set up a TCP connection to the
>>>>>> server (that can be session/clientid trunked with the existing RDMA
>>>>>> channel)
>>>>>
>>>>> Also very minimal changes. The changes are already done,
>>>>> posted in v1 of this patch series.
>>>>
>>>> I'm not asking for details on the size of the changesets, but for a
>>>> justification of the design itself.
>>>
>>> The size of the changeset _is_ the justification. It’s
>>> a much less invasive change to add a TCP side-car than
>>> it is to implement RDMA backchannel on both server and
>>> client.
>>
>> Please define your use of the word "invasive" in the above context. To
>> me "invasive" means "will affect code that is in use by others".
>
> The server side, then, is non-invasive. The client side makes minor
> changes to state management.
>
>>
>>> Most servers would require almost no change. Linux needs
>>> only a bug fix or two. Effectively zero-impact for
>>> servers that already support NFSv4.0 on RDMA to get
>>> NFSv4.1 and pNFS on RDMA, with working callbacks.
>>>
>>> That’s really all there is to it. It’s almost entirely a
>>> practical consideration: we have the infrastructure and
>>> can make it work in just a few lines of code.
>>>
>>>> If it is possible to confine all
>>>> the changes to the RPC/RDMA layer, then why consider patches that
>>>> change the NFSv4.1 layer at all?
>>>
>>> The fast new transport bring-up benefit is probably the
>>> biggest win. A TCP side-car makes bringing up any new
>>> transport implementation simpler.
>>
>> That's an assertion that assumes:
>> - we actually want to implement more transports aside from RDMA
>
> So you no longer consider RPC/SCTP a possibility?

I'd still like to consider it, but the whole point would be to _avoid_
doing trunking in the NFS layer. SCTP does trunking/multi-pathing at
the transport level, meaning that we don't have to deal with tracking
connections, state, replaying messages, etc.
Doing bi-directional RPC with SCTP is not an issue, since the
transport is fully symmetric.

>> - implementing bi-directional transports in the RPC layer is non-simple
>
> I don't care to generalize about that. In the RPC/RDMA case, there
> are some complications that make it non-simple, but not impossible.
> So we have an example of a non-simple case, IMO.
>
>> Right now, the benefit is only to RDMA users. Nobody else is asking
>> for such a change.
>>
>>> And, RPC/RDMA offers zero performance benefit for
>>> backchannel traffic, especially since CB traffic would
>>> never move via RDMA READ/WRITE (as per RFC 5667 section
>>> 5.1).
>>>
>>> The primary benefit to doing an RPC/RDMA-only solution
>>> is that there is no upper layer impact. Is that a design
>>> requirement?
>
> Based on your objections, it appears that "no upper layer
> impact" is a hard design requirement. I will take this as a
> NACK for the side-car approach.

There is not a hard NACK yet, but I am asking for stronger
justification. I do _not_ want to find myself in a situation 2 or 3
years down the road where I have to argue against someone telling me
that we additionally have to implement callbacks over IB/RDMA because
the TCP sidecar is an incomplete solution. We should do either one or
the other, but not both...

-- 
Trond Myklebust

Linux NFS client maintainer, PrimaryData

trond.myklebust@primarydata.com

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v1 13/16] NFS: Add sidecar RPC client support
  2014-10-22 20:53                       ` Trond Myklebust
@ 2014-10-22 22:38                         ` Chuck Lever
  0 siblings, 0 replies; 34+ messages in thread
From: Chuck Lever @ 2014-10-22 22:38 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Anna Schumaker, Linux NFS Mailing List, Tom Talpey


On Oct 22, 2014, at 4:53 PM, Trond Myklebust <trond.myklebust@primarydata.com> wrote:

> On Wed, Oct 22, 2014 at 8:20 PM, Chuck Lever <chuck.lever@oracle.com> wrote:
>> 
>>> On Oct 22, 2014, at 4:39 AM, Trond Myklebust <trond.myklebust@primarydata.com> wrote:
>>> 
>>>> On Tue, Oct 21, 2014 at 8:11 PM, Chuck Lever <chuck.lever@oracle.com> wrote:
>>>> 
>>>>> On Oct 21, 2014, at 3:45 AM, Trond Myklebust <trond.myklebust@primarydata.com> wrote:
>>>>> 
>>>>>> On Tue, Oct 21, 2014 at 4:06 AM, Chuck Lever <chuck.lever@oracle.com> wrote:
>>>>>> 
>>>>>> There is no show-stopper (see Section 5.1, after all). It’s
>>>>>> simply a matter of development effort: a side-car is much
>>>>>> less work than implementing full RDMA backchannel support for
>>>>>> both a client and server, especially since TCP backchannel
>>>>>> already works and can be used immediately.
>>>>>> 
>>>>>> Also, no problem with eventually implementing RDMA backchannel
>>>>>> if the complexity, and any performance overhead it introduces in
>>>>>> the forward channel, can be justified. The client can use the
>>>>>> CREATE_SESSION flags to detect what a server supports.
>>>>> 
>>>>> What complexity and performance overhead does it introduce in the
>>>>> forward channel?
>>>> 
>>>> The benefit of RDMA is that there are opportunities to
>>>> reduce host CPU interaction with incoming data.
>>>> Bi-direction requires that the transport look at the RPC
>>>> header to determine the direction of the message. That
>>>> could have an impact on the forward channel, but it’s
>>>> never been measured, to my knowledge.
>>>> 
>>>> The reason this is more of an issue for RPC/RDMA is that
>>>> a copy of the XID appears in the RPC/RDMA header to avoid
>>>> the need to look at the RPC header. That’s typically what
>>>> implementations use to steer RPC reply processing.
>>>> 
>>>> Often the RPC/RDMA header and RPC header land in
>>>> disparate buffers. The RPC/RDMA reply handler looks
>>>> strictly at the RPC/RDMA header, and runs in a tasklet
>>>> usually on a different CPU. Adding bi-direction would mean
>>>> the transport would have to peek into the upper layer
>>>> headers, possibly resulting in cache line bouncing.
>>> 
>>> Under what circumstances would you expect to receive a valid NFSv4.1
>>> callback with an RDMA header that spans multiple cache lines?
>> 
>> The RPC header and RPC/RDMA header are separate entities, but
>> together can span multiple cache lines if the server has returned a
>> chunk list containing multiple entries.
>> 
>> For example, RDMA_NOMSG would send the RPC/RDMA header
>> via RDMA SEND with a chunk list that represents the RPC and NFS
>> payload. That list could make the header larger than 32 bytes.
>> 
>> I expect that any callback that involves more than 1024 byte of
>> RPC payload will need to use RDMA_NOMSG. A long device
>> info list might fit that category?
> 
> Right, but are there any callbacks that would do that? AFAICS, most of
> them are CB_SEQUENCE+(PUT_FH+CB_do_some_recall_operation_on_this_file
> | some single CB_operation)

That is a question only a pNFS layout developer can answer.

Allowing larger CB operations might be important. I thought
I heard Matt list a couple of examples that might move bulk
data via CB, but probably none that have implementations
currently.

I’m not familiar with block or flex-file, so I can’t make
any kind of guess about those.

> The point is that we can set finite limits on the size of callbacks in
> the CREATE_SESSION. As long as those limits are reasonable (and 1K
> does seem more than reasonable for existing use cases) then why
> shouldn't we be able to expect the server to use RDMA_MSG?

The spec allows both RDMA_MSG and RDMA_NOMSG for CB
RPC. That provides some implementation flexibility, but
it also means either end can use either MSG type. An
interoperable CB service would have to support all
scenarios.

RDMA_NOMSG can support small or large payloads. RDMA_MSG
can support only a payload that fits in the receiver’s
pre-posted buffer (and that includes the RPC and NFS
headers) because NFS CB RPCs are not allowed to use
read or write chunks.

Since there are no implementations, currently, I was
hoping everyone might agree to stick with using only
RDMA_NOMSG for NFSv4.1 CB RPCs on RPC/RDMA. That would
mean there was a high limit on CB RPC payload size, and
RDMA READ would be used to move CB RPC calls and replies
in all cases.

NFS uses NOMSG so infrequently that it would be a good
way to limit churn in the transport’s hot paths.
Particularly NFS READ and WRITE are required to use
RDMA_MSG with their payload encoded in chunks. If the
incoming message is MSG, then clearly it can’t be a
reverse RPC, and there’s no need to go looking for it
(as long as we all agree on the “CBs use only NOMSG”
convention).

The receiver can tell just by looking at the RPC/RDMA
header that extra processing won’t be needed in the
common case.

Clearly we need a prototype or two to understand these
issues. And probably some WG discussion is warranted.

>>>> The complexity would be the addition of over a hundred
>>>> new lines of code on the client, and possibly a similar
>>>> amount of new code on the server. Small, perhaps, but
>>>> not insignificant.
>>> 
>>> Until there are RDMA users, I care a lot less about code changes to
>>> xprtrdma than to NFS.
>>> 
>>>>>>> 2) Why do we instead have to solve the whole backchannel problem in
>>>>>>> the NFSv4.1 layer, and where is the discussion of the merits for and
>>>>>>> against that particular solution? As far as I can tell, it imposes at
>>>>>>> least 2 extra requirements:
>>>>>>> a) NFSv4.1 client+server must have support either for session
>>>>>>> trunking or for clientid trunking
>>>>>> 
>>>>>> Very minimal trunking support. The only operation allowed on
>>>>>> the TCP side-car's forward channel is BIND_CONN_TO_SESSION.
>>>>>> 
>>>>>> Bruce told me that associating multiple transports to a
>>>>>> clientid/session should not be an issue for his server (his
>>>>>> words were “if that doesn’t work, it’s a bug”).
>>>>>> 
>>>>>> Would this restrictive form of trunking present a problem?
>>>>>> 
>>>>>>> b) NFSv4.1 client must be able to set up a TCP connection to the
>>>>>>> server (that can be session/clientid trunked with the existing RDMA
>>>>>>> channel)
>>>>>> 
>>>>>> Also very minimal changes. The changes are already done,
>>>>>> posted in v1 of this patch series.
>>>>> 
>>>>> I'm not asking for details on the size of the changesets, but for a
>>>>> justification of the design itself.
>>>> 
>>>> The size of the changeset _is_ the justification. It’s
>>>> a much less invasive change to add a TCP side-car than
>>>> it is to implement RDMA backchannel on both server and
>>>> client.
>>> 
>>> Please define your use of the word "invasive" in the above context. To
>>> me "invasive" means "will affect code that is in use by others".
>> 
>> The server side, then, is non-invasive. The client side makes minor
>> changes to state management.
>> 
>>> 
>>>> Most servers would require almost no change. Linux needs
>>>> only a bug fix or two. Effectively zero-impact for
>>>> servers that already support NFSv4.0 on RDMA to get
>>>> NFSv4.1 and pNFS on RDMA, with working callbacks.
>>>> 
>>>> That’s really all there is to it. It’s almost entirely a
>>>> practical consideration: we have the infrastructure and
>>>> can make it work in just a few lines of code.
>>>> 
>>>>> If it is possible to confine all
>>>>> the changes to the RPC/RDMA layer, then why consider patches that
>>>>> change the NFSv4.1 layer at all?
>>>> 
>>>> The fast new transport bring-up benefit is probably the
>>>> biggest win. A TCP side-car makes bringing up any new
>>>> transport implementation simpler.
>>> 
>>> That's an assertion that assumes:
>>> - we actually want to implement more transports aside from RDMA
>> 
>> So you no longer consider RPC/SCTP a possibility?
> 
> I'd still like to consider it, but the whole point would be to _avoid_
> doing trunking in the NFS layer. SCTP does trunking/multi-pathing at
> the transport level, meaning that we don't have to deal with tracking
> connections, state, replaying messages, etc.
> Doing bi-directional RPC with SCTP is not an issue, since the
> transport is fully symmetric.
> 
>>> - implementing bi-directional transports in the RPC layer is non-simple
>> 
>> I don't care to generalize about that. In the RPC/RDMA case, there
>> are some complications that make it non-simple, but not impossible.
>> So we have an example of a non-simple case, IMO.
>> 
>>> Right now, the benefit is only to RDMA users. Nobody else is asking
>>> for such a change.
>>> 
>>>> And, RPC/RDMA offers zero performance benefit for
>>>> backchannel traffic, especially since CB traffic would
>>>> never move via RDMA READ/WRITE (as per RFC 5667 section
>>>> 5.1).
>>>> 
>>>> The primary benefit to doing an RPC/RDMA-only solution
>>>> is that there is no upper layer impact. Is that a design
>>>> requirement?
>> 
>> Based on your objections, it appears that "no upper layer
>> impact" is a hard design requirement. I will take this as a
>> NACK for the side-car approach.
> 
> There is not a hard NACK yet, but I am asking for stronger
> justification. I do _not_ want to find myself in a situation 2 or 3
> years down the road where I have to argue against someone telling me
> that we additionally have to implement callbacks over IB/RDMA because
> the TCP sidecar is an incomplete solution. We should do either one or
> the other, but not both…

It is impossible to predict the future. However, I’m not
sure there’s a problem building both eventually, especially
because there is no spec guidance I’m aware of about which
bi-directional RPC mechanisms MUST be supported for NFS/RDMA.

I might be wrong, but we have enough mechanism in
CREATE_SESSION for a client with bi-directional RPC/RDMA
support to detect a server with no bi-directional RPC/RDMA,
and use side-car only in that case.

If we need more discussion, I can drop the side-car patches
for 3.19 and do some more research.

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com




^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v1 13/16] NFS: Add sidecar RPC client support
  2014-10-21 17:11                 ` Chuck Lever
  2014-10-22  8:39                   ` Trond Myklebust
@ 2014-10-23 13:32                   ` J. Bruce Fields
  2014-10-23 13:55                     ` Chuck Lever
  1 sibling, 1 reply; 34+ messages in thread
From: J. Bruce Fields @ 2014-10-23 13:32 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Trond Myklebust, Anna Schumaker, Linux NFS Mailing List, Tom Talpey

On Tue, Oct 21, 2014 at 01:11:26PM -0400, Chuck Lever wrote:
> The size of the changeset _is_ the justification. It’s
> a much less invasive change to add a TCP side-car than
> it is to implement RDMA backchannel on both server and
> client.

Something I'm confused about: is bidirectional RPC/RDMA optional or
mandatory for servers to implement?

Something somewhere has to be mandatory if we want to guarantee a
working backchannel between any two implementations.

--b.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v1 13/16] NFS: Add sidecar RPC client support
  2014-10-23 13:32                   ` J. Bruce Fields
@ 2014-10-23 13:55                     ` Chuck Lever
  0 siblings, 0 replies; 34+ messages in thread
From: Chuck Lever @ 2014-10-23 13:55 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: Trond Myklebust, Anna Schumaker, Linux NFS Mailing List, Tom Talpey


On Oct 23, 2014, at 9:32 AM, J. Bruce Fields <bfields@fieldses.org> wrote:

> On Tue, Oct 21, 2014 at 01:11:26PM -0400, Chuck Lever wrote:
>> The size of the changeset _is_ the justification. It’s
>> a much less invasive change to add a TCP side-car than
>> it is to implement RDMA backchannel on both server and
>> client.
> 
> Something I'm confused about: is bidirectional RPC/RDMA optional or
> mandatory for servers to implement?

IMO bi-directional RPC/RDMA is not required anywhere in RFCs 5666
(the RPC/RDMA spec) or 5667 (the NFS on RPC/RDMA spec).

> Something somewhere has to be mandatory if we want to guarantee a
> working backchannel between any two implementations.

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com




^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2014-10-23 13:55 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-10-16 19:38 [PATCH v1 00/16] NFS/RDMA patches for 3.19 Chuck Lever
2014-10-16 19:38 ` [PATCH v1 01/16] xprtrdma: Return an errno from rpcrdma_register_external() Chuck Lever
2014-10-16 19:38 ` [PATCH v1 02/16] xprtrdma: Cap req_cqinit Chuck Lever
2014-10-20 13:27   ` Anna Schumaker
2014-10-16 19:38 ` [PATCH v1 03/16] SUNRPC: Pass callsize and recvsize to buf_alloc as separate arguments Chuck Lever
2014-10-20 14:04   ` Anna Schumaker
2014-10-20 18:21     ` Chuck Lever
2014-10-16 19:38 ` [PATCH v1 04/16] xprtrdma: Re-write rpcrdma_flush_cqs() Chuck Lever
2014-10-16 19:38 ` [PATCH v1 05/16] xprtrdma: unmap all FMRs during transport disconnect Chuck Lever
2014-10-16 19:39 ` [PATCH v1 06/16] xprtrdma: spin CQ completion vectors Chuck Lever
2014-10-16 19:39 ` [PATCH v1 07/16] SUNRPC: serialize iostats updates Chuck Lever
2014-10-16 19:39 ` [PATCH v1 08/16] xprtrdma: Display async errors Chuck Lever
2014-10-16 19:39 ` [PATCH v1 09/16] xprtrdma: Enable pad optimization Chuck Lever
2014-10-16 19:39 ` [PATCH v1 10/16] NFS: Include transport protocol name in UCS client string Chuck Lever
2014-10-16 19:39 ` [PATCH v1 11/16] NFS: Clean up nfs4_init_callback() Chuck Lever
2014-10-16 19:39 ` [PATCH v1 12/16] SUNRPC: Add rpc_xprt_is_bidirectional() Chuck Lever
2014-10-16 19:40 ` [PATCH v1 13/16] NFS: Add sidecar RPC client support Chuck Lever
2014-10-20 17:33   ` Anna Schumaker
2014-10-20 18:09     ` Chuck Lever
2014-10-20 19:40       ` Trond Myklebust
2014-10-20 20:11         ` Chuck Lever
2014-10-20 22:31           ` Trond Myklebust
2014-10-21  1:06             ` Chuck Lever
2014-10-21  7:45               ` Trond Myklebust
2014-10-21 17:11                 ` Chuck Lever
2014-10-22  8:39                   ` Trond Myklebust
2014-10-22 17:20                     ` Chuck Lever
2014-10-22 20:53                       ` Trond Myklebust
2014-10-22 22:38                         ` Chuck Lever
2014-10-23 13:32                   ` J. Bruce Fields
2014-10-23 13:55                     ` Chuck Lever
2014-10-16 19:40 ` [PATCH v1 14/16] NFS: Set BIND_CONN_TO_SESSION arguments in the proc layer Chuck Lever
2014-10-16 19:40 ` [PATCH v1 15/16] NFS: Bind side-car connection to session Chuck Lever
2014-10-16 19:40 ` [PATCH v1 16/16] NFS: Disable SESSION4_BACK_CHAN when a backchannel sidecar is to be used Chuck Lever

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.