linux-nfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v1 0/4] NFS/RDMA client for v4.21 (part 1)
@ 2018-11-19 15:45 Chuck Lever
  2018-11-19 15:45 ` [PATCH v1 1/4] xprtrdma: Remove support for FMR memory registration Chuck Lever
                   ` (3 more replies)
  0 siblings, 4 replies; 31+ messages in thread
From: Chuck Lever @ 2018-11-19 15:45 UTC (permalink / raw)
  To: linux-rdma; +Cc: linux-nfs

Hi-

These four patches changes details of how the Linux RPC-over-RDMA
client implementation uses RDMA, thus I'd like some scrutiny by the
good people on linux-rdma.

I've got a dozen or so additional patches for v4.21 dealing with the
higher layers which I will post next week. I will aim those at
linux-nfs.

---

Chuck Lever (4):
      xprtrdma: Remove support for FMR memory registration
      xprtrdma: mrs_create off-by-one
      xprtrdma: Reduce max_frwr_depth
      xprtrdma: Plant XID in on-the-wire RDMA offset (FRWR)


 net/sunrpc/xprtrdma/Makefile    |    3 
 net/sunrpc/xprtrdma/fmr_ops.c   |  337 ---------------------------------------
 net/sunrpc/xprtrdma/frwr_ops.c  |   20 ++
 net/sunrpc/xprtrdma/rpc_rdma.c  |    6 -
 net/sunrpc/xprtrdma/verbs.c     |    6 -
 net/sunrpc/xprtrdma/xprt_rdma.h |    2 
 6 files changed, 19 insertions(+), 355 deletions(-)
 delete mode 100644 net/sunrpc/xprtrdma/fmr_ops.c

--
Chuck Lever

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v1 1/4] xprtrdma: Remove support for FMR memory registration
  2018-11-19 15:45 [PATCH v1 0/4] NFS/RDMA client for v4.21 (part 1) Chuck Lever
@ 2018-11-19 15:45 ` Chuck Lever
  2018-11-19 16:16   ` Bart Van Assche
  2018-11-19 15:45 ` [PATCH v1 2/4] xprtrdma: mrs_create off-by-one Chuck Lever
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 31+ messages in thread
From: Chuck Lever @ 2018-11-19 15:45 UTC (permalink / raw)
  To: linux-rdma; +Cc: linux-nfs

FMR is not supported on most recent RDMA devices. It is also slower
and less secure than FRWR. As discussed during the RDMA BoF at LPC
2018, it is time to remove support for FMR in the NFS/RDMA client
stack. NFS/RDMA server-side uses either local memory registration or
FRWR.  There is no change required there to deprecate FMR.

There are a few Infiniband/RoCE devices in the kernel tree that do
not support MEM_MGT_EXTENSIONS, and therefore will no longer support
client-side NFS/RDMA. These are:

 - mthca
 - qib
 - usnic
 - hns (RoCE)

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 net/sunrpc/xprtrdma/Makefile  |    3 
 net/sunrpc/xprtrdma/fmr_ops.c |  337 -----------------------------------------
 net/sunrpc/xprtrdma/verbs.c   |    6 -
 3 files changed, 1 insertion(+), 345 deletions(-)
 delete mode 100644 net/sunrpc/xprtrdma/fmr_ops.c

diff --git a/net/sunrpc/xprtrdma/Makefile b/net/sunrpc/xprtrdma/Makefile
index 8bf19e1..8ed0377 100644
--- a/net/sunrpc/xprtrdma/Makefile
+++ b/net/sunrpc/xprtrdma/Makefile
@@ -1,8 +1,7 @@
 # SPDX-License-Identifier: GPL-2.0
 obj-$(CONFIG_SUNRPC_XPRT_RDMA) += rpcrdma.o
 
-rpcrdma-y := transport.o rpc_rdma.o verbs.o \
-	fmr_ops.o frwr_ops.o \
+rpcrdma-y := transport.o rpc_rdma.o verbs.o frwr_ops.o \
 	svc_rdma.o svc_rdma_backchannel.o svc_rdma_transport.o \
 	svc_rdma_sendto.o svc_rdma_recvfrom.o svc_rdma_rw.o \
 	module.o
diff --git a/net/sunrpc/xprtrdma/fmr_ops.c b/net/sunrpc/xprtrdma/fmr_ops.c
deleted file mode 100644
index 7f5632c..0000000
--- a/net/sunrpc/xprtrdma/fmr_ops.c
+++ /dev/null
@@ -1,337 +0,0 @@
-// SPDX-License-Identifier: GPL-2.0
-/*
- * Copyright (c) 2015, 2017 Oracle.  All rights reserved.
- * Copyright (c) 2003-2007 Network Appliance, Inc. All rights reserved.
- */
-
-/* Lightweight memory registration using Fast Memory Regions (FMR).
- * Referred to sometimes as MTHCAFMR mode.
- *
- * FMR uses synchronous memory registration and deregistration.
- * FMR registration is known to be fast, but FMR deregistration
- * can take tens of usecs to complete.
- */
-
-/* Normal operation
- *
- * A Memory Region is prepared for RDMA READ or WRITE using the
- * ib_map_phys_fmr verb (fmr_op_map). When the RDMA operation is
- * finished, the Memory Region is unmapped using the ib_unmap_fmr
- * verb (fmr_op_unmap).
- */
-
-#include <linux/sunrpc/svc_rdma.h>
-
-#include "xprt_rdma.h"
-#include <trace/events/rpcrdma.h>
-
-#if IS_ENABLED(CONFIG_SUNRPC_DEBUG)
-# define RPCDBG_FACILITY	RPCDBG_TRANS
-#endif
-
-/* Maximum scatter/gather per FMR */
-#define RPCRDMA_MAX_FMR_SGES	(64)
-
-/* Access mode of externally registered pages */
-enum {
-	RPCRDMA_FMR_ACCESS_FLAGS	= IB_ACCESS_REMOTE_WRITE |
-					  IB_ACCESS_REMOTE_READ,
-};
-
-bool
-fmr_is_supported(struct rpcrdma_ia *ia)
-{
-	if (!ia->ri_device->alloc_fmr) {
-		pr_info("rpcrdma: 'fmr' mode is not supported by device %s\n",
-			ia->ri_device->name);
-		return false;
-	}
-	return true;
-}
-
-static void
-__fmr_unmap(struct rpcrdma_mr *mr)
-{
-	LIST_HEAD(l);
-	int rc;
-
-	list_add(&mr->fmr.fm_mr->list, &l);
-	rc = ib_unmap_fmr(&l);
-	list_del(&mr->fmr.fm_mr->list);
-	if (rc)
-		pr_err("rpcrdma: final ib_unmap_fmr for %p failed %i\n",
-		       mr, rc);
-}
-
-/* Release an MR.
- */
-static void
-fmr_op_release_mr(struct rpcrdma_mr *mr)
-{
-	int rc;
-
-	kfree(mr->fmr.fm_physaddrs);
-	kfree(mr->mr_sg);
-
-	/* In case this one was left mapped, try to unmap it
-	 * to prevent dealloc_fmr from failing with EBUSY
-	 */
-	__fmr_unmap(mr);
-
-	rc = ib_dealloc_fmr(mr->fmr.fm_mr);
-	if (rc)
-		pr_err("rpcrdma: final ib_dealloc_fmr for %p returned %i\n",
-		       mr, rc);
-
-	kfree(mr);
-}
-
-/* MRs are dynamically allocated, so simply clean up and release the MR.
- * A replacement MR will subsequently be allocated on demand.
- */
-static void
-fmr_mr_recycle_worker(struct work_struct *work)
-{
-	struct rpcrdma_mr *mr = container_of(work, struct rpcrdma_mr, mr_recycle);
-	struct rpcrdma_xprt *r_xprt = mr->mr_xprt;
-
-	trace_xprtrdma_mr_recycle(mr);
-
-	trace_xprtrdma_mr_unmap(mr);
-	ib_dma_unmap_sg(r_xprt->rx_ia.ri_device,
-			mr->mr_sg, mr->mr_nents, mr->mr_dir);
-
-	spin_lock(&r_xprt->rx_buf.rb_mrlock);
-	list_del(&mr->mr_all);
-	r_xprt->rx_stats.mrs_recycled++;
-	spin_unlock(&r_xprt->rx_buf.rb_mrlock);
-	fmr_op_release_mr(mr);
-}
-
-static int
-fmr_op_init_mr(struct rpcrdma_ia *ia, struct rpcrdma_mr *mr)
-{
-	static struct ib_fmr_attr fmr_attr = {
-		.max_pages	= RPCRDMA_MAX_FMR_SGES,
-		.max_maps	= 1,
-		.page_shift	= PAGE_SHIFT
-	};
-
-	mr->fmr.fm_physaddrs = kcalloc(RPCRDMA_MAX_FMR_SGES,
-				       sizeof(u64), GFP_KERNEL);
-	if (!mr->fmr.fm_physaddrs)
-		goto out_free;
-
-	mr->mr_sg = kcalloc(RPCRDMA_MAX_FMR_SGES,
-			    sizeof(*mr->mr_sg), GFP_KERNEL);
-	if (!mr->mr_sg)
-		goto out_free;
-
-	sg_init_table(mr->mr_sg, RPCRDMA_MAX_FMR_SGES);
-
-	mr->fmr.fm_mr = ib_alloc_fmr(ia->ri_pd, RPCRDMA_FMR_ACCESS_FLAGS,
-				     &fmr_attr);
-	if (IS_ERR(mr->fmr.fm_mr))
-		goto out_fmr_err;
-
-	INIT_LIST_HEAD(&mr->mr_list);
-	INIT_WORK(&mr->mr_recycle, fmr_mr_recycle_worker);
-	return 0;
-
-out_fmr_err:
-	dprintk("RPC:       %s: ib_alloc_fmr returned %ld\n", __func__,
-		PTR_ERR(mr->fmr.fm_mr));
-
-out_free:
-	kfree(mr->mr_sg);
-	kfree(mr->fmr.fm_physaddrs);
-	return -ENOMEM;
-}
-
-/* On success, sets:
- *	ep->rep_attr.cap.max_send_wr
- *	ep->rep_attr.cap.max_recv_wr
- *	cdata->max_requests
- *	ia->ri_max_segs
- */
-static int
-fmr_op_open(struct rpcrdma_ia *ia, struct rpcrdma_ep *ep,
-	    struct rpcrdma_create_data_internal *cdata)
-{
-	int max_qp_wr;
-
-	max_qp_wr = ia->ri_device->attrs.max_qp_wr;
-	max_qp_wr -= RPCRDMA_BACKWARD_WRS;
-	max_qp_wr -= 1;
-	if (max_qp_wr < RPCRDMA_MIN_SLOT_TABLE)
-		return -ENOMEM;
-	if (cdata->max_requests > max_qp_wr)
-		cdata->max_requests = max_qp_wr;
-	ep->rep_attr.cap.max_send_wr = cdata->max_requests;
-	ep->rep_attr.cap.max_send_wr += RPCRDMA_BACKWARD_WRS;
-	ep->rep_attr.cap.max_send_wr += 1; /* for ib_drain_sq */
-	ep->rep_attr.cap.max_recv_wr = cdata->max_requests;
-	ep->rep_attr.cap.max_recv_wr += RPCRDMA_BACKWARD_WRS;
-	ep->rep_attr.cap.max_recv_wr += 1; /* for ib_drain_rq */
-
-	ia->ri_max_segs = max_t(unsigned int, 1, RPCRDMA_MAX_DATA_SEGS /
-				RPCRDMA_MAX_FMR_SGES);
-	ia->ri_max_segs += 2;	/* segments for head and tail buffers */
-	return 0;
-}
-
-/* FMR mode conveys up to 64 pages of payload per chunk segment.
- */
-static size_t
-fmr_op_maxpages(struct rpcrdma_xprt *r_xprt)
-{
-	return min_t(unsigned int, RPCRDMA_MAX_DATA_SEGS,
-		     RPCRDMA_MAX_HDR_SEGS * RPCRDMA_MAX_FMR_SGES);
-}
-
-/* Use the ib_map_phys_fmr() verb to register a memory region
- * for remote access via RDMA READ or RDMA WRITE.
- */
-static struct rpcrdma_mr_seg *
-fmr_op_map(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg,
-	   int nsegs, bool writing, struct rpcrdma_mr **out)
-{
-	struct rpcrdma_mr_seg *seg1 = seg;
-	int len, pageoff, i, rc;
-	struct rpcrdma_mr *mr;
-	u64 *dma_pages;
-
-	mr = rpcrdma_mr_get(r_xprt);
-	if (!mr)
-		return ERR_PTR(-EAGAIN);
-
-	pageoff = offset_in_page(seg1->mr_offset);
-	seg1->mr_offset -= pageoff;	/* start of page */
-	seg1->mr_len += pageoff;
-	len = -pageoff;
-	if (nsegs > RPCRDMA_MAX_FMR_SGES)
-		nsegs = RPCRDMA_MAX_FMR_SGES;
-	for (i = 0; i < nsegs;) {
-		if (seg->mr_page)
-			sg_set_page(&mr->mr_sg[i],
-				    seg->mr_page,
-				    seg->mr_len,
-				    offset_in_page(seg->mr_offset));
-		else
-			sg_set_buf(&mr->mr_sg[i], seg->mr_offset,
-				   seg->mr_len);
-		len += seg->mr_len;
-		++seg;
-		++i;
-		/* Check for holes */
-		if ((i < nsegs && offset_in_page(seg->mr_offset)) ||
-		    offset_in_page((seg-1)->mr_offset + (seg-1)->mr_len))
-			break;
-	}
-	mr->mr_dir = rpcrdma_data_dir(writing);
-
-	mr->mr_nents = ib_dma_map_sg(r_xprt->rx_ia.ri_device,
-				     mr->mr_sg, i, mr->mr_dir);
-	if (!mr->mr_nents)
-		goto out_dmamap_err;
-	trace_xprtrdma_mr_map(mr);
-
-	for (i = 0, dma_pages = mr->fmr.fm_physaddrs; i < mr->mr_nents; i++)
-		dma_pages[i] = sg_dma_address(&mr->mr_sg[i]);
-	rc = ib_map_phys_fmr(mr->fmr.fm_mr, dma_pages, mr->mr_nents,
-			     dma_pages[0]);
-	if (rc)
-		goto out_maperr;
-
-	mr->mr_handle = mr->fmr.fm_mr->rkey;
-	mr->mr_length = len;
-	mr->mr_offset = dma_pages[0] + pageoff;
-
-	*out = mr;
-	return seg;
-
-out_dmamap_err:
-	pr_err("rpcrdma: failed to DMA map sg %p sg_nents %d\n",
-	       mr->mr_sg, i);
-	rpcrdma_mr_put(mr);
-	return ERR_PTR(-EIO);
-
-out_maperr:
-	pr_err("rpcrdma: ib_map_phys_fmr %u@0x%llx+%i (%d) status %i\n",
-	       len, (unsigned long long)dma_pages[0],
-	       pageoff, mr->mr_nents, rc);
-	rpcrdma_mr_unmap_and_put(mr);
-	return ERR_PTR(-EIO);
-}
-
-/* Post Send WR containing the RPC Call message.
- */
-static int
-fmr_op_send(struct rpcrdma_ia *ia, struct rpcrdma_req *req)
-{
-	return ib_post_send(ia->ri_id->qp, &req->rl_sendctx->sc_wr, NULL);
-}
-
-/* Invalidate all memory regions that were registered for "req".
- *
- * Sleeps until it is safe for the host CPU to access the
- * previously mapped memory regions.
- *
- * Caller ensures that @mrs is not empty before the call. This
- * function empties the list.
- */
-static void
-fmr_op_unmap_sync(struct rpcrdma_xprt *r_xprt, struct list_head *mrs)
-{
-	struct rpcrdma_mr *mr;
-	LIST_HEAD(unmap_list);
-	int rc;
-
-	/* ORDER: Invalidate all of the req's MRs first
-	 *
-	 * ib_unmap_fmr() is slow, so use a single call instead
-	 * of one call per mapped FMR.
-	 */
-	list_for_each_entry(mr, mrs, mr_list) {
-		dprintk("RPC:       %s: unmapping fmr %p\n",
-			__func__, &mr->fmr);
-		trace_xprtrdma_mr_localinv(mr);
-		list_add_tail(&mr->fmr.fm_mr->list, &unmap_list);
-	}
-	r_xprt->rx_stats.local_inv_needed++;
-	rc = ib_unmap_fmr(&unmap_list);
-	if (rc)
-		goto out_release;
-
-	/* ORDER: Now DMA unmap all of the req's MRs, and return
-	 * them to the free MW list.
-	 */
-	while (!list_empty(mrs)) {
-		mr = rpcrdma_mr_pop(mrs);
-		list_del(&mr->fmr.fm_mr->list);
-		rpcrdma_mr_unmap_and_put(mr);
-	}
-
-	return;
-
-out_release:
-	pr_err("rpcrdma: ib_unmap_fmr failed (%i)\n", rc);
-
-	while (!list_empty(mrs)) {
-		mr = rpcrdma_mr_pop(mrs);
-		list_del(&mr->fmr.fm_mr->list);
-		rpcrdma_mr_recycle(mr);
-	}
-}
-
-const struct rpcrdma_memreg_ops rpcrdma_fmr_memreg_ops = {
-	.ro_map				= fmr_op_map,
-	.ro_send			= fmr_op_send,
-	.ro_unmap_sync			= fmr_op_unmap_sync,
-	.ro_open			= fmr_op_open,
-	.ro_maxpages			= fmr_op_maxpages,
-	.ro_init_mr			= fmr_op_init_mr,
-	.ro_release_mr			= fmr_op_release_mr,
-	.ro_displayname			= "fmr",
-	.ro_send_w_inv_ok		= 0,
-};
diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 3ddba94..a8d4def 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -434,12 +434,6 @@
 			break;
 		}
 		/*FALLTHROUGH*/
-	case RPCRDMA_MTHCAFMR:
-		if (fmr_is_supported(ia)) {
-			ia->ri_ops = &rpcrdma_fmr_memreg_ops;
-			break;
-		}
-		/*FALLTHROUGH*/
 	default:
 		pr_err("rpcrdma: Device %s does not support memreg mode %d\n",
 		       ia->ri_device->name, xprt_rdma_memreg_strategy);


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v1 2/4] xprtrdma: mrs_create off-by-one
  2018-11-19 15:45 [PATCH v1 0/4] NFS/RDMA client for v4.21 (part 1) Chuck Lever
  2018-11-19 15:45 ` [PATCH v1 1/4] xprtrdma: Remove support for FMR memory registration Chuck Lever
@ 2018-11-19 15:45 ` Chuck Lever
  2018-11-19 15:46 ` [PATCH v1 3/4] xprtrdma: Reduce max_frwr_depth Chuck Lever
  2018-11-19 15:46 ` [PATCH v1 4/4] xprtrdma: Plant XID in on-the-wire RDMA offset (FRWR) Chuck Lever
  3 siblings, 0 replies; 31+ messages in thread
From: Chuck Lever @ 2018-11-19 15:45 UTC (permalink / raw)
  To: linux-rdma; +Cc: linux-nfs

With certain combinations of krb5i/p and MR size, one extra
segment is needed per RPC.

Fixes: c421ece68f6 ("xprtrdma: Create more MRs at a time")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 net/sunrpc/xprtrdma/frwr_ops.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
index fc6378cc..448e063 100644
--- a/net/sunrpc/xprtrdma/frwr_ops.c
+++ b/net/sunrpc/xprtrdma/frwr_ops.c
@@ -242,7 +242,7 @@
 
 	ia->ri_max_segs = max_t(unsigned int, 1, RPCRDMA_MAX_DATA_SEGS /
 				ia->ri_max_frwr_depth);
-	ia->ri_max_segs += 2;	/* segments for head and tail buffers */
+	ia->ri_max_segs += 3;	/* segments for head and tail buffers */
 	return 0;
 }
 


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v1 3/4] xprtrdma: Reduce max_frwr_depth
  2018-11-19 15:45 [PATCH v1 0/4] NFS/RDMA client for v4.21 (part 1) Chuck Lever
  2018-11-19 15:45 ` [PATCH v1 1/4] xprtrdma: Remove support for FMR memory registration Chuck Lever
  2018-11-19 15:45 ` [PATCH v1 2/4] xprtrdma: mrs_create off-by-one Chuck Lever
@ 2018-11-19 15:46 ` Chuck Lever
  2018-11-19 15:46 ` [PATCH v1 4/4] xprtrdma: Plant XID in on-the-wire RDMA offset (FRWR) Chuck Lever
  3 siblings, 0 replies; 31+ messages in thread
From: Chuck Lever @ 2018-11-19 15:46 UTC (permalink / raw)
  To: linux-rdma; +Cc: linux-nfs

Some devices advertise a large max_fast_reg_page_list_len
capability, but perform optimally when MRs are significantly smaller
than that depth -- probably when the MR itself is no larger than a
page.

By default, the RDMA R/W core API uses max_sge_rd as the maximum
page depth for MRs. For some devices, the value of max_sge_rd is
1, which is also not optimal. Thus, when max_sge_rd is larger than
1, use that value. Otherwise use the value of the
max_fast_reg_page_list_len attribute.

I've tested this with a couple of devices, and it reproducibly
improves the throughput of large I/Os by several percent.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 net/sunrpc/xprtrdma/frwr_ops.c |   15 +++++++++++----
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
index 448e063..49b314d 100644
--- a/net/sunrpc/xprtrdma/frwr_ops.c
+++ b/net/sunrpc/xprtrdma/frwr_ops.c
@@ -191,10 +191,17 @@
 	if (attrs->device_cap_flags & IB_DEVICE_SG_GAPS_REG)
 		ia->ri_mrtype = IB_MR_TYPE_SG_GAPS;
 
-	ia->ri_max_frwr_depth =
-			min_t(unsigned int, RPCRDMA_MAX_DATA_SEGS,
-			      attrs->max_fast_reg_page_list_len);
-	dprintk("RPC:       %s: device's max FR page list len = %u\n",
+	/* Quirk: Some devices advertise a large max_fast_reg_page_list_len
+	 * capability, but perform optimally when the MRs are not larger
+	 * than a page.
+	 */
+	if (attrs->max_sge_rd > 1)
+		ia->ri_max_frwr_depth = attrs->max_sge_rd;
+	else
+		ia->ri_max_frwr_depth = attrs->max_fast_reg_page_list_len;
+	if (ia->ri_max_frwr_depth > RPCRDMA_MAX_DATA_SEGS)
+		ia->ri_max_frwr_depth = RPCRDMA_MAX_DATA_SEGS;
+	dprintk("RPC:       %s: max FR page list depth = %u\n",
 		__func__, ia->ri_max_frwr_depth);
 
 	/* Add room for frwr register and invalidate WRs.


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v1 4/4] xprtrdma: Plant XID in on-the-wire RDMA offset (FRWR)
  2018-11-19 15:45 [PATCH v1 0/4] NFS/RDMA client for v4.21 (part 1) Chuck Lever
                   ` (2 preceding siblings ...)
  2018-11-19 15:46 ` [PATCH v1 3/4] xprtrdma: Reduce max_frwr_depth Chuck Lever
@ 2018-11-19 15:46 ` Chuck Lever
  2018-11-19 17:47   ` Olga Kornievskaia
  2018-11-20 18:02   ` Anna Schumaker
  3 siblings, 2 replies; 31+ messages in thread
From: Chuck Lever @ 2018-11-19 15:46 UTC (permalink / raw)
  To: linux-rdma; +Cc: linux-nfs

Place the associated RPC transaction's XID in the upper 32 bits of
each RDMA segment's rdma_offset field. These bits are currently
always zero.

There are two reasons to do this:

- The R_key only has 8 bits that are different from registration to
  registration. The XID adds more uniqueness to each RDMA segment to
  reduce the likelihood of a software bug on the server reading from
  or writing into memory it's not supposed to.

- On-the-wire RDMA Read and Write operations do not otherwise carry
  any identifier that matches them up to an RPC. The XID in the
  upper 32 bits will act as an eye-catcher in network captures.

Suggested-by: Tom Talpey <ttalpey@microsoft.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 net/sunrpc/xprtrdma/frwr_ops.c  |    3 ++-
 net/sunrpc/xprtrdma/rpc_rdma.c  |    6 +++---
 net/sunrpc/xprtrdma/xprt_rdma.h |    2 +-
 3 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
index 49b314d..3b260d2 100644
--- a/net/sunrpc/xprtrdma/frwr_ops.c
+++ b/net/sunrpc/xprtrdma/frwr_ops.c
@@ -344,7 +344,7 @@
  */
 static struct rpcrdma_mr_seg *
 frwr_op_map(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg,
-	    int nsegs, bool writing, struct rpcrdma_mr **out)
+	    int nsegs, bool writing, u32 xid, struct rpcrdma_mr **out)
 {
 	struct rpcrdma_ia *ia = &r_xprt->rx_ia;
 	bool holes_ok = ia->ri_mrtype == IB_MR_TYPE_SG_GAPS;
@@ -398,6 +398,7 @@
 	if (unlikely(n != mr->mr_nents))
 		goto out_mapmr_err;
 
+	ibmr->iova |= ((u64)cpu_to_be32(xid)) << 32;
 	key = (u8)(ibmr->rkey & 0x000000FF);
 	ib_update_fast_reg_key(ibmr, ++key);
 
diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
index 9f53e02..89a2db2 100644
--- a/net/sunrpc/xprtrdma/rpc_rdma.c
+++ b/net/sunrpc/xprtrdma/rpc_rdma.c
@@ -357,7 +357,7 @@ static bool rpcrdma_results_inline(struct rpcrdma_xprt *r_xprt,
 
 	do {
 		seg = r_xprt->rx_ia.ri_ops->ro_map(r_xprt, seg, nsegs,
-						   false, &mr);
+						   false, rqst->rq_xid, &mr);
 		if (IS_ERR(seg))
 			return PTR_ERR(seg);
 		rpcrdma_mr_push(mr, &req->rl_registered);
@@ -415,7 +415,7 @@ static bool rpcrdma_results_inline(struct rpcrdma_xprt *r_xprt,
 	nchunks = 0;
 	do {
 		seg = r_xprt->rx_ia.ri_ops->ro_map(r_xprt, seg, nsegs,
-						   true, &mr);
+						   true, rqst->rq_xid, &mr);
 		if (IS_ERR(seg))
 			return PTR_ERR(seg);
 		rpcrdma_mr_push(mr, &req->rl_registered);
@@ -473,7 +473,7 @@ static bool rpcrdma_results_inline(struct rpcrdma_xprt *r_xprt,
 	nchunks = 0;
 	do {
 		seg = r_xprt->rx_ia.ri_ops->ro_map(r_xprt, seg, nsegs,
-						   true, &mr);
+						   true, rqst->rq_xid, &mr);
 		if (IS_ERR(seg))
 			return PTR_ERR(seg);
 		rpcrdma_mr_push(mr, &req->rl_registered);
diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index a13ccb6..2ae1ee2 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -472,7 +472,7 @@ struct rpcrdma_memreg_ops {
 	struct rpcrdma_mr_seg *
 			(*ro_map)(struct rpcrdma_xprt *,
 				  struct rpcrdma_mr_seg *, int, bool,
-				  struct rpcrdma_mr **);
+				  u32, struct rpcrdma_mr **);
 	int		(*ro_send)(struct rpcrdma_ia *ia,
 				   struct rpcrdma_req *req);
 	void		(*ro_reminv)(struct rpcrdma_rep *rep,


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH v1 1/4] xprtrdma: Remove support for FMR memory registration
  2018-11-19 15:45 ` [PATCH v1 1/4] xprtrdma: Remove support for FMR memory registration Chuck Lever
@ 2018-11-19 16:16   ` Bart Van Assche
  2018-11-19 19:09     ` Leon Romanovsky
  2018-11-19 22:41     ` Jason Gunthorpe
  0 siblings, 2 replies; 31+ messages in thread
From: Bart Van Assche @ 2018-11-19 16:16 UTC (permalink / raw)
  To: Chuck Lever, linux-rdma
  Cc: linux-nfs, oulijun, umalhi, zhongjiang, venkata.s.dhanalakota

On Mon, 2018-11-19 at 10:45 -0500, Chuck Lever wrote:
> FMR is not supported on most recent RDMA devices. It is also slower
> and less secure than FRWR. As discussed during the RDMA BoF at LPC
> 2018, it is time to remove support for FMR in the NFS/RDMA client
> stack. NFS/RDMA server-side uses either local memory registration or
> FRWR.  There is no change required there to deprecate FMR.
> 
> There are a few Infiniband/RoCE devices in the kernel tree that do
> not support MEM_MGT_EXTENSIONS, and therefore will no longer support
> client-side NFS/RDMA. These are:
> 
>  - mthca
>  - qib
>  - usnic
>  - hns (RoCE)

Can someone from Intel comment on how many of their customers rely on the qib
driver?

Can someone from Cisco comment on how many of their customers rely on the usnic
driver?

Can someone from Huawei comment on how many of their customers rely on the hns
driver?

I'm wondering whether I can remove FMR support from the SRP initiator driver.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v1 4/4] xprtrdma: Plant XID in on-the-wire RDMA offset (FRWR)
  2018-11-19 15:46 ` [PATCH v1 4/4] xprtrdma: Plant XID in on-the-wire RDMA offset (FRWR) Chuck Lever
@ 2018-11-19 17:47   ` Olga Kornievskaia
  2018-11-19 17:58     ` Chuck Lever
  2018-11-20 18:02   ` Anna Schumaker
  1 sibling, 1 reply; 31+ messages in thread
From: Olga Kornievskaia @ 2018-11-19 17:47 UTC (permalink / raw)
  To: Chuck Lever; +Cc: linux-rdma, linux-nfs

On Mon, Nov 19, 2018 at 10:46 AM Chuck Lever <chuck.lever@oracle.com> wrote:
>
> Place the associated RPC transaction's XID in the upper 32 bits of
> each RDMA segment's rdma_offset field. These bits are currently
> always zero.
>
> There are two reasons to do this:
>
> - The R_key only has 8 bits that are different from registration to
>   registration. The XID adds more uniqueness to each RDMA segment to
>   reduce the likelihood of a software bug on the server reading from
>   or writing into memory it's not supposed to.
>
> - On-the-wire RDMA Read and Write operations do not otherwise carry
>   any identifier that matches them up to an RPC. The XID in the
>   upper 32 bits will act as an eye-catcher in network captures.

Is this just an "eye-catcher" or do you have plans to use it in
wireshark? If the latter, then can we really do that? while a linux
implementation may do that, other (or even possibly future linux)
implementation might not do this. Can we justify changing the
wireshark logic for it?

> Suggested-by: Tom Talpey <ttalpey@microsoft.com>
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> ---
>  net/sunrpc/xprtrdma/frwr_ops.c  |    3 ++-
>  net/sunrpc/xprtrdma/rpc_rdma.c  |    6 +++---
>  net/sunrpc/xprtrdma/xprt_rdma.h |    2 +-
>  3 files changed, 6 insertions(+), 5 deletions(-)
>
> diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
> index 49b314d..3b260d2 100644
> --- a/net/sunrpc/xprtrdma/frwr_ops.c
> +++ b/net/sunrpc/xprtrdma/frwr_ops.c
> @@ -344,7 +344,7 @@
>   */
>  static struct rpcrdma_mr_seg *
>  frwr_op_map(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg,
> -           int nsegs, bool writing, struct rpcrdma_mr **out)
> +           int nsegs, bool writing, u32 xid, struct rpcrdma_mr **out)
>  {
>         struct rpcrdma_ia *ia = &r_xprt->rx_ia;
>         bool holes_ok = ia->ri_mrtype == IB_MR_TYPE_SG_GAPS;
> @@ -398,6 +398,7 @@
>         if (unlikely(n != mr->mr_nents))
>                 goto out_mapmr_err;
>
> +       ibmr->iova |= ((u64)cpu_to_be32(xid)) << 32;
>         key = (u8)(ibmr->rkey & 0x000000FF);
>         ib_update_fast_reg_key(ibmr, ++key);
>
> diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
> index 9f53e02..89a2db2 100644
> --- a/net/sunrpc/xprtrdma/rpc_rdma.c
> +++ b/net/sunrpc/xprtrdma/rpc_rdma.c
> @@ -357,7 +357,7 @@ static bool rpcrdma_results_inline(struct rpcrdma_xprt *r_xprt,
>
>         do {
>                 seg = r_xprt->rx_ia.ri_ops->ro_map(r_xprt, seg, nsegs,
> -                                                  false, &mr);
> +                                                  false, rqst->rq_xid, &mr);
>                 if (IS_ERR(seg))
>                         return PTR_ERR(seg);
>                 rpcrdma_mr_push(mr, &req->rl_registered);
> @@ -415,7 +415,7 @@ static bool rpcrdma_results_inline(struct rpcrdma_xprt *r_xprt,
>         nchunks = 0;
>         do {
>                 seg = r_xprt->rx_ia.ri_ops->ro_map(r_xprt, seg, nsegs,
> -                                                  true, &mr);
> +                                                  true, rqst->rq_xid, &mr);
>                 if (IS_ERR(seg))
>                         return PTR_ERR(seg);
>                 rpcrdma_mr_push(mr, &req->rl_registered);
> @@ -473,7 +473,7 @@ static bool rpcrdma_results_inline(struct rpcrdma_xprt *r_xprt,
>         nchunks = 0;
>         do {
>                 seg = r_xprt->rx_ia.ri_ops->ro_map(r_xprt, seg, nsegs,
> -                                                  true, &mr);
> +                                                  true, rqst->rq_xid, &mr);
>                 if (IS_ERR(seg))
>                         return PTR_ERR(seg);
>                 rpcrdma_mr_push(mr, &req->rl_registered);
> diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
> index a13ccb6..2ae1ee2 100644
> --- a/net/sunrpc/xprtrdma/xprt_rdma.h
> +++ b/net/sunrpc/xprtrdma/xprt_rdma.h
> @@ -472,7 +472,7 @@ struct rpcrdma_memreg_ops {
>         struct rpcrdma_mr_seg *
>                         (*ro_map)(struct rpcrdma_xprt *,
>                                   struct rpcrdma_mr_seg *, int, bool,
> -                                 struct rpcrdma_mr **);
> +                                 u32, struct rpcrdma_mr **);
>         int             (*ro_send)(struct rpcrdma_ia *ia,
>                                    struct rpcrdma_req *req);
>         void            (*ro_reminv)(struct rpcrdma_rep *rep,
>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v1 4/4] xprtrdma: Plant XID in on-the-wire RDMA offset (FRWR)
  2018-11-19 17:47   ` Olga Kornievskaia
@ 2018-11-19 17:58     ` Chuck Lever
  2018-11-19 18:08       ` Olga Kornievskaia
  0 siblings, 1 reply; 31+ messages in thread
From: Chuck Lever @ 2018-11-19 17:58 UTC (permalink / raw)
  To: Olga Kornievskaia; +Cc: linux-rdma, Linux NFS Mailing List



> On Nov 19, 2018, at 12:47 PM, Olga Kornievskaia <aglo@umich.edu> wrote:
> 
> On Mon, Nov 19, 2018 at 10:46 AM Chuck Lever <chuck.lever@oracle.com> wrote:
>> 
>> Place the associated RPC transaction's XID in the upper 32 bits of
>> each RDMA segment's rdma_offset field. These bits are currently
>> always zero.
>> 
>> There are two reasons to do this:
>> 
>> - The R_key only has 8 bits that are different from registration to
>>  registration. The XID adds more uniqueness to each RDMA segment to
>>  reduce the likelihood of a software bug on the server reading from
>>  or writing into memory it's not supposed to.
>> 
>> - On-the-wire RDMA Read and Write operations do not otherwise carry
>>  any identifier that matches them up to an RPC. The XID in the
>>  upper 32 bits will act as an eye-catcher in network captures.
> 
> Is this just an "eye-catcher" or do you have plans to use it in
> wireshark? If the latter, then can we really do that? while a linux
> implementation may do that, other (or even possibly future linux)
> implementation might not do this. Can we justify changing the
> wireshark logic for it?

No plans to change the wireshark RPC-over-RDMA dissector.
That would only be a valid thing to do if adding the XID
were made part of the RPC-over-RDMA protocol via an RFC.


--
Chuck Lever




^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v1 4/4] xprtrdma: Plant XID in on-the-wire RDMA offset (FRWR)
  2018-11-19 17:58     ` Chuck Lever
@ 2018-11-19 18:08       ` Olga Kornievskaia
  2018-11-19 18:18         ` Chuck Lever
  0 siblings, 1 reply; 31+ messages in thread
From: Olga Kornievskaia @ 2018-11-19 18:08 UTC (permalink / raw)
  To: Chuck Lever; +Cc: linux-rdma, linux-nfs

On Mon, Nov 19, 2018 at 12:59 PM Chuck Lever <chuck.lever@oracle.com> wrote:
>
>
>
> > On Nov 19, 2018, at 12:47 PM, Olga Kornievskaia <aglo@umich.edu> wrote:
> >
> > On Mon, Nov 19, 2018 at 10:46 AM Chuck Lever <chuck.lever@oracle.com> wrote:
> >>
> >> Place the associated RPC transaction's XID in the upper 32 bits of
> >> each RDMA segment's rdma_offset field. These bits are currently
> >> always zero.
> >>
> >> There are two reasons to do this:
> >>
> >> - The R_key only has 8 bits that are different from registration to
> >>  registration. The XID adds more uniqueness to each RDMA segment to
> >>  reduce the likelihood of a software bug on the server reading from
> >>  or writing into memory it's not supposed to.
> >>
> >> - On-the-wire RDMA Read and Write operations do not otherwise carry
> >>  any identifier that matches them up to an RPC. The XID in the
> >>  upper 32 bits will act as an eye-catcher in network captures.
> >
> > Is this just an "eye-catcher" or do you have plans to use it in
> > wireshark? If the latter, then can we really do that? while a linux
> > implementation may do that, other (or even possibly future linux)
> > implementation might not do this. Can we justify changing the
> > wireshark logic for it?
>
> No plans to change the wireshark RPC-over-RDMA dissector.
> That would only be a valid thing to do if adding the XID
> were made part of the RPC-over-RDMA protocol via an RFC.

Agreed. Can you also help me understand the proposal (as I'm still
trying to figure why it is useful).

You are proposing to modify the RDMA segments's RDMA offset field (I
see top 6bits are indeed always 0). I don't see how adding that helps
an RDMA read/write message which does not have an "offset" field in it
be matched to a particular RPC. I don't believe we have (had) any
issues matching the initial RC Send only that contains the RDMA_MSG to
the RPC.

>
>
> --
> Chuck Lever
>
>
>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v1 4/4] xprtrdma: Plant XID in on-the-wire RDMA offset (FRWR)
  2018-11-19 18:08       ` Olga Kornievskaia
@ 2018-11-19 18:18         ` Chuck Lever
  2018-11-19 18:47           ` Olga Kornievskaia
  0 siblings, 1 reply; 31+ messages in thread
From: Chuck Lever @ 2018-11-19 18:18 UTC (permalink / raw)
  To: Olga Kornievskaia; +Cc: linux-rdma, Linux NFS Mailing List



> On Nov 19, 2018, at 1:08 PM, Olga Kornievskaia <aglo@umich.edu> wrote:
> 
> On Mon, Nov 19, 2018 at 12:59 PM Chuck Lever <chuck.lever@oracle.com> wrote:
>> 
>> 
>> 
>>> On Nov 19, 2018, at 12:47 PM, Olga Kornievskaia <aglo@umich.edu> wrote:
>>> 
>>> On Mon, Nov 19, 2018 at 10:46 AM Chuck Lever <chuck.lever@oracle.com> wrote:
>>>> 
>>>> Place the associated RPC transaction's XID in the upper 32 bits of
>>>> each RDMA segment's rdma_offset field. These bits are currently
>>>> always zero.
>>>> 
>>>> There are two reasons to do this:
>>>> 
>>>> - The R_key only has 8 bits that are different from registration to
>>>> registration. The XID adds more uniqueness to each RDMA segment to
>>>> reduce the likelihood of a software bug on the server reading from
>>>> or writing into memory it's not supposed to.
>>>> 
>>>> - On-the-wire RDMA Read and Write operations do not otherwise carry
>>>> any identifier that matches them up to an RPC. The XID in the
>>>> upper 32 bits will act as an eye-catcher in network captures.
>>> 
>>> Is this just an "eye-catcher" or do you have plans to use it in
>>> wireshark? If the latter, then can we really do that? while a linux
>>> implementation may do that, other (or even possibly future linux)
>>> implementation might not do this. Can we justify changing the
>>> wireshark logic for it?
>> 
>> No plans to change the wireshark RPC-over-RDMA dissector.
>> That would only be a valid thing to do if adding the XID
>> were made part of the RPC-over-RDMA protocol via an RFC.
> 
> Agreed. Can you also help me understand the proposal (as I'm still
> trying to figure why it is useful).
> 
> You are proposing to modify the RDMA segments's RDMA offset field (I
> see top 6bits are indeed always 0). I don't see how adding that helps
> an RDMA read/write message which does not have an "offset" field in it
> be matched to a particular RPC. I don't believe we have (had) any
> issues matching the initial RC Send only that contains the RDMA_MSG to
> the RPC.

The ULP has access to only the low order 8 bits of the R_key. The
upper 24 bits are fixed for each MR. So for any given MR, there are
only 256 unique R_key values. That means the same R_key will appear
again quickly on the wire.

The 64-bit offset field is set by the ULP, and can be essentially
any arbitrary value. Most kernel ULPs use the iova of the registered
memory. We only need the lower 32 bits for that.

The purpose of adding junk to the offset is to make the offset
unique to that RPC transaction, just like the R_key is. This helps
make the RDMA segment co-ordinates (handle, length, offset) more
unique and thus harder to spoof.

We could use random numbers in that upper 32 bits, but we have
something more handy: the RPC's XID.

Now when you look at an RDMA Read or Write, the top 32 bits in each
RDMA segment's offset match the XID of the RPC transaction that the
RDMA operations go with. This is really a secondary benefit to the
uniquifying effect above.


--
Chuck Lever




^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v1 4/4] xprtrdma: Plant XID in on-the-wire RDMA offset (FRWR)
  2018-11-19 18:18         ` Chuck Lever
@ 2018-11-19 18:47           ` Olga Kornievskaia
  2018-11-19 18:58             ` Chuck Lever
  0 siblings, 1 reply; 31+ messages in thread
From: Olga Kornievskaia @ 2018-11-19 18:47 UTC (permalink / raw)
  To: Chuck Lever; +Cc: linux-rdma, linux-nfs

On Mon, Nov 19, 2018 at 1:19 PM Chuck Lever <chuck.lever@oracle.com> wrote:
>
>
>
> > On Nov 19, 2018, at 1:08 PM, Olga Kornievskaia <aglo@umich.edu> wrote:
> >
> > On Mon, Nov 19, 2018 at 12:59 PM Chuck Lever <chuck.lever@oracle.com> wrote:
> >>
> >>
> >>
> >>> On Nov 19, 2018, at 12:47 PM, Olga Kornievskaia <aglo@umich.edu> wrote:
> >>>
> >>> On Mon, Nov 19, 2018 at 10:46 AM Chuck Lever <chuck.lever@oracle.com> wrote:
> >>>>
> >>>> Place the associated RPC transaction's XID in the upper 32 bits of
> >>>> each RDMA segment's rdma_offset field. These bits are currently
> >>>> always zero.
> >>>>
> >>>> There are two reasons to do this:
> >>>>
> >>>> - The R_key only has 8 bits that are different from registration to
> >>>> registration. The XID adds more uniqueness to each RDMA segment to
> >>>> reduce the likelihood of a software bug on the server reading from
> >>>> or writing into memory it's not supposed to.
> >>>>
> >>>> - On-the-wire RDMA Read and Write operations do not otherwise carry
> >>>> any identifier that matches them up to an RPC. The XID in the
> >>>> upper 32 bits will act as an eye-catcher in network captures.
> >>>
> >>> Is this just an "eye-catcher" or do you have plans to use it in
> >>> wireshark? If the latter, then can we really do that? while a linux
> >>> implementation may do that, other (or even possibly future linux)
> >>> implementation might not do this. Can we justify changing the
> >>> wireshark logic for it?
> >>
> >> No plans to change the wireshark RPC-over-RDMA dissector.
> >> That would only be a valid thing to do if adding the XID
> >> were made part of the RPC-over-RDMA protocol via an RFC.
> >
> > Agreed. Can you also help me understand the proposal (as I'm still
> > trying to figure why it is useful).
> >
> > You are proposing to modify the RDMA segments's RDMA offset field (I
> > see top 6bits are indeed always 0). I don't see how adding that helps
> > an RDMA read/write message which does not have an "offset" field in it
> > be matched to a particular RPC. I don't believe we have (had) any
> > issues matching the initial RC Send only that contains the RDMA_MSG to
> > the RPC.
>
> The ULP has access to only the low order 8 bits of the R_key. The
> upper 24 bits are fixed for each MR. So for any given MR, there are
> only 256 unique R_key values. That means the same R_key will appear
> again quickly on the wire.
>
> The 64-bit offset field is set by the ULP, and can be essentially
> any arbitrary value. Most kernel ULPs use the iova of the registered
> memory. We only need the lower 32 bits for that.
>
> The purpose of adding junk to the offset is to make the offset
> unique to that RPC transaction, just like the R_key is. This helps
> make the RDMA segment co-ordinates (handle, length, offset) more
> unique and thus harder to spoof.

Thank you for the explanation that makes sense.

> We could use random numbers in that upper 32 bits, but we have
> something more handy: the RPC's XID.
>
> Now when you look at an RDMA Read or Write, the top 32 bits in each
> RDMA segment's offset match the XID of the RPC transaction that the
> RDMA operations go with. This is really a secondary benefit to the
> uniquifying effect above.

I find the wording "no the wire RDMA read or write" misleading. Did
you really mean it as "RDMA read or write" or do you mean "RDMA_MSG"
or do you mean "NFS RDMA read or write"? Because RDMA offset is not a
part of the RDMA read/write (first/middle/last) packet. That's what
I'm hanged up on.


>
>
> --
> Chuck Lever
>
>
>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v1 4/4] xprtrdma: Plant XID in on-the-wire RDMA offset (FRWR)
  2018-11-19 18:47           ` Olga Kornievskaia
@ 2018-11-19 18:58             ` Chuck Lever
  2018-11-19 21:22               ` Olga Kornievskaia
  0 siblings, 1 reply; 31+ messages in thread
From: Chuck Lever @ 2018-11-19 18:58 UTC (permalink / raw)
  To: Olga Kornievskaia; +Cc: linux-rdma, Linux NFS Mailing List



> On Nov 19, 2018, at 1:47 PM, Olga Kornievskaia <aglo@umich.edu> wrote:
> 
> On Mon, Nov 19, 2018 at 1:19 PM Chuck Lever <chuck.lever@oracle.com> wrote:
>> 
>> 
>> 
>>> On Nov 19, 2018, at 1:08 PM, Olga Kornievskaia <aglo@umich.edu> wrote:
>>> 
>>> On Mon, Nov 19, 2018 at 12:59 PM Chuck Lever <chuck.lever@oracle.com> wrote:
>>>> 
>>>> 
>>>> 
>>>>> On Nov 19, 2018, at 12:47 PM, Olga Kornievskaia <aglo@umich.edu> wrote:
>>>>> 
>>>>> On Mon, Nov 19, 2018 at 10:46 AM Chuck Lever <chuck.lever@oracle.com> wrote:
>>>>>> 
>>>>>> Place the associated RPC transaction's XID in the upper 32 bits of
>>>>>> each RDMA segment's rdma_offset field. These bits are currently
>>>>>> always zero.
>>>>>> 
>>>>>> There are two reasons to do this:
>>>>>> 
>>>>>> - The R_key only has 8 bits that are different from registration to
>>>>>> registration. The XID adds more uniqueness to each RDMA segment to
>>>>>> reduce the likelihood of a software bug on the server reading from
>>>>>> or writing into memory it's not supposed to.
>>>>>> 
>>>>>> - On-the-wire RDMA Read and Write operations do not otherwise carry
>>>>>> any identifier that matches them up to an RPC. The XID in the
>>>>>> upper 32 bits will act as an eye-catcher in network captures.
>>>>> 
>>>>> Is this just an "eye-catcher" or do you have plans to use it in
>>>>> wireshark? If the latter, then can we really do that? while a linux
>>>>> implementation may do that, other (or even possibly future linux)
>>>>> implementation might not do this. Can we justify changing the
>>>>> wireshark logic for it?
>>>> 
>>>> No plans to change the wireshark RPC-over-RDMA dissector.
>>>> That would only be a valid thing to do if adding the XID
>>>> were made part of the RPC-over-RDMA protocol via an RFC.
>>> 
>>> Agreed. Can you also help me understand the proposal (as I'm still
>>> trying to figure why it is useful).
>>> 
>>> You are proposing to modify the RDMA segments's RDMA offset field (I
>>> see top 6bits are indeed always 0). I don't see how adding that helps
>>> an RDMA read/write message which does not have an "offset" field in it
>>> be matched to a particular RPC. I don't believe we have (had) any
>>> issues matching the initial RC Send only that contains the RDMA_MSG to
>>> the RPC.
>> 
>> The ULP has access to only the low order 8 bits of the R_key. The
>> upper 24 bits are fixed for each MR. So for any given MR, there are
>> only 256 unique R_key values. That means the same R_key will appear
>> again quickly on the wire.
>> 
>> The 64-bit offset field is set by the ULP, and can be essentially
>> any arbitrary value. Most kernel ULPs use the iova of the registered
>> memory. We only need the lower 32 bits for that.
>> 
>> The purpose of adding junk to the offset is to make the offset
>> unique to that RPC transaction, just like the R_key is. This helps
>> make the RDMA segment co-ordinates (handle, length, offset) more
>> unique and thus harder to spoof.
> 
> Thank you for the explanation that makes sense.
> 
>> We could use random numbers in that upper 32 bits, but we have
>> something more handy: the RPC's XID.
>> 
>> Now when you look at an RDMA Read or Write, the top 32 bits in each
>> RDMA segment's offset match the XID of the RPC transaction that the
>> RDMA operations go with. This is really a secondary benefit to the
>> uniquifying effect above.
> 
> I find the wording "no the wire RDMA read or write" misleading. Did
> you really mean it as "RDMA read or write" or do you mean "RDMA_MSG"
> or do you mean "NFS RDMA read or write"? Because RDMA offset is not a
> part of the RDMA read/write (first/middle/last) packet. That's what
> I'm hanged up on.

Here's an RDMA Read request in a network capture I had at hand:

No.     Time               Source                Destination           Protocol Length Info
    228 22:31:06.203637    LID: 5                LID: 11               InfiniBand 42     RC RDMA Read Request QP=0x000240 

Frame 228: 42 bytes on wire (336 bits), 42 bytes captured (336 bits) on interface 1
Extensible Record Format
InfiniBand
    Local Route Header
    Base Transport Header
    RETH - RDMA Extended Transport Header
        Virtual Address: 11104011393315758080   <<<<<<
        Remote Key: 1879114618
        DMA Length: 4015
    Invariant CRC: 0xd492a3e1
    Variant CRC: 0x8736

The value of the Virtual Address field is what the RPC-over-RDMA
protocol calls the Offset. The Read responses are matched to this
request by their message sequence numbers, and this Read request is
matched to the RPC Call by the XID in the top 32 bits of the
Virtual Address.

Likewise for an RDMA Write Only request:

    188 22:31:06.201350    LID: 5                LID: 11               InfiniBand 162    RC RDMA Write Only QP=0x000240 

Frame 188: 162 bytes on wire (1296 bits), 162 bytes captured (1296 bits) on interface 1
Extensible Record Format
InfiniBand
    Local Route Header
    Base Transport Header
    RETH - RDMA Extended Transport Header
        Virtual Address: 10455493047213809920   <<<<<<
        Remote Key: 1879115386
        DMA Length: 120
    Invariant CRC: 0xe2e1b2cd
    Variant CRC: 0x676f
Data (120 bytes)

0000  91 19 5f 87 00 00 00 01 00 00 00 00 00 00 00 00   .._.............
0010  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 01   ................
0020  00 00 00 02 00 00 01 ed 00 00 00 02 00 00 04 16   ................
0030  00 00 00 64 00 00 00 00 00 00 00 28 00 00 00 00   ...d.......(....
0040  00 00 00 00 00 00 00 00 00 00 00 00 54 9e e5 9d   ............T...
0050  c6 d3 1a d2 00 00 00 00 00 00 63 1f 5b f2 2e 7a   ..........c.[..z
0060  0b ae f9 ec 5b f2 2e 7a 0b 53 6c ef 5b f2 2e 7a   ....[..z.Sl.[..z
0070  0b 53 6c ef 00 00 00 0c                           .Sl.....


I believe RDMA Write First also has an RETH. The sender does not
interleave RDMA Writes, so subsequent Middle and Last packets go
with this RDMA Write First.


--
Chuck Lever




^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v1 1/4] xprtrdma: Remove support for FMR memory registration
  2018-11-19 16:16   ` Bart Van Assche
@ 2018-11-19 19:09     ` Leon Romanovsky
  2018-11-19 20:52       ` Bart Van Assche
  2018-11-19 22:41     ` Jason Gunthorpe
  1 sibling, 1 reply; 31+ messages in thread
From: Leon Romanovsky @ 2018-11-19 19:09 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Chuck Lever, linux-rdma, linux-nfs, oulijun, umalhi, zhongjiang,
	venkata.s.dhanalakota

[-- Attachment #1: Type: text/plain, Size: 1198 bytes --]

On Mon, Nov 19, 2018 at 08:16:37AM -0800, Bart Van Assche wrote:
> On Mon, 2018-11-19 at 10:45 -0500, Chuck Lever wrote:
> > FMR is not supported on most recent RDMA devices. It is also slower
> > and less secure than FRWR. As discussed during the RDMA BoF at LPC
> > 2018, it is time to remove support for FMR in the NFS/RDMA client
> > stack. NFS/RDMA server-side uses either local memory registration or
> > FRWR.  There is no change required there to deprecate FMR.
> >
> > There are a few Infiniband/RoCE devices in the kernel tree that do
> > not support MEM_MGT_EXTENSIONS, and therefore will no longer support
> > client-side NFS/RDMA. These are:
> >
> >  - mthca
> >  - qib
> >  - usnic
> >  - hns (RoCE)
>
> Can someone from Intel comment on how many of their customers rely on the qib
> driver?
>
> Can someone from Cisco comment on how many of their customers rely on the usnic
> driver?
>
> Can someone from Huawei comment on how many of their customers rely on the hns
> driver?
>
> I'm wondering whether I can remove FMR support from the SRP initiator driver.

Bart,

The more accurate question is "how many customers ... use latest,
upstream kernel?"

Thanks

>
> Thanks,
>
> Bart.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 801 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v1 1/4] xprtrdma: Remove support for FMR memory registration
  2018-11-19 19:09     ` Leon Romanovsky
@ 2018-11-19 20:52       ` Bart Van Assche
  2018-11-20  5:37         ` Leon Romanovsky
  0 siblings, 1 reply; 31+ messages in thread
From: Bart Van Assche @ 2018-11-19 20:52 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Chuck Lever, linux-rdma, linux-nfs, oulijun, zhongjiang,
	venkata.s.dhanalakota

On Mon, 2018-11-19 at 21:09 +0200, Leon Romanovsky wrote:
> The more accurate question is "how many customers ... use latest,
> upstream kernel?"

Hi Leon,

If this would not be clear: I am in favor of removing kernel drivers that
are no longer in use. Recently I supported a proposal to remove the SCSI
OSD driver from the kernel tree because that driver
probably has not been
used by anyone in the past five years.

But I think that kernel drivers that are still in use should stay in the
kernel tree, no matter which version is in use. Since the Linux kernel ABI
is backwards compatible, with every Linux distro it is possible to replace
the Linux kernel that was provided by the Linux distributor with a more
recent upstream kernel. I think it would be a very unpleasant surprise for
people who use one of the RDMA adapters that only support FMR if they would
test a more recent kernel and if they would notice that its driver has been
removed.

Bart.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v1 4/4] xprtrdma: Plant XID in on-the-wire RDMA offset (FRWR)
  2018-11-19 18:58             ` Chuck Lever
@ 2018-11-19 21:22               ` Olga Kornievskaia
  2018-11-19 21:32                 ` Chuck Lever
  0 siblings, 1 reply; 31+ messages in thread
From: Olga Kornievskaia @ 2018-11-19 21:22 UTC (permalink / raw)
  To: Chuck Lever; +Cc: linux-rdma, linux-nfs

On Mon, Nov 19, 2018 at 1:59 PM Chuck Lever <chuck.lever@oracle.com> wrote:
>
>
>
> > On Nov 19, 2018, at 1:47 PM, Olga Kornievskaia <aglo@umich.edu> wrote:
> >
> > On Mon, Nov 19, 2018 at 1:19 PM Chuck Lever <chuck.lever@oracle.com> wrote:
> >>
> >>
> >>
> >>> On Nov 19, 2018, at 1:08 PM, Olga Kornievskaia <aglo@umich.edu> wrote:
> >>>
> >>> On Mon, Nov 19, 2018 at 12:59 PM Chuck Lever <chuck.lever@oracle.com> wrote:
> >>>>
> >>>>
> >>>>
> >>>>> On Nov 19, 2018, at 12:47 PM, Olga Kornievskaia <aglo@umich.edu> wrote:
> >>>>>
> >>>>> On Mon, Nov 19, 2018 at 10:46 AM Chuck Lever <chuck.lever@oracle.com> wrote:
> >>>>>>
> >>>>>> Place the associated RPC transaction's XID in the upper 32 bits of
> >>>>>> each RDMA segment's rdma_offset field. These bits are currently
> >>>>>> always zero.
> >>>>>>
> >>>>>> There are two reasons to do this:
> >>>>>>
> >>>>>> - The R_key only has 8 bits that are different from registration to
> >>>>>> registration. The XID adds more uniqueness to each RDMA segment to
> >>>>>> reduce the likelihood of a software bug on the server reading from
> >>>>>> or writing into memory it's not supposed to.
> >>>>>>
> >>>>>> - On-the-wire RDMA Read and Write operations do not otherwise carry
> >>>>>> any identifier that matches them up to an RPC. The XID in the
> >>>>>> upper 32 bits will act as an eye-catcher in network captures.
> >>>>>
> >>>>> Is this just an "eye-catcher" or do you have plans to use it in
> >>>>> wireshark? If the latter, then can we really do that? while a linux
> >>>>> implementation may do that, other (or even possibly future linux)
> >>>>> implementation might not do this. Can we justify changing the
> >>>>> wireshark logic for it?
> >>>>
> >>>> No plans to change the wireshark RPC-over-RDMA dissector.
> >>>> That would only be a valid thing to do if adding the XID
> >>>> were made part of the RPC-over-RDMA protocol via an RFC.
> >>>
> >>> Agreed. Can you also help me understand the proposal (as I'm still
> >>> trying to figure why it is useful).
> >>>
> >>> You are proposing to modify the RDMA segments's RDMA offset field (I
> >>> see top 6bits are indeed always 0). I don't see how adding that helps
> >>> an RDMA read/write message which does not have an "offset" field in it
> >>> be matched to a particular RPC. I don't believe we have (had) any
> >>> issues matching the initial RC Send only that contains the RDMA_MSG to
> >>> the RPC.
> >>
> >> The ULP has access to only the low order 8 bits of the R_key. The
> >> upper 24 bits are fixed for each MR. So for any given MR, there are
> >> only 256 unique R_key values. That means the same R_key will appear
> >> again quickly on the wire.
> >>
> >> The 64-bit offset field is set by the ULP, and can be essentially
> >> any arbitrary value. Most kernel ULPs use the iova of the registered
> >> memory. We only need the lower 32 bits for that.
> >>
> >> The purpose of adding junk to the offset is to make the offset
> >> unique to that RPC transaction, just like the R_key is. This helps
> >> make the RDMA segment co-ordinates (handle, length, offset) more
> >> unique and thus harder to spoof.
> >
> > Thank you for the explanation that makes sense.
> >
> >> We could use random numbers in that upper 32 bits, but we have
> >> something more handy: the RPC's XID.
> >>
> >> Now when you look at an RDMA Read or Write, the top 32 bits in each
> >> RDMA segment's offset match the XID of the RPC transaction that the
> >> RDMA operations go with. This is really a secondary benefit to the
> >> uniquifying effect above.
> >
> > I find the wording "no the wire RDMA read or write" misleading. Did
> > you really mean it as "RDMA read or write" or do you mean "RDMA_MSG"
> > or do you mean "NFS RDMA read or write"? Because RDMA offset is not a
> > part of the RDMA read/write (first/middle/last) packet. That's what
> > I'm hanged up on.
>
> Here's an RDMA Read request in a network capture I had at hand:
>
> No.     Time               Source                Destination           Protocol Length Info
>     228 22:31:06.203637    LID: 5                LID: 11               InfiniBand 42     RC RDMA Read Request QP=0x000240
>
> Frame 228: 42 bytes on wire (336 bits), 42 bytes captured (336 bits) on interface 1
> Extensible Record Format
> InfiniBand
>     Local Route Header
>     Base Transport Header
>     RETH - RDMA Extended Transport Header
>         Virtual Address: 11104011393315758080   <<<<<<
>         Remote Key: 1879114618
>         DMA Length: 4015
>     Invariant CRC: 0xd492a3e1
>     Variant CRC: 0x8736
>
> The value of the Virtual Address field is what the RPC-over-RDMA
> protocol calls the Offset. The Read responses are matched to this
> request by their message sequence numbers, and this Read request is
> matched to the RPC Call by the XID in the top 32 bits of the
> Virtual Address.
>
> Likewise for an RDMA Write Only request:
>
>     188 22:31:06.201350    LID: 5                LID: 11               InfiniBand 162    RC RDMA Write Only QP=0x000240
>
> Frame 188: 162 bytes on wire (1296 bits), 162 bytes captured (1296 bits) on interface 1
> Extensible Record Format
> InfiniBand
>     Local Route Header
>     Base Transport Header
>     RETH - RDMA Extended Transport Header
>         Virtual Address: 10455493047213809920   <<<<<<
>         Remote Key: 1879115386
>         DMA Length: 120
>     Invariant CRC: 0xe2e1b2cd
>     Variant CRC: 0x676f
> Data (120 bytes)
>
> 0000  91 19 5f 87 00 00 00 01 00 00 00 00 00 00 00 00   .._.............
> 0010  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 01   ................
> 0020  00 00 00 02 00 00 01 ed 00 00 00 02 00 00 04 16   ................
> 0030  00 00 00 64 00 00 00 00 00 00 00 28 00 00 00 00   ...d.......(....
> 0040  00 00 00 00 00 00 00 00 00 00 00 00 54 9e e5 9d   ............T...
> 0050  c6 d3 1a d2 00 00 00 00 00 00 63 1f 5b f2 2e 7a   ..........c.[..z
> 0060  0b ae f9 ec 5b f2 2e 7a 0b 53 6c ef 5b f2 2e 7a   ....[..z.Sl.[..z
> 0070  0b 53 6c ef 00 00 00 0c                           .Sl.....
>
>
> I believe RDMA Write First also has an RETH. The sender does not
> interleave RDMA Writes, so subsequent Middle and Last packets go
> with this RDMA Write First.

Ok I see now where I was confused, in RDMA_MSG in the wireshark it's
labeled "RDMA offset" and the in the RDMA write first message it's
labeled "Virtual address". Thank you for explanation.

Here's the next question (coming from Jorge)? Is it reasonable to
assume that top 32bits are always zero? I have an network trace (from
4.18-rc2 kernel) where they are not.

RPC over RDMA
    XID: 0xa347cfa2
    Version: 1
    Flow Control: 128
    Message Type: RDMA_MSG (0)
    Read list (count: 0)
    Write list (count: 1)
        Write chunk (1 segment)
            Write chunk segment count: 1
            RDMA segment 0
                RDMA handle: 0x4000076f
                RDMA length: 65676
                RDMA offset: 0x0000001049973000

I don't believe 0xa347cfa2 can fit?
>
>
> --
> Chuck Lever
>
>
>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v1 4/4] xprtrdma: Plant XID in on-the-wire RDMA offset (FRWR)
  2018-11-19 21:22               ` Olga Kornievskaia
@ 2018-11-19 21:32                 ` Chuck Lever
  2018-11-19 21:42                   ` Mora, Jorge
  0 siblings, 1 reply; 31+ messages in thread
From: Chuck Lever @ 2018-11-19 21:32 UTC (permalink / raw)
  To: Olga Kornievskaia; +Cc: linux-rdma, Linux NFS Mailing List



> On Nov 19, 2018, at 4:22 PM, Olga Kornievskaia <aglo@umich.edu> wrote:
> 
> On Mon, Nov 19, 2018 at 1:59 PM Chuck Lever <chuck.lever@oracle.com> wrote:
>> 
>> 
>> 
>>> On Nov 19, 2018, at 1:47 PM, Olga Kornievskaia <aglo@umich.edu> wrote:
>>> 
>>> On Mon, Nov 19, 2018 at 1:19 PM Chuck Lever <chuck.lever@oracle.com> wrote:
>>>> 
>>>> 
>>>> 
>>>>> On Nov 19, 2018, at 1:08 PM, Olga Kornievskaia <aglo@umich.edu> wrote:
>>>>> 
>>>>> On Mon, Nov 19, 2018 at 12:59 PM Chuck Lever <chuck.lever@oracle.com> wrote:
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> On Nov 19, 2018, at 12:47 PM, Olga Kornievskaia <aglo@umich.edu> wrote:
>>>>>>> 
>>>>>>> On Mon, Nov 19, 2018 at 10:46 AM Chuck Lever <chuck.lever@oracle.com> wrote:
>>>>>>>> 
>>>>>>>> Place the associated RPC transaction's XID in the upper 32 bits of
>>>>>>>> each RDMA segment's rdma_offset field. These bits are currently
>>>>>>>> always zero.
>>>>>>>> 
>>>>>>>> There are two reasons to do this:
>>>>>>>> 
>>>>>>>> - The R_key only has 8 bits that are different from registration to
>>>>>>>> registration. The XID adds more uniqueness to each RDMA segment to
>>>>>>>> reduce the likelihood of a software bug on the server reading from
>>>>>>>> or writing into memory it's not supposed to.
>>>>>>>> 
>>>>>>>> - On-the-wire RDMA Read and Write operations do not otherwise carry
>>>>>>>> any identifier that matches them up to an RPC. The XID in the
>>>>>>>> upper 32 bits will act as an eye-catcher in network captures.
>>>>>>> 
>>>>>>> Is this just an "eye-catcher" or do you have plans to use it in
>>>>>>> wireshark? If the latter, then can we really do that? while a linux
>>>>>>> implementation may do that, other (or even possibly future linux)
>>>>>>> implementation might not do this. Can we justify changing the
>>>>>>> wireshark logic for it?
>>>>>> 
>>>>>> No plans to change the wireshark RPC-over-RDMA dissector.
>>>>>> That would only be a valid thing to do if adding the XID
>>>>>> were made part of the RPC-over-RDMA protocol via an RFC.
>>>>> 
>>>>> Agreed. Can you also help me understand the proposal (as I'm still
>>>>> trying to figure why it is useful).
>>>>> 
>>>>> You are proposing to modify the RDMA segments's RDMA offset field (I
>>>>> see top 6bits are indeed always 0). I don't see how adding that helps
>>>>> an RDMA read/write message which does not have an "offset" field in it
>>>>> be matched to a particular RPC. I don't believe we have (had) any
>>>>> issues matching the initial RC Send only that contains the RDMA_MSG to
>>>>> the RPC.
>>>> 
>>>> The ULP has access to only the low order 8 bits of the R_key. The
>>>> upper 24 bits are fixed for each MR. So for any given MR, there are
>>>> only 256 unique R_key values. That means the same R_key will appear
>>>> again quickly on the wire.
>>>> 
>>>> The 64-bit offset field is set by the ULP, and can be essentially
>>>> any arbitrary value. Most kernel ULPs use the iova of the registered
>>>> memory. We only need the lower 32 bits for that.
>>>> 
>>>> The purpose of adding junk to the offset is to make the offset
>>>> unique to that RPC transaction, just like the R_key is. This helps
>>>> make the RDMA segment co-ordinates (handle, length, offset) more
>>>> unique and thus harder to spoof.
>>> 
>>> Thank you for the explanation that makes sense.
>>> 
>>>> We could use random numbers in that upper 32 bits, but we have
>>>> something more handy: the RPC's XID.
>>>> 
>>>> Now when you look at an RDMA Read or Write, the top 32 bits in each
>>>> RDMA segment's offset match the XID of the RPC transaction that the
>>>> RDMA operations go with. This is really a secondary benefit to the
>>>> uniquifying effect above.
>>> 
>>> I find the wording "no the wire RDMA read or write" misleading. Did
>>> you really mean it as "RDMA read or write" or do you mean "RDMA_MSG"
>>> or do you mean "NFS RDMA read or write"? Because RDMA offset is not a
>>> part of the RDMA read/write (first/middle/last) packet. That's what
>>> I'm hanged up on.
>> 
>> Here's an RDMA Read request in a network capture I had at hand:
>> 
>> No.     Time               Source                Destination           Protocol Length Info
>>    228 22:31:06.203637    LID: 5                LID: 11               InfiniBand 42     RC RDMA Read Request QP=0x000240
>> 
>> Frame 228: 42 bytes on wire (336 bits), 42 bytes captured (336 bits) on interface 1
>> Extensible Record Format
>> InfiniBand
>>    Local Route Header
>>    Base Transport Header
>>    RETH - RDMA Extended Transport Header
>>        Virtual Address: 11104011393315758080   <<<<<<
>>        Remote Key: 1879114618
>>        DMA Length: 4015
>>    Invariant CRC: 0xd492a3e1
>>    Variant CRC: 0x8736
>> 
>> The value of the Virtual Address field is what the RPC-over-RDMA
>> protocol calls the Offset. The Read responses are matched to this
>> request by their message sequence numbers, and this Read request is
>> matched to the RPC Call by the XID in the top 32 bits of the
>> Virtual Address.
>> 
>> Likewise for an RDMA Write Only request:
>> 
>>    188 22:31:06.201350    LID: 5                LID: 11               InfiniBand 162    RC RDMA Write Only QP=0x000240
>> 
>> Frame 188: 162 bytes on wire (1296 bits), 162 bytes captured (1296 bits) on interface 1
>> Extensible Record Format
>> InfiniBand
>>    Local Route Header
>>    Base Transport Header
>>    RETH - RDMA Extended Transport Header
>>        Virtual Address: 10455493047213809920   <<<<<<
>>        Remote Key: 1879115386
>>        DMA Length: 120
>>    Invariant CRC: 0xe2e1b2cd
>>    Variant CRC: 0x676f
>> Data (120 bytes)
>> 
>> 0000  91 19 5f 87 00 00 00 01 00 00 00 00 00 00 00 00   .._.............
>> 0010  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 01   ................
>> 0020  00 00 00 02 00 00 01 ed 00 00 00 02 00 00 04 16   ................
>> 0030  00 00 00 64 00 00 00 00 00 00 00 28 00 00 00 00   ...d.......(....
>> 0040  00 00 00 00 00 00 00 00 00 00 00 00 54 9e e5 9d   ............T...
>> 0050  c6 d3 1a d2 00 00 00 00 00 00 63 1f 5b f2 2e 7a   ..........c.[..z
>> 0060  0b ae f9 ec 5b f2 2e 7a 0b 53 6c ef 5b f2 2e 7a   ....[..z.Sl.[..z
>> 0070  0b 53 6c ef 00 00 00 0c                           .Sl.....
>> 
>> 
>> I believe RDMA Write First also has an RETH. The sender does not
>> interleave RDMA Writes, so subsequent Middle and Last packets go
>> with this RDMA Write First.
> 
> Ok I see now where I was confused, in RDMA_MSG in the wireshark it's
> labeled "RDMA offset" and the in the RDMA write first message it's
> labeled "Virtual address". Thank you for explanation.
> 
> Here's the next question (coming from Jorge)? Is it reasonable to
> assume that top 32bits are always zero? I have an network trace (from
> 4.18-rc2 kernel) where they are not.
> 
> RPC over RDMA
>    XID: 0xa347cfa2
>    Version: 1
>    Flow Control: 128
>    Message Type: RDMA_MSG (0)
>    Read list (count: 0)
>    Write list (count: 1)
>        Write chunk (1 segment)
>            Write chunk segment count: 1
>            RDMA segment 0
>                RDMA handle: 0x4000076f
>                RDMA length: 65676
>                RDMA offset: 0x0000001049973000
> 
> I don't believe 0xa347cfa2 can fit?

I've been told by three independent RDMA experts that using the top
32 bits of the offset, even if they are not zero, should be OK to do.
If it doesn't work, it's a device driver bug that needs to be fixed.


--
Chuck Lever




^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v1 4/4] xprtrdma: Plant XID in on-the-wire RDMA offset (FRWR)
  2018-11-19 21:32                 ` Chuck Lever
@ 2018-11-19 21:42                   ` Mora, Jorge
  2018-11-19 22:46                     ` Jason Gunthorpe
  0 siblings, 1 reply; 31+ messages in thread
From: Mora, Jorge @ 2018-11-19 21:42 UTC (permalink / raw)
  To: Chuck Lever, Olga Kornievskaia; +Cc: linux-rdma, Linux NFS Mailing List

Hello Chuck,

I am confused, is it the whole purpose of RDMA is to place the data directly into the memory location given by the virtual address or offset? What you are saying is that this offset is not the actual memory address and so the driver must map this offset to the actual address?


--Jorge

On 11/19/18, 2:33 PM, "linux-nfs-owner@vger.kernel.org on behalf of Chuck Lever" <linux-nfs-owner@vger.kernel.org on behalf of chuck.lever@oracle.com> wrote:

    NetApp Security WARNING: This is an external email. Do not click links or open attachments unless you recognize the sender and know the content is safe.
    
    
    
    
    > On Nov 19, 2018, at 4:22 PM, Olga Kornievskaia <aglo@umich.edu> wrote:
    >
    > On Mon, Nov 19, 2018 at 1:59 PM Chuck Lever <chuck.lever@oracle.com> wrote:
    >>
    >>
    >>
    >>> On Nov 19, 2018, at 1:47 PM, Olga Kornievskaia <aglo@umich.edu> wrote:
    >>>
    >>> On Mon, Nov 19, 2018 at 1:19 PM Chuck Lever <chuck.lever@oracle.com> wrote:
    >>>>
    >>>>
    >>>>
    >>>>> On Nov 19, 2018, at 1:08 PM, Olga Kornievskaia <aglo@umich.edu> wrote:
    >>>>>
    >>>>> On Mon, Nov 19, 2018 at 12:59 PM Chuck Lever <chuck.lever@oracle.com> wrote:
    >>>>>>
    >>>>>>
    >>>>>>
    >>>>>>> On Nov 19, 2018, at 12:47 PM, Olga Kornievskaia <aglo@umich.edu> wrote:
    >>>>>>>
    >>>>>>> On Mon, Nov 19, 2018 at 10:46 AM Chuck Lever <chuck.lever@oracle.com> wrote:
    >>>>>>>>
    >>>>>>>> Place the associated RPC transaction's XID in the upper 32 bits of
    >>>>>>>> each RDMA segment's rdma_offset field. These bits are currently
    >>>>>>>> always zero.
    >>>>>>>>
    >>>>>>>> There are two reasons to do this:
    >>>>>>>>
    >>>>>>>> - The R_key only has 8 bits that are different from registration to
    >>>>>>>> registration. The XID adds more uniqueness to each RDMA segment to
    >>>>>>>> reduce the likelihood of a software bug on the server reading from
    >>>>>>>> or writing into memory it's not supposed to.
    >>>>>>>>
    >>>>>>>> - On-the-wire RDMA Read and Write operations do not otherwise carry
    >>>>>>>> any identifier that matches them up to an RPC. The XID in the
    >>>>>>>> upper 32 bits will act as an eye-catcher in network captures.
    >>>>>>>
    >>>>>>> Is this just an "eye-catcher" or do you have plans to use it in
    >>>>>>> wireshark? If the latter, then can we really do that? while a linux
    >>>>>>> implementation may do that, other (or even possibly future linux)
    >>>>>>> implementation might not do this. Can we justify changing the
    >>>>>>> wireshark logic for it?
    >>>>>>
    >>>>>> No plans to change the wireshark RPC-over-RDMA dissector.
    >>>>>> That would only be a valid thing to do if adding the XID
    >>>>>> were made part of the RPC-over-RDMA protocol via an RFC.
    >>>>>
    >>>>> Agreed. Can you also help me understand the proposal (as I'm still
    >>>>> trying to figure why it is useful).
    >>>>>
    >>>>> You are proposing to modify the RDMA segments's RDMA offset field (I
    >>>>> see top 6bits are indeed always 0). I don't see how adding that helps
    >>>>> an RDMA read/write message which does not have an "offset" field in it
    >>>>> be matched to a particular RPC. I don't believe we have (had) any
    >>>>> issues matching the initial RC Send only that contains the RDMA_MSG to
    >>>>> the RPC.
    >>>>
    >>>> The ULP has access to only the low order 8 bits of the R_key. The
    >>>> upper 24 bits are fixed for each MR. So for any given MR, there are
    >>>> only 256 unique R_key values. That means the same R_key will appear
    >>>> again quickly on the wire.
    >>>>
    >>>> The 64-bit offset field is set by the ULP, and can be essentially
    >>>> any arbitrary value. Most kernel ULPs use the iova of the registered
    >>>> memory. We only need the lower 32 bits for that.
    >>>>
    >>>> The purpose of adding junk to the offset is to make the offset
    >>>> unique to that RPC transaction, just like the R_key is. This helps
    >>>> make the RDMA segment co-ordinates (handle, length, offset) more
    >>>> unique and thus harder to spoof.
    >>>
    >>> Thank you for the explanation that makes sense.
    >>>
    >>>> We could use random numbers in that upper 32 bits, but we have
    >>>> something more handy: the RPC's XID.
    >>>>
    >>>> Now when you look at an RDMA Read or Write, the top 32 bits in each
    >>>> RDMA segment's offset match the XID of the RPC transaction that the
    >>>> RDMA operations go with. This is really a secondary benefit to the
    >>>> uniquifying effect above.
    >>>
    >>> I find the wording "no the wire RDMA read or write" misleading. Did
    >>> you really mean it as "RDMA read or write" or do you mean "RDMA_MSG"
    >>> or do you mean "NFS RDMA read or write"? Because RDMA offset is not a
    >>> part of the RDMA read/write (first/middle/last) packet. That's what
    >>> I'm hanged up on.
    >>
    >> Here's an RDMA Read request in a network capture I had at hand:
    >>
    >> No.     Time               Source                Destination           Protocol Length Info
    >>    228 22:31:06.203637    LID: 5                LID: 11               InfiniBand 42     RC RDMA Read Request QP=0x000240
    >>
    >> Frame 228: 42 bytes on wire (336 bits), 42 bytes captured (336 bits) on interface 1
    >> Extensible Record Format
    >> InfiniBand
    >>    Local Route Header
    >>    Base Transport Header
    >>    RETH - RDMA Extended Transport Header
    >>        Virtual Address: 11104011393315758080   <<<<<<
    >>        Remote Key: 1879114618
    >>        DMA Length: 4015
    >>    Invariant CRC: 0xd492a3e1
    >>    Variant CRC: 0x8736
    >>
    >> The value of the Virtual Address field is what the RPC-over-RDMA
    >> protocol calls the Offset. The Read responses are matched to this
    >> request by their message sequence numbers, and this Read request is
    >> matched to the RPC Call by the XID in the top 32 bits of the
    >> Virtual Address.
    >>
    >> Likewise for an RDMA Write Only request:
    >>
    >>    188 22:31:06.201350    LID: 5                LID: 11               InfiniBand 162    RC RDMA Write Only QP=0x000240
    >>
    >> Frame 188: 162 bytes on wire (1296 bits), 162 bytes captured (1296 bits) on interface 1
    >> Extensible Record Format
    >> InfiniBand
    >>    Local Route Header
    >>    Base Transport Header
    >>    RETH - RDMA Extended Transport Header
    >>        Virtual Address: 10455493047213809920   <<<<<<
    >>        Remote Key: 1879115386
    >>        DMA Length: 120
    >>    Invariant CRC: 0xe2e1b2cd
    >>    Variant CRC: 0x676f
    >> Data (120 bytes)
    >>
    >> 0000  91 19 5f 87 00 00 00 01 00 00 00 00 00 00 00 00   .._.............
    >> 0010  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 01   ................
    >> 0020  00 00 00 02 00 00 01 ed 00 00 00 02 00 00 04 16   ................
    >> 0030  00 00 00 64 00 00 00 00 00 00 00 28 00 00 00 00   ...d.......(....
    >> 0040  00 00 00 00 00 00 00 00 00 00 00 00 54 9e e5 9d   ............T...
    >> 0050  c6 d3 1a d2 00 00 00 00 00 00 63 1f 5b f2 2e 7a   ..........c.[..z
    >> 0060  0b ae f9 ec 5b f2 2e 7a 0b 53 6c ef 5b f2 2e 7a   ....[..z.Sl.[..z
    >> 0070  0b 53 6c ef 00 00 00 0c                           .Sl.....
    >>
    >>
    >> I believe RDMA Write First also has an RETH. The sender does not
    >> interleave RDMA Writes, so subsequent Middle and Last packets go
    >> with this RDMA Write First.
    >
    > Ok I see now where I was confused, in RDMA_MSG in the wireshark it's
    > labeled "RDMA offset" and the in the RDMA write first message it's
    > labeled "Virtual address". Thank you for explanation.
    >
    > Here's the next question (coming from Jorge)? Is it reasonable to
    > assume that top 32bits are always zero? I have an network trace (from
    > 4.18-rc2 kernel) where they are not.
    >
    > RPC over RDMA
    >    XID: 0xa347cfa2
    >    Version: 1
    >    Flow Control: 128
    >    Message Type: RDMA_MSG (0)
    >    Read list (count: 0)
    >    Write list (count: 1)
    >        Write chunk (1 segment)
    >            Write chunk segment count: 1
    >            RDMA segment 0
    >                RDMA handle: 0x4000076f
    >                RDMA length: 65676
    >                RDMA offset: 0x0000001049973000
    >
    > I don't believe 0xa347cfa2 can fit?
    
    I've been told by three independent RDMA experts that using the top
    32 bits of the offset, even if they are not zero, should be OK to do.
    If it doesn't work, it's a device driver bug that needs to be fixed.
    
    
    --
    Chuck Lever
    
    
    
    


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v1 1/4] xprtrdma: Remove support for FMR memory registration
  2018-11-19 16:16   ` Bart Van Assche
  2018-11-19 19:09     ` Leon Romanovsky
@ 2018-11-19 22:41     ` Jason Gunthorpe
  2018-11-19 22:56       ` Chuck Lever
  2018-11-20 15:22       ` Dennis Dalessandro
  1 sibling, 2 replies; 31+ messages in thread
From: Jason Gunthorpe @ 2018-11-19 22:41 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Chuck Lever, linux-rdma, linux-nfs, oulijun, umalhi, zhongjiang,
	venkata.s.dhanalakota, Dennis Dalessandro

On Mon, Nov 19, 2018 at 08:16:37AM -0800, Bart Van Assche wrote:
> On Mon, 2018-11-19 at 10:45 -0500, Chuck Lever wrote:
> > FMR is not supported on most recent RDMA devices. It is also slower
> > and less secure than FRWR. As discussed during the RDMA BoF at LPC
> > 2018, it is time to remove support for FMR in the NFS/RDMA client
> > stack. NFS/RDMA server-side uses either local memory registration or
> > FRWR.  There is no change required there to deprecate FMR.
> > 
> > There are a few Infiniband/RoCE devices in the kernel tree that do
> > not support MEM_MGT_EXTENSIONS, and therefore will no longer support
> > client-side NFS/RDMA. These are:
> > 
> >  - mthca
> >  - qib

Ooh, qib was not in the list when we talked on this in plumbers. That
does change things. 

Dennis: Could qib be changed? HFI1 supports FRWR right?

> >  - usnic
> >  - hns (RoCE)
> 
> Can someone from Intel comment on how many of their customers rely on the qib
> driver?
> 
> Can someone from Cisco comment on how many of their customers rely on the usnic
> driver?

Ignore usnic, it is not a RDMA driver

> Can someone from Huawei comment on how many of their customers rely on the hns
> driver?

Nor was this, this is an actively used driver..

Could hns do FRWR? It is very disappointing that such a new driver
does not support FRWR.

I guess it is worth stating again that FMR is deprecated. People are
going to be disappointed if they try to submit new drivers that don't
support FRWR :(

Jason

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v1 4/4] xprtrdma: Plant XID in on-the-wire RDMA offset (FRWR)
  2018-11-19 21:42                   ` Mora, Jorge
@ 2018-11-19 22:46                     ` Jason Gunthorpe
  2018-11-20  2:45                       ` Tom Talpey
  0 siblings, 1 reply; 31+ messages in thread
From: Jason Gunthorpe @ 2018-11-19 22:46 UTC (permalink / raw)
  To: Mora, Jorge
  Cc: Chuck Lever, Olga Kornievskaia, linux-rdma, Linux NFS Mailing List

On Mon, Nov 19, 2018 at 09:42:56PM +0000, Mora, Jorge wrote:
> Hello Chuck,
> 
> I am confused, is it the whole purpose of RDMA is to place the data
> directly into the memory location given by the virtual address or
> offset? What you are saying is that this offset is not the actual
> memory address and so the driver must map this offset to the actual
> address?

All RDMA on the wire addresses are 'virtual' and always go through
translation hardware to reach a physical memory address. Even if you
have a virtual address that matches your physical address there will
still be translation hardware to enforce security.

The direct placement refers to not copying data, not to using physical
memory addresses throughout the protocol design.

So it is better to use a virtual address that doesn't disclose any
information about the host's configuration to the network, such as
physical locations of kernel memory.

Jason

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v1 1/4] xprtrdma: Remove support for FMR memory registration
  2018-11-19 22:41     ` Jason Gunthorpe
@ 2018-11-19 22:56       ` Chuck Lever
  2018-11-19 23:10         ` Jason Gunthorpe
  2018-11-20 15:22       ` Dennis Dalessandro
  1 sibling, 1 reply; 31+ messages in thread
From: Chuck Lever @ 2018-11-19 22:56 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Bart Van Assche, linux-rdma, Linux NFS Mailing List, Lijun Ou,
	umalhi, zhongjiang, venkata.s.dhanalakota, Dennis Dalessandro



> On Nov 19, 2018, at 5:41 PM, Jason Gunthorpe <jgg@ziepe.ca> wrote:
> 
> On Mon, Nov 19, 2018 at 08:16:37AM -0800, Bart Van Assche wrote:
>> On Mon, 2018-11-19 at 10:45 -0500, Chuck Lever wrote:
>>> FMR is not supported on most recent RDMA devices. It is also slower
>>> and less secure than FRWR. As discussed during the RDMA BoF at LPC
>>> 2018, it is time to remove support for FMR in the NFS/RDMA client
>>> stack. NFS/RDMA server-side uses either local memory registration or
>>> FRWR.  There is no change required there to deprecate FMR.
>>> 
>>> There are a few Infiniband/RoCE devices in the kernel tree that do
>>> not support MEM_MGT_EXTENSIONS, and therefore will no longer support
>>> client-side NFS/RDMA. These are:
>>> 
>>> - mthca
>>> - qib
> 
> Ooh, qib was not in the list when we talked on this in plumbers. That
> does change things. 
> 
> Dennis: Could qib be changed? HFI1 supports FRWR right?
> 
>>> - usnic
>>> - hns (RoCE)
>> 
>> Can someone from Intel comment on how many of their customers rely on the qib
>> driver?
>> 
>> Can someone from Cisco comment on how many of their customers rely on the usnic
>> driver?
> 
> Ignore usnic, it is not a RDMA driver
> 
>> Can someone from Huawei comment on how many of their customers rely on the hns
>> driver?
> 
> Nor was this, this is an actively used driver..
> 
> Could hns do FRWR? It is very disappointing that such a new driver
> does not support FRWR.
> 
> I guess it is worth stating again that FMR is deprecated. People are
> going to be disappointed if they try to submit new drivers that don't
> support FRWR :(

My methodology was to search for whether each driver asserts
IB_DEVICE_MEM_MGT_EXTENSIONS in device_cap_flags.

Interestingly, nes and ocrdma assert it, but qib and hns do not.


--
Chuck Lever




^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v1 1/4] xprtrdma: Remove support for FMR memory registration
  2018-11-19 22:56       ` Chuck Lever
@ 2018-11-19 23:10         ` Jason Gunthorpe
  0 siblings, 0 replies; 31+ messages in thread
From: Jason Gunthorpe @ 2018-11-19 23:10 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Bart Van Assche, linux-rdma, Linux NFS Mailing List, Lijun Ou,
	umalhi, zhongjiang, venkata.s.dhanalakota, Dennis Dalessandro

On Mon, Nov 19, 2018 at 05:56:54PM -0500, Chuck Lever wrote:
> 
> 
> > On Nov 19, 2018, at 5:41 PM, Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > 
> > On Mon, Nov 19, 2018 at 08:16:37AM -0800, Bart Van Assche wrote:
> >> On Mon, 2018-11-19 at 10:45 -0500, Chuck Lever wrote:
> >>> FMR is not supported on most recent RDMA devices. It is also slower
> >>> and less secure than FRWR. As discussed during the RDMA BoF at LPC
> >>> 2018, it is time to remove support for FMR in the NFS/RDMA client
> >>> stack. NFS/RDMA server-side uses either local memory registration or
> >>> FRWR.  There is no change required there to deprecate FMR.
> >>> 
> >>> There are a few Infiniband/RoCE devices in the kernel tree that do
> >>> not support MEM_MGT_EXTENSIONS, and therefore will no longer support
> >>> client-side NFS/RDMA. These are:
> >>> 
> >>> - mthca
> >>> - qib
> > 
> > Ooh, qib was not in the list when we talked on this in plumbers. That
> > does change things. 
> > 
> > Dennis: Could qib be changed? HFI1 supports FRWR right?
> > 
> >>> - usnic
> >>> - hns (RoCE)
> >> 
> >> Can someone from Intel comment on how many of their customers rely on the qib
> >> driver?
> >> 
> >> Can someone from Cisco comment on how many of their customers rely on the usnic
> >> driver?
> > 
> > Ignore usnic, it is not a RDMA driver
> > 
> >> Can someone from Huawei comment on how many of their customers rely on the hns
> >> driver?
> > 
> > Nor was this, this is an actively used driver..
> > 
> > Could hns do FRWR? It is very disappointing that such a new driver
> > does not support FRWR.
> > 
> > I guess it is worth stating again that FMR is deprecated. People are
> > going to be disappointed if they try to submit new drivers that don't
> > support FRWR :(
> 
> My methodology was to search for whether each driver asserts
> IB_DEVICE_MEM_MGT_EXTENSIONS in device_cap_flags.

Drivers might be broken here, or we might need more flags.

Searching for FWMR shows hits in hns, which is encouraging...

Jason

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v1 4/4] xprtrdma: Plant XID in on-the-wire RDMA offset (FRWR)
  2018-11-19 22:46                     ` Jason Gunthorpe
@ 2018-11-20  2:45                       ` Tom Talpey
  2018-11-20  3:09                         ` Jason Gunthorpe
  0 siblings, 1 reply; 31+ messages in thread
From: Tom Talpey @ 2018-11-20  2:45 UTC (permalink / raw)
  To: Jason Gunthorpe, Mora, Jorge
  Cc: Chuck Lever, Olga Kornievskaia, linux-rdma, Linux NFS Mailing List

On 11/19/2018 5:46 PM, Jason Gunthorpe wrote:
> On Mon, Nov 19, 2018 at 09:42:56PM +0000, Mora, Jorge wrote:
>> Hello Chuck,
>>
>> I am confused, is it the whole purpose of RDMA is to place the data
>> directly into the memory location given by the virtual address or
>> offset? What you are saying is that this offset is not the actual
>> memory address and so the driver must map this offset to the actual
>> address?
> 
> All RDMA on the wire addresses are 'virtual' and always go through
> translation hardware to reach a physical memory address. Even if you
> have a virtual address that matches your physical address there will
> still be translation hardware to enforce security.
> 
> The direct placement refers to not copying data, not to using physical
> memory addresses throughout the protocol design.
> 
> So it is better to use a virtual address that doesn't disclose any
> information about the host's configuration to the network, such as
> physical locations of kernel memory.

Agreed. And, this is a very subtle but very powerful aspect of FRWR
registration, that the high-order bits of the "virtual address" can
basically be any value of the requestor's (client's) choice.

Note, the low-order bits, up to the maximum offset used by the upper
layer (e.g. a few MB for NFS), must be reserved in order to allow RDMA
segments to iterate over the range of bytes in the region. That's only
24 or so bits out of the available 64, here however.

For other registration methods, this is not always the case. It is
definitely best to leverage this important property for FRWR.

SMB Direct in Windows does a similar swizzling, by the way. We have
taken several different approaches to the format, over time.

Tom.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v1 4/4] xprtrdma: Plant XID in on-the-wire RDMA offset (FRWR)
  2018-11-20  2:45                       ` Tom Talpey
@ 2018-11-20  3:09                         ` Jason Gunthorpe
  2018-11-20  3:25                           ` Tom Talpey
  0 siblings, 1 reply; 31+ messages in thread
From: Jason Gunthorpe @ 2018-11-20  3:09 UTC (permalink / raw)
  To: Tom Talpey
  Cc: Mora, Jorge, Chuck Lever, Olga Kornievskaia, linux-rdma,
	Linux NFS Mailing List

On Mon, Nov 19, 2018 at 09:45:24PM -0500, Tom Talpey wrote:
> On 11/19/2018 5:46 PM, Jason Gunthorpe wrote:
> > On Mon, Nov 19, 2018 at 09:42:56PM +0000, Mora, Jorge wrote:
> > > Hello Chuck,
> > > 
> > > I am confused, is it the whole purpose of RDMA is to place the data
> > > directly into the memory location given by the virtual address or
> > > offset? What you are saying is that this offset is not the actual
> > > memory address and so the driver must map this offset to the actual
> > > address?
> > 
> > All RDMA on the wire addresses are 'virtual' and always go through
> > translation hardware to reach a physical memory address. Even if you
> > have a virtual address that matches your physical address there will
> > still be translation hardware to enforce security.
> > 
> > The direct placement refers to not copying data, not to using physical
> > memory addresses throughout the protocol design.
> > 
> > So it is better to use a virtual address that doesn't disclose any
> > information about the host's configuration to the network, such as
> > physical locations of kernel memory.
> 
> Agreed. And, this is a very subtle but very powerful aspect of FRWR
> registration, that the high-order bits of the "virtual address" can
> basically be any value of the requestor's (client's) choice.
> 
> Note, the low-order bits, up to the maximum offset used by the upper
> layer (e.g. a few MB for NFS), must be reserved in order to allow RDMA
> segments to iterate over the range of bytes in the region. That's only
> 24 or so bits out of the available 64, here however.

Why does it depend on protocol? 

So long as the adaptor can compute VA + Offset = Physical the upper
bits can be changed. For many adaptors that means PAGE_SIZE or huge
page size level alignment..

Jason

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v1 4/4] xprtrdma: Plant XID in on-the-wire RDMA offset (FRWR)
  2018-11-20  3:09                         ` Jason Gunthorpe
@ 2018-11-20  3:25                           ` Tom Talpey
  2018-11-20  3:32                             ` Jason Gunthorpe
  0 siblings, 1 reply; 31+ messages in thread
From: Tom Talpey @ 2018-11-20  3:25 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Mora, Jorge, Chuck Lever, Olga Kornievskaia, linux-rdma,
	Linux NFS Mailing List

On 11/19/2018 10:09 PM, Jason Gunthorpe wrote:
> On Mon, Nov 19, 2018 at 09:45:24PM -0500, Tom Talpey wrote:
>> On 11/19/2018 5:46 PM, Jason Gunthorpe wrote:
>>> On Mon, Nov 19, 2018 at 09:42:56PM +0000, Mora, Jorge wrote:
>>>> Hello Chuck,
>>>>
>>>> I am confused, is it the whole purpose of RDMA is to place the data
>>>> directly into the memory location given by the virtual address or
>>>> offset? What you are saying is that this offset is not the actual
>>>> memory address and so the driver must map this offset to the actual
>>>> address?
>>>
>>> All RDMA on the wire addresses are 'virtual' and always go through
>>> translation hardware to reach a physical memory address. Even if you
>>> have a virtual address that matches your physical address there will
>>> still be translation hardware to enforce security.
>>>
>>> The direct placement refers to not copying data, not to using physical
>>> memory addresses throughout the protocol design.
>>>
>>> So it is better to use a virtual address that doesn't disclose any
>>> information about the host's configuration to the network, such as
>>> physical locations of kernel memory.
>>
>> Agreed. And, this is a very subtle but very powerful aspect of FRWR
>> registration, that the high-order bits of the "virtual address" can
>> basically be any value of the requestor's (client's) choice.
>>
>> Note, the low-order bits, up to the maximum offset used by the upper
>> layer (e.g. a few MB for NFS), must be reserved in order to allow RDMA
>> segments to iterate over the range of bytes in the region. That's only
>> 24 or so bits out of the available 64, here however.
> 
> Why does it depend on protocol?

Because the maximum payload being RDMA's is an upper layer matter. The
idea here is to provide a certain number of "invariant" bits in the
offset / virtual address. Chuck proposes reserving just 32 bits, leaving
32 bits of range, quite comfy.

> So long as the adaptor can compute VA + Offset = Physical the upper
> bits can be changed.

Absolutely, but if the offset walks into the high-order invariant bits,
they won't be invariant any longer. It would still work, but it would
defeat the wireshark trick that Chuck wants to play.

> For many adaptors that means PAGE_SIZE or huge
> page size level alignment..

I'm not sure I understand this. The FRWR hides the native page size, by
providing the page frames in the local work request. Their size isn't
known to the protocol.

Tom.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v1 4/4] xprtrdma: Plant XID in on-the-wire RDMA offset (FRWR)
  2018-11-20  3:25                           ` Tom Talpey
@ 2018-11-20  3:32                             ` Jason Gunthorpe
  2018-11-20  3:38                               ` Tom Talpey
  0 siblings, 1 reply; 31+ messages in thread
From: Jason Gunthorpe @ 2018-11-20  3:32 UTC (permalink / raw)
  To: Tom Talpey
  Cc: Mora, Jorge, Chuck Lever, Olga Kornievskaia, linux-rdma,
	Linux NFS Mailing List

On Mon, Nov 19, 2018 at 10:25:37PM -0500, Tom Talpey wrote:

> Absolutely, but if the offset walks into the high-order invariant bits,
> they won't be invariant any longer. It would still work, but it would
> defeat the wireshark trick that Chuck wants to play.

Sure, OK, if you care about the bits being invarient, and not simply
randomized :)

> > For many adaptors that means PAGE_SIZE or huge
> > page size level alignment..
> 
> I'm not sure I understand this. The FRWR hides the native page size, by
> providing the page frames in the local work request. Their size isn't
> known to the protocol.

Ah, some HW has limitations in how it can construct a VA. In many
implementation the low order bits of the VA must match the offset into
the page. If the page is 4K that is 12 bits, if it is 2M that is 21
bits..

So there should be restrictions on what start VA can be assigned to a
FRWR.

Jason

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v1 4/4] xprtrdma: Plant XID in on-the-wire RDMA offset (FRWR)
  2018-11-20  3:32                             ` Jason Gunthorpe
@ 2018-11-20  3:38                               ` Tom Talpey
  0 siblings, 0 replies; 31+ messages in thread
From: Tom Talpey @ 2018-11-20  3:38 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Mora, Jorge, Chuck Lever, Olga Kornievskaia, linux-rdma,
	Linux NFS Mailing List

On 11/19/2018 10:32 PM, Jason Gunthorpe wrote:
> On Mon, Nov 19, 2018 at 10:25:37PM -0500, Tom Talpey wrote:
> 
>> Absolutely, but if the offset walks into the high-order invariant bits,
>> they won't be invariant any longer. It would still work, but it would
>> defeat the wireshark trick that Chuck wants to play.
> 
> Sure, OK, if you care about the bits being invarient, and not simply
> randomized :)

Well, that was Chuck's proposal, to use the XID. SMB Direct originally
used a formula, then changed to randomized, and in future will likely
change it again. The point is, it's the RDMA consumer's choice, not a
verb, or adapter, or protocol, requirement.


>>> For many adaptors that means PAGE_SIZE or huge
>>> page size level alignment..
>>
>> I'm not sure I understand this. The FRWR hides the native page size, by
>> providing the page frames in the local work request. Their size isn't
>> known to the protocol.
> 
> Ah, some HW has limitations in how it can construct a VA. In many
> implementation the low order bits of the VA must match the offset into
> the page. If the page is 4K that is 12 bits, if it is 2M that is 21
> bits..
> 
> So there should be restrictions on what start VA can be assigned to a
> FRWR.

I think only the FBO (First Byte Offset) is really restricted. But
indeed yes, the number of possible FBO bits is a function of the page 
size. So, if so, we agree - ok.

Tom.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v1 1/4] xprtrdma: Remove support for FMR memory registration
  2018-11-19 20:52       ` Bart Van Assche
@ 2018-11-20  5:37         ` Leon Romanovsky
  0 siblings, 0 replies; 31+ messages in thread
From: Leon Romanovsky @ 2018-11-20  5:37 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Chuck Lever, linux-rdma, linux-nfs, oulijun, zhongjiang,
	venkata.s.dhanalakota

[-- Attachment #1: Type: text/plain, Size: 1277 bytes --]

On Mon, Nov 19, 2018 at 12:52:10PM -0800, Bart Van Assche wrote:
> On Mon, 2018-11-19 at 21:09 +0200, Leon Romanovsky wrote:
> > The more accurate question is "how many customers ... use latest,
> > upstream kernel?"
>
> Hi Leon,
>
> If this would not be clear: I am in favor of removing kernel drivers that
> are no longer in use. Recently I supported a proposal to remove the SCSI
> OSD driver from the kernel tree because that driver
> probably has not been
> used by anyone in the past five years.
>
> But I think that kernel drivers that are still in use should stay in the
> kernel tree, no matter which version is in use. Since the Linux kernel ABI
> is backwards compatible, with every Linux distro it is possible to replace
> the Linux kernel that was provided by the Linux distributor with a more
> recent upstream kernel. I think it would be a very unpleasant surprise for
> people who use one of the RDMA adapters that only support FMR if they would
> test a more recent kernel and if they would notice that its driver has been
> removed.

Bart,

I know the theory behind promises to work on old HW, but the reality
that no one really tests latest kernels on those devices.

This article complains about it too:
https://lwn.net/Articles/769468/


Thanks

>
> Bart.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 801 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v1 1/4] xprtrdma: Remove support for FMR memory registration
  2018-11-19 22:41     ` Jason Gunthorpe
  2018-11-19 22:56       ` Chuck Lever
@ 2018-11-20 15:22       ` Dennis Dalessandro
  1 sibling, 0 replies; 31+ messages in thread
From: Dennis Dalessandro @ 2018-11-20 15:22 UTC (permalink / raw)
  To: Jason Gunthorpe, Bart Van Assche
  Cc: Chuck Lever, linux-rdma, linux-nfs, oulijun, umalhi, zhongjiang,
	venkata.s.dhanalakota

On 11/19/2018 5:41 PM, Jason Gunthorpe wrote:
> On Mon, Nov 19, 2018 at 08:16:37AM -0800, Bart Van Assche wrote:
>> On Mon, 2018-11-19 at 10:45 -0500, Chuck Lever wrote:
>>> FMR is not supported on most recent RDMA devices. It is also slower
>>> and less secure than FRWR. As discussed during the RDMA BoF at LPC
>>> 2018, it is time to remove support for FMR in the NFS/RDMA client
>>> stack. NFS/RDMA server-side uses either local memory registration or
>>> FRWR.  There is no change required there to deprecate FMR.
>>>
>>> There are a few Infiniband/RoCE devices in the kernel tree that do
>>> not support MEM_MGT_EXTENSIONS, and therefore will no longer support
>>> client-side NFS/RDMA. These are:
>>>
>>>   - mthca
>>>   - qib
> 
> Ooh, qib was not in the list when we talked on this in plumbers. That
> does change things.
> 
> Dennis: Could qib be changed? HFI1 supports FRWR right?

Turns out that qib can not support FRWR because that requires HW support 
for opcodes that qib just doesn't understand.

hfi1 is good to go though.

>>>   - usnic
>>>   - hns (RoCE)
>>
>> Can someone from Intel comment on how many of their customers rely on the qib
>> driver?

There is a user base. I can't say how large though. But I have had 
discussions with mgmt and we aren't quite ready to leave qib out in the 
cold just yet.

We have a lot of people out this week due to the US Thanksgiving 
holiday. Once people are back in the office on Monday I'll find out some 
more as to the impact of this change.

-Denny

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v1 4/4] xprtrdma: Plant XID in on-the-wire RDMA offset (FRWR)
  2018-11-19 15:46 ` [PATCH v1 4/4] xprtrdma: Plant XID in on-the-wire RDMA offset (FRWR) Chuck Lever
  2018-11-19 17:47   ` Olga Kornievskaia
@ 2018-11-20 18:02   ` Anna Schumaker
  2018-11-20 18:07     ` Chuck Lever
  1 sibling, 1 reply; 31+ messages in thread
From: Anna Schumaker @ 2018-11-20 18:02 UTC (permalink / raw)
  To: Chuck Lever, linux-rdma; +Cc: linux-nfs

Hi Chuck,

On Mon, 2018-11-19 at 10:46 -0500, Chuck Lever wrote:
> Place the associated RPC transaction's XID in the upper 32 bits of
> each RDMA segment's rdma_offset field. These bits are currently
> always zero.
> 
> There are two reasons to do this:
> 
> - The R_key only has 8 bits that are different from registration to
>   registration. The XID adds more uniqueness to each RDMA segment to
>   reduce the likelihood of a software bug on the server reading from
>   or writing into memory it's not supposed to.
> 
> - On-the-wire RDMA Read and Write operations do not otherwise carry
>   any identifier that matches them up to an RPC. The XID in the
>   upper 32 bits will act as an eye-catcher in network captures.
> 
> Suggested-by: Tom Talpey <ttalpey@microsoft.com>
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> ---
>  net/sunrpc/xprtrdma/frwr_ops.c  |    3 ++-
>  net/sunrpc/xprtrdma/rpc_rdma.c  |    6 +++---
>  net/sunrpc/xprtrdma/xprt_rdma.h |    2 +-
>  3 files changed, 6 insertions(+), 5 deletions(-)
> 
> diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
> index 49b314d..3b260d2 100644
> --- a/net/sunrpc/xprtrdma/frwr_ops.c
> +++ b/net/sunrpc/xprtrdma/frwr_ops.c
> @@ -344,7 +344,7 @@
>   */
>  static struct rpcrdma_mr_seg *
>  frwr_op_map(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg,
> -	    int nsegs, bool writing, struct rpcrdma_mr **out)
> +	    int nsegs, bool writing, u32 xid, struct rpcrdma_mr **out)
>  {
>  	struct rpcrdma_ia *ia = &r_xprt->rx_ia;
>  	bool holes_ok = ia->ri_mrtype == IB_MR_TYPE_SG_GAPS;
> @@ -398,6 +398,7 @@
>  	if (unlikely(n != mr->mr_nents))
>  		goto out_mapmr_err;
>  
> +	ibmr->iova |= ((u64)cpu_to_be32(xid)) << 32;

My mount command hangs once we make this change (I got bored and killed it after
about 5 minutes).  This is with NFS over soft-RoCE in a kvm virtual machine, and
I see the behavior with all NFS versions.

I hope this helps!
Anna

>  	key = (u8)(ibmr->rkey & 0x000000FF);
>  	ib_update_fast_reg_key(ibmr, ++key);
>  
> diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
> index 9f53e02..89a2db2 100644
> --- a/net/sunrpc/xprtrdma/rpc_rdma.c
> +++ b/net/sunrpc/xprtrdma/rpc_rdma.c
> @@ -357,7 +357,7 @@ static bool rpcrdma_results_inline(struct rpcrdma_xprt
> *r_xprt,
>  
>  	do {
>  		seg = r_xprt->rx_ia.ri_ops->ro_map(r_xprt, seg, nsegs,
> -						   false, &mr);
> +						   false, rqst->rq_xid, &mr);
>  		if (IS_ERR(seg))
>  			return PTR_ERR(seg);
>  		rpcrdma_mr_push(mr, &req->rl_registered);
> @@ -415,7 +415,7 @@ static bool rpcrdma_results_inline(struct rpcrdma_xprt
> *r_xprt,
>  	nchunks = 0;
>  	do {
>  		seg = r_xprt->rx_ia.ri_ops->ro_map(r_xprt, seg, nsegs,
> -						   true, &mr);
> +						   true, rqst->rq_xid, &mr);
>  		if (IS_ERR(seg))
>  			return PTR_ERR(seg);
>  		rpcrdma_mr_push(mr, &req->rl_registered);
> @@ -473,7 +473,7 @@ static bool rpcrdma_results_inline(struct rpcrdma_xprt
> *r_xprt,
>  	nchunks = 0;
>  	do {
>  		seg = r_xprt->rx_ia.ri_ops->ro_map(r_xprt, seg, nsegs,
> -						   true, &mr);
> +						   true, rqst->rq_xid, &mr);
>  		if (IS_ERR(seg))
>  			return PTR_ERR(seg);
>  		rpcrdma_mr_push(mr, &req->rl_registered);
> diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
> index a13ccb6..2ae1ee2 100644
> --- a/net/sunrpc/xprtrdma/xprt_rdma.h
> +++ b/net/sunrpc/xprtrdma/xprt_rdma.h
> @@ -472,7 +472,7 @@ struct rpcrdma_memreg_ops {
>  	struct rpcrdma_mr_seg *
>  			(*ro_map)(struct rpcrdma_xprt *,
>  				  struct rpcrdma_mr_seg *, int, bool,
> -				  struct rpcrdma_mr **);
> +				  u32, struct rpcrdma_mr **);
>  	int		(*ro_send)(struct rpcrdma_ia *ia,
>  				   struct rpcrdma_req *req);
>  	void		(*ro_reminv)(struct rpcrdma_rep *rep,
> 


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v1 4/4] xprtrdma: Plant XID in on-the-wire RDMA offset (FRWR)
  2018-11-20 18:02   ` Anna Schumaker
@ 2018-11-20 18:07     ` Chuck Lever
       [not found]       ` <94ff7ec712e086bfdd9c217a5f97c293a07151b9.camel@gmail.com>
  0 siblings, 1 reply; 31+ messages in thread
From: Chuck Lever @ 2018-11-20 18:07 UTC (permalink / raw)
  To: Anna Schumaker; +Cc: linux-rdma, Linux NFS Mailing List



> On Nov 20, 2018, at 1:02 PM, Anna Schumaker <schumaker.anna@gmail.com> wrote:
> 
> Hi Chuck,
> 
> On Mon, 2018-11-19 at 10:46 -0500, Chuck Lever wrote:
>> Place the associated RPC transaction's XID in the upper 32 bits of
>> each RDMA segment's rdma_offset field. These bits are currently
>> always zero.
>> 
>> There are two reasons to do this:
>> 
>> - The R_key only has 8 bits that are different from registration to
>>  registration. The XID adds more uniqueness to each RDMA segment to
>>  reduce the likelihood of a software bug on the server reading from
>>  or writing into memory it's not supposed to.
>> 
>> - On-the-wire RDMA Read and Write operations do not otherwise carry
>>  any identifier that matches them up to an RPC. The XID in the
>>  upper 32 bits will act as an eye-catcher in network captures.
>> 
>> Suggested-by: Tom Talpey <ttalpey@microsoft.com>
>> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
>> ---
>> net/sunrpc/xprtrdma/frwr_ops.c  |    3 ++-
>> net/sunrpc/xprtrdma/rpc_rdma.c  |    6 +++---
>> net/sunrpc/xprtrdma/xprt_rdma.h |    2 +-
>> 3 files changed, 6 insertions(+), 5 deletions(-)
>> 
>> diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
>> index 49b314d..3b260d2 100644
>> --- a/net/sunrpc/xprtrdma/frwr_ops.c
>> +++ b/net/sunrpc/xprtrdma/frwr_ops.c
>> @@ -344,7 +344,7 @@
>>  */
>> static struct rpcrdma_mr_seg *
>> frwr_op_map(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg,
>> -	    int nsegs, bool writing, struct rpcrdma_mr **out)
>> +	    int nsegs, bool writing, u32 xid, struct rpcrdma_mr **out)
>> {
>> 	struct rpcrdma_ia *ia = &r_xprt->rx_ia;
>> 	bool holes_ok = ia->ri_mrtype == IB_MR_TYPE_SG_GAPS;
>> @@ -398,6 +398,7 @@
>> 	if (unlikely(n != mr->mr_nents))
>> 		goto out_mapmr_err;
>> 
>> +	ibmr->iova |= ((u64)cpu_to_be32(xid)) << 32;
> 
> My mount command hangs once we make this change (I got bored and killed it after
> about 5 minutes).  This is with NFS over soft-RoCE in a kvm virtual machine, and
> I see the behavior with all NFS versions.
> 
> I hope this helps!

OK. Can you capture ftrace output to show us where the hang occurs?
I use this on the NFS client:

 # trace-cmd record -e sunrpc -e rpcrdma
 ^C
 # trace-cmd report | less      (or > /tmp/trace.log)

Thanks!


> Anna
> 
>> 	key = (u8)(ibmr->rkey & 0x000000FF);
>> 	ib_update_fast_reg_key(ibmr, ++key);
>> 
>> diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
>> index 9f53e02..89a2db2 100644
>> --- a/net/sunrpc/xprtrdma/rpc_rdma.c
>> +++ b/net/sunrpc/xprtrdma/rpc_rdma.c
>> @@ -357,7 +357,7 @@ static bool rpcrdma_results_inline(struct rpcrdma_xprt
>> *r_xprt,
>> 
>> 	do {
>> 		seg = r_xprt->rx_ia.ri_ops->ro_map(r_xprt, seg, nsegs,
>> -						   false, &mr);
>> +						   false, rqst->rq_xid, &mr);
>> 		if (IS_ERR(seg))
>> 			return PTR_ERR(seg);
>> 		rpcrdma_mr_push(mr, &req->rl_registered);
>> @@ -415,7 +415,7 @@ static bool rpcrdma_results_inline(struct rpcrdma_xprt
>> *r_xprt,
>> 	nchunks = 0;
>> 	do {
>> 		seg = r_xprt->rx_ia.ri_ops->ro_map(r_xprt, seg, nsegs,
>> -						   true, &mr);
>> +						   true, rqst->rq_xid, &mr);
>> 		if (IS_ERR(seg))
>> 			return PTR_ERR(seg);
>> 		rpcrdma_mr_push(mr, &req->rl_registered);
>> @@ -473,7 +473,7 @@ static bool rpcrdma_results_inline(struct rpcrdma_xprt
>> *r_xprt,
>> 	nchunks = 0;
>> 	do {
>> 		seg = r_xprt->rx_ia.ri_ops->ro_map(r_xprt, seg, nsegs,
>> -						   true, &mr);
>> +						   true, rqst->rq_xid, &mr);
>> 		if (IS_ERR(seg))
>> 			return PTR_ERR(seg);
>> 		rpcrdma_mr_push(mr, &req->rl_registered);
>> diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
>> index a13ccb6..2ae1ee2 100644
>> --- a/net/sunrpc/xprtrdma/xprt_rdma.h
>> +++ b/net/sunrpc/xprtrdma/xprt_rdma.h
>> @@ -472,7 +472,7 @@ struct rpcrdma_memreg_ops {
>> 	struct rpcrdma_mr_seg *
>> 			(*ro_map)(struct rpcrdma_xprt *,
>> 				  struct rpcrdma_mr_seg *, int, bool,
>> -				  struct rpcrdma_mr **);
>> +				  u32, struct rpcrdma_mr **);
>> 	int		(*ro_send)(struct rpcrdma_ia *ia,
>> 				   struct rpcrdma_req *req);
>> 	void		(*ro_reminv)(struct rpcrdma_rep *rep,

--
Chuck Lever




^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v1 4/4] xprtrdma: Plant XID in on-the-wire RDMA offset (FRWR)
       [not found]       ` <94ff7ec712e086bfdd9c217a5f97c293a07151b9.camel@gmail.com>
@ 2018-11-20 21:31         ` Chuck Lever
  0 siblings, 0 replies; 31+ messages in thread
From: Chuck Lever @ 2018-11-20 21:31 UTC (permalink / raw)
  To: Anna Schumaker; +Cc: linux-rdma, Linux NFS Mailing List



> On Nov 20, 2018, at 4:01 PM, Anna Schumaker <schumaker.anna@gmail.com> wrote:
> 
> On Tue, 2018-11-20 at 13:07 -0500, Chuck Lever wrote:
>>> On Nov 20, 2018, at 1:02 PM, Anna Schumaker <schumaker.anna@gmail.com>
>>> wrote:
>>> 
>>> Hi Chuck,
>>> 
>>> On Mon, 2018-11-19 at 10:46 -0500, Chuck Lever wrote:
>>>> Place the associated RPC transaction's XID in the upper 32 bits of
>>>> each RDMA segment's rdma_offset field. These bits are currently
>>>> always zero.
>>>> 
>>>> There are two reasons to do this:
>>>> 
>>>> - The R_key only has 8 bits that are different from registration to
>>>> registration. The XID adds more uniqueness to each RDMA segment to
>>>> reduce the likelihood of a software bug on the server reading from
>>>> or writing into memory it's not supposed to.
>>>> 
>>>> - On-the-wire RDMA Read and Write operations do not otherwise carry
>>>> any identifier that matches them up to an RPC. The XID in the
>>>> upper 32 bits will act as an eye-catcher in network captures.
>>>> 
>>>> Suggested-by: Tom Talpey <ttalpey@microsoft.com>
>>>> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
>>>> ---
>>>> net/sunrpc/xprtrdma/frwr_ops.c  |    3 ++-
>>>> net/sunrpc/xprtrdma/rpc_rdma.c  |    6 +++---
>>>> net/sunrpc/xprtrdma/xprt_rdma.h |    2 +-
>>>> 3 files changed, 6 insertions(+), 5 deletions(-)
>>>> 
>>>> diff --git a/net/sunrpc/xprtrdma/frwr_ops.c
>>>> b/net/sunrpc/xprtrdma/frwr_ops.c
>>>> index 49b314d..3b260d2 100644
>>>> --- a/net/sunrpc/xprtrdma/frwr_ops.c
>>>> +++ b/net/sunrpc/xprtrdma/frwr_ops.c
>>>> @@ -344,7 +344,7 @@
>>>> */
>>>> static struct rpcrdma_mr_seg *
>>>> frwr_op_map(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg,
>>>> -	    int nsegs, bool writing, struct rpcrdma_mr **out)
>>>> +	    int nsegs, bool writing, u32 xid, struct rpcrdma_mr **out)
>>>> {
>>>> 	struct rpcrdma_ia *ia = &r_xprt->rx_ia;
>>>> 	bool holes_ok = ia->ri_mrtype == IB_MR_TYPE_SG_GAPS;
>>>> @@ -398,6 +398,7 @@
>>>> 	if (unlikely(n != mr->mr_nents))
>>>> 		goto out_mapmr_err;
>>>> 
>>>> +	ibmr->iova |= ((u64)cpu_to_be32(xid)) << 32;
>>> 
>>> My mount command hangs once we make this change (I got bored and killed it
>>> after
>>> about 5 minutes).  This is with NFS over soft-RoCE in a kvm virtual machine,
>>> and
>>> I see the behavior with all NFS versions.
>>> 
>>> I hope this helps!
>> 
>> OK. Can you capture ftrace output to show us where the hang occurs?
>> I use this on the NFS client:
>> 
>> # trace-cmd record -e sunrpc -e rpcrdma
>> ^C
>> # trace-cmd report | less      (or > /tmp/trace.log)
> 
> The trace is attached!  I started the mount and walked away for a while.  When I
> came back, it reported "mount.nfs: Connection timed out".

The first operation is EXCHANGE_ID, and the client expects a
large Reply so it registers a Reply chunk.

       mount.nfs-1325  [001]   450.674910: xprtrdma_reply_chunk: task:16530@10 mr=0xffff93643876cb00 4380@0xffffd37d38bec100:0x00001a61 (last)
       mount.nfs-1325  [001]   450.674912: xprtrdma_marshal:     task:16530@10 xid=0x09c9d05d: hdr=48 xdr=232/0/0 inline/reply chunk
       mount.nfs-1325  [001]   450.674914: xprtrdma_post_send:   req=0xffff936438a80000, 2 SGEs, status=0
       mount.nfs-1325  [001]   450.674928: xprt_transmit:        peer=[192.168.100.215]:20049 xid=0x09c9d05d status=0
       mount.nfs-1325  [001]   450.674929: rpc_task_run_action:  task:16530@10 flags=5a80 state=0015 status=0 action=call_transmit_status
       mount.nfs-1325  [001]   450.674930: rpc_task_sleep:       task:16530@10 flags=5a80 state=0015 status=0 timeout=18000 queue=xprt_pending
    kworker/1:1H-177   [001]   450.676094: xprtrdma_wc_receive:  cqe=0xffff936438d66600 0 bytes: WR_FLUSH_ERR (5/0x0)
    kworker/1:1H-177   [001]   450.676095: xprtrdma_wc_receive:  cqe=0xffff936438d66400 0 bytes: WR_FLUSH_ERR (5/0x0)

The XID is 09c9d0f3, but I don't see it in the offset, which
is ffffd37d38bec100. Are you sure the patch is applied?


> I hope this helps!
> Anna
> 
>> 
>> Thanks!
>> 
>> 
>>> Anna
>>> 
>>>> 	key = (u8)(ibmr->rkey & 0x000000FF);
>>>> 	ib_update_fast_reg_key(ibmr, ++key);
>>>> 
>>>> diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c
>>>> b/net/sunrpc/xprtrdma/rpc_rdma.c
>>>> index 9f53e02..89a2db2 100644
>>>> --- a/net/sunrpc/xprtrdma/rpc_rdma.c
>>>> +++ b/net/sunrpc/xprtrdma/rpc_rdma.c
>>>> @@ -357,7 +357,7 @@ static bool rpcrdma_results_inline(struct rpcrdma_xprt
>>>> *r_xprt,
>>>> 
>>>> 	do {
>>>> 		seg = r_xprt->rx_ia.ri_ops->ro_map(r_xprt, seg, nsegs,
>>>> -						   false, &mr);
>>>> +						   false, rqst->rq_xid, &mr);
>>>> 		if (IS_ERR(seg))
>>>> 			return PTR_ERR(seg);
>>>> 		rpcrdma_mr_push(mr, &req->rl_registered);
>>>> @@ -415,7 +415,7 @@ static bool rpcrdma_results_inline(struct rpcrdma_xprt
>>>> *r_xprt,
>>>> 	nchunks = 0;
>>>> 	do {
>>>> 		seg = r_xprt->rx_ia.ri_ops->ro_map(r_xprt, seg, nsegs,
>>>> -						   true, &mr);
>>>> +						   true, rqst->rq_xid, &mr);
>>>> 		if (IS_ERR(seg))
>>>> 			return PTR_ERR(seg);
>>>> 		rpcrdma_mr_push(mr, &req->rl_registered);
>>>> @@ -473,7 +473,7 @@ static bool rpcrdma_results_inline(struct rpcrdma_xprt
>>>> *r_xprt,
>>>> 	nchunks = 0;
>>>> 	do {
>>>> 		seg = r_xprt->rx_ia.ri_ops->ro_map(r_xprt, seg, nsegs,
>>>> -						   true, &mr);
>>>> +						   true, rqst->rq_xid, &mr);
>>>> 		if (IS_ERR(seg))
>>>> 			return PTR_ERR(seg);
>>>> 		rpcrdma_mr_push(mr, &req->rl_registered);
>>>> diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h
>>>> b/net/sunrpc/xprtrdma/xprt_rdma.h
>>>> index a13ccb6..2ae1ee2 100644
>>>> --- a/net/sunrpc/xprtrdma/xprt_rdma.h
>>>> +++ b/net/sunrpc/xprtrdma/xprt_rdma.h
>>>> @@ -472,7 +472,7 @@ struct rpcrdma_memreg_ops {
>>>> 	struct rpcrdma_mr_seg *
>>>> 			(*ro_map)(struct rpcrdma_xprt *,
>>>> 				  struct rpcrdma_mr_seg *, int, bool,
>>>> -				  struct rpcrdma_mr **);
>>>> +				  u32, struct rpcrdma_mr **);
>>>> 	int		(*ro_send)(struct rpcrdma_ia *ia,
>>>> 				   struct rpcrdma_req *req);
>>>> 	void		(*ro_reminv)(struct rpcrdma_rep *rep,
>> 
>> --
>> Chuck Lever
>> 
>> 
>> 
> <trace.log>

--
Chuck Lever




^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2018-11-20 21:31 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-11-19 15:45 [PATCH v1 0/4] NFS/RDMA client for v4.21 (part 1) Chuck Lever
2018-11-19 15:45 ` [PATCH v1 1/4] xprtrdma: Remove support for FMR memory registration Chuck Lever
2018-11-19 16:16   ` Bart Van Assche
2018-11-19 19:09     ` Leon Romanovsky
2018-11-19 20:52       ` Bart Van Assche
2018-11-20  5:37         ` Leon Romanovsky
2018-11-19 22:41     ` Jason Gunthorpe
2018-11-19 22:56       ` Chuck Lever
2018-11-19 23:10         ` Jason Gunthorpe
2018-11-20 15:22       ` Dennis Dalessandro
2018-11-19 15:45 ` [PATCH v1 2/4] xprtrdma: mrs_create off-by-one Chuck Lever
2018-11-19 15:46 ` [PATCH v1 3/4] xprtrdma: Reduce max_frwr_depth Chuck Lever
2018-11-19 15:46 ` [PATCH v1 4/4] xprtrdma: Plant XID in on-the-wire RDMA offset (FRWR) Chuck Lever
2018-11-19 17:47   ` Olga Kornievskaia
2018-11-19 17:58     ` Chuck Lever
2018-11-19 18:08       ` Olga Kornievskaia
2018-11-19 18:18         ` Chuck Lever
2018-11-19 18:47           ` Olga Kornievskaia
2018-11-19 18:58             ` Chuck Lever
2018-11-19 21:22               ` Olga Kornievskaia
2018-11-19 21:32                 ` Chuck Lever
2018-11-19 21:42                   ` Mora, Jorge
2018-11-19 22:46                     ` Jason Gunthorpe
2018-11-20  2:45                       ` Tom Talpey
2018-11-20  3:09                         ` Jason Gunthorpe
2018-11-20  3:25                           ` Tom Talpey
2018-11-20  3:32                             ` Jason Gunthorpe
2018-11-20  3:38                               ` Tom Talpey
2018-11-20 18:02   ` Anna Schumaker
2018-11-20 18:07     ` Chuck Lever
     [not found]       ` <94ff7ec712e086bfdd9c217a5f97c293a07151b9.camel@gmail.com>
2018-11-20 21:31         ` Chuck Lever

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).