All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/5] Indirect memory registration feature
@ 2015-06-08 13:15 Sagi Grimberg
       [not found] ` <1433769339-949-1-git-send-email-sagig-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 20+ messages in thread
From: Sagi Grimberg @ 2015-06-08 13:15 UTC (permalink / raw)
  To: Doug Ledford
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Or Gerlitz, Eli Cohen,
	Oren Duer, Sagi Grimberg

Hey All,

This patchset introduce a feature called indirect memory registration.

The RDMA stack has always imposed constraints on the nature of a given list
of buffers for memory registration. The scatter list of buffers must meet
the condition where all the SG elements must have a minimum block alignment
(page_shift) and the first SG element is allowed to have a first byte offset.

This can make life hard for ULPs that want to support any arbitrary scattered
lists that don't meet the above constraint. Two immediate examples are iser
and srp which can be handed with SG lists which are not "nicely" aligned while
several components in the IO stack (e.g. block, scsi) support arbitrary SG
lists. Some work loads can yield such SG lists quite commonly.
This introduces a challenge for RDMA storage protocols.

There are couple of possible (sub-optimal) solutions to handle this limitation:
- srp can use multiple memory regions to register a single SG list and send an
  indirect data buffer descriptor. The down-side in this approach is that in certain
  work loads, srp may not have sufficient available resources (i.e. memory regions)
  to register a scsi SG list which will cause the dynamic queue size to shrink and
  cause unpredictable latencies. Another (minor) down-side is that an indirect
  descriptor will cause the target to initiate multiple rdma reads/writes (one for
  each rkey).

- This is not possible for iser. The iser protocol mandates that only a single stag can
  be sent for a unidirectional IO. Two possible solutions are:
  * Allocate a well aligned buffer list and copy the data to/from this SG list
    of buffers (has the obvious down-side of not being zero-copy, and introduces atomic
    allocations in the IO path).
  * Hold another pool of MRs with the minimal block alignment guaranteed from scsi (512B)
    and resort to this pool for an unaligned SG list. The down-sides here are:
    - This does not cover SG-IO where there is no minimal alignment
    - involves a heuristic approach for this pool size
    - Impact of cache misses imposed by longer page lists registered in the
      device translation tables

Indirect memory registration solves this problem by allowing the application/ULP to pass
a list of ib_sge elements which can be byte aligned. The proposed API attempts to follow
the well-known fast registration scheme and can be easily adopted in any application.

Note: We ran out of capability bits in the device_cap_flags, so I modified the field to
      be a (u64). I can alternatively introduce a second device_cap_flags2 if this has
      negative effects with user-space. 

See a former discussion on the RFC version of this in
http://marc.info/?l=linux-rdma&w=2&r=1&s=indirect+fast+memory+registration&q=b

I'll appreciate the community's code review.

Adir Lev (1):
  IB/iser: Add indirect registration support

Sagi Grimberg (4):
  IB/core: Introduce Fast Indirect Memory Registration verbs API
  IB/mlx5: Implement Fast Indirect Memory Registration Feature
  IB/iser: Pass iser device to registration routines
  IB/iser: Add debug prints to the various memory registration methods

 drivers/infiniband/core/verbs.c           |   28 +++++++
 drivers/infiniband/hw/mlx5/cq.c           |    2 +
 drivers/infiniband/hw/mlx5/main.c         |    4 +
 drivers/infiniband/hw/mlx5/mlx5_ib.h      |   19 +++++
 drivers/infiniband/hw/mlx5/mr.c           |   66 +++++++++++++++++
 drivers/infiniband/hw/mlx5/qp.c           |  106 +++++++++++++++++++++++++++
 drivers/infiniband/ulp/iser/iscsi_iser.h  |    8 ++
 drivers/infiniband/ulp/iser/iser_memory.c |  112 +++++++++++++++++++++++++++--
 drivers/infiniband/ulp/iser/iser_verbs.c  |   53 ++++++++++++--
 include/rdma/ib_verbs.h                   |   52 +++++++++++++-
 10 files changed, 434 insertions(+), 16 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 1/5] IB/core: Introduce Fast Indirect Memory Registration verbs API
       [not found] ` <1433769339-949-1-git-send-email-sagig-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2015-06-08 13:15   ` Sagi Grimberg
       [not found]     ` <1433769339-949-2-git-send-email-sagig-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  2015-06-08 13:15   ` [PATCH 2/5] IB/mlx5: Implement Fast Indirect Memory Registration Feature Sagi Grimberg
                     ` (4 subsequent siblings)
  5 siblings, 1 reply; 20+ messages in thread
From: Sagi Grimberg @ 2015-06-08 13:15 UTC (permalink / raw)
  To: Doug Ledford
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Or Gerlitz, Eli Cohen,
	Oren Duer, Sagi Grimberg

In order to support that we provide the user with an interface
to pass a scattered list of buffers to the IB core layer called
ib_indir_reg_list and provide a new send work request opcode
called IB_WR_REG_INDIR_MR. We extend wr union with a new type of
memory registration called indir_reg where the user can place the
relevant information to perform such a memory registration.

The verbs user is expected to perform these steps:
0. Make sure that the device supports Indirect memory registration via
   ib_device_cap_flag IB_DEVICE_INDIR_REGISTRATION and make sure
   that ib_device_attr max_indir_reg_mr_list_len suffice for the
   expected scatterlist length

1. Allocate a memory region with IB_MR_INDIRECT_REG creation flag
   This is done via ib_create_mr() with:
   mr_init_attr.flags = IB_MR_INDIRECT_REG

2. Allocate an ib_indir_reg_list structure to hold the scattered buffers
   pointers. This is done via new ib_alloc_indir_reg_list() verb

3. Fill the scattered buffers in ib_indir_reg_list.sg_list

4. Post a work request with a new opcode IB_WR_REG_INDIR_MR and
   provide the filled ib_indir_reg_list

5. Perform data transfer

6. Get completion of kind IB_WC_REG_INDIR_MR (if requested)

7. Free indirect MR and ib_indir_reg_list via
   ib_dereg_mr() and ib_free_indir_reg_list()

Signed-off-by: Sagi Grimberg <sagig-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
 drivers/infiniband/core/verbs.c |   28 +++++++++++++++++++++
 include/rdma/ib_verbs.h         |   52 +++++++++++++++++++++++++++++++++++++-
 2 files changed, 78 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c
index 927cf2e..ad2a4d3 100644
--- a/drivers/infiniband/core/verbs.c
+++ b/drivers/infiniband/core/verbs.c
@@ -1493,3 +1493,31 @@ int ib_check_mr_status(struct ib_mr *mr, u32 check_mask,
 		mr->device->check_mr_status(mr, check_mask, mr_status) : -ENOSYS;
 }
 EXPORT_SYMBOL(ib_check_mr_status);
+
+struct ib_indir_reg_list *
+ib_alloc_indir_reg_list(struct ib_device *device,
+			unsigned int max_indir_list_len)
+{
+	struct ib_indir_reg_list *indir_list;
+
+	if (!device->alloc_indir_reg_list)
+		return ERR_PTR(-ENOSYS);
+
+	indir_list = device->alloc_indir_reg_list(device,
+						  max_indir_list_len);
+	if (!IS_ERR(indir_list)) {
+		indir_list->device = device;
+		indir_list->max_indir_list_len = max_indir_list_len;
+	}
+
+	return indir_list;
+}
+EXPORT_SYMBOL(ib_alloc_indir_reg_list);
+
+void
+ib_free_indir_reg_list(struct ib_indir_reg_list *indir_list)
+{
+	if (indir_list->device->free_indir_reg_list)
+		indir_list->device->free_indir_reg_list(indir_list);
+}
+EXPORT_SYMBOL(ib_free_indir_reg_list);
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 3fedea5..51563f4 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -133,6 +133,7 @@ enum ib_device_cap_flags {
 	IB_DEVICE_MANAGED_FLOW_STEERING = (1<<29),
 	IB_DEVICE_SIGNATURE_HANDOVER	= (1<<30),
 	IB_DEVICE_ON_DEMAND_PAGING	= (1<<31),
+	IB_DEVICE_INDIR_REGISTRATION	= (1ULL << 32),
 };
 
 enum ib_signature_prot_cap {
@@ -183,7 +184,7 @@ struct ib_device_attr {
 	u32			hw_ver;
 	int			max_qp;
 	int			max_qp_wr;
-	int			device_cap_flags;
+	u64			device_cap_flags;
 	int			max_sge;
 	int			max_sge_rd;
 	int			max_cq;
@@ -212,6 +213,7 @@ struct ib_device_attr {
 	int			max_srq_wr;
 	int			max_srq_sge;
 	unsigned int		max_fast_reg_page_list_len;
+	unsigned int		max_indir_reg_mr_list_len;
 	u16			max_pkeys;
 	u8			local_ca_ack_delay;
 	int			sig_prot_cap;
@@ -543,7 +545,8 @@ __attribute_const__ int ib_rate_to_mult(enum ib_rate rate);
 __attribute_const__ int ib_rate_to_mbps(enum ib_rate rate);
 
 enum ib_mr_create_flags {
-	IB_MR_SIGNATURE_EN = 1,
+	IB_MR_SIGNATURE_EN = 1 << 0,
+	IB_MR_INDIRECT_REG = 1 << 1
 };
 
 /**
@@ -720,6 +723,7 @@ enum ib_wc_opcode {
 	IB_WC_FAST_REG_MR,
 	IB_WC_MASKED_COMP_SWAP,
 	IB_WC_MASKED_FETCH_ADD,
+	IB_WC_REG_INDIR_MR,
 /*
  * Set value of IB_WC_RECV so consumers can test if a completion is a
  * receive by testing (opcode & IB_WC_RECV).
@@ -1014,6 +1018,7 @@ enum ib_wr_opcode {
 	IB_WR_MASKED_ATOMIC_FETCH_AND_ADD,
 	IB_WR_BIND_MW,
 	IB_WR_REG_SIG_MR,
+	IB_WR_REG_INDIR_MR,
 	/* reserve values for low level drivers' internal use.
 	 * These values will not be used at all in the ib core layer.
 	 */
@@ -1053,6 +1058,12 @@ struct ib_fast_reg_page_list {
 	unsigned int		max_page_list_len;
 };
 
+struct ib_indir_reg_list {
+	struct ib_device       *device;
+	struct ib_sge          *sg_list;
+	unsigned int		max_indir_list_len;
+};
+
 /**
  * struct ib_mw_bind_info - Parameters for a memory window bind operation.
  * @mr: A memory region to bind the memory window to.
@@ -1125,6 +1136,14 @@ struct ib_send_wr {
 			int			access_flags;
 			struct ib_sge	       *prot;
 		} sig_handover;
+		struct {
+			u64				iova_start;
+			struct ib_indir_reg_list       *indir_list;
+			unsigned int			indir_list_len;
+			u64				length;
+			unsigned int			access_flags;
+			u32				mkey;
+		} indir_reg;
 	} wr;
 	u32			xrc_remote_srq_num;	/* XRC TGT QPs only */
 };
@@ -1659,6 +1678,9 @@ struct ib_device {
 	struct ib_fast_reg_page_list * (*alloc_fast_reg_page_list)(struct ib_device *device,
 								   int page_list_len);
 	void			   (*free_fast_reg_page_list)(struct ib_fast_reg_page_list *page_list);
+	struct ib_indir_reg_list * (*alloc_indir_reg_list)(struct ib_device *device,
+							   unsigned int indir_list_len);
+	void			   (*free_indir_reg_list)(struct ib_indir_reg_list *indir_list);
 	int                        (*rereg_phys_mr)(struct ib_mr *mr,
 						    int mr_rereg_mask,
 						    struct ib_pd *pd,
@@ -2793,6 +2815,32 @@ struct ib_fast_reg_page_list *ib_alloc_fast_reg_page_list(
 void ib_free_fast_reg_page_list(struct ib_fast_reg_page_list *page_list);
 
 /**
+ * ib_alloc_indir_reg_list() - Allocates an indirect list array
+ * @device: ib device pointer
+ * @indir_list_len: size of the list array to be allocated
+ *
+ * Allocate a struct ib_indir_reg_list and a sg_list array
+ * that is at least indir_list_len in size. The actual size is
+ * returned in max_indir_list_len. The caller is responsible for
+ * initializing the contents of the sg_list array before posting
+ * a send work request with the IB_WC_INDIR_REG_MR opcode.
+ *
+ * The sg_list array entries should be set exactly the same way
+ * the ib_send_wr sg_list {lkey, addr, length}.
+ */
+struct ib_indir_reg_list *
+ib_alloc_indir_reg_list(struct ib_device *device,
+			unsigned int indir_list_len);
+
+/**
+ * ib_free_indir_reg_list() - Deallocates a previously allocated
+ *     indirect list array
+ * @indir_list: pointer to be deallocated
+ */
+void
+ib_free_indir_reg_list(struct ib_indir_reg_list *indir_list);
+
+/**
  * ib_update_fast_reg_key - updates the key portion of the fast_reg MR
  *   R_Key and L_Key.
  * @mr - struct ib_mr pointer to be updated.
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 2/5] IB/mlx5: Implement Fast Indirect Memory Registration Feature
       [not found] ` <1433769339-949-1-git-send-email-sagig-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  2015-06-08 13:15   ` [PATCH 1/5] IB/core: Introduce Fast Indirect Memory Registration verbs API Sagi Grimberg
@ 2015-06-08 13:15   ` Sagi Grimberg
  2015-06-08 13:15   ` [PATCH 3/5] IB/iser: Pass iser device to registration routines Sagi Grimberg
                     ` (3 subsequent siblings)
  5 siblings, 0 replies; 20+ messages in thread
From: Sagi Grimberg @ 2015-06-08 13:15 UTC (permalink / raw)
  To: Doug Ledford
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Or Gerlitz, Eli Cohen,
	Oren Duer, Sagi Grimberg

This patch implements:
- ib_alloc/free_indir_reg_list() routines
- ib_create_mr() extension for IB_MR_INDIRECT_REG
- ib_post_send() extension for IB_WR_REG_INDIR_MR
  and work completion of IB_WC_REG_INDIR_MR
- Expose mlx5 indirect registration device capabilities

Signed-off-by: Sagi Grimberg <sagig-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
 drivers/infiniband/hw/mlx5/cq.c      |    2 +
 drivers/infiniband/hw/mlx5/main.c    |    4 +
 drivers/infiniband/hw/mlx5/mlx5_ib.h |   19 ++++++
 drivers/infiniband/hw/mlx5/mr.c      |   66 +++++++++++++++++++++
 drivers/infiniband/hw/mlx5/qp.c      |  106 ++++++++++++++++++++++++++++++++++
 5 files changed, 197 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/hw/mlx5/cq.c b/drivers/infiniband/hw/mlx5/cq.c
index 2ee6b10..43495c6 100644
--- a/drivers/infiniband/hw/mlx5/cq.c
+++ b/drivers/infiniband/hw/mlx5/cq.c
@@ -111,6 +111,8 @@ static enum ib_wc_opcode get_umr_comp(struct mlx5_ib_wq *wq, int idx)
 	case IB_WR_FAST_REG_MR:
 		return IB_WC_FAST_REG_MR;
 
+	case IB_WR_REG_INDIR_MR:
+		return IB_WC_REG_INDIR_MR;
 	default:
 		pr_warn("unknown completion status\n");
 		return 0;
diff --git a/drivers/infiniband/hw/mlx5/main.c b/drivers/infiniband/hw/mlx5/main.c
index 582bfd9..47a3d76 100644
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@ -107,6 +107,7 @@ static int mlx5_ib_query_device(struct ib_device *ibdev,
 	if (flags & MLX5_DEV_CAP_FLAG_XRC)
 		props->device_cap_flags |= IB_DEVICE_XRC;
 	props->device_cap_flags |= IB_DEVICE_MEM_MGT_EXTENSIONS;
+	props->device_cap_flags |= IB_DEVICE_INDIR_REGISTRATION;
 	if (flags & MLX5_DEV_CAP_FLAG_SIG_HAND_OVER) {
 		props->device_cap_flags |= IB_DEVICE_SIGNATURE_HANDOVER;
 		/* At this stage no support for signature handover */
@@ -145,6 +146,7 @@ static int mlx5_ib_query_device(struct ib_device *ibdev,
 	props->max_res_rd_atom	   = props->max_qp_rd_atom * props->max_qp;
 	props->max_srq_sge	   = max_rq_sg - 1;
 	props->max_fast_reg_page_list_len = (unsigned int)-1;
+	props->max_indir_reg_mr_list_len = 1 << gen->log_max_klm_list_size;
 	props->local_ca_ack_delay  = gen->local_ca_ack_delay;
 	props->atomic_cap	   = IB_ATOMIC_NONE;
 	props->masked_atomic_cap   = IB_ATOMIC_NONE;
@@ -1302,6 +1304,8 @@ static void *mlx5_ib_add(struct mlx5_core_dev *mdev)
 	dev->ib_dev.free_fast_reg_page_list  = mlx5_ib_free_fast_reg_page_list;
 	dev->ib_dev.check_mr_status	= mlx5_ib_check_mr_status;
 	dev->ib_dev.get_port_immutable  = mlx5_port_immutable;
+	dev->ib_dev.alloc_indir_reg_list = mlx5_ib_alloc_indir_reg_list;
+	dev->ib_dev.free_indir_reg_list  = mlx5_ib_free_indir_reg_list;
 
 	mlx5_ib_internal_query_odp_caps(dev);
 
diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h b/drivers/infiniband/hw/mlx5/mlx5_ib.h
index d8e07c1..68d8865 100644
--- a/drivers/infiniband/hw/mlx5/mlx5_ib.h
+++ b/drivers/infiniband/hw/mlx5/mlx5_ib.h
@@ -334,6 +334,13 @@ struct mlx5_ib_fast_reg_page_list {
 	dma_addr_t			map;
 };
 
+struct mlx5_ib_indir_reg_list {
+	struct ib_indir_reg_list        ib_irl;
+	void                           *mapped_ilist;
+	struct mlx5_klm                *klms;
+	dma_addr_t                      map;
+};
+
 struct mlx5_ib_umr_context {
 	enum ib_wc_status	status;
 	struct completion	done;
@@ -508,6 +515,12 @@ static inline struct mlx5_ib_fast_reg_page_list *to_mfrpl(struct ib_fast_reg_pag
 	return container_of(ibfrpl, struct mlx5_ib_fast_reg_page_list, ibfrpl);
 }
 
+static inline struct mlx5_ib_indir_reg_list *
+to_mindir_list(struct ib_indir_reg_list *ib_irl)
+{
+	return container_of(ib_irl, struct mlx5_ib_indir_reg_list, ib_irl);
+}
+
 struct mlx5_ib_ah {
 	struct ib_ah		ibah;
 	struct mlx5_av		av;
@@ -578,6 +591,12 @@ struct ib_mr *mlx5_ib_alloc_fast_reg_mr(struct ib_pd *pd,
 struct ib_fast_reg_page_list *mlx5_ib_alloc_fast_reg_page_list(struct ib_device *ibdev,
 							       int page_list_len);
 void mlx5_ib_free_fast_reg_page_list(struct ib_fast_reg_page_list *page_list);
+
+struct ib_indir_reg_list *
+mlx5_ib_alloc_indir_reg_list(struct ib_device *device,
+			     unsigned int max_indir_list_len);
+void mlx5_ib_free_indir_reg_list(struct ib_indir_reg_list *indir_list);
+
 struct ib_fmr *mlx5_ib_fmr_alloc(struct ib_pd *pd, int acc,
 				 struct ib_fmr_attr *fmr_attr);
 int mlx5_ib_map_phys_fmr(struct ib_fmr *ibfmr, u64 *page_list,
diff --git a/drivers/infiniband/hw/mlx5/mr.c b/drivers/infiniband/hw/mlx5/mr.c
index 04b6787..25c7583 100644
--- a/drivers/infiniband/hw/mlx5/mr.c
+++ b/drivers/infiniband/hw/mlx5/mr.c
@@ -1300,6 +1300,9 @@ struct ib_mr *mlx5_ib_create_mr(struct ib_pd *pd,
 		++mr->sig->sigerr_count;
 	}
 
+	if (mr_init_attr->flags & IB_MR_INDIRECT_REG)
+		access_mode = MLX5_ACCESS_MODE_KLM;
+
 	in->seg.flags = MLX5_PERM_UMR_EN | access_mode;
 	err = mlx5_core_create_mkey(dev->mdev, &mr->mmr, in, sizeof(*in),
 				    NULL, NULL, NULL);
@@ -1459,3 +1462,66 @@ int mlx5_ib_check_mr_status(struct ib_mr *ibmr, u32 check_mask,
 done:
 	return ret;
 }
+
+struct ib_indir_reg_list *
+mlx5_ib_alloc_indir_reg_list(struct ib_device *device,
+			     unsigned int max_indir_list_len)
+{
+	struct device *ddev = device->dma_device;
+	struct mlx5_ib_indir_reg_list *mirl = NULL;
+	int dsize;
+	int err;
+
+	mirl = kzalloc(sizeof(*mirl), GFP_KERNEL);
+	if (!mirl)
+		return ERR_PTR(-ENOMEM);
+
+	mirl->ib_irl.sg_list = kcalloc(max_indir_list_len,
+				       sizeof(*mirl->ib_irl.sg_list),
+				       GFP_KERNEL);
+	if (!mirl->ib_irl.sg_list) {
+		err = -ENOMEM;
+		goto err_sg_list;
+	}
+
+	dsize = sizeof(*mirl->klms) * max_indir_list_len;
+	dsize += max_t(int, MLX5_UMR_ALIGN - ARCH_KMALLOC_MINALIGN, 0);
+	mirl->mapped_ilist = kzalloc(dsize, GFP_KERNEL);
+	if (!mirl->mapped_ilist) {
+		err = -ENOMEM;
+		goto err_mapped_list;
+	}
+
+	mirl->klms = (void *)ALIGN((uintptr_t)mirl->mapped_ilist,
+				   MLX5_UMR_ALIGN);
+	mirl->map = dma_map_single(ddev, mirl->klms,
+				   dsize, DMA_TO_DEVICE);
+	if (dma_mapping_error(ddev, mirl->map)) {
+		err = -ENOMEM;
+		goto err_dma_map;
+	}
+
+	return &mirl->ib_irl;
+err_dma_map:
+	kfree(mirl->mapped_ilist);
+err_mapped_list:
+	kfree(mirl->ib_irl.sg_list);
+err_sg_list:
+	kfree(mirl);
+
+	return ERR_PTR(err);
+}
+
+void
+mlx5_ib_free_indir_reg_list(struct ib_indir_reg_list *indir_list)
+{
+	struct mlx5_ib_indir_reg_list *mirl = to_mindir_list(indir_list);
+	struct device *ddev = indir_list->device->dma_device;
+	int dsize;
+
+	dsize = sizeof(*mirl->klms) * indir_list->max_indir_list_len;
+	dma_unmap_single(ddev, mirl->map, dsize, DMA_TO_DEVICE);
+	kfree(mirl->mapped_ilist);
+	kfree(mirl->ib_irl.sg_list);
+	kfree(mirl);
+}
diff --git a/drivers/infiniband/hw/mlx5/qp.c b/drivers/infiniband/hw/mlx5/qp.c
index d35f62d..64b969b 100644
--- a/drivers/infiniband/hw/mlx5/qp.c
+++ b/drivers/infiniband/hw/mlx5/qp.c
@@ -65,6 +65,7 @@ static const u32 mlx5_ib_opcode[] = {
 	[IB_WR_SEND_WITH_INV]			= MLX5_OPCODE_SEND_INVAL,
 	[IB_WR_LOCAL_INV]			= MLX5_OPCODE_UMR,
 	[IB_WR_FAST_REG_MR]			= MLX5_OPCODE_UMR,
+	[IB_WR_REG_INDIR_MR]			= MLX5_OPCODE_UMR,
 	[IB_WR_MASKED_ATOMIC_CMP_AND_SWP]	= MLX5_OPCODE_ATOMIC_MASKED_CS,
 	[IB_WR_MASKED_ATOMIC_FETCH_AND_ADD]	= MLX5_OPCODE_ATOMIC_MASKED_FA,
 	[MLX5_IB_WR_UMR]			= MLX5_OPCODE_UMR,
@@ -2477,6 +2478,98 @@ static int set_frwr_li_wr(void **seg, struct ib_send_wr *wr, int *size,
 	return 0;
 }
 
+static void set_indir_mkey_segment(struct mlx5_mkey_seg *seg,
+				   struct ib_send_wr *wr, u32 pdn)
+{
+	u32 list_len = wr->wr.indir_reg.indir_list_len;
+
+	memset(seg, 0, sizeof(*seg));
+
+	seg->flags = get_umr_flags(wr->wr.indir_reg.access_flags) |
+				   MLX5_ACCESS_MODE_KLM;
+	seg->qpn_mkey7_0 = cpu_to_be32(0xffffff00 |
+			   mlx5_mkey_variant(wr->wr.indir_reg.mkey));
+	seg->flags_pd = cpu_to_be32(MLX5_MKEY_REMOTE_INVAL | pdn);
+	seg->len = cpu_to_be64(wr->wr.indir_reg.length);
+	seg->start_addr = cpu_to_be64(wr->wr.indir_reg.iova_start);
+	seg->xlt_oct_size =
+		cpu_to_be32(be16_to_cpu(get_klm_octo(list_len * 2)));
+}
+
+static void set_indir_data_seg(struct ib_send_wr *wr, struct mlx5_ib_qp *qp,
+			       u32 pa_key, void **seg, int *size)
+{
+	struct mlx5_wqe_data_seg *data = *seg;
+	struct mlx5_ib_indir_reg_list *mirl;
+	struct ib_sge *sg_list = wr->wr.indir_reg.indir_list->sg_list;
+	u32 list_len = wr->wr.indir_reg.indir_list_len;
+	int i;
+
+	mirl = to_mindir_list(wr->wr.indir_reg.indir_list);
+	for (i = 0; i < list_len; i++) {
+		mirl->klms[i].va = cpu_to_be64(sg_list[i].addr);
+		mirl->klms[i].key = cpu_to_be32(sg_list[i].lkey);
+		mirl->klms[i].bcount = cpu_to_be32(sg_list[i].length);
+	}
+
+	data->byte_count = cpu_to_be32(ALIGN(sizeof(struct mlx5_klm) *
+				       list_len, 64));
+	data->lkey = cpu_to_be32(pa_key);
+	data->addr = cpu_to_be64(mirl->map);
+	*seg += sizeof(*data);
+	*size += sizeof(*data) / 16;
+}
+
+static void set_indir_umr_segment(struct mlx5_wqe_umr_ctrl_seg *umr,
+				  struct ib_send_wr *wr)
+{
+	u64 mask;
+	u32 list_len = wr->wr.indir_reg.indir_list_len;
+
+	memset(umr, 0, sizeof(*umr));
+
+	umr->flags = MLX5_UMR_CHECK_NOT_FREE;
+	umr->klm_octowords = get_klm_octo(list_len * 2);
+	mask = MLX5_MKEY_MASK_LEN		|
+		MLX5_MKEY_MASK_PAGE_SIZE	|
+		MLX5_MKEY_MASK_START_ADDR	|
+		MLX5_MKEY_MASK_EN_RINVAL	|
+		MLX5_MKEY_MASK_KEY		|
+		MLX5_MKEY_MASK_LR		|
+		MLX5_MKEY_MASK_LW		|
+		MLX5_MKEY_MASK_RR		|
+		MLX5_MKEY_MASK_RW		|
+		MLX5_MKEY_MASK_A		|
+		MLX5_MKEY_MASK_FREE;
+
+	umr->mkey_mask = cpu_to_be64(mask);
+}
+
+static int set_indir_reg_wr(struct ib_send_wr *wr, struct mlx5_ib_qp *qp,
+			    void **seg, int *size)
+{
+	struct mlx5_ib_pd *pd = get_pd(qp);
+
+	if (unlikely(wr->send_flags & IB_SEND_INLINE))
+		return -EINVAL;
+
+	set_indir_umr_segment(*seg, wr);
+	*seg += sizeof(struct mlx5_wqe_umr_ctrl_seg);
+	*size += sizeof(struct mlx5_wqe_umr_ctrl_seg) / 16;
+	if (unlikely(*seg == qp->sq.qend))
+		*seg = mlx5_get_send_wqe(qp, 0);
+
+	set_indir_mkey_segment(*seg, wr, pd->pdn);
+	*seg += sizeof(struct mlx5_mkey_seg);
+	*size += sizeof(struct mlx5_mkey_seg) / 16;
+	if (unlikely(*seg == qp->sq.qend))
+		*seg = mlx5_get_send_wqe(qp, 0);
+
+	set_indir_data_seg(wr, qp, pd->pa_lkey, seg, size);
+
+	return 0;
+}
+
 static void dump_wqe(struct mlx5_ib_qp *qp, int idx, int size_16)
 {
 	__be32 *p = NULL;
@@ -2688,6 +2781,19 @@ int mlx5_ib_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr,
 				num_sge = 0;
 				break;
 
+			case IB_WR_REG_INDIR_MR:
+				next_fence = MLX5_FENCE_MODE_INITIATOR_SMALL;
+				qp->sq.wr_data[idx] = IB_WR_REG_INDIR_MR;
+				ctrl->imm = cpu_to_be32(wr->wr.indir_reg.mkey);
+				err = set_indir_reg_wr(wr, qp, &seg, &size);
+				if (err) {
+					mlx5_ib_warn(dev, "Failed to set indir_reg wqe\n");
+					*bad_wr = wr;
+					goto out;
+				}
+				num_sge = 0;
+				break;
+
 			case IB_WR_REG_SIG_MR:
 				qp->sq.wr_data[idx] = IB_WR_REG_SIG_MR;
 				mr = to_mmr(wr->wr.sig_handover.sig_mr);
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 3/5] IB/iser: Pass iser device to registration routines
       [not found] ` <1433769339-949-1-git-send-email-sagig-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  2015-06-08 13:15   ` [PATCH 1/5] IB/core: Introduce Fast Indirect Memory Registration verbs API Sagi Grimberg
  2015-06-08 13:15   ` [PATCH 2/5] IB/mlx5: Implement Fast Indirect Memory Registration Feature Sagi Grimberg
@ 2015-06-08 13:15   ` Sagi Grimberg
  2015-06-08 13:15   ` [PATCH 4/5] IB/iser: Add indirect registration support Sagi Grimberg
                     ` (2 subsequent siblings)
  5 siblings, 0 replies; 20+ messages in thread
From: Sagi Grimberg @ 2015-06-08 13:15 UTC (permalink / raw)
  To: Doug Ledford
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Or Gerlitz, Eli Cohen,
	Oren Duer, Sagi Grimberg

Will be needed when the device capability flags will be
checked for indirect registration support.

This patch does not change any functionality.

Signed-off-by: Sagi Grimberg <sagig-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Signed-off-by: Adir Lev <adirl-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
 drivers/infiniband/ulp/iser/iser_verbs.c |   18 +++++++++++-------
 1 files changed, 11 insertions(+), 7 deletions(-)

diff --git a/drivers/infiniband/ulp/iser/iser_verbs.c b/drivers/infiniband/ulp/iser/iser_verbs.c
index 6a407a3..3267a9c 100644
--- a/drivers/infiniband/ulp/iser/iser_verbs.c
+++ b/drivers/infiniband/ulp/iser/iser_verbs.c
@@ -279,9 +279,11 @@ void iser_free_fmr_pool(struct ib_conn *ib_conn)
 }
 
 static int
-iser_alloc_reg_res(struct ib_device *ib_device, struct ib_pd *pd,
+iser_alloc_reg_res(struct iser_device *device,
+		   struct ib_pd *pd,
 		   struct iser_reg_resources *res)
 {
+	struct ib_device *ib_device = device->ib_device;
 	int ret;
 
 	res->frpl = ib_alloc_fast_reg_page_list(ib_device,
@@ -317,7 +319,8 @@ iser_free_reg_res(struct iser_reg_resources *rsc)
 }
 
 static int
-iser_alloc_pi_ctx(struct ib_device *ib_device, struct ib_pd *pd,
+iser_alloc_pi_ctx(struct iser_device *device,
+		  struct ib_pd *pd,
 		  struct iser_fr_desc *desc)
 {
 	struct iser_pi_context *pi_ctx = NULL;
@@ -331,7 +334,7 @@ iser_alloc_pi_ctx(struct ib_device *ib_device, struct ib_pd *pd,
 
 	pi_ctx = desc->pi_ctx;
 
-	ret = iser_alloc_reg_res(ib_device, pd, &pi_ctx->rsc);
+	ret = iser_alloc_reg_res(device, pd, &pi_ctx->rsc);
 	if (ret) {
 		iser_err("failed to allocate reg_resources\n");
 		goto alloc_reg_res_err;
@@ -364,7 +367,8 @@ iser_free_pi_ctx(struct iser_pi_context *pi_ctx)
 }
 
 static struct iser_fr_desc *
-iser_create_fastreg_desc(struct ib_device *ib_device, struct ib_pd *pd,
+iser_create_fastreg_desc(struct iser_device *device,
+			 struct ib_pd *pd,
 			 bool pi_enable)
 {
 	struct iser_fr_desc *desc;
@@ -374,12 +378,12 @@ iser_create_fastreg_desc(struct ib_device *ib_device, struct ib_pd *pd,
 	if (!desc)
 		return ERR_PTR(-ENOMEM);
 
-	ret = iser_alloc_reg_res(ib_device, pd, &desc->rsc);
+	ret = iser_alloc_reg_res(device, pd, &desc->rsc);
 	if (ret)
 		goto reg_res_alloc_failure;
 
 	if (pi_enable) {
-		ret = iser_alloc_pi_ctx(ib_device, pd, desc);
+		ret = iser_alloc_pi_ctx(device, pd, desc);
 		if (ret)
 			goto pi_ctx_alloc_failure;
 	}
@@ -410,7 +414,7 @@ int iser_alloc_fastreg_pool(struct ib_conn *ib_conn, unsigned cmds_max)
 	spin_lock_init(&fr_pool->lock);
 	fr_pool->size = 0;
 	for (i = 0; i < cmds_max; i++) {
-		desc = iser_create_fastreg_desc(device->ib_device, device->pd,
+		desc = iser_create_fastreg_desc(device, device->pd,
 						ib_conn->pi_support);
 		if (IS_ERR(desc))
 			goto err;
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 4/5] IB/iser: Add indirect registration support
       [not found] ` <1433769339-949-1-git-send-email-sagig-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
                     ` (2 preceding siblings ...)
  2015-06-08 13:15   ` [PATCH 3/5] IB/iser: Pass iser device to registration routines Sagi Grimberg
@ 2015-06-08 13:15   ` Sagi Grimberg
  2015-06-08 13:15   ` [PATCH 5/5] IB/iser: Add debug prints to the various memory registration methods Sagi Grimberg
  2015-06-08 13:22   ` [PATCH 0/5] Indirect memory registration feature Christoph Hellwig
  5 siblings, 0 replies; 20+ messages in thread
From: Sagi Grimberg @ 2015-06-08 13:15 UTC (permalink / raw)
  To: Doug Ledford
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Or Gerlitz, Eli Cohen,
	Oren Duer, Sagi Grimberg, Adir Lev

From: Adir Lev <adirl-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>

In case the SG list handed to us is not nicely page
aligned, we can now use indirect registration instead
of using a bounce buffer. This saves dramatic load for
large unaligned IOs.

Signed-off-by: Sagi Grimberg <sagig-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Signed-off-by: Adir Lev <adirl-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
 drivers/infiniband/ulp/iser/iscsi_iser.h  |    8 +++
 drivers/infiniband/ulp/iser/iser_memory.c |   98 +++++++++++++++++++++++++++--
 drivers/infiniband/ulp/iser/iser_verbs.c  |   35 ++++++++++-
 3 files changed, 135 insertions(+), 6 deletions(-)

diff --git a/drivers/infiniband/ulp/iser/iscsi_iser.h b/drivers/infiniband/ulp/iser/iscsi_iser.h
index 9365343..3cabccd 100644
--- a/drivers/infiniband/ulp/iser/iscsi_iser.h
+++ b/drivers/infiniband/ulp/iser/iscsi_iser.h
@@ -225,6 +225,7 @@ enum iser_data_dir {
  * @orig_sg:      pointer to the original sg list (in case
  *                we used a copy)
  * @orig_size:    num entris of orig sg list
+ * @aligned:      indicate if the data buffer is block aligned
  */
 struct iser_data_buf {
 	struct scatterlist *sg;
@@ -233,6 +234,7 @@ struct iser_data_buf {
 	unsigned int       dma_nents;
 	struct scatterlist *orig_sg;
 	unsigned int       orig_size;
+	bool               aligned;
   };
 
 /* fwd declarations */
@@ -389,7 +391,10 @@ struct iser_device {
  * @fmr_pool:   pool of fmrs
  * @frpl:       fast reg page list used by frwrs
  * @page_vec:   fast reg page list used by fmr pool
+ * @indir_mr:   indirect memory region
+ * @indir_rl:   indirect registration list
  * @mr_valid:   is mr valid indicator
+ * @indirmr_valid: is indirect mr valid indicator
  */
 struct iser_reg_resources {
 	union {
@@ -400,7 +405,10 @@ struct iser_reg_resources {
 		struct ib_fast_reg_page_list     *frpl;
 		struct iser_page_vec             *page_vec;
 	};
+	struct ib_mr                     *indir_mr;
+	struct ib_indir_reg_list         *indir_rl;
 	u8				  mr_valid:1;
+	u8                                indir_mr_valid:1;
 };
 
 /**
diff --git a/drivers/infiniband/ulp/iser/iser_memory.c b/drivers/infiniband/ulp/iser/iser_memory.c
index b1261d5..de5c7da 100644
--- a/drivers/infiniband/ulp/iser/iser_memory.c
+++ b/drivers/infiniband/ulp/iser/iser_memory.c
@@ -782,6 +782,79 @@ static int iser_fast_reg_mr(struct iscsi_iser_task *iser_task,
 }
 
 static int
+iser_sg_to_ivec(struct iser_data_buf *mem,
+		struct iser_device *device,
+		struct ib_sge *sg_list)
+{
+	struct scatterlist *sg;
+	struct ib_sge *sge;
+	int i, total_len = 0;
+
+	for_each_sg(mem->sg, sg, mem->dma_nents, i) {
+		sge = &sg_list[i];
+		sge->addr = ib_sg_dma_address(device->ib_device, sg);
+		sge->length = ib_sg_dma_len(device->ib_device, sg);
+		sge->lkey = device->mr->lkey;
+		total_len += sge->length;
+	}
+
+	return total_len;
+}
+
+static int
+iser_reg_indir_mem(struct iscsi_iser_task *iser_task,
+		   struct iser_data_buf *mem,
+		   struct iser_reg_resources *rsc,
+		   struct iser_mem_reg *reg)
+{
+	struct ib_conn *ib_conn = &iser_task->iser_conn->ib_conn;
+	struct iser_device *device = ib_conn->device;
+	struct ib_send_wr indir_wr, inv_wr;
+	struct ib_send_wr *bad_wr, *wr = NULL;
+	int total_len;
+	int ret;
+
+	iser_task->iser_conn->iscsi_conn->fmr_unalign_cnt++;
+
+	total_len = iser_sg_to_ivec(mem, device, rsc->indir_rl->sg_list);
+
+	if (!rsc->indir_mr_valid) {
+		iser_inv_rkey(&inv_wr, rsc->indir_mr);
+		wr = &inv_wr;
+	}
+
+	memset(&indir_wr, 0, sizeof(indir_wr));
+	indir_wr.opcode = IB_WR_REG_INDIR_MR;
+	indir_wr.wr_id = ISER_FASTREG_LI_WRID;
+	indir_wr.wr.indir_reg.mkey = rsc->indir_mr->rkey;
+	indir_wr.wr.indir_reg.iova_start = rsc->indir_rl->sg_list[0].addr;
+	indir_wr.wr.indir_reg.indir_list = rsc->indir_rl;
+	indir_wr.wr.indir_reg.indir_list_len = mem->size;
+	indir_wr.wr.indir_reg.length = (u64)total_len;
+	indir_wr.wr.indir_reg.access_flags = IB_ACCESS_REMOTE_READ  |
+					     IB_ACCESS_REMOTE_WRITE |
+					     IB_ACCESS_LOCAL_WRITE;
+	if (!wr)
+		wr = &indir_wr;
+	else
+		wr->next = &indir_wr;
+
+	ret = ib_post_send(ib_conn->qp, wr, &bad_wr);
+	if (ret) {
+		iser_err("indirect_reg failed, ret:%d\n", ret);
+		return ret;
+	}
+	rsc->indir_mr_valid = 0;
+
+	reg->sge.lkey = rsc->indir_mr->lkey;
+	reg->rkey = rsc->indir_mr->rkey;
+	reg->sge.addr = indir_wr.wr.indir_reg.iova_start;
+	reg->sge.length = indir_wr.wr.indir_reg.length;
+
+	return 0;
+}
+
+static int
 iser_handle_unaligned_buf(struct iscsi_iser_task *task,
 			  struct iser_data_buf *mem,
 			  enum iser_data_dir dir)
@@ -792,11 +865,20 @@ iser_handle_unaligned_buf(struct iscsi_iser_task *task,
 
 	aligned_len = iser_data_buf_aligned_len(mem, device->ib_device);
 	if (aligned_len != mem->dma_nents) {
-		err = fall_to_bounce_buf(task, mem, dir);
-		if (err)
-			return err;
+		if (device->dev_attr.device_cap_flags &
+		    IB_DEVICE_INDIR_REGISTRATION) {
+			mem->aligned = false;
+			goto done;
+		} else {
+			err = fall_to_bounce_buf(task, mem, dir);
+			if (err)
+				return err;
+		}
 	}
 
+	mem->aligned = true;
+
+done:
 	return 0;
 }
 
@@ -810,8 +892,11 @@ iser_reg_prot_sg(struct iscsi_iser_task *task,
 
 	if (mem->dma_nents == 1)
 		return iser_reg_dma(device, mem, reg);
+	else if (mem->aligned)
+		return device->reg_ops->reg_mem(task, mem,
+						&desc->pi_ctx->rsc, reg);
 
-	return device->reg_ops->reg_mem(task, mem, &desc->pi_ctx->rsc, reg);
+	return iser_reg_indir_mem(task, mem, &desc->rsc, reg);
 }
 
 static int
@@ -824,8 +909,11 @@ iser_reg_data_sg(struct iscsi_iser_task *task,
 
 	if (mem->dma_nents == 1)
 		return iser_reg_dma(device, mem, reg);
+	else if (mem->aligned)
+		return device->reg_ops->reg_mem(task, mem,
+						&desc->rsc, reg);
 
-	return device->reg_ops->reg_mem(task, mem, &desc->rsc, reg);
+	return iser_reg_indir_mem(task, mem, &desc->rsc, reg);
 }
 
 int iser_reg_rdma_mem(struct iscsi_iser_task *task,
diff --git a/drivers/infiniband/ulp/iser/iser_verbs.c b/drivers/infiniband/ulp/iser/iser_verbs.c
index 3267a9c..713f3a9 100644
--- a/drivers/infiniband/ulp/iser/iser_verbs.c
+++ b/drivers/infiniband/ulp/iser/iser_verbs.c
@@ -286,7 +286,7 @@ iser_alloc_reg_res(struct iser_device *device,
 	struct ib_device *ib_device = device->ib_device;
 	int ret;
 
-	res->frpl = ib_alloc_fast_reg_page_list(ib_device,
+	res->frpl = ib_alloc_fast_reg_page_list(device->ib_device,
 						ISCSI_ISER_SG_TABLESIZE + 1);
 	if (IS_ERR(res->frpl)) {
 		ret = PTR_ERR(res->frpl);
@@ -303,8 +303,37 @@ iser_alloc_reg_res(struct iser_device *device,
 	}
 	res->mr_valid = 1;
 
+	if (device->dev_attr.device_cap_flags & IB_DEVICE_INDIR_REGISTRATION) {
+		struct ib_mr_init_attr mr_attr;
+
+		res->indir_rl = ib_alloc_indir_reg_list(ib_device,
+						ISCSI_ISER_SG_TABLESIZE);
+		if (IS_ERR(res->indir_rl)) {
+			ret = PTR_ERR(res->indir_rl);
+			iser_err("Failed to allocate ib_indir_reg_list err=%d\n",
+				 ret);
+			goto indir_reg_list_failure;
+		}
+
+		memset(&mr_attr, 0, sizeof(mr_attr));
+		mr_attr.flags = IB_MR_INDIRECT_REG;
+		mr_attr.max_reg_descriptors = ISCSI_ISER_SG_TABLESIZE;
+		res->indir_mr = ib_create_mr(pd, &mr_attr);
+		if (IS_ERR(res->indir_mr)) {
+			ret = PTR_ERR(res->indir_mr);
+			iser_err("Failed to allocate indir mr err=%d\n",
+				 ret);
+			goto indir_mr_failure;
+		}
+		res->indir_mr_valid = 1;
+	}
+
 	return 0;
 
+indir_mr_failure:
+	ib_free_indir_reg_list(res->indir_rl);
+indir_reg_list_failure:
+	ib_dereg_mr(res->mr);
 fast_reg_mr_failure:
 	ib_free_fast_reg_page_list(res->frpl);
 
@@ -316,6 +345,10 @@ iser_free_reg_res(struct iser_reg_resources *rsc)
 {
 	ib_dereg_mr(rsc->mr);
 	ib_free_fast_reg_page_list(rsc->frpl);
+	if (rsc->indir_mr)
+		ib_dereg_mr(rsc->indir_mr);
+	if (rsc->indir_rl)
+		ib_free_indir_reg_list(rsc->indir_rl);
 }
 
 static int
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 5/5] IB/iser: Add debug prints to the various memory registration methods
       [not found] ` <1433769339-949-1-git-send-email-sagig-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
                     ` (3 preceding siblings ...)
  2015-06-08 13:15   ` [PATCH 4/5] IB/iser: Add indirect registration support Sagi Grimberg
@ 2015-06-08 13:15   ` Sagi Grimberg
  2015-06-08 13:22   ` [PATCH 0/5] Indirect memory registration feature Christoph Hellwig
  5 siblings, 0 replies; 20+ messages in thread
From: Sagi Grimberg @ 2015-06-08 13:15 UTC (permalink / raw)
  To: Doug Ledford
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Or Gerlitz, Eli Cohen,
	Oren Duer, Sagi Grimberg

Easier to debug when we have the registration details.

This patch does not change any functionality.

Signed-off-by: Sagi Grimberg <sagig-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Signed-off-by: Adir Lev <adirl-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
 drivers/infiniband/ulp/iser/iser_memory.c |   14 +++++++++++++-
 1 files changed, 13 insertions(+), 1 deletions(-)

diff --git a/drivers/infiniband/ulp/iser/iser_memory.c b/drivers/infiniband/ulp/iser/iser_memory.c
index de5c7da..354eedd 100644
--- a/drivers/infiniband/ulp/iser/iser_memory.c
+++ b/drivers/infiniband/ulp/iser/iser_memory.c
@@ -545,6 +545,10 @@ int iser_fast_reg_fmr(struct iscsi_iser_task *iser_task,
 	reg->sge.length = page_vec->data_size;
 	reg->mem_h = fmr;
 
+	iser_dbg("fmr reg: lkey=0x%x, rkey=0x%x, addr=0x%llx,"
+		 " length=0x%x\n", reg->sge.lkey, reg->rkey,
+		 reg->sge.addr, reg->sge.length);
+
 	return 0;
 }
 
@@ -715,7 +719,7 @@ iser_reg_sig_mr(struct iscsi_iser_task *iser_task,
 	sig_reg->sge.addr = 0;
 	sig_reg->sge.length = scsi_transfer_length(iser_task->sc);
 
-	iser_dbg("sig_sge: lkey: 0x%x, rkey: 0x%x, addr: 0x%llx, length: %u\n",
+	iser_dbg("sig reg: lkey: 0x%x, rkey: 0x%x, addr: 0x%llx, length: %u\n",
 		 sig_reg->sge.lkey, sig_reg->rkey, sig_reg->sge.addr,
 		 sig_reg->sge.length);
 err:
@@ -778,6 +782,10 @@ static int iser_fast_reg_mr(struct iscsi_iser_task *iser_task,
 	reg->sge.addr = frpl->page_list[0] + offset;
 	reg->sge.length = size;
 
+	iser_dbg("fast reg: lkey=0x%x, rkey=0x%x, addr=0x%llx,"
+		 " length=0x%x\n", reg->sge.lkey, reg->rkey,
+		 reg->sge.addr, reg->sge.length);
+
 	return ret;
 }
 
@@ -851,6 +859,10 @@ iser_reg_indir_mem(struct iscsi_iser_task *iser_task,
 	reg->sge.addr = indir_wr.wr.indir_reg.iova_start;
 	reg->sge.length = indir_wr.wr.indir_reg.length;
 
+	iser_dbg("indir reg: lkey=0x%x, rkey=0x%x, addr=0x%llx,"
+		 " length=0x%x\n", reg->sge.lkey, reg->rkey,
+		 reg->sge.addr, reg->sge.length);
+
 	return 0;
 }
 
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/5] Indirect memory registration feature
       [not found] ` <1433769339-949-1-git-send-email-sagig-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
                     ` (4 preceding siblings ...)
  2015-06-08 13:15   ` [PATCH 5/5] IB/iser: Add debug prints to the various memory registration methods Sagi Grimberg
@ 2015-06-08 13:22   ` Christoph Hellwig
       [not found]     ` <20150608132254.GA14773-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
  5 siblings, 1 reply; 20+ messages in thread
From: Christoph Hellwig @ 2015-06-08 13:22 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA, Or Gerlitz,
	Eli Cohen, Oren Duer, Sagi Grimberg

On Mon, Jun 08, 2015 at 04:15:34PM +0300, Sagi Grimberg wrote:
> There are couple of possible (sub-optimal) solutions to handle this limitation:

There is another one:  Set the block level SG_GAPS flags to ensure
the block layer never generates these SG lists.

Right now drivers tat set these would always fail SG_IO ioctls using
iovecs, but that can be fixed much more easily by doing high level
bounce buffering compared to these horrible workarounds deep down in the
stack.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/5] Indirect memory registration feature
       [not found]     ` <20150608132254.GA14773-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
@ 2015-06-08 13:39       ` Sagi Grimberg
       [not found]         ` <55759B0B.8050805-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 20+ messages in thread
From: Sagi Grimberg @ 2015-06-08 13:39 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA, Or Gerlitz,
	Eli Cohen, Oren Duer, Sagi Grimberg

On 6/8/2015 4:22 PM, Christoph Hellwig wrote:
> On Mon, Jun 08, 2015 at 04:15:34PM +0300, Sagi Grimberg wrote:
>> There are couple of possible (sub-optimal) solutions to handle this limitation:
>
> There is another one:  Set the block level SG_GAPS flags to ensure
> the block layer never generates these SG lists.
>

Hi Christoph,

I have learned about QUEUE_FLAG_SG_GAPS from your recent comments on
the list. Won't setting this flag interfere with bio merges? Ideally
I wouldn't want to restrict merges if my device supports this feature.

> Right now drivers tat set these would always fail SG_IO ioctls using
> iovecs, but that can be fixed much more easily by doing high level
> bounce buffering compared to these horrible workarounds deep down in the
> stack.
>

My tests have shown significant cpu savings with this against SW bounce
buffering, especially for large transfers. I've seen applications that
a significant part of their IO workload consists of this type of memory
layout.

I must say that I've always considered SW bounce buffering as a
work-around for an RDMA driver.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/5] Indirect memory registration feature
       [not found]         ` <55759B0B.8050805-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2015-06-08 13:51           ` Christoph Hellwig
       [not found]             ` <20150608135151.GA14021-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
  0 siblings, 1 reply; 20+ messages in thread
From: Christoph Hellwig @ 2015-06-08 13:51 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Christoph Hellwig, Doug Ledford,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Or Gerlitz, Eli Cohen,
	Oren Duer, Sagi Grimberg

On Mon, Jun 08, 2015 at 04:39:23PM +0300, Sagi Grimberg wrote:
> I have learned about QUEUE_FLAG_SG_GAPS from your recent comments on
> the list. Won't setting this flag interfere with bio merges?

It would if any in-kernel consumer would actually generate such SG
lists.  But none does - the only source is vectored SG_IO support.

> Ideally
> I wouldn't want to restrict merges if my device supports this feature.
> 
> >Right now drivers tat set these would always fail SG_IO ioctls using
> >iovecs, but that can be fixed much more easily by doing high level
> >bounce buffering compared to these horrible workarounds deep down in the
> >stack.
> >
> 
> My tests have shown significant cpu savings with this against SW bounce
> buffering, especially for large transfers. I've seen applications that
> a significant part of their IO workload consists of this type of memory
> layout.
> 
> I must say that I've always considered SW bounce buffering as a
> work-around for an RDMA driver.

bounce buffering is always a workaround, and offloading it to silicone
just makes it a worse hack.  So please fix it in the proper layers
first, and if you really care about the vectored SG_IO case use
your indirect registrations just for that.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/5] Indirect memory registration feature
       [not found]             ` <20150608135151.GA14021-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
@ 2015-06-08 14:42               ` Sagi Grimberg
       [not found]                 ` <5575A9C7.7000409-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
  0 siblings, 1 reply; 20+ messages in thread
From: Sagi Grimberg @ 2015-06-08 14:42 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA, Or Gerlitz,
	Eli Cohen, Oren Duer

On 6/8/2015 4:51 PM, Christoph Hellwig wrote:
> On Mon, Jun 08, 2015 at 04:39:23PM +0300, Sagi Grimberg wrote:
>> I have learned about QUEUE_FLAG_SG_GAPS from your recent comments on
>> the list. Won't setting this flag interfere with bio merges?
>
> It would if any in-kernel consumer would actually generate such SG
> lists.  But none does - the only source is vectored SG_IO support.

I think O_DIRECT allows user-space to generate vectored SG lists that
are not aligned as well.

>>
>> I must say that I've always considered SW bounce buffering as a
>> work-around for an RDMA driver.
>
> bounce buffering is always a workaround, and offloading it to silicone
> just makes it a worse hack.

I wouldn't say this is about offloading bounce buffering to silicon.
The RDMA stack always imposed the alignment limitation as we can only
give a page lists to the devices. Other drivers (qlogic/emulex FC
drivers for example), use an _arbitrary_ SG lists where each element can
point to any {addr, len}.

This puts the RDMA stack in line with other devices effectively
allowing to register arbitrary SG lists.

> So please fix it in the proper layers
> first,

I agree that we can take care of bounce buffering in the block layer
(or scsi for SG_IO) if the driver doesn't want to see any type of
unaligned SG lists.

But do you think that it should come before the stack can support this?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: [PATCH 1/5] IB/core: Introduce Fast Indirect Memory Registration verbs API
       [not found]     ` <1433769339-949-2-git-send-email-sagig-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2015-06-08 20:49       ` Hefty, Sean
       [not found]         ` <1828884A29C6694DAF28B7E6B8A82373A8FE5C7C-P5GAC/sN6hkd3b2yrw5b5LfspsVTdybXVpNB7YpNyf8@public.gmane.org>
  0 siblings, 1 reply; 20+ messages in thread
From: Hefty, Sean @ 2015-06-08 20:49 UTC (permalink / raw)
  To: Sagi Grimberg, Doug Ledford
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Or Gerlitz, Eli Cohen,
	Oren Duer, Sagi Grimberg

> In order to support that we provide the user with an interface
> to pass a scattered list of buffers to the IB core layer called
> ib_indir_reg_list and provide a new send work request opcode
> called IB_WR_REG_INDIR_MR. We extend wr union with a new type of
> memory registration called indir_reg where the user can place the
> relevant information to perform such a memory registration.
> 
> The verbs user is expected to perform these steps:
> 0. Make sure that the device supports Indirect memory registration via
>    ib_device_cap_flag IB_DEVICE_INDIR_REGISTRATION and make sure
>    that ib_device_attr max_indir_reg_mr_list_len suffice for the
>    expected scatterlist length
> 
> 1. Allocate a memory region with IB_MR_INDIRECT_REG creation flag
>    This is done via ib_create_mr() with:
>    mr_init_attr.flags = IB_MR_INDIRECT_REG
> 
> 2. Allocate an ib_indir_reg_list structure to hold the scattered buffers
>    pointers. This is done via new ib_alloc_indir_reg_list() verb
> 
> 3. Fill the scattered buffers in ib_indir_reg_list.sg_list
> 
> 4. Post a work request with a new opcode IB_WR_REG_INDIR_MR and
>    provide the filled ib_indir_reg_list
> 
> 5. Perform data transfer
> 
> 6. Get completion of kind IB_WC_REG_INDIR_MR (if requested)
> 
> 7. Free indirect MR and ib_indir_reg_list via
>    ib_dereg_mr() and ib_free_indir_reg_list()

IMO, we need to introduce vendor specific header files and interfaces.  It is unmaintainable to drive an API from the bottom up and expose the 'bare metal' implementation of a bunch of disjoint pieces of hardware.  (Yeah, because we need yet another way of registering memory... just reading the process for a yet another set of hoops that an app must jump through in order to register memory makes my head hurt.)  Everything about this continual approach needs to end.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/5] Indirect memory registration feature
       [not found]                 ` <5575A9C7.7000409-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
@ 2015-06-09  6:20                   ` Christoph Hellwig
       [not found]                     ` <20150609062054.GA13011-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
  2015-06-09  7:41                   ` Christoph Hellwig
  1 sibling, 1 reply; 20+ messages in thread
From: Christoph Hellwig @ 2015-06-09  6:20 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Christoph Hellwig, Doug Ledford,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Or Gerlitz, Eli Cohen,
	Oren Duer

On Mon, Jun 08, 2015 at 05:42:15PM +0300, Sagi Grimberg wrote:
> I wouldn't say this is about offloading bounce buffering to silicon.
> The RDMA stack always imposed the alignment limitation as we can only
> give a page lists to the devices. Other drivers (qlogic/emulex FC
> drivers for example), use an _arbitrary_ SG lists where each element can
> point to any {addr, len}.

Those are drivers for protocols that support real SG lists.   It seems
only Infiniband and NVMe expose this silly limit.

> >So please fix it in the proper layers
> >first,
> 
> I agree that we can take care of bounce buffering in the block layer
> (or scsi for SG_IO) if the driver doesn't want to see any type of
> unaligned SG lists.
> 
> But do you think that it should come before the stack can support this?

Yes, absolutely.  The other thing that needs to come first is a proper
abstraction for MRs instead of hacking another type into all drivers.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/5] Indirect memory registration feature
       [not found]                 ` <5575A9C7.7000409-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
  2015-06-09  6:20                   ` Christoph Hellwig
@ 2015-06-09  7:41                   ` Christoph Hellwig
  1 sibling, 0 replies; 20+ messages in thread
From: Christoph Hellwig @ 2015-06-09  7:41 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Christoph Hellwig, Doug Ledford,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Or Gerlitz, Eli Cohen,
	Oren Duer

On Mon, Jun 08, 2015 at 05:42:15PM +0300, Sagi Grimberg wrote:
> I think O_DIRECT allows user-space to generate vectored SG lists that
> are not aligned as well.

Indeed if you do sub-page size direct I/O.  Although SG_GAPS helps with
that.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/5] Indirect memory registration feature
       [not found]                     ` <20150609062054.GA13011-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
@ 2015-06-09  8:44                       ` Sagi Grimberg
       [not found]                         ` <5576A760.4090004-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
  0 siblings, 1 reply; 20+ messages in thread
From: Sagi Grimberg @ 2015-06-09  8:44 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA, Or Gerlitz,
	Eli Cohen, Oren Duer, Bart Van Assche, Chuck Lever

On 6/9/2015 9:20 AM, Christoph Hellwig wrote:
> On Mon, Jun 08, 2015 at 05:42:15PM +0300, Sagi Grimberg wrote:
>> I wouldn't say this is about offloading bounce buffering to silicon.
>> The RDMA stack always imposed the alignment limitation as we can only
>> give a page lists to the devices. Other drivers (qlogic/emulex FC
>> drivers for example), use an _arbitrary_ SG lists where each element can
>> point to any {addr, len}.
>
> Those are drivers for protocols that support real SG lists.   It seems
> only Infiniband and NVMe expose this silly limit.

I agree this is indeed a limitation and that's why SG_GAPS was added
in the first place. I think the next gen of nvme devices will support 
real SG lists. This feature enables existing Infiniband devices that can 
handle SG lists to receive them via the RDMA stack (ib_core).

If the memory registration process wasn't such a big fiasco in the
first place, wouldn't this way makes the most sense?

>
>>> So please fix it in the proper layers
>>> first,
>>
>> I agree that we can take care of bounce buffering in the block layer
>> (or scsi for SG_IO) if the driver doesn't want to see any type of
>> unaligned SG lists.
>>
>> But do you think that it should come before the stack can support this?
>
> Yes, absolutely.  The other thing that needs to come first is a proper
> abstraction for MRs instead of hacking another type into all drivers.
>

I'm very much open to the idea of consolidating the memory registration
code instead of doing it in every ULP (srp, iser, xprtrdma, svcrdma,
rds, more to come...) using a general memory registration API. The main
challenge is to abstract the different methods (and considerations) of
memory registration behind an API. Do we completely mask out the way we
are doing it? I'm worried that we might end up either compromising on
performance or trying to understand too much what the caller is trying
to achieve.

For example:
- frwr requires a queue-pair for the post (and it must be the ULP
   queue-pair to ensure the registration is done before the data-transfer
   begins). While fmrs does not need the queue-pair.

- the ULP would probably always initiate data transfer after the
   registration (send a request or do the rdma r/w). It is useful to
   link the frwr post with the next wr in a single post_send call.
   I wander how would an API allow such a thing (while other registration
   methods don't use work request interface).

- There is the fmr_pool API which tries to tackle the disadvantages of
   fmrs (very slow unmap) by delaying the fmr unmap until some dirty
   watermark of remapping is met. I'm not sure how this can be done.

- How would the API choose the method to register memory?

- If there is an alignment issue, do we fail? do we bounce?

- There is the whole T10-DIF support...

...

CC'ing Bart & Chuck who share the suffer of memory registration.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/5] Indirect memory registration feature
       [not found]                         ` <5576A760.4090004-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
@ 2015-06-09 11:14                           ` Christoph Hellwig
  2015-06-09 15:06                           ` Chuck Lever
  1 sibling, 0 replies; 20+ messages in thread
From: Christoph Hellwig @ 2015-06-09 11:14 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Christoph Hellwig, Doug Ledford,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Or Gerlitz, Eli Cohen,
	Oren Duer, Bart Van Assche, Chuck Lever

On Tue, Jun 09, 2015 at 11:44:16AM +0300, Sagi Grimberg wrote:
> For example:
> - frwr requires a queue-pair for the post (and it must be the ULP
>   queue-pair to ensure the registration is done before the data-transfer
>   begins). While fmrs does not need the queue-pair.

How do you transfer data without a queue pair?

> - the ULP would probably always initiate data transfer after the
>   registration (send a request or do the rdma r/w). It is useful to
>   link the frwr post with the next wr in a single post_send call.
>   I wander how would an API allow such a thing (while other registration
>   methods don't use work request interface).
> 
> - There is the fmr_pool API which tries to tackle the disadvantages of
>   fmrs (very slow unmap) by delaying the fmr unmap until some dirty
>   watermark of remapping is met. I'm not sure how this can be done.
> 
> - How would the API choose the method to register memory?

Seems like the general preference in existing users is FRWR -> FMR -> FR
with an exception to use physical for single segment I/O requests.

> - If there is an alignment issue, do we fail? do we bounce?

Leave it to the high level driver for now.  NFS can simply split
requests when the offset mismatches for blok drivers the enhanced
SG_GAPS can take care of it.  If indirect is supported the driver
does not need to set SG_GAPS.
registrations tha

> - There is the whole T10-DIF support...

Just tell the driver if the registration model supports it.  Do the
indirect registrations support it, btw?

As far as I can tell the NFS abstraction fits SRP fairly well.  iSER
is a bit of a mess with it's bounce buffering so I'm not too sure about
that one.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/5] Indirect memory registration feature
       [not found]                         ` <5576A760.4090004-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
  2015-06-09 11:14                           ` Christoph Hellwig
@ 2015-06-09 15:06                           ` Chuck Lever
  1 sibling, 0 replies; 20+ messages in thread
From: Chuck Lever @ 2015-06-09 15:06 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Christoph Hellwig, Doug Ledford, linux-rdma, Or Gerlitz,
	Eli Cohen, Oren Duer, Bart Van Assche


On Jun 9, 2015, at 4:44 AM, Sagi Grimberg <sagig-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org> wrote:

> On 6/9/2015 9:20 AM, Christoph Hellwig wrote:
>> On Mon, Jun 08, 2015 at 05:42:15PM +0300, Sagi Grimberg wrote:
>>> I wouldn't say this is about offloading bounce buffering to silicon.
>>> The RDMA stack always imposed the alignment limitation as we can only
>>> give a page lists to the devices. Other drivers (qlogic/emulex FC
>>> drivers for example), use an _arbitrary_ SG lists where each element can
>>> point to any {addr, len}.
>> 
>> Those are drivers for protocols that support real SG lists.   It seems
>> only Infiniband and NVMe expose this silly limit.
> 
> I agree this is indeed a limitation and that's why SG_GAPS was added
> in the first place. I think the next gen of nvme devices will support real SG lists. This feature enables existing Infiniband devices that can handle SG lists to receive them via the RDMA stack (ib_core).
> 
> If the memory registration process wasn't such a big fiasco in the
> first place, wouldn't this way makes the most sense?
> 
>> 
>>>> So please fix it in the proper layers
>>>> first,
>>> 
>>> I agree that we can take care of bounce buffering in the block layer
>>> (or scsi for SG_IO) if the driver doesn't want to see any type of
>>> unaligned SG lists.
>>> 
>>> But do you think that it should come before the stack can support this?
>> 
>> Yes, absolutely.  The other thing that needs to come first is a proper
>> abstraction for MRs instead of hacking another type into all drivers.
>> 
> 
> I'm very much open to the idea of consolidating the memory registration
> code instead of doing it in every ULP (srp, iser, xprtrdma, svcrdma,
> rds, more to come...) using a general memory registration API. The main
> challenge is to abstract the different methods (and considerations) of
> memory registration behind an API. Do we completely mask out the way we
> are doing it? I'm worried that we might end up either compromising on
> performance or trying to understand too much what the caller is trying
> to achieve.

The point of an API like this is to flatten the developer’s learning
curve at the cost of adding another layer of abstraction. For
in-kernel storage ULPs, I’m not convinced that’s a good trade-off.

The other major issue is dealing with multiple registration methods.
Having new ULPs stick with just one or two seems like it could take
care of that without fuss. HCA vendors seem to be settling on FRMR.

But FRMR has some limitations, some of which I discuss below. IMO it
would be better to think about improving existing limitations with
FRMR before building a shim over it. That could help both the
learning curve and the complexity issues.

On with the pre-caffeine musings:

> For example:
> - frwr requires a queue-pair for the post (and it must be the ULP
>  queue-pair to ensure the registration is done before the data-transfer
>  begins).

The QP does guarantee ordering between registration and use in an RDMA
transfer, but it comes at a cost.

And not just a QP is required, but a QP in RTS (ie, connected). Without
a connected QP, AIUI you can’t invalidate registered FRMRs, you have to
destroy them.

So, for RPC, if the server is down and there are pending RPCs, any FRMRs
associated with that transport are pinned (can’t be re-used or invalidated)
until the server comes back. If an RPC is killed or times out while the
server is down, the associated FRMRs are in limbo.

Registration via WR also means the ULP has to handle the registration and
invalidation completions (or decide to leave them unsignaled, but then
it has to worry about send queue wrapping).

All transport connect operations must be serialized with posting to ensure
that ib_post_send() has a valid (not NULL) QP and ID handle. And, if a QP
or ID handle is used during completion handling, you have to be careful
there too.

(Not to say any of this is impossible to deal with. Obviously RPC/RDMA
works today.)

> While fmrs does not need the queue-pair.

> - the ULP would probably always initiate data transfer after the
>  registration (send a request or do the rdma r/w). It is useful to
>  link the frwr post with the next wr in a single post_send call.
>  I wander how would an API allow such a thing (while other registration
>  methods don't use work request interface).

rkey management is also important. Registration is done before use so
the ULP can send the rkey for that FRMR to the remote to initiate the
remote DMA operation.

However, after a transport disconnect, the rkey in the FRMR may not be
the same one the hardware knows about. Recovery in this case means the
FRMR has to be destroyed. Invalidating the FRMR also requires the ULP
to know the hardware rkey, so it doesn’t work in this case.

What all this means is that significant extra complexity is required to
deal with transport disconnection, which is very rare compared to normal
data transfer operations.

FMR, for instance, has much less of an issue here because the map and
unmap verbs are synchronous, do not rely on having a QP, and do not
generate send completions. But FMR has other problems.

> - There is the fmr_pool API which tries to tackle the disadvantages of
>  fmrs (very slow unmap) by delaying the fmr unmap until some dirty
>  watermark of remapping is met. I'm not sure how this can be done.

I wonder if FMR should be considered at all for a simplified API. Sure,
there are some older cards that do not support FRMR, but seems like FMR
is going to be abandoned sooner or later.

> - How would the API choose the method to register memory?

Based on the capabilities of the HCA, I would think.

> - If there is an alignment issue, do we fail? do we bounce?
> 
> - There is the whole T10-DIF support…

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/5] IB/core: Introduce Fast Indirect Memory Registration verbs API
       [not found]         ` <1828884A29C6694DAF28B7E6B8A82373A8FE5C7C-P5GAC/sN6hkd3b2yrw5b5LfspsVTdybXVpNB7YpNyf8@public.gmane.org>
@ 2015-06-30 11:47           ` Sagi Grimberg
       [not found]             ` <559281B4.6010807-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
  0 siblings, 1 reply; 20+ messages in thread
From: Sagi Grimberg @ 2015-06-30 11:47 UTC (permalink / raw)
  To: Hefty, Sean, Sagi Grimberg, Doug Ledford
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Or Gerlitz, Eli Cohen,
	Oren Duer, Boaz Harrosh, Liran Liss

On 6/8/2015 11:49 PM, Hefty, Sean wrote:

Sean,

>
> IMO, we need to introduce vendor specific header files and interfaces.
> It is unmaintainable to drive an API from the bottom up and expose the 'bare metal'
> implementation of a bunch of disjoint pieces of hardware.  (Yeah, because we need yet another way of registering memory...
> just reading the process for a yet another set of hoops that an app must jump through in order to register memory makes my head hurt.)
> Everything about this continual approach needs to end.
>

I think that the related thread on this patchset makes it clear that
there is a fundamental limitation in the RDMA stack where it allows
to register page lists and not generic SG lists. I strongly disagree
that this is an exposure of some bare metal implementation.

Kernel 4.1 introduced the new pmem driver for byte addressable storage
(https://lwn.net/Articles/640115/). It won't be long before we see HA
models where secondary persistent memory devices will sit across an
RDMA fabric. 
(http://www.snia.org/sites/default/files/DougVoigt_RDMA_Requirements_for_HA.pdf)

We cannot afford to work around the stack block alignment limitation
and meet the latency requirements of such fast devices. This means no
bounce-buffering what-so-ever and we need efficient remote access to
multiple byte ranges.

Moreover, this feature also opens fast registration to user-space (i.e.
"user-space FRWR"). user-space cannot use FRWR as it is not exposed to
the memory physical mapping. Indirect registration allows users to
fast register with ibv_sges. HPC applications need efficient access
to remote scatters.

Do we want to live with this limitation forever?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/5] IB/core: Introduce Fast Indirect Memory Registration verbs API
       [not found]             ` <559281B4.6010807-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
@ 2015-06-30 12:10               ` Christoph Hellwig
       [not found]                 ` <20150630121002.GA24169-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
  0 siblings, 1 reply; 20+ messages in thread
From: Christoph Hellwig @ 2015-06-30 12:10 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Hefty, Sean, Sagi Grimberg, Doug Ledford,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Or Gerlitz, Eli Cohen,
	Oren Duer, Boaz Harrosh, Liran Liss

On Tue, Jun 30, 2015 at 02:47:00PM +0300, Sagi Grimberg wrote:
> Kernel 4.1 introduced the new pmem driver for byte addressable storage
> (https://lwn.net/Articles/640115/). It won't be long before we see HA
> models where secondary persistent memory devices will sit across an
> RDMA fabric. (http://www.snia.org/sites/default/files/DougVoigt_RDMA_Requirements_for_HA.pdf)

And what does this bullshitting slide have to do with our memory models?
In it's current form the pmem driver can't even be used as a (R)DMA
target, so it's totally irrelevant for now.

Let's get a proper in-kernel interface for generic memory registrations
in place as a first step. le this will be a significant amount of work
it'll help greatly with moving nasty implementation details out of
the drivers.  Adding additional registrations methods will be easy
behind the back of a proper abstraction.

> Do we want to live with this limitation forever?

I don't think anyone has refused additional support.  Just the way how
it's done in your patches is a giant nightmare.  The whole RDMA stack
needs a healthy dose of software engineering instead of blindly
following a working group of a marketing organisation.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/5] IB/core: Introduce Fast Indirect Memory Registration verbs API
       [not found]                 ` <20150630121002.GA24169-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
@ 2015-06-30 12:59                   ` Sagi Grimberg
       [not found]                     ` <559292CE.9010303-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
  0 siblings, 1 reply; 20+ messages in thread
From: Sagi Grimberg @ 2015-06-30 12:59 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Hefty, Sean, Sagi Grimberg, Doug Ledford,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Or Gerlitz, Eli Cohen,
	Oren Duer, Boaz Harrosh, Liran Liss

On 6/30/2015 3:10 PM, Christoph Hellwig wrote:
> On Tue, Jun 30, 2015 at 02:47:00PM +0300, Sagi Grimberg wrote:
>> Kernel 4.1 introduced the new pmem driver for byte addressable storage
>> (https://lwn.net/Articles/640115/). It won't be long before we see HA
>> models where secondary persistent memory devices will sit across an
>> RDMA fabric. (http://www.snia.org/sites/default/files/DougVoigt_RDMA_Requirements_for_HA.pdf)
>

Christoph,

> And what does this bullshitting slide have to do with our memory models?

It was just a slide-deck I found that talks about persistent memory in
combination with RDMA so people can learn more on this if they want to.

> In it's current form the pmem driver can't even be used as a (R)DMA
> target, so it's totally irrelevant for now.

I was referring to initiator mode. My understanding is that with
persistent memory the model changes from the traditional block storage
model.

>
> Let's get a proper in-kernel interface for generic memory registrations
> in place as a first step. le this will be a significant amount of work
> it'll help greatly with moving nasty implementation details out of
> the drivers.  Adding additional registrations methods will be easy
> behind the back of a proper abstraction.

This response is directed to Sean's comment on this being a vendor
specific knob rather than a real limitation in the stack.

As I said before, I'm willing to try and address the existing issues we
have in this area of the stack. However, this is a different discussion.

>
>> Do we want to live with this limitation forever?
>
> I don't think anyone has refused additional support.  Just the way how
> it's done in your patches is a giant nightmare.

Putting the generic API discussion aside for a moment. Indirect
registration just a generalizes FRWR for SG-lists. The API I
proposed is a completely symmetrical API for FRWR. So applications that
implements FRWR can very easily use indirect registration. So I don't
think it is "a giant nightmare".

A generic memory registration API would set as an abstraction layer
that just make the registration operation easier for ULPs. It will be
implemented above the core verbs layer (It wouldn't make sense to do
this abstraction below the verbs) so it will need this API in the core
verbs layer just as well.

Do you have another suggestion on how to expose this feature in the core 
layer?

Sagi.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/5] IB/core: Introduce Fast Indirect Memory Registration verbs API
       [not found]                     ` <559292CE.9010303-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
@ 2015-07-01  7:23                       ` Christoph Hellwig
  0 siblings, 0 replies; 20+ messages in thread
From: Christoph Hellwig @ 2015-07-01  7:23 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Christoph Hellwig, Hefty, Sean, Sagi Grimberg, Doug Ledford,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Or Gerlitz, Eli Cohen,
	Oren Duer, Boaz Harrosh, Liran Liss

On Tue, Jun 30, 2015 at 03:59:58PM +0300, Sagi Grimberg wrote:
> Putting the generic API discussion aside for a moment. Indirect
> registration just a generalizes FRWR for SG-lists. The API I
> proposed is a completely symmetrical API for FRWR. So applications that
> implements FRWR can very easily use indirect registration. So I don't
> think it is "a giant nightmare".
> 
> A generic memory registration API would set as an abstraction layer
> that just make the registration operation easier for ULPs. It will be
> implemented above the core verbs layer (It wouldn't make sense to do
> this abstraction below the verbs) so it will need this API in the core
> verbs layer just as well.
> 
> Do you have another suggestion on how to expose this feature in the core
> layer?

Convert the FRWR API with one that takes SGLs, and expose a feature to tell
the client if it can take discontig SGLs.  Next trey to expand the FRWR
API to also cover other memory registration schemes.

> 
> Sagi.
---end quoted text---
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2015-07-01  7:23 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-06-08 13:15 [PATCH 0/5] Indirect memory registration feature Sagi Grimberg
     [not found] ` <1433769339-949-1-git-send-email-sagig-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2015-06-08 13:15   ` [PATCH 1/5] IB/core: Introduce Fast Indirect Memory Registration verbs API Sagi Grimberg
     [not found]     ` <1433769339-949-2-git-send-email-sagig-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2015-06-08 20:49       ` Hefty, Sean
     [not found]         ` <1828884A29C6694DAF28B7E6B8A82373A8FE5C7C-P5GAC/sN6hkd3b2yrw5b5LfspsVTdybXVpNB7YpNyf8@public.gmane.org>
2015-06-30 11:47           ` Sagi Grimberg
     [not found]             ` <559281B4.6010807-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2015-06-30 12:10               ` Christoph Hellwig
     [not found]                 ` <20150630121002.GA24169-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
2015-06-30 12:59                   ` Sagi Grimberg
     [not found]                     ` <559292CE.9010303-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2015-07-01  7:23                       ` Christoph Hellwig
2015-06-08 13:15   ` [PATCH 2/5] IB/mlx5: Implement Fast Indirect Memory Registration Feature Sagi Grimberg
2015-06-08 13:15   ` [PATCH 3/5] IB/iser: Pass iser device to registration routines Sagi Grimberg
2015-06-08 13:15   ` [PATCH 4/5] IB/iser: Add indirect registration support Sagi Grimberg
2015-06-08 13:15   ` [PATCH 5/5] IB/iser: Add debug prints to the various memory registration methods Sagi Grimberg
2015-06-08 13:22   ` [PATCH 0/5] Indirect memory registration feature Christoph Hellwig
     [not found]     ` <20150608132254.GA14773-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
2015-06-08 13:39       ` Sagi Grimberg
     [not found]         ` <55759B0B.8050805-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2015-06-08 13:51           ` Christoph Hellwig
     [not found]             ` <20150608135151.GA14021-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
2015-06-08 14:42               ` Sagi Grimberg
     [not found]                 ` <5575A9C7.7000409-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2015-06-09  6:20                   ` Christoph Hellwig
     [not found]                     ` <20150609062054.GA13011-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
2015-06-09  8:44                       ` Sagi Grimberg
     [not found]                         ` <5576A760.4090004-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2015-06-09 11:14                           ` Christoph Hellwig
2015-06-09 15:06                           ` Chuck Lever
2015-06-09  7:41                   ` Christoph Hellwig

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.