All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH RFC 0/2] Indirect Fast Memory registration support
@ 2014-10-07 14:47 Sagi Grimberg
       [not found] ` <1412693281-6161-1-git-send-email-sagig-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 34+ messages in thread
From: Sagi Grimberg @ 2014-10-07 14:47 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA
  Cc: bvanassche-HInyCGIudOg, roland-DgEjT+Ai2ygdnm+yROfE0A,
	eli-VPRAkNaXOzVWk0Htik3J/w, ogerlitz-VPRAkNaXOzVWk0Htik3J/w,
	oren-VPRAkNaXOzVWk0Htik3J/w, sean.hefty-ral2JQCrhuEAvxtiuMwx3w

This patch set introduces support for registering a scattered
memory area in an indirect manner.

Current supported fast registration support has a known limitation
where the memory must be page aligned, meaning memory scatters must
be in chunks of page size except the first which may be in some offset
from the start of a page and the last which may end before the page
boundary.

This can make life hard for ULPs which may serve a scattered list without
the above limitations. Two immediate examples are iSER and SRP that have
some extra logic or work-arounds to handle an arbitrary scatter list of
buffers (which is supported in the entire stack above them).

The proposed API attempts to follow the well-known fast registration scheme
while allowing the ULP to pass an sg vector rather than a page list (u64 vector).
I expect ULPs to make use of the global DMA key to populate the lkey of the
sg vector. for example for a scatter list sg of 3 elements the indirect ib_sge
vector will look like:
ib_sge[0]: {dma_key, sg[0]->dma_addr, sg[0]->length}
ib_sge[1]: {dma_key, sg[1]->dma_addr, sg[1]->length}
ib_sge[2]: {dma_key, sg[2]->dma_addr, sg[2]->length}

Following this patch set I'll send out a usage for this feature in iSER.
In the meantime, I have a working example of krping utility practicing indirect
registration feature at: git://flatbed.openfabrics.org/~sgrimberg/krping.git
(branch: indir_registration)

Sagi Grimberg (2):
  IB/core: Introduce Fast Indirect Memory Registration verbs API
  IB/mlx5: Implement Fast Indirect Memory Registration Feature

 drivers/infiniband/core/verbs.c      |   29 +++++++++
 drivers/infiniband/hw/mlx5/cq.c      |    2 +
 drivers/infiniband/hw/mlx5/main.c    |    4 +
 drivers/infiniband/hw/mlx5/mlx5_ib.h |   20 +++++++
 drivers/infiniband/hw/mlx5/mr.c      |   70 ++++++++++++++++++++++-
 drivers/infiniband/hw/mlx5/qp.c      |  104 ++++++++++++++++++++++++++++++++++
 include/rdma/ib_verbs.h              |   55 +++++++++++++++++-
 7 files changed, 280 insertions(+), 4 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH RFC 1/2] IB/core: Introduce Fast Indirect Memory Registration verbs API
       [not found] ` <1412693281-6161-1-git-send-email-sagig-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2014-10-07 14:48   ` Sagi Grimberg
       [not found]     ` <1412693281-6161-2-git-send-email-sagig-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  2014-10-07 14:48   ` [PATCH RFC 2/2] IB/mlx5: Implement Fast Indirect Memory Registration Feature Sagi Grimberg
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 34+ messages in thread
From: Sagi Grimberg @ 2014-10-07 14:48 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA
  Cc: bvanassche-HInyCGIudOg, roland-DgEjT+Ai2ygdnm+yROfE0A,
	eli-VPRAkNaXOzVWk0Htik3J/w, ogerlitz-VPRAkNaXOzVWk0Htik3J/w,
	oren-VPRAkNaXOzVWk0Htik3J/w, sean.hefty-ral2JQCrhuEAvxtiuMwx3w

In order to support that we provide the user with an interface
to pass a scattered list of buffers to the IB core layer called
ib_indir_reg_list and provide the a new send work request opcode
called IB_WR_REG_INDIR_MR. We extend wr union with a new type of
memory registration called indir_reg where the user can place the
relevant information to perform such a memory registration.

The verbs user is expected to perform these steps:
0. Make sure that the device supports Indirect memory registration via
   ib_device_cap_flag IB_DEVICE_INDIR_REGISTRATION and make sure
   that ib_device_attr max_indir_reg_mr_list_len suffice for the
   expected scatterlist length

1. Allocate a memory region with IB_MR_INDIRECT_REG creation flag
   This is done via ib_create_mr() with mr_init_attr.flags = IB_MR_INDIRECT_REG

2. Allocate an ib_indir_reg_list structure to hold the scattered buffers
   pointers. This is done via new ib_alloc_indir_reg_list() verb

3. Populate the scattered buffers in ib_indir_reg_list.sg_list

4. Post a work request with a new opcode IB_WR_REG_INDIR_MR and
   provide the populated ib_indir_reg_list

5. Perform data transfer

6. Get completion of kind IB_WC_REG_INDIR_MR (if requested)

7. Free indirect MR and ib_indir_reg_list via
   ib_destroy_mr() and ib_free_indir_reg_list()

Signed-off-by: Sagi Grimberg <sagig-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
 drivers/infiniband/core/verbs.c |   29 ++++++++++++++++++++
 include/rdma/ib_verbs.h         |   55 +++++++++++++++++++++++++++++++++++++-
 2 files changed, 82 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c
index c2b89cc..0364551 100644
--- a/drivers/infiniband/core/verbs.c
+++ b/drivers/infiniband/core/verbs.c
@@ -1445,3 +1445,32 @@ int ib_check_mr_status(struct ib_mr *mr, u32 check_mask,
 		mr->device->check_mr_status(mr, check_mask, mr_status) : -ENOSYS;
 }
 EXPORT_SYMBOL(ib_check_mr_status);
+
+struct ib_indir_reg_list *
+ib_alloc_indir_reg_list(struct ib_device *device,
+			unsigned int max_indir_list_len)
+{
+	struct ib_indir_reg_list *indir_list;
+
+	if (!device->alloc_indir_reg_list)
+		return ERR_PTR(-ENOSYS);
+
+	indir_list = device->alloc_indir_reg_list(device,
+						  max_indir_list_len);
+	if (!IS_ERR(indir_list)) {
+		indir_list->device = device;
+		indir_list->max_indir_list_len = max_indir_list_len;
+	}
+
+	return indir_list;
+}
+EXPORT_SYMBOL(ib_alloc_indir_reg_list);
+
+void
+ib_free_indir_reg_list(struct ib_device *device,
+		       struct ib_indir_reg_list *indir_list)
+{
+	if (device->free_indir_reg_list)
+		device->free_indir_reg_list(device, indir_list);
+}
+EXPORT_SYMBOL(ib_free_indir_reg_list);
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 470a011..f5fe53c 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -123,7 +123,8 @@ enum ib_device_cap_flags {
 	IB_DEVICE_MEM_WINDOW_TYPE_2A	= (1<<23),
 	IB_DEVICE_MEM_WINDOW_TYPE_2B	= (1<<24),
 	IB_DEVICE_MANAGED_FLOW_STEERING = (1<<29),
-	IB_DEVICE_SIGNATURE_HANDOVER	= (1<<30)
+	IB_DEVICE_SIGNATURE_HANDOVER	= (1<<30),
+	IB_DEVICE_INDIR_REGISTRATION	= (1<<31)
 };
 
 enum ib_signature_prot_cap {
@@ -182,6 +183,7 @@ struct ib_device_attr {
 	int			max_srq_wr;
 	int			max_srq_sge;
 	unsigned int		max_fast_reg_page_list_len;
+	unsigned int		max_indir_reg_mr_list_len;
 	u16			max_pkeys;
 	u8			local_ca_ack_delay;
 	int			sig_prot_cap;
@@ -476,7 +478,8 @@ __attribute_const__ int ib_rate_to_mult(enum ib_rate rate);
 __attribute_const__ int ib_rate_to_mbps(enum ib_rate rate);
 
 enum ib_mr_create_flags {
-	IB_MR_SIGNATURE_EN = 1,
+	IB_MR_SIGNATURE_EN = 1 << 0,
+	IB_MR_INDIRECT_REG = 1 << 1
 };
 
 /**
@@ -651,6 +654,7 @@ enum ib_wc_opcode {
 	IB_WC_FAST_REG_MR,
 	IB_WC_MASKED_COMP_SWAP,
 	IB_WC_MASKED_FETCH_ADD,
+	IB_WC_REG_INDIR_MR,
 /*
  * Set value of IB_WC_RECV so consumers can test if a completion is a
  * receive by testing (opcode & IB_WC_RECV).
@@ -945,6 +949,7 @@ enum ib_wr_opcode {
 	IB_WR_MASKED_ATOMIC_FETCH_AND_ADD,
 	IB_WR_BIND_MW,
 	IB_WR_REG_SIG_MR,
+	IB_WR_REG_INDIR_MR,
 	/* reserve values for low level drivers' internal use.
 	 * These values will not be used at all in the ib core layer.
 	 */
@@ -984,6 +989,12 @@ struct ib_fast_reg_page_list {
 	unsigned int		max_page_list_len;
 };
 
+struct ib_indir_reg_list {
+	struct ib_device       *device;
+	struct ib_sge          *sg_list;
+	unsigned int		max_indir_list_len;
+};
+
 /**
  * struct ib_mw_bind_info - Parameters for a memory window bind operation.
  * @mr: A memory region to bind the memory window to.
@@ -1056,6 +1067,14 @@ struct ib_send_wr {
 			int			access_flags;
 			struct ib_sge	       *prot;
 		} sig_handover;
+		struct {
+			u64				iova_start;
+			struct ib_indir_reg_list       *indir_list;
+			unsigned int			indir_list_len;
+			u64				length;
+			unsigned int			access_flags;
+			u32				mkey;
+		} indir_reg;
 	} wr;
 	u32			xrc_remote_srq_num;	/* XRC TGT QPs only */
 };
@@ -1562,6 +1581,10 @@ struct ib_device {
 	struct ib_fast_reg_page_list * (*alloc_fast_reg_page_list)(struct ib_device *device,
 								   int page_list_len);
 	void			   (*free_fast_reg_page_list)(struct ib_fast_reg_page_list *page_list);
+	struct ib_indir_reg_list * (*alloc_indir_reg_list)(struct ib_device *device,
+							   unsigned int indir_list_len);
+	void			   (*free_indir_reg_list)(struct ib_device *device,
+							  struct ib_indir_reg_list *indir_list);
 	int                        (*rereg_phys_mr)(struct ib_mr *mr,
 						    int mr_rereg_mask,
 						    struct ib_pd *pd,
@@ -2460,6 +2483,34 @@ struct ib_fast_reg_page_list *ib_alloc_fast_reg_page_list(
 void ib_free_fast_reg_page_list(struct ib_fast_reg_page_list *page_list);
 
 /**
+ * ib_alloc_indir_reg_list() - Allocates an indirect list array
+ * @device: ib device pointer
+ * @indir_list_len: size of the list array to be allocated
+ *
+ * Allocate a struct ib_indir_reg_list and a sg_list array
+ * that is at least indir_list_len in size. The actual size is
+ * returned in max_indir_list_len. The caller is responsible for
+ * initializing the contents of the sg_list array before posting
+ * a send work request with the IB_WC_INDIR_REG_MR opcode.
+ *
+ * The sg_list array entries should be set exactly the same way
+ * the ib_send_wr sg_list {lkey, addr, length}.
+ */
+struct ib_indir_reg_list *
+ib_alloc_indir_reg_list(struct ib_device *device,
+			unsigned int indir_list_len);
+
+/**
+ * ib_free_indir_reg_list() - Deallocates a previously allocated
+ *     indirect list array
+ * @device: ib device pointer
+ * @indir_list: pointer to be deallocated
+ */
+void
+ib_free_indir_reg_list(struct ib_device *device,
+		       struct ib_indir_reg_list *indir_list);
+
+/**
  * ib_update_fast_reg_key - updates the key portion of the fast_reg MR
  *   R_Key and L_Key.
  * @mr - struct ib_mr pointer to be updated.
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH RFC 2/2] IB/mlx5: Implement Fast Indirect Memory Registration Feature
       [not found] ` <1412693281-6161-1-git-send-email-sagig-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  2014-10-07 14:48   ` [PATCH RFC 1/2] IB/core: Introduce Fast Indirect Memory Registration verbs API Sagi Grimberg
@ 2014-10-07 14:48   ` Sagi Grimberg
       [not found]     ` <1412693281-6161-3-git-send-email-sagig-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  2014-10-08 11:06   ` [PATCH RFC 0/2] Indirect Fast Memory registration support Devesh Sharma
  2014-10-12 19:43   ` Or Gerlitz
  3 siblings, 1 reply; 34+ messages in thread
From: Sagi Grimberg @ 2014-10-07 14:48 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA
  Cc: bvanassche-HInyCGIudOg, roland-DgEjT+Ai2ygdnm+yROfE0A,
	eli-VPRAkNaXOzVWk0Htik3J/w, ogerlitz-VPRAkNaXOzVWk0Htik3J/w,
	oren-VPRAkNaXOzVWk0Htik3J/w, sean.hefty-ral2JQCrhuEAvxtiuMwx3w

This patch implements:
- ib_alloc/free_indir_reg_list() routines
- ib_create_mr() extension for IB_MR_INDIRECT_REG
- ib_post_send() extension for IB_WR_REG_INDIR_MR
  and work completion of IB_WC_REG_INDIR_MR
- Expose mlx5 indirect registration device capabilities

* Nit change in mr_align() static routine to handle void*
instead of __be64.

Signed-off-by: Sagi Grimberg <sagig-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
 drivers/infiniband/hw/mlx5/cq.c      |    2 +
 drivers/infiniband/hw/mlx5/main.c    |    4 +
 drivers/infiniband/hw/mlx5/mlx5_ib.h |   20 +++++++
 drivers/infiniband/hw/mlx5/mr.c      |   70 ++++++++++++++++++++++-
 drivers/infiniband/hw/mlx5/qp.c      |  104 ++++++++++++++++++++++++++++++++++
 5 files changed, 198 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/hw/mlx5/cq.c b/drivers/infiniband/hw/mlx5/cq.c
index e405627..7ca730c 100644
--- a/drivers/infiniband/hw/mlx5/cq.c
+++ b/drivers/infiniband/hw/mlx5/cq.c
@@ -111,6 +111,8 @@ static enum ib_wc_opcode get_umr_comp(struct mlx5_ib_wq *wq, int idx)
 	case IB_WR_FAST_REG_MR:
 		return IB_WC_FAST_REG_MR;
 
+	case IB_WR_REG_INDIR_MR:
+		return IB_WC_REG_INDIR_MR;
 	default:
 		pr_warn("unknown completion status\n");
 		return 0;
diff --git a/drivers/infiniband/hw/mlx5/main.c b/drivers/infiniband/hw/mlx5/main.c
index d8907b2..d834b77 100644
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@ -194,6 +194,7 @@ static int mlx5_ib_query_device(struct ib_device *ibdev,
 	if (flags & MLX5_DEV_CAP_FLAG_XRC)
 		props->device_cap_flags |= IB_DEVICE_XRC;
 	props->device_cap_flags |= IB_DEVICE_MEM_MGT_EXTENSIONS;
+	props->device_cap_flags |= IB_DEVICE_INDIR_REGISTRATION;
 	if (flags & MLX5_DEV_CAP_FLAG_SIG_HAND_OVER) {
 		props->device_cap_flags |= IB_DEVICE_SIGNATURE_HANDOVER;
 		/* At this stage no support for signature handover */
@@ -231,6 +232,7 @@ static int mlx5_ib_query_device(struct ib_device *ibdev,
 	props->max_srq_wr	   = dev->mdev->caps.max_srq_wqes - 1;
 	props->max_srq_sge	   = max_rq_sg - 1;
 	props->max_fast_reg_page_list_len = (unsigned int)-1;
+	props->max_indir_reg_mr_list_len = (unsigned int)-1;
 	props->local_ca_ack_delay  = dev->mdev->caps.local_ca_ack_delay;
 	props->atomic_cap	   = IB_ATOMIC_NONE;
 	props->masked_atomic_cap   = IB_ATOMIC_NONE;
@@ -1354,6 +1356,8 @@ static void *mlx5_ib_add(struct mlx5_core_dev *mdev)
 	dev->ib_dev.alloc_fast_reg_page_list = mlx5_ib_alloc_fast_reg_page_list;
 	dev->ib_dev.free_fast_reg_page_list  = mlx5_ib_free_fast_reg_page_list;
 	dev->ib_dev.check_mr_status	= mlx5_ib_check_mr_status;
+	dev->ib_dev.alloc_indir_reg_list = mlx5_ib_alloc_indir_reg_list;
+	dev->ib_dev.free_indir_reg_list  = mlx5_ib_free_indir_reg_list;
 
 	if (mdev->caps.flags & MLX5_DEV_CAP_FLAG_XRC) {
 		dev->ib_dev.alloc_xrcd = mlx5_ib_alloc_xrcd;
diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h b/drivers/infiniband/hw/mlx5/mlx5_ib.h
index 386780f..3b6ed0f 100644
--- a/drivers/infiniband/hw/mlx5/mlx5_ib.h
+++ b/drivers/infiniband/hw/mlx5/mlx5_ib.h
@@ -275,6 +275,13 @@ struct mlx5_ib_fast_reg_page_list {
 	dma_addr_t			map;
 };
 
+struct mlx5_ib_indir_reg_list {
+	struct ib_indir_reg_list        ib_irl;
+	void                           *mapped_ilist;
+	struct mlx5_klm                *klms;
+	dma_addr_t                      map;
+};
+
 struct mlx5_ib_umr_context {
 	enum ib_wc_status	status;
 	struct completion	done;
@@ -444,6 +451,12 @@ static inline struct mlx5_ib_fast_reg_page_list *to_mfrpl(struct ib_fast_reg_pag
 	return container_of(ibfrpl, struct mlx5_ib_fast_reg_page_list, ibfrpl);
 }
 
+static inline struct mlx5_ib_indir_reg_list *
+to_mindir_list(struct ib_indir_reg_list *ib_irl)
+{
+	return container_of(ib_irl, struct mlx5_ib_indir_reg_list, ib_irl);
+}
+
 struct mlx5_ib_ah {
 	struct ib_ah		ibah;
 	struct mlx5_av		av;
@@ -511,6 +524,13 @@ struct ib_mr *mlx5_ib_alloc_fast_reg_mr(struct ib_pd *pd,
 struct ib_fast_reg_page_list *mlx5_ib_alloc_fast_reg_page_list(struct ib_device *ibdev,
 							       int page_list_len);
 void mlx5_ib_free_fast_reg_page_list(struct ib_fast_reg_page_list *page_list);
+
+struct ib_indir_reg_list *
+mlx5_ib_alloc_indir_reg_list(struct ib_device *device,
+			     unsigned int max_indir_list_len);
+void mlx5_ib_free_indir_reg_list(struct ib_device *device,
+				 struct ib_indir_reg_list *indir_list);
+
 struct ib_fmr *mlx5_ib_fmr_alloc(struct ib_pd *pd, int acc,
 				 struct ib_fmr_attr *fmr_attr);
 int mlx5_ib_map_phys_fmr(struct ib_fmr *ibfmr, u64 *page_list,
diff --git a/drivers/infiniband/hw/mlx5/mr.c b/drivers/infiniband/hw/mlx5/mr.c
index 80b3c63..6fb7cc3 100644
--- a/drivers/infiniband/hw/mlx5/mr.c
+++ b/drivers/infiniband/hw/mlx5/mr.c
@@ -47,11 +47,11 @@ enum {
 	MLX5_UMR_ALIGN	= 2048
 };
 
-static __be64 *mr_align(__be64 *ptr, int align)
+static void *mr_align(void *ptr, int align)
 {
 	unsigned long mask = align - 1;
 
-	return (__be64 *)(((unsigned long)ptr + mask) & ~mask);
+	return (void *)(((unsigned long)ptr + mask) & ~mask);
 }
 
 static int order2idx(struct mlx5_ib_dev *dev, int order)
@@ -1059,6 +1059,9 @@ struct ib_mr *mlx5_ib_create_mr(struct ib_pd *pd,
 		++mr->sig->sigerr_count;
 	}
 
+	if (mr_init_attr->flags & IB_MR_INDIRECT_REG)
+		access_mode = MLX5_ACCESS_MODE_KLM;
+
 	in->seg.flags = MLX5_PERM_UMR_EN | access_mode;
 	err = mlx5_core_create_mkey(dev->mdev, &mr->mmr, in, sizeof(*in),
 				    NULL, NULL, NULL);
@@ -1248,3 +1251,66 @@ int mlx5_ib_check_mr_status(struct ib_mr *ibmr, u32 check_mask,
 done:
 	return ret;
 }
+
+struct ib_indir_reg_list *
+mlx5_ib_alloc_indir_reg_list(struct ib_device *device,
+			     unsigned int max_indir_list_len)
+{
+	struct device *ddev = device->dma_device;
+	struct mlx5_ib_indir_reg_list *mirl = NULL;
+	int dsize;
+	int err;
+
+	mirl = kzalloc(sizeof(*mirl), GFP_KERNEL);
+	if (!mirl)
+		return ERR_PTR(-ENOMEM);
+
+	mirl->ib_irl.sg_list = kcalloc(max_indir_list_len,
+				       sizeof(*mirl->ib_irl.sg_list),
+				       GFP_KERNEL);
+	if (!mirl->ib_irl.sg_list) {
+		err = -ENOMEM;
+		goto err_sg_list;
+	}
+
+	dsize = sizeof(*mirl->klms) * max_indir_list_len;
+	mirl->mapped_ilist = kzalloc(dsize + MLX5_UMR_ALIGN - 1,
+				      GFP_KERNEL);
+	if (!mirl->mapped_ilist) {
+		err = -ENOMEM;
+		goto err_mapped_list;
+	}
+
+	mirl->klms = mr_align(mirl->mapped_ilist, MLX5_UMR_ALIGN);
+	mirl->map = dma_map_single(ddev, mirl->klms,
+				   dsize, DMA_TO_DEVICE);
+	if (dma_mapping_error(ddev, mirl->map)) {
+		err = -ENOMEM;
+		goto err_dma_map;
+	}
+
+	return &mirl->ib_irl;
+err_dma_map:
+	kfree(mirl->mapped_ilist);
+err_mapped_list:
+	kfree(mirl->ib_irl.sg_list);
+err_sg_list:
+	kfree(mirl);
+
+	return ERR_PTR(err);
+}
+
+void
+mlx5_ib_free_indir_reg_list(struct ib_device *device,
+			    struct ib_indir_reg_list *indir_list)
+{
+	struct mlx5_ib_indir_reg_list *mirl = to_mindir_list(indir_list);
+	struct device *ddev = device->dma_device;
+	int dsize;
+
+	dsize = sizeof(*mirl->klms) * indir_list->max_indir_list_len;
+	dma_unmap_single(ddev, mirl->map, dsize, DMA_TO_DEVICE);
+	kfree(mirl->mapped_ilist);
+	kfree(mirl->ib_irl.sg_list);
+	kfree(mirl);
+}
diff --git a/drivers/infiniband/hw/mlx5/qp.c b/drivers/infiniband/hw/mlx5/qp.c
index d7f35e9..a9c74e6 100644
--- a/drivers/infiniband/hw/mlx5/qp.c
+++ b/drivers/infiniband/hw/mlx5/qp.c
@@ -65,6 +65,7 @@ static const u32 mlx5_ib_opcode[] = {
 	[IB_WR_SEND_WITH_INV]			= MLX5_OPCODE_SEND_INVAL,
 	[IB_WR_LOCAL_INV]			= MLX5_OPCODE_UMR,
 	[IB_WR_FAST_REG_MR]			= MLX5_OPCODE_UMR,
+	[IB_WR_REG_INDIR_MR]			= MLX5_OPCODE_UMR,
 	[IB_WR_MASKED_ATOMIC_CMP_AND_SWP]	= MLX5_OPCODE_ATOMIC_MASKED_CS,
 	[IB_WR_MASKED_ATOMIC_FETCH_AND_ADD]	= MLX5_OPCODE_ATOMIC_MASKED_FA,
 	[MLX5_IB_WR_UMR]			= MLX5_OPCODE_UMR,
@@ -2346,6 +2347,96 @@ static int set_frwr_li_wr(void **seg, struct ib_send_wr *wr, int *size,
 	return 0;
 }
 
+static void set_indir_mkey_segment(struct mlx5_mkey_seg *seg,
+				   struct ib_send_wr *wr, u32 pdn)
+{
+	u32 list_len = wr->wr.indir_reg.indir_list_len;
+
+	memset(seg, 0, sizeof(*seg));
+
+	seg->flags = get_umr_flags(wr->wr.indir_reg.access_flags) |
+				   MLX5_ACCESS_MODE_KLM;
+	seg->qpn_mkey7_0 = cpu_to_be32(0xffffff00 |
+			   mlx5_mkey_variant(wr->wr.indir_reg.mkey));
+	seg->flags_pd = cpu_to_be32(MLX5_MKEY_REMOTE_INVAL | pdn);
+	seg->len = cpu_to_be64(wr->wr.indir_reg.length);
+	seg->start_addr = cpu_to_be64(wr->wr.indir_reg.iova_start);
+	seg->xlt_oct_size = cpu_to_be32(be16_to_cpu(get_klm_octo(list_len * 2)));
+}
+
+static void set_indir_data_seg(struct ib_send_wr *wr, struct mlx5_ib_qp *qp,
+			       u32 pa_key, void **seg, int *size)
+{
+	struct mlx5_wqe_data_seg *data = *seg;
+	struct mlx5_ib_indir_reg_list *mirl;
+	struct ib_sge *sg_list = wr->wr.indir_reg.indir_list->sg_list;
+	u32 list_len = wr->wr.indir_reg.indir_list_len;
+	int i;
+
+	mirl = to_mindir_list(wr->wr.indir_reg.indir_list);
+	for (i = 0; i < list_len; i++) {
+		mirl->klms[i].va = cpu_to_be64(sg_list[i].addr);
+		mirl->klms[i].key = cpu_to_be32(sg_list[i].lkey);
+		mirl->klms[i].bcount = cpu_to_be32(sg_list[i].length);
+	}
+
+	data->byte_count = cpu_to_be32(ALIGN(sizeof(struct mlx5_klm) *
+				       list_len, 64));
+	data->lkey = cpu_to_be32(pa_key);
+	data->addr = cpu_to_be64(mirl->map);
+	*seg += sizeof(*data);
+	*size += sizeof(*data) / 16;
+}
+
+static void set_indir_umr_segment(struct mlx5_wqe_umr_ctrl_seg *umr,
+				  struct ib_send_wr *wr)
+{
+	u64 mask;
+	u32 list_len = wr->wr.indir_reg.indir_list_len;
+
+	memset(umr, 0, sizeof(*umr));
+
+	umr->klm_octowords = get_klm_octo(list_len * 2);
+	mask = MLX5_MKEY_MASK_LEN		|
+		MLX5_MKEY_MASK_PAGE_SIZE	|
+		MLX5_MKEY_MASK_START_ADDR	|
+		MLX5_MKEY_MASK_EN_RINVAL	|
+		MLX5_MKEY_MASK_KEY		|
+		MLX5_MKEY_MASK_LR		|
+		MLX5_MKEY_MASK_LW		|
+		MLX5_MKEY_MASK_RR		|
+		MLX5_MKEY_MASK_RW		|
+		MLX5_MKEY_MASK_A		|
+		MLX5_MKEY_MASK_FREE;
+
+	umr->mkey_mask = cpu_to_be64(mask);
+}
+
+static int set_indir_reg_wr(struct ib_send_wr *wr, struct mlx5_ib_qp *qp,
+			    void **seg, int *size)
+{
+	struct mlx5_ib_pd *pd = get_pd(qp);
+
+	if (unlikely(wr->send_flags & IB_SEND_INLINE))
+		return -EINVAL;
+
+	set_indir_umr_segment(*seg, wr);
+	*seg += sizeof(struct mlx5_wqe_umr_ctrl_seg);
+	*size += sizeof(struct mlx5_wqe_umr_ctrl_seg) / 16;
+	if (unlikely((*seg == qp->sq.qend)))
+		*seg = mlx5_get_send_wqe(qp, 0);
+
+	set_indir_mkey_segment(*seg, wr, pd->pdn);
+	*seg += sizeof(struct mlx5_mkey_seg);
+	*size += sizeof(struct mlx5_mkey_seg) / 16;
+	if (unlikely((*seg == qp->sq.qend)))
+		*seg = mlx5_get_send_wqe(qp, 0);
+
+	set_indir_data_seg(wr, qp, pd->pa_lkey, seg, size);
+
+	return 0;
+}
+
 static void dump_wqe(struct mlx5_ib_qp *qp, int idx, int size_16)
 {
 	__be32 *p = NULL;
@@ -2557,6 +2648,19 @@ int mlx5_ib_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr,
 				num_sge = 0;
 				break;
 
+			case IB_WR_REG_INDIR_MR:
+				next_fence = MLX5_FENCE_MODE_INITIATOR_SMALL;
+				qp->sq.wr_data[idx] = IB_WR_REG_INDIR_MR;
+				ctrl->imm = cpu_to_be32(wr->wr.indir_reg.mkey);
+				err = set_indir_reg_wr(wr, qp, &seg, &size);
+				if (err) {
+					mlx5_ib_warn(dev, "\n");
+					*bad_wr = wr;
+					goto out;
+				}
+				num_sge = 0;
+				break;
+
 			case IB_WR_REG_SIG_MR:
 				qp->sq.wr_data[idx] = IB_WR_REG_SIG_MR;
 				mr = to_mmr(wr->wr.sig_handover.sig_mr);
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC 1/2] IB/core: Introduce Fast Indirect Memory Registration verbs API
       [not found]     ` <1412693281-6161-2-git-send-email-sagig-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2014-10-07 18:12       ` Steve Wise
       [not found]         ` <54342D0C.6050103-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>
  2014-10-14  5:40       ` Bart Van Assche
  1 sibling, 1 reply; 34+ messages in thread
From: Steve Wise @ 2014-10-07 18:12 UTC (permalink / raw)
  To: Sagi Grimberg, linux-rdma-u79uwXL29TY76Z2rM5mHXA
  Cc: bvanassche-HInyCGIudOg, roland-DgEjT+Ai2ygdnm+yROfE0A,
	eli-VPRAkNaXOzVWk0Htik3J/w, ogerlitz-VPRAkNaXOzVWk0Htik3J/w,
	oren-VPRAkNaXOzVWk0Htik3J/w, sean.hefty-ral2JQCrhuEAvxtiuMwx3w

On 10/7/2014 9:48 AM, Sagi Grimberg wrote:
> In order to support that we provide the user with an interface
> to pass a scattered list of buffers to the IB core layer called
> ib_indir_reg_list and provide the a new send work request opcode
> called IB_WR_REG_INDIR_MR. We extend wr union with a new type of
> memory registration called indir_reg where the user can place the
> relevant information to perform such a memory registration.
>
> The verbs user is expected to perform these steps:
> 0. Make sure that the device supports Indirect memory registration via
>     ib_device_cap_flag IB_DEVICE_INDIR_REGISTRATION and make sure
>     that ib_device_attr max_indir_reg_mr_list_len suffice for the
>     expected scatterlist length
>
> 1. Allocate a memory region with IB_MR_INDIRECT_REG creation flag
>     This is done via ib_create_mr() with mr_init_attr.flags = IB_MR_INDIRECT_REG
>
> 2. Allocate an ib_indir_reg_list structure to hold the scattered buffers
>     pointers. This is done via new ib_alloc_indir_reg_list() verb
>
> 3. Populate the scattered buffers in ib_indir_reg_list.sg_list
>
> 4. Post a work request with a new opcode IB_WR_REG_INDIR_MR and
>     provide the populated ib_indir_reg_list
>
> 5. Perform data transfer
>
> 6. Get completion of kind IB_WC_REG_INDIR_MR (if requested)
>
> 7. Free indirect MR and ib_indir_reg_list via
>     ib_destroy_mr() and ib_free_indir_reg_list()
>
> Signed-off-by: Sagi Grimberg <sagig-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> ---
>   drivers/infiniband/core/verbs.c |   29 ++++++++++++++++++++
>   include/rdma/ib_verbs.h         |   55 +++++++++++++++++++++++++++++++++++++-
>   2 files changed, 82 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c
> index c2b89cc..0364551 100644
> --- a/drivers/infiniband/core/verbs.c
> +++ b/drivers/infiniband/core/verbs.c
> @@ -1445,3 +1445,32 @@ int ib_check_mr_status(struct ib_mr *mr, u32 check_mask,
>   		mr->device->check_mr_status(mr, check_mask, mr_status) : -ENOSYS;
>   }
>   EXPORT_SYMBOL(ib_check_mr_status);
> +
> +struct ib_indir_reg_list *
> +ib_alloc_indir_reg_list(struct ib_device *device,
> +			unsigned int max_indir_list_len)
> +{
> +	struct ib_indir_reg_list *indir_list;
> +
> +	if (!device->alloc_indir_reg_list)
> +		return ERR_PTR(-ENOSYS);
> +
> +	indir_list = device->alloc_indir_reg_list(device,
> +						  max_indir_list_len);
> +	if (!IS_ERR(indir_list)) {
> +		indir_list->device = device;
> +		indir_list->max_indir_list_len = max_indir_list_len;
> +	}
> +
> +	return indir_list;
> +}
> +EXPORT_SYMBOL(ib_alloc_indir_reg_list);
> +
> +void
> +ib_free_indir_reg_list(struct ib_device *device,
> +		       struct ib_indir_reg_list *indir_list)
> +{
> +	if (device->free_indir_reg_list)
> +		device->free_indir_reg_list(device, indir_list);
> +}
> +EXPORT_SYMBOL(ib_free_indir_reg_list);
> diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
> index 470a011..f5fe53c 100644
> --- a/include/rdma/ib_verbs.h
> +++ b/include/rdma/ib_verbs.h
> @@ -123,7 +123,8 @@ enum ib_device_cap_flags {
>   	IB_DEVICE_MEM_WINDOW_TYPE_2A	= (1<<23),
>   	IB_DEVICE_MEM_WINDOW_TYPE_2B	= (1<<24),
>   	IB_DEVICE_MANAGED_FLOW_STEERING = (1<<29),
> -	IB_DEVICE_SIGNATURE_HANDOVER	= (1<<30)
> +	IB_DEVICE_SIGNATURE_HANDOVER	= (1<<30),
> +	IB_DEVICE_INDIR_REGISTRATION	= (1<<31)
>   };
>   
>   enum ib_signature_prot_cap {
> @@ -182,6 +183,7 @@ struct ib_device_attr {
>   	int			max_srq_wr;
>   	int			max_srq_sge;
>   	unsigned int		max_fast_reg_page_list_len;
> +	unsigned int		max_indir_reg_mr_list_len;
>   	u16			max_pkeys;
>   	u8			local_ca_ack_delay;
>   	int			sig_prot_cap;
> @@ -476,7 +478,8 @@ __attribute_const__ int ib_rate_to_mult(enum ib_rate rate);
>   __attribute_const__ int ib_rate_to_mbps(enum ib_rate rate);
>   
>   enum ib_mr_create_flags {
> -	IB_MR_SIGNATURE_EN = 1,
> +	IB_MR_SIGNATURE_EN = 1 << 0,
> +	IB_MR_INDIRECT_REG = 1 << 1
>   };
>   
>   /**
> @@ -651,6 +654,7 @@ enum ib_wc_opcode {
>   	IB_WC_FAST_REG_MR,
>   	IB_WC_MASKED_COMP_SWAP,
>   	IB_WC_MASKED_FETCH_ADD,
> +	IB_WC_REG_INDIR_MR,
>   /*
>    * Set value of IB_WC_RECV so consumers can test if a completion is a
>    * receive by testing (opcode & IB_WC_RECV).
> @@ -945,6 +949,7 @@ enum ib_wr_opcode {
>   	IB_WR_MASKED_ATOMIC_FETCH_AND_ADD,
>   	IB_WR_BIND_MW,
>   	IB_WR_REG_SIG_MR,
> +	IB_WR_REG_INDIR_MR,
>   	/* reserve values for low level drivers' internal use.
>   	 * These values will not be used at all in the ib core layer.
>   	 */
> @@ -984,6 +989,12 @@ struct ib_fast_reg_page_list {
>   	unsigned int		max_page_list_len;
>   };
>   
> +struct ib_indir_reg_list {
> +	struct ib_device       *device;
> +	struct ib_sge          *sg_list;
> +	unsigned int		max_indir_list_len;
> +};
> +
>   /**
>    * struct ib_mw_bind_info - Parameters for a memory window bind operation.
>    * @mr: A memory region to bind the memory window to.
> @@ -1056,6 +1067,14 @@ struct ib_send_wr {
>   			int			access_flags;
>   			struct ib_sge	       *prot;
>   		} sig_handover;
> +		struct {
> +			u64				iova_start;
> +			struct ib_indir_reg_list       *indir_list;
> +			unsigned int			indir_list_len;
> +			u64				length;
> +			unsigned int			access_flags;
> +			u32				mkey;
> +		} indir_reg;

What is mkey?  Shouldn't this be an rkey?

>   	} wr;
>   	u32			xrc_remote_srq_num;	/* XRC TGT QPs only */
>   };
> @@ -1562,6 +1581,10 @@ struct ib_device {
>   	struct ib_fast_reg_page_list * (*alloc_fast_reg_page_list)(struct ib_device *device,
>   								   int page_list_len);
>   	void			   (*free_fast_reg_page_list)(struct ib_fast_reg_page_list *page_list);
> +	struct ib_indir_reg_list * (*alloc_indir_reg_list)(struct ib_device *device,
> +							   unsigned int indir_list_len);
> +	void			   (*free_indir_reg_list)(struct ib_device *device,
> +							  struct ib_indir_reg_list *indir_list);
>   	int                        (*rereg_phys_mr)(struct ib_mr *mr,
>   						    int mr_rereg_mask,
>   						    struct ib_pd *pd,
> @@ -2460,6 +2483,34 @@ struct ib_fast_reg_page_list *ib_alloc_fast_reg_page_list(
>   void ib_free_fast_reg_page_list(struct ib_fast_reg_page_list *page_list);
>   
>   /**
> + * ib_alloc_indir_reg_list() - Allocates an indirect list array
> + * @device: ib device pointer
> + * @indir_list_len: size of the list array to be allocated
> + *
> + * Allocate a struct ib_indir_reg_list and a sg_list array
> + * that is at least indir_list_len in size. The actual size is
> + * returned in max_indir_list_len. The caller is responsible for
> + * initializing the contents of the sg_list array before posting
> + * a send work request with the IB_WC_INDIR_REG_MR opcode.
> + *
> + * The sg_list array entries should be set exactly the same way
> + * the ib_send_wr sg_list {lkey, addr, length}.
> + */
> +struct ib_indir_reg_list *
> +ib_alloc_indir_reg_list(struct ib_device *device,
> +			unsigned int indir_list_len);
> +
> +/**
> + * ib_free_indir_reg_list() - Deallocates a previously allocated
> + *     indirect list array
> + * @device: ib device pointer
> + * @indir_list: pointer to be deallocated
> + */
> +void
> +ib_free_indir_reg_list(struct ib_device *device,
> +		       struct ib_indir_reg_list *indir_list);
> +
> +/**
>    * ib_update_fast_reg_key - updates the key portion of the fast_reg MR
>    *   R_Key and L_Key.
>    * @mr - struct ib_mr pointer to be updated.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC 1/2] IB/core: Introduce Fast Indirect Memory Registration verbs API
       [not found]         ` <54342D0C.6050103-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>
@ 2014-10-08  5:48           ` Sagi Grimberg
       [not found]             ` <5434D037.4040208-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
  0 siblings, 1 reply; 34+ messages in thread
From: Sagi Grimberg @ 2014-10-08  5:48 UTC (permalink / raw)
  To: Steve Wise, Sagi Grimberg, linux-rdma-u79uwXL29TY76Z2rM5mHXA
  Cc: bvanassche-HInyCGIudOg, roland-DgEjT+Ai2ygdnm+yROfE0A,
	eli-VPRAkNaXOzVWk0Htik3J/w, ogerlitz-VPRAkNaXOzVWk0Htik3J/w,
	oren-VPRAkNaXOzVWk0Htik3J/w, sean.hefty-ral2JQCrhuEAvxtiuMwx3w

On 10/7/2014 9:12 PM, Steve Wise wrote:
> On 10/7/2014 9:48 AM, Sagi Grimberg wrote:
>> In order to support that we provide the user with an interface
>> to pass a scattered list of buffers to the IB core layer called
>> ib_indir_reg_list and provide the a new send work request opcode
>> called IB_WR_REG_INDIR_MR. We extend wr union with a new type of
>> memory registration called indir_reg where the user can place the
>> relevant information to perform such a memory registration.
>>
>> The verbs user is expected to perform these steps:
>> 0. Make sure that the device supports Indirect memory registration via
>>     ib_device_cap_flag IB_DEVICE_INDIR_REGISTRATION and make sure
>>     that ib_device_attr max_indir_reg_mr_list_len suffice for the
>>     expected scatterlist length
>>
>> 1. Allocate a memory region with IB_MR_INDIRECT_REG creation flag
>>     This is done via ib_create_mr() with mr_init_attr.flags =
>> IB_MR_INDIRECT_REG
>>
>> 2. Allocate an ib_indir_reg_list structure to hold the scattered buffers
>>     pointers. This is done via new ib_alloc_indir_reg_list() verb
>>
>> 3. Populate the scattered buffers in ib_indir_reg_list.sg_list
>>
>> 4. Post a work request with a new opcode IB_WR_REG_INDIR_MR and
>>     provide the populated ib_indir_reg_list
>>
>> 5. Perform data transfer
>>
>> 6. Get completion of kind IB_WC_REG_INDIR_MR (if requested)
>>
>> 7. Free indirect MR and ib_indir_reg_list via
>>     ib_destroy_mr() and ib_free_indir_reg_list()
>>
>> Signed-off-by: Sagi Grimberg <sagig-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
>> ---
>>   drivers/infiniband/core/verbs.c |   29 ++++++++++++++++++++
>>   include/rdma/ib_verbs.h         |   55
>> +++++++++++++++++++++++++++++++++++++-
>>   2 files changed, 82 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/infiniband/core/verbs.c
>> b/drivers/infiniband/core/verbs.c
>> index c2b89cc..0364551 100644
>> --- a/drivers/infiniband/core/verbs.c
>> +++ b/drivers/infiniband/core/verbs.c
>> @@ -1445,3 +1445,32 @@ int ib_check_mr_status(struct ib_mr *mr, u32
>> check_mask,
>>           mr->device->check_mr_status(mr, check_mask, mr_status) :
>> -ENOSYS;
>>   }
>>   EXPORT_SYMBOL(ib_check_mr_status);
>> +
>> +struct ib_indir_reg_list *
>> +ib_alloc_indir_reg_list(struct ib_device *device,
>> +            unsigned int max_indir_list_len)
>> +{
>> +    struct ib_indir_reg_list *indir_list;
>> +
>> +    if (!device->alloc_indir_reg_list)
>> +        return ERR_PTR(-ENOSYS);
>> +
>> +    indir_list = device->alloc_indir_reg_list(device,
>> +                          max_indir_list_len);
>> +    if (!IS_ERR(indir_list)) {
>> +        indir_list->device = device;
>> +        indir_list->max_indir_list_len = max_indir_list_len;
>> +    }
>> +
>> +    return indir_list;
>> +}
>> +EXPORT_SYMBOL(ib_alloc_indir_reg_list);
>> +
>> +void
>> +ib_free_indir_reg_list(struct ib_device *device,
>> +               struct ib_indir_reg_list *indir_list)
>> +{
>> +    if (device->free_indir_reg_list)
>> +        device->free_indir_reg_list(device, indir_list);
>> +}
>> +EXPORT_SYMBOL(ib_free_indir_reg_list);
>> diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
>> index 470a011..f5fe53c 100644
>> --- a/include/rdma/ib_verbs.h
>> +++ b/include/rdma/ib_verbs.h
>> @@ -123,7 +123,8 @@ enum ib_device_cap_flags {
>>       IB_DEVICE_MEM_WINDOW_TYPE_2A    = (1<<23),
>>       IB_DEVICE_MEM_WINDOW_TYPE_2B    = (1<<24),
>>       IB_DEVICE_MANAGED_FLOW_STEERING = (1<<29),
>> -    IB_DEVICE_SIGNATURE_HANDOVER    = (1<<30)
>> +    IB_DEVICE_SIGNATURE_HANDOVER    = (1<<30),
>> +    IB_DEVICE_INDIR_REGISTRATION    = (1<<31)
>>   };
>>   enum ib_signature_prot_cap {
>> @@ -182,6 +183,7 @@ struct ib_device_attr {
>>       int            max_srq_wr;
>>       int            max_srq_sge;
>>       unsigned int        max_fast_reg_page_list_len;
>> +    unsigned int        max_indir_reg_mr_list_len;
>>       u16            max_pkeys;
>>       u8            local_ca_ack_delay;
>>       int            sig_prot_cap;
>> @@ -476,7 +478,8 @@ __attribute_const__ int ib_rate_to_mult(enum
>> ib_rate rate);
>>   __attribute_const__ int ib_rate_to_mbps(enum ib_rate rate);
>>   enum ib_mr_create_flags {
>> -    IB_MR_SIGNATURE_EN = 1,
>> +    IB_MR_SIGNATURE_EN = 1 << 0,
>> +    IB_MR_INDIRECT_REG = 1 << 1
>>   };
>>   /**
>> @@ -651,6 +654,7 @@ enum ib_wc_opcode {
>>       IB_WC_FAST_REG_MR,
>>       IB_WC_MASKED_COMP_SWAP,
>>       IB_WC_MASKED_FETCH_ADD,
>> +    IB_WC_REG_INDIR_MR,
>>   /*
>>    * Set value of IB_WC_RECV so consumers can test if a completion is a
>>    * receive by testing (opcode & IB_WC_RECV).
>> @@ -945,6 +949,7 @@ enum ib_wr_opcode {
>>       IB_WR_MASKED_ATOMIC_FETCH_AND_ADD,
>>       IB_WR_BIND_MW,
>>       IB_WR_REG_SIG_MR,
>> +    IB_WR_REG_INDIR_MR,
>>       /* reserve values for low level drivers' internal use.
>>        * These values will not be used at all in the ib core layer.
>>        */
>> @@ -984,6 +989,12 @@ struct ib_fast_reg_page_list {
>>       unsigned int        max_page_list_len;
>>   };
>> +struct ib_indir_reg_list {
>> +    struct ib_device       *device;
>> +    struct ib_sge          *sg_list;
>> +    unsigned int        max_indir_list_len;
>> +};
>> +
>>   /**
>>    * struct ib_mw_bind_info - Parameters for a memory window bind
>> operation.
>>    * @mr: A memory region to bind the memory window to.
>> @@ -1056,6 +1067,14 @@ struct ib_send_wr {
>>               int            access_flags;
>>               struct ib_sge           *prot;
>>           } sig_handover;
>> +        struct {
>> +            u64                iova_start;
>> +            struct ib_indir_reg_list       *indir_list;
>> +            unsigned int            indir_list_len;
>> +            u64                length;
>> +            unsigned int            access_flags;
>> +            u32                mkey;
>> +        } indir_reg;
>
> What is mkey?  Shouldn't this be an rkey?

mkey means memory key. I can change it to rkey if that
is clearer.

Sagi.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: [PATCH RFC 0/2] Indirect Fast Memory registration support
       [not found] ` <1412693281-6161-1-git-send-email-sagig-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  2014-10-07 14:48   ` [PATCH RFC 1/2] IB/core: Introduce Fast Indirect Memory Registration verbs API Sagi Grimberg
  2014-10-07 14:48   ` [PATCH RFC 2/2] IB/mlx5: Implement Fast Indirect Memory Registration Feature Sagi Grimberg
@ 2014-10-08 11:06   ` Devesh Sharma
       [not found]     ` <EE7902D3F51F404C82415C4803930ACD40C4114B-DWYeeINJQrxExQ8dmkPuX0M9+F4ksjoh@public.gmane.org>
  2014-10-12 19:43   ` Or Gerlitz
  3 siblings, 1 reply; 34+ messages in thread
From: Devesh Sharma @ 2014-10-08 11:06 UTC (permalink / raw)
  To: Sagi Grimberg, linux-rdma-u79uwXL29TY76Z2rM5mHXA
  Cc: bvanassche-HInyCGIudOg, roland-DgEjT+Ai2ygdnm+yROfE0A,
	eli-VPRAkNaXOzVWk0Htik3J/w, ogerlitz-VPRAkNaXOzVWk0Htik3J/w,
	oren-VPRAkNaXOzVWk0Htik3J/w, sean.hefty-ral2JQCrhuEAvxtiuMwx3w

> -----Original Message-----
> From: linux-rdma-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org [mailto:linux-rdma-
> owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org] On Behalf Of Sagi Grimberg
> Sent: Tuesday, October 07, 2014 8:18 PM
> To: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> Cc: bvanassche-HInyCGIudOg@public.gmane.org; roland-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org; eli-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org;
> ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org; oren-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org; sean.hefty-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org
> Subject: [PATCH RFC 0/2] Indirect Fast Memory registration support
> 
> This patch set introduces support for registering a scattered memory area in
> an indirect manner.
> 
> Current supported fast registration support has a known limitation where the
> memory must be page aligned, meaning memory scatters must be in chunks
> of page size except the first which may be in some offset from the start of a
> page and the last which may end before the page boundary.
> 
> This can make life hard for ULPs which may serve a scattered list without the
> above limitations. Two immediate examples are iSER and SRP that have some
> extra logic or work-arounds to handle an arbitrary scatter list of buffers
> (which is supported in the entire stack above them).
> 
> The proposed API attempts to follow the well-known fast registration
> scheme while allowing the ULP to pass an sg vector rather than a page list
> (u64 vector).
> I expect ULPs to make use of the global DMA key to populate the lkey of the
> sg vector. for example for a scatter list sg of 3 elements the indirect ib_sge
> vector will look like:
> ib_sge[0]: {dma_key, sg[0]->dma_addr, sg[0]->length}

Here each sge can be of max 4G like regular SGEs? 
And can lkey in each SGE be different?

> ib_sge[1]: {dma_key, sg[1]->dma_addr, sg[1]->length}
> ib_sge[2]: {dma_key, sg[2]->dma_addr, sg[2]->length}
> 
> Following this patch set I'll send out a usage for this feature in iSER.
> In the meantime, I have a working example of krping utility practicing indirect
> registration feature at: git://flatbed.openfabrics.org/~sgrimberg/krping.git
> (branch: indir_registration)
> 
> Sagi Grimberg (2):
>   IB/core: Introduce Fast Indirect Memory Registration verbs API
>   IB/mlx5: Implement Fast Indirect Memory Registration Feature
> 
>  drivers/infiniband/core/verbs.c      |   29 +++++++++
>  drivers/infiniband/hw/mlx5/cq.c      |    2 +
>  drivers/infiniband/hw/mlx5/main.c    |    4 +
>  drivers/infiniband/hw/mlx5/mlx5_ib.h |   20 +++++++
>  drivers/infiniband/hw/mlx5/mr.c      |   70 ++++++++++++++++++++++-
>  drivers/infiniband/hw/mlx5/qp.c      |  104
> ++++++++++++++++++++++++++++++++++
>  include/rdma/ib_verbs.h              |   55 +++++++++++++++++-
>  7 files changed, 280 insertions(+), 4 deletions(-)
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the
> body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at
> http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: [PATCH RFC 1/2] IB/core: Introduce Fast Indirect Memory Registration verbs API
       [not found]             ` <5434D037.4040208-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
@ 2014-10-08 13:54               ` Steve Wise
  2014-10-13  7:57                 ` Sagi Grimberg
  0 siblings, 1 reply; 34+ messages in thread
From: Steve Wise @ 2014-10-08 13:54 UTC (permalink / raw)
  To: 'Sagi Grimberg', 'Sagi Grimberg',
	linux-rdma-u79uwXL29TY76Z2rM5mHXA
  Cc: bvanassche-HInyCGIudOg, roland-DgEjT+Ai2ygdnm+yROfE0A,
	eli-VPRAkNaXOzVWk0Htik3J/w, ogerlitz-VPRAkNaXOzVWk0Htik3J/w,
	oren-VPRAkNaXOzVWk0Htik3J/w, sean.hefty-ral2JQCrhuEAvxtiuMwx3w

> >> @@ -1056,6 +1067,14 @@ struct ib_send_wr {
> >>               int            access_flags;
> >>               struct ib_sge           *prot;
> >>           } sig_handover;
> >> +        struct {
> >> +            u64                iova_start;
> >> +            struct ib_indir_reg_list       *indir_list;
> >> +            unsigned int            indir_list_len;
> >> +            u64                length;
> >> +            unsigned int            access_flags;
> >> +            u32                mkey;
> >> +        } indir_reg;
> >
> > What is mkey?  Shouldn't this be an rkey?
> 
> mkey means memory key. I can change it to rkey if that
> is clearer.

Is it valid to use an lkey here?  Or is an rkey required?  If an rkey is required, then I'd say it is clearer to name it rkey (and
that aligns with the fastreg struct).

Steve.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC 2/2] IB/mlx5: Implement Fast Indirect Memory Registration Feature
       [not found]     ` <1412693281-6161-3-git-send-email-sagig-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2014-10-12 19:39       ` Or Gerlitz
       [not found]         ` <543AD8DE.5060404-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  2014-10-14  5:41       ` Bart Van Assche
  1 sibling, 1 reply; 34+ messages in thread
From: Or Gerlitz @ 2014-10-12 19:39 UTC (permalink / raw)
  To: Sagi Grimberg, linux-rdma-u79uwXL29TY76Z2rM5mHXA
  Cc: bvanassche-HInyCGIudOg, roland-DgEjT+Ai2ygdnm+yROfE0A,
	eli-VPRAkNaXOzVWk0Htik3J/w, oren-VPRAkNaXOzVWk0Htik3J/w,
	sean.hefty-ral2JQCrhuEAvxtiuMwx3w


On 10/7/2014 4:48 PM, Sagi Grimberg wrote:
>
> * Nit change in mr_align() static routine to handle void*
> instead of __be64.

nit comment... any reason not to put in different and unrelated to this 
series patch?

diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h b/drivers/infiniband/hw/mlx5/mlx5_ib.h
--- a/drivers/infiniband/hw/mlx5/qp.c
+++ b/drivers/infiniband/hw/mlx5/qp.c


> +static void set_indir_umr_segment(struct mlx5_wqe_umr_ctrl_seg *umr,
> +				  struct ib_send_wr *wr)
> +{
> +	u64 mask;
> +	u32 list_len = wr->wr.indir_reg.indir_list_len;
> +
> +	memset(umr, 0, sizeof(*umr));
> +
> +	umr->klm_octowords = get_klm_octo(list_len * 2);
> +	mask = MLX5_MKEY_MASK_LEN		|
> +		MLX5_MKEY_MASK_PAGE_SIZE	|
> +		MLX5_MKEY_MASK_START_ADDR	|
> +		MLX5_MKEY_MASK_EN_RINVAL	|
> +		MLX5_MKEY_MASK_KEY		|
> +		MLX5_MKEY_MASK_LR		|
> +		MLX5_MKEY_MASK_LW		|
> +		MLX5_MKEY_MASK_RR		|
> +		MLX5_MKEY_MASK_RW		|
> +		MLX5_MKEY_MASK_A		|
> +		MLX5_MKEY_MASK_FREE;
> +
> +	umr->mkey_mask = cpu_to_be64(mask);
> +}

here you basically replicate the majority of the code from 
set_reg_umr_segment - share the common part...


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC 0/2] Indirect Fast Memory registration support
       [not found] ` <1412693281-6161-1-git-send-email-sagig-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
                     ` (2 preceding siblings ...)
  2014-10-08 11:06   ` [PATCH RFC 0/2] Indirect Fast Memory registration support Devesh Sharma
@ 2014-10-12 19:43   ` Or Gerlitz
       [not found]     ` <543AD9CD.80803-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  3 siblings, 1 reply; 34+ messages in thread
From: Or Gerlitz @ 2014-10-12 19:43 UTC (permalink / raw)
  To: Sagi Grimberg, bvanassche-HInyCGIudOg, sean.hefty-ral2JQCrhuEAvxtiuMwx3w
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A,
	eli-VPRAkNaXOzVWk0Htik3J/w, oren-VPRAkNaXOzVWk0Htik3J/w


On 10/7/2014 4:47 PM, Sagi Grimberg wrote:
> This patch set introduces support for registering a scattered
> memory area in an indirect manner.
>
> Current supported fast registration support has a known limitation
> where the memory must be page aligned, meaning memory scatters must
> be in chunks of page size except the first which may be in some offset
> from the start of a page and the last which may end before the page
> boundary.
>
> This can make life hard for ULPs which may serve a scattered list without
> the above limitations. Two immediate examples are iSER and SRP that have
> some extra logic or work-arounds to handle an arbitrary scatter list of
> buffers (which is supported in the entire stack above them).

Sagi,

The proposed API looks OK to me -- it  will solve us in iSER the long 
standing real life issue
with SGs delivered by the SCSI MID layer which are unaligned for FMRs 
and FastReg MRs.

Sean, Bart - any comment on the API before Sagi sits down to code the 
iSER changes?

Or.

>
> The proposed API attempts to follow the well-known fast registration scheme
> while allowing the ULP to pass an sg vector rather than a page list (u64 vector).
> I expect ULPs to make use of the global DMA key to populate the lkey of the
> sg vector. for example for a scatter list sg of 3 elements the indirect ib_sge
> vector will look like:
> ib_sge[0]: {dma_key, sg[0]->dma_addr, sg[0]->length}
> ib_sge[1]: {dma_key, sg[1]->dma_addr, sg[1]->length}
> ib_sge[2]: {dma_key, sg[2]->dma_addr, sg[2]->length}
>
> Following this patch set I'll send out a usage for this feature in iSER.
> In the meantime, I have a working example of krping utility practicing indirect
> registration feature at: git://flatbed.openfabrics.org/~sgrimberg/krping.git
> (branch: indir_registration)
>
> Sagi Grimberg (2):
>    IB/core: Introduce Fast Indirect Memory Registration verbs API
>    IB/mlx5: Implement Fast Indirect Memory Registration Feature
>
>   drivers/infiniband/core/verbs.c      |   29 +++++++++
>   drivers/infiniband/hw/mlx5/cq.c      |    2 +
>   drivers/infiniband/hw/mlx5/main.c    |    4 +
>   drivers/infiniband/hw/mlx5/mlx5_ib.h |   20 +++++++
>   drivers/infiniband/hw/mlx5/mr.c      |   70 ++++++++++++++++++++++-
>   drivers/infiniband/hw/mlx5/qp.c      |  104 ++++++++++++++++++++++++++++++++++
>   include/rdma/ib_verbs.h              |   55 +++++++++++++++++-
>   7 files changed, 280 insertions(+), 4 deletions(-)
>

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC 1/2] IB/core: Introduce Fast Indirect Memory Registration verbs API
  2014-10-08 13:54               ` Steve Wise
@ 2014-10-13  7:57                 ` Sagi Grimberg
       [not found]                   ` <543B85F7.1060000-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
  0 siblings, 1 reply; 34+ messages in thread
From: Sagi Grimberg @ 2014-10-13  7:57 UTC (permalink / raw)
  To: Steve Wise, 'Sagi Grimberg', linux-rdma-u79uwXL29TY76Z2rM5mHXA
  Cc: bvanassche-HInyCGIudOg, roland-DgEjT+Ai2ygdnm+yROfE0A,
	eli-VPRAkNaXOzVWk0Htik3J/w, ogerlitz-VPRAkNaXOzVWk0Htik3J/w,
	oren-VPRAkNaXOzVWk0Htik3J/w, sean.hefty-ral2JQCrhuEAvxtiuMwx3w

On 10/8/2014 4:54 PM, Steve Wise wrote:
>>>> @@ -1056,6 +1067,14 @@ struct ib_send_wr {
>>>>                int            access_flags;
>>>>                struct ib_sge           *prot;
>>>>            } sig_handover;
>>>> +        struct {
>>>> +            u64                iova_start;
>>>> +            struct ib_indir_reg_list       *indir_list;
>>>> +            unsigned int            indir_list_len;
>>>> +            u64                length;
>>>> +            unsigned int            access_flags;
>>>> +            u32                mkey;
>>>> +        } indir_reg;
>>>
>>> What is mkey?  Shouldn't this be an rkey?
>>
>> mkey means memory key. I can change it to rkey if that
>> is clearer.
>
> Is it valid to use an lkey here?  Or is an rkey required?  If an rkey is required, then I'd say it is clearer to name it rkey (and
> that aligns with the fastreg struct).
>

It is valid. The memory key depends on the use case.
In case a client want to send an rkey to a peer, it will register using
rkey. In case a server wants to transfer data from it's local buffer
it will register using lkey.

So I didn't impose a specific key here - this is why I chose mkey.

I can modify it to rkey to mimic the well known fastreg, but its not
a must.

Sagi.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC 0/2] Indirect Fast Memory registration support
       [not found]     ` <EE7902D3F51F404C82415C4803930ACD40C4114B-DWYeeINJQrxExQ8dmkPuX0M9+F4ksjoh@public.gmane.org>
@ 2014-10-13  8:01       ` Sagi Grimberg
  0 siblings, 0 replies; 34+ messages in thread
From: Sagi Grimberg @ 2014-10-13  8:01 UTC (permalink / raw)
  To: Devesh Sharma, Sagi Grimberg, linux-rdma-u79uwXL29TY76Z2rM5mHXA
  Cc: bvanassche-HInyCGIudOg, roland-DgEjT+Ai2ygdnm+yROfE0A,
	eli-VPRAkNaXOzVWk0Htik3J/w, ogerlitz-VPRAkNaXOzVWk0Htik3J/w,
	oren-VPRAkNaXOzVWk0Htik3J/w, sean.hefty-ral2JQCrhuEAvxtiuMwx3w

On 10/8/2014 2:06 PM, Devesh Sharma wrote:
>> -----Original Message-----
>> From: linux-rdma-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org [mailto:linux-rdma-
>> owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org] On Behalf Of Sagi Grimberg
>> Sent: Tuesday, October 07, 2014 8:18 PM
>> To: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> Cc: bvanassche-HInyCGIudOg@public.gmane.org; roland-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org; eli-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org;
>> ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org; oren-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org; sean.hefty-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org
>> Subject: [PATCH RFC 0/2] Indirect Fast Memory registration support
>>
>> This patch set introduces support for registering a scattered memory area in
>> an indirect manner.
>>
>> Current supported fast registration support has a known limitation where the
>> memory must be page aligned, meaning memory scatters must be in chunks
>> of page size except the first which may be in some offset from the start of a
>> page and the last which may end before the page boundary.
>>
>> This can make life hard for ULPs which may serve a scattered list without the
>> above limitations. Two immediate examples are iSER and SRP that have some
>> extra logic or work-arounds to handle an arbitrary scatter list of buffers
>> (which is supported in the entire stack above them).
>>
>> The proposed API attempts to follow the well-known fast registration
>> scheme while allowing the ULP to pass an sg vector rather than a page list
>> (u64 vector).
>> I expect ULPs to make use of the global DMA key to populate the lkey of the
>> sg vector. for example for a scatter list sg of 3 elements the indirect ib_sge
>> vector will look like:
>> ib_sge[0]: {dma_key, sg[0]->dma_addr, sg[0]->length}
>
> Here each sge can be of max 4G like regular SGEs?

Correct. This is a normal ib_sge.

> And can lkey in each SGE be different?

Yes of course. Generally each lkey can be a different pre-registered
memory. I specifically used this example to show how a ULP would utilize
the global DMA lkey to register an arbitrary scatterlist.

>
>> ib_sge[1]: {dma_key, sg[1]->dma_addr, sg[1]->length}
>> ib_sge[2]: {dma_key, sg[2]->dma_addr, sg[2]->length}
>>

Sagi.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC 2/2] IB/mlx5: Implement Fast Indirect Memory Registration Feature
       [not found]         ` <543AD8DE.5060404-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2014-10-13  8:32           ` Sagi Grimberg
       [not found]             ` <543B8E09.6090606-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
  0 siblings, 1 reply; 34+ messages in thread
From: Sagi Grimberg @ 2014-10-13  8:32 UTC (permalink / raw)
  To: Or Gerlitz, Sagi Grimberg, linux-rdma-u79uwXL29TY76Z2rM5mHXA
  Cc: bvanassche-HInyCGIudOg, roland-DgEjT+Ai2ygdnm+yROfE0A,
	eli-VPRAkNaXOzVWk0Htik3J/w, oren-VPRAkNaXOzVWk0Htik3J/w,
	sean.hefty-ral2JQCrhuEAvxtiuMwx3w

On 10/12/2014 10:39 PM, Or Gerlitz wrote:
>
> On 10/7/2014 4:48 PM, Sagi Grimberg wrote:
>>
>> * Nit change in mr_align() static routine to handle void*
>> instead of __be64.
>
> nit comment... any reason not to put in different and unrelated to this
> series patch?

I can, But there is no use for this out of this patch context.

>
> diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h
> b/drivers/infiniband/hw/mlx5/mlx5_ib.h
> --- a/drivers/infiniband/hw/mlx5/qp.c
> +++ b/drivers/infiniband/hw/mlx5/qp.c
>
>
>> +static void set_indir_umr_segment(struct mlx5_wqe_umr_ctrl_seg *umr,
>> +                  struct ib_send_wr *wr)
>> +{
>> +    u64 mask;
>> +    u32 list_len = wr->wr.indir_reg.indir_list_len;
>> +
>> +    memset(umr, 0, sizeof(*umr));
>> +
>> +    umr->klm_octowords = get_klm_octo(list_len * 2);
>> +    mask = MLX5_MKEY_MASK_LEN        |
>> +        MLX5_MKEY_MASK_PAGE_SIZE    |
>> +        MLX5_MKEY_MASK_START_ADDR    |
>> +        MLX5_MKEY_MASK_EN_RINVAL    |
>> +        MLX5_MKEY_MASK_KEY        |
>> +        MLX5_MKEY_MASK_LR        |
>> +        MLX5_MKEY_MASK_LW        |
>> +        MLX5_MKEY_MASK_RR        |
>> +        MLX5_MKEY_MASK_RW        |
>> +        MLX5_MKEY_MASK_A        |
>> +        MLX5_MKEY_MASK_FREE;
>> +
>> +    umr->mkey_mask = cpu_to_be64(mask);
>> +}
>
> here you basically replicate the majority of the code from
> set_reg_umr_segment - share the common part...

I've thought about it, but I know Eli prefers not to centralize the
umr control segment (and mkey context) setting routines at this point.
I am not against doing it at all, but let's let Eli comment here.

Eli?

Sagi.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC 0/2] Indirect Fast Memory registration support
       [not found]     ` <543AD9CD.80803-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2014-10-13  8:48       ` Bart Van Assche
       [not found]         ` <543B91E2.70206-HInyCGIudOg@public.gmane.org>
  0 siblings, 1 reply; 34+ messages in thread
From: Bart Van Assche @ 2014-10-13  8:48 UTC (permalink / raw)
  To: Or Gerlitz, Sagi Grimberg, sean.hefty-ral2JQCrhuEAvxtiuMwx3w
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A,
	eli-VPRAkNaXOzVWk0Htik3J/w, oren-VPRAkNaXOzVWk0Htik3J/w

On 10/12/14 21:43, Or Gerlitz wrote:
> Sean, Bart - any comment on the API before Sagi sits down to code the
> iSER changes?

Hello Or,

Will this API ever be supported by other Mellanox RDMA drivers than 
mlx5, e.g. mlx4 ? Will this API ever be supported by a Mellanox HCA that 
supports RoCE ? Additionally, do you perhaps know whether any other RDMA 
vendors have plans to implement this API for their RDMA HCA's ? Drivers 
like the iSER and SRP initiator and target drivers have been implemented 
on top of the RDMA API and support all hardware that implements the RDMA 
API. Since this patch series adds support for the indirect fast memory 
registration API for mlx5 only this means that adding support for this 
API in block storage drivers also means adding a fall-back path for RDMA 
drivers that do not have support for this API. That will make these 
block storage drivers harder to maintain, which is something I would 
like to avoid.

Bart.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC 0/2] Indirect Fast Memory registration support
       [not found]         ` <543B91E2.70206-HInyCGIudOg@public.gmane.org>
@ 2014-10-13 11:18           ` Sagi Grimberg
       [not found]             ` <543BB4F4.8090203-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
  0 siblings, 1 reply; 34+ messages in thread
From: Sagi Grimberg @ 2014-10-13 11:18 UTC (permalink / raw)
  To: Bart Van Assche, Or Gerlitz, Sagi Grimberg,
	sean.hefty-ral2JQCrhuEAvxtiuMwx3w
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A,
	eli-VPRAkNaXOzVWk0Htik3J/w, oren-VPRAkNaXOzVWk0Htik3J/w

On 10/13/2014 11:48 AM, Bart Van Assche wrote:
> On 10/12/14 21:43, Or Gerlitz wrote:
>> Sean, Bart - any comment on the API before Sagi sits down to code the
>> iSER changes?
>
> Hello Or,

Hey Bart,

>
> Will this API ever be supported by other Mellanox RDMA drivers than
> mlx5, e.g. mlx4 ?

Well, unfortunately mlx4 micro-architecture does not support
indirect memory registration. IMHO, simulating it in SW/FW is
not feasible.

> Will this API ever be supported by a Mellanox HCA that
> supports RoCE ?

Yes, mlx5 will support RoCE in the future. Moreover AFAIK future
Mellanox HCAs will keep supporting this technology as it is useful
for other use-cases as well.

> Additionally, do you perhaps know whether any other RDMA
> vendors have plans to implement this API for their RDMA HCA's ?

I can't comment on other vendors plans...
Maybe once this feature is included they will consider it... ;)

> Drivers like the iSER and SRP initiator and target drivers have been implemented
> on top of the RDMA API and support all hardware that implements the RDMA
> API. Since this patch series adds support for the indirect fast memory
> registration API for mlx5 only this means that adding support for this
> API in block storage drivers also means adding a fall-back path for RDMA
> drivers that do not have support for this API. That will make these
> block storage drivers harder to maintain, which is something I would
> like to avoid.

I agree with you on that,

But state-of-the-art devices can introduce new functionality which
sometimes isn't supported in older/other devices. So while keeping
a fall-back logic is a pain, we should strive to adopt new features.

As you know, the problem this feature is solving is even more painful
in iSER than in SRP (only a single rkey per-cmd is allowed) and while
the BB memcopy work-around will not be removed from the code completely,
at least it can hide under some abstraction and be avoided when this
feature is supported.

Cheers,
Sagi.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC 2/2] IB/mlx5: Implement Fast Indirect Memory Registration Feature
       [not found]             ` <543B8E09.6090606-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
@ 2014-10-13 12:57               ` Or Gerlitz
  2014-10-13 13:00               ` Or Gerlitz
  2014-10-21 13:12               ` Eli Cohen
  2 siblings, 0 replies; 34+ messages in thread
From: Or Gerlitz @ 2014-10-13 12:57 UTC (permalink / raw)
  To: Sagi Grimberg, Sagi Grimberg, linux-rdma-u79uwXL29TY76Z2rM5mHXA
  Cc: bvanassche-HInyCGIudOg, roland-DgEjT+Ai2ygdnm+yROfE0A,
	eli-VPRAkNaXOzVWk0Htik3J/w, oren-VPRAkNaXOzVWk0Htik3J/w,
	sean.hefty-ral2JQCrhuEAvxtiuMwx3w


On 10/13/2014 11:32 AM, Sagi Grimberg wrote:
> mask = MLX5_MKEY_MASK_LEN        |
> +        MLX5_MKEY_MASK_PAGE_SIZE    |
> +        MLX5_MKEY_MASK_START_ADDR    |
> +        MLX5_MKEY_MASK_EN_RINVAL    |
> +        MLX5_MKEY_MASK_KEY        |
> +        MLX5_MKEY_MASK_LR        |
> +        MLX5_MKEY_MASK_LW        |
> +        MLX5_MKEY_MASK_RR        |
> +        MLX5_MKEY_MASK_RW        |
> +        MLX5_MKEY_MASK_A        |
> +        MLX5_MKEY_MASK_FREE; 
@ least let's not repeat this mask = MLX5_MKEY_MASK_YYY | [...] exactly 
twice, BTW I think Eli is OOO this week and I'm not sure we want to stop 
this train just as of this small suggestion I made here



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC 2/2] IB/mlx5: Implement Fast Indirect Memory Registration Feature
       [not found]             ` <543B8E09.6090606-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
  2014-10-13 12:57               ` Or Gerlitz
@ 2014-10-13 13:00               ` Or Gerlitz
  2014-10-21 13:12               ` Eli Cohen
  2 siblings, 0 replies; 34+ messages in thread
From: Or Gerlitz @ 2014-10-13 13:00 UTC (permalink / raw)
  To: Sagi Grimberg, Sagi Grimberg, linux-rdma-u79uwXL29TY76Z2rM5mHXA
  Cc: bvanassche-HInyCGIudOg, roland-DgEjT+Ai2ygdnm+yROfE0A,
	eli-VPRAkNaXOzVWk0Htik3J/w, oren-VPRAkNaXOzVWk0Htik3J/w,
	sean.hefty-ral2JQCrhuEAvxtiuMwx3w


On 10/13/2014 11:32 AM, Sagi Grimberg wrote:
> n 10/12/2014 10:39 PM, Or Gerlitz wrote:
>>
>> On 10/7/2014 4:48 PM, Sagi Grimberg wrote:
>>>
>>> * Nit change in mr_align() static routine to handle void*
>>> instead of __be64.
>>
>> nit comment... any reason not to put in different and unrelated to this
>> series patch?
>
> I can, But there is no use for this out of this patch context.
if you really want it to be this way, let it be
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: [PATCH RFC 1/2] IB/core: Introduce Fast Indirect Memory Registration verbs API
       [not found]                   ` <543B85F7.1060000-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
@ 2014-10-13 14:41                     ` Steve Wise
  0 siblings, 0 replies; 34+ messages in thread
From: Steve Wise @ 2014-10-13 14:41 UTC (permalink / raw)
  To: 'Sagi Grimberg', 'Sagi Grimberg',
	linux-rdma-u79uwXL29TY76Z2rM5mHXA
  Cc: bvanassche-HInyCGIudOg, roland-DgEjT+Ai2ygdnm+yROfE0A,
	eli-VPRAkNaXOzVWk0Htik3J/w, ogerlitz-VPRAkNaXOzVWk0Htik3J/w,
	oren-VPRAkNaXOzVWk0Htik3J/w, sean.hefty-ral2JQCrhuEAvxtiuMwx3w

> On 10/8/2014 4:54 PM, Steve Wise wrote:
> >>>> @@ -1056,6 +1067,14 @@ struct ib_send_wr {
> >>>>                int            access_flags;
> >>>>                struct ib_sge           *prot;
> >>>>            } sig_handover;
> >>>> +        struct {
> >>>> +            u64                iova_start;
> >>>> +            struct ib_indir_reg_list       *indir_list;
> >>>> +            unsigned int            indir_list_len;
> >>>> +            u64                length;
> >>>> +            unsigned int            access_flags;
> >>>> +            u32                mkey;
> >>>> +        } indir_reg;
> >>>
> >>> What is mkey?  Shouldn't this be an rkey?
> >>
> >> mkey means memory key. I can change it to rkey if that
> >> is clearer.
> >
> > Is it valid to use an lkey here?  Or is an rkey required?  If an rkey is required, then I'd say it is clearer to name it rkey
(and
> > that aligns with the fastreg struct).
> >
> 
> It is valid. The memory key depends on the use case.
> In case a client want to send an rkey to a peer, it will register using
> rkey. In case a server wants to transfer data from it's local buffer
> it will register using lkey.
> 
> So I didn't impose a specific key here - this is why I chose mkey.
> 
> I can modify it to rkey to mimic the well known fastreg, but its not
> a must.
> 

If both local-only and local/remote are valid, then I agree mkey is good.   I was thinking an application wouldn't need this API for
local-only registrations; it could just use the local dma lkey and a bus address in the sge.   But perhaps the indirect fastreg
allows a deeper sgl than is supported by providers via the SEND/READ/WRITE/RECV work requests...

Steve.


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC 0/2] Indirect Fast Memory registration support
       [not found]             ` <543BB4F4.8090203-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
@ 2014-10-13 14:51               ` Steve Wise
  0 siblings, 0 replies; 34+ messages in thread
From: Steve Wise @ 2014-10-13 14:51 UTC (permalink / raw)
  To: Sagi Grimberg, Bart Van Assche, Or Gerlitz, Sagi Grimberg,
	sean.hefty-ral2JQCrhuEAvxtiuMwx3w
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A,
	eli-VPRAkNaXOzVWk0Htik3J/w, oren-VPRAkNaXOzVWk0Htik3J/w

On 10/13/2014 6:18 AM, Sagi Grimberg wrote:
> On 10/13/2014 11:48 AM, Bart Van Assche wrote:
>> On 10/12/14 21:43, Or Gerlitz wrote:
>>> Sean, Bart - any comment on the API before Sagi sits down to code the
>>> iSER changes?
>>
>> Hello Or,
>
> Hey Bart,
>
>>
>> Will this API ever be supported by other Mellanox RDMA drivers than
>> mlx5, e.g. mlx4 ?
>
> Well, unfortunately mlx4 micro-architecture does not support
> indirect memory registration. IMHO, simulating it in SW/FW is
> not feasible.
>
>> Will this API ever be supported by a Mellanox HCA that
>> supports RoCE ?
>
> Yes, mlx5 will support RoCE in the future. Moreover AFAIK future
> Mellanox HCAs will keep supporting this technology as it is useful
> for other use-cases as well.
>
>> Additionally, do you perhaps know whether any other RDMA
>> vendors have plans to implement this API for their RDMA HCA's ?
>
> I can't comment on other vendors plans...
> Maybe once this feature is included they will consider it... ;)
>

cxgb4 currently does not support this.  I'm not sure if it could be done 
given the hw architecture.

Steve.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC 1/2] IB/core: Introduce Fast Indirect Memory Registration verbs API
       [not found]     ` <1412693281-6161-2-git-send-email-sagig-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  2014-10-07 18:12       ` Steve Wise
@ 2014-10-14  5:40       ` Bart Van Assche
       [not found]         ` <543CB76B.7020208-HInyCGIudOg@public.gmane.org>
  1 sibling, 1 reply; 34+ messages in thread
From: Bart Van Assche @ 2014-10-14  5:40 UTC (permalink / raw)
  To: Sagi Grimberg, linux-rdma-u79uwXL29TY76Z2rM5mHXA
  Cc: roland-DgEjT+Ai2ygdnm+yROfE0A, eli-VPRAkNaXOzVWk0Htik3J/w,
	ogerlitz-VPRAkNaXOzVWk0Htik3J/w, oren-VPRAkNaXOzVWk0Htik3J/w,
	sean.hefty-ral2JQCrhuEAvxtiuMwx3w

On 10/07/14 16:48, Sagi Grimberg wrote:
> In order to support that we provide the user with an interface
> to pass a scattered list of buffers to the IB core layer called
> ib_indir_reg_list and provide the a new send work request opcode
> called IB_WR_REG_INDIR_MR. We extend wr union with a new type of
> memory registration called indir_reg where the user can place the
> relevant information to perform such a memory registration.
>
> The verbs user is expected to perform these steps:
> 0. Make sure that the device supports Indirect memory registration via
>     ib_device_cap_flag IB_DEVICE_INDIR_REGISTRATION and make sure
>     that ib_device_attr max_indir_reg_mr_list_len suffice for the
>     expected scatterlist length
>
> 1. Allocate a memory region with IB_MR_INDIRECT_REG creation flag
>     This is done via ib_create_mr() with mr_init_attr.flags = IB_MR_INDIRECT_REG
>
> 2. Allocate an ib_indir_reg_list structure to hold the scattered buffers
>     pointers. This is done via new ib_alloc_indir_reg_list() verb
>
> 3. Populate the scattered buffers in ib_indir_reg_list.sg_list
>
> 4. Post a work request with a new opcode IB_WR_REG_INDIR_MR and
>     provide the populated ib_indir_reg_list
>
> 5. Perform data transfer
>
> 6. Get completion of kind IB_WC_REG_INDIR_MR (if requested)
>
> 7. Free indirect MR and ib_indir_reg_list via
>     ib_destroy_mr() and ib_free_indir_reg_list()

Hello Sagi,

Is there documentation available somewhere about the order in which an 
HCA must execute an indirect memory registration request relative to 
other work requests, similar to the "Work Request Operation Ordering" 
table in the InfiniBand specification ? I think such documentation is 
needed to ensure consistent behavior across HCA models.

Bart.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC 2/2] IB/mlx5: Implement Fast Indirect Memory Registration Feature
       [not found]     ` <1412693281-6161-3-git-send-email-sagig-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  2014-10-12 19:39       ` Or Gerlitz
@ 2014-10-14  5:41       ` Bart Van Assche
       [not found]         ` <543CB79B.6050400-HInyCGIudOg@public.gmane.org>
  1 sibling, 1 reply; 34+ messages in thread
From: Bart Van Assche @ 2014-10-14  5:41 UTC (permalink / raw)
  To: Sagi Grimberg, linux-rdma-u79uwXL29TY76Z2rM5mHXA
  Cc: roland-DgEjT+Ai2ygdnm+yROfE0A, eli-VPRAkNaXOzVWk0Htik3J/w,
	ogerlitz-VPRAkNaXOzVWk0Htik3J/w, oren-VPRAkNaXOzVWk0Htik3J/w,
	sean.hefty-ral2JQCrhuEAvxtiuMwx3w

On 10/07/14 16:48, Sagi Grimberg wrote:
> -static __be64 *mr_align(__be64 *ptr, int align)
> +static void *mr_align(void *ptr, int align)
>   {
>   	unsigned long mask = align - 1;
>
> -	return (__be64 *)(((unsigned long)ptr + mask) & ~mask);
> +	return (void *)(((unsigned long)ptr + mask) & ~mask);
>   }

Maybe I'm missing something, but why doesn't this function use the 
ALIGN() macro defined in <linux/kernel.h> ? Are you aware that a type 
with name uintptr_t has been defined in <linux/types.h> that is intended 
for casting a pointer to an int ?

 > +struct ib_indir_reg_list *
 > +mlx5_ib_alloc_indir_reg_list(struct ib_device *device,
 > +			     unsigned int max_indir_list_len)

There are three k[zc]alloc() calls in this function. Can these be 
combined into one without increasing the chance too much that this will 
increase the order of the allocation ?

 > +	dsize = sizeof(*mirl->klms) * max_indir_list_len;
 > +	mirl->mapped_ilist = kzalloc(dsize + MLX5_UMR_ALIGN - 1,
 > +				      GFP_KERNEL);

The value of the constant MLX5_UMR_ALIGN is 2048. Does that mean this 
code will always allocate 2 KB more than needed ? Is this really the 
minimum alignment required for this data structure ? How likely is it 
that this code will trigger a higher order allocation ?

 > +	if (unlikely((*seg == qp->sq.qend)))

A minor comment: one pair of parentheses can be left out from the above 
code.

 > +					mlx5_ib_warn(dev, "\n");

Will the warning generated by this code allow a user to find out which 
code triggered that warning ?

Bart.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC 2/2] IB/mlx5: Implement Fast Indirect Memory Registration Feature
       [not found]         ` <543CB79B.6050400-HInyCGIudOg@public.gmane.org>
@ 2014-10-14 10:50           ` Sagi Grimberg
  2014-10-19 19:34           ` Sagi Grimberg
  1 sibling, 0 replies; 34+ messages in thread
From: Sagi Grimberg @ 2014-10-14 10:50 UTC (permalink / raw)
  To: Bart Van Assche, Sagi Grimberg, linux-rdma-u79uwXL29TY76Z2rM5mHXA
  Cc: roland-DgEjT+Ai2ygdnm+yROfE0A, eli-VPRAkNaXOzVWk0Htik3J/w,
	ogerlitz-VPRAkNaXOzVWk0Htik3J/w, oren-VPRAkNaXOzVWk0Htik3J/w,
	sean.hefty-ral2JQCrhuEAvxtiuMwx3w

On 10/14/2014 8:41 AM, Bart Van Assche wrote:
> On 10/07/14 16:48, Sagi Grimberg wrote:
>> -static __be64 *mr_align(__be64 *ptr, int align)
>> +static void *mr_align(void *ptr, int align)
>>   {
>>       unsigned long mask = align - 1;
>>
>> -    return (__be64 *)(((unsigned long)ptr + mask) & ~mask);
>> +    return (void *)(((unsigned long)ptr + mask) & ~mask);
>>   }
>
> Maybe I'm missing something, but why doesn't this function use the
> ALIGN() macro defined in <linux/kernel.h> ? Are you aware that a type
> with name uintptr_t has been defined in <linux/types.h> that is intended
> for casting a pointer to an int ?
>
>  > +struct ib_indir_reg_list *
>  > +mlx5_ib_alloc_indir_reg_list(struct ib_device *device,
>  > +                 unsigned int max_indir_list_len)
>
> There are three k[zc]alloc() calls in this function. Can these be
> combined into one without increasing the chance too much that this will
> increase the order of the allocation ?
>
>  > +    dsize = sizeof(*mirl->klms) * max_indir_list_len;
>  > +    mirl->mapped_ilist = kzalloc(dsize + MLX5_UMR_ALIGN - 1,
>  > +                      GFP_KERNEL);
>
> The value of the constant MLX5_UMR_ALIGN is 2048. Does that mean this
> code will always allocate 2 KB more than needed ? Is this really the
> minimum alignment required for this data structure ? How likely is it
> that this code will trigger a higher order allocation ?
>
>  > +    if (unlikely((*seg == qp->sq.qend)))
>
> A minor comment: one pair of parentheses can be left out from the above
> code.
>
>  > +                    mlx5_ib_warn(dev, "\n");
>
> Will the warning generated by this code allow a user to find out which
> code triggered that warning ?
>
> Bart.
>

Hey Bart,

Thanks for the feedback!
Unfortunately I'm caught up in other things at the moment, but I'll
address the comments soon enough... And also I still need to review
your v2 of SRP multichannel. I promise to do it ASAP.

Sagi.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC 1/2] IB/core: Introduce Fast Indirect Memory Registration verbs API
       [not found]         ` <543CB76B.7020208-HInyCGIudOg@public.gmane.org>
@ 2014-10-19 19:01           ` Sagi Grimberg
       [not found]             ` <54440A7E.200-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
  0 siblings, 1 reply; 34+ messages in thread
From: Sagi Grimberg @ 2014-10-19 19:01 UTC (permalink / raw)
  To: Bart Van Assche, Sagi Grimberg, linux-rdma-u79uwXL29TY76Z2rM5mHXA
  Cc: roland-DgEjT+Ai2ygdnm+yROfE0A, eli-VPRAkNaXOzVWk0Htik3J/w,
	ogerlitz-VPRAkNaXOzVWk0Htik3J/w, oren-VPRAkNaXOzVWk0Htik3J/w,
	sean.hefty-ral2JQCrhuEAvxtiuMwx3w, Chuck Lever, Steve Wise

On 10/14/2014 8:40 AM, Bart Van Assche wrote:
> On 10/07/14 16:48, Sagi Grimberg wrote:
>> In order to support that we provide the user with an interface
>> to pass a scattered list of buffers to the IB core layer called
>> ib_indir_reg_list and provide the a new send work request opcode
>> called IB_WR_REG_INDIR_MR. We extend wr union with a new type of
>> memory registration called indir_reg where the user can place the
>> relevant information to perform such a memory registration.
>>
>> The verbs user is expected to perform these steps:
>> 0. Make sure that the device supports Indirect memory registration via
>>     ib_device_cap_flag IB_DEVICE_INDIR_REGISTRATION and make sure
>>     that ib_device_attr max_indir_reg_mr_list_len suffice for the
>>     expected scatterlist length
>>
>> 1. Allocate a memory region with IB_MR_INDIRECT_REG creation flag
>>     This is done via ib_create_mr() with mr_init_attr.flags =
>> IB_MR_INDIRECT_REG
>>
>> 2. Allocate an ib_indir_reg_list structure to hold the scattered buffers
>>     pointers. This is done via new ib_alloc_indir_reg_list() verb
>>
>> 3. Populate the scattered buffers in ib_indir_reg_list.sg_list
>>
>> 4. Post a work request with a new opcode IB_WR_REG_INDIR_MR and
>>     provide the populated ib_indir_reg_list
>>
>> 5. Perform data transfer
>>
>> 6. Get completion of kind IB_WC_REG_INDIR_MR (if requested)
>>
>> 7. Free indirect MR and ib_indir_reg_list via
>>     ib_destroy_mr() and ib_free_indir_reg_list()
>
> Hello Sagi,
>
> Is there documentation available somewhere about the order in which an
> HCA must execute an indirect memory registration request relative to
> other work requests, similar to the "Work Request Operation Ordering"
> table in the InfiniBand specification ? I think such documentation is
> needed to ensure consistent behavior across HCA models.
>

So basically Indirect registration request generalizes fast registration
work request, so it naturally it complies to the same operation ordering
specification as fast memory registration operations.

Does it make sense to add some form of
"Documentation/infiniband/registration_ordering_rules.txt"? This should
probably include bind_mw, fastreg, indirect_reg, local_inv..

I'd like to hear more opinions here before I add it...
Roland, Sean, Steve, Chuck, Or?

Sagi.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC 2/2] IB/mlx5: Implement Fast Indirect Memory Registration Feature
       [not found]         ` <543CB79B.6050400-HInyCGIudOg@public.gmane.org>
  2014-10-14 10:50           ` Sagi Grimberg
@ 2014-10-19 19:34           ` Sagi Grimberg
       [not found]             ` <5444122C.6070804-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
  1 sibling, 1 reply; 34+ messages in thread
From: Sagi Grimberg @ 2014-10-19 19:34 UTC (permalink / raw)
  To: Bart Van Assche, Sagi Grimberg, linux-rdma-u79uwXL29TY76Z2rM5mHXA
  Cc: roland-DgEjT+Ai2ygdnm+yROfE0A, eli-VPRAkNaXOzVWk0Htik3J/w,
	ogerlitz-VPRAkNaXOzVWk0Htik3J/w, oren-VPRAkNaXOzVWk0Htik3J/w,
	sean.hefty-ral2JQCrhuEAvxtiuMwx3w

On 10/14/2014 8:41 AM, Bart Van Assche wrote:
> On 10/07/14 16:48, Sagi Grimberg wrote:
>> -static __be64 *mr_align(__be64 *ptr, int align)
>> +static void *mr_align(void *ptr, int align)
>>   {
>>       unsigned long mask = align - 1;
>>
>> -    return (__be64 *)(((unsigned long)ptr + mask) & ~mask);
>> +    return (void *)(((unsigned long)ptr + mask) & ~mask);
>>   }
>
> Maybe I'm missing something, but why doesn't this function use the
> ALIGN() macro defined in <linux/kernel.h> ? Are you aware that a type
> with name uintptr_t has been defined in <linux/types.h> that is intended
> for casting a pointer to an int ?

Heh, you are right obviously.

I'll fix that to use ALIGN(ptr, mask)

>
>  > +struct ib_indir_reg_list *
>  > +mlx5_ib_alloc_indir_reg_list(struct ib_device *device,
>  > +                 unsigned int max_indir_list_len)
>
> There are three k[zc]alloc() calls in this function. Can these be
> combined into one without increasing the chance too much that this will
> increase the order of the allocation ?

Well, I have 3 allocations:
1. mlx5_indir_reg_list container
2. sg_list: buffer of sg elements for the user to pass
3. mapped_ilist: buffer for HW representation of sg elements.

mapped_ilist buffer has some constraints on it (see below) so I can't
combine it with the other two. I assume I can rearrange the structure to
combine the fist 2 allocations together, but it can trigger a higher
order allocation (unless I use vzalloc, but I'm not sure what is the
benefit here...).

>
>  > +    dsize = sizeof(*mirl->klms) * max_indir_list_len;
>  > +    mirl->mapped_ilist = kzalloc(dsize + MLX5_UMR_ALIGN - 1,
>  > +                      GFP_KERNEL);
>
> The value of the constant MLX5_UMR_ALIGN is 2048. Does that mean this
> code will always allocate 2 KB more than needed ? Is this really the
> minimum alignment required for this data structure ? How likely is it
> that this code will trigger a higher order allocation ?

Well, here I need a variable size physically contiguous allocation with
a start address aligned to 2K (HW requirement). So yes, with this code
given the user allocate more than PAGE_SIZE-2K ib_sges, a higher order
allocation will be triggered.

I allocate 2K more to ensure I can align to 2K, then I set mirl->klms to
point at the aligned address. Any idea on how can I do this without
allocating 2K more?

>
>  > +    if (unlikely((*seg == qp->sq.qend)))
>
> A minor comment: one pair of parentheses can be left out from the above
> code.

Thanks, will fix.

>
>  > +                    mlx5_ib_warn(dev, "\n");
>
> Will the warning generated by this code allow a user to find out which
> code triggered that warning ?

Not at all - and it should be fixed. will do.

Sagi.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC 2/2] IB/mlx5: Implement Fast Indirect Memory Registration Feature
       [not found]             ` <5444122C.6070804-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
@ 2014-10-20  7:46               ` Bart Van Assche
       [not found]                 ` <5444BDDC.1060507-HInyCGIudOg@public.gmane.org>
  2014-10-21 14:20               ` Eli Cohen
  1 sibling, 1 reply; 34+ messages in thread
From: Bart Van Assche @ 2014-10-20  7:46 UTC (permalink / raw)
  To: Sagi Grimberg, Sagi Grimberg, linux-rdma-u79uwXL29TY76Z2rM5mHXA
  Cc: roland-DgEjT+Ai2ygdnm+yROfE0A, eli-VPRAkNaXOzVWk0Htik3J/w,
	ogerlitz-VPRAkNaXOzVWk0Htik3J/w, oren-VPRAkNaXOzVWk0Htik3J/w,
	sean.hefty-ral2JQCrhuEAvxtiuMwx3w

On 10/19/14 21:34, Sagi Grimberg wrote:
> On 10/14/2014 8:41 AM, Bart Van Assche wrote:
>>  > +struct ib_indir_reg_list *
>>  > +mlx5_ib_alloc_indir_reg_list(struct ib_device *device,
>>  > +                 unsigned int max_indir_list_len)
>>
>> There are three k[zc]alloc() calls in this function. Can these be
>> combined into one without increasing the chance too much that this will
>> increase the order of the allocation ?
>
> Well, I have 3 allocations:
> 1. mlx5_indir_reg_list container
> 2. sg_list: buffer of sg elements for the user to pass
> 3. mapped_ilist: buffer for HW representation of sg elements.
>
> mapped_ilist buffer has some constraints on it (see below) so I can't
> combine it with the other two. I assume I can rearrange the structure to
> combine the fist 2 allocations together, but it can trigger a higher
> order allocation (unless I use vzalloc, but I'm not sure what is the
> benefit here...).

Since vmalloc() is about 1000 times slower than kmalloc() I think we 
should try to avoid vmalloc() and its variants. Combining multiple 
kmalloc() calls into a single kmalloc() call helps performance and 
reduces the number of error paths. But since the time needed for a 
single kmalloc() call is only a few microseconds combining multiple 
kmalloc() calls reduces performance if combining kmalloc() calls 
increases the allocation order from one page to multiple pages.

>>  > +    dsize = sizeof(*mirl->klms) * max_indir_list_len;
>>  > +    mirl->mapped_ilist = kzalloc(dsize + MLX5_UMR_ALIGN - 1,
>>  > +                      GFP_KERNEL);
>>
>> The value of the constant MLX5_UMR_ALIGN is 2048. Does that mean this
>> code will always allocate 2 KB more than needed ? Is this really the
>> minimum alignment required for this data structure ? How likely is it
>> that this code will trigger a higher order allocation ?
>
> Well, here I need a variable size physically contiguous allocation with
> a start address aligned to 2K (HW requirement). So yes, with this code
> given the user allocate more than PAGE_SIZE-2K ib_sges, a higher order
> allocation will be triggered.
>
> I allocate 2K more to ensure I can align to 2K, then I set mirl->klms to
> point at the aligned address. Any idea on how can I do this without
> allocating 2K more?

Specifying a minimum memory alignment requirement is possible with 
kmem_cache_create(). However, kmem_caches only support allocation of 
fixed size objects.

In mm/slab_common.c one can see that for SLAB and SLUB allocations of >= 
256 bytes are guaranteed to be aligned on a boundary equal to the 
smallest power of two that is larger than or equal to the allocation 
size. Unfortunately this does not hold for the SLOB allocator. So I 
think the only change that can be made in this code without complicating 
it too much is changing the "MLX5_UMR_ALIGN - 1" into 
"max(MLX5_UMR_ALIGN - ARCH_KMALLOC_MINALIGN, 0)".

Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: [PATCH RFC 1/2] IB/core: Introduce Fast Indirect Memory Registration verbs API
       [not found]             ` <54440A7E.200-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
@ 2014-10-20 14:54               ` Steve Wise
  0 siblings, 0 replies; 34+ messages in thread
From: Steve Wise @ 2014-10-20 14:54 UTC (permalink / raw)
  To: 'Sagi Grimberg', 'Bart Van Assche',
	'Sagi Grimberg',
	linux-rdma-u79uwXL29TY76Z2rM5mHXA
  Cc: roland-DgEjT+Ai2ygdnm+yROfE0A, eli-VPRAkNaXOzVWk0Htik3J/w,
	ogerlitz-VPRAkNaXOzVWk0Htik3J/w, oren-VPRAkNaXOzVWk0Htik3J/w,
	sean.hefty-ral2JQCrhuEAvxtiuMwx3w, 'Chuck Lever'



> -----Original Message-----
> From: linux-rdma-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org [mailto:linux-rdma-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org] On Behalf Of Sagi Grimberg
> Sent: Sunday, October 19, 2014 2:01 PM
> To: Bart Van Assche; Sagi Grimberg; linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> Cc: roland-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org; eli-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org; ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org; oren-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org; sean.hefty-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org; Chuck Lever; Steve
> Wise
> Subject: Re: [PATCH RFC 1/2] IB/core: Introduce Fast Indirect Memory Registration verbs API
> 
> On 10/14/2014 8:40 AM, Bart Van Assche wrote:
> > On 10/07/14 16:48, Sagi Grimberg wrote:
> >> In order to support that we provide the user with an interface
> >> to pass a scattered list of buffers to the IB core layer called
> >> ib_indir_reg_list and provide the a new send work request opcode
> >> called IB_WR_REG_INDIR_MR. We extend wr union with a new type of
> >> memory registration called indir_reg where the user can place the
> >> relevant information to perform such a memory registration.
> >>
> >> The verbs user is expected to perform these steps:
> >> 0. Make sure that the device supports Indirect memory registration via
> >>     ib_device_cap_flag IB_DEVICE_INDIR_REGISTRATION and make sure
> >>     that ib_device_attr max_indir_reg_mr_list_len suffice for the
> >>     expected scatterlist length
> >>
> >> 1. Allocate a memory region with IB_MR_INDIRECT_REG creation flag
> >>     This is done via ib_create_mr() with mr_init_attr.flags =
> >> IB_MR_INDIRECT_REG
> >>
> >> 2. Allocate an ib_indir_reg_list structure to hold the scattered buffers
> >>     pointers. This is done via new ib_alloc_indir_reg_list() verb
> >>
> >> 3. Populate the scattered buffers in ib_indir_reg_list.sg_list
> >>
> >> 4. Post a work request with a new opcode IB_WR_REG_INDIR_MR and
> >>     provide the populated ib_indir_reg_list
> >>
> >> 5. Perform data transfer
> >>
> >> 6. Get completion of kind IB_WC_REG_INDIR_MR (if requested)
> >>
> >> 7. Free indirect MR and ib_indir_reg_list via
> >>     ib_destroy_mr() and ib_free_indir_reg_list()
> >
> > Hello Sagi,
> >
> > Is there documentation available somewhere about the order in which an
> > HCA must execute an indirect memory registration request relative to
> > other work requests, similar to the "Work Request Operation Ordering"
> > table in the InfiniBand specification ? I think such documentation is
> > needed to ensure consistent behavior across HCA models.
> >
> 
> So basically Indirect registration request generalizes fast registration
> work request, so it naturally it complies to the same operation ordering
> specification as fast memory registration operations.
> 
> Does it make sense to add some form of
> "Documentation/infiniband/registration_ordering_rules.txt"? This should
> probably include bind_mw, fastreg, indirect_reg, local_inv..
> 
> I'd like to hear more opinions here before I add it...
> Roland, Sean, Steve, Chuck, Or?
> 
> Sagi.

I wouldn't replicate the IB and IW specs in Documentation/infiniband/.   Perhaps just something referencing the specs and then
saying the indirect registration adheres exactly to the fast registration rules?

Steve


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC 2/2] IB/mlx5: Implement Fast Indirect Memory Registration Feature
       [not found]                 ` <5444BDDC.1060507-HInyCGIudOg@public.gmane.org>
@ 2014-10-21  9:32                   ` Sagi Grimberg
       [not found]                     ` <54462842.8080701-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
  0 siblings, 1 reply; 34+ messages in thread
From: Sagi Grimberg @ 2014-10-21  9:32 UTC (permalink / raw)
  To: Bart Van Assche, Sagi Grimberg, linux-rdma-u79uwXL29TY76Z2rM5mHXA
  Cc: roland-DgEjT+Ai2ygdnm+yROfE0A, eli-VPRAkNaXOzVWk0Htik3J/w,
	ogerlitz-VPRAkNaXOzVWk0Htik3J/w, oren-VPRAkNaXOzVWk0Htik3J/w,
	sean.hefty-ral2JQCrhuEAvxtiuMwx3w

On 10/20/2014 10:46 AM, Bart Van Assche wrote:
> On 10/19/14 21:34, Sagi Grimberg wrote:
>> On 10/14/2014 8:41 AM, Bart Van Assche wrote:
>>>  > +struct ib_indir_reg_list *
>>>  > +mlx5_ib_alloc_indir_reg_list(struct ib_device *device,
>>>  > +                 unsigned int max_indir_list_len)
>>>
>>> There are three k[zc]alloc() calls in this function. Can these be
>>> combined into one without increasing the chance too much that this will
>>> increase the order of the allocation ?
>>
>> Well, I have 3 allocations:
>> 1. mlx5_indir_reg_list container
>> 2. sg_list: buffer of sg elements for the user to pass
>> 3. mapped_ilist: buffer for HW representation of sg elements.
>>
>> mapped_ilist buffer has some constraints on it (see below) so I can't
>> combine it with the other two. I assume I can rearrange the structure to
>> combine the fist 2 allocations together, but it can trigger a higher
>> order allocation (unless I use vzalloc, but I'm not sure what is the
>> benefit here...).
>
> Since vmalloc() is about 1000 times slower than kmalloc() I think we
> should try to avoid vmalloc() and its variants. Combining multiple
> kmalloc() calls into a single kmalloc() call helps performance and
> reduces the number of error paths. But since the time needed for a
> single kmalloc() call is only a few microseconds combining multiple
> kmalloc() calls reduces performance if combining kmalloc() calls
> increases the allocation order from one page to multiple pages.

Right, so I think keeping these separated is OK agreed?

>
>>>  > +    dsize = sizeof(*mirl->klms) * max_indir_list_len;
>>>  > +    mirl->mapped_ilist = kzalloc(dsize + MLX5_UMR_ALIGN - 1,
>>>  > +                      GFP_KERNEL);
>>>
>>> The value of the constant MLX5_UMR_ALIGN is 2048. Does that mean this
>>> code will always allocate 2 KB more than needed ? Is this really the
>>> minimum alignment required for this data structure ? How likely is it
>>> that this code will trigger a higher order allocation ?
>>
>> Well, here I need a variable size physically contiguous allocation with
>> a start address aligned to 2K (HW requirement). So yes, with this code
>> given the user allocate more than PAGE_SIZE-2K ib_sges, a higher order
>> allocation will be triggered.
>>
>> I allocate 2K more to ensure I can align to 2K, then I set mirl->klms to
>> point at the aligned address. Any idea on how can I do this without
>> allocating 2K more?
>
> Specifying a minimum memory alignment requirement is possible with
> kmem_cache_create(). However, kmem_caches only support allocation of
> fixed size objects.
>
> In mm/slab_common.c one can see that for SLAB and SLUB allocations of >=
> 256 bytes are guaranteed to be aligned on a boundary equal to the
> smallest power of two that is larger than or equal to the allocation
> size. Unfortunately this does not hold for the SLOB allocator. So I
> think the only change that can be made in this code without complicating
> it too much is changing the "MLX5_UMR_ALIGN - 1" into
> "max(MLX5_UMR_ALIGN - ARCH_KMALLOC_MINALIGN, 0)".

I can give that a go...

Thanks!

P.S.
Since you didn't give any comments on the API itself should I conclude
that you find it clear and you agree with it?

Sagi.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC 2/2] IB/mlx5: Implement Fast Indirect Memory Registration Feature
       [not found]                     ` <54462842.8080701-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
@ 2014-10-21 10:54                       ` Bart Van Assche
       [not found]                         ` <54463B4A.1070308-HInyCGIudOg@public.gmane.org>
  0 siblings, 1 reply; 34+ messages in thread
From: Bart Van Assche @ 2014-10-21 10:54 UTC (permalink / raw)
  To: Sagi Grimberg, Sagi Grimberg, linux-rdma-u79uwXL29TY76Z2rM5mHXA
  Cc: roland-DgEjT+Ai2ygdnm+yROfE0A, eli-VPRAkNaXOzVWk0Htik3J/w,
	ogerlitz-VPRAkNaXOzVWk0Htik3J/w, oren-VPRAkNaXOzVWk0Htik3J/w,
	sean.hefty-ral2JQCrhuEAvxtiuMwx3w

On 10/21/14 11:32, Sagi Grimberg wrote:
> Since you didn't give any comments on the API itself should I conclude
> that you find it clear and you agree with it?

Mostly. The only aspect of which I think that it is missing in the API 
is the alignment requirement for the addresses in 
ib_indir_reg_list.sg_list. Is byte alignment sufficient ? If so, will 
all HCA vendors be able to support this or do we perhaps need to add a 
member in the device attributes structure that declares what the 
alignment requirements are ?

Bart.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC 2/2] IB/mlx5: Implement Fast Indirect Memory Registration Feature
       [not found]                         ` <54463B4A.1070308-HInyCGIudOg@public.gmane.org>
@ 2014-10-21 10:59                           ` Sagi Grimberg
  0 siblings, 0 replies; 34+ messages in thread
From: Sagi Grimberg @ 2014-10-21 10:59 UTC (permalink / raw)
  To: Bart Van Assche, Sagi Grimberg, linux-rdma-u79uwXL29TY76Z2rM5mHXA
  Cc: roland-DgEjT+Ai2ygdnm+yROfE0A, eli-VPRAkNaXOzVWk0Htik3J/w,
	ogerlitz-VPRAkNaXOzVWk0Htik3J/w, oren-VPRAkNaXOzVWk0Htik3J/w,
	sean.hefty-ral2JQCrhuEAvxtiuMwx3w

On 10/21/2014 1:54 PM, Bart Van Assche wrote:
> On 10/21/14 11:32, Sagi Grimberg wrote:
>> Since you didn't give any comments on the API itself should I conclude
>> that you find it clear and you agree with it?
>
> Mostly. The only aspect of which I think that it is missing in the API
> is the alignment requirement for the addresses in
> ib_indir_reg_list.sg_list. Is byte alignment sufficient ?

Yes.

> If so, will
> all HCA vendors be able to support this or do we perhaps need to add a
> member in the device attributes structure that declares what the
> alignment requirements are ?
>

This feature is designed to remove any alignment constraints on memory
registration. I don't see any point in allowing the API to relax the
constraints instead.

Sagi.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC 2/2] IB/mlx5: Implement Fast Indirect Memory Registration Feature
       [not found]             ` <543B8E09.6090606-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
  2014-10-13 12:57               ` Or Gerlitz
  2014-10-13 13:00               ` Or Gerlitz
@ 2014-10-21 13:12               ` Eli Cohen
  2 siblings, 0 replies; 34+ messages in thread
From: Eli Cohen @ 2014-10-21 13:12 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Or Gerlitz, Sagi Grimberg, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	bvanassche-HInyCGIudOg, roland-DgEjT+Ai2ygdnm+yROfE0A,
	eli-VPRAkNaXOzVWk0Htik3J/w, oren-VPRAkNaXOzVWk0Htik3J/w,
	sean.hefty-ral2JQCrhuEAvxtiuMwx3w

On Mon, Oct 13, 2014 at 11:32:09AM +0300, Sagi Grimberg wrote:
> >
> >>+static void set_indir_umr_segment(struct mlx5_wqe_umr_ctrl_seg *umr,
> >>+                  struct ib_send_wr *wr)
> >>+{
> >>+    u64 mask;
> >>+    u32 list_len = wr->wr.indir_reg.indir_list_len;
> >>+
> >>+    memset(umr, 0, sizeof(*umr));
> >>+
> >>+    umr->klm_octowords = get_klm_octo(list_len * 2);
> >>+    mask = MLX5_MKEY_MASK_LEN        |
> >>+        MLX5_MKEY_MASK_PAGE_SIZE    |
> >>+        MLX5_MKEY_MASK_START_ADDR    |
> >>+        MLX5_MKEY_MASK_EN_RINVAL    |
> >>+        MLX5_MKEY_MASK_KEY        |
> >>+        MLX5_MKEY_MASK_LR        |
> >>+        MLX5_MKEY_MASK_LW        |
> >>+        MLX5_MKEY_MASK_RR        |
> >>+        MLX5_MKEY_MASK_RW        |
> >>+        MLX5_MKEY_MASK_A        |
> >>+        MLX5_MKEY_MASK_FREE;
> >>+
> >>+    umr->mkey_mask = cpu_to_be64(mask);
> >>+}
> >
> >here you basically replicate the majority of the code from
> >set_reg_umr_segment - share the common part...
> 
> I've thought about it, but I know Eli prefers not to centralize the
> umr control segment (and mkey context) setting routines at this point.
> I am not against doing it at all, but let's let Eli comment here.
> 

We have some versions of this mask building but they are not
identical. Note that we build the mask at compile time so it does not
affect code size and it will make future changes easier.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC 2/2] IB/mlx5: Implement Fast Indirect Memory Registration Feature
       [not found]             ` <5444122C.6070804-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
  2014-10-20  7:46               ` Bart Van Assche
@ 2014-10-21 14:20               ` Eli Cohen
  2014-10-21 14:30                 ` Sagi Grimberg
  1 sibling, 1 reply; 34+ messages in thread
From: Eli Cohen @ 2014-10-21 14:20 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Bart Van Assche, Sagi Grimberg,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A,
	eli-VPRAkNaXOzVWk0Htik3J/w, ogerlitz-VPRAkNaXOzVWk0Htik3J/w,
	oren-VPRAkNaXOzVWk0Htik3J/w, sean.hefty-ral2JQCrhuEAvxtiuMwx3w

On Sun, Oct 19, 2014 at 10:34:04PM +0300, Sagi Grimberg wrote:
> >
> > > +                    mlx5_ib_warn(dev, "\n");
> >
> >Will the warning generated by this code allow a user to find out which
> >code triggered that warning ?
> 
> Not at all - and it should be fixed. will do.
> 

This macro expands to function name and line number using __func__ and
__LINE__ so better leave it as is if you care to know what failed.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC 2/2] IB/mlx5: Implement Fast Indirect Memory Registration Feature
  2014-10-21 14:20               ` Eli Cohen
@ 2014-10-21 14:30                 ` Sagi Grimberg
  0 siblings, 0 replies; 34+ messages in thread
From: Sagi Grimberg @ 2014-10-21 14:30 UTC (permalink / raw)
  To: Eli Cohen
  Cc: Bart Van Assche, Sagi Grimberg,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A,
	eli-VPRAkNaXOzVWk0Htik3J/w, ogerlitz-VPRAkNaXOzVWk0Htik3J/w,
	oren-VPRAkNaXOzVWk0Htik3J/w, sean.hefty-ral2JQCrhuEAvxtiuMwx3w

On 10/21/2014 5:20 PM, Eli Cohen wrote:
> On Sun, Oct 19, 2014 at 10:34:04PM +0300, Sagi Grimberg wrote:
>>>
>>>> +                    mlx5_ib_warn(dev, "\n");
>>>
>>> Will the warning generated by this code allow a user to find out which
>>> code triggered that warning ?
>>
>> Not at all - and it should be fixed. will do.
>>
>
> This macro expands to function name and line number using __func__ and
> __LINE__ so better leave it as is if you care to know what failed.
>

Yes, but we can also give a sentence of what failed in the function.
the sources are not always available...
I don't mind modifying it.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC 1/2] IB/core: Introduce Fast Indirect Memory Registration verbs API
       [not found] ` <CAJ3xEMjdnNNbhRC0T_=hmRedwJFvSR9r-JccLZ2m0zaece5OQQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2014-10-13  8:10   ` Sagi Grimberg
@ 2014-10-13  8:11   ` Sagi Grimberg
  1 sibling, 0 replies; 34+ messages in thread
From: Sagi Grimberg @ 2014-10-13  8:11 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Bart Van Assche,
	Roland Dreier, Eli Cohen, Or Gerlitz,
	oren-VPRAkNaXOzVWk0Htik3J/w, sean.hefty-ral2JQCrhuEAvxtiuMwx3w

On 10/9/2014 11:13 PM, Or Gerlitz wrote:
> On Tue, Oct 7, 2014, Sagi Grimberg <sagig-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> wrote:
> [...]
>>   enum ib_signature_prot_cap {
>> @@ -182,6 +183,7 @@ struct ib_device_attr {
>>          int                     max_srq_wr;
>>          int                     max_srq_sge;
>>          unsigned int            max_fast_reg_page_list_len;
>> +       unsigned int            max_indir_reg_mr_list_len;
>
> The indirection registration list is basically made of  struct ib_sge
> objects which are posted on a send-like work-request, any reason  to
> have a dedicated dev cap attribute for that and not use the already
> existing one (max_sge)?
>

Hi Or,

max_send_sge capability is how many ib_sge's the device can handle in a
single send work request which takes into consideration element such as
wqe control segment size and sq reservations. This is different from how
many ib_sge's the device can register to a single indirect memory key
which is free of such limitations.

So given these are different capabilities I prefer to expose them in
different attributes.

Sagi.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC 1/2] IB/core: Introduce Fast Indirect Memory Registration verbs API
       [not found] ` <CAJ3xEMjdnNNbhRC0T_=hmRedwJFvSR9r-JccLZ2m0zaece5OQQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2014-10-13  8:10   ` Sagi Grimberg
  2014-10-13  8:11   ` Sagi Grimberg
  1 sibling, 0 replies; 34+ messages in thread
From: Sagi Grimberg @ 2014-10-13  8:10 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Bart Van Assche,
	Roland Dreier, Eli Cohen, Or Gerlitz,
	oren-VPRAkNaXOzVWk0Htik3J/w, sean.hefty-ral2JQCrhuEAvxtiuMwx3w

On 10/9/2014 11:13 PM, Or Gerlitz wrote:
> On Tue, Oct 7, 2014, Sagi Grimberg <sagig-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> wrote:
> [...]
>>   enum ib_signature_prot_cap {
>> @@ -182,6 +183,7 @@ struct ib_device_attr {
>>          int                     max_srq_wr;
>>          int                     max_srq_sge;
>>          unsigned int            max_fast_reg_page_list_len;
>> +       unsigned int            max_indir_reg_mr_list_len;
>
> The indirection registration list is basically made of  struct ib_sge
> objects which are posted on a send-like work-request, any reason  to
> have a dedicated dev cap attribute for that and not use the already
> existing one (max_sge)?
>

Hi Or,

max_send_sge capability is how many ib_sge's the device can handle in a
single send work request which takes into consideration element such as
wqe control segment size and sq reservations. This is different from how
many ib_sge's the device can register to a single indirect memory key
which is free of such limitations.

So given these are different capabilities I prefer to expose them in
different attributes.

Sagi.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC 1/2] IB/core: Introduce Fast Indirect Memory Registration verbs API
@ 2014-10-09 20:13 Or Gerlitz
       [not found] ` <CAJ3xEMjdnNNbhRC0T_=hmRedwJFvSR9r-JccLZ2m0zaece5OQQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 34+ messages in thread
From: Or Gerlitz @ 2014-10-09 20:13 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Bart Van Assche,
	Roland Dreier, Eli Cohen, Or Gerlitz,
	oren-VPRAkNaXOzVWk0Htik3J/w, sean.hefty-ral2JQCrhuEAvxtiuMwx3w

On Tue, Oct 7, 2014, Sagi Grimberg <sagig-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> wrote:
[...]
>  enum ib_signature_prot_cap {
> @@ -182,6 +183,7 @@ struct ib_device_attr {
>         int                     max_srq_wr;
>         int                     max_srq_sge;
>         unsigned int            max_fast_reg_page_list_len;
> +       unsigned int            max_indir_reg_mr_list_len;

The indirection registration list is basically made of  struct ib_sge
objects which are posted on a send-like work-request, any reason  to
have a dedicated dev cap attribute for that and not use the already
existing one (max_sge)?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2014-10-21 14:30 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-10-07 14:47 [PATCH RFC 0/2] Indirect Fast Memory registration support Sagi Grimberg
     [not found] ` <1412693281-6161-1-git-send-email-sagig-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2014-10-07 14:48   ` [PATCH RFC 1/2] IB/core: Introduce Fast Indirect Memory Registration verbs API Sagi Grimberg
     [not found]     ` <1412693281-6161-2-git-send-email-sagig-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2014-10-07 18:12       ` Steve Wise
     [not found]         ` <54342D0C.6050103-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>
2014-10-08  5:48           ` Sagi Grimberg
     [not found]             ` <5434D037.4040208-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2014-10-08 13:54               ` Steve Wise
2014-10-13  7:57                 ` Sagi Grimberg
     [not found]                   ` <543B85F7.1060000-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2014-10-13 14:41                     ` Steve Wise
2014-10-14  5:40       ` Bart Van Assche
     [not found]         ` <543CB76B.7020208-HInyCGIudOg@public.gmane.org>
2014-10-19 19:01           ` Sagi Grimberg
     [not found]             ` <54440A7E.200-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2014-10-20 14:54               ` Steve Wise
2014-10-07 14:48   ` [PATCH RFC 2/2] IB/mlx5: Implement Fast Indirect Memory Registration Feature Sagi Grimberg
     [not found]     ` <1412693281-6161-3-git-send-email-sagig-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2014-10-12 19:39       ` Or Gerlitz
     [not found]         ` <543AD8DE.5060404-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2014-10-13  8:32           ` Sagi Grimberg
     [not found]             ` <543B8E09.6090606-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2014-10-13 12:57               ` Or Gerlitz
2014-10-13 13:00               ` Or Gerlitz
2014-10-21 13:12               ` Eli Cohen
2014-10-14  5:41       ` Bart Van Assche
     [not found]         ` <543CB79B.6050400-HInyCGIudOg@public.gmane.org>
2014-10-14 10:50           ` Sagi Grimberg
2014-10-19 19:34           ` Sagi Grimberg
     [not found]             ` <5444122C.6070804-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2014-10-20  7:46               ` Bart Van Assche
     [not found]                 ` <5444BDDC.1060507-HInyCGIudOg@public.gmane.org>
2014-10-21  9:32                   ` Sagi Grimberg
     [not found]                     ` <54462842.8080701-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2014-10-21 10:54                       ` Bart Van Assche
     [not found]                         ` <54463B4A.1070308-HInyCGIudOg@public.gmane.org>
2014-10-21 10:59                           ` Sagi Grimberg
2014-10-21 14:20               ` Eli Cohen
2014-10-21 14:30                 ` Sagi Grimberg
2014-10-08 11:06   ` [PATCH RFC 0/2] Indirect Fast Memory registration support Devesh Sharma
     [not found]     ` <EE7902D3F51F404C82415C4803930ACD40C4114B-DWYeeINJQrxExQ8dmkPuX0M9+F4ksjoh@public.gmane.org>
2014-10-13  8:01       ` Sagi Grimberg
2014-10-12 19:43   ` Or Gerlitz
     [not found]     ` <543AD9CD.80803-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2014-10-13  8:48       ` Bart Van Assche
     [not found]         ` <543B91E2.70206-HInyCGIudOg@public.gmane.org>
2014-10-13 11:18           ` Sagi Grimberg
     [not found]             ` <543BB4F4.8090203-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2014-10-13 14:51               ` Steve Wise
2014-10-09 20:13 [PATCH RFC 1/2] IB/core: Introduce Fast Indirect Memory Registration verbs API Or Gerlitz
     [not found] ` <CAJ3xEMjdnNNbhRC0T_=hmRedwJFvSR9r-JccLZ2m0zaece5OQQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2014-10-13  8:10   ` Sagi Grimberg
2014-10-13  8:11   ` Sagi Grimberg

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.