All of lore.kernel.org
 help / color / mirror / Atom feed
* generic RDMA READ/WRITE API V4
@ 2016-03-08 18:16 Christoph Hellwig
  2016-03-08 18:16 ` [PATCH v4 1/9] IB/cma: pass the port number to ib_create_qp Christoph Hellwig
                   ` (6 more replies)
  0 siblings, 7 replies; 46+ messages in thread
From: Christoph Hellwig @ 2016-03-08 18:16 UTC (permalink / raw)
  To: dledford, bart.vanassche; +Cc: sagig, swise, linux-rdma, target-devel

This series contains patches that implement a first version of a generic
API to handle RDMA READ/WRITE operations as commonly used on the target
(or server) side for storage protocols.

I think it's basically ready to merge now, but I'd really like to have
a positive review from Bart for the SRP target conversion.

This has been developed for the upcoming NVMe over Fabrics target, and
extensively teѕted as part of that, although this upstream version has
additional updates over the one we're currently using.

It hides details such as the use of MRs for iWarp devices, and will allow
looking at other HCA specifics easily in the future.

This series contains a conversion of the SRP target, and the git tree
below also has a RFC conversion of the iSER target (a little hacky
due to the signature MR support which I can't test)

I also have a git tree available at:

	git://git.infradead.org/users/hch/rdma.git rdma-rw-api

Gitweb:

	http://git.infradead.org/users/hch/rdma.git/shortlog/refs/heads/rdma-rw-api

These two also include the RFC iSER target conversion.

Changes since V3:
 - really fold the list_del in mr_pool_get into the right patch

Changes since V2:
 - fold the list_del in mr_pool_get into the right patch
 - clamp the max FR page size length
 - minor srpt style fix
 - spelling fixes
 - renamed rdma_has_read_invalidate to rdma_cap_read_inv

Changes since V1:
 - fixed offset handling in ib_sg_to_pages
 - uses proper SG iterators to handle larger than PAGE_SIZE segments
 - adjusted parameters for some functions to reduce size of the context
 - SRP target support

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH v4 1/9] IB/cma: pass the port number to ib_create_qp
  2016-03-08 18:16 generic RDMA READ/WRITE API V4 Christoph Hellwig
@ 2016-03-08 18:16 ` Christoph Hellwig
  2016-03-08 19:25   ` Bart Van Assche
  2016-03-08 19:27   ` Leon Romanovsky
  2016-03-08 18:16 ` [PATCH v4 2/9] IB/core: allow passing mapping an offset into the SG in ib_map_mr_sg Christoph Hellwig
                   ` (5 subsequent siblings)
  6 siblings, 2 replies; 46+ messages in thread
From: Christoph Hellwig @ 2016-03-08 18:16 UTC (permalink / raw)
  To: dledford, bart.vanassche; +Cc: sagig, swise, linux-rdma, target-devel

The new RW API will need this.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Steve Wise <swise@opengridcomputing.com>
---
 drivers/infiniband/core/cma.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index 9729639..a791522 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -800,6 +800,7 @@ int rdma_create_qp(struct rdma_cm_id *id, struct ib_pd *pd,
 	if (id->device != pd->device)
 		return -EINVAL;
 
+	qp_init_attr->port_num = id->port_num;
 	qp = ib_create_qp(pd, qp_init_attr);
 	if (IS_ERR(qp))
 		return PTR_ERR(qp);
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v4 2/9] IB/core: allow passing mapping an offset into the SG in ib_map_mr_sg
  2016-03-08 18:16 generic RDMA READ/WRITE API V4 Christoph Hellwig
  2016-03-08 18:16 ` [PATCH v4 1/9] IB/cma: pass the port number to ib_create_qp Christoph Hellwig
@ 2016-03-08 18:16 ` Christoph Hellwig
  2016-03-08 19:27   ` Bart Van Assche
  2016-03-08 18:16 ` [PATCH v4 3/9] IB/core: add a helper to check for READ WITH INVALIDATE support Christoph Hellwig
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 46+ messages in thread
From: Christoph Hellwig @ 2016-03-08 18:16 UTC (permalink / raw)
  To: dledford, bart.vanassche; +Cc: sagig, swise, linux-rdma, target-devel

Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
---
 drivers/infiniband/core/verbs.c             | 24 ++++++++++++------------
 drivers/infiniband/hw/cxgb3/iwch_provider.c |  7 +++----
 drivers/infiniband/hw/cxgb4/iw_cxgb4.h      |  5 ++---
 drivers/infiniband/hw/cxgb4/mem.c           |  7 +++----
 drivers/infiniband/hw/mlx4/mlx4_ib.h        |  5 ++---
 drivers/infiniband/hw/mlx4/mr.c             |  7 +++----
 drivers/infiniband/hw/mlx5/mlx5_ib.h        |  5 ++---
 drivers/infiniband/hw/mlx5/mr.c             |  7 +++----
 drivers/infiniband/hw/nes/nes_verbs.c       |  7 +++----
 drivers/infiniband/hw/ocrdma/ocrdma_verbs.c |  7 +++----
 drivers/infiniband/hw/ocrdma/ocrdma_verbs.h |  5 ++---
 drivers/infiniband/hw/qib/qib_mr.c          |  7 +++----
 drivers/infiniband/hw/qib/qib_verbs.h       |  5 ++---
 drivers/infiniband/ulp/iser/iser_memory.c   |  4 ++--
 drivers/infiniband/ulp/isert/ib_isert.c     |  2 +-
 drivers/infiniband/ulp/srp/ib_srp.c         |  2 +-
 include/rdma/ib_verbs.h                     | 23 +++++++++--------------
 net/rds/iw_rdma.c                           |  2 +-
 net/rds/iw_send.c                           |  2 +-
 net/sunrpc/xprtrdma/frwr_ops.c              |  2 +-
 net/sunrpc/xprtrdma/svc_rdma_recvfrom.c     |  2 +-
 21 files changed, 60 insertions(+), 77 deletions(-)

diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c
index 48dc43c..eaed99b 100644
--- a/drivers/infiniband/core/verbs.c
+++ b/drivers/infiniband/core/verbs.c
@@ -1557,6 +1557,7 @@ EXPORT_SYMBOL(ib_check_mr_status);
  * @mr:            memory region
  * @sg:            dma mapped scatterlist
  * @sg_nents:      number of entries in sg
+ * @sg_offset:     offset in bytes into sg
  * @page_size:     page vector desired page size
  *
  * Constraints:
@@ -1573,17 +1574,15 @@ EXPORT_SYMBOL(ib_check_mr_status);
  * After this completes successfully, the  memory region
  * is ready for registration.
  */
-int ib_map_mr_sg(struct ib_mr *mr,
-		 struct scatterlist *sg,
-		 int sg_nents,
-		 unsigned int page_size)
+int ib_map_mr_sg(struct ib_mr *mr, struct scatterlist *sg, int sg_nents,
+		unsigned int sg_offset, unsigned int page_size)
 {
 	if (unlikely(!mr->device->map_mr_sg))
 		return -ENOSYS;
 
 	mr->page_size = page_size;
 
-	return mr->device->map_mr_sg(mr, sg, sg_nents);
+	return mr->device->map_mr_sg(mr, sg, sg_nents, sg_offset);
 }
 EXPORT_SYMBOL(ib_map_mr_sg);
 
@@ -1593,6 +1592,7 @@ EXPORT_SYMBOL(ib_map_mr_sg);
  * @mr:            memory region
  * @sgl:           dma mapped scatterlist
  * @sg_nents:      number of entries in sg
+ * @sg_offset:     offset in bytes into sg
  * @set_page:      driver page assignment function pointer
  *
  * Core service helper for drivers to convert the largest
@@ -1603,10 +1603,8 @@ EXPORT_SYMBOL(ib_map_mr_sg);
  * Returns the number of sg elements that were assigned to
  * a page vector.
  */
-int ib_sg_to_pages(struct ib_mr *mr,
-		   struct scatterlist *sgl,
-		   int sg_nents,
-		   int (*set_page)(struct ib_mr *, u64))
+int ib_sg_to_pages(struct ib_mr *mr, struct scatterlist *sgl, int sg_nents,
+		unsigned int sg_offset, int (*set_page)(struct ib_mr *, u64))
 {
 	struct scatterlist *sg;
 	u64 last_end_dma_addr = 0;
@@ -1614,12 +1612,12 @@ int ib_sg_to_pages(struct ib_mr *mr,
 	u64 page_mask = ~((u64)mr->page_size - 1);
 	int i, ret;
 
-	mr->iova = sg_dma_address(&sgl[0]);
+	mr->iova = sg_dma_address(&sgl[0]) + sg_offset;
 	mr->length = 0;
 
 	for_each_sg(sgl, sg, sg_nents, i) {
-		u64 dma_addr = sg_dma_address(sg);
-		unsigned int dma_len = sg_dma_len(sg);
+		u64 dma_addr = sg_dma_address(sg) + sg_offset;
+		unsigned int dma_len = sg_dma_len(sg) - sg_offset;
 		u64 end_dma_addr = dma_addr + dma_len;
 		u64 page_addr = dma_addr & page_mask;
 
@@ -1652,6 +1650,8 @@ next_page:
 		mr->length += dma_len;
 		last_end_dma_addr = end_dma_addr;
 		last_page_off = end_dma_addr & ~page_mask;
+
+		sg_offset = 0;
 	}
 
 	return i;
diff --git a/drivers/infiniband/hw/cxgb3/iwch_provider.c b/drivers/infiniband/hw/cxgb3/iwch_provider.c
index 2734820..a9b8ed5 100644
--- a/drivers/infiniband/hw/cxgb3/iwch_provider.c
+++ b/drivers/infiniband/hw/cxgb3/iwch_provider.c
@@ -782,15 +782,14 @@ static int iwch_set_page(struct ib_mr *ibmr, u64 addr)
 	return 0;
 }
 
-static int iwch_map_mr_sg(struct ib_mr *ibmr,
-			  struct scatterlist *sg,
-			  int sg_nents)
+static int iwch_map_mr_sg(struct ib_mr *ibmr, struct scatterlist *sg,
+		int sg_nents, unsigned sg_offset)
 {
 	struct iwch_mr *mhp = to_iwch_mr(ibmr);
 
 	mhp->npages = 0;
 
-	return ib_sg_to_pages(ibmr, sg, sg_nents, iwch_set_page);
+	return ib_sg_to_pages(ibmr, sg, sg_nents, sg_offset, iwch_set_page);
 }
 
 static int iwch_destroy_qp(struct ib_qp *ib_qp)
diff --git a/drivers/infiniband/hw/cxgb4/iw_cxgb4.h b/drivers/infiniband/hw/cxgb4/iw_cxgb4.h
index 7c6a6e1..5db3c30 100644
--- a/drivers/infiniband/hw/cxgb4/iw_cxgb4.h
+++ b/drivers/infiniband/hw/cxgb4/iw_cxgb4.h
@@ -959,9 +959,8 @@ void c4iw_qp_rem_ref(struct ib_qp *qp);
 struct ib_mr *c4iw_alloc_mr(struct ib_pd *pd,
 			    enum ib_mr_type mr_type,
 			    u32 max_num_sg);
-int c4iw_map_mr_sg(struct ib_mr *ibmr,
-		   struct scatterlist *sg,
-		   int sg_nents);
+int c4iw_map_mr_sg(struct ib_mr *ibmr, struct scatterlist *sg, int sg_nents,
+		unsigned int sg_offset);
 int c4iw_dealloc_mw(struct ib_mw *mw);
 struct ib_mw *c4iw_alloc_mw(struct ib_pd *pd, enum ib_mw_type type);
 struct ib_mr *c4iw_reg_user_mr(struct ib_pd *pd, u64 start,
diff --git a/drivers/infiniband/hw/cxgb4/mem.c b/drivers/infiniband/hw/cxgb4/mem.c
index 9274c90..97b8aae 100644
--- a/drivers/infiniband/hw/cxgb4/mem.c
+++ b/drivers/infiniband/hw/cxgb4/mem.c
@@ -688,15 +688,14 @@ static int c4iw_set_page(struct ib_mr *ibmr, u64 addr)
 	return 0;
 }
 
-int c4iw_map_mr_sg(struct ib_mr *ibmr,
-		   struct scatterlist *sg,
-		   int sg_nents)
+int c4iw_map_mr_sg(struct ib_mr *ibmr, struct scatterlist *sg, int sg_nents,
+		unsigned int sg_offset)
 {
 	struct c4iw_mr *mhp = to_c4iw_mr(ibmr);
 
 	mhp->mpl_len = 0;
 
-	return ib_sg_to_pages(ibmr, sg, sg_nents, c4iw_set_page);
+	return ib_sg_to_pages(ibmr, sg, sg_nents, sg_offset, c4iw_set_page);
 }
 
 int c4iw_dereg_mr(struct ib_mr *ib_mr)
diff --git a/drivers/infiniband/hw/mlx4/mlx4_ib.h b/drivers/infiniband/hw/mlx4/mlx4_ib.h
index 52ce7b0..e38cc44 100644
--- a/drivers/infiniband/hw/mlx4/mlx4_ib.h
+++ b/drivers/infiniband/hw/mlx4/mlx4_ib.h
@@ -716,9 +716,8 @@ int mlx4_ib_dealloc_mw(struct ib_mw *mw);
 struct ib_mr *mlx4_ib_alloc_mr(struct ib_pd *pd,
 			       enum ib_mr_type mr_type,
 			       u32 max_num_sg);
-int mlx4_ib_map_mr_sg(struct ib_mr *ibmr,
-		      struct scatterlist *sg,
-		      int sg_nents);
+int mlx4_ib_map_mr_sg(struct ib_mr *ibmr, struct scatterlist *sg, int sg_nents,
+		unsigned int sg_offset);
 int mlx4_ib_modify_cq(struct ib_cq *cq, u16 cq_count, u16 cq_period);
 int mlx4_ib_resize_cq(struct ib_cq *ibcq, int entries, struct ib_udata *udata);
 struct ib_cq *mlx4_ib_create_cq(struct ib_device *ibdev,
diff --git a/drivers/infiniband/hw/mlx4/mr.c b/drivers/infiniband/hw/mlx4/mr.c
index 242b94e..32c387d 100644
--- a/drivers/infiniband/hw/mlx4/mr.c
+++ b/drivers/infiniband/hw/mlx4/mr.c
@@ -526,9 +526,8 @@ static int mlx4_set_page(struct ib_mr *ibmr, u64 addr)
 	return 0;
 }
 
-int mlx4_ib_map_mr_sg(struct ib_mr *ibmr,
-		      struct scatterlist *sg,
-		      int sg_nents)
+int mlx4_ib_map_mr_sg(struct ib_mr *ibmr, struct scatterlist *sg, int sg_nents,
+		unsigned int sg_offset)
 {
 	struct mlx4_ib_mr *mr = to_mmr(ibmr);
 	int rc;
@@ -539,7 +538,7 @@ int mlx4_ib_map_mr_sg(struct ib_mr *ibmr,
 				   sizeof(u64) * mr->max_pages,
 				   DMA_TO_DEVICE);
 
-	rc = ib_sg_to_pages(ibmr, sg, sg_nents, mlx4_set_page);
+	rc = ib_sg_to_pages(ibmr, sg, sg_nents, sg_offset, mlx4_set_page);
 
 	ib_dma_sync_single_for_device(ibmr->device, mr->page_map,
 				      sizeof(u64) * mr->max_pages,
diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h b/drivers/infiniband/hw/mlx5/mlx5_ib.h
index d2b9737..2d05bb5 100644
--- a/drivers/infiniband/hw/mlx5/mlx5_ib.h
+++ b/drivers/infiniband/hw/mlx5/mlx5_ib.h
@@ -654,9 +654,8 @@ int mlx5_ib_dereg_mr(struct ib_mr *ibmr);
 struct ib_mr *mlx5_ib_alloc_mr(struct ib_pd *pd,
 			       enum ib_mr_type mr_type,
 			       u32 max_num_sg);
-int mlx5_ib_map_mr_sg(struct ib_mr *ibmr,
-		      struct scatterlist *sg,
-		      int sg_nents);
+int mlx5_ib_map_mr_sg(struct ib_mr *ibmr, struct scatterlist *sg, int sg_nents,
+		unsigned int sg_offset);
 int mlx5_ib_process_mad(struct ib_device *ibdev, int mad_flags, u8 port_num,
 			const struct ib_wc *in_wc, const struct ib_grh *in_grh,
 			const struct ib_mad_hdr *in, size_t in_mad_size,
diff --git a/drivers/infiniband/hw/mlx5/mr.c b/drivers/infiniband/hw/mlx5/mr.c
index 6000f7a..2afcb61 100644
--- a/drivers/infiniband/hw/mlx5/mr.c
+++ b/drivers/infiniband/hw/mlx5/mr.c
@@ -1450,9 +1450,8 @@ static int mlx5_set_page(struct ib_mr *ibmr, u64 addr)
 	return 0;
 }
 
-int mlx5_ib_map_mr_sg(struct ib_mr *ibmr,
-		      struct scatterlist *sg,
-		      int sg_nents)
+int mlx5_ib_map_mr_sg(struct ib_mr *ibmr, struct scatterlist *sg, int sg_nents,
+		unsigned int sg_offset)
 {
 	struct mlx5_ib_mr *mr = to_mmr(ibmr);
 	int n;
@@ -1463,7 +1462,7 @@ int mlx5_ib_map_mr_sg(struct ib_mr *ibmr,
 				   mr->desc_size * mr->max_descs,
 				   DMA_TO_DEVICE);
 
-	n = ib_sg_to_pages(ibmr, sg, sg_nents, mlx5_set_page);
+	n = ib_sg_to_pages(ibmr, sg, sg_nents, sg_offset, mlx5_set_page);
 
 	ib_dma_sync_single_for_device(ibmr->device, mr->desc_map,
 				      mr->desc_size * mr->max_descs,
diff --git a/drivers/infiniband/hw/nes/nes_verbs.c b/drivers/infiniband/hw/nes/nes_verbs.c
index 8c4daf7..8c85b84 100644
--- a/drivers/infiniband/hw/nes/nes_verbs.c
+++ b/drivers/infiniband/hw/nes/nes_verbs.c
@@ -401,15 +401,14 @@ static int nes_set_page(struct ib_mr *ibmr, u64 addr)
 	return 0;
 }
 
-static int nes_map_mr_sg(struct ib_mr *ibmr,
-			 struct scatterlist *sg,
-			 int sg_nents)
+static int nes_map_mr_sg(struct ib_mr *ibmr, struct scatterlist *sg,
+		int sg_nents, unsigned int sg_offset)
 {
 	struct nes_mr *nesmr = to_nesmr(ibmr);
 
 	nesmr->npages = 0;
 
-	return ib_sg_to_pages(ibmr, sg, sg_nents, nes_set_page);
+	return ib_sg_to_pages(ibmr, sg, sg_nents, sg_offset, nes_set_page);
 }
 
 /**
diff --git a/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c b/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c
index 4df3f13..5bf06b9 100644
--- a/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c
+++ b/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c
@@ -3083,13 +3083,12 @@ static int ocrdma_set_page(struct ib_mr *ibmr, u64 addr)
 	return 0;
 }
 
-int ocrdma_map_mr_sg(struct ib_mr *ibmr,
-		     struct scatterlist *sg,
-		     int sg_nents)
+int ocrdma_map_mr_sg(struct ib_mr *ibmr, struct scatterlist *sg, int sg_nents,
+		unsigned int sg_offset)
 {
 	struct ocrdma_mr *mr = get_ocrdma_mr(ibmr);
 
 	mr->npages = 0;
 
-	return ib_sg_to_pages(ibmr, sg, sg_nents, ocrdma_set_page);
+	return ib_sg_to_pages(ibmr, sg, sg_nents, sg_offset, ocrdma_set_page);
 }
diff --git a/drivers/infiniband/hw/ocrdma/ocrdma_verbs.h b/drivers/infiniband/hw/ocrdma/ocrdma_verbs.h
index 8b517fd..b290e5d 100644
--- a/drivers/infiniband/hw/ocrdma/ocrdma_verbs.h
+++ b/drivers/infiniband/hw/ocrdma/ocrdma_verbs.h
@@ -122,8 +122,7 @@ struct ib_mr *ocrdma_reg_user_mr(struct ib_pd *, u64 start, u64 length,
 struct ib_mr *ocrdma_alloc_mr(struct ib_pd *pd,
 			      enum ib_mr_type mr_type,
 			      u32 max_num_sg);
-int ocrdma_map_mr_sg(struct ib_mr *ibmr,
-		     struct scatterlist *sg,
-		     int sg_nents);
+int ocrdma_map_mr_sg(struct ib_mr *ibmr, struct scatterlist *sg, int sg_nents,
+		unsigned sg_offset);
 
 #endif				/* __OCRDMA_VERBS_H__ */
diff --git a/drivers/infiniband/hw/qib/qib_mr.c b/drivers/infiniband/hw/qib/qib_mr.c
index 5f53304..c5fb4dd 100644
--- a/drivers/infiniband/hw/qib/qib_mr.c
+++ b/drivers/infiniband/hw/qib/qib_mr.c
@@ -315,15 +315,14 @@ static int qib_set_page(struct ib_mr *ibmr, u64 addr)
 	return 0;
 }
 
-int qib_map_mr_sg(struct ib_mr *ibmr,
-		  struct scatterlist *sg,
-		  int sg_nents)
+int qib_map_mr_sg(struct ib_mr *ibmr, struct scatterlist *sg, int sg_nents,
+		unsigned int sg_offset)
 {
 	struct qib_mr *mr = to_imr(ibmr);
 
 	mr->npages = 0;
 
-	return ib_sg_to_pages(ibmr, sg, sg_nents, qib_set_page);
+	return ib_sg_to_pages(ibmr, sg, sg_nents, sg_offset, qib_set_page);
 }
 
 /**
diff --git a/drivers/infiniband/hw/qib/qib_verbs.h b/drivers/infiniband/hw/qib/qib_verbs.h
index 6c5e777..5067cac 100644
--- a/drivers/infiniband/hw/qib/qib_verbs.h
+++ b/drivers/infiniband/hw/qib/qib_verbs.h
@@ -1042,9 +1042,8 @@ struct ib_mr *qib_alloc_mr(struct ib_pd *pd,
 			   enum ib_mr_type mr_type,
 			   u32 max_entries);
 
-int qib_map_mr_sg(struct ib_mr *ibmr,
-		  struct scatterlist *sg,
-		  int sg_nents);
+int qib_map_mr_sg(struct ib_mr *ibmr, struct scatterlist *sg, int sg_nents,
+		unsigned int sg_offset);
 
 int qib_reg_mr(struct qib_qp *qp, struct ib_reg_wr *wr);
 
diff --git a/drivers/infiniband/ulp/iser/iser_memory.c b/drivers/infiniband/ulp/iser/iser_memory.c
index 9a391cc..44cc85f 100644
--- a/drivers/infiniband/ulp/iser/iser_memory.c
+++ b/drivers/infiniband/ulp/iser/iser_memory.c
@@ -236,7 +236,7 @@ int iser_fast_reg_fmr(struct iscsi_iser_task *iser_task,
 	page_vec->npages = 0;
 	page_vec->fake_mr.page_size = SIZE_4K;
 	plen = ib_sg_to_pages(&page_vec->fake_mr, mem->sg,
-			      mem->size, iser_set_page);
+			      mem->size, 0, iser_set_page);
 	if (unlikely(plen < mem->size)) {
 		iser_err("page vec too short to hold this SG\n");
 		iser_data_buf_dump(mem, device->ib_device);
@@ -446,7 +446,7 @@ static int iser_fast_reg_mr(struct iscsi_iser_task *iser_task,
 
 	ib_update_fast_reg_key(mr, ib_inc_rkey(mr->rkey));
 
-	n = ib_map_mr_sg(mr, mem->sg, mem->size, SIZE_4K);
+	n = ib_map_mr_sg(mr, mem->sg, mem->size, 0, SIZE_4K);
 	if (unlikely(n != mem->size)) {
 		iser_err("failed to map sg (%d/%d)\n",
 			 n, mem->size);
diff --git a/drivers/infiniband/ulp/isert/ib_isert.c b/drivers/infiniband/ulp/isert/ib_isert.c
index f121e61..7c7ad3a 100644
--- a/drivers/infiniband/ulp/isert/ib_isert.c
+++ b/drivers/infiniband/ulp/isert/ib_isert.c
@@ -2561,7 +2561,7 @@ isert_fast_reg_mr(struct isert_conn *isert_conn,
 		wr = &inv_wr;
 	}
 
-	n = ib_map_mr_sg(mr, mem->sg, mem->nents, PAGE_SIZE);
+	n = ib_map_mr_sg(mr, mem->sg, mem->nents, 0, PAGE_SIZE);
 	if (unlikely(n != mem->nents)) {
 		isert_err("failed to map mr sg (%d/%d)\n",
 			 n, mem->nents);
diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c
index b6bf204..ca425f2 100644
--- a/drivers/infiniband/ulp/srp/ib_srp.c
+++ b/drivers/infiniband/ulp/srp/ib_srp.c
@@ -1333,7 +1333,7 @@ static int srp_map_finish_fr(struct srp_map_state *state,
 	rkey = ib_inc_rkey(desc->mr->rkey);
 	ib_update_fast_reg_key(desc->mr, rkey);
 
-	n = ib_map_mr_sg(desc->mr, state->sg, sg_nents, dev->mr_page_size);
+	n = ib_map_mr_sg(desc->mr, state->sg, sg_nents, 0, dev->mr_page_size);
 	if (unlikely(n < 0))
 		return n;
 
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 082a67a..1e21fb8 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -1811,7 +1811,8 @@ struct ib_device {
 					       u32 max_num_sg);
 	int                        (*map_mr_sg)(struct ib_mr *mr,
 						struct scatterlist *sg,
-						int sg_nents);
+						int sg_nents,
+						unsigned sg_offset);
 	struct ib_mw *             (*alloc_mw)(struct ib_pd *pd,
 					       enum ib_mw_type type);
 	int                        (*dealloc_mw)(struct ib_mw *mw);
@@ -3077,29 +3078,23 @@ struct net_device *ib_get_net_dev_by_params(struct ib_device *dev, u8 port,
 					    u16 pkey, const union ib_gid *gid,
 					    const struct sockaddr *addr);
 
-int ib_map_mr_sg(struct ib_mr *mr,
-		 struct scatterlist *sg,
-		 int sg_nents,
-		 unsigned int page_size);
+int ib_map_mr_sg(struct ib_mr *mr, struct scatterlist *sg, int sg_nents,
+		unsigned int sg_offset, unsigned int page_size);
 
 static inline int
-ib_map_mr_sg_zbva(struct ib_mr *mr,
-		  struct scatterlist *sg,
-		  int sg_nents,
-		  unsigned int page_size)
+ib_map_mr_sg_zbva(struct ib_mr *mr, struct scatterlist *sg, int sg_nents,
+		unsigned int sg_offset, unsigned int page_size)
 {
 	int n;
 
-	n = ib_map_mr_sg(mr, sg, sg_nents, page_size);
+	n = ib_map_mr_sg(mr, sg, sg_nents, sg_offset, page_size);
 	mr->iova = 0;
 
 	return n;
 }
 
-int ib_sg_to_pages(struct ib_mr *mr,
-		   struct scatterlist *sgl,
-		   int sg_nents,
-		   int (*set_page)(struct ib_mr *, u64));
+int ib_sg_to_pages(struct ib_mr *mr, struct scatterlist *sgl, int sg_nents,
+		unsigned int sg_offset, int (*set_page)(struct ib_mr *, u64));
 
 void ib_drain_rq(struct ib_qp *qp);
 void ib_drain_sq(struct ib_qp *qp);
diff --git a/net/rds/iw_rdma.c b/net/rds/iw_rdma.c
index b09a40c..7ce9a92 100644
--- a/net/rds/iw_rdma.c
+++ b/net/rds/iw_rdma.c
@@ -666,7 +666,7 @@ static int rds_iw_rdma_reg_mr(struct rds_iw_mapping *mapping)
 	struct ib_send_wr *failed_wr;
 	int ret, n;
 
-	n = ib_map_mr_sg_zbva(ibmr->mr, m_sg->list, m_sg->len, PAGE_SIZE);
+	n = ib_map_mr_sg_zbva(ibmr->mr, m_sg->list, m_sg->len, 0, PAGE_SIZE);
 	if (unlikely(n != m_sg->len))
 		return n < 0 ? n : -EINVAL;
 
diff --git a/net/rds/iw_send.c b/net/rds/iw_send.c
index e20bd50..1ba3c57 100644
--- a/net/rds/iw_send.c
+++ b/net/rds/iw_send.c
@@ -767,7 +767,7 @@ static int rds_iw_build_send_reg(struct rds_iw_send_work *send,
 {
 	int n;
 
-	n = ib_map_mr_sg(send->s_mr, sg, sg_nents, PAGE_SIZE);
+	n = ib_map_mr_sg(send->s_mr, sg, sg_nents, 0, PAGE_SIZE);
 	if (unlikely(n != sg_nents))
 		return n < 0 ? n : -EINVAL;
 
diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
index e165673..4ffdcb2 100644
--- a/net/sunrpc/xprtrdma/frwr_ops.c
+++ b/net/sunrpc/xprtrdma/frwr_ops.c
@@ -384,7 +384,7 @@ frwr_op_map(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg,
 		return -ENOMEM;
 	}
 
-	n = ib_map_mr_sg(mr, frmr->sg, frmr->sg_nents, PAGE_SIZE);
+	n = ib_map_mr_sg(mr, frmr->sg, frmr->sg_nents, 0, PAGE_SIZE);
 	if (unlikely(n != frmr->sg_nents)) {
 		pr_err("RPC:       %s: failed to map mr %p (%u/%u)\n",
 		       __func__, frmr->fr_mr, n, frmr->sg_nents);
diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
index c8b8a8b..1244a62 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
@@ -281,7 +281,7 @@ int rdma_read_chunk_frmr(struct svcxprt_rdma *xprt,
 	}
 	atomic_inc(&xprt->sc_dma_used);
 
-	n = ib_map_mr_sg(frmr->mr, frmr->sg, frmr->sg_nents, PAGE_SIZE);
+	n = ib_map_mr_sg(frmr->mr, frmr->sg, frmr->sg_nents, 0, PAGE_SIZE);
 	if (unlikely(n != frmr->sg_nents)) {
 		pr_err("svcrdma: failed to map mr %p (%d/%d elements)\n",
 		       frmr->mr, n, frmr->sg_nents);
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v4 3/9] IB/core: add a helper to check for READ WITH INVALIDATE support
  2016-03-08 18:16 generic RDMA READ/WRITE API V4 Christoph Hellwig
  2016-03-08 18:16 ` [PATCH v4 1/9] IB/cma: pass the port number to ib_create_qp Christoph Hellwig
  2016-03-08 18:16 ` [PATCH v4 2/9] IB/core: allow passing mapping an offset into the SG in ib_map_mr_sg Christoph Hellwig
@ 2016-03-08 18:16 ` Christoph Hellwig
  2016-03-08 18:16 ` [PATCH v4 4/9] IB/core: refactor ib_create_qp Christoph Hellwig
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 46+ messages in thread
From: Christoph Hellwig @ 2016-03-08 18:16 UTC (permalink / raw)
  To: dledford, bart.vanassche; +Cc: sagig, swise, linux-rdma, target-devel

Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
---
 include/rdma/ib_verbs.h | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 1e21fb8..cdb9159 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -2293,6 +2293,18 @@ static inline bool rdma_cap_roce_gid_table(const struct ib_device *device,
 		device->add_gid && device->del_gid;
 }
 
+/*
+ * Check if the device supports READ W/ INVALIDATE.
+ */
+static inline bool rdma_cap_read_inv(struct ib_device *dev, u32 port_num)
+{
+	/*
+	 * iWarp drivers must support READ W/ INVALIDATE.  No other protocol
+	 * has support for it yet.
+	 */
+	return rdma_protocol_iwarp(dev, port_num);
+}
+
 int ib_query_gid(struct ib_device *device,
 		 u8 port_num, int index, union ib_gid *gid,
 		 struct ib_gid_attr *attr);
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v4 4/9] IB/core: refactor ib_create_qp
  2016-03-08 18:16 generic RDMA READ/WRITE API V4 Christoph Hellwig
                   ` (2 preceding siblings ...)
  2016-03-08 18:16 ` [PATCH v4 3/9] IB/core: add a helper to check for READ WITH INVALIDATE support Christoph Hellwig
@ 2016-03-08 18:16 ` Christoph Hellwig
  2016-03-08 19:33   ` Bart Van Assche
  2016-03-08 18:16 ` [PATCH v4 5/9] IB/core: add a simple MR pool Christoph Hellwig
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 46+ messages in thread
From: Christoph Hellwig @ 2016-03-08 18:16 UTC (permalink / raw)
  To: dledford, bart.vanassche; +Cc: sagig, swise, linux-rdma, target-devel

Split the XRC magic into a separate function, and return early on failure
to make the initialization code readable.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
---
 drivers/infiniband/core/verbs.c | 103 +++++++++++++++++++++-------------------
 1 file changed, 54 insertions(+), 49 deletions(-)

diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c
index eaed99b..7ea0c9c 100644
--- a/drivers/infiniband/core/verbs.c
+++ b/drivers/infiniband/core/verbs.c
@@ -723,62 +723,67 @@ struct ib_qp *ib_open_qp(struct ib_xrcd *xrcd,
 }
 EXPORT_SYMBOL(ib_open_qp);
 
+static struct ib_qp *ib_create_xrc_qp(struct ib_qp *qp,
+		struct ib_qp_init_attr *qp_init_attr)
+{
+	struct ib_qp *real_qp = qp;
+
+	qp->event_handler = __ib_shared_qp_event_handler;
+	qp->qp_context = qp;
+	qp->pd = NULL;
+	qp->send_cq = qp->recv_cq = NULL;
+	qp->srq = NULL;
+	qp->xrcd = qp_init_attr->xrcd;
+	atomic_inc(&qp_init_attr->xrcd->usecnt);
+	INIT_LIST_HEAD(&qp->open_list);
+
+	qp = __ib_open_qp(real_qp, qp_init_attr->event_handler,
+			  qp_init_attr->qp_context);
+	if (IS_ERR(qp))
+		real_qp->device->destroy_qp(real_qp);
+	else
+		__ib_insert_xrcd_qp(qp_init_attr->xrcd, real_qp);
+	return qp;
+}
+
 struct ib_qp *ib_create_qp(struct ib_pd *pd,
 			   struct ib_qp_init_attr *qp_init_attr)
 {
-	struct ib_qp *qp, *real_qp;
-	struct ib_device *device;
+	struct ib_device *device = pd ? pd->device : qp_init_attr->xrcd->device;
+	struct ib_qp *qp;
 
-	device = pd ? pd->device : qp_init_attr->xrcd->device;
 	qp = device->create_qp(pd, qp_init_attr, NULL);
-
-	if (!IS_ERR(qp)) {
-		qp->device     = device;
-		qp->real_qp    = qp;
-		qp->uobject    = NULL;
-		qp->qp_type    = qp_init_attr->qp_type;
-
-		atomic_set(&qp->usecnt, 0);
-		if (qp_init_attr->qp_type == IB_QPT_XRC_TGT) {
-			qp->event_handler = __ib_shared_qp_event_handler;
-			qp->qp_context = qp;
-			qp->pd = NULL;
-			qp->send_cq = qp->recv_cq = NULL;
-			qp->srq = NULL;
-			qp->xrcd = qp_init_attr->xrcd;
-			atomic_inc(&qp_init_attr->xrcd->usecnt);
-			INIT_LIST_HEAD(&qp->open_list);
-
-			real_qp = qp;
-			qp = __ib_open_qp(real_qp, qp_init_attr->event_handler,
-					  qp_init_attr->qp_context);
-			if (!IS_ERR(qp))
-				__ib_insert_xrcd_qp(qp_init_attr->xrcd, real_qp);
-			else
-				real_qp->device->destroy_qp(real_qp);
-		} else {
-			qp->event_handler = qp_init_attr->event_handler;
-			qp->qp_context = qp_init_attr->qp_context;
-			if (qp_init_attr->qp_type == IB_QPT_XRC_INI) {
-				qp->recv_cq = NULL;
-				qp->srq = NULL;
-			} else {
-				qp->recv_cq = qp_init_attr->recv_cq;
-				atomic_inc(&qp_init_attr->recv_cq->usecnt);
-				qp->srq = qp_init_attr->srq;
-				if (qp->srq)
-					atomic_inc(&qp_init_attr->srq->usecnt);
-			}
-
-			qp->pd	    = pd;
-			qp->send_cq = qp_init_attr->send_cq;
-			qp->xrcd    = NULL;
-
-			atomic_inc(&pd->usecnt);
-			atomic_inc(&qp_init_attr->send_cq->usecnt);
-		}
+	if (IS_ERR(qp))
+		return qp;
+
+	qp->device     = device;
+	qp->real_qp    = qp;
+	qp->uobject    = NULL;
+	qp->qp_type    = qp_init_attr->qp_type;
+
+	atomic_set(&qp->usecnt, 0);
+	if (qp_init_attr->qp_type == IB_QPT_XRC_TGT)
+		return ib_create_xrc_qp(qp, qp_init_attr);
+
+	qp->event_handler = qp_init_attr->event_handler;
+	qp->qp_context = qp_init_attr->qp_context;
+	if (qp_init_attr->qp_type == IB_QPT_XRC_INI) {
+		qp->recv_cq = NULL;
+		qp->srq = NULL;
+	} else {
+		qp->recv_cq = qp_init_attr->recv_cq;
+		atomic_inc(&qp_init_attr->recv_cq->usecnt);
+		qp->srq = qp_init_attr->srq;
+		if (qp->srq)
+			atomic_inc(&qp_init_attr->srq->usecnt);
 	}
 
+	qp->pd	    = pd;
+	qp->send_cq = qp_init_attr->send_cq;
+	qp->xrcd    = NULL;
+
+	atomic_inc(&pd->usecnt);
+	atomic_inc(&qp_init_attr->send_cq->usecnt);
 	return qp;
 }
 EXPORT_SYMBOL(ib_create_qp);
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v4 5/9] IB/core: add a simple MR pool
  2016-03-08 18:16 generic RDMA READ/WRITE API V4 Christoph Hellwig
                   ` (3 preceding siblings ...)
  2016-03-08 18:16 ` [PATCH v4 4/9] IB/core: refactor ib_create_qp Christoph Hellwig
@ 2016-03-08 18:16 ` Christoph Hellwig
       [not found]   ` <1457461000-24088-6-git-send-email-hch-jcswGhMUV9g@public.gmane.org>
       [not found] ` <1457461000-24088-1-git-send-email-hch-jcswGhMUV9g@public.gmane.org>
  2016-03-08 18:16 ` [PATCH v4 9/9] IB/srpt: convert to the generic RDMA READ/WRITE API Christoph Hellwig
  6 siblings, 1 reply; 46+ messages in thread
From: Christoph Hellwig @ 2016-03-08 18:16 UTC (permalink / raw)
  To: dledford, bart.vanassche; +Cc: sagig, swise, linux-rdma, target-devel

Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
---
 drivers/infiniband/core/Makefile  |  2 +-
 drivers/infiniband/core/mr_pool.c | 86 +++++++++++++++++++++++++++++++++++++++
 drivers/infiniband/core/verbs.c   |  5 +++
 include/rdma/ib_verbs.h           |  8 +++-
 include/rdma/mr_pool.h            | 25 ++++++++++++
 5 files changed, 124 insertions(+), 2 deletions(-)
 create mode 100644 drivers/infiniband/core/mr_pool.c
 create mode 100644 include/rdma/mr_pool.h

diff --git a/drivers/infiniband/core/Makefile b/drivers/infiniband/core/Makefile
index f818538..48bd9d8 100644
--- a/drivers/infiniband/core/Makefile
+++ b/drivers/infiniband/core/Makefile
@@ -10,7 +10,7 @@ obj-$(CONFIG_INFINIBAND_USER_ACCESS) +=	ib_uverbs.o ib_ucm.o \
 
 ib_core-y :=			packer.o ud_header.o verbs.o cq.o sysfs.o \
 				device.o fmr_pool.o cache.o netlink.o \
-				roce_gid_mgmt.o
+				roce_gid_mgmt.o mr_pool.o
 ib_core-$(CONFIG_INFINIBAND_USER_MEM) += umem.o
 ib_core-$(CONFIG_INFINIBAND_ON_DEMAND_PAGING) += umem_odp.o umem_rbtree.o
 
diff --git a/drivers/infiniband/core/mr_pool.c b/drivers/infiniband/core/mr_pool.c
new file mode 100644
index 0000000..9751bb1
--- /dev/null
+++ b/drivers/infiniband/core/mr_pool.c
@@ -0,0 +1,86 @@
+/*
+ * Copyright (c) 2016 HGST, a Western Digital Company.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+#include <rdma/ib_verbs.h>
+#include <rdma/mr_pool.h>
+
+struct ib_mr *ib_mr_pool_get(struct ib_qp *qp, struct list_head *list)
+{
+	struct ib_mr *mr = NULL;
+	unsigned long flags;
+
+	spin_lock_irqsave(&qp->mr_lock, flags);
+	mr = list_first_entry_or_null(list, struct ib_mr, qp_entry);
+	if (mr) {
+		list_del(&mr->qp_entry);
+		qp->mrs_used++;
+	}
+	spin_unlock_irqrestore(&qp->mr_lock, flags);
+
+	return mr;
+}
+EXPORT_SYMBOL(ib_mr_pool_get);
+
+void ib_mr_pool_put(struct ib_qp *qp, struct list_head *list, struct ib_mr *mr)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&qp->mr_lock, flags);
+	list_add(&mr->qp_entry, list);
+	qp->mrs_used--;
+	spin_unlock_irqrestore(&qp->mr_lock, flags);
+}
+EXPORT_SYMBOL(ib_mr_pool_put);
+
+int ib_mr_pool_init(struct ib_qp *qp, struct list_head *list, int nr,
+		enum ib_mr_type type, u32 max_num_sg)
+{
+	struct ib_mr *mr;
+	unsigned long flags;
+	int ret, i;
+
+	for (i = 0; i < nr; i++) {
+		mr = ib_alloc_mr(qp->pd, type, max_num_sg);
+		if (IS_ERR(mr)) {
+			ret = PTR_ERR(mr);
+			goto out;
+		}
+
+		spin_lock_irqsave(&qp->mr_lock, flags);
+		list_add_tail(&mr->qp_entry, list);
+		spin_unlock_irqrestore(&qp->mr_lock, flags);
+	}
+
+	return 0;
+out:
+	ib_mr_pool_destroy(qp, list);
+	return ret;
+}
+EXPORT_SYMBOL(ib_mr_pool_init);
+
+void ib_mr_pool_destroy(struct ib_qp *qp, struct list_head *list)
+{
+	struct ib_mr *mr;
+	unsigned long flags;
+
+	spin_lock_irqsave(&qp->mr_lock, flags);
+	while (!list_empty(list)) {
+		mr = list_first_entry(list, struct ib_mr, qp_entry);
+		list_del(&mr->qp_entry);
+
+		spin_unlock_irqrestore(&qp->mr_lock, flags);
+		ib_dereg_mr(mr);
+		spin_lock_irqsave(&qp->mr_lock, flags);
+	}
+	spin_unlock_irqrestore(&qp->mr_lock, flags);
+}
+EXPORT_SYMBOL(ib_mr_pool_destroy);
diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c
index 7ea0c9c..b099760 100644
--- a/drivers/infiniband/core/verbs.c
+++ b/drivers/infiniband/core/verbs.c
@@ -762,6 +762,9 @@ struct ib_qp *ib_create_qp(struct ib_pd *pd,
 	qp->qp_type    = qp_init_attr->qp_type;
 
 	atomic_set(&qp->usecnt, 0);
+	qp->mrs_used = 0;
+	spin_lock_init(&qp->mr_lock);
+
 	if (qp_init_attr->qp_type == IB_QPT_XRC_TGT)
 		return ib_create_xrc_qp(qp, qp_init_attr);
 
@@ -1255,6 +1258,8 @@ int ib_destroy_qp(struct ib_qp *qp)
 	struct ib_srq *srq;
 	int ret;
 
+	WARN_ON_ONCE(qp->mrs_used > 0);
+
 	if (atomic_read(&qp->usecnt))
 		return -EBUSY;
 
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index cdb9159..355f5b2 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -1405,9 +1405,12 @@ struct ib_qp {
 	struct ib_pd	       *pd;
 	struct ib_cq	       *send_cq;
 	struct ib_cq	       *recv_cq;
+	spinlock_t		mr_lock;
+	int			mrs_used;
 	struct ib_srq	       *srq;
 	struct ib_xrcd	       *xrcd; /* XRC TGT QPs only */
 	struct list_head	xrcd_list;
+
 	/* count times opened, mcast attaches, flow attaches */
 	atomic_t		usecnt;
 	struct list_head	open_list;
@@ -1422,12 +1425,15 @@ struct ib_qp {
 struct ib_mr {
 	struct ib_device  *device;
 	struct ib_pd	  *pd;
-	struct ib_uobject *uobject;
 	u32		   lkey;
 	u32		   rkey;
 	u64		   iova;
 	u32		   length;
 	unsigned int	   page_size;
+	union {
+		struct ib_uobject	*uobject;	/* user */
+		struct list_head	qp_entry;	/* FR */
+	};
 };
 
 struct ib_mw {
diff --git a/include/rdma/mr_pool.h b/include/rdma/mr_pool.h
new file mode 100644
index 0000000..986010b
--- /dev/null
+++ b/include/rdma/mr_pool.h
@@ -0,0 +1,25 @@
+/*
+ * Copyright (c) 2016 HGST, a Western Digital Company.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+#ifndef _RDMA_MR_POOL_H
+#define _RDMA_MR_POOL_H 1
+
+#include <rdma/ib_verbs.h>
+
+struct ib_mr *ib_mr_pool_get(struct ib_qp *qp, struct list_head *list);
+void ib_mr_pool_put(struct ib_qp *qp, struct list_head *list, struct ib_mr *mr);
+
+int ib_mr_pool_init(struct ib_qp *qp, struct list_head *list, int nr,
+		enum ib_mr_type type, u32 max_num_sg);
+void ib_mr_pool_destroy(struct ib_qp *qp, struct list_head *list);
+
+#endif /* _RDMA_MR_POOL_H */
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v4 6/9] IB/core: add a need_inval flag to struct ib_mr
       [not found] ` <1457461000-24088-1-git-send-email-hch-jcswGhMUV9g@public.gmane.org>
@ 2016-03-08 18:16   ` Christoph Hellwig
  2016-03-08 19:41     ` Bart Van Assche
  2016-03-08 18:16   ` [PATCH v4 7/9] IB/core: generic RDMA READ/WRITE API Christoph Hellwig
  2016-03-08 18:16   ` [PATCH v4 8/9] target: enhance and export target_alloc_sgl/target_free_sgl Christoph Hellwig
  2 siblings, 1 reply; 46+ messages in thread
From: Christoph Hellwig @ 2016-03-08 18:16 UTC (permalink / raw)
  To: dledford-H+wXaHxf7aLQT0dZR+AlfA, bart.vanassche-XdAiOPVOjttBDgjK7y7TUQ
  Cc: sagig-VPRAkNaXOzVWk0Htik3J/w,
	swise-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	target-devel-u79uwXL29TY76Z2rM5mHXA, Steve Wise

From: Steve Wise <swise-ut6Up61K2wZBDgjK7y7TUQ@public.gmane.org>

This is the first step toward moving MR invalidation decisions
to the core.  It will be needed by the upcoming RW API.

Signed-off-by: Steve Wise <swise-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>
Reviewed-by: Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org>
---
 drivers/infiniband/core/verbs.c | 2 ++
 include/rdma/ib_verbs.h         | 1 +
 2 files changed, 3 insertions(+)

diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c
index b099760..69689866 100644
--- a/drivers/infiniband/core/verbs.c
+++ b/drivers/infiniband/core/verbs.c
@@ -1353,6 +1353,7 @@ struct ib_mr *ib_get_dma_mr(struct ib_pd *pd, int mr_access_flags)
 		mr->pd      = pd;
 		mr->uobject = NULL;
 		atomic_inc(&pd->usecnt);
+		mr->need_inval = false;
 	}
 
 	return mr;
@@ -1399,6 +1400,7 @@ struct ib_mr *ib_alloc_mr(struct ib_pd *pd,
 		mr->pd      = pd;
 		mr->uobject = NULL;
 		atomic_inc(&pd->usecnt);
+		mr->need_inval = false;
 	}
 
 	return mr;
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 355f5b2..068d540 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -1430,6 +1430,7 @@ struct ib_mr {
 	u64		   iova;
 	u32		   length;
 	unsigned int	   page_size;
+	bool		   need_inval;
 	union {
 		struct ib_uobject	*uobject;	/* user */
 		struct list_head	qp_entry;	/* FR */
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v4 7/9] IB/core: generic RDMA READ/WRITE API
       [not found] ` <1457461000-24088-1-git-send-email-hch-jcswGhMUV9g@public.gmane.org>
  2016-03-08 18:16   ` [PATCH v4 6/9] IB/core: add a need_inval flag to struct ib_mr Christoph Hellwig
@ 2016-03-08 18:16   ` Christoph Hellwig
  2016-03-08 21:32     ` Bart Van Assche
  2016-03-08 18:16   ` [PATCH v4 8/9] target: enhance and export target_alloc_sgl/target_free_sgl Christoph Hellwig
  2 siblings, 1 reply; 46+ messages in thread
From: Christoph Hellwig @ 2016-03-08 18:16 UTC (permalink / raw)
  To: dledford-H+wXaHxf7aLQT0dZR+AlfA, bart.vanassche-XdAiOPVOjttBDgjK7y7TUQ
  Cc: sagig-VPRAkNaXOzVWk0Htik3J/w,
	swise-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	target-devel-u79uwXL29TY76Z2rM5mHXA

This supports both manual mapping of lots of SGEs, as well as using MRs
from the QP's MR pool, for iWarp or other cases where it's more optimal.
For now, MRs are only used for iWARP transports.  The user of the RDMA-RW
API must allocate the QP MR pool as well as size the SQ accordingly.

Thanks to Steve Wise for testing, fixing and rewriting the iWarp support,
and to Sagi Grimberg for ideas, reviews and fixes.

Signed-off-by: Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org>
Tested-by: Steve Wise <swise-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>
---
 drivers/infiniband/core/Makefile |   2 +-
 drivers/infiniband/core/rw.c     | 419 +++++++++++++++++++++++++++++++++++++++
 drivers/infiniband/core/verbs.c  |  25 +++
 include/rdma/ib_verbs.h          |  14 +-
 include/rdma/rw.h                |  69 +++++++
 5 files changed, 527 insertions(+), 2 deletions(-)
 create mode 100644 drivers/infiniband/core/rw.c
 create mode 100644 include/rdma/rw.h

diff --git a/drivers/infiniband/core/Makefile b/drivers/infiniband/core/Makefile
index 48bd9d8..26987d9 100644
--- a/drivers/infiniband/core/Makefile
+++ b/drivers/infiniband/core/Makefile
@@ -8,7 +8,7 @@ obj-$(CONFIG_INFINIBAND_USER_MAD) +=	ib_umad.o
 obj-$(CONFIG_INFINIBAND_USER_ACCESS) +=	ib_uverbs.o ib_ucm.o \
 					$(user_access-y)
 
-ib_core-y :=			packer.o ud_header.o verbs.o cq.o sysfs.o \
+ib_core-y :=			packer.o ud_header.o verbs.o cq.o rw.o sysfs.o \
 				device.o fmr_pool.o cache.o netlink.o \
 				roce_gid_mgmt.o mr_pool.o
 ib_core-$(CONFIG_INFINIBAND_USER_MEM) += umem.o
diff --git a/drivers/infiniband/core/rw.c b/drivers/infiniband/core/rw.c
new file mode 100644
index 0000000..7311b95
--- /dev/null
+++ b/drivers/infiniband/core/rw.c
@@ -0,0 +1,419 @@
+/*
+ * Copyright (c) 2016 HGST, a Western Digital Company.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+#include <linux/slab.h>
+#include <rdma/mr_pool.h>
+#include <rdma/rw.h>
+
+/*
+ * Check if the device needs a memory registration.  We currently always use
+ * memory registrations for iWarp, and never for IB and RoCE.  In the future
+ * we can hopefully fine tune this based on HCA driver input.
+ */
+static inline bool rdma_rw_use_mr(struct ib_device *dev, u8 port_num)
+{
+	return rdma_protocol_iwarp(dev, port_num);
+}
+
+static inline u32 rdma_rw_max_sge(struct ib_device *dev,
+		enum dma_data_direction dir)
+{
+	return dir == DMA_TO_DEVICE ?
+		dev->attrs.max_sge : dev->attrs.max_sge_rd;
+}
+
+static inline u32 rdma_rw_fr_page_list_len(struct ib_device *dev)
+{
+	/* arbitrary limit to avoid allocating gigantic resources */
+	return min_t(u32, dev->attrs.max_fast_reg_page_list_len, 256);
+}
+
+static int rdma_rw_init_mr_wrs(struct rdma_rw_ctx *ctx, struct ib_qp *qp,
+		u8 port_num, struct scatterlist *sg, u32 offset,
+		u64 remote_addr, u32 rkey, enum dma_data_direction dir)
+{
+	int pages_per_mr = rdma_rw_fr_page_list_len(qp->pd->device);
+	int pages_left = ctx->dma_nents;
+	u32 va_offset = 0;
+	int i, ret = 0, count = 0;
+
+	ctx->nr_ops = (ctx->dma_nents + pages_per_mr - 1) / pages_per_mr;
+	ctx->reg = kcalloc(ctx->nr_ops, sizeof(*ctx->reg), GFP_KERNEL);
+	if (!ctx->reg) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	for (i = 0; i < ctx->nr_ops; i++) {
+		struct rdma_rw_reg_ctx *prev = i ? &ctx->reg[i - 1] : NULL;
+		struct rdma_rw_reg_ctx *reg = &ctx->reg[i];
+		int nents = min(pages_left, pages_per_mr);
+
+		reg->mr = ib_mr_pool_get(qp, &qp->rdma_mrs);
+		if (!reg->mr) {
+			pr_info("failed to allocate MR from pool\n");
+			ret = -EAGAIN;
+			goto out_free;
+		}
+
+		if (reg->mr->need_inval) {
+			reg->inv_wr.opcode = IB_WR_LOCAL_INV;
+			reg->inv_wr.ex.invalidate_rkey = reg->mr->lkey;
+			reg->inv_wr.next = &reg->reg_wr.wr;
+			if (prev)
+				prev->wr.wr.next = &reg->inv_wr;
+
+			count++;
+		} else if (prev) {
+			prev->wr.wr.next = &reg->reg_wr.wr;
+		}
+
+		ib_update_fast_reg_key(reg->mr, ib_inc_rkey(reg->mr->lkey));
+
+		ret = ib_map_mr_sg(reg->mr, sg, nents, offset,
+				PAGE_SIZE);
+		if (ret < nents) {
+			pr_info("failed to map MR\n");
+			ib_mr_pool_put(qp, &qp->rdma_mrs, reg->mr);
+			ret = -EINVAL;
+			goto out_free;
+		}
+
+		reg->reg_wr.wr.opcode = IB_WR_REG_MR;
+		reg->reg_wr.mr = reg->mr;
+		reg->reg_wr.key = reg->mr->lkey;
+		reg->reg_wr.wr.next = &reg->wr.wr;
+		count++;
+
+		reg->reg_wr.access = IB_ACCESS_LOCAL_WRITE;
+		if (rdma_protocol_iwarp(qp->device, port_num))
+			reg->reg_wr.access |= IB_ACCESS_REMOTE_WRITE;
+
+		reg->sge.lkey = reg->mr->lkey;
+		reg->sge.addr = reg->mr->iova;
+		reg->sge.length = reg->mr->length;
+
+		reg->wr.wr.sg_list = &reg->sge;
+		reg->wr.wr.num_sge = 1;
+		reg->wr.remote_addr = remote_addr + va_offset;
+		reg->wr.rkey = rkey;
+		count++;
+
+		if (dir == DMA_FROM_DEVICE) {
+			if (rdma_cap_read_inv(qp->device, port_num)) {
+				reg->wr.wr.opcode = IB_WR_RDMA_READ_WITH_INV;
+				reg->wr.wr.ex.invalidate_rkey = reg->mr->lkey;
+				reg->mr->need_inval = false;
+			}  else {
+				reg->wr.wr.opcode = IB_WR_RDMA_READ;
+				reg->mr->need_inval = true;
+			}
+		} else {
+			reg->wr.wr.opcode = IB_WR_RDMA_WRITE;
+			reg->mr->need_inval = true;
+		}
+
+		va_offset += reg->sge.length;
+		pages_left -= nents;
+		sg = sg_next(sg);
+		offset = 0;
+	}
+
+	return count;
+
+out_free:
+	while (--i >= 0)
+		ib_mr_pool_put(qp, &qp->rdma_mrs, ctx->reg[i].mr);
+	kfree(ctx->reg);
+out:
+	return ret;
+}
+
+static int rdma_rw_init_map_wrs(struct rdma_rw_ctx *ctx, struct ib_qp *qp,
+		struct scatterlist *sg, u32 offset, u64 remote_addr, u32 rkey,
+		enum dma_data_direction dir)
+{
+	struct ib_device *dev = qp->pd->device;
+	u32 max_sge = rdma_rw_max_sge(dev, dir);
+	u32 sge_left = ctx->dma_nents;
+	struct ib_sge *sge;
+	u32 total_len = 0, i, j;
+
+	ctx->nr_ops = DIV_ROUND_UP(ctx->dma_nents, max_sge);
+
+	ctx->map.sges = sge = kcalloc(ctx->dma_nents, sizeof(*sge), GFP_KERNEL);
+	if (!ctx->map.sges)
+		goto out;
+
+	ctx->map.wrs = kcalloc(ctx->nr_ops, sizeof(*ctx->map.wrs), GFP_KERNEL);
+	if (!ctx->map.wrs)
+		goto out_free_sges;
+
+	for (i = 0; i < ctx->nr_ops; i++) {
+		struct ib_rdma_wr *rdma_wr = &ctx->map.wrs[i];
+		u32 nr_sge = min(sge_left, max_sge);
+
+		if (dir == DMA_TO_DEVICE)
+			rdma_wr->wr.opcode = IB_WR_RDMA_WRITE;
+		else
+			rdma_wr->wr.opcode = IB_WR_RDMA_READ;
+		rdma_wr->remote_addr = remote_addr + total_len;
+		rdma_wr->rkey = rkey;
+		rdma_wr->wr.sg_list = sge;
+
+		for (j = 0; j < nr_sge; j++, sg = sg_next(sg)) {
+			rdma_wr->wr.num_sge++;
+
+			sge->addr = ib_sg_dma_address(dev, sg) + offset;
+			sge->length = ib_sg_dma_len(dev, sg) - offset;
+			sge->lkey = qp->pd->local_dma_lkey;
+
+			total_len += sge->length;
+			sge++;
+			sge_left--;
+			offset = 0;
+		}
+
+		if (i + 1 != ctx->nr_ops)
+			rdma_wr->wr.next = &ctx->map.wrs[i + 1].wr;
+	}
+
+	return ctx->nr_ops;
+
+out_free_sges:
+	kfree(ctx->map.sges);
+out:
+	return -ENOMEM;
+}
+
+static int rdma_rw_init_single_wr(struct rdma_rw_ctx *ctx, struct ib_qp *qp,
+		struct scatterlist *sg, u32 offset, u64 remote_addr, u32 rkey,
+		enum dma_data_direction dir)
+{
+	struct ib_device *dev = qp->pd->device;
+	struct ib_rdma_wr *rdma_wr = &ctx->single.wr;
+
+	ctx->nr_ops = 1;
+
+	ctx->single.sge.lkey = qp->pd->local_dma_lkey;
+	ctx->single.sge.addr = ib_sg_dma_address(dev, sg) + offset;
+	ctx->single.sge.length = ib_sg_dma_len(dev, sg) - offset;
+
+	memset(rdma_wr, 0, sizeof(*rdma_wr));
+	if (dir == DMA_TO_DEVICE)
+		rdma_wr->wr.opcode = IB_WR_RDMA_WRITE;
+	else
+		rdma_wr->wr.opcode = IB_WR_RDMA_READ;
+	rdma_wr->wr.sg_list = &ctx->single.sge;
+	rdma_wr->wr.num_sge = 1;
+	rdma_wr->remote_addr = remote_addr;
+	rdma_wr->rkey = rkey;
+
+	return 1;
+}
+
+/**
+ * rdma_rw_ctx_init - initialize a RDMA READ/WRITE context
+ * @ctx:	context to initialize
+ * @qp:		queue pair to operate on
+ * @port_num:	port num to which the connection is bound
+ * @sg:		scatterlist to READ/WRITE from/to
+ * @sg_cnt:	number of entries in @sg
+ * @sg_offset:	current byte offset into @sg
+ * @length:	total length of @sg in bytes
+ * @remote_addr:remote address to read/write (relative to @rkey)
+ * @rkey:	remote key to operate on
+ * @dir:	%DMA_TO_DEVICE for RDMA WRITE, %DMA_FROM_DEVICE for RDMA READ
+ *
+ * If we're going to use a FR to map this context @max_nents should be smaller
+ * or equal to the MR size.
+ *
+ * Returns the number of WQEs that will be needed on the workqueue if
+ * successful, or a negative error code.
+ */
+int rdma_rw_ctx_init(struct rdma_rw_ctx *ctx, struct ib_qp *qp, u8 port_num,
+		struct scatterlist *sg, u32 sg_cnt, u32 sg_offset,
+		u64 remote_addr, u32 rkey, enum dma_data_direction dir)
+{
+	struct ib_device *dev = qp->pd->device;
+	int ret;
+
+	ctx->dma_nents = ib_dma_map_sg(dev, sg, sg_cnt, dir);
+	if (!ctx->dma_nents)
+		return -ENOMEM;
+
+	/*
+	 * Skip to the S/G entry that sg_offset falls into:
+	 */
+	for (; sg; sg = sg_next(sg)) {
+		u32 len = ib_sg_dma_len(dev, sg);
+
+		if (sg_offset < len)
+			break;
+
+		sg_offset -= len;
+		ctx->dma_nents--;
+	}
+
+	if (rdma_rw_use_mr(qp->device, port_num)) {
+		ret = rdma_rw_init_mr_wrs(ctx, qp, port_num, sg, sg_offset,
+				remote_addr, rkey, dir);
+	} else if (ctx->dma_nents > 1) {
+		ret = rdma_rw_init_map_wrs(ctx, qp, sg, sg_offset,
+				remote_addr, rkey, dir);
+	} else {
+		ret = rdma_rw_init_single_wr(ctx, qp, sg, sg_offset,
+				remote_addr, rkey, dir);
+	}
+
+	if (ret < 0)
+		goto out_unmap_sg;
+	return ret;
+
+out_unmap_sg:
+	ib_dma_unmap_sg(dev, sg, sg_cnt, dir);
+	return ret;
+}
+EXPORT_SYMBOL(rdma_rw_ctx_init);
+
+/**
+ * rdma_rw_ctx_destroy - release all resources allocated by rdma_rw_ctx_init
+ * @ctx:	context to release
+ * @qp:		queue pair to operate on
+ * @port_num:	port num to which the connection is bound
+ * @sg:		scatterlist that was used for the READ/WRITE
+ * @sg_cnt:	number of entries in @sg
+ * @dir:	%DMA_TO_DEVICE for RDMA WRITE, %DMA_FROM_DEVICE for RDMA READ
+ */
+void rdma_rw_ctx_destroy(struct rdma_rw_ctx *ctx, struct ib_qp *qp, u8 port_num,
+		struct scatterlist *sg, u32 sg_cnt, enum dma_data_direction dir)
+{
+	if (rdma_rw_use_mr(qp->device, port_num)) {
+		int i;
+
+		for (i = 0; i < ctx->nr_ops; i++)
+			ib_mr_pool_put(qp, &qp->rdma_mrs, ctx->reg[i].mr);
+		kfree(ctx->reg);
+	} else if (ctx->dma_nents > 1) {
+		kfree(ctx->map.wrs);
+		kfree(ctx->map.sges);
+	}
+
+	ib_dma_unmap_sg(qp->pd->device, sg, sg_cnt, dir);
+}
+EXPORT_SYMBOL(rdma_rw_ctx_destroy);
+
+/**
+ * rdma_rw_ctx_wrs - return chain of WRs for a RDMA READ or WRITE operation
+ * @ctx:	context to operate on
+ * @qp:		queue pair to operate on
+ * @port_num:	port num to which the connection is bound
+ * @cqe:	completion queue entry for the last WR
+ * @chain_wr:	WR to append to the posted chain
+ *
+ * Return the WR chain for the set of RDMA READ/WRITE operations described by
+ * @ctx, as well as any memory registration operations needed.  If @chain_wr
+ * is non-NULL the WR it points to will be appended to the chain of WRs posted.
+ * If @chain_wr is not set @cqe must be set so that the caller gets a
+ * completion notification.
+ */
+struct ib_send_wr *rdma_rw_ctx_wrs(struct rdma_rw_ctx *ctx, struct ib_qp *qp,
+		u8 port_num, struct ib_cqe *cqe, struct ib_send_wr *chain_wr)
+{
+	struct ib_send_wr *first_wr, *last_wr;
+
+	if (rdma_rw_use_mr(qp->device, port_num)) {
+		if (ctx->reg[0].inv_wr.next)
+			first_wr = &ctx->reg[0].inv_wr;
+		else
+			first_wr = &ctx->reg[0].reg_wr.wr;
+		last_wr = &ctx->reg[ctx->nr_ops - 1].wr.wr;
+	} else if (ctx->dma_nents > 1) {
+		first_wr = &ctx->map.wrs[0].wr;
+		last_wr = &ctx->map.wrs[ctx->nr_ops - 1].wr;
+	} else {
+		first_wr = &ctx->single.wr.wr;
+		last_wr = &ctx->single.wr.wr;
+	}
+
+	if (chain_wr) {
+		last_wr->next = chain_wr;
+	} else {
+		last_wr->wr_cqe = cqe;
+		last_wr->send_flags |= IB_SEND_SIGNALED;
+	}
+
+	return first_wr;
+}
+EXPORT_SYMBOL(rdma_rw_ctx_wrs);
+
+/**
+ * rdma_rw_ctx_post - post a RDMA READ or RDMA WRITE operation
+ * @ctx:	context to operate on
+ * @qp:		queue pair to operate on
+ * @port_num:	port num to which the connection is bound
+ * @cqe:	completion queue entry for the last WR
+ * @chain_wr:	WR to append to the posted chain
+ *
+ * Post the set of RDMA READ/WRITE operations described by @ctx, as well as
+ * any memory registration operations needed.  If @chain_wr is non-NULL the
+ * WR it points to will be appended to the chain of WRs posted.  If @chain_wr
+ * is not set @cqe must be set so that the caller gets a completion
+ * notification.
+ */
+int rdma_rw_ctx_post(struct rdma_rw_ctx *ctx, struct ib_qp *qp, u8 port_num,
+		struct ib_cqe *cqe, struct ib_send_wr *chain_wr)
+{
+	struct ib_send_wr *first_wr, *bad_wr;
+
+	first_wr = rdma_rw_ctx_wrs(ctx, qp, port_num, cqe, chain_wr);
+	return ib_post_send(qp, first_wr, &bad_wr);
+}
+EXPORT_SYMBOL(rdma_rw_ctx_post);
+
+void rdma_rw_init_qp(struct ib_device *dev, struct ib_qp_init_attr *attr)
+{
+	/*
+	 * Each context needs at least one RDMA READ or WRITE WR.
+	 *
+	 * For some hardware we might need more, eventually we should ask the
+	 * HCA driver for a multiplier here.
+	 */
+	attr->cap.max_send_wr += attr->cap.max_rdma_ctxs;
+
+	/*
+	 * If the devices needs MRs to perform RDMA READ or WRITE operations,
+	 * we'll need two additional MRs for the registrations and the
+	 * invalidation.
+	 */
+	if (rdma_rw_use_mr(dev, attr->port_num))
+		attr->cap.max_send_wr += 2 * attr->cap.max_rdma_ctxs;
+}
+
+int rdma_rw_init_mrs(struct ib_qp *qp, struct ib_qp_init_attr *attr)
+{
+	struct ib_device *dev = qp->pd->device;
+	int ret = 0;
+
+	if (rdma_rw_use_mr(dev, attr->port_num)) {
+		ret = ib_mr_pool_init(qp, &qp->rdma_mrs,
+				attr->cap.max_rdma_ctxs, IB_MR_TYPE_MEM_REG,
+				rdma_rw_fr_page_list_len(dev));
+	}
+
+	return ret;
+}
+
+void rdma_rw_cleanup_mrs(struct ib_qp *qp)
+{
+	ib_mr_pool_destroy(qp, &qp->rdma_mrs);
+}
diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c
index 69689866..72f4fbd 100644
--- a/drivers/infiniband/core/verbs.c
+++ b/drivers/infiniband/core/verbs.c
@@ -48,6 +48,7 @@
 #include <rdma/ib_verbs.h>
 #include <rdma/ib_cache.h>
 #include <rdma/ib_addr.h>
+#include <rdma/rw.h>
 
 #include "core_priv.h"
 
@@ -751,6 +752,16 @@ struct ib_qp *ib_create_qp(struct ib_pd *pd,
 {
 	struct ib_device *device = pd ? pd->device : qp_init_attr->xrcd->device;
 	struct ib_qp *qp;
+	int ret;
+
+	/*
+	 * If the callers is using the RDMA API calculate the resources
+	 * needed for the RDMA READ/WRITE operations.
+	 *
+	 * Note that these callers need to pass in a port number.
+	 */
+	if (qp_init_attr->cap.max_rdma_ctxs)
+		rdma_rw_init_qp(device, qp_init_attr);
 
 	qp = device->create_qp(pd, qp_init_attr, NULL);
 	if (IS_ERR(qp))
@@ -764,6 +775,7 @@ struct ib_qp *ib_create_qp(struct ib_pd *pd,
 	atomic_set(&qp->usecnt, 0);
 	qp->mrs_used = 0;
 	spin_lock_init(&qp->mr_lock);
+	INIT_LIST_HEAD(&qp->rdma_mrs);
 
 	if (qp_init_attr->qp_type == IB_QPT_XRC_TGT)
 		return ib_create_xrc_qp(qp, qp_init_attr);
@@ -787,6 +799,16 @@ struct ib_qp *ib_create_qp(struct ib_pd *pd,
 
 	atomic_inc(&pd->usecnt);
 	atomic_inc(&qp_init_attr->send_cq->usecnt);
+
+	if (qp_init_attr->cap.max_rdma_ctxs) {
+		ret = rdma_rw_init_mrs(qp, qp_init_attr);
+		if (ret) {
+			pr_err("failed to init MR pool ret= %d\n", ret);
+			ib_destroy_qp(qp);
+			qp = ERR_PTR(ret);
+		}
+	}
+
 	return qp;
 }
 EXPORT_SYMBOL(ib_create_qp);
@@ -1271,6 +1293,9 @@ int ib_destroy_qp(struct ib_qp *qp)
 	rcq  = qp->recv_cq;
 	srq  = qp->srq;
 
+	if (!qp->uobject)
+		rdma_rw_cleanup_mrs(qp);
+
 	ret = qp->device->destroy_qp(qp);
 	if (!ret) {
 		if (pd)
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 068d540..074854d 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -915,6 +915,13 @@ struct ib_qp_cap {
 	u32	max_send_sge;
 	u32	max_recv_sge;
 	u32	max_inline_data;
+
+	/*
+	 * Maximum number of rdma_rw_ctx structures in flight at a time.
+	 * ib_create_qp() will calculate the right amount of neededed WRs
+	 * and MRs based on this.
+	 */
+	u32	max_rdma_ctxs;
 };
 
 enum ib_sig_type {
@@ -986,7 +993,11 @@ struct ib_qp_init_attr {
 	enum ib_sig_type	sq_sig_type;
 	enum ib_qp_type		qp_type;
 	enum ib_qp_create_flags	create_flags;
-	u8			port_num; /* special QP types only */
+
+	/*
+	 * Only needed for special QP types, or when using the RW API.
+	 */
+	u8			port_num;
 };
 
 struct ib_qp_open_attr {
@@ -1407,6 +1418,7 @@ struct ib_qp {
 	struct ib_cq	       *recv_cq;
 	spinlock_t		mr_lock;
 	int			mrs_used;
+	struct list_head	rdma_mrs;
 	struct ib_srq	       *srq;
 	struct ib_xrcd	       *xrcd; /* XRC TGT QPs only */
 	struct list_head	xrcd_list;
diff --git a/include/rdma/rw.h b/include/rdma/rw.h
new file mode 100644
index 0000000..57ea304
--- /dev/null
+++ b/include/rdma/rw.h
@@ -0,0 +1,69 @@
+/*
+ * Copyright (c) 2016 HGST, a Western Digital Company.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+#ifndef _RDMA_RW_H
+#define _RDMA_RW_H
+
+#include <linux/dma-mapping.h>
+#include <linux/scatterlist.h>
+#include <rdma/ib_verbs.h>
+#include <rdma/rdma_cm.h>
+#include <rdma/mr_pool.h>
+
+struct rdma_rw_ctx {
+	/* number of SGL entries returned by dma_map_sg */
+	u32			dma_nents;
+
+	/* number of RDMA READ/WRITE WRs (not counting MR WRs) */
+	u32			nr_ops;
+
+	union {
+		/* for mapping a single SGE: */
+		struct {
+			struct ib_sge		sge;
+			struct ib_rdma_wr	wr;
+		} single;
+
+		/* for mapping of multiple SGEs: */
+		struct {
+			struct ib_sge		*sges;
+			struct ib_rdma_wr	*wrs;
+		} map;
+
+		/* for registering multiple WRs: */
+		struct rdma_rw_reg_ctx {
+			struct ib_sge		sge;
+			struct ib_rdma_wr	wr;
+			struct ib_reg_wr	reg_wr;
+			struct ib_send_wr	inv_wr;
+			struct ib_mr		*mr;
+		} *reg;
+	};
+};
+
+int rdma_rw_ctx_init(struct rdma_rw_ctx *ctx, struct ib_qp *qp, u8 port_num,
+		struct scatterlist *sg, u32 sg_cnt, u32 sg_offset,
+		u64 remote_addr, u32 rkey, enum dma_data_direction dir);
+void rdma_rw_ctx_destroy(struct rdma_rw_ctx *ctx, struct ib_qp *qp, u8 port_num,
+		struct scatterlist *sg, u32 sg_cnt,
+		enum dma_data_direction dir);
+
+struct ib_send_wr *rdma_rw_ctx_wrs(struct rdma_rw_ctx *ctx, struct ib_qp *qp,
+		u8 port_num, struct ib_cqe *cqe, struct ib_send_wr *chain_wr);
+int rdma_rw_ctx_post(struct rdma_rw_ctx *ctx, struct ib_qp *qp, u8 port_num,
+		struct ib_cqe *cqe, struct ib_send_wr *chain_wr);
+
+void rdma_rw_init_qp(struct ib_device *dev, struct ib_qp_init_attr *attr);
+int rdma_rw_init_mrs(struct ib_qp *qp, struct ib_qp_init_attr *attr);
+void rdma_rw_cleanup_mrs(struct ib_qp *qp);
+
+#endif /* _RDMA_RW_H */
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v4 8/9] target: enhance and export target_alloc_sgl/target_free_sgl
       [not found] ` <1457461000-24088-1-git-send-email-hch-jcswGhMUV9g@public.gmane.org>
  2016-03-08 18:16   ` [PATCH v4 6/9] IB/core: add a need_inval flag to struct ib_mr Christoph Hellwig
  2016-03-08 18:16   ` [PATCH v4 7/9] IB/core: generic RDMA READ/WRITE API Christoph Hellwig
@ 2016-03-08 18:16   ` Christoph Hellwig
  2016-03-08 21:37     ` Bart Van Assche
  2 siblings, 1 reply; 46+ messages in thread
From: Christoph Hellwig @ 2016-03-08 18:16 UTC (permalink / raw)
  To: dledford-H+wXaHxf7aLQT0dZR+AlfA, bart.vanassche-XdAiOPVOjttBDgjK7y7TUQ
  Cc: sagig-VPRAkNaXOzVWk0Htik3J/w,
	swise-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	target-devel-u79uwXL29TY76Z2rM5mHXA

The SRP target driver will need to allocate and chain it's own SGLs soon.
For this export target_alloc_sgl, and add a new argument to it so that it
can allocate an additional chain entry that doesn't point to a page.  Also
export transport_free_sgl after renaming it to target_free_sgl to free
these SGLs again.

Signed-off-by: Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org>
---
 drivers/target/target_core_transport.c | 32 ++++++++++++++++++--------------
 drivers/target/target_core_xcopy.c     |  2 +-
 include/target/target_core_backend.h   |  1 -
 include/target/target_core_fabric.h    |  4 ++++
 4 files changed, 23 insertions(+), 16 deletions(-)

diff --git a/drivers/target/target_core_transport.c b/drivers/target/target_core_transport.c
index 867bc6d..bb32da7c 100644
--- a/drivers/target/target_core_transport.c
+++ b/drivers/target/target_core_transport.c
@@ -2131,7 +2131,7 @@ queue_full:
 	transport_handle_queue_full(cmd, cmd->se_dev);
 }
 
-static inline void transport_free_sgl(struct scatterlist *sgl, int nents)
+void target_free_sgl(struct scatterlist *sgl, int nents)
 {
 	struct scatterlist *sg;
 	int count;
@@ -2141,6 +2141,7 @@ static inline void transport_free_sgl(struct scatterlist *sgl, int nents)
 
 	kfree(sgl);
 }
+EXPORT_SYMBOL(target_free_sgl);
 
 static inline void transport_reset_sgl_orig(struct se_cmd *cmd)
 {
@@ -2161,7 +2162,7 @@ static inline void transport_reset_sgl_orig(struct se_cmd *cmd)
 static inline void transport_free_pages(struct se_cmd *cmd)
 {
 	if (!(cmd->se_cmd_flags & SCF_PASSTHROUGH_PROT_SG_TO_MEM_NOALLOC)) {
-		transport_free_sgl(cmd->t_prot_sg, cmd->t_prot_nents);
+		target_free_sgl(cmd->t_prot_sg, cmd->t_prot_nents);
 		cmd->t_prot_sg = NULL;
 		cmd->t_prot_nents = 0;
 	}
@@ -2172,7 +2173,7 @@ static inline void transport_free_pages(struct se_cmd *cmd)
 		 * SG_TO_MEM_NOALLOC to function with COMPARE_AND_WRITE
 		 */
 		if (cmd->se_cmd_flags & SCF_COMPARE_AND_WRITE) {
-			transport_free_sgl(cmd->t_bidi_data_sg,
+			target_free_sgl(cmd->t_bidi_data_sg,
 					   cmd->t_bidi_data_nents);
 			cmd->t_bidi_data_sg = NULL;
 			cmd->t_bidi_data_nents = 0;
@@ -2182,11 +2183,11 @@ static inline void transport_free_pages(struct se_cmd *cmd)
 	}
 	transport_reset_sgl_orig(cmd);
 
-	transport_free_sgl(cmd->t_data_sg, cmd->t_data_nents);
+	target_free_sgl(cmd->t_data_sg, cmd->t_data_nents);
 	cmd->t_data_sg = NULL;
 	cmd->t_data_nents = 0;
 
-	transport_free_sgl(cmd->t_bidi_data_sg, cmd->t_bidi_data_nents);
+	target_free_sgl(cmd->t_bidi_data_sg, cmd->t_bidi_data_nents);
 	cmd->t_bidi_data_sg = NULL;
 	cmd->t_bidi_data_nents = 0;
 }
@@ -2260,20 +2261,22 @@ EXPORT_SYMBOL(transport_kunmap_data_sg);
 
 int
 target_alloc_sgl(struct scatterlist **sgl, unsigned int *nents, u32 length,
-		 bool zero_page)
+		 bool zero_page, bool chainable)
 {
 	struct scatterlist *sg;
 	struct page *page;
 	gfp_t zero_flag = (zero_page) ? __GFP_ZERO : 0;
-	unsigned int nent;
+	unsigned int nalloc, nent;
 	int i = 0;
 
-	nent = DIV_ROUND_UP(length, PAGE_SIZE);
-	sg = kmalloc(sizeof(struct scatterlist) * nent, GFP_KERNEL);
+	nalloc = nent = DIV_ROUND_UP(length, PAGE_SIZE);
+	if (chainable)
+		nalloc++;
+	sg = kmalloc_array(nalloc, sizeof(struct scatterlist), GFP_KERNEL);
 	if (!sg)
 		return -ENOMEM;
 
-	sg_init_table(sg, nent);
+	sg_init_table(sg, nalloc);
 
 	while (length) {
 		u32 page_len = min_t(u32, length, PAGE_SIZE);
@@ -2297,6 +2300,7 @@ out:
 	kfree(sg);
 	return -ENOMEM;
 }
+EXPORT_SYMBOL(target_alloc_sgl);
 
 /*
  * Allocate any required resources to execute the command.  For writes we
@@ -2312,7 +2316,7 @@ transport_generic_new_cmd(struct se_cmd *cmd)
 	if (cmd->prot_op != TARGET_PROT_NORMAL &&
 	    !(cmd->se_cmd_flags & SCF_PASSTHROUGH_PROT_SG_TO_MEM_NOALLOC)) {
 		ret = target_alloc_sgl(&cmd->t_prot_sg, &cmd->t_prot_nents,
-				       cmd->prot_length, true);
+				       cmd->prot_length, true, false);
 		if (ret < 0)
 			return TCM_LOGICAL_UNIT_COMMUNICATION_FAILURE;
 	}
@@ -2337,13 +2341,13 @@ transport_generic_new_cmd(struct se_cmd *cmd)
 
 			ret = target_alloc_sgl(&cmd->t_bidi_data_sg,
 					       &cmd->t_bidi_data_nents,
-					       bidi_length, zero_flag);
+					       bidi_length, zero_flag, false);
 			if (ret < 0)
 				return TCM_LOGICAL_UNIT_COMMUNICATION_FAILURE;
 		}
 
 		ret = target_alloc_sgl(&cmd->t_data_sg, &cmd->t_data_nents,
-				       cmd->data_length, zero_flag);
+				       cmd->data_length, zero_flag, false);
 		if (ret < 0)
 			return TCM_LOGICAL_UNIT_COMMUNICATION_FAILURE;
 	} else if ((cmd->se_cmd_flags & SCF_COMPARE_AND_WRITE) &&
@@ -2357,7 +2361,7 @@ transport_generic_new_cmd(struct se_cmd *cmd)
 
 		ret = target_alloc_sgl(&cmd->t_bidi_data_sg,
 				       &cmd->t_bidi_data_nents,
-				       caw_length, zero_flag);
+				       caw_length, zero_flag, false);
 		if (ret < 0)
 			return TCM_LOGICAL_UNIT_COMMUNICATION_FAILURE;
 	}
diff --git a/drivers/target/target_core_xcopy.c b/drivers/target/target_core_xcopy.c
index 47fe94e..75cd854 100644
--- a/drivers/target/target_core_xcopy.c
+++ b/drivers/target/target_core_xcopy.c
@@ -563,7 +563,7 @@ static int target_xcopy_setup_pt_cmd(
 
 	if (alloc_mem) {
 		rc = target_alloc_sgl(&cmd->t_data_sg, &cmd->t_data_nents,
-				      cmd->data_length, false);
+				      cmd->data_length, false, false);
 		if (rc < 0) {
 			ret = rc;
 			goto out;
diff --git a/include/target/target_core_backend.h b/include/target/target_core_backend.h
index 28ee5c2..d8ab510 100644
--- a/include/target/target_core_backend.h
+++ b/include/target/target_core_backend.h
@@ -85,7 +85,6 @@ extern struct configfs_attribute *passthrough_attrib_attrs[];
 void	*transport_kmap_data_sg(struct se_cmd *);
 void	transport_kunmap_data_sg(struct se_cmd *);
 /* core helpers also used by xcopy during internal command setup */
-int	target_alloc_sgl(struct scatterlist **, unsigned int *, u32, bool);
 sense_reason_t	transport_generic_map_mem_to_cmd(struct se_cmd *,
 		struct scatterlist *, u32, struct scatterlist *, u32);
 
diff --git a/include/target/target_core_fabric.h b/include/target/target_core_fabric.h
index 5665340..fd6bbd2 100644
--- a/include/target/target_core_fabric.h
+++ b/include/target/target_core_fabric.h
@@ -179,6 +179,10 @@ int	core_tpg_set_initiator_node_tag(struct se_portal_group *,
 int	core_tpg_register(struct se_wwn *, struct se_portal_group *, int);
 int	core_tpg_deregister(struct se_portal_group *);
 
+int	target_alloc_sgl(struct scatterlist **sgl, unsigned int *nents,
+		u32 length, bool zero_page, bool chainable);
+void	target_free_sgl(struct scatterlist *sgl, int nents);
+
 /*
  * The LIO target core uses DMA_TO_DEVICE to mean that data is going
  * to the target (eg handling a WRITE) and DMA_FROM_DEVICE to mean
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v4 9/9] IB/srpt: convert to the generic RDMA READ/WRITE API
  2016-03-08 18:16 generic RDMA READ/WRITE API V4 Christoph Hellwig
                   ` (5 preceding siblings ...)
       [not found] ` <1457461000-24088-1-git-send-email-hch-jcswGhMUV9g@public.gmane.org>
@ 2016-03-08 18:16 ` Christoph Hellwig
  2016-03-08 20:13   ` Leon Romanovsky
       [not found]   ` <1457461000-24088-10-git-send-email-hch-jcswGhMUV9g@public.gmane.org>
  6 siblings, 2 replies; 46+ messages in thread
From: Christoph Hellwig @ 2016-03-08 18:16 UTC (permalink / raw)
  To: dledford, bart.vanassche; +Cc: sagig, swise, linux-rdma, target-devel

Replace the homegrown RDMA READ/WRITE code in srpt with the generic API.
The only real twist here is that we need to allocate one Linux scatterlist
per direct buffer in the SRP command, and chain them before handing them
off to the target core.

As a side-effect of the conversion the driver will also chain the SEND
of the SRP response to the RDMA WRITE WRs for a DATA OUT command, and
properly account for RDMA WRITE WRs instead of just for RDMA READ WRs
like the driver previously did.

We now allocate half of the SQ size to RDMA READ/WRITE contexts, assuming
by default one RDMA READ or WRITE operation per command.  If a command
has multiple operations it will eat into the budget but will still succeed,
possible after waiting for WQEs to be available.

Also ensure the QPs request the maximum allowed SGEs so that RDMA R/W API
works correctly.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 drivers/infiniband/ulp/srpt/ib_srpt.c | 693 +++++++++++-----------------------
 drivers/infiniband/ulp/srpt/ib_srpt.h |  30 +-
 2 files changed, 242 insertions(+), 481 deletions(-)

diff --git a/drivers/infiniband/ulp/srpt/ib_srpt.c b/drivers/infiniband/ulp/srpt/ib_srpt.c
index 25bdaee..05bdec8 100644
--- a/drivers/infiniband/ulp/srpt/ib_srpt.c
+++ b/drivers/infiniband/ulp/srpt/ib_srpt.c
@@ -765,52 +765,6 @@ static int srpt_post_recv(struct srpt_device *sdev,
 }
 
 /**
- * srpt_post_send() - Post an IB send request.
- *
- * Returns zero upon success and a non-zero value upon failure.
- */
-static int srpt_post_send(struct srpt_rdma_ch *ch,
-			  struct srpt_send_ioctx *ioctx, int len)
-{
-	struct ib_sge list;
-	struct ib_send_wr wr, *bad_wr;
-	struct srpt_device *sdev = ch->sport->sdev;
-	int ret;
-
-	atomic_inc(&ch->req_lim);
-
-	ret = -ENOMEM;
-	if (unlikely(atomic_dec_return(&ch->sq_wr_avail) < 0)) {
-		pr_warn("IB send queue full (needed 1)\n");
-		goto out;
-	}
-
-	ib_dma_sync_single_for_device(sdev->device, ioctx->ioctx.dma, len,
-				      DMA_TO_DEVICE);
-
-	list.addr = ioctx->ioctx.dma;
-	list.length = len;
-	list.lkey = sdev->pd->local_dma_lkey;
-
-	ioctx->ioctx.cqe.done = srpt_send_done;
-	wr.next = NULL;
-	wr.wr_cqe = &ioctx->ioctx.cqe;
-	wr.sg_list = &list;
-	wr.num_sge = 1;
-	wr.opcode = IB_WR_SEND;
-	wr.send_flags = IB_SEND_SIGNALED;
-
-	ret = ib_post_send(ch->qp, &wr, &bad_wr);
-
-out:
-	if (ret < 0) {
-		atomic_inc(&ch->sq_wr_avail);
-		atomic_dec(&ch->req_lim);
-	}
-	return ret;
-}
-
-/**
  * srpt_zerolength_write() - Perform a zero-length RDMA write.
  *
  * A quote from the InfiniBand specification: C9-88: For an HCA responder
@@ -843,6 +797,110 @@ static void srpt_zerolength_write_done(struct ib_cq *cq, struct ib_wc *wc)
 	}
 }
 
+static int srpt_alloc_rw_ctxs(struct srpt_send_ioctx *ioctx,
+		struct srp_direct_buf *db, int nbufs, struct scatterlist **sg,
+		unsigned *sg_cnt)
+{
+	enum dma_data_direction dir = target_reverse_dma_direction(&ioctx->cmd);
+	struct srpt_rdma_ch *ch = ioctx->ch;
+	struct scatterlist *prev = NULL;
+	unsigned prev_nents;
+	int ret, i;
+
+	if (nbufs == 1) {
+		ioctx->rw_ctxs = &ioctx->s_rw_ctx;
+	} else {
+		ioctx->rw_ctxs = kmalloc_array(nbufs, sizeof(*ioctx->rw_ctxs),
+				GFP_KERNEL);
+		if (!ioctx->rw_ctxs)
+			return -ENOMEM;
+	}
+
+	for (i = 0; i < nbufs; i++, db++) {
+		struct srpt_rw_ctx *ctx = &ioctx->rw_ctxs[i];
+		u64 remote_addr = be64_to_cpu(db->va);
+		u32 size = be32_to_cpu(db->len);
+		u32 rkey = be32_to_cpu(db->key);
+
+		ret = target_alloc_sgl(&ctx->sg, &ctx->nents, size, false,
+				i < nbufs - 1);
+		if (ret)
+			goto unwind;
+
+		ret = rdma_rw_ctx_init(&ctx->rw, ch->qp, ch->sport->port,
+				ctx->sg, ctx->nents, 0, remote_addr, rkey, dir);
+		if (ret < 0) {
+			target_free_sgl(ctx->sg, ctx->nents);
+			goto unwind;
+		}
+
+		ioctx->n_rdma += ret;
+
+		if (prev) {
+			sg_unmark_end(&prev[prev_nents - 1]);
+			sg_chain(prev, prev_nents + 1, ctx->sg);
+		} else {
+			*sg = ctx->sg;
+		}
+
+		prev = ctx->sg;
+		prev_nents = ctx->nents;
+
+		*sg_cnt += ctx->nents;
+	}
+
+	ioctx->n_rw_ctx = nbufs;
+	return 0;
+
+unwind:
+	while (--i >= 0) {
+		struct srpt_rw_ctx *ctx = &ioctx->rw_ctxs[i];
+
+		rdma_rw_ctx_destroy(&ctx->rw, ch->qp, ch->sport->port,
+				ctx->sg, ctx->nents, dir);
+		target_free_sgl(ctx->sg, ctx->nents);
+	}
+	if (ioctx->rw_ctxs && ioctx->rw_ctxs != &ioctx->s_rw_ctx)
+		kfree(ioctx->rw_ctxs);
+	return ret;
+}
+
+static void srpt_free_rw_ctxs(struct srpt_rdma_ch *ch,
+				    struct srpt_send_ioctx *ioctx)
+{
+	enum dma_data_direction dir = target_reverse_dma_direction(&ioctx->cmd);
+	int i;
+
+	for (i = 0; i < ioctx->n_rw_ctx; i++) {
+		struct srpt_rw_ctx *ctx = &ioctx->rw_ctxs[i];
+
+		rdma_rw_ctx_destroy(&ctx->rw, ch->qp, ch->sport->port,
+				ctx->sg, ctx->nents, dir);
+		target_free_sgl(ctx->sg, ctx->nents);
+	}
+
+	if (ioctx->rw_ctxs && ioctx->rw_ctxs != &ioctx->s_rw_ctx)
+		kfree(ioctx->rw_ctxs);
+}
+
+static inline void *srpt_get_desc_buf(struct srp_cmd *srp_cmd)
+{
+	/*
+	 * The pointer computations below will only be compiled correctly
+	 * if srp_cmd::add_data is declared as s8*, u8*, s8[] or u8[], so check
+	 * whether srp_cmd::add_data has been declared as a byte pointer.
+	 */
+	BUILD_BUG_ON(!__same_type(srp_cmd->add_data[0], (s8)0)
+		     && !__same_type(srp_cmd->add_data[0], (u8)0));
+
+	/*
+	 * According to the SRP spec, the lower two bits of the 'ADDITIONAL
+	 * CDB LENGTH' field are reserved and the size in bytes of this field
+	 * is four times the value specified in bits 3..7. Hence the "& ~3".
+	 */
+	return srp_cmd->add_data + (srp_cmd->add_cdb_len & ~3);
+}
+
 /**
  * srpt_get_desc_tbl() - Parse the data descriptors of an SRP_CMD request.
  * @ioctx: Pointer to the I/O context associated with the request.
@@ -858,94 +916,59 @@ static void srpt_zerolength_write_done(struct ib_cq *cq, struct ib_wc *wc)
  * -ENOMEM when memory allocation fails and zero upon success.
  */
 static int srpt_get_desc_tbl(struct srpt_send_ioctx *ioctx,
-			     struct srp_cmd *srp_cmd,
-			     enum dma_data_direction *dir, u64 *data_len)
+		struct srp_cmd *srp_cmd, enum dma_data_direction *dir,
+		struct scatterlist **sg, unsigned *sg_cnt, u64 *data_len)
 {
-	struct srp_indirect_buf *idb;
-	struct srp_direct_buf *db;
-	unsigned add_cdb_offset;
-	int ret;
-
-	/*
-	 * The pointer computations below will only be compiled correctly
-	 * if srp_cmd::add_data is declared as s8*, u8*, s8[] or u8[], so check
-	 * whether srp_cmd::add_data has been declared as a byte pointer.
-	 */
-	BUILD_BUG_ON(!__same_type(srp_cmd->add_data[0], (s8)0)
-		     && !__same_type(srp_cmd->add_data[0], (u8)0));
-
 	BUG_ON(!dir);
 	BUG_ON(!data_len);
 
-	ret = 0;
-	*data_len = 0;
-
 	/*
 	 * The lower four bits of the buffer format field contain the DATA-IN
 	 * buffer descriptor format, and the highest four bits contain the
 	 * DATA-OUT buffer descriptor format.
 	 */
-	*dir = DMA_NONE;
 	if (srp_cmd->buf_fmt & 0xf)
 		/* DATA-IN: transfer data from target to initiator (read). */
 		*dir = DMA_FROM_DEVICE;
 	else if (srp_cmd->buf_fmt >> 4)
 		/* DATA-OUT: transfer data from initiator to target (write). */
 		*dir = DMA_TO_DEVICE;
+	else
+		*dir = DMA_NONE;
+
+	/* initialize data_direction early as srpt_alloc_rw_ctxs needs it */
+	ioctx->cmd.data_direction = *dir;
 
-	/*
-	 * According to the SRP spec, the lower two bits of the 'ADDITIONAL
-	 * CDB LENGTH' field are reserved and the size in bytes of this field
-	 * is four times the value specified in bits 3..7. Hence the "& ~3".
-	 */
-	add_cdb_offset = srp_cmd->add_cdb_len & ~3;
 	if (((srp_cmd->buf_fmt & 0xf) == SRP_DATA_DESC_DIRECT) ||
 	    ((srp_cmd->buf_fmt >> 4) == SRP_DATA_DESC_DIRECT)) {
-		ioctx->n_rbuf = 1;
-		ioctx->rbufs = &ioctx->single_rbuf;
+	    	struct srp_direct_buf *db = srpt_get_desc_buf(srp_cmd);
 
-		db = (struct srp_direct_buf *)(srp_cmd->add_data
-					       + add_cdb_offset);
-		memcpy(ioctx->rbufs, db, sizeof(*db));
 		*data_len = be32_to_cpu(db->len);
+		return srpt_alloc_rw_ctxs(ioctx, db, 1, sg, sg_cnt);
 	} else if (((srp_cmd->buf_fmt & 0xf) == SRP_DATA_DESC_INDIRECT) ||
 		   ((srp_cmd->buf_fmt >> 4) == SRP_DATA_DESC_INDIRECT)) {
-		idb = (struct srp_indirect_buf *)(srp_cmd->add_data
-						  + add_cdb_offset);
-
-		ioctx->n_rbuf = be32_to_cpu(idb->table_desc.len) / sizeof(*db);
+		struct srp_indirect_buf *idb = srpt_get_desc_buf(srp_cmd);
+		int nbufs = be32_to_cpu(idb->table_desc.len) /
+				sizeof(struct srp_direct_buf);
 
-		if (ioctx->n_rbuf >
+		if (nbufs >
 		    (srp_cmd->data_out_desc_cnt + srp_cmd->data_in_desc_cnt)) {
 			pr_err("received unsupported SRP_CMD request"
 			       " type (%u out + %u in != %u / %zu)\n",
 			       srp_cmd->data_out_desc_cnt,
 			       srp_cmd->data_in_desc_cnt,
 			       be32_to_cpu(idb->table_desc.len),
-			       sizeof(*db));
-			ioctx->n_rbuf = 0;
-			ret = -EINVAL;
-			goto out;
-		}
-
-		if (ioctx->n_rbuf == 1)
-			ioctx->rbufs = &ioctx->single_rbuf;
-		else {
-			ioctx->rbufs =
-				kmalloc(ioctx->n_rbuf * sizeof(*db), GFP_ATOMIC);
-			if (!ioctx->rbufs) {
-				ioctx->n_rbuf = 0;
-				ret = -ENOMEM;
-				goto out;
-			}
+			       sizeof(struct srp_direct_buf));
+			return -EINVAL;
 		}
 
-		db = idb->desc_list;
-		memcpy(ioctx->rbufs, db, ioctx->n_rbuf * sizeof(*db));
 		*data_len = be32_to_cpu(idb->len);
+		return srpt_alloc_rw_ctxs(ioctx, idb->desc_list, nbufs,
+				sg, sg_cnt);
+	} else {
+		*data_len = 0;
+		return 0;
 	}
-out:
-	return ret;
 }
 
 /**
@@ -1049,217 +1072,6 @@ static int srpt_ch_qp_err(struct srpt_rdma_ch *ch)
 }
 
 /**
- * srpt_unmap_sg_to_ib_sge() - Unmap an IB SGE list.
- */
-static void srpt_unmap_sg_to_ib_sge(struct srpt_rdma_ch *ch,
-				    struct srpt_send_ioctx *ioctx)
-{
-	struct scatterlist *sg;
-	enum dma_data_direction dir;
-
-	BUG_ON(!ch);
-	BUG_ON(!ioctx);
-	BUG_ON(ioctx->n_rdma && !ioctx->rdma_wrs);
-
-	while (ioctx->n_rdma)
-		kfree(ioctx->rdma_wrs[--ioctx->n_rdma].wr.sg_list);
-
-	kfree(ioctx->rdma_wrs);
-	ioctx->rdma_wrs = NULL;
-
-	if (ioctx->mapped_sg_count) {
-		sg = ioctx->sg;
-		WARN_ON(!sg);
-		dir = ioctx->cmd.data_direction;
-		BUG_ON(dir == DMA_NONE);
-		ib_dma_unmap_sg(ch->sport->sdev->device, sg, ioctx->sg_cnt,
-				target_reverse_dma_direction(&ioctx->cmd));
-		ioctx->mapped_sg_count = 0;
-	}
-}
-
-/**
- * srpt_map_sg_to_ib_sge() - Map an SG list to an IB SGE list.
- */
-static int srpt_map_sg_to_ib_sge(struct srpt_rdma_ch *ch,
-				 struct srpt_send_ioctx *ioctx)
-{
-	struct ib_device *dev = ch->sport->sdev->device;
-	struct se_cmd *cmd;
-	struct scatterlist *sg, *sg_orig;
-	int sg_cnt;
-	enum dma_data_direction dir;
-	struct ib_rdma_wr *riu;
-	struct srp_direct_buf *db;
-	dma_addr_t dma_addr;
-	struct ib_sge *sge;
-	u64 raddr;
-	u32 rsize;
-	u32 tsize;
-	u32 dma_len;
-	int count, nrdma;
-	int i, j, k;
-
-	BUG_ON(!ch);
-	BUG_ON(!ioctx);
-	cmd = &ioctx->cmd;
-	dir = cmd->data_direction;
-	BUG_ON(dir == DMA_NONE);
-
-	ioctx->sg = sg = sg_orig = cmd->t_data_sg;
-	ioctx->sg_cnt = sg_cnt = cmd->t_data_nents;
-
-	count = ib_dma_map_sg(ch->sport->sdev->device, sg, sg_cnt,
-			      target_reverse_dma_direction(cmd));
-	if (unlikely(!count))
-		return -EAGAIN;
-
-	ioctx->mapped_sg_count = count;
-
-	if (ioctx->rdma_wrs && ioctx->n_rdma_wrs)
-		nrdma = ioctx->n_rdma_wrs;
-	else {
-		nrdma = (count + SRPT_DEF_SG_PER_WQE - 1) / SRPT_DEF_SG_PER_WQE
-			+ ioctx->n_rbuf;
-
-		ioctx->rdma_wrs = kcalloc(nrdma, sizeof(*ioctx->rdma_wrs),
-				GFP_KERNEL);
-		if (!ioctx->rdma_wrs)
-			goto free_mem;
-
-		ioctx->n_rdma_wrs = nrdma;
-	}
-
-	db = ioctx->rbufs;
-	tsize = cmd->data_length;
-	dma_len = ib_sg_dma_len(dev, &sg[0]);
-	riu = ioctx->rdma_wrs;
-
-	/*
-	 * For each remote desc - calculate the #ib_sge.
-	 * If #ib_sge < SRPT_DEF_SG_PER_WQE per rdma operation then
-	 *      each remote desc rdma_iu is required a rdma wr;
-	 * else
-	 *      we need to allocate extra rdma_iu to carry extra #ib_sge in
-	 *      another rdma wr
-	 */
-	for (i = 0, j = 0;
-	     j < count && i < ioctx->n_rbuf && tsize > 0; ++i, ++riu, ++db) {
-		rsize = be32_to_cpu(db->len);
-		raddr = be64_to_cpu(db->va);
-		riu->remote_addr = raddr;
-		riu->rkey = be32_to_cpu(db->key);
-		riu->wr.num_sge = 0;
-
-		/* calculate how many sge required for this remote_buf */
-		while (rsize > 0 && tsize > 0) {
-
-			if (rsize >= dma_len) {
-				tsize -= dma_len;
-				rsize -= dma_len;
-				raddr += dma_len;
-
-				if (tsize > 0) {
-					++j;
-					if (j < count) {
-						sg = sg_next(sg);
-						dma_len = ib_sg_dma_len(
-								dev, sg);
-					}
-				}
-			} else {
-				tsize -= rsize;
-				dma_len -= rsize;
-				rsize = 0;
-			}
-
-			++riu->wr.num_sge;
-
-			if (rsize > 0 &&
-			    riu->wr.num_sge == SRPT_DEF_SG_PER_WQE) {
-				++ioctx->n_rdma;
-				riu->wr.sg_list = kmalloc_array(riu->wr.num_sge,
-						sizeof(*riu->wr.sg_list),
-						GFP_KERNEL);
-				if (!riu->wr.sg_list)
-					goto free_mem;
-
-				++riu;
-				riu->wr.num_sge = 0;
-				riu->remote_addr = raddr;
-				riu->rkey = be32_to_cpu(db->key);
-			}
-		}
-
-		++ioctx->n_rdma;
-		riu->wr.sg_list = kmalloc_array(riu->wr.num_sge,
-					sizeof(*riu->wr.sg_list),
-					GFP_KERNEL);
-		if (!riu->wr.sg_list)
-			goto free_mem;
-	}
-
-	db = ioctx->rbufs;
-	tsize = cmd->data_length;
-	riu = ioctx->rdma_wrs;
-	sg = sg_orig;
-	dma_len = ib_sg_dma_len(dev, &sg[0]);
-	dma_addr = ib_sg_dma_address(dev, &sg[0]);
-
-	/* this second loop is really mapped sg_addres to rdma_iu->ib_sge */
-	for (i = 0, j = 0;
-	     j < count && i < ioctx->n_rbuf && tsize > 0; ++i, ++riu, ++db) {
-		rsize = be32_to_cpu(db->len);
-		sge = riu->wr.sg_list;
-		k = 0;
-
-		while (rsize > 0 && tsize > 0) {
-			sge->addr = dma_addr;
-			sge->lkey = ch->sport->sdev->pd->local_dma_lkey;
-
-			if (rsize >= dma_len) {
-				sge->length =
-					(tsize < dma_len) ? tsize : dma_len;
-				tsize -= dma_len;
-				rsize -= dma_len;
-
-				if (tsize > 0) {
-					++j;
-					if (j < count) {
-						sg = sg_next(sg);
-						dma_len = ib_sg_dma_len(
-								dev, sg);
-						dma_addr = ib_sg_dma_address(
-								dev, sg);
-					}
-				}
-			} else {
-				sge->length = (tsize < rsize) ? tsize : rsize;
-				tsize -= rsize;
-				dma_len -= rsize;
-				dma_addr += rsize;
-				rsize = 0;
-			}
-
-			++k;
-			if (k == riu->wr.num_sge && rsize > 0 && tsize > 0) {
-				++riu;
-				sge = riu->wr.sg_list;
-				k = 0;
-			} else if (rsize > 0 && tsize > 0)
-				++sge;
-		}
-	}
-
-	return 0;
-
-free_mem:
-	srpt_unmap_sg_to_ib_sge(ch, ioctx);
-
-	return -ENOMEM;
-}
-
-/**
  * srpt_get_send_ioctx() - Obtain an I/O context for sending to the initiator.
  */
 static struct srpt_send_ioctx *srpt_get_send_ioctx(struct srpt_rdma_ch *ch)
@@ -1284,12 +1096,8 @@ static struct srpt_send_ioctx *srpt_get_send_ioctx(struct srpt_rdma_ch *ch)
 	BUG_ON(ioctx->ch != ch);
 	spin_lock_init(&ioctx->spinlock);
 	ioctx->state = SRPT_STATE_NEW;
-	ioctx->n_rbuf = 0;
-	ioctx->rbufs = NULL;
+	ioctx->n_rw_ctx = 0;
 	ioctx->n_rdma = 0;
-	ioctx->n_rdma_wrs = 0;
-	ioctx->rdma_wrs = NULL;
-	ioctx->mapped_sg_count = 0;
 	init_completion(&ioctx->tx_done);
 	ioctx->queue_status_only = false;
 	/*
@@ -1359,7 +1167,6 @@ static int srpt_abort_cmd(struct srpt_send_ioctx *ioctx)
 		 * SRP_RSP sending failed or the SRP_RSP send completion has
 		 * not been received in time.
 		 */
-		srpt_unmap_sg_to_ib_sge(ioctx->ch, ioctx);
 		transport_generic_free_cmd(&ioctx->cmd, 0);
 		break;
 	case SRPT_STATE_MGMT_RSP_SENT:
@@ -1387,6 +1194,7 @@ static void srpt_rdma_read_done(struct ib_cq *cq, struct ib_wc *wc)
 
 	WARN_ON(ioctx->n_rdma <= 0);
 	atomic_add(ioctx->n_rdma, &ch->sq_wr_avail);
+	ioctx->n_rdma = 0;
 
 	if (unlikely(wc->status != IB_WC_SUCCESS)) {
 		pr_info("RDMA_READ for ioctx 0x%p failed with status %d\n",
@@ -1403,23 +1211,6 @@ static void srpt_rdma_read_done(struct ib_cq *cq, struct ib_wc *wc)
 		       __LINE__, srpt_get_cmd_state(ioctx));
 }
 
-static void srpt_rdma_write_done(struct ib_cq *cq, struct ib_wc *wc)
-{
-	struct srpt_send_ioctx *ioctx =
-		container_of(wc->wr_cqe, struct srpt_send_ioctx, rdma_cqe);
-
-	if (unlikely(wc->status != IB_WC_SUCCESS)) {
-		/*
-		 * Note: if an RDMA write error completion is received that
-		 * means that a SEND also has been posted. Defer further
-		 * processing of the associated command until the send error
-		 * completion has been received.
-		 */
-		pr_info("RDMA_WRITE for ioctx 0x%p failed with status %d\n",
-			ioctx, wc->status);
-	}
-}
-
 /**
  * srpt_build_cmd_rsp() - Build an SRP_RSP response.
  * @ch: RDMA channel through which the request has been received.
@@ -1537,6 +1328,8 @@ static void srpt_handle_cmd(struct srpt_rdma_ch *ch,
 {
 	struct se_cmd *cmd;
 	struct srp_cmd *srp_cmd;
+	struct scatterlist *sg = NULL;
+	unsigned sg_cnt = 0;
 	u64 data_len;
 	enum dma_data_direction dir;
 	int rc;
@@ -1563,16 +1356,18 @@ static void srpt_handle_cmd(struct srpt_rdma_ch *ch,
 		break;
 	}
 
-	if (srpt_get_desc_tbl(send_ioctx, srp_cmd, &dir, &data_len)) {
+	if (srpt_get_desc_tbl(send_ioctx, srp_cmd, &dir, &sg, &sg_cnt,
+			&data_len)) {
 		pr_err("0x%llx: parsing SRP descriptor table failed.\n",
 		       srp_cmd->tag);
 		goto release_ioctx;
 	}
 
-	rc = target_submit_cmd(cmd, ch->sess, srp_cmd->cdb,
+	rc = target_submit_cmd_map_sgls(cmd, ch->sess, srp_cmd->cdb,
 			       &send_ioctx->sense_data[0],
 			       scsilun_to_int(&srp_cmd->lun), data_len,
-			       TCM_SIMPLE_TAG, dir, TARGET_SCF_ACK_KREF);
+			       TCM_SIMPLE_TAG, dir, TARGET_SCF_ACK_KREF,
+			       sg, sg_cnt, NULL, 0, NULL, 0);
 	if (rc != 0) {
 		pr_debug("target_submit_cmd() returned %d for tag %#llx\n", rc,
 			 srp_cmd->tag);
@@ -1779,14 +1574,13 @@ static void srpt_send_done(struct ib_cq *cq, struct ib_wc *wc)
 	WARN_ON(state != SRPT_STATE_CMD_RSP_SENT &&
 		state != SRPT_STATE_MGMT_RSP_SENT);
 
-	atomic_inc(&ch->sq_wr_avail);
+	atomic_add(1 + ioctx->n_rdma, &ch->sq_wr_avail);
 
 	if (wc->status != IB_WC_SUCCESS)
 		pr_info("sending response for ioctx 0x%p failed"
 			" with status %d\n", ioctx, wc->status);
 
 	if (state != SRPT_STATE_DONE) {
-		srpt_unmap_sg_to_ib_sge(ch, ioctx);
 		transport_generic_free_cmd(&ioctx->cmd, 0);
 	} else {
 		pr_err("IB completion has been received too late for"
@@ -1832,8 +1626,11 @@ retry:
 	qp_init->srq = sdev->srq;
 	qp_init->sq_sig_type = IB_SIGNAL_REQ_WR;
 	qp_init->qp_type = IB_QPT_RC;
-	qp_init->cap.max_send_wr = srp_sq_size;
-	qp_init->cap.max_send_sge = SRPT_DEF_SG_PER_WQE;
+	qp_init->cap.max_send_wr = srp_sq_size / 2;
+	qp_init->cap.max_rdma_ctxs = srp_sq_size / 2;
+	qp_init->cap.max_send_sge = max(sdev->device->attrs.max_sge_rd,
+					sdev->device->attrs.max_sge);
+	qp_init->port_num = ch->sport->port;
 
 	ch->qp = ib_create_qp(sdev->pd, qp_init);
 	if (IS_ERR(ch->qp)) {
@@ -2397,95 +2194,6 @@ static int srpt_cm_handler(struct ib_cm_id *cm_id, struct ib_cm_event *event)
 	return ret;
 }
 
-/**
- * srpt_perform_rdmas() - Perform IB RDMA.
- *
- * Returns zero upon success or a negative number upon failure.
- */
-static int srpt_perform_rdmas(struct srpt_rdma_ch *ch,
-			      struct srpt_send_ioctx *ioctx)
-{
-	struct ib_send_wr *bad_wr;
-	int sq_wr_avail, ret, i;
-	enum dma_data_direction dir;
-	const int n_rdma = ioctx->n_rdma;
-
-	dir = ioctx->cmd.data_direction;
-	if (dir == DMA_TO_DEVICE) {
-		/* write */
-		ret = -ENOMEM;
-		sq_wr_avail = atomic_sub_return(n_rdma, &ch->sq_wr_avail);
-		if (sq_wr_avail < 0) {
-			pr_warn("IB send queue full (needed %d)\n",
-				n_rdma);
-			goto out;
-		}
-	}
-
-	for (i = 0; i < n_rdma; i++) {
-		struct ib_send_wr *wr = &ioctx->rdma_wrs[i].wr;
-
-		wr->opcode = (dir == DMA_FROM_DEVICE) ?
-				IB_WR_RDMA_WRITE : IB_WR_RDMA_READ;
-
-		if (i == n_rdma - 1) {
-			/* only get completion event for the last rdma read */
-			if (dir == DMA_TO_DEVICE) {
-				wr->send_flags = IB_SEND_SIGNALED;
-				ioctx->rdma_cqe.done = srpt_rdma_read_done;
-			} else {
-				ioctx->rdma_cqe.done = srpt_rdma_write_done;
-			}
-			wr->wr_cqe = &ioctx->rdma_cqe;
-			wr->next = NULL;
-		} else {
-			wr->wr_cqe = NULL;
-			wr->next = &ioctx->rdma_wrs[i + 1].wr;
-		}
-	}
-
-	ret = ib_post_send(ch->qp, &ioctx->rdma_wrs->wr, &bad_wr);
-	if (ret)
-		pr_err("%s[%d]: ib_post_send() returned %d for %d/%d\n",
-				 __func__, __LINE__, ret, i, n_rdma);
-out:
-	if (unlikely(dir == DMA_TO_DEVICE && ret < 0))
-		atomic_add(n_rdma, &ch->sq_wr_avail);
-	return ret;
-}
-
-/**
- * srpt_xfer_data() - Start data transfer from initiator to target.
- */
-static int srpt_xfer_data(struct srpt_rdma_ch *ch,
-			  struct srpt_send_ioctx *ioctx)
-{
-	int ret;
-
-	ret = srpt_map_sg_to_ib_sge(ch, ioctx);
-	if (ret) {
-		pr_err("%s[%d] ret=%d\n", __func__, __LINE__, ret);
-		goto out;
-	}
-
-	ret = srpt_perform_rdmas(ch, ioctx);
-	if (ret) {
-		if (ret == -EAGAIN || ret == -ENOMEM)
-			pr_info("%s[%d] queue full -- ret=%d\n",
-				__func__, __LINE__, ret);
-		else
-			pr_err("%s[%d] fatal error -- ret=%d\n",
-			       __func__, __LINE__, ret);
-		goto out_unmap;
-	}
-
-out:
-	return ret;
-out_unmap:
-	srpt_unmap_sg_to_ib_sge(ch, ioctx);
-	goto out;
-}
-
 static int srpt_write_pending_status(struct se_cmd *se_cmd)
 {
 	struct srpt_send_ioctx *ioctx;
@@ -2502,11 +2210,42 @@ static int srpt_write_pending(struct se_cmd *se_cmd)
 	struct srpt_send_ioctx *ioctx =
 		container_of(se_cmd, struct srpt_send_ioctx, cmd);
 	struct srpt_rdma_ch *ch = ioctx->ch;
+	struct ib_send_wr *first_wr = NULL, *bad_wr;
+	struct ib_cqe *cqe = &ioctx->rdma_cqe;
 	enum srpt_command_state new_state;
+	int ret, i;
 
 	new_state = srpt_set_cmd_state(ioctx, SRPT_STATE_NEED_DATA);
 	WARN_ON(new_state == SRPT_STATE_DONE);
-	return srpt_xfer_data(ch, ioctx);
+
+	if (atomic_sub_return(ioctx->n_rdma, &ch->sq_wr_avail) < 0) {
+		pr_warn("%s: IB send queue full (needed %d)\n",
+				__func__, ioctx->n_rdma);
+		ret = -ENOMEM;
+		goto out_undo;
+	}
+
+	cqe->done = srpt_rdma_read_done;
+	for (i = ioctx->n_rw_ctx - 1; i >= 0; i--) {
+		struct srpt_rw_ctx *ctx = &ioctx->rw_ctxs[i];
+
+		first_wr = rdma_rw_ctx_wrs(&ctx->rw, ch->qp, ch->sport->port,
+				cqe, first_wr);
+		cqe = NULL;
+	}
+	
+	ret = ib_post_send(ch->qp, first_wr, &bad_wr);
+	if (ret) {
+		pr_err("%s: ib_post_send() returned %d for %d (avail: %d)\n",
+			 __func__, ret, ioctx->n_rdma,
+			 atomic_read(&ch->sq_wr_avail));
+		goto out_undo;
+	}
+
+	return 0;
+out_undo:
+	atomic_add(ioctx->n_rdma, &ch->sq_wr_avail);
+	return ret;
 }
 
 static u8 tcm_to_srp_tsk_mgmt_status(const int tcm_mgmt_status)
@@ -2528,17 +2267,17 @@ static u8 tcm_to_srp_tsk_mgmt_status(const int tcm_mgmt_status)
  */
 static void srpt_queue_response(struct se_cmd *cmd)
 {
-	struct srpt_rdma_ch *ch;
-	struct srpt_send_ioctx *ioctx;
+	struct srpt_send_ioctx *ioctx =
+		container_of(cmd, struct srpt_send_ioctx, cmd);
+	struct srpt_rdma_ch *ch = ioctx->ch;
+	struct srpt_device *sdev = ch->sport->sdev;
+	struct ib_send_wr send_wr, *first_wr = NULL, *bad_wr;
+	struct ib_sge sge;
 	enum srpt_command_state state;
 	unsigned long flags;
-	int ret;
-	enum dma_data_direction dir;
-	int resp_len;
+	int resp_len, ret, i;
 	u8 srp_tm_status;
 
-	ioctx = container_of(cmd, struct srpt_send_ioctx, cmd);
-	ch = ioctx->ch;
 	BUG_ON(!ch);
 
 	spin_lock_irqsave(&ioctx->spinlock, flags);
@@ -2565,17 +2304,19 @@ static void srpt_queue_response(struct se_cmd *cmd)
 		return;
 	}
 
-	dir = ioctx->cmd.data_direction;
-
 	/* For read commands, transfer the data to the initiator. */
-	if (dir == DMA_FROM_DEVICE && ioctx->cmd.data_length &&
+	if (ioctx->cmd.data_direction == DMA_FROM_DEVICE &&
+	    ioctx->cmd.data_length &&
 	    !ioctx->queue_status_only) {
-		ret = srpt_xfer_data(ch, ioctx);
-		if (ret) {
-			pr_err("xfer_data failed for tag %llu\n",
-			       ioctx->cmd.tag);
-			return;
+		for (i = ioctx->n_rw_ctx - 1; i >= 0; i--) {
+			struct srpt_rw_ctx *ctx = &ioctx->rw_ctxs[i];
+
+			first_wr = rdma_rw_ctx_wrs(&ctx->rw, ch->qp,
+					ch->sport->port, NULL,
+					first_wr ? first_wr : &send_wr);
 		}
+	} else {
+		first_wr = &send_wr;
 	}
 
 	if (state != SRPT_STATE_MGMT)
@@ -2587,14 +2328,46 @@ static void srpt_queue_response(struct se_cmd *cmd)
 		resp_len = srpt_build_tskmgmt_rsp(ch, ioctx, srp_tm_status,
 						 ioctx->cmd.tag);
 	}
-	ret = srpt_post_send(ch, ioctx, resp_len);
-	if (ret) {
-		pr_err("sending cmd response failed for tag %llu\n",
-		       ioctx->cmd.tag);
-		srpt_unmap_sg_to_ib_sge(ch, ioctx);
-		srpt_set_cmd_state(ioctx, SRPT_STATE_DONE);
-		target_put_sess_cmd(&ioctx->cmd);
+
+	atomic_inc(&ch->req_lim);
+
+	if (unlikely(atomic_sub_return(1 + ioctx->n_rdma,
+			&ch->sq_wr_avail) < 0)) {
+		pr_warn("%s: IB send queue full (needed %d)\n",
+				__func__, ioctx->n_rdma);
+		ret = -ENOMEM;
+		goto out;
 	}
+
+	ib_dma_sync_single_for_device(sdev->device, ioctx->ioctx.dma, resp_len,
+				      DMA_TO_DEVICE);
+
+	sge.addr = ioctx->ioctx.dma;
+	sge.length = resp_len;
+	sge.lkey = sdev->pd->local_dma_lkey;
+
+	ioctx->ioctx.cqe.done = srpt_send_done;
+	send_wr.next = NULL;
+	send_wr.wr_cqe = &ioctx->ioctx.cqe;
+	send_wr.sg_list = &sge;
+	send_wr.num_sge = 1;
+	send_wr.opcode = IB_WR_SEND;
+	send_wr.send_flags = IB_SEND_SIGNALED;
+
+	ret = ib_post_send(ch->qp, first_wr, &bad_wr);
+	if (ret < 0) {
+		pr_err("%s: sending cmd response failed for tag %llu (%d)\n",
+			__func__, ioctx->cmd.tag, ret);
+		goto out;
+	}
+
+	return;
+
+out:
+	atomic_add(1 + ioctx->n_rdma, &ch->sq_wr_avail);
+	atomic_dec(&ch->req_lim);
+	srpt_set_cmd_state(ioctx, SRPT_STATE_DONE);
+	target_put_sess_cmd(&ioctx->cmd);
 }
 
 static int srpt_queue_data_in(struct se_cmd *cmd)
@@ -2610,10 +2383,6 @@ static void srpt_queue_tm_rsp(struct se_cmd *cmd)
 
 static void srpt_aborted_task(struct se_cmd *cmd)
 {
-	struct srpt_send_ioctx *ioctx = container_of(cmd,
-				struct srpt_send_ioctx, cmd);
-
-	srpt_unmap_sg_to_ib_sge(ioctx->ch, ioctx);
 }
 
 static int srpt_queue_status(struct se_cmd *cmd)
@@ -2914,12 +2683,10 @@ static void srpt_release_cmd(struct se_cmd *se_cmd)
 	unsigned long flags;
 
 	WARN_ON(ioctx->state != SRPT_STATE_DONE);
-	WARN_ON(ioctx->mapped_sg_count != 0);
 
-	if (ioctx->n_rbuf > 1) {
-		kfree(ioctx->rbufs);
-		ioctx->rbufs = NULL;
-		ioctx->n_rbuf = 0;
+	if (ioctx->n_rw_ctx > 1) {
+		srpt_free_rw_ctxs(ch, ioctx);
+		ioctx->n_rw_ctx = 0;
 	}
 
 	spin_lock_irqsave(&ch->spinlock, flags);
diff --git a/drivers/infiniband/ulp/srpt/ib_srpt.h b/drivers/infiniband/ulp/srpt/ib_srpt.h
index af9b8b5..7a8ebc7 100644
--- a/drivers/infiniband/ulp/srpt/ib_srpt.h
+++ b/drivers/infiniband/ulp/srpt/ib_srpt.h
@@ -42,6 +42,7 @@
 #include <rdma/ib_verbs.h>
 #include <rdma/ib_sa.h>
 #include <rdma/ib_cm.h>
+#include <rdma/rw.h>
 
 #include <scsi/srp.h>
 
@@ -105,7 +106,6 @@ enum {
 	SRP_LOGIN_RSP_MULTICHAN_MAINTAINED = 0x2,
 
 	SRPT_DEF_SG_TABLESIZE = 128,
-	SRPT_DEF_SG_PER_WQE = 16,
 
 	MIN_SRPT_SQ_SIZE = 16,
 	DEF_SRPT_SQ_SIZE = 4096,
@@ -174,21 +174,18 @@ struct srpt_recv_ioctx {
 	struct srpt_ioctx	ioctx;
 	struct list_head	wait_list;
 };
+	
+struct srpt_rw_ctx {
+	struct rdma_rw_ctx	rw;
+	struct scatterlist	*sg;
+	unsigned int		nents;
+};
 
 /**
  * struct srpt_send_ioctx - SRPT send I/O context.
  * @ioctx:       See above.
  * @ch:          Channel pointer.
  * @free_list:   Node in srpt_rdma_ch.free_list.
- * @n_rbuf:      Number of data buffers in the received SRP command.
- * @rbufs:       Pointer to SRP data buffer array.
- * @single_rbuf: SRP data buffer if the command has only a single buffer.
- * @sg:          Pointer to sg-list associated with this I/O context.
- * @sg_cnt:      SG-list size.
- * @mapped_sg_count: ib_dma_map_sg() return value.
- * @n_rdma_wrs:  Number of elements in the rdma_wrs array.
- * @rdma_wrs:    Array with information about the RDMA mapping.
- * @tag:         Tag of the received SRP information unit.
  * @spinlock:    Protects 'state'.
  * @state:       I/O context state.
  * @cmd:         Target core command data structure.
@@ -197,21 +194,18 @@ struct srpt_recv_ioctx {
 struct srpt_send_ioctx {
 	struct srpt_ioctx	ioctx;
 	struct srpt_rdma_ch	*ch;
-	struct ib_rdma_wr	*rdma_wrs;
+
+	struct srpt_rw_ctx	s_rw_ctx;
+	struct srpt_rw_ctx	*rw_ctxs;
+
 	struct ib_cqe		rdma_cqe;
-	struct srp_direct_buf	*rbufs;
-	struct srp_direct_buf	single_rbuf;
-	struct scatterlist	*sg;
 	struct list_head	free_list;
 	spinlock_t		spinlock;
 	enum srpt_command_state	state;
 	struct se_cmd		cmd;
 	struct completion	tx_done;
-	int			sg_cnt;
-	int			mapped_sg_count;
-	u16			n_rdma_wrs;
 	u8			n_rdma;
-	u8			n_rbuf;
+	u8			n_rw_ctx;
 	bool			queue_status_only;
 	u8			sense_data[TRANSPORT_SENSE_BUFFER];
 };
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 1/9] IB/cma: pass the port number to ib_create_qp
  2016-03-08 18:16 ` [PATCH v4 1/9] IB/cma: pass the port number to ib_create_qp Christoph Hellwig
@ 2016-03-08 19:25   ` Bart Van Assche
  2016-03-08 19:27   ` Leon Romanovsky
  1 sibling, 0 replies; 46+ messages in thread
From: Bart Van Assche @ 2016-03-08 19:25 UTC (permalink / raw)
  To: Christoph Hellwig, dledford; +Cc: sagig, swise, linux-rdma, target-devel

On 03/08/2016 10:16 AM, Christoph Hellwig wrote:
> The new RW API will need this.

Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 1/9] IB/cma: pass the port number to ib_create_qp
  2016-03-08 18:16 ` [PATCH v4 1/9] IB/cma: pass the port number to ib_create_qp Christoph Hellwig
  2016-03-08 19:25   ` Bart Van Assche
@ 2016-03-08 19:27   ` Leon Romanovsky
       [not found]     ` <20160308192709.GJ13396-2ukJVAZIZ/Y@public.gmane.org>
  1 sibling, 1 reply; 46+ messages in thread
From: Leon Romanovsky @ 2016-03-08 19:27 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: dledford, bart.vanassche, sagig, swise, linux-rdma, target-devel

On Tue, Mar 08, 2016 at 07:16:32PM +0100, Christoph Hellwig wrote:
> The new RW API will need this.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> Tested-by: Steve Wise <swise@opengridcomputing.com>
> ---
>  drivers/infiniband/core/cma.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
> index 9729639..a791522 100644
> --- a/drivers/infiniband/core/cma.c
> +++ b/drivers/infiniband/core/cma.c
> @@ -800,6 +800,7 @@ int rdma_create_qp(struct rdma_cm_id *id, struct ib_pd *pd,
>  	if (id->device != pd->device)
>  		return -EINVAL;
>  
> +	qp_init_attr->port_num = id->port_num;

I doubt whether this is the right place to update qp_init_attr structure.
Maybe the better solution will be to update the callers of rdma_create_qp?
It will place all qp_init_attr assignments in one place.

>  	qp = ib_create_qp(pd, qp_init_attr);
>  	if (IS_ERR(qp))
>  		return PTR_ERR(qp);
> -- 
> 2.1.4
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 2/9] IB/core: allow passing mapping an offset into the SG in ib_map_mr_sg
  2016-03-08 18:16 ` [PATCH v4 2/9] IB/core: allow passing mapping an offset into the SG in ib_map_mr_sg Christoph Hellwig
@ 2016-03-08 19:27   ` Bart Van Assche
  0 siblings, 0 replies; 46+ messages in thread
From: Bart Van Assche @ 2016-03-08 19:27 UTC (permalink / raw)
  To: Christoph Hellwig, dledford; +Cc: sagig, swise, linux-rdma, target-devel

On 03/08/2016 10:16 AM, Christoph Hellwig wrote:
> [ ... ]

Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 4/9] IB/core: refactor ib_create_qp
  2016-03-08 18:16 ` [PATCH v4 4/9] IB/core: refactor ib_create_qp Christoph Hellwig
@ 2016-03-08 19:33   ` Bart Van Assche
       [not found]     ` <56DF2917.309-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
  0 siblings, 1 reply; 46+ messages in thread
From: Bart Van Assche @ 2016-03-08 19:33 UTC (permalink / raw)
  To: Christoph Hellwig, dledford; +Cc: sagig, swise, linux-rdma, target-devel

On 03/08/2016 10:16 AM, Christoph Hellwig wrote:
> +	if (IS_ERR(qp))
> +		real_qp->device->destroy_qp(real_qp);
> +	else
> +		__ib_insert_xrcd_qp(qp_init_attr->xrcd, real_qp);
> +	return qp;
> +}

My preference is to handle the "success" case first as in the original code:

-			if (!IS_ERR(qp))
-				__ib_insert_xrcd_qp(qp_init_attr->xrcd, real_qp);
-			else
-				real_qp->device->destroy_qp(real_qp);

Even if this does not get addressed:

Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 5/9] IB/core: add a simple MR pool
       [not found]   ` <1457461000-24088-6-git-send-email-hch-jcswGhMUV9g@public.gmane.org>
@ 2016-03-08 19:37     ` Bart Van Assche
  2016-03-09 12:56       ` Christoph Hellwig
  0 siblings, 1 reply; 46+ messages in thread
From: Bart Van Assche @ 2016-03-08 19:37 UTC (permalink / raw)
  To: Christoph Hellwig, dledford-H+wXaHxf7aLQT0dZR+AlfA
  Cc: sagig-VPRAkNaXOzVWk0Htik3J/w,
	swise-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	target-devel-u79uwXL29TY76Z2rM5mHXA

On 03/08/2016 10:16 AM, Christoph Hellwig wrote:
> +struct ib_mr *ib_mr_pool_get(struct ib_qp *qp, struct list_head *list)
> +{
> +	struct ib_mr *mr = NULL;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&qp->mr_lock, flags);
> +	mr = list_first_entry_or_null(list, struct ib_mr, qp_entry);
> +	if (mr) {
> +		list_del(&mr->qp_entry);
> +		qp->mrs_used++;
> +	}
> +	spin_unlock_irqrestore(&qp->mr_lock, flags);
> +
> +	return mr;
> +}

Hmm ... why has 'mr' been initialized to NULL in this function? I think 
that initialization can be left out. Anyway:

Reviewed-by: Bart Van Assche <bart.vanassche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 46+ messages in thread

* RE: [PATCH v4 1/9] IB/cma: pass the port number to ib_create_qp
       [not found]     ` <20160308192709.GJ13396-2ukJVAZIZ/Y@public.gmane.org>
@ 2016-03-08 19:38       ` Steve Wise
  2016-03-08 19:55         ` Leon Romanovsky
  0 siblings, 1 reply; 46+ messages in thread
From: Steve Wise @ 2016-03-08 19:38 UTC (permalink / raw)
  To: leon-2ukJVAZIZ/Y, 'Christoph Hellwig'
  Cc: dledford-H+wXaHxf7aLQT0dZR+AlfA,
	bart.vanassche-XdAiOPVOjttBDgjK7y7TUQ,
	sagig-VPRAkNaXOzVWk0Htik3J/w, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	target-devel-u79uwXL29TY76Z2rM5mHXA

> On Tue, Mar 08, 2016 at 07:16:32PM +0100, Christoph Hellwig wrote:
> > The new RW API will need this.
> >
> > Signed-off-by: Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org>
> > Tested-by: Steve Wise <swise-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>
> > ---
> >  drivers/infiniband/core/cma.c | 1 +
> >  1 file changed, 1 insertion(+)
> >
> > diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
> > index 9729639..a791522 100644
> > --- a/drivers/infiniband/core/cma.c
> > +++ b/drivers/infiniband/core/cma.c
> > @@ -800,6 +800,7 @@ int rdma_create_qp(struct rdma_cm_id *id, struct ib_pd *pd,
> >  	if (id->device != pd->device)
> >  		return -EINVAL;
> >
> > +	qp_init_attr->port_num = id->port_num;
> 
> I doubt whether this is the right place to update qp_init_attr structure.
> Maybe the better solution will be to update the callers of rdma_create_qp?
> It will place all qp_init_attr assignments in one place.

At the expense of replicating this code and forcing all users to remember to set this.



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 6/9] IB/core: add a need_inval flag to struct ib_mr
  2016-03-08 18:16   ` [PATCH v4 6/9] IB/core: add a need_inval flag to struct ib_mr Christoph Hellwig
@ 2016-03-08 19:41     ` Bart Van Assche
  0 siblings, 0 replies; 46+ messages in thread
From: Bart Van Assche @ 2016-03-08 19:41 UTC (permalink / raw)
  To: Christoph Hellwig, dledford
  Cc: sagig, swise, linux-rdma, target-devel, Steve Wise

On 03/08/2016 10:16 AM, Christoph Hellwig wrote:
> From: Steve Wise <swise@chelsio.com>
>
> This is the first step toward moving MR invalidation decisions
> to the core.  It will be needed by the upcoming RW API.

Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 1/9] IB/cma: pass the port number to ib_create_qp
  2016-03-08 19:38       ` Steve Wise
@ 2016-03-08 19:55         ` Leon Romanovsky
  2016-03-09 13:58           ` 'Christoph Hellwig'
  0 siblings, 1 reply; 46+ messages in thread
From: Leon Romanovsky @ 2016-03-08 19:55 UTC (permalink / raw)
  To: Steve Wise
  Cc: 'Christoph Hellwig',
	dledford-H+wXaHxf7aLQT0dZR+AlfA,
	bart.vanassche-XdAiOPVOjttBDgjK7y7TUQ,
	sagig-VPRAkNaXOzVWk0Htik3J/w, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	target-devel-u79uwXL29TY76Z2rM5mHXA

On Tue, Mar 08, 2016 at 01:38:20PM -0600, Steve Wise wrote:
> > On Tue, Mar 08, 2016 at 07:16:32PM +0100, Christoph Hellwig wrote:
> > > The new RW API will need this.
> > >
> > > Signed-off-by: Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org>
> > > Tested-by: Steve Wise <swise-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>
> > > ---
> > >  drivers/infiniband/core/cma.c | 1 +
> > >  1 file changed, 1 insertion(+)
> > >
> > > diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
> > > index 9729639..a791522 100644
> > > --- a/drivers/infiniband/core/cma.c
> > > +++ b/drivers/infiniband/core/cma.c
> > > @@ -800,6 +800,7 @@ int rdma_create_qp(struct rdma_cm_id *id, struct ib_pd *pd,
> > >  	if (id->device != pd->device)
> > >  		return -EINVAL;
> > >
> > > +	qp_init_attr->port_num = id->port_num;
> > 
> > I doubt whether this is the right place to update qp_init_attr structure.
> > Maybe the better solution will be to update the callers of rdma_create_qp?
> > It will place all qp_init_attr assignments in one place.
> 
> At the expense of replicating this code and forcing all users to remember to set this.

There are no much such users to update, these users need to set
qp_init_attr structure anyway.

> 
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 9/9] IB/srpt: convert to the generic RDMA READ/WRITE API
  2016-03-08 18:16 ` [PATCH v4 9/9] IB/srpt: convert to the generic RDMA READ/WRITE API Christoph Hellwig
@ 2016-03-08 20:13   ` Leon Romanovsky
  2016-03-09 12:59     ` Christoph Hellwig
       [not found]   ` <1457461000-24088-10-git-send-email-hch-jcswGhMUV9g@public.gmane.org>
  1 sibling, 1 reply; 46+ messages in thread
From: Leon Romanovsky @ 2016-03-08 20:13 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: dledford, bart.vanassche, sagig, swise, linux-rdma, target-devel

On Tue, Mar 08, 2016 at 07:16:40PM +0100, Christoph Hellwig wrote:
> Replace the homegrown RDMA READ/WRITE code in srpt with the generic API.
> The only real twist here is that we need to allocate one Linux scatterlist
> per direct buffer in the SRP command, and chain them before handing them
> off to the target core.
> 
> As a side-effect of the conversion the driver will also chain the SEND
> of the SRP response to the RDMA WRITE WRs for a DATA OUT command, and
> properly account for RDMA WRITE WRs instead of just for RDMA READ WRs
> like the driver previously did.
> 
> We now allocate half of the SQ size to RDMA READ/WRITE contexts, assuming
> by default one RDMA READ or WRITE operation per command.  If a command
> has multiple operations it will eat into the budget but will still succeed,
> possible after waiting for WQEs to be available.
> 
> Also ensure the QPs request the maximum allowed SGEs so that RDMA R/W API
> works correctly.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  drivers/infiniband/ulp/srpt/ib_srpt.c | 693 +++++++++++-----------------------
>  drivers/infiniband/ulp/srpt/ib_srpt.h |  30 +-
>  2 files changed, 242 insertions(+), 481 deletions(-)
> 
> diff --git a/drivers/infiniband/ulp/srpt/ib_srpt.c b/drivers/infiniband/ulp/srpt/ib_srpt.c
> index 25bdaee..05bdec8 100644
> --- a/drivers/infiniband/ulp/srpt/ib_srpt.c
> +++ b/drivers/infiniband/ulp/srpt/ib_srpt.c
> @@ -765,52 +765,6 @@ static int srpt_post_recv(struct srpt_device *sdev,
>  }
>  
>  /**
> - * srpt_post_send() - Post an IB send request.
> - *
> - * Returns zero upon success and a non-zero value upon failure.
> - */
> -static int srpt_post_send(struct srpt_rdma_ch *ch,
> -			  struct srpt_send_ioctx *ioctx, int len)
> -{
> -	struct ib_sge list;
> -	struct ib_send_wr wr, *bad_wr;
> -	struct srpt_device *sdev = ch->sport->sdev;
> -	int ret;
> -
> -	atomic_inc(&ch->req_lim);
> -
> -	ret = -ENOMEM;
> -	if (unlikely(atomic_dec_return(&ch->sq_wr_avail) < 0)) {
> -		pr_warn("IB send queue full (needed 1)\n");
> -		goto out;
> -	}
> -
> -	ib_dma_sync_single_for_device(sdev->device, ioctx->ioctx.dma, len,
> -				      DMA_TO_DEVICE);
> -
> -	list.addr = ioctx->ioctx.dma;
> -	list.length = len;
> -	list.lkey = sdev->pd->local_dma_lkey;
> -
> -	ioctx->ioctx.cqe.done = srpt_send_done;
> -	wr.next = NULL;
> -	wr.wr_cqe = &ioctx->ioctx.cqe;
> -	wr.sg_list = &list;
> -	wr.num_sge = 1;
> -	wr.opcode = IB_WR_SEND;
> -	wr.send_flags = IB_SEND_SIGNALED;
> -
> -	ret = ib_post_send(ch->qp, &wr, &bad_wr);
> -
> -out:
> -	if (ret < 0) {
> -		atomic_inc(&ch->sq_wr_avail);
> -		atomic_dec(&ch->req_lim);
> -	}
> -	return ret;
> -}
> -
> -/**
>   * srpt_zerolength_write() - Perform a zero-length RDMA write.
>   *
>   * A quote from the InfiniBand specification: C9-88: For an HCA responder
> @@ -843,6 +797,110 @@ static void srpt_zerolength_write_done(struct ib_cq *cq, struct ib_wc *wc)
>  	}
>  }
>  
> +static int srpt_alloc_rw_ctxs(struct srpt_send_ioctx *ioctx,
> +		struct srp_direct_buf *db, int nbufs, struct scatterlist **sg,
> +		unsigned *sg_cnt)
> +{
> +	enum dma_data_direction dir = target_reverse_dma_direction(&ioctx->cmd);
> +	struct srpt_rdma_ch *ch = ioctx->ch;
> +	struct scatterlist *prev = NULL;
> +	unsigned prev_nents;
> +	int ret, i;
> +
> +	if (nbufs == 1) {
> +		ioctx->rw_ctxs = &ioctx->s_rw_ctx;
> +	} else {
> +		ioctx->rw_ctxs = kmalloc_array(nbufs, sizeof(*ioctx->rw_ctxs),
> +				GFP_KERNEL);
> +		if (!ioctx->rw_ctxs)
> +			return -ENOMEM;
> +	}
> +
> +	for (i = 0; i < nbufs; i++, db++) {
> +		struct srpt_rw_ctx *ctx = &ioctx->rw_ctxs[i];
> +		u64 remote_addr = be64_to_cpu(db->va);
> +		u32 size = be32_to_cpu(db->len);
> +		u32 rkey = be32_to_cpu(db->key);
> +
> +		ret = target_alloc_sgl(&ctx->sg, &ctx->nents, size, false,
> +				i < nbufs - 1);
> +		if (ret)
> +			goto unwind;
> +
> +		ret = rdma_rw_ctx_init(&ctx->rw, ch->qp, ch->sport->port,
> +				ctx->sg, ctx->nents, 0, remote_addr, rkey, dir);
> +		if (ret < 0) {
> +			target_free_sgl(ctx->sg, ctx->nents);
> +			goto unwind;
> +		}
> +
> +		ioctx->n_rdma += ret;
> +
> +		if (prev) {
> +			sg_unmark_end(&prev[prev_nents - 1]);
> +			sg_chain(prev, prev_nents + 1, ctx->sg);
> +		} else {
> +			*sg = ctx->sg;
> +		}
> +
> +		prev = ctx->sg;
> +		prev_nents = ctx->nents;
> +
> +		*sg_cnt += ctx->nents;
> +	}
> +
> +	ioctx->n_rw_ctx = nbufs;
> +	return 0;
> +
> +unwind:
> +	while (--i >= 0) {
> +		struct srpt_rw_ctx *ctx = &ioctx->rw_ctxs[i];
> +
> +		rdma_rw_ctx_destroy(&ctx->rw, ch->qp, ch->sport->port,
> +				ctx->sg, ctx->nents, dir);
> +		target_free_sgl(ctx->sg, ctx->nents);
> +	}
> +	if (ioctx->rw_ctxs && ioctx->rw_ctxs != &ioctx->s_rw_ctx)

You can jump to unwind label from one place only. In that stage
"ioctx->rw_ctxs != NULL" will always be true.

> +		kfree(ioctx->rw_ctxs);
> +	return ret;
> +}

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 7/9] IB/core: generic RDMA READ/WRITE API
  2016-03-08 18:16   ` [PATCH v4 7/9] IB/core: generic RDMA READ/WRITE API Christoph Hellwig
@ 2016-03-08 21:32     ` Bart Van Assche
  2016-03-09 13:07       ` Christoph Hellwig
  0 siblings, 1 reply; 46+ messages in thread
From: Bart Van Assche @ 2016-03-08 21:32 UTC (permalink / raw)
  To: Christoph Hellwig, dledford; +Cc: sagig, swise, linux-rdma, target-devel

On 03/08/2016 10:16 AM, Christoph Hellwig wrote:
> +		reg->mr = ib_mr_pool_get(qp, &qp->rdma_mrs);
> +		if (!reg->mr) {
> +			pr_info("failed to allocate MR from pool\n");
> +			ret = -EAGAIN;
> +			goto out_free;
> +		}

The above pr_info() message does not provide enough context information 
to allow a user to figure out why that message was reported. Please 
leave that message out and make the callers of this function report MR 
allocation failure wherever useful.

> +		ret = ib_map_mr_sg(reg->mr, sg, nents, offset,
> +				PAGE_SIZE);
[ ... ]
> +		sg = sg_next(sg);

ib_map_mr_sg() processed 'ret' elements of scatterlist 'sg'. So why is 
'sg' only advanced by one element at the end of the loop?

> +		for (j = 0; j < nr_sge; j++, sg = sg_next(sg)) {
> +			rdma_wr->wr.num_sge++;
> +
> +			sge->addr = ib_sg_dma_address(dev, sg) + offset;
> +			sge->length = ib_sg_dma_len(dev, sg) - offset;
> +			sge->lkey = qp->pd->local_dma_lkey;
> +
> +			total_len += sge->length;
> +			sge++;
> +			sge_left--;
> +			offset = 0;
> +		}

Shouldn't there be a break statement in the above loop that stops this 
loop if the end of the sg list has been reached? Less than nr_sge 
iterations will be needed to reach the end of the sg list if the length 
of the sg list is not a multiple of nr_sge.

Bart.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 8/9] target: enhance and export target_alloc_sgl/target_free_sgl
  2016-03-08 18:16   ` [PATCH v4 8/9] target: enhance and export target_alloc_sgl/target_free_sgl Christoph Hellwig
@ 2016-03-08 21:37     ` Bart Van Assche
  0 siblings, 0 replies; 46+ messages in thread
From: Bart Van Assche @ 2016-03-08 21:37 UTC (permalink / raw)
  To: Christoph Hellwig, dledford; +Cc: sagig, swise, linux-rdma, target-devel

On 03/08/2016 10:16 AM, Christoph Hellwig wrote:
> The SRP target driver will need to allocate and chain it's own SGLs soon.
> For this export target_alloc_sgl, and add a new argument to it so that it
> can allocate an additional chain entry that doesn't point to a page.  Also
> export transport_free_sgl after renaming it to target_free_sgl to free
> these SGLs again.

Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 4/9] IB/core: refactor ib_create_qp
       [not found]     ` <56DF2917.309-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
@ 2016-03-09 12:55       ` Christoph Hellwig
  0 siblings, 0 replies; 46+ messages in thread
From: Christoph Hellwig @ 2016-03-09 12:55 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Christoph Hellwig, dledford-H+wXaHxf7aLQT0dZR+AlfA,
	sagig-VPRAkNaXOzVWk0Htik3J/w,
	swise-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	target-devel-u79uwXL29TY76Z2rM5mHXA

On Tue, Mar 08, 2016 at 11:33:43AM -0800, Bart Van Assche wrote:
> My preference is to handle the "success" case first as in the original code:
> 
> -			if (!IS_ERR(qp))
> -				__ib_insert_xrcd_qp(qp_init_attr->xrcd, real_qp);
> -			else
> -				real_qp->device->destroy_qp(real_qp);
> 
> Even if this does not get addressed:
> 
> Reviewed-by: Bart Van Assche <bart.vanassche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>

I don't really care either way, so I've changed this.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 5/9] IB/core: add a simple MR pool
  2016-03-08 19:37     ` Bart Van Assche
@ 2016-03-09 12:56       ` Christoph Hellwig
  0 siblings, 0 replies; 46+ messages in thread
From: Christoph Hellwig @ 2016-03-09 12:56 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Christoph Hellwig, dledford, sagig, swise, linux-rdma, target-devel

On Tue, Mar 08, 2016 at 11:37:15AM -0800, Bart Van Assche wrote:
> Hmm ... why has 'mr' been initialized to NULL in this function? I think 
> that initialization can be left out. Anyway:

There is no need for it in the current version, and I'll remove it.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 9/9] IB/srpt: convert to the generic RDMA READ/WRITE API
  2016-03-08 20:13   ` Leon Romanovsky
@ 2016-03-09 12:59     ` Christoph Hellwig
  2016-03-10 22:08       ` Bart Van Assche
  0 siblings, 1 reply; 46+ messages in thread
From: Christoph Hellwig @ 2016-03-09 12:59 UTC (permalink / raw)
  To: Christoph Hellwig, dledford, bart.vanassche, sagig, swise,
	linux-rdma, target-devel

> You can jump to unwind label from one place only. In that stage
> "ioctx->rw_ctxs != NULL" will always be true.

There's two places actually, but it can't be NULL for both, so I
will removed that condition.

I'll update the git tree with the current review feedback, but I'll
wait for you to finish the review before resending it.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 7/9] IB/core: generic RDMA READ/WRITE API
  2016-03-08 21:32     ` Bart Van Assche
@ 2016-03-09 13:07       ` Christoph Hellwig
  2016-03-09 15:54         ` Christoph Hellwig
  0 siblings, 1 reply; 46+ messages in thread
From: Christoph Hellwig @ 2016-03-09 13:07 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Christoph Hellwig, dledford, sagig, swise, linux-rdma, target-devel

On Tue, Mar 08, 2016 at 01:32:44PM -0800, Bart Van Assche wrote:
> The above pr_info() message does not provide enough context information to 
> allow a user to figure out why that message was reported. Please leave that 
> message out and make the callers of this function report MR allocation 
> failure wherever useful.

Ok, I'll remove it.

>> +		ret = ib_map_mr_sg(reg->mr, sg, nents, offset,
>> +				PAGE_SIZE);
> [ ... ]
>> +		sg = sg_next(sg);
>
> ib_map_mr_sg() processed 'ret' elements of scatterlist 'sg'. So why is 'sg' 
> only advanced by one element at the end of the loop?

I think the multiple MR case here is simply broken, as we never hit
for the iSER or NVMe over Fabrics I/O sizes.  I think the safest
is to simply not allow for it.  I'll update the patch to disable the
loop, and add a helper for the driver to query the max I/O size
supported by the device.

>> +		for (j = 0; j < nr_sge; j++, sg = sg_next(sg)) {
>> +			rdma_wr->wr.num_sge++;
>> +
>> +			sge->addr = ib_sg_dma_address(dev, sg) + offset;
>> +			sge->length = ib_sg_dma_len(dev, sg) - offset;
>> +			sge->lkey = qp->pd->local_dma_lkey;
>> +
>> +			total_len += sge->length;
>> +			sge++;
>> +			sge_left--;
>> +			offset = 0;
>> +		}
>
> Shouldn't there be a break statement in the above loop that stops this loop 
> if the end of the sg list has been reached? Less than nr_sge iterations 
> will be needed to reach the end of the sg list if the length of the sg list 
> is not a multiple of nr_sge.

nr_sge is calculated as:

	u32 nr_sge = min(sge_left, max_sge);

so it will never contain more iteration than we can possibly do.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 1/9] IB/cma: pass the port number to ib_create_qp
  2016-03-08 19:55         ` Leon Romanovsky
@ 2016-03-09 13:58           ` 'Christoph Hellwig'
       [not found]             ` <20160309135804.GA32315-jcswGhMUV9g@public.gmane.org>
  0 siblings, 1 reply; 46+ messages in thread
From: 'Christoph Hellwig' @ 2016-03-09 13:58 UTC (permalink / raw)
  To: Steve Wise, 'Christoph Hellwig',
	dledford, bart.vanassche, sagig, linux-rdma, target-devel

On Tue, Mar 08, 2016 at 09:55:14PM +0200, Leon Romanovsky wrote:
> > At the expense of replicating this code and forcing all users to remember to set this.
> 
> There are no much such users to update, these users need to set
> qp_init_attr structure anyway.

But they need to extract the port_num first, while we already get
the cm_id that has the right port_id passed to this function.  Not
setting it in the qp_init_attr changes the interface from one that
just works to one that is arcane, and prone to generate hard to
detect errors (passing a 0 port_num will just work for all current
drivers, but if someone at some point actually introduces different
capabilities for differnet ports it will break for just that case!)

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 1/9] IB/cma: pass the port number to ib_create_qp
       [not found]             ` <20160309135804.GA32315-jcswGhMUV9g@public.gmane.org>
@ 2016-03-09 14:43               ` Leon Romanovsky
  2016-03-09 15:51               ` Steve Wise
  1 sibling, 0 replies; 46+ messages in thread
From: Leon Romanovsky @ 2016-03-09 14:43 UTC (permalink / raw)
  To: 'Christoph Hellwig'
  Cc: Steve Wise, dledford-H+wXaHxf7aLQT0dZR+AlfA,
	bart.vanassche-XdAiOPVOjttBDgjK7y7TUQ,
	sagig-VPRAkNaXOzVWk0Htik3J/w, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	target-devel-u79uwXL29TY76Z2rM5mHXA

On Wed, Mar 09, 2016 at 02:58:04PM +0100, 'Christoph Hellwig' wrote:
> On Tue, Mar 08, 2016 at 09:55:14PM +0200, Leon Romanovsky wrote:
> > > At the expense of replicating this code and forcing all users to remember to set this.
> > 
> > There are no much such users to update, these users need to set
> > qp_init_attr structure anyway.
> 
> But they need to extract the port_num first, while we already get
> the cm_id that has the right port_id passed to this function.  Not
> setting it in the qp_init_attr changes the interface from one that
> just works to one that is arcane, and prone to generate hard to
> detect errors (passing a 0 port_num will just work for all current
> drivers, but if someone at some point actually introduces different
> capabilities for differnet ports it will break for just that case!)

I understand your points and my claim is similar to yours, but from
different side. I think that setting qp_init_attr field separately
from the place there all variables were set is prone to errors.

> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 46+ messages in thread

* RE: [PATCH v4 1/9] IB/cma: pass the port number to ib_create_qp
       [not found]             ` <20160309135804.GA32315-jcswGhMUV9g@public.gmane.org>
  2016-03-09 14:43               ` Leon Romanovsky
@ 2016-03-09 15:51               ` Steve Wise
  2016-03-09 15:52                 ` 'Christoph Hellwig'
  1 sibling, 1 reply; 46+ messages in thread
From: Steve Wise @ 2016-03-09 15:51 UTC (permalink / raw)
  To: 'Christoph Hellwig',
	dledford-H+wXaHxf7aLQT0dZR+AlfA,
	bart.vanassche-XdAiOPVOjttBDgjK7y7TUQ,
	sagig-VPRAkNaXOzVWk0Htik3J/w, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	target-devel-u79uwXL29TY76Z2rM5mHXA

> On Tue, Mar 08, 2016 at 09:55:14PM +0200, Leon Romanovsky wrote:
> > > At the expense of replicating this code and forcing all users to
> remember to set this.
> >
> > There are no much such users to update, these users need to set
> > qp_init_attr structure anyway.
> 
> But they need to extract the port_num first, while we already get
> the cm_id that has the right port_id passed to this function.  Not
> setting it in the qp_init_attr changes the interface from one that
> just works to one that is arcane, and prone to generate hard to
> detect errors (passing a 0 port_num will just work for all current
> drivers, but if someone at some point actually introduces different
> capabilities for differnet ports it will break for just that case!)

While I'm for keeping your change as-is, it turns out the port numbers
are 1 relative, so 0 could be treated as an error. From
read_port_immutable():


        /**
         * device->port_immutable is indexed directly by the port number
to make
         * access to this data as efficient as possible.
         *
         * Therefore port_immutable is declared as a 1 based array with
         * potential empty slots at the beginning.
         */
        device->port_immutable = kzalloc(sizeof(*device->port_immutable)
                                         * (end_port + 1),
                                         GFP_KERNEL);

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 1/9] IB/cma: pass the port number to ib_create_qp
  2016-03-09 15:51               ` Steve Wise
@ 2016-03-09 15:52                 ` 'Christoph Hellwig'
  0 siblings, 0 replies; 46+ messages in thread
From: 'Christoph Hellwig' @ 2016-03-09 15:52 UTC (permalink / raw)
  To: Steve Wise
  Cc: 'Christoph Hellwig',
	dledford-H+wXaHxf7aLQT0dZR+AlfA,
	bart.vanassche-XdAiOPVOjttBDgjK7y7TUQ,
	sagig-VPRAkNaXOzVWk0Htik3J/w, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	target-devel-u79uwXL29TY76Z2rM5mHXA

On Wed, Mar 09, 2016 at 09:51:25AM -0600, Steve Wise wrote:
> While I'm for keeping your change as-is, it turns out the port numbers
> are 1 relative, so 0 could be treated as an error. From
> read_port_immutable():

I'll add a check for 0, but will keep things as-is otherwise.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 7/9] IB/core: generic RDMA READ/WRITE API
  2016-03-09 13:07       ` Christoph Hellwig
@ 2016-03-09 15:54         ` Christoph Hellwig
  2016-03-09 18:55           ` Bart Van Assche
       [not found]           ` <20160309155457.GA1898-jcswGhMUV9g@public.gmane.org>
  0 siblings, 2 replies; 46+ messages in thread
From: Christoph Hellwig @ 2016-03-09 15:54 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Christoph Hellwig, dledford, sagig, swise, linux-rdma, target-devel

On Wed, Mar 09, 2016 at 02:07:23PM +0100, Christoph Hellwig wrote:
> I think the multiple MR case here is simply broken, as we never hit
> for the iSER or NVMe over Fabrics I/O sizes.  I think the safest
> is to simply not allow for it.  I'll update the patch to disable the
> loop, and add a helper for the driver to query the max I/O size
> supported by the device.

Turns out if was easily fixable by doing a loop of sg_next() calls
until we reach the number of segments.  Verified by hacking the
code to enable MRs for IB, and only accepting 4 pages per MR as arbitrarily
low limit.

I've pushed out the updated tree with this to git, but here is the
updated version of this patch for reference:

---
>From a8abce6bb11eb030e956a8591d5a5b5191a3ceeb Mon Sep 17 00:00:00 2001
From: Christoph Hellwig <hch@lst.de>
Date: Thu, 3 Mar 2016 14:17:13 +0100
Subject: IB/core: generic RDMA READ/WRITE API

This supports both manual mapping of lots of SGEs, as well as using MRs
from the QP's MR pool, for iWarp or other cases where it's more optimal.
For now, MRs are only used for iWARP transports.  The user of the RDMA-RW
API must allocate the QP MR pool as well as size the SQ accordingly.

Thanks to Steve Wise for testing, fixing and rewriting the iWarp support,
and to Sagi Grimberg for ideas, reviews and fixes.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Steve Wise <swise@opengridcomputing.com>
---
 drivers/infiniband/core/Makefile |   2 +-
 drivers/infiniband/core/rw.c     | 418 +++++++++++++++++++++++++++++++++++++++
 drivers/infiniband/core/verbs.c  |  25 +++
 include/rdma/ib_verbs.h          |  14 +-
 include/rdma/rw.h                |  69 +++++++
 5 files changed, 526 insertions(+), 2 deletions(-)
 create mode 100644 drivers/infiniband/core/rw.c
 create mode 100644 include/rdma/rw.h

diff --git a/drivers/infiniband/core/Makefile b/drivers/infiniband/core/Makefile
index 48bd9d8..26987d9 100644
--- a/drivers/infiniband/core/Makefile
+++ b/drivers/infiniband/core/Makefile
@@ -8,7 +8,7 @@ obj-$(CONFIG_INFINIBAND_USER_MAD) +=	ib_umad.o
 obj-$(CONFIG_INFINIBAND_USER_ACCESS) +=	ib_uverbs.o ib_ucm.o \
 					$(user_access-y)
 
-ib_core-y :=			packer.o ud_header.o verbs.o cq.o sysfs.o \
+ib_core-y :=			packer.o ud_header.o verbs.o cq.o rw.o sysfs.o \
 				device.o fmr_pool.o cache.o netlink.o \
 				roce_gid_mgmt.o mr_pool.o
 ib_core-$(CONFIG_INFINIBAND_USER_MEM) += umem.o
diff --git a/drivers/infiniband/core/rw.c b/drivers/infiniband/core/rw.c
new file mode 100644
index 0000000..788903d
--- /dev/null
+++ b/drivers/infiniband/core/rw.c
@@ -0,0 +1,418 @@
+/*
+ * Copyright (c) 2016 HGST, a Western Digital Company.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+#include <linux/slab.h>
+#include <rdma/mr_pool.h>
+#include <rdma/rw.h>
+
+/*
+ * Check if the device needs a memory registration.  We currently always use
+ * memory registrations for iWarp, and never for IB and RoCE.  In the future
+ * we can hopefully fine tune this based on HCA driver input.
+ */
+static inline bool rdma_rw_use_mr(struct ib_device *dev, u8 port_num)
+{
+	return rdma_protocol_iwarp(dev, port_num);
+}
+
+static inline u32 rdma_rw_max_sge(struct ib_device *dev,
+		enum dma_data_direction dir)
+{
+	return dir == DMA_TO_DEVICE ?
+		dev->attrs.max_sge : dev->attrs.max_sge_rd;
+}
+
+static inline u32 rdma_rw_fr_page_list_len(struct ib_device *dev)
+{
+	/* arbitrary limit to avoid allocating gigantic resources */
+	return min_t(u32, dev->attrs.max_fast_reg_page_list_len, 256);
+}
+
+static int rdma_rw_init_mr_wrs(struct rdma_rw_ctx *ctx, struct ib_qp *qp,
+		u8 port_num, struct scatterlist *sg, u32 offset,
+		u64 remote_addr, u32 rkey, enum dma_data_direction dir)
+{
+	int pages_per_mr = rdma_rw_fr_page_list_len(qp->pd->device);
+	int pages_left = ctx->dma_nents;
+	u32 va_offset = 0;
+	int i, j, ret = 0, count = 0;
+
+	ctx->nr_ops = (ctx->dma_nents + pages_per_mr - 1) / pages_per_mr;
+	ctx->reg = kcalloc(ctx->nr_ops, sizeof(*ctx->reg), GFP_KERNEL);
+	if (!ctx->reg) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	for (i = 0; i < ctx->nr_ops; i++) {
+		struct rdma_rw_reg_ctx *prev = i ? &ctx->reg[i - 1] : NULL;
+		struct rdma_rw_reg_ctx *reg = &ctx->reg[i];
+		int nents = min(pages_left, pages_per_mr);
+
+		reg->mr = ib_mr_pool_get(qp, &qp->rdma_mrs);
+		if (!reg->mr) {
+			ret = -EAGAIN;
+			goto out_free;
+		}
+
+		if (reg->mr->need_inval) {
+			reg->inv_wr.opcode = IB_WR_LOCAL_INV;
+			reg->inv_wr.ex.invalidate_rkey = reg->mr->lkey;
+			reg->inv_wr.next = &reg->reg_wr.wr;
+			if (prev)
+				prev->wr.wr.next = &reg->inv_wr;
+
+			count++;
+		} else if (prev) {
+			prev->wr.wr.next = &reg->reg_wr.wr;
+		}
+
+		ib_update_fast_reg_key(reg->mr, ib_inc_rkey(reg->mr->lkey));
+
+		ret = ib_map_mr_sg(reg->mr, sg, nents, offset,
+				PAGE_SIZE);
+		if (ret < nents) {
+			ib_mr_pool_put(qp, &qp->rdma_mrs, reg->mr);
+			ret = -EINVAL;
+			goto out_free;
+		}
+
+		reg->reg_wr.wr.opcode = IB_WR_REG_MR;
+		reg->reg_wr.mr = reg->mr;
+		reg->reg_wr.key = reg->mr->lkey;
+		reg->reg_wr.wr.next = &reg->wr.wr;
+		count++;
+
+		reg->reg_wr.access = IB_ACCESS_LOCAL_WRITE;
+		if (rdma_protocol_iwarp(qp->device, port_num))
+			reg->reg_wr.access |= IB_ACCESS_REMOTE_WRITE;
+
+		reg->sge.lkey = reg->mr->lkey;
+		reg->sge.addr = reg->mr->iova;
+		reg->sge.length = reg->mr->length;
+
+		reg->wr.wr.sg_list = &reg->sge;
+		reg->wr.wr.num_sge = 1;
+		reg->wr.remote_addr = remote_addr + va_offset;
+		reg->wr.rkey = rkey;
+		count++;
+
+		if (dir == DMA_FROM_DEVICE) {
+			if (rdma_cap_read_inv(qp->device, port_num)) {
+				reg->wr.wr.opcode = IB_WR_RDMA_READ_WITH_INV;
+				reg->wr.wr.ex.invalidate_rkey = reg->mr->lkey;
+				reg->mr->need_inval = false;
+			}  else {
+				reg->wr.wr.opcode = IB_WR_RDMA_READ;
+				reg->mr->need_inval = true;
+			}
+		} else {
+			reg->wr.wr.opcode = IB_WR_RDMA_WRITE;
+			reg->mr->need_inval = true;
+		}
+
+		va_offset += reg->sge.length;
+		pages_left -= nents;
+		for (j = 0; j < nents; j++)
+			sg = sg_next(sg);
+		offset = 0;
+	}
+
+	return count;
+
+out_free:
+	while (--i >= 0)
+		ib_mr_pool_put(qp, &qp->rdma_mrs, ctx->reg[i].mr);
+	kfree(ctx->reg);
+out:
+	return ret;
+}
+
+static int rdma_rw_init_map_wrs(struct rdma_rw_ctx *ctx, struct ib_qp *qp,
+		struct scatterlist *sg, u32 offset, u64 remote_addr, u32 rkey,
+		enum dma_data_direction dir)
+{
+	struct ib_device *dev = qp->pd->device;
+	u32 max_sge = rdma_rw_max_sge(dev, dir);
+	u32 sge_left = ctx->dma_nents;
+	struct ib_sge *sge;
+	u32 total_len = 0, i, j;
+
+	ctx->nr_ops = DIV_ROUND_UP(ctx->dma_nents, max_sge);
+
+	ctx->map.sges = sge = kcalloc(ctx->dma_nents, sizeof(*sge), GFP_KERNEL);
+	if (!ctx->map.sges)
+		goto out;
+
+	ctx->map.wrs = kcalloc(ctx->nr_ops, sizeof(*ctx->map.wrs), GFP_KERNEL);
+	if (!ctx->map.wrs)
+		goto out_free_sges;
+
+	for (i = 0; i < ctx->nr_ops; i++) {
+		struct ib_rdma_wr *rdma_wr = &ctx->map.wrs[i];
+		u32 nr_sge = min(sge_left, max_sge);
+
+		if (dir == DMA_TO_DEVICE)
+			rdma_wr->wr.opcode = IB_WR_RDMA_WRITE;
+		else
+			rdma_wr->wr.opcode = IB_WR_RDMA_READ;
+		rdma_wr->remote_addr = remote_addr + total_len;
+		rdma_wr->rkey = rkey;
+		rdma_wr->wr.sg_list = sge;
+
+		for (j = 0; j < nr_sge; j++, sg = sg_next(sg)) {
+			rdma_wr->wr.num_sge++;
+
+			sge->addr = ib_sg_dma_address(dev, sg) + offset;
+			sge->length = ib_sg_dma_len(dev, sg) - offset;
+			sge->lkey = qp->pd->local_dma_lkey;
+
+			total_len += sge->length;
+			sge++;
+			sge_left--;
+			offset = 0;
+		}
+
+		if (i + 1 != ctx->nr_ops)
+			rdma_wr->wr.next = &ctx->map.wrs[i + 1].wr;
+	}
+
+	return ctx->nr_ops;
+
+out_free_sges:
+	kfree(ctx->map.sges);
+out:
+	return -ENOMEM;
+}
+
+static int rdma_rw_init_single_wr(struct rdma_rw_ctx *ctx, struct ib_qp *qp,
+		struct scatterlist *sg, u32 offset, u64 remote_addr, u32 rkey,
+		enum dma_data_direction dir)
+{
+	struct ib_device *dev = qp->pd->device;
+	struct ib_rdma_wr *rdma_wr = &ctx->single.wr;
+
+	ctx->nr_ops = 1;
+
+	ctx->single.sge.lkey = qp->pd->local_dma_lkey;
+	ctx->single.sge.addr = ib_sg_dma_address(dev, sg) + offset;
+	ctx->single.sge.length = ib_sg_dma_len(dev, sg) - offset;
+
+	memset(rdma_wr, 0, sizeof(*rdma_wr));
+	if (dir == DMA_TO_DEVICE)
+		rdma_wr->wr.opcode = IB_WR_RDMA_WRITE;
+	else
+		rdma_wr->wr.opcode = IB_WR_RDMA_READ;
+	rdma_wr->wr.sg_list = &ctx->single.sge;
+	rdma_wr->wr.num_sge = 1;
+	rdma_wr->remote_addr = remote_addr;
+	rdma_wr->rkey = rkey;
+
+	return 1;
+}
+
+/**
+ * rdma_rw_ctx_init - initialize a RDMA READ/WRITE context
+ * @ctx:	context to initialize
+ * @qp:		queue pair to operate on
+ * @port_num:	port num to which the connection is bound
+ * @sg:		scatterlist to READ/WRITE from/to
+ * @sg_cnt:	number of entries in @sg
+ * @sg_offset:	current byte offset into @sg
+ * @length:	total length of @sg in bytes
+ * @remote_addr:remote address to read/write (relative to @rkey)
+ * @rkey:	remote key to operate on
+ * @dir:	%DMA_TO_DEVICE for RDMA WRITE, %DMA_FROM_DEVICE for RDMA READ
+ *
+ * If we're going to use a FR to map this context @max_nents should be smaller
+ * or equal to the MR size.
+ *
+ * Returns the number of WQEs that will be needed on the workqueue if
+ * successful, or a negative error code.
+ */
+int rdma_rw_ctx_init(struct rdma_rw_ctx *ctx, struct ib_qp *qp, u8 port_num,
+		struct scatterlist *sg, u32 sg_cnt, u32 sg_offset,
+		u64 remote_addr, u32 rkey, enum dma_data_direction dir)
+{
+	struct ib_device *dev = qp->pd->device;
+	int ret;
+
+	ctx->dma_nents = ib_dma_map_sg(dev, sg, sg_cnt, dir);
+	if (!ctx->dma_nents)
+		return -ENOMEM;
+
+	/*
+	 * Skip to the S/G entry that sg_offset falls into:
+	 */
+	for (; sg; sg = sg_next(sg)) {
+		u32 len = ib_sg_dma_len(dev, sg);
+
+		if (sg_offset < len)
+			break;
+
+		sg_offset -= len;
+		ctx->dma_nents--;
+	}
+
+	if (rdma_rw_use_mr(qp->device, port_num)) {
+		ret = rdma_rw_init_mr_wrs(ctx, qp, port_num, sg, sg_offset,
+				remote_addr, rkey, dir);
+	} else if (ctx->dma_nents > 1) {
+		ret = rdma_rw_init_map_wrs(ctx, qp, sg, sg_offset,
+				remote_addr, rkey, dir);
+	} else {
+		ret = rdma_rw_init_single_wr(ctx, qp, sg, sg_offset,
+				remote_addr, rkey, dir);
+	}
+
+	if (ret < 0)
+		goto out_unmap_sg;
+	return ret;
+
+out_unmap_sg:
+	ib_dma_unmap_sg(dev, sg, sg_cnt, dir);
+	return ret;
+}
+EXPORT_SYMBOL(rdma_rw_ctx_init);
+
+/**
+ * rdma_rw_ctx_destroy - release all resources allocated by rdma_rw_ctx_init
+ * @ctx:	context to release
+ * @qp:		queue pair to operate on
+ * @port_num:	port num to which the connection is bound
+ * @sg:		scatterlist that was used for the READ/WRITE
+ * @sg_cnt:	number of entries in @sg
+ * @dir:	%DMA_TO_DEVICE for RDMA WRITE, %DMA_FROM_DEVICE for RDMA READ
+ */
+void rdma_rw_ctx_destroy(struct rdma_rw_ctx *ctx, struct ib_qp *qp, u8 port_num,
+		struct scatterlist *sg, u32 sg_cnt, enum dma_data_direction dir)
+{
+	if (rdma_rw_use_mr(qp->device, port_num)) {
+		int i;
+
+		for (i = 0; i < ctx->nr_ops; i++)
+			ib_mr_pool_put(qp, &qp->rdma_mrs, ctx->reg[i].mr);
+		kfree(ctx->reg);
+	} else if (ctx->dma_nents > 1) {
+		kfree(ctx->map.wrs);
+		kfree(ctx->map.sges);
+	}
+
+	ib_dma_unmap_sg(qp->pd->device, sg, sg_cnt, dir);
+}
+EXPORT_SYMBOL(rdma_rw_ctx_destroy);
+
+/**
+ * rdma_rw_ctx_wrs - return chain of WRs for a RDMA READ or WRITE operation
+ * @ctx:	context to operate on
+ * @qp:		queue pair to operate on
+ * @port_num:	port num to which the connection is bound
+ * @cqe:	completion queue entry for the last WR
+ * @chain_wr:	WR to append to the posted chain
+ *
+ * Return the WR chain for the set of RDMA READ/WRITE operations described by
+ * @ctx, as well as any memory registration operations needed.  If @chain_wr
+ * is non-NULL the WR it points to will be appended to the chain of WRs posted.
+ * If @chain_wr is not set @cqe must be set so that the caller gets a
+ * completion notification.
+ */
+struct ib_send_wr *rdma_rw_ctx_wrs(struct rdma_rw_ctx *ctx, struct ib_qp *qp,
+		u8 port_num, struct ib_cqe *cqe, struct ib_send_wr *chain_wr)
+{
+	struct ib_send_wr *first_wr, *last_wr;
+
+	if (rdma_rw_use_mr(qp->device, port_num)) {
+		if (ctx->reg[0].inv_wr.next)
+			first_wr = &ctx->reg[0].inv_wr;
+		else
+			first_wr = &ctx->reg[0].reg_wr.wr;
+		last_wr = &ctx->reg[ctx->nr_ops - 1].wr.wr;
+	} else if (ctx->dma_nents > 1) {
+		first_wr = &ctx->map.wrs[0].wr;
+		last_wr = &ctx->map.wrs[ctx->nr_ops - 1].wr;
+	} else {
+		first_wr = &ctx->single.wr.wr;
+		last_wr = &ctx->single.wr.wr;
+	}
+
+	if (chain_wr) {
+		last_wr->next = chain_wr;
+	} else {
+		last_wr->wr_cqe = cqe;
+		last_wr->send_flags |= IB_SEND_SIGNALED;
+	}
+
+	return first_wr;
+}
+EXPORT_SYMBOL(rdma_rw_ctx_wrs);
+
+/**
+ * rdma_rw_ctx_post - post a RDMA READ or RDMA WRITE operation
+ * @ctx:	context to operate on
+ * @qp:		queue pair to operate on
+ * @port_num:	port num to which the connection is bound
+ * @cqe:	completion queue entry for the last WR
+ * @chain_wr:	WR to append to the posted chain
+ *
+ * Post the set of RDMA READ/WRITE operations described by @ctx, as well as
+ * any memory registration operations needed.  If @chain_wr is non-NULL the
+ * WR it points to will be appended to the chain of WRs posted.  If @chain_wr
+ * is not set @cqe must be set so that the caller gets a completion
+ * notification.
+ */
+int rdma_rw_ctx_post(struct rdma_rw_ctx *ctx, struct ib_qp *qp, u8 port_num,
+		struct ib_cqe *cqe, struct ib_send_wr *chain_wr)
+{
+	struct ib_send_wr *first_wr, *bad_wr;
+
+	first_wr = rdma_rw_ctx_wrs(ctx, qp, port_num, cqe, chain_wr);
+	return ib_post_send(qp, first_wr, &bad_wr);
+}
+EXPORT_SYMBOL(rdma_rw_ctx_post);
+
+void rdma_rw_init_qp(struct ib_device *dev, struct ib_qp_init_attr *attr)
+{
+	/*
+	 * Each context needs at least one RDMA READ or WRITE WR.
+	 *
+	 * For some hardware we might need more, eventually we should ask the
+	 * HCA driver for a multiplier here.
+	 */
+	attr->cap.max_send_wr += attr->cap.max_rdma_ctxs;
+
+	/*
+	 * If the devices needs MRs to perform RDMA READ or WRITE operations,
+	 * we'll need two additional MRs for the registrations and the
+	 * invalidation.
+	 */
+	if (rdma_rw_use_mr(dev, attr->port_num))
+		attr->cap.max_send_wr += 2 * attr->cap.max_rdma_ctxs;
+}
+
+int rdma_rw_init_mrs(struct ib_qp *qp, struct ib_qp_init_attr *attr)
+{
+	struct ib_device *dev = qp->pd->device;
+	int ret = 0;
+
+	if (rdma_rw_use_mr(dev, attr->port_num)) {
+		ret = ib_mr_pool_init(qp, &qp->rdma_mrs,
+				attr->cap.max_rdma_ctxs, IB_MR_TYPE_MEM_REG,
+				rdma_rw_fr_page_list_len(dev));
+	}
+
+	return ret;
+}
+
+void rdma_rw_cleanup_mrs(struct ib_qp *qp)
+{
+	ib_mr_pool_destroy(qp, &qp->rdma_mrs);
+}
diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c
index 5955d14..af99f2b 100644
--- a/drivers/infiniband/core/verbs.c
+++ b/drivers/infiniband/core/verbs.c
@@ -48,6 +48,7 @@
 #include <rdma/ib_verbs.h>
 #include <rdma/ib_cache.h>
 #include <rdma/ib_addr.h>
+#include <rdma/rw.h>
 
 #include "core_priv.h"
 
@@ -751,6 +752,16 @@ struct ib_qp *ib_create_qp(struct ib_pd *pd,
 {
 	struct ib_device *device = pd ? pd->device : qp_init_attr->xrcd->device;
 	struct ib_qp *qp;
+	int ret;
+
+	/*
+	 * If the callers is using the RDMA API calculate the resources
+	 * needed for the RDMA READ/WRITE operations.
+	 *
+	 * Note that these callers need to pass in a port number.
+	 */
+	if (qp_init_attr->cap.max_rdma_ctxs)
+		rdma_rw_init_qp(device, qp_init_attr);
 
 	qp = device->create_qp(pd, qp_init_attr, NULL);
 	if (IS_ERR(qp))
@@ -764,6 +775,7 @@ struct ib_qp *ib_create_qp(struct ib_pd *pd,
 	atomic_set(&qp->usecnt, 0);
 	qp->mrs_used = 0;
 	spin_lock_init(&qp->mr_lock);
+	INIT_LIST_HEAD(&qp->rdma_mrs);
 
 	if (qp_init_attr->qp_type == IB_QPT_XRC_TGT)
 		return ib_create_xrc_qp(qp, qp_init_attr);
@@ -787,6 +799,16 @@ struct ib_qp *ib_create_qp(struct ib_pd *pd,
 
 	atomic_inc(&pd->usecnt);
 	atomic_inc(&qp_init_attr->send_cq->usecnt);
+
+	if (qp_init_attr->cap.max_rdma_ctxs) {
+		ret = rdma_rw_init_mrs(qp, qp_init_attr);
+		if (ret) {
+			pr_err("failed to init MR pool ret= %d\n", ret);
+			ib_destroy_qp(qp);
+			qp = ERR_PTR(ret);
+		}
+	}
+
 	return qp;
 }
 EXPORT_SYMBOL(ib_create_qp);
@@ -1271,6 +1293,9 @@ int ib_destroy_qp(struct ib_qp *qp)
 	rcq  = qp->recv_cq;
 	srq  = qp->srq;
 
+	if (!qp->uobject)
+		rdma_rw_cleanup_mrs(qp);
+
 	ret = qp->device->destroy_qp(qp);
 	if (!ret) {
 		if (pd)
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 068d540..074854d 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -915,6 +915,13 @@ struct ib_qp_cap {
 	u32	max_send_sge;
 	u32	max_recv_sge;
 	u32	max_inline_data;
+
+	/*
+	 * Maximum number of rdma_rw_ctx structures in flight at a time.
+	 * ib_create_qp() will calculate the right amount of neededed WRs
+	 * and MRs based on this.
+	 */
+	u32	max_rdma_ctxs;
 };
 
 enum ib_sig_type {
@@ -986,7 +993,11 @@ struct ib_qp_init_attr {
 	enum ib_sig_type	sq_sig_type;
 	enum ib_qp_type		qp_type;
 	enum ib_qp_create_flags	create_flags;
-	u8			port_num; /* special QP types only */
+
+	/*
+	 * Only needed for special QP types, or when using the RW API.
+	 */
+	u8			port_num;
 };
 
 struct ib_qp_open_attr {
@@ -1407,6 +1418,7 @@ struct ib_qp {
 	struct ib_cq	       *recv_cq;
 	spinlock_t		mr_lock;
 	int			mrs_used;
+	struct list_head	rdma_mrs;
 	struct ib_srq	       *srq;
 	struct ib_xrcd	       *xrcd; /* XRC TGT QPs only */
 	struct list_head	xrcd_list;
diff --git a/include/rdma/rw.h b/include/rdma/rw.h
new file mode 100644
index 0000000..57ea304
--- /dev/null
+++ b/include/rdma/rw.h
@@ -0,0 +1,69 @@
+/*
+ * Copyright (c) 2016 HGST, a Western Digital Company.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+#ifndef _RDMA_RW_H
+#define _RDMA_RW_H
+
+#include <linux/dma-mapping.h>
+#include <linux/scatterlist.h>
+#include <rdma/ib_verbs.h>
+#include <rdma/rdma_cm.h>
+#include <rdma/mr_pool.h>
+
+struct rdma_rw_ctx {
+	/* number of SGL entries returned by dma_map_sg */
+	u32			dma_nents;
+
+	/* number of RDMA READ/WRITE WRs (not counting MR WRs) */
+	u32			nr_ops;
+
+	union {
+		/* for mapping a single SGE: */
+		struct {
+			struct ib_sge		sge;
+			struct ib_rdma_wr	wr;
+		} single;
+
+		/* for mapping of multiple SGEs: */
+		struct {
+			struct ib_sge		*sges;
+			struct ib_rdma_wr	*wrs;
+		} map;
+
+		/* for registering multiple WRs: */
+		struct rdma_rw_reg_ctx {
+			struct ib_sge		sge;
+			struct ib_rdma_wr	wr;
+			struct ib_reg_wr	reg_wr;
+			struct ib_send_wr	inv_wr;
+			struct ib_mr		*mr;
+		} *reg;
+	};
+};
+
+int rdma_rw_ctx_init(struct rdma_rw_ctx *ctx, struct ib_qp *qp, u8 port_num,
+		struct scatterlist *sg, u32 sg_cnt, u32 sg_offset,
+		u64 remote_addr, u32 rkey, enum dma_data_direction dir);
+void rdma_rw_ctx_destroy(struct rdma_rw_ctx *ctx, struct ib_qp *qp, u8 port_num,
+		struct scatterlist *sg, u32 sg_cnt,
+		enum dma_data_direction dir);
+
+struct ib_send_wr *rdma_rw_ctx_wrs(struct rdma_rw_ctx *ctx, struct ib_qp *qp,
+		u8 port_num, struct ib_cqe *cqe, struct ib_send_wr *chain_wr);
+int rdma_rw_ctx_post(struct rdma_rw_ctx *ctx, struct ib_qp *qp, u8 port_num,
+		struct ib_cqe *cqe, struct ib_send_wr *chain_wr);
+
+void rdma_rw_init_qp(struct ib_device *dev, struct ib_qp_init_attr *attr);
+int rdma_rw_init_mrs(struct ib_qp *qp, struct ib_qp_init_attr *attr);
+void rdma_rw_cleanup_mrs(struct ib_qp *qp);
+
+#endif /* _RDMA_RW_H */
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 7/9] IB/core: generic RDMA READ/WRITE API
  2016-03-09 15:54         ` Christoph Hellwig
@ 2016-03-09 18:55           ` Bart Van Assche
  2016-03-10  7:57             ` Christoph Hellwig
       [not found]           ` <20160309155457.GA1898-jcswGhMUV9g@public.gmane.org>
  1 sibling, 1 reply; 46+ messages in thread
From: Bart Van Assche @ 2016-03-09 18:55 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: dledford, sagig, swise, linux-rdma, target-devel

On 03/09/2016 07:55 AM, Christoph Hellwig wrote:
> On Wed, Mar 09, 2016 at 02:07:23PM +0100, Christoph Hellwig wrote:
>> I think the multiple MR case here is simply broken, as we never hit
>> for the iSER or NVMe over Fabrics I/O sizes.  I think the safest
>> is to simply not allow for it.  I'll update the patch to disable the
>> loop, and add a helper for the driver to query the max I/O size
>> supported by the device.
>
> Turns out if was easily fixable by doing a loop of sg_next() calls
> until we reach the number of segments.  Verified by hacking the
> code to enable MRs for IB, and only accepting 4 pages per MR as arbitrarily
> low limit.

Hello Christoph,

Regarding the iSERt patches that are present in your rdma-rw-api branch: 
what is the impact of these patches on memory registration for the iSERt 
driver in combination with an IB HCA? Is memory registration still 
enabled for this combination? If not: how about changing this patch 
series such that memory registration is performed either if the 
transport protocol is iWARP or the ULP driver requests memory 
registration explicitly, e.g. by setting a flag in struct rdma_rw_ctx 
and/or struct ib_qp_init_attr?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 7/9] IB/core: generic RDMA READ/WRITE API
  2016-03-09 18:55           ` Bart Van Assche
@ 2016-03-10  7:57             ` Christoph Hellwig
  2016-03-10 15:30               ` Bart Van Assche
  0 siblings, 1 reply; 46+ messages in thread
From: Christoph Hellwig @ 2016-03-10  7:57 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Christoph Hellwig, dledford, sagig, swise, linux-rdma, target-devel

On Wed, Mar 09, 2016 at 10:55:50AM -0800, Bart Van Assche wrote:
> Regarding the iSERt patches that are present in your rdma-rw-api branch: 

I haven't submitted those yet, as the signature MR support hasn't been
tested and is most likely broken.  (and even if it's not broken it's too
ugly to live :))

> what is the impact of these patches on memory registration for the iSERt 
> driver in combination with an IB HCA? Is memory registration still enabled 
> for this combination? If not: how about changing this patch series such 
> that memory registration is performed either if the transport protocol is 
> iWARP or the ULP driver requests memory registration explicitly, e.g. by 
> setting a flag in struct rdma_rw_ctx and/or struct ib_qp_init_attr?

I don't want the ULP to ask for memory registrations, and it's not really
the ULPs business.  It should be done either if needed (iWarp, signature
MRs), or if the HCA driver prefers it for a larger enough transfers.

Once the basic API is in I'd like to add an interface for the HCA driver,
as Mellannox has indicated they'd like to register for larger transfers.

Anby chance to get a review of the remaining patches so we can get it
in for this merge window?

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 7/9] IB/core: generic RDMA READ/WRITE API
  2016-03-10  7:57             ` Christoph Hellwig
@ 2016-03-10 15:30               ` Bart Van Assche
       [not found]                 ` <56E1931C.1040206-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
  0 siblings, 1 reply; 46+ messages in thread
From: Bart Van Assche @ 2016-03-10 15:30 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: dledford, sagig, swise, linux-rdma, target-devel

On 03/09/16 23:57, Christoph Hellwig wrote:
> On Wed, Mar 09, 2016 at 10:55:50AM -0800, Bart Van Assche wrote:
>> Regarding the iSERt patches that are present in your rdma-rw-api branch:
>
> I haven't submitted those yet, as the signature MR support hasn't been
> tested and is most likely broken.  (and even if it's not broken it's too
> ugly to live :))
>
>> what is the impact of these patches on memory registration for the iSERt
>> driver in combination with an IB HCA? Is memory registration still enabled
>> for this combination? If not: how about changing this patch series such
>> that memory registration is performed either if the transport protocol is
>> iWARP or the ULP driver requests memory registration explicitly, e.g. by
>> setting a flag in struct rdma_rw_ctx and/or struct ib_qp_init_attr?
>
> I don't want the ULP to ask for memory registrations, and it's not really
> the ULPs business.  It should be done either if needed (iWarp, signature
> MRs), or if the HCA driver prefers it for a larger enough transfers.
>
> Once the basic API is in I'd like to add an interface for the HCA driver,
> as Mellanox has indicated they'd like to register for larger transfers.
>
> Any chance to get a review of the remaining patches so we can get it
> in for this merge window?

Hello Christoph,

I will continue with reviewing these patches. But I disagree that ULPs 
should not have control over whether or not memory registration is 
enabled. My experience is that it is essential to always enable memory 
registration while debugging an RDMA driver. Without memory registration 
it can be extremely hard to determine whether memory corruption is 
caused by code that is running locally or by a misplaced RDMA write from 
another host. With memory registration enabled misplaced writes result 
in an access violation error. Additionally, certain users prefer to 
enable memory registration for all RDMA transfers because of the 
additional safety it provides.

Bart.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 7/9] IB/core: generic RDMA READ/WRITE API
       [not found]                 ` <56E1931C.1040206-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
@ 2016-03-10 16:21                   ` Christoph Hellwig
       [not found]                     ` <20160310162150.GA9544-jcswGhMUV9g@public.gmane.org>
  0 siblings, 1 reply; 46+ messages in thread
From: Christoph Hellwig @ 2016-03-10 16:21 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Christoph Hellwig, dledford-H+wXaHxf7aLQT0dZR+AlfA,
	sagig-VPRAkNaXOzVWk0Htik3J/w,
	swise-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	target-devel-u79uwXL29TY76Z2rM5mHXA

Hi Bart,

I see the use case for enabling MRs for debugging, but I still don't
think the ULP is the right place.  I also don't see where you
cna do this currently: for srpt, the only driver converted in this
series there is no support for registrations at all currently,
and with the the series you can force it to register with a one
line code change in rdma_rw_use_mr.  For iSERt, which I plan to
convert for 4.7 (and which has some WIP code in git) MR support
exists, but is only enabled if the device supports the vendor
specific signature handover extension (aka if running on mlx5 hardware).
With the future conversion you'll still have a simply compile time
switch to always enabled it, and I'd be fine with adding a module
parameter to force it as well.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 7/9] IB/core: generic RDMA READ/WRITE API
       [not found]                     ` <20160310162150.GA9544-jcswGhMUV9g@public.gmane.org>
@ 2016-03-10 17:00                       ` Bart Van Assche
       [not found]                         ` <56E1A810.1080001-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
  2016-03-11 22:39                       ` Bart Van Assche
  1 sibling, 1 reply; 46+ messages in thread
From: Bart Van Assche @ 2016-03-10 17:00 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: dledford-H+wXaHxf7aLQT0dZR+AlfA, sagig-VPRAkNaXOzVWk0Htik3J/w,
	swise-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	target-devel-u79uwXL29TY76Z2rM5mHXA

On 03/10/2016 08:21 AM, Christoph Hellwig wrote:
> I see the use case for enabling MRs for debugging, but I still don't
> think the ULP is the right place.  I also don't see where you
> cna do this currently: for srpt, the only driver converted in this
> series there is no support for registrations at all currently,
> and with the the series you can force it to register with a one
> line code change in rdma_rw_use_mr.  For iSERt, which I plan to
> convert for 4.7 (and which has some WIP code in git) MR support
> exists, but is only enabled if the device supports the vendor
> specific signature handover extension (aka if running on mlx5 hardware).
> With the future conversion you'll still have a simply compile time
> switch to always enabled it, and I'd be fine with adding a module
> parameter to force it as well.

Hello Christoph,

Had you already noticed the ib_srp register_always kernel module parameter?

Thanks,

Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 7/9] IB/core: generic RDMA READ/WRITE API
       [not found]                         ` <56E1A810.1080001-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
@ 2016-03-10 17:01                           ` Christoph Hellwig
  0 siblings, 0 replies; 46+ messages in thread
From: Christoph Hellwig @ 2016-03-10 17:01 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Christoph Hellwig, dledford-H+wXaHxf7aLQT0dZR+AlfA,
	sagig-VPRAkNaXOzVWk0Htik3J/w,
	swise-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	target-devel-u79uwXL29TY76Z2rM5mHXA

On Thu, Mar 10, 2016 at 09:00:00AM -0800, Bart Van Assche wrote:
> Hello Christoph,
>
> Had you already noticed the ib_srp register_always kernel module parameter?

Yes.  But this whole patch series is about the target drivers, there
are no changes to the initiator side at all.  The various register_always
and similar parameter are untouched by it.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 9/9] IB/srpt: convert to the generic RDMA READ/WRITE API
  2016-03-09 12:59     ` Christoph Hellwig
@ 2016-03-10 22:08       ` Bart Van Assche
  2016-03-11  9:09         ` Christoph Hellwig
  0 siblings, 1 reply; 46+ messages in thread
From: Bart Van Assche @ 2016-03-10 22:08 UTC (permalink / raw)
  To: Christoph Hellwig, dledford, sagig, swise, linux-rdma, target-devel

On 03/09/2016 04:59 AM, Christoph Hellwig wrote:
>> You can jump to unwind label from one place only. In that stage
>> "ioctx->rw_ctxs != NULL" will always be true.
>
> There's two places actually, but it can't be NULL for both, so I
> will removed that condition.
>
> I'll update the git tree with the current review feedback, but I'll
> wait for you to finish the review before resending it.

Since it is safe to pass NULL to kfree() I think a similar comment 
applies to srpt_free_rw_ctxs() and that also from that function 
"ioctx->rw_ctxs && " can be left out.

Bart.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 9/9] IB/srpt: convert to the generic RDMA READ/WRITE API
       [not found]   ` <1457461000-24088-10-git-send-email-hch-jcswGhMUV9g@public.gmane.org>
@ 2016-03-10 22:40     ` Bart Van Assche
       [not found]       ` <56E1F7C8.9000101-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
  0 siblings, 1 reply; 46+ messages in thread
From: Bart Van Assche @ 2016-03-10 22:40 UTC (permalink / raw)
  To: Christoph Hellwig, dledford-H+wXaHxf7aLQT0dZR+AlfA
  Cc: sagig-VPRAkNaXOzVWk0Htik3J/w,
	swise-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	target-devel-u79uwXL29TY76Z2rM5mHXA

On 03/08/2016 10:16 AM, Christoph Hellwig wrote:
> -/**
>    * srpt_get_send_ioctx() - Obtain an I/O context for sending to the initiator.
>    */
>   static struct srpt_send_ioctx *srpt_get_send_ioctx(struct srpt_rdma_ch *ch)
> @@ -1284,12 +1096,8 @@ static struct srpt_send_ioctx *srpt_get_send_ioctx(struct srpt_rdma_ch *ch)
>   	BUG_ON(ioctx->ch != ch);
>   	spin_lock_init(&ioctx->spinlock);
>   	ioctx->state = SRPT_STATE_NEW;
> -	ioctx->n_rbuf = 0;
> -	ioctx->rbufs = NULL;
> +	ioctx->n_rw_ctx = 0;
>   	ioctx->n_rdma = 0;
> -	ioctx->n_rdma_wrs = 0;
> -	ioctx->rdma_wrs = NULL;
> -	ioctx->mapped_sg_count = 0;
>   	init_completion(&ioctx->tx_done);
>   	ioctx->queue_status_only = false;
>   	/*
> @@ -1359,7 +1167,6 @@ static int srpt_abort_cmd(struct srpt_send_ioctx *ioctx)
>   		 * SRP_RSP sending failed or the SRP_RSP send completion has
>   		 * not been received in time.
>   		 */
> -		srpt_unmap_sg_to_ib_sge(ioctx->ch, ioctx);
>   		transport_generic_free_cmd(&ioctx->cmd, 0);
>   		break;
>   	case SRPT_STATE_MGMT_RSP_SENT:
> @@ -1387,6 +1194,7 @@ static void srpt_rdma_read_done(struct ib_cq *cq, struct ib_wc *wc)
>
>   	WARN_ON(ioctx->n_rdma <= 0);
>   	atomic_add(ioctx->n_rdma, &ch->sq_wr_avail);
> +	ioctx->n_rdma = 0;
>
>   	if (unlikely(wc->status != IB_WC_SUCCESS)) {
>   		pr_info("RDMA_READ for ioctx 0x%p failed with status %d\n",

ioctx->n_rdma is reset by srpt_get_send_ioctx(). Do we really need to 
reset that member variable in srpt_rdma_read_done()?

> @@ -1832,8 +1626,11 @@ retry:
>   	qp_init->srq = sdev->srq;
>   	qp_init->sq_sig_type = IB_SIGNAL_REQ_WR;
>   	qp_init->qp_type = IB_QPT_RC;
> -	qp_init->cap.max_send_wr = srp_sq_size;
> -	qp_init->cap.max_send_sge = SRPT_DEF_SG_PER_WQE;
> +	qp_init->cap.max_send_wr = srp_sq_size / 2;
> +	qp_init->cap.max_rdma_ctxs = srp_sq_size / 2;
> +	qp_init->cap.max_send_sge = max(sdev->device->attrs.max_sge_rd,
> +					sdev->device->attrs.max_sge);
> +	qp_init->port_num = ch->sport->port;
>
>   	ch->qp = ib_create_qp(sdev->pd, qp_init);
>   	if (IS_ERR(ch->qp)) {

The QP max_send_wr parameter has been changed but not the ib_alloc_cq() 
argument. Is that perhaps an oversight?

Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 7/9] IB/core: generic RDMA READ/WRITE API
       [not found]           ` <20160309155457.GA1898-jcswGhMUV9g@public.gmane.org>
@ 2016-03-11  1:26             ` Bart Van Assche
  2016-03-11  9:22               ` Christoph Hellwig
  0 siblings, 1 reply; 46+ messages in thread
From: Bart Van Assche @ 2016-03-11  1:26 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: dledford-H+wXaHxf7aLQT0dZR+AlfA, sagig-VPRAkNaXOzVWk0Htik3J/w,
	swise-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	target-devel-u79uwXL29TY76Z2rM5mHXA

On 03/09/2016 07:54 AM, Christoph Hellwig wrote:
> +		if (i + 1 != ctx->nr_ops)
> +			rdma_wr->wr.next = &ctx->map.wrs[i + 1].wr;

Not that this really matters, but my own preference is to write such a 
comparison as i + 1 < ctx->nr_ops.

> @@ -787,6 +799,16 @@ struct ib_qp *ib_create_qp(struct ib_pd *pd,
>
>   	atomic_inc(&pd->usecnt);
>   	atomic_inc(&qp_init_attr->send_cq->usecnt);
> +
> +	if (qp_init_attr->cap.max_rdma_ctxs) {
> +		ret = rdma_rw_init_mrs(qp, qp_init_attr);
> +		if (ret) {
> +			pr_err("failed to init MR pool ret= %d\n", ret);
> +			ib_destroy_qp(qp);
> +			qp = ERR_PTR(ret);
> +		}
> +	}

ib_destroy_qp() can fail and if it fails it does not decrement 
pd->usecnt. Does this have to be handled or reported in some way?

Otherwise this patch looks fine to me.

Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 9/9] IB/srpt: convert to the generic RDMA READ/WRITE API
  2016-03-10 22:08       ` Bart Van Assche
@ 2016-03-11  9:09         ` Christoph Hellwig
  0 siblings, 0 replies; 46+ messages in thread
From: Christoph Hellwig @ 2016-03-11  9:09 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Christoph Hellwig, dledford, sagig, swise, linux-rdma, target-devel

On Thu, Mar 10, 2016 at 02:08:39PM -0800, Bart Van Assche wrote:
> Since it is safe to pass NULL to kfree() I think a similar comment applies 
> to srpt_free_rw_ctxs() and that also from that function "ioctx->rw_ctxs && 
> " can be left out.

Ok, I'll update it.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 9/9] IB/srpt: convert to the generic RDMA READ/WRITE API
       [not found]       ` <56E1F7C8.9000101-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
@ 2016-03-11  9:14         ` Christoph Hellwig
  0 siblings, 0 replies; 46+ messages in thread
From: Christoph Hellwig @ 2016-03-11  9:14 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Christoph Hellwig, dledford-H+wXaHxf7aLQT0dZR+AlfA,
	sagig-VPRAkNaXOzVWk0Htik3J/w,
	swise-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	target-devel-u79uwXL29TY76Z2rM5mHXA

>>   	WARN_ON(ioctx->n_rdma <= 0);
>>   	atomic_add(ioctx->n_rdma, &ch->sq_wr_avail);
>> +	ioctx->n_rdma = 0;
>>
>>   	if (unlikely(wc->status != IB_WC_SUCCESS)) {
>>   		pr_info("RDMA_READ for ioctx 0x%p failed with status %d\n",
>
> ioctx->n_rdma is reset by srpt_get_send_ioctx(). Do we really need to reset 
> that member variable in srpt_rdma_read_done()?

Yes, otherwise the atomic_add in srpt_send_done will do the wrong thing.
Alernatively we could return WQE reservation in srpt_send_done always,
but given that we'd done with the RDMA READ commands it seems useful
to not hang onto them longer than necessary.

>> @@ -1832,8 +1626,11 @@ retry:
>>   	qp_init->srq = sdev->srq;
>>   	qp_init->sq_sig_type = IB_SIGNAL_REQ_WR;
>>   	qp_init->qp_type = IB_QPT_RC;
>> -	qp_init->cap.max_send_wr = srp_sq_size;
>> -	qp_init->cap.max_send_sge = SRPT_DEF_SG_PER_WQE;
>> +	qp_init->cap.max_send_wr = srp_sq_size / 2;
>> +	qp_init->cap.max_rdma_ctxs = srp_sq_size / 2;
>> +	qp_init->cap.max_send_sge = max(sdev->device->attrs.max_sge_rd,
>> +					sdev->device->attrs.max_sge);
>> +	qp_init->port_num = ch->sport->port;
>>
>>   	ch->qp = ib_create_qp(sdev->pd, qp_init);
>>   	if (IS_ERR(ch->qp)) {
>
> The QP max_send_wr parameter has been changed but not the ib_alloc_cq() 
> argument. Is that perhaps an oversight?

The rdma_ctx add at least one to the send_wr count.  In case they
are all RDMA READs we will actually need one CQE per rdma_ctxs, although
we'll usually need less.  So I think the actualy count here is correct,
but I'll add a comment explaining things.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 7/9] IB/core: generic RDMA READ/WRITE API
  2016-03-11  1:26             ` Bart Van Assche
@ 2016-03-11  9:22               ` Christoph Hellwig
  0 siblings, 0 replies; 46+ messages in thread
From: Christoph Hellwig @ 2016-03-11  9:22 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Christoph Hellwig, dledford, sagig, swise, linux-rdma, target-devel

On Thu, Mar 10, 2016 at 05:26:19PM -0800, Bart Van Assche wrote:
> On 03/09/2016 07:54 AM, Christoph Hellwig wrote:
>> +		if (i + 1 != ctx->nr_ops)
>> +			rdma_wr->wr.next = &ctx->map.wrs[i + 1].wr;
>
> Not that this really matters, but my own preference is to write such a 
> comparison as i + 1 < ctx->nr_ops.

Sure, not problem.

>> @@ -787,6 +799,16 @@ struct ib_qp *ib_create_qp(struct ib_pd *pd,
>>
>>   	atomic_inc(&pd->usecnt);
>>   	atomic_inc(&qp_init_attr->send_cq->usecnt);
>> +
>> +	if (qp_init_attr->cap.max_rdma_ctxs) {
>> +		ret = rdma_rw_init_mrs(qp, qp_init_attr);
>> +		if (ret) {
>> +			pr_err("failed to init MR pool ret= %d\n", ret);
>> +			ib_destroy_qp(qp);
>> +			qp = ERR_PTR(ret);
>> +		}
>> +	}
>
> ib_destroy_qp() can fail and if it fails it does not decrement pd->usecnt. 
> Does this have to be handled or reported in some way?

There is no way to handle this properly, and we really should remove
the return value better sooner than later in the destroy_qp API [1].

[1] and various other verbs APIs that return errors on the destroy APIs,
    whoever created these was on crack apparently.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 7/9] IB/core: generic RDMA READ/WRITE API
       [not found]                     ` <20160310162150.GA9544-jcswGhMUV9g@public.gmane.org>
  2016-03-10 17:00                       ` Bart Van Assche
@ 2016-03-11 22:39                       ` Bart Van Assche
       [not found]                         ` <56E34914.2020108-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
  1 sibling, 1 reply; 46+ messages in thread
From: Bart Van Assche @ 2016-03-11 22:39 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: dledford-H+wXaHxf7aLQT0dZR+AlfA, sagig-VPRAkNaXOzVWk0Htik3J/w,
	swise-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	target-devel-u79uwXL29TY76Z2rM5mHXA

On 03/10/2016 08:21 AM, Christoph Hellwig wrote:
> I see the use case for enabling MRs for debugging, but I still don't
> think the ULP is the right place.  I also don't see where you
> cna do this currently: for srpt, the only driver converted in this
> series there is no support for registrations at all currently,
> and with the the series you can force it to register with a one
> line code change in rdma_rw_use_mr.  For iSERt, which I plan to
> convert for 4.7 (and which has some WIP code in git) MR support
> exists, but is only enabled if the device supports the vendor
> specific signature handover extension (aka if running on mlx5 hardware).
> With the future conversion you'll still have a simply compile time
> switch to always enabled it, and I'd be fine with adding a module
> parameter to force it as well.

Hello Christoph,

The above is fine with me. But when I ran a test with rdma_rw_use_mr() 
changed into "return true" the following error messages appeared in the 
kernel log:

[  364.460709] ib_srpt 0x1: parsing SRP descriptor table failed.
[  383.604809] ib_srpt 0x0: parsing SRP descriptor table failed.
[  383.605627] ib_srpt 0x2: parsing SRP descriptor table failed.
[  386.702905] ib_srpt 0x3: parsing SRP descriptor table failed.
[  386.703092] ib_srpt 0x4: parsing SRP descriptor table failed.
[  386.703242] ib_srpt 0x5: parsing SRP descriptor table failed.
[  386.703411] ib_srpt 0x6: parsing SRP descriptor table failed.

Is this expected? I ran this test on a server equipped with two mlx4 
HCAs with latest firmware (2.36.5000). I installed git commit 
c4c65482b56a433a82bc5b63db8ba125727e9f80 of the rdma-rw-api merged with 
v4.5-rc7. Initiator and target drivers were running on the same server 
and were communicating with each other via loopback. Before I modified 
rdma_rw_use_mr() the same test passed on the same setup.

Do you perhaps want me to look further into this?

Thanks,

Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 7/9] IB/core: generic RDMA READ/WRITE API
       [not found]                         ` <56E34914.2020108-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
@ 2016-03-12  6:12                           ` Christoph Hellwig
       [not found]                             ` <20160312061232.GA15236-jcswGhMUV9g@public.gmane.org>
  0 siblings, 1 reply; 46+ messages in thread
From: Christoph Hellwig @ 2016-03-12  6:12 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Christoph Hellwig, dledford-H+wXaHxf7aLQT0dZR+AlfA,
	sagig-VPRAkNaXOzVWk0Htik3J/w,
	swise-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	target-devel-u79uwXL29TY76Z2rM5mHXA

On Fri, Mar 11, 2016 at 02:39:16PM -0800, Bart Van Assche wrote:
> The above is fine with me. But when I ran a test with rdma_rw_use_mr() 
> changed into "return true" the following error messages appeared in the 
> kernel log:
>
> [  364.460709] ib_srpt 0x1: parsing SRP descriptor table failed.
> [  383.604809] ib_srpt 0x0: parsing SRP descriptor table failed.
> [  383.605627] ib_srpt 0x2: parsing SRP descriptor table failed.
> [  386.702905] ib_srpt 0x3: parsing SRP descriptor table failed.
> [  386.703092] ib_srpt 0x4: parsing SRP descriptor table failed.
> [  386.703242] ib_srpt 0x5: parsing SRP descriptor table failed.
> [  386.703411] ib_srpt 0x6: parsing SRP descriptor table failed.
>
> Is this expected? I ran this test on a server equipped with two mlx4 HCAs 
> with latest firmware (2.36.5000). I installed git commit 
> c4c65482b56a433a82bc5b63db8ba125727e9f80 of the rdma-rw-api merged with 
> v4.5-rc7. Initiator and target drivers were running on the same server and 
> were communicating with each other via loopback. Before I modified 
> rdma_rw_use_mr() the same test passed on the same setup.

I think this might be the case when SRP gets multiple SGL entries.
In this case the number of MRs allocated is limited and srpt should
handle rdma_rw_ctx_init failures due to the lack of MRs.  If you add
the ib_mr_pool_get failure printk back that you asked me to remove
I bet it's going to trigger.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 7/9] IB/core: generic RDMA READ/WRITE API
       [not found]                             ` <20160312061232.GA15236-jcswGhMUV9g@public.gmane.org>
@ 2016-03-14  1:04                               ` Bart Van Assche
       [not found]                                 ` <56E60E0F.7060900-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
  0 siblings, 1 reply; 46+ messages in thread
From: Bart Van Assche @ 2016-03-14  1:04 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: dledford-H+wXaHxf7aLQT0dZR+AlfA, sagig-VPRAkNaXOzVWk0Htik3J/w,
	swise-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	target-devel-u79uwXL29TY76Z2rM5mHXA

On 03/11/16 22:12, Christoph Hellwig wrote:
> On Fri, Mar 11, 2016 at 02:39:16PM -0800, Bart Van Assche wrote:
>> The above is fine with me. But when I ran a test with rdma_rw_use_mr()
>> changed into "return true" the following error messages appeared in the
>> kernel log:
>>
>> [  364.460709] ib_srpt 0x1: parsing SRP descriptor table failed.
>> [  383.604809] ib_srpt 0x0: parsing SRP descriptor table failed.
>> [  383.605627] ib_srpt 0x2: parsing SRP descriptor table failed.
>> [  386.702905] ib_srpt 0x3: parsing SRP descriptor table failed.
>> [  386.703092] ib_srpt 0x4: parsing SRP descriptor table failed.
>> [  386.703242] ib_srpt 0x5: parsing SRP descriptor table failed.
>> [  386.703411] ib_srpt 0x6: parsing SRP descriptor table failed.
>>
>> Is this expected? I ran this test on a server equipped with two mlx4 HCAs
>> with latest firmware (2.36.5000). I installed git commit
>> c4c65482b56a433a82bc5b63db8ba125727e9f80 of the rdma-rw-api merged with
>> v4.5-rc7. Initiator and target drivers were running on the same server and
>> were communicating with each other via loopback. Before I modified
>> rdma_rw_use_mr() the same test passed on the same setup.
> 
> I think this might be the case when SRP gets multiple SGL entries.
> In this case the number of MRs allocated is limited and srpt should
> handle rdma_rw_ctx_init failures due to the lack of MRs.  If you add
> the ib_mr_pool_get failure printk back that you asked me to remove
> I bet it's going to trigger.

Hello Christoph,

After having applied the following patch:

diff --git a/drivers/infiniband/core/rw.c b/drivers/infiniband/core/rw.c
index c6e8483..940dee8 100644
--- a/drivers/infiniband/core/rw.c
+++ b/drivers/infiniband/core/rw.c
@@ -64,6 +64,8 @@ static int rdma_rw_init_mr_wrs(struct rdma_rw_ctx *ctx, struct ib_qp *qp,
 
 		reg->mr = ib_mr_pool_get(qp, &qp->rdma_mrs);
 		if (!reg->mr) {
+			pr_debug("failed to allocate MR %d/%d from pool (in use: %d)\n",
+				 i, ctx->nr_ops, qp->mrs_used);
 			ret = -EAGAIN;
 			goto out_free;
 		}

and after having run:

echo 'module ib_core +pmf' > /sys/kernel/debug/dynamic_debug/control

the following output appeared:

[ 1104.391493] ib_core:rdma_rw_init_mr_wrs: failed to allocate MR 0/1 from pool (in use: 2048)
[ 1104.391621] ib_srpt 0x0: parsing SRP descriptor table failed.
[ 1104.391762] ib_core:rdma_rw_init_mr_wrs: failed to allocate MR 0/1 from pool (in use: 2048)
[ 1104.391864] ib_srpt 0x1: parsing SRP descriptor table failed.
[ 1104.391987] ib_core:rdma_rw_init_mr_wrs: failed to allocate MR 0/1 from pool (in use: 2048)
[ 1104.392085] ib_srpt 0x2: parsing SRP descriptor table failed.
[ 1104.392208] ib_core:rdma_rw_init_mr_wrs: failed to allocate MR 0/1 from pool (in use: 2048)
[ 1104.392306] ib_srpt 0x3: parsing SRP descriptor table failed.
[ 1104.392427] ib_core:rdma_rw_init_mr_wrs: failed to allocate MR 0/1 from pool (in use: 2048)
[ 1104.392525] ib_srpt 0x4: parsing SRP descriptor table failed.
[ 1104.392647] ib_core:rdma_rw_init_mr_wrs: failed to allocate MR 0/1 from pool (in use: 2048)
[ 1104.392745] ib_srpt 0x5: parsing SRP descriptor table failed.
[ 1104.392867] ib_core:rdma_rw_init_mr_wrs: failed to allocate MR 0/1 from pool (in use: 2048)
[ 1104.392965] ib_srpt 0x6: parsing SRP descriptor table failed.
[ 1104.393089] ib_core:rdma_rw_init_mr_wrs: failed to allocate MR 0/1 from pool (in use: 2048)
[ 1104.393189] ib_srpt 0x7: parsing SRP descriptor table failed.

Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 7/9] IB/core: generic RDMA READ/WRITE API
       [not found]                                 ` <56E60E0F.7060900-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
@ 2016-03-15  8:45                                   ` Christoph Hellwig
  0 siblings, 0 replies; 46+ messages in thread
From: Christoph Hellwig @ 2016-03-15  8:45 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Christoph Hellwig, dledford-H+wXaHxf7aLQT0dZR+AlfA,
	sagig-VPRAkNaXOzVWk0Htik3J/w,
	swise-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	target-devel-u79uwXL29TY76Z2rM5mHXA

Yes, that's exactly what I expected. So if rdma_rw_init_mr_wrs
fails with -EAGAIN we'll need to propagate this all the way
to srpt_handle_new_iu and then add the command to the wait list.

I'll prepare a patch for that.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 46+ messages in thread

end of thread, other threads:[~2016-03-15  8:45 UTC | newest]

Thread overview: 46+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-03-08 18:16 generic RDMA READ/WRITE API V4 Christoph Hellwig
2016-03-08 18:16 ` [PATCH v4 1/9] IB/cma: pass the port number to ib_create_qp Christoph Hellwig
2016-03-08 19:25   ` Bart Van Assche
2016-03-08 19:27   ` Leon Romanovsky
     [not found]     ` <20160308192709.GJ13396-2ukJVAZIZ/Y@public.gmane.org>
2016-03-08 19:38       ` Steve Wise
2016-03-08 19:55         ` Leon Romanovsky
2016-03-09 13:58           ` 'Christoph Hellwig'
     [not found]             ` <20160309135804.GA32315-jcswGhMUV9g@public.gmane.org>
2016-03-09 14:43               ` Leon Romanovsky
2016-03-09 15:51               ` Steve Wise
2016-03-09 15:52                 ` 'Christoph Hellwig'
2016-03-08 18:16 ` [PATCH v4 2/9] IB/core: allow passing mapping an offset into the SG in ib_map_mr_sg Christoph Hellwig
2016-03-08 19:27   ` Bart Van Assche
2016-03-08 18:16 ` [PATCH v4 3/9] IB/core: add a helper to check for READ WITH INVALIDATE support Christoph Hellwig
2016-03-08 18:16 ` [PATCH v4 4/9] IB/core: refactor ib_create_qp Christoph Hellwig
2016-03-08 19:33   ` Bart Van Assche
     [not found]     ` <56DF2917.309-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
2016-03-09 12:55       ` Christoph Hellwig
2016-03-08 18:16 ` [PATCH v4 5/9] IB/core: add a simple MR pool Christoph Hellwig
     [not found]   ` <1457461000-24088-6-git-send-email-hch-jcswGhMUV9g@public.gmane.org>
2016-03-08 19:37     ` Bart Van Assche
2016-03-09 12:56       ` Christoph Hellwig
     [not found] ` <1457461000-24088-1-git-send-email-hch-jcswGhMUV9g@public.gmane.org>
2016-03-08 18:16   ` [PATCH v4 6/9] IB/core: add a need_inval flag to struct ib_mr Christoph Hellwig
2016-03-08 19:41     ` Bart Van Assche
2016-03-08 18:16   ` [PATCH v4 7/9] IB/core: generic RDMA READ/WRITE API Christoph Hellwig
2016-03-08 21:32     ` Bart Van Assche
2016-03-09 13:07       ` Christoph Hellwig
2016-03-09 15:54         ` Christoph Hellwig
2016-03-09 18:55           ` Bart Van Assche
2016-03-10  7:57             ` Christoph Hellwig
2016-03-10 15:30               ` Bart Van Assche
     [not found]                 ` <56E1931C.1040206-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
2016-03-10 16:21                   ` Christoph Hellwig
     [not found]                     ` <20160310162150.GA9544-jcswGhMUV9g@public.gmane.org>
2016-03-10 17:00                       ` Bart Van Assche
     [not found]                         ` <56E1A810.1080001-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
2016-03-10 17:01                           ` Christoph Hellwig
2016-03-11 22:39                       ` Bart Van Assche
     [not found]                         ` <56E34914.2020108-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
2016-03-12  6:12                           ` Christoph Hellwig
     [not found]                             ` <20160312061232.GA15236-jcswGhMUV9g@public.gmane.org>
2016-03-14  1:04                               ` Bart Van Assche
     [not found]                                 ` <56E60E0F.7060900-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
2016-03-15  8:45                                   ` Christoph Hellwig
     [not found]           ` <20160309155457.GA1898-jcswGhMUV9g@public.gmane.org>
2016-03-11  1:26             ` Bart Van Assche
2016-03-11  9:22               ` Christoph Hellwig
2016-03-08 18:16   ` [PATCH v4 8/9] target: enhance and export target_alloc_sgl/target_free_sgl Christoph Hellwig
2016-03-08 21:37     ` Bart Van Assche
2016-03-08 18:16 ` [PATCH v4 9/9] IB/srpt: convert to the generic RDMA READ/WRITE API Christoph Hellwig
2016-03-08 20:13   ` Leon Romanovsky
2016-03-09 12:59     ` Christoph Hellwig
2016-03-10 22:08       ` Bart Van Assche
2016-03-11  9:09         ` Christoph Hellwig
     [not found]   ` <1457461000-24088-10-git-send-email-hch-jcswGhMUV9g@public.gmane.org>
2016-03-10 22:40     ` Bart Van Assche
     [not found]       ` <56E1F7C8.9000101-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
2016-03-11  9:14         ` Christoph Hellwig

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.