All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/14] RDMA: Improve use of umem in DMA drivers
@ 2020-09-02  0:43 Jason Gunthorpe
  2020-09-02  0:43 ` [PATCH 01/14] RDMA/umem: Fix ib_umem_find_best_pgsz() for mappings that cross a page boundary Jason Gunthorpe
                   ` (14 more replies)
  0 siblings, 15 replies; 42+ messages in thread
From: Jason Gunthorpe @ 2020-09-02  0:43 UTC (permalink / raw)
  To: Adit Ranadive, Ariel Elior, Potnuri Bharat Teja,
	Dennis Dalessandro, Devesh Sharma, Doug Ledford, Faisal Latif,
	Gal Pressman, Wei Hu(Xavier),
	Leon Romanovsky, linux-rdma, Weihang Li, Miguel Ojeda,
	Mike Marciniszyn, Michal Kalderon, Naresh Kumar PBS, Lijun Ou,
	VMware PV-Drivers, Selvin Xavier, Shiraz Saleem, Yossi Leybovich,
	Somnath Kotur, Sriharsha Basavapatna, Zhu Yanjun, Yishai Hadas

Most RDMA drivers rely on a linear table of DMA addresses organized in
some device specific page size.

For a while now the core code has had the rdma_for_each_block() SG
iterator to help break a umem into DMA blocks for use in the device lists.

Improve on this by adding rdma_umem_for_each_dma_block(),
ib_umem_dma_offset() and ib_umem_num_dma_blocks().

Replace open codings, or calls to fixed PAGE_SIZE APIs, in most of the
drivers with one of the above APIs.

Get rid of the really weird and duplicative ib_umem_page_count().

Fix two problems with ib_umem_find_best_pgsz().

At this point many of the driver have a clear path to call
ib_umem_find_best_pgsz() and replace hardcoded PAGE_SIZE or PAGE_SHIFT
values when constructing their DMA lists.

This is the first series in an effort to modernize the umem usage in all
the DMA drivers.

Jason Gunthorpe (14):
  RDMA/umem: Fix ib_umem_find_best_pgsz() for mappings that cross a page
    boundary
  RDMA/umem: Prevent small pages from being returned by
    ib_umem_find_best_pgsz()
  RDMA/umem: Use simpler logic for ib_umem_find_best_pgsz()
  RDMA/umem: Add rdma_umem_for_each_dma_block()
  RDMA/umem: Replace for_each_sg_dma_page with
    rdma_umem_for_each_dma_block
  RDMA/umem: Split ib_umem_num_pages() into ib_umem_num_dma_blocks()
  RDMA/qedr: Use rdma_umem_for_each_dma_block() instead of open-coding
  RDMA/qedr: Use ib_umem_num_dma_blocks() instead of
    ib_umem_page_count()
  RDMA/bnxt: Do not use ib_umem_page_count() or ib_umem_num_pages()
  RDMA/hns: Use ib_umem_num_dma_blocks() instead of opencoding
  RDMA/ocrdma: Use ib_umem_num_dma_blocks() instead of
    ib_umem_page_count()
  RDMA/pvrdma: Use ib_umem_num_dma_blocks() instead of
    ib_umem_page_count()
  RDMA/mlx5: Use ib_umem_num_dma_blocks()
  RDMA/umem: Rename ib_umem_offset() to ib_umem_dma_offset()

 .clang-format                                 |  1 +
 drivers/infiniband/core/umem.c                | 41 ++++++-----
 drivers/infiniband/hw/bnxt_re/ib_verbs.c      | 72 +++++++------------
 drivers/infiniband/hw/cxgb4/mem.c             |  8 +--
 drivers/infiniband/hw/efa/efa_verbs.c         |  3 +-
 drivers/infiniband/hw/hns/hns_roce_alloc.c    |  3 +-
 drivers/infiniband/hw/hns/hns_roce_mr.c       | 49 +++++--------
 drivers/infiniband/hw/i40iw/i40iw_verbs.c     |  3 +-
 drivers/infiniband/hw/mlx4/cq.c               |  8 +--
 drivers/infiniband/hw/mlx4/mlx4_ib.h          |  3 +-
 drivers/infiniband/hw/mlx4/mr.c               | 14 ++--
 drivers/infiniband/hw/mlx4/qp.c               | 17 ++---
 drivers/infiniband/hw/mlx4/srq.c              |  5 +-
 drivers/infiniband/hw/mlx5/mem.c              |  5 +-
 drivers/infiniband/hw/mthca/mthca_provider.c  |  8 +--
 drivers/infiniband/hw/ocrdma/ocrdma_verbs.c   | 26 +++----
 drivers/infiniband/hw/qedr/verbs.c            | 50 ++++++-------
 drivers/infiniband/hw/vmw_pvrdma/pvrdma_cq.c  |  2 +-
 .../infiniband/hw/vmw_pvrdma/pvrdma_misc.c    |  9 ++-
 drivers/infiniband/hw/vmw_pvrdma/pvrdma_mr.c  |  2 +-
 drivers/infiniband/hw/vmw_pvrdma/pvrdma_qp.c  |  6 +-
 drivers/infiniband/hw/vmw_pvrdma/pvrdma_srq.c |  2 +-
 drivers/infiniband/sw/rdmavt/mr.c             |  2 +-
 drivers/infiniband/sw/rxe/rxe_mr.c            |  2 +-
 include/rdma/ib_umem.h                        | 45 +++++++++---
 include/rdma/ib_verbs.h                       | 24 -------
 26 files changed, 184 insertions(+), 226 deletions(-)

-- 
2.28.0


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 01/14] RDMA/umem: Fix ib_umem_find_best_pgsz() for mappings that cross a page boundary
  2020-09-02  0:43 [PATCH 00/14] RDMA: Improve use of umem in DMA drivers Jason Gunthorpe
@ 2020-09-02  0:43 ` Jason Gunthorpe
  2020-09-02  9:24   ` Leon Romanovsky
                     ` (2 more replies)
  2020-09-02  0:43 ` [PATCH 02/14] RDMA/umem: Prevent small pages from being returned by ib_umem_find_best_pgsz() Jason Gunthorpe
                   ` (13 subsequent siblings)
  14 siblings, 3 replies; 42+ messages in thread
From: Jason Gunthorpe @ 2020-09-02  0:43 UTC (permalink / raw)
  To: Doug Ledford, linux-rdma; +Cc: Shiraz Saleem

It is possible for a single SGL to span an aligned boundary, eg if the SGL
is

  61440 -> 90112

Then the length is 28672, which currently limits the block size to
32k. With a 32k page size the two covering blocks will be:

  32768->65536 and 65536->98304

However, the correct answer is a 128K block size which will span the whole
28672 bytes in a single block.

Instead of limiting based on length figure out which high IOVA bits don't
change between the start and end addresses. That is the highest useful
page size.

Fixes: 4a35339958f1 ("RDMA/umem: Add API to find best driver supported page size in an MR")
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/infiniband/core/umem.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index 831bff8d52e547..120e98403c345d 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -156,8 +156,14 @@ unsigned long ib_umem_find_best_pgsz(struct ib_umem *umem,
 		return 0;
 
 	va = virt;
-	/* max page size not to exceed MR length */
-	mask = roundup_pow_of_two(umem->length);
+	/* The best result is the smallest page size that results in the minimum
+	 * number of required pages. Compute the largest page size that could
+	 * work based on VA address bits that don't change.
+	 */
+	mask = pgsz_bitmap &
+	       GENMASK(BITS_PER_LONG - 1,
+		       bits_per((umem->length - 1 + umem->address) ^
+				umem->address));
 	/* offset into first SGL */
 	pgoff = umem->address & ~PAGE_MASK;
 
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 02/14] RDMA/umem: Prevent small pages from being returned by ib_umem_find_best_pgsz()
  2020-09-02  0:43 [PATCH 00/14] RDMA: Improve use of umem in DMA drivers Jason Gunthorpe
  2020-09-02  0:43 ` [PATCH 01/14] RDMA/umem: Fix ib_umem_find_best_pgsz() for mappings that cross a page boundary Jason Gunthorpe
@ 2020-09-02  0:43 ` Jason Gunthorpe
  2020-09-02 11:51   ` Leon Romanovsky
  2020-09-03 14:11   ` Saleem, Shiraz
  2020-09-02  0:43 ` [PATCH 03/14] RDMA/umem: Use simpler logic for ib_umem_find_best_pgsz() Jason Gunthorpe
                   ` (12 subsequent siblings)
  14 siblings, 2 replies; 42+ messages in thread
From: Jason Gunthorpe @ 2020-09-02  0:43 UTC (permalink / raw)
  To: Doug Ledford, linux-rdma; +Cc: Shiraz Saleem

rdma_for_each_block() makes assumptions about how the SGL is constructed
that don't work if the block size is below the page size used to to build
the SGL.

The rules for umem SGL construction require that the SG's all be PAGE_SIZE
aligned and we don't encode the actual byte offset of the VA range inside
the SGL using offset and length. So rdma_for_each_block() has no idea
where the actual starting/ending point is to compute the first/last block
boundary if the starting address should be within a SGL.

Fixing the SGL construction turns out to be really hard, and will be the
subject of other patches. For now block smaller pages.

Fixes: 4a35339958f1 ("RDMA/umem: Add API to find best driver supported page size in an MR")
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/infiniband/core/umem.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index 120e98403c345d..7b5bc969e55630 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -151,6 +151,12 @@ unsigned long ib_umem_find_best_pgsz(struct ib_umem *umem,
 	dma_addr_t mask;
 	int i;
 
+	/* rdma_for_each_block() has a bug if the page size is smaller than the
+	 * page size used to build the umem. For now prevent smaller page sizes
+	 * from being returned.
+	 */
+	pgsz_bitmap &= GENMASK(BITS_PER_LONG - 1, PAGE_SHIFT);
+
 	/* At minimum, drivers must support PAGE_SIZE or smaller */
 	if (WARN_ON(!(pgsz_bitmap & GENMASK(PAGE_SHIFT, 0))))
 		return 0;
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 03/14] RDMA/umem: Use simpler logic for ib_umem_find_best_pgsz()
  2020-09-02  0:43 [PATCH 00/14] RDMA: Improve use of umem in DMA drivers Jason Gunthorpe
  2020-09-02  0:43 ` [PATCH 01/14] RDMA/umem: Fix ib_umem_find_best_pgsz() for mappings that cross a page boundary Jason Gunthorpe
  2020-09-02  0:43 ` [PATCH 02/14] RDMA/umem: Prevent small pages from being returned by ib_umem_find_best_pgsz() Jason Gunthorpe
@ 2020-09-02  0:43 ` Jason Gunthorpe
  2020-09-02  0:43 ` [PATCH 04/14] RDMA/umem: Add rdma_umem_for_each_dma_block() Jason Gunthorpe
                   ` (11 subsequent siblings)
  14 siblings, 0 replies; 42+ messages in thread
From: Jason Gunthorpe @ 2020-09-02  0:43 UTC (permalink / raw)
  To: Doug Ledford, linux-rdma

The calculation in rdma_find_pg_bit() is fairly complicated, and the
function is never called anywhere else. Inline a simpler version into
ib_umem_find_best_pgsz()

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/infiniband/core/umem.c | 11 ++++++++---
 include/rdma/ib_verbs.h        | 24 ------------------------
 2 files changed, 8 insertions(+), 27 deletions(-)

diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index 7b5bc969e55630..f02e34cac59581 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -39,6 +39,7 @@
 #include <linux/export.h>
 #include <linux/slab.h>
 #include <linux/pagemap.h>
+#include <linux/count_zeros.h>
 #include <rdma/ib_umem_odp.h>
 
 #include "uverbs.h"
@@ -146,7 +147,6 @@ unsigned long ib_umem_find_best_pgsz(struct ib_umem *umem,
 				     unsigned long virt)
 {
 	struct scatterlist *sg;
-	unsigned int best_pg_bit;
 	unsigned long va, pgoff;
 	dma_addr_t mask;
 	int i;
@@ -187,9 +187,14 @@ unsigned long ib_umem_find_best_pgsz(struct ib_umem *umem,
 			mask |= va;
 		pgoff = 0;
 	}
-	best_pg_bit = rdma_find_pg_bit(mask, pgsz_bitmap);
 
-	return BIT_ULL(best_pg_bit);
+	/* The mask accumulates 1's in each position where the VA and physical
+	 * address differ, thus the length of trailing 0 is the largest page
+	 * size that can pass the VA through to the physical.
+	 */
+	if (mask)
+		pgsz_bitmap &= GENMASK(count_trailing_zeros(mask), 0);
+	return rounddown_pow_of_two(pgsz_bitmap);
 }
 EXPORT_SYMBOL(ib_umem_find_best_pgsz);
 
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index c868609a4ffaed..5dcbbb77cadb4f 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -3340,30 +3340,6 @@ static inline bool rdma_cap_read_inv(struct ib_device *dev, u32 port_num)
 	return rdma_protocol_iwarp(dev, port_num);
 }
 
-/**
- * rdma_find_pg_bit - Find page bit given address and HW supported page sizes
- *
- * @addr: address
- * @pgsz_bitmap: bitmap of HW supported page sizes
- */
-static inline unsigned int rdma_find_pg_bit(unsigned long addr,
-					    unsigned long pgsz_bitmap)
-{
-	unsigned long align;
-	unsigned long pgsz;
-
-	align = addr & -addr;
-
-	/* Find page bit such that addr is aligned to the highest supported
-	 * HW page size
-	 */
-	pgsz = pgsz_bitmap & ~(-align << 1);
-	if (!pgsz)
-		return __ffs(pgsz_bitmap);
-
-	return __fls(pgsz);
-}
-
 /**
  * rdma_core_cap_opa_port - Return whether the RDMA Port is OPA or not.
  * @device: Device
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 04/14] RDMA/umem: Add rdma_umem_for_each_dma_block()
  2020-09-02  0:43 [PATCH 00/14] RDMA: Improve use of umem in DMA drivers Jason Gunthorpe
                   ` (2 preceding siblings ...)
  2020-09-02  0:43 ` [PATCH 03/14] RDMA/umem: Use simpler logic for ib_umem_find_best_pgsz() Jason Gunthorpe
@ 2020-09-02  0:43 ` Jason Gunthorpe
  2020-09-02  3:10   ` Miguel Ojeda
  2020-09-03 14:12   ` Saleem, Shiraz
  2020-09-02  0:43 ` [PATCH 05/14] RDMA/umem: Replace for_each_sg_dma_page with rdma_umem_for_each_dma_block Jason Gunthorpe
                   ` (10 subsequent siblings)
  14 siblings, 2 replies; 42+ messages in thread
From: Jason Gunthorpe @ 2020-09-02  0:43 UTC (permalink / raw)
  To: Devesh Sharma, Doug Ledford, Faisal Latif, Gal Pressman,
	Wei Hu(Xavier),
	linux-rdma, Weihang Li, Miguel Ojeda, Naresh Kumar PBS, Lijun Ou,
	Selvin Xavier, Shiraz Saleem, Yossi Leybovich, Somnath Kotur,
	Sriharsha Basavapatna

This helper does the same as rdma_for_each_block(), except it works on a
umem. This simplifies most of the call sites.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 .clang-format                              |  1 +
 drivers/infiniband/hw/bnxt_re/ib_verbs.c   |  2 +-
 drivers/infiniband/hw/efa/efa_verbs.c      |  3 +--
 drivers/infiniband/hw/hns/hns_roce_alloc.c |  3 +--
 drivers/infiniband/hw/i40iw/i40iw_verbs.c  |  3 +--
 include/rdma/ib_umem.h                     | 20 ++++++++++++++++++++
 6 files changed, 25 insertions(+), 7 deletions(-)

diff --git a/.clang-format b/.clang-format
index a0a96088c74f49..311ef2c61a1bdf 100644
--- a/.clang-format
+++ b/.clang-format
@@ -415,6 +415,7 @@ ForEachMacros:
   - 'rbtree_postorder_for_each_entry_safe'
   - 'rdma_for_each_block'
   - 'rdma_for_each_port'
+  - 'rdma_umem_for_each_dma_block'
   - 'resource_list_for_each_entry'
   - 'resource_list_for_each_entry_safe'
   - 'rhl_for_each_entry_rcu'
diff --git a/drivers/infiniband/hw/bnxt_re/ib_verbs.c b/drivers/infiniband/hw/bnxt_re/ib_verbs.c
index 5ee272d27aaade..9e26e651730cb3 100644
--- a/drivers/infiniband/hw/bnxt_re/ib_verbs.c
+++ b/drivers/infiniband/hw/bnxt_re/ib_verbs.c
@@ -3783,7 +3783,7 @@ static int fill_umem_pbl_tbl(struct ib_umem *umem, u64 *pbl_tbl_orig,
 	u64 page_size =  BIT_ULL(page_shift);
 	struct ib_block_iter biter;
 
-	rdma_for_each_block(umem->sg_head.sgl, &biter, umem->nmap, page_size)
+	rdma_umem_for_each_dma_block(umem, &biter, page_size)
 		*pbl_tbl++ = rdma_block_iter_dma_address(&biter);
 
 	return pbl_tbl - pbl_tbl_orig;
diff --git a/drivers/infiniband/hw/efa/efa_verbs.c b/drivers/infiniband/hw/efa/efa_verbs.c
index de9a22f0fcc218..d85c63a5021a70 100644
--- a/drivers/infiniband/hw/efa/efa_verbs.c
+++ b/drivers/infiniband/hw/efa/efa_verbs.c
@@ -1142,8 +1142,7 @@ static int umem_to_page_list(struct efa_dev *dev,
 	ibdev_dbg(&dev->ibdev, "hp_cnt[%u], pages_in_hp[%u]\n",
 		  hp_cnt, pages_in_hp);
 
-	rdma_for_each_block(umem->sg_head.sgl, &biter, umem->nmap,
-			    BIT(hp_shift))
+	rdma_umem_for_each_dma_block(umem, &biter, BIT(hp_shift))
 		page_list[hp_idx++] = rdma_block_iter_dma_address(&biter);
 
 	return 0;
diff --git a/drivers/infiniband/hw/hns/hns_roce_alloc.c b/drivers/infiniband/hw/hns/hns_roce_alloc.c
index a522cb2d29eabc..a6b23dec1adcf6 100644
--- a/drivers/infiniband/hw/hns/hns_roce_alloc.c
+++ b/drivers/infiniband/hw/hns/hns_roce_alloc.c
@@ -268,8 +268,7 @@ int hns_roce_get_umem_bufs(struct hns_roce_dev *hr_dev, dma_addr_t *bufs,
 	}
 
 	/* convert system page cnt to hw page cnt */
-	rdma_for_each_block(umem->sg_head.sgl, &biter, umem->nmap,
-			    1 << page_shift) {
+	rdma_umem_for_each_dma_block(umem, &biter, 1 << page_shift) {
 		addr = rdma_block_iter_dma_address(&biter);
 		if (idx >= start) {
 			bufs[total++] = addr;
diff --git a/drivers/infiniband/hw/i40iw/i40iw_verbs.c b/drivers/infiniband/hw/i40iw/i40iw_verbs.c
index b51339328a51ef..beb611b157bc8d 100644
--- a/drivers/infiniband/hw/i40iw/i40iw_verbs.c
+++ b/drivers/infiniband/hw/i40iw/i40iw_verbs.c
@@ -1320,8 +1320,7 @@ static void i40iw_copy_user_pgaddrs(struct i40iw_mr *iwmr,
 	if (iwmr->type == IW_MEMREG_TYPE_QP)
 		iwpbl->qp_mr.sq_page = sg_page(region->sg_head.sgl);
 
-	rdma_for_each_block(region->sg_head.sgl, &biter, region->nmap,
-			    iwmr->page_size) {
+	rdma_umem_for_each_dma_block(region, &biter, iwmr->page_size) {
 		*pbl = rdma_block_iter_dma_address(&biter);
 		pbl = i40iw_next_pbl_addr(pbl, &pinfo, &idx);
 	}
diff --git a/include/rdma/ib_umem.h b/include/rdma/ib_umem.h
index 07a764eb692eed..b880512ba95f16 100644
--- a/include/rdma/ib_umem.h
+++ b/include/rdma/ib_umem.h
@@ -40,6 +40,26 @@ static inline size_t ib_umem_num_pages(struct ib_umem *umem)
 	       PAGE_SHIFT;
 }
 
+static inline void __rdma_umem_block_iter_start(struct ib_block_iter *biter,
+						struct ib_umem *umem,
+						unsigned long pgsz)
+{
+	__rdma_block_iter_start(biter, umem->sg_head.sgl, umem->nmap, pgsz);
+}
+
+/**
+ * rdma_umem_for_each_dma_block - iterate over contiguous DMA blocks of the umem
+ * @umem: umem to iterate over
+ * @pgsz: Page size to split the list into
+ *
+ * pgsz must be <= PAGE_SIZE or computed by ib_umem_find_best_pgsz(). The
+ * returned DMA blocks will be aligned to pgsz and span the range:
+ * ALIGN_DOWN(umem->address, pgsz) to ALIGN(umem->address + umem->length, pgsz)
+ */
+#define rdma_umem_for_each_dma_block(umem, biter, pgsz)                        \
+	for (__rdma_umem_block_iter_start(biter, umem, pgsz);                  \
+	     __rdma_block_iter_next(biter);)
+
 #ifdef CONFIG_INFINIBAND_USER_MEM
 
 struct ib_umem *ib_umem_get(struct ib_device *device, unsigned long addr,
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 05/14] RDMA/umem: Replace for_each_sg_dma_page with rdma_umem_for_each_dma_block
  2020-09-02  0:43 [PATCH 00/14] RDMA: Improve use of umem in DMA drivers Jason Gunthorpe
                   ` (3 preceding siblings ...)
  2020-09-02  0:43 ` [PATCH 04/14] RDMA/umem: Add rdma_umem_for_each_dma_block() Jason Gunthorpe
@ 2020-09-02  0:43 ` Jason Gunthorpe
  2020-09-02  0:43 ` [PATCH 06/14] RDMA/umem: Split ib_umem_num_pages() into ib_umem_num_dma_blocks() Jason Gunthorpe
                   ` (9 subsequent siblings)
  14 siblings, 0 replies; 42+ messages in thread
From: Jason Gunthorpe @ 2020-09-02  0:43 UTC (permalink / raw)
  To: Adit Ranadive, Potnuri Bharat Teja, Devesh Sharma, Doug Ledford,
	linux-rdma, VMware PV-Drivers, Selvin Xavier

Generally drivers should be using this core helper to split up the umem
into DMA pages.

These drivers are all probably wrong in some way to pass PAGE_SIZE in as
the HW page size. Either the driver doesn't support other page sizes and
it should use 4096, or the driver does support other page sizes and should
use ib_umem_find_best_pgsz() to select the best HW pages size of the HW
supported set.

The only case it could be correct is if the HW has a global setting for
PAGE_SIZE set at driver initialization time.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/infiniband/hw/cxgb4/mem.c              | 6 +++---
 drivers/infiniband/hw/mthca/mthca_provider.c   | 6 +++---
 drivers/infiniband/hw/ocrdma/ocrdma_verbs.c    | 7 +++----
 drivers/infiniband/hw/vmw_pvrdma/pvrdma_misc.c | 9 ++++-----
 4 files changed, 13 insertions(+), 15 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb4/mem.c b/drivers/infiniband/hw/cxgb4/mem.c
index 73936c3341b77c..82afdb1987eff6 100644
--- a/drivers/infiniband/hw/cxgb4/mem.c
+++ b/drivers/infiniband/hw/cxgb4/mem.c
@@ -510,7 +510,7 @@ struct ib_mr *c4iw_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
 	__be64 *pages;
 	int shift, n, i;
 	int err = -ENOMEM;
-	struct sg_dma_page_iter sg_iter;
+	struct ib_block_iter biter;
 	struct c4iw_dev *rhp;
 	struct c4iw_pd *php;
 	struct c4iw_mr *mhp;
@@ -561,8 +561,8 @@ struct ib_mr *c4iw_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
 
 	i = n = 0;
 
-	for_each_sg_dma_page(mhp->umem->sg_head.sgl, &sg_iter, mhp->umem->nmap, 0) {
-		pages[i++] = cpu_to_be64(sg_page_iter_dma_address(&sg_iter));
+	rdma_umem_for_each_dma_block(mhp->umem, &biter, 1 << shift) {
+		pages[i++] = cpu_to_be64(rdma_block_iter_dma_address(&biter));
 		if (i == PAGE_SIZE / sizeof(*pages)) {
 			err = write_pbl(&mhp->rhp->rdev, pages,
 					mhp->attr.pbl_addr + (n << 3), i,
diff --git a/drivers/infiniband/hw/mthca/mthca_provider.c b/drivers/infiniband/hw/mthca/mthca_provider.c
index 9fa2f9164a47b6..317e67ad915fe8 100644
--- a/drivers/infiniband/hw/mthca/mthca_provider.c
+++ b/drivers/infiniband/hw/mthca/mthca_provider.c
@@ -846,7 +846,7 @@ static struct ib_mr *mthca_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
 				       u64 virt, int acc, struct ib_udata *udata)
 {
 	struct mthca_dev *dev = to_mdev(pd->device);
-	struct sg_dma_page_iter sg_iter;
+	struct ib_block_iter biter;
 	struct mthca_ucontext *context = rdma_udata_to_drv_context(
 		udata, struct mthca_ucontext, ibucontext);
 	struct mthca_mr *mr;
@@ -895,8 +895,8 @@ static struct ib_mr *mthca_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
 
 	write_mtt_size = min(mthca_write_mtt_size(dev), (int) (PAGE_SIZE / sizeof *pages));
 
-	for_each_sg_dma_page(mr->umem->sg_head.sgl, &sg_iter, mr->umem->nmap, 0) {
-		pages[i++] = sg_page_iter_dma_address(&sg_iter);
+	rdma_umem_for_each_dma_block(mr->umem, &biter, PAGE_SIZE) {
+		pages[i++] = rdma_block_iter_dma_address(&biter);
 
 		/*
 		 * Be friendly to write_mtt and pass it chunks
diff --git a/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c b/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c
index c1751c9a0f625c..933b297de2ba86 100644
--- a/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c
+++ b/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c
@@ -814,9 +814,8 @@ static void build_user_pbes(struct ocrdma_dev *dev, struct ocrdma_mr *mr,
 			    u32 num_pbes)
 {
 	struct ocrdma_pbe *pbe;
-	struct sg_dma_page_iter sg_iter;
+	struct ib_block_iter biter;
 	struct ocrdma_pbl *pbl_tbl = mr->hwmr.pbl_table;
-	struct ib_umem *umem = mr->umem;
 	int pbe_cnt, total_num_pbes = 0;
 	u64 pg_addr;
 
@@ -826,9 +825,9 @@ static void build_user_pbes(struct ocrdma_dev *dev, struct ocrdma_mr *mr,
 	pbe = (struct ocrdma_pbe *)pbl_tbl->va;
 	pbe_cnt = 0;
 
-	for_each_sg_dma_page (umem->sg_head.sgl, &sg_iter, umem->nmap, 0) {
+	rdma_umem_for_each_dma_block (mr->umem, &biter, PAGE_SIZE) {
 		/* store the page address in pbe */
-		pg_addr = sg_page_iter_dma_address(&sg_iter);
+		pg_addr = rdma_block_iter_dma_address(&biter);
 		pbe->pa_lo = cpu_to_le32(pg_addr);
 		pbe->pa_hi = cpu_to_le32(upper_32_bits(pg_addr));
 		pbe_cnt += 1;
diff --git a/drivers/infiniband/hw/vmw_pvrdma/pvrdma_misc.c b/drivers/infiniband/hw/vmw_pvrdma/pvrdma_misc.c
index 7944c58ded0e59..ba43ad07898c2b 100644
--- a/drivers/infiniband/hw/vmw_pvrdma/pvrdma_misc.c
+++ b/drivers/infiniband/hw/vmw_pvrdma/pvrdma_misc.c
@@ -182,17 +182,16 @@ int pvrdma_page_dir_insert_dma(struct pvrdma_page_dir *pdir, u64 idx,
 int pvrdma_page_dir_insert_umem(struct pvrdma_page_dir *pdir,
 				struct ib_umem *umem, u64 offset)
 {
+	struct ib_block_iter biter;
 	u64 i = offset;
 	int ret = 0;
-	struct sg_dma_page_iter sg_iter;
 
 	if (offset >= pdir->npages)
 		return -EINVAL;
 
-	for_each_sg_dma_page(umem->sg_head.sgl, &sg_iter, umem->nmap, 0) {
-		dma_addr_t addr = sg_page_iter_dma_address(&sg_iter);
-
-		ret = pvrdma_page_dir_insert_dma(pdir, i, addr);
+	rdma_umem_for_each_dma_block (umem, &biter, PAGE_SIZE) {
+		ret = pvrdma_page_dir_insert_dma(
+			pdir, i, rdma_block_iter_dma_address(&biter));
 		if (ret)
 			goto exit;
 
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 06/14] RDMA/umem: Split ib_umem_num_pages() into ib_umem_num_dma_blocks()
  2020-09-02  0:43 [PATCH 00/14] RDMA: Improve use of umem in DMA drivers Jason Gunthorpe
                   ` (4 preceding siblings ...)
  2020-09-02  0:43 ` [PATCH 05/14] RDMA/umem: Replace for_each_sg_dma_page with rdma_umem_for_each_dma_block Jason Gunthorpe
@ 2020-09-02  0:43 ` Jason Gunthorpe
  2020-09-03 14:12   ` Saleem, Shiraz
  2020-09-04 22:32   ` Jason Gunthorpe
  2020-09-02  0:43 ` [PATCH 07/14] RDMA/qedr: Use rdma_umem_for_each_dma_block() instead of open-coding Jason Gunthorpe
                   ` (8 subsequent siblings)
  14 siblings, 2 replies; 42+ messages in thread
From: Jason Gunthorpe @ 2020-09-02  0:43 UTC (permalink / raw)
  To: Adit Ranadive, Potnuri Bharat Teja, Doug Ledford,
	Leon Romanovsky, linux-rdma, VMware PV-Drivers

ib_num_pages() should only be used by things working with the SGL in CPU
pages directly.

Drivers building DMA lists should use the new ib_num_dma_blocks() which
returns the number of blocks rdma_umem_for_each_block() will return.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/infiniband/hw/cxgb4/mem.c            |  2 +-
 drivers/infiniband/hw/mlx5/mem.c             |  5 +++--
 drivers/infiniband/hw/mthca/mthca_provider.c |  2 +-
 drivers/infiniband/hw/vmw_pvrdma/pvrdma_mr.c |  2 +-
 include/rdma/ib_umem.h                       | 14 +++++++++++---
 5 files changed, 17 insertions(+), 8 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb4/mem.c b/drivers/infiniband/hw/cxgb4/mem.c
index 82afdb1987eff6..22c8f5745047db 100644
--- a/drivers/infiniband/hw/cxgb4/mem.c
+++ b/drivers/infiniband/hw/cxgb4/mem.c
@@ -548,7 +548,7 @@ struct ib_mr *c4iw_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
 
 	shift = PAGE_SHIFT;
 
-	n = ib_umem_num_pages(mhp->umem);
+	n = ib_umem_num_dma_blocks(mhp->umem, 1 << shift);
 	err = alloc_pbl(mhp, n);
 	if (err)
 		goto err_umem_release;
diff --git a/drivers/infiniband/hw/mlx5/mem.c b/drivers/infiniband/hw/mlx5/mem.c
index c19ec9fd8a63c3..5641dbda72ff66 100644
--- a/drivers/infiniband/hw/mlx5/mem.c
+++ b/drivers/infiniband/hw/mlx5/mem.c
@@ -169,8 +169,9 @@ void mlx5_ib_populate_pas(struct mlx5_ib_dev *dev, struct ib_umem *umem,
 			  int page_shift, __be64 *pas, int access_flags)
 {
 	return __mlx5_ib_populate_pas(dev, umem, page_shift, 0,
-				      ib_umem_num_pages(umem), pas,
-				      access_flags);
+				      ib_umem_num_dma_blocks(umem,
+							     1 << page_shift),
+				      pas, access_flags);
 }
 int mlx5_ib_get_buf_offset(u64 addr, int page_shift, u32 *offset)
 {
diff --git a/drivers/infiniband/hw/mthca/mthca_provider.c b/drivers/infiniband/hw/mthca/mthca_provider.c
index 317e67ad915fe8..b785fb9a2634ff 100644
--- a/drivers/infiniband/hw/mthca/mthca_provider.c
+++ b/drivers/infiniband/hw/mthca/mthca_provider.c
@@ -877,7 +877,7 @@ static struct ib_mr *mthca_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
 		goto err;
 	}
 
-	n = ib_umem_num_pages(mr->umem);
+	n = ib_umem_num_dma_blocks(mr->umem, PAGE_SIZE);
 
 	mr->mtt = mthca_alloc_mtt(dev, n);
 	if (IS_ERR(mr->mtt)) {
diff --git a/drivers/infiniband/hw/vmw_pvrdma/pvrdma_mr.c b/drivers/infiniband/hw/vmw_pvrdma/pvrdma_mr.c
index 91f0957e611543..e80848bfb3bdbf 100644
--- a/drivers/infiniband/hw/vmw_pvrdma/pvrdma_mr.c
+++ b/drivers/infiniband/hw/vmw_pvrdma/pvrdma_mr.c
@@ -133,7 +133,7 @@ struct ib_mr *pvrdma_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
 		return ERR_CAST(umem);
 	}
 
-	npages = ib_umem_num_pages(umem);
+	npages = ib_umem_num_dma_blocks(umem, PAGE_SIZE);
 	if (npages < 0 || npages > PVRDMA_PAGE_DIR_MAX_PAGES) {
 		dev_warn(&dev->pdev->dev, "overflow %d pages in mem region\n",
 			 npages);
diff --git a/include/rdma/ib_umem.h b/include/rdma/ib_umem.h
index b880512ba95f16..ba3b9be0d8c56a 100644
--- a/include/rdma/ib_umem.h
+++ b/include/rdma/ib_umem.h
@@ -33,11 +33,17 @@ static inline int ib_umem_offset(struct ib_umem *umem)
 	return umem->address & ~PAGE_MASK;
 }
 
+static inline size_t ib_umem_num_dma_blocks(struct ib_umem *umem,
+					    unsigned long pgsz)
+{
+	return (ALIGN(umem->address + umem->length, pgsz) -
+		ALIGN_DOWN(umem->address, pgsz)) /
+	       pgsz;
+}
+
 static inline size_t ib_umem_num_pages(struct ib_umem *umem)
 {
-	return (ALIGN(umem->address + umem->length, PAGE_SIZE) -
-		ALIGN_DOWN(umem->address, PAGE_SIZE)) >>
-	       PAGE_SHIFT;
+	return ib_umem_num_dma_blocks(umem, PAGE_SIZE);
 }
 
 static inline void __rdma_umem_block_iter_start(struct ib_block_iter *biter,
@@ -55,6 +61,8 @@ static inline void __rdma_umem_block_iter_start(struct ib_block_iter *biter,
  * pgsz must be <= PAGE_SIZE or computed by ib_umem_find_best_pgsz(). The
  * returned DMA blocks will be aligned to pgsz and span the range:
  * ALIGN_DOWN(umem->address, pgsz) to ALIGN(umem->address + umem->length, pgsz)
+ *
+ * Performs exactly ib_umem_num_dma_blocks() iterations.
  */
 #define rdma_umem_for_each_dma_block(umem, biter, pgsz)                        \
 	for (__rdma_umem_block_iter_start(biter, umem, pgsz);                  \
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 07/14] RDMA/qedr: Use rdma_umem_for_each_dma_block() instead of open-coding
  2020-09-02  0:43 [PATCH 00/14] RDMA: Improve use of umem in DMA drivers Jason Gunthorpe
                   ` (5 preceding siblings ...)
  2020-09-02  0:43 ` [PATCH 06/14] RDMA/umem: Split ib_umem_num_pages() into ib_umem_num_dma_blocks() Jason Gunthorpe
@ 2020-09-02  0:43 ` Jason Gunthorpe
  2020-09-02 15:36   ` [EXT] " Michal Kalderon
  2020-09-02  0:43 ` [PATCH 08/14] RDMA/qedr: Use ib_umem_num_dma_blocks() instead of ib_umem_page_count() Jason Gunthorpe
                   ` (7 subsequent siblings)
  14 siblings, 1 reply; 42+ messages in thread
From: Jason Gunthorpe @ 2020-09-02  0:43 UTC (permalink / raw)
  To: Ariel Elior, Doug Ledford, linux-rdma, Michal Kalderon

This loop is splitting the DMA SGL into pg_shift sized pages, use the core
code for this directly.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/infiniband/hw/qedr/verbs.c | 41 ++++++++++++------------------
 1 file changed, 16 insertions(+), 25 deletions(-)

diff --git a/drivers/infiniband/hw/qedr/verbs.c b/drivers/infiniband/hw/qedr/verbs.c
index b49bef94637e50..cbb49168d9f7ed 100644
--- a/drivers/infiniband/hw/qedr/verbs.c
+++ b/drivers/infiniband/hw/qedr/verbs.c
@@ -600,11 +600,9 @@ static void qedr_populate_pbls(struct qedr_dev *dev, struct ib_umem *umem,
 			       struct qedr_pbl_info *pbl_info, u32 pg_shift)
 {
 	int pbe_cnt, total_num_pbes = 0;
-	u32 fw_pg_cnt, fw_pg_per_umem_pg;
 	struct qedr_pbl *pbl_tbl;
-	struct sg_dma_page_iter sg_iter;
+	struct ib_block_iter biter;
 	struct regpair *pbe;
-	u64 pg_addr;
 
 	if (!pbl_info->num_pbes)
 		return;
@@ -625,32 +623,25 @@ static void qedr_populate_pbls(struct qedr_dev *dev, struct ib_umem *umem,
 
 	pbe_cnt = 0;
 
-	fw_pg_per_umem_pg = BIT(PAGE_SHIFT - pg_shift);
+	rdma_umem_for_each_dma_block (umem, &biter, BIT(pg_shift)) {
+		u64 pg_addr = rdma_block_iter_dma_address(&biter);
 
-	for_each_sg_dma_page (umem->sg_head.sgl, &sg_iter, umem->nmap, 0) {
-		pg_addr = sg_page_iter_dma_address(&sg_iter);
-		for (fw_pg_cnt = 0; fw_pg_cnt < fw_pg_per_umem_pg;) {
-			pbe->lo = cpu_to_le32(pg_addr);
-			pbe->hi = cpu_to_le32(upper_32_bits(pg_addr));
+		pbe->lo = cpu_to_le32(pg_addr);
+		pbe->hi = cpu_to_le32(upper_32_bits(pg_addr));
 
-			pg_addr += BIT(pg_shift);
-			pbe_cnt++;
-			total_num_pbes++;
-			pbe++;
+		pbe_cnt++;
+		total_num_pbes++;
+		pbe++;
 
-			if (total_num_pbes == pbl_info->num_pbes)
-				return;
+		if (total_num_pbes == pbl_info->num_pbes)
+			return;
 
-			/* If the given pbl is full storing the pbes,
-			 * move to next pbl.
-			 */
-			if (pbe_cnt == (pbl_info->pbl_size / sizeof(u64))) {
-				pbl_tbl++;
-				pbe = (struct regpair *)pbl_tbl->va;
-				pbe_cnt = 0;
-			}
-
-			fw_pg_cnt++;
+		/* If the given pbl is full storing the pbes, move to next pbl.
+		 */
+		if (pbe_cnt == (pbl_info->pbl_size / sizeof(u64))) {
+			pbl_tbl++;
+			pbe = (struct regpair *)pbl_tbl->va;
+			pbe_cnt = 0;
 		}
 	}
 }
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 08/14] RDMA/qedr: Use ib_umem_num_dma_blocks() instead of ib_umem_page_count()
  2020-09-02  0:43 [PATCH 00/14] RDMA: Improve use of umem in DMA drivers Jason Gunthorpe
                   ` (6 preceding siblings ...)
  2020-09-02  0:43 ` [PATCH 07/14] RDMA/qedr: Use rdma_umem_for_each_dma_block() instead of open-coding Jason Gunthorpe
@ 2020-09-02  0:43 ` Jason Gunthorpe
  2020-09-02 15:35   ` [EXT] " Michal Kalderon
  2020-09-02  0:43 ` [PATCH 09/14] RDMA/bnxt: Do not use ib_umem_page_count() or ib_umem_num_pages() Jason Gunthorpe
                   ` (6 subsequent siblings)
  14 siblings, 1 reply; 42+ messages in thread
From: Jason Gunthorpe @ 2020-09-02  0:43 UTC (permalink / raw)
  To: Ariel Elior, Doug Ledford, linux-rdma, Michal Kalderon

The length of the list populated by qedr_populate_pbls() should be
calculated using ib_umem_num_blocks() with the same size/shift passed to
qedr_populate_pbls().

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/infiniband/hw/qedr/verbs.c | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/drivers/infiniband/hw/qedr/verbs.c b/drivers/infiniband/hw/qedr/verbs.c
index cbb49168d9f7ed..278b48443aedba 100644
--- a/drivers/infiniband/hw/qedr/verbs.c
+++ b/drivers/infiniband/hw/qedr/verbs.c
@@ -783,9 +783,7 @@ static inline int qedr_init_user_queue(struct ib_udata *udata,
 		return PTR_ERR(q->umem);
 	}
 
-	fw_pages = ib_umem_page_count(q->umem) <<
-	    (PAGE_SHIFT - FW_PAGE_SHIFT);
-
+	fw_pages = ib_umem_num_dma_blocks(q->umem, 1 << FW_PAGE_SHIFT);
 	rc = qedr_prepare_pbl_tbl(dev, &q->pbl_info, fw_pages, 0);
 	if (rc)
 		goto err0;
@@ -2852,7 +2850,8 @@ struct ib_mr *qedr_reg_user_mr(struct ib_pd *ibpd, u64 start, u64 len,
 		goto err0;
 	}
 
-	rc = init_mr_info(dev, &mr->info, ib_umem_page_count(mr->umem), 1);
+	rc = init_mr_info(dev, &mr->info,
+			  ib_umem_num_dma_blocks(mr->umem, PAGE_SIZE), 1);
 	if (rc)
 		goto err1;
 
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 09/14] RDMA/bnxt: Do not use ib_umem_page_count() or ib_umem_num_pages()
  2020-09-02  0:43 [PATCH 00/14] RDMA: Improve use of umem in DMA drivers Jason Gunthorpe
                   ` (7 preceding siblings ...)
  2020-09-02  0:43 ` [PATCH 08/14] RDMA/qedr: Use ib_umem_num_dma_blocks() instead of ib_umem_page_count() Jason Gunthorpe
@ 2020-09-02  0:43 ` Jason Gunthorpe
  2020-09-03  4:41   ` Selvin Xavier
  2020-09-02  0:43 ` [PATCH 10/14] RDMA/hns: Use ib_umem_num_dma_blocks() instead of opencoding Jason Gunthorpe
                   ` (5 subsequent siblings)
  14 siblings, 1 reply; 42+ messages in thread
From: Jason Gunthorpe @ 2020-09-02  0:43 UTC (permalink / raw)
  To: Devesh Sharma, Doug Ledford, linux-rdma, Naresh Kumar PBS,
	Somnath Kotur, Sriharsha Basavapatna
  Cc: Selvin Xavier, Shiraz Saleem

ib_umem_page_count() returns the number of 4k entries required for a DMA
map, but bnxt_re already computes a variable page size. The correct API to
determine the size of the page table array is ib_umem_num_dma_blocks().

Fix the overallocation of the page array in fill_umem_pbl_tbl() when
working with larger page sizes by using the right function. Lightly
re-organize this function to make it clearer.

Replace the other calls to ib_umem_num_pages().

Fixes: d85582517e91 ("RDMA/bnxt_re: Use core helpers to get aligned DMA address")
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/infiniband/hw/bnxt_re/ib_verbs.c | 70 ++++++++----------------
 1 file changed, 24 insertions(+), 46 deletions(-)

diff --git a/drivers/infiniband/hw/bnxt_re/ib_verbs.c b/drivers/infiniband/hw/bnxt_re/ib_verbs.c
index 9e26e651730cb3..9dbf9ab5a4c8db 100644
--- a/drivers/infiniband/hw/bnxt_re/ib_verbs.c
+++ b/drivers/infiniband/hw/bnxt_re/ib_verbs.c
@@ -939,7 +939,7 @@ static int bnxt_re_init_user_qp(struct bnxt_re_dev *rdev, struct bnxt_re_pd *pd,
 
 	qp->sumem = umem;
 	qplib_qp->sq.sg_info.sghead = umem->sg_head.sgl;
-	qplib_qp->sq.sg_info.npages = ib_umem_num_pages(umem);
+	qplib_qp->sq.sg_info.npages = ib_umem_num_dma_blocks(umem, PAGE_SIZE);
 	qplib_qp->sq.sg_info.nmap = umem->nmap;
 	qplib_qp->sq.sg_info.pgsize = PAGE_SIZE;
 	qplib_qp->sq.sg_info.pgshft = PAGE_SHIFT;
@@ -954,7 +954,8 @@ static int bnxt_re_init_user_qp(struct bnxt_re_dev *rdev, struct bnxt_re_pd *pd,
 			goto rqfail;
 		qp->rumem = umem;
 		qplib_qp->rq.sg_info.sghead = umem->sg_head.sgl;
-		qplib_qp->rq.sg_info.npages = ib_umem_num_pages(umem);
+		qplib_qp->rq.sg_info.npages =
+			ib_umem_num_dma_blocks(umem, PAGE_SIZE);
 		qplib_qp->rq.sg_info.nmap = umem->nmap;
 		qplib_qp->rq.sg_info.pgsize = PAGE_SIZE;
 		qplib_qp->rq.sg_info.pgshft = PAGE_SHIFT;
@@ -1609,7 +1610,7 @@ static int bnxt_re_init_user_srq(struct bnxt_re_dev *rdev,
 
 	srq->umem = umem;
 	qplib_srq->sg_info.sghead = umem->sg_head.sgl;
-	qplib_srq->sg_info.npages = ib_umem_num_pages(umem);
+	qplib_srq->sg_info.npages = ib_umem_num_dma_blocks(umem, PAGE_SIZE);
 	qplib_srq->sg_info.nmap = umem->nmap;
 	qplib_srq->sg_info.pgsize = PAGE_SIZE;
 	qplib_srq->sg_info.pgshft = PAGE_SHIFT;
@@ -2861,7 +2862,8 @@ int bnxt_re_create_cq(struct ib_cq *ibcq, const struct ib_cq_init_attr *attr,
 			goto fail;
 		}
 		cq->qplib_cq.sg_info.sghead = cq->umem->sg_head.sgl;
-		cq->qplib_cq.sg_info.npages = ib_umem_num_pages(cq->umem);
+		cq->qplib_cq.sg_info.npages =
+			ib_umem_num_dma_blocks(cq->umem, PAGE_SIZE);
 		cq->qplib_cq.sg_info.nmap = cq->umem->nmap;
 		cq->qplib_cq.dpi = &uctx->dpi;
 	} else {
@@ -3759,23 +3761,6 @@ int bnxt_re_dealloc_mw(struct ib_mw *ib_mw)
 	return rc;
 }
 
-static int bnxt_re_page_size_ok(int page_shift)
-{
-	switch (page_shift) {
-	case CMDQ_REGISTER_MR_LOG2_PBL_PG_SIZE_PG_4K:
-	case CMDQ_REGISTER_MR_LOG2_PBL_PG_SIZE_PG_8K:
-	case CMDQ_REGISTER_MR_LOG2_PBL_PG_SIZE_PG_64K:
-	case CMDQ_REGISTER_MR_LOG2_PBL_PG_SIZE_PG_2M:
-	case CMDQ_REGISTER_MR_LOG2_PBL_PG_SIZE_PG_256K:
-	case CMDQ_REGISTER_MR_LOG2_PBL_PG_SIZE_PG_1M:
-	case CMDQ_REGISTER_MR_LOG2_PBL_PG_SIZE_PG_4M:
-	case CMDQ_REGISTER_MR_LOG2_PBL_PG_SIZE_PG_1G:
-		return 1;
-	default:
-		return 0;
-	}
-}
-
 static int fill_umem_pbl_tbl(struct ib_umem *umem, u64 *pbl_tbl_orig,
 			     int page_shift)
 {
@@ -3799,7 +3784,8 @@ struct ib_mr *bnxt_re_reg_user_mr(struct ib_pd *ib_pd, u64 start, u64 length,
 	struct bnxt_re_mr *mr;
 	struct ib_umem *umem;
 	u64 *pbl_tbl = NULL;
-	int umem_pgs, page_shift, rc;
+	unsigned long page_size;
+	int umem_pgs, rc;
 
 	if (length > BNXT_RE_MAX_MR_SIZE) {
 		ibdev_err(&rdev->ibdev, "MR Size: %lld > Max supported:%lld\n",
@@ -3833,42 +3819,34 @@ struct ib_mr *bnxt_re_reg_user_mr(struct ib_pd *ib_pd, u64 start, u64 length,
 	mr->ib_umem = umem;
 
 	mr->qplib_mr.va = virt_addr;
-	umem_pgs = ib_umem_page_count(umem);
-	if (!umem_pgs) {
-		ibdev_err(&rdev->ibdev, "umem is invalid!");
-		rc = -EINVAL;
-		goto free_umem;
-	}
-	mr->qplib_mr.total_size = length;
-
-	pbl_tbl = kcalloc(umem_pgs, sizeof(u64 *), GFP_KERNEL);
-	if (!pbl_tbl) {
-		rc = -ENOMEM;
-		goto free_umem;
-	}
-
-	page_shift = __ffs(ib_umem_find_best_pgsz(umem,
-				BNXT_RE_PAGE_SIZE_4K | BNXT_RE_PAGE_SIZE_2M,
-				virt_addr));
-
-	if (!bnxt_re_page_size_ok(page_shift)) {
+	page_size = ib_umem_find_best_pgsz(
+		umem, BNXT_RE_PAGE_SIZE_4K | BNXT_RE_PAGE_SIZE_2M, virt_addr);
+	if (!page_size) {
 		ibdev_err(&rdev->ibdev, "umem page size unsupported!");
 		rc = -EFAULT;
-		goto fail;
+		goto free_umem;
 	}
+	mr->qplib_mr.total_size = length;
 
-	if (page_shift == BNXT_RE_PAGE_SHIFT_4K &&
+	if (page_size == BNXT_RE_PAGE_SIZE_4K &&
 	    length > BNXT_RE_MAX_MR_SIZE_LOW) {
 		ibdev_err(&rdev->ibdev, "Requested MR Sz:%llu Max sup:%llu",
 			  length, (u64)BNXT_RE_MAX_MR_SIZE_LOW);
 		rc = -EINVAL;
-		goto fail;
+		goto free_umem;
+	}
+
+	umem_pgs = ib_umem_num_dma_blocks(umem, page_size);
+	pbl_tbl = kcalloc(umem_pgs, sizeof(u64 *), GFP_KERNEL);
+	if (!pbl_tbl) {
+		rc = -ENOMEM;
+		goto free_umem;
 	}
 
 	/* Map umem buf ptrs to the PBL */
-	umem_pgs = fill_umem_pbl_tbl(umem, pbl_tbl, page_shift);
+	umem_pgs = fill_umem_pbl_tbl(umem, pbl_tbl, order_base_2(page_size));
 	rc = bnxt_qplib_reg_mr(&rdev->qplib_res, &mr->qplib_mr, pbl_tbl,
-			       umem_pgs, false, 1 << page_shift);
+			       umem_pgs, false, page_size);
 	if (rc) {
 		ibdev_err(&rdev->ibdev, "Failed to register user MR");
 		goto fail;
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 10/14] RDMA/hns: Use ib_umem_num_dma_blocks() instead of opencoding
  2020-09-02  0:43 [PATCH 00/14] RDMA: Improve use of umem in DMA drivers Jason Gunthorpe
                   ` (8 preceding siblings ...)
  2020-09-02  0:43 ` [PATCH 09/14] RDMA/bnxt: Do not use ib_umem_page_count() or ib_umem_num_pages() Jason Gunthorpe
@ 2020-09-02  0:43 ` Jason Gunthorpe
  2020-09-02  0:43 ` [PATCH 11/14] RDMA/ocrdma: Use ib_umem_num_dma_blocks() instead of ib_umem_page_count() Jason Gunthorpe
                   ` (4 subsequent siblings)
  14 siblings, 0 replies; 42+ messages in thread
From: Jason Gunthorpe @ 2020-09-02  0:43 UTC (permalink / raw)
  To: Doug Ledford, Wei Hu(Xavier), linux-rdma, Weihang Li, Lijun Ou

mtr_umem_page_count() does the same thing, replace it with the core code.

Also, ib_umem_find_best_pgsz() should always be called to check that the
umem meets the page_size requirement. If there is a limited set of
page_sizes that work it the pgsz_bitmap should be set to that set. 0 is a
failure and the umem cannot be used.

Lightly tidy the control flow to implement this flow properly.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/infiniband/hw/hns/hns_roce_mr.c | 49 ++++++++++---------------
 1 file changed, 19 insertions(+), 30 deletions(-)

diff --git a/drivers/infiniband/hw/hns/hns_roce_mr.c b/drivers/infiniband/hw/hns/hns_roce_mr.c
index e5df3884b41dda..16699f6bb03a51 100644
--- a/drivers/infiniband/hw/hns/hns_roce_mr.c
+++ b/drivers/infiniband/hw/hns/hns_roce_mr.c
@@ -707,19 +707,6 @@ static inline size_t mtr_bufs_size(struct hns_roce_buf_attr *attr)
 	return size;
 }
 
-static inline int mtr_umem_page_count(struct ib_umem *umem,
-				      unsigned int page_shift)
-{
-	int count = ib_umem_page_count(umem);
-
-	if (page_shift >= PAGE_SHIFT)
-		count >>= page_shift - PAGE_SHIFT;
-	else
-		count <<= PAGE_SHIFT - page_shift;
-
-	return count;
-}
-
 static inline size_t mtr_kmem_direct_size(bool is_direct, size_t alloc_size,
 					  unsigned int page_shift)
 {
@@ -767,12 +754,10 @@ static int mtr_alloc_bufs(struct hns_roce_dev *hr_dev, struct hns_roce_mtr *mtr,
 			  struct ib_udata *udata, unsigned long user_addr)
 {
 	struct ib_device *ibdev = &hr_dev->ib_dev;
-	unsigned int max_pg_shift = buf_attr->page_shift;
-	unsigned int best_pg_shift = 0;
+	unsigned int best_pg_shift;
 	int all_pg_count = 0;
 	size_t direct_size;
 	size_t total_size;
-	unsigned long tmp;
 	int ret = 0;
 
 	total_size = mtr_bufs_size(buf_attr);
@@ -782,6 +767,9 @@ static int mtr_alloc_bufs(struct hns_roce_dev *hr_dev, struct hns_roce_mtr *mtr,
 	}
 
 	if (udata) {
+		unsigned long pgsz_bitmap;
+		unsigned long page_size;
+
 		mtr->kmem = NULL;
 		mtr->umem = ib_umem_get(ibdev, user_addr, total_size,
 					buf_attr->user_access);
@@ -790,15 +778,17 @@ static int mtr_alloc_bufs(struct hns_roce_dev *hr_dev, struct hns_roce_mtr *mtr,
 				  PTR_ERR(mtr->umem));
 			return -ENOMEM;
 		}
-		if (buf_attr->fixed_page) {
-			best_pg_shift = max_pg_shift;
-		} else {
-			tmp = GENMASK(max_pg_shift, 0);
-			ret = ib_umem_find_best_pgsz(mtr->umem, tmp, user_addr);
-			best_pg_shift = (ret <= PAGE_SIZE) ?
-					PAGE_SHIFT : ilog2(ret);
-		}
-		all_pg_count = mtr_umem_page_count(mtr->umem, best_pg_shift);
+		if (buf_attr->fixed_page)
+			pgsz_bitmap = 1 << buf_attr->page_shift;
+		else
+			pgsz_bitmap = GENMASK(buf_attr->page_shift, PAGE_SHIFT);
+
+		page_size = ib_umem_find_best_pgsz(mtr->umem, pgsz_bitmap,
+						   user_addr);
+		if (!page_size)
+			return -EINVAL;
+		best_pg_shift = order_base_2(page_size);
+		all_pg_count = ib_umem_num_dma_blocks(mtr->umem, page_size);
 		ret = 0;
 	} else {
 		mtr->umem = NULL;
@@ -808,16 +798,15 @@ static int mtr_alloc_bufs(struct hns_roce_dev *hr_dev, struct hns_roce_mtr *mtr,
 			return -ENOMEM;
 		}
 		direct_size = mtr_kmem_direct_size(is_direct, total_size,
-						   max_pg_shift);
+						   buf_attr->page_shift);
 		ret = hns_roce_buf_alloc(hr_dev, total_size, direct_size,
-					 mtr->kmem, max_pg_shift);
+					 mtr->kmem, buf_attr->page_shift);
 		if (ret) {
 			ibdev_err(ibdev, "Failed to alloc kmem, ret %d\n", ret);
 			goto err_alloc_mem;
-		} else {
-			best_pg_shift = max_pg_shift;
-			all_pg_count = mtr->kmem->npages;
 		}
+		best_pg_shift = buf_attr->page_shift;
+		all_pg_count = mtr->kmem->npages;
 	}
 
 	/* must bigger than minimum hardware page shift */
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 11/14] RDMA/ocrdma: Use ib_umem_num_dma_blocks() instead of ib_umem_page_count()
  2020-09-02  0:43 [PATCH 00/14] RDMA: Improve use of umem in DMA drivers Jason Gunthorpe
                   ` (9 preceding siblings ...)
  2020-09-02  0:43 ` [PATCH 10/14] RDMA/hns: Use ib_umem_num_dma_blocks() instead of opencoding Jason Gunthorpe
@ 2020-09-02  0:43 ` Jason Gunthorpe
  2020-09-02  0:43 ` [PATCH 12/14] RDMA/pvrdma: " Jason Gunthorpe
                   ` (3 subsequent siblings)
  14 siblings, 0 replies; 42+ messages in thread
From: Jason Gunthorpe @ 2020-09-02  0:43 UTC (permalink / raw)
  To: Devesh Sharma, Doug Ledford, linux-rdma, Selvin Xavier

This driver always uses a DMA array made up of PAGE_SIZE elements, so just
use ib_umem_num_blocks().

Since rdma_for_each_dma_block() always iterates exactly
ib_umem_num_dma_blocks() there is no need for the early exit check in
build_user_pbes(), delete it.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/infiniband/hw/ocrdma/ocrdma_verbs.c | 17 +++++------------
 1 file changed, 5 insertions(+), 12 deletions(-)

diff --git a/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c b/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c
index 933b297de2ba86..1fb8da6d613674 100644
--- a/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c
+++ b/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c
@@ -810,13 +810,12 @@ static int ocrdma_build_pbl_tbl(struct ocrdma_dev *dev, struct ocrdma_hw_mr *mr)
 	return status;
 }
 
-static void build_user_pbes(struct ocrdma_dev *dev, struct ocrdma_mr *mr,
-			    u32 num_pbes)
+static void build_user_pbes(struct ocrdma_dev *dev, struct ocrdma_mr *mr)
 {
 	struct ocrdma_pbe *pbe;
 	struct ib_block_iter biter;
 	struct ocrdma_pbl *pbl_tbl = mr->hwmr.pbl_table;
-	int pbe_cnt, total_num_pbes = 0;
+	int pbe_cnt;
 	u64 pg_addr;
 
 	if (!mr->hwmr.num_pbes)
@@ -831,13 +830,8 @@ static void build_user_pbes(struct ocrdma_dev *dev, struct ocrdma_mr *mr,
 		pbe->pa_lo = cpu_to_le32(pg_addr);
 		pbe->pa_hi = cpu_to_le32(upper_32_bits(pg_addr));
 		pbe_cnt += 1;
-		total_num_pbes += 1;
 		pbe++;
 
-		/* if done building pbes, issue the mbx cmd. */
-		if (total_num_pbes == num_pbes)
-			return;
-
 		/* if the given pbl is full storing the pbes,
 		 * move to next pbl.
 		 */
@@ -856,7 +850,6 @@ struct ib_mr *ocrdma_reg_user_mr(struct ib_pd *ibpd, u64 start, u64 len,
 	struct ocrdma_dev *dev = get_ocrdma_dev(ibpd->device);
 	struct ocrdma_mr *mr;
 	struct ocrdma_pd *pd;
-	u32 num_pbes;
 
 	pd = get_ocrdma_pd(ibpd);
 
@@ -871,8 +864,8 @@ struct ib_mr *ocrdma_reg_user_mr(struct ib_pd *ibpd, u64 start, u64 len,
 		status = -EFAULT;
 		goto umem_err;
 	}
-	num_pbes = ib_umem_page_count(mr->umem);
-	status = ocrdma_get_pbl_info(dev, mr, num_pbes);
+	status = ocrdma_get_pbl_info(
+		dev, mr, ib_umem_num_dma_blocks(mr->umem, PAGE_SIZE));
 	if (status)
 		goto umem_err;
 
@@ -888,7 +881,7 @@ struct ib_mr *ocrdma_reg_user_mr(struct ib_pd *ibpd, u64 start, u64 len,
 	status = ocrdma_build_pbl_tbl(dev, &mr->hwmr);
 	if (status)
 		goto umem_err;
-	build_user_pbes(dev, mr, num_pbes);
+	build_user_pbes(dev, mr);
 	status = ocrdma_reg_mr(dev, &mr->hwmr, pd->id, acc);
 	if (status)
 		goto mbx_err;
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 12/14] RDMA/pvrdma: Use ib_umem_num_dma_blocks() instead of ib_umem_page_count()
  2020-09-02  0:43 [PATCH 00/14] RDMA: Improve use of umem in DMA drivers Jason Gunthorpe
                   ` (10 preceding siblings ...)
  2020-09-02  0:43 ` [PATCH 11/14] RDMA/ocrdma: Use ib_umem_num_dma_blocks() instead of ib_umem_page_count() Jason Gunthorpe
@ 2020-09-02  0:43 ` Jason Gunthorpe
  2020-09-02  0:43 ` [PATCH 13/14] RDMA/mlx5: Use ib_umem_num_dma_blocks() Jason Gunthorpe
                   ` (2 subsequent siblings)
  14 siblings, 0 replies; 42+ messages in thread
From: Jason Gunthorpe @ 2020-09-02  0:43 UTC (permalink / raw)
  To: Adit Ranadive, Doug Ledford, linux-rdma, VMware PV-Drivers

This driver always uses PAGE_SIZE.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/infiniband/hw/vmw_pvrdma/pvrdma_cq.c  | 2 +-
 drivers/infiniband/hw/vmw_pvrdma/pvrdma_qp.c  | 6 ++++--
 drivers/infiniband/hw/vmw_pvrdma/pvrdma_srq.c | 2 +-
 3 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/drivers/infiniband/hw/vmw_pvrdma/pvrdma_cq.c b/drivers/infiniband/hw/vmw_pvrdma/pvrdma_cq.c
index 01cd122a8b692f..ad7dfababf1fe1 100644
--- a/drivers/infiniband/hw/vmw_pvrdma/pvrdma_cq.c
+++ b/drivers/infiniband/hw/vmw_pvrdma/pvrdma_cq.c
@@ -142,7 +142,7 @@ int pvrdma_create_cq(struct ib_cq *ibcq, const struct ib_cq_init_attr *attr,
 			goto err_cq;
 		}
 
-		npages = ib_umem_page_count(cq->umem);
+		npages = ib_umem_num_dma_blocks(cq->umem, PAGE_SIZE);
 	} else {
 		/* One extra page for shared ring state */
 		npages = 1 + (entries * sizeof(struct pvrdma_cqe) +
diff --git a/drivers/infiniband/hw/vmw_pvrdma/pvrdma_qp.c b/drivers/infiniband/hw/vmw_pvrdma/pvrdma_qp.c
index 9a8f2a9507be07..8a385acf6f0c42 100644
--- a/drivers/infiniband/hw/vmw_pvrdma/pvrdma_qp.c
+++ b/drivers/infiniband/hw/vmw_pvrdma/pvrdma_qp.c
@@ -298,9 +298,11 @@ struct ib_qp *pvrdma_create_qp(struct ib_pd *pd,
 				goto err_qp;
 			}
 
-			qp->npages_send = ib_umem_page_count(qp->sumem);
+			qp->npages_send =
+				ib_umem_num_dma_blocks(qp->sumem, PAGE_SIZE);
 			if (!is_srq)
-				qp->npages_recv = ib_umem_page_count(qp->rumem);
+				qp->npages_recv = ib_umem_num_dma_blocks(
+					qp->rumem, PAGE_SIZE);
 			else
 				qp->npages_recv = 0;
 			qp->npages = qp->npages_send + qp->npages_recv;
diff --git a/drivers/infiniband/hw/vmw_pvrdma/pvrdma_srq.c b/drivers/infiniband/hw/vmw_pvrdma/pvrdma_srq.c
index f60a8e81bddddb..6fd843cff21e70 100644
--- a/drivers/infiniband/hw/vmw_pvrdma/pvrdma_srq.c
+++ b/drivers/infiniband/hw/vmw_pvrdma/pvrdma_srq.c
@@ -152,7 +152,7 @@ int pvrdma_create_srq(struct ib_srq *ibsrq, struct ib_srq_init_attr *init_attr,
 		goto err_srq;
 	}
 
-	srq->npages = ib_umem_page_count(srq->umem);
+	srq->npages = ib_umem_num_dma_blocks(srq->umem, PAGE_SIZE);
 
 	if (srq->npages < 0 || srq->npages > PVRDMA_PAGE_DIR_MAX_PAGES) {
 		dev_warn(&dev->pdev->dev,
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 13/14] RDMA/mlx5: Use ib_umem_num_dma_blocks()
  2020-09-02  0:43 [PATCH 00/14] RDMA: Improve use of umem in DMA drivers Jason Gunthorpe
                   ` (11 preceding siblings ...)
  2020-09-02  0:43 ` [PATCH 12/14] RDMA/pvrdma: " Jason Gunthorpe
@ 2020-09-02  0:43 ` Jason Gunthorpe
  2020-09-02  9:07   ` Gal Pressman
  2020-09-03 15:14   ` Saleem, Shiraz
  2020-09-02  0:43 ` [PATCH 14/14] RDMA/umem: Rename ib_umem_offset() to ib_umem_dma_offset() Jason Gunthorpe
  2020-09-02  9:09 ` [PATCH 00/14] RDMA: Improve use of umem in DMA drivers Gal Pressman
  14 siblings, 2 replies; 42+ messages in thread
From: Jason Gunthorpe @ 2020-09-02  0:43 UTC (permalink / raw)
  To: Doug Ledford, linux-rdma, Yishai Hadas

For the calls linked to mlx4_ib_umem_calc_optimal_mtt_size() compute the
number of DMA pages directly based on the returned page_shift. This was
just being used as a weird default if the algorithm hits the lower limit.

All other places are just using it with PAGE_SIZE.

As this is the last call site, remove ib_umem_num_pages().

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/infiniband/core/umem.c       | 12 ------------
 drivers/infiniband/hw/mlx4/cq.c      |  8 ++++----
 drivers/infiniband/hw/mlx4/mlx4_ib.h |  3 +--
 drivers/infiniband/hw/mlx4/mr.c      | 14 ++++++--------
 drivers/infiniband/hw/mlx4/qp.c      | 17 +++++++++--------
 drivers/infiniband/hw/mlx4/srq.c     |  5 +++--
 include/rdma/ib_umem.h               |  2 --
 7 files changed, 23 insertions(+), 38 deletions(-)

diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index f02e34cac59581..49d6ddc37b6fde 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -346,18 +346,6 @@ void ib_umem_release(struct ib_umem *umem)
 }
 EXPORT_SYMBOL(ib_umem_release);
 
-int ib_umem_page_count(struct ib_umem *umem)
-{
-	int i, n = 0;
-	struct scatterlist *sg;
-
-	for_each_sg(umem->sg_head.sgl, sg, umem->nmap, i)
-		n += sg_dma_len(sg) >> PAGE_SHIFT;
-
-	return n;
-}
-EXPORT_SYMBOL(ib_umem_page_count);
-
 /*
  * Copy from the given ib_umem's pages to the given buffer.
  *
diff --git a/drivers/infiniband/hw/mlx4/cq.c b/drivers/infiniband/hw/mlx4/cq.c
index 8a3436994f8097..3de97f428dc63b 100644
--- a/drivers/infiniband/hw/mlx4/cq.c
+++ b/drivers/infiniband/hw/mlx4/cq.c
@@ -142,16 +142,16 @@ static int mlx4_ib_get_cq_umem(struct mlx4_ib_dev *dev, struct ib_udata *udata,
 	int err;
 	int cqe_size = dev->dev->caps.cqe_size;
 	int shift;
-	int n;
 
 	*umem = ib_umem_get(&dev->ib_dev, buf_addr, cqe * cqe_size,
 			    IB_ACCESS_LOCAL_WRITE);
 	if (IS_ERR(*umem))
 		return PTR_ERR(*umem);
 
-	n = ib_umem_page_count(*umem);
-	shift = mlx4_ib_umem_calc_optimal_mtt_size(*umem, 0, &n);
-	err = mlx4_mtt_init(dev->dev, n, shift, &buf->mtt);
+	shift = mlx4_ib_umem_calc_optimal_mtt_size(*umem, 0);
+	err = mlx4_mtt_init(dev->dev,
+			    ib_umem_num_dma_blocks(*umem, 1UL << shift), shift,
+			    &buf->mtt);
 
 	if (err)
 		goto err_buf;
diff --git a/drivers/infiniband/hw/mlx4/mlx4_ib.h b/drivers/infiniband/hw/mlx4/mlx4_ib.h
index bcac8fc5031766..660955a11914e7 100644
--- a/drivers/infiniband/hw/mlx4/mlx4_ib.h
+++ b/drivers/infiniband/hw/mlx4/mlx4_ib.h
@@ -905,7 +905,6 @@ struct ib_rwq_ind_table
 			      struct ib_rwq_ind_table_init_attr *init_attr,
 			      struct ib_udata *udata);
 int mlx4_ib_destroy_rwq_ind_table(struct ib_rwq_ind_table *wq_ind_table);
-int mlx4_ib_umem_calc_optimal_mtt_size(struct ib_umem *umem, u64 start_va,
-				       int *num_of_mtts);
+int mlx4_ib_umem_calc_optimal_mtt_size(struct ib_umem *umem, u64 start_va);
 
 #endif /* MLX4_IB_H */
diff --git a/drivers/infiniband/hw/mlx4/mr.c b/drivers/infiniband/hw/mlx4/mr.c
index 1d5ef0de12c950..19b2d3fbe81e03 100644
--- a/drivers/infiniband/hw/mlx4/mr.c
+++ b/drivers/infiniband/hw/mlx4/mr.c
@@ -254,8 +254,7 @@ int mlx4_ib_umem_write_mtt(struct mlx4_ib_dev *dev, struct mlx4_mtt *mtt,
  * middle already handled as part of mtt shift calculation for both their start
  * & end addresses.
  */
-int mlx4_ib_umem_calc_optimal_mtt_size(struct ib_umem *umem, u64 start_va,
-				       int *num_of_mtts)
+int mlx4_ib_umem_calc_optimal_mtt_size(struct ib_umem *umem, u64 start_va)
 {
 	u64 block_shift = MLX4_MAX_MTT_SHIFT;
 	u64 min_shift = PAGE_SHIFT;
@@ -353,7 +352,6 @@ int mlx4_ib_umem_calc_optimal_mtt_size(struct ib_umem *umem, u64 start_va,
 		pr_warn("misaligned total length detected (%llu, %llu)!",
 			total_len, block_shift);
 
-	*num_of_mtts = total_len >> block_shift;
 end:
 	if (block_shift < min_shift) {
 		/*
@@ -409,7 +407,6 @@ struct ib_mr *mlx4_ib_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
 	struct mlx4_ib_mr *mr;
 	int shift;
 	int err;
-	int n;
 
 	mr = kzalloc(sizeof(*mr), GFP_KERNEL);
 	if (!mr)
@@ -421,11 +418,12 @@ struct ib_mr *mlx4_ib_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
 		goto err_free;
 	}
 
-	n = ib_umem_page_count(mr->umem);
-	shift = mlx4_ib_umem_calc_optimal_mtt_size(mr->umem, start, &n);
+	shift = mlx4_ib_umem_calc_optimal_mtt_size(mr->umem, start);
 
 	err = mlx4_mr_alloc(dev->dev, to_mpd(pd)->pdn, virt_addr, length,
-			    convert_access(access_flags), n, shift, &mr->mmr);
+			    convert_access(access_flags),
+			    ib_umem_num_dma_blocks(mr->umem, 1UL << shift),
+			    shift, &mr->mmr);
 	if (err)
 		goto err_umem;
 
@@ -511,7 +509,7 @@ int mlx4_ib_rereg_user_mr(struct ib_mr *mr, int flags,
 			mmr->umem = NULL;
 			goto release_mpt_entry;
 		}
-		n = ib_umem_page_count(mmr->umem);
+		n = ib_umem_num_dma_blocks(mmr->umem, PAGE_SIZE);
 		shift = PAGE_SHIFT;
 
 		err = mlx4_mr_rereg_mem_write(dev->dev, &mmr->mmr,
diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c
index 2975f350b9fd10..3113d9ca112771 100644
--- a/drivers/infiniband/hw/mlx4/qp.c
+++ b/drivers/infiniband/hw/mlx4/qp.c
@@ -869,7 +869,6 @@ static int create_rq(struct ib_pd *pd, struct ib_qp_init_attr *init_attr,
 	struct mlx4_ib_create_wq wq;
 	size_t copy_len;
 	int shift;
-	int n;
 
 	qp->mlx4_ib_qp_type = MLX4_IB_QPT_RAW_PACKET;
 
@@ -922,9 +921,10 @@ static int create_rq(struct ib_pd *pd, struct ib_qp_init_attr *init_attr,
 		goto err;
 	}
 
-	n = ib_umem_page_count(qp->umem);
-	shift = mlx4_ib_umem_calc_optimal_mtt_size(qp->umem, 0, &n);
-	err = mlx4_mtt_init(dev->dev, n, shift, &qp->mtt);
+	shift = mlx4_ib_umem_calc_optimal_mtt_size(qp->umem, 0);
+	err = mlx4_mtt_init(dev->dev,
+			    ib_umem_num_dma_blocks(qp->umem, 1UL << shift),
+			    shift, &qp->mtt);
 
 	if (err)
 		goto err_buf;
@@ -1077,7 +1077,6 @@ static int create_qp_common(struct ib_pd *pd, struct ib_qp_init_attr *init_attr,
 		struct mlx4_ib_create_qp ucmd;
 		size_t copy_len;
 		int shift;
-		int n;
 
 		copy_len = sizeof(struct mlx4_ib_create_qp);
 
@@ -1117,9 +1116,11 @@ static int create_qp_common(struct ib_pd *pd, struct ib_qp_init_attr *init_attr,
 			goto err;
 		}
 
-		n = ib_umem_page_count(qp->umem);
-		shift = mlx4_ib_umem_calc_optimal_mtt_size(qp->umem, 0, &n);
-		err = mlx4_mtt_init(dev->dev, n, shift, &qp->mtt);
+		shift = mlx4_ib_umem_calc_optimal_mtt_size(qp->umem, 0);
+		err = mlx4_mtt_init(dev->dev,
+				    ib_umem_num_dma_blocks(qp->umem,
+							   1UL << shift),
+				    shift, &qp->mtt);
 
 		if (err)
 			goto err_buf;
diff --git a/drivers/infiniband/hw/mlx4/srq.c b/drivers/infiniband/hw/mlx4/srq.c
index 8f9d5035142d33..108b2d0118d064 100644
--- a/drivers/infiniband/hw/mlx4/srq.c
+++ b/drivers/infiniband/hw/mlx4/srq.c
@@ -115,8 +115,9 @@ int mlx4_ib_create_srq(struct ib_srq *ib_srq,
 		if (IS_ERR(srq->umem))
 			return PTR_ERR(srq->umem);
 
-		err = mlx4_mtt_init(dev->dev, ib_umem_page_count(srq->umem),
-				    PAGE_SHIFT, &srq->mtt);
+		err = mlx4_mtt_init(
+			dev->dev, ib_umem_num_dma_blocks(srq->umem, PAGE_SIZE),
+			PAGE_SHIFT, &srq->mtt);
 		if (err)
 			goto err_buf;
 
diff --git a/include/rdma/ib_umem.h b/include/rdma/ib_umem.h
index ba3b9be0d8c56a..4bac6e29f030c2 100644
--- a/include/rdma/ib_umem.h
+++ b/include/rdma/ib_umem.h
@@ -73,7 +73,6 @@ static inline void __rdma_umem_block_iter_start(struct ib_block_iter *biter,
 struct ib_umem *ib_umem_get(struct ib_device *device, unsigned long addr,
 			    size_t size, int access);
 void ib_umem_release(struct ib_umem *umem);
-int ib_umem_page_count(struct ib_umem *umem);
 int ib_umem_copy_from(void *dst, struct ib_umem *umem, size_t offset,
 		      size_t length);
 unsigned long ib_umem_find_best_pgsz(struct ib_umem *umem,
@@ -91,7 +90,6 @@ static inline struct ib_umem *ib_umem_get(struct ib_device *device,
 	return ERR_PTR(-EINVAL);
 }
 static inline void ib_umem_release(struct ib_umem *umem) { }
-static inline int ib_umem_page_count(struct ib_umem *umem) { return 0; }
 static inline int ib_umem_copy_from(void *dst, struct ib_umem *umem, size_t offset,
 		      		    size_t length) {
 	return -EINVAL;
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 14/14] RDMA/umem: Rename ib_umem_offset() to ib_umem_dma_offset()
  2020-09-02  0:43 [PATCH 00/14] RDMA: Improve use of umem in DMA drivers Jason Gunthorpe
                   ` (12 preceding siblings ...)
  2020-09-02  0:43 ` [PATCH 13/14] RDMA/mlx5: Use ib_umem_num_dma_blocks() Jason Gunthorpe
@ 2020-09-02  0:43 ` Jason Gunthorpe
  2020-09-02  0:51   ` Zhu Yanjun
                     ` (2 more replies)
  2020-09-02  9:09 ` [PATCH 00/14] RDMA: Improve use of umem in DMA drivers Gal Pressman
  14 siblings, 3 replies; 42+ messages in thread
From: Jason Gunthorpe @ 2020-09-02  0:43 UTC (permalink / raw)
  To: Ariel Elior, Dennis Dalessandro, Devesh Sharma, Doug Ledford,
	linux-rdma, Mike Marciniszyn, Michal Kalderon, Selvin Xavier,
	Zhu Yanjun

This function should be used to get the offset from the first DMA block.

The few places using this without a DMA iterator are calling it to work
around the lack of setting sgl->offset when the umem is created.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/infiniband/core/umem.c              | 2 +-
 drivers/infiniband/hw/ocrdma/ocrdma_verbs.c | 2 +-
 drivers/infiniband/hw/qedr/verbs.c          | 2 +-
 drivers/infiniband/sw/rdmavt/mr.c           | 2 +-
 drivers/infiniband/sw/rxe/rxe_mr.c          | 2 +-
 include/rdma/ib_umem.h                      | 9 ++++++---
 6 files changed, 11 insertions(+), 8 deletions(-)

diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index 49d6ddc37b6fde..c840115b8c0945 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -369,7 +369,7 @@ int ib_umem_copy_from(void *dst, struct ib_umem *umem, size_t offset,
 	}
 
 	ret = sg_pcopy_to_buffer(umem->sg_head.sgl, umem->sg_nents, dst, length,
-				 offset + ib_umem_offset(umem));
+				 offset + ib_umem_dma_offset(umem, PAGE_SIZE));
 
 	if (ret < 0)
 		return ret;
diff --git a/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c b/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c
index 1fb8da6d613674..f22532fbc364fe 100644
--- a/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c
+++ b/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c
@@ -870,7 +870,7 @@ struct ib_mr *ocrdma_reg_user_mr(struct ib_pd *ibpd, u64 start, u64 len,
 		goto umem_err;
 
 	mr->hwmr.pbe_size = PAGE_SIZE;
-	mr->hwmr.fbo = ib_umem_offset(mr->umem);
+	mr->hwmr.fbo = ib_umem_dma_offset(mr->umem, PAGE_SIZE);
 	mr->hwmr.va = usr_addr;
 	mr->hwmr.len = len;
 	mr->hwmr.remote_wr = (acc & IB_ACCESS_REMOTE_WRITE) ? 1 : 0;
diff --git a/drivers/infiniband/hw/qedr/verbs.c b/drivers/infiniband/hw/qedr/verbs.c
index 278b48443aedba..daac742e71044d 100644
--- a/drivers/infiniband/hw/qedr/verbs.c
+++ b/drivers/infiniband/hw/qedr/verbs.c
@@ -2878,7 +2878,7 @@ struct ib_mr *qedr_reg_user_mr(struct ib_pd *ibpd, u64 start, u64 len,
 	mr->hw_mr.pbl_two_level = mr->info.pbl_info.two_layered;
 	mr->hw_mr.pbl_page_size_log = ilog2(mr->info.pbl_info.pbl_size);
 	mr->hw_mr.page_size_log = PAGE_SHIFT;
-	mr->hw_mr.fbo = ib_umem_offset(mr->umem);
+	mr->hw_mr.fbo = ib_umem_dma_offset(mr->umem, PAGE_SIZE);
 	mr->hw_mr.length = len;
 	mr->hw_mr.vaddr = usr_addr;
 	mr->hw_mr.zbva = false;
diff --git a/drivers/infiniband/sw/rdmavt/mr.c b/drivers/infiniband/sw/rdmavt/mr.c
index 2f7c25fea44a9d..04f7dc0ce9e44d 100644
--- a/drivers/infiniband/sw/rdmavt/mr.c
+++ b/drivers/infiniband/sw/rdmavt/mr.c
@@ -404,7 +404,7 @@ struct ib_mr *rvt_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
 	mr->mr.user_base = start;
 	mr->mr.iova = virt_addr;
 	mr->mr.length = length;
-	mr->mr.offset = ib_umem_offset(umem);
+	mr->mr.offset = ib_umem_dma_offset(umem, PAGE_SIZE);
 	mr->mr.access_flags = mr_access_flags;
 	mr->umem = umem;
 
diff --git a/drivers/infiniband/sw/rxe/rxe_mr.c b/drivers/infiniband/sw/rxe/rxe_mr.c
index 708e2dff5eaa70..8f60dc9dee8658 100644
--- a/drivers/infiniband/sw/rxe/rxe_mr.c
+++ b/drivers/infiniband/sw/rxe/rxe_mr.c
@@ -196,7 +196,7 @@ int rxe_mem_init_user(struct rxe_pd *pd, u64 start,
 	mem->length		= length;
 	mem->iova		= iova;
 	mem->va			= start;
-	mem->offset		= ib_umem_offset(umem);
+	mem->offset		= ib_umem_dma_offset(umem, PAGE_SIZE);
 	mem->state		= RXE_MEM_STATE_VALID;
 	mem->type		= RXE_MEM_TYPE_MR;
 
diff --git a/include/rdma/ib_umem.h b/include/rdma/ib_umem.h
index 4bac6e29f030c2..5e709b2c251644 100644
--- a/include/rdma/ib_umem.h
+++ b/include/rdma/ib_umem.h
@@ -27,10 +27,13 @@ struct ib_umem {
 	unsigned int    sg_nents;
 };
 
-/* Returns the offset of the umem start relative to the first page. */
-static inline int ib_umem_offset(struct ib_umem *umem)
+/*
+ * Returns the offset of the umem start relative to the first DMA block returned
+ * by rdma_umem_for_each_dma_block().
+ */
+static inline int ib_umem_dma_offset(struct ib_umem *umem, unsigned long pgsz)
 {
-	return umem->address & ~PAGE_MASK;
+	return umem->address & (pgsz - 1);
 }
 
 static inline size_t ib_umem_num_dma_blocks(struct ib_umem *umem,
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH 14/14] RDMA/umem: Rename ib_umem_offset() to ib_umem_dma_offset()
  2020-09-02  0:43 ` [PATCH 14/14] RDMA/umem: Rename ib_umem_offset() to ib_umem_dma_offset() Jason Gunthorpe
@ 2020-09-02  0:51   ` Zhu Yanjun
  2020-09-02 15:36   ` [EXT] " Michal Kalderon
  2020-09-03 18:48   ` Jason Gunthorpe
  2 siblings, 0 replies; 42+ messages in thread
From: Zhu Yanjun @ 2020-09-02  0:51 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Ariel Elior, Dennis Dalessandro, Devesh Sharma, Doug Ledford,
	linux-rdma, Mike Marciniszyn, Michal Kalderon, Selvin Xavier,
	Zhu Yanjun

On Wed, Sep 2, 2020 at 8:46 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> This function should be used to get the offset from the first DMA block.
>
> The few places using this without a DMA iterator are calling it to work
> around the lack of setting sgl->offset when the umem is created.
>
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> ---
>  drivers/infiniband/core/umem.c              | 2 +-
>  drivers/infiniband/hw/ocrdma/ocrdma_verbs.c | 2 +-
>  drivers/infiniband/hw/qedr/verbs.c          | 2 +-
>  drivers/infiniband/sw/rdmavt/mr.c           | 2 +-
>  drivers/infiniband/sw/rxe/rxe_mr.c          | 2 +-
>  include/rdma/ib_umem.h                      | 9 ++++++---
>  6 files changed, 11 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
> index 49d6ddc37b6fde..c840115b8c0945 100644
> --- a/drivers/infiniband/core/umem.c
> +++ b/drivers/infiniband/core/umem.c
> @@ -369,7 +369,7 @@ int ib_umem_copy_from(void *dst, struct ib_umem *umem, size_t offset,
>         }
>
>         ret = sg_pcopy_to_buffer(umem->sg_head.sgl, umem->sg_nents, dst, length,
> -                                offset + ib_umem_offset(umem));
> +                                offset + ib_umem_dma_offset(umem, PAGE_SIZE));
>
>         if (ret < 0)
>                 return ret;
> diff --git a/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c b/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c
> index 1fb8da6d613674..f22532fbc364fe 100644
> --- a/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c
> +++ b/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c
> @@ -870,7 +870,7 @@ struct ib_mr *ocrdma_reg_user_mr(struct ib_pd *ibpd, u64 start, u64 len,
>                 goto umem_err;
>
>         mr->hwmr.pbe_size = PAGE_SIZE;
> -       mr->hwmr.fbo = ib_umem_offset(mr->umem);
> +       mr->hwmr.fbo = ib_umem_dma_offset(mr->umem, PAGE_SIZE);
>         mr->hwmr.va = usr_addr;
>         mr->hwmr.len = len;
>         mr->hwmr.remote_wr = (acc & IB_ACCESS_REMOTE_WRITE) ? 1 : 0;
> diff --git a/drivers/infiniband/hw/qedr/verbs.c b/drivers/infiniband/hw/qedr/verbs.c
> index 278b48443aedba..daac742e71044d 100644
> --- a/drivers/infiniband/hw/qedr/verbs.c
> +++ b/drivers/infiniband/hw/qedr/verbs.c
> @@ -2878,7 +2878,7 @@ struct ib_mr *qedr_reg_user_mr(struct ib_pd *ibpd, u64 start, u64 len,
>         mr->hw_mr.pbl_two_level = mr->info.pbl_info.two_layered;
>         mr->hw_mr.pbl_page_size_log = ilog2(mr->info.pbl_info.pbl_size);
>         mr->hw_mr.page_size_log = PAGE_SHIFT;
> -       mr->hw_mr.fbo = ib_umem_offset(mr->umem);
> +       mr->hw_mr.fbo = ib_umem_dma_offset(mr->umem, PAGE_SIZE);
>         mr->hw_mr.length = len;
>         mr->hw_mr.vaddr = usr_addr;
>         mr->hw_mr.zbva = false;
> diff --git a/drivers/infiniband/sw/rdmavt/mr.c b/drivers/infiniband/sw/rdmavt/mr.c
> index 2f7c25fea44a9d..04f7dc0ce9e44d 100644
> --- a/drivers/infiniband/sw/rdmavt/mr.c
> +++ b/drivers/infiniband/sw/rdmavt/mr.c
> @@ -404,7 +404,7 @@ struct ib_mr *rvt_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
>         mr->mr.user_base = start;
>         mr->mr.iova = virt_addr;
>         mr->mr.length = length;
> -       mr->mr.offset = ib_umem_offset(umem);
> +       mr->mr.offset = ib_umem_dma_offset(umem, PAGE_SIZE);
>         mr->mr.access_flags = mr_access_flags;
>         mr->umem = umem;
>
> diff --git a/drivers/infiniband/sw/rxe/rxe_mr.c b/drivers/infiniband/sw/rxe/rxe_mr.c
> index 708e2dff5eaa70..8f60dc9dee8658 100644
> --- a/drivers/infiniband/sw/rxe/rxe_mr.c
> +++ b/drivers/infiniband/sw/rxe/rxe_mr.c
> @@ -196,7 +196,7 @@ int rxe_mem_init_user(struct rxe_pd *pd, u64 start,
>         mem->length             = length;
>         mem->iova               = iova;
>         mem->va                 = start;
> -       mem->offset             = ib_umem_offset(umem);
> +       mem->offset             = ib_umem_dma_offset(umem, PAGE_SIZE);

Thanks,
Zhu Yanjun

>         mem->state              = RXE_MEM_STATE_VALID;
>         mem->type               = RXE_MEM_TYPE_MR;
>
> diff --git a/include/rdma/ib_umem.h b/include/rdma/ib_umem.h
> index 4bac6e29f030c2..5e709b2c251644 100644
> --- a/include/rdma/ib_umem.h
> +++ b/include/rdma/ib_umem.h
> @@ -27,10 +27,13 @@ struct ib_umem {
>         unsigned int    sg_nents;
>  };
>
> -/* Returns the offset of the umem start relative to the first page. */
> -static inline int ib_umem_offset(struct ib_umem *umem)
> +/*
> + * Returns the offset of the umem start relative to the first DMA block returned
> + * by rdma_umem_for_each_dma_block().
> + */
> +static inline int ib_umem_dma_offset(struct ib_umem *umem, unsigned long pgsz)
>  {
> -       return umem->address & ~PAGE_MASK;
> +       return umem->address & (pgsz - 1);
>  }
>
>  static inline size_t ib_umem_num_dma_blocks(struct ib_umem *umem,
> --
> 2.28.0
>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 04/14] RDMA/umem: Add rdma_umem_for_each_dma_block()
  2020-09-02  0:43 ` [PATCH 04/14] RDMA/umem: Add rdma_umem_for_each_dma_block() Jason Gunthorpe
@ 2020-09-02  3:10   ` Miguel Ojeda
  2020-09-03 14:12   ` Saleem, Shiraz
  1 sibling, 0 replies; 42+ messages in thread
From: Miguel Ojeda @ 2020-09-02  3:10 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Devesh Sharma, Doug Ledford, Faisal Latif, Gal Pressman,
	Wei Hu(Xavier),
	linux-rdma, Weihang Li, Naresh Kumar PBS, Lijun Ou,
	Selvin Xavier, Shiraz Saleem, Yossi Leybovich, Somnath Kotur,
	Sriharsha Basavapatna

On Wed, Sep 2, 2020 at 2:43 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
>  .clang-format                              |  1 +

Acked-by: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com>

Cheers,
Miguel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 13/14] RDMA/mlx5: Use ib_umem_num_dma_blocks()
  2020-09-02  0:43 ` [PATCH 13/14] RDMA/mlx5: Use ib_umem_num_dma_blocks() Jason Gunthorpe
@ 2020-09-02  9:07   ` Gal Pressman
  2020-09-03 15:14   ` Saleem, Shiraz
  1 sibling, 0 replies; 42+ messages in thread
From: Gal Pressman @ 2020-09-02  9:07 UTC (permalink / raw)
  To: Jason Gunthorpe, Doug Ledford, linux-rdma, Yishai Hadas

On 02/09/2020 3:43, Jason Gunthorpe wrote:
> For the calls linked to mlx4_ib_umem_calc_optimal_mtt_size() compute the
> number of DMA pages directly based on the returned page_shift. This was
> just being used as a weird default if the algorithm hits the lower limit.
> 
> All other places are just using it with PAGE_SIZE.
> 
> As this is the last call site, remove ib_umem_num_pages().
> 
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

The subject line should be fixed to mlx4.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 00/14] RDMA: Improve use of umem in DMA drivers
  2020-09-02  0:43 [PATCH 00/14] RDMA: Improve use of umem in DMA drivers Jason Gunthorpe
                   ` (13 preceding siblings ...)
  2020-09-02  0:43 ` [PATCH 14/14] RDMA/umem: Rename ib_umem_offset() to ib_umem_dma_offset() Jason Gunthorpe
@ 2020-09-02  9:09 ` Gal Pressman
  2020-09-02 12:00   ` Jason Gunthorpe
  14 siblings, 1 reply; 42+ messages in thread
From: Gal Pressman @ 2020-09-02  9:09 UTC (permalink / raw)
  To: Jason Gunthorpe, Adit Ranadive, Ariel Elior, Potnuri Bharat Teja,
	Dennis Dalessandro, Devesh Sharma, Doug Ledford, Faisal Latif,
	Wei Hu(Xavier),
	Leon Romanovsky, linux-rdma, Weihang Li, Miguel Ojeda,
	Mike Marciniszyn, Michal Kalderon, Naresh Kumar PBS, Lijun Ou,
	VMware PV-Drivers, Selvin Xavier, Shiraz Saleem, Yossi Leybovich,
	Somnath Kotur, Sriharsha Basavapatna, Zhu Yanjun, Yishai Hadas

On 02/09/2020 3:43, Jason Gunthorpe wrote:
> This is the first series in an effort to modernize the umem usage in all
> the DMA drivers.

Can you please explain what are the next steps? What's the final goal?

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 01/14] RDMA/umem: Fix ib_umem_find_best_pgsz() for mappings that cross a page boundary
  2020-09-02  0:43 ` [PATCH 01/14] RDMA/umem: Fix ib_umem_find_best_pgsz() for mappings that cross a page boundary Jason Gunthorpe
@ 2020-09-02  9:24   ` Leon Romanovsky
  2020-09-03 14:11   ` Saleem, Shiraz
  2020-09-04 22:30   ` Jason Gunthorpe
  2 siblings, 0 replies; 42+ messages in thread
From: Leon Romanovsky @ 2020-09-02  9:24 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: Doug Ledford, linux-rdma, Shiraz Saleem

On Tue, Sep 01, 2020 at 09:43:29PM -0300, Jason Gunthorpe wrote:
> It is possible for a single SGL to span an aligned boundary, eg if the SGL
> is
>
>   61440 -> 90112
>
> Then the length is 28672, which currently limits the block size to
> 32k. With a 32k page size the two covering blocks will be:
>
>   32768->65536 and 65536->98304
>
> However, the correct answer is a 128K block size which will span the whole
> 28672 bytes in a single block.
>
> Instead of limiting based on length figure out which high IOVA bits don't
> change between the start and end addresses. That is the highest useful
> page size.
>
> Fixes: 4a35339958f1 ("RDMA/umem: Add API to find best driver supported page size in an MR")
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> ---
>  drivers/infiniband/core/umem.c | 10 ++++++++--
>  1 file changed, 8 insertions(+), 2 deletions(-)
>

Thanks,
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 02/14] RDMA/umem: Prevent small pages from being returned by ib_umem_find_best_pgsz()
  2020-09-02  0:43 ` [PATCH 02/14] RDMA/umem: Prevent small pages from being returned by ib_umem_find_best_pgsz() Jason Gunthorpe
@ 2020-09-02 11:51   ` Leon Romanovsky
  2020-09-02 11:59     ` Jason Gunthorpe
  2020-09-03 14:11   ` Saleem, Shiraz
  1 sibling, 1 reply; 42+ messages in thread
From: Leon Romanovsky @ 2020-09-02 11:51 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: Doug Ledford, linux-rdma, Shiraz Saleem

On Tue, Sep 01, 2020 at 09:43:30PM -0300, Jason Gunthorpe wrote:
> rdma_for_each_block() makes assumptions about how the SGL is constructed
> that don't work if the block size is below the page size used to to build
> the SGL.
>
> The rules for umem SGL construction require that the SG's all be PAGE_SIZE
> aligned and we don't encode the actual byte offset of the VA range inside
> the SGL using offset and length. So rdma_for_each_block() has no idea
> where the actual starting/ending point is to compute the first/last block
> boundary if the starting address should be within a SGL.
>
> Fixing the SGL construction turns out to be really hard, and will be the
> subject of other patches. For now block smaller pages.
>
> Fixes: 4a35339958f1 ("RDMA/umem: Add API to find best driver supported page size in an MR")
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> ---
>  drivers/infiniband/core/umem.c | 6 ++++++
>  1 file changed, 6 insertions(+)
>
> diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
> index 120e98403c345d..7b5bc969e55630 100644
> --- a/drivers/infiniband/core/umem.c
> +++ b/drivers/infiniband/core/umem.c
> @@ -151,6 +151,12 @@ unsigned long ib_umem_find_best_pgsz(struct ib_umem *umem,
>  	dma_addr_t mask;
>  	int i;
>
> +	/* rdma_for_each_block() has a bug if the page size is smaller than the
> +	 * page size used to build the umem. For now prevent smaller page sizes
> +	 * from being returned.
> +	 */
> +	pgsz_bitmap &= GENMASK(BITS_PER_LONG - 1, PAGE_SHIFT);
> +

Why do we care about such case? Why can't we leave this check forever?

Thanks,
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 02/14] RDMA/umem: Prevent small pages from being returned by ib_umem_find_best_pgsz()
  2020-09-02 11:51   ` Leon Romanovsky
@ 2020-09-02 11:59     ` Jason Gunthorpe
  2020-09-02 12:05       ` Leon Romanovsky
  0 siblings, 1 reply; 42+ messages in thread
From: Jason Gunthorpe @ 2020-09-02 11:59 UTC (permalink / raw)
  To: Leon Romanovsky; +Cc: Doug Ledford, linux-rdma, Shiraz Saleem

On Wed, Sep 02, 2020 at 02:51:19PM +0300, Leon Romanovsky wrote:
> On Tue, Sep 01, 2020 at 09:43:30PM -0300, Jason Gunthorpe wrote:
> > rdma_for_each_block() makes assumptions about how the SGL is constructed
> > that don't work if the block size is below the page size used to to build
> > the SGL.
> >
> > The rules for umem SGL construction require that the SG's all be PAGE_SIZE
> > aligned and we don't encode the actual byte offset of the VA range inside
> > the SGL using offset and length. So rdma_for_each_block() has no idea
> > where the actual starting/ending point is to compute the first/last block
> > boundary if the starting address should be within a SGL.
> >
> > Fixing the SGL construction turns out to be really hard, and will be the
> > subject of other patches. For now block smaller pages.
> >
> > Fixes: 4a35339958f1 ("RDMA/umem: Add API to find best driver supported page size in an MR")
> > Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> >  drivers/infiniband/core/umem.c | 6 ++++++
> >  1 file changed, 6 insertions(+)
> >
> > diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
> > index 120e98403c345d..7b5bc969e55630 100644
> > +++ b/drivers/infiniband/core/umem.c
> > @@ -151,6 +151,12 @@ unsigned long ib_umem_find_best_pgsz(struct ib_umem *umem,
> >  	dma_addr_t mask;
> >  	int i;
> >
> > +	/* rdma_for_each_block() has a bug if the page size is smaller than the
> > +	 * page size used to build the umem. For now prevent smaller page sizes
> > +	 * from being returned.
> > +	 */
> > +	pgsz_bitmap &= GENMASK(BITS_PER_LONG - 1, PAGE_SHIFT);
> > +
> 
> Why do we care about such case? Why can't we leave this check forever?

If HW supports only, say 4k page size, and runs on a 64k page size
architecture it should be able to fragment into the native HW page
size.

The whole point of these APIs is to decouple the system and HW page
sizes.

Jason

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 00/14] RDMA: Improve use of umem in DMA drivers
  2020-09-02  9:09 ` [PATCH 00/14] RDMA: Improve use of umem in DMA drivers Gal Pressman
@ 2020-09-02 12:00   ` Jason Gunthorpe
  0 siblings, 0 replies; 42+ messages in thread
From: Jason Gunthorpe @ 2020-09-02 12:00 UTC (permalink / raw)
  To: Gal Pressman
  Cc: Adit Ranadive, Ariel Elior, Potnuri Bharat Teja,
	Dennis Dalessandro, Devesh Sharma, Doug Ledford, Faisal Latif,
	Wei Hu(Xavier),
	Leon Romanovsky, linux-rdma, Weihang Li, Miguel Ojeda,
	Mike Marciniszyn, Michal Kalderon, Naresh Kumar PBS, Lijun Ou,
	VMware PV-Drivers, Selvin Xavier, Shiraz Saleem, Yossi Leybovich,
	Somnath Kotur, Sriharsha Basavapatna, Zhu Yanjun, Yishai Hadas

On Wed, Sep 02, 2020 at 12:09:41PM +0300, Gal Pressman wrote:
> On 02/09/2020 3:43, Jason Gunthorpe wrote:
> > This is the first series in an effort to modernize the umem usage in all
> > the DMA drivers.
> 
> Can you please explain what are the next steps? What's the final goal?

DMA drivers should not touch the umem sgl directly

Jason

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 02/14] RDMA/umem: Prevent small pages from being returned by ib_umem_find_best_pgsz()
  2020-09-02 11:59     ` Jason Gunthorpe
@ 2020-09-02 12:05       ` Leon Romanovsky
  2020-09-02 16:34         ` Jason Gunthorpe
  0 siblings, 1 reply; 42+ messages in thread
From: Leon Romanovsky @ 2020-09-02 12:05 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: Doug Ledford, linux-rdma, Shiraz Saleem

On Wed, Sep 02, 2020 at 08:59:12AM -0300, Jason Gunthorpe wrote:
> On Wed, Sep 02, 2020 at 02:51:19PM +0300, Leon Romanovsky wrote:
> > On Tue, Sep 01, 2020 at 09:43:30PM -0300, Jason Gunthorpe wrote:
> > > rdma_for_each_block() makes assumptions about how the SGL is constructed
> > > that don't work if the block size is below the page size used to to build
> > > the SGL.
> > >
> > > The rules for umem SGL construction require that the SG's all be PAGE_SIZE
> > > aligned and we don't encode the actual byte offset of the VA range inside
> > > the SGL using offset and length. So rdma_for_each_block() has no idea
> > > where the actual starting/ending point is to compute the first/last block
> > > boundary if the starting address should be within a SGL.
> > >
> > > Fixing the SGL construction turns out to be really hard, and will be the
> > > subject of other patches. For now block smaller pages.
> > >
> > > Fixes: 4a35339958f1 ("RDMA/umem: Add API to find best driver supported page size in an MR")
> > > Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> > >  drivers/infiniband/core/umem.c | 6 ++++++
> > >  1 file changed, 6 insertions(+)
> > >
> > > diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
> > > index 120e98403c345d..7b5bc969e55630 100644
> > > +++ b/drivers/infiniband/core/umem.c
> > > @@ -151,6 +151,12 @@ unsigned long ib_umem_find_best_pgsz(struct ib_umem *umem,
> > >  	dma_addr_t mask;
> > >  	int i;
> > >
> > > +	/* rdma_for_each_block() has a bug if the page size is smaller than the
> > > +	 * page size used to build the umem. For now prevent smaller page sizes
> > > +	 * from being returned.
> > > +	 */
> > > +	pgsz_bitmap &= GENMASK(BITS_PER_LONG - 1, PAGE_SHIFT);
> > > +
> >
> > Why do we care about such case? Why can't we leave this check forever?
>
> If HW supports only, say 4k page size, and runs on a 64k page size
> architecture it should be able to fragment into the native HW page
> size.
>
> The whole point of these APIs is to decouple the system and HW page
> sizes.

Right now you are preventing such combinations, but is this real concern
for existing drivers?

Thanks

>
> Jason

^ permalink raw reply	[flat|nested] 42+ messages in thread

* RE: [EXT] [PATCH 08/14] RDMA/qedr: Use ib_umem_num_dma_blocks() instead of ib_umem_page_count()
  2020-09-02  0:43 ` [PATCH 08/14] RDMA/qedr: Use ib_umem_num_dma_blocks() instead of ib_umem_page_count() Jason Gunthorpe
@ 2020-09-02 15:35   ` Michal Kalderon
  0 siblings, 0 replies; 42+ messages in thread
From: Michal Kalderon @ 2020-09-02 15:35 UTC (permalink / raw)
  To: Jason Gunthorpe, Ariel Elior, Doug Ledford, linux-rdma

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, September 2, 2020 3:44 AM
> To: Ariel Elior <aelior@marvell.com>; Doug Ledford <dledford@redhat.com>;
> linux-rdma@vger.kernel.org; Michal Kalderon <mkalderon@marvell.com>
> Subject: [EXT] [PATCH 08/14] RDMA/qedr: Use
> ib_umem_num_dma_blocks() instead of ib_umem_page_count()
> 
> The length of the list populated by qedr_populate_pbls() should be
> calculated using ib_umem_num_blocks() with the same size/shift passed to
> qedr_populate_pbls().
> 
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> ---
>  drivers/infiniband/hw/qedr/verbs.c | 7 +++----
>  1 file changed, 3 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/infiniband/hw/qedr/verbs.c
> b/drivers/infiniband/hw/qedr/verbs.c
> index cbb49168d9f7ed..278b48443aedba 100644
> --- a/drivers/infiniband/hw/qedr/verbs.c
> +++ b/drivers/infiniband/hw/qedr/verbs.c
> @@ -783,9 +783,7 @@ static inline int qedr_init_user_queue(struct ib_udata
> *udata,
>  		return PTR_ERR(q->umem);
>  	}
> 
> -	fw_pages = ib_umem_page_count(q->umem) <<
> -	    (PAGE_SHIFT - FW_PAGE_SHIFT);
> -
> +	fw_pages = ib_umem_num_dma_blocks(q->umem, 1 <<
> FW_PAGE_SHIFT);
>  	rc = qedr_prepare_pbl_tbl(dev, &q->pbl_info, fw_pages, 0);
>  	if (rc)
>  		goto err0;
> @@ -2852,7 +2850,8 @@ struct ib_mr *qedr_reg_user_mr(struct ib_pd
> *ibpd, u64 start, u64 len,
>  		goto err0;
>  	}
> 
> -	rc = init_mr_info(dev, &mr->info, ib_umem_page_count(mr-
> >umem), 1);
> +	rc = init_mr_info(dev, &mr->info,
> +			  ib_umem_num_dma_blocks(mr->umem,
> PAGE_SIZE), 1);
>  	if (rc)
>  		goto err1;
> 
> --
> 2.28.0

Thanks, 

Acked-by: Michal Kalderon <michal.kalderon@marvell.com>



^ permalink raw reply	[flat|nested] 42+ messages in thread

* RE: [EXT] [PATCH 07/14] RDMA/qedr: Use rdma_umem_for_each_dma_block() instead of open-coding
  2020-09-02  0:43 ` [PATCH 07/14] RDMA/qedr: Use rdma_umem_for_each_dma_block() instead of open-coding Jason Gunthorpe
@ 2020-09-02 15:36   ` Michal Kalderon
  2020-09-02 18:44     ` Jason Gunthorpe
  0 siblings, 1 reply; 42+ messages in thread
From: Michal Kalderon @ 2020-09-02 15:36 UTC (permalink / raw)
  To: Jason Gunthorpe, Ariel Elior, Doug Ledford, linux-rdma

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, September 2, 2020 3:44 AM
> This loop is splitting the DMA SGL into pg_shift sized pages, use the core code
> for this directly.
> 
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> ---
>  drivers/infiniband/hw/qedr/verbs.c | 41 ++++++++++++------------------
>  1 file changed, 16 insertions(+), 25 deletions(-)
> 
> diff --git a/drivers/infiniband/hw/qedr/verbs.c
> b/drivers/infiniband/hw/qedr/verbs.c
> index b49bef94637e50..cbb49168d9f7ed 100644
> --- a/drivers/infiniband/hw/qedr/verbs.c
> +++ b/drivers/infiniband/hw/qedr/verbs.c
> @@ -600,11 +600,9 @@ static void qedr_populate_pbls(struct qedr_dev
> *dev, struct ib_umem *umem,
>  			       struct qedr_pbl_info *pbl_info, u32 pg_shift)  {
>  	int pbe_cnt, total_num_pbes = 0;
> -	u32 fw_pg_cnt, fw_pg_per_umem_pg;
>  	struct qedr_pbl *pbl_tbl;
> -	struct sg_dma_page_iter sg_iter;
> +	struct ib_block_iter biter;
>  	struct regpair *pbe;
> -	u64 pg_addr;
> 
>  	if (!pbl_info->num_pbes)
>  		return;
> @@ -625,32 +623,25 @@ static void qedr_populate_pbls(struct qedr_dev
> *dev, struct ib_umem *umem,
> 
>  	pbe_cnt = 0;
> 
> -	fw_pg_per_umem_pg = BIT(PAGE_SHIFT - pg_shift);
> +	rdma_umem_for_each_dma_block (umem, &biter, BIT(pg_shift)) {
> +		u64 pg_addr = rdma_block_iter_dma_address(&biter);
> 
> -	for_each_sg_dma_page (umem->sg_head.sgl, &sg_iter, umem-
> >nmap, 0) {
> -		pg_addr = sg_page_iter_dma_address(&sg_iter);
> -		for (fw_pg_cnt = 0; fw_pg_cnt < fw_pg_per_umem_pg;) {
> -			pbe->lo = cpu_to_le32(pg_addr);
> -			pbe->hi = cpu_to_le32(upper_32_bits(pg_addr));
> +		pbe->lo = cpu_to_le32(pg_addr);
> +		pbe->hi = cpu_to_le32(upper_32_bits(pg_addr));
> 
> -			pg_addr += BIT(pg_shift);
> -			pbe_cnt++;
> -			total_num_pbes++;
> -			pbe++;
> +		pbe_cnt++;
> +		total_num_pbes++;
> +		pbe++;
> 
> -			if (total_num_pbes == pbl_info->num_pbes)
> -				return;
> +		if (total_num_pbes == pbl_info->num_pbes)
> +			return;
> 
> -			/* If the given pbl is full storing the pbes,
> -			 * move to next pbl.
> -			 */
> -			if (pbe_cnt == (pbl_info->pbl_size / sizeof(u64))) {
> -				pbl_tbl++;
> -				pbe = (struct regpair *)pbl_tbl->va;
> -				pbe_cnt = 0;
> -			}
> -
> -			fw_pg_cnt++;
> +		/* If the given pbl is full storing the pbes, move to next pbl.
> +		 */
> +		if (pbe_cnt == (pbl_info->pbl_size / sizeof(u64))) {
> +			pbl_tbl++;
> +			pbe = (struct regpair *)pbl_tbl->va;
> +			pbe_cnt = 0;
>  		}
>  	}
>  }
> --
> 2.28.0

Thanks,  looks good!

Acked-by: Michal Kalderon <michal.kalderon@marvell.com>



^ permalink raw reply	[flat|nested] 42+ messages in thread

* RE: [EXT] [PATCH 14/14] RDMA/umem: Rename ib_umem_offset() to ib_umem_dma_offset()
  2020-09-02  0:43 ` [PATCH 14/14] RDMA/umem: Rename ib_umem_offset() to ib_umem_dma_offset() Jason Gunthorpe
  2020-09-02  0:51   ` Zhu Yanjun
@ 2020-09-02 15:36   ` Michal Kalderon
  2020-09-03 18:48   ` Jason Gunthorpe
  2 siblings, 0 replies; 42+ messages in thread
From: Michal Kalderon @ 2020-09-02 15:36 UTC (permalink / raw)
  To: Jason Gunthorpe, Ariel Elior, Dennis Dalessandro, Devesh Sharma,
	Doug Ledford, linux-rdma, Mike Marciniszyn, Selvin Xavier,
	Zhu Yanjun

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, September 2, 2020 3:44 AM
> This function should be used to get the offset from the first DMA block.
> 
> The few places using this without a DMA iterator are calling it to work around
> the lack of setting sgl->offset when the umem is created.
> 
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> ---
>  drivers/infiniband/core/umem.c              | 2 +-
>  drivers/infiniband/hw/ocrdma/ocrdma_verbs.c | 2 +-
>  drivers/infiniband/hw/qedr/verbs.c          | 2 +-
>  drivers/infiniband/sw/rdmavt/mr.c           | 2 +-
>  drivers/infiniband/sw/rxe/rxe_mr.c          | 2 +-
>  include/rdma/ib_umem.h                      | 9 ++++++---
>  6 files changed, 11 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/infiniband/core/umem.c
> b/drivers/infiniband/core/umem.c index 49d6ddc37b6fde..c840115b8c0945
> 100644
> --- a/drivers/infiniband/core/umem.c
> +++ b/drivers/infiniband/core/umem.c
> @@ -369,7 +369,7 @@ int ib_umem_copy_from(void *dst, struct ib_umem
> *umem, size_t offset,
>  	}
> 
>  	ret = sg_pcopy_to_buffer(umem->sg_head.sgl, umem->sg_nents,
> dst, length,
> -				 offset + ib_umem_offset(umem));
> +				 offset + ib_umem_dma_offset(umem,
> PAGE_SIZE));
> 
>  	if (ret < 0)
>  		return ret;
> diff --git a/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c
> b/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c
> index 1fb8da6d613674..f22532fbc364fe 100644
> --- a/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c
> +++ b/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c
> @@ -870,7 +870,7 @@ struct ib_mr *ocrdma_reg_user_mr(struct ib_pd
> *ibpd, u64 start, u64 len,
>  		goto umem_err;
> 
>  	mr->hwmr.pbe_size = PAGE_SIZE;
> -	mr->hwmr.fbo = ib_umem_offset(mr->umem);
> +	mr->hwmr.fbo = ib_umem_dma_offset(mr->umem, PAGE_SIZE);
>  	mr->hwmr.va = usr_addr;
>  	mr->hwmr.len = len;
>  	mr->hwmr.remote_wr = (acc & IB_ACCESS_REMOTE_WRITE) ? 1 : 0;
> diff --git a/drivers/infiniband/hw/qedr/verbs.c
> b/drivers/infiniband/hw/qedr/verbs.c
> index 278b48443aedba..daac742e71044d 100644
> --- a/drivers/infiniband/hw/qedr/verbs.c
> +++ b/drivers/infiniband/hw/qedr/verbs.c
> @@ -2878,7 +2878,7 @@ struct ib_mr *qedr_reg_user_mr(struct ib_pd
> *ibpd, u64 start, u64 len,
>  	mr->hw_mr.pbl_two_level = mr->info.pbl_info.two_layered;
>  	mr->hw_mr.pbl_page_size_log = ilog2(mr->info.pbl_info.pbl_size);
>  	mr->hw_mr.page_size_log = PAGE_SHIFT;
> -	mr->hw_mr.fbo = ib_umem_offset(mr->umem);
> +	mr->hw_mr.fbo = ib_umem_dma_offset(mr->umem, PAGE_SIZE);
>  	mr->hw_mr.length = len;
>  	mr->hw_mr.vaddr = usr_addr;
>  	mr->hw_mr.zbva = false;

Thanks, 

Acked-by: Michal Kalderon <michal.kalderon@marvell.com>


> diff --git a/drivers/infiniband/sw/rdmavt/mr.c
> b/drivers/infiniband/sw/rdmavt/mr.c
> index 2f7c25fea44a9d..04f7dc0ce9e44d 100644
> --- a/drivers/infiniband/sw/rdmavt/mr.c
> +++ b/drivers/infiniband/sw/rdmavt/mr.c
> @@ -404,7 +404,7 @@ struct ib_mr *rvt_reg_user_mr(struct ib_pd *pd, u64
> start, u64 length,
>  	mr->mr.user_base = start;
>  	mr->mr.iova = virt_addr;
>  	mr->mr.length = length;
> -	mr->mr.offset = ib_umem_offset(umem);
> +	mr->mr.offset = ib_umem_dma_offset(umem, PAGE_SIZE);
>  	mr->mr.access_flags = mr_access_flags;
>  	mr->umem = umem;
> 
> diff --git a/drivers/infiniband/sw/rxe/rxe_mr.c
> b/drivers/infiniband/sw/rxe/rxe_mr.c
> index 708e2dff5eaa70..8f60dc9dee8658 100644
> --- a/drivers/infiniband/sw/rxe/rxe_mr.c
> +++ b/drivers/infiniband/sw/rxe/rxe_mr.c
> @@ -196,7 +196,7 @@ int rxe_mem_init_user(struct rxe_pd *pd, u64 start,
>  	mem->length		= length;
>  	mem->iova		= iova;
>  	mem->va			= start;
> -	mem->offset		= ib_umem_offset(umem);
> +	mem->offset		= ib_umem_dma_offset(umem, PAGE_SIZE);
>  	mem->state		= RXE_MEM_STATE_VALID;
>  	mem->type		= RXE_MEM_TYPE_MR;
> 
> diff --git a/include/rdma/ib_umem.h b/include/rdma/ib_umem.h index
> 4bac6e29f030c2..5e709b2c251644 100644
> --- a/include/rdma/ib_umem.h
> +++ b/include/rdma/ib_umem.h
> @@ -27,10 +27,13 @@ struct ib_umem {
>  	unsigned int    sg_nents;
>  };
> 
> -/* Returns the offset of the umem start relative to the first page. */ -static
> inline int ib_umem_offset(struct ib_umem *umem)
> +/*
> + * Returns the offset of the umem start relative to the first DMA block
> +returned
> + * by rdma_umem_for_each_dma_block().
> + */
> +static inline int ib_umem_dma_offset(struct ib_umem *umem, unsigned
> +long pgsz)
>  {
> -	return umem->address & ~PAGE_MASK;
> +	return umem->address & (pgsz - 1);
>  }
> 
>  static inline size_t ib_umem_num_dma_blocks(struct ib_umem *umem,
> --
> 2.28.0


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 02/14] RDMA/umem: Prevent small pages from being returned by ib_umem_find_best_pgsz()
  2020-09-02 12:05       ` Leon Romanovsky
@ 2020-09-02 16:34         ` Jason Gunthorpe
  0 siblings, 0 replies; 42+ messages in thread
From: Jason Gunthorpe @ 2020-09-02 16:34 UTC (permalink / raw)
  To: Leon Romanovsky; +Cc: Doug Ledford, linux-rdma, Shiraz Saleem

On Wed, Sep 02, 2020 at 03:05:40PM +0300, Leon Romanovsky wrote:
> On Wed, Sep 02, 2020 at 08:59:12AM -0300, Jason Gunthorpe wrote:
> > On Wed, Sep 02, 2020 at 02:51:19PM +0300, Leon Romanovsky wrote:
> > > On Tue, Sep 01, 2020 at 09:43:30PM -0300, Jason Gunthorpe wrote:
> > > > rdma_for_each_block() makes assumptions about how the SGL is constructed
> > > > that don't work if the block size is below the page size used to to build
> > > > the SGL.
> > > >
> > > > The rules for umem SGL construction require that the SG's all be PAGE_SIZE
> > > > aligned and we don't encode the actual byte offset of the VA range inside
> > > > the SGL using offset and length. So rdma_for_each_block() has no idea
> > > > where the actual starting/ending point is to compute the first/last block
> > > > boundary if the starting address should be within a SGL.
> > > >
> > > > Fixing the SGL construction turns out to be really hard, and will be the
> > > > subject of other patches. For now block smaller pages.
> > > >
> > > > Fixes: 4a35339958f1 ("RDMA/umem: Add API to find best driver supported page size in an MR")
> > > > Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> > > >  drivers/infiniband/core/umem.c | 6 ++++++
> > > >  1 file changed, 6 insertions(+)
> > > >
> > > > diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
> > > > index 120e98403c345d..7b5bc969e55630 100644
> > > > +++ b/drivers/infiniband/core/umem.c
> > > > @@ -151,6 +151,12 @@ unsigned long ib_umem_find_best_pgsz(struct ib_umem *umem,
> > > >  	dma_addr_t mask;
> > > >  	int i;
> > > >
> > > > +	/* rdma_for_each_block() has a bug if the page size is smaller than the
> > > > +	 * page size used to build the umem. For now prevent smaller page sizes
> > > > +	 * from being returned.
> > > > +	 */
> > > > +	pgsz_bitmap &= GENMASK(BITS_PER_LONG - 1, PAGE_SHIFT);
> > > > +
> > >
> > > Why do we care about such case? Why can't we leave this check forever?
> >
> > If HW supports only, say 4k page size, and runs on a 64k page size
> > architecture it should be able to fragment into the native HW page
> > size.
> >
> > The whole point of these APIs is to decouple the system and HW page
> > sizes.
> 
> Right now you are preventing such combinations, but is this real concern
> for existing drivers?

No, I didn't prevent anything, I've left those drivers just hardwired
to use PAGE_SHIFT/PAGE_SIZE.

Maybe they are broken and malfunction on 64k page size systems, maybe
the HW supports other pages sizes and they should call
ib_umem_find_best_pgsz(), I don't really know.

The fix is fairly trivial, but it can't be done until the drivers stop
touching umem->sgl - as it requires changing how the sgl is
constructed to match standard kernel expectations, which also breaks
all the drivers.

Jason

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [EXT] [PATCH 07/14] RDMA/qedr: Use rdma_umem_for_each_dma_block() instead of open-coding
  2020-09-02 15:36   ` [EXT] " Michal Kalderon
@ 2020-09-02 18:44     ` Jason Gunthorpe
  2020-09-02 19:53       ` Michal Kalderon
  0 siblings, 1 reply; 42+ messages in thread
From: Jason Gunthorpe @ 2020-09-02 18:44 UTC (permalink / raw)
  To: Michal Kalderon; +Cc: Ariel Elior, Doug Ledford, linux-rdma

On Wed, Sep 02, 2020 at 03:36:00PM +0000, Michal Kalderon wrote:
> > +		/* If the given pbl is full storing the pbes, move to next pbl.
> > +		 */
> > +		if (pbe_cnt == (pbl_info->pbl_size / sizeof(u64))) {
> > +			pbl_tbl++;
> > +			pbe = (struct regpair *)pbl_tbl->va;
> > +			pbe_cnt = 0;
> >  		}
> >  	}
> >  }
> 
> Thanks,  looks good!

After this series you should try adding ib_umem_find_best_pgsz() to
qedr, it looks pretty simple now..

Jason

^ permalink raw reply	[flat|nested] 42+ messages in thread

* RE: [EXT] [PATCH 07/14] RDMA/qedr: Use rdma_umem_for_each_dma_block() instead of open-coding
  2020-09-02 18:44     ` Jason Gunthorpe
@ 2020-09-02 19:53       ` Michal Kalderon
  0 siblings, 0 replies; 42+ messages in thread
From: Michal Kalderon @ 2020-09-02 19:53 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: Ariel Elior, Doug Ledford, linux-rdma

> From: linux-rdma-owner@vger.kernel.org <linux-rdma-
> owner@vger.kernel.org> On Behalf Of Jason Gunthorpe
> 
> On Wed, Sep 02, 2020 at 03:36:00PM +0000, Michal Kalderon wrote:
> > > +		/* If the given pbl is full storing the pbes, move to next pbl.
> > > +		 */
> > > +		if (pbe_cnt == (pbl_info->pbl_size / sizeof(u64))) {
> > > +			pbl_tbl++;
> > > +			pbe = (struct regpair *)pbl_tbl->va;
> > > +			pbe_cnt = 0;
> > >  		}
> > >  	}
> > >  }
> >
> > Thanks,  looks good!
> 
> After this series you should try adding ib_umem_find_best_pgsz() to qedr, it
> looks pretty simple now..
Sure, will take a look.
Thanks,
Michal

> 
> Jason

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 09/14] RDMA/bnxt: Do not use ib_umem_page_count() or ib_umem_num_pages()
  2020-09-02  0:43 ` [PATCH 09/14] RDMA/bnxt: Do not use ib_umem_page_count() or ib_umem_num_pages() Jason Gunthorpe
@ 2020-09-03  4:41   ` Selvin Xavier
  0 siblings, 0 replies; 42+ messages in thread
From: Selvin Xavier @ 2020-09-03  4:41 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Devesh Sharma, Doug Ledford, linux-rdma, Naresh Kumar PBS,
	Somnath Kotur, Sriharsha Basavapatna, Shiraz Saleem

On Wed, Sep 2, 2020 at 6:14 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> ib_umem_page_count() returns the number of 4k entries required for a DMA
> map, but bnxt_re already computes a variable page size. The correct API to
> determine the size of the page table array is ib_umem_num_dma_blocks().
>
> Fix the overallocation of the page array in fill_umem_pbl_tbl() when
> working with larger page sizes by using the right function. Lightly
> re-organize this function to make it clearer.
>
> Replace the other calls to ib_umem_num_pages().
>
> Fixes: d85582517e91 ("RDMA/bnxt_re: Use core helpers to get aligned DMA address")
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> ---
>  drivers/infiniband/hw/bnxt_re/ib_verbs.c | 70 ++++++++----------------
>  1 file changed, 24 insertions(+), 46 deletions(-)
>

Acked-by: Selvin Xavier <selvin.xavier@broadcom.com>

Thanks

^ permalink raw reply	[flat|nested] 42+ messages in thread

* RE: [PATCH 01/14] RDMA/umem: Fix ib_umem_find_best_pgsz() for mappings that cross a page boundary
  2020-09-02  0:43 ` [PATCH 01/14] RDMA/umem: Fix ib_umem_find_best_pgsz() for mappings that cross a page boundary Jason Gunthorpe
  2020-09-02  9:24   ` Leon Romanovsky
@ 2020-09-03 14:11   ` Saleem, Shiraz
  2020-09-04 22:30   ` Jason Gunthorpe
  2 siblings, 0 replies; 42+ messages in thread
From: Saleem, Shiraz @ 2020-09-03 14:11 UTC (permalink / raw)
  To: Jason Gunthorpe, Doug Ledford, linux-rdma

> Subject: [PATCH 01/14] RDMA/umem: Fix ib_umem_find_best_pgsz() for
> mappings that cross a page boundary
> 
> It is possible for a single SGL to span an aligned boundary, eg if the SGL is
> 
>   61440 -> 90112
> 
> Then the length is 28672, which currently limits the block size to 32k. With a 32k
> page size the two covering blocks will be:
> 
>   32768->65536 and 65536->98304
> 
> However, the correct answer is a 128K block size which will span the whole
> 28672 bytes in a single block.
> 
> Instead of limiting based on length figure out which high IOVA bits don't change
> between the start and end addresses. That is the highest useful page size.
> 
> Fixes: 4a35339958f1 ("RDMA/umem: Add API to find best driver supported page
> size in an MR")
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> ---
>  drivers/infiniband/core/umem.c | 10 ++++++++--
>  1 file changed, 8 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c index
> 831bff8d52e547..120e98403c345d 100644
> --- a/drivers/infiniband/core/umem.c
> +++ b/drivers/infiniband/core/umem.c
> @@ -156,8 +156,14 @@ unsigned long ib_umem_find_best_pgsz(struct ib_umem
> *umem,
>  		return 0;
> 
>  	va = virt;
> -	/* max page size not to exceed MR length */
> -	mask = roundup_pow_of_two(umem->length);
> +	/* The best result is the smallest page size that results in the minimum
> +	 * number of required pages. Compute the largest page size that could
> +	 * work based on VA address bits that don't change.
> +	 */
> +	mask = pgsz_bitmap &
> +	       GENMASK(BITS_PER_LONG - 1,
> +		       bits_per((umem->length - 1 + umem->address) ^
> +				umem->address));
>  	/* offset into first SGL */
>  	pgoff = umem->address & ~PAGE_MASK;
> 
> --

Reviewed-by: Shiraz Saleem <shiraz.saleem@intel.com>


^ permalink raw reply	[flat|nested] 42+ messages in thread

* RE: [PATCH 02/14] RDMA/umem: Prevent small pages from being returned by ib_umem_find_best_pgsz()
  2020-09-02  0:43 ` [PATCH 02/14] RDMA/umem: Prevent small pages from being returned by ib_umem_find_best_pgsz() Jason Gunthorpe
  2020-09-02 11:51   ` Leon Romanovsky
@ 2020-09-03 14:11   ` Saleem, Shiraz
  2020-09-03 14:17     ` Jason Gunthorpe
  1 sibling, 1 reply; 42+ messages in thread
From: Saleem, Shiraz @ 2020-09-03 14:11 UTC (permalink / raw)
  To: Jason Gunthorpe, Doug Ledford, linux-rdma

> Subject: [PATCH 02/14] RDMA/umem: Prevent small pages from being returned by
> ib_umem_find_best_pgsz()
> 
> rdma_for_each_block() makes assumptions about how the SGL is constructed that
> don't work if the block size is below the page size used to to build the SGL.
> 
> The rules for umem SGL construction require that the SG's all be PAGE_SIZE
> aligned and we don't encode the actual byte offset of the VA range inside the SGL
> using offset and length. So rdma_for_each_block() has no idea where the actual
> starting/ending point is to compute the first/last block boundary if the starting
> address should be within a SGL.

Not sure if we were exposed today. i.e. rdma drivers working with block sizes smaller than system page size?

Nevertheless it's a good find and looks right thing to do for now.

> 
> Fixing the SGL construction turns out to be really hard, and will be the subject of
> other patches. For now block smaller pages.
> 
> Fixes: 4a35339958f1 ("RDMA/umem: Add API to find best driver supported page
> size in an MR")
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> ---

Reviewed-by: Shiraz Saleem <shiraz.saleem@intel.com>


^ permalink raw reply	[flat|nested] 42+ messages in thread

* RE: [PATCH 04/14] RDMA/umem: Add rdma_umem_for_each_dma_block()
  2020-09-02  0:43 ` [PATCH 04/14] RDMA/umem: Add rdma_umem_for_each_dma_block() Jason Gunthorpe
  2020-09-02  3:10   ` Miguel Ojeda
@ 2020-09-03 14:12   ` Saleem, Shiraz
  1 sibling, 0 replies; 42+ messages in thread
From: Saleem, Shiraz @ 2020-09-03 14:12 UTC (permalink / raw)
  To: Jason Gunthorpe, Devesh Sharma, Doug Ledford, Latif, Faisal,
	Gal Pressman, Wei Hu(Xavier),
	linux-rdma, Weihang Li, Miguel Ojeda, Naresh Kumar PBS, Lijun Ou,
	Selvin Xavier, Yossi Leybovich, Somnath Kotur,
	Sriharsha Basavapatna

> Subject: [PATCH 04/14] RDMA/umem: Add rdma_umem_for_each_dma_block()
> 
> This helper does the same as rdma_for_each_block(), except it works on a umem.
> This simplifies most of the call sites.
> 
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> ---

[...]

> diff --git a/drivers/infiniband/hw/i40iw/i40iw_verbs.c
> b/drivers/infiniband/hw/i40iw/i40iw_verbs.c
> index b51339328a51ef..beb611b157bc8d 100644
> --- a/drivers/infiniband/hw/i40iw/i40iw_verbs.c
> +++ b/drivers/infiniband/hw/i40iw/i40iw_verbs.c
> @@ -1320,8 +1320,7 @@ static void i40iw_copy_user_pgaddrs(struct i40iw_mr
> *iwmr,
>  	if (iwmr->type == IW_MEMREG_TYPE_QP)
>  		iwpbl->qp_mr.sq_page = sg_page(region->sg_head.sgl);
> 
> -	rdma_for_each_block(region->sg_head.sgl, &biter, region->nmap,
> -			    iwmr->page_size) {
> +	rdma_umem_for_each_dma_block(region, &biter, iwmr->page_size) {
>  		*pbl = rdma_block_iter_dma_address(&biter);
>  		pbl = i40iw_next_pbl_addr(pbl, &pinfo, &idx);
>  	}

Acked-by: Shiraz Saleem <shiraz.saleem@intel.com>


[....]
> +static inline void __rdma_umem_block_iter_start(struct ib_block_iter *biter,
> +						struct ib_umem *umem,
> +						unsigned long pgsz)
> +{
> +	__rdma_block_iter_start(biter, umem->sg_head.sgl, umem->nmap, pgsz); }
> +
> +/**
> + * rdma_umem_for_each_dma_block - iterate over contiguous DMA blocks of
> +the umem
> + * @umem: umem to iterate over
> + * @pgsz: Page size to split the list into
> + *
> + * pgsz must be <= PAGE_SIZE or computed by ib_umem_find_best_pgsz().

>= ?

^ permalink raw reply	[flat|nested] 42+ messages in thread

* RE: [PATCH 06/14] RDMA/umem: Split ib_umem_num_pages() into ib_umem_num_dma_blocks()
  2020-09-02  0:43 ` [PATCH 06/14] RDMA/umem: Split ib_umem_num_pages() into ib_umem_num_dma_blocks() Jason Gunthorpe
@ 2020-09-03 14:12   ` Saleem, Shiraz
  2020-09-03 14:14     ` Jason Gunthorpe
  2020-09-04 22:32   ` Jason Gunthorpe
  1 sibling, 1 reply; 42+ messages in thread
From: Saleem, Shiraz @ 2020-09-03 14:12 UTC (permalink / raw)
  To: Jason Gunthorpe, Adit Ranadive, Potnuri Bharat Teja,
	Doug Ledford, Leon Romanovsky, linux-rdma, VMware PV-Drivers

> Subject: [PATCH 06/14] RDMA/umem: Split ib_umem_num_pages() into
> ib_umem_num_dma_blocks()
> 
> ib_num_pages() should only be used by things working with the SGL in CPU
> pages directly.
> 
> Drivers building DMA lists should use the new ib_num_dma_blocks() which returns
> the number of blocks rdma_umem_for_each_block() will return.
> 
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> ---
>  drivers/infiniband/hw/cxgb4/mem.c            |  2 +-
>  drivers/infiniband/hw/mlx5/mem.c             |  5 +++--
>  drivers/infiniband/hw/mthca/mthca_provider.c |  2 +-
> drivers/infiniband/hw/vmw_pvrdma/pvrdma_mr.c |  2 +-
>  include/rdma/ib_umem.h                       | 14 +++++++++++---
>  5 files changed, 17 insertions(+), 8 deletions(-)
> 

Perhaps the one in the bnxt_re too?

https://elixir.bootlin.com/linux/v5.9-rc3/source/drivers/infiniband/hw/bnxt_re/qplib_res.c#L94


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 06/14] RDMA/umem: Split ib_umem_num_pages() into ib_umem_num_dma_blocks()
  2020-09-03 14:12   ` Saleem, Shiraz
@ 2020-09-03 14:14     ` Jason Gunthorpe
  0 siblings, 0 replies; 42+ messages in thread
From: Jason Gunthorpe @ 2020-09-03 14:14 UTC (permalink / raw)
  To: Saleem, Shiraz
  Cc: Adit Ranadive, Potnuri Bharat Teja, Doug Ledford,
	Leon Romanovsky, linux-rdma, VMware PV-Drivers

On Thu, Sep 03, 2020 at 02:12:10PM +0000, Saleem, Shiraz wrote:
> > Subject: [PATCH 06/14] RDMA/umem: Split ib_umem_num_pages() into
> > ib_umem_num_dma_blocks()
> > 
> > ib_num_pages() should only be used by things working with the SGL in CPU
> > pages directly.
> > 
> > Drivers building DMA lists should use the new ib_num_dma_blocks() which returns
> > the number of blocks rdma_umem_for_each_block() will return.
> > 
> > Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> >  drivers/infiniband/hw/cxgb4/mem.c            |  2 +-
> >  drivers/infiniband/hw/mlx5/mem.c             |  5 +++--
> >  drivers/infiniband/hw/mthca/mthca_provider.c |  2 +-
> > drivers/infiniband/hw/vmw_pvrdma/pvrdma_mr.c |  2 +-
> >  include/rdma/ib_umem.h                       | 14 +++++++++++---
> >  5 files changed, 17 insertions(+), 8 deletions(-)
> > 
> 
> Perhaps the one in the bnxt_re too?
> 
> https://elixir.bootlin.com/linux/v5.9-rc3/source/drivers/infiniband/hw/bnxt_re/qplib_res.c#L94

Yes, that needs fixing, but it is not so easy. This patch is just for
the easy drivers..

Planning to do it separately as it is the 'easiest' of the remaining
'hard' conversions

Jason

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 02/14] RDMA/umem: Prevent small pages from being returned by ib_umem_find_best_pgsz()
  2020-09-03 14:11   ` Saleem, Shiraz
@ 2020-09-03 14:17     ` Jason Gunthorpe
  2020-09-03 14:18       ` Saleem, Shiraz
  0 siblings, 1 reply; 42+ messages in thread
From: Jason Gunthorpe @ 2020-09-03 14:17 UTC (permalink / raw)
  To: Saleem, Shiraz; +Cc: Doug Ledford, linux-rdma

On Thu, Sep 03, 2020 at 02:11:37PM +0000, Saleem, Shiraz wrote:
> > Subject: [PATCH 02/14] RDMA/umem: Prevent small pages from being returned by
> > ib_umem_find_best_pgsz()
> > 
> > rdma_for_each_block() makes assumptions about how the SGL is constructed that
> > don't work if the block size is below the page size used to to build the SGL.
> > 
> > The rules for umem SGL construction require that the SG's all be PAGE_SIZE
> > aligned and we don't encode the actual byte offset of the VA range inside the SGL
> > using offset and length. So rdma_for_each_block() has no idea where the actual
> > starting/ending point is to compute the first/last block boundary if the starting
> > address should be within a SGL.
> 
> Not sure if we were exposed today. i.e. rdma drivers working with
> block sizes smaller than system page size?

Yes, it could happen, here are some examples:

drivers/infiniband/hw/i40iw/i40iw_verbs.c:
              iwmr->page_size = ib_umem_find_best_pgsz(region, SZ_4K | SZ_2M,

drivers/infiniband/hw/bnxt_re/ib_verbs.c:
        page_shift = __ffs(ib_umem_find_best_pgsz(umem,
                                BNXT_RE_PAGE_SIZE_4K | BNXT_RE_PAGE_SIZE_2M,
                                virt_addr));

Eg that breaks on a ARM with 16k or 64k page sizes.

Jason

^ permalink raw reply	[flat|nested] 42+ messages in thread

* RE: [PATCH 02/14] RDMA/umem: Prevent small pages from being returned by ib_umem_find_best_pgsz()
  2020-09-03 14:17     ` Jason Gunthorpe
@ 2020-09-03 14:18       ` Saleem, Shiraz
  0 siblings, 0 replies; 42+ messages in thread
From: Saleem, Shiraz @ 2020-09-03 14:18 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: Doug Ledford, linux-rdma

> Subject: Re: [PATCH 02/14] RDMA/umem: Prevent small pages from being
> returned by ib_umem_find_best_pgsz()
> 
> On Thu, Sep 03, 2020 at 02:11:37PM +0000, Saleem, Shiraz wrote:
> > > Subject: [PATCH 02/14] RDMA/umem: Prevent small pages from being
> > > returned by
> > > ib_umem_find_best_pgsz()
> > >
> > > rdma_for_each_block() makes assumptions about how the SGL is
> > > constructed that don't work if the block size is below the page size used to to
> build the SGL.
> > >
> > > The rules for umem SGL construction require that the SG's all be
> > > PAGE_SIZE aligned and we don't encode the actual byte offset of the
> > > VA range inside the SGL using offset and length. So
> > > rdma_for_each_block() has no idea where the actual starting/ending
> > > point is to compute the first/last block boundary if the starting address should
> be within a SGL.
> >
> > Not sure if we were exposed today. i.e. rdma drivers working with
> > block sizes smaller than system page size?
> 
> Yes, it could happen, here are some examples:
> 
> drivers/infiniband/hw/i40iw/i40iw_verbs.c:
>               iwmr->page_size = ib_umem_find_best_pgsz(region, SZ_4K | SZ_2M,
> 
> drivers/infiniband/hw/bnxt_re/ib_verbs.c:
>         page_shift = __ffs(ib_umem_find_best_pgsz(umem,
>                                 BNXT_RE_PAGE_SIZE_4K | BNXT_RE_PAGE_SIZE_2M,
>                                 virt_addr));
> 
> Eg that breaks on a ARM with 16k or 64k page sizes.
> 

Yes. Make sense. Thanks for the patch!

^ permalink raw reply	[flat|nested] 42+ messages in thread

* RE: [PATCH 13/14] RDMA/mlx5: Use ib_umem_num_dma_blocks()
  2020-09-02  0:43 ` [PATCH 13/14] RDMA/mlx5: Use ib_umem_num_dma_blocks() Jason Gunthorpe
  2020-09-02  9:07   ` Gal Pressman
@ 2020-09-03 15:14   ` Saleem, Shiraz
  1 sibling, 0 replies; 42+ messages in thread
From: Saleem, Shiraz @ 2020-09-03 15:14 UTC (permalink / raw)
  To: Jason Gunthorpe, Doug Ledford, linux-rdma, Yishai Hadas

> Subject: [PATCH 13/14] RDMA/mlx5: Use ib_umem_num_dma_blocks()
> 
> For the calls linked to mlx4_ib_umem_calc_optimal_mtt_size() compute the number
> of DMA pages directly based on the returned page_shift. This was just being used
> as a weird default if the algorithm hits the lower limit.
> 
> All other places are just using it with PAGE_SIZE.
> 
> As this is the last call site, remove ib_umem_num_pages().

Typo. remove ib_umem_page_count



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 14/14] RDMA/umem: Rename ib_umem_offset() to ib_umem_dma_offset()
  2020-09-02  0:43 ` [PATCH 14/14] RDMA/umem: Rename ib_umem_offset() to ib_umem_dma_offset() Jason Gunthorpe
  2020-09-02  0:51   ` Zhu Yanjun
  2020-09-02 15:36   ` [EXT] " Michal Kalderon
@ 2020-09-03 18:48   ` Jason Gunthorpe
  2 siblings, 0 replies; 42+ messages in thread
From: Jason Gunthorpe @ 2020-09-03 18:48 UTC (permalink / raw)
  To: Ariel Elior, Dennis Dalessandro, Devesh Sharma, Doug Ledford,
	linux-rdma, Mike Marciniszyn, Michal Kalderon, Selvin Xavier,
	Zhu Yanjun

On Tue, Sep 01, 2020 at 09:43:42PM -0300, Jason Gunthorpe wrote:
> +/*
> + * Returns the offset of the umem start relative to the first DMA block returned
> + * by rdma_umem_for_each_dma_block().
> + */
> +static inline int ib_umem_dma_offset(struct ib_umem *umem, unsigned long pgsz)
>  {
> -	return umem->address & ~PAGE_MASK;
> +	return umem->address & (pgsz - 1);
>  }

Actually on greater reflection this is wrong if pgsz > PAGE_SIZE, as
umem->address has nothing to do with the SGL, and certainly doesn't
reflect to anything in DMA drivers. It should be IOVA.

I'm going to drop this patch. It looks like the only two drivers that
use this should be using IOVA anyhow:

diff --git a/drivers/infiniband/hw/ocrdma/ocrdma.h b/drivers/infiniband/hw/ocrdma/ocrdma.h
index fcfe0e82197a24..5eb61c1100900d 100644
--- a/drivers/infiniband/hw/ocrdma/ocrdma.h
+++ b/drivers/infiniband/hw/ocrdma/ocrdma.h
@@ -185,7 +185,6 @@ struct ocrdma_hw_mr {
 	u32 num_pbes;
 	u32 pbl_size;
 	u32 pbe_size;
-	u64 fbo;
 	u64 va;
 };
 
diff --git a/drivers/infiniband/hw/ocrdma/ocrdma_hw.c b/drivers/infiniband/hw/ocrdma/ocrdma_hw.c
index e07bf0b2209a4c..18ed658f8dba10 100644
--- a/drivers/infiniband/hw/ocrdma/ocrdma_hw.c
+++ b/drivers/infiniband/hw/ocrdma/ocrdma_hw.c
@@ -1962,6 +1962,7 @@ static int ocrdma_mbx_reg_mr(struct ocrdma_dev *dev, struct ocrdma_hw_mr *hwmr,
 	int i;
 	struct ocrdma_reg_nsmr *cmd;
 	struct ocrdma_reg_nsmr_rsp *rsp;
+	u64 fbo = hwmr->va & (hwmr->pbe_size - 1);
 
 	cmd = ocrdma_init_emb_mqe(OCRDMA_CMD_REGISTER_NSMR, sizeof(*cmd));
 	if (!cmd)
@@ -1987,8 +1988,8 @@ static int ocrdma_mbx_reg_mr(struct ocrdma_dev *dev, struct ocrdma_hw_mr *hwmr,
 					OCRDMA_REG_NSMR_HPAGE_SIZE_SHIFT;
 	cmd->totlen_low = hwmr->len;
 	cmd->totlen_high = upper_32_bits(hwmr->len);
-	cmd->fbo_low = (u32) (hwmr->fbo & 0xffffffff);
-	cmd->fbo_high = (u32) upper_32_bits(hwmr->fbo);
+	cmd->fbo_low = (u32) (fbo & 0xffffffff);
+	cmd->fbo_high = (u32) upper_32_bits(fbo);
 	cmd->va_loaddr = (u32) hwmr->va;
 	cmd->va_hiaddr = (u32) upper_32_bits(hwmr->va);
 
diff --git a/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c b/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c
index c1751c9a0f625c..354814b2978d51 100644
--- a/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c
+++ b/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c
@@ -878,7 +878,6 @@ struct ib_mr *ocrdma_reg_user_mr(struct ib_pd *ibpd, u64 start, u64 len,
 		goto umem_err;
 
 	mr->hwmr.pbe_size = PAGE_SIZE;
-	mr->hwmr.fbo = ib_umem_offset(mr->umem);
 	mr->hwmr.va = usr_addr;
 	mr->hwmr.len = len;
 	mr->hwmr.remote_wr = (acc & IB_ACCESS_REMOTE_WRITE) ? 1 : 0;
diff --git a/drivers/infiniband/hw/qedr/verbs.c b/drivers/infiniband/hw/qedr/verbs.c
index 0bdfa300865d57..b7ef05618993c4 100644
--- a/drivers/infiniband/hw/qedr/verbs.c
+++ b/drivers/infiniband/hw/qedr/verbs.c
@@ -2888,10 +2888,8 @@ struct ib_mr *qedr_reg_user_mr(struct ib_pd *ibpd, u64 start, u64 len,
 	mr->hw_mr.pbl_two_level = mr->info.pbl_info.two_layered;
 	mr->hw_mr.pbl_page_size_log = ilog2(mr->info.pbl_info.pbl_size);
 	mr->hw_mr.page_size_log = PAGE_SHIFT;
-	mr->hw_mr.fbo = ib_umem_offset(mr->umem);
 	mr->hw_mr.length = len;
 	mr->hw_mr.vaddr = usr_addr;
-	mr->hw_mr.zbva = false;
 	mr->hw_mr.phy_mr = false;
 	mr->hw_mr.dma_mr = false;
 
@@ -2984,10 +2982,8 @@ static struct qedr_mr *__qedr_alloc_mr(struct ib_pd *ibpd,
 	mr->hw_mr.pbl_ptr = 0;
 	mr->hw_mr.pbl_two_level = mr->info.pbl_info.two_layered;
 	mr->hw_mr.pbl_page_size_log = ilog2(mr->info.pbl_info.pbl_size);
-	mr->hw_mr.fbo = 0;
 	mr->hw_mr.length = 0;
 	mr->hw_mr.vaddr = 0;
-	mr->hw_mr.zbva = false;
 	mr->hw_mr.phy_mr = true;
 	mr->hw_mr.dma_mr = false;
 
diff --git a/drivers/net/ethernet/qlogic/qed/qed_rdma.c b/drivers/net/ethernet/qlogic/qed/qed_rdma.c
index 03894584415df7..0df6e058775292 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_rdma.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_rdma.c
@@ -1521,7 +1521,7 @@ qed_rdma_register_tid(void *rdma_cxt,
 		  params->pbl_two_level);
 
 	SET_FIELD(flags, RDMA_REGISTER_TID_RAMROD_DATA_ZERO_BASED,
-		  params->zbva);
+		  false);
 
 	SET_FIELD(flags, RDMA_REGISTER_TID_RAMROD_DATA_PHY_MR, params->phy_mr);
 
@@ -1583,15 +1583,7 @@ qed_rdma_register_tid(void *rdma_cxt,
 	p_ramrod->pd = cpu_to_le16(params->pd);
 	p_ramrod->length_hi = (u8)(params->length >> 32);
 	p_ramrod->length_lo = DMA_LO_LE(params->length);
-	if (params->zbva) {
-		/* Lower 32 bits of the registered MR address.
-		 * In case of zero based MR, will hold FBO
-		 */
-		p_ramrod->va.hi = 0;
-		p_ramrod->va.lo = cpu_to_le32(params->fbo);
-	} else {
-		DMA_REGPAIR_LE(p_ramrod->va, params->vaddr);
-	}
+	DMA_REGPAIR_LE(p_ramrod->va, params->vaddr);
 	DMA_REGPAIR_LE(p_ramrod->pbl_base, params->pbl_ptr);
 
 	/* DIF */
diff --git a/include/linux/qed/qed_rdma_if.h b/include/linux/qed/qed_rdma_if.h
index f464d85e88a410..aeb242cefebfa8 100644
--- a/include/linux/qed/qed_rdma_if.h
+++ b/include/linux/qed/qed_rdma_if.h
@@ -242,10 +242,8 @@ struct qed_rdma_register_tid_in_params {
 	bool pbl_two_level;
 	u8 pbl_page_size_log;
 	u8 page_size_log;
-	u32 fbo;
 	u64 length;
 	u64 vaddr;
-	bool zbva;
 	bool phy_mr;
 	bool dma_mr;
 

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH 01/14] RDMA/umem: Fix ib_umem_find_best_pgsz() for mappings that cross a page boundary
  2020-09-02  0:43 ` [PATCH 01/14] RDMA/umem: Fix ib_umem_find_best_pgsz() for mappings that cross a page boundary Jason Gunthorpe
  2020-09-02  9:24   ` Leon Romanovsky
  2020-09-03 14:11   ` Saleem, Shiraz
@ 2020-09-04 22:30   ` Jason Gunthorpe
  2 siblings, 0 replies; 42+ messages in thread
From: Jason Gunthorpe @ 2020-09-04 22:30 UTC (permalink / raw)
  To: Doug Ledford, linux-rdma; +Cc: Shiraz Saleem

On Tue, Sep 01, 2020 at 09:43:29PM -0300, Jason Gunthorpe wrote:
> It is possible for a single SGL to span an aligned boundary, eg if the SGL
> is
> 
>   61440 -> 90112
> 
> Then the length is 28672, which currently limits the block size to
> 32k. With a 32k page size the two covering blocks will be:
> 
>   32768->65536 and 65536->98304
> 
> However, the correct answer is a 128K block size which will span the whole
> 28672 bytes in a single block.
> 
> Instead of limiting based on length figure out which high IOVA bits don't
> change between the start and end addresses. That is the highest useful
> page size.
> 
> Fixes: 4a35339958f1 ("RDMA/umem: Add API to find best driver supported page size in an MR")
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> ---
>  drivers/infiniband/core/umem.c | 10 ++++++++--
>  1 file changed, 8 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
> index 831bff8d52e547..120e98403c345d 100644
> --- a/drivers/infiniband/core/umem.c
> +++ b/drivers/infiniband/core/umem.c
> @@ -156,8 +156,14 @@ unsigned long ib_umem_find_best_pgsz(struct ib_umem *umem,
>  		return 0;
>  
>  	va = virt;
> -	/* max page size not to exceed MR length */
> -	mask = roundup_pow_of_two(umem->length);
> +	/* The best result is the smallest page size that results in the minimum
> +	 * number of required pages. Compute the largest page size that could
> +	 * work based on VA address bits that don't change.
> +	 */
> +	mask = pgsz_bitmap &
> +	       GENMASK(BITS_PER_LONG - 1,
> +		       bits_per((umem->length - 1 + umem->address) ^
> +				umem->address));

The use of umem->address is incorrect here as well, it should be virt.

All places on the DMA side that touch address are wrong..

Jason

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 06/14] RDMA/umem: Split ib_umem_num_pages() into ib_umem_num_dma_blocks()
  2020-09-02  0:43 ` [PATCH 06/14] RDMA/umem: Split ib_umem_num_pages() into ib_umem_num_dma_blocks() Jason Gunthorpe
  2020-09-03 14:12   ` Saleem, Shiraz
@ 2020-09-04 22:32   ` Jason Gunthorpe
  1 sibling, 0 replies; 42+ messages in thread
From: Jason Gunthorpe @ 2020-09-04 22:32 UTC (permalink / raw)
  To: Adit Ranadive, Potnuri Bharat Teja, Doug Ledford,
	Leon Romanovsky, linux-rdma, VMware PV-Drivers

On Tue, Sep 01, 2020 at 09:43:34PM -0300, Jason Gunthorpe wrote:
> diff --git a/include/rdma/ib_umem.h b/include/rdma/ib_umem.h
> index b880512ba95f16..ba3b9be0d8c56a 100644
> +++ b/include/rdma/ib_umem.h
> @@ -33,11 +33,17 @@ static inline int ib_umem_offset(struct ib_umem *umem)
>  	return umem->address & ~PAGE_MASK;
>  }
>  
> +static inline size_t ib_umem_num_dma_blocks(struct ib_umem *umem,
> +					    unsigned long pgsz)
> +{
> +	return (ALIGN(umem->address + umem->length, pgsz) -
> +		ALIGN_DOWN(umem->address, pgsz)) /
> +	       pgsz;
> +}

Like the other places, this is wrong as well. The IOVA needs to be
used here or it can be +-1 page away from what
rdma_umem_for_each_dma_block() returns.

However, it does work if pgsz is PAGE_SIZE, or for the very common
cases where umem->address == IOVA... 

Which, I suppose, is why nobody noticed this until now, as I found
several drivers open coding the above.

Jason

^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2020-09-04 22:33 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-02  0:43 [PATCH 00/14] RDMA: Improve use of umem in DMA drivers Jason Gunthorpe
2020-09-02  0:43 ` [PATCH 01/14] RDMA/umem: Fix ib_umem_find_best_pgsz() for mappings that cross a page boundary Jason Gunthorpe
2020-09-02  9:24   ` Leon Romanovsky
2020-09-03 14:11   ` Saleem, Shiraz
2020-09-04 22:30   ` Jason Gunthorpe
2020-09-02  0:43 ` [PATCH 02/14] RDMA/umem: Prevent small pages from being returned by ib_umem_find_best_pgsz() Jason Gunthorpe
2020-09-02 11:51   ` Leon Romanovsky
2020-09-02 11:59     ` Jason Gunthorpe
2020-09-02 12:05       ` Leon Romanovsky
2020-09-02 16:34         ` Jason Gunthorpe
2020-09-03 14:11   ` Saleem, Shiraz
2020-09-03 14:17     ` Jason Gunthorpe
2020-09-03 14:18       ` Saleem, Shiraz
2020-09-02  0:43 ` [PATCH 03/14] RDMA/umem: Use simpler logic for ib_umem_find_best_pgsz() Jason Gunthorpe
2020-09-02  0:43 ` [PATCH 04/14] RDMA/umem: Add rdma_umem_for_each_dma_block() Jason Gunthorpe
2020-09-02  3:10   ` Miguel Ojeda
2020-09-03 14:12   ` Saleem, Shiraz
2020-09-02  0:43 ` [PATCH 05/14] RDMA/umem: Replace for_each_sg_dma_page with rdma_umem_for_each_dma_block Jason Gunthorpe
2020-09-02  0:43 ` [PATCH 06/14] RDMA/umem: Split ib_umem_num_pages() into ib_umem_num_dma_blocks() Jason Gunthorpe
2020-09-03 14:12   ` Saleem, Shiraz
2020-09-03 14:14     ` Jason Gunthorpe
2020-09-04 22:32   ` Jason Gunthorpe
2020-09-02  0:43 ` [PATCH 07/14] RDMA/qedr: Use rdma_umem_for_each_dma_block() instead of open-coding Jason Gunthorpe
2020-09-02 15:36   ` [EXT] " Michal Kalderon
2020-09-02 18:44     ` Jason Gunthorpe
2020-09-02 19:53       ` Michal Kalderon
2020-09-02  0:43 ` [PATCH 08/14] RDMA/qedr: Use ib_umem_num_dma_blocks() instead of ib_umem_page_count() Jason Gunthorpe
2020-09-02 15:35   ` [EXT] " Michal Kalderon
2020-09-02  0:43 ` [PATCH 09/14] RDMA/bnxt: Do not use ib_umem_page_count() or ib_umem_num_pages() Jason Gunthorpe
2020-09-03  4:41   ` Selvin Xavier
2020-09-02  0:43 ` [PATCH 10/14] RDMA/hns: Use ib_umem_num_dma_blocks() instead of opencoding Jason Gunthorpe
2020-09-02  0:43 ` [PATCH 11/14] RDMA/ocrdma: Use ib_umem_num_dma_blocks() instead of ib_umem_page_count() Jason Gunthorpe
2020-09-02  0:43 ` [PATCH 12/14] RDMA/pvrdma: " Jason Gunthorpe
2020-09-02  0:43 ` [PATCH 13/14] RDMA/mlx5: Use ib_umem_num_dma_blocks() Jason Gunthorpe
2020-09-02  9:07   ` Gal Pressman
2020-09-03 15:14   ` Saleem, Shiraz
2020-09-02  0:43 ` [PATCH 14/14] RDMA/umem: Rename ib_umem_offset() to ib_umem_dma_offset() Jason Gunthorpe
2020-09-02  0:51   ` Zhu Yanjun
2020-09-02 15:36   ` [EXT] " Michal Kalderon
2020-09-03 18:48   ` Jason Gunthorpe
2020-09-02  9:09 ` [PATCH 00/14] RDMA: Improve use of umem in DMA drivers Gal Pressman
2020-09-02 12:00   ` Jason Gunthorpe

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.