linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH rdma-next 0/4] scatterlist: add sg_alloc_table_append function
@ 2020-09-03 12:18 Leon Romanovsky
  2020-09-03 12:18 ` [PATCH rdma-next 2/4] lib/scatterlist: Add support in dynamically allocation of SG entries Leon Romanovsky
                   ` (4 more replies)
  0 siblings, 5 replies; 18+ messages in thread
From: Leon Romanovsky @ 2020-09-03 12:18 UTC (permalink / raw)
  To: Christoph Hellwig, Doug Ledford, Jason Gunthorpe
  Cc: Leon Romanovsky, linux-kernel, linux-rdma, Maor Gottlieb

From: Leon Romanovsky <leonro@nvidia.com>

From Maor:

This series adds a new constructor for a scatter gather table. Like
sg_alloc_table_from_pages function, this function merges all contiguous
chunks of the pages a into single scatter gather entry.

In contrast to sg_alloc_table_from_pages, the new API allows chaining of
new pages to already initialized SG table.

This allows drivers to utilize the optimization of merging contiguous
pages without a need to pre allocate all the pages and hold them in
a very large temporary buffer prior to the call to SG table initialization.

The first two patches refactor the code of sg_alloc_table_from_pages
in order to have code sharing and add sg_alloc_next function to allow
dynamic allocation of more entries in the SG table.

The third patch introduces the new API.

The last patch changes the Infiniband driver to use the new API. It
removes duplicate functionality from the code and benefits the
optimization of allocating dynamic SG table from pages.

In huge pages system of 2MB page size, without this change, the SG table
would contain x512 SG entries.
E.g. for 100GB memory registration:

             Number of entries      Size
    Before        26214400          600.0MB
    After            51200            1.2MB

Thanks

Maor Gottlieb (4):
  lib/scatterlist: Refactor sg_alloc_table_from_pages
  lib/scatterlist: Add support in dynamically allocation of SG entries
  lib/scatterlist: Add support in dynamic allocation of SG table from
    pages
  RDMA/umem: Move to allocate SG table from pages

 drivers/infiniband/core/umem.c |  93 ++--------
 include/linux/scatterlist.h    |  39 +++--
 lib/scatterlist.c              | 302 +++++++++++++++++++++++++--------
 3 files changed, 271 insertions(+), 163 deletions(-)

--
2.26.2


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH rdma-next 2/4] lib/scatterlist: Add support in dynamically allocation of SG entries
  2020-09-03 12:18 [PATCH rdma-next 0/4] scatterlist: add sg_alloc_table_append function Leon Romanovsky
@ 2020-09-03 12:18 ` Leon Romanovsky
  2020-09-07  7:29   ` Christoph Hellwig
  2020-09-03 12:18 ` [PATCH rdma-next 3/4] lib/scatterlist: Add support in dynamic allocation of SG table from pages Leon Romanovsky
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 18+ messages in thread
From: Leon Romanovsky @ 2020-09-03 12:18 UTC (permalink / raw)
  To: Christoph Hellwig, Doug Ledford, Jason Gunthorpe
  Cc: Maor Gottlieb, linux-kernel, linux-rdma

From: Maor Gottlieb <maorg@nvidia.com>

In order to support dynamic allocation of SG table, this patch
introduces sg_alloc_next. This function should be called to add more
entries to the table. In order to share the code, we will do the
following:
 * Extract the allocation code from __sg_alloc_table to sg_alloc.
 * Add a function to chain SGE to the next page.

Signed-off-by: Maor Gottlieb <maorg@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 include/linux/scatterlist.h |  29 ++++++----
 lib/scatterlist.c           | 110 ++++++++++++++++++++++++------------
 2 files changed, 91 insertions(+), 48 deletions(-)

diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h
index 45cf7b69d852..877d6e160b06 100644
--- a/include/linux/scatterlist.h
+++ b/include/linux/scatterlist.h
@@ -165,6 +165,22 @@ static inline void sg_set_buf(struct scatterlist *sg, const void *buf,
 #define for_each_sgtable_dma_sg(sgt, sg, i)	\
 	for_each_sg((sgt)->sgl, sg, (sgt)->nents, i)

+static inline void _sg_chain(struct scatterlist *chain_sg,
+			     struct scatterlist *sgl)
+{
+	/*
+	 * offset and length are unused for chain entry. Clear them.
+	 */
+	chain_sg->offset = 0;
+	chain_sg->length = 0;
+
+	/*
+	 * Set lowest bit to indicate a link pointer, and make sure to clear
+	 * the termination bit if it happens to be set.
+	 */
+	chain_sg->page_link = ((unsigned long) sgl | SG_CHAIN) & ~SG_END;
+}
+
 /**
  * sg_chain - Chain two sglists together
  * @prv:	First scatterlist
@@ -178,18 +194,7 @@ static inline void sg_set_buf(struct scatterlist *sg, const void *buf,
 static inline void sg_chain(struct scatterlist *prv, unsigned int prv_nents,
 			    struct scatterlist *sgl)
 {
-	/*
-	 * offset and length are unused for chain entry.  Clear them.
-	 */
-	prv[prv_nents - 1].offset = 0;
-	prv[prv_nents - 1].length = 0;
-
-	/*
-	 * Set lowest bit to indicate a link pointer, and make sure to clear
-	 * the termination bit if it happens to be set.
-	 */
-	prv[prv_nents - 1].page_link = ((unsigned long) sgl | SG_CHAIN)
-					& ~SG_END;
+	_sg_chain(&prv[prv_nents - 1], sgl);
 }

 /**
diff --git a/lib/scatterlist.c b/lib/scatterlist.c
index 292e785d21ee..669bd6e6d16a 100644
--- a/lib/scatterlist.c
+++ b/lib/scatterlist.c
@@ -242,38 +242,15 @@ void sg_free_table(struct sg_table *table)
 }
 EXPORT_SYMBOL(sg_free_table);

-/**
- * __sg_alloc_table - Allocate and initialize an sg table with given allocator
- * @table:	The sg table header to use
- * @nents:	Number of entries in sg list
- * @max_ents:	The maximum number of entries the allocator returns per call
- * @nents_first_chunk: Number of entries int the (preallocated) first
- * 	scatterlist chunk, 0 means no such preallocated chunk provided by user
- * @gfp_mask:	GFP allocation mask
- * @alloc_fn:	Allocator to use
- *
- * Description:
- *   This function returns a @table @nents long. The allocator is
- *   defined to return scatterlist chunks of maximum size @max_ents.
- *   Thus if @nents is bigger than @max_ents, the scatterlists will be
- *   chained in units of @max_ents.
- *
- * Notes:
- *   If this function returns non-0 (eg failure), the caller must call
- *   __sg_free_table() to cleanup any leftover allocations.
- *
- **/
-int __sg_alloc_table(struct sg_table *table, unsigned int nents,
-		     unsigned int max_ents, struct scatterlist *first_chunk,
-		     unsigned int nents_first_chunk, gfp_t gfp_mask,
-		     sg_alloc_fn *alloc_fn)
+static int sg_alloc(struct sg_table *table, struct scatterlist *prv,
+		    unsigned int nents, unsigned int max_ents,
+		    struct scatterlist *first_chunk,
+		    unsigned int nents_first_chunk,
+		    gfp_t gfp_mask, sg_alloc_fn *alloc_fn)
 {
-	struct scatterlist *sg, *prv;
-	unsigned int left;
-	unsigned curr_max_ents = nents_first_chunk ?: max_ents;
-	unsigned prv_max_ents;
-
-	memset(table, 0, sizeof(*table));
+	unsigned int curr_max_ents = nents_first_chunk ?: max_ents;
+	unsigned int left, prv_max_ents = 0;
+	struct scatterlist *sg;

 	if (nents == 0)
 		return -EINVAL;
@@ -283,7 +260,6 @@ int __sg_alloc_table(struct sg_table *table, unsigned int nents,
 #endif

 	left = nents;
-	prv = NULL;
 	do {
 		unsigned int sg_size, alloc_size = left;

@@ -308,7 +284,7 @@ int __sg_alloc_table(struct sg_table *table, unsigned int nents,
 			 * linkage.  Without this, sg_kfree() may get
 			 * confused.
 			 */
-			if (prv)
+			if (prv_max_ents)
 				table->nents = ++table->orig_nents;

 			return -ENOMEM;
@@ -321,10 +297,17 @@ int __sg_alloc_table(struct sg_table *table, unsigned int nents,
 		 * If this is the first mapping, assign the sg table header.
 		 * If this is not the first mapping, chain previous part.
 		 */
-		if (prv)
-			sg_chain(prv, prv_max_ents, sg);
-		else
+		if (!prv)
 			table->sgl = sg;
+		else if (prv_max_ents)
+			sg_chain(prv, prv_max_ents, sg);
+		else {
+			_sg_chain(prv, sg);
+			/* We decrease one since the prvious last sge in used to
+			 * chainning.
+			 */
+			table->nents = table->orig_nents -= 1;
+		}

 		/*
 		 * If no more entries after this one, mark the end
@@ -339,6 +322,61 @@ int __sg_alloc_table(struct sg_table *table, unsigned int nents,

 	return 0;
 }
+
+/**
+ * sg_alloc_next - Allocate and initialize new entries in the sg table
+ * @table:	The sg table header to use
+ * @last:	The last scatter list entry in the table
+ * @nents:	Number of entries in sg list
+ * @max_ents:	The maximum number of entries the allocator returns per call
+ * @gfp_mask:	GFP allocation mask
+ * @alloc_fn:	Allocator to use
+ *
+ * Description:
+ *   This function extend @table with @nents long. The allocator is
+ *   defined to return scatterlist chunks of maximum size @max_ents.
+ *   Thus if @nents is bigger than @max_ents, the scatterlists will be
+ *   chained in units of @max_ents.
+ *
+ **/
+static int sg_alloc_next(struct sg_table *table, struct scatterlist *last,
+			 unsigned int nents, unsigned int max_ents,
+			 gfp_t gfp_mask)
+{
+	return sg_alloc(table, last, nents, max_ents, NULL, 0, gfp_mask,
+			sg_kmalloc);
+}
+
+/**
+ * __sg_alloc_table - Allocate and initialize an sg table with given allocator
+ * @table:	The sg table header to use
+ * @nents:	Number of entries in sg list
+ * @max_ents:	The maximum number of entries the allocator returns per call
+ * @nents_first_chunk: Number of entries int the (preallocated) first
+ * scatterlist chunk, 0 means no such preallocated chunk provided by user
+ * @gfp_mask:	GFP allocation mask
+ * @alloc_fn:	Allocator to use
+ *
+ * Description:
+ *   This function returns a @table @nents long. The allocator is
+ *   defined to return scatterlist chunks of maximum size @max_ents.
+ *   Thus if @nents is bigger than @max_ents, the scatterlists will be
+ *   chained in units of @max_ents.
+ *
+ * Notes:
+ *   If this function returns non-0 (eg failure), the caller must call
+ *   __sg_free_table() to cleanup any leftover allocations.
+ *
+ **/
+int __sg_alloc_table(struct sg_table *table, unsigned int nents,
+		     unsigned int max_ents, struct scatterlist *first_chunk,
+		     unsigned int nents_first_chunk, gfp_t gfp_mask,
+		     sg_alloc_fn *alloc_fn)
+{
+	memset(table, 0, sizeof(*table));
+	return sg_alloc(table, NULL, nents, max_ents, first_chunk,
+			nents_first_chunk, gfp_mask, alloc_fn);
+}
 EXPORT_SYMBOL(__sg_alloc_table);

 /**
--
2.26.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH rdma-next 3/4] lib/scatterlist: Add support in dynamic allocation of SG table from pages
  2020-09-03 12:18 [PATCH rdma-next 0/4] scatterlist: add sg_alloc_table_append function Leon Romanovsky
  2020-09-03 12:18 ` [PATCH rdma-next 2/4] lib/scatterlist: Add support in dynamically allocation of SG entries Leon Romanovsky
@ 2020-09-03 12:18 ` Leon Romanovsky
  2020-09-07  7:29   ` Christoph Hellwig
  2020-09-03 12:18 ` [PATCH rdma-next 4/4] RDMA/umem: Move to allocate " Leon Romanovsky
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 18+ messages in thread
From: Leon Romanovsky @ 2020-09-03 12:18 UTC (permalink / raw)
  To: Christoph Hellwig, Doug Ledford, Jason Gunthorpe
  Cc: Maor Gottlieb, linux-kernel, linux-rdma

From: Maor Gottlieb <maorg@nvidia.com>

Add an API that supports dynamic allocation of the SG table from pages,
such function should be used by drivers that can't supply all the pages
at one time.

This function returns the last populated sge in the table. Users should
pass it as an argument to the function from the second call and forward.
As for sg_alloc_table_from_pages, nents will be equal to the number of
populated SGEs (chunks).

With this new API, drivers can benefit the optimization of merging
contiguous pages without a need to allocate all pages in advance and
hold them in a large buffer.

E.g. with the Infiniband driver that allocates a single page for hold the
pages. For 1TB memory registration, the temporary buffer would consume only
4KB, instead of 2GB.

Signed-off-by: Maor Gottlieb <maorg@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 include/linux/scatterlist.h |  10 +++
 lib/scatterlist.c           | 131 ++++++++++++++++++++++++++++++++----
 2 files changed, 128 insertions(+), 13 deletions(-)

diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h
index 877d6e160b06..b4450e3c3f88 100644
--- a/include/linux/scatterlist.h
+++ b/include/linux/scatterlist.h
@@ -45,6 +45,11 @@ struct sg_table {
 	unsigned int orig_nents;	/* original size of list */
 };

+struct sg_append {
+	struct scatterlist *prv; /* Previous entry to append */
+	unsigned int left_pages; /* Left pages to add to table */
+};
+
 /*
  * Notes on SG table design.
  *
@@ -291,6 +296,11 @@ void sg_free_table(struct sg_table *);
 int __sg_alloc_table(struct sg_table *, unsigned int, unsigned int,
 		     struct scatterlist *, unsigned int, gfp_t, sg_alloc_fn *);
 int sg_alloc_table(struct sg_table *, unsigned int, gfp_t);
+struct scatterlist *
+sg_alloc_table_append(struct sg_table *sgt, struct page **pages,
+		      unsigned int n_pages, unsigned int offset,
+		      unsigned long size, unsigned int max_segment,
+		      gfp_t gfp_mask, struct sg_append *append);
 int __sg_alloc_table_from_pages(struct sg_table *sgt, struct page **pages,
 				unsigned int n_pages, unsigned int offset,
 				unsigned long size, unsigned int max_segment,
diff --git a/lib/scatterlist.c b/lib/scatterlist.c
index 669bd6e6d16a..c16a4eebaa0b 100644
--- a/lib/scatterlist.c
+++ b/lib/scatterlist.c
@@ -403,19 +403,56 @@ int sg_alloc_table(struct sg_table *table, unsigned int nents, gfp_t gfp_mask)
 }
 EXPORT_SYMBOL(sg_alloc_table);

+static struct scatterlist *get_next_sg(struct sg_table *table,
+				       struct scatterlist *prv,
+				       unsigned long left_npages,
+				       gfp_t gfp_mask)
+{
+	struct scatterlist *next_sg;
+	int ret;
+
+	/* If table was just allocated */
+	if (!prv)
+		return table->sgl;
+
+	/* Check if last entry should be keeped for chainning */
+	next_sg = sg_next(prv);
+	if (!sg_is_last(next_sg) || left_npages == 1)
+		return next_sg;
+
+	ret = sg_alloc_next(table, next_sg,
+			    min_t(unsigned long, left_npages,
+				  SG_MAX_SINGLE_ALLOC),
+			    SG_MAX_SINGLE_ALLOC, gfp_mask);
+	if (ret)
+		return ERR_PTR(ret);
+	return sg_next(prv);
+}
+
 static struct scatterlist *
 alloc_from_pages_common(struct sg_table *sgt, struct page **pages,
 			unsigned int n_pages, unsigned int offset,
 			unsigned long size, unsigned int max_segment,
-			gfp_t gfp_mask)
+			gfp_t gfp_mask, struct sg_append *append)
 {
-	unsigned int chunks, cur_page, seg_len, i;
-	struct scatterlist *prv, *s = NULL;
+	unsigned int chunks, cur_page, seg_len, i, prv_len = 0;
+	unsigned int tmp_nents = sgt->nents;
+	struct scatterlist *s, *prv = NULL;
+	unsigned int table_size, left = 0;
 	int ret;

 	if (WARN_ON(!max_segment || offset_in_page(max_segment)))
 		return ERR_PTR(-EINVAL);

+	if (append) {
+		prv = append->prv;
+		left = append->left_pages;
+		if (prv &&
+		    page_to_pfn(sg_page(prv)) + (prv->length >> PAGE_SHIFT) ==
+			    page_to_pfn(pages[0]))
+			prv_len = prv->length;
+	}
+
 	/* compute number of contiguous chunks */
 	chunks = 1;
 	seg_len = 0;
@@ -428,13 +465,16 @@ alloc_from_pages_common(struct sg_table *sgt, struct page **pages,
 		}
 	}

-	ret = sg_alloc_table(sgt, chunks, gfp_mask);
-	if (unlikely(ret))
-		return ERR_PTR(ret);
+	if (!prv) {
+		/* Only the last allocation could be less than the maximum */
+		table_size = left ? SG_MAX_SINGLE_ALLOC : chunks;
+		ret = sg_alloc_table(sgt, table_size, gfp_mask);
+		if (unlikely(ret))
+			return ERR_PTR(ret);
+	}

 	/* merging chunks and putting them into the scatterlist */
 	cur_page = 0;
-	s = sgt->sgl;
 	for (i = 0; i < chunks; i++) {
 		unsigned int j, chunk_size;

@@ -444,21 +484,86 @@ alloc_from_pages_common(struct sg_table *sgt, struct page **pages,
 			seg_len += PAGE_SIZE;
 			if (seg_len >= max_segment ||
 			    page_to_pfn(pages[j]) !=
-			    page_to_pfn(pages[j - 1]) + 1)
+				    page_to_pfn(pages[j - 1]) + 1)
 				break;
 		}

 		chunk_size = ((j - cur_page) << PAGE_SHIFT) - offset;
-		sg_set_page(s, pages[cur_page],
-			    min_t(unsigned long, size, chunk_size), offset);
+		chunk_size = min_t(unsigned long, size, chunk_size);
+		if (!i && prv_len) {
+			if (max_segment - prv->length >= chunk_size) {
+				s = prv;
+				sg_set_page(s, sg_page(s),
+					    s->length + chunk_size, s->offset);
+				goto next;
+			}
+		}
+
+		/* Pass how many chunks might left */
+		s = get_next_sg(sgt, prv, chunks - i + left, gfp_mask);
+		if (IS_ERR(s)) {
+			/* Adjust entry length to be as before function was
+			 * called
+			 */
+			if (prv_len)
+				append->prv->length = prv_len;
+			goto out;
+		}
+		sg_set_page(s, pages[cur_page], chunk_size, offset);
+		tmp_nents++;
+next:
 		size -= chunk_size;
 		offset = 0;
 		cur_page = j;
 		prv = s;
-		s = sg_next(s);
 	}
-	return prv;
+	sgt->nents = tmp_nents;
+out:
+	return s;
+}
+
+/**
+ * sg_alloc_table_append - Allocate and initialize an sg table from
+ *                         an array of pages
+ * @sgt:	 The sg table header to use
+ * @pages:	 Pointer to an array of page pointers
+ * @n_pages:	 Number of pages in the pages array
+ * @offset:      Offset from start of the first page to the start of a buffer
+ * @size:        Number of valid bytes in the buffer (after offset)
+ * @max_segment: Maximum size of a scatterlist node in bytes (page aligned)
+ * @gfp_mask:	 GFP allocation mask
+ * @append:	 Used to append pages to last entry in sgt
+ *
+ *  Description:
+ *    If prv field in @append is NULL, it allocates and initialize an sg table
+ *    from a list of pages. Contiguous ranges of the pages are squashed into a
+ *    single scatterlist node up to the maximum size specified in @max_segment.
+ *    A user may provide an offset at a start and a size of valid data in a buffer
+ *    specified by the page array. A user may provide @append to chain pages to
+ *    last entry in sgt.
+ *    The returned sg table is released by sg_free_table.
+ *
+ * Returns:
+ *   Last SGE in sgt on success, negative error on failure.
+ *
+ * Notes:
+ *   If this function returns non-0 (eg failure), the caller must call
+ *   sg_free_table() to cleanup any leftover allocations.
+ */
+struct scatterlist *
+sg_alloc_table_append(struct sg_table *sgt, struct page **pages,
+		      unsigned int n_pages, unsigned int offset,
+		      unsigned long size, unsigned int max_segment,
+		      gfp_t gfp_mask, struct sg_append *append)
+{
+#ifdef CONFIG_ARCH_NO_SG_CHAIN
+	if (append->left_pages)
+		return ERR_PTR(-EOPNOTSUPP);
+#endif
+	return alloc_from_pages_common(sgt, pages, n_pages, offset, size,
+				       max_segment, gfp_mask, append);
 }
+EXPORT_SYMBOL(sg_alloc_table_append);

 /**
  * __sg_alloc_table_from_pages - Allocate and initialize an sg table from
@@ -489,7 +594,7 @@ int __sg_alloc_table_from_pages(struct sg_table *sgt, struct page **pages,
 	struct scatterlist *sg;

 	sg = alloc_from_pages_common(sgt, pages, n_pages, offset, size,
-				     max_segment, gfp_mask);
+				     max_segment, gfp_mask, NULL);
 	return PTR_ERR_OR_ZERO(sg);
 }
 EXPORT_SYMBOL(__sg_alloc_table_from_pages);
--
2.26.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH rdma-next 4/4] RDMA/umem: Move to allocate SG table from pages
  2020-09-03 12:18 [PATCH rdma-next 0/4] scatterlist: add sg_alloc_table_append function Leon Romanovsky
  2020-09-03 12:18 ` [PATCH rdma-next 2/4] lib/scatterlist: Add support in dynamically allocation of SG entries Leon Romanovsky
  2020-09-03 12:18 ` [PATCH rdma-next 3/4] lib/scatterlist: Add support in dynamic allocation of SG table from pages Leon Romanovsky
@ 2020-09-03 12:18 ` Leon Romanovsky
  2020-09-07  7:29   ` Christoph Hellwig
  2020-09-03 15:32 ` [PATCH rdma-next 0/4] scatterlist: add sg_alloc_table_append function Christoph Hellwig
  2020-09-03 15:54 ` [PATCH rdma-next 1/4] lib/scatterlist: Refactor sg_alloc_table_from_pages Leon Romanovsky
  4 siblings, 1 reply; 18+ messages in thread
From: Leon Romanovsky @ 2020-09-03 12:18 UTC (permalink / raw)
  To: Christoph Hellwig, Doug Ledford, Jason Gunthorpe
  Cc: Maor Gottlieb, linux-kernel, linux-rdma

From: Maor Gottlieb <maorg@nvidia.com>

Remove the implementation of ib_umem_add_sg_table and instead
call to sg_alloc_table_append which already has the logic to
merge contiguous pages.

Besides that it removes duplicated functionality, it reduces the
memory consumption of the SG table significantly. Prior to this
patch, the SG table was allocated in advance regardless consideration
of contiguous pages.

In huge pages system of 2MB page size, without this change, the SG table
would contain x512 SG entries.
E.g. for 100GB memory registration:

	 Number of entries	Size
Before 	      26214400          600.0MB
After            51200		  1.2MB

Signed-off-by: Maor Gottlieb <maorg@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/infiniband/core/umem.c | 93 +++++-----------------------------
 1 file changed, 14 insertions(+), 79 deletions(-)

diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index be889e99cfac..9eb946f665ec 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -62,73 +62,6 @@ static void __ib_umem_release(struct ib_device *dev, struct ib_umem *umem, int d
 	sg_free_table(&umem->sg_head);
 }

-/* ib_umem_add_sg_table - Add N contiguous pages to scatter table
- *
- * sg: current scatterlist entry
- * page_list: array of npage struct page pointers
- * npages: number of pages in page_list
- * max_seg_sz: maximum segment size in bytes
- * nents: [out] number of entries in the scatterlist
- *
- * Return new end of scatterlist
- */
-static struct scatterlist *ib_umem_add_sg_table(struct scatterlist *sg,
-						struct page **page_list,
-						unsigned long npages,
-						unsigned int max_seg_sz,
-						int *nents)
-{
-	unsigned long first_pfn;
-	unsigned long i = 0;
-	bool update_cur_sg = false;
-	bool first = !sg_page(sg);
-
-	/* Check if new page_list is contiguous with end of previous page_list.
-	 * sg->length here is a multiple of PAGE_SIZE and sg->offset is 0.
-	 */
-	if (!first && (page_to_pfn(sg_page(sg)) + (sg->length >> PAGE_SHIFT) ==
-		       page_to_pfn(page_list[0])))
-		update_cur_sg = true;
-
-	while (i != npages) {
-		unsigned long len;
-		struct page *first_page = page_list[i];
-
-		first_pfn = page_to_pfn(first_page);
-
-		/* Compute the number of contiguous pages we have starting
-		 * at i
-		 */
-		for (len = 0; i != npages &&
-			      first_pfn + len == page_to_pfn(page_list[i]) &&
-			      len < (max_seg_sz >> PAGE_SHIFT);
-		     len++)
-			i++;
-
-		/* Squash N contiguous pages from page_list into current sge */
-		if (update_cur_sg) {
-			if ((max_seg_sz - sg->length) >= (len << PAGE_SHIFT)) {
-				sg_set_page(sg, sg_page(sg),
-					    sg->length + (len << PAGE_SHIFT),
-					    0);
-				update_cur_sg = false;
-				continue;
-			}
-			update_cur_sg = false;
-		}
-
-		/* Squash N contiguous pages into next sge or first sge */
-		if (!first)
-			sg = sg_next(sg);
-
-		(*nents)++;
-		sg_set_page(sg, first_page, len << PAGE_SHIFT, 0);
-		first = false;
-	}
-
-	return sg;
-}
-
 /**
  * ib_umem_find_best_pgsz - Find best HW page size to use for this MR
  *
@@ -205,7 +138,8 @@ static struct ib_umem *__ib_umem_get(struct ib_device *device,
 	struct mm_struct *mm;
 	unsigned long npages;
 	int ret;
-	struct scatterlist *sg;
+	struct scatterlist *sg = NULL;
+	struct sg_append append = {};
 	unsigned int gup_flags = FOLL_WRITE;

 	/*
@@ -255,15 +189,9 @@ static struct ib_umem *__ib_umem_get(struct ib_device *device,

 	cur_base = addr & PAGE_MASK;

-	ret = sg_alloc_table(&umem->sg_head, npages, GFP_KERNEL);
-	if (ret)
-		goto vma;
-
 	if (!umem->writable)
 		gup_flags |= FOLL_FORCE;

-	sg = umem->sg_head.sgl;
-
 	while (npages) {
 		cond_resched();
 		ret = pin_user_pages_fast(cur_base,
@@ -276,10 +204,18 @@ static struct ib_umem *__ib_umem_get(struct ib_device *device,

 		cur_base += ret * PAGE_SIZE;
 		npages   -= ret;
-
-		sg = ib_umem_add_sg_table(sg, page_list, ret,
-			dma_get_max_seg_size(device->dma_device),
-			&umem->sg_nents);
+		append.left_pages = npages;
+		append.prv = sg;
+		sg = sg_alloc_table_append(&umem->sg_head, page_list, ret, 0,
+					   ret << PAGE_SHIFT,
+					   dma_get_max_seg_size(device->dma_device),
+					   GFP_KERNEL, &append);
+		umem->sg_nents = umem->sg_head.nents;
+		if (IS_ERR(sg)) {
+			unpin_user_pages_dirty_lock(page_list, ret, 0);
+			ret = PTR_ERR(sg);
+			goto umem_release;
+		}
 	}

 	sg_mark_end(sg);
@@ -301,7 +237,6 @@ static struct ib_umem *__ib_umem_get(struct ib_device *device,

 umem_release:
 	__ib_umem_release(device, umem, 0);
-vma:
 	atomic64_sub(ib_umem_num_pages(umem), &mm->pinned_vm);
 out:
 	free_page((unsigned long) page_list);
--
2.26.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH rdma-next 0/4] scatterlist: add sg_alloc_table_append function
  2020-09-03 12:18 [PATCH rdma-next 0/4] scatterlist: add sg_alloc_table_append function Leon Romanovsky
                   ` (2 preceding siblings ...)
  2020-09-03 12:18 ` [PATCH rdma-next 4/4] RDMA/umem: Move to allocate " Leon Romanovsky
@ 2020-09-03 15:32 ` Christoph Hellwig
  2020-09-03 15:55   ` Leon Romanovsky
  2020-09-03 15:54 ` [PATCH rdma-next 1/4] lib/scatterlist: Refactor sg_alloc_table_from_pages Leon Romanovsky
  4 siblings, 1 reply; 18+ messages in thread
From: Christoph Hellwig @ 2020-09-03 15:32 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Christoph Hellwig, Doug Ledford, Jason Gunthorpe,
	Leon Romanovsky, linux-kernel, linux-rdma, Maor Gottlieb

Patch 1 never made it through.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH rdma-next 1/4] lib/scatterlist: Refactor sg_alloc_table_from_pages
  2020-09-03 12:18 [PATCH rdma-next 0/4] scatterlist: add sg_alloc_table_append function Leon Romanovsky
                   ` (3 preceding siblings ...)
  2020-09-03 15:32 ` [PATCH rdma-next 0/4] scatterlist: add sg_alloc_table_append function Christoph Hellwig
@ 2020-09-03 15:54 ` Leon Romanovsky
  2020-09-07  7:29   ` Christoph Hellwig
  4 siblings, 1 reply; 18+ messages in thread
From: Leon Romanovsky @ 2020-09-03 15:54 UTC (permalink / raw)
  To: Christoph Hellwig, Doug Ledford, Jason Gunthorpe
  Cc: Maor Gottlieb, linux-kernel, linux-rdma

From: Maor Gottlieb <maorg@nvidia.com>

Currently, sg_alloc_table_from_pages doesn't support dynamic chaining of
SG entries. Therefore it requires from user to allocate all the pages in
advance and hold them in a large buffer. Such a buffer consumes a lot of
temporary memory in HPC systems which do a very large memory registration.

The next patches introduce API for dynamically allocation from pages and
it requires us to do the following:
 * Extract the code to alloc_from_pages_common.
 * Change the build of the table to iterate on the chunks and not on the
   SGEs. It will allow dynamic allocation of more SGEs.

Since sg_alloc_table_from_pages allocate exactly the number of chunks,
therefore chunks are equal to the number of SG entries.

Signed-off-by: Maor Gottlieb <maorg@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 lib/scatterlist.c | 75 ++++++++++++++++++++++++++++-------------------
 1 file changed, 45 insertions(+), 30 deletions(-)

diff --git a/lib/scatterlist.c b/lib/scatterlist.c
index 5d63a8857f36..292e785d21ee 100644
--- a/lib/scatterlist.c
+++ b/lib/scatterlist.c
@@ -365,38 +365,18 @@ int sg_alloc_table(struct sg_table *table, unsigned int nents, gfp_t gfp_mask)
 }
 EXPORT_SYMBOL(sg_alloc_table);

-/**
- * __sg_alloc_table_from_pages - Allocate and initialize an sg table from
- *			         an array of pages
- * @sgt:	 The sg table header to use
- * @pages:	 Pointer to an array of page pointers
- * @n_pages:	 Number of pages in the pages array
- * @offset:      Offset from start of the first page to the start of a buffer
- * @size:        Number of valid bytes in the buffer (after offset)
- * @max_segment: Maximum size of a scatterlist node in bytes (page aligned)
- * @gfp_mask:	 GFP allocation mask
- *
- *  Description:
- *    Allocate and initialize an sg table from a list of pages. Contiguous
- *    ranges of the pages are squashed into a single scatterlist node up to the
- *    maximum size specified in @max_segment. An user may provide an offset at a
- *    start and a size of valid data in a buffer specified by the page array.
- *    The returned sg table is released by sg_free_table.
- *
- * Returns:
- *   0 on success, negative error on failure
- */
-int __sg_alloc_table_from_pages(struct sg_table *sgt, struct page **pages,
-				unsigned int n_pages, unsigned int offset,
-				unsigned long size, unsigned int max_segment,
-				gfp_t gfp_mask)
+static struct scatterlist *
+alloc_from_pages_common(struct sg_table *sgt, struct page **pages,
+			unsigned int n_pages, unsigned int offset,
+			unsigned long size, unsigned int max_segment,
+			gfp_t gfp_mask)
 {
 	unsigned int chunks, cur_page, seg_len, i;
+	struct scatterlist *prv, *s = NULL;
 	int ret;
-	struct scatterlist *s;

 	if (WARN_ON(!max_segment || offset_in_page(max_segment)))
-		return -EINVAL;
+		return ERR_PTR(-EINVAL);

 	/* compute number of contiguous chunks */
 	chunks = 1;
@@ -412,11 +392,12 @@ int __sg_alloc_table_from_pages(struct sg_table *sgt, struct page **pages,

 	ret = sg_alloc_table(sgt, chunks, gfp_mask);
 	if (unlikely(ret))
-		return ret;
+		return ERR_PTR(ret);

 	/* merging chunks and putting them into the scatterlist */
 	cur_page = 0;
-	for_each_sg(sgt->sgl, s, sgt->orig_nents, i) {
+	s = sgt->sgl;
+	for (i = 0; i < chunks; i++) {
 		unsigned int j, chunk_size;

 		/* look for the end of the current chunk */
@@ -435,9 +416,43 @@ int __sg_alloc_table_from_pages(struct sg_table *sgt, struct page **pages,
 		size -= chunk_size;
 		offset = 0;
 		cur_page = j;
+		prv = s;
+		s = sg_next(s);
 	}
+	return prv;
+}

-	return 0;
+/**
+ * __sg_alloc_table_from_pages - Allocate and initialize an sg table from
+ *			         an array of pages
+ * @sgt:	 The sg table header to use
+ * @pages:	 Pointer to an array of page pointers
+ * @n_pages:	 Number of pages in the pages array
+ * @offset:      Offset from start of the first page to the start of a buffer
+ * @size:        Number of valid bytes in the buffer (after offset)
+ * @max_segment: Maximum size of a scatterlist node in bytes (page aligned)
+ * @gfp_mask:	 GFP allocation mask
+ *
+ *  Description:
+ *    Allocate and initialize an sg table from a list of pages. Contiguous
+ *    ranges of the pages are squashed into a single scatterlist node up to the
+ *    maximum size specified in @max_segment. A user may provide an offset at a
+ *    start and a size of valid data in a buffer specified by the page array.
+ *    The returned sg table is released by sg_free_table.
+ *
+ * Returns:
+ *   0 on success, negative error on failure
+ */
+int __sg_alloc_table_from_pages(struct sg_table *sgt, struct page **pages,
+				unsigned int n_pages, unsigned int offset,
+				unsigned long size, unsigned int max_segment,
+				gfp_t gfp_mask)
+{
+	struct scatterlist *sg;
+
+	sg = alloc_from_pages_common(sgt, pages, n_pages, offset, size,
+				     max_segment, gfp_mask);
+	return PTR_ERR_OR_ZERO(sg);
 }
 EXPORT_SYMBOL(__sg_alloc_table_from_pages);

--
2.26.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH rdma-next 0/4] scatterlist: add sg_alloc_table_append function
  2020-09-03 15:32 ` [PATCH rdma-next 0/4] scatterlist: add sg_alloc_table_append function Christoph Hellwig
@ 2020-09-03 15:55   ` Leon Romanovsky
  0 siblings, 0 replies; 18+ messages in thread
From: Leon Romanovsky @ 2020-09-03 15:55 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Doug Ledford, Jason Gunthorpe, linux-kernel, linux-rdma, Maor Gottlieb

On Thu, Sep 03, 2020 at 05:32:17PM +0200, Christoph Hellwig wrote:
> Patch 1 never made it through.

Thanks, I sent it now and the patch is seen in ML.

https://lore.kernel.org/linux-rdma/20200903153217.GA21689@lst.de/T/#t

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH rdma-next 1/4] lib/scatterlist: Refactor sg_alloc_table_from_pages
  2020-09-03 15:54 ` [PATCH rdma-next 1/4] lib/scatterlist: Refactor sg_alloc_table_from_pages Leon Romanovsky
@ 2020-09-07  7:29   ` Christoph Hellwig
  2020-09-07 12:32     ` Maor Gottlieb
  0 siblings, 1 reply; 18+ messages in thread
From: Christoph Hellwig @ 2020-09-07  7:29 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Christoph Hellwig, Doug Ledford, Jason Gunthorpe, Maor Gottlieb,
	linux-kernel, linux-rdma

On Thu, Sep 03, 2020 at 06:54:34PM +0300, Leon Romanovsky wrote:
> From: Maor Gottlieb <maorg@nvidia.com>
> 
> Currently, sg_alloc_table_from_pages doesn't support dynamic chaining of
> SG entries. Therefore it requires from user to allocate all the pages in
> advance and hold them in a large buffer. Such a buffer consumes a lot of
> temporary memory in HPC systems which do a very large memory registration.
> 
> The next patches introduce API for dynamically allocation from pages and
> it requires us to do the following:
>  * Extract the code to alloc_from_pages_common.
>  * Change the build of the table to iterate on the chunks and not on the
>    SGEs. It will allow dynamic allocation of more SGEs.
> 
> Since sg_alloc_table_from_pages allocate exactly the number of chunks,
> therefore chunks are equal to the number of SG entries.

Given how few users __sg_alloc_table_from_pages has, what about just
switching it to your desired calling conventions without another helper?

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH rdma-next 2/4] lib/scatterlist: Add support in dynamically allocation of SG entries
  2020-09-03 12:18 ` [PATCH rdma-next 2/4] lib/scatterlist: Add support in dynamically allocation of SG entries Leon Romanovsky
@ 2020-09-07  7:29   ` Christoph Hellwig
  2020-09-07 12:34     ` Maor Gottlieb
  0 siblings, 1 reply; 18+ messages in thread
From: Christoph Hellwig @ 2020-09-07  7:29 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Christoph Hellwig, Doug Ledford, Jason Gunthorpe, Maor Gottlieb,
	linux-kernel, linux-rdma

> +static inline void _sg_chain(struct scatterlist *chain_sg,
> +			     struct scatterlist *sgl)
> +{
> +	/*
> +	 * offset and length are unused for chain entry. Clear them.
> +	 */
> +	chain_sg->offset = 0;
> +	chain_sg->length = 0;
> +
> +	/*
> +	 * Set lowest bit to indicate a link pointer, and make sure to clear
> +	 * the termination bit if it happens to be set.
> +	 */
> +	chain_sg->page_link = ((unsigned long) sgl | SG_CHAIN) & ~SG_END;
> +}

Please call this __sg_chain to stick with our normal kernel naming
convention.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH rdma-next 3/4] lib/scatterlist: Add support in dynamic allocation of SG table from pages
  2020-09-03 12:18 ` [PATCH rdma-next 3/4] lib/scatterlist: Add support in dynamic allocation of SG table from pages Leon Romanovsky
@ 2020-09-07  7:29   ` Christoph Hellwig
  2020-09-07 12:44     ` Maor Gottlieb
  0 siblings, 1 reply; 18+ messages in thread
From: Christoph Hellwig @ 2020-09-07  7:29 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Christoph Hellwig, Doug Ledford, Jason Gunthorpe, Maor Gottlieb,
	linux-kernel, linux-rdma

On Thu, Sep 03, 2020 at 03:18:52PM +0300, Leon Romanovsky wrote:
> +struct sg_append {
> +	struct scatterlist *prv; /* Previous entry to append */
> +	unsigned int left_pages; /* Left pages to add to table */
> +};

I don't really see the point in this structure.   Either pass it as
two separate arguments, or switch sg_alloc_table_append and the
internal helper to pass all arguments as a struct.

> + *    A user may provide an offset at a start and a size of valid data in a buffer
> + *    specified by the page array. A user may provide @append to chain pages to

This adds a few pointles > 80 char lines.

> +struct scatterlist *
> +sg_alloc_table_append(struct sg_table *sgt, struct page **pages,
> +		      unsigned int n_pages, unsigned int offset,
> +		      unsigned long size, unsigned int max_segment,
> +		      gfp_t gfp_mask, struct sg_append *append)
> +{
> +#ifdef CONFIG_ARCH_NO_SG_CHAIN
> +	if (append->left_pages)
> +		return ERR_PTR(-EOPNOTSUPP);
> +#endif

Which makes this API entirely useless for !CONFIG_ARCH_NO_SG_CHAIN,
doesn't it?  Wouldn't it make more sense to not provide it for that
case and add an explicitl dependency in the callers?

> +	return alloc_from_pages_common(sgt, pages, n_pages, offset, size,
> +				       max_segment, gfp_mask, append);

And if we somehow manage to sort that out we can merge
sg_alloc_table_append and alloc_from_pages_common, reducing the amount
of wrappers that just make it too hard to follow the code.

> +EXPORT_SYMBOL(sg_alloc_table_append);

EXPORT_SYMBOL_GPL, please.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH rdma-next 4/4] RDMA/umem: Move to allocate SG table from pages
  2020-09-03 12:18 ` [PATCH rdma-next 4/4] RDMA/umem: Move to allocate " Leon Romanovsky
@ 2020-09-07  7:29   ` Christoph Hellwig
  2020-09-08 14:10     ` Jason Gunthorpe
  0 siblings, 1 reply; 18+ messages in thread
From: Christoph Hellwig @ 2020-09-07  7:29 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Christoph Hellwig, Doug Ledford, Jason Gunthorpe, Maor Gottlieb,
	linux-kernel, linux-rdma

On Thu, Sep 03, 2020 at 03:18:53PM +0300, Leon Romanovsky wrote:
> From: Maor Gottlieb <maorg@nvidia.com>
> 
> Remove the implementation of ib_umem_add_sg_table and instead
> call to sg_alloc_table_append which already has the logic to
> merge contiguous pages.
> 
> Besides that it removes duplicated functionality, it reduces the
> memory consumption of the SG table significantly. Prior to this
> patch, the SG table was allocated in advance regardless consideration
> of contiguous pages.
> 
> In huge pages system of 2MB page size, without this change, the SG table
> would contain x512 SG entries.
> E.g. for 100GB memory registration:
> 
> 	 Number of entries	Size
> Before 	      26214400          600.0MB
> After            51200		  1.2MB
> 
> Signed-off-by: Maor Gottlieb <maorg@nvidia.com>
> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>

Looks sensible for now, but the real fix is of course to avoid
the scatterlist here entirely, and provide a bvec based
pin_user_pages_fast.  I'll need to finally get that done..

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH rdma-next 1/4] lib/scatterlist: Refactor sg_alloc_table_from_pages
  2020-09-07  7:29   ` Christoph Hellwig
@ 2020-09-07 12:32     ` Maor Gottlieb
  2020-09-08 15:52       ` Christoph Hellwig
  0 siblings, 1 reply; 18+ messages in thread
From: Maor Gottlieb @ 2020-09-07 12:32 UTC (permalink / raw)
  To: Christoph Hellwig, Leon Romanovsky
  Cc: Doug Ledford, Jason Gunthorpe, linux-kernel, linux-rdma


On 9/7/2020 10:29 AM, Christoph Hellwig wrote:
> On Thu, Sep 03, 2020 at 06:54:34PM +0300, Leon Romanovsky wrote:
>> From: Maor Gottlieb <maorg@nvidia.com>
>>
>> Currently, sg_alloc_table_from_pages doesn't support dynamic chaining of
>> SG entries. Therefore it requires from user to allocate all the pages in
>> advance and hold them in a large buffer. Such a buffer consumes a lot of
>> temporary memory in HPC systems which do a very large memory registration.
>>
>> The next patches introduce API for dynamically allocation from pages and
>> it requires us to do the following:
>>   * Extract the code to alloc_from_pages_common.
>>   * Change the build of the table to iterate on the chunks and not on the
>>     SGEs. It will allow dynamic allocation of more SGEs.
>>
>> Since sg_alloc_table_from_pages allocate exactly the number of chunks,
>> therefore chunks are equal to the number of SG entries.
> Given how few users __sg_alloc_table_from_pages has, what about just
> switching it to your desired calling conventions without another helper?

I tried it now. It didn't save a lot.  Please give me your decision and 
if needed I will update accordingly.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH rdma-next 2/4] lib/scatterlist: Add support in dynamically allocation of SG entries
  2020-09-07  7:29   ` Christoph Hellwig
@ 2020-09-07 12:34     ` Maor Gottlieb
  0 siblings, 0 replies; 18+ messages in thread
From: Maor Gottlieb @ 2020-09-07 12:34 UTC (permalink / raw)
  To: Christoph Hellwig, Leon Romanovsky
  Cc: Doug Ledford, Jason Gunthorpe, linux-kernel, linux-rdma


On 9/7/2020 10:29 AM, Christoph Hellwig wrote:
>> +static inline void _sg_chain(struct scatterlist *chain_sg,
>> +			     struct scatterlist *sgl)
>> +{
>> +	/*
>> +	 * offset and length are unused for chain entry. Clear them.
>> +	 */
>> +	chain_sg->offset = 0;
>> +	chain_sg->length = 0;
>> +
>> +	/*
>> +	 * Set lowest bit to indicate a link pointer, and make sure to clear
>> +	 * the termination bit if it happens to be set.
>> +	 */
>> +	chain_sg->page_link = ((unsigned long) sgl | SG_CHAIN) & ~SG_END;
>> +}
> Please call this __sg_chain to stick with our normal kernel naming
> convention.

Will do.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH rdma-next 3/4] lib/scatterlist: Add support in dynamic allocation of SG table from pages
  2020-09-07  7:29   ` Christoph Hellwig
@ 2020-09-07 12:44     ` Maor Gottlieb
  2020-09-08 15:54       ` Christoph Hellwig
  0 siblings, 1 reply; 18+ messages in thread
From: Maor Gottlieb @ 2020-09-07 12:44 UTC (permalink / raw)
  To: Christoph Hellwig, Leon Romanovsky
  Cc: Doug Ledford, Jason Gunthorpe, linux-kernel, linux-rdma


On 9/7/2020 10:29 AM, Christoph Hellwig wrote:
> On Thu, Sep 03, 2020 at 03:18:52PM +0300, Leon Romanovsky wrote:
>> +struct sg_append {
>> +	struct scatterlist *prv; /* Previous entry to append */
>> +	unsigned int left_pages; /* Left pages to add to table */
>> +};
> I don't really see the point in this structure.   Either pass it as
> two separate arguments, or switch sg_alloc_table_append and the
> internal helper to pass all arguments as a struct.

I did it to avoid more than 8 arguments of this function, will change it 
to be 9 if it's fine for you.
>
>> + *    A user may provide an offset at a start and a size of valid data in a buffer
>> + *    specified by the page array. A user may provide @append to chain pages to
> This adds a few pointles > 80 char lines.

Will fix.
>
>> +struct scatterlist *
>> +sg_alloc_table_append(struct sg_table *sgt, struct page **pages,
>> +		      unsigned int n_pages, unsigned int offset,
>> +		      unsigned long size, unsigned int max_segment,
>> +		      gfp_t gfp_mask, struct sg_append *append)
>> +{
>> +#ifdef CONFIG_ARCH_NO_SG_CHAIN
>> +	if (append->left_pages)
>> +		return ERR_PTR(-EOPNOTSUPP);
>> +#endif
> Which makes this API entirely useless for !CONFIG_ARCH_NO_SG_CHAIN,
> doesn't it?  Wouldn't it make more sense to not provide it for that
> case and add an explicitl dependency in the callers?

Current implementation allow us to support small memory registration 
which not require chaining. I am not aware which archs has the SG_CHAIN 
support and I don't want to break it so I can't add it to as dependency 
to the Kconfig. Another option is to do the logic in the caller, but it 
isn't clean.

>
>> +	return alloc_from_pages_common(sgt, pages, n_pages, offset, size,
>> +				       max_segment, gfp_mask, append);
> And if we somehow manage to sort that out we can merge
> sg_alloc_table_append and alloc_from_pages_common, reducing the amount
> of wrappers that just make it too hard to follow the code.
>
>> +EXPORT_SYMBOL(sg_alloc_table_append);
> EXPORT_SYMBOL_GPL, please.

Sure

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH rdma-next 4/4] RDMA/umem: Move to allocate SG table from pages
  2020-09-07  7:29   ` Christoph Hellwig
@ 2020-09-08 14:10     ` Jason Gunthorpe
  0 siblings, 0 replies; 18+ messages in thread
From: Jason Gunthorpe @ 2020-09-08 14:10 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Leon Romanovsky, Doug Ledford, Maor Gottlieb, linux-kernel, linux-rdma

On Mon, Sep 07, 2020 at 09:29:26AM +0200, Christoph Hellwig wrote:
> On Thu, Sep 03, 2020 at 03:18:53PM +0300, Leon Romanovsky wrote:
> > From: Maor Gottlieb <maorg@nvidia.com>
> > 
> > Remove the implementation of ib_umem_add_sg_table and instead
> > call to sg_alloc_table_append which already has the logic to
> > merge contiguous pages.
> > 
> > Besides that it removes duplicated functionality, it reduces the
> > memory consumption of the SG table significantly. Prior to this
> > patch, the SG table was allocated in advance regardless consideration
> > of contiguous pages.
> > 
> > In huge pages system of 2MB page size, without this change, the SG table
> > would contain x512 SG entries.
> > E.g. for 100GB memory registration:
> > 
> > 	 Number of entries	Size
> > Before 	      26214400          600.0MB
> > After            51200		  1.2MB
> > 
> > Signed-off-by: Maor Gottlieb <maorg@nvidia.com>
> > Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> 
> Looks sensible for now, but the real fix is of course to avoid
> the scatterlist here entirely, and provide a bvec based
> pin_user_pages_fast.  I'll need to finally get that done..

I'm working on cleaning all the DMA RDMA drivers using ib_umem to the
point where doing something like this would become fairly simple.

pin_user_pages_fast_bvec/whatever would be a huge improvement here,
calling in a loop like this just to get a partial page list to copy to
a SGL is horrificly slow due to all the extra overheads. Going
directly to the bvec/sgl/etc inside all the locks will be a lot faster

Jason

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH rdma-next 1/4] lib/scatterlist: Refactor sg_alloc_table_from_pages
  2020-09-07 12:32     ` Maor Gottlieb
@ 2020-09-08 15:52       ` Christoph Hellwig
  0 siblings, 0 replies; 18+ messages in thread
From: Christoph Hellwig @ 2020-09-08 15:52 UTC (permalink / raw)
  To: Maor Gottlieb
  Cc: Christoph Hellwig, Leon Romanovsky, Doug Ledford,
	Jason Gunthorpe, linux-kernel, linux-rdma

On Mon, Sep 07, 2020 at 03:32:31PM +0300, Maor Gottlieb wrote:
>
> On 9/7/2020 10:29 AM, Christoph Hellwig wrote:
>> On Thu, Sep 03, 2020 at 06:54:34PM +0300, Leon Romanovsky wrote:
>>> From: Maor Gottlieb <maorg@nvidia.com>
>>>
>>> Currently, sg_alloc_table_from_pages doesn't support dynamic chaining of
>>> SG entries. Therefore it requires from user to allocate all the pages in
>>> advance and hold them in a large buffer. Such a buffer consumes a lot of
>>> temporary memory in HPC systems which do a very large memory registration.
>>>
>>> The next patches introduce API for dynamically allocation from pages and
>>> it requires us to do the following:
>>>   * Extract the code to alloc_from_pages_common.
>>>   * Change the build of the table to iterate on the chunks and not on the
>>>     SGEs. It will allow dynamic allocation of more SGEs.
>>>
>>> Since sg_alloc_table_from_pages allocate exactly the number of chunks,
>>> therefore chunks are equal to the number of SG entries.
>> Given how few users __sg_alloc_table_from_pages has, what about just
>> switching it to your desired calling conventions without another helper?
>
> I tried it now. It didn't save a lot.  Please give me your decision and if 
> needed I will update accordingly.

Feel free to keep it for now, we can sort this out later.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH rdma-next 3/4] lib/scatterlist: Add support in dynamic allocation of SG table from pages
  2020-09-07 12:44     ` Maor Gottlieb
@ 2020-09-08 15:54       ` Christoph Hellwig
  2020-09-08 16:13         ` Jason Gunthorpe
  0 siblings, 1 reply; 18+ messages in thread
From: Christoph Hellwig @ 2020-09-08 15:54 UTC (permalink / raw)
  To: Maor Gottlieb
  Cc: Christoph Hellwig, Leon Romanovsky, Doug Ledford,
	Jason Gunthorpe, linux-kernel, linux-rdma

On Mon, Sep 07, 2020 at 03:44:08PM +0300, Maor Gottlieb wrote:
>>> +{
>>> +#ifdef CONFIG_ARCH_NO_SG_CHAIN
>>> +	if (append->left_pages)
>>> +		return ERR_PTR(-EOPNOTSUPP);
>>> +#endif
>> Which makes this API entirely useless for !CONFIG_ARCH_NO_SG_CHAIN,
>> doesn't it?  Wouldn't it make more sense to not provide it for that
>> case and add an explicitl dependency in the callers?
>
> Current implementation allow us to support small memory registration which 
> not require chaining. I am not aware which archs has the SG_CHAIN support 
> and I don't want to break it so I can't add it to as dependency to the 
> Kconfig. Another option is to do the logic in the caller, but it isn't 
> clean.

But does the caller handle the -EOPNOTSUPP properly?  I think right now
it will just fail the large registration that worked before this patchset.

Given that ARCH_NO_SG_CHAIN is only true for alpha, parisc and a few
arm subarchitectures I think just not supporting umem is probably
cleared.  And eventually we'll need to drop ARCH_NO_SG_CHAIN entirely.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH rdma-next 3/4] lib/scatterlist: Add support in dynamic allocation of SG table from pages
  2020-09-08 15:54       ` Christoph Hellwig
@ 2020-09-08 16:13         ` Jason Gunthorpe
  0 siblings, 0 replies; 18+ messages in thread
From: Jason Gunthorpe @ 2020-09-08 16:13 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Maor Gottlieb, Leon Romanovsky, Doug Ledford, linux-kernel, linux-rdma

On Tue, Sep 08, 2020 at 05:54:09PM +0200, Christoph Hellwig wrote:
> Given that ARCH_NO_SG_CHAIN is only true for alpha, parisc and a few
> arm subarchitectures I think just not supporting umem is probably
> cleared.  And eventually we'll need to drop ARCH_NO_SG_CHAIN entirely.

It would be fine to make INFINIBAND_USER_MEM depend on
!ARCH_NO_SG_CHAIN. alpha and parisc are not supported in rdma-core,
and the non-multiplatform ARM sub-arches are probably also the kind
that don't work with the userspace DMA model anyhow.

Jason

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2020-09-08 19:42 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-03 12:18 [PATCH rdma-next 0/4] scatterlist: add sg_alloc_table_append function Leon Romanovsky
2020-09-03 12:18 ` [PATCH rdma-next 2/4] lib/scatterlist: Add support in dynamically allocation of SG entries Leon Romanovsky
2020-09-07  7:29   ` Christoph Hellwig
2020-09-07 12:34     ` Maor Gottlieb
2020-09-03 12:18 ` [PATCH rdma-next 3/4] lib/scatterlist: Add support in dynamic allocation of SG table from pages Leon Romanovsky
2020-09-07  7:29   ` Christoph Hellwig
2020-09-07 12:44     ` Maor Gottlieb
2020-09-08 15:54       ` Christoph Hellwig
2020-09-08 16:13         ` Jason Gunthorpe
2020-09-03 12:18 ` [PATCH rdma-next 4/4] RDMA/umem: Move to allocate " Leon Romanovsky
2020-09-07  7:29   ` Christoph Hellwig
2020-09-08 14:10     ` Jason Gunthorpe
2020-09-03 15:32 ` [PATCH rdma-next 0/4] scatterlist: add sg_alloc_table_append function Christoph Hellwig
2020-09-03 15:55   ` Leon Romanovsky
2020-09-03 15:54 ` [PATCH rdma-next 1/4] lib/scatterlist: Refactor sg_alloc_table_from_pages Leon Romanovsky
2020-09-07  7:29   ` Christoph Hellwig
2020-09-07 12:32     ` Maor Gottlieb
2020-09-08 15:52       ` Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).