linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 0/4] mm/gup: page unpining improvements
@ 2021-02-05 20:41 Joao Martins
  2021-02-05 20:41 ` [PATCH v3 1/4] mm/gup: add compound page list iterator Joao Martins
                   ` (3 more replies)
  0 siblings, 4 replies; 15+ messages in thread
From: Joao Martins @ 2021-02-05 20:41 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, linux-rdma, Andrew Morton, Jason Gunthorpe,
	Doug Ledford, John Hubbard, Matthew Wilcox, Joao Martins

Hey,

This series improves page unpinning, with an eye on improving MR
deregistration for big swaths of memory (which is bound by the page
unpining), particularly:

 1) Decrement the head page by @ntails and thus reducing a lot the number of
atomic operations per compound page. This is done by comparing individual
tail pages heads, and counting number of consecutive tails on which they 
match heads and based on that update head page refcount. Should have a
visible improvement in all page (un)pinners which use compound pages.

 2) Introducing a new API for unpinning page ranges (to avoid the trick in the
previous item and be based on math), and use that in RDMA ib_mem_release
(used for mr deregistration).

Performance improvements: unpin_user_pages() for hugetlbfs and THP improves ~3x
(through gup_test) and RDMA MR dereg improves ~4.5x with the new API.
See patches 2 and 4 for those.

These patches used to be in this RFC:

https://lore.kernel.org/linux-mm/20201208172901.17384-1-joao.m.martins@oracle.com/,
"[PATCH RFC 0/9] mm, sparse-vmemmap: Introduce compound pagemaps"

But were moved separately at the suggestion of Jason, given it's applicable
to page unpinning in general. Thanks for all the comments so far.

These patches apply on top of linux-next tag next-20210202.

Suggestions, comments, welcomed as usual.

	Joao

Changelog since,

v2 -> v3:
 * Handle compound_order = 1 as well and move subtraction to min_t()
   on patch 3.
 * Remove stale paragraph on patch 3 commit description (John)
 * Rename range_next to compound_range_next() (John)
 * Add John's Reviewed-by on patch 1 (John)
 * Clean and rework compound_next() on patch 1 (John)

v1 -> v2:
 * Prefix macro arguments with __ to avoid collisions with other defines (John)
 * Remove count_tails() and have the logic for the two iterators split into
   range_next() and compound_next() (John)
 * Remove the @range boolean from the iterator helpers (John)
 * Add docs on unpin_user_page_range_dirty_lock() on patch 3 (John)
 * Use unsigned for @i on patch 4 (John)
 * Fix subject line of patch 4 (John)
 * Add John's Reviewed-by on the second patch
 * Fix incorrect use of @nmap and use @sg_nents instead (Jason)

RFC -> v1:
 * Introduce a head/ntails iterator and change unpin_*_pages() to use that,
   inspired by folio iterators (Jason)
 * Introduce an alternative unpin_user_page_range_dirty_lock() to unpin based
   on a consecutive page range without having to walk page arrays (Jason)
 * Use unsigned for number of tails (Jason)

Joao Martins (4):
  mm/gup: add compound page list iterator
  mm/gup: decrement head page once for group of subpages
  mm/gup: add a range variant of unpin_user_pages_dirty_lock()
  RDMA/umem: batch page unpin in __ib_umem_release()

 drivers/infiniband/core/umem.c |  12 ++--
 include/linux/mm.h             |   2 +
 mm/gup.c                       | 117 ++++++++++++++++++++++++++++-----
 3 files changed, 107 insertions(+), 24 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v3 1/4] mm/gup: add compound page list iterator
  2021-02-05 20:41 [PATCH v3 0/4] mm/gup: page unpining improvements Joao Martins
@ 2021-02-05 20:41 ` Joao Martins
  2021-02-10 23:20   ` Jason Gunthorpe
  2021-02-05 20:41 ` [PATCH v3 2/4] mm/gup: decrement head page once for group of subpages Joao Martins
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 15+ messages in thread
From: Joao Martins @ 2021-02-05 20:41 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, linux-rdma, Andrew Morton, Jason Gunthorpe,
	Doug Ledford, John Hubbard, Matthew Wilcox, Joao Martins

Add an helper that iterates over head pages in a list of pages. It
essentially counts the tails until the next page to process has a
different head that the current. This is going to be used by
unpin_user_pages() family of functions, to batch the head page refcount
updates once for all passed consecutive tail pages.

Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
---
 mm/gup.c | 26 ++++++++++++++++++++++++++
 1 file changed, 26 insertions(+)

diff --git a/mm/gup.c b/mm/gup.c
index d68bcb482b11..8defe4f670d5 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -215,6 +215,32 @@ void unpin_user_page(struct page *page)
 }
 EXPORT_SYMBOL(unpin_user_page);
 
+static inline void compound_next(unsigned long i, unsigned long npages,
+				 struct page **list, struct page **head,
+				 unsigned int *ntails)
+{
+	struct page *page;
+	unsigned int nr;
+
+	if (i >= npages)
+		return;
+
+	page = compound_head(list[i]);
+	for (nr = i + 1; nr < npages; nr++) {
+		if (compound_head(list[nr]) != page)
+			break;
+	}
+
+	*head = page;
+	*ntails = nr - i;
+}
+
+#define for_each_compound_head(__i, __list, __npages, __head, __ntails) \
+	for (__i = 0, \
+	     compound_next(__i, __npages, __list, &(__head), &(__ntails)); \
+	     __i < __npages; __i += __ntails, \
+	     compound_next(__i, __npages, __list, &(__head), &(__ntails)))
+
 /**
  * unpin_user_pages_dirty_lock() - release and optionally dirty gup-pinned pages
  * @pages:  array of pages to be maybe marked dirty, and definitely released.
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v3 2/4] mm/gup: decrement head page once for group of subpages
  2021-02-05 20:41 [PATCH v3 0/4] mm/gup: page unpining improvements Joao Martins
  2021-02-05 20:41 ` [PATCH v3 1/4] mm/gup: add compound page list iterator Joao Martins
@ 2021-02-05 20:41 ` Joao Martins
  2021-02-10 21:02   ` Jason Gunthorpe
  2021-02-05 20:41 ` [PATCH v3 3/4] mm/gup: add a range variant of unpin_user_pages_dirty_lock() Joao Martins
  2021-02-05 20:41 ` [PATCH v3 4/4] RDMA/umem: batch page unpin in __ib_umem_release() Joao Martins
  3 siblings, 1 reply; 15+ messages in thread
From: Joao Martins @ 2021-02-05 20:41 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, linux-rdma, Andrew Morton, Jason Gunthorpe,
	Doug Ledford, John Hubbard, Matthew Wilcox, Joao Martins

Rather than decrementing the head page refcount one by one, we
walk the page array and checking which belong to the same
compound_head. Later on we decrement the calculated amount
of references in a single write to the head page. To that
end switch to for_each_compound_head() does most of the work.

set_page_dirty() needs no adjustment as it's a nop for
non-dirty head pages and it doesn't operate on tail pages.

This considerably improves unpinning of pages with THP and
hugetlbfs:

- THP
gup_test -t -m 16384 -r 10 [-L|-a] -S -n 512 -w
PIN_LONGTERM_BENCHMARK (put values): ~87.6k us -> ~23.2k us

- 16G with 1G huge page size
gup_test -f /mnt/huge/file -m 16384 -r 10 [-L|-a] -S -n 512 -w
PIN_LONGTERM_BENCHMARK: (put values): ~87.6k us -> ~27.5k us

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
---
 mm/gup.c | 29 +++++++++++------------------
 1 file changed, 11 insertions(+), 18 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 8defe4f670d5..467a11df216d 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -267,20 +267,15 @@ void unpin_user_pages_dirty_lock(struct page **pages, unsigned long npages,
 				 bool make_dirty)
 {
 	unsigned long index;
-
-	/*
-	 * TODO: this can be optimized for huge pages: if a series of pages is
-	 * physically contiguous and part of the same compound page, then a
-	 * single operation to the head page should suffice.
-	 */
+	struct page *head;
+	unsigned int ntails;
 
 	if (!make_dirty) {
 		unpin_user_pages(pages, npages);
 		return;
 	}
 
-	for (index = 0; index < npages; index++) {
-		struct page *page = compound_head(pages[index]);
+	for_each_compound_head(index, pages, npages, head, ntails) {
 		/*
 		 * Checking PageDirty at this point may race with
 		 * clear_page_dirty_for_io(), but that's OK. Two key
@@ -301,9 +296,9 @@ void unpin_user_pages_dirty_lock(struct page **pages, unsigned long npages,
 		 * written back, so it gets written back again in the
 		 * next writeback cycle. This is harmless.
 		 */
-		if (!PageDirty(page))
-			set_page_dirty_lock(page);
-		unpin_user_page(page);
+		if (!PageDirty(head))
+			set_page_dirty_lock(head);
+		put_compound_head(head, ntails, FOLL_PIN);
 	}
 }
 EXPORT_SYMBOL(unpin_user_pages_dirty_lock);
@@ -320,6 +315,8 @@ EXPORT_SYMBOL(unpin_user_pages_dirty_lock);
 void unpin_user_pages(struct page **pages, unsigned long npages)
 {
 	unsigned long index;
+	struct page *head;
+	unsigned int ntails;
 
 	/*
 	 * If this WARN_ON() fires, then the system *might* be leaking pages (by
@@ -328,13 +325,9 @@ void unpin_user_pages(struct page **pages, unsigned long npages)
 	 */
 	if (WARN_ON(IS_ERR_VALUE(npages)))
 		return;
-	/*
-	 * TODO: this can be optimized for huge pages: if a series of pages is
-	 * physically contiguous and part of the same compound page, then a
-	 * single operation to the head page should suffice.
-	 */
-	for (index = 0; index < npages; index++)
-		unpin_user_page(pages[index]);
+
+	for_each_compound_head(index, pages, npages, head, ntails)
+		put_compound_head(head, ntails, FOLL_PIN);
 }
 EXPORT_SYMBOL(unpin_user_pages);
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v3 3/4] mm/gup: add a range variant of unpin_user_pages_dirty_lock()
  2021-02-05 20:41 [PATCH v3 0/4] mm/gup: page unpining improvements Joao Martins
  2021-02-05 20:41 ` [PATCH v3 1/4] mm/gup: add compound page list iterator Joao Martins
  2021-02-05 20:41 ` [PATCH v3 2/4] mm/gup: decrement head page once for group of subpages Joao Martins
@ 2021-02-05 20:41 ` Joao Martins
  2021-02-10 23:15   ` Jason Gunthorpe
  2021-02-10 23:19   ` John Hubbard
  2021-02-05 20:41 ` [PATCH v3 4/4] RDMA/umem: batch page unpin in __ib_umem_release() Joao Martins
  3 siblings, 2 replies; 15+ messages in thread
From: Joao Martins @ 2021-02-05 20:41 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, linux-rdma, Andrew Morton, Jason Gunthorpe,
	Doug Ledford, John Hubbard, Matthew Wilcox, Joao Martins

Add a unpin_user_page_range_dirty_lock() API which takes a starting page
and how many consecutive pages we want to unpin and optionally dirty.

To that end, define another iterator for_each_compound_range()
that operates in page ranges as opposed to page array.

For users (like RDMA mr_dereg) where each sg represents a
contiguous set of pages, we're able to more efficiently unpin
pages without having to supply an array of pages much of what
happens today with unpin_user_pages().

Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 include/linux/mm.h |  2 ++
 mm/gup.c           | 62 ++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 64 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index a608feb0d42e..b76063f7f18a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1265,6 +1265,8 @@ static inline void put_page(struct page *page)
 void unpin_user_page(struct page *page);
 void unpin_user_pages_dirty_lock(struct page **pages, unsigned long npages,
 				 bool make_dirty);
+void unpin_user_page_range_dirty_lock(struct page *page, unsigned long npages,
+				      bool make_dirty);
 void unpin_user_pages(struct page **pages, unsigned long npages);
 
 /**
diff --git a/mm/gup.c b/mm/gup.c
index 467a11df216d..938964d31494 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -215,6 +215,32 @@ void unpin_user_page(struct page *page)
 }
 EXPORT_SYMBOL(unpin_user_page);
 
+static inline void compound_range_next(unsigned long i, unsigned long npages,
+				       struct page **list, struct page **head,
+				       unsigned int *ntails)
+{
+	struct page *next, *page;
+	unsigned int nr = 1;
+
+	if (i >= npages)
+		return;
+
+	next = *list + i;
+	page = compound_head(next);
+	if (PageCompound(page) && compound_order(page) >= 1)
+		nr = min_t(unsigned int,
+			   page + compound_nr(page) - next, npages - i);
+
+	*head = page;
+	*ntails = nr;
+}
+
+#define for_each_compound_range(__i, __list, __npages, __head, __ntails) \
+	for (__i = 0, \
+	     compound_range_next(__i, __npages, __list, &(__head), &(__ntails)); \
+	     __i < __npages; __i += __ntails, \
+	     compound_range_next(__i, __npages, __list, &(__head), &(__ntails)))
+
 static inline void compound_next(unsigned long i, unsigned long npages,
 				 struct page **list, struct page **head,
 				 unsigned int *ntails)
@@ -303,6 +329,42 @@ void unpin_user_pages_dirty_lock(struct page **pages, unsigned long npages,
 }
 EXPORT_SYMBOL(unpin_user_pages_dirty_lock);
 
+/**
+ * unpin_user_page_range_dirty_lock() - release and optionally dirty
+ * gup-pinned page range
+ *
+ * @page:  the starting page of a range maybe marked dirty, and definitely released.
+ * @npages: number of consecutive pages to release.
+ * @make_dirty: whether to mark the pages dirty
+ *
+ * "gup-pinned page range" refers to a range of pages that has had one of the
+ * get_user_pages() variants called on that page.
+ *
+ * For the page ranges defined by [page .. page+npages], make that range (or
+ * its head pages, if a compound page) dirty, if @make_dirty is true, and if the
+ * page range was previously listed as clean.
+ *
+ * set_page_dirty_lock() is used internally. If instead, set_page_dirty() is
+ * required, then the caller should a) verify that this is really correct,
+ * because _lock() is usually required, and b) hand code it:
+ * set_page_dirty_lock(), unpin_user_page().
+ *
+ */
+void unpin_user_page_range_dirty_lock(struct page *page, unsigned long npages,
+				      bool make_dirty)
+{
+	unsigned long index;
+	struct page *head;
+	unsigned int ntails;
+
+	for_each_compound_range(index, &page, npages, head, ntails) {
+		if (make_dirty && !PageDirty(head))
+			set_page_dirty_lock(head);
+		put_compound_head(head, ntails, FOLL_PIN);
+	}
+}
+EXPORT_SYMBOL(unpin_user_page_range_dirty_lock);
+
 /**
  * unpin_user_pages() - release an array of gup-pinned pages.
  * @pages:  array of pages to be marked dirty and released.
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v3 4/4] RDMA/umem: batch page unpin in __ib_umem_release()
  2021-02-05 20:41 [PATCH v3 0/4] mm/gup: page unpining improvements Joao Martins
                   ` (2 preceding siblings ...)
  2021-02-05 20:41 ` [PATCH v3 3/4] mm/gup: add a range variant of unpin_user_pages_dirty_lock() Joao Martins
@ 2021-02-05 20:41 ` Joao Martins
  2021-02-10 23:17   ` Jason Gunthorpe
  3 siblings, 1 reply; 15+ messages in thread
From: Joao Martins @ 2021-02-05 20:41 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, linux-rdma, Andrew Morton, Jason Gunthorpe,
	Doug Ledford, John Hubbard, Matthew Wilcox, Joao Martins

Use the newly added unpin_user_page_range_dirty_lock()
for more quickly unpinning a consecutive range of pages
represented as compound pages. This will also calculate
number of pages to unpin (for the tail pages which matching
head page) and thus batch the refcount update.

Running a test program which calls mr reg/unreg on a 1G in size
and measures cost of both operations together (in a guest using rxe)
with THP and hugetlbfs:

Before:
590 rounds in 5.003 sec: 8480.335 usec / round
6898 rounds in 60.001 sec: 8698.367 usec / round

After:
2688 rounds in 5.002 sec: 1860.786 usec / round
32517 rounds in 60.001 sec: 1845.225 usec / round

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 drivers/infiniband/core/umem.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index 2dde99a9ba07..9b607013e2a2 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -47,17 +47,17 @@
 
 static void __ib_umem_release(struct ib_device *dev, struct ib_umem *umem, int dirty)
 {
-	struct sg_page_iter sg_iter;
-	struct page *page;
+	bool make_dirty = umem->writable && dirty;
+	struct scatterlist *sg;
+	unsigned int i;
 
 	if (umem->nmap > 0)
 		ib_dma_unmap_sg(dev, umem->sg_head.sgl, umem->sg_nents,
 				DMA_BIDIRECTIONAL);
 
-	for_each_sg_page(umem->sg_head.sgl, &sg_iter, umem->sg_nents, 0) {
-		page = sg_page_iter_page(&sg_iter);
-		unpin_user_pages_dirty_lock(&page, 1, umem->writable && dirty);
-	}
+	for_each_sg(umem->sg_head.sgl, sg, umem->sg_nents, i)
+		unpin_user_page_range_dirty_lock(sg_page(sg),
+			DIV_ROUND_UP(sg->length, PAGE_SIZE), make_dirty);
 
 	sg_free_table(&umem->sg_head);
 }
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 2/4] mm/gup: decrement head page once for group of subpages
  2021-02-05 20:41 ` [PATCH v3 2/4] mm/gup: decrement head page once for group of subpages Joao Martins
@ 2021-02-10 21:02   ` Jason Gunthorpe
  2021-02-11 10:14     ` Joao Martins
  0 siblings, 1 reply; 15+ messages in thread
From: Jason Gunthorpe @ 2021-02-10 21:02 UTC (permalink / raw)
  To: Joao Martins
  Cc: linux-mm, linux-kernel, linux-rdma, Andrew Morton, Doug Ledford,
	John Hubbard, Matthew Wilcox

On Fri, Feb 05, 2021 at 08:41:25PM +0000, Joao Martins wrote:
> Rather than decrementing the head page refcount one by one, we
> walk the page array and checking which belong to the same
> compound_head. Later on we decrement the calculated amount
> of references in a single write to the head page. To that
> end switch to for_each_compound_head() does most of the work.
> 
> set_page_dirty() needs no adjustment as it's a nop for
> non-dirty head pages and it doesn't operate on tail pages.
> 
> This considerably improves unpinning of pages with THP and
> hugetlbfs:
> 
> - THP
> gup_test -t -m 16384 -r 10 [-L|-a] -S -n 512 -w
> PIN_LONGTERM_BENCHMARK (put values): ~87.6k us -> ~23.2k us
> 
> - 16G with 1G huge page size
> gup_test -f /mnt/huge/file -m 16384 -r 10 [-L|-a] -S -n 512 -w
> PIN_LONGTERM_BENCHMARK: (put values): ~87.6k us -> ~27.5k us
> 
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> Reviewed-by: John Hubbard <jhubbard@nvidia.com>
> ---
>  mm/gup.c | 29 +++++++++++------------------
>  1 file changed, 11 insertions(+), 18 deletions(-)

Looks fine

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>

I was wondering why this only touches the FOLL_PIN path, it would make
sense to also use this same logic for release_pages()

        for (i = 0; i < nr; i++) {
                struct page *page = pages[i];
                page = compound_head(page);
                if (is_huge_zero_page(page))
                        continue; 

Jason

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 3/4] mm/gup: add a range variant of unpin_user_pages_dirty_lock()
  2021-02-05 20:41 ` [PATCH v3 3/4] mm/gup: add a range variant of unpin_user_pages_dirty_lock() Joao Martins
@ 2021-02-10 23:15   ` Jason Gunthorpe
  2021-02-11 10:24     ` Joao Martins
  2021-02-10 23:19   ` John Hubbard
  1 sibling, 1 reply; 15+ messages in thread
From: Jason Gunthorpe @ 2021-02-10 23:15 UTC (permalink / raw)
  To: Joao Martins
  Cc: linux-mm, linux-kernel, linux-rdma, Andrew Morton, Doug Ledford,
	John Hubbard, Matthew Wilcox

On Fri, Feb 05, 2021 at 08:41:26PM +0000, Joao Martins wrote:
> Add a unpin_user_page_range_dirty_lock() API which takes a starting page
> and how many consecutive pages we want to unpin and optionally dirty.
> 
> To that end, define another iterator for_each_compound_range()
> that operates in page ranges as opposed to page array.
> 
> For users (like RDMA mr_dereg) where each sg represents a
> contiguous set of pages, we're able to more efficiently unpin
> pages without having to supply an array of pages much of what
> happens today with unpin_user_pages().
> 
> Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
>  include/linux/mm.h |  2 ++
>  mm/gup.c           | 62 ++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 64 insertions(+)

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>

> +/**
> + * unpin_user_page_range_dirty_lock() - release and optionally dirty
> + * gup-pinned page range
> + *
> + * @page:  the starting page of a range maybe marked dirty, and definitely released.
> + * @npages: number of consecutive pages to release.
> + * @make_dirty: whether to mark the pages dirty
> + *
> + * "gup-pinned page range" refers to a range of pages that has had one of the
> + * get_user_pages() variants called on that page.

Tidy this language though, this only works with the pin_user_pages
variants because it hardwires FOLL_PIN

Jason

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 4/4] RDMA/umem: batch page unpin in __ib_umem_release()
  2021-02-05 20:41 ` [PATCH v3 4/4] RDMA/umem: batch page unpin in __ib_umem_release() Joao Martins
@ 2021-02-10 23:17   ` Jason Gunthorpe
  2021-02-11 10:35     ` Joao Martins
  0 siblings, 1 reply; 15+ messages in thread
From: Jason Gunthorpe @ 2021-02-10 23:17 UTC (permalink / raw)
  To: Joao Martins
  Cc: linux-mm, linux-kernel, linux-rdma, Andrew Morton, Doug Ledford,
	John Hubbard, Matthew Wilcox

On Fri, Feb 05, 2021 at 08:41:27PM +0000, Joao Martins wrote:
> Use the newly added unpin_user_page_range_dirty_lock()
> for more quickly unpinning a consecutive range of pages
> represented as compound pages. This will also calculate
> number of pages to unpin (for the tail pages which matching
> head page) and thus batch the refcount update.
> 
> Running a test program which calls mr reg/unreg on a 1G in size
> and measures cost of both operations together (in a guest using rxe)
> with THP and hugetlbfs:
> 
> Before:
> 590 rounds in 5.003 sec: 8480.335 usec / round
> 6898 rounds in 60.001 sec: 8698.367 usec / round
> 
> After:
> 2688 rounds in 5.002 sec: 1860.786 usec / round
> 32517 rounds in 60.001 sec: 1845.225 usec / round
> 
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
>  drivers/infiniband/core/umem.c | 12 ++++++------
>  1 file changed, 6 insertions(+), 6 deletions(-)

Would best for this to go through Andrew's tree

Acked-by: Jason Gunthorpe <jgg@nvidia.com>

4x improvement is pretty good!

Jason

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 3/4] mm/gup: add a range variant of unpin_user_pages_dirty_lock()
  2021-02-05 20:41 ` [PATCH v3 3/4] mm/gup: add a range variant of unpin_user_pages_dirty_lock() Joao Martins
  2021-02-10 23:15   ` Jason Gunthorpe
@ 2021-02-10 23:19   ` John Hubbard
  2021-02-11 10:52     ` Joao Martins
  1 sibling, 1 reply; 15+ messages in thread
From: John Hubbard @ 2021-02-10 23:19 UTC (permalink / raw)
  To: Joao Martins, linux-mm
  Cc: linux-kernel, linux-rdma, Andrew Morton, Jason Gunthorpe,
	Doug Ledford, Matthew Wilcox

On 2/5/21 12:41 PM, Joao Martins wrote:
> Add a unpin_user_page_range_dirty_lock() API which takes a starting page
> and how many consecutive pages we want to unpin and optionally dirty.
> 
> To that end, define another iterator for_each_compound_range()
> that operates in page ranges as opposed to page array.
> 
> For users (like RDMA mr_dereg) where each sg represents a
> contiguous set of pages, we're able to more efficiently unpin
> pages without having to supply an array of pages much of what
> happens today with unpin_user_pages().
> 
> Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
>   include/linux/mm.h |  2 ++
>   mm/gup.c           | 62 ++++++++++++++++++++++++++++++++++++++++++++++
>   2 files changed, 64 insertions(+)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index a608feb0d42e..b76063f7f18a 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1265,6 +1265,8 @@ static inline void put_page(struct page *page)
>   void unpin_user_page(struct page *page);
>   void unpin_user_pages_dirty_lock(struct page **pages, unsigned long npages,
>   				 bool make_dirty);
> +void unpin_user_page_range_dirty_lock(struct page *page, unsigned long npages,
> +				      bool make_dirty);
>   void unpin_user_pages(struct page **pages, unsigned long npages);
>   
>   /**
> diff --git a/mm/gup.c b/mm/gup.c
> index 467a11df216d..938964d31494 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -215,6 +215,32 @@ void unpin_user_page(struct page *page)
>   }
>   EXPORT_SYMBOL(unpin_user_page);
>   
> +static inline void compound_range_next(unsigned long i, unsigned long npages,
> +				       struct page **list, struct page **head,
> +				       unsigned int *ntails)

Yes, the new names look good, and I have failed to find any logic errors, so:

Reviewed-by: John Hubbard <jhubbard@nvidia.com>

thanks,
-- 
John Hubbard
NVIDIA

> +{
> +	struct page *next, *page;
> +	unsigned int nr = 1;
> +
> +	if (i >= npages)
> +		return;
> +
> +	next = *list + i;
> +	page = compound_head(next);
> +	if (PageCompound(page) && compound_order(page) >= 1)
> +		nr = min_t(unsigned int,
> +			   page + compound_nr(page) - next, npages - i);
> +
> +	*head = page;
> +	*ntails = nr;
> +}
> +
> +#define for_each_compound_range(__i, __list, __npages, __head, __ntails) \
> +	for (__i = 0, \
> +	     compound_range_next(__i, __npages, __list, &(__head), &(__ntails)); \
> +	     __i < __npages; __i += __ntails, \
> +	     compound_range_next(__i, __npages, __list, &(__head), &(__ntails)))
> +
>   static inline void compound_next(unsigned long i, unsigned long npages,
>   				 struct page **list, struct page **head,
>   				 unsigned int *ntails)
> @@ -303,6 +329,42 @@ void unpin_user_pages_dirty_lock(struct page **pages, unsigned long npages,
>   }
>   EXPORT_SYMBOL(unpin_user_pages_dirty_lock);
>   
> +/**
> + * unpin_user_page_range_dirty_lock() - release and optionally dirty
> + * gup-pinned page range
> + *
> + * @page:  the starting page of a range maybe marked dirty, and definitely released.
> + * @npages: number of consecutive pages to release.
> + * @make_dirty: whether to mark the pages dirty
> + *
> + * "gup-pinned page range" refers to a range of pages that has had one of the
> + * get_user_pages() variants called on that page.
> + *
> + * For the page ranges defined by [page .. page+npages], make that range (or
> + * its head pages, if a compound page) dirty, if @make_dirty is true, and if the
> + * page range was previously listed as clean.
> + *
> + * set_page_dirty_lock() is used internally. If instead, set_page_dirty() is
> + * required, then the caller should a) verify that this is really correct,
> + * because _lock() is usually required, and b) hand code it:
> + * set_page_dirty_lock(), unpin_user_page().
> + *
> + */
> +void unpin_user_page_range_dirty_lock(struct page *page, unsigned long npages,
> +				      bool make_dirty)
> +{
> +	unsigned long index;
> +	struct page *head;
> +	unsigned int ntails;
> +
> +	for_each_compound_range(index, &page, npages, head, ntails) {
> +		if (make_dirty && !PageDirty(head))
> +			set_page_dirty_lock(head);
> +		put_compound_head(head, ntails, FOLL_PIN);
> +	}
> +}
> +EXPORT_SYMBOL(unpin_user_page_range_dirty_lock);
> +
>   /**
>    * unpin_user_pages() - release an array of gup-pinned pages.
>    * @pages:  array of pages to be marked dirty and released.
> 


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 1/4] mm/gup: add compound page list iterator
  2021-02-05 20:41 ` [PATCH v3 1/4] mm/gup: add compound page list iterator Joao Martins
@ 2021-02-10 23:20   ` Jason Gunthorpe
  2021-02-11 10:45     ` Joao Martins
  0 siblings, 1 reply; 15+ messages in thread
From: Jason Gunthorpe @ 2021-02-10 23:20 UTC (permalink / raw)
  To: Joao Martins
  Cc: linux-mm, linux-kernel, linux-rdma, Andrew Morton, Doug Ledford,
	John Hubbard, Matthew Wilcox

On Fri, Feb 05, 2021 at 08:41:24PM +0000, Joao Martins wrote:
> Add an helper that iterates over head pages in a list of pages. It
> essentially counts the tails until the next page to process has a
> different head that the current. This is going to be used by
> unpin_user_pages() family of functions, to batch the head page refcount
> updates once for all passed consecutive tail pages.
> 
> Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> Reviewed-by: John Hubbard <jhubbard@nvidia.com>
> ---
>  mm/gup.c | 26 ++++++++++++++++++++++++++
>  1 file changed, 26 insertions(+)

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>

This can be used for check_and_migrate_cma_pages() too (there is a
series around to change this logic though, not sure if it is landed
yet)

Jason

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 2/4] mm/gup: decrement head page once for group of subpages
  2021-02-10 21:02   ` Jason Gunthorpe
@ 2021-02-11 10:14     ` Joao Martins
  0 siblings, 0 replies; 15+ messages in thread
From: Joao Martins @ 2021-02-11 10:14 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-mm, linux-kernel, linux-rdma, Andrew Morton, Doug Ledford,
	John Hubbard, Matthew Wilcox



On 2/10/21 9:02 PM, Jason Gunthorpe wrote:
> On Fri, Feb 05, 2021 at 08:41:25PM +0000, Joao Martins wrote:
>> Rather than decrementing the head page refcount one by one, we
>> walk the page array and checking which belong to the same
>> compound_head. Later on we decrement the calculated amount
>> of references in a single write to the head page. To that
>> end switch to for_each_compound_head() does most of the work.
>>
>> set_page_dirty() needs no adjustment as it's a nop for
>> non-dirty head pages and it doesn't operate on tail pages.
>>
>> This considerably improves unpinning of pages with THP and
>> hugetlbfs:
>>
>> - THP
>> gup_test -t -m 16384 -r 10 [-L|-a] -S -n 512 -w
>> PIN_LONGTERM_BENCHMARK (put values): ~87.6k us -> ~23.2k us
>>
>> - 16G with 1G huge page size
>> gup_test -f /mnt/huge/file -m 16384 -r 10 [-L|-a] -S -n 512 -w
>> PIN_LONGTERM_BENCHMARK: (put values): ~87.6k us -> ~27.5k us
>>
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>> Reviewed-by: John Hubbard <jhubbard@nvidia.com>
>> ---
>>  mm/gup.c | 29 +++++++++++------------------
>>  1 file changed, 11 insertions(+), 18 deletions(-)
> 
> Looks fine
> 
> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> 
Thanks!

> I was wondering why this only touches the FOLL_PIN path, 

That's just because I was looking at pinning mostly.

> it would make
> sense to also use this same logic for release_pages()

Yeah, indeed -- any place tearing potentially consecutive sets of pages
are candidates.

> 
>         for (i = 0; i < nr; i++) {
>                 struct page *page = pages[i];
>                 page = compound_head(page);
>                 if (is_huge_zero_page(page))
>                         continue; 
> 
> Jason
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 3/4] mm/gup: add a range variant of unpin_user_pages_dirty_lock()
  2021-02-10 23:15   ` Jason Gunthorpe
@ 2021-02-11 10:24     ` Joao Martins
  0 siblings, 0 replies; 15+ messages in thread
From: Joao Martins @ 2021-02-11 10:24 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-mm, linux-kernel, linux-rdma, Andrew Morton, Doug Ledford,
	John Hubbard, Matthew Wilcox



On 2/10/21 11:15 PM, Jason Gunthorpe wrote:
> On Fri, Feb 05, 2021 at 08:41:26PM +0000, Joao Martins wrote:
>> Add a unpin_user_page_range_dirty_lock() API which takes a starting page
>> and how many consecutive pages we want to unpin and optionally dirty.
>>
>> To that end, define another iterator for_each_compound_range()
>> that operates in page ranges as opposed to page array.
>>
>> For users (like RDMA mr_dereg) where each sg represents a
>> contiguous set of pages, we're able to more efficiently unpin
>> pages without having to supply an array of pages much of what
>> happens today with unpin_user_pages().
>>
>> Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>> ---
>>  include/linux/mm.h |  2 ++
>>  mm/gup.c           | 62 ++++++++++++++++++++++++++++++++++++++++++++++
>>  2 files changed, 64 insertions(+)
> 
> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> 
Thanks!

>> +/**
>> + * unpin_user_page_range_dirty_lock() - release and optionally dirty
>> + * gup-pinned page range
>> + *
>> + * @page:  the starting page of a range maybe marked dirty, and definitely released.
>> + * @npages: number of consecutive pages to release.
>> + * @make_dirty: whether to mark the pages dirty
>> + *
>> + * "gup-pinned page range" refers to a range of pages that has had one of the
>> + * get_user_pages() variants called on that page.
> 
> Tidy this language though, this only works with the pin_user_pages
> variants because it hardwires FOLL_PIN
> 
Yes, I can respin a v4 with that adjustment.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 4/4] RDMA/umem: batch page unpin in __ib_umem_release()
  2021-02-10 23:17   ` Jason Gunthorpe
@ 2021-02-11 10:35     ` Joao Martins
  0 siblings, 0 replies; 15+ messages in thread
From: Joao Martins @ 2021-02-11 10:35 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-mm, linux-kernel, linux-rdma, Andrew Morton, Doug Ledford,
	John Hubbard, Matthew Wilcox

On 2/10/21 11:17 PM, Jason Gunthorpe wrote:
> On Fri, Feb 05, 2021 at 08:41:27PM +0000, Joao Martins wrote:
>> Use the newly added unpin_user_page_range_dirty_lock()
>> for more quickly unpinning a consecutive range of pages
>> represented as compound pages. This will also calculate
>> number of pages to unpin (for the tail pages which matching
>> head page) and thus batch the refcount update.
>>
>> Running a test program which calls mr reg/unreg on a 1G in size
>> and measures cost of both operations together (in a guest using rxe)
>> with THP and hugetlbfs:
>>
>> Before:
>> 590 rounds in 5.003 sec: 8480.335 usec / round
>> 6898 rounds in 60.001 sec: 8698.367 usec / round
>>
>> After:
>> 2688 rounds in 5.002 sec: 1860.786 usec / round
>> 32517 rounds in 60.001 sec: 1845.225 usec / round
>>
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>> ---
>>  drivers/infiniband/core/umem.c | 12 ++++++------
>>  1 file changed, 6 insertions(+), 6 deletions(-)
> 
> Would best for this to go through Andrew's tree
> 
> Acked-by: Jason Gunthorpe <jgg@nvidia.com>
> 
> 4x improvement is pretty good!
> 

It would only be half of that improvement if it wasn't for your
unpin_user_page_range_dirty_lock() suggestion, so thanks for all
the input :)

	Joao

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 1/4] mm/gup: add compound page list iterator
  2021-02-10 23:20   ` Jason Gunthorpe
@ 2021-02-11 10:45     ` Joao Martins
  0 siblings, 0 replies; 15+ messages in thread
From: Joao Martins @ 2021-02-11 10:45 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-mm, linux-kernel, linux-rdma, Andrew Morton, Doug Ledford,
	John Hubbard, Matthew Wilcox

On 2/10/21 11:20 PM, Jason Gunthorpe wrote:
> On Fri, Feb 05, 2021 at 08:41:24PM +0000, Joao Martins wrote:
>> Add an helper that iterates over head pages in a list of pages. It
>> essentially counts the tails until the next page to process has a
>> different head that the current. This is going to be used by
>> unpin_user_pages() family of functions, to batch the head page refcount
>> updates once for all passed consecutive tail pages.
>>
>> Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>> Reviewed-by: John Hubbard <jhubbard@nvidia.com>
>> ---
>>  mm/gup.c | 26 ++++++++++++++++++++++++++
>>  1 file changed, 26 insertions(+)
> 
> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> 
Thanks!

> This can be used for check_and_migrate_cma_pages() too (there is a
> series around to change this logic though, not sure if it is landed
> yet)

It got unqueued AFAIUI.

It makes sense for most users today except hugetlb pages, which are also
the fastest page pinner today. And unilaterally using this iterator makes
all page types pay the added cost. So either keeping the current loop having
the exception to PageHuge() head pages, or doing it correctly with that
split logic we were talking on the other thread.

	Joao

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 3/4] mm/gup: add a range variant of unpin_user_pages_dirty_lock()
  2021-02-10 23:19   ` John Hubbard
@ 2021-02-11 10:52     ` Joao Martins
  0 siblings, 0 replies; 15+ messages in thread
From: Joao Martins @ 2021-02-11 10:52 UTC (permalink / raw)
  To: John Hubbard
  Cc: linux-kernel, linux-rdma, Andrew Morton, Jason Gunthorpe,
	Doug Ledford, Matthew Wilcox, linux-mm



On 2/10/21 11:19 PM, John Hubbard wrote:
> On 2/5/21 12:41 PM, Joao Martins wrote:
>> Add a unpin_user_page_range_dirty_lock() API which takes a starting page
>> and how many consecutive pages we want to unpin and optionally dirty.
>>
>> To that end, define another iterator for_each_compound_range()
>> that operates in page ranges as opposed to page array.
>>
>> For users (like RDMA mr_dereg) where each sg represents a
>> contiguous set of pages, we're able to more efficiently unpin
>> pages without having to supply an array of pages much of what
>> happens today with unpin_user_pages().
>>
>> Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>> ---
>>   include/linux/mm.h |  2 ++
>>   mm/gup.c           | 62 ++++++++++++++++++++++++++++++++++++++++++++++
>>   2 files changed, 64 insertions(+)
>>
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index a608feb0d42e..b76063f7f18a 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -1265,6 +1265,8 @@ static inline void put_page(struct page *page)
>>   void unpin_user_page(struct page *page);
>>   void unpin_user_pages_dirty_lock(struct page **pages, unsigned long npages,
>>   				 bool make_dirty);
>> +void unpin_user_page_range_dirty_lock(struct page *page, unsigned long npages,
>> +				      bool make_dirty);
>>   void unpin_user_pages(struct page **pages, unsigned long npages);
>>   
>>   /**
>> diff --git a/mm/gup.c b/mm/gup.c
>> index 467a11df216d..938964d31494 100644
>> --- a/mm/gup.c
>> +++ b/mm/gup.c
>> @@ -215,6 +215,32 @@ void unpin_user_page(struct page *page)
>>   }
>>   EXPORT_SYMBOL(unpin_user_page);
>>   
>> +static inline void compound_range_next(unsigned long i, unsigned long npages,
>> +				       struct page **list, struct page **head,
>> +				       unsigned int *ntails)
> 
> Yes, the new names look good, and I have failed to find any logic errors, so:
> 
> Reviewed-by: John Hubbard <jhubbard@nvidia.com>
> 

Thanks again for all the input!

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2021-02-11 11:12 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-02-05 20:41 [PATCH v3 0/4] mm/gup: page unpining improvements Joao Martins
2021-02-05 20:41 ` [PATCH v3 1/4] mm/gup: add compound page list iterator Joao Martins
2021-02-10 23:20   ` Jason Gunthorpe
2021-02-11 10:45     ` Joao Martins
2021-02-05 20:41 ` [PATCH v3 2/4] mm/gup: decrement head page once for group of subpages Joao Martins
2021-02-10 21:02   ` Jason Gunthorpe
2021-02-11 10:14     ` Joao Martins
2021-02-05 20:41 ` [PATCH v3 3/4] mm/gup: add a range variant of unpin_user_pages_dirty_lock() Joao Martins
2021-02-10 23:15   ` Jason Gunthorpe
2021-02-11 10:24     ` Joao Martins
2021-02-10 23:19   ` John Hubbard
2021-02-11 10:52     ` Joao Martins
2021-02-05 20:41 ` [PATCH v3 4/4] RDMA/umem: batch page unpin in __ib_umem_release() Joao Martins
2021-02-10 23:17   ` Jason Gunthorpe
2021-02-11 10:35     ` Joao Martins

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).