linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/8] mm/gup: Unify hugetlb, speed up thp
@ 2023-06-19 23:10 Peter Xu
  2023-06-19 23:10 ` [PATCH v2 1/8] mm/hugetlb: Handle FOLL_DUMP well in follow_page_mask() Peter Xu
                   ` (7 more replies)
  0 siblings, 8 replies; 22+ messages in thread
From: Peter Xu @ 2023-06-19 23:10 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Andrea Arcangeli, Mike Rapoport, David Hildenbrand,
	Matthew Wilcox, Vlastimil Babka, John Hubbard,
	Kirill A . Shutemov, James Houghton, Andrew Morton,
	Lorenzo Stoakes, Hugh Dickins, Mike Kravetz, peterx,
	Jason Gunthorpe

v2:
- Added R-bs
- Merged patch 2 & 4 into a single patch [David, Mike]
- Remove redundant compound_head() [Matthew]
- Still handle failure cases of try_get_folio/page() [Lorenzo]
- Modified more comments in the last removal patch [Lorenzo]
- Fixed hugetlb npages>1 longterm pin issue
- Added two testcase patches at last

Hugetlb has a special path for slow gup that follow_page_mask() is actually
skipped completely along with faultin_page().  It's not only confusing, but
also duplicating a lot of logics that generic gup already has, making
hugetlb slightly special.

This patchset tries to dedup the logic, by first touching up the slow gup
code to be able to handle hugetlb pages correctly with the current follow
page and faultin routines (where we're mostly there.. due to 10 years ago
we did try to optimize thp, but half way done; more below), then at the
last patch drop the special path, then the hugetlb gup will always go the
generic routine too via faultin_page().

Note that hugetlb is still special for gup, mostly due to the pgtable
walking (hugetlb_walk()) that we rely on which is currently per-arch.  But
this is still one small step forward, and the diffstat might be a proof
too that this might be worthwhile.

Then for the "speed up thp" side: as a side effect, when I'm looking at the
chunk of code, I found that thp support is actually partially done.  It
doesn't mean that thp won't work for gup, but as long as **pages pointer
passed over, the optimization will be skipped too.  Patch 6 should address
that, so for thp we now get full speed gup.

For a quick number, "chrt -f 1 ./gup_test -m 512 -t -L -n 1024 -r 10" gives
me 13992.50us -> 378.50us.  Gup_test is an extreme case, but just to show
how it affects thp gups.

Patch 1-5:   prepares for the switch
Patch 6:     switchover to the new code and remove the old
Patch 7-8:   added some gup test matrix into run_vmtests.sh

Please review, thanks.

Peter Xu (8):
  mm/hugetlb: Handle FOLL_DUMP well in follow_page_mask()
  mm/hugetlb: Prepare hugetlb_follow_page_mask() for FOLL_PIN
  mm/hugetlb: Add page_mask for hugetlb_follow_page_mask()
  mm/gup: Cleanup next_page handling
  mm/gup: Accelerate thp gup even for "pages != NULL"
  mm/gup: Retire follow_hugetlb_page()
  selftests/mm: Add -a to run_vmtests.sh
  selftests/mm: Add gup test matrix in run_vmtests.sh

 fs/userfaultfd.c                          |   2 +-
 include/linux/hugetlb.h                   |  20 +-
 mm/gup.c                                  |  83 ++++---
 mm/hugetlb.c                              | 256 +++-------------------
 tools/testing/selftests/mm/run_vmtests.sh |  48 +++-
 5 files changed, 119 insertions(+), 290 deletions(-)

-- 
2.40.1


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v2 1/8] mm/hugetlb: Handle FOLL_DUMP well in follow_page_mask()
  2023-06-19 23:10 [PATCH v2 0/8] mm/gup: Unify hugetlb, speed up thp Peter Xu
@ 2023-06-19 23:10 ` Peter Xu
  2023-06-19 23:10 ` [PATCH v2 2/8] mm/hugetlb: Prepare hugetlb_follow_page_mask() for FOLL_PIN Peter Xu
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 22+ messages in thread
From: Peter Xu @ 2023-06-19 23:10 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Andrea Arcangeli, Mike Rapoport, David Hildenbrand,
	Matthew Wilcox, Vlastimil Babka, John Hubbard,
	Kirill A . Shutemov, James Houghton, Andrew Morton,
	Lorenzo Stoakes, Hugh Dickins, Mike Kravetz, peterx,
	Jason Gunthorpe

Firstly, the no_page_table() is meaningless for hugetlb which is a no-op
there, because a hugetlb page always satisfies:

  - vma_is_anonymous() == false
  - vma->vm_ops->fault != NULL

So we can already safely remove it in hugetlb_follow_page_mask(), alongside
with the page* variable.

Meanwhile, what we do in follow_hugetlb_page() actually makes sense for a
dump: we try to fault in the page only if the page cache is already
allocated.  Let's do the same here for follow_page_mask() on hugetlb.

It should so far has zero effect on real dumps, because that still goes
into follow_hugetlb_page().  But this may start to influence a bit on
follow_page() users who mimics a "dump page" scenario, but hopefully in a
good way.  This also paves way for unifying the hugetlb gup-slow.

Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/gup.c     | 9 ++-------
 mm/hugetlb.c | 9 +++++++++
 2 files changed, 11 insertions(+), 7 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index ce14d4d28503..abcd841d94b7 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -767,7 +767,6 @@ static struct page *follow_page_mask(struct vm_area_struct *vma,
 			      struct follow_page_context *ctx)
 {
 	pgd_t *pgd;
-	struct page *page;
 	struct mm_struct *mm = vma->vm_mm;
 
 	ctx->page_mask = 0;
@@ -780,12 +779,8 @@ static struct page *follow_page_mask(struct vm_area_struct *vma,
 	 * hugetlb_follow_page_mask is only for follow_page() handling here.
 	 * Ordinary GUP uses follow_hugetlb_page for hugetlb processing.
 	 */
-	if (is_vm_hugetlb_page(vma)) {
-		page = hugetlb_follow_page_mask(vma, address, flags);
-		if (!page)
-			page = no_page_table(vma, flags);
-		return page;
-	}
+	if (is_vm_hugetlb_page(vma))
+		return hugetlb_follow_page_mask(vma, address, flags);
 
 	pgd = pgd_offset(mm, address);
 
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index d76574425da3..f75f5e78ff0b 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6498,6 +6498,15 @@ struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
 	spin_unlock(ptl);
 out_unlock:
 	hugetlb_vma_unlock_read(vma);
+
+	/*
+	 * Fixup retval for dump requests: if pagecache doesn't exist,
+	 * don't try to allocate a new page but just skip it.
+	 */
+	if (!page && (flags & FOLL_DUMP) &&
+	    !hugetlbfs_pagecache_present(h, vma, address))
+		page = ERR_PTR(-EFAULT);
+
 	return page;
 }
 
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v2 2/8] mm/hugetlb: Prepare hugetlb_follow_page_mask() for FOLL_PIN
  2023-06-19 23:10 [PATCH v2 0/8] mm/gup: Unify hugetlb, speed up thp Peter Xu
  2023-06-19 23:10 ` [PATCH v2 1/8] mm/hugetlb: Handle FOLL_DUMP well in follow_page_mask() Peter Xu
@ 2023-06-19 23:10 ` Peter Xu
  2023-06-20 15:22   ` David Hildenbrand
  2023-06-20 15:28   ` David Hildenbrand
  2023-06-19 23:10 ` [PATCH v2 3/8] mm/hugetlb: Add page_mask for hugetlb_follow_page_mask() Peter Xu
                   ` (5 subsequent siblings)
  7 siblings, 2 replies; 22+ messages in thread
From: Peter Xu @ 2023-06-19 23:10 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Andrea Arcangeli, Mike Rapoport, David Hildenbrand,
	Matthew Wilcox, Vlastimil Babka, John Hubbard,
	Kirill A . Shutemov, James Houghton, Andrew Morton,
	Lorenzo Stoakes, Hugh Dickins, Mike Kravetz, peterx,
	Jason Gunthorpe

follow_page() doesn't use FOLL_PIN, meanwhile hugetlb seems to not be the
target of FOLL_WRITE either.  However add the checks.

Namely, either the need to CoW due to missing write bit, or proper CoR on
!AnonExclusive pages over R/O pins to reject the follow page.  That brings
this function closer to follow_hugetlb_page().

So we don't care before, and also for now.  But we'll care if we switch
over slow-gup to use hugetlb_follow_page_mask().  We'll also care when to
return -EMLINK properly, as that's the gup internal api to mean "we should
do CoR".  Not really needed for follow page path, though.

When at it, switching the try_grab_page() to use WARN_ON_ONCE(), to be
clear that it just should never fail.

Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/hugetlb.c | 24 +++++++++++++++---------
 1 file changed, 15 insertions(+), 9 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index f75f5e78ff0b..9a6918c4250a 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6463,13 +6463,6 @@ struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
 	spinlock_t *ptl;
 	pte_t *pte, entry;
 
-	/*
-	 * FOLL_PIN is not supported for follow_page(). Ordinary GUP goes via
-	 * follow_hugetlb_page().
-	 */
-	if (WARN_ON_ONCE(flags & FOLL_PIN))
-		return NULL;
-
 	hugetlb_vma_lock_read(vma);
 	pte = hugetlb_walk(vma, haddr, huge_page_size(h));
 	if (!pte)
@@ -6478,8 +6471,21 @@ struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
 	ptl = huge_pte_lock(h, mm, pte);
 	entry = huge_ptep_get(pte);
 	if (pte_present(entry)) {
-		page = pte_page(entry) +
-				((address & ~huge_page_mask(h)) >> PAGE_SHIFT);
+		page = pte_page(entry);
+
+		if (gup_must_unshare(vma, flags, page)) {
+			/* Tell the caller to do Copy-On-Read */
+			page = ERR_PTR(-EMLINK);
+			goto out;
+		}
+
+		if ((flags & FOLL_WRITE) && !pte_write(entry)) {
+			page = NULL;
+			goto out;
+		}
+
+		page += ((address & ~huge_page_mask(h)) >> PAGE_SHIFT);
+
 		/*
 		 * Note that page may be a sub-page, and with vmemmap
 		 * optimizations the page struct may be read only.
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v2 3/8] mm/hugetlb: Add page_mask for hugetlb_follow_page_mask()
  2023-06-19 23:10 [PATCH v2 0/8] mm/gup: Unify hugetlb, speed up thp Peter Xu
  2023-06-19 23:10 ` [PATCH v2 1/8] mm/hugetlb: Handle FOLL_DUMP well in follow_page_mask() Peter Xu
  2023-06-19 23:10 ` [PATCH v2 2/8] mm/hugetlb: Prepare hugetlb_follow_page_mask() for FOLL_PIN Peter Xu
@ 2023-06-19 23:10 ` Peter Xu
  2023-06-20 15:23   ` David Hildenbrand
  2023-06-19 23:10 ` [PATCH v2 4/8] mm/gup: Cleanup next_page handling Peter Xu
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 22+ messages in thread
From: Peter Xu @ 2023-06-19 23:10 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Andrea Arcangeli, Mike Rapoport, David Hildenbrand,
	Matthew Wilcox, Vlastimil Babka, John Hubbard,
	Kirill A . Shutemov, James Houghton, Andrew Morton,
	Lorenzo Stoakes, Hugh Dickins, Mike Kravetz, peterx,
	Jason Gunthorpe

follow_page() doesn't need it, but we'll start to need it when unifying gup
for hugetlb.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/hugetlb.h | 8 +++++---
 mm/gup.c                | 3 ++-
 mm/hugetlb.c            | 5 ++++-
 3 files changed, 11 insertions(+), 5 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index beb7c63d2871..2e2d89e79d6c 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -131,7 +131,8 @@ int move_hugetlb_page_tables(struct vm_area_struct *vma,
 int copy_hugetlb_page_range(struct mm_struct *, struct mm_struct *,
 			    struct vm_area_struct *, struct vm_area_struct *);
 struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
-				unsigned long address, unsigned int flags);
+				      unsigned long address, unsigned int flags,
+				      unsigned int *page_mask);
 long follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *,
 			 struct page **, unsigned long *, unsigned long *,
 			 long, unsigned int, int *);
@@ -297,8 +298,9 @@ static inline void adjust_range_if_pmd_sharing_possible(
 {
 }
 
-static inline struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
-				unsigned long address, unsigned int flags)
+static inline struct page *hugetlb_follow_page_mask(
+    struct vm_area_struct *vma, unsigned long address, unsigned int flags,
+    unsigned int *page_mask)
 {
 	BUILD_BUG(); /* should never be compiled in if !CONFIG_HUGETLB_PAGE*/
 }
diff --git a/mm/gup.c b/mm/gup.c
index abcd841d94b7..9fc9271cba8d 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -780,7 +780,8 @@ static struct page *follow_page_mask(struct vm_area_struct *vma,
 	 * Ordinary GUP uses follow_hugetlb_page for hugetlb processing.
 	 */
 	if (is_vm_hugetlb_page(vma))
-		return hugetlb_follow_page_mask(vma, address, flags);
+		return hugetlb_follow_page_mask(vma, address, flags,
+						&ctx->page_mask);
 
 	pgd = pgd_offset(mm, address);
 
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 9a6918c4250a..fbf6a09c0ec4 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6454,7 +6454,8 @@ static inline bool __follow_hugetlb_must_fault(struct vm_area_struct *vma,
 }
 
 struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
-				unsigned long address, unsigned int flags)
+				      unsigned long address, unsigned int flags,
+				      unsigned int *page_mask)
 {
 	struct hstate *h = hstate_vma(vma);
 	struct mm_struct *mm = vma->vm_mm;
@@ -6499,6 +6500,8 @@ struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
 			page = NULL;
 			goto out;
 		}
+
+		*page_mask = ~huge_page_mask(h) >> PAGE_SHIFT;
 	}
 out:
 	spin_unlock(ptl);
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v2 4/8] mm/gup: Cleanup next_page handling
  2023-06-19 23:10 [PATCH v2 0/8] mm/gup: Unify hugetlb, speed up thp Peter Xu
                   ` (2 preceding siblings ...)
  2023-06-19 23:10 ` [PATCH v2 3/8] mm/hugetlb: Add page_mask for hugetlb_follow_page_mask() Peter Xu
@ 2023-06-19 23:10 ` Peter Xu
  2023-06-20 15:23   ` David Hildenbrand
  2023-06-19 23:10 ` [PATCH v2 5/8] mm/gup: Accelerate thp gup even for "pages != NULL" Peter Xu
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 22+ messages in thread
From: Peter Xu @ 2023-06-19 23:10 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Andrea Arcangeli, Mike Rapoport, David Hildenbrand,
	Matthew Wilcox, Vlastimil Babka, John Hubbard,
	Kirill A . Shutemov, James Houghton, Andrew Morton,
	Lorenzo Stoakes, Hugh Dickins, Mike Kravetz, peterx,
	Jason Gunthorpe

The only path that doesn't use generic "**pages" handling is the gate vma.
Make it use the same path, meanwhile tune the next_page label upper to
cover "**pages" handling.  This prepares for THP handling for "**pages".

Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/gup.c | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 9fc9271cba8d..4a00d609033e 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1124,7 +1124,7 @@ static long __get_user_pages(struct mm_struct *mm,
 			if (!vma && in_gate_area(mm, start)) {
 				ret = get_gate_page(mm, start & PAGE_MASK,
 						gup_flags, &vma,
-						pages ? &pages[i] : NULL);
+						pages ? &page : NULL);
 				if (ret)
 					goto out;
 				ctx.page_mask = 0;
@@ -1194,19 +1194,18 @@ static long __get_user_pages(struct mm_struct *mm,
 				ret = PTR_ERR(page);
 				goto out;
 			}
-
-			goto next_page;
 		} else if (IS_ERR(page)) {
 			ret = PTR_ERR(page);
 			goto out;
 		}
+next_page:
 		if (pages) {
 			pages[i] = page;
 			flush_anon_page(vma, page, start);
 			flush_dcache_page(page);
 			ctx.page_mask = 0;
 		}
-next_page:
+
 		page_increm = 1 + (~(start >> PAGE_SHIFT) & ctx.page_mask);
 		if (page_increm > nr_pages)
 			page_increm = nr_pages;
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v2 5/8] mm/gup: Accelerate thp gup even for "pages != NULL"
  2023-06-19 23:10 [PATCH v2 0/8] mm/gup: Unify hugetlb, speed up thp Peter Xu
                   ` (3 preceding siblings ...)
  2023-06-19 23:10 ` [PATCH v2 4/8] mm/gup: Cleanup next_page handling Peter Xu
@ 2023-06-19 23:10 ` Peter Xu
  2023-06-20 15:43   ` David Hildenbrand
  2023-06-20 21:43   ` Lorenzo Stoakes
  2023-06-19 23:10 ` [PATCH v2 6/8] mm/gup: Retire follow_hugetlb_page() Peter Xu
                   ` (2 subsequent siblings)
  7 siblings, 2 replies; 22+ messages in thread
From: Peter Xu @ 2023-06-19 23:10 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Andrea Arcangeli, Mike Rapoport, David Hildenbrand,
	Matthew Wilcox, Vlastimil Babka, John Hubbard,
	Kirill A . Shutemov, James Houghton, Andrew Morton,
	Lorenzo Stoakes, Hugh Dickins, Mike Kravetz, peterx,
	Jason Gunthorpe

The acceleration of THP was done with ctx.page_mask, however it'll be
ignored if **pages is non-NULL.

The old optimization was introduced in 2013 in 240aadeedc4a ("mm:
accelerate mm_populate() treatment of THP pages").  It didn't explain why
we can't optimize the **pages non-NULL case.  It's possible that at that
time the major goal was for mm_populate() which should be enough back then.

Optimize thp for all cases, by properly looping over each subpage, doing
cache flushes, and boost refcounts / pincounts where needed in one go.

This can be verified using gup_test below:

  # chrt -f 1 ./gup_test -m 512 -t -L -n 1024 -r 10

Before:    13992.50 ( +-8.75%)
After:       378.50 (+-69.62%)

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/gup.c | 51 ++++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 44 insertions(+), 7 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 4a00d609033e..b50272012e49 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1199,16 +1199,53 @@ static long __get_user_pages(struct mm_struct *mm,
 			goto out;
 		}
 next_page:
-		if (pages) {
-			pages[i] = page;
-			flush_anon_page(vma, page, start);
-			flush_dcache_page(page);
-			ctx.page_mask = 0;
-		}
-
 		page_increm = 1 + (~(start >> PAGE_SHIFT) & ctx.page_mask);
 		if (page_increm > nr_pages)
 			page_increm = nr_pages;
+
+		if (pages) {
+			struct page *subpage;
+			unsigned int j;
+
+			/*
+			 * This must be a large folio (and doesn't need to
+			 * be the whole folio; it can be part of it), do
+			 * the refcount work for all the subpages too.
+			 *
+			 * NOTE: here the page may not be the head page
+			 * e.g. when start addr is not thp-size aligned.
+			 * try_grab_folio() should have taken care of tail
+			 * pages.
+			 */
+			if (page_increm > 1) {
+				struct folio *folio;
+
+				/*
+				 * Since we already hold refcount on the
+				 * large folio, this should never fail.
+				 */
+				folio = try_grab_folio(page, page_increm - 1,
+						       foll_flags);
+				if (WARN_ON_ONCE(!folio)) {
+					/*
+					 * Release the 1st page ref if the
+					 * folio is problematic, fail hard.
+					 */
+					gup_put_folio(page_folio(page), 1,
+						      foll_flags);
+					ret = -EFAULT;
+					goto out;
+				}
+			}
+
+			for (j = 0; j < page_increm; j++) {
+				subpage = nth_page(page, j);
+				pages[i+j] = subpage;
+				flush_anon_page(vma, subpage, start + j * PAGE_SIZE);
+				flush_dcache_page(subpage);
+			}
+		}
+
 		i += page_increm;
 		start += page_increm * PAGE_SIZE;
 		nr_pages -= page_increm;
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v2 6/8] mm/gup: Retire follow_hugetlb_page()
  2023-06-19 23:10 [PATCH v2 0/8] mm/gup: Unify hugetlb, speed up thp Peter Xu
                   ` (4 preceding siblings ...)
  2023-06-19 23:10 ` [PATCH v2 5/8] mm/gup: Accelerate thp gup even for "pages != NULL" Peter Xu
@ 2023-06-19 23:10 ` Peter Xu
  2023-06-19 23:10 ` [PATCH v2 7/8] selftests/mm: Add -a to run_vmtests.sh Peter Xu
  2023-06-19 23:10 ` [PATCH v2 8/8] selftests/mm: Add gup test matrix in run_vmtests.sh Peter Xu
  7 siblings, 0 replies; 22+ messages in thread
From: Peter Xu @ 2023-06-19 23:10 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Andrea Arcangeli, Mike Rapoport, David Hildenbrand,
	Matthew Wilcox, Vlastimil Babka, John Hubbard,
	Kirill A . Shutemov, James Houghton, Andrew Morton,
	Lorenzo Stoakes, Hugh Dickins, Mike Kravetz, peterx,
	Jason Gunthorpe

Now __get_user_pages() should be well prepared to handle thp completely,
as long as hugetlb gup requests even without the hugetlb's special path.

Time to retire follow_hugetlb_page().

Tweak misc comments to reflect reality of follow_hugetlb_page()'s removal.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 fs/userfaultfd.c        |   2 +-
 include/linux/hugetlb.h |  12 ---
 mm/gup.c                |  19 ----
 mm/hugetlb.c            | 224 ----------------------------------------
 4 files changed, 1 insertion(+), 256 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 7cecd49e078b..ae711f1d7a83 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -427,7 +427,7 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
 	 *
 	 * We also don't do userfault handling during
 	 * coredumping. hugetlbfs has the special
-	 * follow_hugetlb_page() to skip missing pages in the
+	 * hugetlb_follow_page_mask() to skip missing pages in the
 	 * FOLL_DUMP case, anon memory also checks for FOLL_DUMP with
 	 * the no_page_table() helper in follow_page_mask(), but the
 	 * shmem_vm_ops->fault method is invoked even during
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 2e2d89e79d6c..bb5024718fc1 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -133,9 +133,6 @@ int copy_hugetlb_page_range(struct mm_struct *, struct mm_struct *,
 struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
 				      unsigned long address, unsigned int flags,
 				      unsigned int *page_mask);
-long follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *,
-			 struct page **, unsigned long *, unsigned long *,
-			 long, unsigned int, int *);
 void unmap_hugepage_range(struct vm_area_struct *,
 			  unsigned long, unsigned long, struct page *,
 			  zap_flags_t);
@@ -305,15 +302,6 @@ static inline struct page *hugetlb_follow_page_mask(
 	BUILD_BUG(); /* should never be compiled in if !CONFIG_HUGETLB_PAGE*/
 }
 
-static inline long follow_hugetlb_page(struct mm_struct *mm,
-			struct vm_area_struct *vma, struct page **pages,
-			unsigned long *position, unsigned long *nr_pages,
-			long i, unsigned int flags, int *nonblocking)
-{
-	BUG();
-	return 0;
-}
-
 static inline int copy_hugetlb_page_range(struct mm_struct *dst,
 					  struct mm_struct *src,
 					  struct vm_area_struct *dst_vma,
diff --git a/mm/gup.c b/mm/gup.c
index b50272012e49..e6c1e524bd6b 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -775,9 +775,6 @@ static struct page *follow_page_mask(struct vm_area_struct *vma,
 	 * Call hugetlb_follow_page_mask for hugetlb vmas as it will use
 	 * special hugetlb page table walking code.  This eliminates the
 	 * need to check for hugetlb entries in the general walking code.
-	 *
-	 * hugetlb_follow_page_mask is only for follow_page() handling here.
-	 * Ordinary GUP uses follow_hugetlb_page for hugetlb processing.
 	 */
 	if (is_vm_hugetlb_page(vma))
 		return hugetlb_follow_page_mask(vma, address, flags,
@@ -1138,22 +1135,6 @@ static long __get_user_pages(struct mm_struct *mm,
 			ret = check_vma_flags(vma, gup_flags);
 			if (ret)
 				goto out;
-
-			if (is_vm_hugetlb_page(vma)) {
-				i = follow_hugetlb_page(mm, vma, pages,
-							&start, &nr_pages, i,
-							gup_flags, locked);
-				if (!*locked) {
-					/*
-					 * We've got a VM_FAULT_RETRY
-					 * and we've lost mmap_lock.
-					 * We must stop here.
-					 */
-					BUG_ON(gup_flags & FOLL_NOWAIT);
-					goto out;
-				}
-				continue;
-			}
 		}
 retry:
 		/*
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index fbf6a09c0ec4..da4c76bee01f 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5721,7 +5721,6 @@ static vm_fault_t hugetlb_wp(struct mm_struct *mm, struct vm_area_struct *vma,
 
 /*
  * Return whether there is a pagecache page to back given address within VMA.
- * Caller follow_hugetlb_page() holds page_table_lock so we cannot lock_page.
  */
 static bool hugetlbfs_pagecache_present(struct hstate *h,
 			struct vm_area_struct *vma, unsigned long address)
@@ -6422,37 +6421,6 @@ int hugetlb_mfill_atomic_pte(pte_t *dst_pte,
 }
 #endif /* CONFIG_USERFAULTFD */
 
-static void record_subpages(struct page *page, struct vm_area_struct *vma,
-			    int refs, struct page **pages)
-{
-	int nr;
-
-	for (nr = 0; nr < refs; nr++) {
-		if (likely(pages))
-			pages[nr] = nth_page(page, nr);
-	}
-}
-
-static inline bool __follow_hugetlb_must_fault(struct vm_area_struct *vma,
-					       unsigned int flags, pte_t *pte,
-					       bool *unshare)
-{
-	pte_t pteval = huge_ptep_get(pte);
-
-	*unshare = false;
-	if (is_swap_pte(pteval))
-		return true;
-	if (huge_pte_write(pteval))
-		return false;
-	if (flags & FOLL_WRITE)
-		return true;
-	if (gup_must_unshare(vma, flags, pte_page(pteval))) {
-		*unshare = true;
-		return true;
-	}
-	return false;
-}
-
 struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
 				      unsigned long address, unsigned int flags,
 				      unsigned int *page_mask)
@@ -6519,198 +6487,6 @@ struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
 	return page;
 }
 
-long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
-			 struct page **pages, unsigned long *position,
-			 unsigned long *nr_pages, long i, unsigned int flags,
-			 int *locked)
-{
-	unsigned long pfn_offset;
-	unsigned long vaddr = *position;
-	unsigned long remainder = *nr_pages;
-	struct hstate *h = hstate_vma(vma);
-	int err = -EFAULT, refs;
-
-	while (vaddr < vma->vm_end && remainder) {
-		pte_t *pte;
-		spinlock_t *ptl = NULL;
-		bool unshare = false;
-		int absent;
-		struct page *page;
-
-		/*
-		 * If we have a pending SIGKILL, don't keep faulting pages and
-		 * potentially allocating memory.
-		 */
-		if (fatal_signal_pending(current)) {
-			remainder = 0;
-			break;
-		}
-
-		hugetlb_vma_lock_read(vma);
-		/*
-		 * Some archs (sparc64, sh*) have multiple pte_ts to
-		 * each hugepage.  We have to make sure we get the
-		 * first, for the page indexing below to work.
-		 *
-		 * Note that page table lock is not held when pte is null.
-		 */
-		pte = hugetlb_walk(vma, vaddr & huge_page_mask(h),
-				   huge_page_size(h));
-		if (pte)
-			ptl = huge_pte_lock(h, mm, pte);
-		absent = !pte || huge_pte_none(huge_ptep_get(pte));
-
-		/*
-		 * When coredumping, it suits get_dump_page if we just return
-		 * an error where there's an empty slot with no huge pagecache
-		 * to back it.  This way, we avoid allocating a hugepage, and
-		 * the sparse dumpfile avoids allocating disk blocks, but its
-		 * huge holes still show up with zeroes where they need to be.
-		 */
-		if (absent && (flags & FOLL_DUMP) &&
-		    !hugetlbfs_pagecache_present(h, vma, vaddr)) {
-			if (pte)
-				spin_unlock(ptl);
-			hugetlb_vma_unlock_read(vma);
-			remainder = 0;
-			break;
-		}
-
-		/*
-		 * We need call hugetlb_fault for both hugepages under migration
-		 * (in which case hugetlb_fault waits for the migration,) and
-		 * hwpoisoned hugepages (in which case we need to prevent the
-		 * caller from accessing to them.) In order to do this, we use
-		 * here is_swap_pte instead of is_hugetlb_entry_migration and
-		 * is_hugetlb_entry_hwpoisoned. This is because it simply covers
-		 * both cases, and because we can't follow correct pages
-		 * directly from any kind of swap entries.
-		 */
-		if (absent ||
-		    __follow_hugetlb_must_fault(vma, flags, pte, &unshare)) {
-			vm_fault_t ret;
-			unsigned int fault_flags = 0;
-
-			if (pte)
-				spin_unlock(ptl);
-			hugetlb_vma_unlock_read(vma);
-
-			if (flags & FOLL_WRITE)
-				fault_flags |= FAULT_FLAG_WRITE;
-			else if (unshare)
-				fault_flags |= FAULT_FLAG_UNSHARE;
-			if (locked) {
-				fault_flags |= FAULT_FLAG_ALLOW_RETRY |
-					FAULT_FLAG_KILLABLE;
-				if (flags & FOLL_INTERRUPTIBLE)
-					fault_flags |= FAULT_FLAG_INTERRUPTIBLE;
-			}
-			if (flags & FOLL_NOWAIT)
-				fault_flags |= FAULT_FLAG_ALLOW_RETRY |
-					FAULT_FLAG_RETRY_NOWAIT;
-			if (flags & FOLL_TRIED) {
-				/*
-				 * Note: FAULT_FLAG_ALLOW_RETRY and
-				 * FAULT_FLAG_TRIED can co-exist
-				 */
-				fault_flags |= FAULT_FLAG_TRIED;
-			}
-			ret = hugetlb_fault(mm, vma, vaddr, fault_flags);
-			if (ret & VM_FAULT_ERROR) {
-				err = vm_fault_to_errno(ret, flags);
-				remainder = 0;
-				break;
-			}
-			if (ret & VM_FAULT_RETRY) {
-				if (locked &&
-				    !(fault_flags & FAULT_FLAG_RETRY_NOWAIT))
-					*locked = 0;
-				*nr_pages = 0;
-				/*
-				 * VM_FAULT_RETRY must not return an
-				 * error, it will return zero
-				 * instead.
-				 *
-				 * No need to update "position" as the
-				 * caller will not check it after
-				 * *nr_pages is set to 0.
-				 */
-				return i;
-			}
-			continue;
-		}
-
-		pfn_offset = (vaddr & ~huge_page_mask(h)) >> PAGE_SHIFT;
-		page = pte_page(huge_ptep_get(pte));
-
-		VM_BUG_ON_PAGE((flags & FOLL_PIN) && PageAnon(page) &&
-			       !PageAnonExclusive(page), page);
-
-		/*
-		 * If subpage information not requested, update counters
-		 * and skip the same_page loop below.
-		 */
-		if (!pages && !pfn_offset &&
-		    (vaddr + huge_page_size(h) < vma->vm_end) &&
-		    (remainder >= pages_per_huge_page(h))) {
-			vaddr += huge_page_size(h);
-			remainder -= pages_per_huge_page(h);
-			i += pages_per_huge_page(h);
-			spin_unlock(ptl);
-			hugetlb_vma_unlock_read(vma);
-			continue;
-		}
-
-		/* vaddr may not be aligned to PAGE_SIZE */
-		refs = min3(pages_per_huge_page(h) - pfn_offset, remainder,
-		    (vma->vm_end - ALIGN_DOWN(vaddr, PAGE_SIZE)) >> PAGE_SHIFT);
-
-		if (pages)
-			record_subpages(nth_page(page, pfn_offset),
-					vma, refs,
-					likely(pages) ? pages + i : NULL);
-
-		if (pages) {
-			/*
-			 * try_grab_folio() should always succeed here,
-			 * because: a) we hold the ptl lock, and b) we've just
-			 * checked that the huge page is present in the page
-			 * tables. If the huge page is present, then the tail
-			 * pages must also be present. The ptl prevents the
-			 * head page and tail pages from being rearranged in
-			 * any way. As this is hugetlb, the pages will never
-			 * be p2pdma or not longterm pinable. So this page
-			 * must be available at this point, unless the page
-			 * refcount overflowed:
-			 */
-			if (WARN_ON_ONCE(!try_grab_folio(pages[i], refs,
-							 flags))) {
-				spin_unlock(ptl);
-				hugetlb_vma_unlock_read(vma);
-				remainder = 0;
-				err = -ENOMEM;
-				break;
-			}
-		}
-
-		vaddr += (refs << PAGE_SHIFT);
-		remainder -= refs;
-		i += refs;
-
-		spin_unlock(ptl);
-		hugetlb_vma_unlock_read(vma);
-	}
-	*nr_pages = remainder;
-	/*
-	 * setting position is actually required only if remainder is
-	 * not zero but it's faster not to add a "if (remainder)"
-	 * branch.
-	 */
-	*position = vaddr;
-
-	return i ? i : err;
-}
-
 long hugetlb_change_protection(struct vm_area_struct *vma,
 		unsigned long address, unsigned long end,
 		pgprot_t newprot, unsigned long cp_flags)
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v2 7/8] selftests/mm: Add -a to run_vmtests.sh
  2023-06-19 23:10 [PATCH v2 0/8] mm/gup: Unify hugetlb, speed up thp Peter Xu
                   ` (5 preceding siblings ...)
  2023-06-19 23:10 ` [PATCH v2 6/8] mm/gup: Retire follow_hugetlb_page() Peter Xu
@ 2023-06-19 23:10 ` Peter Xu
  2023-06-19 23:10 ` [PATCH v2 8/8] selftests/mm: Add gup test matrix in run_vmtests.sh Peter Xu
  7 siblings, 0 replies; 22+ messages in thread
From: Peter Xu @ 2023-06-19 23:10 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Andrea Arcangeli, Mike Rapoport, David Hildenbrand,
	Matthew Wilcox, Vlastimil Babka, John Hubbard,
	Kirill A . Shutemov, James Houghton, Andrew Morton,
	Lorenzo Stoakes, Hugh Dickins, Mike Kravetz, peterx,
	Jason Gunthorpe

Allows to specify optional tests in run_vmtests.sh, where we can run time
consuming test matrix only when user specified "-a".

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 tools/testing/selftests/mm/run_vmtests.sh | 11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/tools/testing/selftests/mm/run_vmtests.sh b/tools/testing/selftests/mm/run_vmtests.sh
index 3f26f6e15b2a..824e651f62f4 100644
--- a/tools/testing/selftests/mm/run_vmtests.sh
+++ b/tools/testing/selftests/mm/run_vmtests.sh
@@ -12,11 +12,14 @@ exitcode=0
 
 usage() {
 	cat <<EOF
-usage: ${BASH_SOURCE[0]:-$0} [ -h | -t "<categories>"]
+usage: ${BASH_SOURCE[0]:-$0} [ options ]
+
+  -a: run all tests, including extra ones
   -t: specify specific categories to tests to run
   -h: display this message
 
-The default behavior is to run all tests.
+The default behavior is to run required tests only.  If -a is specified,
+will run all tests.
 
 Alternatively, specific groups tests can be run by passing a string
 to the -t argument containing one or more of the following categories
@@ -60,9 +63,11 @@ EOF
 	exit 0
 }
 
+RUN_ALL=false
 
-while getopts "ht:" OPT; do
+while getopts "aht:" OPT; do
 	case ${OPT} in
+		"a") RUN_ALL=true ;;
 		"h") usage ;;
 		"t") VM_SELFTEST_ITEMS=${OPTARG} ;;
 	esac
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v2 8/8] selftests/mm: Add gup test matrix in run_vmtests.sh
  2023-06-19 23:10 [PATCH v2 0/8] mm/gup: Unify hugetlb, speed up thp Peter Xu
                   ` (6 preceding siblings ...)
  2023-06-19 23:10 ` [PATCH v2 7/8] selftests/mm: Add -a to run_vmtests.sh Peter Xu
@ 2023-06-19 23:10 ` Peter Xu
  7 siblings, 0 replies; 22+ messages in thread
From: Peter Xu @ 2023-06-19 23:10 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Andrea Arcangeli, Mike Rapoport, David Hildenbrand,
	Matthew Wilcox, Vlastimil Babka, John Hubbard,
	Kirill A . Shutemov, James Houghton, Andrew Morton,
	Lorenzo Stoakes, Hugh Dickins, Mike Kravetz, peterx,
	Jason Gunthorpe

Add a matrix for testing gup based on the current gup_test.  Only run the
matrix when -a is specified because it's a bit slow.

It covers:

  - Different types of huge pages: thp, hugetlb, or no huge page
  - Permissions: Write / Read-only
  - Fast-gup, with/without
  - Types of the GUP: pin / gup / longterm pins
  - Shared / Private memories
  - GUP size: 1 / 512 / random page sizes

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 tools/testing/selftests/mm/run_vmtests.sh | 37 ++++++++++++++++++++---
 1 file changed, 32 insertions(+), 5 deletions(-)

diff --git a/tools/testing/selftests/mm/run_vmtests.sh b/tools/testing/selftests/mm/run_vmtests.sh
index 824e651f62f4..9666c0c171ab 100644
--- a/tools/testing/selftests/mm/run_vmtests.sh
+++ b/tools/testing/selftests/mm/run_vmtests.sh
@@ -90,6 +90,30 @@ test_selected() {
 	fi
 }
 
+run_gup_matrix() {
+    # -t: thp=on, -T: thp=off, -H: hugetlb=on
+    local hugetlb_mb=$(( needmem_KB / 1024 ))
+
+    for huge in -t -T "-H -m $hugetlb_mb"; do
+        # -u: gup-fast, -U: gup-basic, -a: pin-fast, -b: pin-basic, -L: pin-longterm
+        for test_cmd in -u -U -a -b -L; do
+            # -w: write=1, -W: write=0
+            for write in -w -W; do
+                # -S: shared
+                for share in -S " "; do
+                    # -n: How many pages to fetch together?  512 is special
+                    # because it's default thp size (or 2M on x86), 123 to
+                    # just test partial gup when hit a huge in whatever form
+                    for num in "-n 1" "-n 512" "-n 123"; do
+                        CATEGORY="gup_test" run_test ./gup_test \
+                                $huge $test_cmd $write $share $num
+                    done
+                done
+            done
+        done
+    done
+}
+
 # get huge pagesize and freepages from /proc/meminfo
 while read -r name size unit; do
 	if [ "$name" = "HugePages_Free:" ]; then
@@ -194,13 +218,16 @@ fi
 
 CATEGORY="mmap" run_test ./map_fixed_noreplace
 
-# get_user_pages_fast() benchmark
-CATEGORY="gup_test" run_test ./gup_test -u
-# pin_user_pages_fast() benchmark
-CATEGORY="gup_test" run_test ./gup_test -a
+if $RUN_ALL; then
+    run_gup_matrix
+else
+    # get_user_pages_fast() benchmark
+    CATEGORY="gup_test" run_test ./gup_test -u
+    # pin_user_pages_fast() benchmark
+    CATEGORY="gup_test" run_test ./gup_test -a
+fi
 # Dump pages 0, 19, and 4096, using pin_user_pages:
 CATEGORY="gup_test" run_test ./gup_test -ct -F 0x1 0 19 0x1000
-
 CATEGORY="gup_test" run_test ./gup_longterm
 
 CATEGORY="userfaultfd" run_test ./uffd-unit-tests
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 2/8] mm/hugetlb: Prepare hugetlb_follow_page_mask() for FOLL_PIN
  2023-06-19 23:10 ` [PATCH v2 2/8] mm/hugetlb: Prepare hugetlb_follow_page_mask() for FOLL_PIN Peter Xu
@ 2023-06-20 15:22   ` David Hildenbrand
  2023-06-20 16:03     ` Peter Xu
  2023-06-20 15:28   ` David Hildenbrand
  1 sibling, 1 reply; 22+ messages in thread
From: David Hildenbrand @ 2023-06-20 15:22 UTC (permalink / raw)
  To: Peter Xu, linux-mm, linux-kernel
  Cc: Andrea Arcangeli, Mike Rapoport, Matthew Wilcox, Vlastimil Babka,
	John Hubbard, Kirill A . Shutemov, James Houghton, Andrew Morton,
	Lorenzo Stoakes, Hugh Dickins, Mike Kravetz, Jason Gunthorpe

On 20.06.23 01:10, Peter Xu wrote:
> follow_page() doesn't use FOLL_PIN, meanwhile hugetlb seems to not be the
> target of FOLL_WRITE either.  However add the checks.
> 
> Namely, either the need to CoW due to missing write bit, or proper CoR on

s/CoR/unsharing/

> !AnonExclusive pages over R/O pins to reject the follow page.  That brings
> this function closer to follow_hugetlb_page().
> 
> So we don't care before, and also for now.  But we'll care if we switch
> over slow-gup to use hugetlb_follow_page_mask().  We'll also care when to
> return -EMLINK properly, as that's the gup internal api to mean "we should
> do CoR".  Not really needed for follow page path, though.

"we should unshare".

> 
> When at it, switching the try_grab_page() to use WARN_ON_ONCE(), to be
> clear that it just should never fail.
> 
> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>   mm/hugetlb.c | 24 +++++++++++++++---------
>   1 file changed, 15 insertions(+), 9 deletions(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index f75f5e78ff0b..9a6918c4250a 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -6463,13 +6463,6 @@ struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
>   	spinlock_t *ptl;
>   	pte_t *pte, entry;
>   
> -	/*
> -	 * FOLL_PIN is not supported for follow_page(). Ordinary GUP goes via
> -	 * follow_hugetlb_page().
> -	 */
> -	if (WARN_ON_ONCE(flags & FOLL_PIN))
> -		return NULL;
> -
>   	hugetlb_vma_lock_read(vma);
>   	pte = hugetlb_walk(vma, haddr, huge_page_size(h));
>   	if (!pte)
> @@ -6478,8 +6471,21 @@ struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
>   	ptl = huge_pte_lock(h, mm, pte);
>   	entry = huge_ptep_get(pte);
>   	if (pte_present(entry)) {
> -		page = pte_page(entry) +
> -				((address & ~huge_page_mask(h)) >> PAGE_SHIFT);
> +		page = pte_page(entry);
> +
> +		if (gup_must_unshare(vma, flags, page)) {

All other callers (like follow_page_pte(), including 
__follow_hugetlb_must_fault())

(a) check for write permissions first.

(b) check for gup_must_unshare() only if !pte_write(entry)

I'd vote to keep these checks as similar as possible to the other GUP code.

> +			/* Tell the caller to do Copy-On-Read */

"Tell the caller to unshare".

> +			page = ERR_PTR(-EMLINK);
> +			goto out;
> +		}
> +
> +		if ((flags & FOLL_WRITE) && !pte_write(entry)) {
> +			page = NULL;
> +			goto out;
> +		}


I'm confused about pte_write() vs. huge_pte_write(), and I don't know 
what's right or wrong here.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 3/8] mm/hugetlb: Add page_mask for hugetlb_follow_page_mask()
  2023-06-19 23:10 ` [PATCH v2 3/8] mm/hugetlb: Add page_mask for hugetlb_follow_page_mask() Peter Xu
@ 2023-06-20 15:23   ` David Hildenbrand
  2023-06-20 16:28     ` Peter Xu
  0 siblings, 1 reply; 22+ messages in thread
From: David Hildenbrand @ 2023-06-20 15:23 UTC (permalink / raw)
  To: Peter Xu, linux-mm, linux-kernel
  Cc: Andrea Arcangeli, Mike Rapoport, Matthew Wilcox, Vlastimil Babka,
	John Hubbard, Kirill A . Shutemov, James Houghton, Andrew Morton,
	Lorenzo Stoakes, Hugh Dickins, Mike Kravetz, Jason Gunthorpe

On 20.06.23 01:10, Peter Xu wrote:
> follow_page() doesn't need it, but we'll start to need it when unifying gup
> for hugetlb.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>   include/linux/hugetlb.h | 8 +++++---
>   mm/gup.c                | 3 ++-
>   mm/hugetlb.c            | 5 ++++-
>   3 files changed, 11 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index beb7c63d2871..2e2d89e79d6c 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -131,7 +131,8 @@ int move_hugetlb_page_tables(struct vm_area_struct *vma,
>   int copy_hugetlb_page_range(struct mm_struct *, struct mm_struct *,
>   			    struct vm_area_struct *, struct vm_area_struct *);
>   struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
> -				unsigned long address, unsigned int flags);
> +				      unsigned long address, unsigned int flags,
> +				      unsigned int *page_mask);
>   long follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *,
>   			 struct page **, unsigned long *, unsigned long *,
>   			 long, unsigned int, int *);
> @@ -297,8 +298,9 @@ static inline void adjust_range_if_pmd_sharing_possible(
>   {
>   }
>   
> -static inline struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
> -				unsigned long address, unsigned int flags)
> +static inline struct page *hugetlb_follow_page_mask(
> +    struct vm_area_struct *vma, unsigned long address, unsigned int flags,
> +    unsigned int *page_mask)
>   {
>   	BUILD_BUG(); /* should never be compiled in if !CONFIG_HUGETLB_PAGE*/
>   }
> diff --git a/mm/gup.c b/mm/gup.c
> index abcd841d94b7..9fc9271cba8d 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -780,7 +780,8 @@ static struct page *follow_page_mask(struct vm_area_struct *vma,
>   	 * Ordinary GUP uses follow_hugetlb_page for hugetlb processing.
>   	 */
>   	if (is_vm_hugetlb_page(vma))
> -		return hugetlb_follow_page_mask(vma, address, flags);
> +		return hugetlb_follow_page_mask(vma, address, flags,
> +						&ctx->page_mask);
>   
>   	pgd = pgd_offset(mm, address);
>   
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 9a6918c4250a..fbf6a09c0ec4 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -6454,7 +6454,8 @@ static inline bool __follow_hugetlb_must_fault(struct vm_area_struct *vma,
>   }
>   
>   struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
> -				unsigned long address, unsigned int flags)
> +				      unsigned long address, unsigned int flags,
> +				      unsigned int *page_mask)
>   {
>   	struct hstate *h = hstate_vma(vma);
>   	struct mm_struct *mm = vma->vm_mm;
> @@ -6499,6 +6500,8 @@ struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
>   			page = NULL;
>   			goto out;
>   		}
> +
> +		*page_mask = ~huge_page_mask(h) >> PAGE_SHIFT;

As discussed, can be simplified. But can be done on top (or not at all, 
but it is confusing code).

Reviewed-by: David Hildenbrand <david@redhat.com>

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 4/8] mm/gup: Cleanup next_page handling
  2023-06-19 23:10 ` [PATCH v2 4/8] mm/gup: Cleanup next_page handling Peter Xu
@ 2023-06-20 15:23   ` David Hildenbrand
  0 siblings, 0 replies; 22+ messages in thread
From: David Hildenbrand @ 2023-06-20 15:23 UTC (permalink / raw)
  To: Peter Xu, linux-mm, linux-kernel
  Cc: Andrea Arcangeli, Mike Rapoport, Matthew Wilcox, Vlastimil Babka,
	John Hubbard, Kirill A . Shutemov, James Houghton, Andrew Morton,
	Lorenzo Stoakes, Hugh Dickins, Mike Kravetz, Jason Gunthorpe

On 20.06.23 01:10, Peter Xu wrote:
> The only path that doesn't use generic "**pages" handling is the gate vma.
> Make it use the same path, meanwhile tune the next_page label upper to
> cover "**pages" handling.  This prepares for THP handling for "**pages".
> 
> Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>   mm/gup.c | 7 +++----
>   1 file changed, 3 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/gup.c b/mm/gup.c
> index 9fc9271cba8d..4a00d609033e 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -1124,7 +1124,7 @@ static long __get_user_pages(struct mm_struct *mm,
>   			if (!vma && in_gate_area(mm, start)) {
>   				ret = get_gate_page(mm, start & PAGE_MASK,
>   						gup_flags, &vma,
> -						pages ? &pages[i] : NULL);
> +						pages ? &page : NULL);
>   				if (ret)
>   					goto out;
>   				ctx.page_mask = 0;
> @@ -1194,19 +1194,18 @@ static long __get_user_pages(struct mm_struct *mm,
>   				ret = PTR_ERR(page);
>   				goto out;
>   			}
> -
> -			goto next_page;
>   		} else if (IS_ERR(page)) {
>   			ret = PTR_ERR(page);
>   			goto out;
>   		}
> +next_page:
>   		if (pages) {
>   			pages[i] = page;
>   			flush_anon_page(vma, page, start);
>   			flush_dcache_page(page);
>   			ctx.page_mask = 0;
>   		}
> -next_page:
> +
>   		page_increm = 1 + (~(start >> PAGE_SHIFT) & ctx.page_mask);
>   		if (page_increm > nr_pages)
>   			page_increm = nr_pages;

Acked-by: David Hildenbrand <david@redhat.com>

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 2/8] mm/hugetlb: Prepare hugetlb_follow_page_mask() for FOLL_PIN
  2023-06-19 23:10 ` [PATCH v2 2/8] mm/hugetlb: Prepare hugetlb_follow_page_mask() for FOLL_PIN Peter Xu
  2023-06-20 15:22   ` David Hildenbrand
@ 2023-06-20 15:28   ` David Hildenbrand
  2023-06-20 16:06     ` Peter Xu
  1 sibling, 1 reply; 22+ messages in thread
From: David Hildenbrand @ 2023-06-20 15:28 UTC (permalink / raw)
  To: Peter Xu, linux-mm, linux-kernel
  Cc: Andrea Arcangeli, Mike Rapoport, Matthew Wilcox, Vlastimil Babka,
	John Hubbard, Kirill A . Shutemov, James Houghton, Andrew Morton,
	Lorenzo Stoakes, Hugh Dickins, Mike Kravetz, Jason Gunthorpe

On 20.06.23 01:10, Peter Xu wrote:
> follow_page() doesn't use FOLL_PIN, meanwhile hugetlb seems to not be the
> target of FOLL_WRITE either.  However add the checks.
> 
> Namely, either the need to CoW due to missing write bit, or proper CoR on
> !AnonExclusive pages over R/O pins to reject the follow page.  That brings
> this function closer to follow_hugetlb_page().
> 
> So we don't care before, and also for now.  But we'll care if we switch
> over slow-gup to use hugetlb_follow_page_mask().  We'll also care when to
> return -EMLINK properly, as that's the gup internal api to mean "we should
> do CoR".  Not really needed for follow page path, though.
> 
> When at it, switching the try_grab_page() to use WARN_ON_ONCE(), to be
> clear that it just should never fail.

Oh, and does this comment really belong into this patch or am I confused?

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 5/8] mm/gup: Accelerate thp gup even for "pages != NULL"
  2023-06-19 23:10 ` [PATCH v2 5/8] mm/gup: Accelerate thp gup even for "pages != NULL" Peter Xu
@ 2023-06-20 15:43   ` David Hildenbrand
  2023-06-20 16:23     ` Peter Xu
  2023-06-20 21:43   ` Lorenzo Stoakes
  1 sibling, 1 reply; 22+ messages in thread
From: David Hildenbrand @ 2023-06-20 15:43 UTC (permalink / raw)
  To: Peter Xu, linux-mm, linux-kernel
  Cc: Andrea Arcangeli, Mike Rapoport, Matthew Wilcox, Vlastimil Babka,
	John Hubbard, Kirill A . Shutemov, James Houghton, Andrew Morton,
	Lorenzo Stoakes, Hugh Dickins, Mike Kravetz, Jason Gunthorpe

On 20.06.23 01:10, Peter Xu wrote:
> The acceleration of THP was done with ctx.page_mask, however it'll be
> ignored if **pages is non-NULL.
> 
> The old optimization was introduced in 2013 in 240aadeedc4a ("mm:
> accelerate mm_populate() treatment of THP pages").  It didn't explain why
> we can't optimize the **pages non-NULL case.  It's possible that at that
> time the major goal was for mm_populate() which should be enough back then.

In the past we had these sub-page refcounts for THP. My best guess (and 
I didn't check if that was still the case in 2013) would be that it was 
simpler regarding refcount handling to to do it one-subpage at a time.

But I might be just wrong.

> 
> Optimize thp for all cases, by properly looping over each subpage, doing
> cache flushes, and boost refcounts / pincounts where needed in one go.
> 
> This can be verified using gup_test below:
> 
>    # chrt -f 1 ./gup_test -m 512 -t -L -n 1024 -r 10
> 
> Before:    13992.50 ( +-8.75%)
> After:       378.50 (+-69.62%)
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>   mm/gup.c | 51 ++++++++++++++++++++++++++++++++++++++++++++-------
>   1 file changed, 44 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/gup.c b/mm/gup.c
> index 4a00d609033e..b50272012e49 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -1199,16 +1199,53 @@ static long __get_user_pages(struct mm_struct *mm,
>   			goto out;
>   		}
>   next_page:
> -		if (pages) {
> -			pages[i] = page;
> -			flush_anon_page(vma, page, start);
> -			flush_dcache_page(page);
> -			ctx.page_mask = 0;
> -		}
> -
>   		page_increm = 1 + (~(start >> PAGE_SHIFT) & ctx.page_mask);
>   		if (page_increm > nr_pages)
>   			page_increm = nr_pages;
> +
> +		if (pages) {
> +			struct page *subpage;
> +			unsigned int j;
> +
> +			/*
> +			 * This must be a large folio (and doesn't need to
> +			 * be the whole folio; it can be part of it), do
> +			 * the refcount work for all the subpages too.
> +			 *
> +			 * NOTE: here the page may not be the head page
> +			 * e.g. when start addr is not thp-size aligned.
> +			 * try_grab_folio() should have taken care of tail
> +			 * pages.
> +			 */
> +			if (page_increm > 1) {
> +				struct folio *folio;
> +
> +				/*
> +				 * Since we already hold refcount on the
> +				 * large folio, this should never fail.
> +				 */
> +				folio = try_grab_folio(page, page_increm - 1,
> +						       foll_flags);
> +				if (WARN_ON_ONCE(!folio)) {
> +					/*
> +					 * Release the 1st page ref if the
> +					 * folio is problematic, fail hard.
> +					 */
> +					gup_put_folio(page_folio(page), 1,
> +						      foll_flags);
> +					ret = -EFAULT;
> +					goto out;
> +				}
> +			}
> +
> +			for (j = 0; j < page_increm; j++) {
> +				subpage = nth_page(page, j);
> +				pages[i+j] = subpage;

Doe checkpatch like pages[i+j]? I'd have used spaces around the +.

> +				flush_anon_page(vma, subpage, start + j * PAGE_SIZE);
> +				flush_dcache_page(subpage);
> +			}
> +		}
> +
>   		i += page_increm;
>   		start += page_increm * PAGE_SIZE;
>   		nr_pages -= page_increm;


So, we did the first try_grab_folio() while our page was PMD-mapped 
udner the PT lock and we had sufficient permissions (e.g., mapped 
writable, no unsharing required). With FOLL_PIN, we incremented the 
pincount.


I was wondering if something could have happened ever since we unlocked 
the PT table lock and possibly PTE-mapped the THP. ... but as it's 
already pinned, it cannot get shared during fork() [will stay exclusive].

So we can just take additional pins on that folio.


LGTM, although I do like the GUP-fast way of recording+ref'ing it at a 
central place (see gup_huge_pmd() with record_subpages() and friends), 
not after the effects.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 2/8] mm/hugetlb: Prepare hugetlb_follow_page_mask() for FOLL_PIN
  2023-06-20 15:22   ` David Hildenbrand
@ 2023-06-20 16:03     ` Peter Xu
  0 siblings, 0 replies; 22+ messages in thread
From: Peter Xu @ 2023-06-20 16:03 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-mm, linux-kernel, Andrea Arcangeli, Mike Rapoport,
	Matthew Wilcox, Vlastimil Babka, John Hubbard,
	Kirill A . Shutemov, James Houghton, Andrew Morton,
	Lorenzo Stoakes, Hugh Dickins, Mike Kravetz, Jason Gunthorpe

On Tue, Jun 20, 2023 at 05:22:02PM +0200, David Hildenbrand wrote:
> On 20.06.23 01:10, Peter Xu wrote:
> > follow_page() doesn't use FOLL_PIN, meanwhile hugetlb seems to not be the
> > target of FOLL_WRITE either.  However add the checks.
> > 
> > Namely, either the need to CoW due to missing write bit, or proper CoR on
> 
> s/CoR/unsharing/
> 
> > !AnonExclusive pages over R/O pins to reject the follow page.  That brings
> > this function closer to follow_hugetlb_page().
> > 
> > So we don't care before, and also for now.  But we'll care if we switch
> > over slow-gup to use hugetlb_follow_page_mask().  We'll also care when to
> > return -EMLINK properly, as that's the gup internal api to mean "we should
> > do CoR".  Not really needed for follow page path, though.
> 
> "we should unshare".
> 
> > 
> > When at it, switching the try_grab_page() to use WARN_ON_ONCE(), to be
> > clear that it just should never fail.
> > 
> > Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >   mm/hugetlb.c | 24 +++++++++++++++---------
> >   1 file changed, 15 insertions(+), 9 deletions(-)
> > 
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index f75f5e78ff0b..9a6918c4250a 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -6463,13 +6463,6 @@ struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
> >   	spinlock_t *ptl;
> >   	pte_t *pte, entry;
> > -	/*
> > -	 * FOLL_PIN is not supported for follow_page(). Ordinary GUP goes via
> > -	 * follow_hugetlb_page().
> > -	 */
> > -	if (WARN_ON_ONCE(flags & FOLL_PIN))
> > -		return NULL;
> > -
> >   	hugetlb_vma_lock_read(vma);
> >   	pte = hugetlb_walk(vma, haddr, huge_page_size(h));
> >   	if (!pte)
> > @@ -6478,8 +6471,21 @@ struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
> >   	ptl = huge_pte_lock(h, mm, pte);
> >   	entry = huge_ptep_get(pte);
> >   	if (pte_present(entry)) {
> > -		page = pte_page(entry) +
> > -				((address & ~huge_page_mask(h)) >> PAGE_SHIFT);
> > +		page = pte_page(entry);
> > +
> > +		if (gup_must_unshare(vma, flags, page)) {
> 
> All other callers (like follow_page_pte(), including
> __follow_hugetlb_must_fault())
> 
> (a) check for write permissions first.
> 
> (b) check for gup_must_unshare() only if !pte_write(entry)
> 
> I'd vote to keep these checks as similar as possible to the other GUP code.

I'm pretty sure the order doesn't matter here since one for read and one
for write.. but sure I can switch the order.

> 
> > +			/* Tell the caller to do Copy-On-Read */
> 
> "Tell the caller to unshare".
> 
> > +			page = ERR_PTR(-EMLINK);
> > +			goto out;
> > +		}
> > +
> > +		if ((flags & FOLL_WRITE) && !pte_write(entry)) {
> > +			page = NULL;
> > +			goto out;
> > +		}
> 
> 
> I'm confused about pte_write() vs. huge_pte_write(), and I don't know what's
> right or wrong here.

AFAICT, they should always be identical in code. But yeah.. I should just
use the huge_ version.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 2/8] mm/hugetlb: Prepare hugetlb_follow_page_mask() for FOLL_PIN
  2023-06-20 15:28   ` David Hildenbrand
@ 2023-06-20 16:06     ` Peter Xu
  0 siblings, 0 replies; 22+ messages in thread
From: Peter Xu @ 2023-06-20 16:06 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-mm, linux-kernel, Andrea Arcangeli, Mike Rapoport,
	Matthew Wilcox, Vlastimil Babka, John Hubbard,
	Kirill A . Shutemov, James Houghton, Andrew Morton,
	Lorenzo Stoakes, Hugh Dickins, Mike Kravetz, Jason Gunthorpe

On Tue, Jun 20, 2023 at 05:28:28PM +0200, David Hildenbrand wrote:
> On 20.06.23 01:10, Peter Xu wrote:
> > follow_page() doesn't use FOLL_PIN, meanwhile hugetlb seems to not be the
> > target of FOLL_WRITE either.  However add the checks.
> > 
> > Namely, either the need to CoW due to missing write bit, or proper CoR on
> > !AnonExclusive pages over R/O pins to reject the follow page.  That brings
> > this function closer to follow_hugetlb_page().
> > 
> > So we don't care before, and also for now.  But we'll care if we switch
> > over slow-gup to use hugetlb_follow_page_mask().  We'll also care when to
> > return -EMLINK properly, as that's the gup internal api to mean "we should
> > do CoR".  Not really needed for follow page path, though.
> > 
> > When at it, switching the try_grab_page() to use WARN_ON_ONCE(), to be
> > clear that it just should never fail.
> 
> Oh, and does this comment really belong into this patch or am I confused?

Ah yeh, thanks for spotting.  I used to have it in v1 but I kept the old
failure path here to partly address Lorenzo's worry; but then I forgot to
add the WARN_ON_ONCE back to guard.  I'll remember to add that in v3.

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 5/8] mm/gup: Accelerate thp gup even for "pages != NULL"
  2023-06-20 15:43   ` David Hildenbrand
@ 2023-06-20 16:23     ` Peter Xu
  2023-06-20 18:02       ` David Hildenbrand
  0 siblings, 1 reply; 22+ messages in thread
From: Peter Xu @ 2023-06-20 16:23 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-mm, linux-kernel, Andrea Arcangeli, Mike Rapoport,
	Matthew Wilcox, Vlastimil Babka, John Hubbard,
	Kirill A . Shutemov, James Houghton, Andrew Morton,
	Lorenzo Stoakes, Hugh Dickins, Mike Kravetz, Jason Gunthorpe

On Tue, Jun 20, 2023 at 05:43:35PM +0200, David Hildenbrand wrote:
> On 20.06.23 01:10, Peter Xu wrote:
> > The acceleration of THP was done with ctx.page_mask, however it'll be
> > ignored if **pages is non-NULL.
> > 
> > The old optimization was introduced in 2013 in 240aadeedc4a ("mm:
> > accelerate mm_populate() treatment of THP pages").  It didn't explain why
> > we can't optimize the **pages non-NULL case.  It's possible that at that
> > time the major goal was for mm_populate() which should be enough back then.
> 
> In the past we had these sub-page refcounts for THP. My best guess (and I
> didn't check if that was still the case in 2013) would be that it was
> simpler regarding refcount handling to to do it one-subpage at a time.
> 
> But I might be just wrong.
> 
> > 
> > Optimize thp for all cases, by properly looping over each subpage, doing
> > cache flushes, and boost refcounts / pincounts where needed in one go.
> > 
> > This can be verified using gup_test below:
> > 
> >    # chrt -f 1 ./gup_test -m 512 -t -L -n 1024 -r 10
> > 
> > Before:    13992.50 ( +-8.75%)
> > After:       378.50 (+-69.62%)
> > 
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >   mm/gup.c | 51 ++++++++++++++++++++++++++++++++++++++++++++-------
> >   1 file changed, 44 insertions(+), 7 deletions(-)
> > 
> > diff --git a/mm/gup.c b/mm/gup.c
> > index 4a00d609033e..b50272012e49 100644
> > --- a/mm/gup.c
> > +++ b/mm/gup.c
> > @@ -1199,16 +1199,53 @@ static long __get_user_pages(struct mm_struct *mm,
> >   			goto out;
> >   		}
> >   next_page:
> > -		if (pages) {
> > -			pages[i] = page;
> > -			flush_anon_page(vma, page, start);
> > -			flush_dcache_page(page);
> > -			ctx.page_mask = 0;
> > -		}
> > -
> >   		page_increm = 1 + (~(start >> PAGE_SHIFT) & ctx.page_mask);
> >   		if (page_increm > nr_pages)
> >   			page_increm = nr_pages;
> > +
> > +		if (pages) {
> > +			struct page *subpage;
> > +			unsigned int j;
> > +
> > +			/*
> > +			 * This must be a large folio (and doesn't need to
> > +			 * be the whole folio; it can be part of it), do
> > +			 * the refcount work for all the subpages too.
> > +			 *
> > +			 * NOTE: here the page may not be the head page
> > +			 * e.g. when start addr is not thp-size aligned.
> > +			 * try_grab_folio() should have taken care of tail
> > +			 * pages.
> > +			 */
> > +			if (page_increm > 1) {
> > +				struct folio *folio;
> > +
> > +				/*
> > +				 * Since we already hold refcount on the
> > +				 * large folio, this should never fail.
> > +				 */
> > +				folio = try_grab_folio(page, page_increm - 1,
> > +						       foll_flags);
> > +				if (WARN_ON_ONCE(!folio)) {
> > +					/*
> > +					 * Release the 1st page ref if the
> > +					 * folio is problematic, fail hard.
> > +					 */
> > +					gup_put_folio(page_folio(page), 1,
> > +						      foll_flags);
> > +					ret = -EFAULT;
> > +					goto out;
> > +				}
> > +			}
> > +
> > +			for (j = 0; j < page_increm; j++) {
> > +				subpage = nth_page(page, j);
> > +				pages[i+j] = subpage;
> 
> Doe checkpatch like pages[i+j]? I'd have used spaces around the +.

Can do.

> 
> > +				flush_anon_page(vma, subpage, start + j * PAGE_SIZE);
> > +				flush_dcache_page(subpage);
> > +			}
> > +		}
> > +
> >   		i += page_increm;
> >   		start += page_increm * PAGE_SIZE;
> >   		nr_pages -= page_increm;
> 
> 
> So, we did the first try_grab_folio() while our page was PMD-mapped udner
> the PT lock and we had sufficient permissions (e.g., mapped writable, no
> unsharing required). With FOLL_PIN, we incremented the pincount.
> 
> 
> I was wondering if something could have happened ever since we unlocked the
> PT table lock and possibly PTE-mapped the THP. ... but as it's already
> pinned, it cannot get shared during fork() [will stay exclusive].
> 
> So we can just take additional pins on that folio.
> 
> 
> LGTM, although I do like the GUP-fast way of recording+ref'ing it at a
> central place (see gup_huge_pmd() with record_subpages() and friends), not
> after the effects.

My read on this is follow_page_mask() is also used in follow page, which
does not need page*.

No strong opinion here. Maybe we leave this as a follow up even if it can
be justified?  This patch is probably still the smallest (and still clean)
change to speed this whole thing up over either thp or hugetlb.

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 3/8] mm/hugetlb: Add page_mask for hugetlb_follow_page_mask()
  2023-06-20 15:23   ` David Hildenbrand
@ 2023-06-20 16:28     ` Peter Xu
  2023-06-20 17:54       ` David Hildenbrand
  0 siblings, 1 reply; 22+ messages in thread
From: Peter Xu @ 2023-06-20 16:28 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-mm, linux-kernel, Andrea Arcangeli, Mike Rapoport,
	Matthew Wilcox, Vlastimil Babka, John Hubbard,
	Kirill A . Shutemov, James Houghton, Andrew Morton,
	Lorenzo Stoakes, Hugh Dickins, Mike Kravetz, Jason Gunthorpe

On Tue, Jun 20, 2023 at 05:23:09PM +0200, David Hildenbrand wrote:
> On 20.06.23 01:10, Peter Xu wrote:
> > follow_page() doesn't need it, but we'll start to need it when unifying gup
> > for hugetlb.
> > 
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >   include/linux/hugetlb.h | 8 +++++---
> >   mm/gup.c                | 3 ++-
> >   mm/hugetlb.c            | 5 ++++-
> >   3 files changed, 11 insertions(+), 5 deletions(-)
> > 
> > diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> > index beb7c63d2871..2e2d89e79d6c 100644
> > --- a/include/linux/hugetlb.h
> > +++ b/include/linux/hugetlb.h
> > @@ -131,7 +131,8 @@ int move_hugetlb_page_tables(struct vm_area_struct *vma,
> >   int copy_hugetlb_page_range(struct mm_struct *, struct mm_struct *,
> >   			    struct vm_area_struct *, struct vm_area_struct *);
> >   struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
> > -				unsigned long address, unsigned int flags);
> > +				      unsigned long address, unsigned int flags,
> > +				      unsigned int *page_mask);
> >   long follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *,
> >   			 struct page **, unsigned long *, unsigned long *,
> >   			 long, unsigned int, int *);
> > @@ -297,8 +298,9 @@ static inline void adjust_range_if_pmd_sharing_possible(
> >   {
> >   }
> > -static inline struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
> > -				unsigned long address, unsigned int flags)
> > +static inline struct page *hugetlb_follow_page_mask(
> > +    struct vm_area_struct *vma, unsigned long address, unsigned int flags,
> > +    unsigned int *page_mask)
> >   {
> >   	BUILD_BUG(); /* should never be compiled in if !CONFIG_HUGETLB_PAGE*/
> >   }
> > diff --git a/mm/gup.c b/mm/gup.c
> > index abcd841d94b7..9fc9271cba8d 100644
> > --- a/mm/gup.c
> > +++ b/mm/gup.c
> > @@ -780,7 +780,8 @@ static struct page *follow_page_mask(struct vm_area_struct *vma,
> >   	 * Ordinary GUP uses follow_hugetlb_page for hugetlb processing.
> >   	 */
> >   	if (is_vm_hugetlb_page(vma))
> > -		return hugetlb_follow_page_mask(vma, address, flags);
> > +		return hugetlb_follow_page_mask(vma, address, flags,
> > +						&ctx->page_mask);
> >   	pgd = pgd_offset(mm, address);
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 9a6918c4250a..fbf6a09c0ec4 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -6454,7 +6454,8 @@ static inline bool __follow_hugetlb_must_fault(struct vm_area_struct *vma,
> >   }
> >   struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
> > -				unsigned long address, unsigned int flags)
> > +				      unsigned long address, unsigned int flags,
> > +				      unsigned int *page_mask)
> >   {
> >   	struct hstate *h = hstate_vma(vma);
> >   	struct mm_struct *mm = vma->vm_mm;
> > @@ -6499,6 +6500,8 @@ struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
> >   			page = NULL;
> >   			goto out;
> >   		}
> > +
> > +		*page_mask = ~huge_page_mask(h) >> PAGE_SHIFT;
> 
> As discussed, can be simplified. But can be done on top (or not at all, but
> it is confusing code).

Since we decided to make this prettier..  At last I decided to go with this:

		*page_mask = (1U << huge_page_order(h)) - 1;

The previous suggestion of PHYS_PFN() will do two shifts over PAGE_SIZE
(the other one in huge_page_size()) which might be unnecessary, also, PHYS_
can be slightly misleading too as prefix.

> 
> Reviewed-by: David Hildenbrand <david@redhat.com>

I'll take this with above change, please shoot if not applicable.  Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 3/8] mm/hugetlb: Add page_mask for hugetlb_follow_page_mask()
  2023-06-20 16:28     ` Peter Xu
@ 2023-06-20 17:54       ` David Hildenbrand
  0 siblings, 0 replies; 22+ messages in thread
From: David Hildenbrand @ 2023-06-20 17:54 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, Andrea Arcangeli, Mike Rapoport,
	Matthew Wilcox, Vlastimil Babka, John Hubbard,
	Kirill A . Shutemov, James Houghton, Andrew Morton,
	Lorenzo Stoakes, Hugh Dickins, Mike Kravetz, Jason Gunthorpe

On 20.06.23 18:28, Peter Xu wrote:
> On Tue, Jun 20, 2023 at 05:23:09PM +0200, David Hildenbrand wrote:
>> On 20.06.23 01:10, Peter Xu wrote:
>>> follow_page() doesn't need it, but we'll start to need it when unifying gup
>>> for hugetlb.
>>>
>>> Signed-off-by: Peter Xu <peterx@redhat.com>
>>> ---
>>>    include/linux/hugetlb.h | 8 +++++---
>>>    mm/gup.c                | 3 ++-
>>>    mm/hugetlb.c            | 5 ++++-
>>>    3 files changed, 11 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
>>> index beb7c63d2871..2e2d89e79d6c 100644
>>> --- a/include/linux/hugetlb.h
>>> +++ b/include/linux/hugetlb.h
>>> @@ -131,7 +131,8 @@ int move_hugetlb_page_tables(struct vm_area_struct *vma,
>>>    int copy_hugetlb_page_range(struct mm_struct *, struct mm_struct *,
>>>    			    struct vm_area_struct *, struct vm_area_struct *);
>>>    struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
>>> -				unsigned long address, unsigned int flags);
>>> +				      unsigned long address, unsigned int flags,
>>> +				      unsigned int *page_mask);
>>>    long follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *,
>>>    			 struct page **, unsigned long *, unsigned long *,
>>>    			 long, unsigned int, int *);
>>> @@ -297,8 +298,9 @@ static inline void adjust_range_if_pmd_sharing_possible(
>>>    {
>>>    }
>>> -static inline struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
>>> -				unsigned long address, unsigned int flags)
>>> +static inline struct page *hugetlb_follow_page_mask(
>>> +    struct vm_area_struct *vma, unsigned long address, unsigned int flags,
>>> +    unsigned int *page_mask)
>>>    {
>>>    	BUILD_BUG(); /* should never be compiled in if !CONFIG_HUGETLB_PAGE*/
>>>    }
>>> diff --git a/mm/gup.c b/mm/gup.c
>>> index abcd841d94b7..9fc9271cba8d 100644
>>> --- a/mm/gup.c
>>> +++ b/mm/gup.c
>>> @@ -780,7 +780,8 @@ static struct page *follow_page_mask(struct vm_area_struct *vma,
>>>    	 * Ordinary GUP uses follow_hugetlb_page for hugetlb processing.
>>>    	 */
>>>    	if (is_vm_hugetlb_page(vma))
>>> -		return hugetlb_follow_page_mask(vma, address, flags);
>>> +		return hugetlb_follow_page_mask(vma, address, flags,
>>> +						&ctx->page_mask);
>>>    	pgd = pgd_offset(mm, address);
>>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>>> index 9a6918c4250a..fbf6a09c0ec4 100644
>>> --- a/mm/hugetlb.c
>>> +++ b/mm/hugetlb.c
>>> @@ -6454,7 +6454,8 @@ static inline bool __follow_hugetlb_must_fault(struct vm_area_struct *vma,
>>>    }
>>>    struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
>>> -				unsigned long address, unsigned int flags)
>>> +				      unsigned long address, unsigned int flags,
>>> +				      unsigned int *page_mask)
>>>    {
>>>    	struct hstate *h = hstate_vma(vma);
>>>    	struct mm_struct *mm = vma->vm_mm;
>>> @@ -6499,6 +6500,8 @@ struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
>>>    			page = NULL;
>>>    			goto out;
>>>    		}
>>> +
>>> +		*page_mask = ~huge_page_mask(h) >> PAGE_SHIFT;
>>
>> As discussed, can be simplified. But can be done on top (or not at all, but
>> it is confusing code).
> 
> Since we decided to make this prettier..  At last I decided to go with this:
> 
> 		*page_mask = (1U << huge_page_order(h)) - 1;
> 
> The previous suggestion of PHYS_PFN() will do two shifts over PAGE_SIZE
> (the other one in huge_page_size()) which might be unnecessary, also, PHYS_
> can be slightly misleading too as prefix.
> 
>>
>> Reviewed-by: David Hildenbrand <david@redhat.com>
> 
> I'll take this with above change, please shoot if not applicable.  Thanks,
> 

Perfectly fine :)

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 5/8] mm/gup: Accelerate thp gup even for "pages != NULL"
  2023-06-20 16:23     ` Peter Xu
@ 2023-06-20 18:02       ` David Hildenbrand
  2023-06-20 20:12         ` Peter Xu
  0 siblings, 1 reply; 22+ messages in thread
From: David Hildenbrand @ 2023-06-20 18:02 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, Andrea Arcangeli, Mike Rapoport,
	Matthew Wilcox, Vlastimil Babka, John Hubbard,
	Kirill A . Shutemov, James Houghton, Andrew Morton,
	Lorenzo Stoakes, Hugh Dickins, Mike Kravetz, Jason Gunthorpe

On 20.06.23 18:23, Peter Xu wrote:
> On Tue, Jun 20, 2023 at 05:43:35PM +0200, David Hildenbrand wrote:
>> On 20.06.23 01:10, Peter Xu wrote:
>>> The acceleration of THP was done with ctx.page_mask, however it'll be
>>> ignored if **pages is non-NULL.
>>>
>>> The old optimization was introduced in 2013 in 240aadeedc4a ("mm:
>>> accelerate mm_populate() treatment of THP pages").  It didn't explain why
>>> we can't optimize the **pages non-NULL case.  It's possible that at that
>>> time the major goal was for mm_populate() which should be enough back then.
>>
>> In the past we had these sub-page refcounts for THP. My best guess (and I
>> didn't check if that was still the case in 2013) would be that it was
>> simpler regarding refcount handling to to do it one-subpage at a time.
>>
>> But I might be just wrong.
>>
>>>
>>> Optimize thp for all cases, by properly looping over each subpage, doing
>>> cache flushes, and boost refcounts / pincounts where needed in one go.
>>>
>>> This can be verified using gup_test below:
>>>
>>>     # chrt -f 1 ./gup_test -m 512 -t -L -n 1024 -r 10
>>>
>>> Before:    13992.50 ( +-8.75%)
>>> After:       378.50 (+-69.62%)
>>>
>>> Signed-off-by: Peter Xu <peterx@redhat.com>
>>> ---
>>>    mm/gup.c | 51 ++++++++++++++++++++++++++++++++++++++++++++-------
>>>    1 file changed, 44 insertions(+), 7 deletions(-)
>>>
>>> diff --git a/mm/gup.c b/mm/gup.c
>>> index 4a00d609033e..b50272012e49 100644
>>> --- a/mm/gup.c
>>> +++ b/mm/gup.c
>>> @@ -1199,16 +1199,53 @@ static long __get_user_pages(struct mm_struct *mm,
>>>    			goto out;
>>>    		}
>>>    next_page:
>>> -		if (pages) {
>>> -			pages[i] = page;
>>> -			flush_anon_page(vma, page, start);
>>> -			flush_dcache_page(page);
>>> -			ctx.page_mask = 0;
>>> -		}
>>> -
>>>    		page_increm = 1 + (~(start >> PAGE_SHIFT) & ctx.page_mask);
>>>    		if (page_increm > nr_pages)
>>>    			page_increm = nr_pages;
>>> +
>>> +		if (pages) {
>>> +			struct page *subpage;
>>> +			unsigned int j;
>>> +
>>> +			/*
>>> +			 * This must be a large folio (and doesn't need to
>>> +			 * be the whole folio; it can be part of it), do
>>> +			 * the refcount work for all the subpages too.
>>> +			 *
>>> +			 * NOTE: here the page may not be the head page
>>> +			 * e.g. when start addr is not thp-size aligned.
>>> +			 * try_grab_folio() should have taken care of tail
>>> +			 * pages.
>>> +			 */
>>> +			if (page_increm > 1) {
>>> +				struct folio *folio;
>>> +
>>> +				/*
>>> +				 * Since we already hold refcount on the
>>> +				 * large folio, this should never fail.
>>> +				 */
>>> +				folio = try_grab_folio(page, page_increm - 1,
>>> +						       foll_flags);
>>> +				if (WARN_ON_ONCE(!folio)) {
>>> +					/*
>>> +					 * Release the 1st page ref if the
>>> +					 * folio is problematic, fail hard.
>>> +					 */
>>> +					gup_put_folio(page_folio(page), 1,
>>> +						      foll_flags);
>>> +					ret = -EFAULT;
>>> +					goto out;
>>> +				}
>>> +			}
>>> +
>>> +			for (j = 0; j < page_increm; j++) {
>>> +				subpage = nth_page(page, j);
>>> +				pages[i+j] = subpage;
>>
>> Doe checkpatch like pages[i+j]? I'd have used spaces around the +.
> 
> Can do.
> 
>>
>>> +				flush_anon_page(vma, subpage, start + j * PAGE_SIZE);
>>> +				flush_dcache_page(subpage);
>>> +			}
>>> +		}
>>> +
>>>    		i += page_increm;
>>>    		start += page_increm * PAGE_SIZE;
>>>    		nr_pages -= page_increm;
>>
>>
>> So, we did the first try_grab_folio() while our page was PMD-mapped udner
>> the PT lock and we had sufficient permissions (e.g., mapped writable, no
>> unsharing required). With FOLL_PIN, we incremented the pincount.
>>
>>
>> I was wondering if something could have happened ever since we unlocked the
>> PT table lock and possibly PTE-mapped the THP. ... but as it's already
>> pinned, it cannot get shared during fork() [will stay exclusive].
>>
>> So we can just take additional pins on that folio.
>>
>>
>> LGTM, although I do like the GUP-fast way of recording+ref'ing it at a
>> central place (see gup_huge_pmd() with record_subpages() and friends), not
>> after the effects.
> 
> My read on this is follow_page_mask() is also used in follow page, which
> does not need page*.

Right ... maybe one day we can do that "better".

> 
> No strong opinion here. Maybe we leave this as a follow up even if it can
> be justified?  This patch is probably still the smallest (and still clean)
> change to speed this whole thing up over either thp or hugetlb.

Sure, we can leave that as a follow-up.


Thinking about why we have the flush_anon_page/flush_dcache_page stuff 
here and not in GUP-fast ... I suspect that all GUP-fast archs don't 
need that stuff.

I was wondering if there are some possible races with the 
flush_anon_page() / flush_dcache_page() on a page that might have been 
unmapped in the meantime (as we dropped the PT lock ...).

Some flush_dcache_page() implementations do some IMHO confusing 
page_mapcount() things (like in arch/arc/mm/cache.c). But maybe the 
unmap code handles that as well ... and most likely these archs don't 
support THP.

Anyhow, just a note that the flush_anon_page/flush_dcache_page left me 
confused.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 5/8] mm/gup: Accelerate thp gup even for "pages != NULL"
  2023-06-20 18:02       ` David Hildenbrand
@ 2023-06-20 20:12         ` Peter Xu
  0 siblings, 0 replies; 22+ messages in thread
From: Peter Xu @ 2023-06-20 20:12 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-mm, linux-kernel, Andrea Arcangeli, Mike Rapoport,
	Matthew Wilcox, Vlastimil Babka, John Hubbard,
	Kirill A . Shutemov, James Houghton, Andrew Morton,
	Lorenzo Stoakes, Hugh Dickins, Mike Kravetz, Jason Gunthorpe

On Tue, Jun 20, 2023 at 08:02:41PM +0200, David Hildenbrand wrote:
> Thinking about why we have the flush_anon_page/flush_dcache_page stuff here
> and not in GUP-fast ... I suspect that all GUP-fast archs don't need that
> stuff.

Yeah that's a bit confusing, and I sincerely don't know the answer.  Though
here I had the other side of the feeling - I feel like gup-fast should also
do it.. but maybe it's just get missed.

AFAIU the idea was that the data can be mis-aligned between user / kernel,
and if it's needed on slow gup I don't see why it's not needed in fast..

There're still a few archs that implemented flush_dcache_page() but
meanwhile has HAVE_FAST_GUP selected, like arm/arm64/powerpc.

It's just getting out of scope of what this series wanted to achieve.

> I was wondering if there are some possible races with the flush_anon_page()
> / flush_dcache_page() on a page that might have been unmapped in the
> meantime (as we dropped the PT lock ...).
> 
> Some flush_dcache_page() implementations do some IMHO confusing
> page_mapcount() things (like in arch/arc/mm/cache.c). But maybe the unmap
> code handles that as well ... and most likely these archs don't support THP.

Maybe true.

It seems that the page_mapcount() was mostly used to identify whether a
page is mapped in the userspace address space, if so I'd worry less because
the only race possible here, iiuc, is when the user unmaps the page
concurrently (and since we got it from gup it must have been mapped once).

Then I would assume the caller should be prepared for that, and the
flush_dcache_page() won't matter too much in this case I assume, if the
userspace dropped all the data anyway - the whole page* can already be
invalid for that VA for a completed unmap.

> 
> Anyhow, just a note that the flush_anon_page/flush_dcache_page left me
> confused.

I share the same confusion. Hopefully, what this series did here was not
changing that, at least not making it worse.

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 5/8] mm/gup: Accelerate thp gup even for "pages != NULL"
  2023-06-19 23:10 ` [PATCH v2 5/8] mm/gup: Accelerate thp gup even for "pages != NULL" Peter Xu
  2023-06-20 15:43   ` David Hildenbrand
@ 2023-06-20 21:43   ` Lorenzo Stoakes
  1 sibling, 0 replies; 22+ messages in thread
From: Lorenzo Stoakes @ 2023-06-20 21:43 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, Andrea Arcangeli, Mike Rapoport,
	David Hildenbrand, Matthew Wilcox, Vlastimil Babka, John Hubbard,
	Kirill A . Shutemov, James Houghton, Andrew Morton, Hugh Dickins,
	Mike Kravetz, Jason Gunthorpe

On Mon, Jun 19, 2023 at 07:10:41PM -0400, Peter Xu wrote:
> The acceleration of THP was done with ctx.page_mask, however it'll be
> ignored if **pages is non-NULL.
>
> The old optimization was introduced in 2013 in 240aadeedc4a ("mm:
> accelerate mm_populate() treatment of THP pages").  It didn't explain why
> we can't optimize the **pages non-NULL case.  It's possible that at that
> time the major goal was for mm_populate() which should be enough back then.
>
> Optimize thp for all cases, by properly looping over each subpage, doing
> cache flushes, and boost refcounts / pincounts where needed in one go.
>
> This can be verified using gup_test below:
>
>   # chrt -f 1 ./gup_test -m 512 -t -L -n 1024 -r 10
>
> Before:    13992.50 ( +-8.75%)
> After:       378.50 (+-69.62%)
>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  mm/gup.c | 51 ++++++++++++++++++++++++++++++++++++++++++++-------
>  1 file changed, 44 insertions(+), 7 deletions(-)
>
> diff --git a/mm/gup.c b/mm/gup.c
> index 4a00d609033e..b50272012e49 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -1199,16 +1199,53 @@ static long __get_user_pages(struct mm_struct *mm,
>  			goto out;
>  		}
>  next_page:
> -		if (pages) {
> -			pages[i] = page;
> -			flush_anon_page(vma, page, start);
> -			flush_dcache_page(page);
> -			ctx.page_mask = 0;
> -		}
> -
>  		page_increm = 1 + (~(start >> PAGE_SHIFT) & ctx.page_mask);
>  		if (page_increm > nr_pages)
>  			page_increm = nr_pages;
> +
> +		if (pages) {
> +			struct page *subpage;
> +			unsigned int j;
> +
> +			/*
> +			 * This must be a large folio (and doesn't need to
> +			 * be the whole folio; it can be part of it), do
> +			 * the refcount work for all the subpages too.
> +			 *
> +			 * NOTE: here the page may not be the head page
> +			 * e.g. when start addr is not thp-size aligned.
> +			 * try_grab_folio() should have taken care of tail
> +			 * pages.
> +			 */
> +			if (page_increm > 1) {
> +				struct folio *folio;
> +
> +				/*
> +				 * Since we already hold refcount on the
> +				 * large folio, this should never fail.
> +				 */
> +				folio = try_grab_folio(page, page_increm - 1,
> +						       foll_flags);
> +				if (WARN_ON_ONCE(!folio)) {
> +					/*
> +					 * Release the 1st page ref if the
> +					 * folio is problematic, fail hard.
> +					 */
> +					gup_put_folio(page_folio(page), 1,
> +						      foll_flags);
> +					ret = -EFAULT;
> +					goto out;
> +				}

Thanks this looks good to me, I agree it'd be quite surprising for us not
to retrieve folio here and probably something has gone wrong if so, so not
actually too unreasonable to warn, as long as we error out.

> +			}
> +
> +			for (j = 0; j < page_increm; j++) {
> +				subpage = nth_page(page, j);
> +				pages[i+j] = subpage;
> +				flush_anon_page(vma, subpage, start + j * PAGE_SIZE);
> +				flush_dcache_page(subpage);
> +			}
> +		}
> +
>  		i += page_increm;
>  		start += page_increm * PAGE_SIZE;
>  		nr_pages -= page_increm;
> --
> 2.40.1
>

Looks good to me overall,

Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com>

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2023-06-20 21:43 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-06-19 23:10 [PATCH v2 0/8] mm/gup: Unify hugetlb, speed up thp Peter Xu
2023-06-19 23:10 ` [PATCH v2 1/8] mm/hugetlb: Handle FOLL_DUMP well in follow_page_mask() Peter Xu
2023-06-19 23:10 ` [PATCH v2 2/8] mm/hugetlb: Prepare hugetlb_follow_page_mask() for FOLL_PIN Peter Xu
2023-06-20 15:22   ` David Hildenbrand
2023-06-20 16:03     ` Peter Xu
2023-06-20 15:28   ` David Hildenbrand
2023-06-20 16:06     ` Peter Xu
2023-06-19 23:10 ` [PATCH v2 3/8] mm/hugetlb: Add page_mask for hugetlb_follow_page_mask() Peter Xu
2023-06-20 15:23   ` David Hildenbrand
2023-06-20 16:28     ` Peter Xu
2023-06-20 17:54       ` David Hildenbrand
2023-06-19 23:10 ` [PATCH v2 4/8] mm/gup: Cleanup next_page handling Peter Xu
2023-06-20 15:23   ` David Hildenbrand
2023-06-19 23:10 ` [PATCH v2 5/8] mm/gup: Accelerate thp gup even for "pages != NULL" Peter Xu
2023-06-20 15:43   ` David Hildenbrand
2023-06-20 16:23     ` Peter Xu
2023-06-20 18:02       ` David Hildenbrand
2023-06-20 20:12         ` Peter Xu
2023-06-20 21:43   ` Lorenzo Stoakes
2023-06-19 23:10 ` [PATCH v2 6/8] mm/gup: Retire follow_hugetlb_page() Peter Xu
2023-06-19 23:10 ` [PATCH v2 7/8] selftests/mm: Add -a to run_vmtests.sh Peter Xu
2023-06-19 23:10 ` [PATCH v2 8/8] selftests/mm: Add gup test matrix in run_vmtests.sh Peter Xu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).