linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/7] mm/gup: Unify hugetlb, speed up thp
@ 2023-06-13 21:53 Peter Xu
  2023-06-13 21:53 ` [PATCH 1/7] mm/hugetlb: Handle FOLL_DUMP well in follow_page_mask() Peter Xu
                   ` (6 more replies)
  0 siblings, 7 replies; 39+ messages in thread
From: Peter Xu @ 2023-06-13 21:53 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Matthew Wilcox, Andrea Arcangeli, John Hubbard, Mike Rapoport,
	David Hildenbrand, Vlastimil Babka, peterx, Kirill A . Shutemov,
	Andrew Morton, Mike Kravetz, James Houghton, Hugh Dickins

Hugetlb has a special path for slow gup that follow_page_mask() is actually
skipped completely along with faultin_page().  It's not only confusing, but
also duplicating a lot of logics that generic gup already has, making
hugetlb slightly special.

This patchset tries to dedup the logic, by first touching up the slow gup
code to be able to handle hugetlb pages correctly with the current follow
page and faultin routines (where we're mostly there.. due to 10 years ago
we did try to optimize thp, but half way done; more below), then at the
last patch drop the special path, then the hugetlb gup will always go the
generic routine too via faultin_page().

Note that hugetlb is still special for gup, mostly due to the pgtable
walking (hugetlb_walk()) that we rely on which is currently per-arch.  But
this is still one small step forward, and the diffstat might be a proof
too that this might be worthwhile.

Then for the "speed up thp" side: as a side effect, when I'm looking at the
chunk of code, I found that thp support is actually partially done.  It
doesn't mean that thp won't work for gup, but as long as **pages pointer
passed over, the optimization will be skipped too.  Patch 6 should address
that, so for thp we now get full speed gup.

For a quick number, "chrt -f 1 ./gup_test -m 512 -t -L -n 1024 -r 10" gives
me 13992.50us -> 378.50us.  Gup_test is an extreme case, but just to show
how it affects thp gups.

James: I hope this won't affect too much of your HGM series as this might
conflict with yours.  Logically it should even make your series smaller,
since you can drop the patch to support HGM for follow_hugetlb_page() now
after this one, but you only need to worry that if this one is proven
useful and merged first.

Patch 1-6 prepares for the switch, while patch 7 does the switch over.

I hope I didn't miss anything, and will be very happy to know.  Please have
a look, thanks.

Peter Xu (7):
  mm/hugetlb: Handle FOLL_DUMP well in follow_page_mask()
  mm/hugetlb: Fix hugetlb_follow_page_mask() on permission checks
  mm/hugetlb: Add page_mask for hugetlb_follow_page_mask()
  mm/hugetlb: Prepare hugetlb_follow_page_mask() for FOLL_PIN
  mm/gup: Cleanup next_page handling
  mm/gup: Accelerate thp gup even for "pages != NULL"
  mm/gup: Retire follow_hugetlb_page()

 include/linux/hugetlb.h |  20 +---
 mm/gup.c                |  68 +++++------
 mm/hugetlb.c            | 259 ++++------------------------------------
 3 files changed, 63 insertions(+), 284 deletions(-)

-- 
2.40.1


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 1/7] mm/hugetlb: Handle FOLL_DUMP well in follow_page_mask()
  2023-06-13 21:53 [PATCH 0/7] mm/gup: Unify hugetlb, speed up thp Peter Xu
@ 2023-06-13 21:53 ` Peter Xu
  2023-06-14 23:24   ` Mike Kravetz
  2023-06-16  8:08   ` David Hildenbrand
  2023-06-13 21:53 ` [PATCH 2/7] mm/hugetlb: Fix hugetlb_follow_page_mask() on permission checks Peter Xu
                   ` (5 subsequent siblings)
  6 siblings, 2 replies; 39+ messages in thread
From: Peter Xu @ 2023-06-13 21:53 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Matthew Wilcox, Andrea Arcangeli, John Hubbard, Mike Rapoport,
	David Hildenbrand, Vlastimil Babka, peterx, Kirill A . Shutemov,
	Andrew Morton, Mike Kravetz, James Houghton, Hugh Dickins

Firstly, the no_page_table() is meaningless for hugetlb which is a no-op
there, because a hugetlb page always satisfies:

  - vma_is_anonymous() == false
  - vma->vm_ops->fault != NULL

So we can already safely remove it in hugetlb_follow_page_mask(), alongside
with the page* variable.

Meanwhile, what we do in follow_hugetlb_page() actually makes sense for a
dump: we try to fault in the page only if the page cache is already
allocated.  Let's do the same here for follow_page_mask() on hugetlb.

It should so far has zero effect on real dumps, because that still goes
into follow_hugetlb_page().  But this may start to influence a bit on
follow_page() users who mimics a "dump page" scenario, but hopefully in a
good way.  This also paves way for unifying the hugetlb gup-slow.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/gup.c     | 9 ++-------
 mm/hugetlb.c | 9 +++++++++
 2 files changed, 11 insertions(+), 7 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index dbe96d266670..aa0668505d61 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -781,7 +781,6 @@ static struct page *follow_page_mask(struct vm_area_struct *vma,
 			      struct follow_page_context *ctx)
 {
 	pgd_t *pgd;
-	struct page *page;
 	struct mm_struct *mm = vma->vm_mm;
 
 	ctx->page_mask = 0;
@@ -794,12 +793,8 @@ static struct page *follow_page_mask(struct vm_area_struct *vma,
 	 * hugetlb_follow_page_mask is only for follow_page() handling here.
 	 * Ordinary GUP uses follow_hugetlb_page for hugetlb processing.
 	 */
-	if (is_vm_hugetlb_page(vma)) {
-		page = hugetlb_follow_page_mask(vma, address, flags);
-		if (!page)
-			page = no_page_table(vma, flags);
-		return page;
-	}
+	if (is_vm_hugetlb_page(vma))
+		return hugetlb_follow_page_mask(vma, address, flags);
 
 	pgd = pgd_offset(mm, address);
 
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 270ec0ecd5a1..82dfdd96db4c 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6501,6 +6501,15 @@ struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
 	spin_unlock(ptl);
 out_unlock:
 	hugetlb_vma_unlock_read(vma);
+
+	/*
+	 * Fixup retval for dump requests: if pagecache doesn't exist,
+	 * don't try to allocate a new page but just skip it.
+	 */
+	if (!page && (flags & FOLL_DUMP) &&
+	    !hugetlbfs_pagecache_present(h, vma, address))
+		page = ERR_PTR(-EFAULT);
+
 	return page;
 }
 
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 2/7] mm/hugetlb: Fix hugetlb_follow_page_mask() on permission checks
  2023-06-13 21:53 [PATCH 0/7] mm/gup: Unify hugetlb, speed up thp Peter Xu
  2023-06-13 21:53 ` [PATCH 1/7] mm/hugetlb: Handle FOLL_DUMP well in follow_page_mask() Peter Xu
@ 2023-06-13 21:53 ` Peter Xu
  2023-06-14 15:31   ` David Hildenbrand
  2023-06-13 21:53 ` [PATCH 3/7] mm/hugetlb: Add page_mask for hugetlb_follow_page_mask() Peter Xu
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 39+ messages in thread
From: Peter Xu @ 2023-06-13 21:53 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Matthew Wilcox, Andrea Arcangeli, John Hubbard, Mike Rapoport,
	David Hildenbrand, Vlastimil Babka, peterx, Kirill A . Shutemov,
	Andrew Morton, Mike Kravetz, James Houghton, Hugh Dickins

It seems hugetlb_follow_page_mask() was missing permission checks.  For
example, one follow_page() can get the hugetlb page with FOLL_WRITE even if
the page is read-only.

And it wasn't there even in the old follow_page_mask(), where we can
reference from before commit 57a196a58421 ("hugetlb: simplify hugetlb
handling in follow_page_mask").

Let's add them, namely, either the need to CoW due to missing write bit, or
proper CoR on !AnonExclusive pages over R/O pins to reject the follow page.
That brings this function closer to follow_hugetlb_page().

I just doubt how many of us care for that, for FOLL_PIN follow_page doesn't
really happen at all.  But we'll care, and care more if we switch over
slow-gup to use hugetlb_follow_page_mask().  We'll also care when to return
-EMLINK then, as that's the gup internal api to mean "we should do CoR".

When at it, switching the try_grab_page() to use WARN_ON_ONCE(), to be
clear that it just should never fail.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/hugetlb.c | 22 ++++++++++++++++------
 1 file changed, 16 insertions(+), 6 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 82dfdd96db4c..9c261921b2cf 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6481,8 +6481,21 @@ struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
 	ptl = huge_pte_lock(h, mm, pte);
 	entry = huge_ptep_get(pte);
 	if (pte_present(entry)) {
-		page = pte_page(entry) +
-				((address & ~huge_page_mask(h)) >> PAGE_SHIFT);
+		page = pte_page(entry);
+
+		if (gup_must_unshare(vma, flags, page)) {
+			/* Tell the caller to do Copy-On-Read */
+			page = ERR_PTR(-EMLINK);
+			goto out;
+		}
+
+		if ((flags & FOLL_WRITE) && !pte_write(entry)) {
+			page = NULL;
+			goto out;
+		}
+
+		page += ((address & ~huge_page_mask(h)) >> PAGE_SHIFT);
+
 		/*
 		 * Note that page may be a sub-page, and with vmemmap
 		 * optimizations the page struct may be read only.
@@ -6492,10 +6505,7 @@ struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
 		 * try_grab_page() should always be able to get the page here,
 		 * because we hold the ptl lock and have verified pte_present().
 		 */
-		if (try_grab_page(page, flags)) {
-			page = NULL;
-			goto out;
-		}
+		WARN_ON_ONCE(try_grab_page(page, flags));
 	}
 out:
 	spin_unlock(ptl);
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 3/7] mm/hugetlb: Add page_mask for hugetlb_follow_page_mask()
  2023-06-13 21:53 [PATCH 0/7] mm/gup: Unify hugetlb, speed up thp Peter Xu
  2023-06-13 21:53 ` [PATCH 1/7] mm/hugetlb: Handle FOLL_DUMP well in follow_page_mask() Peter Xu
  2023-06-13 21:53 ` [PATCH 2/7] mm/hugetlb: Fix hugetlb_follow_page_mask() on permission checks Peter Xu
@ 2023-06-13 21:53 ` Peter Xu
  2023-06-15  0:17   ` Mike Kravetz
                     ` (2 more replies)
  2023-06-13 21:53 ` [PATCH 4/7] mm/hugetlb: Prepare hugetlb_follow_page_mask() for FOLL_PIN Peter Xu
                   ` (3 subsequent siblings)
  6 siblings, 3 replies; 39+ messages in thread
From: Peter Xu @ 2023-06-13 21:53 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Matthew Wilcox, Andrea Arcangeli, John Hubbard, Mike Rapoport,
	David Hildenbrand, Vlastimil Babka, peterx, Kirill A . Shutemov,
	Andrew Morton, Mike Kravetz, James Houghton, Hugh Dickins

follow_page() doesn't need it, but we'll start to need it when unifying gup
for hugetlb.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/hugetlb.h | 8 +++++---
 mm/gup.c                | 3 ++-
 mm/hugetlb.c            | 4 +++-
 3 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 21f942025fec..0d6f389d98de 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -131,7 +131,8 @@ int move_hugetlb_page_tables(struct vm_area_struct *vma,
 int copy_hugetlb_page_range(struct mm_struct *, struct mm_struct *,
 			    struct vm_area_struct *, struct vm_area_struct *);
 struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
-				unsigned long address, unsigned int flags);
+				      unsigned long address, unsigned int flags,
+				      unsigned int *page_mask);
 long follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *,
 			 struct page **, unsigned long *, unsigned long *,
 			 long, unsigned int, int *);
@@ -297,8 +298,9 @@ static inline void adjust_range_if_pmd_sharing_possible(
 {
 }
 
-static inline struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
-				unsigned long address, unsigned int flags)
+static inline struct page *hugetlb_follow_page_mask(
+    struct vm_area_struct *vma, unsigned long address, unsigned int flags,
+    unsigned int *page_mask)
 {
 	BUILD_BUG(); /* should never be compiled in if !CONFIG_HUGETLB_PAGE*/
 }
diff --git a/mm/gup.c b/mm/gup.c
index aa0668505d61..8d59ae4554e7 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -794,7 +794,8 @@ static struct page *follow_page_mask(struct vm_area_struct *vma,
 	 * Ordinary GUP uses follow_hugetlb_page for hugetlb processing.
 	 */
 	if (is_vm_hugetlb_page(vma))
-		return hugetlb_follow_page_mask(vma, address, flags);
+		return hugetlb_follow_page_mask(vma, address, flags,
+						&ctx->page_mask);
 
 	pgd = pgd_offset(mm, address);
 
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 9c261921b2cf..f037eaf9d819 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6457,7 +6457,8 @@ static inline bool __follow_hugetlb_must_fault(struct vm_area_struct *vma,
 }
 
 struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
-				unsigned long address, unsigned int flags)
+				      unsigned long address, unsigned int flags,
+				      unsigned int *page_mask)
 {
 	struct hstate *h = hstate_vma(vma);
 	struct mm_struct *mm = vma->vm_mm;
@@ -6506,6 +6507,7 @@ struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
 		 * because we hold the ptl lock and have verified pte_present().
 		 */
 		WARN_ON_ONCE(try_grab_page(page, flags));
+		*page_mask = huge_page_mask(h);
 	}
 out:
 	spin_unlock(ptl);
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 4/7] mm/hugetlb: Prepare hugetlb_follow_page_mask() for FOLL_PIN
  2023-06-13 21:53 [PATCH 0/7] mm/gup: Unify hugetlb, speed up thp Peter Xu
                   ` (2 preceding siblings ...)
  2023-06-13 21:53 ` [PATCH 3/7] mm/hugetlb: Add page_mask for hugetlb_follow_page_mask() Peter Xu
@ 2023-06-13 21:53 ` Peter Xu
  2023-06-14 14:57   ` David Hildenbrand
  2023-06-13 21:53 ` [PATCH 5/7] mm/gup: Cleanup next_page handling Peter Xu
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 39+ messages in thread
From: Peter Xu @ 2023-06-13 21:53 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Matthew Wilcox, Andrea Arcangeli, John Hubbard, Mike Rapoport,
	David Hildenbrand, Vlastimil Babka, peterx, Kirill A . Shutemov,
	Andrew Morton, Mike Kravetz, James Houghton, Hugh Dickins

It's coming, not yet, but soon.  Loose the restriction.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/hugetlb.c | 7 -------
 1 file changed, 7 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index f037eaf9d819..31d8f18bc2e4 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6467,13 +6467,6 @@ struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
 	spinlock_t *ptl;
 	pte_t *pte, entry;
 
-	/*
-	 * FOLL_PIN is not supported for follow_page(). Ordinary GUP goes via
-	 * follow_hugetlb_page().
-	 */
-	if (WARN_ON_ONCE(flags & FOLL_PIN))
-		return NULL;
-
 	hugetlb_vma_lock_read(vma);
 	pte = hugetlb_walk(vma, haddr, huge_page_size(h));
 	if (!pte)
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 5/7] mm/gup: Cleanup next_page handling
  2023-06-13 21:53 [PATCH 0/7] mm/gup: Unify hugetlb, speed up thp Peter Xu
                   ` (3 preceding siblings ...)
  2023-06-13 21:53 ` [PATCH 4/7] mm/hugetlb: Prepare hugetlb_follow_page_mask() for FOLL_PIN Peter Xu
@ 2023-06-13 21:53 ` Peter Xu
  2023-06-17 19:48   ` Lorenzo Stoakes
  2023-06-13 21:53 ` [PATCH 6/7] mm/gup: Accelerate thp gup even for "pages != NULL" Peter Xu
  2023-06-13 21:53 ` [PATCH 7/7] mm/gup: Retire follow_hugetlb_page() Peter Xu
  6 siblings, 1 reply; 39+ messages in thread
From: Peter Xu @ 2023-06-13 21:53 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Matthew Wilcox, Andrea Arcangeli, John Hubbard, Mike Rapoport,
	David Hildenbrand, Vlastimil Babka, peterx, Kirill A . Shutemov,
	Andrew Morton, Mike Kravetz, James Houghton, Hugh Dickins

The only path that doesn't use generic "**pages" handling is the gate vma.
Make it use the same path, meanwhile tune the next_page label upper to
cover "**pages" handling.  This prepares for THP handling for "**pages".

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/gup.c | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 8d59ae4554e7..a2d1b3c4b104 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1135,7 +1135,7 @@ static long __get_user_pages(struct mm_struct *mm,
 			if (!vma && in_gate_area(mm, start)) {
 				ret = get_gate_page(mm, start & PAGE_MASK,
 						gup_flags, &vma,
-						pages ? &pages[i] : NULL);
+						pages ? &page : NULL);
 				if (ret)
 					goto out;
 				ctx.page_mask = 0;
@@ -1205,19 +1205,18 @@ static long __get_user_pages(struct mm_struct *mm,
 				ret = PTR_ERR(page);
 				goto out;
 			}
-
-			goto next_page;
 		} else if (IS_ERR(page)) {
 			ret = PTR_ERR(page);
 			goto out;
 		}
+next_page:
 		if (pages) {
 			pages[i] = page;
 			flush_anon_page(vma, page, start);
 			flush_dcache_page(page);
 			ctx.page_mask = 0;
 		}
-next_page:
+
 		page_increm = 1 + (~(start >> PAGE_SHIFT) & ctx.page_mask);
 		if (page_increm > nr_pages)
 			page_increm = nr_pages;
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 6/7] mm/gup: Accelerate thp gup even for "pages != NULL"
  2023-06-13 21:53 [PATCH 0/7] mm/gup: Unify hugetlb, speed up thp Peter Xu
                   ` (4 preceding siblings ...)
  2023-06-13 21:53 ` [PATCH 5/7] mm/gup: Cleanup next_page handling Peter Xu
@ 2023-06-13 21:53 ` Peter Xu
  2023-06-14 14:58   ` Matthew Wilcox
  2023-06-17 20:27   ` Lorenzo Stoakes
  2023-06-13 21:53 ` [PATCH 7/7] mm/gup: Retire follow_hugetlb_page() Peter Xu
  6 siblings, 2 replies; 39+ messages in thread
From: Peter Xu @ 2023-06-13 21:53 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Matthew Wilcox, Andrea Arcangeli, John Hubbard, Mike Rapoport,
	David Hildenbrand, Vlastimil Babka, peterx, Kirill A . Shutemov,
	Andrew Morton, Mike Kravetz, James Houghton, Hugh Dickins

The acceleration of THP was done with ctx.page_mask, however it'll be
ignored if **pages is non-NULL.

The old optimization was introduced in 2013 in 240aadeedc4a ("mm:
accelerate mm_populate() treatment of THP pages").  It didn't explain why
we can't optimize the **pages non-NULL case.  It's possible that at that
time the major goal was for mm_populate() which should be enough back then.

Optimize thp for all cases, by properly looping over each subpage, doing
cache flushes, and boost refcounts / pincounts where needed in one go.

This can be verified using gup_test below:

  # chrt -f 1 ./gup_test -m 512 -t -L -n 1024 -r 10

Before:    13992.50 ( +-8.75%)
After:       378.50 (+-69.62%)

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/gup.c | 36 +++++++++++++++++++++++++++++-------
 1 file changed, 29 insertions(+), 7 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index a2d1b3c4b104..cdabc8ea783b 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1210,16 +1210,38 @@ static long __get_user_pages(struct mm_struct *mm,
 			goto out;
 		}
 next_page:
-		if (pages) {
-			pages[i] = page;
-			flush_anon_page(vma, page, start);
-			flush_dcache_page(page);
-			ctx.page_mask = 0;
-		}
-
 		page_increm = 1 + (~(start >> PAGE_SHIFT) & ctx.page_mask);
 		if (page_increm > nr_pages)
 			page_increm = nr_pages;
+
+		if (pages) {
+			struct page *subpage;
+			unsigned int j;
+
+			/*
+			 * This must be a large folio (and doesn't need to
+			 * be the whole folio; it can be part of it), do
+			 * the refcount work for all the subpages too.
+			 * Since we already hold refcount on the head page,
+			 * it should never fail.
+			 *
+			 * NOTE: here the page may not be the head page
+			 * e.g. when start addr is not thp-size aligned.
+			 */
+			if (page_increm > 1)
+				WARN_ON_ONCE(
+				    try_grab_folio(compound_head(page),
+						   page_increm - 1,
+						   foll_flags) == NULL);
+
+			for (j = 0; j < page_increm; j++) {
+				subpage = nth_page(page, j);
+				pages[i+j] = subpage;
+				flush_anon_page(vma, subpage, start + j * PAGE_SIZE);
+				flush_dcache_page(subpage);
+			}
+		}
+
 		i += page_increm;
 		start += page_increm * PAGE_SIZE;
 		nr_pages -= page_increm;
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 7/7] mm/gup: Retire follow_hugetlb_page()
  2023-06-13 21:53 [PATCH 0/7] mm/gup: Unify hugetlb, speed up thp Peter Xu
                   ` (5 preceding siblings ...)
  2023-06-13 21:53 ` [PATCH 6/7] mm/gup: Accelerate thp gup even for "pages != NULL" Peter Xu
@ 2023-06-13 21:53 ` Peter Xu
  2023-06-14 14:37   ` Jason Gunthorpe
  2023-06-17 20:40   ` Lorenzo Stoakes
  6 siblings, 2 replies; 39+ messages in thread
From: Peter Xu @ 2023-06-13 21:53 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Matthew Wilcox, Andrea Arcangeli, John Hubbard, Mike Rapoport,
	David Hildenbrand, Vlastimil Babka, peterx, Kirill A . Shutemov,
	Andrew Morton, Mike Kravetz, James Houghton, Hugh Dickins

Now __get_user_pages() should be well prepared to handle thp completely,
as long as hugetlb gup requests even without the hugetlb's special path.

Time to retire follow_hugetlb_page().

Tweak the comments in follow_page_mask() to reflect reality, by dropping
the "follow_page()" description.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/hugetlb.h |  12 ---
 mm/gup.c                |  19 ----
 mm/hugetlb.c            | 223 ----------------------------------------
 3 files changed, 254 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 0d6f389d98de..44e5836eed15 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -133,9 +133,6 @@ int copy_hugetlb_page_range(struct mm_struct *, struct mm_struct *,
 struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
 				      unsigned long address, unsigned int flags,
 				      unsigned int *page_mask);
-long follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *,
-			 struct page **, unsigned long *, unsigned long *,
-			 long, unsigned int, int *);
 void unmap_hugepage_range(struct vm_area_struct *,
 			  unsigned long, unsigned long, struct page *,
 			  zap_flags_t);
@@ -305,15 +302,6 @@ static inline struct page *hugetlb_follow_page_mask(
 	BUILD_BUG(); /* should never be compiled in if !CONFIG_HUGETLB_PAGE*/
 }
 
-static inline long follow_hugetlb_page(struct mm_struct *mm,
-			struct vm_area_struct *vma, struct page **pages,
-			unsigned long *position, unsigned long *nr_pages,
-			long i, unsigned int flags, int *nonblocking)
-{
-	BUG();
-	return 0;
-}
-
 static inline int copy_hugetlb_page_range(struct mm_struct *dst,
 					  struct mm_struct *src,
 					  struct vm_area_struct *dst_vma,
diff --git a/mm/gup.c b/mm/gup.c
index cdabc8ea783b..a65b80953b7a 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -789,9 +789,6 @@ static struct page *follow_page_mask(struct vm_area_struct *vma,
 	 * Call hugetlb_follow_page_mask for hugetlb vmas as it will use
 	 * special hugetlb page table walking code.  This eliminates the
 	 * need to check for hugetlb entries in the general walking code.
-	 *
-	 * hugetlb_follow_page_mask is only for follow_page() handling here.
-	 * Ordinary GUP uses follow_hugetlb_page for hugetlb processing.
 	 */
 	if (is_vm_hugetlb_page(vma))
 		return hugetlb_follow_page_mask(vma, address, flags,
@@ -1149,22 +1146,6 @@ static long __get_user_pages(struct mm_struct *mm,
 			ret = check_vma_flags(vma, gup_flags);
 			if (ret)
 				goto out;
-
-			if (is_vm_hugetlb_page(vma)) {
-				i = follow_hugetlb_page(mm, vma, pages,
-							&start, &nr_pages, i,
-							gup_flags, locked);
-				if (!*locked) {
-					/*
-					 * We've got a VM_FAULT_RETRY
-					 * and we've lost mmap_lock.
-					 * We must stop here.
-					 */
-					BUG_ON(gup_flags & FOLL_NOWAIT);
-					goto out;
-				}
-				continue;
-			}
 		}
 retry:
 		/*
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 31d8f18bc2e4..b7ff413ff68b 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6425,37 +6425,6 @@ int hugetlb_mfill_atomic_pte(pte_t *dst_pte,
 }
 #endif /* CONFIG_USERFAULTFD */
 
-static void record_subpages(struct page *page, struct vm_area_struct *vma,
-			    int refs, struct page **pages)
-{
-	int nr;
-
-	for (nr = 0; nr < refs; nr++) {
-		if (likely(pages))
-			pages[nr] = nth_page(page, nr);
-	}
-}
-
-static inline bool __follow_hugetlb_must_fault(struct vm_area_struct *vma,
-					       unsigned int flags, pte_t *pte,
-					       bool *unshare)
-{
-	pte_t pteval = huge_ptep_get(pte);
-
-	*unshare = false;
-	if (is_swap_pte(pteval))
-		return true;
-	if (huge_pte_write(pteval))
-		return false;
-	if (flags & FOLL_WRITE)
-		return true;
-	if (gup_must_unshare(vma, flags, pte_page(pteval))) {
-		*unshare = true;
-		return true;
-	}
-	return false;
-}
-
 struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
 				      unsigned long address, unsigned int flags,
 				      unsigned int *page_mask)
@@ -6518,198 +6487,6 @@ struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
 	return page;
 }
 
-long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
-			 struct page **pages, unsigned long *position,
-			 unsigned long *nr_pages, long i, unsigned int flags,
-			 int *locked)
-{
-	unsigned long pfn_offset;
-	unsigned long vaddr = *position;
-	unsigned long remainder = *nr_pages;
-	struct hstate *h = hstate_vma(vma);
-	int err = -EFAULT, refs;
-
-	while (vaddr < vma->vm_end && remainder) {
-		pte_t *pte;
-		spinlock_t *ptl = NULL;
-		bool unshare = false;
-		int absent;
-		struct page *page;
-
-		/*
-		 * If we have a pending SIGKILL, don't keep faulting pages and
-		 * potentially allocating memory.
-		 */
-		if (fatal_signal_pending(current)) {
-			remainder = 0;
-			break;
-		}
-
-		hugetlb_vma_lock_read(vma);
-		/*
-		 * Some archs (sparc64, sh*) have multiple pte_ts to
-		 * each hugepage.  We have to make sure we get the
-		 * first, for the page indexing below to work.
-		 *
-		 * Note that page table lock is not held when pte is null.
-		 */
-		pte = hugetlb_walk(vma, vaddr & huge_page_mask(h),
-				   huge_page_size(h));
-		if (pte)
-			ptl = huge_pte_lock(h, mm, pte);
-		absent = !pte || huge_pte_none(huge_ptep_get(pte));
-
-		/*
-		 * When coredumping, it suits get_dump_page if we just return
-		 * an error where there's an empty slot with no huge pagecache
-		 * to back it.  This way, we avoid allocating a hugepage, and
-		 * the sparse dumpfile avoids allocating disk blocks, but its
-		 * huge holes still show up with zeroes where they need to be.
-		 */
-		if (absent && (flags & FOLL_DUMP) &&
-		    !hugetlbfs_pagecache_present(h, vma, vaddr)) {
-			if (pte)
-				spin_unlock(ptl);
-			hugetlb_vma_unlock_read(vma);
-			remainder = 0;
-			break;
-		}
-
-		/*
-		 * We need call hugetlb_fault for both hugepages under migration
-		 * (in which case hugetlb_fault waits for the migration,) and
-		 * hwpoisoned hugepages (in which case we need to prevent the
-		 * caller from accessing to them.) In order to do this, we use
-		 * here is_swap_pte instead of is_hugetlb_entry_migration and
-		 * is_hugetlb_entry_hwpoisoned. This is because it simply covers
-		 * both cases, and because we can't follow correct pages
-		 * directly from any kind of swap entries.
-		 */
-		if (absent ||
-		    __follow_hugetlb_must_fault(vma, flags, pte, &unshare)) {
-			vm_fault_t ret;
-			unsigned int fault_flags = 0;
-
-			if (pte)
-				spin_unlock(ptl);
-			hugetlb_vma_unlock_read(vma);
-
-			if (flags & FOLL_WRITE)
-				fault_flags |= FAULT_FLAG_WRITE;
-			else if (unshare)
-				fault_flags |= FAULT_FLAG_UNSHARE;
-			if (locked) {
-				fault_flags |= FAULT_FLAG_ALLOW_RETRY |
-					FAULT_FLAG_KILLABLE;
-				if (flags & FOLL_INTERRUPTIBLE)
-					fault_flags |= FAULT_FLAG_INTERRUPTIBLE;
-			}
-			if (flags & FOLL_NOWAIT)
-				fault_flags |= FAULT_FLAG_ALLOW_RETRY |
-					FAULT_FLAG_RETRY_NOWAIT;
-			if (flags & FOLL_TRIED) {
-				/*
-				 * Note: FAULT_FLAG_ALLOW_RETRY and
-				 * FAULT_FLAG_TRIED can co-exist
-				 */
-				fault_flags |= FAULT_FLAG_TRIED;
-			}
-			ret = hugetlb_fault(mm, vma, vaddr, fault_flags);
-			if (ret & VM_FAULT_ERROR) {
-				err = vm_fault_to_errno(ret, flags);
-				remainder = 0;
-				break;
-			}
-			if (ret & VM_FAULT_RETRY) {
-				if (locked &&
-				    !(fault_flags & FAULT_FLAG_RETRY_NOWAIT))
-					*locked = 0;
-				*nr_pages = 0;
-				/*
-				 * VM_FAULT_RETRY must not return an
-				 * error, it will return zero
-				 * instead.
-				 *
-				 * No need to update "position" as the
-				 * caller will not check it after
-				 * *nr_pages is set to 0.
-				 */
-				return i;
-			}
-			continue;
-		}
-
-		pfn_offset = (vaddr & ~huge_page_mask(h)) >> PAGE_SHIFT;
-		page = pte_page(huge_ptep_get(pte));
-
-		VM_BUG_ON_PAGE((flags & FOLL_PIN) && PageAnon(page) &&
-			       !PageAnonExclusive(page), page);
-
-		/*
-		 * If subpage information not requested, update counters
-		 * and skip the same_page loop below.
-		 */
-		if (!pages && !pfn_offset &&
-		    (vaddr + huge_page_size(h) < vma->vm_end) &&
-		    (remainder >= pages_per_huge_page(h))) {
-			vaddr += huge_page_size(h);
-			remainder -= pages_per_huge_page(h);
-			i += pages_per_huge_page(h);
-			spin_unlock(ptl);
-			hugetlb_vma_unlock_read(vma);
-			continue;
-		}
-
-		/* vaddr may not be aligned to PAGE_SIZE */
-		refs = min3(pages_per_huge_page(h) - pfn_offset, remainder,
-		    (vma->vm_end - ALIGN_DOWN(vaddr, PAGE_SIZE)) >> PAGE_SHIFT);
-
-		if (pages)
-			record_subpages(nth_page(page, pfn_offset),
-					vma, refs,
-					likely(pages) ? pages + i : NULL);
-
-		if (pages) {
-			/*
-			 * try_grab_folio() should always succeed here,
-			 * because: a) we hold the ptl lock, and b) we've just
-			 * checked that the huge page is present in the page
-			 * tables. If the huge page is present, then the tail
-			 * pages must also be present. The ptl prevents the
-			 * head page and tail pages from being rearranged in
-			 * any way. As this is hugetlb, the pages will never
-			 * be p2pdma or not longterm pinable. So this page
-			 * must be available at this point, unless the page
-			 * refcount overflowed:
-			 */
-			if (WARN_ON_ONCE(!try_grab_folio(pages[i], refs,
-							 flags))) {
-				spin_unlock(ptl);
-				hugetlb_vma_unlock_read(vma);
-				remainder = 0;
-				err = -ENOMEM;
-				break;
-			}
-		}
-
-		vaddr += (refs << PAGE_SHIFT);
-		remainder -= refs;
-		i += refs;
-
-		spin_unlock(ptl);
-		hugetlb_vma_unlock_read(vma);
-	}
-	*nr_pages = remainder;
-	/*
-	 * setting position is actually required only if remainder is
-	 * not zero but it's faster not to add a "if (remainder)"
-	 * branch.
-	 */
-	*position = vaddr;
-
-	return i ? i : err;
-}
-
 long hugetlb_change_protection(struct vm_area_struct *vma,
 		unsigned long address, unsigned long end,
 		pgprot_t newprot, unsigned long cp_flags)
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH 7/7] mm/gup: Retire follow_hugetlb_page()
  2023-06-13 21:53 ` [PATCH 7/7] mm/gup: Retire follow_hugetlb_page() Peter Xu
@ 2023-06-14 14:37   ` Jason Gunthorpe
  2023-06-17 20:40   ` Lorenzo Stoakes
  1 sibling, 0 replies; 39+ messages in thread
From: Jason Gunthorpe @ 2023-06-14 14:37 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-kernel, linux-mm, Matthew Wilcox, Andrea Arcangeli,
	John Hubbard, Mike Rapoport, David Hildenbrand, Vlastimil Babka,
	Kirill A . Shutemov, Andrew Morton, Mike Kravetz, James Houghton,
	Hugh Dickins

On Tue, Jun 13, 2023 at 05:53:46PM -0400, Peter Xu wrote:
> Now __get_user_pages() should be well prepared to handle thp completely,
> as long as hugetlb gup requests even without the hugetlb's special path.
> 
> Time to retire follow_hugetlb_page().
> 
> Tweak the comments in follow_page_mask() to reflect reality, by dropping
> the "follow_page()" description.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  include/linux/hugetlb.h |  12 ---
>  mm/gup.c                |  19 ----
>  mm/hugetlb.c            | 223 ----------------------------------------
>  3 files changed, 254 deletions(-)

It is a like a dream come true :)

I don't know enough about hugetlb details but this series broadly made
sense to me

Thanks,
Jason

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 4/7] mm/hugetlb: Prepare hugetlb_follow_page_mask() for FOLL_PIN
  2023-06-13 21:53 ` [PATCH 4/7] mm/hugetlb: Prepare hugetlb_follow_page_mask() for FOLL_PIN Peter Xu
@ 2023-06-14 14:57   ` David Hildenbrand
  2023-06-14 15:11     ` Peter Xu
  0 siblings, 1 reply; 39+ messages in thread
From: David Hildenbrand @ 2023-06-14 14:57 UTC (permalink / raw)
  To: Peter Xu, linux-kernel, linux-mm
  Cc: Matthew Wilcox, Andrea Arcangeli, John Hubbard, Mike Rapoport,
	Vlastimil Babka, Kirill A . Shutemov, Andrew Morton,
	Mike Kravetz, James Houghton, Hugh Dickins

On 13.06.23 23:53, Peter Xu wrote:
> It's coming, not yet, but soon.  Loose the restriction.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>   mm/hugetlb.c | 7 -------
>   1 file changed, 7 deletions(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index f037eaf9d819..31d8f18bc2e4 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -6467,13 +6467,6 @@ struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
>   	spinlock_t *ptl;
>   	pte_t *pte, entry;
>   
> -	/*
> -	 * FOLL_PIN is not supported for follow_page(). Ordinary GUP goes via
> -	 * follow_hugetlb_page().
> -	 */
> -	if (WARN_ON_ONCE(flags & FOLL_PIN))
> -		return NULL;
> -
>   	hugetlb_vma_lock_read(vma);
>   	pte = hugetlb_walk(vma, haddr, huge_page_size(h));
>   	if (!pte)
Did you fix why the warning was placed there in the first place? (IIRC, 
at least unsharing support needs to be added, maybe more)

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 6/7] mm/gup: Accelerate thp gup even for "pages != NULL"
  2023-06-13 21:53 ` [PATCH 6/7] mm/gup: Accelerate thp gup even for "pages != NULL" Peter Xu
@ 2023-06-14 14:58   ` Matthew Wilcox
  2023-06-14 15:19     ` Peter Xu
  2023-06-17 20:27   ` Lorenzo Stoakes
  1 sibling, 1 reply; 39+ messages in thread
From: Matthew Wilcox @ 2023-06-14 14:58 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-kernel, linux-mm, Andrea Arcangeli, John Hubbard,
	Mike Rapoport, David Hildenbrand, Vlastimil Babka,
	Kirill A . Shutemov, Andrew Morton, Mike Kravetz, James Houghton,
	Hugh Dickins

On Tue, Jun 13, 2023 at 05:53:45PM -0400, Peter Xu wrote:
> +			if (page_increm > 1)
> +				WARN_ON_ONCE(
> +				    try_grab_folio(compound_head(page),

You don't need to call compound_head() here; try_grab_folio() works
on tail pages just fine.

> +						   page_increm - 1,
> +						   foll_flags) == NULL);
> +
> +			for (j = 0; j < page_increm; j++) {
> +				subpage = nth_page(page, j);
> +				pages[i+j] = subpage;
> +				flush_anon_page(vma, subpage, start + j * PAGE_SIZE);
> +				flush_dcache_page(subpage);

You're better off calling flush_dcache_folio() right at the end.


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 4/7] mm/hugetlb: Prepare hugetlb_follow_page_mask() for FOLL_PIN
  2023-06-14 14:57   ` David Hildenbrand
@ 2023-06-14 15:11     ` Peter Xu
  2023-06-14 15:17       ` David Hildenbrand
  0 siblings, 1 reply; 39+ messages in thread
From: Peter Xu @ 2023-06-14 15:11 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, Matthew Wilcox, Andrea Arcangeli,
	John Hubbard, Mike Rapoport, Vlastimil Babka,
	Kirill A . Shutemov, Andrew Morton, Mike Kravetz, James Houghton,
	Hugh Dickins

On Wed, Jun 14, 2023 at 04:57:37PM +0200, David Hildenbrand wrote:
> On 13.06.23 23:53, Peter Xu wrote:
> > It's coming, not yet, but soon.  Loose the restriction.
> > 
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >   mm/hugetlb.c | 7 -------
> >   1 file changed, 7 deletions(-)
> > 
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index f037eaf9d819..31d8f18bc2e4 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -6467,13 +6467,6 @@ struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
> >   	spinlock_t *ptl;
> >   	pte_t *pte, entry;
> > -	/*
> > -	 * FOLL_PIN is not supported for follow_page(). Ordinary GUP goes via
> > -	 * follow_hugetlb_page().
> > -	 */
> > -	if (WARN_ON_ONCE(flags & FOLL_PIN))
> > -		return NULL;
> > -
> >   	hugetlb_vma_lock_read(vma);
> >   	pte = hugetlb_walk(vma, haddr, huge_page_size(h));
> >   	if (!pte)
> Did you fix why the warning was placed there in the first place? (IIRC, at
> least unsharing support needs to be added, maybe more)

Feel free to have a look at patch 2 - it should be done there, hopefully in
the right way.  And IIUC it could be a bug to not do that before (besides
CoR there was also the pgtable permission checks that was missing).  More
details in patch 2's commit message.  Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 4/7] mm/hugetlb: Prepare hugetlb_follow_page_mask() for FOLL_PIN
  2023-06-14 15:11     ` Peter Xu
@ 2023-06-14 15:17       ` David Hildenbrand
  2023-06-14 15:31         ` Peter Xu
  0 siblings, 1 reply; 39+ messages in thread
From: David Hildenbrand @ 2023-06-14 15:17 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-kernel, linux-mm, Matthew Wilcox, Andrea Arcangeli,
	John Hubbard, Mike Rapoport, Vlastimil Babka,
	Kirill A . Shutemov, Andrew Morton, Mike Kravetz, James Houghton,
	Hugh Dickins

On 14.06.23 17:11, Peter Xu wrote:
> On Wed, Jun 14, 2023 at 04:57:37PM +0200, David Hildenbrand wrote:
>> On 13.06.23 23:53, Peter Xu wrote:
>>> It's coming, not yet, but soon.  Loose the restriction.
>>>
>>> Signed-off-by: Peter Xu <peterx@redhat.com>
>>> ---
>>>    mm/hugetlb.c | 7 -------
>>>    1 file changed, 7 deletions(-)
>>>
>>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>>> index f037eaf9d819..31d8f18bc2e4 100644
>>> --- a/mm/hugetlb.c
>>> +++ b/mm/hugetlb.c
>>> @@ -6467,13 +6467,6 @@ struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
>>>    	spinlock_t *ptl;
>>>    	pte_t *pte, entry;
>>> -	/*
>>> -	 * FOLL_PIN is not supported for follow_page(). Ordinary GUP goes via
>>> -	 * follow_hugetlb_page().
>>> -	 */
>>> -	if (WARN_ON_ONCE(flags & FOLL_PIN))
>>> -		return NULL;
>>> -
>>>    	hugetlb_vma_lock_read(vma);
>>>    	pte = hugetlb_walk(vma, haddr, huge_page_size(h));
>>>    	if (!pte)
>> Did you fix why the warning was placed there in the first place? (IIRC, at
>> least unsharing support needs to be added, maybe more)
> 
> Feel free to have a look at patch 2 - it should be done there, hopefully in
> the right way.  And IIUC it could be a bug to not do that before (besides
> CoR there was also the pgtable permission checks that was missing).  More
> details in patch 2's commit message.  Thanks,

Oh, that slipped my eyes (unsharing is not really a permission check) -- 
and the patch description could have been more explicit about why we can 
now lift the restrictions.

For the records: we don't use CoR terminology upstream. As suggested by 
John, we use "GUP-triggered unsharing".

As unsharing only applies to FOLL_PIN, it doesn't quite fit into patch 
#2. Either move that to this patch or squash both.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 6/7] mm/gup: Accelerate thp gup even for "pages != NULL"
  2023-06-14 14:58   ` Matthew Wilcox
@ 2023-06-14 15:19     ` Peter Xu
  2023-06-14 15:35       ` Peter Xu
  0 siblings, 1 reply; 39+ messages in thread
From: Peter Xu @ 2023-06-14 15:19 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-kernel, linux-mm, Andrea Arcangeli, John Hubbard,
	Mike Rapoport, David Hildenbrand, Vlastimil Babka,
	Kirill A . Shutemov, Andrew Morton, Mike Kravetz, James Houghton,
	Hugh Dickins

On Wed, Jun 14, 2023 at 03:58:34PM +0100, Matthew Wilcox wrote:
> On Tue, Jun 13, 2023 at 05:53:45PM -0400, Peter Xu wrote:
> > +			if (page_increm > 1)
> > +				WARN_ON_ONCE(
> > +				    try_grab_folio(compound_head(page),
> 
> You don't need to call compound_head() here; try_grab_folio() works
> on tail pages just fine.

I did it with caution because two things I'm not sure: either
is_pci_p2pdma_page() or is_longterm_pinnable_page() inside, both calls
is_zone_device_page() on the page*.

But I just noticed try_grab_folio() is also used in gup_pte_range() where
the thp can be pte mapped, so I assume we at least need that to handle tail
page well.

Do we perhaps need the compound_head() in try_grab_folio() as a separate
patch?  Or maybe I was wrong on is_zone_device_page()?

> 
> > +						   page_increm - 1,
> > +						   foll_flags) == NULL);
> > +
> > +			for (j = 0; j < page_increm; j++) {
> > +				subpage = nth_page(page, j);
> > +				pages[i+j] = subpage;
> > +				flush_anon_page(vma, subpage, start + j * PAGE_SIZE);
> > +				flush_dcache_page(subpage);
> 
> You're better off calling flush_dcache_folio() right at the end.

Will do.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 4/7] mm/hugetlb: Prepare hugetlb_follow_page_mask() for FOLL_PIN
  2023-06-14 15:17       ` David Hildenbrand
@ 2023-06-14 15:31         ` Peter Xu
  2023-06-14 15:47           ` David Hildenbrand
  0 siblings, 1 reply; 39+ messages in thread
From: Peter Xu @ 2023-06-14 15:31 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, Matthew Wilcox, Andrea Arcangeli,
	John Hubbard, Mike Rapoport, Vlastimil Babka,
	Kirill A . Shutemov, Andrew Morton, Mike Kravetz, James Houghton,
	Hugh Dickins

On Wed, Jun 14, 2023 at 05:17:13PM +0200, David Hildenbrand wrote:
> On 14.06.23 17:11, Peter Xu wrote:
> > On Wed, Jun 14, 2023 at 04:57:37PM +0200, David Hildenbrand wrote:
> > > On 13.06.23 23:53, Peter Xu wrote:
> > > > It's coming, not yet, but soon.  Loose the restriction.
> > > > 
> > > > Signed-off-by: Peter Xu <peterx@redhat.com>
> > > > ---
> > > >    mm/hugetlb.c | 7 -------
> > > >    1 file changed, 7 deletions(-)
> > > > 
> > > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > > > index f037eaf9d819..31d8f18bc2e4 100644
> > > > --- a/mm/hugetlb.c
> > > > +++ b/mm/hugetlb.c
> > > > @@ -6467,13 +6467,6 @@ struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
> > > >    	spinlock_t *ptl;
> > > >    	pte_t *pte, entry;
> > > > -	/*
> > > > -	 * FOLL_PIN is not supported for follow_page(). Ordinary GUP goes via
> > > > -	 * follow_hugetlb_page().
> > > > -	 */
> > > > -	if (WARN_ON_ONCE(flags & FOLL_PIN))
> > > > -		return NULL;
> > > > -
> > > >    	hugetlb_vma_lock_read(vma);
> > > >    	pte = hugetlb_walk(vma, haddr, huge_page_size(h));
> > > >    	if (!pte)
> > > Did you fix why the warning was placed there in the first place? (IIRC, at
> > > least unsharing support needs to be added, maybe more)
> > 
> > Feel free to have a look at patch 2 - it should be done there, hopefully in
> > the right way.  And IIUC it could be a bug to not do that before (besides
> > CoR there was also the pgtable permission checks that was missing).  More
> > details in patch 2's commit message.  Thanks,
> 
> Oh, that slipped my eyes (unsharing is not really a permission check) -- and

I think it is still a "permission check"?  It means, we forbid anyone R/O
taking the page if it's not exclusively owned, just like we forbid anyone
RW taking the page if it's not writable?

It's just that the permission check only applies to PIN which follow_page()
doesn't yet care, so it won't ever trigger.

> the patch description could have been more explicit about why we can now
> lift the restrictions.
> 
> For the records: we don't use CoR terminology upstream. As suggested by
> John, we use "GUP-triggered unsharing".

Sure.

> 
> As unsharing only applies to FOLL_PIN, it doesn't quite fit into patch #2.
> Either move that to this patch or squash both.

Sure, no strong opinions here.

The plan is _if_ someone wants to backport patch 2, this patch should not
be part of it.  But then maybe it makes more sense to move the CoR change
there into this one, not because "it's not permission check", but because
CoR is not relevant in follow_page(), so not relevant to a backport.

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 2/7] mm/hugetlb: Fix hugetlb_follow_page_mask() on permission checks
  2023-06-13 21:53 ` [PATCH 2/7] mm/hugetlb: Fix hugetlb_follow_page_mask() on permission checks Peter Xu
@ 2023-06-14 15:31   ` David Hildenbrand
  2023-06-14 15:46     ` Peter Xu
  0 siblings, 1 reply; 39+ messages in thread
From: David Hildenbrand @ 2023-06-14 15:31 UTC (permalink / raw)
  To: Peter Xu, linux-kernel, linux-mm
  Cc: Matthew Wilcox, Andrea Arcangeli, John Hubbard, Mike Rapoport,
	Vlastimil Babka, Kirill A . Shutemov, Andrew Morton,
	Mike Kravetz, James Houghton, Hugh Dickins

On 13.06.23 23:53, Peter Xu wrote:
> It seems hugetlb_follow_page_mask() was missing permission checks.  For
> example, one follow_page() can get the hugetlb page with FOLL_WRITE even if
> the page is read-only.

I'm curious if there even is a follow_page() user that operates on 
hugetlb ...

s390x secure storage does not apply to hugetlb IIRC.

ksm.c? no.

huge_memory.c ? no

So what remains is most probably mm/migrate.c, which never sets FOLL_WRITE.


Or am I missing something a user?

>  > And it wasn't there even in the old follow_page_mask(), where we can
> reference from before commit 57a196a58421 ("hugetlb: simplify hugetlb
> handling in follow_page_mask").
> 
> Let's add them, namely, either the need to CoW due to missing write bit, or
> proper CoR on !AnonExclusive pages over R/O pins to reject the follow page.
> That brings this function closer to follow_hugetlb_page().
> 
> I just doubt how many of us care for that, for FOLL_PIN follow_page doesn't
> really happen at all.  But we'll care, and care more if we switch over
> slow-gup to use hugetlb_follow_page_mask().  We'll also care when to return
> -EMLINK then, as that's the gup internal api to mean "we should do CoR".
> 
> When at it, switching the try_grab_page() to use WARN_ON_ONCE(), to be
> clear that it just should never fail.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>   mm/hugetlb.c | 22 ++++++++++++++++------
>   1 file changed, 16 insertions(+), 6 deletions(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 82dfdd96db4c..9c261921b2cf 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -6481,8 +6481,21 @@ struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
>   	ptl = huge_pte_lock(h, mm, pte);
>   	entry = huge_ptep_get(pte);
>   	if (pte_present(entry)) {
> -		page = pte_page(entry) +
> -				((address & ~huge_page_mask(h)) >> PAGE_SHIFT);
> +		page = pte_page(entry);
> +
> +		if (gup_must_unshare(vma, flags, page)) {
> +			/* Tell the caller to do Copy-On-Read */
> +			page = ERR_PTR(-EMLINK);
> +			goto out;
> +		}
> +
> +		if ((flags & FOLL_WRITE) && !pte_write(entry)) {
> +			page = NULL;
> +			goto out;
> +		}
> +
> +		page += ((address & ~huge_page_mask(h)) >> PAGE_SHIFT);
> +
>   		/*
>   		 * Note that page may be a sub-page, and with vmemmap
>   		 * optimizations the page struct may be read only.
> @@ -6492,10 +6505,7 @@ struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
>   		 * try_grab_page() should always be able to get the page here,
>   		 * because we hold the ptl lock and have verified pte_present().
>   		 */
> -		if (try_grab_page(page, flags)) {
> -			page = NULL;
> -			goto out;
> -		}
> +		WARN_ON_ONCE(try_grab_page(page, flags));
>   	}
>   out:
>   	spin_unlock(ptl);

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 6/7] mm/gup: Accelerate thp gup even for "pages != NULL"
  2023-06-14 15:19     ` Peter Xu
@ 2023-06-14 15:35       ` Peter Xu
  0 siblings, 0 replies; 39+ messages in thread
From: Peter Xu @ 2023-06-14 15:35 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-kernel, linux-mm, Andrea Arcangeli, John Hubbard,
	Mike Rapoport, David Hildenbrand, Vlastimil Babka,
	Kirill A . Shutemov, Andrew Morton, Mike Kravetz, James Houghton,
	Hugh Dickins

On Wed, Jun 14, 2023 at 11:19:48AM -0400, Peter Xu wrote:
> > > +			for (j = 0; j < page_increm; j++) {
> > > +				subpage = nth_page(page, j);
> > > +				pages[i+j] = subpage;
> > > +				flush_anon_page(vma, subpage, start + j * PAGE_SIZE);
> > > +				flush_dcache_page(subpage);
> > 
> > You're better off calling flush_dcache_folio() right at the end.
> 
> Will do.

Ah when I start to modify it I noticed it's a two-sided sword: we'll then
also do flush dcache over the whole folio even if we gup one page.

We'll start to get benefit only if some arch at least starts to impl
flush_dcache_folio() (which seems to be none, right now..), and we'll
already start to lose on amplifying the flush when gup on partial folio.

Perhaps I still keep it as-is which will still be accurate, always faster
than old code, and definitely not regress in any form?

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 2/7] mm/hugetlb: Fix hugetlb_follow_page_mask() on permission checks
  2023-06-14 15:31   ` David Hildenbrand
@ 2023-06-14 15:46     ` Peter Xu
  2023-06-14 15:57       ` David Hildenbrand
  2023-06-15  0:11       ` Mike Kravetz
  0 siblings, 2 replies; 39+ messages in thread
From: Peter Xu @ 2023-06-14 15:46 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, Matthew Wilcox, Andrea Arcangeli,
	John Hubbard, Mike Rapoport, Vlastimil Babka,
	Kirill A . Shutemov, Andrew Morton, Mike Kravetz, James Houghton,
	Hugh Dickins

On Wed, Jun 14, 2023 at 05:31:36PM +0200, David Hildenbrand wrote:
> On 13.06.23 23:53, Peter Xu wrote:
> > It seems hugetlb_follow_page_mask() was missing permission checks.  For
> > example, one follow_page() can get the hugetlb page with FOLL_WRITE even if
> > the page is read-only.
> 
> I'm curious if there even is a follow_page() user that operates on hugetlb
> ...
> 
> s390x secure storage does not apply to hugetlb IIRC.

You're the expert, so I'll rely on you. :)

> 
> ksm.c? no.
> 
> huge_memory.c ? no
> 
> So what remains is most probably mm/migrate.c, which never sets FOLL_WRITE.
> 
> Or am I missing something a user?

Yes, non of the rest are with WRITE.

Then I assume no fixes /backport needed at all (which is what this patch
already does).  It's purely to be prepared only.  I'll mention that in the
new version.

Thanks,

> 
> >  > And it wasn't there even in the old follow_page_mask(), where we can
> > reference from before commit 57a196a58421 ("hugetlb: simplify hugetlb
> > handling in follow_page_mask").
> > 
> > Let's add them, namely, either the need to CoW due to missing write bit, or
> > proper CoR on !AnonExclusive pages over R/O pins to reject the follow page.
> > That brings this function closer to follow_hugetlb_page().
> > 
> > I just doubt how many of us care for that, for FOLL_PIN follow_page doesn't
> > really happen at all.  But we'll care, and care more if we switch over
> > slow-gup to use hugetlb_follow_page_mask().  We'll also care when to return
> > -EMLINK then, as that's the gup internal api to mean "we should do CoR".
> > 
> > When at it, switching the try_grab_page() to use WARN_ON_ONCE(), to be
> > clear that it just should never fail.
> > 
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >   mm/hugetlb.c | 22 ++++++++++++++++------
> >   1 file changed, 16 insertions(+), 6 deletions(-)
> > 
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 82dfdd96db4c..9c261921b2cf 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -6481,8 +6481,21 @@ struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
> >   	ptl = huge_pte_lock(h, mm, pte);
> >   	entry = huge_ptep_get(pte);
> >   	if (pte_present(entry)) {
> > -		page = pte_page(entry) +
> > -				((address & ~huge_page_mask(h)) >> PAGE_SHIFT);
> > +		page = pte_page(entry);
> > +
> > +		if (gup_must_unshare(vma, flags, page)) {
> > +			/* Tell the caller to do Copy-On-Read */
> > +			page = ERR_PTR(-EMLINK);
> > +			goto out;
> > +		}
> > +
> > +		if ((flags & FOLL_WRITE) && !pte_write(entry)) {
> > +			page = NULL;
> > +			goto out;
> > +		}
> > +
> > +		page += ((address & ~huge_page_mask(h)) >> PAGE_SHIFT);
> > +
> >   		/*
> >   		 * Note that page may be a sub-page, and with vmemmap
> >   		 * optimizations the page struct may be read only.
> > @@ -6492,10 +6505,7 @@ struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
> >   		 * try_grab_page() should always be able to get the page here,
> >   		 * because we hold the ptl lock and have verified pte_present().
> >   		 */
> > -		if (try_grab_page(page, flags)) {
> > -			page = NULL;
> > -			goto out;
> > -		}
> > +		WARN_ON_ONCE(try_grab_page(page, flags));
> >   	}
> >   out:
> >   	spin_unlock(ptl);
> 
> -- 
> Cheers,
> 
> David / dhildenb
> 

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 4/7] mm/hugetlb: Prepare hugetlb_follow_page_mask() for FOLL_PIN
  2023-06-14 15:31         ` Peter Xu
@ 2023-06-14 15:47           ` David Hildenbrand
  2023-06-14 15:51             ` Peter Xu
  0 siblings, 1 reply; 39+ messages in thread
From: David Hildenbrand @ 2023-06-14 15:47 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-kernel, linux-mm, Matthew Wilcox, Andrea Arcangeli,
	John Hubbard, Mike Rapoport, Vlastimil Babka,
	Kirill A . Shutemov, Andrew Morton, Mike Kravetz, James Houghton,
	Hugh Dickins

On 14.06.23 17:31, Peter Xu wrote:
> On Wed, Jun 14, 2023 at 05:17:13PM +0200, David Hildenbrand wrote:
>> On 14.06.23 17:11, Peter Xu wrote:
>>> On Wed, Jun 14, 2023 at 04:57:37PM +0200, David Hildenbrand wrote:
>>>> On 13.06.23 23:53, Peter Xu wrote:
>>>>> It's coming, not yet, but soon.  Loose the restriction.
>>>>>
>>>>> Signed-off-by: Peter Xu <peterx@redhat.com>
>>>>> ---
>>>>>     mm/hugetlb.c | 7 -------
>>>>>     1 file changed, 7 deletions(-)
>>>>>
>>>>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>>>>> index f037eaf9d819..31d8f18bc2e4 100644
>>>>> --- a/mm/hugetlb.c
>>>>> +++ b/mm/hugetlb.c
>>>>> @@ -6467,13 +6467,6 @@ struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
>>>>>     	spinlock_t *ptl;
>>>>>     	pte_t *pte, entry;
>>>>> -	/*
>>>>> -	 * FOLL_PIN is not supported for follow_page(). Ordinary GUP goes via
>>>>> -	 * follow_hugetlb_page().
>>>>> -	 */
>>>>> -	if (WARN_ON_ONCE(flags & FOLL_PIN))
>>>>> -		return NULL;
>>>>> -
>>>>>     	hugetlb_vma_lock_read(vma);
>>>>>     	pte = hugetlb_walk(vma, haddr, huge_page_size(h));
>>>>>     	if (!pte)
>>>> Did you fix why the warning was placed there in the first place? (IIRC, at
>>>> least unsharing support needs to be added, maybe more)
>>>
>>> Feel free to have a look at patch 2 - it should be done there, hopefully in
>>> the right way.  And IIUC it could be a bug to not do that before (besides
>>> CoR there was also the pgtable permission checks that was missing).  More
>>> details in patch 2's commit message.  Thanks,
>>
>> Oh, that slipped my eyes (unsharing is not really a permission check) -- and
> 
> I think it is still a "permission check"?  It means, we forbid anyone R/O
> taking the page if it's not exclusively owned, just like we forbid anyone
> RW taking the page if it's not writable?

Agreed, just not in the traditional PTE-protection case.

> 
> It's just that the permission check only applies to PIN which follow_page()
> doesn't yet care, so it won't ever trigger.
> 
>> the patch description could have been more explicit about why we can now
>> lift the restrictions.
>>
>> For the records: we don't use CoR terminology upstream. As suggested by
>> John, we use "GUP-triggered unsharing".
> 
> Sure.
> 
>>
>> As unsharing only applies to FOLL_PIN, it doesn't quite fit into patch #2.
>> Either move that to this patch or squash both.
> 
> Sure, no strong opinions here.
> 
> The plan is _if_ someone wants to backport patch 2, this patch should not
> be part of it.  But then maybe it makes more sense to move the CoR change
> there into this one, not because "it's not permission check", but because
> CoR is not relevant in follow_page(), so not relevant to a backport.

Right. Then just call patch #2 "Add missing write-permission check" and 
this patch "Support FOLL_PIN in hugetlb_follow_page_mask()" or sth. like 
that.

Regarding the backport, I really wonder if patch #2 is required at all, 
because I didn't sport any applicable FOLL_WRITE users. Maybe there were 
some? Hm. If it's not applicable, a single "Support FOLL_PIN in 
hugetlb_follow_page_mask()" patch might be cleanest.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 4/7] mm/hugetlb: Prepare hugetlb_follow_page_mask() for FOLL_PIN
  2023-06-14 15:47           ` David Hildenbrand
@ 2023-06-14 15:51             ` Peter Xu
  2023-06-15  0:25               ` Mike Kravetz
  0 siblings, 1 reply; 39+ messages in thread
From: Peter Xu @ 2023-06-14 15:51 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, Matthew Wilcox, Andrea Arcangeli,
	John Hubbard, Mike Rapoport, Vlastimil Babka,
	Kirill A . Shutemov, Andrew Morton, Mike Kravetz, James Houghton,
	Hugh Dickins

On Wed, Jun 14, 2023 at 05:47:31PM +0200, David Hildenbrand wrote:
> Right. Then just call patch #2 "Add missing write-permission check" and this
> patch "Support FOLL_PIN in hugetlb_follow_page_mask()" or sth. like that.
> 
> Regarding the backport, I really wonder if patch #2 is required at all,
> because I didn't sport any applicable FOLL_WRITE users. Maybe there were
> some? Hm. If it's not applicable, a single "Support FOLL_PIN in
> hugetlb_follow_page_mask()" patch might be cleanest.

Yeah, I agree.  The code is definitely needed, not the split of patches if
no need for a backport.  Let me merge then.

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 2/7] mm/hugetlb: Fix hugetlb_follow_page_mask() on permission checks
  2023-06-14 15:46     ` Peter Xu
@ 2023-06-14 15:57       ` David Hildenbrand
  2023-06-15  0:11       ` Mike Kravetz
  1 sibling, 0 replies; 39+ messages in thread
From: David Hildenbrand @ 2023-06-14 15:57 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-kernel, linux-mm, Matthew Wilcox, Andrea Arcangeli,
	John Hubbard, Mike Rapoport, Vlastimil Babka,
	Kirill A . Shutemov, Andrew Morton, Mike Kravetz, James Houghton,
	Hugh Dickins

On 14.06.23 17:46, Peter Xu wrote:
> On Wed, Jun 14, 2023 at 05:31:36PM +0200, David Hildenbrand wrote:
>> On 13.06.23 23:53, Peter Xu wrote:
>>> It seems hugetlb_follow_page_mask() was missing permission checks.  For
>>> example, one follow_page() can get the hugetlb page with FOLL_WRITE even if
>>> the page is read-only.
>>
>> I'm curious if there even is a follow_page() user that operates on hugetlb
>> ...
>>
>> s390x secure storage does not apply to hugetlb IIRC.
> 
> You're the expert, so I'll rely on you. :)
> 

Hehe, there is a comment in gmap_destroy_page(), above one of the 
follow_page() users:

	/*
	 * Huge pages should not be able to become secure
	 */
	if (is_vm_hugetlb_page(vma))
		goto out;


-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/7] mm/hugetlb: Handle FOLL_DUMP well in follow_page_mask()
  2023-06-13 21:53 ` [PATCH 1/7] mm/hugetlb: Handle FOLL_DUMP well in follow_page_mask() Peter Xu
@ 2023-06-14 23:24   ` Mike Kravetz
  2023-06-16  8:08   ` David Hildenbrand
  1 sibling, 0 replies; 39+ messages in thread
From: Mike Kravetz @ 2023-06-14 23:24 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-kernel, linux-mm, Matthew Wilcox, Andrea Arcangeli,
	John Hubbard, Mike Rapoport, David Hildenbrand, Vlastimil Babka,
	Kirill A . Shutemov, Andrew Morton, James Houghton, Hugh Dickins

On 06/13/23 17:53, Peter Xu wrote:
> Firstly, the no_page_table() is meaningless for hugetlb which is a no-op
> there, because a hugetlb page always satisfies:
> 
>   - vma_is_anonymous() == false
>   - vma->vm_ops->fault != NULL
> 
> So we can already safely remove it in hugetlb_follow_page_mask(), alongside
> with the page* variable.
> 
> Meanwhile, what we do in follow_hugetlb_page() actually makes sense for a
> dump: we try to fault in the page only if the page cache is already
> allocated.  Let's do the same here for follow_page_mask() on hugetlb.
> 
> It should so far has zero effect on real dumps, because that still goes
> into follow_hugetlb_page().  But this may start to influence a bit on
> follow_page() users who mimics a "dump page" scenario, but hopefully in a
> good way.  This also paves way for unifying the hugetlb gup-slow.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  mm/gup.c     | 9 ++-------
>  mm/hugetlb.c | 9 +++++++++
>  2 files changed, 11 insertions(+), 7 deletions(-)

Thanks Peter!

Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 2/7] mm/hugetlb: Fix hugetlb_follow_page_mask() on permission checks
  2023-06-14 15:46     ` Peter Xu
  2023-06-14 15:57       ` David Hildenbrand
@ 2023-06-15  0:11       ` Mike Kravetz
  1 sibling, 0 replies; 39+ messages in thread
From: Mike Kravetz @ 2023-06-15  0:11 UTC (permalink / raw)
  To: Peter Xu
  Cc: David Hildenbrand, linux-kernel, linux-mm, Matthew Wilcox,
	Andrea Arcangeli, John Hubbard, Mike Rapoport, Vlastimil Babka,
	Kirill A . Shutemov, Andrew Morton, James Houghton, Hugh Dickins

On 06/14/23 11:46, Peter Xu wrote:
> On Wed, Jun 14, 2023 at 05:31:36PM +0200, David Hildenbrand wrote:
> > On 13.06.23 23:53, Peter Xu wrote:
> 
> Then I assume no fixes /backport needed at all (which is what this patch
> already does).  It's purely to be prepared only.  I'll mention that in the
> new version.

Code looks fine to me.  Feel free to add,

Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
-- 
Mike Kravetz

> > > 
> > > When at it, switching the try_grab_page() to use WARN_ON_ONCE(), to be
> > > clear that it just should never fail.
> > > 
> > > Signed-off-by: Peter Xu <peterx@redhat.com>
> > > ---
> > >   mm/hugetlb.c | 22 ++++++++++++++++------
> > >   1 file changed, 16 insertions(+), 6 deletions(-)
> > > 
> > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > > index 82dfdd96db4c..9c261921b2cf 100644
> > > --- a/mm/hugetlb.c
> > > +++ b/mm/hugetlb.c
> > > @@ -6481,8 +6481,21 @@ struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
> > >   	ptl = huge_pte_lock(h, mm, pte);
> > >   	entry = huge_ptep_get(pte);
> > >   	if (pte_present(entry)) {
> > > -		page = pte_page(entry) +
> > > -				((address & ~huge_page_mask(h)) >> PAGE_SHIFT);
> > > +		page = pte_page(entry);
> > > +
> > > +		if (gup_must_unshare(vma, flags, page)) {
> > > +			/* Tell the caller to do Copy-On-Read */
> > > +			page = ERR_PTR(-EMLINK);
> > > +			goto out;
> > > +		}
> > > +
> > > +		if ((flags & FOLL_WRITE) && !pte_write(entry)) {
> > > +			page = NULL;
> > > +			goto out;
> > > +		}
> > > +
> > > +		page += ((address & ~huge_page_mask(h)) >> PAGE_SHIFT);
> > > +
> > >   		/*
> > >   		 * Note that page may be a sub-page, and with vmemmap
> > >   		 * optimizations the page struct may be read only.
> > > @@ -6492,10 +6505,7 @@ struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
> > >   		 * try_grab_page() should always be able to get the page here,
> > >   		 * because we hold the ptl lock and have verified pte_present().
> > >   		 */
> > > -		if (try_grab_page(page, flags)) {
> > > -			page = NULL;
> > > -			goto out;
> > > -		}
> > > +		WARN_ON_ONCE(try_grab_page(page, flags));
> > >   	}
> > >   out:
> > >   	spin_unlock(ptl);

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 3/7] mm/hugetlb: Add page_mask for hugetlb_follow_page_mask()
  2023-06-13 21:53 ` [PATCH 3/7] mm/hugetlb: Add page_mask for hugetlb_follow_page_mask() Peter Xu
@ 2023-06-15  0:17   ` Mike Kravetz
  2023-06-16  8:11   ` David Hildenbrand
  2023-06-19 21:43   ` Peter Xu
  2 siblings, 0 replies; 39+ messages in thread
From: Mike Kravetz @ 2023-06-15  0:17 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-kernel, linux-mm, Matthew Wilcox, Andrea Arcangeli,
	John Hubbard, Mike Rapoport, David Hildenbrand, Vlastimil Babka,
	Kirill A . Shutemov, Andrew Morton, James Houghton, Hugh Dickins

On 06/13/23 17:53, Peter Xu wrote:
> follow_page() doesn't need it, but we'll start to need it when unifying gup
> for hugetlb.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  include/linux/hugetlb.h | 8 +++++---
>  mm/gup.c                | 3 ++-
>  mm/hugetlb.c            | 4 +++-
>  3 files changed, 10 insertions(+), 5 deletions(-)

Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
-- 
Mike Kravetz

> 
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 21f942025fec..0d6f389d98de 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -131,7 +131,8 @@ int move_hugetlb_page_tables(struct vm_area_struct *vma,
>  int copy_hugetlb_page_range(struct mm_struct *, struct mm_struct *,
>  			    struct vm_area_struct *, struct vm_area_struct *);
>  struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
> -				unsigned long address, unsigned int flags);
> +				      unsigned long address, unsigned int flags,
> +				      unsigned int *page_mask);
>  long follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *,
>  			 struct page **, unsigned long *, unsigned long *,
>  			 long, unsigned int, int *);
> @@ -297,8 +298,9 @@ static inline void adjust_range_if_pmd_sharing_possible(
>  {
>  }
>  
> -static inline struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
> -				unsigned long address, unsigned int flags)
> +static inline struct page *hugetlb_follow_page_mask(
> +    struct vm_area_struct *vma, unsigned long address, unsigned int flags,
> +    unsigned int *page_mask)
>  {
>  	BUILD_BUG(); /* should never be compiled in if !CONFIG_HUGETLB_PAGE*/
>  }
> diff --git a/mm/gup.c b/mm/gup.c
> index aa0668505d61..8d59ae4554e7 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -794,7 +794,8 @@ static struct page *follow_page_mask(struct vm_area_struct *vma,
>  	 * Ordinary GUP uses follow_hugetlb_page for hugetlb processing.
>  	 */
>  	if (is_vm_hugetlb_page(vma))
> -		return hugetlb_follow_page_mask(vma, address, flags);
> +		return hugetlb_follow_page_mask(vma, address, flags,
> +						&ctx->page_mask);
>  
>  	pgd = pgd_offset(mm, address);
>  
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 9c261921b2cf..f037eaf9d819 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -6457,7 +6457,8 @@ static inline bool __follow_hugetlb_must_fault(struct vm_area_struct *vma,
>  }
>  
>  struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
> -				unsigned long address, unsigned int flags)
> +				      unsigned long address, unsigned int flags,
> +				      unsigned int *page_mask)
>  {
>  	struct hstate *h = hstate_vma(vma);
>  	struct mm_struct *mm = vma->vm_mm;
> @@ -6506,6 +6507,7 @@ struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
>  		 * because we hold the ptl lock and have verified pte_present().
>  		 */
>  		WARN_ON_ONCE(try_grab_page(page, flags));
> +		*page_mask = huge_page_mask(h);
>  	}
>  out:
>  	spin_unlock(ptl);
> -- 
> 2.40.1
> 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 4/7] mm/hugetlb: Prepare hugetlb_follow_page_mask() for FOLL_PIN
  2023-06-14 15:51             ` Peter Xu
@ 2023-06-15  0:25               ` Mike Kravetz
  2023-06-15 19:42                 ` Peter Xu
  0 siblings, 1 reply; 39+ messages in thread
From: Mike Kravetz @ 2023-06-15  0:25 UTC (permalink / raw)
  To: Peter Xu
  Cc: David Hildenbrand, linux-kernel, linux-mm, Matthew Wilcox,
	Andrea Arcangeli, John Hubbard, Mike Rapoport, Vlastimil Babka,
	Kirill A . Shutemov, Andrew Morton, James Houghton, Hugh Dickins

On 06/14/23 11:51, Peter Xu wrote:
> On Wed, Jun 14, 2023 at 05:47:31PM +0200, David Hildenbrand wrote:
> > Right. Then just call patch #2 "Add missing write-permission check" and this
> > patch "Support FOLL_PIN in hugetlb_follow_page_mask()" or sth. like that.
> > 
> > Regarding the backport, I really wonder if patch #2 is required at all,
> > because I didn't sport any applicable FOLL_WRITE users. Maybe there were
> > some? Hm. If it's not applicable, a single "Support FOLL_PIN in
> > hugetlb_follow_page_mask()" patch might be cleanest.
> 
> Yeah, I agree.  The code is definitely needed, not the split of patches if
> no need for a backport.  Let me merge then.
> 

Should have read this before adding my RB to patch 2.  I assumed no
backport.  Agree, than merging the gup_must_unshare here makes more sense.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 4/7] mm/hugetlb: Prepare hugetlb_follow_page_mask() for FOLL_PIN
  2023-06-15  0:25               ` Mike Kravetz
@ 2023-06-15 19:42                 ` Peter Xu
  0 siblings, 0 replies; 39+ messages in thread
From: Peter Xu @ 2023-06-15 19:42 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: David Hildenbrand, linux-kernel, linux-mm, Matthew Wilcox,
	Andrea Arcangeli, John Hubbard, Mike Rapoport, Vlastimil Babka,
	Kirill A . Shutemov, Andrew Morton, James Houghton, Hugh Dickins

On Wed, Jun 14, 2023 at 05:25:25PM -0700, Mike Kravetz wrote:
> On 06/14/23 11:51, Peter Xu wrote:
> > On Wed, Jun 14, 2023 at 05:47:31PM +0200, David Hildenbrand wrote:
> > > Right. Then just call patch #2 "Add missing write-permission check" and this
> > > patch "Support FOLL_PIN in hugetlb_follow_page_mask()" or sth. like that.
> > > 
> > > Regarding the backport, I really wonder if patch #2 is required at all,
> > > because I didn't sport any applicable FOLL_WRITE users. Maybe there were
> > > some? Hm. If it's not applicable, a single "Support FOLL_PIN in
> > > hugetlb_follow_page_mask()" patch might be cleanest.
> > 
> > Yeah, I agree.  The code is definitely needed, not the split of patches if
> > no need for a backport.  Let me merge then.
> > 
> 
> Should have read this before adding my RB to patch 2.  I assumed no
> backport.  Agree, than merging the gup_must_unshare here makes more sense.

Thanks for taking a look!

No worries, I'll make bold to just take your R-b over the merged patch when
I repost, according to your R-b in patch 2 and the comment here.  I hope
it's fine to you.

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/7] mm/hugetlb: Handle FOLL_DUMP well in follow_page_mask()
  2023-06-13 21:53 ` [PATCH 1/7] mm/hugetlb: Handle FOLL_DUMP well in follow_page_mask() Peter Xu
  2023-06-14 23:24   ` Mike Kravetz
@ 2023-06-16  8:08   ` David Hildenbrand
  1 sibling, 0 replies; 39+ messages in thread
From: David Hildenbrand @ 2023-06-16  8:08 UTC (permalink / raw)
  To: Peter Xu, linux-kernel, linux-mm
  Cc: Matthew Wilcox, Andrea Arcangeli, John Hubbard, Mike Rapoport,
	Vlastimil Babka, Kirill A . Shutemov, Andrew Morton,
	Mike Kravetz, James Houghton, Hugh Dickins

On 13.06.23 23:53, Peter Xu wrote:
> Firstly, the no_page_table() is meaningless for hugetlb which is a no-op
> there, because a hugetlb page always satisfies:
> 
>    - vma_is_anonymous() == false
>    - vma->vm_ops->fault != NULL
> 
> So we can already safely remove it in hugetlb_follow_page_mask(), alongside
> with the page* variable.
> 
> Meanwhile, what we do in follow_hugetlb_page() actually makes sense for a
> dump: we try to fault in the page only if the page cache is already
> allocated.  Let's do the same here for follow_page_mask() on hugetlb.
> 
> It should so far has zero effect on real dumps, because that still goes
> into follow_hugetlb_page().  But this may start to influence a bit on
> follow_page() users who mimics a "dump page" scenario, but hopefully in a
> good way.  This also paves way for unifying the hugetlb gup-slow.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>   mm/gup.c     | 9 ++-------
>   mm/hugetlb.c | 9 +++++++++
>   2 files changed, 11 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/gup.c b/mm/gup.c
> index dbe96d266670..aa0668505d61 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -781,7 +781,6 @@ static struct page *follow_page_mask(struct vm_area_struct *vma,
>   			      struct follow_page_context *ctx)
>   {
>   	pgd_t *pgd;
> -	struct page *page;
>   	struct mm_struct *mm = vma->vm_mm;
>   
>   	ctx->page_mask = 0;
> @@ -794,12 +793,8 @@ static struct page *follow_page_mask(struct vm_area_struct *vma,
>   	 * hugetlb_follow_page_mask is only for follow_page() handling here.
>   	 * Ordinary GUP uses follow_hugetlb_page for hugetlb processing.
>   	 */
> -	if (is_vm_hugetlb_page(vma)) {
> -		page = hugetlb_follow_page_mask(vma, address, flags);
> -		if (!page)
> -			page = no_page_table(vma, flags);
> -		return page;
> -	}
> +	if (is_vm_hugetlb_page(vma))
> +		return hugetlb_follow_page_mask(vma, address, flags);
>   
>   	pgd = pgd_offset(mm, address);
>   
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 270ec0ecd5a1..82dfdd96db4c 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -6501,6 +6501,15 @@ struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
>   	spin_unlock(ptl);
>   out_unlock:
>   	hugetlb_vma_unlock_read(vma);
> +
> +	/*
> +	 * Fixup retval for dump requests: if pagecache doesn't exist,
> +	 * don't try to allocate a new page but just skip it.
> +	 */
> +	if (!page && (flags & FOLL_DUMP) &&
> +	    !hugetlbfs_pagecache_present(h, vma, address))
> +		page = ERR_PTR(-EFAULT);
> +
>   	return page;
>   }
>   

Makes sense to me:

Reviewed-by: David Hildenbrand <david@redhat.com>

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 3/7] mm/hugetlb: Add page_mask for hugetlb_follow_page_mask()
  2023-06-13 21:53 ` [PATCH 3/7] mm/hugetlb: Add page_mask for hugetlb_follow_page_mask() Peter Xu
  2023-06-15  0:17   ` Mike Kravetz
@ 2023-06-16  8:11   ` David Hildenbrand
  2023-06-19 21:43   ` Peter Xu
  2 siblings, 0 replies; 39+ messages in thread
From: David Hildenbrand @ 2023-06-16  8:11 UTC (permalink / raw)
  To: Peter Xu, linux-kernel, linux-mm
  Cc: Matthew Wilcox, Andrea Arcangeli, John Hubbard, Mike Rapoport,
	Vlastimil Babka, Kirill A . Shutemov, Andrew Morton,
	Mike Kravetz, James Houghton, Hugh Dickins

On 13.06.23 23:53, Peter Xu wrote:
> follow_page() doesn't need it, but we'll start to need it when unifying gup
> for hugetlb.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>   include/linux/hugetlb.h | 8 +++++---
>   mm/gup.c                | 3 ++-
>   mm/hugetlb.c            | 4 +++-
>   3 files changed, 10 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 21f942025fec..0d6f389d98de 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -131,7 +131,8 @@ int move_hugetlb_page_tables(struct vm_area_struct *vma,
>   int copy_hugetlb_page_range(struct mm_struct *, struct mm_struct *,
>   			    struct vm_area_struct *, struct vm_area_struct *);
>   struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
> -				unsigned long address, unsigned int flags);
> +				      unsigned long address, unsigned int flags,
> +				      unsigned int *page_mask);
>   long follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *,
>   			 struct page **, unsigned long *, unsigned long *,
>   			 long, unsigned int, int *);
> @@ -297,8 +298,9 @@ static inline void adjust_range_if_pmd_sharing_possible(
>   {
>   }
>   
> -static inline struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
> -				unsigned long address, unsigned int flags)
> +static inline struct page *hugetlb_follow_page_mask(
> +    struct vm_area_struct *vma, unsigned long address, unsigned int flags,
> +    unsigned int *page_mask)
>   {
>   	BUILD_BUG(); /* should never be compiled in if !CONFIG_HUGETLB_PAGE*/
>   }
> diff --git a/mm/gup.c b/mm/gup.c
> index aa0668505d61..8d59ae4554e7 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -794,7 +794,8 @@ static struct page *follow_page_mask(struct vm_area_struct *vma,
>   	 * Ordinary GUP uses follow_hugetlb_page for hugetlb processing.
>   	 */
>   	if (is_vm_hugetlb_page(vma))
> -		return hugetlb_follow_page_mask(vma, address, flags);
> +		return hugetlb_follow_page_mask(vma, address, flags,
> +						&ctx->page_mask);
>   
>   	pgd = pgd_offset(mm, address);
>   
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 9c261921b2cf..f037eaf9d819 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -6457,7 +6457,8 @@ static inline bool __follow_hugetlb_must_fault(struct vm_area_struct *vma,
>   }
>   
>   struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
> -				unsigned long address, unsigned int flags)
> +				      unsigned long address, unsigned int flags,
> +				      unsigned int *page_mask)
>   {
>   	struct hstate *h = hstate_vma(vma);
>   	struct mm_struct *mm = vma->vm_mm;
> @@ -6506,6 +6507,7 @@ struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
>   		 * because we hold the ptl lock and have verified pte_present().
>   		 */
>   		WARN_ON_ONCE(try_grab_page(page, flags));
> +		*page_mask = huge_page_mask(h);
>   	}
>   out:
>   	spin_unlock(ptl);

Reviewed-by: David Hildenbrand <david@redhat.com>

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 5/7] mm/gup: Cleanup next_page handling
  2023-06-13 21:53 ` [PATCH 5/7] mm/gup: Cleanup next_page handling Peter Xu
@ 2023-06-17 19:48   ` Lorenzo Stoakes
  2023-06-17 20:00     ` Lorenzo Stoakes
  0 siblings, 1 reply; 39+ messages in thread
From: Lorenzo Stoakes @ 2023-06-17 19:48 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-kernel, linux-mm, Matthew Wilcox, Andrea Arcangeli,
	John Hubbard, Mike Rapoport, David Hildenbrand, Vlastimil Babka,
	Kirill A . Shutemov, Andrew Morton, Mike Kravetz, James Houghton,
	Hugh Dickins

On Tue, Jun 13, 2023 at 05:53:44PM -0400, Peter Xu wrote:
> The only path that doesn't use generic "**pages" handling is the gate vma.
> Make it use the same path, meanwhile tune the next_page label upper to
> cover "**pages" handling.  This prepares for THP handling for "**pages".
>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  mm/gup.c | 7 +++----
>  1 file changed, 3 insertions(+), 4 deletions(-)
>
> diff --git a/mm/gup.c b/mm/gup.c
> index 8d59ae4554e7..a2d1b3c4b104 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -1135,7 +1135,7 @@ static long __get_user_pages(struct mm_struct *mm,
>  			if (!vma && in_gate_area(mm, start)) {
>  				ret = get_gate_page(mm, start & PAGE_MASK,
>  						gup_flags, &vma,
> -						pages ? &pages[i] : NULL);
> +						pages ? &page : NULL);

Good spot... ugh that we handled this differently.

>  				if (ret)
>  					goto out;
>  				ctx.page_mask = 0;

We can drop this line now right? As the new next_page block will duplicate
this.

> @@ -1205,19 +1205,18 @@ static long __get_user_pages(struct mm_struct *mm,
>  				ret = PTR_ERR(page);
>  				goto out;
>  			}
> -
> -			goto next_page;

This is neat, we've already checked if pages != NULL so the if (pages)
block at the new next_page label will not be run.

>  		} else if (IS_ERR(page)) {
>  			ret = PTR_ERR(page);
>  			goto out;
>  		}
> +next_page:
>  		if (pages) {
>  			pages[i] = page;
>  			flush_anon_page(vma, page, start);
>  			flush_dcache_page(page);

I guess there's no harm that we now flush here, though it seems to me to be
superfluous, it's not a big deal I don't think.

>  			ctx.page_mask = 0;
>  		}
> -next_page:
> +
>  		page_increm = 1 + (~(start >> PAGE_SHIFT) & ctx.page_mask);
>  		if (page_increm > nr_pages)
>  			page_increm = nr_pages;
> --
> 2.40.1
>

Other than that, LGTM,

Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 5/7] mm/gup: Cleanup next_page handling
  2023-06-17 19:48   ` Lorenzo Stoakes
@ 2023-06-17 20:00     ` Lorenzo Stoakes
  2023-06-19 19:18       ` Peter Xu
  0 siblings, 1 reply; 39+ messages in thread
From: Lorenzo Stoakes @ 2023-06-17 20:00 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-kernel, linux-mm, Matthew Wilcox, Andrea Arcangeli,
	John Hubbard, Mike Rapoport, David Hildenbrand, Vlastimil Babka,
	Kirill A . Shutemov, Andrew Morton, Mike Kravetz, James Houghton,
	Hugh Dickins

On Sat, Jun 17, 2023 at 08:48:38PM +0100, Lorenzo Stoakes wrote:
> On Tue, Jun 13, 2023 at 05:53:44PM -0400, Peter Xu wrote:
> > The only path that doesn't use generic "**pages" handling is the gate vma.
> > Make it use the same path, meanwhile tune the next_page label upper to
> > cover "**pages" handling.  This prepares for THP handling for "**pages".
> >
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >  mm/gup.c | 7 +++----
> >  1 file changed, 3 insertions(+), 4 deletions(-)
> >
> > diff --git a/mm/gup.c b/mm/gup.c
> > index 8d59ae4554e7..a2d1b3c4b104 100644
> > --- a/mm/gup.c
> > +++ b/mm/gup.c
> > @@ -1135,7 +1135,7 @@ static long __get_user_pages(struct mm_struct *mm,
> >  			if (!vma && in_gate_area(mm, start)) {
> >  				ret = get_gate_page(mm, start & PAGE_MASK,
> >  						gup_flags, &vma,
> > -						pages ? &pages[i] : NULL);
> > +						pages ? &page : NULL);
>
> Good spot... ugh that we handled this differently.
>
> >  				if (ret)
> >  					goto out;
> >  				ctx.page_mask = 0;
>
> We can drop this line now right? As the new next_page block will duplicate
> this.

OK I can see why you left this in given the last patch in the series :)
Please disregard.

>
> > @@ -1205,19 +1205,18 @@ static long __get_user_pages(struct mm_struct *mm,
> >  				ret = PTR_ERR(page);
> >  				goto out;
> >  			}
> > -
> > -			goto next_page;
>
> This is neat, we've already checked if pages != NULL so the if (pages)
> block at the new next_page label will not be run.
>
> >  		} else if (IS_ERR(page)) {
> >  			ret = PTR_ERR(page);
> >  			goto out;
> >  		}
> > +next_page:
> >  		if (pages) {
> >  			pages[i] = page;
> >  			flush_anon_page(vma, page, start);
> >  			flush_dcache_page(page);
>
> I guess there's no harm that we now flush here, though it seems to me to be
> superfluous, it's not a big deal I don't think.
>
> >  			ctx.page_mask = 0;
> >  		}
> > -next_page:
> > +
> >  		page_increm = 1 + (~(start >> PAGE_SHIFT) & ctx.page_mask);
> >  		if (page_increm > nr_pages)
> >  			page_increm = nr_pages;
> > --
> > 2.40.1
> >
>
> Other than that, LGTM,
>
> Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 6/7] mm/gup: Accelerate thp gup even for "pages != NULL"
  2023-06-13 21:53 ` [PATCH 6/7] mm/gup: Accelerate thp gup even for "pages != NULL" Peter Xu
  2023-06-14 14:58   ` Matthew Wilcox
@ 2023-06-17 20:27   ` Lorenzo Stoakes
  2023-06-19 19:37     ` Peter Xu
  1 sibling, 1 reply; 39+ messages in thread
From: Lorenzo Stoakes @ 2023-06-17 20:27 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-kernel, linux-mm, Matthew Wilcox, Andrea Arcangeli,
	John Hubbard, Mike Rapoport, David Hildenbrand, Vlastimil Babka,
	Kirill A . Shutemov, Andrew Morton, Mike Kravetz, James Houghton,
	Hugh Dickins

On Tue, Jun 13, 2023 at 05:53:45PM -0400, Peter Xu wrote:
> The acceleration of THP was done with ctx.page_mask, however it'll be
> ignored if **pages is non-NULL.
>
> The old optimization was introduced in 2013 in 240aadeedc4a ("mm:
> accelerate mm_populate() treatment of THP pages").  It didn't explain why
> we can't optimize the **pages non-NULL case.  It's possible that at that
> time the major goal was for mm_populate() which should be enough back then.
>
> Optimize thp for all cases, by properly looping over each subpage, doing
> cache flushes, and boost refcounts / pincounts where needed in one go.
>
> This can be verified using gup_test below:
>
>   # chrt -f 1 ./gup_test -m 512 -t -L -n 1024 -r 10
>
> Before:    13992.50 ( +-8.75%)
> After:       378.50 (+-69.62%)
>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  mm/gup.c | 36 +++++++++++++++++++++++++++++-------
>  1 file changed, 29 insertions(+), 7 deletions(-)
>
> diff --git a/mm/gup.c b/mm/gup.c
> index a2d1b3c4b104..cdabc8ea783b 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -1210,16 +1210,38 @@ static long __get_user_pages(struct mm_struct *mm,
>  			goto out;
>  		}
>  next_page:
> -		if (pages) {
> -			pages[i] = page;
> -			flush_anon_page(vma, page, start);
> -			flush_dcache_page(page);
> -			ctx.page_mask = 0;
> -		}
> -
>  		page_increm = 1 + (~(start >> PAGE_SHIFT) & ctx.page_mask);
>  		if (page_increm > nr_pages)
>  			page_increm = nr_pages;
> +
> +		if (pages) {
> +			struct page *subpage;
> +			unsigned int j;
> +
> +			/*
> +			 * This must be a large folio (and doesn't need to
> +			 * be the whole folio; it can be part of it), do
> +			 * the refcount work for all the subpages too.
> +			 * Since we already hold refcount on the head page,
> +			 * it should never fail.
> +			 *
> +			 * NOTE: here the page may not be the head page
> +			 * e.g. when start addr is not thp-size aligned.
> +			 */
> +			if (page_increm > 1)
> +				WARN_ON_ONCE(
> +				    try_grab_folio(compound_head(page),
> +						   page_increm - 1,
> +						   foll_flags) == NULL);

I'm not sure this should be warning but otherwise ignoring this returning
NULL?  This feels like a case that could come up in realtiy,
e.g. folio_ref_try_add_rcu() fails, or !folio_is_longterm_pinnable().

Side-note: I _hate_ the semantics of GUP such that try_grab_folio()
(invoked, other than for huge page cases, by the GUP-fast logic) will
explicitly fail if neither FOLL_GET or FOLL_PIN are specified,
differentiating it from try_grab_page() in this respect.

This is a side-note and not relevant here, as all callers to
__get_user_pages() either explicitly set FOLL_GET if not set by user (in
__get_user_pages_locked()) or don't set pages (e.g. in
faultin_vma_page_range())

> +
> +			for (j = 0; j < page_increm; j++) {
> +				subpage = nth_page(page, j);
> +				pages[i+j] = subpage;
> +				flush_anon_page(vma, subpage, start + j * PAGE_SIZE);
> +				flush_dcache_page(subpage);
> +			}
> +		}
> +
>  		i += page_increm;
>  		start += page_increm * PAGE_SIZE;
>  		nr_pages -= page_increm;
> --
> 2.40.1
>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 7/7] mm/gup: Retire follow_hugetlb_page()
  2023-06-13 21:53 ` [PATCH 7/7] mm/gup: Retire follow_hugetlb_page() Peter Xu
  2023-06-14 14:37   ` Jason Gunthorpe
@ 2023-06-17 20:40   ` Lorenzo Stoakes
  2023-06-19 19:41     ` Peter Xu
  1 sibling, 1 reply; 39+ messages in thread
From: Lorenzo Stoakes @ 2023-06-17 20:40 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-kernel, linux-mm, Matthew Wilcox, Andrea Arcangeli,
	John Hubbard, Mike Rapoport, David Hildenbrand, Vlastimil Babka,
	Kirill A . Shutemov, Andrew Morton, Mike Kravetz, James Houghton,
	Hugh Dickins

On Tue, Jun 13, 2023 at 05:53:46PM -0400, Peter Xu wrote:
> Now __get_user_pages() should be well prepared to handle thp completely,
> as long as hugetlb gup requests even without the hugetlb's special path.
>
> Time to retire follow_hugetlb_page().

Nit, but there's a couple left over references to this function in comments
in fs/userfaultfd.c and mm/hugetlb.c.

>
> Tweak the comments in follow_page_mask() to reflect reality, by dropping
> the "follow_page()" description.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  include/linux/hugetlb.h |  12 ---
>  mm/gup.c                |  19 ----
>  mm/hugetlb.c            | 223 ----------------------------------------
>  3 files changed, 254 deletions(-)
>
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 0d6f389d98de..44e5836eed15 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -133,9 +133,6 @@ int copy_hugetlb_page_range(struct mm_struct *, struct mm_struct *,
>  struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
>  				      unsigned long address, unsigned int flags,
>  				      unsigned int *page_mask);
> -long follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *,
> -			 struct page **, unsigned long *, unsigned long *,
> -			 long, unsigned int, int *);
>  void unmap_hugepage_range(struct vm_area_struct *,
>  			  unsigned long, unsigned long, struct page *,
>  			  zap_flags_t);
> @@ -305,15 +302,6 @@ static inline struct page *hugetlb_follow_page_mask(
>  	BUILD_BUG(); /* should never be compiled in if !CONFIG_HUGETLB_PAGE*/
>  }
>
> -static inline long follow_hugetlb_page(struct mm_struct *mm,
> -			struct vm_area_struct *vma, struct page **pages,
> -			unsigned long *position, unsigned long *nr_pages,
> -			long i, unsigned int flags, int *nonblocking)
> -{
> -	BUG();
> -	return 0;
> -}
> -
>  static inline int copy_hugetlb_page_range(struct mm_struct *dst,
>  					  struct mm_struct *src,
>  					  struct vm_area_struct *dst_vma,
> diff --git a/mm/gup.c b/mm/gup.c
> index cdabc8ea783b..a65b80953b7a 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -789,9 +789,6 @@ static struct page *follow_page_mask(struct vm_area_struct *vma,
>  	 * Call hugetlb_follow_page_mask for hugetlb vmas as it will use
>  	 * special hugetlb page table walking code.  This eliminates the
>  	 * need to check for hugetlb entries in the general walking code.
> -	 *
> -	 * hugetlb_follow_page_mask is only for follow_page() handling here.
> -	 * Ordinary GUP uses follow_hugetlb_page for hugetlb processing.
>  	 */
>  	if (is_vm_hugetlb_page(vma))
>  		return hugetlb_follow_page_mask(vma, address, flags,
> @@ -1149,22 +1146,6 @@ static long __get_user_pages(struct mm_struct *mm,
>  			ret = check_vma_flags(vma, gup_flags);
>  			if (ret)
>  				goto out;
> -
> -			if (is_vm_hugetlb_page(vma)) {
> -				i = follow_hugetlb_page(mm, vma, pages,
> -							&start, &nr_pages, i,
> -							gup_flags, locked);
> -				if (!*locked) {
> -					/*
> -					 * We've got a VM_FAULT_RETRY
> -					 * and we've lost mmap_lock.
> -					 * We must stop here.
> -					 */
> -					BUG_ON(gup_flags & FOLL_NOWAIT);
> -					goto out;
> -				}
> -				continue;
> -			}
>  		}
>  retry:
>  		/*
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 31d8f18bc2e4..b7ff413ff68b 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -6425,37 +6425,6 @@ int hugetlb_mfill_atomic_pte(pte_t *dst_pte,
>  }
>  #endif /* CONFIG_USERFAULTFD */
>
> -static void record_subpages(struct page *page, struct vm_area_struct *vma,
> -			    int refs, struct page **pages)
> -{
> -	int nr;
> -
> -	for (nr = 0; nr < refs; nr++) {
> -		if (likely(pages))
> -			pages[nr] = nth_page(page, nr);
> -	}
> -}
> -
> -static inline bool __follow_hugetlb_must_fault(struct vm_area_struct *vma,
> -					       unsigned int flags, pte_t *pte,
> -					       bool *unshare)
> -{
> -	pte_t pteval = huge_ptep_get(pte);
> -
> -	*unshare = false;
> -	if (is_swap_pte(pteval))
> -		return true;
> -	if (huge_pte_write(pteval))
> -		return false;
> -	if (flags & FOLL_WRITE)
> -		return true;
> -	if (gup_must_unshare(vma, flags, pte_page(pteval))) {
> -		*unshare = true;
> -		return true;
> -	}
> -	return false;
> -}
> -
>  struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
>  				      unsigned long address, unsigned int flags,
>  				      unsigned int *page_mask)
> @@ -6518,198 +6487,6 @@ struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
>  	return page;
>  }
>
> -long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
> -			 struct page **pages, unsigned long *position,
> -			 unsigned long *nr_pages, long i, unsigned int flags,
> -			 int *locked)
> -{
> -	unsigned long pfn_offset;
> -	unsigned long vaddr = *position;
> -	unsigned long remainder = *nr_pages;
> -	struct hstate *h = hstate_vma(vma);
> -	int err = -EFAULT, refs;
> -
> -	while (vaddr < vma->vm_end && remainder) {
> -		pte_t *pte;
> -		spinlock_t *ptl = NULL;
> -		bool unshare = false;
> -		int absent;
> -		struct page *page;
> -
> -		/*
> -		 * If we have a pending SIGKILL, don't keep faulting pages and
> -		 * potentially allocating memory.
> -		 */
> -		if (fatal_signal_pending(current)) {
> -			remainder = 0;
> -			break;
> -		}
> -
> -		hugetlb_vma_lock_read(vma);
> -		/*
> -		 * Some archs (sparc64, sh*) have multiple pte_ts to
> -		 * each hugepage.  We have to make sure we get the
> -		 * first, for the page indexing below to work.
> -		 *
> -		 * Note that page table lock is not held when pte is null.
> -		 */
> -		pte = hugetlb_walk(vma, vaddr & huge_page_mask(h),
> -				   huge_page_size(h));
> -		if (pte)
> -			ptl = huge_pte_lock(h, mm, pte);
> -		absent = !pte || huge_pte_none(huge_ptep_get(pte));
> -
> -		/*
> -		 * When coredumping, it suits get_dump_page if we just return
> -		 * an error where there's an empty slot with no huge pagecache
> -		 * to back it.  This way, we avoid allocating a hugepage, and
> -		 * the sparse dumpfile avoids allocating disk blocks, but its
> -		 * huge holes still show up with zeroes where they need to be.
> -		 */
> -		if (absent && (flags & FOLL_DUMP) &&
> -		    !hugetlbfs_pagecache_present(h, vma, vaddr)) {
> -			if (pte)
> -				spin_unlock(ptl);
> -			hugetlb_vma_unlock_read(vma);
> -			remainder = 0;
> -			break;
> -		}
> -
> -		/*
> -		 * We need call hugetlb_fault for both hugepages under migration
> -		 * (in which case hugetlb_fault waits for the migration,) and
> -		 * hwpoisoned hugepages (in which case we need to prevent the
> -		 * caller from accessing to them.) In order to do this, we use
> -		 * here is_swap_pte instead of is_hugetlb_entry_migration and
> -		 * is_hugetlb_entry_hwpoisoned. This is because it simply covers
> -		 * both cases, and because we can't follow correct pages
> -		 * directly from any kind of swap entries.
> -		 */
> -		if (absent ||
> -		    __follow_hugetlb_must_fault(vma, flags, pte, &unshare)) {
> -			vm_fault_t ret;
> -			unsigned int fault_flags = 0;
> -
> -			if (pte)
> -				spin_unlock(ptl);
> -			hugetlb_vma_unlock_read(vma);
> -
> -			if (flags & FOLL_WRITE)
> -				fault_flags |= FAULT_FLAG_WRITE;
> -			else if (unshare)
> -				fault_flags |= FAULT_FLAG_UNSHARE;
> -			if (locked) {
> -				fault_flags |= FAULT_FLAG_ALLOW_RETRY |
> -					FAULT_FLAG_KILLABLE;
> -				if (flags & FOLL_INTERRUPTIBLE)
> -					fault_flags |= FAULT_FLAG_INTERRUPTIBLE;
> -			}
> -			if (flags & FOLL_NOWAIT)
> -				fault_flags |= FAULT_FLAG_ALLOW_RETRY |
> -					FAULT_FLAG_RETRY_NOWAIT;
> -			if (flags & FOLL_TRIED) {
> -				/*
> -				 * Note: FAULT_FLAG_ALLOW_RETRY and
> -				 * FAULT_FLAG_TRIED can co-exist
> -				 */
> -				fault_flags |= FAULT_FLAG_TRIED;
> -			}
> -			ret = hugetlb_fault(mm, vma, vaddr, fault_flags);
> -			if (ret & VM_FAULT_ERROR) {
> -				err = vm_fault_to_errno(ret, flags);
> -				remainder = 0;
> -				break;
> -			}
> -			if (ret & VM_FAULT_RETRY) {
> -				if (locked &&
> -				    !(fault_flags & FAULT_FLAG_RETRY_NOWAIT))
> -					*locked = 0;
> -				*nr_pages = 0;
> -				/*
> -				 * VM_FAULT_RETRY must not return an
> -				 * error, it will return zero
> -				 * instead.
> -				 *
> -				 * No need to update "position" as the
> -				 * caller will not check it after
> -				 * *nr_pages is set to 0.
> -				 */
> -				return i;
> -			}
> -			continue;
> -		}
> -
> -		pfn_offset = (vaddr & ~huge_page_mask(h)) >> PAGE_SHIFT;
> -		page = pte_page(huge_ptep_get(pte));
> -
> -		VM_BUG_ON_PAGE((flags & FOLL_PIN) && PageAnon(page) &&
> -			       !PageAnonExclusive(page), page);
> -
> -		/*
> -		 * If subpage information not requested, update counters
> -		 * and skip the same_page loop below.
> -		 */
> -		if (!pages && !pfn_offset &&
> -		    (vaddr + huge_page_size(h) < vma->vm_end) &&
> -		    (remainder >= pages_per_huge_page(h))) {
> -			vaddr += huge_page_size(h);
> -			remainder -= pages_per_huge_page(h);
> -			i += pages_per_huge_page(h);
> -			spin_unlock(ptl);
> -			hugetlb_vma_unlock_read(vma);
> -			continue;
> -		}
> -
> -		/* vaddr may not be aligned to PAGE_SIZE */
> -		refs = min3(pages_per_huge_page(h) - pfn_offset, remainder,
> -		    (vma->vm_end - ALIGN_DOWN(vaddr, PAGE_SIZE)) >> PAGE_SHIFT);
> -
> -		if (pages)
> -			record_subpages(nth_page(page, pfn_offset),
> -					vma, refs,
> -					likely(pages) ? pages + i : NULL);
> -
> -		if (pages) {
> -			/*
> -			 * try_grab_folio() should always succeed here,
> -			 * because: a) we hold the ptl lock, and b) we've just
> -			 * checked that the huge page is present in the page
> -			 * tables. If the huge page is present, then the tail
> -			 * pages must also be present. The ptl prevents the
> -			 * head page and tail pages from being rearranged in
> -			 * any way. As this is hugetlb, the pages will never
> -			 * be p2pdma or not longterm pinable. So this page
> -			 * must be available at this point, unless the page
> -			 * refcount overflowed:
> -			 */
> -			if (WARN_ON_ONCE(!try_grab_folio(pages[i], refs,
> -							 flags))) {
> -				spin_unlock(ptl);
> -				hugetlb_vma_unlock_read(vma);
> -				remainder = 0;
> -				err = -ENOMEM;
> -				break;
> -			}
> -		}
> -
> -		vaddr += (refs << PAGE_SHIFT);
> -		remainder -= refs;
> -		i += refs;
> -
> -		spin_unlock(ptl);
> -		hugetlb_vma_unlock_read(vma);
> -	}
> -	*nr_pages = remainder;
> -	/*
> -	 * setting position is actually required only if remainder is
> -	 * not zero but it's faster not to add a "if (remainder)"
> -	 * branch.
> -	 */
> -	*position = vaddr;
> -
> -	return i ? i : err;
> -}
> -
>  long hugetlb_change_protection(struct vm_area_struct *vma,
>  		unsigned long address, unsigned long end,
>  		pgprot_t newprot, unsigned long cp_flags)
> --
> 2.40.1
>

Absolutely wonderful to see such delightful code deletion :)

I haven't really dug into the huge page side of this in great detail, so I
can't give a meaningful tag, but I can certainly appreciate the code you're
removing here!

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 5/7] mm/gup: Cleanup next_page handling
  2023-06-17 20:00     ` Lorenzo Stoakes
@ 2023-06-19 19:18       ` Peter Xu
  0 siblings, 0 replies; 39+ messages in thread
From: Peter Xu @ 2023-06-19 19:18 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: linux-kernel, linux-mm, Matthew Wilcox, Andrea Arcangeli,
	John Hubbard, Mike Rapoport, David Hildenbrand, Vlastimil Babka,
	Kirill A . Shutemov, Andrew Morton, Mike Kravetz, James Houghton,
	Hugh Dickins

On Sat, Jun 17, 2023 at 09:00:34PM +0100, Lorenzo Stoakes wrote:
> On Sat, Jun 17, 2023 at 08:48:38PM +0100, Lorenzo Stoakes wrote:
> > On Tue, Jun 13, 2023 at 05:53:44PM -0400, Peter Xu wrote:
> > > The only path that doesn't use generic "**pages" handling is the gate vma.
> > > Make it use the same path, meanwhile tune the next_page label upper to
> > > cover "**pages" handling.  This prepares for THP handling for "**pages".
> > >
> > > Signed-off-by: Peter Xu <peterx@redhat.com>
> > > ---
> > >  mm/gup.c | 7 +++----
> > >  1 file changed, 3 insertions(+), 4 deletions(-)
> > >
> > > diff --git a/mm/gup.c b/mm/gup.c
> > > index 8d59ae4554e7..a2d1b3c4b104 100644
> > > --- a/mm/gup.c
> > > +++ b/mm/gup.c
> > > @@ -1135,7 +1135,7 @@ static long __get_user_pages(struct mm_struct *mm,
> > >  			if (!vma && in_gate_area(mm, start)) {
> > >  				ret = get_gate_page(mm, start & PAGE_MASK,
> > >  						gup_flags, &vma,
> > > -						pages ? &pages[i] : NULL);
> > > +						pages ? &page : NULL);
> >
> > Good spot... ugh that we handled this differently.
> >
> > >  				if (ret)
> > >  					goto out;
> > >  				ctx.page_mask = 0;
> >
> > We can drop this line now right? As the new next_page block will duplicate
> > this.
> 
> OK I can see why you left this in given the last patch in the series :)
> Please disregard.

Yes the other "page_mask=0" will be removed in the next (not last) patch.

> 
> >
> > > @@ -1205,19 +1205,18 @@ static long __get_user_pages(struct mm_struct *mm,
> > >  				ret = PTR_ERR(page);
> > >  				goto out;
> > >  			}
> > > -
> > > -			goto next_page;
> >
> > This is neat, we've already checked if pages != NULL so the if (pages)
> > block at the new next_page label will not be run.

Yes.

> >
> > >  		} else if (IS_ERR(page)) {
> > >  			ret = PTR_ERR(page);
> > >  			goto out;
> > >  		}
> > > +next_page:
> > >  		if (pages) {
> > >  			pages[i] = page;
> > >  			flush_anon_page(vma, page, start);
> > >  			flush_dcache_page(page);
> >
> > I guess there's no harm that we now flush here, though it seems to me to be
> > superfluous, it's not a big deal I don't think.

I'd say GUP on gate vma page should be so rare so yeah I think it shouldn't
be a big deal.  Even iiuc vsyscall=xonly should be the default, so gup may
have already failed on a gate vma page even trying to read-only..

> >
> > >  			ctx.page_mask = 0;
> > >  		}
> > > -next_page:
> > > +
> > >  		page_increm = 1 + (~(start >> PAGE_SHIFT) & ctx.page_mask);
> > >  		if (page_increm > nr_pages)
> > >  			page_increm = nr_pages;
> > > --
> > > 2.40.1
> > >
> >
> > Other than that, LGTM,
> >
> > Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com>

Thanks for looking!

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 6/7] mm/gup: Accelerate thp gup even for "pages != NULL"
  2023-06-17 20:27   ` Lorenzo Stoakes
@ 2023-06-19 19:37     ` Peter Xu
  2023-06-19 20:24       ` Peter Xu
  0 siblings, 1 reply; 39+ messages in thread
From: Peter Xu @ 2023-06-19 19:37 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: linux-kernel, linux-mm, Matthew Wilcox, Andrea Arcangeli,
	John Hubbard, Mike Rapoport, David Hildenbrand, Vlastimil Babka,
	Kirill A . Shutemov, Andrew Morton, Mike Kravetz, James Houghton,
	Hugh Dickins

On Sat, Jun 17, 2023 at 09:27:22PM +0100, Lorenzo Stoakes wrote:
> On Tue, Jun 13, 2023 at 05:53:45PM -0400, Peter Xu wrote:
> > The acceleration of THP was done with ctx.page_mask, however it'll be
> > ignored if **pages is non-NULL.
> >
> > The old optimization was introduced in 2013 in 240aadeedc4a ("mm:
> > accelerate mm_populate() treatment of THP pages").  It didn't explain why
> > we can't optimize the **pages non-NULL case.  It's possible that at that
> > time the major goal was for mm_populate() which should be enough back then.
> >
> > Optimize thp for all cases, by properly looping over each subpage, doing
> > cache flushes, and boost refcounts / pincounts where needed in one go.
> >
> > This can be verified using gup_test below:
> >
> >   # chrt -f 1 ./gup_test -m 512 -t -L -n 1024 -r 10
> >
> > Before:    13992.50 ( +-8.75%)
> > After:       378.50 (+-69.62%)
> >
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >  mm/gup.c | 36 +++++++++++++++++++++++++++++-------
> >  1 file changed, 29 insertions(+), 7 deletions(-)
> >
> > diff --git a/mm/gup.c b/mm/gup.c
> > index a2d1b3c4b104..cdabc8ea783b 100644
> > --- a/mm/gup.c
> > +++ b/mm/gup.c
> > @@ -1210,16 +1210,38 @@ static long __get_user_pages(struct mm_struct *mm,
> >  			goto out;
> >  		}
> >  next_page:
> > -		if (pages) {
> > -			pages[i] = page;
> > -			flush_anon_page(vma, page, start);
> > -			flush_dcache_page(page);
> > -			ctx.page_mask = 0;
> > -		}
> > -
> >  		page_increm = 1 + (~(start >> PAGE_SHIFT) & ctx.page_mask);
> >  		if (page_increm > nr_pages)
> >  			page_increm = nr_pages;
> > +
> > +		if (pages) {
> > +			struct page *subpage;
> > +			unsigned int j;
> > +
> > +			/*
> > +			 * This must be a large folio (and doesn't need to
> > +			 * be the whole folio; it can be part of it), do
> > +			 * the refcount work for all the subpages too.
> > +			 * Since we already hold refcount on the head page,
> > +			 * it should never fail.
> > +			 *
> > +			 * NOTE: here the page may not be the head page
> > +			 * e.g. when start addr is not thp-size aligned.
> > +			 */
> > +			if (page_increm > 1)
> > +				WARN_ON_ONCE(
> > +				    try_grab_folio(compound_head(page),
> > +						   page_increm - 1,
> > +						   foll_flags) == NULL);
> 
> I'm not sure this should be warning but otherwise ignoring this returning
> NULL?  This feels like a case that could come up in realtiy,
> e.g. folio_ref_try_add_rcu() fails, or !folio_is_longterm_pinnable().

Note that we hold already at least 1 refcount on the folio (also mentioned
in the comment above this chunk of code), so both folio_ref_try_add_rcu()
and folio_is_longterm_pinnable() should already have been called on the
same folio and passed.  If it will fail it should have already, afaict.

I still don't see how that would trigger if the refcount won't overflow.

Here what I can do is still guard this try_grab_folio() and fail the GUP if
for any reason it failed.  Perhaps then it means I'll also keep that one
untouched in hugetlb_follow_page_mask() too.  But I suppose keeping the
WARN_ON_ONCE() seems still proper.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 7/7] mm/gup: Retire follow_hugetlb_page()
  2023-06-17 20:40   ` Lorenzo Stoakes
@ 2023-06-19 19:41     ` Peter Xu
  0 siblings, 0 replies; 39+ messages in thread
From: Peter Xu @ 2023-06-19 19:41 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: linux-kernel, linux-mm, Matthew Wilcox, Andrea Arcangeli,
	John Hubbard, Mike Rapoport, David Hildenbrand, Vlastimil Babka,
	Kirill A . Shutemov, Andrew Morton, Mike Kravetz, James Houghton,
	Hugh Dickins

On Sat, Jun 17, 2023 at 09:40:41PM +0100, Lorenzo Stoakes wrote:
> On Tue, Jun 13, 2023 at 05:53:46PM -0400, Peter Xu wrote:
> > Now __get_user_pages() should be well prepared to handle thp completely,
> > as long as hugetlb gup requests even without the hugetlb's special path.
> >
> > Time to retire follow_hugetlb_page().
> 
> Nit, but there's a couple left over references to this function in comments
> in fs/userfaultfd.c and mm/hugetlb.c.

Indeed, let me touch those too when respin.

> Absolutely wonderful to see such delightful code deletion :)
> 
> I haven't really dug into the huge page side of this in great detail, so I
> can't give a meaningful tag, but I can certainly appreciate the code you're
> removing here!

Yeah, hopefully it'll make the code also cleaner.  Thanks a lot,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 6/7] mm/gup: Accelerate thp gup even for "pages != NULL"
  2023-06-19 19:37     ` Peter Xu
@ 2023-06-19 20:24       ` Peter Xu
  0 siblings, 0 replies; 39+ messages in thread
From: Peter Xu @ 2023-06-19 20:24 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: linux-kernel, linux-mm, Matthew Wilcox, Andrea Arcangeli,
	John Hubbard, Mike Rapoport, David Hildenbrand, Vlastimil Babka,
	Kirill A . Shutemov, Andrew Morton, Mike Kravetz, James Houghton,
	Hugh Dickins

On Mon, Jun 19, 2023 at 03:37:30PM -0400, Peter Xu wrote:
> Here what I can do is still guard this try_grab_folio() and fail the GUP if
> for any reason it failed.  Perhaps then it means I'll also keep that one
> untouched in hugetlb_follow_page_mask() too.  But I suppose keeping the
> WARN_ON_ONCE() seems still proper.

Here's the outcome that I plan to post in the new version, taking care of
try_grab_folio() failures even if it happens, meanwhile remove the
compound_head() redundancy on the page.

__get_user_pages():
...
===8<===
			/*
			 * This must be a large folio (and doesn't need to
			 * be the whole folio; it can be part of it), do
			 * the refcount work for all the subpages too.
			 *
			 * NOTE: here the page may not be the head page
			 * e.g. when start addr is not thp-size aligned.
			 * try_grab_folio() should have taken care of tail
			 * pages.
			 */
			if (page_increm > 1) {
				struct folio *folio;

				/*
				 * Since we already hold refcount on the
				 * large folio, this should never fail.
				 */
				folio = try_grab_folio(page, page_increm - 1,
						       foll_flags);
				if (WARN_ON_ONCE(!folio)) {
					/*
					 * Release the 1st page ref if the
					 * folio is problematic, fail hard.
					 */
					gup_put_folio(page_folio(page), 1,
						      foll_flags);
					ret = -EFAULT;
					goto out;
				}
			}
===8<===

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 3/7] mm/hugetlb: Add page_mask for hugetlb_follow_page_mask()
  2023-06-13 21:53 ` [PATCH 3/7] mm/hugetlb: Add page_mask for hugetlb_follow_page_mask() Peter Xu
  2023-06-15  0:17   ` Mike Kravetz
  2023-06-16  8:11   ` David Hildenbrand
@ 2023-06-19 21:43   ` Peter Xu
  2023-06-20  7:01     ` David Hildenbrand
  2 siblings, 1 reply; 39+ messages in thread
From: Peter Xu @ 2023-06-19 21:43 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Matthew Wilcox, Andrea Arcangeli, John Hubbard, Mike Rapoport,
	David Hildenbrand, Vlastimil Babka, Kirill A . Shutemov,
	Andrew Morton, Mike Kravetz, James Houghton, Hugh Dickins

On Tue, Jun 13, 2023 at 05:53:42PM -0400, Peter Xu wrote:
> @@ -6506,6 +6507,7 @@ struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
>  		 * because we hold the ptl lock and have verified pte_present().
>  		 */
>  		WARN_ON_ONCE(try_grab_page(page, flags));
> +		*page_mask = huge_page_mask(h);

Sorry, I was wrong this line.  It should be:

		*page_mask = ~huge_page_mask(h) >> PAGE_SHIFT;

This can be exposed if we bypass fast-gup and also specify pin npages>1.
Probably overlooked in my initial round and I'd guess fast-gup just all
succeeded there..

I'll temporarily drop the R-bs for this patch, so reviewers can have
another closer look in this one.

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 3/7] mm/hugetlb: Add page_mask for hugetlb_follow_page_mask()
  2023-06-19 21:43   ` Peter Xu
@ 2023-06-20  7:01     ` David Hildenbrand
  2023-06-20 14:40       ` Peter Xu
  0 siblings, 1 reply; 39+ messages in thread
From: David Hildenbrand @ 2023-06-20  7:01 UTC (permalink / raw)
  To: Peter Xu, linux-kernel, linux-mm
  Cc: Matthew Wilcox, Andrea Arcangeli, John Hubbard, Mike Rapoport,
	Vlastimil Babka, Kirill A . Shutemov, Andrew Morton,
	Mike Kravetz, James Houghton, Hugh Dickins

On 19.06.23 23:43, Peter Xu wrote:
> On Tue, Jun 13, 2023 at 05:53:42PM -0400, Peter Xu wrote:
>> @@ -6506,6 +6507,7 @@ struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
>>   		 * because we hold the ptl lock and have verified pte_present().
>>   		 */
>>   		WARN_ON_ONCE(try_grab_page(page, flags));
>> +		*page_mask = huge_page_mask(h);
> 
> Sorry, I was wrong this line.  It should be:
> 
> 		*page_mask = ~huge_page_mask(h) >> PAGE_SHIFT;
> 

That's ... surprising. It feels like either page_mask or 
huge_page_mask() has a misleading name ....

h->mask = ~(huge_page_size(h) - 1);


For PMDs, we do

ctx->page_mask = HPAGE_PMD_NR - 1;


Maybe

*page_mask = PHYS_PFN(huge_page_size(h)) - 1;

Would be clearer.

I guess "page_mask" should actually be "pfn_mask" ... but the meaning 
regarding PAGE_MASK are still inverted ...

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 3/7] mm/hugetlb: Add page_mask for hugetlb_follow_page_mask()
  2023-06-20  7:01     ` David Hildenbrand
@ 2023-06-20 14:40       ` Peter Xu
  0 siblings, 0 replies; 39+ messages in thread
From: Peter Xu @ 2023-06-20 14:40 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, Matthew Wilcox, Andrea Arcangeli,
	John Hubbard, Mike Rapoport, Vlastimil Babka,
	Kirill A . Shutemov, Andrew Morton, Mike Kravetz, James Houghton,
	Hugh Dickins

On Tue, Jun 20, 2023 at 09:01:23AM +0200, David Hildenbrand wrote:
> On 19.06.23 23:43, Peter Xu wrote:
> > On Tue, Jun 13, 2023 at 05:53:42PM -0400, Peter Xu wrote:
> > > @@ -6506,6 +6507,7 @@ struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
> > >   		 * because we hold the ptl lock and have verified pte_present().
> > >   		 */
> > >   		WARN_ON_ONCE(try_grab_page(page, flags));
> > > +		*page_mask = huge_page_mask(h);
> > 
> > Sorry, I was wrong this line.  It should be:
> > 
> > 		*page_mask = ~huge_page_mask(h) >> PAGE_SHIFT;
> > 
> 
> That's ... surprising. It feels like either page_mask or huge_page_mask()
> has a misleading name ....
> 
> h->mask = ~(huge_page_size(h) - 1);
> 
> 
> For PMDs, we do
> 
> ctx->page_mask = HPAGE_PMD_NR - 1;
> 
> 
> Maybe
> 
> *page_mask = PHYS_PFN(huge_page_size(h)) - 1;
> 
> Would be clearer.

Since I just posted a new version.. I'll see whether I should get that
cleaned up in a new one.

> 
> I guess "page_mask" should actually be "pfn_mask" ... but the meaning
> regarding PAGE_MASK are still inverted ...

Yes, pfn_mask can be at least slightly better.  I can add a patch to rename
it, or we can also do it on top as cleanups.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2023-06-20 14:42 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-06-13 21:53 [PATCH 0/7] mm/gup: Unify hugetlb, speed up thp Peter Xu
2023-06-13 21:53 ` [PATCH 1/7] mm/hugetlb: Handle FOLL_DUMP well in follow_page_mask() Peter Xu
2023-06-14 23:24   ` Mike Kravetz
2023-06-16  8:08   ` David Hildenbrand
2023-06-13 21:53 ` [PATCH 2/7] mm/hugetlb: Fix hugetlb_follow_page_mask() on permission checks Peter Xu
2023-06-14 15:31   ` David Hildenbrand
2023-06-14 15:46     ` Peter Xu
2023-06-14 15:57       ` David Hildenbrand
2023-06-15  0:11       ` Mike Kravetz
2023-06-13 21:53 ` [PATCH 3/7] mm/hugetlb: Add page_mask for hugetlb_follow_page_mask() Peter Xu
2023-06-15  0:17   ` Mike Kravetz
2023-06-16  8:11   ` David Hildenbrand
2023-06-19 21:43   ` Peter Xu
2023-06-20  7:01     ` David Hildenbrand
2023-06-20 14:40       ` Peter Xu
2023-06-13 21:53 ` [PATCH 4/7] mm/hugetlb: Prepare hugetlb_follow_page_mask() for FOLL_PIN Peter Xu
2023-06-14 14:57   ` David Hildenbrand
2023-06-14 15:11     ` Peter Xu
2023-06-14 15:17       ` David Hildenbrand
2023-06-14 15:31         ` Peter Xu
2023-06-14 15:47           ` David Hildenbrand
2023-06-14 15:51             ` Peter Xu
2023-06-15  0:25               ` Mike Kravetz
2023-06-15 19:42                 ` Peter Xu
2023-06-13 21:53 ` [PATCH 5/7] mm/gup: Cleanup next_page handling Peter Xu
2023-06-17 19:48   ` Lorenzo Stoakes
2023-06-17 20:00     ` Lorenzo Stoakes
2023-06-19 19:18       ` Peter Xu
2023-06-13 21:53 ` [PATCH 6/7] mm/gup: Accelerate thp gup even for "pages != NULL" Peter Xu
2023-06-14 14:58   ` Matthew Wilcox
2023-06-14 15:19     ` Peter Xu
2023-06-14 15:35       ` Peter Xu
2023-06-17 20:27   ` Lorenzo Stoakes
2023-06-19 19:37     ` Peter Xu
2023-06-19 20:24       ` Peter Xu
2023-06-13 21:53 ` [PATCH 7/7] mm/gup: Retire follow_hugetlb_page() Peter Xu
2023-06-14 14:37   ` Jason Gunthorpe
2023-06-17 20:40   ` Lorenzo Stoakes
2023-06-19 19:41     ` Peter Xu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).