linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 00/26] hugetlb: Introduce HugeTLB high-granularity mapping
@ 2022-06-24 17:36 James Houghton
  2022-06-24 17:36 ` [RFC PATCH 01/26] hugetlb: make hstate accessor functions const James Houghton
                   ` (28 more replies)
  0 siblings, 29 replies; 123+ messages in thread
From: James Houghton @ 2022-06-24 17:36 UTC (permalink / raw)
  To: Mike Kravetz, Muchun Song, Peter Xu
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Jue Wang, Manish Mishra, Dr . David Alan Gilbert, linux-mm,
	linux-kernel, James Houghton

This RFC introduces the concept of HugeTLB high-granularity mapping
(HGM)[1].  In broad terms, this series teaches HugeTLB how to map HugeTLB
pages at different granularities, and, more importantly, to partially map
a HugeTLB page.  This cover letter will go over
 - the motivation for these changes
 - userspace API
 - some of the changes to HugeTLB to make this work
 - limitations & future enhancements

High-granularity mapping does *not* involve dissolving the hugepages
themselves; it only affects how they are mapped.

---- Motivation ----

Being able to map HugeTLB memory with PAGE_SIZE PTEs has important use
cases in post-copy live migration and memory failure handling.

- Live Migration (userfaultfd)
For post-copy live migration, using userfaultfd, currently we have to
install an entire hugepage before we can allow a guest to access that page.
This is because, right now, either the WHOLE hugepage is mapped or NONE of
it is.  So either the guest can access the WHOLE hugepage or NONE of it.
This makes post-copy live migration for 1G HugeTLB-backed VMs completely
infeasible.

With high-granularity mapping, we can map PAGE_SIZE pieces of a hugepage,
thereby allowing the guest to access only PAGE_SIZE chunks, and getting
page faults on the rest (and triggering another demand-fetch). This gives
userspace the flexibility to install PAGE_SIZE chunks of memory into a
hugepage, making migration of 1G-backed VMs perfectly feasible, and it
vastly reduces the vCPU stall time during post-copy for 2M-backed VMs.

At Google, for a 48 vCPU VM in post-copy, we can expect these approximate
per-page median fetch latencies:
     4K: <100us
     2M: >10ms
Being able to unpause a vCPU 100x quicker is helpful for guest stability,
and being able to use 1G pages at all can significant improve steady-state
guest performance.

After fully copying a hugepage over the network, we will want to collapse
the mapping down to what it would normally be (e.g., one PUD for a 1G
page). Rather than having the kernel do this automatically, we leave it up
to userspace to tell us to collapse a range (via MADV_COLLAPSE, co-opting
the API that is being introduced for THPs[2]).

- Memory Failure
When a memory error is found within a HugeTLB page, it would be ideal if we
could unmap only the PAGE_SIZE section that contained the error. This is
what THPs are able to do. Using high-granularity mapping, we could do this,
but this isn't tackled in this patch series.

---- Userspace API ----

This patch series introduces a single way to take advantage of
high-granularity mapping: via UFFDIO_CONTINUE. UFFDIO_CONTINUE allows
userspace to resolve MINOR page faults on shared VMAs.

To collapse a HugeTLB address range that has been mapped with several
UFFDIO_CONTINUE operations, userspace can issue MADV_COLLAPSE. We expect
userspace to know when all pages (that they care about) have been fetched.

---- HugeTLB Changes ----

- Mapcount
The way mapcount is handled is different from the way that it was handled
before. If the PUD for a hugepage is not none, a hugepage's mapcount will
be increased. This scheme means that, for hugepages that aren't mapped at
high granularity, their mapcounts will remain the same as what they would
have been pre-HGM.

- Page table walking and manipulation
A new function, hugetlb_walk_to, handles walking HugeTLB page tables for
high-granularity mappings. Eventually, it's possible to merge
hugetlb_walk_to with huge_pte_offset and huge_pte_alloc.

We keep track of HugeTLB page table entries with a new struct, hugetlb_pte.
This is because we generally need to know the "size" of a PTE (previously
always just huge_page_size(hstate)).

For every page table manipulation function that has a huge version (e.g.
huge_ptep_get and ptep_get), there is a wrapper for it (e.g.
hugetlb_ptep_get).  The correct version is used depending on if a HugeTLB
PTE really is "huge".

- Synchronization
For existing bits of HugeTLB, synchronization is unchanged. For splitting
and collapsing HugeTLB PTEs, we require that the i_mmap_rw_sem is held for
writing, and for doing high-granularity page table walks, we require it to
be held for reading.

---- Limitations & Future Changes ----

This patch series only implements high-granularity mapping for VM_SHARED
VMAs.  I intend to implement enough HGM to support 4K unmapping for memory
failure recovery for both shared and private mappings.

The memory failure use case poses its own challenges that can be
addressed, but I will do so in a separate RFC.

Performance has not been heavily scrutinized with this patch series. There
are places where lock contention can significantly reduce performance. This
will be addressed later.

The patch series, as it stands right now, is compatible with the VMEMMAP
page struct optimization[3], as we do not need to modify data contained
in the subpage page structs.

Other omissions:
 - Compatibility with userfaultfd write-protect (will be included in v1).
 - Support for mremap() (will be included in v1). This looks a lot like
   the support we have for fork().
 - Documentation changes (will be included in v1).
 - Completely ignores PMD sharing and hugepage migration (will be included
   in v1).
 - Implementations for architectures that don't use GENERAL_HUGETLB other
   than arm64.

---- Patch Breakdown ----

Patch 1     - Preliminary changes
Patch 2-10  - HugeTLB HGM core changes
Patch 11-13 - HugeTLB HGM page table walking functionality
Patch 14-19 - HugeTLB HGM compatibility with other bits
Patch 20-23 - Userfaultfd and collapse changes
Patch 24-26 - arm64 support and selftests

[1] This used to be called HugeTLB double mapping, a bad and confusing
    name. "High-granularity mapping" is not a great name either. I am open
    to better names.
[2] https://lore.kernel.org/linux-mm/20220604004004.954674-10-zokeefe@google.com/
[3] commit f41f2ed43ca5 ("mm: hugetlb: free the vmemmap pages associated with each HugeTLB page")

James Houghton (26):
  hugetlb: make hstate accessor functions const
  hugetlb: sort hstates in hugetlb_init_hstates
  hugetlb: add make_huge_pte_with_shift
  hugetlb: make huge_pte_lockptr take an explicit shift argument.
  hugetlb: add CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING
  mm: make free_p?d_range functions public
  hugetlb: add hugetlb_pte to track HugeTLB page table entries
  hugetlb: add hugetlb_free_range to free PT structures
  hugetlb: add hugetlb_hgm_enabled
  hugetlb: add for_each_hgm_shift
  hugetlb: add hugetlb_walk_to to do PT walks
  hugetlb: add HugeTLB splitting functionality
  hugetlb: add huge_pte_alloc_high_granularity
  hugetlb: add HGM support for hugetlb_fault and hugetlb_no_page
  hugetlb: make unmapping compatible with high-granularity mappings
  hugetlb: make hugetlb_change_protection compatible with HGM
  hugetlb: update follow_hugetlb_page to support HGM
  hugetlb: use struct hugetlb_pte for walk_hugetlb_range
  hugetlb: add HGM support for copy_hugetlb_page_range
  hugetlb: add support for high-granularity UFFDIO_CONTINUE
  hugetlb: add hugetlb_collapse
  madvise: add uapi for HugeTLB HGM collapse: MADV_COLLAPSE
  userfaultfd: add UFFD_FEATURE_MINOR_HUGETLBFS_HGM
  arm64/hugetlb: add support for high-granularity mappings
  selftests: add HugeTLB HGM to userfaultfd selftest
  selftests: add HugeTLB HGM to KVM demand paging selftest

 arch/arm64/Kconfig                            |   1 +
 arch/arm64/mm/hugetlbpage.c                   |  63 ++
 arch/powerpc/mm/pgtable.c                     |   3 +-
 arch/s390/mm/gmap.c                           |   8 +-
 fs/Kconfig                                    |   7 +
 fs/proc/task_mmu.c                            |  35 +-
 fs/userfaultfd.c                              |  10 +-
 include/asm-generic/tlb.h                     |   6 +-
 include/linux/hugetlb.h                       | 177 +++-
 include/linux/mm.h                            |   7 +
 include/linux/pagewalk.h                      |   3 +-
 include/uapi/asm-generic/mman-common.h        |   2 +
 include/uapi/linux/userfaultfd.h              |   2 +
 mm/damon/vaddr.c                              |  34 +-
 mm/hmm.c                                      |   7 +-
 mm/hugetlb.c                                  | 987 +++++++++++++++---
 mm/madvise.c                                  |  23 +
 mm/memory.c                                   |   8 +-
 mm/mempolicy.c                                |  11 +-
 mm/migrate.c                                  |   3 +-
 mm/mincore.c                                  |   4 +-
 mm/mprotect.c                                 |   6 +-
 mm/page_vma_mapped.c                          |   3 +-
 mm/pagewalk.c                                 |  18 +-
 mm/userfaultfd.c                              |  57 +-
 .../testing/selftests/kvm/include/test_util.h |   2 +
 tools/testing/selftests/kvm/lib/kvm_util.c    |   2 +-
 tools/testing/selftests/kvm/lib/test_util.c   |  14 +
 tools/testing/selftests/vm/userfaultfd.c      |  61 +-
 29 files changed, 1314 insertions(+), 250 deletions(-)

-- 
2.37.0.rc0.161.g10f37bed90-goog


^ permalink raw reply	[flat|nested] 123+ messages in thread

* [RFC PATCH 01/26] hugetlb: make hstate accessor functions const
  2022-06-24 17:36 [RFC PATCH 00/26] hugetlb: Introduce HugeTLB high-granularity mapping James Houghton
@ 2022-06-24 17:36 ` James Houghton
  2022-06-24 18:43   ` Mina Almasry
                     ` (2 more replies)
  2022-06-24 17:36 ` [RFC PATCH 02/26] hugetlb: sort hstates in hugetlb_init_hstates James Houghton
                   ` (27 subsequent siblings)
  28 siblings, 3 replies; 123+ messages in thread
From: James Houghton @ 2022-06-24 17:36 UTC (permalink / raw)
  To: Mike Kravetz, Muchun Song, Peter Xu
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Jue Wang, Manish Mishra, Dr . David Alan Gilbert, linux-mm,
	linux-kernel, James Houghton

This is just a const-correctness change so that the new hugetlb_pte
changes can be const-correct too.

Acked-by: David Rientjes <rientjes@google.com>

Signed-off-by: James Houghton <jthoughton@google.com>
---
 include/linux/hugetlb.h | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index e4cff27d1198..498a4ae3d462 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -715,7 +715,7 @@ static inline struct hstate *hstate_vma(struct vm_area_struct *vma)
 	return hstate_file(vma->vm_file);
 }
 
-static inline unsigned long huge_page_size(struct hstate *h)
+static inline unsigned long huge_page_size(const struct hstate *h)
 {
 	return (unsigned long)PAGE_SIZE << h->order;
 }
@@ -729,27 +729,27 @@ static inline unsigned long huge_page_mask(struct hstate *h)
 	return h->mask;
 }
 
-static inline unsigned int huge_page_order(struct hstate *h)
+static inline unsigned int huge_page_order(const struct hstate *h)
 {
 	return h->order;
 }
 
-static inline unsigned huge_page_shift(struct hstate *h)
+static inline unsigned huge_page_shift(const struct hstate *h)
 {
 	return h->order + PAGE_SHIFT;
 }
 
-static inline bool hstate_is_gigantic(struct hstate *h)
+static inline bool hstate_is_gigantic(const struct hstate *h)
 {
 	return huge_page_order(h) >= MAX_ORDER;
 }
 
-static inline unsigned int pages_per_huge_page(struct hstate *h)
+static inline unsigned int pages_per_huge_page(const struct hstate *h)
 {
 	return 1 << h->order;
 }
 
-static inline unsigned int blocks_per_huge_page(struct hstate *h)
+static inline unsigned int blocks_per_huge_page(const struct hstate *h)
 {
 	return huge_page_size(h) / 512;
 }
-- 
2.37.0.rc0.161.g10f37bed90-goog


^ permalink raw reply related	[flat|nested] 123+ messages in thread

* [RFC PATCH 02/26] hugetlb: sort hstates in hugetlb_init_hstates
  2022-06-24 17:36 [RFC PATCH 00/26] hugetlb: Introduce HugeTLB high-granularity mapping James Houghton
  2022-06-24 17:36 ` [RFC PATCH 01/26] hugetlb: make hstate accessor functions const James Houghton
@ 2022-06-24 17:36 ` James Houghton
  2022-06-24 18:51   ` Mina Almasry
                     ` (2 more replies)
  2022-06-24 17:36 ` [RFC PATCH 03/26] hugetlb: add make_huge_pte_with_shift James Houghton
                   ` (26 subsequent siblings)
  28 siblings, 3 replies; 123+ messages in thread
From: James Houghton @ 2022-06-24 17:36 UTC (permalink / raw)
  To: Mike Kravetz, Muchun Song, Peter Xu
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Jue Wang, Manish Mishra, Dr . David Alan Gilbert, linux-mm,
	linux-kernel, James Houghton

When using HugeTLB high-granularity mapping, we need to go through the
supported hugepage sizes in decreasing order so that we pick the largest
size that works. Consider the case where we're faulting in a 1G hugepage
for the first time: we want hugetlb_fault/hugetlb_no_page to map it with
a PUD. By going through the sizes in decreasing order, we will find that
PUD_SIZE works before finding out that PMD_SIZE or PAGE_SIZE work too.

Signed-off-by: James Houghton <jthoughton@google.com>
---
 mm/hugetlb.c | 40 +++++++++++++++++++++++++++++++++++++---
 1 file changed, 37 insertions(+), 3 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index a57e1be41401..5df838d86f32 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -33,6 +33,7 @@
 #include <linux/migrate.h>
 #include <linux/nospec.h>
 #include <linux/delayacct.h>
+#include <linux/sort.h>
 
 #include <asm/page.h>
 #include <asm/pgalloc.h>
@@ -48,6 +49,10 @@
 
 int hugetlb_max_hstate __read_mostly;
 unsigned int default_hstate_idx;
+/*
+ * After hugetlb_init_hstates is called, hstates will be sorted from largest
+ * to smallest.
+ */
 struct hstate hstates[HUGE_MAX_HSTATE];
 
 #ifdef CONFIG_CMA
@@ -3144,14 +3149,43 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
 	kfree(node_alloc_noretry);
 }
 
+static int compare_hstates_decreasing(const void *a, const void *b)
+{
+	const int shift_a = huge_page_shift((const struct hstate *)a);
+	const int shift_b = huge_page_shift((const struct hstate *)b);
+
+	if (shift_a < shift_b)
+		return 1;
+	if (shift_a > shift_b)
+		return -1;
+	return 0;
+}
+
+static void sort_hstates(void)
+{
+	unsigned long default_hstate_sz = huge_page_size(&default_hstate);
+
+	/* Sort from largest to smallest. */
+	sort(hstates, hugetlb_max_hstate, sizeof(*hstates),
+	     compare_hstates_decreasing, NULL);
+
+	/*
+	 * We may have changed the location of the default hstate, so we need to
+	 * update it.
+	 */
+	default_hstate_idx = hstate_index(size_to_hstate(default_hstate_sz));
+}
+
 static void __init hugetlb_init_hstates(void)
 {
 	struct hstate *h, *h2;
 
-	for_each_hstate(h) {
-		if (minimum_order > huge_page_order(h))
-			minimum_order = huge_page_order(h);
+	sort_hstates();
 
+	/* The last hstate is now the smallest. */
+	minimum_order = huge_page_order(&hstates[hugetlb_max_hstate - 1]);
+
+	for_each_hstate(h) {
 		/* oversize hugepages were init'ed in early boot */
 		if (!hstate_is_gigantic(h))
 			hugetlb_hstate_alloc_pages(h);
-- 
2.37.0.rc0.161.g10f37bed90-goog


^ permalink raw reply related	[flat|nested] 123+ messages in thread

* [RFC PATCH 03/26] hugetlb: add make_huge_pte_with_shift
  2022-06-24 17:36 [RFC PATCH 00/26] hugetlb: Introduce HugeTLB high-granularity mapping James Houghton
  2022-06-24 17:36 ` [RFC PATCH 01/26] hugetlb: make hstate accessor functions const James Houghton
  2022-06-24 17:36 ` [RFC PATCH 02/26] hugetlb: sort hstates in hugetlb_init_hstates James Houghton
@ 2022-06-24 17:36 ` James Houghton
  2022-06-24 19:01   ` Mina Almasry
  2022-06-27 12:13   ` manish.mishra
  2022-06-24 17:36 ` [RFC PATCH 04/26] hugetlb: make huge_pte_lockptr take an explicit shift argument James Houghton
                   ` (25 subsequent siblings)
  28 siblings, 2 replies; 123+ messages in thread
From: James Houghton @ 2022-06-24 17:36 UTC (permalink / raw)
  To: Mike Kravetz, Muchun Song, Peter Xu
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Jue Wang, Manish Mishra, Dr . David Alan Gilbert, linux-mm,
	linux-kernel, James Houghton

This allows us to make huge PTEs at shifts other than the hstate shift,
which will be necessary for high-granularity mappings.

Signed-off-by: James Houghton <jthoughton@google.com>
---
 mm/hugetlb.c | 33 ++++++++++++++++++++-------------
 1 file changed, 20 insertions(+), 13 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 5df838d86f32..0eec34edf3b2 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4686,23 +4686,30 @@ const struct vm_operations_struct hugetlb_vm_ops = {
 	.pagesize = hugetlb_vm_op_pagesize,
 };
 
+static pte_t make_huge_pte_with_shift(struct vm_area_struct *vma,
+				      struct page *page, int writable,
+				      int shift)
+{
+	bool huge = shift > PAGE_SHIFT;
+	pte_t entry = huge ? mk_huge_pte(page, vma->vm_page_prot)
+			   : mk_pte(page, vma->vm_page_prot);
+
+	if (writable)
+		entry = huge ? huge_pte_mkwrite(entry) : pte_mkwrite(entry);
+	else
+		entry = huge ? huge_pte_wrprotect(entry) : pte_wrprotect(entry);
+	pte_mkyoung(entry);
+	if (huge)
+		entry = arch_make_huge_pte(entry, shift, vma->vm_flags);
+	return entry;
+}
+
 static pte_t make_huge_pte(struct vm_area_struct *vma, struct page *page,
-				int writable)
+			   int writable)
 {
-	pte_t entry;
 	unsigned int shift = huge_page_shift(hstate_vma(vma));
 
-	if (writable) {
-		entry = huge_pte_mkwrite(huge_pte_mkdirty(mk_huge_pte(page,
-					 vma->vm_page_prot)));
-	} else {
-		entry = huge_pte_wrprotect(mk_huge_pte(page,
-					   vma->vm_page_prot));
-	}
-	entry = pte_mkyoung(entry);
-	entry = arch_make_huge_pte(entry, shift, vma->vm_flags);
-
-	return entry;
+	return make_huge_pte_with_shift(vma, page, writable, shift);
 }
 
 static void set_huge_ptep_writable(struct vm_area_struct *vma,
-- 
2.37.0.rc0.161.g10f37bed90-goog


^ permalink raw reply related	[flat|nested] 123+ messages in thread

* [RFC PATCH 04/26] hugetlb: make huge_pte_lockptr take an explicit shift argument.
  2022-06-24 17:36 [RFC PATCH 00/26] hugetlb: Introduce HugeTLB high-granularity mapping James Houghton
                   ` (2 preceding siblings ...)
  2022-06-24 17:36 ` [RFC PATCH 03/26] hugetlb: add make_huge_pte_with_shift James Houghton
@ 2022-06-24 17:36 ` James Houghton
  2022-06-27 12:26   ` manish.mishra
  2022-06-27 20:51   ` Mike Kravetz
  2022-06-24 17:36 ` [RFC PATCH 05/26] hugetlb: add CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING James Houghton
                   ` (24 subsequent siblings)
  28 siblings, 2 replies; 123+ messages in thread
From: James Houghton @ 2022-06-24 17:36 UTC (permalink / raw)
  To: Mike Kravetz, Muchun Song, Peter Xu
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Jue Wang, Manish Mishra, Dr . David Alan Gilbert, linux-mm,
	linux-kernel, James Houghton

This is needed to handle PTL locking with high-granularity mapping. We
won't always be using the PMD-level PTL even if we're using the 2M
hugepage hstate. It's possible that we're dealing with 4K PTEs, in which
case, we need to lock the PTL for the 4K PTE.

Signed-off-by: James Houghton <jthoughton@google.com>
---
 arch/powerpc/mm/pgtable.c |  3 ++-
 include/linux/hugetlb.h   | 19 ++++++++++++++-----
 mm/hugetlb.c              |  9 +++++----
 mm/migrate.c              |  3 ++-
 mm/page_vma_mapped.c      |  3 ++-
 5 files changed, 25 insertions(+), 12 deletions(-)

diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c
index e6166b71d36d..663d591a8f08 100644
--- a/arch/powerpc/mm/pgtable.c
+++ b/arch/powerpc/mm/pgtable.c
@@ -261,7 +261,8 @@ int huge_ptep_set_access_flags(struct vm_area_struct *vma,
 
 		psize = hstate_get_psize(h);
 #ifdef CONFIG_DEBUG_VM
-		assert_spin_locked(huge_pte_lockptr(h, vma->vm_mm, ptep));
+		assert_spin_locked(huge_pte_lockptr(huge_page_shift(h),
+						    vma->vm_mm, ptep));
 #endif
 
 #else
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 498a4ae3d462..5fe1db46d8c9 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -868,12 +868,11 @@ static inline gfp_t htlb_modify_alloc_mask(struct hstate *h, gfp_t gfp_mask)
 	return modified_mask;
 }
 
-static inline spinlock_t *huge_pte_lockptr(struct hstate *h,
+static inline spinlock_t *huge_pte_lockptr(unsigned int shift,
 					   struct mm_struct *mm, pte_t *pte)
 {
-	if (huge_page_size(h) == PMD_SIZE)
+	if (shift == PMD_SHIFT)
 		return pmd_lockptr(mm, (pmd_t *) pte);
-	VM_BUG_ON(huge_page_size(h) == PAGE_SIZE);
 	return &mm->page_table_lock;
 }
 
@@ -1076,7 +1075,7 @@ static inline gfp_t htlb_modify_alloc_mask(struct hstate *h, gfp_t gfp_mask)
 	return 0;
 }
 
-static inline spinlock_t *huge_pte_lockptr(struct hstate *h,
+static inline spinlock_t *huge_pte_lockptr(unsigned int shift,
 					   struct mm_struct *mm, pte_t *pte)
 {
 	return &mm->page_table_lock;
@@ -1116,7 +1115,17 @@ static inline spinlock_t *huge_pte_lock(struct hstate *h,
 {
 	spinlock_t *ptl;
 
-	ptl = huge_pte_lockptr(h, mm, pte);
+	ptl = huge_pte_lockptr(huge_page_shift(h), mm, pte);
+	spin_lock(ptl);
+	return ptl;
+}
+
+static inline spinlock_t *huge_pte_lock_shift(unsigned int shift,
+					      struct mm_struct *mm, pte_t *pte)
+{
+	spinlock_t *ptl;
+
+	ptl = huge_pte_lockptr(shift, mm, pte);
 	spin_lock(ptl);
 	return ptl;
 }
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 0eec34edf3b2..d6d0d4c03def 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4817,7 +4817,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 			continue;
 
 		dst_ptl = huge_pte_lock(h, dst, dst_pte);
-		src_ptl = huge_pte_lockptr(h, src, src_pte);
+		src_ptl = huge_pte_lockptr(huge_page_shift(h), src, src_pte);
 		spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
 		entry = huge_ptep_get(src_pte);
 		dst_entry = huge_ptep_get(dst_pte);
@@ -4894,7 +4894,8 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 
 				/* Install the new huge page if src pte stable */
 				dst_ptl = huge_pte_lock(h, dst, dst_pte);
-				src_ptl = huge_pte_lockptr(h, src, src_pte);
+				src_ptl = huge_pte_lockptr(huge_page_shift(h),
+							   src, src_pte);
 				spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
 				entry = huge_ptep_get(src_pte);
 				if (!pte_same(src_pte_old, entry)) {
@@ -4948,7 +4949,7 @@ static void move_huge_pte(struct vm_area_struct *vma, unsigned long old_addr,
 	pte_t pte;
 
 	dst_ptl = huge_pte_lock(h, mm, dst_pte);
-	src_ptl = huge_pte_lockptr(h, mm, src_pte);
+	src_ptl = huge_pte_lockptr(huge_page_shift(h), mm, src_pte);
 
 	/*
 	 * We don't have to worry about the ordering of src and dst ptlocks
@@ -6024,7 +6025,7 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
 		page_in_pagecache = true;
 	}
 
-	ptl = huge_pte_lockptr(h, dst_mm, dst_pte);
+	ptl = huge_pte_lockptr(huge_page_shift(h), dst_mm, dst_pte);
 	spin_lock(ptl);
 
 	/*
diff --git a/mm/migrate.c b/mm/migrate.c
index e51588e95f57..a8a960992373 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -318,7 +318,8 @@ void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
 void migration_entry_wait_huge(struct vm_area_struct *vma,
 		struct mm_struct *mm, pte_t *pte)
 {
-	spinlock_t *ptl = huge_pte_lockptr(hstate_vma(vma), mm, pte);
+	spinlock_t *ptl = huge_pte_lockptr(huge_page_shift(hstate_vma(vma)),
+					   mm, pte);
 	__migration_entry_wait(mm, pte, ptl);
 }
 
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index c10f839fc410..8921dd4e41b1 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -174,7 +174,8 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
 		if (!pvmw->pte)
 			return false;
 
-		pvmw->ptl = huge_pte_lockptr(hstate, mm, pvmw->pte);
+		pvmw->ptl = huge_pte_lockptr(huge_page_shift(hstate),
+					     mm, pvmw->pte);
 		spin_lock(pvmw->ptl);
 		if (!check_pte(pvmw))
 			return not_found(pvmw);
-- 
2.37.0.rc0.161.g10f37bed90-goog


^ permalink raw reply related	[flat|nested] 123+ messages in thread

* [RFC PATCH 05/26] hugetlb: add CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING
  2022-06-24 17:36 [RFC PATCH 00/26] hugetlb: Introduce HugeTLB high-granularity mapping James Houghton
                   ` (3 preceding siblings ...)
  2022-06-24 17:36 ` [RFC PATCH 04/26] hugetlb: make huge_pte_lockptr take an explicit shift argument James Houghton
@ 2022-06-24 17:36 ` James Houghton
  2022-06-27 12:28   ` manish.mishra
  2022-06-24 17:36 ` [RFC PATCH 06/26] mm: make free_p?d_range functions public James Houghton
                   ` (23 subsequent siblings)
  28 siblings, 1 reply; 123+ messages in thread
From: James Houghton @ 2022-06-24 17:36 UTC (permalink / raw)
  To: Mike Kravetz, Muchun Song, Peter Xu
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Jue Wang, Manish Mishra, Dr . David Alan Gilbert, linux-mm,
	linux-kernel, James Houghton

This adds the Kconfig to enable or disable high-granularity mapping. It
is enabled by default for architectures that use
ARCH_WANT_GENERAL_HUGETLB.

There is also an arch-specific config ARCH_HAS_SPECIAL_HUGETLB_HGM which
controls whether or not the architecture has been updated to support
HGM if it doesn't use general HugeTLB.

Signed-off-by: James Houghton <jthoughton@google.com>
---
 fs/Kconfig | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/fs/Kconfig b/fs/Kconfig
index 5976eb33535f..d76c7d812656 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -268,6 +268,13 @@ config HUGETLB_PAGE_OPTIMIZE_VMEMMAP_DEFAULT_ON
 	  to enable optimizing vmemmap pages of HugeTLB by default. It can then
 	  be disabled on the command line via hugetlb_free_vmemmap=off.
 
+config ARCH_HAS_SPECIAL_HUGETLB_HGM
+	bool
+
+config HUGETLB_HIGH_GRANULARITY_MAPPING
+	def_bool ARCH_WANT_GENERAL_HUGETLB || ARCH_HAS_SPECIAL_HUGETLB_HGM
+	depends on HUGETLB_PAGE
+
 config MEMFD_CREATE
 	def_bool TMPFS || HUGETLBFS
 
-- 
2.37.0.rc0.161.g10f37bed90-goog


^ permalink raw reply related	[flat|nested] 123+ messages in thread

* [RFC PATCH 06/26] mm: make free_p?d_range functions public
  2022-06-24 17:36 [RFC PATCH 00/26] hugetlb: Introduce HugeTLB high-granularity mapping James Houghton
                   ` (4 preceding siblings ...)
  2022-06-24 17:36 ` [RFC PATCH 05/26] hugetlb: add CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING James Houghton
@ 2022-06-24 17:36 ` James Houghton
  2022-06-27 12:31   ` manish.mishra
  2022-06-28 20:35   ` Mike Kravetz
  2022-06-24 17:36 ` [RFC PATCH 07/26] hugetlb: add hugetlb_pte to track HugeTLB page table entries James Houghton
                   ` (22 subsequent siblings)
  28 siblings, 2 replies; 123+ messages in thread
From: James Houghton @ 2022-06-24 17:36 UTC (permalink / raw)
  To: Mike Kravetz, Muchun Song, Peter Xu
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Jue Wang, Manish Mishra, Dr . David Alan Gilbert, linux-mm,
	linux-kernel, James Houghton

This makes them usable for HugeTLB page table freeing operations.
After HugeTLB high-granularity mapping, the page table for a HugeTLB VMA
can get more complex, and these functions handle freeing page tables
generally.

Signed-off-by: James Houghton <jthoughton@google.com>
---
 include/linux/mm.h | 7 +++++++
 mm/memory.c        | 8 ++++----
 2 files changed, 11 insertions(+), 4 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index bc8f326be0ce..07f5da512147 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1847,6 +1847,13 @@ void unmap_vmas(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
 
 struct mmu_notifier_range;
 
+void free_pte_range(struct mmu_gather *tlb, pmd_t *pmd, unsigned long addr);
+void free_pmd_range(struct mmu_gather *tlb, pud_t *pud, unsigned long addr,
+		unsigned long end, unsigned long floor, unsigned long ceiling);
+void free_pud_range(struct mmu_gather *tlb, p4d_t *p4d, unsigned long addr,
+		unsigned long end, unsigned long floor, unsigned long ceiling);
+void free_p4d_range(struct mmu_gather *tlb, pgd_t *pgd, unsigned long addr,
+		unsigned long end, unsigned long floor, unsigned long ceiling);
 void free_pgd_range(struct mmu_gather *tlb, unsigned long addr,
 		unsigned long end, unsigned long floor, unsigned long ceiling);
 int
diff --git a/mm/memory.c b/mm/memory.c
index 7a089145cad4..bb3b9b5b94fb 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -227,7 +227,7 @@ static void check_sync_rss_stat(struct task_struct *task)
  * Note: this doesn't free the actual pages themselves. That
  * has been handled earlier when unmapping all the memory regions.
  */
-static void free_pte_range(struct mmu_gather *tlb, pmd_t *pmd,
+void free_pte_range(struct mmu_gather *tlb, pmd_t *pmd,
 			   unsigned long addr)
 {
 	pgtable_t token = pmd_pgtable(*pmd);
@@ -236,7 +236,7 @@ static void free_pte_range(struct mmu_gather *tlb, pmd_t *pmd,
 	mm_dec_nr_ptes(tlb->mm);
 }
 
-static inline void free_pmd_range(struct mmu_gather *tlb, pud_t *pud,
+inline void free_pmd_range(struct mmu_gather *tlb, pud_t *pud,
 				unsigned long addr, unsigned long end,
 				unsigned long floor, unsigned long ceiling)
 {
@@ -270,7 +270,7 @@ static inline void free_pmd_range(struct mmu_gather *tlb, pud_t *pud,
 	mm_dec_nr_pmds(tlb->mm);
 }
 
-static inline void free_pud_range(struct mmu_gather *tlb, p4d_t *p4d,
+inline void free_pud_range(struct mmu_gather *tlb, p4d_t *p4d,
 				unsigned long addr, unsigned long end,
 				unsigned long floor, unsigned long ceiling)
 {
@@ -304,7 +304,7 @@ static inline void free_pud_range(struct mmu_gather *tlb, p4d_t *p4d,
 	mm_dec_nr_puds(tlb->mm);
 }
 
-static inline void free_p4d_range(struct mmu_gather *tlb, pgd_t *pgd,
+inline void free_p4d_range(struct mmu_gather *tlb, pgd_t *pgd,
 				unsigned long addr, unsigned long end,
 				unsigned long floor, unsigned long ceiling)
 {
-- 
2.37.0.rc0.161.g10f37bed90-goog


^ permalink raw reply related	[flat|nested] 123+ messages in thread

* [RFC PATCH 07/26] hugetlb: add hugetlb_pte to track HugeTLB page table entries
  2022-06-24 17:36 [RFC PATCH 00/26] hugetlb: Introduce HugeTLB high-granularity mapping James Houghton
                   ` (5 preceding siblings ...)
  2022-06-24 17:36 ` [RFC PATCH 06/26] mm: make free_p?d_range functions public James Houghton
@ 2022-06-24 17:36 ` James Houghton
  2022-06-27 12:47   ` manish.mishra
                     ` (4 more replies)
  2022-06-24 17:36 ` [RFC PATCH 08/26] hugetlb: add hugetlb_free_range to free PT structures James Houghton
                   ` (21 subsequent siblings)
  28 siblings, 5 replies; 123+ messages in thread
From: James Houghton @ 2022-06-24 17:36 UTC (permalink / raw)
  To: Mike Kravetz, Muchun Song, Peter Xu
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Jue Wang, Manish Mishra, Dr . David Alan Gilbert, linux-mm,
	linux-kernel, James Houghton

After high-granularity mapping, page table entries for HugeTLB pages can
be of any size/type. (For example, we can have a 1G page mapped with a
mix of PMDs and PTEs.) This struct is to help keep track of a HugeTLB
PTE after we have done a page table walk.

Without this, we'd have to pass around the "size" of the PTE everywhere.
We effectively did this before; it could be fetched from the hstate,
which we pass around pretty much everywhere.

This commit includes definitions for some basic helper functions that
are used later. These helper functions wrap existing PTE
inspection/modification functions, where the correct version is picked
depending on if the HugeTLB PTE is actually "huge" or not. (Previously,
all HugeTLB PTEs were "huge").

For example, hugetlb_ptep_get wraps huge_ptep_get and ptep_get, where
ptep_get is used when the HugeTLB PTE is PAGE_SIZE, and huge_ptep_get is
used in all other cases.

Signed-off-by: James Houghton <jthoughton@google.com>
---
 include/linux/hugetlb.h | 84 +++++++++++++++++++++++++++++++++++++++++
 mm/hugetlb.c            | 57 ++++++++++++++++++++++++++++
 2 files changed, 141 insertions(+)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 5fe1db46d8c9..1d4ec9dfdebf 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -46,6 +46,68 @@ enum {
 	__NR_USED_SUBPAGE,
 };
 
+struct hugetlb_pte {
+	pte_t *ptep;
+	unsigned int shift;
+};
+
+static inline
+void hugetlb_pte_init(struct hugetlb_pte *hpte)
+{
+	hpte->ptep = NULL;
+}
+
+static inline
+void hugetlb_pte_populate(struct hugetlb_pte *hpte, pte_t *ptep,
+			  unsigned int shift)
+{
+	BUG_ON(!ptep);
+	hpte->ptep = ptep;
+	hpte->shift = shift;
+}
+
+static inline
+unsigned long hugetlb_pte_size(const struct hugetlb_pte *hpte)
+{
+	BUG_ON(!hpte->ptep);
+	return 1UL << hpte->shift;
+}
+
+static inline
+unsigned long hugetlb_pte_mask(const struct hugetlb_pte *hpte)
+{
+	BUG_ON(!hpte->ptep);
+	return ~(hugetlb_pte_size(hpte) - 1);
+}
+
+static inline
+unsigned int hugetlb_pte_shift(const struct hugetlb_pte *hpte)
+{
+	BUG_ON(!hpte->ptep);
+	return hpte->shift;
+}
+
+static inline
+bool hugetlb_pte_huge(const struct hugetlb_pte *hpte)
+{
+	return !IS_ENABLED(CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING) ||
+		hugetlb_pte_shift(hpte) > PAGE_SHIFT;
+}
+
+static inline
+void hugetlb_pte_copy(struct hugetlb_pte *dest, const struct hugetlb_pte *src)
+{
+	dest->ptep = src->ptep;
+	dest->shift = src->shift;
+}
+
+bool hugetlb_pte_present_leaf(const struct hugetlb_pte *hpte);
+bool hugetlb_pte_none(const struct hugetlb_pte *hpte);
+bool hugetlb_pte_none_mostly(const struct hugetlb_pte *hpte);
+pte_t hugetlb_ptep_get(const struct hugetlb_pte *hpte);
+void hugetlb_pte_clear(struct mm_struct *mm, const struct hugetlb_pte *hpte,
+		       unsigned long address);
+
 struct hugepage_subpool {
 	spinlock_t lock;
 	long count;
@@ -1130,6 +1192,28 @@ static inline spinlock_t *huge_pte_lock_shift(unsigned int shift,
 	return ptl;
 }
 
+static inline
+spinlock_t *hugetlb_pte_lockptr(struct mm_struct *mm, struct hugetlb_pte *hpte)
+{
+
+	BUG_ON(!hpte->ptep);
+	// Only use huge_pte_lockptr if we are at leaf-level. Otherwise use
+	// the regular page table lock.
+	if (hugetlb_pte_none(hpte) || hugetlb_pte_present_leaf(hpte))
+		return huge_pte_lockptr(hugetlb_pte_shift(hpte),
+				mm, hpte->ptep);
+	return &mm->page_table_lock;
+}
+
+static inline
+spinlock_t *hugetlb_pte_lock(struct mm_struct *mm, struct hugetlb_pte *hpte)
+{
+	spinlock_t *ptl = hugetlb_pte_lockptr(mm, hpte);
+
+	spin_lock(ptl);
+	return ptl;
+}
+
 #if defined(CONFIG_HUGETLB_PAGE) && defined(CONFIG_CMA)
 extern void __init hugetlb_cma_reserve(int order);
 extern void __init hugetlb_cma_check(void);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index d6d0d4c03def..1a1434e29740 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1120,6 +1120,63 @@ static bool vma_has_reserves(struct vm_area_struct *vma, long chg)
 	return false;
 }
 
+bool hugetlb_pte_present_leaf(const struct hugetlb_pte *hpte)
+{
+	pgd_t pgd;
+	p4d_t p4d;
+	pud_t pud;
+	pmd_t pmd;
+
+	BUG_ON(!hpte->ptep);
+	if (hugetlb_pte_size(hpte) >= PGDIR_SIZE) {
+		pgd = *(pgd_t *)hpte->ptep;
+		return pgd_present(pgd) && pgd_leaf(pgd);
+	} else if (hugetlb_pte_size(hpte) >= P4D_SIZE) {
+		p4d = *(p4d_t *)hpte->ptep;
+		return p4d_present(p4d) && p4d_leaf(p4d);
+	} else if (hugetlb_pte_size(hpte) >= PUD_SIZE) {
+		pud = *(pud_t *)hpte->ptep;
+		return pud_present(pud) && pud_leaf(pud);
+	} else if (hugetlb_pte_size(hpte) >= PMD_SIZE) {
+		pmd = *(pmd_t *)hpte->ptep;
+		return pmd_present(pmd) && pmd_leaf(pmd);
+	} else if (hugetlb_pte_size(hpte) >= PAGE_SIZE)
+		return pte_present(*hpte->ptep);
+	BUG();
+}
+
+bool hugetlb_pte_none(const struct hugetlb_pte *hpte)
+{
+	if (hugetlb_pte_huge(hpte))
+		return huge_pte_none(huge_ptep_get(hpte->ptep));
+	return pte_none(ptep_get(hpte->ptep));
+}
+
+bool hugetlb_pte_none_mostly(const struct hugetlb_pte *hpte)
+{
+	if (hugetlb_pte_huge(hpte))
+		return huge_pte_none_mostly(huge_ptep_get(hpte->ptep));
+	return pte_none_mostly(ptep_get(hpte->ptep));
+}
+
+pte_t hugetlb_ptep_get(const struct hugetlb_pte *hpte)
+{
+	if (hugetlb_pte_huge(hpte))
+		return huge_ptep_get(hpte->ptep);
+	return ptep_get(hpte->ptep);
+}
+
+void hugetlb_pte_clear(struct mm_struct *mm, const struct hugetlb_pte *hpte,
+		       unsigned long address)
+{
+	BUG_ON(!hpte->ptep);
+	unsigned long sz = hugetlb_pte_size(hpte);
+
+	if (sz > PAGE_SIZE)
+		return huge_pte_clear(mm, address, hpte->ptep, sz);
+	return pte_clear(mm, address, hpte->ptep);
+}
+
 static void enqueue_huge_page(struct hstate *h, struct page *page)
 {
 	int nid = page_to_nid(page);
-- 
2.37.0.rc0.161.g10f37bed90-goog


^ permalink raw reply related	[flat|nested] 123+ messages in thread

* [RFC PATCH 08/26] hugetlb: add hugetlb_free_range to free PT structures
  2022-06-24 17:36 [RFC PATCH 00/26] hugetlb: Introduce HugeTLB high-granularity mapping James Houghton
                   ` (6 preceding siblings ...)
  2022-06-24 17:36 ` [RFC PATCH 07/26] hugetlb: add hugetlb_pte to track HugeTLB page table entries James Houghton
@ 2022-06-24 17:36 ` James Houghton
  2022-06-27 12:52   ` manish.mishra
  2022-06-28 20:27   ` Mina Almasry
  2022-06-24 17:36 ` [RFC PATCH 09/26] hugetlb: add hugetlb_hgm_enabled James Houghton
                   ` (20 subsequent siblings)
  28 siblings, 2 replies; 123+ messages in thread
From: James Houghton @ 2022-06-24 17:36 UTC (permalink / raw)
  To: Mike Kravetz, Muchun Song, Peter Xu
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Jue Wang, Manish Mishra, Dr . David Alan Gilbert, linux-mm,
	linux-kernel, James Houghton

This is a helper function for freeing the bits of the page table that
map a particular HugeTLB PTE.

Signed-off-by: James Houghton <jthoughton@google.com>
---
 include/linux/hugetlb.h |  2 ++
 mm/hugetlb.c            | 17 +++++++++++++++++
 2 files changed, 19 insertions(+)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 1d4ec9dfdebf..33ba48fac551 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -107,6 +107,8 @@ bool hugetlb_pte_none_mostly(const struct hugetlb_pte *hpte);
 pte_t hugetlb_ptep_get(const struct hugetlb_pte *hpte);
 void hugetlb_pte_clear(struct mm_struct *mm, const struct hugetlb_pte *hpte,
 		       unsigned long address);
+void hugetlb_free_range(struct mmu_gather *tlb, const struct hugetlb_pte *hpte,
+			unsigned long start, unsigned long end);
 
 struct hugepage_subpool {
 	spinlock_t lock;
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 1a1434e29740..a2d2ffa76173 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1120,6 +1120,23 @@ static bool vma_has_reserves(struct vm_area_struct *vma, long chg)
 	return false;
 }
 
+void hugetlb_free_range(struct mmu_gather *tlb, const struct hugetlb_pte *hpte,
+			unsigned long start, unsigned long end)
+{
+	unsigned long floor = start & hugetlb_pte_mask(hpte);
+	unsigned long ceiling = floor + hugetlb_pte_size(hpte);
+
+	if (hugetlb_pte_size(hpte) >= PGDIR_SIZE) {
+		free_p4d_range(tlb, (pgd_t *)hpte->ptep, start, end, floor, ceiling);
+	} else if (hugetlb_pte_size(hpte) >= P4D_SIZE) {
+		free_pud_range(tlb, (p4d_t *)hpte->ptep, start, end, floor, ceiling);
+	} else if (hugetlb_pte_size(hpte) >= PUD_SIZE) {
+		free_pmd_range(tlb, (pud_t *)hpte->ptep, start, end, floor, ceiling);
+	} else if (hugetlb_pte_size(hpte) >= PMD_SIZE) {
+		free_pte_range(tlb, (pmd_t *)hpte->ptep, start);
+	}
+}
+
 bool hugetlb_pte_present_leaf(const struct hugetlb_pte *hpte)
 {
 	pgd_t pgd;
-- 
2.37.0.rc0.161.g10f37bed90-goog


^ permalink raw reply related	[flat|nested] 123+ messages in thread

* [RFC PATCH 09/26] hugetlb: add hugetlb_hgm_enabled
  2022-06-24 17:36 [RFC PATCH 00/26] hugetlb: Introduce HugeTLB high-granularity mapping James Houghton
                   ` (7 preceding siblings ...)
  2022-06-24 17:36 ` [RFC PATCH 08/26] hugetlb: add hugetlb_free_range to free PT structures James Houghton
@ 2022-06-24 17:36 ` James Houghton
  2022-06-28 20:33   ` Mina Almasry
  2022-09-08 18:07   ` Peter Xu
  2022-06-24 17:36 ` [RFC PATCH 10/26] hugetlb: add for_each_hgm_shift James Houghton
                   ` (19 subsequent siblings)
  28 siblings, 2 replies; 123+ messages in thread
From: James Houghton @ 2022-06-24 17:36 UTC (permalink / raw)
  To: Mike Kravetz, Muchun Song, Peter Xu
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Jue Wang, Manish Mishra, Dr . David Alan Gilbert, linux-mm,
	linux-kernel, James Houghton

Currently, this is always true if the VMA is shared. In the future, it's
possible that private mappings will get some or all HGM functionality.

Signed-off-by: James Houghton <jthoughton@google.com>
---
 include/linux/hugetlb.h | 10 ++++++++++
 mm/hugetlb.c            |  8 ++++++++
 2 files changed, 18 insertions(+)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 33ba48fac551..e7a6b944d0cc 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -1174,6 +1174,16 @@ static inline void set_huge_pte_at(struct mm_struct *mm, unsigned long addr,
 }
 #endif	/* CONFIG_HUGETLB_PAGE */
 
+#ifdef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING
+/* If HugeTLB high-granularity mappings are enabled for this VMA. */
+bool hugetlb_hgm_enabled(struct vm_area_struct *vma);
+#else
+static inline bool hugetlb_hgm_enabled(struct vm_area_struct *vma)
+{
+	return false;
+}
+#endif
+
 static inline spinlock_t *huge_pte_lock(struct hstate *h,
 					struct mm_struct *mm, pte_t *pte)
 {
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index a2d2ffa76173..8b10b941458d 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6983,6 +6983,14 @@ pte_t *huge_pte_offset(struct mm_struct *mm,
 
 #endif /* CONFIG_ARCH_WANT_GENERAL_HUGETLB */
 
+#ifdef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING
+bool hugetlb_hgm_enabled(struct vm_area_struct *vma)
+{
+	/* All shared VMAs have HGM enabled. */
+	return vma->vm_flags & VM_SHARED;
+}
+#endif /* CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING */
+
 /*
  * These functions are overwritable if your architecture needs its own
  * behavior.
-- 
2.37.0.rc0.161.g10f37bed90-goog


^ permalink raw reply related	[flat|nested] 123+ messages in thread

* [RFC PATCH 10/26] hugetlb: add for_each_hgm_shift
  2022-06-24 17:36 [RFC PATCH 00/26] hugetlb: Introduce HugeTLB high-granularity mapping James Houghton
                   ` (8 preceding siblings ...)
  2022-06-24 17:36 ` [RFC PATCH 09/26] hugetlb: add hugetlb_hgm_enabled James Houghton
@ 2022-06-24 17:36 ` James Houghton
  2022-06-27 13:01   ` manish.mishra
  2022-06-28 21:58   ` Mina Almasry
  2022-06-24 17:36 ` [RFC PATCH 11/26] hugetlb: add hugetlb_walk_to to do PT walks James Houghton
                   ` (18 subsequent siblings)
  28 siblings, 2 replies; 123+ messages in thread
From: James Houghton @ 2022-06-24 17:36 UTC (permalink / raw)
  To: Mike Kravetz, Muchun Song, Peter Xu
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Jue Wang, Manish Mishra, Dr . David Alan Gilbert, linux-mm,
	linux-kernel, James Houghton

This is a helper macro to loop through all the usable page sizes for a
high-granularity-enabled HugeTLB VMA. Given the VMA's hstate, it will
loop, in descending order, through the page sizes that HugeTLB supports
for this architecture; it always includes PAGE_SIZE.

Signed-off-by: James Houghton <jthoughton@google.com>
---
 mm/hugetlb.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 8b10b941458d..557b0afdb503 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6989,6 +6989,16 @@ bool hugetlb_hgm_enabled(struct vm_area_struct *vma)
 	/* All shared VMAs have HGM enabled. */
 	return vma->vm_flags & VM_SHARED;
 }
+static unsigned int __shift_for_hstate(struct hstate *h)
+{
+	if (h >= &hstates[hugetlb_max_hstate])
+		return PAGE_SHIFT;
+	return huge_page_shift(h);
+}
+#define for_each_hgm_shift(hstate, tmp_h, shift) \
+	for ((tmp_h) = hstate; (shift) = __shift_for_hstate(tmp_h), \
+			       (tmp_h) <= &hstates[hugetlb_max_hstate]; \
+			       (tmp_h)++)
 #endif /* CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING */
 
 /*
-- 
2.37.0.rc0.161.g10f37bed90-goog


^ permalink raw reply related	[flat|nested] 123+ messages in thread

* [RFC PATCH 11/26] hugetlb: add hugetlb_walk_to to do PT walks
  2022-06-24 17:36 [RFC PATCH 00/26] hugetlb: Introduce HugeTLB high-granularity mapping James Houghton
                   ` (9 preceding siblings ...)
  2022-06-24 17:36 ` [RFC PATCH 10/26] hugetlb: add for_each_hgm_shift James Houghton
@ 2022-06-24 17:36 ` James Houghton
  2022-06-27 13:07   ` manish.mishra
  2022-09-08 18:20   ` Peter Xu
  2022-06-24 17:36 ` [RFC PATCH 12/26] hugetlb: add HugeTLB splitting functionality James Houghton
                   ` (17 subsequent siblings)
  28 siblings, 2 replies; 123+ messages in thread
From: James Houghton @ 2022-06-24 17:36 UTC (permalink / raw)
  To: Mike Kravetz, Muchun Song, Peter Xu
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Jue Wang, Manish Mishra, Dr . David Alan Gilbert, linux-mm,
	linux-kernel, James Houghton

This adds it for architectures that use GENERAL_HUGETLB, including x86.

Signed-off-by: James Houghton <jthoughton@google.com>
---
 include/linux/hugetlb.h |  2 ++
 mm/hugetlb.c            | 45 +++++++++++++++++++++++++++++++++++++++++
 2 files changed, 47 insertions(+)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index e7a6b944d0cc..605aa19d8572 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -258,6 +258,8 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
 			unsigned long addr, unsigned long sz);
 pte_t *huge_pte_offset(struct mm_struct *mm,
 		       unsigned long addr, unsigned long sz);
+int hugetlb_walk_to(struct mm_struct *mm, struct hugetlb_pte *hpte,
+		    unsigned long addr, unsigned long sz, bool stop_at_none);
 int huge_pmd_unshare(struct mm_struct *mm, struct vm_area_struct *vma,
 				unsigned long *addr, pte_t *ptep);
 void adjust_range_if_pmd_sharing_possible(struct vm_area_struct *vma,
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 557b0afdb503..3ec2a921ee6f 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6981,6 +6981,51 @@ pte_t *huge_pte_offset(struct mm_struct *mm,
 	return (pte_t *)pmd;
 }
 
+int hugetlb_walk_to(struct mm_struct *mm, struct hugetlb_pte *hpte,
+		    unsigned long addr, unsigned long sz, bool stop_at_none)
+{
+	pte_t *ptep;
+
+	if (!hpte->ptep) {
+		pgd_t *pgd = pgd_offset(mm, addr);
+
+		if (!pgd)
+			return -ENOMEM;
+		ptep = (pte_t *)p4d_alloc(mm, pgd, addr);
+		if (!ptep)
+			return -ENOMEM;
+		hugetlb_pte_populate(hpte, ptep, P4D_SHIFT);
+	}
+
+	while (hugetlb_pte_size(hpte) > sz &&
+			!hugetlb_pte_present_leaf(hpte) &&
+			!(stop_at_none && hugetlb_pte_none(hpte))) {
+		if (hpte->shift == PMD_SHIFT) {
+			ptep = pte_alloc_map(mm, (pmd_t *)hpte->ptep, addr);
+			if (!ptep)
+				return -ENOMEM;
+			hpte->shift = PAGE_SHIFT;
+			hpte->ptep = ptep;
+		} else if (hpte->shift == PUD_SHIFT) {
+			ptep = (pte_t *)pmd_alloc(mm, (pud_t *)hpte->ptep,
+						  addr);
+			if (!ptep)
+				return -ENOMEM;
+			hpte->shift = PMD_SHIFT;
+			hpte->ptep = ptep;
+		} else if (hpte->shift == P4D_SHIFT) {
+			ptep = (pte_t *)pud_alloc(mm, (p4d_t *)hpte->ptep,
+						  addr);
+			if (!ptep)
+				return -ENOMEM;
+			hpte->shift = PUD_SHIFT;
+			hpte->ptep = ptep;
+		} else
+			BUG();
+	}
+	return 0;
+}
+
 #endif /* CONFIG_ARCH_WANT_GENERAL_HUGETLB */
 
 #ifdef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING
-- 
2.37.0.rc0.161.g10f37bed90-goog


^ permalink raw reply related	[flat|nested] 123+ messages in thread

* [RFC PATCH 12/26] hugetlb: add HugeTLB splitting functionality
  2022-06-24 17:36 [RFC PATCH 00/26] hugetlb: Introduce HugeTLB high-granularity mapping James Houghton
                   ` (10 preceding siblings ...)
  2022-06-24 17:36 ` [RFC PATCH 11/26] hugetlb: add hugetlb_walk_to to do PT walks James Houghton
@ 2022-06-24 17:36 ` James Houghton
  2022-06-27 13:50   ` manish.mishra
  2022-06-29 14:33   ` manish.mishra
  2022-06-24 17:36 ` [RFC PATCH 13/26] hugetlb: add huge_pte_alloc_high_granularity James Houghton
                   ` (16 subsequent siblings)
  28 siblings, 2 replies; 123+ messages in thread
From: James Houghton @ 2022-06-24 17:36 UTC (permalink / raw)
  To: Mike Kravetz, Muchun Song, Peter Xu
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Jue Wang, Manish Mishra, Dr . David Alan Gilbert, linux-mm,
	linux-kernel, James Houghton

The new function, hugetlb_split_to_shift, will optimally split the page
table to map a particular address at a particular granularity.

This is useful for punching a hole in the mapping and for mapping small
sections of a HugeTLB page (via UFFDIO_CONTINUE, for example).

Signed-off-by: James Houghton <jthoughton@google.com>
---
 mm/hugetlb.c | 122 +++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 122 insertions(+)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 3ec2a921ee6f..eaffe7b4f67c 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -102,6 +102,18 @@ struct mutex *hugetlb_fault_mutex_table ____cacheline_aligned_in_smp;
 /* Forward declaration */
 static int hugetlb_acct_memory(struct hstate *h, long delta);
 
+/*
+ * Find the subpage that corresponds to `addr` in `hpage`.
+ */
+static struct page *hugetlb_find_subpage(struct hstate *h, struct page *hpage,
+				 unsigned long addr)
+{
+	size_t idx = (addr & ~huge_page_mask(h))/PAGE_SIZE;
+
+	BUG_ON(idx >= pages_per_huge_page(h));
+	return &hpage[idx];
+}
+
 static inline bool subpool_is_free(struct hugepage_subpool *spool)
 {
 	if (spool->count)
@@ -7044,6 +7056,116 @@ static unsigned int __shift_for_hstate(struct hstate *h)
 	for ((tmp_h) = hstate; (shift) = __shift_for_hstate(tmp_h), \
 			       (tmp_h) <= &hstates[hugetlb_max_hstate]; \
 			       (tmp_h)++)
+
+/*
+ * Given a particular address, split the HugeTLB PTE that currently maps it
+ * so that, for the given address, the PTE that maps it is `desired_shift`.
+ * This function will always split the HugeTLB PTE optimally.
+ *
+ * For example, given a HugeTLB 1G page that is mapped from VA 0 to 1G. If we
+ * call this function with addr=0 and desired_shift=PAGE_SHIFT, will result in
+ * these changes to the page table:
+ * 1. The PUD will be split into 2M PMDs.
+ * 2. The first PMD will be split again into 4K PTEs.
+ */
+static int hugetlb_split_to_shift(struct mm_struct *mm, struct vm_area_struct *vma,
+			   const struct hugetlb_pte *hpte,
+			   unsigned long addr, unsigned long desired_shift)
+{
+	unsigned long start, end, curr;
+	unsigned long desired_sz = 1UL << desired_shift;
+	struct hstate *h = hstate_vma(vma);
+	int ret;
+	struct hugetlb_pte new_hpte;
+	struct mmu_notifier_range range;
+	struct page *hpage = NULL;
+	struct page *subpage;
+	pte_t old_entry;
+	struct mmu_gather tlb;
+
+	BUG_ON(!hpte->ptep);
+	BUG_ON(hugetlb_pte_size(hpte) == desired_sz);
+
+	start = addr & hugetlb_pte_mask(hpte);
+	end = start + hugetlb_pte_size(hpte);
+
+	i_mmap_assert_write_locked(vma->vm_file->f_mapping);
+
+	BUG_ON(!hpte->ptep);
+	/* This function only works if we are looking at a leaf-level PTE. */
+	BUG_ON(!hugetlb_pte_none(hpte) && !hugetlb_pte_present_leaf(hpte));
+
+	/*
+	 * Clear the PTE so that we will allocate the PT structures when
+	 * walking the page table.
+	 */
+	old_entry = huge_ptep_get_and_clear(mm, start, hpte->ptep);
+
+	if (!huge_pte_none(old_entry))
+		hpage = pte_page(old_entry);
+
+	BUG_ON(!IS_ALIGNED(start, desired_sz));
+	BUG_ON(!IS_ALIGNED(end, desired_sz));
+
+	for (curr = start; curr < end;) {
+		struct hstate *tmp_h;
+		unsigned int shift;
+
+		for_each_hgm_shift(h, tmp_h, shift) {
+			unsigned long sz = 1UL << shift;
+
+			if (!IS_ALIGNED(curr, sz) || curr + sz > end)
+				continue;
+			/*
+			 * If we are including `addr`, we need to make sure
+			 * splitting down to the correct size. Go to a smaller
+			 * size if we are not.
+			 */
+			if (curr <= addr && curr + sz > addr &&
+					shift > desired_shift)
+				continue;
+
+			/*
+			 * Continue the page table walk to the level we want,
+			 * allocate PT structures as we go.
+			 */
+			hugetlb_pte_copy(&new_hpte, hpte);
+			ret = hugetlb_walk_to(mm, &new_hpte, curr, sz,
+					      /*stop_at_none=*/false);
+			if (ret)
+				goto err;
+			BUG_ON(hugetlb_pte_size(&new_hpte) != sz);
+			if (hpage) {
+				pte_t new_entry;
+
+				subpage = hugetlb_find_subpage(h, hpage, curr);
+				new_entry = make_huge_pte_with_shift(vma, subpage,
+								     huge_pte_write(old_entry),
+								     shift);
+				set_huge_pte_at(mm, curr, new_hpte.ptep, new_entry);
+			}
+			curr += sz;
+			goto next;
+		}
+		/* We couldn't find a size that worked. */
+		BUG();
+next:
+		continue;
+	}
+
+	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, vma->vm_mm,
+				start, end);
+	mmu_notifier_invalidate_range_start(&range);
+	return 0;
+err:
+	tlb_gather_mmu(&tlb, mm);
+	/* Free any newly allocated page table entries. */
+	hugetlb_free_range(&tlb, hpte, start, curr);
+	/* Restore the old entry. */
+	set_huge_pte_at(mm, start, hpte->ptep, old_entry);
+	tlb_finish_mmu(&tlb);
+	return ret;
+}
 #endif /* CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING */
 
 /*
-- 
2.37.0.rc0.161.g10f37bed90-goog


^ permalink raw reply related	[flat|nested] 123+ messages in thread

* [RFC PATCH 13/26] hugetlb: add huge_pte_alloc_high_granularity
  2022-06-24 17:36 [RFC PATCH 00/26] hugetlb: Introduce HugeTLB high-granularity mapping James Houghton
                   ` (11 preceding siblings ...)
  2022-06-24 17:36 ` [RFC PATCH 12/26] hugetlb: add HugeTLB splitting functionality James Houghton
@ 2022-06-24 17:36 ` James Houghton
  2022-06-29 14:11   ` manish.mishra
  2022-06-24 17:36 ` [RFC PATCH 14/26] hugetlb: add HGM support for hugetlb_fault and hugetlb_no_page James Houghton
                   ` (15 subsequent siblings)
  28 siblings, 1 reply; 123+ messages in thread
From: James Houghton @ 2022-06-24 17:36 UTC (permalink / raw)
  To: Mike Kravetz, Muchun Song, Peter Xu
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Jue Wang, Manish Mishra, Dr . David Alan Gilbert, linux-mm,
	linux-kernel, James Houghton

This function is to be used to do a HugeTLB page table walk where we may
need to split a leaf-level huge PTE into a new page table level.

Consider the case where we want to install 4K inside an empty 1G page:
1. We walk to the PUD and notice that it is pte_none.
2. We split the PUD by calling `hugetlb_split_to_shift`, creating a
   standard PUD that points to PMDs that are all pte_none.
3. We continue the PT walk to find the PMD. We split it just like we
   split the PUD.
4. We find the PTE and give it back to the caller.

To avoid concurrent splitting operations on the same page table entry,
we require that the mapping rwsem is held for writing while collapsing
and for reading when doing a high-granularity PT walk.

Signed-off-by: James Houghton <jthoughton@google.com>
---
 include/linux/hugetlb.h | 23 ++++++++++++++
 mm/hugetlb.c            | 67 +++++++++++++++++++++++++++++++++++++++++
 2 files changed, 90 insertions(+)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 605aa19d8572..321f5745d87f 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -1176,14 +1176,37 @@ static inline void set_huge_pte_at(struct mm_struct *mm, unsigned long addr,
 }
 #endif	/* CONFIG_HUGETLB_PAGE */
 
+enum split_mode {
+	HUGETLB_SPLIT_NEVER   = 0,
+	HUGETLB_SPLIT_NONE    = 1 << 0,
+	HUGETLB_SPLIT_PRESENT = 1 << 1,
+	HUGETLB_SPLIT_ALWAYS  = HUGETLB_SPLIT_NONE | HUGETLB_SPLIT_PRESENT,
+};
 #ifdef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING
 /* If HugeTLB high-granularity mappings are enabled for this VMA. */
 bool hugetlb_hgm_enabled(struct vm_area_struct *vma);
+int huge_pte_alloc_high_granularity(struct hugetlb_pte *hpte,
+				    struct mm_struct *mm,
+				    struct vm_area_struct *vma,
+				    unsigned long addr,
+				    unsigned int desired_sz,
+				    enum split_mode mode,
+				    bool write_locked);
 #else
 static inline bool hugetlb_hgm_enabled(struct vm_area_struct *vma)
 {
 	return false;
 }
+static inline int huge_pte_alloc_high_granularity(struct hugetlb_pte *hpte,
+					   struct mm_struct *mm,
+					   struct vm_area_struct *vma,
+					   unsigned long addr,
+					   unsigned int desired_sz,
+					   enum split_mode mode,
+					   bool write_locked)
+{
+	return -EINVAL;
+}
 #endif
 
 static inline spinlock_t *huge_pte_lock(struct hstate *h,
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index eaffe7b4f67c..6e0c5fbfe32c 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -7166,6 +7166,73 @@ static int hugetlb_split_to_shift(struct mm_struct *mm, struct vm_area_struct *v
 	tlb_finish_mmu(&tlb);
 	return ret;
 }
+
+/*
+ * Similar to huge_pte_alloc except that this can be used to create or walk
+ * high-granularity mappings. It will automatically split existing HugeTLB PTEs
+ * if required by @mode. The resulting HugeTLB PTE will be returned in @hpte.
+ *
+ * There are three options for @mode:
+ *  - HUGETLB_SPLIT_NEVER   - Never split.
+ *  - HUGETLB_SPLIT_NONE    - Split empty PTEs.
+ *  - HUGETLB_SPLIT_PRESENT - Split present PTEs.
+ *  - HUGETLB_SPLIT_ALWAYS  - Split both empty and present PTEs.
+ */
+int huge_pte_alloc_high_granularity(struct hugetlb_pte *hpte,
+				    struct mm_struct *mm,
+				    struct vm_area_struct *vma,
+				    unsigned long addr,
+				    unsigned int desired_shift,
+				    enum split_mode mode,
+				    bool write_locked)
+{
+	struct address_space *mapping = vma->vm_file->f_mapping;
+	bool has_write_lock = write_locked;
+	unsigned long desired_sz = 1UL << desired_shift;
+	int ret;
+
+	BUG_ON(!hpte);
+
+	if (has_write_lock)
+		i_mmap_assert_write_locked(mapping);
+	else
+		i_mmap_assert_locked(mapping);
+
+retry:
+	ret = 0;
+	hugetlb_pte_init(hpte);
+
+	ret = hugetlb_walk_to(mm, hpte, addr, desired_sz,
+			      !(mode & HUGETLB_SPLIT_NONE));
+	if (ret || hugetlb_pte_size(hpte) == desired_sz)
+		goto out;
+
+	if (
+		((mode & HUGETLB_SPLIT_NONE) && hugetlb_pte_none(hpte)) ||
+		((mode & HUGETLB_SPLIT_PRESENT) &&
+		  hugetlb_pte_present_leaf(hpte))
+	   ) {
+		if (!has_write_lock) {
+			i_mmap_unlock_read(mapping);
+			i_mmap_lock_write(mapping);
+			has_write_lock = true;
+			goto retry;
+		}
+		ret = hugetlb_split_to_shift(mm, vma, hpte, addr,
+					     desired_shift);
+	}
+
+out:
+	if (has_write_lock && !write_locked) {
+		/* Drop the write lock. */
+		i_mmap_unlock_write(mapping);
+		i_mmap_lock_read(mapping);
+		has_write_lock = false;
+		goto retry;
+	}
+
+	return ret;
+}
 #endif /* CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING */
 
 /*
-- 
2.37.0.rc0.161.g10f37bed90-goog


^ permalink raw reply related	[flat|nested] 123+ messages in thread

* [RFC PATCH 14/26] hugetlb: add HGM support for hugetlb_fault and hugetlb_no_page
  2022-06-24 17:36 [RFC PATCH 00/26] hugetlb: Introduce HugeTLB high-granularity mapping James Houghton
                   ` (12 preceding siblings ...)
  2022-06-24 17:36 ` [RFC PATCH 13/26] hugetlb: add huge_pte_alloc_high_granularity James Houghton
@ 2022-06-24 17:36 ` James Houghton
  2022-06-29 14:40   ` manish.mishra
  2022-06-24 17:36 ` [RFC PATCH 15/26] hugetlb: make unmapping compatible with high-granularity mappings James Houghton
                   ` (14 subsequent siblings)
  28 siblings, 1 reply; 123+ messages in thread
From: James Houghton @ 2022-06-24 17:36 UTC (permalink / raw)
  To: Mike Kravetz, Muchun Song, Peter Xu
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Jue Wang, Manish Mishra, Dr . David Alan Gilbert, linux-mm,
	linux-kernel, James Houghton

This CL is the first main functional HugeTLB change. Together, these
changes allow the HugeTLB fault path to handle faults on HGM-enabled
VMAs. The two main behaviors that can be done now:
  1. Faults can be passed to handle_userfault. (Userspace will want to
     use UFFD_FEATURE_REAL_ADDRESS to get the real address to know which
     region they should be call UFFDIO_CONTINUE on later.)
  2. Faults on pages that have been partially mapped (and userfaultfd is
     not being used) will get mapped at the largest possible size.
     For example, if a 1G page has been partially mapped at 2M, and we
     fault on an unmapped 2M section, hugetlb_no_page will create a 2M
     PMD to map the faulting address.

This commit does not handle hugetlb_wp right now, and it doesn't handle
HugeTLB page migration and swap entries.

Signed-off-by: James Houghton <jthoughton@google.com>
---
 include/linux/hugetlb.h |  12 ++++
 mm/hugetlb.c            | 121 +++++++++++++++++++++++++++++++---------
 2 files changed, 106 insertions(+), 27 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 321f5745d87f..ac4ac8fbd901 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -1185,6 +1185,9 @@ enum split_mode {
 #ifdef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING
 /* If HugeTLB high-granularity mappings are enabled for this VMA. */
 bool hugetlb_hgm_enabled(struct vm_area_struct *vma);
+int hugetlb_alloc_largest_pte(struct hugetlb_pte *hpte, struct mm_struct *mm,
+			      struct vm_area_struct *vma, unsigned long start,
+			      unsigned long end);
 int huge_pte_alloc_high_granularity(struct hugetlb_pte *hpte,
 				    struct mm_struct *mm,
 				    struct vm_area_struct *vma,
@@ -1197,6 +1200,15 @@ static inline bool hugetlb_hgm_enabled(struct vm_area_struct *vma)
 {
 	return false;
 }
+
+static inline
+int hugetlb_alloc_largest_pte(struct hugetlb_pte *hpte, struct mm_struct *mm,
+			      struct vm_area_struct *vma, unsigned long start,
+			      unsigned long end)
+{
+		BUG();
+}
+
 static inline int huge_pte_alloc_high_granularity(struct hugetlb_pte *hpte,
 					   struct mm_struct *mm,
 					   struct vm_area_struct *vma,
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 6e0c5fbfe32c..da30621656b8 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5605,18 +5605,24 @@ static inline vm_fault_t hugetlb_handle_userfault(struct vm_area_struct *vma,
 static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 			struct vm_area_struct *vma,
 			struct address_space *mapping, pgoff_t idx,
-			unsigned long address, pte_t *ptep,
+			unsigned long address, struct hugetlb_pte *hpte,
 			pte_t old_pte, unsigned int flags)
 {
 	struct hstate *h = hstate_vma(vma);
 	vm_fault_t ret = VM_FAULT_SIGBUS;
 	int anon_rmap = 0;
 	unsigned long size;
-	struct page *page;
+	struct page *page, *subpage;
 	pte_t new_pte;
 	spinlock_t *ptl;
 	unsigned long haddr = address & huge_page_mask(h);
+	unsigned long haddr_hgm = address & hugetlb_pte_mask(hpte);
 	bool new_page, new_pagecache_page = false;
+	/*
+	 * This page is getting mapped for the first time, in which case we
+	 * want to increment its mapcount.
+	 */
+	bool new_mapping = hpte->shift == huge_page_shift(h);
 
 	/*
 	 * Currently, we are forced to kill the process in the event the
@@ -5665,9 +5671,9 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 			 * here.  Before returning error, get ptl and make
 			 * sure there really is no pte entry.
 			 */
-			ptl = huge_pte_lock(h, mm, ptep);
+			ptl = hugetlb_pte_lock(mm, hpte);
 			ret = 0;
-			if (huge_pte_none(huge_ptep_get(ptep)))
+			if (hugetlb_pte_none(hpte))
 				ret = vmf_error(PTR_ERR(page));
 			spin_unlock(ptl);
 			goto out;
@@ -5731,18 +5737,25 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 		vma_end_reservation(h, vma, haddr);
 	}
 
-	ptl = huge_pte_lock(h, mm, ptep);
+	ptl = hugetlb_pte_lock(mm, hpte);
 	ret = 0;
 	/* If pte changed from under us, retry */
-	if (!pte_same(huge_ptep_get(ptep), old_pte))
+	if (!pte_same(hugetlb_ptep_get(hpte), old_pte))
 		goto backout;
 
-	if (anon_rmap) {
-		ClearHPageRestoreReserve(page);
-		hugepage_add_new_anon_rmap(page, vma, haddr);
-	} else
-		page_dup_file_rmap(page, true);
-	new_pte = make_huge_pte(vma, page, ((vma->vm_flags & VM_WRITE)
+	if (new_mapping) {
+		/* Only increment this page's mapcount if we are mapping it
+		 * for the first time.
+		 */
+		if (anon_rmap) {
+			ClearHPageRestoreReserve(page);
+			hugepage_add_new_anon_rmap(page, vma, haddr);
+		} else
+			page_dup_file_rmap(page, true);
+	}
+
+	subpage = hugetlb_find_subpage(h, page, haddr_hgm);
+	new_pte = make_huge_pte(vma, subpage, ((vma->vm_flags & VM_WRITE)
 				&& (vma->vm_flags & VM_SHARED)));
 	/*
 	 * If this pte was previously wr-protected, keep it wr-protected even
@@ -5750,12 +5763,13 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 	 */
 	if (unlikely(pte_marker_uffd_wp(old_pte)))
 		new_pte = huge_pte_wrprotect(huge_pte_mkuffd_wp(new_pte));
-	set_huge_pte_at(mm, haddr, ptep, new_pte);
+	set_huge_pte_at(mm, haddr_hgm, hpte->ptep, new_pte);
 
-	hugetlb_count_add(pages_per_huge_page(h), mm);
+	hugetlb_count_add(hugetlb_pte_size(hpte) / PAGE_SIZE, mm);
 	if ((flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) {
+		BUG_ON(hugetlb_pte_size(hpte) != huge_page_size(h));
 		/* Optimization, do the COW without a second fault */
-		ret = hugetlb_wp(mm, vma, address, ptep, flags, page, ptl);
+		ret = hugetlb_wp(mm, vma, address, hpte->ptep, flags, page, ptl);
 	}
 
 	spin_unlock(ptl);
@@ -5816,11 +5830,15 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	u32 hash;
 	pgoff_t idx;
 	struct page *page = NULL;
+	struct page *subpage = NULL;
 	struct page *pagecache_page = NULL;
 	struct hstate *h = hstate_vma(vma);
 	struct address_space *mapping;
 	int need_wait_lock = 0;
 	unsigned long haddr = address & huge_page_mask(h);
+	unsigned long haddr_hgm;
+	bool hgm_enabled = hugetlb_hgm_enabled(vma);
+	struct hugetlb_pte hpte;
 
 	ptep = huge_pte_offset(mm, haddr, huge_page_size(h));
 	if (ptep) {
@@ -5866,11 +5884,22 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	hash = hugetlb_fault_mutex_hash(mapping, idx);
 	mutex_lock(&hugetlb_fault_mutex_table[hash]);
 
-	entry = huge_ptep_get(ptep);
+	hugetlb_pte_populate(&hpte, ptep, huge_page_shift(h));
+
+	if (hgm_enabled) {
+		ret = hugetlb_walk_to(mm, &hpte, address,
+				      PAGE_SIZE, /*stop_at_none=*/true);
+		if (ret) {
+			ret = vmf_error(ret);
+			goto out_mutex;
+		}
+	}
+
+	entry = hugetlb_ptep_get(&hpte);
 	/* PTE markers should be handled the same way as none pte */
-	if (huge_pte_none_mostly(entry)) {
-		ret = hugetlb_no_page(mm, vma, mapping, idx, address, ptep,
-				      entry, flags);
+	if (hugetlb_pte_none_mostly(&hpte)) {
+		ret = hugetlb_no_page(mm, vma, mapping, idx, address, &hpte,
+				entry, flags);
 		goto out_mutex;
 	}
 
@@ -5908,14 +5937,17 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 								vma, haddr);
 	}
 
-	ptl = huge_pte_lock(h, mm, ptep);
+	ptl = hugetlb_pte_lock(mm, &hpte);
 
 	/* Check for a racing update before calling hugetlb_wp() */
-	if (unlikely(!pte_same(entry, huge_ptep_get(ptep))))
+	if (unlikely(!pte_same(entry, hugetlb_ptep_get(&hpte))))
 		goto out_ptl;
 
+	/* haddr_hgm is the base address of the region that hpte maps. */
+	haddr_hgm = address & hugetlb_pte_mask(&hpte);
+
 	/* Handle userfault-wp first, before trying to lock more pages */
-	if (userfaultfd_wp(vma) && huge_pte_uffd_wp(huge_ptep_get(ptep)) &&
+	if (userfaultfd_wp(vma) && huge_pte_uffd_wp(hugetlb_ptep_get(&hpte)) &&
 	    (flags & FAULT_FLAG_WRITE) && !huge_pte_write(entry)) {
 		struct vm_fault vmf = {
 			.vma = vma,
@@ -5939,7 +5971,8 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	 * pagecache_page, so here we need take the former one
 	 * when page != pagecache_page or !pagecache_page.
 	 */
-	page = pte_page(entry);
+	subpage = pte_page(entry);
+	page = compound_head(subpage);
 	if (page != pagecache_page)
 		if (!trylock_page(page)) {
 			need_wait_lock = 1;
@@ -5950,7 +5983,8 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 
 	if (flags & (FAULT_FLAG_WRITE|FAULT_FLAG_UNSHARE)) {
 		if (!huge_pte_write(entry)) {
-			ret = hugetlb_wp(mm, vma, address, ptep, flags,
+			BUG_ON(hugetlb_pte_size(&hpte) != huge_page_size(h));
+			ret = hugetlb_wp(mm, vma, address, hpte.ptep, flags,
 					 pagecache_page, ptl);
 			goto out_put_page;
 		} else if (likely(flags & FAULT_FLAG_WRITE)) {
@@ -5958,9 +5992,9 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		}
 	}
 	entry = pte_mkyoung(entry);
-	if (huge_ptep_set_access_flags(vma, haddr, ptep, entry,
+	if (huge_ptep_set_access_flags(vma, haddr_hgm, hpte.ptep, entry,
 						flags & FAULT_FLAG_WRITE))
-		update_mmu_cache(vma, haddr, ptep);
+		update_mmu_cache(vma, haddr_hgm, hpte.ptep);
 out_put_page:
 	if (page != pagecache_page)
 		unlock_page(page);
@@ -6951,7 +6985,8 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
 				pte = (pte_t *)pmd_alloc(mm, pud, addr);
 		}
 	}
-	BUG_ON(pte && pte_present(*pte) && !pte_huge(*pte));
+	if (!hugetlb_hgm_enabled(vma))
+		BUG_ON(pte && pte_present(*pte) && !pte_huge(*pte));
 
 	return pte;
 }
@@ -7057,6 +7092,38 @@ static unsigned int __shift_for_hstate(struct hstate *h)
 			       (tmp_h) <= &hstates[hugetlb_max_hstate]; \
 			       (tmp_h)++)
 
+/*
+ * Allocate a HugeTLB PTE that maps as much of [start, end) as possible with a
+ * single page table entry. The allocated HugeTLB PTE is returned in hpte.
+ */
+int hugetlb_alloc_largest_pte(struct hugetlb_pte *hpte, struct mm_struct *mm,
+			      struct vm_area_struct *vma, unsigned long start,
+			      unsigned long end)
+{
+	struct hstate *h = hstate_vma(vma), *tmp_h;
+	unsigned int shift;
+	int ret;
+
+	for_each_hgm_shift(h, tmp_h, shift) {
+		unsigned long sz = 1UL << shift;
+
+		if (!IS_ALIGNED(start, sz) || start + sz > end)
+			continue;
+		ret = huge_pte_alloc_high_granularity(hpte, mm, vma, start,
+						      shift, HUGETLB_SPLIT_NONE,
+						      /*write_locked=*/false);
+		if (ret)
+			return ret;
+
+		if (hpte->shift > shift)
+			return -EEXIST;
+
+		BUG_ON(hpte->shift != shift);
+		return 0;
+	}
+	return -EINVAL;
+}
+
 /*
  * Given a particular address, split the HugeTLB PTE that currently maps it
  * so that, for the given address, the PTE that maps it is `desired_shift`.
-- 
2.37.0.rc0.161.g10f37bed90-goog


^ permalink raw reply related	[flat|nested] 123+ messages in thread

* [RFC PATCH 15/26] hugetlb: make unmapping compatible with high-granularity mappings
  2022-06-24 17:36 [RFC PATCH 00/26] hugetlb: Introduce HugeTLB high-granularity mapping James Houghton
                   ` (13 preceding siblings ...)
  2022-06-24 17:36 ` [RFC PATCH 14/26] hugetlb: add HGM support for hugetlb_fault and hugetlb_no_page James Houghton
@ 2022-06-24 17:36 ` James Houghton
  2022-07-19 10:19   ` manish.mishra
  2022-06-24 17:36 ` [RFC PATCH 16/26] hugetlb: make hugetlb_change_protection compatible with HGM James Houghton
                   ` (13 subsequent siblings)
  28 siblings, 1 reply; 123+ messages in thread
From: James Houghton @ 2022-06-24 17:36 UTC (permalink / raw)
  To: Mike Kravetz, Muchun Song, Peter Xu
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Jue Wang, Manish Mishra, Dr . David Alan Gilbert, linux-mm,
	linux-kernel, James Houghton

This enlightens __unmap_hugepage_range to deal with high-granularity
mappings. This doesn't change its API; it still must be called with
hugepage alignment, but it will correctly unmap hugepages that have been
mapped at high granularity.

Analogous to the mapcount rules introduced by hugetlb_no_page, we only
drop mapcount in this case if we are unmapping an entire hugepage in one
operation. This is the case when a VMA is destroyed.

Eventually, functionality here can be expanded to allow users to call
MADV_DONTNEED on PAGE_SIZE-aligned sections of a hugepage, but that is
not done here.

Signed-off-by: James Houghton <jthoughton@google.com>
---
 include/asm-generic/tlb.h |  6 +--
 mm/hugetlb.c              | 85 ++++++++++++++++++++++++++-------------
 2 files changed, 59 insertions(+), 32 deletions(-)

diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index ff3e82553a76..8daa3ae460d9 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -562,9 +562,9 @@ static inline void tlb_flush_p4d_range(struct mmu_gather *tlb,
 		__tlb_remove_tlb_entry(tlb, ptep, address);	\
 	} while (0)
 
-#define tlb_remove_huge_tlb_entry(h, tlb, ptep, address)	\
+#define tlb_remove_huge_tlb_entry(tlb, hpte, address)	\
 	do {							\
-		unsigned long _sz = huge_page_size(h);		\
+		unsigned long _sz = hugetlb_pte_size(&hpte);	\
 		if (_sz >= P4D_SIZE)				\
 			tlb_flush_p4d_range(tlb, address, _sz);	\
 		else if (_sz >= PUD_SIZE)			\
@@ -573,7 +573,7 @@ static inline void tlb_flush_p4d_range(struct mmu_gather *tlb,
 			tlb_flush_pmd_range(tlb, address, _sz);	\
 		else						\
 			tlb_flush_pte_range(tlb, address, _sz);	\
-		__tlb_remove_tlb_entry(tlb, ptep, address);	\
+		__tlb_remove_tlb_entry(tlb, hpte.ptep, address);\
 	} while (0)
 
 /**
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index da30621656b8..51fc1d3f122f 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5120,24 +5120,20 @@ static void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct
 {
 	struct mm_struct *mm = vma->vm_mm;
 	unsigned long address;
-	pte_t *ptep;
+	struct hugetlb_pte hpte;
 	pte_t pte;
 	spinlock_t *ptl;
-	struct page *page;
+	struct page *hpage, *subpage;
 	struct hstate *h = hstate_vma(vma);
 	unsigned long sz = huge_page_size(h);
 	struct mmu_notifier_range range;
 	bool force_flush = false;
+	bool hgm_enabled = hugetlb_hgm_enabled(vma);
 
 	WARN_ON(!is_vm_hugetlb_page(vma));
 	BUG_ON(start & ~huge_page_mask(h));
 	BUG_ON(end & ~huge_page_mask(h));
 
-	/*
-	 * This is a hugetlb vma, all the pte entries should point
-	 * to huge page.
-	 */
-	tlb_change_page_size(tlb, sz);
 	tlb_start_vma(tlb, vma);
 
 	/*
@@ -5148,25 +5144,43 @@ static void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct
 	adjust_range_if_pmd_sharing_possible(vma, &range.start, &range.end);
 	mmu_notifier_invalidate_range_start(&range);
 	address = start;
-	for (; address < end; address += sz) {
-		ptep = huge_pte_offset(mm, address, sz);
-		if (!ptep)
+
+	while (address < end) {
+		pte_t *ptep = huge_pte_offset(mm, address, sz);
+
+		if (!ptep) {
+			address += sz;
 			continue;
+		}
+		hugetlb_pte_populate(&hpte, ptep, huge_page_shift(h));
+		if (hgm_enabled) {
+			int ret = huge_pte_alloc_high_granularity(
+					&hpte, mm, vma, address, PAGE_SHIFT,
+					HUGETLB_SPLIT_NEVER,
+					/*write_locked=*/true);
+			/*
+			 * We will never split anything, so this should always
+			 * succeed.
+			 */
+			BUG_ON(ret);
+		}
 
-		ptl = huge_pte_lock(h, mm, ptep);
-		if (huge_pmd_unshare(mm, vma, &address, ptep)) {
+		ptl = hugetlb_pte_lock(mm, &hpte);
+		if (!hgm_enabled && huge_pmd_unshare(
+					mm, vma, &address, hpte.ptep)) {
 			spin_unlock(ptl);
 			tlb_flush_pmd_range(tlb, address & PUD_MASK, PUD_SIZE);
 			force_flush = true;
-			continue;
+			goto next_hpte;
 		}
 
-		pte = huge_ptep_get(ptep);
-		if (huge_pte_none(pte)) {
+		if (hugetlb_pte_none(&hpte)) {
 			spin_unlock(ptl);
-			continue;
+			goto next_hpte;
 		}
 
+		pte = hugetlb_ptep_get(&hpte);
+
 		/*
 		 * Migrating hugepage or HWPoisoned hugepage is already
 		 * unmapped and its refcount is dropped, so just clear pte here.
@@ -5180,24 +5194,27 @@ static void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct
 			 */
 			if (pte_swp_uffd_wp_any(pte) &&
 			    !(zap_flags & ZAP_FLAG_DROP_MARKER))
-				set_huge_pte_at(mm, address, ptep,
+				set_huge_pte_at(mm, address, hpte.ptep,
 						make_pte_marker(PTE_MARKER_UFFD_WP));
 			else
-				huge_pte_clear(mm, address, ptep, sz);
+				huge_pte_clear(mm, address, hpte.ptep,
+						hugetlb_pte_size(&hpte));
 			spin_unlock(ptl);
-			continue;
+			goto next_hpte;
 		}
 
-		page = pte_page(pte);
+		subpage = pte_page(pte);
+		BUG_ON(!subpage);
+		hpage = compound_head(subpage);
 		/*
 		 * If a reference page is supplied, it is because a specific
 		 * page is being unmapped, not a range. Ensure the page we
 		 * are about to unmap is the actual page of interest.
 		 */
 		if (ref_page) {
-			if (page != ref_page) {
+			if (hpage != ref_page) {
 				spin_unlock(ptl);
-				continue;
+				goto next_hpte;
 			}
 			/*
 			 * Mark the VMA as having unmapped its page so that
@@ -5207,25 +5224,35 @@ static void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct
 			set_vma_resv_flags(vma, HPAGE_RESV_UNMAPPED);
 		}
 
-		pte = huge_ptep_get_and_clear(mm, address, ptep);
-		tlb_remove_huge_tlb_entry(h, tlb, ptep, address);
+		pte = huge_ptep_get_and_clear(mm, address, hpte.ptep);
+		tlb_change_page_size(tlb, hugetlb_pte_size(&hpte));
+		tlb_remove_huge_tlb_entry(tlb, hpte, address);
 		if (huge_pte_dirty(pte))
-			set_page_dirty(page);
+			set_page_dirty(hpage);
 		/* Leave a uffd-wp pte marker if needed */
 		if (huge_pte_uffd_wp(pte) &&
 		    !(zap_flags & ZAP_FLAG_DROP_MARKER))
-			set_huge_pte_at(mm, address, ptep,
+			set_huge_pte_at(mm, address, hpte.ptep,
 					make_pte_marker(PTE_MARKER_UFFD_WP));
-		hugetlb_count_sub(pages_per_huge_page(h), mm);
-		page_remove_rmap(page, vma, true);
+
+		hugetlb_count_sub(hugetlb_pte_size(&hpte)/PAGE_SIZE, mm);
+
+		/*
+		 * If we are unmapping the entire page, remove it from the
+		 * rmap.
+		 */
+		if (IS_ALIGNED(address, sz) && address + sz <= end)
+			page_remove_rmap(hpage, vma, true);
 
 		spin_unlock(ptl);
-		tlb_remove_page_size(tlb, page, huge_page_size(h));
+		tlb_remove_page_size(tlb, subpage, hugetlb_pte_size(&hpte));
 		/*
 		 * Bail out after unmapping reference page if supplied
 		 */
 		if (ref_page)
 			break;
+next_hpte:
+		address += hugetlb_pte_size(&hpte);
 	}
 	mmu_notifier_invalidate_range_end(&range);
 	tlb_end_vma(tlb, vma);
-- 
2.37.0.rc0.161.g10f37bed90-goog


^ permalink raw reply related	[flat|nested] 123+ messages in thread

* [RFC PATCH 16/26] hugetlb: make hugetlb_change_protection compatible with HGM
  2022-06-24 17:36 [RFC PATCH 00/26] hugetlb: Introduce HugeTLB high-granularity mapping James Houghton
                   ` (14 preceding siblings ...)
  2022-06-24 17:36 ` [RFC PATCH 15/26] hugetlb: make unmapping compatible with high-granularity mappings James Houghton
@ 2022-06-24 17:36 ` James Houghton
  2022-06-24 17:36 ` [RFC PATCH 17/26] hugetlb: update follow_hugetlb_page to support HGM James Houghton
                   ` (12 subsequent siblings)
  28 siblings, 0 replies; 123+ messages in thread
From: James Houghton @ 2022-06-24 17:36 UTC (permalink / raw)
  To: Mike Kravetz, Muchun Song, Peter Xu
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Jue Wang, Manish Mishra, Dr . David Alan Gilbert, linux-mm,
	linux-kernel, James Houghton

HugeTLB is now able to change the protection of hugepages that are
mapped at high granularity.

I need to add more of the HugeTLB PTE wrapper functions to clean up this
patch. I'll do this in the next version.

Signed-off-by: James Houghton <jthoughton@google.com>
---
 mm/hugetlb.c | 91 +++++++++++++++++++++++++++++++++++-----------------
 1 file changed, 62 insertions(+), 29 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 51fc1d3f122f..f9c7daa6c090 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6476,14 +6476,15 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 {
 	struct mm_struct *mm = vma->vm_mm;
 	unsigned long start = address;
-	pte_t *ptep;
 	pte_t pte;
 	struct hstate *h = hstate_vma(vma);
-	unsigned long pages = 0, psize = huge_page_size(h);
+	unsigned long base_pages = 0, psize = huge_page_size(h);
 	bool shared_pmd = false;
 	struct mmu_notifier_range range;
 	bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
 	bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;
+	struct hugetlb_pte hpte;
+	bool hgm_enabled = hugetlb_hgm_enabled(vma);
 
 	/*
 	 * In the case of shared PMDs, the area to flush could be beyond
@@ -6499,28 +6500,38 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 
 	mmu_notifier_invalidate_range_start(&range);
 	i_mmap_lock_write(vma->vm_file->f_mapping);
-	for (; address < end; address += psize) {
+	while (address < end) {
 		spinlock_t *ptl;
-		ptep = huge_pte_offset(mm, address, psize);
-		if (!ptep)
+		pte_t *ptep = huge_pte_offset(mm, address, huge_page_size(h));
+
+		if (!ptep) {
+			address += huge_page_size(h);
 			continue;
-		ptl = huge_pte_lock(h, mm, ptep);
-		if (huge_pmd_unshare(mm, vma, &address, ptep)) {
+		}
+		hugetlb_pte_populate(&hpte, ptep, huge_page_shift(h));
+		if (hgm_enabled) {
+			int ret = hugetlb_walk_to(mm, &hpte, address, PAGE_SIZE,
+						  /*stop_at_none=*/true);
+			BUG_ON(ret);
+		}
+
+		ptl = hugetlb_pte_lock(mm, &hpte);
+		if (huge_pmd_unshare(mm, vma, &address, hpte.ptep)) {
 			/*
 			 * When uffd-wp is enabled on the vma, unshare
 			 * shouldn't happen at all.  Warn about it if it
 			 * happened due to some reason.
 			 */
 			WARN_ON_ONCE(uffd_wp || uffd_wp_resolve);
-			pages++;
+			base_pages += hugetlb_pte_size(&hpte) / PAGE_SIZE;
 			spin_unlock(ptl);
 			shared_pmd = true;
-			continue;
+			goto next_hpte;
 		}
-		pte = huge_ptep_get(ptep);
+		pte = hugetlb_ptep_get(&hpte);
 		if (unlikely(is_hugetlb_entry_hwpoisoned(pte))) {
 			spin_unlock(ptl);
-			continue;
+			goto next_hpte;
 		}
 		if (unlikely(is_hugetlb_entry_migration(pte))) {
 			swp_entry_t entry = pte_to_swp_entry(pte);
@@ -6540,12 +6551,13 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 					newpte = pte_swp_mkuffd_wp(newpte);
 				else if (uffd_wp_resolve)
 					newpte = pte_swp_clear_uffd_wp(newpte);
-				set_huge_swap_pte_at(mm, address, ptep,
-						     newpte, psize);
-				pages++;
+				set_huge_swap_pte_at(mm, address, hpte.ptep,
+						     newpte,
+						     hugetlb_pte_size(&hpte));
+				base_pages += hugetlb_pte_size(&hpte) / PAGE_SIZE;
 			}
 			spin_unlock(ptl);
-			continue;
+			goto next_hpte;
 		}
 		if (unlikely(pte_marker_uffd_wp(pte))) {
 			/*
@@ -6553,21 +6565,40 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 			 * no need for huge_ptep_modify_prot_start/commit().
 			 */
 			if (uffd_wp_resolve)
-				huge_pte_clear(mm, address, ptep, psize);
+				huge_pte_clear(mm, address, hpte.ptep, psize);
 		}
-		if (!huge_pte_none(pte)) {
+		if (!hugetlb_pte_none(&hpte)) {
 			pte_t old_pte;
-			unsigned int shift = huge_page_shift(hstate_vma(vma));
-
-			old_pte = huge_ptep_modify_prot_start(vma, address, ptep);
-			pte = huge_pte_modify(old_pte, newprot);
-			pte = arch_make_huge_pte(pte, shift, vma->vm_flags);
-			if (uffd_wp)
-				pte = huge_pte_mkuffd_wp(huge_pte_wrprotect(pte));
-			else if (uffd_wp_resolve)
-				pte = huge_pte_clear_uffd_wp(pte);
-			huge_ptep_modify_prot_commit(vma, address, ptep, old_pte, pte);
-			pages++;
+			unsigned int shift = hpte.shift;
+			/*
+			 * This is ugly. This will be cleaned up in a future
+			 * version of this series.
+			 */
+			if (shift > PAGE_SHIFT) {
+				old_pte = huge_ptep_modify_prot_start(
+						vma, address, hpte.ptep);
+				pte = huge_pte_modify(old_pte, newprot);
+				pte = arch_make_huge_pte(
+						pte, shift, vma->vm_flags);
+				if (uffd_wp)
+					pte = huge_pte_mkuffd_wp(huge_pte_wrprotect(pte));
+				else if (uffd_wp_resolve)
+					pte = huge_pte_clear_uffd_wp(pte);
+				huge_ptep_modify_prot_commit(
+						vma, address, hpte.ptep,
+						old_pte, pte);
+			} else {
+				old_pte = ptep_modify_prot_start(
+						vma, address, hpte.ptep);
+				pte = pte_modify(old_pte, newprot);
+				if (uffd_wp)
+					pte = pte_mkuffd_wp(pte_wrprotect(pte));
+				else if (uffd_wp_resolve)
+					pte = pte_clear_uffd_wp(pte);
+				ptep_modify_prot_commit(
+						vma, address, hpte.ptep, old_pte, pte);
+			}
+			base_pages += hugetlb_pte_size(&hpte) / PAGE_SIZE;
 		} else {
 			/* None pte */
 			if (unlikely(uffd_wp))
@@ -6576,6 +6607,8 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 						make_pte_marker(PTE_MARKER_UFFD_WP));
 		}
 		spin_unlock(ptl);
+next_hpte:
+		address += hugetlb_pte_size(&hpte);
 	}
 	/*
 	 * Must flush TLB before releasing i_mmap_rwsem: x86's huge_pmd_unshare
@@ -6597,7 +6630,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 	i_mmap_unlock_write(vma->vm_file->f_mapping);
 	mmu_notifier_invalidate_range_end(&range);
 
-	return pages << h->order;
+	return base_pages;
 }
 
 /* Return true if reservation was successful, false otherwise.  */
-- 
2.37.0.rc0.161.g10f37bed90-goog


^ permalink raw reply related	[flat|nested] 123+ messages in thread

* [RFC PATCH 17/26] hugetlb: update follow_hugetlb_page to support HGM
  2022-06-24 17:36 [RFC PATCH 00/26] hugetlb: Introduce HugeTLB high-granularity mapping James Houghton
                   ` (15 preceding siblings ...)
  2022-06-24 17:36 ` [RFC PATCH 16/26] hugetlb: make hugetlb_change_protection compatible with HGM James Houghton
@ 2022-06-24 17:36 ` James Houghton
  2022-07-19 10:48   ` manish.mishra
  2022-06-24 17:36 ` [RFC PATCH 18/26] hugetlb: use struct hugetlb_pte for walk_hugetlb_range James Houghton
                   ` (11 subsequent siblings)
  28 siblings, 1 reply; 123+ messages in thread
From: James Houghton @ 2022-06-24 17:36 UTC (permalink / raw)
  To: Mike Kravetz, Muchun Song, Peter Xu
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Jue Wang, Manish Mishra, Dr . David Alan Gilbert, linux-mm,
	linux-kernel, James Houghton

This enables support for GUP, and it is needed for the KVM demand paging
self-test to work.

One important change here is that, before, we never needed to grab the
i_mmap_sem, but now, to prevent someone from collapsing the page tables
out from under us, we grab it for reading when doing high-granularity PT
walks.

Signed-off-by: James Houghton <jthoughton@google.com>
---
 mm/hugetlb.c | 70 ++++++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 57 insertions(+), 13 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index f9c7daa6c090..aadfcee947cf 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6298,14 +6298,18 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	unsigned long vaddr = *position;
 	unsigned long remainder = *nr_pages;
 	struct hstate *h = hstate_vma(vma);
+	struct address_space *mapping = vma->vm_file->f_mapping;
 	int err = -EFAULT, refs;
+	bool has_i_mmap_sem = false;
 
 	while (vaddr < vma->vm_end && remainder) {
 		pte_t *pte;
 		spinlock_t *ptl = NULL;
 		bool unshare = false;
 		int absent;
+		unsigned long pages_per_hpte;
 		struct page *page;
+		struct hugetlb_pte hpte;
 
 		/*
 		 * If we have a pending SIGKILL, don't keep faulting pages and
@@ -6325,9 +6329,23 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		 */
 		pte = huge_pte_offset(mm, vaddr & huge_page_mask(h),
 				      huge_page_size(h));
-		if (pte)
-			ptl = huge_pte_lock(h, mm, pte);
-		absent = !pte || huge_pte_none(huge_ptep_get(pte));
+		if (pte) {
+			hugetlb_pte_populate(&hpte, pte, huge_page_shift(h));
+			if (hugetlb_hgm_enabled(vma)) {
+				BUG_ON(has_i_mmap_sem);
+				i_mmap_lock_read(mapping);
+				/*
+				 * Need to hold the mapping semaphore for
+				 * reading to do a HGM walk.
+				 */
+				has_i_mmap_sem = true;
+				hugetlb_walk_to(mm, &hpte, vaddr, PAGE_SIZE,
+						/*stop_at_none=*/true);
+			}
+			ptl = hugetlb_pte_lock(mm, &hpte);
+		}
+
+		absent = !pte || hugetlb_pte_none(&hpte);
 
 		/*
 		 * When coredumping, it suits get_dump_page if we just return
@@ -6338,8 +6356,13 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		 */
 		if (absent && (flags & FOLL_DUMP) &&
 		    !hugetlbfs_pagecache_present(h, vma, vaddr)) {
-			if (pte)
+			if (pte) {
+				if (has_i_mmap_sem) {
+					i_mmap_unlock_read(mapping);
+					has_i_mmap_sem = false;
+				}
 				spin_unlock(ptl);
+			}
 			remainder = 0;
 			break;
 		}
@@ -6359,8 +6382,13 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			vm_fault_t ret;
 			unsigned int fault_flags = 0;
 
-			if (pte)
+			if (pte) {
+				if (has_i_mmap_sem) {
+					i_mmap_unlock_read(mapping);
+					has_i_mmap_sem = false;
+				}
 				spin_unlock(ptl);
+			}
 			if (flags & FOLL_WRITE)
 				fault_flags |= FAULT_FLAG_WRITE;
 			else if (unshare)
@@ -6403,8 +6431,11 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			continue;
 		}
 
-		pfn_offset = (vaddr & ~huge_page_mask(h)) >> PAGE_SHIFT;
-		page = pte_page(huge_ptep_get(pte));
+		pfn_offset = (vaddr & ~hugetlb_pte_mask(&hpte)) >> PAGE_SHIFT;
+		page = pte_page(hugetlb_ptep_get(&hpte));
+		pages_per_hpte = hugetlb_pte_size(&hpte) / PAGE_SIZE;
+		if (hugetlb_hgm_enabled(vma))
+			page = compound_head(page);
 
 		VM_BUG_ON_PAGE((flags & FOLL_PIN) && PageAnon(page) &&
 			       !PageAnonExclusive(page), page);
@@ -6414,17 +6445,21 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		 * and skip the same_page loop below.
 		 */
 		if (!pages && !vmas && !pfn_offset &&
-		    (vaddr + huge_page_size(h) < vma->vm_end) &&
-		    (remainder >= pages_per_huge_page(h))) {
-			vaddr += huge_page_size(h);
-			remainder -= pages_per_huge_page(h);
-			i += pages_per_huge_page(h);
+		    (vaddr + pages_per_hpte < vma->vm_end) &&
+		    (remainder >= pages_per_hpte)) {
+			vaddr += pages_per_hpte;
+			remainder -= pages_per_hpte;
+			i += pages_per_hpte;
 			spin_unlock(ptl);
+			if (has_i_mmap_sem) {
+				has_i_mmap_sem = false;
+				i_mmap_unlock_read(mapping);
+			}
 			continue;
 		}
 
 		/* vaddr may not be aligned to PAGE_SIZE */
-		refs = min3(pages_per_huge_page(h) - pfn_offset, remainder,
+		refs = min3(pages_per_hpte - pfn_offset, remainder,
 		    (vma->vm_end - ALIGN_DOWN(vaddr, PAGE_SIZE)) >> PAGE_SHIFT);
 
 		if (pages || vmas)
@@ -6447,6 +6482,10 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			if (WARN_ON_ONCE(!try_grab_folio(pages[i], refs,
 							 flags))) {
 				spin_unlock(ptl);
+				if (has_i_mmap_sem) {
+					has_i_mmap_sem = false;
+					i_mmap_unlock_read(mapping);
+				}
 				remainder = 0;
 				err = -ENOMEM;
 				break;
@@ -6458,8 +6497,13 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		i += refs;
 
 		spin_unlock(ptl);
+		if (has_i_mmap_sem) {
+			has_i_mmap_sem = false;
+			i_mmap_unlock_read(mapping);
+		}
 	}
 	*nr_pages = remainder;
+	BUG_ON(has_i_mmap_sem);
 	/*
 	 * setting position is actually required only if remainder is
 	 * not zero but it's faster not to add a "if (remainder)"
-- 
2.37.0.rc0.161.g10f37bed90-goog


^ permalink raw reply related	[flat|nested] 123+ messages in thread

* [RFC PATCH 18/26] hugetlb: use struct hugetlb_pte for walk_hugetlb_range
  2022-06-24 17:36 [RFC PATCH 00/26] hugetlb: Introduce HugeTLB high-granularity mapping James Houghton
                   ` (16 preceding siblings ...)
  2022-06-24 17:36 ` [RFC PATCH 17/26] hugetlb: update follow_hugetlb_page to support HGM James Houghton
@ 2022-06-24 17:36 ` James Houghton
  2022-06-24 17:36 ` [RFC PATCH 19/26] hugetlb: add HGM support for copy_hugetlb_page_range James Houghton
                   ` (10 subsequent siblings)
  28 siblings, 0 replies; 123+ messages in thread
From: James Houghton @ 2022-06-24 17:36 UTC (permalink / raw)
  To: Mike Kravetz, Muchun Song, Peter Xu
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Jue Wang, Manish Mishra, Dr . David Alan Gilbert, linux-mm,
	linux-kernel, James Houghton

Although this change is large, it is somewhat straightforward. Before,
all users of walk_hugetlb_range could get the size of the PTE just be
checking the hmask or the mm_walk struct. With HGM, that information is
held in the hugetlb_pte struct, so we provide that instead of the raw
pte_t*.

Signed-off-by: James Houghton <jthoughton@google.com>
---
 arch/s390/mm/gmap.c      |  8 ++++++--
 fs/proc/task_mmu.c       | 35 +++++++++++++++++++----------------
 include/linux/pagewalk.h |  3 ++-
 mm/damon/vaddr.c         | 34 ++++++++++++++++++----------------
 mm/hmm.c                 |  7 ++++---
 mm/mempolicy.c           | 11 ++++++++---
 mm/mincore.c             |  4 ++--
 mm/mprotect.c            |  6 +++---
 mm/pagewalk.c            | 18 ++++++++++++++++--
 9 files changed, 78 insertions(+), 48 deletions(-)

diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
index b8ae4a4aa2ba..518cebfd72cd 100644
--- a/arch/s390/mm/gmap.c
+++ b/arch/s390/mm/gmap.c
@@ -2620,10 +2620,14 @@ static int __s390_enable_skey_pmd(pmd_t *pmd, unsigned long addr,
 	return 0;
 }
 
-static int __s390_enable_skey_hugetlb(pte_t *pte, unsigned long addr,
-				      unsigned long hmask, unsigned long next,
+static int __s390_enable_skey_hugetlb(struct hugetlb_pte *hpte,
+				      unsigned long addr, unsigned long next,
 				      struct mm_walk *walk)
 {
+	if (!hugetlb_pte_present_leaf(hpte) ||
+			hugetlb_pte_size(hpte) != PMD_SIZE)
+		return -EINVAL;
+
 	pmd_t *pmd = (pmd_t *)pte;
 	unsigned long start, end;
 	struct page *page = pmd_page(*pmd);
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 2d04e3470d4c..b2d683f99fa9 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -714,18 +714,19 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
 }
 
 #ifdef CONFIG_HUGETLB_PAGE
-static int smaps_hugetlb_range(pte_t *pte, unsigned long hmask,
+static int smaps_hugetlb_range(struct hugetlb_pte *hpte,
 				 unsigned long addr, unsigned long end,
 				 struct mm_walk *walk)
 {
 	struct mem_size_stats *mss = walk->private;
 	struct vm_area_struct *vma = walk->vma;
 	struct page *page = NULL;
+	pte_t pte = hugetlb_ptep_get(hpte);
 
-	if (pte_present(*pte)) {
-		page = vm_normal_page(vma, addr, *pte);
-	} else if (is_swap_pte(*pte)) {
-		swp_entry_t swpent = pte_to_swp_entry(*pte);
+	if (hugetlb_pte_present_leaf(hpte)) {
+		page = vm_normal_page(vma, addr, pte);
+	} else if (is_swap_pte(pte)) {
+		swp_entry_t swpent = pte_to_swp_entry(pte);
 
 		if (is_pfn_swap_entry(swpent))
 			page = pfn_swap_entry_to_page(swpent);
@@ -734,9 +735,9 @@ static int smaps_hugetlb_range(pte_t *pte, unsigned long hmask,
 		int mapcount = page_mapcount(page);
 
 		if (mapcount >= 2)
-			mss->shared_hugetlb += huge_page_size(hstate_vma(vma));
+			mss->shared_hugetlb += hugetlb_pte_size(hpte);
 		else
-			mss->private_hugetlb += huge_page_size(hstate_vma(vma));
+			mss->private_hugetlb += hugetlb_pte_size(hpte);
 	}
 	return 0;
 }
@@ -1535,7 +1536,7 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end,
 
 #ifdef CONFIG_HUGETLB_PAGE
 /* This function walks within one hugetlb entry in the single call */
-static int pagemap_hugetlb_range(pte_t *ptep, unsigned long hmask,
+static int pagemap_hugetlb_range(struct hugetlb_pte *hpte,
 				 unsigned long addr, unsigned long end,
 				 struct mm_walk *walk)
 {
@@ -1543,13 +1544,13 @@ static int pagemap_hugetlb_range(pte_t *ptep, unsigned long hmask,
 	struct vm_area_struct *vma = walk->vma;
 	u64 flags = 0, frame = 0;
 	int err = 0;
-	pte_t pte;
+	unsigned long hmask = hugetlb_pte_mask(hpte);
 
 	if (vma->vm_flags & VM_SOFTDIRTY)
 		flags |= PM_SOFT_DIRTY;
 
-	pte = huge_ptep_get(ptep);
-	if (pte_present(pte)) {
+	if (hugetlb_pte_present_leaf(hpte)) {
+		pte_t pte = hugetlb_ptep_get(hpte);
 		struct page *page = pte_page(pte);
 
 		if (!PageAnon(page))
@@ -1565,7 +1566,7 @@ static int pagemap_hugetlb_range(pte_t *ptep, unsigned long hmask,
 		if (pm->show_pfn)
 			frame = pte_pfn(pte) +
 				((addr & ~hmask) >> PAGE_SHIFT);
-	} else if (pte_swp_uffd_wp_any(pte)) {
+	} else if (pte_swp_uffd_wp_any(hugetlb_ptep_get(hpte))) {
 		flags |= PM_UFFD_WP;
 	}
 
@@ -1869,17 +1870,19 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long addr,
 	return 0;
 }
 #ifdef CONFIG_HUGETLB_PAGE
-static int gather_hugetlb_stats(pte_t *pte, unsigned long hmask,
-		unsigned long addr, unsigned long end, struct mm_walk *walk)
+static int gather_hugetlb_stats(struct hugetlb_pte *hpte, unsigned long addr,
+		unsigned long end, struct mm_walk *walk)
 {
-	pte_t huge_pte = huge_ptep_get(pte);
+	pte_t huge_pte = hugetlb_ptep_get(hpte);
 	struct numa_maps *md;
 	struct page *page;
 
-	if (!pte_present(huge_pte))
+	if (!hugetlb_pte_present_leaf(hpte))
 		return 0;
 
 	page = pte_page(huge_pte);
+	if (page != compound_head(page))
+		return 0;
 
 	md = walk->private;
 	gather_stats(page, md, pte_dirty(huge_pte), 1);
diff --git a/include/linux/pagewalk.h b/include/linux/pagewalk.h
index ac7b38ad5903..0d21e25df37f 100644
--- a/include/linux/pagewalk.h
+++ b/include/linux/pagewalk.h
@@ -3,6 +3,7 @@
 #define _LINUX_PAGEWALK_H
 
 #include <linux/mm.h>
+#include <linux/hugetlb.h>
 
 struct mm_walk;
 
@@ -47,7 +48,7 @@ struct mm_walk_ops {
 			 unsigned long next, struct mm_walk *walk);
 	int (*pte_hole)(unsigned long addr, unsigned long next,
 			int depth, struct mm_walk *walk);
-	int (*hugetlb_entry)(pte_t *pte, unsigned long hmask,
+	int (*hugetlb_entry)(struct hugetlb_pte *hpte,
 			     unsigned long addr, unsigned long next,
 			     struct mm_walk *walk);
 	int (*test_walk)(unsigned long addr, unsigned long next,
diff --git a/mm/damon/vaddr.c b/mm/damon/vaddr.c
index 59e1653799f8..ce50b937dcf2 100644
--- a/mm/damon/vaddr.c
+++ b/mm/damon/vaddr.c
@@ -324,14 +324,15 @@ static int damon_mkold_pmd_entry(pmd_t *pmd, unsigned long addr,
 }
 
 #ifdef CONFIG_HUGETLB_PAGE
-static void damon_hugetlb_mkold(pte_t *pte, struct mm_struct *mm,
+static void damon_hugetlb_mkold(struct hugetlb_pte *hpte, struct mm_struct *mm,
 				struct vm_area_struct *vma, unsigned long addr)
 {
 	bool referenced = false;
 	pte_t entry = huge_ptep_get(pte);
 	struct page *page = pte_page(entry);
+	struct page *hpage = compound_head(page);
 
-	get_page(page);
+	get_page(hpage);
 
 	if (pte_young(entry)) {
 		referenced = true;
@@ -342,18 +343,18 @@ static void damon_hugetlb_mkold(pte_t *pte, struct mm_struct *mm,
 
 #ifdef CONFIG_MMU_NOTIFIER
 	if (mmu_notifier_clear_young(mm, addr,
-				     addr + huge_page_size(hstate_vma(vma))))
+				     addr + hugetlb_pte_size(hpte));
 		referenced = true;
 #endif /* CONFIG_MMU_NOTIFIER */
 
 	if (referenced)
-		set_page_young(page);
+		set_page_young(hpage);
 
-	set_page_idle(page);
-	put_page(page);
+	set_page_idle(hpage);
+	put_page(hpage);
 }
 
-static int damon_mkold_hugetlb_entry(pte_t *pte, unsigned long hmask,
+static int damon_mkold_hugetlb_entry(struct hugetlb_pte *hpte,
 				     unsigned long addr, unsigned long end,
 				     struct mm_walk *walk)
 {
@@ -361,12 +362,12 @@ static int damon_mkold_hugetlb_entry(pte_t *pte, unsigned long hmask,
 	spinlock_t *ptl;
 	pte_t entry;
 
-	ptl = huge_pte_lock(h, walk->mm, pte);
-	entry = huge_ptep_get(pte);
+	ptl = huge_pte_lock_shift(hpte->shift, walk->mm, hpte->ptep);
+	entry = huge_ptep_get(hpte->ptep);
 	if (!pte_present(entry))
 		goto out;
 
-	damon_hugetlb_mkold(pte, walk->mm, walk->vma, addr);
+	damon_hugetlb_mkold(hpte, walk->mm, walk->vma, addr);
 
 out:
 	spin_unlock(ptl);
@@ -474,31 +475,32 @@ static int damon_young_pmd_entry(pmd_t *pmd, unsigned long addr,
 }
 
 #ifdef CONFIG_HUGETLB_PAGE
-static int damon_young_hugetlb_entry(pte_t *pte, unsigned long hmask,
+static int damon_young_hugetlb_entry(struct hugetlb_pte *hpte,
 				     unsigned long addr, unsigned long end,
 				     struct mm_walk *walk)
 {
 	struct damon_young_walk_private *priv = walk->private;
 	struct hstate *h = hstate_vma(walk->vma);
-	struct page *page;
+	struct page *page, *hpage;
 	spinlock_t *ptl;
 	pte_t entry;
 
-	ptl = huge_pte_lock(h, walk->mm, pte);
+	ptl = huge_pte_lock_shift(hpte->shift, walk->mm, hpte->ptep);
 	entry = huge_ptep_get(pte);
 	if (!pte_present(entry))
 		goto out;
 
 	page = pte_page(entry);
-	get_page(page);
+	hpage = compound_head(page);
+	get_page(hpage);
 
-	if (pte_young(entry) || !page_is_idle(page) ||
+	if (pte_young(entry) || !page_is_idle(hpage) ||
 	    mmu_notifier_test_young(walk->mm, addr)) {
 		*priv->page_sz = huge_page_size(h);
 		priv->young = true;
 	}
 
-	put_page(page);
+	put_page(hpage);
 
 out:
 	spin_unlock(ptl);
diff --git a/mm/hmm.c b/mm/hmm.c
index 3fd3242c5e50..1ad5d76fa8be 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -472,7 +472,7 @@ static int hmm_vma_walk_pud(pud_t *pudp, unsigned long start, unsigned long end,
 #endif
 
 #ifdef CONFIG_HUGETLB_PAGE
-static int hmm_vma_walk_hugetlb_entry(pte_t *pte, unsigned long hmask,
+static int hmm_vma_walk_hugetlb_entry(struct hugetlb_pte *hpte,
 				      unsigned long start, unsigned long end,
 				      struct mm_walk *walk)
 {
@@ -483,11 +483,12 @@ static int hmm_vma_walk_hugetlb_entry(pte_t *pte, unsigned long hmask,
 	unsigned int required_fault;
 	unsigned long pfn_req_flags;
 	unsigned long cpu_flags;
+	unsigned long hmask = hugetlb_pte_mask(hpte);
 	spinlock_t *ptl;
 	pte_t entry;
 
-	ptl = huge_pte_lock(hstate_vma(vma), walk->mm, pte);
-	entry = huge_ptep_get(pte);
+	ptl = huge_pte_lock_shift(hpte->shift, walk->mm, hpte->ptep);
+	entry = huge_ptep_get(hpte->ptep);
 
 	i = (start - range->start) >> PAGE_SHIFT;
 	pfn_req_flags = range->hmm_pfns[i];
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index d39b01fd52fe..a1d82db7c19f 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -559,7 +559,7 @@ static int queue_pages_pte_range(pmd_t *pmd, unsigned long addr,
 	return addr != end ? -EIO : 0;
 }
 
-static int queue_pages_hugetlb(pte_t *pte, unsigned long hmask,
+static int queue_pages_hugetlb(struct hugetlb_pte *hpte,
 			       unsigned long addr, unsigned long end,
 			       struct mm_walk *walk)
 {
@@ -571,8 +571,13 @@ static int queue_pages_hugetlb(pte_t *pte, unsigned long hmask,
 	spinlock_t *ptl;
 	pte_t entry;
 
-	ptl = huge_pte_lock(hstate_vma(walk->vma), walk->mm, pte);
-	entry = huge_ptep_get(pte);
+	/* We don't migrate high-granularity HugeTLB mappings for now. */
+	if (hugetlb_pte_size(hpte) !=
+			huge_page_size(hstate_vma(walk->vma)))
+		return -EINVAL;
+
+	ptl = hugetlb_pte_lock(walk->mm, hpte);
+	entry = hugetlb_ptep_get(hpte);
 	if (!pte_present(entry))
 		goto unlock;
 	page = pte_page(entry);
diff --git a/mm/mincore.c b/mm/mincore.c
index fa200c14185f..dc1717dc6a2c 100644
--- a/mm/mincore.c
+++ b/mm/mincore.c
@@ -22,7 +22,7 @@
 #include <linux/uaccess.h>
 #include "swap.h"
 
-static int mincore_hugetlb(pte_t *pte, unsigned long hmask, unsigned long addr,
+static int mincore_hugetlb(struct hugetlb_pte *hpte, unsigned long addr,
 			unsigned long end, struct mm_walk *walk)
 {
 #ifdef CONFIG_HUGETLB_PAGE
@@ -33,7 +33,7 @@ static int mincore_hugetlb(pte_t *pte, unsigned long hmask, unsigned long addr,
 	 * Hugepages under user process are always in RAM and never
 	 * swapped out, but theoretically it needs to be checked.
 	 */
-	present = pte && !huge_pte_none(huge_ptep_get(pte));
+	present = hpte->ptep && !hugetlb_pte_none(hpte);
 	for (; addr != end; vec++, addr += PAGE_SIZE)
 		*vec = present;
 	walk->private = vec;
diff --git a/mm/mprotect.c b/mm/mprotect.c
index ba5592655ee3..9c5a35a1c0eb 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -476,12 +476,12 @@ static int prot_none_pte_entry(pte_t *pte, unsigned long addr,
 		0 : -EACCES;
 }
 
-static int prot_none_hugetlb_entry(pte_t *pte, unsigned long hmask,
+static int prot_none_hugetlb_entry(struct hugetlb_pte *hpte,
 				   unsigned long addr, unsigned long next,
 				   struct mm_walk *walk)
 {
-	return pfn_modify_allowed(pte_pfn(*pte), *(pgprot_t *)(walk->private)) ?
-		0 : -EACCES;
+	return pfn_modify_allowed(pte_pfn(*hpte->ptep),
+			*(pgprot_t *)(walk->private)) ? 0 : -EACCES;
 }
 
 static int prot_none_test(unsigned long addr, unsigned long next,
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index 9b3db11a4d1d..f8e24a0a0179 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -3,6 +3,7 @@
 #include <linux/highmem.h>
 #include <linux/sched.h>
 #include <linux/hugetlb.h>
+#include <linux/minmax.h>
 
 /*
  * We want to know the real level where a entry is located ignoring any
@@ -301,13 +302,26 @@ static int walk_hugetlb_range(unsigned long addr, unsigned long end,
 	pte_t *pte;
 	const struct mm_walk_ops *ops = walk->ops;
 	int err = 0;
+	struct hugetlb_pte hpte;
 
 	do {
-		next = hugetlb_entry_end(h, addr, end);
 		pte = huge_pte_offset(walk->mm, addr & hmask, sz);
+		if (!pte) {
+			next = hugetlb_entry_end(h, addr, end);
+		} else {
+			hugetlb_pte_populate(&hpte, pte, huge_page_shift(h));
+			if (hugetlb_hgm_enabled(vma)) {
+				err = hugetlb_walk_to(walk->mm, &hpte, addr,
+						      PAGE_SIZE,
+						      /*stop_at_none=*/true);
+				if (err)
+					break;
+			}
+			next = min(addr + hugetlb_pte_size(&hpte), end);
+		}
 
 		if (pte)
-			err = ops->hugetlb_entry(pte, hmask, addr, next, walk);
+			err = ops->hugetlb_entry(&hpte, addr, next, walk);
 		else if (ops->pte_hole)
 			err = ops->pte_hole(addr, next, -1, walk);
 
-- 
2.37.0.rc0.161.g10f37bed90-goog


^ permalink raw reply related	[flat|nested] 123+ messages in thread

* [RFC PATCH 19/26] hugetlb: add HGM support for copy_hugetlb_page_range
  2022-06-24 17:36 [RFC PATCH 00/26] hugetlb: Introduce HugeTLB high-granularity mapping James Houghton
                   ` (17 preceding siblings ...)
  2022-06-24 17:36 ` [RFC PATCH 18/26] hugetlb: use struct hugetlb_pte for walk_hugetlb_range James Houghton
@ 2022-06-24 17:36 ` James Houghton
  2022-07-11 23:41   ` Mike Kravetz
  2022-06-24 17:36 ` [RFC PATCH 20/26] hugetlb: add support for high-granularity UFFDIO_CONTINUE James Houghton
                   ` (9 subsequent siblings)
  28 siblings, 1 reply; 123+ messages in thread
From: James Houghton @ 2022-06-24 17:36 UTC (permalink / raw)
  To: Mike Kravetz, Muchun Song, Peter Xu
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Jue Wang, Manish Mishra, Dr . David Alan Gilbert, linux-mm,
	linux-kernel, James Houghton

This allows fork() to work with high-granularity mappings. The page
table structure is copied such that partially mapped regions will remain
partially mapped in the same way for the new process.

Signed-off-by: James Houghton <jthoughton@google.com>
---
 mm/hugetlb.c | 74 +++++++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 59 insertions(+), 15 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index aadfcee947cf..0ec2f231524e 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4851,7 +4851,8 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 			    struct vm_area_struct *src_vma)
 {
 	pte_t *src_pte, *dst_pte, entry, dst_entry;
-	struct page *ptepage;
+	struct hugetlb_pte src_hpte, dst_hpte;
+	struct page *ptepage, *hpage;
 	unsigned long addr;
 	bool cow = is_cow_mapping(src_vma->vm_flags);
 	struct hstate *h = hstate_vma(src_vma);
@@ -4878,17 +4879,44 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 		i_mmap_lock_read(mapping);
 	}
 
-	for (addr = src_vma->vm_start; addr < src_vma->vm_end; addr += sz) {
+	addr = src_vma->vm_start;
+	while (addr < src_vma->vm_end) {
 		spinlock_t *src_ptl, *dst_ptl;
+		unsigned long hpte_sz;
 		src_pte = huge_pte_offset(src, addr, sz);
-		if (!src_pte)
+		if (!src_pte) {
+			addr += sz;
 			continue;
+		}
 		dst_pte = huge_pte_alloc(dst, dst_vma, addr, sz);
 		if (!dst_pte) {
 			ret = -ENOMEM;
 			break;
 		}
 
+		hugetlb_pte_populate(&src_hpte, src_pte, huge_page_shift(h));
+		hugetlb_pte_populate(&dst_hpte, dst_pte, huge_page_shift(h));
+
+		if (hugetlb_hgm_enabled(src_vma)) {
+			BUG_ON(!hugetlb_hgm_enabled(dst_vma));
+			ret = hugetlb_walk_to(src, &src_hpte, addr,
+					      PAGE_SIZE, /*stop_at_none=*/true);
+			if (ret)
+				break;
+			ret = huge_pte_alloc_high_granularity(
+					&dst_hpte, dst, dst_vma, addr,
+					hugetlb_pte_shift(&src_hpte),
+					HUGETLB_SPLIT_NONE,
+					/*write_locked=*/false);
+			if (ret)
+				break;
+
+			src_pte = src_hpte.ptep;
+			dst_pte = dst_hpte.ptep;
+		}
+
+		hpte_sz = hugetlb_pte_size(&src_hpte);
+
 		/*
 		 * If the pagetables are shared don't copy or take references.
 		 * dst_pte == src_pte is the common case of src/dest sharing.
@@ -4899,16 +4927,19 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 		 * after taking the lock below.
 		 */
 		dst_entry = huge_ptep_get(dst_pte);
-		if ((dst_pte == src_pte) || !huge_pte_none(dst_entry))
+		if ((dst_pte == src_pte) || !hugetlb_pte_none(&dst_hpte)) {
+			addr += hugetlb_pte_size(&src_hpte);
 			continue;
+		}
 
-		dst_ptl = huge_pte_lock(h, dst, dst_pte);
-		src_ptl = huge_pte_lockptr(huge_page_shift(h), src, src_pte);
+		dst_ptl = hugetlb_pte_lock(dst, &dst_hpte);
+		src_ptl = hugetlb_pte_lockptr(src, &src_hpte);
 		spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
 		entry = huge_ptep_get(src_pte);
 		dst_entry = huge_ptep_get(dst_pte);
 again:
-		if (huge_pte_none(entry) || !huge_pte_none(dst_entry)) {
+		if (hugetlb_pte_none(&src_hpte) ||
+		    !hugetlb_pte_none(&dst_hpte)) {
 			/*
 			 * Skip if src entry none.  Also, skip in the
 			 * unlikely case dst entry !none as this implies
@@ -4931,11 +4962,12 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 				if (userfaultfd_wp(src_vma) && uffd_wp)
 					entry = huge_pte_mkuffd_wp(entry);
 				set_huge_swap_pte_at(src, addr, src_pte,
-						     entry, sz);
+						     entry, hpte_sz);
 			}
 			if (!userfaultfd_wp(dst_vma) && uffd_wp)
 				entry = huge_pte_clear_uffd_wp(entry);
-			set_huge_swap_pte_at(dst, addr, dst_pte, entry, sz);
+			set_huge_swap_pte_at(dst, addr, dst_pte, entry,
+					     hpte_sz);
 		} else if (unlikely(is_pte_marker(entry))) {
 			/*
 			 * We copy the pte marker only if the dst vma has
@@ -4946,7 +4978,8 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 		} else {
 			entry = huge_ptep_get(src_pte);
 			ptepage = pte_page(entry);
-			get_page(ptepage);
+			hpage = compound_head(ptepage);
+			get_page(hpage);
 
 			/*
 			 * Failing to duplicate the anon rmap is a rare case
@@ -4959,9 +4992,16 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 			 * sleep during the process.
 			 */
 			if (!PageAnon(ptepage)) {
-				page_dup_file_rmap(ptepage, true);
+				/* Only dup_rmap once for a page */
+				if (IS_ALIGNED(addr, sz))
+					page_dup_file_rmap(hpage, true);
 			} else if (page_try_dup_anon_rmap(ptepage, true,
 							  src_vma)) {
+				if (hugetlb_hgm_enabled(src_vma)) {
+					ret = -EINVAL;
+					break;
+				}
+				BUG_ON(!IS_ALIGNED(addr, hugetlb_pte_size(&src_hpte)));
 				pte_t src_pte_old = entry;
 				struct page *new;
 
@@ -4970,13 +5010,13 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 				/* Do not use reserve as it's private owned */
 				new = alloc_huge_page(dst_vma, addr, 1);
 				if (IS_ERR(new)) {
-					put_page(ptepage);
+					put_page(hpage);
 					ret = PTR_ERR(new);
 					break;
 				}
-				copy_user_huge_page(new, ptepage, addr, dst_vma,
+				copy_user_huge_page(new, hpage, addr, dst_vma,
 						    npages);
-				put_page(ptepage);
+				put_page(hpage);
 
 				/* Install the new huge page if src pte stable */
 				dst_ptl = huge_pte_lock(h, dst, dst_pte);
@@ -4994,6 +5034,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 				hugetlb_install_page(dst_vma, dst_pte, addr, new);
 				spin_unlock(src_ptl);
 				spin_unlock(dst_ptl);
+				addr += hugetlb_pte_size(&src_hpte);
 				continue;
 			}
 
@@ -5010,10 +5051,13 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 			}
 
 			set_huge_pte_at(dst, addr, dst_pte, entry);
-			hugetlb_count_add(npages, dst);
+			hugetlb_count_add(
+					hugetlb_pte_size(&dst_hpte) / PAGE_SIZE,
+					dst);
 		}
 		spin_unlock(src_ptl);
 		spin_unlock(dst_ptl);
+		addr += hugetlb_pte_size(&src_hpte);
 	}
 
 	if (cow) {
-- 
2.37.0.rc0.161.g10f37bed90-goog


^ permalink raw reply related	[flat|nested] 123+ messages in thread

* [RFC PATCH 20/26] hugetlb: add support for high-granularity UFFDIO_CONTINUE
  2022-06-24 17:36 [RFC PATCH 00/26] hugetlb: Introduce HugeTLB high-granularity mapping James Houghton
                   ` (18 preceding siblings ...)
  2022-06-24 17:36 ` [RFC PATCH 19/26] hugetlb: add HGM support for copy_hugetlb_page_range James Houghton
@ 2022-06-24 17:36 ` James Houghton
  2022-07-15 16:21   ` Peter Xu
  2022-06-24 17:36 ` [RFC PATCH 21/26] hugetlb: add hugetlb_collapse James Houghton
                   ` (8 subsequent siblings)
  28 siblings, 1 reply; 123+ messages in thread
From: James Houghton @ 2022-06-24 17:36 UTC (permalink / raw)
  To: Mike Kravetz, Muchun Song, Peter Xu
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Jue Wang, Manish Mishra, Dr . David Alan Gilbert, linux-mm,
	linux-kernel, James Houghton

The changes here are very similar to the changes made to
hugetlb_no_page, where we do a high-granularity page table walk and
do accounting slightly differently because we are mapping only a piece
of a page.

Signed-off-by: James Houghton <jthoughton@google.com>
---
 fs/userfaultfd.c        |  3 +++
 include/linux/hugetlb.h |  6 +++--
 mm/hugetlb.c            | 54 +++++++++++++++++++++-----------------
 mm/userfaultfd.c        | 57 +++++++++++++++++++++++++++++++----------
 4 files changed, 82 insertions(+), 38 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index e943370107d0..77c1b8a7d0b9 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -245,6 +245,9 @@ static inline bool userfaultfd_huge_must_wait(struct userfaultfd_ctx *ctx,
 	if (!ptep)
 		goto out;
 
+	if (hugetlb_hgm_enabled(vma))
+		goto out;
+
 	ret = false;
 	pte = huge_ptep_get(ptep);
 
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index ac4ac8fbd901..c207b1ac6195 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -221,13 +221,15 @@ unsigned long hugetlb_total_pages(void);
 vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 			unsigned long address, unsigned int flags);
 #ifdef CONFIG_USERFAULTFD
-int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm, pte_t *dst_pte,
+int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
+				struct hugetlb_pte *dst_hpte,
 				struct vm_area_struct *dst_vma,
 				unsigned long dst_addr,
 				unsigned long src_addr,
 				enum mcopy_atomic_mode mode,
 				struct page **pagep,
-				bool wp_copy);
+				bool wp_copy,
+				bool new_mapping);
 #endif /* CONFIG_USERFAULTFD */
 bool hugetlb_reserve_pages(struct inode *inode, long from, long to,
 						struct vm_area_struct *vma,
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 0ec2f231524e..09fa57599233 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5808,6 +5808,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 		vma_end_reservation(h, vma, haddr);
 	}
 
+	/* This lock will get pretty expensive at 4K. */
 	ptl = hugetlb_pte_lock(mm, hpte);
 	ret = 0;
 	/* If pte changed from under us, retry */
@@ -6098,24 +6099,26 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
  * modifications for huge pages.
  */
 int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
-			    pte_t *dst_pte,
+			    struct hugetlb_pte *dst_hpte,
 			    struct vm_area_struct *dst_vma,
 			    unsigned long dst_addr,
 			    unsigned long src_addr,
 			    enum mcopy_atomic_mode mode,
 			    struct page **pagep,
-			    bool wp_copy)
+			    bool wp_copy,
+			    bool new_mapping)
 {
 	bool is_continue = (mode == MCOPY_ATOMIC_CONTINUE);
 	struct hstate *h = hstate_vma(dst_vma);
 	struct address_space *mapping = dst_vma->vm_file->f_mapping;
+	unsigned long haddr = dst_addr & huge_page_mask(h);
 	pgoff_t idx = vma_hugecache_offset(h, dst_vma, dst_addr);
 	unsigned long size;
 	int vm_shared = dst_vma->vm_flags & VM_SHARED;
 	pte_t _dst_pte;
 	spinlock_t *ptl;
 	int ret = -ENOMEM;
-	struct page *page;
+	struct page *page, *subpage;
 	int writable;
 	bool page_in_pagecache = false;
 
@@ -6130,12 +6133,12 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
 		 * a non-missing case. Return -EEXIST.
 		 */
 		if (vm_shared &&
-		    hugetlbfs_pagecache_present(h, dst_vma, dst_addr)) {
+		    hugetlbfs_pagecache_present(h, dst_vma, haddr)) {
 			ret = -EEXIST;
 			goto out;
 		}
 
-		page = alloc_huge_page(dst_vma, dst_addr, 0);
+		page = alloc_huge_page(dst_vma, haddr, 0);
 		if (IS_ERR(page)) {
 			ret = -ENOMEM;
 			goto out;
@@ -6151,13 +6154,13 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
 			/* Free the allocated page which may have
 			 * consumed a reservation.
 			 */
-			restore_reserve_on_error(h, dst_vma, dst_addr, page);
+			restore_reserve_on_error(h, dst_vma, haddr, page);
 			put_page(page);
 
 			/* Allocate a temporary page to hold the copied
 			 * contents.
 			 */
-			page = alloc_huge_page_vma(h, dst_vma, dst_addr);
+			page = alloc_huge_page_vma(h, dst_vma, haddr);
 			if (!page) {
 				ret = -ENOMEM;
 				goto out;
@@ -6171,14 +6174,14 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
 		}
 	} else {
 		if (vm_shared &&
-		    hugetlbfs_pagecache_present(h, dst_vma, dst_addr)) {
+		    hugetlbfs_pagecache_present(h, dst_vma, haddr)) {
 			put_page(*pagep);
 			ret = -EEXIST;
 			*pagep = NULL;
 			goto out;
 		}
 
-		page = alloc_huge_page(dst_vma, dst_addr, 0);
+		page = alloc_huge_page(dst_vma, haddr, 0);
 		if (IS_ERR(page)) {
 			ret = -ENOMEM;
 			*pagep = NULL;
@@ -6216,8 +6219,7 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
 		page_in_pagecache = true;
 	}
 
-	ptl = huge_pte_lockptr(huge_page_shift(h), dst_mm, dst_pte);
-	spin_lock(ptl);
+	ptl = hugetlb_pte_lock(dst_mm, dst_hpte);
 
 	/*
 	 * Recheck the i_size after holding PT lock to make sure not
@@ -6239,14 +6241,16 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
 	 * registered, we firstly wr-protect a none pte which has no page cache
 	 * page backing it, then access the page.
 	 */
-	if (!huge_pte_none_mostly(huge_ptep_get(dst_pte)))
+	if (!hugetlb_pte_none_mostly(dst_hpte))
 		goto out_release_unlock;
 
-	if (vm_shared) {
-		page_dup_file_rmap(page, true);
-	} else {
-		ClearHPageRestoreReserve(page);
-		hugepage_add_new_anon_rmap(page, dst_vma, dst_addr);
+	if (new_mapping) {
+		if (vm_shared) {
+			page_dup_file_rmap(page, true);
+		} else {
+			ClearHPageRestoreReserve(page);
+			hugepage_add_new_anon_rmap(page, dst_vma, haddr);
+		}
 	}
 
 	/*
@@ -6258,7 +6262,11 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
 	else
 		writable = dst_vma->vm_flags & VM_WRITE;
 
-	_dst_pte = make_huge_pte(dst_vma, page, writable);
+	subpage = hugetlb_find_subpage(h, page, dst_addr);
+	if (subpage != page)
+		BUG_ON(!hugetlb_hgm_enabled(dst_vma));
+
+	_dst_pte = make_huge_pte(dst_vma, subpage, writable);
 	/*
 	 * Always mark UFFDIO_COPY page dirty; note that this may not be
 	 * extremely important for hugetlbfs for now since swapping is not
@@ -6271,14 +6279,14 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
 	if (wp_copy)
 		_dst_pte = huge_pte_mkuffd_wp(_dst_pte);
 
-	set_huge_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte);
+	set_huge_pte_at(dst_mm, dst_addr, dst_hpte->ptep, _dst_pte);
 
-	(void)huge_ptep_set_access_flags(dst_vma, dst_addr, dst_pte, _dst_pte,
-					dst_vma->vm_flags & VM_WRITE);
-	hugetlb_count_add(pages_per_huge_page(h), dst_mm);
+	(void)huge_ptep_set_access_flags(dst_vma, dst_addr, dst_hpte->ptep,
+			_dst_pte, dst_vma->vm_flags & VM_WRITE);
+	hugetlb_count_add(hugetlb_pte_size(dst_hpte) / PAGE_SIZE, dst_mm);
 
 	/* No need to invalidate - it was non-present before */
-	update_mmu_cache(dst_vma, dst_addr, dst_pte);
+	update_mmu_cache(dst_vma, dst_addr, dst_hpte->ptep);
 
 	spin_unlock(ptl);
 	if (!is_continue)
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 4f4892a5f767..ee40d98068bf 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -310,14 +310,16 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
 {
 	int vm_shared = dst_vma->vm_flags & VM_SHARED;
 	ssize_t err;
-	pte_t *dst_pte;
 	unsigned long src_addr, dst_addr;
 	long copied;
 	struct page *page;
-	unsigned long vma_hpagesize;
+	unsigned long vma_hpagesize, vma_altpagesize;
 	pgoff_t idx;
 	u32 hash;
 	struct address_space *mapping;
+	bool use_hgm = hugetlb_hgm_enabled(dst_vma) &&
+		mode == MCOPY_ATOMIC_CONTINUE;
+	struct hstate *h = hstate_vma(dst_vma);
 
 	/*
 	 * There is no default zero huge page for all huge page sizes as
@@ -335,12 +337,16 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
 	copied = 0;
 	page = NULL;
 	vma_hpagesize = vma_kernel_pagesize(dst_vma);
+	if (use_hgm)
+		vma_altpagesize = PAGE_SIZE;
+	else
+		vma_altpagesize = vma_hpagesize;
 
 	/*
 	 * Validate alignment based on huge page size
 	 */
 	err = -EINVAL;
-	if (dst_start & (vma_hpagesize - 1) || len & (vma_hpagesize - 1))
+	if (dst_start & (vma_altpagesize - 1) || len & (vma_altpagesize - 1))
 		goto out_unlock;
 
 retry:
@@ -361,6 +367,8 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
 		vm_shared = dst_vma->vm_flags & VM_SHARED;
 	}
 
+	BUG_ON(!vm_shared && use_hgm);
+
 	/*
 	 * If not shared, ensure the dst_vma has a anon_vma.
 	 */
@@ -371,11 +379,13 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
 	}
 
 	while (src_addr < src_start + len) {
+		struct hugetlb_pte hpte;
+		bool new_mapping;
 		BUG_ON(dst_addr >= dst_start + len);
 
 		/*
 		 * Serialize via i_mmap_rwsem and hugetlb_fault_mutex.
-		 * i_mmap_rwsem ensures the dst_pte remains valid even
+		 * i_mmap_rwsem ensures the hpte.ptep remains valid even
 		 * in the case of shared pmds.  fault mutex prevents
 		 * races with other faulting threads.
 		 */
@@ -383,27 +393,47 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
 		i_mmap_lock_read(mapping);
 		idx = linear_page_index(dst_vma, dst_addr);
 		hash = hugetlb_fault_mutex_hash(mapping, idx);
+		/* This lock will get expensive at 4K. */
 		mutex_lock(&hugetlb_fault_mutex_table[hash]);
 
-		err = -ENOMEM;
-		dst_pte = huge_pte_alloc(dst_mm, dst_vma, dst_addr, vma_hpagesize);
-		if (!dst_pte) {
+		err = 0;
+
+		pte_t *ptep = huge_pte_alloc(dst_mm, dst_vma, dst_addr,
+					     vma_hpagesize);
+		if (!ptep)
+			err = -ENOMEM;
+		else {
+			hugetlb_pte_populate(&hpte, ptep,
+					huge_page_shift(h));
+			/*
+			 * If the hstate-level PTE is not none, then a mapping
+			 * was previously established.
+			 * The per-hpage mutex prevents double-counting.
+			 */
+			new_mapping = hugetlb_pte_none(&hpte);
+			if (use_hgm)
+				err = hugetlb_alloc_largest_pte(&hpte, dst_mm, dst_vma,
+								dst_addr,
+								dst_start + len);
+		}
+
+		if (err) {
 			mutex_unlock(&hugetlb_fault_mutex_table[hash]);
 			i_mmap_unlock_read(mapping);
 			goto out_unlock;
 		}
 
 		if (mode != MCOPY_ATOMIC_CONTINUE &&
-		    !huge_pte_none_mostly(huge_ptep_get(dst_pte))) {
+		    !hugetlb_pte_none_mostly(&hpte)) {
 			err = -EEXIST;
 			mutex_unlock(&hugetlb_fault_mutex_table[hash]);
 			i_mmap_unlock_read(mapping);
 			goto out_unlock;
 		}
 
-		err = hugetlb_mcopy_atomic_pte(dst_mm, dst_pte, dst_vma,
+		err = hugetlb_mcopy_atomic_pte(dst_mm, &hpte, dst_vma,
 					       dst_addr, src_addr, mode, &page,
-					       wp_copy);
+					       wp_copy, new_mapping);
 
 		mutex_unlock(&hugetlb_fault_mutex_table[hash]);
 		i_mmap_unlock_read(mapping);
@@ -413,6 +443,7 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
 		if (unlikely(err == -ENOENT)) {
 			mmap_read_unlock(dst_mm);
 			BUG_ON(!page);
+			BUG_ON(hpte.shift != huge_page_shift(h));
 
 			err = copy_huge_page_from_user(page,
 						(const void __user *)src_addr,
@@ -430,9 +461,9 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
 			BUG_ON(page);
 
 		if (!err) {
-			dst_addr += vma_hpagesize;
-			src_addr += vma_hpagesize;
-			copied += vma_hpagesize;
+			dst_addr += hugetlb_pte_size(&hpte);
+			src_addr += hugetlb_pte_size(&hpte);
+			copied += hugetlb_pte_size(&hpte);
 
 			if (fatal_signal_pending(current))
 				err = -EINTR;
-- 
2.37.0.rc0.161.g10f37bed90-goog


^ permalink raw reply related	[flat|nested] 123+ messages in thread

* [RFC PATCH 21/26] hugetlb: add hugetlb_collapse
  2022-06-24 17:36 [RFC PATCH 00/26] hugetlb: Introduce HugeTLB high-granularity mapping James Houghton
                   ` (19 preceding siblings ...)
  2022-06-24 17:36 ` [RFC PATCH 20/26] hugetlb: add support for high-granularity UFFDIO_CONTINUE James Houghton
@ 2022-06-24 17:36 ` James Houghton
  2022-06-24 17:36 ` [RFC PATCH 22/26] madvise: add uapi for HugeTLB HGM collapse: MADV_COLLAPSE James Houghton
                   ` (7 subsequent siblings)
  28 siblings, 0 replies; 123+ messages in thread
From: James Houghton @ 2022-06-24 17:36 UTC (permalink / raw)
  To: Mike Kravetz, Muchun Song, Peter Xu
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Jue Wang, Manish Mishra, Dr . David Alan Gilbert, linux-mm,
	linux-kernel, James Houghton

This is what implements MADV_COLLAPSE for HugeTLB pages. This is a
necessary extension to the UFFDIO_CONTINUE changes. When userspace
finishes mapping an entire hugepage with UFFDIO_CONTINUE, the kernel has
no mechanism to automatically collapse the page table to map the whole
hugepage normally. We require userspace to inform us that they would
like the hugepages to be collapsed; they do this with MADV_COLLAPSE.

If userspace has not mapped all of a hugepage with UFFDIO_CONTINUE, but
only some, hugetlb_collapse will cause the requested range to be mapped
as if it were UFFDIO_CONTINUE'd already.

Signed-off-by: James Houghton <jthoughton@google.com>
---
 include/linux/hugetlb.h |  7 ++++
 mm/hugetlb.c            | 88 +++++++++++++++++++++++++++++++++++++++++
 2 files changed, 95 insertions(+)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index c207b1ac6195..438057dc3b75 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -1197,6 +1197,8 @@ int huge_pte_alloc_high_granularity(struct hugetlb_pte *hpte,
 				    unsigned int desired_sz,
 				    enum split_mode mode,
 				    bool write_locked);
+int hugetlb_collapse(struct mm_struct *mm, struct vm_area_struct *vma,
+		     unsigned long start, unsigned long end);
 #else
 static inline bool hugetlb_hgm_enabled(struct vm_area_struct *vma)
 {
@@ -1221,6 +1223,11 @@ static inline int huge_pte_alloc_high_granularity(struct hugetlb_pte *hpte,
 {
 	return -EINVAL;
 }
+int hugetlb_collapse(struct mm_struct *mm, struct vm_area_struct *vma,
+		     unsigned long start, unsigned long end)
+{
+	return -EINVAL;
+}
 #endif
 
 static inline spinlock_t *huge_pte_lock(struct hstate *h,
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 09fa57599233..70bb3a1342d9 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -7280,6 +7280,94 @@ int hugetlb_alloc_largest_pte(struct hugetlb_pte *hpte, struct mm_struct *mm,
 	return -EINVAL;
 }
 
+/*
+ * Collapse the address range from @start to @end to be mapped optimally.
+ *
+ * This is only valid for shared mappings. The main use case for this function
+ * is following UFFDIO_CONTINUE. If a user UFFDIO_CONTINUEs an entire hugepage
+ * by calling UFFDIO_CONTINUE once for each 4K region, the kernel doesn't know
+ * to collapse the mapping after the final UFFDIO_CONTINUE. Instead, we leave
+ * it up to userspace to tell us to do so, via MADV_COLLAPSE.
+ *
+ * Any holes in the mapping will be filled. If there is no page in the
+ * pagecache for a region we're collapsing, the PTEs will be cleared.
+ */
+int hugetlb_collapse(struct mm_struct *mm, struct vm_area_struct *vma,
+			    unsigned long start, unsigned long end)
+{
+	struct hstate *h = hstate_vma(vma);
+	struct address_space *mapping = vma->vm_file->f_mapping;
+	struct mmu_notifier_range range;
+	struct mmu_gather tlb;
+	struct hstate *tmp_h;
+	unsigned int shift;
+	unsigned long curr = start;
+	int ret = 0;
+	struct page *hpage, *subpage;
+	pgoff_t idx;
+	bool writable = vma->vm_flags & VM_WRITE;
+	bool shared = vma->vm_flags & VM_SHARED;
+	pte_t entry;
+
+	/*
+	 * This is only supported for shared VMAs, because we need to look up
+	 * the page to use for any PTEs we end up creating.
+	 */
+	if (!shared)
+		return -EINVAL;
+
+	i_mmap_assert_write_locked(mapping);
+
+	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, mm,
+				start, end);
+	mmu_notifier_invalidate_range_start(&range);
+	tlb_gather_mmu(&tlb, mm);
+
+	while (curr < end) {
+		for_each_hgm_shift(h, tmp_h, shift) {
+			unsigned long sz = 1UL << shift;
+			struct hugetlb_pte hpte;
+
+			if (!IS_ALIGNED(curr, sz) || curr + sz > end)
+				continue;
+
+			hugetlb_pte_init(&hpte);
+			ret = hugetlb_walk_to(mm, &hpte, curr, sz,
+					      /*stop_at_none=*/false);
+			if (ret)
+				goto out;
+			if (hugetlb_pte_size(&hpte) >= sz)
+				goto hpte_finished;
+
+			idx = vma_hugecache_offset(h, vma, curr);
+			hpage = find_lock_page(mapping, idx);
+			hugetlb_free_range(&tlb, &hpte, curr,
+					   curr + hugetlb_pte_size(&hpte));
+			if (!hpage) {
+				hugetlb_pte_clear(mm, &hpte, curr);
+				goto hpte_finished;
+			}
+
+			subpage = hugetlb_find_subpage(h, hpage, curr);
+			entry = make_huge_pte_with_shift(vma, subpage,
+							 writable, shift);
+			set_huge_pte_at(mm, curr, hpte.ptep, entry);
+			unlock_page(hpage);
+hpte_finished:
+			curr += hugetlb_pte_size(&hpte);
+			goto next;
+		}
+		ret = -EINVAL;
+		goto out;
+next:
+		continue;
+	}
+out:
+	tlb_finish_mmu(&tlb);
+	mmu_notifier_invalidate_range_end(&range);
+	return ret;
+}
+
 /*
  * Given a particular address, split the HugeTLB PTE that currently maps it
  * so that, for the given address, the PTE that maps it is `desired_shift`.
-- 
2.37.0.rc0.161.g10f37bed90-goog


^ permalink raw reply related	[flat|nested] 123+ messages in thread

* [RFC PATCH 22/26] madvise: add uapi for HugeTLB HGM collapse: MADV_COLLAPSE
  2022-06-24 17:36 [RFC PATCH 00/26] hugetlb: Introduce HugeTLB high-granularity mapping James Houghton
                   ` (20 preceding siblings ...)
  2022-06-24 17:36 ` [RFC PATCH 21/26] hugetlb: add hugetlb_collapse James Houghton
@ 2022-06-24 17:36 ` James Houghton
  2022-06-24 17:36 ` [RFC PATCH 23/26] userfaultfd: add UFFD_FEATURE_MINOR_HUGETLBFS_HGM James Houghton
                   ` (6 subsequent siblings)
  28 siblings, 0 replies; 123+ messages in thread
From: James Houghton @ 2022-06-24 17:36 UTC (permalink / raw)
  To: Mike Kravetz, Muchun Song, Peter Xu
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Jue Wang, Manish Mishra, Dr . David Alan Gilbert, linux-mm,
	linux-kernel, James Houghton

This commit is co-opting the same madvise mode that is being introduced
by zokeefe@google.com to manually collapse THPs[1].

As with the rest of the high-granularity mapping support, MADV_COLLAPSE
is only supported for shared VMAs right now.

[1] https://lore.kernel.org/linux-mm/20220604004004.954674-10-zokeefe@google.com/

Signed-off-by: James Houghton <jthoughton@google.com>
---
 include/uapi/asm-generic/mman-common.h |  2 ++
 mm/madvise.c                           | 23 +++++++++++++++++++++++
 2 files changed, 25 insertions(+)

diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index 6c1aa92a92e4..b686920ca731 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -77,6 +77,8 @@
 
 #define MADV_DONTNEED_LOCKED	24	/* like DONTNEED, but drop locked pages too */
 
+#define MADV_COLLAPSE	25		/* collapse an address range into hugepages */
+
 /* compatibility flags */
 #define MAP_FILE	0
 
diff --git a/mm/madvise.c b/mm/madvise.c
index d7b4f2602949..c624c0f02276 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -59,6 +59,7 @@ static int madvise_need_mmap_write(int behavior)
 	case MADV_FREE:
 	case MADV_POPULATE_READ:
 	case MADV_POPULATE_WRITE:
+	case MADV_COLLAPSE:
 		return 0;
 	default:
 		/* be safe, default to 1. list exceptions explicitly */
@@ -981,6 +982,20 @@ static long madvise_remove(struct vm_area_struct *vma,
 	return error;
 }
 
+static int madvise_collapse(struct vm_area_struct *vma,
+			    struct vm_area_struct **prev,
+			    unsigned long start, unsigned long end)
+{
+	bool shared = vma->vm_flags & VM_SHARED;
+	*prev = vma;
+
+	/* Only allow collapsing for HGM-enabled, shared mappings. */
+	if (!is_vm_hugetlb_page(vma) || !hugetlb_hgm_enabled(vma) || !shared)
+		return -EINVAL;
+
+	return hugetlb_collapse(vma->vm_mm, vma, start, end);
+}
+
 /*
  * Apply an madvise behavior to a region of a vma.  madvise_update_vma
  * will handle splitting a vm area into separate areas, each area with its own
@@ -1011,6 +1026,8 @@ static int madvise_vma_behavior(struct vm_area_struct *vma,
 	case MADV_POPULATE_READ:
 	case MADV_POPULATE_WRITE:
 		return madvise_populate(vma, prev, start, end, behavior);
+	case MADV_COLLAPSE:
+		return madvise_collapse(vma, prev, start, end);
 	case MADV_NORMAL:
 		new_flags = new_flags & ~VM_RAND_READ & ~VM_SEQ_READ;
 		break;
@@ -1158,6 +1175,9 @@ madvise_behavior_valid(int behavior)
 #ifdef CONFIG_MEMORY_FAILURE
 	case MADV_SOFT_OFFLINE:
 	case MADV_HWPOISON:
+#endif
+#ifdef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING
+	case MADV_COLLAPSE:
 #endif
 		return true;
 
@@ -1351,6 +1371,9 @@ int madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
  *		triggering read faults if required
  *  MADV_POPULATE_WRITE - populate (prefault) page tables writable by
  *		triggering write faults if required
+ *  MADV_COLLAPSE - collapse a high-granularity HugeTLB mapping into huge
+ *		mappings. This is useful after an entire hugepage has been
+ *		mapped with individual small UFFDIO_CONTINUE operations.
  *
  * return values:
  *  zero    - success
-- 
2.37.0.rc0.161.g10f37bed90-goog


^ permalink raw reply related	[flat|nested] 123+ messages in thread

* [RFC PATCH 23/26] userfaultfd: add UFFD_FEATURE_MINOR_HUGETLBFS_HGM
  2022-06-24 17:36 [RFC PATCH 00/26] hugetlb: Introduce HugeTLB high-granularity mapping James Houghton
                   ` (21 preceding siblings ...)
  2022-06-24 17:36 ` [RFC PATCH 22/26] madvise: add uapi for HugeTLB HGM collapse: MADV_COLLAPSE James Houghton
@ 2022-06-24 17:36 ` James Houghton
  2022-06-24 17:36 ` [RFC PATCH 24/26] arm64/hugetlb: add support for high-granularity mappings James Houghton
                   ` (5 subsequent siblings)
  28 siblings, 0 replies; 123+ messages in thread
From: James Houghton @ 2022-06-24 17:36 UTC (permalink / raw)
  To: Mike Kravetz, Muchun Song, Peter Xu
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Jue Wang, Manish Mishra, Dr . David Alan Gilbert, linux-mm,
	linux-kernel, James Houghton

This is so that userspace is aware that their kernel was compiled with
HugeTLB high-granularity mapping and that UFFDIO_CONTINUE down to
PAGE_SIZE-aligned chunks are valid.

Signed-off-by: James Houghton <jthoughton@google.com>
---
 fs/userfaultfd.c                 | 7 ++++++-
 include/uapi/linux/userfaultfd.h | 2 ++
 2 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 77c1b8a7d0b9..59bfdb7a67e0 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -1935,10 +1935,15 @@ static int userfaultfd_api(struct userfaultfd_ctx *ctx,
 		goto err_out;
 	/* report all available features and ioctls to userland */
 	uffdio_api.features = UFFD_API_FEATURES;
+
 #ifndef CONFIG_HAVE_ARCH_USERFAULTFD_MINOR
 	uffdio_api.features &=
 		~(UFFD_FEATURE_MINOR_HUGETLBFS | UFFD_FEATURE_MINOR_SHMEM);
-#endif
+#ifndef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING
+	uffdio_api.features &= ~UFFD_FEATURE_MINOR_HUGETLBFS_HGM;
+#endif  /* CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING */
+#endif  /* CONFIG_HAVE_ARCH_USERFAULTFD_MINOR */
+
 #ifndef CONFIG_HAVE_ARCH_USERFAULTFD_WP
 	uffdio_api.features &= ~UFFD_FEATURE_PAGEFAULT_FLAG_WP;
 #endif
diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
index 7d32b1e797fb..50fbcb0bcba0 100644
--- a/include/uapi/linux/userfaultfd.h
+++ b/include/uapi/linux/userfaultfd.h
@@ -32,6 +32,7 @@
 			   UFFD_FEATURE_SIGBUS |		\
 			   UFFD_FEATURE_THREAD_ID |		\
 			   UFFD_FEATURE_MINOR_HUGETLBFS |	\
+			   UFFD_FEATURE_MINOR_HUGETLBFS_HGM |	\
 			   UFFD_FEATURE_MINOR_SHMEM |		\
 			   UFFD_FEATURE_EXACT_ADDRESS |		\
 			   UFFD_FEATURE_WP_HUGETLBFS_SHMEM)
@@ -213,6 +214,7 @@ struct uffdio_api {
 #define UFFD_FEATURE_MINOR_SHMEM		(1<<10)
 #define UFFD_FEATURE_EXACT_ADDRESS		(1<<11)
 #define UFFD_FEATURE_WP_HUGETLBFS_SHMEM		(1<<12)
+#define UFFD_FEATURE_MINOR_HUGETLBFS_HGM	(1<<13)
 	__u64 features;
 
 	__u64 ioctls;
-- 
2.37.0.rc0.161.g10f37bed90-goog


^ permalink raw reply related	[flat|nested] 123+ messages in thread

* [RFC PATCH 24/26] arm64/hugetlb: add support for high-granularity mappings
  2022-06-24 17:36 [RFC PATCH 00/26] hugetlb: Introduce HugeTLB high-granularity mapping James Houghton
                   ` (22 preceding siblings ...)
  2022-06-24 17:36 ` [RFC PATCH 23/26] userfaultfd: add UFFD_FEATURE_MINOR_HUGETLBFS_HGM James Houghton
@ 2022-06-24 17:36 ` James Houghton
  2022-06-24 17:36 ` [RFC PATCH 25/26] selftests: add HugeTLB HGM to userfaultfd selftest James Houghton
                   ` (4 subsequent siblings)
  28 siblings, 0 replies; 123+ messages in thread
From: James Houghton @ 2022-06-24 17:36 UTC (permalink / raw)
  To: Mike Kravetz, Muchun Song, Peter Xu
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Jue Wang, Manish Mishra, Dr . David Alan Gilbert, linux-mm,
	linux-kernel, James Houghton

This is included in this RFC to demonstrate how an architecture that
doesn't use ARCH_WANT_GENERAL_HUGETLB can be updated to support HugeTLB
high-granularity mappings: an architecture just needs to implement
hugetlb_walk_to.

Signed-off-by: James Houghton <jthoughton@google.com>
---
 arch/arm64/Kconfig          |  1 +
 arch/arm64/mm/hugetlbpage.c | 63 +++++++++++++++++++++++++++++++++++++
 2 files changed, 64 insertions(+)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 1652a9800ebe..74108713a99a 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -99,6 +99,7 @@ config ARM64
 	select ARCH_WANT_FRAME_POINTERS
 	select ARCH_WANT_HUGE_PMD_SHARE if ARM64_4K_PAGES || (ARM64_16K_PAGES && !ARM64_VA_BITS_36)
 	select ARCH_WANT_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
+	select ARCH_HAS_SPECIAL_HUGETLB_HGM
 	select ARCH_WANT_LD_ORPHAN_WARN
 	select ARCH_WANTS_NO_INSTR
 	select ARCH_HAS_UBSAN_SANITIZE_ALL
diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
index e2a5ec9fdc0d..1901818bed9d 100644
--- a/arch/arm64/mm/hugetlbpage.c
+++ b/arch/arm64/mm/hugetlbpage.c
@@ -281,6 +281,69 @@ void set_huge_swap_pte_at(struct mm_struct *mm, unsigned long addr,
 		set_pte(ptep, pte);
 }
 
+int hugetlb_walk_to(struct mm_struct *mm, struct hugetlb_pte *hpte,
+		    unsigned long addr, unsigned long sz, bool stop_at_none)
+{
+	pgd_t *pgdp;
+	p4d_t *p4dp;
+	pte_t *ptep;
+
+	if (!hpte->ptep) {
+		pgdp = pgd_offset(mm, addr);
+		p4dp = p4d_offset(pgdp, addr);
+		if (!p4dp)
+			return -ENOMEM;
+		hugetlb_pte_populate(hpte, (pte_t *)p4dp, P4D_SHIFT);
+	}
+
+	while (hugetlb_pte_size(hpte) > sz &&
+			!hugetlb_pte_present_leaf(hpte) &&
+			!(stop_at_none && hugetlb_pte_none(hpte))) {
+		if (hpte->shift == PMD_SHIFT) {
+			unsigned long rounded_addr = sz == CONT_PTE_SIZE
+						     ? addr & CONT_PTE_MASK
+						     : addr;
+
+			ptep = pte_offset_kernel((pmd_t *)hpte->ptep,
+						 rounded_addr);
+			if (!ptep)
+				return -ENOMEM;
+			if (sz == CONT_PTE_SIZE)
+				hpte->shift = CONT_PTE_SHIFT;
+			else
+				hpte->shift = pte_cont(*ptep) ? CONT_PTE_SHIFT
+							      : PAGE_SHIFT;
+			hpte->ptep = ptep;
+		} else if (hpte->shift == PUD_SHIFT) {
+			pud_t *pudp = (pud_t *)hpte->ptep;
+
+			ptep = (pte_t *)pmd_alloc(mm, pudp, addr);
+
+			if (!ptep)
+				return -ENOMEM;
+			if (sz == CONT_PMD_SIZE)
+				hpte->shift = CONT_PMD_SHIFT;
+			else
+				hpte->shift = pte_cont(*ptep) ? CONT_PMD_SHIFT
+							      : PMD_SHIFT;
+			hpte->ptep = ptep;
+		} else if (hpte->shift == P4D_SHIFT) {
+			ptep = (pte_t *)pud_alloc(mm, (p4d_t *)hpte->ptep, addr);
+			if (!ptep)
+				return -ENOMEM;
+			hpte->shift = PUD_SHIFT;
+			hpte->ptep = ptep;
+		} else
+			/*
+			 * This also catches the cases of CONT_PMD_SHIFT and
+			 * CONT_PTE_SHIFT. Those PTEs should always be leaves.
+			 */
+			BUG();
+	}
+
+	return 0;
+}
+
 pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
 		      unsigned long addr, unsigned long sz)
 {
-- 
2.37.0.rc0.161.g10f37bed90-goog


^ permalink raw reply related	[flat|nested] 123+ messages in thread

* [RFC PATCH 25/26] selftests: add HugeTLB HGM to userfaultfd selftest
  2022-06-24 17:36 [RFC PATCH 00/26] hugetlb: Introduce HugeTLB high-granularity mapping James Houghton
                   ` (23 preceding siblings ...)
  2022-06-24 17:36 ` [RFC PATCH 24/26] arm64/hugetlb: add support for high-granularity mappings James Houghton
@ 2022-06-24 17:36 ` James Houghton
  2022-06-24 17:36 ` [RFC PATCH 26/26] selftests: add HugeTLB HGM to KVM demand paging selftest James Houghton
                   ` (3 subsequent siblings)
  28 siblings, 0 replies; 123+ messages in thread
From: James Houghton @ 2022-06-24 17:36 UTC (permalink / raw)
  To: Mike Kravetz, Muchun Song, Peter Xu
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Jue Wang, Manish Mishra, Dr . David Alan Gilbert, linux-mm,
	linux-kernel, James Houghton

It behaves just like the regular shared HugeTLB configuration, except
that it uses 4K instead of hugepages.

This doesn't test collapsing yet. I'll add a test for that for v1.

Signed-off-by: James Houghton <jthoughton@google.com>
---
 tools/testing/selftests/vm/userfaultfd.c | 61 ++++++++++++++++++++----
 1 file changed, 51 insertions(+), 10 deletions(-)

diff --git a/tools/testing/selftests/vm/userfaultfd.c b/tools/testing/selftests/vm/userfaultfd.c
index 0bdfc1955229..9cbb959519a6 100644
--- a/tools/testing/selftests/vm/userfaultfd.c
+++ b/tools/testing/selftests/vm/userfaultfd.c
@@ -64,7 +64,7 @@
 
 #ifdef __NR_userfaultfd
 
-static unsigned long nr_cpus, nr_pages, nr_pages_per_cpu, page_size;
+static unsigned long nr_cpus, nr_pages, nr_pages_per_cpu, page_size, hpage_size;
 
 #define BOUNCE_RANDOM		(1<<0)
 #define BOUNCE_RACINGFAULTS	(1<<1)
@@ -72,9 +72,10 @@ static unsigned long nr_cpus, nr_pages, nr_pages_per_cpu, page_size;
 #define BOUNCE_POLL		(1<<3)
 static int bounces;
 
-#define TEST_ANON	1
-#define TEST_HUGETLB	2
-#define TEST_SHMEM	3
+#define TEST_ANON		1
+#define TEST_HUGETLB		2
+#define TEST_HUGETLB_HGM	3
+#define TEST_SHMEM		4
 static int test_type;
 
 /* exercise the test_uffdio_*_eexist every ALARM_INTERVAL_SECS */
@@ -85,6 +86,7 @@ static volatile bool test_uffdio_zeropage_eexist = true;
 static bool test_uffdio_wp = true;
 /* Whether to test uffd minor faults */
 static bool test_uffdio_minor = false;
+static bool test_uffdio_copy = true;
 
 static bool map_shared;
 static int shm_fd;
@@ -140,12 +142,17 @@ static void usage(void)
 	fprintf(stderr, "\nUsage: ./userfaultfd <test type> <MiB> <bounces> "
 		"[hugetlbfs_file]\n\n");
 	fprintf(stderr, "Supported <test type>: anon, hugetlb, "
-		"hugetlb_shared, shmem\n\n");
+		"hugetlb_shared, hugetlb_shared_hgm, shmem\n\n");
 	fprintf(stderr, "Examples:\n\n");
 	fprintf(stderr, "%s", examples);
 	exit(1);
 }
 
+static bool test_is_hugetlb(void)
+{
+	return test_type == TEST_HUGETLB || test_type == TEST_HUGETLB_HGM;
+}
+
 #define _err(fmt, ...)						\
 	do {							\
 		int ret = errno;				\
@@ -348,7 +355,7 @@ static struct uffd_test_ops *uffd_test_ops;
 
 static inline uint64_t uffd_minor_feature(void)
 {
-	if (test_type == TEST_HUGETLB && map_shared)
+	if (test_is_hugetlb() && map_shared)
 		return UFFD_FEATURE_MINOR_HUGETLBFS;
 	else if (test_type == TEST_SHMEM)
 		return UFFD_FEATURE_MINOR_SHMEM;
@@ -360,7 +367,7 @@ static uint64_t get_expected_ioctls(uint64_t mode)
 {
 	uint64_t ioctls = UFFD_API_RANGE_IOCTLS;
 
-	if (test_type == TEST_HUGETLB)
+	if (test_is_hugetlb())
 		ioctls &= ~(1 << _UFFDIO_ZEROPAGE);
 
 	if (!((mode & UFFDIO_REGISTER_MODE_WP) && test_uffdio_wp))
@@ -1116,6 +1123,12 @@ static int userfaultfd_events_test(void)
 	char c;
 	struct uffd_stats stats = { 0 };
 
+	if (!test_uffdio_copy) {
+		printf("Skipping userfaultfd events test "
+			"(test_uffdio_copy=false)\n");
+		return 0;
+	}
+
 	printf("testing events (fork, remap, remove): ");
 	fflush(stdout);
 
@@ -1169,6 +1182,12 @@ static int userfaultfd_sig_test(void)
 	char c;
 	struct uffd_stats stats = { 0 };
 
+	if (!test_uffdio_copy) {
+		printf("Skipping userfaultfd signal test "
+			"(test_uffdio_copy=false)\n");
+		return 0;
+	}
+
 	printf("testing signal delivery: ");
 	fflush(stdout);
 
@@ -1438,6 +1457,12 @@ static int userfaultfd_stress(void)
 	pthread_attr_init(&attr);
 	pthread_attr_setstacksize(&attr, 16*1024*1024);
 
+	if (!test_uffdio_copy) {
+		printf("Skipping userfaultfd stress test "
+			"(test_uffdio_copy=false)\n");
+		bounces = 0;
+	}
+
 	while (bounces--) {
 		printf("bounces: %d, mode:", bounces);
 		if (bounces & BOUNCE_RANDOM)
@@ -1598,6 +1623,13 @@ static void set_test_type(const char *type)
 		uffd_test_ops = &hugetlb_uffd_test_ops;
 		/* Minor faults require shared hugetlb; only enable here. */
 		test_uffdio_minor = true;
+	} else if (!strcmp(type, "hugetlb_shared_hgm")) {
+		map_shared = true;
+		test_type = TEST_HUGETLB_HGM;
+		uffd_test_ops = &hugetlb_uffd_test_ops;
+		/* Minor faults require shared hugetlb; only enable here. */
+		test_uffdio_minor = true;
+		test_uffdio_copy = false;
 	} else if (!strcmp(type, "shmem")) {
 		map_shared = true;
 		test_type = TEST_SHMEM;
@@ -1607,8 +1639,10 @@ static void set_test_type(const char *type)
 		err("Unknown test type: %s", type);
 	}
 
+	hpage_size = default_huge_page_size();
 	if (test_type == TEST_HUGETLB)
-		page_size = default_huge_page_size();
+		// TEST_HUGETLB_HGM gets small pages.
+		page_size = hpage_size;
 	else
 		page_size = sysconf(_SC_PAGE_SIZE);
 
@@ -1658,19 +1692,26 @@ int main(int argc, char **argv)
 	nr_cpus = sysconf(_SC_NPROCESSORS_ONLN);
 	nr_pages_per_cpu = atol(argv[2]) * 1024*1024 / page_size /
 		nr_cpus;
+	if (test_type == TEST_HUGETLB_HGM)
+		/*
+		 * `page_size` refers to the page_size we can use in
+		 * UFFDIO_CONTINUE. We still need nr_pages to be appropriately
+		 * aligned, so align it here.
+		 */
+		nr_pages_per_cpu -= nr_pages_per_cpu % (hpage_size / page_size);
 	if (!nr_pages_per_cpu) {
 		_err("invalid MiB");
 		usage();
 	}
+	nr_pages = nr_pages_per_cpu * nr_cpus;
 
 	bounces = atoi(argv[3]);
 	if (bounces <= 0) {
 		_err("invalid bounces");
 		usage();
 	}
-	nr_pages = nr_pages_per_cpu * nr_cpus;
 
-	if (test_type == TEST_HUGETLB && map_shared) {
+	if (test_is_hugetlb() && map_shared) {
 		if (argc < 5)
 			usage();
 		huge_fd = open(argv[4], O_CREAT | O_RDWR, 0755);
-- 
2.37.0.rc0.161.g10f37bed90-goog


^ permalink raw reply related	[flat|nested] 123+ messages in thread

* [RFC PATCH 26/26] selftests: add HugeTLB HGM to KVM demand paging selftest
  2022-06-24 17:36 [RFC PATCH 00/26] hugetlb: Introduce HugeTLB high-granularity mapping James Houghton
                   ` (24 preceding siblings ...)
  2022-06-24 17:36 ` [RFC PATCH 25/26] selftests: add HugeTLB HGM to userfaultfd selftest James Houghton
@ 2022-06-24 17:36 ` James Houghton
  2022-06-24 18:29 ` [RFC PATCH 00/26] hugetlb: Introduce HugeTLB high-granularity mapping Matthew Wilcox
                   ` (2 subsequent siblings)
  28 siblings, 0 replies; 123+ messages in thread
From: James Houghton @ 2022-06-24 17:36 UTC (permalink / raw)
  To: Mike Kravetz, Muchun Song, Peter Xu
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Jue Wang, Manish Mishra, Dr . David Alan Gilbert, linux-mm,
	linux-kernel, James Houghton

This doesn't address collapsing yet, and it only works with the MINOR
mode (UFFDIO_CONTINUE).

Signed-off-by: James Houghton <jthoughton@google.com>
---
 tools/testing/selftests/kvm/include/test_util.h |  2 ++
 tools/testing/selftests/kvm/lib/kvm_util.c      |  2 +-
 tools/testing/selftests/kvm/lib/test_util.c     | 14 ++++++++++++++
 3 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/kvm/include/test_util.h b/tools/testing/selftests/kvm/include/test_util.h
index 99e0dcdc923f..6209e44981a7 100644
--- a/tools/testing/selftests/kvm/include/test_util.h
+++ b/tools/testing/selftests/kvm/include/test_util.h
@@ -87,6 +87,7 @@ enum vm_mem_backing_src_type {
 	VM_MEM_SRC_ANONYMOUS_HUGETLB_16GB,
 	VM_MEM_SRC_SHMEM,
 	VM_MEM_SRC_SHARED_HUGETLB,
+	VM_MEM_SRC_SHARED_HUGETLB_HGM,
 	NUM_SRC_TYPES,
 };
 
@@ -105,6 +106,7 @@ size_t get_def_hugetlb_pagesz(void);
 const struct vm_mem_backing_src_alias *vm_mem_backing_src_alias(uint32_t i);
 size_t get_backing_src_pagesz(uint32_t i);
 bool is_backing_src_hugetlb(uint32_t i);
+bool is_backing_src_shared_hugetlb(enum vm_mem_backing_src_type src_type);
 void backing_src_help(const char *flag);
 enum vm_mem_backing_src_type parse_backing_src_type(const char *type_name);
 long get_run_delay(void);
diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c
index 1665a220abcb..382f8fb75b7f 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -993,7 +993,7 @@ void vm_userspace_mem_region_add(struct kvm_vm *vm,
 	region->fd = -1;
 	if (backing_src_is_shared(src_type))
 		region->fd = kvm_memfd_alloc(region->mmap_size,
-					     src_type == VM_MEM_SRC_SHARED_HUGETLB);
+				is_backing_src_shared_hugetlb(src_type));
 
 	region->mmap_start = mmap(NULL, region->mmap_size,
 				  PROT_READ | PROT_WRITE,
diff --git a/tools/testing/selftests/kvm/lib/test_util.c b/tools/testing/selftests/kvm/lib/test_util.c
index 6d23878bbfe1..710dc42077fe 100644
--- a/tools/testing/selftests/kvm/lib/test_util.c
+++ b/tools/testing/selftests/kvm/lib/test_util.c
@@ -254,6 +254,13 @@ const struct vm_mem_backing_src_alias *vm_mem_backing_src_alias(uint32_t i)
 			 */
 			.flag = MAP_SHARED,
 		},
+		[VM_MEM_SRC_SHARED_HUGETLB_HGM] = {
+			/*
+			 * Identical to shared_hugetlb except for the name.
+			 */
+			.name = "shared_hugetlb_hgm",
+			.flag = MAP_SHARED,
+		},
 	};
 	_Static_assert(ARRAY_SIZE(aliases) == NUM_SRC_TYPES,
 		       "Missing new backing src types?");
@@ -272,6 +279,7 @@ size_t get_backing_src_pagesz(uint32_t i)
 	switch (i) {
 	case VM_MEM_SRC_ANONYMOUS:
 	case VM_MEM_SRC_SHMEM:
+	case VM_MEM_SRC_SHARED_HUGETLB_HGM:
 		return getpagesize();
 	case VM_MEM_SRC_ANONYMOUS_THP:
 		return get_trans_hugepagesz();
@@ -288,6 +296,12 @@ bool is_backing_src_hugetlb(uint32_t i)
 	return !!(vm_mem_backing_src_alias(i)->flag & MAP_HUGETLB);
 }
 
+bool is_backing_src_shared_hugetlb(enum vm_mem_backing_src_type src_type)
+{
+	return src_type == VM_MEM_SRC_SHARED_HUGETLB ||
+		src_type == VM_MEM_SRC_SHARED_HUGETLB_HGM;
+}
+
 static void print_available_backing_src_types(const char *prefix)
 {
 	int i;
-- 
2.37.0.rc0.161.g10f37bed90-goog


^ permalink raw reply related	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH 00/26] hugetlb: Introduce HugeTLB high-granularity mapping
  2022-06-24 17:36 [RFC PATCH 00/26] hugetlb: Introduce HugeTLB high-granularity mapping James Houghton
                   ` (25 preceding siblings ...)
  2022-06-24 17:36 ` [RFC PATCH 26/26] selftests: add HugeTLB HGM to KVM demand paging selftest James Houghton
@ 2022-06-24 18:29 ` Matthew Wilcox
  2022-06-27 16:36   ` James Houghton
  2022-06-24 18:41 ` Mina Almasry
  2022-06-24 18:47 ` Matthew Wilcox
  28 siblings, 1 reply; 123+ messages in thread
From: Matthew Wilcox @ 2022-06-24 18:29 UTC (permalink / raw)
  To: James Houghton
  Cc: Mike Kravetz, Muchun Song, Peter Xu, David Hildenbrand,
	David Rientjes, Axel Rasmussen, Mina Almasry, Jue Wang,
	Manish Mishra, Dr . David Alan Gilbert, linux-mm, linux-kernel

On Fri, Jun 24, 2022 at 05:36:30PM +0000, James Houghton wrote:
> [1] This used to be called HugeTLB double mapping, a bad and confusing
>     name. "High-granularity mapping" is not a great name either. I am open
>     to better names.

Oh good, I was grinding my teeth every time I read it ;-)

How does "Fine granularity" work for you?
"sub-page mapping" might work too.

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH 00/26] hugetlb: Introduce HugeTLB high-granularity mapping
  2022-06-24 17:36 [RFC PATCH 00/26] hugetlb: Introduce HugeTLB high-granularity mapping James Houghton
                   ` (26 preceding siblings ...)
  2022-06-24 18:29 ` [RFC PATCH 00/26] hugetlb: Introduce HugeTLB high-granularity mapping Matthew Wilcox
@ 2022-06-24 18:41 ` Mina Almasry
  2022-06-27 16:27   ` James Houghton
  2022-06-24 18:47 ` Matthew Wilcox
  28 siblings, 1 reply; 123+ messages in thread
From: Mina Almasry @ 2022-06-24 18:41 UTC (permalink / raw)
  To: James Houghton
  Cc: Mike Kravetz, Muchun Song, Peter Xu, David Hildenbrand,
	David Rientjes, Axel Rasmussen, Jue Wang, Manish Mishra,
	Dr . David Alan Gilbert, linux-mm, linux-kernel

On Fri, Jun 24, 2022 at 10:37 AM James Houghton <jthoughton@google.com> wrote:
>
> This RFC introduces the concept of HugeTLB high-granularity mapping
> (HGM)[1].  In broad terms, this series teaches HugeTLB how to map HugeTLB
> pages at different granularities, and, more importantly, to partially map
> a HugeTLB page.  This cover letter will go over
>  - the motivation for these changes
>  - userspace API
>  - some of the changes to HugeTLB to make this work
>  - limitations & future enhancements
>
> High-granularity mapping does *not* involve dissolving the hugepages
> themselves; it only affects how they are mapped.
>
> ---- Motivation ----
>
> Being able to map HugeTLB memory with PAGE_SIZE PTEs has important use
> cases in post-copy live migration and memory failure handling.
>
> - Live Migration (userfaultfd)
> For post-copy live migration, using userfaultfd, currently we have to
> install an entire hugepage before we can allow a guest to access that page.
> This is because, right now, either the WHOLE hugepage is mapped or NONE of
> it is.  So either the guest can access the WHOLE hugepage or NONE of it.
> This makes post-copy live migration for 1G HugeTLB-backed VMs completely
> infeasible.
>
> With high-granularity mapping, we can map PAGE_SIZE pieces of a hugepage,
> thereby allowing the guest to access only PAGE_SIZE chunks, and getting
> page faults on the rest (and triggering another demand-fetch). This gives
> userspace the flexibility to install PAGE_SIZE chunks of memory into a
> hugepage, making migration of 1G-backed VMs perfectly feasible, and it
> vastly reduces the vCPU stall time during post-copy for 2M-backed VMs.
>
> At Google, for a 48 vCPU VM in post-copy, we can expect these approximate
> per-page median fetch latencies:
>      4K: <100us
>      2M: >10ms
> Being able to unpause a vCPU 100x quicker is helpful for guest stability,
> and being able to use 1G pages at all can significant improve steady-state
> guest performance.
>
> After fully copying a hugepage over the network, we will want to collapse
> the mapping down to what it would normally be (e.g., one PUD for a 1G
> page). Rather than having the kernel do this automatically, we leave it up
> to userspace to tell us to collapse a range (via MADV_COLLAPSE, co-opting
> the API that is being introduced for THPs[2]).
>
> - Memory Failure
> When a memory error is found within a HugeTLB page, it would be ideal if we
> could unmap only the PAGE_SIZE section that contained the error. This is
> what THPs are able to do. Using high-granularity mapping, we could do this,
> but this isn't tackled in this patch series.
>
> ---- Userspace API ----
>
> This patch series introduces a single way to take advantage of
> high-granularity mapping: via UFFDIO_CONTINUE. UFFDIO_CONTINUE allows
> userspace to resolve MINOR page faults on shared VMAs.
>
> To collapse a HugeTLB address range that has been mapped with several
> UFFDIO_CONTINUE operations, userspace can issue MADV_COLLAPSE. We expect
> userspace to know when all pages (that they care about) have been fetched.
>

Thanks James! Cover letter looks good. A few questions:

Why not have the kernel collapse the hugepage once all the 4K pages
have been fetched automatically? It would remove the need for a new
userspace API, and AFACT there aren't really any cases where it is
beneficial to have a hugepage sharded into 4K mappings when those
mappings can be collapsed.

> ---- HugeTLB Changes ----
>
> - Mapcount
> The way mapcount is handled is different from the way that it was handled
> before. If the PUD for a hugepage is not none, a hugepage's mapcount will
> be increased. This scheme means that, for hugepages that aren't mapped at
> high granularity, their mapcounts will remain the same as what they would
> have been pre-HGM.
>

Sorry, I didn't quite follow this. It says mapcount is handled
differently, but the same if the page is not mapped at high
granularity. Can you elaborate on how the mapcount handling will be
different when the page is mapped at high granularity?

> - Page table walking and manipulation
> A new function, hugetlb_walk_to, handles walking HugeTLB page tables for
> high-granularity mappings. Eventually, it's possible to merge
> hugetlb_walk_to with huge_pte_offset and huge_pte_alloc.
>
> We keep track of HugeTLB page table entries with a new struct, hugetlb_pte.
> This is because we generally need to know the "size" of a PTE (previously
> always just huge_page_size(hstate)).
>
> For every page table manipulation function that has a huge version (e.g.
> huge_ptep_get and ptep_get), there is a wrapper for it (e.g.
> hugetlb_ptep_get).  The correct version is used depending on if a HugeTLB
> PTE really is "huge".
>
> - Synchronization
> For existing bits of HugeTLB, synchronization is unchanged. For splitting
> and collapsing HugeTLB PTEs, we require that the i_mmap_rw_sem is held for
> writing, and for doing high-granularity page table walks, we require it to
> be held for reading.
>
> ---- Limitations & Future Changes ----
>
> This patch series only implements high-granularity mapping for VM_SHARED
> VMAs.  I intend to implement enough HGM to support 4K unmapping for memory
> failure recovery for both shared and private mappings.
>
> The memory failure use case poses its own challenges that can be
> addressed, but I will do so in a separate RFC.
>
> Performance has not been heavily scrutinized with this patch series. There
> are places where lock contention can significantly reduce performance. This
> will be addressed later.
>
> The patch series, as it stands right now, is compatible with the VMEMMAP
> page struct optimization[3], as we do not need to modify data contained
> in the subpage page structs.
>
> Other omissions:
>  - Compatibility with userfaultfd write-protect (will be included in v1).
>  - Support for mremap() (will be included in v1). This looks a lot like
>    the support we have for fork().
>  - Documentation changes (will be included in v1).
>  - Completely ignores PMD sharing and hugepage migration (will be included
>    in v1).
>  - Implementations for architectures that don't use GENERAL_HUGETLB other
>    than arm64.
>
> ---- Patch Breakdown ----
>
> Patch 1     - Preliminary changes
> Patch 2-10  - HugeTLB HGM core changes
> Patch 11-13 - HugeTLB HGM page table walking functionality
> Patch 14-19 - HugeTLB HGM compatibility with other bits
> Patch 20-23 - Userfaultfd and collapse changes
> Patch 24-26 - arm64 support and selftests
>
> [1] This used to be called HugeTLB double mapping, a bad and confusing
>     name. "High-granularity mapping" is not a great name either. I am open
>     to better names.

I would drop 1 extra word and do "granular mapping", as in the mapping
is more granular than what it normally is (2MB/1G, etc).

> [2] https://lore.kernel.org/linux-mm/20220604004004.954674-10-zokeefe@google.com/
> [3] commit f41f2ed43ca5 ("mm: hugetlb: free the vmemmap pages associated with each HugeTLB page")
>
> James Houghton (26):
>   hugetlb: make hstate accessor functions const
>   hugetlb: sort hstates in hugetlb_init_hstates
>   hugetlb: add make_huge_pte_with_shift
>   hugetlb: make huge_pte_lockptr take an explicit shift argument.
>   hugetlb: add CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING
>   mm: make free_p?d_range functions public
>   hugetlb: add hugetlb_pte to track HugeTLB page table entries
>   hugetlb: add hugetlb_free_range to free PT structures
>   hugetlb: add hugetlb_hgm_enabled
>   hugetlb: add for_each_hgm_shift
>   hugetlb: add hugetlb_walk_to to do PT walks
>   hugetlb: add HugeTLB splitting functionality
>   hugetlb: add huge_pte_alloc_high_granularity
>   hugetlb: add HGM support for hugetlb_fault and hugetlb_no_page
>   hugetlb: make unmapping compatible with high-granularity mappings
>   hugetlb: make hugetlb_change_protection compatible with HGM
>   hugetlb: update follow_hugetlb_page to support HGM
>   hugetlb: use struct hugetlb_pte for walk_hugetlb_range
>   hugetlb: add HGM support for copy_hugetlb_page_range
>   hugetlb: add support for high-granularity UFFDIO_CONTINUE
>   hugetlb: add hugetlb_collapse
>   madvise: add uapi for HugeTLB HGM collapse: MADV_COLLAPSE
>   userfaultfd: add UFFD_FEATURE_MINOR_HUGETLBFS_HGM
>   arm64/hugetlb: add support for high-granularity mappings
>   selftests: add HugeTLB HGM to userfaultfd selftest
>   selftests: add HugeTLB HGM to KVM demand paging selftest
>
>  arch/arm64/Kconfig                            |   1 +
>  arch/arm64/mm/hugetlbpage.c                   |  63 ++
>  arch/powerpc/mm/pgtable.c                     |   3 +-
>  arch/s390/mm/gmap.c                           |   8 +-
>  fs/Kconfig                                    |   7 +
>  fs/proc/task_mmu.c                            |  35 +-
>  fs/userfaultfd.c                              |  10 +-
>  include/asm-generic/tlb.h                     |   6 +-
>  include/linux/hugetlb.h                       | 177 +++-
>  include/linux/mm.h                            |   7 +
>  include/linux/pagewalk.h                      |   3 +-
>  include/uapi/asm-generic/mman-common.h        |   2 +
>  include/uapi/linux/userfaultfd.h              |   2 +
>  mm/damon/vaddr.c                              |  34 +-
>  mm/hmm.c                                      |   7 +-
>  mm/hugetlb.c                                  | 987 +++++++++++++++---
>  mm/madvise.c                                  |  23 +
>  mm/memory.c                                   |   8 +-
>  mm/mempolicy.c                                |  11 +-
>  mm/migrate.c                                  |   3 +-
>  mm/mincore.c                                  |   4 +-
>  mm/mprotect.c                                 |   6 +-
>  mm/page_vma_mapped.c                          |   3 +-
>  mm/pagewalk.c                                 |  18 +-
>  mm/userfaultfd.c                              |  57 +-
>  .../testing/selftests/kvm/include/test_util.h |   2 +
>  tools/testing/selftests/kvm/lib/kvm_util.c    |   2 +-
>  tools/testing/selftests/kvm/lib/test_util.c   |  14 +
>  tools/testing/selftests/vm/userfaultfd.c      |  61 +-
>  29 files changed, 1314 insertions(+), 250 deletions(-)
>
> --
> 2.37.0.rc0.161.g10f37bed90-goog
>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH 01/26] hugetlb: make hstate accessor functions const
  2022-06-24 17:36 ` [RFC PATCH 01/26] hugetlb: make hstate accessor functions const James Houghton
@ 2022-06-24 18:43   ` Mina Almasry
       [not found]   ` <e55f90f5-ba14-5d6e-8f8f-abf731b9095e@nutanix.com>
  2022-06-29  6:18   ` Muchun Song
  2 siblings, 0 replies; 123+ messages in thread
From: Mina Almasry @ 2022-06-24 18:43 UTC (permalink / raw)
  To: James Houghton
  Cc: Mike Kravetz, Muchun Song, Peter Xu, David Hildenbrand,
	David Rientjes, Axel Rasmussen, Jue Wang, Manish Mishra,
	Dr . David Alan Gilbert, linux-mm, linux-kernel

On Fri, Jun 24, 2022 at 10:37 AM James Houghton <jthoughton@google.com> wrote:
>
> This is just a const-correctness change so that the new hugetlb_pte
> changes can be const-correct too.
>
> Acked-by: David Rientjes <rientjes@google.com>
>

Reviewed-By: Mina Almasry <almasrymina@google.com>

> Signed-off-by: James Houghton <jthoughton@google.com>
> ---
>  include/linux/hugetlb.h | 12 ++++++------
>  1 file changed, 6 insertions(+), 6 deletions(-)
>
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index e4cff27d1198..498a4ae3d462 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -715,7 +715,7 @@ static inline struct hstate *hstate_vma(struct vm_area_struct *vma)
>         return hstate_file(vma->vm_file);
>  }
>
> -static inline unsigned long huge_page_size(struct hstate *h)
> +static inline unsigned long huge_page_size(const struct hstate *h)
>  {
>         return (unsigned long)PAGE_SIZE << h->order;
>  }
> @@ -729,27 +729,27 @@ static inline unsigned long huge_page_mask(struct hstate *h)
>         return h->mask;
>  }
>
> -static inline unsigned int huge_page_order(struct hstate *h)
> +static inline unsigned int huge_page_order(const struct hstate *h)
>  {
>         return h->order;
>  }
>
> -static inline unsigned huge_page_shift(struct hstate *h)
> +static inline unsigned huge_page_shift(const struct hstate *h)
>  {
>         return h->order + PAGE_SHIFT;
>  }
>
> -static inline bool hstate_is_gigantic(struct hstate *h)
> +static inline bool hstate_is_gigantic(const struct hstate *h)
>  {
>         return huge_page_order(h) >= MAX_ORDER;
>  }
>
> -static inline unsigned int pages_per_huge_page(struct hstate *h)
> +static inline unsigned int pages_per_huge_page(const struct hstate *h)
>  {
>         return 1 << h->order;
>  }
>
> -static inline unsigned int blocks_per_huge_page(struct hstate *h)
> +static inline unsigned int blocks_per_huge_page(const struct hstate *h)
>  {
>         return huge_page_size(h) / 512;
>  }
> --
> 2.37.0.rc0.161.g10f37bed90-goog
>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH 00/26] hugetlb: Introduce HugeTLB high-granularity mapping
  2022-06-24 17:36 [RFC PATCH 00/26] hugetlb: Introduce HugeTLB high-granularity mapping James Houghton
                   ` (27 preceding siblings ...)
  2022-06-24 18:41 ` Mina Almasry
@ 2022-06-24 18:47 ` Matthew Wilcox
  2022-06-27 16:48   ` James Houghton
  28 siblings, 1 reply; 123+ messages in thread
From: Matthew Wilcox @ 2022-06-24 18:47 UTC (permalink / raw)
  To: James Houghton
  Cc: Mike Kravetz, Muchun Song, Peter Xu, David Hildenbrand,
	David Rientjes, Axel Rasmussen, Mina Almasry, Jue Wang,
	Manish Mishra, Dr . David Alan Gilbert, linux-mm, linux-kernel

On Fri, Jun 24, 2022 at 05:36:30PM +0000, James Houghton wrote:
> - Page table walking and manipulation
> A new function, hugetlb_walk_to, handles walking HugeTLB page tables for
> high-granularity mappings. Eventually, it's possible to merge
> hugetlb_walk_to with huge_pte_offset and huge_pte_alloc.
> 
> We keep track of HugeTLB page table entries with a new struct, hugetlb_pte.
> This is because we generally need to know the "size" of a PTE (previously
> always just huge_page_size(hstate)).
> 
> For every page table manipulation function that has a huge version (e.g.
> huge_ptep_get and ptep_get), there is a wrapper for it (e.g.
> hugetlb_ptep_get).  The correct version is used depending on if a HugeTLB
> PTE really is "huge".

I'm disappointed to hear that page table walking is going to become even
more special.  I'd much prefer it if hugetlb walking were exactly the
same as THP walking.  This seems like a good time to do at least some
of that work.

Was there a reason you chose the "more complexity" direction?

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH 02/26] hugetlb: sort hstates in hugetlb_init_hstates
  2022-06-24 17:36 ` [RFC PATCH 02/26] hugetlb: sort hstates in hugetlb_init_hstates James Houghton
@ 2022-06-24 18:51   ` Mina Almasry
  2022-06-27 12:08   ` manish.mishra
  2022-06-27 18:42   ` Mike Kravetz
  2 siblings, 0 replies; 123+ messages in thread
From: Mina Almasry @ 2022-06-24 18:51 UTC (permalink / raw)
  To: James Houghton
  Cc: Mike Kravetz, Muchun Song, Peter Xu, David Hildenbrand,
	David Rientjes, Axel Rasmussen, Jue Wang, Manish Mishra,
	Dr . David Alan Gilbert, linux-mm, linux-kernel

On Fri, Jun 24, 2022 at 10:37 AM James Houghton <jthoughton@google.com> wrote:
>
> When using HugeTLB high-granularity mapping, we need to go through the
> supported hugepage sizes in decreasing order so that we pick the largest
> size that works. Consider the case where we're faulting in a 1G hugepage
> for the first time: we want hugetlb_fault/hugetlb_no_page to map it with
> a PUD. By going through the sizes in decreasing order, we will find that
> PUD_SIZE works before finding out that PMD_SIZE or PAGE_SIZE work too.
>

Mostly nits:

Reviewed-by: Mina Almasry <almasrymina@google.com>

> Signed-off-by: James Houghton <jthoughton@google.com>
> ---
>  mm/hugetlb.c | 40 +++++++++++++++++++++++++++++++++++++---
>  1 file changed, 37 insertions(+), 3 deletions(-)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index a57e1be41401..5df838d86f32 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -33,6 +33,7 @@
>  #include <linux/migrate.h>
>  #include <linux/nospec.h>
>  #include <linux/delayacct.h>
> +#include <linux/sort.h>
>
>  #include <asm/page.h>
>  #include <asm/pgalloc.h>
> @@ -48,6 +49,10 @@
>
>  int hugetlb_max_hstate __read_mostly;
>  unsigned int default_hstate_idx;
> +/*
> + * After hugetlb_init_hstates is called, hstates will be sorted from largest
> + * to smallest.
> + */
>  struct hstate hstates[HUGE_MAX_HSTATE];
>
>  #ifdef CONFIG_CMA
> @@ -3144,14 +3149,43 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
>         kfree(node_alloc_noretry);
>  }
>
> +static int compare_hstates_decreasing(const void *a, const void *b)
> +{
> +       const int shift_a = huge_page_shift((const struct hstate *)a);
> +       const int shift_b = huge_page_shift((const struct hstate *)b);
> +
> +       if (shift_a < shift_b)
> +               return 1;
> +       if (shift_a > shift_b)
> +               return -1;
> +       return 0;
> +}
> +
> +static void sort_hstates(void)

Maybe sort_hstates_descending(void) for extra clarity.

> +{
> +       unsigned long default_hstate_sz = huge_page_size(&default_hstate);
> +
> +       /* Sort from largest to smallest. */

I'd remove this redundant comment; it's somewhat obvious what the next
line does.

> +       sort(hstates, hugetlb_max_hstate, sizeof(*hstates),
> +            compare_hstates_decreasing, NULL);
> +
> +       /*
> +        * We may have changed the location of the default hstate, so we need to
> +        * update it.
> +        */
> +       default_hstate_idx = hstate_index(size_to_hstate(default_hstate_sz));
> +}
> +
>  static void __init hugetlb_init_hstates(void)
>  {
>         struct hstate *h, *h2;
>
> -       for_each_hstate(h) {
> -               if (minimum_order > huge_page_order(h))
> -                       minimum_order = huge_page_order(h);
> +       sort_hstates();
>
> +       /* The last hstate is now the smallest. */

Same, given that above is sort_hstates().

> +       minimum_order = huge_page_order(&hstates[hugetlb_max_hstate - 1]);
> +
> +       for_each_hstate(h) {
>                 /* oversize hugepages were init'ed in early boot */
>                 if (!hstate_is_gigantic(h))
>                         hugetlb_hstate_alloc_pages(h);
> --
> 2.37.0.rc0.161.g10f37bed90-goog
>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH 03/26] hugetlb: add make_huge_pte_with_shift
  2022-06-24 17:36 ` [RFC PATCH 03/26] hugetlb: add make_huge_pte_with_shift James Houghton
@ 2022-06-24 19:01   ` Mina Almasry
  2022-06-27 12:13   ` manish.mishra
  1 sibling, 0 replies; 123+ messages in thread
From: Mina Almasry @ 2022-06-24 19:01 UTC (permalink / raw)
  To: James Houghton
  Cc: Mike Kravetz, Muchun Song, Peter Xu, David Hildenbrand,
	David Rientjes, Axel Rasmussen, Jue Wang, Manish Mishra,
	Dr . David Alan Gilbert, linux-mm, linux-kernel

On Fri, Jun 24, 2022 at 10:37 AM James Houghton <jthoughton@google.com> wrote:
>
> This allows us to make huge PTEs at shifts other than the hstate shift,
> which will be necessary for high-granularity mappings.
>

Can you elaborate on why?

> Signed-off-by: James Houghton <jthoughton@google.com>
> ---
>  mm/hugetlb.c | 33 ++++++++++++++++++++-------------
>  1 file changed, 20 insertions(+), 13 deletions(-)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 5df838d86f32..0eec34edf3b2 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -4686,23 +4686,30 @@ const struct vm_operations_struct hugetlb_vm_ops = {
>         .pagesize = hugetlb_vm_op_pagesize,
>  };
>
> +static pte_t make_huge_pte_with_shift(struct vm_area_struct *vma,
> +                                     struct page *page, int writable,
> +                                     int shift)
> +{
> +       bool huge = shift > PAGE_SHIFT;
> +       pte_t entry = huge ? mk_huge_pte(page, vma->vm_page_prot)
> +                          : mk_pte(page, vma->vm_page_prot);
> +
> +       if (writable)
> +               entry = huge ? huge_pte_mkwrite(entry) : pte_mkwrite(entry);
> +       else
> +               entry = huge ? huge_pte_wrprotect(entry) : pte_wrprotect(entry);
> +       pte_mkyoung(entry);
> +       if (huge)
> +               entry = arch_make_huge_pte(entry, shift, vma->vm_flags);
> +       return entry;
> +}
> +
>  static pte_t make_huge_pte(struct vm_area_struct *vma, struct page *page,
> -                               int writable)
> +                          int writable)

Looks like an unnecessary diff?

>  {
> -       pte_t entry;
>         unsigned int shift = huge_page_shift(hstate_vma(vma));
>
> -       if (writable) {
> -               entry = huge_pte_mkwrite(huge_pte_mkdirty(mk_huge_pte(page,
> -                                        vma->vm_page_prot)));

In this case there is an intermediate call to huge_pte_mkdirty() that
is not done in make_huge_pte_with_shift(). Why was this removed?

> -       } else {
> -               entry = huge_pte_wrprotect(mk_huge_pte(page,
> -                                          vma->vm_page_prot));
> -       }
> -       entry = pte_mkyoung(entry);
> -       entry = arch_make_huge_pte(entry, shift, vma->vm_flags);
> -
> -       return entry;
> +       return make_huge_pte_with_shift(vma, page, writable, shift);

I think this is marginally cleaner to calculate the shift inline:

  return make_huge_pte_with_shift(vma, page, writable,
huge_page_shift(hstate_vma(vma)));

>  }
>
>  static void set_huge_ptep_writable(struct vm_area_struct *vma,
> --
> 2.37.0.rc0.161.g10f37bed90-goog
>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH 02/26] hugetlb: sort hstates in hugetlb_init_hstates
  2022-06-24 17:36 ` [RFC PATCH 02/26] hugetlb: sort hstates in hugetlb_init_hstates James Houghton
  2022-06-24 18:51   ` Mina Almasry
@ 2022-06-27 12:08   ` manish.mishra
  2022-06-28 15:35     ` James Houghton
  2022-06-27 18:42   ` Mike Kravetz
  2 siblings, 1 reply; 123+ messages in thread
From: manish.mishra @ 2022-06-27 12:08 UTC (permalink / raw)
  To: James Houghton, Mike Kravetz, Muchun Song, Peter Xu
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Jue Wang, Dr . David Alan Gilbert, linux-mm, linux-kernel


On 24/06/22 11:06 pm, James Houghton wrote:
> When using HugeTLB high-granularity mapping, we need to go through the
> supported hugepage sizes in decreasing order so that we pick the largest
> size that works. Consider the case where we're faulting in a 1G hugepage
> for the first time: we want hugetlb_fault/hugetlb_no_page to map it with
> a PUD. By going through the sizes in decreasing order, we will find that
> PUD_SIZE works before finding out that PMD_SIZE or PAGE_SIZE work too.
>
> Signed-off-by: James Houghton <jthoughton@google.com>
> ---
>   mm/hugetlb.c | 40 +++++++++++++++++++++++++++++++++++++---
>   1 file changed, 37 insertions(+), 3 deletions(-)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index a57e1be41401..5df838d86f32 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -33,6 +33,7 @@
>   #include <linux/migrate.h>
>   #include <linux/nospec.h>
>   #include <linux/delayacct.h>
> +#include <linux/sort.h>
>   
>   #include <asm/page.h>
>   #include <asm/pgalloc.h>
> @@ -48,6 +49,10 @@
>   
>   int hugetlb_max_hstate __read_mostly;
>   unsigned int default_hstate_idx;
> +/*
> + * After hugetlb_init_hstates is called, hstates will be sorted from largest
> + * to smallest.
> + */
>   struct hstate hstates[HUGE_MAX_HSTATE];
>   
>   #ifdef CONFIG_CMA
> @@ -3144,14 +3149,43 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
>   	kfree(node_alloc_noretry);
>   }
>   
> +static int compare_hstates_decreasing(const void *a, const void *b)
> +{
> +	const int shift_a = huge_page_shift((const struct hstate *)a);
> +	const int shift_b = huge_page_shift((const struct hstate *)b);
> +
> +	if (shift_a < shift_b)
> +		return 1;
> +	if (shift_a > shift_b)
> +		return -1;
> +	return 0;
> +}
> +
> +static void sort_hstates(void)
> +{
> +	unsigned long default_hstate_sz = huge_page_size(&default_hstate);
> +
> +	/* Sort from largest to smallest. */
> +	sort(hstates, hugetlb_max_hstate, sizeof(*hstates),
> +	     compare_hstates_decreasing, NULL);
> +
> +	/*
> +	 * We may have changed the location of the default hstate, so we need to
> +	 * update it.
> +	 */
> +	default_hstate_idx = hstate_index(size_to_hstate(default_hstate_sz));
> +}
> +
>   static void __init hugetlb_init_hstates(void)
>   {
>   	struct hstate *h, *h2;
>   
> -	for_each_hstate(h) {
> -		if (minimum_order > huge_page_order(h))
> -			minimum_order = huge_page_order(h);
> +	sort_hstates();
>   
> +	/* The last hstate is now the smallest. */
> +	minimum_order = huge_page_order(&hstates[hugetlb_max_hstate - 1]);
> +
> +	for_each_hstate(h) {
>   		/* oversize hugepages were init'ed in early boot */
>   		if (!hstate_is_gigantic(h))
>   			hugetlb_hstate_alloc_pages(h);

As now hstates are ordered can code which does calculation of demot_order

can too be optimised, i mean it can be value of order of hstate at next index?



^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH 03/26] hugetlb: add make_huge_pte_with_shift
  2022-06-24 17:36 ` [RFC PATCH 03/26] hugetlb: add make_huge_pte_with_shift James Houghton
  2022-06-24 19:01   ` Mina Almasry
@ 2022-06-27 12:13   ` manish.mishra
  1 sibling, 0 replies; 123+ messages in thread
From: manish.mishra @ 2022-06-27 12:13 UTC (permalink / raw)
  To: James Houghton, Mike Kravetz, Muchun Song, Peter Xu
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Jue Wang, Dr . David Alan Gilbert, linux-mm, linux-kernel


On 24/06/22 11:06 pm, James Houghton wrote:
> This allows us to make huge PTEs at shifts other than the hstate shift,
> which will be necessary for high-granularity mappings.
>
> Signed-off-by: James Houghton <jthoughton@google.com>
> ---
>   mm/hugetlb.c | 33 ++++++++++++++++++++-------------
>   1 file changed, 20 insertions(+), 13 deletions(-)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 5df838d86f32..0eec34edf3b2 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -4686,23 +4686,30 @@ const struct vm_operations_struct hugetlb_vm_ops = {
>   	.pagesize = hugetlb_vm_op_pagesize,
>   };
reviewed-by: manish.mishra@nutanix.com
>   
> +static pte_t make_huge_pte_with_shift(struct vm_area_struct *vma,
> +				      struct page *page, int writable,
> +				      int shift)
> +{
> +	bool huge = shift > PAGE_SHIFT;
> +	pte_t entry = huge ? mk_huge_pte(page, vma->vm_page_prot)
> +			   : mk_pte(page, vma->vm_page_prot);
> +
> +	if (writable)
> +		entry = huge ? huge_pte_mkwrite(entry) : pte_mkwrite(entry);
> +	else
> +		entry = huge ? huge_pte_wrprotect(entry) : pte_wrprotect(entry);
> +	pte_mkyoung(entry);
> +	if (huge)
> +		entry = arch_make_huge_pte(entry, shift, vma->vm_flags);
> +	return entry;
> +}
> +
>   static pte_t make_huge_pte(struct vm_area_struct *vma, struct page *page,
> -				int writable)
> +			   int writable)
>   {
> -	pte_t entry;
>   	unsigned int shift = huge_page_shift(hstate_vma(vma));
>   
> -	if (writable) {
> -		entry = huge_pte_mkwrite(huge_pte_mkdirty(mk_huge_pte(page,
> -					 vma->vm_page_prot)));
> -	} else {
> -		entry = huge_pte_wrprotect(mk_huge_pte(page,
> -					   vma->vm_page_prot));
> -	}
> -	entry = pte_mkyoung(entry);
> -	entry = arch_make_huge_pte(entry, shift, vma->vm_flags);
> -
> -	return entry;
> +	return make_huge_pte_with_shift(vma, page, writable, shift);
>   }
>   
>   static void set_huge_ptep_writable(struct vm_area_struct *vma,

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH 04/26] hugetlb: make huge_pte_lockptr take an explicit shift argument.
  2022-06-24 17:36 ` [RFC PATCH 04/26] hugetlb: make huge_pte_lockptr take an explicit shift argument James Houghton
@ 2022-06-27 12:26   ` manish.mishra
  2022-06-27 20:51   ` Mike Kravetz
  1 sibling, 0 replies; 123+ messages in thread
From: manish.mishra @ 2022-06-27 12:26 UTC (permalink / raw)
  To: James Houghton, Mike Kravetz, Muchun Song, Peter Xu
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Jue Wang, Dr . David Alan Gilbert, linux-mm, linux-kernel


On 24/06/22 11:06 pm, James Houghton wrote:
> This is needed to handle PTL locking with high-granularity mapping. We
> won't always be using the PMD-level PTL even if we're using the 2M
> hugepage hstate. It's possible that we're dealing with 4K PTEs, in which
> case, we need to lock the PTL for the 4K PTE.
>
> Signed-off-by: James Houghton <jthoughton@google.com>
> ---
>   arch/powerpc/mm/pgtable.c |  3 ++-
>   include/linux/hugetlb.h   | 19 ++++++++++++++-----
>   mm/hugetlb.c              |  9 +++++----
>   mm/migrate.c              |  3 ++-
>   mm/page_vma_mapped.c      |  3 ++-
>   5 files changed, 25 insertions(+), 12 deletions(-)
>
> diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c
> index e6166b71d36d..663d591a8f08 100644
> --- a/arch/powerpc/mm/pgtable.c
> +++ b/arch/powerpc/mm/pgtable.c
> @@ -261,7 +261,8 @@ int huge_ptep_set_access_flags(struct vm_area_struct *vma,
>   
>   		psize = hstate_get_psize(h);
>   #ifdef CONFIG_DEBUG_VM
> -		assert_spin_locked(huge_pte_lockptr(h, vma->vm_mm, ptep));
> +		assert_spin_locked(huge_pte_lockptr(huge_page_shift(h),
> +						    vma->vm_mm, ptep));
>   #endif
>   
>   #else
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 498a4ae3d462..5fe1db46d8c9 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -868,12 +868,11 @@ static inline gfp_t htlb_modify_alloc_mask(struct hstate *h, gfp_t gfp_mask)
>   	return modified_mask;
>   }
>   
> -static inline spinlock_t *huge_pte_lockptr(struct hstate *h,
> +static inline spinlock_t *huge_pte_lockptr(unsigned int shift,
>   					   struct mm_struct *mm, pte_t *pte)
>   {
> -	if (huge_page_size(h) == PMD_SIZE)
> +	if (shift == PMD_SHIFT)
>   		return pmd_lockptr(mm, (pmd_t *) pte);
> -	VM_BUG_ON(huge_page_size(h) == PAGE_SIZE);

I may have wrong understanding here, is per pmd lock for reducing

contention, if that is the case should be take per pmd lock of

PAGE_SIZE too and page_table_lock for anything higer than PMD.

>   	return &mm->page_table_lock;
>   }
>   
> @@ -1076,7 +1075,7 @@ static inline gfp_t htlb_modify_alloc_mask(struct hstate *h, gfp_t gfp_mask)
>   	return 0;
>   }
>   
> -static inline spinlock_t *huge_pte_lockptr(struct hstate *h,
> +static inline spinlock_t *huge_pte_lockptr(unsigned int shift,
>   					   struct mm_struct *mm, pte_t *pte)
>   {
>   	return &mm->page_table_lock;
> @@ -1116,7 +1115,17 @@ static inline spinlock_t *huge_pte_lock(struct hstate *h,
>   {
>   	spinlock_t *ptl;
>   
> -	ptl = huge_pte_lockptr(h, mm, pte);
> +	ptl = huge_pte_lockptr(huge_page_shift(h), mm, pte);
> +	spin_lock(ptl);
> +	return ptl;
> +}
> +
> +static inline spinlock_t *huge_pte_lock_shift(unsigned int shift,
> +					      struct mm_struct *mm, pte_t *pte)
> +{
> +	spinlock_t *ptl;
> +
> +	ptl = huge_pte_lockptr(shift, mm, pte);
>   	spin_lock(ptl);
>   	return ptl;
>   }
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 0eec34edf3b2..d6d0d4c03def 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -4817,7 +4817,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
>   			continue;
>   
>   		dst_ptl = huge_pte_lock(h, dst, dst_pte);
> -		src_ptl = huge_pte_lockptr(h, src, src_pte);
> +		src_ptl = huge_pte_lockptr(huge_page_shift(h), src, src_pte);
>   		spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
>   		entry = huge_ptep_get(src_pte);
>   		dst_entry = huge_ptep_get(dst_pte);
> @@ -4894,7 +4894,8 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
>   
>   				/* Install the new huge page if src pte stable */
>   				dst_ptl = huge_pte_lock(h, dst, dst_pte);
> -				src_ptl = huge_pte_lockptr(h, src, src_pte);
> +				src_ptl = huge_pte_lockptr(huge_page_shift(h),
> +							   src, src_pte);
>   				spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
>   				entry = huge_ptep_get(src_pte);
>   				if (!pte_same(src_pte_old, entry)) {
> @@ -4948,7 +4949,7 @@ static void move_huge_pte(struct vm_area_struct *vma, unsigned long old_addr,
>   	pte_t pte;
>   
>   	dst_ptl = huge_pte_lock(h, mm, dst_pte);
> -	src_ptl = huge_pte_lockptr(h, mm, src_pte);
> +	src_ptl = huge_pte_lockptr(huge_page_shift(h), mm, src_pte);
>   
>   	/*
>   	 * We don't have to worry about the ordering of src and dst ptlocks
> @@ -6024,7 +6025,7 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
>   		page_in_pagecache = true;
>   	}
>   
> -	ptl = huge_pte_lockptr(h, dst_mm, dst_pte);
> +	ptl = huge_pte_lockptr(huge_page_shift(h), dst_mm, dst_pte);
>   	spin_lock(ptl);
>   
>   	/*
> diff --git a/mm/migrate.c b/mm/migrate.c
> index e51588e95f57..a8a960992373 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -318,7 +318,8 @@ void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
>   void migration_entry_wait_huge(struct vm_area_struct *vma,
>   		struct mm_struct *mm, pte_t *pte)
>   {
> -	spinlock_t *ptl = huge_pte_lockptr(hstate_vma(vma), mm, pte);
> +	spinlock_t *ptl = huge_pte_lockptr(huge_page_shift(hstate_vma(vma)),
> +					   mm, pte);
>   	__migration_entry_wait(mm, pte, ptl);
>   }
>   
> diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
> index c10f839fc410..8921dd4e41b1 100644
> --- a/mm/page_vma_mapped.c
> +++ b/mm/page_vma_mapped.c
> @@ -174,7 +174,8 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
>   		if (!pvmw->pte)
>   			return false;
>   
> -		pvmw->ptl = huge_pte_lockptr(hstate, mm, pvmw->pte);
> +		pvmw->ptl = huge_pte_lockptr(huge_page_shift(hstate),
> +					     mm, pvmw->pte);
>   		spin_lock(pvmw->ptl);
>   		if (!check_pte(pvmw))
>   			return not_found(pvmw);

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH 05/26] hugetlb: add CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING
  2022-06-24 17:36 ` [RFC PATCH 05/26] hugetlb: add CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING James Houghton
@ 2022-06-27 12:28   ` manish.mishra
  2022-06-28 20:03     ` Mina Almasry
  0 siblings, 1 reply; 123+ messages in thread
From: manish.mishra @ 2022-06-27 12:28 UTC (permalink / raw)
  To: James Houghton, Mike Kravetz, Muchun Song, Peter Xu
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Jue Wang, Dr . David Alan Gilbert, linux-mm, linux-kernel


On 24/06/22 11:06 pm, James Houghton wrote:
> This adds the Kconfig to enable or disable high-granularity mapping. It
> is enabled by default for architectures that use
> ARCH_WANT_GENERAL_HUGETLB.
>
> There is also an arch-specific config ARCH_HAS_SPECIAL_HUGETLB_HGM which
> controls whether or not the architecture has been updated to support
> HGM if it doesn't use general HugeTLB.
>
> Signed-off-by: James Houghton <jthoughton@google.com>
reviewed-by:manish.mishra@nutanix.com
> ---
>   fs/Kconfig | 7 +++++++
>   1 file changed, 7 insertions(+)
>
> diff --git a/fs/Kconfig b/fs/Kconfig
> index 5976eb33535f..d76c7d812656 100644
> --- a/fs/Kconfig
> +++ b/fs/Kconfig
> @@ -268,6 +268,13 @@ config HUGETLB_PAGE_OPTIMIZE_VMEMMAP_DEFAULT_ON
>   	  to enable optimizing vmemmap pages of HugeTLB by default. It can then
>   	  be disabled on the command line via hugetlb_free_vmemmap=off.
>   
> +config ARCH_HAS_SPECIAL_HUGETLB_HGM
> +	bool
> +
> +config HUGETLB_HIGH_GRANULARITY_MAPPING
> +	def_bool ARCH_WANT_GENERAL_HUGETLB || ARCH_HAS_SPECIAL_HUGETLB_HGM
> +	depends on HUGETLB_PAGE
> +
>   config MEMFD_CREATE
>   	def_bool TMPFS || HUGETLBFS
>   

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH 06/26] mm: make free_p?d_range functions public
  2022-06-24 17:36 ` [RFC PATCH 06/26] mm: make free_p?d_range functions public James Houghton
@ 2022-06-27 12:31   ` manish.mishra
  2022-06-28 20:35   ` Mike Kravetz
  1 sibling, 0 replies; 123+ messages in thread
From: manish.mishra @ 2022-06-27 12:31 UTC (permalink / raw)
  To: James Houghton, Mike Kravetz, Muchun Song, Peter Xu
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Jue Wang, Dr . David Alan Gilbert, linux-mm, linux-kernel


On 24/06/22 11:06 pm, James Houghton wrote:
> This makes them usable for HugeTLB page table freeing operations.
> After HugeTLB high-granularity mapping, the page table for a HugeTLB VMA
> can get more complex, and these functions handle freeing page tables
> generally.
>
> Signed-off-by: James Houghton <jthoughton@google.com>
reviewed-by: manish.mishra@nutanix.com
> ---
>   include/linux/mm.h | 7 +++++++
>   mm/memory.c        | 8 ++++----
>   2 files changed, 11 insertions(+), 4 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index bc8f326be0ce..07f5da512147 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1847,6 +1847,13 @@ void unmap_vmas(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
>   
>   struct mmu_notifier_range;
>   
> +void free_pte_range(struct mmu_gather *tlb, pmd_t *pmd, unsigned long addr);
> +void free_pmd_range(struct mmu_gather *tlb, pud_t *pud, unsigned long addr,
> +		unsigned long end, unsigned long floor, unsigned long ceiling);
> +void free_pud_range(struct mmu_gather *tlb, p4d_t *p4d, unsigned long addr,
> +		unsigned long end, unsigned long floor, unsigned long ceiling);
> +void free_p4d_range(struct mmu_gather *tlb, pgd_t *pgd, unsigned long addr,
> +		unsigned long end, unsigned long floor, unsigned long ceiling);
>   void free_pgd_range(struct mmu_gather *tlb, unsigned long addr,
>   		unsigned long end, unsigned long floor, unsigned long ceiling);
>   int
> diff --git a/mm/memory.c b/mm/memory.c
> index 7a089145cad4..bb3b9b5b94fb 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -227,7 +227,7 @@ static void check_sync_rss_stat(struct task_struct *task)
>    * Note: this doesn't free the actual pages themselves. That
>    * has been handled earlier when unmapping all the memory regions.
>    */
> -static void free_pte_range(struct mmu_gather *tlb, pmd_t *pmd,
> +void free_pte_range(struct mmu_gather *tlb, pmd_t *pmd,
>   			   unsigned long addr)
>   {
>   	pgtable_t token = pmd_pgtable(*pmd);
> @@ -236,7 +236,7 @@ static void free_pte_range(struct mmu_gather *tlb, pmd_t *pmd,
>   	mm_dec_nr_ptes(tlb->mm);
>   }
>   
> -static inline void free_pmd_range(struct mmu_gather *tlb, pud_t *pud,
> +inline void free_pmd_range(struct mmu_gather *tlb, pud_t *pud,
>   				unsigned long addr, unsigned long end,
>   				unsigned long floor, unsigned long ceiling)
>   {
> @@ -270,7 +270,7 @@ static inline void free_pmd_range(struct mmu_gather *tlb, pud_t *pud,
>   	mm_dec_nr_pmds(tlb->mm);
>   }
>   
> -static inline void free_pud_range(struct mmu_gather *tlb, p4d_t *p4d,
> +inline void free_pud_range(struct mmu_gather *tlb, p4d_t *p4d,
>   				unsigned long addr, unsigned long end,
>   				unsigned long floor, unsigned long ceiling)
>   {
> @@ -304,7 +304,7 @@ static inline void free_pud_range(struct mmu_gather *tlb, p4d_t *p4d,
>   	mm_dec_nr_puds(tlb->mm);
>   }
>   
> -static inline void free_p4d_range(struct mmu_gather *tlb, pgd_t *pgd,
> +inline void free_p4d_range(struct mmu_gather *tlb, pgd_t *pgd,
>   				unsigned long addr, unsigned long end,
>   				unsigned long floor, unsigned long ceiling)
>   {

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH 07/26] hugetlb: add hugetlb_pte to track HugeTLB page table entries
  2022-06-24 17:36 ` [RFC PATCH 07/26] hugetlb: add hugetlb_pte to track HugeTLB page table entries James Houghton
@ 2022-06-27 12:47   ` manish.mishra
  2022-06-29 16:28     ` James Houghton
  2022-06-28 20:25   ` Mina Almasry
                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 123+ messages in thread
From: manish.mishra @ 2022-06-27 12:47 UTC (permalink / raw)
  To: James Houghton, Mike Kravetz, Muchun Song, Peter Xu
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Jue Wang, Dr . David Alan Gilbert, linux-mm, linux-kernel


On 24/06/22 11:06 pm, James Houghton wrote:
> After high-granularity mapping, page table entries for HugeTLB pages can
> be of any size/type. (For example, we can have a 1G page mapped with a
> mix of PMDs and PTEs.) This struct is to help keep track of a HugeTLB
> PTE after we have done a page table walk.
>
> Without this, we'd have to pass around the "size" of the PTE everywhere.
> We effectively did this before; it could be fetched from the hstate,
> which we pass around pretty much everywhere.
>
> This commit includes definitions for some basic helper functions that
> are used later. These helper functions wrap existing PTE
> inspection/modification functions, where the correct version is picked
> depending on if the HugeTLB PTE is actually "huge" or not. (Previously,
> all HugeTLB PTEs were "huge").
>
> For example, hugetlb_ptep_get wraps huge_ptep_get and ptep_get, where
> ptep_get is used when the HugeTLB PTE is PAGE_SIZE, and huge_ptep_get is
> used in all other cases.
>
> Signed-off-by: James Houghton <jthoughton@google.com>
> ---
>   include/linux/hugetlb.h | 84 +++++++++++++++++++++++++++++++++++++++++
>   mm/hugetlb.c            | 57 ++++++++++++++++++++++++++++
>   2 files changed, 141 insertions(+)
>
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 5fe1db46d8c9..1d4ec9dfdebf 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -46,6 +46,68 @@ enum {
>   	__NR_USED_SUBPAGE,
>   };
>   
> +struct hugetlb_pte {
> +	pte_t *ptep;
> +	unsigned int shift;
> +};
> +
> +static inline
> +void hugetlb_pte_init(struct hugetlb_pte *hpte)
> +{
> +	hpte->ptep = NULL;
I agree it does not matter but still will hpte->shift = 0 too be better?
> +}
> +
> +static inline
> +void hugetlb_pte_populate(struct hugetlb_pte *hpte, pte_t *ptep,
> +			  unsigned int shift)
> +{
> +	BUG_ON(!ptep);
> +	hpte->ptep = ptep;
> +	hpte->shift = shift;
> +}
> +
> +static inline
> +unsigned long hugetlb_pte_size(const struct hugetlb_pte *hpte)
> +{
> +	BUG_ON(!hpte->ptep);
> +	return 1UL << hpte->shift;
> +}
> +
> +static inline
> +unsigned long hugetlb_pte_mask(const struct hugetlb_pte *hpte)
> +{
> +	BUG_ON(!hpte->ptep);
> +	return ~(hugetlb_pte_size(hpte) - 1);
> +}
> +
> +static inline
> +unsigned int hugetlb_pte_shift(const struct hugetlb_pte *hpte)
> +{
> +	BUG_ON(!hpte->ptep);
> +	return hpte->shift;
> +}
> +
> +static inline
> +bool hugetlb_pte_huge(const struct hugetlb_pte *hpte)
> +{
> +	return !IS_ENABLED(CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING) ||
> +		hugetlb_pte_shift(hpte) > PAGE_SHIFT;
> +}
> +
> +static inline
> +void hugetlb_pte_copy(struct hugetlb_pte *dest, const struct hugetlb_pte *src)
> +{
> +	dest->ptep = src->ptep;
> +	dest->shift = src->shift;
> +}
> +
> +bool hugetlb_pte_present_leaf(const struct hugetlb_pte *hpte);
> +bool hugetlb_pte_none(const struct hugetlb_pte *hpte);
> +bool hugetlb_pte_none_mostly(const struct hugetlb_pte *hpte);
> +pte_t hugetlb_ptep_get(const struct hugetlb_pte *hpte);
> +void hugetlb_pte_clear(struct mm_struct *mm, const struct hugetlb_pte *hpte,
> +		       unsigned long address);
> +
>   struct hugepage_subpool {
>   	spinlock_t lock;
>   	long count;
> @@ -1130,6 +1192,28 @@ static inline spinlock_t *huge_pte_lock_shift(unsigned int shift,
>   	return ptl;
>   }
>   
> +static inline
> +spinlock_t *hugetlb_pte_lockptr(struct mm_struct *mm, struct hugetlb_pte *hpte)
> +{
> +
> +	BUG_ON(!hpte->ptep);
> +	// Only use huge_pte_lockptr if we are at leaf-level. Otherwise use
> +	// the regular page table lock.
> +	if (hugetlb_pte_none(hpte) || hugetlb_pte_present_leaf(hpte))
> +		return huge_pte_lockptr(hugetlb_pte_shift(hpte),
> +				mm, hpte->ptep);
> +	return &mm->page_table_lock;
> +}
> +
> +static inline
> +spinlock_t *hugetlb_pte_lock(struct mm_struct *mm, struct hugetlb_pte *hpte)
> +{
> +	spinlock_t *ptl = hugetlb_pte_lockptr(mm, hpte);
> +
> +	spin_lock(ptl);
> +	return ptl;
> +}
> +
>   #if defined(CONFIG_HUGETLB_PAGE) && defined(CONFIG_CMA)
>   extern void __init hugetlb_cma_reserve(int order);
>   extern void __init hugetlb_cma_check(void);
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index d6d0d4c03def..1a1434e29740 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1120,6 +1120,63 @@ static bool vma_has_reserves(struct vm_area_struct *vma, long chg)
>   	return false;
>   }
>   
> +bool hugetlb_pte_present_leaf(const struct hugetlb_pte *hpte)
> +{
> +	pgd_t pgd;
> +	p4d_t p4d;
> +	pud_t pud;
> +	pmd_t pmd;
> +
> +	BUG_ON(!hpte->ptep);
> +	if (hugetlb_pte_size(hpte) >= PGDIR_SIZE) {
> +		pgd = *(pgd_t *)hpte->ptep;

sorry did not understand in these conditions why

hugetlb_pte_size(hpte) >= PGDIR_SIZE. I mean why >= check

and not just == check?

> +		return pgd_present(pgd) && pgd_leaf(pgd);
> +	} else if (hugetlb_pte_size(hpte) >= P4D_SIZE) {
> +		p4d = *(p4d_t *)hpte->ptep;
> +		return p4d_present(p4d) && p4d_leaf(p4d);
> +	} else if (hugetlb_pte_size(hpte) >= PUD_SIZE) {
> +		pud = *(pud_t *)hpte->ptep;
> +		return pud_present(pud) && pud_leaf(pud);
> +	} else if (hugetlb_pte_size(hpte) >= PMD_SIZE) {
> +		pmd = *(pmd_t *)hpte->ptep;
> +		return pmd_present(pmd) && pmd_leaf(pmd);
> +	} else if (hugetlb_pte_size(hpte) >= PAGE_SIZE)
> +		return pte_present(*hpte->ptep);
> +	BUG();
> +}
> +
> +bool hugetlb_pte_none(const struct hugetlb_pte *hpte)
> +{
> +	if (hugetlb_pte_huge(hpte))
> +		return huge_pte_none(huge_ptep_get(hpte->ptep));
> +	return pte_none(ptep_get(hpte->ptep));
> +}
> +
> +bool hugetlb_pte_none_mostly(const struct hugetlb_pte *hpte)
> +{
> +	if (hugetlb_pte_huge(hpte))
> +		return huge_pte_none_mostly(huge_ptep_get(hpte->ptep));
> +	return pte_none_mostly(ptep_get(hpte->ptep));
> +}
> +
> +pte_t hugetlb_ptep_get(const struct hugetlb_pte *hpte)
> +{
> +	if (hugetlb_pte_huge(hpte))
> +		return huge_ptep_get(hpte->ptep);
> +	return ptep_get(hpte->ptep);
> +}
> +
> +void hugetlb_pte_clear(struct mm_struct *mm, const struct hugetlb_pte *hpte,
> +		       unsigned long address)
> +{
> +	BUG_ON(!hpte->ptep);
> +	unsigned long sz = hugetlb_pte_size(hpte);
> +
> +	if (sz > PAGE_SIZE)
> +		return huge_pte_clear(mm, address, hpte->ptep, sz);

just for cosistency something like above?

if (hugetlb_pte_huge(hpte))
+		return huge_pte_clear
;

> +	return pte_clear(mm, address, hpte->ptep);
> +}
> +
>   static void enqueue_huge_page(struct hstate *h, struct page *page)
>   {
>   	int nid = page_to_nid(page);

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH 08/26] hugetlb: add hugetlb_free_range to free PT structures
  2022-06-24 17:36 ` [RFC PATCH 08/26] hugetlb: add hugetlb_free_range to free PT structures James Houghton
@ 2022-06-27 12:52   ` manish.mishra
  2022-06-28 20:27   ` Mina Almasry
  1 sibling, 0 replies; 123+ messages in thread
From: manish.mishra @ 2022-06-27 12:52 UTC (permalink / raw)
  To: James Houghton, Mike Kravetz, Muchun Song, Peter Xu
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Jue Wang, Dr . David Alan Gilbert, linux-mm, linux-kernel


On 24/06/22 11:06 pm, James Houghton wrote:
> This is a helper function for freeing the bits of the page table that
> map a particular HugeTLB PTE.
>
> Signed-off-by: James Houghton <jthoughton@google.com>
> ---
>   include/linux/hugetlb.h |  2 ++
>   mm/hugetlb.c            | 17 +++++++++++++++++
>   2 files changed, 19 insertions(+)
>
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 1d4ec9dfdebf..33ba48fac551 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -107,6 +107,8 @@ bool hugetlb_pte_none_mostly(const struct hugetlb_pte *hpte);
>   pte_t hugetlb_ptep_get(const struct hugetlb_pte *hpte);
>   void hugetlb_pte_clear(struct mm_struct *mm, const struct hugetlb_pte *hpte,
>   		       unsigned long address);
> +void hugetlb_free_range(struct mmu_gather *tlb, const struct hugetlb_pte *hpte,
> +			unsigned long start, unsigned long end);
>   
>   struct hugepage_subpool {
>   	spinlock_t lock;
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 1a1434e29740..a2d2ffa76173 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1120,6 +1120,23 @@ static bool vma_has_reserves(struct vm_area_struct *vma, long chg)
>   	return false;
>   }
>   
> +void hugetlb_free_range(struct mmu_gather *tlb, const struct hugetlb_pte *hpte,
> +			unsigned long start, unsigned long end)
> +{
> +	unsigned long floor = start & hugetlb_pte_mask(hpte);
> +	unsigned long ceiling = floor + hugetlb_pte_size(hpte);
> +
> +	if (hugetlb_pte_size(hpte) >= PGDIR_SIZE) {

sorry again did not understand why it is >= check and not just ==, does it help

in non-x86 arches.

> +		free_p4d_range(tlb, (pgd_t *)hpte->ptep, start, end, floor, ceiling);
> +	} else if (hugetlb_pte_size(hpte) >= P4D_SIZE) {
> +		free_pud_range(tlb, (p4d_t *)hpte->ptep, start, end, floor, ceiling);
> +	} else if (hugetlb_pte_size(hpte) >= PUD_SIZE) {
> +		free_pmd_range(tlb, (pud_t *)hpte->ptep, start, end, floor, ceiling);
> +	} else if (hugetlb_pte_size(hpte) >= PMD_SIZE) {
> +		free_pte_range(tlb, (pmd_t *)hpte->ptep, start);
> +	}
> +}
> +
>   bool hugetlb_pte_present_leaf(const struct hugetlb_pte *hpte)
>   {
>   	pgd_t pgd;

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH 10/26] hugetlb: add for_each_hgm_shift
  2022-06-24 17:36 ` [RFC PATCH 10/26] hugetlb: add for_each_hgm_shift James Houghton
@ 2022-06-27 13:01   ` manish.mishra
  2022-06-28 21:58   ` Mina Almasry
  1 sibling, 0 replies; 123+ messages in thread
From: manish.mishra @ 2022-06-27 13:01 UTC (permalink / raw)
  To: James Houghton, Mike Kravetz, Muchun Song, Peter Xu
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Jue Wang, Dr . David Alan Gilbert, linux-mm, linux-kernel


On 24/06/22 11:06 pm, James Houghton wrote:
> This is a helper macro to loop through all the usable page sizes for a
> high-granularity-enabled HugeTLB VMA. Given the VMA's hstate, it will
> loop, in descending order, through the page sizes that HugeTLB supports
> for this architecture; it always includes PAGE_SIZE.
reviewed-by:manish.mishra@nutanix.com
> Signed-off-by: James Houghton <jthoughton@google.com>
> ---
>   mm/hugetlb.c | 10 ++++++++++
>   1 file changed, 10 insertions(+)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 8b10b941458d..557b0afdb503 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -6989,6 +6989,16 @@ bool hugetlb_hgm_enabled(struct vm_area_struct *vma)
>   	/* All shared VMAs have HGM enabled. */
>   	return vma->vm_flags & VM_SHARED;
>   }
> +static unsigned int __shift_for_hstate(struct hstate *h)
> +{
> +	if (h >= &hstates[hugetlb_max_hstate])
> +		return PAGE_SHIFT;
> +	return huge_page_shift(h);
> +}
> +#define for_each_hgm_shift(hstate, tmp_h, shift) \
> +	for ((tmp_h) = hstate; (shift) = __shift_for_hstate(tmp_h), \
> +			       (tmp_h) <= &hstates[hugetlb_max_hstate]; \
> +			       (tmp_h)++)
>   #endif /* CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING */
>   
>   /*

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH 11/26] hugetlb: add hugetlb_walk_to to do PT walks
  2022-06-24 17:36 ` [RFC PATCH 11/26] hugetlb: add hugetlb_walk_to to do PT walks James Houghton
@ 2022-06-27 13:07   ` manish.mishra
  2022-07-07 23:03     ` Mike Kravetz
  2022-09-08 18:20   ` Peter Xu
  1 sibling, 1 reply; 123+ messages in thread
From: manish.mishra @ 2022-06-27 13:07 UTC (permalink / raw)
  To: James Houghton, Mike Kravetz, Muchun Song, Peter Xu
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Jue Wang, Dr . David Alan Gilbert, linux-mm, linux-kernel


On 24/06/22 11:06 pm, James Houghton wrote:
> This adds it for architectures that use GENERAL_HUGETLB, including x86.
>
> Signed-off-by: James Houghton <jthoughton@google.com>
> ---
>   include/linux/hugetlb.h |  2 ++
>   mm/hugetlb.c            | 45 +++++++++++++++++++++++++++++++++++++++++
>   2 files changed, 47 insertions(+)
>
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index e7a6b944d0cc..605aa19d8572 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -258,6 +258,8 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
>   			unsigned long addr, unsigned long sz);
>   pte_t *huge_pte_offset(struct mm_struct *mm,
>   		       unsigned long addr, unsigned long sz);
> +int hugetlb_walk_to(struct mm_struct *mm, struct hugetlb_pte *hpte,
> +		    unsigned long addr, unsigned long sz, bool stop_at_none);
>   int huge_pmd_unshare(struct mm_struct *mm, struct vm_area_struct *vma,
>   				unsigned long *addr, pte_t *ptep);
>   void adjust_range_if_pmd_sharing_possible(struct vm_area_struct *vma,
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 557b0afdb503..3ec2a921ee6f 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -6981,6 +6981,51 @@ pte_t *huge_pte_offset(struct mm_struct *mm,
>   	return (pte_t *)pmd;
>   }


not strong feeling but this name looks confusing to me as it does

not only walk over page-tables but can also alloc.

> +int hugetlb_walk_to(struct mm_struct *mm, struct hugetlb_pte *hpte,
> +		    unsigned long addr, unsigned long sz, bool stop_at_none)
> +{
> +	pte_t *ptep;
> +
> +	if (!hpte->ptep) {
> +		pgd_t *pgd = pgd_offset(mm, addr);
> +
> +		if (!pgd)
> +			return -ENOMEM;
> +		ptep = (pte_t *)p4d_alloc(mm, pgd, addr);
> +		if (!ptep)
> +			return -ENOMEM;
> +		hugetlb_pte_populate(hpte, ptep, P4D_SHIFT);
> +	}
> +
> +	while (hugetlb_pte_size(hpte) > sz &&
> +			!hugetlb_pte_present_leaf(hpte) &&
> +			!(stop_at_none && hugetlb_pte_none(hpte))) {

Should this ordering of if-else condition be in reverse, i mean it will look

more natural and possibly less condition checks as we go from top to bottom.

> +		if (hpte->shift == PMD_SHIFT) {
> +			ptep = pte_alloc_map(mm, (pmd_t *)hpte->ptep, addr);
> +			if (!ptep)
> +				return -ENOMEM;
> +			hpte->shift = PAGE_SHIFT;
> +			hpte->ptep = ptep;
> +		} else if (hpte->shift == PUD_SHIFT) {
> +			ptep = (pte_t *)pmd_alloc(mm, (pud_t *)hpte->ptep,
> +						  addr);
> +			if (!ptep)
> +				return -ENOMEM;
> +			hpte->shift = PMD_SHIFT;
> +			hpte->ptep = ptep;
> +		} else if (hpte->shift == P4D_SHIFT) {
> +			ptep = (pte_t *)pud_alloc(mm, (p4d_t *)hpte->ptep,
> +						  addr);
> +			if (!ptep)
> +				return -ENOMEM;
> +			hpte->shift = PUD_SHIFT;
> +			hpte->ptep = ptep;
> +		} else
> +			BUG();
> +	}
> +	return 0;
> +}
> +
>   #endif /* CONFIG_ARCH_WANT_GENERAL_HUGETLB */
>   
>   #ifdef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH 12/26] hugetlb: add HugeTLB splitting functionality
  2022-06-24 17:36 ` [RFC PATCH 12/26] hugetlb: add HugeTLB splitting functionality James Houghton
@ 2022-06-27 13:50   ` manish.mishra
  2022-06-29 16:10     ` James Houghton
  2022-06-29 14:33   ` manish.mishra
  1 sibling, 1 reply; 123+ messages in thread
From: manish.mishra @ 2022-06-27 13:50 UTC (permalink / raw)
  To: James Houghton, Mike Kravetz, Muchun Song, Peter Xu
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Jue Wang, Dr . David Alan Gilbert, linux-mm, linux-kernel


On 24/06/22 11:06 pm, James Houghton wrote:
> The new function, hugetlb_split_to_shift, will optimally split the page
> table to map a particular address at a particular granularity.
>
> This is useful for punching a hole in the mapping and for mapping small
> sections of a HugeTLB page (via UFFDIO_CONTINUE, for example).
>
> Signed-off-by: James Houghton <jthoughton@google.com>
> ---
>   mm/hugetlb.c | 122 +++++++++++++++++++++++++++++++++++++++++++++++++++
>   1 file changed, 122 insertions(+)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 3ec2a921ee6f..eaffe7b4f67c 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -102,6 +102,18 @@ struct mutex *hugetlb_fault_mutex_table ____cacheline_aligned_in_smp;
>   /* Forward declaration */
>   static int hugetlb_acct_memory(struct hstate *h, long delta);
>   
> +/*
> + * Find the subpage that corresponds to `addr` in `hpage`.
> + */
> +static struct page *hugetlb_find_subpage(struct hstate *h, struct page *hpage,
> +				 unsigned long addr)
> +{
> +	size_t idx = (addr & ~huge_page_mask(h))/PAGE_SIZE;
> +
> +	BUG_ON(idx >= pages_per_huge_page(h));
> +	return &hpage[idx];
> +}
> +
>   static inline bool subpool_is_free(struct hugepage_subpool *spool)
>   {
>   	if (spool->count)
> @@ -7044,6 +7056,116 @@ static unsigned int __shift_for_hstate(struct hstate *h)
>   	for ((tmp_h) = hstate; (shift) = __shift_for_hstate(tmp_h), \
>   			       (tmp_h) <= &hstates[hugetlb_max_hstate]; \
>   			       (tmp_h)++)
> +
> +/*
> + * Given a particular address, split the HugeTLB PTE that currently maps it
> + * so that, for the given address, the PTE that maps it is `desired_shift`.
> + * This function will always split the HugeTLB PTE optimally.
> + *
> + * For example, given a HugeTLB 1G page that is mapped from VA 0 to 1G. If we
> + * call this function with addr=0 and desired_shift=PAGE_SHIFT, will result in
> + * these changes to the page table:
> + * 1. The PUD will be split into 2M PMDs.
> + * 2. The first PMD will be split again into 4K PTEs.
> + */
> +static int hugetlb_split_to_shift(struct mm_struct *mm, struct vm_area_struct *vma,
> +			   const struct hugetlb_pte *hpte,
> +			   unsigned long addr, unsigned long desired_shift)
> +{
> +	unsigned long start, end, curr;
> +	unsigned long desired_sz = 1UL << desired_shift;
> +	struct hstate *h = hstate_vma(vma);
> +	int ret;
> +	struct hugetlb_pte new_hpte;
> +	struct mmu_notifier_range range;
> +	struct page *hpage = NULL;
> +	struct page *subpage;
> +	pte_t old_entry;
> +	struct mmu_gather tlb;
> +
> +	BUG_ON(!hpte->ptep);
> +	BUG_ON(hugetlb_pte_size(hpte) == desired_sz);
can it be BUG_ON(hugetlb_pte_size(hpte) <= desired_sz)
> +
> +	start = addr & hugetlb_pte_mask(hpte);
> +	end = start + hugetlb_pte_size(hpte);
> +
> +	i_mmap_assert_write_locked(vma->vm_file->f_mapping);

As it is just changing mappings is holding f_mapping required? I mean in future

is it any paln or way to use some per process level sub-lock?

> +
> +	BUG_ON(!hpte->ptep);
> +	/* This function only works if we are looking at a leaf-level PTE. */
> +	BUG_ON(!hugetlb_pte_none(hpte) && !hugetlb_pte_present_leaf(hpte));
> +
> +	/*
> +	 * Clear the PTE so that we will allocate the PT structures when
> +	 * walking the page table.
> +	 */
> +	old_entry = huge_ptep_get_and_clear(mm, start, hpte->ptep);
> +
> +	if (!huge_pte_none(old_entry))
> +		hpage = pte_page(old_entry);
> +
> +	BUG_ON(!IS_ALIGNED(start, desired_sz));
> +	BUG_ON(!IS_ALIGNED(end, desired_sz));
> +
> +	for (curr = start; curr < end;) {
> +		struct hstate *tmp_h;
> +		unsigned int shift;
> +
> +		for_each_hgm_shift(h, tmp_h, shift) {
> +			unsigned long sz = 1UL << shift;
> +
> +			if (!IS_ALIGNED(curr, sz) || curr + sz > end)
> +				continue;
> +			/*
> +			 * If we are including `addr`, we need to make sure
> +			 * splitting down to the correct size. Go to a smaller
> +			 * size if we are not.
> +			 */
> +			if (curr <= addr && curr + sz > addr &&
> +					shift > desired_shift)
> +				continue;
> +
> +			/*
> +			 * Continue the page table walk to the level we want,
> +			 * allocate PT structures as we go.
> +			 */

As i understand this for_each_hgm_shift loop is just to find right size of shift,

then code below this line can be put out of loop, no strong feeling but it looks

more proper may make code easier to understand.

> +			hugetlb_pte_copy(&new_hpte, hpte);
> +			ret = hugetlb_walk_to(mm, &new_hpte, curr, sz,
> +					      /*stop_at_none=*/false);
> +			if (ret)
> +				goto err;
> +			BUG_ON(hugetlb_pte_size(&new_hpte) != sz);
> +			if (hpage) {
> +				pte_t new_entry;
> +
> +				subpage = hugetlb_find_subpage(h, hpage, curr);
> +				new_entry = make_huge_pte_with_shift(vma, subpage,
> +								     huge_pte_write(old_entry),
> +								     shift);
> +				set_huge_pte_at(mm, curr, new_hpte.ptep, new_entry);
> +			}
> +			curr += sz;
> +			goto next;
> +		}
> +		/* We couldn't find a size that worked. */
> +		BUG();
> +next:
> +		continue;
> +	}
> +
> +	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, vma->vm_mm,
> +				start, end);
> +	mmu_notifier_invalidate_range_start(&range);

sorry did not understand where tlb flush will be taken care in case of success?

I see set_huge_pte_at does not do it internally by self.

> +	return 0;
> +err:
> +	tlb_gather_mmu(&tlb, mm);
> +	/* Free any newly allocated page table entries. */
> +	hugetlb_free_range(&tlb, hpte, start, curr);
> +	/* Restore the old entry. */
> +	set_huge_pte_at(mm, start, hpte->ptep, old_entry);
> +	tlb_finish_mmu(&tlb);
> +	return ret;
> +}
>   #endif /* CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING */
>   
>   /*

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH 00/26] hugetlb: Introduce HugeTLB high-granularity mapping
  2022-06-24 18:41 ` Mina Almasry
@ 2022-06-27 16:27   ` James Houghton
  2022-06-28 14:17     ` Muchun Song
  2022-06-28 17:26     ` Mina Almasry
  0 siblings, 2 replies; 123+ messages in thread
From: James Houghton @ 2022-06-27 16:27 UTC (permalink / raw)
  To: Mina Almasry
  Cc: Mike Kravetz, Muchun Song, Peter Xu, David Hildenbrand,
	David Rientjes, Axel Rasmussen, Jue Wang, Manish Mishra,
	Dr . David Alan Gilbert, linux-mm, linux-kernel

On Fri, Jun 24, 2022 at 11:41 AM Mina Almasry <almasrymina@google.com> wrote:
>
> On Fri, Jun 24, 2022 at 10:37 AM James Houghton <jthoughton@google.com> wrote:
> >
> > [trimmed...]
> > ---- Userspace API ----
> >
> > This patch series introduces a single way to take advantage of
> > high-granularity mapping: via UFFDIO_CONTINUE. UFFDIO_CONTINUE allows
> > userspace to resolve MINOR page faults on shared VMAs.
> >
> > To collapse a HugeTLB address range that has been mapped with several
> > UFFDIO_CONTINUE operations, userspace can issue MADV_COLLAPSE. We expect
> > userspace to know when all pages (that they care about) have been fetched.
> >
>
> Thanks James! Cover letter looks good. A few questions:
>
> Why not have the kernel collapse the hugepage once all the 4K pages
> have been fetched automatically? It would remove the need for a new
> userspace API, and AFACT there aren't really any cases where it is
> beneficial to have a hugepage sharded into 4K mappings when those
> mappings can be collapsed.

The reason that we don't automatically collapse mappings is because it
would take additional complexity, and it is less flexible. Consider
the case of 1G pages on x86: currently, userspace can collapse the
whole page when it's all ready, but they can also choose to collapse a
2M piece of it. On architectures with more supported hugepage sizes
(e.g., arm64), userspace has even more possibilities for when to
collapse. This likely further complicates a potential
automatic-collapse solution. Userspace may also want to collapse the
mapping for an entire hugepage without completely mapping the hugepage
first (this would also be possible by issuing UFFDIO_CONTINUE on all
the holes, though).

>
> > ---- HugeTLB Changes ----
> >
> > - Mapcount
> > The way mapcount is handled is different from the way that it was handled
> > before. If the PUD for a hugepage is not none, a hugepage's mapcount will
> > be increased. This scheme means that, for hugepages that aren't mapped at
> > high granularity, their mapcounts will remain the same as what they would
> > have been pre-HGM.
> >
>
> Sorry, I didn't quite follow this. It says mapcount is handled
> differently, but the same if the page is not mapped at high
> granularity. Can you elaborate on how the mapcount handling will be
> different when the page is mapped at high granularity?

I guess I didn't phrase this very well. For the sake of simplicity,
consider 1G pages on x86, typically mapped with leaf-level PUDs.
Previously, there were two possibilities for how a hugepage was
mapped, either it was (1) completely mapped (PUD is present and a
leaf), or (2) it wasn't mapped (PUD is none). Now we have a third
case, where the PUD is not none but also not a leaf (this usually
means that the page is partially mapped). We handle this case as if
the whole page was mapped. That is, if we partially map a hugepage
that was previously unmapped (making the PUD point to PMDs), we
increment its mapcount, and if we completely unmap a partially mapped
hugepage (making the PUD none), we decrement its mapcount. If we
collapse a non-leaf PUD to a leaf PUD, we don't change mapcount.

It is possible for a PUD to be present and not a leaf (mapcount has
been incremented) but for the page to still be unmapped: if the PMDs
(or PTEs) underneath are all none. This case is atypical, and as of
this RFC (without bestowing MADV_DONTNEED with HGM flexibility), I
think it would be very difficult to get this to happen.

>
> > - Page table walking and manipulation
> > A new function, hugetlb_walk_to, handles walking HugeTLB page tables for
> > high-granularity mappings. Eventually, it's possible to merge
> > hugetlb_walk_to with huge_pte_offset and huge_pte_alloc.
> >
> > We keep track of HugeTLB page table entries with a new struct, hugetlb_pte.
> > This is because we generally need to know the "size" of a PTE (previously
> > always just huge_page_size(hstate)).
> >
> > For every page table manipulation function that has a huge version (e.g.
> > huge_ptep_get and ptep_get), there is a wrapper for it (e.g.
> > hugetlb_ptep_get).  The correct version is used depending on if a HugeTLB
> > PTE really is "huge".
> >
> > - Synchronization
> > For existing bits of HugeTLB, synchronization is unchanged. For splitting
> > and collapsing HugeTLB PTEs, we require that the i_mmap_rw_sem is held for
> > writing, and for doing high-granularity page table walks, we require it to
> > be held for reading.
> >
> > ---- Limitations & Future Changes ----
> >
> > This patch series only implements high-granularity mapping for VM_SHARED
> > VMAs.  I intend to implement enough HGM to support 4K unmapping for memory
> > failure recovery for both shared and private mappings.
> >
> > The memory failure use case poses its own challenges that can be
> > addressed, but I will do so in a separate RFC.
> >
> > Performance has not been heavily scrutinized with this patch series. There
> > are places where lock contention can significantly reduce performance. This
> > will be addressed later.
> >
> > The patch series, as it stands right now, is compatible with the VMEMMAP
> > page struct optimization[3], as we do not need to modify data contained
> > in the subpage page structs.
> >
> > Other omissions:
> >  - Compatibility with userfaultfd write-protect (will be included in v1).
> >  - Support for mremap() (will be included in v1). This looks a lot like
> >    the support we have for fork().
> >  - Documentation changes (will be included in v1).
> >  - Completely ignores PMD sharing and hugepage migration (will be included
> >    in v1).
> >  - Implementations for architectures that don't use GENERAL_HUGETLB other
> >    than arm64.
> >
> > ---- Patch Breakdown ----
> >
> > Patch 1     - Preliminary changes
> > Patch 2-10  - HugeTLB HGM core changes
> > Patch 11-13 - HugeTLB HGM page table walking functionality
> > Patch 14-19 - HugeTLB HGM compatibility with other bits
> > Patch 20-23 - Userfaultfd and collapse changes
> > Patch 24-26 - arm64 support and selftests
> >
> > [1] This used to be called HugeTLB double mapping, a bad and confusing
> >     name. "High-granularity mapping" is not a great name either. I am open
> >     to better names.
>
> I would drop 1 extra word and do "granular mapping", as in the mapping
> is more granular than what it normally is (2MB/1G, etc).

Noted. :)

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH 00/26] hugetlb: Introduce HugeTLB high-granularity mapping
  2022-06-24 18:29 ` [RFC PATCH 00/26] hugetlb: Introduce HugeTLB high-granularity mapping Matthew Wilcox
@ 2022-06-27 16:36   ` James Houghton
  2022-06-27 17:56     ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 123+ messages in thread
From: James Houghton @ 2022-06-27 16:36 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Mike Kravetz, Muchun Song, Peter Xu, David Hildenbrand,
	David Rientjes, Axel Rasmussen, Mina Almasry, Jue Wang,
	Manish Mishra, Dr . David Alan Gilbert, linux-mm, linux-kernel

On Fri, Jun 24, 2022 at 11:29 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Fri, Jun 24, 2022 at 05:36:30PM +0000, James Houghton wrote:
> > [1] This used to be called HugeTLB double mapping, a bad and confusing
> >     name. "High-granularity mapping" is not a great name either. I am open
> >     to better names.
>
> Oh good, I was grinding my teeth every time I read it ;-)
>
> How does "Fine granularity" work for you?
> "sub-page mapping" might work too.

"Granularity", as I've come to realize, is hard to say, so I think I
prefer sub-page mapping. :) So to recap the suggestions I have so far:

1. Sub-page mapping
2. Granular mapping
3. Flexible mapping

I'll pick one of these (or maybe some other one that works better) for
the next version of this series.

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH 00/26] hugetlb: Introduce HugeTLB high-granularity mapping
  2022-06-24 18:47 ` Matthew Wilcox
@ 2022-06-27 16:48   ` James Houghton
  0 siblings, 0 replies; 123+ messages in thread
From: James Houghton @ 2022-06-27 16:48 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Mike Kravetz, Muchun Song, Peter Xu, David Hildenbrand,
	David Rientjes, Axel Rasmussen, Mina Almasry, Jue Wang,
	Manish Mishra, Dr . David Alan Gilbert, linux-mm, linux-kernel

On Fri, Jun 24, 2022 at 11:47 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Fri, Jun 24, 2022 at 05:36:30PM +0000, James Houghton wrote:
> > - Page table walking and manipulation
> > A new function, hugetlb_walk_to, handles walking HugeTLB page tables for
> > high-granularity mappings. Eventually, it's possible to merge
> > hugetlb_walk_to with huge_pte_offset and huge_pte_alloc.
> >
> > We keep track of HugeTLB page table entries with a new struct, hugetlb_pte.
> > This is because we generally need to know the "size" of a PTE (previously
> > always just huge_page_size(hstate)).
> >
> > For every page table manipulation function that has a huge version (e.g.
> > huge_ptep_get and ptep_get), there is a wrapper for it (e.g.
> > hugetlb_ptep_get).  The correct version is used depending on if a HugeTLB
> > PTE really is "huge".
>
> I'm disappointed to hear that page table walking is going to become even
> more special.  I'd much prefer it if hugetlb walking were exactly the
> same as THP walking.  This seems like a good time to do at least some
> of that work.
>
> Was there a reason you chose the "more complexity" direction?

I chose this direction because it seemed to be the most
straightforward to get to a working prototype and then to an RFC. I
agree with your sentiment -- I'll see what I can do to reconcile THP
walking with HugeTLB(+HGM) walking.

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH 00/26] hugetlb: Introduce HugeTLB high-granularity mapping
  2022-06-27 16:36   ` James Houghton
@ 2022-06-27 17:56     ` Dr. David Alan Gilbert
  2022-06-27 20:31       ` James Houghton
  0 siblings, 1 reply; 123+ messages in thread
From: Dr. David Alan Gilbert @ 2022-06-27 17:56 UTC (permalink / raw)
  To: James Houghton
  Cc: Matthew Wilcox, Mike Kravetz, Muchun Song, Peter Xu,
	David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Jue Wang, Manish Mishra, linux-mm, linux-kernel

* James Houghton (jthoughton@google.com) wrote:
> On Fri, Jun 24, 2022 at 11:29 AM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Fri, Jun 24, 2022 at 05:36:30PM +0000, James Houghton wrote:
> > > [1] This used to be called HugeTLB double mapping, a bad and confusing
> > >     name. "High-granularity mapping" is not a great name either. I am open
> > >     to better names.
> >
> > Oh good, I was grinding my teeth every time I read it ;-)
> >
> > How does "Fine granularity" work for you?
> > "sub-page mapping" might work too.
> 
> "Granularity", as I've come to realize, is hard to say, so I think I
> prefer sub-page mapping. :) So to recap the suggestions I have so far:
> 
> 1. Sub-page mapping
> 2. Granular mapping
> 3. Flexible mapping
> 
> I'll pick one of these (or maybe some other one that works better) for
> the next version of this series.

<shrug> Just a name; SPM might work (although may confuse those
architectures which had subprotection for normal pages), and at least
we can mispronounce it.

In 14/26 your commit message says:

  1. Faults can be passed to handle_userfault. (Userspace will want to
     use UFFD_FEATURE_REAL_ADDRESS to get the real address to know which
     region they should be call UFFDIO_CONTINUE on later.)

can you explain what that new UFFD_FEATURE does?

Dave

-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH 02/26] hugetlb: sort hstates in hugetlb_init_hstates
  2022-06-24 17:36 ` [RFC PATCH 02/26] hugetlb: sort hstates in hugetlb_init_hstates James Houghton
  2022-06-24 18:51   ` Mina Almasry
  2022-06-27 12:08   ` manish.mishra
@ 2022-06-27 18:42   ` Mike Kravetz
  2022-06-28 15:40     ` James Houghton
  2 siblings, 1 reply; 123+ messages in thread
From: Mike Kravetz @ 2022-06-27 18:42 UTC (permalink / raw)
  To: James Houghton
  Cc: Muchun Song, Peter Xu, David Hildenbrand, David Rientjes,
	Axel Rasmussen, Mina Almasry, Jue Wang, Manish Mishra,
	Dr . David Alan Gilbert, linux-mm, linux-kernel

On 06/24/22 17:36, James Houghton wrote:
> When using HugeTLB high-granularity mapping, we need to go through the
> supported hugepage sizes in decreasing order so that we pick the largest
> size that works. Consider the case where we're faulting in a 1G hugepage
> for the first time: we want hugetlb_fault/hugetlb_no_page to map it with
> a PUD. By going through the sizes in decreasing order, we will find that
> PUD_SIZE works before finding out that PMD_SIZE or PAGE_SIZE work too.
> 
> Signed-off-by: James Houghton <jthoughton@google.com>
> ---
>  mm/hugetlb.c | 40 +++++++++++++++++++++++++++++++++++++---
>  1 file changed, 37 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index a57e1be41401..5df838d86f32 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -33,6 +33,7 @@
>  #include <linux/migrate.h>
>  #include <linux/nospec.h>
>  #include <linux/delayacct.h>
> +#include <linux/sort.h>
>  
>  #include <asm/page.h>
>  #include <asm/pgalloc.h>
> @@ -48,6 +49,10 @@
>  
>  int hugetlb_max_hstate __read_mostly;
>  unsigned int default_hstate_idx;
> +/*
> + * After hugetlb_init_hstates is called, hstates will be sorted from largest
> + * to smallest.
> + */
>  struct hstate hstates[HUGE_MAX_HSTATE];
>  
>  #ifdef CONFIG_CMA
> @@ -3144,14 +3149,43 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
>  	kfree(node_alloc_noretry);
>  }
>  
> +static int compare_hstates_decreasing(const void *a, const void *b)
> +{
> +	const int shift_a = huge_page_shift((const struct hstate *)a);
> +	const int shift_b = huge_page_shift((const struct hstate *)b);
> +
> +	if (shift_a < shift_b)
> +		return 1;
> +	if (shift_a > shift_b)
> +		return -1;
> +	return 0;
> +}
> +
> +static void sort_hstates(void)
> +{
> +	unsigned long default_hstate_sz = huge_page_size(&default_hstate);
> +
> +	/* Sort from largest to smallest. */
> +	sort(hstates, hugetlb_max_hstate, sizeof(*hstates),
> +	     compare_hstates_decreasing, NULL);
> +
> +	/*
> +	 * We may have changed the location of the default hstate, so we need to
> +	 * update it.
> +	 */
> +	default_hstate_idx = hstate_index(size_to_hstate(default_hstate_sz));
> +}
> +
>  static void __init hugetlb_init_hstates(void)
>  {
>  	struct hstate *h, *h2;
>  
> -	for_each_hstate(h) {
> -		if (minimum_order > huge_page_order(h))
> -			minimum_order = huge_page_order(h);
> +	sort_hstates();
>  
> +	/* The last hstate is now the smallest. */
> +	minimum_order = huge_page_order(&hstates[hugetlb_max_hstate - 1]);
> +
> +	for_each_hstate(h) {
>  		/* oversize hugepages were init'ed in early boot */
>  		if (!hstate_is_gigantic(h))
>  			hugetlb_hstate_alloc_pages(h);

This may/will cause problems for gigantic hugetlb pages allocated at boot
time.  See alloc_bootmem_huge_page() where a pointer to the associated hstate
is encoded within the allocated hugetlb page.  These pages are added to
hugetlb pools by the routine gather_bootmem_prealloc() which uses the saved
hstate to add prep the gigantic page and add to the correct pool.  Currently,
gather_bootmem_prealloc is called after hugetlb_init_hstates.  So, changing
hstate order will cause errors.

I do not see any reason why we could not call gather_bootmem_prealloc before
hugetlb_init_hstates to avoid this issue.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH 00/26] hugetlb: Introduce HugeTLB high-granularity mapping
  2022-06-27 17:56     ` Dr. David Alan Gilbert
@ 2022-06-27 20:31       ` James Houghton
  2022-06-28  0:04         ` Nadav Amit
  2022-06-28  8:20         ` Dr. David Alan Gilbert
  0 siblings, 2 replies; 123+ messages in thread
From: James Houghton @ 2022-06-27 20:31 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Matthew Wilcox, Mike Kravetz, Muchun Song, Peter Xu,
	David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Jue Wang, Manish Mishra, linux-mm, linux-kernel, Nadav Amit

On Mon, Jun 27, 2022 at 10:56 AM Dr. David Alan Gilbert
<dgilbert@redhat.com> wrote:
>
> * James Houghton (jthoughton@google.com) wrote:
> > On Fri, Jun 24, 2022 at 11:29 AM Matthew Wilcox <willy@infradead.org> wrote:
> > >
> > > On Fri, Jun 24, 2022 at 05:36:30PM +0000, James Houghton wrote:
> > > > [1] This used to be called HugeTLB double mapping, a bad and confusing
> > > >     name. "High-granularity mapping" is not a great name either. I am open
> > > >     to better names.
> > >
> > > Oh good, I was grinding my teeth every time I read it ;-)
> > >
> > > How does "Fine granularity" work for you?
> > > "sub-page mapping" might work too.
> >
> > "Granularity", as I've come to realize, is hard to say, so I think I
> > prefer sub-page mapping. :) So to recap the suggestions I have so far:
> >
> > 1. Sub-page mapping
> > 2. Granular mapping
> > 3. Flexible mapping
> >
> > I'll pick one of these (or maybe some other one that works better) for
> > the next version of this series.
>
> <shrug> Just a name; SPM might work (although may confuse those
> architectures which had subprotection for normal pages), and at least
> we can mispronounce it.
>
> In 14/26 your commit message says:
>
>   1. Faults can be passed to handle_userfault. (Userspace will want to
>      use UFFD_FEATURE_REAL_ADDRESS to get the real address to know which
>      region they should be call UFFDIO_CONTINUE on later.)
>
> can you explain what that new UFFD_FEATURE does?

+cc Nadav Amit <namit@vmware.com> to check me here.

Sorry, this should be UFFD_FEATURE_EXACT_ADDRESS. It isn't a new
feature, and it actually isn't needed (I will correct the commit
message). Why it isn't needed is a little bit complicated, though. Let
me explain:

Before UFFD_FEATURE_EXACT_ADDRESS was introduced, the address that
userfaultfd gave userspace for HugeTLB pages was rounded down to be
hstate-size-aligned. This would have had to change, because userspace,
to take advantage of HGM, needs to know which 4K piece to install.

However, after UFFD_FEATURE_EXACT_ADDRESS was introduced[1], the
address was rounded down to be PAGE_SIZE-aligned instead, even if the
flag wasn't used. I think this was an unintended change. If the flag
is used, then the address isn't rounded at all -- that was the
intended purpose of this flag. Hope that makes sense.

The new userfaultfd feature, UFFD_FEATURE_MINOR_HUGETLBFS_HGM, informs
userspace that high-granularity CONTINUEs are available.

[1] commit 824ddc601adc ("userfaultfd: provide unmasked address on page-fault")


>
> Dave
>
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH 04/26] hugetlb: make huge_pte_lockptr take an explicit shift argument.
  2022-06-24 17:36 ` [RFC PATCH 04/26] hugetlb: make huge_pte_lockptr take an explicit shift argument James Houghton
  2022-06-27 12:26   ` manish.mishra
@ 2022-06-27 20:51   ` Mike Kravetz
  2022-06-28 15:29     ` James Houghton
  2022-06-29  6:09     ` Muchun Song
  1 sibling, 2 replies; 123+ messages in thread
From: Mike Kravetz @ 2022-06-27 20:51 UTC (permalink / raw)
  To: James Houghton
  Cc: Muchun Song, Peter Xu, David Hildenbrand, David Rientjes,
	Axel Rasmussen, Mina Almasry, Jue Wang, Manish Mishra,
	Dr . David Alan Gilbert, linux-mm, linux-kernel

On 06/24/22 17:36, James Houghton wrote:
> This is needed to handle PTL locking with high-granularity mapping. We
> won't always be using the PMD-level PTL even if we're using the 2M
> hugepage hstate. It's possible that we're dealing with 4K PTEs, in which
> case, we need to lock the PTL for the 4K PTE.

I'm not really sure why this would be required.
Why not use the PMD level lock for 4K PTEs?  Seems that would scale better
with less contention than using the more coarse mm lock.  

-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH 00/26] hugetlb: Introduce HugeTLB high-granularity mapping
  2022-06-27 20:31       ` James Houghton
@ 2022-06-28  0:04         ` Nadav Amit
  2022-06-30 19:21           ` Peter Xu
  2022-06-28  8:20         ` Dr. David Alan Gilbert
  1 sibling, 1 reply; 123+ messages in thread
From: Nadav Amit @ 2022-06-28  0:04 UTC (permalink / raw)
  To: James Houghton
  Cc: Dr. David Alan Gilbert, Matthew Wilcox, Mike Kravetz,
	Muchun Song, Peter Xu, David Hildenbrand, David Rientjes,
	Axel Rasmussen, Mina Almasry, Jue Wang, Manish Mishra, linux-mm,
	linux-kernel



> On Jun 27, 2022, at 1:31 PM, James Houghton <jthoughton@google.com> wrote:
> 
> ⚠ External Email
> 
> On Mon, Jun 27, 2022 at 10:56 AM Dr. David Alan Gilbert
> <dgilbert@redhat.com> wrote:
>> 
>> * James Houghton (jthoughton@google.com) wrote:
>>> On Fri, Jun 24, 2022 at 11:29 AM Matthew Wilcox <willy@infradead.org> wrote:
>>>> 
>>>> On Fri, Jun 24, 2022 at 05:36:30PM +0000, James Houghton wrote:
>>>>> [1] This used to be called HugeTLB double mapping, a bad and confusing
>>>>> name. "High-granularity mapping" is not a great name either. I am open
>>>>> to better names.
>>>> 
>>>> Oh good, I was grinding my teeth every time I read it ;-)
>>>> 
>>>> How does "Fine granularity" work for you?
>>>> "sub-page mapping" might work too.
>>> 
>>> "Granularity", as I've come to realize, is hard to say, so I think I
>>> prefer sub-page mapping. :) So to recap the suggestions I have so far:
>>> 
>>> 1. Sub-page mapping
>>> 2. Granular mapping
>>> 3. Flexible mapping
>>> 
>>> I'll pick one of these (or maybe some other one that works better) for
>>> the next version of this series.
>> 
>> <shrug> Just a name; SPM might work (although may confuse those
>> architectures which had subprotection for normal pages), and at least
>> we can mispronounce it.
>> 
>> In 14/26 your commit message says:
>> 
>> 1. Faults can be passed to handle_userfault. (Userspace will want to
>> use UFFD_FEATURE_REAL_ADDRESS to get the real address to know which
>> region they should be call UFFDIO_CONTINUE on later.)
>> 
>> can you explain what that new UFFD_FEATURE does?
> 
> +cc Nadav Amit <namit@vmware.com> to check me here.
> 
> Sorry, this should be UFFD_FEATURE_EXACT_ADDRESS. It isn't a new
> feature, and it actually isn't needed (I will correct the commit
> message). Why it isn't needed is a little bit complicated, though. Let
> me explain:
> 
> Before UFFD_FEATURE_EXACT_ADDRESS was introduced, the address that
> userfaultfd gave userspace for HugeTLB pages was rounded down to be
> hstate-size-aligned. This would have had to change, because userspace,
> to take advantage of HGM, needs to know which 4K piece to install.
> 
> However, after UFFD_FEATURE_EXACT_ADDRESS was introduced[1], the
> address was rounded down to be PAGE_SIZE-aligned instead, even if the
> flag wasn't used. I think this was an unintended change. If the flag
> is used, then the address isn't rounded at all -- that was the
> intended purpose of this flag. Hope that makes sense.
> 
> The new userfaultfd feature, UFFD_FEATURE_MINOR_HUGETLBFS_HGM, informs
> userspace that high-granularity CONTINUEs are available.
> 
> [1] commit 824ddc601adc ("userfaultfd: provide unmasked address on page-fault")

Indeed this change of behavior (not aligning to huge-pages when flags is
not set) was unintentional. If you want to fix it in a separate patch so
it would be backported, that may be a good idea.

For the record, there was a short period of time in 2016 when the exact
fault address was delivered even when UFFD_FEATURE_EXACT_ADDRESS was not
provided. We had some arguments whether this was a regression...

BTW: I should have thought on the use-case of knowing the exact address
in huge-pages. It would have shorten my discussions with Andrea on whether
this feature (UFFD_FEATURE_EXACT_ADDRESS) is needed. :)


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH 00/26] hugetlb: Introduce HugeTLB high-granularity mapping
  2022-06-27 20:31       ` James Houghton
  2022-06-28  0:04         ` Nadav Amit
@ 2022-06-28  8:20         ` Dr. David Alan Gilbert
  2022-06-30 16:09           ` Peter Xu
  1 sibling, 1 reply; 123+ messages in thread
From: Dr. David Alan Gilbert @ 2022-06-28  8:20 UTC (permalink / raw)
  To: James Houghton
  Cc: Matthew Wilcox, Mike Kravetz, Muchun Song, Peter Xu,
	David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Jue Wang, Manish Mishra, linux-mm, linux-kernel, Nadav Amit

* James Houghton (jthoughton@google.com) wrote:
> On Mon, Jun 27, 2022 at 10:56 AM Dr. David Alan Gilbert
> <dgilbert@redhat.com> wrote:
> >
> > * James Houghton (jthoughton@google.com) wrote:
> > > On Fri, Jun 24, 2022 at 11:29 AM Matthew Wilcox <willy@infradead.org> wrote:
> > > >
> > > > On Fri, Jun 24, 2022 at 05:36:30PM +0000, James Houghton wrote:
> > > > > [1] This used to be called HugeTLB double mapping, a bad and confusing
> > > > >     name. "High-granularity mapping" is not a great name either. I am open
> > > > >     to better names.
> > > >
> > > > Oh good, I was grinding my teeth every time I read it ;-)
> > > >
> > > > How does "Fine granularity" work for you?
> > > > "sub-page mapping" might work too.
> > >
> > > "Granularity", as I've come to realize, is hard to say, so I think I
> > > prefer sub-page mapping. :) So to recap the suggestions I have so far:
> > >
> > > 1. Sub-page mapping
> > > 2. Granular mapping
> > > 3. Flexible mapping
> > >
> > > I'll pick one of these (or maybe some other one that works better) for
> > > the next version of this series.
> >
> > <shrug> Just a name; SPM might work (although may confuse those
> > architectures which had subprotection for normal pages), and at least
> > we can mispronounce it.
> >
> > In 14/26 your commit message says:
> >
> >   1. Faults can be passed to handle_userfault. (Userspace will want to
> >      use UFFD_FEATURE_REAL_ADDRESS to get the real address to know which
> >      region they should be call UFFDIO_CONTINUE on later.)
> >
> > can you explain what that new UFFD_FEATURE does?
> 
> +cc Nadav Amit <namit@vmware.com> to check me here.
> 
> Sorry, this should be UFFD_FEATURE_EXACT_ADDRESS. It isn't a new
> feature, and it actually isn't needed (I will correct the commit
> message). Why it isn't needed is a little bit complicated, though. Let
> me explain:
> 
> Before UFFD_FEATURE_EXACT_ADDRESS was introduced, the address that
> userfaultfd gave userspace for HugeTLB pages was rounded down to be
> hstate-size-aligned. This would have had to change, because userspace,
> to take advantage of HGM, needs to know which 4K piece to install.
> 
> However, after UFFD_FEATURE_EXACT_ADDRESS was introduced[1], the
> address was rounded down to be PAGE_SIZE-aligned instead, even if the
> flag wasn't used. I think this was an unintended change. If the flag
> is used, then the address isn't rounded at all -- that was the
> intended purpose of this flag. Hope that makes sense.

Oh that's 'fun'; right but the need for the less-rounded address makes
sense.

One other thing I thought of; you provide the modified 'CONTINUE'
behaviour, which works for postcopy as long as you use two mappings in
userspace; one protected by userfault, and one which you do the writes
to, and then issue the CONTINUE into the protected mapping; that's fine,
but it's not currently how we have our postcopy code wired up in qemu,
we have one mapping and use UFFDIO_COPY to place the page.
Requiring the two mappings is fine, but it's probably worth pointing out
the need for it somewhere.

Dave

> The new userfaultfd feature, UFFD_FEATURE_MINOR_HUGETLBFS_HGM, informs
> userspace that high-granularity CONTINUEs are available.
> 
> [1] commit 824ddc601adc ("userfaultfd: provide unmasked address on page-fault")
> 
> 
> >
> > Dave
> >
> > --
> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> >
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH 00/26] hugetlb: Introduce HugeTLB high-granularity mapping
  2022-06-27 16:27   ` James Houghton
@ 2022-06-28 14:17     ` Muchun Song
  2022-06-28 17:26     ` Mina Almasry
  1 sibling, 0 replies; 123+ messages in thread
From: Muchun Song @ 2022-06-28 14:17 UTC (permalink / raw)
  To: James Houghton
  Cc: Mina Almasry, Mike Kravetz, Peter Xu, David Hildenbrand,
	David Rientjes, Axel Rasmussen, Jue Wang, Manish Mishra,
	Dr . David Alan Gilbert, linux-mm, linux-kernel

On Mon, Jun 27, 2022 at 09:27:38AM -0700, James Houghton wrote:
> On Fri, Jun 24, 2022 at 11:41 AM Mina Almasry <almasrymina@google.com> wrote:
> >
> > On Fri, Jun 24, 2022 at 10:37 AM James Houghton <jthoughton@google.com> wrote:
> > >
> > > [trimmed...]
> > > ---- Userspace API ----
> > >
> > > This patch series introduces a single way to take advantage of
> > > high-granularity mapping: via UFFDIO_CONTINUE. UFFDIO_CONTINUE allows
> > > userspace to resolve MINOR page faults on shared VMAs.
> > >
> > > To collapse a HugeTLB address range that has been mapped with several
> > > UFFDIO_CONTINUE operations, userspace can issue MADV_COLLAPSE. We expect
> > > userspace to know when all pages (that they care about) have been fetched.
> > >
> >
> > Thanks James! Cover letter looks good. A few questions:
> >
> > Why not have the kernel collapse the hugepage once all the 4K pages
> > have been fetched automatically? It would remove the need for a new
> > userspace API, and AFACT there aren't really any cases where it is
> > beneficial to have a hugepage sharded into 4K mappings when those
> > mappings can be collapsed.
> 
> The reason that we don't automatically collapse mappings is because it
> would take additional complexity, and it is less flexible. Consider
> the case of 1G pages on x86: currently, userspace can collapse the
> whole page when it's all ready, but they can also choose to collapse a
> 2M piece of it. On architectures with more supported hugepage sizes
> (e.g., arm64), userspace has even more possibilities for when to
> collapse. This likely further complicates a potential
> automatic-collapse solution. Userspace may also want to collapse the
> mapping for an entire hugepage without completely mapping the hugepage
> first (this would also be possible by issuing UFFDIO_CONTINUE on all
> the holes, though).
> 
> >
> > > ---- HugeTLB Changes ----
> > >
> > > - Mapcount
> > > The way mapcount is handled is different from the way that it was handled
> > > before. If the PUD for a hugepage is not none, a hugepage's mapcount will
> > > be increased. This scheme means that, for hugepages that aren't mapped at
> > > high granularity, their mapcounts will remain the same as what they would
> > > have been pre-HGM.
> > >
> >
> > Sorry, I didn't quite follow this. It says mapcount is handled

+1

> > differently, but the same if the page is not mapped at high
> > granularity. Can you elaborate on how the mapcount handling will be
> > different when the page is mapped at high granularity?
> 
> I guess I didn't phrase this very well. For the sake of simplicity,
> consider 1G pages on x86, typically mapped with leaf-level PUDs.
> Previously, there were two possibilities for how a hugepage was
> mapped, either it was (1) completely mapped (PUD is present and a
> leaf), or (2) it wasn't mapped (PUD is none). Now we have a third
> case, where the PUD is not none but also not a leaf (this usually
> means that the page is partially mapped). We handle this case as if
> the whole page was mapped. That is, if we partially map a hugepage
> that was previously unmapped (making the PUD point to PMDs), we
> increment its mapcount, and if we completely unmap a partially mapped
> hugepage (making the PUD none), we decrement its mapcount. If we
> collapse a non-leaf PUD to a leaf PUD, we don't change mapcount.
> 
> It is possible for a PUD to be present and not a leaf (mapcount has
> been incremented) but for the page to still be unmapped: if the PMDs
> (or PTEs) underneath are all none. This case is atypical, and as of
> this RFC (without bestowing MADV_DONTNEED with HGM flexibility), I
> think it would be very difficult to get this to happen.
> 

It is a good explanation. I think it is better to go to cover letter.

Thanks.

> >
> > > - Page table walking and manipulation
> > > A new function, hugetlb_walk_to, handles walking HugeTLB page tables for
> > > high-granularity mappings. Eventually, it's possible to merge
> > > hugetlb_walk_to with huge_pte_offset and huge_pte_alloc.
> > >
> > > We keep track of HugeTLB page table entries with a new struct, hugetlb_pte.
> > > This is because we generally need to know the "size" of a PTE (previously
> > > always just huge_page_size(hstate)).
> > >
> > > For every page table manipulation function that has a huge version (e.g.
> > > huge_ptep_get and ptep_get), there is a wrapper for it (e.g.
> > > hugetlb_ptep_get).  The correct version is used depending on if a HugeTLB
> > > PTE really is "huge".
> > >
> > > - Synchronization
> > > For existing bits of HugeTLB, synchronization is unchanged. For splitting
> > > and collapsing HugeTLB PTEs, we require that the i_mmap_rw_sem is held for
> > > writing, and for doing high-granularity page table walks, we require it to
> > > be held for reading.
> > >
> > > ---- Limitations & Future Changes ----
> > >
> > > This patch series only implements high-granularity mapping for VM_SHARED
> > > VMAs.  I intend to implement enough HGM to support 4K unmapping for memory
> > > failure recovery for both shared and private mappings.
> > >
> > > The memory failure use case poses its own challenges that can be
> > > addressed, but I will do so in a separate RFC.
> > >
> > > Performance has not been heavily scrutinized with this patch series. There
> > > are places where lock contention can significantly reduce performance. This
> > > will be addressed later.
> > >
> > > The patch series, as it stands right now, is compatible with the VMEMMAP
> > > page struct optimization[3], as we do not need to modify data contained
> > > in the subpage page structs.
> > >
> > > Other omissions:
> > >  - Compatibility with userfaultfd write-protect (will be included in v1).
> > >  - Support for mremap() (will be included in v1). This looks a lot like
> > >    the support we have for fork().
> > >  - Documentation changes (will be included in v1).
> > >  - Completely ignores PMD sharing and hugepage migration (will be included
> > >    in v1).
> > >  - Implementations for architectures that don't use GENERAL_HUGETLB other
> > >    than arm64.
> > >
> > > ---- Patch Breakdown ----
> > >
> > > Patch 1     - Preliminary changes
> > > Patch 2-10  - HugeTLB HGM core changes
> > > Patch 11-13 - HugeTLB HGM page table walking functionality
> > > Patch 14-19 - HugeTLB HGM compatibility with other bits
> > > Patch 20-23 - Userfaultfd and collapse changes
> > > Patch 24-26 - arm64 support and selftests
> > >
> > > [1] This used to be called HugeTLB double mapping, a bad and confusing
> > >     name. "High-granularity mapping" is not a great name either. I am open
> > >     to better names.
> >
> > I would drop 1 extra word and do "granular mapping", as in the mapping
> > is more granular than what it normally is (2MB/1G, etc).
> 
> Noted. :)
> 

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH 04/26] hugetlb: make huge_pte_lockptr take an explicit shift argument.
  2022-06-27 20:51   ` Mike Kravetz
@ 2022-06-28 15:29     ` James Houghton
  2022-06-29  6:09     ` Muchun Song
  1 sibling, 0 replies; 123+ messages in thread
From: James Houghton @ 2022-06-28 15:29 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Muchun Song, Peter Xu, David Hildenbrand, David Rientjes,
	Axel Rasmussen, Mina Almasry, Jue Wang, Manish Mishra,
	Dr . David Alan Gilbert, linux-mm, linux-kernel

On Mon, Jun 27, 2022 at 1:52 PM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>
> On 06/24/22 17:36, James Houghton wrote:
> > This is needed to handle PTL locking with high-granularity mapping. We
> > won't always be using the PMD-level PTL even if we're using the 2M
> > hugepage hstate. It's possible that we're dealing with 4K PTEs, in which
> > case, we need to lock the PTL for the 4K PTE.
>
> I'm not really sure why this would be required.
> Why not use the PMD level lock for 4K PTEs?  Seems that would scale better
> with less contention than using the more coarse mm lock.

I should be using the PMD level lock for 4K PTEs, yeah. I'll work this
into the next version of the series. Thanks both.

>
> --
> Mike Kravetz

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH 02/26] hugetlb: sort hstates in hugetlb_init_hstates
  2022-06-27 12:08   ` manish.mishra
@ 2022-06-28 15:35     ` James Houghton
  0 siblings, 0 replies; 123+ messages in thread
From: James Houghton @ 2022-06-28 15:35 UTC (permalink / raw)
  To: manish.mishra
  Cc: Mike Kravetz, Muchun Song, Peter Xu, David Hildenbrand,
	David Rientjes, Axel Rasmussen, Mina Almasry, Jue Wang,
	Dr . David Alan Gilbert, linux-mm, linux-kernel

On Mon, Jun 27, 2022 at 5:09 AM manish.mishra <manish.mishra@nutanix.com> wrote:
>
>
> On 24/06/22 11:06 pm, James Houghton wrote:
> > When using HugeTLB high-granularity mapping, we need to go through the
> > supported hugepage sizes in decreasing order so that we pick the largest
> > size that works. Consider the case where we're faulting in a 1G hugepage
> > for the first time: we want hugetlb_fault/hugetlb_no_page to map it with
> > a PUD. By going through the sizes in decreasing order, we will find that
> > PUD_SIZE works before finding out that PMD_SIZE or PAGE_SIZE work too.
> >
> > Signed-off-by: James Houghton <jthoughton@google.com>
> > ---
> >   mm/hugetlb.c | 40 +++++++++++++++++++++++++++++++++++++---
> >   1 file changed, 37 insertions(+), 3 deletions(-)
> >
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index a57e1be41401..5df838d86f32 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -33,6 +33,7 @@
> >   #include <linux/migrate.h>
> >   #include <linux/nospec.h>
> >   #include <linux/delayacct.h>
> > +#include <linux/sort.h>
> >
> >   #include <asm/page.h>
> >   #include <asm/pgalloc.h>
> > @@ -48,6 +49,10 @@
> >
> >   int hugetlb_max_hstate __read_mostly;
> >   unsigned int default_hstate_idx;
> > +/*
> > + * After hugetlb_init_hstates is called, hstates will be sorted from largest
> > + * to smallest.
> > + */
> >   struct hstate hstates[HUGE_MAX_HSTATE];
> >
> >   #ifdef CONFIG_CMA
> > @@ -3144,14 +3149,43 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
> >       kfree(node_alloc_noretry);
> >   }
> >
> > +static int compare_hstates_decreasing(const void *a, const void *b)
> > +{
> > +     const int shift_a = huge_page_shift((const struct hstate *)a);
> > +     const int shift_b = huge_page_shift((const struct hstate *)b);
> > +
> > +     if (shift_a < shift_b)
> > +             return 1;
> > +     if (shift_a > shift_b)
> > +             return -1;
> > +     return 0;
> > +}
> > +
> > +static void sort_hstates(void)
> > +{
> > +     unsigned long default_hstate_sz = huge_page_size(&default_hstate);
> > +
> > +     /* Sort from largest to smallest. */
> > +     sort(hstates, hugetlb_max_hstate, sizeof(*hstates),
> > +          compare_hstates_decreasing, NULL);
> > +
> > +     /*
> > +      * We may have changed the location of the default hstate, so we need to
> > +      * update it.
> > +      */
> > +     default_hstate_idx = hstate_index(size_to_hstate(default_hstate_sz));
> > +}
> > +
> >   static void __init hugetlb_init_hstates(void)
> >   {
> >       struct hstate *h, *h2;
> >
> > -     for_each_hstate(h) {
> > -             if (minimum_order > huge_page_order(h))
> > -                     minimum_order = huge_page_order(h);
> > +     sort_hstates();
> >
> > +     /* The last hstate is now the smallest. */
> > +     minimum_order = huge_page_order(&hstates[hugetlb_max_hstate - 1]);
> > +
> > +     for_each_hstate(h) {
> >               /* oversize hugepages were init'ed in early boot */
> >               if (!hstate_is_gigantic(h))
> >                       hugetlb_hstate_alloc_pages(h);
>
> As now hstates are ordered can code which does calculation of demot_order
>
> can too be optimised, i mean it can be value of order of hstate at next index?
>

Indeed -- thanks for catching that. I'll make this optimization for
the next version of this series.

>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH 02/26] hugetlb: sort hstates in hugetlb_init_hstates
  2022-06-27 18:42   ` Mike Kravetz
@ 2022-06-28 15:40     ` James Houghton
  2022-06-29  6:39       ` Muchun Song
  0 siblings, 1 reply; 123+ messages in thread
From: James Houghton @ 2022-06-28 15:40 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Muchun Song, Peter Xu, David Hildenbrand, David Rientjes,
	Axel Rasmussen, Mina Almasry, Jue Wang, Manish Mishra,
	Dr . David Alan Gilbert, linux-mm, linux-kernel

On Mon, Jun 27, 2022 at 11:42 AM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>
> On 06/24/22 17:36, James Houghton wrote:
> > When using HugeTLB high-granularity mapping, we need to go through the
> > supported hugepage sizes in decreasing order so that we pick the largest
> > size that works. Consider the case where we're faulting in a 1G hugepage
> > for the first time: we want hugetlb_fault/hugetlb_no_page to map it with
> > a PUD. By going through the sizes in decreasing order, we will find that
> > PUD_SIZE works before finding out that PMD_SIZE or PAGE_SIZE work too.
> >
> > Signed-off-by: James Houghton <jthoughton@google.com>
> > ---
> >  mm/hugetlb.c | 40 +++++++++++++++++++++++++++++++++++++---
> >  1 file changed, 37 insertions(+), 3 deletions(-)
> >
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index a57e1be41401..5df838d86f32 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -33,6 +33,7 @@
> >  #include <linux/migrate.h>
> >  #include <linux/nospec.h>
> >  #include <linux/delayacct.h>
> > +#include <linux/sort.h>
> >
> >  #include <asm/page.h>
> >  #include <asm/pgalloc.h>
> > @@ -48,6 +49,10 @@
> >
> >  int hugetlb_max_hstate __read_mostly;
> >  unsigned int default_hstate_idx;
> > +/*
> > + * After hugetlb_init_hstates is called, hstates will be sorted from largest
> > + * to smallest.
> > + */
> >  struct hstate hstates[HUGE_MAX_HSTATE];
> >
> >  #ifdef CONFIG_CMA
> > @@ -3144,14 +3149,43 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
> >       kfree(node_alloc_noretry);
> >  }
> >
> > +static int compare_hstates_decreasing(const void *a, const void *b)
> > +{
> > +     const int shift_a = huge_page_shift((const struct hstate *)a);
> > +     const int shift_b = huge_page_shift((const struct hstate *)b);
> > +
> > +     if (shift_a < shift_b)
> > +             return 1;
> > +     if (shift_a > shift_b)
> > +             return -1;
> > +     return 0;
> > +}
> > +
> > +static void sort_hstates(void)
> > +{
> > +     unsigned long default_hstate_sz = huge_page_size(&default_hstate);
> > +
> > +     /* Sort from largest to smallest. */
> > +     sort(hstates, hugetlb_max_hstate, sizeof(*hstates),
> > +          compare_hstates_decreasing, NULL);
> > +
> > +     /*
> > +      * We may have changed the location of the default hstate, so we need to
> > +      * update it.
> > +      */
> > +     default_hstate_idx = hstate_index(size_to_hstate(default_hstate_sz));
> > +}
> > +
> >  static void __init hugetlb_init_hstates(void)
> >  {
> >       struct hstate *h, *h2;
> >
> > -     for_each_hstate(h) {
> > -             if (minimum_order > huge_page_order(h))
> > -                     minimum_order = huge_page_order(h);
> > +     sort_hstates();
> >
> > +     /* The last hstate is now the smallest. */
> > +     minimum_order = huge_page_order(&hstates[hugetlb_max_hstate - 1]);
> > +
> > +     for_each_hstate(h) {
> >               /* oversize hugepages were init'ed in early boot */
> >               if (!hstate_is_gigantic(h))
> >                       hugetlb_hstate_alloc_pages(h);
>
> This may/will cause problems for gigantic hugetlb pages allocated at boot
> time.  See alloc_bootmem_huge_page() where a pointer to the associated hstate
> is encoded within the allocated hugetlb page.  These pages are added to
> hugetlb pools by the routine gather_bootmem_prealloc() which uses the saved
> hstate to add prep the gigantic page and add to the correct pool.  Currently,
> gather_bootmem_prealloc is called after hugetlb_init_hstates.  So, changing
> hstate order will cause errors.
>
> I do not see any reason why we could not call gather_bootmem_prealloc before
> hugetlb_init_hstates to avoid this issue.

Thanks for catching this, Mike. Your suggestion certainly seems to
work, but it also seems kind of error prone. I'll have to look at the
code more closely, but maybe it would be better if I just maintained a
separate `struct hstate *sorted_hstate_ptrs[]`, where the original
locations of the hstates remain unchanged, as to not break
gather_bootmem_prealloc/other things.

> --
> Mike Kravetz

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH 01/26] hugetlb: make hstate accessor functions const
       [not found]     ` <bb903be9-546d-04a7-e9e4-f5ba313319de@nutanix.com>
@ 2022-06-28 17:08       ` James Houghton
  0 siblings, 0 replies; 123+ messages in thread
From: James Houghton @ 2022-06-28 17:08 UTC (permalink / raw)
  To: manish.mishra
  Cc: Mike Kravetz, Muchun Song, Peter Xu, David Hildenbrand,
	David Rientjes, Axel Rasmussen, Mina Almasry, Jue Wang,
	Dr . David Alan Gilbert, linux-mm, linux-kernel

On Mon, Jun 27, 2022 at 5:09 AM manish.mishra <manish.mishra@nutanix.com> wrote:
>
>
> On 27/06/22 5:06 pm, manish.mishra wrote:
>
>
> On 24/06/22 11:06 pm, James Houghton wrote:
>
> This is just a const-correctness change so that the new hugetlb_pte
> changes can be const-correct too.
>
> Acked-by: David Rientjes <rientjes@google.com>
>
> Signed-off-by: James Houghton <jthoughton@google.com>
> ---
>  include/linux/hugetlb.h | 12 ++++++------
>  1 file changed, 6 insertions(+), 6 deletions(-)
>
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index e4cff27d1198..498a4ae3d462 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -715,7 +715,7 @@ static inline struct hstate *hstate_vma(struct vm_area_struct *vma)
>   return hstate_file(vma->vm_file);
>  }
>
> -static inline unsigned long huge_page_size(struct hstate *h)
> +static inline unsigned long huge_page_size(const struct hstate *h)
>  {
>   return (unsigned long)PAGE_SIZE << h->order;
>  }
> @@ -729,27 +729,27 @@ static inline unsigned long huge_page_mask(struct hstate *h)
>   return h->mask;
>  }
>
> -static inline unsigned int huge_page_order(struct hstate *h)
> +static inline unsigned int huge_page_order(const struct hstate *h)
>  {
>   return h->order;
>  }
>
> -static inline unsigned huge_page_shift(struct hstate *h)
> +static inline unsigned huge_page_shift(const struct hstate *h)
>  {
>   return h->order + PAGE_SHIFT;
>  }
>
> -static inline bool hstate_is_gigantic(struct hstate *h)
> +static inline bool hstate_is_gigantic(const struct hstate *h)
>  {
>   return huge_page_order(h) >= MAX_ORDER;
>  }
>
> -static inline unsigned int pages_per_huge_page(struct hstate *h)
> +static inline unsigned int pages_per_huge_page(const struct hstate *h)
>  {
>   return 1 << h->order;
>  }
>
> -static inline unsigned int blocks_per_huge_page(struct hstate *h)
> +static inline unsigned int blocks_per_huge_page(const struct hstate *h)
>  {
>   return huge_page_size(h) / 512;
>  }
>
> James, Just wanted to check why you did it selectively only for these functions
>
> why not for something like hstate_index which too i see used in your code.

I'll look into which other functions can be made const. We need
huge_page_shift() to take `const struct hstate *h` so that the hstates
can be sorted, and it then followed to make the surrounding, related
functions const as well. I could also just leave it at
huge_page_shift().

The commit message here is wrong -- the hugetlb_pte const-correctness
is a separate issue that doesn't depend the constness of hstates. I'll
fix that -- sorry about that.

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH 00/26] hugetlb: Introduce HugeTLB high-granularity mapping
  2022-06-27 16:27   ` James Houghton
  2022-06-28 14:17     ` Muchun Song
@ 2022-06-28 17:26     ` Mina Almasry
  2022-06-28 17:56       ` Dr. David Alan Gilbert
  2022-06-29 20:39       ` Axel Rasmussen
  1 sibling, 2 replies; 123+ messages in thread
From: Mina Almasry @ 2022-06-28 17:26 UTC (permalink / raw)
  To: James Houghton
  Cc: Mike Kravetz, Muchun Song, Peter Xu, David Hildenbrand,
	David Rientjes, Axel Rasmussen, Jue Wang, Manish Mishra,
	Dr . David Alan Gilbert, linux-mm, linux-kernel

On Mon, Jun 27, 2022 at 9:27 AM James Houghton <jthoughton@google.com> wrote:
>
> On Fri, Jun 24, 2022 at 11:41 AM Mina Almasry <almasrymina@google.com> wrote:
> >
> > On Fri, Jun 24, 2022 at 10:37 AM James Houghton <jthoughton@google.com> wrote:
> > >
> > > [trimmed...]
> > > ---- Userspace API ----
> > >
> > > This patch series introduces a single way to take advantage of
> > > high-granularity mapping: via UFFDIO_CONTINUE. UFFDIO_CONTINUE allows
> > > userspace to resolve MINOR page faults on shared VMAs.
> > >
> > > To collapse a HugeTLB address range that has been mapped with several
> > > UFFDIO_CONTINUE operations, userspace can issue MADV_COLLAPSE. We expect
> > > userspace to know when all pages (that they care about) have been fetched.
> > >
> >
> > Thanks James! Cover letter looks good. A few questions:
> >
> > Why not have the kernel collapse the hugepage once all the 4K pages
> > have been fetched automatically? It would remove the need for a new
> > userspace API, and AFACT there aren't really any cases where it is
> > beneficial to have a hugepage sharded into 4K mappings when those
> > mappings can be collapsed.
>
> The reason that we don't automatically collapse mappings is because it
> would take additional complexity, and it is less flexible. Consider
> the case of 1G pages on x86: currently, userspace can collapse the
> whole page when it's all ready, but they can also choose to collapse a
> 2M piece of it. On architectures with more supported hugepage sizes
> (e.g., arm64), userspace has even more possibilities for when to
> collapse. This likely further complicates a potential
> automatic-collapse solution. Userspace may also want to collapse the
> mapping for an entire hugepage without completely mapping the hugepage
> first (this would also be possible by issuing UFFDIO_CONTINUE on all
> the holes, though).
>

To be honest I'm don't think I'm a fan of this. I don't think this
saves complexity, but rather pushes it to the userspace. I.e. the
userspace now must track which regions are faulted in and which are
not to call MADV_COLLAPSE at the right time. Also, if the userspace
gets it wrong it may accidentally not call MADV_COLLAPSE (and not get
any hugepages) or call MADV_COLLAPSE too early and have to deal with a
storm of maybe hundreds of minor faults at once which may take too
long to resolve and may impact guest stability, yes?

For these reasons I think automatic collapsing is something that will
eventually be implemented by us or someone else, and at that point
MADV_COLLAPSE for hugetlb memory will become obsolete; i.e. this patch
is adding a userspace API that will probably need to be maintained for
perpetuity but actually is likely going to be going obsolete "soon".
For this reason I had hoped that automatic collapsing would come with
V1.

I wonder if we can have a very simple first try at automatic
collapsing for V1? I.e., can we support collapsing to the hstate size
and only that? So 4K pages can only be either collapsed to 2MB or 1G
on x86 depending on the hstate size. I think this may be not too
difficult to implement: we can have a counter similar to mapcount that
tracks how many of the subpages are mapped (subpage_mapcount). Once
all the subpages are mapped (the counter reaches a certain value),
trigger collapsing similar to hstate size MADV_COLLAPSE.

I gather that no one else reviewing this has raised this issue thus
far so it might not be a big deal and I will continue to review the
RFC, but I had hoped for automatic collapsing myself for the reasons
above.

> >
> > > ---- HugeTLB Changes ----
> > >
> > > - Mapcount
> > > The way mapcount is handled is different from the way that it was handled
> > > before. If the PUD for a hugepage is not none, a hugepage's mapcount will
> > > be increased. This scheme means that, for hugepages that aren't mapped at
> > > high granularity, their mapcounts will remain the same as what they would
> > > have been pre-HGM.
> > >
> >
> > Sorry, I didn't quite follow this. It says mapcount is handled
> > differently, but the same if the page is not mapped at high
> > granularity. Can you elaborate on how the mapcount handling will be
> > different when the page is mapped at high granularity?
>
> I guess I didn't phrase this very well. For the sake of simplicity,
> consider 1G pages on x86, typically mapped with leaf-level PUDs.
> Previously, there were two possibilities for how a hugepage was
> mapped, either it was (1) completely mapped (PUD is present and a
> leaf), or (2) it wasn't mapped (PUD is none). Now we have a third
> case, where the PUD is not none but also not a leaf (this usually
> means that the page is partially mapped). We handle this case as if
> the whole page was mapped. That is, if we partially map a hugepage
> that was previously unmapped (making the PUD point to PMDs), we
> increment its mapcount, and if we completely unmap a partially mapped
> hugepage (making the PUD none), we decrement its mapcount. If we
> collapse a non-leaf PUD to a leaf PUD, we don't change mapcount.
>
> It is possible for a PUD to be present and not a leaf (mapcount has
> been incremented) but for the page to still be unmapped: if the PMDs
> (or PTEs) underneath are all none. This case is atypical, and as of
> this RFC (without bestowing MADV_DONTNEED with HGM flexibility), I
> think it would be very difficult to get this to happen.
>

Thank you for the detailed explanation. Please add it to the cover letter.

I wonder the case "PUD present but all the PMD are none": is that a
bug? I don't understand the usefulness of that. Not a comment on this
patch but rather a curiosity.

> >
> > > - Page table walking and manipulation
> > > A new function, hugetlb_walk_to, handles walking HugeTLB page tables for
> > > high-granularity mappings. Eventually, it's possible to merge
> > > hugetlb_walk_to with huge_pte_offset and huge_pte_alloc.
> > >
> > > We keep track of HugeTLB page table entries with a new struct, hugetlb_pte.
> > > This is because we generally need to know the "size" of a PTE (previously
> > > always just huge_page_size(hstate)).
> > >
> > > For every page table manipulation function that has a huge version (e.g.
> > > huge_ptep_get and ptep_get), there is a wrapper for it (e.g.
> > > hugetlb_ptep_get).  The correct version is used depending on if a HugeTLB
> > > PTE really is "huge".
> > >
> > > - Synchronization
> > > For existing bits of HugeTLB, synchronization is unchanged. For splitting
> > > and collapsing HugeTLB PTEs, we require that the i_mmap_rw_sem is held for
> > > writing, and for doing high-granularity page table walks, we require it to
> > > be held for reading.
> > >
> > > ---- Limitations & Future Changes ----
> > >
> > > This patch series only implements high-granularity mapping for VM_SHARED
> > > VMAs.  I intend to implement enough HGM to support 4K unmapping for memory
> > > failure recovery for both shared and private mappings.
> > >
> > > The memory failure use case poses its own challenges that can be
> > > addressed, but I will do so in a separate RFC.
> > >
> > > Performance has not been heavily scrutinized with this patch series. There
> > > are places where lock contention can significantly reduce performance. This
> > > will be addressed later.
> > >
> > > The patch series, as it stands right now, is compatible with the VMEMMAP
> > > page struct optimization[3], as we do not need to modify data contained
> > > in the subpage page structs.
> > >
> > > Other omissions:
> > >  - Compatibility with userfaultfd write-protect (will be included in v1).
> > >  - Support for mremap() (will be included in v1). This looks a lot like
> > >    the support we have for fork().
> > >  - Documentation changes (will be included in v1).
> > >  - Completely ignores PMD sharing and hugepage migration (will be included
> > >    in v1).
> > >  - Implementations for architectures that don't use GENERAL_HUGETLB other
> > >    than arm64.
> > >
> > > ---- Patch Breakdown ----
> > >
> > > Patch 1     - Preliminary changes
> > > Patch 2-10  - HugeTLB HGM core changes
> > > Patch 11-13 - HugeTLB HGM page table walking functionality
> > > Patch 14-19 - HugeTLB HGM compatibility with other bits
> > > Patch 20-23 - Userfaultfd and collapse changes
> > > Patch 24-26 - arm64 support and selftests
> > >
> > > [1] This used to be called HugeTLB double mapping, a bad and confusing
> > >     name. "High-granularity mapping" is not a great name either. I am open
> > >     to better names.
> >
> > I would drop 1 extra word and do "granular mapping", as in the mapping
> > is more granular than what it normally is (2MB/1G, etc).
>
> Noted. :)

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH 00/26] hugetlb: Introduce HugeTLB high-granularity mapping
  2022-06-28 17:26     ` Mina Almasry
@ 2022-06-28 17:56       ` Dr. David Alan Gilbert
  2022-06-29 18:31         ` James Houghton
  2022-06-29 20:39       ` Axel Rasmussen
  1 sibling, 1 reply; 123+ messages in thread
From: Dr. David Alan Gilbert @ 2022-06-28 17:56 UTC (permalink / raw)
  To: Mina Almasry
  Cc: James Houghton, Mike Kravetz, Muchun Song, Peter Xu,
	David Hildenbrand, David Rientjes, Axel Rasmussen, Jue Wang,
	Manish Mishra, linux-mm, linux-kernel

* Mina Almasry (almasrymina@google.com) wrote:
> On Mon, Jun 27, 2022 at 9:27 AM James Houghton <jthoughton@google.com> wrote:
> >
> > On Fri, Jun 24, 2022 at 11:41 AM Mina Almasry <almasrymina@google.com> wrote:
> > >
> > > On Fri, Jun 24, 2022 at 10:37 AM James Houghton <jthoughton@google.com> wrote:
> > > >
> > > > [trimmed...]
> > > > ---- Userspace API ----
> > > >
> > > > This patch series introduces a single way to take advantage of
> > > > high-granularity mapping: via UFFDIO_CONTINUE. UFFDIO_CONTINUE allows
> > > > userspace to resolve MINOR page faults on shared VMAs.
> > > >
> > > > To collapse a HugeTLB address range that has been mapped with several
> > > > UFFDIO_CONTINUE operations, userspace can issue MADV_COLLAPSE. We expect
> > > > userspace to know when all pages (that they care about) have been fetched.
> > > >
> > >
> > > Thanks James! Cover letter looks good. A few questions:
> > >
> > > Why not have the kernel collapse the hugepage once all the 4K pages
> > > have been fetched automatically? It would remove the need for a new
> > > userspace API, and AFACT there aren't really any cases where it is
> > > beneficial to have a hugepage sharded into 4K mappings when those
> > > mappings can be collapsed.
> >
> > The reason that we don't automatically collapse mappings is because it
> > would take additional complexity, and it is less flexible. Consider
> > the case of 1G pages on x86: currently, userspace can collapse the
> > whole page when it's all ready, but they can also choose to collapse a
> > 2M piece of it. On architectures with more supported hugepage sizes
> > (e.g., arm64), userspace has even more possibilities for when to
> > collapse. This likely further complicates a potential
> > automatic-collapse solution. Userspace may also want to collapse the
> > mapping for an entire hugepage without completely mapping the hugepage
> > first (this would also be possible by issuing UFFDIO_CONTINUE on all
> > the holes, though).
> >
> 
> To be honest I'm don't think I'm a fan of this. I don't think this
> saves complexity, but rather pushes it to the userspace. I.e. the
> userspace now must track which regions are faulted in and which are
> not to call MADV_COLLAPSE at the right time. Also, if the userspace
> gets it wrong it may accidentally not call MADV_COLLAPSE (and not get
> any hugepages) or call MADV_COLLAPSE too early and have to deal with a
> storm of maybe hundreds of minor faults at once which may take too
> long to resolve and may impact guest stability, yes?

I think it depends on whether the userspace is already holding bitmaps
and data structures to let it know when the right time to call collapse
is; if it already has to do all that book keeping for it's own postcopy
or whatever process, then getting userspace to call it is easy.
(I don't know the answer to whether it does have!)

Dave

> For these reasons I think automatic collapsing is something that will
> eventually be implemented by us or someone else, and at that point
> MADV_COLLAPSE for hugetlb memory will become obsolete; i.e. this patch
> is adding a userspace API that will probably need to be maintained for
> perpetuity but actually is likely going to be going obsolete "soon".
> For this reason I had hoped that automatic collapsing would come with
> V1.
> 
> I wonder if we can have a very simple first try at automatic
> collapsing for V1? I.e., can we support collapsing to the hstate size
> and only that? So 4K pages can only be either collapsed to 2MB or 1G
> on x86 depending on the hstate size. I think this may be not too
> difficult to implement: we can have a counter similar to mapcount that
> tracks how many of the subpages are mapped (subpage_mapcount). Once
> all the subpages are mapped (the counter reaches a certain value),
> trigger collapsing similar to hstate size MADV_COLLAPSE.
> 
> I gather that no one else reviewing this has raised this issue thus
> far so it might not be a big deal and I will continue to review the
> RFC, but I had hoped for automatic collapsing myself for the reasons
> above.
> 
> > >
> > > > ---- HugeTLB Changes ----
> > > >
> > > > - Mapcount
> > > > The way mapcount is handled is different from the way that it was handled
> > > > before. If the PUD for a hugepage is not none, a hugepage's mapcount will
> > > > be increased. This scheme means that, for hugepages that aren't mapped at
> > > > high granularity, their mapcounts will remain the same as what they would
> > > > have been pre-HGM.
> > > >
> > >
> > > Sorry, I didn't quite follow this. It says mapcount is handled
> > > differently, but the same if the page is not mapped at high
> > > granularity. Can you elaborate on how the mapcount handling will be
> > > different when the page is mapped at high granularity?
> >
> > I guess I didn't phrase this very well. For the sake of simplicity,
> > consider 1G pages on x86, typically mapped with leaf-level PUDs.
> > Previously, there were two possibilities for how a hugepage was
> > mapped, either it was (1) completely mapped (PUD is present and a
> > leaf), or (2) it wasn't mapped (PUD is none). Now we have a third
> > case, where the PUD is not none but also not a leaf (this usually
> > means that the page is partially mapped). We handle this case as if
> > the whole page was mapped. That is, if we partially map a hugepage
> > that was previously unmapped (making the PUD point to PMDs), we
> > increment its mapcount, and if we completely unmap a partially mapped
> > hugepage (making the PUD none), we decrement its mapcount. If we
> > collapse a non-leaf PUD to a leaf PUD, we don't change mapcount.
> >
> > It is possible for a PUD to be present and not a leaf (mapcount has
> > been incremented) but for the page to still be unmapped: if the PMDs
> > (or PTEs) underneath are all none. This case is atypical, and as of
> > this RFC (without bestowing MADV_DONTNEED with HGM flexibility), I
> > think it would be very difficult to get this to happen.
> >
> 
> Thank you for the detailed explanation. Please add it to the cover letter.
> 
> I wonder the case "PUD present but all the PMD are none": is that a
> bug? I don't understand the usefulness of that. Not a comment on this
> patch but rather a curiosity.
> 
> > >
> > > > - Page table walking and manipulation
> > > > A new function, hugetlb_walk_to, handles walking HugeTLB page tables for
> > > > high-granularity mappings. Eventually, it's possible to merge
> > > > hugetlb_walk_to with huge_pte_offset and huge_pte_alloc.
> > > >
> > > > We keep track of HugeTLB page table entries with a new struct, hugetlb_pte.
> > > > This is because we generally need to know the "size" of a PTE (previously
> > > > always just huge_page_size(hstate)).
> > > >
> > > > For every page table manipulation function that has a huge version (e.g.
> > > > huge_ptep_get and ptep_get), there is a wrapper for it (e.g.
> > > > hugetlb_ptep_get).  The correct version is used depending on if a HugeTLB
> > > > PTE really is "huge".
> > > >
> > > > - Synchronization
> > > > For existing bits of HugeTLB, synchronization is unchanged. For splitting
> > > > and collapsing HugeTLB PTEs, we require that the i_mmap_rw_sem is held for
> > > > writing, and for doing high-granularity page table walks, we require it to
> > > > be held for reading.
> > > >
> > > > ---- Limitations & Future Changes ----
> > > >
> > > > This patch series only implements high-granularity mapping for VM_SHARED
> > > > VMAs.  I intend to implement enough HGM to support 4K unmapping for memory
> > > > failure recovery for both shared and private mappings.
> > > >
> > > > The memory failure use case poses its own challenges that can be
> > > > addressed, but I will do so in a separate RFC.
> > > >
> > > > Performance has not been heavily scrutinized with this patch series. There
> > > > are places where lock contention can significantly reduce performance. This
> > > > will be addressed later.
> > > >
> > > > The patch series, as it stands right now, is compatible with the VMEMMAP
> > > > page struct optimization[3], as we do not need to modify data contained
> > > > in the subpage page structs.
> > > >
> > > > Other omissions:
> > > >  - Compatibility with userfaultfd write-protect (will be included in v1).
> > > >  - Support for mremap() (will be included in v1). This looks a lot like
> > > >    the support we have for fork().
> > > >  - Documentation changes (will be included in v1).
> > > >  - Completely ignores PMD sharing and hugepage migration (will be included
> > > >    in v1).
> > > >  - Implementations for architectures that don't use GENERAL_HUGETLB other
> > > >    than arm64.
> > > >
> > > > ---- Patch Breakdown ----
> > > >
> > > > Patch 1     - Preliminary changes
> > > > Patch 2-10  - HugeTLB HGM core changes
> > > > Patch 11-13 - HugeTLB HGM page table walking functionality
> > > > Patch 14-19 - HugeTLB HGM compatibility with other bits
> > > > Patch 20-23 - Userfaultfd and collapse changes
> > > > Patch 24-26 - arm64 support and selftests
> > > >
> > > > [1] This used to be called HugeTLB double mapping, a bad and confusing
> > > >     name. "High-granularity mapping" is not a great name either. I am open
> > > >     to better names.
> > >
> > > I would drop 1 extra word and do "granular mapping", as in the mapping
> > > is more granular than what it normally is (2MB/1G, etc).
> >
> > Noted. :)
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH 05/26] hugetlb: add CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING
  2022-06-27 12:28   ` manish.mishra
@ 2022-06-28 20:03     ` Mina Almasry
  0 siblings, 0 replies; 123+ messages in thread
From: Mina Almasry @ 2022-06-28 20:03 UTC (permalink / raw)
  To: manish.mishra
  Cc: James Houghton, Mike Kravetz, Muchun Song, Peter Xu,
	David Hildenbrand, David Rientjes, Axel Rasmussen, Jue Wang,
	Dr . David Alan Gilbert, linux-mm, linux-kernel

On Mon, Jun 27, 2022 at 5:29 AM manish.mishra <manish.mishra@nutanix.com> wrote:
>
>
> On 24/06/22 11:06 pm, James Houghton wrote:
> > This adds the Kconfig to enable or disable high-granularity mapping. It
> > is enabled by default for architectures that use
> > ARCH_WANT_GENERAL_HUGETLB.
> >
> > There is also an arch-specific config ARCH_HAS_SPECIAL_HUGETLB_HGM which
> > controls whether or not the architecture has been updated to support
> > HGM if it doesn't use general HugeTLB.
> >
> > Signed-off-by: James Houghton <jthoughton@google.com>
> reviewed-by:manish.mishra@nutanix.com

Mostly minor nits,

Reviewed-by: Mina Almasry <almasrymina@google.com>

> > ---
> >   fs/Kconfig | 7 +++++++
> >   1 file changed, 7 insertions(+)
> >
> > diff --git a/fs/Kconfig b/fs/Kconfig
> > index 5976eb33535f..d76c7d812656 100644
> > --- a/fs/Kconfig
> > +++ b/fs/Kconfig
> > @@ -268,6 +268,13 @@ config HUGETLB_PAGE_OPTIMIZE_VMEMMAP_DEFAULT_ON
> >         to enable optimizing vmemmap pages of HugeTLB by default. It can then
> >         be disabled on the command line via hugetlb_free_vmemmap=off.
> >
> > +config ARCH_HAS_SPECIAL_HUGETLB_HGM

Nit: would have preferred just ARCH_HAS_HUGETLB_HGM, as ARCH implies
arch-specific.

> > +     bool
> > +
> > +config HUGETLB_HIGH_GRANULARITY_MAPPING
> > +     def_bool ARCH_WANT_GENERAL_HUGETLB || ARCH_HAS_SPECIAL_HUGETLB_HGM

Nit: would have preferred to go with either HGM _or_
HIGH_GRANULARITY_MAPPING (or whatever new name comes up), rather than
both, for consistency's sake.

> > +     depends on HUGETLB_PAGE
> > +
> >   config MEMFD_CREATE
> >       def_bool TMPFS || HUGETLBFS
> >

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH 07/26] hugetlb: add hugetlb_pte to track HugeTLB page table entries
  2022-06-24 17:36 ` [RFC PATCH 07/26] hugetlb: add hugetlb_pte to track HugeTLB page table entries James Houghton
  2022-06-27 12:47   ` manish.mishra
@ 2022-06-28 20:25   ` Mina Almasry
  2022-06-29 16:42     ` James Houghton
  2022-06-28 20:44   ` Mike Kravetz
                     ` (2 subsequent siblings)
  4 siblings, 1 reply; 123+ messages in thread
From: Mina Almasry @ 2022-06-28 20:25 UTC (permalink / raw)
  To: James Houghton
  Cc: Mike Kravetz, Muchun Song, Peter Xu, David Hildenbrand,
	David Rientjes, Axel Rasmussen, Jue Wang, Manish Mishra,
	Dr . David Alan Gilbert, linux-mm, linux-kernel

On Fri, Jun 24, 2022 at 10:37 AM James Houghton <jthoughton@google.com> wrote:
>
> After high-granularity mapping, page table entries for HugeTLB pages can
> be of any size/type. (For example, we can have a 1G page mapped with a
> mix of PMDs and PTEs.) This struct is to help keep track of a HugeTLB
> PTE after we have done a page table walk.
>
> Without this, we'd have to pass around the "size" of the PTE everywhere.
> We effectively did this before; it could be fetched from the hstate,
> which we pass around pretty much everywhere.
>
> This commit includes definitions for some basic helper functions that
> are used later. These helper functions wrap existing PTE
> inspection/modification functions, where the correct version is picked
> depending on if the HugeTLB PTE is actually "huge" or not. (Previously,
> all HugeTLB PTEs were "huge").
>
> For example, hugetlb_ptep_get wraps huge_ptep_get and ptep_get, where
> ptep_get is used when the HugeTLB PTE is PAGE_SIZE, and huge_ptep_get is
> used in all other cases.
>
> Signed-off-by: James Houghton <jthoughton@google.com>
> ---
>  include/linux/hugetlb.h | 84 +++++++++++++++++++++++++++++++++++++++++
>  mm/hugetlb.c            | 57 ++++++++++++++++++++++++++++
>  2 files changed, 141 insertions(+)
>
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 5fe1db46d8c9..1d4ec9dfdebf 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -46,6 +46,68 @@ enum {
>         __NR_USED_SUBPAGE,
>  };
>
> +struct hugetlb_pte {
> +       pte_t *ptep;
> +       unsigned int shift;
> +};
> +
> +static inline
> +void hugetlb_pte_init(struct hugetlb_pte *hpte)
> +{
> +       hpte->ptep = NULL;

shift = 0; ?

> +}
> +
> +static inline
> +void hugetlb_pte_populate(struct hugetlb_pte *hpte, pte_t *ptep,
> +                         unsigned int shift)
> +{
> +       BUG_ON(!ptep);
> +       hpte->ptep = ptep;
> +       hpte->shift = shift;
> +}
> +
> +static inline
> +unsigned long hugetlb_pte_size(const struct hugetlb_pte *hpte)
> +{
> +       BUG_ON(!hpte->ptep);
> +       return 1UL << hpte->shift;
> +}
> +

This helper is quite redundant in my opinion.

> +static inline
> +unsigned long hugetlb_pte_mask(const struct hugetlb_pte *hpte)
> +{
> +       BUG_ON(!hpte->ptep);
> +       return ~(hugetlb_pte_size(hpte) - 1);
> +}
> +
> +static inline
> +unsigned int hugetlb_pte_shift(const struct hugetlb_pte *hpte)
> +{
> +       BUG_ON(!hpte->ptep);
> +       return hpte->shift;
> +}
> +

This one jumps as quite redundant too.

> +static inline
> +bool hugetlb_pte_huge(const struct hugetlb_pte *hpte)
> +{
> +       return !IS_ENABLED(CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING) ||
> +               hugetlb_pte_shift(hpte) > PAGE_SHIFT;
> +}
> +

I'm guessing the !IS_ENABLED() check is because only the HGM code
would store a non-huge pte in a hugetlb_pte struct. I think it's a bit
fragile because anyone can add code in the future that uses
hugetlb_pte in unexpected ways, but I will concede that it is correct
as written.

> +static inline
> +void hugetlb_pte_copy(struct hugetlb_pte *dest, const struct hugetlb_pte *src)
> +{
> +       dest->ptep = src->ptep;
> +       dest->shift = src->shift;
> +}
> +
> +bool hugetlb_pte_present_leaf(const struct hugetlb_pte *hpte);
> +bool hugetlb_pte_none(const struct hugetlb_pte *hpte);
> +bool hugetlb_pte_none_mostly(const struct hugetlb_pte *hpte);
> +pte_t hugetlb_ptep_get(const struct hugetlb_pte *hpte);
> +void hugetlb_pte_clear(struct mm_struct *mm, const struct hugetlb_pte *hpte,
> +                      unsigned long address);
> +
>  struct hugepage_subpool {
>         spinlock_t lock;
>         long count;
> @@ -1130,6 +1192,28 @@ static inline spinlock_t *huge_pte_lock_shift(unsigned int shift,
>         return ptl;
>  }
>
> +static inline

Maybe for organization, move all the static functions you're adding
above the hugetlb_pte_* declarations you're adding?

> +spinlock_t *hugetlb_pte_lockptr(struct mm_struct *mm, struct hugetlb_pte *hpte)
> +{
> +
> +       BUG_ON(!hpte->ptep);
> +       // Only use huge_pte_lockptr if we are at leaf-level. Otherwise use
> +       // the regular page table lock.

Does checkpatch.pl not complain about // style comments? I think those
are not allowed, no?

> +       if (hugetlb_pte_none(hpte) || hugetlb_pte_present_leaf(hpte))
> +               return huge_pte_lockptr(hugetlb_pte_shift(hpte),
> +                               mm, hpte->ptep);
> +       return &mm->page_table_lock;
> +}
> +
> +static inline
> +spinlock_t *hugetlb_pte_lock(struct mm_struct *mm, struct hugetlb_pte *hpte)
> +{
> +       spinlock_t *ptl = hugetlb_pte_lockptr(mm, hpte);
> +
> +       spin_lock(ptl);
> +       return ptl;
> +}
> +
>  #if defined(CONFIG_HUGETLB_PAGE) && defined(CONFIG_CMA)
>  extern void __init hugetlb_cma_reserve(int order);
>  extern void __init hugetlb_cma_check(void);
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index d6d0d4c03def..1a1434e29740 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1120,6 +1120,63 @@ static bool vma_has_reserves(struct vm_area_struct *vma, long chg)
>         return false;
>  }
>
> +bool hugetlb_pte_present_leaf(const struct hugetlb_pte *hpte)
> +{
> +       pgd_t pgd;
> +       p4d_t p4d;
> +       pud_t pud;
> +       pmd_t pmd;
> +
> +       BUG_ON(!hpte->ptep);
> +       if (hugetlb_pte_size(hpte) >= PGDIR_SIZE) {
> +               pgd = *(pgd_t *)hpte->ptep;
> +               return pgd_present(pgd) && pgd_leaf(pgd);
> +       } else if (hugetlb_pte_size(hpte) >= P4D_SIZE) {
> +               p4d = *(p4d_t *)hpte->ptep;
> +               return p4d_present(p4d) && p4d_leaf(p4d);
> +       } else if (hugetlb_pte_size(hpte) >= PUD_SIZE) {
> +               pud = *(pud_t *)hpte->ptep;
> +               return pud_present(pud) && pud_leaf(pud);
> +       } else if (hugetlb_pte_size(hpte) >= PMD_SIZE) {
> +               pmd = *(pmd_t *)hpte->ptep;
> +               return pmd_present(pmd) && pmd_leaf(pmd);
> +       } else if (hugetlb_pte_size(hpte) >= PAGE_SIZE)
> +               return pte_present(*hpte->ptep);

The use of >= is a bit curious to me. Shouldn't these be ==?

Also probably doesn't matter but I was thinking to use *_SHIFTs
instead of *_SIZE so you don't have to calculate the size 5 times in
this routine, or calculate hugetlb_pte_size() once for some less code
duplication and re-use?

> +       BUG();
> +}
> +
> +bool hugetlb_pte_none(const struct hugetlb_pte *hpte)
> +{
> +       if (hugetlb_pte_huge(hpte))
> +               return huge_pte_none(huge_ptep_get(hpte->ptep));
> +       return pte_none(ptep_get(hpte->ptep));
> +}
> +
> +bool hugetlb_pte_none_mostly(const struct hugetlb_pte *hpte)
> +{
> +       if (hugetlb_pte_huge(hpte))
> +               return huge_pte_none_mostly(huge_ptep_get(hpte->ptep));
> +       return pte_none_mostly(ptep_get(hpte->ptep));
> +}
> +
> +pte_t hugetlb_ptep_get(const struct hugetlb_pte *hpte)
> +{
> +       if (hugetlb_pte_huge(hpte))
> +               return huge_ptep_get(hpte->ptep);
> +       return ptep_get(hpte->ptep);
> +}
> +
> +void hugetlb_pte_clear(struct mm_struct *mm, const struct hugetlb_pte *hpte,
> +                      unsigned long address)
> +{
> +       BUG_ON(!hpte->ptep);
> +       unsigned long sz = hugetlb_pte_size(hpte);
> +
> +       if (sz > PAGE_SIZE)
> +               return huge_pte_clear(mm, address, hpte->ptep, sz);
> +       return pte_clear(mm, address, hpte->ptep);
> +}
> +
>  static void enqueue_huge_page(struct hstate *h, struct page *page)
>  {
>         int nid = page_to_nid(page);
> --
> 2.37.0.rc0.161.g10f37bed90-goog
>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH 08/26] hugetlb: add hugetlb_free_range to free PT structures
  2022-06-24 17:36 ` [RFC PATCH 08/26] hugetlb: add hugetlb_free_range to free PT structures James Houghton
  2022-06-27 12:52   ` manish.mishra
@ 2022-06-28 20:27   ` Mina Almasry
  1 sibling, 0 replies; 123+ messages in thread
From: Mina Almasry @ 2022-06-28 20:27 UTC (permalink / raw)
  To: James Houghton
  Cc: Mike Kravetz, Muchun Song, Peter Xu, David Hildenbrand,
	David Rientjes, Axel Rasmussen, Jue Wang, Manish Mishra,
	Dr . David Alan Gilbert, linux-mm, linux-kernel

On Fri, Jun 24, 2022 at 10:37 AM James Houghton <jthoughton@google.com> wrote:
>
> This is a helper function for freeing the bits of the page table that
> map a particular HugeTLB PTE.
>
> Signed-off-by: James Houghton <jthoughton@google.com>
> ---
>  include/linux/hugetlb.h |  2 ++
>  mm/hugetlb.c            | 17 +++++++++++++++++
>  2 files changed, 19 insertions(+)
>
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 1d4ec9dfdebf..33ba48fac551 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -107,6 +107,8 @@ bool hugetlb_pte_none_mostly(const struct hugetlb_pte *hpte);
>  pte_t hugetlb_ptep_get(const struct hugetlb_pte *hpte);
>  void hugetlb_pte_clear(struct mm_struct *mm, const struct hugetlb_pte *hpte,
>                        unsigned long address);
> +void hugetlb_free_range(struct mmu_gather *tlb, const struct hugetlb_pte *hpte,
> +                       unsigned long start, unsigned long end);
>
>  struct hugepage_subpool {
>         spinlock_t lock;
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 1a1434e29740..a2d2ffa76173 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1120,6 +1120,23 @@ static bool vma_has_reserves(struct vm_area_struct *vma, long chg)
>         return false;
>  }
>
> +void hugetlb_free_range(struct mmu_gather *tlb, const struct hugetlb_pte *hpte,
> +                       unsigned long start, unsigned long end)
> +{
> +       unsigned long floor = start & hugetlb_pte_mask(hpte);
> +       unsigned long ceiling = floor + hugetlb_pte_size(hpte);
> +
> +       if (hugetlb_pte_size(hpte) >= PGDIR_SIZE) {
> +               free_p4d_range(tlb, (pgd_t *)hpte->ptep, start, end, floor, ceiling);
> +       } else if (hugetlb_pte_size(hpte) >= P4D_SIZE) {
> +               free_pud_range(tlb, (p4d_t *)hpte->ptep, start, end, floor, ceiling);
> +       } else if (hugetlb_pte_size(hpte) >= PUD_SIZE) {
> +               free_pmd_range(tlb, (pud_t *)hpte->ptep, start, end, floor, ceiling);
> +       } else if (hugetlb_pte_size(hpte) >= PMD_SIZE) {
> +               free_pte_range(tlb, (pmd_t *)hpte->ptep, start);
> +       }

Same as the previous patch: I wonder about >=, and if possible
calculate hugetlb_pte_size() once, or use *_SHIFT comparison.

> +}
> +
>  bool hugetlb_pte_present_leaf(const struct hugetlb_pte *hpte)
>  {
>         pgd_t pgd;
> --
> 2.37.0.rc0.161.g10f37bed90-goog
>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH 09/26] hugetlb: add hugetlb_hgm_enabled
  2022-06-24 17:36 ` [RFC PATCH 09/26] hugetlb: add hugetlb_hgm_enabled James Houghton
@ 2022-06-28 20:33   ` Mina Almasry
  2022-09-08 18:07   ` Peter Xu
  1 sibling, 0 replies; 123+ messages in thread
From: Mina Almasry @ 2022-06-28 20:33 UTC (permalink / raw)
  To: James Houghton
  Cc: Mike Kravetz, Muchun Song, Peter Xu, David Hildenbrand,
	David Rientjes, Axel Rasmussen, Jue Wang, Manish Mishra,
	Dr . David Alan Gilbert, linux-mm, linux-kernel

On Fri, Jun 24, 2022 at 10:37 AM James Houghton <jthoughton@google.com> wrote:
>
> Currently, this is always true if the VMA is shared. In the future, it's
> possible that private mappings will get some or all HGM functionality.
>
> Signed-off-by: James Houghton <jthoughton@google.com>

Reviewed-by: Mina Almasry <almasrymina@google.com>

> ---
>  include/linux/hugetlb.h | 10 ++++++++++
>  mm/hugetlb.c            |  8 ++++++++
>  2 files changed, 18 insertions(+)
>
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 33ba48fac551..e7a6b944d0cc 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -1174,6 +1174,16 @@ static inline void set_huge_pte_at(struct mm_struct *mm, unsigned long addr,
>  }
>  #endif /* CONFIG_HUGETLB_PAGE */
>
> +#ifdef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING
> +/* If HugeTLB high-granularity mappings are enabled for this VMA. */
> +bool hugetlb_hgm_enabled(struct vm_area_struct *vma);
> +#else
> +static inline bool hugetlb_hgm_enabled(struct vm_area_struct *vma)
> +{
> +       return false;
> +}
> +#endif
> +
>  static inline spinlock_t *huge_pte_lock(struct hstate *h,
>                                         struct mm_struct *mm, pte_t *pte)
>  {
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index a2d2ffa76173..8b10b941458d 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -6983,6 +6983,14 @@ pte_t *huge_pte_offset(struct mm_struct *mm,
>
>  #endif /* CONFIG_ARCH_WANT_GENERAL_HUGETLB */
>
> +#ifdef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING
> +bool hugetlb_hgm_enabled(struct vm_area_struct *vma)
> +{
> +       /* All shared VMAs have HGM enabled. */

Personally I find the comment redundant; the next line does just that.

What about VM_MAYSHARE? Should those also have HGM enabled?

Is it possible to get some docs with this series for V1? This would be
something to highlight in the docs.

> +       return vma->vm_flags & VM_SHARED;
> +}
> +#endif /* CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING */
> +
>  /*
>   * These functions are overwritable if your architecture needs its own
>   * behavior.
> --
> 2.37.0.rc0.161.g10f37bed90-goog
>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH 06/26] mm: make free_p?d_range functions public
  2022-06-24 17:36 ` [RFC PATCH 06/26] mm: make free_p?d_range functions public James Houghton
  2022-06-27 12:31   ` manish.mishra
@ 2022-06-28 20:35   ` Mike Kravetz
  2022-07-12 20:52     ` James Houghton
  1 sibling, 1 reply; 123+ messages in thread
From: Mike Kravetz @ 2022-06-28 20:35 UTC (permalink / raw)
  To: James Houghton
  Cc: Muchun Song, Peter Xu, David Hildenbrand, David Rientjes,
	Axel Rasmussen, Mina Almasry, Jue Wang, Manish Mishra,
	Dr . David Alan Gilbert, linux-mm, linux-kernel

On 06/24/22 17:36, James Houghton wrote:
> This makes them usable for HugeTLB page table freeing operations.
> After HugeTLB high-granularity mapping, the page table for a HugeTLB VMA
> can get more complex, and these functions handle freeing page tables
> generally.
> 

Hmmmm?

free_pgd_range is not generally called directly for hugetlb mappings.
There is a wrapper hugetlb_free_pgd_range which can have architecture
specific implementations.  It makes me wonder if these lower level
routines can be directly used on hugetlb mappings.  My 'guess' is that any
such details will be hidden in the callers.  Suspect this will become clear
in later patches.
-- 
Mike Kravetz

> Signed-off-by: James Houghton <jthoughton@google.com>
> ---
>  include/linux/mm.h | 7 +++++++
>  mm/memory.c        | 8 ++++----
>  2 files changed, 11 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index bc8f326be0ce..07f5da512147 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1847,6 +1847,13 @@ void unmap_vmas(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
>  
>  struct mmu_notifier_range;
>  
> +void free_pte_range(struct mmu_gather *tlb, pmd_t *pmd, unsigned long addr);
> +void free_pmd_range(struct mmu_gather *tlb, pud_t *pud, unsigned long addr,
> +		unsigned long end, unsigned long floor, unsigned long ceiling);
> +void free_pud_range(struct mmu_gather *tlb, p4d_t *p4d, unsigned long addr,
> +		unsigned long end, unsigned long floor, unsigned long ceiling);
> +void free_p4d_range(struct mmu_gather *tlb, pgd_t *pgd, unsigned long addr,
> +		unsigned long end, unsigned long floor, unsigned long ceiling);
>  void free_pgd_range(struct mmu_gather *tlb, unsigned long addr,
>  		unsigned long end, unsigned long floor, unsigned long ceiling);
>  int
> diff --git a/mm/memory.c b/mm/memory.c
> index 7a089145cad4..bb3b9b5b94fb 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -227,7 +227,7 @@ static void check_sync_rss_stat(struct task_struct *task)
>   * Note: this doesn't free the actual pages themselves. That
>   * has been handled earlier when unmapping all the memory regions.
>   */
> -static void free_pte_range(struct mmu_gather *tlb, pmd_t *pmd,
> +void free_pte_range(struct mmu_gather *tlb, pmd_t *pmd,
>  			   unsigned long addr)
>  {
>  	pgtable_t token = pmd_pgtable(*pmd);
> @@ -236,7 +236,7 @@ static void free_pte_range(struct mmu_gather *tlb, pmd_t *pmd,
>  	mm_dec_nr_ptes(tlb->mm);
>  }
>  
> -static inline void free_pmd_range(struct mmu_gather *tlb, pud_t *pud,
> +inline void free_pmd_range(struct mmu_gather *tlb, pud_t *pud,
>  				unsigned long addr, unsigned long end,
>  				unsigned long floor, unsigned long ceiling)
>  {
> @@ -270,7 +270,7 @@ static inline void free_pmd_range(struct mmu_gather *tlb, pud_t *pud,
>  	mm_dec_nr_pmds(tlb->mm);
>  }
>  
> -static inline void free_pud_range(struct mmu_gather *tlb, p4d_t *p4d,
> +inline void free_pud_range(struct mmu_gather *tlb, p4d_t *p4d,
>  				unsigned long addr, unsigned long end,
>  				unsigned long floor, unsigned long ceiling)
>  {
> @@ -304,7 +304,7 @@ static inline void free_pud_range(struct mmu_gather *tlb, p4d_t *p4d,
>  	mm_dec_nr_puds(tlb->mm);
>  }
>  
> -static inline void free_p4d_range(struct mmu_gather *tlb, pgd_t *pgd,
> +inline void free_p4d_range(struct mmu_gather *tlb, pgd_t *pgd,
>  				unsigned long addr, unsigned long end,
>  				unsigned long floor, unsigned long ceiling)
>  {
> -- 
> 2.37.0.rc0.161.g10f37bed90-goog
> 

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH 07/26] hugetlb: add hugetlb_pte to track HugeTLB page table entries
  2022-06-24 17:36 ` [RFC PATCH 07/26] hugetlb: add hugetlb_pte to track HugeTLB page table entries James Houghton
  2022-06-27 12:47   ` manish.mishra
  2022-06-28 20:25   ` Mina Almasry
@ 2022-06-28 20:44   ` Mike Kravetz
  2022-06-29 16:24     ` James Houghton
  2022-07-11 23:32   ` Mike Kravetz
  2022-09-08 17:38   ` Peter Xu
  4 siblings, 1 reply; 123+ messages in thread
From: Mike Kravetz @ 2022-06-28 20:44 UTC (permalink / raw)
  To: James Houghton
  Cc: Muchun Song, Peter Xu, David Hildenbrand, David Rientjes,
	Axel Rasmussen, Mina Almasry, Jue Wang, Manish Mishra,
	Dr . David Alan Gilbert, linux-mm, linux-kernel

On 06/24/22 17:36, James Houghton wrote:
> After high-granularity mapping, page table entries for HugeTLB pages can
> be of any size/type. (For example, we can have a 1G page mapped with a
> mix of PMDs and PTEs.) This struct is to help keep track of a HugeTLB
> PTE after we have done a page table walk.
> 
> Without this, we'd have to pass around the "size" of the PTE everywhere.
> We effectively did this before; it could be fetched from the hstate,
> which we pass around pretty much everywhere.
> 
> This commit includes definitions for some basic helper functions that
> are used later. These helper functions wrap existing PTE
> inspection/modification functions, where the correct version is picked
> depending on if the HugeTLB PTE is actually "huge" or not. (Previously,
> all HugeTLB PTEs were "huge").
> 
> For example, hugetlb_ptep_get wraps huge_ptep_get and ptep_get, where
> ptep_get is used when the HugeTLB PTE is PAGE_SIZE, and huge_ptep_get is
> used in all other cases.
> 
> Signed-off-by: James Houghton <jthoughton@google.com>
> ---
>  include/linux/hugetlb.h | 84 +++++++++++++++++++++++++++++++++++++++++
>  mm/hugetlb.c            | 57 ++++++++++++++++++++++++++++
>  2 files changed, 141 insertions(+)

There is nothing 'wrong' with this patch, but it does make me wonder.
After introducing hugetlb_pte, is all code dealing with hugetlb mappings
going to be using hugetlb_ptes?  It would be quite confusing if there is
a mix of hugetlb_ptes and non-hugetlb_ptes.  This will be revealed later
in the series, but a comment about suture direction would be helpful
here.
-- 
Mike Kravetz

> 
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 5fe1db46d8c9..1d4ec9dfdebf 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -46,6 +46,68 @@ enum {
>  	__NR_USED_SUBPAGE,
>  };
>  
> +struct hugetlb_pte {
> +	pte_t *ptep;
> +	unsigned int shift;
> +};
> +
> +static inline
> +void hugetlb_pte_init(struct hugetlb_pte *hpte)
> +{
> +	hpte->ptep = NULL;
> +}
> +
> +static inline
> +void hugetlb_pte_populate(struct hugetlb_pte *hpte, pte_t *ptep,
> +			  unsigned int shift)
> +{
> +	BUG_ON(!ptep);
> +	hpte->ptep = ptep;
> +	hpte->shift = shift;
> +}
> +
> +static inline
> +unsigned long hugetlb_pte_size(const struct hugetlb_pte *hpte)
> +{
> +	BUG_ON(!hpte->ptep);
> +	return 1UL << hpte->shift;
> +}
> +
> +static inline
> +unsigned long hugetlb_pte_mask(const struct hugetlb_pte *hpte)
> +{
> +	BUG_ON(!hpte->ptep);
> +	return ~(hugetlb_pte_size(hpte) - 1);
> +}
> +
> +static inline
> +unsigned int hugetlb_pte_shift(const struct hugetlb_pte *hpte)
> +{
> +	BUG_ON(!hpte->ptep);
> +	return hpte->shift;
> +}
> +
> +static inline
> +bool hugetlb_pte_huge(const struct hugetlb_pte *hpte)
> +{
> +	return !IS_ENABLED(CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING) ||
> +		hugetlb_pte_shift(hpte) > PAGE_SHIFT;
> +}
> +
> +static inline
> +void hugetlb_pte_copy(struct hugetlb_pte *dest, const struct hugetlb_pte *src)
> +{
> +	dest->ptep = src->ptep;
> +	dest->shift = src->shift;
> +}
> +
> +bool hugetlb_pte_present_leaf(const struct hugetlb_pte *hpte);
> +bool hugetlb_pte_none(const struct hugetlb_pte *hpte);
> +bool hugetlb_pte_none_mostly(const struct hugetlb_pte *hpte);
> +pte_t hugetlb_ptep_get(const struct hugetlb_pte *hpte);
> +void hugetlb_pte_clear(struct mm_struct *mm, const struct hugetlb_pte *hpte,
> +		       unsigned long address);
> +
>  struct hugepage_subpool {
>  	spinlock_t lock;
>  	long count;
> @@ -1130,6 +1192,28 @@ static inline spinlock_t *huge_pte_lock_shift(unsigned int shift,
>  	return ptl;
>  }
>  
> +static inline
> +spinlock_t *hugetlb_pte_lockptr(struct mm_struct *mm, struct hugetlb_pte *hpte)
> +{
> +
> +	BUG_ON(!hpte->ptep);
> +	// Only use huge_pte_lockptr if we are at leaf-level. Otherwise use
> +	// the regular page table lock.
> +	if (hugetlb_pte_none(hpte) || hugetlb_pte_present_leaf(hpte))
> +		return huge_pte_lockptr(hugetlb_pte_shift(hpte),
> +				mm, hpte->ptep);
> +	return &mm->page_table_lock;
> +}
> +
> +static inline
> +spinlock_t *hugetlb_pte_lock(struct mm_struct *mm, struct hugetlb_pte *hpte)
> +{
> +	spinlock_t *ptl = hugetlb_pte_lockptr(mm, hpte);
> +
> +	spin_lock(ptl);
> +	return ptl;
> +}
> +
>  #if defined(CONFIG_HUGETLB_PAGE) && defined(CONFIG_CMA)
>  extern void __init hugetlb_cma_reserve(int order);
>  extern void __init hugetlb_cma_check(void);
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index d6d0d4c03def..1a1434e29740 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1120,6 +1120,63 @@ static bool vma_has_reserves(struct vm_area_struct *vma, long chg)
>  	return false;
>  }
>  
> +bool hugetlb_pte_present_leaf(const struct hugetlb_pte *hpte)
> +{
> +	pgd_t pgd;
> +	p4d_t p4d;
> +	pud_t pud;
> +	pmd_t pmd;
> +
> +	BUG_ON(!hpte->ptep);
> +	if (hugetlb_pte_size(hpte) >= PGDIR_SIZE) {
> +		pgd = *(pgd_t *)hpte->ptep;
> +		return pgd_present(pgd) && pgd_leaf(pgd);
> +	} else if (hugetlb_pte_size(hpte) >= P4D_SIZE) {
> +		p4d = *(p4d_t *)hpte->ptep;
> +		return p4d_present(p4d) && p4d_leaf(p4d);
> +	} else if (hugetlb_pte_size(hpte) >= PUD_SIZE) {
> +		pud = *(pud_t *)hpte->ptep;
> +		return pud_present(pud) && pud_leaf(pud);
> +	} else if (hugetlb_pte_size(hpte) >= PMD_SIZE) {
> +		pmd = *(pmd_t *)hpte->ptep;
> +		return pmd_present(pmd) && pmd_leaf(pmd);
> +	} else if (hugetlb_pte_size(hpte) >= PAGE_SIZE)
> +		return pte_present(*hpte->ptep);
> +	BUG();
> +}
> +
> +bool hugetlb_pte_none(const struct hugetlb_pte *hpte)
> +{
> +	if (hugetlb_pte_huge(hpte))
> +		return huge_pte_none(huge_ptep_get(hpte->ptep));
> +	return pte_none(ptep_get(hpte->ptep));
> +}
> +
> +bool hugetlb_pte_none_mostly(const struct hugetlb_pte *hpte)
> +{
> +	if (hugetlb_pte_huge(hpte))
> +		return huge_pte_none_mostly(huge_ptep_get(hpte->ptep));
> +	return pte_none_mostly(ptep_get(hpte->ptep));
> +}
> +
> +pte_t hugetlb_ptep_get(const struct hugetlb_pte *hpte)
> +{
> +	if (hugetlb_pte_huge(hpte))
> +		return huge_ptep_get(hpte->ptep);
> +	return ptep_get(hpte->ptep);
> +}
> +
> +void hugetlb_pte_clear(struct mm_struct *mm, const struct hugetlb_pte *hpte,
> +		       unsigned long address)
> +{
> +	BUG_ON(!hpte->ptep);
> +	unsigned long sz = hugetlb_pte_size(hpte);
> +
> +	if (sz > PAGE_SIZE)
> +		return huge_pte_clear(mm, address, hpte->ptep, sz);
> +	return pte_clear(mm, address, hpte->ptep);
> +}
> +
>  static void enqueue_huge_page(struct hstate *h, struct page *page)
>  {
>  	int nid = page_to_nid(page);
> -- 
> 2.37.0.rc0.161.g10f37bed90-goog
> 

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH 10/26] hugetlb: add for_each_hgm_shift
  2022-06-24 17:36 ` [RFC PATCH 10/26] hugetlb: add for_each_hgm_shift James Houghton
  2022-06-27 13:01   ` manish.mishra
@ 2022-06-28 21:58   ` Mina Almasry
  2022-07-07 21:39     ` Mike Kravetz
  2022-07-08 15:52     ` James Houghton
  1 sibling, 2 replies; 123+ messages in thread
From: Mina Almasry @ 2022-06-28 21:58 UTC (permalink / raw)
  To: James Houghton
  Cc: Mike Kravetz, Muchun Song, Peter Xu, David Hildenbrand,
	David Rientjes, Axel Rasmussen, Jue Wang, Manish Mishra,
	Dr . David Alan Gilbert, linux-mm, linux-kernel

On Fri, Jun 24, 2022 at 10:37 AM James Houghton <jthoughton@google.com> wrote:
>
> This is a helper macro to loop through all the usable page sizes for a
> high-granularity-enabled HugeTLB VMA. Given the VMA's hstate, it will
> loop, in descending order, through the page sizes that HugeTLB supports
> for this architecture; it always includes PAGE_SIZE.
>
> Signed-off-by: James Houghton <jthoughton@google.com>
> ---
>  mm/hugetlb.c | 10 ++++++++++
>  1 file changed, 10 insertions(+)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 8b10b941458d..557b0afdb503 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -6989,6 +6989,16 @@ bool hugetlb_hgm_enabled(struct vm_area_struct *vma)
>         /* All shared VMAs have HGM enabled. */
>         return vma->vm_flags & VM_SHARED;
>  }
> +static unsigned int __shift_for_hstate(struct hstate *h)
> +{
> +       if (h >= &hstates[hugetlb_max_hstate])
> +               return PAGE_SHIFT;

h > &hstates[hugetlb_max_hstate] means that h is out of bounds, no? am
I missing something here?

So is this intending to do:

if (h == hstates[hugetlb_max_hstate]
    return PAGE_SHIFT;

? If so, could we write it as so?

I'm also wondering why __shift_for_hstate(hstate[hugetlb_max_hstate])
== PAGE_SHIFT? Isn't the last hstate the smallest hstate which should
be 2MB on x86? Shouldn't this return PMD_SHIFT in that case?

> +       return huge_page_shift(h);
> +}
> +#define for_each_hgm_shift(hstate, tmp_h, shift) \
> +       for ((tmp_h) = hstate; (shift) = __shift_for_hstate(tmp_h), \
> +                              (tmp_h) <= &hstates[hugetlb_max_hstate]; \
> +                              (tmp_h)++)
>  #endif /* CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING */
>
>  /*
> --
> 2.37.0.rc0.161.g10f37bed90-goog
>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH 04/26] hugetlb: make huge_pte_lockptr take an explicit shift argument.
  2022-06-27 20:51   ` Mike Kravetz
  2022-06-28 15:29     ` James Houghton
@ 2022-06-29  6:09     ` Muchun Song
  2022-06-29 21:03       ` Mike Kravetz
  1 sibling, 1 reply; 123+ messages in thread
From: Muchun Song @ 2022-06-29  6:09 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: James Houghton, Peter Xu, David Hildenbrand, David Rientjes,
	Axel Rasmussen, Mina Almasry, Jue Wang, Manish Mishra,
	Dr . David Alan Gilbert, linux-mm, linux-kernel

On Mon, Jun 27, 2022 at 01:51:53PM -0700, Mike Kravetz wrote:
> On 06/24/22 17:36, James Houghton wrote:
> > This is needed to handle PTL locking with high-granularity mapping. We
> > won't always be using the PMD-level PTL even if we're using the 2M
> > hugepage hstate. It's possible that we're dealing with 4K PTEs, in which
> > case, we need to lock the PTL for the 4K PTE.
> 
> I'm not really sure why this would be required.
> Why not use the PMD level lock for 4K PTEs?  Seems that would scale better
> with less contention than using the more coarse mm lock.  
>

Your words make me thing of another question unrelated to this patch.
We __know__ that arm64 supports continues PTE HugeTLB. huge_pte_lockptr()
did not consider this case, in this case, those HugeTLB pages are contended
with mm lock. Seems we should optimize this case. Something like:

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 0d790fa3f297..68a1e071bfc0 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -893,7 +893,7 @@ static inline gfp_t htlb_modify_alloc_mask(struct hstate *h, gfp_t gfp_mask)
 static inline spinlock_t *huge_pte_lockptr(struct hstate *h,
                                           struct mm_struct *mm, pte_t *pte)
 {
-       if (huge_page_size(h) == PMD_SIZE)
+       if (huge_page_size(h) <= PMD_SIZE)
                return pmd_lockptr(mm, (pmd_t *) pte);
        VM_BUG_ON(huge_page_size(h) == PAGE_SIZE);
        return &mm->page_table_lock;

I did not check if elsewhere needs to be changed as well. Just a primary
thought.

Thanks.
 
> -- 
> Mike Kravetz
> 

^ permalink raw reply related	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH 01/26] hugetlb: make hstate accessor functions const
  2022-06-24 17:36 ` [RFC PATCH 01/26] hugetlb: make hstate accessor functions const James Houghton
  2022-06-24 18:43   ` Mina Almasry
       [not found]   ` <e55f90f5-ba14-5d6e-8f8f-abf731b9095e@nutanix.com>
@ 2022-06-29  6:18   ` Muchun Song
  2 siblings, 0 replies; 123+ messages in thread
From: Muchun Song @ 2022-06-29  6:18 UTC (permalink / raw)
  To: James Houghton
  Cc: Mike Kravetz, Peter Xu, David Hildenbrand, David Rientjes,
	Axel Rasmussen, Mina Almasry, Jue Wang, Manish Mishra,
	Dr . David Alan Gilbert, linux-mm, linux-kernel

On Fri, Jun 24, 2022 at 05:36:31PM +0000, James Houghton wrote:
> This is just a const-correctness change so that the new hugetlb_pte
> changes can be const-correct too.
> 
> Acked-by: David Rientjes <rientjes@google.com>
> 
> Signed-off-by: James Houghton <jthoughton@google.com>

This is a good start. I also want to make those helpers take
const type parameter. Seems you have forgotten to update them
when !CONFIG_HUGETLB_PAGE.

Thanks.

> ---
>  include/linux/hugetlb.h | 12 ++++++------
>  1 file changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index e4cff27d1198..498a4ae3d462 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -715,7 +715,7 @@ static inline struct hstate *hstate_vma(struct vm_area_struct *vma)
>  	return hstate_file(vma->vm_file);
>  }
>  
> -static inline unsigned long huge_page_size(struct hstate *h)
> +static inline unsigned long huge_page_size(const struct hstate *h)
>  {
>  	return (unsigned long)PAGE_SIZE << h->order;
>  }
> @@ -729,27 +729,27 @@ static inline unsigned long huge_page_mask(struct hstate *h)
>  	return h->mask;
>  }
>  
> -static inline unsigned int huge_page_order(struct hstate *h)
> +static inline unsigned int huge_page_order(const struct hstate *h)
>  {
>  	return h->order;
>  }
>  
> -static inline unsigned huge_page_shift(struct hstate *h)
> +static inline unsigned huge_page_shift(const struct hstate *h)
>  {
>  	return h->order + PAGE_SHIFT;
>  }
>  
> -static inline bool hstate_is_gigantic(struct hstate *h)
> +static inline bool hstate_is_gigantic(const struct hstate *h)
>  {
>  	return huge_page_order(h) >= MAX_ORDER;
>  }
>  
> -static inline unsigned int pages_per_huge_page(struct hstate *h)
> +static inline unsigned int pages_per_huge_page(const struct hstate *h)
>  {
>  	return 1 << h->order;
>  }
>  
> -static inline unsigned int blocks_per_huge_page(struct hstate *h)
> +static inline unsigned int blocks_per_huge_page(const struct hstate *h)
>  {
>  	return huge_page_size(h) / 512;
>  }
> -- 
> 2.37.0.rc0.161.g10f37bed90-goog
> 
> 

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH 02/26] hugetlb: sort hstates in hugetlb_init_hstates
  2022-06-28 15:40     ` James Houghton
@ 2022-06-29  6:39       ` Muchun Song
  2022-06-29 21:06         ` Mike Kravetz
  0 siblings, 1 reply; 123+ messages in thread
From: Muchun Song @ 2022-06-29  6:39 UTC (permalink / raw)
  To: James Houghton
  Cc: Mike Kravetz, Peter Xu, David Hildenbrand, David Rientjes,
	Axel Rasmussen, Mina Almasry, Jue Wang, Manish Mishra,
	Dr . David Alan Gilbert, linux-mm, linux-kernel

On Tue, Jun 28, 2022 at 08:40:27AM -0700, James Houghton wrote:
> On Mon, Jun 27, 2022 at 11:42 AM Mike Kravetz <mike.kravetz@oracle.com> wrote:
> >
> > On 06/24/22 17:36, James Houghton wrote:
> > > When using HugeTLB high-granularity mapping, we need to go through the
> > > supported hugepage sizes in decreasing order so that we pick the largest
> > > size that works. Consider the case where we're faulting in a 1G hugepage
> > > for the first time: we want hugetlb_fault/hugetlb_no_page to map it with
> > > a PUD. By going through the sizes in decreasing order, we will find that
> > > PUD_SIZE works before finding out that PMD_SIZE or PAGE_SIZE work too.
> > >
> > > Signed-off-by: James Houghton <jthoughton@google.com>
> > > ---
> > >  mm/hugetlb.c | 40 +++++++++++++++++++++++++++++++++++++---
> > >  1 file changed, 37 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > > index a57e1be41401..5df838d86f32 100644
> > > --- a/mm/hugetlb.c
> > > +++ b/mm/hugetlb.c
> > > @@ -33,6 +33,7 @@
> > >  #include <linux/migrate.h>
> > >  #include <linux/nospec.h>
> > >  #include <linux/delayacct.h>
> > > +#include <linux/sort.h>
> > >
> > >  #include <asm/page.h>
> > >  #include <asm/pgalloc.h>
> > > @@ -48,6 +49,10 @@
> > >
> > >  int hugetlb_max_hstate __read_mostly;
> > >  unsigned int default_hstate_idx;
> > > +/*
> > > + * After hugetlb_init_hstates is called, hstates will be sorted from largest
> > > + * to smallest.
> > > + */
> > >  struct hstate hstates[HUGE_MAX_HSTATE];
> > >
> > >  #ifdef CONFIG_CMA
> > > @@ -3144,14 +3149,43 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
> > >       kfree(node_alloc_noretry);
> > >  }
> > >
> > > +static int compare_hstates_decreasing(const void *a, const void *b)
> > > +{
> > > +     const int shift_a = huge_page_shift((const struct hstate *)a);
> > > +     const int shift_b = huge_page_shift((const struct hstate *)b);
> > > +
> > > +     if (shift_a < shift_b)
> > > +             return 1;
> > > +     if (shift_a > shift_b)
> > > +             return -1;
> > > +     return 0;
> > > +}
> > > +
> > > +static void sort_hstates(void)
> > > +{
> > > +     unsigned long default_hstate_sz = huge_page_size(&default_hstate);
> > > +
> > > +     /* Sort from largest to smallest. */
> > > +     sort(hstates, hugetlb_max_hstate, sizeof(*hstates),
> > > +          compare_hstates_decreasing, NULL);
> > > +
> > > +     /*
> > > +      * We may have changed the location of the default hstate, so we need to
> > > +      * update it.
> > > +      */
> > > +     default_hstate_idx = hstate_index(size_to_hstate(default_hstate_sz));
> > > +}
> > > +
> > >  static void __init hugetlb_init_hstates(void)
> > >  {
> > >       struct hstate *h, *h2;
> > >
> > > -     for_each_hstate(h) {
> > > -             if (minimum_order > huge_page_order(h))
> > > -                     minimum_order = huge_page_order(h);
> > > +     sort_hstates();
> > >
> > > +     /* The last hstate is now the smallest. */
> > > +     minimum_order = huge_page_order(&hstates[hugetlb_max_hstate - 1]);
> > > +
> > > +     for_each_hstate(h) {
> > >               /* oversize hugepages were init'ed in early boot */
> > >               if (!hstate_is_gigantic(h))
> > >                       hugetlb_hstate_alloc_pages(h);
> >
> > This may/will cause problems for gigantic hugetlb pages allocated at boot
> > time.  See alloc_bootmem_huge_page() where a pointer to the associated hstate
> > is encoded within the allocated hugetlb page.  These pages are added to
> > hugetlb pools by the routine gather_bootmem_prealloc() which uses the saved
> > hstate to add prep the gigantic page and add to the correct pool.  Currently,
> > gather_bootmem_prealloc is called after hugetlb_init_hstates.  So, changing
> > hstate order will cause errors.
> >
> > I do not see any reason why we could not call gather_bootmem_prealloc before
> > hugetlb_init_hstates to avoid this issue.
> 
> Thanks for catching this, Mike. Your suggestion certainly seems to
> work, but it also seems kind of error prone. I'll have to look at the
> code more closely, but maybe it would be better if I just maintained a
> separate `struct hstate *sorted_hstate_ptrs[]`, where the original

I don't think this is a good idea.  If you really rely on the order of
the initialization in this patch.  The easier solution is changing
huge_bootmem_page->hstate to huge_bootmem_page->hugepagesz. Then we
can use size_to_hstate(huge_bootmem_page->hugepagesz) in
gather_bootmem_prealloc().

Thanks.

> locations of the hstates remain unchanged, as to not break
> gather_bootmem_prealloc/other things.
> 
> > --
> > Mike Kravetz
> 

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH 13/26] hugetlb: add huge_pte_alloc_high_granularity
  2022-06-24 17:36 ` [RFC PATCH 13/26] hugetlb: add huge_pte_alloc_high_granularity James Houghton
@ 2022-06-29 14:11   ` manish.mishra
  0 siblings, 0 replies; 123+ messages in thread
From: manish.mishra @ 2022-06-29 14:11 UTC (permalink / raw)
  To: James Houghton, Mike Kravetz, Muchun Song, Peter Xu
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Jue Wang, Dr . David Alan Gilbert, linux-mm, linux-kernel


On 24/06/22 11:06 pm, James Houghton wrote:
> This function is to be used to do a HugeTLB page table walk where we may
> need to split a leaf-level huge PTE into a new page table level.
>
> Consider the case where we want to install 4K inside an empty 1G page:
> 1. We walk to the PUD and notice that it is pte_none.
> 2. We split the PUD by calling `hugetlb_split_to_shift`, creating a
>     standard PUD that points to PMDs that are all pte_none.
> 3. We continue the PT walk to find the PMD. We split it just like we
>     split the PUD.
> 4. We find the PTE and give it back to the caller.
>
> To avoid concurrent splitting operations on the same page table entry,
> we require that the mapping rwsem is held for writing while collapsing
> and for reading when doing a high-granularity PT walk.
>
> Signed-off-by: James Houghton <jthoughton@google.com>
> ---
>   include/linux/hugetlb.h | 23 ++++++++++++++
>   mm/hugetlb.c            | 67 +++++++++++++++++++++++++++++++++++++++++
>   2 files changed, 90 insertions(+)
>
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 605aa19d8572..321f5745d87f 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -1176,14 +1176,37 @@ static inline void set_huge_pte_at(struct mm_struct *mm, unsigned long addr,
>   }
>   #endif	/* CONFIG_HUGETLB_PAGE */
>   
> +enum split_mode {
> +	HUGETLB_SPLIT_NEVER   = 0,
> +	HUGETLB_SPLIT_NONE    = 1 << 0,
> +	HUGETLB_SPLIT_PRESENT = 1 << 1,
huge
> +	HUGETLB_SPLIT_ALWAYS  = HUGETLB_SPLIT_NONE | HUGETLB_SPLIT_PRESENT,
> +};
>   #ifdef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING
>   /* If HugeTLB high-granularity mappings are enabled for this VMA. */
>   bool hugetlb_hgm_enabled(struct vm_area_struct *vma);
> +int huge_pte_alloc_high_granularity(struct hugetlb_pte *hpte,
> +				    struct mm_struct *mm,
> +				    struct vm_area_struct *vma,
> +				    unsigned long addr,
> +				    unsigned int desired_sz,
> +				    enum split_mode mode,
> +				    bool write_locked);
>   #else
>   static inline bool hugetlb_hgm_enabled(struct vm_area_struct *vma)
>   {
>   	return false;
>   }
> +static inline int huge_pte_alloc_high_granularity(struct hugetlb_pte *hpte,
> +					   struct mm_struct *mm,
> +					   struct vm_area_struct *vma,
> +					   unsigned long addr,
> +					   unsigned int desired_sz,
> +					   enum split_mode mode,
> +					   bool write_locked)
> +{
> +	return -EINVAL;
> +}
>   #endif
>   
>   static inline spinlock_t *huge_pte_lock(struct hstate *h,
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index eaffe7b4f67c..6e0c5fbfe32c 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -7166,6 +7166,73 @@ static int hugetlb_split_to_shift(struct mm_struct *mm, struct vm_area_struct *v
>   	tlb_finish_mmu(&tlb);
>   	return ret;
>   }
> +
> +/*
> + * Similar to huge_pte_alloc except that this can be used to create or walk
> + * high-granularity mappings. It will automatically split existing HugeTLB PTEs
> + * if required by @mode. The resulting HugeTLB PTE will be returned in @hpte.
> + *
> + * There are three options for @mode:
> + *  - HUGETLB_SPLIT_NEVER   - Never split.
> + *  - HUGETLB_SPLIT_NONE    - Split empty PTEs.
> + *  - HUGETLB_SPLIT_PRESENT - Split present PTEs.
> + *  - HUGETLB_SPLIT_ALWAYS  - Split both empty and present PTEs.
> + */
> +int huge_pte_alloc_high_granularity(struct hugetlb_pte *hpte,
> +				    struct mm_struct *mm,
> +				    struct vm_area_struct *vma,
> +				    unsigned long addr,
> +				    unsigned int desired_shift,
> +				    enum split_mode mode,
> +				    bool write_locked)
> +{
> +	struct address_space *mapping = vma->vm_file->f_mapping;
> +	bool has_write_lock = write_locked;
> +	unsigned long desired_sz = 1UL << desired_shift;
> +	int ret;
> +
> +	BUG_ON(!hpte);
> +
> +	if (has_write_lock)
> +		i_mmap_assert_write_locked(mapping);
> +	else
> +		i_mmap_assert_locked(mapping);

> +
> +retry:
> +	ret = 0;
> +	hugetlb_pte_init(hpte);
> +
> +	ret = hugetlb_walk_to(mm, hpte, addr, desired_sz,
> +			      !(mode & HUGETLB_SPLIT_NONE));

hugetlb_walk_to when called with split_non mode can change mappings?

If so should be ensure we are holding write-lock here.

> +	if (ret || hugetlb_pte_size(hpte) == desired_sz)
> +		goto out;
> +
> +	if (
> +		((mode & HUGETLB_SPLIT_NONE) && hugetlb_pte_none(hpte)) ||
> +		((mode & HUGETLB_SPLIT_PRESENT) &&
> +		  hugetlb_pte_present_leaf(hpte))
> +	   ) {
> +		if (!has_write_lock) {
> +			i_mmap_unlock_read(mapping);
Should lock upgrade be used here?
> +			i_mmap_lock_write(mapping);
> +			has_write_lock = true;
> +			goto retry;
> +		}
> +		ret = hugetlb_split_to_shift(mm, vma, hpte, addr,
> +					     desired_shift);
> +	}
> +
> +out:
> +	if (has_write_lock && !write_locked) {
> +		/* Drop the write lock. */
> +		i_mmap_unlock_write(mapping);
> +		i_mmap_lock_read(mapping);
same here lock downgrade?
> +		has_write_lock = false;
> +		goto retry;
> +	}
> +
> +	return ret;
> +}
>   #endif /* CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING */
>   
>   /*

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH 12/26] hugetlb: add HugeTLB splitting functionality
  2022-06-24 17:36 ` [RFC PATCH 12/26] hugetlb: add HugeTLB splitting functionality James Houghton
  2022-06-27 13:50   ` manish.mishra
@ 2022-06-29 14:33   ` manish.mishra
  2022-06-29 16:20     ` James Houghton
  1 sibling, 1 reply; 123+ messages in thread
From: manish.mishra @ 2022-06-29 14:33 UTC (permalink / raw)
  To: James Houghton, Mike Kravetz, Muchun Song, Peter Xu
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Jue Wang, Dr . David Alan Gilbert, linux-mm, linux-kernel


On 24/06/22 11:06 pm, James Houghton wrote:
> The new function, hugetlb_split_to_shift, will optimally split the page
> table to map a particular address at a particular granularity.
>
> This is useful for punching a hole in the mapping and for mapping small
> sections of a HugeTLB page (via UFFDIO_CONTINUE, for example).
>
> Signed-off-by: James Houghton <jthoughton@google.com>
> ---
>   mm/hugetlb.c | 122 +++++++++++++++++++++++++++++++++++++++++++++++++++
>   1 file changed, 122 insertions(+)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 3ec2a921ee6f..eaffe7b4f67c 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -102,6 +102,18 @@ struct mutex *hugetlb_fault_mutex_table ____cacheline_aligned_in_smp;
>   /* Forward declaration */
>   static int hugetlb_acct_memory(struct hstate *h, long delta);
>   
> +/*
> + * Find the subpage that corresponds to `addr` in `hpage`.
> + */
> +static struct page *hugetlb_find_subpage(struct hstate *h, struct page *hpage,
> +				 unsigned long addr)
> +{
> +	size_t idx = (addr & ~huge_page_mask(h))/PAGE_SIZE;
> +
> +	BUG_ON(idx >= pages_per_huge_page(h));
> +	return &hpage[idx];
> +}
> +
>   static inline bool subpool_is_free(struct hugepage_subpool *spool)
>   {
>   	if (spool->count)
> @@ -7044,6 +7056,116 @@ static unsigned int __shift_for_hstate(struct hstate *h)
>   	for ((tmp_h) = hstate; (shift) = __shift_for_hstate(tmp_h), \
>   			       (tmp_h) <= &hstates[hugetlb_max_hstate]; \
>   			       (tmp_h)++)
> +
> +/*
> + * Given a particular address, split the HugeTLB PTE that currently maps it
> + * so that, for the given address, the PTE that maps it is `desired_shift`.
> + * This function will always split the HugeTLB PTE optimally.
> + *
> + * For example, given a HugeTLB 1G page that is mapped from VA 0 to 1G. If we
> + * call this function with addr=0 and desired_shift=PAGE_SHIFT, will result in
> + * these changes to the page table:
> + * 1. The PUD will be split into 2M PMDs.
> + * 2. The first PMD will be split again into 4K PTEs.
> + */
> +static int hugetlb_split_to_shift(struct mm_struct *mm, struct vm_area_struct *vma,
> +			   const struct hugetlb_pte *hpte,
> +			   unsigned long addr, unsigned long desired_shift)
> +{
> +	unsigned long start, end, curr;
> +	unsigned long desired_sz = 1UL << desired_shift;
> +	struct hstate *h = hstate_vma(vma);
> +	int ret;
> +	struct hugetlb_pte new_hpte;
> +	struct mmu_notifier_range range;
> +	struct page *hpage = NULL;
> +	struct page *subpage;
> +	pte_t old_entry;
> +	struct mmu_gather tlb;
> +
> +	BUG_ON(!hpte->ptep);
> +	BUG_ON(hugetlb_pte_size(hpte) == desired_sz);
> +
> +	start = addr & hugetlb_pte_mask(hpte);
> +	end = start + hugetlb_pte_size(hpte);
> +
> +	i_mmap_assert_write_locked(vma->vm_file->f_mapping);
> +
> +	BUG_ON(!hpte->ptep);
> +	/* This function only works if we are looking at a leaf-level PTE. */
> +	BUG_ON(!hugetlb_pte_none(hpte) && !hugetlb_pte_present_leaf(hpte));
> +
> +	/*
> +	 * Clear the PTE so that we will allocate the PT structures when
> +	 * walking the page table.
> +	 */
> +	old_entry = huge_ptep_get_and_clear(mm, start, hpte->ptep);

Sorry missed it last time, what if hgm mapping present here and current hpte is

at higher level. Where we will clear and free child page-table pages.

I see it does not happen in huge_ptep_get_and_clear.

> +
> +	if (!huge_pte_none(old_entry))
> +		hpage = pte_page(old_entry);
> +
> +	BUG_ON(!IS_ALIGNED(start, desired_sz));
> +	BUG_ON(!IS_ALIGNED(end, desired_sz));
> +
> +	for (curr = start; curr < end;) {
> +		struct hstate *tmp_h;
> +		unsigned int shift;
> +
> +		for_each_hgm_shift(h, tmp_h, shift) {
> +			unsigned long sz = 1UL << shift;
> +
> +			if (!IS_ALIGNED(curr, sz) || curr + sz > end)
> +				continue;
> +			/*
> +			 * If we are including `addr`, we need to make sure
> +			 * splitting down to the correct size. Go to a smaller
> +			 * size if we are not.
> +			 */
> +			if (curr <= addr && curr + sz > addr &&
> +					shift > desired_shift)
> +				continue;
> +
> +			/*
> +			 * Continue the page table walk to the level we want,
> +			 * allocate PT structures as we go.
> +			 */
> +			hugetlb_pte_copy(&new_hpte, hpte);
> +			ret = hugetlb_walk_to(mm, &new_hpte, curr, sz,
> +					      /*stop_at_none=*/false);
> +			if (ret)
> +				goto err;
> +			BUG_ON(hugetlb_pte_size(&new_hpte) != sz);
> +			if (hpage) {
> +				pte_t new_entry;
> +
> +				subpage = hugetlb_find_subpage(h, hpage, curr);
> +				new_entry = make_huge_pte_with_shift(vma, subpage,
> +								     huge_pte_write(old_entry),
> +								     shift);
> +				set_huge_pte_at(mm, curr, new_hpte.ptep, new_entry);
> +			}
> +			curr += sz;
> +			goto next;
> +		}
> +		/* We couldn't find a size that worked. */
> +		BUG();
> +next:
> +		continue;
> +	}
> +
> +	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, vma->vm_mm,
> +				start, end);
> +	mmu_notifier_invalidate_range_start(&range);
> +	return 0;
> +err:
> +	tlb_gather_mmu(&tlb, mm);
> +	/* Free any newly allocated page table entries. */
> +	hugetlb_free_range(&tlb, hpte, start, curr);
> +	/* Restore the old entry. */
> +	set_huge_pte_at(mm, start, hpte->ptep, old_entry);
> +	tlb_finish_mmu(&tlb);
> +	return ret;
> +}
>   #endif /* CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING */
>   
>   /*

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH 14/26] hugetlb: add HGM support for hugetlb_fault and hugetlb_no_page
  2022-06-24 17:36 ` [RFC PATCH 14/26] hugetlb: add HGM support for hugetlb_fault and hugetlb_no_page James Houghton
@ 2022-06-29 14:40   ` manish.mishra
  2022-06-29 15:56     ` James Houghton
  0 siblings, 1 reply; 123+ messages in thread
From: manish.mishra @ 2022-06-29 14:40 UTC (permalink / raw)
  To: James Houghton, Mike Kravetz, Muchun Song, Peter Xu
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Jue Wang, Dr . David Alan Gilbert, linux-mm, linux-kernel


On 24/06/22 11:06 pm, James Houghton wrote:
> This CL is the first main functional HugeTLB change. Together, these
> changes allow the HugeTLB fault path to handle faults on HGM-enabled
> VMAs. The two main behaviors that can be done now:
>    1. Faults can be passed to handle_userfault. (Userspace will want to
>       use UFFD_FEATURE_REAL_ADDRESS to get the real address to know which
>       region they should be call UFFDIO_CONTINUE on later.)
>    2. Faults on pages that have been partially mapped (and userfaultfd is
>       not being used) will get mapped at the largest possible size.
>       For example, if a 1G page has been partially mapped at 2M, and we
>       fault on an unmapped 2M section, hugetlb_no_page will create a 2M
>       PMD to map the faulting address.
>
> This commit does not handle hugetlb_wp right now, and it doesn't handle
> HugeTLB page migration and swap entries.
>
> Signed-off-by: James Houghton <jthoughton@google.com>
> ---
>   include/linux/hugetlb.h |  12 ++++
>   mm/hugetlb.c            | 121 +++++++++++++++++++++++++++++++---------
>   2 files changed, 106 insertions(+), 27 deletions(-)
>
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 321f5745d87f..ac4ac8fbd901 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -1185,6 +1185,9 @@ enum split_mode {
>   #ifdef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING
>   /* If HugeTLB high-granularity mappings are enabled for this VMA. */
>   bool hugetlb_hgm_enabled(struct vm_area_struct *vma);
> +int hugetlb_alloc_largest_pte(struct hugetlb_pte *hpte, struct mm_struct *mm,
> +			      struct vm_area_struct *vma, unsigned long start,
> +			      unsigned long end);
>   int huge_pte_alloc_high_granularity(struct hugetlb_pte *hpte,
>   				    struct mm_struct *mm,
>   				    struct vm_area_struct *vma,
> @@ -1197,6 +1200,15 @@ static inline bool hugetlb_hgm_enabled(struct vm_area_struct *vma)
>   {
>   	return false;
>   }
> +
> +static inline
> +int hugetlb_alloc_largest_pte(struct hugetlb_pte *hpte, struct mm_struct *mm,
> +			      struct vm_area_struct *vma, unsigned long start,
> +			      unsigned long end)
> +{
> +		BUG();
> +}
> +
>   static inline int huge_pte_alloc_high_granularity(struct hugetlb_pte *hpte,
>   					   struct mm_struct *mm,
>   					   struct vm_area_struct *vma,
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 6e0c5fbfe32c..da30621656b8 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -5605,18 +5605,24 @@ static inline vm_fault_t hugetlb_handle_userfault(struct vm_area_struct *vma,
>   static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
>   			struct vm_area_struct *vma,
>   			struct address_space *mapping, pgoff_t idx,
> -			unsigned long address, pte_t *ptep,
> +			unsigned long address, struct hugetlb_pte *hpte,
>   			pte_t old_pte, unsigned int flags)
>   {
>   	struct hstate *h = hstate_vma(vma);
>   	vm_fault_t ret = VM_FAULT_SIGBUS;
>   	int anon_rmap = 0;
>   	unsigned long size;
> -	struct page *page;
> +	struct page *page, *subpage;
>   	pte_t new_pte;
>   	spinlock_t *ptl;
>   	unsigned long haddr = address & huge_page_mask(h);
> +	unsigned long haddr_hgm = address & hugetlb_pte_mask(hpte);
>   	bool new_page, new_pagecache_page = false;
> +	/*
> +	 * This page is getting mapped for the first time, in which case we
> +	 * want to increment its mapcount.
> +	 */
> +	bool new_mapping = hpte->shift == huge_page_shift(h);
>   
>   	/*
>   	 * Currently, we are forced to kill the process in the event the
> @@ -5665,9 +5671,9 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
>   			 * here.  Before returning error, get ptl and make
>   			 * sure there really is no pte entry.
>   			 */
> -			ptl = huge_pte_lock(h, mm, ptep);
> +			ptl = hugetlb_pte_lock(mm, hpte);
>   			ret = 0;
> -			if (huge_pte_none(huge_ptep_get(ptep)))
> +			if (hugetlb_pte_none(hpte))
>   				ret = vmf_error(PTR_ERR(page));
>   			spin_unlock(ptl);
>   			goto out;
> @@ -5731,18 +5737,25 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
>   		vma_end_reservation(h, vma, haddr);
>   	}
>   
> -	ptl = huge_pte_lock(h, mm, ptep);
> +	ptl = hugetlb_pte_lock(mm, hpte);
>   	ret = 0;
>   	/* If pte changed from under us, retry */
> -	if (!pte_same(huge_ptep_get(ptep), old_pte))
> +	if (!pte_same(hugetlb_ptep_get(hpte), old_pte))
>   		goto backout;
>   
> -	if (anon_rmap) {
> -		ClearHPageRestoreReserve(page);
> -		hugepage_add_new_anon_rmap(page, vma, haddr);
> -	} else
> -		page_dup_file_rmap(page, true);
> -	new_pte = make_huge_pte(vma, page, ((vma->vm_flags & VM_WRITE)
> +	if (new_mapping) {
> +		/* Only increment this page's mapcount if we are mapping it
> +		 * for the first time.
> +		 */
> +		if (anon_rmap) {
> +			ClearHPageRestoreReserve(page);
> +			hugepage_add_new_anon_rmap(page, vma, haddr);
> +		} else
> +			page_dup_file_rmap(page, true);
> +	}
> +
> +	subpage = hugetlb_find_subpage(h, page, haddr_hgm);

               sorry did not understand why make_huge_pte we may be mapping just PAGE_SIZE

               too here.

> +	new_pte = make_huge_pte(vma, subpage, ((vma->vm_flags & VM_WRITE)
>   				&& (vma->vm_flags & VM_SHARED)));
>   	/*
>   	 * If this pte was previously wr-protected, keep it wr-protected even
> @@ -5750,12 +5763,13 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
>   	 */
>   	if (unlikely(pte_marker_uffd_wp(old_pte)))
>   		new_pte = huge_pte_wrprotect(huge_pte_mkuffd_wp(new_pte));
> -	set_huge_pte_at(mm, haddr, ptep, new_pte);
> +	set_huge_pte_at(mm, haddr_hgm, hpte->ptep, new_pte);
>   
> -	hugetlb_count_add(pages_per_huge_page(h), mm);
> +	hugetlb_count_add(hugetlb_pte_size(hpte) / PAGE_SIZE, mm);
>   	if ((flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) {
> +		BUG_ON(hugetlb_pte_size(hpte) != huge_page_size(h));
>   		/* Optimization, do the COW without a second fault */
> -		ret = hugetlb_wp(mm, vma, address, ptep, flags, page, ptl);
> +		ret = hugetlb_wp(mm, vma, address, hpte->ptep, flags, page, ptl);
>   	}
>   
>   	spin_unlock(ptl);
> @@ -5816,11 +5830,15 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>   	u32 hash;
>   	pgoff_t idx;
>   	struct page *page = NULL;
> +	struct page *subpage = NULL;
>   	struct page *pagecache_page = NULL;
>   	struct hstate *h = hstate_vma(vma);
>   	struct address_space *mapping;
>   	int need_wait_lock = 0;
>   	unsigned long haddr = address & huge_page_mask(h);
> +	unsigned long haddr_hgm;
> +	bool hgm_enabled = hugetlb_hgm_enabled(vma);
> +	struct hugetlb_pte hpte;
>   
>   	ptep = huge_pte_offset(mm, haddr, huge_page_size(h));
>   	if (ptep) {
> @@ -5866,11 +5884,22 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>   	hash = hugetlb_fault_mutex_hash(mapping, idx);
>   	mutex_lock(&hugetlb_fault_mutex_table[hash]);
>   
> -	entry = huge_ptep_get(ptep);
> +	hugetlb_pte_populate(&hpte, ptep, huge_page_shift(h));
> +
> +	if (hgm_enabled) {
> +		ret = hugetlb_walk_to(mm, &hpte, address,
> +				      PAGE_SIZE, /*stop_at_none=*/true);
> +		if (ret) {
> +			ret = vmf_error(ret);
> +			goto out_mutex;
> +		}
> +	}
> +
> +	entry = hugetlb_ptep_get(&hpte);
>   	/* PTE markers should be handled the same way as none pte */
> -	if (huge_pte_none_mostly(entry)) {
> -		ret = hugetlb_no_page(mm, vma, mapping, idx, address, ptep,
> -				      entry, flags);
> +	if (hugetlb_pte_none_mostly(&hpte)) {
> +		ret = hugetlb_no_page(mm, vma, mapping, idx, address, &hpte,
> +				entry, flags);
>   		goto out_mutex;
>   	}
>   
> @@ -5908,14 +5937,17 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>   								vma, haddr);
>   	}
>   
> -	ptl = huge_pte_lock(h, mm, ptep);
> +	ptl = hugetlb_pte_lock(mm, &hpte);
>   
>   	/* Check for a racing update before calling hugetlb_wp() */
> -	if (unlikely(!pte_same(entry, huge_ptep_get(ptep))))
> +	if (unlikely(!pte_same(entry, hugetlb_ptep_get(&hpte))))
>   		goto out_ptl;
>   
> +	/* haddr_hgm is the base address of the region that hpte maps. */
> +	haddr_hgm = address & hugetlb_pte_mask(&hpte);
> +
>   	/* Handle userfault-wp first, before trying to lock more pages */
> -	if (userfaultfd_wp(vma) && huge_pte_uffd_wp(huge_ptep_get(ptep)) &&
> +	if (userfaultfd_wp(vma) && huge_pte_uffd_wp(hugetlb_ptep_get(&hpte)) &&
>   	    (flags & FAULT_FLAG_WRITE) && !huge_pte_write(entry)) {
>   		struct vm_fault vmf = {
>   			.vma = vma,
> @@ -5939,7 +5971,8 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>   	 * pagecache_page, so here we need take the former one
>   	 * when page != pagecache_page or !pagecache_page.
>   	 */
> -	page = pte_page(entry);
> +	subpage = pte_page(entry);
> +	page = compound_head(subpage);
>   	if (page != pagecache_page)
>   		if (!trylock_page(page)) {
>   			need_wait_lock = 1;
> @@ -5950,7 +5983,8 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>   
>   	if (flags & (FAULT_FLAG_WRITE|FAULT_FLAG_UNSHARE)) {
>   		if (!huge_pte_write(entry)) {
> -			ret = hugetlb_wp(mm, vma, address, ptep, flags,
> +			BUG_ON(hugetlb_pte_size(&hpte) != huge_page_size(h));

is it in respect to fact that userfault_wp is not support with HGM mapping currently? Not

sure yet though how it is controlled may be next patches will have more details.

> +			ret = hugetlb_wp(mm, vma, address, hpte.ptep, flags,
>   					 pagecache_page, ptl);
>   			goto out_put_page;
>   		} else if (likely(flags & FAULT_FLAG_WRITE)) {
> @@ -5958,9 +5992,9 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>   		}
>   	}
>   	entry = pte_mkyoung(entry);
> -	if (huge_ptep_set_access_flags(vma, haddr, ptep, entry,
> +	if (huge_ptep_set_access_flags(vma, haddr_hgm, hpte.ptep, entry,
>   						flags & FAULT_FLAG_WRITE))
> -		update_mmu_cache(vma, haddr, ptep);
> +		update_mmu_cache(vma, haddr_hgm, hpte.ptep);
>   out_put_page:
>   	if (page != pagecache_page)
>   		unlock_page(page);
> @@ -6951,7 +6985,8 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
>   				pte = (pte_t *)pmd_alloc(mm, pud, addr);
>   		}
>   	}
> -	BUG_ON(pte && pte_present(*pte) && !pte_huge(*pte));
> +	if (!hugetlb_hgm_enabled(vma))
> +		BUG_ON(pte && pte_present(*pte) && !pte_huge(*pte));
>   
>   	return pte;
>   }
> @@ -7057,6 +7092,38 @@ static unsigned int __shift_for_hstate(struct hstate *h)
>   			       (tmp_h) <= &hstates[hugetlb_max_hstate]; \
>   			       (tmp_h)++)
>   
> +/*
> + * Allocate a HugeTLB PTE that maps as much of [start, end) as possible with a
> + * single page table entry. The allocated HugeTLB PTE is returned in hpte.
> + */

Will it be used for madvise_collapase? If so will it make sense to keep it in different patch

as this one title says just for handle_page_fault routines.

> +int hugetlb_alloc_largest_pte(struct hugetlb_pte *hpte, struct mm_struct *mm,
> +			      struct vm_area_struct *vma, unsigned long start,
> +			      unsigned long end)
> +{
> +	struct hstate *h = hstate_vma(vma), *tmp_h;
> +	unsigned int shift;
> +	int ret;
> +
> +	for_each_hgm_shift(h, tmp_h, shift) {
> +		unsigned long sz = 1UL << shift;
> +
> +		if (!IS_ALIGNED(start, sz) || start + sz > end)
> +			continue;
> +		ret = huge_pte_alloc_high_granularity(hpte, mm, vma, start,
> +						      shift, HUGETLB_SPLIT_NONE,
> +						      /*write_locked=*/false);
> +		if (ret)
> +			return ret;
> +
> +		if (hpte->shift > shift)
> +			return -EEXIST;
> +
> +		BUG_ON(hpte->shift != shift);
> +		return 0;
> +	}
> +	return -EINVAL;
> +}
> +
>   /*
>    * Given a particular address, split the HugeTLB PTE that currently maps it
>    * so that, for the given address, the PTE that maps it is `desired_shift`.

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH 14/26] hugetlb: add HGM support for hugetlb_fault and hugetlb_no_page
  2022-06-29 14:40   ` manish.mishra
@ 2022-06-29 15:56     ` James Houghton
  0 siblings, 0 replies; 123+ messages in thread
From: James Houghton @ 2022-06-29 15:56 UTC (permalink / raw)
  To: manish.mishra
  Cc: Mike Kravetz, Muchun Song, Peter Xu, David Hildenbrand,
	David Rientjes, Axel Rasmussen, Mina Almasry, Jue Wang,
	Dr . David Alan Gilbert, linux-mm, linux-kernel

On Wed, Jun 29, 2022 at 7:41 AM manish.mishra <manish.mishra@nutanix.com> wrote:
>
>
> On 24/06/22 11:06 pm, James Houghton wrote:
> > This CL is the first main functional HugeTLB change. Together, these
> > changes allow the HugeTLB fault path to handle faults on HGM-enabled
> > VMAs. The two main behaviors that can be done now:
> >    1. Faults can be passed to handle_userfault. (Userspace will want to
> >       use UFFD_FEATURE_REAL_ADDRESS to get the real address to know which
> >       region they should be call UFFDIO_CONTINUE on later.)
> >    2. Faults on pages that have been partially mapped (and userfaultfd is
> >       not being used) will get mapped at the largest possible size.
> >       For example, if a 1G page has been partially mapped at 2M, and we
> >       fault on an unmapped 2M section, hugetlb_no_page will create a 2M
> >       PMD to map the faulting address.
> >
> > This commit does not handle hugetlb_wp right now, and it doesn't handle
> > HugeTLB page migration and swap entries.
> >
> > Signed-off-by: James Houghton <jthoughton@google.com>
> > ---
> >   include/linux/hugetlb.h |  12 ++++
> >   mm/hugetlb.c            | 121 +++++++++++++++++++++++++++++++---------
> >   2 files changed, 106 insertions(+), 27 deletions(-)
> >
> > diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> > index 321f5745d87f..ac4ac8fbd901 100644
> > --- a/include/linux/hugetlb.h
> > +++ b/include/linux/hugetlb.h
> > @@ -1185,6 +1185,9 @@ enum split_mode {
> >   #ifdef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING
> >   /* If HugeTLB high-granularity mappings are enabled for this VMA. */
> >   bool hugetlb_hgm_enabled(struct vm_area_struct *vma);
> > +int hugetlb_alloc_largest_pte(struct hugetlb_pte *hpte, struct mm_struct *mm,
> > +                           struct vm_area_struct *vma, unsigned long start,
> > +                           unsigned long end);
> >   int huge_pte_alloc_high_granularity(struct hugetlb_pte *hpte,
> >                                   struct mm_struct *mm,
> >                                   struct vm_area_struct *vma,
> > @@ -1197,6 +1200,15 @@ static inline bool hugetlb_hgm_enabled(struct vm_area_struct *vma)
> >   {
> >       return false;
> >   }
> > +
> > +static inline
> > +int hugetlb_alloc_largest_pte(struct hugetlb_pte *hpte, struct mm_struct *mm,
> > +                           struct vm_area_struct *vma, unsigned long start,
> > +                           unsigned long end)
> > +{
> > +             BUG();
> > +}
> > +
> >   static inline int huge_pte_alloc_high_granularity(struct hugetlb_pte *hpte,
> >                                          struct mm_struct *mm,
> >                                          struct vm_area_struct *vma,
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 6e0c5fbfe32c..da30621656b8 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -5605,18 +5605,24 @@ static inline vm_fault_t hugetlb_handle_userfault(struct vm_area_struct *vma,
> >   static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
> >                       struct vm_area_struct *vma,
> >                       struct address_space *mapping, pgoff_t idx,
> > -                     unsigned long address, pte_t *ptep,
> > +                     unsigned long address, struct hugetlb_pte *hpte,
> >                       pte_t old_pte, unsigned int flags)
> >   {
> >       struct hstate *h = hstate_vma(vma);
> >       vm_fault_t ret = VM_FAULT_SIGBUS;
> >       int anon_rmap = 0;
> >       unsigned long size;
> > -     struct page *page;
> > +     struct page *page, *subpage;
> >       pte_t new_pte;
> >       spinlock_t *ptl;
> >       unsigned long haddr = address & huge_page_mask(h);
> > +     unsigned long haddr_hgm = address & hugetlb_pte_mask(hpte);
> >       bool new_page, new_pagecache_page = false;
> > +     /*
> > +      * This page is getting mapped for the first time, in which case we
> > +      * want to increment its mapcount.
> > +      */
> > +     bool new_mapping = hpte->shift == huge_page_shift(h);
> >
> >       /*
> >        * Currently, we are forced to kill the process in the event the
> > @@ -5665,9 +5671,9 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
> >                        * here.  Before returning error, get ptl and make
> >                        * sure there really is no pte entry.
> >                        */
> > -                     ptl = huge_pte_lock(h, mm, ptep);
> > +                     ptl = hugetlb_pte_lock(mm, hpte);
> >                       ret = 0;
> > -                     if (huge_pte_none(huge_ptep_get(ptep)))
> > +                     if (hugetlb_pte_none(hpte))
> >                               ret = vmf_error(PTR_ERR(page));
> >                       spin_unlock(ptl);
> >                       goto out;
> > @@ -5731,18 +5737,25 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
> >               vma_end_reservation(h, vma, haddr);
> >       }
> >
> > -     ptl = huge_pte_lock(h, mm, ptep);
> > +     ptl = hugetlb_pte_lock(mm, hpte);
> >       ret = 0;
> >       /* If pte changed from under us, retry */
> > -     if (!pte_same(huge_ptep_get(ptep), old_pte))
> > +     if (!pte_same(hugetlb_ptep_get(hpte), old_pte))
> >               goto backout;
> >
> > -     if (anon_rmap) {
> > -             ClearHPageRestoreReserve(page);
> > -             hugepage_add_new_anon_rmap(page, vma, haddr);
> > -     } else
> > -             page_dup_file_rmap(page, true);
> > -     new_pte = make_huge_pte(vma, page, ((vma->vm_flags & VM_WRITE)
> > +     if (new_mapping) {
> > +             /* Only increment this page's mapcount if we are mapping it
> > +              * for the first time.
> > +              */
> > +             if (anon_rmap) {
> > +                     ClearHPageRestoreReserve(page);
> > +                     hugepage_add_new_anon_rmap(page, vma, haddr);
> > +             } else
> > +                     page_dup_file_rmap(page, true);
> > +     }
> > +
> > +     subpage = hugetlb_find_subpage(h, page, haddr_hgm);
>
>                sorry did not understand why make_huge_pte we may be mapping just PAGE_SIZE
>
>                too here.
>

This should be make_huge_pte_with_shift(), with shift =
hugetlb_pte_shift(hpte). Thanks.

> > +     new_pte = make_huge_pte(vma, subpage, ((vma->vm_flags & VM_WRITE)
> >                               && (vma->vm_flags & VM_SHARED)));
> >       /*
> >        * If this pte was previously wr-protected, keep it wr-protected even
> > @@ -5750,12 +5763,13 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
> >        */
> >       if (unlikely(pte_marker_uffd_wp(old_pte)))
> >               new_pte = huge_pte_wrprotect(huge_pte_mkuffd_wp(new_pte));
> > -     set_huge_pte_at(mm, haddr, ptep, new_pte);
> > +     set_huge_pte_at(mm, haddr_hgm, hpte->ptep, new_pte);
> >
> > -     hugetlb_count_add(pages_per_huge_page(h), mm);
> > +     hugetlb_count_add(hugetlb_pte_size(hpte) / PAGE_SIZE, mm);
> >       if ((flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) {
> > +             BUG_ON(hugetlb_pte_size(hpte) != huge_page_size(h));
> >               /* Optimization, do the COW without a second fault */
> > -             ret = hugetlb_wp(mm, vma, address, ptep, flags, page, ptl);
> > +             ret = hugetlb_wp(mm, vma, address, hpte->ptep, flags, page, ptl);
> >       }
> >
> >       spin_unlock(ptl);
> > @@ -5816,11 +5830,15 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> >       u32 hash;
> >       pgoff_t idx;
> >       struct page *page = NULL;
> > +     struct page *subpage = NULL;
> >       struct page *pagecache_page = NULL;
> >       struct hstate *h = hstate_vma(vma);
> >       struct address_space *mapping;
> >       int need_wait_lock = 0;
> >       unsigned long haddr = address & huge_page_mask(h);
> > +     unsigned long haddr_hgm;
> > +     bool hgm_enabled = hugetlb_hgm_enabled(vma);
> > +     struct hugetlb_pte hpte;
> >
> >       ptep = huge_pte_offset(mm, haddr, huge_page_size(h));
> >       if (ptep) {
> > @@ -5866,11 +5884,22 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> >       hash = hugetlb_fault_mutex_hash(mapping, idx);
> >       mutex_lock(&hugetlb_fault_mutex_table[hash]);
> >
> > -     entry = huge_ptep_get(ptep);
> > +     hugetlb_pte_populate(&hpte, ptep, huge_page_shift(h));
> > +
> > +     if (hgm_enabled) {
> > +             ret = hugetlb_walk_to(mm, &hpte, address,
> > +                                   PAGE_SIZE, /*stop_at_none=*/true);
> > +             if (ret) {
> > +                     ret = vmf_error(ret);
> > +                     goto out_mutex;
> > +             }
> > +     }
> > +
> > +     entry = hugetlb_ptep_get(&hpte);
> >       /* PTE markers should be handled the same way as none pte */
> > -     if (huge_pte_none_mostly(entry)) {
> > -             ret = hugetlb_no_page(mm, vma, mapping, idx, address, ptep,
> > -                                   entry, flags);
> > +     if (hugetlb_pte_none_mostly(&hpte)) {
> > +             ret = hugetlb_no_page(mm, vma, mapping, idx, address, &hpte,
> > +                             entry, flags);
> >               goto out_mutex;
> >       }
> >
> > @@ -5908,14 +5937,17 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> >                                                               vma, haddr);
> >       }
> >
> > -     ptl = huge_pte_lock(h, mm, ptep);
> > +     ptl = hugetlb_pte_lock(mm, &hpte);
> >
> >       /* Check for a racing update before calling hugetlb_wp() */
> > -     if (unlikely(!pte_same(entry, huge_ptep_get(ptep))))
> > +     if (unlikely(!pte_same(entry, hugetlb_ptep_get(&hpte))))
> >               goto out_ptl;
> >
> > +     /* haddr_hgm is the base address of the region that hpte maps. */
> > +     haddr_hgm = address & hugetlb_pte_mask(&hpte);
> > +
> >       /* Handle userfault-wp first, before trying to lock more pages */
> > -     if (userfaultfd_wp(vma) && huge_pte_uffd_wp(huge_ptep_get(ptep)) &&
> > +     if (userfaultfd_wp(vma) && huge_pte_uffd_wp(hugetlb_ptep_get(&hpte)) &&
> >           (flags & FAULT_FLAG_WRITE) && !huge_pte_write(entry)) {
> >               struct vm_fault vmf = {
> >                       .vma = vma,
> > @@ -5939,7 +5971,8 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> >        * pagecache_page, so here we need take the former one
> >        * when page != pagecache_page or !pagecache_page.
> >        */
> > -     page = pte_page(entry);
> > +     subpage = pte_page(entry);
> > +     page = compound_head(subpage);
> >       if (page != pagecache_page)
> >               if (!trylock_page(page)) {
> >                       need_wait_lock = 1;
> > @@ -5950,7 +5983,8 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> >
> >       if (flags & (FAULT_FLAG_WRITE|FAULT_FLAG_UNSHARE)) {
> >               if (!huge_pte_write(entry)) {
> > -                     ret = hugetlb_wp(mm, vma, address, ptep, flags,
> > +                     BUG_ON(hugetlb_pte_size(&hpte) != huge_page_size(h));
>
> is it in respect to fact that userfault_wp is not support with HGM mapping currently? Not
>
> sure yet though how it is controlled may be next patches will have more details.

Yeah this BUG_ON is just because I haven't implemented support for
userfaultfd_wp yet (userfaultfd_wp for HugeTLB was added pretty
recently, while I was working on this patch series). I'll improve WP
support for the next version.

>
> > +                     ret = hugetlb_wp(mm, vma, address, hpte.ptep, flags,
> >                                        pagecache_page, ptl);
> >                       goto out_put_page;
> >               } else if (likely(flags & FAULT_FLAG_WRITE)) {
> > @@ -5958,9 +5992,9 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> >               }
> >       }
> >       entry = pte_mkyoung(entry);
> > -     if (huge_ptep_set_access_flags(vma, haddr, ptep, entry,
> > +     if (huge_ptep_set_access_flags(vma, haddr_hgm, hpte.ptep, entry,
> >                                               flags & FAULT_FLAG_WRITE))
> > -             update_mmu_cache(vma, haddr, ptep);
> > +             update_mmu_cache(vma, haddr_hgm, hpte.ptep);
> >   out_put_page:
> >       if (page != pagecache_page)
> >               unlock_page(page);
> > @@ -6951,7 +6985,8 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
> >                               pte = (pte_t *)pmd_alloc(mm, pud, addr);
> >               }
> >       }
> > -     BUG_ON(pte && pte_present(*pte) && !pte_huge(*pte));
> > +     if (!hugetlb_hgm_enabled(vma))
> > +             BUG_ON(pte && pte_present(*pte) && !pte_huge(*pte));
> >
> >       return pte;
> >   }
> > @@ -7057,6 +7092,38 @@ static unsigned int __shift_for_hstate(struct hstate *h)
> >                              (tmp_h) <= &hstates[hugetlb_max_hstate]; \
> >                              (tmp_h)++)
> >
> > +/*
> > + * Allocate a HugeTLB PTE that maps as much of [start, end) as possible with a
> > + * single page table entry. The allocated HugeTLB PTE is returned in hpte.
> > + */
>
> Will it be used for madvise_collapase? If so will it make sense to keep it in different patch
>
> as this one title says just for handle_page_fault routines.

This is used by userfaultfd/UFFDIO_CONTINUE -- I will move this diff
to the patch that uses it (certainly shouldn't be in this patch).

>
> > +int hugetlb_alloc_largest_pte(struct hugetlb_pte *hpte, struct mm_struct *mm,
> > +                           struct vm_area_struct *vma, unsigned long start,
> > +                           unsigned long end)
> > +{
> > +     struct hstate *h = hstate_vma(vma), *tmp_h;
> > +     unsigned int shift;
> > +     int ret;
> > +
> > +     for_each_hgm_shift(h, tmp_h, shift) {
> > +             unsigned long sz = 1UL << shift;
> > +
> > +             if (!IS_ALIGNED(start, sz) || start + sz > end)
> > +                     continue;
> > +             ret = huge_pte_alloc_high_granularity(hpte, mm, vma, start,
> > +                                                   shift, HUGETLB_SPLIT_NONE,
> > +                                                   /*write_locked=*/false);
> > +             if (ret)
> > +                     return ret;
> > +
> > +             if (hpte->shift > shift)
> > +                     return -EEXIST;
> > +
> > +             BUG_ON(hpte->shift != shift);
> > +             return 0;
> > +     }
> > +     return -EINVAL;
> > +}
> > +
> >   /*
> >    * Given a particular address, split the HugeTLB PTE that currently maps it
> >    * so that, for the given address, the PTE that maps it is `desired_shift`.

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH 12/26] hugetlb: add HugeTLB splitting functionality
  2022-06-27 13:50   ` manish.mishra
@ 2022-06-29 16:10     ` James Houghton
  0 siblings, 0 replies; 123+ messages in thread
From: James Houghton @ 2022-06-29 16:10 UTC (permalink / raw)
  To: manish.mishra
  Cc: Mike Kravetz, Muchun Song, Peter Xu, David Hildenbrand,
	David Rientjes, Axel Rasmussen, Mina Almasry, Jue Wang,
	Dr . David Alan Gilbert, linux-mm, linux-kernel

On Mon, Jun 27, 2022 at 6:51 AM manish.mishra <manish.mishra@nutanix.com> wrote:
>
>
> On 24/06/22 11:06 pm, James Houghton wrote:
> > The new function, hugetlb_split_to_shift, will optimally split the page
> > table to map a particular address at a particular granularity.
> >
> > This is useful for punching a hole in the mapping and for mapping small
> > sections of a HugeTLB page (via UFFDIO_CONTINUE, for example).
> >
> > Signed-off-by: James Houghton <jthoughton@google.com>
> > ---
> >   mm/hugetlb.c | 122 +++++++++++++++++++++++++++++++++++++++++++++++++++
> >   1 file changed, 122 insertions(+)
> >
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 3ec2a921ee6f..eaffe7b4f67c 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -102,6 +102,18 @@ struct mutex *hugetlb_fault_mutex_table ____cacheline_aligned_in_smp;
> >   /* Forward declaration */
> >   static int hugetlb_acct_memory(struct hstate *h, long delta);
> >
> > +/*
> > + * Find the subpage that corresponds to `addr` in `hpage`.
> > + */
> > +static struct page *hugetlb_find_subpage(struct hstate *h, struct page *hpage,
> > +                              unsigned long addr)
> > +{
> > +     size_t idx = (addr & ~huge_page_mask(h))/PAGE_SIZE;
> > +
> > +     BUG_ON(idx >= pages_per_huge_page(h));
> > +     return &hpage[idx];
> > +}
> > +
> >   static inline bool subpool_is_free(struct hugepage_subpool *spool)
> >   {
> >       if (spool->count)
> > @@ -7044,6 +7056,116 @@ static unsigned int __shift_for_hstate(struct hstate *h)
> >       for ((tmp_h) = hstate; (shift) = __shift_for_hstate(tmp_h), \
> >                              (tmp_h) <= &hstates[hugetlb_max_hstate]; \
> >                              (tmp_h)++)
> > +
> > +/*
> > + * Given a particular address, split the HugeTLB PTE that currently maps it
> > + * so that, for the given address, the PTE that maps it is `desired_shift`.
> > + * This function will always split the HugeTLB PTE optimally.
> > + *
> > + * For example, given a HugeTLB 1G page that is mapped from VA 0 to 1G. If we
> > + * call this function with addr=0 and desired_shift=PAGE_SHIFT, will result in
> > + * these changes to the page table:
> > + * 1. The PUD will be split into 2M PMDs.
> > + * 2. The first PMD will be split again into 4K PTEs.
> > + */
> > +static int hugetlb_split_to_shift(struct mm_struct *mm, struct vm_area_struct *vma,
> > +                        const struct hugetlb_pte *hpte,
> > +                        unsigned long addr, unsigned long desired_shift)
> > +{
> > +     unsigned long start, end, curr;
> > +     unsigned long desired_sz = 1UL << desired_shift;
> > +     struct hstate *h = hstate_vma(vma);
> > +     int ret;
> > +     struct hugetlb_pte new_hpte;
> > +     struct mmu_notifier_range range;
> > +     struct page *hpage = NULL;
> > +     struct page *subpage;
> > +     pte_t old_entry;
> > +     struct mmu_gather tlb;
> > +
> > +     BUG_ON(!hpte->ptep);
> > +     BUG_ON(hugetlb_pte_size(hpte) == desired_sz);
> can it be BUG_ON(hugetlb_pte_size(hpte) <= desired_sz)

Sure -- I think that's better.

> > +
> > +     start = addr & hugetlb_pte_mask(hpte);
> > +     end = start + hugetlb_pte_size(hpte);
> > +
> > +     i_mmap_assert_write_locked(vma->vm_file->f_mapping);
>
> As it is just changing mappings is holding f_mapping required? I mean in future
>
> is it any paln or way to use some per process level sub-lock?

We don't need to hold a per-mapping lock here, you're right; a per-VMA
lock will do just fine. I'll replace this with a per-VMA lock for the
next version.

>
> > +
> > +     BUG_ON(!hpte->ptep);
> > +     /* This function only works if we are looking at a leaf-level PTE. */
> > +     BUG_ON(!hugetlb_pte_none(hpte) && !hugetlb_pte_present_leaf(hpte));
> > +
> > +     /*
> > +      * Clear the PTE so that we will allocate the PT structures when
> > +      * walking the page table.
> > +      */
> > +     old_entry = huge_ptep_get_and_clear(mm, start, hpte->ptep);
> > +
> > +     if (!huge_pte_none(old_entry))
> > +             hpage = pte_page(old_entry);
> > +
> > +     BUG_ON(!IS_ALIGNED(start, desired_sz));
> > +     BUG_ON(!IS_ALIGNED(end, desired_sz));
> > +
> > +     for (curr = start; curr < end;) {
> > +             struct hstate *tmp_h;
> > +             unsigned int shift;
> > +
> > +             for_each_hgm_shift(h, tmp_h, shift) {
> > +                     unsigned long sz = 1UL << shift;
> > +
> > +                     if (!IS_ALIGNED(curr, sz) || curr + sz > end)
> > +                             continue;
> > +                     /*
> > +                      * If we are including `addr`, we need to make sure
> > +                      * splitting down to the correct size. Go to a smaller
> > +                      * size if we are not.
> > +                      */
> > +                     if (curr <= addr && curr + sz > addr &&
> > +                                     shift > desired_shift)
> > +                             continue;
> > +
> > +                     /*
> > +                      * Continue the page table walk to the level we want,
> > +                      * allocate PT structures as we go.
> > +                      */
>
> As i understand this for_each_hgm_shift loop is just to find right size of shift,
>
> then code below this line can be put out of loop, no strong feeling but it looks
>
> more proper may make code easier to understand.

Agreed. I'll clean this up.

>
> > +                     hugetlb_pte_copy(&new_hpte, hpte);
> > +                     ret = hugetlb_walk_to(mm, &new_hpte, curr, sz,
> > +                                           /*stop_at_none=*/false);
> > +                     if (ret)
> > +                             goto err;
> > +                     BUG_ON(hugetlb_pte_size(&new_hpte) != sz);
> > +                     if (hpage) {
> > +                             pte_t new_entry;
> > +
> > +                             subpage = hugetlb_find_subpage(h, hpage, curr);
> > +                             new_entry = make_huge_pte_with_shift(vma, subpage,
> > +                                                                  huge_pte_write(old_entry),
> > +                                                                  shift);
> > +                             set_huge_pte_at(mm, curr, new_hpte.ptep, new_entry);
> > +                     }
> > +                     curr += sz;
> > +                     goto next;
> > +             }
> > +             /* We couldn't find a size that worked. */
> > +             BUG();
> > +next:
> > +             continue;
> > +     }
> > +
> > +     mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, vma->vm_mm,
> > +                             start, end);
> > +     mmu_notifier_invalidate_range_start(&range);
>
> sorry did not understand where tlb flush will be taken care in case of success?
>
> I see set_huge_pte_at does not do it internally by self.

A TLB flush isn't necessary in the success case -- pages that were
mapped before will continue to be mapped the same way, so the TLB
entries will still be valid. If we're splitting a none P*D, then
there's nothing to flush. If we're splitting a present P*D, then the
flush will come if/when we clear any of the page table entries below
the P*D.

>
> > +     return 0;
> > +err:
> > +     tlb_gather_mmu(&tlb, mm);
> > +     /* Free any newly allocated page table entries. */
> > +     hugetlb_free_range(&tlb, hpte, start, curr);
> > +     /* Restore the old entry. */
> > +     set_huge_pte_at(mm, start, hpte->ptep, old_entry);
> > +     tlb_finish_mmu(&tlb);
> > +     return ret;
> > +}
> >   #endif /* CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING */
> >
> >   /*

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH 12/26] hugetlb: add HugeTLB splitting functionality
  2022-06-29 14:33   ` manish.mishra
@ 2022-06-29 16:20     ` James Houghton
  0 siblings, 0 replies; 123+ messages in thread
From: James Houghton @ 2022-06-29 16:20 UTC (permalink / raw)
  To: manish.mishra
  Cc: Mike Kravetz, Muchun Song, Peter Xu, David Hildenbrand,
	David Rientjes, Axel Rasmussen, Mina Almasry, Jue Wang,
	Dr . David Alan Gilbert, linux-mm, linux-kernel

On Wed, Jun 29, 2022 at 7:33 AM manish.mishra <manish.mishra@nutanix.com> wrote:
>
>
> On 24/06/22 11:06 pm, James Houghton wrote:
> > The new function, hugetlb_split_to_shift, will optimally split the page
> > table to map a particular address at a particular granularity.
> >
> > This is useful for punching a hole in the mapping and for mapping small
> > sections of a HugeTLB page (via UFFDIO_CONTINUE, for example).
> >
> > Signed-off-by: James Houghton <jthoughton@google.com>
> > ---
> >   mm/hugetlb.c | 122 +++++++++++++++++++++++++++++++++++++++++++++++++++
> >   1 file changed, 122 insertions(+)
> >
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 3ec2a921ee6f..eaffe7b4f67c 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -102,6 +102,18 @@ struct mutex *hugetlb_fault_mutex_table ____cacheline_aligned_in_smp;
> >   /* Forward declaration */
> >   static int hugetlb_acct_memory(struct hstate *h, long delta);
> >
> > +/*
> > + * Find the subpage that corresponds to `addr` in `hpage`.
> > + */
> > +static struct page *hugetlb_find_subpage(struct hstate *h, struct page *hpage,
> > +                              unsigned long addr)
> > +{
> > +     size_t idx = (addr & ~huge_page_mask(h))/PAGE_SIZE;
> > +
> > +     BUG_ON(idx >= pages_per_huge_page(h));
> > +     return &hpage[idx];
> > +}
> > +
> >   static inline bool subpool_is_free(struct hugepage_subpool *spool)
> >   {
> >       if (spool->count)
> > @@ -7044,6 +7056,116 @@ static unsigned int __shift_for_hstate(struct hstate *h)
> >       for ((tmp_h) = hstate; (shift) = __shift_for_hstate(tmp_h), \
> >                              (tmp_h) <= &hstates[hugetlb_max_hstate]; \
> >                              (tmp_h)++)
> > +
> > +/*
> > + * Given a particular address, split the HugeTLB PTE that currently maps it
> > + * so that, for the given address, the PTE that maps it is `desired_shift`.
> > + * This function will always split the HugeTLB PTE optimally.
> > + *
> > + * For example, given a HugeTLB 1G page that is mapped from VA 0 to 1G. If we
> > + * call this function with addr=0 and desired_shift=PAGE_SHIFT, will result in
> > + * these changes to the page table:
> > + * 1. The PUD will be split into 2M PMDs.
> > + * 2. The first PMD will be split again into 4K PTEs.
> > + */
> > +static int hugetlb_split_to_shift(struct mm_struct *mm, struct vm_area_struct *vma,
> > +                        const struct hugetlb_pte *hpte,
> > +                        unsigned long addr, unsigned long desired_shift)
> > +{
> > +     unsigned long start, end, curr;
> > +     unsigned long desired_sz = 1UL << desired_shift;
> > +     struct hstate *h = hstate_vma(vma);
> > +     int ret;
> > +     struct hugetlb_pte new_hpte;
> > +     struct mmu_notifier_range range;
> > +     struct page *hpage = NULL;
> > +     struct page *subpage;
> > +     pte_t old_entry;
> > +     struct mmu_gather tlb;
> > +
> > +     BUG_ON(!hpte->ptep);
> > +     BUG_ON(hugetlb_pte_size(hpte) == desired_sz);
> > +
> > +     start = addr & hugetlb_pte_mask(hpte);
> > +     end = start + hugetlb_pte_size(hpte);
> > +
> > +     i_mmap_assert_write_locked(vma->vm_file->f_mapping);
> > +
> > +     BUG_ON(!hpte->ptep);
> > +     /* This function only works if we are looking at a leaf-level PTE. */
> > +     BUG_ON(!hugetlb_pte_none(hpte) && !hugetlb_pte_present_leaf(hpte));
> > +
> > +     /*
> > +      * Clear the PTE so that we will allocate the PT structures when
> > +      * walking the page table.
> > +      */
> > +     old_entry = huge_ptep_get_and_clear(mm, start, hpte->ptep);
>
> Sorry missed it last time, what if hgm mapping present here and current hpte is
>
> at higher level. Where we will clear and free child page-table pages.
>
> I see it does not happen in huge_ptep_get_and_clear.

This shouldn't happen because earlier we have
BUG_ON(!hugetlb_pte_none(hpte) && !hugetlb_pte_present_leaf(hpte));

i.e., hpte must either be none or present and leaf-level.

>
> > +
> > +     if (!huge_pte_none(old_entry))
> > +             hpage = pte_page(old_entry);
> > +
> > +     BUG_ON(!IS_ALIGNED(start, desired_sz));
> > +     BUG_ON(!IS_ALIGNED(end, desired_sz));
> > +
> > +     for (curr = start; curr < end;) {
> > +             struct hstate *tmp_h;
> > +             unsigned int shift;
> > +
> > +             for_each_hgm_shift(h, tmp_h, shift) {
> > +                     unsigned long sz = 1UL << shift;
> > +
> > +                     if (!IS_ALIGNED(curr, sz) || curr + sz > end)
> > +                             continue;
> > +                     /*
> > +                      * If we are including `addr`, we need to make sure
> > +                      * splitting down to the correct size. Go to a smaller
> > +                      * size if we are not.
> > +                      */
> > +                     if (curr <= addr && curr + sz > addr &&
> > +                                     shift > desired_shift)
> > +                             continue;
> > +
> > +                     /*
> > +                      * Continue the page table walk to the level we want,
> > +                      * allocate PT structures as we go.
> > +                      */
> > +                     hugetlb_pte_copy(&new_hpte, hpte);
> > +                     ret = hugetlb_walk_to(mm, &new_hpte, curr, sz,
> > +                                           /*stop_at_none=*/false);
> > +                     if (ret)
> > +                             goto err;
> > +                     BUG_ON(hugetlb_pte_size(&new_hpte) != sz);
> > +                     if (hpage) {
> > +                             pte_t new_entry;
> > +
> > +                             subpage = hugetlb_find_subpage(h, hpage, curr);
> > +                             new_entry = make_huge_pte_with_shift(vma, subpage,
> > +                                                                  huge_pte_write(old_entry),
> > +                                                                  shift);
> > +                             set_huge_pte_at(mm, curr, new_hpte.ptep, new_entry);
> > +                     }
> > +                     curr += sz;
> > +                     goto next;
> > +             }
> > +             /* We couldn't find a size that worked. */
> > +             BUG();
> > +next:
> > +             continue;
> > +     }
> > +
> > +     mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, vma->vm_mm,
> > +                             start, end);
> > +     mmu_notifier_invalidate_range_start(&range);
> > +     return 0;
> > +err:
> > +     tlb_gather_mmu(&tlb, mm);
> > +     /* Free any newly allocated page table entries. */
> > +     hugetlb_free_range(&tlb, hpte, start, curr);
> > +     /* Restore the old entry. */
> > +     set_huge_pte_at(mm, start, hpte->ptep, old_entry);
> > +     tlb_finish_mmu(&tlb);
> > +     return ret;
> > +}
> >   #endif /* CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING */
> >
> >   /*

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH 07/26] hugetlb: add hugetlb_pte to track HugeTLB page table entries
  2022-06-28 20:44   ` Mike Kravetz
@ 2022-06-29 16:24     ` James Houghton
  0 siblings, 0 replies; 123+ messages in thread
From: James Houghton @ 2022-06-29 16:24 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Muchun Song, Peter Xu, David Hildenbrand, David Rientjes,
	Axel Rasmussen, Mina Almasry, Jue Wang, Manish Mishra,
	Dr . David Alan Gilbert, linux-mm, linux-kernel

On Tue, Jun 28, 2022 at 1:45 PM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>
> On 06/24/22 17:36, James Houghton wrote:
> > After high-granularity mapping, page table entries for HugeTLB pages can
> > be of any size/type. (For example, we can have a 1G page mapped with a
> > mix of PMDs and PTEs.) This struct is to help keep track of a HugeTLB
> > PTE after we have done a page table walk.
> >
> > Without this, we'd have to pass around the "size" of the PTE everywhere.
> > We effectively did this before; it could be fetched from the hstate,
> > which we pass around pretty much everywhere.
> >
> > This commit includes definitions for some basic helper functions that
> > are used later. These helper functions wrap existing PTE
> > inspection/modification functions, where the correct version is picked
> > depending on if the HugeTLB PTE is actually "huge" or not. (Previously,
> > all HugeTLB PTEs were "huge").
> >
> > For example, hugetlb_ptep_get wraps huge_ptep_get and ptep_get, where
> > ptep_get is used when the HugeTLB PTE is PAGE_SIZE, and huge_ptep_get is
> > used in all other cases.
> >
> > Signed-off-by: James Houghton <jthoughton@google.com>
> > ---
> >  include/linux/hugetlb.h | 84 +++++++++++++++++++++++++++++++++++++++++
> >  mm/hugetlb.c            | 57 ++++++++++++++++++++++++++++
> >  2 files changed, 141 insertions(+)
>
> There is nothing 'wrong' with this patch, but it does make me wonder.
> After introducing hugetlb_pte, is all code dealing with hugetlb mappings
> going to be using hugetlb_ptes?  It would be quite confusing if there is
> a mix of hugetlb_ptes and non-hugetlb_ptes.  This will be revealed later
> in the series, but a comment about suture direction would be helpful
> here.

That is indeed the direction I am trying to go -- I'll make sure to
comment on this in this patch. I am planning to replace all other
non-hugetlb_pte uses with hugetlb_pte in the next version of this
series (I see it as necessary to get HGM merged).

> --
> Mike Kravetz
>
> >
> > diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> > index 5fe1db46d8c9..1d4ec9dfdebf 100644
> > --- a/include/linux/hugetlb.h
> > +++ b/include/linux/hugetlb.h
> > @@ -46,6 +46,68 @@ enum {
> >       __NR_USED_SUBPAGE,
> >  };
> >
> > +struct hugetlb_pte {
> > +     pte_t *ptep;
> > +     unsigned int shift;
> > +};
> > +
> > +static inline
> > +void hugetlb_pte_init(struct hugetlb_pte *hpte)
> > +{
> > +     hpte->ptep = NULL;
> > +}
> > +
> > +static inline
> > +void hugetlb_pte_populate(struct hugetlb_pte *hpte, pte_t *ptep,
> > +                       unsigned int shift)
> > +{
> > +     BUG_ON(!ptep);
> > +     hpte->ptep = ptep;
> > +     hpte->shift = shift;
> > +}
> > +
> > +static inline
> > +unsigned long hugetlb_pte_size(const struct hugetlb_pte *hpte)
> > +{
> > +     BUG_ON(!hpte->ptep);
> > +     return 1UL << hpte->shift;
> > +}
> > +
> > +static inline
> > +unsigned long hugetlb_pte_mask(const struct hugetlb_pte *hpte)
> > +{
> > +     BUG_ON(!hpte->ptep);
> > +     return ~(hugetlb_pte_size(hpte) - 1);
> > +}
> > +
> > +static inline
> > +unsigned int hugetlb_pte_shift(const struct hugetlb_pte *hpte)
> > +{
> > +     BUG_ON(!hpte->ptep);
> > +     return hpte->shift;
> > +}
> > +
> > +static inline
> > +bool hugetlb_pte_huge(const struct hugetlb_pte *hpte)
> > +{
> > +     return !IS_ENABLED(CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING) ||
> > +             hugetlb_pte_shift(hpte) > PAGE_SHIFT;
> > +}
> > +
> > +static inline
> > +void hugetlb_pte_copy(struct hugetlb_pte *dest, const struct hugetlb_pte *src)
> > +{
> > +     dest->ptep = src->ptep;
> > +     dest->shift = src->shift;
> > +}
> > +
> > +bool hugetlb_pte_present_leaf(const struct hugetlb_pte *hpte);
> > +bool hugetlb_pte_none(const struct hugetlb_pte *hpte);
> > +bool hugetlb_pte_none_mostly(const struct hugetlb_pte *hpte);
> > +pte_t hugetlb_ptep_get(const struct hugetlb_pte *hpte);
> > +void hugetlb_pte_clear(struct mm_struct *mm, const struct hugetlb_pte *hpte,
> > +                    unsigned long address);
> > +
> >  struct hugepage_subpool {
> >       spinlock_t lock;
> >       long count;
> > @@ -1130,6 +1192,28 @@ static inline spinlock_t *huge_pte_lock_shift(unsigned int shift,
> >       return ptl;
> >  }
> >
> > +static inline
> > +spinlock_t *hugetlb_pte_lockptr(struct mm_struct *mm, struct hugetlb_pte *hpte)
> > +{
> > +
> > +     BUG_ON(!hpte->ptep);
> > +     // Only use huge_pte_lockptr if we are at leaf-level. Otherwise use
> > +     // the regular page table lock.
> > +     if (hugetlb_pte_none(hpte) || hugetlb_pte_present_leaf(hpte))
> > +             return huge_pte_lockptr(hugetlb_pte_shift(hpte),
> > +                             mm, hpte->ptep);
> > +     return &mm->page_table_lock;
> > +}
> > +
> > +static inline
> > +spinlock_t *hugetlb_pte_lock(struct mm_struct *mm, struct hugetlb_pte *hpte)
> > +{
> > +     spinlock_t *ptl = hugetlb_pte_lockptr(mm, hpte);
> > +
> > +     spin_lock(ptl);
> > +     return ptl;
> > +}
> > +
> >  #if defined(CONFIG_HUGETLB_PAGE) && defined(CONFIG_CMA)
> >  extern void __init hugetlb_cma_reserve(int order);
> >  extern void __init hugetlb_cma_check(void);
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index d6d0d4c03def..1a1434e29740 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -1120,6 +1120,63 @@ static bool vma_has_reserves(struct vm_area_struct *vma, long chg)
> >       return false;
> >  }
> >
> > +bool hugetlb_pte_present_leaf(const struct hugetlb_pte *hpte)
> > +{
> > +     pgd_t pgd;
> > +     p4d_t p4d;
> > +     pud_t pud;
> > +     pmd_t pmd;
> > +
> > +     BUG_ON(!hpte->ptep);
> > +     if (hugetlb_pte_size(hpte) >= PGDIR_SIZE) {
> > +             pgd = *(pgd_t *)hpte->ptep;
> > +             return pgd_present(pgd) && pgd_leaf(pgd);
> > +     } else if (hugetlb_pte_size(hpte) >= P4D_SIZE) {
> > +             p4d = *(p4d_t *)hpte->ptep;
> > +             return p4d_present(p4d) && p4d_leaf(p4d);
> > +     } else if (hugetlb_pte_size(hpte) >= PUD_SIZE) {
> > +             pud = *(pud_t *)hpte->ptep;
> > +             return pud_present(pud) && pud_leaf(pud);
> > +     } else if (hugetlb_pte_size(hpte) >= PMD_SIZE) {
> > +             pmd = *(pmd_t *)hpte->ptep;
> > +             return pmd_present(pmd) && pmd_leaf(pmd);
> > +     } else if (hugetlb_pte_size(hpte) >= PAGE_SIZE)
> > +             return pte_present(*hpte->ptep);
> > +     BUG();
> > +}
> > +
> > +bool hugetlb_pte_none(const struct hugetlb_pte *hpte)
> > +{
> > +     if (hugetlb_pte_huge(hpte))
> > +             return huge_pte_none(huge_ptep_get(hpte->ptep));
> > +     return pte_none(ptep_get(hpte->ptep));
> > +}
> > +
> > +bool hugetlb_pte_none_mostly(const struct hugetlb_pte *hpte)
> > +{
> > +     if (hugetlb_pte_huge(hpte))
> > +             return huge_pte_none_mostly(huge_ptep_get(hpte->ptep));
> > +     return pte_none_mostly(ptep_get(hpte->ptep));
> > +}
> > +
> > +pte_t hugetlb_ptep_get(const struct hugetlb_pte *hpte)
> > +{
> > +     if (hugetlb_pte_huge(hpte))
> > +             return huge_ptep_get(hpte->ptep);
> > +     return ptep_get(hpte->ptep);
> > +}
> > +
> > +void hugetlb_pte_clear(struct mm_struct *mm, const struct hugetlb_pte *hpte,
> > +                    unsigned long address)
> > +{
> > +     BUG_ON(!hpte->ptep);
> > +     unsigned long sz = hugetlb_pte_size(hpte);
> > +
> > +     if (sz > PAGE_SIZE)
> > +             return huge_pte_clear(mm, address, hpte->ptep, sz);
> > +     return pte_clear(mm, address, hpte->ptep);
> > +}
> > +
> >  static void enqueue_huge_page(struct hstate *h, struct page *page)
> >  {
> >       int nid = page_to_nid(page);
> > --
> > 2.37.0.rc0.161.g10f37bed90-goog
> >

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH 07/26] hugetlb: add hugetlb_pte to track HugeTLB page table entries
  2022-06-27 12:47   ` manish.mishra
@ 2022-06-29 16:28     ` James Houghton
  0 siblings, 0 replies; 123+ messages in thread
From: James Houghton @ 2022-06-29 16:28 UTC (permalink / raw)
  To: manish.mishra
  Cc: Mike Kravetz, Muchun Song, Peter Xu, David Hildenbrand,
	David Rientjes, Axel Rasmussen, Mina Almasry, Jue Wang,
	Dr . David Alan Gilbert, linux-mm, linux-kernel

On Mon, Jun 27, 2022 at 5:47 AM manish.mishra <manish.mishra@nutanix.com> wrote:
>
>
> On 24/06/22 11:06 pm, James Houghton wrote:
> > After high-granularity mapping, page table entries for HugeTLB pages can
> > be of any size/type. (For example, we can have a 1G page mapped with a
> > mix of PMDs and PTEs.) This struct is to help keep track of a HugeTLB
> > PTE after we have done a page table walk.
> >
> > Without this, we'd have to pass around the "size" of the PTE everywhere.
> > We effectively did this before; it could be fetched from the hstate,
> > which we pass around pretty much everywhere.
> >
> > This commit includes definitions for some basic helper functions that
> > are used later. These helper functions wrap existing PTE
> > inspection/modification functions, where the correct version is picked
> > depending on if the HugeTLB PTE is actually "huge" or not. (Previously,
> > all HugeTLB PTEs were "huge").
> >
> > For example, hugetlb_ptep_get wraps huge_ptep_get and ptep_get, where
> > ptep_get is used when the HugeTLB PTE is PAGE_SIZE, and huge_ptep_get is
> > used in all other cases.
> >
> > Signed-off-by: James Houghton <jthoughton@google.com>
> > ---
> >   include/linux/hugetlb.h | 84 +++++++++++++++++++++++++++++++++++++++++
> >   mm/hugetlb.c            | 57 ++++++++++++++++++++++++++++
> >   2 files changed, 141 insertions(+)
> >
> > diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> > index 5fe1db46d8c9..1d4ec9dfdebf 100644
> > --- a/include/linux/hugetlb.h
> > +++ b/include/linux/hugetlb.h
> > @@ -46,6 +46,68 @@ enum {
> >       __NR_USED_SUBPAGE,
> >   };
> >
> > +struct hugetlb_pte {
> > +     pte_t *ptep;
> > +     unsigned int shift;
> > +};
> > +
> > +static inline
> > +void hugetlb_pte_init(struct hugetlb_pte *hpte)
> > +{
> > +     hpte->ptep = NULL;
> I agree it does not matter but still will hpte->shift = 0 too be better?
> > +}
> > +
> > +static inline
> > +void hugetlb_pte_populate(struct hugetlb_pte *hpte, pte_t *ptep,
> > +                       unsigned int shift)
> > +{
> > +     BUG_ON(!ptep);
> > +     hpte->ptep = ptep;
> > +     hpte->shift = shift;
> > +}
> > +
> > +static inline
> > +unsigned long hugetlb_pte_size(const struct hugetlb_pte *hpte)
> > +{
> > +     BUG_ON(!hpte->ptep);
> > +     return 1UL << hpte->shift;
> > +}
> > +
> > +static inline
> > +unsigned long hugetlb_pte_mask(const struct hugetlb_pte *hpte)
> > +{
> > +     BUG_ON(!hpte->ptep);
> > +     return ~(hugetlb_pte_size(hpte) - 1);
> > +}
> > +
> > +static inline
> > +unsigned int hugetlb_pte_shift(const struct hugetlb_pte *hpte)
> > +{
> > +     BUG_ON(!hpte->ptep);
> > +     return hpte->shift;
> > +}
> > +
> > +static inline
> > +bool hugetlb_pte_huge(const struct hugetlb_pte *hpte)
> > +{
> > +     return !IS_ENABLED(CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING) ||
> > +             hugetlb_pte_shift(hpte) > PAGE_SHIFT;
> > +}
> > +
> > +static inline
> > +void hugetlb_pte_copy(struct hugetlb_pte *dest, const struct hugetlb_pte *src)
> > +{
> > +     dest->ptep = src->ptep;
> > +     dest->shift = src->shift;
> > +}
> > +
> > +bool hugetlb_pte_present_leaf(const struct hugetlb_pte *hpte);
> > +bool hugetlb_pte_none(const struct hugetlb_pte *hpte);
> > +bool hugetlb_pte_none_mostly(const struct hugetlb_pte *hpte);
> > +pte_t hugetlb_ptep_get(const struct hugetlb_pte *hpte);
> > +void hugetlb_pte_clear(struct mm_struct *mm, const struct hugetlb_pte *hpte,
> > +                    unsigned long address);
> > +
> >   struct hugepage_subpool {
> >       spinlock_t lock;
> >       long count;
> > @@ -1130,6 +1192,28 @@ static inline spinlock_t *huge_pte_lock_shift(unsigned int shift,
> >       return ptl;
> >   }
> >
> > +static inline
> > +spinlock_t *hugetlb_pte_lockptr(struct mm_struct *mm, struct hugetlb_pte *hpte)
> > +{
> > +
> > +     BUG_ON(!hpte->ptep);
> > +     // Only use huge_pte_lockptr if we are at leaf-level. Otherwise use
> > +     // the regular page table lock.
> > +     if (hugetlb_pte_none(hpte) || hugetlb_pte_present_leaf(hpte))
> > +             return huge_pte_lockptr(hugetlb_pte_shift(hpte),
> > +                             mm, hpte->ptep);
> > +     return &mm->page_table_lock;
> > +}
> > +
> > +static inline
> > +spinlock_t *hugetlb_pte_lock(struct mm_struct *mm, struct hugetlb_pte *hpte)
> > +{
> > +     spinlock_t *ptl = hugetlb_pte_lockptr(mm, hpte);
> > +
> > +     spin_lock(ptl);
> > +     return ptl;
> > +}
> > +
> >   #if defined(CONFIG_HUGETLB_PAGE) && defined(CONFIG_CMA)
> >   extern void __init hugetlb_cma_reserve(int order);
> >   extern void __init hugetlb_cma_check(void);
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index d6d0d4c03def..1a1434e29740 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -1120,6 +1120,63 @@ static bool vma_has_reserves(struct vm_area_struct *vma, long chg)
> >       return false;
> >   }
> >
> > +bool hugetlb_pte_present_leaf(const struct hugetlb_pte *hpte)
> > +{
> > +     pgd_t pgd;
> > +     p4d_t p4d;
> > +     pud_t pud;
> > +     pmd_t pmd;
> > +
> > +     BUG_ON(!hpte->ptep);
> > +     if (hugetlb_pte_size(hpte) >= PGDIR_SIZE) {
> > +             pgd = *(pgd_t *)hpte->ptep;
>
> sorry did not understand in these conditions why
>
> hugetlb_pte_size(hpte) >= PGDIR_SIZE. I mean why >= check
>
> and not just == check?

I did >= PGDIR_SIZE just because it was consistent with the rest of
the sizes, but, indeed, > PGDIR_SIZE makes little sense, so I'll
replace it with ==.

>
> > +             return pgd_present(pgd) && pgd_leaf(pgd);
> > +     } else if (hugetlb_pte_size(hpte) >= P4D_SIZE) {
> > +             p4d = *(p4d_t *)hpte->ptep;
> > +             return p4d_present(p4d) && p4d_leaf(p4d);
> > +     } else if (hugetlb_pte_size(hpte) >= PUD_SIZE) {
> > +             pud = *(pud_t *)hpte->ptep;
> > +             return pud_present(pud) && pud_leaf(pud);
> > +     } else if (hugetlb_pte_size(hpte) >= PMD_SIZE) {
> > +             pmd = *(pmd_t *)hpte->ptep;
> > +             return pmd_present(pmd) && pmd_leaf(pmd);
> > +     } else if (hugetlb_pte_size(hpte) >= PAGE_SIZE)
> > +             return pte_present(*hpte->ptep);
> > +     BUG();
> > +}
> > +
> > +bool hugetlb_pte_none(const struct hugetlb_pte *hpte)
> > +{
> > +     if (hugetlb_pte_huge(hpte))
> > +             return huge_pte_none(huge_ptep_get(hpte->ptep));
> > +     return pte_none(ptep_get(hpte->ptep));
> > +}
> > +
> > +bool hugetlb_pte_none_mostly(const struct hugetlb_pte *hpte)
> > +{
> > +     if (hugetlb_pte_huge(hpte))
> > +             return huge_pte_none_mostly(huge_ptep_get(hpte->ptep));
> > +     return pte_none_mostly(ptep_get(hpte->ptep));
> > +}
> > +
> > +pte_t hugetlb_ptep_get(const struct hugetlb_pte *hpte)
> > +{
> > +     if (hugetlb_pte_huge(hpte))
> > +             return huge_ptep_get(hpte->ptep);
> > +     return ptep_get(hpte->ptep);
> > +}
> > +
> > +void hugetlb_pte_clear(struct mm_struct *mm, const struct hugetlb_pte *hpte,
> > +                    unsigned long address)
> > +{
> > +     BUG_ON(!hpte->ptep);
> > +     unsigned long sz = hugetlb_pte_size(hpte);
> > +
> > +     if (sz > PAGE_SIZE)
> > +             return huge_pte_clear(mm, address, hpte->ptep, sz);
>
> just for cosistency something like above?
>
> if (hugetlb_pte_huge(hpte))
> +               return huge_pte_clear
> ;

Will do, yes. (I added hugetlb_pte_huge quite late, and I guess I
missed updating this spot. :))

>
> > +     return pte_clear(mm, address, hpte->ptep);
> > +}
> > +
> >   static void enqueue_huge_page(struct hstate *h, struct page *page)
> >   {
> >       int nid = page_to_nid(page);

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH 07/26] hugetlb: add hugetlb_pte to track HugeTLB page table entries
  2022-06-28 20:25   ` Mina Almasry
@ 2022-06-29 16:42     ` James Houghton
  0 siblings, 0 replies; 123+ messages in thread
From: James Houghton @ 2022-06-29 16:42 UTC (permalink / raw)
  To: Mina Almasry
  Cc: Mike Kravetz, Muchun Song, Peter Xu, David Hildenbrand,
	David Rientjes, Axel Rasmussen, Jue Wang, Manish Mishra,
	Dr . David Alan Gilbert, linux-mm, linux-kernel

On Tue, Jun 28, 2022 at 1:25 PM Mina Almasry <almasrymina@google.com> wrote:
>
> On Fri, Jun 24, 2022 at 10:37 AM James Houghton <jthoughton@google.com> wrote:
> >
> > After high-granularity mapping, page table entries for HugeTLB pages can
> > be of any size/type. (For example, we can have a 1G page mapped with a
> > mix of PMDs and PTEs.) This struct is to help keep track of a HugeTLB
> > PTE after we have done a page table walk.
> >
> > Without this, we'd have to pass around the "size" of the PTE everywhere.
> > We effectively did this before; it could be fetched from the hstate,
> > which we pass around pretty much everywhere.
> >
> > This commit includes definitions for some basic helper functions that
> > are used later. These helper functions wrap existing PTE
> > inspection/modification functions, where the correct version is picked
> > depending on if the HugeTLB PTE is actually "huge" or not. (Previously,
> > all HugeTLB PTEs were "huge").
> >
> > For example, hugetlb_ptep_get wraps huge_ptep_get and ptep_get, where
> > ptep_get is used when the HugeTLB PTE is PAGE_SIZE, and huge_ptep_get is
> > used in all other cases.
> >
> > Signed-off-by: James Houghton <jthoughton@google.com>
> > ---
> >  include/linux/hugetlb.h | 84 +++++++++++++++++++++++++++++++++++++++++
> >  mm/hugetlb.c            | 57 ++++++++++++++++++++++++++++
> >  2 files changed, 141 insertions(+)
> >
> > diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> > index 5fe1db46d8c9..1d4ec9dfdebf 100644
> > --- a/include/linux/hugetlb.h
> > +++ b/include/linux/hugetlb.h
> > @@ -46,6 +46,68 @@ enum {
> >         __NR_USED_SUBPAGE,
> >  };
> >
> > +struct hugetlb_pte {
> > +       pte_t *ptep;
> > +       unsigned int shift;
> > +};
> > +
> > +static inline
> > +void hugetlb_pte_init(struct hugetlb_pte *hpte)
> > +{
> > +       hpte->ptep = NULL;
>
> shift = 0; ?

I don't think this is necessary (but, admittedly, it is quite harmless
to add). ptep = NULL means that the hugetlb_pte isn't valid, and shift
could be anything. Originally I had a separate `bool valid`, but
ptep=NULL was exactly the same as valid=false.

>
> > +}
> > +
> > +static inline
> > +void hugetlb_pte_populate(struct hugetlb_pte *hpte, pte_t *ptep,
> > +                         unsigned int shift)
> > +{
> > +       BUG_ON(!ptep);
> > +       hpte->ptep = ptep;
> > +       hpte->shift = shift;
> > +}
> > +
> > +static inline
> > +unsigned long hugetlb_pte_size(const struct hugetlb_pte *hpte)
> > +{
> > +       BUG_ON(!hpte->ptep);
> > +       return 1UL << hpte->shift;
> > +}
> > +
>
> This helper is quite redundant in my opinion.

Putting 1UL << hugetlb_pte_shit(hpte) everywhere is kind of annoying. :)

>
> > +static inline
> > +unsigned long hugetlb_pte_mask(const struct hugetlb_pte *hpte)
> > +{
> > +       BUG_ON(!hpte->ptep);
> > +       return ~(hugetlb_pte_size(hpte) - 1);
> > +}
> > +
> > +static inline
> > +unsigned int hugetlb_pte_shift(const struct hugetlb_pte *hpte)
> > +{
> > +       BUG_ON(!hpte->ptep);
> > +       return hpte->shift;
> > +}
> > +
>
> This one jumps as quite redundant too.

To make sure we aren't using an invalid hugetlb_pte, I want to remove
all places where I directly access hpte->shift -- really they should
all go through hugetlb_pte_shift.

>
> > +static inline
> > +bool hugetlb_pte_huge(const struct hugetlb_pte *hpte)
> > +{
> > +       return !IS_ENABLED(CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING) ||
> > +               hugetlb_pte_shift(hpte) > PAGE_SHIFT;
> > +}
> > +
>
> I'm guessing the !IS_ENABLED() check is because only the HGM code
> would store a non-huge pte in a hugetlb_pte struct. I think it's a bit
> fragile because anyone can add code in the future that uses
> hugetlb_pte in unexpected ways, but I will concede that it is correct
> as written.

I added this so that, if HGM isn't enabled, the compiler would have an
easier time optimizing things. I don't really have strong feelings
about keeping/removing it.

>
> > +static inline
> > +void hugetlb_pte_copy(struct hugetlb_pte *dest, const struct hugetlb_pte *src)
> > +{
> > +       dest->ptep = src->ptep;
> > +       dest->shift = src->shift;
> > +}
> > +
> > +bool hugetlb_pte_present_leaf(const struct hugetlb_pte *hpte);
> > +bool hugetlb_pte_none(const struct hugetlb_pte *hpte);
> > +bool hugetlb_pte_none_mostly(const struct hugetlb_pte *hpte);
> > +pte_t hugetlb_ptep_get(const struct hugetlb_pte *hpte);
> > +void hugetlb_pte_clear(struct mm_struct *mm, const struct hugetlb_pte *hpte,
> > +                      unsigned long address);
> > +
> >  struct hugepage_subpool {
> >         spinlock_t lock;
> >         long count;
> > @@ -1130,6 +1192,28 @@ static inline spinlock_t *huge_pte_lock_shift(unsigned int shift,
> >         return ptl;
> >  }
> >
> > +static inline
>
> Maybe for organization, move all the static functions you're adding
> above the hugetlb_pte_* declarations you're adding?

Will do.

>
> > +spinlock_t *hugetlb_pte_lockptr(struct mm_struct *mm, struct hugetlb_pte *hpte)
> > +{
> > +
> > +       BUG_ON(!hpte->ptep);
> > +       // Only use huge_pte_lockptr if we are at leaf-level. Otherwise use
> > +       // the regular page table lock.
>
> Does checkpatch.pl not complain about // style comments? I think those
> are not allowed, no?

It didn't :( I thought I went through and removed them all -- I guess
I missed some.

>
> > +       if (hugetlb_pte_none(hpte) || hugetlb_pte_present_leaf(hpte))
> > +               return huge_pte_lockptr(hugetlb_pte_shift(hpte),
> > +                               mm, hpte->ptep);
> > +       return &mm->page_table_lock;
> > +}
> > +
> > +static inline
> > +spinlock_t *hugetlb_pte_lock(struct mm_struct *mm, struct hugetlb_pte *hpte)
> > +{
> > +       spinlock_t *ptl = hugetlb_pte_lockptr(mm, hpte);
> > +
> > +       spin_lock(ptl);
> > +       return ptl;
> > +}
> > +
> >  #if defined(CONFIG_HUGETLB_PAGE) && defined(CONFIG_CMA)
> >  extern void __init hugetlb_cma_reserve(int order);
> >  extern void __init hugetlb_cma_check(void);
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index d6d0d4c03def..1a1434e29740 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -1120,6 +1120,63 @@ static bool vma_has_reserves(struct vm_area_struct *vma, long chg)
> >         return false;
> >  }
> >
> > +bool hugetlb_pte_present_leaf(const struct hugetlb_pte *hpte)
> > +{
> > +       pgd_t pgd;
> > +       p4d_t p4d;
> > +       pud_t pud;
> > +       pmd_t pmd;
> > +
> > +       BUG_ON(!hpte->ptep);
> > +       if (hugetlb_pte_size(hpte) >= PGDIR_SIZE) {
> > +               pgd = *(pgd_t *)hpte->ptep;
> > +               return pgd_present(pgd) && pgd_leaf(pgd);
> > +       } else if (hugetlb_pte_size(hpte) >= P4D_SIZE) {
> > +               p4d = *(p4d_t *)hpte->ptep;
> > +               return p4d_present(p4d) && p4d_leaf(p4d);
> > +       } else if (hugetlb_pte_size(hpte) >= PUD_SIZE) {
> > +               pud = *(pud_t *)hpte->ptep;
> > +               return pud_present(pud) && pud_leaf(pud);
> > +       } else if (hugetlb_pte_size(hpte) >= PMD_SIZE) {
> > +               pmd = *(pmd_t *)hpte->ptep;
> > +               return pmd_present(pmd) && pmd_leaf(pmd);
> > +       } else if (hugetlb_pte_size(hpte) >= PAGE_SIZE)
> > +               return pte_present(*hpte->ptep);
>
> The use of >= is a bit curious to me. Shouldn't these be ==?

These (except PGDIR_SIZE) should be >=. This is because some
architectures support multiple huge PTE sizes at the same page table
level. For example, on arm64, you can have 2M PMDs, and you can also
have 32M PMDs[1].

[1]: https://www.kernel.org/doc/html/latest/arm64/hugetlbpage.html

>
> Also probably doesn't matter but I was thinking to use *_SHIFTs
> instead of *_SIZE so you don't have to calculate the size 5 times in
> this routine, or calculate hugetlb_pte_size() once for some less code
> duplication and re-use?

I'll change this to use the shift, and I'll move the computation so
it's only done once (it is probably helpful for readability too). (I
imagine the compiler only actually computes the size once here.)

>
> > +       BUG();
> > +}
> > +
> > +bool hugetlb_pte_none(const struct hugetlb_pte *hpte)
> > +{
> > +       if (hugetlb_pte_huge(hpte))
> > +               return huge_pte_none(huge_ptep_get(hpte->ptep));
> > +       return pte_none(ptep_get(hpte->ptep));
> > +}
> > +
> > +bool hugetlb_pte_none_mostly(const struct hugetlb_pte *hpte)
> > +{
> > +       if (hugetlb_pte_huge(hpte))
> > +               return huge_pte_none_mostly(huge_ptep_get(hpte->ptep));
> > +       return pte_none_mostly(ptep_get(hpte->ptep));
> > +}
> > +
> > +pte_t hugetlb_ptep_get(const struct hugetlb_pte *hpte)
> > +{
> > +       if (hugetlb_pte_huge(hpte))
> > +               return huge_ptep_get(hpte->ptep);
> > +       return ptep_get(hpte->ptep);
> > +}
> > +
> > +void hugetlb_pte_clear(struct mm_struct *mm, const struct hugetlb_pte *hpte,
> > +                      unsigned long address)
> > +{
> > +       BUG_ON(!hpte->ptep);
> > +       unsigned long sz = hugetlb_pte_size(hpte);
> > +
> > +       if (sz > PAGE_SIZE)
> > +               return huge_pte_clear(mm, address, hpte->ptep, sz);
> > +       return pte_clear(mm, address, hpte->ptep);
> > +}
> > +
> >  static void enqueue_huge_page(struct hstate *h, struct page *page)
> >  {
> >         int nid = page_to_nid(page);
> > --
> > 2.37.0.rc0.161.g10f37bed90-goog
> >

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH 00/26] hugetlb: Introduce HugeTLB high-granularity mapping
  2022-06-28 17:56       ` Dr. David Alan Gilbert
@ 2022-06-29 18:31         ` James Houghton
  0 siblings, 0 replies; 123+ messages in thread
From: James Houghton @ 2022-06-29 18:31 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Mina Almasry, Mike Kravetz, Muchun Song, Peter Xu,
	David Hildenbrand, David Rientjes, Axel Rasmussen, Jue Wang,
	Manish Mishra, linux-mm, linux-kernel

On Tue, Jun 28, 2022 at 10:56 AM Dr. David Alan Gilbert
<dgilbert@redhat.com> wrote:
>
> * Mina Almasry (almasrymina@google.com) wrote:
> > On Mon, Jun 27, 2022 at 9:27 AM James Houghton <jthoughton@google.com> wrote:
> > >
> > > On Fri, Jun 24, 2022 at 11:41 AM Mina Almasry <almasrymina@google.com> wrote:
> > > >
> > > > On Fri, Jun 24, 2022 at 10:37 AM James Houghton <jthoughton@google.com> wrote:
> > > > >
> > > > > [trimmed...]
> > > > > ---- Userspace API ----
> > > > >
> > > > > This patch series introduces a single way to take advantage of
> > > > > high-granularity mapping: via UFFDIO_CONTINUE. UFFDIO_CONTINUE allows
> > > > > userspace to resolve MINOR page faults on shared VMAs.
> > > > >
> > > > > To collapse a HugeTLB address range that has been mapped with several
> > > > > UFFDIO_CONTINUE operations, userspace can issue MADV_COLLAPSE. We expect
> > > > > userspace to know when all pages (that they care about) have been fetched.
> > > > >
> > > >
> > > > Thanks James! Cover letter looks good. A few questions:
> > > >
> > > > Why not have the kernel collapse the hugepage once all the 4K pages
> > > > have been fetched automatically? It would remove the need for a new
> > > > userspace API, and AFACT there aren't really any cases where it is
> > > > beneficial to have a hugepage sharded into 4K mappings when those
> > > > mappings can be collapsed.
> > >
> > > The reason that we don't automatically collapse mappings is because it
> > > would take additional complexity, and it is less flexible. Consider
> > > the case of 1G pages on x86: currently, userspace can collapse the
> > > whole page when it's all ready, but they can also choose to collapse a
> > > 2M piece of it. On architectures with more supported hugepage sizes
> > > (e.g., arm64), userspace has even more possibilities for when to
> > > collapse. This likely further complicates a potential
> > > automatic-collapse solution. Userspace may also want to collapse the
> > > mapping for an entire hugepage without completely mapping the hugepage
> > > first (this would also be possible by issuing UFFDIO_CONTINUE on all
> > > the holes, though).
> > >
> >
> > To be honest I'm don't think I'm a fan of this. I don't think this
> > saves complexity, but rather pushes it to the userspace. I.e. the
> > userspace now must track which regions are faulted in and which are
> > not to call MADV_COLLAPSE at the right time. Also, if the userspace
> > gets it wrong it may accidentally not call MADV_COLLAPSE (and not get
> > any hugepages) or call MADV_COLLAPSE too early and have to deal with a
> > storm of maybe hundreds of minor faults at once which may take too
> > long to resolve and may impact guest stability, yes?
>
> I think it depends on whether the userspace is already holding bitmaps
> and data structures to let it know when the right time to call collapse
> is; if it already has to do all that book keeping for it's own postcopy
> or whatever process, then getting userspace to call it is easy.
> (I don't know the answer to whether it does have!)

Userspace generally has a lot of information about which pages have
been UFFDIO_CONTINUE'd, but they may not have the information (say,
some atomic count per hpage) to tell them exactly when to collapse.

I think it's worth discussing the tmpfs/THP case right now, too. Right
now, after userfaultfd post-copy, all THPs we have will all be
PTE-mapped. To deal with this, we need to use Zach's MADV_COLLAPSE to
collapse the mappings to PMD mappings (we don't want to wait for
khugepaged to happen upon them -- we want good performance ASAP :)).
In fact, IIUC, khugepaged actually won't collapse these *ever* right
now. I suppose we could enlighten tmpfs's UFFDIO_CONTINUE to
automatically collapse too (thus avoiding the need for MADV_COLLAPSE),
but that could be complicated/unwanted (if that is something we might
want, maybe we should have a separate discussion).

So, as it stands today, we intend to use MADV_COLLAPSE explicitly in
the tmpfs case as soon as it is supported, and so it follows that it's
ok to require userspace to do the same thing for HugeTLBFS-backed
memory.

>
> Dave
>
> > For these reasons I think automatic collapsing is something that will
> > eventually be implemented by us or someone else, and at that point
> > MADV_COLLAPSE for hugetlb memory will become obsolete; i.e. this patch
> > is adding a userspace API that will probably need to be maintained for
> > perpetuity but actually is likely going to be going obsolete "soon".
> > For this reason I had hoped that automatic collapsing would come with
> > V1.

Small, unimportant clarification: the API, as described here, won't be
*completely* meaningless if we end up implementing automatic
collapsing :) It still has the effect of not requiring other
UFFDIO_CONTINUE operations to be done for the collapsed region.

> >
> > I wonder if we can have a very simple first try at automatic
> > collapsing for V1? I.e., can we support collapsing to the hstate size
> > and only that? So 4K pages can only be either collapsed to 2MB or 1G
> > on x86 depending on the hstate size. I think this may be not too
> > difficult to implement: we can have a counter similar to mapcount that
> > tracks how many of the subpages are mapped (subpage_mapcount). Once
> > all the subpages are mapped (the counter reaches a certain value),
> > trigger collapsing similar to hstate size MADV_COLLAPSE.
> >

In my estimation, to implement automatic collapsing, for one VMA, we
will need a per-hstate count, where when the count reaches the maximum
number, we collapse automatically to the next most optimal size. So if
we finish filling in enough PTEs for a CONT_PTE, we will collapse to a
CONT_PTE. If we finish filling up CONT_PTEs to a PMD, then collapse to
a PMD.

If you are suggesting to only collapse to the hstate size at the end,
then we lose flexibility.

> > I gather that no one else reviewing this has raised this issue thus
> > far so it might not be a big deal and I will continue to review the
> > RFC, but I had hoped for automatic collapsing myself for the reasons
> > above.

Thanks for the thorough review, Mina. :)

> >
> > > >
> > > > > ---- HugeTLB Changes ----
> > > > >
> > > > > - Mapcount
> > > > > The way mapcount is handled is different from the way that it was handled
> > > > > before. If the PUD for a hugepage is not none, a hugepage's mapcount will
> > > > > be increased. This scheme means that, for hugepages that aren't mapped at
> > > > > high granularity, their mapcounts will remain the same as what they would
> > > > > have been pre-HGM.
> > > > >
> > > >
> > > > Sorry, I didn't quite follow this. It says mapcount is handled
> > > > differently, but the same if the page is not mapped at high
> > > > granularity. Can you elaborate on how the mapcount handling will be
> > > > different when the page is mapped at high granularity?
> > >
> > > I guess I didn't phrase this very well. For the sake of simplicity,
> > > consider 1G pages on x86, typically mapped with leaf-level PUDs.
> > > Previously, there were two possibilities for how a hugepage was
> > > mapped, either it was (1) completely mapped (PUD is present and a
> > > leaf), or (2) it wasn't mapped (PUD is none). Now we have a third
> > > case, where the PUD is not none but also not a leaf (this usually
> > > means that the page is partially mapped). We handle this case as if
> > > the whole page was mapped. That is, if we partially map a hugepage
> > > that was previously unmapped (making the PUD point to PMDs), we
> > > increment its mapcount, and if we completely unmap a partially mapped
> > > hugepage (making the PUD none), we decrement its mapcount. If we
> > > collapse a non-leaf PUD to a leaf PUD, we don't change mapcount.
> > >
> > > It is possible for a PUD to be present and not a leaf (mapcount has
> > > been incremented) but for the page to still be unmapped: if the PMDs
> > > (or PTEs) underneath are all none. This case is atypical, and as of
> > > this RFC (without bestowing MADV_DONTNEED with HGM flexibility), I
> > > think it would be very difficult to get this to happen.
> > >
> >
> > Thank you for the detailed explanation. Please add it to the cover letter.
> >
> > I wonder the case "PUD present but all the PMD are none": is that a
> > bug? I don't understand the usefulness of that. Not a comment on this
> > patch but rather a curiosity.
> >
> > > >
> > > > > - Page table walking and manipulation
> > > > > A new function, hugetlb_walk_to, handles walking HugeTLB page tables for
> > > > > high-granularity mappings. Eventually, it's possible to merge
> > > > > hugetlb_walk_to with huge_pte_offset and huge_pte_alloc.
> > > > >
> > > > > We keep track of HugeTLB page table entries with a new struct, hugetlb_pte.
> > > > > This is because we generally need to know the "size" of a PTE (previously
> > > > > always just huge_page_size(hstate)).
> > > > >
> > > > > For every page table manipulation function that has a huge version (e.g.
> > > > > huge_ptep_get and ptep_get), there is a wrapper for it (e.g.
> > > > > hugetlb_ptep_get).  The correct version is used depending on if a HugeTLB
> > > > > PTE really is "huge".
> > > > >
> > > > > - Synchronization
> > > > > For existing bits of HugeTLB, synchronization is unchanged. For splitting
> > > > > and collapsing HugeTLB PTEs, we require that the i_mmap_rw_sem is held for
> > > > > writing, and for doing high-granularity page table walks, we require it to
> > > > > be held for reading.
> > > > >
> > > > > ---- Limitations & Future Changes ----
> > > > >
> > > > > This patch series only implements high-granularity mapping for VM_SHARED
> > > > > VMAs.  I intend to implement enough HGM to support 4K unmapping for memory
> > > > > failure recovery for both shared and private mappings.
> > > > >
> > > > > The memory failure use case poses its own challenges that can be
> > > > > addressed, but I will do so in a separate RFC.
> > > > >
> > > > > Performance has not been heavily scrutinized with this patch series. There
> > > > > are places where lock contention can significantly reduce performance. This
> > > > > will be addressed later.
> > > > >
> > > > > The patch series, as it stands right now, is compatible with the VMEMMAP
> > > > > page struct optimization[3], as we do not need to modify data contained
> > > > > in the subpage page structs.
> > > > >
> > > > > Other omissions:
> > > > >  - Compatibility with userfaultfd write-protect (will be included in v1).
> > > > >  - Support for mremap() (will be included in v1). This looks a lot like
> > > > >    the support we have for fork().
> > > > >  - Documentation changes (will be included in v1).
> > > > >  - Completely ignores PMD sharing and hugepage migration (will be included
> > > > >    in v1).
> > > > >  - Implementations for architectures that don't use GENERAL_HUGETLB other
> > > > >    than arm64.
> > > > >
> > > > > ---- Patch Breakdown ----
> > > > >
> > > > > Patch 1     - Preliminary changes
> > > > > Patch 2-10  - HugeTLB HGM core changes
> > > > > Patch 11-13 - HugeTLB HGM page table walking functionality
> > > > > Patch 14-19 - HugeTLB HGM compatibility with other bits
> > > > > Patch 20-23 - Userfaultfd and collapse changes
> > > > > Patch 24-26 - arm64 support and selftests
> > > > >
> > > > > [1] This used to be called HugeTLB double mapping, a bad and confusing
> > > > >     name. "High-granularity mapping" is not a great name either. I am open
> > > > >     to better names.
> > > >
> > > > I would drop 1 extra word and do "granular mapping", as in the mapping
> > > > is more granular than what it normally is (2MB/1G, etc).
> > >
> > > Noted. :)
> >
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH 00/26] hugetlb: Introduce HugeTLB high-granularity mapping
  2022-06-28 17:26     ` Mina Almasry
  2022-06-28 17:56       ` Dr. David Alan Gilbert
@ 2022-06-29 20:39       ` Axel Rasmussen
  1 sibling, 0 replies; 123+ messages in thread
From: Axel Rasmussen @ 2022-06-29 20:39 UTC (permalink / raw)
  To: Mina Almasry
  Cc: James Houghton, Mike Kravetz, Muchun Song, Peter Xu,
	David Hildenbrand, David Rientjes, Jue Wang, Manish Mishra,
	Dr . David Alan Gilbert, Linux MM, LKML

On Tue, Jun 28, 2022 at 10:27 AM Mina Almasry <almasrymina@google.com> wrote:
>
> On Mon, Jun 27, 2022 at 9:27 AM James Houghton <jthoughton@google.com> wrote:
> >
> > On Fri, Jun 24, 2022 at 11:41 AM Mina Almasry <almasrymina@google.com> wrote:
> > >
> > > On Fri, Jun 24, 2022 at 10:37 AM James Houghton <jthoughton@google.com> wrote:
> > > >
> > > > [trimmed...]
> > > > ---- Userspace API ----
> > > >
> > > > This patch series introduces a single way to take advantage of
> > > > high-granularity mapping: via UFFDIO_CONTINUE. UFFDIO_CONTINUE allows
> > > > userspace to resolve MINOR page faults on shared VMAs.
> > > >
> > > > To collapse a HugeTLB address range that has been mapped with several
> > > > UFFDIO_CONTINUE operations, userspace can issue MADV_COLLAPSE. We expect
> > > > userspace to know when all pages (that they care about) have been fetched.
> > > >
> > >
> > > Thanks James! Cover letter looks good. A few questions:
> > >
> > > Why not have the kernel collapse the hugepage once all the 4K pages
> > > have been fetched automatically? It would remove the need for a new
> > > userspace API, and AFACT there aren't really any cases where it is
> > > beneficial to have a hugepage sharded into 4K mappings when those
> > > mappings can be collapsed.
> >
> > The reason that we don't automatically collapse mappings is because it
> > would take additional complexity, and it is less flexible. Consider
> > the case of 1G pages on x86: currently, userspace can collapse the
> > whole page when it's all ready, but they can also choose to collapse a
> > 2M piece of it. On architectures with more supported hugepage sizes
> > (e.g., arm64), userspace has even more possibilities for when to
> > collapse. This likely further complicates a potential
> > automatic-collapse solution. Userspace may also want to collapse the
> > mapping for an entire hugepage without completely mapping the hugepage
> > first (this would also be possible by issuing UFFDIO_CONTINUE on all
> > the holes, though).
> >
>
> To be honest I'm don't think I'm a fan of this. I don't think this
> saves complexity, but rather pushes it to the userspace. I.e. the
> userspace now must track which regions are faulted in and which are
> not to call MADV_COLLAPSE at the right time. Also, if the userspace
> gets it wrong it may accidentally not call MADV_COLLAPSE (and not get
> any hugepages) or call MADV_COLLAPSE too early and have to deal with a
> storm of maybe hundreds of minor faults at once which may take too
> long to resolve and may impact guest stability, yes?

I disagree, I think this is state userspace needs to maintain anyway,
even if we ignore the use case James' series is about.

One example: today, you can't UFFDIO_CONTINUE a region which is
already mapped - you'll get -EEXIST. So, userspace needs to be sure
not to double-continue an area. We could think about relaxing this,
but there's a tradeoff - being more permissive means it's "easier to
use", but, it also means we're less strict about catching potentially
buggy userspaces.

There's another case that I don't see any way to get rid of. The way
live migration at least for GCE works is, we have two things
installing new pages: the on-demand fetcher, which reacts to UFFD
events and resolves them. And then we have the background fetcher,
which goes along and fetches pages which haven't been touched /
requested yet (and which may never be, it's not uncommon for a guest
to have at least *some* pages which are very infrequently / never
touched). In order for the background fetcher to know what pages to
transfer over the network, or not, userspace has to remember which
ones it's already installed.

Another point is, consider the use case of UFFDIO_CONTINUE over
UFFDIO_COPY. When userspace gets a UFFD event for a page, the
assumption is that it's somewhat likely the page is already up to
date, because we already copied it over from the source machine before
we stopped the guest and restarted it running on the target machine
("precopy"). So, we want to maintain a dirty bitmap, which tells us
which pages are clean or not - when we get a UFFD event, we check the
bitmap, and only if the page is dirty do we actually go fetch it over
the network - otherwise we just UFFDIO_CONTINUE and we're done.

>
> For these reasons I think automatic collapsing is something that will
> eventually be implemented by us or someone else, and at that point
> MADV_COLLAPSE for hugetlb memory will become obsolete; i.e. this patch
> is adding a userspace API that will probably need to be maintained for
> perpetuity but actually is likely going to be going obsolete "soon".
> For this reason I had hoped that automatic collapsing would come with
> V1.
>
> I wonder if we can have a very simple first try at automatic
> collapsing for V1? I.e., can we support collapsing to the hstate size
> and only that? So 4K pages can only be either collapsed to 2MB or 1G
> on x86 depending on the hstate size. I think this may be not too
> difficult to implement: we can have a counter similar to mapcount that
> tracks how many of the subpages are mapped (subpage_mapcount). Once
> all the subpages are mapped (the counter reaches a certain value),
> trigger collapsing similar to hstate size MADV_COLLAPSE.

I'm not sure I agree this is likely.

Two problems:

One is, say you UFFDIO_CONTINUE a 4k PTE. If we wanted collapsing to
happen automatically, we'd need to answer the question: is this the
last 4k PTE in a 2M region, so now it can be collapsed? Today the only
way to know is to go check - walk the PTEs. This is expensive, and
it's something we'd have to do on each and every UFFDIO_CONTINUE
operation -- this sucks because we're incurring the cost on every
operation, even though most of them (only 1 / 512, say) the answer
will be "no it wasn't the last one, we can't collapse yet". For
on-demand paging, it's really critical installing the page is as fast
as possible -- in an ideal world it would be exactly as fast as a
"normal" minor fault and the guest would not even be able to tell at
all that it was in the process of being migrated.

Now, as you pointed out, we can just store a mapcount somewhere which
keeps track of how many PTEs in each 2M region are installed or not.
So, then we can more quickly check in UFFDIO_CONTINUE. But, we have
the memory overhead and CPU time overhead of maintaining this
metadata. And, it's not like having the kernel do this means userspace
doesn't have to - like I described above, I think userspace would
*also* need to keep track of this same thing anyway, so now we're
doing it 2x.

Another problem I see is, it seems like collapsing automatically would
involve letting UFFD know a bit too much for my liking about hugetlbfs
internals. It seems to me more ideal to have it know as little as
possible about how hugetlbfs works internally.



Also, there are some benefits to letting userspace decide when / if to
collapse.

For example, userspace might decide it prefers to MADV_COLLAPSE
immediately, in the demand paging thread. Or, it might decide it's
okay to let it be collapsed a bit later, and leave that up to some
other background thread. It might MADV_COLLAPSE as soon as it sees a
complete 2M region, or maybe it wants to batch things up and waits
until it has a full 1G region to collapse. It might also do different
things for different regions, e.g. depending on if they were hot or
cold (demand paged vs. background fetched). I don't see any single
"right way" to do things here, I just see tradeoffs, which userspace
is in a good position to decide on.

>
> I gather that no one else reviewing this has raised this issue thus
> far so it might not be a big deal and I will continue to review the
> RFC, but I had hoped for automatic collapsing myself for the reasons
> above.
>
> > >
> > > > ---- HugeTLB Changes ----
> > > >
> > > > - Mapcount
> > > > The way mapcount is handled is different from the way that it was handled
> > > > before. If the PUD for a hugepage is not none, a hugepage's mapcount will
> > > > be increased. This scheme means that, for hugepages that aren't mapped at
> > > > high granularity, their mapcounts will remain the same as what they would
> > > > have been pre-HGM.
> > > >
> > >
> > > Sorry, I didn't quite follow this. It says mapcount is handled
> > > differently, but the same if the page is not mapped at high
> > > granularity. Can you elaborate on how the mapcount handling will be
> > > different when the page is mapped at high granularity?
> >
> > I guess I didn't phrase this very well. For the sake of simplicity,
> > consider 1G pages on x86, typically mapped with leaf-level PUDs.
> > Previously, there were two possibilities for how a hugepage was
> > mapped, either it was (1) completely mapped (PUD is present and a
> > leaf), or (2) it wasn't mapped (PUD is none). Now we have a third
> > case, where the PUD is not none but also not a leaf (this usually
> > means that the page is partially mapped). We handle this case as if
> > the whole page was mapped. That is, if we partially map a hugepage
> > that was previously unmapped (making the PUD point to PMDs), we
> > increment its mapcount, and if we completely unmap a partially mapped
> > hugepage (making the PUD none), we decrement its mapcount. If we
> > collapse a non-leaf PUD to a leaf PUD, we don't change mapcount.
> >
> > It is possible for a PUD to be present and not a leaf (mapcount has
> > been incremented) but for the page to still be unmapped: if the PMDs
> > (or PTEs) underneath are all none. This case is atypical, and as of
> > this RFC (without bestowing MADV_DONTNEED with HGM flexibility), I
> > think it would be very difficult to get this to happen.
> >
>
> Thank you for the detailed explanation. Please add it to the cover letter.
>
> I wonder the case "PUD present but all the PMD are none": is that a
> bug? I don't understand the usefulness of that. Not a comment on this
> patch but rather a curiosity.
>
> > >
> > > > - Page table walking and manipulation
> > > > A new function, hugetlb_walk_to, handles walking HugeTLB page tables for
> > > > high-granularity mappings. Eventually, it's possible to merge
> > > > hugetlb_walk_to with huge_pte_offset and huge_pte_alloc.
> > > >
> > > > We keep track of HugeTLB page table entries with a new struct, hugetlb_pte.
> > > > This is because we generally need to know the "size" of a PTE (previously
> > > > always just huge_page_size(hstate)).
> > > >
> > > > For every page table manipulation function that has a huge version (e.g.
> > > > huge_ptep_get and ptep_get), there is a wrapper for it (e.g.
> > > > hugetlb_ptep_get).  The correct version is used depending on if a HugeTLB
> > > > PTE really is "huge".
> > > >
> > > > - Synchronization
> > > > For existing bits of HugeTLB, synchronization is unchanged. For splitting
> > > > and collapsing HugeTLB PTEs, we require that the i_mmap_rw_sem is held for
> > > > writing, and for doing high-granularity page table walks, we require it to
> > > > be held for reading.
> > > >
> > > > ---- Limitations & Future Changes ----
> > > >
> > > > This patch series only implements high-granularity mapping for VM_SHARED
> > > > VMAs.  I intend to implement enough HGM to support 4K unmapping for memory
> > > > failure recovery for both shared and private mappings.
> > > >
> > > > The memory failure use case poses its own challenges that can be
> > > > addressed, but I will do so in a separate RFC.
> > > >
> > > > Performance has not been heavily scrutinized with this patch series. There
> > > > are places where lock contention can significantly reduce performance. This
> > > > will be addressed later.
> > > >
> > > > The patch series, as it stands right now, is compatible with the VMEMMAP
> > > > page struct optimization[3], as we do not need to modify data contained
> > > > in the subpage page structs.
> > > >
> > > > Other omissions:
> > > >  - Compatibility with userfaultfd write-protect (will be included in v1).
> > > >  - Support for mremap() (will be included in v1). This looks a lot like
> > > >    the support we have for fork().
> > > >  - Documentation changes (will be included in v1).
> > > >  - Completely ignores PMD sharing and hugepage migration (will be included
> > > >    in v1).
> > > >  - Implementations for architectures that don't use GENERAL_HUGETLB other
> > > >    than arm64.
> > > >
> > > > ---- Patch Breakdown ----
> > > >
> > > > Patch 1     - Preliminary changes
> > > > Patch 2-10  - HugeTLB HGM core changes
> > > > Patch 11-13 - HugeTLB HGM page table walking functionality
> > > > Patch 14-19 - HugeTLB HGM compatibility with other bits
> > > > Patch 20-23 - Userfaultfd and collapse changes
> > > > Patch 24-26 - arm64 support and selftests
> > > >
> > > > [1] This used to be called HugeTLB double mapping, a bad and confusing
> > > >     name. "High-granularity mapping" is not a great name either. I am open
> > > >     to better names.
> > >
> > > I would drop 1 extra word and do "granular mapping", as in the mapping
> > > is more granular than what it normally is (2MB/1G, etc).
> >
> > Noted. :)

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH 04/26] hugetlb: make huge_pte_lockptr take an explicit shift argument.
  2022-06-29  6:09     ` Muchun Song
@ 2022-06-29 21:03       ` Mike Kravetz
  2022-06-29 21:39         ` James Houghton
  0 siblings, 1 reply; 123+ messages in thread
From: Mike Kravetz @ 2022-06-29 21:03 UTC (permalink / raw)
  To: Muchun Song
  Cc: James Houghton, Peter Xu, David Hildenbrand, David Rientjes,
	Axel Rasmussen, Mina Almasry, Jue Wang, Manish Mishra,
	Dr . David Alan Gilbert, linux-mm, linux-kernel

On 06/29/22 14:09, Muchun Song wrote:
> On Mon, Jun 27, 2022 at 01:51:53PM -0700, Mike Kravetz wrote:
> > On 06/24/22 17:36, James Houghton wrote:
> > > This is needed to handle PTL locking with high-granularity mapping. We
> > > won't always be using the PMD-level PTL even if we're using the 2M
> > > hugepage hstate. It's possible that we're dealing with 4K PTEs, in which
> > > case, we need to lock the PTL for the 4K PTE.
> > 
> > I'm not really sure why this would be required.
> > Why not use the PMD level lock for 4K PTEs?  Seems that would scale better
> > with less contention than using the more coarse mm lock.  
> >
> 
> Your words make me thing of another question unrelated to this patch.
> We __know__ that arm64 supports continues PTE HugeTLB. huge_pte_lockptr()
> did not consider this case, in this case, those HugeTLB pages are contended
> with mm lock. Seems we should optimize this case. Something like:
> 
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 0d790fa3f297..68a1e071bfc0 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -893,7 +893,7 @@ static inline gfp_t htlb_modify_alloc_mask(struct hstate *h, gfp_t gfp_mask)
>  static inline spinlock_t *huge_pte_lockptr(struct hstate *h,
>                                            struct mm_struct *mm, pte_t *pte)
>  {
> -       if (huge_page_size(h) == PMD_SIZE)
> +       if (huge_page_size(h) <= PMD_SIZE)
>                 return pmd_lockptr(mm, (pmd_t *) pte);
>         VM_BUG_ON(huge_page_size(h) == PAGE_SIZE);
>         return &mm->page_table_lock;
> 
> I did not check if elsewhere needs to be changed as well. Just a primary
> thought.

That seems perfectly reasonable to me.

Also unrelated, but using the pmd lock is REQUIRED for pmd sharing.  The
mm lock is process specific and does not synchronize shared access.  I
found that out the hard way. :)

-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH 02/26] hugetlb: sort hstates in hugetlb_init_hstates
  2022-06-29  6:39       ` Muchun Song
@ 2022-06-29 21:06         ` Mike Kravetz
  2022-06-29 21:13           ` James Houghton
  0 siblings, 1 reply; 123+ messages in thread
From: Mike Kravetz @ 2022-06-29 21:06 UTC (permalink / raw)
  To: Muchun Song
  Cc: James Houghton, Peter Xu, David Hildenbrand, David Rientjes,
	Axel Rasmussen, Mina Almasry, Jue Wang, Manish Mishra,
	Dr . David Alan Gilbert, linux-mm, linux-kernel

On 06/29/22 14:39, Muchun Song wrote:
> On Tue, Jun 28, 2022 at 08:40:27AM -0700, James Houghton wrote:
> > On Mon, Jun 27, 2022 at 11:42 AM Mike Kravetz <mike.kravetz@oracle.com> wrote:
> > >
> > > On 06/24/22 17:36, James Houghton wrote:
> > > > When using HugeTLB high-granularity mapping, we need to go through the
> > > > supported hugepage sizes in decreasing order so that we pick the largest
> > > > size that works. Consider the case where we're faulting in a 1G hugepage
> > > > for the first time: we want hugetlb_fault/hugetlb_no_page to map it with
> > > > a PUD. By going through the sizes in decreasing order, we will find that
> > > > PUD_SIZE works before finding out that PMD_SIZE or PAGE_SIZE work too.
> > > >
> > >
> > > This may/will cause problems for gigantic hugetlb pages allocated at boot
> > > time.  See alloc_bootmem_huge_page() where a pointer to the associated hstate
> > > is encoded within the allocated hugetlb page.  These pages are added to
> > > hugetlb pools by the routine gather_bootmem_prealloc() which uses the saved
> > > hstate to add prep the gigantic page and add to the correct pool.  Currently,
> > > gather_bootmem_prealloc is called after hugetlb_init_hstates.  So, changing
> > > hstate order will cause errors.
> > >
> > > I do not see any reason why we could not call gather_bootmem_prealloc before
> > > hugetlb_init_hstates to avoid this issue.
> > 
> > Thanks for catching this, Mike. Your suggestion certainly seems to
> > work, but it also seems kind of error prone. I'll have to look at the
> > code more closely, but maybe it would be better if I just maintained a
> > separate `struct hstate *sorted_hstate_ptrs[]`, where the original
> 
> I don't think this is a good idea.  If you really rely on the order of
> the initialization in this patch.  The easier solution is changing
> huge_bootmem_page->hstate to huge_bootmem_page->hugepagesz. Then we
> can use size_to_hstate(huge_bootmem_page->hugepagesz) in
> gather_bootmem_prealloc().
> 

That is a much better solution.  Thanks Muchun!

-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH 02/26] hugetlb: sort hstates in hugetlb_init_hstates
  2022-06-29 21:06         ` Mike Kravetz
@ 2022-06-29 21:13           ` James Houghton
  0 siblings, 0 replies; 123+ messages in thread
From: James Houghton @ 2022-06-29 21:13 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Muchun Song, Peter Xu, David Hildenbrand, David Rientjes,
	Axel Rasmussen, Mina Almasry, Jue Wang, Manish Mishra,
	Dr . David Alan Gilbert, linux-mm, linux-kernel

On Wed, Jun 29, 2022 at 2:06 PM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>
> On 06/29/22 14:39, Muchun Song wrote:
> > On Tue, Jun 28, 2022 at 08:40:27AM -0700, James Houghton wrote:
> > > On Mon, Jun 27, 2022 at 11:42 AM Mike Kravetz <mike.kravetz@oracle.com> wrote:
> > > >
> > > > On 06/24/22 17:36, James Houghton wrote:
> > > > > When using HugeTLB high-granularity mapping, we need to go through the
> > > > > supported hugepage sizes in decreasing order so that we pick the largest
> > > > > size that works. Consider the case where we're faulting in a 1G hugepage
> > > > > for the first time: we want hugetlb_fault/hugetlb_no_page to map it with
> > > > > a PUD. By going through the sizes in decreasing order, we will find that
> > > > > PUD_SIZE works before finding out that PMD_SIZE or PAGE_SIZE work too.
> > > > >
> > > >
> > > > This may/will cause problems for gigantic hugetlb pages allocated at boot
> > > > time.  See alloc_bootmem_huge_page() where a pointer to the associated hstate
> > > > is encoded within the allocated hugetlb page.  These pages are added to
> > > > hugetlb pools by the routine gather_bootmem_prealloc() which uses the saved
> > > > hstate to add prep the gigantic page and add to the correct pool.  Currently,
> > > > gather_bootmem_prealloc is called after hugetlb_init_hstates.  So, changing
> > > > hstate order will cause errors.
> > > >
> > > > I do not see any reason why we could not call gather_bootmem_prealloc before
> > > > hugetlb_init_hstates to avoid this issue.
> > >
> > > Thanks for catching this, Mike. Your suggestion certainly seems to
> > > work, but it also seems kind of error prone. I'll have to look at the
> > > code more closely, but maybe it would be better if I just maintained a
> > > separate `struct hstate *sorted_hstate_ptrs[]`, where the original
> >
> > I don't think this is a good idea.  If you really rely on the order of
> > the initialization in this patch.  The easier solution is changing
> > huge_bootmem_page->hstate to huge_bootmem_page->hugepagesz. Then we
> > can use size_to_hstate(huge_bootmem_page->hugepagesz) in
> > gather_bootmem_prealloc().
> >
>
> That is a much better solution.  Thanks Muchun!

Indeed. Thank you, Muchun. :)

>
> --
> Mike Kravetz

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH 04/26] hugetlb: make huge_pte_lockptr take an explicit shift argument.
  2022-06-29 21:03       ` Mike Kravetz
@ 2022-06-29 21:39         ` James Houghton
  2022-06-29 22:24           ` Mike Kravetz
  0 siblings, 1 reply; 123+ messages in thread
From: James Houghton @ 2022-06-29 21:39 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Muchun Song, Peter Xu, David Hildenbrand, David Rientjes,
	Axel Rasmussen, Mina Almasry, Jue Wang, Manish Mishra,
	Dr . David Alan Gilbert, linux-mm, linux-kernel

On Wed, Jun 29, 2022 at 2:04 PM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>
> On 06/29/22 14:09, Muchun Song wrote:
> > On Mon, Jun 27, 2022 at 01:51:53PM -0700, Mike Kravetz wrote:
> > > On 06/24/22 17:36, James Houghton wrote:
> > > > This is needed to handle PTL locking with high-granularity mapping. We
> > > > won't always be using the PMD-level PTL even if we're using the 2M
> > > > hugepage hstate. It's possible that we're dealing with 4K PTEs, in which
> > > > case, we need to lock the PTL for the 4K PTE.
> > >
> > > I'm not really sure why this would be required.
> > > Why not use the PMD level lock for 4K PTEs?  Seems that would scale better
> > > with less contention than using the more coarse mm lock.
> > >
> >
> > Your words make me thing of another question unrelated to this patch.
> > We __know__ that arm64 supports continues PTE HugeTLB. huge_pte_lockptr()
> > did not consider this case, in this case, those HugeTLB pages are contended
> > with mm lock. Seems we should optimize this case. Something like:
> >
> > diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> > index 0d790fa3f297..68a1e071bfc0 100644
> > --- a/include/linux/hugetlb.h
> > +++ b/include/linux/hugetlb.h
> > @@ -893,7 +893,7 @@ static inline gfp_t htlb_modify_alloc_mask(struct hstate *h, gfp_t gfp_mask)
> >  static inline spinlock_t *huge_pte_lockptr(struct hstate *h,
> >                                            struct mm_struct *mm, pte_t *pte)
> >  {
> > -       if (huge_page_size(h) == PMD_SIZE)
> > +       if (huge_page_size(h) <= PMD_SIZE)
> >                 return pmd_lockptr(mm, (pmd_t *) pte);
> >         VM_BUG_ON(huge_page_size(h) == PAGE_SIZE);
> >         return &mm->page_table_lock;
> >
> > I did not check if elsewhere needs to be changed as well. Just a primary
> > thought.

I'm not sure if this works. If hugetlb_pte_size(hpte) is PAGE_SIZE,
then `hpte.ptep` will be a pte_t, not a pmd_t -- I assume that breaks
things. So I think, when doing a HugeTLB PT walk down to PAGE_SIZE, we
need to separately keep track of the location of the PMD so that we
can use it to get the PMD lock.

>
> That seems perfectly reasonable to me.
>
> Also unrelated, but using the pmd lock is REQUIRED for pmd sharing.  The
> mm lock is process specific and does not synchronize shared access.  I
> found that out the hard way. :)
>
> --
> Mike Kravetz

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH 04/26] hugetlb: make huge_pte_lockptr take an explicit shift argument.
  2022-06-29 21:39         ` James Houghton
@ 2022-06-29 22:24           ` Mike Kravetz
  2022-06-30  9:35             ` Muchun Song
  0 siblings, 1 reply; 123+ messages in thread
From: Mike Kravetz @ 2022-06-29 22:24 UTC (permalink / raw)
  To: James Houghton
  Cc: Muchun Song, Peter Xu, David Hildenbrand, David Rientjes,
	Axel Rasmussen, Mina Almasry, Jue Wang, Manish Mishra,
	Dr . David Alan Gilbert, linux-mm, linux-kernel

On 06/29/22 14:39, James Houghton wrote:
> On Wed, Jun 29, 2022 at 2:04 PM Mike Kravetz <mike.kravetz@oracle.com> wrote:
> >
> > On 06/29/22 14:09, Muchun Song wrote:
> > > On Mon, Jun 27, 2022 at 01:51:53PM -0700, Mike Kravetz wrote:
> > > > On 06/24/22 17:36, James Houghton wrote:
> > > > > This is needed to handle PTL locking with high-granularity mapping. We
> > > > > won't always be using the PMD-level PTL even if we're using the 2M
> > > > > hugepage hstate. It's possible that we're dealing with 4K PTEs, in which
> > > > > case, we need to lock the PTL for the 4K PTE.
> > > >
> > > > I'm not really sure why this would be required.
> > > > Why not use the PMD level lock for 4K PTEs?  Seems that would scale better
> > > > with less contention than using the more coarse mm lock.
> > > >
> > >
> > > Your words make me thing of another question unrelated to this patch.
> > > We __know__ that arm64 supports continues PTE HugeTLB. huge_pte_lockptr()
> > > did not consider this case, in this case, those HugeTLB pages are contended
> > > with mm lock. Seems we should optimize this case. Something like:
> > >
> > > diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> > > index 0d790fa3f297..68a1e071bfc0 100644
> > > --- a/include/linux/hugetlb.h
> > > +++ b/include/linux/hugetlb.h
> > > @@ -893,7 +893,7 @@ static inline gfp_t htlb_modify_alloc_mask(struct hstate *h, gfp_t gfp_mask)
> > >  static inline spinlock_t *huge_pte_lockptr(struct hstate *h,
> > >                                            struct mm_struct *mm, pte_t *pte)
> > >  {
> > > -       if (huge_page_size(h) == PMD_SIZE)
> > > +       if (huge_page_size(h) <= PMD_SIZE)
> > >                 return pmd_lockptr(mm, (pmd_t *) pte);
> > >         VM_BUG_ON(huge_page_size(h) == PAGE_SIZE);
> > >         return &mm->page_table_lock;
> > >
> > > I did not check if elsewhere needs to be changed as well. Just a primary
> > > thought.
> 
> I'm not sure if this works. If hugetlb_pte_size(hpte) is PAGE_SIZE,
> then `hpte.ptep` will be a pte_t, not a pmd_t -- I assume that breaks
> things. So I think, when doing a HugeTLB PT walk down to PAGE_SIZE, we
> need to separately keep track of the location of the PMD so that we
> can use it to get the PMD lock.

I assume Muchun was talking about changing this in current code (before
your changes) where huge_page_size(h) can not be PAGE_SIZE.

-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH 04/26] hugetlb: make huge_pte_lockptr take an explicit shift argument.
  2022-06-29 22:24           ` Mike Kravetz
@ 2022-06-30  9:35             ` Muchun Song
  2022-06-30 16:23               ` James Houghton
  0 siblings, 1 reply; 123+ messages in thread
From: Muchun Song @ 2022-06-30  9:35 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: James Houghton, Peter Xu, David Hildenbrand, David Rientjes,
	Axel Rasmussen, Mina Almasry, Jue Wang, Manish Mishra,
	Dr . David Alan Gilbert, linux-mm, linux-kernel

On Wed, Jun 29, 2022 at 03:24:45PM -0700, Mike Kravetz wrote:
> On 06/29/22 14:39, James Houghton wrote:
> > On Wed, Jun 29, 2022 at 2:04 PM Mike Kravetz <mike.kravetz@oracle.com> wrote:
> > >
> > > On 06/29/22 14:09, Muchun Song wrote:
> > > > On Mon, Jun 27, 2022 at 01:51:53PM -0700, Mike Kravetz wrote:
> > > > > On 06/24/22 17:36, James Houghton wrote:
> > > > > > This is needed to handle PTL locking with high-granularity mapping. We
> > > > > > won't always be using the PMD-level PTL even if we're using the 2M
> > > > > > hugepage hstate. It's possible that we're dealing with 4K PTEs, in which
> > > > > > case, we need to lock the PTL for the 4K PTE.
> > > > >
> > > > > I'm not really sure why this would be required.
> > > > > Why not use the PMD level lock for 4K PTEs?  Seems that would scale better
> > > > > with less contention than using the more coarse mm lock.
> > > > >
> > > >
> > > > Your words make me thing of another question unrelated to this patch.
> > > > We __know__ that arm64 supports continues PTE HugeTLB. huge_pte_lockptr()
> > > > did not consider this case, in this case, those HugeTLB pages are contended
> > > > with mm lock. Seems we should optimize this case. Something like:
> > > >
> > > > diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> > > > index 0d790fa3f297..68a1e071bfc0 100644
> > > > --- a/include/linux/hugetlb.h
> > > > +++ b/include/linux/hugetlb.h
> > > > @@ -893,7 +893,7 @@ static inline gfp_t htlb_modify_alloc_mask(struct hstate *h, gfp_t gfp_mask)
> > > >  static inline spinlock_t *huge_pte_lockptr(struct hstate *h,
> > > >                                            struct mm_struct *mm, pte_t *pte)
> > > >  {
> > > > -       if (huge_page_size(h) == PMD_SIZE)
> > > > +       if (huge_page_size(h) <= PMD_SIZE)
> > > >                 return pmd_lockptr(mm, (pmd_t *) pte);
> > > >         VM_BUG_ON(huge_page_size(h) == PAGE_SIZE);
> > > >         return &mm->page_table_lock;
> > > >
> > > > I did not check if elsewhere needs to be changed as well. Just a primary
> > > > thought.
> > 
> > I'm not sure if this works. If hugetlb_pte_size(hpte) is PAGE_SIZE,
> > then `hpte.ptep` will be a pte_t, not a pmd_t -- I assume that breaks
> > things. So I think, when doing a HugeTLB PT walk down to PAGE_SIZE, we
> > need to separately keep track of the location of the PMD so that we
> > can use it to get the PMD lock.
> 
> I assume Muchun was talking about changing this in current code (before
> your changes) where huge_page_size(h) can not be PAGE_SIZE.
>

Yes, that's what I meant.

Thanks. 

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH 00/26] hugetlb: Introduce HugeTLB high-granularity mapping
  2022-06-28  8:20         ` Dr. David Alan Gilbert
@ 2022-06-30 16:09           ` Peter Xu
  0 siblings, 0 replies; 123+ messages in thread
From: Peter Xu @ 2022-06-30 16:09 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: James Houghton, Matthew Wilcox, Mike Kravetz, Muchun Song,
	David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Jue Wang, Manish Mishra, linux-mm, linux-kernel, Nadav Amit

On Tue, Jun 28, 2022 at 09:20:41AM +0100, Dr. David Alan Gilbert wrote:
> One other thing I thought of; you provide the modified 'CONTINUE'
> behaviour, which works for postcopy as long as you use two mappings in
> userspace; one protected by userfault, and one which you do the writes
> to, and then issue the CONTINUE into the protected mapping; that's fine,
> but it's not currently how we have our postcopy code wired up in qemu,
> we have one mapping and use UFFDIO_COPY to place the page.
> Requiring the two mappings is fine, but it's probably worth pointing out
> the need for it somewhere.

It'll be about CONTINUE, maybe not directly related to sub-page mapping,
but indeed that's something we may need to do.  It's also in my poc [1]
previously (I never got time to get back to it yet though..).

It's just that two mappings are not required.  E.g., one could use a fd on
the file and lseek()/write() to the file to update content rather than
using another mapping.  It might be just slower.

Or, IMHO an app can legally just delay faulting of some mapping using minor
mode and maybe the app doesn't even need to modify the page content before
CONTINUE for some reason, then it's even not needed to have either the
other mapping or the fd.  Fundamentally, MINOR mode and CONTINUE provides
another way to trap page fault when page cache existed.  It doesn't really
define whether or how the data will be modified.

It's just that for QEMU unfortunately we may need to have that two mappings
just for this use case indeed..

[1] https://github.com/xzpeter/qemu/commit/41538a9a8ff5c981af879afe48e4ecca9a1aabc8

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH 04/26] hugetlb: make huge_pte_lockptr take an explicit shift argument.
  2022-06-30  9:35             ` Muchun Song
@ 2022-06-30 16:23               ` James Houghton
  2022-06-30 17:40                 ` Mike Kravetz
  2022-07-01  3:32                 ` Muchun Song
  0 siblings, 2 replies; 123+ messages in thread
From: James Houghton @ 2022-06-30 16:23 UTC (permalink / raw)
  To: Muchun Song
  Cc: Mike Kravetz, Peter Xu, David Hildenbrand, David Rientjes,
	Axel Rasmussen, Mina Almasry, Jue Wang, Manish Mishra,
	Dr . David Alan Gilbert, linux-mm, linux-kernel

On Thu, Jun 30, 2022 at 2:35 AM Muchun Song <songmuchun@bytedance.com> wrote:
>
> On Wed, Jun 29, 2022 at 03:24:45PM -0700, Mike Kravetz wrote:
> > On 06/29/22 14:39, James Houghton wrote:
> > > On Wed, Jun 29, 2022 at 2:04 PM Mike Kravetz <mike.kravetz@oracle.com> wrote:
> > > >
> > > > On 06/29/22 14:09, Muchun Song wrote:
> > > > > On Mon, Jun 27, 2022 at 01:51:53PM -0700, Mike Kravetz wrote:
> > > > > > On 06/24/22 17:36, James Houghton wrote:
> > > > > > > This is needed to handle PTL locking with high-granularity mapping. We
> > > > > > > won't always be using the PMD-level PTL even if we're using the 2M
> > > > > > > hugepage hstate. It's possible that we're dealing with 4K PTEs, in which
> > > > > > > case, we need to lock the PTL for the 4K PTE.
> > > > > >
> > > > > > I'm not really sure why this would be required.
> > > > > > Why not use the PMD level lock for 4K PTEs?  Seems that would scale better
> > > > > > with less contention than using the more coarse mm lock.
> > > > > >
> > > > >
> > > > > Your words make me thing of another question unrelated to this patch.
> > > > > We __know__ that arm64 supports continues PTE HugeTLB. huge_pte_lockptr()
> > > > > did not consider this case, in this case, those HugeTLB pages are contended
> > > > > with mm lock. Seems we should optimize this case. Something like:
> > > > >
> > > > > diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> > > > > index 0d790fa3f297..68a1e071bfc0 100644
> > > > > --- a/include/linux/hugetlb.h
> > > > > +++ b/include/linux/hugetlb.h
> > > > > @@ -893,7 +893,7 @@ static inline gfp_t htlb_modify_alloc_mask(struct hstate *h, gfp_t gfp_mask)
> > > > >  static inline spinlock_t *huge_pte_lockptr(struct hstate *h,
> > > > >                                            struct mm_struct *mm, pte_t *pte)
> > > > >  {
> > > > > -       if (huge_page_size(h) == PMD_SIZE)
> > > > > +       if (huge_page_size(h) <= PMD_SIZE)
> > > > >                 return pmd_lockptr(mm, (pmd_t *) pte);
> > > > >         VM_BUG_ON(huge_page_size(h) == PAGE_SIZE);
> > > > >         return &mm->page_table_lock;
> > > > >
> > > > > I did not check if elsewhere needs to be changed as well. Just a primary
> > > > > thought.
> > >
> > > I'm not sure if this works. If hugetlb_pte_size(hpte) is PAGE_SIZE,
> > > then `hpte.ptep` will be a pte_t, not a pmd_t -- I assume that breaks
> > > things. So I think, when doing a HugeTLB PT walk down to PAGE_SIZE, we
> > > need to separately keep track of the location of the PMD so that we
> > > can use it to get the PMD lock.
> >
> > I assume Muchun was talking about changing this in current code (before
> > your changes) where huge_page_size(h) can not be PAGE_SIZE.
> >
>
> Yes, that's what I meant.

Right -- but I think my point still stands. If `huge_page_size(h)` is
CONT_PTE_SIZE, then the `pte_t *` passed to `huge_pte_lockptr` will
*actually* point to a `pte_t` and not a `pmd_t` (I'm pretty sure the
distinction is important). So it seems like we need to separately keep
track of the real pmd_t that is being used in the CONT_PTE_SIZE case
(and therefore, when considering HGM, the PAGE_SIZE case).

However, we *can* make this optimization for CONT_PMD_SIZE (maybe this
is what you originally meant, Muchun?), so instead of
`huge_page_size(h) == PMD_SIZE`, we could do `huge_page_size(h) >=
PMD_SIZE && huge_page_size(h) < PUD_SIZE`.

>
> Thanks.

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH 04/26] hugetlb: make huge_pte_lockptr take an explicit shift argument.
  2022-06-30 16:23               ` James Houghton
@ 2022-06-30 17:40                 ` Mike Kravetz
  2022-07-01  3:32                 ` Muchun Song
  1 sibling, 0 replies; 123+ messages in thread
From: Mike Kravetz @ 2022-06-30 17:40 UTC (permalink / raw)
  To: James Houghton
  Cc: Muchun Song, Peter Xu, David Hildenbrand, David Rientjes,
	Axel Rasmussen, Mina Almasry, Jue Wang, Manish Mishra,
	Dr . David Alan Gilbert, linux-mm, linux-kernel

On 06/30/22 09:23, James Houghton wrote:
> On Thu, Jun 30, 2022 at 2:35 AM Muchun Song <songmuchun@bytedance.com> wrote:
> >
> > On Wed, Jun 29, 2022 at 03:24:45PM -0700, Mike Kravetz wrote:
> > > On 06/29/22 14:39, James Houghton wrote:
> > > > On Wed, Jun 29, 2022 at 2:04 PM Mike Kravetz <mike.kravetz@oracle.com> wrote:
> > > > >
> > > > > On 06/29/22 14:09, Muchun Song wrote:
> > > > > > On Mon, Jun 27, 2022 at 01:51:53PM -0700, Mike Kravetz wrote:
> > > > > > > On 06/24/22 17:36, James Houghton wrote:
> > > > > > > > This is needed to handle PTL locking with high-granularity mapping. We
> > > > > > > > won't always be using the PMD-level PTL even if we're using the 2M
> > > > > > > > hugepage hstate. It's possible that we're dealing with 4K PTEs, in which
> > > > > > > > case, we need to lock the PTL for the 4K PTE.
> > > > > > >
> > > > > > > I'm not really sure why this would be required.
> > > > > > > Why not use the PMD level lock for 4K PTEs?  Seems that would scale better
> > > > > > > with less contention than using the more coarse mm lock.
> > > > > > >
> > > > > >
> > > > > > Your words make me thing of another question unrelated to this patch.
> > > > > > We __know__ that arm64 supports continues PTE HugeTLB. huge_pte_lockptr()
> > > > > > did not consider this case, in this case, those HugeTLB pages are contended
> > > > > > with mm lock. Seems we should optimize this case. Something like:
> > > > > >
> > > > > > diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> > > > > > index 0d790fa3f297..68a1e071bfc0 100644
> > > > > > --- a/include/linux/hugetlb.h
> > > > > > +++ b/include/linux/hugetlb.h
> > > > > > @@ -893,7 +893,7 @@ static inline gfp_t htlb_modify_alloc_mask(struct hstate *h, gfp_t gfp_mask)
> > > > > >  static inline spinlock_t *huge_pte_lockptr(struct hstate *h,
> > > > > >                                            struct mm_struct *mm, pte_t *pte)
> > > > > >  {
> > > > > > -       if (huge_page_size(h) == PMD_SIZE)
> > > > > > +       if (huge_page_size(h) <= PMD_SIZE)
> > > > > >                 return pmd_lockptr(mm, (pmd_t *) pte);
> > > > > >         VM_BUG_ON(huge_page_size(h) == PAGE_SIZE);
> > > > > >         return &mm->page_table_lock;
> > > > > >
> > > > > > I did not check if elsewhere needs to be changed as well. Just a primary
> > > > > > thought.
> > > >
> > > > I'm not sure if this works. If hugetlb_pte_size(hpte) is PAGE_SIZE,
> > > > then `hpte.ptep` will be a pte_t, not a pmd_t -- I assume that breaks
> > > > things. So I think, when doing a HugeTLB PT walk down to PAGE_SIZE, we
> > > > need to separately keep track of the location of the PMD so that we
> > > > can use it to get the PMD lock.
> > >
> > > I assume Muchun was talking about changing this in current code (before
> > > your changes) where huge_page_size(h) can not be PAGE_SIZE.
> > >
> >
> > Yes, that's what I meant.
> 
> Right -- but I think my point still stands. If `huge_page_size(h)` is
> CONT_PTE_SIZE, then the `pte_t *` passed to `huge_pte_lockptr` will
> *actually* point to a `pte_t` and not a `pmd_t` (I'm pretty sure the
> distinction is important). So it seems like we need to separately keep
> track of the real pmd_t that is being used in the CONT_PTE_SIZE case
> (and therefore, when considering HGM, the PAGE_SIZE case).

Ah yes, that is correct.  We would be passing in a pte not pmd in this
case.

> 
> However, we *can* make this optimization for CONT_PMD_SIZE (maybe this
> is what you originally meant, Muchun?), so instead of
> `huge_page_size(h) == PMD_SIZE`, we could do `huge_page_size(h) >=
> PMD_SIZE && huge_page_size(h) < PUD_SIZE`.
> 

Another 'optimization' may exist in hugetlb address range scanning code.
We currently have something like:

for addr=start, addr< end, addr += huge_page_size
	pte = huge_pte_offset(addr)
	ptl = huge_pte_lock(pte)
	...
	...
	spin_unlock(ptl)

Seems like ptl will be the same for all entries on the same pmd page.
We 'may' be able to go from 512 lock/unlock cycles to 1.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH 00/26] hugetlb: Introduce HugeTLB high-granularity mapping
  2022-06-28  0:04         ` Nadav Amit
@ 2022-06-30 19:21           ` Peter Xu
  2022-07-01  5:54             ` Nadav Amit
  0 siblings, 1 reply; 123+ messages in thread
From: Peter Xu @ 2022-06-30 19:21 UTC (permalink / raw)
  To: Nadav Amit
  Cc: James Houghton, Dr. David Alan Gilbert, Matthew Wilcox,
	Mike Kravetz, Muchun Song, David Hildenbrand, David Rientjes,
	Axel Rasmussen, Mina Almasry, Jue Wang, Manish Mishra, linux-mm,
	linux-kernel

On Tue, Jun 28, 2022 at 12:04:28AM +0000, Nadav Amit wrote:
> > [1] commit 824ddc601adc ("userfaultfd: provide unmasked address on page-fault")
> 
> Indeed this change of behavior (not aligning to huge-pages when flags is
> not set) was unintentional. If you want to fix it in a separate patch so
> it would be backported, that may be a good idea.

The fix seems to be straightforward, though.  Nadav, wanna post a patch
yourself?

That seems to be an accident and it's just that having sub-page mapping
rely on the accident is probably not desirable..  So irrelevant of the
separate patch I'd suggest we keep the requirement on enabling the exact
addr feature for sub-page mapping.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH 04/26] hugetlb: make huge_pte_lockptr take an explicit shift argument.
  2022-06-30 16:23               ` James Houghton
  2022-06-30 17:40                 ` Mike Kravetz
@ 2022-07-01  3:32                 ` Muchun Song
  1 sibling, 0 replies; 123+ messages in thread
From: Muchun Song @ 2022-07-01  3:32 UTC (permalink / raw)
  To: James Houghton
  Cc: Mike Kravetz, Peter Xu, David Hildenbrand, David Rientjes,
	Axel Rasmussen, Mina Almasry, Jue Wang, Manish Mishra,
	Dr . David Alan Gilbert, linux-mm, linux-kernel, zhengqi.arch



> On Jul 1, 2022, at 00:23, James Houghton <jthoughton@google.com> wrote:
> 
> On Thu, Jun 30, 2022 at 2:35 AM Muchun Song <songmuchun@bytedance.com> wrote:
>> 
>> On Wed, Jun 29, 2022 at 03:24:45PM -0700, Mike Kravetz wrote:
>>> On 06/29/22 14:39, James Houghton wrote:
>>>> On Wed, Jun 29, 2022 at 2:04 PM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>>>>> 
>>>>> On 06/29/22 14:09, Muchun Song wrote:
>>>>>> On Mon, Jun 27, 2022 at 01:51:53PM -0700, Mike Kravetz wrote:
>>>>>>> On 06/24/22 17:36, James Houghton wrote:
>>>>>>>> This is needed to handle PTL locking with high-granularity mapping. We
>>>>>>>> won't always be using the PMD-level PTL even if we're using the 2M
>>>>>>>> hugepage hstate. It's possible that we're dealing with 4K PTEs, in which
>>>>>>>> case, we need to lock the PTL for the 4K PTE.
>>>>>>> 
>>>>>>> I'm not really sure why this would be required.
>>>>>>> Why not use the PMD level lock for 4K PTEs? Seems that would scale better
>>>>>>> with less contention than using the more coarse mm lock.
>>>>>>> 
>>>>>> 
>>>>>> Your words make me thing of another question unrelated to this patch.
>>>>>> We __know__ that arm64 supports continues PTE HugeTLB. huge_pte_lockptr()
>>>>>> did not consider this case, in this case, those HugeTLB pages are contended
>>>>>> with mm lock. Seems we should optimize this case. Something like:
>>>>>> 
>>>>>> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
>>>>>> index 0d790fa3f297..68a1e071bfc0 100644
>>>>>> --- a/include/linux/hugetlb.h
>>>>>> +++ b/include/linux/hugetlb.h
>>>>>> @@ -893,7 +893,7 @@ static inline gfp_t htlb_modify_alloc_mask(struct hstate *h, gfp_t gfp_mask)
>>>>>> static inline spinlock_t *huge_pte_lockptr(struct hstate *h,
>>>>>> struct mm_struct *mm, pte_t *pte)
>>>>>> {
>>>>>> - if (huge_page_size(h) == PMD_SIZE)
>>>>>> + if (huge_page_size(h) <= PMD_SIZE)
>>>>>> return pmd_lockptr(mm, (pmd_t *) pte);
>>>>>> VM_BUG_ON(huge_page_size(h) == PAGE_SIZE);
>>>>>> return &mm->page_table_lock;
>>>>>> 
>>>>>> I did not check if elsewhere needs to be changed as well. Just a primary
>>>>>> thought.
>>>> 
>>>> I'm not sure if this works. If hugetlb_pte_size(hpte) is PAGE_SIZE,
>>>> then `hpte.ptep` will be a pte_t, not a pmd_t -- I assume that breaks
>>>> things. So I think, when doing a HugeTLB PT walk down to PAGE_SIZE, we
>>>> need to separately keep track of the location of the PMD so that we
>>>> can use it to get the PMD lock.
>>> 
>>> I assume Muchun was talking about changing this in current code (before
>>> your changes) where huge_page_size(h) can not be PAGE_SIZE.
>>> 
>> 
>> Yes, that's what I meant.
> 
> Right -- but I think my point still stands. If `huge_page_size(h)` is
> CONT_PTE_SIZE, then the `pte_t *` passed to `huge_pte_lockptr` will
> *actually* point to a `pte_t` and not a `pmd_t` (I'm pretty sure the

Right. It is a pte in this case.

> distinction is important). So it seems like we need to separately keep
> track of the real pmd_t that is being used in the CONT_PTE_SIZE case

If we want to find pmd_t from pte_t, I think we can introduce a new field
in struct page just like the thread [1] does.

[1] https://lore.kernel.org/lkml/20211110105428.32458-7-zhengqi.arch@bytedance.com/

> (and therefore, when considering HGM, the PAGE_SIZE case).
> 
> However, we *can* make this optimization for CONT_PMD_SIZE (maybe this
> is what you originally meant, Muchun?), so instead of
> `huge_page_size(h) == PMD_SIZE`, we could do `huge_page_size(h) >=
> PMD_SIZE && huge_page_size(h) < PUD_SIZE`.

Right. It is a good start to optimize CONT_PMD_SIZE case.

Thanks.

> 
>> 
>> Thanks.


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH 00/26] hugetlb: Introduce HugeTLB high-granularity mapping
  2022-06-30 19:21           ` Peter Xu
@ 2022-07-01  5:54             ` Nadav Amit
  0 siblings, 0 replies; 123+ messages in thread
From: Nadav Amit @ 2022-07-01  5:54 UTC (permalink / raw)
  To: Peter Xu
  Cc: James Houghton, Dr. David Alan Gilbert, Matthew Wilcox,
	Mike Kravetz, Muchun Song, David Hildenbrand, David Rientjes,
	Axel Rasmussen, Mina Almasry, Jue Wang, Manish Mishra, linux-mm,
	linux-kernel

On Jun 30, 2022, at 12:21 PM, Peter Xu <peterx@redhat.com> wrote:

> ⚠ External Email
> 
> On Tue, Jun 28, 2022 at 12:04:28AM +0000, Nadav Amit wrote:
>>> [1] commit 824ddc601adc ("userfaultfd: provide unmasked address on page-fault")
>> 
>> Indeed this change of behavior (not aligning to huge-pages when flags is
>> not set) was unintentional. If you want to fix it in a separate patch so
>> it would be backported, that may be a good idea.
> 
> The fix seems to be straightforward, though.  Nadav, wanna post a patch
> yourself?

Yes, even I can do it :)

Just busy right now, so I’ll try to do it over the weekend.

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH 10/26] hugetlb: add for_each_hgm_shift
  2022-06-28 21:58   ` Mina Almasry
@ 2022-07-07 21:39     ` Mike Kravetz
  2022-07-08 15:52     ` James Houghton
  1 sibling, 0 replies; 123+ messages in thread
From: Mike Kravetz @ 2022-07-07 21:39 UTC (permalink / raw)
  To: Mina Almasry
  Cc: James Houghton, Muchun Song, Peter Xu, David Hildenbrand,
	David Rientjes, Axel Rasmussen, Jue Wang, Manish Mishra,
	Dr . David Alan Gilbert, linux-mm, linux-kernel

On 06/28/22 14:58, Mina Almasry wrote:
> On Fri, Jun 24, 2022 at 10:37 AM James Houghton <jthoughton@google.com> wrote:
> >
> > This is a helper macro to loop through all the usable page sizes for a
> > high-granularity-enabled HugeTLB VMA. Given the VMA's hstate, it will
> > loop, in descending order, through the page sizes that HugeTLB supports
> > for this architecture; it always includes PAGE_SIZE.
> >
> > Signed-off-by: James Houghton <jthoughton@google.com>
> > ---
> >  mm/hugetlb.c | 10 ++++++++++
> >  1 file changed, 10 insertions(+)
> >
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 8b10b941458d..557b0afdb503 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -6989,6 +6989,16 @@ bool hugetlb_hgm_enabled(struct vm_area_struct *vma)
> >         /* All shared VMAs have HGM enabled. */
> >         return vma->vm_flags & VM_SHARED;
> >  }
> > +static unsigned int __shift_for_hstate(struct hstate *h)
> > +{
> > +       if (h >= &hstates[hugetlb_max_hstate])
> > +               return PAGE_SHIFT;
> 
> h > &hstates[hugetlb_max_hstate] means that h is out of bounds, no? am
> I missing something here?
> 
> So is this intending to do:
> 
> if (h == hstates[hugetlb_max_hstate]
>     return PAGE_SHIFT;
> 
> ? If so, could we write it as so?
> 
> I'm also wondering why __shift_for_hstate(hstate[hugetlb_max_hstate])
> == PAGE_SHIFT? Isn't the last hstate the smallest hstate which should
> be 2MB on x86? Shouldn't this return PMD_SHIFT in that case?
> 

I too am missing how this is working for similar reasons.
-- 
Mike Kravetz

> > +       return huge_page_shift(h);
> > +}
> > +#define for_each_hgm_shift(hstate, tmp_h, shift) \
> > +       for ((tmp_h) = hstate; (shift) = __shift_for_hstate(tmp_h), \
> > +                              (tmp_h) <= &hstates[hugetlb_max_hstate]; \
> > +                              (tmp_h)++)
> >  #endif /* CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING */
> >
> >  /*
> > --
> > 2.37.0.rc0.161.g10f37bed90-goog
> >

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH 11/26] hugetlb: add hugetlb_walk_to to do PT walks
  2022-06-27 13:07   ` manish.mishra
@ 2022-07-07 23:03     ` Mike Kravetz
  0 siblings, 0 replies; 123+ messages in thread
From: Mike Kravetz @ 2022-07-07 23:03 UTC (permalink / raw)
  To: manish.mishra
  Cc: James Houghton, Muchun Song, Peter Xu, David Hildenbrand,
	David Rientjes, Axel Rasmussen, Mina Almasry, Jue Wang,
	Dr . David Alan Gilbert, linux-mm, linux-kernel

On 06/27/22 18:37, manish.mishra wrote:
> 
> On 24/06/22 11:06 pm, James Houghton wrote:
> > This adds it for architectures that use GENERAL_HUGETLB, including x86.

I expect this will be used in arch independent code and there will need to
be at least a stub for all architectures?

> > 
> > Signed-off-by: James Houghton <jthoughton@google.com>
> > ---
> >   include/linux/hugetlb.h |  2 ++
> >   mm/hugetlb.c            | 45 +++++++++++++++++++++++++++++++++++++++++
> >   2 files changed, 47 insertions(+)
> > 
> > diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> > index e7a6b944d0cc..605aa19d8572 100644
> > --- a/include/linux/hugetlb.h
> > +++ b/include/linux/hugetlb.h
> > @@ -258,6 +258,8 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
> >   			unsigned long addr, unsigned long sz);
> >   pte_t *huge_pte_offset(struct mm_struct *mm,
> >   		       unsigned long addr, unsigned long sz);
> > +int hugetlb_walk_to(struct mm_struct *mm, struct hugetlb_pte *hpte,
> > +		    unsigned long addr, unsigned long sz, bool stop_at_none);
> >   int huge_pmd_unshare(struct mm_struct *mm, struct vm_area_struct *vma,
> >   				unsigned long *addr, pte_t *ptep);
> >   void adjust_range_if_pmd_sharing_possible(struct vm_area_struct *vma,
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 557b0afdb503..3ec2a921ee6f 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -6981,6 +6981,51 @@ pte_t *huge_pte_offset(struct mm_struct *mm,
> >   	return (pte_t *)pmd;
> >   }
> 
> 
> not strong feeling but this name looks confusing to me as it does
> 
> not only walk over page-tables but can also alloc.
> 

Somewhat agree.  With this we have:
- huge_pte_offset to walk/lookup a pte
- huge_pte_alloc to allocate ptes
- hugetlb_walk_to which does some/all of both

Do not see anything obviously wrong with the routine, but future
direction would be to combine/clean up these routines with similar
purpose.
-- 
Mike Kravetz

> > +int hugetlb_walk_to(struct mm_struct *mm, struct hugetlb_pte *hpte,
> > +		    unsigned long addr, unsigned long sz, bool stop_at_none)
> > +{
> > +	pte_t *ptep;
> > +
> > +	if (!hpte->ptep) {
> > +		pgd_t *pgd = pgd_offset(mm, addr);
> > +
> > +		if (!pgd)
> > +			return -ENOMEM;
> > +		ptep = (pte_t *)p4d_alloc(mm, pgd, addr);
> > +		if (!ptep)
> > +			return -ENOMEM;
> > +		hugetlb_pte_populate(hpte, ptep, P4D_SHIFT);
> > +	}
> > +
> > +	while (hugetlb_pte_size(hpte) > sz &&
> > +			!hugetlb_pte_present_leaf(hpte) &&
> > +			!(stop_at_none && hugetlb_pte_none(hpte))) {
> 
> Should this ordering of if-else condition be in reverse, i mean it will look
> 
> more natural and possibly less condition checks as we go from top to bottom.
> 
> > +		if (hpte->shift == PMD_SHIFT) {
> > +			ptep = pte_alloc_map(mm, (pmd_t *)hpte->ptep, addr);
> > +			if (!ptep)
> > +				return -ENOMEM;
> > +			hpte->shift = PAGE_SHIFT;
> > +			hpte->ptep = ptep;
> > +		} else if (hpte->shift == PUD_SHIFT) {
> > +			ptep = (pte_t *)pmd_alloc(mm, (pud_t *)hpte->ptep,
> > +						  addr);
> > +			if (!ptep)
> > +				return -ENOMEM;
> > +			hpte->shift = PMD_SHIFT;
> > +			hpte->ptep = ptep;
> > +		} else if (hpte->shift == P4D_SHIFT) {
> > +			ptep = (pte_t *)pud_alloc(mm, (p4d_t *)hpte->ptep,
> > +						  addr);
> > +			if (!ptep)
> > +				return -ENOMEM;
> > +			hpte->shift = PUD_SHIFT;
> > +			hpte->ptep = ptep;
> > +		} else
> > +			BUG();
> > +	}
> > +	return 0;
> > +}
> > +
> >   #endif /* CONFIG_ARCH_WANT_GENERAL_HUGETLB */
> >   #ifdef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH 10/26] hugetlb: add for_each_hgm_shift
  2022-06-28 21:58   ` Mina Almasry
  2022-07-07 21:39     ` Mike Kravetz
@ 2022-07-08 15:52     ` James Houghton
  2022-07-09 21:55       ` Mina Almasry
  1 sibling, 1 reply; 123+ messages in thread
From: James Houghton @ 2022-07-08 15:52 UTC (permalink / raw)
  To: Mina Almasry
  Cc: Mike Kravetz, Muchun Song, Peter Xu, David Hildenbrand,
	David Rientjes, Axel Rasmussen, Jue Wang, Manish Mishra,
	Dr . David Alan Gilbert, linux-mm, linux-kernel

On Tue, Jun 28, 2022 at 2:58 PM Mina Almasry <almasrymina@google.com> wrote:
>
> On Fri, Jun 24, 2022 at 10:37 AM James Houghton <jthoughton@google.com> wrote:
> >
> > This is a helper macro to loop through all the usable page sizes for a
> > high-granularity-enabled HugeTLB VMA. Given the VMA's hstate, it will
> > loop, in descending order, through the page sizes that HugeTLB supports
> > for this architecture; it always includes PAGE_SIZE.
> >
> > Signed-off-by: James Houghton <jthoughton@google.com>
> > ---
> >  mm/hugetlb.c | 10 ++++++++++
> >  1 file changed, 10 insertions(+)
> >
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 8b10b941458d..557b0afdb503 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -6989,6 +6989,16 @@ bool hugetlb_hgm_enabled(struct vm_area_struct *vma)
> >         /* All shared VMAs have HGM enabled. */
> >         return vma->vm_flags & VM_SHARED;
> >  }
> > +static unsigned int __shift_for_hstate(struct hstate *h)
> > +{
> > +       if (h >= &hstates[hugetlb_max_hstate])
> > +               return PAGE_SHIFT;
>
> h > &hstates[hugetlb_max_hstate] means that h is out of bounds, no? am
> I missing something here?

Yeah, it goes out of bounds intentionally. Maybe I should have called
this out. We need for_each_hgm_shift to include PAGE_SHIFT, and there
is no hstate for it. So to handle it, we iterate past the end of the
hstate array, and when we are past the end, we return PAGE_SHIFT and
stop iterating further. This is admittedly kind of gross; if you have
other suggestions for a way to get a clean `for_each_hgm_shift` macro
like this, I'm all ears. :)

>
> So is this intending to do:
>
> if (h == hstates[hugetlb_max_hstate]
>     return PAGE_SHIFT;
>
> ? If so, could we write it as so?

Yeah, this works. I'll write it this way instead. If that condition is
true, `h` is out of bounds (`hugetlb_max_hstate` is past the end, not
the index for the final element). I guess `hugetlb_max_hstate` is a
bit of a misnomer.

>
> I'm also wondering why __shift_for_hstate(hstate[hugetlb_max_hstate])
> == PAGE_SHIFT? Isn't the last hstate the smallest hstate which should
> be 2MB on x86? Shouldn't this return PMD_SHIFT in that case?

`huge_page_shift(hstate[hugetlb_max_hstate-1])` is PMD_SHIFT on x86.
Actually reading `hstate[hugetlb_max_hstate]` would be bad, which is
why `__shift_for_hstate` exists: to return PAGE_SIZE when we would
otherwise attempt to compute
`huge_page_shift(hstate[hugetlb_max_hstate])`.

>
> > +       return huge_page_shift(h);
> > +}
> > +#define for_each_hgm_shift(hstate, tmp_h, shift) \
> > +       for ((tmp_h) = hstate; (shift) = __shift_for_hstate(tmp_h), \
> > +                              (tmp_h) <= &hstates[hugetlb_max_hstate]; \

Note the <= here. If we wanted to always remain inbounds here, we'd
want < instead. But we don't have an hstate for PAGE_SIZE.

> > +                              (tmp_h)++)
> >  #endif /* CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING */
> >
> >  /*
> > --
> > 2.37.0.rc0.161.g10f37bed90-goog
> >

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH 10/26] hugetlb: add for_each_hgm_shift
  2022-07-08 15:52     ` James Houghton
@ 2022-07-09 21:55       ` Mina Almasry
  0 siblings, 0 replies; 123+ messages in thread
From: Mina Almasry @ 2022-07-09 21:55 UTC (permalink / raw)
  To: James Houghton
  Cc: Mike Kravetz, Muchun Song, Peter Xu, David Hildenbrand,
	David Rientjes, Axel Rasmussen, Jue Wang, Manish Mishra,
	Dr . David Alan Gilbert, linux-mm, linux-kernel

On Fri, Jul 8, 2022 at 8:52 AM James Houghton <jthoughton@google.com> wrote:
>
> On Tue, Jun 28, 2022 at 2:58 PM Mina Almasry <almasrymina@google.com> wrote:
> >
> > On Fri, Jun 24, 2022 at 10:37 AM James Houghton <jthoughton@google.com> wrote:
> > >
> > > This is a helper macro to loop through all the usable page sizes for a
> > > high-granularity-enabled HugeTLB VMA. Given the VMA's hstate, it will
> > > loop, in descending order, through the page sizes that HugeTLB supports
> > > for this architecture; it always includes PAGE_SIZE.
> > >
> > > Signed-off-by: James Houghton <jthoughton@google.com>
> > > ---
> > >  mm/hugetlb.c | 10 ++++++++++
> > >  1 file changed, 10 insertions(+)
> > >
> > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > > index 8b10b941458d..557b0afdb503 100644
> > > --- a/mm/hugetlb.c
> > > +++ b/mm/hugetlb.c
> > > @@ -6989,6 +6989,16 @@ bool hugetlb_hgm_enabled(struct vm_area_struct *vma)
> > >         /* All shared VMAs have HGM enabled. */
> > >         return vma->vm_flags & VM_SHARED;
> > >  }
> > > +static unsigned int __shift_for_hstate(struct hstate *h)
> > > +{
> > > +       if (h >= &hstates[hugetlb_max_hstate])
> > > +               return PAGE_SHIFT;
> >
> > h > &hstates[hugetlb_max_hstate] means that h is out of bounds, no? am
> > I missing something here?
>
> Yeah, it goes out of bounds intentionally. Maybe I should have called
> this out. We need for_each_hgm_shift to include PAGE_SHIFT, and there
> is no hstate for it. So to handle it, we iterate past the end of the
> hstate array, and when we are past the end, we return PAGE_SHIFT and
> stop iterating further. This is admittedly kind of gross; if you have
> other suggestions for a way to get a clean `for_each_hgm_shift` macro
> like this, I'm all ears. :)
>
> >
> > So is this intending to do:
> >
> > if (h == hstates[hugetlb_max_hstate]
> >     return PAGE_SHIFT;
> >
> > ? If so, could we write it as so?
>
> Yeah, this works. I'll write it this way instead. If that condition is
> true, `h` is out of bounds (`hugetlb_max_hstate` is past the end, not
> the index for the final element). I guess `hugetlb_max_hstate` is a
> bit of a misnomer.
>
> >
> > I'm also wondering why __shift_for_hstate(hstate[hugetlb_max_hstate])
> > == PAGE_SHIFT? Isn't the last hstate the smallest hstate which should
> > be 2MB on x86? Shouldn't this return PMD_SHIFT in that case?
>
> `huge_page_shift(hstate[hugetlb_max_hstate-1])` is PMD_SHIFT on x86.
> Actually reading `hstate[hugetlb_max_hstate]` would be bad, which is
> why `__shift_for_hstate` exists: to return PAGE_SIZE when we would
> otherwise attempt to compute
> `huge_page_shift(hstate[hugetlb_max_hstate])`.
>
> >
> > > +       return huge_page_shift(h);
> > > +}
> > > +#define for_each_hgm_shift(hstate, tmp_h, shift) \
> > > +       for ((tmp_h) = hstate; (shift) = __shift_for_hstate(tmp_h), \
> > > +                              (tmp_h) <= &hstates[hugetlb_max_hstate]; \
>
> Note the <= here. If we wanted to always remain inbounds here, we'd
> want < instead. But we don't have an hstate for PAGE_SIZE.
>

I see, thanks for the explanation. I can see 2 options here to make
the code more understandable:

option (a), don't go past the array. I.e. for_each_hgm_shift() will
loop over all the hugetlb-supported shifts on this arch, and the
calling code falls back to PAGE_SHIFT if the hugetlb page shifts don't
work for it. I admit that could lead to code dup in the calling code,
but I have not gotten to the patch that calls this yet.

option (b), simply add a comment and/or make it more obvious that
you're intentionally going out of bounds, and you want to loop over
PAGE_SHIFT at the end. Something like:

+ /* Returns huge_page_shift(h) if h is a pointer to an hstate in
hstates[] array, PAGE_SIZE otherwise. */
+static unsigned int __shift_for_hstate(struct hstate *h)
+{
+       if (h < &hstates[0] || h > &hstates[hugetlb_max_hstate - 1])
+               return PAGE_SHIFT;
+       return huge_page_shift(h);
+}
+
+ /* Loops over all the HGM shifts supported on this arch, from the
largest shift possible down to PAGE_SHIFT inclusive. */
+#define for_each_hgm_shift(hstate, tmp_h, shift) \
+       for ((tmp_h) = hstate; (shift) = __shift_for_hstate(tmp_h), \
+                              (tmp_h) <= &hstates[hugetlb_max_hstate]; \
+                              (tmp_h)++)
 #endif /* CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING */

> > > +                              (tmp_h)++)
> > >  #endif /* CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING */
> > >
> > >  /*
> > > --
> > > 2.37.0.rc0.161.g10f37bed90-goog
> > >

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH 07/26] hugetlb: add hugetlb_pte to track HugeTLB page table entries
  2022-06-24 17:36 ` [RFC PATCH 07/26] hugetlb: add hugetlb_pte to track HugeTLB page table entries James Houghton
                     ` (2 preceding siblings ...)
  2022-06-28 20:44   ` Mike Kravetz
@ 2022-07-11 23:32   ` Mike Kravetz
  2022-07-12  9:42     ` Dr. David Alan Gilbert
  2022-09-08 17:38   ` Peter Xu
  4 siblings, 1 reply; 123+ messages in thread
From: Mike Kravetz @ 2022-07-11 23:32 UTC (permalink / raw)
  To: James Houghton
  Cc: Muchun Song, Peter Xu, David Hildenbrand, David Rientjes,
	Axel Rasmussen, Mina Almasry, Jue Wang, Manish Mishra,
	Dr . David Alan Gilbert, linux-mm, linux-kernel

On 06/24/22 17:36, James Houghton wrote:
> After high-granularity mapping, page table entries for HugeTLB pages can
> be of any size/type. (For example, we can have a 1G page mapped with a
> mix of PMDs and PTEs.) This struct is to help keep track of a HugeTLB
> PTE after we have done a page table walk.

This has been rolling around in my head.

Will this first use case (live migration) actually make use of this
'mixed mapping' model where hugetlb pages could be mapped at the PUD,
PMD and PTE level all within the same vma?  I only understand the use
case from a high level.  But, it seems that we would want to only want
to migrate PTE (or PMD) sized pages and not necessarily a mix.

The only reason I ask is because the code might be much simpler if all
mappings within a vma were of the same size.  Of course, the
performance/latency of converting a large mapping may be prohibitively
expensive.

Looking to the future when supporting memory error handling/page poisoning
it seems like we would certainly want multiple size mappings.

Just a thought.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH 19/26] hugetlb: add HGM support for copy_hugetlb_page_range
  2022-06-24 17:36 ` [RFC PATCH 19/26] hugetlb: add HGM support for copy_hugetlb_page_range James Houghton
@ 2022-07-11 23:41   ` Mike Kravetz
  2022-07-12 17:19     ` James Houghton
  0 siblings, 1 reply; 123+ messages in thread
From: Mike Kravetz @ 2022-07-11 23:41 UTC (permalink / raw)
  To: James Houghton
  Cc: Muchun Song, Peter Xu, David Hildenbrand, David Rientjes,
	Axel Rasmussen, Mina Almasry, Jue Wang, Manish Mishra,
	Dr . David Alan Gilbert, linux-mm, linux-kernel

On 06/24/22 17:36, James Houghton wrote:
> This allows fork() to work with high-granularity mappings. The page
> table structure is copied such that partially mapped regions will remain
> partially mapped in the same way for the new process.
> 
> Signed-off-by: James Houghton <jthoughton@google.com>
> ---
>  mm/hugetlb.c | 74 +++++++++++++++++++++++++++++++++++++++++-----------
>  1 file changed, 59 insertions(+), 15 deletions(-)

FYI -
With https://lore.kernel.org/linux-mm/20220621235620.291305-5-mike.kravetz@oracle.com/
copy_hugetlb_page_range() should never be called for shared mappings.
Since HGM only works on shared mappings, code in this patch will never
be executed.

I have a TODO to remove shared mapping support from copy_hugetlb_page_range.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH 07/26] hugetlb: add hugetlb_pte to track HugeTLB page table entries
  2022-07-11 23:32   ` Mike Kravetz
@ 2022-07-12  9:42     ` Dr. David Alan Gilbert
  2022-07-12 17:51       ` Mike Kravetz
  2022-07-15 16:35       ` Peter Xu
  0 siblings, 2 replies; 123+ messages in thread
From: Dr. David Alan Gilbert @ 2022-07-12  9:42 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: James Houghton, Muchun Song, Peter Xu, David Hildenbrand,
	David Rientjes, Axel Rasmussen, Mina Almasry, Jue Wang,
	Manish Mishra, linux-mm, linux-kernel

* Mike Kravetz (mike.kravetz@oracle.com) wrote:
> On 06/24/22 17:36, James Houghton wrote:
> > After high-granularity mapping, page table entries for HugeTLB pages can
> > be of any size/type. (For example, we can have a 1G page mapped with a
> > mix of PMDs and PTEs.) This struct is to help keep track of a HugeTLB
> > PTE after we have done a page table walk.
> 
> This has been rolling around in my head.
> 
> Will this first use case (live migration) actually make use of this
> 'mixed mapping' model where hugetlb pages could be mapped at the PUD,
> PMD and PTE level all within the same vma?  I only understand the use
> case from a high level.  But, it seems that we would want to only want
> to migrate PTE (or PMD) sized pages and not necessarily a mix.

I suspect we would pick one size and use that size for all transfers
when in postcopy; not sure if there are any side cases though.

> The only reason I ask is because the code might be much simpler if all
> mappings within a vma were of the same size.  Of course, the
> performance/latency of converting a large mapping may be prohibitively
> expensive.

Imagine we're migrating a few TB VM, backed by 1GB hugepages, I'm guessing it
would be nice to clean up the PTE/PMDs for split 1GB pages as they're
completed rather than having thousands of them for the whole VM.
(I'm not sure if that is already doable)

Dave

> Looking to the future when supporting memory error handling/page poisoning
> it seems like we would certainly want multiple size mappings.
> 
> Just a thought.
> -- 
> Mike Kravetz
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH 19/26] hugetlb: add HGM support for copy_hugetlb_page_range
  2022-07-11 23:41   ` Mike Kravetz
@ 2022-07-12 17:19     ` James Houghton
  2022-07-12 18:06       ` Mike Kravetz
  0 siblings, 1 reply; 123+ messages in thread
From: James Houghton @ 2022-07-12 17:19 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Muchun Song, Peter Xu, David Hildenbrand, David Rientjes,
	Axel Rasmussen, Mina Almasry, Jue Wang, Manish Mishra,
	Dr . David Alan Gilbert, linux-mm, linux-kernel

On Mon, Jul 11, 2022 at 4:41 PM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>
> On 06/24/22 17:36, James Houghton wrote:
> > This allows fork() to work with high-granularity mappings. The page
> > table structure is copied such that partially mapped regions will remain
> > partially mapped in the same way for the new process.
> >
> > Signed-off-by: James Houghton <jthoughton@google.com>
> > ---
> >  mm/hugetlb.c | 74 +++++++++++++++++++++++++++++++++++++++++-----------
> >  1 file changed, 59 insertions(+), 15 deletions(-)
>
> FYI -
> With https://lore.kernel.org/linux-mm/20220621235620.291305-5-mike.kravetz@oracle.com/
> copy_hugetlb_page_range() should never be called for shared mappings.
> Since HGM only works on shared mappings, code in this patch will never
> be executed.
>
> I have a TODO to remove shared mapping support from copy_hugetlb_page_range.

Thanks Mike. If I understand things correctly, it seems like I don't
have to do anything to support fork() then; we just don't copy the
page table structure from the old VMA to the new one. That is, as
opposed to having the same bits of the old VMA being mapped in the new
one, the new VMA will have an empty page table. This would slightly
change how userfaultfd's behavior on the new VMA, but that seems fine
to me.

- James

> --
> Mike Kravetz

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH 07/26] hugetlb: add hugetlb_pte to track HugeTLB page table entries
  2022-07-12  9:42     ` Dr. David Alan Gilbert
@ 2022-07-12 17:51       ` Mike Kravetz
  2022-07-15 16:35       ` Peter Xu
  1 sibling, 0 replies; 123+ messages in thread
From: Mike Kravetz @ 2022-07-12 17:51 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: James Houghton, Muchun Song, Peter Xu, David Hildenbrand,
	David Rientjes, Axel Rasmussen, Mina Almasry, Jue Wang,
	Manish Mishra, linux-mm, linux-kernel

On 07/12/22 10:42, Dr. David Alan Gilbert wrote:
> * Mike Kravetz (mike.kravetz@oracle.com) wrote:
> > On 06/24/22 17:36, James Houghton wrote:
> > > After high-granularity mapping, page table entries for HugeTLB pages can
> > > be of any size/type. (For example, we can have a 1G page mapped with a
> > > mix of PMDs and PTEs.) This struct is to help keep track of a HugeTLB
> > > PTE after we have done a page table walk.
> > 
> > This has been rolling around in my head.
> > 
> > Will this first use case (live migration) actually make use of this
> > 'mixed mapping' model where hugetlb pages could be mapped at the PUD,
> > PMD and PTE level all within the same vma?  I only understand the use
> > case from a high level.  But, it seems that we would want to only want
> > to migrate PTE (or PMD) sized pages and not necessarily a mix.
> 
> I suspect we would pick one size and use that size for all transfers
> when in postcopy; not sure if there are any side cases though.
> 
> > The only reason I ask is because the code might be much simpler if all
> > mappings within a vma were of the same size.  Of course, the
> > performance/latency of converting a large mapping may be prohibitively
> > expensive.
> 
> Imagine we're migrating a few TB VM, backed by 1GB hugepages, I'm guessing it
> would be nice to clean up the PTE/PMDs for split 1GB pages as they're
> completed rather than having thousands of them for the whole VM.
> (I'm not sure if that is already doable)

Seems that would be doable by calling MADV_COLLAPSE for 1GB pages as
they are completed.

Thanks for information on post copy.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH 19/26] hugetlb: add HGM support for copy_hugetlb_page_range
  2022-07-12 17:19     ` James Houghton
@ 2022-07-12 18:06       ` Mike Kravetz
  2022-07-15 21:39         ` Axel Rasmussen
  0 siblings, 1 reply; 123+ messages in thread
From: Mike Kravetz @ 2022-07-12 18:06 UTC (permalink / raw)
  To: James Houghton, Axel Rasmussen
  Cc: Muchun Song, Peter Xu, David Hildenbrand, David Rientjes,
	Mina Almasry, Jue Wang, Manish Mishra, Dr . David Alan Gilbert,
	linux-mm, linux-kernel

On 07/12/22 10:19, James Houghton wrote:
> On Mon, Jul 11, 2022 at 4:41 PM Mike Kravetz <mike.kravetz@oracle.com> wrote:
> >
> > On 06/24/22 17:36, James Houghton wrote:
> > > This allows fork() to work with high-granularity mappings. The page
> > > table structure is copied such that partially mapped regions will remain
> > > partially mapped in the same way for the new process.
> > >
> > > Signed-off-by: James Houghton <jthoughton@google.com>
> > > ---
> > >  mm/hugetlb.c | 74 +++++++++++++++++++++++++++++++++++++++++-----------
> > >  1 file changed, 59 insertions(+), 15 deletions(-)
> >
> > FYI -
> > With https://lore.kernel.org/linux-mm/20220621235620.291305-5-mike.kravetz@oracle.com/
> > copy_hugetlb_page_range() should never be called for shared mappings.
> > Since HGM only works on shared mappings, code in this patch will never
> > be executed.
> >
> > I have a TODO to remove shared mapping support from copy_hugetlb_page_range.
> 
> Thanks Mike. If I understand things correctly, it seems like I don't
> have to do anything to support fork() then; we just don't copy the
> page table structure from the old VMA to the new one.

Yes, for now.  We will not copy the page tables for shared mappings.
When adding support for private mapping, we will need to handle the
HGM case.

>                                                       That is, as
> opposed to having the same bits of the old VMA being mapped in the new
> one, the new VMA will have an empty page table. This would slightly
> change how userfaultfd's behavior on the new VMA, but that seems fine
> to me.

Right.  Since the 'mapping size information' is essentially carried in
the page tables, it will be lost if page tables are not copied.

Not sure if anyone would depend on that behavior.

Axel, this may also impact minor fault processing.  Any concerns?
Patch is sitting in Andrew's tree for next merge window.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH 06/26] mm: make free_p?d_range functions public
  2022-06-28 20:35   ` Mike Kravetz
@ 2022-07-12 20:52     ` James Houghton
  0 siblings, 0 replies; 123+ messages in thread
From: James Houghton @ 2022-07-12 20:52 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Muchun Song, Peter Xu, David Hildenbrand, David Rientjes,
	Axel Rasmussen, Mina Almasry, Jue Wang, Manish Mishra,
	Dr . David Alan Gilbert, linux-mm, linux-kernel

On Tue, Jun 28, 2022 at 1:35 PM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>
> On 06/24/22 17:36, James Houghton wrote:
> > This makes them usable for HugeTLB page table freeing operations.
> > After HugeTLB high-granularity mapping, the page table for a HugeTLB VMA
> > can get more complex, and these functions handle freeing page tables
> > generally.
> >
>
> Hmmmm?
>
> free_pgd_range is not generally called directly for hugetlb mappings.
> There is a wrapper hugetlb_free_pgd_range which can have architecture
> specific implementations.  It makes me wonder if these lower level
> routines can be directly used on hugetlb mappings.  My 'guess' is that any
> such details will be hidden in the callers.  Suspect this will become clear
> in later patches.

Thanks for pointing out hugetlb_free_pgd_range. I think I'll need to
change how freeing HugeTLB HGM PTEs is written, because as written, we
don't do any architecture-specific things. I think I have a good idea
for what I need to do, probably something like this: make
`hugetlb_free_range` overridable, and then provide an implementation
for it for all the architectures that
HAVE_ARCH_HUGETLB_FREE_PGD_RANGE. Making the regular `free_p?d_range`
functions public *does* help with implementing the regular/general
`hugetlb_free_range` function though, so I think this commit is still
useful.

- James

> --
> Mike Kravetz
>
> > Signed-off-by: James Houghton <jthoughton@google.com>
> > ---
> >  include/linux/mm.h | 7 +++++++
> >  mm/memory.c        | 8 ++++----
> >  2 files changed, 11 insertions(+), 4 deletions(-)
> >
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index bc8f326be0ce..07f5da512147 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -1847,6 +1847,13 @@ void unmap_vmas(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
> >
> >  struct mmu_notifier_range;
> >
> > +void free_pte_range(struct mmu_gather *tlb, pmd_t *pmd, unsigned long addr);
> > +void free_pmd_range(struct mmu_gather *tlb, pud_t *pud, unsigned long addr,
> > +             unsigned long end, unsigned long floor, unsigned long ceiling);
> > +void free_pud_range(struct mmu_gather *tlb, p4d_t *p4d, unsigned long addr,
> > +             unsigned long end, unsigned long floor, unsigned long ceiling);
> > +void free_p4d_range(struct mmu_gather *tlb, pgd_t *pgd, unsigned long addr,
> > +             unsigned long end, unsigned long floor, unsigned long ceiling);
> >  void free_pgd_range(struct mmu_gather *tlb, unsigned long addr,
> >               unsigned long end, unsigned long floor, unsigned long ceiling);
> >  int
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 7a089145cad4..bb3b9b5b94fb 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -227,7 +227,7 @@ static void check_sync_rss_stat(struct task_struct *task)
> >   * Note: this doesn't free the actual pages themselves. That
> >   * has been handled earlier when unmapping all the memory regions.
> >   */
> > -static void free_pte_range(struct mmu_gather *tlb, pmd_t *pmd,
> > +void free_pte_range(struct mmu_gather *tlb, pmd_t *pmd,
> >                          unsigned long addr)
> >  {
> >       pgtable_t token = pmd_pgtable(*pmd);
> > @@ -236,7 +236,7 @@ static void free_pte_range(struct mmu_gather *tlb, pmd_t *pmd,
> >       mm_dec_nr_ptes(tlb->mm);
> >  }
> >
> > -static inline void free_pmd_range(struct mmu_gather *tlb, pud_t *pud,
> > +inline void free_pmd_range(struct mmu_gather *tlb, pud_t *pud,
> >                               unsigned long addr, unsigned long end,
> >                               unsigned long floor, unsigned long ceiling)
> >  {
> > @@ -270,7 +270,7 @@ static inline void free_pmd_range(struct mmu_gather *tlb, pud_t *pud,
> >       mm_dec_nr_pmds(tlb->mm);
> >  }
> >
> > -static inline void free_pud_range(struct mmu_gather *tlb, p4d_t *p4d,
> > +inline void free_pud_range(struct mmu_gather *tlb, p4d_t *p4d,
> >                               unsigned long addr, unsigned long end,
> >                               unsigned long floor, unsigned long ceiling)
> >  {
> > @@ -304,7 +304,7 @@ static inline void free_pud_range(struct mmu_gather *tlb, p4d_t *p4d,
> >       mm_dec_nr_puds(tlb->mm);
> >  }
> >
> > -static inline void free_p4d_range(struct mmu_gather *tlb, pgd_t *pgd,
> > +inline void free_p4d_range(struct mmu_gather *tlb, pgd_t *pgd,
> >                               unsigned long addr, unsigned long end,
> >                               unsigned long floor, unsigned long ceiling)
> >  {
> > --
> > 2.37.0.rc0.161.g10f37bed90-goog
> >

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH 20/26] hugetlb: add support for high-granularity UFFDIO_CONTINUE
  2022-06-24 17:36 ` [RFC PATCH 20/26] hugetlb: add support for high-granularity UFFDIO_CONTINUE James Houghton
@ 2022-07-15 16:21   ` Peter Xu
  2022-07-15 16:58     ` James Houghton
  0 siblings, 1 reply; 123+ messages in thread
From: Peter Xu @ 2022-07-15 16:21 UTC (permalink / raw)
  To: James Houghton
  Cc: Mike Kravetz, Muchun Song, David Hildenbrand, David Rientjes,
	Axel Rasmussen, Mina Almasry, Jue Wang, Manish Mishra,
	Dr . David Alan Gilbert, linux-mm, linux-kernel

On Fri, Jun 24, 2022 at 05:36:50PM +0000, James Houghton wrote:
> The changes here are very similar to the changes made to
> hugetlb_no_page, where we do a high-granularity page table walk and
> do accounting slightly differently because we are mapping only a piece
> of a page.
> 
> Signed-off-by: James Houghton <jthoughton@google.com>
> ---
>  fs/userfaultfd.c        |  3 +++
>  include/linux/hugetlb.h |  6 +++--
>  mm/hugetlb.c            | 54 +++++++++++++++++++++-----------------
>  mm/userfaultfd.c        | 57 +++++++++++++++++++++++++++++++----------
>  4 files changed, 82 insertions(+), 38 deletions(-)
> 
> diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> index e943370107d0..77c1b8a7d0b9 100644
> --- a/fs/userfaultfd.c
> +++ b/fs/userfaultfd.c
> @@ -245,6 +245,9 @@ static inline bool userfaultfd_huge_must_wait(struct userfaultfd_ctx *ctx,
>  	if (!ptep)
>  		goto out;
>  
> +	if (hugetlb_hgm_enabled(vma))
> +		goto out;
> +

This is weird.  It means we'll never wait for sub-page mapping enabled
vmas.  Why?

Not to mention hugetlb_hgm_enabled() currently is simply VM_SHARED, so it
means we'll stop waiting for all shared hugetlbfs uffd page faults..

I'd expect in the in-house postcopy tests you should see vcpu threads
spinning on the page faults until it's serviced.

IMO we still need to properly wait when the pgtable doesn't have the
faulted address covered.  For sub-page mapping it'll probably need to walk
into sub-page levels.

>  	ret = false;
>  	pte = huge_ptep_get(ptep);
>  
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index ac4ac8fbd901..c207b1ac6195 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -221,13 +221,15 @@ unsigned long hugetlb_total_pages(void);
>  vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>  			unsigned long address, unsigned int flags);
>  #ifdef CONFIG_USERFAULTFD
> -int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm, pte_t *dst_pte,
> +int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
> +				struct hugetlb_pte *dst_hpte,
>  				struct vm_area_struct *dst_vma,
>  				unsigned long dst_addr,
>  				unsigned long src_addr,
>  				enum mcopy_atomic_mode mode,
>  				struct page **pagep,
> -				bool wp_copy);
> +				bool wp_copy,
> +				bool new_mapping);
>  #endif /* CONFIG_USERFAULTFD */
>  bool hugetlb_reserve_pages(struct inode *inode, long from, long to,
>  						struct vm_area_struct *vma,
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 0ec2f231524e..09fa57599233 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -5808,6 +5808,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
>  		vma_end_reservation(h, vma, haddr);
>  	}
>  
> +	/* This lock will get pretty expensive at 4K. */
>  	ptl = hugetlb_pte_lock(mm, hpte);
>  	ret = 0;
>  	/* If pte changed from under us, retry */
> @@ -6098,24 +6099,26 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>   * modifications for huge pages.
>   */
>  int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
> -			    pte_t *dst_pte,
> +			    struct hugetlb_pte *dst_hpte,
>  			    struct vm_area_struct *dst_vma,
>  			    unsigned long dst_addr,
>  			    unsigned long src_addr,
>  			    enum mcopy_atomic_mode mode,
>  			    struct page **pagep,
> -			    bool wp_copy)
> +			    bool wp_copy,
> +			    bool new_mapping)
>  {
>  	bool is_continue = (mode == MCOPY_ATOMIC_CONTINUE);
>  	struct hstate *h = hstate_vma(dst_vma);
>  	struct address_space *mapping = dst_vma->vm_file->f_mapping;
> +	unsigned long haddr = dst_addr & huge_page_mask(h);
>  	pgoff_t idx = vma_hugecache_offset(h, dst_vma, dst_addr);
>  	unsigned long size;
>  	int vm_shared = dst_vma->vm_flags & VM_SHARED;
>  	pte_t _dst_pte;
>  	spinlock_t *ptl;
>  	int ret = -ENOMEM;
> -	struct page *page;
> +	struct page *page, *subpage;
>  	int writable;
>  	bool page_in_pagecache = false;
>  
> @@ -6130,12 +6133,12 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
>  		 * a non-missing case. Return -EEXIST.
>  		 */
>  		if (vm_shared &&
> -		    hugetlbfs_pagecache_present(h, dst_vma, dst_addr)) {
> +		    hugetlbfs_pagecache_present(h, dst_vma, haddr)) {
>  			ret = -EEXIST;
>  			goto out;
>  		}
>  
> -		page = alloc_huge_page(dst_vma, dst_addr, 0);
> +		page = alloc_huge_page(dst_vma, haddr, 0);
>  		if (IS_ERR(page)) {
>  			ret = -ENOMEM;
>  			goto out;
> @@ -6151,13 +6154,13 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
>  			/* Free the allocated page which may have
>  			 * consumed a reservation.
>  			 */
> -			restore_reserve_on_error(h, dst_vma, dst_addr, page);
> +			restore_reserve_on_error(h, dst_vma, haddr, page);
>  			put_page(page);
>  
>  			/* Allocate a temporary page to hold the copied
>  			 * contents.
>  			 */
> -			page = alloc_huge_page_vma(h, dst_vma, dst_addr);
> +			page = alloc_huge_page_vma(h, dst_vma, haddr);
>  			if (!page) {
>  				ret = -ENOMEM;
>  				goto out;
> @@ -6171,14 +6174,14 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
>  		}
>  	} else {
>  		if (vm_shared &&
> -		    hugetlbfs_pagecache_present(h, dst_vma, dst_addr)) {
> +		    hugetlbfs_pagecache_present(h, dst_vma, haddr)) {
>  			put_page(*pagep);
>  			ret = -EEXIST;
>  			*pagep = NULL;
>  			goto out;
>  		}
>  
> -		page = alloc_huge_page(dst_vma, dst_addr, 0);
> +		page = alloc_huge_page(dst_vma, haddr, 0);
>  		if (IS_ERR(page)) {
>  			ret = -ENOMEM;
>  			*pagep = NULL;
> @@ -6216,8 +6219,7 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
>  		page_in_pagecache = true;
>  	}
>  
> -	ptl = huge_pte_lockptr(huge_page_shift(h), dst_mm, dst_pte);
> -	spin_lock(ptl);
> +	ptl = hugetlb_pte_lock(dst_mm, dst_hpte);
>  
>  	/*
>  	 * Recheck the i_size after holding PT lock to make sure not
> @@ -6239,14 +6241,16 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
>  	 * registered, we firstly wr-protect a none pte which has no page cache
>  	 * page backing it, then access the page.
>  	 */
> -	if (!huge_pte_none_mostly(huge_ptep_get(dst_pte)))
> +	if (!hugetlb_pte_none_mostly(dst_hpte))
>  		goto out_release_unlock;
>  
> -	if (vm_shared) {
> -		page_dup_file_rmap(page, true);
> -	} else {
> -		ClearHPageRestoreReserve(page);
> -		hugepage_add_new_anon_rmap(page, dst_vma, dst_addr);
> +	if (new_mapping) {

IIUC you wanted to avoid the mapcount accountings when it's the sub-page
that was going to be mapped.

Is it a must we get this only from the caller?  Can we know we're doing
sub-page mapping already here and make a decision with e.g. dst_hpte?

It looks weird to me to pass this explicitly from the caller, especially
that's when we don't really have the pgtable lock so I'm wondering about
possible race conditions too on having stale new_mapping values.

> +		if (vm_shared) {
> +			page_dup_file_rmap(page, true);
> +		} else {
> +			ClearHPageRestoreReserve(page);
> +			hugepage_add_new_anon_rmap(page, dst_vma, haddr);
> +		}
>  	}
>  
>  	/*
> @@ -6258,7 +6262,11 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
>  	else
>  		writable = dst_vma->vm_flags & VM_WRITE;
>  
> -	_dst_pte = make_huge_pte(dst_vma, page, writable);
> +	subpage = hugetlb_find_subpage(h, page, dst_addr);
> +	if (subpage != page)
> +		BUG_ON(!hugetlb_hgm_enabled(dst_vma));
> +
> +	_dst_pte = make_huge_pte(dst_vma, subpage, writable);
>  	/*
>  	 * Always mark UFFDIO_COPY page dirty; note that this may not be
>  	 * extremely important for hugetlbfs for now since swapping is not
> @@ -6271,14 +6279,14 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
>  	if (wp_copy)
>  		_dst_pte = huge_pte_mkuffd_wp(_dst_pte);
>  
> -	set_huge_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte);
> +	set_huge_pte_at(dst_mm, dst_addr, dst_hpte->ptep, _dst_pte);
>  
> -	(void)huge_ptep_set_access_flags(dst_vma, dst_addr, dst_pte, _dst_pte,
> -					dst_vma->vm_flags & VM_WRITE);
> -	hugetlb_count_add(pages_per_huge_page(h), dst_mm);
> +	(void)huge_ptep_set_access_flags(dst_vma, dst_addr, dst_hpte->ptep,
> +			_dst_pte, dst_vma->vm_flags & VM_WRITE);
> +	hugetlb_count_add(hugetlb_pte_size(dst_hpte) / PAGE_SIZE, dst_mm);
>  
>  	/* No need to invalidate - it was non-present before */
> -	update_mmu_cache(dst_vma, dst_addr, dst_pte);
> +	update_mmu_cache(dst_vma, dst_addr, dst_hpte->ptep);
>  
>  	spin_unlock(ptl);
>  	if (!is_continue)
> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> index 4f4892a5f767..ee40d98068bf 100644
> --- a/mm/userfaultfd.c
> +++ b/mm/userfaultfd.c
> @@ -310,14 +310,16 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
>  {
>  	int vm_shared = dst_vma->vm_flags & VM_SHARED;
>  	ssize_t err;
> -	pte_t *dst_pte;
>  	unsigned long src_addr, dst_addr;
>  	long copied;
>  	struct page *page;
> -	unsigned long vma_hpagesize;
> +	unsigned long vma_hpagesize, vma_altpagesize;
>  	pgoff_t idx;
>  	u32 hash;
>  	struct address_space *mapping;
> +	bool use_hgm = hugetlb_hgm_enabled(dst_vma) &&
> +		mode == MCOPY_ATOMIC_CONTINUE;
> +	struct hstate *h = hstate_vma(dst_vma);
>  
>  	/*
>  	 * There is no default zero huge page for all huge page sizes as
> @@ -335,12 +337,16 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
>  	copied = 0;
>  	page = NULL;
>  	vma_hpagesize = vma_kernel_pagesize(dst_vma);
> +	if (use_hgm)
> +		vma_altpagesize = PAGE_SIZE;

Do we need to check the "len" to know whether we should use sub-page
mapping or original hpage size?  E.g. any old UFFDIO_CONTINUE code will
still want the old behavior I think.

> +	else
> +		vma_altpagesize = vma_hpagesize;
>  
>  	/*
>  	 * Validate alignment based on huge page size
>  	 */
>  	err = -EINVAL;
> -	if (dst_start & (vma_hpagesize - 1) || len & (vma_hpagesize - 1))
> +	if (dst_start & (vma_altpagesize - 1) || len & (vma_altpagesize - 1))
>  		goto out_unlock;
>  
>  retry:
> @@ -361,6 +367,8 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
>  		vm_shared = dst_vma->vm_flags & VM_SHARED;
>  	}
>  
> +	BUG_ON(!vm_shared && use_hgm);
> +
>  	/*
>  	 * If not shared, ensure the dst_vma has a anon_vma.
>  	 */
> @@ -371,11 +379,13 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
>  	}
>  
>  	while (src_addr < src_start + len) {
> +		struct hugetlb_pte hpte;
> +		bool new_mapping;
>  		BUG_ON(dst_addr >= dst_start + len);
>  
>  		/*
>  		 * Serialize via i_mmap_rwsem and hugetlb_fault_mutex.
> -		 * i_mmap_rwsem ensures the dst_pte remains valid even
> +		 * i_mmap_rwsem ensures the hpte.ptep remains valid even
>  		 * in the case of shared pmds.  fault mutex prevents
>  		 * races with other faulting threads.
>  		 */
> @@ -383,27 +393,47 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
>  		i_mmap_lock_read(mapping);
>  		idx = linear_page_index(dst_vma, dst_addr);
>  		hash = hugetlb_fault_mutex_hash(mapping, idx);
> +		/* This lock will get expensive at 4K. */
>  		mutex_lock(&hugetlb_fault_mutex_table[hash]);
>  
> -		err = -ENOMEM;
> -		dst_pte = huge_pte_alloc(dst_mm, dst_vma, dst_addr, vma_hpagesize);
> -		if (!dst_pte) {
> +		err = 0;
> +
> +		pte_t *ptep = huge_pte_alloc(dst_mm, dst_vma, dst_addr,
> +					     vma_hpagesize);
> +		if (!ptep)
> +			err = -ENOMEM;
> +		else {
> +			hugetlb_pte_populate(&hpte, ptep,
> +					huge_page_shift(h));
> +			/*
> +			 * If the hstate-level PTE is not none, then a mapping
> +			 * was previously established.
> +			 * The per-hpage mutex prevents double-counting.
> +			 */
> +			new_mapping = hugetlb_pte_none(&hpte);
> +			if (use_hgm)
> +				err = hugetlb_alloc_largest_pte(&hpte, dst_mm, dst_vma,
> +								dst_addr,
> +								dst_start + len);
> +		}
> +
> +		if (err) {
>  			mutex_unlock(&hugetlb_fault_mutex_table[hash]);
>  			i_mmap_unlock_read(mapping);
>  			goto out_unlock;
>  		}
>  
>  		if (mode != MCOPY_ATOMIC_CONTINUE &&
> -		    !huge_pte_none_mostly(huge_ptep_get(dst_pte))) {
> +		    !hugetlb_pte_none_mostly(&hpte)) {
>  			err = -EEXIST;
>  			mutex_unlock(&hugetlb_fault_mutex_table[hash]);
>  			i_mmap_unlock_read(mapping);
>  			goto out_unlock;
>  		}
>  
> -		err = hugetlb_mcopy_atomic_pte(dst_mm, dst_pte, dst_vma,
> +		err = hugetlb_mcopy_atomic_pte(dst_mm, &hpte, dst_vma,
>  					       dst_addr, src_addr, mode, &page,
> -					       wp_copy);
> +					       wp_copy, new_mapping);
>  
>  		mutex_unlock(&hugetlb_fault_mutex_table[hash]);
>  		i_mmap_unlock_read(mapping);
> @@ -413,6 +443,7 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
>  		if (unlikely(err == -ENOENT)) {
>  			mmap_read_unlock(dst_mm);
>  			BUG_ON(!page);
> +			BUG_ON(hpte.shift != huge_page_shift(h));
>  
>  			err = copy_huge_page_from_user(page,
>  						(const void __user *)src_addr,
> @@ -430,9 +461,9 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
>  			BUG_ON(page);
>  
>  		if (!err) {
> -			dst_addr += vma_hpagesize;
> -			src_addr += vma_hpagesize;
> -			copied += vma_hpagesize;
> +			dst_addr += hugetlb_pte_size(&hpte);
> +			src_addr += hugetlb_pte_size(&hpte);
> +			copied += hugetlb_pte_size(&hpte);
>  
>  			if (fatal_signal_pending(current))
>  				err = -EINTR;
> -- 
> 2.37.0.rc0.161.g10f37bed90-goog
> 

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH 07/26] hugetlb: add hugetlb_pte to track HugeTLB page table entries
  2022-07-12  9:42     ` Dr. David Alan Gilbert
  2022-07-12 17:51       ` Mike Kravetz
@ 2022-07-15 16:35       ` Peter Xu
  2022-07-15 21:52         ` Axel Rasmussen
  1 sibling, 1 reply; 123+ messages in thread
From: Peter Xu @ 2022-07-15 16:35 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Mike Kravetz, James Houghton, Muchun Song, David Hildenbrand,
	David Rientjes, Axel Rasmussen, Mina Almasry, Jue Wang,
	Manish Mishra, linux-mm, linux-kernel

On Tue, Jul 12, 2022 at 10:42:17AM +0100, Dr. David Alan Gilbert wrote:
> * Mike Kravetz (mike.kravetz@oracle.com) wrote:
> > On 06/24/22 17:36, James Houghton wrote:
> > > After high-granularity mapping, page table entries for HugeTLB pages can
> > > be of any size/type. (For example, we can have a 1G page mapped with a
> > > mix of PMDs and PTEs.) This struct is to help keep track of a HugeTLB
> > > PTE after we have done a page table walk.
> > 
> > This has been rolling around in my head.
> > 
> > Will this first use case (live migration) actually make use of this
> > 'mixed mapping' model where hugetlb pages could be mapped at the PUD,
> > PMD and PTE level all within the same vma?  I only understand the use
> > case from a high level.  But, it seems that we would want to only want
> > to migrate PTE (or PMD) sized pages and not necessarily a mix.
> 
> I suspect we would pick one size and use that size for all transfers
> when in postcopy; not sure if there are any side cases though.

Yes, I'm also curious whether the series can be much simplified if we have
a static way to do sub-page mappings, e.g., when sub-page mapping enabled
we always map to PAGE_SIZE only; if not we keep the old hpage size mappings
only.

> > Looking to the future when supporting memory error handling/page poisoning
> > it seems like we would certainly want multiple size mappings.

If we treat page poisoning as very rare events anyway, IMHO it'll even be
acceptable if we always split 1G pages into 4K ones but only rule out the
real poisoned 4K phys page.  After all IIUC the major goal is for reducing
poisoned memory footprint.

It'll be definitely nicer if we can keep 511 2M pages and 511 4K pages in
that case so the 511 2M pages performs slightly better, but it'll be
something extra to me.  It can always be something worked upon a simpler
version of sub-page mapping which is only PAGE_SIZE based.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH 20/26] hugetlb: add support for high-granularity UFFDIO_CONTINUE
  2022-07-15 16:21   ` Peter Xu
@ 2022-07-15 16:58     ` James Houghton
  2022-07-15 17:20       ` Peter Xu
  0 siblings, 1 reply; 123+ messages in thread
From: James Houghton @ 2022-07-15 16:58 UTC (permalink / raw)
  To: Peter Xu
  Cc: Mike Kravetz, Muchun Song, David Hildenbrand, David Rientjes,
	Axel Rasmussen, Mina Almasry, Jue Wang, Manish Mishra,
	Dr . David Alan Gilbert, linux-mm, linux-kernel

On Fri, Jul 15, 2022 at 9:21 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Fri, Jun 24, 2022 at 05:36:50PM +0000, James Houghton wrote:
> > The changes here are very similar to the changes made to
> > hugetlb_no_page, where we do a high-granularity page table walk and
> > do accounting slightly differently because we are mapping only a piece
> > of a page.
> >
> > Signed-off-by: James Houghton <jthoughton@google.com>
> > ---
> >  fs/userfaultfd.c        |  3 +++
> >  include/linux/hugetlb.h |  6 +++--
> >  mm/hugetlb.c            | 54 +++++++++++++++++++++-----------------
> >  mm/userfaultfd.c        | 57 +++++++++++++++++++++++++++++++----------
> >  4 files changed, 82 insertions(+), 38 deletions(-)
> >
> > diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> > index e943370107d0..77c1b8a7d0b9 100644
> > --- a/fs/userfaultfd.c
> > +++ b/fs/userfaultfd.c
> > @@ -245,6 +245,9 @@ static inline bool userfaultfd_huge_must_wait(struct userfaultfd_ctx *ctx,
> >       if (!ptep)
> >               goto out;
> >
> > +     if (hugetlb_hgm_enabled(vma))
> > +             goto out;
> > +
>
> This is weird.  It means we'll never wait for sub-page mapping enabled
> vmas.  Why?
>

`ret` is true in this case, so we're actually *always* waiting.

> Not to mention hugetlb_hgm_enabled() currently is simply VM_SHARED, so it
> means we'll stop waiting for all shared hugetlbfs uffd page faults..
>
> I'd expect in the in-house postcopy tests you should see vcpu threads
> spinning on the page faults until it's serviced.
>
> IMO we still need to properly wait when the pgtable doesn't have the
> faulted address covered.  For sub-page mapping it'll probably need to walk
> into sub-page levels.

Ok, SGTM. I'll do that for the next version. I'm not sure of the
consequences of returning `true` here when we should be returning
`false`.

>
> >       ret = false;
> >       pte = huge_ptep_get(ptep);
> >
> > diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> > index ac4ac8fbd901..c207b1ac6195 100644
> > --- a/include/linux/hugetlb.h
> > +++ b/include/linux/hugetlb.h
> > @@ -221,13 +221,15 @@ unsigned long hugetlb_total_pages(void);
> >  vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> >                       unsigned long address, unsigned int flags);
> >  #ifdef CONFIG_USERFAULTFD
> > -int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm, pte_t *dst_pte,
> > +int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
> > +                             struct hugetlb_pte *dst_hpte,
> >                               struct vm_area_struct *dst_vma,
> >                               unsigned long dst_addr,
> >                               unsigned long src_addr,
> >                               enum mcopy_atomic_mode mode,
> >                               struct page **pagep,
> > -                             bool wp_copy);
> > +                             bool wp_copy,
> > +                             bool new_mapping);
> >  #endif /* CONFIG_USERFAULTFD */
> >  bool hugetlb_reserve_pages(struct inode *inode, long from, long to,
> >                                               struct vm_area_struct *vma,
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 0ec2f231524e..09fa57599233 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -5808,6 +5808,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
> >               vma_end_reservation(h, vma, haddr);
> >       }
> >
> > +     /* This lock will get pretty expensive at 4K. */
> >       ptl = hugetlb_pte_lock(mm, hpte);
> >       ret = 0;
> >       /* If pte changed from under us, retry */
> > @@ -6098,24 +6099,26 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> >   * modifications for huge pages.
> >   */
> >  int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
> > -                         pte_t *dst_pte,
> > +                         struct hugetlb_pte *dst_hpte,
> >                           struct vm_area_struct *dst_vma,
> >                           unsigned long dst_addr,
> >                           unsigned long src_addr,
> >                           enum mcopy_atomic_mode mode,
> >                           struct page **pagep,
> > -                         bool wp_copy)
> > +                         bool wp_copy,
> > +                         bool new_mapping)
> >  {
> >       bool is_continue = (mode == MCOPY_ATOMIC_CONTINUE);
> >       struct hstate *h = hstate_vma(dst_vma);
> >       struct address_space *mapping = dst_vma->vm_file->f_mapping;
> > +     unsigned long haddr = dst_addr & huge_page_mask(h);
> >       pgoff_t idx = vma_hugecache_offset(h, dst_vma, dst_addr);
> >       unsigned long size;
> >       int vm_shared = dst_vma->vm_flags & VM_SHARED;
> >       pte_t _dst_pte;
> >       spinlock_t *ptl;
> >       int ret = -ENOMEM;
> > -     struct page *page;
> > +     struct page *page, *subpage;
> >       int writable;
> >       bool page_in_pagecache = false;
> >
> > @@ -6130,12 +6133,12 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
> >                * a non-missing case. Return -EEXIST.
> >                */
> >               if (vm_shared &&
> > -                 hugetlbfs_pagecache_present(h, dst_vma, dst_addr)) {
> > +                 hugetlbfs_pagecache_present(h, dst_vma, haddr)) {
> >                       ret = -EEXIST;
> >                       goto out;
> >               }
> >
> > -             page = alloc_huge_page(dst_vma, dst_addr, 0);
> > +             page = alloc_huge_page(dst_vma, haddr, 0);
> >               if (IS_ERR(page)) {
> >                       ret = -ENOMEM;
> >                       goto out;
> > @@ -6151,13 +6154,13 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
> >                       /* Free the allocated page which may have
> >                        * consumed a reservation.
> >                        */
> > -                     restore_reserve_on_error(h, dst_vma, dst_addr, page);
> > +                     restore_reserve_on_error(h, dst_vma, haddr, page);
> >                       put_page(page);
> >
> >                       /* Allocate a temporary page to hold the copied
> >                        * contents.
> >                        */
> > -                     page = alloc_huge_page_vma(h, dst_vma, dst_addr);
> > +                     page = alloc_huge_page_vma(h, dst_vma, haddr);
> >                       if (!page) {
> >                               ret = -ENOMEM;
> >                               goto out;
> > @@ -6171,14 +6174,14 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
> >               }
> >       } else {
> >               if (vm_shared &&
> > -                 hugetlbfs_pagecache_present(h, dst_vma, dst_addr)) {
> > +                 hugetlbfs_pagecache_present(h, dst_vma, haddr)) {
> >                       put_page(*pagep);
> >                       ret = -EEXIST;
> >                       *pagep = NULL;
> >                       goto out;
> >               }
> >
> > -             page = alloc_huge_page(dst_vma, dst_addr, 0);
> > +             page = alloc_huge_page(dst_vma, haddr, 0);
> >               if (IS_ERR(page)) {
> >                       ret = -ENOMEM;
> >                       *pagep = NULL;
> > @@ -6216,8 +6219,7 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
> >               page_in_pagecache = true;
> >       }
> >
> > -     ptl = huge_pte_lockptr(huge_page_shift(h), dst_mm, dst_pte);
> > -     spin_lock(ptl);
> > +     ptl = hugetlb_pte_lock(dst_mm, dst_hpte);
> >
> >       /*
> >        * Recheck the i_size after holding PT lock to make sure not
> > @@ -6239,14 +6241,16 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
> >        * registered, we firstly wr-protect a none pte which has no page cache
> >        * page backing it, then access the page.
> >        */
> > -     if (!huge_pte_none_mostly(huge_ptep_get(dst_pte)))
> > +     if (!hugetlb_pte_none_mostly(dst_hpte))
> >               goto out_release_unlock;
> >
> > -     if (vm_shared) {
> > -             page_dup_file_rmap(page, true);
> > -     } else {
> > -             ClearHPageRestoreReserve(page);
> > -             hugepage_add_new_anon_rmap(page, dst_vma, dst_addr);
> > +     if (new_mapping) {
>
> IIUC you wanted to avoid the mapcount accountings when it's the sub-page
> that was going to be mapped.
>
> Is it a must we get this only from the caller?  Can we know we're doing
> sub-page mapping already here and make a decision with e.g. dst_hpte?
>
> It looks weird to me to pass this explicitly from the caller, especially
> that's when we don't really have the pgtable lock so I'm wondering about
> possible race conditions too on having stale new_mapping values.

The only way to know what the correct value for `new_mapping` should
be is to know if we had to change the hstate-level P*D to non-none to
service this UFFDIO_CONTINUE request. I'll see if there is a nice way
to do that check in `hugetlb_mcopy_atomic_pte`. Right now there is no
race, because we synchronize on the per-hpage mutex.

>
> > +             if (vm_shared) {
> > +                     page_dup_file_rmap(page, true);
> > +             } else {
> > +                     ClearHPageRestoreReserve(page);
> > +                     hugepage_add_new_anon_rmap(page, dst_vma, haddr);
> > +             }
> >       }
> >
> >       /*
> > @@ -6258,7 +6262,11 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
> >       else
> >               writable = dst_vma->vm_flags & VM_WRITE;
> >
> > -     _dst_pte = make_huge_pte(dst_vma, page, writable);
> > +     subpage = hugetlb_find_subpage(h, page, dst_addr);
> > +     if (subpage != page)
> > +             BUG_ON(!hugetlb_hgm_enabled(dst_vma));
> > +
> > +     _dst_pte = make_huge_pte(dst_vma, subpage, writable);
> >       /*
> >        * Always mark UFFDIO_COPY page dirty; note that this may not be
> >        * extremely important for hugetlbfs for now since swapping is not
> > @@ -6271,14 +6279,14 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
> >       if (wp_copy)
> >               _dst_pte = huge_pte_mkuffd_wp(_dst_pte);
> >
> > -     set_huge_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte);
> > +     set_huge_pte_at(dst_mm, dst_addr, dst_hpte->ptep, _dst_pte);
> >
> > -     (void)huge_ptep_set_access_flags(dst_vma, dst_addr, dst_pte, _dst_pte,
> > -                                     dst_vma->vm_flags & VM_WRITE);
> > -     hugetlb_count_add(pages_per_huge_page(h), dst_mm);
> > +     (void)huge_ptep_set_access_flags(dst_vma, dst_addr, dst_hpte->ptep,
> > +                     _dst_pte, dst_vma->vm_flags & VM_WRITE);
> > +     hugetlb_count_add(hugetlb_pte_size(dst_hpte) / PAGE_SIZE, dst_mm);
> >
> >       /* No need to invalidate - it was non-present before */
> > -     update_mmu_cache(dst_vma, dst_addr, dst_pte);
> > +     update_mmu_cache(dst_vma, dst_addr, dst_hpte->ptep);
> >
> >       spin_unlock(ptl);
> >       if (!is_continue)
> > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> > index 4f4892a5f767..ee40d98068bf 100644
> > --- a/mm/userfaultfd.c
> > +++ b/mm/userfaultfd.c
> > @@ -310,14 +310,16 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
> >  {
> >       int vm_shared = dst_vma->vm_flags & VM_SHARED;
> >       ssize_t err;
> > -     pte_t *dst_pte;
> >       unsigned long src_addr, dst_addr;
> >       long copied;
> >       struct page *page;
> > -     unsigned long vma_hpagesize;
> > +     unsigned long vma_hpagesize, vma_altpagesize;
> >       pgoff_t idx;
> >       u32 hash;
> >       struct address_space *mapping;
> > +     bool use_hgm = hugetlb_hgm_enabled(dst_vma) &&
> > +             mode == MCOPY_ATOMIC_CONTINUE;
> > +     struct hstate *h = hstate_vma(dst_vma);
> >
> >       /*
> >        * There is no default zero huge page for all huge page sizes as
> > @@ -335,12 +337,16 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
> >       copied = 0;
> >       page = NULL;
> >       vma_hpagesize = vma_kernel_pagesize(dst_vma);
> > +     if (use_hgm)
> > +             vma_altpagesize = PAGE_SIZE;
>
> Do we need to check the "len" to know whether we should use sub-page
> mapping or original hpage size?  E.g. any old UFFDIO_CONTINUE code will
> still want the old behavior I think.

I think that's a fair point; however, if we enable HGM and the address
and len happen to be hstate-aligned, we basically do the same thing as
if HGM wasn't enabled. It could be a minor performance optimization to
do `vma_altpagesize=vma_hpagesize` in that case, but in terms of how
the page tables are set up, the end result would be the same.

>
> > +     else
> > +             vma_altpagesize = vma_hpagesize;
> >
> >       /*
> >        * Validate alignment based on huge page size
> >        */
> >       err = -EINVAL;
> > -     if (dst_start & (vma_hpagesize - 1) || len & (vma_hpagesize - 1))
> > +     if (dst_start & (vma_altpagesize - 1) || len & (vma_altpagesize - 1))
> >               goto out_unlock;
> >
> >  retry:
> > @@ -361,6 +367,8 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
> >               vm_shared = dst_vma->vm_flags & VM_SHARED;
> >       }
> >
> > +     BUG_ON(!vm_shared && use_hgm);
> > +
> >       /*
> >        * If not shared, ensure the dst_vma has a anon_vma.
> >        */
> > @@ -371,11 +379,13 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
> >       }
> >
> >       while (src_addr < src_start + len) {
> > +             struct hugetlb_pte hpte;
> > +             bool new_mapping;
> >               BUG_ON(dst_addr >= dst_start + len);
> >
> >               /*
> >                * Serialize via i_mmap_rwsem and hugetlb_fault_mutex.
> > -              * i_mmap_rwsem ensures the dst_pte remains valid even
> > +              * i_mmap_rwsem ensures the hpte.ptep remains valid even
> >                * in the case of shared pmds.  fault mutex prevents
> >                * races with other faulting threads.
> >                */
> > @@ -383,27 +393,47 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
> >               i_mmap_lock_read(mapping);
> >               idx = linear_page_index(dst_vma, dst_addr);
> >               hash = hugetlb_fault_mutex_hash(mapping, idx);
> > +             /* This lock will get expensive at 4K. */
> >               mutex_lock(&hugetlb_fault_mutex_table[hash]);
> >
> > -             err = -ENOMEM;
> > -             dst_pte = huge_pte_alloc(dst_mm, dst_vma, dst_addr, vma_hpagesize);
> > -             if (!dst_pte) {
> > +             err = 0;
> > +
> > +             pte_t *ptep = huge_pte_alloc(dst_mm, dst_vma, dst_addr,
> > +                                          vma_hpagesize);
> > +             if (!ptep)
> > +                     err = -ENOMEM;
> > +             else {
> > +                     hugetlb_pte_populate(&hpte, ptep,
> > +                                     huge_page_shift(h));
> > +                     /*
> > +                      * If the hstate-level PTE is not none, then a mapping
> > +                      * was previously established.
> > +                      * The per-hpage mutex prevents double-counting.
> > +                      */
> > +                     new_mapping = hugetlb_pte_none(&hpte);
> > +                     if (use_hgm)
> > +                             err = hugetlb_alloc_largest_pte(&hpte, dst_mm, dst_vma,
> > +                                                             dst_addr,
> > +                                                             dst_start + len);
> > +             }
> > +
> > +             if (err) {
> >                       mutex_unlock(&hugetlb_fault_mutex_table[hash]);
> >                       i_mmap_unlock_read(mapping);
> >                       goto out_unlock;
> >               }
> >
> >               if (mode != MCOPY_ATOMIC_CONTINUE &&
> > -                 !huge_pte_none_mostly(huge_ptep_get(dst_pte))) {
> > +                 !hugetlb_pte_none_mostly(&hpte)) {
> >                       err = -EEXIST;
> >                       mutex_unlock(&hugetlb_fault_mutex_table[hash]);
> >                       i_mmap_unlock_read(mapping);
> >                       goto out_unlock;
> >               }
> >
> > -             err = hugetlb_mcopy_atomic_pte(dst_mm, dst_pte, dst_vma,
> > +             err = hugetlb_mcopy_atomic_pte(dst_mm, &hpte, dst_vma,
> >                                              dst_addr, src_addr, mode, &page,
> > -                                            wp_copy);
> > +                                            wp_copy, new_mapping);
> >
> >               mutex_unlock(&hugetlb_fault_mutex_table[hash]);
> >               i_mmap_unlock_read(mapping);
> > @@ -413,6 +443,7 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
> >               if (unlikely(err == -ENOENT)) {
> >                       mmap_read_unlock(dst_mm);
> >                       BUG_ON(!page);
> > +                     BUG_ON(hpte.shift != huge_page_shift(h));
> >
> >                       err = copy_huge_page_from_user(page,
> >                                               (const void __user *)src_addr,
> > @@ -430,9 +461,9 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
> >                       BUG_ON(page);
> >
> >               if (!err) {
> > -                     dst_addr += vma_hpagesize;
> > -                     src_addr += vma_hpagesize;
> > -                     copied += vma_hpagesize;
> > +                     dst_addr += hugetlb_pte_size(&hpte);
> > +                     src_addr += hugetlb_pte_size(&hpte);
> > +                     copied += hugetlb_pte_size(&hpte);
> >
> >                       if (fatal_signal_pending(current))
> >                               err = -EINTR;
> > --
> > 2.37.0.rc0.161.g10f37bed90-goog
> >
>
> --
> Peter Xu
>

Thanks, Peter! :)

- James

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH 20/26] hugetlb: add support for high-granularity UFFDIO_CONTINUE
  2022-07-15 16:58     ` James Houghton
@ 2022-07-15 17:20       ` Peter Xu
  2022-07-20 20:58         ` James Houghton
  0 siblings, 1 reply; 123+ messages in thread
From: Peter Xu @ 2022-07-15 17:20 UTC (permalink / raw)
  To: James Houghton
  Cc: Mike Kravetz, Muchun Song, David Hildenbrand, David Rientjes,
	Axel Rasmussen, Mina Almasry, Jue Wang, Manish Mishra,
	Dr . David Alan Gilbert, linux-mm, linux-kernel

On Fri, Jul 15, 2022 at 09:58:10AM -0700, James Houghton wrote:
> On Fri, Jul 15, 2022 at 9:21 AM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Fri, Jun 24, 2022 at 05:36:50PM +0000, James Houghton wrote:
> > > The changes here are very similar to the changes made to
> > > hugetlb_no_page, where we do a high-granularity page table walk and
> > > do accounting slightly differently because we are mapping only a piece
> > > of a page.
> > >
> > > Signed-off-by: James Houghton <jthoughton@google.com>
> > > ---
> > >  fs/userfaultfd.c        |  3 +++
> > >  include/linux/hugetlb.h |  6 +++--
> > >  mm/hugetlb.c            | 54 +++++++++++++++++++++-----------------
> > >  mm/userfaultfd.c        | 57 +++++++++++++++++++++++++++++++----------
> > >  4 files changed, 82 insertions(+), 38 deletions(-)
> > >
> > > diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> > > index e943370107d0..77c1b8a7d0b9 100644
> > > --- a/fs/userfaultfd.c
> > > +++ b/fs/userfaultfd.c
> > > @@ -245,6 +245,9 @@ static inline bool userfaultfd_huge_must_wait(struct userfaultfd_ctx *ctx,
> > >       if (!ptep)
> > >               goto out;
> > >
> > > +     if (hugetlb_hgm_enabled(vma))
> > > +             goto out;
> > > +
> >
> > This is weird.  It means we'll never wait for sub-page mapping enabled
> > vmas.  Why?
> >
> 
> `ret` is true in this case, so we're actually *always* waiting.

Aha!  Then I think that's another problem, sorry. :) See Below.

> 
> > Not to mention hugetlb_hgm_enabled() currently is simply VM_SHARED, so it
> > means we'll stop waiting for all shared hugetlbfs uffd page faults..
> >
> > I'd expect in the in-house postcopy tests you should see vcpu threads
> > spinning on the page faults until it's serviced.
> >
> > IMO we still need to properly wait when the pgtable doesn't have the
> > faulted address covered.  For sub-page mapping it'll probably need to walk
> > into sub-page levels.
> 
> Ok, SGTM. I'll do that for the next version. I'm not sure of the
> consequences of returning `true` here when we should be returning
> `false`.

We've put ourselves onto the wait queue, if another concurrent
UFFDIO_CONTINUE happened and pte is already installed, I think this thread
could be waiting forever on the next schedule().

The solution should be the same - walking the sub-page pgtable would work,
afaict.

[...]

> > > @@ -6239,14 +6241,16 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
> > >        * registered, we firstly wr-protect a none pte which has no page cache
> > >        * page backing it, then access the page.
> > >        */
> > > -     if (!huge_pte_none_mostly(huge_ptep_get(dst_pte)))
> > > +     if (!hugetlb_pte_none_mostly(dst_hpte))
> > >               goto out_release_unlock;
> > >
> > > -     if (vm_shared) {
> > > -             page_dup_file_rmap(page, true);
> > > -     } else {
> > > -             ClearHPageRestoreReserve(page);
> > > -             hugepage_add_new_anon_rmap(page, dst_vma, dst_addr);
> > > +     if (new_mapping) {
> >
> > IIUC you wanted to avoid the mapcount accountings when it's the sub-page
> > that was going to be mapped.
> >
> > Is it a must we get this only from the caller?  Can we know we're doing
> > sub-page mapping already here and make a decision with e.g. dst_hpte?
> >
> > It looks weird to me to pass this explicitly from the caller, especially
> > that's when we don't really have the pgtable lock so I'm wondering about
> > possible race conditions too on having stale new_mapping values.
> 
> The only way to know what the correct value for `new_mapping` should
> be is to know if we had to change the hstate-level P*D to non-none to
> service this UFFDIO_CONTINUE request. I'll see if there is a nice way
> to do that check in `hugetlb_mcopy_atomic_pte`.
> Right now there is no

Would "new_mapping = dest_hpte->shift != huge_page_shift(hstate)" work (or
something alike)?

> race, because we synchronize on the per-hpage mutex.

Yeah not familiar with that mutex enough to tell, as long as that mutex
guarantees no pgtable update (hmm, then why we need the pgtable lock
here???) then it looks fine.

[...]

> > > @@ -335,12 +337,16 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
> > >       copied = 0;
> > >       page = NULL;
> > >       vma_hpagesize = vma_kernel_pagesize(dst_vma);
> > > +     if (use_hgm)
> > > +             vma_altpagesize = PAGE_SIZE;
> >
> > Do we need to check the "len" to know whether we should use sub-page
> > mapping or original hpage size?  E.g. any old UFFDIO_CONTINUE code will
> > still want the old behavior I think.
> 
> I think that's a fair point; however, if we enable HGM and the address
> and len happen to be hstate-aligned

The address can, but len (note! not "end" here) cannot?

> , we basically do the same thing as
> if HGM wasn't enabled. It could be a minor performance optimization to
> do `vma_altpagesize=vma_hpagesize` in that case, but in terms of how
> the page tables are set up, the end result would be the same.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH 19/26] hugetlb: add HGM support for copy_hugetlb_page_range
  2022-07-12 18:06       ` Mike Kravetz
@ 2022-07-15 21:39         ` Axel Rasmussen
  0 siblings, 0 replies; 123+ messages in thread
From: Axel Rasmussen @ 2022-07-15 21:39 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: James Houghton, Muchun Song, Peter Xu, David Hildenbrand,
	David Rientjes, Mina Almasry, Jue Wang, Manish Mishra,
	Dr . David Alan Gilbert, Linux MM, LKML

On Tue, Jul 12, 2022 at 11:07 AM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>
> On 07/12/22 10:19, James Houghton wrote:
> > On Mon, Jul 11, 2022 at 4:41 PM Mike Kravetz <mike.kravetz@oracle.com> wrote:
> > >
> > > On 06/24/22 17:36, James Houghton wrote:
> > > > This allows fork() to work with high-granularity mappings. The page
> > > > table structure is copied such that partially mapped regions will remain
> > > > partially mapped in the same way for the new process.
> > > >
> > > > Signed-off-by: James Houghton <jthoughton@google.com>
> > > > ---
> > > >  mm/hugetlb.c | 74 +++++++++++++++++++++++++++++++++++++++++-----------
> > > >  1 file changed, 59 insertions(+), 15 deletions(-)
> > >
> > > FYI -
> > > With https://lore.kernel.org/linux-mm/20220621235620.291305-5-mike.kravetz@oracle.com/
> > > copy_hugetlb_page_range() should never be called for shared mappings.
> > > Since HGM only works on shared mappings, code in this patch will never
> > > be executed.
> > >
> > > I have a TODO to remove shared mapping support from copy_hugetlb_page_range.
> >
> > Thanks Mike. If I understand things correctly, it seems like I don't
> > have to do anything to support fork() then; we just don't copy the
> > page table structure from the old VMA to the new one.
>
> Yes, for now.  We will not copy the page tables for shared mappings.
> When adding support for private mapping, we will need to handle the
> HGM case.
>
> >                                                       That is, as
> > opposed to having the same bits of the old VMA being mapped in the new
> > one, the new VMA will have an empty page table. This would slightly
> > change how userfaultfd's behavior on the new VMA, but that seems fine
> > to me.
>
> Right.  Since the 'mapping size information' is essentially carried in
> the page tables, it will be lost if page tables are not copied.
>
> Not sure if anyone would depend on that behavior.
>
> Axel, this may also impact minor fault processing.  Any concerns?
> Patch is sitting in Andrew's tree for next merge window.

Sorry for the slow response, just catching up a bit here. :)

If I understand correctly, let's say we have a process where some
hugetlb pages are fully mapped (pages are in page cache, page table
entries exist). Once we fork(), we in the future won't copy the page
table entries, but I assume we do setup the underlying pages for CoW
still. So I guess this means in the old process no fault would happen
if the memory was touched, but in the forked process it would generate
a minor fault?

To me that seems fine. When userspace gets a minor fault it's always
fine for it to just say "don't care, just UFFDIO_CONTINUE, no work
needed". For VM migration I don't think it's unreasonable to expect
userspace to remember whether or not the page is clean (it already
does this anyway) and whether or not a fork (without exec) had
happened. It seems to me it should work fine.

> --
> Mike Kravetz

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH 07/26] hugetlb: add hugetlb_pte to track HugeTLB page table entries
  2022-07-15 16:35       ` Peter Xu
@ 2022-07-15 21:52         ` Axel Rasmussen
  2022-07-15 23:03           ` Peter Xu
  0 siblings, 1 reply; 123+ messages in thread
From: Axel Rasmussen @ 2022-07-15 21:52 UTC (permalink / raw)
  To: Peter Xu
  Cc: Dr. David Alan Gilbert, Mike Kravetz, James Houghton,
	Muchun Song, David Hildenbrand, David Rientjes, Mina Almasry,
	Jue Wang, Manish Mishra, Linux MM, LKML

On Fri, Jul 15, 2022 at 9:35 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Tue, Jul 12, 2022 at 10:42:17AM +0100, Dr. David Alan Gilbert wrote:
> > * Mike Kravetz (mike.kravetz@oracle.com) wrote:
> > > On 06/24/22 17:36, James Houghton wrote:
> > > > After high-granularity mapping, page table entries for HugeTLB pages can
> > > > be of any size/type. (For example, we can have a 1G page mapped with a
> > > > mix of PMDs and PTEs.) This struct is to help keep track of a HugeTLB
> > > > PTE after we have done a page table walk.
> > >
> > > This has been rolling around in my head.
> > >
> > > Will this first use case (live migration) actually make use of this
> > > 'mixed mapping' model where hugetlb pages could be mapped at the PUD,
> > > PMD and PTE level all within the same vma?  I only understand the use
> > > case from a high level.  But, it seems that we would want to only want
> > > to migrate PTE (or PMD) sized pages and not necessarily a mix.
> >
> > I suspect we would pick one size and use that size for all transfers
> > when in postcopy; not sure if there are any side cases though.

Sorry for chiming in late. At least from my perspective being able to
do multiple sizes is a nice to have optmization.

As talked about above, imagine a guest VM backed by 1G hugetlb pages.
We're going along doing demand paging at 4K; because we want each
request to complete as quickly as possible, we want very small
granularity.

Guest access in terms of "physical" memory address is basically
random. So, actually filling in all 262k 4K PTEs making up a
contiguous 1G region might take quite some time. Once we've completed
any of the various 2M contiguous regions, it would be nice to go ahead
and collapse those right away. The benefit is, the guest will see some
performance benefit from the 2G page already, without having to wait
for the full 1G page to complete. Once we do complete a 1G page, it
would be nice to collapse that one level further. If we do this, the
whole guest memory will be a mix of 1G, 2M, and 4K.

>
> Yes, I'm also curious whether the series can be much simplified if we have
> a static way to do sub-page mappings, e.g., when sub-page mapping enabled
> we always map to PAGE_SIZE only; if not we keep the old hpage size mappings
> only.
>
> > > Looking to the future when supporting memory error handling/page poisoning
> > > it seems like we would certainly want multiple size mappings.
>
> If we treat page poisoning as very rare events anyway, IMHO it'll even be
> acceptable if we always split 1G pages into 4K ones but only rule out the
> real poisoned 4K phys page.  After all IIUC the major goal is for reducing
> poisoned memory footprint.
>
> It'll be definitely nicer if we can keep 511 2M pages and 511 4K pages in
> that case so the 511 2M pages performs slightly better, but it'll be
> something extra to me.  It can always be something worked upon a simpler
> version of sub-page mapping which is only PAGE_SIZE based.
>
> Thanks,
>
> --
> Peter Xu
>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH 07/26] hugetlb: add hugetlb_pte to track HugeTLB page table entries
  2022-07-15 21:52         ` Axel Rasmussen
@ 2022-07-15 23:03           ` Peter Xu
  0 siblings, 0 replies; 123+ messages in thread
From: Peter Xu @ 2022-07-15 23:03 UTC (permalink / raw)
  To: Axel Rasmussen
  Cc: Dr. David Alan Gilbert, Mike Kravetz, James Houghton,
	Muchun Song, David Hildenbrand, David Rientjes, Mina Almasry,
	Jue Wang, Manish Mishra, Linux MM, LKML

On Fri, Jul 15, 2022 at 02:52:27PM -0700, Axel Rasmussen wrote:
> Guest access in terms of "physical" memory address is basically
> random. So, actually filling in all 262k 4K PTEs making up a
> contiguous 1G region might take quite some time. Once we've completed
> any of the various 2M contiguous regions, it would be nice to go ahead
> and collapse those right away. The benefit is, the guest will see some
> performance benefit from the 2G page already, without having to wait
> for the full 1G page to complete. Once we do complete a 1G page, it
> would be nice to collapse that one level further. If we do this, the
> whole guest memory will be a mix of 1G, 2M, and 4K.

Just to mention that we've got quite some other things that drags perf down
much more than tlb hits on page sizes during any VM migration process.

For example, when we split & wr-protect pages during the starting phase of
migration on src host, it's not about 10% or 20% drop but much drastic.  In
the postcopy case it's for dest but still it's part of the whole migration
process and probably guest-aware too.  If the guest wants, it can simply
start writting some pages continuously and it'll see obvious drag downs any
time during migration I bet.

It'll always be nice to have multi-level sub-mappings and I fully agree.
IMHO it's a matter of whether keeping 4k-only would greatly simplify the
work, especially on the rework of hugetlb sub-mage aware pgtable ops.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH 15/26] hugetlb: make unmapping compatible with high-granularity mappings
  2022-06-24 17:36 ` [RFC PATCH 15/26] hugetlb: make unmapping compatible with high-granularity mappings James Houghton
@ 2022-07-19 10:19   ` manish.mishra
  2022-07-19 15:58     ` James Houghton
  0 siblings, 1 reply; 123+ messages in thread
From: manish.mishra @ 2022-07-19 10:19 UTC (permalink / raw)
  To: James Houghton, Mike Kravetz, Muchun Song, Peter Xu
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Jue Wang, Dr . David Alan Gilbert, linux-mm, linux-kernel


On 24/06/22 11:06 pm, James Houghton wrote:
> This enlightens __unmap_hugepage_range to deal with high-granularity
> mappings. This doesn't change its API; it still must be called with
> hugepage alignment, but it will correctly unmap hugepages that have been
> mapped at high granularity.
>
> Analogous to the mapcount rules introduced by hugetlb_no_page, we only
> drop mapcount in this case if we are unmapping an entire hugepage in one
> operation. This is the case when a VMA is destroyed.
>
> Eventually, functionality here can be expanded to allow users to call
> MADV_DONTNEED on PAGE_SIZE-aligned sections of a hugepage, but that is
> not done here.

Sorry i may have misunderstood something here, but allowing something like

MADV_DONTNEED on PAGE_SIZE in hugetlbfs can cause fragmentation

in hugetlbfs pool which kind of looks opposite of prupose of hugetlbfs?

>
> Signed-off-by: James Houghton <jthoughton@google.com>
> ---
>   include/asm-generic/tlb.h |  6 +--
>   mm/hugetlb.c              | 85 ++++++++++++++++++++++++++-------------
>   2 files changed, 59 insertions(+), 32 deletions(-)
>
> diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
> index ff3e82553a76..8daa3ae460d9 100644
> --- a/include/asm-generic/tlb.h
> +++ b/include/asm-generic/tlb.h
> @@ -562,9 +562,9 @@ static inline void tlb_flush_p4d_range(struct mmu_gather *tlb,
>   		__tlb_remove_tlb_entry(tlb, ptep, address);	\
>   	} while (0)
>   
> -#define tlb_remove_huge_tlb_entry(h, tlb, ptep, address)	\
> +#define tlb_remove_huge_tlb_entry(tlb, hpte, address)	\
>   	do {							\
> -		unsigned long _sz = huge_page_size(h);		\
> +		unsigned long _sz = hugetlb_pte_size(&hpte);	\
>   		if (_sz >= P4D_SIZE)				\
>   			tlb_flush_p4d_range(tlb, address, _sz);	\
>   		else if (_sz >= PUD_SIZE)			\
> @@ -573,7 +573,7 @@ static inline void tlb_flush_p4d_range(struct mmu_gather *tlb,
>   			tlb_flush_pmd_range(tlb, address, _sz);	\
>   		else						\
>   			tlb_flush_pte_range(tlb, address, _sz);	\
> -		__tlb_remove_tlb_entry(tlb, ptep, address);	\
> +		__tlb_remove_tlb_entry(tlb, hpte.ptep, address);\
>   	} while (0)
>   
>   /**
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index da30621656b8..51fc1d3f122f 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -5120,24 +5120,20 @@ static void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct
>   {
>   	struct mm_struct *mm = vma->vm_mm;
>   	unsigned long address;
> -	pte_t *ptep;
> +	struct hugetlb_pte hpte;
>   	pte_t pte;
>   	spinlock_t *ptl;
> -	struct page *page;
> +	struct page *hpage, *subpage;
>   	struct hstate *h = hstate_vma(vma);
>   	unsigned long sz = huge_page_size(h);
>   	struct mmu_notifier_range range;
>   	bool force_flush = false;
> +	bool hgm_enabled = hugetlb_hgm_enabled(vma);
>   
>   	WARN_ON(!is_vm_hugetlb_page(vma));
>   	BUG_ON(start & ~huge_page_mask(h));
>   	BUG_ON(end & ~huge_page_mask(h));
>   
> -	/*
> -	 * This is a hugetlb vma, all the pte entries should point
> -	 * to huge page.
> -	 */
> -	tlb_change_page_size(tlb, sz);
>   	tlb_start_vma(tlb, vma);
>   
>   	/*
> @@ -5148,25 +5144,43 @@ static void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct
>   	adjust_range_if_pmd_sharing_possible(vma, &range.start, &range.end);
>   	mmu_notifier_invalidate_range_start(&range);
>   	address = start;
> -	for (; address < end; address += sz) {
> -		ptep = huge_pte_offset(mm, address, sz);
> -		if (!ptep)
> +
> +	while (address < end) {
> +		pte_t *ptep = huge_pte_offset(mm, address, sz);
> +
> +		if (!ptep) {
> +			address += sz;
>   			continue;
> +		}
> +		hugetlb_pte_populate(&hpte, ptep, huge_page_shift(h));
> +		if (hgm_enabled) {
> +			int ret = huge_pte_alloc_high_granularity(
> +					&hpte, mm, vma, address, PAGE_SHIFT,
> +					HUGETLB_SPLIT_NEVER,
> +					/*write_locked=*/true);

I see huge_pte_alloc_high_granularity with HUGETLB_SPLIT_NEVER just

do huge_tlb_walk. So is HUGETLB_SPLIT_NEVER even required, i mean

for those cases you can directly do huge_tlb_walk? I mean name

huge_pte_alloc_high_granularity confuses for those cases.

> +			/*
> +			 * We will never split anything, so this should always
> +			 * succeed.
> +			 */
> +			BUG_ON(ret);
> +		}
>   
> -		ptl = huge_pte_lock(h, mm, ptep);
> -		if (huge_pmd_unshare(mm, vma, &address, ptep)) {
> +		ptl = hugetlb_pte_lock(mm, &hpte);
> +		if (!hgm_enabled && huge_pmd_unshare(
> +					mm, vma, &address, hpte.ptep)) {
>   			spin_unlock(ptl);
>   			tlb_flush_pmd_range(tlb, address & PUD_MASK, PUD_SIZE);
>   			force_flush = true;
> -			continue;
> +			goto next_hpte;
>   		}
>   
> -		pte = huge_ptep_get(ptep);
> -		if (huge_pte_none(pte)) {
> +		if (hugetlb_pte_none(&hpte)) {
>   			spin_unlock(ptl);
> -			continue;
> +			goto next_hpte;
>   		}
>   
> +		pte = hugetlb_ptep_get(&hpte);
> +
>   		/*
>   		 * Migrating hugepage or HWPoisoned hugepage is already
>   		 * unmapped and its refcount is dropped, so just clear pte here.
> @@ -5180,24 +5194,27 @@ static void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct
>   			 */
>   			if (pte_swp_uffd_wp_any(pte) &&
>   			    !(zap_flags & ZAP_FLAG_DROP_MARKER))
> -				set_huge_pte_at(mm, address, ptep,
> +				set_huge_pte_at(mm, address, hpte.ptep,
>   						make_pte_marker(PTE_MARKER_UFFD_WP));
>   			else
> -				huge_pte_clear(mm, address, ptep, sz);
> +				huge_pte_clear(mm, address, hpte.ptep,
> +						hugetlb_pte_size(&hpte));
>   			spin_unlock(ptl);
> -			continue;
> +			goto next_hpte;
>   		}
>   
> -		page = pte_page(pte);
> +		subpage = pte_page(pte);
> +		BUG_ON(!subpage);
> +		hpage = compound_head(subpage);
>   		/*
>   		 * If a reference page is supplied, it is because a specific
>   		 * page is being unmapped, not a range. Ensure the page we
>   		 * are about to unmap is the actual page of interest.
>   		 */
>   		if (ref_page) {
> -			if (page != ref_page) {
> +			if (hpage != ref_page) {
>   				spin_unlock(ptl);
> -				continue;
> +				goto next_hpte;
>   			}
>   			/*
>   			 * Mark the VMA as having unmapped its page so that
> @@ -5207,25 +5224,35 @@ static void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct
>   			set_vma_resv_flags(vma, HPAGE_RESV_UNMAPPED);
>   		}
>   
> -		pte = huge_ptep_get_and_clear(mm, address, ptep);
> -		tlb_remove_huge_tlb_entry(h, tlb, ptep, address);
> +		pte = huge_ptep_get_and_clear(mm, address, hpte.ptep);
> +		tlb_change_page_size(tlb, hugetlb_pte_size(&hpte));
> +		tlb_remove_huge_tlb_entry(tlb, hpte, address);
>   		if (huge_pte_dirty(pte))
> -			set_page_dirty(page);
> +			set_page_dirty(hpage);
>   		/* Leave a uffd-wp pte marker if needed */
>   		if (huge_pte_uffd_wp(pte) &&
>   		    !(zap_flags & ZAP_FLAG_DROP_MARKER))
> -			set_huge_pte_at(mm, address, ptep,
> +			set_huge_pte_at(mm, address, hpte.ptep,
>   					make_pte_marker(PTE_MARKER_UFFD_WP));
> -		hugetlb_count_sub(pages_per_huge_page(h), mm);
> -		page_remove_rmap(page, vma, true);
> +
> +		hugetlb_count_sub(hugetlb_pte_size(&hpte)/PAGE_SIZE, mm);
> +
> +		/*
> +		 * If we are unmapping the entire page, remove it from the
> +		 * rmap.
> +		 */
> +		if (IS_ALIGNED(address, sz) && address + sz <= end)
> +			page_remove_rmap(hpage, vma, true);
>   
>   		spin_unlock(ptl);
> -		tlb_remove_page_size(tlb, page, huge_page_size(h));
> +		tlb_remove_page_size(tlb, subpage, hugetlb_pte_size(&hpte));
>   		/*
>   		 * Bail out after unmapping reference page if supplied
>   		 */
>   		if (ref_page)
>   			break;
> +next_hpte:
> +		address += hugetlb_pte_size(&hpte);
>   	}
>   	mmu_notifier_invalidate_range_end(&range);
>   	tlb_end_vma(tlb, vma);

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH 17/26] hugetlb: update follow_hugetlb_page to support HGM
  2022-06-24 17:36 ` [RFC PATCH 17/26] hugetlb: update follow_hugetlb_page to support HGM James Houghton
@ 2022-07-19 10:48   ` manish.mishra
  2022-07-19 16:19     ` James Houghton
  0 siblings, 1 reply; 123+ messages in thread
From: manish.mishra @ 2022-07-19 10:48 UTC (permalink / raw)
  To: James Houghton, Mike Kravetz, Muchun Song, Peter Xu
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Jue Wang, Dr . David Alan Gilbert, linux-mm, linux-kernel


On 24/06/22 11:06 pm, James Houghton wrote:
> This enables support for GUP, and it is needed for the KVM demand paging
> self-test to work.
>
> One important change here is that, before, we never needed to grab the
> i_mmap_sem, but now, to prevent someone from collapsing the page tables
> out from under us, we grab it for reading when doing high-granularity PT
> walks.
>
> Signed-off-by: James Houghton <jthoughton@google.com>
> ---
>   mm/hugetlb.c | 70 ++++++++++++++++++++++++++++++++++++++++++----------
>   1 file changed, 57 insertions(+), 13 deletions(-)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index f9c7daa6c090..aadfcee947cf 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -6298,14 +6298,18 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
>   	unsigned long vaddr = *position;
>   	unsigned long remainder = *nr_pages;
>   	struct hstate *h = hstate_vma(vma);
> +	struct address_space *mapping = vma->vm_file->f_mapping;
>   	int err = -EFAULT, refs;
> +	bool has_i_mmap_sem = false;
>   
>   	while (vaddr < vma->vm_end && remainder) {
>   		pte_t *pte;
>   		spinlock_t *ptl = NULL;
>   		bool unshare = false;
>   		int absent;
> +		unsigned long pages_per_hpte;
>   		struct page *page;
> +		struct hugetlb_pte hpte;
>   
>   		/*
>   		 * If we have a pending SIGKILL, don't keep faulting pages and
> @@ -6325,9 +6329,23 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
>   		 */
>   		pte = huge_pte_offset(mm, vaddr & huge_page_mask(h),
>   				      huge_page_size(h));
> -		if (pte)
> -			ptl = huge_pte_lock(h, mm, pte);
> -		absent = !pte || huge_pte_none(huge_ptep_get(pte));
> +		if (pte) {
> +			hugetlb_pte_populate(&hpte, pte, huge_page_shift(h));
> +			if (hugetlb_hgm_enabled(vma)) {
> +				BUG_ON(has_i_mmap_sem);

Just thinking can we do without i_mmap_lock_read in most cases. Like earlier

this function was good without i_mmap_lock_read doing almost everything

which is happening now?

> +				i_mmap_lock_read(mapping);
> +				/*
> +				 * Need to hold the mapping semaphore for
> +				 * reading to do a HGM walk.
> +				 */
> +				has_i_mmap_sem = true;
> +				hugetlb_walk_to(mm, &hpte, vaddr, PAGE_SIZE,
> +						/*stop_at_none=*/true);
> +			}
> +			ptl = hugetlb_pte_lock(mm, &hpte);
> +		}
> +
> +		absent = !pte || hugetlb_pte_none(&hpte);
>   
>   		/*
>   		 * When coredumping, it suits get_dump_page if we just return
> @@ -6338,8 +6356,13 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
>   		 */
>   		if (absent && (flags & FOLL_DUMP) &&
>   		    !hugetlbfs_pagecache_present(h, vma, vaddr)) {
> -			if (pte)
> +			if (pte) {
> +				if (has_i_mmap_sem) {
> +					i_mmap_unlock_read(mapping);
> +					has_i_mmap_sem = false;
> +				}
>   				spin_unlock(ptl);
> +			}
>   			remainder = 0;
>   			break;
>   		}
> @@ -6359,8 +6382,13 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
>   			vm_fault_t ret;
>   			unsigned int fault_flags = 0;
>   
> -			if (pte)
> +			if (pte) {
> +				if (has_i_mmap_sem) {
> +					i_mmap_unlock_read(mapping);
> +					has_i_mmap_sem = false;
> +				}
>   				spin_unlock(ptl);
> +			}
>   			if (flags & FOLL_WRITE)
>   				fault_flags |= FAULT_FLAG_WRITE;
>   			else if (unshare)
> @@ -6403,8 +6431,11 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
>   			continue;
>   		}
>   
> -		pfn_offset = (vaddr & ~huge_page_mask(h)) >> PAGE_SHIFT;
> -		page = pte_page(huge_ptep_get(pte));
> +		pfn_offset = (vaddr & ~hugetlb_pte_mask(&hpte)) >> PAGE_SHIFT;
> +		page = pte_page(hugetlb_ptep_get(&hpte));
> +		pages_per_hpte = hugetlb_pte_size(&hpte) / PAGE_SIZE;
> +		if (hugetlb_hgm_enabled(vma))
> +			page = compound_head(page);
>   
>   		VM_BUG_ON_PAGE((flags & FOLL_PIN) && PageAnon(page) &&
>   			       !PageAnonExclusive(page), page);
> @@ -6414,17 +6445,21 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
>   		 * and skip the same_page loop below.
>   		 */
>   		if (!pages && !vmas && !pfn_offset &&
> -		    (vaddr + huge_page_size(h) < vma->vm_end) &&
> -		    (remainder >= pages_per_huge_page(h))) {
> -			vaddr += huge_page_size(h);
> -			remainder -= pages_per_huge_page(h);
> -			i += pages_per_huge_page(h);
> +		    (vaddr + pages_per_hpte < vma->vm_end) &&
> +		    (remainder >= pages_per_hpte)) {
> +			vaddr += pages_per_hpte;
> +			remainder -= pages_per_hpte;
> +			i += pages_per_hpte;
>   			spin_unlock(ptl);
> +			if (has_i_mmap_sem) {
> +				has_i_mmap_sem = false;
> +				i_mmap_unlock_read(mapping);
> +			}
>   			continue;
>   		}
>   
>   		/* vaddr may not be aligned to PAGE_SIZE */
> -		refs = min3(pages_per_huge_page(h) - pfn_offset, remainder,
> +		refs = min3(pages_per_hpte - pfn_offset, remainder,
>   		    (vma->vm_end - ALIGN_DOWN(vaddr, PAGE_SIZE)) >> PAGE_SHIFT);
>   
>   		if (pages || vmas)
> @@ -6447,6 +6482,10 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
>   			if (WARN_ON_ONCE(!try_grab_folio(pages[i], refs,
>   							 flags))) {
>   				spin_unlock(ptl);
> +				if (has_i_mmap_sem) {
> +					has_i_mmap_sem = false;
> +					i_mmap_unlock_read(mapping);
> +				}
>   				remainder = 0;
>   				err = -ENOMEM;
>   				break;
> @@ -6458,8 +6497,13 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
>   		i += refs;
>   
>   		spin_unlock(ptl);
> +		if (has_i_mmap_sem) {
> +			has_i_mmap_sem = false;
> +			i_mmap_unlock_read(mapping);
> +		}
>   	}
>   	*nr_pages = remainder;
> +	BUG_ON(has_i_mmap_sem);
>   	/*
>   	 * setting position is actually required only if remainder is
>   	 * not zero but it's faster not to add a "if (remainder)"

Thanks

Manish Mishra


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH 15/26] hugetlb: make unmapping compatible with high-granularity mappings
  2022-07-19 10:19   ` manish.mishra
@ 2022-07-19 15:58     ` James Houghton
  0 siblings, 0 replies; 123+ messages in thread
From: James Houghton @ 2022-07-19 15:58 UTC (permalink / raw)
  To: manish.mishra
  Cc: Mike Kravetz, Muchun Song, Peter Xu, David Hildenbrand,
	David Rientjes, Axel Rasmussen, Mina Almasry, Jue Wang,
	Dr . David Alan Gilbert, linux-mm, linux-kernel

On Tue, Jul 19, 2022 at 3:20 AM manish.mishra <manish.mishra@nutanix.com> wrote:
>
>
> On 24/06/22 11:06 pm, James Houghton wrote:
> > This enlightens __unmap_hugepage_range to deal with high-granularity
> > mappings. This doesn't change its API; it still must be called with
> > hugepage alignment, but it will correctly unmap hugepages that have been
> > mapped at high granularity.
> >
> > Analogous to the mapcount rules introduced by hugetlb_no_page, we only
> > drop mapcount in this case if we are unmapping an entire hugepage in one
> > operation. This is the case when a VMA is destroyed.
> >
> > Eventually, functionality here can be expanded to allow users to call
> > MADV_DONTNEED on PAGE_SIZE-aligned sections of a hugepage, but that is
> > not done here.
>
> Sorry i may have misunderstood something here, but allowing something like
>
> MADV_DONTNEED on PAGE_SIZE in hugetlbfs can cause fragmentation
>
> in hugetlbfs pool which kind of looks opposite of prupose of hugetlbfs?

It can be helpful for some applications, like if we want to get page
fault notifications through userfaultfd on a 4K piece of a hugepage.
It kind of goes against the purpose of HugeTLB, but we sort of get
this functionality automatically with this patch.

>
> >
> > Signed-off-by: James Houghton <jthoughton@google.com>
> > ---
> >   include/asm-generic/tlb.h |  6 +--
> >   mm/hugetlb.c              | 85 ++++++++++++++++++++++++++-------------
> >   2 files changed, 59 insertions(+), 32 deletions(-)
> >
> > diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
> > index ff3e82553a76..8daa3ae460d9 100644
> > --- a/include/asm-generic/tlb.h
> > +++ b/include/asm-generic/tlb.h
> > @@ -562,9 +562,9 @@ static inline void tlb_flush_p4d_range(struct mmu_gather *tlb,
> >               __tlb_remove_tlb_entry(tlb, ptep, address);     \
> >       } while (0)
> >
> > -#define tlb_remove_huge_tlb_entry(h, tlb, ptep, address)     \
> > +#define tlb_remove_huge_tlb_entry(tlb, hpte, address)        \
> >       do {                                                    \
> > -             unsigned long _sz = huge_page_size(h);          \
> > +             unsigned long _sz = hugetlb_pte_size(&hpte);    \
> >               if (_sz >= P4D_SIZE)                            \
> >                       tlb_flush_p4d_range(tlb, address, _sz); \
> >               else if (_sz >= PUD_SIZE)                       \
> > @@ -573,7 +573,7 @@ static inline void tlb_flush_p4d_range(struct mmu_gather *tlb,
> >                       tlb_flush_pmd_range(tlb, address, _sz); \
> >               else                                            \
> >                       tlb_flush_pte_range(tlb, address, _sz); \
> > -             __tlb_remove_tlb_entry(tlb, ptep, address);     \
> > +             __tlb_remove_tlb_entry(tlb, hpte.ptep, address);\
> >       } while (0)
> >
> >   /**
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index da30621656b8..51fc1d3f122f 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -5120,24 +5120,20 @@ static void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct
> >   {
> >       struct mm_struct *mm = vma->vm_mm;
> >       unsigned long address;
> > -     pte_t *ptep;
> > +     struct hugetlb_pte hpte;
> >       pte_t pte;
> >       spinlock_t *ptl;
> > -     struct page *page;
> > +     struct page *hpage, *subpage;
> >       struct hstate *h = hstate_vma(vma);
> >       unsigned long sz = huge_page_size(h);
> >       struct mmu_notifier_range range;
> >       bool force_flush = false;
> > +     bool hgm_enabled = hugetlb_hgm_enabled(vma);
> >
> >       WARN_ON(!is_vm_hugetlb_page(vma));
> >       BUG_ON(start & ~huge_page_mask(h));
> >       BUG_ON(end & ~huge_page_mask(h));
> >
> > -     /*
> > -      * This is a hugetlb vma, all the pte entries should point
> > -      * to huge page.
> > -      */
> > -     tlb_change_page_size(tlb, sz);
> >       tlb_start_vma(tlb, vma);
> >
> >       /*
> > @@ -5148,25 +5144,43 @@ static void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct
> >       adjust_range_if_pmd_sharing_possible(vma, &range.start, &range.end);
> >       mmu_notifier_invalidate_range_start(&range);
> >       address = start;
> > -     for (; address < end; address += sz) {
> > -             ptep = huge_pte_offset(mm, address, sz);
> > -             if (!ptep)
> > +
> > +     while (address < end) {
> > +             pte_t *ptep = huge_pte_offset(mm, address, sz);
> > +
> > +             if (!ptep) {
> > +                     address += sz;
> >                       continue;
> > +             }
> > +             hugetlb_pte_populate(&hpte, ptep, huge_page_shift(h));
> > +             if (hgm_enabled) {
> > +                     int ret = huge_pte_alloc_high_granularity(
> > +                                     &hpte, mm, vma, address, PAGE_SHIFT,
> > +                                     HUGETLB_SPLIT_NEVER,
> > +                                     /*write_locked=*/true);
>
> I see huge_pte_alloc_high_granularity with HUGETLB_SPLIT_NEVER just
>
> do huge_tlb_walk. So is HUGETLB_SPLIT_NEVER even required, i mean
>
> for those cases you can directly do huge_tlb_walk? I mean name
>
> huge_pte_alloc_high_granularity confuses for those cases.

Agreed. huge_pte_alloc_high_granularity with HUGETLB_SPLIT_NEVER is
pretty much the same as hugetlb_walk_to (+hugetlb_pte_init). It is
confusing to have two ways of doing the exact same thing, so I'll get
rid of HUGETLB_SPLIT_NEVER (and the "alloc" name is confusing in this
case too, yeah).

>
> > +                     /*
> > +                      * We will never split anything, so this should always
> > +                      * succeed.
> > +                      */
> > +                     BUG_ON(ret);
> > +             }
> >
> > -             ptl = huge_pte_lock(h, mm, ptep);
> > -             if (huge_pmd_unshare(mm, vma, &address, ptep)) {
> > +             ptl = hugetlb_pte_lock(mm, &hpte);
> > +             if (!hgm_enabled && huge_pmd_unshare(
> > +                                     mm, vma, &address, hpte.ptep)) {
> >                       spin_unlock(ptl);
> >                       tlb_flush_pmd_range(tlb, address & PUD_MASK, PUD_SIZE);
> >                       force_flush = true;
> > -                     continue;
> > +                     goto next_hpte;
> >               }
> >
> > -             pte = huge_ptep_get(ptep);
> > -             if (huge_pte_none(pte)) {
> > +             if (hugetlb_pte_none(&hpte)) {
> >                       spin_unlock(ptl);
> > -                     continue;
> > +                     goto next_hpte;
> >               }
> >
> > +             pte = hugetlb_ptep_get(&hpte);
> > +
> >               /*
> >                * Migrating hugepage or HWPoisoned hugepage is already
> >                * unmapped and its refcount is dropped, so just clear pte here.
> > @@ -5180,24 +5194,27 @@ static void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct
> >                        */
> >                       if (pte_swp_uffd_wp_any(pte) &&
> >                           !(zap_flags & ZAP_FLAG_DROP_MARKER))
> > -                             set_huge_pte_at(mm, address, ptep,
> > +                             set_huge_pte_at(mm, address, hpte.ptep,
> >                                               make_pte_marker(PTE_MARKER_UFFD_WP));
> >                       else
> > -                             huge_pte_clear(mm, address, ptep, sz);
> > +                             huge_pte_clear(mm, address, hpte.ptep,
> > +                                             hugetlb_pte_size(&hpte));
> >                       spin_unlock(ptl);
> > -                     continue;
> > +                     goto next_hpte;
> >               }
> >
> > -             page = pte_page(pte);
> > +             subpage = pte_page(pte);
> > +             BUG_ON(!subpage);
> > +             hpage = compound_head(subpage);
> >               /*
> >                * If a reference page is supplied, it is because a specific
> >                * page is being unmapped, not a range. Ensure the page we
> >                * are about to unmap is the actual page of interest.
> >                */
> >               if (ref_page) {
> > -                     if (page != ref_page) {
> > +                     if (hpage != ref_page) {
> >                               spin_unlock(ptl);
> > -                             continue;
> > +                             goto next_hpte;
> >                       }
> >                       /*
> >                        * Mark the VMA as having unmapped its page so that
> > @@ -5207,25 +5224,35 @@ static void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct
> >                       set_vma_resv_flags(vma, HPAGE_RESV_UNMAPPED);
> >               }
> >
> > -             pte = huge_ptep_get_and_clear(mm, address, ptep);
> > -             tlb_remove_huge_tlb_entry(h, tlb, ptep, address);
> > +             pte = huge_ptep_get_and_clear(mm, address, hpte.ptep);
> > +             tlb_change_page_size(tlb, hugetlb_pte_size(&hpte));
> > +             tlb_remove_huge_tlb_entry(tlb, hpte, address);
> >               if (huge_pte_dirty(pte))
> > -                     set_page_dirty(page);
> > +                     set_page_dirty(hpage);
> >               /* Leave a uffd-wp pte marker if needed */
> >               if (huge_pte_uffd_wp(pte) &&
> >                   !(zap_flags & ZAP_FLAG_DROP_MARKER))
> > -                     set_huge_pte_at(mm, address, ptep,
> > +                     set_huge_pte_at(mm, address, hpte.ptep,
> >                                       make_pte_marker(PTE_MARKER_UFFD_WP));
> > -             hugetlb_count_sub(pages_per_huge_page(h), mm);
> > -             page_remove_rmap(page, vma, true);
> > +
> > +             hugetlb_count_sub(hugetlb_pte_size(&hpte)/PAGE_SIZE, mm);
> > +
> > +             /*
> > +              * If we are unmapping the entire page, remove it from the
> > +              * rmap.
> > +              */
> > +             if (IS_ALIGNED(address, sz) && address + sz <= end)
> > +                     page_remove_rmap(hpage, vma, true);
> >
> >               spin_unlock(ptl);
> > -             tlb_remove_page_size(tlb, page, huge_page_size(h));
> > +             tlb_remove_page_size(tlb, subpage, hugetlb_pte_size(&hpte));
> >               /*
> >                * Bail out after unmapping reference page if supplied
> >                */
> >               if (ref_page)
> >                       break;
> > +next_hpte:
> > +             address += hugetlb_pte_size(&hpte);
> >       }
> >       mmu_notifier_invalidate_range_end(&range);
> >       tlb_end_vma(tlb, vma);

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH 17/26] hugetlb: update follow_hugetlb_page to support HGM
  2022-07-19 10:48   ` manish.mishra
@ 2022-07-19 16:19     ` James Houghton
  0 siblings, 0 replies; 123+ messages in thread
From: James Houghton @ 2022-07-19 16:19 UTC (permalink / raw)
  To: manish.mishra
  Cc: Mike Kravetz, Muchun Song, Peter Xu, David Hildenbrand,
	David Rientjes, Axel Rasmussen, Mina Almasry, Jue Wang,
	Dr . David Alan Gilbert, linux-mm, linux-kernel

On Tue, Jul 19, 2022 at 3:48 AM manish.mishra <manish.mishra@nutanix.com> wrote:
>
>
> On 24/06/22 11:06 pm, James Houghton wrote:
> > This enables support for GUP, and it is needed for the KVM demand paging
> > self-test to work.
> >
> > One important change here is that, before, we never needed to grab the
> > i_mmap_sem, but now, to prevent someone from collapsing the page tables
> > out from under us, we grab it for reading when doing high-granularity PT
> > walks.
> >
> > Signed-off-by: James Houghton <jthoughton@google.com>
> > ---
> >   mm/hugetlb.c | 70 ++++++++++++++++++++++++++++++++++++++++++----------
> >   1 file changed, 57 insertions(+), 13 deletions(-)
> >
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index f9c7daa6c090..aadfcee947cf 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -6298,14 +6298,18 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
> >       unsigned long vaddr = *position;
> >       unsigned long remainder = *nr_pages;
> >       struct hstate *h = hstate_vma(vma);
> > +     struct address_space *mapping = vma->vm_file->f_mapping;
> >       int err = -EFAULT, refs;
> > +     bool has_i_mmap_sem = false;
> >
> >       while (vaddr < vma->vm_end && remainder) {
> >               pte_t *pte;
> >               spinlock_t *ptl = NULL;
> >               bool unshare = false;
> >               int absent;
> > +             unsigned long pages_per_hpte;
> >               struct page *page;
> > +             struct hugetlb_pte hpte;
> >
> >               /*
> >                * If we have a pending SIGKILL, don't keep faulting pages and
> > @@ -6325,9 +6329,23 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
> >                */
> >               pte = huge_pte_offset(mm, vaddr & huge_page_mask(h),
> >                                     huge_page_size(h));
> > -             if (pte)
> > -                     ptl = huge_pte_lock(h, mm, pte);
> > -             absent = !pte || huge_pte_none(huge_ptep_get(pte));
> > +             if (pte) {
> > +                     hugetlb_pte_populate(&hpte, pte, huge_page_shift(h));
> > +                     if (hugetlb_hgm_enabled(vma)) {
> > +                             BUG_ON(has_i_mmap_sem);
>
> Just thinking can we do without i_mmap_lock_read in most cases. Like earlier
>
> this function was good without i_mmap_lock_read doing almost everything
>
> which is happening now?

We need something to prevent the page tables from being rearranged
while we're walking them. In this RFC, I used the i_mmap_lock. I'm
going to change it, probably to a per-VMA lock (or maybe a per-hpage
lock. I'm trying to figure out if a system with PTLs/hugetlb_pte_lock
could work too :)).

>
> > +                             i_mmap_lock_read(mapping);
> > +                             /*
> > +                              * Need to hold the mapping semaphore for
> > +                              * reading to do a HGM walk.
> > +                              */
> > +                             has_i_mmap_sem = true;
> > +                             hugetlb_walk_to(mm, &hpte, vaddr, PAGE_SIZE,
> > +                                             /*stop_at_none=*/true);
> > +                     }
> > +                     ptl = hugetlb_pte_lock(mm, &hpte);
> > +             }
> > +
> > +             absent = !pte || hugetlb_pte_none(&hpte);
> >
> >               /*
> >                * When coredumping, it suits get_dump_page if we just return
> > @@ -6338,8 +6356,13 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
> >                */
> >               if (absent && (flags & FOLL_DUMP) &&
> >                   !hugetlbfs_pagecache_present(h, vma, vaddr)) {
> > -                     if (pte)
> > +                     if (pte) {
> > +                             if (has_i_mmap_sem) {
> > +                                     i_mmap_unlock_read(mapping);
> > +                                     has_i_mmap_sem = false;
> > +                             }
> >                               spin_unlock(ptl);
> > +                     }
> >                       remainder = 0;
> >                       break;
> >               }
> > @@ -6359,8 +6382,13 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
> >                       vm_fault_t ret;
> >                       unsigned int fault_flags = 0;
> >
> > -                     if (pte)
> > +                     if (pte) {
> > +                             if (has_i_mmap_sem) {
> > +                                     i_mmap_unlock_read(mapping);
> > +                                     has_i_mmap_sem = false;
> > +                             }
> >                               spin_unlock(ptl);
> > +                     }
> >                       if (flags & FOLL_WRITE)
> >                               fault_flags |= FAULT_FLAG_WRITE;
> >                       else if (unshare)
> > @@ -6403,8 +6431,11 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
> >                       continue;
> >               }
> >
> > -             pfn_offset = (vaddr & ~huge_page_mask(h)) >> PAGE_SHIFT;
> > -             page = pte_page(huge_ptep_get(pte));
> > +             pfn_offset = (vaddr & ~hugetlb_pte_mask(&hpte)) >> PAGE_SHIFT;
> > +             page = pte_page(hugetlb_ptep_get(&hpte));
> > +             pages_per_hpte = hugetlb_pte_size(&hpte) / PAGE_SIZE;
> > +             if (hugetlb_hgm_enabled(vma))
> > +                     page = compound_head(page);
> >
> >               VM_BUG_ON_PAGE((flags & FOLL_PIN) && PageAnon(page) &&
> >                              !PageAnonExclusive(page), page);
> > @@ -6414,17 +6445,21 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
> >                * and skip the same_page loop below.
> >                */
> >               if (!pages && !vmas && !pfn_offset &&
> > -                 (vaddr + huge_page_size(h) < vma->vm_end) &&
> > -                 (remainder >= pages_per_huge_page(h))) {
> > -                     vaddr += huge_page_size(h);
> > -                     remainder -= pages_per_huge_page(h);
> > -                     i += pages_per_huge_page(h);
> > +                 (vaddr + pages_per_hpte < vma->vm_end) &&
> > +                 (remainder >= pages_per_hpte)) {
> > +                     vaddr += pages_per_hpte;
> > +                     remainder -= pages_per_hpte;
> > +                     i += pages_per_hpte;
> >                       spin_unlock(ptl);
> > +                     if (has_i_mmap_sem) {
> > +                             has_i_mmap_sem = false;
> > +                             i_mmap_unlock_read(mapping);
> > +                     }
> >                       continue;
> >               }
> >
> >               /* vaddr may not be aligned to PAGE_SIZE */
> > -             refs = min3(pages_per_huge_page(h) - pfn_offset, remainder,
> > +             refs = min3(pages_per_hpte - pfn_offset, remainder,
> >                   (vma->vm_end - ALIGN_DOWN(vaddr, PAGE_SIZE)) >> PAGE_SHIFT);
> >
> >               if (pages || vmas)
> > @@ -6447,6 +6482,10 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
> >                       if (WARN_ON_ONCE(!try_grab_folio(pages[i], refs,
> >                                                        flags))) {
> >                               spin_unlock(ptl);
> > +                             if (has_i_mmap_sem) {
> > +                                     has_i_mmap_sem = false;
> > +                                     i_mmap_unlock_read(mapping);
> > +                             }
> >                               remainder = 0;
> >                               err = -ENOMEM;
> >                               break;
> > @@ -6458,8 +6497,13 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
> >               i += refs;
> >
> >               spin_unlock(ptl);
> > +             if (has_i_mmap_sem) {
> > +                     has_i_mmap_sem = false;
> > +                     i_mmap_unlock_read(mapping);
> > +             }
> >       }
> >       *nr_pages = remainder;
> > +     BUG_ON(has_i_mmap_sem);
> >       /*
> >        * setting position is actually required only if remainder is
> >        * not zero but it's faster not to add a "if (remainder)"
>
> Thanks
>
> Manish Mishra
>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH 20/26] hugetlb: add support for high-granularity UFFDIO_CONTINUE
  2022-07-15 17:20       ` Peter Xu
@ 2022-07-20 20:58         ` James Houghton
  2022-07-21 19:09           ` Peter Xu
  0 siblings, 1 reply; 123+ messages in thread
From: James Houghton @ 2022-07-20 20:58 UTC (permalink / raw)
  To: Peter Xu
  Cc: Mike Kravetz, Muchun Song, David Hildenbrand, David Rientjes,
	Axel Rasmussen, Mina Almasry, Jue Wang, Manish Mishra,
	Dr . David Alan Gilbert, linux-mm, linux-kernel

On Fri, Jul 15, 2022 at 10:20 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Fri, Jul 15, 2022 at 09:58:10AM -0700, James Houghton wrote:
> > On Fri, Jul 15, 2022 at 9:21 AM Peter Xu <peterx@redhat.com> wrote:
> > >
> > > On Fri, Jun 24, 2022 at 05:36:50PM +0000, James Houghton wrote:
> > > > The changes here are very similar to the changes made to
> > > > hugetlb_no_page, where we do a high-granularity page table walk and
> > > > do accounting slightly differently because we are mapping only a piece
> > > > of a page.
> > > >
> > > > Signed-off-by: James Houghton <jthoughton@google.com>
> > > > ---
> > > >  fs/userfaultfd.c        |  3 +++
> > > >  include/linux/hugetlb.h |  6 +++--
> > > >  mm/hugetlb.c            | 54 +++++++++++++++++++++-----------------
> > > >  mm/userfaultfd.c        | 57 +++++++++++++++++++++++++++++++----------
> > > >  4 files changed, 82 insertions(+), 38 deletions(-)
> > > >
> > > > diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> > > > index e943370107d0..77c1b8a7d0b9 100644
> > > > --- a/fs/userfaultfd.c
> > > > +++ b/fs/userfaultfd.c
> > > > @@ -245,6 +245,9 @@ static inline bool userfaultfd_huge_must_wait(struct userfaultfd_ctx *ctx,
> > > >       if (!ptep)
> > > >               goto out;
> > > >
> > > > +     if (hugetlb_hgm_enabled(vma))
> > > > +             goto out;
> > > > +
> > >
> > > This is weird.  It means we'll never wait for sub-page mapping enabled
> > > vmas.  Why?
> > >
> >
> > `ret` is true in this case, so we're actually *always* waiting.
>
> Aha!  Then I think that's another problem, sorry. :) See Below.
>
> >
> > > Not to mention hugetlb_hgm_enabled() currently is simply VM_SHARED, so it
> > > means we'll stop waiting for all shared hugetlbfs uffd page faults..
> > >
> > > I'd expect in the in-house postcopy tests you should see vcpu threads
> > > spinning on the page faults until it's serviced.
> > >
> > > IMO we still need to properly wait when the pgtable doesn't have the
> > > faulted address covered.  For sub-page mapping it'll probably need to walk
> > > into sub-page levels.
> >
> > Ok, SGTM. I'll do that for the next version. I'm not sure of the
> > consequences of returning `true` here when we should be returning
> > `false`.
>
> We've put ourselves onto the wait queue, if another concurrent
> UFFDIO_CONTINUE happened and pte is already installed, I think this thread
> could be waiting forever on the next schedule().
>
> The solution should be the same - walking the sub-page pgtable would work,
> afaict.
>
> [...]
>
> > > > @@ -6239,14 +6241,16 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
> > > >        * registered, we firstly wr-protect a none pte which has no page cache
> > > >        * page backing it, then access the page.
> > > >        */
> > > > -     if (!huge_pte_none_mostly(huge_ptep_get(dst_pte)))
> > > > +     if (!hugetlb_pte_none_mostly(dst_hpte))
> > > >               goto out_release_unlock;
> > > >
> > > > -     if (vm_shared) {
> > > > -             page_dup_file_rmap(page, true);
> > > > -     } else {
> > > > -             ClearHPageRestoreReserve(page);
> > > > -             hugepage_add_new_anon_rmap(page, dst_vma, dst_addr);
> > > > +     if (new_mapping) {
> > >
> > > IIUC you wanted to avoid the mapcount accountings when it's the sub-page
> > > that was going to be mapped.
> > >
> > > Is it a must we get this only from the caller?  Can we know we're doing
> > > sub-page mapping already here and make a decision with e.g. dst_hpte?
> > >
> > > It looks weird to me to pass this explicitly from the caller, especially
> > > that's when we don't really have the pgtable lock so I'm wondering about
> > > possible race conditions too on having stale new_mapping values.
> >
> > The only way to know what the correct value for `new_mapping` should
> > be is to know if we had to change the hstate-level P*D to non-none to
> > service this UFFDIO_CONTINUE request. I'll see if there is a nice way
> > to do that check in `hugetlb_mcopy_atomic_pte`.
> > Right now there is no
>
> Would "new_mapping = dest_hpte->shift != huge_page_shift(hstate)" work (or
> something alike)?

This works in the hugetlb_fault case, because in the hugetlb_fault
case, we install the largest PTE possible. If we are mapping a page
for the first time, we will use an hstate-sized PTE. But for
UFFDIO_CONTINUE, we may be installing a 4K PTE as the first PTE for
the whole hpage.

>
> > race, because we synchronize on the per-hpage mutex.
>
> Yeah not familiar with that mutex enough to tell, as long as that mutex
> guarantees no pgtable update (hmm, then why we need the pgtable lock
> here???) then it looks fine.

Let me take a closer look at this. I'll have a more detailed
explanation for the next version of the RFC.

>
> [...]
>
> > > > @@ -335,12 +337,16 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
> > > >       copied = 0;
> > > >       page = NULL;
> > > >       vma_hpagesize = vma_kernel_pagesize(dst_vma);
> > > > +     if (use_hgm)
> > > > +             vma_altpagesize = PAGE_SIZE;
> > >
> > > Do we need to check the "len" to know whether we should use sub-page
> > > mapping or original hpage size?  E.g. any old UFFDIO_CONTINUE code will
> > > still want the old behavior I think.
> >
> > I think that's a fair point; however, if we enable HGM and the address
> > and len happen to be hstate-aligned
>
> The address can, but len (note! not "end" here) cannot?

They both (dst_start and len) need to be hpage-aligned, otherwise we
won't be able to install hstate-sized PTEs. Like if we're installing
4K at the beginning of a 1G hpage, we can't install a PUD, because we
only want to install that 4K.

>
> > , we basically do the same thing as
> > if HGM wasn't enabled. It could be a minor performance optimization to
> > do `vma_altpagesize=vma_hpagesize` in that case, but in terms of how
> > the page tables are set up, the end result would be the same.
>
> Thanks,

Thanks!

>
> --
> Peter Xu
>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH 20/26] hugetlb: add support for high-granularity UFFDIO_CONTINUE
  2022-07-20 20:58         ` James Houghton
@ 2022-07-21 19:09           ` Peter Xu
  2022-07-21 19:44             ` James Houghton
  0 siblings, 1 reply; 123+ messages in thread
From: Peter Xu @ 2022-07-21 19:09 UTC (permalink / raw)
  To: James Houghton
  Cc: Mike Kravetz, Muchun Song, David Hildenbrand, David Rientjes,
	Axel Rasmussen, Mina Almasry, Jue Wang, Manish Mishra,
	Dr . David Alan Gilbert, linux-mm, linux-kernel

On Wed, Jul 20, 2022 at 01:58:06PM -0700, James Houghton wrote:
> > > > > @@ -335,12 +337,16 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
> > > > >       copied = 0;
> > > > >       page = NULL;
> > > > >       vma_hpagesize = vma_kernel_pagesize(dst_vma);
> > > > > +     if (use_hgm)
> > > > > +             vma_altpagesize = PAGE_SIZE;
> > > >
> > > > Do we need to check the "len" to know whether we should use sub-page
> > > > mapping or original hpage size?  E.g. any old UFFDIO_CONTINUE code will
> > > > still want the old behavior I think.
> > >
> > > I think that's a fair point; however, if we enable HGM and the address
> > > and len happen to be hstate-aligned
> >
> > The address can, but len (note! not "end" here) cannot?
> 
> They both (dst_start and len) need to be hpage-aligned, otherwise we
> won't be able to install hstate-sized PTEs. Like if we're installing
> 4K at the beginning of a 1G hpage, we can't install a PUD, because we
> only want to install that 4K.

I'm still confused...

Shouldn't one of the major goals of sub-page mapping is to grant user the
capability to do UFFDIO_CONTINUE with len<hpagesize (so we install pages in
sub-page level)?  If so, why len needs to be always hpagesize aligned?

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH 20/26] hugetlb: add support for high-granularity UFFDIO_CONTINUE
  2022-07-21 19:09           ` Peter Xu
@ 2022-07-21 19:44             ` James Houghton
  2022-07-21 19:53               ` Peter Xu
  0 siblings, 1 reply; 123+ messages in thread
From: James Houghton @ 2022-07-21 19:44 UTC (permalink / raw)
  To: Peter Xu
  Cc: Mike Kravetz, Muchun Song, David Hildenbrand, David Rientjes,
	Axel Rasmussen, Mina Almasry, Jue Wang, Manish Mishra,
	Dr . David Alan Gilbert, linux-mm, linux-kernel

On Thu, Jul 21, 2022 at 12:09 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Wed, Jul 20, 2022 at 01:58:06PM -0700, James Houghton wrote:
> > > > > > @@ -335,12 +337,16 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
> > > > > >       copied = 0;
> > > > > >       page = NULL;
> > > > > >       vma_hpagesize = vma_kernel_pagesize(dst_vma);
> > > > > > +     if (use_hgm)
> > > > > > +             vma_altpagesize = PAGE_SIZE;
> > > > >
> > > > > Do we need to check the "len" to know whether we should use sub-page
> > > > > mapping or original hpage size?  E.g. any old UFFDIO_CONTINUE code will
> > > > > still want the old behavior I think.
> > > >
> > > > I think that's a fair point; however, if we enable HGM and the address
> > > > and len happen to be hstate-aligned
> > >
> > > The address can, but len (note! not "end" here) cannot?
> >
> > They both (dst_start and len) need to be hpage-aligned, otherwise we
> > won't be able to install hstate-sized PTEs. Like if we're installing
> > 4K at the beginning of a 1G hpage, we can't install a PUD, because we
> > only want to install that 4K.
>
> I'm still confused...
>
> Shouldn't one of the major goals of sub-page mapping is to grant user the
> capability to do UFFDIO_CONTINUE with len<hpagesize (so we install pages in
> sub-page level)?  If so, why len needs to be always hpagesize aligned?

Sorry I misunderstood what you were asking. We allow both to be
PAGE_SIZE-aligned. :) That is indeed the goal of HGM.

If dst_start and len were both hpage-aligned, then we *could* set
`use_hgm = false`, and everything would still work. That's what I
thought you were asking about. I don't see any reason to do this
though, as `use_hgm = true` will only grant additional functionality,
and `use_hgm = false` would only -- at best -- be a minor performance
optimization in this case.

- James

>
> --
> Peter Xu
>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH 20/26] hugetlb: add support for high-granularity UFFDIO_CONTINUE
  2022-07-21 19:44             ` James Houghton
@ 2022-07-21 19:53               ` Peter Xu
  0 siblings, 0 replies; 123+ messages in thread
From: Peter Xu @ 2022-07-21 19:53 UTC (permalink / raw)
  To: James Houghton
  Cc: Mike Kravetz, Muchun Song, David Hildenbrand, David Rientjes,
	Axel Rasmussen, Mina Almasry, Jue Wang, Manish Mishra,
	Dr . David Alan Gilbert, linux-mm, linux-kernel

On Thu, Jul 21, 2022 at 12:44:58PM -0700, James Houghton wrote:
> On Thu, Jul 21, 2022 at 12:09 PM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Wed, Jul 20, 2022 at 01:58:06PM -0700, James Houghton wrote:
> > > > > > > @@ -335,12 +337,16 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
> > > > > > >       copied = 0;
> > > > > > >       page = NULL;
> > > > > > >       vma_hpagesize = vma_kernel_pagesize(dst_vma);
> > > > > > > +     if (use_hgm)
> > > > > > > +             vma_altpagesize = PAGE_SIZE;
> > > > > >
> > > > > > Do we need to check the "len" to know whether we should use sub-page
> > > > > > mapping or original hpage size?  E.g. any old UFFDIO_CONTINUE code will
> > > > > > still want the old behavior I think.
> > > > >
> > > > > I think that's a fair point; however, if we enable HGM and the address
> > > > > and len happen to be hstate-aligned
> > > >
> > > > The address can, but len (note! not "end" here) cannot?
> > >
> > > They both (dst_start and len) need to be hpage-aligned, otherwise we
> > > won't be able to install hstate-sized PTEs. Like if we're installing
> > > 4K at the beginning of a 1G hpage, we can't install a PUD, because we
> > > only want to install that 4K.
> >
> > I'm still confused...
> >
> > Shouldn't one of the major goals of sub-page mapping is to grant user the
> > capability to do UFFDIO_CONTINUE with len<hpagesize (so we install pages in
> > sub-page level)?  If so, why len needs to be always hpagesize aligned?
> 
> Sorry I misunderstood what you were asking. We allow both to be
> PAGE_SIZE-aligned. :) That is indeed the goal of HGM.

Ah OK. :)

> 
> If dst_start and len were both hpage-aligned, then we *could* set
> `use_hgm = false`, and everything would still work. That's what I
> thought you were asking about. I don't see any reason to do this
> though, as `use_hgm = true` will only grant additional functionality,
> and `use_hgm = false` would only -- at best -- be a minor performance
> optimization in this case.

I just want to make sure this patch won't break existing uffd-minor users,
or it'll be an kernel abi breakage.

We'd still want to have e.g. existing compiled apps run like before, which
iiuc means we should only use sub-page mapping when len!=hpagesize here.

I'm not sure it's only about perf - the app may not even be prepared to
receive yet another page faults within the same huge page range.

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH 07/26] hugetlb: add hugetlb_pte to track HugeTLB page table entries
  2022-06-24 17:36 ` [RFC PATCH 07/26] hugetlb: add hugetlb_pte to track HugeTLB page table entries James Houghton
                     ` (3 preceding siblings ...)
  2022-07-11 23:32   ` Mike Kravetz
@ 2022-09-08 17:38   ` Peter Xu
  2022-09-08 17:54     ` James Houghton
  4 siblings, 1 reply; 123+ messages in thread
From: Peter Xu @ 2022-09-08 17:38 UTC (permalink / raw)
  To: James Houghton
  Cc: Mike Kravetz, Muchun Song, David Hildenbrand, David Rientjes,
	Axel Rasmussen, Mina Almasry, Jue Wang, Manish Mishra,
	Dr . David Alan Gilbert, linux-mm, linux-kernel

James,

On Fri, Jun 24, 2022 at 05:36:37PM +0000, James Houghton wrote:
> +static inline
> +spinlock_t *hugetlb_pte_lockptr(struct mm_struct *mm, struct hugetlb_pte *hpte)
> +{
> +
> +	BUG_ON(!hpte->ptep);
> +	// Only use huge_pte_lockptr if we are at leaf-level. Otherwise use
> +	// the regular page table lock.
> +	if (hugetlb_pte_none(hpte) || hugetlb_pte_present_leaf(hpte))
> +		return huge_pte_lockptr(hugetlb_pte_shift(hpte),
> +				mm, hpte->ptep);
> +	return &mm->page_table_lock;
> +}

Today when I re-read part of this thread, I found that I'm not sure whether
this is safe.  IIUC taking different locks depending on the state of pte
may lead to issues.

For example, could below race happen where two threads can be taking
different locks even if stumbled over the same pmd entry?

         thread 1                          thread 2
         --------                          --------

    hugetlb_pte_lockptr (for pmd level)
      pte_none()==true,
        take pmd lock
    pmd_alloc()
                                hugetlb_pte_lockptr (for pmd level)
                                  pte is pgtable entry (so !none, !present_leaf)
                                    take page_table_lock
                                (can run concurrently with thread 1...)
    pte_alloc()
    ...

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH 07/26] hugetlb: add hugetlb_pte to track HugeTLB page table entries
  2022-09-08 17:38   ` Peter Xu
@ 2022-09-08 17:54     ` James Houghton
  0 siblings, 0 replies; 123+ messages in thread
From: James Houghton @ 2022-09-08 17:54 UTC (permalink / raw)
  To: Peter Xu
  Cc: Mike Kravetz, Muchun Song, David Hildenbrand, David Rientjes,
	Axel Rasmussen, Mina Almasry, Jue Wang, Manish Mishra,
	Dr . David Alan Gilbert, linux-mm, linux-kernel

On Thu, Sep 8, 2022 at 10:38 AM Peter Xu <peterx@redhat.com> wrote:
>
> James,
>
> On Fri, Jun 24, 2022 at 05:36:37PM +0000, James Houghton wrote:
> > +static inline
> > +spinlock_t *hugetlb_pte_lockptr(struct mm_struct *mm, struct hugetlb_pte *hpte)
> > +{
> > +
> > +     BUG_ON(!hpte->ptep);
> > +     // Only use huge_pte_lockptr if we are at leaf-level. Otherwise use
> > +     // the regular page table lock.
> > +     if (hugetlb_pte_none(hpte) || hugetlb_pte_present_leaf(hpte))
> > +             return huge_pte_lockptr(hugetlb_pte_shift(hpte),
> > +                             mm, hpte->ptep);
> > +     return &mm->page_table_lock;
> > +}
>
> Today when I re-read part of this thread, I found that I'm not sure whether
> this is safe.  IIUC taking different locks depending on the state of pte
> may lead to issues.
>
> For example, could below race happen where two threads can be taking
> different locks even if stumbled over the same pmd entry?
>
>          thread 1                          thread 2
>          --------                          --------
>
>     hugetlb_pte_lockptr (for pmd level)
>       pte_none()==true,
>         take pmd lock
>     pmd_alloc()
>                                 hugetlb_pte_lockptr (for pmd level)
>                                   pte is pgtable entry (so !none, !present_leaf)
>                                     take page_table_lock
>                                 (can run concurrently with thread 1...)
>     pte_alloc()
>     ...

Thanks for pointing out this race. Yes, it is wrong to change which
lock we take depending on the value of the PTE, as we would need to
lock the PTE first to correctly make the decision. This has already
been fixed in the next version of this series :). That is, we choose
which lock to grab based on the PTE's page table level.

- James

>
> --
> Peter Xu
>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH 09/26] hugetlb: add hugetlb_hgm_enabled
  2022-06-24 17:36 ` [RFC PATCH 09/26] hugetlb: add hugetlb_hgm_enabled James Houghton
  2022-06-28 20:33   ` Mina Almasry
@ 2022-09-08 18:07   ` Peter Xu
  2022-09-08 18:13     ` James Houghton
  1 sibling, 1 reply; 123+ messages in thread
From: Peter Xu @ 2022-09-08 18:07 UTC (permalink / raw)
  To: James Houghton
  Cc: Mike Kravetz, Muchun Song, David Hildenbrand, David Rientjes,
	Axel Rasmussen, Mina Almasry, Jue Wang, Manish Mishra,
	Dr . David Alan Gilbert, linux-mm, linux-kernel

On Fri, Jun 24, 2022 at 05:36:39PM +0000, James Houghton wrote:
> +#ifdef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING
> +bool hugetlb_hgm_enabled(struct vm_area_struct *vma)
> +{
> +	/* All shared VMAs have HGM enabled. */
> +	return vma->vm_flags & VM_SHARED;
> +}
> +#endif /* CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING */

Another nitpick: suggest to rename this to "hugetlb_***_supported()" (with
whatever the new name could be..), as long as it cannot be "disabled". :)

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH 09/26] hugetlb: add hugetlb_hgm_enabled
  2022-09-08 18:07   ` Peter Xu
@ 2022-09-08 18:13     ` James Houghton
  0 siblings, 0 replies; 123+ messages in thread
From: James Houghton @ 2022-09-08 18:13 UTC (permalink / raw)
  To: Peter Xu
  Cc: Mike Kravetz, Muchun Song, David Hildenbrand, David Rientjes,
	Axel Rasmussen, Mina Almasry, Jue Wang, Manish Mishra,
	Dr . David Alan Gilbert, linux-mm, linux-kernel

On Thu, Sep 8, 2022 at 11:07 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Fri, Jun 24, 2022 at 05:36:39PM +0000, James Houghton wrote:
> > +#ifdef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING
> > +bool hugetlb_hgm_enabled(struct vm_area_struct *vma)
> > +{
> > +     /* All shared VMAs have HGM enabled. */
> > +     return vma->vm_flags & VM_SHARED;
> > +}
> > +#endif /* CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING */
>
> Another nitpick: suggest to rename this to "hugetlb_***_supported()" (with
> whatever the new name could be..), as long as it cannot be "disabled". :)
>

Will do. :)

- James

> --
> Peter Xu
>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH 11/26] hugetlb: add hugetlb_walk_to to do PT walks
  2022-06-24 17:36 ` [RFC PATCH 11/26] hugetlb: add hugetlb_walk_to to do PT walks James Houghton
  2022-06-27 13:07   ` manish.mishra
@ 2022-09-08 18:20   ` Peter Xu
  1 sibling, 0 replies; 123+ messages in thread
From: Peter Xu @ 2022-09-08 18:20 UTC (permalink / raw)
  To: James Houghton
  Cc: Mike Kravetz, Muchun Song, David Hildenbrand, David Rientjes,
	Axel Rasmussen, Mina Almasry, Jue Wang, Manish Mishra,
	Dr . David Alan Gilbert, linux-mm, linux-kernel

On Fri, Jun 24, 2022 at 05:36:41PM +0000, James Houghton wrote:
> This adds it for architectures that use GENERAL_HUGETLB, including x86.
> 
> Signed-off-by: James Houghton <jthoughton@google.com>
> ---
>  include/linux/hugetlb.h |  2 ++
>  mm/hugetlb.c            | 45 +++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 47 insertions(+)
> 
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index e7a6b944d0cc..605aa19d8572 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -258,6 +258,8 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
>  			unsigned long addr, unsigned long sz);
>  pte_t *huge_pte_offset(struct mm_struct *mm,
>  		       unsigned long addr, unsigned long sz);
> +int hugetlb_walk_to(struct mm_struct *mm, struct hugetlb_pte *hpte,
> +		    unsigned long addr, unsigned long sz, bool stop_at_none);
>  int huge_pmd_unshare(struct mm_struct *mm, struct vm_area_struct *vma,
>  				unsigned long *addr, pte_t *ptep);
>  void adjust_range_if_pmd_sharing_possible(struct vm_area_struct *vma,
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 557b0afdb503..3ec2a921ee6f 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -6981,6 +6981,51 @@ pte_t *huge_pte_offset(struct mm_struct *mm,
>  	return (pte_t *)pmd;
>  }
>  
> +int hugetlb_walk_to(struct mm_struct *mm, struct hugetlb_pte *hpte,
> +		    unsigned long addr, unsigned long sz, bool stop_at_none)
> +{
> +	pte_t *ptep;
> +
> +	if (!hpte->ptep) {
> +		pgd_t *pgd = pgd_offset(mm, addr);
> +
> +		if (!pgd)
> +			return -ENOMEM;
> +		ptep = (pte_t *)p4d_alloc(mm, pgd, addr);
> +		if (!ptep)
> +			return -ENOMEM;
> +		hugetlb_pte_populate(hpte, ptep, P4D_SHIFT);
> +	}
> +
> +	while (hugetlb_pte_size(hpte) > sz &&
> +			!hugetlb_pte_present_leaf(hpte) &&
> +			!(stop_at_none && hugetlb_pte_none(hpte))) {
> +		if (hpte->shift == PMD_SHIFT) {
> +			ptep = pte_alloc_map(mm, (pmd_t *)hpte->ptep, addr);

I had a feeling that the pairing pte_unmap() was lost.

I think most distros are not with CONFIG_HIGHPTE at all, but still..

> +			if (!ptep)
> +				return -ENOMEM;
> +			hpte->shift = PAGE_SHIFT;
> +			hpte->ptep = ptep;
> +		} else if (hpte->shift == PUD_SHIFT) {
> +			ptep = (pte_t *)pmd_alloc(mm, (pud_t *)hpte->ptep,
> +						  addr);
> +			if (!ptep)
> +				return -ENOMEM;
> +			hpte->shift = PMD_SHIFT;
> +			hpte->ptep = ptep;
> +		} else if (hpte->shift == P4D_SHIFT) {
> +			ptep = (pte_t *)pud_alloc(mm, (p4d_t *)hpte->ptep,
> +						  addr);
> +			if (!ptep)
> +				return -ENOMEM;
> +			hpte->shift = PUD_SHIFT;
> +			hpte->ptep = ptep;
> +		} else
> +			BUG();
> +	}
> +	return 0;
> +}
> +
>  #endif /* CONFIG_ARCH_WANT_GENERAL_HUGETLB */
>  
>  #ifdef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING
> -- 
> 2.37.0.rc0.161.g10f37bed90-goog
> 

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 123+ messages in thread

end of thread, other threads:[~2022-09-08 18:21 UTC | newest]

Thread overview: 123+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-06-24 17:36 [RFC PATCH 00/26] hugetlb: Introduce HugeTLB high-granularity mapping James Houghton
2022-06-24 17:36 ` [RFC PATCH 01/26] hugetlb: make hstate accessor functions const James Houghton
2022-06-24 18:43   ` Mina Almasry
     [not found]   ` <e55f90f5-ba14-5d6e-8f8f-abf731b9095e@nutanix.com>
     [not found]     ` <bb903be9-546d-04a7-e9e4-f5ba313319de@nutanix.com>
2022-06-28 17:08       ` James Houghton
2022-06-29  6:18   ` Muchun Song
2022-06-24 17:36 ` [RFC PATCH 02/26] hugetlb: sort hstates in hugetlb_init_hstates James Houghton
2022-06-24 18:51   ` Mina Almasry
2022-06-27 12:08   ` manish.mishra
2022-06-28 15:35     ` James Houghton
2022-06-27 18:42   ` Mike Kravetz
2022-06-28 15:40     ` James Houghton
2022-06-29  6:39       ` Muchun Song
2022-06-29 21:06         ` Mike Kravetz
2022-06-29 21:13           ` James Houghton
2022-06-24 17:36 ` [RFC PATCH 03/26] hugetlb: add make_huge_pte_with_shift James Houghton
2022-06-24 19:01   ` Mina Almasry
2022-06-27 12:13   ` manish.mishra
2022-06-24 17:36 ` [RFC PATCH 04/26] hugetlb: make huge_pte_lockptr take an explicit shift argument James Houghton
2022-06-27 12:26   ` manish.mishra
2022-06-27 20:51   ` Mike Kravetz
2022-06-28 15:29     ` James Houghton
2022-06-29  6:09     ` Muchun Song
2022-06-29 21:03       ` Mike Kravetz
2022-06-29 21:39         ` James Houghton
2022-06-29 22:24           ` Mike Kravetz
2022-06-30  9:35             ` Muchun Song
2022-06-30 16:23               ` James Houghton
2022-06-30 17:40                 ` Mike Kravetz
2022-07-01  3:32                 ` Muchun Song
2022-06-24 17:36 ` [RFC PATCH 05/26] hugetlb: add CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING James Houghton
2022-06-27 12:28   ` manish.mishra
2022-06-28 20:03     ` Mina Almasry
2022-06-24 17:36 ` [RFC PATCH 06/26] mm: make free_p?d_range functions public James Houghton
2022-06-27 12:31   ` manish.mishra
2022-06-28 20:35   ` Mike Kravetz
2022-07-12 20:52     ` James Houghton
2022-06-24 17:36 ` [RFC PATCH 07/26] hugetlb: add hugetlb_pte to track HugeTLB page table entries James Houghton
2022-06-27 12:47   ` manish.mishra
2022-06-29 16:28     ` James Houghton
2022-06-28 20:25   ` Mina Almasry
2022-06-29 16:42     ` James Houghton
2022-06-28 20:44   ` Mike Kravetz
2022-06-29 16:24     ` James Houghton
2022-07-11 23:32   ` Mike Kravetz
2022-07-12  9:42     ` Dr. David Alan Gilbert
2022-07-12 17:51       ` Mike Kravetz
2022-07-15 16:35       ` Peter Xu
2022-07-15 21:52         ` Axel Rasmussen
2022-07-15 23:03           ` Peter Xu
2022-09-08 17:38   ` Peter Xu
2022-09-08 17:54     ` James Houghton
2022-06-24 17:36 ` [RFC PATCH 08/26] hugetlb: add hugetlb_free_range to free PT structures James Houghton
2022-06-27 12:52   ` manish.mishra
2022-06-28 20:27   ` Mina Almasry
2022-06-24 17:36 ` [RFC PATCH 09/26] hugetlb: add hugetlb_hgm_enabled James Houghton
2022-06-28 20:33   ` Mina Almasry
2022-09-08 18:07   ` Peter Xu
2022-09-08 18:13     ` James Houghton
2022-06-24 17:36 ` [RFC PATCH 10/26] hugetlb: add for_each_hgm_shift James Houghton
2022-06-27 13:01   ` manish.mishra
2022-06-28 21:58   ` Mina Almasry
2022-07-07 21:39     ` Mike Kravetz
2022-07-08 15:52     ` James Houghton
2022-07-09 21:55       ` Mina Almasry
2022-06-24 17:36 ` [RFC PATCH 11/26] hugetlb: add hugetlb_walk_to to do PT walks James Houghton
2022-06-27 13:07   ` manish.mishra
2022-07-07 23:03     ` Mike Kravetz
2022-09-08 18:20   ` Peter Xu
2022-06-24 17:36 ` [RFC PATCH 12/26] hugetlb: add HugeTLB splitting functionality James Houghton
2022-06-27 13:50   ` manish.mishra
2022-06-29 16:10     ` James Houghton
2022-06-29 14:33   ` manish.mishra
2022-06-29 16:20     ` James Houghton
2022-06-24 17:36 ` [RFC PATCH 13/26] hugetlb: add huge_pte_alloc_high_granularity James Houghton
2022-06-29 14:11   ` manish.mishra
2022-06-24 17:36 ` [RFC PATCH 14/26] hugetlb: add HGM support for hugetlb_fault and hugetlb_no_page James Houghton
2022-06-29 14:40   ` manish.mishra
2022-06-29 15:56     ` James Houghton
2022-06-24 17:36 ` [RFC PATCH 15/26] hugetlb: make unmapping compatible with high-granularity mappings James Houghton
2022-07-19 10:19   ` manish.mishra
2022-07-19 15:58     ` James Houghton
2022-06-24 17:36 ` [RFC PATCH 16/26] hugetlb: make hugetlb_change_protection compatible with HGM James Houghton
2022-06-24 17:36 ` [RFC PATCH 17/26] hugetlb: update follow_hugetlb_page to support HGM James Houghton
2022-07-19 10:48   ` manish.mishra
2022-07-19 16:19     ` James Houghton
2022-06-24 17:36 ` [RFC PATCH 18/26] hugetlb: use struct hugetlb_pte for walk_hugetlb_range James Houghton
2022-06-24 17:36 ` [RFC PATCH 19/26] hugetlb: add HGM support for copy_hugetlb_page_range James Houghton
2022-07-11 23:41   ` Mike Kravetz
2022-07-12 17:19     ` James Houghton
2022-07-12 18:06       ` Mike Kravetz
2022-07-15 21:39         ` Axel Rasmussen
2022-06-24 17:36 ` [RFC PATCH 20/26] hugetlb: add support for high-granularity UFFDIO_CONTINUE James Houghton
2022-07-15 16:21   ` Peter Xu
2022-07-15 16:58     ` James Houghton
2022-07-15 17:20       ` Peter Xu
2022-07-20 20:58         ` James Houghton
2022-07-21 19:09           ` Peter Xu
2022-07-21 19:44             ` James Houghton
2022-07-21 19:53               ` Peter Xu
2022-06-24 17:36 ` [RFC PATCH 21/26] hugetlb: add hugetlb_collapse James Houghton
2022-06-24 17:36 ` [RFC PATCH 22/26] madvise: add uapi for HugeTLB HGM collapse: MADV_COLLAPSE James Houghton
2022-06-24 17:36 ` [RFC PATCH 23/26] userfaultfd: add UFFD_FEATURE_MINOR_HUGETLBFS_HGM James Houghton
2022-06-24 17:36 ` [RFC PATCH 24/26] arm64/hugetlb: add support for high-granularity mappings James Houghton
2022-06-24 17:36 ` [RFC PATCH 25/26] selftests: add HugeTLB HGM to userfaultfd selftest James Houghton
2022-06-24 17:36 ` [RFC PATCH 26/26] selftests: add HugeTLB HGM to KVM demand paging selftest James Houghton
2022-06-24 18:29 ` [RFC PATCH 00/26] hugetlb: Introduce HugeTLB high-granularity mapping Matthew Wilcox
2022-06-27 16:36   ` James Houghton
2022-06-27 17:56     ` Dr. David Alan Gilbert
2022-06-27 20:31       ` James Houghton
2022-06-28  0:04         ` Nadav Amit
2022-06-30 19:21           ` Peter Xu
2022-07-01  5:54             ` Nadav Amit
2022-06-28  8:20         ` Dr. David Alan Gilbert
2022-06-30 16:09           ` Peter Xu
2022-06-24 18:41 ` Mina Almasry
2022-06-27 16:27   ` James Houghton
2022-06-28 14:17     ` Muchun Song
2022-06-28 17:26     ` Mina Almasry
2022-06-28 17:56       ` Dr. David Alan Gilbert
2022-06-29 18:31         ` James Houghton
2022-06-29 20:39       ` Axel Rasmussen
2022-06-24 18:47 ` Matthew Wilcox
2022-06-27 16:48   ` James Houghton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).