All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH v2 00/47] hugetlb: introduce HugeTLB high-granularity mapping
@ 2022-10-21 16:36 James Houghton
  2022-10-21 16:36 ` [RFC PATCH v2 01/47] hugetlb: don't set PageUptodate for UFFDIO_CONTINUE James Houghton
                   ` (46 more replies)
  0 siblings, 47 replies; 122+ messages in thread
From: James Houghton @ 2022-10-21 16:36 UTC (permalink / raw)
  To: Mike Kravetz, Muchun Song, Peter Xu
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Zach O'Keefe, Manish Mishra, Naoya Horiguchi,
	Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Andrew Morton, linux-mm, linux-kernel, James Houghton

This RFC v2 is a more complete and correct implementation of
the original high-granularity mapping RFC[1]. For HGM background and
motivation, please see the original RFC.

This series has changed quite significantly since its first version, so
I've dropped all the Reviewed-bys that it picked up, and I am not
including a full changelog here. Some notable changes:
  1. mapcount rules have been simplified (now: the number of times a
     hugepage is referenced in page tables, still tracked on the head
     page).
  2. Synchronizing page table collapsing is now done using the VMA lock
     that Mike introduced recently.
  3. PTE splitting is only supported for blank PTEs, and it is done
     without needing to hold the VMA lock for writing. In many places,
     we explicitly check if a PTE has been split from under us.
  4. The userspace API has changed slightly.

This series implements high-granularity mapping basics, enough to
support PAGE_SIZE-aligned UFFDIO_CONTINUE operations and MADV_COLLAPSE
for shared HugeTLB VMAs for x86. The main use case for this is post-copy
for virtual machines, one of the important HGM use cases described in
[1]. MADV_COLLAPSE was originally introduced for THPs[2], but it is
now meaningful for HGM, and so I am co-opting the same API.

- Userspace API

There are two main ways userspace interacts with high-granularity
mappings:
  1. Create them with UFFDIO_CONTINUE in an apporiately configured
     userfaultfd VMA.
  2. Collapse high-granularity mappings with MADV_COLLAPSE.

The userfaultfd bits of the userspace API have changed slightly since
RFC v1. To configure a userfaultfd VMA to enable HGM, userspace must
provide UFFD_FEATURE_MINOR_HUGETLBFS_HGM and UFFD_FEATURE_EXACT_ADDRESS
in its call to UFFDIO_API.

- A Note About KVM

Normally KVM (as well as any other non-HugeTLB code that assumes that
HugeTLB pages will always be mapped with huge PTEs) would need to be
enlightened to do the correct thing with high-granularity-mapped HugeTLB
pages. It turns out that the x86 TDP MMU already handles HGM mappings
correctly, but other architectures' KVM MMUs, like arm64's, will need to
be updated before HGM can be enabled for those architectures.

- How complete is this series?

I have tested this series with the self-tests that I have modified and
added, and I have run real, large end-to-end migration tests. This
series should be mostly stable, though I haven't tested DAMON and other
pieces that were slightly changed by this series.

There is a bug in the current x86 TDP MMU that prevents MADV_COLLAPSE
from having an effect. That is, the second-stage mappings will remain
small. This will be fixed with [3], so unless you have [3] merged in
your tree, you will see that MADV_COLLAPSE does not impact on virtual
machine performance.

- Future Work

The main areas of future work are:
  1) Support more architectures (arm64 support is mostly complete, but
     supporting it is not trivial, and to keep this RFC as short as
     possible, I will send the arm64 support series separately).
  2) Improve performance. Right now we take two per-hpage locks in the
     hotpath for userfaultfd-based post-copy live migration, the page
     lock and the fault mutex. To improve post-copy performance as much
     as possible, we likely need to improve this locking strategy.
  3) Support PAGE_SIZE poisoning of HugeTLB pages. To provide userspace
     with consistent poison behavior whether using MAP_PRIVATE or
     MAP_SHARED, more work is needed to implement basic HGM support for
     MAP_PRIVATE mappings.

- Patches

Patches 1-4:	Cleanup.
Patches 5-6:	Extend the HugeTLB shared VMA lock struct.
Patches 7-14:	Create hugetlb_pte and implement HGM basics (PT walking,
		enabling HGM).
Patches 15-30:	Make existing routines compatible with HGM.
Patches 31-35:	Extend userfaultfd to support high-granularity CONTINUEs.
Patch   36:	Add HugeTLB HGM support to MADV_COLLAPSE.
Patches 37-40:	Cleanup, add HGM stats, and enable HGM for x86.
Patches 41-47:	Documentation and selftests.

This series is based on mm-everything-2022-10-20-00-43.

Finally, I will be on vacation next week (until Nov 2, unfortunate
timing). I will try to respond before Nov 2; I wanted to get this series
up ASAP.

[1] https://lore.kernel.org/linux-mm/20220624173656.2033256-1-jthoughton@google.com/
[2] commit 7d8faaf155454 ("mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse")
[3] https://lore.kernel.org/kvm/20220830235537.4004585-1-seanjc@google.com/

James Houghton (47):
  hugetlb: don't set PageUptodate for UFFDIO_CONTINUE
  hugetlb: remove mk_huge_pte; it is unused
  hugetlb: remove redundant pte_mkhuge in migration path
  hugetlb: only adjust address ranges when VMAs want PMD sharing
  hugetlb: make hugetlb_vma_lock_alloc return its failure reason
  hugetlb: extend vma lock for shared vmas
  hugetlb: add CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING
  hugetlb: add HGM enablement functions
  hugetlb: make huge_pte_lockptr take an explicit shift argument.
  hugetlb: add hugetlb_pte to track HugeTLB page table entries
  hugetlb: add hugetlb_pmd_alloc and hugetlb_pte_alloc
  hugetlb: add hugetlb_hgm_walk and hugetlb_walk_step
  hugetlb: add make_huge_pte_with_shift
  hugetlb: make default arch_make_huge_pte understand small mappings
  hugetlbfs: for unmapping, treat HGM-mapped pages as potentially mapped
  hugetlb: make unmapping compatible with high-granularity mappings
  hugetlb: make hugetlb_change_protection compatible with HGM
  hugetlb: enlighten follow_hugetlb_page to support HGM
  hugetlb: make hugetlb_follow_page_mask HGM-enabled
  hugetlb: use struct hugetlb_pte for walk_hugetlb_range
  mm: rmap: provide pte_order in page_vma_mapped_walk
  mm: rmap: make page_vma_mapped_walk callers use pte_order
  rmap: update hugetlb lock comment for HGM
  hugetlb: update page_vma_mapped to do high-granularity walks
  hugetlb: add HGM support for copy_hugetlb_page_range
  hugetlb: make move_hugetlb_page_tables compatible with HGM
  hugetlb: add HGM support for hugetlb_fault and hugetlb_no_page
  rmap: in try_to_{migrate,unmap}_one, check head page for page flags
  hugetlb: add high-granularity migration support
  hugetlb: add high-granularity check for hwpoison in fault path
  hugetlb: sort hstates in hugetlb_init_hstates
  hugetlb: add for_each_hgm_shift
  userfaultfd: add UFFD_FEATURE_MINOR_HUGETLBFS_HGM
  hugetlb: userfaultfd: add support for high-granularity UFFDIO_CONTINUE
  userfaultfd: require UFFD_FEATURE_EXACT_ADDRESS when using HugeTLB HGM
  hugetlb: add MADV_COLLAPSE for hugetlb
  hugetlb: remove huge_pte_lock and huge_pte_lockptr
  hugetlb: replace make_huge_pte with make_huge_pte_with_shift
  mm: smaps: add stats for HugeTLB mapping size
  hugetlb: x86: enable high-granularity mapping
  docs: hugetlb: update hugetlb and userfaultfd admin-guides with HGM
    info
  docs: proc: include information about HugeTLB HGM
  selftests/vm: add HugeTLB HGM to userfaultfd selftest
  selftests/kvm: add HugeTLB HGM to KVM demand paging selftest
  selftests/vm: add anon and shared hugetlb to migration test
  selftests/vm: add hugetlb HGM test to migration selftest
  selftests/vm: add HGM UFFDIO_CONTINUE and hwpoison tests

 Documentation/admin-guide/mm/hugetlbpage.rst  |    4 +
 Documentation/admin-guide/mm/userfaultfd.rst  |   16 +-
 Documentation/filesystems/proc.rst            |   56 +-
 arch/powerpc/mm/pgtable.c                     |    3 +-
 arch/s390/include/asm/hugetlb.h               |    5 -
 arch/s390/mm/gmap.c                           |   20 +-
 arch/x86/Kconfig                              |    1 +
 fs/Kconfig                                    |    7 +
 fs/hugetlbfs/inode.c                          |   27 +-
 fs/proc/task_mmu.c                            |  184 ++-
 fs/userfaultfd.c                              |   56 +-
 include/asm-generic/hugetlb.h                 |    5 -
 include/asm-generic/tlb.h                     |    6 +-
 include/linux/huge_mm.h                       |   12 +-
 include/linux/hugetlb.h                       |  173 ++-
 include/linux/pagewalk.h                      |   11 +-
 include/linux/rmap.h                          |    5 +
 include/linux/swapops.h                       |    8 +-
 include/linux/userfaultfd_k.h                 |    7 +
 include/uapi/linux/userfaultfd.h              |    2 +
 mm/damon/vaddr.c                              |   57 +-
 mm/debug_vm_pgtable.c                         |    2 +-
 mm/hmm.c                                      |   21 +-
 mm/hugetlb.c                                  | 1209 ++++++++++++++---
 mm/khugepaged.c                               |    4 +-
 mm/madvise.c                                  |   24 +-
 mm/memory-failure.c                           |   17 +-
 mm/mempolicy.c                                |   28 +-
 mm/migrate.c                                  |   20 +-
 mm/mincore.c                                  |   17 +-
 mm/mprotect.c                                 |   18 +-
 mm/page_vma_mapped.c                          |   60 +-
 mm/pagewalk.c                                 |   32 +-
 mm/rmap.c                                     |  102 +-
 mm/userfaultfd.c                              |   46 +-
 .../selftests/kvm/demand_paging_test.c        |   20 +-
 .../testing/selftests/kvm/include/test_util.h |    2 +
 tools/testing/selftests/kvm/lib/kvm_util.c    |    2 +-
 tools/testing/selftests/kvm/lib/test_util.c   |   14 +
 tools/testing/selftests/vm/Makefile           |    1 +
 tools/testing/selftests/vm/hugetlb-hgm.c      |  326 +++++
 tools/testing/selftests/vm/migration.c        |  222 ++-
 tools/testing/selftests/vm/userfaultfd.c      |   90 +-
 43 files changed, 2449 insertions(+), 493 deletions(-)
 create mode 100644 tools/testing/selftests/vm/hugetlb-hgm.c

-- 
2.38.0.135.g90850a2211-goog


^ permalink raw reply	[flat|nested] 122+ messages in thread

* [RFC PATCH v2 01/47] hugetlb: don't set PageUptodate for UFFDIO_CONTINUE
  2022-10-21 16:36 [RFC PATCH v2 00/47] hugetlb: introduce HugeTLB high-granularity mapping James Houghton
@ 2022-10-21 16:36 ` James Houghton
  2022-11-16 16:30   ` Peter Xu
  2022-10-21 16:36 ` [RFC PATCH v2 02/47] hugetlb: remove mk_huge_pte; it is unused James Houghton
                   ` (45 subsequent siblings)
  46 siblings, 1 reply; 122+ messages in thread
From: James Houghton @ 2022-10-21 16:36 UTC (permalink / raw)
  To: Mike Kravetz, Muchun Song, Peter Xu
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Zach O'Keefe, Manish Mishra, Naoya Horiguchi,
	Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Andrew Morton, linux-mm, linux-kernel, James Houghton

This is how it should have been to begin with. It would be very bad if
we actually set PageUptodate with a UFFDIO_CONTINUE, as UFFDIO_CONTINUE
doesn't actually set/update the contents of the page, so we would be
exposing a non-zeroed page to the user.

The reason this change is being made now is because UFFDIO_CONTINUEs on
subpages definitely shouldn't set this page flag on the head page.

Signed-off-by: James Houghton <jthoughton@google.com>
---
 mm/hugetlb.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 1a7dc7b2e16c..650761cdd2f6 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6097,7 +6097,10 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
 	 * preceding stores to the page contents become visible before
 	 * the set_pte_at() write.
 	 */
-	__SetPageUptodate(page);
+	if (!is_continue)
+		__SetPageUptodate(page);
+	else
+		VM_WARN_ON_ONCE_PAGE(!PageUptodate(page), page);
 
 	/* Add shared, newly allocated pages to the page cache. */
 	if (vm_shared && !is_continue) {
-- 
2.38.0.135.g90850a2211-goog


^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [RFC PATCH v2 02/47] hugetlb: remove mk_huge_pte; it is unused
  2022-10-21 16:36 [RFC PATCH v2 00/47] hugetlb: introduce HugeTLB high-granularity mapping James Houghton
  2022-10-21 16:36 ` [RFC PATCH v2 01/47] hugetlb: don't set PageUptodate for UFFDIO_CONTINUE James Houghton
@ 2022-10-21 16:36 ` James Houghton
  2022-11-16 16:35   ` Peter Xu
                     ` (2 more replies)
  2022-10-21 16:36 ` [RFC PATCH v2 03/47] hugetlb: remove redundant pte_mkhuge in migration path James Houghton
                   ` (44 subsequent siblings)
  46 siblings, 3 replies; 122+ messages in thread
From: James Houghton @ 2022-10-21 16:36 UTC (permalink / raw)
  To: Mike Kravetz, Muchun Song, Peter Xu
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Zach O'Keefe, Manish Mishra, Naoya Horiguchi,
	Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Andrew Morton, linux-mm, linux-kernel, James Houghton

mk_huge_pte is unused and not necessary. pte_mkhuge is the appropriate
function to call to create a HugeTLB PTE (see
Documentation/mm/arch_pgtable_helpers.rst).

It is being removed now to avoid complicating the implementation of
HugeTLB high-granularity mapping.

Signed-off-by: James Houghton <jthoughton@google.com>
---
 arch/s390/include/asm/hugetlb.h | 5 -----
 include/asm-generic/hugetlb.h   | 5 -----
 mm/debug_vm_pgtable.c           | 2 +-
 mm/hugetlb.c                    | 7 +++----
 4 files changed, 4 insertions(+), 15 deletions(-)

diff --git a/arch/s390/include/asm/hugetlb.h b/arch/s390/include/asm/hugetlb.h
index ccdbccfde148..c34893719715 100644
--- a/arch/s390/include/asm/hugetlb.h
+++ b/arch/s390/include/asm/hugetlb.h
@@ -77,11 +77,6 @@ static inline void huge_ptep_set_wrprotect(struct mm_struct *mm,
 	set_huge_pte_at(mm, addr, ptep, pte_wrprotect(pte));
 }
 
-static inline pte_t mk_huge_pte(struct page *page, pgprot_t pgprot)
-{
-	return mk_pte(page, pgprot);
-}
-
 static inline int huge_pte_none(pte_t pte)
 {
 	return pte_none(pte);
diff --git a/include/asm-generic/hugetlb.h b/include/asm-generic/hugetlb.h
index a57d667addd2..aab9e46fa628 100644
--- a/include/asm-generic/hugetlb.h
+++ b/include/asm-generic/hugetlb.h
@@ -5,11 +5,6 @@
 #include <linux/swap.h>
 #include <linux/swapops.h>
 
-static inline pte_t mk_huge_pte(struct page *page, pgprot_t pgprot)
-{
-	return mk_pte(page, pgprot);
-}
-
 static inline unsigned long huge_pte_write(pte_t pte)
 {
 	return pte_write(pte);
diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
index 2b61fde8c38c..10573a283a12 100644
--- a/mm/debug_vm_pgtable.c
+++ b/mm/debug_vm_pgtable.c
@@ -929,7 +929,7 @@ static void __init hugetlb_basic_tests(struct pgtable_debug_args *args)
 	 * as it was previously derived from a real kernel symbol.
 	 */
 	page = pfn_to_page(args->fixed_pmd_pfn);
-	pte = mk_huge_pte(page, args->page_prot);
+	pte = mk_pte(page, args->page_prot);
 
 	WARN_ON(!huge_pte_dirty(huge_pte_mkdirty(pte)));
 	WARN_ON(!huge_pte_write(huge_pte_mkwrite(huge_pte_wrprotect(pte))));
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 650761cdd2f6..20a111b532aa 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4728,11 +4728,10 @@ static pte_t make_huge_pte(struct vm_area_struct *vma, struct page *page,
 	unsigned int shift = huge_page_shift(hstate_vma(vma));
 
 	if (writable) {
-		entry = huge_pte_mkwrite(huge_pte_mkdirty(mk_huge_pte(page,
-					 vma->vm_page_prot)));
+		entry = huge_pte_mkwrite(huge_pte_mkdirty(mk_pte(page,
+						vma->vm_page_prot)));
 	} else {
-		entry = huge_pte_wrprotect(mk_huge_pte(page,
-					   vma->vm_page_prot));
+		entry = huge_pte_wrprotect(mk_pte(page, vma->vm_page_prot));
 	}
 	entry = pte_mkyoung(entry);
 	entry = arch_make_huge_pte(entry, shift, vma->vm_flags);
-- 
2.38.0.135.g90850a2211-goog


^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [RFC PATCH v2 03/47] hugetlb: remove redundant pte_mkhuge in migration path
  2022-10-21 16:36 [RFC PATCH v2 00/47] hugetlb: introduce HugeTLB high-granularity mapping James Houghton
  2022-10-21 16:36 ` [RFC PATCH v2 01/47] hugetlb: don't set PageUptodate for UFFDIO_CONTINUE James Houghton
  2022-10-21 16:36 ` [RFC PATCH v2 02/47] hugetlb: remove mk_huge_pte; it is unused James Houghton
@ 2022-10-21 16:36 ` James Houghton
  2022-11-16 16:36   ` Peter Xu
                     ` (2 more replies)
  2022-10-21 16:36 ` [RFC PATCH v2 04/47] hugetlb: only adjust address ranges when VMAs want PMD sharing James Houghton
                   ` (43 subsequent siblings)
  46 siblings, 3 replies; 122+ messages in thread
From: James Houghton @ 2022-10-21 16:36 UTC (permalink / raw)
  To: Mike Kravetz, Muchun Song, Peter Xu
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Zach O'Keefe, Manish Mishra, Naoya Horiguchi,
	Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Andrew Morton, linux-mm, linux-kernel, James Houghton

arch_make_huge_pte, which is called immediately following pte_mkhuge,
already makes the necessary changes to the PTE that pte_mkhuge would
have. The generic implementation of arch_make_huge_pte simply calls
pte_mkhuge.

Signed-off-by: James Houghton <jthoughton@google.com>
---
 mm/migrate.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 8e5eb6ed9da2..1457cdbb7828 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -237,7 +237,6 @@ static bool remove_migration_pte(struct folio *folio,
 		if (folio_test_hugetlb(folio)) {
 			unsigned int shift = huge_page_shift(hstate_vma(vma));
 
-			pte = pte_mkhuge(pte);
 			pte = arch_make_huge_pte(pte, shift, vma->vm_flags);
 			if (folio_test_anon(folio))
 				hugepage_add_anon_rmap(new, vma, pvmw.address,
-- 
2.38.0.135.g90850a2211-goog


^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [RFC PATCH v2 04/47] hugetlb: only adjust address ranges when VMAs want PMD sharing
  2022-10-21 16:36 [RFC PATCH v2 00/47] hugetlb: introduce HugeTLB high-granularity mapping James Houghton
                   ` (2 preceding siblings ...)
  2022-10-21 16:36 ` [RFC PATCH v2 03/47] hugetlb: remove redundant pte_mkhuge in migration path James Houghton
@ 2022-10-21 16:36 ` James Houghton
  2022-11-16 16:50   ` Peter Xu
  2022-12-09  0:22   ` Mike Kravetz
  2022-10-21 16:36 ` [RFC PATCH v2 05/47] hugetlb: make hugetlb_vma_lock_alloc return its failure reason James Houghton
                   ` (42 subsequent siblings)
  46 siblings, 2 replies; 122+ messages in thread
From: James Houghton @ 2022-10-21 16:36 UTC (permalink / raw)
  To: Mike Kravetz, Muchun Song, Peter Xu
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Zach O'Keefe, Manish Mishra, Naoya Horiguchi,
	Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Andrew Morton, linux-mm, linux-kernel, James Houghton

Currently this check is overly aggressive. For some userfaultfd VMAs,
VMA sharing is disabled, yet we still widen the address range, which is
used for flushing TLBs and sending MMU notifiers.

This is done now, as HGM VMAs also have sharing disabled, yet would
still have flush ranges adjusted. Overaggressively flushing TLBs and
triggering MMU notifiers is particularly harmful with lots of
high-granularity operations.

Signed-off-by: James Houghton <jthoughton@google.com>
---
 mm/hugetlb.c | 21 +++++++++++++++------
 1 file changed, 15 insertions(+), 6 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 20a111b532aa..52cec5b0789e 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6835,22 +6835,31 @@ static unsigned long page_table_shareable(struct vm_area_struct *svma,
 	return saddr;
 }
 
-bool want_pmd_share(struct vm_area_struct *vma, unsigned long addr)
+static bool pmd_sharing_possible(struct vm_area_struct *vma)
 {
-	unsigned long start = addr & PUD_MASK;
-	unsigned long end = start + PUD_SIZE;
-
 #ifdef CONFIG_USERFAULTFD
 	if (uffd_disable_huge_pmd_share(vma))
 		return false;
 #endif
 	/*
-	 * check on proper vm_flags and page table alignment
+	 * Only shared VMAs can share PMDs.
 	 */
 	if (!(vma->vm_flags & VM_MAYSHARE))
 		return false;
 	if (!vma->vm_private_data)	/* vma lock required for sharing */
 		return false;
+	return true;
+}
+
+bool want_pmd_share(struct vm_area_struct *vma, unsigned long addr)
+{
+	unsigned long start = addr & PUD_MASK;
+	unsigned long end = start + PUD_SIZE;
+	/*
+	 * check on proper vm_flags and page table alignment
+	 */
+	if (!pmd_sharing_possible(vma))
+		return false;
 	if (!range_in_vma(vma, start, end))
 		return false;
 	return true;
@@ -6871,7 +6880,7 @@ void adjust_range_if_pmd_sharing_possible(struct vm_area_struct *vma,
 	 * vma needs to span at least one aligned PUD size, and the range
 	 * must be at least partially within in.
 	 */
-	if (!(vma->vm_flags & VM_MAYSHARE) || !(v_end > v_start) ||
+	if (!pmd_sharing_possible(vma) || !(v_end > v_start) ||
 		(*end <= v_start) || (*start >= v_end))
 		return;
 
-- 
2.38.0.135.g90850a2211-goog


^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [RFC PATCH v2 05/47] hugetlb: make hugetlb_vma_lock_alloc return its failure reason
  2022-10-21 16:36 [RFC PATCH v2 00/47] hugetlb: introduce HugeTLB high-granularity mapping James Houghton
                   ` (3 preceding siblings ...)
  2022-10-21 16:36 ` [RFC PATCH v2 04/47] hugetlb: only adjust address ranges when VMAs want PMD sharing James Houghton
@ 2022-10-21 16:36 ` James Houghton
  2022-11-16 17:08   ` Peter Xu
                     ` (2 more replies)
  2022-10-21 16:36 ` [RFC PATCH v2 06/47] hugetlb: extend vma lock for shared vmas James Houghton
                   ` (41 subsequent siblings)
  46 siblings, 3 replies; 122+ messages in thread
From: James Houghton @ 2022-10-21 16:36 UTC (permalink / raw)
  To: Mike Kravetz, Muchun Song, Peter Xu
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Zach O'Keefe, Manish Mishra, Naoya Horiguchi,
	Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Andrew Morton, linux-mm, linux-kernel, James Houghton

Currently hugetlb_vma_lock_alloc doesn't return anything, as there is no
need: if it fails, PMD sharing won't be enabled. However, HGM requires
that the VMA lock exists, so we need to verify that
hugetlb_vma_lock_alloc actually succeeded. If hugetlb_vma_lock_alloc
fails, then we can pass that up to the caller that is attempting to
enable HGM.

Signed-off-by: James Houghton <jthoughton@google.com>
---
 mm/hugetlb.c | 16 +++++++++-------
 1 file changed, 9 insertions(+), 7 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 52cec5b0789e..dc82256b89dd 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -92,7 +92,7 @@ struct mutex *hugetlb_fault_mutex_table ____cacheline_aligned_in_smp;
 /* Forward declaration */
 static int hugetlb_acct_memory(struct hstate *h, long delta);
 static void hugetlb_vma_lock_free(struct vm_area_struct *vma);
-static void hugetlb_vma_lock_alloc(struct vm_area_struct *vma);
+static int hugetlb_vma_lock_alloc(struct vm_area_struct *vma);
 static void __hugetlb_vma_unlock_write_free(struct vm_area_struct *vma);
 
 static inline bool subpool_is_free(struct hugepage_subpool *spool)
@@ -7001,17 +7001,17 @@ static void hugetlb_vma_lock_free(struct vm_area_struct *vma)
 	}
 }
 
-static void hugetlb_vma_lock_alloc(struct vm_area_struct *vma)
+static int hugetlb_vma_lock_alloc(struct vm_area_struct *vma)
 {
 	struct hugetlb_vma_lock *vma_lock;
 
 	/* Only establish in (flags) sharable vmas */
 	if (!vma || !(vma->vm_flags & VM_MAYSHARE))
-		return;
+		return -EINVAL;
 
-	/* Should never get here with non-NULL vm_private_data */
+	/* We've already allocated the lock. */
 	if (vma->vm_private_data)
-		return;
+		return 0;
 
 	vma_lock = kmalloc(sizeof(*vma_lock), GFP_KERNEL);
 	if (!vma_lock) {
@@ -7026,13 +7026,14 @@ static void hugetlb_vma_lock_alloc(struct vm_area_struct *vma)
 		 * allocation failure.
 		 */
 		pr_warn_once("HugeTLB: unable to allocate vma specific lock\n");
-		return;
+		return -ENOMEM;
 	}
 
 	kref_init(&vma_lock->refs);
 	init_rwsem(&vma_lock->rw_sema);
 	vma_lock->vma = vma;
 	vma->vm_private_data = vma_lock;
+	return 0;
 }
 
 /*
@@ -7160,8 +7161,9 @@ static void hugetlb_vma_lock_free(struct vm_area_struct *vma)
 {
 }
 
-static void hugetlb_vma_lock_alloc(struct vm_area_struct *vma)
+static int hugetlb_vma_lock_alloc(struct vm_area_struct *vma)
 {
+	return 0;
 }
 
 pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma,
-- 
2.38.0.135.g90850a2211-goog


^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [RFC PATCH v2 06/47] hugetlb: extend vma lock for shared vmas
  2022-10-21 16:36 [RFC PATCH v2 00/47] hugetlb: introduce HugeTLB high-granularity mapping James Houghton
                   ` (4 preceding siblings ...)
  2022-10-21 16:36 ` [RFC PATCH v2 05/47] hugetlb: make hugetlb_vma_lock_alloc return its failure reason James Houghton
@ 2022-10-21 16:36 ` James Houghton
  2022-11-30 21:01   ` Peter Xu
  2022-10-21 16:36 ` [RFC PATCH v2 07/47] hugetlb: add CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING James Houghton
                   ` (40 subsequent siblings)
  46 siblings, 1 reply; 122+ messages in thread
From: James Houghton @ 2022-10-21 16:36 UTC (permalink / raw)
  To: Mike Kravetz, Muchun Song, Peter Xu
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Zach O'Keefe, Manish Mishra, Naoya Horiguchi,
	Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Andrew Morton, linux-mm, linux-kernel, James Houghton

This allows us to add more data into the shared structure, which we will
use to store whether or not HGM is enabled for this VMA or not, as HGM
is only available for shared mappings.

It may be better to include HGM as a VMA flag instead of extending the
VMA lock structure.

Signed-off-by: James Houghton <jthoughton@google.com>
---
 include/linux/hugetlb.h |  4 +++
 mm/hugetlb.c            | 65 +++++++++++++++++++++--------------------
 2 files changed, 37 insertions(+), 32 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index a899bc76d677..534958499ac4 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -121,6 +121,10 @@ struct hugetlb_vma_lock {
 	struct vm_area_struct *vma;
 };
 
+struct hugetlb_shared_vma_data {
+	struct hugetlb_vma_lock vma_lock;
+};
+
 extern struct resv_map *resv_map_alloc(void);
 void resv_map_release(struct kref *ref);
 
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index dc82256b89dd..5ae8bc8c928e 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -91,8 +91,8 @@ struct mutex *hugetlb_fault_mutex_table ____cacheline_aligned_in_smp;
 
 /* Forward declaration */
 static int hugetlb_acct_memory(struct hstate *h, long delta);
-static void hugetlb_vma_lock_free(struct vm_area_struct *vma);
-static int hugetlb_vma_lock_alloc(struct vm_area_struct *vma);
+static void hugetlb_vma_data_free(struct vm_area_struct *vma);
+static int hugetlb_vma_data_alloc(struct vm_area_struct *vma);
 static void __hugetlb_vma_unlock_write_free(struct vm_area_struct *vma);
 
 static inline bool subpool_is_free(struct hugepage_subpool *spool)
@@ -4643,11 +4643,11 @@ static void hugetlb_vm_op_open(struct vm_area_struct *vma)
 		if (vma_lock) {
 			if (vma_lock->vma != vma) {
 				vma->vm_private_data = NULL;
-				hugetlb_vma_lock_alloc(vma);
+				hugetlb_vma_data_alloc(vma);
 			} else
 				pr_warn("HugeTLB: vma_lock already exists in %s.\n", __func__);
 		} else
-			hugetlb_vma_lock_alloc(vma);
+			hugetlb_vma_data_alloc(vma);
 	}
 }
 
@@ -4659,7 +4659,7 @@ static void hugetlb_vm_op_close(struct vm_area_struct *vma)
 	unsigned long reserve, start, end;
 	long gbl_reserve;
 
-	hugetlb_vma_lock_free(vma);
+	hugetlb_vma_data_free(vma);
 
 	resv = vma_resv_map(vma);
 	if (!resv || !is_vma_resv_set(vma, HPAGE_RESV_OWNER))
@@ -6629,7 +6629,7 @@ bool hugetlb_reserve_pages(struct inode *inode,
 	/*
 	 * vma specific semaphore used for pmd sharing synchronization
 	 */
-	hugetlb_vma_lock_alloc(vma);
+	hugetlb_vma_data_alloc(vma);
 
 	/*
 	 * Only apply hugepage reservation if asked. At fault time, an
@@ -6753,7 +6753,7 @@ bool hugetlb_reserve_pages(struct inode *inode,
 	hugetlb_cgroup_uncharge_cgroup_rsvd(hstate_index(h),
 					    chg * pages_per_huge_page(h), h_cg);
 out_err:
-	hugetlb_vma_lock_free(vma);
+	hugetlb_vma_data_free(vma);
 	if (!vma || vma->vm_flags & VM_MAYSHARE)
 		/* Only call region_abort if the region_chg succeeded but the
 		 * region_add failed or didn't run.
@@ -6901,55 +6901,55 @@ static bool __vma_shareable_flags_pmd(struct vm_area_struct *vma)
 void hugetlb_vma_lock_read(struct vm_area_struct *vma)
 {
 	if (__vma_shareable_flags_pmd(vma)) {
-		struct hugetlb_vma_lock *vma_lock = vma->vm_private_data;
+		struct hugetlb_shared_vma_data *data = vma->vm_private_data;
 
-		down_read(&vma_lock->rw_sema);
+		down_read(&data->vma_lock.rw_sema);
 	}
 }
 
 void hugetlb_vma_unlock_read(struct vm_area_struct *vma)
 {
 	if (__vma_shareable_flags_pmd(vma)) {
-		struct hugetlb_vma_lock *vma_lock = vma->vm_private_data;
+		struct hugetlb_shared_vma_data *data = vma->vm_private_data;
 
-		up_read(&vma_lock->rw_sema);
+		up_read(&data->vma_lock.rw_sema);
 	}
 }
 
 void hugetlb_vma_lock_write(struct vm_area_struct *vma)
 {
 	if (__vma_shareable_flags_pmd(vma)) {
-		struct hugetlb_vma_lock *vma_lock = vma->vm_private_data;
+		struct hugetlb_shared_vma_data *data = vma->vm_private_data;
 
-		down_write(&vma_lock->rw_sema);
+		down_write(&data->vma_lock.rw_sema);
 	}
 }
 
 void hugetlb_vma_unlock_write(struct vm_area_struct *vma)
 {
 	if (__vma_shareable_flags_pmd(vma)) {
-		struct hugetlb_vma_lock *vma_lock = vma->vm_private_data;
+		struct hugetlb_shared_vma_data *data = vma->vm_private_data;
 
-		up_write(&vma_lock->rw_sema);
+		up_write(&data->vma_lock.rw_sema);
 	}
 }
 
 int hugetlb_vma_trylock_write(struct vm_area_struct *vma)
 {
-	struct hugetlb_vma_lock *vma_lock = vma->vm_private_data;
+	struct hugetlb_shared_vma_data *data = vma->vm_private_data;
 
 	if (!__vma_shareable_flags_pmd(vma))
 		return 1;
 
-	return down_write_trylock(&vma_lock->rw_sema);
+	return down_write_trylock(&data->vma_lock.rw_sema);
 }
 
 void hugetlb_vma_assert_locked(struct vm_area_struct *vma)
 {
 	if (__vma_shareable_flags_pmd(vma)) {
-		struct hugetlb_vma_lock *vma_lock = vma->vm_private_data;
+		struct hugetlb_shared_vma_data *data = vma->vm_private_data;
 
-		lockdep_assert_held(&vma_lock->rw_sema);
+		lockdep_assert_held(&data->vma_lock.rw_sema);
 	}
 }
 
@@ -6985,7 +6985,7 @@ static void __hugetlb_vma_unlock_write_free(struct vm_area_struct *vma)
 	}
 }
 
-static void hugetlb_vma_lock_free(struct vm_area_struct *vma)
+static void hugetlb_vma_data_free(struct vm_area_struct *vma)
 {
 	/*
 	 * Only present in sharable vmas.
@@ -6994,16 +6994,17 @@ static void hugetlb_vma_lock_free(struct vm_area_struct *vma)
 		return;
 
 	if (vma->vm_private_data) {
-		struct hugetlb_vma_lock *vma_lock = vma->vm_private_data;
+		struct hugetlb_shared_vma_data *data = vma->vm_private_data;
+		struct hugetlb_vma_lock *vma_lock = &data->vma_lock;
 
 		down_write(&vma_lock->rw_sema);
 		__hugetlb_vma_unlock_write_put(vma_lock);
 	}
 }
 
-static int hugetlb_vma_lock_alloc(struct vm_area_struct *vma)
+static int hugetlb_vma_data_alloc(struct vm_area_struct *vma)
 {
-	struct hugetlb_vma_lock *vma_lock;
+	struct hugetlb_shared_vma_data *data;
 
 	/* Only establish in (flags) sharable vmas */
 	if (!vma || !(vma->vm_flags & VM_MAYSHARE))
@@ -7013,8 +7014,8 @@ static int hugetlb_vma_lock_alloc(struct vm_area_struct *vma)
 	if (vma->vm_private_data)
 		return 0;
 
-	vma_lock = kmalloc(sizeof(*vma_lock), GFP_KERNEL);
-	if (!vma_lock) {
+	data = kmalloc(sizeof(*data), GFP_KERNEL);
+	if (!data) {
 		/*
 		 * If we can not allocate structure, then vma can not
 		 * participate in pmd sharing.  This is only a possible
@@ -7025,14 +7026,14 @@ static int hugetlb_vma_lock_alloc(struct vm_area_struct *vma)
 		 * until the file is removed.  Warn in the unlikely case of
 		 * allocation failure.
 		 */
-		pr_warn_once("HugeTLB: unable to allocate vma specific lock\n");
+		pr_warn_once("HugeTLB: unable to allocate vma shared data\n");
 		return -ENOMEM;
 	}
 
-	kref_init(&vma_lock->refs);
-	init_rwsem(&vma_lock->rw_sema);
-	vma_lock->vma = vma;
-	vma->vm_private_data = vma_lock;
+	kref_init(&data->vma_lock.refs);
+	init_rwsem(&data->vma_lock.rw_sema);
+	data->vma_lock.vma = vma;
+	vma->vm_private_data = data;
 	return 0;
 }
 
@@ -7157,11 +7158,11 @@ static void __hugetlb_vma_unlock_write_free(struct vm_area_struct *vma)
 {
 }
 
-static void hugetlb_vma_lock_free(struct vm_area_struct *vma)
+static void hugetlb_vma_data_free(struct vm_area_struct *vma)
 {
 }
 
-static int hugetlb_vma_lock_alloc(struct vm_area_struct *vma)
+static int hugetlb_vma_data_alloc(struct vm_area_struct *vma)
 {
 	return 0;
 }
-- 
2.38.0.135.g90850a2211-goog


^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [RFC PATCH v2 07/47] hugetlb: add CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING
  2022-10-21 16:36 [RFC PATCH v2 00/47] hugetlb: introduce HugeTLB high-granularity mapping James Houghton
                   ` (5 preceding siblings ...)
  2022-10-21 16:36 ` [RFC PATCH v2 06/47] hugetlb: extend vma lock for shared vmas James Houghton
@ 2022-10-21 16:36 ` James Houghton
  2022-12-09 22:52   ` Mike Kravetz
  2022-10-21 16:36 ` [RFC PATCH v2 08/47] hugetlb: add HGM enablement functions James Houghton
                   ` (39 subsequent siblings)
  46 siblings, 1 reply; 122+ messages in thread
From: James Houghton @ 2022-10-21 16:36 UTC (permalink / raw)
  To: Mike Kravetz, Muchun Song, Peter Xu
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Zach O'Keefe, Manish Mishra, Naoya Horiguchi,
	Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Andrew Morton, linux-mm, linux-kernel, James Houghton

This adds the Kconfig to enable or disable high-granularity mapping.
Each architecture must explicitly opt-in to it (via
ARCH_WANT_HUGETLB_HIGH_GRANULARITY_MAPPING), but when opted in, HGM will
be enabled by default if HUGETLB_PAGE is enabled.

Signed-off-by: James Houghton <jthoughton@google.com>
---
 fs/Kconfig | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/fs/Kconfig b/fs/Kconfig
index 2685a4d0d353..ce2567946016 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -267,6 +267,13 @@ config HUGETLB_PAGE_OPTIMIZE_VMEMMAP_DEFAULT_ON
 	  enable HVO by default. It can be disabled via hugetlb_free_vmemmap=off
 	  (boot command line) or hugetlb_optimize_vmemmap (sysctl).
 
+config ARCH_WANT_HUGETLB_HIGH_GRANULARITY_MAPPING
+	bool
+
+config HUGETLB_HIGH_GRANULARITY_MAPPING
+	def_bool HUGETLB_PAGE
+	depends on ARCH_WANT_HUGETLB_HIGH_GRANULARITY_MAPPING
+
 config MEMFD_CREATE
 	def_bool TMPFS || HUGETLBFS
 
-- 
2.38.0.135.g90850a2211-goog


^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [RFC PATCH v2 08/47] hugetlb: add HGM enablement functions
  2022-10-21 16:36 [RFC PATCH v2 00/47] hugetlb: introduce HugeTLB high-granularity mapping James Houghton
                   ` (6 preceding siblings ...)
  2022-10-21 16:36 ` [RFC PATCH v2 07/47] hugetlb: add CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING James Houghton
@ 2022-10-21 16:36 ` James Houghton
  2022-11-16 17:19   ` Peter Xu
                     ` (2 more replies)
  2022-10-21 16:36 ` [RFC PATCH v2 09/47] hugetlb: make huge_pte_lockptr take an explicit shift argument James Houghton
                   ` (38 subsequent siblings)
  46 siblings, 3 replies; 122+ messages in thread
From: James Houghton @ 2022-10-21 16:36 UTC (permalink / raw)
  To: Mike Kravetz, Muchun Song, Peter Xu
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Zach O'Keefe, Manish Mishra, Naoya Horiguchi,
	Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Andrew Morton, linux-mm, linux-kernel, James Houghton

Currently it is possible for all shared VMAs to use HGM, but it must be
enabled first. This is because with HGM, we lose PMD sharing, and page
table walks require additional synchronization (we need to take the VMA
lock).

Signed-off-by: James Houghton <jthoughton@google.com>
---
 include/linux/hugetlb.h | 22 +++++++++++++
 mm/hugetlb.c            | 69 +++++++++++++++++++++++++++++++++++++++++
 2 files changed, 91 insertions(+)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 534958499ac4..6e0c36b08a0c 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -123,6 +123,9 @@ struct hugetlb_vma_lock {
 
 struct hugetlb_shared_vma_data {
 	struct hugetlb_vma_lock vma_lock;
+#ifdef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING
+	bool hgm_enabled;
+#endif
 };
 
 extern struct resv_map *resv_map_alloc(void);
@@ -1179,6 +1182,25 @@ static inline void hugetlb_unregister_node(struct node *node)
 }
 #endif	/* CONFIG_HUGETLB_PAGE */
 
+#ifdef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING
+bool hugetlb_hgm_enabled(struct vm_area_struct *vma);
+bool hugetlb_hgm_eligible(struct vm_area_struct *vma);
+int enable_hugetlb_hgm(struct vm_area_struct *vma);
+#else
+static inline bool hugetlb_hgm_enabled(struct vm_area_struct *vma)
+{
+	return false;
+}
+static inline bool hugetlb_hgm_eligible(struct vm_area_struct *vma)
+{
+	return false;
+}
+static inline int enable_hugetlb_hgm(struct vm_area_struct *vma)
+{
+	return -EINVAL;
+}
+#endif
+
 static inline spinlock_t *huge_pte_lock(struct hstate *h,
 					struct mm_struct *mm, pte_t *pte)
 {
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 5ae8bc8c928e..a18143add956 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6840,6 +6840,10 @@ static bool pmd_sharing_possible(struct vm_area_struct *vma)
 #ifdef CONFIG_USERFAULTFD
 	if (uffd_disable_huge_pmd_share(vma))
 		return false;
+#endif
+#ifdef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING
+	if (hugetlb_hgm_enabled(vma))
+		return false;
 #endif
 	/*
 	 * Only shared VMAs can share PMDs.
@@ -7033,6 +7037,9 @@ static int hugetlb_vma_data_alloc(struct vm_area_struct *vma)
 	kref_init(&data->vma_lock.refs);
 	init_rwsem(&data->vma_lock.rw_sema);
 	data->vma_lock.vma = vma;
+#ifdef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING
+	data->hgm_enabled = false;
+#endif
 	vma->vm_private_data = data;
 	return 0;
 }
@@ -7290,6 +7297,68 @@ __weak unsigned long hugetlb_mask_last_page(struct hstate *h)
 
 #endif /* CONFIG_ARCH_WANT_GENERAL_HUGETLB */
 
+#ifdef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING
+bool hugetlb_hgm_eligible(struct vm_area_struct *vma)
+{
+	/*
+	 * All shared VMAs may have HGM.
+	 *
+	 * HGM requires using the VMA lock, which only exists for shared VMAs.
+	 * To make HGM work for private VMAs, we would need to use another
+	 * scheme to prevent collapsing/splitting from invalidating other
+	 * threads' page table walks.
+	 */
+	return vma && (vma->vm_flags & VM_MAYSHARE);
+}
+bool hugetlb_hgm_enabled(struct vm_area_struct *vma)
+{
+	struct hugetlb_shared_vma_data *data = vma->vm_private_data;
+
+	if (!vma || !(vma->vm_flags & VM_MAYSHARE))
+		return false;
+
+	return data && data->hgm_enabled;
+}
+
+/*
+ * Enable high-granularity mapping (HGM) for this VMA. Once enabled, HGM
+ * cannot be turned off.
+ *
+ * PMDs cannot be shared in HGM VMAs.
+ */
+int enable_hugetlb_hgm(struct vm_area_struct *vma)
+{
+	int ret;
+	struct hugetlb_shared_vma_data *data;
+
+	if (!hugetlb_hgm_eligible(vma))
+		return -EINVAL;
+
+	if (hugetlb_hgm_enabled(vma))
+		return 0;
+
+	/*
+	 * We must hold the mmap lock for writing so that callers can rely on
+	 * hugetlb_hgm_enabled returning a consistent result while holding
+	 * the mmap lock for reading.
+	 */
+	mmap_assert_write_locked(vma->vm_mm);
+
+	/* HugeTLB HGM requires the VMA lock to synchronize collapsing. */
+	ret = hugetlb_vma_data_alloc(vma);
+	if (ret)
+		return ret;
+
+	data = vma->vm_private_data;
+	BUG_ON(!data);
+	data->hgm_enabled = true;
+
+	/* We don't support PMD sharing with HGM. */
+	hugetlb_unshare_all_pmds(vma);
+	return 0;
+}
+#endif /* CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING */
+
 /*
  * These functions are overwritable if your architecture needs its own
  * behavior.
-- 
2.38.0.135.g90850a2211-goog


^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [RFC PATCH v2 09/47] hugetlb: make huge_pte_lockptr take an explicit shift argument.
  2022-10-21 16:36 [RFC PATCH v2 00/47] hugetlb: introduce HugeTLB high-granularity mapping James Houghton
                   ` (7 preceding siblings ...)
  2022-10-21 16:36 ` [RFC PATCH v2 08/47] hugetlb: add HGM enablement functions James Houghton
@ 2022-10-21 16:36 ` James Houghton
  2022-12-08  0:30   ` Mina Almasry
  2022-12-13  0:25   ` Mike Kravetz
  2022-10-21 16:36 ` [RFC PATCH v2 10/47] hugetlb: add hugetlb_pte to track HugeTLB page table entries James Houghton
                   ` (37 subsequent siblings)
  46 siblings, 2 replies; 122+ messages in thread
From: James Houghton @ 2022-10-21 16:36 UTC (permalink / raw)
  To: Mike Kravetz, Muchun Song, Peter Xu
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Zach O'Keefe, Manish Mishra, Naoya Horiguchi,
	Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Andrew Morton, linux-mm, linux-kernel, James Houghton

This is needed to handle PTL locking with high-granularity mapping. We
won't always be using the PMD-level PTL even if we're using the 2M
hugepage hstate. It's possible that we're dealing with 4K PTEs, in which
case, we need to lock the PTL for the 4K PTE.

Signed-off-by: James Houghton <jthoughton@google.com>
---
 arch/powerpc/mm/pgtable.c | 3 ++-
 include/linux/hugetlb.h   | 9 ++++-----
 mm/hugetlb.c              | 7 ++++---
 mm/migrate.c              | 3 ++-
 4 files changed, 12 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c
index cb2dcdb18f8e..035a0df47af0 100644
--- a/arch/powerpc/mm/pgtable.c
+++ b/arch/powerpc/mm/pgtable.c
@@ -261,7 +261,8 @@ int huge_ptep_set_access_flags(struct vm_area_struct *vma,
 
 		psize = hstate_get_psize(h);
 #ifdef CONFIG_DEBUG_VM
-		assert_spin_locked(huge_pte_lockptr(h, vma->vm_mm, ptep));
+		assert_spin_locked(huge_pte_lockptr(huge_page_shift(h),
+						    vma->vm_mm, ptep));
 #endif
 
 #else
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 6e0c36b08a0c..db3ed6095b1c 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -934,12 +934,11 @@ static inline gfp_t htlb_modify_alloc_mask(struct hstate *h, gfp_t gfp_mask)
 	return modified_mask;
 }
 
-static inline spinlock_t *huge_pte_lockptr(struct hstate *h,
+static inline spinlock_t *huge_pte_lockptr(unsigned int shift,
 					   struct mm_struct *mm, pte_t *pte)
 {
-	if (huge_page_size(h) == PMD_SIZE)
+	if (shift == PMD_SHIFT)
 		return pmd_lockptr(mm, (pmd_t *) pte);
-	VM_BUG_ON(huge_page_size(h) == PAGE_SIZE);
 	return &mm->page_table_lock;
 }
 
@@ -1144,7 +1143,7 @@ static inline gfp_t htlb_modify_alloc_mask(struct hstate *h, gfp_t gfp_mask)
 	return 0;
 }
 
-static inline spinlock_t *huge_pte_lockptr(struct hstate *h,
+static inline spinlock_t *huge_pte_lockptr(unsigned int shift,
 					   struct mm_struct *mm, pte_t *pte)
 {
 	return &mm->page_table_lock;
@@ -1206,7 +1205,7 @@ static inline spinlock_t *huge_pte_lock(struct hstate *h,
 {
 	spinlock_t *ptl;
 
-	ptl = huge_pte_lockptr(h, mm, pte);
+	ptl = huge_pte_lockptr(huge_page_shift(h), mm, pte);
 	spin_lock(ptl);
 	return ptl;
 }
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index a18143add956..ef7662bd0068 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4847,7 +4847,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 		}
 
 		dst_ptl = huge_pte_lock(h, dst, dst_pte);
-		src_ptl = huge_pte_lockptr(h, src, src_pte);
+		src_ptl = huge_pte_lockptr(huge_page_shift(h), src, src_pte);
 		spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
 		entry = huge_ptep_get(src_pte);
 again:
@@ -4925,7 +4925,8 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 
 				/* Install the new huge page if src pte stable */
 				dst_ptl = huge_pte_lock(h, dst, dst_pte);
-				src_ptl = huge_pte_lockptr(h, src, src_pte);
+				src_ptl = huge_pte_lockptr(huge_page_shift(h),
+							   src, src_pte);
 				spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
 				entry = huge_ptep_get(src_pte);
 				if (!pte_same(src_pte_old, entry)) {
@@ -4979,7 +4980,7 @@ static void move_huge_pte(struct vm_area_struct *vma, unsigned long old_addr,
 	pte_t pte;
 
 	dst_ptl = huge_pte_lock(h, mm, dst_pte);
-	src_ptl = huge_pte_lockptr(h, mm, src_pte);
+	src_ptl = huge_pte_lockptr(huge_page_shift(h), mm, src_pte);
 
 	/*
 	 * We don't have to worry about the ordering of src and dst ptlocks
diff --git a/mm/migrate.c b/mm/migrate.c
index 1457cdbb7828..a0105fa6e3b2 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -334,7 +334,8 @@ void __migration_entry_wait_huge(pte_t *ptep, spinlock_t *ptl)
 
 void migration_entry_wait_huge(struct vm_area_struct *vma, pte_t *pte)
 {
-	spinlock_t *ptl = huge_pte_lockptr(hstate_vma(vma), vma->vm_mm, pte);
+	spinlock_t *ptl = huge_pte_lockptr(huge_page_shift(hstate_vma(vma)),
+					   vma->vm_mm, pte);
 
 	__migration_entry_wait_huge(pte, ptl);
 }
-- 
2.38.0.135.g90850a2211-goog


^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [RFC PATCH v2 10/47] hugetlb: add hugetlb_pte to track HugeTLB page table entries
  2022-10-21 16:36 [RFC PATCH v2 00/47] hugetlb: introduce HugeTLB high-granularity mapping James Houghton
                   ` (8 preceding siblings ...)
  2022-10-21 16:36 ` [RFC PATCH v2 09/47] hugetlb: make huge_pte_lockptr take an explicit shift argument James Houghton
@ 2022-10-21 16:36 ` James Houghton
  2022-11-16 22:17   ` Peter Xu
  2022-12-08  0:46   ` Mina Almasry
  2022-10-21 16:36 ` [RFC PATCH v2 11/47] hugetlb: add hugetlb_pmd_alloc and hugetlb_pte_alloc James Houghton
                   ` (36 subsequent siblings)
  46 siblings, 2 replies; 122+ messages in thread
From: James Houghton @ 2022-10-21 16:36 UTC (permalink / raw)
  To: Mike Kravetz, Muchun Song, Peter Xu
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Zach O'Keefe, Manish Mishra, Naoya Horiguchi,
	Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Andrew Morton, linux-mm, linux-kernel, James Houghton

After high-granularity mapping, page table entries for HugeTLB pages can
be of any size/type. (For example, we can have a 1G page mapped with a
mix of PMDs and PTEs.) This struct is to help keep track of a HugeTLB
PTE after we have done a page table walk.

Without this, we'd have to pass around the "size" of the PTE everywhere.
We effectively did this before; it could be fetched from the hstate,
which we pass around pretty much everywhere.

hugetlb_pte_present_leaf is included here as a helper function that will
be used frequently later on.

Signed-off-by: James Houghton <jthoughton@google.com>
---
 include/linux/hugetlb.h | 88 +++++++++++++++++++++++++++++++++++++++++
 mm/hugetlb.c            | 29 ++++++++++++++
 2 files changed, 117 insertions(+)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index db3ed6095b1c..d30322108b34 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -50,6 +50,75 @@ enum {
 	__NR_USED_SUBPAGE,
 };
 
+enum hugetlb_level {
+	HUGETLB_LEVEL_PTE = 1,
+	/*
+	 * We always include PMD, PUD, and P4D in this enum definition so that,
+	 * when logged as an integer, we can easily tell which level it is.
+	 */
+	HUGETLB_LEVEL_PMD,
+	HUGETLB_LEVEL_PUD,
+	HUGETLB_LEVEL_P4D,
+	HUGETLB_LEVEL_PGD,
+};
+
+struct hugetlb_pte {
+	pte_t *ptep;
+	unsigned int shift;
+	enum hugetlb_level level;
+	spinlock_t *ptl;
+};
+
+static inline
+void hugetlb_pte_populate(struct hugetlb_pte *hpte, pte_t *ptep,
+			  unsigned int shift, enum hugetlb_level level)
+{
+	WARN_ON_ONCE(!ptep);
+	hpte->ptep = ptep;
+	hpte->shift = shift;
+	hpte->level = level;
+	hpte->ptl = NULL;
+}
+
+static inline
+unsigned long hugetlb_pte_size(const struct hugetlb_pte *hpte)
+{
+	WARN_ON_ONCE(!hpte->ptep);
+	return 1UL << hpte->shift;
+}
+
+static inline
+unsigned long hugetlb_pte_mask(const struct hugetlb_pte *hpte)
+{
+	WARN_ON_ONCE(!hpte->ptep);
+	return ~(hugetlb_pte_size(hpte) - 1);
+}
+
+static inline
+unsigned int hugetlb_pte_shift(const struct hugetlb_pte *hpte)
+{
+	WARN_ON_ONCE(!hpte->ptep);
+	return hpte->shift;
+}
+
+static inline
+enum hugetlb_level hugetlb_pte_level(const struct hugetlb_pte *hpte)
+{
+	WARN_ON_ONCE(!hpte->ptep);
+	return hpte->level;
+}
+
+static inline
+void hugetlb_pte_copy(struct hugetlb_pte *dest, const struct hugetlb_pte *src)
+{
+	dest->ptep = src->ptep;
+	dest->shift = src->shift;
+	dest->level = src->level;
+	dest->ptl = src->ptl;
+}
+
+bool hugetlb_pte_present_leaf(const struct hugetlb_pte *hpte, pte_t pte);
+
 struct hugepage_subpool {
 	spinlock_t lock;
 	long count;
@@ -1210,6 +1279,25 @@ static inline spinlock_t *huge_pte_lock(struct hstate *h,
 	return ptl;
 }
 
+static inline
+spinlock_t *hugetlb_pte_lockptr(struct mm_struct *mm, struct hugetlb_pte *hpte)
+{
+
+	BUG_ON(!hpte->ptep);
+	if (hpte->ptl)
+		return hpte->ptl;
+	return huge_pte_lockptr(hugetlb_pte_shift(hpte), mm, hpte->ptep);
+}
+
+static inline
+spinlock_t *hugetlb_pte_lock(struct mm_struct *mm, struct hugetlb_pte *hpte)
+{
+	spinlock_t *ptl = hugetlb_pte_lockptr(mm, hpte);
+
+	spin_lock(ptl);
+	return ptl;
+}
+
 #if defined(CONFIG_HUGETLB_PAGE) && defined(CONFIG_CMA)
 extern void __init hugetlb_cma_reserve(int order);
 #else
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index ef7662bd0068..a0e46d35dabc 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1127,6 +1127,35 @@ static bool vma_has_reserves(struct vm_area_struct *vma, long chg)
 	return false;
 }
 
+bool hugetlb_pte_present_leaf(const struct hugetlb_pte *hpte, pte_t pte)
+{
+	pgd_t pgd;
+	p4d_t p4d;
+	pud_t pud;
+	pmd_t pmd;
+
+	WARN_ON_ONCE(!hpte->ptep);
+	switch (hugetlb_pte_level(hpte)) {
+	case HUGETLB_LEVEL_PGD:
+		pgd = __pgd(pte_val(pte));
+		return pgd_present(pgd) && pgd_leaf(pgd);
+	case HUGETLB_LEVEL_P4D:
+		p4d = __p4d(pte_val(pte));
+		return p4d_present(p4d) && p4d_leaf(p4d);
+	case HUGETLB_LEVEL_PUD:
+		pud = __pud(pte_val(pte));
+		return pud_present(pud) && pud_leaf(pud);
+	case HUGETLB_LEVEL_PMD:
+		pmd = __pmd(pte_val(pte));
+		return pmd_present(pmd) && pmd_leaf(pmd);
+	case HUGETLB_LEVEL_PTE:
+		return pte_present(pte);
+	default:
+		WARN_ON_ONCE(1);
+		return false;
+	}
+}
+
 static void enqueue_huge_page(struct hstate *h, struct page *page)
 {
 	int nid = page_to_nid(page);
-- 
2.38.0.135.g90850a2211-goog


^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [RFC PATCH v2 11/47] hugetlb: add hugetlb_pmd_alloc and hugetlb_pte_alloc
  2022-10-21 16:36 [RFC PATCH v2 00/47] hugetlb: introduce HugeTLB high-granularity mapping James Houghton
                   ` (9 preceding siblings ...)
  2022-10-21 16:36 ` [RFC PATCH v2 10/47] hugetlb: add hugetlb_pte to track HugeTLB page table entries James Houghton
@ 2022-10-21 16:36 ` James Houghton
  2022-12-13 19:32   ` Mike Kravetz
  2022-10-21 16:36 ` [RFC PATCH v2 12/47] hugetlb: add hugetlb_hgm_walk and hugetlb_walk_step James Houghton
                   ` (35 subsequent siblings)
  46 siblings, 1 reply; 122+ messages in thread
From: James Houghton @ 2022-10-21 16:36 UTC (permalink / raw)
  To: Mike Kravetz, Muchun Song, Peter Xu
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Zach O'Keefe, Manish Mishra, Naoya Horiguchi,
	Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Andrew Morton, linux-mm, linux-kernel, James Houghton

These functions are used to allocate new PTEs below the hstate PTE. This
will be used by hugetlb_walk_step, which implements stepping forwards in
a HugeTLB high-granularity page table walk.

The reasons that we don't use the standard pmd_alloc/pte_alloc*
functions are:
 1) This prevents us from accidentally overwriting swap entries or
    attempting to use swap entries as present non-leaf PTEs (see
    pmd_alloc(); we assume that !pte_none means pte_present and
    non-leaf).
 2) Locking hugetlb PTEs can different than regular PTEs. (Although, as
    implemented right now, locking is the same.)
 3) We can maintain compatibility with CONFIG_HIGHPTE. That is, HugeTLB
    HGM won't use HIGHPTE, but the kernel can still be built with it,
    and other mm code will use it.

When GENERAL_HUGETLB supports P4D-based hugepages, we will need to
implement hugetlb_pud_alloc to implement hugetlb_walk_step.

Signed-off-by: James Houghton <jthoughton@google.com>
---
 include/linux/hugetlb.h |  5 +++
 mm/hugetlb.c            | 94 +++++++++++++++++++++++++++++++++++++++++
 2 files changed, 99 insertions(+)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index d30322108b34..003255b0e40f 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -119,6 +119,11 @@ void hugetlb_pte_copy(struct hugetlb_pte *dest, const struct hugetlb_pte *src)
 
 bool hugetlb_pte_present_leaf(const struct hugetlb_pte *hpte, pte_t pte);
 
+pmd_t *hugetlb_pmd_alloc(struct mm_struct *mm, struct hugetlb_pte *hpte,
+		unsigned long addr);
+pte_t *hugetlb_pte_alloc(struct mm_struct *mm, struct hugetlb_pte *hpte,
+		unsigned long addr);
+
 struct hugepage_subpool {
 	spinlock_t lock;
 	long count;
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index a0e46d35dabc..e3733388adee 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -341,6 +341,100 @@ static bool has_same_uncharge_info(struct file_region *rg,
 #endif
 }
 
+pmd_t *hugetlb_pmd_alloc(struct mm_struct *mm, struct hugetlb_pte *hpte,
+		unsigned long addr)
+{
+	spinlock_t *ptl = hugetlb_pte_lockptr(mm, hpte);
+	pmd_t *new;
+	pud_t *pudp;
+	pud_t pud;
+
+	if (hpte->level != HUGETLB_LEVEL_PUD)
+		return ERR_PTR(-EINVAL);
+
+	pudp = (pud_t *)hpte->ptep;
+retry:
+	pud = *pudp;
+	if (likely(pud_present(pud)))
+		return unlikely(pud_leaf(pud))
+			? ERR_PTR(-EEXIST)
+			: pmd_offset(pudp, addr);
+	else if (!huge_pte_none(huge_ptep_get(hpte->ptep)))
+		/*
+		 * Not present and not none means that a swap entry lives here,
+		 * and we can't get rid of it.
+		 */
+		return ERR_PTR(-EEXIST);
+
+	new = pmd_alloc_one(mm, addr);
+	if (!new)
+		return ERR_PTR(-ENOMEM);
+
+	spin_lock(ptl);
+	if (!pud_same(pud, *pudp)) {
+		spin_unlock(ptl);
+		pmd_free(mm, new);
+		goto retry;
+	}
+
+	mm_inc_nr_pmds(mm);
+	smp_wmb(); /* See comment in pmd_install() */
+	pud_populate(mm, pudp, new);
+	spin_unlock(ptl);
+	return pmd_offset(pudp, addr);
+}
+
+pte_t *hugetlb_pte_alloc(struct mm_struct *mm, struct hugetlb_pte *hpte,
+		unsigned long addr)
+{
+	spinlock_t *ptl = hugetlb_pte_lockptr(mm, hpte);
+	pgtable_t new;
+	pmd_t *pmdp;
+	pmd_t pmd;
+
+	if (hpte->level != HUGETLB_LEVEL_PMD)
+		return ERR_PTR(-EINVAL);
+
+	pmdp = (pmd_t *)hpte->ptep;
+retry:
+	pmd = *pmdp;
+	if (likely(pmd_present(pmd)))
+		return unlikely(pmd_leaf(pmd))
+			? ERR_PTR(-EEXIST)
+			: pte_offset_kernel(pmdp, addr);
+	else if (!huge_pte_none(huge_ptep_get(hpte->ptep)))
+		/*
+		 * Not present and not none means that a swap entry lives here,
+		 * and we can't get rid of it.
+		 */
+		return ERR_PTR(-EEXIST);
+
+	/*
+	 * With CONFIG_HIGHPTE, calling `pte_alloc_one` directly may result
+	 * in page tables being allocated in high memory, needing a kmap to
+	 * access. Instead, we call __pte_alloc_one directly with
+	 * GFP_PGTABLE_USER to prevent these PTEs being allocated in high
+	 * memory.
+	 */
+	new = __pte_alloc_one(mm, GFP_PGTABLE_USER);
+	if (!new)
+		return ERR_PTR(-ENOMEM);
+
+	spin_lock(ptl);
+	if (!pmd_same(pmd, *pmdp)) {
+		spin_unlock(ptl);
+		pgtable_pte_page_dtor(new);
+		__free_page(new);
+		goto retry;
+	}
+
+	mm_inc_nr_ptes(mm);
+	smp_wmb(); /* See comment in pmd_install() */
+	pmd_populate(mm, pmdp, new);
+	spin_unlock(ptl);
+	return pte_offset_kernel(pmdp, addr);
+}
+
 static void coalesce_file_region(struct resv_map *resv, struct file_region *rg)
 {
 	struct file_region *nrg, *prg;
-- 
2.38.0.135.g90850a2211-goog


^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [RFC PATCH v2 12/47] hugetlb: add hugetlb_hgm_walk and hugetlb_walk_step
  2022-10-21 16:36 [RFC PATCH v2 00/47] hugetlb: introduce HugeTLB high-granularity mapping James Houghton
                   ` (10 preceding siblings ...)
  2022-10-21 16:36 ` [RFC PATCH v2 11/47] hugetlb: add hugetlb_pmd_alloc and hugetlb_pte_alloc James Houghton
@ 2022-10-21 16:36 ` James Houghton
  2022-11-16 22:02   ` Peter Xu
                     ` (2 more replies)
  2022-10-21 16:36 ` [RFC PATCH v2 13/47] hugetlb: add make_huge_pte_with_shift James Houghton
                   ` (34 subsequent siblings)
  46 siblings, 3 replies; 122+ messages in thread
From: James Houghton @ 2022-10-21 16:36 UTC (permalink / raw)
  To: Mike Kravetz, Muchun Song, Peter Xu
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Zach O'Keefe, Manish Mishra, Naoya Horiguchi,
	Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Andrew Morton, linux-mm, linux-kernel, James Houghton

hugetlb_hgm_walk implements high-granularity page table walks for
HugeTLB. It is safe to call on non-HGM enabled VMAs; it will return
immediately.

hugetlb_walk_step implements how we step forwards in the walk. For
architectures that don't use GENERAL_HUGETLB, they will need to provide
their own implementation.

Signed-off-by: James Houghton <jthoughton@google.com>
---
 include/linux/hugetlb.h |  13 +++++
 mm/hugetlb.c            | 125 ++++++++++++++++++++++++++++++++++++++++
 2 files changed, 138 insertions(+)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 003255b0e40f..4b1548adecde 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -276,6 +276,10 @@ u32 hugetlb_fault_mutex_hash(struct address_space *mapping, pgoff_t idx);
 pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma,
 		      unsigned long addr, pud_t *pud);
 
+int hugetlb_hgm_walk(struct mm_struct *mm, struct vm_area_struct *vma,
+		     struct hugetlb_pte *hpte, unsigned long addr,
+		     unsigned long sz, bool stop_at_none);
+
 struct address_space *hugetlb_page_mapping_lock_write(struct page *hpage);
 
 extern int sysctl_hugetlb_shm_group;
@@ -288,6 +292,8 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
 pte_t *huge_pte_offset(struct mm_struct *mm,
 		       unsigned long addr, unsigned long sz);
 unsigned long hugetlb_mask_last_page(struct hstate *h);
+int hugetlb_walk_step(struct mm_struct *mm, struct hugetlb_pte *hpte,
+		      unsigned long addr, unsigned long sz);
 int huge_pmd_unshare(struct mm_struct *mm, struct vm_area_struct *vma,
 				unsigned long addr, pte_t *ptep);
 void adjust_range_if_pmd_sharing_possible(struct vm_area_struct *vma,
@@ -1066,6 +1072,8 @@ void hugetlb_register_node(struct node *node);
 void hugetlb_unregister_node(struct node *node);
 #endif
 
+enum hugetlb_level hpage_size_to_level(unsigned long sz);
+
 #else	/* CONFIG_HUGETLB_PAGE */
 struct hstate {};
 
@@ -1253,6 +1261,11 @@ static inline void hugetlb_register_node(struct node *node)
 static inline void hugetlb_unregister_node(struct node *node)
 {
 }
+
+static inline enum hugetlb_level hpage_size_to_level(unsigned long sz)
+{
+	return HUGETLB_LEVEL_PTE;
+}
 #endif	/* CONFIG_HUGETLB_PAGE */
 
 #ifdef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index e3733388adee..90db59632559 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -95,6 +95,29 @@ static void hugetlb_vma_data_free(struct vm_area_struct *vma);
 static int hugetlb_vma_data_alloc(struct vm_area_struct *vma);
 static void __hugetlb_vma_unlock_write_free(struct vm_area_struct *vma);
 
+/*
+ * hpage_size_to_level() - convert @sz to the corresponding page table level
+ *
+ * @sz must be less than or equal to a valid hugepage size.
+ */
+enum hugetlb_level hpage_size_to_level(unsigned long sz)
+{
+	/*
+	 * We order the conditionals from smallest to largest to pick the
+	 * smallest level when multiple levels have the same size (i.e.,
+	 * when levels are folded).
+	 */
+	if (sz < PMD_SIZE)
+		return HUGETLB_LEVEL_PTE;
+	if (sz < PUD_SIZE)
+		return HUGETLB_LEVEL_PMD;
+	if (sz < P4D_SIZE)
+		return HUGETLB_LEVEL_PUD;
+	if (sz < PGDIR_SIZE)
+		return HUGETLB_LEVEL_P4D;
+	return HUGETLB_LEVEL_PGD;
+}
+
 static inline bool subpool_is_free(struct hugepage_subpool *spool)
 {
 	if (spool->count)
@@ -7321,6 +7344,70 @@ bool want_pmd_share(struct vm_area_struct *vma, unsigned long addr)
 }
 #endif /* CONFIG_ARCH_WANT_HUGE_PMD_SHARE */
 
+/* hugetlb_hgm_walk - walks a high-granularity HugeTLB page table to resolve
+ * the page table entry for @addr.
+ *
+ * @hpte must always be pointing at an hstate-level PTE (or deeper).
+ *
+ * This function will never walk further if it encounters a PTE of a size
+ * less than or equal to @sz.
+ *
+ * @stop_at_none determines what we do when we encounter an empty PTE. If true,
+ * we return that PTE. If false and @sz is less than the current PTE's size,
+ * we make that PTE point to the next level down, going until @sz is the same
+ * as our current PTE.
+ *
+ * If @stop_at_none is true and @sz is PAGE_SIZE, this function will always
+ * succeed, but that does not guarantee that hugetlb_pte_size(hpte) is @sz.
+ *
+ * Return:
+ *	-ENOMEM if we couldn't allocate new PTEs.
+ *	-EEXIST if the caller wanted to walk further than a migration PTE,
+ *		poison PTE, or a PTE marker. The caller needs to manually deal
+ *		with this scenario.
+ *	-EINVAL if called with invalid arguments (@sz invalid, @hpte not
+ *		initialized).
+ *	0 otherwise.
+ *
+ *	Even if this function fails, @hpte is guaranteed to always remain
+ *	valid.
+ */
+int hugetlb_hgm_walk(struct mm_struct *mm, struct vm_area_struct *vma,
+		     struct hugetlb_pte *hpte, unsigned long addr,
+		     unsigned long sz, bool stop_at_none)
+{
+	int ret = 0;
+	pte_t pte;
+
+	if (WARN_ON_ONCE(sz < PAGE_SIZE))
+		return -EINVAL;
+
+	if (!hugetlb_hgm_enabled(vma)) {
+		if (stop_at_none)
+			return 0;
+		return sz == huge_page_size(hstate_vma(vma)) ? 0 : -EINVAL;
+	}
+
+	hugetlb_vma_assert_locked(vma);
+
+	if (WARN_ON_ONCE(!hpte->ptep))
+		return -EINVAL;
+
+	while (hugetlb_pte_size(hpte) > sz && !ret) {
+		pte = huge_ptep_get(hpte->ptep);
+		if (!pte_present(pte)) {
+			if (stop_at_none)
+				return 0;
+			if (unlikely(!huge_pte_none(pte)))
+				return -EEXIST;
+		} else if (hugetlb_pte_present_leaf(hpte, pte))
+			return 0;
+		ret = hugetlb_walk_step(mm, hpte, addr, sz);
+	}
+
+	return ret;
+}
+
 #ifdef CONFIG_ARCH_WANT_GENERAL_HUGETLB
 pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
 			unsigned long addr, unsigned long sz)
@@ -7388,6 +7475,44 @@ pte_t *huge_pte_offset(struct mm_struct *mm,
 	return (pte_t *)pmd;
 }
 
+/*
+ * hugetlb_walk_step() - Walk the page table one step to resolve the page
+ * (hugepage or subpage) entry at address @addr.
+ *
+ * @sz always points at the final target PTE size (e.g. PAGE_SIZE for the
+ * lowest level PTE).
+ *
+ * @hpte will always remain valid, even if this function fails.
+ */
+int hugetlb_walk_step(struct mm_struct *mm, struct hugetlb_pte *hpte,
+		      unsigned long addr, unsigned long sz)
+{
+	pte_t *ptep;
+	spinlock_t *ptl;
+
+	switch (hpte->level) {
+	case HUGETLB_LEVEL_PUD:
+		ptep = (pte_t *)hugetlb_pmd_alloc(mm, hpte, addr);
+		if (IS_ERR(ptep))
+			return PTR_ERR(ptep);
+		hugetlb_pte_populate(hpte, ptep, PMD_SHIFT, HUGETLB_LEVEL_PMD);
+		break;
+	case HUGETLB_LEVEL_PMD:
+		ptep = hugetlb_pte_alloc(mm, hpte, addr);
+		if (IS_ERR(ptep))
+			return PTR_ERR(ptep);
+		ptl = pte_lockptr(mm, (pmd_t *)hpte->ptep);
+		hugetlb_pte_populate(hpte, ptep, PAGE_SHIFT, HUGETLB_LEVEL_PTE);
+		hpte->ptl = ptl;
+		break;
+	default:
+		WARN_ONCE(1, "%s: got invalid level: %d (shift: %d)\n",
+				__func__, hpte->level, hpte->shift);
+		return -EINVAL;
+	}
+	return 0;
+}
+
 /*
  * Return a mask that can be used to update an address to the last huge
  * page in a page table page mapping size.  Used to skip non-present
-- 
2.38.0.135.g90850a2211-goog


^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [RFC PATCH v2 13/47] hugetlb: add make_huge_pte_with_shift
  2022-10-21 16:36 [RFC PATCH v2 00/47] hugetlb: introduce HugeTLB high-granularity mapping James Houghton
                   ` (11 preceding siblings ...)
  2022-10-21 16:36 ` [RFC PATCH v2 12/47] hugetlb: add hugetlb_hgm_walk and hugetlb_walk_step James Houghton
@ 2022-10-21 16:36 ` James Houghton
  2022-12-14  1:08   ` Mike Kravetz
  2022-10-21 16:36 ` [RFC PATCH v2 14/47] hugetlb: make default arch_make_huge_pte understand small mappings James Houghton
                   ` (33 subsequent siblings)
  46 siblings, 1 reply; 122+ messages in thread
From: James Houghton @ 2022-10-21 16:36 UTC (permalink / raw)
  To: Mike Kravetz, Muchun Song, Peter Xu
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Zach O'Keefe, Manish Mishra, Naoya Horiguchi,
	Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Andrew Morton, linux-mm, linux-kernel, James Houghton

This allows us to make huge PTEs at shifts other than the hstate shift,
which will be necessary for high-granularity mappings.

Signed-off-by: James Houghton <jthoughton@google.com>
---
 mm/hugetlb.c | 14 +++++++++++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 90db59632559..74a4afda1a7e 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4867,11 +4867,11 @@ const struct vm_operations_struct hugetlb_vm_ops = {
 	.pagesize = hugetlb_vm_op_pagesize,
 };
 
-static pte_t make_huge_pte(struct vm_area_struct *vma, struct page *page,
-				int writable)
+static pte_t make_huge_pte_with_shift(struct vm_area_struct *vma,
+				      struct page *page, int writable,
+				      int shift)
 {
 	pte_t entry;
-	unsigned int shift = huge_page_shift(hstate_vma(vma));
 
 	if (writable) {
 		entry = huge_pte_mkwrite(huge_pte_mkdirty(mk_pte(page,
@@ -4885,6 +4885,14 @@ static pte_t make_huge_pte(struct vm_area_struct *vma, struct page *page,
 	return entry;
 }
 
+static pte_t make_huge_pte(struct vm_area_struct *vma, struct page *page,
+			   int writable)
+{
+	unsigned int shift = huge_page_shift(hstate_vma(vma));
+
+	return make_huge_pte_with_shift(vma, page, writable, shift);
+}
+
 static void set_huge_ptep_writable(struct vm_area_struct *vma,
 				   unsigned long address, pte_t *ptep)
 {
-- 
2.38.0.135.g90850a2211-goog


^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [RFC PATCH v2 14/47] hugetlb: make default arch_make_huge_pte understand small mappings
  2022-10-21 16:36 [RFC PATCH v2 00/47] hugetlb: introduce HugeTLB high-granularity mapping James Houghton
                   ` (12 preceding siblings ...)
  2022-10-21 16:36 ` [RFC PATCH v2 13/47] hugetlb: add make_huge_pte_with_shift James Houghton
@ 2022-10-21 16:36 ` James Houghton
  2022-12-14 22:17   ` Mike Kravetz
  2022-10-21 16:36 ` [RFC PATCH v2 15/47] hugetlbfs: for unmapping, treat HGM-mapped pages as potentially mapped James Houghton
                   ` (32 subsequent siblings)
  46 siblings, 1 reply; 122+ messages in thread
From: James Houghton @ 2022-10-21 16:36 UTC (permalink / raw)
  To: Mike Kravetz, Muchun Song, Peter Xu
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Zach O'Keefe, Manish Mishra, Naoya Horiguchi,
	Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Andrew Morton, linux-mm, linux-kernel, James Houghton

This is a simple change: don't create a "huge" PTE if we are making a
regular, PAGE_SIZE PTE. All architectures that want to implement HGM
likely need to be changed in a similar way if they implement their own
version of arch_make_huge_pte.

Signed-off-by: James Houghton <jthoughton@google.com>
---
 include/linux/hugetlb.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 4b1548adecde..d305742e9d44 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -907,7 +907,7 @@ static inline void arch_clear_hugepage_flags(struct page *page) { }
 static inline pte_t arch_make_huge_pte(pte_t entry, unsigned int shift,
 				       vm_flags_t flags)
 {
-	return pte_mkhuge(entry);
+	return shift > PAGE_SHIFT ? pte_mkhuge(entry) : entry;
 }
 #endif
 
-- 
2.38.0.135.g90850a2211-goog


^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [RFC PATCH v2 15/47] hugetlbfs: for unmapping, treat HGM-mapped pages as potentially mapped
  2022-10-21 16:36 [RFC PATCH v2 00/47] hugetlb: introduce HugeTLB high-granularity mapping James Houghton
                   ` (13 preceding siblings ...)
  2022-10-21 16:36 ` [RFC PATCH v2 14/47] hugetlb: make default arch_make_huge_pte understand small mappings James Houghton
@ 2022-10-21 16:36 ` James Houghton
  2022-12-14 23:37   ` Mike Kravetz
  2022-10-21 16:36 ` [RFC PATCH v2 16/47] hugetlb: make unmapping compatible with high-granularity mappings James Houghton
                   ` (31 subsequent siblings)
  46 siblings, 1 reply; 122+ messages in thread
From: James Houghton @ 2022-10-21 16:36 UTC (permalink / raw)
  To: Mike Kravetz, Muchun Song, Peter Xu
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Zach O'Keefe, Manish Mishra, Naoya Horiguchi,
	Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Andrew Morton, linux-mm, linux-kernel, James Houghton

hugetlb_vma_maps_page was mostly used as an optimization: if the VMA
isn't mapping a page, then we don't have to attempt to unmap it again.
We are still able to call the unmap routine if we need to.

For high-granularity mapped pages, we can't easily do a full walk to see
if the page is actually mapped or not, so simply return that it might
be.

Signed-off-by: James Houghton <jthoughton@google.com>
---
 fs/hugetlbfs/inode.c | 27 +++++++++++++++++++++------
 1 file changed, 21 insertions(+), 6 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 7f836f8f9db1..a7ab62e39b8c 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -383,21 +383,34 @@ static void hugetlb_delete_from_page_cache(struct folio *folio)
  * mutex for the page in the mapping.  So, we can not race with page being
  * faulted into the vma.
  */
-static bool hugetlb_vma_maps_page(struct vm_area_struct *vma,
-				unsigned long addr, struct page *page)
+static bool hugetlb_vma_maybe_maps_page(struct vm_area_struct *vma,
+					unsigned long addr, struct page *page)
 {
 	pte_t *ptep, pte;
+	struct hugetlb_pte hpte;
+	struct hstate *h = hstate_vma(vma);
 
-	ptep = huge_pte_offset(vma->vm_mm, addr,
-			huge_page_size(hstate_vma(vma)));
+	ptep = huge_pte_offset(vma->vm_mm, addr, huge_page_size(h));
 
 	if (!ptep)
 		return false;
 
+	hugetlb_pte_populate(&hpte, ptep, huge_page_shift(h),
+			hpage_size_to_level(huge_page_size(h)));
+
 	pte = huge_ptep_get(ptep);
 	if (huge_pte_none(pte) || !pte_present(pte))
 		return false;
 
+	if (!hugetlb_pte_present_leaf(&hpte, pte))
+		/*
+		 * The top-level PTE is not a leaf, so it's possible that a PTE
+		 * under us is mapping the page. We aren't holding the VMA
+		 * lock, so it is unsafe to continue the walk further. Instead,
+		 * return true to indicate that we might be mapping the page.
+		 */
+		return true;
+
 	if (pte_page(pte) == page)
 		return true;
 
@@ -457,7 +470,8 @@ static void hugetlb_unmap_file_folio(struct hstate *h,
 		v_start = vma_offset_start(vma, start);
 		v_end = vma_offset_end(vma, end);
 
-		if (!hugetlb_vma_maps_page(vma, vma->vm_start + v_start, page))
+		if (!hugetlb_vma_maybe_maps_page(vma, vma->vm_start + v_start,
+					page))
 			continue;
 
 		if (!hugetlb_vma_trylock_write(vma)) {
@@ -507,7 +521,8 @@ static void hugetlb_unmap_file_folio(struct hstate *h,
 		 */
 		v_start = vma_offset_start(vma, start);
 		v_end = vma_offset_end(vma, end);
-		if (hugetlb_vma_maps_page(vma, vma->vm_start + v_start, page))
+		if (hugetlb_vma_maybe_maps_page(vma, vma->vm_start + v_start,
+					page))
 			unmap_hugepage_range(vma, vma->vm_start + v_start,
 						v_end, NULL,
 						ZAP_FLAG_DROP_MARKER);
-- 
2.38.0.135.g90850a2211-goog


^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [RFC PATCH v2 16/47] hugetlb: make unmapping compatible with high-granularity mappings
  2022-10-21 16:36 [RFC PATCH v2 00/47] hugetlb: introduce HugeTLB high-granularity mapping James Houghton
                   ` (14 preceding siblings ...)
  2022-10-21 16:36 ` [RFC PATCH v2 15/47] hugetlbfs: for unmapping, treat HGM-mapped pages as potentially mapped James Houghton
@ 2022-10-21 16:36 ` James Houghton
  2022-12-15  0:28   ` Mike Kravetz
  2022-10-21 16:36 ` [RFC PATCH v2 17/47] hugetlb: make hugetlb_change_protection compatible with HGM James Houghton
                   ` (30 subsequent siblings)
  46 siblings, 1 reply; 122+ messages in thread
From: James Houghton @ 2022-10-21 16:36 UTC (permalink / raw)
  To: Mike Kravetz, Muchun Song, Peter Xu
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Zach O'Keefe, Manish Mishra, Naoya Horiguchi,
	Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Andrew Morton, linux-mm, linux-kernel, James Houghton

Enlighten __unmap_hugepage_range to deal with high-granularity mappings.
This doesn't change its API; it still must be called with hugepage
alignment, but it will correctly unmap hugepages that have been mapped
at high granularity.

The rules for mapcount and refcount here are:
 1. Refcount and mapcount are tracked on the head page.
 2. Each page table mapping into some of an hpage will increase that
    hpage's mapcount and refcount by 1.

Eventually, functionality here can be expanded to allow users to call
MADV_DONTNEED on PAGE_SIZE-aligned sections of a hugepage, but that is
not done here.

Signed-off-by: James Houghton <jthoughton@google.com>
---
 include/asm-generic/tlb.h |  6 ++--
 mm/hugetlb.c              | 76 +++++++++++++++++++++++++--------------
 2 files changed, 52 insertions(+), 30 deletions(-)

diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index 492dce43236e..c378a44915a9 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -566,9 +566,9 @@ static inline void tlb_flush_p4d_range(struct mmu_gather *tlb,
 		__tlb_remove_tlb_entry(tlb, ptep, address);	\
 	} while (0)
 
-#define tlb_remove_huge_tlb_entry(h, tlb, ptep, address)	\
+#define tlb_remove_huge_tlb_entry(tlb, hpte, address)	\
 	do {							\
-		unsigned long _sz = huge_page_size(h);		\
+		unsigned long _sz = hugetlb_pte_size(&hpte);	\
 		if (_sz >= P4D_SIZE)				\
 			tlb_flush_p4d_range(tlb, address, _sz);	\
 		else if (_sz >= PUD_SIZE)			\
@@ -577,7 +577,7 @@ static inline void tlb_flush_p4d_range(struct mmu_gather *tlb,
 			tlb_flush_pmd_range(tlb, address, _sz);	\
 		else						\
 			tlb_flush_pte_range(tlb, address, _sz);	\
-		__tlb_remove_tlb_entry(tlb, ptep, address);	\
+		__tlb_remove_tlb_entry(tlb, hpte.ptep, address);\
 	} while (0)
 
 /**
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 74a4afda1a7e..227150c25763 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5221,10 +5221,10 @@ static void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct
 {
 	struct mm_struct *mm = vma->vm_mm;
 	unsigned long address;
-	pte_t *ptep;
+	struct hugetlb_pte hpte;
 	pte_t pte;
 	spinlock_t *ptl;
-	struct page *page;
+	struct page *hpage, *subpage;
 	struct hstate *h = hstate_vma(vma);
 	unsigned long sz = huge_page_size(h);
 	struct mmu_notifier_range range;
@@ -5235,11 +5235,6 @@ static void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct
 	BUG_ON(start & ~huge_page_mask(h));
 	BUG_ON(end & ~huge_page_mask(h));
 
-	/*
-	 * This is a hugetlb vma, all the pte entries should point
-	 * to huge page.
-	 */
-	tlb_change_page_size(tlb, sz);
 	tlb_start_vma(tlb, vma);
 
 	/*
@@ -5251,26 +5246,35 @@ static void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct
 	mmu_notifier_invalidate_range_start(&range);
 	last_addr_mask = hugetlb_mask_last_page(h);
 	address = start;
-	for (; address < end; address += sz) {
-		ptep = huge_pte_offset(mm, address, sz);
+
+	while (address < end) {
+		pte_t *ptep = huge_pte_offset(mm, address, sz);
+
 		if (!ptep) {
 			address |= last_addr_mask;
+			address += sz;
 			continue;
 		}
+		hugetlb_pte_populate(&hpte, ptep, huge_page_shift(h),
+				hpage_size_to_level(huge_page_size(h)));
+		hugetlb_hgm_walk(mm, vma, &hpte, address,
+				PAGE_SIZE, /*stop_at_none=*/true);
 
-		ptl = huge_pte_lock(h, mm, ptep);
-		if (huge_pmd_unshare(mm, vma, address, ptep)) {
+		ptl = hugetlb_pte_lock(mm, &hpte);
+		if (huge_pmd_unshare(mm, vma, address, hpte.ptep)) {
 			spin_unlock(ptl);
 			tlb_flush_pmd_range(tlb, address & PUD_MASK, PUD_SIZE);
 			force_flush = true;
 			address |= last_addr_mask;
+			address += sz;
 			continue;
 		}
 
-		pte = huge_ptep_get(ptep);
+		pte = huge_ptep_get(hpte.ptep);
+
 		if (huge_pte_none(pte)) {
 			spin_unlock(ptl);
-			continue;
+			goto next_hpte;
 		}
 
 		/*
@@ -5287,25 +5291,36 @@ static void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct
 			 */
 			if (pte_swp_uffd_wp_any(pte) &&
 			    !(zap_flags & ZAP_FLAG_DROP_MARKER))
-				set_huge_pte_at(mm, address, ptep,
+				set_huge_pte_at(mm, address, hpte.ptep,
 						make_pte_marker(PTE_MARKER_UFFD_WP));
 			else
 #endif
-				huge_pte_clear(mm, address, ptep, sz);
+				huge_pte_clear(mm, address, hpte.ptep,
+						hugetlb_pte_size(&hpte));
+			spin_unlock(ptl);
+			goto next_hpte;
+		}
+
+		if (unlikely(!hugetlb_pte_present_leaf(&hpte, pte))) {
+			/*
+			 * We raced with someone splitting out from under us.
+			 * Retry the walk.
+			 */
 			spin_unlock(ptl);
 			continue;
 		}
 
-		page = pte_page(pte);
+		subpage = pte_page(pte);
+		hpage = compound_head(subpage);
 		/*
 		 * If a reference page is supplied, it is because a specific
 		 * page is being unmapped, not a range. Ensure the page we
 		 * are about to unmap is the actual page of interest.
 		 */
 		if (ref_page) {
-			if (page != ref_page) {
+			if (hpage != ref_page) {
 				spin_unlock(ptl);
-				continue;
+				goto next_hpte;
 			}
 			/*
 			 * Mark the VMA as having unmapped its page so that
@@ -5315,27 +5330,34 @@ static void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct
 			set_vma_resv_flags(vma, HPAGE_RESV_UNMAPPED);
 		}
 
-		pte = huge_ptep_get_and_clear(mm, address, ptep);
-		tlb_remove_huge_tlb_entry(h, tlb, ptep, address);
+		pte = huge_ptep_get_and_clear(mm, address, hpte.ptep);
+		tlb_change_page_size(tlb, hugetlb_pte_size(&hpte));
+		tlb_remove_huge_tlb_entry(tlb, hpte, address);
 		if (huge_pte_dirty(pte))
-			set_page_dirty(page);
+			set_page_dirty(hpage);
 #ifdef CONFIG_PTE_MARKER_UFFD_WP
 		/* Leave a uffd-wp pte marker if needed */
 		if (huge_pte_uffd_wp(pte) &&
 		    !(zap_flags & ZAP_FLAG_DROP_MARKER))
-			set_huge_pte_at(mm, address, ptep,
+			set_huge_pte_at(mm, address, hpte.ptep,
 					make_pte_marker(PTE_MARKER_UFFD_WP));
 #endif
-		hugetlb_count_sub(pages_per_huge_page(h), mm);
-		page_remove_rmap(page, vma, true);
+		hugetlb_count_sub(hugetlb_pte_size(&hpte)/PAGE_SIZE, mm);
+		page_remove_rmap(hpage, vma, true);
 
 		spin_unlock(ptl);
-		tlb_remove_page_size(tlb, page, huge_page_size(h));
 		/*
-		 * Bail out after unmapping reference page if supplied
+		 * Lower the reference count on the head page.
+		 */
+		tlb_remove_page_size(tlb, hpage, sz);
+		/*
+		 * Bail out after unmapping reference page if supplied,
+		 * and there's only one PTE mapping this page.
 		 */
-		if (ref_page)
+		if (ref_page && hugetlb_pte_size(&hpte) == sz)
 			break;
+next_hpte:
+		address += hugetlb_pte_size(&hpte);
 	}
 	mmu_notifier_invalidate_range_end(&range);
 	tlb_end_vma(tlb, vma);
-- 
2.38.0.135.g90850a2211-goog


^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [RFC PATCH v2 17/47] hugetlb: make hugetlb_change_protection compatible with HGM
  2022-10-21 16:36 [RFC PATCH v2 00/47] hugetlb: introduce HugeTLB high-granularity mapping James Houghton
                   ` (15 preceding siblings ...)
  2022-10-21 16:36 ` [RFC PATCH v2 16/47] hugetlb: make unmapping compatible with high-granularity mappings James Houghton
@ 2022-10-21 16:36 ` James Houghton
  2022-12-15 18:15   ` Mike Kravetz
  2022-10-21 16:36 ` [RFC PATCH v2 18/47] hugetlb: enlighten follow_hugetlb_page to support HGM James Houghton
                   ` (29 subsequent siblings)
  46 siblings, 1 reply; 122+ messages in thread
From: James Houghton @ 2022-10-21 16:36 UTC (permalink / raw)
  To: Mike Kravetz, Muchun Song, Peter Xu
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Zach O'Keefe, Manish Mishra, Naoya Horiguchi,
	Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Andrew Morton, linux-mm, linux-kernel, James Houghton

The main change here is to do a high-granularity walk and pulling the
shift from the walk (not from the hstate).

Signed-off-by: James Houghton <jthoughton@google.com>
---
 mm/hugetlb.c | 65 ++++++++++++++++++++++++++++++++++++----------------
 1 file changed, 45 insertions(+), 20 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 227150c25763..2d096cef53cd 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6654,15 +6654,15 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 {
 	struct mm_struct *mm = vma->vm_mm;
 	unsigned long start = address;
-	pte_t *ptep;
 	pte_t pte;
 	struct hstate *h = hstate_vma(vma);
-	unsigned long pages = 0, psize = huge_page_size(h);
+	unsigned long base_pages = 0, psize = huge_page_size(h);
 	bool shared_pmd = false;
 	struct mmu_notifier_range range;
 	unsigned long last_addr_mask;
 	bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
 	bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;
+	struct hugetlb_pte hpte;
 
 	/*
 	 * In the case of shared PMDs, the area to flush could be beyond
@@ -6680,31 +6680,38 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 	hugetlb_vma_lock_write(vma);
 	i_mmap_lock_write(vma->vm_file->f_mapping);
 	last_addr_mask = hugetlb_mask_last_page(h);
-	for (; address < end; address += psize) {
+	while (address < end) {
 		spinlock_t *ptl;
-		ptep = huge_pte_offset(mm, address, psize);
+		pte_t *ptep = huge_pte_offset(mm, address, psize);
+
 		if (!ptep) {
 			address |= last_addr_mask;
+			address += huge_page_size(h);
 			continue;
 		}
-		ptl = huge_pte_lock(h, mm, ptep);
-		if (huge_pmd_unshare(mm, vma, address, ptep)) {
+		hugetlb_pte_populate(&hpte, ptep, huge_page_shift(h),
+				hpage_size_to_level(psize));
+		hugetlb_hgm_walk(mm, vma, &hpte, address, PAGE_SIZE,
+					/*stop_at_none=*/true);
+
+		ptl = hugetlb_pte_lock(mm, &hpte);
+		if (huge_pmd_unshare(mm, vma, address, hpte.ptep)) {
 			/*
 			 * When uffd-wp is enabled on the vma, unshare
 			 * shouldn't happen at all.  Warn about it if it
 			 * happened due to some reason.
 			 */
 			WARN_ON_ONCE(uffd_wp || uffd_wp_resolve);
-			pages++;
+			base_pages += hugetlb_pte_size(&hpte) / PAGE_SIZE;
 			spin_unlock(ptl);
 			shared_pmd = true;
 			address |= last_addr_mask;
-			continue;
+			goto next_hpte;
 		}
-		pte = huge_ptep_get(ptep);
+		pte = huge_ptep_get(hpte.ptep);
 		if (unlikely(is_hugetlb_entry_hwpoisoned(pte))) {
 			spin_unlock(ptl);
-			continue;
+			goto next_hpte;
 		}
 		if (unlikely(is_hugetlb_entry_migration(pte))) {
 			swp_entry_t entry = pte_to_swp_entry(pte);
@@ -6724,11 +6731,11 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 					newpte = pte_swp_mkuffd_wp(newpte);
 				else if (uffd_wp_resolve)
 					newpte = pte_swp_clear_uffd_wp(newpte);
-				set_huge_pte_at(mm, address, ptep, newpte);
-				pages++;
+				set_huge_pte_at(mm, address, hpte.ptep, newpte);
+				base_pages += hugetlb_pte_size(&hpte) / PAGE_SIZE;
 			}
 			spin_unlock(ptl);
-			continue;
+			goto next_hpte;
 		}
 		if (unlikely(pte_marker_uffd_wp(pte))) {
 			/*
@@ -6736,21 +6743,37 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 			 * no need for huge_ptep_modify_prot_start/commit().
 			 */
 			if (uffd_wp_resolve)
-				huge_pte_clear(mm, address, ptep, psize);
+				huge_pte_clear(mm, address, hpte.ptep,
+						hugetlb_pte_size(&hpte));
 		}
 		if (!huge_pte_none(pte)) {
 			pte_t old_pte;
-			unsigned int shift = huge_page_shift(hstate_vma(vma));
+			unsigned int shift = hpte.shift;
 
-			old_pte = huge_ptep_modify_prot_start(vma, address, ptep);
+			/*
+			 * Because we are holding the VMA lock for writing, pte
+			 * will always be a leaf. WARN if it is not.
+			 */
+			if (unlikely(!hugetlb_pte_present_leaf(&hpte, pte))) {
+				spin_unlock(ptl);
+				WARN_ONCE(1, "Unexpected non-leaf PTE: ptep:%p, address:0x%lx\n",
+					     hpte.ptep, address);
+				continue;
+			}
+
+			old_pte = huge_ptep_modify_prot_start(
+					vma, address, hpte.ptep);
 			pte = huge_pte_modify(old_pte, newprot);
-			pte = arch_make_huge_pte(pte, shift, vma->vm_flags);
+			pte = arch_make_huge_pte(
+					pte, shift, vma->vm_flags);
 			if (uffd_wp)
 				pte = huge_pte_mkuffd_wp(huge_pte_wrprotect(pte));
 			else if (uffd_wp_resolve)
 				pte = huge_pte_clear_uffd_wp(pte);
-			huge_ptep_modify_prot_commit(vma, address, ptep, old_pte, pte);
-			pages++;
+			huge_ptep_modify_prot_commit(
+					vma, address, hpte.ptep,
+					old_pte, pte);
+			base_pages += hugetlb_pte_size(&hpte) / PAGE_SIZE;
 		} else {
 			/* None pte */
 			if (unlikely(uffd_wp))
@@ -6759,6 +6782,8 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 						make_pte_marker(PTE_MARKER_UFFD_WP));
 		}
 		spin_unlock(ptl);
+next_hpte:
+		address += hugetlb_pte_size(&hpte);
 	}
 	/*
 	 * Must flush TLB before releasing i_mmap_rwsem: x86's huge_pmd_unshare
@@ -6781,7 +6806,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 	hugetlb_vma_unlock_write(vma);
 	mmu_notifier_invalidate_range_end(&range);
 
-	return pages << h->order;
+	return base_pages;
 }
 
 /* Return true if reservation was successful, false otherwise.  */
-- 
2.38.0.135.g90850a2211-goog


^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [RFC PATCH v2 18/47] hugetlb: enlighten follow_hugetlb_page to support HGM
  2022-10-21 16:36 [RFC PATCH v2 00/47] hugetlb: introduce HugeTLB high-granularity mapping James Houghton
                   ` (16 preceding siblings ...)
  2022-10-21 16:36 ` [RFC PATCH v2 17/47] hugetlb: make hugetlb_change_protection compatible with HGM James Houghton
@ 2022-10-21 16:36 ` James Houghton
  2022-12-15 19:29   ` Mike Kravetz
  2022-10-21 16:36 ` [RFC PATCH v2 19/47] hugetlb: make hugetlb_follow_page_mask HGM-enabled James Houghton
                   ` (28 subsequent siblings)
  46 siblings, 1 reply; 122+ messages in thread
From: James Houghton @ 2022-10-21 16:36 UTC (permalink / raw)
  To: Mike Kravetz, Muchun Song, Peter Xu
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Zach O'Keefe, Manish Mishra, Naoya Horiguchi,
	Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Andrew Morton, linux-mm, linux-kernel, James Houghton

This enables high-granularity mapping support in GUP.

One important change here is that, before, we never needed to grab the
VMA lock, but now, to prevent someone from collapsing the page tables
out from under us, we grab it for reading when doing high-granularity PT
walks.

In case it is confusing, pfn_offset is the offset (in PAGE_SIZE units)
that vaddr points to within the subpage that hpte points to.

Signed-off-by: James Houghton <jthoughton@google.com>
---
 mm/hugetlb.c | 76 ++++++++++++++++++++++++++++++++++++----------------
 1 file changed, 53 insertions(+), 23 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 2d096cef53cd..d76ab32fb6d3 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6382,11 +6382,9 @@ static void record_subpages_vmas(struct page *page, struct vm_area_struct *vma,
 	}
 }
 
-static inline bool __follow_hugetlb_must_fault(unsigned int flags, pte_t *pte,
+static inline bool __follow_hugetlb_must_fault(unsigned int flags, pte_t pteval,
 					       bool *unshare)
 {
-	pte_t pteval = huge_ptep_get(pte);
-
 	*unshare = false;
 	if (is_swap_pte(pteval))
 		return true;
@@ -6478,12 +6476,20 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	struct hstate *h = hstate_vma(vma);
 	int err = -EFAULT, refs;
 
+	/*
+	 * Grab the VMA lock for reading now so no one can collapse the page
+	 * table from under us.
+	 */
+	hugetlb_vma_lock_read(vma);
+
 	while (vaddr < vma->vm_end && remainder) {
-		pte_t *pte;
+		pte_t *ptep, pte;
 		spinlock_t *ptl = NULL;
 		bool unshare = false;
 		int absent;
-		struct page *page;
+		unsigned long pages_per_hpte;
+		struct page *page, *subpage;
+		struct hugetlb_pte hpte;
 
 		/*
 		 * If we have a pending SIGKILL, don't keep faulting pages and
@@ -6499,13 +6505,22 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		 * each hugepage.  We have to make sure we get the
 		 * first, for the page indexing below to work.
 		 *
-		 * Note that page table lock is not held when pte is null.
+		 * Note that page table lock is not held when ptep is null.
 		 */
-		pte = huge_pte_offset(mm, vaddr & huge_page_mask(h),
-				      huge_page_size(h));
-		if (pte)
-			ptl = huge_pte_lock(h, mm, pte);
-		absent = !pte || huge_pte_none(huge_ptep_get(pte));
+		ptep = huge_pte_offset(mm, vaddr & huge_page_mask(h),
+				       huge_page_size(h));
+		if (ptep) {
+			hugetlb_pte_populate(&hpte, ptep, huge_page_shift(h),
+					hpage_size_to_level(huge_page_size(h)));
+			hugetlb_hgm_walk(mm, vma, &hpte, vaddr,
+					PAGE_SIZE,
+					/*stop_at_none=*/true);
+			ptl = hugetlb_pte_lock(mm, &hpte);
+			ptep = hpte.ptep;
+			pte = huge_ptep_get(ptep);
+		}
+
+		absent = !ptep || huge_pte_none(pte);
 
 		/*
 		 * When coredumping, it suits get_dump_page if we just return
@@ -6516,12 +6531,19 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		 */
 		if (absent && (flags & FOLL_DUMP) &&
 		    !hugetlbfs_pagecache_present(h, vma, vaddr)) {
-			if (pte)
+			if (ptep)
 				spin_unlock(ptl);
 			remainder = 0;
 			break;
 		}
 
+		if (!absent && pte_present(pte) &&
+				!hugetlb_pte_present_leaf(&hpte, pte)) {
+			/* We raced with someone splitting the PTE, so retry. */
+			spin_unlock(ptl);
+			continue;
+		}
+
 		/*
 		 * We need call hugetlb_fault for both hugepages under migration
 		 * (in which case hugetlb_fault waits for the migration,) and
@@ -6537,7 +6559,10 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			vm_fault_t ret;
 			unsigned int fault_flags = 0;
 
-			if (pte)
+			/* Drop the lock before entering hugetlb_fault. */
+			hugetlb_vma_unlock_read(vma);
+
+			if (ptep)
 				spin_unlock(ptl);
 			if (flags & FOLL_WRITE)
 				fault_flags |= FAULT_FLAG_WRITE;
@@ -6560,7 +6585,7 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			if (ret & VM_FAULT_ERROR) {
 				err = vm_fault_to_errno(ret, flags);
 				remainder = 0;
-				break;
+				goto out;
 			}
 			if (ret & VM_FAULT_RETRY) {
 				if (locked &&
@@ -6578,11 +6603,14 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 				 */
 				return i;
 			}
+			hugetlb_vma_lock_read(vma);
 			continue;
 		}
 
-		pfn_offset = (vaddr & ~huge_page_mask(h)) >> PAGE_SHIFT;
-		page = pte_page(huge_ptep_get(pte));
+		pfn_offset = (vaddr & ~hugetlb_pte_mask(&hpte)) >> PAGE_SHIFT;
+		subpage = pte_page(pte);
+		pages_per_hpte = hugetlb_pte_size(&hpte) / PAGE_SIZE;
+		page = compound_head(subpage);
 
 		VM_BUG_ON_PAGE((flags & FOLL_PIN) && PageAnon(page) &&
 			       !PageAnonExclusive(page), page);
@@ -6592,21 +6620,21 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		 * and skip the same_page loop below.
 		 */
 		if (!pages && !vmas && !pfn_offset &&
-		    (vaddr + huge_page_size(h) < vma->vm_end) &&
-		    (remainder >= pages_per_huge_page(h))) {
-			vaddr += huge_page_size(h);
-			remainder -= pages_per_huge_page(h);
-			i += pages_per_huge_page(h);
+		    (vaddr + pages_per_hpte < vma->vm_end) &&
+		    (remainder >= pages_per_hpte)) {
+			vaddr += pages_per_hpte;
+			remainder -= pages_per_hpte;
+			i += pages_per_hpte;
 			spin_unlock(ptl);
 			continue;
 		}
 
 		/* vaddr may not be aligned to PAGE_SIZE */
-		refs = min3(pages_per_huge_page(h) - pfn_offset, remainder,
+		refs = min3(pages_per_hpte - pfn_offset, remainder,
 		    (vma->vm_end - ALIGN_DOWN(vaddr, PAGE_SIZE)) >> PAGE_SHIFT);
 
 		if (pages || vmas)
-			record_subpages_vmas(nth_page(page, pfn_offset),
+			record_subpages_vmas(nth_page(subpage, pfn_offset),
 					     vma, refs,
 					     likely(pages) ? pages + i : NULL,
 					     vmas ? vmas + i : NULL);
@@ -6637,6 +6665,8 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 
 		spin_unlock(ptl);
 	}
+	hugetlb_vma_unlock_read(vma);
+out:
 	*nr_pages = remainder;
 	/*
 	 * setting position is actually required only if remainder is
-- 
2.38.0.135.g90850a2211-goog


^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [RFC PATCH v2 19/47] hugetlb: make hugetlb_follow_page_mask HGM-enabled
  2022-10-21 16:36 [RFC PATCH v2 00/47] hugetlb: introduce HugeTLB high-granularity mapping James Houghton
                   ` (17 preceding siblings ...)
  2022-10-21 16:36 ` [RFC PATCH v2 18/47] hugetlb: enlighten follow_hugetlb_page to support HGM James Houghton
@ 2022-10-21 16:36 ` James Houghton
  2022-12-16  0:25   ` Mike Kravetz
  2022-10-21 16:36 ` [RFC PATCH v2 20/47] hugetlb: use struct hugetlb_pte for walk_hugetlb_range James Houghton
                   ` (27 subsequent siblings)
  46 siblings, 1 reply; 122+ messages in thread
From: James Houghton @ 2022-10-21 16:36 UTC (permalink / raw)
  To: Mike Kravetz, Muchun Song, Peter Xu
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Zach O'Keefe, Manish Mishra, Naoya Horiguchi,
	Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Andrew Morton, linux-mm, linux-kernel, James Houghton

The change here is very simple: do a high-granularity walk.

Signed-off-by: James Houghton <jthoughton@google.com>
---
 mm/hugetlb.c | 16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index d76ab32fb6d3..5783a8307a77 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6408,6 +6408,7 @@ struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
 	struct page *page = NULL;
 	spinlock_t *ptl;
 	pte_t *pte, entry;
+	struct hugetlb_pte hpte;
 
 	/*
 	 * FOLL_PIN is not supported for follow_page(). Ordinary GUP goes via
@@ -6429,9 +6430,22 @@ struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
 		return NULL;
 	}
 
-	ptl = huge_pte_lock(h, mm, pte);
+retry_walk:
+	hugetlb_pte_populate(&hpte, pte, huge_page_shift(h),
+			hpage_size_to_level(huge_page_size(h)));
+	hugetlb_hgm_walk(mm, vma, &hpte, address,
+			PAGE_SIZE,
+			/*stop_at_none=*/true);
+
+	ptl = hugetlb_pte_lock(mm, &hpte);
 	entry = huge_ptep_get(pte);
 	if (pte_present(entry)) {
+		if (unlikely(!hugetlb_pte_present_leaf(&hpte, entry))) {
+			/* We raced with someone splitting from under us. */
+			spin_unlock(ptl);
+			goto retry_walk;
+		}
+
 		page = pte_page(entry) +
 				((address & ~huge_page_mask(h)) >> PAGE_SHIFT);
 		/*
-- 
2.38.0.135.g90850a2211-goog


^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [RFC PATCH v2 20/47] hugetlb: use struct hugetlb_pte for walk_hugetlb_range
  2022-10-21 16:36 [RFC PATCH v2 00/47] hugetlb: introduce HugeTLB high-granularity mapping James Houghton
                   ` (18 preceding siblings ...)
  2022-10-21 16:36 ` [RFC PATCH v2 19/47] hugetlb: make hugetlb_follow_page_mask HGM-enabled James Houghton
@ 2022-10-21 16:36 ` James Houghton
  2022-10-21 16:36 ` [RFC PATCH v2 21/47] mm: rmap: provide pte_order in page_vma_mapped_walk James Houghton
                   ` (26 subsequent siblings)
  46 siblings, 0 replies; 122+ messages in thread
From: James Houghton @ 2022-10-21 16:36 UTC (permalink / raw)
  To: Mike Kravetz, Muchun Song, Peter Xu
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Zach O'Keefe, Manish Mishra, Naoya Horiguchi,
	Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Andrew Morton, linux-mm, linux-kernel, James Houghton

The main change in this commit is to walk_hugetlb_range to support
walking HGM mappings, but all walk_hugetlb_range callers must be updated
to use the new API and take the correct action.

Listing all the changes to the callers:

For s390 changes, we simply ignore HGM PTEs (we don't support s390 yet).

For smaps, shared_hugetlb (and private_hugetlb, although private
mappings don't support HGM) may now not be divisible by the hugepage
size. The appropriate changes have been made to support analyzing HGM
PTEs.

For pagemap, we ignore non-leaf PTEs by treating that as if they were
none PTEs. We can only end up with non-leaf PTEs if they had just been
updated from a none PTE.

For show_numa_map, the challenge is that, if any of a hugepage is
mapped, we have to count that entire page exactly once, as the results
are given in units of hugepages. To support HGM mappings, we keep track
of the last page that we looked it. If the hugepage we are currently
looking at is the same as the last one, then we must be looking at an
HGM-mapped page that has been mapped at high-granularity, and we've
already accounted for it.

For DAMON, we treat non-leaf PTEs as if they were blank, for the same
reason as pagemap.

For hwpoison, we proactively update the logic to support the case when
hpte is pointing to a subpage within the poisoned hugepage.

For queue_pages_hugetlb/migration, we ignore all HGM-enabled VMAs for
now.

For mincore, we ignore non-leaf PTEs for the same reason as pagemap.

For mprotect/prot_none_hugetlb_entry, we retry the walk when we get a
non-leaf PTE.

Signed-off-by: James Houghton <jthoughton@google.com>
---
 arch/s390/mm/gmap.c      | 20 ++++++++--
 fs/proc/task_mmu.c       | 83 +++++++++++++++++++++++++++++-----------
 include/linux/pagewalk.h | 11 ++++--
 mm/damon/vaddr.c         | 57 +++++++++++++++++----------
 mm/hmm.c                 | 21 ++++++----
 mm/memory-failure.c      | 17 ++++----
 mm/mempolicy.c           | 12 ++++--
 mm/mincore.c             | 17 ++++++--
 mm/mprotect.c            | 18 ++++++---
 mm/pagewalk.c            | 32 +++++++++++++---
 10 files changed, 203 insertions(+), 85 deletions(-)

diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
index 02d15c8dc92e..d65c15b5dccb 100644
--- a/arch/s390/mm/gmap.c
+++ b/arch/s390/mm/gmap.c
@@ -2622,13 +2622,25 @@ static int __s390_enable_skey_pmd(pmd_t *pmd, unsigned long addr,
 	return 0;
 }
 
-static int __s390_enable_skey_hugetlb(pte_t *pte, unsigned long addr,
-				      unsigned long hmask, unsigned long next,
+static int __s390_enable_skey_hugetlb(struct hugetlb_pte *hpte,
+				      unsigned long addr,
 				      struct mm_walk *walk)
 {
-	pmd_t *pmd = (pmd_t *)pte;
+	struct hstate *h = hstate_vma(walk->vma);
+	pmd_t *pmd;
 	unsigned long start, end;
-	struct page *page = pmd_page(*pmd);
+	struct page *page;
+
+	if (huge_page_size(h) != hugetlb_pte_size(hpte))
+		/* Ignore high-granularity PTEs. */
+		return 0;
+
+	if (!pte_present(huge_ptep_get(hpte->ptep)))
+		/* Ignore non-present PTEs. */
+		return 0;
+
+	pmd = (pmd_t *)pte;
+	page = pmd_page(*pmd);
 
 	/*
 	 * The write check makes sure we do not set a key on shared
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 8a74cdcc9af0..be78cdb7677e 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -720,18 +720,28 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
 }
 
 #ifdef CONFIG_HUGETLB_PAGE
-static int smaps_hugetlb_range(pte_t *pte, unsigned long hmask,
-				 unsigned long addr, unsigned long end,
-				 struct mm_walk *walk)
+static int smaps_hugetlb_range(struct hugetlb_pte *hpte,
+				unsigned long addr,
+				struct mm_walk *walk)
 {
 	struct mem_size_stats *mss = walk->private;
 	struct vm_area_struct *vma = walk->vma;
 	struct page *page = NULL;
+	pte_t pte = huge_ptep_get(hpte->ptep);
 
-	if (pte_present(*pte)) {
-		page = vm_normal_page(vma, addr, *pte);
-	} else if (is_swap_pte(*pte)) {
-		swp_entry_t swpent = pte_to_swp_entry(*pte);
+	if (pte_present(pte)) {
+		/* We only care about leaf-level PTEs. */
+		if (!hugetlb_pte_present_leaf(hpte, pte))
+			/*
+			 * The only case where hpte is not a leaf is that
+			 * it was originally none, but it was split from
+			 * under us. It was originally none, so exclude it.
+			 */
+			return 0;
+
+		page = vm_normal_page(vma, addr, pte);
+	} else if (is_swap_pte(pte)) {
+		swp_entry_t swpent = pte_to_swp_entry(pte);
 
 		if (is_pfn_swap_entry(swpent))
 			page = pfn_swap_entry_to_page(swpent);
@@ -740,9 +750,9 @@ static int smaps_hugetlb_range(pte_t *pte, unsigned long hmask,
 		int mapcount = page_mapcount(page);
 
 		if (mapcount >= 2)
-			mss->shared_hugetlb += huge_page_size(hstate_vma(vma));
+			mss->shared_hugetlb += hugetlb_pte_size(hpte);
 		else
-			mss->private_hugetlb += huge_page_size(hstate_vma(vma));
+			mss->private_hugetlb += hugetlb_pte_size(hpte);
 	}
 	return 0;
 }
@@ -1561,22 +1571,31 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end,
 
 #ifdef CONFIG_HUGETLB_PAGE
 /* This function walks within one hugetlb entry in the single call */
-static int pagemap_hugetlb_range(pte_t *ptep, unsigned long hmask,
-				 unsigned long addr, unsigned long end,
+static int pagemap_hugetlb_range(struct hugetlb_pte *hpte,
+				 unsigned long addr,
 				 struct mm_walk *walk)
 {
 	struct pagemapread *pm = walk->private;
 	struct vm_area_struct *vma = walk->vma;
 	u64 flags = 0, frame = 0;
 	int err = 0;
-	pte_t pte;
+	unsigned long hmask = hugetlb_pte_mask(hpte);
+	unsigned long end = addr + hugetlb_pte_size(hpte);
+	pte_t pte = huge_ptep_get(hpte->ptep);
+	struct page *page;
 
 	if (vma->vm_flags & VM_SOFTDIRTY)
 		flags |= PM_SOFT_DIRTY;
 
-	pte = huge_ptep_get(ptep);
 	if (pte_present(pte)) {
-		struct page *page = pte_page(pte);
+		/*
+		 * We raced with this PTE being split, which can only happen if
+		 * it was blank before. Treat it is as if it were blank.
+		 */
+		if (!hugetlb_pte_present_leaf(hpte, pte))
+			return 0;
+
+		page = pte_page(pte);
 
 		if (!PageAnon(page))
 			flags |= PM_FILE;
@@ -1857,10 +1876,16 @@ static struct page *can_gather_numa_stats_pmd(pmd_t pmd,
 }
 #endif
 
+struct show_numa_map_private {
+	struct numa_maps *md;
+	struct page *last_page;
+};
+
 static int gather_pte_stats(pmd_t *pmd, unsigned long addr,
 		unsigned long end, struct mm_walk *walk)
 {
-	struct numa_maps *md = walk->private;
+	struct show_numa_map_private *priv = walk->private;
+	struct numa_maps *md = priv->md;
 	struct vm_area_struct *vma = walk->vma;
 	spinlock_t *ptl;
 	pte_t *orig_pte;
@@ -1872,6 +1897,7 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long addr,
 		struct page *page;
 
 		page = can_gather_numa_stats_pmd(*pmd, vma, addr);
+		priv->last_page = page;
 		if (page)
 			gather_stats(page, md, pmd_dirty(*pmd),
 				     HPAGE_PMD_SIZE/PAGE_SIZE);
@@ -1885,6 +1911,7 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long addr,
 	orig_pte = pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
 	do {
 		struct page *page = can_gather_numa_stats(*pte, vma, addr);
+		priv->last_page = page;
 		if (!page)
 			continue;
 		gather_stats(page, md, pte_dirty(*pte), 1);
@@ -1895,19 +1922,25 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long addr,
 	return 0;
 }
 #ifdef CONFIG_HUGETLB_PAGE
-static int gather_hugetlb_stats(pte_t *pte, unsigned long hmask,
-		unsigned long addr, unsigned long end, struct mm_walk *walk)
+static int gather_hugetlb_stats(struct hugetlb_pte *hpte, unsigned long addr,
+		struct mm_walk *walk)
 {
-	pte_t huge_pte = huge_ptep_get(pte);
+	struct show_numa_map_private *priv = walk->private;
+	pte_t huge_pte = huge_ptep_get(hpte->ptep);
 	struct numa_maps *md;
 	struct page *page;
 
-	if (!pte_present(huge_pte))
+	if (!hugetlb_pte_present_leaf(hpte, huge_pte))
+		return 0;
+
+	page = compound_head(pte_page(huge_pte));
+	if (priv->last_page == page)
+		/* we've already accounted for this page */
 		return 0;
 
-	page = pte_page(huge_pte);
+	priv->last_page = page;
 
-	md = walk->private;
+	md = priv->md;
 	gather_stats(page, md, pte_dirty(huge_pte), 1);
 	return 0;
 }
@@ -1937,9 +1970,15 @@ static int show_numa_map(struct seq_file *m, void *v)
 	struct file *file = vma->vm_file;
 	struct mm_struct *mm = vma->vm_mm;
 	struct mempolicy *pol;
+
 	char buffer[64];
 	int nid;
 
+	struct show_numa_map_private numa_map_private;
+
+	numa_map_private.md = md;
+	numa_map_private.last_page = NULL;
+
 	if (!mm)
 		return 0;
 
@@ -1969,7 +2008,7 @@ static int show_numa_map(struct seq_file *m, void *v)
 		seq_puts(m, " huge");
 
 	/* mmap_lock is held by m_start */
-	walk_page_vma(vma, &show_numa_ops, md);
+	walk_page_vma(vma, &show_numa_ops, &numa_map_private);
 
 	if (!md->pages)
 		goto out;
diff --git a/include/linux/pagewalk.h b/include/linux/pagewalk.h
index 2f8f6cc980b4..7ed065ea5dba 100644
--- a/include/linux/pagewalk.h
+++ b/include/linux/pagewalk.h
@@ -3,6 +3,7 @@
 #define _LINUX_PAGEWALK_H
 
 #include <linux/mm.h>
+#include <linux/hugetlb.h>
 
 struct mm_walk;
 
@@ -21,7 +22,10 @@ struct mm_walk;
  *			depth is -1 if not known, 0:PGD, 1:P4D, 2:PUD, 3:PMD.
  *			Any folded depths (where PTRS_PER_P?D is equal to 1)
  *			are skipped.
- * @hugetlb_entry:	if set, called for each hugetlb entry
+ * @hugetlb_entry:	if set, called for each hugetlb entry. In the presence
+ *			of high-granularity hugetlb entries, @hugetlb_entry is
+ *			called only for leaf-level entries (i.e., hstate-level
+ *			page table entries are ignored if they are not leaves).
  * @test_walk:		caller specific callback function to determine whether
  *			we walk over the current vma or not. Returning 0 means
  *			"do page table walk over the current vma", returning
@@ -47,9 +51,8 @@ struct mm_walk_ops {
 			 unsigned long next, struct mm_walk *walk);
 	int (*pte_hole)(unsigned long addr, unsigned long next,
 			int depth, struct mm_walk *walk);
-	int (*hugetlb_entry)(pte_t *pte, unsigned long hmask,
-			     unsigned long addr, unsigned long next,
-			     struct mm_walk *walk);
+	int (*hugetlb_entry)(struct hugetlb_pte *hpte,
+			     unsigned long addr, struct mm_walk *walk);
 	int (*test_walk)(unsigned long addr, unsigned long next,
 			struct mm_walk *walk);
 	int (*pre_vma)(unsigned long start, unsigned long end,
diff --git a/mm/damon/vaddr.c b/mm/damon/vaddr.c
index 15f03df66db6..42845e1b560d 100644
--- a/mm/damon/vaddr.c
+++ b/mm/damon/vaddr.c
@@ -330,48 +330,55 @@ static int damon_mkold_pmd_entry(pmd_t *pmd, unsigned long addr,
 }
 
 #ifdef CONFIG_HUGETLB_PAGE
-static void damon_hugetlb_mkold(pte_t *pte, struct mm_struct *mm,
+static void damon_hugetlb_mkold(struct hugetlb_pte *hpte, pte_t entry,
+				struct mm_struct *mm,
 				struct vm_area_struct *vma, unsigned long addr)
 {
 	bool referenced = false;
-	pte_t entry = huge_ptep_get(pte);
 	struct page *page = pte_page(entry);
+	struct page *hpage = compound_head(page);
 
-	get_page(page);
+	get_page(hpage);
 
 	if (pte_young(entry)) {
 		referenced = true;
 		entry = pte_mkold(entry);
-		set_huge_pte_at(mm, addr, pte, entry);
+		set_huge_pte_at(mm, addr, hpte->ptep, entry);
 	}
 
 #ifdef CONFIG_MMU_NOTIFIER
 	if (mmu_notifier_clear_young(mm, addr,
-				     addr + huge_page_size(hstate_vma(vma))))
+				     addr + hugetlb_pte_size(hpte)))
 		referenced = true;
 #endif /* CONFIG_MMU_NOTIFIER */
 
 	if (referenced)
-		set_page_young(page);
+		set_page_young(hpage);
 
-	set_page_idle(page);
-	put_page(page);
+	set_page_idle(hpage);
+	put_page(hpage);
 }
 
-static int damon_mkold_hugetlb_entry(pte_t *pte, unsigned long hmask,
-				     unsigned long addr, unsigned long end,
+static int damon_mkold_hugetlb_entry(struct hugetlb_pte *hpte,
+				     unsigned long addr,
 				     struct mm_walk *walk)
 {
-	struct hstate *h = hstate_vma(walk->vma);
 	spinlock_t *ptl;
 	pte_t entry;
 
-	ptl = huge_pte_lock(h, walk->mm, pte);
-	entry = huge_ptep_get(pte);
+	ptl = hugetlb_pte_lock(walk->mm, hpte);
+	entry = huge_ptep_get(hpte->ptep);
 	if (!pte_present(entry))
 		goto out;
 
-	damon_hugetlb_mkold(pte, walk->mm, walk->vma, addr);
+	if (!hugetlb_pte_present_leaf(hpte, entry))
+		/*
+		 * We raced with someone splitting a blank PTE. Treat this PTE
+		 * as if it were blank.
+		 */
+		goto out;
+
+	damon_hugetlb_mkold(hpte, entry, walk->mm, walk->vma, addr);
 
 out:
 	spin_unlock(ptl);
@@ -484,31 +491,39 @@ static int damon_young_pmd_entry(pmd_t *pmd, unsigned long addr,
 }
 
 #ifdef CONFIG_HUGETLB_PAGE
-static int damon_young_hugetlb_entry(pte_t *pte, unsigned long hmask,
-				     unsigned long addr, unsigned long end,
+static int damon_young_hugetlb_entry(struct hugetlb_pte *hpte,
+				     unsigned long addr,
 				     struct mm_walk *walk)
 {
 	struct damon_young_walk_private *priv = walk->private;
 	struct hstate *h = hstate_vma(walk->vma);
-	struct page *page;
+	struct page *page, *hpage;
 	spinlock_t *ptl;
 	pte_t entry;
 
-	ptl = huge_pte_lock(h, walk->mm, pte);
+	ptl = hugetlb_pte_lock(walk->mm, hpte);
 	entry = huge_ptep_get(pte);
 	if (!pte_present(entry))
 		goto out;
 
+	if (!hugetlb_pte_present_leaf(hpte, entry))
+		/*
+		 * We raced with someone splitting a blank PTE. Treat this PTE
+		 * as if it were blank.
+		 */
+		goto out;
+
 	page = pte_page(entry);
-	get_page(page);
+	hpage = compound_head(page);
+	get_page(hpage);
 
-	if (pte_young(entry) || !page_is_idle(page) ||
+	if (pte_young(entry) || !page_is_idle(hpage) ||
 	    mmu_notifier_test_young(walk->mm, addr)) {
 		*priv->page_sz = huge_page_size(h);
 		priv->young = true;
 	}
 
-	put_page(page);
+	put_page(hpage);
 
 out:
 	spin_unlock(ptl);
diff --git a/mm/hmm.c b/mm/hmm.c
index 3850fb625dda..76679b46ad5e 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -469,27 +469,34 @@ static int hmm_vma_walk_pud(pud_t *pudp, unsigned long start, unsigned long end,
 #endif
 
 #ifdef CONFIG_HUGETLB_PAGE
-static int hmm_vma_walk_hugetlb_entry(pte_t *pte, unsigned long hmask,
-				      unsigned long start, unsigned long end,
+static int hmm_vma_walk_hugetlb_entry(struct hugetlb_pte *hpte,
+				      unsigned long start,
 				      struct mm_walk *walk)
 {
 	unsigned long addr = start, i, pfn;
 	struct hmm_vma_walk *hmm_vma_walk = walk->private;
 	struct hmm_range *range = hmm_vma_walk->range;
-	struct vm_area_struct *vma = walk->vma;
 	unsigned int required_fault;
 	unsigned long pfn_req_flags;
 	unsigned long cpu_flags;
+	unsigned long hmask = hugetlb_pte_mask(hpte);
+	unsigned int order = hugetlb_pte_shift(hpte) - PAGE_SHIFT;
+	unsigned long end = start + hugetlb_pte_size(hpte);
 	spinlock_t *ptl;
 	pte_t entry;
 
-	ptl = huge_pte_lock(hstate_vma(vma), walk->mm, pte);
-	entry = huge_ptep_get(pte);
+	ptl = hugetlb_pte_lock(walk->mm, hpte);
+	entry = huge_ptep_get(hpte->ptep);
+
+	if (!hugetlb_pte_present_leaf(hpte, entry)) {
+		spin_unlock(ptl);
+		return -EAGAIN;
+	}
 
 	i = (start - range->start) >> PAGE_SHIFT;
 	pfn_req_flags = range->hmm_pfns[i];
 	cpu_flags = pte_to_hmm_pfn_flags(range, entry) |
-		    hmm_pfn_flags_order(huge_page_order(hstate_vma(vma)));
+		    hmm_pfn_flags_order(order);
 	required_fault =
 		hmm_pte_need_fault(hmm_vma_walk, pfn_req_flags, cpu_flags);
 	if (required_fault) {
@@ -593,7 +600,7 @@ int hmm_range_fault(struct hmm_range *range)
 		 * in pfns. All entries < last in the pfn array are set to their
 		 * output, and all >= are still at their input values.
 		 */
-	} while (ret == -EBUSY);
+	} while (ret == -EBUSY || ret == -EAGAIN);
 	return ret;
 }
 EXPORT_SYMBOL(hmm_range_fault);
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index bead6bccc7f2..505efba59d29 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -628,6 +628,7 @@ static int check_hwpoisoned_entry(pte_t pte, unsigned long addr, short shift,
 				unsigned long poisoned_pfn, struct to_kill *tk)
 {
 	unsigned long pfn = 0;
+	unsigned long base_pages_poisoned = (1UL << shift) / PAGE_SIZE;
 
 	if (pte_present(pte)) {
 		pfn = pte_pfn(pte);
@@ -638,7 +639,8 @@ static int check_hwpoisoned_entry(pte_t pte, unsigned long addr, short shift,
 			pfn = swp_offset_pfn(swp);
 	}
 
-	if (!pfn || pfn != poisoned_pfn)
+	if (!pfn || pfn < poisoned_pfn ||
+			pfn >= poisoned_pfn + base_pages_poisoned)
 		return 0;
 
 	set_to_kill(tk, addr, shift);
@@ -704,16 +706,15 @@ static int hwpoison_pte_range(pmd_t *pmdp, unsigned long addr,
 }
 
 #ifdef CONFIG_HUGETLB_PAGE
-static int hwpoison_hugetlb_range(pte_t *ptep, unsigned long hmask,
-			    unsigned long addr, unsigned long end,
-			    struct mm_walk *walk)
+static int hwpoison_hugetlb_range(struct hugetlb_pte *hpte,
+				  unsigned long addr,
+				  struct mm_walk *walk)
 {
 	struct hwp_walk *hwp = walk->private;
-	pte_t pte = huge_ptep_get(ptep);
-	struct hstate *h = hstate_vma(walk->vma);
+	pte_t pte = huge_ptep_get(hpte->ptep);
 
-	return check_hwpoisoned_entry(pte, addr, huge_page_shift(h),
-				      hwp->pfn, &hwp->tk);
+	return check_hwpoisoned_entry(pte, addr & hugetlb_pte_mask(hpte),
+			hpte->shift, hwp->pfn, &hwp->tk);
 }
 #else
 #define hwpoison_hugetlb_range	NULL
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 61aa9aedb728..275bc549590e 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -558,8 +558,8 @@ static int queue_pages_pte_range(pmd_t *pmd, unsigned long addr,
 	return addr != end ? -EIO : 0;
 }
 
-static int queue_pages_hugetlb(pte_t *pte, unsigned long hmask,
-			       unsigned long addr, unsigned long end,
+static int queue_pages_hugetlb(struct hugetlb_pte *hpte,
+			       unsigned long addr,
 			       struct mm_walk *walk)
 {
 	int ret = 0;
@@ -570,8 +570,12 @@ static int queue_pages_hugetlb(pte_t *pte, unsigned long hmask,
 	spinlock_t *ptl;
 	pte_t entry;
 
-	ptl = huge_pte_lock(hstate_vma(walk->vma), walk->mm, pte);
-	entry = huge_ptep_get(pte);
+	/* We don't migrate high-granularity HugeTLB mappings for now. */
+	if (hugetlb_hgm_enabled(walk->vma))
+		return -EINVAL;
+
+	ptl = hugetlb_pte_lock(walk->mm, hpte);
+	entry = huge_ptep_get(hpte->ptep);
 	if (!pte_present(entry))
 		goto unlock;
 	page = pte_page(entry);
diff --git a/mm/mincore.c b/mm/mincore.c
index a085a2aeabd8..0894965b3944 100644
--- a/mm/mincore.c
+++ b/mm/mincore.c
@@ -22,18 +22,29 @@
 #include <linux/uaccess.h>
 #include "swap.h"
 
-static int mincore_hugetlb(pte_t *pte, unsigned long hmask, unsigned long addr,
-			unsigned long end, struct mm_walk *walk)
+static int mincore_hugetlb(struct hugetlb_pte *hpte, unsigned long addr,
+			   struct mm_walk *walk)
 {
 #ifdef CONFIG_HUGETLB_PAGE
 	unsigned char present;
+	unsigned long end = addr + hugetlb_pte_size(hpte);
 	unsigned char *vec = walk->private;
+	pte_t pte = huge_ptep_get(hpte->ptep);
 
 	/*
 	 * Hugepages under user process are always in RAM and never
 	 * swapped out, but theoretically it needs to be checked.
 	 */
-	present = pte && !huge_pte_none(huge_ptep_get(pte));
+	present = !huge_pte_none(pte);
+
+	/*
+	 * If the pte is present but not a leaf, we raced with someone
+	 * splitting it. For someone to have split it, it must have been
+	 * huge_pte_none before, so treat it as such.
+	 */
+	if (pte_present(pte) && !hugetlb_pte_present_leaf(hpte, pte))
+		present = false;
+
 	for (; addr != end; vec++, addr += PAGE_SIZE)
 		*vec = present;
 	walk->private = vec;
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 99762403cc8f..9975b86035e0 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -524,12 +524,16 @@ static int prot_none_pte_entry(pte_t *pte, unsigned long addr,
 		0 : -EACCES;
 }
 
-static int prot_none_hugetlb_entry(pte_t *pte, unsigned long hmask,
-				   unsigned long addr, unsigned long next,
+static int prot_none_hugetlb_entry(struct hugetlb_pte *hpte,
+				   unsigned long addr,
 				   struct mm_walk *walk)
 {
-	return pfn_modify_allowed(pte_pfn(*pte), *(pgprot_t *)(walk->private)) ?
-		0 : -EACCES;
+	pte_t pte = huge_ptep_get(hpte->ptep);
+
+	if (!hugetlb_pte_present_leaf(hpte, pte))
+		return -EAGAIN;
+	return pfn_modify_allowed(pte_pfn(pte),
+			*(pgprot_t *)(walk->private)) ? 0 : -EACCES;
 }
 
 static int prot_none_test(unsigned long addr, unsigned long next,
@@ -572,8 +576,10 @@ mprotect_fixup(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	    (newflags & VM_ACCESS_FLAGS) == 0) {
 		pgprot_t new_pgprot = vm_get_page_prot(newflags);
 
-		error = walk_page_range(current->mm, start, end,
-				&prot_none_walk_ops, &new_pgprot);
+		do {
+			error = walk_page_range(current->mm, start, end,
+					&prot_none_walk_ops, &new_pgprot);
+		} while (error == -EAGAIN);
 		if (error)
 			return error;
 	}
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index bb33c1e8c017..2318aae98f1e 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -3,6 +3,7 @@
 #include <linux/highmem.h>
 #include <linux/sched.h>
 #include <linux/hugetlb.h>
+#include <linux/minmax.h>
 
 /*
  * We want to know the real level where a entry is located ignoring any
@@ -301,20 +302,39 @@ static int walk_hugetlb_range(unsigned long addr, unsigned long end,
 	pte_t *pte;
 	const struct mm_walk_ops *ops = walk->ops;
 	int err = 0;
+	struct hugetlb_pte hpte;
+
+	if (hugetlb_hgm_enabled(vma))
+		/*
+		 * We could potentially do high-granularity walks. Grab the
+		 * VMA lock to prevent PTEs from becoming invalid.
+		 */
+		hugetlb_vma_lock_read(vma);
 
 	do {
-		next = hugetlb_entry_end(h, addr, end);
 		pte = huge_pte_offset(walk->mm, addr & hmask, sz);
-
-		if (pte)
-			err = ops->hugetlb_entry(pte, hmask, addr, next, walk);
-		else if (ops->pte_hole)
-			err = ops->pte_hole(addr, next, -1, walk);
+		if (!pte) {
+			next = hugetlb_entry_end(h, addr, end);
+			if (ops->pte_hole)
+				err = ops->pte_hole(addr, next, -1, walk);
+		} else {
+			hugetlb_pte_populate(&hpte, pte, huge_page_shift(h),
+					hpage_size_to_level(sz));
+			hugetlb_hgm_walk(walk->mm, vma, &hpte, addr,
+					PAGE_SIZE,
+					/*stop_at_none=*/true);
+			err = ops->hugetlb_entry(
+					&hpte, addr, walk);
+			next = min(addr + hugetlb_pte_size(&hpte), end);
+		}
 
 		if (err)
 			break;
 	} while (addr = next, addr != end);
 
+	if (hugetlb_hgm_enabled(vma))
+		hugetlb_vma_unlock_read(vma);
+
 	return err;
 }
 
-- 
2.38.0.135.g90850a2211-goog


^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [RFC PATCH v2 21/47] mm: rmap: provide pte_order in page_vma_mapped_walk
  2022-10-21 16:36 [RFC PATCH v2 00/47] hugetlb: introduce HugeTLB high-granularity mapping James Houghton
                   ` (19 preceding siblings ...)
  2022-10-21 16:36 ` [RFC PATCH v2 20/47] hugetlb: use struct hugetlb_pte for walk_hugetlb_range James Houghton
@ 2022-10-21 16:36 ` James Houghton
  2022-10-21 16:36 ` [RFC PATCH v2 22/47] mm: rmap: make page_vma_mapped_walk callers use pte_order James Houghton
                   ` (25 subsequent siblings)
  46 siblings, 0 replies; 122+ messages in thread
From: James Houghton @ 2022-10-21 16:36 UTC (permalink / raw)
  To: Mike Kravetz, Muchun Song, Peter Xu
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Zach O'Keefe, Manish Mishra, Naoya Horiguchi,
	Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Andrew Morton, linux-mm, linux-kernel, James Houghton

page_vma_mapped_walk callers will need this information to know how
HugeTLB pages are mapped. pte_order only applies if pte is not NULL.

Signed-off-by: James Houghton <jthoughton@google.com>
---
 include/linux/rmap.h | 1 +
 mm/page_vma_mapped.c | 3 +++
 2 files changed, 4 insertions(+)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index bd3504d11b15..e0557ede2951 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -378,6 +378,7 @@ struct page_vma_mapped_walk {
 	pmd_t *pmd;
 	pte_t *pte;
 	spinlock_t *ptl;
+	unsigned int pte_order;
 	unsigned int flags;
 };
 
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index 93e13fc17d3c..395ca4e21c56 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -16,6 +16,7 @@ static inline bool not_found(struct page_vma_mapped_walk *pvmw)
 static bool map_pte(struct page_vma_mapped_walk *pvmw)
 {
 	pvmw->pte = pte_offset_map(pvmw->pmd, pvmw->address);
+	pvmw->pte_order = 0;
 	if (!(pvmw->flags & PVMW_SYNC)) {
 		if (pvmw->flags & PVMW_MIGRATION) {
 			if (!is_swap_pte(*pvmw->pte))
@@ -174,6 +175,7 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
 		if (!pvmw->pte)
 			return false;
 
+		pvmw->pte_order = huge_page_order(hstate);
 		pvmw->ptl = huge_pte_lock(hstate, mm, pvmw->pte);
 		if (!check_pte(pvmw))
 			return not_found(pvmw);
@@ -269,6 +271,7 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
 				}
 				pte_unmap(pvmw->pte);
 				pvmw->pte = NULL;
+				pvmw->pte_order = 0;
 				goto restart;
 			}
 			pvmw->pte++;
-- 
2.38.0.135.g90850a2211-goog


^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [RFC PATCH v2 22/47] mm: rmap: make page_vma_mapped_walk callers use pte_order
  2022-10-21 16:36 [RFC PATCH v2 00/47] hugetlb: introduce HugeTLB high-granularity mapping James Houghton
                   ` (20 preceding siblings ...)
  2022-10-21 16:36 ` [RFC PATCH v2 21/47] mm: rmap: provide pte_order in page_vma_mapped_walk James Houghton
@ 2022-10-21 16:36 ` James Houghton
  2022-10-21 16:36 ` [RFC PATCH v2 23/47] rmap: update hugetlb lock comment for HGM James Houghton
                   ` (24 subsequent siblings)
  46 siblings, 0 replies; 122+ messages in thread
From: James Houghton @ 2022-10-21 16:36 UTC (permalink / raw)
  To: Mike Kravetz, Muchun Song, Peter Xu
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Zach O'Keefe, Manish Mishra, Naoya Horiguchi,
	Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Andrew Morton, linux-mm, linux-kernel, James Houghton

This also updates the callers' hugetlb mapcounting code to handle
mapcount properly for subpage-mapped hugetlb pages.

Signed-off-by: James Houghton <jthoughton@google.com>
---
 mm/migrate.c |  2 +-
 mm/rmap.c    | 17 +++++++++++++----
 2 files changed, 14 insertions(+), 5 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index a0105fa6e3b2..8712b694c5a7 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -235,7 +235,7 @@ static bool remove_migration_pte(struct folio *folio,
 
 #ifdef CONFIG_HUGETLB_PAGE
 		if (folio_test_hugetlb(folio)) {
-			unsigned int shift = huge_page_shift(hstate_vma(vma));
+			unsigned int shift = pvmw.pte_order + PAGE_SHIFT;
 
 			pte = arch_make_huge_pte(pte, shift, vma->vm_flags);
 			if (folio_test_anon(folio))
diff --git a/mm/rmap.c b/mm/rmap.c
index 9bba65b30e4d..19850d955aea 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1626,7 +1626,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 		if (PageHWPoison(subpage) && !(flags & TTU_IGNORE_HWPOISON)) {
 			pteval = swp_entry_to_pte(make_hwpoison_entry(subpage));
 			if (folio_test_hugetlb(folio)) {
-				hugetlb_count_sub(folio_nr_pages(folio), mm);
+				hugetlb_count_sub(1UL << pvmw.pte_order, mm);
 				set_huge_pte_at(mm, address, pvmw.pte, pteval);
 			} else {
 				dec_mm_counter(mm, mm_counter(&folio->page));
@@ -1785,7 +1785,11 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 		 *
 		 * See Documentation/mm/mmu_notifier.rst
 		 */
-		page_remove_rmap(subpage, vma, folio_test_hugetlb(folio));
+		if (folio_test_hugetlb(folio))
+			page_remove_rmap(&folio->page, vma, true);
+		else
+			page_remove_rmap(subpage, vma, false);
+
 		if (vma->vm_flags & VM_LOCKED)
 			mlock_page_drain_local();
 		folio_put(folio);
@@ -2034,7 +2038,7 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 		} else if (PageHWPoison(subpage)) {
 			pteval = swp_entry_to_pte(make_hwpoison_entry(subpage));
 			if (folio_test_hugetlb(folio)) {
-				hugetlb_count_sub(folio_nr_pages(folio), mm);
+				hugetlb_count_sub(1L << pvmw.pte_order, mm);
 				set_huge_pte_at(mm, address, pvmw.pte, pteval);
 			} else {
 				dec_mm_counter(mm, mm_counter(&folio->page));
@@ -2126,7 +2130,10 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 		 *
 		 * See Documentation/mm/mmu_notifier.rst
 		 */
-		page_remove_rmap(subpage, vma, folio_test_hugetlb(folio));
+		if (folio_test_hugetlb(folio))
+			page_remove_rmap(&folio->page, vma, true);
+		else
+			page_remove_rmap(subpage, vma, false);
 		if (vma->vm_flags & VM_LOCKED)
 			mlock_page_drain_local();
 		folio_put(folio);
@@ -2210,6 +2217,8 @@ static bool page_make_device_exclusive_one(struct folio *folio,
 				      args->owner);
 	mmu_notifier_invalidate_range_start(&range);
 
+	VM_BUG_ON_FOLIO(folio_test_hugetlb(folio), folio);
+
 	while (page_vma_mapped_walk(&pvmw)) {
 		/* Unexpected PMD-mapped THP? */
 		VM_BUG_ON_FOLIO(!pvmw.pte, folio);
-- 
2.38.0.135.g90850a2211-goog


^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [RFC PATCH v2 23/47] rmap: update hugetlb lock comment for HGM
  2022-10-21 16:36 [RFC PATCH v2 00/47] hugetlb: introduce HugeTLB high-granularity mapping James Houghton
                   ` (21 preceding siblings ...)
  2022-10-21 16:36 ` [RFC PATCH v2 22/47] mm: rmap: make page_vma_mapped_walk callers use pte_order James Houghton
@ 2022-10-21 16:36 ` James Houghton
  2022-10-21 16:36 ` [RFC PATCH v2 24/47] hugetlb: update page_vma_mapped to do high-granularity walks James Houghton
                   ` (23 subsequent siblings)
  46 siblings, 0 replies; 122+ messages in thread
From: James Houghton @ 2022-10-21 16:36 UTC (permalink / raw)
  To: Mike Kravetz, Muchun Song, Peter Xu
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Zach O'Keefe, Manish Mishra, Naoya Horiguchi,
	Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Andrew Morton, linux-mm, linux-kernel, James Houghton

The VMA lock is used to prevent high-granularity HugeTLB mappings from
being collapsed while other threads are doing high-granularity page
table walks.

Signed-off-by: James Houghton <jthoughton@google.com>
---
 mm/rmap.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index 19850d955aea..527463c1e936 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -47,7 +47,8 @@
  *
  * hugetlbfs PageHuge() take locks in this order:
  *   hugetlb_fault_mutex (hugetlbfs specific page fault mutex)
- *     vma_lock (hugetlb specific lock for pmd_sharing)
+ *     vma_lock (hugetlb specific lock for pmd_sharing and high-granularity
+ *               mapping)
  *       mapping->i_mmap_rwsem (also used for hugetlb pmd sharing)
  *         page->flags PG_locked (lock_page)
  */
-- 
2.38.0.135.g90850a2211-goog


^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [RFC PATCH v2 24/47] hugetlb: update page_vma_mapped to do high-granularity walks
  2022-10-21 16:36 [RFC PATCH v2 00/47] hugetlb: introduce HugeTLB high-granularity mapping James Houghton
                   ` (22 preceding siblings ...)
  2022-10-21 16:36 ` [RFC PATCH v2 23/47] rmap: update hugetlb lock comment for HGM James Houghton
@ 2022-10-21 16:36 ` James Houghton
  2022-12-15 17:49   ` James Houghton
  2022-10-21 16:36 ` [RFC PATCH v2 25/47] hugetlb: add HGM support for copy_hugetlb_page_range James Houghton
                   ` (22 subsequent siblings)
  46 siblings, 1 reply; 122+ messages in thread
From: James Houghton @ 2022-10-21 16:36 UTC (permalink / raw)
  To: Mike Kravetz, Muchun Song, Peter Xu
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Zach O'Keefe, Manish Mishra, Naoya Horiguchi,
	Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Andrew Morton, linux-mm, linux-kernel, James Houghton

This updates the HugeTLB logic to look a lot more like the PTE-mapped
THP logic. When a user calls us in a loop, we will update pvmw->address
to walk to each page table entry that could possibly map the hugepage
containing pvmw->pfn.

This makes use of the new pte_order so callers know what size PTE
they're getting.

Signed-off-by: James Houghton <jthoughton@google.com>
---
 include/linux/rmap.h |  4 +++
 mm/page_vma_mapped.c | 59 ++++++++++++++++++++++++++++++++++++--------
 mm/rmap.c            | 48 +++++++++++++++++++++--------------
 3 files changed, 83 insertions(+), 28 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index e0557ede2951..d7d2d9f65a01 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -13,6 +13,7 @@
 #include <linux/highmem.h>
 #include <linux/pagemap.h>
 #include <linux/memremap.h>
+#include <linux/hugetlb.h>
 
 /*
  * The anon_vma heads a list of private "related" vmas, to scan if
@@ -409,6 +410,9 @@ static inline void page_vma_mapped_walk_done(struct page_vma_mapped_walk *pvmw)
 		pte_unmap(pvmw->pte);
 	if (pvmw->ptl)
 		spin_unlock(pvmw->ptl);
+	if (pvmw->pte && is_vm_hugetlb_page(pvmw->vma) &&
+			hugetlb_hgm_enabled(pvmw->vma))
+		hugetlb_vma_unlock_read(pvmw->vma);
 }
 
 bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw);
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index 395ca4e21c56..1994b3f9a4c2 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -133,7 +133,8 @@ static void step_forward(struct page_vma_mapped_walk *pvmw, unsigned long size)
  *
  * Returns true if the page is mapped in the vma. @pvmw->pmd and @pvmw->pte point
  * to relevant page table entries. @pvmw->ptl is locked. @pvmw->address is
- * adjusted if needed (for PTE-mapped THPs).
+ * adjusted if needed (for PTE-mapped THPs and high-granularity--mapped HugeTLB
+ * pages).
  *
  * If @pvmw->pmd is set but @pvmw->pte is not, you have found PMD-mapped page
  * (usually THP). For PTE-mapped THP, you should run page_vma_mapped_walk() in
@@ -166,19 +167,57 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
 	if (unlikely(is_vm_hugetlb_page(vma))) {
 		struct hstate *hstate = hstate_vma(vma);
 		unsigned long size = huge_page_size(hstate);
-		/* The only possible mapping was handled on last iteration */
-		if (pvmw->pte)
-			return not_found(pvmw);
+		struct hugetlb_pte hpte;
+		pte_t *pte;
+		pte_t pteval;
+
+		end = (pvmw->address & huge_page_mask(hstate)) +
+			huge_page_size(hstate);
 
 		/* when pud is not present, pte will be NULL */
-		pvmw->pte = huge_pte_offset(mm, pvmw->address, size);
-		if (!pvmw->pte)
+		pte = huge_pte_offset(mm, pvmw->address, size);
+		if (!pte)
 			return false;
 
-		pvmw->pte_order = huge_page_order(hstate);
-		pvmw->ptl = huge_pte_lock(hstate, mm, pvmw->pte);
-		if (!check_pte(pvmw))
-			return not_found(pvmw);
+		do {
+			hugetlb_pte_populate(&hpte, pte, huge_page_shift(hstate),
+					hpage_size_to_level(size));
+
+			/*
+			 * Do a high granularity page table walk. The vma lock
+			 * is grabbed to prevent the page table from being
+			 * collapsed mid-walk. It is dropped in
+			 * page_vma_mapped_walk_done().
+			 */
+			if (pvmw->pte) {
+				if (pvmw->ptl)
+					spin_unlock(pvmw->ptl);
+				pvmw->ptl = NULL;
+				pvmw->address += PAGE_SIZE << pvmw->pte_order;
+				if (pvmw->address >= end)
+					return not_found(pvmw);
+			} else if (hugetlb_hgm_enabled(vma))
+				/* Only grab the lock once. */
+				hugetlb_vma_lock_read(vma);
+
+retry_walk:
+			hugetlb_hgm_walk(mm, vma, &hpte, pvmw->address,
+					PAGE_SIZE, /*stop_at_none=*/true);
+
+			pvmw->pte = hpte.ptep;
+			pvmw->pte_order = hpte.shift - PAGE_SHIFT;
+			pvmw->ptl = hugetlb_pte_lock(mm, &hpte);
+			pteval = huge_ptep_get(hpte.ptep);
+			if (pte_present(pteval) && !hugetlb_pte_present_leaf(
+						&hpte, pteval)) {
+				/*
+				 * Someone split from under us, so keep
+				 * walking.
+				 */
+				spin_unlock(pvmw->ptl);
+				goto retry_walk;
+			}
+		} while (!check_pte(pvmw));
 		return true;
 	}
 
diff --git a/mm/rmap.c b/mm/rmap.c
index 527463c1e936..a8359584467e 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1552,17 +1552,23 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 			flush_cache_range(vma, range.start, range.end);
 
 			/*
-			 * To call huge_pmd_unshare, i_mmap_rwsem must be
-			 * held in write mode.  Caller needs to explicitly
-			 * do this outside rmap routines.
-			 *
-			 * We also must hold hugetlb vma_lock in write mode.
-			 * Lock order dictates acquiring vma_lock BEFORE
-			 * i_mmap_rwsem.  We can only try lock here and fail
-			 * if unsuccessful.
+			 * If HGM is enabled, we have already grabbed the VMA
+			 * lock for reading, and we cannot safely release it.
+			 * Because HGM-enabled VMAs have already unshared all
+			 * PMDs, we can safely ignore PMD unsharing here.
 			 */
-			if (!anon) {
+			if (!anon && !hugetlb_hgm_enabled(vma)) {
 				VM_BUG_ON(!(flags & TTU_RMAP_LOCKED));
+				/*
+				 * To call huge_pmd_unshare, i_mmap_rwsem must
+				 * be held in write mode.  Caller needs to
+				 * explicitly do this outside rmap routines.
+				 *
+				 * We also must hold hugetlb vma_lock in write
+				 * mode. Lock order dictates acquiring vma_lock
+				 * BEFORE i_mmap_rwsem.  We can only try lock
+				 * here and fail if unsuccessful.
+				 */
 				if (!hugetlb_vma_trylock_write(vma)) {
 					page_vma_mapped_walk_done(&pvmw);
 					ret = false;
@@ -1946,17 +1952,23 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 			flush_cache_range(vma, range.start, range.end);
 
 			/*
-			 * To call huge_pmd_unshare, i_mmap_rwsem must be
-			 * held in write mode.  Caller needs to explicitly
-			 * do this outside rmap routines.
-			 *
-			 * We also must hold hugetlb vma_lock in write mode.
-			 * Lock order dictates acquiring vma_lock BEFORE
-			 * i_mmap_rwsem.  We can only try lock here and
-			 * fail if unsuccessful.
+			 * If HGM is enabled, we have already grabbed the VMA
+			 * lock for reading, and we cannot safely release it.
+			 * Because HGM-enabled VMAs have already unshared all
+			 * PMDs, we can safely ignore PMD unsharing here.
 			 */
-			if (!anon) {
+			if (!anon && !hugetlb_hgm_enabled(vma)) {
 				VM_BUG_ON(!(flags & TTU_RMAP_LOCKED));
+				/*
+				 * To call huge_pmd_unshare, i_mmap_rwsem must
+				 * be held in write mode.  Caller needs to
+				 * explicitly do this outside rmap routines.
+				 *
+				 * We also must hold hugetlb vma_lock in write
+				 * mode. Lock order dictates acquiring vma_lock
+				 * BEFORE i_mmap_rwsem.  We can only try lock
+				 * here and fail if unsuccessful.
+				 */
 				if (!hugetlb_vma_trylock_write(vma)) {
 					page_vma_mapped_walk_done(&pvmw);
 					ret = false;
-- 
2.38.0.135.g90850a2211-goog


^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [RFC PATCH v2 25/47] hugetlb: add HGM support for copy_hugetlb_page_range
  2022-10-21 16:36 [RFC PATCH v2 00/47] hugetlb: introduce HugeTLB high-granularity mapping James Houghton
                   ` (23 preceding siblings ...)
  2022-10-21 16:36 ` [RFC PATCH v2 24/47] hugetlb: update page_vma_mapped to do high-granularity walks James Houghton
@ 2022-10-21 16:36 ` James Houghton
  2022-11-30 21:32   ` Peter Xu
  2022-10-21 16:36 ` [RFC PATCH v2 26/47] hugetlb: make move_hugetlb_page_tables compatible with HGM James Houghton
                   ` (21 subsequent siblings)
  46 siblings, 1 reply; 122+ messages in thread
From: James Houghton @ 2022-10-21 16:36 UTC (permalink / raw)
  To: Mike Kravetz, Muchun Song, Peter Xu
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Zach O'Keefe, Manish Mishra, Naoya Horiguchi,
	Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Andrew Morton, linux-mm, linux-kernel, James Houghton

This allows fork() to work with high-granularity mappings. The page
table structure is copied such that partially mapped regions will remain
partially mapped in the same way for the new process.

A page's reference count is incremented for *each* portion of it that is
mapped in the page table. For example, if you have a PMD-mapped 1G page,
the reference count and mapcount will be incremented by 512.

Signed-off-by: James Houghton <jthoughton@google.com>
---
 mm/hugetlb.c | 81 +++++++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 65 insertions(+), 16 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 5783a8307a77..7d692907cbf3 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4946,7 +4946,8 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 			    struct vm_area_struct *src_vma)
 {
 	pte_t *src_pte, *dst_pte, entry;
-	struct page *ptepage;
+	struct hugetlb_pte src_hpte, dst_hpte;
+	struct page *ptepage, *hpage;
 	unsigned long addr;
 	bool cow = is_cow_mapping(src_vma->vm_flags);
 	struct hstate *h = hstate_vma(src_vma);
@@ -4956,6 +4957,16 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 	unsigned long last_addr_mask;
 	int ret = 0;
 
+	if (hugetlb_hgm_enabled(src_vma)) {
+		/*
+		 * src_vma might have high-granularity PTEs, and dst_vma will
+		 * need to copy those.
+		 */
+		ret = enable_hugetlb_hgm(dst_vma);
+		if (ret)
+			return ret;
+	}
+
 	if (cow) {
 		mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, src_vma, src,
 					src_vma->vm_start,
@@ -4967,18 +4978,22 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 		/*
 		 * For shared mappings the vma lock must be held before
 		 * calling huge_pte_offset in the src vma. Otherwise, the
-		 * returned ptep could go away if part of a shared pmd and
-		 * another thread calls huge_pmd_unshare.
+		 * returned ptep could go away if
+		 *  - part of a shared pmd and another thread calls
+		 *    huge_pmd_unshare, or
+		 *  - another thread collapses a high-granularity mapping.
 		 */
 		hugetlb_vma_lock_read(src_vma);
 	}
 
 	last_addr_mask = hugetlb_mask_last_page(h);
-	for (addr = src_vma->vm_start; addr < src_vma->vm_end; addr += sz) {
+	addr = src_vma->vm_start;
+	while (addr < src_vma->vm_end) {
 		spinlock_t *src_ptl, *dst_ptl;
+		unsigned long hpte_sz;
 		src_pte = huge_pte_offset(src, addr, sz);
 		if (!src_pte) {
-			addr |= last_addr_mask;
+			addr = (addr | last_addr_mask) + sz;
 			continue;
 		}
 		dst_pte = huge_pte_alloc(dst, dst_vma, addr, sz);
@@ -4987,6 +5002,26 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 			break;
 		}
 
+		hugetlb_pte_populate(&src_hpte, src_pte, huge_page_shift(h),
+				hpage_size_to_level(huge_page_size(h)));
+		hugetlb_pte_populate(&dst_hpte, dst_pte, huge_page_shift(h),
+				hpage_size_to_level(huge_page_size(h)));
+
+		if (hugetlb_hgm_enabled(src_vma)) {
+			hugetlb_hgm_walk(src, src_vma, &src_hpte, addr,
+				      PAGE_SIZE, /*stop_at_none=*/true);
+			ret = hugetlb_hgm_walk(dst, dst_vma, &dst_hpte, addr,
+					hugetlb_pte_size(&src_hpte),
+					/*stop_at_none=*/false);
+			if (ret)
+				break;
+
+			src_pte = src_hpte.ptep;
+			dst_pte = dst_hpte.ptep;
+		}
+
+		hpte_sz = hugetlb_pte_size(&src_hpte);
+
 		/*
 		 * If the pagetables are shared don't copy or take references.
 		 *
@@ -4996,12 +5031,12 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 		 * to reliably determine whether pte is shared.
 		 */
 		if (page_count(virt_to_page(dst_pte)) > 1) {
-			addr |= last_addr_mask;
+			addr = (addr | last_addr_mask) + sz;
 			continue;
 		}
 
-		dst_ptl = huge_pte_lock(h, dst, dst_pte);
-		src_ptl = huge_pte_lockptr(huge_page_shift(h), src, src_pte);
+		dst_ptl = hugetlb_pte_lock(dst, &dst_hpte);
+		src_ptl = hugetlb_pte_lockptr(src, &src_hpte);
 		spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
 		entry = huge_ptep_get(src_pte);
 again:
@@ -5042,10 +5077,15 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 			 */
 			if (userfaultfd_wp(dst_vma))
 				set_huge_pte_at(dst, addr, dst_pte, entry);
+		} else if (!hugetlb_pte_present_leaf(&src_hpte, entry)) {
+			/* Retry the walk. */
+			spin_unlock(src_ptl);
+			spin_unlock(dst_ptl);
+			continue;
 		} else {
-			entry = huge_ptep_get(src_pte);
 			ptepage = pte_page(entry);
-			get_page(ptepage);
+			hpage = compound_head(ptepage);
+			get_page(hpage);
 
 			/*
 			 * Failing to duplicate the anon rmap is a rare case
@@ -5058,24 +5098,29 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 			 * sleep during the process.
 			 */
 			if (!PageAnon(ptepage)) {
-				page_dup_file_rmap(ptepage, true);
-			} else if (page_try_dup_anon_rmap(ptepage, true,
+				page_dup_file_rmap(hpage, true);
+			} else if (page_try_dup_anon_rmap(hpage, true,
 							  src_vma)) {
 				pte_t src_pte_old = entry;
 				struct page *new;
 
+				if (hugetlb_hgm_enabled(src_vma)) {
+					ret = -EINVAL;
+					break;
+				}
+
 				spin_unlock(src_ptl);
 				spin_unlock(dst_ptl);
 				/* Do not use reserve as it's private owned */
 				new = alloc_huge_page(dst_vma, addr, 1);
 				if (IS_ERR(new)) {
-					put_page(ptepage);
+					put_page(hpage);
 					ret = PTR_ERR(new);
 					break;
 				}
-				copy_user_huge_page(new, ptepage, addr, dst_vma,
+				copy_user_huge_page(new, hpage, addr, dst_vma,
 						    npages);
-				put_page(ptepage);
+				put_page(hpage);
 
 				/* Install the new huge page if src pte stable */
 				dst_ptl = huge_pte_lock(h, dst, dst_pte);
@@ -5093,6 +5138,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 				hugetlb_install_page(dst_vma, dst_pte, addr, new);
 				spin_unlock(src_ptl);
 				spin_unlock(dst_ptl);
+				addr += hugetlb_pte_size(&src_hpte);
 				continue;
 			}
 
@@ -5109,10 +5155,13 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 			}
 
 			set_huge_pte_at(dst, addr, dst_pte, entry);
-			hugetlb_count_add(npages, dst);
+			hugetlb_count_add(
+					hugetlb_pte_size(&dst_hpte) / PAGE_SIZE,
+					dst);
 		}
 		spin_unlock(src_ptl);
 		spin_unlock(dst_ptl);
+		addr += hugetlb_pte_size(&src_hpte);
 	}
 
 	if (cow) {
-- 
2.38.0.135.g90850a2211-goog


^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [RFC PATCH v2 26/47] hugetlb: make move_hugetlb_page_tables compatible with HGM
  2022-10-21 16:36 [RFC PATCH v2 00/47] hugetlb: introduce HugeTLB high-granularity mapping James Houghton
                   ` (24 preceding siblings ...)
  2022-10-21 16:36 ` [RFC PATCH v2 25/47] hugetlb: add HGM support for copy_hugetlb_page_range James Houghton
@ 2022-10-21 16:36 ` James Houghton
  2022-10-21 16:36 ` [RFC PATCH v2 27/47] hugetlb: add HGM support for hugetlb_fault and hugetlb_no_page James Houghton
                   ` (20 subsequent siblings)
  46 siblings, 0 replies; 122+ messages in thread
From: James Houghton @ 2022-10-21 16:36 UTC (permalink / raw)
  To: Mike Kravetz, Muchun Song, Peter Xu
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Zach O'Keefe, Manish Mishra, Naoya Horiguchi,
	Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Andrew Morton, linux-mm, linux-kernel, James Houghton

This is very similar to the support that was added to
copy_hugetlb_page_range. We simply do a high-granularity walk now, and
most of the rest of the code stays the same.

Signed-off-by: James Houghton <jthoughton@google.com>
---
 mm/hugetlb.c | 47 ++++++++++++++++++++++++++++++++---------------
 1 file changed, 32 insertions(+), 15 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 7d692907cbf3..16b0d192445c 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5174,16 +5174,16 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 	return ret;
 }
 
-static void move_huge_pte(struct vm_area_struct *vma, unsigned long old_addr,
-			  unsigned long new_addr, pte_t *src_pte, pte_t *dst_pte)
+static void move_hugetlb_pte(struct vm_area_struct *vma, unsigned long old_addr,
+			     unsigned long new_addr, struct hugetlb_pte *src_hpte,
+			     struct hugetlb_pte *dst_hpte)
 {
-	struct hstate *h = hstate_vma(vma);
 	struct mm_struct *mm = vma->vm_mm;
 	spinlock_t *src_ptl, *dst_ptl;
 	pte_t pte;
 
-	dst_ptl = huge_pte_lock(h, mm, dst_pte);
-	src_ptl = huge_pte_lockptr(huge_page_shift(h), mm, src_pte);
+	dst_ptl = hugetlb_pte_lock(mm, dst_hpte);
+	src_ptl = hugetlb_pte_lockptr(mm, src_hpte);
 
 	/*
 	 * We don't have to worry about the ordering of src and dst ptlocks
@@ -5192,8 +5192,8 @@ static void move_huge_pte(struct vm_area_struct *vma, unsigned long old_addr,
 	if (src_ptl != dst_ptl)
 		spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
 
-	pte = huge_ptep_get_and_clear(mm, old_addr, src_pte);
-	set_huge_pte_at(mm, new_addr, dst_pte, pte);
+	pte = huge_ptep_get_and_clear(mm, old_addr, src_hpte->ptep);
+	set_huge_pte_at(mm, new_addr, dst_hpte->ptep, pte);
 
 	if (src_ptl != dst_ptl)
 		spin_unlock(src_ptl);
@@ -5214,6 +5214,7 @@ int move_hugetlb_page_tables(struct vm_area_struct *vma,
 	pte_t *src_pte, *dst_pte;
 	struct mmu_notifier_range range;
 	bool shared_pmd = false;
+	struct hugetlb_pte src_hpte, dst_hpte;
 
 	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, mm, old_addr,
 				old_end);
@@ -5229,20 +5230,28 @@ int move_hugetlb_page_tables(struct vm_area_struct *vma,
 	/* Prevent race with file truncation */
 	hugetlb_vma_lock_write(vma);
 	i_mmap_lock_write(mapping);
-	for (; old_addr < old_end; old_addr += sz, new_addr += sz) {
+	while (old_addr < old_end) {
 		src_pte = huge_pte_offset(mm, old_addr, sz);
 		if (!src_pte) {
-			old_addr |= last_addr_mask;
-			new_addr |= last_addr_mask;
+			old_addr = (old_addr | last_addr_mask) + sz;
+			new_addr = (new_addr | last_addr_mask) + sz;
 			continue;
 		}
-		if (huge_pte_none(huge_ptep_get(src_pte)))
+
+		hugetlb_pte_populate(&src_hpte, src_pte, huge_page_shift(h),
+				     hpage_size_to_level(sz));
+		hugetlb_hgm_walk(mm, vma, &src_hpte, old_addr,
+				PAGE_SIZE, /*stop_at_none=*/true);
+		if (huge_pte_none(huge_ptep_get(src_hpte.ptep))) {
+			old_addr += hugetlb_pte_size(&src_hpte);
+			new_addr += hugetlb_pte_size(&src_hpte);
 			continue;
+		}
 
-		if (huge_pmd_unshare(mm, vma, old_addr, src_pte)) {
+		if (huge_pmd_unshare(mm, vma, old_addr, src_hpte.ptep)) {
 			shared_pmd = true;
-			old_addr |= last_addr_mask;
-			new_addr |= last_addr_mask;
+			old_addr = (old_addr | last_addr_mask) + sz;
+			new_addr = (new_addr | last_addr_mask) + sz;
 			continue;
 		}
 
@@ -5250,7 +5259,15 @@ int move_hugetlb_page_tables(struct vm_area_struct *vma,
 		if (!dst_pte)
 			break;
 
-		move_huge_pte(vma, old_addr, new_addr, src_pte, dst_pte);
+		hugetlb_pte_populate(&dst_hpte, dst_pte, huge_page_shift(h),
+				     hpage_size_to_level(sz));
+		if (hugetlb_hgm_walk(mm, vma, &dst_hpte, new_addr,
+				     hugetlb_pte_size(&src_hpte),
+				     /*stop_at_none=*/false))
+			break;
+		move_hugetlb_pte(vma, old_addr, new_addr, &src_hpte, &dst_hpte);
+		old_addr += hugetlb_pte_size(&src_hpte);
+		new_addr += hugetlb_pte_size(&src_hpte);
 	}
 
 	if (shared_pmd)
-- 
2.38.0.135.g90850a2211-goog


^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [RFC PATCH v2 27/47] hugetlb: add HGM support for hugetlb_fault and hugetlb_no_page
  2022-10-21 16:36 [RFC PATCH v2 00/47] hugetlb: introduce HugeTLB high-granularity mapping James Houghton
                   ` (25 preceding siblings ...)
  2022-10-21 16:36 ` [RFC PATCH v2 26/47] hugetlb: make move_hugetlb_page_tables compatible with HGM James Houghton
@ 2022-10-21 16:36 ` James Houghton
  2022-10-21 16:36 ` [RFC PATCH v2 28/47] rmap: in try_to_{migrate,unmap}_one, check head page for page flags James Houghton
                   ` (19 subsequent siblings)
  46 siblings, 0 replies; 122+ messages in thread
From: James Houghton @ 2022-10-21 16:36 UTC (permalink / raw)
  To: Mike Kravetz, Muchun Song, Peter Xu
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Zach O'Keefe, Manish Mishra, Naoya Horiguchi,
	Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Andrew Morton, linux-mm, linux-kernel, James Houghton

Update the page fault handler to support high-granularity page faults.
While handling a page fault on a partially-mapped HugeTLB page, if the
PTE we find with hugetlb_pte_walk is none, then we will replace it with
a leaf-level PTE to map the page. To give some examples:
1. For a completely unmapped 1G page, it will be mapped with a 1G PUD.
2. For a 1G page that has its first 512M mapped, any faults on the
   unmapped sections will result in 2M PMDs mapping each unmapped 2M
   section.
3. For a 1G page that has only its first 4K mapped, a page fault on its
   second 4K section will get a 4K PTE to map it.

Unless high-granularity mappings are created via UFFDIO_CONTINUE, it is
impossible for hugetlb_fault to create high-granularity mappings.

This commit does not handle hugetlb_wp right now, and it doesn't handle
HugeTLB page migration and swap entries.

Signed-off-by: James Houghton <jthoughton@google.com>
---
 mm/hugetlb.c | 90 +++++++++++++++++++++++++++++++++++++---------------
 1 file changed, 64 insertions(+), 26 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 16b0d192445c..2ee2c48ee79c 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -118,6 +118,18 @@ enum hugetlb_level hpage_size_to_level(unsigned long sz)
 	return HUGETLB_LEVEL_PGD;
 }
 
+/*
+ * Find the subpage that corresponds to `addr` in `hpage`.
+ */
+static struct page *hugetlb_find_subpage(struct hstate *h, struct page *hpage,
+				 unsigned long addr)
+{
+	size_t idx = (addr & ~huge_page_mask(h))/PAGE_SIZE;
+
+	BUG_ON(idx >= pages_per_huge_page(h));
+	return &hpage[idx];
+}
+
 static inline bool subpool_is_free(struct hugepage_subpool *spool)
 {
 	if (spool->count)
@@ -5810,13 +5822,13 @@ static inline vm_fault_t hugetlb_handle_userfault(struct vm_area_struct *vma,
  * false if pte changed or is changing.
  */
 static bool hugetlb_pte_stable(struct hstate *h, struct mm_struct *mm,
-			       pte_t *ptep, pte_t old_pte)
+			       struct hugetlb_pte *hpte, pte_t old_pte)
 {
 	spinlock_t *ptl;
 	bool same;
 
-	ptl = huge_pte_lock(h, mm, ptep);
-	same = pte_same(huge_ptep_get(ptep), old_pte);
+	ptl = hugetlb_pte_lock(mm, hpte);
+	same = pte_same(huge_ptep_get(hpte->ptep), old_pte);
 	spin_unlock(ptl);
 
 	return same;
@@ -5825,17 +5837,18 @@ static bool hugetlb_pte_stable(struct hstate *h, struct mm_struct *mm,
 static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 			struct vm_area_struct *vma,
 			struct address_space *mapping, pgoff_t idx,
-			unsigned long address, pte_t *ptep,
+			unsigned long address, struct hugetlb_pte *hpte,
 			pte_t old_pte, unsigned int flags)
 {
 	struct hstate *h = hstate_vma(vma);
 	vm_fault_t ret = VM_FAULT_SIGBUS;
 	int anon_rmap = 0;
 	unsigned long size;
-	struct page *page;
+	struct page *page, *subpage;
 	pte_t new_pte;
 	spinlock_t *ptl;
 	unsigned long haddr = address & huge_page_mask(h);
+	unsigned long haddr_hgm = address & hugetlb_pte_mask(hpte);
 	bool new_page, new_pagecache_page = false;
 	u32 hash = hugetlb_fault_mutex_hash(mapping, idx);
 
@@ -5880,7 +5893,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 			 * never happen on the page after UFFDIO_COPY has
 			 * correctly installed the page and returned.
 			 */
-			if (!hugetlb_pte_stable(h, mm, ptep, old_pte)) {
+			if (!hugetlb_pte_stable(h, mm, hpte, old_pte)) {
 				ret = 0;
 				goto out;
 			}
@@ -5904,7 +5917,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 			 * here.  Before returning error, get ptl and make
 			 * sure there really is no pte entry.
 			 */
-			if (hugetlb_pte_stable(h, mm, ptep, old_pte))
+			if (hugetlb_pte_stable(h, mm, hpte, old_pte))
 				ret = vmf_error(PTR_ERR(page));
 			else
 				ret = 0;
@@ -5954,7 +5967,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 			unlock_page(page);
 			put_page(page);
 			/* See comment in userfaultfd_missing() block above */
-			if (!hugetlb_pte_stable(h, mm, ptep, old_pte)) {
+			if (!hugetlb_pte_stable(h, mm, hpte, old_pte)) {
 				ret = 0;
 				goto out;
 			}
@@ -5979,10 +5992,10 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 		vma_end_reservation(h, vma, haddr);
 	}
 
-	ptl = huge_pte_lock(h, mm, ptep);
+	ptl = hugetlb_pte_lock(mm, hpte);
 	ret = 0;
 	/* If pte changed from under us, retry */
-	if (!pte_same(huge_ptep_get(ptep), old_pte))
+	if (!pte_same(huge_ptep_get(hpte->ptep), old_pte))
 		goto backout;
 
 	if (anon_rmap) {
@@ -5990,20 +6003,25 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 		hugepage_add_new_anon_rmap(page, vma, haddr);
 	} else
 		page_dup_file_rmap(page, true);
-	new_pte = make_huge_pte(vma, page, ((vma->vm_flags & VM_WRITE)
-				&& (vma->vm_flags & VM_SHARED)));
+
+	subpage = hugetlb_find_subpage(h, page, haddr_hgm);
+	new_pte = make_huge_pte_with_shift(vma, subpage,
+			((vma->vm_flags & VM_WRITE)
+			 && (vma->vm_flags & VM_SHARED)),
+			hpte->shift);
 	/*
 	 * If this pte was previously wr-protected, keep it wr-protected even
 	 * if populated.
 	 */
 	if (unlikely(pte_marker_uffd_wp(old_pte)))
 		new_pte = huge_pte_wrprotect(huge_pte_mkuffd_wp(new_pte));
-	set_huge_pte_at(mm, haddr, ptep, new_pte);
+	set_huge_pte_at(mm, haddr_hgm, hpte->ptep, new_pte);
 
-	hugetlb_count_add(pages_per_huge_page(h), mm);
+	hugetlb_count_add(hugetlb_pte_size(hpte) / PAGE_SIZE, mm);
 	if ((flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) {
+		BUG_ON(hugetlb_pte_size(hpte) != huge_page_size(h));
 		/* Optimization, do the COW without a second fault */
-		ret = hugetlb_wp(mm, vma, address, ptep, flags, page, ptl);
+		ret = hugetlb_wp(mm, vma, address, hpte->ptep, flags, page, ptl);
 	}
 
 	spin_unlock(ptl);
@@ -6066,11 +6084,14 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	u32 hash;
 	pgoff_t idx;
 	struct page *page = NULL;
+	struct page *subpage = NULL;
 	struct page *pagecache_page = NULL;
 	struct hstate *h = hstate_vma(vma);
 	struct address_space *mapping;
 	int need_wait_lock = 0;
 	unsigned long haddr = address & huge_page_mask(h);
+	unsigned long haddr_hgm;
+	struct hugetlb_pte hpte;
 
 	ptep = huge_pte_offset(mm, haddr, huge_page_size(h));
 	if (ptep) {
@@ -6115,15 +6136,22 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		return VM_FAULT_OOM;
 	}
 
-	entry = huge_ptep_get(ptep);
+	hugetlb_pte_populate(&hpte, ptep, huge_page_shift(h),
+			hpage_size_to_level(huge_page_size(h)));
+	/* Do a high-granularity page table walk. */
+	hugetlb_hgm_walk(mm, vma, &hpte, address, PAGE_SIZE,
+			/*stop_at_none=*/true);
+
+	entry = huge_ptep_get(hpte.ptep);
 	/* PTE markers should be handled the same way as none pte */
-	if (huge_pte_none_mostly(entry))
+	if (huge_pte_none_mostly(entry)) {
 		/*
 		 * hugetlb_no_page will drop vma lock and hugetlb fault
 		 * mutex internally, which make us return immediately.
 		 */
-		return hugetlb_no_page(mm, vma, mapping, idx, address, ptep,
+		return hugetlb_no_page(mm, vma, mapping, idx, address, &hpte,
 				      entry, flags);
+	}
 
 	ret = 0;
 
@@ -6137,6 +6165,10 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (!pte_present(entry))
 		goto out_mutex;
 
+	if (!hugetlb_pte_present_leaf(&hpte, entry))
+		/* We raced with someone splitting the entry. */
+		goto out_mutex;
+
 	/*
 	 * If we are going to COW/unshare the mapping later, we examine the
 	 * pending reservations for this page now. This will ensure that any
@@ -6156,14 +6188,17 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		pagecache_page = find_lock_page(mapping, idx);
 	}
 
-	ptl = huge_pte_lock(h, mm, ptep);
+	ptl = hugetlb_pte_lock(mm, &hpte);
 
 	/* Check for a racing update before calling hugetlb_wp() */
-	if (unlikely(!pte_same(entry, huge_ptep_get(ptep))))
+	if (unlikely(!pte_same(entry, huge_ptep_get(hpte.ptep))))
 		goto out_ptl;
 
+	/* haddr_hgm is the base address of the region that hpte maps. */
+	haddr_hgm = address & hugetlb_pte_mask(&hpte);
+
 	/* Handle userfault-wp first, before trying to lock more pages */
-	if (userfaultfd_wp(vma) && huge_pte_uffd_wp(huge_ptep_get(ptep)) &&
+	if (userfaultfd_wp(vma) && huge_pte_uffd_wp(entry) &&
 	    (flags & FAULT_FLAG_WRITE) && !huge_pte_write(entry)) {
 		struct vm_fault vmf = {
 			.vma = vma,
@@ -6187,7 +6222,8 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	 * pagecache_page, so here we need take the former one
 	 * when page != pagecache_page or !pagecache_page.
 	 */
-	page = pte_page(entry);
+	subpage = pte_page(entry);
+	page = compound_head(subpage);
 	if (page != pagecache_page)
 		if (!trylock_page(page)) {
 			need_wait_lock = 1;
@@ -6198,7 +6234,8 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 
 	if (flags & (FAULT_FLAG_WRITE|FAULT_FLAG_UNSHARE)) {
 		if (!huge_pte_write(entry)) {
-			ret = hugetlb_wp(mm, vma, address, ptep, flags,
+			BUG_ON(hugetlb_pte_size(&hpte) != huge_page_size(h));
+			ret = hugetlb_wp(mm, vma, address, hpte.ptep, flags,
 					 pagecache_page, ptl);
 			goto out_put_page;
 		} else if (likely(flags & FAULT_FLAG_WRITE)) {
@@ -6206,9 +6243,9 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		}
 	}
 	entry = pte_mkyoung(entry);
-	if (huge_ptep_set_access_flags(vma, haddr, ptep, entry,
+	if (huge_ptep_set_access_flags(vma, haddr_hgm, hpte.ptep, entry,
 						flags & FAULT_FLAG_WRITE))
-		update_mmu_cache(vma, haddr, ptep);
+		update_mmu_cache(vma, haddr_hgm, hpte.ptep);
 out_put_page:
 	if (page != pagecache_page)
 		unlock_page(page);
@@ -7598,7 +7635,8 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
 				pte = (pte_t *)pmd_alloc(mm, pud, addr);
 		}
 	}
-	BUG_ON(pte && pte_present(*pte) && !pte_huge(*pte));
+	BUG_ON(pte && pte_present(*pte) && !pte_huge(*pte) &&
+			!hugetlb_hgm_enabled(vma));
 
 	return pte;
 }
-- 
2.38.0.135.g90850a2211-goog


^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [RFC PATCH v2 28/47] rmap: in try_to_{migrate,unmap}_one, check head page for page flags
  2022-10-21 16:36 [RFC PATCH v2 00/47] hugetlb: introduce HugeTLB high-granularity mapping James Houghton
                   ` (26 preceding siblings ...)
  2022-10-21 16:36 ` [RFC PATCH v2 27/47] hugetlb: add HGM support for hugetlb_fault and hugetlb_no_page James Houghton
@ 2022-10-21 16:36 ` James Houghton
  2022-10-21 16:36 ` [RFC PATCH v2 29/47] hugetlb: add high-granularity migration support James Houghton
                   ` (18 subsequent siblings)
  46 siblings, 0 replies; 122+ messages in thread
From: James Houghton @ 2022-10-21 16:36 UTC (permalink / raw)
  To: Mike Kravetz, Muchun Song, Peter Xu
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Zach O'Keefe, Manish Mishra, Naoya Horiguchi,
	Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Andrew Morton, linux-mm, linux-kernel, James Houghton

The main complication here is that HugeTLB pages have their poison
status stored in the head page as the HWPoison page flag. Because
HugeTLB high-granularity mapping can create PTEs that point to subpages
instead of always the head of a hugepage, we need to check the
compound_head for page flags.

Signed-off-by: James Houghton <jthoughton@google.com>
---
 mm/rmap.c | 34 ++++++++++++++++++++++++++--------
 1 file changed, 26 insertions(+), 8 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index a8359584467e..d5e1eb6b8ce5 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1474,10 +1474,11 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 	struct mm_struct *mm = vma->vm_mm;
 	DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0);
 	pte_t pteval;
-	struct page *subpage;
+	struct page *subpage, *page_flags_page;
 	bool anon_exclusive, ret = true;
 	struct mmu_notifier_range range;
 	enum ttu_flags flags = (enum ttu_flags)(long)arg;
+	bool page_poisoned;
 
 	/*
 	 * When racing against e.g. zap_pte_range() on another cpu,
@@ -1530,9 +1531,17 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 
 		subpage = folio_page(folio,
 					pte_pfn(*pvmw.pte) - folio_pfn(folio));
+		/*
+		 * We check the page flags of HugeTLB pages by checking the
+		 * head page.
+		 */
+		page_flags_page = folio_test_hugetlb(folio)
+			? &folio->page
+			: subpage;
+		page_poisoned = PageHWPoison(page_flags_page);
 		address = pvmw.address;
 		anon_exclusive = folio_test_anon(folio) &&
-				 PageAnonExclusive(subpage);
+				 PageAnonExclusive(page_flags_page);
 
 		if (folio_test_hugetlb(folio)) {
 			bool anon = folio_test_anon(folio);
@@ -1541,7 +1550,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 			 * The try_to_unmap() is only passed a hugetlb page
 			 * in the case where the hugetlb page is poisoned.
 			 */
-			VM_BUG_ON_PAGE(!PageHWPoison(subpage), subpage);
+			VM_BUG_ON_FOLIO(!page_poisoned, folio);
 			/*
 			 * huge_pmd_unshare may unmap an entire PMD page.
 			 * There is no way of knowing exactly which PMDs may
@@ -1630,7 +1639,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 		/* Update high watermark before we lower rss */
 		update_hiwater_rss(mm);
 
-		if (PageHWPoison(subpage) && !(flags & TTU_IGNORE_HWPOISON)) {
+		if (page_poisoned && !(flags & TTU_IGNORE_HWPOISON)) {
 			pteval = swp_entry_to_pte(make_hwpoison_entry(subpage));
 			if (folio_test_hugetlb(folio)) {
 				hugetlb_count_sub(1UL << pvmw.pte_order, mm);
@@ -1656,7 +1665,9 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 			mmu_notifier_invalidate_range(mm, address,
 						      address + PAGE_SIZE);
 		} else if (folio_test_anon(folio)) {
-			swp_entry_t entry = { .val = page_private(subpage) };
+			swp_entry_t entry = {
+				.val = page_private(page_flags_page)
+			};
 			pte_t swp_pte;
 			/*
 			 * Store the swap location in the pte.
@@ -1855,7 +1866,7 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 	struct mm_struct *mm = vma->vm_mm;
 	DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0);
 	pte_t pteval;
-	struct page *subpage;
+	struct page *subpage, *page_flags_page;
 	bool anon_exclusive, ret = true;
 	struct mmu_notifier_range range;
 	enum ttu_flags flags = (enum ttu_flags)(long)arg;
@@ -1935,9 +1946,16 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 			subpage = folio_page(folio,
 					pte_pfn(*pvmw.pte) - folio_pfn(folio));
 		}
+		/*
+		 * We check the page flags of HugeTLB pages by checking the
+		 * head page.
+		 */
+		page_flags_page = folio_test_hugetlb(folio)
+			? &folio->page
+			: subpage;
 		address = pvmw.address;
 		anon_exclusive = folio_test_anon(folio) &&
-				 PageAnonExclusive(subpage);
+				 PageAnonExclusive(page_flags_page);
 
 		if (folio_test_hugetlb(folio)) {
 			bool anon = folio_test_anon(folio);
@@ -2048,7 +2066,7 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 			 * No need to invalidate here it will synchronize on
 			 * against the special swap migration pte.
 			 */
-		} else if (PageHWPoison(subpage)) {
+		} else if (PageHWPoison(page_flags_page)) {
 			pteval = swp_entry_to_pte(make_hwpoison_entry(subpage));
 			if (folio_test_hugetlb(folio)) {
 				hugetlb_count_sub(1L << pvmw.pte_order, mm);
-- 
2.38.0.135.g90850a2211-goog


^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [RFC PATCH v2 29/47] hugetlb: add high-granularity migration support
  2022-10-21 16:36 [RFC PATCH v2 00/47] hugetlb: introduce HugeTLB high-granularity mapping James Houghton
                   ` (27 preceding siblings ...)
  2022-10-21 16:36 ` [RFC PATCH v2 28/47] rmap: in try_to_{migrate,unmap}_one, check head page for page flags James Houghton
@ 2022-10-21 16:36 ` James Houghton
  2022-10-21 16:36 ` [RFC PATCH v2 30/47] hugetlb: add high-granularity check for hwpoison in fault path James Houghton
                   ` (17 subsequent siblings)
  46 siblings, 0 replies; 122+ messages in thread
From: James Houghton @ 2022-10-21 16:36 UTC (permalink / raw)
  To: Mike Kravetz, Muchun Song, Peter Xu
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Zach O'Keefe, Manish Mishra, Naoya Horiguchi,
	Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Andrew Morton, linux-mm, linux-kernel, James Houghton

To prevent queueing a hugepage for migration multiple times, we use
last_page to keep track of the last page we saw in queue_pages_hugetlb,
and if the page we're looking at is last_page, then we skip it.

For the non-hugetlb cases, last_page, although unused, is still updated
so that it has a consistent meaning with the hugetlb case.

This commit adds a check in hugetlb_fault for high-granularity migration
PTEs.

Signed-off-by: James Houghton <jthoughton@google.com>
---
 include/linux/swapops.h |  8 ++++++--
 mm/hugetlb.c            | 15 ++++++++++++++-
 mm/mempolicy.c          | 24 +++++++++++++++++++-----
 mm/migrate.c            | 18 +++++++++++-------
 4 files changed, 50 insertions(+), 15 deletions(-)

diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index 86b95ccb81bb..2939323d0fd2 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -66,6 +66,8 @@
 
 static inline bool is_pfn_swap_entry(swp_entry_t entry);
 
+struct hugetlb_pte;
+
 /* Clear all flags but only keep swp_entry_t related information */
 static inline pte_t pte_swp_clear_flags(pte_t pte)
 {
@@ -346,7 +348,8 @@ extern void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
 					unsigned long address);
 #ifdef CONFIG_HUGETLB_PAGE
 extern void __migration_entry_wait_huge(pte_t *ptep, spinlock_t *ptl);
-extern void migration_entry_wait_huge(struct vm_area_struct *vma, pte_t *pte);
+extern void migration_entry_wait_huge(struct vm_area_struct *vma,
+					struct hugetlb_pte *hpte);
 #endif	/* CONFIG_HUGETLB_PAGE */
 #else  /* CONFIG_MIGRATION */
 static inline swp_entry_t make_readable_migration_entry(pgoff_t offset)
@@ -375,7 +378,8 @@ static inline void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
 					 unsigned long address) { }
 #ifdef CONFIG_HUGETLB_PAGE
 static inline void __migration_entry_wait_huge(pte_t *ptep, spinlock_t *ptl) { }
-static inline void migration_entry_wait_huge(struct vm_area_struct *vma, pte_t *pte) { }
+static inline void migration_entry_wait_huge(struct vm_area_struct *vma,
+						struct hugetlb_pte *hpte) { }
 #endif	/* CONFIG_HUGETLB_PAGE */
 static inline int is_writable_migration_entry(swp_entry_t entry)
 {
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 2ee2c48ee79c..8dba8d59ebe5 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6100,9 +6100,11 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		 * OK as we are only making decisions based on content and
 		 * not actually modifying content here.
 		 */
+		hugetlb_pte_populate(&hpte, ptep, huge_page_shift(h),
+				hpage_size_to_level(huge_page_size(h)));
 		entry = huge_ptep_get(ptep);
 		if (unlikely(is_hugetlb_entry_migration(entry))) {
-			migration_entry_wait_huge(vma, ptep);
+			migration_entry_wait_huge(vma, &hpte);
 			return 0;
 		} else if (unlikely(is_hugetlb_entry_hwpoisoned(entry)))
 			return VM_FAULT_HWPOISON_LARGE |
@@ -6142,7 +6144,18 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	hugetlb_hgm_walk(mm, vma, &hpte, address, PAGE_SIZE,
 			/*stop_at_none=*/true);
 
+	/*
+	 * Now that we have done a high-granularity walk, check again if we are
+	 * looking at a migration entry.
+	 */
 	entry = huge_ptep_get(hpte.ptep);
+	if (unlikely(is_hugetlb_entry_migration(entry))) {
+		hugetlb_vma_unlock_read(vma);
+		mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+		migration_entry_wait_huge(vma, &hpte);
+		return 0;
+	}
+
 	/* PTE markers should be handled the same way as none pte */
 	if (huge_pte_none_mostly(entry)) {
 		/*
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 275bc549590e..47bf9b16a9c0 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -424,6 +424,7 @@ struct queue_pages {
 	unsigned long start;
 	unsigned long end;
 	struct vm_area_struct *first;
+	struct page *last_page;
 };
 
 /*
@@ -475,6 +476,7 @@ static int queue_pages_pmd(pmd_t *pmd, spinlock_t *ptl, unsigned long addr,
 	flags = qp->flags;
 	/* go to thp migration */
 	if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) {
+		qp->last_page = page;
 		if (!vma_migratable(walk->vma) ||
 		    migrate_page_add(page, qp->pagelist, flags)) {
 			ret = 1;
@@ -532,6 +534,7 @@ static int queue_pages_pte_range(pmd_t *pmd, unsigned long addr,
 			continue;
 		if (!queue_pages_required(page, qp))
 			continue;
+
 		if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) {
 			/* MPOL_MF_STRICT must be specified if we get here */
 			if (!vma_migratable(vma)) {
@@ -539,6 +542,8 @@ static int queue_pages_pte_range(pmd_t *pmd, unsigned long addr,
 				break;
 			}
 
+			qp->last_page = page;
+
 			/*
 			 * Do not abort immediately since there may be
 			 * temporary off LRU pages in the range.  Still
@@ -570,15 +575,22 @@ static int queue_pages_hugetlb(struct hugetlb_pte *hpte,
 	spinlock_t *ptl;
 	pte_t entry;
 
-	/* We don't migrate high-granularity HugeTLB mappings for now. */
-	if (hugetlb_hgm_enabled(walk->vma))
-		return -EINVAL;
-
 	ptl = hugetlb_pte_lock(walk->mm, hpte);
 	entry = huge_ptep_get(hpte->ptep);
 	if (!pte_present(entry))
 		goto unlock;
-	page = pte_page(entry);
+
+	if (!hugetlb_pte_present_leaf(hpte, entry)) {
+		ret = -EAGAIN;
+		goto unlock;
+	}
+
+	page = compound_head(pte_page(entry));
+
+	/* We already queued this page with another high-granularity PTE. */
+	if (page == qp->last_page)
+		goto unlock;
+
 	if (!queue_pages_required(page, qp))
 		goto unlock;
 
@@ -605,6 +617,7 @@ static int queue_pages_hugetlb(struct hugetlb_pte *hpte,
 	/* With MPOL_MF_MOVE, we migrate only unshared hugepage. */
 	if (flags & (MPOL_MF_MOVE_ALL) ||
 	    (flags & MPOL_MF_MOVE && page_mapcount(page) == 1)) {
+		qp->last_page = page;
 		if (isolate_hugetlb(page, qp->pagelist) &&
 			(flags & MPOL_MF_STRICT))
 			/*
@@ -740,6 +753,7 @@ queue_pages_range(struct mm_struct *mm, unsigned long start, unsigned long end,
 		.start = start,
 		.end = end,
 		.first = NULL,
+		.last_page = NULL,
 	};
 
 	err = walk_page_range(mm, start, end, &queue_pages_walk_ops, &qp);
diff --git a/mm/migrate.c b/mm/migrate.c
index 8712b694c5a7..197662dd1dc0 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -186,6 +186,9 @@ static bool remove_migration_pte(struct folio *folio,
 		/* pgoff is invalid for ksm pages, but they are never large */
 		if (folio_test_large(folio) && !folio_test_hugetlb(folio))
 			idx = linear_page_index(vma, pvmw.address) - pvmw.pgoff;
+		else if (folio_test_hugetlb(folio))
+			idx = (pvmw.address & ~huge_page_mask(hstate_vma(vma)))/
+				PAGE_SIZE;
 		new = folio_page(folio, idx);
 
 #ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
@@ -235,14 +238,15 @@ static bool remove_migration_pte(struct folio *folio,
 
 #ifdef CONFIG_HUGETLB_PAGE
 		if (folio_test_hugetlb(folio)) {
+			struct page *hpage = folio_page(folio, 0);
 			unsigned int shift = pvmw.pte_order + PAGE_SHIFT;
 
 			pte = arch_make_huge_pte(pte, shift, vma->vm_flags);
 			if (folio_test_anon(folio))
-				hugepage_add_anon_rmap(new, vma, pvmw.address,
+				hugepage_add_anon_rmap(hpage, vma, pvmw.address,
 						       rmap_flags);
 			else
-				page_dup_file_rmap(new, true);
+				page_dup_file_rmap(hpage, true);
 			set_huge_pte_at(vma->vm_mm, pvmw.address, pvmw.pte, pte);
 		} else
 #endif
@@ -258,7 +262,7 @@ static bool remove_migration_pte(struct folio *folio,
 			mlock_page_drain_local();
 
 		trace_remove_migration_pte(pvmw.address, pte_val(pte),
-					   compound_order(new));
+					   pvmw.pte_order);
 
 		/* No need to invalidate - it was non-present before */
 		update_mmu_cache(vma, pvmw.address, pvmw.pte);
@@ -332,12 +336,12 @@ void __migration_entry_wait_huge(pte_t *ptep, spinlock_t *ptl)
 		migration_entry_wait_on_locked(pte_to_swp_entry(pte), NULL, ptl);
 }
 
-void migration_entry_wait_huge(struct vm_area_struct *vma, pte_t *pte)
+void migration_entry_wait_huge(struct vm_area_struct *vma,
+				struct hugetlb_pte *hpte)
 {
-	spinlock_t *ptl = huge_pte_lockptr(huge_page_shift(hstate_vma(vma)),
-					   vma->vm_mm, pte);
+	spinlock_t *ptl = hugetlb_pte_lockptr(vma->vm_mm, hpte);
 
-	__migration_entry_wait_huge(pte, ptl);
+	__migration_entry_wait_huge(hpte->ptep, ptl);
 }
 #endif
 
-- 
2.38.0.135.g90850a2211-goog


^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [RFC PATCH v2 30/47] hugetlb: add high-granularity check for hwpoison in fault path
  2022-10-21 16:36 [RFC PATCH v2 00/47] hugetlb: introduce HugeTLB high-granularity mapping James Houghton
                   ` (28 preceding siblings ...)
  2022-10-21 16:36 ` [RFC PATCH v2 29/47] hugetlb: add high-granularity migration support James Houghton
@ 2022-10-21 16:36 ` James Houghton
  2022-10-21 16:36 ` [RFC PATCH v2 31/47] hugetlb: sort hstates in hugetlb_init_hstates James Houghton
                   ` (16 subsequent siblings)
  46 siblings, 0 replies; 122+ messages in thread
From: James Houghton @ 2022-10-21 16:36 UTC (permalink / raw)
  To: Mike Kravetz, Muchun Song, Peter Xu
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Zach O'Keefe, Manish Mishra, Naoya Horiguchi,
	Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Andrew Morton, linux-mm, linux-kernel, James Houghton

Because hwpoison swap entries may be placed beneath the hstate-level
PTE, we need to check for it separately (on top of the hstate-level PTE
check that remains).

Signed-off-by: James Houghton <jthoughton@google.com>
---
 mm/hugetlb.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 8dba8d59ebe5..bb0005d57cab 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6154,6 +6154,11 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		mutex_unlock(&hugetlb_fault_mutex_table[hash]);
 		migration_entry_wait_huge(vma, &hpte);
 		return 0;
+	} else if (unlikely(is_hugetlb_entry_hwpoisoned(entry))) {
+		hugetlb_vma_unlock_read(vma);
+		mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+		return VM_FAULT_HWPOISON_LARGE |
+			VM_FAULT_SET_HINDEX(hstate_index(h));
 	}
 
 	/* PTE markers should be handled the same way as none pte */
-- 
2.38.0.135.g90850a2211-goog


^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [RFC PATCH v2 31/47] hugetlb: sort hstates in hugetlb_init_hstates
  2022-10-21 16:36 [RFC PATCH v2 00/47] hugetlb: introduce HugeTLB high-granularity mapping James Houghton
                   ` (29 preceding siblings ...)
  2022-10-21 16:36 ` [RFC PATCH v2 30/47] hugetlb: add high-granularity check for hwpoison in fault path James Houghton
@ 2022-10-21 16:36 ` James Houghton
  2022-10-21 16:36 ` [RFC PATCH v2 32/47] hugetlb: add for_each_hgm_shift James Houghton
                   ` (15 subsequent siblings)
  46 siblings, 0 replies; 122+ messages in thread
From: James Houghton @ 2022-10-21 16:36 UTC (permalink / raw)
  To: Mike Kravetz, Muchun Song, Peter Xu
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Zach O'Keefe, Manish Mishra, Naoya Horiguchi,
	Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Andrew Morton, linux-mm, linux-kernel, James Houghton

When using HugeTLB high-granularity mapping, we need to go through the
supported hugepage sizes in decreasing order so that we pick the largest
size that works. Consider the case where we're faulting in a 1G hugepage
for the first time: we want hugetlb_fault/hugetlb_no_page to map it with
a PUD. By going through the sizes in decreasing order, we will find that
PUD_SIZE works before finding out that PMD_SIZE or PAGE_SIZE work too.

This commit also changes bootmem hugepages from storing hstate pointers
directly to storing the hstate sizes. The hstate pointers used for
boot-time-allocated hugepages become invalid after we sort the hstates.
`gather_bootmem_prealloc`, called after the hstates have been sorted,
now converts the size to the correct hstate.

Signed-off-by: James Houghton <jthoughton@google.com>
---
 include/linux/hugetlb.h |  2 +-
 mm/hugetlb.c            | 49 ++++++++++++++++++++++++++++++++---------
 2 files changed, 40 insertions(+), 11 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index d305742e9d44..e25f97cdd086 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -772,7 +772,7 @@ struct hstate {
 
 struct huge_bootmem_page {
 	struct list_head list;
-	struct hstate *hstate;
+	unsigned long hstate_sz;
 };
 
 int isolate_or_dissolve_huge_page(struct page *page, struct list_head *list);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index bb0005d57cab..d6f07968156c 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -34,6 +34,7 @@
 #include <linux/nospec.h>
 #include <linux/delayacct.h>
 #include <linux/memory.h>
+#include <linux/sort.h>
 
 #include <asm/page.h>
 #include <asm/pgalloc.h>
@@ -49,6 +50,10 @@
 
 int hugetlb_max_hstate __read_mostly;
 unsigned int default_hstate_idx;
+/*
+ * After hugetlb_init_hstates is called, hstates will be sorted from largest
+ * to smallest.
+ */
 struct hstate hstates[HUGE_MAX_HSTATE];
 
 #ifdef CONFIG_CMA
@@ -3189,7 +3194,7 @@ int __alloc_bootmem_huge_page(struct hstate *h, int nid)
 	/* Put them into a private list first because mem_map is not up yet */
 	INIT_LIST_HEAD(&m->list);
 	list_add(&m->list, &huge_boot_pages);
-	m->hstate = h;
+	m->hstate_sz = huge_page_size(h);
 	return 1;
 }
 
@@ -3203,7 +3208,7 @@ static void __init gather_bootmem_prealloc(void)
 
 	list_for_each_entry(m, &huge_boot_pages, list) {
 		struct page *page = virt_to_page(m);
-		struct hstate *h = m->hstate;
+		struct hstate *h = size_to_hstate(m->hstate_sz);
 
 		VM_BUG_ON(!hstate_is_gigantic(h));
 		WARN_ON(page_count(page) != 1);
@@ -3319,9 +3324,38 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
 	kfree(node_alloc_noretry);
 }
 
+static int compare_hstates_decreasing(const void *a, const void *b)
+{
+	unsigned long sz_a = huge_page_size((const struct hstate *)a);
+	unsigned long sz_b = huge_page_size((const struct hstate *)b);
+
+	if (sz_a < sz_b)
+		return 1;
+	if (sz_a > sz_b)
+		return -1;
+	return 0;
+}
+
+static void sort_hstates(void)
+{
+	unsigned long default_hstate_sz = huge_page_size(&default_hstate);
+
+	/* Sort from largest to smallest. */
+	sort(hstates, hugetlb_max_hstate, sizeof(*hstates),
+	     compare_hstates_decreasing, NULL);
+
+	/*
+	 * We may have changed the location of the default hstate, so we need to
+	 * update it.
+	 */
+	default_hstate_idx = hstate_index(size_to_hstate(default_hstate_sz));
+}
+
 static void __init hugetlb_init_hstates(void)
 {
-	struct hstate *h, *h2;
+	struct hstate *h;
+
+	sort_hstates();
 
 	for_each_hstate(h) {
 		/* oversize hugepages were init'ed in early boot */
@@ -3340,13 +3374,8 @@ static void __init hugetlb_init_hstates(void)
 			continue;
 		if (hugetlb_cma_size && h->order <= HUGETLB_PAGE_ORDER)
 			continue;
-		for_each_hstate(h2) {
-			if (h2 == h)
-				continue;
-			if (h2->order < h->order &&
-			    h2->order > h->demote_order)
-				h->demote_order = h2->order;
-		}
+		if (h - 1 >= &hstates[0])
+			h->demote_order = huge_page_order(h - 1);
 	}
 }
 
-- 
2.38.0.135.g90850a2211-goog


^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [RFC PATCH v2 32/47] hugetlb: add for_each_hgm_shift
  2022-10-21 16:36 [RFC PATCH v2 00/47] hugetlb: introduce HugeTLB high-granularity mapping James Houghton
                   ` (30 preceding siblings ...)
  2022-10-21 16:36 ` [RFC PATCH v2 31/47] hugetlb: sort hstates in hugetlb_init_hstates James Houghton
@ 2022-10-21 16:36 ` James Houghton
  2022-10-21 16:36 ` [RFC PATCH v2 33/47] userfaultfd: add UFFD_FEATURE_MINOR_HUGETLBFS_HGM James Houghton
                   ` (14 subsequent siblings)
  46 siblings, 0 replies; 122+ messages in thread
From: James Houghton @ 2022-10-21 16:36 UTC (permalink / raw)
  To: Mike Kravetz, Muchun Song, Peter Xu
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Zach O'Keefe, Manish Mishra, Naoya Horiguchi,
	Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Andrew Morton, linux-mm, linux-kernel, James Houghton

This is a helper macro to loop through all the usable page sizes for a
high-granularity-enabled HugeTLB VMA. Given the VMA's hstate, it will
loop, in descending order, through the page sizes that HugeTLB supports
for this architecture. It always includes PAGE_SIZE.

This is done by looping through the hstates; however, there is no
hstate for PAGE_SIZE. To handle this case, the loop intentionally goes
out of bounds, and the out-of-bounds pointer is mapped to PAGE_SIZE.

Signed-off-by: James Houghton <jthoughton@google.com>
---
 mm/hugetlb.c | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index d6f07968156c..6eaec40d66ad 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -7856,6 +7856,25 @@ int enable_hugetlb_hgm(struct vm_area_struct *vma)
 	hugetlb_unshare_all_pmds(vma);
 	return 0;
 }
+
+/* Should only be used by the for_each_hgm_shift macro. */
+static unsigned int __shift_for_hstate(struct hstate *h)
+{
+	/* If h is out of bounds, we have reached the end, so give PAGE_SIZE */
+	if (h >= &hstates[hugetlb_max_hstate])
+		return PAGE_SHIFT;
+	return huge_page_shift(h);
+}
+
+/*
+ * Intentionally go out of bounds. An out-of-bounds hstate will be converted to
+ * PAGE_SIZE.
+ */
+#define for_each_hgm_shift(hstate, tmp_h, shift) \
+	for ((tmp_h) = hstate; (shift) = __shift_for_hstate(tmp_h), \
+			       (tmp_h) <= &hstates[hugetlb_max_hstate]; \
+			       (tmp_h)++)
+
 #endif /* CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING */
 
 /*
-- 
2.38.0.135.g90850a2211-goog


^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [RFC PATCH v2 33/47] userfaultfd: add UFFD_FEATURE_MINOR_HUGETLBFS_HGM
  2022-10-21 16:36 [RFC PATCH v2 00/47] hugetlb: introduce HugeTLB high-granularity mapping James Houghton
                   ` (31 preceding siblings ...)
  2022-10-21 16:36 ` [RFC PATCH v2 32/47] hugetlb: add for_each_hgm_shift James Houghton
@ 2022-10-21 16:36 ` James Houghton
  2022-11-16 22:28   ` Peter Xu
  2022-10-21 16:36 ` [RFC PATCH v2 34/47] hugetlb: userfaultfd: add support for high-granularity UFFDIO_CONTINUE James Houghton
                   ` (13 subsequent siblings)
  46 siblings, 1 reply; 122+ messages in thread
From: James Houghton @ 2022-10-21 16:36 UTC (permalink / raw)
  To: Mike Kravetz, Muchun Song, Peter Xu
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Zach O'Keefe, Manish Mishra, Naoya Horiguchi,
	Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Andrew Morton, linux-mm, linux-kernel, James Houghton

Userspace must provide this new feature when it calls UFFDIO_API to
enable HGM. Userspace can check if the feature exists in
uffdio_api.features, and if it does not exist, the kernel does not
support and therefore did not enable HGM.

Signed-off-by: James Houghton <jthoughton@google.com>
---
 fs/userfaultfd.c                 | 12 +++++++++++-
 include/linux/userfaultfd_k.h    |  7 +++++++
 include/uapi/linux/userfaultfd.h |  2 ++
 3 files changed, 20 insertions(+), 1 deletion(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 07c81ab3fd4d..3a3e9ef74dab 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -226,6 +226,11 @@ static inline struct uffd_msg userfault_msg(unsigned long address,
 	return msg;
 }
 
+bool uffd_ctx_has_hgm(struct vm_userfaultfd_ctx *ctx)
+{
+	return ctx->ctx->features & UFFD_FEATURE_MINOR_HUGETLBFS_HGM;
+}
+
 #ifdef CONFIG_HUGETLB_PAGE
 /*
  * Same functionality as userfaultfd_must_wait below with modifications for
@@ -1954,10 +1959,15 @@ static int userfaultfd_api(struct userfaultfd_ctx *ctx,
 		goto err_out;
 	/* report all available features and ioctls to userland */
 	uffdio_api.features = UFFD_API_FEATURES;
+
 #ifndef CONFIG_HAVE_ARCH_USERFAULTFD_MINOR
 	uffdio_api.features &=
 		~(UFFD_FEATURE_MINOR_HUGETLBFS | UFFD_FEATURE_MINOR_SHMEM);
-#endif
+#ifndef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING
+	uffdio_api.features &= ~UFFD_FEATURE_MINOR_HUGETLBFS_HGM;
+#endif  /* CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING */
+#endif  /* CONFIG_HAVE_ARCH_USERFAULTFD_MINOR */
+
 #ifndef CONFIG_HAVE_ARCH_USERFAULTFD_WP
 	uffdio_api.features &= ~UFFD_FEATURE_PAGEFAULT_FLAG_WP;
 #endif
diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index f07e6998bb68..d8fa37f308f7 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -162,6 +162,8 @@ static inline bool vma_can_userfault(struct vm_area_struct *vma,
 	    vma_is_shmem(vma);
 }
 
+extern bool uffd_ctx_has_hgm(struct vm_userfaultfd_ctx *);
+
 extern int dup_userfaultfd(struct vm_area_struct *, struct list_head *);
 extern void dup_userfaultfd_complete(struct list_head *);
 
@@ -228,6 +230,11 @@ static inline bool userfaultfd_armed(struct vm_area_struct *vma)
 	return false;
 }
 
+static inline bool uffd_ctx_has_hgm(struct vm_userfaultfd_ctx *ctx)
+{
+	return false;
+}
+
 static inline int dup_userfaultfd(struct vm_area_struct *vma,
 				  struct list_head *l)
 {
diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
index 005e5e306266..ae8080003560 100644
--- a/include/uapi/linux/userfaultfd.h
+++ b/include/uapi/linux/userfaultfd.h
@@ -36,6 +36,7 @@
 			   UFFD_FEATURE_SIGBUS |		\
 			   UFFD_FEATURE_THREAD_ID |		\
 			   UFFD_FEATURE_MINOR_HUGETLBFS |	\
+			   UFFD_FEATURE_MINOR_HUGETLBFS_HGM |	\
 			   UFFD_FEATURE_MINOR_SHMEM |		\
 			   UFFD_FEATURE_EXACT_ADDRESS |		\
 			   UFFD_FEATURE_WP_HUGETLBFS_SHMEM)
@@ -217,6 +218,7 @@ struct uffdio_api {
 #define UFFD_FEATURE_MINOR_SHMEM		(1<<10)
 #define UFFD_FEATURE_EXACT_ADDRESS		(1<<11)
 #define UFFD_FEATURE_WP_HUGETLBFS_SHMEM		(1<<12)
+#define UFFD_FEATURE_MINOR_HUGETLBFS_HGM	(1<<13)
 	__u64 features;
 
 	__u64 ioctls;
-- 
2.38.0.135.g90850a2211-goog


^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [RFC PATCH v2 34/47] hugetlb: userfaultfd: add support for high-granularity UFFDIO_CONTINUE
  2022-10-21 16:36 [RFC PATCH v2 00/47] hugetlb: introduce HugeTLB high-granularity mapping James Houghton
                   ` (32 preceding siblings ...)
  2022-10-21 16:36 ` [RFC PATCH v2 33/47] userfaultfd: add UFFD_FEATURE_MINOR_HUGETLBFS_HGM James Houghton
@ 2022-10-21 16:36 ` James Houghton
  2022-11-17 16:58   ` Peter Xu
  2022-12-23 18:38   ` Peter Xu
  2022-10-21 16:36 ` [RFC PATCH v2 35/47] userfaultfd: require UFFD_FEATURE_EXACT_ADDRESS when using HugeTLB HGM James Houghton
                   ` (12 subsequent siblings)
  46 siblings, 2 replies; 122+ messages in thread
From: James Houghton @ 2022-10-21 16:36 UTC (permalink / raw)
  To: Mike Kravetz, Muchun Song, Peter Xu
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Zach O'Keefe, Manish Mishra, Naoya Horiguchi,
	Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Andrew Morton, linux-mm, linux-kernel, James Houghton

Changes here are similar to the changes made for hugetlb_no_page.

Pass vmf->real_address to userfaultfd_huge_must_wait because
vmf->address is rounded down to the hugepage size, and a
high-granularity page table walk would look up the wrong PTE. Also
change the call to userfaultfd_must_wait in the same way for
consistency.

This commit introduces hugetlb_alloc_largest_pte which is used to find
the appropriate PTE size to map pages with UFFDIO_CONTINUE.

Signed-off-by: James Houghton <jthoughton@google.com>
---
 fs/userfaultfd.c        | 33 +++++++++++++++---
 include/linux/hugetlb.h | 14 +++++++-
 mm/hugetlb.c            | 76 +++++++++++++++++++++++++++++++++--------
 mm/userfaultfd.c        | 46 +++++++++++++++++--------
 4 files changed, 135 insertions(+), 34 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 3a3e9ef74dab..0204108e3882 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -245,14 +245,22 @@ static inline bool userfaultfd_huge_must_wait(struct userfaultfd_ctx *ctx,
 	struct mm_struct *mm = ctx->mm;
 	pte_t *ptep, pte;
 	bool ret = true;
+	struct hugetlb_pte hpte;
+	unsigned long sz = vma_mmu_pagesize(vma);
+	unsigned int shift = huge_page_shift(hstate_vma(vma));
 
 	mmap_assert_locked(mm);
 
-	ptep = huge_pte_offset(mm, address, vma_mmu_pagesize(vma));
+	ptep = huge_pte_offset(mm, address, sz);
 
 	if (!ptep)
 		goto out;
 
+	hugetlb_pte_populate(&hpte, ptep, shift, hpage_size_to_level(sz));
+	hugetlb_hgm_walk(mm, vma, &hpte, address, PAGE_SIZE,
+			/*stop_at_none=*/true);
+	ptep = hpte.ptep;
+
 	ret = false;
 	pte = huge_ptep_get(ptep);
 
@@ -498,6 +506,14 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
 
 	blocking_state = userfaultfd_get_blocking_state(vmf->flags);
 
+	if (is_vm_hugetlb_page(vmf->vma) && hugetlb_hgm_enabled(vmf->vma))
+		/*
+		 * Lock the VMA lock so we can do a high-granularity walk in
+		 * userfaultfd_huge_must_wait. We have to grab this lock before
+		 * we set our state to blocking.
+		 */
+		hugetlb_vma_lock_read(vmf->vma);
+
 	spin_lock_irq(&ctx->fault_pending_wqh.lock);
 	/*
 	 * After the __add_wait_queue the uwq is visible to userland
@@ -513,12 +529,15 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
 	spin_unlock_irq(&ctx->fault_pending_wqh.lock);
 
 	if (!is_vm_hugetlb_page(vmf->vma))
-		must_wait = userfaultfd_must_wait(ctx, vmf->address, vmf->flags,
-						  reason);
+		must_wait = userfaultfd_must_wait(ctx, vmf->real_address,
+				vmf->flags, reason);
 	else
 		must_wait = userfaultfd_huge_must_wait(ctx, vmf->vma,
-						       vmf->address,
+						       vmf->real_address,
 						       vmf->flags, reason);
+
+	if (is_vm_hugetlb_page(vmf->vma) && hugetlb_hgm_enabled(vmf->vma))
+		hugetlb_vma_unlock_read(vmf->vma);
 	mmap_read_unlock(mm);
 
 	if (likely(must_wait && !READ_ONCE(ctx->released))) {
@@ -1463,6 +1482,12 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
 			mas_pause(&mas);
 		}
 	next:
+		if (is_vm_hugetlb_page(vma) && (ctx->features &
+					UFFD_FEATURE_MINOR_HUGETLBFS_HGM)) {
+			ret = enable_hugetlb_hgm(vma);
+			if (ret)
+				break;
+		}
 		/*
 		 * In the vma_merge() successful mprotect-like case 8:
 		 * the next vma was merged into the current one and
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index e25f97cdd086..00c22a84a1c6 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -250,7 +250,8 @@ unsigned long hugetlb_total_pages(void);
 vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 			unsigned long address, unsigned int flags);
 #ifdef CONFIG_USERFAULTFD
-int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm, pte_t *dst_pte,
+int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
+				struct hugetlb_pte *dst_hpte,
 				struct vm_area_struct *dst_vma,
 				unsigned long dst_addr,
 				unsigned long src_addr,
@@ -1272,6 +1273,9 @@ static inline enum hugetlb_level hpage_size_to_level(unsigned long sz)
 bool hugetlb_hgm_enabled(struct vm_area_struct *vma);
 bool hugetlb_hgm_eligible(struct vm_area_struct *vma);
 int enable_hugetlb_hgm(struct vm_area_struct *vma);
+int hugetlb_alloc_largest_pte(struct hugetlb_pte *hpte, struct mm_struct *mm,
+			      struct vm_area_struct *vma, unsigned long start,
+			      unsigned long end);
 #else
 static inline bool hugetlb_hgm_enabled(struct vm_area_struct *vma)
 {
@@ -1285,6 +1289,14 @@ static inline int enable_hugetlb_hgm(struct vm_area_struct *vma)
 {
 	return -EINVAL;
 }
+
+static inline
+int hugetlb_alloc_largest_pte(struct hugetlb_pte *hpte, struct mm_struct *mm,
+			      struct vm_area_struct *vma, unsigned long start,
+			      unsigned long end)
+{
+	return -EINVAL;
+}
 #endif
 
 static inline spinlock_t *huge_pte_lock(struct hstate *h,
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 6eaec40d66ad..c25d3cd73ac9 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6325,7 +6325,7 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
  * modifications for huge pages.
  */
 int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
-			    pte_t *dst_pte,
+			    struct hugetlb_pte *dst_hpte,
 			    struct vm_area_struct *dst_vma,
 			    unsigned long dst_addr,
 			    unsigned long src_addr,
@@ -6336,13 +6336,14 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
 	bool is_continue = (mode == MCOPY_ATOMIC_CONTINUE);
 	struct hstate *h = hstate_vma(dst_vma);
 	struct address_space *mapping = dst_vma->vm_file->f_mapping;
-	pgoff_t idx = vma_hugecache_offset(h, dst_vma, dst_addr);
+	unsigned long haddr = dst_addr & huge_page_mask(h);
+	pgoff_t idx = vma_hugecache_offset(h, dst_vma, haddr);
 	unsigned long size;
 	int vm_shared = dst_vma->vm_flags & VM_SHARED;
 	pte_t _dst_pte;
 	spinlock_t *ptl;
 	int ret = -ENOMEM;
-	struct page *page;
+	struct page *page, *subpage;
 	int writable;
 	bool page_in_pagecache = false;
 
@@ -6357,12 +6358,12 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
 		 * a non-missing case. Return -EEXIST.
 		 */
 		if (vm_shared &&
-		    hugetlbfs_pagecache_present(h, dst_vma, dst_addr)) {
+		    hugetlbfs_pagecache_present(h, dst_vma, haddr)) {
 			ret = -EEXIST;
 			goto out;
 		}
 
-		page = alloc_huge_page(dst_vma, dst_addr, 0);
+		page = alloc_huge_page(dst_vma, haddr, 0);
 		if (IS_ERR(page)) {
 			ret = -ENOMEM;
 			goto out;
@@ -6378,13 +6379,13 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
 			/* Free the allocated page which may have
 			 * consumed a reservation.
 			 */
-			restore_reserve_on_error(h, dst_vma, dst_addr, page);
+			restore_reserve_on_error(h, dst_vma, haddr, page);
 			put_page(page);
 
 			/* Allocate a temporary page to hold the copied
 			 * contents.
 			 */
-			page = alloc_huge_page_vma(h, dst_vma, dst_addr);
+			page = alloc_huge_page_vma(h, dst_vma, haddr);
 			if (!page) {
 				ret = -ENOMEM;
 				goto out;
@@ -6398,14 +6399,14 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
 		}
 	} else {
 		if (vm_shared &&
-		    hugetlbfs_pagecache_present(h, dst_vma, dst_addr)) {
+		    hugetlbfs_pagecache_present(h, dst_vma, haddr)) {
 			put_page(*pagep);
 			ret = -EEXIST;
 			*pagep = NULL;
 			goto out;
 		}
 
-		page = alloc_huge_page(dst_vma, dst_addr, 0);
+		page = alloc_huge_page(dst_vma, haddr, 0);
 		if (IS_ERR(page)) {
 			put_page(*pagep);
 			ret = -ENOMEM;
@@ -6447,7 +6448,7 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
 		page_in_pagecache = true;
 	}
 
-	ptl = huge_pte_lock(h, dst_mm, dst_pte);
+	ptl = hugetlb_pte_lock(dst_mm, dst_hpte);
 
 	ret = -EIO;
 	if (PageHWPoison(page))
@@ -6459,7 +6460,7 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
 	 * page backing it, then access the page.
 	 */
 	ret = -EEXIST;
-	if (!huge_pte_none_mostly(huge_ptep_get(dst_pte)))
+	if (!huge_pte_none_mostly(huge_ptep_get(dst_hpte->ptep)))
 		goto out_release_unlock;
 
 	if (page_in_pagecache) {
@@ -6478,7 +6479,11 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
 	else
 		writable = dst_vma->vm_flags & VM_WRITE;
 
-	_dst_pte = make_huge_pte(dst_vma, page, writable);
+	subpage = hugetlb_find_subpage(h, page, dst_addr);
+	WARN_ON_ONCE(subpage != page && !hugetlb_hgm_enabled(dst_vma));
+
+	_dst_pte = make_huge_pte_with_shift(dst_vma, subpage, writable,
+			dst_hpte->shift);
 	/*
 	 * Always mark UFFDIO_COPY page dirty; note that this may not be
 	 * extremely important for hugetlbfs for now since swapping is not
@@ -6491,12 +6496,12 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
 	if (wp_copy)
 		_dst_pte = huge_pte_mkuffd_wp(_dst_pte);
 
-	set_huge_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte);
+	set_huge_pte_at(dst_mm, dst_addr, dst_hpte->ptep, _dst_pte);
 
-	hugetlb_count_add(pages_per_huge_page(h), dst_mm);
+	hugetlb_count_add(hugetlb_pte_size(dst_hpte) / PAGE_SIZE, dst_mm);
 
 	/* No need to invalidate - it was non-present before */
-	update_mmu_cache(dst_vma, dst_addr, dst_pte);
+	update_mmu_cache(dst_vma, dst_addr, dst_hpte->ptep);
 
 	spin_unlock(ptl);
 	if (!is_continue)
@@ -7875,6 +7880,47 @@ static unsigned int __shift_for_hstate(struct hstate *h)
 			       (tmp_h) <= &hstates[hugetlb_max_hstate]; \
 			       (tmp_h)++)
 
+/*
+ * Allocate a HugeTLB PTE that maps as much of [start, end) as possible with a
+ * single page table entry. The allocated HugeTLB PTE is returned in @hpte.
+ */
+int hugetlb_alloc_largest_pte(struct hugetlb_pte *hpte, struct mm_struct *mm,
+			      struct vm_area_struct *vma, unsigned long start,
+			      unsigned long end)
+{
+	struct hstate *h = hstate_vma(vma), *tmp_h;
+	unsigned int shift;
+	unsigned long sz;
+	int ret;
+	pte_t *ptep;
+
+	for_each_hgm_shift(h, tmp_h, shift) {
+		sz = 1UL << shift;
+
+		if (!IS_ALIGNED(start, sz) || start + sz > end)
+			continue;
+		goto found;
+	}
+	return -EINVAL;
+found:
+	ptep = huge_pte_alloc(mm, vma, start, huge_page_size(h));
+	if (!ptep)
+		return -ENOMEM;
+
+	hugetlb_pte_populate(hpte, ptep, huge_page_shift(h),
+			hpage_size_to_level(huge_page_size(h)));
+
+	ret = hugetlb_hgm_walk(mm, vma, hpte, start, 1L << shift,
+			/*stop_at_none=*/false);
+	if (ret)
+		return ret;
+
+	if (hpte->shift > shift)
+		return -EEXIST;
+
+	return 0;
+}
+
 #endif /* CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING */
 
 /*
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index e24e8a47ce8a..c4a8e6666ea6 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -315,14 +315,16 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
 {
 	int vm_shared = dst_vma->vm_flags & VM_SHARED;
 	ssize_t err;
-	pte_t *dst_pte;
 	unsigned long src_addr, dst_addr;
 	long copied;
 	struct page *page;
-	unsigned long vma_hpagesize;
+	unsigned long vma_hpagesize, target_pagesize;
 	pgoff_t idx;
 	u32 hash;
 	struct address_space *mapping;
+	bool use_hgm = uffd_ctx_has_hgm(&dst_vma->vm_userfaultfd_ctx) &&
+		mode == MCOPY_ATOMIC_CONTINUE;
+	struct hstate *h = hstate_vma(dst_vma);
 
 	/*
 	 * There is no default zero huge page for all huge page sizes as
@@ -340,12 +342,13 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
 	copied = 0;
 	page = NULL;
 	vma_hpagesize = vma_kernel_pagesize(dst_vma);
+	target_pagesize = use_hgm ? PAGE_SIZE : vma_hpagesize;
 
 	/*
-	 * Validate alignment based on huge page size
+	 * Validate alignment based on the targeted page size.
 	 */
 	err = -EINVAL;
-	if (dst_start & (vma_hpagesize - 1) || len & (vma_hpagesize - 1))
+	if (dst_start & (target_pagesize - 1) || len & (target_pagesize - 1))
 		goto out_unlock;
 
 retry:
@@ -362,6 +365,8 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
 		err = -EINVAL;
 		if (vma_hpagesize != vma_kernel_pagesize(dst_vma))
 			goto out_unlock;
+		if (use_hgm && !hugetlb_hgm_enabled(dst_vma))
+			goto out_unlock;
 
 		vm_shared = dst_vma->vm_flags & VM_SHARED;
 	}
@@ -376,13 +381,15 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
 	}
 
 	while (src_addr < src_start + len) {
+		struct hugetlb_pte hpte;
+		pte_t *dst_pte;
 		BUG_ON(dst_addr >= dst_start + len);
 
 		/*
 		 * Serialize via vma_lock and hugetlb_fault_mutex.
-		 * vma_lock ensures the dst_pte remains valid even
-		 * in the case of shared pmds.  fault mutex prevents
-		 * races with other faulting threads.
+		 * vma_lock ensures the hpte.ptep remains valid even
+		 * in the case of shared pmds and page table collapsing.
+		 * fault mutex prevents races with other faulting threads.
 		 */
 		idx = linear_page_index(dst_vma, dst_addr);
 		mapping = dst_vma->vm_file->f_mapping;
@@ -390,23 +397,33 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
 		mutex_lock(&hugetlb_fault_mutex_table[hash]);
 		hugetlb_vma_lock_read(dst_vma);
 
-		err = -ENOMEM;
+		err = 0;
 		dst_pte = huge_pte_alloc(dst_mm, dst_vma, dst_addr, vma_hpagesize);
-		if (!dst_pte) {
+		if (!dst_pte)
+			err = -ENOMEM;
+		else {
+			hugetlb_pte_populate(&hpte, dst_pte, huge_page_shift(h),
+					hpage_size_to_level(huge_page_size(h)));
+			if (use_hgm)
+				err = hugetlb_alloc_largest_pte(&hpte,
+						dst_mm, dst_vma, dst_addr,
+						dst_start + len);
+		}
+		if (err) {
 			hugetlb_vma_unlock_read(dst_vma);
 			mutex_unlock(&hugetlb_fault_mutex_table[hash]);
 			goto out_unlock;
 		}
 
 		if (mode != MCOPY_ATOMIC_CONTINUE &&
-		    !huge_pte_none_mostly(huge_ptep_get(dst_pte))) {
+		    !huge_pte_none_mostly(huge_ptep_get(hpte.ptep))) {
 			err = -EEXIST;
 			hugetlb_vma_unlock_read(dst_vma);
 			mutex_unlock(&hugetlb_fault_mutex_table[hash]);
 			goto out_unlock;
 		}
 
-		err = hugetlb_mcopy_atomic_pte(dst_mm, dst_pte, dst_vma,
+		err = hugetlb_mcopy_atomic_pte(dst_mm, &hpte, dst_vma,
 					       dst_addr, src_addr, mode, &page,
 					       wp_copy);
 
@@ -418,6 +435,7 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
 		if (unlikely(err == -ENOENT)) {
 			mmap_read_unlock(dst_mm);
 			BUG_ON(!page);
+			BUG_ON(hpte.shift != huge_page_shift(h));
 
 			err = copy_huge_page_from_user(page,
 						(const void __user *)src_addr,
@@ -435,9 +453,9 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
 			BUG_ON(page);
 
 		if (!err) {
-			dst_addr += vma_hpagesize;
-			src_addr += vma_hpagesize;
-			copied += vma_hpagesize;
+			dst_addr += hugetlb_pte_size(&hpte);
+			src_addr += hugetlb_pte_size(&hpte);
+			copied += hugetlb_pte_size(&hpte);
 
 			if (fatal_signal_pending(current))
 				err = -EINTR;
-- 
2.38.0.135.g90850a2211-goog


^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [RFC PATCH v2 35/47] userfaultfd: require UFFD_FEATURE_EXACT_ADDRESS when using HugeTLB HGM
  2022-10-21 16:36 [RFC PATCH v2 00/47] hugetlb: introduce HugeTLB high-granularity mapping James Houghton
                   ` (33 preceding siblings ...)
  2022-10-21 16:36 ` [RFC PATCH v2 34/47] hugetlb: userfaultfd: add support for high-granularity UFFDIO_CONTINUE James Houghton
@ 2022-10-21 16:36 ` James Houghton
  2022-12-22 21:47   ` Peter Xu
  2022-10-21 16:36 ` [RFC PATCH v2 36/47] hugetlb: add MADV_COLLAPSE for hugetlb James Houghton
                   ` (11 subsequent siblings)
  46 siblings, 1 reply; 122+ messages in thread
From: James Houghton @ 2022-10-21 16:36 UTC (permalink / raw)
  To: Mike Kravetz, Muchun Song, Peter Xu
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Zach O'Keefe, Manish Mishra, Naoya Horiguchi,
	Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Andrew Morton, linux-mm, linux-kernel, James Houghton

To avoid bugs in userspace, we require that userspace provide
UFFD_FEATURE_EXACT_ADDRESS when using UFFD_FEATURE_MINOR_HUGETLBFS_HGM,
otherwise UFFDIO_API will fail with EINVAL.

The potential confusion is this: without EXACT_ADDRESS, the address
given in the userfaultfd message will be rounded down to the hugepage
size. Userspace may think that, because they're using HGM, just
UFFDIO_CONTINUE the interval [address, address+PAGE_SIZE), but for
faults that didn't occur in the first base page of the hugepage, this
won't resolve the fault. The only choice it has in this scenario is to
UFFDIO_CONTINUE the interval [address, address+hugepage_size), which
negates the purpose of using HGM in the first place.

By requiring userspace to provide UFFD_FEATURE_EXACT_ADDRESS, there is
no rounding, and userspace now has the information it needs to
appropriately resolve the fault.

Another potential solution here is to change the behavior when
UFFD_FEATURE_EXACT_ADDRESS is not provided: when HGM is enabled, start
rounding to PAGE_SIZE instead of to the hugepage size. I think requiring
UFFD_FEATURE_EXACT_ADDRESS is cleaner.

Signed-off-by: James Houghton <jthoughton@google.com>
---
 fs/userfaultfd.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 0204108e3882..c8f21f53e37d 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -1990,6 +1990,17 @@ static int userfaultfd_api(struct userfaultfd_ctx *ctx,
 		~(UFFD_FEATURE_MINOR_HUGETLBFS | UFFD_FEATURE_MINOR_SHMEM);
 #ifndef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING
 	uffdio_api.features &= ~UFFD_FEATURE_MINOR_HUGETLBFS_HGM;
+#else
+
+	ret = -EINVAL;
+	if ((uffdio_api.features & UFFD_FEATURE_MINOR_HUGETLBFS_HGM) &&
+	    !(uffdio_api.features & UFFD_FEATURE_EXACT_ADDRESS))
+		/*
+		 * UFFD_FEATURE_MINOR_HUGETLBFS_HGM is mostly
+		 * useless without UFFD_FEATURE_EXACT_ADDRESS,
+		 * so require userspace to provide both.
+		 */
+		goto err_out;
 #endif  /* CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING */
 #endif  /* CONFIG_HAVE_ARCH_USERFAULTFD_MINOR */
 
-- 
2.38.0.135.g90850a2211-goog


^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [RFC PATCH v2 36/47] hugetlb: add MADV_COLLAPSE for hugetlb
  2022-10-21 16:36 [RFC PATCH v2 00/47] hugetlb: introduce HugeTLB high-granularity mapping James Houghton
                   ` (34 preceding siblings ...)
  2022-10-21 16:36 ` [RFC PATCH v2 35/47] userfaultfd: require UFFD_FEATURE_EXACT_ADDRESS when using HugeTLB HGM James Houghton
@ 2022-10-21 16:36 ` James Houghton
  2022-10-21 16:36 ` [RFC PATCH v2 37/47] hugetlb: remove huge_pte_lock and huge_pte_lockptr James Houghton
                   ` (10 subsequent siblings)
  46 siblings, 0 replies; 122+ messages in thread
From: James Houghton @ 2022-10-21 16:36 UTC (permalink / raw)
  To: Mike Kravetz, Muchun Song, Peter Xu
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Zach O'Keefe, Manish Mishra, Naoya Horiguchi,
	Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Andrew Morton, linux-mm, linux-kernel, James Houghton

This is a necessary extension to the UFFDIO_CONTINUE changes. When
userspace finishes mapping an entire hugepage with UFFDIO_CONTINUE, the
kernel has no mechanism to automatically collapse the page table to map
the whole hugepage normally. We require userspace to inform us that they
would like the mapping to be collapsed; they do this with MADV_COLLAPSE.

If userspace has not mapped all of a hugepage with UFFDIO_CONTINUE, but
only some, hugetlb_collapse will cause the requested range to be mapped
as if it were UFFDIO_CONTINUE'd already. The effects of any
UFFDIO_WRITEPROTECT calls may be undone by a call to MADV_COLLAPSE for
intersecting address ranges.

This commit is co-opting the same madvise mode that has been introduced
to synchronously collapse THPs. The function that does THP collapsing
has been renamed to madvise_collapse_thp.

As with the rest of the high-granularity mapping support, MADV_COLLAPSE
is only supported for shared VMAs right now.

Signed-off-by: James Houghton <jthoughton@google.com>
---
 include/linux/huge_mm.h |  12 ++--
 include/linux/hugetlb.h |   8 +++
 mm/hugetlb.c            | 142 ++++++++++++++++++++++++++++++++++++++++
 mm/khugepaged.c         |   4 +-
 mm/madvise.c            |  24 ++++++-
 5 files changed, 181 insertions(+), 9 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 5d861905df46..fc2813db5e2e 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -226,9 +226,9 @@ void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
 
 int hugepage_madvise(struct vm_area_struct *vma, unsigned long *vm_flags,
 		     int advice);
-int madvise_collapse(struct vm_area_struct *vma,
-		     struct vm_area_struct **prev,
-		     unsigned long start, unsigned long end);
+int madvise_collapse_thp(struct vm_area_struct *vma,
+			 struct vm_area_struct **prev,
+			 unsigned long start, unsigned long end);
 void vma_adjust_trans_huge(struct vm_area_struct *vma, unsigned long start,
 			   unsigned long end, long adjust_next);
 spinlock_t *__pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma);
@@ -373,9 +373,9 @@ static inline int hugepage_madvise(struct vm_area_struct *vma,
 	return -EINVAL;
 }
 
-static inline int madvise_collapse(struct vm_area_struct *vma,
-				   struct vm_area_struct **prev,
-				   unsigned long start, unsigned long end)
+static inline int madvise_collapse_thp(struct vm_area_struct *vma,
+				       struct vm_area_struct **prev,
+				       unsigned long start, unsigned long end)
 {
 	return -EINVAL;
 }
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 00c22a84a1c6..5378b98cc7b8 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -1276,6 +1276,8 @@ int enable_hugetlb_hgm(struct vm_area_struct *vma);
 int hugetlb_alloc_largest_pte(struct hugetlb_pte *hpte, struct mm_struct *mm,
 			      struct vm_area_struct *vma, unsigned long start,
 			      unsigned long end);
+int hugetlb_collapse(struct mm_struct *mm, struct vm_area_struct *vma,
+		     unsigned long start, unsigned long end);
 #else
 static inline bool hugetlb_hgm_enabled(struct vm_area_struct *vma)
 {
@@ -1297,6 +1299,12 @@ int hugetlb_alloc_largest_pte(struct hugetlb_pte *hpte, struct mm_struct *mm,
 {
 	return -EINVAL;
 }
+static inline
+int hugetlb_collapse(struct mm_struct *mm, struct vm_area_struct *vma,
+		     unsigned long start, unsigned long end)
+{
+	return -EINVAL;
+}
 #endif
 
 static inline spinlock_t *huge_pte_lock(struct hstate *h,
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index c25d3cd73ac9..d80db81a1fa5 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -7921,6 +7921,148 @@ int hugetlb_alloc_largest_pte(struct hugetlb_pte *hpte, struct mm_struct *mm,
 	return 0;
 }
 
+/*
+ * Collapse the address range from @start to @end to be mapped optimally.
+ *
+ * This is only valid for shared mappings. The main use case for this function
+ * is following UFFDIO_CONTINUE. If a user UFFDIO_CONTINUEs an entire hugepage
+ * by calling UFFDIO_CONTINUE once for each 4K region, the kernel doesn't know
+ * to collapse the mapping after the final UFFDIO_CONTINUE. Instead, we leave
+ * it up to userspace to tell us to do so, via MADV_COLLAPSE.
+ *
+ * Any holes in the mapping will be filled. If there is no page in the
+ * pagecache for a region we're collapsing, the PTEs will be cleared.
+ *
+ * If high-granularity PTEs are uffd-wp markers, those markers will be dropped.
+ */
+int hugetlb_collapse(struct mm_struct *mm, struct vm_area_struct *vma,
+			    unsigned long start, unsigned long end)
+{
+	struct hstate *h = hstate_vma(vma);
+	struct address_space *mapping = vma->vm_file->f_mapping;
+	struct mmu_notifier_range range;
+	struct mmu_gather tlb;
+	unsigned long curr = start;
+	int ret = 0;
+	struct page *hpage, *subpage;
+	pgoff_t idx;
+	bool writable = vma->vm_flags & VM_WRITE;
+	bool shared = vma->vm_flags & VM_SHARED;
+	struct hugetlb_pte hpte;
+	pte_t entry;
+
+	/*
+	 * This is only supported for shared VMAs, because we need to look up
+	 * the page to use for any PTEs we end up creating.
+	 */
+	if (!shared)
+		return -EINVAL;
+
+	if (!hugetlb_hgm_enabled(vma))
+		return 0;
+
+	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, mm,
+				start, end);
+	mmu_notifier_invalidate_range_start(&range);
+	tlb_gather_mmu(&tlb, mm);
+
+	/*
+	 * Grab the lock VMA lock for writing. This will prevent concurrent
+	 * high-granularity page table walks, so that we can safely collapse
+	 * and free page tables.
+	 */
+	hugetlb_vma_lock_write(vma);
+
+	while (curr < end) {
+		ret = hugetlb_alloc_largest_pte(&hpte, mm, vma, curr, end);
+		if (ret)
+			goto out;
+
+		entry = huge_ptep_get(hpte.ptep);
+
+		/*
+		 * There is no work to do if the PTE doesn't point to page
+		 * tables.
+		 */
+		if (!pte_present(entry))
+			goto next_hpte;
+		if (hugetlb_pte_present_leaf(&hpte, entry))
+			goto next_hpte;
+
+		idx = vma_hugecache_offset(h, vma, curr);
+		hpage = find_get_page(mapping, idx);
+
+		if (hpage && !HPageMigratable(hpage)) {
+			/*
+			 * Don't collapse a mapping to a page that is pending
+			 * a migration. Migration swap entries may have placed
+			 * in the page table.
+			 */
+			ret = -EBUSY;
+			put_page(hpage);
+			goto out;
+		}
+
+		if (hpage && PageHWPoison(hpage)) {
+			/*
+			 * Don't collapse a mapping to a page that is
+			 * hwpoisoned.
+			 */
+			ret = -EHWPOISON;
+			put_page(hpage);
+			/*
+			 * By setting ret to -EHWPOISON, if nothing else
+			 * happens, we will tell userspace that we couldn't
+			 * fully collapse everything due to poison.
+			 *
+			 * Skip this page, and continue to collapse the rest
+			 * of the mapping.
+			 */
+			curr = (curr & huge_page_mask(h)) + huge_page_size(h);
+			continue;
+		}
+
+		/*
+		 * Clear all the PTEs, and drop ref/mapcounts
+		 * (on tlb_finish_mmu).
+		 */
+		__unmap_hugepage_range(&tlb, vma, curr,
+			curr + hugetlb_pte_size(&hpte),
+			NULL,
+			ZAP_FLAG_DROP_MARKER);
+		/* Free the PTEs. */
+		hugetlb_free_pgd_range(&tlb,
+				curr, curr + hugetlb_pte_size(&hpte),
+				curr, curr + hugetlb_pte_size(&hpte));
+		if (!hpage) {
+			huge_pte_clear(mm, curr, hpte.ptep,
+					hugetlb_pte_size(&hpte));
+			goto next_hpte;
+		}
+
+		page_dup_file_rmap(hpage, true);
+
+		subpage = hugetlb_find_subpage(h, hpage, curr);
+		entry = make_huge_pte_with_shift(vma, subpage,
+						 writable, hpte.shift);
+		set_huge_pte_at(mm, curr, hpte.ptep, entry);
+next_hpte:
+		curr += hugetlb_pte_size(&hpte);
+
+		if (curr < end) {
+			/* Don't hold the VMA lock for too long. */
+			hugetlb_vma_unlock_write(vma);
+			cond_resched();
+			hugetlb_vma_lock_write(vma);
+		}
+	}
+out:
+	hugetlb_vma_unlock_write(vma);
+	tlb_finish_mmu(&tlb);
+	mmu_notifier_invalidate_range_end(&range);
+	return ret;
+}
+
 #endif /* CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING */
 
 /*
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 4734315f7940..70796824e9d2 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -2555,8 +2555,8 @@ static int madvise_collapse_errno(enum scan_result r)
 	}
 }
 
-int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
-		     unsigned long start, unsigned long end)
+int madvise_collapse_thp(struct vm_area_struct *vma, struct vm_area_struct **prev,
+			 unsigned long start, unsigned long end)
 {
 	struct collapse_control *cc;
 	struct mm_struct *mm = vma->vm_mm;
diff --git a/mm/madvise.c b/mm/madvise.c
index 2baa93ca2310..6aed9bd68476 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -986,6 +986,24 @@ static long madvise_remove(struct vm_area_struct *vma,
 	return error;
 }
 
+static int madvise_collapse(struct vm_area_struct *vma,
+			    struct vm_area_struct **prev,
+			    unsigned long start, unsigned long end)
+{
+	/* Only allow collapsing for HGM-enabled, shared mappings. */
+	if (is_vm_hugetlb_page(vma)) {
+		*prev = vma;
+		if (!hugetlb_hgm_eligible(vma))
+			return -EINVAL;
+		if (!hugetlb_hgm_enabled(vma))
+			return 0;
+		return hugetlb_collapse(vma->vm_mm, vma, start, end);
+	}
+
+	return madvise_collapse_thp(vma, prev, start, end);
+
+}
+
 /*
  * Apply an madvise behavior to a region of a vma.  madvise_update_vma
  * will handle splitting a vm area into separate areas, each area with its own
@@ -1157,6 +1175,9 @@ madvise_behavior_valid(int behavior)
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	case MADV_HUGEPAGE:
 	case MADV_NOHUGEPAGE:
+#endif
+#if defined(CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING) || \
+		defined(CONFIG_TRANSPARENT_HUGEPAGE)
 	case MADV_COLLAPSE:
 #endif
 	case MADV_DONTDUMP:
@@ -1347,7 +1368,8 @@ int madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
  *  MADV_NOHUGEPAGE - mark the given range as not worth being backed by
  *		transparent huge pages so the existing pages will not be
  *		coalesced into THP and new pages will not be allocated as THP.
- *  MADV_COLLAPSE - synchronously coalesce pages into new THP.
+ *  MADV_COLLAPSE - synchronously coalesce pages into new THP, or, for HugeTLB
+ *		pages, collapse the mapping.
  *  MADV_DONTDUMP - the application wants to prevent pages in the given range
  *		from being included in its core dump.
  *  MADV_DODUMP - cancel MADV_DONTDUMP: no longer exclude from core dump.
-- 
2.38.0.135.g90850a2211-goog


^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [RFC PATCH v2 37/47] hugetlb: remove huge_pte_lock and huge_pte_lockptr
  2022-10-21 16:36 [RFC PATCH v2 00/47] hugetlb: introduce HugeTLB high-granularity mapping James Houghton
                   ` (35 preceding siblings ...)
  2022-10-21 16:36 ` [RFC PATCH v2 36/47] hugetlb: add MADV_COLLAPSE for hugetlb James Houghton
@ 2022-10-21 16:36 ` James Houghton
  2022-11-16 20:16   ` Peter Xu
  2022-10-21 16:36 ` [RFC PATCH v2 38/47] hugetlb: replace make_huge_pte with make_huge_pte_with_shift James Houghton
                   ` (9 subsequent siblings)
  46 siblings, 1 reply; 122+ messages in thread
From: James Houghton @ 2022-10-21 16:36 UTC (permalink / raw)
  To: Mike Kravetz, Muchun Song, Peter Xu
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Zach O'Keefe, Manish Mishra, Naoya Horiguchi,
	Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Andrew Morton, linux-mm, linux-kernel, James Houghton

They are replaced with hugetlb_pte_lock{,ptr}. All callers that haven't
already been replaced don't get called when using HGM, so we handle them
by populating hugetlb_ptes with the standard, hstate-sized huge PTEs.

Signed-off-by: James Houghton <jthoughton@google.com>
---
 include/linux/hugetlb.h | 28 +++-------------------------
 mm/hugetlb.c            | 15 ++++++++++-----
 2 files changed, 13 insertions(+), 30 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 5378b98cc7b8..e6dc25b15403 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -1015,14 +1015,6 @@ static inline gfp_t htlb_modify_alloc_mask(struct hstate *h, gfp_t gfp_mask)
 	return modified_mask;
 }
 
-static inline spinlock_t *huge_pte_lockptr(unsigned int shift,
-					   struct mm_struct *mm, pte_t *pte)
-{
-	if (shift == PMD_SHIFT)
-		return pmd_lockptr(mm, (pmd_t *) pte);
-	return &mm->page_table_lock;
-}
-
 #ifndef hugepages_supported
 /*
  * Some platform decide whether they support huge pages at boot
@@ -1226,12 +1218,6 @@ static inline gfp_t htlb_modify_alloc_mask(struct hstate *h, gfp_t gfp_mask)
 	return 0;
 }
 
-static inline spinlock_t *huge_pte_lockptr(unsigned int shift,
-					   struct mm_struct *mm, pte_t *pte)
-{
-	return &mm->page_table_lock;
-}
-
 static inline void hugetlb_count_init(struct mm_struct *mm)
 {
 }
@@ -1307,16 +1293,6 @@ int hugetlb_collapse(struct mm_struct *mm, struct vm_area_struct *vma,
 }
 #endif
 
-static inline spinlock_t *huge_pte_lock(struct hstate *h,
-					struct mm_struct *mm, pte_t *pte)
-{
-	spinlock_t *ptl;
-
-	ptl = huge_pte_lockptr(huge_page_shift(h), mm, pte);
-	spin_lock(ptl);
-	return ptl;
-}
-
 static inline
 spinlock_t *hugetlb_pte_lockptr(struct mm_struct *mm, struct hugetlb_pte *hpte)
 {
@@ -1324,7 +1300,9 @@ spinlock_t *hugetlb_pte_lockptr(struct mm_struct *mm, struct hugetlb_pte *hpte)
 	BUG_ON(!hpte->ptep);
 	if (hpte->ptl)
 		return hpte->ptl;
-	return huge_pte_lockptr(hugetlb_pte_shift(hpte), mm, hpte->ptep);
+	if (hugetlb_pte_level(hpte) == HUGETLB_LEVEL_PMD)
+		return pmd_lockptr(mm, (pmd_t *) hpte->ptep);
+	return &mm->page_table_lock;
 }
 
 static inline
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index d80db81a1fa5..9d4e41c41f78 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5164,9 +5164,8 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 				put_page(hpage);
 
 				/* Install the new huge page if src pte stable */
-				dst_ptl = huge_pte_lock(h, dst, dst_pte);
-				src_ptl = huge_pte_lockptr(huge_page_shift(h),
-							   src, src_pte);
+				dst_ptl = hugetlb_pte_lock(dst, &dst_hpte);
+				src_ptl = hugetlb_pte_lockptr(src, &src_hpte);
 				spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
 				entry = huge_ptep_get(src_pte);
 				if (!pte_same(src_pte_old, entry)) {
@@ -7465,6 +7464,7 @@ pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma,
 	pte_t *spte = NULL;
 	pte_t *pte;
 	spinlock_t *ptl;
+	struct hugetlb_pte hpte;
 
 	i_mmap_lock_read(mapping);
 	vma_interval_tree_foreach(svma, &mapping->i_mmap, idx, idx) {
@@ -7485,7 +7485,8 @@ pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (!spte)
 		goto out;
 
-	ptl = huge_pte_lock(hstate_vma(vma), mm, spte);
+	hugetlb_pte_populate(&hpte, (pte_t *)pud, PUD_SHIFT, HUGETLB_LEVEL_PUD);
+	ptl = hugetlb_pte_lock(mm, &hpte);
 	if (pud_none(*pud)) {
 		pud_populate(mm, pud,
 				(pmd_t *)((unsigned long)spte & PAGE_MASK));
@@ -8179,6 +8180,7 @@ void hugetlb_unshare_all_pmds(struct vm_area_struct *vma)
 	unsigned long address, start, end;
 	spinlock_t *ptl;
 	pte_t *ptep;
+	struct hugetlb_pte hpte;
 
 	if (!(vma->vm_flags & VM_MAYSHARE))
 		return;
@@ -8203,7 +8205,10 @@ void hugetlb_unshare_all_pmds(struct vm_area_struct *vma)
 		ptep = huge_pte_offset(mm, address, sz);
 		if (!ptep)
 			continue;
-		ptl = huge_pte_lock(h, mm, ptep);
+
+		hugetlb_pte_populate(&hpte, ptep, huge_page_shift(h),
+				hpage_size_to_level(sz));
+		ptl = hugetlb_pte_lock(mm, &hpte);
 		huge_pmd_unshare(mm, vma, address, ptep);
 		spin_unlock(ptl);
 	}
-- 
2.38.0.135.g90850a2211-goog


^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [RFC PATCH v2 38/47] hugetlb: replace make_huge_pte with make_huge_pte_with_shift
  2022-10-21 16:36 [RFC PATCH v2 00/47] hugetlb: introduce HugeTLB high-granularity mapping James Houghton
                   ` (36 preceding siblings ...)
  2022-10-21 16:36 ` [RFC PATCH v2 37/47] hugetlb: remove huge_pte_lock and huge_pte_lockptr James Houghton
@ 2022-10-21 16:36 ` James Houghton
  2022-10-21 16:36 ` [RFC PATCH v2 39/47] mm: smaps: add stats for HugeTLB mapping size James Houghton
                   ` (8 subsequent siblings)
  46 siblings, 0 replies; 122+ messages in thread
From: James Houghton @ 2022-10-21 16:36 UTC (permalink / raw)
  To: Mike Kravetz, Muchun Song, Peter Xu
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Zach O'Keefe, Manish Mishra, Naoya Horiguchi,
	Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Andrew Morton, linux-mm, linux-kernel, James Houghton

This removes the old definition of make_huge_pte, where now we always
require the shift to be explicitly given. All callsites are cleaned up.

Signed-off-by: James Houghton <jthoughton@google.com>
---
 mm/hugetlb.c | 31 ++++++++++++-------------------
 1 file changed, 12 insertions(+), 19 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 9d4e41c41f78..b26142bec4fe 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4908,9 +4908,9 @@ const struct vm_operations_struct hugetlb_vm_ops = {
 	.pagesize = hugetlb_vm_op_pagesize,
 };
 
-static pte_t make_huge_pte_with_shift(struct vm_area_struct *vma,
-				      struct page *page, int writable,
-				      int shift)
+static pte_t make_huge_pte(struct vm_area_struct *vma,
+			   struct page *page, int writable,
+			   int shift)
 {
 	pte_t entry;
 
@@ -4926,14 +4926,6 @@ static pte_t make_huge_pte_with_shift(struct vm_area_struct *vma,
 	return entry;
 }
 
-static pte_t make_huge_pte(struct vm_area_struct *vma, struct page *page,
-			   int writable)
-{
-	unsigned int shift = huge_page_shift(hstate_vma(vma));
-
-	return make_huge_pte_with_shift(vma, page, writable, shift);
-}
-
 static void set_huge_ptep_writable(struct vm_area_struct *vma,
 				   unsigned long address, pte_t *ptep)
 {
@@ -4974,10 +4966,12 @@ static void
 hugetlb_install_page(struct vm_area_struct *vma, pte_t *ptep, unsigned long addr,
 		     struct page *new_page)
 {
+	struct hstate *h = hstate_vma(vma);
 	__SetPageUptodate(new_page);
 	hugepage_add_new_anon_rmap(new_page, vma, addr);
-	set_huge_pte_at(vma->vm_mm, addr, ptep, make_huge_pte(vma, new_page, 1));
-	hugetlb_count_add(pages_per_huge_page(hstate_vma(vma)), vma->vm_mm);
+	set_huge_pte_at(vma->vm_mm, addr, ptep, make_huge_pte(vma, new_page, 1,
+				huge_page_shift(h)));
+	hugetlb_count_add(pages_per_huge_page(h), vma->vm_mm);
 	ClearHPageRestoreReserve(new_page);
 	SetHPageMigratable(new_page);
 }
@@ -5737,7 +5731,8 @@ static vm_fault_t hugetlb_wp(struct mm_struct *mm, struct vm_area_struct *vma,
 		page_remove_rmap(old_page, vma, true);
 		hugepage_add_new_anon_rmap(new_page, vma, haddr);
 		set_huge_pte_at(mm, haddr, ptep,
-				make_huge_pte(vma, new_page, !unshare));
+				make_huge_pte(vma, new_page, !unshare,
+					huge_page_shift(h)));
 		SetHPageMigratable(new_page);
 		/* Make the old page be freed below */
 		new_page = old_page;
@@ -6033,7 +6028,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 		page_dup_file_rmap(page, true);
 
 	subpage = hugetlb_find_subpage(h, page, haddr_hgm);
-	new_pte = make_huge_pte_with_shift(vma, subpage,
+	new_pte = make_huge_pte(vma, subpage,
 			((vma->vm_flags & VM_WRITE)
 			 && (vma->vm_flags & VM_SHARED)),
 			hpte->shift);
@@ -6481,8 +6476,7 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
 	subpage = hugetlb_find_subpage(h, page, dst_addr);
 	WARN_ON_ONCE(subpage != page && !hugetlb_hgm_enabled(dst_vma));
 
-	_dst_pte = make_huge_pte_with_shift(dst_vma, subpage, writable,
-			dst_hpte->shift);
+	_dst_pte = make_huge_pte(dst_vma, subpage, writable, dst_hpte->shift);
 	/*
 	 * Always mark UFFDIO_COPY page dirty; note that this may not be
 	 * extremely important for hugetlbfs for now since swapping is not
@@ -8044,8 +8038,7 @@ int hugetlb_collapse(struct mm_struct *mm, struct vm_area_struct *vma,
 		page_dup_file_rmap(hpage, true);
 
 		subpage = hugetlb_find_subpage(h, hpage, curr);
-		entry = make_huge_pte_with_shift(vma, subpage,
-						 writable, hpte.shift);
+		entry = make_huge_pte(vma, subpage, writable, hpte.shift);
 		set_huge_pte_at(mm, curr, hpte.ptep, entry);
 next_hpte:
 		curr += hugetlb_pte_size(&hpte);
-- 
2.38.0.135.g90850a2211-goog


^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [RFC PATCH v2 39/47] mm: smaps: add stats for HugeTLB mapping size
  2022-10-21 16:36 [RFC PATCH v2 00/47] hugetlb: introduce HugeTLB high-granularity mapping James Houghton
                   ` (37 preceding siblings ...)
  2022-10-21 16:36 ` [RFC PATCH v2 38/47] hugetlb: replace make_huge_pte with make_huge_pte_with_shift James Houghton
@ 2022-10-21 16:36 ` James Houghton
  2022-10-21 16:36 ` [RFC PATCH v2 40/47] hugetlb: x86: enable high-granularity mapping James Houghton
                   ` (7 subsequent siblings)
  46 siblings, 0 replies; 122+ messages in thread
From: James Houghton @ 2022-10-21 16:36 UTC (permalink / raw)
  To: Mike Kravetz, Muchun Song, Peter Xu
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Zach O'Keefe, Manish Mishra, Naoya Horiguchi,
	Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Andrew Morton, linux-mm, linux-kernel, James Houghton

When the kernel is compiled with HUGETLB_HIGH_GRANULARITY_MAPPING,
smaps may provide HugetlbPudMapped, HugetlbPmdMapped, and
HugetlbPteMapped. Levels that are folded will not be outputted.

Signed-off-by: James Houghton <jthoughton@google.com>
---
 fs/proc/task_mmu.c | 101 +++++++++++++++++++++++++++++++++------------
 1 file changed, 75 insertions(+), 26 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index be78cdb7677e..16288d6dbf1d 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -405,6 +405,15 @@ struct mem_size_stats {
 	unsigned long swap;
 	unsigned long shared_hugetlb;
 	unsigned long private_hugetlb;
+#ifdef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING
+#ifndef __PAGETABLE_PUD_FOLDED
+	unsigned long hugetlb_pud_mapped;
+#endif
+#ifndef __PAGETABLE_PMD_FOLDED
+	unsigned long hugetlb_pmd_mapped;
+#endif
+	unsigned long hugetlb_pte_mapped;
+#endif
 	u64 pss;
 	u64 pss_anon;
 	u64 pss_file;
@@ -720,6 +729,35 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
 }
 
 #ifdef CONFIG_HUGETLB_PAGE
+
+static void smaps_hugetlb_hgm_account(struct mem_size_stats *mss,
+		struct hugetlb_pte *hpte)
+{
+#ifdef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING
+	unsigned long size = hugetlb_pte_size(hpte);
+
+	switch (hpte->level) {
+#ifndef __PAGETABLE_PUD_FOLDED
+	case HUGETLB_LEVEL_PUD:
+		mss->hugetlb_pud_mapped += size;
+		break;
+#endif
+#ifndef __PAGETABLE_PMD_FOLDED
+	case HUGETLB_LEVEL_PMD:
+		mss->hugetlb_pmd_mapped += size;
+		break;
+#endif
+	case HUGETLB_LEVEL_PTE:
+		mss->hugetlb_pte_mapped += size;
+		break;
+	default:
+		break;
+	}
+#else
+	return;
+#endif /* CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING */
+}
+
 static int smaps_hugetlb_range(struct hugetlb_pte *hpte,
 				unsigned long addr,
 				struct mm_walk *walk)
@@ -753,6 +791,8 @@ static int smaps_hugetlb_range(struct hugetlb_pte *hpte,
 			mss->shared_hugetlb += hugetlb_pte_size(hpte);
 		else
 			mss->private_hugetlb += hugetlb_pte_size(hpte);
+
+		smaps_hugetlb_hgm_account(mss, hpte);
 	}
 	return 0;
 }
@@ -822,38 +862,47 @@ static void smap_gather_stats(struct vm_area_struct *vma,
 static void __show_smap(struct seq_file *m, const struct mem_size_stats *mss,
 	bool rollup_mode)
 {
-	SEQ_PUT_DEC("Rss:            ", mss->resident);
-	SEQ_PUT_DEC(" kB\nPss:            ", mss->pss >> PSS_SHIFT);
-	SEQ_PUT_DEC(" kB\nPss_Dirty:      ", mss->pss_dirty >> PSS_SHIFT);
+	SEQ_PUT_DEC("Rss:              ", mss->resident);
+	SEQ_PUT_DEC(" kB\nPss:              ", mss->pss >> PSS_SHIFT);
+	SEQ_PUT_DEC(" kB\nPss_Dirty:        ", mss->pss_dirty >> PSS_SHIFT);
 	if (rollup_mode) {
 		/*
 		 * These are meaningful only for smaps_rollup, otherwise two of
 		 * them are zero, and the other one is the same as Pss.
 		 */
-		SEQ_PUT_DEC(" kB\nPss_Anon:       ",
+		SEQ_PUT_DEC(" kB\nPss_Anon:         ",
 			mss->pss_anon >> PSS_SHIFT);
-		SEQ_PUT_DEC(" kB\nPss_File:       ",
+		SEQ_PUT_DEC(" kB\nPss_File:         ",
 			mss->pss_file >> PSS_SHIFT);
-		SEQ_PUT_DEC(" kB\nPss_Shmem:      ",
+		SEQ_PUT_DEC(" kB\nPss_Shmem:        ",
 			mss->pss_shmem >> PSS_SHIFT);
 	}
-	SEQ_PUT_DEC(" kB\nShared_Clean:   ", mss->shared_clean);
-	SEQ_PUT_DEC(" kB\nShared_Dirty:   ", mss->shared_dirty);
-	SEQ_PUT_DEC(" kB\nPrivate_Clean:  ", mss->private_clean);
-	SEQ_PUT_DEC(" kB\nPrivate_Dirty:  ", mss->private_dirty);
-	SEQ_PUT_DEC(" kB\nReferenced:     ", mss->referenced);
-	SEQ_PUT_DEC(" kB\nAnonymous:      ", mss->anonymous);
-	SEQ_PUT_DEC(" kB\nLazyFree:       ", mss->lazyfree);
-	SEQ_PUT_DEC(" kB\nAnonHugePages:  ", mss->anonymous_thp);
-	SEQ_PUT_DEC(" kB\nShmemPmdMapped: ", mss->shmem_thp);
-	SEQ_PUT_DEC(" kB\nFilePmdMapped:  ", mss->file_thp);
-	SEQ_PUT_DEC(" kB\nShared_Hugetlb: ", mss->shared_hugetlb);
-	seq_put_decimal_ull_width(m, " kB\nPrivate_Hugetlb: ",
+	SEQ_PUT_DEC(" kB\nShared_Clean:     ", mss->shared_clean);
+	SEQ_PUT_DEC(" kB\nShared_Dirty:     ", mss->shared_dirty);
+	SEQ_PUT_DEC(" kB\nPrivate_Clean:    ", mss->private_clean);
+	SEQ_PUT_DEC(" kB\nPrivate_Dirty:    ", mss->private_dirty);
+	SEQ_PUT_DEC(" kB\nReferenced:       ", mss->referenced);
+	SEQ_PUT_DEC(" kB\nAnonymous:        ", mss->anonymous);
+	SEQ_PUT_DEC(" kB\nLazyFree:         ", mss->lazyfree);
+	SEQ_PUT_DEC(" kB\nAnonHugePages:    ", mss->anonymous_thp);
+	SEQ_PUT_DEC(" kB\nShmemPmdMapped:   ", mss->shmem_thp);
+	SEQ_PUT_DEC(" kB\nFilePmdMapped:    ", mss->file_thp);
+	SEQ_PUT_DEC(" kB\nShared_Hugetlb:   ", mss->shared_hugetlb);
+	seq_put_decimal_ull_width(m, " kB\nPrivate_Hugetlb:   ",
 				  mss->private_hugetlb >> 10, 7);
-	SEQ_PUT_DEC(" kB\nSwap:           ", mss->swap);
-	SEQ_PUT_DEC(" kB\nSwapPss:        ",
+#ifdef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING
+#ifndef __PAGETABLE_PUD_FOLDED
+	SEQ_PUT_DEC(" kB\nHugetlbPudMapped: ", mss->hugetlb_pud_mapped);
+#endif
+#ifndef __PAGETABLE_PMD_FOLDED
+	SEQ_PUT_DEC(" kB\nHugetlbPmdMapped: ", mss->hugetlb_pmd_mapped);
+#endif
+	SEQ_PUT_DEC(" kB\nHugetlbPteMapped: ", mss->hugetlb_pte_mapped);
+#endif /* CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING */
+	SEQ_PUT_DEC(" kB\nSwap:             ", mss->swap);
+	SEQ_PUT_DEC(" kB\nSwapPss:          ",
 					mss->swap_pss >> PSS_SHIFT);
-	SEQ_PUT_DEC(" kB\nLocked:         ",
+	SEQ_PUT_DEC(" kB\nLocked:           ",
 					mss->pss_locked >> PSS_SHIFT);
 	seq_puts(m, " kB\n");
 }
@@ -869,18 +918,18 @@ static int show_smap(struct seq_file *m, void *v)
 
 	show_map_vma(m, vma);
 
-	SEQ_PUT_DEC("Size:           ", vma->vm_end - vma->vm_start);
-	SEQ_PUT_DEC(" kB\nKernelPageSize: ", vma_kernel_pagesize(vma));
-	SEQ_PUT_DEC(" kB\nMMUPageSize:    ", vma_mmu_pagesize(vma));
+	SEQ_PUT_DEC("Size:             ", vma->vm_end - vma->vm_start);
+	SEQ_PUT_DEC(" kB\nKernelPageSize:   ", vma_kernel_pagesize(vma));
+	SEQ_PUT_DEC(" kB\nMMUPageSize:      ", vma_mmu_pagesize(vma));
 	seq_puts(m, " kB\n");
 
 	__show_smap(m, &mss, false);
 
-	seq_printf(m, "THPeligible:    %d\n",
+	seq_printf(m, "THPeligible:      %d\n",
 		   hugepage_vma_check(vma, vma->vm_flags, true, false, true));
 
 	if (arch_pkeys_enabled())
-		seq_printf(m, "ProtectionKey:  %8u\n", vma_pkey(vma));
+		seq_printf(m, "ProtectionKey:    %8u\n", vma_pkey(vma));
 	show_smap_vma_flags(m, vma);
 
 	return 0;
-- 
2.38.0.135.g90850a2211-goog


^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [RFC PATCH v2 40/47] hugetlb: x86: enable high-granularity mapping
  2022-10-21 16:36 [RFC PATCH v2 00/47] hugetlb: introduce HugeTLB high-granularity mapping James Houghton
                   ` (38 preceding siblings ...)
  2022-10-21 16:36 ` [RFC PATCH v2 39/47] mm: smaps: add stats for HugeTLB mapping size James Houghton
@ 2022-10-21 16:36 ` James Houghton
  2022-10-21 16:36 ` [RFC PATCH v2 41/47] docs: hugetlb: update hugetlb and userfaultfd admin-guides with HGM info James Houghton
                   ` (6 subsequent siblings)
  46 siblings, 0 replies; 122+ messages in thread
From: James Houghton @ 2022-10-21 16:36 UTC (permalink / raw)
  To: Mike Kravetz, Muchun Song, Peter Xu
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Zach O'Keefe, Manish Mishra, Naoya Horiguchi,
	Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Andrew Morton, linux-mm, linux-kernel, James Houghton

Now that HGM is fully supported for GENERAL_HUGETLB, x86 can enable it.
The x86 KVM MMU already properly handles HugeTLB HGM pages (it does a
page table walk to determine which size to use in the second-stage page
table instead of, for example, checking vma_mmu_pagesize, like arm64
does).

We could also enable HugeTLB HGM for arm (32-bit) at this point, as it
also uses GENERAL_HUGETLB and I don't see anything else that is needed
for it. However, I haven't tested on arm at all, so I won't enable it.

Signed-off-by: James Houghton <jthoughton@google.com>
---
 arch/x86/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 6d1879ef933a..6d7103266e61 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -124,6 +124,7 @@ config X86
 	select ARCH_WANT_GENERAL_HUGETLB
 	select ARCH_WANT_HUGE_PMD_SHARE
 	select ARCH_WANT_HUGETLB_PAGE_OPTIMIZE_VMEMMAP	if X86_64
+	select ARCH_WANT_HUGETLB_HIGH_GRANULARITY_MAPPING
 	select ARCH_WANT_LD_ORPHAN_WARN
 	select ARCH_WANTS_THP_SWAP		if X86_64
 	select ARCH_HAS_PARANOID_L1D_FLUSH
-- 
2.38.0.135.g90850a2211-goog


^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [RFC PATCH v2 41/47] docs: hugetlb: update hugetlb and userfaultfd admin-guides with HGM info
  2022-10-21 16:36 [RFC PATCH v2 00/47] hugetlb: introduce HugeTLB high-granularity mapping James Houghton
                   ` (39 preceding siblings ...)
  2022-10-21 16:36 ` [RFC PATCH v2 40/47] hugetlb: x86: enable high-granularity mapping James Houghton
@ 2022-10-21 16:36 ` James Houghton
  2022-10-21 16:36 ` [RFC PATCH v2 42/47] docs: proc: include information about HugeTLB HGM James Houghton
                   ` (5 subsequent siblings)
  46 siblings, 0 replies; 122+ messages in thread
From: James Houghton @ 2022-10-21 16:36 UTC (permalink / raw)
  To: Mike Kravetz, Muchun Song, Peter Xu
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Zach O'Keefe, Manish Mishra, Naoya Horiguchi,
	Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Andrew Morton, linux-mm, linux-kernel, James Houghton

This includes information about how UFFD_FEATURE_MINOR_HUGETLBFS_HGM
should be used and when MADV_COLLAPSE should be used with it.

Signed-off-by: James Houghton <jthoughton@google.com>
---
 Documentation/admin-guide/mm/hugetlbpage.rst |  4 ++++
 Documentation/admin-guide/mm/userfaultfd.rst | 16 +++++++++++++++-
 2 files changed, 19 insertions(+), 1 deletion(-)

diff --git a/Documentation/admin-guide/mm/hugetlbpage.rst b/Documentation/admin-guide/mm/hugetlbpage.rst
index 19f27c0d92e0..ca7db15ae768 100644
--- a/Documentation/admin-guide/mm/hugetlbpage.rst
+++ b/Documentation/admin-guide/mm/hugetlbpage.rst
@@ -454,6 +454,10 @@ errno set to EINVAL or exclude hugetlb pages that extend beyond the length if
 not hugepage aligned.  For example, munmap(2) will fail if memory is backed by
 a hugetlb page and the length is smaller than the hugepage size.
 
+It is possible for users to map HugeTLB pages at a higher granularity than
+normal using HugeTLB high-granularity mapping (HGM). For example, when using 1G
+pages on x86, a user could map that page with 4K PTEs, 2M PMDs, a combination of
+the two. See Documentation/admin-guide/mm/userfaultfd.rst.
 
 Examples
 ========
diff --git a/Documentation/admin-guide/mm/userfaultfd.rst b/Documentation/admin-guide/mm/userfaultfd.rst
index 83f31919ebb3..19877aaad61b 100644
--- a/Documentation/admin-guide/mm/userfaultfd.rst
+++ b/Documentation/admin-guide/mm/userfaultfd.rst
@@ -115,6 +115,14 @@ events, except page fault notifications, may be generated:
   areas. ``UFFD_FEATURE_MINOR_SHMEM`` is the analogous feature indicating
   support for shmem virtual memory areas.
 
+- ``UFFD_FEATURE_MINOR_HUGETLBFS_HGM`` indicates that the kernel supports
+  small-page-aligned regions for ``UFFDIO_CONTINUE`` in HugeTLB-backed
+  virtual memory areas. ``UFFD_FEATURE_MINOR_HUGETLBFS_HGM`` and
+  ``UFFD_FEATURE_EXACT_ADDRESS`` must both be specified explicitly to enable
+  this behavior. If ``UFFD_FEATURE_MINOR_HUGETLBFS_HGM`` is specified but
+  ``UFFD_FEATURE_EXACT_ADDRESS`` is not, then ``UFFDIO_API`` will fail with
+  ``EINVAL``.
+
 The userland application should set the feature flags it intends to use
 when invoking the ``UFFDIO_API`` ioctl, to request that those features be
 enabled if supported.
@@ -169,7 +177,13 @@ like to do to resolve it:
   the page cache). Userspace has the option of modifying the page's
   contents before resolving the fault. Once the contents are correct
   (modified or not), userspace asks the kernel to map the page and let the
-  faulting thread continue with ``UFFDIO_CONTINUE``.
+  faulting thread continue with ``UFFDIO_CONTINUE``. If this is done at the
+  base-page size in a transparent-hugepage-eligible VMA or in a HugeTLB VMA
+  (requires ``UFFD_FEATURE_MINOR_HUGETLBFS_HGM``), then userspace may want to
+  use ``MADV_COLLAPSE`` when a hugepage is fully populated to inform the kernel
+  that it may be able to collapse the mapping. ``MADV_COLLAPSE`` will may undo
+  the effect of any ``UFFDIO_WRITEPROTECT`` calls on the collapsed address
+  range.
 
 Notes:
 
-- 
2.38.0.135.g90850a2211-goog


^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [RFC PATCH v2 42/47] docs: proc: include information about HugeTLB HGM
  2022-10-21 16:36 [RFC PATCH v2 00/47] hugetlb: introduce HugeTLB high-granularity mapping James Houghton
                   ` (40 preceding siblings ...)
  2022-10-21 16:36 ` [RFC PATCH v2 41/47] docs: hugetlb: update hugetlb and userfaultfd admin-guides with HGM info James Houghton
@ 2022-10-21 16:36 ` James Houghton
  2022-10-21 16:36 ` [RFC PATCH v2 43/47] selftests/vm: add HugeTLB HGM to userfaultfd selftest James Houghton
                   ` (4 subsequent siblings)
  46 siblings, 0 replies; 122+ messages in thread
From: James Houghton @ 2022-10-21 16:36 UTC (permalink / raw)
  To: Mike Kravetz, Muchun Song, Peter Xu
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Zach O'Keefe, Manish Mishra, Naoya Horiguchi,
	Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Andrew Morton, linux-mm, linux-kernel, James Houghton

This includes the updates that have been made to smaps, specifically,
the addition of Hugetlb[Pud,Pmd,Pte]Mapped.

Signed-off-by: James Houghton <jthoughton@google.com>
---
 Documentation/filesystems/proc.rst | 56 +++++++++++++++++-------------
 1 file changed, 32 insertions(+), 24 deletions(-)

diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
index ec6cfdf1796a..807d6c0694c2 100644
--- a/Documentation/filesystems/proc.rst
+++ b/Documentation/filesystems/proc.rst
@@ -444,29 +444,32 @@ Memory Area, or VMA) there is a series of lines such as the following::
 
     08048000-080bc000 r-xp 00000000 03:02 13130      /bin/bash
 
-    Size:               1084 kB
-    KernelPageSize:        4 kB
-    MMUPageSize:           4 kB
-    Rss:                 892 kB
-    Pss:                 374 kB
-    Pss_Dirty:             0 kB
-    Shared_Clean:        892 kB
-    Shared_Dirty:          0 kB
-    Private_Clean:         0 kB
-    Private_Dirty:         0 kB
-    Referenced:          892 kB
-    Anonymous:             0 kB
-    LazyFree:              0 kB
-    AnonHugePages:         0 kB
-    ShmemPmdMapped:        0 kB
-    Shared_Hugetlb:        0 kB
-    Private_Hugetlb:       0 kB
-    Swap:                  0 kB
-    SwapPss:               0 kB
-    KernelPageSize:        4 kB
-    MMUPageSize:           4 kB
-    Locked:                0 kB
-    THPeligible:           0
+    Size:                 1084 kB
+    KernelPageSize:          4 kB
+    MMUPageSize:             4 kB
+    Rss:                   892 kB
+    Pss:                   374 kB
+    Pss_Dirty:               0 kB
+    Shared_Clean:          892 kB
+    Shared_Dirty:            0 kB
+    Private_Clean:           0 kB
+    Private_Dirty:           0 kB
+    Referenced:            892 kB
+    Anonymous:               0 kB
+    LazyFree:                0 kB
+    AnonHugePages:           0 kB
+    ShmemPmdMapped:          0 kB
+    Shared_Hugetlb:          0 kB
+    Private_Hugetlb:         0 kB
+    HugetlbPudMapped:        0 kB
+    HugetlbPmdMapped:        0 kB
+    HugetlbPteMapped:        0 kB
+    Swap:                    0 kB
+    SwapPss:                 0 kB
+    KernelPageSize:          4 kB
+    MMUPageSize:             4 kB
+    Locked:                  0 kB
+    THPeligible:             0
     VmFlags: rd ex mr mw me dw
 
 The first of these lines shows the same information as is displayed for the
@@ -507,10 +510,15 @@ implementation. If this is not desirable please file a bug report.
 "ShmemPmdMapped" shows the ammount of shared (shmem/tmpfs) memory backed by
 huge pages.
 
-"Shared_Hugetlb" and "Private_Hugetlb" show the ammounts of memory backed by
+"Shared_Hugetlb" and "Private_Hugetlb" show the amounts of memory backed by
 hugetlbfs page which is *not* counted in "RSS" or "PSS" field for historical
 reasons. And these are not included in {Shared,Private}_{Clean,Dirty} field.
 
+If the kernel was compiled with ``CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING``,
+"HugetlbPudMapped", "HugetlbPmdMapped", and "HugetlbPteMapped" will appear and
+show the amount of HugeTLB memory mapped with PUDs, PMDs, and PTEs respectively.
+See Documentation/admin-guide/mm/hugetlbpage.rst.
+
 "Swap" shows how much would-be-anonymous memory is also used, but out on swap.
 
 For shmem mappings, "Swap" includes also the size of the mapped (and not
-- 
2.38.0.135.g90850a2211-goog


^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [RFC PATCH v2 43/47] selftests/vm: add HugeTLB HGM to userfaultfd selftest
  2022-10-21 16:36 [RFC PATCH v2 00/47] hugetlb: introduce HugeTLB high-granularity mapping James Houghton
                   ` (41 preceding siblings ...)
  2022-10-21 16:36 ` [RFC PATCH v2 42/47] docs: proc: include information about HugeTLB HGM James Houghton
@ 2022-10-21 16:36 ` James Houghton
  2022-10-21 16:37 ` [RFC PATCH v2 44/47] selftests/kvm: add HugeTLB HGM to KVM demand paging selftest James Houghton
                   ` (3 subsequent siblings)
  46 siblings, 0 replies; 122+ messages in thread
From: James Houghton @ 2022-10-21 16:36 UTC (permalink / raw)
  To: Mike Kravetz, Muchun Song, Peter Xu
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Zach O'Keefe, Manish Mishra, Naoya Horiguchi,
	Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Andrew Morton, linux-mm, linux-kernel, James Houghton

This test case behaves similarly to the regular shared HugeTLB
configuration, except that it uses 4K instead of hugepages, and that we
ignore the UFFDIO_COPY tests, as UFFDIO_CONTINUE is the only ioctl that
supports PAGE_SIZE-aligned regions.

This doesn't test MADV_COLLAPSE. Other tests are added later to exercise
MADV_COLLAPSE.

Signed-off-by: James Houghton <jthoughton@google.com>
---
 tools/testing/selftests/vm/userfaultfd.c | 90 +++++++++++++++++++-----
 1 file changed, 74 insertions(+), 16 deletions(-)

diff --git a/tools/testing/selftests/vm/userfaultfd.c b/tools/testing/selftests/vm/userfaultfd.c
index 7f22844ed704..c9cdfb20f292 100644
--- a/tools/testing/selftests/vm/userfaultfd.c
+++ b/tools/testing/selftests/vm/userfaultfd.c
@@ -73,9 +73,10 @@ static unsigned long nr_cpus, nr_pages, nr_pages_per_cpu, page_size, hpage_size;
 #define BOUNCE_POLL		(1<<3)
 static int bounces;
 
-#define TEST_ANON	1
-#define TEST_HUGETLB	2
-#define TEST_SHMEM	3
+#define TEST_ANON		1
+#define TEST_HUGETLB		2
+#define TEST_HUGETLB_HGM	3
+#define TEST_SHMEM		4
 static int test_type;
 
 #define UFFD_FLAGS	(O_CLOEXEC | O_NONBLOCK | UFFD_USER_MODE_ONLY)
@@ -93,6 +94,8 @@ static volatile bool test_uffdio_zeropage_eexist = true;
 static bool test_uffdio_wp = true;
 /* Whether to test uffd minor faults */
 static bool test_uffdio_minor = false;
+static bool test_uffdio_copy = true;
+
 static bool map_shared;
 static int mem_fd;
 static unsigned long long *count_verify;
@@ -151,7 +154,7 @@ static void usage(void)
 	fprintf(stderr, "\nUsage: ./userfaultfd <test type> <MiB> <bounces> "
 		"[hugetlbfs_file]\n\n");
 	fprintf(stderr, "Supported <test type>: anon, hugetlb, "
-		"hugetlb_shared, shmem\n\n");
+		"hugetlb_shared, hugetlb_shared_hgm, shmem\n\n");
 	fprintf(stderr, "'Test mods' can be joined to the test type string with a ':'. "
 		"Supported mods:\n");
 	fprintf(stderr, "\tsyscall - Use userfaultfd(2) (default)\n");
@@ -167,6 +170,11 @@ static void usage(void)
 	exit(1);
 }
 
+static bool test_is_hugetlb(void)
+{
+	return test_type == TEST_HUGETLB || test_type == TEST_HUGETLB_HGM;
+}
+
 #define _err(fmt, ...)						\
 	do {							\
 		int ret = errno;				\
@@ -381,8 +389,12 @@ static struct uffd_test_ops *uffd_test_ops;
 
 static inline uint64_t uffd_minor_feature(void)
 {
-	if (test_type == TEST_HUGETLB && map_shared)
-		return UFFD_FEATURE_MINOR_HUGETLBFS;
+	if (test_is_hugetlb() && map_shared)
+		return UFFD_FEATURE_MINOR_HUGETLBFS |
+			(test_type == TEST_HUGETLB_HGM
+			 ? (UFFD_FEATURE_MINOR_HUGETLBFS_HGM |
+				 UFFD_FEATURE_EXACT_ADDRESS)
+			 : 0);
 	else if (test_type == TEST_SHMEM)
 		return UFFD_FEATURE_MINOR_SHMEM;
 	else
@@ -393,7 +405,7 @@ static uint64_t get_expected_ioctls(uint64_t mode)
 {
 	uint64_t ioctls = UFFD_API_RANGE_IOCTLS;
 
-	if (test_type == TEST_HUGETLB)
+	if (test_is_hugetlb())
 		ioctls &= ~(1 << _UFFDIO_ZEROPAGE);
 
 	if (!((mode & UFFDIO_REGISTER_MODE_WP) && test_uffdio_wp))
@@ -500,13 +512,16 @@ static void uffd_test_ctx_clear(void)
 static void uffd_test_ctx_init(uint64_t features)
 {
 	unsigned long nr, cpu;
+	uint64_t enabled_features = features;
 
 	uffd_test_ctx_clear();
 
 	uffd_test_ops->allocate_area((void **)&area_src, true);
 	uffd_test_ops->allocate_area((void **)&area_dst, false);
 
-	userfaultfd_open(&features);
+	userfaultfd_open(&enabled_features);
+	if ((enabled_features & features) != features)
+		err("couldn't enable all features");
 
 	count_verify = malloc(nr_pages * sizeof(unsigned long long));
 	if (!count_verify)
@@ -726,13 +741,21 @@ static void uffd_handle_page_fault(struct uffd_msg *msg,
 				   struct uffd_stats *stats)
 {
 	unsigned long offset;
+	unsigned long address;
 
 	if (msg->event != UFFD_EVENT_PAGEFAULT)
 		err("unexpected msg event %u", msg->event);
 
+	/*
+	 * Round down address to nearest page_size.
+	 * We do this manually because we specified UFFD_FEATURE_EXACT_ADDRESS
+	 * to support UFFD_FEATURE_MINOR_HUGETLBFS_HGM.
+	 */
+	address = msg->arg.pagefault.address & ~(page_size - 1);
+
 	if (msg->arg.pagefault.flags & UFFD_PAGEFAULT_FLAG_WP) {
 		/* Write protect page faults */
-		wp_range(uffd, msg->arg.pagefault.address, page_size, false);
+		wp_range(uffd, address, page_size, false);
 		stats->wp_faults++;
 	} else if (msg->arg.pagefault.flags & UFFD_PAGEFAULT_FLAG_MINOR) {
 		uint8_t *area;
@@ -751,11 +774,10 @@ static void uffd_handle_page_fault(struct uffd_msg *msg,
 		 */
 
 		area = (uint8_t *)(area_dst +
-				   ((char *)msg->arg.pagefault.address -
-				    area_dst_alias));
+				   ((char *)address - area_dst_alias));
 		for (b = 0; b < page_size; ++b)
 			area[b] = ~area[b];
-		continue_range(uffd, msg->arg.pagefault.address, page_size);
+		continue_range(uffd, address, page_size);
 		stats->minor_faults++;
 	} else {
 		/*
@@ -782,7 +804,7 @@ static void uffd_handle_page_fault(struct uffd_msg *msg,
 		if (msg->arg.pagefault.flags & UFFD_PAGEFAULT_FLAG_WRITE)
 			err("unexpected write fault");
 
-		offset = (char *)(unsigned long)msg->arg.pagefault.address - area_dst;
+		offset = (char *)address - area_dst;
 		offset &= ~(page_size-1);
 
 		if (copy_page(uffd, offset))
@@ -1192,6 +1214,12 @@ static int userfaultfd_events_test(void)
 	char c;
 	struct uffd_stats stats = { 0 };
 
+	if (!test_uffdio_copy) {
+		printf("Skipping userfaultfd events test "
+			"(test_uffdio_copy=false)\n");
+		return 0;
+	}
+
 	printf("testing events (fork, remap, remove): ");
 	fflush(stdout);
 
@@ -1245,6 +1273,12 @@ static int userfaultfd_sig_test(void)
 	char c;
 	struct uffd_stats stats = { 0 };
 
+	if (!test_uffdio_copy) {
+		printf("Skipping userfaultfd signal test "
+			"(test_uffdio_copy=false)\n");
+		return 0;
+	}
+
 	printf("testing signal delivery: ");
 	fflush(stdout);
 
@@ -1538,6 +1572,12 @@ static int userfaultfd_stress(void)
 	pthread_attr_init(&attr);
 	pthread_attr_setstacksize(&attr, 16*1024*1024);
 
+	if (!test_uffdio_copy) {
+		printf("Skipping userfaultfd stress test "
+			"(test_uffdio_copy=false)\n");
+		bounces = 0;
+	}
+
 	while (bounces--) {
 		printf("bounces: %d, mode:", bounces);
 		if (bounces & BOUNCE_RANDOM)
@@ -1696,6 +1736,16 @@ static void set_test_type(const char *type)
 		uffd_test_ops = &hugetlb_uffd_test_ops;
 		/* Minor faults require shared hugetlb; only enable here. */
 		test_uffdio_minor = true;
+	} else if (!strcmp(type, "hugetlb_shared_hgm")) {
+		map_shared = true;
+		test_type = TEST_HUGETLB_HGM;
+		uffd_test_ops = &hugetlb_uffd_test_ops;
+		/*
+		 * HugeTLB HGM only changes UFFDIO_CONTINUE, so don't test
+		 * UFFDIO_COPY.
+		 */
+		test_uffdio_minor = true;
+		test_uffdio_copy = false;
 	} else if (!strcmp(type, "shmem")) {
 		map_shared = true;
 		test_type = TEST_SHMEM;
@@ -1731,6 +1781,7 @@ static void parse_test_type_arg(const char *raw_type)
 		err("Unsupported test: %s", raw_type);
 
 	if (test_type == TEST_HUGETLB)
+		/* TEST_HUGETLB_HGM gets small pages. */
 		page_size = hpage_size;
 	else
 		page_size = sysconf(_SC_PAGE_SIZE);
@@ -1813,22 +1864,29 @@ int main(int argc, char **argv)
 		nr_cpus = x < y ? x : y;
 	}
 	nr_pages_per_cpu = bytes / page_size / nr_cpus;
+	if (test_type == TEST_HUGETLB_HGM)
+		/*
+		 * `page_size` refers to the page_size we can use in
+		 * UFFDIO_CONTINUE. We still need nr_pages to be appropriately
+		 * aligned, so align it here.
+		 */
+		nr_pages_per_cpu -= nr_pages_per_cpu % (hpage_size / page_size);
 	if (!nr_pages_per_cpu) {
 		_err("invalid MiB");
 		usage();
 	}
+	nr_pages = nr_pages_per_cpu * nr_cpus;
 
 	bounces = atoi(argv[3]);
 	if (bounces <= 0) {
 		_err("invalid bounces");
 		usage();
 	}
-	nr_pages = nr_pages_per_cpu * nr_cpus;
 
-	if (test_type == TEST_SHMEM || test_type == TEST_HUGETLB) {
+	if (test_type == TEST_SHMEM || test_is_hugetlb()) {
 		unsigned int memfd_flags = 0;
 
-		if (test_type == TEST_HUGETLB)
+		if (test_is_hugetlb())
 			memfd_flags = MFD_HUGETLB;
 		mem_fd = memfd_create(argv[0], memfd_flags);
 		if (mem_fd < 0)
-- 
2.38.0.135.g90850a2211-goog


^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [RFC PATCH v2 44/47] selftests/kvm: add HugeTLB HGM to KVM demand paging selftest
  2022-10-21 16:36 [RFC PATCH v2 00/47] hugetlb: introduce HugeTLB high-granularity mapping James Houghton
                   ` (42 preceding siblings ...)
  2022-10-21 16:36 ` [RFC PATCH v2 43/47] selftests/vm: add HugeTLB HGM to userfaultfd selftest James Houghton
@ 2022-10-21 16:37 ` James Houghton
  2022-10-21 16:37 ` [RFC PATCH v2 45/47] selftests/vm: add anon and shared hugetlb to migration test James Houghton
                   ` (2 subsequent siblings)
  46 siblings, 0 replies; 122+ messages in thread
From: James Houghton @ 2022-10-21 16:37 UTC (permalink / raw)
  To: Mike Kravetz, Muchun Song, Peter Xu
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Zach O'Keefe, Manish Mishra, Naoya Horiguchi,
	Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Andrew Morton, linux-mm, linux-kernel, James Houghton

This test exercises the GUP paths for HGM. MADV_COLLAPSE is not tested.

Signed-off-by: James Houghton <jthoughton@google.com>
---
 .../selftests/kvm/demand_paging_test.c        | 20 ++++++++++++++++---
 .../testing/selftests/kvm/include/test_util.h |  2 ++
 tools/testing/selftests/kvm/lib/kvm_util.c    |  2 +-
 tools/testing/selftests/kvm/lib/test_util.c   | 14 +++++++++++++
 4 files changed, 34 insertions(+), 4 deletions(-)

diff --git a/tools/testing/selftests/kvm/demand_paging_test.c b/tools/testing/selftests/kvm/demand_paging_test.c
index 779ae54f89c4..67ca8703c6b7 100644
--- a/tools/testing/selftests/kvm/demand_paging_test.c
+++ b/tools/testing/selftests/kvm/demand_paging_test.c
@@ -76,6 +76,12 @@ static int handle_uffd_page_request(int uffd_mode, int uffd, uint64_t addr)
 
 	clock_gettime(CLOCK_MONOTONIC, &start);
 
+	/*
+	 * We're using UFFD_FEATURE_EXACT_ADDRESS, so round down the address.
+	 * This is needed to support HugeTLB high-granularity mapping.
+	 */
+	addr &= ~(demand_paging_size - 1);
+
 	if (uffd_mode == UFFDIO_REGISTER_MODE_MISSING) {
 		struct uffdio_copy copy;
 
@@ -214,7 +220,8 @@ static void setup_demand_paging(struct kvm_vm *vm,
 				pthread_t *uffd_handler_thread, int pipefd,
 				int uffd_mode, useconds_t uffd_delay,
 				struct uffd_handler_args *uffd_args,
-				void *hva, void *alias, uint64_t len)
+				void *hva, void *alias, uint64_t len,
+				enum vm_mem_backing_src_type src_type)
 {
 	bool is_minor = (uffd_mode == UFFDIO_REGISTER_MODE_MINOR);
 	int uffd;
@@ -244,9 +251,15 @@ static void setup_demand_paging(struct kvm_vm *vm,
 	TEST_ASSERT(uffd >= 0, __KVM_SYSCALL_ERROR("userfaultfd()", uffd));
 
 	uffdio_api.api = UFFD_API;
-	uffdio_api.features = 0;
+	uffdio_api.features = is_minor
+		? UFFD_FEATURE_EXACT_ADDRESS | UFFD_FEATURE_MINOR_HUGETLBFS_HGM
+		: 0;
 	ret = ioctl(uffd, UFFDIO_API, &uffdio_api);
 	TEST_ASSERT(ret != -1, __KVM_SYSCALL_ERROR("UFFDIO_API", ret));
+	if (src_type == VM_MEM_SRC_SHARED_HUGETLB_HGM)
+		TEST_ASSERT(uffdio_api.features &
+			    UFFD_FEATURE_MINOR_HUGETLBFS_HGM,
+			    "UFFD_FEATURE_MINOR_HUGETLBFS_HGM not present");
 
 	uffdio_register.range.start = (uint64_t)hva;
 	uffdio_register.range.len = len;
@@ -329,7 +342,8 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 					    pipefds[i * 2], p->uffd_mode,
 					    p->uffd_delay, &uffd_args[i],
 					    vcpu_hva, vcpu_alias,
-					    vcpu_args->pages * perf_test_args.guest_page_size);
+					    vcpu_args->pages * perf_test_args.guest_page_size,
+					    p->src_type);
 		}
 	}
 
diff --git a/tools/testing/selftests/kvm/include/test_util.h b/tools/testing/selftests/kvm/include/test_util.h
index befc754ce9b3..0410326dbc18 100644
--- a/tools/testing/selftests/kvm/include/test_util.h
+++ b/tools/testing/selftests/kvm/include/test_util.h
@@ -96,6 +96,7 @@ enum vm_mem_backing_src_type {
 	VM_MEM_SRC_ANONYMOUS_HUGETLB_16GB,
 	VM_MEM_SRC_SHMEM,
 	VM_MEM_SRC_SHARED_HUGETLB,
+	VM_MEM_SRC_SHARED_HUGETLB_HGM,
 	NUM_SRC_TYPES,
 };
 
@@ -114,6 +115,7 @@ size_t get_def_hugetlb_pagesz(void);
 const struct vm_mem_backing_src_alias *vm_mem_backing_src_alias(uint32_t i);
 size_t get_backing_src_pagesz(uint32_t i);
 bool is_backing_src_hugetlb(uint32_t i);
+bool is_backing_src_shared_hugetlb(enum vm_mem_backing_src_type src_type);
 void backing_src_help(const char *flag);
 enum vm_mem_backing_src_type parse_backing_src_type(const char *type_name);
 long get_run_delay(void);
diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c
index f1cb1627161f..7d769a117e14 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -896,7 +896,7 @@ void vm_userspace_mem_region_add(struct kvm_vm *vm,
 	region->fd = -1;
 	if (backing_src_is_shared(src_type))
 		region->fd = kvm_memfd_alloc(region->mmap_size,
-					     src_type == VM_MEM_SRC_SHARED_HUGETLB);
+				is_backing_src_shared_hugetlb(src_type));
 
 	region->mmap_start = mmap(NULL, region->mmap_size,
 				  PROT_READ | PROT_WRITE,
diff --git a/tools/testing/selftests/kvm/lib/test_util.c b/tools/testing/selftests/kvm/lib/test_util.c
index 6d23878bbfe1..710dc42077fe 100644
--- a/tools/testing/selftests/kvm/lib/test_util.c
+++ b/tools/testing/selftests/kvm/lib/test_util.c
@@ -254,6 +254,13 @@ const struct vm_mem_backing_src_alias *vm_mem_backing_src_alias(uint32_t i)
 			 */
 			.flag = MAP_SHARED,
 		},
+		[VM_MEM_SRC_SHARED_HUGETLB_HGM] = {
+			/*
+			 * Identical to shared_hugetlb except for the name.
+			 */
+			.name = "shared_hugetlb_hgm",
+			.flag = MAP_SHARED,
+		},
 	};
 	_Static_assert(ARRAY_SIZE(aliases) == NUM_SRC_TYPES,
 		       "Missing new backing src types?");
@@ -272,6 +279,7 @@ size_t get_backing_src_pagesz(uint32_t i)
 	switch (i) {
 	case VM_MEM_SRC_ANONYMOUS:
 	case VM_MEM_SRC_SHMEM:
+	case VM_MEM_SRC_SHARED_HUGETLB_HGM:
 		return getpagesize();
 	case VM_MEM_SRC_ANONYMOUS_THP:
 		return get_trans_hugepagesz();
@@ -288,6 +296,12 @@ bool is_backing_src_hugetlb(uint32_t i)
 	return !!(vm_mem_backing_src_alias(i)->flag & MAP_HUGETLB);
 }
 
+bool is_backing_src_shared_hugetlb(enum vm_mem_backing_src_type src_type)
+{
+	return src_type == VM_MEM_SRC_SHARED_HUGETLB ||
+		src_type == VM_MEM_SRC_SHARED_HUGETLB_HGM;
+}
+
 static void print_available_backing_src_types(const char *prefix)
 {
 	int i;
-- 
2.38.0.135.g90850a2211-goog


^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [RFC PATCH v2 45/47] selftests/vm: add anon and shared hugetlb to migration test
  2022-10-21 16:36 [RFC PATCH v2 00/47] hugetlb: introduce HugeTLB high-granularity mapping James Houghton
                   ` (43 preceding siblings ...)
  2022-10-21 16:37 ` [RFC PATCH v2 44/47] selftests/kvm: add HugeTLB HGM to KVM demand paging selftest James Houghton
@ 2022-10-21 16:37 ` James Houghton
  2022-10-21 16:37 ` [RFC PATCH v2 46/47] selftests/vm: add hugetlb HGM test to migration selftest James Houghton
  2022-10-21 16:37 ` [RFC PATCH v2 47/47] selftests/vm: add HGM UFFDIO_CONTINUE and hwpoison tests James Houghton
  46 siblings, 0 replies; 122+ messages in thread
From: James Houghton @ 2022-10-21 16:37 UTC (permalink / raw)
  To: Mike Kravetz, Muchun Song, Peter Xu
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Zach O'Keefe, Manish Mishra, Naoya Horiguchi,
	Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Andrew Morton, linux-mm, linux-kernel, James Houghton

Shared HugeTLB mappings are migrated best-effort. Sometimes, due to
being unable to grab the VMA lock for writing, migration may just
randomly fail. To allow for that, we allow retries.

Signed-off-by: James Houghton <jthoughton@google.com>
---
 tools/testing/selftests/vm/migration.c | 83 ++++++++++++++++++++++++--
 1 file changed, 79 insertions(+), 4 deletions(-)

diff --git a/tools/testing/selftests/vm/migration.c b/tools/testing/selftests/vm/migration.c
index 1cec8425e3ca..21577a84d7e4 100644
--- a/tools/testing/selftests/vm/migration.c
+++ b/tools/testing/selftests/vm/migration.c
@@ -13,6 +13,7 @@
 #include <sys/types.h>
 #include <signal.h>
 #include <time.h>
+#include <sys/statfs.h>
 
 #define TWOMEG (2<<20)
 #define RUNTIME (60)
@@ -59,11 +60,12 @@ FIXTURE_TEARDOWN(migration)
 	free(self->pids);
 }
 
-int migrate(uint64_t *ptr, int n1, int n2)
+int migrate(uint64_t *ptr, int n1, int n2, int retries)
 {
 	int ret, tmp;
 	int status = 0;
 	struct timespec ts1, ts2;
+	int failed = 0;
 
 	if (clock_gettime(CLOCK_MONOTONIC, &ts1))
 		return -1;
@@ -78,6 +80,9 @@ int migrate(uint64_t *ptr, int n1, int n2)
 		ret = move_pages(0, 1, (void **) &ptr, &n2, &status,
 				MPOL_MF_MOVE_ALL);
 		if (ret) {
+			if (++failed < retries)
+				continue;
+
 			if (ret > 0)
 				printf("Didn't migrate %d pages\n", ret);
 			else
@@ -88,6 +93,7 @@ int migrate(uint64_t *ptr, int n1, int n2)
 		tmp = n2;
 		n2 = n1;
 		n1 = tmp;
+		failed = 0;
 	}
 
 	return 0;
@@ -128,7 +134,7 @@ TEST_F_TIMEOUT(migration, private_anon, 2*RUNTIME)
 		if (pthread_create(&self->threads[i], NULL, access_mem, ptr))
 			perror("Couldn't create thread");
 
-	ASSERT_EQ(migrate(ptr, self->n1, self->n2), 0);
+	ASSERT_EQ(migrate(ptr, self->n1, self->n2, 1), 0);
 	for (i = 0; i < self->nthreads - 1; i++)
 		ASSERT_EQ(pthread_cancel(self->threads[i]), 0);
 }
@@ -158,7 +164,7 @@ TEST_F_TIMEOUT(migration, shared_anon, 2*RUNTIME)
 			self->pids[i] = pid;
 	}
 
-	ASSERT_EQ(migrate(ptr, self->n1, self->n2), 0);
+	ASSERT_EQ(migrate(ptr, self->n1, self->n2, 1), 0);
 	for (i = 0; i < self->nthreads - 1; i++)
 		ASSERT_EQ(kill(self->pids[i], SIGTERM), 0);
 }
@@ -185,9 +191,78 @@ TEST_F_TIMEOUT(migration, private_anon_thp, 2*RUNTIME)
 		if (pthread_create(&self->threads[i], NULL, access_mem, ptr))
 			perror("Couldn't create thread");
 
-	ASSERT_EQ(migrate(ptr, self->n1, self->n2), 0);
+	ASSERT_EQ(migrate(ptr, self->n1, self->n2, 1), 0);
+	for (i = 0; i < self->nthreads - 1; i++)
+		ASSERT_EQ(pthread_cancel(self->threads[i]), 0);
+}
+
+/*
+ * Tests the anon hugetlb migration entry paths.
+ */
+TEST_F_TIMEOUT(migration, private_anon_hugetlb, 2*RUNTIME)
+{
+	uint64_t *ptr;
+	int i;
+
+	if (self->nthreads < 2 || self->n1 < 0 || self->n2 < 0)
+		SKIP(return, "Not enough threads or NUMA nodes available");
+
+	ptr = mmap(NULL, TWOMEG, PROT_READ | PROT_WRITE,
+		MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB, -1, 0);
+	if (ptr == MAP_FAILED)
+		SKIP(return, "Could not allocate hugetlb pages");
+
+	memset(ptr, 0xde, TWOMEG);
+	for (i = 0; i < self->nthreads - 1; i++)
+		if (pthread_create(&self->threads[i], NULL, access_mem, ptr))
+			perror("Couldn't create thread");
+
+	ASSERT_EQ(migrate(ptr, self->n1, self->n2, 1), 0);
 	for (i = 0; i < self->nthreads - 1; i++)
 		ASSERT_EQ(pthread_cancel(self->threads[i]), 0);
 }
 
+/*
+ * Tests the shared hugetlb migration entry paths.
+ */
+TEST_F_TIMEOUT(migration, shared_hugetlb, 2*RUNTIME)
+{
+	uint64_t *ptr;
+	int i;
+	int fd;
+	unsigned long sz;
+	struct statfs filestat;
+
+	if (self->nthreads < 2 || self->n1 < 0 || self->n2 < 0)
+		SKIP(return, "Not enough threads or NUMA nodes available");
+
+	fd = memfd_create("tmp_hugetlb", MFD_HUGETLB);
+	if (fd < 0)
+		SKIP(return, "Couldn't create hugetlb memfd");
+
+	if (fstatfs(fd, &filestat) < 0)
+		SKIP(return, "Couldn't fstatfs hugetlb file");
+
+	sz = filestat.f_bsize;
+
+	if (ftruncate(fd, sz))
+		SKIP(return, "Couldn't allocate hugetlb pages");
+	ptr = mmap(NULL, sz, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
+	if (ptr == MAP_FAILED)
+		SKIP(return, "Could not map hugetlb pages");
+
+	memset(ptr, 0xde, sz);
+	for (i = 0; i < self->nthreads - 1; i++)
+		if (pthread_create(&self->threads[i], NULL, access_mem, ptr))
+			perror("Couldn't create thread");
+
+	ASSERT_EQ(migrate(ptr, self->n1, self->n2, 10), 0);
+	for (i = 0; i < self->nthreads - 1; i++) {
+		ASSERT_EQ(pthread_cancel(self->threads[i]), 0);
+		pthread_join(self->threads[i], NULL);
+	}
+	ftruncate(fd, 0);
+	close(fd);
+}
+
 TEST_HARNESS_MAIN
-- 
2.38.0.135.g90850a2211-goog


^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [RFC PATCH v2 46/47] selftests/vm: add hugetlb HGM test to migration selftest
  2022-10-21 16:36 [RFC PATCH v2 00/47] hugetlb: introduce HugeTLB high-granularity mapping James Houghton
                   ` (44 preceding siblings ...)
  2022-10-21 16:37 ` [RFC PATCH v2 45/47] selftests/vm: add anon and shared hugetlb to migration test James Houghton
@ 2022-10-21 16:37 ` James Houghton
  2022-10-21 16:37 ` [RFC PATCH v2 47/47] selftests/vm: add HGM UFFDIO_CONTINUE and hwpoison tests James Houghton
  46 siblings, 0 replies; 122+ messages in thread
From: James Houghton @ 2022-10-21 16:37 UTC (permalink / raw)
  To: Mike Kravetz, Muchun Song, Peter Xu
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Zach O'Keefe, Manish Mishra, Naoya Horiguchi,
	Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Andrew Morton, linux-mm, linux-kernel, James Houghton

This is mostly the same as the shared HugeTLB case, but instead of
mapping the page with a regular page fault, we map it with lots of
UFFDIO_CONTINUE operations. We also verify that the contents haven't
changed after the migration, which would be the case if the
post-migration PTEs pointed to the wrong page.

Signed-off-by: James Houghton <jthoughton@google.com>
---
 tools/testing/selftests/vm/migration.c | 139 +++++++++++++++++++++++++
 1 file changed, 139 insertions(+)

diff --git a/tools/testing/selftests/vm/migration.c b/tools/testing/selftests/vm/migration.c
index 21577a84d7e4..89cb5934f139 100644
--- a/tools/testing/selftests/vm/migration.c
+++ b/tools/testing/selftests/vm/migration.c
@@ -14,6 +14,11 @@
 #include <signal.h>
 #include <time.h>
 #include <sys/statfs.h>
+#include <unistd.h>
+#include <sys/ioctl.h>
+#include <linux/userfaultfd.h>
+#include <sys/syscall.h>
+#include <fcntl.h>
 
 #define TWOMEG (2<<20)
 #define RUNTIME (60)
@@ -265,4 +270,138 @@ TEST_F_TIMEOUT(migration, shared_hugetlb, 2*RUNTIME)
 	close(fd);
 }
 
+#ifdef __NR_userfaultfd
+static int map_at_high_granularity(char *mem, size_t length)
+{
+	int i;
+	int ret;
+	int uffd = syscall(__NR_userfaultfd, 0);
+	struct uffdio_api api;
+	struct uffdio_register reg;
+	int pagesize = getpagesize();
+
+	if (uffd < 0) {
+		perror("couldn't create uffd");
+		return uffd;
+	}
+
+	api.api = UFFD_API;
+	api.features = UFFD_FEATURE_MISSING_HUGETLBFS
+		| UFFD_FEATURE_MINOR_HUGETLBFS
+		| UFFD_FEATURE_MINOR_HUGETLBFS_HGM;
+
+	ret = ioctl(uffd, UFFDIO_API, &api);
+	if (ret || api.api != UFFD_API) {
+		perror("UFFDIO_API failed");
+		goto out;
+	}
+
+	reg.range.start = (unsigned long)mem;
+	reg.range.len = length;
+
+	reg.mode = UFFDIO_REGISTER_MODE_MISSING | UFFDIO_REGISTER_MODE_MINOR;
+
+	ret = ioctl(uffd, UFFDIO_REGISTER, &reg);
+	if (ret) {
+		perror("UFFDIO_REGISTER failed");
+		goto out;
+	}
+
+	/* UFFDIO_CONTINUE each 4K segment of the 2M page. */
+	for (i = 0; i < length/pagesize; ++i) {
+		struct uffdio_continue cont;
+
+		cont.range.start = (unsigned long long)mem + i * pagesize;
+		cont.range.len = pagesize;
+		cont.mode = 0;
+		ret = ioctl(uffd, UFFDIO_CONTINUE, &cont);
+		if (ret) {
+			fprintf(stderr, "UFFDIO_CONTINUE failed "
+					"for %llx -> %llx: %d\n",
+					cont.range.start,
+					cont.range.start + cont.range.len,
+					errno);
+			goto out;
+		}
+	}
+	ret = 0;
+out:
+	close(uffd);
+	return ret;
+}
+#else
+static int map_at_high_granularity(char *mem, size_t length)
+{
+	fprintf(stderr, "Userfaultfd missing\n");
+	return -1;
+}
+#endif /* __NR_userfaultfd */
+
+/*
+ * Tests the high-granularity hugetlb migration entry paths.
+ */
+TEST_F_TIMEOUT(migration, shared_hugetlb_hgm, 2*RUNTIME)
+{
+	uint64_t *ptr;
+	int i;
+	int fd;
+	unsigned long sz;
+	struct statfs filestat;
+
+	if (self->nthreads < 2 || self->n1 < 0 || self->n2 < 0)
+		SKIP(return, "Not enough threads or NUMA nodes available");
+
+	fd = memfd_create("tmp_hugetlb", MFD_HUGETLB);
+	if (fd < 0)
+		SKIP(return, "Couldn't create hugetlb memfd");
+
+	if (fstatfs(fd, &filestat) < 0)
+		SKIP(return, "Couldn't fstatfs hugetlb file");
+
+	sz = filestat.f_bsize;
+
+	if (ftruncate(fd, sz))
+		SKIP(return, "Couldn't allocate hugetlb pages");
+
+	if (fallocate(fd, 0, 0, sz) < 0) {
+		perror("fallocate failed");
+		SKIP(return, "fallocate failed");
+	}
+
+	ptr = mmap(NULL, sz, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
+	if (ptr == MAP_FAILED)
+		SKIP(return, "Could not allocate hugetlb pages");
+
+	/*
+	 * We have to map_at_high_granularity before we memset, otherwise
+	 * memset will map everything at the hugepage size.
+	 */
+	if (map_at_high_granularity((char *)ptr, sz) < 0)
+		SKIP(return, "Could not map HugeTLB range at high granularity");
+
+	/* Populate the page we're migrating. */
+	for (i = 0; i < sz/sizeof(*ptr); ++i)
+		ptr[i] = i;
+
+	for (i = 0; i < self->nthreads - 1; i++)
+		if (pthread_create(&self->threads[i], NULL, access_mem, ptr))
+			perror("Couldn't create thread");
+
+	ASSERT_EQ(migrate(ptr, self->n1, self->n2, 10), 0);
+	for (i = 0; i < self->nthreads - 1; i++) {
+		ASSERT_EQ(pthread_cancel(self->threads[i]), 0);
+		pthread_join(self->threads[i], NULL);
+	}
+
+	/* Check that the contents didnt' change. */
+	for (i = 0; i < sz/sizeof(*ptr); ++i) {
+		ASSERT_EQ(ptr[i], i);
+		if (ptr[i] != i)
+			break;
+	}
+
+	ftruncate(fd, 0);
+	close(fd);
+}
+
 TEST_HARNESS_MAIN
-- 
2.38.0.135.g90850a2211-goog


^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [RFC PATCH v2 47/47] selftests/vm: add HGM UFFDIO_CONTINUE and hwpoison tests
  2022-10-21 16:36 [RFC PATCH v2 00/47] hugetlb: introduce HugeTLB high-granularity mapping James Houghton
                   ` (45 preceding siblings ...)
  2022-10-21 16:37 ` [RFC PATCH v2 46/47] selftests/vm: add hugetlb HGM test to migration selftest James Houghton
@ 2022-10-21 16:37 ` James Houghton
  46 siblings, 0 replies; 122+ messages in thread
From: James Houghton @ 2022-10-21 16:37 UTC (permalink / raw)
  To: Mike Kravetz, Muchun Song, Peter Xu
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Zach O'Keefe, Manish Mishra, Naoya Horiguchi,
	Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Andrew Morton, linux-mm, linux-kernel, James Houghton

This tests that high-granularity CONTINUEs at all sizes work
(exercising contiguous PTE sizes for arm64, when support is added). This
also tests that collapse works and hwpoison works correctly (although we
aren't yet testing high-granularity poison).

Signed-off-by: James Houghton <jthoughton@google.com>
---
 tools/testing/selftests/vm/Makefile      |   1 +
 tools/testing/selftests/vm/hugetlb-hgm.c | 326 +++++++++++++++++++++++
 2 files changed, 327 insertions(+)
 create mode 100644 tools/testing/selftests/vm/hugetlb-hgm.c

diff --git a/tools/testing/selftests/vm/Makefile b/tools/testing/selftests/vm/Makefile
index 00920cb8b499..da1e01a5ac9b 100644
--- a/tools/testing/selftests/vm/Makefile
+++ b/tools/testing/selftests/vm/Makefile
@@ -32,6 +32,7 @@ TEST_GEN_FILES += compaction_test
 TEST_GEN_FILES += gup_test
 TEST_GEN_FILES += hmm-tests
 TEST_GEN_FILES += hugetlb-madvise
+TEST_GEN_FILES += hugetlb-hgm
 TEST_GEN_FILES += hugepage-mmap
 TEST_GEN_FILES += hugepage-mremap
 TEST_GEN_FILES += hugepage-shm
diff --git a/tools/testing/selftests/vm/hugetlb-hgm.c b/tools/testing/selftests/vm/hugetlb-hgm.c
new file mode 100644
index 000000000000..e36a1c988bb4
--- /dev/null
+++ b/tools/testing/selftests/vm/hugetlb-hgm.c
@@ -0,0 +1,326 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Test uncommon cases in HugeTLB high-granularity mapping:
+ *  1. Test all supported high-granularity page sizes (with MADV_COLLAPSE).
+ *  2. Test MADV_HWPOISON behavior.
+ */
+
+#define _GNU_SOURCE
+#include <fcntl.h>
+#include <sys/syscall.h>
+#include <sys/ioctl.h>
+#include <unistd.h>
+#include <stdlib.h>
+#include <stdio.h>
+#include <sys/poll.h>
+#include <stdint.h>
+#include <string.h>
+
+#include <linux/userfaultfd.h>
+#include <linux/magic.h>
+#include <sys/mman.h>
+#include <sys/statfs.h>
+#include <errno.h>
+#include <stdbool.h>
+#include <signal.h>
+#include <pthread.h>
+
+#define PAGE_MASK ~(4096 - 1)
+
+#ifndef MADV_COLLAPSE
+#define MADV_COLLAPSE 25
+#endif
+
+#define PREFIX " ... "
+
+int userfaultfd(int flags)
+{
+	return syscall(__NR_userfaultfd, flags);
+}
+
+int map_range(int uffd, char *addr, uint64_t length)
+{
+	struct uffdio_continue cont = {
+		.range = (struct uffdio_range) {
+			.start = (uint64_t)addr,
+			.len = length,
+		},
+		.mode = 0,
+		.mapped = 0,
+	};
+
+	if (ioctl(uffd, UFFDIO_CONTINUE, &cont) < 0) {
+		perror("UFFDIO_CONTINUE failed");
+		return -1;
+	}
+	return 0;
+}
+
+int check_equal(char *mapping, size_t length, char value)
+{
+	size_t i;
+
+	for (i = 0; i < length; ++i)
+		if (mapping[i] != value) {
+			printf("mismatch at %p (%d != %d)\n", &mapping[i],
+					mapping[i], value);
+			return -1;
+		}
+
+	return 0;
+}
+
+int test_continues(int uffd, char *primary_map, char *secondary_map, size_t len,
+		   bool verify)
+{
+	size_t offset = 0;
+	unsigned char iter = 0;
+	unsigned long pagesize = getpagesize();
+	uint64_t size;
+
+	for (size = len/2; size >= pagesize;
+			offset += size, size /= 2) {
+		iter++;
+		memset(secondary_map + offset, iter, size);
+		printf(PREFIX "UFFDIO_CONTINUE: %p -> %p = %d%s\n",
+				primary_map + offset,
+				primary_map + offset + size,
+				iter,
+				verify ? " (and verify)" : "");
+		if (map_range(uffd, primary_map + offset, size))
+			return -1;
+		if (verify && check_equal(primary_map + offset, size, iter))
+			return -1;
+	}
+	return 0;
+}
+
+int test_collapse(char *primary_map, size_t len, bool hwpoison)
+{
+	size_t offset;
+	int i;
+	uint64_t size;
+
+	printf(PREFIX "collapsing %p -> %p\n", primary_map, primary_map + len);
+	if (madvise(primary_map, len, MADV_COLLAPSE) < 0) {
+		if (errno == EHWPOISON && hwpoison) {
+			/* this is expected for the hwpoison test. */
+			printf(PREFIX "could not collapse due to poison\n");
+			return 0;
+		}
+		perror("collapse failed");
+		return -1;
+	}
+
+	printf(PREFIX "verifying %p -> %p\n", primary_map, primary_map + len);
+
+	offset = 0;
+	i = 0;
+	for (size = len/2; size > 4096; offset += size, size /= 2) {
+		if (check_equal(primary_map + offset, size, ++i))
+			return -1;
+	}
+	/* expect the last 4K to be zero. */
+	if (check_equal(primary_map + len - 4096, 4096, 0))
+		return -1;
+
+	return 0;
+}
+
+static void *poisoned_addr;
+
+void sigbus_handler(int signo, siginfo_t *info, void *context)
+{
+	if (info->si_code != BUS_MCEERR_AR)
+		goto kill;
+	poisoned_addr = info->si_addr;
+kill:
+	pthread_exit(NULL);
+}
+
+void *access_mem(void *addr)
+{
+	volatile char *ptr = addr;
+
+	*ptr;
+	return NULL;
+}
+
+int test_poison_sigbus(char *addr)
+{
+	int ret = 0;
+	pthread_t pthread;
+
+	poisoned_addr = (void *)0xBADBADBAD;
+	ret = pthread_create(&pthread, NULL, &access_mem, addr);
+	if (pthread_create(&pthread, NULL, &access_mem, addr)) {
+		printf("failed to create thread: %s\n", strerror(ret));
+		return ret;
+	}
+
+	pthread_join(pthread, NULL);
+	if (poisoned_addr != addr) {
+		printf("got incorrect poisoned address: %p vs %p\n",
+				poisoned_addr, addr);
+		return -1;
+	}
+	return 0;
+}
+
+int test_hwpoison(char *primary_map, size_t len)
+{
+	const unsigned long pagesize = getpagesize();
+	const int num_poison_checks = 512;
+	unsigned long bytes_per_check = len/num_poison_checks;
+	struct sigaction new, old;
+	int i;
+
+	printf(PREFIX "poisoning %p -> %p\n", primary_map, primary_map + len);
+	if (madvise(primary_map, len, MADV_HWPOISON) < 0) {
+		perror("MADV_HWPOISON failed");
+		return -1;
+	}
+
+	printf(PREFIX "checking that it was poisoned "
+	       "(%d addresses within %p -> %p)\n",
+	       num_poison_checks, primary_map, primary_map + len);
+
+	new.sa_sigaction = &sigbus_handler;
+	new.sa_flags = SA_SIGINFO;
+	if (sigaction(SIGBUS, &new, &old) < 0) {
+		perror("could not setup SIGBUS handler");
+		return -1;
+	}
+
+	if (pagesize > bytes_per_check)
+		bytes_per_check = pagesize;
+
+	for (i = 0; i < len; i += bytes_per_check)
+		if (test_poison_sigbus(primary_map + i) < 0)
+			return -1;
+	/* check very last byte, because we left it unmapped */
+	if (test_poison_sigbus(primary_map + len - 1))
+		return -1;
+
+	return 0;
+}
+
+int test_hgm(int fd, size_t hugepagesize, size_t len, bool hwpoison)
+{
+	int ret = 0;
+	int uffd;
+	char *primary_map, *secondary_map;
+	struct uffdio_api api;
+	struct uffdio_register reg;
+
+	if (ftruncate(fd, len) < 0) {
+		perror("ftruncate failed");
+		return -1;
+	}
+
+	uffd = userfaultfd(O_CLOEXEC | O_NONBLOCK);
+	if (uffd < 0) {
+		perror("uffd not created");
+		return -1;
+	}
+
+	primary_map = mmap(NULL, len, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
+	if (primary_map == MAP_FAILED) {
+		perror("mmap for primary mapping failed");
+		ret = -1;
+		goto close_uffd;
+	}
+	secondary_map = mmap(NULL, len, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
+	if (secondary_map == MAP_FAILED) {
+		perror("mmap for secondary mapping failed");
+		ret = -1;
+		goto unmap_primary;
+	}
+
+	printf(PREFIX "primary mapping: %p\n", primary_map);
+	printf(PREFIX "secondary mapping: %p\n", secondary_map);
+
+	api.api = UFFD_API;
+	api.features = UFFD_FEATURE_MINOR_HUGETLBFS |
+		UFFD_FEATURE_MISSING_HUGETLBFS |
+		UFFD_FEATURE_MINOR_HUGETLBFS_HGM | UFFD_FEATURE_SIGBUS |
+		UFFD_FEATURE_EXACT_ADDRESS;
+	if (ioctl(uffd, UFFDIO_API, &api) == -1) {
+		perror("UFFDIO_API failed");
+		ret = -1;
+		goto out;
+	}
+	if (!(api.features & UFFD_FEATURE_MINOR_HUGETLBFS_HGM)) {
+		puts("UFFD_FEATURE_MINOR_HUGETLBFS_HGM not present");
+		ret = -1;
+		goto out;
+	}
+
+	reg.range.start = (unsigned long)primary_map;
+	reg.range.len = len;
+	reg.mode = UFFDIO_REGISTER_MODE_MINOR | UFFDIO_REGISTER_MODE_MISSING;
+	reg.ioctls = 0;
+	if (ioctl(uffd, UFFDIO_REGISTER, &reg) == -1) {
+		perror("register failed");
+		ret = -1;
+		goto out;
+	}
+
+	if (test_continues(uffd, primary_map, secondary_map, len, !hwpoison)
+		|| (hwpoison && test_hwpoison(primary_map, len))
+		|| test_collapse(primary_map, len, hwpoison)) {
+		ret = -1;
+	}
+
+	if (ftruncate(fd, 0) < 0) {
+		perror("ftruncate back to 0 failed");
+		ret = -1;
+	}
+
+out:
+	munmap(secondary_map, len);
+unmap_primary:
+	munmap(primary_map, len);
+close_uffd:
+	close(uffd);
+	return ret;
+}
+
+int main(void)
+{
+	int fd;
+	struct statfs file_stat;
+	size_t hugepagesize;
+	size_t len;
+
+	fd = memfd_create("hugetlb_tmp", MFD_HUGETLB);
+	if (fd < 0) {
+		perror("could not open hugetlbfs file");
+		return -1;
+	}
+
+	memset(&file_stat, 0, sizeof(file_stat));
+	if (fstatfs(fd, &file_stat)) {
+		perror("fstatfs failed");
+		goto close;
+	}
+	if (file_stat.f_type != HUGETLBFS_MAGIC) {
+		printf("not hugetlbfs file\n");
+		goto close;
+	}
+
+	hugepagesize = file_stat.f_bsize;
+	len = 2 * hugepagesize;
+	printf("HGM regular test...\n");
+	printf("HGM regular test:  %s\n",
+			test_hgm(fd, hugepagesize, len, false)
+			? "FAILED" : "PASSED");
+	printf("HGM hwpoison test...\n");
+	printf("HGM hwpoison test: %s\n",
+			test_hgm(fd, hugepagesize, len, true)
+			? "FAILED" : "PASSED");
+close:
+	close(fd);
+
+	return 0;
+}
-- 
2.38.0.135.g90850a2211-goog


^ permalink raw reply related	[flat|nested] 122+ messages in thread

* Re: [RFC PATCH v2 01/47] hugetlb: don't set PageUptodate for UFFDIO_CONTINUE
  2022-10-21 16:36 ` [RFC PATCH v2 01/47] hugetlb: don't set PageUptodate for UFFDIO_CONTINUE James Houghton
@ 2022-11-16 16:30   ` Peter Xu
  2022-11-21 18:33     ` James Houghton
  0 siblings, 1 reply; 122+ messages in thread
From: Peter Xu @ 2022-11-16 16:30 UTC (permalink / raw)
  To: James Houghton
  Cc: Mike Kravetz, Muchun Song, David Hildenbrand, David Rientjes,
	Axel Rasmussen, Mina Almasry, Zach O'Keefe, Manish Mishra,
	Naoya Horiguchi, Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Andrew Morton, linux-mm, linux-kernel

On Fri, Oct 21, 2022 at 04:36:17PM +0000, James Houghton wrote:
> This is how it should have been to begin with. It would be very bad if
> we actually set PageUptodate with a UFFDIO_CONTINUE, as UFFDIO_CONTINUE
> doesn't actually set/update the contents of the page, so we would be
> exposing a non-zeroed page to the user.
> 
> The reason this change is being made now is because UFFDIO_CONTINUEs on
> subpages definitely shouldn't set this page flag on the head page.
> 
> Signed-off-by: James Houghton <jthoughton@google.com>
> ---
>  mm/hugetlb.c | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 1a7dc7b2e16c..650761cdd2f6 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -6097,7 +6097,10 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
>  	 * preceding stores to the page contents become visible before
>  	 * the set_pte_at() write.
>  	 */
> -	__SetPageUptodate(page);
> +	if (!is_continue)
> +		__SetPageUptodate(page);
> +	else
> +		VM_WARN_ON_ONCE_PAGE(!PageUptodate(page), page);

Yeah the old code looks wrong, I'm just wondering whether we can 100%
guarantee this for hugetlb.  E.g. for shmem that won't hold when we
uffd-continue on a not used page (e.g. by an over-sized fallocate()).

Another safer approach is simply fail the ioctl if !uptodate, but if you're
certain then WARN_ON_ONCE sounds all good too.  At least I did have a quick
look on hugetlb fallocate() and pages will be uptodate immediately.

>  
>  	/* Add shared, newly allocated pages to the page cache. */
>  	if (vm_shared && !is_continue) {
> -- 
> 2.38.0.135.g90850a2211-goog
> 
> 

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [RFC PATCH v2 02/47] hugetlb: remove mk_huge_pte; it is unused
  2022-10-21 16:36 ` [RFC PATCH v2 02/47] hugetlb: remove mk_huge_pte; it is unused James Houghton
@ 2022-11-16 16:35   ` Peter Xu
  2022-12-07 23:13   ` Mina Almasry
  2022-12-08 23:42   ` Mike Kravetz
  2 siblings, 0 replies; 122+ messages in thread
From: Peter Xu @ 2022-11-16 16:35 UTC (permalink / raw)
  To: James Houghton
  Cc: Mike Kravetz, Muchun Song, David Hildenbrand, David Rientjes,
	Axel Rasmussen, Mina Almasry, Zach O'Keefe, Manish Mishra,
	Naoya Horiguchi, Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Andrew Morton, linux-mm, linux-kernel

On Fri, Oct 21, 2022 at 04:36:18PM +0000, James Houghton wrote:
> mk_huge_pte is unused and not necessary. pte_mkhuge is the appropriate
> function to call to create a HugeTLB PTE (see
> Documentation/mm/arch_pgtable_helpers.rst).
> 
> It is being removed now to avoid complicating the implementation of
> HugeTLB high-granularity mapping.
> 
> Signed-off-by: James Houghton <jthoughton@google.com>

Acked-by: Peter Xu <peterx@redhat.com>

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [RFC PATCH v2 03/47] hugetlb: remove redundant pte_mkhuge in migration path
  2022-10-21 16:36 ` [RFC PATCH v2 03/47] hugetlb: remove redundant pte_mkhuge in migration path James Houghton
@ 2022-11-16 16:36   ` Peter Xu
  2022-12-07 23:16   ` Mina Almasry
  2022-12-09  0:10   ` Mike Kravetz
  2 siblings, 0 replies; 122+ messages in thread
From: Peter Xu @ 2022-11-16 16:36 UTC (permalink / raw)
  To: James Houghton
  Cc: Mike Kravetz, Muchun Song, David Hildenbrand, David Rientjes,
	Axel Rasmussen, Mina Almasry, Zach O'Keefe, Manish Mishra,
	Naoya Horiguchi, Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Andrew Morton, linux-mm, linux-kernel

On Fri, Oct 21, 2022 at 04:36:19PM +0000, James Houghton wrote:
> arch_make_huge_pte, which is called immediately following pte_mkhuge,
> already makes the necessary changes to the PTE that pte_mkhuge would
> have. The generic implementation of arch_make_huge_pte simply calls
> pte_mkhuge.
> 
> Signed-off-by: James Houghton <jthoughton@google.com>

Acked-by: Peter Xu <peterx@redhat.com>

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [RFC PATCH v2 04/47] hugetlb: only adjust address ranges when VMAs want PMD sharing
  2022-10-21 16:36 ` [RFC PATCH v2 04/47] hugetlb: only adjust address ranges when VMAs want PMD sharing James Houghton
@ 2022-11-16 16:50   ` Peter Xu
  2022-12-09  0:22   ` Mike Kravetz
  1 sibling, 0 replies; 122+ messages in thread
From: Peter Xu @ 2022-11-16 16:50 UTC (permalink / raw)
  To: James Houghton
  Cc: Mike Kravetz, Muchun Song, David Hildenbrand, David Rientjes,
	Axel Rasmussen, Mina Almasry, Zach O'Keefe, Manish Mishra,
	Naoya Horiguchi, Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Andrew Morton, linux-mm, linux-kernel

On Fri, Oct 21, 2022 at 04:36:20PM +0000, James Houghton wrote:
> Currently this check is overly aggressive. For some userfaultfd VMAs,
> VMA sharing is disabled, yet we still widen the address range, which is
> used for flushing TLBs and sending MMU notifiers.
> 
> This is done now, as HGM VMAs also have sharing disabled, yet would
> still have flush ranges adjusted. Overaggressively flushing TLBs and
> triggering MMU notifiers is particularly harmful with lots of
> high-granularity operations.
> 
> Signed-off-by: James Houghton <jthoughton@google.com>

Acked-by: Peter Xu <peterx@redhat.com>

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [RFC PATCH v2 05/47] hugetlb: make hugetlb_vma_lock_alloc return its failure reason
  2022-10-21 16:36 ` [RFC PATCH v2 05/47] hugetlb: make hugetlb_vma_lock_alloc return its failure reason James Houghton
@ 2022-11-16 17:08   ` Peter Xu
  2022-11-21 18:11     ` James Houghton
  2022-12-07 23:33   ` Mina Almasry
  2022-12-09 22:36   ` Mike Kravetz
  2 siblings, 1 reply; 122+ messages in thread
From: Peter Xu @ 2022-11-16 17:08 UTC (permalink / raw)
  To: James Houghton
  Cc: Mike Kravetz, Muchun Song, David Hildenbrand, David Rientjes,
	Axel Rasmussen, Mina Almasry, Zach O'Keefe, Manish Mishra,
	Naoya Horiguchi, Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Andrew Morton, linux-mm, linux-kernel

On Fri, Oct 21, 2022 at 04:36:21PM +0000, James Houghton wrote:
> Currently hugetlb_vma_lock_alloc doesn't return anything, as there is no
> need: if it fails, PMD sharing won't be enabled. However, HGM requires
> that the VMA lock exists, so we need to verify that
> hugetlb_vma_lock_alloc actually succeeded. If hugetlb_vma_lock_alloc
> fails, then we can pass that up to the caller that is attempting to
> enable HGM.
> 
> Signed-off-by: James Houghton <jthoughton@google.com>
> ---
>  mm/hugetlb.c | 16 +++++++++-------
>  1 file changed, 9 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 52cec5b0789e..dc82256b89dd 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -92,7 +92,7 @@ struct mutex *hugetlb_fault_mutex_table ____cacheline_aligned_in_smp;
>  /* Forward declaration */
>  static int hugetlb_acct_memory(struct hstate *h, long delta);
>  static void hugetlb_vma_lock_free(struct vm_area_struct *vma);
> -static void hugetlb_vma_lock_alloc(struct vm_area_struct *vma);
> +static int hugetlb_vma_lock_alloc(struct vm_area_struct *vma);
>  static void __hugetlb_vma_unlock_write_free(struct vm_area_struct *vma);
>  
>  static inline bool subpool_is_free(struct hugepage_subpool *spool)
> @@ -7001,17 +7001,17 @@ static void hugetlb_vma_lock_free(struct vm_area_struct *vma)
>  	}
>  }
>  
> -static void hugetlb_vma_lock_alloc(struct vm_area_struct *vma)
> +static int hugetlb_vma_lock_alloc(struct vm_area_struct *vma)
>  {
>  	struct hugetlb_vma_lock *vma_lock;
>  
>  	/* Only establish in (flags) sharable vmas */
>  	if (!vma || !(vma->vm_flags & VM_MAYSHARE))
> -		return;
> +		return -EINVAL;
>  
> -	/* Should never get here with non-NULL vm_private_data */
> +	/* We've already allocated the lock. */
>  	if (vma->vm_private_data)
> -		return;
> +		return 0;

No objection on the patch itself, but I am just wondering what guarantees
thread-safety for this function to not leak vm_private_data when two
threads try to allocate at the same time.

I think it should be the write mmap lock.  I saw that in your latest code
base there's:

	/*
	 * We must hold the mmap lock for writing so that callers can rely on
	 * hugetlb_hgm_enabled returning a consistent result while holding
	 * the mmap lock for reading.
	 */
	mmap_assert_write_locked(vma->vm_mm);

	/* HugeTLB HGM requires the VMA lock to synchronize collapsing. */
	ret = hugetlb_vma_data_alloc(vma);
	if (ret)
		return ret;

So that's covered there.  The rest places are hugetlb_vm_op_open() and
hugetlb_reserve_pages() and they all seem fine too: hugetlb_vm_op_open() is
during mmap(), the latter has vma==NULL so allocation will be skipped.

I'm wondering whether it would make sense to move this assert to be inside
of hugetlb_vma_data_alloc() after the !vma check, or just add the same
assert too but for different reason.

>  
>  	vma_lock = kmalloc(sizeof(*vma_lock), GFP_KERNEL);
>  	if (!vma_lock) {
> @@ -7026,13 +7026,14 @@ static void hugetlb_vma_lock_alloc(struct vm_area_struct *vma)
>  		 * allocation failure.
>  		 */
>  		pr_warn_once("HugeTLB: unable to allocate vma specific lock\n");
> -		return;
> +		return -ENOMEM;
>  	}
>  
>  	kref_init(&vma_lock->refs);
>  	init_rwsem(&vma_lock->rw_sema);
>  	vma_lock->vma = vma;
>  	vma->vm_private_data = vma_lock;
> +	return 0;
>  }
>  
>  /*
> @@ -7160,8 +7161,9 @@ static void hugetlb_vma_lock_free(struct vm_area_struct *vma)
>  {
>  }
>  
> -static void hugetlb_vma_lock_alloc(struct vm_area_struct *vma)
> +static int hugetlb_vma_lock_alloc(struct vm_area_struct *vma)
>  {
> +	return 0;
>  }
>  
>  pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma,
> -- 
> 2.38.0.135.g90850a2211-goog
> 
> 

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [RFC PATCH v2 08/47] hugetlb: add HGM enablement functions
  2022-10-21 16:36 ` [RFC PATCH v2 08/47] hugetlb: add HGM enablement functions James Houghton
@ 2022-11-16 17:19   ` Peter Xu
  2022-12-08  0:26   ` Mina Almasry
  2022-12-13  0:13   ` Mike Kravetz
  2 siblings, 0 replies; 122+ messages in thread
From: Peter Xu @ 2022-11-16 17:19 UTC (permalink / raw)
  To: James Houghton
  Cc: Mike Kravetz, Muchun Song, David Hildenbrand, David Rientjes,
	Axel Rasmussen, Mina Almasry, Zach O'Keefe, Manish Mishra,
	Naoya Horiguchi, Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Andrew Morton, linux-mm, linux-kernel

On Fri, Oct 21, 2022 at 04:36:24PM +0000, James Houghton wrote:
> Currently it is possible for all shared VMAs to use HGM, but it must be
> enabled first. This is because with HGM, we lose PMD sharing, and page
> table walks require additional synchronization (we need to take the VMA
> lock).
> 
> Signed-off-by: James Houghton <jthoughton@google.com>
> ---
>  include/linux/hugetlb.h | 22 +++++++++++++
>  mm/hugetlb.c            | 69 +++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 91 insertions(+)
> 
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 534958499ac4..6e0c36b08a0c 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -123,6 +123,9 @@ struct hugetlb_vma_lock {
>  
>  struct hugetlb_shared_vma_data {
>  	struct hugetlb_vma_lock vma_lock;
> +#ifdef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING
> +	bool hgm_enabled;
> +#endif
>  };
>  
>  extern struct resv_map *resv_map_alloc(void);
> @@ -1179,6 +1182,25 @@ static inline void hugetlb_unregister_node(struct node *node)
>  }
>  #endif	/* CONFIG_HUGETLB_PAGE */
>  
> +#ifdef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING
> +bool hugetlb_hgm_enabled(struct vm_area_struct *vma);
> +bool hugetlb_hgm_eligible(struct vm_area_struct *vma);
> +int enable_hugetlb_hgm(struct vm_area_struct *vma);
> +#else
> +static inline bool hugetlb_hgm_enabled(struct vm_area_struct *vma)
> +{
> +	return false;
> +}
> +static inline bool hugetlb_hgm_eligible(struct vm_area_struct *vma)
> +{
> +	return false;
> +}
> +static inline int enable_hugetlb_hgm(struct vm_area_struct *vma)
> +{
> +	return -EINVAL;
> +}
> +#endif
> +
>  static inline spinlock_t *huge_pte_lock(struct hstate *h,
>  					struct mm_struct *mm, pte_t *pte)
>  {
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 5ae8bc8c928e..a18143add956 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -6840,6 +6840,10 @@ static bool pmd_sharing_possible(struct vm_area_struct *vma)
>  #ifdef CONFIG_USERFAULTFD
>  	if (uffd_disable_huge_pmd_share(vma))
>  		return false;
> +#endif
> +#ifdef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING
> +	if (hugetlb_hgm_enabled(vma))
> +		return false;
>  #endif
>  	/*
>  	 * Only shared VMAs can share PMDs.
> @@ -7033,6 +7037,9 @@ static int hugetlb_vma_data_alloc(struct vm_area_struct *vma)
>  	kref_init(&data->vma_lock.refs);
>  	init_rwsem(&data->vma_lock.rw_sema);
>  	data->vma_lock.vma = vma;
> +#ifdef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING
> +	data->hgm_enabled = false;
> +#endif
>  	vma->vm_private_data = data;
>  	return 0;
>  }
> @@ -7290,6 +7297,68 @@ __weak unsigned long hugetlb_mask_last_page(struct hstate *h)
>  
>  #endif /* CONFIG_ARCH_WANT_GENERAL_HUGETLB */
>  
> +#ifdef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING
> +bool hugetlb_hgm_eligible(struct vm_area_struct *vma)
> +{
> +	/*
> +	 * All shared VMAs may have HGM.
> +	 *
> +	 * HGM requires using the VMA lock, which only exists for shared VMAs.
> +	 * To make HGM work for private VMAs, we would need to use another
> +	 * scheme to prevent collapsing/splitting from invalidating other
> +	 * threads' page table walks.
> +	 */
> +	return vma && (vma->vm_flags & VM_MAYSHARE);
> +}
> +bool hugetlb_hgm_enabled(struct vm_area_struct *vma)
> +{
> +	struct hugetlb_shared_vma_data *data = vma->vm_private_data;
> +
> +	if (!vma || !(vma->vm_flags & VM_MAYSHARE))
> +		return false;

Nit: smells like a open-coded hugetlb_hgm_eligible().

> +
> +	return data && data->hgm_enabled;
> +}
> +
> +/*
> + * Enable high-granularity mapping (HGM) for this VMA. Once enabled, HGM
> + * cannot be turned off.
> + *
> + * PMDs cannot be shared in HGM VMAs.
> + */
> +int enable_hugetlb_hgm(struct vm_area_struct *vma)
> +{
> +	int ret;
> +	struct hugetlb_shared_vma_data *data;
> +
> +	if (!hugetlb_hgm_eligible(vma))
> +		return -EINVAL;
> +
> +	if (hugetlb_hgm_enabled(vma))
> +		return 0;
> +
> +	/*
> +	 * We must hold the mmap lock for writing so that callers can rely on
> +	 * hugetlb_hgm_enabled returning a consistent result while holding
> +	 * the mmap lock for reading.
> +	 */
> +	mmap_assert_write_locked(vma->vm_mm);
> +
> +	/* HugeTLB HGM requires the VMA lock to synchronize collapsing. */
> +	ret = hugetlb_vma_data_alloc(vma);
> +	if (ret)
> +		return ret;
> +
> +	data = vma->vm_private_data;
> +	BUG_ON(!data);

Let's avoid BUG_ON() as afaiu it's mostly not welcomed unless extremely
necessary.  In this case it'll crash immediately in next dereference anyway
with the whole stack dumped, so we won't miss anything important. :)

> +	data->hgm_enabled = true;
> +
> +	/* We don't support PMD sharing with HGM. */
> +	hugetlb_unshare_all_pmds(vma);
> +	return 0;
> +}
> +#endif /* CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING */
> +
>  /*
>   * These functions are overwritable if your architecture needs its own
>   * behavior.
> -- 
> 2.38.0.135.g90850a2211-goog
> 

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [RFC PATCH v2 37/47] hugetlb: remove huge_pte_lock and huge_pte_lockptr
  2022-10-21 16:36 ` [RFC PATCH v2 37/47] hugetlb: remove huge_pte_lock and huge_pte_lockptr James Houghton
@ 2022-11-16 20:16   ` Peter Xu
  0 siblings, 0 replies; 122+ messages in thread
From: Peter Xu @ 2022-11-16 20:16 UTC (permalink / raw)
  To: James Houghton
  Cc: Mike Kravetz, Muchun Song, David Hildenbrand, David Rientjes,
	Axel Rasmussen, Mina Almasry, Zach O'Keefe, Manish Mishra,
	Naoya Horiguchi, Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Andrew Morton, linux-mm, linux-kernel

On Fri, Oct 21, 2022 at 04:36:53PM +0000, James Houghton wrote:
> They are replaced with hugetlb_pte_lock{,ptr}. All callers that haven't
> already been replaced don't get called when using HGM, so we handle them
> by populating hugetlb_ptes with the standard, hstate-sized huge PTEs.

I didn't yet check the rational at all, but just noticed there's one more
of it for ppc code:

*** arch/powerpc/mm/pgtable.c:
huge_ptep_set_access_flags[264] assert_spin_locked(huge_pte_lockptr(huge_page_shift(h),

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [RFC PATCH v2 12/47] hugetlb: add hugetlb_hgm_walk and hugetlb_walk_step
  2022-10-21 16:36 ` [RFC PATCH v2 12/47] hugetlb: add hugetlb_hgm_walk and hugetlb_walk_step James Houghton
@ 2022-11-16 22:02   ` Peter Xu
  2022-11-17  1:39     ` James Houghton
  2022-12-14  0:47   ` Mike Kravetz
  2023-01-05  0:57   ` Jane Chu
  2 siblings, 1 reply; 122+ messages in thread
From: Peter Xu @ 2022-11-16 22:02 UTC (permalink / raw)
  To: James Houghton
  Cc: Mike Kravetz, Muchun Song, David Hildenbrand, David Rientjes,
	Axel Rasmussen, Mina Almasry, Zach O'Keefe, Manish Mishra,
	Naoya Horiguchi, Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Andrew Morton, linux-mm, linux-kernel

On Fri, Oct 21, 2022 at 04:36:28PM +0000, James Houghton wrote:
> +/* hugetlb_hgm_walk - walks a high-granularity HugeTLB page table to resolve
> + * the page table entry for @addr.
> + *
> + * @hpte must always be pointing at an hstate-level PTE (or deeper).
> + *
> + * This function will never walk further if it encounters a PTE of a size
> + * less than or equal to @sz.
> + *
> + * @stop_at_none determines what we do when we encounter an empty PTE.

IIUC it is not about empty PTE but swap-or-empty pte?

I'm not sure whether it'll be more straightforward to have "bool alloc"
just to show whether the caller would like to allocate pgtables when
walking the sub-level pgtable until the level specified.

In final version of the code I also think we should drop all the "/*
stop_at_pte */" comments in the callers. Maybe that already means the
meaning of the bool is confusing so we always need a hint.

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [RFC PATCH v2 10/47] hugetlb: add hugetlb_pte to track HugeTLB page table entries
  2022-10-21 16:36 ` [RFC PATCH v2 10/47] hugetlb: add hugetlb_pte to track HugeTLB page table entries James Houghton
@ 2022-11-16 22:17   ` Peter Xu
  2022-11-17  1:00     ` James Houghton
  2022-12-08  0:46   ` Mina Almasry
  1 sibling, 1 reply; 122+ messages in thread
From: Peter Xu @ 2022-11-16 22:17 UTC (permalink / raw)
  To: James Houghton
  Cc: Mike Kravetz, Muchun Song, David Hildenbrand, David Rientjes,
	Axel Rasmussen, Mina Almasry, Zach O'Keefe, Manish Mishra,
	Naoya Horiguchi, Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Andrew Morton, linux-mm, linux-kernel

On Fri, Oct 21, 2022 at 04:36:26PM +0000, James Houghton wrote:
> +struct hugetlb_pte {
> +	pte_t *ptep;
> +	unsigned int shift;
> +	enum hugetlb_level level;
> +	spinlock_t *ptl;
> +};

Do we need both shift + level?  Maybe it's only meaningful for ARM where
the shift may not be directly calculcated from level?

I'm wondering whether we can just maintain "shift" then we calculate
"level" realtime.  It just reads a bit weird to have these two fields, also
a burden to most of the call sites where shift and level exactly match..

> +
> +static inline
> +void hugetlb_pte_populate(struct hugetlb_pte *hpte, pte_t *ptep,
> +			  unsigned int shift, enum hugetlb_level level)

I'd think it's nicer to replace "populate" with something else, as populate
is definitely a meaningful word in vm world for "making something appear if
it wasn't".  Maybe hugetlb_pte_setup()?

Even one step back, on the naming of hugetlb_pte..  Sorry to comment on
namings especially on this one, I really don't like to do that normally..
but here hugetlb_pte only walks the sub-page level of pgtables, meanwhile
it's not really a pte but an iterator.  How about hugetlb_hgm_iter?  "hgm"
tells that it only walks sub-level, and "iter" tells that it is an
iterator, being updated for each stepping downwards.

Then hugetlb_pte_populate() can be hugetlb_hgm_iter_init().

Take these comments with a grain of salt, and it never hurts to wait for a
2nd opinion before anything.

> +{
> +	WARN_ON_ONCE(!ptep);
> +	hpte->ptep = ptep;
> +	hpte->shift = shift;
> +	hpte->level = level;
> +	hpte->ptl = NULL;
> +}
> +
> +static inline
> +unsigned long hugetlb_pte_size(const struct hugetlb_pte *hpte)
> +{
> +	WARN_ON_ONCE(!hpte->ptep);
> +	return 1UL << hpte->shift;
> +}
> +
> +static inline
> +unsigned long hugetlb_pte_mask(const struct hugetlb_pte *hpte)
> +{
> +	WARN_ON_ONCE(!hpte->ptep);
> +	return ~(hugetlb_pte_size(hpte) - 1);
> +}
> +
> +static inline
> +unsigned int hugetlb_pte_shift(const struct hugetlb_pte *hpte)
> +{
> +	WARN_ON_ONCE(!hpte->ptep);
> +	return hpte->shift;
> +}
> +
> +static inline
> +enum hugetlb_level hugetlb_pte_level(const struct hugetlb_pte *hpte)
> +{
> +	WARN_ON_ONCE(!hpte->ptep);

There're definitely a bunch of hpte->ptep WARN_ON_ONCE()s..  AFAIK the
hugetlb_pte* will be setup once with valid ptep and then it should always
be.  I rem someone commented on these helpers look not useful, which I must
confess I had the same feeling.  But besides that, I'd rather drop all
these WARN_ON_ONCE()s but only check it when init() the iterator/pte.

> +	return hpte->level;
> +}
> +
> +static inline
> +void hugetlb_pte_copy(struct hugetlb_pte *dest, const struct hugetlb_pte *src)
> +{
> +	dest->ptep = src->ptep;
> +	dest->shift = src->shift;
> +	dest->level = src->level;
> +	dest->ptl = src->ptl;
> +}
> +
> +bool hugetlb_pte_present_leaf(const struct hugetlb_pte *hpte, pte_t pte);
> +
>  struct hugepage_subpool {
>  	spinlock_t lock;
>  	long count;
> @@ -1210,6 +1279,25 @@ static inline spinlock_t *huge_pte_lock(struct hstate *h,
>  	return ptl;
>  }
>  
> +static inline
> +spinlock_t *hugetlb_pte_lockptr(struct mm_struct *mm, struct hugetlb_pte *hpte)
> +{
> +
> +	BUG_ON(!hpte->ptep);

Another BUG_ON(); better be dropped too.

> +	if (hpte->ptl)
> +		return hpte->ptl;
> +	return huge_pte_lockptr(hugetlb_pte_shift(hpte), mm, hpte->ptep);

I'm curious whether we can always have hpte->ptl set for a valid
hugetlb_pte.  I think that means we'll need to also init the ptl in the
init() fn of the iterator.  Then it'll be clear on which lock to take for
each valid hugetlb_pte.

> +}

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [RFC PATCH v2 33/47] userfaultfd: add UFFD_FEATURE_MINOR_HUGETLBFS_HGM
  2022-10-21 16:36 ` [RFC PATCH v2 33/47] userfaultfd: add UFFD_FEATURE_MINOR_HUGETLBFS_HGM James Houghton
@ 2022-11-16 22:28   ` Peter Xu
  2022-11-16 23:30     ` James Houghton
  0 siblings, 1 reply; 122+ messages in thread
From: Peter Xu @ 2022-11-16 22:28 UTC (permalink / raw)
  To: James Houghton
  Cc: Mike Kravetz, Muchun Song, David Hildenbrand, David Rientjes,
	Axel Rasmussen, Mina Almasry, Zach O'Keefe, Manish Mishra,
	Naoya Horiguchi, Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Andrew Morton, linux-mm, linux-kernel

On Fri, Oct 21, 2022 at 04:36:49PM +0000, James Houghton wrote:
> Userspace must provide this new feature when it calls UFFDIO_API to
> enable HGM. Userspace can check if the feature exists in
> uffdio_api.features, and if it does not exist, the kernel does not
> support and therefore did not enable HGM.
> 
> Signed-off-by: James Houghton <jthoughton@google.com>

It's still slightly a pity that this can only be enabled by an uffd context
plus a minor fault, so generic hugetlb users cannot directly leverage this.

The patch itself looks good.

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [RFC PATCH v2 33/47] userfaultfd: add UFFD_FEATURE_MINOR_HUGETLBFS_HGM
  2022-11-16 22:28   ` Peter Xu
@ 2022-11-16 23:30     ` James Houghton
  2022-12-21 19:23       ` Peter Xu
  0 siblings, 1 reply; 122+ messages in thread
From: James Houghton @ 2022-11-16 23:30 UTC (permalink / raw)
  To: Peter Xu
  Cc: Mike Kravetz, Muchun Song, David Hildenbrand, David Rientjes,
	Axel Rasmussen, Mina Almasry, Zach O'Keefe, Manish Mishra,
	Naoya Horiguchi, Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Andrew Morton, linux-mm, linux-kernel

On Wed, Nov 16, 2022 at 2:28 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Fri, Oct 21, 2022 at 04:36:49PM +0000, James Houghton wrote:
> > Userspace must provide this new feature when it calls UFFDIO_API to
> > enable HGM. Userspace can check if the feature exists in
> > uffdio_api.features, and if it does not exist, the kernel does not
> > support and therefore did not enable HGM.
> >
> > Signed-off-by: James Houghton <jthoughton@google.com>
>
> It's still slightly a pity that this can only be enabled by an uffd context
> plus a minor fault, so generic hugetlb users cannot directly leverage this.

The idea here is that, for applications that can conceivably benefit
from HGM, we have a mechanism for enabling it for that application. So
this patch creates that mechanism for userfaultfd/UFFDIO_CONTINUE. I
prefer this approach over something more general like MADV_ENABLE_HGM
or something.

For hwpoison, HGM will be automatically enabled, but that isn't
implemented in this series. We could also extend MADV_DONTNEED to do
high-granularity unmapping in some way, but that also isn't attempted
here. I'm sure that if there are other cases where HGM may be useful,
we can add/change some uapi to make it possible to take advantage HGM.

- James

>
> The patch itself looks good.
>
> --
> Peter Xu
>

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [RFC PATCH v2 10/47] hugetlb: add hugetlb_pte to track HugeTLB page table entries
  2022-11-16 22:17   ` Peter Xu
@ 2022-11-17  1:00     ` James Houghton
  2022-11-17 16:27       ` Peter Xu
  0 siblings, 1 reply; 122+ messages in thread
From: James Houghton @ 2022-11-17  1:00 UTC (permalink / raw)
  To: Peter Xu
  Cc: Mike Kravetz, Muchun Song, David Hildenbrand, David Rientjes,
	Axel Rasmussen, Mina Almasry, Zach O'Keefe, Manish Mishra,
	Naoya Horiguchi, Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Andrew Morton, linux-mm, linux-kernel

On Wed, Nov 16, 2022 at 2:18 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Fri, Oct 21, 2022 at 04:36:26PM +0000, James Houghton wrote:
> > +struct hugetlb_pte {
> > +     pte_t *ptep;
> > +     unsigned int shift;
> > +     enum hugetlb_level level;
> > +     spinlock_t *ptl;
> > +};
>
> Do we need both shift + level?  Maybe it's only meaningful for ARM where
> the shift may not be directly calculcated from level?
>
> I'm wondering whether we can just maintain "shift" then we calculate
> "level" realtime.  It just reads a bit weird to have these two fields, also
> a burden to most of the call sites where shift and level exactly match..

My main concern is interaction with folded levels. For example, if
PUD_SIZE and PMD_SIZE are the same, we want to do something like this:

pud = pud_offset(p4d, addr)
pmd = pmd_offset(pud, addr) /* this is just pmd = (pmd_t *) pud */
pte = pte_offset(pmd, addr)

and I think we should avoid quietly skipping the folded level, which
could happen:

pud = pud_offset(p4d, addr)
/* Each time, we go back to pte_t *, so if we stored PUD_SHIFT here,
it is impossible to know that `pud` came from `pud_offset` and not
`pmd_offset`. We must assume the deeper level so that we don't get
stuck in a loop. */
pte = pte_offset(pud, addr) /* pud is cast from (pud_t * -> pte_t * ->
pmd_t *) */

Quietly dropping p*d_offset for folded levels is safe; it's just a
cast that we're doing anyway. If you think this is fine, then I can
remove `level`. It might also be that this is a non-issue and that
there will never be a folded level underneath a hugepage level.

We could also change `ptep` to a union eventually (to clean up
"hugetlb casts everything to pte_t *" messiness), and having an
explicit `level` as a tag for the union would be nice help. In the
same way: I like having `level` explicitly so that we know for sure
where `ptep` came from.

I can try to reduce the burden at the callsite while keeping `level`:
hpage_size_to_level() is really annoying to have everywhere.

>
> > +
> > +static inline
> > +void hugetlb_pte_populate(struct hugetlb_pte *hpte, pte_t *ptep,
> > +                       unsigned int shift, enum hugetlb_level level)
>
> I'd think it's nicer to replace "populate" with something else, as populate
> is definitely a meaningful word in vm world for "making something appear if
> it wasn't".  Maybe hugetlb_pte_setup()?
>
> Even one step back, on the naming of hugetlb_pte..  Sorry to comment on
> namings especially on this one, I really don't like to do that normally..
> but here hugetlb_pte only walks the sub-page level of pgtables, meanwhile
> it's not really a pte but an iterator.  How about hugetlb_hgm_iter?  "hgm"
> tells that it only walks sub-level, and "iter" tells that it is an
> iterator, being updated for each stepping downwards.
>
> Then hugetlb_pte_populate() can be hugetlb_hgm_iter_init().
>
> Take these comments with a grain of salt, and it never hurts to wait for a
> 2nd opinion before anything.

I think this is a great idea. :) Thank you! I'll make this change for
v1 unless someone has a better suggestion.

>
> > +{
> > +     WARN_ON_ONCE(!ptep);
> > +     hpte->ptep = ptep;
> > +     hpte->shift = shift;
> > +     hpte->level = level;
> > +     hpte->ptl = NULL;
> > +}
> > +
> > +static inline
> > +unsigned long hugetlb_pte_size(const struct hugetlb_pte *hpte)
> > +{
> > +     WARN_ON_ONCE(!hpte->ptep);
> > +     return 1UL << hpte->shift;
> > +}
> > +
> > +static inline
> > +unsigned long hugetlb_pte_mask(const struct hugetlb_pte *hpte)
> > +{
> > +     WARN_ON_ONCE(!hpte->ptep);
> > +     return ~(hugetlb_pte_size(hpte) - 1);
> > +}
> > +
> > +static inline
> > +unsigned int hugetlb_pte_shift(const struct hugetlb_pte *hpte)
> > +{
> > +     WARN_ON_ONCE(!hpte->ptep);
> > +     return hpte->shift;
> > +}
> > +
> > +static inline
> > +enum hugetlb_level hugetlb_pte_level(const struct hugetlb_pte *hpte)
> > +{
> > +     WARN_ON_ONCE(!hpte->ptep);
>
> There're definitely a bunch of hpte->ptep WARN_ON_ONCE()s..  AFAIK the
> hugetlb_pte* will be setup once with valid ptep and then it should always
> be.  I rem someone commented on these helpers look not useful, which I must
> confess I had the same feeling.  But besides that, I'd rather drop all
> these WARN_ON_ONCE()s but only check it when init() the iterator/pte.

The idea with these WARN_ON_ONCE()s is that it WARNs for the case that
`hpte` was never populated/initialized, but I realize that we can't
even rely on hpte->ptep == NULL. I'll remove the WARN_ON_ONCE()s, and
I'll drop hugetlb_pte_shift and hugetlb_pte_level entirely.

I'll keep the hugetlb_pte_{size,mask,copy,present_leaf} helpers as
they are legitimately helpful.

>
> > +     return hpte->level;
> > +}
> > +
> > +static inline
> > +void hugetlb_pte_copy(struct hugetlb_pte *dest, const struct hugetlb_pte *src)
> > +{
> > +     dest->ptep = src->ptep;
> > +     dest->shift = src->shift;
> > +     dest->level = src->level;
> > +     dest->ptl = src->ptl;
> > +}
> > +
> > +bool hugetlb_pte_present_leaf(const struct hugetlb_pte *hpte, pte_t pte);
> > +
> >  struct hugepage_subpool {
> >       spinlock_t lock;
> >       long count;
> > @@ -1210,6 +1279,25 @@ static inline spinlock_t *huge_pte_lock(struct hstate *h,
> >       return ptl;
> >  }
> >
> > +static inline
> > +spinlock_t *hugetlb_pte_lockptr(struct mm_struct *mm, struct hugetlb_pte *hpte)
> > +{
> > +
> > +     BUG_ON(!hpte->ptep);
>
> Another BUG_ON(); better be dropped too.

Can do.

>
> > +     if (hpte->ptl)
> > +             return hpte->ptl;
> > +     return huge_pte_lockptr(hugetlb_pte_shift(hpte), mm, hpte->ptep);
>
> I'm curious whether we can always have hpte->ptl set for a valid
> hugetlb_pte.  I think that means we'll need to also init the ptl in the
> init() fn of the iterator.  Then it'll be clear on which lock to take for
> each valid hugetlb_pte.

I can work on this for v1. Right now it's not very good: for 4K PTEs,
we manually set ->ptl while walking. I'll make it so that ->ptl is
always populated so the code is easier to read.

- James

>
> > +}

>
> --
> Peter Xu
>

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [RFC PATCH v2 12/47] hugetlb: add hugetlb_hgm_walk and hugetlb_walk_step
  2022-11-16 22:02   ` Peter Xu
@ 2022-11-17  1:39     ` James Houghton
  0 siblings, 0 replies; 122+ messages in thread
From: James Houghton @ 2022-11-17  1:39 UTC (permalink / raw)
  To: Peter Xu
  Cc: Mike Kravetz, Muchun Song, David Hildenbrand, David Rientjes,
	Axel Rasmussen, Mina Almasry, Zach O'Keefe, Manish Mishra,
	Naoya Horiguchi, Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Andrew Morton, linux-mm, linux-kernel

On Wed, Nov 16, 2022 at 2:02 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Fri, Oct 21, 2022 at 04:36:28PM +0000, James Houghton wrote:
> > +/* hugetlb_hgm_walk - walks a high-granularity HugeTLB page table to resolve
> > + * the page table entry for @addr.
> > + *
> > + * @hpte must always be pointing at an hstate-level PTE (or deeper).
> > + *
> > + * This function will never walk further if it encounters a PTE of a size
> > + * less than or equal to @sz.
> > + *
> > + * @stop_at_none determines what we do when we encounter an empty PTE.
>
> IIUC it is not about empty PTE but swap-or-empty pte?
>
> I'm not sure whether it'll be more straightforward to have "bool alloc"
> just to show whether the caller would like to allocate pgtables when
> walking the sub-level pgtable until the level specified.

I think "bool alloc" is cleaner. I'll do that. Thanks for the suggestion.

>
> In final version of the code I also think we should drop all the "/*
> stop_at_pte */" comments in the callers. Maybe that already means the
> meaning of the bool is confusing so we always need a hint.

I did that to hopefully make things easier to read. I'll remove it.

- James

>
> --
> Peter Xu
>

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [RFC PATCH v2 10/47] hugetlb: add hugetlb_pte to track HugeTLB page table entries
  2022-11-17  1:00     ` James Houghton
@ 2022-11-17 16:27       ` Peter Xu
  0 siblings, 0 replies; 122+ messages in thread
From: Peter Xu @ 2022-11-17 16:27 UTC (permalink / raw)
  To: James Houghton
  Cc: Mike Kravetz, Muchun Song, David Hildenbrand, David Rientjes,
	Axel Rasmussen, Mina Almasry, Zach O'Keefe, Manish Mishra,
	Naoya Horiguchi, Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Andrew Morton, linux-mm, linux-kernel

On Wed, Nov 16, 2022 at 05:00:08PM -0800, James Houghton wrote:
> On Wed, Nov 16, 2022 at 2:18 PM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Fri, Oct 21, 2022 at 04:36:26PM +0000, James Houghton wrote:
> > > +struct hugetlb_pte {
> > > +     pte_t *ptep;
> > > +     unsigned int shift;
> > > +     enum hugetlb_level level;
> > > +     spinlock_t *ptl;
> > > +};
> >
> > Do we need both shift + level?  Maybe it's only meaningful for ARM where
> > the shift may not be directly calculcated from level?
> >
> > I'm wondering whether we can just maintain "shift" then we calculate
> > "level" realtime.  It just reads a bit weird to have these two fields, also
> > a burden to most of the call sites where shift and level exactly match..
> 
> My main concern is interaction with folded levels. For example, if
> PUD_SIZE and PMD_SIZE are the same, we want to do something like this:
> 
> pud = pud_offset(p4d, addr)
> pmd = pmd_offset(pud, addr) /* this is just pmd = (pmd_t *) pud */
> pte = pte_offset(pmd, addr)
> 
> and I think we should avoid quietly skipping the folded level, which
> could happen:
> 
> pud = pud_offset(p4d, addr)
> /* Each time, we go back to pte_t *, so if we stored PUD_SHIFT here,
> it is impossible to know that `pud` came from `pud_offset` and not
> `pmd_offset`. We must assume the deeper level so that we don't get
> stuck in a loop. */
> pte = pte_offset(pud, addr) /* pud is cast from (pud_t * -> pte_t * ->
> pmd_t *) */
> 
> Quietly dropping p*d_offset for folded levels is safe; it's just a
> cast that we're doing anyway. If you think this is fine, then I can
> remove `level`. It might also be that this is a non-issue and that
> there will never be a folded level underneath a hugepage level.
> 
> We could also change `ptep` to a union eventually (to clean up
> "hugetlb casts everything to pte_t *" messiness), and having an
> explicit `level` as a tag for the union would be nice help. In the
> same way: I like having `level` explicitly so that we know for sure
> where `ptep` came from.

Makes sense.

> 
> I can try to reduce the burden at the callsite while keeping `level`:
> hpage_size_to_level() is really annoying to have everywhere.

Yeah this would be nice.

Per what I see most callers are having paired level/shift, so maybe we can
make hugetlb_hgm_iter_init() to only take one of them and calculate the
other. Then there can also be the __hugetlb_hgm_iter_init() which takes
both, only used in the few places where necessary to have explicit
level/shift.  hugetlb_hgm_iter_init() calls __hugetlb_hgm_iter_init().

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [RFC PATCH v2 34/47] hugetlb: userfaultfd: add support for high-granularity UFFDIO_CONTINUE
  2022-10-21 16:36 ` [RFC PATCH v2 34/47] hugetlb: userfaultfd: add support for high-granularity UFFDIO_CONTINUE James Houghton
@ 2022-11-17 16:58   ` Peter Xu
  2022-12-23 18:38   ` Peter Xu
  1 sibling, 0 replies; 122+ messages in thread
From: Peter Xu @ 2022-11-17 16:58 UTC (permalink / raw)
  To: James Houghton
  Cc: Mike Kravetz, Muchun Song, David Hildenbrand, David Rientjes,
	Axel Rasmussen, Mina Almasry, Zach O'Keefe, Manish Mishra,
	Naoya Horiguchi, Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Andrew Morton, linux-mm, linux-kernel

On Fri, Oct 21, 2022 at 04:36:50PM +0000, James Houghton wrote:
> Changes here are similar to the changes made for hugetlb_no_page.
> 
> Pass vmf->real_address to userfaultfd_huge_must_wait because
> vmf->address is rounded down to the hugepage size, and a
> high-granularity page table walk would look up the wrong PTE. Also
> change the call to userfaultfd_must_wait in the same way for
> consistency.
> 
> This commit introduces hugetlb_alloc_largest_pte which is used to find
> the appropriate PTE size to map pages with UFFDIO_CONTINUE.
> 
> Signed-off-by: James Houghton <jthoughton@google.com>
> ---
>  fs/userfaultfd.c        | 33 +++++++++++++++---
>  include/linux/hugetlb.h | 14 +++++++-
>  mm/hugetlb.c            | 76 +++++++++++++++++++++++++++++++++--------
>  mm/userfaultfd.c        | 46 +++++++++++++++++--------
>  4 files changed, 135 insertions(+), 34 deletions(-)
> 
> diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> index 3a3e9ef74dab..0204108e3882 100644
> --- a/fs/userfaultfd.c
> +++ b/fs/userfaultfd.c
> @@ -245,14 +245,22 @@ static inline bool userfaultfd_huge_must_wait(struct userfaultfd_ctx *ctx,
>  	struct mm_struct *mm = ctx->mm;
>  	pte_t *ptep, pte;
>  	bool ret = true;
> +	struct hugetlb_pte hpte;
> +	unsigned long sz = vma_mmu_pagesize(vma);
> +	unsigned int shift = huge_page_shift(hstate_vma(vma));
>  
>  	mmap_assert_locked(mm);
>  
> -	ptep = huge_pte_offset(mm, address, vma_mmu_pagesize(vma));
> +	ptep = huge_pte_offset(mm, address, sz);
>  
>  	if (!ptep)
>  		goto out;
>  
> +	hugetlb_pte_populate(&hpte, ptep, shift, hpage_size_to_level(sz));
> +	hugetlb_hgm_walk(mm, vma, &hpte, address, PAGE_SIZE,
> +			/*stop_at_none=*/true);

Side note: I had a feeling that we may want a helper function to walk the
whole hugetlb pgtable, that may contain huge_pte_offset() and the hgm walk.
Not really needed for this series, but maybe for the future.

> +	ptep = hpte.ptep;
> +
>  	ret = false;
>  	pte = huge_ptep_get(ptep);
>  
> @@ -498,6 +506,14 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
>  
>  	blocking_state = userfaultfd_get_blocking_state(vmf->flags);
>  
> +	if (is_vm_hugetlb_page(vmf->vma) && hugetlb_hgm_enabled(vmf->vma))
> +		/*
> +		 * Lock the VMA lock so we can do a high-granularity walk in
> +		 * userfaultfd_huge_must_wait. We have to grab this lock before
> +		 * we set our state to blocking.
> +		 */
> +		hugetlb_vma_lock_read(vmf->vma);

Yeah this will help with/without hgm, afaict.  Maybe when I rework my other
patchset I'll just take the vma lock unconditionally for this path.

> +
>  	spin_lock_irq(&ctx->fault_pending_wqh.lock);
>  	/*
>  	 * After the __add_wait_queue the uwq is visible to userland
> @@ -513,12 +529,15 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
>  	spin_unlock_irq(&ctx->fault_pending_wqh.lock);
>  
>  	if (!is_vm_hugetlb_page(vmf->vma))
> -		must_wait = userfaultfd_must_wait(ctx, vmf->address, vmf->flags,
> -						  reason);
> +		must_wait = userfaultfd_must_wait(ctx, vmf->real_address,
> +				vmf->flags, reason);
>  	else
>  		must_wait = userfaultfd_huge_must_wait(ctx, vmf->vma,
> -						       vmf->address,
> +						       vmf->real_address,
>  						       vmf->flags, reason);
> +
> +	if (is_vm_hugetlb_page(vmf->vma) && hugetlb_hgm_enabled(vmf->vma))
> +		hugetlb_vma_unlock_read(vmf->vma);
>  	mmap_read_unlock(mm);
>  
>  	if (likely(must_wait && !READ_ONCE(ctx->released))) {
> @@ -1463,6 +1482,12 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
>  			mas_pause(&mas);
>  		}
>  	next:
> +		if (is_vm_hugetlb_page(vma) && (ctx->features &
> +					UFFD_FEATURE_MINOR_HUGETLBFS_HGM)) {
> +			ret = enable_hugetlb_hgm(vma);
> +			if (ret)
> +				break;

[1]

> +		}
>  		/*
>  		 * In the vma_merge() successful mprotect-like case 8:
>  		 * the next vma was merged into the current one and
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index e25f97cdd086..00c22a84a1c6 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -250,7 +250,8 @@ unsigned long hugetlb_total_pages(void);
>  vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>  			unsigned long address, unsigned int flags);
>  #ifdef CONFIG_USERFAULTFD
> -int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm, pte_t *dst_pte,
> +int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
> +				struct hugetlb_pte *dst_hpte,
>  				struct vm_area_struct *dst_vma,
>  				unsigned long dst_addr,
>  				unsigned long src_addr,
> @@ -1272,6 +1273,9 @@ static inline enum hugetlb_level hpage_size_to_level(unsigned long sz)
>  bool hugetlb_hgm_enabled(struct vm_area_struct *vma);
>  bool hugetlb_hgm_eligible(struct vm_area_struct *vma);
>  int enable_hugetlb_hgm(struct vm_area_struct *vma);
> +int hugetlb_alloc_largest_pte(struct hugetlb_pte *hpte, struct mm_struct *mm,
> +			      struct vm_area_struct *vma, unsigned long start,
> +			      unsigned long end);
>  #else
>  static inline bool hugetlb_hgm_enabled(struct vm_area_struct *vma)
>  {
> @@ -1285,6 +1289,14 @@ static inline int enable_hugetlb_hgm(struct vm_area_struct *vma)
>  {
>  	return -EINVAL;
>  }
> +
> +static inline
> +int hugetlb_alloc_largest_pte(struct hugetlb_pte *hpte, struct mm_struct *mm,
> +			      struct vm_area_struct *vma, unsigned long start,
> +			      unsigned long end)
> +{
> +	return -EINVAL;
> +}
>  #endif
>  
>  static inline spinlock_t *huge_pte_lock(struct hstate *h,
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 6eaec40d66ad..c25d3cd73ac9 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -6325,7 +6325,7 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>   * modifications for huge pages.
>   */
>  int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
> -			    pte_t *dst_pte,
> +			    struct hugetlb_pte *dst_hpte,
>  			    struct vm_area_struct *dst_vma,
>  			    unsigned long dst_addr,
>  			    unsigned long src_addr,
> @@ -6336,13 +6336,14 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
>  	bool is_continue = (mode == MCOPY_ATOMIC_CONTINUE);
>  	struct hstate *h = hstate_vma(dst_vma);
>  	struct address_space *mapping = dst_vma->vm_file->f_mapping;
> -	pgoff_t idx = vma_hugecache_offset(h, dst_vma, dst_addr);
> +	unsigned long haddr = dst_addr & huge_page_mask(h);
> +	pgoff_t idx = vma_hugecache_offset(h, dst_vma, haddr);
>  	unsigned long size;
>  	int vm_shared = dst_vma->vm_flags & VM_SHARED;
>  	pte_t _dst_pte;
>  	spinlock_t *ptl;
>  	int ret = -ENOMEM;
> -	struct page *page;
> +	struct page *page, *subpage;
>  	int writable;
>  	bool page_in_pagecache = false;
>  
> @@ -6357,12 +6358,12 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
>  		 * a non-missing case. Return -EEXIST.
>  		 */
>  		if (vm_shared &&
> -		    hugetlbfs_pagecache_present(h, dst_vma, dst_addr)) {
> +		    hugetlbfs_pagecache_present(h, dst_vma, haddr)) {
>  			ret = -EEXIST;
>  			goto out;
>  		}
>  
> -		page = alloc_huge_page(dst_vma, dst_addr, 0);
> +		page = alloc_huge_page(dst_vma, haddr, 0);
>  		if (IS_ERR(page)) {
>  			ret = -ENOMEM;
>  			goto out;
> @@ -6378,13 +6379,13 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
>  			/* Free the allocated page which may have
>  			 * consumed a reservation.
>  			 */
> -			restore_reserve_on_error(h, dst_vma, dst_addr, page);
> +			restore_reserve_on_error(h, dst_vma, haddr, page);
>  			put_page(page);
>  
>  			/* Allocate a temporary page to hold the copied
>  			 * contents.
>  			 */
> -			page = alloc_huge_page_vma(h, dst_vma, dst_addr);
> +			page = alloc_huge_page_vma(h, dst_vma, haddr);
>  			if (!page) {
>  				ret = -ENOMEM;
>  				goto out;
> @@ -6398,14 +6399,14 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
>  		}
>  	} else {
>  		if (vm_shared &&
> -		    hugetlbfs_pagecache_present(h, dst_vma, dst_addr)) {
> +		    hugetlbfs_pagecache_present(h, dst_vma, haddr)) {
>  			put_page(*pagep);
>  			ret = -EEXIST;
>  			*pagep = NULL;
>  			goto out;
>  		}
>  
> -		page = alloc_huge_page(dst_vma, dst_addr, 0);
> +		page = alloc_huge_page(dst_vma, haddr, 0);
>  		if (IS_ERR(page)) {
>  			put_page(*pagep);
>  			ret = -ENOMEM;
> @@ -6447,7 +6448,7 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
>  		page_in_pagecache = true;
>  	}
>  
> -	ptl = huge_pte_lock(h, dst_mm, dst_pte);
> +	ptl = hugetlb_pte_lock(dst_mm, dst_hpte);
>  
>  	ret = -EIO;
>  	if (PageHWPoison(page))
> @@ -6459,7 +6460,7 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
>  	 * page backing it, then access the page.
>  	 */
>  	ret = -EEXIST;
> -	if (!huge_pte_none_mostly(huge_ptep_get(dst_pte)))
> +	if (!huge_pte_none_mostly(huge_ptep_get(dst_hpte->ptep)))
>  		goto out_release_unlock;
>  
>  	if (page_in_pagecache) {
> @@ -6478,7 +6479,11 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
>  	else
>  		writable = dst_vma->vm_flags & VM_WRITE;
>  
> -	_dst_pte = make_huge_pte(dst_vma, page, writable);
> +	subpage = hugetlb_find_subpage(h, page, dst_addr);
> +	WARN_ON_ONCE(subpage != page && !hugetlb_hgm_enabled(dst_vma));
> +
> +	_dst_pte = make_huge_pte_with_shift(dst_vma, subpage, writable,
> +			dst_hpte->shift);
>  	/*
>  	 * Always mark UFFDIO_COPY page dirty; note that this may not be
>  	 * extremely important for hugetlbfs for now since swapping is not
> @@ -6491,12 +6496,12 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
>  	if (wp_copy)
>  		_dst_pte = huge_pte_mkuffd_wp(_dst_pte);
>  
> -	set_huge_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte);
> +	set_huge_pte_at(dst_mm, dst_addr, dst_hpte->ptep, _dst_pte);
>  
> -	hugetlb_count_add(pages_per_huge_page(h), dst_mm);
> +	hugetlb_count_add(hugetlb_pte_size(dst_hpte) / PAGE_SIZE, dst_mm);
>  
>  	/* No need to invalidate - it was non-present before */
> -	update_mmu_cache(dst_vma, dst_addr, dst_pte);
> +	update_mmu_cache(dst_vma, dst_addr, dst_hpte->ptep);
>  
>  	spin_unlock(ptl);
>  	if (!is_continue)
> @@ -7875,6 +7880,47 @@ static unsigned int __shift_for_hstate(struct hstate *h)
>  			       (tmp_h) <= &hstates[hugetlb_max_hstate]; \
>  			       (tmp_h)++)
>  
> +/*
> + * Allocate a HugeTLB PTE that maps as much of [start, end) as possible with a
> + * single page table entry. The allocated HugeTLB PTE is returned in @hpte.
> + */
> +int hugetlb_alloc_largest_pte(struct hugetlb_pte *hpte, struct mm_struct *mm,
> +			      struct vm_area_struct *vma, unsigned long start,
> +			      unsigned long end)
> +{
> +	struct hstate *h = hstate_vma(vma), *tmp_h;
> +	unsigned int shift;
> +	unsigned long sz;
> +	int ret;
> +	pte_t *ptep;
> +
> +	for_each_hgm_shift(h, tmp_h, shift) {

The fallback to PAGE_SIZE (in __shift_for_hstate()) is not obvious.  Would
it be clearer we handle PAGE_SIZE explicitly here?

> +		sz = 1UL << shift;
> +
> +		if (!IS_ALIGNED(start, sz) || start + sz > end)
> +			continue;
> +		goto found;
> +	}
> +	return -EINVAL;
> +found:
> +	ptep = huge_pte_alloc(mm, vma, start, huge_page_size(h));
> +	if (!ptep)
> +		return -ENOMEM;
> +
> +	hugetlb_pte_populate(hpte, ptep, huge_page_shift(h),
> +			hpage_size_to_level(huge_page_size(h)));
> +
> +	ret = hugetlb_hgm_walk(mm, vma, hpte, start, 1L << shift,
> +			/*stop_at_none=*/false);
> +	if (ret)
> +		return ret;
> +
> +	if (hpte->shift > shift)
> +		return -EEXIST;
> +
> +	return 0;
> +}
> +
>  #endif /* CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING */
>  
>  /*
> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> index e24e8a47ce8a..c4a8e6666ea6 100644
> --- a/mm/userfaultfd.c
> +++ b/mm/userfaultfd.c
> @@ -315,14 +315,16 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
>  {
>  	int vm_shared = dst_vma->vm_flags & VM_SHARED;
>  	ssize_t err;
> -	pte_t *dst_pte;
>  	unsigned long src_addr, dst_addr;
>  	long copied;
>  	struct page *page;
> -	unsigned long vma_hpagesize;
> +	unsigned long vma_hpagesize, target_pagesize;
>  	pgoff_t idx;
>  	u32 hash;
>  	struct address_space *mapping;
> +	bool use_hgm = uffd_ctx_has_hgm(&dst_vma->vm_userfaultfd_ctx) &&
> +		mode == MCOPY_ATOMIC_CONTINUE;
> +	struct hstate *h = hstate_vma(dst_vma);
>  
>  	/*
>  	 * There is no default zero huge page for all huge page sizes as
> @@ -340,12 +342,13 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
>  	copied = 0;
>  	page = NULL;
>  	vma_hpagesize = vma_kernel_pagesize(dst_vma);
> +	target_pagesize = use_hgm ? PAGE_SIZE : vma_hpagesize;

Nit: "target_pagesize" is slightly misleading?  Because hgm can do e.g. 2M
on 1G too.  I feel like what you want to check here is the minimum
requirement, hence.. "min_pagesize"?

>  
>  	/*
> -	 * Validate alignment based on huge page size
> +	 * Validate alignment based on the targeted page size.
>  	 */
>  	err = -EINVAL;
> -	if (dst_start & (vma_hpagesize - 1) || len & (vma_hpagesize - 1))
> +	if (dst_start & (target_pagesize - 1) || len & (target_pagesize - 1))
>  		goto out_unlock;
>  
>  retry:
> @@ -362,6 +365,8 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
>  		err = -EINVAL;
>  		if (vma_hpagesize != vma_kernel_pagesize(dst_vma))
>  			goto out_unlock;
> +		if (use_hgm && !hugetlb_hgm_enabled(dst_vma))
> +			goto out_unlock;

Nit: this seems not needed, because enabling the hgm feature for uffd
requires to enable it in hugetlb already when register uffd above [1].

>  
>  		vm_shared = dst_vma->vm_flags & VM_SHARED;
>  	}
> @@ -376,13 +381,15 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
>  	}
>  
>  	while (src_addr < src_start + len) {
> +		struct hugetlb_pte hpte;
> +		pte_t *dst_pte;
>  		BUG_ON(dst_addr >= dst_start + len);
>  
>  		/*
>  		 * Serialize via vma_lock and hugetlb_fault_mutex.
> -		 * vma_lock ensures the dst_pte remains valid even
> -		 * in the case of shared pmds.  fault mutex prevents
> -		 * races with other faulting threads.
> +		 * vma_lock ensures the hpte.ptep remains valid even
> +		 * in the case of shared pmds and page table collapsing.
> +		 * fault mutex prevents races with other faulting threads.
>  		 */
>  		idx = linear_page_index(dst_vma, dst_addr);
>  		mapping = dst_vma->vm_file->f_mapping;
> @@ -390,23 +397,33 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
>  		mutex_lock(&hugetlb_fault_mutex_table[hash]);
>  		hugetlb_vma_lock_read(dst_vma);
>  
> -		err = -ENOMEM;
> +		err = 0;
>  		dst_pte = huge_pte_alloc(dst_mm, dst_vma, dst_addr, vma_hpagesize);
> -		if (!dst_pte) {
> +		if (!dst_pte)
> +			err = -ENOMEM;
> +		else {
> +			hugetlb_pte_populate(&hpte, dst_pte, huge_page_shift(h),
> +					hpage_size_to_level(huge_page_size(h)));
> +			if (use_hgm)
> +				err = hugetlb_alloc_largest_pte(&hpte,
> +						dst_mm, dst_vma, dst_addr,
> +						dst_start + len);

dst_addr, not dst_start?

> +		}
> +		if (err) {
>  			hugetlb_vma_unlock_read(dst_vma);
>  			mutex_unlock(&hugetlb_fault_mutex_table[hash]);
>  			goto out_unlock;
>  		}
>  
>  		if (mode != MCOPY_ATOMIC_CONTINUE &&
> -		    !huge_pte_none_mostly(huge_ptep_get(dst_pte))) {
> +		    !huge_pte_none_mostly(huge_ptep_get(hpte.ptep))) {
>  			err = -EEXIST;
>  			hugetlb_vma_unlock_read(dst_vma);
>  			mutex_unlock(&hugetlb_fault_mutex_table[hash]);
>  			goto out_unlock;
>  		}
>  
> -		err = hugetlb_mcopy_atomic_pte(dst_mm, dst_pte, dst_vma,
> +		err = hugetlb_mcopy_atomic_pte(dst_mm, &hpte, dst_vma,
>  					       dst_addr, src_addr, mode, &page,
>  					       wp_copy);
>  
> @@ -418,6 +435,7 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
>  		if (unlikely(err == -ENOENT)) {
>  			mmap_read_unlock(dst_mm);
>  			BUG_ON(!page);
> +			BUG_ON(hpte.shift != huge_page_shift(h));
>  
>  			err = copy_huge_page_from_user(page,
>  						(const void __user *)src_addr,
> @@ -435,9 +453,9 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
>  			BUG_ON(page);
>  
>  		if (!err) {
> -			dst_addr += vma_hpagesize;
> -			src_addr += vma_hpagesize;
> -			copied += vma_hpagesize;
> +			dst_addr += hugetlb_pte_size(&hpte);
> +			src_addr += hugetlb_pte_size(&hpte);
> +			copied += hugetlb_pte_size(&hpte);
>  
>  			if (fatal_signal_pending(current))
>  				err = -EINTR;
> -- 
> 2.38.0.135.g90850a2211-goog
> 

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [RFC PATCH v2 05/47] hugetlb: make hugetlb_vma_lock_alloc return its failure reason
  2022-11-16 17:08   ` Peter Xu
@ 2022-11-21 18:11     ` James Houghton
  0 siblings, 0 replies; 122+ messages in thread
From: James Houghton @ 2022-11-21 18:11 UTC (permalink / raw)
  To: Peter Xu
  Cc: Mike Kravetz, Muchun Song, David Hildenbrand, David Rientjes,
	Axel Rasmussen, Mina Almasry, Zach O'Keefe, Manish Mishra,
	Naoya Horiguchi, Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Andrew Morton, linux-mm, linux-kernel

On Wed, Nov 16, 2022 at 9:08 AM Peter Xu <peterx@redhat.com> wrote:
>
> No objection on the patch itself, but I am just wondering what guarantees
> thread-safety for this function to not leak vm_private_data when two
> threads try to allocate at the same time.
>
> I think it should be the write mmap lock.  I saw that in your latest code
> base there's:
>
>         /*
>          * We must hold the mmap lock for writing so that callers can rely on
>          * hugetlb_hgm_enabled returning a consistent result while holding
>          * the mmap lock for reading.
>          */
>         mmap_assert_write_locked(vma->vm_mm);
>
>         /* HugeTLB HGM requires the VMA lock to synchronize collapsing. */
>         ret = hugetlb_vma_data_alloc(vma);
>         if (ret)
>                 return ret;
>
> So that's covered there.  The rest places are hugetlb_vm_op_open() and
> hugetlb_reserve_pages() and they all seem fine too: hugetlb_vm_op_open() is
> during mmap(), the latter has vma==NULL so allocation will be skipped.
>
> I'm wondering whether it would make sense to move this assert to be inside
> of hugetlb_vma_data_alloc() after the !vma check, or just add the same
> assert too but for different reason.

I think leaving the assert here and adding a new assert inside
hugetlb_vma_data_alloc() makes sense. Thanks Peter.

- James

>
> >
> >       vma_lock = kmalloc(sizeof(*vma_lock), GFP_KERNEL);
> >       if (!vma_lock) {
> > @@ -7026,13 +7026,14 @@ static void hugetlb_vma_lock_alloc(struct vm_area_struct *vma)
> >                * allocation failure.
> >                */
> >               pr_warn_once("HugeTLB: unable to allocate vma specific lock\n");
> > -             return;
> > +             return -ENOMEM;
> >       }
> >
> >       kref_init(&vma_lock->refs);
> >       init_rwsem(&vma_lock->rw_sema);
> >       vma_lock->vma = vma;
> >       vma->vm_private_data = vma_lock;
> > +     return 0;
> >  }
> >
> >  /*
> > @@ -7160,8 +7161,9 @@ static void hugetlb_vma_lock_free(struct vm_area_struct *vma)
> >  {
> >  }
> >
> > -static void hugetlb_vma_lock_alloc(struct vm_area_struct *vma)
> > +static int hugetlb_vma_lock_alloc(struct vm_area_struct *vma)
> >  {
> > +     return 0;
> >  }
> >
> >  pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma,
> > --
> > 2.38.0.135.g90850a2211-goog
> >
> >
>
> --
> Peter Xu
>

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [RFC PATCH v2 01/47] hugetlb: don't set PageUptodate for UFFDIO_CONTINUE
  2022-11-16 16:30   ` Peter Xu
@ 2022-11-21 18:33     ` James Houghton
  2022-12-08 22:55       ` Mike Kravetz
  0 siblings, 1 reply; 122+ messages in thread
From: James Houghton @ 2022-11-21 18:33 UTC (permalink / raw)
  To: Peter Xu
  Cc: Mike Kravetz, Muchun Song, David Hildenbrand, David Rientjes,
	Axel Rasmussen, Mina Almasry, Zach O'Keefe, Manish Mishra,
	Naoya Horiguchi, Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Andrew Morton, linux-mm, linux-kernel

On Wed, Nov 16, 2022 at 8:30 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Fri, Oct 21, 2022 at 04:36:17PM +0000, James Houghton wrote:
> > This is how it should have been to begin with. It would be very bad if
> > we actually set PageUptodate with a UFFDIO_CONTINUE, as UFFDIO_CONTINUE
> > doesn't actually set/update the contents of the page, so we would be
> > exposing a non-zeroed page to the user.
> >
> > The reason this change is being made now is because UFFDIO_CONTINUEs on
> > subpages definitely shouldn't set this page flag on the head page.
> >
> > Signed-off-by: James Houghton <jthoughton@google.com>
> > ---
> >  mm/hugetlb.c | 5 ++++-
> >  1 file changed, 4 insertions(+), 1 deletion(-)
> >
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 1a7dc7b2e16c..650761cdd2f6 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -6097,7 +6097,10 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
> >        * preceding stores to the page contents become visible before
> >        * the set_pte_at() write.
> >        */
> > -     __SetPageUptodate(page);
> > +     if (!is_continue)
> > +             __SetPageUptodate(page);
> > +     else
> > +             VM_WARN_ON_ONCE_PAGE(!PageUptodate(page), page);
>
> Yeah the old code looks wrong, I'm just wondering whether we can 100%
> guarantee this for hugetlb.  E.g. for shmem that won't hold when we
> uffd-continue on a not used page (e.g. by an over-sized fallocate()).
>
> Another safer approach is simply fail the ioctl if !uptodate, but if you're
> certain then WARN_ON_ONCE sounds all good too.  At least I did have a quick
> look on hugetlb fallocate() and pages will be uptodate immediately.

Failing the ioctl sounds better than only WARNing. I'll do that and
drop the WARN_ON_ONCE for v1. Thanks!

- James

>
> >
> >       /* Add shared, newly allocated pages to the page cache. */
> >       if (vm_shared && !is_continue) {
> > --
> > 2.38.0.135.g90850a2211-goog
> >
> >
>
> --
> Peter Xu
>

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [RFC PATCH v2 06/47] hugetlb: extend vma lock for shared vmas
  2022-10-21 16:36 ` [RFC PATCH v2 06/47] hugetlb: extend vma lock for shared vmas James Houghton
@ 2022-11-30 21:01   ` Peter Xu
  2022-11-30 23:29     ` James Houghton
  2022-12-09 22:48     ` Mike Kravetz
  0 siblings, 2 replies; 122+ messages in thread
From: Peter Xu @ 2022-11-30 21:01 UTC (permalink / raw)
  To: James Houghton
  Cc: Mike Kravetz, Muchun Song, David Hildenbrand, David Rientjes,
	Axel Rasmussen, Mina Almasry, Zach O'Keefe, Manish Mishra,
	Naoya Horiguchi, Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Andrew Morton, linux-mm, linux-kernel

On Fri, Oct 21, 2022 at 04:36:22PM +0000, James Houghton wrote:
> This allows us to add more data into the shared structure, which we will
> use to store whether or not HGM is enabled for this VMA or not, as HGM
> is only available for shared mappings.
> 
> It may be better to include HGM as a VMA flag instead of extending the
> VMA lock structure.
> 
> Signed-off-by: James Houghton <jthoughton@google.com>
> ---
>  include/linux/hugetlb.h |  4 +++
>  mm/hugetlb.c            | 65 +++++++++++++++++++++--------------------
>  2 files changed, 37 insertions(+), 32 deletions(-)
> 
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index a899bc76d677..534958499ac4 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -121,6 +121,10 @@ struct hugetlb_vma_lock {
>  	struct vm_area_struct *vma;
>  };
>  
> +struct hugetlb_shared_vma_data {
> +	struct hugetlb_vma_lock vma_lock;
> +};

How about add a comment above hugetlb_vma_lock showing how it should be
used correctly?  We lacked documents on the lock for pmd sharing
protections, now if to reuse the same lock for HGM pgtables I think some
doc will definitely help.

To summarize, I think so far it means:

  - Read lock needed when one wants to stablize VM_SHARED pgtables (covers
    both pmd shared pgtables or hgm low-level pgtables)

  - Write lock needed when one wants to release VM_SHARED pgtable pages
    (covers both pmd unshare or releasing hgm low-level pgtables)

Or something like that.

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [RFC PATCH v2 25/47] hugetlb: add HGM support for copy_hugetlb_page_range
  2022-10-21 16:36 ` [RFC PATCH v2 25/47] hugetlb: add HGM support for copy_hugetlb_page_range James Houghton
@ 2022-11-30 21:32   ` Peter Xu
  2022-11-30 23:18     ` James Houghton
  0 siblings, 1 reply; 122+ messages in thread
From: Peter Xu @ 2022-11-30 21:32 UTC (permalink / raw)
  To: James Houghton
  Cc: Mike Kravetz, Muchun Song, David Hildenbrand, David Rientjes,
	Axel Rasmussen, Mina Almasry, Zach O'Keefe, Manish Mishra,
	Naoya Horiguchi, Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Andrew Morton, linux-mm, linux-kernel

On Fri, Oct 21, 2022 at 04:36:41PM +0000, James Houghton wrote:
> This allows fork() to work with high-granularity mappings. The page
> table structure is copied such that partially mapped regions will remain
> partially mapped in the same way for the new process.
> 
> A page's reference count is incremented for *each* portion of it that is
> mapped in the page table. For example, if you have a PMD-mapped 1G page,
> the reference count and mapcount will be incremented by 512.
> 
> Signed-off-by: James Houghton <jthoughton@google.com>

I have a feeling that this path is not triggered.  See:

bcd51a3c679d ("hugetlb: lazy page table copies in fork()", 2022-07-17)

It might be helpful to have it when exploring private mapping support of
hgm on page poison in the future.  But the thing is if we want this to be
accepted we still need a way to test it. I just don't see how to test this
without the private support being there..

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [RFC PATCH v2 25/47] hugetlb: add HGM support for copy_hugetlb_page_range
  2022-11-30 21:32   ` Peter Xu
@ 2022-11-30 23:18     ` James Houghton
  0 siblings, 0 replies; 122+ messages in thread
From: James Houghton @ 2022-11-30 23:18 UTC (permalink / raw)
  To: Peter Xu
  Cc: Mike Kravetz, Muchun Song, David Hildenbrand, David Rientjes,
	Axel Rasmussen, Mina Almasry, Zach O'Keefe, Manish Mishra,
	Naoya Horiguchi, Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Andrew Morton, linux-mm, linux-kernel

On Wed, Nov 30, 2022 at 4:32 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Fri, Oct 21, 2022 at 04:36:41PM +0000, James Houghton wrote:
> > This allows fork() to work with high-granularity mappings. The page
> > table structure is copied such that partially mapped regions will remain
> > partially mapped in the same way for the new process.
> >
> > A page's reference count is incremented for *each* portion of it that is
> > mapped in the page table. For example, if you have a PMD-mapped 1G page,
> > the reference count and mapcount will be incremented by 512.
> >
> > Signed-off-by: James Houghton <jthoughton@google.com>
>
> I have a feeling that this path is not triggered.  See:
>
> bcd51a3c679d ("hugetlb: lazy page table copies in fork()", 2022-07-17)
>
> It might be helpful to have it when exploring private mapping support of
> hgm on page poison in the future.  But the thing is if we want this to be
> accepted we still need a way to test it. I just don't see how to test this
> without the private support being there..

We can trigger this behavior by registering the VMA with
uffd-writeprotect. I didn't include any self-tests for this though;
I'll make sure to actually test this path in v1.

- James

>
> --
> Peter Xu
>

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [RFC PATCH v2 06/47] hugetlb: extend vma lock for shared vmas
  2022-11-30 21:01   ` Peter Xu
@ 2022-11-30 23:29     ` James Houghton
  2022-12-09 22:48     ` Mike Kravetz
  1 sibling, 0 replies; 122+ messages in thread
From: James Houghton @ 2022-11-30 23:29 UTC (permalink / raw)
  To: Peter Xu
  Cc: Mike Kravetz, Muchun Song, David Hildenbrand, David Rientjes,
	Axel Rasmussen, Mina Almasry, Zach O'Keefe, Manish Mishra,
	Naoya Horiguchi, Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Andrew Morton, linux-mm, linux-kernel

On Wed, Nov 30, 2022 at 4:01 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Fri, Oct 21, 2022 at 04:36:22PM +0000, James Houghton wrote:
> > This allows us to add more data into the shared structure, which we will
> > use to store whether or not HGM is enabled for this VMA or not, as HGM
> > is only available for shared mappings.
> >
> > It may be better to include HGM as a VMA flag instead of extending the
> > VMA lock structure.
> >
> > Signed-off-by: James Houghton <jthoughton@google.com>
> > ---
> >  include/linux/hugetlb.h |  4 +++
> >  mm/hugetlb.c            | 65 +++++++++++++++++++++--------------------
> >  2 files changed, 37 insertions(+), 32 deletions(-)
> >
> > diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> > index a899bc76d677..534958499ac4 100644
> > --- a/include/linux/hugetlb.h
> > +++ b/include/linux/hugetlb.h
> > @@ -121,6 +121,10 @@ struct hugetlb_vma_lock {
> >       struct vm_area_struct *vma;
> >  };
> >
> > +struct hugetlb_shared_vma_data {
> > +     struct hugetlb_vma_lock vma_lock;
> > +};
>
> How about add a comment above hugetlb_vma_lock showing how it should be
> used correctly?  We lacked documents on the lock for pmd sharing
> protections, now if to reuse the same lock for HGM pgtables I think some
> doc will definitely help.
>
> To summarize, I think so far it means:
>
>   - Read lock needed when one wants to stablize VM_SHARED pgtables (covers
>     both pmd shared pgtables or hgm low-level pgtables)
>
>   - Write lock needed when one wants to release VM_SHARED pgtable pages
>     (covers both pmd unshare or releasing hgm low-level pgtables)
>
> Or something like that.

Will do. I'll make this change together with the rmap comment update
("rmap: update hugetlb lock comment for HGM").

- James

>
> --
> Peter Xu
>

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [RFC PATCH v2 02/47] hugetlb: remove mk_huge_pte; it is unused
  2022-10-21 16:36 ` [RFC PATCH v2 02/47] hugetlb: remove mk_huge_pte; it is unused James Houghton
  2022-11-16 16:35   ` Peter Xu
@ 2022-12-07 23:13   ` Mina Almasry
  2022-12-08 23:42   ` Mike Kravetz
  2 siblings, 0 replies; 122+ messages in thread
From: Mina Almasry @ 2022-12-07 23:13 UTC (permalink / raw)
  To: James Houghton
  Cc: Mike Kravetz, Muchun Song, Peter Xu, David Hildenbrand,
	David Rientjes, Axel Rasmussen, Zach O'Keefe, Manish Mishra,
	Naoya Horiguchi, Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Andrew Morton, linux-mm, linux-kernel

On Fri, Oct 21, 2022 at 9:37 AM James Houghton <jthoughton@google.com> wrote:
>
> mk_huge_pte is unused and not necessary. pte_mkhuge is the appropriate
> function to call to create a HugeTLB PTE (see
> Documentation/mm/arch_pgtable_helpers.rst).
>
> It is being removed now to avoid complicating the implementation of
> HugeTLB high-granularity mapping.
>
> Signed-off-by: James Houghton <jthoughton@google.com>

Acked-by: Mina Almasry <almasrymina@google.com>

> ---
>  arch/s390/include/asm/hugetlb.h | 5 -----
>  include/asm-generic/hugetlb.h   | 5 -----
>  mm/debug_vm_pgtable.c           | 2 +-
>  mm/hugetlb.c                    | 7 +++----
>  4 files changed, 4 insertions(+), 15 deletions(-)
>
> diff --git a/arch/s390/include/asm/hugetlb.h b/arch/s390/include/asm/hugetlb.h
> index ccdbccfde148..c34893719715 100644
> --- a/arch/s390/include/asm/hugetlb.h
> +++ b/arch/s390/include/asm/hugetlb.h
> @@ -77,11 +77,6 @@ static inline void huge_ptep_set_wrprotect(struct mm_struct *mm,
>         set_huge_pte_at(mm, addr, ptep, pte_wrprotect(pte));
>  }
>
> -static inline pte_t mk_huge_pte(struct page *page, pgprot_t pgprot)
> -{
> -       return mk_pte(page, pgprot);
> -}
> -
>  static inline int huge_pte_none(pte_t pte)
>  {
>         return pte_none(pte);
> diff --git a/include/asm-generic/hugetlb.h b/include/asm-generic/hugetlb.h
> index a57d667addd2..aab9e46fa628 100644
> --- a/include/asm-generic/hugetlb.h
> +++ b/include/asm-generic/hugetlb.h
> @@ -5,11 +5,6 @@
>  #include <linux/swap.h>
>  #include <linux/swapops.h>
>
> -static inline pte_t mk_huge_pte(struct page *page, pgprot_t pgprot)
> -{
> -       return mk_pte(page, pgprot);
> -}
> -
>  static inline unsigned long huge_pte_write(pte_t pte)
>  {
>         return pte_write(pte);
> diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
> index 2b61fde8c38c..10573a283a12 100644
> --- a/mm/debug_vm_pgtable.c
> +++ b/mm/debug_vm_pgtable.c
> @@ -929,7 +929,7 @@ static void __init hugetlb_basic_tests(struct pgtable_debug_args *args)
>          * as it was previously derived from a real kernel symbol.
>          */
>         page = pfn_to_page(args->fixed_pmd_pfn);
> -       pte = mk_huge_pte(page, args->page_prot);
> +       pte = mk_pte(page, args->page_prot);
>
>         WARN_ON(!huge_pte_dirty(huge_pte_mkdirty(pte)));
>         WARN_ON(!huge_pte_write(huge_pte_mkwrite(huge_pte_wrprotect(pte))));
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 650761cdd2f6..20a111b532aa 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -4728,11 +4728,10 @@ static pte_t make_huge_pte(struct vm_area_struct *vma, struct page *page,
>         unsigned int shift = huge_page_shift(hstate_vma(vma));
>
>         if (writable) {
> -               entry = huge_pte_mkwrite(huge_pte_mkdirty(mk_huge_pte(page,
> -                                        vma->vm_page_prot)));
> +               entry = huge_pte_mkwrite(huge_pte_mkdirty(mk_pte(page,
> +                                               vma->vm_page_prot)));
>         } else {
> -               entry = huge_pte_wrprotect(mk_huge_pte(page,
> -                                          vma->vm_page_prot));
> +               entry = huge_pte_wrprotect(mk_pte(page, vma->vm_page_prot));
>         }
>         entry = pte_mkyoung(entry);
>         entry = arch_make_huge_pte(entry, shift, vma->vm_flags);
> --
> 2.38.0.135.g90850a2211-goog
>

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [RFC PATCH v2 03/47] hugetlb: remove redundant pte_mkhuge in migration path
  2022-10-21 16:36 ` [RFC PATCH v2 03/47] hugetlb: remove redundant pte_mkhuge in migration path James Houghton
  2022-11-16 16:36   ` Peter Xu
@ 2022-12-07 23:16   ` Mina Almasry
  2022-12-09  0:10   ` Mike Kravetz
  2 siblings, 0 replies; 122+ messages in thread
From: Mina Almasry @ 2022-12-07 23:16 UTC (permalink / raw)
  To: James Houghton
  Cc: Mike Kravetz, Muchun Song, Peter Xu, David Hildenbrand,
	David Rientjes, Axel Rasmussen, Zach O'Keefe, Manish Mishra,
	Naoya Horiguchi, Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Andrew Morton, linux-mm, linux-kernel

On Fri, Oct 21, 2022 at 9:37 AM James Houghton <jthoughton@google.com> wrote:
>
> arch_make_huge_pte, which is called immediately following pte_mkhuge,
> already makes the necessary changes to the PTE that pte_mkhuge would
> have. The generic implementation of arch_make_huge_pte simply calls
> pte_mkhuge.
>
> Signed-off-by: James Houghton <jthoughton@google.com>

Acked-by: Mina Almasry <almasrymina@google.com>

> ---
>  mm/migrate.c | 1 -
>  1 file changed, 1 deletion(-)
>
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 8e5eb6ed9da2..1457cdbb7828 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -237,7 +237,6 @@ static bool remove_migration_pte(struct folio *folio,
>                 if (folio_test_hugetlb(folio)) {
>                         unsigned int shift = huge_page_shift(hstate_vma(vma));
>
> -                       pte = pte_mkhuge(pte);
>                         pte = arch_make_huge_pte(pte, shift, vma->vm_flags);
>                         if (folio_test_anon(folio))
>                                 hugepage_add_anon_rmap(new, vma, pvmw.address,
> --
> 2.38.0.135.g90850a2211-goog
>

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [RFC PATCH v2 05/47] hugetlb: make hugetlb_vma_lock_alloc return its failure reason
  2022-10-21 16:36 ` [RFC PATCH v2 05/47] hugetlb: make hugetlb_vma_lock_alloc return its failure reason James Houghton
  2022-11-16 17:08   ` Peter Xu
@ 2022-12-07 23:33   ` Mina Almasry
  2022-12-09 22:36   ` Mike Kravetz
  2 siblings, 0 replies; 122+ messages in thread
From: Mina Almasry @ 2022-12-07 23:33 UTC (permalink / raw)
  To: James Houghton
  Cc: Mike Kravetz, Muchun Song, Peter Xu, David Hildenbrand,
	David Rientjes, Axel Rasmussen, Zach O'Keefe, Manish Mishra,
	Naoya Horiguchi, Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Andrew Morton, linux-mm, linux-kernel

On Fri, Oct 21, 2022 at 9:37 AM James Houghton <jthoughton@google.com> wrote:
>
> Currently hugetlb_vma_lock_alloc doesn't return anything, as there is no
> need: if it fails, PMD sharing won't be enabled. However, HGM requires
> that the VMA lock exists, so we need to verify that
> hugetlb_vma_lock_alloc actually succeeded. If hugetlb_vma_lock_alloc
> fails, then we can pass that up to the caller that is attempting to
> enable HGM.
>
> Signed-off-by: James Houghton <jthoughton@google.com>
> ---
>  mm/hugetlb.c | 16 +++++++++-------
>  1 file changed, 9 insertions(+), 7 deletions(-)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 52cec5b0789e..dc82256b89dd 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -92,7 +92,7 @@ struct mutex *hugetlb_fault_mutex_table ____cacheline_aligned_in_smp;
>  /* Forward declaration */
>  static int hugetlb_acct_memory(struct hstate *h, long delta);
>  static void hugetlb_vma_lock_free(struct vm_area_struct *vma);
> -static void hugetlb_vma_lock_alloc(struct vm_area_struct *vma);
> +static int hugetlb_vma_lock_alloc(struct vm_area_struct *vma);
>  static void __hugetlb_vma_unlock_write_free(struct vm_area_struct *vma);
>
>  static inline bool subpool_is_free(struct hugepage_subpool *spool)
> @@ -7001,17 +7001,17 @@ static void hugetlb_vma_lock_free(struct vm_area_struct *vma)
>         }
>  }
>
> -static void hugetlb_vma_lock_alloc(struct vm_area_struct *vma)
> +static int hugetlb_vma_lock_alloc(struct vm_area_struct *vma)
>  {
>         struct hugetlb_vma_lock *vma_lock;
>
>         /* Only establish in (flags) sharable vmas */
>         if (!vma || !(vma->vm_flags & VM_MAYSHARE))
> -               return;
> +               return -EINVAL;
>
> -       /* Should never get here with non-NULL vm_private_data */
> +       /* We've already allocated the lock. */
>         if (vma->vm_private_data)
> -               return;
> +               return 0;

I would have expected -EEXIST here.

Also even if the patch looks generally fine it's hard to provide
Acked-by now. I need to look at the call site which is in another
patch in the series. If there is an opportunity to squash changes to
helpers and their call sites please do.

>
>         vma_lock = kmalloc(sizeof(*vma_lock), GFP_KERNEL);
>         if (!vma_lock) {
> @@ -7026,13 +7026,14 @@ static void hugetlb_vma_lock_alloc(struct vm_area_struct *vma)
>                  * allocation failure.
>                  */
>                 pr_warn_once("HugeTLB: unable to allocate vma specific lock\n");
> -               return;
> +               return -ENOMEM;
>         }
>
>         kref_init(&vma_lock->refs);
>         init_rwsem(&vma_lock->rw_sema);
>         vma_lock->vma = vma;
>         vma->vm_private_data = vma_lock;
> +       return 0;
>  }
>
>  /*
> @@ -7160,8 +7161,9 @@ static void hugetlb_vma_lock_free(struct vm_area_struct *vma)
>  {
>  }
>
> -static void hugetlb_vma_lock_alloc(struct vm_area_struct *vma)
> +static int hugetlb_vma_lock_alloc(struct vm_area_struct *vma)
>  {
> +       return 0;
>  }
>
>  pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma,
> --
> 2.38.0.135.g90850a2211-goog
>

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [RFC PATCH v2 08/47] hugetlb: add HGM enablement functions
  2022-10-21 16:36 ` [RFC PATCH v2 08/47] hugetlb: add HGM enablement functions James Houghton
  2022-11-16 17:19   ` Peter Xu
@ 2022-12-08  0:26   ` Mina Almasry
  2022-12-09 15:41     ` James Houghton
  2022-12-13  0:13   ` Mike Kravetz
  2 siblings, 1 reply; 122+ messages in thread
From: Mina Almasry @ 2022-12-08  0:26 UTC (permalink / raw)
  To: James Houghton
  Cc: Mike Kravetz, Muchun Song, Peter Xu, David Hildenbrand,
	David Rientjes, Axel Rasmussen, Zach O'Keefe, Manish Mishra,
	Naoya Horiguchi, Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Andrew Morton, linux-mm, linux-kernel

On Fri, Oct 21, 2022 at 9:37 AM James Houghton <jthoughton@google.com> wrote:
>
> Currently it is possible for all shared VMAs to use HGM, but it must be
> enabled first. This is because with HGM, we lose PMD sharing, and page
> table walks require additional synchronization (we need to take the VMA
> lock).
>
> Signed-off-by: James Houghton <jthoughton@google.com>
> ---
>  include/linux/hugetlb.h | 22 +++++++++++++
>  mm/hugetlb.c            | 69 +++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 91 insertions(+)
>
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 534958499ac4..6e0c36b08a0c 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -123,6 +123,9 @@ struct hugetlb_vma_lock {
>
>  struct hugetlb_shared_vma_data {
>         struct hugetlb_vma_lock vma_lock;
> +#ifdef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING
> +       bool hgm_enabled;
> +#endif
>  };
>
>  extern struct resv_map *resv_map_alloc(void);
> @@ -1179,6 +1182,25 @@ static inline void hugetlb_unregister_node(struct node *node)
>  }
>  #endif /* CONFIG_HUGETLB_PAGE */
>
> +#ifdef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING
> +bool hugetlb_hgm_enabled(struct vm_area_struct *vma);
> +bool hugetlb_hgm_eligible(struct vm_area_struct *vma);
> +int enable_hugetlb_hgm(struct vm_area_struct *vma);
> +#else
> +static inline bool hugetlb_hgm_enabled(struct vm_area_struct *vma)
> +{
> +       return false;
> +}
> +static inline bool hugetlb_hgm_eligible(struct vm_area_struct *vma)
> +{
> +       return false;
> +}
> +static inline int enable_hugetlb_hgm(struct vm_area_struct *vma)
> +{
> +       return -EINVAL;
> +}
> +#endif
> +
>  static inline spinlock_t *huge_pte_lock(struct hstate *h,
>                                         struct mm_struct *mm, pte_t *pte)
>  {
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 5ae8bc8c928e..a18143add956 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -6840,6 +6840,10 @@ static bool pmd_sharing_possible(struct vm_area_struct *vma)
>  #ifdef CONFIG_USERFAULTFD
>         if (uffd_disable_huge_pmd_share(vma))
>                 return false;
> +#endif
> +#ifdef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING
> +       if (hugetlb_hgm_enabled(vma))
> +               return false;
>  #endif
>         /*
>          * Only shared VMAs can share PMDs.
> @@ -7033,6 +7037,9 @@ static int hugetlb_vma_data_alloc(struct vm_area_struct *vma)
>         kref_init(&data->vma_lock.refs);
>         init_rwsem(&data->vma_lock.rw_sema);
>         data->vma_lock.vma = vma;
> +#ifdef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING
> +       data->hgm_enabled = false;
> +#endif
>         vma->vm_private_data = data;
>         return 0;
>  }
> @@ -7290,6 +7297,68 @@ __weak unsigned long hugetlb_mask_last_page(struct hstate *h)
>
>  #endif /* CONFIG_ARCH_WANT_GENERAL_HUGETLB */
>
> +#ifdef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING
> +bool hugetlb_hgm_eligible(struct vm_area_struct *vma)
> +{
> +       /*
> +        * All shared VMAs may have HGM.
> +        *
> +        * HGM requires using the VMA lock, which only exists for shared VMAs.
> +        * To make HGM work for private VMAs, we would need to use another
> +        * scheme to prevent collapsing/splitting from invalidating other
> +        * threads' page table walks.
> +        */
> +       return vma && (vma->vm_flags & VM_MAYSHARE);
> +}
> +bool hugetlb_hgm_enabled(struct vm_area_struct *vma)
> +{
> +       struct hugetlb_shared_vma_data *data = vma->vm_private_data;
> +
> +       if (!vma || !(vma->vm_flags & VM_MAYSHARE))
> +               return false;
> +
> +       return data && data->hgm_enabled;

Don't you need to lock data->vma_lock before you access data? Or did I
misunderstand the locking? Or are you assuming this is safe before
hgm_enabled can't be disabled?
> +}
> +
> +/*
> + * Enable high-granularity mapping (HGM) for this VMA. Once enabled, HGM
> + * cannot be turned off.
> + *
> + * PMDs cannot be shared in HGM VMAs.
> + */
> +int enable_hugetlb_hgm(struct vm_area_struct *vma)
> +{
> +       int ret;
> +       struct hugetlb_shared_vma_data *data;
> +
> +       if (!hugetlb_hgm_eligible(vma))
> +               return -EINVAL;
> +
> +       if (hugetlb_hgm_enabled(vma))
> +               return 0;
> +
> +       /*
> +        * We must hold the mmap lock for writing so that callers can rely on
> +        * hugetlb_hgm_enabled returning a consistent result while holding
> +        * the mmap lock for reading.
> +        */
> +       mmap_assert_write_locked(vma->vm_mm);
> +
> +       /* HugeTLB HGM requires the VMA lock to synchronize collapsing. */
> +       ret = hugetlb_vma_data_alloc(vma);

Confused we need to vma_data_alloc() here. Shouldn't this be done by
hugetlb_vm_op_open()?

> +       if (ret)
> +               return ret;
> +
> +       data = vma->vm_private_data;
> +       BUG_ON(!data);
> +       data->hgm_enabled = true;
> +
> +       /* We don't support PMD sharing with HGM. */
> +       hugetlb_unshare_all_pmds(vma);
> +       return 0;
> +}
> +#endif /* CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING */
> +
>  /*
>   * These functions are overwritable if your architecture needs its own
>   * behavior.
> --
> 2.38.0.135.g90850a2211-goog
>

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [RFC PATCH v2 09/47] hugetlb: make huge_pte_lockptr take an explicit shift argument.
  2022-10-21 16:36 ` [RFC PATCH v2 09/47] hugetlb: make huge_pte_lockptr take an explicit shift argument James Houghton
@ 2022-12-08  0:30   ` Mina Almasry
  2022-12-13  0:25   ` Mike Kravetz
  1 sibling, 0 replies; 122+ messages in thread
From: Mina Almasry @ 2022-12-08  0:30 UTC (permalink / raw)
  To: James Houghton
  Cc: Mike Kravetz, Muchun Song, Peter Xu, David Hildenbrand,
	David Rientjes, Axel Rasmussen, Zach O'Keefe, Manish Mishra,
	Naoya Horiguchi, Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Andrew Morton, linux-mm, linux-kernel

On Fri, Oct 21, 2022 at 9:37 AM James Houghton <jthoughton@google.com> wrote:
>
> This is needed to handle PTL locking with high-granularity mapping. We
> won't always be using the PMD-level PTL even if we're using the 2M
> hugepage hstate. It's possible that we're dealing with 4K PTEs, in which
> case, we need to lock the PTL for the 4K PTE.
>
> Signed-off-by: James Houghton <jthoughton@google.com>

Reviewed-by: Mina Almasry <almasrymina@google.com>

> ---
>  arch/powerpc/mm/pgtable.c | 3 ++-
>  include/linux/hugetlb.h   | 9 ++++-----
>  mm/hugetlb.c              | 7 ++++---
>  mm/migrate.c              | 3 ++-
>  4 files changed, 12 insertions(+), 10 deletions(-)
>
> diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c
> index cb2dcdb18f8e..035a0df47af0 100644
> --- a/arch/powerpc/mm/pgtable.c
> +++ b/arch/powerpc/mm/pgtable.c
> @@ -261,7 +261,8 @@ int huge_ptep_set_access_flags(struct vm_area_struct *vma,
>
>                 psize = hstate_get_psize(h);
>  #ifdef CONFIG_DEBUG_VM
> -               assert_spin_locked(huge_pte_lockptr(h, vma->vm_mm, ptep));
> +               assert_spin_locked(huge_pte_lockptr(huge_page_shift(h),
> +                                                   vma->vm_mm, ptep));
>  #endif
>
>  #else
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 6e0c36b08a0c..db3ed6095b1c 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -934,12 +934,11 @@ static inline gfp_t htlb_modify_alloc_mask(struct hstate *h, gfp_t gfp_mask)
>         return modified_mask;
>  }
>
> -static inline spinlock_t *huge_pte_lockptr(struct hstate *h,
> +static inline spinlock_t *huge_pte_lockptr(unsigned int shift,
>                                            struct mm_struct *mm, pte_t *pte)
>  {
> -       if (huge_page_size(h) == PMD_SIZE)
> +       if (shift == PMD_SHIFT)
>                 return pmd_lockptr(mm, (pmd_t *) pte);
> -       VM_BUG_ON(huge_page_size(h) == PAGE_SIZE);
>         return &mm->page_table_lock;
>  }
>
> @@ -1144,7 +1143,7 @@ static inline gfp_t htlb_modify_alloc_mask(struct hstate *h, gfp_t gfp_mask)
>         return 0;
>  }
>
> -static inline spinlock_t *huge_pte_lockptr(struct hstate *h,
> +static inline spinlock_t *huge_pte_lockptr(unsigned int shift,
>                                            struct mm_struct *mm, pte_t *pte)
>  {
>         return &mm->page_table_lock;
> @@ -1206,7 +1205,7 @@ static inline spinlock_t *huge_pte_lock(struct hstate *h,
>  {
>         spinlock_t *ptl;
>
> -       ptl = huge_pte_lockptr(h, mm, pte);
> +       ptl = huge_pte_lockptr(huge_page_shift(h), mm, pte);
>         spin_lock(ptl);
>         return ptl;
>  }
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index a18143add956..ef7662bd0068 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -4847,7 +4847,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
>                 }
>
>                 dst_ptl = huge_pte_lock(h, dst, dst_pte);
> -               src_ptl = huge_pte_lockptr(h, src, src_pte);
> +               src_ptl = huge_pte_lockptr(huge_page_shift(h), src, src_pte);
>                 spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
>                 entry = huge_ptep_get(src_pte);
>  again:
> @@ -4925,7 +4925,8 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
>
>                                 /* Install the new huge page if src pte stable */
>                                 dst_ptl = huge_pte_lock(h, dst, dst_pte);
> -                               src_ptl = huge_pte_lockptr(h, src, src_pte);
> +                               src_ptl = huge_pte_lockptr(huge_page_shift(h),
> +                                                          src, src_pte);
>                                 spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
>                                 entry = huge_ptep_get(src_pte);
>                                 if (!pte_same(src_pte_old, entry)) {
> @@ -4979,7 +4980,7 @@ static void move_huge_pte(struct vm_area_struct *vma, unsigned long old_addr,
>         pte_t pte;
>
>         dst_ptl = huge_pte_lock(h, mm, dst_pte);
> -       src_ptl = huge_pte_lockptr(h, mm, src_pte);
> +       src_ptl = huge_pte_lockptr(huge_page_shift(h), mm, src_pte);
>
>         /*
>          * We don't have to worry about the ordering of src and dst ptlocks
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 1457cdbb7828..a0105fa6e3b2 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -334,7 +334,8 @@ void __migration_entry_wait_huge(pte_t *ptep, spinlock_t *ptl)
>
>  void migration_entry_wait_huge(struct vm_area_struct *vma, pte_t *pte)
>  {
> -       spinlock_t *ptl = huge_pte_lockptr(hstate_vma(vma), vma->vm_mm, pte);
> +       spinlock_t *ptl = huge_pte_lockptr(huge_page_shift(hstate_vma(vma)),
> +                                          vma->vm_mm, pte);
>
>         __migration_entry_wait_huge(pte, ptl);
>  }
> --
> 2.38.0.135.g90850a2211-goog
>

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [RFC PATCH v2 10/47] hugetlb: add hugetlb_pte to track HugeTLB page table entries
  2022-10-21 16:36 ` [RFC PATCH v2 10/47] hugetlb: add hugetlb_pte to track HugeTLB page table entries James Houghton
  2022-11-16 22:17   ` Peter Xu
@ 2022-12-08  0:46   ` Mina Almasry
  2022-12-09 16:02     ` James Houghton
  1 sibling, 1 reply; 122+ messages in thread
From: Mina Almasry @ 2022-12-08  0:46 UTC (permalink / raw)
  To: James Houghton
  Cc: Mike Kravetz, Muchun Song, Peter Xu, David Hildenbrand,
	David Rientjes, Axel Rasmussen, Zach O'Keefe, Manish Mishra,
	Naoya Horiguchi, Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Andrew Morton, linux-mm, linux-kernel

On Fri, Oct 21, 2022 at 9:37 AM James Houghton <jthoughton@google.com> wrote:
>
> After high-granularity mapping, page table entries for HugeTLB pages can
> be of any size/type. (For example, we can have a 1G page mapped with a
> mix of PMDs and PTEs.) This struct is to help keep track of a HugeTLB
> PTE after we have done a page table walk.
>
> Without this, we'd have to pass around the "size" of the PTE everywhere.
> We effectively did this before; it could be fetched from the hstate,
> which we pass around pretty much everywhere.
>
> hugetlb_pte_present_leaf is included here as a helper function that will
> be used frequently later on.
>
> Signed-off-by: James Houghton <jthoughton@google.com>
> ---
>  include/linux/hugetlb.h | 88 +++++++++++++++++++++++++++++++++++++++++
>  mm/hugetlb.c            | 29 ++++++++++++++
>  2 files changed, 117 insertions(+)
>
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index db3ed6095b1c..d30322108b34 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -50,6 +50,75 @@ enum {
>         __NR_USED_SUBPAGE,
>  };
>
> +enum hugetlb_level {
> +       HUGETLB_LEVEL_PTE = 1,
> +       /*
> +        * We always include PMD, PUD, and P4D in this enum definition so that,
> +        * when logged as an integer, we can easily tell which level it is.
> +        */
> +       HUGETLB_LEVEL_PMD,
> +       HUGETLB_LEVEL_PUD,
> +       HUGETLB_LEVEL_P4D,
> +       HUGETLB_LEVEL_PGD,
> +};
> +

Don't we need to support CONTIG_PTE/PMD levels here for ARM64?

> +struct hugetlb_pte {
> +       pte_t *ptep;
> +       unsigned int shift;
> +       enum hugetlb_level level;

Is shift + level redundant? When would those diverge?

> +       spinlock_t *ptl;
> +};
> +
> +static inline
> +void hugetlb_pte_populate(struct hugetlb_pte *hpte, pte_t *ptep,
> +                         unsigned int shift, enum hugetlb_level level)
> +{
> +       WARN_ON_ONCE(!ptep);
> +       hpte->ptep = ptep;
> +       hpte->shift = shift;
> +       hpte->level = level;
> +       hpte->ptl = NULL;
> +}
> +
> +static inline
> +unsigned long hugetlb_pte_size(const struct hugetlb_pte *hpte)
> +{
> +       WARN_ON_ONCE(!hpte->ptep);
> +       return 1UL << hpte->shift;
> +}
> +
> +static inline
> +unsigned long hugetlb_pte_mask(const struct hugetlb_pte *hpte)
> +{
> +       WARN_ON_ONCE(!hpte->ptep);
> +       return ~(hugetlb_pte_size(hpte) - 1);
> +}
> +
> +static inline
> +unsigned int hugetlb_pte_shift(const struct hugetlb_pte *hpte)
> +{
> +       WARN_ON_ONCE(!hpte->ptep);
> +       return hpte->shift;
> +}
> +
> +static inline
> +enum hugetlb_level hugetlb_pte_level(const struct hugetlb_pte *hpte)
> +{
> +       WARN_ON_ONCE(!hpte->ptep);
> +       return hpte->level;
> +}
> +
> +static inline
> +void hugetlb_pte_copy(struct hugetlb_pte *dest, const struct hugetlb_pte *src)
> +{
> +       dest->ptep = src->ptep;
> +       dest->shift = src->shift;
> +       dest->level = src->level;
> +       dest->ptl = src->ptl;
> +}
> +
> +bool hugetlb_pte_present_leaf(const struct hugetlb_pte *hpte, pte_t pte);
> +
>  struct hugepage_subpool {
>         spinlock_t lock;
>         long count;
> @@ -1210,6 +1279,25 @@ static inline spinlock_t *huge_pte_lock(struct hstate *h,
>         return ptl;
>  }
>
> +static inline
> +spinlock_t *hugetlb_pte_lockptr(struct mm_struct *mm, struct hugetlb_pte *hpte)
> +{
> +
> +       BUG_ON(!hpte->ptep);

I think BUG_ON()s will be frowned upon. This function also doesn't
really need ptep. Maybe let hugetlb_pte_shift() decide to BUG_ON() if
necessary.


> +       if (hpte->ptl)
> +               return hpte->ptl;
> +       return huge_pte_lockptr(hugetlb_pte_shift(hpte), mm, hpte->ptep);

I don't know if this fallback to huge_pte_lockptr() should be obivous
to the reader. If not, a comment would help.

> +}
> +
> +static inline
> +spinlock_t *hugetlb_pte_lock(struct mm_struct *mm, struct hugetlb_pte *hpte)
> +{
> +       spinlock_t *ptl = hugetlb_pte_lockptr(mm, hpte);
> +
> +       spin_lock(ptl);
> +       return ptl;
> +}
> +
>  #if defined(CONFIG_HUGETLB_PAGE) && defined(CONFIG_CMA)
>  extern void __init hugetlb_cma_reserve(int order);
>  #else
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index ef7662bd0068..a0e46d35dabc 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1127,6 +1127,35 @@ static bool vma_has_reserves(struct vm_area_struct *vma, long chg)
>         return false;
>  }
>
> +bool hugetlb_pte_present_leaf(const struct hugetlb_pte *hpte, pte_t pte)

I also don't know if this is obvious to other readers, but I'm quite
confused that we pass both hugetlb_pte and pte_t here, especially when
hpte has a pte_t inside of it. Maybe a comment would help.

> +{
> +       pgd_t pgd;
> +       p4d_t p4d;
> +       pud_t pud;
> +       pmd_t pmd;
> +
> +       WARN_ON_ONCE(!hpte->ptep);
> +       switch (hugetlb_pte_level(hpte)) {
> +       case HUGETLB_LEVEL_PGD:
> +               pgd = __pgd(pte_val(pte));
> +               return pgd_present(pgd) && pgd_leaf(pgd);
> +       case HUGETLB_LEVEL_P4D:
> +               p4d = __p4d(pte_val(pte));
> +               return p4d_present(p4d) && p4d_leaf(p4d);
> +       case HUGETLB_LEVEL_PUD:
> +               pud = __pud(pte_val(pte));
> +               return pud_present(pud) && pud_leaf(pud);
> +       case HUGETLB_LEVEL_PMD:
> +               pmd = __pmd(pte_val(pte));
> +               return pmd_present(pmd) && pmd_leaf(pmd);
> +       case HUGETLB_LEVEL_PTE:
> +               return pte_present(pte);
> +       default:
> +               WARN_ON_ONCE(1);
> +               return false;
> +       }
> +}
> +
>  static void enqueue_huge_page(struct hstate *h, struct page *page)
>  {
>         int nid = page_to_nid(page);
> --
> 2.38.0.135.g90850a2211-goog
>

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [RFC PATCH v2 01/47] hugetlb: don't set PageUptodate for UFFDIO_CONTINUE
  2022-11-21 18:33     ` James Houghton
@ 2022-12-08 22:55       ` Mike Kravetz
  0 siblings, 0 replies; 122+ messages in thread
From: Mike Kravetz @ 2022-12-08 22:55 UTC (permalink / raw)
  To: James Houghton
  Cc: Peter Xu, Muchun Song, David Hildenbrand, David Rientjes,
	Axel Rasmussen, Mina Almasry, Zach O'Keefe, Manish Mishra,
	Naoya Horiguchi, Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Andrew Morton, linux-mm, linux-kernel

On 11/21/22 10:33, James Houghton wrote:
> On Wed, Nov 16, 2022 at 8:30 AM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Fri, Oct 21, 2022 at 04:36:17PM +0000, James Houghton wrote:
> > > This is how it should have been to begin with. It would be very bad if
> > > we actually set PageUptodate with a UFFDIO_CONTINUE, as UFFDIO_CONTINUE
> > > doesn't actually set/update the contents of the page, so we would be
> > > exposing a non-zeroed page to the user.
> > >
> > > The reason this change is being made now is because UFFDIO_CONTINUEs on
> > > subpages definitely shouldn't set this page flag on the head page.
> > >
> > > Signed-off-by: James Houghton <jthoughton@google.com>
> > > ---
> > >  mm/hugetlb.c | 5 ++++-
> > >  1 file changed, 4 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > > index 1a7dc7b2e16c..650761cdd2f6 100644
> > > --- a/mm/hugetlb.c
> > > +++ b/mm/hugetlb.c
> > > @@ -6097,7 +6097,10 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
> > >        * preceding stores to the page contents become visible before
> > >        * the set_pte_at() write.
> > >        */
> > > -     __SetPageUptodate(page);
> > > +     if (!is_continue)
> > > +             __SetPageUptodate(page);
> > > +     else
> > > +             VM_WARN_ON_ONCE_PAGE(!PageUptodate(page), page);
> >
> > Yeah the old code looks wrong, I'm just wondering whether we can 100%
> > guarantee this for hugetlb.  E.g. for shmem that won't hold when we
> > uffd-continue on a not used page (e.g. by an over-sized fallocate()).
> >
> > Another safer approach is simply fail the ioctl if !uptodate, but if you're
> > certain then WARN_ON_ONCE sounds all good too.  At least I did have a quick
> > look on hugetlb fallocate() and pages will be uptodate immediately.
> 
> Failing the ioctl sounds better than only WARNing. I'll do that and
> drop the WARN_ON_ONCE for v1. Thanks!
> 

Sorry for the VERY late reply ...

After checking all the code paths, I do not think it is possible for a
!PageUptodate to be in the cache (target of continue).

ACK to failing the ioctl if not set, although I don't think it is possible
in current code.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [RFC PATCH v2 02/47] hugetlb: remove mk_huge_pte; it is unused
  2022-10-21 16:36 ` [RFC PATCH v2 02/47] hugetlb: remove mk_huge_pte; it is unused James Houghton
  2022-11-16 16:35   ` Peter Xu
  2022-12-07 23:13   ` Mina Almasry
@ 2022-12-08 23:42   ` Mike Kravetz
  2 siblings, 0 replies; 122+ messages in thread
From: Mike Kravetz @ 2022-12-08 23:42 UTC (permalink / raw)
  To: James Houghton
  Cc: Muchun Song, Peter Xu, David Hildenbrand, David Rientjes,
	Axel Rasmussen, Mina Almasry, Zach O'Keefe, Manish Mishra,
	Naoya Horiguchi, Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Andrew Morton, linux-mm, linux-kernel

On 10/21/22 16:36, James Houghton wrote:
> mk_huge_pte is unused and not necessary. pte_mkhuge is the appropriate
> function to call to create a HugeTLB PTE (see
> Documentation/mm/arch_pgtable_helpers.rst).
> 
> It is being removed now to avoid complicating the implementation of
> HugeTLB high-granularity mapping.
> 
> Signed-off-by: James Houghton <jthoughton@google.com>
> ---
>  arch/s390/include/asm/hugetlb.h | 5 -----
>  include/asm-generic/hugetlb.h   | 5 -----
>  mm/debug_vm_pgtable.c           | 2 +-
>  mm/hugetlb.c                    | 7 +++----
>  4 files changed, 4 insertions(+), 15 deletions(-)

Thanks!

I suspect there is more cleanup of 'hugetlb page table helpers' that
could be done.

Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [RFC PATCH v2 03/47] hugetlb: remove redundant pte_mkhuge in migration path
  2022-10-21 16:36 ` [RFC PATCH v2 03/47] hugetlb: remove redundant pte_mkhuge in migration path James Houghton
  2022-11-16 16:36   ` Peter Xu
  2022-12-07 23:16   ` Mina Almasry
@ 2022-12-09  0:10   ` Mike Kravetz
  2 siblings, 0 replies; 122+ messages in thread
From: Mike Kravetz @ 2022-12-09  0:10 UTC (permalink / raw)
  To: James Houghton
  Cc: Muchun Song, Peter Xu, David Hildenbrand, David Rientjes,
	Axel Rasmussen, Mina Almasry, Zach O'Keefe, Manish Mishra,
	Naoya Horiguchi, Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Andrew Morton, linux-mm, linux-kernel

On 10/21/22 16:36, James Houghton wrote:
> arch_make_huge_pte, which is called immediately following pte_mkhuge,
> already makes the necessary changes to the PTE that pte_mkhuge would
> have. The generic implementation of arch_make_huge_pte simply calls
> pte_mkhuge.
> 
> Signed-off-by: James Houghton <jthoughton@google.com>
> ---
>  mm/migrate.c | 1 -
>  1 file changed, 1 deletion(-)

Thanks,

Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>

-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [RFC PATCH v2 04/47] hugetlb: only adjust address ranges when VMAs want PMD sharing
  2022-10-21 16:36 ` [RFC PATCH v2 04/47] hugetlb: only adjust address ranges when VMAs want PMD sharing James Houghton
  2022-11-16 16:50   ` Peter Xu
@ 2022-12-09  0:22   ` Mike Kravetz
  1 sibling, 0 replies; 122+ messages in thread
From: Mike Kravetz @ 2022-12-09  0:22 UTC (permalink / raw)
  To: James Houghton
  Cc: Muchun Song, Peter Xu, David Hildenbrand, David Rientjes,
	Axel Rasmussen, Mina Almasry, Zach O'Keefe, Manish Mishra,
	Naoya Horiguchi, Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Andrew Morton, linux-mm, linux-kernel

On 10/21/22 16:36, James Houghton wrote:
> Currently this check is overly aggressive. For some userfaultfd VMAs,
> VMA sharing is disabled, yet we still widen the address range, which is
> used for flushing TLBs and sending MMU notifiers.

Yes, the userfaultfd check is missing in the code today.

> This is done now, as HGM VMAs also have sharing disabled, yet would
> still have flush ranges adjusted. Overaggressively flushing TLBs and
> triggering MMU notifiers is particularly harmful with lots of
> high-granularity operations.
> 
> Signed-off-by: James Houghton <jthoughton@google.com>
> ---
>  mm/hugetlb.c | 21 +++++++++++++++------
>  1 file changed, 15 insertions(+), 6 deletions(-)

Thanks,

Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [RFC PATCH v2 08/47] hugetlb: add HGM enablement functions
  2022-12-08  0:26   ` Mina Almasry
@ 2022-12-09 15:41     ` James Houghton
  0 siblings, 0 replies; 122+ messages in thread
From: James Houghton @ 2022-12-09 15:41 UTC (permalink / raw)
  To: Mina Almasry
  Cc: Mike Kravetz, Muchun Song, Peter Xu, David Hildenbrand,
	David Rientjes, Axel Rasmussen, Zach O'Keefe, Manish Mishra,
	Naoya Horiguchi, Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Andrew Morton, linux-mm, linux-kernel

On Wed, Dec 7, 2022 at 7:26 PM Mina Almasry <almasrymina@google.com> wrote:
>
> On Fri, Oct 21, 2022 at 9:37 AM James Houghton <jthoughton@google.com> wrote:
> >
> > Currently it is possible for all shared VMAs to use HGM, but it must be
> > enabled first. This is because with HGM, we lose PMD sharing, and page
> > table walks require additional synchronization (we need to take the VMA
> > lock).
> >
> > Signed-off-by: James Houghton <jthoughton@google.com>
> > ---
> >  include/linux/hugetlb.h | 22 +++++++++++++
> >  mm/hugetlb.c            | 69 +++++++++++++++++++++++++++++++++++++++++
> >  2 files changed, 91 insertions(+)
> >
> > diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> > index 534958499ac4..6e0c36b08a0c 100644
> > --- a/include/linux/hugetlb.h
> > +++ b/include/linux/hugetlb.h
> > @@ -123,6 +123,9 @@ struct hugetlb_vma_lock {
> >
> >  struct hugetlb_shared_vma_data {
> >         struct hugetlb_vma_lock vma_lock;
> > +#ifdef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING
> > +       bool hgm_enabled;
> > +#endif
> >  };
> >
> >  extern struct resv_map *resv_map_alloc(void);
> > @@ -1179,6 +1182,25 @@ static inline void hugetlb_unregister_node(struct node *node)
> >  }
> >  #endif /* CONFIG_HUGETLB_PAGE */
> >
> > +#ifdef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING
> > +bool hugetlb_hgm_enabled(struct vm_area_struct *vma);
> > +bool hugetlb_hgm_eligible(struct vm_area_struct *vma);
> > +int enable_hugetlb_hgm(struct vm_area_struct *vma);
> > +#else
> > +static inline bool hugetlb_hgm_enabled(struct vm_area_struct *vma)
> > +{
> > +       return false;
> > +}
> > +static inline bool hugetlb_hgm_eligible(struct vm_area_struct *vma)
> > +{
> > +       return false;
> > +}
> > +static inline int enable_hugetlb_hgm(struct vm_area_struct *vma)
> > +{
> > +       return -EINVAL;
> > +}
> > +#endif
> > +
> >  static inline spinlock_t *huge_pte_lock(struct hstate *h,
> >                                         struct mm_struct *mm, pte_t *pte)
> >  {
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 5ae8bc8c928e..a18143add956 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -6840,6 +6840,10 @@ static bool pmd_sharing_possible(struct vm_area_struct *vma)
> >  #ifdef CONFIG_USERFAULTFD
> >         if (uffd_disable_huge_pmd_share(vma))
> >                 return false;
> > +#endif
> > +#ifdef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING
> > +       if (hugetlb_hgm_enabled(vma))
> > +               return false;
> >  #endif
> >         /*
> >          * Only shared VMAs can share PMDs.
> > @@ -7033,6 +7037,9 @@ static int hugetlb_vma_data_alloc(struct vm_area_struct *vma)
> >         kref_init(&data->vma_lock.refs);
> >         init_rwsem(&data->vma_lock.rw_sema);
> >         data->vma_lock.vma = vma;
> > +#ifdef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING
> > +       data->hgm_enabled = false;
> > +#endif
> >         vma->vm_private_data = data;
> >         return 0;
> >  }
> > @@ -7290,6 +7297,68 @@ __weak unsigned long hugetlb_mask_last_page(struct hstate *h)
> >
> >  #endif /* CONFIG_ARCH_WANT_GENERAL_HUGETLB */
> >
> > +#ifdef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING
> > +bool hugetlb_hgm_eligible(struct vm_area_struct *vma)
> > +{
> > +       /*
> > +        * All shared VMAs may have HGM.
> > +        *
> > +        * HGM requires using the VMA lock, which only exists for shared VMAs.
> > +        * To make HGM work for private VMAs, we would need to use another
> > +        * scheme to prevent collapsing/splitting from invalidating other
> > +        * threads' page table walks.
> > +        */
> > +       return vma && (vma->vm_flags & VM_MAYSHARE);
> > +}
> > +bool hugetlb_hgm_enabled(struct vm_area_struct *vma)
> > +{
> > +       struct hugetlb_shared_vma_data *data = vma->vm_private_data;
> > +
> > +       if (!vma || !(vma->vm_flags & VM_MAYSHARE))
> > +               return false;
> > +
> > +       return data && data->hgm_enabled;
>
> Don't you need to lock data->vma_lock before you access data? Or did I
> misunderstand the locking? Or are you assuming this is safe before
> hgm_enabled can't be disabled?

This should be protected by the mmap_lock (we must be holding it for
at least reading here). `data` and `data->hgm_enabled` are only
changed when holding the mmap_lock for writing.

> > +}
> > +
> > +/*
> > + * Enable high-granularity mapping (HGM) for this VMA. Once enabled, HGM
> > + * cannot be turned off.
> > + *
> > + * PMDs cannot be shared in HGM VMAs.
> > + */
> > +int enable_hugetlb_hgm(struct vm_area_struct *vma)
> > +{
> > +       int ret;
> > +       struct hugetlb_shared_vma_data *data;
> > +
> > +       if (!hugetlb_hgm_eligible(vma))
> > +               return -EINVAL;
> > +
> > +       if (hugetlb_hgm_enabled(vma))
> > +               return 0;
> > +
> > +       /*
> > +        * We must hold the mmap lock for writing so that callers can rely on
> > +        * hugetlb_hgm_enabled returning a consistent result while holding
> > +        * the mmap lock for reading.
> > +        */
> > +       mmap_assert_write_locked(vma->vm_mm);
> > +
> > +       /* HugeTLB HGM requires the VMA lock to synchronize collapsing. */
> > +       ret = hugetlb_vma_data_alloc(vma);
>
> Confused we need to vma_data_alloc() here. Shouldn't this be done by
> hugetlb_vm_op_open()?

hugetlb_vma_data_alloc() can fail. In hugetlb_vm_op_open()/other
places, it is allowed to fail, and so we call it again here and check
that it succeeded so that we can rely on the VMA lock.

I think I need to be a little bit more careful with how I handle VMA
splitting, though. It's possible for `data` not to be allocated after
we split, but for some things to be mapped at high-granularity. The
easiest solution here would be to disallow splitting when HGM is
enabled; not sure what the best solution is though.

Thanks for the review, Mina!

>
> > +       if (ret)
> > +               return ret;
> > +
> > +       data = vma->vm_private_data;
> > +       BUG_ON(!data);
> > +       data->hgm_enabled = true;
> > +
> > +       /* We don't support PMD sharing with HGM. */
> > +       hugetlb_unshare_all_pmds(vma);
> > +       return 0;
> > +}
> > +#endif /* CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING */
> > +
> >  /*
> >   * These functions are overwritable if your architecture needs its own
> >   * behavior.
> > --
> > 2.38.0.135.g90850a2211-goog
> >

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [RFC PATCH v2 10/47] hugetlb: add hugetlb_pte to track HugeTLB page table entries
  2022-12-08  0:46   ` Mina Almasry
@ 2022-12-09 16:02     ` James Houghton
  2022-12-13 18:44       ` Mike Kravetz
  0 siblings, 1 reply; 122+ messages in thread
From: James Houghton @ 2022-12-09 16:02 UTC (permalink / raw)
  To: Mina Almasry
  Cc: Mike Kravetz, Muchun Song, Peter Xu, David Hildenbrand,
	David Rientjes, Axel Rasmussen, Zach O'Keefe, Manish Mishra,
	Naoya Horiguchi, Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Andrew Morton, linux-mm, linux-kernel

On Wed, Dec 7, 2022 at 7:46 PM Mina Almasry <almasrymina@google.com> wrote:
>
> On Fri, Oct 21, 2022 at 9:37 AM James Houghton <jthoughton@google.com> wrote:
> >
> > After high-granularity mapping, page table entries for HugeTLB pages can
> > be of any size/type. (For example, we can have a 1G page mapped with a
> > mix of PMDs and PTEs.) This struct is to help keep track of a HugeTLB
> > PTE after we have done a page table walk.
> >
> > Without this, we'd have to pass around the "size" of the PTE everywhere.
> > We effectively did this before; it could be fetched from the hstate,
> > which we pass around pretty much everywhere.
> >
> > hugetlb_pte_present_leaf is included here as a helper function that will
> > be used frequently later on.
> >
> > Signed-off-by: James Houghton <jthoughton@google.com>
> > ---
> >  include/linux/hugetlb.h | 88 +++++++++++++++++++++++++++++++++++++++++
> >  mm/hugetlb.c            | 29 ++++++++++++++
> >  2 files changed, 117 insertions(+)
> >
> > diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> > index db3ed6095b1c..d30322108b34 100644
> > --- a/include/linux/hugetlb.h
> > +++ b/include/linux/hugetlb.h
> > @@ -50,6 +50,75 @@ enum {
> >         __NR_USED_SUBPAGE,
> >  };
> >
> > +enum hugetlb_level {
> > +       HUGETLB_LEVEL_PTE = 1,
> > +       /*
> > +        * We always include PMD, PUD, and P4D in this enum definition so that,
> > +        * when logged as an integer, we can easily tell which level it is.
> > +        */
> > +       HUGETLB_LEVEL_PMD,
> > +       HUGETLB_LEVEL_PUD,
> > +       HUGETLB_LEVEL_P4D,
> > +       HUGETLB_LEVEL_PGD,
> > +};
> > +
>
> Don't we need to support CONTIG_PTE/PMD levels here for ARM64?

Yeah, which is why shift and level aren't quite the same thing.
Contiguous PMDs would be HUGETLB_LEVEL_PMD but have shift =
CONT_PMD_SHIFT, whereas regular PMDs would have shift = PMD_SHIFT.

>
> > +struct hugetlb_pte {
> > +       pte_t *ptep;
> > +       unsigned int shift;
> > +       enum hugetlb_level level;
>
> Is shift + level redundant? When would those diverge?

Peter asked a very similar question. `shift` can be used to determine
`level` if no levels are being folded. In the case of folded levels,
you might have a single shift that corresponds to multiple "levels".
That isn't necessarily a problem, as folding a level just means
casting your p?d_t* differently, but I think it's good to be able to
*know* if the hugetlb_pte was populated with a pud_t* that we treat it
like a pud_t* always.

If `ptep` was instead a union, then `level` would be the tag. Perhaps
it should be written that way.

>
> > +       spinlock_t *ptl;
> > +};
> > +
> > +static inline
> > +void hugetlb_pte_populate(struct hugetlb_pte *hpte, pte_t *ptep,
> > +                         unsigned int shift, enum hugetlb_level level)
> > +{
> > +       WARN_ON_ONCE(!ptep);
> > +       hpte->ptep = ptep;
> > +       hpte->shift = shift;
> > +       hpte->level = level;
> > +       hpte->ptl = NULL;
> > +}
> > +
> > +static inline
> > +unsigned long hugetlb_pte_size(const struct hugetlb_pte *hpte)
> > +{
> > +       WARN_ON_ONCE(!hpte->ptep);
> > +       return 1UL << hpte->shift;
> > +}
> > +
> > +static inline
> > +unsigned long hugetlb_pte_mask(const struct hugetlb_pte *hpte)
> > +{
> > +       WARN_ON_ONCE(!hpte->ptep);
> > +       return ~(hugetlb_pte_size(hpte) - 1);
> > +}
> > +
> > +static inline
> > +unsigned int hugetlb_pte_shift(const struct hugetlb_pte *hpte)
> > +{
> > +       WARN_ON_ONCE(!hpte->ptep);
> > +       return hpte->shift;
> > +}
> > +
> > +static inline
> > +enum hugetlb_level hugetlb_pte_level(const struct hugetlb_pte *hpte)
> > +{
> > +       WARN_ON_ONCE(!hpte->ptep);
> > +       return hpte->level;
> > +}
> > +
> > +static inline
> > +void hugetlb_pte_copy(struct hugetlb_pte *dest, const struct hugetlb_pte *src)
> > +{
> > +       dest->ptep = src->ptep;
> > +       dest->shift = src->shift;
> > +       dest->level = src->level;
> > +       dest->ptl = src->ptl;
> > +}
> > +
> > +bool hugetlb_pte_present_leaf(const struct hugetlb_pte *hpte, pte_t pte);
> > +
> >  struct hugepage_subpool {
> >         spinlock_t lock;
> >         long count;
> > @@ -1210,6 +1279,25 @@ static inline spinlock_t *huge_pte_lock(struct hstate *h,
> >         return ptl;
> >  }
> >
> > +static inline
> > +spinlock_t *hugetlb_pte_lockptr(struct mm_struct *mm, struct hugetlb_pte *hpte)
> > +{
> > +
> > +       BUG_ON(!hpte->ptep);
>
> I think BUG_ON()s will be frowned upon. This function also doesn't
> really need ptep. Maybe let hugetlb_pte_shift() decide to BUG_ON() if
> necessary.

Right. I'll remove this (and others that aren't really necessary).
Peter's suggestion to just let the kernel take a #pf and crash
(thereby logging more info) SGTM.

>
>
> > +       if (hpte->ptl)
> > +               return hpte->ptl;
> > +       return huge_pte_lockptr(hugetlb_pte_shift(hpte), mm, hpte->ptep);
>
> I don't know if this fallback to huge_pte_lockptr() should be obivous
> to the reader. If not, a comment would help.

I'll clean this up a little for the next version. If something like
this branch stays, I'll add a comment.

>
> > +}
> > +
> > +static inline
> > +spinlock_t *hugetlb_pte_lock(struct mm_struct *mm, struct hugetlb_pte *hpte)
> > +{
> > +       spinlock_t *ptl = hugetlb_pte_lockptr(mm, hpte);
> > +
> > +       spin_lock(ptl);
> > +       return ptl;
> > +}
> > +
> >  #if defined(CONFIG_HUGETLB_PAGE) && defined(CONFIG_CMA)
> >  extern void __init hugetlb_cma_reserve(int order);
> >  #else
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index ef7662bd0068..a0e46d35dabc 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -1127,6 +1127,35 @@ static bool vma_has_reserves(struct vm_area_struct *vma, long chg)
> >         return false;
> >  }
> >
> > +bool hugetlb_pte_present_leaf(const struct hugetlb_pte *hpte, pte_t pte)
>
> I also don't know if this is obvious to other readers, but I'm quite
> confused that we pass both hugetlb_pte and pte_t here, especially when
> hpte has a pte_t inside of it. Maybe a comment would help.

It's possible for the value of the pte to change if we haven't locked
the PTL; we only store a pte_t* in hugetlb_pte, not the value itself.

Thinking about this... we *do* store `shift` which technically depends
on the value of the PTE. If the PTE is pte_none, the true `shift` of
the PTE is ambiguous, and so we just provide what the user asked for.
That could lead to a scenario where UFFDIO_CONTINUE(some 4K page) then
UFFDIO_CONTINUE(CONT_PTE_SIZE range around that page) can both succeed
because we merely check if the *first* PTE in the contiguous bunch is
none/has changed.

So, in the case of a contiguous PTE where we *think* we're overwriting
a bunch of none PTEs, we need to check that each PTE we're overwriting
is still none while holding the PTL. That means that the PTL we use
for cont PTEs and non-cont PTEs of the same level must be the same.

So for the next version, I'll:
- add some requirement that contiguous and non-contiguous PTEs on the
same level must use the same PTL
- think up some kind of API like all_contig_ptes_none(), but it only
really applies for arm64, so I think actually putting it in can wait.
I'll at least put a comment in hugetlb_mcopy_atomic_pte and
hugetlb_no_page (near the final huge_pte_none() and pte_same()
checks).


>
> > +{
> > +       pgd_t pgd;
> > +       p4d_t p4d;
> > +       pud_t pud;
> > +       pmd_t pmd;
> > +
> > +       WARN_ON_ONCE(!hpte->ptep);
> > +       switch (hugetlb_pte_level(hpte)) {
> > +       case HUGETLB_LEVEL_PGD:
> > +               pgd = __pgd(pte_val(pte));
> > +               return pgd_present(pgd) && pgd_leaf(pgd);
> > +       case HUGETLB_LEVEL_P4D:
> > +               p4d = __p4d(pte_val(pte));
> > +               return p4d_present(p4d) && p4d_leaf(p4d);
> > +       case HUGETLB_LEVEL_PUD:
> > +               pud = __pud(pte_val(pte));
> > +               return pud_present(pud) && pud_leaf(pud);
> > +       case HUGETLB_LEVEL_PMD:
> > +               pmd = __pmd(pte_val(pte));
> > +               return pmd_present(pmd) && pmd_leaf(pmd);
> > +       case HUGETLB_LEVEL_PTE:
> > +               return pte_present(pte);
> > +       default:
> > +               WARN_ON_ONCE(1);
> > +               return false;
> > +       }
> > +}
> > +
> >  static void enqueue_huge_page(struct hstate *h, struct page *page)
> >  {
> >         int nid = page_to_nid(page);
> > --
> > 2.38.0.135.g90850a2211-goog
> >

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [RFC PATCH v2 05/47] hugetlb: make hugetlb_vma_lock_alloc return its failure reason
  2022-10-21 16:36 ` [RFC PATCH v2 05/47] hugetlb: make hugetlb_vma_lock_alloc return its failure reason James Houghton
  2022-11-16 17:08   ` Peter Xu
  2022-12-07 23:33   ` Mina Almasry
@ 2022-12-09 22:36   ` Mike Kravetz
  2 siblings, 0 replies; 122+ messages in thread
From: Mike Kravetz @ 2022-12-09 22:36 UTC (permalink / raw)
  To: James Houghton
  Cc: Muchun Song, Peter Xu, David Hildenbrand, David Rientjes,
	Axel Rasmussen, Mina Almasry, Zach O'Keefe, Manish Mishra,
	Naoya Horiguchi, Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Andrew Morton, linux-mm, linux-kernel

On 10/21/22 16:36, James Houghton wrote:
> Currently hugetlb_vma_lock_alloc doesn't return anything, as there is no
> need: if it fails, PMD sharing won't be enabled. However, HGM requires
> that the VMA lock exists, so we need to verify that
> hugetlb_vma_lock_alloc actually succeeded. If hugetlb_vma_lock_alloc
> fails, then we can pass that up to the caller that is attempting to
> enable HGM.

No serious objections to this change ...

However, there are currently only two places today where hugetlb_vma_lock_alloc
is called: hugetlb_reserve_pages and hugetlb_vm_op_open.  hugetlb_reserve_pages
is not an issue.  Since hugetlb_vm_op_open (as a defined vm_operation) returns
void, I am not sure how you plan to pass up an allocation failure.
Suspect this will become evident in subsequent patches.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [RFC PATCH v2 06/47] hugetlb: extend vma lock for shared vmas
  2022-11-30 21:01   ` Peter Xu
  2022-11-30 23:29     ` James Houghton
@ 2022-12-09 22:48     ` Mike Kravetz
  1 sibling, 0 replies; 122+ messages in thread
From: Mike Kravetz @ 2022-12-09 22:48 UTC (permalink / raw)
  To: Peter Xu
  Cc: James Houghton, Muchun Song, David Hildenbrand, David Rientjes,
	Axel Rasmussen, Mina Almasry, Zach O'Keefe, Manish Mishra,
	Naoya Horiguchi, Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Andrew Morton, linux-mm, linux-kernel

On 11/30/22 16:01, Peter Xu wrote:
> On Fri, Oct 21, 2022 at 04:36:22PM +0000, James Houghton wrote:
> > This allows us to add more data into the shared structure, which we will
> > use to store whether or not HGM is enabled for this VMA or not, as HGM
> > is only available for shared mappings.
> > 
> > It may be better to include HGM as a VMA flag instead of extending the
> > VMA lock structure.
> > 
> > Signed-off-by: James Houghton <jthoughton@google.com>
> > ---
> >  include/linux/hugetlb.h |  4 +++
> >  mm/hugetlb.c            | 65 +++++++++++++++++++++--------------------
> >  2 files changed, 37 insertions(+), 32 deletions(-)
> > 
> > diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> > index a899bc76d677..534958499ac4 100644
> > --- a/include/linux/hugetlb.h
> > +++ b/include/linux/hugetlb.h
> > @@ -121,6 +121,10 @@ struct hugetlb_vma_lock {
> >  	struct vm_area_struct *vma;
> >  };
> >  
> > +struct hugetlb_shared_vma_data {
> > +	struct hugetlb_vma_lock vma_lock;
> > +};
> 
> How about add a comment above hugetlb_vma_lock showing how it should be
> used correctly?  We lacked documents on the lock for pmd sharing
> protections, now if to reuse the same lock for HGM pgtables I think some
> doc will definitely help.
> 
> To summarize, I think so far it means:
> 
>   - Read lock needed when one wants to stablize VM_SHARED pgtables (covers
>     both pmd shared pgtables or hgm low-level pgtables)
> 
>   - Write lock needed when one wants to release VM_SHARED pgtable pages
>     (covers both pmd unshare or releasing hgm low-level pgtables)

Peter must have read ahead and knows that you plan to use the vma_lock for HGM.

The commit message implies that a you only need some type of indication (a flag
for instance) that HGM is enabled for the vma.

No objections to expanding the structure as is done here.

If this is the direction we take, and someday this is extended to private
mappings we could used the same scheme to expand the reserve map structure.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [RFC PATCH v2 07/47] hugetlb: add CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING
  2022-10-21 16:36 ` [RFC PATCH v2 07/47] hugetlb: add CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING James Houghton
@ 2022-12-09 22:52   ` Mike Kravetz
  0 siblings, 0 replies; 122+ messages in thread
From: Mike Kravetz @ 2022-12-09 22:52 UTC (permalink / raw)
  To: James Houghton
  Cc: Muchun Song, Peter Xu, David Hildenbrand, David Rientjes,
	Axel Rasmussen, Mina Almasry, Zach O'Keefe, Manish Mishra,
	Naoya Horiguchi, Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Andrew Morton, linux-mm, linux-kernel

On 10/21/22 16:36, James Houghton wrote:
> This adds the Kconfig to enable or disable high-granularity mapping.
> Each architecture must explicitly opt-in to it (via
> ARCH_WANT_HUGETLB_HIGH_GRANULARITY_MAPPING), but when opted in, HGM will
> be enabled by default if HUGETLB_PAGE is enabled.
> 
> Signed-off-by: James Houghton <jthoughton@google.com>
> ---
>  fs/Kconfig | 7 +++++++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/fs/Kconfig b/fs/Kconfig
> index 2685a4d0d353..ce2567946016 100644
> --- a/fs/Kconfig
> +++ b/fs/Kconfig
> @@ -267,6 +267,13 @@ config HUGETLB_PAGE_OPTIMIZE_VMEMMAP_DEFAULT_ON
>  	  enable HVO by default. It can be disabled via hugetlb_free_vmemmap=off
>  	  (boot command line) or hugetlb_optimize_vmemmap (sysctl).
>  
> +config ARCH_WANT_HUGETLB_HIGH_GRANULARITY_MAPPING
> +	bool
> +
> +config HUGETLB_HIGH_GRANULARITY_MAPPING
> +	def_bool HUGETLB_PAGE
> +	depends on ARCH_WANT_HUGETLB_HIGH_GRANULARITY_MAPPING

Might also need to make this depend on CONFIG_ARCH_WANT_HUGE_PMD_SHARE
as the vma local allocation will only happen in this case.

-- 
Mike Kravetz

> +
>  config MEMFD_CREATE
>  	def_bool TMPFS || HUGETLBFS
>  
> -- 
> 2.38.0.135.g90850a2211-goog
> 

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [RFC PATCH v2 08/47] hugetlb: add HGM enablement functions
  2022-10-21 16:36 ` [RFC PATCH v2 08/47] hugetlb: add HGM enablement functions James Houghton
  2022-11-16 17:19   ` Peter Xu
  2022-12-08  0:26   ` Mina Almasry
@ 2022-12-13  0:13   ` Mike Kravetz
  2022-12-13 15:49     ` James Houghton
  2 siblings, 1 reply; 122+ messages in thread
From: Mike Kravetz @ 2022-12-13  0:13 UTC (permalink / raw)
  To: James Houghton
  Cc: Muchun Song, Peter Xu, David Hildenbrand, David Rientjes,
	Axel Rasmussen, Mina Almasry, Zach O'Keefe, Manish Mishra,
	Naoya Horiguchi, Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Andrew Morton, linux-mm, linux-kernel

On 10/21/22 16:36, James Houghton wrote:
> Currently it is possible for all shared VMAs to use HGM, but it must be
> enabled first. This is because with HGM, we lose PMD sharing, and page
> table walks require additional synchronization (we need to take the VMA
> lock).

Not sure yet, but I expect Peter's series will help with locking for
hugetlb specific page table walks.

> 
> Signed-off-by: James Houghton <jthoughton@google.com>
> ---
>  include/linux/hugetlb.h | 22 +++++++++++++
>  mm/hugetlb.c            | 69 +++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 91 insertions(+)
> 
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 534958499ac4..6e0c36b08a0c 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -123,6 +123,9 @@ struct hugetlb_vma_lock {
>  
>  struct hugetlb_shared_vma_data {
>  	struct hugetlb_vma_lock vma_lock;
> +#ifdef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING
> +	bool hgm_enabled;
> +#endif
>  };
>  
>  extern struct resv_map *resv_map_alloc(void);
> @@ -1179,6 +1182,25 @@ static inline void hugetlb_unregister_node(struct node *node)
>  }
>  #endif	/* CONFIG_HUGETLB_PAGE */
>  
> +#ifdef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING
> +bool hugetlb_hgm_enabled(struct vm_area_struct *vma);
> +bool hugetlb_hgm_eligible(struct vm_area_struct *vma);
> +int enable_hugetlb_hgm(struct vm_area_struct *vma);
> +#else
> +static inline bool hugetlb_hgm_enabled(struct vm_area_struct *vma)
> +{
> +	return false;
> +}
> +static inline bool hugetlb_hgm_eligible(struct vm_area_struct *vma)
> +{
> +	return false;
> +}
> +static inline int enable_hugetlb_hgm(struct vm_area_struct *vma)
> +{
> +	return -EINVAL;
> +}
> +#endif
> +
>  static inline spinlock_t *huge_pte_lock(struct hstate *h,
>  					struct mm_struct *mm, pte_t *pte)
>  {
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 5ae8bc8c928e..a18143add956 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -6840,6 +6840,10 @@ static bool pmd_sharing_possible(struct vm_area_struct *vma)
>  #ifdef CONFIG_USERFAULTFD
>  	if (uffd_disable_huge_pmd_share(vma))
>  		return false;
> +#endif
> +#ifdef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING
> +	if (hugetlb_hgm_enabled(vma))
> +		return false;
>  #endif
>  	/*
>  	 * Only shared VMAs can share PMDs.
> @@ -7033,6 +7037,9 @@ static int hugetlb_vma_data_alloc(struct vm_area_struct *vma)
>  	kref_init(&data->vma_lock.refs);
>  	init_rwsem(&data->vma_lock.rw_sema);
>  	data->vma_lock.vma = vma;
> +#ifdef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING
> +	data->hgm_enabled = false;
> +#endif
>  	vma->vm_private_data = data;
>  	return 0;
>  }
> @@ -7290,6 +7297,68 @@ __weak unsigned long hugetlb_mask_last_page(struct hstate *h)
>  
>  #endif /* CONFIG_ARCH_WANT_GENERAL_HUGETLB */
>  
> +#ifdef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING
> +bool hugetlb_hgm_eligible(struct vm_area_struct *vma)
> +{
> +	/*
> +	 * All shared VMAs may have HGM.
> +	 *
> +	 * HGM requires using the VMA lock, which only exists for shared VMAs.
> +	 * To make HGM work for private VMAs, we would need to use another
> +	 * scheme to prevent collapsing/splitting from invalidating other
> +	 * threads' page table walks.
> +	 */
> +	return vma && (vma->vm_flags & VM_MAYSHARE);

I am not yet 100% convinced you can/will take care of all possible code
paths where hugetlb_vma_data allocation may fail.  If not, then you
should be checking vm_private_data here as well.

> +}
> +bool hugetlb_hgm_enabled(struct vm_area_struct *vma)
> +{
> +	struct hugetlb_shared_vma_data *data = vma->vm_private_data;
> +
> +	if (!vma || !(vma->vm_flags & VM_MAYSHARE))
> +		return false;
> +
> +	return data && data->hgm_enabled;
> +}
> +
> +/*
> + * Enable high-granularity mapping (HGM) for this VMA. Once enabled, HGM
> + * cannot be turned off.
> + *
> + * PMDs cannot be shared in HGM VMAs.
> + */
> +int enable_hugetlb_hgm(struct vm_area_struct *vma)
> +{
> +	int ret;
> +	struct hugetlb_shared_vma_data *data;
> +
> +	if (!hugetlb_hgm_eligible(vma))
> +		return -EINVAL;
> +
> +	if (hugetlb_hgm_enabled(vma))
> +		return 0;
> +
> +	/*
> +	 * We must hold the mmap lock for writing so that callers can rely on
> +	 * hugetlb_hgm_enabled returning a consistent result while holding
> +	 * the mmap lock for reading.
> +	 */
> +	mmap_assert_write_locked(vma->vm_mm);
> +
> +	/* HugeTLB HGM requires the VMA lock to synchronize collapsing. */
> +	ret = hugetlb_vma_data_alloc(vma);
> +	if (ret)
> +		return ret;
> +
> +	data = vma->vm_private_data;
> +	BUG_ON(!data);

Would rather have hugetlb_hgm_eligible check for vm_private_data as
suggested above instead of the BUG here.

-- 
Mike Kravetz

> +	data->hgm_enabled = true;
> +
> +	/* We don't support PMD sharing with HGM. */
> +	hugetlb_unshare_all_pmds(vma);
> +	return 0;
> +}
> +#endif /* CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING */
> +
>  /*
>   * These functions are overwritable if your architecture needs its own
>   * behavior.
> -- 
> 2.38.0.135.g90850a2211-goog
> 

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [RFC PATCH v2 09/47] hugetlb: make huge_pte_lockptr take an explicit shift argument.
  2022-10-21 16:36 ` [RFC PATCH v2 09/47] hugetlb: make huge_pte_lockptr take an explicit shift argument James Houghton
  2022-12-08  0:30   ` Mina Almasry
@ 2022-12-13  0:25   ` Mike Kravetz
  1 sibling, 0 replies; 122+ messages in thread
From: Mike Kravetz @ 2022-12-13  0:25 UTC (permalink / raw)
  To: James Houghton
  Cc: Muchun Song, Peter Xu, David Hildenbrand, David Rientjes,
	Axel Rasmussen, Mina Almasry, Zach O'Keefe, Manish Mishra,
	Naoya Horiguchi, Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Andrew Morton, linux-mm, linux-kernel

On 10/21/22 16:36, James Houghton wrote:
> This is needed to handle PTL locking with high-granularity mapping. We
> won't always be using the PMD-level PTL even if we're using the 2M
> hugepage hstate. It's possible that we're dealing with 4K PTEs, in which
> case, we need to lock the PTL for the 4K PTE.
> 
> Signed-off-by: James Houghton <jthoughton@google.com>
> ---
>  arch/powerpc/mm/pgtable.c | 3 ++-
>  include/linux/hugetlb.h   | 9 ++++-----
>  mm/hugetlb.c              | 7 ++++---
>  mm/migrate.c              | 3 ++-
>  4 files changed, 12 insertions(+), 10 deletions(-)

Straight forward substitution,

Acked-by: Mike Kravetz <mike.kravetz@oracle.com>

-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [RFC PATCH v2 08/47] hugetlb: add HGM enablement functions
  2022-12-13  0:13   ` Mike Kravetz
@ 2022-12-13 15:49     ` James Houghton
  2022-12-15 17:51       ` Mike Kravetz
  0 siblings, 1 reply; 122+ messages in thread
From: James Houghton @ 2022-12-13 15:49 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Muchun Song, Peter Xu, David Hildenbrand, David Rientjes,
	Axel Rasmussen, Mina Almasry, Zach O'Keefe, Manish Mishra,
	Naoya Horiguchi, Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Andrew Morton, linux-mm, linux-kernel

On Mon, Dec 12, 2022 at 7:14 PM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>
> On 10/21/22 16:36, James Houghton wrote:
> > Currently it is possible for all shared VMAs to use HGM, but it must be
> > enabled first. This is because with HGM, we lose PMD sharing, and page
> > table walks require additional synchronization (we need to take the VMA
> > lock).
>
> Not sure yet, but I expect Peter's series will help with locking for
> hugetlb specific page table walks.

It should make things a little bit cleaner in this series; I'll rebase
HGM on top of those patches this week (and hopefully get a v1 out
soon).

I don't think it's possible to implement MADV_COLLAPSE with RCU alone
(as implemented in Peter's series anyway); we still need the VMA lock.

>
> >
> > Signed-off-by: James Houghton <jthoughton@google.com>
> > ---
> >  include/linux/hugetlb.h | 22 +++++++++++++
> >  mm/hugetlb.c            | 69 +++++++++++++++++++++++++++++++++++++++++
> >  2 files changed, 91 insertions(+)
> >
> > diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> > index 534958499ac4..6e0c36b08a0c 100644
> > --- a/include/linux/hugetlb.h
> > +++ b/include/linux/hugetlb.h
> > @@ -123,6 +123,9 @@ struct hugetlb_vma_lock {
> >
> >  struct hugetlb_shared_vma_data {
> >       struct hugetlb_vma_lock vma_lock;
> > +#ifdef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING
> > +     bool hgm_enabled;
> > +#endif
> >  };
> >
> >  extern struct resv_map *resv_map_alloc(void);
> > @@ -1179,6 +1182,25 @@ static inline void hugetlb_unregister_node(struct node *node)
> >  }
> >  #endif       /* CONFIG_HUGETLB_PAGE */
> >
> > +#ifdef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING
> > +bool hugetlb_hgm_enabled(struct vm_area_struct *vma);
> > +bool hugetlb_hgm_eligible(struct vm_area_struct *vma);
> > +int enable_hugetlb_hgm(struct vm_area_struct *vma);
> > +#else
> > +static inline bool hugetlb_hgm_enabled(struct vm_area_struct *vma)
> > +{
> > +     return false;
> > +}
> > +static inline bool hugetlb_hgm_eligible(struct vm_area_struct *vma)
> > +{
> > +     return false;
> > +}
> > +static inline int enable_hugetlb_hgm(struct vm_area_struct *vma)
> > +{
> > +     return -EINVAL;
> > +}
> > +#endif
> > +
> >  static inline spinlock_t *huge_pte_lock(struct hstate *h,
> >                                       struct mm_struct *mm, pte_t *pte)
> >  {
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 5ae8bc8c928e..a18143add956 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -6840,6 +6840,10 @@ static bool pmd_sharing_possible(struct vm_area_struct *vma)
> >  #ifdef CONFIG_USERFAULTFD
> >       if (uffd_disable_huge_pmd_share(vma))
> >               return false;
> > +#endif
> > +#ifdef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING
> > +     if (hugetlb_hgm_enabled(vma))
> > +             return false;
> >  #endif
> >       /*
> >        * Only shared VMAs can share PMDs.
> > @@ -7033,6 +7037,9 @@ static int hugetlb_vma_data_alloc(struct vm_area_struct *vma)
> >       kref_init(&data->vma_lock.refs);
> >       init_rwsem(&data->vma_lock.rw_sema);
> >       data->vma_lock.vma = vma;
> > +#ifdef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING
> > +     data->hgm_enabled = false;
> > +#endif
> >       vma->vm_private_data = data;
> >       return 0;
> >  }
> > @@ -7290,6 +7297,68 @@ __weak unsigned long hugetlb_mask_last_page(struct hstate *h)
> >
> >  #endif /* CONFIG_ARCH_WANT_GENERAL_HUGETLB */
> >
> > +#ifdef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING
> > +bool hugetlb_hgm_eligible(struct vm_area_struct *vma)
> > +{
> > +     /*
> > +      * All shared VMAs may have HGM.
> > +      *
> > +      * HGM requires using the VMA lock, which only exists for shared VMAs.
> > +      * To make HGM work for private VMAs, we would need to use another
> > +      * scheme to prevent collapsing/splitting from invalidating other
> > +      * threads' page table walks.
> > +      */
> > +     return vma && (vma->vm_flags & VM_MAYSHARE);
>
> I am not yet 100% convinced you can/will take care of all possible code
> paths where hugetlb_vma_data allocation may fail.  If not, then you
> should be checking vm_private_data here as well.

I think the check here makes sense -- if a VMA is shared, then it is
eligible for HGM, but we might fail to enable it because we can't
allocate the VMA lock. I'll reword the comment to clearly say this.

There is the problem of splitting, though: if we have high-granularity
mapped PTEs in a VMA and that VMA gets split, we need to remember that
the VMA had HGM enabled even if allocating the VMA lock fails,
otherwise things get out of sync. How does PMD sharing handle the
splitting case?

An easy way HGM could handle this is by disallowing splitting, but I
think we can do better. If we fail to allocate the VMA lock, then we
can no longer MADV_COLLAPSE safely, but everything else can proceed as
normal, and so some "hugetlb_hgm_enabled" checks can be
removed/changed. This should make things easier for when we have to
handle (some bits of) HGM for private mappings, too. I'll make some
improvements here for v1.

>
> > +}
> > +bool hugetlb_hgm_enabled(struct vm_area_struct *vma)
> > +{
> > +     struct hugetlb_shared_vma_data *data = vma->vm_private_data;
> > +
> > +     if (!vma || !(vma->vm_flags & VM_MAYSHARE))
> > +             return false;
> > +
> > +     return data && data->hgm_enabled;
> > +}
> > +
> > +/*
> > + * Enable high-granularity mapping (HGM) for this VMA. Once enabled, HGM
> > + * cannot be turned off.
> > + *
> > + * PMDs cannot be shared in HGM VMAs.
> > + */
> > +int enable_hugetlb_hgm(struct vm_area_struct *vma)
> > +{
> > +     int ret;
> > +     struct hugetlb_shared_vma_data *data;
> > +
> > +     if (!hugetlb_hgm_eligible(vma))
> > +             return -EINVAL;
> > +
> > +     if (hugetlb_hgm_enabled(vma))
> > +             return 0;
> > +
> > +     /*
> > +      * We must hold the mmap lock for writing so that callers can rely on
> > +      * hugetlb_hgm_enabled returning a consistent result while holding
> > +      * the mmap lock for reading.
> > +      */
> > +     mmap_assert_write_locked(vma->vm_mm);
> > +
> > +     /* HugeTLB HGM requires the VMA lock to synchronize collapsing. */
> > +     ret = hugetlb_vma_data_alloc(vma);
> > +     if (ret)
> > +             return ret;
> > +
> > +     data = vma->vm_private_data;
> > +     BUG_ON(!data);
>
> Would rather have hugetlb_hgm_eligible check for vm_private_data as
> suggested above instead of the BUG here.

I don't think we'd ever actually BUG() here. Please correct me if I'm
wrong, but if we are eligible for HGM, then hugetlb_vma_data_alloc()
will only succeed if we actually allocated the VMA data/lock, so
vma->vm_private_data should never be NULL (with the BUG_ON to inform
the reader). Maybe I should just drop the BUG()?

>
> --
> Mike Kravetz
>
> > +     data->hgm_enabled = true;
> > +
> > +     /* We don't support PMD sharing with HGM. */
> > +     hugetlb_unshare_all_pmds(vma);
> > +     return 0;
> > +}
> > +#endif /* CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING */
> > +
> >  /*
> >   * These functions are overwritable if your architecture needs its own
> >   * behavior.
> > --
> > 2.38.0.135.g90850a2211-goog
> >

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [RFC PATCH v2 10/47] hugetlb: add hugetlb_pte to track HugeTLB page table entries
  2022-12-09 16:02     ` James Houghton
@ 2022-12-13 18:44       ` Mike Kravetz
  0 siblings, 0 replies; 122+ messages in thread
From: Mike Kravetz @ 2022-12-13 18:44 UTC (permalink / raw)
  To: James Houghton
  Cc: Mina Almasry, Muchun Song, Peter Xu, David Hildenbrand,
	David Rientjes, Axel Rasmussen, Zach O'Keefe, Manish Mishra,
	Naoya Horiguchi, Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Andrew Morton, linux-mm, linux-kernel

On 12/09/22 11:02, James Houghton wrote:
> On Wed, Dec 7, 2022 at 7:46 PM Mina Almasry <almasrymina@google.com> wrote:
> > On Fri, Oct 21, 2022 at 9:37 AM James Houghton <jthoughton@google.com> wrote:
> > >
> > > +bool hugetlb_pte_present_leaf(const struct hugetlb_pte *hpte, pte_t pte)
> >
> > I also don't know if this is obvious to other readers, but I'm quite
> > confused that we pass both hugetlb_pte and pte_t here, especially when
> > hpte has a pte_t inside of it. Maybe a comment would help.
> 
> It's possible for the value of the pte to change if we haven't locked
> the PTL; we only store a pte_t* in hugetlb_pte, not the value itself.

I had comments similar to Mina and Peter on other parts of this patch.  Calling
this without some type of locking is 'interesting'.  I have not yet looked at
callers (without locking), but I assume such callers can handle stale results.

> Thinking about this... we *do* store `shift` which technically depends
> on the value of the PTE. If the PTE is pte_none, the true `shift` of
> the PTE is ambiguous, and so we just provide what the user asked for.
> That could lead to a scenario where UFFDIO_CONTINUE(some 4K page) then
> UFFDIO_CONTINUE(CONT_PTE_SIZE range around that page) can both succeed
> because we merely check if the *first* PTE in the contiguous bunch is
> none/has changed.

Right, Yuck!

> 
> So, in the case of a contiguous PTE where we *think* we're overwriting
> a bunch of none PTEs, we need to check that each PTE we're overwriting
> is still none while holding the PTL. That means that the PTL we use
> for cont PTEs and non-cont PTEs of the same level must be the same.
> 
> So for the next version, I'll:
> - add some requirement that contiguous and non-contiguous PTEs on the
> same level must use the same PTL
> - think up some kind of API like all_contig_ptes_none(), but it only
> really applies for arm64, so I think actually putting it in can wait.
> I'll at least put a comment in hugetlb_mcopy_atomic_pte and
> hugetlb_no_page (near the final huge_pte_none() and pte_same()
> checks).
> 
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [RFC PATCH v2 11/47] hugetlb: add hugetlb_pmd_alloc and hugetlb_pte_alloc
  2022-10-21 16:36 ` [RFC PATCH v2 11/47] hugetlb: add hugetlb_pmd_alloc and hugetlb_pte_alloc James Houghton
@ 2022-12-13 19:32   ` Mike Kravetz
  2022-12-13 20:18     ` James Houghton
  0 siblings, 1 reply; 122+ messages in thread
From: Mike Kravetz @ 2022-12-13 19:32 UTC (permalink / raw)
  To: James Houghton
  Cc: Muchun Song, Peter Xu, David Hildenbrand, David Rientjes,
	Axel Rasmussen, Mina Almasry, Zach O'Keefe, Manish Mishra,
	Naoya Horiguchi, Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Andrew Morton, linux-mm, linux-kernel

On 10/21/22 16:36, James Houghton wrote:
> These functions are used to allocate new PTEs below the hstate PTE. This
> will be used by hugetlb_walk_step, which implements stepping forwards in
> a HugeTLB high-granularity page table walk.
> 
> The reasons that we don't use the standard pmd_alloc/pte_alloc*
> functions are:
>  1) This prevents us from accidentally overwriting swap entries or
>     attempting to use swap entries as present non-leaf PTEs (see
>     pmd_alloc(); we assume that !pte_none means pte_present and
>     non-leaf).
>  2) Locking hugetlb PTEs can different than regular PTEs. (Although, as
>     implemented right now, locking is the same.)
>  3) We can maintain compatibility with CONFIG_HIGHPTE. That is, HugeTLB
>     HGM won't use HIGHPTE, but the kernel can still be built with it,
>     and other mm code will use it.
> 
> When GENERAL_HUGETLB supports P4D-based hugepages, we will need to
> implement hugetlb_pud_alloc to implement hugetlb_walk_step.
> 
> Signed-off-by: James Houghton <jthoughton@google.com>
> ---
>  include/linux/hugetlb.h |  5 +++
>  mm/hugetlb.c            | 94 +++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 99 insertions(+)
> 
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index d30322108b34..003255b0e40f 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -119,6 +119,11 @@ void hugetlb_pte_copy(struct hugetlb_pte *dest, const struct hugetlb_pte *src)
>  
>  bool hugetlb_pte_present_leaf(const struct hugetlb_pte *hpte, pte_t pte);
>  
> +pmd_t *hugetlb_pmd_alloc(struct mm_struct *mm, struct hugetlb_pte *hpte,
> +		unsigned long addr);
> +pte_t *hugetlb_pte_alloc(struct mm_struct *mm, struct hugetlb_pte *hpte,
> +		unsigned long addr);
> +
>  struct hugepage_subpool {
>  	spinlock_t lock;
>  	long count;
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index a0e46d35dabc..e3733388adee 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -341,6 +341,100 @@ static bool has_same_uncharge_info(struct file_region *rg,
>  #endif
>  }
>  
> +pmd_t *hugetlb_pmd_alloc(struct mm_struct *mm, struct hugetlb_pte *hpte,
> +		unsigned long addr)

A little confused as there are no users yet ... Is hpte the 'hstate PTE'
that we are trying to allocate ptes under?  For example, in the case of
a hugetlb_pmd_alloc caller hpte would be a PUD or CONT_PMD size pte?

> +{
> +	spinlock_t *ptl = hugetlb_pte_lockptr(mm, hpte);
> +	pmd_t *new;
> +	pud_t *pudp;
> +	pud_t pud;
> +
> +	if (hpte->level != HUGETLB_LEVEL_PUD)
> +		return ERR_PTR(-EINVAL);

Ah yes, it is PUD level.  However, I guess CONT_PMD would also be valid
on arm64?

> +
> +	pudp = (pud_t *)hpte->ptep;
> +retry:
> +	pud = *pudp;

We might want to consider a READ_ONCE here.  I am not an expert on such
things, but recall a similar as pointed out in the now obsolete commit
27ceae9833843.

-- 
Mike Kravetz

> +	if (likely(pud_present(pud)))
> +		return unlikely(pud_leaf(pud))
> +			? ERR_PTR(-EEXIST)
> +			: pmd_offset(pudp, addr);
> +	else if (!huge_pte_none(huge_ptep_get(hpte->ptep)))
> +		/*
> +		 * Not present and not none means that a swap entry lives here,
> +		 * and we can't get rid of it.
> +		 */
> +		return ERR_PTR(-EEXIST);
> +
> +	new = pmd_alloc_one(mm, addr);
> +	if (!new)
> +		return ERR_PTR(-ENOMEM);
> +
> +	spin_lock(ptl);
> +	if (!pud_same(pud, *pudp)) {
> +		spin_unlock(ptl);
> +		pmd_free(mm, new);
> +		goto retry;
> +	}
> +
> +	mm_inc_nr_pmds(mm);
> +	smp_wmb(); /* See comment in pmd_install() */
> +	pud_populate(mm, pudp, new);
> +	spin_unlock(ptl);
> +	return pmd_offset(pudp, addr);
> +}
> +
> +pte_t *hugetlb_pte_alloc(struct mm_struct *mm, struct hugetlb_pte *hpte,
> +		unsigned long addr)
> +{
> +	spinlock_t *ptl = hugetlb_pte_lockptr(mm, hpte);
> +	pgtable_t new;
> +	pmd_t *pmdp;
> +	pmd_t pmd;
> +
> +	if (hpte->level != HUGETLB_LEVEL_PMD)
> +		return ERR_PTR(-EINVAL);
> +
> +	pmdp = (pmd_t *)hpte->ptep;
> +retry:
> +	pmd = *pmdp;
> +	if (likely(pmd_present(pmd)))
> +		return unlikely(pmd_leaf(pmd))
> +			? ERR_PTR(-EEXIST)
> +			: pte_offset_kernel(pmdp, addr);
> +	else if (!huge_pte_none(huge_ptep_get(hpte->ptep)))
> +		/*
> +		 * Not present and not none means that a swap entry lives here,
> +		 * and we can't get rid of it.
> +		 */
> +		return ERR_PTR(-EEXIST);
> +
> +	/*
> +	 * With CONFIG_HIGHPTE, calling `pte_alloc_one` directly may result
> +	 * in page tables being allocated in high memory, needing a kmap to
> +	 * access. Instead, we call __pte_alloc_one directly with
> +	 * GFP_PGTABLE_USER to prevent these PTEs being allocated in high
> +	 * memory.
> +	 */
> +	new = __pte_alloc_one(mm, GFP_PGTABLE_USER);
> +	if (!new)
> +		return ERR_PTR(-ENOMEM);
> +
> +	spin_lock(ptl);
> +	if (!pmd_same(pmd, *pmdp)) {
> +		spin_unlock(ptl);
> +		pgtable_pte_page_dtor(new);
> +		__free_page(new);
> +		goto retry;
> +	}
> +
> +	mm_inc_nr_ptes(mm);
> +	smp_wmb(); /* See comment in pmd_install() */
> +	pmd_populate(mm, pmdp, new);
> +	spin_unlock(ptl);
> +	return pte_offset_kernel(pmdp, addr);
> +}
> +
>  static void coalesce_file_region(struct resv_map *resv, struct file_region *rg)
>  {
>  	struct file_region *nrg, *prg;
> -- 
> 2.38.0.135.g90850a2211-goog
> 

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [RFC PATCH v2 11/47] hugetlb: add hugetlb_pmd_alloc and hugetlb_pte_alloc
  2022-12-13 19:32   ` Mike Kravetz
@ 2022-12-13 20:18     ` James Houghton
  2022-12-14  0:04       ` James Houghton
  0 siblings, 1 reply; 122+ messages in thread
From: James Houghton @ 2022-12-13 20:18 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Muchun Song, Peter Xu, David Hildenbrand, David Rientjes,
	Axel Rasmussen, Mina Almasry, Zach O'Keefe, Manish Mishra,
	Naoya Horiguchi, Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Andrew Morton, linux-mm, linux-kernel

On Tue, Dec 13, 2022 at 2:32 PM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>
> On 10/21/22 16:36, James Houghton wrote:
> > These functions are used to allocate new PTEs below the hstate PTE. This
> > will be used by hugetlb_walk_step, which implements stepping forwards in
> > a HugeTLB high-granularity page table walk.
> >
> > The reasons that we don't use the standard pmd_alloc/pte_alloc*
> > functions are:
> >  1) This prevents us from accidentally overwriting swap entries or
> >     attempting to use swap entries as present non-leaf PTEs (see
> >     pmd_alloc(); we assume that !pte_none means pte_present and
> >     non-leaf).
> >  2) Locking hugetlb PTEs can different than regular PTEs. (Although, as
> >     implemented right now, locking is the same.)
> >  3) We can maintain compatibility with CONFIG_HIGHPTE. That is, HugeTLB
> >     HGM won't use HIGHPTE, but the kernel can still be built with it,
> >     and other mm code will use it.
> >
> > When GENERAL_HUGETLB supports P4D-based hugepages, we will need to
> > implement hugetlb_pud_alloc to implement hugetlb_walk_step.
> >
> > Signed-off-by: James Houghton <jthoughton@google.com>
> > ---
> >  include/linux/hugetlb.h |  5 +++
> >  mm/hugetlb.c            | 94 +++++++++++++++++++++++++++++++++++++++++
> >  2 files changed, 99 insertions(+)
> >
> > diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> > index d30322108b34..003255b0e40f 100644
> > --- a/include/linux/hugetlb.h
> > +++ b/include/linux/hugetlb.h
> > @@ -119,6 +119,11 @@ void hugetlb_pte_copy(struct hugetlb_pte *dest, const struct hugetlb_pte *src)
> >
> >  bool hugetlb_pte_present_leaf(const struct hugetlb_pte *hpte, pte_t pte);
> >
> > +pmd_t *hugetlb_pmd_alloc(struct mm_struct *mm, struct hugetlb_pte *hpte,
> > +             unsigned long addr);
> > +pte_t *hugetlb_pte_alloc(struct mm_struct *mm, struct hugetlb_pte *hpte,
> > +             unsigned long addr);
> > +
> >  struct hugepage_subpool {
> >       spinlock_t lock;
> >       long count;
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index a0e46d35dabc..e3733388adee 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -341,6 +341,100 @@ static bool has_same_uncharge_info(struct file_region *rg,
> >  #endif
> >  }
> >
> > +pmd_t *hugetlb_pmd_alloc(struct mm_struct *mm, struct hugetlb_pte *hpte,
> > +             unsigned long addr)
>
> A little confused as there are no users yet ... Is hpte the 'hstate PTE'
> that we are trying to allocate ptes under?  For example, in the case of
> a hugetlb_pmd_alloc caller hpte would be a PUD or CONT_PMD size pte?

The hpte is the level above the level we're trying to allocate (not
necessarily the 'hstate PTE'). I'll make that clear in the comments
for both functions.

So consider allocating 4K PTEs for a 1G HugeTLB page:
- With the hstate 'PTE' (PUD), we make a hugetlb_pte with that PUD
(let's call it 'hpte')
- We call hugetlb_pmd_alloc(hpte) which will leave 'hpte' the same,
but the pud_t that hpte->ptep points to is no longer a leaf.
- We call hugetlb_walk_step(hpte) to step down a level to get a PMD,
changing hpte. The hpte->ptep is now pointing to a blank pmd_t.
- We call hugetlb_pte_alloc(hpte) to allocate a bunch of PTEs and
populate the pmd_t.
- We call hugetlb_walk_step(hpte) to step down again.

This is basically what hugetlb_hgm_walk does (in the next patch). We
only change 'hpte' when we do a step, and that is when we populate
'shift'. The 'sz' parameter for hugetlb_walk_step is what
architectures can use to populate hpte->shift appropriately (ignored
for x86).

For arm64, we can use 'sz' to populate hpte->shift with what the
caller wants when we are free to choose (like if all the PTEs are
none, we can do CONT_PTE_SHIFT). See [1]'s implementation of
hugetlb_walk_step for what I *think* is correct for arm64.

[1] https://github.com/48ca/linux/commit/bf3b8742e95c58c2431c80c5bed5cb5cb95885af

>
> > +{
> > +     spinlock_t *ptl = hugetlb_pte_lockptr(mm, hpte);
> > +     pmd_t *new;
> > +     pud_t *pudp;
> > +     pud_t pud;
> > +
> > +     if (hpte->level != HUGETLB_LEVEL_PUD)
> > +             return ERR_PTR(-EINVAL);
>
> Ah yes, it is PUD level.  However, I guess CONT_PMD would also be valid
> on arm64?

The level is always PGD, P4D, PUD, PMD, or PTE. CONT_PTE is on
HUGETLB_LEVEL_PTE, CONT_PMD is on HUGETLB_LEVEL_PMD.

These functions are supposed to be used for all architectures (in
their implementations of 'hugetlb_walk_step'; that's why they're not
static, actually. I'll make that clear in the commit description).

>
> > +
> > +     pudp = (pud_t *)hpte->ptep;
> > +retry:
> > +     pud = *pudp;
>
> We might want to consider a READ_ONCE here.  I am not an expert on such
> things, but recall a similar as pointed out in the now obsolete commit
> 27ceae9833843.

Agreed. Will try to change all PTE reading to use READ_ONCE, though
they can be easy to miss... :(

Thanks very much for the reviews so far, Mike!

- James

>
> --
> Mike Kravetz
>

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [RFC PATCH v2 11/47] hugetlb: add hugetlb_pmd_alloc and hugetlb_pte_alloc
  2022-12-13 20:18     ` James Houghton
@ 2022-12-14  0:04       ` James Houghton
  0 siblings, 0 replies; 122+ messages in thread
From: James Houghton @ 2022-12-14  0:04 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Muchun Song, Peter Xu, David Hildenbrand, David Rientjes,
	Axel Rasmussen, Mina Almasry, Zach O'Keefe, Manish Mishra,
	Naoya Horiguchi, Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Andrew Morton, linux-mm, linux-kernel

On Tue, Dec 13, 2022 at 3:18 PM James Houghton <jthoughton@google.com> wrote:
>
> On Tue, Dec 13, 2022 at 2:32 PM Mike Kravetz <mike.kravetz@oracle.com> wrote:
> >
> > On 10/21/22 16:36, James Houghton wrote:
> > > These functions are used to allocate new PTEs below the hstate PTE. This
> > > will be used by hugetlb_walk_step, which implements stepping forwards in
> > > a HugeTLB high-granularity page table walk.
> > >
> > > The reasons that we don't use the standard pmd_alloc/pte_alloc*
> > > functions are:
> > >  1) This prevents us from accidentally overwriting swap entries or
> > >     attempting to use swap entries as present non-leaf PTEs (see
> > >     pmd_alloc(); we assume that !pte_none means pte_present and
> > >     non-leaf).
> > >  2) Locking hugetlb PTEs can different than regular PTEs. (Although, as
> > >     implemented right now, locking is the same.)
> > >  3) We can maintain compatibility with CONFIG_HIGHPTE. That is, HugeTLB
> > >     HGM won't use HIGHPTE, but the kernel can still be built with it,
> > >     and other mm code will use it.
> > >
> > > When GENERAL_HUGETLB supports P4D-based hugepages, we will need to
> > > implement hugetlb_pud_alloc to implement hugetlb_walk_step.
> > >
> > > Signed-off-by: James Houghton <jthoughton@google.com>
> > > ---
> > >  include/linux/hugetlb.h |  5 +++
> > >  mm/hugetlb.c            | 94 +++++++++++++++++++++++++++++++++++++++++
> > >  2 files changed, 99 insertions(+)
> > >
> > > diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> > > index d30322108b34..003255b0e40f 100644
> > > --- a/include/linux/hugetlb.h
> > > +++ b/include/linux/hugetlb.h
> > > @@ -119,6 +119,11 @@ void hugetlb_pte_copy(struct hugetlb_pte *dest, const struct hugetlb_pte *src)
> > >
> > >  bool hugetlb_pte_present_leaf(const struct hugetlb_pte *hpte, pte_t pte);
> > >
> > > +pmd_t *hugetlb_pmd_alloc(struct mm_struct *mm, struct hugetlb_pte *hpte,
> > > +             unsigned long addr);
> > > +pte_t *hugetlb_pte_alloc(struct mm_struct *mm, struct hugetlb_pte *hpte,
> > > +             unsigned long addr);
> > > +
> > >  struct hugepage_subpool {
> > >       spinlock_t lock;
> > >       long count;
> > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > > index a0e46d35dabc..e3733388adee 100644
> > > --- a/mm/hugetlb.c
> > > +++ b/mm/hugetlb.c
> > > @@ -341,6 +341,100 @@ static bool has_same_uncharge_info(struct file_region *rg,
> > >  #endif
> > >  }
> > >
> > > +pmd_t *hugetlb_pmd_alloc(struct mm_struct *mm, struct hugetlb_pte *hpte,
> > > +             unsigned long addr)
> >
> > A little confused as there are no users yet ... Is hpte the 'hstate PTE'
> > that we are trying to allocate ptes under?  For example, in the case of
> > a hugetlb_pmd_alloc caller hpte would be a PUD or CONT_PMD size pte?
>
> The hpte is the level above the level we're trying to allocate (not
> necessarily the 'hstate PTE'). I'll make that clear in the comments
> for both functions.
>
> So consider allocating 4K PTEs for a 1G HugeTLB page:
> - With the hstate 'PTE' (PUD), we make a hugetlb_pte with that PUD
> (let's call it 'hpte')
> - We call hugetlb_pmd_alloc(hpte) which will leave 'hpte' the same,
> but the pud_t that hpte->ptep points to is no longer a leaf.
> - We call hugetlb_walk_step(hpte) to step down a level to get a PMD,
> changing hpte. The hpte->ptep is now pointing to a blank pmd_t.
> - We call hugetlb_pte_alloc(hpte) to allocate a bunch of PTEs and
> populate the pmd_t.
> - We call hugetlb_walk_step(hpte) to step down again.

Erm actually this isn't entirely accurate. The general flow is about
right, but hugetlb_pmd_alloc/hugetlb_pte_alloc are actually part of
hugetlb_walk_step. (See hugetlb_hgm_walk for the ground truth :P)

- James

>
> This is basically what hugetlb_hgm_walk does (in the next patch). We
> only change 'hpte' when we do a step, and that is when we populate
> 'shift'. The 'sz' parameter for hugetlb_walk_step is what
> architectures can use to populate hpte->shift appropriately (ignored
> for x86).
>
> For arm64, we can use 'sz' to populate hpte->shift with what the
> caller wants when we are free to choose (like if all the PTEs are
> none, we can do CONT_PTE_SHIFT). See [1]'s implementation of
> hugetlb_walk_step for what I *think* is correct for arm64.
>
> [1] https://github.com/48ca/linux/commit/bf3b8742e95c58c2431c80c5bed5cb5cb95885af
>
> >
> > > +{
> > > +     spinlock_t *ptl = hugetlb_pte_lockptr(mm, hpte);
> > > +     pmd_t *new;
> > > +     pud_t *pudp;
> > > +     pud_t pud;
> > > +
> > > +     if (hpte->level != HUGETLB_LEVEL_PUD)
> > > +             return ERR_PTR(-EINVAL);
> >
> > Ah yes, it is PUD level.  However, I guess CONT_PMD would also be valid
> > on arm64?
>
> The level is always PGD, P4D, PUD, PMD, or PTE. CONT_PTE is on
> HUGETLB_LEVEL_PTE, CONT_PMD is on HUGETLB_LEVEL_PMD.
>
> These functions are supposed to be used for all architectures (in
> their implementations of 'hugetlb_walk_step'; that's why they're not
> static, actually. I'll make that clear in the commit description).
>
> >
> > > +
> > > +     pudp = (pud_t *)hpte->ptep;
> > > +retry:
> > > +     pud = *pudp;
> >
> > We might want to consider a READ_ONCE here.  I am not an expert on such
> > things, but recall a similar as pointed out in the now obsolete commit
> > 27ceae9833843.
>
> Agreed. Will try to change all PTE reading to use READ_ONCE, though
> they can be easy to miss... :(
>
> Thanks very much for the reviews so far, Mike!
>
> - James
>
> >
> > --
> > Mike Kravetz
> >

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [RFC PATCH v2 12/47] hugetlb: add hugetlb_hgm_walk and hugetlb_walk_step
  2022-10-21 16:36 ` [RFC PATCH v2 12/47] hugetlb: add hugetlb_hgm_walk and hugetlb_walk_step James Houghton
  2022-11-16 22:02   ` Peter Xu
@ 2022-12-14  0:47   ` Mike Kravetz
  2023-01-05  0:57   ` Jane Chu
  2 siblings, 0 replies; 122+ messages in thread
From: Mike Kravetz @ 2022-12-14  0:47 UTC (permalink / raw)
  To: James Houghton
  Cc: Muchun Song, Peter Xu, David Hildenbrand, David Rientjes,
	Axel Rasmussen, Mina Almasry, Zach O'Keefe, Manish Mishra,
	Naoya Horiguchi, Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Andrew Morton, linux-mm, linux-kernel

On 10/21/22 16:36, James Houghton wrote:
> hugetlb_hgm_walk implements high-granularity page table walks for
> HugeTLB. It is safe to call on non-HGM enabled VMAs; it will return
> immediately.
> 
> hugetlb_walk_step implements how we step forwards in the walk. For
> architectures that don't use GENERAL_HUGETLB, they will need to provide
> their own implementation.
> 
> Signed-off-by: James Houghton <jthoughton@google.com>
> ---
>  include/linux/hugetlb.h |  13 +++++
>  mm/hugetlb.c            | 125 ++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 138 insertions(+)
> 
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 003255b0e40f..4b1548adecde 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -276,6 +276,10 @@ u32 hugetlb_fault_mutex_hash(struct address_space *mapping, pgoff_t idx);
>  pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma,
>  		      unsigned long addr, pud_t *pud);
>  
> +int hugetlb_hgm_walk(struct mm_struct *mm, struct vm_area_struct *vma,
> +		     struct hugetlb_pte *hpte, unsigned long addr,
> +		     unsigned long sz, bool stop_at_none);
> +
>  struct address_space *hugetlb_page_mapping_lock_write(struct page *hpage);
>  
>  extern int sysctl_hugetlb_shm_group;
> @@ -288,6 +292,8 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
>  pte_t *huge_pte_offset(struct mm_struct *mm,
>  		       unsigned long addr, unsigned long sz);
>  unsigned long hugetlb_mask_last_page(struct hstate *h);
> +int hugetlb_walk_step(struct mm_struct *mm, struct hugetlb_pte *hpte,
> +		      unsigned long addr, unsigned long sz);
>  int huge_pmd_unshare(struct mm_struct *mm, struct vm_area_struct *vma,
>  				unsigned long addr, pte_t *ptep);
>  void adjust_range_if_pmd_sharing_possible(struct vm_area_struct *vma,
> @@ -1066,6 +1072,8 @@ void hugetlb_register_node(struct node *node);
>  void hugetlb_unregister_node(struct node *node);
>  #endif
>  
> +enum hugetlb_level hpage_size_to_level(unsigned long sz);
> +
>  #else	/* CONFIG_HUGETLB_PAGE */
>  struct hstate {};
>  
> @@ -1253,6 +1261,11 @@ static inline void hugetlb_register_node(struct node *node)
>  static inline void hugetlb_unregister_node(struct node *node)
>  {
>  }
> +
> +static inline enum hugetlb_level hpage_size_to_level(unsigned long sz)
> +{
> +	return HUGETLB_LEVEL_PTE;
> +}
>  #endif	/* CONFIG_HUGETLB_PAGE */
>  
>  #ifdef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index e3733388adee..90db59632559 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -95,6 +95,29 @@ static void hugetlb_vma_data_free(struct vm_area_struct *vma);
>  static int hugetlb_vma_data_alloc(struct vm_area_struct *vma);
>  static void __hugetlb_vma_unlock_write_free(struct vm_area_struct *vma);
>  
> +/*
> + * hpage_size_to_level() - convert @sz to the corresponding page table level
> + *
> + * @sz must be less than or equal to a valid hugepage size.
> + */
> +enum hugetlb_level hpage_size_to_level(unsigned long sz)
> +{
> +	/*
> +	 * We order the conditionals from smallest to largest to pick the
> +	 * smallest level when multiple levels have the same size (i.e.,
> +	 * when levels are folded).
> +	 */
> +	if (sz < PMD_SIZE)
> +		return HUGETLB_LEVEL_PTE;
> +	if (sz < PUD_SIZE)
> +		return HUGETLB_LEVEL_PMD;
> +	if (sz < P4D_SIZE)
> +		return HUGETLB_LEVEL_PUD;
> +	if (sz < PGDIR_SIZE)
> +		return HUGETLB_LEVEL_P4D;
> +	return HUGETLB_LEVEL_PGD;
> +}
> +
>  static inline bool subpool_is_free(struct hugepage_subpool *spool)
>  {
>  	if (spool->count)
> @@ -7321,6 +7344,70 @@ bool want_pmd_share(struct vm_area_struct *vma, unsigned long addr)
>  }
>  #endif /* CONFIG_ARCH_WANT_HUGE_PMD_SHARE */
>  
> +/* hugetlb_hgm_walk - walks a high-granularity HugeTLB page table to resolve
> + * the page table entry for @addr.
> + *
> + * @hpte must always be pointing at an hstate-level PTE (or deeper).
> + *
> + * This function will never walk further if it encounters a PTE of a size
> + * less than or equal to @sz.
> + *
> + * @stop_at_none determines what we do when we encounter an empty PTE. If true,
> + * we return that PTE. If false and @sz is less than the current PTE's size,
> + * we make that PTE point to the next level down, going until @sz is the same
> + * as our current PTE.

I was a bit confused about 'we return that PTE' when the function is of type
int.  TBH, I am not a fan of the current scheme of passing in *hpte and having
the hpte modified by the function.

> + *
> + * If @stop_at_none is true and @sz is PAGE_SIZE, this function will always
> + * succeed, but that does not guarantee that hugetlb_pte_size(hpte) is @sz.
> + *
> + * Return:
> + *	-ENOMEM if we couldn't allocate new PTEs.
> + *	-EEXIST if the caller wanted to walk further than a migration PTE,
> + *		poison PTE, or a PTE marker. The caller needs to manually deal
> + *		with this scenario.
> + *	-EINVAL if called with invalid arguments (@sz invalid, @hpte not
> + *		initialized).
> + *	0 otherwise.
> + *
> + *	Even if this function fails, @hpte is guaranteed to always remain
> + *	valid.
> + */
> +int hugetlb_hgm_walk(struct mm_struct *mm, struct vm_area_struct *vma,
> +		     struct hugetlb_pte *hpte, unsigned long addr,
> +		     unsigned long sz, bool stop_at_none)

Since we are potentially populating lower level page tables, we may want a
different function name.  It may just be me, but I think of walk as a read
only operation.  I would suggest putting populate in the name, but as Peter
pointed out elsewhere, that has other implications.  Sorry, I can not think
of something better right now.

-- 
Mike Kravetz

> +{
> +	int ret = 0;
> +	pte_t pte;
> +
> +	if (WARN_ON_ONCE(sz < PAGE_SIZE))
> +		return -EINVAL;
> +
> +	if (!hugetlb_hgm_enabled(vma)) {
> +		if (stop_at_none)
> +			return 0;
> +		return sz == huge_page_size(hstate_vma(vma)) ? 0 : -EINVAL;
> +	}
> +
> +	hugetlb_vma_assert_locked(vma);
> +
> +	if (WARN_ON_ONCE(!hpte->ptep))
> +		return -EINVAL;
> +
> +	while (hugetlb_pte_size(hpte) > sz && !ret) {
> +		pte = huge_ptep_get(hpte->ptep);
> +		if (!pte_present(pte)) {
> +			if (stop_at_none)
> +				return 0;
> +			if (unlikely(!huge_pte_none(pte)))
> +				return -EEXIST;
> +		} else if (hugetlb_pte_present_leaf(hpte, pte))
> +			return 0;
> +		ret = hugetlb_walk_step(mm, hpte, addr, sz);
> +	}
> +
> +	return ret;
> +}
> +
>  #ifdef CONFIG_ARCH_WANT_GENERAL_HUGETLB
>  pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
>  			unsigned long addr, unsigned long sz)
> @@ -7388,6 +7475,44 @@ pte_t *huge_pte_offset(struct mm_struct *mm,
>  	return (pte_t *)pmd;
>  }
>  
> +/*
> + * hugetlb_walk_step() - Walk the page table one step to resolve the page
> + * (hugepage or subpage) entry at address @addr.
> + *
> + * @sz always points at the final target PTE size (e.g. PAGE_SIZE for the
> + * lowest level PTE).
> + *
> + * @hpte will always remain valid, even if this function fails.
> + */
> +int hugetlb_walk_step(struct mm_struct *mm, struct hugetlb_pte *hpte,
> +		      unsigned long addr, unsigned long sz)
> +{
> +	pte_t *ptep;
> +	spinlock_t *ptl;
> +
> +	switch (hpte->level) {
> +	case HUGETLB_LEVEL_PUD:
> +		ptep = (pte_t *)hugetlb_pmd_alloc(mm, hpte, addr);
> +		if (IS_ERR(ptep))
> +			return PTR_ERR(ptep);
> +		hugetlb_pte_populate(hpte, ptep, PMD_SHIFT, HUGETLB_LEVEL_PMD);
> +		break;
> +	case HUGETLB_LEVEL_PMD:
> +		ptep = hugetlb_pte_alloc(mm, hpte, addr);
> +		if (IS_ERR(ptep))
> +			return PTR_ERR(ptep);
> +		ptl = pte_lockptr(mm, (pmd_t *)hpte->ptep);
> +		hugetlb_pte_populate(hpte, ptep, PAGE_SHIFT, HUGETLB_LEVEL_PTE);
> +		hpte->ptl = ptl;
> +		break;
> +	default:
> +		WARN_ONCE(1, "%s: got invalid level: %d (shift: %d)\n",
> +				__func__, hpte->level, hpte->shift);
> +		return -EINVAL;
> +	}
> +	return 0;
> +}
> +
>  /*
>   * Return a mask that can be used to update an address to the last huge
>   * page in a page table page mapping size.  Used to skip non-present
> -- 
> 2.38.0.135.g90850a2211-goog
> 

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [RFC PATCH v2 13/47] hugetlb: add make_huge_pte_with_shift
  2022-10-21 16:36 ` [RFC PATCH v2 13/47] hugetlb: add make_huge_pte_with_shift James Houghton
@ 2022-12-14  1:08   ` Mike Kravetz
  0 siblings, 0 replies; 122+ messages in thread
From: Mike Kravetz @ 2022-12-14  1:08 UTC (permalink / raw)
  To: James Houghton
  Cc: Muchun Song, Peter Xu, David Hildenbrand, David Rientjes,
	Axel Rasmussen, Mina Almasry, Zach O'Keefe, Manish Mishra,
	Naoya Horiguchi, Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Andrew Morton, linux-mm, linux-kernel

On 10/21/22 16:36, James Houghton wrote:
> This allows us to make huge PTEs at shifts other than the hstate shift,
> which will be necessary for high-granularity mappings.
> 
> Signed-off-by: James Houghton <jthoughton@google.com>
> ---
>  mm/hugetlb.c | 14 +++++++++++---
>  1 file changed, 11 insertions(+), 3 deletions(-)

Straight forward,

Acked-by: Mike Kravetz <mike.kravetz@oracle.com>

-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [RFC PATCH v2 14/47] hugetlb: make default arch_make_huge_pte understand small mappings
  2022-10-21 16:36 ` [RFC PATCH v2 14/47] hugetlb: make default arch_make_huge_pte understand small mappings James Houghton
@ 2022-12-14 22:17   ` Mike Kravetz
  0 siblings, 0 replies; 122+ messages in thread
From: Mike Kravetz @ 2022-12-14 22:17 UTC (permalink / raw)
  To: James Houghton
  Cc: Muchun Song, Peter Xu, David Hildenbrand, David Rientjes,
	Axel Rasmussen, Mina Almasry, Zach O'Keefe, Manish Mishra,
	Naoya Horiguchi, Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Andrew Morton, linux-mm, linux-kernel

On 10/21/22 16:36, James Houghton wrote:
> This is a simple change: don't create a "huge" PTE if we are making a
> regular, PAGE_SIZE PTE. All architectures that want to implement HGM
> likely need to be changed in a similar way if they implement their own
> version of arch_make_huge_pte.

Nothing wrong with this patch.

However, I wish there was some way we could flag this requirement in
arch specific code.  Just seems like something that would be easy to
overlook.

-- 
Mike Kravetz

> Signed-off-by: James Houghton <jthoughton@google.com>
> ---
>  include/linux/hugetlb.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 4b1548adecde..d305742e9d44 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -907,7 +907,7 @@ static inline void arch_clear_hugepage_flags(struct page *page) { }
>  static inline pte_t arch_make_huge_pte(pte_t entry, unsigned int shift,
>  				       vm_flags_t flags)
>  {
> -	return pte_mkhuge(entry);
> +	return shift > PAGE_SHIFT ? pte_mkhuge(entry) : entry;
>  }
>  #endif
>  
> -- 
> 2.38.0.135.g90850a2211-goog
> 

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [RFC PATCH v2 15/47] hugetlbfs: for unmapping, treat HGM-mapped pages as potentially mapped
  2022-10-21 16:36 ` [RFC PATCH v2 15/47] hugetlbfs: for unmapping, treat HGM-mapped pages as potentially mapped James Houghton
@ 2022-12-14 23:37   ` Mike Kravetz
  0 siblings, 0 replies; 122+ messages in thread
From: Mike Kravetz @ 2022-12-14 23:37 UTC (permalink / raw)
  To: James Houghton
  Cc: Muchun Song, Peter Xu, David Hildenbrand, David Rientjes,
	Axel Rasmussen, Mina Almasry, Zach O'Keefe, Manish Mishra,
	Naoya Horiguchi, Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Andrew Morton, linux-mm, linux-kernel

On 10/21/22 16:36, James Houghton wrote:
> hugetlb_vma_maps_page was mostly used as an optimization: if the VMA
> isn't mapping a page, then we don't have to attempt to unmap it again.
> We are still able to call the unmap routine if we need to.
> 
> For high-granularity mapped pages, we can't easily do a full walk to see
> if the page is actually mapped or not, so simply return that it might
> be.
> 
> Signed-off-by: James Houghton <jthoughton@google.com>
> ---
>  fs/hugetlbfs/inode.c | 27 +++++++++++++++++++++------
>  1 file changed, 21 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> index 7f836f8f9db1..a7ab62e39b8c 100644
> --- a/fs/hugetlbfs/inode.c
> +++ b/fs/hugetlbfs/inode.c
> @@ -383,21 +383,34 @@ static void hugetlb_delete_from_page_cache(struct folio *folio)
>   * mutex for the page in the mapping.  So, we can not race with page being
>   * faulted into the vma.
>   */
> -static bool hugetlb_vma_maps_page(struct vm_area_struct *vma,
> -				unsigned long addr, struct page *page)
> +static bool hugetlb_vma_maybe_maps_page(struct vm_area_struct *vma,
> +					unsigned long addr, struct page *page)
>  {
>  	pte_t *ptep, pte;
> +	struct hugetlb_pte hpte;
> +	struct hstate *h = hstate_vma(vma);
>  
> -	ptep = huge_pte_offset(vma->vm_mm, addr,
> -			huge_page_size(hstate_vma(vma)));
> +	ptep = huge_pte_offset(vma->vm_mm, addr, huge_page_size(h));
>  
>  	if (!ptep)
>  		return false;
>  
> +	hugetlb_pte_populate(&hpte, ptep, huge_page_shift(h),
> +			hpage_size_to_level(huge_page_size(h)));
> +

Only a nit, the hugetlb_pte_populate call should probably be after the
check below.

This makes sense, and as you mention hugetlb_vma_maybe_maps_page is mostly an
optimization that will still work as designed for non HGM vmas.

-- 
Mike Kravetz

>  	pte = huge_ptep_get(ptep);
>  	if (huge_pte_none(pte) || !pte_present(pte))
>  		return false;
>  
> +	if (!hugetlb_pte_present_leaf(&hpte, pte))
> +		/*
> +		 * The top-level PTE is not a leaf, so it's possible that a PTE
> +		 * under us is mapping the page. We aren't holding the VMA
> +		 * lock, so it is unsafe to continue the walk further. Instead,
> +		 * return true to indicate that we might be mapping the page.
> +		 */
> +		return true;
> +
>  	if (pte_page(pte) == page)
>  		return true;
>  
> @@ -457,7 +470,8 @@ static void hugetlb_unmap_file_folio(struct hstate *h,
>  		v_start = vma_offset_start(vma, start);
>  		v_end = vma_offset_end(vma, end);
>  
> -		if (!hugetlb_vma_maps_page(vma, vma->vm_start + v_start, page))
> +		if (!hugetlb_vma_maybe_maps_page(vma, vma->vm_start + v_start,
> +					page))
>  			continue;
>  
>  		if (!hugetlb_vma_trylock_write(vma)) {
> @@ -507,7 +521,8 @@ static void hugetlb_unmap_file_folio(struct hstate *h,
>  		 */
>  		v_start = vma_offset_start(vma, start);
>  		v_end = vma_offset_end(vma, end);
> -		if (hugetlb_vma_maps_page(vma, vma->vm_start + v_start, page))
> +		if (hugetlb_vma_maybe_maps_page(vma, vma->vm_start + v_start,
> +					page))
>  			unmap_hugepage_range(vma, vma->vm_start + v_start,
>  						v_end, NULL,
>  						ZAP_FLAG_DROP_MARKER);
> -- 
> 2.38.0.135.g90850a2211-goog
> 

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [RFC PATCH v2 16/47] hugetlb: make unmapping compatible with high-granularity mappings
  2022-10-21 16:36 ` [RFC PATCH v2 16/47] hugetlb: make unmapping compatible with high-granularity mappings James Houghton
@ 2022-12-15  0:28   ` Mike Kravetz
  0 siblings, 0 replies; 122+ messages in thread
From: Mike Kravetz @ 2022-12-15  0:28 UTC (permalink / raw)
  To: James Houghton
  Cc: Muchun Song, Peter Xu, David Hildenbrand, David Rientjes,
	Axel Rasmussen, Mina Almasry, Zach O'Keefe, Manish Mishra,
	Naoya Horiguchi, Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Andrew Morton, linux-mm, linux-kernel

On 10/21/22 16:36, James Houghton wrote:
> Enlighten __unmap_hugepage_range to deal with high-granularity mappings.
> This doesn't change its API; it still must be called with hugepage
> alignment, but it will correctly unmap hugepages that have been mapped
> at high granularity.
> 
> The rules for mapcount and refcount here are:
>  1. Refcount and mapcount are tracked on the head page.
>  2. Each page table mapping into some of an hpage will increase that
>     hpage's mapcount and refcount by 1.
> 
> Eventually, functionality here can be expanded to allow users to call
> MADV_DONTNEED on PAGE_SIZE-aligned sections of a hugepage, but that is
> not done here.
> 
> Signed-off-by: James Houghton <jthoughton@google.com>
> ---
>  include/asm-generic/tlb.h |  6 ++--
>  mm/hugetlb.c              | 76 +++++++++++++++++++++++++--------------
>  2 files changed, 52 insertions(+), 30 deletions(-)

All looks reasonable, nothing stands out.

-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [RFC PATCH v2 24/47] hugetlb: update page_vma_mapped to do high-granularity walks
  2022-10-21 16:36 ` [RFC PATCH v2 24/47] hugetlb: update page_vma_mapped to do high-granularity walks James Houghton
@ 2022-12-15 17:49   ` James Houghton
  2022-12-15 18:45     ` Peter Xu
  0 siblings, 1 reply; 122+ messages in thread
From: James Houghton @ 2022-12-15 17:49 UTC (permalink / raw)
  To: Mike Kravetz, Muchun Song, Peter Xu
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Zach O'Keefe, Manish Mishra, Naoya Horiguchi,
	Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Andrew Morton, linux-mm, linux-kernel

On Fri, Oct 21, 2022 at 12:37 PM James Houghton <jthoughton@google.com> wrote:
>
> This updates the HugeTLB logic to look a lot more like the PTE-mapped
> THP logic. When a user calls us in a loop, we will update pvmw->address
> to walk to each page table entry that could possibly map the hugepage
> containing pvmw->pfn.
>
> This makes use of the new pte_order so callers know what size PTE
> they're getting.
>
> Signed-off-by: James Houghton <jthoughton@google.com>
> ---
>  include/linux/rmap.h |  4 +++
>  mm/page_vma_mapped.c | 59 ++++++++++++++++++++++++++++++++++++--------
>  mm/rmap.c            | 48 +++++++++++++++++++++--------------
>  3 files changed, 83 insertions(+), 28 deletions(-)
>
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index e0557ede2951..d7d2d9f65a01 100644
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -13,6 +13,7 @@
>  #include <linux/highmem.h>
>  #include <linux/pagemap.h>
>  #include <linux/memremap.h>
> +#include <linux/hugetlb.h>
>
>  /*
>   * The anon_vma heads a list of private "related" vmas, to scan if
> @@ -409,6 +410,9 @@ static inline void page_vma_mapped_walk_done(struct page_vma_mapped_walk *pvmw)
>                 pte_unmap(pvmw->pte);
>         if (pvmw->ptl)
>                 spin_unlock(pvmw->ptl);
> +       if (pvmw->pte && is_vm_hugetlb_page(pvmw->vma) &&
> +                       hugetlb_hgm_enabled(pvmw->vma))
> +               hugetlb_vma_unlock_read(pvmw->vma);
>  }
>
>  bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw);
> diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
> index 395ca4e21c56..1994b3f9a4c2 100644
> --- a/mm/page_vma_mapped.c
> +++ b/mm/page_vma_mapped.c
> @@ -133,7 +133,8 @@ static void step_forward(struct page_vma_mapped_walk *pvmw, unsigned long size)
>   *
>   * Returns true if the page is mapped in the vma. @pvmw->pmd and @pvmw->pte point
>   * to relevant page table entries. @pvmw->ptl is locked. @pvmw->address is
> - * adjusted if needed (for PTE-mapped THPs).
> + * adjusted if needed (for PTE-mapped THPs and high-granularity--mapped HugeTLB
> + * pages).
>   *
>   * If @pvmw->pmd is set but @pvmw->pte is not, you have found PMD-mapped page
>   * (usually THP). For PTE-mapped THP, you should run page_vma_mapped_walk() in
> @@ -166,19 +167,57 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
>         if (unlikely(is_vm_hugetlb_page(vma))) {
>                 struct hstate *hstate = hstate_vma(vma);
>                 unsigned long size = huge_page_size(hstate);
> -               /* The only possible mapping was handled on last iteration */
> -               if (pvmw->pte)
> -                       return not_found(pvmw);
> +               struct hugetlb_pte hpte;
> +               pte_t *pte;
> +               pte_t pteval;
> +
> +               end = (pvmw->address & huge_page_mask(hstate)) +
> +                       huge_page_size(hstate);
>
>                 /* when pud is not present, pte will be NULL */
> -               pvmw->pte = huge_pte_offset(mm, pvmw->address, size);
> -               if (!pvmw->pte)
> +               pte = huge_pte_offset(mm, pvmw->address, size);
> +               if (!pte)
>                         return false;
>
> -               pvmw->pte_order = huge_page_order(hstate);
> -               pvmw->ptl = huge_pte_lock(hstate, mm, pvmw->pte);
> -               if (!check_pte(pvmw))
> -                       return not_found(pvmw);
> +               do {
> +                       hugetlb_pte_populate(&hpte, pte, huge_page_shift(hstate),
> +                                       hpage_size_to_level(size));
> +
> +                       /*
> +                        * Do a high granularity page table walk. The vma lock
> +                        * is grabbed to prevent the page table from being
> +                        * collapsed mid-walk. It is dropped in
> +                        * page_vma_mapped_walk_done().
> +                        */
> +                       if (pvmw->pte) {
> +                               if (pvmw->ptl)
> +                                       spin_unlock(pvmw->ptl);
> +                               pvmw->ptl = NULL;
> +                               pvmw->address += PAGE_SIZE << pvmw->pte_order;
> +                               if (pvmw->address >= end)
> +                                       return not_found(pvmw);
> +                       } else if (hugetlb_hgm_enabled(vma))
> +                               /* Only grab the lock once. */
> +                               hugetlb_vma_lock_read(vma);

I realize that I can't do this -- we're already holding the
i_mmap_rwsem, and we have to take the VMA lock first. It seems like
we're always holding it for writing in this case, so if I make
hugetlb_collapse taking the i_mmap_rwsem for reading, this will be
safe.

Peter, you looked at this recently [1] -- do you know if we're always
holding i_mmap_rwsem *for writing* here?

[1] https://lore.kernel.org/linux-mm/20221209170100.973970-10-peterx@redhat.com/

Thanks!

- James

> +
> +retry_walk:
> +                       hugetlb_hgm_walk(mm, vma, &hpte, pvmw->address,
> +                                       PAGE_SIZE, /*stop_at_none=*/true);
> +
> +                       pvmw->pte = hpte.ptep;
> +                       pvmw->pte_order = hpte.shift - PAGE_SHIFT;
> +                       pvmw->ptl = hugetlb_pte_lock(mm, &hpte);
> +                       pteval = huge_ptep_get(hpte.ptep);
> +                       if (pte_present(pteval) && !hugetlb_pte_present_leaf(
> +                                               &hpte, pteval)) {
> +                               /*
> +                                * Someone split from under us, so keep
> +                                * walking.
> +                                */
> +                               spin_unlock(pvmw->ptl);
> +                               goto retry_walk;
> +                       }
> +               } while (!check_pte(pvmw));
>                 return true;
>         }
>
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 527463c1e936..a8359584467e 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1552,17 +1552,23 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>                         flush_cache_range(vma, range.start, range.end);
>
>                         /*
> -                        * To call huge_pmd_unshare, i_mmap_rwsem must be
> -                        * held in write mode.  Caller needs to explicitly
> -                        * do this outside rmap routines.
> -                        *
> -                        * We also must hold hugetlb vma_lock in write mode.
> -                        * Lock order dictates acquiring vma_lock BEFORE
> -                        * i_mmap_rwsem.  We can only try lock here and fail
> -                        * if unsuccessful.
> +                        * If HGM is enabled, we have already grabbed the VMA
> +                        * lock for reading, and we cannot safely release it.
> +                        * Because HGM-enabled VMAs have already unshared all
> +                        * PMDs, we can safely ignore PMD unsharing here.
>                          */
> -                       if (!anon) {
> +                       if (!anon && !hugetlb_hgm_enabled(vma)) {
>                                 VM_BUG_ON(!(flags & TTU_RMAP_LOCKED));
> +                               /*
> +                                * To call huge_pmd_unshare, i_mmap_rwsem must
> +                                * be held in write mode.  Caller needs to
> +                                * explicitly do this outside rmap routines.
> +                                *
> +                                * We also must hold hugetlb vma_lock in write
> +                                * mode. Lock order dictates acquiring vma_lock
> +                                * BEFORE i_mmap_rwsem.  We can only try lock
> +                                * here and fail if unsuccessful.
> +                                */
>                                 if (!hugetlb_vma_trylock_write(vma)) {
>                                         page_vma_mapped_walk_done(&pvmw);
>                                         ret = false;
> @@ -1946,17 +1952,23 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
>                         flush_cache_range(vma, range.start, range.end);
>
>                         /*
> -                        * To call huge_pmd_unshare, i_mmap_rwsem must be
> -                        * held in write mode.  Caller needs to explicitly
> -                        * do this outside rmap routines.
> -                        *
> -                        * We also must hold hugetlb vma_lock in write mode.
> -                        * Lock order dictates acquiring vma_lock BEFORE
> -                        * i_mmap_rwsem.  We can only try lock here and
> -                        * fail if unsuccessful.
> +                        * If HGM is enabled, we have already grabbed the VMA
> +                        * lock for reading, and we cannot safely release it.
> +                        * Because HGM-enabled VMAs have already unshared all
> +                        * PMDs, we can safely ignore PMD unsharing here.
>                          */
> -                       if (!anon) {
> +                       if (!anon && !hugetlb_hgm_enabled(vma)) {
>                                 VM_BUG_ON(!(flags & TTU_RMAP_LOCKED));
> +                               /*
> +                                * To call huge_pmd_unshare, i_mmap_rwsem must
> +                                * be held in write mode.  Caller needs to
> +                                * explicitly do this outside rmap routines.
> +                                *
> +                                * We also must hold hugetlb vma_lock in write
> +                                * mode. Lock order dictates acquiring vma_lock
> +                                * BEFORE i_mmap_rwsem.  We can only try lock
> +                                * here and fail if unsuccessful.
> +                                */
>                                 if (!hugetlb_vma_trylock_write(vma)) {
>                                         page_vma_mapped_walk_done(&pvmw);
>                                         ret = false;
> --
> 2.38.0.135.g90850a2211-goog
>

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [RFC PATCH v2 08/47] hugetlb: add HGM enablement functions
  2022-12-13 15:49     ` James Houghton
@ 2022-12-15 17:51       ` Mike Kravetz
  2022-12-15 18:08         ` James Houghton
  0 siblings, 1 reply; 122+ messages in thread
From: Mike Kravetz @ 2022-12-15 17:51 UTC (permalink / raw)
  To: James Houghton
  Cc: Muchun Song, Peter Xu, David Hildenbrand, David Rientjes,
	Axel Rasmussen, Mina Almasry, Zach O'Keefe, Manish Mishra,
	Naoya Horiguchi, Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Andrew Morton, linux-mm, linux-kernel

On 12/13/22 10:49, James Houghton wrote:
> On Mon, Dec 12, 2022 at 7:14 PM Mike Kravetz <mike.kravetz@oracle.com> wrote:
> >
> > On 10/21/22 16:36, James Houghton wrote:
> > > Currently it is possible for all shared VMAs to use HGM, but it must be
> > > enabled first. This is because with HGM, we lose PMD sharing, and page
> > > table walks require additional synchronization (we need to take the VMA
> > > lock).
> >
> > Not sure yet, but I expect Peter's series will help with locking for
> > hugetlb specific page table walks.
> 
> It should make things a little bit cleaner in this series; I'll rebase
> HGM on top of those patches this week (and hopefully get a v1 out
> soon).
> 
> I don't think it's possible to implement MADV_COLLAPSE with RCU alone
> (as implemented in Peter's series anyway); we still need the VMA lock.

As I continue going through the series, I realize that I am not exactly
sure what synchronization by the vma lock is required by HGM.  As you are
aware, it was originally designed to protect against someone doing a
pmd_unshare and effectively removing part of the page table.  However,
since pmd sharing is disabled for vmas with HGM enabled (I think?), then
it might be a good idea to explicitly say somewhere the reason for using
the lock.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [RFC PATCH v2 08/47] hugetlb: add HGM enablement functions
  2022-12-15 17:51       ` Mike Kravetz
@ 2022-12-15 18:08         ` James Houghton
  0 siblings, 0 replies; 122+ messages in thread
From: James Houghton @ 2022-12-15 18:08 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Muchun Song, Peter Xu, David Hildenbrand, David Rientjes,
	Axel Rasmussen, Mina Almasry, Zach O'Keefe, Manish Mishra,
	Naoya Horiguchi, Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Andrew Morton, linux-mm, linux-kernel

On Thu, Dec 15, 2022 at 12:52 PM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>
> On 12/13/22 10:49, James Houghton wrote:
> > On Mon, Dec 12, 2022 at 7:14 PM Mike Kravetz <mike.kravetz@oracle.com> wrote:
> > >
> > > On 10/21/22 16:36, James Houghton wrote:
> > > > Currently it is possible for all shared VMAs to use HGM, but it must be
> > > > enabled first. This is because with HGM, we lose PMD sharing, and page
> > > > table walks require additional synchronization (we need to take the VMA
> > > > lock).
> > >
> > > Not sure yet, but I expect Peter's series will help with locking for
> > > hugetlb specific page table walks.
> >
> > It should make things a little bit cleaner in this series; I'll rebase
> > HGM on top of those patches this week (and hopefully get a v1 out
> > soon).
> >
> > I don't think it's possible to implement MADV_COLLAPSE with RCU alone
> > (as implemented in Peter's series anyway); we still need the VMA lock.
>
> As I continue going through the series, I realize that I am not exactly
> sure what synchronization by the vma lock is required by HGM.  As you are
> aware, it was originally designed to protect against someone doing a
> pmd_unshare and effectively removing part of the page table.  However,
> since pmd sharing is disabled for vmas with HGM enabled (I think?), then
> it might be a good idea to explicitly say somewhere the reason for using
> the lock.

It synchronizes MADV_COLLAPSE for hugetlb (hugetlb_collapse).
MADV_COLLAPSE will take it for writing and free some page table pages,
and high-granularity walks will generally take it for reading. I'll
make this clear in a comment somewhere and in commit messages.

It might be easier if hugetlb_collapse() had the exact same
synchronization as huge_pmd_unshare, where we not only take the VMA
lock for writing, we also take the i_mmap_rw_sem for writing, so
anywhere where hugetlb_walk() is safe, high-granularity walks are also
safe. I think I should just do that for the sake of simplicity.

- James

> --
> Mike Kravetz

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [RFC PATCH v2 17/47] hugetlb: make hugetlb_change_protection compatible with HGM
  2022-10-21 16:36 ` [RFC PATCH v2 17/47] hugetlb: make hugetlb_change_protection compatible with HGM James Houghton
@ 2022-12-15 18:15   ` Mike Kravetz
  0 siblings, 0 replies; 122+ messages in thread
From: Mike Kravetz @ 2022-12-15 18:15 UTC (permalink / raw)
  To: James Houghton
  Cc: Muchun Song, Peter Xu, David Hildenbrand, David Rientjes,
	Axel Rasmussen, Mina Almasry, Zach O'Keefe, Manish Mishra,
	Naoya Horiguchi, Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Andrew Morton, linux-mm, linux-kernel

On 10/21/22 16:36, James Houghton wrote:
> The main change here is to do a high-granularity walk and pulling the
> shift from the walk (not from the hstate).
> 
> Signed-off-by: James Houghton <jthoughton@google.com>
> ---
>  mm/hugetlb.c | 65 ++++++++++++++++++++++++++++++++++++----------------
>  1 file changed, 45 insertions(+), 20 deletions(-)

Nothing stands out.  The more patches I look at, the more familiar I am
becoming with the new data structures and interfaces.

-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [RFC PATCH v2 24/47] hugetlb: update page_vma_mapped to do high-granularity walks
  2022-12-15 17:49   ` James Houghton
@ 2022-12-15 18:45     ` Peter Xu
  0 siblings, 0 replies; 122+ messages in thread
From: Peter Xu @ 2022-12-15 18:45 UTC (permalink / raw)
  To: James Houghton
  Cc: Mike Kravetz, Muchun Song, David Hildenbrand, David Rientjes,
	Axel Rasmussen, Mina Almasry, Zach O'Keefe, Manish Mishra,
	Naoya Horiguchi, Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Andrew Morton, linux-mm, linux-kernel

Hi, James,

On Thu, Dec 15, 2022 at 12:49:18PM -0500, James Houghton wrote:
> > @@ -166,19 +167,57 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)

[...]

> I realize that I can't do this -- we're already holding the
> i_mmap_rwsem, and we have to take the VMA lock first. It seems like
> we're always holding it for writing in this case, so if I make
> hugetlb_collapse taking the i_mmap_rwsem for reading, this will be
> safe.
> 
> Peter, you looked at this recently [1] -- do you know if we're always
> holding i_mmap_rwsem *for writing* here?
> 
> [1] https://lore.kernel.org/linux-mm/20221209170100.973970-10-peterx@redhat.com/

I think so, an analysis is in previous v2 in one of my reply to John:

https://lore.kernel.org/all/Y5JjTPTxCWSklCan@x1n/

No hurt to double check, though.

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [RFC PATCH v2 18/47] hugetlb: enlighten follow_hugetlb_page to support HGM
  2022-10-21 16:36 ` [RFC PATCH v2 18/47] hugetlb: enlighten follow_hugetlb_page to support HGM James Houghton
@ 2022-12-15 19:29   ` Mike Kravetz
  0 siblings, 0 replies; 122+ messages in thread
From: Mike Kravetz @ 2022-12-15 19:29 UTC (permalink / raw)
  To: James Houghton
  Cc: Muchun Song, Peter Xu, David Hildenbrand, David Rientjes,
	Axel Rasmussen, Mina Almasry, Zach O'Keefe, Manish Mishra,
	Naoya Horiguchi, Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Andrew Morton, linux-mm, linux-kernel

On 10/21/22 16:36, James Houghton wrote:
> This enables high-granularity mapping support in GUP.
> 
> One important change here is that, before, we never needed to grab the
> VMA lock, but now, to prevent someone from collapsing the page tables
> out from under us, we grab it for reading when doing high-granularity PT
> walks.

Once again, I think Peter's series will already take the vma lock here.

> In case it is confusing, pfn_offset is the offset (in PAGE_SIZE units)
> that vaddr points to within the subpage that hpte points to.
> 
> Signed-off-by: James Houghton <jthoughton@google.com>
> ---
>  mm/hugetlb.c | 76 ++++++++++++++++++++++++++++++++++++----------------
>  1 file changed, 53 insertions(+), 23 deletions(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 2d096cef53cd..d76ab32fb6d3 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -6382,11 +6382,9 @@ static void record_subpages_vmas(struct page *page, struct vm_area_struct *vma,
>  	}
>  }
>  
> -static inline bool __follow_hugetlb_must_fault(unsigned int flags, pte_t *pte,
> +static inline bool __follow_hugetlb_must_fault(unsigned int flags, pte_t pteval,
>  					       bool *unshare)
>  {
> -	pte_t pteval = huge_ptep_get(pte);
> -
>  	*unshare = false;
>  	if (is_swap_pte(pteval))
>  		return true;
> @@ -6478,12 +6476,20 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  	struct hstate *h = hstate_vma(vma);
>  	int err = -EFAULT, refs;
>  
> +	/*
> +	 * Grab the VMA lock for reading now so no one can collapse the page
> +	 * table from under us.
> +	 */
> +	hugetlb_vma_lock_read(vma);
> +
>  	while (vaddr < vma->vm_end && remainder) {
> -		pte_t *pte;
> +		pte_t *ptep, pte;

Thanks, that really would be better as ptep in the existing code.

>  		spinlock_t *ptl = NULL;
>  		bool unshare = false;
>  		int absent;
> -		struct page *page;
> +		unsigned long pages_per_hpte;
> +		struct page *page, *subpage;
> +		struct hugetlb_pte hpte;
>  
>  		/*
>  		 * If we have a pending SIGKILL, don't keep faulting pages and
> @@ -6499,13 +6505,22 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  		 * each hugepage.  We have to make sure we get the
>  		 * first, for the page indexing below to work.
>  		 *
> -		 * Note that page table lock is not held when pte is null.
> +		 * Note that page table lock is not held when ptep is null.
>  		 */
> -		pte = huge_pte_offset(mm, vaddr & huge_page_mask(h),
> -				      huge_page_size(h));
> -		if (pte)
> -			ptl = huge_pte_lock(h, mm, pte);
> -		absent = !pte || huge_pte_none(huge_ptep_get(pte));
> +		ptep = huge_pte_offset(mm, vaddr & huge_page_mask(h),
> +				       huge_page_size(h));
> +		if (ptep) {
> +			hugetlb_pte_populate(&hpte, ptep, huge_page_shift(h),
> +					hpage_size_to_level(huge_page_size(h)));
> +			hugetlb_hgm_walk(mm, vma, &hpte, vaddr,
> +					PAGE_SIZE,
> +					/*stop_at_none=*/true);
> +			ptl = hugetlb_pte_lock(mm, &hpte);
> +			ptep = hpte.ptep;
> +			pte = huge_ptep_get(ptep);
> +		}
> +
> +		absent = !ptep || huge_pte_none(pte);

In Peter's series, huge_pte_offset calls are replaced with hugetlb_walk that
takes a vma pointer.  It might make sense now to consolidate the hugetlb
page table walkers.  I know that was discussed at some time.  Just thinking
we could possibly fold much of the above into hugetlb_walk.

>  
>  		/*
>  		 * When coredumping, it suits get_dump_page if we just return
> @@ -6516,12 +6531,19 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  		 */
>  		if (absent && (flags & FOLL_DUMP) &&
>  		    !hugetlbfs_pagecache_present(h, vma, vaddr)) {
> -			if (pte)
> +			if (ptep)
>  				spin_unlock(ptl);
>  			remainder = 0;
>  			break;
>  		}
>  
> +		if (!absent && pte_present(pte) &&
> +				!hugetlb_pte_present_leaf(&hpte, pte)) {
> +			/* We raced with someone splitting the PTE, so retry. */

I do not think I have got to the splitting code yet, but I am assuming we do
not hold vma lock for write when splitting.  We would of course hold page table
lock.

> +			spin_unlock(ptl);
> +			continue;
> +		}
> +
>  		/*
>  		 * We need call hugetlb_fault for both hugepages under migration
>  		 * (in which case hugetlb_fault waits for the migration,) and
> @@ -6537,7 +6559,10 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  			vm_fault_t ret;
>  			unsigned int fault_flags = 0;
>  
> -			if (pte)
> +			/* Drop the lock before entering hugetlb_fault. */
> +			hugetlb_vma_unlock_read(vma);
> +
> +			if (ptep)
>  				spin_unlock(ptl);
>  			if (flags & FOLL_WRITE)
>  				fault_flags |= FAULT_FLAG_WRITE;
> @@ -6560,7 +6585,7 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  			if (ret & VM_FAULT_ERROR) {
>  				err = vm_fault_to_errno(ret, flags);
>  				remainder = 0;
> -				break;
> +				goto out;
>  			}
>  			if (ret & VM_FAULT_RETRY) {
>  				if (locked &&
> @@ -6578,11 +6603,14 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  				 */
>  				return i;
>  			}
> +			hugetlb_vma_lock_read(vma);
>  			continue;
>  		}
>  
> -		pfn_offset = (vaddr & ~huge_page_mask(h)) >> PAGE_SHIFT;
> -		page = pte_page(huge_ptep_get(pte));
> +		pfn_offset = (vaddr & ~hugetlb_pte_mask(&hpte)) >> PAGE_SHIFT;
> +		subpage = pte_page(pte);
> +		pages_per_hpte = hugetlb_pte_size(&hpte) / PAGE_SIZE;
> +		page = compound_head(subpage);
>  
>  		VM_BUG_ON_PAGE((flags & FOLL_PIN) && PageAnon(page) &&
>  			       !PageAnonExclusive(page), page);
> @@ -6592,21 +6620,21 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  		 * and skip the same_page loop below.
>  		 */
>  		if (!pages && !vmas && !pfn_offset &&
> -		    (vaddr + huge_page_size(h) < vma->vm_end) &&
> -		    (remainder >= pages_per_huge_page(h))) {
> -			vaddr += huge_page_size(h);
> -			remainder -= pages_per_huge_page(h);
> -			i += pages_per_huge_page(h);
> +		    (vaddr + pages_per_hpte < vma->vm_end) &&
> +		    (remainder >= pages_per_hpte)) {
> +			vaddr += pages_per_hpte;
> +			remainder -= pages_per_hpte;
> +			i += pages_per_hpte;
>  			spin_unlock(ptl);
>  			continue;
>  		}
>  
>  		/* vaddr may not be aligned to PAGE_SIZE */
> -		refs = min3(pages_per_huge_page(h) - pfn_offset, remainder,
> +		refs = min3(pages_per_hpte - pfn_offset, remainder,
>  		    (vma->vm_end - ALIGN_DOWN(vaddr, PAGE_SIZE)) >> PAGE_SHIFT);
>  
>  		if (pages || vmas)
> -			record_subpages_vmas(nth_page(page, pfn_offset),
> +			record_subpages_vmas(nth_page(subpage, pfn_offset),
>  					     vma, refs,
>  					     likely(pages) ? pages + i : NULL,
>  					     vmas ? vmas + i : NULL);

Not your fault, but all the above was difficult to follow before HGM. :(
Did not notice any issues.
-- 
Mike Kravetz

> @@ -6637,6 +6665,8 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  
>  		spin_unlock(ptl);
>  	}
> +	hugetlb_vma_unlock_read(vma);
> +out:
>  	*nr_pages = remainder;
>  	/*
>  	 * setting position is actually required only if remainder is
> -- 
> 2.38.0.135.g90850a2211-goog
> 

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [RFC PATCH v2 19/47] hugetlb: make hugetlb_follow_page_mask HGM-enabled
  2022-10-21 16:36 ` [RFC PATCH v2 19/47] hugetlb: make hugetlb_follow_page_mask HGM-enabled James Houghton
@ 2022-12-16  0:25   ` Mike Kravetz
  0 siblings, 0 replies; 122+ messages in thread
From: Mike Kravetz @ 2022-12-16  0:25 UTC (permalink / raw)
  To: James Houghton
  Cc: Muchun Song, Peter Xu, David Hildenbrand, David Rientjes,
	Axel Rasmussen, Mina Almasry, Zach O'Keefe, Manish Mishra,
	Naoya Horiguchi, Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Andrew Morton, linux-mm, linux-kernel

On 10/21/22 16:36, James Houghton wrote:
> The change here is very simple: do a high-granularity walk.
> 
> Signed-off-by: James Houghton <jthoughton@google.com>
> ---
>  mm/hugetlb.c | 16 +++++++++++++++-
>  1 file changed, 15 insertions(+), 1 deletion(-)

Looks fine

-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [RFC PATCH v2 33/47] userfaultfd: add UFFD_FEATURE_MINOR_HUGETLBFS_HGM
  2022-11-16 23:30     ` James Houghton
@ 2022-12-21 19:23       ` Peter Xu
  2022-12-21 20:21         ` James Houghton
  0 siblings, 1 reply; 122+ messages in thread
From: Peter Xu @ 2022-12-21 19:23 UTC (permalink / raw)
  To: James Houghton
  Cc: Mike Kravetz, Muchun Song, David Hildenbrand, David Rientjes,
	Axel Rasmussen, Mina Almasry, Zach O'Keefe, Manish Mishra,
	Naoya Horiguchi, Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Andrew Morton, linux-mm, linux-kernel

James,

On Wed, Nov 16, 2022 at 03:30:00PM -0800, James Houghton wrote:
> On Wed, Nov 16, 2022 at 2:28 PM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Fri, Oct 21, 2022 at 04:36:49PM +0000, James Houghton wrote:
> > > Userspace must provide this new feature when it calls UFFDIO_API to
> > > enable HGM. Userspace can check if the feature exists in
> > > uffdio_api.features, and if it does not exist, the kernel does not
> > > support and therefore did not enable HGM.
> > >
> > > Signed-off-by: James Houghton <jthoughton@google.com>
> >
> > It's still slightly a pity that this can only be enabled by an uffd context
> > plus a minor fault, so generic hugetlb users cannot directly leverage this.
> 
> The idea here is that, for applications that can conceivably benefit
> from HGM, we have a mechanism for enabling it for that application. So
> this patch creates that mechanism for userfaultfd/UFFDIO_CONTINUE. I
> prefer this approach over something more general like MADV_ENABLE_HGM
> or something.

Sorry to get back to this very late - I know this has been discussed since
the very early stage of the feature, but is there any reasoning behind?

When I start to think seriously on applying this to process snapshot with
uffd-wp I found that the minor mode trick won't easily play - normally
that's a case where all the pages were there mapped huge, but when the app
wants UFFDIO_WRITEPROTECT it may want to remap the huge pages into smaller
pages, probably some size that the user can specify.  It'll be non-trivial
to enable HGM during that phase using MINOR mode because in that case the
pages are all mapped.

For the long term, I am just still worried the current interface is still
not as flexible.

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [RFC PATCH v2 33/47] userfaultfd: add UFFD_FEATURE_MINOR_HUGETLBFS_HGM
  2022-12-21 19:23       ` Peter Xu
@ 2022-12-21 20:21         ` James Houghton
  2022-12-21 21:39           ` Mike Kravetz
  0 siblings, 1 reply; 122+ messages in thread
From: James Houghton @ 2022-12-21 20:21 UTC (permalink / raw)
  To: Peter Xu
  Cc: Mike Kravetz, Muchun Song, David Hildenbrand, David Rientjes,
	Axel Rasmussen, Mina Almasry, Zach O'Keefe, Manish Mishra,
	Naoya Horiguchi, Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Andrew Morton, linux-mm, linux-kernel

On Wed, Dec 21, 2022 at 2:23 PM Peter Xu <peterx@redhat.com> wrote:
>
> James,
>
> On Wed, Nov 16, 2022 at 03:30:00PM -0800, James Houghton wrote:
> > On Wed, Nov 16, 2022 at 2:28 PM Peter Xu <peterx@redhat.com> wrote:
> > >
> > > On Fri, Oct 21, 2022 at 04:36:49PM +0000, James Houghton wrote:
> > > > Userspace must provide this new feature when it calls UFFDIO_API to
> > > > enable HGM. Userspace can check if the feature exists in
> > > > uffdio_api.features, and if it does not exist, the kernel does not
> > > > support and therefore did not enable HGM.
> > > >
> > > > Signed-off-by: James Houghton <jthoughton@google.com>
> > >
> > > It's still slightly a pity that this can only be enabled by an uffd context
> > > plus a minor fault, so generic hugetlb users cannot directly leverage this.
> >
> > The idea here is that, for applications that can conceivably benefit
> > from HGM, we have a mechanism for enabling it for that application. So
> > this patch creates that mechanism for userfaultfd/UFFDIO_CONTINUE. I
> > prefer this approach over something more general like MADV_ENABLE_HGM
> > or something.
>
> Sorry to get back to this very late - I know this has been discussed since
> the very early stage of the feature, but is there any reasoning behind?
>
> When I start to think seriously on applying this to process snapshot with
> uffd-wp I found that the minor mode trick won't easily play - normally
> that's a case where all the pages were there mapped huge, but when the app
> wants UFFDIO_WRITEPROTECT it may want to remap the huge pages into smaller
> pages, probably some size that the user can specify.  It'll be non-trivial
> to enable HGM during that phase using MINOR mode because in that case the
> pages are all mapped.
>
> For the long term, I am just still worried the current interface is still
> not as flexible.

Thanks for bringing this up, Peter. I think the main reason was:
having separate UFFD_FEATUREs clearly indicates to userspace what is
and is not supported.

For UFFDIO_WRITEPROTECT, a user could remap huge pages into smaller
pages by issuing a high-granularity UFFDIO_WRITEPROTECT. That isn't
allowed as of this patch series, but it could be allowed in the
future. To add support in the same way as this series, we would add
another feature, say UFFD_FEATURE_WP_HUGETLBFS_HGM. I agree that
having to add another feature isn't great; is this what you're
concerned about?

Considering MADV_ENABLE_HUGETLB...
1. If a user provides this, then the contract becomes: "the kernel may
allow UFFDIO_CONTINUE and UFFDIO_WRITEPROTECT for HugeTLB at
high-granularities, provided the support exists", but it becomes
unclear to userspace to know what's supported and what isn't.
2. We would then need to keep track if a user explicitly enabled it,
or if it got enabled automatically in response to memory poison, for
example. Not a big problem, just a complication. (Otherwise, if HGM
got enabled for poison, suddenly userspace would be allowed to do
things it wasn't allowed to do before.)
3. This API makes sense for enabling HGM for something outside of
userfaultfd, like MADV_DONTNEED.

Maybe (1) is solvable if we provide a bit field that describes what's
supported, or maybe (1) isn't even a problem.

Another possibility is to have a feature like
UFFD_FEATURE_HUGETLB_HGM, which will enable the possibility of HGM for
all relevant userfaultfd ioctls, but we have the same problem where
it's unclear what's supported and what isn't.

I'm happy to change the API to whatever you think makes the most sense.

Thanks!
- James

>
> --
> Peter Xu
>

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [RFC PATCH v2 33/47] userfaultfd: add UFFD_FEATURE_MINOR_HUGETLBFS_HGM
  2022-12-21 20:21         ` James Houghton
@ 2022-12-21 21:39           ` Mike Kravetz
  2022-12-21 22:10             ` Peter Xu
  0 siblings, 1 reply; 122+ messages in thread
From: Mike Kravetz @ 2022-12-21 21:39 UTC (permalink / raw)
  To: James Houghton
  Cc: Peter Xu, Muchun Song, David Hildenbrand, David Rientjes,
	Axel Rasmussen, Mina Almasry, Zach O'Keefe, Manish Mishra,
	Naoya Horiguchi, Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Andrew Morton, linux-mm, linux-kernel

On 12/21/22 15:21, James Houghton wrote:
> On Wed, Dec 21, 2022 at 2:23 PM Peter Xu <peterx@redhat.com> wrote:
> >
> > James,
> >
> > On Wed, Nov 16, 2022 at 03:30:00PM -0800, James Houghton wrote:
> > > On Wed, Nov 16, 2022 at 2:28 PM Peter Xu <peterx@redhat.com> wrote:
> > > >
> > > > On Fri, Oct 21, 2022 at 04:36:49PM +0000, James Houghton wrote:
> > > > > Userspace must provide this new feature when it calls UFFDIO_API to
> > > > > enable HGM. Userspace can check if the feature exists in
> > > > > uffdio_api.features, and if it does not exist, the kernel does not
> > > > > support and therefore did not enable HGM.
> > > > >
> > > > > Signed-off-by: James Houghton <jthoughton@google.com>
> > > >
> > > > It's still slightly a pity that this can only be enabled by an uffd context
> > > > plus a minor fault, so generic hugetlb users cannot directly leverage this.
> > >
> > > The idea here is that, for applications that can conceivably benefit
> > > from HGM, we have a mechanism for enabling it for that application. So
> > > this patch creates that mechanism for userfaultfd/UFFDIO_CONTINUE. I
> > > prefer this approach over something more general like MADV_ENABLE_HGM
> > > or something.
> >
> > Sorry to get back to this very late - I know this has been discussed since
> > the very early stage of the feature, but is there any reasoning behind?
> >
> > When I start to think seriously on applying this to process snapshot with
> > uffd-wp I found that the minor mode trick won't easily play - normally
> > that's a case where all the pages were there mapped huge, but when the app
> > wants UFFDIO_WRITEPROTECT it may want to remap the huge pages into smaller
> > pages, probably some size that the user can specify.  It'll be non-trivial
> > to enable HGM during that phase using MINOR mode because in that case the
> > pages are all mapped.
> >
> > For the long term, I am just still worried the current interface is still
> > not as flexible.
> 
> Thanks for bringing this up, Peter. I think the main reason was:
> having separate UFFD_FEATUREs clearly indicates to userspace what is
> and is not supported.

IIRC, I think we wanted to initially limit the usage to the very
specific use case (live migration).  The idea is that we could then
expand usage as more use cases came to light.

Another good thing is that userfaultfd has versioning built into the
API.  Thus a user can determine if HGM is enabled in their running
kernel.

> For UFFDIO_WRITEPROTECT, a user could remap huge pages into smaller
> pages by issuing a high-granularity UFFDIO_WRITEPROTECT. That isn't
> allowed as of this patch series, but it could be allowed in the
> future. To add support in the same way as this series, we would add
> another feature, say UFFD_FEATURE_WP_HUGETLBFS_HGM. I agree that
> having to add another feature isn't great; is this what you're
> concerned about?
> 
> Considering MADV_ENABLE_HUGETLB...
> 1. If a user provides this, then the contract becomes: "the kernel may
> allow UFFDIO_CONTINUE and UFFDIO_WRITEPROTECT for HugeTLB at
> high-granularities, provided the support exists", but it becomes
> unclear to userspace to know what's supported and what isn't.
> 2. We would then need to keep track if a user explicitly enabled it,
> or if it got enabled automatically in response to memory poison, for
> example. Not a big problem, just a complication. (Otherwise, if HGM
> got enabled for poison, suddenly userspace would be allowed to do
> things it wasn't allowed to do before.)
> 3. This API makes sense for enabling HGM for something outside of
> userfaultfd, like MADV_DONTNEED.

I think #3 is key here.  Once we start applying HGM to things outside
userfaultfd, then more thought will be required on APIs.  The API is
somewhat limited by design until the basic functionality is in place.
-- 
Mike Kravetz

> Maybe (1) is solvable if we provide a bit field that describes what's
> supported, or maybe (1) isn't even a problem.
> 
> Another possibility is to have a feature like
> UFFD_FEATURE_HUGETLB_HGM, which will enable the possibility of HGM for
> all relevant userfaultfd ioctls, but we have the same problem where
> it's unclear what's supported and what isn't.
> 
> I'm happy to change the API to whatever you think makes the most sense.

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [RFC PATCH v2 33/47] userfaultfd: add UFFD_FEATURE_MINOR_HUGETLBFS_HGM
  2022-12-21 21:39           ` Mike Kravetz
@ 2022-12-21 22:10             ` Peter Xu
  2022-12-21 22:31               ` Mike Kravetz
  0 siblings, 1 reply; 122+ messages in thread
From: Peter Xu @ 2022-12-21 22:10 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: James Houghton, Muchun Song, David Hildenbrand, David Rientjes,
	Axel Rasmussen, Mina Almasry, Zach O'Keefe, Manish Mishra,
	Naoya Horiguchi, Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Andrew Morton, linux-mm, linux-kernel

On Wed, Dec 21, 2022 at 01:39:39PM -0800, Mike Kravetz wrote:
> On 12/21/22 15:21, James Houghton wrote:
> > On Wed, Dec 21, 2022 at 2:23 PM Peter Xu <peterx@redhat.com> wrote:
> > >
> > > James,
> > >
> > > On Wed, Nov 16, 2022 at 03:30:00PM -0800, James Houghton wrote:
> > > > On Wed, Nov 16, 2022 at 2:28 PM Peter Xu <peterx@redhat.com> wrote:
> > > > >
> > > > > On Fri, Oct 21, 2022 at 04:36:49PM +0000, James Houghton wrote:
> > > > > > Userspace must provide this new feature when it calls UFFDIO_API to
> > > > > > enable HGM. Userspace can check if the feature exists in
> > > > > > uffdio_api.features, and if it does not exist, the kernel does not
> > > > > > support and therefore did not enable HGM.
> > > > > >
> > > > > > Signed-off-by: James Houghton <jthoughton@google.com>
> > > > >
> > > > > It's still slightly a pity that this can only be enabled by an uffd context
> > > > > plus a minor fault, so generic hugetlb users cannot directly leverage this.
> > > >
> > > > The idea here is that, for applications that can conceivably benefit
> > > > from HGM, we have a mechanism for enabling it for that application. So
> > > > this patch creates that mechanism for userfaultfd/UFFDIO_CONTINUE. I
> > > > prefer this approach over something more general like MADV_ENABLE_HGM
> > > > or something.
> > >
> > > Sorry to get back to this very late - I know this has been discussed since
> > > the very early stage of the feature, but is there any reasoning behind?
> > >
> > > When I start to think seriously on applying this to process snapshot with
> > > uffd-wp I found that the minor mode trick won't easily play - normally
> > > that's a case where all the pages were there mapped huge, but when the app
> > > wants UFFDIO_WRITEPROTECT it may want to remap the huge pages into smaller
> > > pages, probably some size that the user can specify.  It'll be non-trivial
> > > to enable HGM during that phase using MINOR mode because in that case the
> > > pages are all mapped.
> > >
> > > For the long term, I am just still worried the current interface is still
> > > not as flexible.
> > 
> > Thanks for bringing this up, Peter. I think the main reason was:
> > having separate UFFD_FEATUREs clearly indicates to userspace what is
> > and is not supported.
> 
> IIRC, I think we wanted to initially limit the usage to the very
> specific use case (live migration).  The idea is that we could then
> expand usage as more use cases came to light.
> 
> Another good thing is that userfaultfd has versioning built into the
> API.  Thus a user can determine if HGM is enabled in their running
> kernel.

I don't worry much on this one, afaiu if we have any way to enable hgm then
the user can just try enabling it on a test vma, just like when an app
wants to detect whether a new madvise() is present on the current host OS.

Besides, I'm wondering whether something like /sys/kernel/vm/hugepages/hgm
would work too.

> 
> > For UFFDIO_WRITEPROTECT, a user could remap huge pages into smaller
> > pages by issuing a high-granularity UFFDIO_WRITEPROTECT. That isn't
> > allowed as of this patch series, but it could be allowed in the
> > future. To add support in the same way as this series, we would add
> > another feature, say UFFD_FEATURE_WP_HUGETLBFS_HGM. I agree that
> > having to add another feature isn't great; is this what you're
> > concerned about?
> > 
> > Considering MADV_ENABLE_HUGETLB...
> > 1. If a user provides this, then the contract becomes: "the kernel may
> > allow UFFDIO_CONTINUE and UFFDIO_WRITEPROTECT for HugeTLB at
> > high-granularities, provided the support exists", but it becomes
> > unclear to userspace to know what's supported and what isn't.
> > 2. We would then need to keep track if a user explicitly enabled it,
> > or if it got enabled automatically in response to memory poison, for
> > example. Not a big problem, just a complication. (Otherwise, if HGM
> > got enabled for poison, suddenly userspace would be allowed to do
> > things it wasn't allowed to do before.)

We could alternatively have two flags for each vma: (a) hgm_advised and (b)
hgm_enabled.  (a) always sets (b) but not vice versa.  We can limit poison
to set (b) only.  For this patchset, it can be all about (a).

> > 3. This API makes sense for enabling HGM for something outside of
> > userfaultfd, like MADV_DONTNEED.
> 
> I think #3 is key here.  Once we start applying HGM to things outside
> userfaultfd, then more thought will be required on APIs.  The API is
> somewhat limited by design until the basic functionality is in place.

Mike, could you elaborate what's the major concern of having hgm used
outside uffd and live migration use cases?

I feel like I miss something here.  I can understand we want to limit the
usage only when the user specifies using hgm because we want to keep the
old behavior intact.  However if we want another way to enable hgm it'll
still need one knob anyway even outside uffd, and I thought that'll service
the same purpose, or maybe not?

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [RFC PATCH v2 33/47] userfaultfd: add UFFD_FEATURE_MINOR_HUGETLBFS_HGM
  2022-12-21 22:10             ` Peter Xu
@ 2022-12-21 22:31               ` Mike Kravetz
  2022-12-22  0:02                 ` James Houghton
  0 siblings, 1 reply; 122+ messages in thread
From: Mike Kravetz @ 2022-12-21 22:31 UTC (permalink / raw)
  To: Peter Xu
  Cc: James Houghton, Muchun Song, David Hildenbrand, David Rientjes,
	Axel Rasmussen, Mina Almasry, Zach O'Keefe, Manish Mishra,
	Naoya Horiguchi, Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Andrew Morton, linux-mm, linux-kernel

On 12/21/22 17:10, Peter Xu wrote:
> On Wed, Dec 21, 2022 at 01:39:39PM -0800, Mike Kravetz wrote:
> > On 12/21/22 15:21, James Houghton wrote:
> > > On Wed, Dec 21, 2022 at 2:23 PM Peter Xu <peterx@redhat.com> wrote:
> > > >
> > > > James,
> > > >
> > > > On Wed, Nov 16, 2022 at 03:30:00PM -0800, James Houghton wrote:
> > > > > On Wed, Nov 16, 2022 at 2:28 PM Peter Xu <peterx@redhat.com> wrote:
> > > > > >
> > > > > > On Fri, Oct 21, 2022 at 04:36:49PM +0000, James Houghton wrote:
> > > > > > > Userspace must provide this new feature when it calls UFFDIO_API to
> > > > > > > enable HGM. Userspace can check if the feature exists in
> > > > > > > uffdio_api.features, and if it does not exist, the kernel does not
> > > > > > > support and therefore did not enable HGM.
> > > > > > >
> > > > > > > Signed-off-by: James Houghton <jthoughton@google.com>
> > > > > >
> > > > > > It's still slightly a pity that this can only be enabled by an uffd context
> > > > > > plus a minor fault, so generic hugetlb users cannot directly leverage this.
> > > > >
> > > > > The idea here is that, for applications that can conceivably benefit
> > > > > from HGM, we have a mechanism for enabling it for that application. So
> > > > > this patch creates that mechanism for userfaultfd/UFFDIO_CONTINUE. I
> > > > > prefer this approach over something more general like MADV_ENABLE_HGM
> > > > > or something.
> > > >
> > > > Sorry to get back to this very late - I know this has been discussed since
> > > > the very early stage of the feature, but is there any reasoning behind?
> > > >
> > > > When I start to think seriously on applying this to process snapshot with
> > > > uffd-wp I found that the minor mode trick won't easily play - normally
> > > > that's a case where all the pages were there mapped huge, but when the app
> > > > wants UFFDIO_WRITEPROTECT it may want to remap the huge pages into smaller
> > > > pages, probably some size that the user can specify.  It'll be non-trivial
> > > > to enable HGM during that phase using MINOR mode because in that case the
> > > > pages are all mapped.
> > > >
> > > > For the long term, I am just still worried the current interface is still
> > > > not as flexible.
> > > 
> > > Thanks for bringing this up, Peter. I think the main reason was:
> > > having separate UFFD_FEATUREs clearly indicates to userspace what is
> > > and is not supported.
> > 
> > IIRC, I think we wanted to initially limit the usage to the very
> > specific use case (live migration).  The idea is that we could then
> > expand usage as more use cases came to light.
> > 
> > Another good thing is that userfaultfd has versioning built into the
> > API.  Thus a user can determine if HGM is enabled in their running
> > kernel.
> 
> I don't worry much on this one, afaiu if we have any way to enable hgm then
> the user can just try enabling it on a test vma, just like when an app
> wants to detect whether a new madvise() is present on the current host OS.
> 
> Besides, I'm wondering whether something like /sys/kernel/vm/hugepages/hgm
> would work too.
> 
> > 
> > > For UFFDIO_WRITEPROTECT, a user could remap huge pages into smaller
> > > pages by issuing a high-granularity UFFDIO_WRITEPROTECT. That isn't
> > > allowed as of this patch series, but it could be allowed in the
> > > future. To add support in the same way as this series, we would add
> > > another feature, say UFFD_FEATURE_WP_HUGETLBFS_HGM. I agree that
> > > having to add another feature isn't great; is this what you're
> > > concerned about?
> > > 
> > > Considering MADV_ENABLE_HUGETLB...
> > > 1. If a user provides this, then the contract becomes: "the kernel may
> > > allow UFFDIO_CONTINUE and UFFDIO_WRITEPROTECT for HugeTLB at
> > > high-granularities, provided the support exists", but it becomes
> > > unclear to userspace to know what's supported and what isn't.
> > > 2. We would then need to keep track if a user explicitly enabled it,
> > > or if it got enabled automatically in response to memory poison, for
> > > example. Not a big problem, just a complication. (Otherwise, if HGM
> > > got enabled for poison, suddenly userspace would be allowed to do
> > > things it wasn't allowed to do before.)
> 
> We could alternatively have two flags for each vma: (a) hgm_advised and (b)
> hgm_enabled.  (a) always sets (b) but not vice versa.  We can limit poison
> to set (b) only.  For this patchset, it can be all about (a).
> 
> > > 3. This API makes sense for enabling HGM for something outside of
> > > userfaultfd, like MADV_DONTNEED.
> > 
> > I think #3 is key here.  Once we start applying HGM to things outside
> > userfaultfd, then more thought will be required on APIs.  The API is
> > somewhat limited by design until the basic functionality is in place.
> 
> Mike, could you elaborate what's the major concern of having hgm used
> outside uffd and live migration use cases?
> 
> I feel like I miss something here.  I can understand we want to limit the
> usage only when the user specifies using hgm because we want to keep the
> old behavior intact.  However if we want another way to enable hgm it'll
> still need one knob anyway even outside uffd, and I thought that'll service
> the same purpose, or maybe not?

I am not opposed to using hgm outside the use cases targeted by this series.

It seems that when we were previously discussing the API we spent a bunch of
time going around in circles trying to get the API correct.  That is expected
as it is more difficult to take all users/uses/abuses of the API into account.

Since the initial use case was fairly limited, it seemed like a good idea to
limit the API to userfaultfd.  In this way we could focus on the underlying
code/implementation and then expand as needed.  Of course, with an eye on
anything that may be a limiting factor in the future.

I was not aware of the uffd-wp use case, and am more than happy to discuss
expanding the API.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [RFC PATCH v2 33/47] userfaultfd: add UFFD_FEATURE_MINOR_HUGETLBFS_HGM
  2022-12-21 22:31               ` Mike Kravetz
@ 2022-12-22  0:02                 ` James Houghton
  2022-12-22  0:38                   ` Mike Kravetz
  0 siblings, 1 reply; 122+ messages in thread
From: James Houghton @ 2022-12-22  0:02 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Peter Xu, Muchun Song, David Hildenbrand, David Rientjes,
	Axel Rasmussen, Mina Almasry, Zach O'Keefe, Manish Mishra,
	Naoya Horiguchi, Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Andrew Morton, linux-mm, linux-kernel

On Wed, Dec 21, 2022 at 5:32 PM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>
> On 12/21/22 17:10, Peter Xu wrote:
> > On Wed, Dec 21, 2022 at 01:39:39PM -0800, Mike Kravetz wrote:
> > > On 12/21/22 15:21, James Houghton wrote:
> > > > Thanks for bringing this up, Peter. I think the main reason was:
> > > > having separate UFFD_FEATUREs clearly indicates to userspace what is
> > > > and is not supported.
> > >
> > > IIRC, I think we wanted to initially limit the usage to the very
> > > specific use case (live migration).  The idea is that we could then
> > > expand usage as more use cases came to light.
> > >
> > > Another good thing is that userfaultfd has versioning built into the
> > > API.  Thus a user can determine if HGM is enabled in their running
> > > kernel.
> >
> > I don't worry much on this one, afaiu if we have any way to enable hgm then
> > the user can just try enabling it on a test vma, just like when an app
> > wants to detect whether a new madvise() is present on the current host OS.

That would be enough to test if HGM was merely present, but if
specific features like 4K UFFDIO_CONTINUEs or 4K UFFDIO_WRITEPROTECTs
were available. You could always check these by making a HugeTLB VMA
and setting it up correctly for userfaultfd/etc., but that's a little
messy.

> >
> > Besides, I'm wondering whether something like /sys/kernel/vm/hugepages/hgm
> > would work too.

I'm not opposed to this.

> >
> > >
> > > > For UFFDIO_WRITEPROTECT, a user could remap huge pages into smaller
> > > > pages by issuing a high-granularity UFFDIO_WRITEPROTECT. That isn't
> > > > allowed as of this patch series, but it could be allowed in the
> > > > future. To add support in the same way as this series, we would add
> > > > another feature, say UFFD_FEATURE_WP_HUGETLBFS_HGM. I agree that
> > > > having to add another feature isn't great; is this what you're
> > > > concerned about?
> > > >
> > > > Considering MADV_ENABLE_HUGETLB...
> > > > 1. If a user provides this, then the contract becomes: "the kernel may
> > > > allow UFFDIO_CONTINUE and UFFDIO_WRITEPROTECT for HugeTLB at
> > > > high-granularities, provided the support exists", but it becomes
> > > > unclear to userspace to know what's supported and what isn't.
> > > > 2. We would then need to keep track if a user explicitly enabled it,
> > > > or if it got enabled automatically in response to memory poison, for
> > > > example. Not a big problem, just a complication. (Otherwise, if HGM
> > > > got enabled for poison, suddenly userspace would be allowed to do
> > > > things it wasn't allowed to do before.)
> >
> > We could alternatively have two flags for each vma: (a) hgm_advised and (b)
> > hgm_enabled.  (a) always sets (b) but not vice versa.  We can limit poison
> > to set (b) only.  For this patchset, it can be all about (a).

My thoughts exactly. :)

> >
> > > > 3. This API makes sense for enabling HGM for something outside of
> > > > userfaultfd, like MADV_DONTNEED.
> > >
> > > I think #3 is key here.  Once we start applying HGM to things outside
> > > userfaultfd, then more thought will be required on APIs.  The API is
> > > somewhat limited by design until the basic functionality is in place.
> >
> > Mike, could you elaborate what's the major concern of having hgm used
> > outside uffd and live migration use cases?
> >
> > I feel like I miss something here.  I can understand we want to limit the
> > usage only when the user specifies using hgm because we want to keep the
> > old behavior intact.  However if we want another way to enable hgm it'll
> > still need one knob anyway even outside uffd, and I thought that'll service
> > the same purpose, or maybe not?
>
> I am not opposed to using hgm outside the use cases targeted by this series.
>
> It seems that when we were previously discussing the API we spent a bunch of
> time going around in circles trying to get the API correct.  That is expected
> as it is more difficult to take all users/uses/abuses of the API into account.
>
> Since the initial use case was fairly limited, it seemed like a good idea to
> limit the API to userfaultfd.  In this way we could focus on the underlying
> code/implementation and then expand as needed.  Of course, with an eye on
> anything that may be a limiting factor in the future.
>
> I was not aware of the uffd-wp use case, and am more than happy to discuss
> expanding the API.

So considering two API choices:

1. What we have now: UFFD_FEATURE_MINOR_HUGETLBFS_HGM for
UFFDIO_CONTINUE, and later UFFD_FEATURE_WP_HUGETLBFS_HGM for
UFFDIO_WRITEPROTECT. For MADV_DONTNEED, we could just suddenly start
allowing high-granularity choices (not sure if this is bad; we started
allowing it for HugeTLB recently with no other API change, AFAIA).

2. MADV_ENABLE_HGM or something similar. The changes to
UFFDIO_CONTINUE/UFFDIO_WRITEPROTECT/MADV_DONTNEED come automatically,
provided they are implemented.

I don't mind one way or the other. Peter, I assume you prefer #2.
Mike, what about you? If we decide on something other than #1, I'll
make the change before sending v1 out.

- James

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [RFC PATCH v2 33/47] userfaultfd: add UFFD_FEATURE_MINOR_HUGETLBFS_HGM
  2022-12-22  0:02                 ` James Houghton
@ 2022-12-22  0:38                   ` Mike Kravetz
  2022-12-22  1:24                     ` James Houghton
  0 siblings, 1 reply; 122+ messages in thread
From: Mike Kravetz @ 2022-12-22  0:38 UTC (permalink / raw)
  To: James Houghton
  Cc: Peter Xu, Muchun Song, David Hildenbrand, David Rientjes,
	Axel Rasmussen, Mina Almasry, Zach O'Keefe, Manish Mishra,
	Naoya Horiguchi, Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Andrew Morton, linux-mm, linux-kernel

On 12/21/22 19:02, James Houghton wrote:
> On Wed, Dec 21, 2022 at 5:32 PM Mike Kravetz <mike.kravetz@oracle.com> wrote:
> >
> > On 12/21/22 17:10, Peter Xu wrote:
> > > On Wed, Dec 21, 2022 at 01:39:39PM -0800, Mike Kravetz wrote:
> > > > On 12/21/22 15:21, James Houghton wrote:
> > > > > Thanks for bringing this up, Peter. I think the main reason was:
> > > > > having separate UFFD_FEATUREs clearly indicates to userspace what is
> > > > > and is not supported.
> > > >
> > > > IIRC, I think we wanted to initially limit the usage to the very
> > > > specific use case (live migration).  The idea is that we could then
> > > > expand usage as more use cases came to light.
> > > >
> > > > Another good thing is that userfaultfd has versioning built into the
> > > > API.  Thus a user can determine if HGM is enabled in their running
> > > > kernel.
> > >
> > > I don't worry much on this one, afaiu if we have any way to enable hgm then
> > > the user can just try enabling it on a test vma, just like when an app
> > > wants to detect whether a new madvise() is present on the current host OS.
> 
> That would be enough to test if HGM was merely present, but if
> specific features like 4K UFFDIO_CONTINUEs or 4K UFFDIO_WRITEPROTECTs
> were available. You could always check these by making a HugeTLB VMA
> and setting it up correctly for userfaultfd/etc., but that's a little
> messy.
> 
> > >
> > > Besides, I'm wondering whether something like /sys/kernel/vm/hugepages/hgm
> > > would work too.
> 
> I'm not opposed to this.
> 
> > >
> > > >
> > > > > For UFFDIO_WRITEPROTECT, a user could remap huge pages into smaller
> > > > > pages by issuing a high-granularity UFFDIO_WRITEPROTECT. That isn't
> > > > > allowed as of this patch series, but it could be allowed in the
> > > > > future. To add support in the same way as this series, we would add
> > > > > another feature, say UFFD_FEATURE_WP_HUGETLBFS_HGM. I agree that
> > > > > having to add another feature isn't great; is this what you're
> > > > > concerned about?
> > > > >
> > > > > Considering MADV_ENABLE_HUGETLB...
> > > > > 1. If a user provides this, then the contract becomes: "the kernel may
> > > > > allow UFFDIO_CONTINUE and UFFDIO_WRITEPROTECT for HugeTLB at
> > > > > high-granularities, provided the support exists", but it becomes
> > > > > unclear to userspace to know what's supported and what isn't.
> > > > > 2. We would then need to keep track if a user explicitly enabled it,
> > > > > or if it got enabled automatically in response to memory poison, for
> > > > > example. Not a big problem, just a complication. (Otherwise, if HGM
> > > > > got enabled for poison, suddenly userspace would be allowed to do
> > > > > things it wasn't allowed to do before.)
> > >
> > > We could alternatively have two flags for each vma: (a) hgm_advised and (b)
> > > hgm_enabled.  (a) always sets (b) but not vice versa.  We can limit poison
> > > to set (b) only.  For this patchset, it can be all about (a).
> 
> My thoughts exactly. :)
> 
> > >
> > > > > 3. This API makes sense for enabling HGM for something outside of
> > > > > userfaultfd, like MADV_DONTNEED.
> > > >
> > > > I think #3 is key here.  Once we start applying HGM to things outside
> > > > userfaultfd, then more thought will be required on APIs.  The API is
> > > > somewhat limited by design until the basic functionality is in place.
> > >
> > > Mike, could you elaborate what's the major concern of having hgm used
> > > outside uffd and live migration use cases?
> > >
> > > I feel like I miss something here.  I can understand we want to limit the
> > > usage only when the user specifies using hgm because we want to keep the
> > > old behavior intact.  However if we want another way to enable hgm it'll
> > > still need one knob anyway even outside uffd, and I thought that'll service
> > > the same purpose, or maybe not?
> >
> > I am not opposed to using hgm outside the use cases targeted by this series.
> >
> > It seems that when we were previously discussing the API we spent a bunch of
> > time going around in circles trying to get the API correct.  That is expected
> > as it is more difficult to take all users/uses/abuses of the API into account.
> >
> > Since the initial use case was fairly limited, it seemed like a good idea to
> > limit the API to userfaultfd.  In this way we could focus on the underlying
> > code/implementation and then expand as needed.  Of course, with an eye on
> > anything that may be a limiting factor in the future.
> >
> > I was not aware of the uffd-wp use case, and am more than happy to discuss
> > expanding the API.
> 
> So considering two API choices:
> 
> 1. What we have now: UFFD_FEATURE_MINOR_HUGETLBFS_HGM for
> UFFDIO_CONTINUE, and later UFFD_FEATURE_WP_HUGETLBFS_HGM for
> UFFDIO_WRITEPROTECT. For MADV_DONTNEED, we could just suddenly start
> allowing high-granularity choices (not sure if this is bad; we started
> allowing it for HugeTLB recently with no other API change, AFAIA).

I don't think we can just start allowing HGM for MADV_DONTNEED without
some type of user interaction/request.  Otherwise, a user that passes
in non-hugetlb page size requests may get unexpected results.  And, one
of the threads about MADV_DONTNEED points out a valid use cases where
the caller may not know the mapping is hugetlb or not and is likely to
pass in non-hugetlb page size requests.

> 2. MADV_ENABLE_HGM or something similar. The changes to
> UFFDIO_CONTINUE/UFFDIO_WRITEPROTECT/MADV_DONTNEED come automatically,
> provided they are implemented.
> 
> I don't mind one way or the other. Peter, I assume you prefer #2.
> Mike, what about you? If we decide on something other than #1, I'll
> make the change before sending v1 out.

Since I do not believe 1) is an option, MADV_ENABLE_HGM might be the way
to go.  Any thoughts about MADV_ENABLE_HGM?  I'm thinking:
- Make it have same restrictions as other madvise hugetlb calls,
  . addr must be huge page aligned
  . length is rounded down to a multiple of huge page size
- We split the vma as required
- Flags carrying HGM state reside in the hugetlb_shared_vma_data struct

-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [RFC PATCH v2 33/47] userfaultfd: add UFFD_FEATURE_MINOR_HUGETLBFS_HGM
  2022-12-22  0:38                   ` Mike Kravetz
@ 2022-12-22  1:24                     ` James Houghton
  2022-12-22 14:30                       ` Peter Xu
  0 siblings, 1 reply; 122+ messages in thread
From: James Houghton @ 2022-12-22  1:24 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Peter Xu, Muchun Song, David Hildenbrand, David Rientjes,
	Axel Rasmussen, Mina Almasry, Zach O'Keefe, Manish Mishra,
	Naoya Horiguchi, Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Andrew Morton, linux-mm, linux-kernel

> > So considering two API choices:
> >
> > 1. What we have now: UFFD_FEATURE_MINOR_HUGETLBFS_HGM for
> > UFFDIO_CONTINUE, and later UFFD_FEATURE_WP_HUGETLBFS_HGM for
> > UFFDIO_WRITEPROTECT. For MADV_DONTNEED, we could just suddenly start
> > allowing high-granularity choices (not sure if this is bad; we started
> > allowing it for HugeTLB recently with no other API change, AFAIA).
>
> I don't think we can just start allowing HGM for MADV_DONTNEED without
> some type of user interaction/request.  Otherwise, a user that passes
> in non-hugetlb page size requests may get unexpected results.  And, one
> of the threads about MADV_DONTNEED points out a valid use cases where
> the caller may not know the mapping is hugetlb or not and is likely to
> pass in non-hugetlb page size requests.
>
> > 2. MADV_ENABLE_HGM or something similar. The changes to
> > UFFDIO_CONTINUE/UFFDIO_WRITEPROTECT/MADV_DONTNEED come automatically,
> > provided they are implemented.
> >
> > I don't mind one way or the other. Peter, I assume you prefer #2.
> > Mike, what about you? If we decide on something other than #1, I'll
> > make the change before sending v1 out.
>
> Since I do not believe 1) is an option, MADV_ENABLE_HGM might be the way
> to go.  Any thoughts about MADV_ENABLE_HGM?  I'm thinking:
> - Make it have same restrictions as other madvise hugetlb calls,
>   . addr must be huge page aligned
>   . length is rounded down to a multiple of huge page size
> - We split the vma as required
I agree with these.
> - Flags carrying HGM state reside in the hugetlb_shared_vma_data struct
I actually changed this in v1 to storing HGM state as a VMA flag to
avoid problems with splitting VMAs (like, when we split a VMA, it's
possible the VMA data/lock struct doesn't get allocated). It seems
better to me; I can change it back if you disagree.

Not sure what the best name for this flag is either. MADV_ENABLE_HGM
sounds ok. MADV_HUGETLB_HGM or MADV_HUGETLB_SMALL_PAGES could work
too. No need to figure it out now.

Thanks Mike and Peter :) I'll make this change for v1 and send it out
sometime soon.

- James

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [RFC PATCH v2 33/47] userfaultfd: add UFFD_FEATURE_MINOR_HUGETLBFS_HGM
  2022-12-22  1:24                     ` James Houghton
@ 2022-12-22 14:30                       ` Peter Xu
  2022-12-27 17:02                         ` James Houghton
  0 siblings, 1 reply; 122+ messages in thread
From: Peter Xu @ 2022-12-22 14:30 UTC (permalink / raw)
  To: James Houghton
  Cc: Mike Kravetz, Muchun Song, David Hildenbrand, David Rientjes,
	Axel Rasmussen, Mina Almasry, Zach O'Keefe, Manish Mishra,
	Naoya Horiguchi, Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Andrew Morton, linux-mm, linux-kernel

On Wed, Dec 21, 2022 at 08:24:45PM -0500, James Houghton wrote:
> Not sure what the best name for this flag is either. MADV_ENABLE_HGM
> sounds ok. MADV_HUGETLB_HGM or MADV_HUGETLB_SMALL_PAGES could work
> too. No need to figure it out now.

One more option to consider is MADV_SPLIT (hopefully to be more generic).

We already decided to reuse thp MADV_COLLAPSE, we can also introduce
MADV_SPLIT and leave thp for later if it can be anything helpful (I
remember we used to discuss this for thp split).

For hugetlb one SPLIT should enable hgm advise bit on the vma forever.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [RFC PATCH v2 35/47] userfaultfd: require UFFD_FEATURE_EXACT_ADDRESS when using HugeTLB HGM
  2022-10-21 16:36 ` [RFC PATCH v2 35/47] userfaultfd: require UFFD_FEATURE_EXACT_ADDRESS when using HugeTLB HGM James Houghton
@ 2022-12-22 21:47   ` Peter Xu
  2022-12-27 16:39     ` James Houghton
  0 siblings, 1 reply; 122+ messages in thread
From: Peter Xu @ 2022-12-22 21:47 UTC (permalink / raw)
  To: James Houghton
  Cc: Mike Kravetz, Muchun Song, David Hildenbrand, David Rientjes,
	Axel Rasmussen, Mina Almasry, Zach O'Keefe, Manish Mishra,
	Naoya Horiguchi, Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Andrew Morton, linux-mm, linux-kernel

On Fri, Oct 21, 2022 at 04:36:51PM +0000, James Houghton wrote:
> @@ -1990,6 +1990,17 @@ static int userfaultfd_api(struct userfaultfd_ctx *ctx,
>  		~(UFFD_FEATURE_MINOR_HUGETLBFS | UFFD_FEATURE_MINOR_SHMEM);
>  #ifndef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING
>  	uffdio_api.features &= ~UFFD_FEATURE_MINOR_HUGETLBFS_HGM;
> +#else
> +
> +	ret = -EINVAL;
> +	if ((uffdio_api.features & UFFD_FEATURE_MINOR_HUGETLBFS_HGM) &&
> +	    !(uffdio_api.features & UFFD_FEATURE_EXACT_ADDRESS))

This check needs to be done upon "features" or "ctx_features", rather than
"uffdio_api.features".  The latter is the one we'll report to the user only.

> +		/*
> +		 * UFFD_FEATURE_MINOR_HUGETLBFS_HGM is mostly
> +		 * useless without UFFD_FEATURE_EXACT_ADDRESS,
> +		 * so require userspace to provide both.
> +		 */
> +		goto err_out;
>  #endif  /* CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING */
>  #endif  /* CONFIG_HAVE_ARCH_USERFAULTFD_MINOR */
>  
> -- 
> 2.38.0.135.g90850a2211-goog
> 
> 

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [RFC PATCH v2 34/47] hugetlb: userfaultfd: add support for high-granularity UFFDIO_CONTINUE
  2022-10-21 16:36 ` [RFC PATCH v2 34/47] hugetlb: userfaultfd: add support for high-granularity UFFDIO_CONTINUE James Houghton
  2022-11-17 16:58   ` Peter Xu
@ 2022-12-23 18:38   ` Peter Xu
  2022-12-27 16:38     ` James Houghton
  1 sibling, 1 reply; 122+ messages in thread
From: Peter Xu @ 2022-12-23 18:38 UTC (permalink / raw)
  To: James Houghton
  Cc: Mike Kravetz, Muchun Song, David Hildenbrand, David Rientjes,
	Axel Rasmussen, Mina Almasry, Zach O'Keefe, Manish Mishra,
	Naoya Horiguchi, Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Andrew Morton, linux-mm, linux-kernel

James,

On Fri, Oct 21, 2022 at 04:36:50PM +0000, James Houghton wrote:
> +	bool use_hgm = uffd_ctx_has_hgm(&dst_vma->vm_userfaultfd_ctx) &&
> +		mode == MCOPY_ATOMIC_CONTINUE;

Do you think in your new version use_hgm can work even for MISSING by
default?

I had a feeling that the major components are ready for that anyway.  Then
no matter how HGM is enabled (assuming it'll switch to MADV, or even one
can just register with MISSING+MINOR and enable the uffd HGM feature), an
existing MISSING only app can easily switch to HGM support if it's on huge
pages.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [RFC PATCH v2 34/47] hugetlb: userfaultfd: add support for high-granularity UFFDIO_CONTINUE
  2022-12-23 18:38   ` Peter Xu
@ 2022-12-27 16:38     ` James Houghton
  2023-01-03 17:09       ` Peter Xu
  0 siblings, 1 reply; 122+ messages in thread
From: James Houghton @ 2022-12-27 16:38 UTC (permalink / raw)
  To: Peter Xu
  Cc: Mike Kravetz, Muchun Song, David Hildenbrand, David Rientjes,
	Axel Rasmussen, Mina Almasry, Zach O'Keefe, Manish Mishra,
	Naoya Horiguchi, Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Andrew Morton, linux-mm, linux-kernel

On Fri, Dec 23, 2022 at 1:38 PM Peter Xu <peterx@redhat.com> wrote:
>
> James,
>
> On Fri, Oct 21, 2022 at 04:36:50PM +0000, James Houghton wrote:
> > +     bool use_hgm = uffd_ctx_has_hgm(&dst_vma->vm_userfaultfd_ctx) &&
> > +             mode == MCOPY_ATOMIC_CONTINUE;
>
> Do you think in your new version use_hgm can work even for MISSING by
> default?

I don't think so -- UFFDIO_COPY will allocate a hugepage, so I'm not
sure if it makes sense to allow it at high-granularity. If UFFDIO_COPY
didn't allocate a new page, then it could make sense (maybe we'd need
a new ioctl or new UFFDIO_COPY mode?). I think it makes most sense to
add this with another series.

Thanks,
- James

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [RFC PATCH v2 35/47] userfaultfd: require UFFD_FEATURE_EXACT_ADDRESS when using HugeTLB HGM
  2022-12-22 21:47   ` Peter Xu
@ 2022-12-27 16:39     ` James Houghton
  0 siblings, 0 replies; 122+ messages in thread
From: James Houghton @ 2022-12-27 16:39 UTC (permalink / raw)
  To: Peter Xu
  Cc: Mike Kravetz, Muchun Song, David Hildenbrand, David Rientjes,
	Axel Rasmussen, Mina Almasry, Zach O'Keefe, Manish Mishra,
	Naoya Horiguchi, Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Andrew Morton, linux-mm, linux-kernel

On Thu, Dec 22, 2022 at 4:47 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Fri, Oct 21, 2022 at 04:36:51PM +0000, James Houghton wrote:
> > @@ -1990,6 +1990,17 @@ static int userfaultfd_api(struct userfaultfd_ctx *ctx,
> >               ~(UFFD_FEATURE_MINOR_HUGETLBFS | UFFD_FEATURE_MINOR_SHMEM);
> >  #ifndef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING
> >       uffdio_api.features &= ~UFFD_FEATURE_MINOR_HUGETLBFS_HGM;
> > +#else
> > +
> > +     ret = -EINVAL;
> > +     if ((uffdio_api.features & UFFD_FEATURE_MINOR_HUGETLBFS_HGM) &&
> > +         !(uffdio_api.features & UFFD_FEATURE_EXACT_ADDRESS))
>
> This check needs to be done upon "features" or "ctx_features", rather than
> "uffdio_api.features".  The latter is the one we'll report to the user only.

Ack, thanks Peter. I'm going to drop this patch given the API change
(switching to MADV_SPLIT).

- James

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [RFC PATCH v2 33/47] userfaultfd: add UFFD_FEATURE_MINOR_HUGETLBFS_HGM
  2022-12-22 14:30                       ` Peter Xu
@ 2022-12-27 17:02                         ` James Houghton
  2023-01-03 17:06                           ` Peter Xu
  0 siblings, 1 reply; 122+ messages in thread
From: James Houghton @ 2022-12-27 17:02 UTC (permalink / raw)
  To: Peter Xu
  Cc: Mike Kravetz, Muchun Song, David Hildenbrand, David Rientjes,
	Axel Rasmussen, Mina Almasry, Zach O'Keefe, Manish Mishra,
	Naoya Horiguchi, Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Andrew Morton, linux-mm, linux-kernel

On Thu, Dec 22, 2022 at 9:30 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Wed, Dec 21, 2022 at 08:24:45PM -0500, James Houghton wrote:
> > Not sure what the best name for this flag is either. MADV_ENABLE_HGM
> > sounds ok. MADV_HUGETLB_HGM or MADV_HUGETLB_SMALL_PAGES could work
> > too. No need to figure it out now.
>
> One more option to consider is MADV_SPLIT (hopefully to be more generic).
>
> We already decided to reuse thp MADV_COLLAPSE, we can also introduce
> MADV_SPLIT and leave thp for later if it can be anything helpful (I
> remember we used to discuss this for thp split).
>
> For hugetlb one SPLIT should enable hgm advise bit on the vma forever.

MADV_SPLIT sounds okay to me -- we'll see how it turns out when I send
v1. However, there's an interesting API question regarding what
address userfaultfd provides. We previously required
UFFD_FEATURE_EXACT_ADDRESS when you specified
UFFD_FEATURE_MINOR_HUGETLBFS_HGM so that there was no ambiguity. Now,
we can do:

1. When MADV_SPLIT is given, userfaultfd will now round addresses to
PAGE_SIZE instead of huge_page_size(hstate), and
UFFD_FEATURE_EXACT_ADDRESS is not needed.
2. Don't change anything. A user must know to provide
UFFD_FEATURE_EXACT_ADDRESS to get the real address, otherwise they get
an (unusable) hugepage-aligned address.

I think #1 sounds fine; let me know if you disagree.

Thanks!
- James

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [RFC PATCH v2 33/47] userfaultfd: add UFFD_FEATURE_MINOR_HUGETLBFS_HGM
  2022-12-27 17:02                         ` James Houghton
@ 2023-01-03 17:06                           ` Peter Xu
  0 siblings, 0 replies; 122+ messages in thread
From: Peter Xu @ 2023-01-03 17:06 UTC (permalink / raw)
  To: James Houghton
  Cc: Mike Kravetz, Muchun Song, David Hildenbrand, David Rientjes,
	Axel Rasmussen, Mina Almasry, Zach O'Keefe, Manish Mishra,
	Naoya Horiguchi, Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Andrew Morton, linux-mm, linux-kernel

On Tue, Dec 27, 2022 at 12:02:52PM -0500, James Houghton wrote:
> On Thu, Dec 22, 2022 at 9:30 AM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Wed, Dec 21, 2022 at 08:24:45PM -0500, James Houghton wrote:
> > > Not sure what the best name for this flag is either. MADV_ENABLE_HGM
> > > sounds ok. MADV_HUGETLB_HGM or MADV_HUGETLB_SMALL_PAGES could work
> > > too. No need to figure it out now.
> >
> > One more option to consider is MADV_SPLIT (hopefully to be more generic).
> >
> > We already decided to reuse thp MADV_COLLAPSE, we can also introduce
> > MADV_SPLIT and leave thp for later if it can be anything helpful (I
> > remember we used to discuss this for thp split).
> >
> > For hugetlb one SPLIT should enable hgm advise bit on the vma forever.
> 
> MADV_SPLIT sounds okay to me -- we'll see how it turns out when I send
> v1. However, there's an interesting API question regarding what
> address userfaultfd provides. We previously required
> UFFD_FEATURE_EXACT_ADDRESS when you specified
> UFFD_FEATURE_MINOR_HUGETLBFS_HGM so that there was no ambiguity. Now,
> we can do:
> 
> 1. When MADV_SPLIT is given, userfaultfd will now round addresses to
> PAGE_SIZE instead of huge_page_size(hstate), and
> UFFD_FEATURE_EXACT_ADDRESS is not needed.
> 2. Don't change anything. A user must know to provide
> UFFD_FEATURE_EXACT_ADDRESS to get the real address, otherwise they get
> an (unusable) hugepage-aligned address.
> 
> I think #1 sounds fine; let me know if you disagree.

Sounds good to me, thanks!

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [RFC PATCH v2 34/47] hugetlb: userfaultfd: add support for high-granularity UFFDIO_CONTINUE
  2022-12-27 16:38     ` James Houghton
@ 2023-01-03 17:09       ` Peter Xu
  0 siblings, 0 replies; 122+ messages in thread
From: Peter Xu @ 2023-01-03 17:09 UTC (permalink / raw)
  To: James Houghton
  Cc: Mike Kravetz, Muchun Song, David Hildenbrand, David Rientjes,
	Axel Rasmussen, Mina Almasry, Zach O'Keefe, Manish Mishra,
	Naoya Horiguchi, Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Andrew Morton, linux-mm, linux-kernel

On Tue, Dec 27, 2022 at 11:38:31AM -0500, James Houghton wrote:
> On Fri, Dec 23, 2022 at 1:38 PM Peter Xu <peterx@redhat.com> wrote:
> >
> > James,
> >
> > On Fri, Oct 21, 2022 at 04:36:50PM +0000, James Houghton wrote:
> > > +     bool use_hgm = uffd_ctx_has_hgm(&dst_vma->vm_userfaultfd_ctx) &&
> > > +             mode == MCOPY_ATOMIC_CONTINUE;
> >
> > Do you think in your new version use_hgm can work even for MISSING by
> > default?
> 
> I don't think so -- UFFDIO_COPY will allocate a hugepage, so I'm not
> sure if it makes sense to allow it at high-granularity. If UFFDIO_COPY
> didn't allocate a new page, then it could make sense (maybe we'd need
> a new ioctl or new UFFDIO_COPY mode?). I think it makes most sense to
> add this with another series.

I forgot again on how the page cache is managed for the split pages,
sorry..  Yeah let's stick with minor mode for now.

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [RFC PATCH v2 12/47] hugetlb: add hugetlb_hgm_walk and hugetlb_walk_step
  2022-10-21 16:36 ` [RFC PATCH v2 12/47] hugetlb: add hugetlb_hgm_walk and hugetlb_walk_step James Houghton
  2022-11-16 22:02   ` Peter Xu
  2022-12-14  0:47   ` Mike Kravetz
@ 2023-01-05  0:57   ` Jane Chu
  2023-01-05  1:12     ` Jane Chu
  2023-01-05  1:23     ` James Houghton
  2 siblings, 2 replies; 122+ messages in thread
From: Jane Chu @ 2023-01-05  0:57 UTC (permalink / raw)
  To: James Houghton, Mike Kravetz, Muchun Song, Peter Xu
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Zach O'Keefe, Manish Mishra, Naoya Horiguchi,
	Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Andrew Morton, linux-mm, linux-kernel

> + * @stop_at_none determines what we do when we encounter an empty PTE. If true,
> + * we return that PTE. If false and @sz is less than the current PTE's size,
> + * we make that PTE point to the next level down, going until @sz is the same
> + * as our current PTE.
[..]
> +int hugetlb_hgm_walk(struct mm_struct *mm, struct vm_area_struct *vma,
> +		     struct hugetlb_pte *hpte, unsigned long addr,
> +		     unsigned long sz, bool stop_at_none)
> +{
[..]
> +	while (hugetlb_pte_size(hpte) > sz && !ret) {
> +		pte = huge_ptep_get(hpte->ptep);
> +		if (!pte_present(pte)) {
> +			if (stop_at_none)
> +				return 0;
> +			if (unlikely(!huge_pte_none(pte)))
> +				return -EEXIST;

If 'stop_at_none' means settling down on the just encountered empty PTE,
should the above two "if" clauses switch order?  I thought Peter has
raised this question too, but I'm not seeing a response.

Regards,
-jane


> +		} else if (hugetlb_pte_present_leaf(hpte, pte))
> +			return 0;
> +		ret = hugetlb_walk_step(mm, hpte, addr, sz);
> +	}
> +
> +	return ret;
> +}
> +


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [RFC PATCH v2 12/47] hugetlb: add hugetlb_hgm_walk and hugetlb_walk_step
  2023-01-05  0:57   ` Jane Chu
@ 2023-01-05  1:12     ` Jane Chu
  2023-01-05  1:23     ` James Houghton
  1 sibling, 0 replies; 122+ messages in thread
From: Jane Chu @ 2023-01-05  1:12 UTC (permalink / raw)
  To: James Houghton, Mike Kravetz, Muchun Song, Peter Xu
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Zach O'Keefe, Manish Mishra, Naoya Horiguchi,
	Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Andrew Morton, linux-mm, linux-kernel



On 1/4/2023 4:57 PM, Jane Chu wrote:
>> + * @stop_at_none determines what we do when we encounter an empty 
>> PTE. If true,
>> + * we return that PTE. If false and @sz is less than the current 
>> PTE's size,
>> + * we make that PTE point to the next level down, going until @sz is 
>> the same
>> + * as our current PTE.
> [..]
>> +int hugetlb_hgm_walk(struct mm_struct *mm, struct vm_area_struct *vma,
>> +             struct hugetlb_pte *hpte, unsigned long addr,
>> +             unsigned long sz, bool stop_at_none)
>> +{

Also here below, the way 'stop_at_none' is used when HGM isn't enabled
is puzzling.  Could you elaborate please?

+	if (!hugetlb_hgm_enabled(vma)) {
+		if (stop_at_none)
+			return 0;
+		return sz == huge_page_size(hstate_vma(vma)) ? 0 : -EINVAL;
+	}

> [..]
>> +    while (hugetlb_pte_size(hpte) > sz && !ret) {
>> +        pte = huge_ptep_get(hpte->ptep);
>> +        if (!pte_present(pte)) {
>> +            if (stop_at_none)
>> +                return 0;
>> +            if (unlikely(!huge_pte_none(pte)))
>> +                return -EEXIST;
> 
> If 'stop_at_none' means settling down on the just encountered empty PTE,
> should the above two "if" clauses switch order?  I thought Peter has
> raised this question too, but I'm not seeing a response.
> 
> Regards,
> -jane
> 
> 
>> +        } else if (hugetlb_pte_present_leaf(hpte, pte))
>> +            return 0;
>> +        ret = hugetlb_walk_step(mm, hpte, addr, sz);
>> +    }
>> +
>> +    return ret;
>> +}
>> +
> 
> 

thanks,
-jane

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [RFC PATCH v2 12/47] hugetlb: add hugetlb_hgm_walk and hugetlb_walk_step
  2023-01-05  0:57   ` Jane Chu
  2023-01-05  1:12     ` Jane Chu
@ 2023-01-05  1:23     ` James Houghton
  1 sibling, 0 replies; 122+ messages in thread
From: James Houghton @ 2023-01-05  1:23 UTC (permalink / raw)
  To: Jane Chu
  Cc: Mike Kravetz, Muchun Song, Peter Xu, David Hildenbrand,
	David Rientjes, Axel Rasmussen, Mina Almasry, Zach O'Keefe,
	Manish Mishra, Naoya Horiguchi, Dr . David Alan Gilbert,
	Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Andrew Morton, linux-mm, linux-kernel

On Thu, Jan 5, 2023 at 12:58 AM Jane Chu <jane.chu@oracle.com> wrote:
>
> > + * @stop_at_none determines what we do when we encounter an empty PTE. If true,
> > + * we return that PTE. If false and @sz is less than the current PTE's size,
> > + * we make that PTE point to the next level down, going until @sz is the same
> > + * as our current PTE.
> [..]
> > +int hugetlb_hgm_walk(struct mm_struct *mm, struct vm_area_struct *vma,
> > +                  struct hugetlb_pte *hpte, unsigned long addr,
> > +                  unsigned long sz, bool stop_at_none)
> > +{
> [..]
> > +     while (hugetlb_pte_size(hpte) > sz && !ret) {
> > +             pte = huge_ptep_get(hpte->ptep);
> > +             if (!pte_present(pte)) {
> > +                     if (stop_at_none)
> > +                             return 0;
> > +                     if (unlikely(!huge_pte_none(pte)))
> > +                             return -EEXIST;
>
> If 'stop_at_none' means settling down on the just encountered empty PTE,
> should the above two "if" clauses switch order?  I thought Peter has
> raised this question too, but I'm not seeing a response.

A better name for "stop_at_none" would be "dont_allocate"; it will be
changed in the next version. The idea is that "stop_at_none" would
simply do a walk, and the caller will deal with what it finds. If we
can't continue the walk for any reason, just return 0. So in this
case, if we land on a non-present, non-none PTE, we can't continue the
walk, so just return 0.

Another way to justify this order: we want to ensure that calls to
this function with stop_at_none=1 and sz=PAGE_SIZE will never fail,
and that gives us the order that you see. (This requirement is
documented in the comment above the definition of hugetlb_hgm_walk().
This guarantee makes it easier to write code that uses HGM walks.)

> Also here below, the way 'stop_at_none' is used when HGM isn't enabled
> is puzzling.  Could you elaborate please?
>
> > +       if (!hugetlb_hgm_enabled(vma)) {
> > +               if (stop_at_none)
> > +                       return 0;
> > +               return sz == huge_page_size(hstate_vma(vma)) ? 0 : -EINVAL;
> > +       }

This is for the same reason; if "stop_at_none" is provided, we need to
guarantee that this function won't fail. If "stop_at_none" is false
and sz != huge_page_size(), then the caller is attempting to use HGM
without having enabled it, hence -EINVAL.

Both of these bits will be cleaned up with the next version of this series. :)

Thanks!

- James

^ permalink raw reply	[flat|nested] 122+ messages in thread

end of thread, other threads:[~2023-01-05  1:23 UTC | newest]

Thread overview: 122+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-10-21 16:36 [RFC PATCH v2 00/47] hugetlb: introduce HugeTLB high-granularity mapping James Houghton
2022-10-21 16:36 ` [RFC PATCH v2 01/47] hugetlb: don't set PageUptodate for UFFDIO_CONTINUE James Houghton
2022-11-16 16:30   ` Peter Xu
2022-11-21 18:33     ` James Houghton
2022-12-08 22:55       ` Mike Kravetz
2022-10-21 16:36 ` [RFC PATCH v2 02/47] hugetlb: remove mk_huge_pte; it is unused James Houghton
2022-11-16 16:35   ` Peter Xu
2022-12-07 23:13   ` Mina Almasry
2022-12-08 23:42   ` Mike Kravetz
2022-10-21 16:36 ` [RFC PATCH v2 03/47] hugetlb: remove redundant pte_mkhuge in migration path James Houghton
2022-11-16 16:36   ` Peter Xu
2022-12-07 23:16   ` Mina Almasry
2022-12-09  0:10   ` Mike Kravetz
2022-10-21 16:36 ` [RFC PATCH v2 04/47] hugetlb: only adjust address ranges when VMAs want PMD sharing James Houghton
2022-11-16 16:50   ` Peter Xu
2022-12-09  0:22   ` Mike Kravetz
2022-10-21 16:36 ` [RFC PATCH v2 05/47] hugetlb: make hugetlb_vma_lock_alloc return its failure reason James Houghton
2022-11-16 17:08   ` Peter Xu
2022-11-21 18:11     ` James Houghton
2022-12-07 23:33   ` Mina Almasry
2022-12-09 22:36   ` Mike Kravetz
2022-10-21 16:36 ` [RFC PATCH v2 06/47] hugetlb: extend vma lock for shared vmas James Houghton
2022-11-30 21:01   ` Peter Xu
2022-11-30 23:29     ` James Houghton
2022-12-09 22:48     ` Mike Kravetz
2022-10-21 16:36 ` [RFC PATCH v2 07/47] hugetlb: add CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING James Houghton
2022-12-09 22:52   ` Mike Kravetz
2022-10-21 16:36 ` [RFC PATCH v2 08/47] hugetlb: add HGM enablement functions James Houghton
2022-11-16 17:19   ` Peter Xu
2022-12-08  0:26   ` Mina Almasry
2022-12-09 15:41     ` James Houghton
2022-12-13  0:13   ` Mike Kravetz
2022-12-13 15:49     ` James Houghton
2022-12-15 17:51       ` Mike Kravetz
2022-12-15 18:08         ` James Houghton
2022-10-21 16:36 ` [RFC PATCH v2 09/47] hugetlb: make huge_pte_lockptr take an explicit shift argument James Houghton
2022-12-08  0:30   ` Mina Almasry
2022-12-13  0:25   ` Mike Kravetz
2022-10-21 16:36 ` [RFC PATCH v2 10/47] hugetlb: add hugetlb_pte to track HugeTLB page table entries James Houghton
2022-11-16 22:17   ` Peter Xu
2022-11-17  1:00     ` James Houghton
2022-11-17 16:27       ` Peter Xu
2022-12-08  0:46   ` Mina Almasry
2022-12-09 16:02     ` James Houghton
2022-12-13 18:44       ` Mike Kravetz
2022-10-21 16:36 ` [RFC PATCH v2 11/47] hugetlb: add hugetlb_pmd_alloc and hugetlb_pte_alloc James Houghton
2022-12-13 19:32   ` Mike Kravetz
2022-12-13 20:18     ` James Houghton
2022-12-14  0:04       ` James Houghton
2022-10-21 16:36 ` [RFC PATCH v2 12/47] hugetlb: add hugetlb_hgm_walk and hugetlb_walk_step James Houghton
2022-11-16 22:02   ` Peter Xu
2022-11-17  1:39     ` James Houghton
2022-12-14  0:47   ` Mike Kravetz
2023-01-05  0:57   ` Jane Chu
2023-01-05  1:12     ` Jane Chu
2023-01-05  1:23     ` James Houghton
2022-10-21 16:36 ` [RFC PATCH v2 13/47] hugetlb: add make_huge_pte_with_shift James Houghton
2022-12-14  1:08   ` Mike Kravetz
2022-10-21 16:36 ` [RFC PATCH v2 14/47] hugetlb: make default arch_make_huge_pte understand small mappings James Houghton
2022-12-14 22:17   ` Mike Kravetz
2022-10-21 16:36 ` [RFC PATCH v2 15/47] hugetlbfs: for unmapping, treat HGM-mapped pages as potentially mapped James Houghton
2022-12-14 23:37   ` Mike Kravetz
2022-10-21 16:36 ` [RFC PATCH v2 16/47] hugetlb: make unmapping compatible with high-granularity mappings James Houghton
2022-12-15  0:28   ` Mike Kravetz
2022-10-21 16:36 ` [RFC PATCH v2 17/47] hugetlb: make hugetlb_change_protection compatible with HGM James Houghton
2022-12-15 18:15   ` Mike Kravetz
2022-10-21 16:36 ` [RFC PATCH v2 18/47] hugetlb: enlighten follow_hugetlb_page to support HGM James Houghton
2022-12-15 19:29   ` Mike Kravetz
2022-10-21 16:36 ` [RFC PATCH v2 19/47] hugetlb: make hugetlb_follow_page_mask HGM-enabled James Houghton
2022-12-16  0:25   ` Mike Kravetz
2022-10-21 16:36 ` [RFC PATCH v2 20/47] hugetlb: use struct hugetlb_pte for walk_hugetlb_range James Houghton
2022-10-21 16:36 ` [RFC PATCH v2 21/47] mm: rmap: provide pte_order in page_vma_mapped_walk James Houghton
2022-10-21 16:36 ` [RFC PATCH v2 22/47] mm: rmap: make page_vma_mapped_walk callers use pte_order James Houghton
2022-10-21 16:36 ` [RFC PATCH v2 23/47] rmap: update hugetlb lock comment for HGM James Houghton
2022-10-21 16:36 ` [RFC PATCH v2 24/47] hugetlb: update page_vma_mapped to do high-granularity walks James Houghton
2022-12-15 17:49   ` James Houghton
2022-12-15 18:45     ` Peter Xu
2022-10-21 16:36 ` [RFC PATCH v2 25/47] hugetlb: add HGM support for copy_hugetlb_page_range James Houghton
2022-11-30 21:32   ` Peter Xu
2022-11-30 23:18     ` James Houghton
2022-10-21 16:36 ` [RFC PATCH v2 26/47] hugetlb: make move_hugetlb_page_tables compatible with HGM James Houghton
2022-10-21 16:36 ` [RFC PATCH v2 27/47] hugetlb: add HGM support for hugetlb_fault and hugetlb_no_page James Houghton
2022-10-21 16:36 ` [RFC PATCH v2 28/47] rmap: in try_to_{migrate,unmap}_one, check head page for page flags James Houghton
2022-10-21 16:36 ` [RFC PATCH v2 29/47] hugetlb: add high-granularity migration support James Houghton
2022-10-21 16:36 ` [RFC PATCH v2 30/47] hugetlb: add high-granularity check for hwpoison in fault path James Houghton
2022-10-21 16:36 ` [RFC PATCH v2 31/47] hugetlb: sort hstates in hugetlb_init_hstates James Houghton
2022-10-21 16:36 ` [RFC PATCH v2 32/47] hugetlb: add for_each_hgm_shift James Houghton
2022-10-21 16:36 ` [RFC PATCH v2 33/47] userfaultfd: add UFFD_FEATURE_MINOR_HUGETLBFS_HGM James Houghton
2022-11-16 22:28   ` Peter Xu
2022-11-16 23:30     ` James Houghton
2022-12-21 19:23       ` Peter Xu
2022-12-21 20:21         ` James Houghton
2022-12-21 21:39           ` Mike Kravetz
2022-12-21 22:10             ` Peter Xu
2022-12-21 22:31               ` Mike Kravetz
2022-12-22  0:02                 ` James Houghton
2022-12-22  0:38                   ` Mike Kravetz
2022-12-22  1:24                     ` James Houghton
2022-12-22 14:30                       ` Peter Xu
2022-12-27 17:02                         ` James Houghton
2023-01-03 17:06                           ` Peter Xu
2022-10-21 16:36 ` [RFC PATCH v2 34/47] hugetlb: userfaultfd: add support for high-granularity UFFDIO_CONTINUE James Houghton
2022-11-17 16:58   ` Peter Xu
2022-12-23 18:38   ` Peter Xu
2022-12-27 16:38     ` James Houghton
2023-01-03 17:09       ` Peter Xu
2022-10-21 16:36 ` [RFC PATCH v2 35/47] userfaultfd: require UFFD_FEATURE_EXACT_ADDRESS when using HugeTLB HGM James Houghton
2022-12-22 21:47   ` Peter Xu
2022-12-27 16:39     ` James Houghton
2022-10-21 16:36 ` [RFC PATCH v2 36/47] hugetlb: add MADV_COLLAPSE for hugetlb James Houghton
2022-10-21 16:36 ` [RFC PATCH v2 37/47] hugetlb: remove huge_pte_lock and huge_pte_lockptr James Houghton
2022-11-16 20:16   ` Peter Xu
2022-10-21 16:36 ` [RFC PATCH v2 38/47] hugetlb: replace make_huge_pte with make_huge_pte_with_shift James Houghton
2022-10-21 16:36 ` [RFC PATCH v2 39/47] mm: smaps: add stats for HugeTLB mapping size James Houghton
2022-10-21 16:36 ` [RFC PATCH v2 40/47] hugetlb: x86: enable high-granularity mapping James Houghton
2022-10-21 16:36 ` [RFC PATCH v2 41/47] docs: hugetlb: update hugetlb and userfaultfd admin-guides with HGM info James Houghton
2022-10-21 16:36 ` [RFC PATCH v2 42/47] docs: proc: include information about HugeTLB HGM James Houghton
2022-10-21 16:36 ` [RFC PATCH v2 43/47] selftests/vm: add HugeTLB HGM to userfaultfd selftest James Houghton
2022-10-21 16:37 ` [RFC PATCH v2 44/47] selftests/kvm: add HugeTLB HGM to KVM demand paging selftest James Houghton
2022-10-21 16:37 ` [RFC PATCH v2 45/47] selftests/vm: add anon and shared hugetlb to migration test James Houghton
2022-10-21 16:37 ` [RFC PATCH v2 46/47] selftests/vm: add hugetlb HGM test to migration selftest James Houghton
2022-10-21 16:37 ` [RFC PATCH v2 47/47] selftests/vm: add HGM UFFDIO_CONTINUE and hwpoison tests James Houghton

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.