linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 00/46] hugetlb: introduce HugeTLB high-granularity mapping
@ 2023-02-18  0:27 James Houghton
  2023-02-18  0:27 ` [PATCH v2 01/46] hugetlb: don't set PageUptodate for UFFDIO_CONTINUE James Houghton
                   ` (46 more replies)
  0 siblings, 47 replies; 96+ messages in thread
From: James Houghton @ 2023-02-18  0:27 UTC (permalink / raw)
  To: Mike Kravetz, Muchun Song, Peter Xu, Andrew Morton
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Zach O'Keefe, Manish Mishra, Naoya Horiguchi,
	Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Frank van der Linden, Jiaqi Yan, linux-mm, linux-kernel,
	James Houghton

This series introduces the concept of HugeTLB high-granularity mapping
(HGM). This series teaches HugeTLB how to map HugeTLB pages at
high-granularity, similar to how THPs can be PTE-mapped.

Support for HGM in this series is for MAP_SHARED VMAs on x86_64 only. Other
architectures and (some) support for MAP_PRIVATE will come later.

This series is based on latest mm-unstable (ccd6a73daba9).

Notable changes with this series
================================

 - hugetlb_add_file_rmap / hugetlb_remove_rmap are added to handle
   mapcounting for non-anon hugetlb.
 - The mapcounting scheme uses subpages' mapcounts for high-granularity
   mappings, but it does not use subpages_mapcount(). This scheme
   prevents the HugeTLB VMEMMAP optimization from being used, so it
   will be improved in a later series.
 - page_add_file_rmap and page_remove_rmap are updated so they can be
   used by hugetlb_add_file_rmap / hugetlb_remove_rmap.
 - MADV_SPLIT has been added to enable the userspace API changes that
   HGM allows for: high-granularity UFFDIO_CONTINUE (and maybe other
   changes in the future). MADV_SPLIT does NOT force all the mappings to
   be PAGE_SIZE.
 - MADV_COLLAPSE is expanded to include HugeTLB mappings.

Old versions:
v1: https://lore.kernel.org/linux-mm/20230105101844.1893104-1-jthoughton@google.com/
RFC v2: https://lore.kernel.org/linux-mm/20221021163703.3218176-1-jthoughton@google.com/
RFC v1: https://lore.kernel.org/linux-mm/20220624173656.2033256-1-jthoughton@google.com/

Changelog:
v1 -> v2 (thanks Peter for all your suggestions!):
- Changed mapcount to be more THP-like, and make HGM incompatible with
  HVO.
- HGM is now disabled by default to leave HVO enabled by default.
- Added refcount overflow check.
- Removed cond_resched() in hugetlb_collapse().
- Take mmap_lock for writing in hugetlb_collapse().
- Fixed high-granularity UFFDIO_CONTINUE on a UFFDIO_WRITEPROTECTed page (+tests)
- Fixed vaddr math in follow_hugetlb_page.
- Fixed Kconfig to limit HGM to x86_64.
- Fixed some compile errors.
RFC v2 -> v1:
- Userspace API to enable HGM changed from
  UFFD_FEATURE_MINOR_HUGETLBFS_HGM to MADV_SPLIT.
- Picked up Acked-bys and Reviewed-bys. Thanks Mike, Peter, and Mina!
- Rebased onto latest mm-unstable, notably picking up Peter's
  HugeTLB walk synchronization fix [1].
- Changed MADV_COLLAPSE to take i_mmap_rwsem for writing to make its
  synchronization the same as huge_pmd_unshare, so anywhere where
  hugetlb_pte_walk() is safe, HGM walks are also safe.
- hugetlb_hgm_walk API has changed -- should reduce complexity where
  callers wish to do HGM walks.
- Always round addresses properly before populating hugetlb_ptes (always
  pick up first PTE in a contiguous bunch).
- Added a VMA flag for HGM: VM_HUGETLB_HGM; the hugetlb_shared_vma_data
  struct has been removed.
- Make hugetlb_pte.ptl always hold the PTL to use.
- Added a requirement that overlapping contiguous and non-contiguous
  PTEs must use the same PTL.
- Some things have been slightly renamed for clarity, and I've added
  lots of comments that I said I would.
- Added a test for fork() + uffd-wp to cover
  copy_hugetlb_page_range().

Patch breakdown:
Patches 1-4:	Cleanup.
Patch   5:	rmap preliminary changes.
Patches 6-9:	Add HGM config option, VM flag, MADV_SPLIT.
Patches 10-15:	Create hugetlb_pte and implement HGM basics.
Patches 16-30:	Make existing routines compatible with HGM.
Patches 31-34:	Extend userfaultfd to support high-granularity
CONTINUEs.
Patch   35:	Add HugeTLB HGM support to MADV_COLLAPSE.
Patch   36:	Add refcount overflow check.
Patches 37-40:	Cleanup, add HGM stats, and enable HGM for x86_64.
Patches 41-47:	Documentation and selftests.

Motivation
==========

Being able to map HugeTLB pages at PAGE_SIZE has important use cases in
post-copy live migration and memory poisoning.

- Live Migration (userfaultfd)
For post-copy live migration, using userfaultfd, currently we have to
install an entire hugepage before we can allow a guest to access that
page. This is because, right now, either the WHOLE hugepage is mapped or
NONE of it is. So either the guest can access the WHOLE hugepage or NONE
of it. This makes post-copy live migration for 1G HugeTLB-backed VMs
completely infeasible.

With high-granularity mapping, we can map PAGE_SIZE pieces of a
hugepage, thereby allowing the guest to access only PAGE_SIZE chunks,
and getting page faults on the rest (and triggering another
demand-fetch). This gives userspace the flexibility to install PAGE_SIZE
chunks of memory into a hugepage, making migration of 1G-backed VMs
perfectly feasible, and it vastly reduces the vCPU stall time during
post-copy for 2M-backed VMs.

At Google, for a 48 vCPU VM in post-copy, we can expect these approximate
per-page median fetch latencies:
     4K: <100us
     2M: >10ms
Being able to unpause a vCPU 100x quicker is helpful for guest stability,
and being able to use 1G pages at all can significant improve
steady-state guest performance.

After fully copying a hugepage over the network, we will want to
collapse the mapping down to what it would normally be (e.g., one PUD
for a 1G page). Rather than having the kernel do this automatically,
we leave it up to userspace to tell us to collapse a range (via
MADV_COLLAPSE).

- Memory Failure
When a memory error is found within a HugeTLB page, it would be ideal
if we could unmap only the PAGE_SIZE section that contained the error.
This is what THPs are able to do. Using high-granularity mapping, we
could do this, but this isn't tackled in this patch series.

Userspace API
=============

This series introduces the first application of high-granularity
mapping: high-granularity userfaultfd post-copy for HugeTLB.

The userspace API for this consists of:
- MADV_SPLIT: to enable the following userfaultfd API changes.
  1. read(uffd): addresses are rounded to PAGE_SIZE instead of the
     hugepage size.
  2. UFFDIO_CONTINUE for HugeTLB VMAs is now allowed in
     PAGE_SIZE-aligned chunks.
- MADV_COLLAPSE is now available for MAP_SHARED HugeTLB VMAs. It is used
  to collapse the page table mappings, but it does not undo the API
  changes that MADV_SPLIT provides.

HugeTLB changes
===============

- hugetlb_pte
`hugetlb_pte` is used to keep track of "HugeTLB" PTEs, which are PTEs at
any level and of any size. page_vma_mapped_walk and pagewalk have both
been changed to provide `hugetlb_pte`s to callers so that they can get
size+level information that, before, came from the hstate.

- Mapcount
Previously, file-backed HugeTLB pages had their mapcount incremented by
page_dup_file_rmap. This is replaced with page_add_file_rmap, which is
wrapped by hugetlb_add_file_rmap to implement new mapcount behavior.

HugeTLB pages mapped at hugepage-granularity still have their
compound_mapcount incremented by 1, but when a page is mapped at
high granularity, we increase the subpages' mapcounts for all the
subpages that get mapped. For example, for a 1G page, if a 2M piece of
it is mapped with a PMD, the mapcount for all the 4K pages within the 2M
piece have their mapcount's incremented.

This behavior means that HGM is incompatible with the HugeTLB Vmemmap
Optimization (HVO). HGM is disabled by default, and if it gets enabled,
HVO will be disabled. Also, collapsing to the hugepage size requires us
to decrement the subpage mapcounts for all of the subpages we had
mapped. For a 1G page, this can get really slow. This thread[3] has some
discussion.

- Synchronization
Collapsing high-granularity HugeTLB mappings requires taking the
mmap_lock for writing.

Supporting arm64 & contiguous PTEs
==================================

As implemented, HGM does not yet fully support contiguous PTEs. To do
this, the HugeTLB API that architectures implement will need to change.
For example, set_huge_pte_at merely takes a `pte_t *`; there is no
information about the "size" of that PTE (like, if we need to overwrite
multiple contiguous PTEs).

To handle this, in a follow-up series, set_huge_pte_at and many other
similar functions will be replaced with variants that take
`hugetlb_pte`s. See [2] for how this may be implemented, plus a full HGM
implementation for arm64.

Supporting architectures beyond arm64
=====================================

Each architecture must audit their HugeTLB implementations to make sure
that they support HGM. For example, architectures that implement
arch_make_huge_pte need to ensure that a `shift` of `PAGE_SHIFT` is
acceptable.

Architectures must also audit code that might depend on HugeTLB always
having large mappings (i.e., check huge_page_size(), huge_page_shift(),
vma_kernel_pagesize(), and vma_mmu_pagesize() callers). For example, the
arm64 KVM MMU implementation thinks that all hugepages are mapped at
huge_page_size(), and thus builds the second-stage page table
accordingly. In an HGM world, this isn't true; it is corrected in [2].

[1]: https://lore.kernel.org/linux-mm/20221216155100.2043537-1-peterx@redhat.com/
[2]: https://github.com/48ca/linux/tree/hgmv1-dec19-2
[3]: https://lore.kernel.org/linux-mm/CADrL8HUSx6=K0QXQtTmv9ZJQmvhe6KEb+FiAviRfO3HjmRUeTw@mail.gmail.com/

James Houghton (46):
  hugetlb: don't set PageUptodate for UFFDIO_CONTINUE
  hugetlb: remove mk_huge_pte; it is unused
  hugetlb: remove redundant pte_mkhuge in migration path
  hugetlb: only adjust address ranges when VMAs want PMD sharing
  rmap: hugetlb: switch from page_dup_file_rmap to page_add_file_rmap
  hugetlb: add CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING
  mm: add VM_HUGETLB_HGM VMA flag
  hugetlb: add HugeTLB HGM enablement helpers
  mm: add MADV_SPLIT to enable HugeTLB HGM
  hugetlb: make huge_pte_lockptr take an explicit shift argument
  hugetlb: add hugetlb_pte to track HugeTLB page table entries
  hugetlb: add hugetlb_alloc_pmd and hugetlb_alloc_pte
  hugetlb: add hugetlb_hgm_walk and hugetlb_walk_step
  hugetlb: split PTE markers when doing HGM walks
  hugetlb: add make_huge_pte_with_shift
  hugetlb: make default arch_make_huge_pte understand small mappings
  hugetlbfs: do a full walk to check if vma maps a page
  hugetlb: add HGM support to __unmap_hugepage_range
  hugetlb: add HGM support to hugetlb_change_protection
  hugetlb: add HGM support to follow_hugetlb_page
  hugetlb: add HGM support to hugetlb_follow_page_mask
  hugetlb: add HGM support to copy_hugetlb_page_range
  hugetlb: add HGM support to move_hugetlb_page_tables
  hugetlb: add HGM support to hugetlb_fault and hugetlb_no_page
  hugetlb: use struct hugetlb_pte for walk_hugetlb_range
  mm: rmap: provide pte_order in page_vma_mapped_walk
  mm: rmap: update try_to_{migrate,unmap} to handle mapcount for HGM
  mm: rmap: in try_to_{migrate,unmap}, check head page for hugetlb page
    flags
  hugetlb: update page_vma_mapped to do high-granularity walks
  hugetlb: add high-granularity migration support
  hugetlb: sort hstates in hugetlb_init_hstates
  hugetlb: add for_each_hgm_shift
  hugetlb: userfaultfd: add support for high-granularity UFFDIO_CONTINUE
  hugetlb: add MADV_COLLAPSE for hugetlb
  hugetlb: add check to prevent refcount overflow via HGM
  hugetlb: remove huge_pte_lock and huge_pte_lockptr
  hugetlb: replace make_huge_pte with make_huge_pte_with_shift
  mm: smaps: add stats for HugeTLB mapping size
  hugetlb: x86: enable high-granularity mapping for x86_64
  docs: hugetlb: update hugetlb and userfaultfd admin-guides with HGM
    info
  docs: proc: include information about HugeTLB HGM
  selftests/mm: add HugeTLB HGM to userfaultfd selftest
  KVM: selftests: add HugeTLB HGM to KVM demand paging selftest
  selftests/mm: add anon and shared hugetlb to migration test
  selftests/mm: add hugetlb HGM test to migration selftest
  selftests/mm: add HGM UFFDIO_CONTINUE and hwpoison tests

 Documentation/admin-guide/mm/hugetlbpage.rst  |    4 +
 Documentation/admin-guide/mm/userfaultfd.rst  |    8 +-
 Documentation/filesystems/proc.rst            |   56 +-
 arch/alpha/include/uapi/asm/mman.h            |    2 +
 arch/mips/include/uapi/asm/mman.h             |    2 +
 arch/parisc/include/uapi/asm/mman.h           |    2 +
 arch/powerpc/mm/pgtable.c                     |    6 +-
 arch/s390/include/asm/hugetlb.h               |    5 -
 arch/s390/mm/gmap.c                           |   12 +-
 arch/x86/Kconfig                              |    1 +
 arch/xtensa/include/uapi/asm/mman.h           |    2 +
 fs/Kconfig                                    |   13 +
 fs/hugetlbfs/inode.c                          |   17 +-
 fs/proc/task_mmu.c                            |  190 ++-
 fs/userfaultfd.c                              |   14 +-
 include/asm-generic/hugetlb.h                 |    5 -
 include/asm-generic/tlb.h                     |    6 +-
 include/linux/huge_mm.h                       |   12 +-
 include/linux/hugetlb.h                       |  170 +-
 include/linux/mm.h                            |    7 +
 include/linux/pagewalk.h                      |   10 +-
 include/linux/rmap.h                          |    1 +
 include/linux/swapops.h                       |    8 +-
 include/trace/events/mmflags.h                |    7 +
 include/uapi/asm-generic/mman-common.h        |    2 +
 mm/damon/vaddr.c                              |   41 +-
 mm/debug_vm_pgtable.c                         |    2 +-
 mm/hmm.c                                      |   20 +-
 mm/hugetlb.c                                  | 1390 ++++++++++++++---
 mm/khugepaged.c                               |    4 +-
 mm/madvise.c                                  |   56 +-
 mm/memory-failure.c                           |   17 +-
 mm/mempolicy.c                                |   28 +-
 mm/migrate.c                                  |   21 +-
 mm/mincore.c                                  |   17 +-
 mm/mprotect.c                                 |   18 +-
 mm/page_vma_mapped.c                          |   60 +-
 mm/pagewalk.c                                 |   20 +-
 mm/rmap.c                                     |   85 +-
 mm/userfaultfd.c                              |   40 +-
 .../selftests/kvm/demand_paging_test.c        |    2 +-
 .../testing/selftests/kvm/include/test_util.h |    2 +
 .../selftests/kvm/include/userfaultfd_util.h  |    6 +-
 tools/testing/selftests/kvm/lib/kvm_util.c    |    2 +-
 tools/testing/selftests/kvm/lib/test_util.c   |   14 +
 .../selftests/kvm/lib/userfaultfd_util.c      |   14 +-
 tools/testing/selftests/mm/Makefile           |    1 +
 tools/testing/selftests/mm/hugetlb-hgm.c      |  608 +++++++
 tools/testing/selftests/mm/migration.c        |  229 ++-
 tools/testing/selftests/mm/userfaultfd.c      |   84 +-
 50 files changed, 2841 insertions(+), 502 deletions(-)
 create mode 100644 tools/testing/selftests/mm/hugetlb-hgm.c

-- 
2.39.2.637.g21b0678d19-goog


^ permalink raw reply	[flat|nested] 96+ messages in thread

* [PATCH v2 01/46] hugetlb: don't set PageUptodate for UFFDIO_CONTINUE
  2023-02-18  0:27 [PATCH v2 00/46] hugetlb: introduce HugeTLB high-granularity mapping James Houghton
@ 2023-02-18  0:27 ` James Houghton
  2023-02-18  0:41   ` Mina Almasry
  2023-02-18  0:27 ` [PATCH v2 02/46] hugetlb: remove mk_huge_pte; it is unused James Houghton
                   ` (45 subsequent siblings)
  46 siblings, 1 reply; 96+ messages in thread
From: James Houghton @ 2023-02-18  0:27 UTC (permalink / raw)
  To: Mike Kravetz, Muchun Song, Peter Xu, Andrew Morton
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Zach O'Keefe, Manish Mishra, Naoya Horiguchi,
	Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Frank van der Linden, Jiaqi Yan, linux-mm, linux-kernel,
	James Houghton

If would be bad if we actually set PageUptodate with UFFDIO_CONTINUE;
PageUptodate indicates that the page has been zeroed, and we don't want
to give a non-zeroed page to the user.

The reason this change is being made now is because UFFDIO_CONTINUEs on
subpages definitely shouldn't set this page flag on the head page.

Signed-off-by: James Houghton <jthoughton@google.com>

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 07abcb6eb203..792cb2e67ce5 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6256,7 +6256,16 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
 	 * preceding stores to the page contents become visible before
 	 * the set_pte_at() write.
 	 */
-	__folio_mark_uptodate(folio);
+	if (!is_continue)
+		__folio_mark_uptodate(folio);
+	else if (!folio_test_uptodate(folio)) {
+		/*
+		 * This should never happen; HugeTLB pages are always Uptodate
+		 * as soon as they are allocated.
+		 */
+		ret = -EFAULT;
+		goto out_release_nounlock;
+	}
 
 	/* Add shared, newly allocated pages to the page cache. */
 	if (vm_shared && !is_continue) {
-- 
2.39.2.637.g21b0678d19-goog


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH v2 02/46] hugetlb: remove mk_huge_pte; it is unused
  2023-02-18  0:27 [PATCH v2 00/46] hugetlb: introduce HugeTLB high-granularity mapping James Houghton
  2023-02-18  0:27 ` [PATCH v2 01/46] hugetlb: don't set PageUptodate for UFFDIO_CONTINUE James Houghton
@ 2023-02-18  0:27 ` James Houghton
  2023-02-18  0:27 ` [PATCH v2 03/46] hugetlb: remove redundant pte_mkhuge in migration path James Houghton
                   ` (44 subsequent siblings)
  46 siblings, 0 replies; 96+ messages in thread
From: James Houghton @ 2023-02-18  0:27 UTC (permalink / raw)
  To: Mike Kravetz, Muchun Song, Peter Xu, Andrew Morton
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Zach O'Keefe, Manish Mishra, Naoya Horiguchi,
	Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Frank van der Linden, Jiaqi Yan, linux-mm, linux-kernel,
	James Houghton

mk_huge_pte is unused and not necessary. pte_mkhuge is the appropriate
function to call to create a HugeTLB PTE (see
Documentation/mm/arch_pgtable_helpers.rst).

It is being removed now to avoid complicating the implementation of
HugeTLB high-granularity mapping.

Acked-by: Peter Xu <peterx@redhat.com>
Acked-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: James Houghton <jthoughton@google.com>

diff --git a/arch/s390/include/asm/hugetlb.h b/arch/s390/include/asm/hugetlb.h
index ccdbccfde148..c34893719715 100644
--- a/arch/s390/include/asm/hugetlb.h
+++ b/arch/s390/include/asm/hugetlb.h
@@ -77,11 +77,6 @@ static inline void huge_ptep_set_wrprotect(struct mm_struct *mm,
 	set_huge_pte_at(mm, addr, ptep, pte_wrprotect(pte));
 }
 
-static inline pte_t mk_huge_pte(struct page *page, pgprot_t pgprot)
-{
-	return mk_pte(page, pgprot);
-}
-
 static inline int huge_pte_none(pte_t pte)
 {
 	return pte_none(pte);
diff --git a/include/asm-generic/hugetlb.h b/include/asm-generic/hugetlb.h
index d7f6335d3999..be2e763e956f 100644
--- a/include/asm-generic/hugetlb.h
+++ b/include/asm-generic/hugetlb.h
@@ -5,11 +5,6 @@
 #include <linux/swap.h>
 #include <linux/swapops.h>
 
-static inline pte_t mk_huge_pte(struct page *page, pgprot_t pgprot)
-{
-	return mk_pte(page, pgprot);
-}
-
 static inline unsigned long huge_pte_write(pte_t pte)
 {
 	return pte_write(pte);
diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
index af59cc7bd307..fbbc53113473 100644
--- a/mm/debug_vm_pgtable.c
+++ b/mm/debug_vm_pgtable.c
@@ -925,7 +925,7 @@ static void __init hugetlb_basic_tests(struct pgtable_debug_args *args)
 	 * as it was previously derived from a real kernel symbol.
 	 */
 	page = pfn_to_page(args->fixed_pmd_pfn);
-	pte = mk_huge_pte(page, args->page_prot);
+	pte = mk_pte(page, args->page_prot);
 
 	WARN_ON(!huge_pte_dirty(huge_pte_mkdirty(pte)));
 	WARN_ON(!huge_pte_write(huge_pte_mkwrite(huge_pte_wrprotect(pte))));
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 792cb2e67ce5..540cdf9570d3 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4899,11 +4899,10 @@ static pte_t make_huge_pte(struct vm_area_struct *vma, struct page *page,
 	unsigned int shift = huge_page_shift(hstate_vma(vma));
 
 	if (writable) {
-		entry = huge_pte_mkwrite(huge_pte_mkdirty(mk_huge_pte(page,
-					 vma->vm_page_prot)));
+		entry = huge_pte_mkwrite(huge_pte_mkdirty(mk_pte(page,
+						vma->vm_page_prot)));
 	} else {
-		entry = huge_pte_wrprotect(mk_huge_pte(page,
-					   vma->vm_page_prot));
+		entry = huge_pte_wrprotect(mk_pte(page, vma->vm_page_prot));
 	}
 	entry = pte_mkyoung(entry);
 	entry = arch_make_huge_pte(entry, shift, vma->vm_flags);
-- 
2.39.2.637.g21b0678d19-goog


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH v2 03/46] hugetlb: remove redundant pte_mkhuge in migration path
  2023-02-18  0:27 [PATCH v2 00/46] hugetlb: introduce HugeTLB high-granularity mapping James Houghton
  2023-02-18  0:27 ` [PATCH v2 01/46] hugetlb: don't set PageUptodate for UFFDIO_CONTINUE James Houghton
  2023-02-18  0:27 ` [PATCH v2 02/46] hugetlb: remove mk_huge_pte; it is unused James Houghton
@ 2023-02-18  0:27 ` James Houghton
  2023-02-18  0:27 ` [PATCH v2 04/46] hugetlb: only adjust address ranges when VMAs want PMD sharing James Houghton
                   ` (43 subsequent siblings)
  46 siblings, 0 replies; 96+ messages in thread
From: James Houghton @ 2023-02-18  0:27 UTC (permalink / raw)
  To: Mike Kravetz, Muchun Song, Peter Xu, Andrew Morton
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Zach O'Keefe, Manish Mishra, Naoya Horiguchi,
	Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Frank van der Linden, Jiaqi Yan, linux-mm, linux-kernel,
	James Houghton

arch_make_huge_pte, which is called immediately following pte_mkhuge,
already makes the necessary changes to the PTE that pte_mkhuge would
have. The generic implementation of arch_make_huge_pte simply calls
pte_mkhuge.

Acked-by: Peter Xu <peterx@redhat.com>
Acked-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: James Houghton <jthoughton@google.com>

diff --git a/mm/migrate.c b/mm/migrate.c
index 37865f85df6d..d3964c414010 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -249,7 +249,6 @@ static bool remove_migration_pte(struct folio *folio,
 		if (folio_test_hugetlb(folio)) {
 			unsigned int shift = huge_page_shift(hstate_vma(vma));
 
-			pte = pte_mkhuge(pte);
 			pte = arch_make_huge_pte(pte, shift, vma->vm_flags);
 			if (folio_test_anon(folio))
 				hugepage_add_anon_rmap(new, vma, pvmw.address,
-- 
2.39.2.637.g21b0678d19-goog


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH v2 04/46] hugetlb: only adjust address ranges when VMAs want PMD sharing
  2023-02-18  0:27 [PATCH v2 00/46] hugetlb: introduce HugeTLB high-granularity mapping James Houghton
                   ` (2 preceding siblings ...)
  2023-02-18  0:27 ` [PATCH v2 03/46] hugetlb: remove redundant pte_mkhuge in migration path James Houghton
@ 2023-02-18  0:27 ` James Houghton
  2023-02-18  1:10   ` Mina Almasry
  2023-02-18  0:27 ` [PATCH v2 05/46] rmap: hugetlb: switch from page_dup_file_rmap to page_add_file_rmap James Houghton
                   ` (42 subsequent siblings)
  46 siblings, 1 reply; 96+ messages in thread
From: James Houghton @ 2023-02-18  0:27 UTC (permalink / raw)
  To: Mike Kravetz, Muchun Song, Peter Xu, Andrew Morton
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Zach O'Keefe, Manish Mishra, Naoya Horiguchi,
	Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Frank van der Linden, Jiaqi Yan, linux-mm, linux-kernel,
	James Houghton

Currently this check is overly aggressive. For some userfaultfd VMAs,
VMA sharing is disabled, yet we still widen the address range, which is
used for flushing TLBs and sending MMU notifiers.

This is done now, as HGM VMAs also have sharing disabled, yet would
still have flush ranges adjusted. Overaggressively flushing TLBs and
triggering MMU notifiers is particularly harmful with lots of
high-granularity operations.

Acked-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: James Houghton <jthoughton@google.com>

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 540cdf9570d3..08004371cfed 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6999,22 +6999,31 @@ static unsigned long page_table_shareable(struct vm_area_struct *svma,
 	return saddr;
 }
 
-bool want_pmd_share(struct vm_area_struct *vma, unsigned long addr)
+static bool pmd_sharing_possible(struct vm_area_struct *vma)
 {
-	unsigned long start = addr & PUD_MASK;
-	unsigned long end = start + PUD_SIZE;
-
 #ifdef CONFIG_USERFAULTFD
 	if (uffd_disable_huge_pmd_share(vma))
 		return false;
 #endif
 	/*
-	 * check on proper vm_flags and page table alignment
+	 * Only shared VMAs can share PMDs.
 	 */
 	if (!(vma->vm_flags & VM_MAYSHARE))
 		return false;
 	if (!vma->vm_private_data)	/* vma lock required for sharing */
 		return false;
+	return true;
+}
+
+bool want_pmd_share(struct vm_area_struct *vma, unsigned long addr)
+{
+	unsigned long start = addr & PUD_MASK;
+	unsigned long end = start + PUD_SIZE;
+	/*
+	 * check on proper vm_flags and page table alignment
+	 */
+	if (!pmd_sharing_possible(vma))
+		return false;
 	if (!range_in_vma(vma, start, end))
 		return false;
 	return true;
@@ -7035,7 +7044,7 @@ void adjust_range_if_pmd_sharing_possible(struct vm_area_struct *vma,
 	 * vma needs to span at least one aligned PUD size, and the range
 	 * must be at least partially within in.
 	 */
-	if (!(vma->vm_flags & VM_MAYSHARE) || !(v_end > v_start) ||
+	if (!pmd_sharing_possible(vma) || !(v_end > v_start) ||
 		(*end <= v_start) || (*start >= v_end))
 		return;
 
-- 
2.39.2.637.g21b0678d19-goog


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH v2 05/46] rmap: hugetlb: switch from page_dup_file_rmap to page_add_file_rmap
  2023-02-18  0:27 [PATCH v2 00/46] hugetlb: introduce HugeTLB high-granularity mapping James Houghton
                   ` (3 preceding siblings ...)
  2023-02-18  0:27 ` [PATCH v2 04/46] hugetlb: only adjust address ranges when VMAs want PMD sharing James Houghton
@ 2023-02-18  0:27 ` James Houghton
  2023-03-02  1:06   ` Jiaqi Yan
  2023-02-18  0:27 ` [PATCH v2 06/46] hugetlb: add CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING James Houghton
                   ` (41 subsequent siblings)
  46 siblings, 1 reply; 96+ messages in thread
From: James Houghton @ 2023-02-18  0:27 UTC (permalink / raw)
  To: Mike Kravetz, Muchun Song, Peter Xu, Andrew Morton
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Zach O'Keefe, Manish Mishra, Naoya Horiguchi,
	Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Frank van der Linden, Jiaqi Yan, linux-mm, linux-kernel,
	James Houghton

This only applies to file-backed HugeTLB, and it should be a no-op until
high-granularity mapping is possible. Also update page_remove_rmap to
support the eventual case where !compound && folio_test_hugetlb().

HugeTLB doesn't use LRU or mlock, so we avoid those bits. This also
means we don't need to use subpage_mapcount; if we did, it would
overflow with only a few mappings.

There is still one caller of page_dup_file_rmap left: copy_present_pte,
and it is always called with compound=false in this case.

Signed-off-by: James Houghton <jthoughton@google.com>

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 08004371cfed..6c008c9de80e 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5077,7 +5077,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 			 * sleep during the process.
 			 */
 			if (!PageAnon(ptepage)) {
-				page_dup_file_rmap(ptepage, true);
+				page_add_file_rmap(ptepage, src_vma, true);
 			} else if (page_try_dup_anon_rmap(ptepage, true,
 							  src_vma)) {
 				pte_t src_pte_old = entry;
@@ -5910,7 +5910,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 	if (anon_rmap)
 		hugepage_add_new_anon_rmap(folio, vma, haddr);
 	else
-		page_dup_file_rmap(&folio->page, true);
+		page_add_file_rmap(&folio->page, vma, true);
 	new_pte = make_huge_pte(vma, &folio->page, ((vma->vm_flags & VM_WRITE)
 				&& (vma->vm_flags & VM_SHARED)));
 	/*
@@ -6301,7 +6301,7 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
 		goto out_release_unlock;
 
 	if (folio_in_pagecache)
-		page_dup_file_rmap(&folio->page, true);
+		page_add_file_rmap(&folio->page, dst_vma, true);
 	else
 		hugepage_add_new_anon_rmap(folio, dst_vma, dst_addr);
 
diff --git a/mm/migrate.c b/mm/migrate.c
index d3964c414010..b0f87f19b536 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -254,7 +254,7 @@ static bool remove_migration_pte(struct folio *folio,
 				hugepage_add_anon_rmap(new, vma, pvmw.address,
 						       rmap_flags);
 			else
-				page_dup_file_rmap(new, true);
+				page_add_file_rmap(new, vma, true);
 			set_huge_pte_at(vma->vm_mm, pvmw.address, pvmw.pte, pte);
 		} else
 #endif
diff --git a/mm/rmap.c b/mm/rmap.c
index 15ae24585fc4..c010d0af3a82 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1318,21 +1318,21 @@ void page_add_file_rmap(struct page *page, struct vm_area_struct *vma,
 	int nr = 0, nr_pmdmapped = 0;
 	bool first;
 
-	VM_BUG_ON_PAGE(compound && !PageTransHuge(page), page);
+	VM_BUG_ON_PAGE(compound && !PageTransHuge(page)
+				&& !folio_test_hugetlb(folio), page);
 
 	/* Is page being mapped by PTE? Is this its first map to be added? */
 	if (likely(!compound)) {
 		first = atomic_inc_and_test(&page->_mapcount);
 		nr = first;
-		if (first && folio_test_large(folio)) {
+		if (first && folio_test_large(folio)
+			  && !folio_test_hugetlb(folio)) {
 			nr = atomic_inc_return_relaxed(mapped);
 			nr = (nr < COMPOUND_MAPPED);
 		}
-	} else if (folio_test_pmd_mappable(folio)) {
-		/* That test is redundant: it's for safety or to optimize out */
-
+	} else {
 		first = atomic_inc_and_test(&folio->_entire_mapcount);
-		if (first) {
+		if (first && !folio_test_hugetlb(folio)) {
 			nr = atomic_add_return_relaxed(COMPOUND_MAPPED, mapped);
 			if (likely(nr < COMPOUND_MAPPED + COMPOUND_MAPPED)) {
 				nr_pmdmapped = folio_nr_pages(folio);
@@ -1347,6 +1347,9 @@ void page_add_file_rmap(struct page *page, struct vm_area_struct *vma,
 		}
 	}
 
+	if (folio_test_hugetlb(folio))
+		return;
+
 	if (nr_pmdmapped)
 		__lruvec_stat_mod_folio(folio, folio_test_swapbacked(folio) ?
 			NR_SHMEM_PMDMAPPED : NR_FILE_PMDMAPPED, nr_pmdmapped);
@@ -1376,8 +1379,7 @@ void page_remove_rmap(struct page *page, struct vm_area_struct *vma,
 	VM_BUG_ON_PAGE(compound && !PageHead(page), page);
 
 	/* Hugetlb pages are not counted in NR_*MAPPED */
-	if (unlikely(folio_test_hugetlb(folio))) {
-		/* hugetlb pages are always mapped with pmds */
+	if (unlikely(folio_test_hugetlb(folio)) && compound) {
 		atomic_dec(&folio->_entire_mapcount);
 		return;
 	}
@@ -1386,15 +1388,14 @@ void page_remove_rmap(struct page *page, struct vm_area_struct *vma,
 	if (likely(!compound)) {
 		last = atomic_add_negative(-1, &page->_mapcount);
 		nr = last;
-		if (last && folio_test_large(folio)) {
+		if (last && folio_test_large(folio)
+			 && !folio_test_hugetlb(folio)) {
 			nr = atomic_dec_return_relaxed(mapped);
 			nr = (nr < COMPOUND_MAPPED);
 		}
-	} else if (folio_test_pmd_mappable(folio)) {
-		/* That test is redundant: it's for safety or to optimize out */
-
+	} else {
 		last = atomic_add_negative(-1, &folio->_entire_mapcount);
-		if (last) {
+		if (last && !folio_test_hugetlb(folio)) {
 			nr = atomic_sub_return_relaxed(COMPOUND_MAPPED, mapped);
 			if (likely(nr < COMPOUND_MAPPED)) {
 				nr_pmdmapped = folio_nr_pages(folio);
@@ -1409,6 +1410,9 @@ void page_remove_rmap(struct page *page, struct vm_area_struct *vma,
 		}
 	}
 
+	if (folio_test_hugetlb(folio))
+		return;
+
 	if (nr_pmdmapped) {
 		if (folio_test_anon(folio))
 			idx = NR_ANON_THPS;
-- 
2.39.2.637.g21b0678d19-goog


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH v2 06/46] hugetlb: add CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING
  2023-02-18  0:27 [PATCH v2 00/46] hugetlb: introduce HugeTLB high-granularity mapping James Houghton
                   ` (4 preceding siblings ...)
  2023-02-18  0:27 ` [PATCH v2 05/46] rmap: hugetlb: switch from page_dup_file_rmap to page_add_file_rmap James Houghton
@ 2023-02-18  0:27 ` James Houghton
  2023-02-18  0:27 ` [PATCH v2 07/46] mm: add VM_HUGETLB_HGM VMA flag James Houghton
                   ` (40 subsequent siblings)
  46 siblings, 0 replies; 96+ messages in thread
From: James Houghton @ 2023-02-18  0:27 UTC (permalink / raw)
  To: Mike Kravetz, Muchun Song, Peter Xu, Andrew Morton
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Zach O'Keefe, Manish Mishra, Naoya Horiguchi,
	Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Frank van der Linden, Jiaqi Yan, linux-mm, linux-kernel,
	James Houghton

This adds the Kconfig to enable or disable high-granularity mapping.
Each architecture must explicitly opt-in to it (via
ARCH_WANT_HUGETLB_HIGH_GRANULARITY_MAPPING), but when opted in, HGM will
be enabled by default if HUGETLB_PAGE is enabled.

Signed-off-by: James Houghton <jthoughton@google.com>

diff --git a/fs/Kconfig b/fs/Kconfig
index 2685a4d0d353..a072bbe3439a 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -246,6 +246,18 @@ config HUGETLBFS
 config HUGETLB_PAGE
 	def_bool HUGETLBFS
 
+config ARCH_WANT_HUGETLB_HIGH_GRANULARITY_MAPPING
+	bool
+
+config HUGETLB_HIGH_GRANULARITY_MAPPING
+	bool "HugeTLB high-granularity mapping support"
+	default n
+	depends on ARCH_WANT_HUGETLB_HIGH_GRANULARITY_MAPPING
+	help
+	  HugeTLB high-granularity mapping (HGM) allows userspace to issue
+	  UFFDIO_CONTINUE on HugeTLB mappings in PAGE_SIZE chunks.
+	  HGM is incompatible with the HugeTLB Vmemmap Optimization (HVO).
+
 #
 # Select this config option from the architecture Kconfig, if it is preferred
 # to enable the feature of HugeTLB Vmemmap Optimization (HVO).
@@ -257,6 +269,7 @@ config HUGETLB_PAGE_OPTIMIZE_VMEMMAP
 	def_bool HUGETLB_PAGE
 	depends on ARCH_WANT_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
 	depends on SPARSEMEM_VMEMMAP
+	depends on !HUGETLB_HIGH_GRANULARITY_MAPPING
 
 config HUGETLB_PAGE_OPTIMIZE_VMEMMAP_DEFAULT_ON
 	bool "HugeTLB Vmemmap Optimization (HVO) defaults to on"
-- 
2.39.2.637.g21b0678d19-goog


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH v2 07/46] mm: add VM_HUGETLB_HGM VMA flag
  2023-02-18  0:27 [PATCH v2 00/46] hugetlb: introduce HugeTLB high-granularity mapping James Houghton
                   ` (5 preceding siblings ...)
  2023-02-18  0:27 ` [PATCH v2 06/46] hugetlb: add CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING James Houghton
@ 2023-02-18  0:27 ` James Houghton
  2023-02-24 22:35   ` Mike Kravetz
  2023-02-18  0:27 ` [PATCH v2 08/46] hugetlb: add HugeTLB HGM enablement helpers James Houghton
                   ` (39 subsequent siblings)
  46 siblings, 1 reply; 96+ messages in thread
From: James Houghton @ 2023-02-18  0:27 UTC (permalink / raw)
  To: Mike Kravetz, Muchun Song, Peter Xu, Andrew Morton
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Zach O'Keefe, Manish Mishra, Naoya Horiguchi,
	Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Frank van der Linden, Jiaqi Yan, linux-mm, linux-kernel,
	James Houghton

VM_HUGETLB_HGM indicates that a HugeTLB VMA may contain high-granularity
mappings. Its VmFlags string is "hm".

Signed-off-by: James Houghton <jthoughton@google.com>

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 6a96e1713fd5..77b72f42556a 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -711,6 +711,9 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
 #ifdef CONFIG_HAVE_ARCH_USERFAULTFD_MINOR
 		[ilog2(VM_UFFD_MINOR)]	= "ui",
 #endif /* CONFIG_HAVE_ARCH_USERFAULTFD_MINOR */
+#ifdef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING
+		[ilog2(VM_HUGETLB_HGM)]	= "hm",
+#endif /* CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING */
 	};
 	size_t i;
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 2992a2d55aee..9d3216b4284a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -383,6 +383,13 @@ extern unsigned int kobjsize(const void *objp);
 # define VM_UFFD_MINOR		VM_NONE
 #endif /* CONFIG_HAVE_ARCH_USERFAULTFD_MINOR */
 
+#ifdef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING
+# define VM_HUGETLB_HGM_BIT	38
+# define VM_HUGETLB_HGM		BIT(VM_HUGETLB_HGM_BIT)	/* HugeTLB high-granularity mapping */
+#else /* !CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING */
+# define VM_HUGETLB_HGM		VM_NONE
+#endif /* CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING */
+
 /* Bits set in the VMA until the stack is in its final location */
 #define VM_STACK_INCOMPLETE_SETUP	(VM_RAND_READ | VM_SEQ_READ)
 
diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
index 9db52bc4ce19..bceb960dbada 100644
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -162,6 +162,12 @@ IF_HAVE_PG_SKIP_KASAN_POISON(PG_skip_kasan_poison, "skip_kasan_poison")
 # define IF_HAVE_UFFD_MINOR(flag, name)
 #endif
 
+#ifdef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING
+# define IF_HAVE_HUGETLB_HGM(flag, name) {flag, name},
+#else
+# define IF_HAVE_HUGETLB_HGM(flag, name)
+#endif
+
 #define __def_vmaflag_names						\
 	{VM_READ,			"read"		},		\
 	{VM_WRITE,			"write"		},		\
@@ -186,6 +192,7 @@ IF_HAVE_UFFD_MINOR(VM_UFFD_MINOR,	"uffd_minor"	)		\
 	{VM_ACCOUNT,			"account"	},		\
 	{VM_NORESERVE,			"noreserve"	},		\
 	{VM_HUGETLB,			"hugetlb"	},		\
+IF_HAVE_HUGETLB_HGM(VM_HUGETLB_HGM,	"hugetlb_hgm"	)		\
 	{VM_SYNC,			"sync"		},		\
 	__VM_ARCH_SPECIFIC_1				,		\
 	{VM_WIPEONFORK,			"wipeonfork"	},		\
-- 
2.39.2.637.g21b0678d19-goog


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH v2 08/46] hugetlb: add HugeTLB HGM enablement helpers
  2023-02-18  0:27 [PATCH v2 00/46] hugetlb: introduce HugeTLB high-granularity mapping James Houghton
                   ` (6 preceding siblings ...)
  2023-02-18  0:27 ` [PATCH v2 07/46] mm: add VM_HUGETLB_HGM VMA flag James Houghton
@ 2023-02-18  0:27 ` James Houghton
  2023-02-18  1:40   ` Mina Almasry
  2023-02-24 23:08   ` Mike Kravetz
  2023-02-18  0:27 ` [PATCH v2 09/46] mm: add MADV_SPLIT to enable HugeTLB HGM James Houghton
                   ` (38 subsequent siblings)
  46 siblings, 2 replies; 96+ messages in thread
From: James Houghton @ 2023-02-18  0:27 UTC (permalink / raw)
  To: Mike Kravetz, Muchun Song, Peter Xu, Andrew Morton
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Zach O'Keefe, Manish Mishra, Naoya Horiguchi,
	Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Frank van der Linden, Jiaqi Yan, linux-mm, linux-kernel,
	James Houghton

hugetlb_hgm_eligible indicates that a VMA is eligible to have HGM
explicitly enabled via MADV_SPLIT, and hugetlb_hgm_enabled indicates
that HGM has been enabled.

Signed-off-by: James Houghton <jthoughton@google.com>

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 7c977d234aba..efd2635a87f5 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -1211,6 +1211,20 @@ static inline void hugetlb_unregister_node(struct node *node)
 }
 #endif	/* CONFIG_HUGETLB_PAGE */
 
+#ifdef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING
+bool hugetlb_hgm_enabled(struct vm_area_struct *vma);
+bool hugetlb_hgm_eligible(struct vm_area_struct *vma);
+#else
+static inline bool hugetlb_hgm_enabled(struct vm_area_struct *vma)
+{
+	return false;
+}
+static inline bool hugetlb_hgm_eligible(struct vm_area_struct *vma)
+{
+	return false;
+}
+#endif
+
 static inline spinlock_t *huge_pte_lock(struct hstate *h,
 					struct mm_struct *mm, pte_t *pte)
 {
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 6c008c9de80e..0576dcc98044 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -7004,6 +7004,10 @@ static bool pmd_sharing_possible(struct vm_area_struct *vma)
 #ifdef CONFIG_USERFAULTFD
 	if (uffd_disable_huge_pmd_share(vma))
 		return false;
+#endif
+#ifdef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING
+	if (hugetlb_hgm_enabled(vma))
+		return false;
 #endif
 	/*
 	 * Only shared VMAs can share PMDs.
@@ -7267,6 +7271,18 @@ __weak unsigned long hugetlb_mask_last_page(struct hstate *h)
 
 #endif /* CONFIG_ARCH_WANT_GENERAL_HUGETLB */
 
+#ifdef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING
+bool hugetlb_hgm_eligible(struct vm_area_struct *vma)
+{
+	/* All shared VMAs may have HGM. */
+	return vma && (vma->vm_flags & VM_MAYSHARE);
+}
+bool hugetlb_hgm_enabled(struct vm_area_struct *vma)
+{
+	return vma && (vma->vm_flags & VM_HUGETLB_HGM);
+}
+#endif /* CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING */
+
 /*
  * These functions are overwritable if your architecture needs its own
  * behavior.
-- 
2.39.2.637.g21b0678d19-goog


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH v2 09/46] mm: add MADV_SPLIT to enable HugeTLB HGM
  2023-02-18  0:27 [PATCH v2 00/46] hugetlb: introduce HugeTLB high-granularity mapping James Houghton
                   ` (7 preceding siblings ...)
  2023-02-18  0:27 ` [PATCH v2 08/46] hugetlb: add HugeTLB HGM enablement helpers James Houghton
@ 2023-02-18  0:27 ` James Houghton
  2023-02-18  1:58   ` Mina Almasry
  2023-02-24 23:25   ` Mike Kravetz
  2023-02-18  0:27 ` [PATCH v2 10/46] hugetlb: make huge_pte_lockptr take an explicit shift argument James Houghton
                   ` (37 subsequent siblings)
  46 siblings, 2 replies; 96+ messages in thread
From: James Houghton @ 2023-02-18  0:27 UTC (permalink / raw)
  To: Mike Kravetz, Muchun Song, Peter Xu, Andrew Morton
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Zach O'Keefe, Manish Mishra, Naoya Horiguchi,
	Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Frank van der Linden, Jiaqi Yan, linux-mm, linux-kernel,
	James Houghton

Issuing ioctl(MADV_SPLIT) on a HugeTLB address range will enable
HugeTLB HGM. MADV_SPLIT was chosen for the name so that this API can be
applied to non-HugeTLB memory in the future, if such an application is
to arise.

MADV_SPLIT provides several API changes for some syscalls on HugeTLB
address ranges:
1. UFFDIO_CONTINUE is allowed for MAP_SHARED VMAs at PAGE_SIZE
   alignment.
2. read()ing a page fault event from a userfaultfd will yield a
   PAGE_SIZE-rounded address, instead of a huge-page-size-rounded
   address (unless UFFD_FEATURE_EXACT_ADDRESS is used).

There is no way to disable the API changes that come with issuing
MADV_SPLIT. MADV_COLLAPSE can be used to collapse high-granularity page
table mappings that come from the extended functionality that comes with
using MADV_SPLIT.

For post-copy live migration, the expected use-case is:
1. mmap(MAP_SHARED, some_fd) primary mapping
2. mmap(MAP_SHARED, some_fd) alias mapping
3. MADV_SPLIT the primary mapping
4. UFFDIO_REGISTER/etc. the primary mapping
5. Copy memory contents into alias mapping and UFFDIO_CONTINUE the
   corresponding PAGE_SIZE sections in the primary mapping.

More API changes may be added in the future.

Signed-off-by: James Houghton <jthoughton@google.com>

diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h
index 763929e814e9..7a26f3648b90 100644
--- a/arch/alpha/include/uapi/asm/mman.h
+++ b/arch/alpha/include/uapi/asm/mman.h
@@ -78,6 +78,8 @@
 
 #define MADV_COLLAPSE	25		/* Synchronous hugepage collapse */
 
+#define MADV_SPLIT	26		/* Enable hugepage high-granularity APIs */
+
 /* compatibility flags */
 #define MAP_FILE	0
 
diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h
index c6e1fc77c996..f8a74a3a0928 100644
--- a/arch/mips/include/uapi/asm/mman.h
+++ b/arch/mips/include/uapi/asm/mman.h
@@ -105,6 +105,8 @@
 
 #define MADV_COLLAPSE	25		/* Synchronous hugepage collapse */
 
+#define MADV_SPLIT	26		/* Enable hugepage high-granularity APIs */
+
 /* compatibility flags */
 #define MAP_FILE	0
 
diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h
index 68c44f99bc93..a6dc6a56c941 100644
--- a/arch/parisc/include/uapi/asm/mman.h
+++ b/arch/parisc/include/uapi/asm/mman.h
@@ -72,6 +72,8 @@
 
 #define MADV_COLLAPSE	25		/* Synchronous hugepage collapse */
 
+#define MADV_SPLIT	74		/* Enable hugepage high-granularity APIs */
+
 #define MADV_HWPOISON     100		/* poison a page for testing */
 #define MADV_SOFT_OFFLINE 101		/* soft offline page for testing */
 
diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h
index 1ff0c858544f..f98a77c430a9 100644
--- a/arch/xtensa/include/uapi/asm/mman.h
+++ b/arch/xtensa/include/uapi/asm/mman.h
@@ -113,6 +113,8 @@
 
 #define MADV_COLLAPSE	25		/* Synchronous hugepage collapse */
 
+#define MADV_SPLIT	26		/* Enable hugepage high-granularity APIs */
+
 /* compatibility flags */
 #define MAP_FILE	0
 
diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index 6ce1f1ceb432..996e8ded092f 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -79,6 +79,8 @@
 
 #define MADV_COLLAPSE	25		/* Synchronous hugepage collapse */
 
+#define MADV_SPLIT	26		/* Enable hugepage high-granularity APIs */
+
 /* compatibility flags */
 #define MAP_FILE	0
 
diff --git a/mm/madvise.c b/mm/madvise.c
index c2202f51e9dd..8c004c678262 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -1006,6 +1006,28 @@ static long madvise_remove(struct vm_area_struct *vma,
 	return error;
 }
 
+static int madvise_split(struct vm_area_struct *vma,
+			 unsigned long *new_flags)
+{
+#ifdef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING
+	if (!is_vm_hugetlb_page(vma) || !hugetlb_hgm_eligible(vma))
+		return -EINVAL;
+
+	/*
+	 * PMD sharing doesn't work with HGM. If this MADV_SPLIT is on part
+	 * of a VMA, then we will split the VMA. Here, we're unsharing before
+	 * splitting because it's simpler, although we may be unsharing more
+	 * than we need.
+	 */
+	hugetlb_unshare_all_pmds(vma);
+
+	*new_flags |= VM_HUGETLB_HGM;
+	return 0;
+#else
+	return -EINVAL;
+#endif
+}
+
 /*
  * Apply an madvise behavior to a region of a vma.  madvise_update_vma
  * will handle splitting a vm area into separate areas, each area with its own
@@ -1084,6 +1106,11 @@ static int madvise_vma_behavior(struct vm_area_struct *vma,
 		break;
 	case MADV_COLLAPSE:
 		return madvise_collapse(vma, prev, start, end);
+	case MADV_SPLIT:
+		error = madvise_split(vma, &new_flags);
+		if (error)
+			goto out;
+		break;
 	}
 
 	anon_name = anon_vma_name(vma);
@@ -1178,6 +1205,9 @@ madvise_behavior_valid(int behavior)
 	case MADV_HUGEPAGE:
 	case MADV_NOHUGEPAGE:
 	case MADV_COLLAPSE:
+#endif
+#ifdef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING
+	case MADV_SPLIT:
 #endif
 	case MADV_DONTDUMP:
 	case MADV_DODUMP:
@@ -1368,6 +1398,8 @@ int madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
  *		transparent huge pages so the existing pages will not be
  *		coalesced into THP and new pages will not be allocated as THP.
  *  MADV_COLLAPSE - synchronously coalesce pages into new THP.
+ *  MADV_SPLIT - allow HugeTLB pages to be mapped at PAGE_SIZE. This allows
+ *		UFFDIO_CONTINUE to accept PAGE_SIZE-aligned regions.
  *  MADV_DONTDUMP - the application wants to prevent pages in the given range
  *		from being included in its core dump.
  *  MADV_DODUMP - cancel MADV_DONTDUMP: no longer exclude from core dump.
-- 
2.39.2.637.g21b0678d19-goog


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH v2 10/46] hugetlb: make huge_pte_lockptr take an explicit shift argument
  2023-02-18  0:27 [PATCH v2 00/46] hugetlb: introduce HugeTLB high-granularity mapping James Houghton
                   ` (8 preceding siblings ...)
  2023-02-18  0:27 ` [PATCH v2 09/46] mm: add MADV_SPLIT to enable HugeTLB HGM James Houghton
@ 2023-02-18  0:27 ` James Houghton
  2023-02-18  0:27 ` [PATCH v2 11/46] hugetlb: add hugetlb_pte to track HugeTLB page table entries James Houghton
                   ` (36 subsequent siblings)
  46 siblings, 0 replies; 96+ messages in thread
From: James Houghton @ 2023-02-18  0:27 UTC (permalink / raw)
  To: Mike Kravetz, Muchun Song, Peter Xu, Andrew Morton
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Zach O'Keefe, Manish Mishra, Naoya Horiguchi,
	Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Frank van der Linden, Jiaqi Yan, linux-mm, linux-kernel,
	James Houghton

This is needed to handle PTL locking with high-granularity mapping. We
won't always be using the PMD-level PTL even if we're using the 2M
hugepage hstate. It's possible that we're dealing with 4K PTEs, in which
case, we need to lock the PTL for the 4K PTE.

Reviewed-by: Mina Almasry <almasrymina@google.com>
Acked-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: James Houghton <jthoughton@google.com>

diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c
index cb2dcdb18f8e..035a0df47af0 100644
--- a/arch/powerpc/mm/pgtable.c
+++ b/arch/powerpc/mm/pgtable.c
@@ -261,7 +261,8 @@ int huge_ptep_set_access_flags(struct vm_area_struct *vma,
 
 		psize = hstate_get_psize(h);
 #ifdef CONFIG_DEBUG_VM
-		assert_spin_locked(huge_pte_lockptr(h, vma->vm_mm, ptep));
+		assert_spin_locked(huge_pte_lockptr(huge_page_shift(h),
+						    vma->vm_mm, ptep));
 #endif
 
 #else
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index efd2635a87f5..a1ceb9417f01 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -958,12 +958,11 @@ static inline gfp_t htlb_modify_alloc_mask(struct hstate *h, gfp_t gfp_mask)
 	return modified_mask;
 }
 
-static inline spinlock_t *huge_pte_lockptr(struct hstate *h,
+static inline spinlock_t *huge_pte_lockptr(unsigned int shift,
 					   struct mm_struct *mm, pte_t *pte)
 {
-	if (huge_page_size(h) == PMD_SIZE)
+	if (shift == PMD_SHIFT)
 		return pmd_lockptr(mm, (pmd_t *) pte);
-	VM_BUG_ON(huge_page_size(h) == PAGE_SIZE);
 	return &mm->page_table_lock;
 }
 
@@ -1173,7 +1172,7 @@ static inline gfp_t htlb_modify_alloc_mask(struct hstate *h, gfp_t gfp_mask)
 	return 0;
 }
 
-static inline spinlock_t *huge_pte_lockptr(struct hstate *h,
+static inline spinlock_t *huge_pte_lockptr(unsigned int shift,
 					   struct mm_struct *mm, pte_t *pte)
 {
 	return &mm->page_table_lock;
@@ -1230,7 +1229,7 @@ static inline spinlock_t *huge_pte_lock(struct hstate *h,
 {
 	spinlock_t *ptl;
 
-	ptl = huge_pte_lockptr(h, mm, pte);
+	ptl = huge_pte_lockptr(huge_page_shift(h), mm, pte);
 	spin_lock(ptl);
 	return ptl;
 }
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 0576dcc98044..5ca9eae0ac42 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5017,7 +5017,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 		}
 
 		dst_ptl = huge_pte_lock(h, dst, dst_pte);
-		src_ptl = huge_pte_lockptr(h, src, src_pte);
+		src_ptl = huge_pte_lockptr(huge_page_shift(h), src, src_pte);
 		spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
 		entry = huge_ptep_get(src_pte);
 again:
@@ -5098,7 +5098,8 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 
 				/* Install the new hugetlb folio if src pte stable */
 				dst_ptl = huge_pte_lock(h, dst, dst_pte);
-				src_ptl = huge_pte_lockptr(h, src, src_pte);
+				src_ptl = huge_pte_lockptr(huge_page_shift(h),
+							   src, src_pte);
 				spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
 				entry = huge_ptep_get(src_pte);
 				if (!pte_same(src_pte_old, entry)) {
@@ -5152,7 +5153,7 @@ static void move_huge_pte(struct vm_area_struct *vma, unsigned long old_addr,
 	pte_t pte;
 
 	dst_ptl = huge_pte_lock(h, mm, dst_pte);
-	src_ptl = huge_pte_lockptr(h, mm, src_pte);
+	src_ptl = huge_pte_lockptr(huge_page_shift(h), mm, src_pte);
 
 	/*
 	 * We don't have to worry about the ordering of src and dst ptlocks
diff --git a/mm/migrate.c b/mm/migrate.c
index b0f87f19b536..9b4a7e75f6e6 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -363,7 +363,8 @@ void __migration_entry_wait_huge(struct vm_area_struct *vma,
 
 void migration_entry_wait_huge(struct vm_area_struct *vma, pte_t *pte)
 {
-	spinlock_t *ptl = huge_pte_lockptr(hstate_vma(vma), vma->vm_mm, pte);
+	spinlock_t *ptl = huge_pte_lockptr(huge_page_shift(hstate_vma(vma)),
+					   vma->vm_mm, pte);
 
 	__migration_entry_wait_huge(vma, pte, ptl);
 }
-- 
2.39.2.637.g21b0678d19-goog


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH v2 11/46] hugetlb: add hugetlb_pte to track HugeTLB page table entries
  2023-02-18  0:27 [PATCH v2 00/46] hugetlb: introduce HugeTLB high-granularity mapping James Houghton
                   ` (9 preceding siblings ...)
  2023-02-18  0:27 ` [PATCH v2 10/46] hugetlb: make huge_pte_lockptr take an explicit shift argument James Houghton
@ 2023-02-18  0:27 ` James Houghton
  2023-02-18  5:24   ` Mina Almasry
  2023-02-25  0:09   ` Mike Kravetz
  2023-02-18  0:27 ` [PATCH v2 12/46] hugetlb: add hugetlb_alloc_pmd and hugetlb_alloc_pte James Houghton
                   ` (35 subsequent siblings)
  46 siblings, 2 replies; 96+ messages in thread
From: James Houghton @ 2023-02-18  0:27 UTC (permalink / raw)
  To: Mike Kravetz, Muchun Song, Peter Xu, Andrew Morton
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Zach O'Keefe, Manish Mishra, Naoya Horiguchi,
	Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Frank van der Linden, Jiaqi Yan, linux-mm, linux-kernel,
	James Houghton

After high-granularity mapping, page table entries for HugeTLB pages can
be of any size/type. (For example, we can have a 1G page mapped with a
mix of PMDs and PTEs.) This struct is to help keep track of a HugeTLB
PTE after we have done a page table walk.

Without this, we'd have to pass around the "size" of the PTE everywhere.
We effectively did this before; it could be fetched from the hstate,
which we pass around pretty much everywhere.

hugetlb_pte_present_leaf is included here as a helper function that will
be used frequently later on.

Signed-off-by: James Houghton <jthoughton@google.com>

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index a1ceb9417f01..eeacadf3272b 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -26,6 +26,25 @@ typedef struct { unsigned long pd; } hugepd_t;
 #define __hugepd(x) ((hugepd_t) { (x) })
 #endif
 
+enum hugetlb_level {
+	HUGETLB_LEVEL_PTE = 1,
+	/*
+	 * We always include PMD, PUD, and P4D in this enum definition so that,
+	 * when logged as an integer, we can easily tell which level it is.
+	 */
+	HUGETLB_LEVEL_PMD,
+	HUGETLB_LEVEL_PUD,
+	HUGETLB_LEVEL_P4D,
+	HUGETLB_LEVEL_PGD,
+};
+
+struct hugetlb_pte {
+	pte_t *ptep;
+	unsigned int shift;
+	enum hugetlb_level level;
+	spinlock_t *ptl;
+};
+
 #ifdef CONFIG_HUGETLB_PAGE
 
 #include <linux/mempolicy.h>
@@ -39,6 +58,20 @@ typedef struct { unsigned long pd; } hugepd_t;
  */
 #define __NR_USED_SUBPAGE 3
 
+static inline
+unsigned long hugetlb_pte_size(const struct hugetlb_pte *hpte)
+{
+	return 1UL << hpte->shift;
+}
+
+static inline
+unsigned long hugetlb_pte_mask(const struct hugetlb_pte *hpte)
+{
+	return ~(hugetlb_pte_size(hpte) - 1);
+}
+
+bool hugetlb_pte_present_leaf(const struct hugetlb_pte *hpte, pte_t pte);
+
 struct hugepage_subpool {
 	spinlock_t lock;
 	long count;
@@ -1234,6 +1267,45 @@ static inline spinlock_t *huge_pte_lock(struct hstate *h,
 	return ptl;
 }
 
+static inline
+spinlock_t *hugetlb_pte_lockptr(struct hugetlb_pte *hpte)
+{
+	return hpte->ptl;
+}
+
+static inline
+spinlock_t *hugetlb_pte_lock(struct hugetlb_pte *hpte)
+{
+	spinlock_t *ptl = hugetlb_pte_lockptr(hpte);
+
+	spin_lock(ptl);
+	return ptl;
+}
+
+static inline
+void __hugetlb_pte_init(struct hugetlb_pte *hpte, pte_t *ptep,
+			unsigned int shift, enum hugetlb_level level,
+			spinlock_t *ptl)
+{
+	/*
+	 * If 'shift' indicates that this PTE is contiguous, then @ptep must
+	 * be the first pte of the contiguous bunch.
+	 */
+	hpte->ptl = ptl;
+	hpte->ptep = ptep;
+	hpte->shift = shift;
+	hpte->level = level;
+}
+
+static inline
+void hugetlb_pte_init(struct mm_struct *mm, struct hugetlb_pte *hpte,
+		      pte_t *ptep, unsigned int shift,
+		      enum hugetlb_level level)
+{
+	__hugetlb_pte_init(hpte, ptep, shift, level,
+			   huge_pte_lockptr(shift, mm, ptep));
+}
+
 #if defined(CONFIG_HUGETLB_PAGE) && defined(CONFIG_CMA)
 extern void __init hugetlb_cma_reserve(int order);
 #else
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 5ca9eae0ac42..6c74adff43b6 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1269,6 +1269,35 @@ static bool vma_has_reserves(struct vm_area_struct *vma, long chg)
 	return false;
 }
 
+bool hugetlb_pte_present_leaf(const struct hugetlb_pte *hpte, pte_t pte)
+{
+	pgd_t pgd;
+	p4d_t p4d;
+	pud_t pud;
+	pmd_t pmd;
+
+	switch (hpte->level) {
+	case HUGETLB_LEVEL_PGD:
+		pgd = __pgd(pte_val(pte));
+		return pgd_present(pgd) && pgd_leaf(pgd);
+	case HUGETLB_LEVEL_P4D:
+		p4d = __p4d(pte_val(pte));
+		return p4d_present(p4d) && p4d_leaf(p4d);
+	case HUGETLB_LEVEL_PUD:
+		pud = __pud(pte_val(pte));
+		return pud_present(pud) && pud_leaf(pud);
+	case HUGETLB_LEVEL_PMD:
+		pmd = __pmd(pte_val(pte));
+		return pmd_present(pmd) && pmd_leaf(pmd);
+	case HUGETLB_LEVEL_PTE:
+		return pte_present(pte);
+	default:
+		WARN_ON_ONCE(1);
+		return false;
+	}
+}
+
+
 static void enqueue_hugetlb_folio(struct hstate *h, struct folio *folio)
 {
 	int nid = folio_nid(folio);
-- 
2.39.2.637.g21b0678d19-goog


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH v2 12/46] hugetlb: add hugetlb_alloc_pmd and hugetlb_alloc_pte
  2023-02-18  0:27 [PATCH v2 00/46] hugetlb: introduce HugeTLB high-granularity mapping James Houghton
                   ` (10 preceding siblings ...)
  2023-02-18  0:27 ` [PATCH v2 11/46] hugetlb: add hugetlb_pte to track HugeTLB page table entries James Houghton
@ 2023-02-18  0:27 ` James Houghton
  2023-02-18 17:46   ` kernel test robot
  2023-02-27 19:16   ` Mike Kravetz
  2023-02-18  0:27 ` [PATCH v2 13/46] hugetlb: add hugetlb_hgm_walk and hugetlb_walk_step James Houghton
                   ` (34 subsequent siblings)
  46 siblings, 2 replies; 96+ messages in thread
From: James Houghton @ 2023-02-18  0:27 UTC (permalink / raw)
  To: Mike Kravetz, Muchun Song, Peter Xu, Andrew Morton
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Zach O'Keefe, Manish Mishra, Naoya Horiguchi,
	Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Frank van der Linden, Jiaqi Yan, linux-mm, linux-kernel,
	James Houghton

These functions are used to allocate new PTEs below the hstate PTE. This
will be used by hugetlb_walk_step, which implements stepping forwards in
a HugeTLB high-granularity page table walk.

The reasons that we don't use the standard pmd_alloc/pte_alloc*
functions are:
 1) This prevents us from accidentally overwriting swap entries or
    attempting to use swap entries as present non-leaf PTEs (see
    pmd_alloc(); we assume that !pte_none means pte_present and
    non-leaf).
 2) Locking hugetlb PTEs can different than regular PTEs. (Although, as
    implemented right now, locking is the same.)
 3) We can maintain compatibility with CONFIG_HIGHPTE. That is, HugeTLB
    HGM won't use HIGHPTE, but the kernel can still be built with it,
    and other mm code will use it.

When GENERAL_HUGETLB supports P4D-based hugepages, we will need to
implement hugetlb_pud_alloc to implement hugetlb_walk_step.

Signed-off-by: James Houghton <jthoughton@google.com>

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index eeacadf3272b..9d839519c875 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -72,6 +72,11 @@ unsigned long hugetlb_pte_mask(const struct hugetlb_pte *hpte)
 
 bool hugetlb_pte_present_leaf(const struct hugetlb_pte *hpte, pte_t pte);
 
+pmd_t *hugetlb_alloc_pmd(struct mm_struct *mm, struct hugetlb_pte *hpte,
+		unsigned long addr);
+pte_t *hugetlb_alloc_pte(struct mm_struct *mm, struct hugetlb_pte *hpte,
+		unsigned long addr);
+
 struct hugepage_subpool {
 	spinlock_t lock;
 	long count;
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 6c74adff43b6..bb424cdf79e4 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -483,6 +483,120 @@ static bool has_same_uncharge_info(struct file_region *rg,
 #endif
 }
 
+/*
+ * hugetlb_alloc_pmd -- Allocate or find a PMD beneath a PUD-level hpte.
+ *
+ * This is meant to be used to implement hugetlb_walk_step when one must go to
+ * step down to a PMD. Different architectures may implement hugetlb_walk_step
+ * differently, but hugetlb_alloc_pmd and hugetlb_alloc_pte are architecture-
+ * independent.
+ *
+ * Returns:
+ *	On success: the pointer to the PMD. This should be placed into a
+ *		    hugetlb_pte. @hpte is not changed.
+ *	ERR_PTR(-EINVAL): hpte is not PUD-level
+ *	ERR_PTR(-EEXIST): there is a non-leaf and non-empty PUD in @hpte
+ *	ERR_PTR(-ENOMEM): could not allocate the new PMD
+ */
+pmd_t *hugetlb_alloc_pmd(struct mm_struct *mm, struct hugetlb_pte *hpte,
+		unsigned long addr)
+{
+	spinlock_t *ptl = hugetlb_pte_lockptr(hpte);
+	pmd_t *new;
+	pud_t *pudp;
+	pud_t pud;
+
+	if (hpte->level != HUGETLB_LEVEL_PUD)
+		return ERR_PTR(-EINVAL);
+
+	pudp = (pud_t *)hpte->ptep;
+retry:
+	pud = READ_ONCE(*pudp);
+	if (likely(pud_present(pud)))
+		return unlikely(pud_leaf(pud))
+			? ERR_PTR(-EEXIST)
+			: pmd_offset(pudp, addr);
+	else if (!pud_none(pud))
+		/*
+		 * Not present and not none means that a swap entry lives here,
+		 * and we can't get rid of it.
+		 */
+		return ERR_PTR(-EEXIST);
+
+	new = pmd_alloc_one(mm, addr);
+	if (!new)
+		return ERR_PTR(-ENOMEM);
+
+	spin_lock(ptl);
+	if (!pud_same(pud, *pudp)) {
+		spin_unlock(ptl);
+		pmd_free(mm, new);
+		goto retry;
+	}
+
+	mm_inc_nr_pmds(mm);
+	smp_wmb(); /* See comment in pmd_install() */
+	pud_populate(mm, pudp, new);
+	spin_unlock(ptl);
+	return pmd_offset(pudp, addr);
+}
+
+/*
+ * hugetlb_alloc_pte -- Allocate a PTE beneath a pmd_none PMD-level hpte.
+ *
+ * See the comment above hugetlb_alloc_pmd.
+ */
+pte_t *hugetlb_alloc_pte(struct mm_struct *mm, struct hugetlb_pte *hpte,
+		unsigned long addr)
+{
+	spinlock_t *ptl = hugetlb_pte_lockptr(hpte);
+	pgtable_t new;
+	pmd_t *pmdp;
+	pmd_t pmd;
+
+	if (hpte->level != HUGETLB_LEVEL_PMD)
+		return ERR_PTR(-EINVAL);
+
+	pmdp = (pmd_t *)hpte->ptep;
+retry:
+	pmd = READ_ONCE(*pmdp);
+	if (likely(pmd_present(pmd)))
+		return unlikely(pmd_leaf(pmd))
+			? ERR_PTR(-EEXIST)
+			: pte_offset_kernel(pmdp, addr);
+	else if (!pmd_none(pmd))
+		/*
+		 * Not present and not none means that a swap entry lives here,
+		 * and we can't get rid of it.
+		 */
+		return ERR_PTR(-EEXIST);
+
+	/*
+	 * With CONFIG_HIGHPTE, calling `pte_alloc_one` directly may result
+	 * in page tables being allocated in high memory, needing a kmap to
+	 * access. Instead, we call __pte_alloc_one directly with
+	 * GFP_PGTABLE_USER to prevent these PTEs being allocated in high
+	 * memory.
+	 */
+	new = __pte_alloc_one(mm, GFP_PGTABLE_USER);
+	if (!new)
+		return ERR_PTR(-ENOMEM);
+
+	spin_lock(ptl);
+	if (!pmd_same(pmd, *pmdp)) {
+		spin_unlock(ptl);
+		pgtable_pte_page_dtor(new);
+		__free_page(new);
+		goto retry;
+	}
+
+	mm_inc_nr_ptes(mm);
+	smp_wmb(); /* See comment in pmd_install() */
+	pmd_populate(mm, pmdp, new);
+	spin_unlock(ptl);
+	return pte_offset_kernel(pmdp, addr);
+}
+
 static void coalesce_file_region(struct resv_map *resv, struct file_region *rg)
 {
 	struct file_region *nrg, *prg;
-- 
2.39.2.637.g21b0678d19-goog


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH v2 13/46] hugetlb: add hugetlb_hgm_walk and hugetlb_walk_step
  2023-02-18  0:27 [PATCH v2 00/46] hugetlb: introduce HugeTLB high-granularity mapping James Houghton
                   ` (11 preceding siblings ...)
  2023-02-18  0:27 ` [PATCH v2 12/46] hugetlb: add hugetlb_alloc_pmd and hugetlb_alloc_pte James Houghton
@ 2023-02-18  0:27 ` James Houghton
  2023-02-18  7:43   ` kernel test robot
                     ` (2 more replies)
  2023-02-18  0:27 ` [PATCH v2 14/46] hugetlb: split PTE markers when doing HGM walks James Houghton
                   ` (33 subsequent siblings)
  46 siblings, 3 replies; 96+ messages in thread
From: James Houghton @ 2023-02-18  0:27 UTC (permalink / raw)
  To: Mike Kravetz, Muchun Song, Peter Xu, Andrew Morton
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Zach O'Keefe, Manish Mishra, Naoya Horiguchi,
	Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Frank van der Linden, Jiaqi Yan, linux-mm, linux-kernel,
	James Houghton

hugetlb_hgm_walk implements high-granularity page table walks for
HugeTLB. It is safe to call on non-HGM enabled VMAs; it will return
immediately.

hugetlb_walk_step implements how we step forwards in the walk. For
architectures that don't use GENERAL_HUGETLB, they will need to provide
their own implementation.

The broader API that should be used is
hugetlb_full_walk[,alloc|,continue].

Signed-off-by: James Houghton <jthoughton@google.com>

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 9d839519c875..726d581158b1 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -223,6 +223,14 @@ u32 hugetlb_fault_mutex_hash(struct address_space *mapping, pgoff_t idx);
 pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma,
 		      unsigned long addr, pud_t *pud);
 
+int hugetlb_full_walk(struct hugetlb_pte *hpte, struct vm_area_struct *vma,
+		      unsigned long addr);
+void hugetlb_full_walk_continue(struct hugetlb_pte *hpte,
+				struct vm_area_struct *vma, unsigned long addr);
+int hugetlb_full_walk_alloc(struct hugetlb_pte *hpte,
+			    struct vm_area_struct *vma, unsigned long addr,
+			    unsigned long target_sz);
+
 struct address_space *hugetlb_page_mapping_lock_write(struct page *hpage);
 
 extern int sysctl_hugetlb_shm_group;
@@ -272,6 +280,8 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
 pte_t *huge_pte_offset(struct mm_struct *mm,
 		       unsigned long addr, unsigned long sz);
 unsigned long hugetlb_mask_last_page(struct hstate *h);
+int hugetlb_walk_step(struct mm_struct *mm, struct hugetlb_pte *hpte,
+		      unsigned long addr, unsigned long sz);
 int huge_pmd_unshare(struct mm_struct *mm, struct vm_area_struct *vma,
 				unsigned long addr, pte_t *ptep);
 void adjust_range_if_pmd_sharing_possible(struct vm_area_struct *vma,
@@ -1054,6 +1064,8 @@ void hugetlb_register_node(struct node *node);
 void hugetlb_unregister_node(struct node *node);
 #endif
 
+enum hugetlb_level hpage_size_to_level(unsigned long sz);
+
 #else	/* CONFIG_HUGETLB_PAGE */
 struct hstate {};
 
@@ -1246,6 +1258,11 @@ static inline void hugetlb_register_node(struct node *node)
 static inline void hugetlb_unregister_node(struct node *node)
 {
 }
+
+static inline enum hugetlb_level hpage_size_to_level(unsigned long sz)
+{
+	return HUGETLB_LEVEL_PTE;
+}
 #endif	/* CONFIG_HUGETLB_PAGE */
 
 #ifdef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index bb424cdf79e4..810c05feb41f 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -97,6 +97,29 @@ static void __hugetlb_vma_unlock_write_free(struct vm_area_struct *vma);
 static void hugetlb_unshare_pmds(struct vm_area_struct *vma,
 		unsigned long start, unsigned long end);
 
+/*
+ * hpage_size_to_level() - convert @sz to the corresponding page table level
+ *
+ * @sz must be less than or equal to a valid hugepage size.
+ */
+enum hugetlb_level hpage_size_to_level(unsigned long sz)
+{
+	/*
+	 * We order the conditionals from smallest to largest to pick the
+	 * smallest level when multiple levels have the same size (i.e.,
+	 * when levels are folded).
+	 */
+	if (sz < PMD_SIZE)
+		return HUGETLB_LEVEL_PTE;
+	if (sz < PUD_SIZE)
+		return HUGETLB_LEVEL_PMD;
+	if (sz < P4D_SIZE)
+		return HUGETLB_LEVEL_PUD;
+	if (sz < PGDIR_SIZE)
+		return HUGETLB_LEVEL_P4D;
+	return HUGETLB_LEVEL_PGD;
+}
+
 static inline bool subpool_is_free(struct hugepage_subpool *spool)
 {
 	if (spool->count)
@@ -7315,6 +7338,154 @@ bool want_pmd_share(struct vm_area_struct *vma, unsigned long addr)
 }
 #endif /* CONFIG_ARCH_WANT_HUGE_PMD_SHARE */
 
+/* __hugetlb_hgm_walk - walks a high-granularity HugeTLB page table to resolve
+ * the page table entry for @addr. We might allocate new PTEs.
+ *
+ * @hpte must always be pointing at an hstate-level PTE or deeper.
+ *
+ * This function will never walk further if it encounters a PTE of a size
+ * less than or equal to @sz.
+ *
+ * @alloc determines what we do when we encounter an empty PTE. If false,
+ * we stop walking. If true and @sz is less than the current PTE's size,
+ * we make that PTE point to the next level down, going until @sz is the same
+ * as our current PTE.
+ *
+ * If @alloc is false and @sz is PAGE_SIZE, this function will always
+ * succeed, but that does not guarantee that hugetlb_pte_size(hpte) is @sz.
+ *
+ * Return:
+ *	-ENOMEM if we couldn't allocate new PTEs.
+ *	-EEXIST if the caller wanted to walk further than a migration PTE,
+ *		poison PTE, or a PTE marker. The caller needs to manually deal
+ *		with this scenario.
+ *	-EINVAL if called with invalid arguments (@sz invalid, @hpte not
+ *		initialized).
+ *	0 otherwise.
+ *
+ *	Even if this function fails, @hpte is guaranteed to always remain
+ *	valid.
+ */
+static int __hugetlb_hgm_walk(struct mm_struct *mm, struct vm_area_struct *vma,
+			      struct hugetlb_pte *hpte, unsigned long addr,
+			      unsigned long sz, bool alloc)
+{
+	int ret = 0;
+	pte_t pte;
+
+	if (WARN_ON_ONCE(sz < PAGE_SIZE))
+		return -EINVAL;
+
+	if (WARN_ON_ONCE(!hpte->ptep))
+		return -EINVAL;
+
+	while (hugetlb_pte_size(hpte) > sz && !ret) {
+		pte = huge_ptep_get(hpte->ptep);
+		if (!pte_present(pte)) {
+			if (!alloc)
+				return 0;
+			if (unlikely(!huge_pte_none(pte)))
+				return -EEXIST;
+		} else if (hugetlb_pte_present_leaf(hpte, pte))
+			return 0;
+		ret = hugetlb_walk_step(mm, hpte, addr, sz);
+	}
+
+	return ret;
+}
+
+/*
+ * hugetlb_hgm_walk - Has the same behavior as __hugetlb_hgm_walk but will
+ * initialize @hpte with hstate-level PTE pointer @ptep.
+ */
+static int hugetlb_hgm_walk(struct hugetlb_pte *hpte,
+			    pte_t *ptep,
+			    struct vm_area_struct *vma,
+			    unsigned long addr,
+			    unsigned long target_sz,
+			    bool alloc)
+{
+	struct hstate *h = hstate_vma(vma);
+
+	hugetlb_pte_init(vma->vm_mm, hpte, ptep, huge_page_shift(h),
+			 hpage_size_to_level(huge_page_size(h)));
+	return __hugetlb_hgm_walk(vma->vm_mm, vma, hpte, addr, target_sz,
+				  alloc);
+}
+
+/*
+ * hugetlb_full_walk_continue - continue a high-granularity page-table walk.
+ *
+ * If a user has a valid @hpte but knows that @hpte is not a leaf, they can
+ * attempt to continue walking by calling this function.
+ *
+ * This function will never fail, but @hpte might not change.
+ *
+ * If @hpte hasn't been initialized, then this function's behavior is
+ * undefined.
+ */
+void hugetlb_full_walk_continue(struct hugetlb_pte *hpte,
+				struct vm_area_struct *vma,
+				unsigned long addr)
+{
+	/* __hugetlb_hgm_walk will never fail with these arguments. */
+	WARN_ON_ONCE(__hugetlb_hgm_walk(vma->vm_mm, vma, hpte, addr,
+					PAGE_SIZE, false));
+}
+
+/*
+ * hugetlb_full_walk - do a high-granularity page-table walk; never allocate.
+ *
+ * This function can only fail if we find that the hstate-level PTE is not
+ * allocated. Callers can take advantage of this fact to skip address regions
+ * that cannot be mapped in that case.
+ *
+ * If this function succeeds, @hpte is guaranteed to be valid.
+ */
+int hugetlb_full_walk(struct hugetlb_pte *hpte,
+		      struct vm_area_struct *vma,
+		      unsigned long addr)
+{
+	struct hstate *h = hstate_vma(vma);
+	unsigned long sz = huge_page_size(h);
+	/*
+	 * We must mask the address appropriately so that we pick up the first
+	 * PTE in a contiguous group.
+	 */
+	pte_t *ptep = hugetlb_walk(vma, addr & huge_page_mask(h), sz);
+
+	if (!ptep)
+		return -ENOMEM;
+
+	/* hugetlb_hgm_walk will never fail with these arguments. */
+	WARN_ON_ONCE(hugetlb_hgm_walk(hpte, ptep, vma, addr, PAGE_SIZE, false));
+	return 0;
+}
+
+/*
+ * hugetlb_full_walk_alloc - do a high-granularity walk, potentially allocate
+ *	new PTEs.
+ */
+int hugetlb_full_walk_alloc(struct hugetlb_pte *hpte,
+				   struct vm_area_struct *vma,
+				   unsigned long addr,
+				   unsigned long target_sz)
+{
+	struct hstate *h = hstate_vma(vma);
+	unsigned long sz = huge_page_size(h);
+	/*
+	 * We must mask the address appropriately so that we pick up the first
+	 * PTE in a contiguous group.
+	 */
+	pte_t *ptep = huge_pte_alloc(vma->vm_mm, vma, addr & huge_page_mask(h),
+				     sz);
+
+	if (!ptep)
+		return -ENOMEM;
+
+	return hugetlb_hgm_walk(hpte, ptep, vma, addr, target_sz, true);
+}
+
 #ifdef CONFIG_ARCH_WANT_GENERAL_HUGETLB
 pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
 			unsigned long addr, unsigned long sz)
@@ -7382,6 +7553,48 @@ pte_t *huge_pte_offset(struct mm_struct *mm,
 	return (pte_t *)pmd;
 }
 
+/*
+ * hugetlb_walk_step() - Walk the page table one step to resolve the page
+ * (hugepage or subpage) entry at address @addr.
+ *
+ * @sz always points at the final target PTE size (e.g. PAGE_SIZE for the
+ * lowest level PTE).
+ *
+ * @hpte will always remain valid, even if this function fails.
+ *
+ * Architectures that implement this function must ensure that if @hpte does
+ * not change levels, then its PTL must also stay the same.
+ */
+int hugetlb_walk_step(struct mm_struct *mm, struct hugetlb_pte *hpte,
+		      unsigned long addr, unsigned long sz)
+{
+	pte_t *ptep;
+	spinlock_t *ptl;
+
+	switch (hpte->level) {
+	case HUGETLB_LEVEL_PUD:
+		ptep = (pte_t *)hugetlb_alloc_pmd(mm, hpte, addr);
+		if (IS_ERR(ptep))
+			return PTR_ERR(ptep);
+		hugetlb_pte_init(mm, hpte, ptep, PMD_SHIFT,
+				 HUGETLB_LEVEL_PMD);
+		break;
+	case HUGETLB_LEVEL_PMD:
+		ptep = hugetlb_alloc_pte(mm, hpte, addr);
+		if (IS_ERR(ptep))
+			return PTR_ERR(ptep);
+		ptl = pte_lockptr(mm, (pmd_t *)hpte->ptep);
+		__hugetlb_pte_init(hpte, ptep, PAGE_SHIFT,
+				   HUGETLB_LEVEL_PTE, ptl);
+		break;
+	default:
+		WARN_ONCE(1, "%s: got invalid level: %d (shift: %d)\n",
+				__func__, hpte->level, hpte->shift);
+		return -EINVAL;
+	}
+	return 0;
+}
+
 /*
  * Return a mask that can be used to update an address to the last huge
  * page in a page table page mapping size.  Used to skip non-present
-- 
2.39.2.637.g21b0678d19-goog


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH v2 14/46] hugetlb: split PTE markers when doing HGM walks
  2023-02-18  0:27 [PATCH v2 00/46] hugetlb: introduce HugeTLB high-granularity mapping James Houghton
                   ` (12 preceding siblings ...)
  2023-02-18  0:27 ` [PATCH v2 13/46] hugetlb: add hugetlb_hgm_walk and hugetlb_walk_step James Houghton
@ 2023-02-18  0:27 ` James Houghton
  2023-02-18 19:49   ` kernel test robot
  2023-02-28 22:48   ` Mike Kravetz
  2023-02-18  0:27 ` [PATCH v2 15/46] hugetlb: add make_huge_pte_with_shift James Houghton
                   ` (32 subsequent siblings)
  46 siblings, 2 replies; 96+ messages in thread
From: James Houghton @ 2023-02-18  0:27 UTC (permalink / raw)
  To: Mike Kravetz, Muchun Song, Peter Xu, Andrew Morton
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Zach O'Keefe, Manish Mishra, Naoya Horiguchi,
	Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Frank van der Linden, Jiaqi Yan, linux-mm, linux-kernel,
	James Houghton

Fix how UFFDIO_CONTINUE and UFFDIO_WRITEPROTECT interact in these two
ways:
 - UFFDIO_WRITEPROTECT no longer prevents a high-granularity
   UFFDIO_CONTINUE.
 - UFFD-WP PTE markers installed with UFFDIO_WRITEPROTECT will be
   properly propagated when high-granularily UFFDIO_CONTINUEs are
   performed.

Note: UFFDIO_WRITEPROTECT is not yet permitted at PAGE_SIZE granularity.

Signed-off-by: James Houghton <jthoughton@google.com>

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 810c05feb41f..f74183acc521 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -506,6 +506,30 @@ static bool has_same_uncharge_info(struct file_region *rg,
 #endif
 }
 
+static void hugetlb_install_markers_pmd(pmd_t *pmdp, pte_marker marker)
+{
+	int i;
+
+	for (i = 0; i < PTRS_PER_PMD; ++i)
+		/*
+		 * WRITE_ONCE not needed because the pud hasn't been
+		 * installed yet.
+		 */
+		pmdp[i] = __pmd(pte_val(make_pte_marker(marker)));
+}
+
+static void hugetlb_install_markers_pte(pte_t *ptep, pte_marker marker)
+{
+	int i;
+
+	for (i = 0; i < PTRS_PER_PTE; ++i)
+		/*
+		 * WRITE_ONCE not needed because the pmd hasn't been
+		 * installed yet.
+		 */
+		ptep[i] = make_pte_marker(marker);
+}
+
 /*
  * hugetlb_alloc_pmd -- Allocate or find a PMD beneath a PUD-level hpte.
  *
@@ -528,23 +552,32 @@ pmd_t *hugetlb_alloc_pmd(struct mm_struct *mm, struct hugetlb_pte *hpte,
 	pmd_t *new;
 	pud_t *pudp;
 	pud_t pud;
+	bool is_marker;
+	pte_marker marker;
 
 	if (hpte->level != HUGETLB_LEVEL_PUD)
 		return ERR_PTR(-EINVAL);
 
 	pudp = (pud_t *)hpte->ptep;
 retry:
+	is_marker = false;
 	pud = READ_ONCE(*pudp);
 	if (likely(pud_present(pud)))
 		return unlikely(pud_leaf(pud))
 			? ERR_PTR(-EEXIST)
 			: pmd_offset(pudp, addr);
-	else if (!pud_none(pud))
+	else if (!pud_none(pud)) {
 		/*
-		 * Not present and not none means that a swap entry lives here,
-		 * and we can't get rid of it.
+		 * Not present and not none means that a swap entry lives here.
+		 * If it's a PTE marker, we can deal with it. If it's another
+		 * swap entry, we don't attempt to split it.
 		 */
-		return ERR_PTR(-EEXIST);
+		is_marker = is_pte_marker(__pte(pud_val(pud)));
+		if (!is_marker)
+			return ERR_PTR(-EEXIST);
+
+		marker = pte_marker_get(pte_to_swp_entry(__pte(pud_val(pud))));
+	}
 
 	new = pmd_alloc_one(mm, addr);
 	if (!new)
@@ -557,6 +590,13 @@ pmd_t *hugetlb_alloc_pmd(struct mm_struct *mm, struct hugetlb_pte *hpte,
 		goto retry;
 	}
 
+	/*
+	 * Install markers before PUD to avoid races with other
+	 * page tables walks.
+	 */
+	if (is_marker)
+		hugetlb_install_markers_pmd(new, marker);
+
 	mm_inc_nr_pmds(mm);
 	smp_wmb(); /* See comment in pmd_install() */
 	pud_populate(mm, pudp, new);
@@ -576,23 +616,32 @@ pte_t *hugetlb_alloc_pte(struct mm_struct *mm, struct hugetlb_pte *hpte,
 	pgtable_t new;
 	pmd_t *pmdp;
 	pmd_t pmd;
+	bool is_marker;
+	pte_marker marker;
 
 	if (hpte->level != HUGETLB_LEVEL_PMD)
 		return ERR_PTR(-EINVAL);
 
 	pmdp = (pmd_t *)hpte->ptep;
 retry:
+	is_marker = false;
 	pmd = READ_ONCE(*pmdp);
 	if (likely(pmd_present(pmd)))
 		return unlikely(pmd_leaf(pmd))
 			? ERR_PTR(-EEXIST)
 			: pte_offset_kernel(pmdp, addr);
-	else if (!pmd_none(pmd))
+	else if (!pmd_none(pmd)) {
 		/*
-		 * Not present and not none means that a swap entry lives here,
-		 * and we can't get rid of it.
+		 * Not present and not none means that a swap entry lives here.
+		 * If it's a PTE marker, we can deal with it. If it's another
+		 * swap entry, we don't attempt to split it.
 		 */
-		return ERR_PTR(-EEXIST);
+		is_marker = is_pte_marker(__pte(pmd_val(pmd)));
+		if (!is_marker)
+			return ERR_PTR(-EEXIST);
+
+		marker = pte_marker_get(pte_to_swp_entry(__pte(pmd_val(pmd))));
+	}
 
 	/*
 	 * With CONFIG_HIGHPTE, calling `pte_alloc_one` directly may result
@@ -613,6 +662,9 @@ pte_t *hugetlb_alloc_pte(struct mm_struct *mm, struct hugetlb_pte *hpte,
 		goto retry;
 	}
 
+	if (is_marker)
+		hugetlb_install_markers_pte(page_address(new), marker);
+
 	mm_inc_nr_ptes(mm);
 	smp_wmb(); /* See comment in pmd_install() */
 	pmd_populate(mm, pmdp, new);
@@ -7384,7 +7436,12 @@ static int __hugetlb_hgm_walk(struct mm_struct *mm, struct vm_area_struct *vma,
 		if (!pte_present(pte)) {
 			if (!alloc)
 				return 0;
-			if (unlikely(!huge_pte_none(pte)))
+			/*
+			 * In hugetlb_alloc_pmd and hugetlb_alloc_pte,
+			 * we split PTE markers, so we can tolerate
+			 * PTE markers here.
+			 */
+			if (unlikely(!huge_pte_none_mostly(pte)))
 				return -EEXIST;
 		} else if (hugetlb_pte_present_leaf(hpte, pte))
 			return 0;
-- 
2.39.2.637.g21b0678d19-goog


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH v2 15/46] hugetlb: add make_huge_pte_with_shift
  2023-02-18  0:27 [PATCH v2 00/46] hugetlb: introduce HugeTLB high-granularity mapping James Houghton
                   ` (13 preceding siblings ...)
  2023-02-18  0:27 ` [PATCH v2 14/46] hugetlb: split PTE markers when doing HGM walks James Houghton
@ 2023-02-18  0:27 ` James Houghton
  2023-02-22 21:14   ` Mina Almasry
  2023-02-18  0:27 ` [PATCH v2 16/46] hugetlb: make default arch_make_huge_pte understand small mappings James Houghton
                   ` (31 subsequent siblings)
  46 siblings, 1 reply; 96+ messages in thread
From: James Houghton @ 2023-02-18  0:27 UTC (permalink / raw)
  To: Mike Kravetz, Muchun Song, Peter Xu, Andrew Morton
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Zach O'Keefe, Manish Mishra, Naoya Horiguchi,
	Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Frank van der Linden, Jiaqi Yan, linux-mm, linux-kernel,
	James Houghton

This allows us to make huge PTEs at shifts other than the hstate shift,
which will be necessary for high-granularity mappings.

Acked-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: James Houghton <jthoughton@google.com>

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index f74183acc521..ed1d806020de 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5110,11 +5110,11 @@ const struct vm_operations_struct hugetlb_vm_ops = {
 	.pagesize = hugetlb_vm_op_pagesize,
 };
 
-static pte_t make_huge_pte(struct vm_area_struct *vma, struct page *page,
-				int writable)
+static pte_t make_huge_pte_with_shift(struct vm_area_struct *vma,
+				      struct page *page, int writable,
+				      int shift)
 {
 	pte_t entry;
-	unsigned int shift = huge_page_shift(hstate_vma(vma));
 
 	if (writable) {
 		entry = huge_pte_mkwrite(huge_pte_mkdirty(mk_pte(page,
@@ -5128,6 +5128,14 @@ static pte_t make_huge_pte(struct vm_area_struct *vma, struct page *page,
 	return entry;
 }
 
+static pte_t make_huge_pte(struct vm_area_struct *vma, struct page *page,
+			   int writable)
+{
+	unsigned int shift = huge_page_shift(hstate_vma(vma));
+
+	return make_huge_pte_with_shift(vma, page, writable, shift);
+}
+
 static void set_huge_ptep_writable(struct vm_area_struct *vma,
 				   unsigned long address, pte_t *ptep)
 {
-- 
2.39.2.637.g21b0678d19-goog


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH v2 16/46] hugetlb: make default arch_make_huge_pte understand small mappings
  2023-02-18  0:27 [PATCH v2 00/46] hugetlb: introduce HugeTLB high-granularity mapping James Houghton
                   ` (14 preceding siblings ...)
  2023-02-18  0:27 ` [PATCH v2 15/46] hugetlb: add make_huge_pte_with_shift James Houghton
@ 2023-02-18  0:27 ` James Houghton
  2023-02-22 21:17   ` Mina Almasry
  2023-02-28 23:02   ` Mike Kravetz
  2023-02-18  0:27 ` [PATCH v2 17/46] hugetlbfs: do a full walk to check if vma maps a page James Houghton
                   ` (30 subsequent siblings)
  46 siblings, 2 replies; 96+ messages in thread
From: James Houghton @ 2023-02-18  0:27 UTC (permalink / raw)
  To: Mike Kravetz, Muchun Song, Peter Xu, Andrew Morton
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Zach O'Keefe, Manish Mishra, Naoya Horiguchi,
	Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Frank van der Linden, Jiaqi Yan, linux-mm, linux-kernel,
	James Houghton

This is a simple change: don't create a "huge" PTE if we are making a
regular, PAGE_SIZE PTE. All architectures that want to implement HGM
likely need to be changed in a similar way if they implement their own
version of arch_make_huge_pte.

Signed-off-by: James Houghton <jthoughton@google.com>

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 726d581158b1..b767b6889dea 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -899,7 +899,7 @@ static inline void arch_clear_hugepage_flags(struct page *page) { }
 static inline pte_t arch_make_huge_pte(pte_t entry, unsigned int shift,
 				       vm_flags_t flags)
 {
-	return pte_mkhuge(entry);
+	return shift > PAGE_SHIFT ? pte_mkhuge(entry) : entry;
 }
 #endif
 
-- 
2.39.2.637.g21b0678d19-goog


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH v2 17/46] hugetlbfs: do a full walk to check if vma maps a page
  2023-02-18  0:27 [PATCH v2 00/46] hugetlb: introduce HugeTLB high-granularity mapping James Houghton
                   ` (15 preceding siblings ...)
  2023-02-18  0:27 ` [PATCH v2 16/46] hugetlb: make default arch_make_huge_pte understand small mappings James Houghton
@ 2023-02-18  0:27 ` James Houghton
  2023-02-22 15:46   ` James Houghton
  2023-02-18  0:27 ` [PATCH v2 18/46] hugetlb: add HGM support to __unmap_hugepage_range James Houghton
                   ` (29 subsequent siblings)
  46 siblings, 1 reply; 96+ messages in thread
From: James Houghton @ 2023-02-18  0:27 UTC (permalink / raw)
  To: Mike Kravetz, Muchun Song, Peter Xu, Andrew Morton
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Zach O'Keefe, Manish Mishra, Naoya Horiguchi,
	Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Frank van der Linden, Jiaqi Yan, linux-mm, linux-kernel,
	James Houghton

Because it is safe to do so, do a full high-granularity page table walk
to check if the page is mapped.

Signed-off-by: James Houghton <jthoughton@google.com>

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index cfd09f95551b..c0ee69f0418e 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -386,17 +386,24 @@ static void hugetlb_delete_from_page_cache(struct folio *folio)
 static bool hugetlb_vma_maps_page(struct vm_area_struct *vma,
 				unsigned long addr, struct page *page)
 {
-	pte_t *ptep, pte;
+	pte_t pte;
+	struct hugetlb_pte hpte;
 
-	ptep = hugetlb_walk(vma, addr, huge_page_size(hstate_vma(vma)));
-	if (!ptep)
+	if (hugetlb_full_walk(&hpte, vma, addr))
 		return false;
 
-	pte = huge_ptep_get(ptep);
+	pte = huge_ptep_get(hpte.ptep);
 	if (huge_pte_none(pte) || !pte_present(pte))
 		return false;
 
-	if (pte_page(pte) == page)
+	if (unlikely(!hugetlb_pte_present_leaf(&hpte, pte)))
+		/*
+		 * We raced with someone splitting us, and the only case
+		 * where this is impossible is when the pte was none.
+		 */
+		return false;
+
+	if (compound_head(pte_page(pte)) == page)
 		return true;
 
 	return false;
-- 
2.39.2.637.g21b0678d19-goog


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH v2 18/46] hugetlb: add HGM support to __unmap_hugepage_range
  2023-02-18  0:27 [PATCH v2 00/46] hugetlb: introduce HugeTLB high-granularity mapping James Houghton
                   ` (16 preceding siblings ...)
  2023-02-18  0:27 ` [PATCH v2 17/46] hugetlbfs: do a full walk to check if vma maps a page James Houghton
@ 2023-02-18  0:27 ` James Houghton
  2023-02-18  0:27 ` [PATCH v2 19/46] hugetlb: add HGM support to hugetlb_change_protection James Houghton
                   ` (28 subsequent siblings)
  46 siblings, 0 replies; 96+ messages in thread
From: James Houghton @ 2023-02-18  0:27 UTC (permalink / raw)
  To: Mike Kravetz, Muchun Song, Peter Xu, Andrew Morton
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Zach O'Keefe, Manish Mishra, Naoya Horiguchi,
	Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Frank van der Linden, Jiaqi Yan, linux-mm, linux-kernel,
	James Houghton

Enlighten __unmap_hugepage_range to deal with high-granularity mappings.
This doesn't change its API; it still must be called with hugepage
alignment, but it will correctly unmap hugepages that have been mapped
at high granularity.

Eventually, functionality here can be expanded to allow users to call
MADV_DONTNEED on PAGE_SIZE-aligned sections of a hugepage, but that is
not done here.

Introduce hugetlb_remove_rmap to properly decrement mapcount for
high-granularity-mapped HugeTLB pages.

Signed-off-by: James Houghton <jthoughton@google.com>

diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index b46617207c93..31267471760e 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -598,9 +598,9 @@ static inline void tlb_flush_p4d_range(struct mmu_gather *tlb,
 		__tlb_remove_tlb_entry(tlb, ptep, address);	\
 	} while (0)
 
-#define tlb_remove_huge_tlb_entry(h, tlb, ptep, address)	\
+#define tlb_remove_huge_tlb_entry(tlb, hpte, address)	\
 	do {							\
-		unsigned long _sz = huge_page_size(h);		\
+		unsigned long _sz = hugetlb_pte_size(&hpte);	\
 		if (_sz >= P4D_SIZE)				\
 			tlb_flush_p4d_range(tlb, address, _sz);	\
 		else if (_sz >= PUD_SIZE)			\
@@ -609,7 +609,7 @@ static inline void tlb_flush_p4d_range(struct mmu_gather *tlb,
 			tlb_flush_pmd_range(tlb, address, _sz);	\
 		else						\
 			tlb_flush_pte_range(tlb, address, _sz);	\
-		__tlb_remove_tlb_entry(tlb, ptep, address);	\
+		__tlb_remove_tlb_entry(tlb, hpte.ptep, address);\
 	} while (0)
 
 /**
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index b767b6889dea..1a1a71868dfd 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -160,6 +160,9 @@ struct hugepage_subpool *hugepage_new_subpool(struct hstate *h, long max_hpages,
 						long min_hpages);
 void hugepage_put_subpool(struct hugepage_subpool *spool);
 
+void hugetlb_remove_rmap(struct page *subpage, unsigned long shift,
+			 struct hstate *h, struct vm_area_struct *vma);
+
 void hugetlb_dup_vma_private(struct vm_area_struct *vma);
 void clear_vma_resv_huge_pages(struct vm_area_struct *vma);
 int hugetlb_sysctl_handler(struct ctl_table *, int, void *, size_t *, loff_t *);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index ed1d806020de..ecf1a28dbaaa 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -120,6 +120,28 @@ enum hugetlb_level hpage_size_to_level(unsigned long sz)
 	return HUGETLB_LEVEL_PGD;
 }
 
+void hugetlb_remove_rmap(struct page *subpage, unsigned long shift,
+			 struct hstate *h, struct vm_area_struct *vma)
+{
+	struct page *hpage = compound_head(subpage);
+
+	if (shift == huge_page_shift(h)) {
+		VM_BUG_ON_PAGE(subpage != hpage, subpage);
+		page_remove_rmap(hpage, vma, true);
+	} else {
+		unsigned long nr_subpages = 1UL << (shift - PAGE_SHIFT);
+		struct page *final_page = &subpage[nr_subpages];
+
+		VM_BUG_ON_PAGE(HPageVmemmapOptimized(hpage), hpage);
+		/*
+		 * Decrement the mapcount on each page that is getting
+		 * unmapped.
+		 */
+		for (; subpage < final_page; ++subpage)
+			page_remove_rmap(subpage, vma, false);
+	}
+}
+
 static inline bool subpool_is_free(struct hugepage_subpool *spool)
 {
 	if (spool->count)
@@ -5466,10 +5488,10 @@ static void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct
 {
 	struct mm_struct *mm = vma->vm_mm;
 	unsigned long address;
-	pte_t *ptep;
+	struct hugetlb_pte hpte;
 	pte_t pte;
 	spinlock_t *ptl;
-	struct page *page;
+	struct page *hpage, *subpage;
 	struct hstate *h = hstate_vma(vma);
 	unsigned long sz = huge_page_size(h);
 	unsigned long last_addr_mask;
@@ -5479,35 +5501,33 @@ static void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct
 	BUG_ON(start & ~huge_page_mask(h));
 	BUG_ON(end & ~huge_page_mask(h));
 
-	/*
-	 * This is a hugetlb vma, all the pte entries should point
-	 * to huge page.
-	 */
-	tlb_change_page_size(tlb, sz);
 	tlb_start_vma(tlb, vma);
 
 	last_addr_mask = hugetlb_mask_last_page(h);
 	address = start;
-	for (; address < end; address += sz) {
-		ptep = hugetlb_walk(vma, address, sz);
-		if (!ptep) {
-			address |= last_addr_mask;
+
+	while (address < end) {
+		if (hugetlb_full_walk(&hpte, vma, address)) {
+			address = (address | last_addr_mask) + sz;
 			continue;
 		}
 
-		ptl = huge_pte_lock(h, mm, ptep);
-		if (huge_pmd_unshare(mm, vma, address, ptep)) {
+		ptl = hugetlb_pte_lock(&hpte);
+		if (hugetlb_pte_size(&hpte) == sz &&
+		    huge_pmd_unshare(mm, vma, address, hpte.ptep)) {
 			spin_unlock(ptl);
 			tlb_flush_pmd_range(tlb, address & PUD_MASK, PUD_SIZE);
 			force_flush = true;
 			address |= last_addr_mask;
+			address += sz;
 			continue;
 		}
 
-		pte = huge_ptep_get(ptep);
+		pte = huge_ptep_get(hpte.ptep);
+
 		if (huge_pte_none(pte)) {
 			spin_unlock(ptl);
-			continue;
+			goto next_hpte;
 		}
 
 		/*
@@ -5523,24 +5543,35 @@ static void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct
 			 */
 			if (pte_swp_uffd_wp_any(pte) &&
 			    !(zap_flags & ZAP_FLAG_DROP_MARKER))
-				set_huge_pte_at(mm, address, ptep,
+				set_huge_pte_at(mm, address, hpte.ptep,
 						make_pte_marker(PTE_MARKER_UFFD_WP));
 			else
-				huge_pte_clear(mm, address, ptep, sz);
+				huge_pte_clear(mm, address, hpte.ptep,
+						hugetlb_pte_size(&hpte));
+			spin_unlock(ptl);
+			goto next_hpte;
+		}
+
+		if (unlikely(!hugetlb_pte_present_leaf(&hpte, pte))) {
+			/*
+			 * We raced with someone splitting out from under us.
+			 * Retry the walk.
+			 */
 			spin_unlock(ptl);
 			continue;
 		}
 
-		page = pte_page(pte);
+		subpage = pte_page(pte);
+		hpage = compound_head(subpage);
 		/*
 		 * If a reference page is supplied, it is because a specific
 		 * page is being unmapped, not a range. Ensure the page we
 		 * are about to unmap is the actual page of interest.
 		 */
 		if (ref_page) {
-			if (page != ref_page) {
+			if (hpage != ref_page) {
 				spin_unlock(ptl);
-				continue;
+				goto next_hpte;
 			}
 			/*
 			 * Mark the VMA as having unmapped its page so that
@@ -5550,25 +5581,32 @@ static void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct
 			set_vma_resv_flags(vma, HPAGE_RESV_UNMAPPED);
 		}
 
-		pte = huge_ptep_get_and_clear(mm, address, ptep);
-		tlb_remove_huge_tlb_entry(h, tlb, ptep, address);
+		pte = huge_ptep_get_and_clear(mm, address, hpte.ptep);
+		tlb_change_page_size(tlb, hugetlb_pte_size(&hpte));
+		tlb_remove_huge_tlb_entry(tlb, hpte, address);
 		if (huge_pte_dirty(pte))
-			set_page_dirty(page);
+			set_page_dirty(hpage);
 		/* Leave a uffd-wp pte marker if needed */
 		if (huge_pte_uffd_wp(pte) &&
 		    !(zap_flags & ZAP_FLAG_DROP_MARKER))
-			set_huge_pte_at(mm, address, ptep,
+			set_huge_pte_at(mm, address, hpte.ptep,
 					make_pte_marker(PTE_MARKER_UFFD_WP));
-		hugetlb_count_sub(pages_per_huge_page(h), mm);
-		page_remove_rmap(page, vma, true);
+		hugetlb_count_sub(hugetlb_pte_size(&hpte)/PAGE_SIZE, mm);
+		hugetlb_remove_rmap(subpage, hpte.shift, h, vma);
 
 		spin_unlock(ptl);
-		tlb_remove_page_size(tlb, page, huge_page_size(h));
 		/*
-		 * Bail out after unmapping reference page if supplied
+		 * Lower the reference count on the head page.
+		 */
+		tlb_remove_page_size(tlb, hpage, sz);
+		/*
+		 * Bail out after unmapping reference page if supplied,
+		 * and there's only one PTE mapping this page.
 		 */
-		if (ref_page)
+		if (ref_page && hugetlb_pte_size(&hpte) == sz)
 			break;
+next_hpte:
+		address += hugetlb_pte_size(&hpte);
 	}
 	tlb_end_vma(tlb, vma);
 
@@ -5846,7 +5884,7 @@ static vm_fault_t hugetlb_wp(struct mm_struct *mm, struct vm_area_struct *vma,
 		/* Break COW or unshare */
 		huge_ptep_clear_flush(vma, haddr, ptep);
 		mmu_notifier_invalidate_range(mm, range.start, range.end);
-		page_remove_rmap(old_page, vma, true);
+		hugetlb_remove_rmap(old_page, huge_page_shift(h), h, vma);
 		hugepage_add_new_anon_rmap(new_folio, vma, haddr);
 		set_huge_pte_at(mm, haddr, ptep,
 				make_huge_pte(vma, &new_folio->page, !unshare));
-- 
2.39.2.637.g21b0678d19-goog


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH v2 19/46] hugetlb: add HGM support to hugetlb_change_protection
  2023-02-18  0:27 [PATCH v2 00/46] hugetlb: introduce HugeTLB high-granularity mapping James Houghton
                   ` (17 preceding siblings ...)
  2023-02-18  0:27 ` [PATCH v2 18/46] hugetlb: add HGM support to __unmap_hugepage_range James Houghton
@ 2023-02-18  0:27 ` James Houghton
  2023-02-18  0:27 ` [PATCH v2 20/46] hugetlb: add HGM support to follow_hugetlb_page James Houghton
                   ` (27 subsequent siblings)
  46 siblings, 0 replies; 96+ messages in thread
From: James Houghton @ 2023-02-18  0:27 UTC (permalink / raw)
  To: Mike Kravetz, Muchun Song, Peter Xu, Andrew Morton
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Zach O'Keefe, Manish Mishra, Naoya Horiguchi,
	Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Frank van der Linden, Jiaqi Yan, linux-mm, linux-kernel,
	James Houghton

The main change here is to do a high-granularity walk and pulling the
shift from the walk (not from the hstate).

Signed-off-by: James Houghton <jthoughton@google.com>

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index ecf1a28dbaaa..7321c6602d6f 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6900,15 +6900,15 @@ long hugetlb_change_protection(struct vm_area_struct *vma,
 {
 	struct mm_struct *mm = vma->vm_mm;
 	unsigned long start = address;
-	pte_t *ptep;
 	pte_t pte;
 	struct hstate *h = hstate_vma(vma);
-	long pages = 0, psize = huge_page_size(h);
+	long base_pages = 0, psize = huge_page_size(h);
 	bool shared_pmd = false;
 	struct mmu_notifier_range range;
 	unsigned long last_addr_mask;
 	bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
 	bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;
+	struct hugetlb_pte hpte;
 
 	/*
 	 * In the case of shared PMDs, the area to flush could be beyond
@@ -6926,39 +6926,43 @@ long hugetlb_change_protection(struct vm_area_struct *vma,
 	hugetlb_vma_lock_write(vma);
 	i_mmap_lock_write(vma->vm_file->f_mapping);
 	last_addr_mask = hugetlb_mask_last_page(h);
-	for (; address < end; address += psize) {
+	while (address < end) {
 		spinlock_t *ptl;
-		ptep = hugetlb_walk(vma, address, psize);
-		if (!ptep) {
+		if (hugetlb_full_walk(&hpte, vma, address)) {
 			if (!uffd_wp) {
-				address |= last_addr_mask;
+				address = (address | last_addr_mask) + psize;
 				continue;
 			}
 			/*
 			 * Userfaultfd wr-protect requires pgtable
 			 * pre-allocations to install pte markers.
+			 *
+			 * Use hugetlb_full_walk_alloc to allocate
+			 * the hstate-level PTE.
 			 */
-			ptep = huge_pte_alloc(mm, vma, address, psize);
-			if (!ptep) {
-				pages = -ENOMEM;
+			if (hugetlb_full_walk_alloc(&hpte, vma,
+						    address, psize)) {
+				base_pages = -ENOMEM;
 				break;
 			}
 		}
-		ptl = huge_pte_lock(h, mm, ptep);
-		if (huge_pmd_unshare(mm, vma, address, ptep)) {
+
+		ptl = hugetlb_pte_lock(&hpte);
+		if (hugetlb_pte_size(&hpte) == psize &&
+		    huge_pmd_unshare(mm, vma, address, hpte.ptep)) {
 			/*
 			 * When uffd-wp is enabled on the vma, unshare
 			 * shouldn't happen at all.  Warn about it if it
 			 * happened due to some reason.
 			 */
 			WARN_ON_ONCE(uffd_wp || uffd_wp_resolve);
-			pages++;
+			base_pages += psize / PAGE_SIZE;
 			spin_unlock(ptl);
 			shared_pmd = true;
-			address |= last_addr_mask;
+			address = (address | last_addr_mask) + psize;
 			continue;
 		}
-		pte = huge_ptep_get(ptep);
+		pte = huge_ptep_get(hpte.ptep);
 		if (unlikely(is_hugetlb_entry_hwpoisoned(pte))) {
 			/* Nothing to do. */
 		} else if (unlikely(is_hugetlb_entry_migration(pte))) {
@@ -6974,7 +6978,7 @@ long hugetlb_change_protection(struct vm_area_struct *vma,
 					entry = make_readable_migration_entry(
 								swp_offset(entry));
 				newpte = swp_entry_to_pte(entry);
-				pages++;
+				base_pages += hugetlb_pte_size(&hpte) / PAGE_SIZE;
 			}
 
 			if (uffd_wp)
@@ -6982,34 +6986,49 @@ long hugetlb_change_protection(struct vm_area_struct *vma,
 			else if (uffd_wp_resolve)
 				newpte = pte_swp_clear_uffd_wp(newpte);
 			if (!pte_same(pte, newpte))
-				set_huge_pte_at(mm, address, ptep, newpte);
+				set_huge_pte_at(mm, address, hpte.ptep, newpte);
 		} else if (unlikely(is_pte_marker(pte))) {
 			/* No other markers apply for now. */
 			WARN_ON_ONCE(!pte_marker_uffd_wp(pte));
 			if (uffd_wp_resolve)
 				/* Safe to modify directly (non-present->none). */
-				huge_pte_clear(mm, address, ptep, psize);
+				huge_pte_clear(mm, address, hpte.ptep,
+						hugetlb_pte_size(&hpte));
 		} else if (!huge_pte_none(pte)) {
 			pte_t old_pte;
-			unsigned int shift = huge_page_shift(hstate_vma(vma));
+			unsigned int shift = hpte.shift;
+
+			if (unlikely(!hugetlb_pte_present_leaf(&hpte, pte))) {
+				/*
+				 * Someone split the PTE from under us, so retry
+				 * the walk,
+				 */
+				spin_unlock(ptl);
+				continue;
+			}
 
-			old_pte = huge_ptep_modify_prot_start(vma, address, ptep);
+			old_pte = huge_ptep_modify_prot_start(
+					vma, address, hpte.ptep);
 			pte = huge_pte_modify(old_pte, newprot);
-			pte = arch_make_huge_pte(pte, shift, vma->vm_flags);
+			pte = arch_make_huge_pte(
+					pte, shift, vma->vm_flags);
 			if (uffd_wp)
 				pte = huge_pte_mkuffd_wp(pte);
 			else if (uffd_wp_resolve)
 				pte = huge_pte_clear_uffd_wp(pte);
-			huge_ptep_modify_prot_commit(vma, address, ptep, old_pte, pte);
-			pages++;
+			huge_ptep_modify_prot_commit(
+					vma, address, hpte.ptep,
+					old_pte, pte);
+			base_pages += hugetlb_pte_size(&hpte) / PAGE_SIZE;
 		} else {
 			/* None pte */
 			if (unlikely(uffd_wp))
 				/* Safe to modify directly (none->non-present). */
-				set_huge_pte_at(mm, address, ptep,
+				set_huge_pte_at(mm, address, hpte.ptep,
 						make_pte_marker(PTE_MARKER_UFFD_WP));
 		}
 		spin_unlock(ptl);
+		address += hugetlb_pte_size(&hpte);
 	}
 	/*
 	 * Must flush TLB before releasing i_mmap_rwsem: x86's huge_pmd_unshare
@@ -7032,7 +7051,7 @@ long hugetlb_change_protection(struct vm_area_struct *vma,
 	hugetlb_vma_unlock_write(vma);
 	mmu_notifier_invalidate_range_end(&range);
 
-	return pages > 0 ? (pages << h->order) : pages;
+	return base_pages;
 }
 
 /* Return true if reservation was successful, false otherwise.  */
-- 
2.39.2.637.g21b0678d19-goog


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH v2 20/46] hugetlb: add HGM support to follow_hugetlb_page
  2023-02-18  0:27 [PATCH v2 00/46] hugetlb: introduce HugeTLB high-granularity mapping James Houghton
                   ` (18 preceding siblings ...)
  2023-02-18  0:27 ` [PATCH v2 19/46] hugetlb: add HGM support to hugetlb_change_protection James Houghton
@ 2023-02-18  0:27 ` James Houghton
  2023-02-18  0:27 ` [PATCH v2 21/46] hugetlb: add HGM support to hugetlb_follow_page_mask James Houghton
                   ` (26 subsequent siblings)
  46 siblings, 0 replies; 96+ messages in thread
From: James Houghton @ 2023-02-18  0:27 UTC (permalink / raw)
  To: Mike Kravetz, Muchun Song, Peter Xu, Andrew Morton
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Zach O'Keefe, Manish Mishra, Naoya Horiguchi,
	Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Frank van der Linden, Jiaqi Yan, linux-mm, linux-kernel,
	James Houghton

Enable high-granularity mapping support in GUP.

In case it is confusing, pfn_offset is the offset (in PAGE_SIZE units)
that vaddr points to within the subpage that hpte points to.

Signed-off-by: James Houghton <jthoughton@google.com>

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 7321c6602d6f..c26b040f4fb5 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6634,11 +6634,9 @@ static void record_subpages_vmas(struct page *page, struct vm_area_struct *vma,
 }
 
 static inline bool __follow_hugetlb_must_fault(struct vm_area_struct *vma,
-					       unsigned int flags, pte_t *pte,
+					       unsigned int flags, pte_t pteval,
 					       bool *unshare)
 {
-	pte_t pteval = huge_ptep_get(pte);
-
 	*unshare = false;
 	if (is_swap_pte(pteval))
 		return true;
@@ -6713,11 +6711,13 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	int err = -EFAULT, refs;
 
 	while (vaddr < vma->vm_end && remainder) {
-		pte_t *pte;
+		pte_t *ptep, pte;
 		spinlock_t *ptl = NULL;
 		bool unshare = false;
 		int absent;
-		struct page *page;
+		unsigned long pages_per_hpte;
+		struct page *page, *subpage;
+		struct hugetlb_pte hpte;
 
 		/*
 		 * If we have a pending SIGKILL, don't keep faulting pages and
@@ -6734,13 +6734,19 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		 * each hugepage.  We have to make sure we get the
 		 * first, for the page indexing below to work.
 		 *
-		 * Note that page table lock is not held when pte is null.
+		 * hugetlb_full_walk will mask the address appropriately.
+		 *
+		 * Note that page table lock is not held when ptep is null.
 		 */
-		pte = hugetlb_walk(vma, vaddr & huge_page_mask(h),
-				   huge_page_size(h));
-		if (pte)
-			ptl = huge_pte_lock(h, mm, pte);
-		absent = !pte || huge_pte_none(huge_ptep_get(pte));
+		if (hugetlb_full_walk(&hpte, vma, vaddr)) {
+			ptep = NULL;
+			absent = true;
+		} else {
+			ptl = hugetlb_pte_lock(&hpte);
+			ptep = hpte.ptep;
+			pte = huge_ptep_get(ptep);
+			absent = huge_pte_none(pte);
+		}
 
 		/*
 		 * When coredumping, it suits get_dump_page if we just return
@@ -6751,13 +6757,21 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		 */
 		if (absent && (flags & FOLL_DUMP) &&
 		    !hugetlbfs_pagecache_present(h, vma, vaddr)) {
-			if (pte)
+			if (ptep)
 				spin_unlock(ptl);
 			hugetlb_vma_unlock_read(vma);
 			remainder = 0;
 			break;
 		}
 
+		if (!absent && pte_present(pte) &&
+				!hugetlb_pte_present_leaf(&hpte, pte)) {
+			/* We raced with someone splitting the PTE, so retry. */
+			spin_unlock(ptl);
+			hugetlb_vma_unlock_read(vma);
+			continue;
+		}
+
 		/*
 		 * We need call hugetlb_fault for both hugepages under migration
 		 * (in which case hugetlb_fault waits for the migration,) and
@@ -6773,7 +6787,7 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			vm_fault_t ret;
 			unsigned int fault_flags = 0;
 
-			if (pte)
+			if (ptep)
 				spin_unlock(ptl);
 			hugetlb_vma_unlock_read(vma);
 
@@ -6822,8 +6836,10 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			continue;
 		}
 
-		pfn_offset = (vaddr & ~huge_page_mask(h)) >> PAGE_SHIFT;
-		page = pte_page(huge_ptep_get(pte));
+		pfn_offset = (vaddr & ~hugetlb_pte_mask(&hpte)) >> PAGE_SHIFT;
+		subpage = pte_page(pte);
+		pages_per_hpte = hugetlb_pte_size(&hpte) / PAGE_SIZE;
+		page = compound_head(subpage);
 
 		VM_BUG_ON_PAGE((flags & FOLL_PIN) && PageAnon(page) &&
 			       !PageAnonExclusive(page), page);
@@ -6833,22 +6849,22 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		 * and skip the same_page loop below.
 		 */
 		if (!pages && !vmas && !pfn_offset &&
-		    (vaddr + huge_page_size(h) < vma->vm_end) &&
-		    (remainder >= pages_per_huge_page(h))) {
-			vaddr += huge_page_size(h);
-			remainder -= pages_per_huge_page(h);
-			i += pages_per_huge_page(h);
+		    (vaddr + hugetlb_pte_size(&hpte) < vma->vm_end) &&
+		    (remainder >= pages_per_hpte)) {
+			vaddr += hugetlb_pte_size(&hpte);
+			remainder -= pages_per_hpte;
+			i += pages_per_hpte;
 			spin_unlock(ptl);
 			hugetlb_vma_unlock_read(vma);
 			continue;
 		}
 
 		/* vaddr may not be aligned to PAGE_SIZE */
-		refs = min3(pages_per_huge_page(h) - pfn_offset, remainder,
+		refs = min3(pages_per_hpte - pfn_offset, remainder,
 		    (vma->vm_end - ALIGN_DOWN(vaddr, PAGE_SIZE)) >> PAGE_SHIFT);
 
 		if (pages || vmas)
-			record_subpages_vmas(nth_page(page, pfn_offset),
+			record_subpages_vmas(nth_page(subpage, pfn_offset),
 					     vma, refs,
 					     likely(pages) ? pages + i : NULL,
 					     vmas ? vmas + i : NULL);
-- 
2.39.2.637.g21b0678d19-goog


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH v2 21/46] hugetlb: add HGM support to hugetlb_follow_page_mask
  2023-02-18  0:27 [PATCH v2 00/46] hugetlb: introduce HugeTLB high-granularity mapping James Houghton
                   ` (19 preceding siblings ...)
  2023-02-18  0:27 ` [PATCH v2 20/46] hugetlb: add HGM support to follow_hugetlb_page James Houghton
@ 2023-02-18  0:27 ` James Houghton
  2023-02-18  0:27 ` [PATCH v2 22/46] hugetlb: add HGM support to copy_hugetlb_page_range James Houghton
                   ` (25 subsequent siblings)
  46 siblings, 0 replies; 96+ messages in thread
From: James Houghton @ 2023-02-18  0:27 UTC (permalink / raw)
  To: Mike Kravetz, Muchun Song, Peter Xu, Andrew Morton
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Zach O'Keefe, Manish Mishra, Naoya Horiguchi,
	Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Frank van der Linden, Jiaqi Yan, linux-mm, linux-kernel,
	James Houghton

The change here is very simple: do a high-granularity walk.

Signed-off-by: James Houghton <jthoughton@google.com>

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index c26b040f4fb5..693332b7e186 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6655,11 +6655,10 @@ struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
 				unsigned long address, unsigned int flags)
 {
 	struct hstate *h = hstate_vma(vma);
-	struct mm_struct *mm = vma->vm_mm;
-	unsigned long haddr = address & huge_page_mask(h);
 	struct page *page = NULL;
 	spinlock_t *ptl;
-	pte_t *pte, entry;
+	pte_t entry;
+	struct hugetlb_pte hpte;
 
 	/*
 	 * FOLL_PIN is not supported for follow_page(). Ordinary GUP goes via
@@ -6669,13 +6668,24 @@ struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
 		return NULL;
 
 	hugetlb_vma_lock_read(vma);
-	pte = hugetlb_walk(vma, haddr, huge_page_size(h));
-	if (!pte)
+
+	if (hugetlb_full_walk(&hpte, vma, address))
 		goto out_unlock;
 
-	ptl = huge_pte_lock(h, mm, pte);
-	entry = huge_ptep_get(pte);
+retry:
+	ptl = hugetlb_pte_lock(&hpte);
+	entry = huge_ptep_get(hpte.ptep);
 	if (pte_present(entry)) {
+		if (unlikely(!hugetlb_pte_present_leaf(&hpte, entry))) {
+			/*
+			 * We raced with someone splitting from under us.
+			 * Keep walking to get to the real leaf.
+			 */
+			spin_unlock(ptl);
+			hugetlb_full_walk_continue(&hpte, vma, address);
+			goto retry;
+		}
+
 		page = pte_page(entry) +
 				((address & ~huge_page_mask(h)) >> PAGE_SHIFT);
 		/*
-- 
2.39.2.637.g21b0678d19-goog


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH v2 22/46] hugetlb: add HGM support to copy_hugetlb_page_range
  2023-02-18  0:27 [PATCH v2 00/46] hugetlb: introduce HugeTLB high-granularity mapping James Houghton
                   ` (20 preceding siblings ...)
  2023-02-18  0:27 ` [PATCH v2 21/46] hugetlb: add HGM support to hugetlb_follow_page_mask James Houghton
@ 2023-02-18  0:27 ` James Houghton
  2023-02-24 17:39   ` James Houghton
  2023-02-18  0:27 ` [PATCH v2 23/46] hugetlb: add HGM support to move_hugetlb_page_tables James Houghton
                   ` (24 subsequent siblings)
  46 siblings, 1 reply; 96+ messages in thread
From: James Houghton @ 2023-02-18  0:27 UTC (permalink / raw)
  To: Mike Kravetz, Muchun Song, Peter Xu, Andrew Morton
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Zach O'Keefe, Manish Mishra, Naoya Horiguchi,
	Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Frank van der Linden, Jiaqi Yan, linux-mm, linux-kernel,
	James Houghton

This allows fork() to work with high-granularity mappings. The page
table structure is copied such that partially mapped regions will remain
partially mapped in the same way for the new process.

A page's reference count is incremented for *each* portion of it that
is mapped in the page table. For example, if you have a PMD-mapped 1G
page, the reference count will be incremented by 512.

mapcount is handled similar to THPs: if you're completely mapping a
hugepage, then the compound_mapcount is incremented. If you're mapping a
part of it, the subpages that are getting mapped will have their
mapcounts incremented.

Signed-off-by: James Houghton <jthoughton@google.com>

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 1a1a71868dfd..2fe1eb6897d4 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -162,6 +162,8 @@ void hugepage_put_subpool(struct hugepage_subpool *spool);
 
 void hugetlb_remove_rmap(struct page *subpage, unsigned long shift,
 			 struct hstate *h, struct vm_area_struct *vma);
+void hugetlb_add_file_rmap(struct page *subpage, unsigned long shift,
+			   struct hstate *h, struct vm_area_struct *vma);
 
 void hugetlb_dup_vma_private(struct vm_area_struct *vma);
 void clear_vma_resv_huge_pages(struct vm_area_struct *vma);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 693332b7e186..210c6f2b16a5 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -141,6 +141,37 @@ void hugetlb_remove_rmap(struct page *subpage, unsigned long shift,
 			page_remove_rmap(subpage, vma, false);
 	}
 }
+/*
+ * hugetlb_add_file_rmap() - increment the mapcounts for file-backed hugetlb
+ * pages appropriately.
+ *
+ * For pages that are being mapped with their hstate-level PTE (e.g., a 1G page
+ * being mapped with a 1G PUD), then we increment the compound_mapcount for the
+ * head page.
+ *
+ * For pages that are being mapped with high-granularity, we increment the
+ * mapcounts for the individual subpages that are getting mapped.
+ */
+void hugetlb_add_file_rmap(struct page *subpage, unsigned long shift,
+			   struct hstate *h, struct vm_area_struct *vma)
+{
+	struct page *hpage = compound_head(subpage);
+
+	if (shift == huge_page_shift(h)) {
+		VM_BUG_ON_PAGE(subpage != hpage, subpage);
+		page_add_file_rmap(hpage, vma, true);
+	} else {
+		unsigned long nr_subpages = 1UL << (shift - PAGE_SHIFT);
+		struct page *final_page = &subpage[nr_subpages];
+
+		VM_BUG_ON_PAGE(HPageVmemmapOptimized(hpage), hpage);
+		/*
+		 * Increment the mapcount on each page that is getting mapped.
+		 */
+		for (; subpage < final_page; ++subpage)
+			page_add_file_rmap(subpage, vma, false);
+	}
+}
 
 static inline bool subpool_is_free(struct hugepage_subpool *spool)
 {
@@ -5210,7 +5241,8 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 			    struct vm_area_struct *src_vma)
 {
 	pte_t *src_pte, *dst_pte, entry;
-	struct page *ptepage;
+	struct hugetlb_pte src_hpte, dst_hpte;
+	struct page *ptepage, *hpage;
 	unsigned long addr;
 	bool cow = is_cow_mapping(src_vma->vm_flags);
 	struct hstate *h = hstate_vma(src_vma);
@@ -5238,18 +5270,24 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 	}
 
 	last_addr_mask = hugetlb_mask_last_page(h);
-	for (addr = src_vma->vm_start; addr < src_vma->vm_end; addr += sz) {
+	addr = src_vma->vm_start;
+	while (addr < src_vma->vm_end) {
 		spinlock_t *src_ptl, *dst_ptl;
-		src_pte = hugetlb_walk(src_vma, addr, sz);
-		if (!src_pte) {
-			addr |= last_addr_mask;
+		unsigned long hpte_sz;
+
+		if (hugetlb_full_walk(&src_hpte, src_vma, addr)) {
+			addr = (addr | last_addr_mask) + sz;
 			continue;
 		}
-		dst_pte = huge_pte_alloc(dst, dst_vma, addr, sz);
-		if (!dst_pte) {
-			ret = -ENOMEM;
+		ret = hugetlb_full_walk_alloc(&dst_hpte, dst_vma, addr,
+				hugetlb_pte_size(&src_hpte));
+		if (ret)
 			break;
-		}
+
+		src_pte = src_hpte.ptep;
+		dst_pte = dst_hpte.ptep;
+
+		hpte_sz = hugetlb_pte_size(&src_hpte);
 
 		/*
 		 * If the pagetables are shared don't copy or take references.
@@ -5259,13 +5297,14 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 		 * another vma. So page_count of ptep page is checked instead
 		 * to reliably determine whether pte is shared.
 		 */
-		if (page_count(virt_to_page(dst_pte)) > 1) {
-			addr |= last_addr_mask;
+		if (hugetlb_pte_size(&dst_hpte) == sz &&
+		    page_count(virt_to_page(dst_pte)) > 1) {
+			addr = (addr | last_addr_mask) + sz;
 			continue;
 		}
 
-		dst_ptl = huge_pte_lock(h, dst, dst_pte);
-		src_ptl = huge_pte_lockptr(huge_page_shift(h), src, src_pte);
+		dst_ptl = hugetlb_pte_lock(&dst_hpte);
+		src_ptl = hugetlb_pte_lockptr(&src_hpte);
 		spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
 		entry = huge_ptep_get(src_pte);
 again:
@@ -5309,10 +5348,15 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 			 */
 			if (userfaultfd_wp(dst_vma))
 				set_huge_pte_at(dst, addr, dst_pte, entry);
+		} else if (!hugetlb_pte_present_leaf(&src_hpte, entry)) {
+			/* Retry the walk. */
+			spin_unlock(src_ptl);
+			spin_unlock(dst_ptl);
+			continue;
 		} else {
-			entry = huge_ptep_get(src_pte);
 			ptepage = pte_page(entry);
-			get_page(ptepage);
+			hpage = compound_head(ptepage);
+			get_page(hpage);
 
 			/*
 			 * Failing to duplicate the anon rmap is a rare case
@@ -5324,13 +5368,34 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 			 * need to be without the pgtable locks since we could
 			 * sleep during the process.
 			 */
-			if (!PageAnon(ptepage)) {
-				page_add_file_rmap(ptepage, src_vma, true);
-			} else if (page_try_dup_anon_rmap(ptepage, true,
+			if (!PageAnon(hpage)) {
+				hugetlb_add_file_rmap(ptepage,
+						src_hpte.shift, h, src_vma);
+			}
+			/*
+			 * It is currently impossible to get anonymous HugeTLB
+			 * high-granularity mappings, so we use 'hpage' here.
+			 *
+			 * This will need to be changed when HGM support for
+			 * anon mappings is added.
+			 */
+			else if (page_try_dup_anon_rmap(hpage, true,
 							  src_vma)) {
 				pte_t src_pte_old = entry;
 				struct folio *new_folio;
 
+				/*
+				 * If we are mapped at high granularity, we
+				 * may end up allocating lots and lots of
+				 * hugepages when we only need one. Bail out
+				 * now.
+				 */
+				if (hugetlb_pte_size(&src_hpte) != sz) {
+					put_page(hpage);
+					ret = -EINVAL;
+					break;
+				}
+
 				spin_unlock(src_ptl);
 				spin_unlock(dst_ptl);
 				/* Do not use reserve as it's private owned */
@@ -5342,7 +5407,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 				}
 				copy_user_huge_page(&new_folio->page, ptepage, addr, dst_vma,
 						    npages);
-				put_page(ptepage);
+				put_page(hpage);
 
 				/* Install the new hugetlb folio if src pte stable */
 				dst_ptl = huge_pte_lock(h, dst, dst_pte);
@@ -5360,6 +5425,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 				hugetlb_install_folio(dst_vma, dst_pte, addr, new_folio);
 				spin_unlock(src_ptl);
 				spin_unlock(dst_ptl);
+				addr += hugetlb_pte_size(&src_hpte);
 				continue;
 			}
 
@@ -5376,10 +5442,13 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 			}
 
 			set_huge_pte_at(dst, addr, dst_pte, entry);
-			hugetlb_count_add(npages, dst);
+			hugetlb_count_add(
+					hugetlb_pte_size(&dst_hpte) / PAGE_SIZE,
+					dst);
 		}
 		spin_unlock(src_ptl);
 		spin_unlock(dst_ptl);
+		addr += hugetlb_pte_size(&src_hpte);
 	}
 
 	if (cow) {
-- 
2.39.2.637.g21b0678d19-goog


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH v2 23/46] hugetlb: add HGM support to move_hugetlb_page_tables
  2023-02-18  0:27 [PATCH v2 00/46] hugetlb: introduce HugeTLB high-granularity mapping James Houghton
                   ` (21 preceding siblings ...)
  2023-02-18  0:27 ` [PATCH v2 22/46] hugetlb: add HGM support to copy_hugetlb_page_range James Houghton
@ 2023-02-18  0:27 ` James Houghton
  2023-02-18  0:27 ` [PATCH v2 24/46] hugetlb: add HGM support to hugetlb_fault and hugetlb_no_page James Houghton
                   ` (23 subsequent siblings)
  46 siblings, 0 replies; 96+ messages in thread
From: James Houghton @ 2023-02-18  0:27 UTC (permalink / raw)
  To: Mike Kravetz, Muchun Song, Peter Xu, Andrew Morton
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Zach O'Keefe, Manish Mishra, Naoya Horiguchi,
	Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Frank van der Linden, Jiaqi Yan, linux-mm, linux-kernel,
	James Houghton

This is very similar to the support that was added to
copy_hugetlb_page_range. We simply do a high-granularity walk now, and
most of the rest of the code stays the same.

Signed-off-by: James Houghton <jthoughton@google.com>

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 210c6f2b16a5..6c4678b7a07d 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5461,16 +5461,16 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 	return ret;
 }
 
-static void move_huge_pte(struct vm_area_struct *vma, unsigned long old_addr,
-			  unsigned long new_addr, pte_t *src_pte, pte_t *dst_pte)
+static void move_hugetlb_pte(struct vm_area_struct *vma, unsigned long old_addr,
+			     unsigned long new_addr, struct hugetlb_pte *src_hpte,
+			     struct hugetlb_pte *dst_hpte)
 {
-	struct hstate *h = hstate_vma(vma);
 	struct mm_struct *mm = vma->vm_mm;
 	spinlock_t *src_ptl, *dst_ptl;
 	pte_t pte;
 
-	dst_ptl = huge_pte_lock(h, mm, dst_pte);
-	src_ptl = huge_pte_lockptr(huge_page_shift(h), mm, src_pte);
+	dst_ptl = hugetlb_pte_lock(dst_hpte);
+	src_ptl = hugetlb_pte_lockptr(src_hpte);
 
 	/*
 	 * We don't have to worry about the ordering of src and dst ptlocks
@@ -5479,8 +5479,8 @@ static void move_huge_pte(struct vm_area_struct *vma, unsigned long old_addr,
 	if (src_ptl != dst_ptl)
 		spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
 
-	pte = huge_ptep_get_and_clear(mm, old_addr, src_pte);
-	set_huge_pte_at(mm, new_addr, dst_pte, pte);
+	pte = huge_ptep_get_and_clear(mm, old_addr, src_hpte->ptep);
+	set_huge_pte_at(mm, new_addr, dst_hpte->ptep, pte);
 
 	if (src_ptl != dst_ptl)
 		spin_unlock(src_ptl);
@@ -5498,9 +5498,9 @@ int move_hugetlb_page_tables(struct vm_area_struct *vma,
 	struct mm_struct *mm = vma->vm_mm;
 	unsigned long old_end = old_addr + len;
 	unsigned long last_addr_mask;
-	pte_t *src_pte, *dst_pte;
 	struct mmu_notifier_range range;
 	bool shared_pmd = false;
+	struct hugetlb_pte src_hpte, dst_hpte;
 
 	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, old_addr,
 				old_end);
@@ -5516,28 +5516,35 @@ int move_hugetlb_page_tables(struct vm_area_struct *vma,
 	/* Prevent race with file truncation */
 	hugetlb_vma_lock_write(vma);
 	i_mmap_lock_write(mapping);
-	for (; old_addr < old_end; old_addr += sz, new_addr += sz) {
-		src_pte = hugetlb_walk(vma, old_addr, sz);
-		if (!src_pte) {
-			old_addr |= last_addr_mask;
-			new_addr |= last_addr_mask;
+	while (old_addr < old_end) {
+		if (hugetlb_full_walk(&src_hpte, vma, old_addr)) {
+			/* The hstate-level PTE wasn't allocated. */
+			old_addr = (old_addr | last_addr_mask) + sz;
+			new_addr = (new_addr | last_addr_mask) + sz;
 			continue;
 		}
-		if (huge_pte_none(huge_ptep_get(src_pte)))
+
+		if (huge_pte_none(huge_ptep_get(src_hpte.ptep))) {
+			old_addr += hugetlb_pte_size(&src_hpte);
+			new_addr += hugetlb_pte_size(&src_hpte);
 			continue;
+		}
 
-		if (huge_pmd_unshare(mm, vma, old_addr, src_pte)) {
+		if (hugetlb_pte_size(&src_hpte) == sz &&
+		    huge_pmd_unshare(mm, vma, old_addr, src_hpte.ptep)) {
 			shared_pmd = true;
-			old_addr |= last_addr_mask;
-			new_addr |= last_addr_mask;
+			old_addr = (old_addr | last_addr_mask) + sz;
+			new_addr = (new_addr | last_addr_mask) + sz;
 			continue;
 		}
 
-		dst_pte = huge_pte_alloc(mm, new_vma, new_addr, sz);
-		if (!dst_pte)
+		if (hugetlb_full_walk_alloc(&dst_hpte, new_vma, new_addr,
+					hugetlb_pte_size(&src_hpte)))
 			break;
 
-		move_huge_pte(vma, old_addr, new_addr, src_pte, dst_pte);
+		move_hugetlb_pte(vma, old_addr, new_addr, &src_hpte, &dst_hpte);
+		old_addr += hugetlb_pte_size(&src_hpte);
+		new_addr += hugetlb_pte_size(&src_hpte);
 	}
 
 	if (shared_pmd)
-- 
2.39.2.637.g21b0678d19-goog


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH v2 24/46] hugetlb: add HGM support to hugetlb_fault and hugetlb_no_page
  2023-02-18  0:27 [PATCH v2 00/46] hugetlb: introduce HugeTLB high-granularity mapping James Houghton
                   ` (22 preceding siblings ...)
  2023-02-18  0:27 ` [PATCH v2 23/46] hugetlb: add HGM support to move_hugetlb_page_tables James Houghton
@ 2023-02-18  0:27 ` James Houghton
  2023-02-18  0:27 ` [PATCH v2 25/46] hugetlb: use struct hugetlb_pte for walk_hugetlb_range James Houghton
                   ` (22 subsequent siblings)
  46 siblings, 0 replies; 96+ messages in thread
From: James Houghton @ 2023-02-18  0:27 UTC (permalink / raw)
  To: Mike Kravetz, Muchun Song, Peter Xu, Andrew Morton
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Zach O'Keefe, Manish Mishra, Naoya Horiguchi,
	Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Frank van der Linden, Jiaqi Yan, linux-mm, linux-kernel,
	James Houghton

Update the page fault handler to support high-granularity page faults.
While handling a page fault on a partially-mapped HugeTLB page, if the
PTE we find with hugetlb_pte_walk is none, then we will replace it with
a leaf-level PTE to map the page. To give some examples:
1. For a completely unmapped 1G page, it will be mapped with a 1G PUD.
2. For a 1G page that has its first 512M mapped, any faults on the
   unmapped sections will result in 2M PMDs mapping each unmapped 2M
   section.
3. For a 1G page that has only its first 4K mapped, a page fault on its
   second 4K section will get a 4K PTE to map it.

Unless high-granularity mappings are created via UFFDIO_CONTINUE, it is
impossible for hugetlb_fault to create high-granularity mappings.

This commit does not handle hugetlb_wp right now, and it doesn't handle
HugeTLB page migration and swap entries.

The BUG_ON in huge_pte_alloc is removed, as it is not longer valid when
HGM is possible. HGM can be disabled if the VMA lock cannot be allocated
after a VMA is split, yet high-granularity mappings may still exist.

Signed-off-by: James Houghton <jthoughton@google.com>

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 6c4678b7a07d..86cd51beb02c 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -173,6 +173,18 @@ void hugetlb_add_file_rmap(struct page *subpage, unsigned long shift,
 	}
 }
 
+/*
+ * Find the subpage that corresponds to `addr` in `folio`.
+ */
+static struct page *hugetlb_find_subpage(struct hstate *h, struct folio *folio,
+					 unsigned long addr)
+{
+	size_t idx = (addr & ~huge_page_mask(h))/PAGE_SIZE;
+
+	BUG_ON(idx >= pages_per_huge_page(h));
+	return folio_page(folio, idx);
+}
+
 static inline bool subpool_is_free(struct hugepage_subpool *spool)
 {
 	if (spool->count)
@@ -6072,14 +6084,14 @@ static inline vm_fault_t hugetlb_handle_userfault(struct vm_area_struct *vma,
  * Recheck pte with pgtable lock.  Returns true if pte didn't change, or
  * false if pte changed or is changing.
  */
-static bool hugetlb_pte_stable(struct hstate *h, struct mm_struct *mm,
-			       pte_t *ptep, pte_t old_pte)
+static bool hugetlb_pte_stable(struct hstate *h, struct hugetlb_pte *hpte,
+			       pte_t old_pte)
 {
 	spinlock_t *ptl;
 	bool same;
 
-	ptl = huge_pte_lock(h, mm, ptep);
-	same = pte_same(huge_ptep_get(ptep), old_pte);
+	ptl = hugetlb_pte_lock(hpte);
+	same = pte_same(huge_ptep_get(hpte->ptep), old_pte);
 	spin_unlock(ptl);
 
 	return same;
@@ -6088,7 +6100,7 @@ static bool hugetlb_pte_stable(struct hstate *h, struct mm_struct *mm,
 static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 			struct vm_area_struct *vma,
 			struct address_space *mapping, pgoff_t idx,
-			unsigned long address, pte_t *ptep,
+			unsigned long address, struct hugetlb_pte *hpte,
 			pte_t old_pte, unsigned int flags)
 {
 	struct hstate *h = hstate_vma(vma);
@@ -6096,10 +6108,12 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 	int anon_rmap = 0;
 	unsigned long size;
 	struct folio *folio;
+	struct page *subpage;
 	pte_t new_pte;
 	spinlock_t *ptl;
 	unsigned long haddr = address & huge_page_mask(h);
 	bool new_folio, new_pagecache_folio = false;
+	unsigned long haddr_hgm = address & hugetlb_pte_mask(hpte);
 	u32 hash = hugetlb_fault_mutex_hash(mapping, idx);
 
 	/*
@@ -6143,7 +6157,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 			 * never happen on the page after UFFDIO_COPY has
 			 * correctly installed the page and returned.
 			 */
-			if (!hugetlb_pte_stable(h, mm, ptep, old_pte)) {
+			if (!hugetlb_pte_stable(h, hpte, old_pte)) {
 				ret = 0;
 				goto out;
 			}
@@ -6167,7 +6181,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 			 * here.  Before returning error, get ptl and make
 			 * sure there really is no pte entry.
 			 */
-			if (hugetlb_pte_stable(h, mm, ptep, old_pte))
+			if (hugetlb_pte_stable(h, hpte, old_pte))
 				ret = vmf_error(PTR_ERR(folio));
 			else
 				ret = 0;
@@ -6217,7 +6231,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 			folio_unlock(folio);
 			folio_put(folio);
 			/* See comment in userfaultfd_missing() block above */
-			if (!hugetlb_pte_stable(h, mm, ptep, old_pte)) {
+			if (!hugetlb_pte_stable(h, hpte, old_pte)) {
 				ret = 0;
 				goto out;
 			}
@@ -6242,30 +6256,46 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 		vma_end_reservation(h, vma, haddr);
 	}
 
-	ptl = huge_pte_lock(h, mm, ptep);
+	ptl = hugetlb_pte_lock(hpte);
 	ret = 0;
-	/* If pte changed from under us, retry */
-	if (!pte_same(huge_ptep_get(ptep), old_pte))
+	/*
+	 * If pte changed from under us, retry.
+	 *
+	 * When dealing with high-granularity-mapped PTEs, it's possible that
+	 * a non-contiguous PTE within our contiguous PTE group gets populated,
+	 * in which case, we need to retry here. This is NOT caught here, and
+	 * will need to be addressed when HGM is supported for architectures
+	 * that support contiguous PTEs.
+	 */
+	if (!pte_same(huge_ptep_get(hpte->ptep), old_pte))
 		goto backout;
 
-	if (anon_rmap)
+	subpage = hugetlb_find_subpage(h, folio, haddr_hgm);
+
+	if (anon_rmap) {
+		VM_BUG_ON(&folio->page != subpage);
 		hugepage_add_new_anon_rmap(folio, vma, haddr);
+	}
 	else
-		page_add_file_rmap(&folio->page, vma, true);
-	new_pte = make_huge_pte(vma, &folio->page, ((vma->vm_flags & VM_WRITE)
-				&& (vma->vm_flags & VM_SHARED)));
+		hugetlb_add_file_rmap(subpage, hpte->shift, h, vma);
+
+	new_pte = make_huge_pte_with_shift(vma, subpage,
+			((vma->vm_flags & VM_WRITE)
+			 && (vma->vm_flags & VM_SHARED)),
+			hpte->shift);
 	/*
 	 * If this pte was previously wr-protected, keep it wr-protected even
 	 * if populated.
 	 */
 	if (unlikely(pte_marker_uffd_wp(old_pte)))
 		new_pte = huge_pte_mkuffd_wp(new_pte);
-	set_huge_pte_at(mm, haddr, ptep, new_pte);
+	set_huge_pte_at(mm, haddr_hgm, hpte->ptep, new_pte);
 
-	hugetlb_count_add(pages_per_huge_page(h), mm);
+	hugetlb_count_add(hugetlb_pte_size(hpte) / PAGE_SIZE, mm);
 	if ((flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) {
+		WARN_ON_ONCE(hugetlb_pte_size(hpte) != huge_page_size(h));
 		/* Optimization, do the COW without a second fault */
-		ret = hugetlb_wp(mm, vma, address, ptep, flags, folio, ptl);
+		ret = hugetlb_wp(mm, vma, address, hpte->ptep, flags, folio, ptl);
 	}
 
 	spin_unlock(ptl);
@@ -6322,17 +6352,19 @@ u32 hugetlb_fault_mutex_hash(struct address_space *mapping, pgoff_t idx)
 vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 			unsigned long address, unsigned int flags)
 {
-	pte_t *ptep, entry;
+	pte_t entry;
 	spinlock_t *ptl;
 	vm_fault_t ret;
 	u32 hash;
 	pgoff_t idx;
-	struct page *page = NULL;
-	struct folio *pagecache_folio = NULL;
+	struct page *subpage = NULL;
+	struct folio *pagecache_folio = NULL, *folio = NULL;
 	struct hstate *h = hstate_vma(vma);
 	struct address_space *mapping;
 	int need_wait_lock = 0;
 	unsigned long haddr = address & huge_page_mask(h);
+	unsigned long haddr_hgm;
+	struct hugetlb_pte hpte;
 
 	/*
 	 * Serialize hugepage allocation and instantiation, so that we don't
@@ -6346,26 +6378,26 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 
 	/*
 	 * Acquire vma lock before calling huge_pte_alloc and hold
-	 * until finished with ptep.  This prevents huge_pmd_unshare from
-	 * being called elsewhere and making the ptep no longer valid.
+	 * until finished with hpte.  This prevents huge_pmd_unshare from
+	 * being called elsewhere and making the hpte no longer valid.
 	 */
 	hugetlb_vma_lock_read(vma);
-	ptep = huge_pte_alloc(mm, vma, haddr, huge_page_size(h));
-	if (!ptep) {
+	if (hugetlb_full_walk_alloc(&hpte, vma, address, 0)) {
 		hugetlb_vma_unlock_read(vma);
 		mutex_unlock(&hugetlb_fault_mutex_table[hash]);
 		return VM_FAULT_OOM;
 	}
 
-	entry = huge_ptep_get(ptep);
+	entry = huge_ptep_get(hpte.ptep);
 	/* PTE markers should be handled the same way as none pte */
-	if (huge_pte_none_mostly(entry))
+	if (huge_pte_none_mostly(entry)) {
 		/*
 		 * hugetlb_no_page will drop vma lock and hugetlb fault
 		 * mutex internally, which make us return immediately.
 		 */
-		return hugetlb_no_page(mm, vma, mapping, idx, address, ptep,
+		return hugetlb_no_page(mm, vma, mapping, idx, address, &hpte,
 				      entry, flags);
+	}
 
 	ret = 0;
 
@@ -6386,7 +6418,7 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 			 * be released there.
 			 */
 			mutex_unlock(&hugetlb_fault_mutex_table[hash]);
-			migration_entry_wait_huge(vma, ptep);
+			migration_entry_wait_huge(vma, hpte.ptep);
 			return 0;
 		} else if (unlikely(is_hugetlb_entry_hwpoisoned(entry)))
 			ret = VM_FAULT_HWPOISON_LARGE |
@@ -6394,6 +6426,10 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		goto out_mutex;
 	}
 
+	if (!hugetlb_pte_present_leaf(&hpte, entry))
+		/* We raced with someone splitting the entry. */
+		goto out_mutex;
+
 	/*
 	 * If we are going to COW/unshare the mapping later, we examine the
 	 * pending reservations for this page now. This will ensure that any
@@ -6413,14 +6449,17 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		pagecache_folio = filemap_lock_folio(mapping, idx);
 	}
 
-	ptl = huge_pte_lock(h, mm, ptep);
+	ptl = hugetlb_pte_lock(&hpte);
 
 	/* Check for a racing update before calling hugetlb_wp() */
-	if (unlikely(!pte_same(entry, huge_ptep_get(ptep))))
+	if (unlikely(!pte_same(entry, huge_ptep_get(hpte.ptep))))
 		goto out_ptl;
 
+	/* haddr_hgm is the base address of the region that hpte maps. */
+	haddr_hgm = address & hugetlb_pte_mask(&hpte);
+
 	/* Handle userfault-wp first, before trying to lock more pages */
-	if (userfaultfd_wp(vma) && huge_pte_uffd_wp(huge_ptep_get(ptep)) &&
+	if (userfaultfd_wp(vma) && huge_pte_uffd_wp(entry) &&
 	    (flags & FAULT_FLAG_WRITE) && !huge_pte_write(entry)) {
 		struct vm_fault vmf = {
 			.vma = vma,
@@ -6444,18 +6483,21 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	 * pagecache_folio, so here we need take the former one
 	 * when page != pagecache_folio or !pagecache_folio.
 	 */
-	page = pte_page(entry);
-	if (page_folio(page) != pagecache_folio)
-		if (!trylock_page(page)) {
+	subpage = pte_page(entry);
+	folio = page_folio(subpage);
+	if (folio != pagecache_folio)
+		if (!trylock_page(&folio->page)) {
 			need_wait_lock = 1;
 			goto out_ptl;
 		}
 
-	get_page(page);
+	folio_get(folio);
 
 	if (flags & (FAULT_FLAG_WRITE|FAULT_FLAG_UNSHARE)) {
 		if (!huge_pte_write(entry)) {
-			ret = hugetlb_wp(mm, vma, address, ptep, flags,
+			WARN_ON_ONCE(hugetlb_pte_size(&hpte) !=
+					huge_page_size(h));
+			ret = hugetlb_wp(mm, vma, address, hpte.ptep, flags,
 					 pagecache_folio, ptl);
 			goto out_put_page;
 		} else if (likely(flags & FAULT_FLAG_WRITE)) {
@@ -6463,13 +6505,13 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		}
 	}
 	entry = pte_mkyoung(entry);
-	if (huge_ptep_set_access_flags(vma, haddr, ptep, entry,
+	if (huge_ptep_set_access_flags(vma, haddr_hgm, hpte.ptep, entry,
 						flags & FAULT_FLAG_WRITE))
-		update_mmu_cache(vma, haddr, ptep);
+		update_mmu_cache(vma, haddr_hgm, hpte.ptep);
 out_put_page:
-	if (page_folio(page) != pagecache_folio)
-		unlock_page(page);
-	put_page(page);
+	if (folio != pagecache_folio)
+		folio_unlock(folio);
+	folio_put(folio);
 out_ptl:
 	spin_unlock(ptl);
 
@@ -6488,7 +6530,7 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	 * here without taking refcount.
 	 */
 	if (need_wait_lock)
-		wait_on_page_locked(page);
+		wait_on_page_locked(&folio->page);
 	return ret;
 }
 
@@ -7689,6 +7731,9 @@ int hugetlb_full_walk(struct hugetlb_pte *hpte,
 /*
  * hugetlb_full_walk_alloc - do a high-granularity walk, potentially allocate
  *	new PTEs.
+ *
+ * If @target_sz is 0, then only attempt to allocate the hstate-level PTE and
+ * walk as far as we can go.
  */
 int hugetlb_full_walk_alloc(struct hugetlb_pte *hpte,
 				   struct vm_area_struct *vma,
@@ -7707,6 +7752,12 @@ int hugetlb_full_walk_alloc(struct hugetlb_pte *hpte,
 	if (!ptep)
 		return -ENOMEM;
 
+	if (!target_sz) {
+		WARN_ON_ONCE(hugetlb_hgm_walk(hpte, ptep, vma, addr,
+					      PAGE_SIZE, false));
+		return 0;
+	}
+
 	return hugetlb_hgm_walk(hpte, ptep, vma, addr, target_sz, true);
 }
 
@@ -7735,7 +7786,6 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
 				pte = (pte_t *)pmd_alloc(mm, pud, addr);
 		}
 	}
-	BUG_ON(pte && pte_present(*pte) && !pte_huge(*pte));
 
 	return pte;
 }
-- 
2.39.2.637.g21b0678d19-goog


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH v2 25/46] hugetlb: use struct hugetlb_pte for walk_hugetlb_range
  2023-02-18  0:27 [PATCH v2 00/46] hugetlb: introduce HugeTLB high-granularity mapping James Houghton
                   ` (23 preceding siblings ...)
  2023-02-18  0:27 ` [PATCH v2 24/46] hugetlb: add HGM support to hugetlb_fault and hugetlb_no_page James Houghton
@ 2023-02-18  0:27 ` James Houghton
  2023-02-18  0:27 ` [PATCH v2 26/46] mm: rmap: provide pte_order in page_vma_mapped_walk James Houghton
                   ` (21 subsequent siblings)
  46 siblings, 0 replies; 96+ messages in thread
From: James Houghton @ 2023-02-18  0:27 UTC (permalink / raw)
  To: Mike Kravetz, Muchun Song, Peter Xu, Andrew Morton
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Zach O'Keefe, Manish Mishra, Naoya Horiguchi,
	Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Frank van der Linden, Jiaqi Yan, linux-mm, linux-kernel,
	James Houghton

The main change in this commit is to walk_hugetlb_range to support
walking HGM mappings, but all walk_hugetlb_range callers must be updated
to use the new API and take the correct action.

Listing all the changes to the callers:

For s390 changes, we simply BUILD_BUG_ON if HGM is enabled.

For smaps, shared_hugetlb (and private_hugetlb, although private
mappings don't support HGM) may now not be divisible by the hugepage
size. The appropriate changes have been made to support analyzing HGM
PTEs.

For pagemap, we ignore non-leaf PTEs by treating that as if they were
none PTEs. We can only end up with non-leaf PTEs if they had just been
updated from a none PTE.

For show_numa_map, the challenge is that, if any of a hugepage is
mapped, we have to count that entire page exactly once, as the results
are given in units of hugepages. To support HGM mappings, we keep track
of the last page that we looked it. If the hugepage we are currently
looking at is the same as the last one, then we must be looking at an
HGM-mapped page that has been mapped at high-granularity, and we've
already accounted for it.

For DAMON, we treat non-leaf PTEs as if they were blank, for the same
reason as pagemap.

For hwpoison, we proactively update the logic to support the case when
hpte is pointing to a subpage within the poisoned hugepage.

For queue_pages_hugetlb/migration, we ignore all HGM-enabled VMAs for
now.

For mincore, we ignore non-leaf PTEs for the same reason as pagemap.

For mprotect/prot_none_hugetlb_entry, we retry the walk when we get a
non-leaf PTE.

Signed-off-by: James Houghton <jthoughton@google.com>

diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
index 5a716bdcba05..e1d41caa8504 100644
--- a/arch/s390/mm/gmap.c
+++ b/arch/s390/mm/gmap.c
@@ -2629,14 +2629,20 @@ static int __s390_enable_skey_pmd(pmd_t *pmd, unsigned long addr,
 	return 0;
 }
 
-static int __s390_enable_skey_hugetlb(pte_t *pte, unsigned long addr,
-				      unsigned long hmask, unsigned long next,
+static int __s390_enable_skey_hugetlb(struct hugetlb_pte *hpte,
+				      unsigned long addr,
 				      struct mm_walk *walk)
 {
-	pmd_t *pmd = (pmd_t *)pte;
+	pmd_t *pmd = (pmd_t *)hpte->ptep;
 	unsigned long start, end;
 	struct page *page = pmd_page(*pmd);
 
+	/*
+	 * We don't support high-granularity mappings yet. If we did, the
+	 * pmd_page() call above would be unsafe.
+	 */
+	BUILD_BUG_ON(IS_ENABLED(CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING));
+
 	/*
 	 * The write check makes sure we do not set a key on shared
 	 * memory. This is needed as the walker does not differentiate
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 77b72f42556a..2f293b5dabc0 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -731,27 +731,39 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
 }
 
 #ifdef CONFIG_HUGETLB_PAGE
-static int smaps_hugetlb_range(pte_t *pte, unsigned long hmask,
-				 unsigned long addr, unsigned long end,
-				 struct mm_walk *walk)
+static int smaps_hugetlb_range(struct hugetlb_pte *hpte,
+				unsigned long addr,
+				struct mm_walk *walk)
 {
 	struct mem_size_stats *mss = walk->private;
 	struct vm_area_struct *vma = walk->vma;
 	struct page *page = NULL;
+	pte_t pte = huge_ptep_get(hpte->ptep);
 
-	if (pte_present(*pte)) {
-		page = vm_normal_page(vma, addr, *pte);
-	} else if (is_swap_pte(*pte)) {
-		swp_entry_t swpent = pte_to_swp_entry(*pte);
+	if (pte_present(pte)) {
+		/* We only care about leaf-level PTEs. */
+		if (!hugetlb_pte_present_leaf(hpte, pte))
+			/*
+			 * The only case where hpte is not a leaf is that
+			 * it was originally none, but it was split from
+			 * under us. It was originally none, so exclude it.
+			 */
+			return 0;
+
+		page = vm_normal_page(vma, addr, pte);
+	} else if (is_swap_pte(pte)) {
+		swp_entry_t swpent = pte_to_swp_entry(pte);
 
 		if (is_pfn_swap_entry(swpent))
 			page = pfn_swap_entry_to_page(swpent);
 	}
 	if (page) {
-		if (page_mapcount(page) >= 2 || hugetlb_pmd_shared(pte))
-			mss->shared_hugetlb += huge_page_size(hstate_vma(vma));
+		unsigned long sz = hugetlb_pte_size(hpte);
+
+		if (page_mapcount(page) >= 2 || hugetlb_pmd_shared(hpte->ptep))
+			mss->shared_hugetlb += sz;
 		else
-			mss->private_hugetlb += huge_page_size(hstate_vma(vma));
+			mss->private_hugetlb += sz;
 	}
 	return 0;
 }
@@ -1569,22 +1581,31 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end,
 
 #ifdef CONFIG_HUGETLB_PAGE
 /* This function walks within one hugetlb entry in the single call */
-static int pagemap_hugetlb_range(pte_t *ptep, unsigned long hmask,
-				 unsigned long addr, unsigned long end,
+static int pagemap_hugetlb_range(struct hugetlb_pte *hpte,
+				 unsigned long addr,
 				 struct mm_walk *walk)
 {
 	struct pagemapread *pm = walk->private;
 	struct vm_area_struct *vma = walk->vma;
 	u64 flags = 0, frame = 0;
 	int err = 0;
-	pte_t pte;
+	unsigned long hmask = hugetlb_pte_mask(hpte);
+	unsigned long end = addr + hugetlb_pte_size(hpte);
+	pte_t pte = huge_ptep_get(hpte->ptep);
+	struct page *page;
 
 	if (vma->vm_flags & VM_SOFTDIRTY)
 		flags |= PM_SOFT_DIRTY;
 
-	pte = huge_ptep_get(ptep);
 	if (pte_present(pte)) {
-		struct page *page = pte_page(pte);
+		/*
+		 * We raced with this PTE being split, which can only happen if
+		 * it was blank before. Treat it is as if it were blank.
+		 */
+		if (!hugetlb_pte_present_leaf(hpte, pte))
+			return 0;
+
+		page = pte_page(pte);
 
 		if (!PageAnon(page))
 			flags |= PM_FILE;
@@ -1865,10 +1886,16 @@ static struct page *can_gather_numa_stats_pmd(pmd_t pmd,
 }
 #endif
 
+struct show_numa_map_private {
+	struct numa_maps *md;
+	struct page *last_page;
+};
+
 static int gather_pte_stats(pmd_t *pmd, unsigned long addr,
 		unsigned long end, struct mm_walk *walk)
 {
-	struct numa_maps *md = walk->private;
+	struct show_numa_map_private *priv = walk->private;
+	struct numa_maps *md = priv->md;
 	struct vm_area_struct *vma = walk->vma;
 	spinlock_t *ptl;
 	pte_t *orig_pte;
@@ -1880,6 +1907,7 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long addr,
 		struct page *page;
 
 		page = can_gather_numa_stats_pmd(*pmd, vma, addr);
+		priv->last_page = page;
 		if (page)
 			gather_stats(page, md, pmd_dirty(*pmd),
 				     HPAGE_PMD_SIZE/PAGE_SIZE);
@@ -1893,6 +1921,7 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long addr,
 	orig_pte = pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
 	do {
 		struct page *page = can_gather_numa_stats(*pte, vma, addr);
+		priv->last_page = page;
 		if (!page)
 			continue;
 		gather_stats(page, md, pte_dirty(*pte), 1);
@@ -1903,19 +1932,25 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long addr,
 	return 0;
 }
 #ifdef CONFIG_HUGETLB_PAGE
-static int gather_hugetlb_stats(pte_t *pte, unsigned long hmask,
-		unsigned long addr, unsigned long end, struct mm_walk *walk)
+static int gather_hugetlb_stats(struct hugetlb_pte *hpte, unsigned long addr,
+		struct mm_walk *walk)
 {
-	pte_t huge_pte = huge_ptep_get(pte);
+	struct show_numa_map_private *priv = walk->private;
+	pte_t huge_pte = huge_ptep_get(hpte->ptep);
 	struct numa_maps *md;
 	struct page *page;
 
-	if (!pte_present(huge_pte))
+	if (!hugetlb_pte_present_leaf(hpte, huge_pte))
+		return 0;
+
+	page = compound_head(pte_page(huge_pte));
+	if (priv->last_page == page)
+		/* we've already accounted for this page */
 		return 0;
 
-	page = pte_page(huge_pte);
+	priv->last_page = page;
 
-	md = walk->private;
+	md = priv->md;
 	gather_stats(page, md, pte_dirty(huge_pte), 1);
 	return 0;
 }
@@ -1945,9 +1980,15 @@ static int show_numa_map(struct seq_file *m, void *v)
 	struct file *file = vma->vm_file;
 	struct mm_struct *mm = vma->vm_mm;
 	struct mempolicy *pol;
+
 	char buffer[64];
 	int nid;
 
+	struct show_numa_map_private numa_map_private;
+
+	numa_map_private.md = md;
+	numa_map_private.last_page = NULL;
+
 	if (!mm)
 		return 0;
 
@@ -1977,7 +2018,7 @@ static int show_numa_map(struct seq_file *m, void *v)
 		seq_puts(m, " huge");
 
 	/* mmap_lock is held by m_start */
-	walk_page_vma(vma, &show_numa_ops, md);
+	walk_page_vma(vma, &show_numa_ops, &numa_map_private);
 
 	if (!md->pages)
 		goto out;
diff --git a/include/linux/pagewalk.h b/include/linux/pagewalk.h
index 27a6df448ee5..f4bddad615c2 100644
--- a/include/linux/pagewalk.h
+++ b/include/linux/pagewalk.h
@@ -3,6 +3,7 @@
 #define _LINUX_PAGEWALK_H
 
 #include <linux/mm.h>
+#include <linux/hugetlb.h>
 
 struct mm_walk;
 
@@ -31,6 +32,10 @@ struct mm_walk;
  *			ptl after dropping the vma lock, or else revalidate
  *			those items after re-acquiring the vma lock and before
  *			accessing them.
+ *			In the presence of high-granularity hugetlb entries,
+ *			@hugetlb_entry is called only for leaf-level entries
+ *			(hstate-level entries are ignored if they are not
+ *			leaves).
  * @test_walk:		caller specific callback function to determine whether
  *			we walk over the current vma or not. Returning 0 means
  *			"do page table walk over the current vma", returning
@@ -58,9 +63,8 @@ struct mm_walk_ops {
 			 unsigned long next, struct mm_walk *walk);
 	int (*pte_hole)(unsigned long addr, unsigned long next,
 			int depth, struct mm_walk *walk);
-	int (*hugetlb_entry)(pte_t *pte, unsigned long hmask,
-			     unsigned long addr, unsigned long next,
-			     struct mm_walk *walk);
+	int (*hugetlb_entry)(struct hugetlb_pte *hpte,
+			     unsigned long addr, struct mm_walk *walk);
 	int (*test_walk)(unsigned long addr, unsigned long next,
 			struct mm_walk *walk);
 	int (*pre_vma)(unsigned long start, unsigned long end,
diff --git a/mm/damon/vaddr.c b/mm/damon/vaddr.c
index 1fec16d7263e..0f001950498a 100644
--- a/mm/damon/vaddr.c
+++ b/mm/damon/vaddr.c
@@ -330,11 +330,11 @@ static int damon_mkold_pmd_entry(pmd_t *pmd, unsigned long addr,
 }
 
 #ifdef CONFIG_HUGETLB_PAGE
-static void damon_hugetlb_mkold(pte_t *pte, struct mm_struct *mm,
+static void damon_hugetlb_mkold(struct hugetlb_pte *hpte, pte_t entry,
+				struct mm_struct *mm,
 				struct vm_area_struct *vma, unsigned long addr)
 {
 	bool referenced = false;
-	pte_t entry = huge_ptep_get(pte);
 	struct folio *folio = pfn_folio(pte_pfn(entry));
 
 	folio_get(folio);
@@ -342,12 +342,12 @@ static void damon_hugetlb_mkold(pte_t *pte, struct mm_struct *mm,
 	if (pte_young(entry)) {
 		referenced = true;
 		entry = pte_mkold(entry);
-		set_huge_pte_at(mm, addr, pte, entry);
+		set_huge_pte_at(mm, addr, hpte->ptep, entry);
 	}
 
 #ifdef CONFIG_MMU_NOTIFIER
 	if (mmu_notifier_clear_young(mm, addr,
-				     addr + huge_page_size(hstate_vma(vma))))
+				     addr + hugetlb_pte_size(hpte)))
 		referenced = true;
 #endif /* CONFIG_MMU_NOTIFIER */
 
@@ -358,20 +358,26 @@ static void damon_hugetlb_mkold(pte_t *pte, struct mm_struct *mm,
 	folio_put(folio);
 }
 
-static int damon_mkold_hugetlb_entry(pte_t *pte, unsigned long hmask,
-				     unsigned long addr, unsigned long end,
+static int damon_mkold_hugetlb_entry(struct hugetlb_pte *hpte,
+				     unsigned long addr,
 				     struct mm_walk *walk)
 {
-	struct hstate *h = hstate_vma(walk->vma);
 	spinlock_t *ptl;
 	pte_t entry;
 
-	ptl = huge_pte_lock(h, walk->mm, pte);
-	entry = huge_ptep_get(pte);
+	ptl = hugetlb_pte_lock(hpte);
+	entry = huge_ptep_get(hpte->ptep);
 	if (!pte_present(entry))
 		goto out;
 
-	damon_hugetlb_mkold(pte, walk->mm, walk->vma, addr);
+	if (!hugetlb_pte_present_leaf(hpte, entry))
+		/*
+		 * We raced with someone splitting a blank PTE. Treat this PTE
+		 * as if it were blank.
+		 */
+		goto out;
+
+	damon_hugetlb_mkold(hpte, entry, walk->mm, walk->vma, addr);
 
 out:
 	spin_unlock(ptl);
@@ -483,8 +489,8 @@ static int damon_young_pmd_entry(pmd_t *pmd, unsigned long addr,
 }
 
 #ifdef CONFIG_HUGETLB_PAGE
-static int damon_young_hugetlb_entry(pte_t *pte, unsigned long hmask,
-				     unsigned long addr, unsigned long end,
+static int damon_young_hugetlb_entry(struct hugetlb_pte *hpte,
+				     unsigned long addr,
 				     struct mm_walk *walk)
 {
 	struct damon_young_walk_private *priv = walk->private;
@@ -493,11 +499,18 @@ static int damon_young_hugetlb_entry(pte_t *pte, unsigned long hmask,
 	spinlock_t *ptl;
 	pte_t entry;
 
-	ptl = huge_pte_lock(h, walk->mm, pte);
-	entry = huge_ptep_get(pte);
+	ptl = hugetlb_pte_lock(hpte);
+	entry = huge_ptep_get(hpte->ptep);
 	if (!pte_present(entry))
 		goto out;
 
+	if (!hugetlb_pte_present_leaf(hpte, entry))
+		/*
+		 * We raced with someone splitting a blank PTE. Treat this PTE
+		 * as if it were blank.
+		 */
+		goto out;
+
 	folio = pfn_folio(pte_pfn(entry));
 	folio_get(folio);
 
diff --git a/mm/hmm.c b/mm/hmm.c
index 6a151c09de5e..d3e40cfdd4cb 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -468,8 +468,8 @@ static int hmm_vma_walk_pud(pud_t *pudp, unsigned long start, unsigned long end,
 #endif
 
 #ifdef CONFIG_HUGETLB_PAGE
-static int hmm_vma_walk_hugetlb_entry(pte_t *pte, unsigned long hmask,
-				      unsigned long start, unsigned long end,
+static int hmm_vma_walk_hugetlb_entry(struct hugetlb_pte *hpte,
+				      unsigned long start,
 				      struct mm_walk *walk)
 {
 	unsigned long addr = start, i, pfn;
@@ -479,16 +479,24 @@ static int hmm_vma_walk_hugetlb_entry(pte_t *pte, unsigned long hmask,
 	unsigned int required_fault;
 	unsigned long pfn_req_flags;
 	unsigned long cpu_flags;
+	unsigned long hmask = hugetlb_pte_mask(hpte);
+	unsigned int order = hpte->shift - PAGE_SHIFT;
+	unsigned long end = start + hugetlb_pte_size(hpte);
 	spinlock_t *ptl;
 	pte_t entry;
 
-	ptl = huge_pte_lock(hstate_vma(vma), walk->mm, pte);
-	entry = huge_ptep_get(pte);
+	ptl = hugetlb_pte_lock(hpte);
+	entry = huge_ptep_get(hpte->ptep);
+
+	if (!hugetlb_pte_present_leaf(hpte, entry)) {
+		spin_unlock(ptl);
+		return -EAGAIN;
+	}
 
 	i = (start - range->start) >> PAGE_SHIFT;
 	pfn_req_flags = range->hmm_pfns[i];
 	cpu_flags = pte_to_hmm_pfn_flags(range, entry) |
-		    hmm_pfn_flags_order(huge_page_order(hstate_vma(vma)));
+		    hmm_pfn_flags_order(order);
 	required_fault =
 		hmm_pte_need_fault(hmm_vma_walk, pfn_req_flags, cpu_flags);
 	if (required_fault) {
@@ -605,7 +613,7 @@ int hmm_range_fault(struct hmm_range *range)
 		 * in pfns. All entries < last in the pfn array are set to their
 		 * output, and all >= are still at their input values.
 		 */
-	} while (ret == -EBUSY);
+	} while (ret == -EBUSY || ret == -EAGAIN);
 	return ret;
 }
 EXPORT_SYMBOL(hmm_range_fault);
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index a1ede7bdce95..0b37cbc6e8ae 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -676,6 +676,7 @@ static int check_hwpoisoned_entry(pte_t pte, unsigned long addr, short shift,
 				unsigned long poisoned_pfn, struct to_kill *tk)
 {
 	unsigned long pfn = 0;
+	unsigned long base_pages_poisoned = (1UL << shift) / PAGE_SIZE;
 
 	if (pte_present(pte)) {
 		pfn = pte_pfn(pte);
@@ -686,7 +687,8 @@ static int check_hwpoisoned_entry(pte_t pte, unsigned long addr, short shift,
 			pfn = swp_offset_pfn(swp);
 	}
 
-	if (!pfn || pfn != poisoned_pfn)
+	if (!pfn || pfn < poisoned_pfn ||
+			pfn >= poisoned_pfn + base_pages_poisoned)
 		return 0;
 
 	set_to_kill(tk, addr, shift);
@@ -752,16 +754,15 @@ static int hwpoison_pte_range(pmd_t *pmdp, unsigned long addr,
 }
 
 #ifdef CONFIG_HUGETLB_PAGE
-static int hwpoison_hugetlb_range(pte_t *ptep, unsigned long hmask,
-			    unsigned long addr, unsigned long end,
-			    struct mm_walk *walk)
+static int hwpoison_hugetlb_range(struct hugetlb_pte *hpte,
+				  unsigned long addr,
+				  struct mm_walk *walk)
 {
 	struct hwp_walk *hwp = walk->private;
-	pte_t pte = huge_ptep_get(ptep);
-	struct hstate *h = hstate_vma(walk->vma);
+	pte_t pte = huge_ptep_get(hpte->ptep);
 
-	return check_hwpoisoned_entry(pte, addr, huge_page_shift(h),
-				      hwp->pfn, &hwp->tk);
+	return check_hwpoisoned_entry(pte, addr & hugetlb_pte_mask(hpte),
+			hpte->shift, hwp->pfn, &hwp->tk);
 }
 #else
 #define hwpoison_hugetlb_range	NULL
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index a256a241fd1d..0f91be88392b 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -558,8 +558,8 @@ static int queue_folios_pte_range(pmd_t *pmd, unsigned long addr,
 	return addr != end ? -EIO : 0;
 }
 
-static int queue_folios_hugetlb(pte_t *pte, unsigned long hmask,
-			       unsigned long addr, unsigned long end,
+static int queue_folios_hugetlb(struct hugetlb_pte *hpte,
+			       unsigned long addr,
 			       struct mm_walk *walk)
 {
 	int ret = 0;
@@ -570,8 +570,12 @@ static int queue_folios_hugetlb(pte_t *pte, unsigned long hmask,
 	spinlock_t *ptl;
 	pte_t entry;
 
-	ptl = huge_pte_lock(hstate_vma(walk->vma), walk->mm, pte);
-	entry = huge_ptep_get(pte);
+	/* We don't migrate high-granularity HugeTLB mappings for now. */
+	if (hugetlb_hgm_enabled(walk->vma))
+		return -EINVAL;
+
+	ptl = hugetlb_pte_lock(hpte);
+	entry = huge_ptep_get(hpte->ptep);
 	if (!pte_present(entry))
 		goto unlock;
 	folio = pfn_folio(pte_pfn(entry));
@@ -608,7 +612,7 @@ static int queue_folios_hugetlb(pte_t *pte, unsigned long hmask,
 	 */
 	if (flags & (MPOL_MF_MOVE_ALL) ||
 	    (flags & MPOL_MF_MOVE && folio_estimated_sharers(folio) == 1 &&
-	     !hugetlb_pmd_shared(pte))) {
+	     !hugetlb_pmd_shared(hpte->ptep))) {
 		if (!isolate_hugetlb(folio, qp->pagelist) &&
 			(flags & MPOL_MF_STRICT))
 			/*
diff --git a/mm/mincore.c b/mm/mincore.c
index a085a2aeabd8..0894965b3944 100644
--- a/mm/mincore.c
+++ b/mm/mincore.c
@@ -22,18 +22,29 @@
 #include <linux/uaccess.h>
 #include "swap.h"
 
-static int mincore_hugetlb(pte_t *pte, unsigned long hmask, unsigned long addr,
-			unsigned long end, struct mm_walk *walk)
+static int mincore_hugetlb(struct hugetlb_pte *hpte, unsigned long addr,
+			   struct mm_walk *walk)
 {
 #ifdef CONFIG_HUGETLB_PAGE
 	unsigned char present;
+	unsigned long end = addr + hugetlb_pte_size(hpte);
 	unsigned char *vec = walk->private;
+	pte_t pte = huge_ptep_get(hpte->ptep);
 
 	/*
 	 * Hugepages under user process are always in RAM and never
 	 * swapped out, but theoretically it needs to be checked.
 	 */
-	present = pte && !huge_pte_none(huge_ptep_get(pte));
+	present = !huge_pte_none(pte);
+
+	/*
+	 * If the pte is present but not a leaf, we raced with someone
+	 * splitting it. For someone to have split it, it must have been
+	 * huge_pte_none before, so treat it as such.
+	 */
+	if (pte_present(pte) && !hugetlb_pte_present_leaf(hpte, pte))
+		present = false;
+
 	for (; addr != end; vec++, addr += PAGE_SIZE)
 		*vec = present;
 	walk->private = vec;
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 1d4843c97c2a..61263ce9d925 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -564,12 +564,16 @@ static int prot_none_pte_entry(pte_t *pte, unsigned long addr,
 		0 : -EACCES;
 }
 
-static int prot_none_hugetlb_entry(pte_t *pte, unsigned long hmask,
-				   unsigned long addr, unsigned long next,
+static int prot_none_hugetlb_entry(struct hugetlb_pte *hpte,
+				   unsigned long addr,
 				   struct mm_walk *walk)
 {
-	return pfn_modify_allowed(pte_pfn(*pte), *(pgprot_t *)(walk->private)) ?
-		0 : -EACCES;
+	pte_t pte = huge_ptep_get(hpte->ptep);
+
+	if (!hugetlb_pte_present_leaf(hpte, pte))
+		return -EAGAIN;
+	return pfn_modify_allowed(pte_pfn(pte),
+			*(pgprot_t *)(walk->private)) ? 0 : -EACCES;
 }
 
 static int prot_none_test(unsigned long addr, unsigned long next,
@@ -612,8 +616,10 @@ mprotect_fixup(struct vma_iterator *vmi, struct mmu_gather *tlb,
 	    (newflags & VM_ACCESS_FLAGS) == 0) {
 		pgprot_t new_pgprot = vm_get_page_prot(newflags);
 
-		error = walk_page_range(current->mm, start, end,
-				&prot_none_walk_ops, &new_pgprot);
+		do {
+			error = walk_page_range(current->mm, start, end,
+					&prot_none_walk_ops, &new_pgprot);
+		} while (error == -EAGAIN);
 		if (error)
 			return error;
 	}
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index cb23f8a15c13..05ce242f8b7e 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -3,6 +3,7 @@
 #include <linux/highmem.h>
 #include <linux/sched.h>
 #include <linux/hugetlb.h>
+#include <linux/minmax.h>
 
 /*
  * We want to know the real level where a entry is located ignoring any
@@ -296,20 +297,21 @@ static int walk_hugetlb_range(unsigned long addr, unsigned long end,
 	struct vm_area_struct *vma = walk->vma;
 	struct hstate *h = hstate_vma(vma);
 	unsigned long next;
-	unsigned long hmask = huge_page_mask(h);
-	unsigned long sz = huge_page_size(h);
-	pte_t *pte;
 	const struct mm_walk_ops *ops = walk->ops;
 	int err = 0;
+	struct hugetlb_pte hpte;
 
 	hugetlb_vma_lock_read(vma);
 	do {
-		next = hugetlb_entry_end(h, addr, end);
-		pte = hugetlb_walk(vma, addr & hmask, sz);
-		if (pte)
-			err = ops->hugetlb_entry(pte, hmask, addr, next, walk);
-		else if (ops->pte_hole)
-			err = ops->pte_hole(addr, next, -1, walk);
+		if (hugetlb_full_walk(&hpte, vma, addr)) {
+			next = hugetlb_entry_end(h, addr, end);
+			if (ops->pte_hole)
+				err = ops->pte_hole(addr, next, -1, walk);
+		} else {
+			err = ops->hugetlb_entry(
+					&hpte, addr, walk);
+			next = min(addr + hugetlb_pte_size(&hpte), end);
+		}
 		if (err)
 			break;
 	} while (addr = next, addr != end);
-- 
2.39.2.637.g21b0678d19-goog


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH v2 26/46] mm: rmap: provide pte_order in page_vma_mapped_walk
  2023-02-18  0:27 [PATCH v2 00/46] hugetlb: introduce HugeTLB high-granularity mapping James Houghton
                   ` (24 preceding siblings ...)
  2023-02-18  0:27 ` [PATCH v2 25/46] hugetlb: use struct hugetlb_pte for walk_hugetlb_range James Houghton
@ 2023-02-18  0:27 ` James Houghton
  2023-02-18  0:28 ` [PATCH v2 27/46] mm: rmap: update try_to_{migrate,unmap} to handle mapcount for HGM James Houghton
                   ` (20 subsequent siblings)
  46 siblings, 0 replies; 96+ messages in thread
From: James Houghton @ 2023-02-18  0:27 UTC (permalink / raw)
  To: Mike Kravetz, Muchun Song, Peter Xu, Andrew Morton
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Zach O'Keefe, Manish Mishra, Naoya Horiguchi,
	Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Frank van der Linden, Jiaqi Yan, linux-mm, linux-kernel,
	James Houghton

page_vma_mapped_walk callers will need this information to know how
HugeTLB pages are mapped. pte_order only applies if pte is not NULL.

Signed-off-by: James Houghton <jthoughton@google.com>

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index a4570da03e58..87a2c7f422bf 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -387,6 +387,7 @@ struct page_vma_mapped_walk {
 	pmd_t *pmd;
 	pte_t *pte;
 	spinlock_t *ptl;
+	unsigned int pte_order;
 	unsigned int flags;
 };
 
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index 4e448cfbc6ef..08295b122ad6 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -16,6 +16,7 @@ static inline bool not_found(struct page_vma_mapped_walk *pvmw)
 static bool map_pte(struct page_vma_mapped_walk *pvmw)
 {
 	pvmw->pte = pte_offset_map(pvmw->pmd, pvmw->address);
+	pvmw->pte_order = 0;
 	if (!(pvmw->flags & PVMW_SYNC)) {
 		if (pvmw->flags & PVMW_MIGRATION) {
 			if (!is_swap_pte(*pvmw->pte))
@@ -177,6 +178,7 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
 		if (!pvmw->pte)
 			return false;
 
+		pvmw->pte_order = huge_page_order(hstate);
 		pvmw->ptl = huge_pte_lock(hstate, mm, pvmw->pte);
 		if (!check_pte(pvmw))
 			return not_found(pvmw);
@@ -272,6 +274,7 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
 				}
 				pte_unmap(pvmw->pte);
 				pvmw->pte = NULL;
+				pvmw->pte_order = 0;
 				goto restart;
 			}
 			pvmw->pte++;
-- 
2.39.2.637.g21b0678d19-goog


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH v2 27/46] mm: rmap: update try_to_{migrate,unmap} to handle mapcount for HGM
  2023-02-18  0:27 [PATCH v2 00/46] hugetlb: introduce HugeTLB high-granularity mapping James Houghton
                   ` (25 preceding siblings ...)
  2023-02-18  0:27 ` [PATCH v2 26/46] mm: rmap: provide pte_order in page_vma_mapped_walk James Houghton
@ 2023-02-18  0:28 ` James Houghton
  2023-02-18  0:28 ` [PATCH v2 28/46] mm: rmap: in try_to_{migrate,unmap}, check head page for hugetlb page flags James Houghton
                   ` (19 subsequent siblings)
  46 siblings, 0 replies; 96+ messages in thread
From: James Houghton @ 2023-02-18  0:28 UTC (permalink / raw)
  To: Mike Kravetz, Muchun Song, Peter Xu, Andrew Morton
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Zach O'Keefe, Manish Mishra, Naoya Horiguchi,
	Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Frank van der Linden, Jiaqi Yan, linux-mm, linux-kernel,
	James Houghton

Make use of the new pvmw->pte_order field to determine the size of the
PTE we're unmapping/migrating.

Signed-off-by: James Houghton <jthoughton@google.com>

diff --git a/mm/migrate.c b/mm/migrate.c
index 9b4a7e75f6e6..616afcc40fdc 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -247,7 +247,7 @@ static bool remove_migration_pte(struct folio *folio,
 
 #ifdef CONFIG_HUGETLB_PAGE
 		if (folio_test_hugetlb(folio)) {
-			unsigned int shift = huge_page_shift(hstate_vma(vma));
+			unsigned int shift = pvmw.pte_order + PAGE_SHIFT;
 
 			pte = arch_make_huge_pte(pte, shift, vma->vm_flags);
 			if (folio_test_anon(folio))
diff --git a/mm/rmap.c b/mm/rmap.c
index c010d0af3a82..0a019ae32f04 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1609,7 +1609,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 		if (PageHWPoison(subpage) && !(flags & TTU_IGNORE_HWPOISON)) {
 			pteval = swp_entry_to_pte(make_hwpoison_entry(subpage));
 			if (folio_test_hugetlb(folio)) {
-				hugetlb_count_sub(folio_nr_pages(folio), mm);
+				hugetlb_count_sub(1UL << pvmw.pte_order, mm);
 				set_huge_pte_at(mm, address, pvmw.pte, pteval);
 			} else {
 				dec_mm_counter(mm, mm_counter(&folio->page));
@@ -1757,7 +1757,13 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 		 *
 		 * See Documentation/mm/mmu_notifier.rst
 		 */
-		page_remove_rmap(subpage, vma, folio_test_hugetlb(folio));
+		if (folio_test_hugetlb(folio))
+			hugetlb_remove_rmap(subpage,
+					pvmw.pte_order + PAGE_SHIFT,
+					hstate_vma(vma), vma);
+		else
+			page_remove_rmap(subpage, vma, false);
+
 		if (vma->vm_flags & VM_LOCKED)
 			mlock_drain_local();
 		folio_put(folio);
@@ -2020,7 +2026,7 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 		} else if (PageHWPoison(subpage)) {
 			pteval = swp_entry_to_pte(make_hwpoison_entry(subpage));
 			if (folio_test_hugetlb(folio)) {
-				hugetlb_count_sub(folio_nr_pages(folio), mm);
+				hugetlb_count_sub(1L << pvmw.pte_order, mm);
 				set_huge_pte_at(mm, address, pvmw.pte, pteval);
 			} else {
 				dec_mm_counter(mm, mm_counter(&folio->page));
@@ -2112,7 +2118,12 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 		 *
 		 * See Documentation/mm/mmu_notifier.rst
 		 */
-		page_remove_rmap(subpage, vma, folio_test_hugetlb(folio));
+		if (folio_test_hugetlb(folio))
+			hugetlb_remove_rmap(subpage,
+					pvmw.pte_order + PAGE_SHIFT,
+					hstate_vma(vma), vma);
+		else
+			page_remove_rmap(subpage, vma, false);
 		if (vma->vm_flags & VM_LOCKED)
 			mlock_drain_local();
 		folio_put(folio);
@@ -2196,6 +2207,8 @@ static bool page_make_device_exclusive_one(struct folio *folio,
 				      args->owner);
 	mmu_notifier_invalidate_range_start(&range);
 
+	VM_BUG_ON_FOLIO(folio_test_hugetlb(folio), folio);
+
 	while (page_vma_mapped_walk(&pvmw)) {
 		/* Unexpected PMD-mapped THP? */
 		VM_BUG_ON_FOLIO(!pvmw.pte, folio);
-- 
2.39.2.637.g21b0678d19-goog


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH v2 28/46] mm: rmap: in try_to_{migrate,unmap}, check head page for hugetlb page flags
  2023-02-18  0:27 [PATCH v2 00/46] hugetlb: introduce HugeTLB high-granularity mapping James Houghton
                   ` (26 preceding siblings ...)
  2023-02-18  0:28 ` [PATCH v2 27/46] mm: rmap: update try_to_{migrate,unmap} to handle mapcount for HGM James Houghton
@ 2023-02-18  0:28 ` James Houghton
  2023-02-18  0:28 ` [PATCH v2 29/46] hugetlb: update page_vma_mapped to do high-granularity walks James Houghton
                   ` (18 subsequent siblings)
  46 siblings, 0 replies; 96+ messages in thread
From: James Houghton @ 2023-02-18  0:28 UTC (permalink / raw)
  To: Mike Kravetz, Muchun Song, Peter Xu, Andrew Morton
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Zach O'Keefe, Manish Mishra, Naoya Horiguchi,
	Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Frank van der Linden, Jiaqi Yan, linux-mm, linux-kernel,
	James Houghton

The main complication here is that HugeTLB pages have their poison
status stored in the head page as the HWPoison page flag. Because
HugeTLB high-granularity mapping can create PTEs that point to subpages
instead of always the head of a hugepage, we need to check the
compound_head for page flags.

Signed-off-by: James Houghton <jthoughton@google.com>

diff --git a/mm/rmap.c b/mm/rmap.c
index 0a019ae32f04..4908ede83173 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1456,10 +1456,11 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 	struct mm_struct *mm = vma->vm_mm;
 	DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0);
 	pte_t pteval;
-	struct page *subpage;
+	struct page *subpage, *page_flags_page;
 	bool anon_exclusive, ret = true;
 	struct mmu_notifier_range range;
 	enum ttu_flags flags = (enum ttu_flags)(long)arg;
+	bool page_poisoned;
 
 	/*
 	 * When racing against e.g. zap_pte_range() on another cpu,
@@ -1512,9 +1513,17 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 
 		subpage = folio_page(folio,
 					pte_pfn(*pvmw.pte) - folio_pfn(folio));
+		/*
+		 * We check the page flags of HugeTLB pages by checking the
+		 * head page.
+		 */
+		page_flags_page = folio_test_hugetlb(folio)
+			? &folio->page
+			: subpage;
+		page_poisoned = PageHWPoison(page_flags_page);
 		address = pvmw.address;
 		anon_exclusive = folio_test_anon(folio) &&
-				 PageAnonExclusive(subpage);
+				 PageAnonExclusive(page_flags_page);
 
 		if (folio_test_hugetlb(folio)) {
 			bool anon = folio_test_anon(folio);
@@ -1523,7 +1532,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 			 * The try_to_unmap() is only passed a hugetlb page
 			 * in the case where the hugetlb page is poisoned.
 			 */
-			VM_BUG_ON_PAGE(!PageHWPoison(subpage), subpage);
+			VM_BUG_ON_FOLIO(!page_poisoned, folio);
 			/*
 			 * huge_pmd_unshare may unmap an entire PMD page.
 			 * There is no way of knowing exactly which PMDs may
@@ -1606,7 +1615,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 		/* Update high watermark before we lower rss */
 		update_hiwater_rss(mm);
 
-		if (PageHWPoison(subpage) && !(flags & TTU_IGNORE_HWPOISON)) {
+		if (page_poisoned && !(flags & TTU_IGNORE_HWPOISON)) {
 			pteval = swp_entry_to_pte(make_hwpoison_entry(subpage));
 			if (folio_test_hugetlb(folio)) {
 				hugetlb_count_sub(1UL << pvmw.pte_order, mm);
@@ -1632,7 +1641,9 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 			mmu_notifier_invalidate_range(mm, address,
 						      address + PAGE_SIZE);
 		} else if (folio_test_anon(folio)) {
-			swp_entry_t entry = { .val = page_private(subpage) };
+			swp_entry_t entry = {
+				.val = page_private(page_flags_page)
+			};
 			pte_t swp_pte;
 			/*
 			 * Store the swap location in the pte.
@@ -1822,7 +1833,7 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 	struct mm_struct *mm = vma->vm_mm;
 	DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0);
 	pte_t pteval;
-	struct page *subpage;
+	struct page *subpage, *page_flags_page;
 	bool anon_exclusive, ret = true;
 	struct mmu_notifier_range range;
 	enum ttu_flags flags = (enum ttu_flags)(long)arg;
@@ -1902,9 +1913,16 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 			subpage = folio_page(folio,
 					pte_pfn(*pvmw.pte) - folio_pfn(folio));
 		}
+		/*
+		 * We check the page flags of HugeTLB pages by checking the
+		 * head page.
+		 */
+		page_flags_page = folio_test_hugetlb(folio)
+			? &folio->page
+			: subpage;
 		address = pvmw.address;
 		anon_exclusive = folio_test_anon(folio) &&
-				 PageAnonExclusive(subpage);
+				 PageAnonExclusive(page_flags_page);
 
 		if (folio_test_hugetlb(folio)) {
 			bool anon = folio_test_anon(folio);
@@ -2023,7 +2041,7 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 			 * No need to invalidate here it will synchronize on
 			 * against the special swap migration pte.
 			 */
-		} else if (PageHWPoison(subpage)) {
+		} else if (PageHWPoison(page_flags_page)) {
 			pteval = swp_entry_to_pte(make_hwpoison_entry(subpage));
 			if (folio_test_hugetlb(folio)) {
 				hugetlb_count_sub(1L << pvmw.pte_order, mm);
-- 
2.39.2.637.g21b0678d19-goog


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH v2 29/46] hugetlb: update page_vma_mapped to do high-granularity walks
  2023-02-18  0:27 [PATCH v2 00/46] hugetlb: introduce HugeTLB high-granularity mapping James Houghton
                   ` (27 preceding siblings ...)
  2023-02-18  0:28 ` [PATCH v2 28/46] mm: rmap: in try_to_{migrate,unmap}, check head page for hugetlb page flags James Houghton
@ 2023-02-18  0:28 ` James Houghton
  2023-02-18  0:28 ` [PATCH v2 30/46] hugetlb: add high-granularity migration support James Houghton
                   ` (17 subsequent siblings)
  46 siblings, 0 replies; 96+ messages in thread
From: James Houghton @ 2023-02-18  0:28 UTC (permalink / raw)
  To: Mike Kravetz, Muchun Song, Peter Xu, Andrew Morton
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Zach O'Keefe, Manish Mishra, Naoya Horiguchi,
	Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Frank van der Linden, Jiaqi Yan, linux-mm, linux-kernel,
	James Houghton

Update the HugeTLB logic to look a lot more like the PTE-mapped THP
logic. When a user calls us in a loop, we will update pvmw->address to
walk to each page table entry that could possibly map the hugepage
containing pvmw->pfn.

Make use of the new pte_order so callers know what size PTE
they're getting.

The !pte failure case is changed to call not_found() instead of just
returning false. This should be a no-op, but if somehow the hstate-level
PTE were deallocated between iterations, not_found() should be called to
drop locks.

Signed-off-by: James Houghton <jthoughton@google.com>

diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index 08295b122ad6..03e8a4987272 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -133,7 +133,8 @@ static void step_forward(struct page_vma_mapped_walk *pvmw, unsigned long size)
  *
  * Returns true if the page is mapped in the vma. @pvmw->pmd and @pvmw->pte point
  * to relevant page table entries. @pvmw->ptl is locked. @pvmw->address is
- * adjusted if needed (for PTE-mapped THPs).
+ * adjusted if needed (for PTE-mapped THPs and high-granularity-mapped HugeTLB
+ * pages).
  *
  * If @pvmw->pmd is set but @pvmw->pte is not, you have found PMD-mapped page
  * (usually THP). For PTE-mapped THP, you should run page_vma_mapped_walk() in
@@ -165,23 +166,47 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
 
 	if (unlikely(is_vm_hugetlb_page(vma))) {
 		struct hstate *hstate = hstate_vma(vma);
-		unsigned long size = huge_page_size(hstate);
-		/* The only possible mapping was handled on last iteration */
-		if (pvmw->pte)
-			return not_found(pvmw);
-		/*
-		 * All callers that get here will already hold the
-		 * i_mmap_rwsem.  Therefore, no additional locks need to be
-		 * taken before calling hugetlb_walk().
-		 */
-		pvmw->pte = hugetlb_walk(vma, pvmw->address, size);
-		if (!pvmw->pte)
-			return false;
+		struct hugetlb_pte hpte;
+		pte_t pteval;
+
+		end = (pvmw->address & huge_page_mask(hstate)) +
+			huge_page_size(hstate);
+
+		do {
+			if (pvmw->pte) {
+				if (pvmw->ptl)
+					spin_unlock(pvmw->ptl);
+				pvmw->ptl = NULL;
+				pvmw->address += PAGE_SIZE << pvmw->pte_order;
+				if (pvmw->address >= end)
+					return not_found(pvmw);
+			}
 
-		pvmw->pte_order = huge_page_order(hstate);
-		pvmw->ptl = huge_pte_lock(hstate, mm, pvmw->pte);
-		if (!check_pte(pvmw))
-			return not_found(pvmw);
+			/*
+			 * All callers that get here will already hold the
+			 * i_mmap_rwsem. Therefore, no additional locks need to
+			 * be taken before calling hugetlb_walk().
+			 */
+			if (hugetlb_full_walk(&hpte, vma, pvmw->address))
+				return not_found(pvmw);
+
+retry:
+			pvmw->pte = hpte.ptep;
+			pvmw->pte_order = hpte.shift - PAGE_SHIFT;
+			pvmw->ptl = hugetlb_pte_lock(&hpte);
+			pteval = huge_ptep_get(hpte.ptep);
+			if (pte_present(pteval) && !hugetlb_pte_present_leaf(
+						&hpte, pteval)) {
+				/*
+				 * Someone split from under us, so keep
+				 * walking.
+				 */
+				spin_unlock(pvmw->ptl);
+				hugetlb_full_walk_continue(&hpte, vma,
+						pvmw->address);
+				goto retry;
+			}
+		} while (!check_pte(pvmw));
 		return true;
 	}
 
-- 
2.39.2.637.g21b0678d19-goog


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH v2 30/46] hugetlb: add high-granularity migration support
  2023-02-18  0:27 [PATCH v2 00/46] hugetlb: introduce HugeTLB high-granularity mapping James Houghton
                   ` (28 preceding siblings ...)
  2023-02-18  0:28 ` [PATCH v2 29/46] hugetlb: update page_vma_mapped to do high-granularity walks James Houghton
@ 2023-02-18  0:28 ` James Houghton
  2023-02-18  0:28 ` [PATCH v2 31/46] hugetlb: sort hstates in hugetlb_init_hstates James Houghton
                   ` (16 subsequent siblings)
  46 siblings, 0 replies; 96+ messages in thread
From: James Houghton @ 2023-02-18  0:28 UTC (permalink / raw)
  To: Mike Kravetz, Muchun Song, Peter Xu, Andrew Morton
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Zach O'Keefe, Manish Mishra, Naoya Horiguchi,
	Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Frank van der Linden, Jiaqi Yan, linux-mm, linux-kernel,
	James Houghton

To prevent queueing a hugepage for migration multiple times, we use
last_folio to keep track of the last page we saw in queue_pages_hugetlb,
and if the page we're looking at is last_folio, then we skip it.

For the non-hugetlb cases, last_folio, although unused, is still updated
so that it has a consistent meaning with the hugetlb case.

Signed-off-by: James Houghton <jthoughton@google.com>

diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index 3a451b7afcb3..6ef80763e629 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -68,6 +68,8 @@
 
 static inline bool is_pfn_swap_entry(swp_entry_t entry);
 
+struct hugetlb_pte;
+
 /* Clear all flags but only keep swp_entry_t related information */
 static inline pte_t pte_swp_clear_flags(pte_t pte)
 {
@@ -339,7 +341,8 @@ extern void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
 #ifdef CONFIG_HUGETLB_PAGE
 extern void __migration_entry_wait_huge(struct vm_area_struct *vma,
 					pte_t *ptep, spinlock_t *ptl);
-extern void migration_entry_wait_huge(struct vm_area_struct *vma, pte_t *pte);
+extern void migration_entry_wait_huge(struct vm_area_struct *vma,
+					struct hugetlb_pte *hpte);
 #endif	/* CONFIG_HUGETLB_PAGE */
 #else  /* CONFIG_MIGRATION */
 static inline swp_entry_t make_readable_migration_entry(pgoff_t offset)
@@ -369,7 +372,8 @@ static inline void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
 #ifdef CONFIG_HUGETLB_PAGE
 static inline void __migration_entry_wait_huge(struct vm_area_struct *vma,
 					       pte_t *ptep, spinlock_t *ptl) { }
-static inline void migration_entry_wait_huge(struct vm_area_struct *vma, pte_t *pte) { }
+static inline void migration_entry_wait_huge(struct vm_area_struct *vma,
+						struct hugetlb_pte *hpte) { }
 #endif	/* CONFIG_HUGETLB_PAGE */
 static inline int is_writable_migration_entry(swp_entry_t entry)
 {
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 86cd51beb02c..39f541b4a0a8 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6418,7 +6418,7 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 			 * be released there.
 			 */
 			mutex_unlock(&hugetlb_fault_mutex_table[hash]);
-			migration_entry_wait_huge(vma, hpte.ptep);
+			migration_entry_wait_huge(vma, &hpte);
 			return 0;
 		} else if (unlikely(is_hugetlb_entry_hwpoisoned(entry)))
 			ret = VM_FAULT_HWPOISON_LARGE |
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 0f91be88392b..43e210181cce 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -424,6 +424,7 @@ struct queue_pages {
 	unsigned long start;
 	unsigned long end;
 	struct vm_area_struct *first;
+	struct folio *last_folio;
 };
 
 /*
@@ -475,6 +476,7 @@ static int queue_folios_pmd(pmd_t *pmd, spinlock_t *ptl, unsigned long addr,
 	flags = qp->flags;
 	/* go to folio migration */
 	if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) {
+		qp->last_folio = folio;
 		if (!vma_migratable(walk->vma) ||
 		    migrate_folio_add(folio, qp->pagelist, flags)) {
 			ret = 1;
@@ -539,6 +541,8 @@ static int queue_folios_pte_range(pmd_t *pmd, unsigned long addr,
 				break;
 			}
 
+			qp->last_folio = folio;
+
 			/*
 			 * Do not abort immediately since there may be
 			 * temporary off LRU pages in the range.  Still
@@ -570,15 +574,22 @@ static int queue_folios_hugetlb(struct hugetlb_pte *hpte,
 	spinlock_t *ptl;
 	pte_t entry;
 
-	/* We don't migrate high-granularity HugeTLB mappings for now. */
-	if (hugetlb_hgm_enabled(walk->vma))
-		return -EINVAL;
-
 	ptl = hugetlb_pte_lock(hpte);
 	entry = huge_ptep_get(hpte->ptep);
 	if (!pte_present(entry))
 		goto unlock;
-	folio = pfn_folio(pte_pfn(entry));
+
+	if (!hugetlb_pte_present_leaf(hpte, entry)) {
+		ret = -EAGAIN;
+		goto unlock;
+	}
+
+	folio = page_folio(pte_page(entry));
+
+	/* We already queued this page with another high-granularity PTE. */
+	if (folio == qp->last_folio)
+		goto unlock;
+
 	if (!queue_folio_required(folio, qp))
 		goto unlock;
 
@@ -747,6 +758,7 @@ queue_pages_range(struct mm_struct *mm, unsigned long start, unsigned long end,
 		.start = start,
 		.end = end,
 		.first = NULL,
+		.last_folio = NULL,
 	};
 
 	err = walk_page_range(mm, start, end, &queue_pages_walk_ops, &qp);
diff --git a/mm/migrate.c b/mm/migrate.c
index 616afcc40fdc..b26169990532 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -196,6 +196,9 @@ static bool remove_migration_pte(struct folio *folio,
 		/* pgoff is invalid for ksm pages, but they are never large */
 		if (folio_test_large(folio) && !folio_test_hugetlb(folio))
 			idx = linear_page_index(vma, pvmw.address) - pvmw.pgoff;
+		else if (folio_test_hugetlb(folio))
+			idx = (pvmw.address & ~huge_page_mask(hstate_vma(vma)))/
+				PAGE_SIZE;
 		new = folio_page(folio, idx);
 
 #ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
@@ -247,14 +250,16 @@ static bool remove_migration_pte(struct folio *folio,
 
 #ifdef CONFIG_HUGETLB_PAGE
 		if (folio_test_hugetlb(folio)) {
+			struct page *hpage = folio_page(folio, 0);
 			unsigned int shift = pvmw.pte_order + PAGE_SHIFT;
 
 			pte = arch_make_huge_pte(pte, shift, vma->vm_flags);
 			if (folio_test_anon(folio))
-				hugepage_add_anon_rmap(new, vma, pvmw.address,
+				hugepage_add_anon_rmap(hpage, vma, pvmw.address,
 						       rmap_flags);
 			else
-				page_add_file_rmap(new, vma, true);
+				hugetlb_add_file_rmap(new, shift,
+						hstate_vma(vma), vma);
 			set_huge_pte_at(vma->vm_mm, pvmw.address, pvmw.pte, pte);
 		} else
 #endif
@@ -270,7 +275,7 @@ static bool remove_migration_pte(struct folio *folio,
 			mlock_drain_local();
 
 		trace_remove_migration_pte(pvmw.address, pte_val(pte),
-					   compound_order(new));
+					   pvmw.pte_order);
 
 		/* No need to invalidate - it was non-present before */
 		update_mmu_cache(vma, pvmw.address, pvmw.pte);
@@ -361,12 +366,10 @@ void __migration_entry_wait_huge(struct vm_area_struct *vma,
 	}
 }
 
-void migration_entry_wait_huge(struct vm_area_struct *vma, pte_t *pte)
+void migration_entry_wait_huge(struct vm_area_struct *vma,
+				struct hugetlb_pte *hpte)
 {
-	spinlock_t *ptl = huge_pte_lockptr(huge_page_shift(hstate_vma(vma)),
-					   vma->vm_mm, pte);
-
-	__migration_entry_wait_huge(vma, pte, ptl);
+	__migration_entry_wait_huge(vma, hpte->ptep, hpte->ptl);
 }
 #endif
 
-- 
2.39.2.637.g21b0678d19-goog


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH v2 31/46] hugetlb: sort hstates in hugetlb_init_hstates
  2023-02-18  0:27 [PATCH v2 00/46] hugetlb: introduce HugeTLB high-granularity mapping James Houghton
                   ` (29 preceding siblings ...)
  2023-02-18  0:28 ` [PATCH v2 30/46] hugetlb: add high-granularity migration support James Houghton
@ 2023-02-18  0:28 ` James Houghton
  2023-02-18  0:28 ` [PATCH v2 32/46] hugetlb: add for_each_hgm_shift James Houghton
                   ` (15 subsequent siblings)
  46 siblings, 0 replies; 96+ messages in thread
From: James Houghton @ 2023-02-18  0:28 UTC (permalink / raw)
  To: Mike Kravetz, Muchun Song, Peter Xu, Andrew Morton
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Zach O'Keefe, Manish Mishra, Naoya Horiguchi,
	Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Frank van der Linden, Jiaqi Yan, linux-mm, linux-kernel,
	James Houghton

When using HugeTLB high-granularity mapping, we need to go through the
supported hugepage sizes in decreasing order so that we pick the largest
size that works. Consider the case where we're faulting in a 1G hugepage
for the first time: we want hugetlb_fault/hugetlb_no_page to map it with
a PUD. By going through the sizes in decreasing order, we will find that
PUD_SIZE works before finding out that PMD_SIZE or PAGE_SIZE work too.

This commit also changes bootmem hugepages from storing hstate pointers
directly to storing the hstate sizes. The hstate pointers used for
boot-time-allocated hugepages become invalid after we sort the hstates.
`gather_bootmem_prealloc`, called after the hstates have been sorted,
now converts the size to the correct hstate.

Signed-off-by: James Houghton <jthoughton@google.com>

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 2fe1eb6897d4..a344f9d9eba1 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -766,7 +766,7 @@ struct hstate {
 
 struct huge_bootmem_page {
 	struct list_head list;
-	struct hstate *hstate;
+	unsigned long hstate_sz;
 };
 
 int isolate_or_dissolve_huge_page(struct page *page, struct list_head *list);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 39f541b4a0a8..e20df8f6216e 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -34,6 +34,7 @@
 #include <linux/nospec.h>
 #include <linux/delayacct.h>
 #include <linux/memory.h>
+#include <linux/sort.h>
 
 #include <asm/page.h>
 #include <asm/pgalloc.h>
@@ -49,6 +50,10 @@
 
 int hugetlb_max_hstate __read_mostly;
 unsigned int default_hstate_idx;
+/*
+ * After hugetlb_init_hstates is called, hstates will be sorted from largest
+ * to smallest.
+ */
 struct hstate hstates[HUGE_MAX_HSTATE];
 
 #ifdef CONFIG_CMA
@@ -3464,7 +3469,7 @@ int __alloc_bootmem_huge_page(struct hstate *h, int nid)
 	/* Put them into a private list first because mem_map is not up yet */
 	INIT_LIST_HEAD(&m->list);
 	list_add(&m->list, &huge_boot_pages);
-	m->hstate = h;
+	m->hstate_sz = huge_page_size(h);
 	return 1;
 }
 
@@ -3479,7 +3484,7 @@ static void __init gather_bootmem_prealloc(void)
 	list_for_each_entry(m, &huge_boot_pages, list) {
 		struct page *page = virt_to_page(m);
 		struct folio *folio = page_folio(page);
-		struct hstate *h = m->hstate;
+		struct hstate *h = size_to_hstate(m->hstate_sz);
 
 		VM_BUG_ON(!hstate_is_gigantic(h));
 		WARN_ON(folio_ref_count(folio) != 1);
@@ -3595,9 +3600,38 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
 	kfree(node_alloc_noretry);
 }
 
+static int compare_hstates_decreasing(const void *a, const void *b)
+{
+	unsigned long sz_a = huge_page_size((const struct hstate *)a);
+	unsigned long sz_b = huge_page_size((const struct hstate *)b);
+
+	if (sz_a < sz_b)
+		return 1;
+	if (sz_a > sz_b)
+		return -1;
+	return 0;
+}
+
+static void sort_hstates(void)
+{
+	unsigned long default_hstate_sz = huge_page_size(&default_hstate);
+
+	/* Sort from largest to smallest. */
+	sort(hstates, hugetlb_max_hstate, sizeof(*hstates),
+	     compare_hstates_decreasing, NULL);
+
+	/*
+	 * We may have changed the location of the default hstate, so we need to
+	 * update it.
+	 */
+	default_hstate_idx = hstate_index(size_to_hstate(default_hstate_sz));
+}
+
 static void __init hugetlb_init_hstates(void)
 {
-	struct hstate *h, *h2;
+	struct hstate *h;
+
+	sort_hstates();
 
 	for_each_hstate(h) {
 		/* oversize hugepages were init'ed in early boot */
@@ -3616,13 +3650,8 @@ static void __init hugetlb_init_hstates(void)
 			continue;
 		if (hugetlb_cma_size && h->order <= HUGETLB_PAGE_ORDER)
 			continue;
-		for_each_hstate(h2) {
-			if (h2 == h)
-				continue;
-			if (h2->order < h->order &&
-			    h2->order > h->demote_order)
-				h->demote_order = h2->order;
-		}
+		if (h - 1 >= &hstates[0])
+			h->demote_order = huge_page_order(h - 1);
 	}
 }
 
-- 
2.39.2.637.g21b0678d19-goog


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH v2 32/46] hugetlb: add for_each_hgm_shift
  2023-02-18  0:27 [PATCH v2 00/46] hugetlb: introduce HugeTLB high-granularity mapping James Houghton
                   ` (30 preceding siblings ...)
  2023-02-18  0:28 ` [PATCH v2 31/46] hugetlb: sort hstates in hugetlb_init_hstates James Houghton
@ 2023-02-18  0:28 ` James Houghton
  2023-02-18  0:28 ` [PATCH v2 33/46] hugetlb: userfaultfd: add support for high-granularity UFFDIO_CONTINUE James Houghton
                   ` (14 subsequent siblings)
  46 siblings, 0 replies; 96+ messages in thread
From: James Houghton @ 2023-02-18  0:28 UTC (permalink / raw)
  To: Mike Kravetz, Muchun Song, Peter Xu, Andrew Morton
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Zach O'Keefe, Manish Mishra, Naoya Horiguchi,
	Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Frank van der Linden, Jiaqi Yan, linux-mm, linux-kernel,
	James Houghton

This is a helper macro to loop through all the usable page sizes for a
high-granularity-enabled HugeTLB VMA. Given the VMA's hstate, it will
loop, in descending order, through the page sizes that HugeTLB supports
for this architecture. It always includes PAGE_SIZE.

This is done by looping through the hstates; however, there is no
hstate for PAGE_SIZE. To handle this case, the loop intentionally goes
out of bounds, and the out-of-bounds pointer is mapped to PAGE_SIZE.

Signed-off-by: James Houghton <jthoughton@google.com>

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index e20df8f6216e..667e82b7a0ff 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -7941,6 +7941,24 @@ bool hugetlb_hgm_enabled(struct vm_area_struct *vma)
 {
 	return vma && (vma->vm_flags & VM_HUGETLB_HGM);
 }
+/* Should only be used by the for_each_hgm_shift macro. */
+static unsigned int __shift_for_hstate(struct hstate *h)
+{
+	/* If h is out of bounds, we have reached the end, so give PAGE_SIZE */
+	if (h >= &hstates[hugetlb_max_hstate])
+		return PAGE_SHIFT;
+	return huge_page_shift(h);
+}
+
+/*
+ * Intentionally go out of bounds. An out-of-bounds hstate will be converted to
+ * PAGE_SIZE.
+ */
+#define for_each_hgm_shift(hstate, tmp_h, shift) \
+	for ((tmp_h) = hstate; (shift) = __shift_for_hstate(tmp_h), \
+			       (tmp_h) <= &hstates[hugetlb_max_hstate]; \
+			       (tmp_h)++)
+
 #endif /* CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING */
 
 /*
-- 
2.39.2.637.g21b0678d19-goog


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH v2 33/46] hugetlb: userfaultfd: add support for high-granularity UFFDIO_CONTINUE
  2023-02-18  0:27 [PATCH v2 00/46] hugetlb: introduce HugeTLB high-granularity mapping James Houghton
                   ` (31 preceding siblings ...)
  2023-02-18  0:28 ` [PATCH v2 32/46] hugetlb: add for_each_hgm_shift James Houghton
@ 2023-02-18  0:28 ` James Houghton
  2023-02-18  0:28 ` [PATCH v2 34/46] hugetlb: add MADV_COLLAPSE for hugetlb James Houghton
                   ` (13 subsequent siblings)
  46 siblings, 0 replies; 96+ messages in thread
From: James Houghton @ 2023-02-18  0:28 UTC (permalink / raw)
  To: Mike Kravetz, Muchun Song, Peter Xu, Andrew Morton
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Zach O'Keefe, Manish Mishra, Naoya Horiguchi,
	Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Frank van der Linden, Jiaqi Yan, linux-mm, linux-kernel,
	James Houghton

Changes here are similar to the changes made for hugetlb_no_page.

Pass vmf->real_address to userfaultfd_huge_must_wait because
vmf->address may be rounded down to the hugepage size, and a
high-granularity page table walk would look up the wrong PTE. Also
change the call to userfaultfd_must_wait in the same way for
consistency.

This commit introduces hugetlb_alloc_largest_pte which is used to find
the appropriate PTE size to map pages with UFFDIO_CONTINUE.

When MADV_SPLIT is provided, page fault events will report
PAGE_SIZE-aligned address instead of huge_page_size(h)-aligned
addresses, regardless of if UFFD_FEATURE_EXACT_ADDRESS is used.

Signed-off-by: James Houghton <jthoughton@google.com>

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 44d1ee429eb0..bb30001b63ba 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -252,17 +252,17 @@ static inline bool userfaultfd_huge_must_wait(struct userfaultfd_ctx *ctx,
 					 unsigned long flags,
 					 unsigned long reason)
 {
-	pte_t *ptep, pte;
+	pte_t pte;
 	bool ret = true;
+	struct hugetlb_pte hpte;
 
 	mmap_assert_locked(ctx->mm);
 
-	ptep = hugetlb_walk(vma, address, vma_mmu_pagesize(vma));
-	if (!ptep)
+	if (hugetlb_full_walk(&hpte, vma, address))
 		goto out;
 
 	ret = false;
-	pte = huge_ptep_get(ptep);
+	pte = huge_ptep_get(hpte.ptep);
 
 	/*
 	 * Lockless access: we're in a wait_event so it's ok if it
@@ -531,11 +531,11 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
 	spin_unlock_irq(&ctx->fault_pending_wqh.lock);
 
 	if (!is_vm_hugetlb_page(vma))
-		must_wait = userfaultfd_must_wait(ctx, vmf->address, vmf->flags,
-						  reason);
+		must_wait = userfaultfd_must_wait(ctx, vmf->real_address,
+						  vmf->flags, reason);
 	else
 		must_wait = userfaultfd_huge_must_wait(ctx, vma,
-						       vmf->address,
+						       vmf->real_address,
 						       vmf->flags, reason);
 	if (is_vm_hugetlb_page(vma))
 		hugetlb_vma_unlock_read(vma);
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index a344f9d9eba1..e0e51bb06112 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -201,7 +201,8 @@ unsigned long hugetlb_total_pages(void);
 vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 			unsigned long address, unsigned int flags);
 #ifdef CONFIG_USERFAULTFD
-int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm, pte_t *dst_pte,
+int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
+				struct hugetlb_pte *dst_hpte,
 				struct vm_area_struct *dst_vma,
 				unsigned long dst_addr,
 				unsigned long src_addr,
@@ -1272,16 +1273,31 @@ static inline enum hugetlb_level hpage_size_to_level(unsigned long sz)
 
 #ifdef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING
 bool hugetlb_hgm_enabled(struct vm_area_struct *vma);
+bool hugetlb_hgm_advised(struct vm_area_struct *vma);
 bool hugetlb_hgm_eligible(struct vm_area_struct *vma);
+int hugetlb_alloc_largest_pte(struct hugetlb_pte *hpte, struct mm_struct *mm,
+			      struct vm_area_struct *vma, unsigned long start,
+			      unsigned long end);
 #else
 static inline bool hugetlb_hgm_enabled(struct vm_area_struct *vma)
 {
 	return false;
 }
+static inline bool hugetlb_hgm_advised(struct vm_area_struct *vma)
+{
+	return false;
+}
 static inline bool hugetlb_hgm_eligible(struct vm_area_struct *vma)
 {
 	return false;
 }
+static inline
+int hugetlb_alloc_largest_pte(struct hugetlb_pte *hpte, struct mm_struct *mm,
+			      struct vm_area_struct *vma, unsigned long start,
+			      unsigned long end)
+{
+	return -EINVAL;
+}
 #endif
 
 static inline spinlock_t *huge_pte_lock(struct hstate *h,
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 667e82b7a0ff..a00b4ac07046 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6083,9 +6083,15 @@ static inline vm_fault_t hugetlb_handle_userfault(struct vm_area_struct *vma,
 						  unsigned long reason)
 {
 	u32 hash;
+	/*
+	 * Don't use the hpage-aligned address if the user has explicitly
+	 * enabled HGM.
+	 */
+	bool round_to_pagesize = hugetlb_hgm_advised(vma) &&
+				 reason == VM_UFFD_MINOR;
 	struct vm_fault vmf = {
 		.vma = vma,
-		.address = haddr,
+		.address = round_to_pagesize ? addr & PAGE_MASK : haddr,
 		.real_address = addr,
 		.flags = flags,
 
@@ -6569,7 +6575,7 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
  * modifications for huge pages.
  */
 int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
-			    pte_t *dst_pte,
+			    struct hugetlb_pte *dst_hpte,
 			    struct vm_area_struct *dst_vma,
 			    unsigned long dst_addr,
 			    unsigned long src_addr,
@@ -6580,13 +6586,15 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
 	bool is_continue = (mode == MCOPY_ATOMIC_CONTINUE);
 	struct hstate *h = hstate_vma(dst_vma);
 	struct address_space *mapping = dst_vma->vm_file->f_mapping;
-	pgoff_t idx = vma_hugecache_offset(h, dst_vma, dst_addr);
+	unsigned long haddr = dst_addr & huge_page_mask(h);
+	pgoff_t idx = vma_hugecache_offset(h, dst_vma, haddr);
 	unsigned long size;
 	int vm_shared = dst_vma->vm_flags & VM_SHARED;
 	pte_t _dst_pte;
 	spinlock_t *ptl;
 	int ret = -ENOMEM;
 	struct folio *folio;
+	struct page *subpage;
 	int writable;
 	bool folio_in_pagecache = false;
 
@@ -6601,12 +6609,12 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
 		 * a non-missing case. Return -EEXIST.
 		 */
 		if (vm_shared &&
-		    hugetlbfs_pagecache_present(h, dst_vma, dst_addr)) {
+		    hugetlbfs_pagecache_present(h, dst_vma, haddr)) {
 			ret = -EEXIST;
 			goto out;
 		}
 
-		folio = alloc_hugetlb_folio(dst_vma, dst_addr, 0);
+		folio = alloc_hugetlb_folio(dst_vma, haddr, 0);
 		if (IS_ERR(folio)) {
 			ret = -ENOMEM;
 			goto out;
@@ -6622,13 +6630,13 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
 			/* Free the allocated folio which may have
 			 * consumed a reservation.
 			 */
-			restore_reserve_on_error(h, dst_vma, dst_addr, folio);
+			restore_reserve_on_error(h, dst_vma, haddr, folio);
 			folio_put(folio);
 
 			/* Allocate a temporary folio to hold the copied
 			 * contents.
 			 */
-			folio = alloc_hugetlb_folio_vma(h, dst_vma, dst_addr);
+			folio = alloc_hugetlb_folio_vma(h, dst_vma, haddr);
 			if (!folio) {
 				ret = -ENOMEM;
 				goto out;
@@ -6642,14 +6650,14 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
 		}
 	} else {
 		if (vm_shared &&
-		    hugetlbfs_pagecache_present(h, dst_vma, dst_addr)) {
+		    hugetlbfs_pagecache_present(h, dst_vma, haddr)) {
 			put_page(*pagep);
 			ret = -EEXIST;
 			*pagep = NULL;
 			goto out;
 		}
 
-		folio = alloc_hugetlb_folio(dst_vma, dst_addr, 0);
+		folio = alloc_hugetlb_folio(dst_vma, haddr, 0);
 		if (IS_ERR(folio)) {
 			put_page(*pagep);
 			ret = -ENOMEM;
@@ -6697,7 +6705,7 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
 		folio_in_pagecache = true;
 	}
 
-	ptl = huge_pte_lock(h, dst_mm, dst_pte);
+	ptl = hugetlb_pte_lock(dst_hpte);
 
 	ret = -EIO;
 	if (folio_test_hwpoison(folio))
@@ -6709,11 +6717,13 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
 	 * page backing it, then access the page.
 	 */
 	ret = -EEXIST;
-	if (!huge_pte_none_mostly(huge_ptep_get(dst_pte)))
+	if (!huge_pte_none_mostly(huge_ptep_get(dst_hpte->ptep)))
 		goto out_release_unlock;
 
+	subpage = hugetlb_find_subpage(h, folio, dst_addr);
+
 	if (folio_in_pagecache)
-		page_add_file_rmap(&folio->page, dst_vma, true);
+		hugetlb_add_file_rmap(subpage, dst_hpte->shift, h, dst_vma);
 	else
 		hugepage_add_new_anon_rmap(folio, dst_vma, dst_addr);
 
@@ -6726,7 +6736,8 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
 	else
 		writable = dst_vma->vm_flags & VM_WRITE;
 
-	_dst_pte = make_huge_pte(dst_vma, &folio->page, writable);
+	_dst_pte = make_huge_pte_with_shift(dst_vma, subpage, writable,
+			dst_hpte->shift);
 	/*
 	 * Always mark UFFDIO_COPY page dirty; note that this may not be
 	 * extremely important for hugetlbfs for now since swapping is not
@@ -6739,12 +6750,12 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
 	if (wp_copy)
 		_dst_pte = huge_pte_mkuffd_wp(_dst_pte);
 
-	set_huge_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte);
+	set_huge_pte_at(dst_mm, dst_addr, dst_hpte->ptep, _dst_pte);
 
-	hugetlb_count_add(pages_per_huge_page(h), dst_mm);
+	hugetlb_count_add(hugetlb_pte_size(dst_hpte) / PAGE_SIZE, dst_mm);
 
 	/* No need to invalidate - it was non-present before */
-	update_mmu_cache(dst_vma, dst_addr, dst_pte);
+	update_mmu_cache(dst_vma, dst_addr, dst_hpte->ptep);
 
 	spin_unlock(ptl);
 	if (!is_continue)
@@ -7941,6 +7952,18 @@ bool hugetlb_hgm_enabled(struct vm_area_struct *vma)
 {
 	return vma && (vma->vm_flags & VM_HUGETLB_HGM);
 }
+bool hugetlb_hgm_advised(struct vm_area_struct *vma)
+{
+	/*
+	 * Right now, the only way for HGM to be enabled is if a user
+	 * explicitly enables it via MADV_SPLIT, but in the future, there
+	 * may be cases where it gets enabled automatically.
+	 *
+	 * Provide hugetlb_hgm_advised() now for call sites where care that the
+	 * user explicitly enabled HGM.
+	 */
+	return hugetlb_hgm_enabled(vma);
+}
 /* Should only be used by the for_each_hgm_shift macro. */
 static unsigned int __shift_for_hstate(struct hstate *h)
 {
@@ -7959,6 +7982,38 @@ static unsigned int __shift_for_hstate(struct hstate *h)
 			       (tmp_h) <= &hstates[hugetlb_max_hstate]; \
 			       (tmp_h)++)
 
+/*
+ * Find the HugeTLB PTE that maps as much of [start, end) as possible with a
+ * single page table entry. It is returned in @hpte.
+ */
+int hugetlb_alloc_largest_pte(struct hugetlb_pte *hpte, struct mm_struct *mm,
+			      struct vm_area_struct *vma, unsigned long start,
+			      unsigned long end)
+{
+	struct hstate *h = hstate_vma(vma), *tmp_h;
+	unsigned int shift;
+	unsigned long sz;
+	int ret;
+
+	for_each_hgm_shift(h, tmp_h, shift) {
+		sz = 1UL << shift;
+
+		if (!IS_ALIGNED(start, sz) || start + sz > end)
+			continue;
+		goto found;
+	}
+	return -EINVAL;
+found:
+	ret = hugetlb_full_walk_alloc(hpte, vma, start, sz);
+	if (ret)
+		return ret;
+
+	if (hpte->shift > shift)
+		return -EEXIST;
+
+	return 0;
+}
+
 #endif /* CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING */
 
 /*
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 53c3d916ff66..b56bc12f600e 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -320,14 +320,16 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
 {
 	int vm_shared = dst_vma->vm_flags & VM_SHARED;
 	ssize_t err;
-	pte_t *dst_pte;
 	unsigned long src_addr, dst_addr;
 	long copied;
 	struct page *page;
-	unsigned long vma_hpagesize;
+	unsigned long vma_hpagesize, target_pagesize;
 	pgoff_t idx;
 	u32 hash;
 	struct address_space *mapping;
+	bool use_hgm = hugetlb_hgm_advised(dst_vma) &&
+		mode == MCOPY_ATOMIC_CONTINUE;
+	struct hstate *h = hstate_vma(dst_vma);
 
 	/*
 	 * There is no default zero huge page for all huge page sizes as
@@ -345,12 +347,13 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
 	copied = 0;
 	page = NULL;
 	vma_hpagesize = vma_kernel_pagesize(dst_vma);
+	target_pagesize = use_hgm ? PAGE_SIZE : vma_hpagesize;
 
 	/*
-	 * Validate alignment based on huge page size
+	 * Validate alignment based on the targeted page size.
 	 */
 	err = -EINVAL;
-	if (dst_start & (vma_hpagesize - 1) || len & (vma_hpagesize - 1))
+	if (dst_start & (target_pagesize - 1) || len & (target_pagesize - 1))
 		goto out_unlock;
 
 retry:
@@ -381,13 +384,14 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
 	}
 
 	while (src_addr < src_start + len) {
+		struct hugetlb_pte hpte;
 		BUG_ON(dst_addr >= dst_start + len);
 
 		/*
 		 * Serialize via vma_lock and hugetlb_fault_mutex.
-		 * vma_lock ensures the dst_pte remains valid even
-		 * in the case of shared pmds.  fault mutex prevents
-		 * races with other faulting threads.
+		 * vma_lock ensures the hpte.ptep remains valid even
+		 * in the case of shared pmds and page table collapsing.
+		 * fault mutex prevents races with other faulting threads.
 		 */
 		idx = linear_page_index(dst_vma, dst_addr);
 		mapping = dst_vma->vm_file->f_mapping;
@@ -395,23 +399,28 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
 		mutex_lock(&hugetlb_fault_mutex_table[hash]);
 		hugetlb_vma_lock_read(dst_vma);
 
-		err = -ENOMEM;
-		dst_pte = huge_pte_alloc(dst_mm, dst_vma, dst_addr, vma_hpagesize);
-		if (!dst_pte) {
+		if (use_hgm)
+			err = hugetlb_alloc_largest_pte(&hpte, dst_mm, dst_vma,
+							dst_addr,
+							dst_start + len);
+		else
+			err = hugetlb_full_walk_alloc(&hpte, dst_vma, dst_addr,
+						      vma_hpagesize);
+		if (err) {
 			hugetlb_vma_unlock_read(dst_vma);
 			mutex_unlock(&hugetlb_fault_mutex_table[hash]);
 			goto out_unlock;
 		}
 
 		if (mode != MCOPY_ATOMIC_CONTINUE &&
-		    !huge_pte_none_mostly(huge_ptep_get(dst_pte))) {
+		    !huge_pte_none_mostly(huge_ptep_get(hpte.ptep))) {
 			err = -EEXIST;
 			hugetlb_vma_unlock_read(dst_vma);
 			mutex_unlock(&hugetlb_fault_mutex_table[hash]);
 			goto out_unlock;
 		}
 
-		err = hugetlb_mcopy_atomic_pte(dst_mm, dst_pte, dst_vma,
+		err = hugetlb_mcopy_atomic_pte(dst_mm, &hpte, dst_vma,
 					       dst_addr, src_addr, mode, &page,
 					       wp_copy);
 
@@ -423,6 +432,7 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
 		if (unlikely(err == -ENOENT)) {
 			mmap_read_unlock(dst_mm);
 			BUG_ON(!page);
+			WARN_ON_ONCE(hpte.shift != huge_page_shift(h));
 
 			err = copy_huge_page_from_user(page,
 						(const void __user *)src_addr,
@@ -440,9 +450,9 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
 			BUG_ON(page);
 
 		if (!err) {
-			dst_addr += vma_hpagesize;
-			src_addr += vma_hpagesize;
-			copied += vma_hpagesize;
+			dst_addr += hugetlb_pte_size(&hpte);
+			src_addr += hugetlb_pte_size(&hpte);
+			copied += hugetlb_pte_size(&hpte);
 
 			if (fatal_signal_pending(current))
 				err = -EINTR;
-- 
2.39.2.637.g21b0678d19-goog


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH v2 34/46] hugetlb: add MADV_COLLAPSE for hugetlb
  2023-02-18  0:27 [PATCH v2 00/46] hugetlb: introduce HugeTLB high-granularity mapping James Houghton
                   ` (32 preceding siblings ...)
  2023-02-18  0:28 ` [PATCH v2 33/46] hugetlb: userfaultfd: add support for high-granularity UFFDIO_CONTINUE James Houghton
@ 2023-02-18  0:28 ` James Houghton
  2023-02-18  0:28 ` [PATCH v2 35/46] hugetlb: add check to prevent refcount overflow via HGM James Houghton
                   ` (12 subsequent siblings)
  46 siblings, 0 replies; 96+ messages in thread
From: James Houghton @ 2023-02-18  0:28 UTC (permalink / raw)
  To: Mike Kravetz, Muchun Song, Peter Xu, Andrew Morton
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Zach O'Keefe, Manish Mishra, Naoya Horiguchi,
	Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Frank van der Linden, Jiaqi Yan, linux-mm, linux-kernel,
	James Houghton

This is a necessary extension to the UFFDIO_CONTINUE changes. When
userspace finishes mapping an entire hugepage with UFFDIO_CONTINUE, the
kernel has no mechanism to automatically collapse the page table to map
the whole hugepage normally. We require userspace to inform us that they
would like the mapping to be collapsed; they do this with MADV_COLLAPSE.

If userspace has not mapped all of a hugepage with UFFDIO_CONTINUE, but
only some, hugetlb_collapse will cause the requested range to be mapped
as if it were UFFDIO_CONTINUE'd already. The effects of any
UFFDIO_WRITEPROTECT calls may be undone by a call to MADV_COLLAPSE for
intersecting address ranges.

This commit is co-opting the same madvise mode that has been introduced
to synchronously collapse THPs. The function that does THP collapsing
has been renamed to madvise_collapse_thp.

As with the rest of the high-granularity mapping support, MADV_COLLAPSE
is only supported for shared VMAs right now.

MADV_COLLAPSE for HugeTLB takes the mmap_lock for writing.

It is important that we check PageHWPoison before checking
!HPageMigratable, as PageHWPoison implies !HPageMigratable.
!PageHWPoison && !HPageMigratable means that the page has been isolated
for migration.

Signed-off-by: James Houghton <jthoughton@google.com>

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 70bd867eba94..fa63a56ebaf0 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -218,9 +218,9 @@ void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
 
 int hugepage_madvise(struct vm_area_struct *vma, unsigned long *vm_flags,
 		     int advice);
-int madvise_collapse(struct vm_area_struct *vma,
-		     struct vm_area_struct **prev,
-		     unsigned long start, unsigned long end);
+int madvise_collapse_thp(struct vm_area_struct *vma,
+			 struct vm_area_struct **prev,
+			 unsigned long start, unsigned long end);
 void vma_adjust_trans_huge(struct vm_area_struct *vma, unsigned long start,
 			   unsigned long end, long adjust_next);
 spinlock_t *__pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma);
@@ -358,9 +358,9 @@ static inline int hugepage_madvise(struct vm_area_struct *vma,
 	return -EINVAL;
 }
 
-static inline int madvise_collapse(struct vm_area_struct *vma,
-				   struct vm_area_struct **prev,
-				   unsigned long start, unsigned long end)
+static inline int madvise_collapse_thp(struct vm_area_struct *vma,
+				       struct vm_area_struct **prev,
+				       unsigned long start, unsigned long end)
 {
 	return -EINVAL;
 }
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index e0e51bb06112..6cd4ae08d84d 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -1278,6 +1278,8 @@ bool hugetlb_hgm_eligible(struct vm_area_struct *vma);
 int hugetlb_alloc_largest_pte(struct hugetlb_pte *hpte, struct mm_struct *mm,
 			      struct vm_area_struct *vma, unsigned long start,
 			      unsigned long end);
+int hugetlb_collapse(struct mm_struct *mm, unsigned long start,
+		     unsigned long end);
 #else
 static inline bool hugetlb_hgm_enabled(struct vm_area_struct *vma)
 {
@@ -1298,6 +1300,12 @@ int hugetlb_alloc_largest_pte(struct hugetlb_pte *hpte, struct mm_struct *mm,
 {
 	return -EINVAL;
 }
+static inline
+int hugetlb_collapse(struct mm_struct *mm, unsigned long start,
+		     unsigned long end)
+{
+	return -EINVAL;
+}
 #endif
 
 static inline spinlock_t *huge_pte_lock(struct hstate *h,
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index a00b4ac07046..c4d189e5f1fd 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -8014,6 +8014,158 @@ int hugetlb_alloc_largest_pte(struct hugetlb_pte *hpte, struct mm_struct *mm,
 	return 0;
 }
 
+/*
+ * Collapse the address range from @start to @end to be mapped optimally.
+ *
+ * This is only valid for shared mappings. The main use case for this function
+ * is following UFFDIO_CONTINUE. If a user UFFDIO_CONTINUEs an entire hugepage
+ * by calling UFFDIO_CONTINUE once for each 4K region, the kernel doesn't know
+ * to collapse the mapping after the final UFFDIO_CONTINUE. Instead, we leave
+ * it up to userspace to tell us to do so, via MADV_COLLAPSE.
+ *
+ * Any holes in the mapping will be filled. If there is no page in the
+ * pagecache for a region we're collapsing, the PTEs will be cleared.
+ *
+ * If high-granularity PTEs are uffd-wp markers, those markers will be dropped.
+ */
+static int __hugetlb_collapse(struct mm_struct *mm, struct vm_area_struct *vma,
+			      unsigned long start, unsigned long end)
+{
+	struct hstate *h = hstate_vma(vma);
+	struct address_space *mapping = vma->vm_file->f_mapping;
+	struct mmu_notifier_range range;
+	struct mmu_gather tlb;
+	unsigned long curr = start;
+	int ret = 0;
+	struct folio *folio;
+	struct page *subpage;
+	pgoff_t idx;
+	bool writable = vma->vm_flags & VM_WRITE;
+	struct hugetlb_pte hpte;
+	pte_t entry;
+	spinlock_t *ptl;
+
+	/*
+	 * This is only supported for shared VMAs, because we need to look up
+	 * the page to use for any PTEs we end up creating.
+	 */
+	if (!(vma->vm_flags & VM_MAYSHARE))
+		return -EINVAL;
+
+	/* If HGM is not enabled, there is nothing to collapse. */
+	if (!hugetlb_hgm_enabled(vma))
+		return 0;
+
+	tlb_gather_mmu(&tlb, mm);
+
+	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, start, end);
+	mmu_notifier_invalidate_range_start(&range);
+
+	while (curr < end) {
+		ret = hugetlb_alloc_largest_pte(&hpte, mm, vma, curr, end);
+		if (ret)
+			goto out;
+
+		entry = huge_ptep_get(hpte.ptep);
+
+		/*
+		 * There is no work to do if the PTE doesn't point to page
+		 * tables.
+		 */
+		if (!pte_present(entry))
+			goto next_hpte;
+		if (hugetlb_pte_present_leaf(&hpte, entry))
+			goto next_hpte;
+
+		idx = vma_hugecache_offset(h, vma, curr);
+		folio = filemap_get_folio(mapping, idx);
+
+		if (folio && folio_test_hwpoison(folio)) {
+			/*
+			 * Don't collapse a mapping to a page that is
+			 * hwpoisoned. The entire page will be poisoned.
+			 *
+			 * When HugeTLB supports poisoning PAGE_SIZE bits of
+			 * the hugepage, the logic here can be improved.
+			 *
+			 * Skip this page, and continue to collapse the rest
+			 * of the mapping.
+			 */
+			folio_put(folio);
+			curr = (curr & huge_page_mask(h)) + huge_page_size(h);
+			continue;
+		}
+
+		if (folio && !folio_test_hugetlb_migratable(folio)) {
+			/*
+			 * Don't collapse a mapping to a page that is pending
+			 * a migration. Migration swap entries may have placed
+			 * in the page table.
+			 */
+			ret = -EBUSY;
+			folio_put(folio);
+			goto out;
+		}
+
+		/*
+		 * Clear all the PTEs, and drop ref/mapcounts
+		 * (on tlb_finish_mmu).
+		 */
+		__unmap_hugepage_range(&tlb, vma, curr,
+			curr + hugetlb_pte_size(&hpte),
+			NULL,
+			ZAP_FLAG_DROP_MARKER);
+		/* Free the PTEs. */
+		hugetlb_free_pgd_range(&tlb,
+				curr, curr + hugetlb_pte_size(&hpte),
+				curr, curr + hugetlb_pte_size(&hpte));
+
+		ptl = hugetlb_pte_lock(&hpte);
+
+		if (!folio) {
+			huge_pte_clear(mm, curr, hpte.ptep,
+					hugetlb_pte_size(&hpte));
+			spin_unlock(ptl);
+			goto next_hpte;
+		}
+
+		subpage = hugetlb_find_subpage(h, folio, curr);
+		entry = make_huge_pte_with_shift(vma, subpage,
+						 writable, hpte.shift);
+		hugetlb_add_file_rmap(subpage, hpte.shift, h, vma);
+		set_huge_pte_at(mm, curr, hpte.ptep, entry);
+		spin_unlock(ptl);
+next_hpte:
+		curr += hugetlb_pte_size(&hpte);
+	}
+out:
+	mmu_notifier_invalidate_range_end(&range);
+	tlb_finish_mmu(&tlb);
+
+	return ret;
+}
+
+int hugetlb_collapse(struct mm_struct *mm, unsigned long start,
+		     unsigned long end)
+{
+	int ret = 0;
+	struct vm_area_struct *vma;
+
+	mmap_write_lock(mm);
+	while (start < end || ret) {
+		vma = find_vma(mm, start);
+		if (!vma || !is_vm_hugetlb_page(vma)) {
+			ret = -EINVAL;
+			break;
+		}
+		ret = __hugetlb_collapse(mm, vma, start,
+				end < vma->vm_end ? end : vma->vm_end);
+		start = vma->vm_end;
+	}
+	mmap_write_unlock(mm);
+	return ret;
+}
+
 #endif /* CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING */
 
 /*
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 8dbc39896811..58cda5020537 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -2750,8 +2750,8 @@ static int madvise_collapse_errno(enum scan_result r)
 	}
 }
 
-int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
-		     unsigned long start, unsigned long end)
+int madvise_collapse_thp(struct vm_area_struct *vma, struct vm_area_struct **prev,
+			 unsigned long start, unsigned long end)
 {
 	struct collapse_control *cc;
 	struct mm_struct *mm = vma->vm_mm;
diff --git a/mm/madvise.c b/mm/madvise.c
index 8c004c678262..e121d135252a 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -1028,6 +1028,24 @@ static int madvise_split(struct vm_area_struct *vma,
 #endif
 }
 
+static int madvise_collapse(struct vm_area_struct *vma,
+			    struct vm_area_struct **prev,
+			    unsigned long start, unsigned long end)
+{
+	if (is_vm_hugetlb_page(vma)) {
+		struct mm_struct *mm = vma->vm_mm;
+		int ret;
+
+		*prev = NULL; /* tell sys_madvise we dropped the mmap lock */
+		mmap_read_unlock(mm);
+		ret = hugetlb_collapse(mm, start, end);
+		mmap_read_lock(mm);
+		return ret;
+	}
+
+	return madvise_collapse_thp(vma, prev, start, end);
+}
+
 /*
  * Apply an madvise behavior to a region of a vma.  madvise_update_vma
  * will handle splitting a vm area into separate areas, each area with its own
@@ -1204,6 +1222,9 @@ madvise_behavior_valid(int behavior)
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	case MADV_HUGEPAGE:
 	case MADV_NOHUGEPAGE:
+#endif
+#if defined(CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING) || \
+		defined(CONFIG_TRANSPARENT_HUGEPAGE)
 	case MADV_COLLAPSE:
 #endif
 #ifdef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING
@@ -1397,7 +1418,8 @@ int madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
  *  MADV_NOHUGEPAGE - mark the given range as not worth being backed by
  *		transparent huge pages so the existing pages will not be
  *		coalesced into THP and new pages will not be allocated as THP.
- *  MADV_COLLAPSE - synchronously coalesce pages into new THP.
+ *  MADV_COLLAPSE - synchronously coalesce pages into new THP, or, for HugeTLB
+ *		pages, collapse the mapping.
  *  MADV_SPLIT - allow HugeTLB pages to be mapped at PAGE_SIZE. This allows
  *		UFFDIO_CONTINUE to accept PAGE_SIZE-aligned regions.
  *  MADV_DONTDUMP - the application wants to prevent pages in the given range
-- 
2.39.2.637.g21b0678d19-goog


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH v2 35/46] hugetlb: add check to prevent refcount overflow via HGM
  2023-02-18  0:27 [PATCH v2 00/46] hugetlb: introduce HugeTLB high-granularity mapping James Houghton
                   ` (33 preceding siblings ...)
  2023-02-18  0:28 ` [PATCH v2 34/46] hugetlb: add MADV_COLLAPSE for hugetlb James Houghton
@ 2023-02-18  0:28 ` James Houghton
  2023-02-24 17:42   ` James Houghton
  2023-02-18  0:28 ` [PATCH v2 36/46] hugetlb: remove huge_pte_lock and huge_pte_lockptr James Houghton
                   ` (11 subsequent siblings)
  46 siblings, 1 reply; 96+ messages in thread
From: James Houghton @ 2023-02-18  0:28 UTC (permalink / raw)
  To: Mike Kravetz, Muchun Song, Peter Xu, Andrew Morton
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Zach O'Keefe, Manish Mishra, Naoya Horiguchi,
	Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Frank van der Linden, Jiaqi Yan, linux-mm, linux-kernel,
	James Houghton

With high-granularity mappings, it becomes quite trivial for userspace
to overflow a page's refcount or mapcount. It can be done like so:

1. Create a 1G hugetlbfs file with a single 1G page.
2. Create 8192 mappings of the file.
3. Use UFFDIO_CONTINUE to map every mapping at entirely 4K.

Each time step 3 is done for a mapping, the refcount and mapcount will
increase by 2^19 (512 * 512). Do that 2^13 times (8192), and you reach
2^31.

To avoid this, WARN_ON_ONCE when the refcount goes negative. If this
happens as a result of a page fault, return VM_FAULT_SIGBUS, and if it
happens as a result of a UFFDIO_CONTINUE, return EFAULT.

We can also create too many mappings by fork()ing a lot with VMAs setup
such that page tables must be copied at fork()-time (like if we have
VM_UFFD_WP). Use try_get_page() in copy_hugetlb_page_range() to deal
with this.

Signed-off-by: James Houghton <jthoughton@google.com>

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index c4d189e5f1fd..34368072dabe 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5397,7 +5397,10 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 		} else {
 			ptepage = pte_page(entry);
 			hpage = compound_head(ptepage);
-			get_page(hpage);
+			if (try_get_page(hpage)) {
+				ret = -EFAULT;
+				break;
+			}
 
 			/*
 			 * Failing to duplicate the anon rmap is a rare case
@@ -6132,6 +6135,30 @@ static bool hugetlb_pte_stable(struct hstate *h, struct hugetlb_pte *hpte,
 	return same;
 }
 
+/*
+ * Like filemap_lock_folio, but check the refcount of the page afterwards to
+ * check if we are at risk of overflowing refcount back to 0.
+ *
+ * This should be used in places that can be used to easily overflow refcount,
+ * like places that create high-granularity mappings.
+ */
+static struct folio *hugetlb_try_find_lock_folio(struct address_space *mapping,
+						pgoff_t idx)
+{
+	struct folio *folio = filemap_lock_folio(mapping, idx);
+
+	/*
+	 * This check is very similar to the one in try_get_page().
+	 *
+	 * This check is inherently racy, so WARN_ON_ONCE() if this condition
+	 * ever occurs.
+	 */
+	if (WARN_ON_ONCE(folio && folio_ref_count(folio) <= 0))
+		return ERR_PTR(-EFAULT);
+
+	return folio;
+}
+
 static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 			struct vm_area_struct *vma,
 			struct address_space *mapping, pgoff_t idx,
@@ -6168,7 +6195,15 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 	 * before we get page_table_lock.
 	 */
 	new_folio = false;
-	folio = filemap_lock_folio(mapping, idx);
+	folio = hugetlb_try_find_lock_folio(mapping, idx);
+	if (IS_ERR(folio)) {
+		/*
+		 * We don't want to invoke the OOM killer here, as we aren't
+		 * actually OOMing.
+		 */
+		ret = VM_FAULT_SIGBUS;
+		goto out;
+	}
 	if (!folio) {
 		size = i_size_read(mapping->host) >> huge_page_shift(h);
 		if (idx >= size)
@@ -6600,8 +6635,8 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
 
 	if (is_continue) {
 		ret = -EFAULT;
-		folio = filemap_lock_folio(mapping, idx);
-		if (!folio)
+		folio = hugetlb_try_find_lock_folio(mapping, idx);
+		if (IS_ERR_OR_NULL(folio))
 			goto out;
 		folio_in_pagecache = true;
 	} else if (!*pagep) {
-- 
2.39.2.637.g21b0678d19-goog


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH v2 36/46] hugetlb: remove huge_pte_lock and huge_pte_lockptr
  2023-02-18  0:27 [PATCH v2 00/46] hugetlb: introduce HugeTLB high-granularity mapping James Houghton
                   ` (34 preceding siblings ...)
  2023-02-18  0:28 ` [PATCH v2 35/46] hugetlb: add check to prevent refcount overflow via HGM James Houghton
@ 2023-02-18  0:28 ` James Houghton
  2023-02-18  0:28 ` [PATCH v2 37/46] hugetlb: replace make_huge_pte with make_huge_pte_with_shift James Houghton
                   ` (10 subsequent siblings)
  46 siblings, 0 replies; 96+ messages in thread
From: James Houghton @ 2023-02-18  0:28 UTC (permalink / raw)
  To: Mike Kravetz, Muchun Song, Peter Xu, Andrew Morton
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Zach O'Keefe, Manish Mishra, Naoya Horiguchi,
	Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Frank van der Linden, Jiaqi Yan, linux-mm, linux-kernel,
	James Houghton

They are replaced with hugetlb_pte_lock{,ptr}. All callers that haven't
already been replaced don't get called when using HGM, so we handle them
by populating hugetlb_ptes with the standard, hstate-sized huge PTEs.

Signed-off-by: James Houghton <jthoughton@google.com>

diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c
index 035a0df47af0..c90ac06dc8d9 100644
--- a/arch/powerpc/mm/pgtable.c
+++ b/arch/powerpc/mm/pgtable.c
@@ -258,11 +258,14 @@ int huge_ptep_set_access_flags(struct vm_area_struct *vma,
 
 #ifdef CONFIG_PPC_BOOK3S_64
 		struct hstate *h = hstate_vma(vma);
+		struct hugetlb_pte hpte;
 
 		psize = hstate_get_psize(h);
 #ifdef CONFIG_DEBUG_VM
-		assert_spin_locked(huge_pte_lockptr(huge_page_shift(h),
-						    vma->vm_mm, ptep));
+		/* HGM is not supported for powerpc yet. */
+		hugetlb_pte_init(&hpte, ptep, huge_page_shift(h),
+				 hpage_size_to_level(psize));
+		assert_spin_locked(hpte.ptl);
 #endif
 
 #else
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 6cd4ae08d84d..742e7f2cb170 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -1012,14 +1012,6 @@ static inline gfp_t htlb_modify_alloc_mask(struct hstate *h, gfp_t gfp_mask)
 	return modified_mask;
 }
 
-static inline spinlock_t *huge_pte_lockptr(unsigned int shift,
-					   struct mm_struct *mm, pte_t *pte)
-{
-	if (shift == PMD_SHIFT)
-		return pmd_lockptr(mm, (pmd_t *) pte);
-	return &mm->page_table_lock;
-}
-
 #ifndef hugepages_supported
 /*
  * Some platform decide whether they support huge pages at boot
@@ -1228,12 +1220,6 @@ static inline gfp_t htlb_modify_alloc_mask(struct hstate *h, gfp_t gfp_mask)
 	return 0;
 }
 
-static inline spinlock_t *huge_pte_lockptr(unsigned int shift,
-					   struct mm_struct *mm, pte_t *pte)
-{
-	return &mm->page_table_lock;
-}
-
 static inline void hugetlb_count_init(struct mm_struct *mm)
 {
 }
@@ -1308,16 +1294,6 @@ int hugetlb_collapse(struct mm_struct *mm, unsigned long start,
 }
 #endif
 
-static inline spinlock_t *huge_pte_lock(struct hstate *h,
-					struct mm_struct *mm, pte_t *pte)
-{
-	spinlock_t *ptl;
-
-	ptl = huge_pte_lockptr(huge_page_shift(h), mm, pte);
-	spin_lock(ptl);
-	return ptl;
-}
-
 static inline
 spinlock_t *hugetlb_pte_lockptr(struct hugetlb_pte *hpte)
 {
@@ -1353,8 +1329,22 @@ void hugetlb_pte_init(struct mm_struct *mm, struct hugetlb_pte *hpte,
 		      pte_t *ptep, unsigned int shift,
 		      enum hugetlb_level level)
 {
-	__hugetlb_pte_init(hpte, ptep, shift, level,
-			   huge_pte_lockptr(shift, mm, ptep));
+	spinlock_t *ptl;
+
+	/*
+	 * For contiguous HugeTLB PTEs that can contain other HugeTLB PTEs
+	 * on the same level, the same PTL for both must be used.
+	 *
+	 * For some architectures that implement hugetlb_walk_step, this
+	 * version of hugetlb_pte_populate() may not be correct to use for
+	 * high-granularity PTEs. Instead, call __hugetlb_pte_populate()
+	 * directly.
+	 */
+	if (level == HUGETLB_LEVEL_PMD)
+		ptl = pmd_lockptr(mm, (pmd_t *) ptep);
+	else
+		ptl = &mm->page_table_lock;
+	__hugetlb_pte_init(hpte, ptep, shift, level, ptl);
 }
 
 #if defined(CONFIG_HUGETLB_PAGE) && defined(CONFIG_CMA)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 34368072dabe..e0a92e7c1755 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5454,9 +5454,8 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 				put_page(hpage);
 
 				/* Install the new hugetlb folio if src pte stable */
-				dst_ptl = huge_pte_lock(h, dst, dst_pte);
-				src_ptl = huge_pte_lockptr(huge_page_shift(h),
-							   src, src_pte);
+				dst_ptl = hugetlb_pte_lock(&dst_hpte);
+				src_ptl = hugetlb_pte_lockptr(&src_hpte);
 				spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
 				entry = huge_ptep_get(src_pte);
 				if (!pte_same(src_pte_old, entry)) {
@@ -7582,7 +7581,8 @@ pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma,
 	unsigned long saddr;
 	pte_t *spte = NULL;
 	pte_t *pte;
-	spinlock_t *ptl;
+	struct hugetlb_pte hpte;
+	struct hstate *shstate;
 
 	i_mmap_lock_read(mapping);
 	vma_interval_tree_foreach(svma, &mapping->i_mmap, idx, idx) {
@@ -7603,7 +7603,11 @@ pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (!spte)
 		goto out;
 
-	ptl = huge_pte_lock(hstate_vma(vma), mm, spte);
+	shstate = hstate_vma(svma);
+
+	hugetlb_pte_init(mm, &hpte, spte, huge_page_shift(shstate),
+			 hpage_size_to_level(huge_page_size(shstate)));
+	spin_lock(hpte.ptl);
 	if (pud_none(*pud)) {
 		pud_populate(mm, pud,
 				(pmd_t *)((unsigned long)spte & PAGE_MASK));
@@ -7611,7 +7615,7 @@ pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma,
 	} else {
 		put_page(virt_to_page(spte));
 	}
-	spin_unlock(ptl);
+	spin_unlock(hpte.ptl);
 out:
 	pte = (pte_t *)pmd_alloc(mm, pud, addr);
 	i_mmap_unlock_read(mapping);
@@ -8315,6 +8319,7 @@ static void hugetlb_unshare_pmds(struct vm_area_struct *vma,
 	unsigned long address;
 	spinlock_t *ptl;
 	pte_t *ptep;
+	struct hugetlb_pte hpte;
 
 	if (!(vma->vm_flags & VM_MAYSHARE))
 		return;
@@ -8336,7 +8341,10 @@ static void hugetlb_unshare_pmds(struct vm_area_struct *vma,
 		ptep = hugetlb_walk(vma, address, sz);
 		if (!ptep)
 			continue;
-		ptl = huge_pte_lock(h, mm, ptep);
+
+		hugetlb_pte_init(mm, &hpte, ptep, huge_page_shift(h),
+				 hpage_size_to_level(sz));
+		ptl = hugetlb_pte_lock(&hpte);
 		huge_pmd_unshare(mm, vma, address, ptep);
 		spin_unlock(ptl);
 	}
-- 
2.39.2.637.g21b0678d19-goog


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH v2 37/46] hugetlb: replace make_huge_pte with make_huge_pte_with_shift
  2023-02-18  0:27 [PATCH v2 00/46] hugetlb: introduce HugeTLB high-granularity mapping James Houghton
                   ` (35 preceding siblings ...)
  2023-02-18  0:28 ` [PATCH v2 36/46] hugetlb: remove huge_pte_lock and huge_pte_lockptr James Houghton
@ 2023-02-18  0:28 ` James Houghton
  2023-02-18  0:28 ` [PATCH v2 38/46] mm: smaps: add stats for HugeTLB mapping size James Houghton
                   ` (9 subsequent siblings)
  46 siblings, 0 replies; 96+ messages in thread
From: James Houghton @ 2023-02-18  0:28 UTC (permalink / raw)
  To: Mike Kravetz, Muchun Song, Peter Xu, Andrew Morton
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Zach O'Keefe, Manish Mishra, Naoya Horiguchi,
	Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Frank van der Linden, Jiaqi Yan, linux-mm, linux-kernel,
	James Houghton

This removes the old definition of make_huge_pte, where now we always
require the shift to be explicitly given. All callsites are cleaned up.

Signed-off-by: James Houghton <jthoughton@google.com>

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index e0a92e7c1755..4c9b3c5379b2 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5204,9 +5204,9 @@ const struct vm_operations_struct hugetlb_vm_ops = {
 	.pagesize = hugetlb_vm_op_pagesize,
 };
 
-static pte_t make_huge_pte_with_shift(struct vm_area_struct *vma,
-				      struct page *page, int writable,
-				      int shift)
+static pte_t make_huge_pte(struct vm_area_struct *vma,
+			   struct page *page, int writable,
+			   int shift)
 {
 	pte_t entry;
 
@@ -5222,14 +5222,6 @@ static pte_t make_huge_pte_with_shift(struct vm_area_struct *vma,
 	return entry;
 }
 
-static pte_t make_huge_pte(struct vm_area_struct *vma, struct page *page,
-			   int writable)
-{
-	unsigned int shift = huge_page_shift(hstate_vma(vma));
-
-	return make_huge_pte_with_shift(vma, page, writable, shift);
-}
-
 static void set_huge_ptep_writable(struct vm_area_struct *vma,
 				   unsigned long address, pte_t *ptep)
 {
@@ -5272,7 +5264,9 @@ hugetlb_install_folio(struct vm_area_struct *vma, pte_t *ptep, unsigned long add
 {
 	__folio_mark_uptodate(new_folio);
 	hugepage_add_new_anon_rmap(new_folio, vma, addr);
-	set_huge_pte_at(vma->vm_mm, addr, ptep, make_huge_pte(vma, &new_folio->page, 1));
+	set_huge_pte_at(vma->vm_mm, addr, ptep, make_huge_pte(
+				vma, &new_folio->page, 1,
+				huge_page_shift(hstate_vma(vma))));
 	hugetlb_count_add(pages_per_huge_page(hstate_vma(vma)), vma->vm_mm);
 	folio_set_hugetlb_migratable(new_folio);
 }
@@ -6006,7 +6000,8 @@ static vm_fault_t hugetlb_wp(struct mm_struct *mm, struct vm_area_struct *vma,
 		hugetlb_remove_rmap(old_page, huge_page_shift(h), h, vma);
 		hugepage_add_new_anon_rmap(new_folio, vma, haddr);
 		set_huge_pte_at(mm, haddr, ptep,
-				make_huge_pte(vma, &new_folio->page, !unshare));
+				make_huge_pte(vma, &new_folio->page, !unshare,
+					      huge_page_shift(h)));
 		folio_set_hugetlb_migratable(new_folio);
 		/* Make the old page be freed below */
 		new_folio = page_folio(old_page);
@@ -6348,7 +6343,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 	else
 		hugetlb_add_file_rmap(subpage, hpte->shift, h, vma);
 
-	new_pte = make_huge_pte_with_shift(vma, subpage,
+	new_pte = make_huge_pte(vma, subpage,
 			((vma->vm_flags & VM_WRITE)
 			 && (vma->vm_flags & VM_SHARED)),
 			hpte->shift);
@@ -6770,8 +6765,7 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
 	else
 		writable = dst_vma->vm_flags & VM_WRITE;
 
-	_dst_pte = make_huge_pte_with_shift(dst_vma, subpage, writable,
-			dst_hpte->shift);
+	_dst_pte = make_huge_pte(dst_vma, subpage, writable, dst_hpte->shift);
 	/*
 	 * Always mark UFFDIO_COPY page dirty; note that this may not be
 	 * extremely important for hugetlbfs for now since swapping is not
@@ -8169,8 +8163,7 @@ static int __hugetlb_collapse(struct mm_struct *mm, struct vm_area_struct *vma,
 		}
 
 		subpage = hugetlb_find_subpage(h, folio, curr);
-		entry = make_huge_pte_with_shift(vma, subpage,
-						 writable, hpte.shift);
+		entry = make_huge_pte(vma, subpage, writable, hpte.shift);
 		hugetlb_add_file_rmap(subpage, hpte.shift, h, vma);
 		set_huge_pte_at(mm, curr, hpte.ptep, entry);
 		spin_unlock(ptl);
-- 
2.39.2.637.g21b0678d19-goog


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH v2 38/46] mm: smaps: add stats for HugeTLB mapping size
  2023-02-18  0:27 [PATCH v2 00/46] hugetlb: introduce HugeTLB high-granularity mapping James Houghton
                   ` (36 preceding siblings ...)
  2023-02-18  0:28 ` [PATCH v2 37/46] hugetlb: replace make_huge_pte with make_huge_pte_with_shift James Houghton
@ 2023-02-18  0:28 ` James Houghton
  2023-02-18  0:28 ` [PATCH v2 39/46] hugetlb: x86: enable high-granularity mapping for x86_64 James Houghton
                   ` (8 subsequent siblings)
  46 siblings, 0 replies; 96+ messages in thread
From: James Houghton @ 2023-02-18  0:28 UTC (permalink / raw)
  To: Mike Kravetz, Muchun Song, Peter Xu, Andrew Morton
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Zach O'Keefe, Manish Mishra, Naoya Horiguchi,
	Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Frank van der Linden, Jiaqi Yan, linux-mm, linux-kernel,
	James Houghton

When the kernel is compiled with HUGETLB_HIGH_GRANULARITY_MAPPING,
smaps may provide HugetlbPudMapped, HugetlbPmdMapped, and
HugetlbPteMapped. Levels that are folded will not be outputted.

Signed-off-by: James Houghton <jthoughton@google.com>

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 2f293b5dabc0..1ced7300f8cd 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -412,6 +412,15 @@ struct mem_size_stats {
 	unsigned long swap;
 	unsigned long shared_hugetlb;
 	unsigned long private_hugetlb;
+#ifdef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING
+#ifndef __PAGETABLE_PUD_FOLDED
+	unsigned long hugetlb_pud_mapped;
+#endif
+#ifndef __PAGETABLE_PMD_FOLDED
+	unsigned long hugetlb_pmd_mapped;
+#endif
+	unsigned long hugetlb_pte_mapped;
+#endif /* CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING */
 	u64 pss;
 	u64 pss_anon;
 	u64 pss_file;
@@ -731,6 +740,33 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
 }
 
 #ifdef CONFIG_HUGETLB_PAGE
+
+#ifdef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING
+static void smaps_hugetlb_hgm_account(struct mem_size_stats *mss,
+		struct hugetlb_pte *hpte)
+{
+	unsigned long size = hugetlb_pte_size(hpte);
+
+	switch (hpte->level) {
+#ifndef __PAGETABLE_PUD_FOLDED
+	case HUGETLB_LEVEL_PUD:
+		mss->hugetlb_pud_mapped += size;
+		break;
+#endif
+#ifndef __PAGETABLE_PMD_FOLDED
+	case HUGETLB_LEVEL_PMD:
+		mss->hugetlb_pmd_mapped += size;
+		break;
+#endif
+	case HUGETLB_LEVEL_PTE:
+		mss->hugetlb_pte_mapped += size;
+		break;
+	default:
+		break;
+	}
+}
+#endif /* CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING */
+
 static int smaps_hugetlb_range(struct hugetlb_pte *hpte,
 				unsigned long addr,
 				struct mm_walk *walk)
@@ -764,6 +800,9 @@ static int smaps_hugetlb_range(struct hugetlb_pte *hpte,
 			mss->shared_hugetlb += sz;
 		else
 			mss->private_hugetlb += sz;
+#ifdef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING
+		smaps_hugetlb_hgm_account(mss, hpte);
+#endif
 	}
 	return 0;
 }
@@ -833,38 +872,47 @@ static void smap_gather_stats(struct vm_area_struct *vma,
 static void __show_smap(struct seq_file *m, const struct mem_size_stats *mss,
 	bool rollup_mode)
 {
-	SEQ_PUT_DEC("Rss:            ", mss->resident);
-	SEQ_PUT_DEC(" kB\nPss:            ", mss->pss >> PSS_SHIFT);
-	SEQ_PUT_DEC(" kB\nPss_Dirty:      ", mss->pss_dirty >> PSS_SHIFT);
+	SEQ_PUT_DEC("Rss:              ", mss->resident);
+	SEQ_PUT_DEC(" kB\nPss:              ", mss->pss >> PSS_SHIFT);
+	SEQ_PUT_DEC(" kB\nPss_Dirty:        ", mss->pss_dirty >> PSS_SHIFT);
 	if (rollup_mode) {
 		/*
 		 * These are meaningful only for smaps_rollup, otherwise two of
 		 * them are zero, and the other one is the same as Pss.
 		 */
-		SEQ_PUT_DEC(" kB\nPss_Anon:       ",
+		SEQ_PUT_DEC(" kB\nPss_Anon:         ",
 			mss->pss_anon >> PSS_SHIFT);
-		SEQ_PUT_DEC(" kB\nPss_File:       ",
+		SEQ_PUT_DEC(" kB\nPss_File:         ",
 			mss->pss_file >> PSS_SHIFT);
-		SEQ_PUT_DEC(" kB\nPss_Shmem:      ",
+		SEQ_PUT_DEC(" kB\nPss_Shmem:        ",
 			mss->pss_shmem >> PSS_SHIFT);
 	}
-	SEQ_PUT_DEC(" kB\nShared_Clean:   ", mss->shared_clean);
-	SEQ_PUT_DEC(" kB\nShared_Dirty:   ", mss->shared_dirty);
-	SEQ_PUT_DEC(" kB\nPrivate_Clean:  ", mss->private_clean);
-	SEQ_PUT_DEC(" kB\nPrivate_Dirty:  ", mss->private_dirty);
-	SEQ_PUT_DEC(" kB\nReferenced:     ", mss->referenced);
-	SEQ_PUT_DEC(" kB\nAnonymous:      ", mss->anonymous);
-	SEQ_PUT_DEC(" kB\nLazyFree:       ", mss->lazyfree);
-	SEQ_PUT_DEC(" kB\nAnonHugePages:  ", mss->anonymous_thp);
-	SEQ_PUT_DEC(" kB\nShmemPmdMapped: ", mss->shmem_thp);
-	SEQ_PUT_DEC(" kB\nFilePmdMapped:  ", mss->file_thp);
-	SEQ_PUT_DEC(" kB\nShared_Hugetlb: ", mss->shared_hugetlb);
-	seq_put_decimal_ull_width(m, " kB\nPrivate_Hugetlb: ",
+	SEQ_PUT_DEC(" kB\nShared_Clean:     ", mss->shared_clean);
+	SEQ_PUT_DEC(" kB\nShared_Dirty:     ", mss->shared_dirty);
+	SEQ_PUT_DEC(" kB\nPrivate_Clean:    ", mss->private_clean);
+	SEQ_PUT_DEC(" kB\nPrivate_Dirty:    ", mss->private_dirty);
+	SEQ_PUT_DEC(" kB\nReferenced:       ", mss->referenced);
+	SEQ_PUT_DEC(" kB\nAnonymous:        ", mss->anonymous);
+	SEQ_PUT_DEC(" kB\nLazyFree:         ", mss->lazyfree);
+	SEQ_PUT_DEC(" kB\nAnonHugePages:    ", mss->anonymous_thp);
+	SEQ_PUT_DEC(" kB\nShmemPmdMapped:   ", mss->shmem_thp);
+	SEQ_PUT_DEC(" kB\nFilePmdMapped:    ", mss->file_thp);
+	SEQ_PUT_DEC(" kB\nShared_Hugetlb:   ", mss->shared_hugetlb);
+	seq_put_decimal_ull_width(m, " kB\nPrivate_Hugetlb:   ",
 				  mss->private_hugetlb >> 10, 7);
-	SEQ_PUT_DEC(" kB\nSwap:           ", mss->swap);
-	SEQ_PUT_DEC(" kB\nSwapPss:        ",
+#ifdef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING
+#ifndef __PAGETABLE_PUD_FOLDED
+	SEQ_PUT_DEC(" kB\nHugetlbPudMapped: ", mss->hugetlb_pud_mapped);
+#endif
+#ifndef __PAGETABLE_PMD_FOLDED
+	SEQ_PUT_DEC(" kB\nHugetlbPmdMapped: ", mss->hugetlb_pmd_mapped);
+#endif
+	SEQ_PUT_DEC(" kB\nHugetlbPteMapped: ", mss->hugetlb_pte_mapped);
+#endif /* CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING */
+	SEQ_PUT_DEC(" kB\nSwap:             ", mss->swap);
+	SEQ_PUT_DEC(" kB\nSwapPss:          ",
 					mss->swap_pss >> PSS_SHIFT);
-	SEQ_PUT_DEC(" kB\nLocked:         ",
+	SEQ_PUT_DEC(" kB\nLocked:           ",
 					mss->pss_locked >> PSS_SHIFT);
 	seq_puts(m, " kB\n");
 }
@@ -880,18 +928,18 @@ static int show_smap(struct seq_file *m, void *v)
 
 	show_map_vma(m, vma);
 
-	SEQ_PUT_DEC("Size:           ", vma->vm_end - vma->vm_start);
-	SEQ_PUT_DEC(" kB\nKernelPageSize: ", vma_kernel_pagesize(vma));
-	SEQ_PUT_DEC(" kB\nMMUPageSize:    ", vma_mmu_pagesize(vma));
+	SEQ_PUT_DEC("Size:             ", vma->vm_end - vma->vm_start);
+	SEQ_PUT_DEC(" kB\nKernelPageSize:   ", vma_kernel_pagesize(vma));
+	SEQ_PUT_DEC(" kB\nMMUPageSize:      ", vma_mmu_pagesize(vma));
 	seq_puts(m, " kB\n");
 
 	__show_smap(m, &mss, false);
 
-	seq_printf(m, "THPeligible:    %d\n",
+	seq_printf(m, "THPeligible:      %d\n",
 		   hugepage_vma_check(vma, vma->vm_flags, true, false, true));
 
 	if (arch_pkeys_enabled())
-		seq_printf(m, "ProtectionKey:  %8u\n", vma_pkey(vma));
+		seq_printf(m, "ProtectionKey:    %8u\n", vma_pkey(vma));
 	show_smap_vma_flags(m, vma);
 
 	return 0;
-- 
2.39.2.637.g21b0678d19-goog


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH v2 39/46] hugetlb: x86: enable high-granularity mapping for x86_64
  2023-02-18  0:27 [PATCH v2 00/46] hugetlb: introduce HugeTLB high-granularity mapping James Houghton
                   ` (37 preceding siblings ...)
  2023-02-18  0:28 ` [PATCH v2 38/46] mm: smaps: add stats for HugeTLB mapping size James Houghton
@ 2023-02-18  0:28 ` James Houghton
  2023-02-18  0:28 ` [PATCH v2 40/46] docs: hugetlb: update hugetlb and userfaultfd admin-guides with HGM info James Houghton
                   ` (7 subsequent siblings)
  46 siblings, 0 replies; 96+ messages in thread
From: James Houghton @ 2023-02-18  0:28 UTC (permalink / raw)
  To: Mike Kravetz, Muchun Song, Peter Xu, Andrew Morton
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Zach O'Keefe, Manish Mishra, Naoya Horiguchi,
	Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Frank van der Linden, Jiaqi Yan, linux-mm, linux-kernel,
	James Houghton

Now that HGM is fully supported for GENERAL_HUGETLB, we can enable it
for x86_64. We can only enable it for 64-bit architectures because the
vm flag VM_HUGETLB_HGM uses a high bit.

The x86 KVM MMU already properly handles HugeTLB HGM pages (it does a
page table walk to determine which size to use in the second-stage page
table instead of, for example, checking vma_mmu_pagesize, like arm64
does).

Signed-off-by: James Houghton <jthoughton@google.com>

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 3604074a878b..fde9ba1dd8d7 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -126,6 +126,7 @@ config X86
 	select ARCH_WANT_GENERAL_HUGETLB
 	select ARCH_WANT_HUGE_PMD_SHARE
 	select ARCH_WANT_HUGETLB_PAGE_OPTIMIZE_VMEMMAP	if X86_64
+	select ARCH_WANT_HUGETLB_HIGH_GRANULARITY_MAPPING if X86_64
 	select ARCH_WANT_LD_ORPHAN_WARN
 	select ARCH_WANTS_THP_SWAP		if X86_64
 	select ARCH_HAS_PARANOID_L1D_FLUSH
-- 
2.39.2.637.g21b0678d19-goog


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH v2 40/46] docs: hugetlb: update hugetlb and userfaultfd admin-guides with HGM info
  2023-02-18  0:27 [PATCH v2 00/46] hugetlb: introduce HugeTLB high-granularity mapping James Houghton
                   ` (38 preceding siblings ...)
  2023-02-18  0:28 ` [PATCH v2 39/46] hugetlb: x86: enable high-granularity mapping for x86_64 James Houghton
@ 2023-02-18  0:28 ` James Houghton
  2023-02-18  0:28 ` [PATCH v2 41/46] docs: proc: include information about HugeTLB HGM James Houghton
                   ` (6 subsequent siblings)
  46 siblings, 0 replies; 96+ messages in thread
From: James Houghton @ 2023-02-18  0:28 UTC (permalink / raw)
  To: Mike Kravetz, Muchun Song, Peter Xu, Andrew Morton
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Zach O'Keefe, Manish Mishra, Naoya Horiguchi,
	Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Frank van der Linden, Jiaqi Yan, linux-mm, linux-kernel,
	James Houghton

Include information about how MADV_SPLIT should be used to enable
high-granularity UFFDIO_CONTINUE operations, and include information
about how MADV_COLLAPSE should be used to collapse the mappings at the
end.

Signed-off-by: James Houghton <jthoughton@google.com>

diff --git a/Documentation/admin-guide/mm/hugetlbpage.rst b/Documentation/admin-guide/mm/hugetlbpage.rst
index a969a2c742b2..c6eaef785609 100644
--- a/Documentation/admin-guide/mm/hugetlbpage.rst
+++ b/Documentation/admin-guide/mm/hugetlbpage.rst
@@ -454,6 +454,10 @@ errno set to EINVAL or exclude hugetlb pages that extend beyond the length if
 not hugepage aligned.  For example, munmap(2) will fail if memory is backed by
 a hugetlb page and the length is smaller than the hugepage size.
 
+It is possible for users to map HugeTLB pages at a higher granularity than
+normal using HugeTLB high-granularity mapping (HGM). For example, when using 1G
+pages on x86, a user could map that page with 4K PTEs, 2M PMDs, a combination of
+the two. See Documentation/admin-guide/mm/userfaultfd.rst.
 
 Examples
 ========
diff --git a/Documentation/admin-guide/mm/userfaultfd.rst b/Documentation/admin-guide/mm/userfaultfd.rst
index 83f31919ebb3..cc496a307ea2 100644
--- a/Documentation/admin-guide/mm/userfaultfd.rst
+++ b/Documentation/admin-guide/mm/userfaultfd.rst
@@ -169,7 +169,13 @@ like to do to resolve it:
   the page cache). Userspace has the option of modifying the page's
   contents before resolving the fault. Once the contents are correct
   (modified or not), userspace asks the kernel to map the page and let the
-  faulting thread continue with ``UFFDIO_CONTINUE``.
+  faulting thread continue with ``UFFDIO_CONTINUE``. If this is done at the
+  base-page size in a transparent-hugepage-eligible VMA or in a HugeTLB VMA
+  (requires ``MADV_SPLIT``), then userspace may want to use
+  ``MADV_COLLAPSE`` when a hugepage is fully populated to inform the kernel
+  that it may be able to collapse the mapping. ``MADV_COLLAPSE`` will undo
+  the effect of any ``UFFDIO_WRITEPROTECT`` calls on the collapsed address
+  range.
 
 Notes:
 
-- 
2.39.2.637.g21b0678d19-goog


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH v2 41/46] docs: proc: include information about HugeTLB HGM
  2023-02-18  0:27 [PATCH v2 00/46] hugetlb: introduce HugeTLB high-granularity mapping James Houghton
                   ` (39 preceding siblings ...)
  2023-02-18  0:28 ` [PATCH v2 40/46] docs: hugetlb: update hugetlb and userfaultfd admin-guides with HGM info James Houghton
@ 2023-02-18  0:28 ` James Houghton
  2023-02-18  0:28 ` [PATCH v2 42/46] selftests/mm: add HugeTLB HGM to userfaultfd selftest James Houghton
                   ` (5 subsequent siblings)
  46 siblings, 0 replies; 96+ messages in thread
From: James Houghton @ 2023-02-18  0:28 UTC (permalink / raw)
  To: Mike Kravetz, Muchun Song, Peter Xu, Andrew Morton
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Zach O'Keefe, Manish Mishra, Naoya Horiguchi,
	Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Frank van der Linden, Jiaqi Yan, linux-mm, linux-kernel,
	James Houghton

Include the updates that have been made to smaps, specifically,
the addition of Hugetlb[Pud,Pmd,Pte]Mapped.

Signed-off-by: James Houghton <jthoughton@google.com>

diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
index e224b6d5b642..1d2a1cd1fe6a 100644
--- a/Documentation/filesystems/proc.rst
+++ b/Documentation/filesystems/proc.rst
@@ -447,29 +447,32 @@ Memory Area, or VMA) there is a series of lines such as the following::
 
     08048000-080bc000 r-xp 00000000 03:02 13130      /bin/bash
 
-    Size:               1084 kB
-    KernelPageSize:        4 kB
-    MMUPageSize:           4 kB
-    Rss:                 892 kB
-    Pss:                 374 kB
-    Pss_Dirty:             0 kB
-    Shared_Clean:        892 kB
-    Shared_Dirty:          0 kB
-    Private_Clean:         0 kB
-    Private_Dirty:         0 kB
-    Referenced:          892 kB
-    Anonymous:             0 kB
-    LazyFree:              0 kB
-    AnonHugePages:         0 kB
-    ShmemPmdMapped:        0 kB
-    Shared_Hugetlb:        0 kB
-    Private_Hugetlb:       0 kB
-    Swap:                  0 kB
-    SwapPss:               0 kB
-    KernelPageSize:        4 kB
-    MMUPageSize:           4 kB
-    Locked:                0 kB
-    THPeligible:           0
+    Size:                 1084 kB
+    KernelPageSize:          4 kB
+    MMUPageSize:             4 kB
+    Rss:                   892 kB
+    Pss:                   374 kB
+    Pss_Dirty:               0 kB
+    Shared_Clean:          892 kB
+    Shared_Dirty:            0 kB
+    Private_Clean:           0 kB
+    Private_Dirty:           0 kB
+    Referenced:            892 kB
+    Anonymous:               0 kB
+    LazyFree:                0 kB
+    AnonHugePages:           0 kB
+    ShmemPmdMapped:          0 kB
+    Shared_Hugetlb:          0 kB
+    Private_Hugetlb:         0 kB
+    HugetlbPudMapped:        0 kB
+    HugetlbPmdMapped:        0 kB
+    HugetlbPteMapped:        0 kB
+    Swap:                    0 kB
+    SwapPss:                 0 kB
+    KernelPageSize:          4 kB
+    MMUPageSize:             4 kB
+    Locked:                  0 kB
+    THPeligible:             0
     VmFlags: rd ex mr mw me dw
 
 The first of these lines shows the same information as is displayed for the
@@ -510,10 +513,15 @@ implementation. If this is not desirable please file a bug report.
 "ShmemPmdMapped" shows the ammount of shared (shmem/tmpfs) memory backed by
 huge pages.
 
-"Shared_Hugetlb" and "Private_Hugetlb" show the ammounts of memory backed by
+"Shared_Hugetlb" and "Private_Hugetlb" show the amounts of memory backed by
 hugetlbfs page which is *not* counted in "RSS" or "PSS" field for historical
 reasons. And these are not included in {Shared,Private}_{Clean,Dirty} field.
 
+If the kernel was compiled with ``CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING``,
+"HugetlbPudMapped", "HugetlbPmdMapped", and "HugetlbPteMapped" may appear and
+show the amount of HugeTLB memory mapped with PUDs, PMDs, and PTEs respectively.
+Folded levels won't appear. See Documentation/admin-guide/mm/hugetlbpage.rst.
+
 "Swap" shows how much would-be-anonymous memory is also used, but out on swap.
 
 For shmem mappings, "Swap" includes also the size of the mapped (and not
-- 
2.39.2.637.g21b0678d19-goog


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH v2 42/46] selftests/mm: add HugeTLB HGM to userfaultfd selftest
  2023-02-18  0:27 [PATCH v2 00/46] hugetlb: introduce HugeTLB high-granularity mapping James Houghton
                   ` (40 preceding siblings ...)
  2023-02-18  0:28 ` [PATCH v2 41/46] docs: proc: include information about HugeTLB HGM James Houghton
@ 2023-02-18  0:28 ` James Houghton
  2023-02-18  0:28 ` [PATCH v2 43/46] KVM: selftests: add HugeTLB HGM to KVM demand paging selftest James Houghton
                   ` (4 subsequent siblings)
  46 siblings, 0 replies; 96+ messages in thread
From: James Houghton @ 2023-02-18  0:28 UTC (permalink / raw)
  To: Mike Kravetz, Muchun Song, Peter Xu, Andrew Morton
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Zach O'Keefe, Manish Mishra, Naoya Horiguchi,
	Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Frank van der Linden, Jiaqi Yan, linux-mm, linux-kernel,
	James Houghton

This test case behaves similarly to the regular shared HugeTLB
configuration, except that it uses 4K instead of hugepages, and that we
ignore the UFFDIO_COPY tests, as UFFDIO_CONTINUE is the only ioctl that
supports PAGE_SIZE-aligned regions.

This doesn't test MADV_COLLAPSE. Other tests are added later to exercise
MADV_COLLAPSE.

Signed-off-by: James Houghton <jthoughton@google.com>

diff --git a/tools/testing/selftests/mm/userfaultfd.c b/tools/testing/selftests/mm/userfaultfd.c
index 7f22844ed704..681c5c5f863b 100644
--- a/tools/testing/selftests/mm/userfaultfd.c
+++ b/tools/testing/selftests/mm/userfaultfd.c
@@ -73,9 +73,10 @@ static unsigned long nr_cpus, nr_pages, nr_pages_per_cpu, page_size, hpage_size;
 #define BOUNCE_POLL		(1<<3)
 static int bounces;
 
-#define TEST_ANON	1
-#define TEST_HUGETLB	2
-#define TEST_SHMEM	3
+#define TEST_ANON		1
+#define TEST_HUGETLB		2
+#define TEST_HUGETLB_HGM	3
+#define TEST_SHMEM		4
 static int test_type;
 
 #define UFFD_FLAGS	(O_CLOEXEC | O_NONBLOCK | UFFD_USER_MODE_ONLY)
@@ -93,6 +94,8 @@ static volatile bool test_uffdio_zeropage_eexist = true;
 static bool test_uffdio_wp = true;
 /* Whether to test uffd minor faults */
 static bool test_uffdio_minor = false;
+static bool test_uffdio_copy = true;
+
 static bool map_shared;
 static int mem_fd;
 static unsigned long long *count_verify;
@@ -151,7 +154,7 @@ static void usage(void)
 	fprintf(stderr, "\nUsage: ./userfaultfd <test type> <MiB> <bounces> "
 		"[hugetlbfs_file]\n\n");
 	fprintf(stderr, "Supported <test type>: anon, hugetlb, "
-		"hugetlb_shared, shmem\n\n");
+		"hugetlb_shared, hugetlb_shared_hgm, shmem\n\n");
 	fprintf(stderr, "'Test mods' can be joined to the test type string with a ':'. "
 		"Supported mods:\n");
 	fprintf(stderr, "\tsyscall - Use userfaultfd(2) (default)\n");
@@ -167,6 +170,11 @@ static void usage(void)
 	exit(1);
 }
 
+static bool test_is_hugetlb(void)
+{
+	return test_type == TEST_HUGETLB || test_type == TEST_HUGETLB_HGM;
+}
+
 #define _err(fmt, ...)						\
 	do {							\
 		int ret = errno;				\
@@ -381,7 +389,7 @@ static struct uffd_test_ops *uffd_test_ops;
 
 static inline uint64_t uffd_minor_feature(void)
 {
-	if (test_type == TEST_HUGETLB && map_shared)
+	if (test_is_hugetlb() && map_shared)
 		return UFFD_FEATURE_MINOR_HUGETLBFS;
 	else if (test_type == TEST_SHMEM)
 		return UFFD_FEATURE_MINOR_SHMEM;
@@ -393,7 +401,7 @@ static uint64_t get_expected_ioctls(uint64_t mode)
 {
 	uint64_t ioctls = UFFD_API_RANGE_IOCTLS;
 
-	if (test_type == TEST_HUGETLB)
+	if (test_is_hugetlb())
 		ioctls &= ~(1 << _UFFDIO_ZEROPAGE);
 
 	if (!((mode & UFFDIO_REGISTER_MODE_WP) && test_uffdio_wp))
@@ -500,13 +508,16 @@ static void uffd_test_ctx_clear(void)
 static void uffd_test_ctx_init(uint64_t features)
 {
 	unsigned long nr, cpu;
+	uint64_t enabled_features = features;
 
 	uffd_test_ctx_clear();
 
 	uffd_test_ops->allocate_area((void **)&area_src, true);
 	uffd_test_ops->allocate_area((void **)&area_dst, false);
 
-	userfaultfd_open(&features);
+	userfaultfd_open(&enabled_features);
+	if ((enabled_features & features) != features)
+		err("couldn't enable all features");
 
 	count_verify = malloc(nr_pages * sizeof(unsigned long long));
 	if (!count_verify)
@@ -726,13 +737,16 @@ static void uffd_handle_page_fault(struct uffd_msg *msg,
 				   struct uffd_stats *stats)
 {
 	unsigned long offset;
+	unsigned long address;
 
 	if (msg->event != UFFD_EVENT_PAGEFAULT)
 		err("unexpected msg event %u", msg->event);
 
+	address = msg->arg.pagefault.address;
+
 	if (msg->arg.pagefault.flags & UFFD_PAGEFAULT_FLAG_WP) {
 		/* Write protect page faults */
-		wp_range(uffd, msg->arg.pagefault.address, page_size, false);
+		wp_range(uffd, address, page_size, false);
 		stats->wp_faults++;
 	} else if (msg->arg.pagefault.flags & UFFD_PAGEFAULT_FLAG_MINOR) {
 		uint8_t *area;
@@ -751,11 +765,10 @@ static void uffd_handle_page_fault(struct uffd_msg *msg,
 		 */
 
 		area = (uint8_t *)(area_dst +
-				   ((char *)msg->arg.pagefault.address -
-				    area_dst_alias));
+				   ((char *)address - area_dst_alias));
 		for (b = 0; b < page_size; ++b)
 			area[b] = ~area[b];
-		continue_range(uffd, msg->arg.pagefault.address, page_size);
+		continue_range(uffd, address, page_size);
 		stats->minor_faults++;
 	} else {
 		/*
@@ -782,7 +795,7 @@ static void uffd_handle_page_fault(struct uffd_msg *msg,
 		if (msg->arg.pagefault.flags & UFFD_PAGEFAULT_FLAG_WRITE)
 			err("unexpected write fault");
 
-		offset = (char *)(unsigned long)msg->arg.pagefault.address - area_dst;
+		offset = (char *)address - area_dst;
 		offset &= ~(page_size-1);
 
 		if (copy_page(uffd, offset))
@@ -1192,6 +1205,12 @@ static int userfaultfd_events_test(void)
 	char c;
 	struct uffd_stats stats = { 0 };
 
+	if (!test_uffdio_copy) {
+		printf("Skipping userfaultfd events test "
+			"(test_uffdio_copy=false)\n");
+		return 0;
+	}
+
 	printf("testing events (fork, remap, remove): ");
 	fflush(stdout);
 
@@ -1245,6 +1264,12 @@ static int userfaultfd_sig_test(void)
 	char c;
 	struct uffd_stats stats = { 0 };
 
+	if (!test_uffdio_copy) {
+		printf("Skipping userfaultfd signal test "
+			"(test_uffdio_copy=false)\n");
+		return 0;
+	}
+
 	printf("testing signal delivery: ");
 	fflush(stdout);
 
@@ -1329,6 +1354,11 @@ static int userfaultfd_minor_test(void)
 
 	uffd_test_ctx_init(uffd_minor_feature());
 
+	if (test_type == TEST_HUGETLB_HGM)
+		/* Enable high-granularity userfaultfd ioctls for HugeTLB */
+		if (madvise(area_dst_alias, nr_pages * page_size, MADV_SPLIT))
+			err("MADV_SPLIT failed");
+
 	uffdio_register.range.start = (unsigned long)area_dst_alias;
 	uffdio_register.range.len = nr_pages * page_size;
 	uffdio_register.mode = UFFDIO_REGISTER_MODE_MINOR;
@@ -1538,6 +1568,12 @@ static int userfaultfd_stress(void)
 	pthread_attr_init(&attr);
 	pthread_attr_setstacksize(&attr, 16*1024*1024);
 
+	if (!test_uffdio_copy) {
+		printf("Skipping userfaultfd stress test "
+			"(test_uffdio_copy=false)\n");
+		bounces = 0;
+	}
+
 	while (bounces--) {
 		printf("bounces: %d, mode:", bounces);
 		if (bounces & BOUNCE_RANDOM)
@@ -1696,6 +1732,16 @@ static void set_test_type(const char *type)
 		uffd_test_ops = &hugetlb_uffd_test_ops;
 		/* Minor faults require shared hugetlb; only enable here. */
 		test_uffdio_minor = true;
+	} else if (!strcmp(type, "hugetlb_shared_hgm")) {
+		map_shared = true;
+		test_type = TEST_HUGETLB_HGM;
+		uffd_test_ops = &hugetlb_uffd_test_ops;
+		/*
+		 * HugeTLB HGM only changes UFFDIO_CONTINUE, so don't test
+		 * UFFDIO_COPY.
+		 */
+		test_uffdio_minor = true;
+		test_uffdio_copy = false;
 	} else if (!strcmp(type, "shmem")) {
 		map_shared = true;
 		test_type = TEST_SHMEM;
@@ -1731,6 +1777,7 @@ static void parse_test_type_arg(const char *raw_type)
 		err("Unsupported test: %s", raw_type);
 
 	if (test_type == TEST_HUGETLB)
+		/* TEST_HUGETLB_HGM gets small pages. */
 		page_size = hpage_size;
 	else
 		page_size = sysconf(_SC_PAGE_SIZE);
@@ -1813,22 +1860,29 @@ int main(int argc, char **argv)
 		nr_cpus = x < y ? x : y;
 	}
 	nr_pages_per_cpu = bytes / page_size / nr_cpus;
+	if (test_type == TEST_HUGETLB_HGM)
+		/*
+		 * `page_size` refers to the page_size we can use in
+		 * UFFDIO_CONTINUE. We still need nr_pages to be appropriately
+		 * aligned, so align it here.
+		 */
+		nr_pages_per_cpu -= nr_pages_per_cpu % (hpage_size / page_size);
 	if (!nr_pages_per_cpu) {
 		_err("invalid MiB");
 		usage();
 	}
+	nr_pages = nr_pages_per_cpu * nr_cpus;
 
 	bounces = atoi(argv[3]);
 	if (bounces <= 0) {
 		_err("invalid bounces");
 		usage();
 	}
-	nr_pages = nr_pages_per_cpu * nr_cpus;
 
-	if (test_type == TEST_SHMEM || test_type == TEST_HUGETLB) {
+	if (test_type == TEST_SHMEM || test_is_hugetlb()) {
 		unsigned int memfd_flags = 0;
 
-		if (test_type == TEST_HUGETLB)
+		if (test_is_hugetlb())
 			memfd_flags = MFD_HUGETLB;
 		mem_fd = memfd_create(argv[0], memfd_flags);
 		if (mem_fd < 0)
-- 
2.39.2.637.g21b0678d19-goog


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH v2 43/46] KVM: selftests: add HugeTLB HGM to KVM demand paging selftest
  2023-02-18  0:27 [PATCH v2 00/46] hugetlb: introduce HugeTLB high-granularity mapping James Houghton
                   ` (41 preceding siblings ...)
  2023-02-18  0:28 ` [PATCH v2 42/46] selftests/mm: add HugeTLB HGM to userfaultfd selftest James Houghton
@ 2023-02-18  0:28 ` James Houghton
  2023-02-18  0:28 ` [PATCH v2 44/46] selftests/mm: add anon and shared hugetlb to migration test James Houghton
                   ` (3 subsequent siblings)
  46 siblings, 0 replies; 96+ messages in thread
From: James Houghton @ 2023-02-18  0:28 UTC (permalink / raw)
  To: Mike Kravetz, Muchun Song, Peter Xu, Andrew Morton
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Zach O'Keefe, Manish Mishra, Naoya Horiguchi,
	Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Frank van der Linden, Jiaqi Yan, linux-mm, linux-kernel,
	James Houghton

This test exercises the GUP paths for HGM. MADV_COLLAPSE is not tested.

Signed-off-by: James Houghton <jthoughton@google.com>

diff --git a/tools/testing/selftests/kvm/demand_paging_test.c b/tools/testing/selftests/kvm/demand_paging_test.c
index b0e1fc4de9e2..e534f9c927bf 100644
--- a/tools/testing/selftests/kvm/demand_paging_test.c
+++ b/tools/testing/selftests/kvm/demand_paging_test.c
@@ -170,7 +170,7 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 			uffd_descs[i] = uffd_setup_demand_paging(
 				p->uffd_mode, p->uffd_delay, vcpu_hva,
 				vcpu_args->pages * memstress_args.guest_page_size,
-				&handle_uffd_page_request);
+				p->src_type, &handle_uffd_page_request);
 		}
 	}
 
diff --git a/tools/testing/selftests/kvm/include/test_util.h b/tools/testing/selftests/kvm/include/test_util.h
index 80d6416f3012..a2106c19a614 100644
--- a/tools/testing/selftests/kvm/include/test_util.h
+++ b/tools/testing/selftests/kvm/include/test_util.h
@@ -103,6 +103,7 @@ enum vm_mem_backing_src_type {
 	VM_MEM_SRC_ANONYMOUS_HUGETLB_16GB,
 	VM_MEM_SRC_SHMEM,
 	VM_MEM_SRC_SHARED_HUGETLB,
+	VM_MEM_SRC_SHARED_HUGETLB_HGM,
 	NUM_SRC_TYPES,
 };
 
@@ -121,6 +122,7 @@ size_t get_def_hugetlb_pagesz(void);
 const struct vm_mem_backing_src_alias *vm_mem_backing_src_alias(uint32_t i);
 size_t get_backing_src_pagesz(uint32_t i);
 bool is_backing_src_hugetlb(uint32_t i);
+bool is_backing_src_shared_hugetlb(enum vm_mem_backing_src_type src_type);
 void backing_src_help(const char *flag);
 enum vm_mem_backing_src_type parse_backing_src_type(const char *type_name);
 long get_run_delay(void);
diff --git a/tools/testing/selftests/kvm/include/userfaultfd_util.h b/tools/testing/selftests/kvm/include/userfaultfd_util.h
index 877449c34592..d91528a58245 100644
--- a/tools/testing/selftests/kvm/include/userfaultfd_util.h
+++ b/tools/testing/selftests/kvm/include/userfaultfd_util.h
@@ -26,9 +26,9 @@ struct uffd_desc {
 	pthread_t thread;
 };
 
-struct uffd_desc *uffd_setup_demand_paging(int uffd_mode, useconds_t delay,
-					   void *hva, uint64_t len,
-					   uffd_handler_t handler);
+struct uffd_desc *uffd_setup_demand_paging(
+		int uffd_mode, useconds_t delay, void *hva, uint64_t len,
+		enum vm_mem_backing_src_type src_type, uffd_handler_t handler);
 
 void uffd_stop_demand_paging(struct uffd_desc *uffd);
 
diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c
index 56d5ea949cbb..b9c398dc295d 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -981,7 +981,7 @@ void vm_userspace_mem_region_add(struct kvm_vm *vm,
 	region->fd = -1;
 	if (backing_src_is_shared(src_type))
 		region->fd = kvm_memfd_alloc(region->mmap_size,
-					     src_type == VM_MEM_SRC_SHARED_HUGETLB);
+				is_backing_src_shared_hugetlb(src_type));
 
 	region->mmap_start = mmap(NULL, region->mmap_size,
 				  PROT_READ | PROT_WRITE,
diff --git a/tools/testing/selftests/kvm/lib/test_util.c b/tools/testing/selftests/kvm/lib/test_util.c
index 5c22fa4c2825..712a0878932e 100644
--- a/tools/testing/selftests/kvm/lib/test_util.c
+++ b/tools/testing/selftests/kvm/lib/test_util.c
@@ -271,6 +271,13 @@ const struct vm_mem_backing_src_alias *vm_mem_backing_src_alias(uint32_t i)
 			 */
 			.flag = MAP_SHARED,
 		},
+		[VM_MEM_SRC_SHARED_HUGETLB_HGM] = {
+			/*
+			 * Identical to shared_hugetlb except for the name.
+			 */
+			.name = "shared_hugetlb_hgm",
+			.flag = MAP_SHARED,
+		},
 	};
 	_Static_assert(ARRAY_SIZE(aliases) == NUM_SRC_TYPES,
 		       "Missing new backing src types?");
@@ -289,6 +296,7 @@ size_t get_backing_src_pagesz(uint32_t i)
 	switch (i) {
 	case VM_MEM_SRC_ANONYMOUS:
 	case VM_MEM_SRC_SHMEM:
+	case VM_MEM_SRC_SHARED_HUGETLB_HGM:
 		return getpagesize();
 	case VM_MEM_SRC_ANONYMOUS_THP:
 		return get_trans_hugepagesz();
@@ -305,6 +313,12 @@ bool is_backing_src_hugetlb(uint32_t i)
 	return !!(vm_mem_backing_src_alias(i)->flag & MAP_HUGETLB);
 }
 
+bool is_backing_src_shared_hugetlb(enum vm_mem_backing_src_type src_type)
+{
+	return src_type == VM_MEM_SRC_SHARED_HUGETLB ||
+		src_type == VM_MEM_SRC_SHARED_HUGETLB_HGM;
+}
+
 static void print_available_backing_src_types(const char *prefix)
 {
 	int i;
diff --git a/tools/testing/selftests/kvm/lib/userfaultfd_util.c b/tools/testing/selftests/kvm/lib/userfaultfd_util.c
index 92cef20902f1..3c7178d6c4f4 100644
--- a/tools/testing/selftests/kvm/lib/userfaultfd_util.c
+++ b/tools/testing/selftests/kvm/lib/userfaultfd_util.c
@@ -25,6 +25,10 @@
 
 #ifdef __NR_userfaultfd
 
+#ifndef MADV_SPLIT
+#define MADV_SPLIT 26
+#endif
+
 static void *uffd_handler_thread_fn(void *arg)
 {
 	struct uffd_desc *uffd_desc = (struct uffd_desc *)arg;
@@ -108,9 +112,9 @@ static void *uffd_handler_thread_fn(void *arg)
 	return NULL;
 }
 
-struct uffd_desc *uffd_setup_demand_paging(int uffd_mode, useconds_t delay,
-					   void *hva, uint64_t len,
-					   uffd_handler_t handler)
+struct uffd_desc *uffd_setup_demand_paging(
+		int uffd_mode, useconds_t delay, void *hva, uint64_t len,
+		enum vm_mem_backing_src_type src_type, uffd_handler_t handler)
 {
 	struct uffd_desc *uffd_desc;
 	bool is_minor = (uffd_mode == UFFDIO_REGISTER_MODE_MINOR);
@@ -140,6 +144,10 @@ struct uffd_desc *uffd_setup_demand_paging(int uffd_mode, useconds_t delay,
 		    "ioctl UFFDIO_API failed: %" PRIu64,
 		    (uint64_t)uffdio_api.api);
 
+	if (src_type == VM_MEM_SRC_SHARED_HUGETLB_HGM)
+		TEST_ASSERT(!madvise(hva, len, MADV_SPLIT),
+				"Could not enable HGM");
+
 	uffdio_register.range.start = (uint64_t)hva;
 	uffdio_register.range.len = len;
 	uffdio_register.mode = uffd_mode;
-- 
2.39.2.637.g21b0678d19-goog


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH v2 44/46] selftests/mm: add anon and shared hugetlb to migration test
  2023-02-18  0:27 [PATCH v2 00/46] hugetlb: introduce HugeTLB high-granularity mapping James Houghton
                   ` (42 preceding siblings ...)
  2023-02-18  0:28 ` [PATCH v2 43/46] KVM: selftests: add HugeTLB HGM to KVM demand paging selftest James Houghton
@ 2023-02-18  0:28 ` James Houghton
  2023-02-18  0:28 ` [PATCH v2 45/46] selftests/mm: add hugetlb HGM test to migration selftest James Houghton
                   ` (2 subsequent siblings)
  46 siblings, 0 replies; 96+ messages in thread
From: James Houghton @ 2023-02-18  0:28 UTC (permalink / raw)
  To: Mike Kravetz, Muchun Song, Peter Xu, Andrew Morton
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Zach O'Keefe, Manish Mishra, Naoya Horiguchi,
	Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Frank van der Linden, Jiaqi Yan, linux-mm, linux-kernel,
	James Houghton

Shared HugeTLB mappings are migrated best-effort. Sometimes, due to
being unable to grab the VMA lock for writing, migration may just
randomly fail. To allow for that, we allow retries.

Signed-off-by: James Houghton <jthoughton@google.com>

diff --git a/tools/testing/selftests/mm/migration.c b/tools/testing/selftests/mm/migration.c
index 1cec8425e3ca..21577a84d7e4 100644
--- a/tools/testing/selftests/mm/migration.c
+++ b/tools/testing/selftests/mm/migration.c
@@ -13,6 +13,7 @@
 #include <sys/types.h>
 #include <signal.h>
 #include <time.h>
+#include <sys/statfs.h>
 
 #define TWOMEG (2<<20)
 #define RUNTIME (60)
@@ -59,11 +60,12 @@ FIXTURE_TEARDOWN(migration)
 	free(self->pids);
 }
 
-int migrate(uint64_t *ptr, int n1, int n2)
+int migrate(uint64_t *ptr, int n1, int n2, int retries)
 {
 	int ret, tmp;
 	int status = 0;
 	struct timespec ts1, ts2;
+	int failed = 0;
 
 	if (clock_gettime(CLOCK_MONOTONIC, &ts1))
 		return -1;
@@ -78,6 +80,9 @@ int migrate(uint64_t *ptr, int n1, int n2)
 		ret = move_pages(0, 1, (void **) &ptr, &n2, &status,
 				MPOL_MF_MOVE_ALL);
 		if (ret) {
+			if (++failed < retries)
+				continue;
+
 			if (ret > 0)
 				printf("Didn't migrate %d pages\n", ret);
 			else
@@ -88,6 +93,7 @@ int migrate(uint64_t *ptr, int n1, int n2)
 		tmp = n2;
 		n2 = n1;
 		n1 = tmp;
+		failed = 0;
 	}
 
 	return 0;
@@ -128,7 +134,7 @@ TEST_F_TIMEOUT(migration, private_anon, 2*RUNTIME)
 		if (pthread_create(&self->threads[i], NULL, access_mem, ptr))
 			perror("Couldn't create thread");
 
-	ASSERT_EQ(migrate(ptr, self->n1, self->n2), 0);
+	ASSERT_EQ(migrate(ptr, self->n1, self->n2, 1), 0);
 	for (i = 0; i < self->nthreads - 1; i++)
 		ASSERT_EQ(pthread_cancel(self->threads[i]), 0);
 }
@@ -158,7 +164,7 @@ TEST_F_TIMEOUT(migration, shared_anon, 2*RUNTIME)
 			self->pids[i] = pid;
 	}
 
-	ASSERT_EQ(migrate(ptr, self->n1, self->n2), 0);
+	ASSERT_EQ(migrate(ptr, self->n1, self->n2, 1), 0);
 	for (i = 0; i < self->nthreads - 1; i++)
 		ASSERT_EQ(kill(self->pids[i], SIGTERM), 0);
 }
@@ -185,9 +191,78 @@ TEST_F_TIMEOUT(migration, private_anon_thp, 2*RUNTIME)
 		if (pthread_create(&self->threads[i], NULL, access_mem, ptr))
 			perror("Couldn't create thread");
 
-	ASSERT_EQ(migrate(ptr, self->n1, self->n2), 0);
+	ASSERT_EQ(migrate(ptr, self->n1, self->n2, 1), 0);
+	for (i = 0; i < self->nthreads - 1; i++)
+		ASSERT_EQ(pthread_cancel(self->threads[i]), 0);
+}
+
+/*
+ * Tests the anon hugetlb migration entry paths.
+ */
+TEST_F_TIMEOUT(migration, private_anon_hugetlb, 2*RUNTIME)
+{
+	uint64_t *ptr;
+	int i;
+
+	if (self->nthreads < 2 || self->n1 < 0 || self->n2 < 0)
+		SKIP(return, "Not enough threads or NUMA nodes available");
+
+	ptr = mmap(NULL, TWOMEG, PROT_READ | PROT_WRITE,
+		MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB, -1, 0);
+	if (ptr == MAP_FAILED)
+		SKIP(return, "Could not allocate hugetlb pages");
+
+	memset(ptr, 0xde, TWOMEG);
+	for (i = 0; i < self->nthreads - 1; i++)
+		if (pthread_create(&self->threads[i], NULL, access_mem, ptr))
+			perror("Couldn't create thread");
+
+	ASSERT_EQ(migrate(ptr, self->n1, self->n2, 1), 0);
 	for (i = 0; i < self->nthreads - 1; i++)
 		ASSERT_EQ(pthread_cancel(self->threads[i]), 0);
 }
 
+/*
+ * Tests the shared hugetlb migration entry paths.
+ */
+TEST_F_TIMEOUT(migration, shared_hugetlb, 2*RUNTIME)
+{
+	uint64_t *ptr;
+	int i;
+	int fd;
+	unsigned long sz;
+	struct statfs filestat;
+
+	if (self->nthreads < 2 || self->n1 < 0 || self->n2 < 0)
+		SKIP(return, "Not enough threads or NUMA nodes available");
+
+	fd = memfd_create("tmp_hugetlb", MFD_HUGETLB);
+	if (fd < 0)
+		SKIP(return, "Couldn't create hugetlb memfd");
+
+	if (fstatfs(fd, &filestat) < 0)
+		SKIP(return, "Couldn't fstatfs hugetlb file");
+
+	sz = filestat.f_bsize;
+
+	if (ftruncate(fd, sz))
+		SKIP(return, "Couldn't allocate hugetlb pages");
+	ptr = mmap(NULL, sz, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
+	if (ptr == MAP_FAILED)
+		SKIP(return, "Could not map hugetlb pages");
+
+	memset(ptr, 0xde, sz);
+	for (i = 0; i < self->nthreads - 1; i++)
+		if (pthread_create(&self->threads[i], NULL, access_mem, ptr))
+			perror("Couldn't create thread");
+
+	ASSERT_EQ(migrate(ptr, self->n1, self->n2, 10), 0);
+	for (i = 0; i < self->nthreads - 1; i++) {
+		ASSERT_EQ(pthread_cancel(self->threads[i]), 0);
+		pthread_join(self->threads[i], NULL);
+	}
+	ftruncate(fd, 0);
+	close(fd);
+}
+
 TEST_HARNESS_MAIN
-- 
2.39.2.637.g21b0678d19-goog


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH v2 45/46] selftests/mm: add hugetlb HGM test to migration selftest
  2023-02-18  0:27 [PATCH v2 00/46] hugetlb: introduce HugeTLB high-granularity mapping James Houghton
                   ` (43 preceding siblings ...)
  2023-02-18  0:28 ` [PATCH v2 44/46] selftests/mm: add anon and shared hugetlb to migration test James Houghton
@ 2023-02-18  0:28 ` James Houghton
  2023-02-18  0:28 ` [PATCH v2 46/46] selftests/mm: add HGM UFFDIO_CONTINUE and hwpoison tests James Houghton
  2023-02-21 21:46 ` [PATCH v2 00/46] hugetlb: introduce HugeTLB high-granularity mapping Mike Kravetz
  46 siblings, 0 replies; 96+ messages in thread
From: James Houghton @ 2023-02-18  0:28 UTC (permalink / raw)
  To: Mike Kravetz, Muchun Song, Peter Xu, Andrew Morton
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Zach O'Keefe, Manish Mishra, Naoya Horiguchi,
	Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Frank van der Linden, Jiaqi Yan, linux-mm, linux-kernel,
	James Houghton

This is mostly the same as the shared HugeTLB case, but instead of
mapping the page with a regular page fault, we map it with lots of
UFFDIO_CONTINUE operations. We also verify that the contents haven't
changed after the migration, which would be the case if the
post-migration PTEs pointed to the wrong page.

Signed-off-by: James Houghton <jthoughton@google.com>

diff --git a/tools/testing/selftests/mm/migration.c b/tools/testing/selftests/mm/migration.c
index 21577a84d7e4..1fb3607accab 100644
--- a/tools/testing/selftests/mm/migration.c
+++ b/tools/testing/selftests/mm/migration.c
@@ -14,12 +14,21 @@
 #include <signal.h>
 #include <time.h>
 #include <sys/statfs.h>
+#include <unistd.h>
+#include <sys/ioctl.h>
+#include <linux/userfaultfd.h>
+#include <sys/syscall.h>
+#include <fcntl.h>
 
 #define TWOMEG (2<<20)
 #define RUNTIME (60)
 
 #define ALIGN(x, a) (((x) + (a - 1)) & (~((a) - 1)))
 
+#ifndef MADV_SPLIT
+#define MADV_SPLIT 26
+#endif
+
 FIXTURE(migration)
 {
 	pthread_t *threads;
@@ -265,4 +274,141 @@ TEST_F_TIMEOUT(migration, shared_hugetlb, 2*RUNTIME)
 	close(fd);
 }
 
+#ifdef __NR_userfaultfd
+static int map_at_high_granularity(char *mem, size_t length)
+{
+	int i;
+	int ret;
+	int uffd = syscall(__NR_userfaultfd, 0);
+	struct uffdio_api api;
+	struct uffdio_register reg;
+	int pagesize = getpagesize();
+
+	if (uffd < 0) {
+		perror("couldn't create uffd");
+		return uffd;
+	}
+
+	api.api = UFFD_API;
+	api.features = 0;
+
+	ret = ioctl(uffd, UFFDIO_API, &api);
+	if (ret || api.api != UFFD_API) {
+		perror("UFFDIO_API failed");
+		goto out;
+	}
+
+	if (madvise(mem, length, MADV_SPLIT) == -1) {
+		perror("MADV_SPLIT failed");
+		goto out;
+	}
+
+	reg.range.start = (unsigned long)mem;
+	reg.range.len = length;
+
+	reg.mode = UFFDIO_REGISTER_MODE_MISSING | UFFDIO_REGISTER_MODE_MINOR;
+
+	ret = ioctl(uffd, UFFDIO_REGISTER, &reg);
+	if (ret) {
+		perror("UFFDIO_REGISTER failed");
+		goto out;
+	}
+
+	/* UFFDIO_CONTINUE each 4K segment of the 2M page. */
+	for (i = 0; i < length/pagesize; ++i) {
+		struct uffdio_continue cont;
+
+		cont.range.start = (unsigned long long)mem + i * pagesize;
+		cont.range.len = pagesize;
+		cont.mode = 0;
+		ret = ioctl(uffd, UFFDIO_CONTINUE, &cont);
+		if (ret) {
+			fprintf(stderr, "UFFDIO_CONTINUE failed "
+					"for %llx -> %llx: %d\n",
+					cont.range.start,
+					cont.range.start + cont.range.len,
+					errno);
+			goto out;
+		}
+	}
+	ret = 0;
+out:
+	close(uffd);
+	return ret;
+}
+#else
+static int map_at_high_granularity(char *mem, size_t length)
+{
+	fprintf(stderr, "Userfaultfd missing\n");
+	return -1;
+}
+#endif /* __NR_userfaultfd */
+
+/*
+ * Tests the high-granularity hugetlb migration entry paths.
+ */
+TEST_F_TIMEOUT(migration, shared_hugetlb_hgm, 2*RUNTIME)
+{
+	uint64_t *ptr;
+	int i;
+	int fd;
+	unsigned long sz;
+	struct statfs filestat;
+
+	if (self->nthreads < 2 || self->n1 < 0 || self->n2 < 0)
+		SKIP(return, "Not enough threads or NUMA nodes available");
+
+	fd = memfd_create("tmp_hugetlb", MFD_HUGETLB);
+	if (fd < 0)
+		SKIP(return, "Couldn't create hugetlb memfd");
+
+	if (fstatfs(fd, &filestat) < 0)
+		SKIP(return, "Couldn't fstatfs hugetlb file");
+
+	sz = filestat.f_bsize;
+
+	if (ftruncate(fd, sz))
+		SKIP(return, "Couldn't allocate hugetlb pages");
+
+	if (fallocate(fd, 0, 0, sz) < 0) {
+		perror("fallocate failed");
+		SKIP(return, "fallocate failed");
+	}
+
+	ptr = mmap(NULL, sz, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
+	if (ptr == MAP_FAILED)
+		SKIP(return, "Could not allocate hugetlb pages");
+
+	/*
+	 * We have to map_at_high_granularity before we memset, otherwise
+	 * memset will map everything at the hugepage size.
+	 */
+	if (map_at_high_granularity((char *)ptr, sz) < 0)
+		SKIP(return, "Could not map HugeTLB range at high granularity");
+
+	/* Populate the page we're migrating. */
+	for (i = 0; i < sz/sizeof(*ptr); ++i)
+		ptr[i] = i;
+
+	for (i = 0; i < self->nthreads - 1; i++)
+		if (pthread_create(&self->threads[i], NULL, access_mem, ptr))
+			perror("Couldn't create thread");
+
+	ASSERT_EQ(migrate(ptr, self->n1, self->n2, 10), 0);
+	for (i = 0; i < self->nthreads - 1; i++) {
+		ASSERT_EQ(pthread_cancel(self->threads[i]), 0);
+		pthread_join(self->threads[i], NULL);
+	}
+
+	/* Check that the contents didnt' change. */
+	for (i = 0; i < sz/sizeof(*ptr); ++i) {
+		ASSERT_EQ(ptr[i], i);
+		if (ptr[i] != i)
+			break;
+	}
+
+	ftruncate(fd, 0);
+	close(fd);
+}
+
 TEST_HARNESS_MAIN
-- 
2.39.2.637.g21b0678d19-goog


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH v2 46/46] selftests/mm: add HGM UFFDIO_CONTINUE and hwpoison tests
  2023-02-18  0:27 [PATCH v2 00/46] hugetlb: introduce HugeTLB high-granularity mapping James Houghton
                   ` (44 preceding siblings ...)
  2023-02-18  0:28 ` [PATCH v2 45/46] selftests/mm: add hugetlb HGM test to migration selftest James Houghton
@ 2023-02-18  0:28 ` James Houghton
  2023-02-24 17:37   ` James Houghton
  2023-02-21 21:46 ` [PATCH v2 00/46] hugetlb: introduce HugeTLB high-granularity mapping Mike Kravetz
  46 siblings, 1 reply; 96+ messages in thread
From: James Houghton @ 2023-02-18  0:28 UTC (permalink / raw)
  To: Mike Kravetz, Muchun Song, Peter Xu, Andrew Morton
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Zach O'Keefe, Manish Mishra, Naoya Horiguchi,
	Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Frank van der Linden, Jiaqi Yan, linux-mm, linux-kernel,
	James Houghton

Test that high-granularity CONTINUEs at all sizes work (exercising
contiguous PTE sizes for arm64, when support is added). Also test that
collapse works and hwpoison works correctly (although we aren't yet
testing high-granularity poison).

This test uses UFFD_FEATURE_EVENT_FORK + UFFD_REGISTER_MODE_WP to force
the kernel to copy page tables on fork(), exercising the changes to
copy_hugetlb_page_range().

Also test that UFFDIO_WRITEPROTECT doesn't prevent UFFDIO_CONTINUE
from behaving properly (in other words, that HGM walks treat UFFD-WP
markers like blank PTEs in the appropriate cases). We also test that
the uffd-wp PTE markers are preserved properly.

Signed-off-by: James Houghton <jthoughton@google.com>

 create mode 100644 tools/testing/selftests/mm/hugetlb-hgm.c

diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/mm/Makefile
index d90cdc06aa59..920baccccb9e 100644
--- a/tools/testing/selftests/mm/Makefile
+++ b/tools/testing/selftests/mm/Makefile
@@ -36,6 +36,7 @@ TEST_GEN_FILES += compaction_test
 TEST_GEN_FILES += gup_test
 TEST_GEN_FILES += hmm-tests
 TEST_GEN_FILES += hugetlb-madvise
+TEST_GEN_FILES += hugetlb-hgm
 TEST_GEN_FILES += hugepage-mmap
 TEST_GEN_FILES += hugepage-mremap
 TEST_GEN_FILES += hugepage-shm
diff --git a/tools/testing/selftests/mm/hugetlb-hgm.c b/tools/testing/selftests/mm/hugetlb-hgm.c
new file mode 100644
index 000000000000..4c27a6a11818
--- /dev/null
+++ b/tools/testing/selftests/mm/hugetlb-hgm.c
@@ -0,0 +1,608 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Test uncommon cases in HugeTLB high-granularity mapping:
+ *  1. Test all supported high-granularity page sizes (with MADV_COLLAPSE).
+ *  2. Test MADV_HWPOISON behavior.
+ *  3. Test interaction with UFFDIO_WRITEPROTECT.
+ */
+
+#define _GNU_SOURCE
+#include <fcntl.h>
+#include <sys/syscall.h>
+#include <sys/ioctl.h>
+#include <unistd.h>
+#include <stdlib.h>
+#include <stdio.h>
+#include <sys/poll.h>
+#include <stdint.h>
+#include <string.h>
+
+#include <linux/userfaultfd.h>
+#include <linux/magic.h>
+#include <sys/mman.h>
+#include <sys/statfs.h>
+#include <errno.h>
+#include <stdbool.h>
+#include <signal.h>
+#include <pthread.h>
+#include <sys/types.h>
+#include <sys/wait.h>
+
+#define PAGE_SIZE 4096
+#define PAGE_MASK ~(PAGE_SIZE - 1)
+
+#ifndef MADV_COLLAPSE
+#define MADV_COLLAPSE 25
+#endif
+
+#ifndef MADV_SPLIT
+#define MADV_SPLIT 26
+#endif
+
+#define PREFIX " ... "
+#define ERROR_PREFIX " !!! "
+
+static void *sigbus_addr;
+bool was_mceerr;
+bool got_sigbus;
+bool expecting_sigbus;
+
+enum test_status {
+	TEST_PASSED = 0,
+	TEST_FAILED = 1,
+	TEST_SKIPPED = 2,
+};
+
+static char *status_to_str(enum test_status status)
+{
+	switch (status) {
+	case TEST_PASSED:
+		return "TEST_PASSED";
+	case TEST_FAILED:
+		return "TEST_FAILED";
+	case TEST_SKIPPED:
+		return "TEST_SKIPPED";
+	default:
+		return "TEST_???";
+	}
+}
+
+static int userfaultfd(int flags)
+{
+	return syscall(__NR_userfaultfd, flags);
+}
+
+static int map_range(int uffd, char *addr, uint64_t length)
+{
+	struct uffdio_continue cont = {
+		.range = (struct uffdio_range) {
+			.start = (uint64_t)addr,
+			.len = length,
+		},
+		.mode = 0,
+		.mapped = 0,
+	};
+
+	if (ioctl(uffd, UFFDIO_CONTINUE, &cont) < 0) {
+		perror(ERROR_PREFIX "UFFDIO_CONTINUE failed");
+		return -1;
+	}
+	return 0;
+}
+
+static int userfaultfd_writeprotect(int uffd, char *addr, uint64_t length,
+				    bool protect)
+{
+	struct uffdio_writeprotect wp = {
+		.range = (struct uffdio_range) {
+			.start = (uint64_t)addr,
+			.len = length,
+		},
+		.mode = UFFDIO_WRITEPROTECT_MODE_DONTWAKE,
+	};
+
+	if (protect)
+		wp.mode = UFFDIO_WRITEPROTECT_MODE_WP;
+
+	printf(PREFIX "UFFDIO_WRITEPROTECT: %p -> %p (%sprotected)\n", addr,
+			addr + length, protect ? "" : "un");
+
+	if (ioctl(uffd, UFFDIO_WRITEPROTECT, &wp) < 0) {
+		perror(ERROR_PREFIX "UFFDIO_WRITEPROTECT failed");
+		return -1;
+	}
+	return 0;
+}
+
+static int check_equal(char *mapping, size_t length, char value)
+{
+	size_t i;
+
+	for (i = 0; i < length; ++i)
+		if (mapping[i] != value) {
+			printf(ERROR_PREFIX "mismatch at %p (%d != %d)\n",
+					&mapping[i], mapping[i], value);
+			return -1;
+		}
+
+	return 0;
+}
+
+static int test_continues(int uffd, char *primary_map, char *secondary_map,
+			  size_t len, bool verify)
+{
+	size_t offset = 0;
+	unsigned char iter = 0;
+	unsigned long pagesize = getpagesize();
+	uint64_t size;
+
+	for (size = len/2; size >= pagesize;
+			offset += size, size /= 2) {
+		iter++;
+		memset(secondary_map + offset, iter, size);
+		printf(PREFIX "UFFDIO_CONTINUE: %p -> %p = %d%s\n",
+				primary_map + offset,
+				primary_map + offset + size,
+				iter,
+				verify ? " (and verify)" : "");
+		if (map_range(uffd, primary_map + offset, size))
+			return -1;
+		if (verify && check_equal(primary_map + offset, size, iter))
+			return -1;
+	}
+	return 0;
+}
+
+static int verify_contents(char *map, size_t len, bool last_page_zero)
+{
+	size_t offset = 0;
+	int i = 0;
+	uint64_t size;
+
+	for (size = len/2; size > PAGE_SIZE; offset += size, size /= 2)
+		if (check_equal(map + offset, size, ++i))
+			return -1;
+
+	if (last_page_zero)
+		if (check_equal(map + len - PAGE_SIZE, PAGE_SIZE, 0))
+			return -1;
+
+	return 0;
+}
+
+static int test_collapse(char *primary_map, size_t len, bool verify)
+{
+	int ret = 0;
+
+	printf(PREFIX "collapsing %p -> %p\n", primary_map, primary_map + len);
+	if (madvise(primary_map, len, MADV_COLLAPSE) < 0) {
+		perror(ERROR_PREFIX "collapse failed");
+		return -1;
+	}
+
+	if (verify) {
+		printf(PREFIX "verifying %p -> %p\n", primary_map,
+				primary_map + len);
+		ret = verify_contents(primary_map, len, true);
+	}
+	return ret;
+}
+
+static void sigbus_handler(int signo, siginfo_t *info, void *context)
+{
+	if (!expecting_sigbus)
+		printf(ERROR_PREFIX "unexpected sigbus: %p\n", info->si_addr);
+
+	got_sigbus = true;
+	was_mceerr = info->si_code == BUS_MCEERR_AR;
+	sigbus_addr = info->si_addr;
+
+	pthread_exit(NULL);
+}
+
+static void *access_mem(void *addr)
+{
+	volatile char *ptr = addr;
+
+	/*
+	 * Do a write without changing memory contents, as other routines will
+	 * need to verify that mapping contents haven't changed.
+	 *
+	 * We do a write so that we trigger uffd-wp SIGBUSes. To test that we
+	 * get HWPOISON SIGBUSes, we would only need to read.
+	 */
+	*ptr = *ptr;
+	return NULL;
+}
+
+static int test_sigbus(char *addr, bool poison)
+{
+	int ret;
+	pthread_t pthread;
+
+	sigbus_addr = (void *)0xBADBADBAD;
+	was_mceerr = false;
+	got_sigbus = false;
+	expecting_sigbus = true;
+	ret = pthread_create(&pthread, NULL, &access_mem, addr);
+	if (ret) {
+		printf(ERROR_PREFIX "failed to create thread: %s\n",
+				strerror(ret));
+		goto out;
+	}
+
+	pthread_join(pthread, NULL);
+
+	ret = -1;
+	if (!got_sigbus)
+		printf(ERROR_PREFIX "didn't get a SIGBUS: %p\n", addr);
+	else if (sigbus_addr != addr)
+		printf(ERROR_PREFIX "got incorrect sigbus address: %p vs %p\n",
+				sigbus_addr, addr);
+	else if (poison && !was_mceerr)
+		printf(ERROR_PREFIX "didn't get an MCEERR?\n");
+	else
+		ret = 0;
+out:
+	expecting_sigbus = false;
+	return ret;
+}
+
+static void *read_from_uffd_thd(void *arg)
+{
+	int uffd = *(int *)arg;
+	struct uffd_msg msg;
+	/* opened without O_NONBLOCK */
+	if (read(uffd, &msg, sizeof(msg)) != sizeof(msg))
+		printf(ERROR_PREFIX "reading uffd failed\n");
+
+	return NULL;
+}
+
+static int read_event_from_uffd(int *uffd, pthread_t *pthread)
+{
+	int ret = 0;
+
+	ret = pthread_create(pthread, NULL, &read_from_uffd_thd, (void *)uffd);
+	if (ret) {
+		printf(ERROR_PREFIX "failed to create thread: %s\n",
+				strerror(ret));
+		return ret;
+	}
+	return 0;
+}
+
+static int test_sigbus_range(char *primary_map, size_t len, bool hwpoison)
+{
+	const unsigned long pagesize = getpagesize();
+	const int num_checks = 512;
+	unsigned long bytes_per_check = len/num_checks;
+	int i;
+
+	printf(PREFIX "checking that we can't access "
+	       "(%d addresses within %p -> %p)\n",
+	       num_checks, primary_map, primary_map + len);
+
+	if (pagesize > bytes_per_check)
+		bytes_per_check = pagesize;
+
+	for (i = 0; i < len; i += bytes_per_check)
+		if (test_sigbus(primary_map + i, hwpoison) < 0)
+			return 1;
+	/* check very last byte, because we left it unmapped */
+	if (test_sigbus(primary_map + len - 1, hwpoison))
+		return 1;
+
+	return 0;
+}
+
+static enum test_status test_hwpoison(char *primary_map, size_t len)
+{
+	printf(PREFIX "poisoning %p -> %p\n", primary_map, primary_map + len);
+	if (madvise(primary_map, len, MADV_HWPOISON) < 0) {
+		perror(ERROR_PREFIX "MADV_HWPOISON failed");
+		return TEST_SKIPPED;
+	}
+
+	return test_sigbus_range(primary_map, len, true)
+		? TEST_FAILED : TEST_PASSED;
+}
+
+static int test_fork(int uffd, char *primary_map, size_t len)
+{
+	int status;
+	int ret = 0;
+	pid_t pid;
+	pthread_t uffd_thd;
+
+	/*
+	 * UFFD_FEATURE_EVENT_FORK will put fork event on the userfaultfd,
+	 * which we must read, otherwise we block fork(). Setup a thread to
+	 * read that event now.
+	 *
+	 * Page fault events should result in a SIGBUS, so we expect only a
+	 * single event from the uffd (the fork event).
+	 */
+	if (read_event_from_uffd(&uffd, &uffd_thd))
+		return -1;
+
+	pid = fork();
+
+	if (!pid) {
+		/*
+		 * Because we have UFFDIO_REGISTER_MODE_WP and
+		 * UFFD_FEATURE_EVENT_FORK, the page tables should be copied
+		 * exactly.
+		 *
+		 * Check that everything except that last 4K has correct
+		 * contents, and then check that the last 4K gets a SIGBUS.
+		 */
+		printf(PREFIX "child validating...\n");
+		ret = verify_contents(primary_map, len, false) ||
+			test_sigbus(primary_map + len - 1, false);
+		ret = 0;
+		exit(ret ? 1 : 0);
+	} else {
+		/* wait for the child to finish. */
+		waitpid(pid, &status, 0);
+		ret = WEXITSTATUS(status);
+		if (!ret) {
+			printf(PREFIX "parent validating...\n");
+			/* Same check as the child. */
+			ret = verify_contents(primary_map, len, false) ||
+				test_sigbus(primary_map + len - 1, false);
+			ret = 0;
+		}
+	}
+
+	pthread_join(uffd_thd, NULL);
+	return ret;
+
+}
+
+static int uffd_register(int uffd, char *primary_map, unsigned long len,
+			 int mode)
+{
+	struct uffdio_register reg;
+
+	reg.range.start = (unsigned long)primary_map;
+	reg.range.len = len;
+	reg.mode = mode;
+
+	reg.ioctls = 0;
+	return ioctl(uffd, UFFDIO_REGISTER, &reg);
+}
+
+enum test_type {
+	TEST_DEFAULT,
+	TEST_UFFDWP,
+	TEST_HWPOISON
+};
+
+static enum test_status
+test_hgm(int fd, size_t hugepagesize, size_t len, enum test_type type)
+{
+	int uffd;
+	char *primary_map, *secondary_map;
+	struct uffdio_api api;
+	struct sigaction new, old;
+	enum test_status status = TEST_SKIPPED;
+	bool hwpoison = type == TEST_HWPOISON;
+	bool uffd_wp = type == TEST_UFFDWP;
+	bool verify = type == TEST_DEFAULT;
+	int register_args;
+
+	if (ftruncate(fd, len) < 0) {
+		perror(ERROR_PREFIX "ftruncate failed");
+		return status;
+	}
+
+	uffd = userfaultfd(O_CLOEXEC);
+	if (uffd < 0) {
+		perror(ERROR_PREFIX "uffd not created");
+		return status;
+	}
+
+	primary_map = mmap(NULL, len, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
+	if (primary_map == MAP_FAILED) {
+		perror(ERROR_PREFIX "mmap for primary mapping failed");
+		goto close_uffd;
+	}
+	secondary_map = mmap(NULL, len, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
+	if (secondary_map == MAP_FAILED) {
+		perror(ERROR_PREFIX "mmap for secondary mapping failed");
+		goto unmap_primary;
+	}
+
+	printf(PREFIX "primary mapping: %p\n", primary_map);
+	printf(PREFIX "secondary mapping: %p\n", secondary_map);
+
+	api.api = UFFD_API;
+	api.features = UFFD_FEATURE_SIGBUS | UFFD_FEATURE_EXACT_ADDRESS |
+		UFFD_FEATURE_EVENT_FORK;
+	if (ioctl(uffd, UFFDIO_API, &api) == -1) {
+		perror(ERROR_PREFIX "UFFDIO_API failed");
+		goto out;
+	}
+
+	if (madvise(primary_map, len, MADV_SPLIT)) {
+		perror(ERROR_PREFIX "MADV_SPLIT failed");
+		goto out;
+	}
+
+	/*
+	 * Register with UFFDIO_REGISTER_MODE_WP to force fork() to copy page
+	 * tables (also need UFFD_FEATURE_EVENT_FORK, which we have).
+	 */
+	register_args = UFFDIO_REGISTER_MODE_MISSING | UFFDIO_REGISTER_MODE_WP;
+	if (!uffd_wp)
+		/*
+		 * If we're testing UFFDIO_WRITEPROTECT, then we don't want
+		 * minor faults. With minor faults enabled, we'll get SIGBUSes
+		 * for any minor fault, wheresa without minot faults enabled,
+		 * writes will verify that uffd-wp PTE markers were installed
+		 * properly.
+		 */
+		register_args |= UFFDIO_REGISTER_MODE_MINOR;
+
+	if (uffd_register(uffd, primary_map, len, register_args)) {
+		perror(ERROR_PREFIX "UFFDIO_REGISTER failed");
+		goto out;
+	}
+
+
+	new.sa_sigaction = &sigbus_handler;
+	new.sa_flags = SA_SIGINFO;
+	if (sigaction(SIGBUS, &new, &old) < 0) {
+		perror(ERROR_PREFIX "could not setup SIGBUS handler");
+		goto out;
+	}
+
+	status = TEST_FAILED;
+
+	if (uffd_wp) {
+		/*
+		 * Install uffd-wp PTE markers now. They should be preserved
+		 * as we split the mappings with UFFDIO_CONTINUE later.
+		 */
+		if (userfaultfd_writeprotect(uffd, primary_map, len, true))
+			goto done;
+		/* Verify that we really are write-protected. */
+		if (test_sigbus(primary_map, false))
+			goto done;
+	}
+
+	/*
+	 * Main piece of the test: map primary_map at all the possible
+	 * page sizes. Starting at the hugepage size and going down to
+	 * PAGE_SIZE. This leaves the final PAGE_SIZE piece of the mapping
+	 * unmapped.
+	 */
+	if (test_continues(uffd, primary_map, secondary_map, len, verify))
+		goto done;
+
+	/*
+	 * Verify that MADV_HWPOISON is able to properly poison the entire
+	 * mapping.
+	 */
+	if (hwpoison) {
+		enum test_status new_status = test_hwpoison(primary_map, len);
+
+		if (new_status != TEST_PASSED) {
+			status = new_status;
+			goto done;
+		}
+	}
+
+	if (uffd_wp) {
+		/*
+		 * Check that the uffd-wp marker we installed initially still
+		 * exists in the unmapped 4K piece at the end the mapping.
+		 *
+		 * test_sigbus() will do a write. When this happens:
+		 *  1. The page fault handler will find the uffd-wp marker and
+		 *     create a read-only PTE.
+		 *  2. The memory access is retried, and the page fault handler
+		 *     will find that a write was attempted in a UFFD_WP VMA
+		 *     where a RO mapping exists, so SIGBUS
+		 *     (we have UFFD_FEATURE_SIGBUS).
+		 *
+		 * We only check the final pag because UFFDIO_CONTINUE will
+		 * have cleared the write-protection on all the other pieces
+		 * of the mapping.
+		 */
+		printf(PREFIX "verifying that we can't write to final page\n");
+		if (test_sigbus(primary_map + len - 1, false))
+			goto done;
+	}
+
+	if (!hwpoison)
+		/*
+		 * test_fork() will verify memory contents. We can't do
+		 * that if memory has been poisoned.
+		 */
+		if (test_fork(uffd, primary_map, len))
+			goto done;
+
+	/*
+	 * Check that MADV_COLLAPSE functions properly. That is:
+	 *  - the PAGE_SIZE hole we had is no longer unmapped.
+	 *  - poisoned regions are still poisoned.
+	 *
+	 *  Verify the data is correct if we haven't poisoned.
+	 */
+	if (test_collapse(primary_map, len, !hwpoison))
+		goto done;
+	/*
+	 * Verify that memory is still poisoned.
+	 */
+	if (hwpoison && test_sigbus_range(primary_map, len, true))
+		goto done;
+
+	status = TEST_PASSED;
+
+done:
+	if (ftruncate(fd, 0) < 0) {
+		perror(ERROR_PREFIX "ftruncate back to 0 failed");
+		status = TEST_FAILED;
+	}
+
+out:
+	munmap(secondary_map, len);
+unmap_primary:
+	munmap(primary_map, len);
+close_uffd:
+	close(uffd);
+	return status;
+}
+
+int main(void)
+{
+	int fd;
+	struct statfs file_stat;
+	size_t hugepagesize;
+	size_t len;
+	enum test_status status;
+	int ret = 0;
+
+	fd = memfd_create("hugetlb_tmp", MFD_HUGETLB);
+	if (fd < 0) {
+		perror(ERROR_PREFIX "could not open hugetlbfs file");
+		return -1;
+	}
+
+	memset(&file_stat, 0, sizeof(file_stat));
+	if (fstatfs(fd, &file_stat)) {
+		perror(ERROR_PREFIX "fstatfs failed");
+		goto close;
+	}
+	if (file_stat.f_type != HUGETLBFS_MAGIC) {
+		printf(ERROR_PREFIX "not hugetlbfs file\n");
+		goto close;
+	}
+
+	hugepagesize = file_stat.f_bsize;
+	len = 2 * hugepagesize;
+
+	printf("HGM regular test...\n");
+	status = test_hgm(fd, hugepagesize, len, TEST_DEFAULT);
+	printf("HGM regular test:  %s\n", status_to_str(status));
+	if (status == TEST_FAILED)
+		ret = -1;
+
+	printf("HGM uffd-wp test...\n");
+	status = test_hgm(fd, hugepagesize, len, TEST_UFFDWP);
+	printf("HGM uffd-wp test:  %s\n", status_to_str(status));
+	if (status == TEST_FAILED)
+		ret = -1;
+
+	printf("HGM hwpoison test...\n");
+	status = test_hgm(fd, hugepagesize, len, TEST_HWPOISON);
+	printf("HGM hwpoison test: %s\n", status_to_str(status));
+	if (status == TEST_FAILED)
+		ret = -1;
+close:
+	close(fd);
+
+	return ret;
+}
-- 
2.39.2.637.g21b0678d19-goog


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* Re: [PATCH v2 01/46] hugetlb: don't set PageUptodate for UFFDIO_CONTINUE
  2023-02-18  0:27 ` [PATCH v2 01/46] hugetlb: don't set PageUptodate for UFFDIO_CONTINUE James Houghton
@ 2023-02-18  0:41   ` Mina Almasry
  2023-02-21 15:59     ` James Houghton
  0 siblings, 1 reply; 96+ messages in thread
From: Mina Almasry @ 2023-02-18  0:41 UTC (permalink / raw)
  To: James Houghton
  Cc: Mike Kravetz, Muchun Song, Peter Xu, Andrew Morton,
	David Hildenbrand, David Rientjes, Axel Rasmussen,
	Zach O'Keefe, Manish Mishra, Naoya Horiguchi,
	Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Frank van der Linden, Jiaqi Yan, linux-mm, linux-kernel

On Fri, Feb 17, 2023 at 4:28 PM James Houghton <jthoughton@google.com> wrote:
>
> If would be bad if we actually set PageUptodate with UFFDIO_CONTINUE;
> PageUptodate indicates that the page has been zeroed, and we don't want
> to give a non-zeroed page to the user.
>
> The reason this change is being made now is because UFFDIO_CONTINUEs on
> subpages definitely shouldn't set this page flag on the head page.
>
> Signed-off-by: James Houghton <jthoughton@google.com>
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 07abcb6eb203..792cb2e67ce5 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -6256,7 +6256,16 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
>          * preceding stores to the page contents become visible before
>          * the set_pte_at() write.
>          */
> -       __folio_mark_uptodate(folio);
> +       if (!is_continue)
> +               __folio_mark_uptodate(folio);
> +       else if (!folio_test_uptodate(folio)) {
> +               /*
> +                * This should never happen; HugeTLB pages are always Uptodate
> +                * as soon as they are allocated.
> +                */

if (is_continue) then we grab a page from the page cache, no? Are
pages in page caches always uptodate? Why? I guess that means they're
mapped hence uptodate?

Also this comment should explain why pages in the page cache are
always uptodate, no? Because this error branch is hit if (is_continue
&& !folio_test_uptodate()), not when pages are freshly allocated.

> +               ret = -EFAULT;
> +               goto out_release_nounlock;
> +       }
>
>         /* Add shared, newly allocated pages to the page cache. */
>         if (vm_shared && !is_continue) {
> --
> 2.39.2.637.g21b0678d19-goog
>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v2 04/46] hugetlb: only adjust address ranges when VMAs want PMD sharing
  2023-02-18  0:27 ` [PATCH v2 04/46] hugetlb: only adjust address ranges when VMAs want PMD sharing James Houghton
@ 2023-02-18  1:10   ` Mina Almasry
  0 siblings, 0 replies; 96+ messages in thread
From: Mina Almasry @ 2023-02-18  1:10 UTC (permalink / raw)
  To: James Houghton
  Cc: Mike Kravetz, Muchun Song, Peter Xu, Andrew Morton,
	David Hildenbrand, David Rientjes, Axel Rasmussen,
	Zach O'Keefe, Manish Mishra, Naoya Horiguchi,
	Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Frank van der Linden, Jiaqi Yan, linux-mm, linux-kernel

On Fri, Feb 17, 2023 at 4:28 PM James Houghton <jthoughton@google.com> wrote:
>
> Currently this check is overly aggressive. For some userfaultfd VMAs,
> VMA sharing is disabled, yet we still widen the address range, which is
> used for flushing TLBs and sending MMU notifiers.
>
> This is done now, as HGM VMAs also have sharing disabled, yet would
> still have flush ranges adjusted. Overaggressively flushing TLBs and
> triggering MMU notifiers is particularly harmful with lots of
> high-granularity operations.
>
> Acked-by: Peter Xu <peterx@redhat.com>
> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
> Signed-off-by: James Houghton <jthoughton@google.com>

Acked-by: Mina Almasry <almasrymina@google.com>

>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 540cdf9570d3..08004371cfed 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -6999,22 +6999,31 @@ static unsigned long page_table_shareable(struct vm_area_struct *svma,
>         return saddr;
>  }
>
> -bool want_pmd_share(struct vm_area_struct *vma, unsigned long addr)
> +static bool pmd_sharing_possible(struct vm_area_struct *vma)
>  {
> -       unsigned long start = addr & PUD_MASK;
> -       unsigned long end = start + PUD_SIZE;
> -
>  #ifdef CONFIG_USERFAULTFD
>         if (uffd_disable_huge_pmd_share(vma))
>                 return false;
>  #endif
>         /*
> -        * check on proper vm_flags and page table alignment
> +        * Only shared VMAs can share PMDs.
>          */
>         if (!(vma->vm_flags & VM_MAYSHARE))
>                 return false;
>         if (!vma->vm_private_data)      /* vma lock required for sharing */
>                 return false;
> +       return true;
> +}
> +
> +bool want_pmd_share(struct vm_area_struct *vma, unsigned long addr)
> +{
> +       unsigned long start = addr & PUD_MASK;
> +       unsigned long end = start + PUD_SIZE;
> +       /*
> +        * check on proper vm_flags and page table alignment
> +        */
> +       if (!pmd_sharing_possible(vma))
> +               return false;
>         if (!range_in_vma(vma, start, end))
>                 return false;
>         return true;
> @@ -7035,7 +7044,7 @@ void adjust_range_if_pmd_sharing_possible(struct vm_area_struct *vma,
>          * vma needs to span at least one aligned PUD size, and the range
>          * must be at least partially within in.
>          */
> -       if (!(vma->vm_flags & VM_MAYSHARE) || !(v_end > v_start) ||
> +       if (!pmd_sharing_possible(vma) || !(v_end > v_start) ||
>                 (*end <= v_start) || (*start >= v_end))
>                 return;
>
> --
> 2.39.2.637.g21b0678d19-goog
>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v2 08/46] hugetlb: add HugeTLB HGM enablement helpers
  2023-02-18  0:27 ` [PATCH v2 08/46] hugetlb: add HugeTLB HGM enablement helpers James Houghton
@ 2023-02-18  1:40   ` Mina Almasry
  2023-02-21 16:16     ` James Houghton
  2023-02-24 23:08   ` Mike Kravetz
  1 sibling, 1 reply; 96+ messages in thread
From: Mina Almasry @ 2023-02-18  1:40 UTC (permalink / raw)
  To: James Houghton
  Cc: Mike Kravetz, Muchun Song, Peter Xu, Andrew Morton,
	David Hildenbrand, David Rientjes, Axel Rasmussen,
	Zach O'Keefe, Manish Mishra, Naoya Horiguchi,
	Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Frank van der Linden, Jiaqi Yan, linux-mm, linux-kernel

On Fri, Feb 17, 2023 at 4:28 PM James Houghton <jthoughton@google.com> wrote:
>
> hugetlb_hgm_eligible indicates that a VMA is eligible to have HGM
> explicitly enabled via MADV_SPLIT, and hugetlb_hgm_enabled indicates
> that HGM has been enabled.
>
> Signed-off-by: James Houghton <jthoughton@google.com>

Only nits:
Reviewed-by: Mina Almasry <almasrymina@google.com>

>
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 7c977d234aba..efd2635a87f5 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -1211,6 +1211,20 @@ static inline void hugetlb_unregister_node(struct node *node)
>  }
>  #endif /* CONFIG_HUGETLB_PAGE */
>
> +#ifdef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING
> +bool hugetlb_hgm_enabled(struct vm_area_struct *vma);
> +bool hugetlb_hgm_eligible(struct vm_area_struct *vma);
> +#else
> +static inline bool hugetlb_hgm_enabled(struct vm_area_struct *vma)
> +{
> +       return false;
> +}
> +static inline bool hugetlb_hgm_eligible(struct vm_area_struct *vma)
> +{
> +       return false;
> +}
> +#endif
> +
>  static inline spinlock_t *huge_pte_lock(struct hstate *h,
>                                         struct mm_struct *mm, pte_t *pte)
>  {
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 6c008c9de80e..0576dcc98044 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -7004,6 +7004,10 @@ static bool pmd_sharing_possible(struct vm_area_struct *vma)
>  #ifdef CONFIG_USERFAULTFD
>         if (uffd_disable_huge_pmd_share(vma))
>                 return false;
> +#endif
> +#ifdef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING
> +       if (hugetlb_hgm_enabled(vma))
> +               return false;
>  #endif
>         /*
>          * Only shared VMAs can share PMDs.
> @@ -7267,6 +7271,18 @@ __weak unsigned long hugetlb_mask_last_page(struct hstate *h)
>
>  #endif /* CONFIG_ARCH_WANT_GENERAL_HUGETLB */
>
> +#ifdef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING
> +bool hugetlb_hgm_eligible(struct vm_area_struct *vma)

I think the other function you named pmd_sharing_possible(), I suggest
hugetlb_hgm_possible() for some consistency.

> +{
> +       /* All shared VMAs may have HGM. */

I think this is a redundant comment.

> +       return vma && (vma->vm_flags & VM_MAYSHARE);
> +}
> +bool hugetlb_hgm_enabled(struct vm_area_struct *vma)
> +{
> +       return vma && (vma->vm_flags & VM_HUGETLB_HGM);
> +}
> +#endif /* CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING */
> +
>  /*
>   * These functions are overwritable if your architecture needs its own
>   * behavior.
> --
> 2.39.2.637.g21b0678d19-goog
>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v2 09/46] mm: add MADV_SPLIT to enable HugeTLB HGM
  2023-02-18  0:27 ` [PATCH v2 09/46] mm: add MADV_SPLIT to enable HugeTLB HGM James Houghton
@ 2023-02-18  1:58   ` Mina Almasry
  2023-02-21 16:33     ` James Houghton
  2023-02-24 23:25   ` Mike Kravetz
  1 sibling, 1 reply; 96+ messages in thread
From: Mina Almasry @ 2023-02-18  1:58 UTC (permalink / raw)
  To: James Houghton
  Cc: Mike Kravetz, Muchun Song, Peter Xu, Andrew Morton,
	David Hildenbrand, David Rientjes, Axel Rasmussen,
	Zach O'Keefe, Manish Mishra, Naoya Horiguchi,
	Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Frank van der Linden, Jiaqi Yan, linux-mm, linux-kernel

On Fri, Feb 17, 2023 at 4:28 PM James Houghton <jthoughton@google.com> wrote:
>
> Issuing ioctl(MADV_SPLIT) on a HugeTLB address range will enable
> HugeTLB HGM. MADV_SPLIT was chosen for the name so that this API can be
> applied to non-HugeTLB memory in the future, if such an application is
> to arise.
>
> MADV_SPLIT provides several API changes for some syscalls on HugeTLB
> address ranges:
> 1. UFFDIO_CONTINUE is allowed for MAP_SHARED VMAs at PAGE_SIZE
>    alignment.
> 2. read()ing a page fault event from a userfaultfd will yield a
>    PAGE_SIZE-rounded address, instead of a huge-page-size-rounded
>    address (unless UFFD_FEATURE_EXACT_ADDRESS is used).
>
> There is no way to disable the API changes that come with issuing
> MADV_SPLIT. MADV_COLLAPSE can be used to collapse high-granularity page
> table mappings that come from the extended functionality that comes with
> using MADV_SPLIT.
>

So is a hugetlb page or VMA that has been MADV_SPLIT + MADV_COLLAPSE
distinct from a hugetlb page or vma that has not been? I thought
COLLAPSE would reverse the effects on SPLIT completely.

> For post-copy live migration, the expected use-case is:
> 1. mmap(MAP_SHARED, some_fd) primary mapping
> 2. mmap(MAP_SHARED, some_fd) alias mapping
> 3. MADV_SPLIT the primary mapping
> 4. UFFDIO_REGISTER/etc. the primary mapping
> 5. Copy memory contents into alias mapping and UFFDIO_CONTINUE the
>    corresponding PAGE_SIZE sections in the primary mapping.
>

Huh, so MADV_SPLIT doesn't actually split an existing PMD mapping into
high granularity mappings. Instead it says that future mappings may be
high granularity? I assume they may not even be high granularity, like
if the alias mapping faulted in a full hugetlb page (without
UFFDIO_CONTINUE) that page would be regular mapped not high
granularity mapped.

This may be bikeshedding but I do think a clearer name is warranted.
Maybe MADV_MAY_SPLIT or something.

> More API changes may be added in the future.
>
> Signed-off-by: James Houghton <jthoughton@google.com>
>
> diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h
> index 763929e814e9..7a26f3648b90 100644
> --- a/arch/alpha/include/uapi/asm/mman.h
> +++ b/arch/alpha/include/uapi/asm/mman.h
> @@ -78,6 +78,8 @@
>
>  #define MADV_COLLAPSE  25              /* Synchronous hugepage collapse */
>
> +#define MADV_SPLIT     26              /* Enable hugepage high-granularity APIs */
> +
>  /* compatibility flags */
>  #define MAP_FILE       0
>
> diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h
> index c6e1fc77c996..f8a74a3a0928 100644
> --- a/arch/mips/include/uapi/asm/mman.h
> +++ b/arch/mips/include/uapi/asm/mman.h
> @@ -105,6 +105,8 @@
>
>  #define MADV_COLLAPSE  25              /* Synchronous hugepage collapse */
>
> +#define MADV_SPLIT     26              /* Enable hugepage high-granularity APIs */
> +
>  /* compatibility flags */
>  #define MAP_FILE       0
>
> diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h
> index 68c44f99bc93..a6dc6a56c941 100644
> --- a/arch/parisc/include/uapi/asm/mman.h
> +++ b/arch/parisc/include/uapi/asm/mman.h
> @@ -72,6 +72,8 @@
>
>  #define MADV_COLLAPSE  25              /* Synchronous hugepage collapse */
>
> +#define MADV_SPLIT     74              /* Enable hugepage high-granularity APIs */
> +
>  #define MADV_HWPOISON     100          /* poison a page for testing */
>  #define MADV_SOFT_OFFLINE 101          /* soft offline page for testing */
>
> diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h
> index 1ff0c858544f..f98a77c430a9 100644
> --- a/arch/xtensa/include/uapi/asm/mman.h
> +++ b/arch/xtensa/include/uapi/asm/mman.h
> @@ -113,6 +113,8 @@
>
>  #define MADV_COLLAPSE  25              /* Synchronous hugepage collapse */
>
> +#define MADV_SPLIT     26              /* Enable hugepage high-granularity APIs */
> +
>  /* compatibility flags */
>  #define MAP_FILE       0
>
> diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
> index 6ce1f1ceb432..996e8ded092f 100644
> --- a/include/uapi/asm-generic/mman-common.h
> +++ b/include/uapi/asm-generic/mman-common.h
> @@ -79,6 +79,8 @@
>
>  #define MADV_COLLAPSE  25              /* Synchronous hugepage collapse */
>
> +#define MADV_SPLIT     26              /* Enable hugepage high-granularity APIs */
> +
>  /* compatibility flags */
>  #define MAP_FILE       0
>
> diff --git a/mm/madvise.c b/mm/madvise.c
> index c2202f51e9dd..8c004c678262 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -1006,6 +1006,28 @@ static long madvise_remove(struct vm_area_struct *vma,
>         return error;
>  }
>
> +static int madvise_split(struct vm_area_struct *vma,
> +                        unsigned long *new_flags)
> +{
> +#ifdef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING
> +       if (!is_vm_hugetlb_page(vma) || !hugetlb_hgm_eligible(vma))
> +               return -EINVAL;
> +
> +       /*
> +        * PMD sharing doesn't work with HGM. If this MADV_SPLIT is on part
> +        * of a VMA, then we will split the VMA. Here, we're unsharing before
> +        * splitting because it's simpler, although we may be unsharing more
> +        * than we need.
> +        */
> +       hugetlb_unshare_all_pmds(vma);
> +
> +       *new_flags |= VM_HUGETLB_HGM;
> +       return 0;
> +#else
> +       return -EINVAL;
> +#endif
> +}
> +
>  /*
>   * Apply an madvise behavior to a region of a vma.  madvise_update_vma
>   * will handle splitting a vm area into separate areas, each area with its own
> @@ -1084,6 +1106,11 @@ static int madvise_vma_behavior(struct vm_area_struct *vma,
>                 break;
>         case MADV_COLLAPSE:
>                 return madvise_collapse(vma, prev, start, end);
> +       case MADV_SPLIT:
> +               error = madvise_split(vma, &new_flags);
> +               if (error)
> +                       goto out;
> +               break;
>         }
>
>         anon_name = anon_vma_name(vma);
> @@ -1178,6 +1205,9 @@ madvise_behavior_valid(int behavior)
>         case MADV_HUGEPAGE:
>         case MADV_NOHUGEPAGE:
>         case MADV_COLLAPSE:
> +#endif
> +#ifdef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING
> +       case MADV_SPLIT:
>  #endif
>         case MADV_DONTDUMP:
>         case MADV_DODUMP:
> @@ -1368,6 +1398,8 @@ int madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
>   *             transparent huge pages so the existing pages will not be
>   *             coalesced into THP and new pages will not be allocated as THP.
>   *  MADV_COLLAPSE - synchronously coalesce pages into new THP.
> + *  MADV_SPLIT - allow HugeTLB pages to be mapped at PAGE_SIZE. This allows
> + *             UFFDIO_CONTINUE to accept PAGE_SIZE-aligned regions.
>   *  MADV_DONTDUMP - the application wants to prevent pages in the given range
>   *             from being included in its core dump.
>   *  MADV_DODUMP - cancel MADV_DONTDUMP: no longer exclude from core dump.
> --
> 2.39.2.637.g21b0678d19-goog
>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v2 11/46] hugetlb: add hugetlb_pte to track HugeTLB page table entries
  2023-02-18  0:27 ` [PATCH v2 11/46] hugetlb: add hugetlb_pte to track HugeTLB page table entries James Houghton
@ 2023-02-18  5:24   ` Mina Almasry
  2023-02-21 16:36     ` James Houghton
  2023-02-25  0:09   ` Mike Kravetz
  1 sibling, 1 reply; 96+ messages in thread
From: Mina Almasry @ 2023-02-18  5:24 UTC (permalink / raw)
  To: James Houghton
  Cc: Mike Kravetz, Muchun Song, Peter Xu, Andrew Morton,
	David Hildenbrand, David Rientjes, Axel Rasmussen,
	Zach O'Keefe, Manish Mishra, Naoya Horiguchi,
	Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Frank van der Linden, Jiaqi Yan, linux-mm, linux-kernel

On Fri, Feb 17, 2023 at 4:28 PM James Houghton <jthoughton@google.com> wrote:
>
> After high-granularity mapping, page table entries for HugeTLB pages can
> be of any size/type. (For example, we can have a 1G page mapped with a
> mix of PMDs and PTEs.) This struct is to help keep track of a HugeTLB
> PTE after we have done a page table walk.
>
> Without this, we'd have to pass around the "size" of the PTE everywhere.
> We effectively did this before; it could be fetched from the hstate,
> which we pass around pretty much everywhere.
>
> hugetlb_pte_present_leaf is included here as a helper function that will
> be used frequently later on.
>
> Signed-off-by: James Houghton <jthoughton@google.com>
>

Only nits.

Reviewed-by: Mina Almasry <almasrymina@google.com>

> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index a1ceb9417f01..eeacadf3272b 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -26,6 +26,25 @@ typedef struct { unsigned long pd; } hugepd_t;
>  #define __hugepd(x) ((hugepd_t) { (x) })
>  #endif
>
> +enum hugetlb_level {
> +       HUGETLB_LEVEL_PTE = 1,
> +       /*
> +        * We always include PMD, PUD, and P4D in this enum definition so that,
> +        * when logged as an integer, we can easily tell which level it is.
> +        */
> +       HUGETLB_LEVEL_PMD,
> +       HUGETLB_LEVEL_PUD,
> +       HUGETLB_LEVEL_P4D,
> +       HUGETLB_LEVEL_PGD,
> +};
> +
> +struct hugetlb_pte {
> +       pte_t *ptep;
> +       unsigned int shift;
> +       enum hugetlb_level level;
> +       spinlock_t *ptl;
> +};
> +
>  #ifdef CONFIG_HUGETLB_PAGE
>
>  #include <linux/mempolicy.h>
> @@ -39,6 +58,20 @@ typedef struct { unsigned long pd; } hugepd_t;
>   */
>  #define __NR_USED_SUBPAGE 3
>
> +static inline
> +unsigned long hugetlb_pte_size(const struct hugetlb_pte *hpte)
> +{
> +       return 1UL << hpte->shift;
> +}
> +
> +static inline
> +unsigned long hugetlb_pte_mask(const struct hugetlb_pte *hpte)
> +{
> +       return ~(hugetlb_pte_size(hpte) - 1);
> +}
> +
> +bool hugetlb_pte_present_leaf(const struct hugetlb_pte *hpte, pte_t pte);
> +
>  struct hugepage_subpool {
>         spinlock_t lock;
>         long count;
> @@ -1234,6 +1267,45 @@ static inline spinlock_t *huge_pte_lock(struct hstate *h,
>         return ptl;
>  }
>
> +static inline
> +spinlock_t *hugetlb_pte_lockptr(struct hugetlb_pte *hpte)
> +{
> +       return hpte->ptl;
> +}

I find this helper unnecessary. I would remove it.

> +
> +static inline
> +spinlock_t *hugetlb_pte_lock(struct hugetlb_pte *hpte)
> +{
> +       spinlock_t *ptl = hugetlb_pte_lockptr(hpte);
> +
> +       spin_lock(ptl);

Here 'spin_lock(hpte->ptl)' would be more immediately understandable
IMO, for example.

> +       return ptl;
> +}
> +
> +static inline
> +void __hugetlb_pte_init(struct hugetlb_pte *hpte, pte_t *ptep,
> +                       unsigned int shift, enum hugetlb_level level,
> +                       spinlock_t *ptl)
> +{
> +       /*
> +        * If 'shift' indicates that this PTE is contiguous, then @ptep must
> +        * be the first pte of the contiguous bunch.
> +        */

I would move the comment to above the function as a pseudo doc. It
seems to instruct the user of the function of how to use it.

> +       hpte->ptl = ptl;
> +       hpte->ptep = ptep;
> +       hpte->shift = shift;
> +       hpte->level = level;
> +}
> +
> +static inline
> +void hugetlb_pte_init(struct mm_struct *mm, struct hugetlb_pte *hpte,
> +                     pte_t *ptep, unsigned int shift,
> +                     enum hugetlb_level level)
> +{
> +       __hugetlb_pte_init(hpte, ptep, shift, level,
> +                          huge_pte_lockptr(shift, mm, ptep));
> +}
> +
>  #if defined(CONFIG_HUGETLB_PAGE) && defined(CONFIG_CMA)
>  extern void __init hugetlb_cma_reserve(int order);
>  #else
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 5ca9eae0ac42..6c74adff43b6 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1269,6 +1269,35 @@ static bool vma_has_reserves(struct vm_area_struct *vma, long chg)
>         return false;
>  }
>
> +bool hugetlb_pte_present_leaf(const struct hugetlb_pte *hpte, pte_t pte)
> +{
> +       pgd_t pgd;
> +       p4d_t p4d;
> +       pud_t pud;
> +       pmd_t pmd;
> +
> +       switch (hpte->level) {
> +       case HUGETLB_LEVEL_PGD:
> +               pgd = __pgd(pte_val(pte));
> +               return pgd_present(pgd) && pgd_leaf(pgd);
> +       case HUGETLB_LEVEL_P4D:
> +               p4d = __p4d(pte_val(pte));
> +               return p4d_present(p4d) && p4d_leaf(p4d);
> +       case HUGETLB_LEVEL_PUD:
> +               pud = __pud(pte_val(pte));
> +               return pud_present(pud) && pud_leaf(pud);
> +       case HUGETLB_LEVEL_PMD:
> +               pmd = __pmd(pte_val(pte));
> +               return pmd_present(pmd) && pmd_leaf(pmd);
> +       case HUGETLB_LEVEL_PTE:
> +               return pte_present(pte);
> +       default:
> +               WARN_ON_ONCE(1);
> +               return false;
> +       }
> +}
> +
> +
>  static void enqueue_hugetlb_folio(struct hstate *h, struct folio *folio)
>  {
>         int nid = folio_nid(folio);
> --
> 2.39.2.637.g21b0678d19-goog
>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v2 13/46] hugetlb: add hugetlb_hgm_walk and hugetlb_walk_step
  2023-02-18  0:27 ` [PATCH v2 13/46] hugetlb: add hugetlb_hgm_walk and hugetlb_walk_step James Houghton
@ 2023-02-18  7:43   ` kernel test robot
  2023-02-18 18:07   ` kernel test robot
  2023-02-28 22:14   ` Mike Kravetz
  2 siblings, 0 replies; 96+ messages in thread
From: kernel test robot @ 2023-02-18  7:43 UTC (permalink / raw)
  To: James Houghton, Mike Kravetz, Muchun Song, Peter Xu, Andrew Morton
  Cc: oe-kbuild-all, Linux Memory Management List, David Hildenbrand,
	David Rientjes, Axel Rasmussen, Mina Almasry, Zach O'Keefe,
	Manish Mishra, Naoya Horiguchi, Dr . David Alan Gilbert,
	Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Frank van der Linden, Jiaqi Yan, linux-kernel, James Houghton

Hi James,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on next-20230217]
[cannot apply to kvm/queue shuah-kselftest/next shuah-kselftest/fixes arnd-asm-generic/master linus/master kvm/linux-next v6.2-rc8 v6.2-rc7 v6.2-rc6 v6.2-rc8]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/James-Houghton/hugetlb-don-t-set-PageUptodate-for-UFFDIO_CONTINUE/20230218-083216
patch link:    https://lore.kernel.org/r/20230218002819.1486479-14-jthoughton%40google.com
patch subject: [PATCH v2 13/46] hugetlb: add hugetlb_hgm_walk and hugetlb_walk_step
config: loongarch-allyesconfig (https://download.01.org/0day-ci/archive/20230218/202302181558.6o0zw4Cl-lkp@intel.com/config)
compiler: loongarch64-linux-gcc (GCC) 12.1.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/intel-lab-lkp/linux/commit/7e55fe945a1b5f042746277050390bdeba9e22d2
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review James-Houghton/hugetlb-don-t-set-PageUptodate-for-UFFDIO_CONTINUE/20230218-083216
        git checkout 7e55fe945a1b5f042746277050390bdeba9e22d2
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=loongarch olddefconfig
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=loongarch SHELL=/bin/bash

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <lkp@intel.com>
| Link: https://lore.kernel.org/oe-kbuild-all/202302181558.6o0zw4Cl-lkp@intel.com/

All errors (new ones prefixed by >>):

   loongarch64-linux-ld: mm/hugetlb.o: in function `.L142':
>> hugetlb.c:(.text+0x9ec): undefined reference to `hugetlb_walk_step'

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v2 12/46] hugetlb: add hugetlb_alloc_pmd and hugetlb_alloc_pte
  2023-02-18  0:27 ` [PATCH v2 12/46] hugetlb: add hugetlb_alloc_pmd and hugetlb_alloc_pte James Houghton
@ 2023-02-18 17:46   ` kernel test robot
  2023-02-27 19:16   ` Mike Kravetz
  1 sibling, 0 replies; 96+ messages in thread
From: kernel test robot @ 2023-02-18 17:46 UTC (permalink / raw)
  To: James Houghton, Mike Kravetz, Muchun Song, Peter Xu, Andrew Morton
  Cc: llvm, oe-kbuild-all, Linux Memory Management List,
	David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Zach O'Keefe, Manish Mishra, Naoya Horiguchi,
	Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Frank van der Linden, Jiaqi Yan, linux-kernel, James Houghton

Hi James,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on next-20230217]
[cannot apply to kvm/queue shuah-kselftest/next shuah-kselftest/fixes arnd-asm-generic/master linus/master kvm/linux-next v6.2-rc8 v6.2-rc7 v6.2-rc6 v6.2-rc8]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/James-Houghton/hugetlb-don-t-set-PageUptodate-for-UFFDIO_CONTINUE/20230218-083216
patch link:    https://lore.kernel.org/r/20230218002819.1486479-13-jthoughton%40google.com
patch subject: [PATCH v2 12/46] hugetlb: add hugetlb_alloc_pmd and hugetlb_alloc_pte
config: powerpc-randconfig-r001-20230217 (https://download.01.org/0day-ci/archive/20230219/202302190142.59JVVPVm-lkp@intel.com/config)
compiler: clang version 17.0.0 (https://github.com/llvm/llvm-project db89896bbbd2251fff457699635acbbedeead27f)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # install powerpc cross compiling tool for clang build
        # apt-get install binutils-powerpc-linux-gnu
        # https://github.com/intel-lab-lkp/linux/commit/af805b45df0b60a0ef2231f41413b8265e1e8d93
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review James-Houghton/hugetlb-don-t-set-PageUptodate-for-UFFDIO_CONTINUE/20230218-083216
        git checkout af805b45df0b60a0ef2231f41413b8265e1e8d93
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=powerpc olddefconfig
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=powerpc SHELL=/bin/bash

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <lkp@intel.com>
| Link: https://lore.kernel.org/oe-kbuild-all/202302190142.59JVVPVm-lkp@intel.com/

All errors (new ones prefixed by >>):

   __do_insb
   ^
   arch/powerpc/include/asm/io.h:577:56: note: expanded from macro '__do_insb'
   #define __do_insb(p, b, n)      readsb((PCI_IO_ADDR)_IO_BASE+(p), (b), (n))
                                          ~~~~~~~~~~~~~~~~~~~~~^
   In file included from mm/hugetlb.c:11:
   In file included from include/linux/highmem.h:12:
   In file included from include/linux/hardirq.h:11:
   In file included from arch/powerpc/include/asm/hardirq.h:6:
   In file included from include/linux/irq.h:20:
   In file included from include/linux/io.h:13:
   In file included from arch/powerpc/include/asm/io.h:640:
   arch/powerpc/include/asm/io-defs.h:45:1: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
   DEF_PCI_AC_NORET(insw, (unsigned long p, void *b, unsigned long c),
   ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   arch/powerpc/include/asm/io.h:637:3: note: expanded from macro 'DEF_PCI_AC_NORET'
                   __do_##name al;                                 \
                   ^~~~~~~~~~~~~~
   <scratch space>:102:1: note: expanded from here
   __do_insw
   ^
   arch/powerpc/include/asm/io.h:578:56: note: expanded from macro '__do_insw'
   #define __do_insw(p, b, n)      readsw((PCI_IO_ADDR)_IO_BASE+(p), (b), (n))
                                          ~~~~~~~~~~~~~~~~~~~~~^
   In file included from mm/hugetlb.c:11:
   In file included from include/linux/highmem.h:12:
   In file included from include/linux/hardirq.h:11:
   In file included from arch/powerpc/include/asm/hardirq.h:6:
   In file included from include/linux/irq.h:20:
   In file included from include/linux/io.h:13:
   In file included from arch/powerpc/include/asm/io.h:640:
   arch/powerpc/include/asm/io-defs.h:47:1: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
   DEF_PCI_AC_NORET(insl, (unsigned long p, void *b, unsigned long c),
   ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   arch/powerpc/include/asm/io.h:637:3: note: expanded from macro 'DEF_PCI_AC_NORET'
                   __do_##name al;                                 \
                   ^~~~~~~~~~~~~~
   <scratch space>:104:1: note: expanded from here
   __do_insl
   ^
   arch/powerpc/include/asm/io.h:579:56: note: expanded from macro '__do_insl'
   #define __do_insl(p, b, n)      readsl((PCI_IO_ADDR)_IO_BASE+(p), (b), (n))
                                          ~~~~~~~~~~~~~~~~~~~~~^
   In file included from mm/hugetlb.c:11:
   In file included from include/linux/highmem.h:12:
   In file included from include/linux/hardirq.h:11:
   In file included from arch/powerpc/include/asm/hardirq.h:6:
   In file included from include/linux/irq.h:20:
   In file included from include/linux/io.h:13:
   In file included from arch/powerpc/include/asm/io.h:640:
   arch/powerpc/include/asm/io-defs.h:49:1: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
   DEF_PCI_AC_NORET(outsb, (unsigned long p, const void *b, unsigned long c),
   ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   arch/powerpc/include/asm/io.h:637:3: note: expanded from macro 'DEF_PCI_AC_NORET'
                   __do_##name al;                                 \
                   ^~~~~~~~~~~~~~
   <scratch space>:106:1: note: expanded from here
   __do_outsb
   ^
   arch/powerpc/include/asm/io.h:580:58: note: expanded from macro '__do_outsb'
   #define __do_outsb(p, b, n)     writesb((PCI_IO_ADDR)_IO_BASE+(p),(b),(n))
                                           ~~~~~~~~~~~~~~~~~~~~~^
   In file included from mm/hugetlb.c:11:
   In file included from include/linux/highmem.h:12:
   In file included from include/linux/hardirq.h:11:
   In file included from arch/powerpc/include/asm/hardirq.h:6:
   In file included from include/linux/irq.h:20:
   In file included from include/linux/io.h:13:
   In file included from arch/powerpc/include/asm/io.h:640:
   arch/powerpc/include/asm/io-defs.h:51:1: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
   DEF_PCI_AC_NORET(outsw, (unsigned long p, const void *b, unsigned long c),
   ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   arch/powerpc/include/asm/io.h:637:3: note: expanded from macro 'DEF_PCI_AC_NORET'
                   __do_##name al;                                 \
                   ^~~~~~~~~~~~~~
   <scratch space>:108:1: note: expanded from here
   __do_outsw
   ^
   arch/powerpc/include/asm/io.h:581:58: note: expanded from macro '__do_outsw'
   #define __do_outsw(p, b, n)     writesw((PCI_IO_ADDR)_IO_BASE+(p),(b),(n))
                                           ~~~~~~~~~~~~~~~~~~~~~^
   In file included from mm/hugetlb.c:11:
   In file included from include/linux/highmem.h:12:
   In file included from include/linux/hardirq.h:11:
   In file included from arch/powerpc/include/asm/hardirq.h:6:
   In file included from include/linux/irq.h:20:
   In file included from include/linux/io.h:13:
   In file included from arch/powerpc/include/asm/io.h:640:
   arch/powerpc/include/asm/io-defs.h:53:1: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
   DEF_PCI_AC_NORET(outsl, (unsigned long p, const void *b, unsigned long c),
   ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   arch/powerpc/include/asm/io.h:637:3: note: expanded from macro 'DEF_PCI_AC_NORET'
                   __do_##name al;                                 \
                   ^~~~~~~~~~~~~~
   <scratch space>:110:1: note: expanded from here
   __do_outsl
   ^
   arch/powerpc/include/asm/io.h:582:58: note: expanded from macro '__do_outsl'
   #define __do_outsl(p, b, n)     writesl((PCI_IO_ADDR)_IO_BASE+(p),(b),(n))
                                           ~~~~~~~~~~~~~~~~~~~~~^
>> mm/hugetlb.c:581:8: error: call to undeclared function '__pte_alloc_one'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
           new = __pte_alloc_one(mm, GFP_PGTABLE_USER);
                 ^
   mm/hugetlb.c:581:8: note: did you mean 'pte_alloc_one'?
   arch/powerpc/include/asm/pgalloc.h:30:25: note: 'pte_alloc_one' declared here
   static inline pgtable_t pte_alloc_one(struct mm_struct *mm)
                           ^
>> mm/hugetlb.c:581:28: error: use of undeclared identifier 'GFP_PGTABLE_USER'
           new = __pte_alloc_one(mm, GFP_PGTABLE_USER);
                                     ^
>> mm/hugetlb.c:588:25: error: incompatible pointer types passing 'pgtable_t' (aka 'unsigned long *') to parameter of type 'struct page *' [-Werror,-Wincompatible-pointer-types]
                   pgtable_pte_page_dtor(new);
                                         ^~~
   include/linux/mm.h:2661:55: note: passing argument to parameter 'page' here
   static inline void pgtable_pte_page_dtor(struct page *page)
                                                         ^
   mm/hugetlb.c:589:3: error: incompatible pointer types passing 'pgtable_t' (aka 'unsigned long *') to parameter of type 'struct page *' [-Werror,-Wincompatible-pointer-types]
                   __free_page(new);
                   ^~~~~~~~~~~~~~~~
   include/linux/gfp.h:319:40: note: expanded from macro '__free_page'
   #define __free_page(page) __free_pages((page), 0)
                                          ^~~~~~
   include/linux/gfp.h:302:39: note: passing argument to parameter 'page' here
   extern void __free_pages(struct page *page, unsigned int order);
                                         ^
   6 warnings and 4 errors generated.


vim +/__pte_alloc_one +581 mm/hugetlb.c

   543	
   544	/*
   545	 * hugetlb_alloc_pte -- Allocate a PTE beneath a pmd_none PMD-level hpte.
   546	 *
   547	 * See the comment above hugetlb_alloc_pmd.
   548	 */
   549	pte_t *hugetlb_alloc_pte(struct mm_struct *mm, struct hugetlb_pte *hpte,
   550			unsigned long addr)
   551	{
   552		spinlock_t *ptl = hugetlb_pte_lockptr(hpte);
   553		pgtable_t new;
   554		pmd_t *pmdp;
   555		pmd_t pmd;
   556	
   557		if (hpte->level != HUGETLB_LEVEL_PMD)
   558			return ERR_PTR(-EINVAL);
   559	
   560		pmdp = (pmd_t *)hpte->ptep;
   561	retry:
   562		pmd = READ_ONCE(*pmdp);
   563		if (likely(pmd_present(pmd)))
   564			return unlikely(pmd_leaf(pmd))
   565				? ERR_PTR(-EEXIST)
   566				: pte_offset_kernel(pmdp, addr);
   567		else if (!pmd_none(pmd))
   568			/*
   569			 * Not present and not none means that a swap entry lives here,
   570			 * and we can't get rid of it.
   571			 */
   572			return ERR_PTR(-EEXIST);
   573	
   574		/*
   575		 * With CONFIG_HIGHPTE, calling `pte_alloc_one` directly may result
   576		 * in page tables being allocated in high memory, needing a kmap to
   577		 * access. Instead, we call __pte_alloc_one directly with
   578		 * GFP_PGTABLE_USER to prevent these PTEs being allocated in high
   579		 * memory.
   580		 */
 > 581		new = __pte_alloc_one(mm, GFP_PGTABLE_USER);
   582		if (!new)
   583			return ERR_PTR(-ENOMEM);
   584	
   585		spin_lock(ptl);
   586		if (!pmd_same(pmd, *pmdp)) {
   587			spin_unlock(ptl);
 > 588			pgtable_pte_page_dtor(new);
   589			__free_page(new);
   590			goto retry;
   591		}
   592	
   593		mm_inc_nr_ptes(mm);
   594		smp_wmb(); /* See comment in pmd_install() */
   595		pmd_populate(mm, pmdp, new);
   596		spin_unlock(ptl);
   597		return pte_offset_kernel(pmdp, addr);
   598	}
   599	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v2 13/46] hugetlb: add hugetlb_hgm_walk and hugetlb_walk_step
  2023-02-18  0:27 ` [PATCH v2 13/46] hugetlb: add hugetlb_hgm_walk and hugetlb_walk_step James Houghton
  2023-02-18  7:43   ` kernel test robot
@ 2023-02-18 18:07   ` kernel test robot
  2023-02-21 17:09     ` James Houghton
  2023-02-28 22:14   ` Mike Kravetz
  2 siblings, 1 reply; 96+ messages in thread
From: kernel test robot @ 2023-02-18 18:07 UTC (permalink / raw)
  To: James Houghton, Mike Kravetz, Muchun Song, Peter Xu, Andrew Morton
  Cc: llvm, oe-kbuild-all, Linux Memory Management List,
	David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Zach O'Keefe, Manish Mishra, Naoya Horiguchi,
	Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Frank van der Linden, Jiaqi Yan, linux-kernel, James Houghton

Hi James,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on next-20230217]
[cannot apply to kvm/queue shuah-kselftest/next shuah-kselftest/fixes arnd-asm-generic/master linus/master kvm/linux-next v6.2-rc8 v6.2-rc7 v6.2-rc6 v6.2-rc8]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/James-Houghton/hugetlb-don-t-set-PageUptodate-for-UFFDIO_CONTINUE/20230218-083216
patch link:    https://lore.kernel.org/r/20230218002819.1486479-14-jthoughton%40google.com
patch subject: [PATCH v2 13/46] hugetlb: add hugetlb_hgm_walk and hugetlb_walk_step
config: arm64-randconfig-r005-20230217 (https://download.01.org/0day-ci/archive/20230219/202302190101.aoXrbN26-lkp@intel.com/config)
compiler: clang version 17.0.0 (https://github.com/llvm/llvm-project db89896bbbd2251fff457699635acbbedeead27f)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # install arm64 cross compiling tool for clang build
        # apt-get install binutils-aarch64-linux-gnu
        # https://github.com/intel-lab-lkp/linux/commit/7e55fe945a1b5f042746277050390bdeba9e22d2
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review James-Houghton/hugetlb-don-t-set-PageUptodate-for-UFFDIO_CONTINUE/20230218-083216
        git checkout 7e55fe945a1b5f042746277050390bdeba9e22d2
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=arm64 olddefconfig
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=arm64 SHELL=/bin/bash

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <lkp@intel.com>
| Link: https://lore.kernel.org/oe-kbuild-all/202302190101.aoXrbN26-lkp@intel.com/

All errors (new ones prefixed by >>):

>> ld.lld: error: undefined symbol: hugetlb_walk_step
   >>> referenced by hugetlb.c
   >>>               mm/hugetlb.o:(__hugetlb_hgm_walk) in archive vmlinux.a

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v2 14/46] hugetlb: split PTE markers when doing HGM walks
  2023-02-18  0:27 ` [PATCH v2 14/46] hugetlb: split PTE markers when doing HGM walks James Houghton
@ 2023-02-18 19:49   ` kernel test robot
  2023-02-28 22:48   ` Mike Kravetz
  1 sibling, 0 replies; 96+ messages in thread
From: kernel test robot @ 2023-02-18 19:49 UTC (permalink / raw)
  To: James Houghton, Mike Kravetz, Muchun Song, Peter Xu, Andrew Morton
  Cc: llvm, oe-kbuild-all, Linux Memory Management List,
	David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Zach O'Keefe, Manish Mishra, Naoya Horiguchi,
	Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Frank van der Linden, Jiaqi Yan, linux-kernel, James Houghton

Hi James,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on next-20230217]
[cannot apply to kvm/queue shuah-kselftest/next shuah-kselftest/fixes arnd-asm-generic/master linus/master kvm/linux-next v6.2-rc8 v6.2-rc7 v6.2-rc6 v6.2-rc8]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/James-Houghton/hugetlb-don-t-set-PageUptodate-for-UFFDIO_CONTINUE/20230218-083216
patch link:    https://lore.kernel.org/r/20230218002819.1486479-15-jthoughton%40google.com
patch subject: [PATCH v2 14/46] hugetlb: split PTE markers when doing HGM walks
config: powerpc-randconfig-r001-20230217 (https://download.01.org/0day-ci/archive/20230219/202302190304.YdPwtMZS-lkp@intel.com/config)
compiler: clang version 17.0.0 (https://github.com/llvm/llvm-project db89896bbbd2251fff457699635acbbedeead27f)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # install powerpc cross compiling tool for clang build
        # apt-get install binutils-powerpc-linux-gnu
        # https://github.com/intel-lab-lkp/linux/commit/55c33d65b06ad109b87a418540fe98f7365185d4
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review James-Houghton/hugetlb-don-t-set-PageUptodate-for-UFFDIO_CONTINUE/20230218-083216
        git checkout 55c33d65b06ad109b87a418540fe98f7365185d4
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=powerpc olddefconfig
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=powerpc SHELL=/bin/bash

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <lkp@intel.com>
| Link: https://lore.kernel.org/oe-kbuild-all/202302190304.YdPwtMZS-lkp@intel.com/

All errors (new ones prefixed by >>):

   In file included from include/linux/highmem.h:12:
   In file included from include/linux/hardirq.h:11:
   In file included from arch/powerpc/include/asm/hardirq.h:6:
   In file included from include/linux/irq.h:20:
   In file included from include/linux/io.h:13:
   In file included from arch/powerpc/include/asm/io.h:640:
   arch/powerpc/include/asm/io-defs.h:47:1: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
   DEF_PCI_AC_NORET(insl, (unsigned long p, void *b, unsigned long c),
   ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   arch/powerpc/include/asm/io.h:637:3: note: expanded from macro 'DEF_PCI_AC_NORET'
                   __do_##name al;                                 \
                   ^~~~~~~~~~~~~~
   <scratch space>:104:1: note: expanded from here
   __do_insl
   ^
   arch/powerpc/include/asm/io.h:579:56: note: expanded from macro '__do_insl'
   #define __do_insl(p, b, n)      readsl((PCI_IO_ADDR)_IO_BASE+(p), (b), (n))
                                          ~~~~~~~~~~~~~~~~~~~~~^
   In file included from mm/hugetlb.c:11:
   In file included from include/linux/highmem.h:12:
   In file included from include/linux/hardirq.h:11:
   In file included from arch/powerpc/include/asm/hardirq.h:6:
   In file included from include/linux/irq.h:20:
   In file included from include/linux/io.h:13:
   In file included from arch/powerpc/include/asm/io.h:640:
   arch/powerpc/include/asm/io-defs.h:49:1: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
   DEF_PCI_AC_NORET(outsb, (unsigned long p, const void *b, unsigned long c),
   ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   arch/powerpc/include/asm/io.h:637:3: note: expanded from macro 'DEF_PCI_AC_NORET'
                   __do_##name al;                                 \
                   ^~~~~~~~~~~~~~
   <scratch space>:106:1: note: expanded from here
   __do_outsb
   ^
   arch/powerpc/include/asm/io.h:580:58: note: expanded from macro '__do_outsb'
   #define __do_outsb(p, b, n)     writesb((PCI_IO_ADDR)_IO_BASE+(p),(b),(n))
                                           ~~~~~~~~~~~~~~~~~~~~~^
   In file included from mm/hugetlb.c:11:
   In file included from include/linux/highmem.h:12:
   In file included from include/linux/hardirq.h:11:
   In file included from arch/powerpc/include/asm/hardirq.h:6:
   In file included from include/linux/irq.h:20:
   In file included from include/linux/io.h:13:
   In file included from arch/powerpc/include/asm/io.h:640:
   arch/powerpc/include/asm/io-defs.h:51:1: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
   DEF_PCI_AC_NORET(outsw, (unsigned long p, const void *b, unsigned long c),
   ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   arch/powerpc/include/asm/io.h:637:3: note: expanded from macro 'DEF_PCI_AC_NORET'
                   __do_##name al;                                 \
                   ^~~~~~~~~~~~~~
   <scratch space>:108:1: note: expanded from here
   __do_outsw
   ^
   arch/powerpc/include/asm/io.h:581:58: note: expanded from macro '__do_outsw'
   #define __do_outsw(p, b, n)     writesw((PCI_IO_ADDR)_IO_BASE+(p),(b),(n))
                                           ~~~~~~~~~~~~~~~~~~~~~^
   In file included from mm/hugetlb.c:11:
   In file included from include/linux/highmem.h:12:
   In file included from include/linux/hardirq.h:11:
   In file included from arch/powerpc/include/asm/hardirq.h:6:
   In file included from include/linux/irq.h:20:
   In file included from include/linux/io.h:13:
   In file included from arch/powerpc/include/asm/io.h:640:
   arch/powerpc/include/asm/io-defs.h:53:1: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
   DEF_PCI_AC_NORET(outsl, (unsigned long p, const void *b, unsigned long c),
   ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   arch/powerpc/include/asm/io.h:637:3: note: expanded from macro 'DEF_PCI_AC_NORET'
                   __do_##name al;                                 \
                   ^~~~~~~~~~~~~~
   <scratch space>:110:1: note: expanded from here
   __do_outsl
   ^
   arch/powerpc/include/asm/io.h:582:58: note: expanded from macro '__do_outsl'
   #define __do_outsl(p, b, n)     writesl((PCI_IO_ADDR)_IO_BASE+(p),(b),(n))
                                           ~~~~~~~~~~~~~~~~~~~~~^
   mm/hugetlb.c:653:8: error: call to undeclared function '__pte_alloc_one'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
           new = __pte_alloc_one(mm, GFP_PGTABLE_USER);
                 ^
   mm/hugetlb.c:653:8: note: did you mean 'pte_alloc_one'?
   arch/powerpc/include/asm/pgalloc.h:30:25: note: 'pte_alloc_one' declared here
   static inline pgtable_t pte_alloc_one(struct mm_struct *mm)
                           ^
   mm/hugetlb.c:653:28: error: use of undeclared identifier 'GFP_PGTABLE_USER'
           new = __pte_alloc_one(mm, GFP_PGTABLE_USER);
                                     ^
   mm/hugetlb.c:660:25: error: incompatible pointer types passing 'pgtable_t' (aka 'unsigned long *') to parameter of type 'struct page *' [-Werror,-Wincompatible-pointer-types]
                   pgtable_pte_page_dtor(new);
                                         ^~~
   include/linux/mm.h:2661:55: note: passing argument to parameter 'page' here
   static inline void pgtable_pte_page_dtor(struct page *page)
                                                         ^
   mm/hugetlb.c:661:3: error: incompatible pointer types passing 'pgtable_t' (aka 'unsigned long *') to parameter of type 'struct page *' [-Werror,-Wincompatible-pointer-types]
                   __free_page(new);
                   ^~~~~~~~~~~~~~~~
   include/linux/gfp.h:319:40: note: expanded from macro '__free_page'
   #define __free_page(page) __free_pages((page), 0)
                                          ^~~~~~
   include/linux/gfp.h:302:39: note: passing argument to parameter 'page' here
   extern void __free_pages(struct page *page, unsigned int order);
                                         ^
>> mm/hugetlb.c:666:44: error: incompatible pointer types passing 'pgtable_t' (aka 'unsigned long *') to parameter of type 'const struct page *' [-Werror,-Wincompatible-pointer-types]
                   hugetlb_install_markers_pte(page_address(new), marker);
                                                            ^~~
   include/linux/mm.h:2001:39: note: passing argument to parameter 'page' here
   void *page_address(const struct page *page);
                                         ^
   6 warnings and 5 errors generated.


vim +666 mm/hugetlb.c

   606	
   607	/*
   608	 * hugetlb_alloc_pte -- Allocate a PTE beneath a pmd_none PMD-level hpte.
   609	 *
   610	 * See the comment above hugetlb_alloc_pmd.
   611	 */
   612	pte_t *hugetlb_alloc_pte(struct mm_struct *mm, struct hugetlb_pte *hpte,
   613			unsigned long addr)
   614	{
   615		spinlock_t *ptl = hugetlb_pte_lockptr(hpte);
   616		pgtable_t new;
   617		pmd_t *pmdp;
   618		pmd_t pmd;
   619		bool is_marker;
   620		pte_marker marker;
   621	
   622		if (hpte->level != HUGETLB_LEVEL_PMD)
   623			return ERR_PTR(-EINVAL);
   624	
   625		pmdp = (pmd_t *)hpte->ptep;
   626	retry:
   627		is_marker = false;
   628		pmd = READ_ONCE(*pmdp);
   629		if (likely(pmd_present(pmd)))
   630			return unlikely(pmd_leaf(pmd))
   631				? ERR_PTR(-EEXIST)
   632				: pte_offset_kernel(pmdp, addr);
   633		else if (!pmd_none(pmd)) {
   634			/*
   635			 * Not present and not none means that a swap entry lives here.
   636			 * If it's a PTE marker, we can deal with it. If it's another
   637			 * swap entry, we don't attempt to split it.
   638			 */
   639			is_marker = is_pte_marker(__pte(pmd_val(pmd)));
   640			if (!is_marker)
   641				return ERR_PTR(-EEXIST);
   642	
   643			marker = pte_marker_get(pte_to_swp_entry(__pte(pmd_val(pmd))));
   644		}
   645	
   646		/*
   647		 * With CONFIG_HIGHPTE, calling `pte_alloc_one` directly may result
   648		 * in page tables being allocated in high memory, needing a kmap to
   649		 * access. Instead, we call __pte_alloc_one directly with
   650		 * GFP_PGTABLE_USER to prevent these PTEs being allocated in high
   651		 * memory.
   652		 */
   653		new = __pte_alloc_one(mm, GFP_PGTABLE_USER);
   654		if (!new)
   655			return ERR_PTR(-ENOMEM);
   656	
   657		spin_lock(ptl);
   658		if (!pmd_same(pmd, *pmdp)) {
   659			spin_unlock(ptl);
   660			pgtable_pte_page_dtor(new);
   661			__free_page(new);
   662			goto retry;
   663		}
   664	
   665		if (is_marker)
 > 666			hugetlb_install_markers_pte(page_address(new), marker);
   667	
   668		mm_inc_nr_ptes(mm);
   669		smp_wmb(); /* See comment in pmd_install() */
   670		pmd_populate(mm, pmdp, new);
   671		spin_unlock(ptl);
   672		return pte_offset_kernel(pmdp, addr);
   673	}
   674	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v2 01/46] hugetlb: don't set PageUptodate for UFFDIO_CONTINUE
  2023-02-18  0:41   ` Mina Almasry
@ 2023-02-21 15:59     ` James Houghton
  2023-02-21 19:33       ` Mike Kravetz
  0 siblings, 1 reply; 96+ messages in thread
From: James Houghton @ 2023-02-21 15:59 UTC (permalink / raw)
  To: Mina Almasry
  Cc: Mike Kravetz, Muchun Song, Peter Xu, Andrew Morton,
	David Hildenbrand, David Rientjes, Axel Rasmussen,
	Zach O'Keefe, Manish Mishra, Naoya Horiguchi,
	Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Frank van der Linden, Jiaqi Yan, linux-mm, linux-kernel

On Fri, Feb 17, 2023 at 4:42 PM Mina Almasry <almasrymina@google.com> wrote:
>
> On Fri, Feb 17, 2023 at 4:28 PM James Houghton <jthoughton@google.com> wrote:
> >
> > If would be bad if we actually set PageUptodate with UFFDIO_CONTINUE;
> > PageUptodate indicates that the page has been zeroed, and we don't want
> > to give a non-zeroed page to the user.
> >
> > The reason this change is being made now is because UFFDIO_CONTINUEs on
> > subpages definitely shouldn't set this page flag on the head page.
> >
> > Signed-off-by: James Houghton <jthoughton@google.com>
> >
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 07abcb6eb203..792cb2e67ce5 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -6256,7 +6256,16 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
> >          * preceding stores to the page contents become visible before
> >          * the set_pte_at() write.
> >          */
> > -       __folio_mark_uptodate(folio);
> > +       if (!is_continue)
> > +               __folio_mark_uptodate(folio);
> > +       else if (!folio_test_uptodate(folio)) {
> > +               /*
> > +                * This should never happen; HugeTLB pages are always Uptodate
> > +                * as soon as they are allocated.
> > +                */
>
> if (is_continue) then we grab a page from the page cache, no? Are
> pages in page caches always uptodate? Why? I guess that means they're
> mapped hence uptodate?
>
> Also this comment should explain why pages in the page cache are
> always uptodate, no? Because this error branch is hit if (is_continue
> && !folio_test_uptodate()), not when pages are freshly allocated.

There was some discussion about it here[1].

Without even thinking about how the pages become uptodate, I think
this patch is justified like this: UFFDIO_CONTINUE => we aren't
actually changing the contents of the page, so we shouldn't be
changing the uptodate-ness of the page.

HugeTLB pages in the page cache are always uptodate:
1. fallocate -- the page is allocated, zeroed, marked as uptodate, and
then placed in the page cache.
2. hugetlb_no_page -- same as above.

So uptodate <=> "the page has been zeroed", so it would be very bad if
we gave a !uptodate page to userspace via UFFDIO_CONTINUE.

I'll update the comment to something like:

"HugeTLB pages are always Uptodate as soon as they are added to the
page cache. Given that we aren't changing the contents of the page, we
shouldn't be updating the Uptodate-ness of the page."

[1]: https://lore.kernel.org/linux-mm/Y5JrS4o5Detzid9V@monkey/

Thanks, Mina. :)

- James

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v2 08/46] hugetlb: add HugeTLB HGM enablement helpers
  2023-02-18  1:40   ` Mina Almasry
@ 2023-02-21 16:16     ` James Houghton
  0 siblings, 0 replies; 96+ messages in thread
From: James Houghton @ 2023-02-21 16:16 UTC (permalink / raw)
  To: Mina Almasry
  Cc: Mike Kravetz, Muchun Song, Peter Xu, Andrew Morton,
	David Hildenbrand, David Rientjes, Axel Rasmussen,
	Zach O'Keefe, Manish Mishra, Naoya Horiguchi,
	Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Frank van der Linden, Jiaqi Yan, linux-mm, linux-kernel

On Fri, Feb 17, 2023 at 5:40 PM Mina Almasry <almasrymina@google.com> wrote:
>
> On Fri, Feb 17, 2023 at 4:28 PM James Houghton <jthoughton@google.com> wrote:
> >
> > hugetlb_hgm_eligible indicates that a VMA is eligible to have HGM
> > explicitly enabled via MADV_SPLIT, and hugetlb_hgm_enabled indicates
> > that HGM has been enabled.
> >
> > Signed-off-by: James Houghton <jthoughton@google.com>
>
> Only nits:
> Reviewed-by: Mina Almasry <almasrymina@google.com>

Thanks Mina. :)

>
> >
> > diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> > index 7c977d234aba..efd2635a87f5 100644
> > --- a/include/linux/hugetlb.h
> > +++ b/include/linux/hugetlb.h
> > @@ -1211,6 +1211,20 @@ static inline void hugetlb_unregister_node(struct node *node)
> >  }
> >  #endif /* CONFIG_HUGETLB_PAGE */
> >
> > +#ifdef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING
> > +bool hugetlb_hgm_enabled(struct vm_area_struct *vma);
> > +bool hugetlb_hgm_eligible(struct vm_area_struct *vma);
> > +#else
> > +static inline bool hugetlb_hgm_enabled(struct vm_area_struct *vma)
> > +{
> > +       return false;
> > +}
> > +static inline bool hugetlb_hgm_eligible(struct vm_area_struct *vma)
> > +{
> > +       return false;
> > +}
> > +#endif
> > +
> >  static inline spinlock_t *huge_pte_lock(struct hstate *h,
> >                                         struct mm_struct *mm, pte_t *pte)
> >  {
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 6c008c9de80e..0576dcc98044 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -7004,6 +7004,10 @@ static bool pmd_sharing_possible(struct vm_area_struct *vma)
> >  #ifdef CONFIG_USERFAULTFD
> >         if (uffd_disable_huge_pmd_share(vma))
> >                 return false;
> > +#endif
> > +#ifdef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING
> > +       if (hugetlb_hgm_enabled(vma))
> > +               return false;
> >  #endif
> >         /*
> >          * Only shared VMAs can share PMDs.
> > @@ -7267,6 +7271,18 @@ __weak unsigned long hugetlb_mask_last_page(struct hstate *h)
> >
> >  #endif /* CONFIG_ARCH_WANT_GENERAL_HUGETLB */
> >
> > +#ifdef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING
> > +bool hugetlb_hgm_eligible(struct vm_area_struct *vma)
>
> I think the other function you named pmd_sharing_possible(), I suggest
> hugetlb_hgm_possible() for some consistency.

Good idea. Will do.

>
> > +{
> > +       /* All shared VMAs may have HGM. */
>
> I think this is a redundant comment.

Indeed. I'll change it to something like:

"HGM is not supported for private mappings. Operations that apply to
MAP_PRIVATE VMAs like hugetlb_wp haven't been updated to support
high-granularity mappings."

>
> > +       return vma && (vma->vm_flags & VM_MAYSHARE);
> > +}
> > +bool hugetlb_hgm_enabled(struct vm_area_struct *vma)
> > +{
> > +       return vma && (vma->vm_flags & VM_HUGETLB_HGM);
> > +}
> > +#endif /* CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING */
> > +
> >  /*
> >   * These functions are overwritable if your architecture needs its own
> >   * behavior.
> > --
> > 2.39.2.637.g21b0678d19-goog
> >

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v2 09/46] mm: add MADV_SPLIT to enable HugeTLB HGM
  2023-02-18  1:58   ` Mina Almasry
@ 2023-02-21 16:33     ` James Houghton
  0 siblings, 0 replies; 96+ messages in thread
From: James Houghton @ 2023-02-21 16:33 UTC (permalink / raw)
  To: Mina Almasry
  Cc: Mike Kravetz, Muchun Song, Peter Xu, Andrew Morton,
	David Hildenbrand, David Rientjes, Axel Rasmussen,
	Zach O'Keefe, Manish Mishra, Naoya Horiguchi,
	Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Frank van der Linden, Jiaqi Yan, linux-mm, linux-kernel

On Fri, Feb 17, 2023 at 5:58 PM Mina Almasry <almasrymina@google.com> wrote:
>
> On Fri, Feb 17, 2023 at 4:28 PM James Houghton <jthoughton@google.com> wrote:
> >
> > Issuing ioctl(MADV_SPLIT) on a HugeTLB address range will enable
> > HugeTLB HGM. MADV_SPLIT was chosen for the name so that this API can be
> > applied to non-HugeTLB memory in the future, if such an application is
> > to arise.
> >
> > MADV_SPLIT provides several API changes for some syscalls on HugeTLB
> > address ranges:
> > 1. UFFDIO_CONTINUE is allowed for MAP_SHARED VMAs at PAGE_SIZE
> >    alignment.
> > 2. read()ing a page fault event from a userfaultfd will yield a
> >    PAGE_SIZE-rounded address, instead of a huge-page-size-rounded
> >    address (unless UFFD_FEATURE_EXACT_ADDRESS is used).
> >
> > There is no way to disable the API changes that come with issuing
> > MADV_SPLIT. MADV_COLLAPSE can be used to collapse high-granularity page
> > table mappings that come from the extended functionality that comes with
> > using MADV_SPLIT.
> >
>
> So is a hugetlb page or VMA that has been MADV_SPLIT + MADV_COLLAPSE
> distinct from a hugetlb page or vma that has not been? I thought
> COLLAPSE would reverse the effects on SPLIT completely.

Right now, MADV_COLLAPSE does *not* completely undo the effects of an
MADV_SPLIT. The API changes that come from MADV_SPLIT aren't undone
with an MADV_COLLAPSE.

>
> > For post-copy live migration, the expected use-case is:
> > 1. mmap(MAP_SHARED, some_fd) primary mapping
> > 2. mmap(MAP_SHARED, some_fd) alias mapping
> > 3. MADV_SPLIT the primary mapping
> > 4. UFFDIO_REGISTER/etc. the primary mapping
> > 5. Copy memory contents into alias mapping and UFFDIO_CONTINUE the
> >    corresponding PAGE_SIZE sections in the primary mapping.
> >
>
> Huh, so MADV_SPLIT doesn't actually split an existing PMD mapping into
> high granularity mappings. Instead it says that future mappings may be
> high granularity? I assume they may not even be high granularity, like
> if the alias mapping faulted in a full hugetlb page (without
> UFFDIO_CONTINUE) that page would be regular mapped not high
> granularity mapped.

MADV_SPLIT just means "userspace is aware that they are able to start
mapping HugeTLB pages at high-granularity". Right now the only way to
get high-granularity mappings is with UFFDIO_CONTINUE, but there may
be other ways in the future.

As of this series, if you MADV_SPLIT a HugeTLB VMA and you aren't
using userfaultfd minor faults, it's basically a no-op. The mappings
that are created will still be huge. I could change this, but I don't
really see a reason to right now.

>
> This may be bikeshedding but I do think a clearer name is warranted.
> Maybe MADV_MAY_SPLIT or something.

I agree -- MADV_MAY_SPLIT more accurately describes the HugeTLB
functionality. I really don't mind what the MADV is called.

I think enabling the high-granularity userfaultfd bits with a
userfaultfd feature[1] worked reasonably well. There is some API
discussion in that thread[1].

[1]: https://lore.kernel.org/linux-mm/20221021163703.3218176-34-jthoughton@google.com/

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v2 11/46] hugetlb: add hugetlb_pte to track HugeTLB page table entries
  2023-02-18  5:24   ` Mina Almasry
@ 2023-02-21 16:36     ` James Houghton
  0 siblings, 0 replies; 96+ messages in thread
From: James Houghton @ 2023-02-21 16:36 UTC (permalink / raw)
  To: Mina Almasry
  Cc: Mike Kravetz, Muchun Song, Peter Xu, Andrew Morton,
	David Hildenbrand, David Rientjes, Axel Rasmussen,
	Zach O'Keefe, Manish Mishra, Naoya Horiguchi,
	Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Frank van der Linden, Jiaqi Yan, linux-mm, linux-kernel

On Fri, Feb 17, 2023 at 9:24 PM Mina Almasry <almasrymina@google.com> wrote:
>
> On Fri, Feb 17, 2023 at 4:28 PM James Houghton <jthoughton@google.com> wrote:
> >
> > After high-granularity mapping, page table entries for HugeTLB pages can
> > be of any size/type. (For example, we can have a 1G page mapped with a
> > mix of PMDs and PTEs.) This struct is to help keep track of a HugeTLB
> > PTE after we have done a page table walk.
> >
> > Without this, we'd have to pass around the "size" of the PTE everywhere.
> > We effectively did this before; it could be fetched from the hstate,
> > which we pass around pretty much everywhere.
> >
> > hugetlb_pte_present_leaf is included here as a helper function that will
> > be used frequently later on.
> >
> > Signed-off-by: James Houghton <jthoughton@google.com>
> >
>
> Only nits.
>
> Reviewed-by: Mina Almasry <almasrymina@google.com>

Thanks Mina!

>
> > diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> > index a1ceb9417f01..eeacadf3272b 100644
> > --- a/include/linux/hugetlb.h
> > +++ b/include/linux/hugetlb.h
> > @@ -26,6 +26,25 @@ typedef struct { unsigned long pd; } hugepd_t;
> >  #define __hugepd(x) ((hugepd_t) { (x) })
> >  #endif
> >
> > +enum hugetlb_level {
> > +       HUGETLB_LEVEL_PTE = 1,
> > +       /*
> > +        * We always include PMD, PUD, and P4D in this enum definition so that,
> > +        * when logged as an integer, we can easily tell which level it is.
> > +        */
> > +       HUGETLB_LEVEL_PMD,
> > +       HUGETLB_LEVEL_PUD,
> > +       HUGETLB_LEVEL_P4D,
> > +       HUGETLB_LEVEL_PGD,
> > +};
> > +
> > +struct hugetlb_pte {
> > +       pte_t *ptep;
> > +       unsigned int shift;
> > +       enum hugetlb_level level;
> > +       spinlock_t *ptl;
> > +};
> > +
> >  #ifdef CONFIG_HUGETLB_PAGE
> >
> >  #include <linux/mempolicy.h>
> > @@ -39,6 +58,20 @@ typedef struct { unsigned long pd; } hugepd_t;
> >   */
> >  #define __NR_USED_SUBPAGE 3
> >
> > +static inline
> > +unsigned long hugetlb_pte_size(const struct hugetlb_pte *hpte)
> > +{
> > +       return 1UL << hpte->shift;
> > +}
> > +
> > +static inline
> > +unsigned long hugetlb_pte_mask(const struct hugetlb_pte *hpte)
> > +{
> > +       return ~(hugetlb_pte_size(hpte) - 1);
> > +}
> > +
> > +bool hugetlb_pte_present_leaf(const struct hugetlb_pte *hpte, pte_t pte);
> > +
> >  struct hugepage_subpool {
> >         spinlock_t lock;
> >         long count;
> > @@ -1234,6 +1267,45 @@ static inline spinlock_t *huge_pte_lock(struct hstate *h,
> >         return ptl;
> >  }
> >
> > +static inline
> > +spinlock_t *hugetlb_pte_lockptr(struct hugetlb_pte *hpte)
> > +{
> > +       return hpte->ptl;
> > +}
>
> I find this helper unnecessary. I would remove it.

Ok. Will do.

>
> > +
> > +static inline
> > +spinlock_t *hugetlb_pte_lock(struct hugetlb_pte *hpte)
> > +{
> > +       spinlock_t *ptl = hugetlb_pte_lockptr(hpte);
> > +
> > +       spin_lock(ptl);
>
> Here 'spin_lock(hpte->ptl)' would be more immediately understandable
> IMO, for example.
>
> > +       return ptl;
> > +}
> > +
> > +static inline
> > +void __hugetlb_pte_init(struct hugetlb_pte *hpte, pte_t *ptep,
> > +                       unsigned int shift, enum hugetlb_level level,
> > +                       spinlock_t *ptl)
> > +{
> > +       /*
> > +        * If 'shift' indicates that this PTE is contiguous, then @ptep must
> > +        * be the first pte of the contiguous bunch.
> > +        */
>
> I would move the comment to above the function as a pseudo doc. It
> seems to instruct the user of the function of how to use it.

Right. Will do.

>
> > +       hpte->ptl = ptl;
> > +       hpte->ptep = ptep;
> > +       hpte->shift = shift;
> > +       hpte->level = level;
> > +}
> > +
> > +static inline
> > +void hugetlb_pte_init(struct mm_struct *mm, struct hugetlb_pte *hpte,
> > +                     pte_t *ptep, unsigned int shift,
> > +                     enum hugetlb_level level)
> > +{
> > +       __hugetlb_pte_init(hpte, ptep, shift, level,
> > +                          huge_pte_lockptr(shift, mm, ptep));
> > +}
> > +
> >  #if defined(CONFIG_HUGETLB_PAGE) && defined(CONFIG_CMA)
> >  extern void __init hugetlb_cma_reserve(int order);
> >  #else
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 5ca9eae0ac42..6c74adff43b6 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -1269,6 +1269,35 @@ static bool vma_has_reserves(struct vm_area_struct *vma, long chg)
> >         return false;
> >  }
> >
> > +bool hugetlb_pte_present_leaf(const struct hugetlb_pte *hpte, pte_t pte)
> > +{
> > +       pgd_t pgd;
> > +       p4d_t p4d;
> > +       pud_t pud;
> > +       pmd_t pmd;
> > +
> > +       switch (hpte->level) {
> > +       case HUGETLB_LEVEL_PGD:
> > +               pgd = __pgd(pte_val(pte));
> > +               return pgd_present(pgd) && pgd_leaf(pgd);
> > +       case HUGETLB_LEVEL_P4D:
> > +               p4d = __p4d(pte_val(pte));
> > +               return p4d_present(p4d) && p4d_leaf(p4d);
> > +       case HUGETLB_LEVEL_PUD:
> > +               pud = __pud(pte_val(pte));
> > +               return pud_present(pud) && pud_leaf(pud);
> > +       case HUGETLB_LEVEL_PMD:
> > +               pmd = __pmd(pte_val(pte));
> > +               return pmd_present(pmd) && pmd_leaf(pmd);
> > +       case HUGETLB_LEVEL_PTE:
> > +               return pte_present(pte);
> > +       default:
> > +               WARN_ON_ONCE(1);
> > +               return false;
> > +       }
> > +}
> > +
> > +
> >  static void enqueue_hugetlb_folio(struct hstate *h, struct folio *folio)
> >  {
> >         int nid = folio_nid(folio);
> > --
> > 2.39.2.637.g21b0678d19-goog
> >

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v2 13/46] hugetlb: add hugetlb_hgm_walk and hugetlb_walk_step
  2023-02-18 18:07   ` kernel test robot
@ 2023-02-21 17:09     ` James Houghton
  0 siblings, 0 replies; 96+ messages in thread
From: James Houghton @ 2023-02-21 17:09 UTC (permalink / raw)
  To: kernel test robot
  Cc: Mike Kravetz, Muchun Song, Peter Xu, Andrew Morton, llvm,
	oe-kbuild-all, Linux Memory Management List, David Hildenbrand,
	David Rientjes, Axel Rasmussen, Mina Almasry, Zach O'Keefe,
	Manish Mishra, Naoya Horiguchi, Dr . David Alan Gilbert,
	Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Frank van der Linden, Jiaqi Yan, linux-kernel

On Sat, Feb 18, 2023 at 10:08 AM kernel test robot <lkp@intel.com> wrote:
>
> Hi James,
>
> Thank you for the patch! Yet something to improve:
>
> [auto build test ERROR on next-20230217]
> [cannot apply to kvm/queue shuah-kselftest/next shuah-kselftest/fixes arnd-asm-generic/master linus/master kvm/linux-next v6.2-rc8 v6.2-rc7 v6.2-rc6 v6.2-rc8]
> [If your patch is applied to the wrong git tree, kindly drop us a note.
> And when submitting patch, we suggest to use '--base' as documented in
> https://git-scm.com/docs/git-format-patch#_base_tree_information]
>
> url:    https://github.com/intel-lab-lkp/linux/commits/James-Houghton/hugetlb-don-t-set-PageUptodate-for-UFFDIO_CONTINUE/20230218-083216
> patch link:    https://lore.kernel.org/r/20230218002819.1486479-14-jthoughton%40google.com
> patch subject: [PATCH v2 13/46] hugetlb: add hugetlb_hgm_walk and hugetlb_walk_step
> config: arm64-randconfig-r005-20230217 (https://download.01.org/0day-ci/archive/20230219/202302190101.aoXrbN26-lkp@intel.com/config)
> compiler: clang version 17.0.0 (https://github.com/llvm/llvm-project db89896bbbd2251fff457699635acbbedeead27f)
> reproduce (this is a W=1 build):
>         wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
>         chmod +x ~/bin/make.cross
>         # install arm64 cross compiling tool for clang build
>         # apt-get install binutils-aarch64-linux-gnu
>         # https://github.com/intel-lab-lkp/linux/commit/7e55fe945a1b5f042746277050390bdeba9e22d2
>         git remote add linux-review https://github.com/intel-lab-lkp/linux
>         git fetch --no-tags linux-review James-Houghton/hugetlb-don-t-set-PageUptodate-for-UFFDIO_CONTINUE/20230218-083216
>         git checkout 7e55fe945a1b5f042746277050390bdeba9e22d2
>         # save the config file
>         mkdir build_dir && cp config build_dir/.config
>         COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=arm64 olddefconfig
>         COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=arm64 SHELL=/bin/bash
>
> If you fix the issue, kindly add following tag where applicable
> | Reported-by: kernel test robot <lkp@intel.com>
> | Link: https://lore.kernel.org/oe-kbuild-all/202302190101.aoXrbN26-lkp@intel.com/
>
> All errors (new ones prefixed by >>):
>
> >> ld.lld: error: undefined symbol: hugetlb_walk_step
>    >>> referenced by hugetlb.c
>    >>>               mm/hugetlb.o:(__hugetlb_hgm_walk) in archive vmlinux.a
>
> --
> 0-DAY CI Kernel Test Service
> https://github.com/intel/lkp-tests
>

This is fixed by providing a trivial definition of __hugetlb_hgm_walk
when !CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING. Will be done for v3.
hugetlb_walk_step() is only defined by architectures that support HGM.

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v2 01/46] hugetlb: don't set PageUptodate for UFFDIO_CONTINUE
  2023-02-21 15:59     ` James Houghton
@ 2023-02-21 19:33       ` Mike Kravetz
  2023-02-21 19:58         ` James Houghton
  0 siblings, 1 reply; 96+ messages in thread
From: Mike Kravetz @ 2023-02-21 19:33 UTC (permalink / raw)
  To: James Houghton
  Cc: Mina Almasry, Muchun Song, Peter Xu, Andrew Morton,
	David Hildenbrand, David Rientjes, Axel Rasmussen,
	Zach O'Keefe, Manish Mishra, Naoya Horiguchi,
	Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Frank van der Linden, Jiaqi Yan, linux-mm, linux-kernel

On 02/21/23 07:59, James Houghton wrote:
> On Fri, Feb 17, 2023 at 4:42 PM Mina Almasry <almasrymina@google.com> wrote:
> >
> > On Fri, Feb 17, 2023 at 4:28 PM James Houghton <jthoughton@google.com> wrote:
> > >
> > > If would be bad if we actually set PageUptodate with UFFDIO_CONTINUE;
> > > PageUptodate indicates that the page has been zeroed, and we don't want
> > > to give a non-zeroed page to the user.
> > >
> > > The reason this change is being made now is because UFFDIO_CONTINUEs on
> > > subpages definitely shouldn't set this page flag on the head page.
> > >
> > > Signed-off-by: James Houghton <jthoughton@google.com>
> > >
> > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > > index 07abcb6eb203..792cb2e67ce5 100644
> > > --- a/mm/hugetlb.c
> > > +++ b/mm/hugetlb.c
> > > @@ -6256,7 +6256,16 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
> > >          * preceding stores to the page contents become visible before
> > >          * the set_pte_at() write.
> > >          */
> > > -       __folio_mark_uptodate(folio);
> > > +       if (!is_continue)
> > > +               __folio_mark_uptodate(folio);
> > > +       else if (!folio_test_uptodate(folio)) {
> > > +               /*
> > > +                * This should never happen; HugeTLB pages are always Uptodate
> > > +                * as soon as they are allocated.
> > > +                */
> >
> > if (is_continue) then we grab a page from the page cache, no? Are
> > pages in page caches always uptodate? Why? I guess that means they're
> > mapped hence uptodate?
> >
> > Also this comment should explain why pages in the page cache are
> > always uptodate, no? Because this error branch is hit if (is_continue
> > && !folio_test_uptodate()), not when pages are freshly allocated.
> 
> There was some discussion about it here[1].
> 
> Without even thinking about how the pages become uptodate, I think
> this patch is justified like this: UFFDIO_CONTINUE => we aren't
> actually changing the contents of the page, so we shouldn't be
> changing the uptodate-ness of the page.

Agree!

> HugeTLB pages in the page cache are always uptodate:
> 1. fallocate -- the page is allocated, zeroed, marked as uptodate, and
> then placed in the page cache.
> 2. hugetlb_no_page -- same as above.
> 
> So uptodate <=> "the page has been zeroed", so it would be very bad if
> we gave a !uptodate page to userspace via UFFDIO_CONTINUE.
> 
> I'll update the comment to something like:
> 
> "HugeTLB pages are always Uptodate as soon as they are added to the
> page cache. Given that we aren't changing the contents of the page, we
> shouldn't be updating the Uptodate-ness of the page."

Perhaps a better way of saying is that hugetlb pages are marked uptodate
shortly after allocation when their contents are initialized.  Initialized
data could be zero, or it could be contents copied from another location
(such as in the UFFDIO_COPY case also handled in this routine).

Saying "PageUptodate indicates that the page has been zeroed" as in the
commit message is technically not correct.

Ack to the patch.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v2 01/46] hugetlb: don't set PageUptodate for UFFDIO_CONTINUE
  2023-02-21 19:33       ` Mike Kravetz
@ 2023-02-21 19:58         ` James Houghton
  0 siblings, 0 replies; 96+ messages in thread
From: James Houghton @ 2023-02-21 19:58 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Mina Almasry, Muchun Song, Peter Xu, Andrew Morton,
	David Hildenbrand, David Rientjes, Axel Rasmussen,
	Zach O'Keefe, Manish Mishra, Naoya Horiguchi,
	Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Frank van der Linden, Jiaqi Yan, linux-mm, linux-kernel

On Tue, Feb 21, 2023 at 11:34 AM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>
> On 02/21/23 07:59, James Houghton wrote:
> > On Fri, Feb 17, 2023 at 4:42 PM Mina Almasry <almasrymina@google.com> wrote:
> > >
> > > On Fri, Feb 17, 2023 at 4:28 PM James Houghton <jthoughton@google.com> wrote:
> > > >
> > > > If would be bad if we actually set PageUptodate with UFFDIO_CONTINUE;
> > > > PageUptodate indicates that the page has been zeroed, and we don't want
> > > > to give a non-zeroed page to the user.
> > > >
> > > > The reason this change is being made now is because UFFDIO_CONTINUEs on
> > > > subpages definitely shouldn't set this page flag on the head page.
> > > >
> > > > Signed-off-by: James Houghton <jthoughton@google.com>
> > > >
> > > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > > > index 07abcb6eb203..792cb2e67ce5 100644
> > > > --- a/mm/hugetlb.c
> > > > +++ b/mm/hugetlb.c
> > > > @@ -6256,7 +6256,16 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
> > > >          * preceding stores to the page contents become visible before
> > > >          * the set_pte_at() write.
> > > >          */
> > > > -       __folio_mark_uptodate(folio);
> > > > +       if (!is_continue)
> > > > +               __folio_mark_uptodate(folio);
> > > > +       else if (!folio_test_uptodate(folio)) {
> > > > +               /*
> > > > +                * This should never happen; HugeTLB pages are always Uptodate
> > > > +                * as soon as they are allocated.
> > > > +                */
> > >
> > > if (is_continue) then we grab a page from the page cache, no? Are
> > > pages in page caches always uptodate? Why? I guess that means they're
> > > mapped hence uptodate?
> > >
> > > Also this comment should explain why pages in the page cache are
> > > always uptodate, no? Because this error branch is hit if (is_continue
> > > && !folio_test_uptodate()), not when pages are freshly allocated.
> >
> > There was some discussion about it here[1].
> >
> > Without even thinking about how the pages become uptodate, I think
> > this patch is justified like this: UFFDIO_CONTINUE => we aren't
> > actually changing the contents of the page, so we shouldn't be
> > changing the uptodate-ness of the page.
>
> Agree!
>
> > HugeTLB pages in the page cache are always uptodate:
> > 1. fallocate -- the page is allocated, zeroed, marked as uptodate, and
> > then placed in the page cache.
> > 2. hugetlb_no_page -- same as above.
> >
> > So uptodate <=> "the page has been zeroed", so it would be very bad if
> > we gave a !uptodate page to userspace via UFFDIO_CONTINUE.
> >
> > I'll update the comment to something like:
> >
> > "HugeTLB pages are always Uptodate as soon as they are added to the
> > page cache. Given that we aren't changing the contents of the page, we
> > shouldn't be updating the Uptodate-ness of the page."
>
> Perhaps a better way of saying is that hugetlb pages are marked uptodate
> shortly after allocation when their contents are initialized.  Initialized
> data could be zero, or it could be contents copied from another location
> (such as in the UFFDIO_COPY case also handled in this routine).

I'll write something like this. Thank you!

>
> Saying "PageUptodate indicates that the page has been zeroed" as in the
> commit message is technically not correct.

And I'll make sure to update the commit description as well.

>
> Ack to the patch.

Thanks, Mike!

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v2 00/46] hugetlb: introduce HugeTLB high-granularity mapping
  2023-02-18  0:27 [PATCH v2 00/46] hugetlb: introduce HugeTLB high-granularity mapping James Houghton
                   ` (45 preceding siblings ...)
  2023-02-18  0:28 ` [PATCH v2 46/46] selftests/mm: add HGM UFFDIO_CONTINUE and hwpoison tests James Houghton
@ 2023-02-21 21:46 ` Mike Kravetz
  2023-02-22 15:48   ` David Hildenbrand
  46 siblings, 1 reply; 96+ messages in thread
From: Mike Kravetz @ 2023-02-21 21:46 UTC (permalink / raw)
  To: James Houghton
  Cc: Muchun Song, Peter Xu, Andrew Morton, David Hildenbrand,
	David Rientjes, Axel Rasmussen, Mina Almasry, Zach O'Keefe,
	Manish Mishra, Naoya Horiguchi, Dr . David Alan Gilbert,
	Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Frank van der Linden, Jiaqi Yan, linux-mm, linux-kernel

On 02/18/23 00:27, James Houghton wrote:
> This series introduces the concept of HugeTLB high-granularity mapping
> (HGM). This series teaches HugeTLB how to map HugeTLB pages at
> high-granularity, similar to how THPs can be PTE-mapped.
> 
> Support for HGM in this series is for MAP_SHARED VMAs on x86_64 only. Other
> architectures and (some) support for MAP_PRIVATE will come later.
> 
> This series is based on latest mm-unstable (ccd6a73daba9).
> 
> Notable changes with this series
> ================================
> 
>  - hugetlb_add_file_rmap / hugetlb_remove_rmap are added to handle
>    mapcounting for non-anon hugetlb.
>  - The mapcounting scheme uses subpages' mapcounts for high-granularity
>    mappings, but it does not use subpages_mapcount(). This scheme
>    prevents the HugeTLB VMEMMAP optimization from being used, so it
>    will be improved in a later series.
>  - page_add_file_rmap and page_remove_rmap are updated so they can be
>    used by hugetlb_add_file_rmap / hugetlb_remove_rmap.
>  - MADV_SPLIT has been added to enable the userspace API changes that
>    HGM allows for: high-granularity UFFDIO_CONTINUE (and maybe other
>    changes in the future). MADV_SPLIT does NOT force all the mappings to
>    be PAGE_SIZE.
>  - MADV_COLLAPSE is expanded to include HugeTLB mappings.
> 
> Old versions:
> v1: https://lore.kernel.org/linux-mm/20230105101844.1893104-1-jthoughton@google.com/
> RFC v2: https://lore.kernel.org/linux-mm/20221021163703.3218176-1-jthoughton@google.com/
> RFC v1: https://lore.kernel.org/linux-mm/20220624173656.2033256-1-jthoughton@google.com/
> 
> Changelog:
> v1 -> v2 (thanks Peter for all your suggestions!):
> - Changed mapcount to be more THP-like, and make HGM incompatible with
>   HVO.
> - HGM is now disabled by default to leave HVO enabled by default.

I understand the reasoning behind the move to THP-like mapcounting, and the
incompatibility with HVO.  However, I just got to patch 5 and realized either
HGM or HVO will need to be chosen at kernel build time.  That may not be an
issue for cloud providers or others building their own kernels for internal
use.  However, distro kernels will need to pick one option or the other.
Right now, my Fedora desktop has HVO enabled so it would likely not have
HGM enabled.  That is not a big deal for a desktop.

Just curious, do we have distro kernel users that want to use HGM?

I see it mentioned that this incompatibility will be addressed in a future
series. This certainly will be required before HGM can be expanded for
use cases such as memory errors and page poisoning.

Just curious of other thoughts?  Does the first version of HGM need to be
compatible with HVO?
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v2 17/46] hugetlbfs: do a full walk to check if vma maps a page
  2023-02-18  0:27 ` [PATCH v2 17/46] hugetlbfs: do a full walk to check if vma maps a page James Houghton
@ 2023-02-22 15:46   ` James Houghton
  2023-02-28 23:52     ` Mike Kravetz
  0 siblings, 1 reply; 96+ messages in thread
From: James Houghton @ 2023-02-22 15:46 UTC (permalink / raw)
  To: Mike Kravetz, Muchun Song, Peter Xu, Andrew Morton
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Zach O'Keefe, Manish Mishra, Naoya Horiguchi,
	Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Frank van der Linden, Jiaqi Yan, linux-mm, linux-kernel

On Fri, Feb 17, 2023 at 4:29 PM James Houghton <jthoughton@google.com> wrote:
>
> Because it is safe to do so, do a full high-granularity page table walk
> to check if the page is mapped.
>
> Signed-off-by: James Houghton <jthoughton@google.com>
>
> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> index cfd09f95551b..c0ee69f0418e 100644
> --- a/fs/hugetlbfs/inode.c
> +++ b/fs/hugetlbfs/inode.c
> @@ -386,17 +386,24 @@ static void hugetlb_delete_from_page_cache(struct folio *folio)
>  static bool hugetlb_vma_maps_page(struct vm_area_struct *vma,
>                                 unsigned long addr, struct page *page)
>  {
> -       pte_t *ptep, pte;
> +       pte_t pte;
> +       struct hugetlb_pte hpte;
>
> -       ptep = hugetlb_walk(vma, addr, huge_page_size(hstate_vma(vma)));
> -       if (!ptep)
> +       if (hugetlb_full_walk(&hpte, vma, addr))
>                 return false;
>
> -       pte = huge_ptep_get(ptep);
> +       pte = huge_ptep_get(hpte.ptep);
>         if (huge_pte_none(pte) || !pte_present(pte))
>                 return false;
>
> -       if (pte_page(pte) == page)
> +       if (unlikely(!hugetlb_pte_present_leaf(&hpte, pte)))
> +               /*
> +                * We raced with someone splitting us, and the only case
> +                * where this is impossible is when the pte was none.
> +                */
> +               return false;
> +
> +       if (compound_head(pte_page(pte)) == page)
>                 return true;
>
>         return false;
> --
> 2.39.2.637.g21b0678d19-goog
>

I think this patch is actually incorrect.

This function is *supposed* to check if the page is mapped at all in
this VMA, but really we're only checking if the base address of the
page is mapped. If we did the 'hugetlb_vma_maybe_maps_page' approach
that I did previously and returned 'true' if
!hugetlb_pte_present_leaf(), then this code would be correct again.

But what I really think this function should do is just call
page_vma_mapped_walk(). We're sort of reimplementing it here anyway.
Unless someone disagrees, I'll do this for v3.

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v2 00/46] hugetlb: introduce HugeTLB high-granularity mapping
  2023-02-21 21:46 ` [PATCH v2 00/46] hugetlb: introduce HugeTLB high-granularity mapping Mike Kravetz
@ 2023-02-22 15:48   ` David Hildenbrand
  2023-02-22 20:57     ` Mina Almasry
  0 siblings, 1 reply; 96+ messages in thread
From: David Hildenbrand @ 2023-02-22 15:48 UTC (permalink / raw)
  To: Mike Kravetz, James Houghton
  Cc: Muchun Song, Peter Xu, Andrew Morton, David Rientjes,
	Axel Rasmussen, Mina Almasry, Zach O'Keefe, Manish Mishra,
	Naoya Horiguchi, Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Frank van der Linden, Jiaqi Yan, linux-mm, linux-kernel

On 21.02.23 22:46, Mike Kravetz wrote:
> On 02/18/23 00:27, James Houghton wrote:
>> This series introduces the concept of HugeTLB high-granularity mapping
>> (HGM). This series teaches HugeTLB how to map HugeTLB pages at
>> high-granularity, similar to how THPs can be PTE-mapped.
>>
>> Support for HGM in this series is for MAP_SHARED VMAs on x86_64 only. Other
>> architectures and (some) support for MAP_PRIVATE will come later.
>>
>> This series is based on latest mm-unstable (ccd6a73daba9).
>>
>> Notable changes with this series
>> ================================
>>
>>   - hugetlb_add_file_rmap / hugetlb_remove_rmap are added to handle
>>     mapcounting for non-anon hugetlb.
>>   - The mapcounting scheme uses subpages' mapcounts for high-granularity
>>     mappings, but it does not use subpages_mapcount(). This scheme
>>     prevents the HugeTLB VMEMMAP optimization from being used, so it
>>     will be improved in a later series.
>>   - page_add_file_rmap and page_remove_rmap are updated so they can be
>>     used by hugetlb_add_file_rmap / hugetlb_remove_rmap.
>>   - MADV_SPLIT has been added to enable the userspace API changes that
>>     HGM allows for: high-granularity UFFDIO_CONTINUE (and maybe other
>>     changes in the future). MADV_SPLIT does NOT force all the mappings to
>>     be PAGE_SIZE.
>>   - MADV_COLLAPSE is expanded to include HugeTLB mappings.
>>
>> Old versions:
>> v1: https://lore.kernel.org/linux-mm/20230105101844.1893104-1-jthoughton@google.com/
>> RFC v2: https://lore.kernel.org/linux-mm/20221021163703.3218176-1-jthoughton@google.com/
>> RFC v1: https://lore.kernel.org/linux-mm/20220624173656.2033256-1-jthoughton@google.com/
>>
>> Changelog:
>> v1 -> v2 (thanks Peter for all your suggestions!):
>> - Changed mapcount to be more THP-like, and make HGM incompatible with
>>    HVO.
>> - HGM is now disabled by default to leave HVO enabled by default.
> 
> I understand the reasoning behind the move to THP-like mapcounting, and the
> incompatibility with HVO.  However, I just got to patch 5 and realized either
> HGM or HVO will need to be chosen at kernel build time.  That may not be an
> issue for cloud providers or others building their own kernels for internal
> use.  However, distro kernels will need to pick one option or the other.
> Right now, my Fedora desktop has HVO enabled so it would likely not have
> HGM enabled.  That is not a big deal for a desktop.
> 
> Just curious, do we have distro kernel users that want to use HGM?

Most certainly I would say :)

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v2 00/46] hugetlb: introduce HugeTLB high-granularity mapping
  2023-02-22 15:48   ` David Hildenbrand
@ 2023-02-22 20:57     ` Mina Almasry
  2023-02-23  9:07       ` David Hildenbrand
  0 siblings, 1 reply; 96+ messages in thread
From: Mina Almasry @ 2023-02-22 20:57 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Mike Kravetz, James Houghton, Muchun Song, Peter Xu,
	Andrew Morton, David Rientjes, Axel Rasmussen, Zach O'Keefe,
	Manish Mishra, Naoya Horiguchi, Dr . David Alan Gilbert,
	Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Frank van der Linden, Jiaqi Yan, linux-mm, linux-kernel

On Wed, Feb 22, 2023 at 7:49 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 21.02.23 22:46, Mike Kravetz wrote:
> > On 02/18/23 00:27, James Houghton wrote:
> >> This series introduces the concept of HugeTLB high-granularity mapping
> >> (HGM). This series teaches HugeTLB how to map HugeTLB pages at
> >> high-granularity, similar to how THPs can be PTE-mapped.
> >>
> >> Support for HGM in this series is for MAP_SHARED VMAs on x86_64 only. Other
> >> architectures and (some) support for MAP_PRIVATE will come later.
> >>
> >> This series is based on latest mm-unstable (ccd6a73daba9).
> >>
> >> Notable changes with this series
> >> ================================
> >>
> >>   - hugetlb_add_file_rmap / hugetlb_remove_rmap are added to handle
> >>     mapcounting for non-anon hugetlb.
> >>   - The mapcounting scheme uses subpages' mapcounts for high-granularity
> >>     mappings, but it does not use subpages_mapcount(). This scheme
> >>     prevents the HugeTLB VMEMMAP optimization from being used, so it
> >>     will be improved in a later series.
> >>   - page_add_file_rmap and page_remove_rmap are updated so they can be
> >>     used by hugetlb_add_file_rmap / hugetlb_remove_rmap.
> >>   - MADV_SPLIT has been added to enable the userspace API changes that
> >>     HGM allows for: high-granularity UFFDIO_CONTINUE (and maybe other
> >>     changes in the future). MADV_SPLIT does NOT force all the mappings to
> >>     be PAGE_SIZE.
> >>   - MADV_COLLAPSE is expanded to include HugeTLB mappings.
> >>
> >> Old versions:
> >> v1: https://lore.kernel.org/linux-mm/20230105101844.1893104-1-jthoughton@google.com/
> >> RFC v2: https://lore.kernel.org/linux-mm/20221021163703.3218176-1-jthoughton@google.com/
> >> RFC v1: https://lore.kernel.org/linux-mm/20220624173656.2033256-1-jthoughton@google.com/
> >>
> >> Changelog:
> >> v1 -> v2 (thanks Peter for all your suggestions!):
> >> - Changed mapcount to be more THP-like, and make HGM incompatible with
> >>    HVO.
> >> - HGM is now disabled by default to leave HVO enabled by default.
> >
> > I understand the reasoning behind the move to THP-like mapcounting, and the
> > incompatibility with HVO.  However, I just got to patch 5 and realized either
> > HGM or HVO will need to be chosen at kernel build time.  That may not be an
> > issue for cloud providers or others building their own kernels for internal
> > use.  However, distro kernels will need to pick one option or the other.
> > Right now, my Fedora desktop has HVO enabled so it would likely not have
> > HGM enabled.  That is not a big deal for a desktop.
> >
> > Just curious, do we have distro kernel users that want to use HGM?
>
> Most certainly I would say :)
>

Is it a blocker to merge in an initial implementation though? Do
distro kernel users have a pressing need for HVO + HGM used in tandem?


> --
> Thanks,
>
> David / dhildenb
>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v2 15/46] hugetlb: add make_huge_pte_with_shift
  2023-02-18  0:27 ` [PATCH v2 15/46] hugetlb: add make_huge_pte_with_shift James Houghton
@ 2023-02-22 21:14   ` Mina Almasry
  2023-02-22 22:53     ` James Houghton
  0 siblings, 1 reply; 96+ messages in thread
From: Mina Almasry @ 2023-02-22 21:14 UTC (permalink / raw)
  To: James Houghton
  Cc: Mike Kravetz, Muchun Song, Peter Xu, Andrew Morton,
	David Hildenbrand, David Rientjes, Axel Rasmussen,
	Zach O'Keefe, Manish Mishra, Naoya Horiguchi,
	Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Frank van der Linden, Jiaqi Yan, linux-mm, linux-kernel

On Fri, Feb 17, 2023 at 4:28 PM James Houghton <jthoughton@google.com> wrote:
>
> This allows us to make huge PTEs at shifts other than the hstate shift,
> which will be necessary for high-granularity mappings.
>
> Acked-by: Mike Kravetz <mike.kravetz@oracle.com>
> Signed-off-by: James Houghton <jthoughton@google.com>
>

Reviewed-by: Mina Almasry <almasrymina@google.com>

> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index f74183acc521..ed1d806020de 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -5110,11 +5110,11 @@ const struct vm_operations_struct hugetlb_vm_ops = {
>         .pagesize = hugetlb_vm_op_pagesize,
>  };
>
> -static pte_t make_huge_pte(struct vm_area_struct *vma, struct page *page,
> -                               int writable)
> +static pte_t make_huge_pte_with_shift(struct vm_area_struct *vma,
> +                                     struct page *page, int writable,
> +                                     int shift)

Nit: can this be 'unsigned int shift'. Because you're actually passing
it an unsigned int below and there is an implicit cast there. Yes it
will never matter, I know...

>  {
>         pte_t entry;
> -       unsigned int shift = huge_page_shift(hstate_vma(vma));
>
>         if (writable) {
>                 entry = huge_pte_mkwrite(huge_pte_mkdirty(mk_pte(page,
> @@ -5128,6 +5128,14 @@ static pte_t make_huge_pte(struct vm_area_struct *vma, struct page *page,
>         return entry;
>  }
>
> +static pte_t make_huge_pte(struct vm_area_struct *vma, struct page *page,
> +                          int writable)
> +{
> +       unsigned int shift = huge_page_shift(hstate_vma(vma));
> +
> +       return make_huge_pte_with_shift(vma, page, writable, shift);
> +}
> +
>  static void set_huge_ptep_writable(struct vm_area_struct *vma,
>                                    unsigned long address, pte_t *ptep)
>  {
> --
> 2.39.2.637.g21b0678d19-goog
>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v2 16/46] hugetlb: make default arch_make_huge_pte understand small mappings
  2023-02-18  0:27 ` [PATCH v2 16/46] hugetlb: make default arch_make_huge_pte understand small mappings James Houghton
@ 2023-02-22 21:17   ` Mina Almasry
  2023-02-22 22:52     ` James Houghton
  2023-02-28 23:02   ` Mike Kravetz
  1 sibling, 1 reply; 96+ messages in thread
From: Mina Almasry @ 2023-02-22 21:17 UTC (permalink / raw)
  To: James Houghton
  Cc: Mike Kravetz, Muchun Song, Peter Xu, Andrew Morton,
	David Hildenbrand, David Rientjes, Axel Rasmussen,
	Zach O'Keefe, Manish Mishra, Naoya Horiguchi,
	Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Frank van der Linden, Jiaqi Yan, linux-mm, linux-kernel

On Fri, Feb 17, 2023 at 4:28 PM James Houghton <jthoughton@google.com> wrote:
>
> This is a simple change: don't create a "huge" PTE if we are making a
> regular, PAGE_SIZE PTE. All architectures that want to implement HGM
> likely need to be changed in a similar way if they implement their own
> version of arch_make_huge_pte.
>
> Signed-off-by: James Houghton <jthoughton@google.com>
>
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 726d581158b1..b767b6889dea 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -899,7 +899,7 @@ static inline void arch_clear_hugepage_flags(struct page *page) { }
>  static inline pte_t arch_make_huge_pte(pte_t entry, unsigned int shift,
>                                        vm_flags_t flags)
>  {
> -       return pte_mkhuge(entry);
> +       return shift > PAGE_SHIFT ? pte_mkhuge(entry) : entry;
>  }
>  #endif
>

How are contig_pte's handled here? Will shift show that it's actually
a contig_pte and not just PAGE_SHIFT? Or is that arm64 specific so it
exists only in the arm64 version of this function? Do we need to worry
about it here?

> --
> 2.39.2.637.g21b0678d19-goog
>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v2 16/46] hugetlb: make default arch_make_huge_pte understand small mappings
  2023-02-22 21:17   ` Mina Almasry
@ 2023-02-22 22:52     ` James Houghton
  0 siblings, 0 replies; 96+ messages in thread
From: James Houghton @ 2023-02-22 22:52 UTC (permalink / raw)
  To: Mina Almasry
  Cc: Mike Kravetz, Muchun Song, Peter Xu, Andrew Morton,
	David Hildenbrand, David Rientjes, Axel Rasmussen,
	Zach O'Keefe, Manish Mishra, Naoya Horiguchi,
	Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Frank van der Linden, Jiaqi Yan, linux-mm, linux-kernel

On Wed, Feb 22, 2023 at 1:18 PM Mina Almasry <almasrymina@google.com> wrote:
>
> On Fri, Feb 17, 2023 at 4:28 PM James Houghton <jthoughton@google.com> wrote:
> >
> > This is a simple change: don't create a "huge" PTE if we are making a
> > regular, PAGE_SIZE PTE. All architectures that want to implement HGM
> > likely need to be changed in a similar way if they implement their own
> > version of arch_make_huge_pte.
> >
> > Signed-off-by: James Houghton <jthoughton@google.com>
> >
> > diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> > index 726d581158b1..b767b6889dea 100644
> > --- a/include/linux/hugetlb.h
> > +++ b/include/linux/hugetlb.h
> > @@ -899,7 +899,7 @@ static inline void arch_clear_hugepage_flags(struct page *page) { }
> >  static inline pte_t arch_make_huge_pte(pte_t entry, unsigned int shift,
> >                                        vm_flags_t flags)
> >  {
> > -       return pte_mkhuge(entry);
> > +       return shift > PAGE_SHIFT ? pte_mkhuge(entry) : entry;
> >  }
> >  #endif
> >
>
> How are contig_pte's handled here? Will shift show that it's actually
> a contig_pte and not just PAGE_SHIFT? Or is that arm64 specific so it
> exists only in the arm64 version of this function? Do we need to worry
> about it here?

arm64 implements its own version of arch_make_huge_pte, and 'shift'
does indeed indicate (to arm64) if the PTE is contiguous or not (like
it will be CONT_PTE_SHIFT, for example).

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v2 15/46] hugetlb: add make_huge_pte_with_shift
  2023-02-22 21:14   ` Mina Almasry
@ 2023-02-22 22:53     ` James Houghton
  0 siblings, 0 replies; 96+ messages in thread
From: James Houghton @ 2023-02-22 22:53 UTC (permalink / raw)
  To: Mina Almasry
  Cc: Mike Kravetz, Muchun Song, Peter Xu, Andrew Morton,
	David Hildenbrand, David Rientjes, Axel Rasmussen,
	Zach O'Keefe, Manish Mishra, Naoya Horiguchi,
	Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Frank van der Linden, Jiaqi Yan, linux-mm, linux-kernel

On Wed, Feb 22, 2023 at 1:15 PM Mina Almasry <almasrymina@google.com> wrote:
>
> On Fri, Feb 17, 2023 at 4:28 PM James Houghton <jthoughton@google.com> wrote:
> >
> > This allows us to make huge PTEs at shifts other than the hstate shift,
> > which will be necessary for high-granularity mappings.
> >
> > Acked-by: Mike Kravetz <mike.kravetz@oracle.com>
> > Signed-off-by: James Houghton <jthoughton@google.com>
> >
>
> Reviewed-by: Mina Almasry <almasrymina@google.com>

Thank you :)

>
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index f74183acc521..ed1d806020de 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -5110,11 +5110,11 @@ const struct vm_operations_struct hugetlb_vm_ops = {
> >         .pagesize = hugetlb_vm_op_pagesize,
> >  };
> >
> > -static pte_t make_huge_pte(struct vm_area_struct *vma, struct page *page,
> > -                               int writable)
> > +static pte_t make_huge_pte_with_shift(struct vm_area_struct *vma,
> > +                                     struct page *page, int writable,
> > +                                     int shift)
>
> Nit: can this be 'unsigned int shift'. Because you're actually passing
> it an unsigned int below and there is an implicit cast there. Yes it
> will never matter, I know...

Yes I think it should be unsigned int. Thanks for the catch.

>
> >  {
> >         pte_t entry;
> > -       unsigned int shift = huge_page_shift(hstate_vma(vma));
> >
> >         if (writable) {
> >                 entry = huge_pte_mkwrite(huge_pte_mkdirty(mk_pte(page,
> > @@ -5128,6 +5128,14 @@ static pte_t make_huge_pte(struct vm_area_struct *vma, struct page *page,
> >         return entry;
> >  }
> >
> > +static pte_t make_huge_pte(struct vm_area_struct *vma, struct page *page,
> > +                          int writable)
> > +{
> > +       unsigned int shift = huge_page_shift(hstate_vma(vma));
> > +
> > +       return make_huge_pte_with_shift(vma, page, writable, shift);
> > +}
> > +
> >  static void set_huge_ptep_writable(struct vm_area_struct *vma,
> >                                    unsigned long address, pte_t *ptep)
> >  {
> > --
> > 2.39.2.637.g21b0678d19-goog
> >

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v2 00/46] hugetlb: introduce HugeTLB high-granularity mapping
  2023-02-22 20:57     ` Mina Almasry
@ 2023-02-23  9:07       ` David Hildenbrand
  2023-02-23 15:53         ` James Houghton
  0 siblings, 1 reply; 96+ messages in thread
From: David Hildenbrand @ 2023-02-23  9:07 UTC (permalink / raw)
  To: Mina Almasry
  Cc: Mike Kravetz, James Houghton, Muchun Song, Peter Xu,
	Andrew Morton, David Rientjes, Axel Rasmussen, Zach O'Keefe,
	Manish Mishra, Naoya Horiguchi, Dr . David Alan Gilbert,
	Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Frank van der Linden, Jiaqi Yan, linux-mm, linux-kernel

On 22.02.23 21:57, Mina Almasry wrote:
> On Wed, Feb 22, 2023 at 7:49 AM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 21.02.23 22:46, Mike Kravetz wrote:
>>> On 02/18/23 00:27, James Houghton wrote:
>>>> This series introduces the concept of HugeTLB high-granularity mapping
>>>> (HGM). This series teaches HugeTLB how to map HugeTLB pages at
>>>> high-granularity, similar to how THPs can be PTE-mapped.
>>>>
>>>> Support for HGM in this series is for MAP_SHARED VMAs on x86_64 only. Other
>>>> architectures and (some) support for MAP_PRIVATE will come later.
>>>>
>>>> This series is based on latest mm-unstable (ccd6a73daba9).
>>>>
>>>> Notable changes with this series
>>>> ================================
>>>>
>>>>    - hugetlb_add_file_rmap / hugetlb_remove_rmap are added to handle
>>>>      mapcounting for non-anon hugetlb.
>>>>    - The mapcounting scheme uses subpages' mapcounts for high-granularity
>>>>      mappings, but it does not use subpages_mapcount(). This scheme
>>>>      prevents the HugeTLB VMEMMAP optimization from being used, so it
>>>>      will be improved in a later series.
>>>>    - page_add_file_rmap and page_remove_rmap are updated so they can be
>>>>      used by hugetlb_add_file_rmap / hugetlb_remove_rmap.
>>>>    - MADV_SPLIT has been added to enable the userspace API changes that
>>>>      HGM allows for: high-granularity UFFDIO_CONTINUE (and maybe other
>>>>      changes in the future). MADV_SPLIT does NOT force all the mappings to
>>>>      be PAGE_SIZE.
>>>>    - MADV_COLLAPSE is expanded to include HugeTLB mappings.
>>>>
>>>> Old versions:
>>>> v1: https://lore.kernel.org/linux-mm/20230105101844.1893104-1-jthoughton@google.com/
>>>> RFC v2: https://lore.kernel.org/linux-mm/20221021163703.3218176-1-jthoughton@google.com/
>>>> RFC v1: https://lore.kernel.org/linux-mm/20220624173656.2033256-1-jthoughton@google.com/
>>>>
>>>> Changelog:
>>>> v1 -> v2 (thanks Peter for all your suggestions!):
>>>> - Changed mapcount to be more THP-like, and make HGM incompatible with
>>>>     HVO.
>>>> - HGM is now disabled by default to leave HVO enabled by default.
>>>
>>> I understand the reasoning behind the move to THP-like mapcounting, and the
>>> incompatibility with HVO.  However, I just got to patch 5 and realized either
>>> HGM or HVO will need to be chosen at kernel build time.  That may not be an
>>> issue for cloud providers or others building their own kernels for internal
>>> use.  However, distro kernels will need to pick one option or the other.
>>> Right now, my Fedora desktop has HVO enabled so it would likely not have
>>> HGM enabled.  That is not a big deal for a desktop.
>>>
>>> Just curious, do we have distro kernel users that want to use HGM?
>>
>> Most certainly I would say :)
>>
> 
> Is it a blocker to merge in an initial implementation though? Do
> distro kernel users have a pressing need for HVO + HGM used in tandem?

At least RHEL9 seems to include HVO. It's not enabled as default 
(CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP_DEFAULT_ON not set), but compiled 
in so it can be runtime-enabled. Disabling HVO is not an option IMHO.

Maybe, one could make both features compile-time compatible but 
runtime-mutually exclusive. Or work on a way to make them fully 
compatible right from the start.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v2 00/46] hugetlb: introduce HugeTLB high-granularity mapping
  2023-02-23  9:07       ` David Hildenbrand
@ 2023-02-23 15:53         ` James Houghton
  2023-02-23 16:17           ` David Hildenbrand
  2023-02-23 18:25           ` Mike Kravetz
  0 siblings, 2 replies; 96+ messages in thread
From: James Houghton @ 2023-02-23 15:53 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Mina Almasry, Mike Kravetz, Muchun Song, Peter Xu, Andrew Morton,
	David Rientjes, Axel Rasmussen, Zach O'Keefe, Manish Mishra,
	Naoya Horiguchi, Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Frank van der Linden, Jiaqi Yan, linux-mm, linux-kernel

On Thu, Feb 23, 2023 at 1:07 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 22.02.23 21:57, Mina Almasry wrote:
> > On Wed, Feb 22, 2023 at 7:49 AM David Hildenbrand <david@redhat.com> wrote:
> >>
> >> On 21.02.23 22:46, Mike Kravetz wrote:
> >>> On 02/18/23 00:27, James Houghton wrote:
> >>>> This series introduces the concept of HugeTLB high-granularity mapping
> >>>> (HGM). This series teaches HugeTLB how to map HugeTLB pages at
> >>>> high-granularity, similar to how THPs can be PTE-mapped.
> >>>>
> >>>> Support for HGM in this series is for MAP_SHARED VMAs on x86_64 only. Other
> >>>> architectures and (some) support for MAP_PRIVATE will come later.
> >>>>
> >>>> This series is based on latest mm-unstable (ccd6a73daba9).
> >>>>
> >>>> Notable changes with this series
> >>>> ================================
> >>>>
> >>>>    - hugetlb_add_file_rmap / hugetlb_remove_rmap are added to handle
> >>>>      mapcounting for non-anon hugetlb.
> >>>>    - The mapcounting scheme uses subpages' mapcounts for high-granularity
> >>>>      mappings, but it does not use subpages_mapcount(). This scheme
> >>>>      prevents the HugeTLB VMEMMAP optimization from being used, so it
> >>>>      will be improved in a later series.
> >>>>    - page_add_file_rmap and page_remove_rmap are updated so they can be
> >>>>      used by hugetlb_add_file_rmap / hugetlb_remove_rmap.
> >>>>    - MADV_SPLIT has been added to enable the userspace API changes that
> >>>>      HGM allows for: high-granularity UFFDIO_CONTINUE (and maybe other
> >>>>      changes in the future). MADV_SPLIT does NOT force all the mappings to
> >>>>      be PAGE_SIZE.
> >>>>    - MADV_COLLAPSE is expanded to include HugeTLB mappings.
> >>>>
> >>>> Old versions:
> >>>> v1: https://lore.kernel.org/linux-mm/20230105101844.1893104-1-jthoughton@google.com/
> >>>> RFC v2: https://lore.kernel.org/linux-mm/20221021163703.3218176-1-jthoughton@google.com/
> >>>> RFC v1: https://lore.kernel.org/linux-mm/20220624173656.2033256-1-jthoughton@google.com/
> >>>>
> >>>> Changelog:
> >>>> v1 -> v2 (thanks Peter for all your suggestions!):
> >>>> - Changed mapcount to be more THP-like, and make HGM incompatible with
> >>>>     HVO.
> >>>> - HGM is now disabled by default to leave HVO enabled by default.
> >>>
> >>> I understand the reasoning behind the move to THP-like mapcounting, and the
> >>> incompatibility with HVO.  However, I just got to patch 5 and realized either
> >>> HGM or HVO will need to be chosen at kernel build time.  That may not be an
> >>> issue for cloud providers or others building their own kernels for internal
> >>> use.  However, distro kernels will need to pick one option or the other.
> >>> Right now, my Fedora desktop has HVO enabled so it would likely not have
> >>> HGM enabled.  That is not a big deal for a desktop.
> >>>
> >>> Just curious, do we have distro kernel users that want to use HGM?
> >>
> >> Most certainly I would say :)

I'm not sure. Maybe distros want the hwpoison benefits HGM provides?
But that's not implemented in this series.

> >>
> >
> > Is it a blocker to merge in an initial implementation though? Do
> > distro kernel users have a pressing need for HVO + HGM used in tandem?

+1. I don't see why this should be a blocker.

>
> At least RHEL9 seems to include HVO. It's not enabled as default
> (CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP_DEFAULT_ON not set), but compiled
> in so it can be runtime-enabled. Disabling HVO is not an option IMHO.

I agree!

CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP=y is still the default here; I
made sure not to change that. :)

>
> Maybe, one could make both features compile-time compatible but
> runtime-mutually exclusive. Or work on a way to make them fully
> compatible right from the start.

For the sake of simplifying this series as much as possible, going
with the THP-like mapcount scheme that we know works properly seems
like the right decision to me, even though it is incompatible with
HVO.

Making HGM and HVO play nice at runtime is a little bit complicated,
and it becomes worthless as soon as we optimize the mapcount strategy.
So let's just optimize the mapcount strategy, but in a later series.

As soon as this series has been fully reviewed, patches will be sent up to:
1. Change the mapcount scheme to make HGM and HVO compatible again
(and make MADV_COLLAPSE faster)
2. Add arm64 support
3. Add hwpoison support

If we try to integrate #1 with this series now, I fear that that will
just slow things down more than if #1 is sent up by itself later.

(FWIW, #2 is basically fully implemented and #3 is basically done for
MAP_SHARED. Each of these series are MUCH smaller than this main one.)

- James

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v2 00/46] hugetlb: introduce HugeTLB high-granularity mapping
  2023-02-23 15:53         ` James Houghton
@ 2023-02-23 16:17           ` David Hildenbrand
  2023-02-23 18:33             ` Dr. David Alan Gilbert
  2023-02-23 18:25           ` Mike Kravetz
  1 sibling, 1 reply; 96+ messages in thread
From: David Hildenbrand @ 2023-02-23 16:17 UTC (permalink / raw)
  To: James Houghton
  Cc: Mina Almasry, Mike Kravetz, Muchun Song, Peter Xu, Andrew Morton,
	David Rientjes, Axel Rasmussen, Zach O'Keefe, Manish Mishra,
	Naoya Horiguchi, Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Frank van der Linden, Jiaqi Yan, linux-mm, linux-kernel

On 23.02.23 16:53, James Houghton wrote:
> On Thu, Feb 23, 2023 at 1:07 AM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 22.02.23 21:57, Mina Almasry wrote:
>>> On Wed, Feb 22, 2023 at 7:49 AM David Hildenbrand <david@redhat.com> wrote:
>>>>
>>>> On 21.02.23 22:46, Mike Kravetz wrote:
>>>>> On 02/18/23 00:27, James Houghton wrote:
>>>>>> This series introduces the concept of HugeTLB high-granularity mapping
>>>>>> (HGM). This series teaches HugeTLB how to map HugeTLB pages at
>>>>>> high-granularity, similar to how THPs can be PTE-mapped.
>>>>>>
>>>>>> Support for HGM in this series is for MAP_SHARED VMAs on x86_64 only. Other
>>>>>> architectures and (some) support for MAP_PRIVATE will come later.
>>>>>>
>>>>>> This series is based on latest mm-unstable (ccd6a73daba9).
>>>>>>
>>>>>> Notable changes with this series
>>>>>> ================================
>>>>>>
>>>>>>     - hugetlb_add_file_rmap / hugetlb_remove_rmap are added to handle
>>>>>>       mapcounting for non-anon hugetlb.
>>>>>>     - The mapcounting scheme uses subpages' mapcounts for high-granularity
>>>>>>       mappings, but it does not use subpages_mapcount(). This scheme
>>>>>>       prevents the HugeTLB VMEMMAP optimization from being used, so it
>>>>>>       will be improved in a later series.
>>>>>>     - page_add_file_rmap and page_remove_rmap are updated so they can be
>>>>>>       used by hugetlb_add_file_rmap / hugetlb_remove_rmap.
>>>>>>     - MADV_SPLIT has been added to enable the userspace API changes that
>>>>>>       HGM allows for: high-granularity UFFDIO_CONTINUE (and maybe other
>>>>>>       changes in the future). MADV_SPLIT does NOT force all the mappings to
>>>>>>       be PAGE_SIZE.
>>>>>>     - MADV_COLLAPSE is expanded to include HugeTLB mappings.
>>>>>>
>>>>>> Old versions:
>>>>>> v1: https://lore.kernel.org/linux-mm/20230105101844.1893104-1-jthoughton@google.com/
>>>>>> RFC v2: https://lore.kernel.org/linux-mm/20221021163703.3218176-1-jthoughton@google.com/
>>>>>> RFC v1: https://lore.kernel.org/linux-mm/20220624173656.2033256-1-jthoughton@google.com/
>>>>>>
>>>>>> Changelog:
>>>>>> v1 -> v2 (thanks Peter for all your suggestions!):
>>>>>> - Changed mapcount to be more THP-like, and make HGM incompatible with
>>>>>>      HVO.
>>>>>> - HGM is now disabled by default to leave HVO enabled by default.
>>>>>
>>>>> I understand the reasoning behind the move to THP-like mapcounting, and the
>>>>> incompatibility with HVO.  However, I just got to patch 5 and realized either
>>>>> HGM or HVO will need to be chosen at kernel build time.  That may not be an
>>>>> issue for cloud providers or others building their own kernels for internal
>>>>> use.  However, distro kernels will need to pick one option or the other.
>>>>> Right now, my Fedora desktop has HVO enabled so it would likely not have
>>>>> HGM enabled.  That is not a big deal for a desktop.
>>>>>
>>>>> Just curious, do we have distro kernel users that want to use HGM?
>>>>
>>>> Most certainly I would say :)
> 
> I'm not sure. Maybe distros want the hwpoison benefits HGM provides?
> But that's not implemented in this series.

 From what I can tell, HGM helps to improve live migration of VMs with 
gigantic pages. That sounds like a good reason why distros (that support 
virtualization) might want it independent of hwpoison changes.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v2 00/46] hugetlb: introduce HugeTLB high-granularity mapping
  2023-02-23 15:53         ` James Houghton
  2023-02-23 16:17           ` David Hildenbrand
@ 2023-02-23 18:25           ` Mike Kravetz
  1 sibling, 0 replies; 96+ messages in thread
From: Mike Kravetz @ 2023-02-23 18:25 UTC (permalink / raw)
  To: James Houghton
  Cc: David Hildenbrand, Mina Almasry, Muchun Song, Peter Xu,
	Andrew Morton, David Rientjes, Axel Rasmussen, Zach O'Keefe,
	Manish Mishra, Naoya Horiguchi, Dr . David Alan Gilbert,
	Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Frank van der Linden, Jiaqi Yan, linux-mm, linux-kernel

On 02/23/23 07:53, James Houghton wrote:
> On Thu, Feb 23, 2023 at 1:07 AM David Hildenbrand <david@redhat.com> wrote:
> >
> > On 22.02.23 21:57, Mina Almasry wrote:
> > > On Wed, Feb 22, 2023 at 7:49 AM David Hildenbrand <david@redhat.com> wrote:
> > >>
> > >> On 21.02.23 22:46, Mike Kravetz wrote:
> > >>> On 02/18/23 00:27, James Houghton wrote:
> > >>>> This series introduces the concept of HugeTLB high-granularity mapping
> > >>>> (HGM). This series teaches HugeTLB how to map HugeTLB pages at
> > >>>> high-granularity, similar to how THPs can be PTE-mapped.
> > >>>>
> > >>>> Support for HGM in this series is for MAP_SHARED VMAs on x86_64 only. Other
> > >>>> architectures and (some) support for MAP_PRIVATE will come later.
> > >>>>
> > >>>> This series is based on latest mm-unstable (ccd6a73daba9).
> > >>>>
> > >>>> Notable changes with this series
> > >>>> ================================
> > >>>>
> > >>>>    - hugetlb_add_file_rmap / hugetlb_remove_rmap are added to handle
> > >>>>      mapcounting for non-anon hugetlb.
> > >>>>    - The mapcounting scheme uses subpages' mapcounts for high-granularity
> > >>>>      mappings, but it does not use subpages_mapcount(). This scheme
> > >>>>      prevents the HugeTLB VMEMMAP optimization from being used, so it
> > >>>>      will be improved in a later series.
> > >>>>    - page_add_file_rmap and page_remove_rmap are updated so they can be
> > >>>>      used by hugetlb_add_file_rmap / hugetlb_remove_rmap.
> > >>>>    - MADV_SPLIT has been added to enable the userspace API changes that
> > >>>>      HGM allows for: high-granularity UFFDIO_CONTINUE (and maybe other
> > >>>>      changes in the future). MADV_SPLIT does NOT force all the mappings to
> > >>>>      be PAGE_SIZE.
> > >>>>    - MADV_COLLAPSE is expanded to include HugeTLB mappings.
> > >>>>
> > >>>> Old versions:
> > >>>> v1: https://lore.kernel.org/linux-mm/20230105101844.1893104-1-jthoughton@google.com/
> > >>>> RFC v2: https://lore.kernel.org/linux-mm/20221021163703.3218176-1-jthoughton@google.com/
> > >>>> RFC v1: https://lore.kernel.org/linux-mm/20220624173656.2033256-1-jthoughton@google.com/
> > >>>>
> > >>>> Changelog:
> > >>>> v1 -> v2 (thanks Peter for all your suggestions!):
> > >>>> - Changed mapcount to be more THP-like, and make HGM incompatible with
> > >>>>     HVO.
> > >>>> - HGM is now disabled by default to leave HVO enabled by default.
> > >>>
> > >>> I understand the reasoning behind the move to THP-like mapcounting, and the
> > >>> incompatibility with HVO.  However, I just got to patch 5 and realized either
> > >>> HGM or HVO will need to be chosen at kernel build time.  That may not be an
> > >>> issue for cloud providers or others building their own kernels for internal
> > >>> use.  However, distro kernels will need to pick one option or the other.
> > >>> Right now, my Fedora desktop has HVO enabled so it would likely not have
> > >>> HGM enabled.  That is not a big deal for a desktop.
> > >>>
> > >>> Just curious, do we have distro kernel users that want to use HGM?
> > >>
> > >> Most certainly I would say :)
> 
> I'm not sure. Maybe distros want the hwpoison benefits HGM provides?
> But that's not implemented in this series.
> 
> > >>
> > >
> > > Is it a blocker to merge in an initial implementation though? Do
> > > distro kernel users have a pressing need for HVO + HGM used in tandem?
> 
> +1. I don't see why this should be a blocker.
> 
> >
> > At least RHEL9 seems to include HVO. It's not enabled as default
> > (CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP_DEFAULT_ON not set), but compiled
> > in so it can be runtime-enabled. Disabling HVO is not an option IMHO.
> 
> I agree!
> 
> CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP=y is still the default here; I
> made sure not to change that. :)
> 
> >
> > Maybe, one could make both features compile-time compatible but
> > runtime-mutually exclusive. Or work on a way to make them fully
> > compatible right from the start.
> 
> For the sake of simplifying this series as much as possible, going
> with the THP-like mapcount scheme that we know works properly seems
> like the right decision to me, even though it is incompatible with
> HVO.
> 
> Making HGM and HVO play nice at runtime is a little bit complicated,
> and it becomes worthless as soon as we optimize the mapcount strategy.
> So let's just optimize the mapcount strategy, but in a later series.
> 
> As soon as this series has been fully reviewed, patches will be sent up to:
> 1. Change the mapcount scheme to make HGM and HVO compatible again
> (and make MADV_COLLAPSE faster)
> 2. Add arm64 support
> 3. Add hwpoison support
> 
> If we try to integrate #1 with this series now, I fear that that will
> just slow things down more than if #1 is sent up by itself later.
> 
> (FWIW, #2 is basically fully implemented and #3 is basically done for
> MAP_SHARED. Each of these series are MUCH smaller than this main one.)

By asking this question, my intention was NOT to force HGM and HVO
compatibility now.  Rather, just to ask if there were any distro kernels
or environments that enable HVO now, and want HGM ASAP.  Was hoping someone
from Red Hat would chime in: thanks David!

FYI - Oracle is keen on HVO to use every possible bit of memory. See,
https://lore.kernel.org/linux-mm/20221110121214.6297-1-joao.m.martins@oracle.com/
In addition, Oracle kernel has CONFIG_HUGETLB_PAGE_FREE_VMEMMAP_DEFAULT_ON=y,
so it can not immediately take advantage of HGM.  That is OK 'for now'.

I will try to ignore the mapcount issue right now and focus on the rest
of the series.  Thanks for all your efforts James!
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v2 00/46] hugetlb: introduce HugeTLB high-granularity mapping
  2023-02-23 16:17           ` David Hildenbrand
@ 2023-02-23 18:33             ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 96+ messages in thread
From: Dr. David Alan Gilbert @ 2023-02-23 18:33 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: James Houghton, Mina Almasry, Mike Kravetz, Muchun Song,
	Peter Xu, Andrew Morton, David Rientjes, Axel Rasmussen,
	Zach O'Keefe, Manish Mishra, Naoya Horiguchi,
	Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Frank van der Linden, Jiaqi Yan, linux-mm, linux-kernel

* David Hildenbrand (david@redhat.com) wrote:
> On 23.02.23 16:53, James Houghton wrote:
> > On Thu, Feb 23, 2023 at 1:07 AM David Hildenbrand <david@redhat.com> wrote:
> > > 
> > > On 22.02.23 21:57, Mina Almasry wrote:
> > > > On Wed, Feb 22, 2023 at 7:49 AM David Hildenbrand <david@redhat.com> wrote:
> > > > > 
> > > > > On 21.02.23 22:46, Mike Kravetz wrote:
> > > > > > On 02/18/23 00:27, James Houghton wrote:
> > > > > > > This series introduces the concept of HugeTLB high-granularity mapping
> > > > > > > (HGM). This series teaches HugeTLB how to map HugeTLB pages at
> > > > > > > high-granularity, similar to how THPs can be PTE-mapped.
> > > > > > > 
> > > > > > > Support for HGM in this series is for MAP_SHARED VMAs on x86_64 only. Other
> > > > > > > architectures and (some) support for MAP_PRIVATE will come later.
> > > > > > > 
> > > > > > > This series is based on latest mm-unstable (ccd6a73daba9).
> > > > > > > 
> > > > > > > Notable changes with this series
> > > > > > > ================================
> > > > > > > 
> > > > > > >     - hugetlb_add_file_rmap / hugetlb_remove_rmap are added to handle
> > > > > > >       mapcounting for non-anon hugetlb.
> > > > > > >     - The mapcounting scheme uses subpages' mapcounts for high-granularity
> > > > > > >       mappings, but it does not use subpages_mapcount(). This scheme
> > > > > > >       prevents the HugeTLB VMEMMAP optimization from being used, so it
> > > > > > >       will be improved in a later series.
> > > > > > >     - page_add_file_rmap and page_remove_rmap are updated so they can be
> > > > > > >       used by hugetlb_add_file_rmap / hugetlb_remove_rmap.
> > > > > > >     - MADV_SPLIT has been added to enable the userspace API changes that
> > > > > > >       HGM allows for: high-granularity UFFDIO_CONTINUE (and maybe other
> > > > > > >       changes in the future). MADV_SPLIT does NOT force all the mappings to
> > > > > > >       be PAGE_SIZE.
> > > > > > >     - MADV_COLLAPSE is expanded to include HugeTLB mappings.
> > > > > > > 
> > > > > > > Old versions:
> > > > > > > v1: https://lore.kernel.org/linux-mm/20230105101844.1893104-1-jthoughton@google.com/
> > > > > > > RFC v2: https://lore.kernel.org/linux-mm/20221021163703.3218176-1-jthoughton@google.com/
> > > > > > > RFC v1: https://lore.kernel.org/linux-mm/20220624173656.2033256-1-jthoughton@google.com/
> > > > > > > 
> > > > > > > Changelog:
> > > > > > > v1 -> v2 (thanks Peter for all your suggestions!):
> > > > > > > - Changed mapcount to be more THP-like, and make HGM incompatible with
> > > > > > >      HVO.
> > > > > > > - HGM is now disabled by default to leave HVO enabled by default.
> > > > > > 
> > > > > > I understand the reasoning behind the move to THP-like mapcounting, and the
> > > > > > incompatibility with HVO.  However, I just got to patch 5 and realized either
> > > > > > HGM or HVO will need to be chosen at kernel build time.  That may not be an
> > > > > > issue for cloud providers or others building their own kernels for internal
> > > > > > use.  However, distro kernels will need to pick one option or the other.
> > > > > > Right now, my Fedora desktop has HVO enabled so it would likely not have
> > > > > > HGM enabled.  That is not a big deal for a desktop.
> > > > > > 
> > > > > > Just curious, do we have distro kernel users that want to use HGM?
> > > > > 
> > > > > Most certainly I would say :)
> > 
> > I'm not sure. Maybe distros want the hwpoison benefits HGM provides?
> > But that's not implemented in this series.
> 
> From what I can tell, HGM helps to improve live migration of VMs with
> gigantic pages. That sounds like a good reason why distros (that support
> virtualization) might want it independent of hwpoison changes.

Yes, in particular for postcopy migration of those VMs, where we can't
afford the latency of waiting for the entire gigantic page to bubble
along the network.

Dave

> -- 
> Thanks,
> 
> David / dhildenb
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v2 46/46] selftests/mm: add HGM UFFDIO_CONTINUE and hwpoison tests
  2023-02-18  0:28 ` [PATCH v2 46/46] selftests/mm: add HGM UFFDIO_CONTINUE and hwpoison tests James Houghton
@ 2023-02-24 17:37   ` James Houghton
  0 siblings, 0 replies; 96+ messages in thread
From: James Houghton @ 2023-02-24 17:37 UTC (permalink / raw)
  To: Mike Kravetz, Muchun Song, Peter Xu, Andrew Morton
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Zach O'Keefe, Manish Mishra, Naoya Horiguchi,
	Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Frank van der Linden, Jiaqi Yan, linux-mm, linux-kernel

> +static int test_fork(int uffd, char *primary_map, size_t len)
> +{
> +       int status;
> +       int ret = 0;
> +       pid_t pid;
> +       pthread_t uffd_thd;
> +
> +       /*
> +        * UFFD_FEATURE_EVENT_FORK will put fork event on the userfaultfd,
> +        * which we must read, otherwise we block fork(). Setup a thread to
> +        * read that event now.
> +        *
> +        * Page fault events should result in a SIGBUS, so we expect only a
> +        * single event from the uffd (the fork event).
> +        */
> +       if (read_event_from_uffd(&uffd, &uffd_thd))
> +               return -1;
> +
> +       pid = fork();
> +
> +       if (!pid) {
> +               /*
> +                * Because we have UFFDIO_REGISTER_MODE_WP and
> +                * UFFD_FEATURE_EVENT_FORK, the page tables should be copied
> +                * exactly.
> +                *
> +                * Check that everything except that last 4K has correct
> +                * contents, and then check that the last 4K gets a SIGBUS.
> +                */
> +               printf(PREFIX "child validating...\n");
> +               ret = verify_contents(primary_map, len, false) ||
> +                       test_sigbus(primary_map + len - 1, false);
> +               ret = 0;
> +               exit(ret ? 1 : 0);
> +       } else {
> +               /* wait for the child to finish. */
> +               waitpid(pid, &status, 0);
> +               ret = WEXITSTATUS(status);
> +               if (!ret) {
> +                       printf(PREFIX "parent validating...\n");
> +                       /* Same check as the child. */
> +                       ret = verify_contents(primary_map, len, false) ||
> +                               test_sigbus(primary_map + len - 1, false);
> +                       ret = 0;

I'm not sure how these 'ret = 0's got here -- they will be removed.

> +               }
> +       }

This else block also runs when fork() fails; we need to fail the test instead.

> +
> +       pthread_join(uffd_thd, NULL);
> +       return ret;
> +
> +}

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v2 22/46] hugetlb: add HGM support to copy_hugetlb_page_range
  2023-02-18  0:27 ` [PATCH v2 22/46] hugetlb: add HGM support to copy_hugetlb_page_range James Houghton
@ 2023-02-24 17:39   ` James Houghton
  0 siblings, 0 replies; 96+ messages in thread
From: James Houghton @ 2023-02-24 17:39 UTC (permalink / raw)
  To: Mike Kravetz, Muchun Song, Peter Xu, Andrew Morton
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Zach O'Keefe, Manish Mishra, Naoya Horiguchi,
	Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Frank van der Linden, Jiaqi Yan, linux-mm, linux-kernel

On Fri, Feb 17, 2023 at 4:29 PM James Houghton <jthoughton@google.com> wrote:
>
> This allows fork() to work with high-granularity mappings. The page
> table structure is copied such that partially mapped regions will remain
> partially mapped in the same way for the new process.
>
> A page's reference count is incremented for *each* portion of it that
> is mapped in the page table. For example, if you have a PMD-mapped 1G
> page, the reference count will be incremented by 512.
>
> mapcount is handled similar to THPs: if you're completely mapping a
> hugepage, then the compound_mapcount is incremented. If you're mapping a
> part of it, the subpages that are getting mapped will have their
> mapcounts incremented.
>
> Signed-off-by: James Houghton <jthoughton@google.com>
>
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 1a1a71868dfd..2fe1eb6897d4 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -162,6 +162,8 @@ void hugepage_put_subpool(struct hugepage_subpool *spool);
>
>  void hugetlb_remove_rmap(struct page *subpage, unsigned long shift,
>                          struct hstate *h, struct vm_area_struct *vma);
> +void hugetlb_add_file_rmap(struct page *subpage, unsigned long shift,
> +                          struct hstate *h, struct vm_area_struct *vma);
>
>  void hugetlb_dup_vma_private(struct vm_area_struct *vma);
>  void clear_vma_resv_huge_pages(struct vm_area_struct *vma);
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 693332b7e186..210c6f2b16a5 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -141,6 +141,37 @@ void hugetlb_remove_rmap(struct page *subpage, unsigned long shift,
>                         page_remove_rmap(subpage, vma, false);
>         }
>  }
> +/*
> + * hugetlb_add_file_rmap() - increment the mapcounts for file-backed hugetlb
> + * pages appropriately.
> + *
> + * For pages that are being mapped with their hstate-level PTE (e.g., a 1G page
> + * being mapped with a 1G PUD), then we increment the compound_mapcount for the
> + * head page.
> + *
> + * For pages that are being mapped with high-granularity, we increment the
> + * mapcounts for the individual subpages that are getting mapped.
> + */
> +void hugetlb_add_file_rmap(struct page *subpage, unsigned long shift,
> +                          struct hstate *h, struct vm_area_struct *vma)
> +{
> +       struct page *hpage = compound_head(subpage);
> +
> +       if (shift == huge_page_shift(h)) {
> +               VM_BUG_ON_PAGE(subpage != hpage, subpage);
> +               page_add_file_rmap(hpage, vma, true);
> +       } else {
> +               unsigned long nr_subpages = 1UL << (shift - PAGE_SHIFT);
> +               struct page *final_page = &subpage[nr_subpages];
> +
> +               VM_BUG_ON_PAGE(HPageVmemmapOptimized(hpage), hpage);
> +               /*
> +                * Increment the mapcount on each page that is getting mapped.
> +                */
> +               for (; subpage < final_page; ++subpage)
> +                       page_add_file_rmap(subpage, vma, false);
> +       }
> +}
>
>  static inline bool subpool_is_free(struct hugepage_subpool *spool)
>  {
> @@ -5210,7 +5241,8 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
>                             struct vm_area_struct *src_vma)
>  {
>         pte_t *src_pte, *dst_pte, entry;
> -       struct page *ptepage;
> +       struct hugetlb_pte src_hpte, dst_hpte;
> +       struct page *ptepage, *hpage;
>         unsigned long addr;
>         bool cow = is_cow_mapping(src_vma->vm_flags);
>         struct hstate *h = hstate_vma(src_vma);
> @@ -5238,18 +5270,24 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
>         }
>
>         last_addr_mask = hugetlb_mask_last_page(h);
> -       for (addr = src_vma->vm_start; addr < src_vma->vm_end; addr += sz) {
> +       addr = src_vma->vm_start;
> +       while (addr < src_vma->vm_end) {
>                 spinlock_t *src_ptl, *dst_ptl;
> -               src_pte = hugetlb_walk(src_vma, addr, sz);
> -               if (!src_pte) {
> -                       addr |= last_addr_mask;
> +               unsigned long hpte_sz;
> +
> +               if (hugetlb_full_walk(&src_hpte, src_vma, addr)) {
> +                       addr = (addr | last_addr_mask) + sz;
>                         continue;
>                 }
> -               dst_pte = huge_pte_alloc(dst, dst_vma, addr, sz);
> -               if (!dst_pte) {
> -                       ret = -ENOMEM;
> +               ret = hugetlb_full_walk_alloc(&dst_hpte, dst_vma, addr,
> +                               hugetlb_pte_size(&src_hpte));
> +               if (ret)
>                         break;
> -               }
> +
> +               src_pte = src_hpte.ptep;
> +               dst_pte = dst_hpte.ptep;
> +
> +               hpte_sz = hugetlb_pte_size(&src_hpte);
>
>                 /*
>                  * If the pagetables are shared don't copy or take references.
> @@ -5259,13 +5297,14 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
>                  * another vma. So page_count of ptep page is checked instead
>                  * to reliably determine whether pte is shared.
>                  */
> -               if (page_count(virt_to_page(dst_pte)) > 1) {
> -                       addr |= last_addr_mask;
> +               if (hugetlb_pte_size(&dst_hpte) == sz &&
> +                   page_count(virt_to_page(dst_pte)) > 1) {
> +                       addr = (addr | last_addr_mask) + sz;
>                         continue;
>                 }
>
> -               dst_ptl = huge_pte_lock(h, dst, dst_pte);
> -               src_ptl = huge_pte_lockptr(huge_page_shift(h), src, src_pte);
> +               dst_ptl = hugetlb_pte_lock(&dst_hpte);
> +               src_ptl = hugetlb_pte_lockptr(&src_hpte);
>                 spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
>                 entry = huge_ptep_get(src_pte);
>  again:
> @@ -5309,10 +5348,15 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
>                          */
>                         if (userfaultfd_wp(dst_vma))
>                                 set_huge_pte_at(dst, addr, dst_pte, entry);
> +               } else if (!hugetlb_pte_present_leaf(&src_hpte, entry)) {
> +                       /* Retry the walk. */
> +                       spin_unlock(src_ptl);
> +                       spin_unlock(dst_ptl);
> +                       continue;
>                 } else {
> -                       entry = huge_ptep_get(src_pte);
>                         ptepage = pte_page(entry);
> -                       get_page(ptepage);
> +                       hpage = compound_head(ptepage);
> +                       get_page(hpage);
>
>                         /*
>                          * Failing to duplicate the anon rmap is a rare case
> @@ -5324,13 +5368,34 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
>                          * need to be without the pgtable locks since we could
>                          * sleep during the process.
>                          */
> -                       if (!PageAnon(ptepage)) {
> -                               page_add_file_rmap(ptepage, src_vma, true);
> -                       } else if (page_try_dup_anon_rmap(ptepage, true,
> +                       if (!PageAnon(hpage)) {
> +                               hugetlb_add_file_rmap(ptepage,
> +                                               src_hpte.shift, h, src_vma);
> +                       }
> +                       /*
> +                        * It is currently impossible to get anonymous HugeTLB
> +                        * high-granularity mappings, so we use 'hpage' here.
> +                        *
> +                        * This will need to be changed when HGM support for
> +                        * anon mappings is added.
> +                        */
> +                       else if (page_try_dup_anon_rmap(hpage, true,
>                                                           src_vma)) {
>                                 pte_t src_pte_old = entry;
>                                 struct folio *new_folio;
>
> +                               /*
> +                                * If we are mapped at high granularity, we
> +                                * may end up allocating lots and lots of
> +                                * hugepages when we only need one. Bail out
> +                                * now.
> +                                */
> +                               if (hugetlb_pte_size(&src_hpte) != sz) {
> +                                       put_page(hpage);
> +                                       ret = -EINVAL;
> +                                       break;
> +                               }
> +

Although this block never executes, it should come after the following
spin_unlocks().

>                                 spin_unlock(src_ptl);
>                                 spin_unlock(dst_ptl);
>                                 /* Do not use reserve as it's private owned */
> @@ -5342,7 +5407,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
>                                 }
>                                 copy_user_huge_page(&new_folio->page, ptepage, addr, dst_vma,
>                                                     npages);
> -                               put_page(ptepage);
> +                               put_page(hpage);
>
>                                 /* Install the new hugetlb folio if src pte stable */
>                                 dst_ptl = huge_pte_lock(h, dst, dst_pte);
> @@ -5360,6 +5425,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
>                                 hugetlb_install_folio(dst_vma, dst_pte, addr, new_folio);
>                                 spin_unlock(src_ptl);
>                                 spin_unlock(dst_ptl);
> +                               addr += hugetlb_pte_size(&src_hpte);
>                                 continue;
>                         }
>
> @@ -5376,10 +5442,13 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
>                         }
>
>                         set_huge_pte_at(dst, addr, dst_pte, entry);
> -                       hugetlb_count_add(npages, dst);
> +                       hugetlb_count_add(
> +                                       hugetlb_pte_size(&dst_hpte) / PAGE_SIZE,
> +                                       dst);
>                 }
>                 spin_unlock(src_ptl);
>                 spin_unlock(dst_ptl);
> +               addr += hugetlb_pte_size(&src_hpte);
>         }
>
>         if (cow) {
> --
> 2.39.2.637.g21b0678d19-goog
>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v2 35/46] hugetlb: add check to prevent refcount overflow via HGM
  2023-02-18  0:28 ` [PATCH v2 35/46] hugetlb: add check to prevent refcount overflow via HGM James Houghton
@ 2023-02-24 17:42   ` James Houghton
  2023-02-24 18:05     ` James Houghton
  0 siblings, 1 reply; 96+ messages in thread
From: James Houghton @ 2023-02-24 17:42 UTC (permalink / raw)
  To: Mike Kravetz, Muchun Song, Peter Xu, Andrew Morton
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Zach O'Keefe, Manish Mishra, Naoya Horiguchi,
	Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Frank van der Linden, Jiaqi Yan, linux-mm, linux-kernel

> @@ -5397,7 +5397,10 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
>                 } else {
>                         ptepage = pte_page(entry);
>                         hpage = compound_head(ptepage);
> -                       get_page(hpage);
> +                       if (try_get_page(hpage)) {
> +                               ret = -EFAULT;
> +                               break;

spin_unlock(src_ptl) and spin_unlock(dst_ptl) is required here.

I'll make sure there's a selftest that actually makes sure that
refcount overflowing is handled gracefully for v3.

> +                       }
>
>                         /*
>                          * Failing to duplicate the anon rmap is a rare case
> @@ -6132,6 +6135,30 @@ static bool hugetlb_pte_stable(struct hstate *h, struct hugetlb_pte *hpte,
>         return same;
>  }

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v2 35/46] hugetlb: add check to prevent refcount overflow via HGM
  2023-02-24 17:42   ` James Houghton
@ 2023-02-24 18:05     ` James Houghton
  0 siblings, 0 replies; 96+ messages in thread
From: James Houghton @ 2023-02-24 18:05 UTC (permalink / raw)
  To: Mike Kravetz, Muchun Song, Peter Xu, Andrew Morton
  Cc: David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Zach O'Keefe, Manish Mishra, Naoya Horiguchi,
	Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Frank van der Linden, Jiaqi Yan, linux-mm, linux-kernel

On Fri, Feb 24, 2023 at 9:42 AM James Houghton <jthoughton@google.com> wrote:
>
> > @@ -5397,7 +5397,10 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
> >                 } else {
> >                         ptepage = pte_page(entry);
> >                         hpage = compound_head(ptepage);
> > -                       get_page(hpage);
> > +                       if (try_get_page(hpage)) {
> > +                               ret = -EFAULT;
> > +                               break;
>
> spin_unlock(src_ptl) and spin_unlock(dst_ptl) is required here.
>
> I'll make sure there's a selftest that actually makes sure that
> refcount overflowing is handled gracefully for v3.

And this should be !try_get_page(). This hunk was a last-minute
addition to this commit; apparently I hadn't retested fork() after I
made this change. Sorry! The hugetlb-hgm selftest immediately catches
this problem.

- James

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v2 07/46] mm: add VM_HUGETLB_HGM VMA flag
  2023-02-18  0:27 ` [PATCH v2 07/46] mm: add VM_HUGETLB_HGM VMA flag James Houghton
@ 2023-02-24 22:35   ` Mike Kravetz
  0 siblings, 0 replies; 96+ messages in thread
From: Mike Kravetz @ 2023-02-24 22:35 UTC (permalink / raw)
  To: James Houghton
  Cc: Muchun Song, Peter Xu, Andrew Morton, David Hildenbrand,
	David Rientjes, Axel Rasmussen, Mina Almasry, Zach O'Keefe,
	Manish Mishra, Naoya Horiguchi, Dr . David Alan Gilbert,
	Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Frank van der Linden, Jiaqi Yan, linux-mm, linux-kernel

On 02/18/23 00:27, James Houghton wrote:
> VM_HUGETLB_HGM indicates that a HugeTLB VMA may contain high-granularity
> mappings. Its VmFlags string is "hm".
> 
> Signed-off-by: James Houghton <jthoughton@google.com>
> 
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 6a96e1713fd5..77b72f42556a 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c

Acked-by: Mike Kravetz <mike.kravetz@oracle.com>

If there is any push back on using a bit in vm flags, we can go back to
your original scheme of embedding info in the hugetlb per-vma structure.
-- 
Mike Kravetz

> @@ -711,6 +711,9 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
>  #ifdef CONFIG_HAVE_ARCH_USERFAULTFD_MINOR
>  		[ilog2(VM_UFFD_MINOR)]	= "ui",
>  #endif /* CONFIG_HAVE_ARCH_USERFAULTFD_MINOR */
> +#ifdef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING
> +		[ilog2(VM_HUGETLB_HGM)]	= "hm",
> +#endif /* CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING */
>  	};
>  	size_t i;
>  
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 2992a2d55aee..9d3216b4284a 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -383,6 +383,13 @@ extern unsigned int kobjsize(const void *objp);
>  # define VM_UFFD_MINOR		VM_NONE
>  #endif /* CONFIG_HAVE_ARCH_USERFAULTFD_MINOR */
>  
> +#ifdef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING
> +# define VM_HUGETLB_HGM_BIT	38
> +# define VM_HUGETLB_HGM		BIT(VM_HUGETLB_HGM_BIT)	/* HugeTLB high-granularity mapping */
> +#else /* !CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING */
> +# define VM_HUGETLB_HGM		VM_NONE
> +#endif /* CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING */
> +
>  /* Bits set in the VMA until the stack is in its final location */
>  #define VM_STACK_INCOMPLETE_SETUP	(VM_RAND_READ | VM_SEQ_READ)
>  
> diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
> index 9db52bc4ce19..bceb960dbada 100644
> --- a/include/trace/events/mmflags.h
> +++ b/include/trace/events/mmflags.h
> @@ -162,6 +162,12 @@ IF_HAVE_PG_SKIP_KASAN_POISON(PG_skip_kasan_poison, "skip_kasan_poison")
>  # define IF_HAVE_UFFD_MINOR(flag, name)
>  #endif
>  
> +#ifdef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING
> +# define IF_HAVE_HUGETLB_HGM(flag, name) {flag, name},
> +#else
> +# define IF_HAVE_HUGETLB_HGM(flag, name)
> +#endif
> +
>  #define __def_vmaflag_names						\
>  	{VM_READ,			"read"		},		\
>  	{VM_WRITE,			"write"		},		\
> @@ -186,6 +192,7 @@ IF_HAVE_UFFD_MINOR(VM_UFFD_MINOR,	"uffd_minor"	)		\
>  	{VM_ACCOUNT,			"account"	},		\
>  	{VM_NORESERVE,			"noreserve"	},		\
>  	{VM_HUGETLB,			"hugetlb"	},		\
> +IF_HAVE_HUGETLB_HGM(VM_HUGETLB_HGM,	"hugetlb_hgm"	)		\
>  	{VM_SYNC,			"sync"		},		\
>  	__VM_ARCH_SPECIFIC_1				,		\
>  	{VM_WIPEONFORK,			"wipeonfork"	},		\
> -- 
> 2.39.2.637.g21b0678d19-goog
> 

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v2 08/46] hugetlb: add HugeTLB HGM enablement helpers
  2023-02-18  0:27 ` [PATCH v2 08/46] hugetlb: add HugeTLB HGM enablement helpers James Houghton
  2023-02-18  1:40   ` Mina Almasry
@ 2023-02-24 23:08   ` Mike Kravetz
  1 sibling, 0 replies; 96+ messages in thread
From: Mike Kravetz @ 2023-02-24 23:08 UTC (permalink / raw)
  To: James Houghton
  Cc: Muchun Song, Peter Xu, Andrew Morton, David Hildenbrand,
	David Rientjes, Axel Rasmussen, Mina Almasry, Zach O'Keefe,
	Manish Mishra, Naoya Horiguchi, Dr . David Alan Gilbert,
	Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Frank van der Linden, Jiaqi Yan, linux-mm, linux-kernel

On 02/18/23 00:27, James Houghton wrote:
> hugetlb_hgm_eligible indicates that a VMA is eligible to have HGM
> explicitly enabled via MADV_SPLIT, and hugetlb_hgm_enabled indicates
> that HGM has been enabled.
> 
> Signed-off-by: James Houghton <jthoughton@google.com>
> 
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 7c977d234aba..efd2635a87f5 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -1211,6 +1211,20 @@ static inline void hugetlb_unregister_node(struct node *node)
>  }
>  #endif	/* CONFIG_HUGETLB_PAGE */
>  
> +#ifdef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING
> +bool hugetlb_hgm_enabled(struct vm_area_struct *vma);
> +bool hugetlb_hgm_eligible(struct vm_area_struct *vma);
> +#else
> +static inline bool hugetlb_hgm_enabled(struct vm_area_struct *vma)
> +{
> +	return false;
> +}
> +static inline bool hugetlb_hgm_eligible(struct vm_area_struct *vma)
> +{
> +	return false;
> +}
> +#endif
> +
>  static inline spinlock_t *huge_pte_lock(struct hstate *h,
>  					struct mm_struct *mm, pte_t *pte)
>  {
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 6c008c9de80e..0576dcc98044 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -7004,6 +7004,10 @@ static bool pmd_sharing_possible(struct vm_area_struct *vma)
>  #ifdef CONFIG_USERFAULTFD
>  	if (uffd_disable_huge_pmd_share(vma))
>  		return false;
> +#endif
> +#ifdef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING
> +	if (hugetlb_hgm_enabled(vma))
> +		return false;
>  #endif
>  	/*
>  	 * Only shared VMAs can share PMDs.
> @@ -7267,6 +7271,18 @@ __weak unsigned long hugetlb_mask_last_page(struct hstate *h)
>  
>  #endif /* CONFIG_ARCH_WANT_GENERAL_HUGETLB */
>  
> +#ifdef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING
> +bool hugetlb_hgm_eligible(struct vm_area_struct *vma)
> +{
> +	/* All shared VMAs may have HGM. */
> +	return vma && (vma->vm_flags & VM_MAYSHARE);

I think the only user is madvise_split().  We should probably check here
or more likely at the beginning of madvise_split for VM_HUGETLB_HGM already
set.  No sense in invoking the overhead of hugetlb_unshare_all_pmds if
not needed.
-- 
Mike Kravetz

> +}
> +bool hugetlb_hgm_enabled(struct vm_area_struct *vma)
> +{
> +	return vma && (vma->vm_flags & VM_HUGETLB_HGM);
> +}
> +#endif /* CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING */
> +
>  /*
>   * These functions are overwritable if your architecture needs its own
>   * behavior.
> -- 
> 2.39.2.637.g21b0678d19-goog
> 

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v2 09/46] mm: add MADV_SPLIT to enable HugeTLB HGM
  2023-02-18  0:27 ` [PATCH v2 09/46] mm: add MADV_SPLIT to enable HugeTLB HGM James Houghton
  2023-02-18  1:58   ` Mina Almasry
@ 2023-02-24 23:25   ` Mike Kravetz
  2023-02-27 15:14     ` James Houghton
  1 sibling, 1 reply; 96+ messages in thread
From: Mike Kravetz @ 2023-02-24 23:25 UTC (permalink / raw)
  To: James Houghton
  Cc: Muchun Song, Peter Xu, Andrew Morton, David Hildenbrand,
	David Rientjes, Axel Rasmussen, Mina Almasry, Zach O'Keefe,
	Manish Mishra, Naoya Horiguchi, Dr . David Alan Gilbert,
	Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Frank van der Linden, Jiaqi Yan, linux-mm, linux-kernel

On 02/18/23 00:27, James Houghton wrote:
> Issuing ioctl(MADV_SPLIT) on a HugeTLB address range will enable
> HugeTLB HGM. MADV_SPLIT was chosen for the name so that this API can be
> applied to non-HugeTLB memory in the future, if such an application is
> to arise.
> 
> MADV_SPLIT provides several API changes for some syscalls on HugeTLB
> address ranges:
> 1. UFFDIO_CONTINUE is allowed for MAP_SHARED VMAs at PAGE_SIZE
>    alignment.
> 2. read()ing a page fault event from a userfaultfd will yield a
>    PAGE_SIZE-rounded address, instead of a huge-page-size-rounded
>    address (unless UFFD_FEATURE_EXACT_ADDRESS is used).
> 
> There is no way to disable the API changes that come with issuing
> MADV_SPLIT. MADV_COLLAPSE can be used to collapse high-granularity page
> table mappings that come from the extended functionality that comes with
> using MADV_SPLIT.
> 
> For post-copy live migration, the expected use-case is:
> 1. mmap(MAP_SHARED, some_fd) primary mapping
> 2. mmap(MAP_SHARED, some_fd) alias mapping
> 3. MADV_SPLIT the primary mapping
> 4. UFFDIO_REGISTER/etc. the primary mapping
> 5. Copy memory contents into alias mapping and UFFDIO_CONTINUE the
>    corresponding PAGE_SIZE sections in the primary mapping.
> 
> More API changes may be added in the future.
> 
> Signed-off-by: James Houghton <jthoughton@google.com>
> 
> diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h
> index 763929e814e9..7a26f3648b90 100644
> --- a/arch/alpha/include/uapi/asm/mman.h
> +++ b/arch/alpha/include/uapi/asm/mman.h
> @@ -78,6 +78,8 @@
>  
>  #define MADV_COLLAPSE	25		/* Synchronous hugepage collapse */
>  
> +#define MADV_SPLIT	26		/* Enable hugepage high-granularity APIs */
> +
>  /* compatibility flags */
>  #define MAP_FILE	0
>  
> diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h
> index c6e1fc77c996..f8a74a3a0928 100644
> --- a/arch/mips/include/uapi/asm/mman.h
> +++ b/arch/mips/include/uapi/asm/mman.h
> @@ -105,6 +105,8 @@
>  
>  #define MADV_COLLAPSE	25		/* Synchronous hugepage collapse */
>  
> +#define MADV_SPLIT	26		/* Enable hugepage high-granularity APIs */
> +
>  /* compatibility flags */
>  #define MAP_FILE	0
>  
> diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h
> index 68c44f99bc93..a6dc6a56c941 100644
> --- a/arch/parisc/include/uapi/asm/mman.h
> +++ b/arch/parisc/include/uapi/asm/mman.h
> @@ -72,6 +72,8 @@
>  
>  #define MADV_COLLAPSE	25		/* Synchronous hugepage collapse */
>  
> +#define MADV_SPLIT	74		/* Enable hugepage high-granularity APIs */
> +
>  #define MADV_HWPOISON     100		/* poison a page for testing */
>  #define MADV_SOFT_OFFLINE 101		/* soft offline page for testing */
>  
> diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h
> index 1ff0c858544f..f98a77c430a9 100644
> --- a/arch/xtensa/include/uapi/asm/mman.h
> +++ b/arch/xtensa/include/uapi/asm/mman.h
> @@ -113,6 +113,8 @@
>  
>  #define MADV_COLLAPSE	25		/* Synchronous hugepage collapse */
>  
> +#define MADV_SPLIT	26		/* Enable hugepage high-granularity APIs */
> +
>  /* compatibility flags */
>  #define MAP_FILE	0
>  
> diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
> index 6ce1f1ceb432..996e8ded092f 100644
> --- a/include/uapi/asm-generic/mman-common.h
> +++ b/include/uapi/asm-generic/mman-common.h
> @@ -79,6 +79,8 @@
>  
>  #define MADV_COLLAPSE	25		/* Synchronous hugepage collapse */
>  
> +#define MADV_SPLIT	26		/* Enable hugepage high-granularity APIs */
> +
>  /* compatibility flags */
>  #define MAP_FILE	0
>  
> diff --git a/mm/madvise.c b/mm/madvise.c
> index c2202f51e9dd..8c004c678262 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -1006,6 +1006,28 @@ static long madvise_remove(struct vm_area_struct *vma,
>  	return error;
>  }
>  
> +static int madvise_split(struct vm_area_struct *vma,
> +			 unsigned long *new_flags)
> +{
> +#ifdef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING
> +	if (!is_vm_hugetlb_page(vma) || !hugetlb_hgm_eligible(vma))
> +		return -EINVAL;
> +
> +	/*
> +	 * PMD sharing doesn't work with HGM. If this MADV_SPLIT is on part
> +	 * of a VMA, then we will split the VMA. Here, we're unsharing before
> +	 * splitting because it's simpler, although we may be unsharing more
> +	 * than we need.
> +	 */
> +	hugetlb_unshare_all_pmds(vma);

I think we should just unshare the (appropriately aligned) range within the
vma that is the target of MADV_SPLIT.  No need to unshare the entire vma.

> +
> +	*new_flags |= VM_HUGETLB_HGM;
> +	return 0;
> +#else
> +	return -EINVAL;
> +#endif
> +}
> +
>  /*
>   * Apply an madvise behavior to a region of a vma.  madvise_update_vma
>   * will handle splitting a vm area into separate areas, each area with its own
> @@ -1084,6 +1106,11 @@ static int madvise_vma_behavior(struct vm_area_struct *vma,
>  		break;
>  	case MADV_COLLAPSE:
>  		return madvise_collapse(vma, prev, start, end);
> +	case MADV_SPLIT:
> +		error = madvise_split(vma, &new_flags);
> +		if (error)
> +			goto out;

Not a huge deal, but if one passes an invalid range (such as not huge page
size aligned) to MADV_SPLIT, then we will not notice the error until
later in madvise_update_vma() when the vma split fails.  By then, we will
have unshared all pmds in the entire vma (or just the range if you agree
with my suggestion above).

-- 
Mike Kravetz

> +		break;
>  	}
>  
>  	anon_name = anon_vma_name(vma);
> @@ -1178,6 +1205,9 @@ madvise_behavior_valid(int behavior)
>  	case MADV_HUGEPAGE:
>  	case MADV_NOHUGEPAGE:
>  	case MADV_COLLAPSE:
> +#endif
> +#ifdef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING
> +	case MADV_SPLIT:
>  #endif
>  	case MADV_DONTDUMP:
>  	case MADV_DODUMP:
> @@ -1368,6 +1398,8 @@ int madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
>   *		transparent huge pages so the existing pages will not be
>   *		coalesced into THP and new pages will not be allocated as THP.
>   *  MADV_COLLAPSE - synchronously coalesce pages into new THP.
> + *  MADV_SPLIT - allow HugeTLB pages to be mapped at PAGE_SIZE. This allows
> + *		UFFDIO_CONTINUE to accept PAGE_SIZE-aligned regions.
>   *  MADV_DONTDUMP - the application wants to prevent pages in the given range
>   *		from being included in its core dump.
>   *  MADV_DODUMP - cancel MADV_DONTDUMP: no longer exclude from core dump.
> -- 
> 2.39.2.637.g21b0678d19-goog
> 

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v2 11/46] hugetlb: add hugetlb_pte to track HugeTLB page table entries
  2023-02-18  0:27 ` [PATCH v2 11/46] hugetlb: add hugetlb_pte to track HugeTLB page table entries James Houghton
  2023-02-18  5:24   ` Mina Almasry
@ 2023-02-25  0:09   ` Mike Kravetz
  1 sibling, 0 replies; 96+ messages in thread
From: Mike Kravetz @ 2023-02-25  0:09 UTC (permalink / raw)
  To: James Houghton
  Cc: Muchun Song, Peter Xu, Andrew Morton, David Hildenbrand,
	David Rientjes, Axel Rasmussen, Mina Almasry, Zach O'Keefe,
	Manish Mishra, Naoya Horiguchi, Dr . David Alan Gilbert,
	Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Frank van der Linden, Jiaqi Yan, linux-mm, linux-kernel

On 02/18/23 00:27, James Houghton wrote:
> After high-granularity mapping, page table entries for HugeTLB pages can
> be of any size/type. (For example, we can have a 1G page mapped with a
> mix of PMDs and PTEs.) This struct is to help keep track of a HugeTLB
> PTE after we have done a page table walk.
> 
> Without this, we'd have to pass around the "size" of the PTE everywhere.
> We effectively did this before; it could be fetched from the hstate,
> which we pass around pretty much everywhere.

Agreed.  I can not think of a better way to handle the possibility of
having hugetlb page table entries at any level.  The somewhat unfortunate
part of this is that code outside hugetlbfs proper needs to know about this.
However, there is already 'special handling' with hugetlb assumptions
in those places today.

> hugetlb_pte_present_leaf is included here as a helper function that will
> be used frequently later on.
> 
> Signed-off-by: James Houghton <jthoughton@google.com>
> 
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index a1ceb9417f01..eeacadf3272b 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h

Acked-by: Mike Kravetz <mike.kravetz@oracle.com>
-- 
Mike Kravetz

> @@ -26,6 +26,25 @@ typedef struct { unsigned long pd; } hugepd_t;
>  #define __hugepd(x) ((hugepd_t) { (x) })
>  #endif
>  
> +enum hugetlb_level {
> +	HUGETLB_LEVEL_PTE = 1,
> +	/*
> +	 * We always include PMD, PUD, and P4D in this enum definition so that,
> +	 * when logged as an integer, we can easily tell which level it is.
> +	 */
> +	HUGETLB_LEVEL_PMD,
> +	HUGETLB_LEVEL_PUD,
> +	HUGETLB_LEVEL_P4D,
> +	HUGETLB_LEVEL_PGD,
> +};
> +
> +struct hugetlb_pte {
> +	pte_t *ptep;
> +	unsigned int shift;
> +	enum hugetlb_level level;
> +	spinlock_t *ptl;
> +};
> +
>  #ifdef CONFIG_HUGETLB_PAGE
>  
>  #include <linux/mempolicy.h>
> @@ -39,6 +58,20 @@ typedef struct { unsigned long pd; } hugepd_t;
>   */
>  #define __NR_USED_SUBPAGE 3
>  
> +static inline
> +unsigned long hugetlb_pte_size(const struct hugetlb_pte *hpte)
> +{
> +	return 1UL << hpte->shift;
> +}
> +
> +static inline
> +unsigned long hugetlb_pte_mask(const struct hugetlb_pte *hpte)
> +{
> +	return ~(hugetlb_pte_size(hpte) - 1);
> +}
> +
> +bool hugetlb_pte_present_leaf(const struct hugetlb_pte *hpte, pte_t pte);
> +
>  struct hugepage_subpool {
>  	spinlock_t lock;
>  	long count;
> @@ -1234,6 +1267,45 @@ static inline spinlock_t *huge_pte_lock(struct hstate *h,
>  	return ptl;
>  }
>  
> +static inline
> +spinlock_t *hugetlb_pte_lockptr(struct hugetlb_pte *hpte)
> +{
> +	return hpte->ptl;
> +}
> +
> +static inline
> +spinlock_t *hugetlb_pte_lock(struct hugetlb_pte *hpte)
> +{
> +	spinlock_t *ptl = hugetlb_pte_lockptr(hpte);
> +
> +	spin_lock(ptl);
> +	return ptl;
> +}
> +
> +static inline
> +void __hugetlb_pte_init(struct hugetlb_pte *hpte, pte_t *ptep,
> +			unsigned int shift, enum hugetlb_level level,
> +			spinlock_t *ptl)
> +{
> +	/*
> +	 * If 'shift' indicates that this PTE is contiguous, then @ptep must
> +	 * be the first pte of the contiguous bunch.
> +	 */
> +	hpte->ptl = ptl;
> +	hpte->ptep = ptep;
> +	hpte->shift = shift;
> +	hpte->level = level;
> +}
> +
> +static inline
> +void hugetlb_pte_init(struct mm_struct *mm, struct hugetlb_pte *hpte,
> +		      pte_t *ptep, unsigned int shift,
> +		      enum hugetlb_level level)
> +{
> +	__hugetlb_pte_init(hpte, ptep, shift, level,
> +			   huge_pte_lockptr(shift, mm, ptep));
> +}
> +
>  #if defined(CONFIG_HUGETLB_PAGE) && defined(CONFIG_CMA)
>  extern void __init hugetlb_cma_reserve(int order);
>  #else
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 5ca9eae0ac42..6c74adff43b6 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1269,6 +1269,35 @@ static bool vma_has_reserves(struct vm_area_struct *vma, long chg)
>  	return false;
>  }
>  
> +bool hugetlb_pte_present_leaf(const struct hugetlb_pte *hpte, pte_t pte)
> +{
> +	pgd_t pgd;
> +	p4d_t p4d;
> +	pud_t pud;
> +	pmd_t pmd;
> +
> +	switch (hpte->level) {
> +	case HUGETLB_LEVEL_PGD:
> +		pgd = __pgd(pte_val(pte));
> +		return pgd_present(pgd) && pgd_leaf(pgd);
> +	case HUGETLB_LEVEL_P4D:
> +		p4d = __p4d(pte_val(pte));
> +		return p4d_present(p4d) && p4d_leaf(p4d);
> +	case HUGETLB_LEVEL_PUD:
> +		pud = __pud(pte_val(pte));
> +		return pud_present(pud) && pud_leaf(pud);
> +	case HUGETLB_LEVEL_PMD:
> +		pmd = __pmd(pte_val(pte));
> +		return pmd_present(pmd) && pmd_leaf(pmd);
> +	case HUGETLB_LEVEL_PTE:
> +		return pte_present(pte);
> +	default:
> +		WARN_ON_ONCE(1);
> +		return false;
> +	}
> +}
> +
> +
>  static void enqueue_hugetlb_folio(struct hstate *h, struct folio *folio)
>  {
>  	int nid = folio_nid(folio);
> -- 
> 2.39.2.637.g21b0678d19-goog
> 

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v2 09/46] mm: add MADV_SPLIT to enable HugeTLB HGM
  2023-02-24 23:25   ` Mike Kravetz
@ 2023-02-27 15:14     ` James Houghton
  0 siblings, 0 replies; 96+ messages in thread
From: James Houghton @ 2023-02-27 15:14 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Muchun Song, Peter Xu, Andrew Morton, David Hildenbrand,
	David Rientjes, Axel Rasmussen, Mina Almasry, Zach O'Keefe,
	Manish Mishra, Naoya Horiguchi, Dr . David Alan Gilbert,
	Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Frank van der Linden, Jiaqi Yan, linux-mm, linux-kernel

On Fri, Feb 24, 2023 at 3:25 PM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>
> On 02/18/23 00:27, James Houghton wrote:
> > Issuing ioctl(MADV_SPLIT) on a HugeTLB address range will enable
> > HugeTLB HGM. MADV_SPLIT was chosen for the name so that this API can be
> > applied to non-HugeTLB memory in the future, if such an application is
> > to arise.
> >
> > MADV_SPLIT provides several API changes for some syscalls on HugeTLB
> > address ranges:
> > 1. UFFDIO_CONTINUE is allowed for MAP_SHARED VMAs at PAGE_SIZE
> >    alignment.
> > 2. read()ing a page fault event from a userfaultfd will yield a
> >    PAGE_SIZE-rounded address, instead of a huge-page-size-rounded
> >    address (unless UFFD_FEATURE_EXACT_ADDRESS is used).
> >
> > There is no way to disable the API changes that come with issuing
> > MADV_SPLIT. MADV_COLLAPSE can be used to collapse high-granularity page
> > table mappings that come from the extended functionality that comes with
> > using MADV_SPLIT.
> >
> > For post-copy live migration, the expected use-case is:
> > 1. mmap(MAP_SHARED, some_fd) primary mapping
> > 2. mmap(MAP_SHARED, some_fd) alias mapping
> > 3. MADV_SPLIT the primary mapping
> > 4. UFFDIO_REGISTER/etc. the primary mapping
> > 5. Copy memory contents into alias mapping and UFFDIO_CONTINUE the
> >    corresponding PAGE_SIZE sections in the primary mapping.
> >
> > More API changes may be added in the future.
> >
> > Signed-off-by: James Houghton <jthoughton@google.com>
> >
> > diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h
> > index 763929e814e9..7a26f3648b90 100644
> > --- a/arch/alpha/include/uapi/asm/mman.h
> > +++ b/arch/alpha/include/uapi/asm/mman.h
> > @@ -78,6 +78,8 @@
> >
> >  #define MADV_COLLAPSE        25              /* Synchronous hugepage collapse */
> >
> > +#define MADV_SPLIT   26              /* Enable hugepage high-granularity APIs */
> > +
> >  /* compatibility flags */
> >  #define MAP_FILE     0
> >
> > diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h
> > index c6e1fc77c996..f8a74a3a0928 100644
> > --- a/arch/mips/include/uapi/asm/mman.h
> > +++ b/arch/mips/include/uapi/asm/mman.h
> > @@ -105,6 +105,8 @@
> >
> >  #define MADV_COLLAPSE        25              /* Synchronous hugepage collapse */
> >
> > +#define MADV_SPLIT   26              /* Enable hugepage high-granularity APIs */
> > +
> >  /* compatibility flags */
> >  #define MAP_FILE     0
> >
> > diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h
> > index 68c44f99bc93..a6dc6a56c941 100644
> > --- a/arch/parisc/include/uapi/asm/mman.h
> > +++ b/arch/parisc/include/uapi/asm/mman.h
> > @@ -72,6 +72,8 @@
> >
> >  #define MADV_COLLAPSE        25              /* Synchronous hugepage collapse */
> >
> > +#define MADV_SPLIT   74              /* Enable hugepage high-granularity APIs */
> > +
> >  #define MADV_HWPOISON     100                /* poison a page for testing */
> >  #define MADV_SOFT_OFFLINE 101                /* soft offline page for testing */
> >
> > diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h
> > index 1ff0c858544f..f98a77c430a9 100644
> > --- a/arch/xtensa/include/uapi/asm/mman.h
> > +++ b/arch/xtensa/include/uapi/asm/mman.h
> > @@ -113,6 +113,8 @@
> >
> >  #define MADV_COLLAPSE        25              /* Synchronous hugepage collapse */
> >
> > +#define MADV_SPLIT   26              /* Enable hugepage high-granularity APIs */
> > +
> >  /* compatibility flags */
> >  #define MAP_FILE     0
> >
> > diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
> > index 6ce1f1ceb432..996e8ded092f 100644
> > --- a/include/uapi/asm-generic/mman-common.h
> > +++ b/include/uapi/asm-generic/mman-common.h
> > @@ -79,6 +79,8 @@
> >
> >  #define MADV_COLLAPSE        25              /* Synchronous hugepage collapse */
> >
> > +#define MADV_SPLIT   26              /* Enable hugepage high-granularity APIs */
> > +
> >  /* compatibility flags */
> >  #define MAP_FILE     0
> >
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index c2202f51e9dd..8c004c678262 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -1006,6 +1006,28 @@ static long madvise_remove(struct vm_area_struct *vma,
> >       return error;
> >  }
> >
> > +static int madvise_split(struct vm_area_struct *vma,
> > +                      unsigned long *new_flags)
> > +{
> > +#ifdef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING
> > +     if (!is_vm_hugetlb_page(vma) || !hugetlb_hgm_eligible(vma))
> > +             return -EINVAL;
> > +
> > +     /*
> > +      * PMD sharing doesn't work with HGM. If this MADV_SPLIT is on part
> > +      * of a VMA, then we will split the VMA. Here, we're unsharing before
> > +      * splitting because it's simpler, although we may be unsharing more
> > +      * than we need.
> > +      */
> > +     hugetlb_unshare_all_pmds(vma);
>
> I think we should just unshare the (appropriately aligned) range within the
> vma that is the target of MADV_SPLIT.  No need to unshare the entire vma.

Right I can do that, and I can check for appropriate alignment here
(else fail with -EINVAL).

>
> > +
> > +     *new_flags |= VM_HUGETLB_HGM;
> > +     return 0;
> > +#else
> > +     return -EINVAL;
> > +#endif
> > +}
> > +
> >  /*
> >   * Apply an madvise behavior to a region of a vma.  madvise_update_vma
> >   * will handle splitting a vm area into separate areas, each area with its own
> > @@ -1084,6 +1106,11 @@ static int madvise_vma_behavior(struct vm_area_struct *vma,
> >               break;
> >       case MADV_COLLAPSE:
> >               return madvise_collapse(vma, prev, start, end);
> > +     case MADV_SPLIT:
> > +             error = madvise_split(vma, &new_flags);
> > +             if (error)
> > +                     goto out;
>
> Not a huge deal, but if one passes an invalid range (such as not huge page
> size aligned) to MADV_SPLIT, then we will not notice the error until
> later in madvise_update_vma() when the vma split fails.  By then, we will
> have unshared all pmds in the entire vma (or just the range if you agree
> with my suggestion above).

Good point. I'll fix this for v3. :) Thanks Mike.

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v2 12/46] hugetlb: add hugetlb_alloc_pmd and hugetlb_alloc_pte
  2023-02-18  0:27 ` [PATCH v2 12/46] hugetlb: add hugetlb_alloc_pmd and hugetlb_alloc_pte James Houghton
  2023-02-18 17:46   ` kernel test robot
@ 2023-02-27 19:16   ` Mike Kravetz
  2023-02-27 19:31     ` James Houghton
  1 sibling, 1 reply; 96+ messages in thread
From: Mike Kravetz @ 2023-02-27 19:16 UTC (permalink / raw)
  To: James Houghton
  Cc: Muchun Song, Peter Xu, Andrew Morton, David Hildenbrand,
	David Rientjes, Axel Rasmussen, Mina Almasry, Zach O'Keefe,
	Manish Mishra, Naoya Horiguchi, Dr . David Alan Gilbert,
	Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Frank van der Linden, Jiaqi Yan, linux-mm, linux-kernel

On 02/18/23 00:27, James Houghton wrote:
> These functions are used to allocate new PTEs below the hstate PTE. This
> will be used by hugetlb_walk_step, which implements stepping forwards in
> a HugeTLB high-granularity page table walk.
> 
> The reasons that we don't use the standard pmd_alloc/pte_alloc*
> functions are:
>  1) This prevents us from accidentally overwriting swap entries or
>     attempting to use swap entries as present non-leaf PTEs (see
>     pmd_alloc(); we assume that !pte_none means pte_present and
>     non-leaf).
>  2) Locking hugetlb PTEs can different than regular PTEs. (Although, as
>     implemented right now, locking is the same.)
>  3) We can maintain compatibility with CONFIG_HIGHPTE. That is, HugeTLB
>     HGM won't use HIGHPTE, but the kernel can still be built with it,
>     and other mm code will use it.
> 
> When GENERAL_HUGETLB supports P4D-based hugepages, we will need to
> implement hugetlb_pud_alloc to implement hugetlb_walk_step.
> 
> Signed-off-by: James Houghton <jthoughton@google.com>
> 
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index eeacadf3272b..9d839519c875 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -72,6 +72,11 @@ unsigned long hugetlb_pte_mask(const struct hugetlb_pte *hpte)
>  
>  bool hugetlb_pte_present_leaf(const struct hugetlb_pte *hpte, pte_t pte);
>  
> +pmd_t *hugetlb_alloc_pmd(struct mm_struct *mm, struct hugetlb_pte *hpte,
> +		unsigned long addr);
> +pte_t *hugetlb_alloc_pte(struct mm_struct *mm, struct hugetlb_pte *hpte,
> +		unsigned long addr);
> +
>  struct hugepage_subpool {
>  	spinlock_t lock;
>  	long count;
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 6c74adff43b6..bb424cdf79e4 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -483,6 +483,120 @@ static bool has_same_uncharge_info(struct file_region *rg,
>  #endif
>  }
>  
> +/*
> + * hugetlb_alloc_pmd -- Allocate or find a PMD beneath a PUD-level hpte.
> + *
> + * This is meant to be used to implement hugetlb_walk_step when one must go to
> + * step down to a PMD. Different architectures may implement hugetlb_walk_step
> + * differently, but hugetlb_alloc_pmd and hugetlb_alloc_pte are architecture-
> + * independent.
> + *
> + * Returns:
> + *	On success: the pointer to the PMD. This should be placed into a
> + *		    hugetlb_pte. @hpte is not changed.
> + *	ERR_PTR(-EINVAL): hpte is not PUD-level
> + *	ERR_PTR(-EEXIST): there is a non-leaf and non-empty PUD in @hpte

I often get this confused, should this really be 'non-leaf'?  Because, ...

> + *	ERR_PTR(-ENOMEM): could not allocate the new PMD
> + */
> +pmd_t *hugetlb_alloc_pmd(struct mm_struct *mm, struct hugetlb_pte *hpte,
> +		unsigned long addr)
> +{
> +	spinlock_t *ptl = hugetlb_pte_lockptr(hpte);
> +	pmd_t *new;
> +	pud_t *pudp;
> +	pud_t pud;
> +
> +	if (hpte->level != HUGETLB_LEVEL_PUD)
> +		return ERR_PTR(-EINVAL);
> +
> +	pudp = (pud_t *)hpte->ptep;
> +retry:
> +	pud = READ_ONCE(*pudp);
> +	if (likely(pud_present(pud)))
> +		return unlikely(pud_leaf(pud))
> +			? ERR_PTR(-EEXIST)
> +			: pmd_offset(pudp, addr);

... it seems we return -EEXIST in the pud_leaf case.
-- 
Mike Kravetz

> +	else if (!pud_none(pud))
> +		/*
> +		 * Not present and not none means that a swap entry lives here,
> +		 * and we can't get rid of it.
> +		 */
> +		return ERR_PTR(-EEXIST);
> +
> +	new = pmd_alloc_one(mm, addr);
> +	if (!new)
> +		return ERR_PTR(-ENOMEM);
> +
> +	spin_lock(ptl);
> +	if (!pud_same(pud, *pudp)) {
> +		spin_unlock(ptl);
> +		pmd_free(mm, new);
> +		goto retry;
> +	}
> +
> +	mm_inc_nr_pmds(mm);
> +	smp_wmb(); /* See comment in pmd_install() */
> +	pud_populate(mm, pudp, new);
> +	spin_unlock(ptl);
> +	return pmd_offset(pudp, addr);
> +}
> +
> +/*
> + * hugetlb_alloc_pte -- Allocate a PTE beneath a pmd_none PMD-level hpte.
> + *
> + * See the comment above hugetlb_alloc_pmd.
> + */
> +pte_t *hugetlb_alloc_pte(struct mm_struct *mm, struct hugetlb_pte *hpte,
> +		unsigned long addr)
> +{
> +	spinlock_t *ptl = hugetlb_pte_lockptr(hpte);
> +	pgtable_t new;
> +	pmd_t *pmdp;
> +	pmd_t pmd;
> +
> +	if (hpte->level != HUGETLB_LEVEL_PMD)
> +		return ERR_PTR(-EINVAL);
> +
> +	pmdp = (pmd_t *)hpte->ptep;
> +retry:
> +	pmd = READ_ONCE(*pmdp);
> +	if (likely(pmd_present(pmd)))
> +		return unlikely(pmd_leaf(pmd))
> +			? ERR_PTR(-EEXIST)
> +			: pte_offset_kernel(pmdp, addr);
> +	else if (!pmd_none(pmd))
> +		/*
> +		 * Not present and not none means that a swap entry lives here,
> +		 * and we can't get rid of it.
> +		 */
> +		return ERR_PTR(-EEXIST);
> +
> +	/*
> +	 * With CONFIG_HIGHPTE, calling `pte_alloc_one` directly may result
> +	 * in page tables being allocated in high memory, needing a kmap to
> +	 * access. Instead, we call __pte_alloc_one directly with
> +	 * GFP_PGTABLE_USER to prevent these PTEs being allocated in high
> +	 * memory.
> +	 */
> +	new = __pte_alloc_one(mm, GFP_PGTABLE_USER);
> +	if (!new)
> +		return ERR_PTR(-ENOMEM);
> +
> +	spin_lock(ptl);
> +	if (!pmd_same(pmd, *pmdp)) {
> +		spin_unlock(ptl);
> +		pgtable_pte_page_dtor(new);
> +		__free_page(new);
> +		goto retry;
> +	}
> +
> +	mm_inc_nr_ptes(mm);
> +	smp_wmb(); /* See comment in pmd_install() */
> +	pmd_populate(mm, pmdp, new);
> +	spin_unlock(ptl);
> +	return pte_offset_kernel(pmdp, addr);
> +}
> +
>  static void coalesce_file_region(struct resv_map *resv, struct file_region *rg)
>  {
>  	struct file_region *nrg, *prg;
> -- 
> 2.39.2.637.g21b0678d19-goog
> 

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v2 12/46] hugetlb: add hugetlb_alloc_pmd and hugetlb_alloc_pte
  2023-02-27 19:16   ` Mike Kravetz
@ 2023-02-27 19:31     ` James Houghton
  0 siblings, 0 replies; 96+ messages in thread
From: James Houghton @ 2023-02-27 19:31 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Muchun Song, Peter Xu, Andrew Morton, David Hildenbrand,
	David Rientjes, Axel Rasmussen, Mina Almasry, Zach O'Keefe,
	Manish Mishra, Naoya Horiguchi, Dr . David Alan Gilbert,
	Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Frank van der Linden, Jiaqi Yan, linux-mm, linux-kernel

"

On Mon, Feb 27, 2023 at 11:17 AM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>
> On 02/18/23 00:27, James Houghton wrote:
> > These functions are used to allocate new PTEs below the hstate PTE. This
> > will be used by hugetlb_walk_step, which implements stepping forwards in
> > a HugeTLB high-granularity page table walk.
> >
> > The reasons that we don't use the standard pmd_alloc/pte_alloc*
> > functions are:
> >  1) This prevents us from accidentally overwriting swap entries or
> >     attempting to use swap entries as present non-leaf PTEs (see
> >     pmd_alloc(); we assume that !pte_none means pte_present and
> >     non-leaf).
> >  2) Locking hugetlb PTEs can different than regular PTEs. (Although, as
> >     implemented right now, locking is the same.)
> >  3) We can maintain compatibility with CONFIG_HIGHPTE. That is, HugeTLB
> >     HGM won't use HIGHPTE, but the kernel can still be built with it,
> >     and other mm code will use it.
> >
> > When GENERAL_HUGETLB supports P4D-based hugepages, we will need to
> > implement hugetlb_pud_alloc to implement hugetlb_walk_step.
> >
> > Signed-off-by: James Houghton <jthoughton@google.com>
> >
> > diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> > index eeacadf3272b..9d839519c875 100644
> > --- a/include/linux/hugetlb.h
> > +++ b/include/linux/hugetlb.h
> > @@ -72,6 +72,11 @@ unsigned long hugetlb_pte_mask(const struct hugetlb_pte *hpte)
> >
> >  bool hugetlb_pte_present_leaf(const struct hugetlb_pte *hpte, pte_t pte);
> >
> > +pmd_t *hugetlb_alloc_pmd(struct mm_struct *mm, struct hugetlb_pte *hpte,
> > +             unsigned long addr);
> > +pte_t *hugetlb_alloc_pte(struct mm_struct *mm, struct hugetlb_pte *hpte,
> > +             unsigned long addr);
> > +
> >  struct hugepage_subpool {
> >       spinlock_t lock;
> >       long count;
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 6c74adff43b6..bb424cdf79e4 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -483,6 +483,120 @@ static bool has_same_uncharge_info(struct file_region *rg,
> >  #endif
> >  }
> >
> > +/*
> > + * hugetlb_alloc_pmd -- Allocate or find a PMD beneath a PUD-level hpte.
> > + *
> > + * This is meant to be used to implement hugetlb_walk_step when one must go to
> > + * step down to a PMD. Different architectures may implement hugetlb_walk_step
> > + * differently, but hugetlb_alloc_pmd and hugetlb_alloc_pte are architecture-
> > + * independent.
> > + *
> > + * Returns:
> > + *   On success: the pointer to the PMD. This should be placed into a
> > + *               hugetlb_pte. @hpte is not changed.
> > + *   ERR_PTR(-EINVAL): hpte is not PUD-level
> > + *   ERR_PTR(-EEXIST): there is a non-leaf and non-empty PUD in @hpte
>
> I often get this confused, should this really be 'non-leaf'?  Because, ...

This comment is wrong, it should be "non-empty PUD that is not
pointing to page tables". Maybe it would be better to say "-EEXIST
unless @hpte is pud_none() or already points to page tables".

In this commit, PTEs containing PTE markers are treated as non-empty
here (and not pointing to page tables), but after the commit "hugetlb:
split PTE markers when doing HGM walks", they are treated as empty.
I'll update the comment in that commit as well.

>
> > + *   ERR_PTR(-ENOMEM): could not allocate the new PMD
> > + */
> > +pmd_t *hugetlb_alloc_pmd(struct mm_struct *mm, struct hugetlb_pte *hpte,
> > +             unsigned long addr)
> > +{
> > +     spinlock_t *ptl = hugetlb_pte_lockptr(hpte);
> > +     pmd_t *new;
> > +     pud_t *pudp;
> > +     pud_t pud;
> > +
> > +     if (hpte->level != HUGETLB_LEVEL_PUD)
> > +             return ERR_PTR(-EINVAL);
> > +
> > +     pudp = (pud_t *)hpte->ptep;
> > +retry:
> > +     pud = READ_ONCE(*pudp);
> > +     if (likely(pud_present(pud)))
> > +             return unlikely(pud_leaf(pud))
> > +                     ? ERR_PTR(-EEXIST)
> > +                     : pmd_offset(pudp, addr);
>
> ... it seems we return -EEXIST in the pud_leaf case.

This code is correct. :) We don't want to overwrite a leaf. Sorry for
the confusion!

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v2 13/46] hugetlb: add hugetlb_hgm_walk and hugetlb_walk_step
  2023-02-18  0:27 ` [PATCH v2 13/46] hugetlb: add hugetlb_hgm_walk and hugetlb_walk_step James Houghton
  2023-02-18  7:43   ` kernel test robot
  2023-02-18 18:07   ` kernel test robot
@ 2023-02-28 22:14   ` Mike Kravetz
  2023-02-28 23:03     ` James Houghton
  2 siblings, 1 reply; 96+ messages in thread
From: Mike Kravetz @ 2023-02-28 22:14 UTC (permalink / raw)
  To: James Houghton
  Cc: Muchun Song, Peter Xu, Andrew Morton, David Hildenbrand,
	David Rientjes, Axel Rasmussen, Mina Almasry, Zach O'Keefe,
	Manish Mishra, Naoya Horiguchi, Dr . David Alan Gilbert,
	Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Frank van der Linden, Jiaqi Yan, linux-mm, linux-kernel

On 02/18/23 00:27, James Houghton wrote:
> hugetlb_hgm_walk implements high-granularity page table walks for
> HugeTLB. It is safe to call on non-HGM enabled VMAs; it will return
> immediately.
> 
> hugetlb_walk_step implements how we step forwards in the walk. For
> architectures that don't use GENERAL_HUGETLB, they will need to provide
> their own implementation.
> 
> The broader API that should be used is
> hugetlb_full_walk[,alloc|,continue].

I guess 'full' in the name implies walking to the PTE (PAGE_SIZE) level.
It could just be me and my over-familiarity with the existing hugetlb
walking code, but that was not obvious.

Again, perhaps it is just how familiar I am with the existing code, but
I found the routines difficult to follow.  Nothing looks obviously wrong.

Just a couple comments.questions below.

> 
> Signed-off-by: James Houghton <jthoughton@google.com>
> 
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 9d839519c875..726d581158b1 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -223,6 +223,14 @@ u32 hugetlb_fault_mutex_hash(struct address_space *mapping, pgoff_t idx);
>  pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma,
>  		      unsigned long addr, pud_t *pud);
>  
> +int hugetlb_full_walk(struct hugetlb_pte *hpte, struct vm_area_struct *vma,
> +		      unsigned long addr);
> +void hugetlb_full_walk_continue(struct hugetlb_pte *hpte,
> +				struct vm_area_struct *vma, unsigned long addr);
> +int hugetlb_full_walk_alloc(struct hugetlb_pte *hpte,
> +			    struct vm_area_struct *vma, unsigned long addr,
> +			    unsigned long target_sz);
> +
>  struct address_space *hugetlb_page_mapping_lock_write(struct page *hpage);
>  
>  extern int sysctl_hugetlb_shm_group;
> @@ -272,6 +280,8 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
>  pte_t *huge_pte_offset(struct mm_struct *mm,
>  		       unsigned long addr, unsigned long sz);
>  unsigned long hugetlb_mask_last_page(struct hstate *h);
> +int hugetlb_walk_step(struct mm_struct *mm, struct hugetlb_pte *hpte,
> +		      unsigned long addr, unsigned long sz);
>  int huge_pmd_unshare(struct mm_struct *mm, struct vm_area_struct *vma,
>  				unsigned long addr, pte_t *ptep);
>  void adjust_range_if_pmd_sharing_possible(struct vm_area_struct *vma,
> @@ -1054,6 +1064,8 @@ void hugetlb_register_node(struct node *node);
>  void hugetlb_unregister_node(struct node *node);
>  #endif
>  
> +enum hugetlb_level hpage_size_to_level(unsigned long sz);
> +
>  #else	/* CONFIG_HUGETLB_PAGE */
>  struct hstate {};
>  
> @@ -1246,6 +1258,11 @@ static inline void hugetlb_register_node(struct node *node)
>  static inline void hugetlb_unregister_node(struct node *node)
>  {
>  }
> +
> +static inline enum hugetlb_level hpage_size_to_level(unsigned long sz)
> +{
> +	return HUGETLB_LEVEL_PTE;
> +}
>  #endif	/* CONFIG_HUGETLB_PAGE */
>  
>  #ifdef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index bb424cdf79e4..810c05feb41f 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -97,6 +97,29 @@ static void __hugetlb_vma_unlock_write_free(struct vm_area_struct *vma);
>  static void hugetlb_unshare_pmds(struct vm_area_struct *vma,
>  		unsigned long start, unsigned long end);
>  
> +/*
> + * hpage_size_to_level() - convert @sz to the corresponding page table level
> + *
> + * @sz must be less than or equal to a valid hugepage size.
> + */
> +enum hugetlb_level hpage_size_to_level(unsigned long sz)
> +{
> +	/*
> +	 * We order the conditionals from smallest to largest to pick the
> +	 * smallest level when multiple levels have the same size (i.e.,
> +	 * when levels are folded).
> +	 */
> +	if (sz < PMD_SIZE)
> +		return HUGETLB_LEVEL_PTE;
> +	if (sz < PUD_SIZE)
> +		return HUGETLB_LEVEL_PMD;
> +	if (sz < P4D_SIZE)
> +		return HUGETLB_LEVEL_PUD;
> +	if (sz < PGDIR_SIZE)
> +		return HUGETLB_LEVEL_P4D;
> +	return HUGETLB_LEVEL_PGD;
> +}
> +
>  static inline bool subpool_is_free(struct hugepage_subpool *spool)
>  {
>  	if (spool->count)
> @@ -7315,6 +7338,154 @@ bool want_pmd_share(struct vm_area_struct *vma, unsigned long addr)
>  }
>  #endif /* CONFIG_ARCH_WANT_HUGE_PMD_SHARE */
>  
> +/* __hugetlb_hgm_walk - walks a high-granularity HugeTLB page table to resolve
> + * the page table entry for @addr. We might allocate new PTEs.
> + *
> + * @hpte must always be pointing at an hstate-level PTE or deeper.
> + *
> + * This function will never walk further if it encounters a PTE of a size
> + * less than or equal to @sz.
> + *
> + * @alloc determines what we do when we encounter an empty PTE. If false,
> + * we stop walking. If true and @sz is less than the current PTE's size,
> + * we make that PTE point to the next level down, going until @sz is the same
> + * as our current PTE.
> + *
> + * If @alloc is false and @sz is PAGE_SIZE, this function will always
> + * succeed, but that does not guarantee that hugetlb_pte_size(hpte) is @sz.
> + *
> + * Return:
> + *	-ENOMEM if we couldn't allocate new PTEs.
> + *	-EEXIST if the caller wanted to walk further than a migration PTE,
> + *		poison PTE, or a PTE marker. The caller needs to manually deal
> + *		with this scenario.
> + *	-EINVAL if called with invalid arguments (@sz invalid, @hpte not
> + *		initialized).
> + *	0 otherwise.
> + *
> + *	Even if this function fails, @hpte is guaranteed to always remain
> + *	valid.
> + */
> +static int __hugetlb_hgm_walk(struct mm_struct *mm, struct vm_area_struct *vma,
> +			      struct hugetlb_pte *hpte, unsigned long addr,
> +			      unsigned long sz, bool alloc)
> +{
> +	int ret = 0;
> +	pte_t pte;
> +
> +	if (WARN_ON_ONCE(sz < PAGE_SIZE))
> +		return -EINVAL;
> +
> +	if (WARN_ON_ONCE(!hpte->ptep))
> +		return -EINVAL;
> +
> +	while (hugetlb_pte_size(hpte) > sz && !ret) {
> +		pte = huge_ptep_get(hpte->ptep);
> +		if (!pte_present(pte)) {
> +			if (!alloc)
> +				return 0;
> +			if (unlikely(!huge_pte_none(pte)))
> +				return -EEXIST;
> +		} else if (hugetlb_pte_present_leaf(hpte, pte))
> +			return 0;
> +		ret = hugetlb_walk_step(mm, hpte, addr, sz);
> +	}
> +
> +	return ret;
> +}
> +
> +/*
> + * hugetlb_hgm_walk - Has the same behavior as __hugetlb_hgm_walk but will
> + * initialize @hpte with hstate-level PTE pointer @ptep.
> + */
> +static int hugetlb_hgm_walk(struct hugetlb_pte *hpte,
> +			    pte_t *ptep,
> +			    struct vm_area_struct *vma,
> +			    unsigned long addr,
> +			    unsigned long target_sz,
> +			    bool alloc)
> +{
> +	struct hstate *h = hstate_vma(vma);
> +
> +	hugetlb_pte_init(vma->vm_mm, hpte, ptep, huge_page_shift(h),
> +			 hpage_size_to_level(huge_page_size(h)));
> +	return __hugetlb_hgm_walk(vma->vm_mm, vma, hpte, addr, target_sz,
> +				  alloc);
> +}
> +
> +/*
> + * hugetlb_full_walk_continue - continue a high-granularity page-table walk.
> + *
> + * If a user has a valid @hpte but knows that @hpte is not a leaf, they can
> + * attempt to continue walking by calling this function.
> + *
> + * This function will never fail, but @hpte might not change.
> + *
> + * If @hpte hasn't been initialized, then this function's behavior is
> + * undefined.
> + */
> +void hugetlb_full_walk_continue(struct hugetlb_pte *hpte,
> +				struct vm_area_struct *vma,
> +				unsigned long addr)
> +{
> +	/* __hugetlb_hgm_walk will never fail with these arguments. */
> +	WARN_ON_ONCE(__hugetlb_hgm_walk(vma->vm_mm, vma, hpte, addr,
> +					PAGE_SIZE, false));
> +}
> +
> +/*
> + * hugetlb_full_walk - do a high-granularity page-table walk; never allocate.
> + *
> + * This function can only fail if we find that the hstate-level PTE is not
> + * allocated. Callers can take advantage of this fact to skip address regions
> + * that cannot be mapped in that case.
> + *
> + * If this function succeeds, @hpte is guaranteed to be valid.
> + */
> +int hugetlb_full_walk(struct hugetlb_pte *hpte,
> +		      struct vm_area_struct *vma,
> +		      unsigned long addr)
> +{
> +	struct hstate *h = hstate_vma(vma);
> +	unsigned long sz = huge_page_size(h);
> +	/*
> +	 * We must mask the address appropriately so that we pick up the first
> +	 * PTE in a contiguous group.
> +	 */
> +	pte_t *ptep = hugetlb_walk(vma, addr & huge_page_mask(h), sz);
> +
> +	if (!ptep)
> +		return -ENOMEM;

-ENOMEM does not seem appropriate, but I can not think of something
better.  -ENOENT perhaps?

> +
> +	/* hugetlb_hgm_walk will never fail with these arguments. */
> +	WARN_ON_ONCE(hugetlb_hgm_walk(hpte, ptep, vma, addr, PAGE_SIZE, false));
> +	return 0;
> +}
> +
> +/*
> + * hugetlb_full_walk_alloc - do a high-granularity walk, potentially allocate
> + *	new PTEs.
> + */
> +int hugetlb_full_walk_alloc(struct hugetlb_pte *hpte,
> +				   struct vm_area_struct *vma,
> +				   unsigned long addr,
> +				   unsigned long target_sz)
> +{
> +	struct hstate *h = hstate_vma(vma);
> +	unsigned long sz = huge_page_size(h);
> +	/*
> +	 * We must mask the address appropriately so that we pick up the first
> +	 * PTE in a contiguous group.
> +	 */
> +	pte_t *ptep = huge_pte_alloc(vma->vm_mm, vma, addr & huge_page_mask(h),
> +				     sz);
> +
> +	if (!ptep)
> +		return -ENOMEM;
> +
> +	return hugetlb_hgm_walk(hpte, ptep, vma, addr, target_sz, true);
> +}
> +
>  #ifdef CONFIG_ARCH_WANT_GENERAL_HUGETLB
>  pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
>  			unsigned long addr, unsigned long sz)
> @@ -7382,6 +7553,48 @@ pte_t *huge_pte_offset(struct mm_struct *mm,
>  	return (pte_t *)pmd;
>  }
>  
> +/*
> + * hugetlb_walk_step() - Walk the page table one step to resolve the page
> + * (hugepage or subpage) entry at address @addr.
> + *
> + * @sz always points at the final target PTE size (e.g. PAGE_SIZE for the
> + * lowest level PTE).
> + *
> + * @hpte will always remain valid, even if this function fails.
> + *
> + * Architectures that implement this function must ensure that if @hpte does
> + * not change levels, then its PTL must also stay the same.
> + */
> +int hugetlb_walk_step(struct mm_struct *mm, struct hugetlb_pte *hpte,
> +		      unsigned long addr, unsigned long sz)
> +{
> +	pte_t *ptep;
> +	spinlock_t *ptl;
> +
> +	switch (hpte->level) {
> +	case HUGETLB_LEVEL_PUD:
> +		ptep = (pte_t *)hugetlb_alloc_pmd(mm, hpte, addr);
> +		if (IS_ERR(ptep))
> +			return PTR_ERR(ptep);
> +		hugetlb_pte_init(mm, hpte, ptep, PMD_SHIFT,
> +				 HUGETLB_LEVEL_PMD);
> +		break;
> +	case HUGETLB_LEVEL_PMD:
> +		ptep = hugetlb_alloc_pte(mm, hpte, addr);
> +		if (IS_ERR(ptep))
> +			return PTR_ERR(ptep);
> +		ptl = pte_lockptr(mm, (pmd_t *)hpte->ptep);

Is that right?  hpte->ptep is the PMD level entry.  It seems
pte_lockptr() -> ptlock_ptr(pmd_page(*pmd)) -> return page->ptl
But, I would think we want the page mm->page_table_lock for newly
allocated PTE.

-- 
Mike Kravetz

> +		__hugetlb_pte_init(hpte, ptep, PAGE_SHIFT,
> +				   HUGETLB_LEVEL_PTE, ptl);
> +		break;
> +	default:
> +		WARN_ONCE(1, "%s: got invalid level: %d (shift: %d)\n",
> +				__func__, hpte->level, hpte->shift);
> +		return -EINVAL;
> +	}
> +	return 0;
> +}
> +
>  /*
>   * Return a mask that can be used to update an address to the last huge
>   * page in a page table page mapping size.  Used to skip non-present
> -- 
> 2.39.2.637.g21b0678d19-goog
> 

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v2 14/46] hugetlb: split PTE markers when doing HGM walks
  2023-02-18  0:27 ` [PATCH v2 14/46] hugetlb: split PTE markers when doing HGM walks James Houghton
  2023-02-18 19:49   ` kernel test robot
@ 2023-02-28 22:48   ` Mike Kravetz
  1 sibling, 0 replies; 96+ messages in thread
From: Mike Kravetz @ 2023-02-28 22:48 UTC (permalink / raw)
  To: James Houghton
  Cc: Muchun Song, Peter Xu, Andrew Morton, David Hildenbrand,
	David Rientjes, Axel Rasmussen, Mina Almasry, Zach O'Keefe,
	Manish Mishra, Naoya Horiguchi, Dr . David Alan Gilbert,
	Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Frank van der Linden, Jiaqi Yan, linux-mm, linux-kernel

On 02/18/23 00:27, James Houghton wrote:
> Fix how UFFDIO_CONTINUE and UFFDIO_WRITEPROTECT interact in these two
> ways:
>  - UFFDIO_WRITEPROTECT no longer prevents a high-granularity
>    UFFDIO_CONTINUE.
>  - UFFD-WP PTE markers installed with UFFDIO_WRITEPROTECT will be
>    properly propagated when high-granularily UFFDIO_CONTINUEs are
>    performed.
> 
> Note: UFFDIO_WRITEPROTECT is not yet permitted at PAGE_SIZE granularity.
> 
> Signed-off-by: James Houghton <jthoughton@google.com>
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 810c05feb41f..f74183acc521 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c

Seems relatively straight forward,

Acked-by: Mike Kravetz <mike.kravetz@oracle.com>
-- 
Mike Kravetz

> @@ -506,6 +506,30 @@ static bool has_same_uncharge_info(struct file_region *rg,
>  #endif
>  }
>  
> +static void hugetlb_install_markers_pmd(pmd_t *pmdp, pte_marker marker)
> +{
> +	int i;
> +
> +	for (i = 0; i < PTRS_PER_PMD; ++i)
> +		/*
> +		 * WRITE_ONCE not needed because the pud hasn't been
> +		 * installed yet.
> +		 */
> +		pmdp[i] = __pmd(pte_val(make_pte_marker(marker)));
> +}
> +
> +static void hugetlb_install_markers_pte(pte_t *ptep, pte_marker marker)
> +{
> +	int i;
> +
> +	for (i = 0; i < PTRS_PER_PTE; ++i)
> +		/*
> +		 * WRITE_ONCE not needed because the pmd hasn't been
> +		 * installed yet.
> +		 */
> +		ptep[i] = make_pte_marker(marker);
> +}
> +
>  /*
>   * hugetlb_alloc_pmd -- Allocate or find a PMD beneath a PUD-level hpte.
>   *
> @@ -528,23 +552,32 @@ pmd_t *hugetlb_alloc_pmd(struct mm_struct *mm, struct hugetlb_pte *hpte,
>  	pmd_t *new;
>  	pud_t *pudp;
>  	pud_t pud;
> +	bool is_marker;
> +	pte_marker marker;
>  
>  	if (hpte->level != HUGETLB_LEVEL_PUD)
>  		return ERR_PTR(-EINVAL);
>  
>  	pudp = (pud_t *)hpte->ptep;
>  retry:
> +	is_marker = false;
>  	pud = READ_ONCE(*pudp);
>  	if (likely(pud_present(pud)))
>  		return unlikely(pud_leaf(pud))
>  			? ERR_PTR(-EEXIST)
>  			: pmd_offset(pudp, addr);
> -	else if (!pud_none(pud))
> +	else if (!pud_none(pud)) {
>  		/*
> -		 * Not present and not none means that a swap entry lives here,
> -		 * and we can't get rid of it.
> +		 * Not present and not none means that a swap entry lives here.
> +		 * If it's a PTE marker, we can deal with it. If it's another
> +		 * swap entry, we don't attempt to split it.
>  		 */
> -		return ERR_PTR(-EEXIST);
> +		is_marker = is_pte_marker(__pte(pud_val(pud)));
> +		if (!is_marker)
> +			return ERR_PTR(-EEXIST);
> +
> +		marker = pte_marker_get(pte_to_swp_entry(__pte(pud_val(pud))));
> +	}
>  
>  	new = pmd_alloc_one(mm, addr);
>  	if (!new)
> @@ -557,6 +590,13 @@ pmd_t *hugetlb_alloc_pmd(struct mm_struct *mm, struct hugetlb_pte *hpte,
>  		goto retry;
>  	}
>  
> +	/*
> +	 * Install markers before PUD to avoid races with other
> +	 * page tables walks.
> +	 */
> +	if (is_marker)
> +		hugetlb_install_markers_pmd(new, marker);
> +
>  	mm_inc_nr_pmds(mm);
>  	smp_wmb(); /* See comment in pmd_install() */
>  	pud_populate(mm, pudp, new);
> @@ -576,23 +616,32 @@ pte_t *hugetlb_alloc_pte(struct mm_struct *mm, struct hugetlb_pte *hpte,
>  	pgtable_t new;
>  	pmd_t *pmdp;
>  	pmd_t pmd;
> +	bool is_marker;
> +	pte_marker marker;
>  
>  	if (hpte->level != HUGETLB_LEVEL_PMD)
>  		return ERR_PTR(-EINVAL);
>  
>  	pmdp = (pmd_t *)hpte->ptep;
>  retry:
> +	is_marker = false;
>  	pmd = READ_ONCE(*pmdp);
>  	if (likely(pmd_present(pmd)))
>  		return unlikely(pmd_leaf(pmd))
>  			? ERR_PTR(-EEXIST)
>  			: pte_offset_kernel(pmdp, addr);
> -	else if (!pmd_none(pmd))
> +	else if (!pmd_none(pmd)) {
>  		/*
> -		 * Not present and not none means that a swap entry lives here,
> -		 * and we can't get rid of it.
> +		 * Not present and not none means that a swap entry lives here.
> +		 * If it's a PTE marker, we can deal with it. If it's another
> +		 * swap entry, we don't attempt to split it.
>  		 */
> -		return ERR_PTR(-EEXIST);
> +		is_marker = is_pte_marker(__pte(pmd_val(pmd)));
> +		if (!is_marker)
> +			return ERR_PTR(-EEXIST);
> +
> +		marker = pte_marker_get(pte_to_swp_entry(__pte(pmd_val(pmd))));
> +	}
>  
>  	/*
>  	 * With CONFIG_HIGHPTE, calling `pte_alloc_one` directly may result
> @@ -613,6 +662,9 @@ pte_t *hugetlb_alloc_pte(struct mm_struct *mm, struct hugetlb_pte *hpte,
>  		goto retry;
>  	}
>  
> +	if (is_marker)
> +		hugetlb_install_markers_pte(page_address(new), marker);
> +
>  	mm_inc_nr_ptes(mm);
>  	smp_wmb(); /* See comment in pmd_install() */
>  	pmd_populate(mm, pmdp, new);
> @@ -7384,7 +7436,12 @@ static int __hugetlb_hgm_walk(struct mm_struct *mm, struct vm_area_struct *vma,
>  		if (!pte_present(pte)) {
>  			if (!alloc)
>  				return 0;
> -			if (unlikely(!huge_pte_none(pte)))
> +			/*
> +			 * In hugetlb_alloc_pmd and hugetlb_alloc_pte,
> +			 * we split PTE markers, so we can tolerate
> +			 * PTE markers here.
> +			 */
> +			if (unlikely(!huge_pte_none_mostly(pte)))
>  				return -EEXIST;
>  		} else if (hugetlb_pte_present_leaf(hpte, pte))
>  			return 0;
> -- 
> 2.39.2.637.g21b0678d19-goog
> 

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v2 16/46] hugetlb: make default arch_make_huge_pte understand small mappings
  2023-02-18  0:27 ` [PATCH v2 16/46] hugetlb: make default arch_make_huge_pte understand small mappings James Houghton
  2023-02-22 21:17   ` Mina Almasry
@ 2023-02-28 23:02   ` Mike Kravetz
  1 sibling, 0 replies; 96+ messages in thread
From: Mike Kravetz @ 2023-02-28 23:02 UTC (permalink / raw)
  To: James Houghton
  Cc: Muchun Song, Peter Xu, Andrew Morton, David Hildenbrand,
	David Rientjes, Axel Rasmussen, Mina Almasry, Zach O'Keefe,
	Manish Mishra, Naoya Horiguchi, Dr . David Alan Gilbert,
	Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Frank van der Linden, Jiaqi Yan, linux-mm, linux-kernel

On 02/18/23 00:27, James Houghton wrote:
> This is a simple change: don't create a "huge" PTE if we are making a
> regular, PAGE_SIZE PTE. All architectures that want to implement HGM
> likely need to be changed in a similar way if they implement their own
> version of arch_make_huge_pte.
> 
> Signed-off-by: James Houghton <jthoughton@google.com>
> 
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 726d581158b1..b767b6889dea 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h

Thanks,
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
-- 
Mike Kravetz

> @@ -899,7 +899,7 @@ static inline void arch_clear_hugepage_flags(struct page *page) { }
>  static inline pte_t arch_make_huge_pte(pte_t entry, unsigned int shift,
>  				       vm_flags_t flags)
>  {
> -	return pte_mkhuge(entry);
> +	return shift > PAGE_SHIFT ? pte_mkhuge(entry) : entry;
>  }
>  #endif
>  
> -- 
> 2.39.2.637.g21b0678d19-goog
> 

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v2 13/46] hugetlb: add hugetlb_hgm_walk and hugetlb_walk_step
  2023-02-28 22:14   ` Mike Kravetz
@ 2023-02-28 23:03     ` James Houghton
  0 siblings, 0 replies; 96+ messages in thread
From: James Houghton @ 2023-02-28 23:03 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Muchun Song, Peter Xu, Andrew Morton, David Hildenbrand,
	David Rientjes, Axel Rasmussen, Mina Almasry, Zach O'Keefe,
	Manish Mishra, Naoya Horiguchi, Dr . David Alan Gilbert,
	Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Frank van der Linden, Jiaqi Yan, linux-mm, linux-kernel

On Tue, Feb 28, 2023 at 2:15 PM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>
> On 02/18/23 00:27, James Houghton wrote:
> > hugetlb_hgm_walk implements high-granularity page table walks for
> > HugeTLB. It is safe to call on non-HGM enabled VMAs; it will return
> > immediately.
> >
> > hugetlb_walk_step implements how we step forwards in the walk. For
> > architectures that don't use GENERAL_HUGETLB, they will need to provide
> > their own implementation.
> >
> > The broader API that should be used is
> > hugetlb_full_walk[,alloc|,continue].
>
> I guess 'full' in the name implies walking to the PTE (PAGE_SIZE) level.
> It could just be me and my over-familiarity with the existing hugetlb
> walking code, but that was not obvious.

Yeah "full" means it walks all the way down to the leaf. "alloc" means
it may allocate if required to reach some target PTE size (target_sz).

I'll try to be clearer in the comments for hugetlb_full_walk and
hugetlb_full_walk_alloc.

>
> Again, perhaps it is just how familiar I am with the existing code, but
> I found the routines difficult to follow.  Nothing looks obviously wrong.
>
> Just a couple comments.questions below.
>
> >
> > Signed-off-by: James Houghton <jthoughton@google.com>
> >
> > diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> > index 9d839519c875..726d581158b1 100644
> > --- a/include/linux/hugetlb.h
> > +++ b/include/linux/hugetlb.h
> > @@ -223,6 +223,14 @@ u32 hugetlb_fault_mutex_hash(struct address_space *mapping, pgoff_t idx);
> >  pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma,
> >                     unsigned long addr, pud_t *pud);
> >
> > +int hugetlb_full_walk(struct hugetlb_pte *hpte, struct vm_area_struct *vma,
> > +                   unsigned long addr);
> > +void hugetlb_full_walk_continue(struct hugetlb_pte *hpte,
> > +                             struct vm_area_struct *vma, unsigned long addr);
> > +int hugetlb_full_walk_alloc(struct hugetlb_pte *hpte,
> > +                         struct vm_area_struct *vma, unsigned long addr,
> > +                         unsigned long target_sz);
> > +
> >  struct address_space *hugetlb_page_mapping_lock_write(struct page *hpage);
> >
> >  extern int sysctl_hugetlb_shm_group;
> > @@ -272,6 +280,8 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
> >  pte_t *huge_pte_offset(struct mm_struct *mm,
> >                      unsigned long addr, unsigned long sz);
> >  unsigned long hugetlb_mask_last_page(struct hstate *h);
> > +int hugetlb_walk_step(struct mm_struct *mm, struct hugetlb_pte *hpte,
> > +                   unsigned long addr, unsigned long sz);
> >  int huge_pmd_unshare(struct mm_struct *mm, struct vm_area_struct *vma,
> >                               unsigned long addr, pte_t *ptep);
> >  void adjust_range_if_pmd_sharing_possible(struct vm_area_struct *vma,
> > @@ -1054,6 +1064,8 @@ void hugetlb_register_node(struct node *node);
> >  void hugetlb_unregister_node(struct node *node);
> >  #endif
> >
> > +enum hugetlb_level hpage_size_to_level(unsigned long sz);
> > +
> >  #else        /* CONFIG_HUGETLB_PAGE */
> >  struct hstate {};
> >
> > @@ -1246,6 +1258,11 @@ static inline void hugetlb_register_node(struct node *node)
> >  static inline void hugetlb_unregister_node(struct node *node)
> >  {
> >  }
> > +
> > +static inline enum hugetlb_level hpage_size_to_level(unsigned long sz)
> > +{
> > +     return HUGETLB_LEVEL_PTE;
> > +}
> >  #endif       /* CONFIG_HUGETLB_PAGE */
> >
> >  #ifdef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index bb424cdf79e4..810c05feb41f 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -97,6 +97,29 @@ static void __hugetlb_vma_unlock_write_free(struct vm_area_struct *vma);
> >  static void hugetlb_unshare_pmds(struct vm_area_struct *vma,
> >               unsigned long start, unsigned long end);
> >
> > +/*
> > + * hpage_size_to_level() - convert @sz to the corresponding page table level
> > + *
> > + * @sz must be less than or equal to a valid hugepage size.
> > + */
> > +enum hugetlb_level hpage_size_to_level(unsigned long sz)
> > +{
> > +     /*
> > +      * We order the conditionals from smallest to largest to pick the
> > +      * smallest level when multiple levels have the same size (i.e.,
> > +      * when levels are folded).
> > +      */
> > +     if (sz < PMD_SIZE)
> > +             return HUGETLB_LEVEL_PTE;
> > +     if (sz < PUD_SIZE)
> > +             return HUGETLB_LEVEL_PMD;
> > +     if (sz < P4D_SIZE)
> > +             return HUGETLB_LEVEL_PUD;
> > +     if (sz < PGDIR_SIZE)
> > +             return HUGETLB_LEVEL_P4D;
> > +     return HUGETLB_LEVEL_PGD;
> > +}
> > +
> >  static inline bool subpool_is_free(struct hugepage_subpool *spool)
> >  {
> >       if (spool->count)
> > @@ -7315,6 +7338,154 @@ bool want_pmd_share(struct vm_area_struct *vma, unsigned long addr)
> >  }
> >  #endif /* CONFIG_ARCH_WANT_HUGE_PMD_SHARE */
> >
> > +/* __hugetlb_hgm_walk - walks a high-granularity HugeTLB page table to resolve
> > + * the page table entry for @addr. We might allocate new PTEs.
> > + *
> > + * @hpte must always be pointing at an hstate-level PTE or deeper.
> > + *
> > + * This function will never walk further if it encounters a PTE of a size
> > + * less than or equal to @sz.
> > + *
> > + * @alloc determines what we do when we encounter an empty PTE. If false,
> > + * we stop walking. If true and @sz is less than the current PTE's size,
> > + * we make that PTE point to the next level down, going until @sz is the same
> > + * as our current PTE.
> > + *
> > + * If @alloc is false and @sz is PAGE_SIZE, this function will always
> > + * succeed, but that does not guarantee that hugetlb_pte_size(hpte) is @sz.
> > + *
> > + * Return:
> > + *   -ENOMEM if we couldn't allocate new PTEs.
> > + *   -EEXIST if the caller wanted to walk further than a migration PTE,
> > + *           poison PTE, or a PTE marker. The caller needs to manually deal
> > + *           with this scenario.
> > + *   -EINVAL if called with invalid arguments (@sz invalid, @hpte not
> > + *           initialized).
> > + *   0 otherwise.
> > + *
> > + *   Even if this function fails, @hpte is guaranteed to always remain
> > + *   valid.
> > + */
> > +static int __hugetlb_hgm_walk(struct mm_struct *mm, struct vm_area_struct *vma,
> > +                           struct hugetlb_pte *hpte, unsigned long addr,
> > +                           unsigned long sz, bool alloc)
> > +{
> > +     int ret = 0;
> > +     pte_t pte;
> > +
> > +     if (WARN_ON_ONCE(sz < PAGE_SIZE))
> > +             return -EINVAL;
> > +
> > +     if (WARN_ON_ONCE(!hpte->ptep))
> > +             return -EINVAL;
> > +
> > +     while (hugetlb_pte_size(hpte) > sz && !ret) {
> > +             pte = huge_ptep_get(hpte->ptep);
> > +             if (!pte_present(pte)) {
> > +                     if (!alloc)
> > +                             return 0;
> > +                     if (unlikely(!huge_pte_none(pte)))
> > +                             return -EEXIST;
> > +             } else if (hugetlb_pte_present_leaf(hpte, pte))
> > +                     return 0;
> > +             ret = hugetlb_walk_step(mm, hpte, addr, sz);
> > +     }
> > +
> > +     return ret;
> > +}
> > +
> > +/*
> > + * hugetlb_hgm_walk - Has the same behavior as __hugetlb_hgm_walk but will
> > + * initialize @hpte with hstate-level PTE pointer @ptep.
> > + */
> > +static int hugetlb_hgm_walk(struct hugetlb_pte *hpte,
> > +                         pte_t *ptep,
> > +                         struct vm_area_struct *vma,
> > +                         unsigned long addr,
> > +                         unsigned long target_sz,
> > +                         bool alloc)
> > +{
> > +     struct hstate *h = hstate_vma(vma);
> > +
> > +     hugetlb_pte_init(vma->vm_mm, hpte, ptep, huge_page_shift(h),
> > +                      hpage_size_to_level(huge_page_size(h)));
> > +     return __hugetlb_hgm_walk(vma->vm_mm, vma, hpte, addr, target_sz,
> > +                               alloc);
> > +}
> > +
> > +/*
> > + * hugetlb_full_walk_continue - continue a high-granularity page-table walk.
> > + *
> > + * If a user has a valid @hpte but knows that @hpte is not a leaf, they can
> > + * attempt to continue walking by calling this function.
> > + *
> > + * This function will never fail, but @hpte might not change.
> > + *
> > + * If @hpte hasn't been initialized, then this function's behavior is
> > + * undefined.
> > + */
> > +void hugetlb_full_walk_continue(struct hugetlb_pte *hpte,
> > +                             struct vm_area_struct *vma,
> > +                             unsigned long addr)
> > +{
> > +     /* __hugetlb_hgm_walk will never fail with these arguments. */
> > +     WARN_ON_ONCE(__hugetlb_hgm_walk(vma->vm_mm, vma, hpte, addr,
> > +                                     PAGE_SIZE, false));
> > +}
> > +
> > +/*
> > + * hugetlb_full_walk - do a high-granularity page-table walk; never allocate.
> > + *
> > + * This function can only fail if we find that the hstate-level PTE is not
> > + * allocated. Callers can take advantage of this fact to skip address regions
> > + * that cannot be mapped in that case.
> > + *
> > + * If this function succeeds, @hpte is guaranteed to be valid.
> > + */
> > +int hugetlb_full_walk(struct hugetlb_pte *hpte,
> > +                   struct vm_area_struct *vma,
> > +                   unsigned long addr)
> > +{
> > +     struct hstate *h = hstate_vma(vma);
> > +     unsigned long sz = huge_page_size(h);
> > +     /*
> > +      * We must mask the address appropriately so that we pick up the first
> > +      * PTE in a contiguous group.
> > +      */
> > +     pte_t *ptep = hugetlb_walk(vma, addr & huge_page_mask(h), sz);
> > +
> > +     if (!ptep)
> > +             return -ENOMEM;
>
> -ENOMEM does not seem appropriate, but I can not think of something
> better.  -ENOENT perhaps?

The callers only ever check if hugetlb_full_walk() is 0 or not, so
returning 1 here could be a fine solution too. What do you think?

>
> > +
> > +     /* hugetlb_hgm_walk will never fail with these arguments. */
> > +     WARN_ON_ONCE(hugetlb_hgm_walk(hpte, ptep, vma, addr, PAGE_SIZE, false));
> > +     return 0;
> > +}
> > +
> > +/*
> > + * hugetlb_full_walk_alloc - do a high-granularity walk, potentially allocate
> > + *   new PTEs.
> > + */
> > +int hugetlb_full_walk_alloc(struct hugetlb_pte *hpte,
> > +                                struct vm_area_struct *vma,
> > +                                unsigned long addr,
> > +                                unsigned long target_sz)
> > +{
> > +     struct hstate *h = hstate_vma(vma);
> > +     unsigned long sz = huge_page_size(h);
> > +     /*
> > +      * We must mask the address appropriately so that we pick up the first
> > +      * PTE in a contiguous group.
> > +      */
> > +     pte_t *ptep = huge_pte_alloc(vma->vm_mm, vma, addr & huge_page_mask(h),
> > +                                  sz);
> > +
> > +     if (!ptep)
> > +             return -ENOMEM;
> > +
> > +     return hugetlb_hgm_walk(hpte, ptep, vma, addr, target_sz, true);
> > +}
> > +
> >  #ifdef CONFIG_ARCH_WANT_GENERAL_HUGETLB
> >  pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
> >                       unsigned long addr, unsigned long sz)
> > @@ -7382,6 +7553,48 @@ pte_t *huge_pte_offset(struct mm_struct *mm,
> >       return (pte_t *)pmd;
> >  }
> >
> > +/*
> > + * hugetlb_walk_step() - Walk the page table one step to resolve the page
> > + * (hugepage or subpage) entry at address @addr.
> > + *
> > + * @sz always points at the final target PTE size (e.g. PAGE_SIZE for the
> > + * lowest level PTE).
> > + *
> > + * @hpte will always remain valid, even if this function fails.
> > + *
> > + * Architectures that implement this function must ensure that if @hpte does
> > + * not change levels, then its PTL must also stay the same.
> > + */
> > +int hugetlb_walk_step(struct mm_struct *mm, struct hugetlb_pte *hpte,
> > +                   unsigned long addr, unsigned long sz)
> > +{
> > +     pte_t *ptep;
> > +     spinlock_t *ptl;
> > +
> > +     switch (hpte->level) {
> > +     case HUGETLB_LEVEL_PUD:
> > +             ptep = (pte_t *)hugetlb_alloc_pmd(mm, hpte, addr);
> > +             if (IS_ERR(ptep))
> > +                     return PTR_ERR(ptep);
> > +             hugetlb_pte_init(mm, hpte, ptep, PMD_SHIFT,
> > +                              HUGETLB_LEVEL_PMD);
> > +             break;
> > +     case HUGETLB_LEVEL_PMD:
> > +             ptep = hugetlb_alloc_pte(mm, hpte, addr);
> > +             if (IS_ERR(ptep))
> > +                     return PTR_ERR(ptep);
> > +             ptl = pte_lockptr(mm, (pmd_t *)hpte->ptep);
>
> Is that right?  hpte->ptep is the PMD level entry.  It seems
> pte_lockptr() -> ptlock_ptr(pmd_page(*pmd)) -> return page->ptl
> But, I would think we want the page mm->page_table_lock for newly
> allocated PTE.

If we used mm->page_table_lock for 4K PTEs, that would be a
performance nightmare (right?). The PTL we set here will be used for
page faults, UFFDIO_CONTINUEs, etc.

This code should be right. It's doing:
(1) Overwrite our leaf-level PMD with a non-leaf PMD (returning the 4K
PTE we should populate the hpte with).
(2) Find the PTL for the non-leaf PMD (the same way that generic mm does).
(3) Update hpte to point to the PTE we are supposed to use.

It's really important that (2) happens after (1), otherwise pmd_page()
won't be pointing to a page table page (and will change shortly), so
ptlock_ptr() won't give us what we want.

This is probably worth a comment; I'll write something like this here.

Thanks Mike. Hopefully the rest of the series isn't too confusing.

- James

>
> --
> Mike Kravetz
>
> > +             __hugetlb_pte_init(hpte, ptep, PAGE_SHIFT,
> > +                                HUGETLB_LEVEL_PTE, ptl);
> > +             break;
> > +     default:
> > +             WARN_ONCE(1, "%s: got invalid level: %d (shift: %d)\n",
> > +                             __func__, hpte->level, hpte->shift);
> > +             return -EINVAL;
> > +     }
> > +     return 0;
> > +}
> > +
> >  /*
> >   * Return a mask that can be used to update an address to the last huge
> >   * page in a page table page mapping size.  Used to skip non-present
> > --
> > 2.39.2.637.g21b0678d19-goog
> >

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v2 17/46] hugetlbfs: do a full walk to check if vma maps a page
  2023-02-22 15:46   ` James Houghton
@ 2023-02-28 23:52     ` Mike Kravetz
  0 siblings, 0 replies; 96+ messages in thread
From: Mike Kravetz @ 2023-02-28 23:52 UTC (permalink / raw)
  To: James Houghton
  Cc: Muchun Song, Peter Xu, Andrew Morton, David Hildenbrand,
	David Rientjes, Axel Rasmussen, Mina Almasry, Zach O'Keefe,
	Manish Mishra, Naoya Horiguchi, Dr . David Alan Gilbert,
	Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Frank van der Linden, Jiaqi Yan, linux-mm, linux-kernel

On 02/22/23 07:46, James Houghton wrote:
> On Fri, Feb 17, 2023 at 4:29 PM James Houghton <jthoughton@google.com> wrote:
> >
> > Because it is safe to do so, do a full high-granularity page table walk
> > to check if the page is mapped.
> >
> > Signed-off-by: James Houghton <jthoughton@google.com>
> >
> > diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> > index cfd09f95551b..c0ee69f0418e 100644
> > --- a/fs/hugetlbfs/inode.c
> > +++ b/fs/hugetlbfs/inode.c
> > @@ -386,17 +386,24 @@ static void hugetlb_delete_from_page_cache(struct folio *folio)
> >  static bool hugetlb_vma_maps_page(struct vm_area_struct *vma,
> >                                 unsigned long addr, struct page *page)
> >  {
> > -       pte_t *ptep, pte;
> > +       pte_t pte;
> > +       struct hugetlb_pte hpte;
> >
> > -       ptep = hugetlb_walk(vma, addr, huge_page_size(hstate_vma(vma)));
> > -       if (!ptep)
> > +       if (hugetlb_full_walk(&hpte, vma, addr))
> >                 return false;
> >
> > -       pte = huge_ptep_get(ptep);
> > +       pte = huge_ptep_get(hpte.ptep);
> >         if (huge_pte_none(pte) || !pte_present(pte))
> >                 return false;
> >
> > -       if (pte_page(pte) == page)
> > +       if (unlikely(!hugetlb_pte_present_leaf(&hpte, pte)))
> > +               /*
> > +                * We raced with someone splitting us, and the only case
> > +                * where this is impossible is when the pte was none.
> > +                */
> > +               return false;
> > +
> > +       if (compound_head(pte_page(pte)) == page)
> >                 return true;
> >
> >         return false;
> > --
> > 2.39.2.637.g21b0678d19-goog
> >
> 
> I think this patch is actually incorrect.
> 
> This function is *supposed* to check if the page is mapped at all in
> this VMA, but really we're only checking if the base address of the
> page is mapped.

The function is/was only checking if the page is mapped at the specific
address.  That is because when walking the interval tree, we know where
it would be mapped and only check there.

I suppose it would still be functionally correct if we checked for the
page being mapped anywhere in the vma.

>                 If we did the 'hugetlb_vma_maybe_maps_page' approach
> that I did previously and returned 'true' if
> !hugetlb_pte_present_leaf(), then this code would be correct again.
> 
> But what I really think this function should do is just call
> page_vma_mapped_walk(). We're sort of reimplementing it here anyway.
> Unless someone disagrees, I'll do this for v3.

Yes, I think page_vma_mapped_walk would provide the same functionality.
I did not consider this when writing hugetlb_vma_maps_page, and
hugetlb_vma_maps_page was pretty simple for the current hugetlb
possibilities.  Things get more complicated with HGM.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v2 05/46] rmap: hugetlb: switch from page_dup_file_rmap to page_add_file_rmap
  2023-02-18  0:27 ` [PATCH v2 05/46] rmap: hugetlb: switch from page_dup_file_rmap to page_add_file_rmap James Houghton
@ 2023-03-02  1:06   ` Jiaqi Yan
  2023-03-02 15:44     ` James Houghton
  0 siblings, 1 reply; 96+ messages in thread
From: Jiaqi Yan @ 2023-03-02  1:06 UTC (permalink / raw)
  To: James Houghton
  Cc: Mike Kravetz, Muchun Song, Peter Xu, Andrew Morton,
	David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Zach O'Keefe, Manish Mishra, Naoya Horiguchi,
	Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Frank van der Linden, linux-mm, linux-kernel

On Fri, Feb 17, 2023 at 4:28 PM James Houghton <jthoughton@google.com> wrote:
>
> This only applies to file-backed HugeTLB, and it should be a no-op until
> high-granularity mapping is possible. Also update page_remove_rmap to
> support the eventual case where !compound && folio_test_hugetlb().
>
> HugeTLB doesn't use LRU or mlock, so we avoid those bits. This also
> means we don't need to use subpage_mapcount; if we did, it would
> overflow with only a few mappings.
>
> There is still one caller of page_dup_file_rmap left: copy_present_pte,
> and it is always called with compound=false in this case.
>
> Signed-off-by: James Houghton <jthoughton@google.com>
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 08004371cfed..6c008c9de80e 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -5077,7 +5077,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
>                          * sleep during the process.
>                          */
>                         if (!PageAnon(ptepage)) {
> -                               page_dup_file_rmap(ptepage, true);
> +                               page_add_file_rmap(ptepage, src_vma, true);
>                         } else if (page_try_dup_anon_rmap(ptepage, true,
>                                                           src_vma)) {
>                                 pte_t src_pte_old = entry;
> @@ -5910,7 +5910,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
>         if (anon_rmap)
>                 hugepage_add_new_anon_rmap(folio, vma, haddr);
>         else
> -               page_dup_file_rmap(&folio->page, true);
> +               page_add_file_rmap(&folio->page, vma, true);
>         new_pte = make_huge_pte(vma, &folio->page, ((vma->vm_flags & VM_WRITE)
>                                 && (vma->vm_flags & VM_SHARED)));
>         /*
> @@ -6301,7 +6301,7 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
>                 goto out_release_unlock;
>
>         if (folio_in_pagecache)
> -               page_dup_file_rmap(&folio->page, true);
> +               page_add_file_rmap(&folio->page, dst_vma, true);
>         else
>                 hugepage_add_new_anon_rmap(folio, dst_vma, dst_addr);
>
> diff --git a/mm/migrate.c b/mm/migrate.c
> index d3964c414010..b0f87f19b536 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -254,7 +254,7 @@ static bool remove_migration_pte(struct folio *folio,
>                                 hugepage_add_anon_rmap(new, vma, pvmw.address,
>                                                        rmap_flags);
>                         else
> -                               page_dup_file_rmap(new, true);
> +                               page_add_file_rmap(new, vma, true);
>                         set_huge_pte_at(vma->vm_mm, pvmw.address, pvmw.pte, pte);
>                 } else
>  #endif
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 15ae24585fc4..c010d0af3a82 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c

Given you are making hugetlb's ref/mapcount mechanism to be consistent
with THP, I think the special folio_test_hugetlb checks you added in
this commit will break page_mapped() and folio_mapped() if page/folio
is HGMed. With these checks, folio->_nr_pages_mapped are not properly
increased/decreased.

> @@ -1318,21 +1318,21 @@ void page_add_file_rmap(struct page *page, struct vm_area_struct *vma,
>         int nr = 0, nr_pmdmapped = 0;
>         bool first;
>
> -       VM_BUG_ON_PAGE(compound && !PageTransHuge(page), page);
> +       VM_BUG_ON_PAGE(compound && !PageTransHuge(page)
> +                               && !folio_test_hugetlb(folio), page);
>
>         /* Is page being mapped by PTE? Is this its first map to be added? */
>         if (likely(!compound)) {
>                 first = atomic_inc_and_test(&page->_mapcount);
>                 nr = first;
> -               if (first && folio_test_large(folio)) {
> +               if (first && folio_test_large(folio)
> +                         && !folio_test_hugetlb(folio)) {

So we should still increment _nr_pages_mapped for hugetlb case here,
and decrement in the corresponding place in page_remove_rmap.

>                         nr = atomic_inc_return_relaxed(mapped);
>                         nr = (nr < COMPOUND_MAPPED);
>                 }
> -       } else if (folio_test_pmd_mappable(folio)) {
> -               /* That test is redundant: it's for safety or to optimize out */
> -
> +       } else {
>                 first = atomic_inc_and_test(&folio->_entire_mapcount);
> -               if (first) {
> +               if (first && !folio_test_hugetlb(folio)) {

Same here: we should still increase _nr_pages_mapped by
COMPOUND_MAPPED and decrease by COMPOUND_MAPPED in the corresponding
place in page_remove_rmap.

>                         nr = atomic_add_return_relaxed(COMPOUND_MAPPED, mapped);
>                         if (likely(nr < COMPOUND_MAPPED + COMPOUND_MAPPED)) {
>                                 nr_pmdmapped = folio_nr_pages(folio);
> @@ -1347,6 +1347,9 @@ void page_add_file_rmap(struct page *page, struct vm_area_struct *vma,
>                 }
>         }
>
> +       if (folio_test_hugetlb(folio))
> +               return;
> +
>         if (nr_pmdmapped)
>                 __lruvec_stat_mod_folio(folio, folio_test_swapbacked(folio) ?
>                         NR_SHMEM_PMDMAPPED : NR_FILE_PMDMAPPED, nr_pmdmapped);
> @@ -1376,8 +1379,7 @@ void page_remove_rmap(struct page *page, struct vm_area_struct *vma,
>         VM_BUG_ON_PAGE(compound && !PageHead(page), page);
>
>         /* Hugetlb pages are not counted in NR_*MAPPED */
> -       if (unlikely(folio_test_hugetlb(folio))) {
> -               /* hugetlb pages are always mapped with pmds */
> +       if (unlikely(folio_test_hugetlb(folio)) && compound) {
>                 atomic_dec(&folio->_entire_mapcount);
>                 return;
>         }

This entire if-block should be removed after you remove the
!folio_test_hugetlb checks in page_add_file_rmap.

> @@ -1386,15 +1388,14 @@ void page_remove_rmap(struct page *page, struct vm_area_struct *vma,
>         if (likely(!compound)) {
>                 last = atomic_add_negative(-1, &page->_mapcount);
>                 nr = last;
> -               if (last && folio_test_large(folio)) {
> +               if (last && folio_test_large(folio)
> +                        && !folio_test_hugetlb(folio)) {

ditto.

>                         nr = atomic_dec_return_relaxed(mapped);
>                         nr = (nr < COMPOUND_MAPPED);
>                 }
> -       } else if (folio_test_pmd_mappable(folio)) {
> -               /* That test is redundant: it's for safety or to optimize out */
> -
> +       } else {
>                 last = atomic_add_negative(-1, &folio->_entire_mapcount);
> -               if (last) {
> +               if (last && !folio_test_hugetlb(folio)) {

ditto.

>                         nr = atomic_sub_return_relaxed(COMPOUND_MAPPED, mapped);
>                         if (likely(nr < COMPOUND_MAPPED)) {
>                                 nr_pmdmapped = folio_nr_pages(folio);
> @@ -1409,6 +1410,9 @@ void page_remove_rmap(struct page *page, struct vm_area_struct *vma,
>                 }
>         }
>
> +       if (folio_test_hugetlb(folio))
> +               return;
> +
>         if (nr_pmdmapped) {
>                 if (folio_test_anon(folio))
>                         idx = NR_ANON_THPS;
> --
> 2.39.2.637.g21b0678d19-goog
>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v2 05/46] rmap: hugetlb: switch from page_dup_file_rmap to page_add_file_rmap
  2023-03-02  1:06   ` Jiaqi Yan
@ 2023-03-02 15:44     ` James Houghton
  2023-03-02 16:43       ` James Houghton
  0 siblings, 1 reply; 96+ messages in thread
From: James Houghton @ 2023-03-02 15:44 UTC (permalink / raw)
  To: Jiaqi Yan
  Cc: Mike Kravetz, Muchun Song, Peter Xu, Andrew Morton,
	David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Zach O'Keefe, Manish Mishra, Naoya Horiguchi,
	Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Frank van der Linden, linux-mm, linux-kernel

On Wed, Mar 1, 2023 at 5:06 PM Jiaqi Yan <jiaqiyan@google.com> wrote:
>
> On Fri, Feb 17, 2023 at 4:28 PM James Houghton <jthoughton@google.com> wrote:
> >
> > This only applies to file-backed HugeTLB, and it should be a no-op until
> > high-granularity mapping is possible. Also update page_remove_rmap to
> > support the eventual case where !compound && folio_test_hugetlb().
> >
> > HugeTLB doesn't use LRU or mlock, so we avoid those bits. This also
> > means we don't need to use subpage_mapcount; if we did, it would
> > overflow with only a few mappings.

This is wrong; I guess I misunderstood the code when I wrote this
commit. subpages_mapcount (now called _nr_pages_mapped) won't overflow
(unless HugeTLB pages could be greater than 16G). It is indeed a bug
not to update _nr_pages_mapped the same way THPs do.

>
> >
> > There is still one caller of page_dup_file_rmap left: copy_present_pte,
> > and it is always called with compound=false in this case.
> >
> > Signed-off-by: James Houghton <jthoughton@google.com>
> >
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 08004371cfed..6c008c9de80e 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -5077,7 +5077,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
> >                          * sleep during the process.
> >                          */
> >                         if (!PageAnon(ptepage)) {
> > -                               page_dup_file_rmap(ptepage, true);
> > +                               page_add_file_rmap(ptepage, src_vma, true);
> >                         } else if (page_try_dup_anon_rmap(ptepage, true,
> >                                                           src_vma)) {
> >                                 pte_t src_pte_old = entry;
> > @@ -5910,7 +5910,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
> >         if (anon_rmap)
> >                 hugepage_add_new_anon_rmap(folio, vma, haddr);
> >         else
> > -               page_dup_file_rmap(&folio->page, true);
> > +               page_add_file_rmap(&folio->page, vma, true);
> >         new_pte = make_huge_pte(vma, &folio->page, ((vma->vm_flags & VM_WRITE)
> >                                 && (vma->vm_flags & VM_SHARED)));
> >         /*
> > @@ -6301,7 +6301,7 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
> >                 goto out_release_unlock;
> >
> >         if (folio_in_pagecache)
> > -               page_dup_file_rmap(&folio->page, true);
> > +               page_add_file_rmap(&folio->page, dst_vma, true);
> >         else
> >                 hugepage_add_new_anon_rmap(folio, dst_vma, dst_addr);
> >
> > diff --git a/mm/migrate.c b/mm/migrate.c
> > index d3964c414010..b0f87f19b536 100644
> > --- a/mm/migrate.c
> > +++ b/mm/migrate.c
> > @@ -254,7 +254,7 @@ static bool remove_migration_pte(struct folio *folio,
> >                                 hugepage_add_anon_rmap(new, vma, pvmw.address,
> >                                                        rmap_flags);
> >                         else
> > -                               page_dup_file_rmap(new, true);
> > +                               page_add_file_rmap(new, vma, true);
> >                         set_huge_pte_at(vma->vm_mm, pvmw.address, pvmw.pte, pte);
> >                 } else
> >  #endif
> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index 15ae24585fc4..c010d0af3a82 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
>
> Given you are making hugetlb's ref/mapcount mechanism to be consistent
> with THP, I think the special folio_test_hugetlb checks you added in
> this commit will break page_mapped() and folio_mapped() if page/folio
> is HGMed. With these checks, folio->_nr_pages_mapped are not properly
> increased/decreased.

Thank you, Jiaqi! I didn't realize I broke
folio_mapped()/page_mapped(). The end result is that page_mapped() may
report that an HGMed page isn't mapped when it is. Not good!

>
> > @@ -1318,21 +1318,21 @@ void page_add_file_rmap(struct page *page, struct vm_area_struct *vma,
> >         int nr = 0, nr_pmdmapped = 0;
> >         bool first;
> >
> > -       VM_BUG_ON_PAGE(compound && !PageTransHuge(page), page);
> > +       VM_BUG_ON_PAGE(compound && !PageTransHuge(page)
> > +                               && !folio_test_hugetlb(folio), page);
> >
> >         /* Is page being mapped by PTE? Is this its first map to be added? */
> >         if (likely(!compound)) {
> >                 first = atomic_inc_and_test(&page->_mapcount);
> >                 nr = first;
> > -               if (first && folio_test_large(folio)) {
> > +               if (first && folio_test_large(folio)
> > +                         && !folio_test_hugetlb(folio)) {
>
> So we should still increment _nr_pages_mapped for hugetlb case here,
> and decrement in the corresponding place in page_remove_rmap.
>
> >                         nr = atomic_inc_return_relaxed(mapped);
> >                         nr = (nr < COMPOUND_MAPPED);
> >                 }
> > -       } else if (folio_test_pmd_mappable(folio)) {
> > -               /* That test is redundant: it's for safety or to optimize out */
> > -
> > +       } else {
> >                 first = atomic_inc_and_test(&folio->_entire_mapcount);
> > -               if (first) {
> > +               if (first && !folio_test_hugetlb(folio)) {
>
> Same here: we should still increase _nr_pages_mapped by
> COMPOUND_MAPPED and decrease by COMPOUND_MAPPED in the corresponding
> place in page_remove_rmap.
>
> >                         nr = atomic_add_return_relaxed(COMPOUND_MAPPED, mapped);
> >                         if (likely(nr < COMPOUND_MAPPED + COMPOUND_MAPPED)) {
> >                                 nr_pmdmapped = folio_nr_pages(folio);
> > @@ -1347,6 +1347,9 @@ void page_add_file_rmap(struct page *page, struct vm_area_struct *vma,
> >                 }
> >         }
> >
> > +       if (folio_test_hugetlb(folio))
> > +               return;
> > +
> >         if (nr_pmdmapped)
> >                 __lruvec_stat_mod_folio(folio, folio_test_swapbacked(folio) ?
> >                         NR_SHMEM_PMDMAPPED : NR_FILE_PMDMAPPED, nr_pmdmapped);
> > @@ -1376,8 +1379,7 @@ void page_remove_rmap(struct page *page, struct vm_area_struct *vma,
> >         VM_BUG_ON_PAGE(compound && !PageHead(page), page);
> >
> >         /* Hugetlb pages are not counted in NR_*MAPPED */
> > -       if (unlikely(folio_test_hugetlb(folio))) {
> > -               /* hugetlb pages are always mapped with pmds */
> > +       if (unlikely(folio_test_hugetlb(folio)) && compound) {
> >                 atomic_dec(&folio->_entire_mapcount);
> >                 return;
> >         }
>
> This entire if-block should be removed after you remove the
> !folio_test_hugetlb checks in page_add_file_rmap.

This is the not-so-obvious change that is needed. Thank you!

>
> > @@ -1386,15 +1388,14 @@ void page_remove_rmap(struct page *page, struct vm_area_struct *vma,
> >         if (likely(!compound)) {
> >                 last = atomic_add_negative(-1, &page->_mapcount);
> >                 nr = last;
> > -               if (last && folio_test_large(folio)) {
> > +               if (last && folio_test_large(folio)
> > +                        && !folio_test_hugetlb(folio)) {
>
> ditto.
>
> >                         nr = atomic_dec_return_relaxed(mapped);
> >                         nr = (nr < COMPOUND_MAPPED);
> >                 }
> > -       } else if (folio_test_pmd_mappable(folio)) {
> > -               /* That test is redundant: it's for safety or to optimize out */
> > -
> > +       } else {
> >                 last = atomic_add_negative(-1, &folio->_entire_mapcount);
> > -               if (last) {
> > +               if (last && !folio_test_hugetlb(folio)) {
>
> ditto.

I agree with all of your suggestions. Testing with the hugetlb-hgm
selftest, nothing seems to break. :)

Given that this is at least the third or fourth major bug in this
version of the series, I'll go ahead and send a v3 sooner rather than
later.

>
> >                         nr = atomic_sub_return_relaxed(COMPOUND_MAPPED, mapped);
> >                         if (likely(nr < COMPOUND_MAPPED)) {
> >                                 nr_pmdmapped = folio_nr_pages(folio);
> > @@ -1409,6 +1410,9 @@ void page_remove_rmap(struct page *page, struct vm_area_struct *vma,
> >                 }
> >         }
> >
> > +       if (folio_test_hugetlb(folio))
> > +               return;
> > +
> >         if (nr_pmdmapped) {
> >                 if (folio_test_anon(folio))
> >                         idx = NR_ANON_THPS;
> > --
> > 2.39.2.637.g21b0678d19-goog
> >

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v2 05/46] rmap: hugetlb: switch from page_dup_file_rmap to page_add_file_rmap
  2023-03-02 15:44     ` James Houghton
@ 2023-03-02 16:43       ` James Houghton
  2023-03-02 19:22         ` Mike Kravetz
  0 siblings, 1 reply; 96+ messages in thread
From: James Houghton @ 2023-03-02 16:43 UTC (permalink / raw)
  To: Jiaqi Yan
  Cc: Mike Kravetz, Muchun Song, Peter Xu, Andrew Morton,
	David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Zach O'Keefe, Manish Mishra, Naoya Horiguchi,
	Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Frank van der Linden, linux-mm, linux-kernel

On Thu, Mar 2, 2023 at 7:44 AM James Houghton <jthoughton@google.com> wrote:
>
> On Wed, Mar 1, 2023 at 5:06 PM Jiaqi Yan <jiaqiyan@google.com> wrote:
> >
> > On Fri, Feb 17, 2023 at 4:28 PM James Houghton <jthoughton@google.com> wrote:
> > >
> > > This only applies to file-backed HugeTLB, and it should be a no-op until
> > > high-granularity mapping is possible. Also update page_remove_rmap to
> > > support the eventual case where !compound && folio_test_hugetlb().
> > >
> > > HugeTLB doesn't use LRU or mlock, so we avoid those bits. This also
> > > means we don't need to use subpage_mapcount; if we did, it would
> > > overflow with only a few mappings.
>
> This is wrong; I guess I misunderstood the code when I wrote this
> commit. subpages_mapcount (now called _nr_pages_mapped) won't overflow
> (unless HugeTLB pages could be greater than 16G). It is indeed a bug
> not to update _nr_pages_mapped the same way THPs do.
>
> >
> > >
> > > There is still one caller of page_dup_file_rmap left: copy_present_pte,
> > > and it is always called with compound=false in this case.
> > >
> > > Signed-off-by: James Houghton <jthoughton@google.com>
> > >
> > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > > index 08004371cfed..6c008c9de80e 100644
> > > --- a/mm/hugetlb.c
> > > +++ b/mm/hugetlb.c
> > > @@ -5077,7 +5077,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
> > >                          * sleep during the process.
> > >                          */
> > >                         if (!PageAnon(ptepage)) {
> > > -                               page_dup_file_rmap(ptepage, true);
> > > +                               page_add_file_rmap(ptepage, src_vma, true);
> > >                         } else if (page_try_dup_anon_rmap(ptepage, true,
> > >                                                           src_vma)) {
> > >                                 pte_t src_pte_old = entry;
> > > @@ -5910,7 +5910,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
> > >         if (anon_rmap)
> > >                 hugepage_add_new_anon_rmap(folio, vma, haddr);
> > >         else
> > > -               page_dup_file_rmap(&folio->page, true);
> > > +               page_add_file_rmap(&folio->page, vma, true);
> > >         new_pte = make_huge_pte(vma, &folio->page, ((vma->vm_flags & VM_WRITE)
> > >                                 && (vma->vm_flags & VM_SHARED)));
> > >         /*
> > > @@ -6301,7 +6301,7 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
> > >                 goto out_release_unlock;
> > >
> > >         if (folio_in_pagecache)
> > > -               page_dup_file_rmap(&folio->page, true);
> > > +               page_add_file_rmap(&folio->page, dst_vma, true);
> > >         else
> > >                 hugepage_add_new_anon_rmap(folio, dst_vma, dst_addr);
> > >
> > > diff --git a/mm/migrate.c b/mm/migrate.c
> > > index d3964c414010..b0f87f19b536 100644
> > > --- a/mm/migrate.c
> > > +++ b/mm/migrate.c
> > > @@ -254,7 +254,7 @@ static bool remove_migration_pte(struct folio *folio,
> > >                                 hugepage_add_anon_rmap(new, vma, pvmw.address,
> > >                                                        rmap_flags);
> > >                         else
> > > -                               page_dup_file_rmap(new, true);
> > > +                               page_add_file_rmap(new, vma, true);
> > >                         set_huge_pte_at(vma->vm_mm, pvmw.address, pvmw.pte, pte);
> > >                 } else
> > >  #endif
> > > diff --git a/mm/rmap.c b/mm/rmap.c
> > > index 15ae24585fc4..c010d0af3a82 100644
> > > --- a/mm/rmap.c
> > > +++ b/mm/rmap.c
> >
> > Given you are making hugetlb's ref/mapcount mechanism to be consistent
> > with THP, I think the special folio_test_hugetlb checks you added in
> > this commit will break page_mapped() and folio_mapped() if page/folio
> > is HGMed. With these checks, folio->_nr_pages_mapped are not properly
> > increased/decreased.
>
> Thank you, Jiaqi! I didn't realize I broke
> folio_mapped()/page_mapped(). The end result is that page_mapped() may
> report that an HGMed page isn't mapped when it is. Not good!
>
> >
> > > @@ -1318,21 +1318,21 @@ void page_add_file_rmap(struct page *page, struct vm_area_struct *vma,
> > >         int nr = 0, nr_pmdmapped = 0;
> > >         bool first;
> > >
> > > -       VM_BUG_ON_PAGE(compound && !PageTransHuge(page), page);
> > > +       VM_BUG_ON_PAGE(compound && !PageTransHuge(page)
> > > +                               && !folio_test_hugetlb(folio), page);
> > >
> > >         /* Is page being mapped by PTE? Is this its first map to be added? */
> > >         if (likely(!compound)) {
> > >                 first = atomic_inc_and_test(&page->_mapcount);
> > >                 nr = first;
> > > -               if (first && folio_test_large(folio)) {
> > > +               if (first && folio_test_large(folio)
> > > +                         && !folio_test_hugetlb(folio)) {
> >
> > So we should still increment _nr_pages_mapped for hugetlb case here,
> > and decrement in the corresponding place in page_remove_rmap.
> >
> > >                         nr = atomic_inc_return_relaxed(mapped);
> > >                         nr = (nr < COMPOUND_MAPPED);
> > >                 }
> > > -       } else if (folio_test_pmd_mappable(folio)) {
> > > -               /* That test is redundant: it's for safety or to optimize out */
> > > -
> > > +       } else {
> > >                 first = atomic_inc_and_test(&folio->_entire_mapcount);
> > > -               if (first) {
> > > +               if (first && !folio_test_hugetlb(folio)) {
> >
> > Same here: we should still increase _nr_pages_mapped by
> > COMPOUND_MAPPED and decrease by COMPOUND_MAPPED in the corresponding
> > place in page_remove_rmap.
> >
> > >                         nr = atomic_add_return_relaxed(COMPOUND_MAPPED, mapped);
> > >                         if (likely(nr < COMPOUND_MAPPED + COMPOUND_MAPPED)) {
> > >                                 nr_pmdmapped = folio_nr_pages(folio);
> > > @@ -1347,6 +1347,9 @@ void page_add_file_rmap(struct page *page, struct vm_area_struct *vma,
> > >                 }
> > >         }
> > >
> > > +       if (folio_test_hugetlb(folio))
> > > +               return;
> > > +
> > >         if (nr_pmdmapped)
> > >                 __lruvec_stat_mod_folio(folio, folio_test_swapbacked(folio) ?
> > >                         NR_SHMEM_PMDMAPPED : NR_FILE_PMDMAPPED, nr_pmdmapped);
> > > @@ -1376,8 +1379,7 @@ void page_remove_rmap(struct page *page, struct vm_area_struct *vma,
> > >         VM_BUG_ON_PAGE(compound && !PageHead(page), page);
> > >
> > >         /* Hugetlb pages are not counted in NR_*MAPPED */
> > > -       if (unlikely(folio_test_hugetlb(folio))) {
> > > -               /* hugetlb pages are always mapped with pmds */
> > > +       if (unlikely(folio_test_hugetlb(folio)) && compound) {
> > >                 atomic_dec(&folio->_entire_mapcount);
> > >                 return;
> > >         }
> >
> > This entire if-block should be removed after you remove the
> > !folio_test_hugetlb checks in page_add_file_rmap.
>
> This is the not-so-obvious change that is needed. Thank you!
>
> >
> > > @@ -1386,15 +1388,14 @@ void page_remove_rmap(struct page *page, struct vm_area_struct *vma,
> > >         if (likely(!compound)) {
> > >                 last = atomic_add_negative(-1, &page->_mapcount);
> > >                 nr = last;
> > > -               if (last && folio_test_large(folio)) {
> > > +               if (last && folio_test_large(folio)
> > > +                        && !folio_test_hugetlb(folio)) {
> >
> > ditto.
> >
> > >                         nr = atomic_dec_return_relaxed(mapped);
> > >                         nr = (nr < COMPOUND_MAPPED);
> > >                 }
> > > -       } else if (folio_test_pmd_mappable(folio)) {
> > > -               /* That test is redundant: it's for safety or to optimize out */
> > > -
> > > +       } else {
> > >                 last = atomic_add_negative(-1, &folio->_entire_mapcount);
> > > -               if (last) {
> > > +               if (last && !folio_test_hugetlb(folio)) {
> >
> > ditto.
>
> I agree with all of your suggestions. Testing with the hugetlb-hgm
> selftest, nothing seems to break. :)
>
> Given that this is at least the third or fourth major bug in this
> version of the series, I'll go ahead and send a v3 sooner rather than
> later.

This solution limits the size of a HugeTLB page to 16G. I'm not sure
if there are any architectures that support HugeTLB pages larger than
16G (it looks like powerpc supports 16G). If they do, I think we can
just increase the value of COMPOUND_MAPPED. If that's not possible, we
would need another solution than participating in _nr_pages_mapped
like THPs.

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v2 05/46] rmap: hugetlb: switch from page_dup_file_rmap to page_add_file_rmap
  2023-03-02 16:43       ` James Houghton
@ 2023-03-02 19:22         ` Mike Kravetz
  0 siblings, 0 replies; 96+ messages in thread
From: Mike Kravetz @ 2023-03-02 19:22 UTC (permalink / raw)
  To: James Houghton
  Cc: Jiaqi Yan, Muchun Song, Peter Xu, Andrew Morton,
	David Hildenbrand, David Rientjes, Axel Rasmussen, Mina Almasry,
	Zach O'Keefe, Manish Mishra, Naoya Horiguchi,
	Dr . David Alan Gilbert, Matthew Wilcox (Oracle),
	Vlastimil Babka, Baolin Wang, Miaohe Lin, Yang Shi,
	Frank van der Linden, linux-mm, linux-kernel

On 03/02/23 08:43, James Houghton wrote:
> 
> This solution limits the size of a HugeTLB page to 16G. I'm not sure
> if there are any architectures that support HugeTLB pages larger than
> 16G (it looks like powerpc supports 16G). If they do, I think we can
> just increase the value of COMPOUND_MAPPED. If that's not possible, we
> would need another solution than participating in _nr_pages_mapped
> like THPs.

For now, I believe you should continue with the THP like approach.

A couple days back, Matthew asked me to take a look and comment on his
latest mapcount proposal.  I have not done that yet. :( The hope is that
this will make hugetlb HGM simpler as well.

It is somewhat troubling that mapcount may be changed in the near future.
Better to be consistent with THP than invent something hugetlb HGM specific.
That way, when THP is updated hugetlb updates should be nearly the same.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 96+ messages in thread

end of thread, other threads:[~2023-03-02 19:23 UTC | newest]

Thread overview: 96+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-02-18  0:27 [PATCH v2 00/46] hugetlb: introduce HugeTLB high-granularity mapping James Houghton
2023-02-18  0:27 ` [PATCH v2 01/46] hugetlb: don't set PageUptodate for UFFDIO_CONTINUE James Houghton
2023-02-18  0:41   ` Mina Almasry
2023-02-21 15:59     ` James Houghton
2023-02-21 19:33       ` Mike Kravetz
2023-02-21 19:58         ` James Houghton
2023-02-18  0:27 ` [PATCH v2 02/46] hugetlb: remove mk_huge_pte; it is unused James Houghton
2023-02-18  0:27 ` [PATCH v2 03/46] hugetlb: remove redundant pte_mkhuge in migration path James Houghton
2023-02-18  0:27 ` [PATCH v2 04/46] hugetlb: only adjust address ranges when VMAs want PMD sharing James Houghton
2023-02-18  1:10   ` Mina Almasry
2023-02-18  0:27 ` [PATCH v2 05/46] rmap: hugetlb: switch from page_dup_file_rmap to page_add_file_rmap James Houghton
2023-03-02  1:06   ` Jiaqi Yan
2023-03-02 15:44     ` James Houghton
2023-03-02 16:43       ` James Houghton
2023-03-02 19:22         ` Mike Kravetz
2023-02-18  0:27 ` [PATCH v2 06/46] hugetlb: add CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING James Houghton
2023-02-18  0:27 ` [PATCH v2 07/46] mm: add VM_HUGETLB_HGM VMA flag James Houghton
2023-02-24 22:35   ` Mike Kravetz
2023-02-18  0:27 ` [PATCH v2 08/46] hugetlb: add HugeTLB HGM enablement helpers James Houghton
2023-02-18  1:40   ` Mina Almasry
2023-02-21 16:16     ` James Houghton
2023-02-24 23:08   ` Mike Kravetz
2023-02-18  0:27 ` [PATCH v2 09/46] mm: add MADV_SPLIT to enable HugeTLB HGM James Houghton
2023-02-18  1:58   ` Mina Almasry
2023-02-21 16:33     ` James Houghton
2023-02-24 23:25   ` Mike Kravetz
2023-02-27 15:14     ` James Houghton
2023-02-18  0:27 ` [PATCH v2 10/46] hugetlb: make huge_pte_lockptr take an explicit shift argument James Houghton
2023-02-18  0:27 ` [PATCH v2 11/46] hugetlb: add hugetlb_pte to track HugeTLB page table entries James Houghton
2023-02-18  5:24   ` Mina Almasry
2023-02-21 16:36     ` James Houghton
2023-02-25  0:09   ` Mike Kravetz
2023-02-18  0:27 ` [PATCH v2 12/46] hugetlb: add hugetlb_alloc_pmd and hugetlb_alloc_pte James Houghton
2023-02-18 17:46   ` kernel test robot
2023-02-27 19:16   ` Mike Kravetz
2023-02-27 19:31     ` James Houghton
2023-02-18  0:27 ` [PATCH v2 13/46] hugetlb: add hugetlb_hgm_walk and hugetlb_walk_step James Houghton
2023-02-18  7:43   ` kernel test robot
2023-02-18 18:07   ` kernel test robot
2023-02-21 17:09     ` James Houghton
2023-02-28 22:14   ` Mike Kravetz
2023-02-28 23:03     ` James Houghton
2023-02-18  0:27 ` [PATCH v2 14/46] hugetlb: split PTE markers when doing HGM walks James Houghton
2023-02-18 19:49   ` kernel test robot
2023-02-28 22:48   ` Mike Kravetz
2023-02-18  0:27 ` [PATCH v2 15/46] hugetlb: add make_huge_pte_with_shift James Houghton
2023-02-22 21:14   ` Mina Almasry
2023-02-22 22:53     ` James Houghton
2023-02-18  0:27 ` [PATCH v2 16/46] hugetlb: make default arch_make_huge_pte understand small mappings James Houghton
2023-02-22 21:17   ` Mina Almasry
2023-02-22 22:52     ` James Houghton
2023-02-28 23:02   ` Mike Kravetz
2023-02-18  0:27 ` [PATCH v2 17/46] hugetlbfs: do a full walk to check if vma maps a page James Houghton
2023-02-22 15:46   ` James Houghton
2023-02-28 23:52     ` Mike Kravetz
2023-02-18  0:27 ` [PATCH v2 18/46] hugetlb: add HGM support to __unmap_hugepage_range James Houghton
2023-02-18  0:27 ` [PATCH v2 19/46] hugetlb: add HGM support to hugetlb_change_protection James Houghton
2023-02-18  0:27 ` [PATCH v2 20/46] hugetlb: add HGM support to follow_hugetlb_page James Houghton
2023-02-18  0:27 ` [PATCH v2 21/46] hugetlb: add HGM support to hugetlb_follow_page_mask James Houghton
2023-02-18  0:27 ` [PATCH v2 22/46] hugetlb: add HGM support to copy_hugetlb_page_range James Houghton
2023-02-24 17:39   ` James Houghton
2023-02-18  0:27 ` [PATCH v2 23/46] hugetlb: add HGM support to move_hugetlb_page_tables James Houghton
2023-02-18  0:27 ` [PATCH v2 24/46] hugetlb: add HGM support to hugetlb_fault and hugetlb_no_page James Houghton
2023-02-18  0:27 ` [PATCH v2 25/46] hugetlb: use struct hugetlb_pte for walk_hugetlb_range James Houghton
2023-02-18  0:27 ` [PATCH v2 26/46] mm: rmap: provide pte_order in page_vma_mapped_walk James Houghton
2023-02-18  0:28 ` [PATCH v2 27/46] mm: rmap: update try_to_{migrate,unmap} to handle mapcount for HGM James Houghton
2023-02-18  0:28 ` [PATCH v2 28/46] mm: rmap: in try_to_{migrate,unmap}, check head page for hugetlb page flags James Houghton
2023-02-18  0:28 ` [PATCH v2 29/46] hugetlb: update page_vma_mapped to do high-granularity walks James Houghton
2023-02-18  0:28 ` [PATCH v2 30/46] hugetlb: add high-granularity migration support James Houghton
2023-02-18  0:28 ` [PATCH v2 31/46] hugetlb: sort hstates in hugetlb_init_hstates James Houghton
2023-02-18  0:28 ` [PATCH v2 32/46] hugetlb: add for_each_hgm_shift James Houghton
2023-02-18  0:28 ` [PATCH v2 33/46] hugetlb: userfaultfd: add support for high-granularity UFFDIO_CONTINUE James Houghton
2023-02-18  0:28 ` [PATCH v2 34/46] hugetlb: add MADV_COLLAPSE for hugetlb James Houghton
2023-02-18  0:28 ` [PATCH v2 35/46] hugetlb: add check to prevent refcount overflow via HGM James Houghton
2023-02-24 17:42   ` James Houghton
2023-02-24 18:05     ` James Houghton
2023-02-18  0:28 ` [PATCH v2 36/46] hugetlb: remove huge_pte_lock and huge_pte_lockptr James Houghton
2023-02-18  0:28 ` [PATCH v2 37/46] hugetlb: replace make_huge_pte with make_huge_pte_with_shift James Houghton
2023-02-18  0:28 ` [PATCH v2 38/46] mm: smaps: add stats for HugeTLB mapping size James Houghton
2023-02-18  0:28 ` [PATCH v2 39/46] hugetlb: x86: enable high-granularity mapping for x86_64 James Houghton
2023-02-18  0:28 ` [PATCH v2 40/46] docs: hugetlb: update hugetlb and userfaultfd admin-guides with HGM info James Houghton
2023-02-18  0:28 ` [PATCH v2 41/46] docs: proc: include information about HugeTLB HGM James Houghton
2023-02-18  0:28 ` [PATCH v2 42/46] selftests/mm: add HugeTLB HGM to userfaultfd selftest James Houghton
2023-02-18  0:28 ` [PATCH v2 43/46] KVM: selftests: add HugeTLB HGM to KVM demand paging selftest James Houghton
2023-02-18  0:28 ` [PATCH v2 44/46] selftests/mm: add anon and shared hugetlb to migration test James Houghton
2023-02-18  0:28 ` [PATCH v2 45/46] selftests/mm: add hugetlb HGM test to migration selftest James Houghton
2023-02-18  0:28 ` [PATCH v2 46/46] selftests/mm: add HGM UFFDIO_CONTINUE and hwpoison tests James Houghton
2023-02-24 17:37   ` James Houghton
2023-02-21 21:46 ` [PATCH v2 00/46] hugetlb: introduce HugeTLB high-granularity mapping Mike Kravetz
2023-02-22 15:48   ` David Hildenbrand
2023-02-22 20:57     ` Mina Almasry
2023-02-23  9:07       ` David Hildenbrand
2023-02-23 15:53         ` James Houghton
2023-02-23 16:17           ` David Hildenbrand
2023-02-23 18:33             ` Dr. David Alan Gilbert
2023-02-23 18:25           ` Mike Kravetz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).