linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/3] hugetlb: speed up linear address scanning
@ 2022-05-27 22:58 Mike Kravetz
  2022-05-27 22:58 ` [RFC PATCH 1/3] hugetlb: skip to end of PT page mapping when pte not present Mike Kravetz
                   ` (2 more replies)
  0 siblings, 3 replies; 20+ messages in thread
From: Mike Kravetz @ 2022-05-27 22:58 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Muchun Song, Michal Hocko, Peter Xu, Naoya Horiguchi,
	James Houghton, Mina Almasry, Aneesh Kumar K . V,
	Anshuman Khandual, Paul Walmsley, Christian Borntraeger,
	Andrew Morton, Mike Kravetz

At unmap, fork and remap time hugetlb address ranges are linearly
scanned.  We can optimize these scans if the ranges are sparsely
populated.  Also, enable page table "Lazy copy" for hugetlb at fork.

Mike Kravetz (3):
  hugetlb: skip to end of PT page mapping when pte not present
  hugetlb: do not update address in huge_pmd_unshare
  hugetlb: Lazy page table copies in fork()

 include/linux/hugetlb.h |   5 +-
 mm/hugetlb.c            | 104 ++++++++++++++++++++++++++++------------
 mm/memory.c             |   2 +-
 mm/rmap.c               |   4 +-
 4 files changed, 79 insertions(+), 36 deletions(-)

-- 
2.35.3


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [RFC PATCH 1/3] hugetlb: skip to end of PT page mapping when pte not present
  2022-05-27 22:58 [RFC PATCH 0/3] hugetlb: speed up linear address scanning Mike Kravetz
@ 2022-05-27 22:58 ` Mike Kravetz
  2022-05-30 10:10   ` Baolin Wang
  2022-05-30 19:56   ` Peter Xu
  2022-05-27 22:58 ` [RFC PATCH 2/3] hugetlb: do not update address in huge_pmd_unshare Mike Kravetz
  2022-05-27 22:58 ` [RFC PATCH 3/3] hugetlb: Lazy page table copies in fork() Mike Kravetz
  2 siblings, 2 replies; 20+ messages in thread
From: Mike Kravetz @ 2022-05-27 22:58 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Muchun Song, Michal Hocko, Peter Xu, Naoya Horiguchi,
	James Houghton, Mina Almasry, Aneesh Kumar K . V,
	Anshuman Khandual, Paul Walmsley, Christian Borntraeger,
	Andrew Morton, Mike Kravetz

HugeTLB address ranges are linearly scanned during fork, unmap and
remap operations.  If a non-present entry is encountered, the code
currently continues to the next huge page aligned address.  However,
a non-present entry implies that the page table page for that entry
is not present.  Therefore, the linear scan can skip to the end of
range mapped by the page table page.  This can speed operations on
large sparsely populated hugetlb mappings.

Create a new routine hugetlb_mask_last_hp() that will return an
address mask.  When the mask is ORed with an address, the result
will be the address of the last huge page mapped by the associated
page table page.  Use this mask to update addresses in routines which
linearly scan hugetlb address ranges when a non-present pte is
encountered.

hugetlb_mask_last_hp is related to the implementation of huge_pte_offset
as hugetlb_mask_last_hp is called when huge_pte_offset returns NULL.
This patch only provides a complete hugetlb_mask_last_hp implementation
when CONFIG_ARCH_WANT_GENERAL_HUGETLB is defined.  Architectures which
provide their own versions of huge_pte_offset can also provide their own
version of hugetlb_mask_last_hp.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 include/linux/hugetlb.h |  1 +
 mm/hugetlb.c            | 58 +++++++++++++++++++++++++++++++++++++----
 2 files changed, 54 insertions(+), 5 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index e4cff27d1198..25078a0ea1d8 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -194,6 +194,7 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
 			unsigned long addr, unsigned long sz);
 pte_t *huge_pte_offset(struct mm_struct *mm,
 		       unsigned long addr, unsigned long sz);
+unsigned long hugetlb_mask_last_hp(struct hstate *h);
 int huge_pmd_unshare(struct mm_struct *mm, struct vm_area_struct *vma,
 				unsigned long *addr, pte_t *ptep);
 void adjust_range_if_pmd_sharing_possible(struct vm_area_struct *vma,
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 7c468ac1d069..a2db878b2255 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4731,6 +4731,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 	unsigned long npages = pages_per_huge_page(h);
 	struct address_space *mapping = src_vma->vm_file->f_mapping;
 	struct mmu_notifier_range range;
+	unsigned long last_addr_mask;
 	int ret = 0;
 
 	if (cow) {
@@ -4750,11 +4751,14 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 		i_mmap_lock_read(mapping);
 	}
 
+	last_addr_mask = hugetlb_mask_last_hp(h);
 	for (addr = src_vma->vm_start; addr < src_vma->vm_end; addr += sz) {
 		spinlock_t *src_ptl, *dst_ptl;
 		src_pte = huge_pte_offset(src, addr, sz);
-		if (!src_pte)
+		if (!src_pte) {
+			addr |= last_addr_mask;
 			continue;
+		}
 		dst_pte = huge_pte_alloc(dst, dst_vma, addr, sz);
 		if (!dst_pte) {
 			ret = -ENOMEM;
@@ -4771,8 +4775,10 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 		 * after taking the lock below.
 		 */
 		dst_entry = huge_ptep_get(dst_pte);
-		if ((dst_pte == src_pte) || !huge_pte_none(dst_entry))
+		if ((dst_pte == src_pte) || !huge_pte_none(dst_entry)) {
+			addr |= last_addr_mask;
 			continue;
+		}
 
 		dst_ptl = huge_pte_lock(h, dst, dst_pte);
 		src_ptl = huge_pte_lockptr(h, src, src_pte);
@@ -4933,6 +4939,7 @@ int move_hugetlb_page_tables(struct vm_area_struct *vma,
 	unsigned long sz = huge_page_size(h);
 	struct mm_struct *mm = vma->vm_mm;
 	unsigned long old_end = old_addr + len;
+	unsigned long last_addr_mask;
 	unsigned long old_addr_copy;
 	pte_t *src_pte, *dst_pte;
 	struct mmu_notifier_range range;
@@ -4948,12 +4955,16 @@ int move_hugetlb_page_tables(struct vm_area_struct *vma,
 	flush_cache_range(vma, range.start, range.end);
 
 	mmu_notifier_invalidate_range_start(&range);
+	last_addr_mask = hugetlb_mask_last_hp(h);
 	/* Prevent race with file truncation */
 	i_mmap_lock_write(mapping);
 	for (; old_addr < old_end; old_addr += sz, new_addr += sz) {
 		src_pte = huge_pte_offset(mm, old_addr, sz);
-		if (!src_pte)
+		if (!src_pte) {
+			old_addr |= last_addr_mask;
+			new_addr |= last_addr_mask;
 			continue;
+		}
 		if (huge_pte_none(huge_ptep_get(src_pte)))
 			continue;
 
@@ -4998,6 +5009,7 @@ static void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct
 	struct hstate *h = hstate_vma(vma);
 	unsigned long sz = huge_page_size(h);
 	struct mmu_notifier_range range;
+	unsigned long last_addr_mask;
 	bool force_flush = false;
 
 	WARN_ON(!is_vm_hugetlb_page(vma));
@@ -5018,11 +5030,14 @@ static void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct
 				end);
 	adjust_range_if_pmd_sharing_possible(vma, &range.start, &range.end);
 	mmu_notifier_invalidate_range_start(&range);
+	last_addr_mask = hugetlb_mask_last_hp(h);
 	address = start;
 	for (; address < end; address += sz) {
 		ptep = huge_pte_offset(mm, address, sz);
-		if (!ptep)
+		if (!ptep) {
+			address |= last_addr_mask;
 			continue;
+		}
 
 		ptl = huge_pte_lock(h, mm, ptep);
 		if (huge_pmd_unshare(mm, vma, &address, ptep)) {
@@ -6285,6 +6300,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 	unsigned long pages = 0, psize = huge_page_size(h);
 	bool shared_pmd = false;
 	struct mmu_notifier_range range;
+	unsigned long last_addr_mask;
 	bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
 	bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;
 
@@ -6301,12 +6317,15 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 	flush_cache_range(vma, range.start, range.end);
 
 	mmu_notifier_invalidate_range_start(&range);
+	last_addr_mask = hugetlb_mask_last_hp(h);
 	i_mmap_lock_write(vma->vm_file->f_mapping);
 	for (; address < end; address += psize) {
 		spinlock_t *ptl;
 		ptep = huge_pte_offset(mm, address, psize);
-		if (!ptep)
+		if (!ptep) {
+			address |= last_addr_mask;
 			continue;
+		}
 		ptl = huge_pte_lock(h, mm, ptep);
 		if (huge_pmd_unshare(mm, vma, &address, ptep)) {
 			/*
@@ -6857,6 +6876,35 @@ pte_t *huge_pte_offset(struct mm_struct *mm,
 	return (pte_t *)pmd;
 }
 
+/*
+ * Return a mask that can be used to update an address to the last huge
+ * page in a page table page mapping size.  Used to skip non-present
+ * page table entries when linearly scanning address ranges.  Architectures
+ * with unique huge page to page table relationships can define their own
+ * version of this routine.
+ */
+unsigned long hugetlb_mask_last_hp(struct hstate *h)
+{
+	unsigned long hp_size = huge_page_size(h);
+
+	if (hp_size == P4D_SIZE)
+		return PGDIR_SIZE - P4D_SIZE;
+	else if (hp_size == PUD_SIZE)
+		return P4D_SIZE - PUD_SIZE;
+	else if (hp_size == PMD_SIZE)
+		return PUD_SIZE - PMD_SIZE;
+
+	return ~(0);
+}
+
+#else
+
+/* See description above.  Architectures can provide their own version. */
+__weak unsigned long hugetlb_mask_last_hp(struct hstate *h)
+{
+	return ~(0);
+}
+
 #endif /* CONFIG_ARCH_WANT_GENERAL_HUGETLB */
 
 /*
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [RFC PATCH 2/3] hugetlb: do not update address in huge_pmd_unshare
  2022-05-27 22:58 [RFC PATCH 0/3] hugetlb: speed up linear address scanning Mike Kravetz
  2022-05-27 22:58 ` [RFC PATCH 1/3] hugetlb: skip to end of PT page mapping when pte not present Mike Kravetz
@ 2022-05-27 22:58 ` Mike Kravetz
  2022-05-30 10:14   ` Baolin Wang
  2022-05-30 15:36   ` Muchun Song
  2022-05-27 22:58 ` [RFC PATCH 3/3] hugetlb: Lazy page table copies in fork() Mike Kravetz
  2 siblings, 2 replies; 20+ messages in thread
From: Mike Kravetz @ 2022-05-27 22:58 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Muchun Song, Michal Hocko, Peter Xu, Naoya Horiguchi,
	James Houghton, Mina Almasry, Aneesh Kumar K . V,
	Anshuman Khandual, Paul Walmsley, Christian Borntraeger,
	Andrew Morton, Mike Kravetz

As an optimization for loops sequentially processing hugetlb address
ranges, huge_pmd_unshare would update a passed address if it unshared a
pmd.  Updating a loop control variable outside the loop like this is
generally a bad idea.  These loops are now using hugetlb_mask_last_hp
to optimize scanning when non-present ptes are discovered.  The same
can be done when huge_pmd_unshare returns 1 indicating a pmd was
unshared.

Remove address update from huge_pmd_unshare.  Change the passed argument
type and update all callers.  In loops sequentially processing addresses
use hugetlb_mask_last_hp to update address if pmd is unshared.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 include/linux/hugetlb.h |  4 ++--
 mm/hugetlb.c            | 46 ++++++++++++++++++-----------------------
 mm/rmap.c               |  4 ++--
 3 files changed, 24 insertions(+), 30 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 25078a0ea1d8..307c8f6e6752 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -196,7 +196,7 @@ pte_t *huge_pte_offset(struct mm_struct *mm,
 		       unsigned long addr, unsigned long sz);
 unsigned long hugetlb_mask_last_hp(struct hstate *h);
 int huge_pmd_unshare(struct mm_struct *mm, struct vm_area_struct *vma,
-				unsigned long *addr, pte_t *ptep);
+				unsigned long addr, pte_t *ptep);
 void adjust_range_if_pmd_sharing_possible(struct vm_area_struct *vma,
 				unsigned long *start, unsigned long *end);
 struct page *follow_huge_addr(struct mm_struct *mm, unsigned long address,
@@ -243,7 +243,7 @@ static inline struct address_space *hugetlb_page_mapping_lock_write(
 
 static inline int huge_pmd_unshare(struct mm_struct *mm,
 					struct vm_area_struct *vma,
-					unsigned long *addr, pte_t *ptep)
+					unsigned long addr, pte_t *ptep)
 {
 	return 0;
 }
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index a2db878b2255..c7d3fbf3ec05 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4940,7 +4940,6 @@ int move_hugetlb_page_tables(struct vm_area_struct *vma,
 	struct mm_struct *mm = vma->vm_mm;
 	unsigned long old_end = old_addr + len;
 	unsigned long last_addr_mask;
-	unsigned long old_addr_copy;
 	pte_t *src_pte, *dst_pte;
 	struct mmu_notifier_range range;
 	bool shared_pmd = false;
@@ -4968,14 +4967,10 @@ int move_hugetlb_page_tables(struct vm_area_struct *vma,
 		if (huge_pte_none(huge_ptep_get(src_pte)))
 			continue;
 
-		/* old_addr arg to huge_pmd_unshare() is a pointer and so the
-		 * arg may be modified. Pass a copy instead to preserve the
-		 * value in old_addr.
-		 */
-		old_addr_copy = old_addr;
-
-		if (huge_pmd_unshare(mm, vma, &old_addr_copy, src_pte)) {
+		if (huge_pmd_unshare(mm, vma, old_addr, src_pte)) {
 			shared_pmd = true;
+			old_addr |= last_addr_mask;
+			new_addr |= last_addr_mask;
 			continue;
 		}
 
@@ -5040,10 +5035,11 @@ static void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct
 		}
 
 		ptl = huge_pte_lock(h, mm, ptep);
-		if (huge_pmd_unshare(mm, vma, &address, ptep)) {
+		if (huge_pmd_unshare(mm, vma, address, ptep)) {
 			spin_unlock(ptl);
 			tlb_flush_pmd_range(tlb, address & PUD_MASK, PUD_SIZE);
 			force_flush = true;
+			address |= last_addr_mask;
 			continue;
 		}
 
@@ -6327,7 +6323,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 			continue;
 		}
 		ptl = huge_pte_lock(h, mm, ptep);
-		if (huge_pmd_unshare(mm, vma, &address, ptep)) {
+		if (huge_pmd_unshare(mm, vma, address, ptep)) {
 			/*
 			 * When uffd-wp is enabled on the vma, unshare
 			 * shouldn't happen at all.  Warn about it if it
@@ -6337,6 +6333,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 			pages++;
 			spin_unlock(ptl);
 			shared_pmd = true;
+			address |= last_addr_mask;
 			continue;
 		}
 		pte = huge_ptep_get(ptep);
@@ -6760,11 +6757,11 @@ pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma,
  *	    0 the underlying pte page is not shared, or it is the last user
  */
 int huge_pmd_unshare(struct mm_struct *mm, struct vm_area_struct *vma,
-					unsigned long *addr, pte_t *ptep)
+					unsigned long addr, pte_t *ptep)
 {
-	pgd_t *pgd = pgd_offset(mm, *addr);
-	p4d_t *p4d = p4d_offset(pgd, *addr);
-	pud_t *pud = pud_offset(p4d, *addr);
+	pgd_t *pgd = pgd_offset(mm, addr);
+	p4d_t *p4d = p4d_offset(pgd, addr);
+	pud_t *pud = pud_offset(p4d, addr);
 
 	i_mmap_assert_write_locked(vma->vm_file->f_mapping);
 	BUG_ON(page_count(virt_to_page(ptep)) == 0);
@@ -6774,14 +6771,6 @@ int huge_pmd_unshare(struct mm_struct *mm, struct vm_area_struct *vma,
 	pud_clear(pud);
 	put_page(virt_to_page(ptep));
 	mm_dec_nr_pmds(mm);
-	/*
-	 * This update of passed address optimizes loops sequentially
-	 * processing addresses in increments of huge page size (PMD_SIZE
-	 * in this case).  By clearing the pud, a PUD_SIZE area is unmapped.
-	 * Update address to the 'last page' in the cleared area so that
-	 * calling loop can move to first page past this area.
-	 */
-	*addr |= PUD_SIZE - PMD_SIZE;
 	return 1;
 }
 
@@ -6793,7 +6782,7 @@ pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma,
 }
 
 int huge_pmd_unshare(struct mm_struct *mm, struct vm_area_struct *vma,
-				unsigned long *addr, pte_t *ptep)
+				unsigned long addr, pte_t *ptep)
 {
 	return 0;
 }
@@ -6902,6 +6891,13 @@ unsigned long hugetlb_mask_last_hp(struct hstate *h)
 /* See description above.  Architectures can provide their own version. */
 __weak unsigned long hugetlb_mask_last_hp(struct hstate *h)
 {
+	unsigned long hp_size = huge_page_size(h);
+
+#ifdef CONFIG_ARCH_WANT_HUGE_PMD_SHARE
+	if (hp_size == PMD_SIZE)	/* required for pmd sharing */
+		return PUD_SIZE - PMD_SIZE;
+#endif
+
 	return ~(0);
 }
 
@@ -7128,14 +7124,12 @@ void hugetlb_unshare_all_pmds(struct vm_area_struct *vma)
 	mmu_notifier_invalidate_range_start(&range);
 	i_mmap_lock_write(vma->vm_file->f_mapping);
 	for (address = start; address < end; address += PUD_SIZE) {
-		unsigned long tmp = address;
-
 		ptep = huge_pte_offset(mm, address, sz);
 		if (!ptep)
 			continue;
 		ptl = huge_pte_lock(h, mm, ptep);
 		/* We don't want 'address' to be changed */
-		huge_pmd_unshare(mm, vma, &tmp, ptep);
+		huge_pmd_unshare(mm, vma, address, ptep);
 		spin_unlock(ptl);
 	}
 	flush_hugetlb_tlb_range(vma, start, end);
diff --git a/mm/rmap.c b/mm/rmap.c
index 5bcb334cd6f2..45b04e2e83ab 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1559,7 +1559,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 				 */
 				VM_BUG_ON(!(flags & TTU_RMAP_LOCKED));
 
-				if (huge_pmd_unshare(mm, vma, &address, pvmw.pte)) {
+				if (huge_pmd_unshare(mm, vma, address, pvmw.pte)) {
 					flush_tlb_range(vma, range.start, range.end);
 					mmu_notifier_invalidate_range(mm, range.start,
 								      range.end);
@@ -1923,7 +1923,7 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 				 */
 				VM_BUG_ON(!(flags & TTU_RMAP_LOCKED));
 
-				if (huge_pmd_unshare(mm, vma, &address, pvmw.pte)) {
+				if (huge_pmd_unshare(mm, vma, address, pvmw.pte)) {
 					flush_tlb_range(vma, range.start, range.end);
 					mmu_notifier_invalidate_range(mm, range.start,
 								      range.end);
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [RFC PATCH 3/3] hugetlb: Lazy page table copies in fork()
  2022-05-27 22:58 [RFC PATCH 0/3] hugetlb: speed up linear address scanning Mike Kravetz
  2022-05-27 22:58 ` [RFC PATCH 1/3] hugetlb: skip to end of PT page mapping when pte not present Mike Kravetz
  2022-05-27 22:58 ` [RFC PATCH 2/3] hugetlb: do not update address in huge_pmd_unshare Mike Kravetz
@ 2022-05-27 22:58 ` Mike Kravetz
  2022-05-31 17:25   ` David Hildenbrand
  2022-06-01  5:20   ` Muchun Song
  2 siblings, 2 replies; 20+ messages in thread
From: Mike Kravetz @ 2022-05-27 22:58 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Muchun Song, Michal Hocko, Peter Xu, Naoya Horiguchi,
	James Houghton, Mina Almasry, Aneesh Kumar K . V,
	Anshuman Khandual, Paul Walmsley, Christian Borntraeger,
	Andrew Morton, Mike Kravetz

Lazy page table copying at fork time was introduced with commit
d992895ba2b2 ("Lazy page table copies in fork()").  At the time,
hugetlb was very new and did not support page faulting.  As a result,
it was excluded.  When full page fault support was added for hugetlb,
the exclusion was not removed.

Simply remove the check that prevents lazy copying of hugetlb page
tables at fork.  Of course, like other mappings this only applies to
shared mappings.

Lazy page table copying at fork will be less advantageous for hugetlb
mappings because:
- There are fewer page table entries with hugetlb
- hugetlb pmds can be shared instead of copied

In any case, completely eliminating the copy at fork time shold speed
things up.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 mm/memory.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/memory.c b/mm/memory.c
index 21dadf03f089..f390cc11886d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1245,7 +1245,7 @@ vma_needs_copy(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
 	if (userfaultfd_wp(dst_vma))
 		return true;
 
-	if (src_vma->vm_flags & (VM_HUGETLB | VM_PFNMAP | VM_MIXEDMAP))
+	if (src_vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP))
 		return true;
 
 	if (src_vma->anon_vma)
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 1/3] hugetlb: skip to end of PT page mapping when pte not present
  2022-05-27 22:58 ` [RFC PATCH 1/3] hugetlb: skip to end of PT page mapping when pte not present Mike Kravetz
@ 2022-05-30 10:10   ` Baolin Wang
  2022-05-31 16:56     ` Mike Kravetz
  2022-06-15 21:22     ` Mike Kravetz
  2022-05-30 19:56   ` Peter Xu
  1 sibling, 2 replies; 20+ messages in thread
From: Baolin Wang @ 2022-05-30 10:10 UTC (permalink / raw)
  To: Mike Kravetz, linux-kernel, linux-mm
  Cc: Muchun Song, Michal Hocko, Peter Xu, Naoya Horiguchi,
	James Houghton, Mina Almasry, Aneesh Kumar K . V,
	Anshuman Khandual, Paul Walmsley, Christian Borntraeger,
	Andrew Morton



On 5/28/2022 6:58 AM, Mike Kravetz wrote:
> HugeTLB address ranges are linearly scanned during fork, unmap and
> remap operations.  If a non-present entry is encountered, the code
> currently continues to the next huge page aligned address.  However,
> a non-present entry implies that the page table page for that entry
> is not present.  Therefore, the linear scan can skip to the end of
> range mapped by the page table page.  This can speed operations on
> large sparsely populated hugetlb mappings.
> 
> Create a new routine hugetlb_mask_last_hp() that will return an
> address mask.  When the mask is ORed with an address, the result
> will be the address of the last huge page mapped by the associated
> page table page.  Use this mask to update addresses in routines which
> linearly scan hugetlb address ranges when a non-present pte is
> encountered.
> 
> hugetlb_mask_last_hp is related to the implementation of huge_pte_offset
> as hugetlb_mask_last_hp is called when huge_pte_offset returns NULL.
> This patch only provides a complete hugetlb_mask_last_hp implementation
> when CONFIG_ARCH_WANT_GENERAL_HUGETLB is defined.  Architectures which
> provide their own versions of huge_pte_offset can also provide their own
> version of hugetlb_mask_last_hp.

I tested on my ARM64 machine with implementing arm64 specific 
hugetlb_mask_last_hp() as below, and it works well.

Just a few nits inline, otherwise looks good to me.
Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>

diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
index d93ba128a2b0..e04a097ffcc4 100644
--- a/arch/arm64/mm/hugetlbpage.c
+++ b/arch/arm64/mm/hugetlbpage.c
@@ -376,6 +376,28 @@ pte_t *huge_pte_offset(struct mm_struct *mm,
         return NULL;
  }

+unsigned long hugetlb_mask_last_hp(struct hstate *h)
+{
+       unsigned long hp_size = huge_page_size(h);
+
+       switch (hp_size) {
+       case P4D_SIZE:
+               return PGDIR_SIZE - P4D_SIZE;
+       case PUD_SIZE:
+               return P4D_SIZE - PUD_SIZE;
+       case CONT_PMD_SIZE:
+               return PUD_SIZE - CONT_PMD_SIZE;
+       case PMD_SIZE:
+               return PUD_SIZE - PMD_SIZE;
+       case CONT_PTE_SIZE:
+               return PMD_SIZE - CONT_PTE_SIZE;
+       default:
+               break;
+       }
+
+       return ~(0UL);
+}

> 
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> ---
>   include/linux/hugetlb.h |  1 +
>   mm/hugetlb.c            | 58 +++++++++++++++++++++++++++++++++++++----
>   2 files changed, 54 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index e4cff27d1198..25078a0ea1d8 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -194,6 +194,7 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
>   			unsigned long addr, unsigned long sz);
>   pte_t *huge_pte_offset(struct mm_struct *mm,
>   		       unsigned long addr, unsigned long sz);
> +unsigned long hugetlb_mask_last_hp(struct hstate *h);
>   int huge_pmd_unshare(struct mm_struct *mm, struct vm_area_struct *vma,
>   				unsigned long *addr, pte_t *ptep);
>   void adjust_range_if_pmd_sharing_possible(struct vm_area_struct *vma,
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 7c468ac1d069..a2db878b2255 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -4731,6 +4731,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
>   	unsigned long npages = pages_per_huge_page(h);
>   	struct address_space *mapping = src_vma->vm_file->f_mapping;
>   	struct mmu_notifier_range range;
> +	unsigned long last_addr_mask;
>   	int ret = 0;
>   
>   	if (cow) {
> @@ -4750,11 +4751,14 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
>   		i_mmap_lock_read(mapping);
>   	}
>   
> +	last_addr_mask = hugetlb_mask_last_hp(h);
>   	for (addr = src_vma->vm_start; addr < src_vma->vm_end; addr += sz) {
>   		spinlock_t *src_ptl, *dst_ptl;
>   		src_pte = huge_pte_offset(src, addr, sz);
> -		if (!src_pte)
> +		if (!src_pte) {
> +			addr |= last_addr_mask;
>   			continue;
> +		}
>   		dst_pte = huge_pte_alloc(dst, dst_vma, addr, sz);
>   		if (!dst_pte) {
>   			ret = -ENOMEM;
> @@ -4771,8 +4775,10 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
>   		 * after taking the lock below.
>   		 */
>   		dst_entry = huge_ptep_get(dst_pte);
> -		if ((dst_pte == src_pte) || !huge_pte_none(dst_entry))
> +		if ((dst_pte == src_pte) || !huge_pte_none(dst_entry)) {
> +			addr |= last_addr_mask;
>   			continue;
> +		}
>   
>   		dst_ptl = huge_pte_lock(h, dst, dst_pte);
>   		src_ptl = huge_pte_lockptr(h, src, src_pte);
> @@ -4933,6 +4939,7 @@ int move_hugetlb_page_tables(struct vm_area_struct *vma,
>   	unsigned long sz = huge_page_size(h);
>   	struct mm_struct *mm = vma->vm_mm;
>   	unsigned long old_end = old_addr + len;
> +	unsigned long last_addr_mask;
>   	unsigned long old_addr_copy;
>   	pte_t *src_pte, *dst_pte;
>   	struct mmu_notifier_range range;
> @@ -4948,12 +4955,16 @@ int move_hugetlb_page_tables(struct vm_area_struct *vma,
>   	flush_cache_range(vma, range.start, range.end);
>   
>   	mmu_notifier_invalidate_range_start(&range);
> +	last_addr_mask = hugetlb_mask_last_hp(h);
>   	/* Prevent race with file truncation */
>   	i_mmap_lock_write(mapping);
>   	for (; old_addr < old_end; old_addr += sz, new_addr += sz) {
>   		src_pte = huge_pte_offset(mm, old_addr, sz);
> -		if (!src_pte)
> +		if (!src_pte) {
> +			old_addr |= last_addr_mask;
> +			new_addr |= last_addr_mask;
>   			continue;
> +		}
>   		if (huge_pte_none(huge_ptep_get(src_pte)))
>   			continue;
>   
> @@ -4998,6 +5009,7 @@ static void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct
>   	struct hstate *h = hstate_vma(vma);
>   	unsigned long sz = huge_page_size(h);
>   	struct mmu_notifier_range range;
> +	unsigned long last_addr_mask;
>   	bool force_flush = false;
>   
>   	WARN_ON(!is_vm_hugetlb_page(vma));
> @@ -5018,11 +5030,14 @@ static void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct
>   				end);
>   	adjust_range_if_pmd_sharing_possible(vma, &range.start, &range.end);
>   	mmu_notifier_invalidate_range_start(&range);
> +	last_addr_mask = hugetlb_mask_last_hp(h);
>   	address = start;
>   	for (; address < end; address += sz) {
>   		ptep = huge_pte_offset(mm, address, sz);
> -		if (!ptep)
> +		if (!ptep) {
> +			address |= last_addr_mask;
>   			continue;
> +		}
>   
>   		ptl = huge_pte_lock(h, mm, ptep);
>   		if (huge_pmd_unshare(mm, vma, &address, ptep)) {
> @@ -6285,6 +6300,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
>   	unsigned long pages = 0, psize = huge_page_size(h);
>   	bool shared_pmd = false;
>   	struct mmu_notifier_range range;
> +	unsigned long last_addr_mask;
>   	bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
>   	bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;
>   
> @@ -6301,12 +6317,15 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
>   	flush_cache_range(vma, range.start, range.end);
>   
>   	mmu_notifier_invalidate_range_start(&range);
> +	last_addr_mask = hugetlb_mask_last_hp(h);
>   	i_mmap_lock_write(vma->vm_file->f_mapping);
>   	for (; address < end; address += psize) {
>   		spinlock_t *ptl;
>   		ptep = huge_pte_offset(mm, address, psize);
> -		if (!ptep)
> +		if (!ptep) {
> +			address |= last_addr_mask;
>   			continue;
> +		}
>   		ptl = huge_pte_lock(h, mm, ptep);
>   		if (huge_pmd_unshare(mm, vma, &address, ptep)) {
>   			/*
> @@ -6857,6 +6876,35 @@ pte_t *huge_pte_offset(struct mm_struct *mm,
>   	return (pte_t *)pmd;
>   }
>   
> +/*
> + * Return a mask that can be used to update an address to the last huge
> + * page in a page table page mapping size.  Used to skip non-present
> + * page table entries when linearly scanning address ranges.  Architectures
> + * with unique huge page to page table relationships can define their own
> + * version of this routine.
> + */
> +unsigned long hugetlb_mask_last_hp(struct hstate *h)
> +{
> +	unsigned long hp_size = huge_page_size(h);
> +
> +	if (hp_size == P4D_SIZE)
> +		return PGDIR_SIZE - P4D_SIZE;
> +	else if (hp_size == PUD_SIZE)
> +		return P4D_SIZE - PUD_SIZE;
> +	else if (hp_size == PMD_SIZE)
> +		return PUD_SIZE - PMD_SIZE;

Changing to use 'switch' looks more readable?

> +
> +	return ~(0);

Better to return '~(0UL)' to keep function type consistent.

> +}
> +
> +#else
> +
> +/* See description above.  Architectures can provide their own version. */
> +__weak unsigned long hugetlb_mask_last_hp(struct hstate *h)
> +{
> +	return ~(0);

Ditto.

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 2/3] hugetlb: do not update address in huge_pmd_unshare
  2022-05-27 22:58 ` [RFC PATCH 2/3] hugetlb: do not update address in huge_pmd_unshare Mike Kravetz
@ 2022-05-30 10:14   ` Baolin Wang
  2022-05-30 15:36   ` Muchun Song
  1 sibling, 0 replies; 20+ messages in thread
From: Baolin Wang @ 2022-05-30 10:14 UTC (permalink / raw)
  To: Mike Kravetz, linux-kernel, linux-mm
  Cc: Muchun Song, Michal Hocko, Peter Xu, Naoya Horiguchi,
	James Houghton, Mina Almasry, Aneesh Kumar K . V,
	Anshuman Khandual, Paul Walmsley, Christian Borntraeger,
	Andrew Morton



On 5/28/2022 6:58 AM, Mike Kravetz wrote:
> As an optimization for loops sequentially processing hugetlb address
> ranges, huge_pmd_unshare would update a passed address if it unshared a
> pmd.  Updating a loop control variable outside the loop like this is
> generally a bad idea.  These loops are now using hugetlb_mask_last_hp
> to optimize scanning when non-present ptes are discovered.  The same
> can be done when huge_pmd_unshare returns 1 indicating a pmd was
> unshared.
> 
> Remove address update from huge_pmd_unshare.  Change the passed argument
> type and update all callers.  In loops sequentially processing addresses
> use hugetlb_mask_last_hp to update address if pmd is unshared.
> 
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>

LGTM. Please feel free to add:
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 2/3] hugetlb: do not update address in huge_pmd_unshare
  2022-05-27 22:58 ` [RFC PATCH 2/3] hugetlb: do not update address in huge_pmd_unshare Mike Kravetz
  2022-05-30 10:14   ` Baolin Wang
@ 2022-05-30 15:36   ` Muchun Song
  2022-05-31 17:06     ` Mike Kravetz
  1 sibling, 1 reply; 20+ messages in thread
From: Muchun Song @ 2022-05-30 15:36 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-kernel, linux-mm, Michal Hocko, Peter Xu, Naoya Horiguchi,
	James Houghton, Mina Almasry, Aneesh Kumar K . V,
	Anshuman Khandual, Paul Walmsley, Christian Borntraeger,
	Andrew Morton

On Fri, May 27, 2022 at 03:58:48PM -0700, Mike Kravetz wrote:
> As an optimization for loops sequentially processing hugetlb address
> ranges, huge_pmd_unshare would update a passed address if it unshared a
> pmd.  Updating a loop control variable outside the loop like this is
> generally a bad idea.  These loops are now using hugetlb_mask_last_hp

Totally agree.

> to optimize scanning when non-present ptes are discovered.  The same
> can be done when huge_pmd_unshare returns 1 indicating a pmd was
> unshared.
> 
> Remove address update from huge_pmd_unshare.  Change the passed argument
> type and update all callers.  In loops sequentially processing addresses
> use hugetlb_mask_last_hp to update address if pmd is unshared.
> 
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>

Acked-by: Muchun Song <songmuchun@bytedance.com>

Some nits below.

> ---
>  include/linux/hugetlb.h |  4 ++--
>  mm/hugetlb.c            | 46 ++++++++++++++++++-----------------------
>  mm/rmap.c               |  4 ++--
>  3 files changed, 24 insertions(+), 30 deletions(-)
> 
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 25078a0ea1d8..307c8f6e6752 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -196,7 +196,7 @@ pte_t *huge_pte_offset(struct mm_struct *mm,
>  		       unsigned long addr, unsigned long sz);
>  unsigned long hugetlb_mask_last_hp(struct hstate *h);
>  int huge_pmd_unshare(struct mm_struct *mm, struct vm_area_struct *vma,
> -				unsigned long *addr, pte_t *ptep);
> +				unsigned long addr, pte_t *ptep);
>  void adjust_range_if_pmd_sharing_possible(struct vm_area_struct *vma,
>  				unsigned long *start, unsigned long *end);
>  struct page *follow_huge_addr(struct mm_struct *mm, unsigned long address,
> @@ -243,7 +243,7 @@ static inline struct address_space *hugetlb_page_mapping_lock_write(
>  
>  static inline int huge_pmd_unshare(struct mm_struct *mm,
>  					struct vm_area_struct *vma,
> -					unsigned long *addr, pte_t *ptep)
> +					unsigned long addr, pte_t *ptep)
>  {
>  	return 0;
>  }
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index a2db878b2255..c7d3fbf3ec05 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -4940,7 +4940,6 @@ int move_hugetlb_page_tables(struct vm_area_struct *vma,
>  	struct mm_struct *mm = vma->vm_mm;
>  	unsigned long old_end = old_addr + len;
>  	unsigned long last_addr_mask;
> -	unsigned long old_addr_copy;
>  	pte_t *src_pte, *dst_pte;
>  	struct mmu_notifier_range range;
>  	bool shared_pmd = false;
> @@ -4968,14 +4967,10 @@ int move_hugetlb_page_tables(struct vm_area_struct *vma,
>  		if (huge_pte_none(huge_ptep_get(src_pte)))
>  			continue;
>  
> -		/* old_addr arg to huge_pmd_unshare() is a pointer and so the
> -		 * arg may be modified. Pass a copy instead to preserve the
> -		 * value in old_addr.
> -		 */
> -		old_addr_copy = old_addr;
> -
> -		if (huge_pmd_unshare(mm, vma, &old_addr_copy, src_pte)) {
> +		if (huge_pmd_unshare(mm, vma, old_addr, src_pte)) {
>  			shared_pmd = true;
> +			old_addr |= last_addr_mask;
> +			new_addr |= last_addr_mask;
>  			continue;
>  		}
>  
> @@ -5040,10 +5035,11 @@ static void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct
>  		}
>  
>  		ptl = huge_pte_lock(h, mm, ptep);
> -		if (huge_pmd_unshare(mm, vma, &address, ptep)) {
> +		if (huge_pmd_unshare(mm, vma, address, ptep)) {
>  			spin_unlock(ptl);
>  			tlb_flush_pmd_range(tlb, address & PUD_MASK, PUD_SIZE);
>  			force_flush = true;
> +			address |= last_addr_mask;
>  			continue;
>  		}
>  
> @@ -6327,7 +6323,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
>  			continue;
>  		}
>  		ptl = huge_pte_lock(h, mm, ptep);
> -		if (huge_pmd_unshare(mm, vma, &address, ptep)) {
> +		if (huge_pmd_unshare(mm, vma, address, ptep)) {
>  			/*
>  			 * When uffd-wp is enabled on the vma, unshare
>  			 * shouldn't happen at all.  Warn about it if it
> @@ -6337,6 +6333,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
>  			pages++;
>  			spin_unlock(ptl);
>  			shared_pmd = true;
> +			address |= last_addr_mask;
>  			continue;
>  		}
>  		pte = huge_ptep_get(ptep);
> @@ -6760,11 +6757,11 @@ pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma,
>   *	    0 the underlying pte page is not shared, or it is the last user
>   */
>  int huge_pmd_unshare(struct mm_struct *mm, struct vm_area_struct *vma,
> -					unsigned long *addr, pte_t *ptep)
> +					unsigned long addr, pte_t *ptep)
>  {
> -	pgd_t *pgd = pgd_offset(mm, *addr);
> -	p4d_t *p4d = p4d_offset(pgd, *addr);
> -	pud_t *pud = pud_offset(p4d, *addr);
> +	pgd_t *pgd = pgd_offset(mm, addr);
> +	p4d_t *p4d = p4d_offset(pgd, addr);
> +	pud_t *pud = pud_offset(p4d, addr);
>  
>  	i_mmap_assert_write_locked(vma->vm_file->f_mapping);
>  	BUG_ON(page_count(virt_to_page(ptep)) == 0);
> @@ -6774,14 +6771,6 @@ int huge_pmd_unshare(struct mm_struct *mm, struct vm_area_struct *vma,
>  	pud_clear(pud);
>  	put_page(virt_to_page(ptep));
>  	mm_dec_nr_pmds(mm);
> -	/*
> -	 * This update of passed address optimizes loops sequentially
> -	 * processing addresses in increments of huge page size (PMD_SIZE
> -	 * in this case).  By clearing the pud, a PUD_SIZE area is unmapped.
> -	 * Update address to the 'last page' in the cleared area so that
> -	 * calling loop can move to first page past this area.
> -	 */
> -	*addr |= PUD_SIZE - PMD_SIZE;
>  	return 1;
>  }
>  
> @@ -6793,7 +6782,7 @@ pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma,
>  }
>  
>  int huge_pmd_unshare(struct mm_struct *mm, struct vm_area_struct *vma,
> -				unsigned long *addr, pte_t *ptep)
> +				unsigned long addr, pte_t *ptep)
>  {
>  	return 0;
>  }
> @@ -6902,6 +6891,13 @@ unsigned long hugetlb_mask_last_hp(struct hstate *h)
>  /* See description above.  Architectures can provide their own version. */
>  __weak unsigned long hugetlb_mask_last_hp(struct hstate *h)
>  {
> +	unsigned long hp_size = huge_page_size(h);
> +
> +#ifdef CONFIG_ARCH_WANT_HUGE_PMD_SHARE
> +	if (hp_size == PMD_SIZE)	/* required for pmd sharing */
> +		return PUD_SIZE - PMD_SIZE;
> +#endif
> +
>  	return ~(0);

Should be ~0UL (However, this should belong to the previous patch).

>  }
>  
> @@ -7128,14 +7124,12 @@ void hugetlb_unshare_all_pmds(struct vm_area_struct *vma)
>  	mmu_notifier_invalidate_range_start(&range);
>  	i_mmap_lock_write(vma->vm_file->f_mapping);
>  	for (address = start; address < end; address += PUD_SIZE) {
> -		unsigned long tmp = address;
> -
>  		ptep = huge_pte_offset(mm, address, sz);
>  		if (!ptep)
>  			continue;
>  		ptl = huge_pte_lock(h, mm, ptep);
>  		/* We don't want 'address' to be changed */

Dead comment, should be removed.

> -		huge_pmd_unshare(mm, vma, &tmp, ptep);
> +		huge_pmd_unshare(mm, vma, address, ptep);
>  		spin_unlock(ptl);
>  	}
>  	flush_hugetlb_tlb_range(vma, start, end);
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 5bcb334cd6f2..45b04e2e83ab 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1559,7 +1559,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>  				 */
>  				VM_BUG_ON(!(flags & TTU_RMAP_LOCKED));
>  
> -				if (huge_pmd_unshare(mm, vma, &address, pvmw.pte)) {
> +				if (huge_pmd_unshare(mm, vma, address, pvmw.pte)) {
>  					flush_tlb_range(vma, range.start, range.end);
>  					mmu_notifier_invalidate_range(mm, range.start,
>  								      range.end);
> @@ -1923,7 +1923,7 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
>  				 */
>  				VM_BUG_ON(!(flags & TTU_RMAP_LOCKED));
>  
> -				if (huge_pmd_unshare(mm, vma, &address, pvmw.pte)) {
> +				if (huge_pmd_unshare(mm, vma, address, pvmw.pte)) {
>  					flush_tlb_range(vma, range.start, range.end);
>  					mmu_notifier_invalidate_range(mm, range.start,
>  								      range.end);
> -- 
> 2.35.3
> 
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 1/3] hugetlb: skip to end of PT page mapping when pte not present
  2022-05-27 22:58 ` [RFC PATCH 1/3] hugetlb: skip to end of PT page mapping when pte not present Mike Kravetz
  2022-05-30 10:10   ` Baolin Wang
@ 2022-05-30 19:56   ` Peter Xu
  2022-05-31  2:04     ` Muchun Song
  2022-05-31 17:00     ` Mike Kravetz
  1 sibling, 2 replies; 20+ messages in thread
From: Peter Xu @ 2022-05-30 19:56 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-kernel, linux-mm, Muchun Song, Michal Hocko,
	Naoya Horiguchi, James Houghton, Mina Almasry,
	Aneesh Kumar K . V, Anshuman Khandual, Paul Walmsley,
	Christian Borntraeger, Andrew Morton

Hi, Mike,

On Fri, May 27, 2022 at 03:58:47PM -0700, Mike Kravetz wrote:
> +unsigned long hugetlb_mask_last_hp(struct hstate *h)
> +{
> +	unsigned long hp_size = huge_page_size(h);
> +
> +	if (hp_size == P4D_SIZE)
> +		return PGDIR_SIZE - P4D_SIZE;
> +	else if (hp_size == PUD_SIZE)
> +		return P4D_SIZE - PUD_SIZE;
> +	else if (hp_size == PMD_SIZE)
> +		return PUD_SIZE - PMD_SIZE;
> +
> +	return ~(0);
> +}

How about:

unsigned long hugetlb_mask_last_hp(struct hstate *h)
{
	unsigned long hp_size = huge_page_size(h);

	return hp_size * (PTRS_PER_PTE - 1);
}

?

This is definitely a good idea, though I'm wondering the possibility to go
one step further to make hugetlb pgtable walk just like the normal pages.

Say, would it be non-trivial to bring some of huge_pte_offset() into the
walker functions, so that we can jump over even larger than PTRS_PER_PTE
entries (e.g. when p4d==NULL for 2m huge pages)?  It's very possible I
overlooked something, though.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 1/3] hugetlb: skip to end of PT page mapping when pte not present
  2022-05-30 19:56   ` Peter Xu
@ 2022-05-31  2:04     ` Muchun Song
  2022-05-31 17:05       ` Mike Kravetz
  2022-05-31 17:00     ` Mike Kravetz
  1 sibling, 1 reply; 20+ messages in thread
From: Muchun Song @ 2022-05-31  2:04 UTC (permalink / raw)
  To: Peter Xu
  Cc: Mike Kravetz, linux-kernel, linux-mm, Michal Hocko,
	Naoya Horiguchi, James Houghton, Mina Almasry,
	Aneesh Kumar K . V, Anshuman Khandual, Paul Walmsley,
	Christian Borntraeger, Andrew Morton

On Mon, May 30, 2022 at 03:56:43PM -0400, Peter Xu wrote:
> Hi, Mike,
> 
> On Fri, May 27, 2022 at 03:58:47PM -0700, Mike Kravetz wrote:
> > +unsigned long hugetlb_mask_last_hp(struct hstate *h)
> > +{
> > +	unsigned long hp_size = huge_page_size(h);
> > +
> > +	if (hp_size == P4D_SIZE)
> > +		return PGDIR_SIZE - P4D_SIZE;
> > +	else if (hp_size == PUD_SIZE)
> > +		return P4D_SIZE - PUD_SIZE;
> > +	else if (hp_size == PMD_SIZE)
> > +		return PUD_SIZE - PMD_SIZE;
> > +
> > +	return ~(0);
> > +}
> 
> How about:
> 
> unsigned long hugetlb_mask_last_hp(struct hstate *h)
> {
> 	unsigned long hp_size = huge_page_size(h);
> 
> 	return hp_size * (PTRS_PER_PTE - 1);
> }
>

+1
 
> ?
> 
> This is definitely a good idea, though I'm wondering the possibility to go
> one step further to make hugetlb pgtable walk just like the normal pages.
> 
> Say, would it be non-trivial to bring some of huge_pte_offset() into the
> walker functions, so that we can jump over even larger than PTRS_PER_PTE
> entries (e.g. when p4d==NULL for 2m huge pages)?  It's very possible I
> overlooked something, though.
> 
> Thanks,
> 
> -- 
> Peter Xu
> 
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 1/3] hugetlb: skip to end of PT page mapping when pte not present
  2022-05-30 10:10   ` Baolin Wang
@ 2022-05-31 16:56     ` Mike Kravetz
  2022-06-15 21:22     ` Mike Kravetz
  1 sibling, 0 replies; 20+ messages in thread
From: Mike Kravetz @ 2022-05-31 16:56 UTC (permalink / raw)
  To: Baolin Wang, linux-kernel, linux-mm
  Cc: Muchun Song, Michal Hocko, Peter Xu, Naoya Horiguchi,
	James Houghton, Mina Almasry, Aneesh Kumar K . V,
	Anshuman Khandual, Paul Walmsley, Christian Borntraeger,
	Andrew Morton

On 5/30/22 03:10, Baolin Wang wrote:
> On 5/28/2022 6:58 AM, Mike Kravetz wrote:
> 
> I tested on my ARM64 machine with implementing arm64 specific hugetlb_mask_last_hp() as below, and it works well.
> 
> Just a few nits inline, otherwise looks good to me.
> Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> 
> diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
> index d93ba128a2b0..e04a097ffcc4 100644
> --- a/arch/arm64/mm/hugetlbpage.c
> +++ b/arch/arm64/mm/hugetlbpage.c
> @@ -376,6 +376,28 @@ pte_t *huge_pte_offset(struct mm_struct *mm,
>         return NULL;
>  }
> 
> +unsigned long hugetlb_mask_last_hp(struct hstate *h)
> +{
> +       unsigned long hp_size = huge_page_size(h);
> +
> +       switch (hp_size) {
> +       case P4D_SIZE:
> +               return PGDIR_SIZE - P4D_SIZE;
> +       case PUD_SIZE:
> +               return P4D_SIZE - PUD_SIZE;
> +       case CONT_PMD_SIZE:
> +               return PUD_SIZE - CONT_PMD_SIZE;
> +       case PMD_SIZE:
> +               return PUD_SIZE - PMD_SIZE;
> +       case CONT_PTE_SIZE:
> +               return PMD_SIZE - CONT_PTE_SIZE;
> +       default:
> +               break;
> +       }
> +
> +       return ~(0UL);
> +}
> 

Thanks!  I was hesitant to put something together for those CONT_* sizes.

>>   +/*
>> + * Return a mask that can be used to update an address to the last huge
>> + * page in a page table page mapping size.  Used to skip non-present
>> + * page table entries when linearly scanning address ranges.  Architectures
>> + * with unique huge page to page table relationships can define their own
>> + * version of this routine.
>> + */
>> +unsigned long hugetlb_mask_last_hp(struct hstate *h)
>> +{
>> +    unsigned long hp_size = huge_page_size(h);
>> +
>> +    if (hp_size == P4D_SIZE)
>> +        return PGDIR_SIZE - P4D_SIZE;
>> +    else if (hp_size == PUD_SIZE)
>> +        return P4D_SIZE - PUD_SIZE;
>> +    else if (hp_size == PMD_SIZE)
>> +        return PUD_SIZE - PMD_SIZE;
> 
> Changing to use 'switch' looks more readable?

Agree.  Or, I might just go with Peter's simplification.
> 
>> +
>> +    return ~(0);
> 
> Better to return '~(0UL)' to keep function type consistent.

Yes, thanks!
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 1/3] hugetlb: skip to end of PT page mapping when pte not present
  2022-05-30 19:56   ` Peter Xu
  2022-05-31  2:04     ` Muchun Song
@ 2022-05-31 17:00     ` Mike Kravetz
  2022-06-15 17:27       ` Mike Kravetz
  1 sibling, 1 reply; 20+ messages in thread
From: Mike Kravetz @ 2022-05-31 17:00 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-kernel, linux-mm, Muchun Song, Michal Hocko,
	Naoya Horiguchi, James Houghton, Mina Almasry,
	Aneesh Kumar K . V, Anshuman Khandual, Paul Walmsley,
	Christian Borntraeger, Andrew Morton

On 5/30/22 12:56, Peter Xu wrote:
> Hi, Mike,
> 
> On Fri, May 27, 2022 at 03:58:47PM -0700, Mike Kravetz wrote:
>> +unsigned long hugetlb_mask_last_hp(struct hstate *h)
>> +{
>> +	unsigned long hp_size = huge_page_size(h);
>> +
>> +	if (hp_size == P4D_SIZE)
>> +		return PGDIR_SIZE - P4D_SIZE;
>> +	else if (hp_size == PUD_SIZE)
>> +		return P4D_SIZE - PUD_SIZE;
>> +	else if (hp_size == PMD_SIZE)
>> +		return PUD_SIZE - PMD_SIZE;
>> +
>> +	return ~(0);
>> +}
> 
> How about:
> 
> unsigned long hugetlb_mask_last_hp(struct hstate *h)
> {
> 	unsigned long hp_size = huge_page_size(h);
> 
> 	return hp_size * (PTRS_PER_PTE - 1);
> }
> 
> ?
> 
> This is definitely a good idea, though I'm wondering the possibility to go
> one step further to make hugetlb pgtable walk just like the normal pages.
> 
> Say, would it be non-trivial to bring some of huge_pte_offset() into the
> walker functions, so that we can jump over even larger than PTRS_PER_PTE
> entries (e.g. when p4d==NULL for 2m huge pages)?  It's very possible I
> overlooked something, though.

Thanks Peter!

I did think of that as well.  But, I mostly wanted to throw out this simple
code while the idea of optimizations for sparse address range traversing was
fresh in my mind.

I'll take a closer look and see if we can use those general walker routines.
If we can, it would be great.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 1/3] hugetlb: skip to end of PT page mapping when pte not present
  2022-05-31  2:04     ` Muchun Song
@ 2022-05-31 17:05       ` Mike Kravetz
  2022-06-01  6:58         ` Anshuman Khandual
  0 siblings, 1 reply; 20+ messages in thread
From: Mike Kravetz @ 2022-05-31 17:05 UTC (permalink / raw)
  To: Muchun Song, Peter Xu
  Cc: linux-kernel, linux-mm, Michal Hocko, Naoya Horiguchi,
	James Houghton, Mina Almasry, Aneesh Kumar K . V,
	Anshuman Khandual, Paul Walmsley, Christian Borntraeger,
	Andrew Morton

On 5/30/22 19:04, Muchun Song wrote:
> On Mon, May 30, 2022 at 03:56:43PM -0400, Peter Xu wrote:
>> Hi, Mike,
>>
>> On Fri, May 27, 2022 at 03:58:47PM -0700, Mike Kravetz wrote:
>>> +unsigned long hugetlb_mask_last_hp(struct hstate *h)
>>> +{
>>> +	unsigned long hp_size = huge_page_size(h);
>>> +
>>> +	if (hp_size == P4D_SIZE)
>>> +		return PGDIR_SIZE - P4D_SIZE;
>>> +	else if (hp_size == PUD_SIZE)
>>> +		return P4D_SIZE - PUD_SIZE;
>>> +	else if (hp_size == PMD_SIZE)
>>> +		return PUD_SIZE - PMD_SIZE;
>>> +
>>> +	return ~(0);
>>> +}
>>
>> How about:
>>
>> unsigned long hugetlb_mask_last_hp(struct hstate *h)
>> {
>> 	unsigned long hp_size = huge_page_size(h);
>>
>> 	return hp_size * (PTRS_PER_PTE - 1);
>> }
>>
> 
> +1
>  


I like this as well, but I wonder if we should put something like the
following in just to be safe.

BUILD_BUG_ON(PTRS_PER_PTE != PTRS_PER_PMD);
BUILD_BUG_ON(PTRS_PER_PTE != PTRS_PER_PUD);
BUILD_BUG_ON(PTRS_PER_PTE != PTRS_PER_P4D);

-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 2/3] hugetlb: do not update address in huge_pmd_unshare
  2022-05-30 15:36   ` Muchun Song
@ 2022-05-31 17:06     ` Mike Kravetz
  0 siblings, 0 replies; 20+ messages in thread
From: Mike Kravetz @ 2022-05-31 17:06 UTC (permalink / raw)
  To: Muchun Song
  Cc: linux-kernel, linux-mm, Michal Hocko, Peter Xu, Naoya Horiguchi,
	James Houghton, Mina Almasry, Aneesh Kumar K . V,
	Anshuman Khandual, Paul Walmsley, Christian Borntraeger,
	Andrew Morton

On 5/30/22 08:36, Muchun Song wrote:
> On Fri, May 27, 2022 at 03:58:48PM -0700, Mike Kravetz wrote:
>> As an optimization for loops sequentially processing hugetlb address
>> ranges, huge_pmd_unshare would update a passed address if it unshared a
>> pmd.  Updating a loop control variable outside the loop like this is
>> generally a bad idea.  These loops are now using hugetlb_mask_last_hp
> 
> Totally agree.
> 
>> to optimize scanning when non-present ptes are discovered.  The same
>> can be done when huge_pmd_unshare returns 1 indicating a pmd was
>> unshared.
>>
>> Remove address update from huge_pmd_unshare.  Change the passed argument
>> type and update all callers.  In loops sequentially processing addresses
>> use hugetlb_mask_last_hp to update address if pmd is unshared.
>>
>> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> 
> Acked-by: Muchun Song <songmuchun@bytedance.com>
> 
> Some nits below.
> 
>>  		ptep = huge_pte_offset(mm, address, sz);
>>  		if (!ptep)
>>  			continue;
>>  		ptl = huge_pte_lock(h, mm, ptep);
>>  		/* We don't want 'address' to be changed */
> 
> Dead comment, should be removed.

Thanks!  I missed that.

-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 3/3] hugetlb: Lazy page table copies in fork()
  2022-05-27 22:58 ` [RFC PATCH 3/3] hugetlb: Lazy page table copies in fork() Mike Kravetz
@ 2022-05-31 17:25   ` David Hildenbrand
  2022-06-01  5:20   ` Muchun Song
  1 sibling, 0 replies; 20+ messages in thread
From: David Hildenbrand @ 2022-05-31 17:25 UTC (permalink / raw)
  To: Mike Kravetz, linux-kernel, linux-mm
  Cc: Muchun Song, Michal Hocko, Peter Xu, Naoya Horiguchi,
	James Houghton, Mina Almasry, Aneesh Kumar K . V,
	Anshuman Khandual, Paul Walmsley, Christian Borntraeger,
	Andrew Morton

On 28.05.22 00:58, Mike Kravetz wrote:
> Lazy page table copying at fork time was introduced with commit
> d992895ba2b2 ("Lazy page table copies in fork()").  At the time,
> hugetlb was very new and did not support page faulting.  As a result,
> it was excluded.  When full page fault support was added for hugetlb,
> the exclusion was not removed.
> 
> Simply remove the check that prevents lazy copying of hugetlb page
> tables at fork.  Of course, like other mappings this only applies to
> shared mappings.
> 
> Lazy page table copying at fork will be less advantageous for hugetlb
> mappings because:
> - There are fewer page table entries with hugetlb
> - hugetlb pmds can be shared instead of copied
> 
> In any case, completely eliminating the copy at fork time shold speed
> things up.

+ make hugetlb less special, at least a tiny little bit


Acked-by: David Hildenbrand <david@redhat.com>

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 3/3] hugetlb: Lazy page table copies in fork()
  2022-05-27 22:58 ` [RFC PATCH 3/3] hugetlb: Lazy page table copies in fork() Mike Kravetz
  2022-05-31 17:25   ` David Hildenbrand
@ 2022-06-01  5:20   ` Muchun Song
  1 sibling, 0 replies; 20+ messages in thread
From: Muchun Song @ 2022-06-01  5:20 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-kernel, linux-mm, Michal Hocko, Peter Xu, Naoya Horiguchi,
	James Houghton, Mina Almasry, Aneesh Kumar K . V,
	Anshuman Khandual, Paul Walmsley, Christian Borntraeger,
	Andrew Morton

On Fri, May 27, 2022 at 03:58:49PM -0700, Mike Kravetz wrote:
> Lazy page table copying at fork time was introduced with commit
> d992895ba2b2 ("Lazy page table copies in fork()").  At the time,
> hugetlb was very new and did not support page faulting.  As a result,
> it was excluded.  When full page fault support was added for hugetlb,
> the exclusion was not removed.
> 
> Simply remove the check that prevents lazy copying of hugetlb page
> tables at fork.  Of course, like other mappings this only applies to
> shared mappings.
> 
> Lazy page table copying at fork will be less advantageous for hugetlb
> mappings because:
> - There are fewer page table entries with hugetlb
> - hugetlb pmds can be shared instead of copied
> 
> In any case, completely eliminating the copy at fork time shold speed
> things up.
> 
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>

Acked-by: Muchun Song <songmuchun@bytedance.com>

Thanks.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 1/3] hugetlb: skip to end of PT page mapping when pte not present
  2022-05-31 17:05       ` Mike Kravetz
@ 2022-06-01  6:58         ` Anshuman Khandual
  0 siblings, 0 replies; 20+ messages in thread
From: Anshuman Khandual @ 2022-06-01  6:58 UTC (permalink / raw)
  To: Mike Kravetz, Muchun Song, Peter Xu
  Cc: linux-kernel, linux-mm, Michal Hocko, Naoya Horiguchi,
	James Houghton, Mina Almasry, Aneesh Kumar K . V, Paul Walmsley,
	Christian Borntraeger, Andrew Morton



On 5/31/22 22:35, Mike Kravetz wrote:
> On 5/30/22 19:04, Muchun Song wrote:
>> On Mon, May 30, 2022 at 03:56:43PM -0400, Peter Xu wrote:
>>> Hi, Mike,
>>>
>>> On Fri, May 27, 2022 at 03:58:47PM -0700, Mike Kravetz wrote:
>>>> +unsigned long hugetlb_mask_last_hp(struct hstate *h)
>>>> +{
>>>> +	unsigned long hp_size = huge_page_size(h);
>>>> +
>>>> +	if (hp_size == P4D_SIZE)
>>>> +		return PGDIR_SIZE - P4D_SIZE;
>>>> +	else if (hp_size == PUD_SIZE)
>>>> +		return P4D_SIZE - PUD_SIZE;
>>>> +	else if (hp_size == PMD_SIZE)
>>>> +		return PUD_SIZE - PMD_SIZE;
>>>> +
>>>> +	return ~(0);
>>>> +}
>>>
>>> How about:
>>>
>>> unsigned long hugetlb_mask_last_hp(struct hstate *h)
>>> {
>>> 	unsigned long hp_size = huge_page_size(h);
>>>
>>> 	return hp_size * (PTRS_PER_PTE - 1);
>>> }
>>>
>>
>> +1
>>  
> 
> 
> I like this as well, but I wonder if we should put something like the
> following in just to be safe.
> 
> BUILD_BUG_ON(PTRS_PER_PTE != PTRS_PER_PMD);
> BUILD_BUG_ON(PTRS_PER_PTE != PTRS_PER_PUD);
> BUILD_BUG_ON(PTRS_PER_PTE != PTRS_PER_P4D);
Exactly, assuming that PTRS_PER_PTE entries are present per page table page for
all supported HugeTLB levels might be bit risky, particularly for the top level.
Hence rather than all these additional BUILD_BUG_ON() checks for using standard
page table page entries i.e PTRS_PER_PTE, it might be better to just stick with
the original check either via 'if else' as proposed or better with a switch case.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 1/3] hugetlb: skip to end of PT page mapping when pte not present
  2022-05-31 17:00     ` Mike Kravetz
@ 2022-06-15 17:27       ` Mike Kravetz
  2022-06-15 17:51         ` Peter Xu
  0 siblings, 1 reply; 20+ messages in thread
From: Mike Kravetz @ 2022-06-15 17:27 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-kernel, linux-mm, Muchun Song, Michal Hocko,
	Naoya Horiguchi, James Houghton, Mina Almasry,
	Aneesh Kumar K . V, Anshuman Khandual, Paul Walmsley,
	Christian Borntraeger, Andrew Morton

On 05/31/22 10:00, Mike Kravetz wrote:
> On 5/30/22 12:56, Peter Xu wrote:
> > Hi, Mike,
> > 
> > On Fri, May 27, 2022 at 03:58:47PM -0700, Mike Kravetz wrote:
> >> +unsigned long hugetlb_mask_last_hp(struct hstate *h)
> >> +{
> >> +	unsigned long hp_size = huge_page_size(h);
> >> +
> >> +	if (hp_size == P4D_SIZE)
> >> +		return PGDIR_SIZE - P4D_SIZE;
> >> +	else if (hp_size == PUD_SIZE)
> >> +		return P4D_SIZE - PUD_SIZE;
> >> +	else if (hp_size == PMD_SIZE)
> >> +		return PUD_SIZE - PMD_SIZE;
> >> +
> >> +	return ~(0);
> >> +}
> > 
> > How about:
> > 
> > unsigned long hugetlb_mask_last_hp(struct hstate *h)
> > {
> > 	unsigned long hp_size = huge_page_size(h);
> > 
> > 	return hp_size * (PTRS_PER_PTE - 1);
> > }
> > 
> > ?

As mentioned in a followup e-mail, I am a little worried about this
calculation not being accurate for all configurations.  Today,
PTRS_PER_PTE == PTRS_PER_PMD == PTRS_PER_PUD == PTRS_PER_P4D in all
architectures that CONFIG_ARCH_WANT_GENERAL_HUGETLB.  However, if we
code things as above and that changes the bug might be hard to find.

In the next version, I will leave this as above but move to a switch
statement for better readability.

> > 
> > This is definitely a good idea, though I'm wondering the possibility to go
> > one step further to make hugetlb pgtable walk just like the normal pages.
> > 
> > Say, would it be non-trivial to bring some of huge_pte_offset() into the
> > walker functions, so that we can jump over even larger than PTRS_PER_PTE
> > entries (e.g. when p4d==NULL for 2m huge pages)?  It's very possible I
> > overlooked something, though.

I briefly looked at this.  To make it work, the walker zapping functions
such as zap_*_range would need to have a 'is_vm_hugetlb_page(vma)', and
if true use hugetlb specific page table routines instead of the generic
routines.

In many cases, the hugetlb specific page table routines are the same as
the generic routines.  But, there are a few exceptions.  IMO, it would
be better to first try to cleanup and unify those routines.  That would
make changes to the walker routines less invasive and easier to
maintain.  I believe is other code that would benefit from such a
cleanup.  Unless there are strong objections, I suggest we move forward
with the optimization here and move the cleanup and possible walker
changes to a later series.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 1/3] hugetlb: skip to end of PT page mapping when pte not present
  2022-06-15 17:27       ` Mike Kravetz
@ 2022-06-15 17:51         ` Peter Xu
  0 siblings, 0 replies; 20+ messages in thread
From: Peter Xu @ 2022-06-15 17:51 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-kernel, linux-mm, Muchun Song, Michal Hocko,
	Naoya Horiguchi, James Houghton, Mina Almasry,
	Aneesh Kumar K . V, Anshuman Khandual, Paul Walmsley,
	Christian Borntraeger, Andrew Morton

On Wed, Jun 15, 2022 at 10:27:25AM -0700, Mike Kravetz wrote:
> In many cases, the hugetlb specific page table routines are the same as
> the generic routines.  But, there are a few exceptions.  IMO, it would
> be better to first try to cleanup and unify those routines.  That would
> make changes to the walker routines less invasive and easier to
> maintain.  I believe is other code that would benefit from such a
> cleanup.  Unless there are strong objections, I suggest we move forward
> with the optimization here and move the cleanup and possible walker
> changes to a later series.

No objection on my side, thanks Mike for trying to look into it.

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 1/3] hugetlb: skip to end of PT page mapping when pte not present
  2022-05-30 10:10   ` Baolin Wang
  2022-05-31 16:56     ` Mike Kravetz
@ 2022-06-15 21:22     ` Mike Kravetz
  2022-06-16  3:48       ` Baolin Wang
  1 sibling, 1 reply; 20+ messages in thread
From: Mike Kravetz @ 2022-06-15 21:22 UTC (permalink / raw)
  To: Baolin Wang
  Cc: linux-kernel, linux-mm, Muchun Song, Michal Hocko, Peter Xu,
	Naoya Horiguchi, James Houghton, Mina Almasry,
	Aneesh Kumar K . V, Anshuman Khandual, Paul Walmsley,
	Christian Borntraeger, Andrew Morton

On 05/30/22 18:10, Baolin Wang wrote:
> 
> 
> On 5/28/2022 6:58 AM, Mike Kravetz wrote:
> > HugeTLB address ranges are linearly scanned during fork, unmap and
> > remap operations.  If a non-present entry is encountered, the code
> > currently continues to the next huge page aligned address.  However,
> > a non-present entry implies that the page table page for that entry
> > is not present.  Therefore, the linear scan can skip to the end of
> > range mapped by the page table page.  This can speed operations on
> > large sparsely populated hugetlb mappings.
> > 
> > Create a new routine hugetlb_mask_last_hp() that will return an
> > address mask.  When the mask is ORed with an address, the result
> > will be the address of the last huge page mapped by the associated
> > page table page.  Use this mask to update addresses in routines which
> > linearly scan hugetlb address ranges when a non-present pte is
> > encountered.
> > 
> > hugetlb_mask_last_hp is related to the implementation of huge_pte_offset
> > as hugetlb_mask_last_hp is called when huge_pte_offset returns NULL.
> > This patch only provides a complete hugetlb_mask_last_hp implementation
> > when CONFIG_ARCH_WANT_GENERAL_HUGETLB is defined.  Architectures which
> > provide their own versions of huge_pte_offset can also provide their own
> > version of hugetlb_mask_last_hp.
> 
> I tested on my ARM64 machine with implementing arm64 specific
> hugetlb_mask_last_hp() as below, and it works well.
> 
> Just a few nits inline, otherwise looks good to me.
> Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> 
> diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
> index d93ba128a2b0..e04a097ffcc4 100644
> --- a/arch/arm64/mm/hugetlbpage.c
> +++ b/arch/arm64/mm/hugetlbpage.c
> @@ -376,6 +376,28 @@ pte_t *huge_pte_offset(struct mm_struct *mm,
>         return NULL;
>  }
> 
> +unsigned long hugetlb_mask_last_hp(struct hstate *h)
> +{
> +       unsigned long hp_size = huge_page_size(h);
> +
> +       switch (hp_size) {
> +       case P4D_SIZE:
> +               return PGDIR_SIZE - P4D_SIZE;
> +       case PUD_SIZE:
> +               return P4D_SIZE - PUD_SIZE;
> +       case CONT_PMD_SIZE:
> +               return PUD_SIZE - CONT_PMD_SIZE;
> +       case PMD_SIZE:
> +               return PUD_SIZE - PMD_SIZE;
> +       case CONT_PTE_SIZE:
> +               return PMD_SIZE - CONT_PTE_SIZE;
> +       default:
> +               break;
> +       }
> +
> +       return ~(0UL);
> +}

Hello Baolin,

Would you mind sending this as a proper patch with commit message and
'Signed-off-by:'?  I would like to include it in the upcoming patch series.

-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 1/3] hugetlb: skip to end of PT page mapping when pte not present
  2022-06-15 21:22     ` Mike Kravetz
@ 2022-06-16  3:48       ` Baolin Wang
  0 siblings, 0 replies; 20+ messages in thread
From: Baolin Wang @ 2022-06-16  3:48 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-kernel, linux-mm, Muchun Song, Michal Hocko, Peter Xu,
	Naoya Horiguchi, James Houghton, Mina Almasry,
	Aneesh Kumar K . V, Anshuman Khandual, Paul Walmsley,
	Christian Borntraeger, Andrew Morton



On 6/16/2022 5:22 AM, Mike Kravetz wrote:
> On 05/30/22 18:10, Baolin Wang wrote:
>>
>>
>> On 5/28/2022 6:58 AM, Mike Kravetz wrote:
>>> HugeTLB address ranges are linearly scanned during fork, unmap and
>>> remap operations.  If a non-present entry is encountered, the code
>>> currently continues to the next huge page aligned address.  However,
>>> a non-present entry implies that the page table page for that entry
>>> is not present.  Therefore, the linear scan can skip to the end of
>>> range mapped by the page table page.  This can speed operations on
>>> large sparsely populated hugetlb mappings.
>>>
>>> Create a new routine hugetlb_mask_last_hp() that will return an
>>> address mask.  When the mask is ORed with an address, the result
>>> will be the address of the last huge page mapped by the associated
>>> page table page.  Use this mask to update addresses in routines which
>>> linearly scan hugetlb address ranges when a non-present pte is
>>> encountered.
>>>
>>> hugetlb_mask_last_hp is related to the implementation of huge_pte_offset
>>> as hugetlb_mask_last_hp is called when huge_pte_offset returns NULL.
>>> This patch only provides a complete hugetlb_mask_last_hp implementation
>>> when CONFIG_ARCH_WANT_GENERAL_HUGETLB is defined.  Architectures which
>>> provide their own versions of huge_pte_offset can also provide their own
>>> version of hugetlb_mask_last_hp.
>>
>> I tested on my ARM64 machine with implementing arm64 specific
>> hugetlb_mask_last_hp() as below, and it works well.
>>
>> Just a few nits inline, otherwise looks good to me.
>> Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>>
>> diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
>> index d93ba128a2b0..e04a097ffcc4 100644
>> --- a/arch/arm64/mm/hugetlbpage.c
>> +++ b/arch/arm64/mm/hugetlbpage.c
>> @@ -376,6 +376,28 @@ pte_t *huge_pte_offset(struct mm_struct *mm,
>>          return NULL;
>>   }
>>
>> +unsigned long hugetlb_mask_last_hp(struct hstate *h)
>> +{
>> +       unsigned long hp_size = huge_page_size(h);
>> +
>> +       switch (hp_size) {
>> +       case P4D_SIZE:
>> +               return PGDIR_SIZE - P4D_SIZE;
>> +       case PUD_SIZE:
>> +               return P4D_SIZE - PUD_SIZE;
>> +       case CONT_PMD_SIZE:
>> +               return PUD_SIZE - CONT_PMD_SIZE;
>> +       case PMD_SIZE:
>> +               return PUD_SIZE - PMD_SIZE;
>> +       case CONT_PTE_SIZE:
>> +               return PMD_SIZE - CONT_PTE_SIZE;
>> +       default:
>> +               break;
>> +       }
>> +
>> +       return ~(0UL);
>> +}
> 
> Hello Baolin,
> 
> Would you mind sending this as a proper patch with commit message and
> 'Signed-off-by:'?  I would like to include it in the upcoming patch series.

Sure. I've sent it out [1], and please fold it into your series. Thanks.

[1] 
https://lore.kernel.org/all/7256dbe078d7231f45b0f47c2c52a3bd3aa10da7.1655350193.git.baolin.wang@linux.alibaba.com/

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2022-06-16  3:48 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-05-27 22:58 [RFC PATCH 0/3] hugetlb: speed up linear address scanning Mike Kravetz
2022-05-27 22:58 ` [RFC PATCH 1/3] hugetlb: skip to end of PT page mapping when pte not present Mike Kravetz
2022-05-30 10:10   ` Baolin Wang
2022-05-31 16:56     ` Mike Kravetz
2022-06-15 21:22     ` Mike Kravetz
2022-06-16  3:48       ` Baolin Wang
2022-05-30 19:56   ` Peter Xu
2022-05-31  2:04     ` Muchun Song
2022-05-31 17:05       ` Mike Kravetz
2022-06-01  6:58         ` Anshuman Khandual
2022-05-31 17:00     ` Mike Kravetz
2022-06-15 17:27       ` Mike Kravetz
2022-06-15 17:51         ` Peter Xu
2022-05-27 22:58 ` [RFC PATCH 2/3] hugetlb: do not update address in huge_pmd_unshare Mike Kravetz
2022-05-30 10:14   ` Baolin Wang
2022-05-30 15:36   ` Muchun Song
2022-05-31 17:06     ` Mike Kravetz
2022-05-27 22:58 ` [RFC PATCH 3/3] hugetlb: Lazy page table copies in fork() Mike Kravetz
2022-05-31 17:25   ` David Hildenbrand
2022-06-01  5:20   ` Muchun Song

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).