linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/5] hugetlb: Change huge pmd sharing
@ 2022-04-06 20:48 Mike Kravetz
  2022-04-06 20:48 ` [RFC PATCH 1/5] hugetlbfs: revert use i_mmap_rwsem to address page fault/truncate race Mike Kravetz
                   ` (5 more replies)
  0 siblings, 6 replies; 11+ messages in thread
From: Mike Kravetz @ 2022-04-06 20:48 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Michal Hocko, Peter Xu, Naoya Horiguchi, Aneesh Kumar K . V,
	Andrea Arcangeli, Kirill A . Shutemov, Davidlohr Bueso,
	Prakash Sangappa, James Houghton, Mina Almasry, Ray Fucillo,
	Andrew Morton, Mike Kravetz

hugetlb fault scalability regressions have recently been reported [1].
This is not the first such report, as regressions were also noted when
commit c0d0381ade79 ("hugetlbfs: use i_mmap_rwsem for more pmd sharing
synchronization") was added [2] in v5.7.  At that time, a proposal to
address the regression was suggested [3] but went nowhere.

To illustrate the regression, I created a simple program that does the
following in an infinite loop:
- mmap a 4GB hugetlb file (size insures pmd sharing)
- fault in all pages
- unmap the hugetlb file

The hugetlb fault code was then instrumented to collect number of times
the mutex was locked and wait time.  Samples are from 10 second
intervals on a 4 CPU VM with 8GB memory.  Eight instances of the
map/fault/unmap program are running.

v5.17
-----
[  708.763114] Wait_debug: faults sec  3622
[  708.764010]             num faults  36220
[  708.765016]             num waits   36220
[  708.766054]             intvl wait time 54074 msecs
[  708.767287]             max_wait_time   31000 usecs


v5.17 + this series (similar to v5.6)
-------------------------------------
[  282.191391] Wait_debug: faults sec  1777939
[  282.192571]             num faults  17779393
[  282.193746]             num locks   5517
[  282.194858]             intvl wait time 19907 msecs
[  282.196226]             max_wait_time   43000 usecs

As can be seen, fault time suffers when there are other operations
taking i_mmap_rwsem in write mode such as unmap.

This series proposes reverting c0d0381ade79 and 87bf91d39bb5 which
depends on c0d0381ade79.  This moves acquisition of i_mmap_rwsem in the
fault path back to huge_pmd_share where it is only taken when necessary.
After, reverting these patches we still need to handle:
fault and truncate races
- Catch and properly backout faults beyond i_size
  Backing out reservations is much easier after 846be08578ed to expand
  restore_reserve_on_error functionality.
unshare and fault/lookup races
- Since the pointer returned from huge_pte_offset or huge_pte_alloc may
  become invalid until we lock the page table, we must revalidate after
  taking the lock.  Code paths must backout and possibly retry if
  page table pointer changes.

The commit message in patch 5 suggests that it is not safe to use
SPLIT_PMD_PTLOCKS for hugetlb mappings if sharing is possible.  If
others confirm/agree then there will need to be additional work.

Please help with comments or suggestions.  I would like to come up with
something that is performant and safe.

[1] https://lore.kernel.org/linux-mm/43faf292-245b-5db5-cce9-369d8fb6bd21@infradead.org/
[2] https://lore.kernel.org/lkml/20200622005551.GK5535@shao2-debian/
[3] https://lore.kernel.org/linux-mm/20200706202615.32111-1-mike.kravetz@oracle.com/

Mike Kravetz (5):
  hugetlbfs: revert use i_mmap_rwsem to address page fault/truncate race
  hugetlbfs: revert use i_mmap_rwsem for more pmd sharing
    synchronization
  hugetlbfs: move routine remove_huge_page to hugetlb.c
  hugetlbfs: catch and handle truncate racing with page faults
  hugetlb: Check for pmd unshare and fault/lookup races

 fs/hugetlbfs/inode.c    |  84 ++++++++------------
 include/linux/hugetlb.h |   3 +-
 mm/hugetlb.c            | 169 +++++++++++++++++++---------------------
 mm/rmap.c               |  14 +---
 mm/userfaultfd.c        |  11 +--
 5 files changed, 118 insertions(+), 163 deletions(-)

-- 
2.35.1



^ permalink raw reply	[flat|nested] 11+ messages in thread

* [RFC PATCH 1/5] hugetlbfs: revert use i_mmap_rwsem to address page fault/truncate race
  2022-04-06 20:48 [RFC PATCH 0/5] hugetlb: Change huge pmd sharing Mike Kravetz
@ 2022-04-06 20:48 ` Mike Kravetz
  2022-04-06 20:48 ` [RFC PATCH 2/5] hugetlbfs: revert use i_mmap_rwsem for more pmd sharing synchronization Mike Kravetz
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 11+ messages in thread
From: Mike Kravetz @ 2022-04-06 20:48 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Michal Hocko, Peter Xu, Naoya Horiguchi, Aneesh Kumar K . V,
	Andrea Arcangeli, Kirill A . Shutemov, Davidlohr Bueso,
	Prakash Sangappa, James Houghton, Mina Almasry, Ray Fucillo,
	Andrew Morton, Mike Kravetz

Commit c0d0381ade79 ("hugetlbfs: use i_mmap_rwsem for more pmd sharing
synchronization") added code to take i_mmap_rwsem in read mode for the
duration of fault processing.  The use of i_mmap_rwsem to prevent
fault/truncate races depends on this.  However, this has been shown to
cause performance/scaling issues.  As a result, that code will be
reverted.  Since the use i_mmap_rwsem to address page fault/truncate races
depends on this, it must also be reverted.

In a subsequent patch, code will be added to detect the fault/truncate
race and back out operations as required.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 fs/hugetlbfs/inode.c | 30 +++++++++---------------------
 mm/hugetlb.c         | 23 ++++++++++++-----------
 2 files changed, 21 insertions(+), 32 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index a7c6c7498be0..e50de48c7707 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -450,9 +450,10 @@ hugetlb_vmdelete_list(struct rb_root_cached *root, pgoff_t start, pgoff_t end)
  *	In this case, we first scan the range and release found pages.
  *	After releasing pages, hugetlb_unreserve_pages cleans up region/reserve
  *	maps and global counts.  Page faults can not race with truncation
- *	in this routine.  hugetlb_no_page() holds i_mmap_rwsem and prevents
- *	page faults in the truncated range by checking i_size.  i_size is
- *	modified while holding i_mmap_rwsem.
+ *	in this routine.  hugetlb_no_page() prevents page faults in the
+ *	truncated range.  It checks i_size before allocation, and again after
+ *	with the page table lock for the page held.  The same lock must be
+ *	acquired to unmap a page.
  * hole punch is indicated if end is not LLONG_MAX
  *	In the hole punch case we scan the range and release found pages.
  *	Only when releasing a page is the associated region/reserve map
@@ -488,16 +489,8 @@ static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
 			u32 hash = 0;
 
 			index = page->index;
-			if (!truncate_op) {
-				/*
-				 * Only need to hold the fault mutex in the
-				 * hole punch case.  This prevents races with
-				 * page faults.  Races are not possible in the
-				 * case of truncation.
-				 */
-				hash = hugetlb_fault_mutex_hash(mapping, index);
-				mutex_lock(&hugetlb_fault_mutex_table[hash]);
-			}
+			hash = hugetlb_fault_mutex_hash(mapping, index);
+			mutex_lock(&hugetlb_fault_mutex_table[hash]);
 
 			/*
 			 * If page is mapped, it was faulted in after being
@@ -540,8 +533,7 @@ static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
 			}
 
 			unlock_page(page);
-			if (!truncate_op)
-				mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+			mutex_unlock(&hugetlb_fault_mutex_table[hash]);
 		}
 		huge_pagevec_release(&pvec);
 		cond_resched();
@@ -579,8 +571,8 @@ static void hugetlb_vmtruncate(struct inode *inode, loff_t offset)
 	BUG_ON(offset & ~huge_page_mask(h));
 	pgoff = offset >> PAGE_SHIFT;
 
-	i_mmap_lock_write(mapping);
 	i_size_write(inode, offset);
+	i_mmap_lock_write(mapping);
 	if (!RB_EMPTY_ROOT(&mapping->i_mmap.rb_root))
 		hugetlb_vmdelete_list(&mapping->i_mmap, pgoff, 0);
 	i_mmap_unlock_write(mapping);
@@ -700,11 +692,7 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
 		/* addr is the offset within the file (zero based) */
 		addr = index * hpage_size;
 
-		/*
-		 * fault mutex taken here, protects against fault path
-		 * and hole punch.  inode_lock previously taken protects
-		 * against truncation.
-		 */
+		/* mutex taken here, fault path and hole punch */
 		hash = hugetlb_fault_mutex_hash(mapping, index);
 		mutex_lock(&hugetlb_fault_mutex_table[hash]);
 
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index f294db835f4b..398b7742cc63 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5401,18 +5401,17 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 	}
 
 	/*
-	 * We can not race with truncation due to holding i_mmap_rwsem.
-	 * i_size is modified when holding i_mmap_rwsem, so check here
-	 * once for faults beyond end of file.
+	 * Use page lock to guard against racing truncation
+	 * before we get page_table_lock.
 	 */
-	size = i_size_read(mapping->host) >> huge_page_shift(h);
-	if (idx >= size)
-		goto out;
-
 retry:
 	new_page = false;
 	page = find_lock_page(mapping, idx);
 	if (!page) {
+		size = i_size_read(mapping->host) >> huge_page_shift(h);
+		if (idx >= size)
+			goto out;
+
 		/* Check for page in userfault range */
 		if (userfaultfd_missing(vma)) {
 			ret = hugetlb_handle_userfault(vma, mapping, idx,
@@ -5502,6 +5501,10 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 	}
 
 	ptl = huge_pte_lock(h, mm, ptep);
+	size = i_size_read(mapping->host) >> huge_page_shift(h);
+	if (idx >= size)
+		goto backout;
+
 	ret = 0;
 	if (!huge_pte_none(huge_ptep_get(ptep)))
 		goto backout;
@@ -5603,10 +5606,8 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 
 	/*
 	 * Acquire i_mmap_rwsem before calling huge_pte_alloc and hold
-	 * until finished with ptep.  This serves two purposes:
-	 * 1) It prevents huge_pmd_unshare from being called elsewhere
-	 *    and making the ptep no longer valid.
-	 * 2) It synchronizes us with i_size modifications during truncation.
+	 * until finished with ptep.  This prevents huge_pmd_unshare from
+	 * being called elsewhere and making the ptep no longer valid.
 	 *
 	 * ptep could have already be assigned via huge_pte_offset.  That
 	 * is OK, as huge_pte_alloc will return the same value unless
-- 
2.35.1



^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [RFC PATCH 2/5] hugetlbfs: revert use i_mmap_rwsem for more pmd sharing synchronization
  2022-04-06 20:48 [RFC PATCH 0/5] hugetlb: Change huge pmd sharing Mike Kravetz
  2022-04-06 20:48 ` [RFC PATCH 1/5] hugetlbfs: revert use i_mmap_rwsem to address page fault/truncate race Mike Kravetz
@ 2022-04-06 20:48 ` Mike Kravetz
  2022-04-06 20:48 ` [RFC PATCH 3/5] hugetlbfs: move routine remove_huge_page to hugetlb.c Mike Kravetz
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 11+ messages in thread
From: Mike Kravetz @ 2022-04-06 20:48 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Michal Hocko, Peter Xu, Naoya Horiguchi, Aneesh Kumar K . V,
	Andrea Arcangeli, Kirill A . Shutemov, Davidlohr Bueso,
	Prakash Sangappa, James Houghton, Mina Almasry, Ray Fucillo,
	Andrew Morton, Mike Kravetz

Commit c0d0381ade79 added code to take i_mmap_rwsem in read mode for the
duration of fault processing.  However, this has been shown to cause
performance/scaling issues.  Revert the code and go back to the method
of only taking the semaphore in huge_pmd_share.

Keep the code that takes i_mmap_rwsem in write mode before calling
try_to_unmap as this is required if huge_pmd_unshare is called.

In a subsequent patch, code will be added to detect when a pmd was
'unshared' during fault processing and deal with that.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 fs/hugetlbfs/inode.c |  2 --
 mm/hugetlb.c         | 76 +++++++-------------------------------------
 mm/rmap.c            | 14 +-------
 mm/userfaultfd.c     | 11 ++-----
 4 files changed, 15 insertions(+), 88 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index e50de48c7707..56cd75b6cfc0 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -504,9 +504,7 @@ static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
 			if (unlikely(page_mapped(page))) {
 				BUG_ON(truncate_op);
 
-				mutex_unlock(&hugetlb_fault_mutex_table[hash]);
 				i_mmap_lock_write(mapping);
-				mutex_lock(&hugetlb_fault_mutex_table[hash]);
 				hugetlb_vmdelete_list(&mapping->i_mmap,
 					index * pages_per_huge_page(h),
 					(index + 1) * pages_per_huge_page(h));
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 398b7742cc63..8fa2386bf7c0 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4701,7 +4701,6 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 	struct hstate *h = hstate_vma(vma);
 	unsigned long sz = huge_page_size(h);
 	unsigned long npages = pages_per_huge_page(h);
-	struct address_space *mapping = vma->vm_file->f_mapping;
 	struct mmu_notifier_range range;
 	int ret = 0;
 
@@ -4710,14 +4709,6 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 					vma->vm_start,
 					vma->vm_end);
 		mmu_notifier_invalidate_range_start(&range);
-	} else {
-		/*
-		 * For shared mappings i_mmap_rwsem must be held to call
-		 * huge_pte_alloc, otherwise the returned ptep could go
-		 * away if part of a shared pmd and another thread calls
-		 * huge_pmd_unshare.
-		 */
-		i_mmap_lock_read(mapping);
 	}
 
 	for (addr = vma->vm_start; addr < vma->vm_end; addr += sz) {
@@ -4844,8 +4835,6 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 
 	if (cow)
 		mmu_notifier_invalidate_range_end(&range);
-	else
-		i_mmap_unlock_read(mapping);
 
 	return ret;
 }
@@ -5189,30 +5178,9 @@ static vm_fault_t hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
 		 * may get SIGKILLed if it later faults.
 		 */
 		if (outside_reserve) {
-			struct address_space *mapping = vma->vm_file->f_mapping;
-			pgoff_t idx;
-			u32 hash;
-
 			put_page(old_page);
 			BUG_ON(huge_pte_none(pte));
-			/*
-			 * Drop hugetlb_fault_mutex and i_mmap_rwsem before
-			 * unmapping.  unmapping needs to hold i_mmap_rwsem
-			 * in write mode.  Dropping i_mmap_rwsem in read mode
-			 * here is OK as COW mappings do not interact with
-			 * PMD sharing.
-			 *
-			 * Reacquire both after unmap operation.
-			 */
-			idx = vma_hugecache_offset(h, vma, haddr);
-			hash = hugetlb_fault_mutex_hash(mapping, idx);
-			mutex_unlock(&hugetlb_fault_mutex_table[hash]);
-			i_mmap_unlock_read(mapping);
-
 			unmap_ref_private(mm, vma, old_page, haddr);
-
-			i_mmap_lock_read(mapping);
-			mutex_lock(&hugetlb_fault_mutex_table[hash]);
 			spin_lock(ptl);
 			ptep = huge_pte_offset(mm, haddr, huge_page_size(h));
 			if (likely(ptep &&
@@ -5366,9 +5334,7 @@ static inline vm_fault_t hugetlb_handle_userfault(struct vm_area_struct *vma,
 	 */
 	hash = hugetlb_fault_mutex_hash(mapping, idx);
 	mutex_unlock(&hugetlb_fault_mutex_table[hash]);
-	i_mmap_unlock_read(mapping);
 	ret = handle_userfault(&vmf, reason);
-	i_mmap_lock_read(mapping);
 	mutex_lock(&hugetlb_fault_mutex_table[hash]);
 
 	return ret;
@@ -5590,11 +5556,6 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 
 	ptep = huge_pte_offset(mm, haddr, huge_page_size(h));
 	if (ptep) {
-		/*
-		 * Since we hold no locks, ptep could be stale.  That is
-		 * OK as we are only making decisions based on content and
-		 * not actually modifying content here.
-		 */
 		entry = huge_ptep_get(ptep);
 		if (unlikely(is_hugetlb_entry_migration(entry))) {
 			migration_entry_wait_huge(vma, mm, ptep);
@@ -5602,31 +5563,20 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		} else if (unlikely(is_hugetlb_entry_hwpoisoned(entry)))
 			return VM_FAULT_HWPOISON_LARGE |
 				VM_FAULT_SET_HINDEX(hstate_index(h));
+	} else {
+		ptep = huge_pte_alloc(mm, vma, haddr, huge_page_size(h));
+		if (!ptep)
+			return VM_FAULT_OOM;
 	}
 
-	/*
-	 * Acquire i_mmap_rwsem before calling huge_pte_alloc and hold
-	 * until finished with ptep.  This prevents huge_pmd_unshare from
-	 * being called elsewhere and making the ptep no longer valid.
-	 *
-	 * ptep could have already be assigned via huge_pte_offset.  That
-	 * is OK, as huge_pte_alloc will return the same value unless
-	 * something has changed.
-	 */
 	mapping = vma->vm_file->f_mapping;
-	i_mmap_lock_read(mapping);
-	ptep = huge_pte_alloc(mm, vma, haddr, huge_page_size(h));
-	if (!ptep) {
-		i_mmap_unlock_read(mapping);
-		return VM_FAULT_OOM;
-	}
+	idx = vma_hugecache_offset(h, vma, haddr);
 
 	/*
 	 * Serialize hugepage allocation and instantiation, so that we don't
 	 * get spurious allocation failures if two CPUs race to instantiate
 	 * the same page in the page cache.
 	 */
-	idx = vma_hugecache_offset(h, vma, haddr);
 	hash = hugetlb_fault_mutex_hash(mapping, idx);
 	mutex_lock(&hugetlb_fault_mutex_table[hash]);
 
@@ -5714,7 +5664,6 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	}
 out_mutex:
 	mutex_unlock(&hugetlb_fault_mutex_table[hash]);
-	i_mmap_unlock_read(mapping);
 	/*
 	 * Generally it's safe to hold refcount during waiting page lock. But
 	 * here we just wait to defer the next page fault to avoid busy loop and
@@ -6475,12 +6424,10 @@ void adjust_range_if_pmd_sharing_possible(struct vm_area_struct *vma,
  * Search for a shareable pmd page for hugetlb. In any case calls pmd_alloc()
  * and returns the corresponding pte. While this is not necessary for the
  * !shared pmd case because we can allocate the pmd later as well, it makes the
- * code much cleaner.
- *
- * This routine must be called with i_mmap_rwsem held in at least read mode if
- * sharing is possible.  For hugetlbfs, this prevents removal of any page
- * table entries associated with the address space.  This is important as we
- * are setting up sharing based on existing page table entries (mappings).
+ * code much cleaner. pmd allocation is essential for the shared case because
+ * pud has to be populated inside the same i_mmap_rwsem section - otherwise
+ * racing tasks could either miss the sharing (see huge_pte_offset) or select a
+ * bad pmd for sharing.
  */
 pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma,
 		      unsigned long addr, pud_t *pud)
@@ -6494,7 +6441,7 @@ pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma,
 	pte_t *pte;
 	spinlock_t *ptl;
 
-	i_mmap_assert_locked(mapping);
+	i_mmap_lock_read(mapping);
 	vma_interval_tree_foreach(svma, &mapping->i_mmap, idx, idx) {
 		if (svma == vma)
 			continue;
@@ -6524,6 +6471,7 @@ pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma,
 	spin_unlock(ptl);
 out:
 	pte = (pte_t *)pmd_alloc(mm, pud, addr);
+	i_mmap_unlock_read(mapping);
 	return pte;
 }
 
@@ -6534,7 +6482,7 @@ pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma,
  * indicated by page_count > 1, unmap is achieved by clearing pud and
  * decrementing the ref count. If count == 1, the pte page is not shared.
  *
- * Called with page table lock held and i_mmap_rwsem held in write mode.
+ * Called with page table lock held.
  *
  * returns: 1 successfully unmapped a shared pte page
  *	    0 the underlying pte page is not shared, or it is the last user
diff --git a/mm/rmap.c b/mm/rmap.c
index 6a1e8c7f6213..206e7d3efdb1 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -23,10 +23,9 @@
  * inode->i_rwsem	(while writing or truncating, not reading or faulting)
  *   mm->mmap_lock
  *     mapping->invalidate_lock (in filemap_fault)
- *       page->flags PG_locked (lock_page)   * (see hugetlbfs below)
+ *       page->flags PG_locked (lock_page)
  *         hugetlbfs_i_mmap_rwsem_key (in huge_pmd_share)
  *           mapping->i_mmap_rwsem
- *             hugetlb_fault_mutex (hugetlbfs specific page fault mutex)
  *             anon_vma->rwsem
  *               mm->page_table_lock or pte_lock
  *                 swap_lock (in swap_duplicate, swap_info_get)
@@ -45,11 +44,6 @@
  * anon_vma->rwsem,mapping->i_mmap_rwsem   (memory_failure, collect_procs_anon)
  *   ->tasklist_lock
  *     pte map lock
- *
- * * hugetlbfs PageHuge() pages take locks in this order:
- *         mapping->i_mmap_rwsem
- *           hugetlb_fault_mutex (hugetlbfs specific page fault mutex)
- *             page->flags PG_locked (lock_page)
  */
 
 #include <linux/mm.h>
@@ -1495,12 +1489,6 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 		address = pvmw.address;
 
 		if (PageHuge(page) && !PageAnon(page)) {
-			/*
-			 * To call huge_pmd_unshare, i_mmap_rwsem must be
-			 * held in write mode.  Caller needs to explicitly
-			 * do this outside rmap routines.
-			 */
-			VM_BUG_ON(!(flags & TTU_RMAP_LOCKED));
 			if (huge_pmd_unshare(mm, vma, &address, pvmw.pte)) {
 				/*
 				 * huge_pmd_unshare unmapped an entire PMD
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 0780c2a57ff1..81e299edbc1a 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -351,14 +351,10 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
 		BUG_ON(dst_addr >= dst_start + len);
 
 		/*
-		 * Serialize via i_mmap_rwsem and hugetlb_fault_mutex.
-		 * i_mmap_rwsem ensures the dst_pte remains valid even
-		 * in the case of shared pmds.  fault mutex prevents
-		 * races with other faulting threads.
+		 * Serialize via hugetlb_fault_mutex.
 		 */
-		mapping = dst_vma->vm_file->f_mapping;
-		i_mmap_lock_read(mapping);
 		idx = linear_page_index(dst_vma, dst_addr);
+		mapping = dst_vma->vm_file->f_mapping;
 		hash = hugetlb_fault_mutex_hash(mapping, idx);
 		mutex_lock(&hugetlb_fault_mutex_table[hash]);
 
@@ -366,7 +362,6 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
 		dst_pte = huge_pte_alloc(dst_mm, dst_vma, dst_addr, vma_hpagesize);
 		if (!dst_pte) {
 			mutex_unlock(&hugetlb_fault_mutex_table[hash]);
-			i_mmap_unlock_read(mapping);
 			goto out_unlock;
 		}
 
@@ -374,7 +369,6 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
 		    !huge_pte_none(huge_ptep_get(dst_pte))) {
 			err = -EEXIST;
 			mutex_unlock(&hugetlb_fault_mutex_table[hash]);
-			i_mmap_unlock_read(mapping);
 			goto out_unlock;
 		}
 
@@ -382,7 +376,6 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
 					       dst_addr, src_addr, mode, &page);
 
 		mutex_unlock(&hugetlb_fault_mutex_table[hash]);
-		i_mmap_unlock_read(mapping);
 
 		cond_resched();
 
-- 
2.35.1



^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [RFC PATCH 3/5] hugetlbfs: move routine remove_huge_page to hugetlb.c
  2022-04-06 20:48 [RFC PATCH 0/5] hugetlb: Change huge pmd sharing Mike Kravetz
  2022-04-06 20:48 ` [RFC PATCH 1/5] hugetlbfs: revert use i_mmap_rwsem to address page fault/truncate race Mike Kravetz
  2022-04-06 20:48 ` [RFC PATCH 2/5] hugetlbfs: revert use i_mmap_rwsem for more pmd sharing synchronization Mike Kravetz
@ 2022-04-06 20:48 ` Mike Kravetz
  2022-04-06 20:48 ` [RFC PATCH 4/5] hugetlbfs: catch and handle truncate racing with page faults Mike Kravetz
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 11+ messages in thread
From: Mike Kravetz @ 2022-04-06 20:48 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Michal Hocko, Peter Xu, Naoya Horiguchi, Aneesh Kumar K . V,
	Andrea Arcangeli, Kirill A . Shutemov, Davidlohr Bueso,
	Prakash Sangappa, James Houghton, Mina Almasry, Ray Fucillo,
	Andrew Morton, Mike Kravetz

In preparation for code in hugetlb.c removing pages from the page
cache, move remove_huge_page to hugetlb.c.  For a more descriptive
global name, rename to hugetlb_delete_from_page_cache.  Also,
rename huge_add_to_page_cache to be consistent.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 fs/hugetlbfs/inode.c    | 24 ++++++++----------------
 include/linux/hugetlb.h |  3 ++-
 mm/hugetlb.c            | 15 +++++++++++----
 3 files changed, 21 insertions(+), 21 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 56cd75b6cfc0..0cf352555354 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -396,13 +396,6 @@ static int hugetlbfs_write_end(struct file *file, struct address_space *mapping,
 	return -EINVAL;
 }
 
-static void remove_huge_page(struct page *page)
-{
-	ClearPageDirty(page);
-	ClearPageUptodate(page);
-	delete_from_page_cache(page);
-}
-
 static void
 hugetlb_vmdelete_list(struct rb_root_cached *root, pgoff_t start, pgoff_t end)
 {
@@ -514,15 +507,14 @@ static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
 			lock_page(page);
 			/*
 			 * We must free the huge page and remove from page
-			 * cache (remove_huge_page) BEFORE removing the
-			 * region/reserve map (hugetlb_unreserve_pages).  In
-			 * rare out of memory conditions, removal of the
-			 * region/reserve map could fail. Correspondingly,
-			 * the subpool and global reserve usage count can need
-			 * to be adjusted.
+			 * cache BEFORE removing the region/reserve map
+			 * (hugetlb_unreserve_pages).  In rare out of memory
+			 * conditions, removal of the region/reserve map could
+			 * fail. Correspondingly, the subpool and global
+			 * reserve usage count can need to be adjusted.
 			 */
 			VM_BUG_ON(HPageRestoreReserve(page));
-			remove_huge_page(page);
+			hugetlb_delete_from_page_cache(page);
 			freed++;
 			if (!truncate_op) {
 				if (unlikely(hugetlb_unreserve_pages(inode,
@@ -720,7 +712,7 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
 		}
 		clear_huge_page(page, addr, pages_per_huge_page(h));
 		__SetPageUptodate(page);
-		error = huge_add_to_page_cache(page, mapping, index);
+		error = hugetlb_add_to_page_cache(page, mapping, index);
 		if (unlikely(error)) {
 			restore_reserve_on_error(h, &pseudo_vma, addr, page);
 			put_page(page);
@@ -972,7 +964,7 @@ static int hugetlbfs_error_remove_page(struct address_space *mapping,
 	struct inode *inode = mapping->host;
 	pgoff_t index = page->index;
 
-	remove_huge_page(page);
+	hugetlb_delete_from_page_cache(page);
 	if (unlikely(hugetlb_unreserve_pages(inode, index, index + 1, 1)))
 		hugetlb_fix_reserve_counts(inode);
 
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index d1897a69c540..2cf99d769f61 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -640,8 +640,9 @@ struct page *alloc_huge_page_nodemask(struct hstate *h, int preferred_nid,
 				nodemask_t *nmask, gfp_t gfp_mask);
 struct page *alloc_huge_page_vma(struct hstate *h, struct vm_area_struct *vma,
 				unsigned long address);
-int huge_add_to_page_cache(struct page *page, struct address_space *mapping,
+int hugetlb_add_to_page_cache(struct page *page, struct address_space *mapping,
 			pgoff_t idx);
+void hugetlb_delete_from_page_cache(struct page *page);
 void restore_reserve_on_error(struct hstate *h, struct vm_area_struct *vma,
 				unsigned long address, struct page *page);
 
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 8fa2386bf7c0..c6d76f61de98 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5281,7 +5281,7 @@ static bool hugetlbfs_pagecache_present(struct hstate *h,
 	return page != NULL;
 }
 
-int huge_add_to_page_cache(struct page *page, struct address_space *mapping,
+int hugetlb_add_to_page_cache(struct page *page, struct address_space *mapping,
 			   pgoff_t idx)
 {
 	struct inode *inode = mapping->host;
@@ -5304,6 +5304,13 @@ int huge_add_to_page_cache(struct page *page, struct address_space *mapping,
 	return 0;
 }
 
+void hugetlb_delete_from_page_cache(struct page *page)
+{
+	ClearPageDirty(page);
+	ClearPageUptodate(page);
+	delete_from_page_cache(page);
+}
+
 static inline vm_fault_t hugetlb_handle_userfault(struct vm_area_struct *vma,
 						  struct address_space *mapping,
 						  pgoff_t idx,
@@ -5412,7 +5419,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 		new_page = true;
 
 		if (vma->vm_flags & VM_MAYSHARE) {
-			int err = huge_add_to_page_cache(page, mapping, idx);
+			int err = hugetlb_add_to_page_cache(page, mapping, idx);
 			if (err) {
 				put_page(page);
 				if (err == -EEXIST)
@@ -5788,11 +5795,11 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
 
 		/*
 		 * Serialization between remove_inode_hugepages() and
-		 * huge_add_to_page_cache() below happens through the
+		 * hugetlb_add_to_page_cache() below happens through the
 		 * hugetlb_fault_mutex_table that here must be hold by
 		 * the caller.
 		 */
-		ret = huge_add_to_page_cache(page, mapping, idx);
+		ret = hugetlb_add_to_page_cache(page, mapping, idx);
 		if (ret)
 			goto out_release_nounlock;
 		page_in_pagecache = true;
-- 
2.35.1



^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [RFC PATCH 4/5] hugetlbfs: catch and handle truncate racing with page faults
  2022-04-06 20:48 [RFC PATCH 0/5] hugetlb: Change huge pmd sharing Mike Kravetz
                   ` (2 preceding siblings ...)
  2022-04-06 20:48 ` [RFC PATCH 3/5] hugetlbfs: move routine remove_huge_page to hugetlb.c Mike Kravetz
@ 2022-04-06 20:48 ` Mike Kravetz
  2022-04-06 20:48 ` [RFC PATCH 5/5] hugetlb: Check for pmd unshare and fault/lookup races Mike Kravetz
  2022-04-07 10:08 ` [RFC PATCH 0/5] hugetlb: Change huge pmd sharing David Hildenbrand
  5 siblings, 0 replies; 11+ messages in thread
From: Mike Kravetz @ 2022-04-06 20:48 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Michal Hocko, Peter Xu, Naoya Horiguchi, Aneesh Kumar K . V,
	Andrea Arcangeli, Kirill A . Shutemov, Davidlohr Bueso,
	Prakash Sangappa, James Houghton, Mina Almasry, Ray Fucillo,
	Andrew Morton, Mike Kravetz

Most hugetlb fault handling code checks for faults beyond i_size.
While there are early checks in the code paths, the most difficult
to handle are those discovered after taking the page table lock.
At this point, we have possibly allocated a page and consumed
associated reservations and possibly added the page to the page cache.

When discovering a fault beyond i_size, be sure to:
- Remove the page from page cache, else it will sit there until the
  file is removed.
- Do not restore any reservation for the page consumed.  Otherwise
  there will be an outstanding reservation for an offset beyond the
  end of file.

The 'truncation' code in remove_inode_hugepages must deal with fault
code potentially removing a page from the cache after the page was
returned by pagevec_lookup and before locking the page.  This can be
discovered by a change in page_mapping().

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 fs/hugetlbfs/inode.c | 40 ++++++++++++++++++++++------------------
 mm/hugetlb.c         | 28 ++++++++++++++++++++--------
 2 files changed, 42 insertions(+), 26 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 0cf352555354..341156c2a7d0 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -490,13 +490,8 @@ static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
 			 * unmapped in caller.  Unmap (again) now after taking
 			 * the fault mutex.  The mutex will prevent faults
 			 * until we finish removing the page.
-			 *
-			 * This race can only happen in the hole punch case.
-			 * Getting here in a truncate operation is a bug.
 			 */
 			if (unlikely(page_mapped(page))) {
-				BUG_ON(truncate_op);
-
 				i_mmap_lock_write(mapping);
 				hugetlb_vmdelete_list(&mapping->i_mmap,
 					index * pages_per_huge_page(h),
@@ -506,22 +501,31 @@ static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
 
 			lock_page(page);
 			/*
-			 * We must free the huge page and remove from page
-			 * cache BEFORE removing the region/reserve map
-			 * (hugetlb_unreserve_pages).  In rare out of memory
-			 * conditions, removal of the region/reserve map could
-			 * fail. Correspondingly, the subpool and global
-			 * reserve usage count can need to be adjusted.
+			 * After locking page, make sure mapping is the same.
+			 * We could have raced with page fault populate and
+			 * backout code.
 			 */
-			VM_BUG_ON(HPageRestoreReserve(page));
-			hugetlb_delete_from_page_cache(page);
-			freed++;
-			if (!truncate_op) {
-				if (unlikely(hugetlb_unreserve_pages(inode,
+			if (page_mapping(page) == mapping) {
+				/*
+				 * We must free the huge page and remove from
+				 * page cache BEFORE removing the region/
+				 * reserve map (hugetlb_unreserve_pages).  In
+				 * rare out of memory conditions, removal of
+				 * the region/reserve map could fail.
+				 * Correspondingly, the subpool and global
+				 * reserve usage count can need to be adjusted.
+				 */
+				VM_BUG_ON(HPageRestoreReserve(page));
+				hugetlb_delete_from_page_cache(page);
+				freed++;
+				if (!truncate_op) {
+					if (unlikely(
+					    hugetlb_unreserve_pages(inode,
 							index, index + 1, 1)))
-					hugetlb_fix_reserve_counts(inode);
+						hugetlb_fix_reserve_counts(
+								inode);
+				}
 			}
-
 			unlock_page(page);
 			mutex_unlock(&hugetlb_fault_mutex_table[hash]);
 		}
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index c6d76f61de98..b8f994961a68 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5361,6 +5361,8 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 	spinlock_t *ptl;
 	unsigned long haddr = address & huge_page_mask(h);
 	bool new_page, new_pagecache_page = false;
+	bool beyond_i_size = false;
+	bool reserve_alloc = false;
 
 	/*
 	 * Currently, we are forced to kill the process in the event the
@@ -5417,6 +5419,8 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 		clear_huge_page(page, address, pages_per_huge_page(h));
 		__SetPageUptodate(page);
 		new_page = true;
+		if (HPageRestoreReserve(page))
+			reserve_alloc = true;
 
 		if (vma->vm_flags & VM_MAYSHARE) {
 			int err = hugetlb_add_to_page_cache(page, mapping, idx);
@@ -5475,8 +5479,10 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 
 	ptl = huge_pte_lock(h, mm, ptep);
 	size = i_size_read(mapping->host) >> huge_page_shift(h);
-	if (idx >= size)
+	if (idx >= size) {
+		beyond_i_size = true;
 		goto backout;
+	}
 
 	ret = 0;
 	if (!huge_pte_none(huge_ptep_get(ptep)))
@@ -5514,10 +5520,16 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 backout:
 	spin_unlock(ptl);
 backout_unlocked:
+	if (new_pagecache_page && beyond_i_size)
+		hugetlb_delete_from_page_cache(page);
 	unlock_page(page);
 	/* restore reserve for newly allocated pages not in page cache */
-	if (new_page && !new_pagecache_page)
-		restore_reserve_on_error(h, vma, haddr, page);
+	if (!new_pagecache_page) {
+		if (reserve_alloc)
+			SetHPageRestoreReserve(page);
+		if (new_page)
+			restore_reserve_on_error(h, vma, haddr, page);
+	}
 	put_page(page);
 	goto out;
 }
@@ -5812,15 +5824,15 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
 	 * Recheck the i_size after holding PT lock to make sure not
 	 * to leave any page mapped (as page_mapped()) beyond the end
 	 * of the i_size (remove_inode_hugepages() is strict about
-	 * enforcing that). If we bail out here, we'll also leave a
-	 * page in the radix tree in the vm_shared case beyond the end
-	 * of the i_size, but remove_inode_hugepages() will take care
-	 * of it as soon as we drop the hugetlb_fault_mutex_table.
+	 * enforcing that). If we bail out here, remove the page
+	 * added to the radix tree.
 	 */
 	size = i_size_read(mapping->host) >> huge_page_shift(h);
 	ret = -EFAULT;
-	if (idx >= size)
+	if (idx >= size) {
+		hugetlb_delete_from_page_cache(page);
 		goto out_release_unlock;
+	}
 
 	ret = -EEXIST;
 	if (!huge_pte_none(huge_ptep_get(dst_pte)))
-- 
2.35.1



^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [RFC PATCH 5/5] hugetlb: Check for pmd unshare and fault/lookup races
  2022-04-06 20:48 [RFC PATCH 0/5] hugetlb: Change huge pmd sharing Mike Kravetz
                   ` (3 preceding siblings ...)
  2022-04-06 20:48 ` [RFC PATCH 4/5] hugetlbfs: catch and handle truncate racing with page faults Mike Kravetz
@ 2022-04-06 20:48 ` Mike Kravetz
  2022-04-07 10:08 ` [RFC PATCH 0/5] hugetlb: Change huge pmd sharing David Hildenbrand
  5 siblings, 0 replies; 11+ messages in thread
From: Mike Kravetz @ 2022-04-06 20:48 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Michal Hocko, Peter Xu, Naoya Horiguchi, Aneesh Kumar K . V,
	Andrea Arcangeli, Kirill A . Shutemov, Davidlohr Bueso,
	Prakash Sangappa, James Houghton, Mina Almasry, Ray Fucillo,
	Andrew Morton, Mike Kravetz

When a pmd is 'unshared' it effectivelly deletes part of a processes
page tables.  The routine huge_pmd_unshare must be called with
i_mmap_rwsem held in write mode and the page table locked.  However,
consider a page fault happening within that same process.  We could
have the following race:

Faulting thread					Unsharing thread
...						     ...
ptep = huge_pte_offset()
      or
ptep = huge_pte_alloc()
...
						i_mmap_unlock_write
						lock_page table
ptep invalid   <------------------------	huge_pmd_unshare
Could be in a previously			unlock_page_table
sharing process or worse
...
ptl = huge_pte_lock(ptep)
get/update pte
set_pte_at(pte, ptep)

If the above race happens, we can update the pte of another process.

Catch this situation by doing another huge_pte_offset/page table
walk after obtaining the page table lock and compare pointers.  If
the pointers are different, then we know a race happened and we can
bail and cleanup.

In fault code, make sure to check for this race AFTER checking for
faults beyond i_size so page cache can be cleaned up properly.

Do note that even this is not perfect.  The page table lock is in the
page struct of the pmd page.  We need the pmd pointer (ptep) to get the
page table lock.  As shown above, we can not even be certain ptep is
still valid when getting/locking the page table.  The other option is
to always use 'mm->page_table_lock' for hugetlb page table.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 mm/hugetlb.c | 33 ++++++++++++++++++++++++++++-----
 1 file changed, 28 insertions(+), 5 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index b8f994961a68..e5196f0fa09c 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4695,6 +4695,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 			    struct vm_area_struct *vma)
 {
 	pte_t *src_pte, *dst_pte, entry, dst_entry;
+	pte_t *src_pte2;
 	struct page *ptepage;
 	unsigned long addr;
 	bool cow = is_cow_mapping(vma->vm_flags);
@@ -4741,7 +4742,15 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 		entry = huge_ptep_get(src_pte);
 		dst_entry = huge_ptep_get(dst_pte);
 again:
-		if (huge_pte_none(entry) || !huge_pte_none(dst_entry)) {
+
+		src_pte2 = huge_pte_offset(src, addr, sz);
+		if (unlikely(src_pte2 != src_pte)) {
+			/*
+			 * Another thread could have unshared src_pte.
+			 * Just skip.
+			 */
+			;
+		} else if (huge_pte_none(entry) || !huge_pte_none(dst_entry)) {
 			/*
 			 * Skip if src entry none.  Also, skip in the
 			 * unlikely case dst entry !none as this implies
@@ -5363,6 +5372,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 	bool new_page, new_pagecache_page = false;
 	bool beyond_i_size = false;
 	bool reserve_alloc = false;
+	pte_t *ptep2;
 
 	/*
 	 * Currently, we are forced to kill the process in the event the
@@ -5410,8 +5420,10 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 			 * sure there really is no pte entry.
 			 */
 			ptl = huge_pte_lock(h, mm, ptep);
+			/* ptep2 checks for racing unshare page tables */
+			ptep2 = huge_pte_offset(mm, haddr, huge_page_size(h));
 			ret = 0;
-			if (huge_pte_none(huge_ptep_get(ptep)))
+			if (ptep2 == ptep && huge_pte_none(huge_ptep_get(ptep)))
 				ret = vmf_error(PTR_ERR(page));
 			spin_unlock(ptl);
 			goto out;
@@ -5484,6 +5496,11 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 		goto backout;
 	}
 
+	/* Check for racing unshare page tables */
+	ptep2 = huge_pte_offset(mm, haddr, huge_page_size(h));
+	if (ptep2 != ptep)
+		goto backout;
+
 	ret = 0;
 	if (!huge_pte_none(huge_ptep_get(ptep)))
 		goto backout;
@@ -5561,7 +5578,7 @@ u32 hugetlb_fault_mutex_hash(struct address_space *mapping, pgoff_t idx)
 vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 			unsigned long address, unsigned int flags)
 {
-	pte_t *ptep, entry;
+	pte_t *ptep, *ptep2, entry;
 	spinlock_t *ptl;
 	vm_fault_t ret;
 	u32 hash;
@@ -5640,8 +5657,8 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 
 	ptl = huge_pte_lock(h, mm, ptep);
 
-	/* Check for a racing update before calling hugetlb_cow */
-	if (unlikely(!pte_same(entry, huge_ptep_get(ptep))))
+	/* Check for a racing update or unshare before calling hugetlb_cow */
+	if (unlikely(ptep2 != ptep || !pte_same(entry, huge_ptep_get(ptep))))
 		goto out_ptl;
 
 	/*
@@ -5720,6 +5737,7 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
 	struct page *page;
 	int writable;
 	bool page_in_pagecache = false;
+	pte_t *ptep2;
 
 	if (is_continue) {
 		ret = -EFAULT;
@@ -5834,6 +5852,11 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
 		goto out_release_unlock;
 	}
 
+	/* Check for racing unshare page tables */
+	ptep2 = huge_pte_offset(dst_mm, dst_addr, huge_page_size(h));
+	if (unlikely(ptep2 != dst_pte))
+		goto out_release_unlock;
+
 	ret = -EEXIST;
 	if (!huge_pte_none(huge_ptep_get(dst_pte)))
 		goto out_release_unlock;
-- 
2.35.1



^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 0/5] hugetlb: Change huge pmd sharing
  2022-04-06 20:48 [RFC PATCH 0/5] hugetlb: Change huge pmd sharing Mike Kravetz
                   ` (4 preceding siblings ...)
  2022-04-06 20:48 ` [RFC PATCH 5/5] hugetlb: Check for pmd unshare and fault/lookup races Mike Kravetz
@ 2022-04-07 10:08 ` David Hildenbrand
  2022-04-07 16:17   ` Mike Kravetz
  5 siblings, 1 reply; 11+ messages in thread
From: David Hildenbrand @ 2022-04-07 10:08 UTC (permalink / raw)
  To: Mike Kravetz, linux-mm, linux-kernel
  Cc: Michal Hocko, Peter Xu, Naoya Horiguchi, Aneesh Kumar K . V,
	Andrea Arcangeli, Kirill A . Shutemov, Davidlohr Bueso,
	Prakash Sangappa, James Houghton, Mina Almasry, Ray Fucillo,
	Andrew Morton

On 06.04.22 22:48, Mike Kravetz wrote:
> hugetlb fault scalability regressions have recently been reported [1].
> This is not the first such report, as regressions were also noted when
> commit c0d0381ade79 ("hugetlbfs: use i_mmap_rwsem for more pmd sharing
> synchronization") was added [2] in v5.7.  At that time, a proposal to
> address the regression was suggested [3] but went nowhere.
> 
> To illustrate the regression, I created a simple program that does the
> following in an infinite loop:
> - mmap a 4GB hugetlb file (size insures pmd sharing)
> - fault in all pages
> - unmap the hugetlb file
> 
> The hugetlb fault code was then instrumented to collect number of times
> the mutex was locked and wait time.  Samples are from 10 second
> intervals on a 4 CPU VM with 8GB memory.  Eight instances of the
> map/fault/unmap program are running.
> 
> v5.17
> -----
> [  708.763114] Wait_debug: faults sec  3622
> [  708.764010]             num faults  36220
> [  708.765016]             num waits   36220
> [  708.766054]             intvl wait time 54074 msecs
> [  708.767287]             max_wait_time   31000 usecs
> 
> 
> v5.17 + this series (similar to v5.6)
> -------------------------------------
> [  282.191391] Wait_debug: faults sec  1777939
> [  282.192571]             num faults  17779393
> [  282.193746]             num locks   5517
> [  282.194858]             intvl wait time 19907 msecs
> [  282.196226]             max_wait_time   43000 usecs
> 
> As can be seen, fault time suffers when there are other operations
> taking i_mmap_rwsem in write mode such as unmap.
> 
> This series proposes reverting c0d0381ade79 and 87bf91d39bb5 which
> depends on c0d0381ade79.  This moves acquisition of i_mmap_rwsem in the
> fault path back to huge_pmd_share where it is only taken when necessary.
> After, reverting these patches we still need to handle:
> fault and truncate races
> - Catch and properly backout faults beyond i_size
>   Backing out reservations is much easier after 846be08578ed to expand
>   restore_reserve_on_error functionality.
> unshare and fault/lookup races
> - Since the pointer returned from huge_pte_offset or huge_pte_alloc may
>   become invalid until we lock the page table, we must revalidate after
>   taking the lock.  Code paths must backout and possibly retry if
>   page table pointer changes.
> 
> The commit message in patch 5 suggests that it is not safe to use
> SPLIT_PMD_PTLOCKS for hugetlb mappings if sharing is possible.  If
> others confirm/agree then there will need to be additional work.
> 
> Please help with comments or suggestions.  I would like to come up with
> something that is performant and safe.

May I challenge the existence of huge PMD sharing? TBH I am not
convinced that the code complexity is worth the benefit.


Let me know if I get something wrong:

Let's assume a 4 TiB device and 2 MiB hugepage size. That's 2097152 huge
pages. Each such PMD entry consumes 8 bytes. That's 16 MiB.

Sure, with thousands of processes sharing that memory, the size of page
tables required would increase with each and every process. But TBH,
that's in no way different to other file systems where we're even
dealing with PTE tables.


Which results in me wondering if

a) We should simply use gigantic pages for such extreme use case. Allows
   for freeing up more memory via vmemmap either way.
b) We should instead look into reclaiming reconstruct-able page table.
   It's hard to imagine that each and every process accesses each and
   every part of the gigantic file all of the time.
c) We should instead establish a more generic page table sharing
   mechanism.


Consequently, I'd be much more in favor of ripping it out :/ but that's
just my personal opinion.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 0/5] hugetlb: Change huge pmd sharing
  2022-04-07 10:08 ` [RFC PATCH 0/5] hugetlb: Change huge pmd sharing David Hildenbrand
@ 2022-04-07 16:17   ` Mike Kravetz
  2022-04-08  9:26     ` David Hildenbrand
  0 siblings, 1 reply; 11+ messages in thread
From: Mike Kravetz @ 2022-04-07 16:17 UTC (permalink / raw)
  To: David Hildenbrand, linux-mm, linux-kernel
  Cc: Michal Hocko, Peter Xu, Naoya Horiguchi, Aneesh Kumar K . V,
	Andrea Arcangeli, Kirill A . Shutemov, Davidlohr Bueso,
	Prakash Sangappa, James Houghton, Mina Almasry, Ray Fucillo,
	Andrew Morton

On 4/7/22 03:08, David Hildenbrand wrote:
> On 06.04.22 22:48, Mike Kravetz wrote:
>> hugetlb fault scalability regressions have recently been reported [1].
>> This is not the first such report, as regressions were also noted when
>> commit c0d0381ade79 ("hugetlbfs: use i_mmap_rwsem for more pmd sharing
>> synchronization") was added [2] in v5.7.  At that time, a proposal to
>> address the regression was suggested [3] but went nowhere.
<snip>
>> Please help with comments or suggestions.  I would like to come up with
>> something that is performant and safe.
> 
> May I challenge the existence of huge PMD sharing? TBH I am not
> convinced that the code complexity is worth the benefit.
> 

That is a fair question.
Huge PMD sharing is not a documented or well known feature.  Most people would
not notice it going away.  However, I suspect some people will notice.
> Let me know if I get something wrong:
> 
> Let's assume a 4 TiB device and 2 MiB hugepage size. That's 2097152 huge
> pages. Each such PMD entry consumes 8 bytes. That's 16 MiB.
> 
> Sure, with thousands of processes sharing that memory, the size of page
> tables required would increase with each and every process. But TBH,
> that's in no way different to other file systems where we're even
> dealing with PTE tables.

The numbers for a real use case I am frequently quoted are something like:
1TB shared mapping, 10,000 processes sharing the mapping
4K PMD Page per 1GB of shared mapping
4M saving for each shared process
9,999 * 4M ~= 39GB savings

However, if you look at commit 39dde65c9940c which introduced huge pmd sharing
it states that performance rather than memory savings was the primary
objective.

"For hugetlb, the saving on page table memory is not the primary
 objective (as hugetlb itself already cuts down page table overhead
 significantly), instead, the purpose of using shared page table on hugetlb is
 to allow faster TLB refill and smaller cache pollution upon TLB miss.
    
 With PT sharing, pte entries are shared among hundreds of processes, the
 cache consumption used by all the page table is smaller and in return,
 application gets much higher cache hit ratio.  One other effect is that
 cache hit ratio with hardware page walker hitting on pte in cache will be
 higher and this helps to reduce tlb miss latency.  These two effects
 contribute to higher application performance."

That 'makes sense', but I have never tried to measure any such performance
benefit.  It is easier to calculate the memory savings.

> 
> Which results in me wondering if
> 
> a) We should simply use gigantic pages for such extreme use case. Allows
>    for freeing up more memory via vmemmap either way.

The only problem with this is that many processors in use today have
limited TLB entries for gigantic pages.

> b) We should instead look into reclaiming reconstruct-able page table.
>    It's hard to imagine that each and every process accesses each and
>    every part of the gigantic file all of the time.
> c) We should instead establish a more generic page table sharing
>    mechanism.

Yes.  I think that is the direction taken by mshare() proposal.  If we have
a more generic approach we can certainly start deprecating hugetlb pmd
sharing.

> 
> 
> Consequently, I'd be much more in favor of ripping it out :/ but that's
> just my personal opinion.
> 

-- 
Mike Kravetz


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 0/5] hugetlb: Change huge pmd sharing
  2022-04-07 16:17   ` Mike Kravetz
@ 2022-04-08  9:26     ` David Hildenbrand
  2022-04-19 22:50       ` Mike Kravetz
  0 siblings, 1 reply; 11+ messages in thread
From: David Hildenbrand @ 2022-04-08  9:26 UTC (permalink / raw)
  To: Mike Kravetz, linux-mm, linux-kernel
  Cc: Michal Hocko, Peter Xu, Naoya Horiguchi, Aneesh Kumar K . V,
	Andrea Arcangeli, Kirill A . Shutemov, Davidlohr Bueso,
	Prakash Sangappa, James Houghton, Mina Almasry, Ray Fucillo,
	Andrew Morton

>>
>> Let's assume a 4 TiB device and 2 MiB hugepage size. That's 2097152 huge
>> pages. Each such PMD entry consumes 8 bytes. That's 16 MiB.
>>
>> Sure, with thousands of processes sharing that memory, the size of page
>> tables required would increase with each and every process. But TBH,
>> that's in no way different to other file systems where we're even
>> dealing with PTE tables.
> 
> The numbers for a real use case I am frequently quoted are something like:
> 1TB shared mapping, 10,000 processes sharing the mapping
> 4K PMD Page per 1GB of shared mapping
> 4M saving for each shared process
> 9,999 * 4M ~= 39GB savings

3.7 % of all memory. Noticeable if the feature is removed? yes. Do we
care about supporting such corner cases that result in a maintenance
burden? My take is a clear no.

> 
> However, if you look at commit 39dde65c9940c which introduced huge pmd sharing
> it states that performance rather than memory savings was the primary
> objective.
> 
> "For hugetlb, the saving on page table memory is not the primary
>  objective (as hugetlb itself already cuts down page table overhead
>  significantly), instead, the purpose of using shared page table on hugetlb is
>  to allow faster TLB refill and smaller cache pollution upon TLB miss.
>     
>  With PT sharing, pte entries are shared among hundreds of processes, the
>  cache consumption used by all the page table is smaller and in return,
>  application gets much higher cache hit ratio.  One other effect is that
>  cache hit ratio with hardware page walker hitting on pte in cache will be
>  higher and this helps to reduce tlb miss latency.  These two effects
>  contribute to higher application performance."
> 
> That 'makes sense', but I have never tried to measure any such performance
> benefit.  It is easier to calculate the memory savings.

It does makes sense; but then, again, what's specific here about hugetlb?

Most probably it was just easy to add to hugetlb in contrast to other
types of shared memory.

> 
>>
>> Which results in me wondering if
>>
>> a) We should simply use gigantic pages for such extreme use case. Allows
>>    for freeing up more memory via vmemmap either way.
> 
> The only problem with this is that many processors in use today have
> limited TLB entries for gigantic pages.
> 
>> b) We should instead look into reclaiming reconstruct-able page table.
>>    It's hard to imagine that each and every process accesses each and
>>    every part of the gigantic file all of the time.
>> c) We should instead establish a more generic page table sharing
>>    mechanism.
> 
> Yes.  I think that is the direction taken by mshare() proposal.  If we have
> a more generic approach we can certainly start deprecating hugetlb pmd
> sharing.

My strong opinion is to remove it ASAP and get something proper into place.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 0/5] hugetlb: Change huge pmd sharing
  2022-04-08  9:26     ` David Hildenbrand
@ 2022-04-19 22:50       ` Mike Kravetz
  2022-04-20  7:12         ` David Hildenbrand
  0 siblings, 1 reply; 11+ messages in thread
From: Mike Kravetz @ 2022-04-19 22:50 UTC (permalink / raw)
  To: David Hildenbrand, linux-mm, linux-kernel
  Cc: Michal Hocko, Peter Xu, Naoya Horiguchi, Aneesh Kumar K . V,
	Andrea Arcangeli, Kirill A . Shutemov, Davidlohr Bueso,
	Prakash Sangappa, James Houghton, Mina Almasry, Ray Fucillo,
	Andrew Morton

On 4/8/22 02:26, David Hildenbrand wrote:
>>>
>>> Let's assume a 4 TiB device and 2 MiB hugepage size. That's 2097152 huge
>>> pages. Each such PMD entry consumes 8 bytes. That's 16 MiB.
>>>
>>> Sure, with thousands of processes sharing that memory, the size of page
>>> tables required would increase with each and every process. But TBH,
>>> that's in no way different to other file systems where we're even
>>> dealing with PTE tables.
>>
>> The numbers for a real use case I am frequently quoted are something like:
>> 1TB shared mapping, 10,000 processes sharing the mapping
>> 4K PMD Page per 1GB of shared mapping
>> 4M saving for each shared process
>> 9,999 * 4M ~= 39GB savings
> 
> 3.7 % of all memory. Noticeable if the feature is removed? yes. Do we
> care about supporting such corner cases that result in a maintenance
> burden? My take is a clear no.
> 
>>
>> However, if you look at commit 39dde65c9940c which introduced huge pmd sharing
>> it states that performance rather than memory savings was the primary
>> objective.
>>
>> "For hugetlb, the saving on page table memory is not the primary
>>  objective (as hugetlb itself already cuts down page table overhead
>>  significantly), instead, the purpose of using shared page table on hugetlb is
>>  to allow faster TLB refill and smaller cache pollution upon TLB miss.
>>     
>>  With PT sharing, pte entries are shared among hundreds of processes, the
>>  cache consumption used by all the page table is smaller and in return,
>>  application gets much higher cache hit ratio.  One other effect is that
>>  cache hit ratio with hardware page walker hitting on pte in cache will be
>>  higher and this helps to reduce tlb miss latency.  These two effects
>>  contribute to higher application performance."
>>
>> That 'makes sense', but I have never tried to measure any such performance
>> benefit.  It is easier to calculate the memory savings.
> 
> It does makes sense; but then, again, what's specific here about hugetlb?
> 
> Most probably it was just easy to add to hugetlb in contrast to other
> types of shared memory.
> 
>>
>>>
>>> Which results in me wondering if
>>>
>>> a) We should simply use gigantic pages for such extreme use case. Allows
>>>    for freeing up more memory via vmemmap either way.
>>
>> The only problem with this is that many processors in use today have
>> limited TLB entries for gigantic pages.
>>
>>> b) We should instead look into reclaiming reconstruct-able page table.
>>>    It's hard to imagine that each and every process accesses each and
>>>    every part of the gigantic file all of the time.
>>> c) We should instead establish a more generic page table sharing
>>>    mechanism.
>>
>> Yes.  I think that is the direction taken by mshare() proposal.  If we have
>> a more generic approach we can certainly start deprecating hugetlb pmd
>> sharing.
> 
> My strong opinion is to remove it ASAP and get something proper into place.
> 

No arguments about the complexity of this code.  However, there will be some
people who will notice if it is removed.

Whether or not we remove huge pmd sharing support, I would still like to
address the scalability issue.  To do so, taking i_mmap_rwsem in read mode
for fault processing needs to go away.  With this gone, the issue of faults
racing with truncation needs to be addressed as it depended on fault code
taking the mutex.  At a high level, this is fairly simple but hugetlb
reservations add to the complexity.  This was not completely addressed in
this series.

I will be sending out another RFC that more correctly address all the issues
this series attempted to address.  I am not discounting your opinion that we
should get rid of huge pmd sharing.  Rather, I would at least like to get
some eyes on my approach to addressing the issue with reservations during
fault and truncate races.
-- 
Mike Kravetz


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 0/5] hugetlb: Change huge pmd sharing
  2022-04-19 22:50       ` Mike Kravetz
@ 2022-04-20  7:12         ` David Hildenbrand
  0 siblings, 0 replies; 11+ messages in thread
From: David Hildenbrand @ 2022-04-20  7:12 UTC (permalink / raw)
  To: Mike Kravetz, linux-mm, linux-kernel
  Cc: Michal Hocko, Peter Xu, Naoya Horiguchi, Aneesh Kumar K . V,
	Andrea Arcangeli, Kirill A . Shutemov, Davidlohr Bueso,
	Prakash Sangappa, James Houghton, Mina Almasry, Ray Fucillo,
	Andrew Morton

On 20.04.22 00:50, Mike Kravetz wrote:
> On 4/8/22 02:26, David Hildenbrand wrote:
>>>>
>>>> Let's assume a 4 TiB device and 2 MiB hugepage size. That's 2097152 huge
>>>> pages. Each such PMD entry consumes 8 bytes. That's 16 MiB.
>>>>
>>>> Sure, with thousands of processes sharing that memory, the size of page
>>>> tables required would increase with each and every process. But TBH,
>>>> that's in no way different to other file systems where we're even
>>>> dealing with PTE tables.
>>>
>>> The numbers for a real use case I am frequently quoted are something like:
>>> 1TB shared mapping, 10,000 processes sharing the mapping
>>> 4K PMD Page per 1GB of shared mapping
>>> 4M saving for each shared process
>>> 9,999 * 4M ~= 39GB savings
>>
>> 3.7 % of all memory. Noticeable if the feature is removed? yes. Do we
>> care about supporting such corner cases that result in a maintenance
>> burden? My take is a clear no.
>>
>>>
>>> However, if you look at commit 39dde65c9940c which introduced huge pmd sharing
>>> it states that performance rather than memory savings was the primary
>>> objective.
>>>
>>> "For hugetlb, the saving on page table memory is not the primary
>>>  objective (as hugetlb itself already cuts down page table overhead
>>>  significantly), instead, the purpose of using shared page table on hugetlb is
>>>  to allow faster TLB refill and smaller cache pollution upon TLB miss.
>>>     
>>>  With PT sharing, pte entries are shared among hundreds of processes, the
>>>  cache consumption used by all the page table is smaller and in return,
>>>  application gets much higher cache hit ratio.  One other effect is that
>>>  cache hit ratio with hardware page walker hitting on pte in cache will be
>>>  higher and this helps to reduce tlb miss latency.  These two effects
>>>  contribute to higher application performance."
>>>
>>> That 'makes sense', but I have never tried to measure any such performance
>>> benefit.  It is easier to calculate the memory savings.
>>
>> It does makes sense; but then, again, what's specific here about hugetlb?
>>
>> Most probably it was just easy to add to hugetlb in contrast to other
>> types of shared memory.
>>
>>>
>>>>
>>>> Which results in me wondering if
>>>>
>>>> a) We should simply use gigantic pages for such extreme use case. Allows
>>>>    for freeing up more memory via vmemmap either way.
>>>
>>> The only problem with this is that many processors in use today have
>>> limited TLB entries for gigantic pages.
>>>
>>>> b) We should instead look into reclaiming reconstruct-able page table.
>>>>    It's hard to imagine that each and every process accesses each and
>>>>    every part of the gigantic file all of the time.
>>>> c) We should instead establish a more generic page table sharing
>>>>    mechanism.
>>>
>>> Yes.  I think that is the direction taken by mshare() proposal.  If we have
>>> a more generic approach we can certainly start deprecating hugetlb pmd
>>> sharing.
>>
>> My strong opinion is to remove it ASAP and get something proper into place.
>>
> 
> No arguments about the complexity of this code.  However, there will be some
> people who will notice if it is removed.

Yes, it should never have been added that way -- unfortunately.

> 
> Whether or not we remove huge pmd sharing support, I would still like to
> address the scalability issue.  To do so, taking i_mmap_rwsem in read mode
> for fault processing needs to go away.  With this gone, the issue of faults
> racing with truncation needs to be addressed as it depended on fault code
> taking the mutex.  At a high level, this is fairly simple but hugetlb
> reservations add to the complexity.  This was not completely addressed in
> this series.

Okay.

> 
> I will be sending out another RFC that more correctly address all the issues
> this series attempted to address.  I am not discounting your opinion that we
> should get rid of huge pmd sharing.  Rather, I would at least like to get
> some eyes on my approach to addressing the issue with reservations during
> fault and truncate races.

Makes sense to me. I agree that we should fix all that. What I
experienced is that the pmd sharing over-complicates the situation quite
a lot and makes the code hard to follow

[huge page reservation is another thing I dislike, especially because
it's no good in NUMA setups and we still have to preallocate huge pages
to make it work reliably]

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2022-04-20  7:12 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-04-06 20:48 [RFC PATCH 0/5] hugetlb: Change huge pmd sharing Mike Kravetz
2022-04-06 20:48 ` [RFC PATCH 1/5] hugetlbfs: revert use i_mmap_rwsem to address page fault/truncate race Mike Kravetz
2022-04-06 20:48 ` [RFC PATCH 2/5] hugetlbfs: revert use i_mmap_rwsem for more pmd sharing synchronization Mike Kravetz
2022-04-06 20:48 ` [RFC PATCH 3/5] hugetlbfs: move routine remove_huge_page to hugetlb.c Mike Kravetz
2022-04-06 20:48 ` [RFC PATCH 4/5] hugetlbfs: catch and handle truncate racing with page faults Mike Kravetz
2022-04-06 20:48 ` [RFC PATCH 5/5] hugetlb: Check for pmd unshare and fault/lookup races Mike Kravetz
2022-04-07 10:08 ` [RFC PATCH 0/5] hugetlb: Change huge pmd sharing David Hildenbrand
2022-04-07 16:17   ` Mike Kravetz
2022-04-08  9:26     ` David Hildenbrand
2022-04-19 22:50       ` Mike Kravetz
2022-04-20  7:12         ` David Hildenbrand

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).