All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH v2 0/6] hugetlb: Change huge pmd sharing synchronization again
@ 2022-04-20 22:37 Mike Kravetz
  2022-04-20 22:37 ` [RFC PATCH v2 1/6] hugetlbfs: revert use i_mmap_rwsem to address page fault/truncate race Mike Kravetz
                   ` (6 more replies)
  0 siblings, 7 replies; 8+ messages in thread
From: Mike Kravetz @ 2022-04-20 22:37 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Michal Hocko, Peter Xu, Naoya Horiguchi, David Hildenbrand,
	Aneesh Kumar K . V, Andrea Arcangeli, Kirill A . Shutemov,
	Davidlohr Bueso, Prakash Sangappa, James Houghton, Mina Almasry,
	Ray Fucillo, Andrew Morton, Mike Kravetz

I am sending this as a v2 RFC for the following reasons:
- The original RFC was incomplete and had a few issues.
- The only comments from the original RFC suggested eliminating huge pmd
  sharing to eliminate the associated complexity.  I do not believe this
  is possible as user space code will notice it's absence.  In any case,
  if we want to remove i_mmap_rwsem from the fault path to address fault
  scalability, we will need to address fault/truncate races.  Patches 3
  and 4 of this series do that.

hugetlb fault scalability regressions have recently been reported [1].
This is not the first such report, as regressions were also noted when
commit c0d0381ade79 ("hugetlbfs: use i_mmap_rwsem for more pmd sharing
synchronization") was added [2] in v5.7.  At that time, a proposal to
address the regression was suggested [3] but went nowhere.

To illustrate the regression, I created a simple program that does the
following in an infinite loop:
- mmap a 4GB hugetlb file (size insures pmd sharing)
- fault in all pages
- unmap the hugetlb file

The hugetlb fault code was then instrumented to collect number of times
the mutex was locked and wait time.  Samples are from 10 second
intervals on a 4 CPU VM with 8GB memory.  Eight instances of the
map/fault/unmap program are running.

next-20220420
-------------
[  690.117843] Wait_debug: faults sec  3506
[  690.118788]             num faults  35062
[  690.119825]             num locks   35062
[  690.120956]             intvl wait time 54688 msecs
[  690.122330]             max_wait_time   24000 usecs


next-20220420 + this series
---------------------------
[  484.965960] Wait_debug: faults sec  1419429
[  484.967294]             num faults  14194293
[  484.968656]             num locks   5611
[  484.969893]             intvl wait time 21087 msecs
[  484.971388]             max_wait_time   34000 usecs

As can be seen, fault time suffers when there are other operations
taking i_mmap_rwsem in write mode such as unmap.

This series proposes reverting c0d0381ade79 and 87bf91d39bb5 which
depends on c0d0381ade79.  This moves acquisition of i_mmap_rwsem in the
fault path back to huge_pmd_share where it is only taken when necessary.
After, reverting these patches we still need to handle:
- fault and truncate races
  Catch and properly backout faults beyond i_size
  Backing out reservations is much easier after 846be08578ed to expand
  restore_reserve_on_error functionality.
- unshare and fault/lookup races
  Since the pointer returned from huge_pte_offset or huge_pte_alloc may
  become invalid until we lock the page table, we must revalidate after
  taking the lock.  Code paths must backout and possibly retry if
  page table pointer changes.

Patch 5 in this series makes basic changes to page table locking for
hugetlb mappings.  Currently code uses (split) pmd locks for hugetlb
mappings if page size is PMD_SIZE.  A pointer to the pmd is required
to find the page struct containing the lock.  However, with pmd sharing
the pmd pointer is not stable until we hold the pmd lock.  To solve
this chicken/egg problem, we use the page_table_lock in mm_struct if
the pmd pointer is associated with a mapping where pmd sharing is
possible.  A study of the performance implications of this change still
needs to be performed.

Please help with comments or suggestions.  I would like to come up with
something that is performant AND safe.

Mike Kravetz (6):
  hugetlbfs: revert use i_mmap_rwsem to address page fault/truncate race
  hugetlbfs: revert use i_mmap_rwsem for more pmd sharing
    synchronization
  hugetlbfs: move routine remove_huge_page to hugetlb.c
  hugetlbfs: catch and handle truncate racing with page faults
  hugetlbfs: Do not use pmd locks if hugetlb sharing possible
  hugetlb: Check for pmd unshare and fault/lookup races

 arch/powerpc/mm/pgtable.c |   2 +-
 fs/hugetlbfs/inode.c      | 136 +++++++++++--------
 include/linux/hugetlb.h   |  30 ++---
 mm/damon/vaddr.c          |   4 +-
 mm/hmm.c                  |   2 +-
 mm/hugetlb.c              | 273 ++++++++++++++++++++++----------------
 mm/mempolicy.c            |   2 +-
 mm/migrate.c              |   2 +-
 mm/page_vma_mapped.c      |   2 +-
 mm/rmap.c                 |   8 +-
 mm/userfaultfd.c          |  11 +-
 11 files changed, 261 insertions(+), 211 deletions(-)

-- 
2.35.1


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [RFC PATCH v2 1/6] hugetlbfs: revert use i_mmap_rwsem to address page fault/truncate race
  2022-04-20 22:37 [RFC PATCH v2 0/6] hugetlb: Change huge pmd sharing synchronization again Mike Kravetz
@ 2022-04-20 22:37 ` Mike Kravetz
  2022-04-20 22:37 ` [RFC PATCH v2 2/6] hugetlbfs: revert use i_mmap_rwsem for more pmd sharing synchronization Mike Kravetz
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: Mike Kravetz @ 2022-04-20 22:37 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Michal Hocko, Peter Xu, Naoya Horiguchi, David Hildenbrand,
	Aneesh Kumar K . V, Andrea Arcangeli, Kirill A . Shutemov,
	Davidlohr Bueso, Prakash Sangappa, James Houghton, Mina Almasry,
	Ray Fucillo, Andrew Morton, Mike Kravetz

Commit c0d0381ade79 ("hugetlbfs: use i_mmap_rwsem for more pmd sharing
synchronization") added code to take i_mmap_rwsem in read mode for the
duration of fault processing.  The use of i_mmap_rwsem to prevent
fault/truncate races depends on this.  However, this has been shown to
cause performance/scaling issues.  As a result, that code will be
reverted.  Since the use i_mmap_rwsem to address page fault/truncate races
depends on this, it must also be reverted.

In a subsequent patch, code will be added to detect the fault/truncate
race and back out operations as required.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 fs/hugetlbfs/inode.c | 30 +++++++++---------------------
 mm/hugetlb.c         | 23 ++++++++++++-----------
 2 files changed, 21 insertions(+), 32 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 8b5b9df2be7d..1ad76a7ae1cc 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -451,9 +451,10 @@ hugetlb_vmdelete_list(struct rb_root_cached *root, pgoff_t start, pgoff_t end,
  *	In this case, we first scan the range and release found pages.
  *	After releasing pages, hugetlb_unreserve_pages cleans up region/reserve
  *	maps and global counts.  Page faults can not race with truncation
- *	in this routine.  hugetlb_no_page() holds i_mmap_rwsem and prevents
- *	page faults in the truncated range by checking i_size.  i_size is
- *	modified while holding i_mmap_rwsem.
+ *	in this routine.  hugetlb_no_page() prevents page faults in the
+ *	truncated range.  It checks i_size before allocation, and again after
+ *	with the page table lock for the page held.  The same lock must be
+ *	acquired to unmap a page.
  * hole punch is indicated if end is not LLONG_MAX
  *	In the hole punch case we scan the range and release found pages.
  *	Only when releasing a page is the associated region/reserve map
@@ -489,16 +490,8 @@ static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
 			u32 hash = 0;
 
 			index = page->index;
-			if (!truncate_op) {
-				/*
-				 * Only need to hold the fault mutex in the
-				 * hole punch case.  This prevents races with
-				 * page faults.  Races are not possible in the
-				 * case of truncation.
-				 */
-				hash = hugetlb_fault_mutex_hash(mapping, index);
-				mutex_lock(&hugetlb_fault_mutex_table[hash]);
-			}
+			hash = hugetlb_fault_mutex_hash(mapping, index);
+			mutex_lock(&hugetlb_fault_mutex_table[hash]);
 
 			/*
 			 * If page is mapped, it was faulted in after being
@@ -542,8 +535,7 @@ static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
 			}
 
 			unlock_page(page);
-			if (!truncate_op)
-				mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+			mutex_unlock(&hugetlb_fault_mutex_table[hash]);
 		}
 		huge_pagevec_release(&pvec);
 		cond_resched();
@@ -581,8 +573,8 @@ static void hugetlb_vmtruncate(struct inode *inode, loff_t offset)
 	BUG_ON(offset & ~huge_page_mask(h));
 	pgoff = offset >> PAGE_SHIFT;
 
-	i_mmap_lock_write(mapping);
 	i_size_write(inode, offset);
+	i_mmap_lock_write(mapping);
 	if (!RB_EMPTY_ROOT(&mapping->i_mmap.rb_root))
 		hugetlb_vmdelete_list(&mapping->i_mmap, pgoff, 0,
 				      ZAP_FLAG_DROP_MARKER);
@@ -703,11 +695,7 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
 		/* addr is the offset within the file (zero based) */
 		addr = index * hpage_size;
 
-		/*
-		 * fault mutex taken here, protects against fault path
-		 * and hole punch.  inode_lock previously taken protects
-		 * against truncation.
-		 */
+		/* mutex taken here, fault path and hole punch */
 		hash = hugetlb_fault_mutex_hash(mapping, index);
 		mutex_lock(&hugetlb_fault_mutex_table[hash]);
 
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index daa4bdd6c26c..9421d2aeddc0 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5477,18 +5477,17 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 	}
 
 	/*
-	 * We can not race with truncation due to holding i_mmap_rwsem.
-	 * i_size is modified when holding i_mmap_rwsem, so check here
-	 * once for faults beyond end of file.
+	 * Use page lock to guard against racing truncation
+	 * before we get page_table_lock.
 	 */
-	size = i_size_read(mapping->host) >> huge_page_shift(h);
-	if (idx >= size)
-		goto out;
-
 retry:
 	new_page = false;
 	page = find_lock_page(mapping, idx);
 	if (!page) {
+		size = i_size_read(mapping->host) >> huge_page_shift(h);
+		if (idx >= size)
+			goto out;
+
 		/* Check for page in userfault range */
 		if (userfaultfd_missing(vma)) {
 			ret = hugetlb_handle_userfault(vma, mapping, idx,
@@ -5578,6 +5577,10 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 	}
 
 	ptl = huge_pte_lock(h, mm, ptep);
+	size = i_size_read(mapping->host) >> huge_page_shift(h);
+	if (idx >= size)
+		goto backout;
+
 	ret = 0;
 	/* If pte changed from under us, retry */
 	if (!pte_same(huge_ptep_get(ptep), old_pte))
@@ -5686,10 +5689,8 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 
 	/*
 	 * Acquire i_mmap_rwsem before calling huge_pte_alloc and hold
-	 * until finished with ptep.  This serves two purposes:
-	 * 1) It prevents huge_pmd_unshare from being called elsewhere
-	 *    and making the ptep no longer valid.
-	 * 2) It synchronizes us with i_size modifications during truncation.
+	 * until finished with ptep.  This prevents huge_pmd_unshare from
+	 * being called elsewhere and making the ptep no longer valid.
 	 *
 	 * ptep could have already be assigned via huge_pte_offset.  That
 	 * is OK, as huge_pte_alloc will return the same value unless
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [RFC PATCH v2 2/6] hugetlbfs: revert use i_mmap_rwsem for more pmd sharing synchronization
  2022-04-20 22:37 [RFC PATCH v2 0/6] hugetlb: Change huge pmd sharing synchronization again Mike Kravetz
  2022-04-20 22:37 ` [RFC PATCH v2 1/6] hugetlbfs: revert use i_mmap_rwsem to address page fault/truncate race Mike Kravetz
@ 2022-04-20 22:37 ` Mike Kravetz
  2022-04-20 22:37 ` [RFC PATCH v2 3/6] hugetlbfs: move routine remove_huge_page to hugetlb.c Mike Kravetz
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: Mike Kravetz @ 2022-04-20 22:37 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Michal Hocko, Peter Xu, Naoya Horiguchi, David Hildenbrand,
	Aneesh Kumar K . V, Andrea Arcangeli, Kirill A . Shutemov,
	Davidlohr Bueso, Prakash Sangappa, James Houghton, Mina Almasry,
	Ray Fucillo, Andrew Morton, Mike Kravetz

Commit c0d0381ade79 added code to take i_mmap_rwsem in read mode for the
duration of fault processing.  However, this has been shown to cause
performance/scaling issues.  Revert the code and go back to the method
of only taking the semaphore in huge_pmd_share.

Keep the code that takes i_mmap_rwsem in write mode before calling
try_to_unmap as this is required if huge_pmd_unshare is called.

In a subsequent patch, code will be added to detect when a pmd was
'unshared' during fault processing and deal with that.

FIXME - Check locking in move_huge_pte and caller

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 fs/hugetlbfs/inode.c |  2 --
 mm/hugetlb.c         | 76 +++++++-------------------------------------
 mm/rmap.c            |  8 +----
 mm/userfaultfd.c     | 11 ++-----
 4 files changed, 15 insertions(+), 82 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 1ad76a7ae1cc..80573f0e8d9f 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -505,9 +505,7 @@ static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
 			if (unlikely(page_mapped(page))) {
 				BUG_ON(truncate_op);
 
-				mutex_unlock(&hugetlb_fault_mutex_table[hash]);
 				i_mmap_lock_write(mapping);
-				mutex_lock(&hugetlb_fault_mutex_table[hash]);
 				hugetlb_vmdelete_list(&mapping->i_mmap,
 					index * pages_per_huge_page(h),
 					(index + 1) * pages_per_huge_page(h),
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 9421d2aeddc0..562ecac0168f 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4717,7 +4717,6 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 	struct hstate *h = hstate_vma(src_vma);
 	unsigned long sz = huge_page_size(h);
 	unsigned long npages = pages_per_huge_page(h);
-	struct address_space *mapping = src_vma->vm_file->f_mapping;
 	struct mmu_notifier_range range;
 	int ret = 0;
 
@@ -4728,14 +4727,6 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 		mmu_notifier_invalidate_range_start(&range);
 		mmap_assert_write_locked(src);
 		raw_write_seqcount_begin(&src->write_protect_seq);
-	} else {
-		/*
-		 * For shared mappings i_mmap_rwsem must be held to call
-		 * huge_pte_alloc, otherwise the returned ptep could go
-		 * away if part of a shared pmd and another thread calls
-		 * huge_pmd_unshare.
-		 */
-		i_mmap_lock_read(mapping);
 	}
 
 	for (addr = src_vma->vm_start; addr < src_vma->vm_end; addr += sz) {
@@ -4878,8 +4869,6 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 	if (cow) {
 		raw_write_seqcount_end(&src->write_protect_seq);
 		mmu_notifier_invalidate_range_end(&range);
-	} else {
-		i_mmap_unlock_read(mapping);
 	}
 
 	return ret;
@@ -5255,30 +5244,9 @@ static vm_fault_t hugetlb_wp(struct mm_struct *mm, struct vm_area_struct *vma,
 		 * may get SIGKILLed if it later faults.
 		 */
 		if (outside_reserve) {
-			struct address_space *mapping = vma->vm_file->f_mapping;
-			pgoff_t idx;
-			u32 hash;
-
 			put_page(old_page);
 			BUG_ON(huge_pte_none(pte));
-			/*
-			 * Drop hugetlb_fault_mutex and i_mmap_rwsem before
-			 * unmapping.  unmapping needs to hold i_mmap_rwsem
-			 * in write mode.  Dropping i_mmap_rwsem in read mode
-			 * here is OK as COW mappings do not interact with
-			 * PMD sharing.
-			 *
-			 * Reacquire both after unmap operation.
-			 */
-			idx = vma_hugecache_offset(h, vma, haddr);
-			hash = hugetlb_fault_mutex_hash(mapping, idx);
-			mutex_unlock(&hugetlb_fault_mutex_table[hash]);
-			i_mmap_unlock_read(mapping);
-
 			unmap_ref_private(mm, vma, old_page, haddr);
-
-			i_mmap_lock_read(mapping);
-			mutex_lock(&hugetlb_fault_mutex_table[hash]);
 			spin_lock(ptl);
 			ptep = huge_pte_offset(mm, haddr, huge_page_size(h));
 			if (likely(ptep &&
@@ -5440,9 +5408,7 @@ static inline vm_fault_t hugetlb_handle_userfault(struct vm_area_struct *vma,
 	 */
 	hash = hugetlb_fault_mutex_hash(mapping, idx);
 	mutex_unlock(&hugetlb_fault_mutex_table[hash]);
-	i_mmap_unlock_read(mapping);
 	ret = handle_userfault(&vmf, reason);
-	i_mmap_lock_read(mapping);
 	mutex_lock(&hugetlb_fault_mutex_table[hash]);
 
 	return ret;
@@ -5673,11 +5639,6 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 
 	ptep = huge_pte_offset(mm, haddr, huge_page_size(h));
 	if (ptep) {
-		/*
-		 * Since we hold no locks, ptep could be stale.  That is
-		 * OK as we are only making decisions based on content and
-		 * not actually modifying content here.
-		 */
 		entry = huge_ptep_get(ptep);
 		if (unlikely(is_hugetlb_entry_migration(entry))) {
 			migration_entry_wait_huge(vma, mm, ptep);
@@ -5685,31 +5646,20 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		} else if (unlikely(is_hugetlb_entry_hwpoisoned(entry)))
 			return VM_FAULT_HWPOISON_LARGE |
 				VM_FAULT_SET_HINDEX(hstate_index(h));
+	} else {
+		ptep = huge_pte_alloc(mm, vma, haddr, huge_page_size(h));
+		if (!ptep)
+			return VM_FAULT_OOM;
 	}
 
-	/*
-	 * Acquire i_mmap_rwsem before calling huge_pte_alloc and hold
-	 * until finished with ptep.  This prevents huge_pmd_unshare from
-	 * being called elsewhere and making the ptep no longer valid.
-	 *
-	 * ptep could have already be assigned via huge_pte_offset.  That
-	 * is OK, as huge_pte_alloc will return the same value unless
-	 * something has changed.
-	 */
 	mapping = vma->vm_file->f_mapping;
-	i_mmap_lock_read(mapping);
-	ptep = huge_pte_alloc(mm, vma, haddr, huge_page_size(h));
-	if (!ptep) {
-		i_mmap_unlock_read(mapping);
-		return VM_FAULT_OOM;
-	}
+	idx = vma_hugecache_offset(h, vma, haddr);
 
 	/*
 	 * Serialize hugepage allocation and instantiation, so that we don't
 	 * get spurious allocation failures if two CPUs race to instantiate
 	 * the same page in the page cache.
 	 */
-	idx = vma_hugecache_offset(h, vma, haddr);
 	hash = hugetlb_fault_mutex_hash(mapping, idx);
 	mutex_lock(&hugetlb_fault_mutex_table[hash]);
 
@@ -5821,7 +5771,6 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	}
 out_mutex:
 	mutex_unlock(&hugetlb_fault_mutex_table[hash]);
-	i_mmap_unlock_read(mapping);
 	/*
 	 * Generally it's safe to hold refcount during waiting page lock. But
 	 * here we just wait to defer the next page fault to avoid busy loop and
@@ -6659,12 +6608,10 @@ void adjust_range_if_pmd_sharing_possible(struct vm_area_struct *vma,
  * Search for a shareable pmd page for hugetlb. In any case calls pmd_alloc()
  * and returns the corresponding pte. While this is not necessary for the
  * !shared pmd case because we can allocate the pmd later as well, it makes the
- * code much cleaner.
- *
- * This routine must be called with i_mmap_rwsem held in at least read mode if
- * sharing is possible.  For hugetlbfs, this prevents removal of any page
- * table entries associated with the address space.  This is important as we
- * are setting up sharing based on existing page table entries (mappings).
+ * code much cleaner. pmd allocation is essential for the shared case because
+ * pud has to be populated inside the same i_mmap_rwsem section - otherwise
+ * racing tasks could either miss the sharing (see huge_pte_offset) or select a
+ * bad pmd for sharing.
  */
 pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma,
 		      unsigned long addr, pud_t *pud)
@@ -6678,7 +6625,7 @@ pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma,
 	pte_t *pte;
 	spinlock_t *ptl;
 
-	i_mmap_assert_locked(mapping);
+	i_mmap_lock_read(mapping);
 	vma_interval_tree_foreach(svma, &mapping->i_mmap, idx, idx) {
 		if (svma == vma)
 			continue;
@@ -6708,6 +6655,7 @@ pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma,
 	spin_unlock(ptl);
 out:
 	pte = (pte_t *)pmd_alloc(mm, pud, addr);
+	i_mmap_unlock_read(mapping);
 	return pte;
 }
 
@@ -6718,7 +6666,7 @@ pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma,
  * indicated by page_count > 1, unmap is achieved by clearing pud and
  * decrementing the ref count. If count == 1, the pte page is not shared.
  *
- * Called with page table lock held and i_mmap_rwsem held in write mode.
+ * Called with page table lock held.
  *
  * returns: 1 successfully unmapped a shared pte page
  *	    0 the underlying pte page is not shared, or it is the last user
diff --git a/mm/rmap.c b/mm/rmap.c
index edfe61f95a7f..33c717163112 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -23,10 +23,9 @@
  * inode->i_rwsem	(while writing or truncating, not reading or faulting)
  *   mm->mmap_lock
  *     mapping->invalidate_lock (in filemap_fault)
- *       page->flags PG_locked (lock_page)   * (see hugetlbfs below)
+ *       page->flags PG_locked (lock_page)
  *         hugetlbfs_i_mmap_rwsem_key (in huge_pmd_share)
  *           mapping->i_mmap_rwsem
- *             hugetlb_fault_mutex (hugetlbfs specific page fault mutex)
  *             anon_vma->rwsem
  *               mm->page_table_lock or pte_lock
  *                 swap_lock (in swap_duplicate, swap_info_get)
@@ -45,11 +44,6 @@
  * anon_vma->rwsem,mapping->i_mmap_rwsem   (memory_failure, collect_procs_anon)
  *   ->tasklist_lock
  *     pte map lock
- *
- * * hugetlbfs PageHuge() pages take locks in this order:
- *         mapping->i_mmap_rwsem
- *           hugetlb_fault_mutex (hugetlbfs specific page fault mutex)
- *             page->flags PG_locked (lock_page)
  */
 
 #include <linux/mm.h>
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 4f4892a5f767..1a2cdac18ad7 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -374,14 +374,10 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
 		BUG_ON(dst_addr >= dst_start + len);
 
 		/*
-		 * Serialize via i_mmap_rwsem and hugetlb_fault_mutex.
-		 * i_mmap_rwsem ensures the dst_pte remains valid even
-		 * in the case of shared pmds.  fault mutex prevents
-		 * races with other faulting threads.
+		 * Serialize via hugetlb_fault_mutex.
 		 */
-		mapping = dst_vma->vm_file->f_mapping;
-		i_mmap_lock_read(mapping);
 		idx = linear_page_index(dst_vma, dst_addr);
+		mapping = dst_vma->vm_file->f_mapping;
 		hash = hugetlb_fault_mutex_hash(mapping, idx);
 		mutex_lock(&hugetlb_fault_mutex_table[hash]);
 
@@ -389,7 +385,6 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
 		dst_pte = huge_pte_alloc(dst_mm, dst_vma, dst_addr, vma_hpagesize);
 		if (!dst_pte) {
 			mutex_unlock(&hugetlb_fault_mutex_table[hash]);
-			i_mmap_unlock_read(mapping);
 			goto out_unlock;
 		}
 
@@ -397,7 +392,6 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
 		    !huge_pte_none_mostly(huge_ptep_get(dst_pte))) {
 			err = -EEXIST;
 			mutex_unlock(&hugetlb_fault_mutex_table[hash]);
-			i_mmap_unlock_read(mapping);
 			goto out_unlock;
 		}
 
@@ -406,7 +400,6 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
 					       wp_copy);
 
 		mutex_unlock(&hugetlb_fault_mutex_table[hash]);
-		i_mmap_unlock_read(mapping);
 
 		cond_resched();
 
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [RFC PATCH v2 3/6] hugetlbfs: move routine remove_huge_page to hugetlb.c
  2022-04-20 22:37 [RFC PATCH v2 0/6] hugetlb: Change huge pmd sharing synchronization again Mike Kravetz
  2022-04-20 22:37 ` [RFC PATCH v2 1/6] hugetlbfs: revert use i_mmap_rwsem to address page fault/truncate race Mike Kravetz
  2022-04-20 22:37 ` [RFC PATCH v2 2/6] hugetlbfs: revert use i_mmap_rwsem for more pmd sharing synchronization Mike Kravetz
@ 2022-04-20 22:37 ` Mike Kravetz
  2022-04-20 22:37 ` [RFC PATCH v2 4/6] hugetlbfs: catch and handle truncate racing with page faults Mike Kravetz
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: Mike Kravetz @ 2022-04-20 22:37 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Michal Hocko, Peter Xu, Naoya Horiguchi, David Hildenbrand,
	Aneesh Kumar K . V, Andrea Arcangeli, Kirill A . Shutemov,
	Davidlohr Bueso, Prakash Sangappa, James Houghton, Mina Almasry,
	Ray Fucillo, Andrew Morton, Mike Kravetz

In preparation for code in hugetlb.c removing pages from the page
cache, move remove_huge_page to hugetlb.c.  For a more descriptive
global name, rename to hugetlb_delete_from_page_cache.  Also,
rename huge_add_to_page_cache to be consistent.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 fs/hugetlbfs/inode.c    | 24 ++++++++----------------
 include/linux/hugetlb.h |  3 ++-
 mm/hugetlb.c            | 15 +++++++++++----
 3 files changed, 21 insertions(+), 21 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 80573f0e8d9f..5e4bd2f1705f 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -396,13 +396,6 @@ static int hugetlbfs_write_end(struct file *file, struct address_space *mapping,
 	return -EINVAL;
 }
 
-static void remove_huge_page(struct page *page)
-{
-	ClearPageDirty(page);
-	ClearPageUptodate(page);
-	delete_from_page_cache(page);
-}
-
 static void
 hugetlb_vmdelete_list(struct rb_root_cached *root, pgoff_t start, pgoff_t end,
 		      unsigned long zap_flags)
@@ -516,15 +509,14 @@ static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
 			lock_page(page);
 			/*
 			 * We must free the huge page and remove from page
-			 * cache (remove_huge_page) BEFORE removing the
-			 * region/reserve map (hugetlb_unreserve_pages).  In
-			 * rare out of memory conditions, removal of the
-			 * region/reserve map could fail. Correspondingly,
-			 * the subpool and global reserve usage count can need
-			 * to be adjusted.
+			 * cache BEFORE removing the region/reserve map
+			 * (hugetlb_unreserve_pages).  In rare out of memory
+			 * conditions, removal of the region/reserve map could
+			 * fail. Correspondingly, the subpool and global
+			 * reserve usage count can need to be adjusted.
 			 */
 			VM_BUG_ON(HPageRestoreReserve(page));
-			remove_huge_page(page);
+			hugetlb_delete_from_page_cache(page);
 			freed++;
 			if (!truncate_op) {
 				if (unlikely(hugetlb_unreserve_pages(inode,
@@ -723,7 +715,7 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
 		}
 		clear_huge_page(page, addr, pages_per_huge_page(h));
 		__SetPageUptodate(page);
-		error = huge_add_to_page_cache(page, mapping, index);
+		error = hugetlb_add_to_page_cache(page, mapping, index);
 		if (unlikely(error)) {
 			restore_reserve_on_error(h, &pseudo_vma, addr, page);
 			put_page(page);
@@ -975,7 +967,7 @@ static int hugetlbfs_error_remove_page(struct address_space *mapping,
 	struct inode *inode = mapping->host;
 	pgoff_t index = page->index;
 
-	remove_huge_page(page);
+	hugetlb_delete_from_page_cache(page);
 	if (unlikely(hugetlb_unreserve_pages(inode, index, index + 1, 1)))
 		hugetlb_fix_reserve_counts(inode);
 
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index b5f4a2f69dd3..75f4ff481538 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -655,8 +655,9 @@ struct page *alloc_huge_page_nodemask(struct hstate *h, int preferred_nid,
 				nodemask_t *nmask, gfp_t gfp_mask);
 struct page *alloc_huge_page_vma(struct hstate *h, struct vm_area_struct *vma,
 				unsigned long address);
-int huge_add_to_page_cache(struct page *page, struct address_space *mapping,
+int hugetlb_add_to_page_cache(struct page *page, struct address_space *mapping,
 			pgoff_t idx);
+void hugetlb_delete_from_page_cache(struct page *page);
 void restore_reserve_on_error(struct hstate *h, struct vm_area_struct *vma,
 				unsigned long address, struct page *page);
 
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 562ecac0168f..d60997462df8 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5353,7 +5353,7 @@ static bool hugetlbfs_pagecache_present(struct hstate *h,
 	return page != NULL;
 }
 
-int huge_add_to_page_cache(struct page *page, struct address_space *mapping,
+int hugetlb_add_to_page_cache(struct page *page, struct address_space *mapping,
 			   pgoff_t idx)
 {
 	struct inode *inode = mapping->host;
@@ -5376,6 +5376,13 @@ int huge_add_to_page_cache(struct page *page, struct address_space *mapping,
 	return 0;
 }
 
+void hugetlb_delete_from_page_cache(struct page *page)
+{
+	ClearPageDirty(page);
+	ClearPageUptodate(page);
+	delete_from_page_cache(page);
+}
+
 static inline vm_fault_t hugetlb_handle_userfault(struct vm_area_struct *vma,
 						  struct address_space *mapping,
 						  pgoff_t idx,
@@ -5488,7 +5495,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 		new_page = true;
 
 		if (vma->vm_flags & VM_MAYSHARE) {
-			int err = huge_add_to_page_cache(page, mapping, idx);
+			int err = hugetlb_add_to_page_cache(page, mapping, idx);
 			if (err) {
 				put_page(page);
 				if (err == -EEXIST)
@@ -5897,11 +5904,11 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
 
 		/*
 		 * Serialization between remove_inode_hugepages() and
-		 * huge_add_to_page_cache() below happens through the
+		 * hugetlb_add_to_page_cache() below happens through the
 		 * hugetlb_fault_mutex_table that here must be hold by
 		 * the caller.
 		 */
-		ret = huge_add_to_page_cache(page, mapping, idx);
+		ret = hugetlb_add_to_page_cache(page, mapping, idx);
 		if (ret)
 			goto out_release_nounlock;
 		page_in_pagecache = true;
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [RFC PATCH v2 4/6] hugetlbfs: catch and handle truncate racing with page faults
  2022-04-20 22:37 [RFC PATCH v2 0/6] hugetlb: Change huge pmd sharing synchronization again Mike Kravetz
                   ` (2 preceding siblings ...)
  2022-04-20 22:37 ` [RFC PATCH v2 3/6] hugetlbfs: move routine remove_huge_page to hugetlb.c Mike Kravetz
@ 2022-04-20 22:37 ` Mike Kravetz
  2022-04-20 22:37 ` [RFC PATCH v2 5/6] hugetlbfs: Do not use pmd locks if hugetlb sharing possible Mike Kravetz
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: Mike Kravetz @ 2022-04-20 22:37 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Michal Hocko, Peter Xu, Naoya Horiguchi, David Hildenbrand,
	Aneesh Kumar K . V, Andrea Arcangeli, Kirill A . Shutemov,
	Davidlohr Bueso, Prakash Sangappa, James Houghton, Mina Almasry,
	Ray Fucillo, Andrew Morton, Mike Kravetz

Most hugetlb fault handling code checks for faults beyond i_size.
While there are early checks in the code paths, the most difficult
to handle are those discovered after taking the page table lock.
At this point, we have possibly allocated a page and consumed
associated reservations and possibly added the page to the page cache.

When discovering a fault beyond i_size, be sure to:
- Remove the page from page cache, else it will sit there until the
  file is removed.
- Do not restore any reservation for the page consumed.  Otherwise
  there will be an outstanding reservation for an offset beyond the
  end of file.

The 'truncation' code in remove_inode_hugepages must deal with fault
code potentially removing a page from the cache after the page was
returned by pagevec_lookup and before locking the page.  This can be
discovered by a change in page_mapping() after taking page lock.  In
addition, this code must deal with fault code potentially consuming
and returning reservations.  Th synchronize this, remove_inode_hugepages
will not take the fault mutex for ALL indicies in the hole or truncated
range.  In this way, it KNOWS fault code has finished or will see the
updated file size.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 fs/hugetlbfs/inode.c | 104 +++++++++++++++++++++++++++++++------------
 mm/hugetlb.c         |  39 ++++++++++++----
 2 files changed, 105 insertions(+), 38 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 5e4bd2f1705f..d239646fa85d 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -443,11 +443,10 @@ hugetlb_vmdelete_list(struct rb_root_cached *root, pgoff_t start, pgoff_t end,
  * truncation is indicated by end of range being LLONG_MAX
  *	In this case, we first scan the range and release found pages.
  *	After releasing pages, hugetlb_unreserve_pages cleans up region/reserve
- *	maps and global counts.  Page faults can not race with truncation
- *	in this routine.  hugetlb_no_page() prevents page faults in the
- *	truncated range.  It checks i_size before allocation, and again after
- *	with the page table lock for the page held.  The same lock must be
- *	acquired to unmap a page.
+ *	maps and global counts.  Page faults can race with truncation.
+ *	During faults, hugetlb_no_page() checks i_size before page allocation,
+ *	and again after	obtaining page table lock.  It will 'back out'
+ *	allocations in the truncated range.
  * hole punch is indicated if end is not LLONG_MAX
  *	In the hole punch case we scan the range and release found pages.
  *	Only when releasing a page is the associated region/reserve map
@@ -456,14 +455,26 @@ hugetlb_vmdelete_list(struct rb_root_cached *root, pgoff_t start, pgoff_t end,
  *	This is indicated if we find a mapped page.
  * Note: If the passed end of range value is beyond the end of file, but
  * not LLONG_MAX this routine still performs a hole punch operation.
+ *
+ * Since page faults can race with this routine, care must be taken as both
+ * modify huge page reservation data.  To somewhat synchronize these operations
+ * the hugetlb fault mutex is taken for EVERY index in the range to be hole
+ * punched or truncated.  In this way, we KNOW fault code will either have
+ * completed backout operations under the mutex, or fault code will see the
+ * updated file size and not allocate a page for offsets beyond truncated size.
+ * The parameter 'lm__end' indicates the offset of the end of hole or file
+ * before truncation.  For hole punch lm_end == lend.
  */
 static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
-				   loff_t lend)
+				   loff_t lend, loff_t lm_end)
 {
+	u32 hash;
 	struct hstate *h = hstate_inode(inode);
 	struct address_space *mapping = &inode->i_data;
 	const pgoff_t start = lstart >> huge_page_shift(h);
 	const pgoff_t end = lend >> huge_page_shift(h);
+	pgoff_t m_end = lm_end >> huge_page_shift(h);
+	pgoff_t m_start, m_index;
 	struct pagevec pvec;
 	pgoff_t next, index;
 	int i, freed = 0;
@@ -475,14 +486,33 @@ static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
 		/*
 		 * When no more pages are found, we are done.
 		 */
-		if (!pagevec_lookup_range(&pvec, mapping, &next, end - 1))
+		m_start = next;
+		if (!pagevec_lookup_range(&pvec, mapping, &next, end - 1)) {
+			/*
+			 * To synchronize with faults, take fault mutex for
+			 * each index in range.
+			 */
+			for (m_index = m_start; m_index < m_end; m_index++) {
+				hash = hugetlb_fault_mutex_hash(mapping,
+						m_index);
+				mutex_lock(&hugetlb_fault_mutex_table[hash]);
+				mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+			}
 			break;
+		}
 
 		for (i = 0; i < pagevec_count(&pvec); ++i) {
 			struct page *page = pvec.pages[i];
-			u32 hash = 0;
 
 			index = page->index;
+			/* Take fault mutex for missing pages before index */
+			for (m_index = m_start; m_index < index; m_index++) {
+				hash = hugetlb_fault_mutex_hash(mapping,
+						m_index);
+				mutex_lock(&hugetlb_fault_mutex_table[hash]);
+				mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+			}
+			m_start = index + 1;
 			hash = hugetlb_fault_mutex_hash(mapping, index);
 			mutex_lock(&hugetlb_fault_mutex_table[hash]);
 
@@ -491,13 +521,8 @@ static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
 			 * unmapped in caller.  Unmap (again) now after taking
 			 * the fault mutex.  The mutex will prevent faults
 			 * until we finish removing the page.
-			 *
-			 * This race can only happen in the hole punch case.
-			 * Getting here in a truncate operation is a bug.
 			 */
 			if (unlikely(page_mapped(page))) {
-				BUG_ON(truncate_op);
-
 				i_mmap_lock_write(mapping);
 				hugetlb_vmdelete_list(&mapping->i_mmap,
 					index * pages_per_huge_page(h),
@@ -508,27 +533,46 @@ static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
 
 			lock_page(page);
 			/*
-			 * We must free the huge page and remove from page
-			 * cache BEFORE removing the region/reserve map
-			 * (hugetlb_unreserve_pages).  In rare out of memory
-			 * conditions, removal of the region/reserve map could
-			 * fail. Correspondingly, the subpool and global
-			 * reserve usage count can need to be adjusted.
+			 * After locking page, make sure mapping is the same.
+			 * We could have raced with page fault populate and
+			 * backout code.
 			 */
-			VM_BUG_ON(HPageRestoreReserve(page));
-			hugetlb_delete_from_page_cache(page);
-			freed++;
-			if (!truncate_op) {
-				if (unlikely(hugetlb_unreserve_pages(inode,
+			if (page_mapping(page) == mapping) {
+				/*
+				 * We must free the huge page and remove from
+				 * page cache BEFORE removing the region/
+				 * reserve map (hugetlb_unreserve_pages).  In
+				 * rare out of memory conditions, removal of
+				 * the region/reserve map could fail.
+				 * Correspondingly, the subpool and global
+				 * reserve usage count can need to be adjusted.
+				 */
+				VM_BUG_ON(HPageRestoreReserve(page));
+				hugetlb_delete_from_page_cache(page);
+				freed++;
+				if (!truncate_op) {
+					if (unlikely(
+					    hugetlb_unreserve_pages(inode,
 							index, index + 1, 1)))
-					hugetlb_fix_reserve_counts(inode);
+						hugetlb_fix_reserve_counts(
+								inode);
+				}
 			}
-
 			unlock_page(page);
 			mutex_unlock(&hugetlb_fault_mutex_table[hash]);
 		}
 		huge_pagevec_release(&pvec);
 		cond_resched();
+
+		if (!(next < end)) {
+			/* Will exit loop, take mutex for indicies up to end */
+			for (m_index = m_start; m_index < m_end; m_index++) {
+				hash = hugetlb_fault_mutex_hash(mapping,
+								m_index);
+				mutex_lock(&hugetlb_fault_mutex_table[hash]);
+				mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+			}
+		}
 	}
 
 	if (truncate_op)
@@ -538,8 +582,9 @@ static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
 static void hugetlbfs_evict_inode(struct inode *inode)
 {
 	struct resv_map *resv_map;
+	loff_t prev_size = i_size_read(inode);
 
-	remove_inode_hugepages(inode, 0, LLONG_MAX);
+	remove_inode_hugepages(inode, 0, LLONG_MAX, prev_size);
 
 	/*
 	 * Get the resv_map from the address space embedded in the inode.
@@ -559,6 +604,7 @@ static void hugetlb_vmtruncate(struct inode *inode, loff_t offset)
 	pgoff_t pgoff;
 	struct address_space *mapping = inode->i_mapping;
 	struct hstate *h = hstate_inode(inode);
+	loff_t prev_size = i_size_read(inode);
 
 	BUG_ON(offset & ~huge_page_mask(h));
 	pgoff = offset >> PAGE_SHIFT;
@@ -569,7 +615,7 @@ static void hugetlb_vmtruncate(struct inode *inode, loff_t offset)
 		hugetlb_vmdelete_list(&mapping->i_mmap, pgoff, 0,
 				      ZAP_FLAG_DROP_MARKER);
 	i_mmap_unlock_write(mapping);
-	remove_inode_hugepages(inode, offset, LLONG_MAX);
+	remove_inode_hugepages(inode, offset, LLONG_MAX, prev_size);
 }
 
 static long hugetlbfs_punch_hole(struct inode *inode, loff_t offset, loff_t len)
@@ -603,7 +649,7 @@ static long hugetlbfs_punch_hole(struct inode *inode, loff_t offset, loff_t len)
 					      hole_start >> PAGE_SHIFT,
 					      hole_end >> PAGE_SHIFT, 0);
 		i_mmap_unlock_write(mapping);
-		remove_inode_hugepages(inode, hole_start, hole_end);
+		remove_inode_hugepages(inode, hole_start, hole_end, hole_end);
 		inode_unlock(inode);
 	}
 
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index d60997462df8..e02df3527a9c 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5436,6 +5436,8 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 	spinlock_t *ptl;
 	unsigned long haddr = address & huge_page_mask(h);
 	bool new_page, new_pagecache_page = false;
+	bool beyond_i_size = false;
+	bool reserve_alloc = false;
 
 	/*
 	 * Currently, we are forced to kill the process in the event the
@@ -5493,6 +5495,8 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 		clear_huge_page(page, address, pages_per_huge_page(h));
 		__SetPageUptodate(page);
 		new_page = true;
+		if (HPageRestoreReserve(page))
+			reserve_alloc = true;
 
 		if (vma->vm_flags & VM_MAYSHARE) {
 			int err = hugetlb_add_to_page_cache(page, mapping, idx);
@@ -5551,8 +5555,10 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 
 	ptl = huge_pte_lock(h, mm, ptep);
 	size = i_size_read(mapping->host) >> huge_page_shift(h);
-	if (idx >= size)
+	if (idx >= size) {
+		beyond_i_size = true;
 		goto backout;
+	}
 
 	ret = 0;
 	/* If pte changed from under us, retry */
@@ -5597,10 +5603,25 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 backout:
 	spin_unlock(ptl);
 backout_unlocked:
+	if (new_page) {
+		if (new_pagecache_page)
+			hugetlb_delete_from_page_cache(page);
+
+		/*
+		 * If reserve was consumed, make sure flag is set so that it
+		 * will be restored in free_huge_page().
+		 */
+		if (reserve_alloc)
+			SetHPageRestoreReserve(page);
+
+		/*
+		 * Do not restore reserve map entries beyond i_size.
+		 * Otherwise, there will be leaks when the file is removed.
+		 */
+		if (!beyond_i_size)
+			restore_reserve_on_error(h, vma, haddr, page);
+	}
 	unlock_page(page);
-	/* restore reserve for newly allocated pages not in page cache */
-	if (new_page && !new_pagecache_page)
-		restore_reserve_on_error(h, vma, haddr, page);
 	put_page(page);
 	goto out;
 }
@@ -5921,15 +5942,15 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
 	 * Recheck the i_size after holding PT lock to make sure not
 	 * to leave any page mapped (as page_mapped()) beyond the end
 	 * of the i_size (remove_inode_hugepages() is strict about
-	 * enforcing that). If we bail out here, we'll also leave a
-	 * page in the radix tree in the vm_shared case beyond the end
-	 * of the i_size, but remove_inode_hugepages() will take care
-	 * of it as soon as we drop the hugetlb_fault_mutex_table.
+	 * enforcing that). If we bail out here, remove the page
+	 * added to the radix tree.
 	 */
 	size = i_size_read(mapping->host) >> huge_page_shift(h);
 	ret = -EFAULT;
-	if (idx >= size)
+	if (idx >= size) {
+		hugetlb_delete_from_page_cache(page);
 		goto out_release_unlock;
+	}
 
 	ret = -EEXIST;
 	/*
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [RFC PATCH v2 5/6] hugetlbfs: Do not use pmd locks if hugetlb sharing possible
  2022-04-20 22:37 [RFC PATCH v2 0/6] hugetlb: Change huge pmd sharing synchronization again Mike Kravetz
                   ` (3 preceding siblings ...)
  2022-04-20 22:37 ` [RFC PATCH v2 4/6] hugetlbfs: catch and handle truncate racing with page faults Mike Kravetz
@ 2022-04-20 22:37 ` Mike Kravetz
  2022-04-20 22:37 ` [RFC PATCH v2 6/6] hugetlb: Check for pmd unshare and fault/lookup races Mike Kravetz
  2022-04-22 16:38 ` [RFC PATCH v2 0/6] hugetlb: Change huge pmd sharing synchronization again Mike Kravetz
  6 siblings, 0 replies; 8+ messages in thread
From: Mike Kravetz @ 2022-04-20 22:37 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Michal Hocko, Peter Xu, Naoya Horiguchi, David Hildenbrand,
	Aneesh Kumar K . V, Andrea Arcangeli, Kirill A . Shutemov,
	Davidlohr Bueso, Prakash Sangappa, James Houghton, Mina Almasry,
	Ray Fucillo, Andrew Morton, Mike Kravetz

In hugetlbfs, split pmd page table locks are generally used if
huge_page_size is equal to PMD_SIZE.  These locks are located in the
struct page of the corresponding pmd page.  A pmd pointer is used to
locate the page.

In the case of pmd sharing, pmd pointers can become invalid unless one
holds the page table lock.  This creates a chicken/egg problem as we
need to use the pointer to locate the lock.  To address this issue, use
the page_table_lock in the mm_struct in the pmd pointer is associated
with a sharable vma.

The routines dealing with huge pte locks (huge_pte_lockptr and
huge_pte_lock) are modified to take a vma pointer instead of mm pointer.
The vma is then checked to determine if sharing is possible.  If it is,
then  the page table lock in the mm_struct is used.  Otherwise, the
lock in hte pmd page struct page is used.

Note that code uses the mm_struct if any part of the vma is sharable.
This could be optimized by passing in the virtial address associated
with the pte pointer to determine if that specific address is sharable.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 arch/powerpc/mm/pgtable.c |  2 +-
 include/linux/hugetlb.h   | 27 ++++--------
 mm/damon/vaddr.c          |  4 +-
 mm/hmm.c                  |  2 +-
 mm/hugetlb.c              | 92 +++++++++++++++++++++++++++++----------
 mm/mempolicy.c            |  2 +-
 mm/migrate.c              |  2 +-
 mm/page_vma_mapped.c      |  2 +-
 8 files changed, 85 insertions(+), 48 deletions(-)

diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c
index 6ec5a7dd7913..02f76e8b735a 100644
--- a/arch/powerpc/mm/pgtable.c
+++ b/arch/powerpc/mm/pgtable.c
@@ -261,7 +261,7 @@ int huge_ptep_set_access_flags(struct vm_area_struct *vma,
 
 		psize = hstate_get_psize(h);
 #ifdef CONFIG_DEBUG_VM
-		assert_spin_locked(huge_pte_lockptr(h, vma->vm_mm, ptep));
+		assert_spin_locked(huge_pte_lockptr(h, vma, ptep));
 #endif
 
 #else
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 75f4ff481538..c37611eb8571 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -864,15 +864,8 @@ static inline gfp_t htlb_modify_alloc_mask(struct hstate *h, gfp_t gfp_mask)
 	return modified_mask;
 }
 
-static inline spinlock_t *huge_pte_lockptr(struct hstate *h,
-					   struct mm_struct *mm, pte_t *pte)
-{
-	if (huge_page_size(h) == PMD_SIZE)
-		return pmd_lockptr(mm, (pmd_t *) pte);
-	VM_BUG_ON(huge_page_size(h) == PAGE_SIZE);
-	return &mm->page_table_lock;
-}
-
+spinlock_t *huge_pte_lockptr(struct hstate *h, struct vm_area_struct *vma,
+					   pte_t *pte);
 #ifndef hugepages_supported
 /*
  * Some platform decide whether they support huge pages at boot
@@ -1073,8 +1066,11 @@ static inline gfp_t htlb_modify_alloc_mask(struct hstate *h, gfp_t gfp_mask)
 }
 
 static inline spinlock_t *huge_pte_lockptr(struct hstate *h,
-					   struct mm_struct *mm, pte_t *pte)
+					   struct vm_area_struct *vma,
+					   pte_t *pte)
 {
+	struct mm_struct *mm = vma->vm_mm;
+
 	return &mm->page_table_lock;
 }
 
@@ -1096,15 +1092,8 @@ static inline void set_huge_swap_pte_at(struct mm_struct *mm, unsigned long addr
 }
 #endif	/* CONFIG_HUGETLB_PAGE */
 
-static inline spinlock_t *huge_pte_lock(struct hstate *h,
-					struct mm_struct *mm, pte_t *pte)
-{
-	spinlock_t *ptl;
-
-	ptl = huge_pte_lockptr(h, mm, pte);
-	spin_lock(ptl);
-	return ptl;
-}
+spinlock_t *huge_pte_lock(struct hstate *h, struct vm_area_struct *vma,
+					pte_t *pte);
 
 #if defined(CONFIG_HUGETLB_PAGE) && defined(CONFIG_CMA)
 extern void __init hugetlb_cma_reserve(int order);
diff --git a/mm/damon/vaddr.c b/mm/damon/vaddr.c
index b2ec0aa1ff45..125439fc88b6 100644
--- a/mm/damon/vaddr.c
+++ b/mm/damon/vaddr.c
@@ -432,7 +432,7 @@ static int damon_mkold_hugetlb_entry(pte_t *pte, unsigned long hmask,
 	spinlock_t *ptl;
 	pte_t entry;
 
-	ptl = huge_pte_lock(h, walk->mm, pte);
+	ptl = huge_pte_lock(h, walk->vma, pte);
 	entry = huge_ptep_get(pte);
 	if (!pte_present(entry))
 		goto out;
@@ -555,7 +555,7 @@ static int damon_young_hugetlb_entry(pte_t *pte, unsigned long hmask,
 	spinlock_t *ptl;
 	pte_t entry;
 
-	ptl = huge_pte_lock(h, walk->mm, pte);
+	ptl = huge_pte_lock(h, walk->vma, pte);
 	entry = huge_ptep_get(pte);
 	if (!pte_present(entry))
 		goto out;
diff --git a/mm/hmm.c b/mm/hmm.c
index 3fd3242c5e50..95b443f2e48e 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -486,7 +486,7 @@ static int hmm_vma_walk_hugetlb_entry(pte_t *pte, unsigned long hmask,
 	spinlock_t *ptl;
 	pte_t entry;
 
-	ptl = huge_pte_lock(hstate_vma(vma), walk->mm, pte);
+	ptl = huge_pte_lock(hstate_vma(vma), vma, pte);
 	entry = huge_ptep_get(pte);
 
 	i = (start - range->start) >> PAGE_SHIFT;
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index e02df3527a9c..c1352ab7f941 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -94,8 +94,32 @@ DEFINE_SPINLOCK(hugetlb_lock);
 static int num_fault_mutexes;
 struct mutex *hugetlb_fault_mutex_table ____cacheline_aligned_in_smp;
 
-/* Forward declaration */
+/* Forward declarations */
 static int hugetlb_acct_memory(struct hstate *h, long delta);
+static bool vma_range_shareable(struct vm_area_struct *vma,
+				unsigned long start, unsigned long end);
+
+spinlock_t *huge_pte_lockptr(struct hstate *h, struct vm_area_struct *vma,
+				pte_t *pte)
+{
+	struct mm_struct *mm = vma->vm_mm;
+
+	if (huge_page_size(h) == PMD_SIZE &&
+			!vma_range_shareable(vma, vma->vm_start, vma->vm_end))
+		return pmd_lockptr(mm, (pmd_t *) pte);
+	VM_BUG_ON(huge_page_size(h) == PAGE_SIZE);
+	return &mm->page_table_lock;
+}
+
+spinlock_t *huge_pte_lock(struct hstate *h, struct vm_area_struct *vma,
+				pte_t *pte)
+{
+	spinlock_t *ptl;
+
+	ptl = huge_pte_lockptr(h, vma, pte);
+	spin_lock(ptl);
+	return ptl;
+}
 
 static inline bool subpool_is_free(struct hugepage_subpool *spool)
 {
@@ -4753,8 +4777,8 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 		if ((dst_pte == src_pte) || !huge_pte_none(dst_entry))
 			continue;
 
-		dst_ptl = huge_pte_lock(h, dst, dst_pte);
-		src_ptl = huge_pte_lockptr(h, src, src_pte);
+		dst_ptl = huge_pte_lock(h, dst_vma, dst_pte);
+		src_ptl = huge_pte_lockptr(h, src_vma, src_pte);
 		spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
 		entry = huge_ptep_get(src_pte);
 		dst_entry = huge_ptep_get(dst_pte);
@@ -4830,8 +4854,8 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 				put_page(ptepage);
 
 				/* Install the new huge page if src pte stable */
-				dst_ptl = huge_pte_lock(h, dst, dst_pte);
-				src_ptl = huge_pte_lockptr(h, src, src_pte);
+				dst_ptl = huge_pte_lock(h, dst_vma, dst_pte);
+				src_ptl = huge_pte_lockptr(h, src_vma, src_pte);
 				spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
 				entry = huge_ptep_get(src_pte);
 				if (!pte_same(src_pte_old, entry)) {
@@ -4882,8 +4906,8 @@ static void move_huge_pte(struct vm_area_struct *vma, unsigned long old_addr,
 	spinlock_t *src_ptl, *dst_ptl;
 	pte_t pte;
 
-	dst_ptl = huge_pte_lock(h, mm, dst_pte);
-	src_ptl = huge_pte_lockptr(h, mm, src_pte);
+	dst_ptl = huge_pte_lock(h, vma, dst_pte);
+	src_ptl = huge_pte_lockptr(h, vma, src_pte);
 
 	/*
 	 * We don't have to worry about the ordering of src and dst ptlocks
@@ -4988,7 +5012,7 @@ static void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct
 		if (!ptep)
 			continue;
 
-		ptl = huge_pte_lock(h, mm, ptep);
+		ptl = huge_pte_lock(h, vma, ptep);
 		if (huge_pmd_unshare(mm, vma, &address, ptep)) {
 			spin_unlock(ptl);
 			tlb_flush_pmd_range(tlb, address & PUD_MASK, PUD_SIZE);
@@ -5485,7 +5509,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 			 * here.  Before returning error, get ptl and make
 			 * sure there really is no pte entry.
 			 */
-			ptl = huge_pte_lock(h, mm, ptep);
+			ptl = huge_pte_lock(h, vma, ptep);
 			ret = 0;
 			if (huge_pte_none(huge_ptep_get(ptep)))
 				ret = vmf_error(PTR_ERR(page));
@@ -5553,7 +5577,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 		vma_end_reservation(h, vma, haddr);
 	}
 
-	ptl = huge_pte_lock(h, mm, ptep);
+	ptl = huge_pte_lock(h, vma, ptep);
 	size = i_size_read(mapping->host) >> huge_page_shift(h);
 	if (idx >= size) {
 		beyond_i_size = true;
@@ -5733,7 +5757,7 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 								vma, haddr);
 	}
 
-	ptl = huge_pte_lock(h, mm, ptep);
+	ptl = huge_pte_lock(h, vma, ptep);
 
 	/* Check for a racing update before calling hugetlb_wp() */
 	if (unlikely(!pte_same(entry, huge_ptep_get(ptep))))
@@ -5935,7 +5959,7 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
 		page_in_pagecache = true;
 	}
 
-	ptl = huge_pte_lockptr(h, dst_mm, dst_pte);
+	ptl = huge_pte_lockptr(h, dst_vma, dst_pte);
 	spin_lock(ptl);
 
 	/*
@@ -6089,7 +6113,7 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		pte = huge_pte_offset(mm, vaddr & huge_page_mask(h),
 				      huge_page_size(h));
 		if (pte)
-			ptl = huge_pte_lock(h, mm, pte);
+			ptl = huge_pte_lock(h, vma, pte);
 		absent = !pte || huge_pte_none(huge_ptep_get(pte));
 
 		/*
@@ -6267,7 +6291,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 		ptep = huge_pte_offset(mm, address, psize);
 		if (!ptep)
 			continue;
-		ptl = huge_pte_lock(h, mm, ptep);
+		ptl = huge_pte_lock(h, vma, ptep);
 		if (huge_pmd_unshare(mm, vma, &address, ptep)) {
 			/*
 			 * When uffd-wp is enabled on the vma, unshare
@@ -6583,26 +6607,44 @@ static unsigned long page_table_shareable(struct vm_area_struct *svma,
 	return saddr;
 }
 
-static bool vma_shareable(struct vm_area_struct *vma, unsigned long addr)
+static bool __vma_aligned_range_shareable(struct vm_area_struct *vma,
+				unsigned long start, unsigned long end)
 {
-	unsigned long base = addr & PUD_MASK;
-	unsigned long end = base + PUD_SIZE;
-
 	/*
 	 * check on proper vm_flags and page table alignment
 	 */
-	if (vma->vm_flags & VM_MAYSHARE && range_in_vma(vma, base, end))
+	if (vma->vm_flags & VM_MAYSHARE && range_in_vma(vma, start, end))
 		return true;
 	return false;
 }
 
+static bool vma_range_shareable(struct vm_area_struct *vma,
+				unsigned long start, unsigned long end)
+{
+	unsigned long v_start = ALIGN(vma->vm_start, PUD_SIZE),
+		      v_end = ALIGN_DOWN(vma->vm_end, PUD_SIZE);
+
+	if (v_start >= v_end)
+		return false;
+
+	return __vma_aligned_range_shareable(vma, v_start, v_end);
+}
+
+static bool vma_addr_shareable(struct vm_area_struct *vma, unsigned long addr)
+{
+	unsigned long start = addr & PUD_MASK;
+	unsigned long end = start + PUD_SIZE;
+
+	return __vma_aligned_range_shareable(vma, start, end);
+}
+
 bool want_pmd_share(struct vm_area_struct *vma, unsigned long addr)
 {
 #ifdef CONFIG_USERFAULTFD
 	if (uffd_disable_huge_pmd_share(vma))
 		return false;
 #endif
-	return vma_shareable(vma, addr);
+	return vma_addr_shareable(vma, addr);
 }
 
 /*
@@ -6672,7 +6714,7 @@ pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (!spte)
 		goto out;
 
-	ptl = huge_pte_lock(hstate_vma(vma), mm, spte);
+	ptl = huge_pte_lock(hstate_vma(vma), vma, spte);
 	if (pud_none(*pud)) {
 		pud_populate(mm, pud,
 				(pmd_t *)((unsigned long)spte & PAGE_MASK));
@@ -6719,6 +6761,12 @@ int huge_pmd_unshare(struct mm_struct *mm, struct vm_area_struct *vma,
 }
 
 #else /* !CONFIG_ARCH_WANT_HUGE_PMD_SHARE */
+static bool vma_range_shareable(struct vm_area_struct *vma,
+				unsigned long start, unsigned long end)
+{
+	return false;
+}
+
 pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma,
 		      unsigned long addr, pud_t *pud)
 {
@@ -7034,7 +7082,7 @@ void hugetlb_unshare_all_pmds(struct vm_area_struct *vma)
 		ptep = huge_pte_offset(mm, address, sz);
 		if (!ptep)
 			continue;
-		ptl = huge_pte_lock(h, mm, ptep);
+		ptl = huge_pte_lock(h, vma, ptep);
 		/* We don't want 'address' to be changed */
 		huge_pmd_unshare(mm, vma, &tmp, ptep);
 		spin_unlock(ptl);
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 58af432a39b2..4692640847eb 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -577,7 +577,7 @@ static int queue_pages_hugetlb(pte_t *pte, unsigned long hmask,
 	spinlock_t *ptl;
 	pte_t entry;
 
-	ptl = huge_pte_lock(hstate_vma(walk->vma), walk->mm, pte);
+	ptl = huge_pte_lock(hstate_vma(walk->vma), walk->vma, pte);
 	entry = huge_ptep_get(pte);
 	if (!pte_present(entry))
 		goto unlock;
diff --git a/mm/migrate.c b/mm/migrate.c
index b2678279eb43..3d765ee101ad 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -318,7 +318,7 @@ void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
 void migration_entry_wait_huge(struct vm_area_struct *vma,
 		struct mm_struct *mm, pte_t *pte)
 {
-	spinlock_t *ptl = huge_pte_lockptr(hstate_vma(vma), mm, pte);
+	spinlock_t *ptl = huge_pte_lockptr(hstate_vma(vma), vma, pte);
 	__migration_entry_wait(mm, pte, ptl);
 }
 
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index c10f839fc410..f09eaef2a828 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -174,7 +174,7 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
 		if (!pvmw->pte)
 			return false;
 
-		pvmw->ptl = huge_pte_lockptr(hstate, mm, pvmw->pte);
+		pvmw->ptl = huge_pte_lockptr(hstate, vma, pvmw->pte);
 		spin_lock(pvmw->ptl);
 		if (!check_pte(pvmw))
 			return not_found(pvmw);
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [RFC PATCH v2 6/6] hugetlb: Check for pmd unshare and fault/lookup races
  2022-04-20 22:37 [RFC PATCH v2 0/6] hugetlb: Change huge pmd sharing synchronization again Mike Kravetz
                   ` (4 preceding siblings ...)
  2022-04-20 22:37 ` [RFC PATCH v2 5/6] hugetlbfs: Do not use pmd locks if hugetlb sharing possible Mike Kravetz
@ 2022-04-20 22:37 ` Mike Kravetz
  2022-04-22 16:38 ` [RFC PATCH v2 0/6] hugetlb: Change huge pmd sharing synchronization again Mike Kravetz
  6 siblings, 0 replies; 8+ messages in thread
From: Mike Kravetz @ 2022-04-20 22:37 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Michal Hocko, Peter Xu, Naoya Horiguchi, David Hildenbrand,
	Aneesh Kumar K . V, Andrea Arcangeli, Kirill A . Shutemov,
	Davidlohr Bueso, Prakash Sangappa, James Houghton, Mina Almasry,
	Ray Fucillo, Andrew Morton, Mike Kravetz

When a pmd is 'unshared' it effectivelly deletes part of a processes
page tables.  The routine huge_pmd_unshare must be called with
i_mmap_rwsem held in write mode and the page table locked.  However,
consider a page fault happening within that same process.  We could
have the following race:

Faulting thread                                 Unsharing thread
...                                                  ...
ptep = huge_pte_offset()
      or
ptep = huge_pte_alloc()
...
                                                i_mmap_unlock_write
                                                lock_page table
ptep invalid   <------------------------        huge_pmd_unshare
Could be in a previously                        unlock_page_table
sharing process or worse
...
ptl = huge_pte_lock(ptep)
get/update pte
set_pte_at(pte, ptep)

If the above race happens, we can update the pte of another process.

Catch this situation by doing another huge_pte_offset/page table
walk after obtaining the page table lock and compare pointers.  If
the pointers are different, then we know a race happened and we can
bail and cleanup.

In fault code, make sure to check for this race AFTER checking for
faults beyond i_size so page cache can be cleaned up properly.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 mm/hugetlb.c | 34 +++++++++++++++++++++++++++++-----
 1 file changed, 29 insertions(+), 5 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index c1352ab7f941..804a8d0a2cb8 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4735,6 +4735,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 			    struct vm_area_struct *src_vma)
 {
 	pte_t *src_pte, *dst_pte, entry, dst_entry;
+	pte_t *src_pte2;
 	struct page *ptepage;
 	unsigned long addr;
 	bool cow = is_cow_mapping(src_vma->vm_flags);
@@ -4783,7 +4784,15 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 		entry = huge_ptep_get(src_pte);
 		dst_entry = huge_ptep_get(dst_pte);
 again:
-		if (huge_pte_none(entry) || !huge_pte_none(dst_entry)) {
+
+		src_pte2 = huge_pte_offset(src, addr, sz);
+		if (unlikely(src_pte2 != src_pte)) {
+			/*
+			 * Another thread could have unshared src_pte.
+			 * Just skip.
+			 */
+			;
+		} else if (huge_pte_none(entry) || !huge_pte_none(dst_entry)) {
 			/*
 			 * Skip if src entry none.  Also, skip in the
 			 * unlikely case dst entry !none as this implies
@@ -5462,6 +5471,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 	bool new_page, new_pagecache_page = false;
 	bool beyond_i_size = false;
 	bool reserve_alloc = false;
+	pte_t *ptep2;
 
 	/*
 	 * Currently, we are forced to kill the process in the event the
@@ -5510,8 +5520,10 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 			 * sure there really is no pte entry.
 			 */
 			ptl = huge_pte_lock(h, vma, ptep);
+			/* ptep2 checks for racing unshare page tables */
+			ptep2 = huge_pte_offset(mm, haddr, huge_page_size(h));
 			ret = 0;
-			if (huge_pte_none(huge_ptep_get(ptep)))
+			if (ptep2 == ptep && huge_pte_none(huge_ptep_get(ptep)))
 				ret = vmf_error(PTR_ERR(page));
 			spin_unlock(ptl);
 			goto out;
@@ -5584,6 +5596,11 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 		goto backout;
 	}
 
+	/* Check for racing unshare page tables */
+	ptep2 = huge_pte_offset(mm, haddr, huge_page_size(h));
+	if (ptep2 != ptep)
+		goto backout;
+
 	ret = 0;
 	/* If pte changed from under us, retry */
 	if (!pte_same(huge_ptep_get(ptep), old_pte))
@@ -5677,7 +5694,7 @@ u32 hugetlb_fault_mutex_hash(struct address_space *mapping, pgoff_t idx)
 vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 			unsigned long address, unsigned int flags)
 {
-	pte_t *ptep, entry;
+	pte_t *ptep, *ptep2, entry;
 	spinlock_t *ptl;
 	vm_fault_t ret;
 	u32 hash;
@@ -5759,8 +5776,9 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 
 	ptl = huge_pte_lock(h, vma, ptep);
 
-	/* Check for a racing update before calling hugetlb_wp() */
-	if (unlikely(!pte_same(entry, huge_ptep_get(ptep))))
+	/* Check for a racing update or unshare before calling hugetlb_wp() */
+	ptep2 = huge_pte_offset(mm, haddr, huge_page_size(h));
+	if (unlikely(ptep2 != ptep || !pte_same(entry, huge_ptep_get(ptep))))
 		goto out_ptl;
 
 	/* Handle userfault-wp first, before trying to lock more pages */
@@ -5861,6 +5879,7 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
 	struct page *page;
 	int writable;
 	bool page_in_pagecache = false;
+	pte_t *ptep2;
 
 	if (is_continue) {
 		ret = -EFAULT;
@@ -5976,6 +5995,11 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
 		goto out_release_unlock;
 	}
 
+	/* Check for racing unshare page tables */
+	ptep2 = huge_pte_offset(dst_mm, dst_addr, huge_page_size(h));
+	if (unlikely(ptep2 != dst_pte))
+		goto out_release_unlock;
+
 	ret = -EEXIST;
 	/*
 	 * We allow to overwrite a pte marker: consider when both MISSING|WP
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH v2 0/6] hugetlb: Change huge pmd sharing synchronization again
  2022-04-20 22:37 [RFC PATCH v2 0/6] hugetlb: Change huge pmd sharing synchronization again Mike Kravetz
                   ` (5 preceding siblings ...)
  2022-04-20 22:37 ` [RFC PATCH v2 6/6] hugetlb: Check for pmd unshare and fault/lookup races Mike Kravetz
@ 2022-04-22 16:38 ` Mike Kravetz
  6 siblings, 0 replies; 8+ messages in thread
From: Mike Kravetz @ 2022-04-22 16:38 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Michal Hocko, Peter Xu, Naoya Horiguchi, David Hildenbrand,
	Aneesh Kumar K . V, Andrea Arcangeli, Kirill A . Shutemov,
	Davidlohr Bueso, Prakash Sangappa, James Houghton, Mina Almasry,
	Ray Fucillo, Andrew Morton

On 4/20/22 15:37, Mike Kravetz wrote:
> 
> Patch 5 in this series makes basic changes to page table locking for
> hugetlb mappings.  Currently code uses (split) pmd locks for hugetlb
> mappings if page size is PMD_SIZE.  A pointer to the pmd is required
> to find the page struct containing the lock.  However, with pmd sharing
> the pmd pointer is not stable until we hold the pmd lock.  To solve
> this chicken/egg problem, we use the page_table_lock in mm_struct if
> the pmd pointer is associated with a mapping where pmd sharing is
> possible.  A study of the performance implications of this change still
> needs to be performed.

Sorry, this approach is totally wrong!!!

If sharing pmds, we MUST use the pmd specific lock as that is common
between processes.  If we use the process specific lock in mm_struct
we are not synchronizing pmd updates between processes.

I am going to rethink the idea of a vma (process) specific synchronization
mechanism for pmd sharing.  I abandoned this early because of some lock
ordering issues, but think there may already exist code to handle situations
where we run into trouble with lock order.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2022-04-22 16:39 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-04-20 22:37 [RFC PATCH v2 0/6] hugetlb: Change huge pmd sharing synchronization again Mike Kravetz
2022-04-20 22:37 ` [RFC PATCH v2 1/6] hugetlbfs: revert use i_mmap_rwsem to address page fault/truncate race Mike Kravetz
2022-04-20 22:37 ` [RFC PATCH v2 2/6] hugetlbfs: revert use i_mmap_rwsem for more pmd sharing synchronization Mike Kravetz
2022-04-20 22:37 ` [RFC PATCH v2 3/6] hugetlbfs: move routine remove_huge_page to hugetlb.c Mike Kravetz
2022-04-20 22:37 ` [RFC PATCH v2 4/6] hugetlbfs: catch and handle truncate racing with page faults Mike Kravetz
2022-04-20 22:37 ` [RFC PATCH v2 5/6] hugetlbfs: Do not use pmd locks if hugetlb sharing possible Mike Kravetz
2022-04-20 22:37 ` [RFC PATCH v2 6/6] hugetlb: Check for pmd unshare and fault/lookup races Mike Kravetz
2022-04-22 16:38 ` [RFC PATCH v2 0/6] hugetlb: Change huge pmd sharing synchronization again Mike Kravetz

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.