All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/9] hugetlb: Use new vma lock for huge pmd sharing synchronization
@ 2022-09-14 22:18 Mike Kravetz
  2022-09-14 22:18 ` [PATCH v2 1/9] hugetlbfs: revert use i_mmap_rwsem to address page fault/truncate race Mike Kravetz
                   ` (8 more replies)
  0 siblings, 9 replies; 18+ messages in thread
From: Mike Kravetz @ 2022-09-14 22:18 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Muchun Song, Miaohe Lin, David Hildenbrand, Sven Schnelle,
	Michal Hocko, Peter Xu, Naoya Horiguchi, Aneesh Kumar K . V,
	Andrea Arcangeli, Kirill A . Shutemov, Davidlohr Bueso,
	Prakash Sangappa, James Houghton, Mina Almasry, Pasha Tatashin,
	Axel Rasmussen, Ray Fucillo, Andrew Morton, Mike Kravetz

hugetlb fault scalability regressions have recently been reported [1].
This is not the first such report, as regressions were also noted when
commit c0d0381ade79 ("hugetlbfs: use i_mmap_rwsem for more pmd sharing
synchronization") was added [2] in v5.7.  At that time, a proposal to
address the regression was suggested [3] but went nowhere.

The regression and benefit of this patch series is not evident when
using the vm_scalability benchmark reported in [2] on a recent kernel.
Results from running,
"./usemem -n 48 --prealloc --prefault -O -U 3448054972"

			48 sample Avg
next-20220913		next-20220913			next-20220913
unmodified	revert i_mmap_sema locking	vma sema locking, this series
-----------------------------------------------------------------------------
498150 KB/s		501934 KB/s			504793 KB/s

The recent regression report [1] notes page fault and fork latency of
shared hugetlb mappings.  To measure this, I created two simple programs:
1) map a shared hugetlb area, write fault all pages, unmap area
   Do this in a continuous loop to measure faults per second
2) map a shared hugetlb area, write fault a few pages, fork and exit
   Do this in a continuous loop to measure forks per second
These programs were run on a 48 CPU VM with 320GB memory.  The shared
mapping size was 250GB.  For comparison, a single instance of the program
was run.  Then, multiple instances were run in parallel to introduce
lock contention.  Changing the locking scheme results in a significant
performance benefit.

test		instances	unmodified	revert		vma
--------------------------------------------------------------------------
faults per sec	1		393043		395680		389932
faults per sec  24		 71405		 81191		 79048
forks per sec   1		  2802		  2747		  2725
forks per sec   24		   439		   536		   500
Combined faults 24		  1621		 68070		 53662
Combined forks  24		   358		    67		   142

Combined test is when running both faulting program and forking program
simultaneously.

Patches 1 and 2 of this series revert c0d0381ade79 and 87bf91d39bb5 which
depends on c0d0381ade79.  Acquisition of i_mmap_rwsem is still required in
the fault path to establish pmd sharing, so this is moved back to
huge_pmd_share.  With c0d0381ade79 reverted, this race is exposed:

Faulting thread                                 Unsharing thread
...                                                  ...
ptep = huge_pte_offset()
      or
ptep = huge_pte_alloc()
...
                                                i_mmap_lock_write
                                                lock page table
ptep invalid   <------------------------        huge_pmd_unshare()
Could be in a previously                        unlock_page_table
sharing process or worse                        i_mmap_unlock_write
...
ptl = huge_pte_lock(ptep)
get/update pte
set_pte_at(pte, ptep)

Reverting 87bf91d39bb5 exposes races in page fault/file truncation.
When the new vma lock is put to use in patch 8, this will handle the
fault/file truncation races.  This is explained in patch 9 where code
associated with these races is cleaned up.

Patches 3 - 5 restructure existing code in preparation for using the
new vma lock (rw semaphore) for pmd sharing synchronization.  The idea
is that this semaphore will be held in read mode for the duration of
fault processing, and held in write mode for unmap operations which may
call huge_pmd_unshare.  Acquiring i_mmap_rwsem is also still required
to synchronize huge pmd sharing.  However it is only required in the
fault path when setting up sharing, and will be acquired in huge_pmd_share().

Patch 6 adds the new vma lock and all supporting routines, but does not
actually change code to use the new lock.

Patch 7 refactors code in preparation for using the new lock. And, patch
8 finally adds code to make use of this new vma lock.  Unfortunately, the
fault code and truncate/hole punch code would naturally take locks in the
opposite order which could lead to deadlock.  Since the performance of
page faults is more important, the truncation/hole punch code is modified
to back out and take locks in the correct order if necessary.

[1] https://lore.kernel.org/linux-mm/43faf292-245b-5db5-cce9-369d8fb6bd21@infradead.org/
[2] https://lore.kernel.org/lkml/20200622005551.GK5535@shao2-debian/
[3] https://lore.kernel.org/linux-mm/20200706202615.32111-1-mike.kravetz@oracle.com/

v1  -> v2
- Addressed more issues pointed out by Miaohe Lin.  Tahnks again!  And,
  added some Reviewed-by's.
- Addressed performance issue with files that have large holes as reported
  by Sven Schnelle.  Ultimately, removed code using fault mutex to address
  fault/truncation races as new vma lock is sufficient for this.
- Made vma lock be a ref counted structure so that it can exist separate
  from its associated vma.  This makes the unmap code that backs out and
  take locks in order cleaner and a little simpler.
- Rebased and retested on next-20220913

RFC -> v1
- Addressed many issues pointed out by Miaohe Lin.  Thank you!  Most
  significant was not attempting to backout pages in fault code due to
  races with truncation (patch 4).
- Rebased and retested on next-20220822

Mike Kravetz (9):
  hugetlbfs: revert use i_mmap_rwsem to address page fault/truncate race
  hugetlbfs: revert use i_mmap_rwsem for more pmd sharing
    synchronization
  hugetlb: rename remove_huge_page to hugetlb_delete_from_page_cache
  hugetlb: create remove_inode_single_folio to remove single file folio
  hugetlb: rename vma_shareable() and refactor code
  hugetlb: add vma based lock for pmd sharing
  hugetlb: create hugetlb_unmap_file_folio to unmap single file folio
  hugetlb: use new vma_lock for pmd sharing synchronization
  hugetlb: clean up code checking for fault/truncation races

 fs/hugetlbfs/inode.c    | 300 +++++++++++++++++++++++---------
 include/linux/hugetlb.h |  45 ++++-
 kernel/fork.c           |   6 +-
 mm/hugetlb.c            | 373 ++++++++++++++++++++++++++++++----------
 mm/memory.c             |   2 +
 mm/rmap.c               | 114 +++++++-----
 mm/userfaultfd.c        |  14 +-
 7 files changed, 621 insertions(+), 233 deletions(-)

-- 
2.37.2


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v2 1/9] hugetlbfs: revert use i_mmap_rwsem to address page fault/truncate race
  2022-09-14 22:18 [PATCH v2 0/9] hugetlb: Use new vma lock for huge pmd sharing synchronization Mike Kravetz
@ 2022-09-14 22:18 ` Mike Kravetz
  2022-09-14 22:18 ` [PATCH v2 2/9] hugetlbfs: revert use i_mmap_rwsem for more pmd sharing synchronization Mike Kravetz
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 18+ messages in thread
From: Mike Kravetz @ 2022-09-14 22:18 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Muchun Song, Miaohe Lin, David Hildenbrand, Sven Schnelle,
	Michal Hocko, Peter Xu, Naoya Horiguchi, Aneesh Kumar K . V,
	Andrea Arcangeli, Kirill A . Shutemov, Davidlohr Bueso,
	Prakash Sangappa, James Houghton, Mina Almasry, Pasha Tatashin,
	Axel Rasmussen, Ray Fucillo, Andrew Morton, Mike Kravetz

Commit c0d0381ade79 ("hugetlbfs: use i_mmap_rwsem for more pmd sharing
synchronization") added code to take i_mmap_rwsem in read mode for the
duration of fault processing.  The use of i_mmap_rwsem to prevent
fault/truncate races depends on this.  However, this has been shown to
cause performance/scaling issues.  As a result, that code will be
reverted.  Since the use i_mmap_rwsem to address page fault/truncate races
depends on this, it must also be reverted.

In a subsequent patch, code will be added to detect the fault/truncate
race and back out operations as required.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
---
 fs/hugetlbfs/inode.c | 30 +++++++++---------------------
 mm/hugetlb.c         | 22 +++++++++++-----------
 2 files changed, 20 insertions(+), 32 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index f7a5b5124d8a..a32031e751d1 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -419,9 +419,10 @@ hugetlb_vmdelete_list(struct rb_root_cached *root, pgoff_t start, pgoff_t end,
  *	In this case, we first scan the range and release found pages.
  *	After releasing pages, hugetlb_unreserve_pages cleans up region/reserve
  *	maps and global counts.  Page faults can not race with truncation
- *	in this routine.  hugetlb_no_page() holds i_mmap_rwsem and prevents
- *	page faults in the truncated range by checking i_size.  i_size is
- *	modified while holding i_mmap_rwsem.
+ *	in this routine.  hugetlb_no_page() prevents page faults in the
+ *	truncated range.  It checks i_size before allocation, and again after
+ *	with the page table lock for the page held.  The same lock must be
+ *	acquired to unmap a page.
  * hole punch is indicated if end is not LLONG_MAX
  *	In the hole punch case we scan the range and release found pages.
  *	Only when releasing a page is the associated region/reserve map
@@ -451,16 +452,8 @@ static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
 			u32 hash = 0;
 
 			index = folio->index;
-			if (!truncate_op) {
-				/*
-				 * Only need to hold the fault mutex in the
-				 * hole punch case.  This prevents races with
-				 * page faults.  Races are not possible in the
-				 * case of truncation.
-				 */
-				hash = hugetlb_fault_mutex_hash(mapping, index);
-				mutex_lock(&hugetlb_fault_mutex_table[hash]);
-			}
+			hash = hugetlb_fault_mutex_hash(mapping, index);
+			mutex_lock(&hugetlb_fault_mutex_table[hash]);
 
 			/*
 			 * If folio is mapped, it was faulted in after being
@@ -504,8 +497,7 @@ static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
 			}
 
 			folio_unlock(folio);
-			if (!truncate_op)
-				mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+			mutex_unlock(&hugetlb_fault_mutex_table[hash]);
 		}
 		folio_batch_release(&fbatch);
 		cond_resched();
@@ -543,8 +535,8 @@ static void hugetlb_vmtruncate(struct inode *inode, loff_t offset)
 	BUG_ON(offset & ~huge_page_mask(h));
 	pgoff = offset >> PAGE_SHIFT;
 
-	i_mmap_lock_write(mapping);
 	i_size_write(inode, offset);
+	i_mmap_lock_write(mapping);
 	if (!RB_EMPTY_ROOT(&mapping->i_mmap.rb_root))
 		hugetlb_vmdelete_list(&mapping->i_mmap, pgoff, 0,
 				      ZAP_FLAG_DROP_MARKER);
@@ -703,11 +695,7 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
 		/* addr is the offset within the file (zero based) */
 		addr = index * hpage_size;
 
-		/*
-		 * fault mutex taken here, protects against fault path
-		 * and hole punch.  inode_lock previously taken protects
-		 * against truncation.
-		 */
+		/* mutex taken here, fault path and hole punch */
 		hash = hugetlb_fault_mutex_hash(mapping, index);
 		mutex_lock(&hugetlb_fault_mutex_table[hash]);
 
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index c6b53bcf823d..6c97b97aa252 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5559,17 +5559,15 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 	}
 
 	/*
-	 * We can not race with truncation due to holding i_mmap_rwsem.
-	 * i_size is modified when holding i_mmap_rwsem, so check here
-	 * once for faults beyond end of file.
+	 * Use page lock to guard against racing truncation
+	 * before we get page_table_lock.
 	 */
-	size = i_size_read(mapping->host) >> huge_page_shift(h);
-	if (idx >= size)
-		goto out;
-
 	new_page = false;
 	page = find_lock_page(mapping, idx);
 	if (!page) {
+		size = i_size_read(mapping->host) >> huge_page_shift(h);
+		if (idx >= size)
+			goto out;
 		/* Check for page in userfault range */
 		if (userfaultfd_missing(vma)) {
 			ret = hugetlb_handle_userfault(vma, mapping, idx,
@@ -5665,6 +5663,10 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 	}
 
 	ptl = huge_pte_lock(h, mm, ptep);
+	size = i_size_read(mapping->host) >> huge_page_shift(h);
+	if (idx >= size)
+		goto backout;
+
 	ret = 0;
 	/* If pte changed from under us, retry */
 	if (!pte_same(huge_ptep_get(ptep), old_pte))
@@ -5773,10 +5775,8 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 
 	/*
 	 * Acquire i_mmap_rwsem before calling huge_pte_alloc and hold
-	 * until finished with ptep.  This serves two purposes:
-	 * 1) It prevents huge_pmd_unshare from being called elsewhere
-	 *    and making the ptep no longer valid.
-	 * 2) It synchronizes us with i_size modifications during truncation.
+	 * until finished with ptep.  This prevents huge_pmd_unshare from
+	 * being called elsewhere and making the ptep no longer valid.
 	 *
 	 * ptep could have already be assigned via huge_pte_offset.  That
 	 * is OK, as huge_pte_alloc will return the same value unless
-- 
2.37.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v2 2/9] hugetlbfs: revert use i_mmap_rwsem for more pmd sharing synchronization
  2022-09-14 22:18 [PATCH v2 0/9] hugetlb: Use new vma lock for huge pmd sharing synchronization Mike Kravetz
  2022-09-14 22:18 ` [PATCH v2 1/9] hugetlbfs: revert use i_mmap_rwsem to address page fault/truncate race Mike Kravetz
@ 2022-09-14 22:18 ` Mike Kravetz
  2022-09-14 22:18 ` [PATCH v2 3/9] hugetlb: rename remove_huge_page to hugetlb_delete_from_page_cache Mike Kravetz
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 18+ messages in thread
From: Mike Kravetz @ 2022-09-14 22:18 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Muchun Song, Miaohe Lin, David Hildenbrand, Sven Schnelle,
	Michal Hocko, Peter Xu, Naoya Horiguchi, Aneesh Kumar K . V,
	Andrea Arcangeli, Kirill A . Shutemov, Davidlohr Bueso,
	Prakash Sangappa, James Houghton, Mina Almasry, Pasha Tatashin,
	Axel Rasmussen, Ray Fucillo, Andrew Morton, Mike Kravetz

Commit c0d0381ade79 ("hugetlbfs: use i_mmap_rwsem for more pmd sharing
synchronization") added code to take i_mmap_rwsem in read mode for the
duration of fault processing.  However, this has been shown to cause
performance/scaling issues.  Revert the code and go back to only taking
the semaphore in huge_pmd_share during the fault path.

Keep the code that takes i_mmap_rwsem in write mode before calling
try_to_unmap as this is required if huge_pmd_unshare is called.

NOTE: Reverting this code does expose the following race condition.

Faulting thread                                 Unsharing thread
...                                                  ...
ptep = huge_pte_offset()
      or
ptep = huge_pte_alloc()
...
                                                i_mmap_lock_write
                                                lock page table
ptep invalid   <------------------------        huge_pmd_unshare()
Could be in a previously                        unlock_page_table
sharing process or worse                        i_mmap_unlock_write
...
ptl = huge_pte_lock(ptep)
get/update pte
set_pte_at(pte, ptep)

It is unknown if the above race was ever experienced by a user.  It was
discovered via code inspection when initially addressed.

In subsequent patches, a new synchronization mechanism will be added to
coordinate pmd sharing and eliminate this race.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
---
 fs/hugetlbfs/inode.c |  2 --
 mm/hugetlb.c         | 77 +++++++-------------------------------------
 mm/rmap.c            |  8 +----
 mm/userfaultfd.c     | 11 ++-----
 4 files changed, 15 insertions(+), 83 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index a32031e751d1..dfb735a91bbb 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -467,9 +467,7 @@ static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
 			if (unlikely(folio_mapped(folio))) {
 				BUG_ON(truncate_op);
 
-				mutex_unlock(&hugetlb_fault_mutex_table[hash]);
 				i_mmap_lock_write(mapping);
-				mutex_lock(&hugetlb_fault_mutex_table[hash]);
 				hugetlb_vmdelete_list(&mapping->i_mmap,
 					index * pages_per_huge_page(h),
 					(index + 1) * pages_per_huge_page(h),
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 6c97b97aa252..00fba195a439 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4769,7 +4769,6 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 	struct hstate *h = hstate_vma(src_vma);
 	unsigned long sz = huge_page_size(h);
 	unsigned long npages = pages_per_huge_page(h);
-	struct address_space *mapping = src_vma->vm_file->f_mapping;
 	struct mmu_notifier_range range;
 	unsigned long last_addr_mask;
 	int ret = 0;
@@ -4781,14 +4780,6 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 		mmu_notifier_invalidate_range_start(&range);
 		mmap_assert_write_locked(src);
 		raw_write_seqcount_begin(&src->write_protect_seq);
-	} else {
-		/*
-		 * For shared mappings i_mmap_rwsem must be held to call
-		 * huge_pte_alloc, otherwise the returned ptep could go
-		 * away if part of a shared pmd and another thread calls
-		 * huge_pmd_unshare.
-		 */
-		i_mmap_lock_read(mapping);
 	}
 
 	last_addr_mask = hugetlb_mask_last_page(h);
@@ -4935,8 +4926,6 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 	if (cow) {
 		raw_write_seqcount_end(&src->write_protect_seq);
 		mmu_notifier_invalidate_range_end(&range);
-	} else {
-		i_mmap_unlock_read(mapping);
 	}
 
 	return ret;
@@ -5345,29 +5334,8 @@ static vm_fault_t hugetlb_wp(struct mm_struct *mm, struct vm_area_struct *vma,
 		 * may get SIGKILLed if it later faults.
 		 */
 		if (outside_reserve) {
-			struct address_space *mapping = vma->vm_file->f_mapping;
-			pgoff_t idx;
-			u32 hash;
-
 			put_page(old_page);
-			/*
-			 * Drop hugetlb_fault_mutex and i_mmap_rwsem before
-			 * unmapping.  unmapping needs to hold i_mmap_rwsem
-			 * in write mode.  Dropping i_mmap_rwsem in read mode
-			 * here is OK as COW mappings do not interact with
-			 * PMD sharing.
-			 *
-			 * Reacquire both after unmap operation.
-			 */
-			idx = vma_hugecache_offset(h, vma, haddr);
-			hash = hugetlb_fault_mutex_hash(mapping, idx);
-			mutex_unlock(&hugetlb_fault_mutex_table[hash]);
-			i_mmap_unlock_read(mapping);
-
 			unmap_ref_private(mm, vma, old_page, haddr);
-
-			i_mmap_lock_read(mapping);
-			mutex_lock(&hugetlb_fault_mutex_table[hash]);
 			spin_lock(ptl);
 			ptep = huge_pte_offset(mm, haddr, huge_page_size(h));
 			if (likely(ptep &&
@@ -5522,9 +5490,7 @@ static inline vm_fault_t hugetlb_handle_userfault(struct vm_area_struct *vma,
 	 */
 	hash = hugetlb_fault_mutex_hash(mapping, idx);
 	mutex_unlock(&hugetlb_fault_mutex_table[hash]);
-	i_mmap_unlock_read(mapping);
 	ret = handle_userfault(&vmf, reason);
-	i_mmap_lock_read(mapping);
 	mutex_lock(&hugetlb_fault_mutex_table[hash]);
 
 	return ret;
@@ -5759,11 +5725,6 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 
 	ptep = huge_pte_offset(mm, haddr, huge_page_size(h));
 	if (ptep) {
-		/*
-		 * Since we hold no locks, ptep could be stale.  That is
-		 * OK as we are only making decisions based on content and
-		 * not actually modifying content here.
-		 */
 		entry = huge_ptep_get(ptep);
 		if (unlikely(is_hugetlb_entry_migration(entry))) {
 			migration_entry_wait_huge(vma, ptep);
@@ -5771,31 +5732,20 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		} else if (unlikely(is_hugetlb_entry_hwpoisoned(entry)))
 			return VM_FAULT_HWPOISON_LARGE |
 				VM_FAULT_SET_HINDEX(hstate_index(h));
+	} else {
+		ptep = huge_pte_alloc(mm, vma, haddr, huge_page_size(h));
+		if (!ptep)
+			return VM_FAULT_OOM;
 	}
 
-	/*
-	 * Acquire i_mmap_rwsem before calling huge_pte_alloc and hold
-	 * until finished with ptep.  This prevents huge_pmd_unshare from
-	 * being called elsewhere and making the ptep no longer valid.
-	 *
-	 * ptep could have already be assigned via huge_pte_offset.  That
-	 * is OK, as huge_pte_alloc will return the same value unless
-	 * something has changed.
-	 */
 	mapping = vma->vm_file->f_mapping;
-	i_mmap_lock_read(mapping);
-	ptep = huge_pte_alloc(mm, vma, haddr, huge_page_size(h));
-	if (!ptep) {
-		i_mmap_unlock_read(mapping);
-		return VM_FAULT_OOM;
-	}
+	idx = vma_hugecache_offset(h, vma, haddr);
 
 	/*
 	 * Serialize hugepage allocation and instantiation, so that we don't
 	 * get spurious allocation failures if two CPUs race to instantiate
 	 * the same page in the page cache.
 	 */
-	idx = vma_hugecache_offset(h, vma, haddr);
 	hash = hugetlb_fault_mutex_hash(mapping, idx);
 	mutex_lock(&hugetlb_fault_mutex_table[hash]);
 
@@ -5860,7 +5810,6 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 			put_page(pagecache_page);
 		}
 		mutex_unlock(&hugetlb_fault_mutex_table[hash]);
-		i_mmap_unlock_read(mapping);
 		return handle_userfault(&vmf, VM_UFFD_WP);
 	}
 
@@ -5904,7 +5853,6 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	}
 out_mutex:
 	mutex_unlock(&hugetlb_fault_mutex_table[hash]);
-	i_mmap_unlock_read(mapping);
 	/*
 	 * Generally it's safe to hold refcount during waiting page lock. But
 	 * here we just wait to defer the next page fault to avoid busy loop and
@@ -6744,12 +6692,10 @@ void adjust_range_if_pmd_sharing_possible(struct vm_area_struct *vma,
  * Search for a shareable pmd page for hugetlb. In any case calls pmd_alloc()
  * and returns the corresponding pte. While this is not necessary for the
  * !shared pmd case because we can allocate the pmd later as well, it makes the
- * code much cleaner.
- *
- * This routine must be called with i_mmap_rwsem held in at least read mode if
- * sharing is possible.  For hugetlbfs, this prevents removal of any page
- * table entries associated with the address space.  This is important as we
- * are setting up sharing based on existing page table entries (mappings).
+ * code much cleaner. pmd allocation is essential for the shared case because
+ * pud has to be populated inside the same i_mmap_rwsem section - otherwise
+ * racing tasks could either miss the sharing (see huge_pte_offset) or select a
+ * bad pmd for sharing.
  */
 pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma,
 		      unsigned long addr, pud_t *pud)
@@ -6763,7 +6709,7 @@ pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma,
 	pte_t *pte;
 	spinlock_t *ptl;
 
-	i_mmap_assert_locked(mapping);
+	i_mmap_lock_read(mapping);
 	vma_interval_tree_foreach(svma, &mapping->i_mmap, idx, idx) {
 		if (svma == vma)
 			continue;
@@ -6793,6 +6739,7 @@ pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma,
 	spin_unlock(ptl);
 out:
 	pte = (pte_t *)pmd_alloc(mm, pud, addr);
+	i_mmap_unlock_read(mapping);
 	return pte;
 }
 
@@ -6803,7 +6750,7 @@ pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma,
  * indicated by page_count > 1, unmap is achieved by clearing pud and
  * decrementing the ref count. If count == 1, the pte page is not shared.
  *
- * Called with page table lock held and i_mmap_rwsem held in write mode.
+ * Called with page table lock held.
  *
  * returns: 1 successfully unmapped a shared pte page
  *	    0 the underlying pte page is not shared, or it is the last user
diff --git a/mm/rmap.c b/mm/rmap.c
index 08d552ea4ceb..d17d68a9b15b 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -23,10 +23,9 @@
  * inode->i_rwsem	(while writing or truncating, not reading or faulting)
  *   mm->mmap_lock
  *     mapping->invalidate_lock (in filemap_fault)
- *       page->flags PG_locked (lock_page)   * (see hugetlbfs below)
+ *       page->flags PG_locked (lock_page)
  *         hugetlbfs_i_mmap_rwsem_key (in huge_pmd_share)
  *           mapping->i_mmap_rwsem
- *             hugetlb_fault_mutex (hugetlbfs specific page fault mutex)
  *             anon_vma->rwsem
  *               mm->page_table_lock or pte_lock
  *                 swap_lock (in swap_duplicate, swap_info_get)
@@ -45,11 +44,6 @@
  * anon_vma->rwsem,mapping->i_mmap_rwsem   (memory_failure, collect_procs_anon)
  *   ->tasklist_lock
  *     pte map lock
- *
- * * hugetlbfs PageHuge() pages take locks in this order:
- *         mapping->i_mmap_rwsem
- *           hugetlb_fault_mutex (hugetlbfs specific page fault mutex)
- *             page->flags PG_locked (lock_page)
  */
 
 #include <linux/mm.h>
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 9c035be2148b..0fdbd2c05587 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -379,14 +379,10 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
 		BUG_ON(dst_addr >= dst_start + len);
 
 		/*
-		 * Serialize via i_mmap_rwsem and hugetlb_fault_mutex.
-		 * i_mmap_rwsem ensures the dst_pte remains valid even
-		 * in the case of shared pmds.  fault mutex prevents
-		 * races with other faulting threads.
+		 * Serialize via hugetlb_fault_mutex.
 		 */
-		mapping = dst_vma->vm_file->f_mapping;
-		i_mmap_lock_read(mapping);
 		idx = linear_page_index(dst_vma, dst_addr);
+		mapping = dst_vma->vm_file->f_mapping;
 		hash = hugetlb_fault_mutex_hash(mapping, idx);
 		mutex_lock(&hugetlb_fault_mutex_table[hash]);
 
@@ -394,7 +390,6 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
 		dst_pte = huge_pte_alloc(dst_mm, dst_vma, dst_addr, vma_hpagesize);
 		if (!dst_pte) {
 			mutex_unlock(&hugetlb_fault_mutex_table[hash]);
-			i_mmap_unlock_read(mapping);
 			goto out_unlock;
 		}
 
@@ -402,7 +397,6 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
 		    !huge_pte_none_mostly(huge_ptep_get(dst_pte))) {
 			err = -EEXIST;
 			mutex_unlock(&hugetlb_fault_mutex_table[hash]);
-			i_mmap_unlock_read(mapping);
 			goto out_unlock;
 		}
 
@@ -411,7 +405,6 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
 					       wp_copy);
 
 		mutex_unlock(&hugetlb_fault_mutex_table[hash]);
-		i_mmap_unlock_read(mapping);
 
 		cond_resched();
 
-- 
2.37.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v2 3/9] hugetlb: rename remove_huge_page to hugetlb_delete_from_page_cache
  2022-09-14 22:18 [PATCH v2 0/9] hugetlb: Use new vma lock for huge pmd sharing synchronization Mike Kravetz
  2022-09-14 22:18 ` [PATCH v2 1/9] hugetlbfs: revert use i_mmap_rwsem to address page fault/truncate race Mike Kravetz
  2022-09-14 22:18 ` [PATCH v2 2/9] hugetlbfs: revert use i_mmap_rwsem for more pmd sharing synchronization Mike Kravetz
@ 2022-09-14 22:18 ` Mike Kravetz
  2022-09-14 22:18 ` [PATCH v2 4/9] hugetlb: create remove_inode_single_folio to remove single file folio Mike Kravetz
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 18+ messages in thread
From: Mike Kravetz @ 2022-09-14 22:18 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Muchun Song, Miaohe Lin, David Hildenbrand, Sven Schnelle,
	Michal Hocko, Peter Xu, Naoya Horiguchi, Aneesh Kumar K . V,
	Andrea Arcangeli, Kirill A . Shutemov, Davidlohr Bueso,
	Prakash Sangappa, James Houghton, Mina Almasry, Pasha Tatashin,
	Axel Rasmussen, Ray Fucillo, Andrew Morton, Mike Kravetz

remove_huge_page removes a hugetlb page from the page cache.  Change
to hugetlb_delete_from_page_cache as it is a more descriptive name.
huge_add_to_page_cache is global in scope, but only deals with hugetlb
pages.  For consistency and clarity, rename to hugetlb_add_to_page_cache.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
---
 fs/hugetlbfs/inode.c    | 21 ++++++++++-----------
 include/linux/hugetlb.h |  2 +-
 mm/hugetlb.c            |  8 ++++----
 3 files changed, 15 insertions(+), 16 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index dfb735a91bbb..edd69cc43ca5 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -364,7 +364,7 @@ static int hugetlbfs_write_end(struct file *file, struct address_space *mapping,
 	return -EINVAL;
 }
 
-static void remove_huge_page(struct page *page)
+static void hugetlb_delete_from_page_cache(struct page *page)
 {
 	ClearPageDirty(page);
 	ClearPageUptodate(page);
@@ -478,15 +478,14 @@ static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
 			folio_lock(folio);
 			/*
 			 * We must free the huge page and remove from page
-			 * cache (remove_huge_page) BEFORE removing the
-			 * region/reserve map (hugetlb_unreserve_pages).  In
-			 * rare out of memory conditions, removal of the
-			 * region/reserve map could fail. Correspondingly,
-			 * the subpool and global reserve usage count can need
-			 * to be adjusted.
+			 * cache BEFORE removing the region/reserve map
+			 * (hugetlb_unreserve_pages).  In rare out of memory
+			 * conditions, removal of the region/reserve map could
+			 * fail. Correspondingly, the subpool and global
+			 * reserve usage count can need to be adjusted.
 			 */
 			VM_BUG_ON(HPageRestoreReserve(&folio->page));
-			remove_huge_page(&folio->page);
+			hugetlb_delete_from_page_cache(&folio->page);
 			freed++;
 			if (!truncate_op) {
 				if (unlikely(hugetlb_unreserve_pages(inode,
@@ -723,7 +722,7 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
 		}
 		clear_huge_page(page, addr, pages_per_huge_page(h));
 		__SetPageUptodate(page);
-		error = huge_add_to_page_cache(page, mapping, index);
+		error = hugetlb_add_to_page_cache(page, mapping, index);
 		if (unlikely(error)) {
 			restore_reserve_on_error(h, &pseudo_vma, addr, page);
 			put_page(page);
@@ -735,7 +734,7 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
 
 		SetHPageMigratable(page);
 		/*
-		 * unlock_page because locked by huge_add_to_page_cache()
+		 * unlock_page because locked by hugetlb_add_to_page_cache()
 		 * put_page() due to reference from alloc_huge_page()
 		 */
 		unlock_page(page);
@@ -980,7 +979,7 @@ static int hugetlbfs_error_remove_page(struct address_space *mapping,
 	struct inode *inode = mapping->host;
 	pgoff_t index = page->index;
 
-	remove_huge_page(page);
+	hugetlb_delete_from_page_cache(page);
 	if (unlikely(hugetlb_unreserve_pages(inode, index, index + 1, 1)))
 		hugetlb_fix_reserve_counts(inode);
 
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 890f7b6a2eff..0ce916d1afca 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -665,7 +665,7 @@ struct page *alloc_huge_page_nodemask(struct hstate *h, int preferred_nid,
 				nodemask_t *nmask, gfp_t gfp_mask);
 struct page *alloc_huge_page_vma(struct hstate *h, struct vm_area_struct *vma,
 				unsigned long address);
-int huge_add_to_page_cache(struct page *page, struct address_space *mapping,
+int hugetlb_add_to_page_cache(struct page *page, struct address_space *mapping,
 			pgoff_t idx);
 void restore_reserve_on_error(struct hstate *h, struct vm_area_struct *vma,
 				unsigned long address, struct page *page);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 00fba195a439..eb38ae3e7a83 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5429,7 +5429,7 @@ static bool hugetlbfs_pagecache_present(struct hstate *h,
 	return page != NULL;
 }
 
-int huge_add_to_page_cache(struct page *page, struct address_space *mapping,
+int hugetlb_add_to_page_cache(struct page *page, struct address_space *mapping,
 			   pgoff_t idx)
 {
 	struct folio *folio = page_folio(page);
@@ -5568,7 +5568,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 		new_page = true;
 
 		if (vma->vm_flags & VM_MAYSHARE) {
-			int err = huge_add_to_page_cache(page, mapping, idx);
+			int err = hugetlb_add_to_page_cache(page, mapping, idx);
 			if (err) {
 				/*
 				 * err can't be -EEXIST which implies someone
@@ -5980,11 +5980,11 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
 
 		/*
 		 * Serialization between remove_inode_hugepages() and
-		 * huge_add_to_page_cache() below happens through the
+		 * hugetlb_add_to_page_cache() below happens through the
 		 * hugetlb_fault_mutex_table that here must be hold by
 		 * the caller.
 		 */
-		ret = huge_add_to_page_cache(page, mapping, idx);
+		ret = hugetlb_add_to_page_cache(page, mapping, idx);
 		if (ret)
 			goto out_release_nounlock;
 		page_in_pagecache = true;
-- 
2.37.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v2 4/9] hugetlb: create remove_inode_single_folio to remove single file folio
  2022-09-14 22:18 [PATCH v2 0/9] hugetlb: Use new vma lock for huge pmd sharing synchronization Mike Kravetz
                   ` (2 preceding siblings ...)
  2022-09-14 22:18 ` [PATCH v2 3/9] hugetlb: rename remove_huge_page to hugetlb_delete_from_page_cache Mike Kravetz
@ 2022-09-14 22:18 ` Mike Kravetz
  2022-09-24 12:36   ` Miaohe Lin
  2022-09-14 22:18 ` [PATCH v2 5/9] hugetlb: rename vma_shareable() and refactor code Mike Kravetz
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 18+ messages in thread
From: Mike Kravetz @ 2022-09-14 22:18 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Muchun Song, Miaohe Lin, David Hildenbrand, Sven Schnelle,
	Michal Hocko, Peter Xu, Naoya Horiguchi, Aneesh Kumar K . V,
	Andrea Arcangeli, Kirill A . Shutemov, Davidlohr Bueso,
	Prakash Sangappa, James Houghton, Mina Almasry, Pasha Tatashin,
	Axel Rasmussen, Ray Fucillo, Andrew Morton, Mike Kravetz

Create the new routine remove_inode_single_folio that will remove a
single folio from a file.  This is refactored code from
remove_inode_hugepages.  It checks for the uncommon case in which the
folio is still mapped and unmaps.

No functional change.  This refactoring will be put to use and expanded
upon in a subsequent patches.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 fs/hugetlbfs/inode.c | 105 ++++++++++++++++++++++++++-----------------
 1 file changed, 63 insertions(+), 42 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index edd69cc43ca5..7112a9a9f54d 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -411,6 +411,60 @@ hugetlb_vmdelete_list(struct rb_root_cached *root, pgoff_t start, pgoff_t end,
 	}
 }
 
+/*
+ * Called with hugetlb fault mutex held.
+ * Returns true if page was actually removed, false otherwise.
+ */
+static bool remove_inode_single_folio(struct hstate *h, struct inode *inode,
+					struct address_space *mapping,
+					struct folio *folio, pgoff_t index,
+					bool truncate_op)
+{
+	bool ret = false;
+
+	/*
+	 * If folio is mapped, it was faulted in after being
+	 * unmapped in caller.  Unmap (again) while holding
+	 * the fault mutex.  The mutex will prevent faults
+	 * until we finish removing the folio.
+	 */
+	if (unlikely(folio_mapped(folio))) {
+		i_mmap_lock_write(mapping);
+		hugetlb_vmdelete_list(&mapping->i_mmap,
+			index * pages_per_huge_page(h),
+			(index + 1) * pages_per_huge_page(h),
+			ZAP_FLAG_DROP_MARKER);
+		i_mmap_unlock_write(mapping);
+	}
+
+	folio_lock(folio);
+	/*
+	 * After locking page, make sure mapping is the same.
+	 * We could have raced with page fault populate and
+	 * backout code.
+	 */
+	if (folio_mapping(folio) == mapping) {
+		/*
+		 * We must remove the folio from page cache before removing
+		 * the region/ reserve map (hugetlb_unreserve_pages).  In
+		 * rare out of memory conditions, removal of the region/reserve
+		 * map could fail.  Correspondingly, the subpool and global
+		 * reserve usage count can need to be adjusted.
+		 */
+		VM_BUG_ON(HPageRestoreReserve(&folio->page));
+		hugetlb_delete_from_page_cache(&folio->page);
+		ret = true;
+		if (!truncate_op) {
+			if (unlikely(hugetlb_unreserve_pages(inode, index,
+								index + 1, 1)))
+				hugetlb_fix_reserve_counts(inode);
+		}
+	}
+
+	folio_unlock(folio);
+	return ret;
+}
+
 /*
  * remove_inode_hugepages handles two distinct cases: truncation and hole
  * punch.  There are subtle differences in operation for each case.
@@ -418,11 +472,10 @@ hugetlb_vmdelete_list(struct rb_root_cached *root, pgoff_t start, pgoff_t end,
  * truncation is indicated by end of range being LLONG_MAX
  *	In this case, we first scan the range and release found pages.
  *	After releasing pages, hugetlb_unreserve_pages cleans up region/reserve
- *	maps and global counts.  Page faults can not race with truncation
- *	in this routine.  hugetlb_no_page() prevents page faults in the
- *	truncated range.  It checks i_size before allocation, and again after
- *	with the page table lock for the page held.  The same lock must be
- *	acquired to unmap a page.
+ *	maps and global counts.  Page faults can race with truncation.
+ *	During faults, hugetlb_no_page() checks i_size before page allocation,
+ *	and again after obtaining page table lock.  It will 'back out'
+ *	allocations in the truncated range.
  * hole punch is indicated if end is not LLONG_MAX
  *	In the hole punch case we scan the range and release found pages.
  *	Only when releasing a page is the associated region/reserve map
@@ -456,44 +509,12 @@ static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
 			mutex_lock(&hugetlb_fault_mutex_table[hash]);
 
 			/*
-			 * If folio is mapped, it was faulted in after being
-			 * unmapped in caller.  Unmap (again) now after taking
-			 * the fault mutex.  The mutex will prevent faults
-			 * until we finish removing the folio.
-			 *
-			 * This race can only happen in the hole punch case.
-			 * Getting here in a truncate operation is a bug.
+			 * Remove folio that was part of folio_batch.
 			 */
-			if (unlikely(folio_mapped(folio))) {
-				BUG_ON(truncate_op);
-
-				i_mmap_lock_write(mapping);
-				hugetlb_vmdelete_list(&mapping->i_mmap,
-					index * pages_per_huge_page(h),
-					(index + 1) * pages_per_huge_page(h),
-					ZAP_FLAG_DROP_MARKER);
-				i_mmap_unlock_write(mapping);
-			}
-
-			folio_lock(folio);
-			/*
-			 * We must free the huge page and remove from page
-			 * cache BEFORE removing the region/reserve map
-			 * (hugetlb_unreserve_pages).  In rare out of memory
-			 * conditions, removal of the region/reserve map could
-			 * fail. Correspondingly, the subpool and global
-			 * reserve usage count can need to be adjusted.
-			 */
-			VM_BUG_ON(HPageRestoreReserve(&folio->page));
-			hugetlb_delete_from_page_cache(&folio->page);
-			freed++;
-			if (!truncate_op) {
-				if (unlikely(hugetlb_unreserve_pages(inode,
-							index, index + 1, 1)))
-					hugetlb_fix_reserve_counts(inode);
-			}
-
-			folio_unlock(folio);
+			if (remove_inode_single_folio(h, inode, mapping, folio,
+							index, truncate_op))
+				freed++;
+
 			mutex_unlock(&hugetlb_fault_mutex_table[hash]);
 		}
 		folio_batch_release(&fbatch);
-- 
2.37.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v2 5/9] hugetlb: rename vma_shareable() and refactor code
  2022-09-14 22:18 [PATCH v2 0/9] hugetlb: Use new vma lock for huge pmd sharing synchronization Mike Kravetz
                   ` (3 preceding siblings ...)
  2022-09-14 22:18 ` [PATCH v2 4/9] hugetlb: create remove_inode_single_folio to remove single file folio Mike Kravetz
@ 2022-09-14 22:18 ` Mike Kravetz
  2022-09-14 22:18 ` [PATCH v2 6/9] hugetlb: add vma based lock for pmd sharing Mike Kravetz
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 18+ messages in thread
From: Mike Kravetz @ 2022-09-14 22:18 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Muchun Song, Miaohe Lin, David Hildenbrand, Sven Schnelle,
	Michal Hocko, Peter Xu, Naoya Horiguchi, Aneesh Kumar K . V,
	Andrea Arcangeli, Kirill A . Shutemov, Davidlohr Bueso,
	Prakash Sangappa, James Houghton, Mina Almasry, Pasha Tatashin,
	Axel Rasmussen, Ray Fucillo, Andrew Morton, Mike Kravetz

Rename the routine vma_shareable to vma_addr_pmd_shareable as it is
checking a specific address within the vma.  Refactor code to check if
an aligned range is shareable as this will be needed in a subsequent
patch.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
---
 mm/hugetlb.c | 19 +++++++++++++------
 1 file changed, 13 insertions(+), 6 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index eb38ae3e7a83..8117bc299c46 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6639,26 +6639,33 @@ static unsigned long page_table_shareable(struct vm_area_struct *svma,
 	return saddr;
 }
 
-static bool vma_shareable(struct vm_area_struct *vma, unsigned long addr)
+static bool __vma_aligned_range_pmd_shareable(struct vm_area_struct *vma,
+				unsigned long start, unsigned long end)
 {
-	unsigned long base = addr & PUD_MASK;
-	unsigned long end = base + PUD_SIZE;
-
 	/*
 	 * check on proper vm_flags and page table alignment
 	 */
-	if (vma->vm_flags & VM_MAYSHARE && range_in_vma(vma, base, end))
+	if (vma->vm_flags & VM_MAYSHARE && range_in_vma(vma, start, end))
 		return true;
 	return false;
 }
 
+static bool vma_addr_pmd_shareable(struct vm_area_struct *vma,
+						unsigned long addr)
+{
+	unsigned long start = addr & PUD_MASK;
+	unsigned long end = start + PUD_SIZE;
+
+	return __vma_aligned_range_pmd_shareable(vma, start, end);
+}
+
 bool want_pmd_share(struct vm_area_struct *vma, unsigned long addr)
 {
 #ifdef CONFIG_USERFAULTFD
 	if (uffd_disable_huge_pmd_share(vma))
 		return false;
 #endif
-	return vma_shareable(vma, addr);
+	return vma_addr_pmd_shareable(vma, addr);
 }
 
 /*
-- 
2.37.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v2 6/9] hugetlb: add vma based lock for pmd sharing
  2022-09-14 22:18 [PATCH v2 0/9] hugetlb: Use new vma lock for huge pmd sharing synchronization Mike Kravetz
                   ` (4 preceding siblings ...)
  2022-09-14 22:18 ` [PATCH v2 5/9] hugetlb: rename vma_shareable() and refactor code Mike Kravetz
@ 2022-09-14 22:18 ` Mike Kravetz
  2022-09-15 20:25   ` Mike Kravetz
  2022-09-24 13:11   ` Miaohe Lin
  2022-09-14 22:18 ` [PATCH v2 7/9] hugetlb: create hugetlb_unmap_file_folio to unmap single file folio Mike Kravetz
                   ` (2 subsequent siblings)
  8 siblings, 2 replies; 18+ messages in thread
From: Mike Kravetz @ 2022-09-14 22:18 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Muchun Song, Miaohe Lin, David Hildenbrand, Sven Schnelle,
	Michal Hocko, Peter Xu, Naoya Horiguchi, Aneesh Kumar K . V,
	Andrea Arcangeli, Kirill A . Shutemov, Davidlohr Bueso,
	Prakash Sangappa, James Houghton, Mina Almasry, Pasha Tatashin,
	Axel Rasmussen, Ray Fucillo, Andrew Morton, Mike Kravetz

Allocate a new hugetlb_vma_lock structure and hang off vm_private_data
for synchronization use by vmas that could be involved in pmd sharing.
This data structure contains a rw semaphore that is the primary tool
used for synchronization.

This new structure is ref counted, so that it can exist when NOT attached
to a vma.  This is only helpful in resolving lock ordering issues where
code may need to obtain the vma_lock while there are no guarantees the
vma may go away.  By obtaining a ref on the structure, it can be
guaranteed that at least the rw semaphore will not go away.

Only add infrastructure for the new lock here.  Actual use will be added
in subsequent patches.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 include/linux/hugetlb.h |  43 ++++++++-
 kernel/fork.c           |   6 +-
 mm/hugetlb.c            | 202 ++++++++++++++++++++++++++++++++++++----
 mm/rmap.c               |   8 +-
 4 files changed, 235 insertions(+), 24 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 0ce916d1afca..6a1bd172f943 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -114,6 +114,12 @@ struct file_region {
 #endif
 };
 
+struct hugetlb_vma_lock {
+	struct kref refs;
+	struct rw_semaphore rw_sema;
+	struct vm_area_struct *vma;
+};
+
 extern struct resv_map *resv_map_alloc(void);
 void resv_map_release(struct kref *ref);
 
@@ -126,7 +132,7 @@ struct hugepage_subpool *hugepage_new_subpool(struct hstate *h, long max_hpages,
 						long min_hpages);
 void hugepage_put_subpool(struct hugepage_subpool *spool);
 
-void reset_vma_resv_huge_pages(struct vm_area_struct *vma);
+void hugetlb_dup_vma_private(struct vm_area_struct *vma);
 void clear_vma_resv_huge_pages(struct vm_area_struct *vma);
 int hugetlb_sysctl_handler(struct ctl_table *, int, void *, size_t *, loff_t *);
 int hugetlb_overcommit_handler(struct ctl_table *, int, void *, size_t *,
@@ -214,6 +220,14 @@ struct page *follow_huge_pud(struct mm_struct *mm, unsigned long address,
 struct page *follow_huge_pgd(struct mm_struct *mm, unsigned long address,
 			     pgd_t *pgd, int flags);
 
+void hugetlb_vma_lock_read(struct vm_area_struct *vma);
+void hugetlb_vma_unlock_read(struct vm_area_struct *vma);
+void hugetlb_vma_lock_write(struct vm_area_struct *vma);
+void hugetlb_vma_unlock_write(struct vm_area_struct *vma);
+int hugetlb_vma_trylock_write(struct vm_area_struct *vma);
+void hugetlb_vma_assert_locked(struct vm_area_struct *vma);
+void hugetlb_vma_lock_release(struct kref *kref);
+
 int pmd_huge(pmd_t pmd);
 int pud_huge(pud_t pud);
 unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
@@ -225,7 +239,7 @@ void hugetlb_unshare_all_pmds(struct vm_area_struct *vma);
 
 #else /* !CONFIG_HUGETLB_PAGE */
 
-static inline void reset_vma_resv_huge_pages(struct vm_area_struct *vma)
+static inline void hugetlb_dup_vma_private(struct vm_area_struct *vma)
 {
 }
 
@@ -336,6 +350,31 @@ static inline int prepare_hugepage_range(struct file *file,
 	return -EINVAL;
 }
 
+static inline void hugetlb_vma_lock_read(struct vm_area_struct *vma)
+{
+}
+
+static inline void hugetlb_vma_unlock_read(struct vm_area_struct *vma)
+{
+}
+
+static inline void hugetlb_vma_lock_write(struct vm_area_struct *vma)
+{
+}
+
+static inline void hugetlb_vma_unlock_write(struct vm_area_struct *vma)
+{
+}
+
+static inline int hugetlb_vma_trylock_write(struct vm_area_struct *vma)
+{
+	return 1;
+}
+
+static inline void hugetlb_vma_assert_locked(struct vm_area_struct *vma)
+{
+}
+
 static inline int pmd_huge(pmd_t pmd)
 {
 	return 0;
diff --git a/kernel/fork.c b/kernel/fork.c
index b3399184706c..e85e923537a2 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -677,12 +677,10 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
 		}
 
 		/*
-		 * Clear hugetlb-related page reserves for children. This only
-		 * affects MAP_PRIVATE mappings. Faults generated by the child
-		 * are not guaranteed to succeed, even if read-only
+		 * Copy/update hugetlb private vma information.
 		 */
 		if (is_vm_hugetlb_page(tmp))
-			reset_vma_resv_huge_pages(tmp);
+			hugetlb_dup_vma_private(tmp);
 
 		/* Link the vma into the MT */
 		mas.index = tmp->vm_start;
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 8117bc299c46..616be891b798 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -90,6 +90,8 @@ struct mutex *hugetlb_fault_mutex_table ____cacheline_aligned_in_smp;
 
 /* Forward declaration */
 static int hugetlb_acct_memory(struct hstate *h, long delta);
+static void hugetlb_vma_lock_free(struct vm_area_struct *vma);
+static void hugetlb_vma_lock_alloc(struct vm_area_struct *vma);
 
 static inline bool subpool_is_free(struct hugepage_subpool *spool)
 {
@@ -858,7 +860,7 @@ __weak unsigned long vma_mmu_pagesize(struct vm_area_struct *vma)
  * faults in a MAP_PRIVATE mapping. Only the process that called mmap()
  * is guaranteed to have their future faults succeed.
  *
- * With the exception of reset_vma_resv_huge_pages() which is called at fork(),
+ * With the exception of hugetlb_dup_vma_private() which is called at fork(),
  * the reserve counters are updated with the hugetlb_lock held. It is safe
  * to reset the VMA at fork() time as it is not in use yet and there is no
  * chance of the global counters getting corrupted as a result of the values.
@@ -1005,12 +1007,20 @@ static int is_vma_resv_set(struct vm_area_struct *vma, unsigned long flag)
 	return (get_vma_private_data(vma) & flag) != 0;
 }
 
-/* Reset counters to 0 and clear all HPAGE_RESV_* flags */
-void reset_vma_resv_huge_pages(struct vm_area_struct *vma)
+void hugetlb_dup_vma_private(struct vm_area_struct *vma)
 {
 	VM_BUG_ON_VMA(!is_vm_hugetlb_page(vma), vma);
+	/*
+	 * Clear vm_private_data
+	 * - For MAP_PRIVATE mappings, this is the reserve map which does
+	 *   not apply to children.  Faults generated by the children are
+	 *   not guaranteed to succeed, even if read-only.
+	 * - For shared mappings this is a per-vma semaphore that may be
+	 *   allocated in a subsequent call to hugetlb_vm_op_open.
+	 */
+	vma->vm_private_data = (void *)0;
 	if (!(vma->vm_flags & VM_MAYSHARE))
-		vma->vm_private_data = (void *)0;
+		return;
 }
 
 /*
@@ -1041,7 +1051,7 @@ void clear_vma_resv_huge_pages(struct vm_area_struct *vma)
 		kref_put(&reservations->refs, resv_map_release);
 	}
 
-	reset_vma_resv_huge_pages(vma);
+	hugetlb_dup_vma_private(vma);
 }
 
 /* Returns true if the VMA has associated reserve pages */
@@ -4622,16 +4632,21 @@ static void hugetlb_vm_op_open(struct vm_area_struct *vma)
 		resv_map_dup_hugetlb_cgroup_uncharge_info(resv);
 		kref_get(&resv->refs);
 	}
+
+	hugetlb_vma_lock_alloc(vma);
 }
 
 static void hugetlb_vm_op_close(struct vm_area_struct *vma)
 {
 	struct hstate *h = hstate_vma(vma);
-	struct resv_map *resv = vma_resv_map(vma);
+	struct resv_map *resv;
 	struct hugepage_subpool *spool = subpool_vma(vma);
 	unsigned long reserve, start, end;
 	long gbl_reserve;
 
+	hugetlb_vma_lock_free(vma);
+
+	resv = vma_resv_map(vma);
 	if (!resv || !is_vma_resv_set(vma, HPAGE_RESV_OWNER))
 		return;
 
@@ -6438,6 +6453,11 @@ bool hugetlb_reserve_pages(struct inode *inode,
 		return false;
 	}
 
+	/*
+	 * vma specific semaphore used for pmd sharing synchronization
+	 */
+	hugetlb_vma_lock_alloc(vma);
+
 	/*
 	 * Only apply hugepage reservation if asked. At fault time, an
 	 * attempt will be made for VM_NORESERVE to allocate a page
@@ -6461,12 +6481,11 @@ bool hugetlb_reserve_pages(struct inode *inode,
 		resv_map = inode_resv_map(inode);
 
 		chg = region_chg(resv_map, from, to, &regions_needed);
-
 	} else {
 		/* Private mapping. */
 		resv_map = resv_map_alloc();
 		if (!resv_map)
-			return false;
+			goto out_err;
 
 		chg = to - from;
 
@@ -6561,6 +6580,7 @@ bool hugetlb_reserve_pages(struct inode *inode,
 	hugetlb_cgroup_uncharge_cgroup_rsvd(hstate_index(h),
 					    chg * pages_per_huge_page(h), h_cg);
 out_err:
+	hugetlb_vma_lock_free(vma);
 	if (!vma || vma->vm_flags & VM_MAYSHARE)
 		/* Only call region_abort if the region_chg succeeded but the
 		 * region_add failed or didn't run.
@@ -6640,14 +6660,34 @@ static unsigned long page_table_shareable(struct vm_area_struct *svma,
 }
 
 static bool __vma_aligned_range_pmd_shareable(struct vm_area_struct *vma,
-				unsigned long start, unsigned long end)
+				unsigned long start, unsigned long end,
+				bool check_vma_lock)
 {
+#ifdef CONFIG_USERFAULTFD
+	if (uffd_disable_huge_pmd_share(vma))
+		return false;
+#endif
 	/*
 	 * check on proper vm_flags and page table alignment
 	 */
-	if (vma->vm_flags & VM_MAYSHARE && range_in_vma(vma, start, end))
-		return true;
-	return false;
+	if (!(vma->vm_flags & VM_MAYSHARE))
+		return false;
+	if (check_vma_lock && !vma->vm_private_data)
+		return false;
+	if (!range_in_vma(vma, start, end))
+		return false;
+	return true;
+}
+
+static bool vma_pmd_shareable(struct vm_area_struct *vma)
+{
+	unsigned long start = ALIGN(vma->vm_start, PUD_SIZE),
+		      end = ALIGN_DOWN(vma->vm_end, PUD_SIZE);
+
+	if (start >= end)
+		return false;
+
+	return __vma_aligned_range_pmd_shareable(vma, start, end, false);
 }
 
 static bool vma_addr_pmd_shareable(struct vm_area_struct *vma,
@@ -6656,15 +6696,11 @@ static bool vma_addr_pmd_shareable(struct vm_area_struct *vma,
 	unsigned long start = addr & PUD_MASK;
 	unsigned long end = start + PUD_SIZE;
 
-	return __vma_aligned_range_pmd_shareable(vma, start, end);
+	return __vma_aligned_range_pmd_shareable(vma, start, end, true);
 }
 
 bool want_pmd_share(struct vm_area_struct *vma, unsigned long addr)
 {
-#ifdef CONFIG_USERFAULTFD
-	if (uffd_disable_huge_pmd_share(vma))
-		return false;
-#endif
 	return vma_addr_pmd_shareable(vma, addr);
 }
 
@@ -6695,6 +6731,130 @@ void adjust_range_if_pmd_sharing_possible(struct vm_area_struct *vma,
 		*end = ALIGN(*end, PUD_SIZE);
 }
 
+static bool __vma_shareable_flags_pmd(struct vm_area_struct *vma)
+{
+	return vma->vm_flags & (VM_MAYSHARE | VM_SHARED) &&
+		vma->vm_private_data;
+}
+
+void hugetlb_vma_lock_read(struct vm_area_struct *vma)
+{
+	if (__vma_shareable_flags_pmd(vma)) {
+		struct hugetlb_vma_lock *vma_lock = vma->vm_private_data;
+
+		down_read(&vma_lock->rw_sema);
+	}
+}
+
+void hugetlb_vma_unlock_read(struct vm_area_struct *vma)
+{
+	if (__vma_shareable_flags_pmd(vma)) {
+		struct hugetlb_vma_lock *vma_lock = vma->vm_private_data;
+
+		up_read(&vma_lock->rw_sema);
+	}
+}
+
+void hugetlb_vma_lock_write(struct vm_area_struct *vma)
+{
+	if (__vma_shareable_flags_pmd(vma)) {
+		struct hugetlb_vma_lock *vma_lock = vma->vm_private_data;
+
+		down_write(&vma_lock->rw_sema);
+	}
+}
+
+void hugetlb_vma_unlock_write(struct vm_area_struct *vma)
+{
+	if (__vma_shareable_flags_pmd(vma)) {
+		struct hugetlb_vma_lock *vma_lock = vma->vm_private_data;
+
+		up_write(&vma_lock->rw_sema);
+	}
+}
+
+int hugetlb_vma_trylock_write(struct vm_area_struct *vma)
+{
+	struct hugetlb_vma_lock *vma_lock = vma->vm_private_data;
+
+	if (!__vma_shareable_flags_pmd(vma))
+		return 1;
+
+	return down_write_trylock(&vma_lock->rw_sema);
+}
+
+void hugetlb_vma_assert_locked(struct vm_area_struct *vma)
+{
+	if (__vma_shareable_flags_pmd(vma)) {
+		struct hugetlb_vma_lock *vma_lock = vma->vm_private_data;
+
+		lockdep_assert_held(&vma_lock->rw_sema);
+	}
+}
+
+void hugetlb_vma_lock_release(struct kref *kref)
+{
+	struct hugetlb_vma_lock *vma_lock = container_of(kref,
+			struct hugetlb_vma_lock, refs);
+
+	kfree(vma_lock);
+}
+
+static void hugetlb_vma_lock_free(struct vm_area_struct *vma)
+{
+	/*
+	 * Only present in sharable vmas.  See comment in
+	 * __unmap_hugepage_range_final about how VM_SHARED could
+	 * be set without VM_MAYSHARE.  As a result, we need to
+	 * check if either is set in the free path.
+	 */
+	if (!vma || !(vma->vm_flags & (VM_MAYSHARE | VM_SHARED)))
+		return;
+
+	if (vma->vm_private_data) {
+		struct hugetlb_vma_lock *vma_lock = vma->vm_private_data;
+
+		/*
+		 * vma_lock structure may or not be released, but it
+		 * certainly will no longer be attached to vma so clear
+		 * pointer.
+		 */
+		vma_lock->vma = NULL;
+		kref_put(&vma_lock->refs, hugetlb_vma_lock_release);
+		vma->vm_private_data = NULL;
+	}
+}
+
+static void hugetlb_vma_lock_alloc(struct vm_area_struct *vma)
+{
+	struct hugetlb_vma_lock *vma_lock;
+
+	/* Only establish in (flags) sharable vmas */
+	if (!vma || !(vma->vm_flags & VM_MAYSHARE))
+		return;
+
+	/* Should never get here with non-NULL vm_private_data */
+	if (vma->vm_private_data)
+		return;
+
+	/* Check size/alignment for pmd sharing possible */
+	if (!vma_pmd_shareable(vma))
+		return;
+
+	vma_lock = kmalloc(sizeof(*vma_lock), GFP_KERNEL);
+	if (!vma_lock)
+		/*
+		 * If we can not allocate structure, then vma can not
+		 * participate in pmd sharing.
+		 */
+		return;
+
+	kref_init(&vma_lock->refs);
+	init_rwsem(&vma_lock->rw_sema);
+	vma_lock->vma = vma;
+	vma->vm_private_data = vma_lock;
+}
+
 /*
  * Search for a shareable pmd page for hugetlb. In any case calls pmd_alloc()
  * and returns the corresponding pte. While this is not necessary for the
@@ -6781,6 +6941,14 @@ int huge_pmd_unshare(struct mm_struct *mm, struct vm_area_struct *vma,
 }
 
 #else /* !CONFIG_ARCH_WANT_HUGE_PMD_SHARE */
+static void hugetlb_vma_lock_free(struct vm_area_struct *vma)
+{
+}
+
+static void hugetlb_vma_lock_alloc(struct vm_area_struct *vma)
+{
+}
+
 pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma,
 		      unsigned long addr, pud_t *pud)
 {
diff --git a/mm/rmap.c b/mm/rmap.c
index d17d68a9b15b..744faaef0489 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -24,7 +24,7 @@
  *   mm->mmap_lock
  *     mapping->invalidate_lock (in filemap_fault)
  *       page->flags PG_locked (lock_page)
- *         hugetlbfs_i_mmap_rwsem_key (in huge_pmd_share)
+ *         hugetlbfs_i_mmap_rwsem_key (in huge_pmd_share, see hugetlbfs below)
  *           mapping->i_mmap_rwsem
  *             anon_vma->rwsem
  *               mm->page_table_lock or pte_lock
@@ -44,6 +44,12 @@
  * anon_vma->rwsem,mapping->i_mmap_rwsem   (memory_failure, collect_procs_anon)
  *   ->tasklist_lock
  *     pte map lock
+ *
+ * hugetlbfs PageHuge() take locks in this order:
+ *   hugetlb_fault_mutex (hugetlbfs specific page fault mutex)
+ *     vma_lock (hugetlb specific lock for pmd_sharing)
+ *       mapping->i_mmap_rwsem (also used for hugetlb pmd sharing)
+ *         page->flags PG_locked (lock_page)
  */
 
 #include <linux/mm.h>
-- 
2.37.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v2 7/9] hugetlb: create hugetlb_unmap_file_folio to unmap single file folio
  2022-09-14 22:18 [PATCH v2 0/9] hugetlb: Use new vma lock for huge pmd sharing synchronization Mike Kravetz
                   ` (5 preceding siblings ...)
  2022-09-14 22:18 ` [PATCH v2 6/9] hugetlb: add vma based lock for pmd sharing Mike Kravetz
@ 2022-09-14 22:18 ` Mike Kravetz
  2022-09-14 22:18 ` [PATCH v2 8/9] hugetlb: use new vma_lock for pmd sharing synchronization Mike Kravetz
  2022-09-14 22:18 ` [PATCH v2 9/9] hugetlb: clean up code checking for fault/truncation races Mike Kravetz
  8 siblings, 0 replies; 18+ messages in thread
From: Mike Kravetz @ 2022-09-14 22:18 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Muchun Song, Miaohe Lin, David Hildenbrand, Sven Schnelle,
	Michal Hocko, Peter Xu, Naoya Horiguchi, Aneesh Kumar K . V,
	Andrea Arcangeli, Kirill A . Shutemov, Davidlohr Bueso,
	Prakash Sangappa, James Houghton, Mina Almasry, Pasha Tatashin,
	Axel Rasmussen, Ray Fucillo, Andrew Morton, Mike Kravetz

Create the new routine hugetlb_unmap_file_folio that will unmap a single
file folio.  This is refactored code from hugetlb_vmdelete_list.  It is
modified to do locking within the routine itself and check whether the
page is mapped within a specific vma before unmapping.

This refactoring will be put to use and expanded upon in a subsequent
patch adding vma specific locking.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
---
 fs/hugetlbfs/inode.c | 123 +++++++++++++++++++++++++++++++++----------
 1 file changed, 94 insertions(+), 29 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 7112a9a9f54d..3bb1772fce2f 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -371,6 +371,94 @@ static void hugetlb_delete_from_page_cache(struct page *page)
 	delete_from_page_cache(page);
 }
 
+/*
+ * Called with i_mmap_rwsem held for inode based vma maps.  This makes
+ * sure vma (and vm_mm) will not go away.  We also hold the hugetlb fault
+ * mutex for the page in the mapping.  So, we can not race with page being
+ * faulted into the vma.
+ */
+static bool hugetlb_vma_maps_page(struct vm_area_struct *vma,
+				unsigned long addr, struct page *page)
+{
+	pte_t *ptep, pte;
+
+	ptep = huge_pte_offset(vma->vm_mm, addr,
+			huge_page_size(hstate_vma(vma)));
+
+	if (!ptep)
+		return false;
+
+	pte = huge_ptep_get(ptep);
+	if (huge_pte_none(pte) || !pte_present(pte))
+		return false;
+
+	if (pte_page(pte) == page)
+		return true;
+
+	return false;
+}
+
+/*
+ * Can vma_offset_start/vma_offset_end overflow on 32-bit arches?
+ * No, because the interval tree returns us only those vmas
+ * which overlap the truncated area starting at pgoff,
+ * and no vma on a 32-bit arch can span beyond the 4GB.
+ */
+static unsigned long vma_offset_start(struct vm_area_struct *vma, pgoff_t start)
+{
+	if (vma->vm_pgoff < start)
+		return (start - vma->vm_pgoff) << PAGE_SHIFT;
+	else
+		return 0;
+}
+
+static unsigned long vma_offset_end(struct vm_area_struct *vma, pgoff_t end)
+{
+	unsigned long t_end;
+
+	if (!end)
+		return vma->vm_end;
+
+	t_end = ((end - vma->vm_pgoff) << PAGE_SHIFT) + vma->vm_start;
+	if (t_end > vma->vm_end)
+		t_end = vma->vm_end;
+	return t_end;
+}
+
+/*
+ * Called with hugetlb fault mutex held.  Therefore, no more mappings to
+ * this folio can be created while executing the routine.
+ */
+static void hugetlb_unmap_file_folio(struct hstate *h,
+					struct address_space *mapping,
+					struct folio *folio, pgoff_t index)
+{
+	struct rb_root_cached *root = &mapping->i_mmap;
+	struct page *page = &folio->page;
+	struct vm_area_struct *vma;
+	unsigned long v_start;
+	unsigned long v_end;
+	pgoff_t start, end;
+
+	start = index * pages_per_huge_page(h);
+	end = (index + 1) * pages_per_huge_page(h);
+
+	i_mmap_lock_write(mapping);
+
+	vma_interval_tree_foreach(vma, root, start, end - 1) {
+		v_start = vma_offset_start(vma, start);
+		v_end = vma_offset_end(vma, end);
+
+		if (!hugetlb_vma_maps_page(vma, vma->vm_start + v_start, page))
+			continue;
+
+		unmap_hugepage_range(vma, vma->vm_start + v_start, v_end,
+				NULL, ZAP_FLAG_DROP_MARKER);
+	}
+
+	i_mmap_unlock_write(mapping);
+}
+
 static void
 hugetlb_vmdelete_list(struct rb_root_cached *root, pgoff_t start, pgoff_t end,
 		      zap_flags_t zap_flags)
@@ -383,30 +471,13 @@ hugetlb_vmdelete_list(struct rb_root_cached *root, pgoff_t start, pgoff_t end,
 	 * an inclusive "last".
 	 */
 	vma_interval_tree_foreach(vma, root, start, end ? end - 1 : ULONG_MAX) {
-		unsigned long v_offset;
+		unsigned long v_start;
 		unsigned long v_end;
 
-		/*
-		 * Can the expression below overflow on 32-bit arches?
-		 * No, because the interval tree returns us only those vmas
-		 * which overlap the truncated area starting at pgoff,
-		 * and no vma on a 32-bit arch can span beyond the 4GB.
-		 */
-		if (vma->vm_pgoff < start)
-			v_offset = (start - vma->vm_pgoff) << PAGE_SHIFT;
-		else
-			v_offset = 0;
-
-		if (!end)
-			v_end = vma->vm_end;
-		else {
-			v_end = ((end - vma->vm_pgoff) << PAGE_SHIFT)
-							+ vma->vm_start;
-			if (v_end > vma->vm_end)
-				v_end = vma->vm_end;
-		}
+		v_start = vma_offset_start(vma, start);
+		v_end = vma_offset_end(vma, end);
 
-		unmap_hugepage_range(vma, vma->vm_start + v_offset, v_end,
+		unmap_hugepage_range(vma, vma->vm_start + v_start, v_end,
 				     NULL, zap_flags);
 	}
 }
@@ -428,14 +499,8 @@ static bool remove_inode_single_folio(struct hstate *h, struct inode *inode,
 	 * the fault mutex.  The mutex will prevent faults
 	 * until we finish removing the folio.
 	 */
-	if (unlikely(folio_mapped(folio))) {
-		i_mmap_lock_write(mapping);
-		hugetlb_vmdelete_list(&mapping->i_mmap,
-			index * pages_per_huge_page(h),
-			(index + 1) * pages_per_huge_page(h),
-			ZAP_FLAG_DROP_MARKER);
-		i_mmap_unlock_write(mapping);
-	}
+	if (unlikely(folio_mapped(folio)))
+		hugetlb_unmap_file_folio(h, mapping, folio, index);
 
 	folio_lock(folio);
 	/*
-- 
2.37.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v2 8/9] hugetlb: use new vma_lock for pmd sharing synchronization
  2022-09-14 22:18 [PATCH v2 0/9] hugetlb: Use new vma lock for huge pmd sharing synchronization Mike Kravetz
                   ` (6 preceding siblings ...)
  2022-09-14 22:18 ` [PATCH v2 7/9] hugetlb: create hugetlb_unmap_file_folio to unmap single file folio Mike Kravetz
@ 2022-09-14 22:18 ` Mike Kravetz
  2022-09-29  6:08   ` Miaohe Lin
  2022-09-14 22:18 ` [PATCH v2 9/9] hugetlb: clean up code checking for fault/truncation races Mike Kravetz
  8 siblings, 1 reply; 18+ messages in thread
From: Mike Kravetz @ 2022-09-14 22:18 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Muchun Song, Miaohe Lin, David Hildenbrand, Sven Schnelle,
	Michal Hocko, Peter Xu, Naoya Horiguchi, Aneesh Kumar K . V,
	Andrea Arcangeli, Kirill A . Shutemov, Davidlohr Bueso,
	Prakash Sangappa, James Houghton, Mina Almasry, Pasha Tatashin,
	Axel Rasmussen, Ray Fucillo, Andrew Morton, Mike Kravetz

The new hugetlb vma lock is used to address this race:

Faulting thread                                 Unsharing thread
...                                                  ...
ptep = huge_pte_offset()
      or
ptep = huge_pte_alloc()
...
                                                i_mmap_lock_write
                                                lock page table
ptep invalid   <------------------------        huge_pmd_unshare()
Could be in a previously                        unlock_page_table
sharing process or worse                        i_mmap_unlock_write
...

The vma_lock is used as follows:
- During fault processing. The lock is acquired in read mode before
  doing a page table lock and allocation (huge_pte_alloc).  The lock is
  held until code is finished with the page table entry (ptep).
- The lock must be held in write mode whenever huge_pmd_unshare is
  called.

Lock ordering issues come into play when unmapping a page from all
vmas mapping the page.  The i_mmap_rwsem must be held to search for the
vmas, and the vma lock must be held before calling unmap which will
call huge_pmd_unshare.  This is done today in:
- try_to_migrate_one and try_to_unmap_ for page migration and memory
  error handling.  In these routines we 'try' to obtain the vma lock and
  fail to unmap if unsuccessful.  Calling routines already deal with the
  failure of unmapping.
- hugetlb_vmdelete_list for truncation and hole punch.  This routine
  also tries to acquire the vma lock.  If it fails, it skips the
  unmapping.  However, we can not have file truncation or hole punch
  fail because of contention.  After hugetlb_vmdelete_list, truncation
  and hole punch call remove_inode_hugepages.  remove_inode_hugepages
  checks for mapped pages and call hugetlb_unmap_file_page to unmap them.
  hugetlb_unmap_file_page is designed to drop locks and reacquire in the
  correct order to guarantee unmap success.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 fs/hugetlbfs/inode.c |  66 +++++++++++++++++++++++++++-
 mm/hugetlb.c         | 102 +++++++++++++++++++++++++++++++++++++++----
 mm/memory.c          |   2 +
 mm/rmap.c            | 100 +++++++++++++++++++++++++++---------------
 mm/userfaultfd.c     |   9 +++-
 5 files changed, 233 insertions(+), 46 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 3bb1772fce2f..009ae539b9b2 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -434,6 +434,7 @@ static void hugetlb_unmap_file_folio(struct hstate *h,
 					struct folio *folio, pgoff_t index)
 {
 	struct rb_root_cached *root = &mapping->i_mmap;
+	struct hugetlb_vma_lock *vma_lock;
 	struct page *page = &folio->page;
 	struct vm_area_struct *vma;
 	unsigned long v_start;
@@ -444,7 +445,8 @@ static void hugetlb_unmap_file_folio(struct hstate *h,
 	end = (index + 1) * pages_per_huge_page(h);
 
 	i_mmap_lock_write(mapping);
-
+retry:
+	vma_lock = NULL;
 	vma_interval_tree_foreach(vma, root, start, end - 1) {
 		v_start = vma_offset_start(vma, start);
 		v_end = vma_offset_end(vma, end);
@@ -452,11 +454,63 @@ static void hugetlb_unmap_file_folio(struct hstate *h,
 		if (!hugetlb_vma_maps_page(vma, vma->vm_start + v_start, page))
 			continue;
 
+		if (!hugetlb_vma_trylock_write(vma)) {
+			vma_lock = vma->vm_private_data;
+			/*
+			 * If we can not get vma lock, we need to drop
+			 * immap_sema and take locks in order.  First,
+			 * take a ref on the vma_lock structure so that
+			 * we can be guaranteed it will not go away when
+			 * dropping immap_sema.
+			 */
+			kref_get(&vma_lock->refs);
+			break;
+		}
+
 		unmap_hugepage_range(vma, vma->vm_start + v_start, v_end,
 				NULL, ZAP_FLAG_DROP_MARKER);
+		hugetlb_vma_unlock_write(vma);
 	}
 
 	i_mmap_unlock_write(mapping);
+
+	if (vma_lock) {
+		/*
+		 * Wait on vma_lock.  We know it is still valid as we have
+		 * a reference.  We must 'open code' vma locking as we do
+		 * not know if vma_lock is still attached to vma.
+		 */
+		down_write(&vma_lock->rw_sema);
+		i_mmap_lock_write(mapping);
+
+		vma = vma_lock->vma;
+		if (!vma) {
+			/*
+			 * If lock is no longer attached to vma, then just
+			 * unlock, drop our reference and retry looking for
+			 * other vmas.
+			 */
+			up_write(&vma_lock->rw_sema);
+			kref_put(&vma_lock->refs, hugetlb_vma_lock_release);
+			goto retry;
+		}
+
+		/*
+		 * vma_lock is still attached to vma.  Check to see if vma
+		 * still maps page and if so, unmap.
+		 */
+		v_start = vma_offset_start(vma, start);
+		v_end = vma_offset_end(vma, end);
+		if (hugetlb_vma_maps_page(vma, vma->vm_start + v_start, page))
+			unmap_hugepage_range(vma, vma->vm_start + v_start,
+						v_end, NULL,
+						ZAP_FLAG_DROP_MARKER);
+
+		kref_put(&vma_lock->refs, hugetlb_vma_lock_release);
+		hugetlb_vma_unlock_write(vma);
+
+		goto retry;
+	}
 }
 
 static void
@@ -474,11 +528,21 @@ hugetlb_vmdelete_list(struct rb_root_cached *root, pgoff_t start, pgoff_t end,
 		unsigned long v_start;
 		unsigned long v_end;
 
+		if (!hugetlb_vma_trylock_write(vma))
+			continue;
+
 		v_start = vma_offset_start(vma, start);
 		v_end = vma_offset_end(vma, end);
 
 		unmap_hugepage_range(vma, vma->vm_start + v_start, v_end,
 				     NULL, zap_flags);
+
+		/*
+		 * Note that vma lock only exists for shared/non-private
+		 * vmas.  Therefore, lock is not held when calling
+		 * unmap_hugepage_range for private vmas.
+		 */
+		hugetlb_vma_unlock_write(vma);
 	}
 }
 
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 616be891b798..e8cbc0f7cdaa 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4795,6 +4795,14 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 		mmu_notifier_invalidate_range_start(&range);
 		mmap_assert_write_locked(src);
 		raw_write_seqcount_begin(&src->write_protect_seq);
+	} else {
+		/*
+		 * For shared mappings the vma lock must be held before
+		 * calling huge_pte_offset in the src vma. Otherwise, the
+		 * returned ptep could go away if part of a shared pmd and
+		 * another thread calls huge_pmd_unshare.
+		 */
+		hugetlb_vma_lock_read(src_vma);
 	}
 
 	last_addr_mask = hugetlb_mask_last_page(h);
@@ -4941,6 +4949,8 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 	if (cow) {
 		raw_write_seqcount_end(&src->write_protect_seq);
 		mmu_notifier_invalidate_range_end(&range);
+	} else {
+		hugetlb_vma_unlock_read(src_vma);
 	}
 
 	return ret;
@@ -4999,6 +5009,7 @@ int move_hugetlb_page_tables(struct vm_area_struct *vma,
 	mmu_notifier_invalidate_range_start(&range);
 	last_addr_mask = hugetlb_mask_last_page(h);
 	/* Prevent race with file truncation */
+	hugetlb_vma_lock_write(vma);
 	i_mmap_lock_write(mapping);
 	for (; old_addr < old_end; old_addr += sz, new_addr += sz) {
 		src_pte = huge_pte_offset(mm, old_addr, sz);
@@ -5030,6 +5041,7 @@ int move_hugetlb_page_tables(struct vm_area_struct *vma,
 		flush_tlb_range(vma, old_end - len, old_end);
 	mmu_notifier_invalidate_range_end(&range);
 	i_mmap_unlock_write(mapping);
+	hugetlb_vma_unlock_write(vma);
 
 	return len + old_addr - old_end;
 }
@@ -5349,8 +5361,29 @@ static vm_fault_t hugetlb_wp(struct mm_struct *mm, struct vm_area_struct *vma,
 		 * may get SIGKILLed if it later faults.
 		 */
 		if (outside_reserve) {
+			struct address_space *mapping = vma->vm_file->f_mapping;
+			pgoff_t idx;
+			u32 hash;
+
 			put_page(old_page);
+			/*
+			 * Drop hugetlb_fault_mutex and vma_lock before
+			 * unmapping.  unmapping needs to hold vma_lock
+			 * in write mode.  Dropping vma_lock in read mode
+			 * here is OK as COW mappings do not interact with
+			 * PMD sharing.
+			 *
+			 * Reacquire both after unmap operation.
+			 */
+			idx = vma_hugecache_offset(h, vma, haddr);
+			hash = hugetlb_fault_mutex_hash(mapping, idx);
+			hugetlb_vma_unlock_read(vma);
+			mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+
 			unmap_ref_private(mm, vma, old_page, haddr);
+
+			mutex_lock(&hugetlb_fault_mutex_table[hash]);
+			hugetlb_vma_lock_read(vma);
 			spin_lock(ptl);
 			ptep = huge_pte_offset(mm, haddr, huge_page_size(h));
 			if (likely(ptep &&
@@ -5499,14 +5532,16 @@ static inline vm_fault_t hugetlb_handle_userfault(struct vm_area_struct *vma,
 	};
 
 	/*
-	 * hugetlb_fault_mutex and i_mmap_rwsem must be
+	 * vma_lock and hugetlb_fault_mutex must be
 	 * dropped before handling userfault.  Reacquire
 	 * after handling fault to make calling code simpler.
 	 */
+	hugetlb_vma_unlock_read(vma);
 	hash = hugetlb_fault_mutex_hash(mapping, idx);
 	mutex_unlock(&hugetlb_fault_mutex_table[hash]);
 	ret = handle_userfault(&vmf, reason);
 	mutex_lock(&hugetlb_fault_mutex_table[hash]);
+	hugetlb_vma_lock_read(vma);
 
 	return ret;
 }
@@ -5740,6 +5775,11 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 
 	ptep = huge_pte_offset(mm, haddr, huge_page_size(h));
 	if (ptep) {
+		/*
+		 * Since we hold no locks, ptep could be stale.  That is
+		 * OK as we are only making decisions based on content and
+		 * not actually modifying content here.
+		 */
 		entry = huge_ptep_get(ptep);
 		if (unlikely(is_hugetlb_entry_migration(entry))) {
 			migration_entry_wait_huge(vma, ptep);
@@ -5747,23 +5787,35 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		} else if (unlikely(is_hugetlb_entry_hwpoisoned(entry)))
 			return VM_FAULT_HWPOISON_LARGE |
 				VM_FAULT_SET_HINDEX(hstate_index(h));
-	} else {
-		ptep = huge_pte_alloc(mm, vma, haddr, huge_page_size(h));
-		if (!ptep)
-			return VM_FAULT_OOM;
 	}
 
-	mapping = vma->vm_file->f_mapping;
-	idx = vma_hugecache_offset(h, vma, haddr);
-
 	/*
 	 * Serialize hugepage allocation and instantiation, so that we don't
 	 * get spurious allocation failures if two CPUs race to instantiate
 	 * the same page in the page cache.
 	 */
+	mapping = vma->vm_file->f_mapping;
+	idx = vma_hugecache_offset(h, vma, haddr);
 	hash = hugetlb_fault_mutex_hash(mapping, idx);
 	mutex_lock(&hugetlb_fault_mutex_table[hash]);
 
+	/*
+	 * Acquire vma lock before calling huge_pte_alloc and hold
+	 * until finished with ptep.  This prevents huge_pmd_unshare from
+	 * being called elsewhere and making the ptep no longer valid.
+	 *
+	 * ptep could have already be assigned via huge_pte_offset.  That
+	 * is OK, as huge_pte_alloc will return the same value unless
+	 * something has changed.
+	 */
+	hugetlb_vma_lock_read(vma);
+	ptep = huge_pte_alloc(mm, vma, haddr, huge_page_size(h));
+	if (!ptep) {
+		hugetlb_vma_unlock_read(vma);
+		mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+		return VM_FAULT_OOM;
+	}
+
 	entry = huge_ptep_get(ptep);
 	/* PTE markers should be handled the same way as none pte */
 	if (huge_pte_none_mostly(entry)) {
@@ -5824,6 +5876,7 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 			unlock_page(pagecache_page);
 			put_page(pagecache_page);
 		}
+		hugetlb_vma_unlock_read(vma);
 		mutex_unlock(&hugetlb_fault_mutex_table[hash]);
 		return handle_userfault(&vmf, VM_UFFD_WP);
 	}
@@ -5867,6 +5920,7 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		put_page(pagecache_page);
 	}
 out_mutex:
+	hugetlb_vma_unlock_read(vma);
 	mutex_unlock(&hugetlb_fault_mutex_table[hash]);
 	/*
 	 * Generally it's safe to hold refcount during waiting page lock. But
@@ -6329,8 +6383,9 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 	flush_cache_range(vma, range.start, range.end);
 
 	mmu_notifier_invalidate_range_start(&range);
-	last_addr_mask = hugetlb_mask_last_page(h);
+	hugetlb_vma_lock_write(vma);
 	i_mmap_lock_write(vma->vm_file->f_mapping);
+	last_addr_mask = hugetlb_mask_last_page(h);
 	for (; address < end; address += psize) {
 		spinlock_t *ptl;
 		ptep = huge_pte_offset(mm, address, psize);
@@ -6429,6 +6484,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 	 * See Documentation/mm/mmu_notifier.rst
 	 */
 	i_mmap_unlock_write(vma->vm_file->f_mapping);
+	hugetlb_vma_unlock_write(vma);
 	mmu_notifier_invalidate_range_end(&range);
 
 	return pages << h->order;
@@ -6930,6 +6986,7 @@ int huge_pmd_unshare(struct mm_struct *mm, struct vm_area_struct *vma,
 	pud_t *pud = pud_offset(p4d, addr);
 
 	i_mmap_assert_write_locked(vma->vm_file->f_mapping);
+	hugetlb_vma_assert_locked(vma);
 	BUG_ON(page_count(virt_to_page(ptep)) == 0);
 	if (page_count(virt_to_page(ptep)) == 1)
 		return 0;
@@ -6941,6 +6998,31 @@ int huge_pmd_unshare(struct mm_struct *mm, struct vm_area_struct *vma,
 }
 
 #else /* !CONFIG_ARCH_WANT_HUGE_PMD_SHARE */
+void hugetlb_vma_lock_read(struct vm_area_struct *vma)
+{
+}
+
+void hugetlb_vma_unlock_read(struct vm_area_struct *vma)
+{
+}
+
+void hugetlb_vma_lock_write(struct vm_area_struct *vma)
+{
+}
+
+void hugetlb_vma_unlock_write(struct vm_area_struct *vma)
+{
+}
+
+int hugetlb_vma_trylock_write(struct vm_area_struct *vma)
+{
+	return 1;
+}
+
+void hugetlb_vma_assert_locked(struct vm_area_struct *vma)
+{
+}
+
 static void hugetlb_vma_lock_free(struct vm_area_struct *vma)
 {
 }
@@ -7318,6 +7400,7 @@ void hugetlb_unshare_all_pmds(struct vm_area_struct *vma)
 	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, mm,
 				start, end);
 	mmu_notifier_invalidate_range_start(&range);
+	hugetlb_vma_lock_write(vma);
 	i_mmap_lock_write(vma->vm_file->f_mapping);
 	for (address = start; address < end; address += PUD_SIZE) {
 		ptep = huge_pte_offset(mm, address, sz);
@@ -7329,6 +7412,7 @@ void hugetlb_unshare_all_pmds(struct vm_area_struct *vma)
 	}
 	flush_hugetlb_tlb_range(vma, start, end);
 	i_mmap_unlock_write(vma->vm_file->f_mapping);
+	hugetlb_vma_unlock_write(vma);
 	/*
 	 * No need to call mmu_notifier_invalidate_range(), see
 	 * Documentation/mm/mmu_notifier.rst.
diff --git a/mm/memory.c b/mm/memory.c
index c4c3c2fd4f45..118e5f023597 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1685,10 +1685,12 @@ static void unmap_single_vma(struct mmu_gather *tlb,
 			if (vma->vm_file) {
 				zap_flags_t zap_flags = details ?
 				    details->zap_flags : 0;
+				hugetlb_vma_lock_write(vma);
 				i_mmap_lock_write(vma->vm_file->f_mapping);
 				__unmap_hugepage_range_final(tlb, vma, start, end,
 							     NULL, zap_flags);
 				i_mmap_unlock_write(vma->vm_file->f_mapping);
+				hugetlb_vma_unlock_write(vma);
 			}
 		} else
 			unmap_page_range(tlb, vma, start, end, details);
diff --git a/mm/rmap.c b/mm/rmap.c
index 744faaef0489..2ec925e5fa6a 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1554,24 +1554,39 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 			 * To call huge_pmd_unshare, i_mmap_rwsem must be
 			 * held in write mode.  Caller needs to explicitly
 			 * do this outside rmap routines.
+			 *
+			 * We also must hold hugetlb vma_lock in write mode.
+			 * Lock order dictates acquiring vma_lock BEFORE
+			 * i_mmap_rwsem.  We can only try lock here and fail
+			 * if unsuccessful.
 			 */
-			VM_BUG_ON(!anon && !(flags & TTU_RMAP_LOCKED));
-			if (!anon && huge_pmd_unshare(mm, vma, address, pvmw.pte)) {
-				flush_tlb_range(vma, range.start, range.end);
-				mmu_notifier_invalidate_range(mm, range.start,
-							      range.end);
-
-				/*
-				 * The ref count of the PMD page was dropped
-				 * which is part of the way map counting
-				 * is done for shared PMDs.  Return 'true'
-				 * here.  When there is no other sharing,
-				 * huge_pmd_unshare returns false and we will
-				 * unmap the actual page and drop map count
-				 * to zero.
-				 */
-				page_vma_mapped_walk_done(&pvmw);
-				break;
+			if (!anon) {
+				VM_BUG_ON(!(flags & TTU_RMAP_LOCKED));
+				if (!hugetlb_vma_trylock_write(vma)) {
+					page_vma_mapped_walk_done(&pvmw);
+					ret = false;
+					break;
+				}
+				if (huge_pmd_unshare(mm, vma, address, pvmw.pte)) {
+					hugetlb_vma_unlock_write(vma);
+					flush_tlb_range(vma,
+						range.start, range.end);
+					mmu_notifier_invalidate_range(mm,
+						range.start, range.end);
+					/*
+					 * The ref count of the PMD page was
+					 * dropped which is part of the way map
+					 * counting is done for shared PMDs.
+					 * Return 'true' here.  When there is
+					 * no other sharing, huge_pmd_unshare
+					 * returns false and we will unmap the
+					 * actual page and drop map count
+					 * to zero.
+					 */
+					page_vma_mapped_walk_done(&pvmw);
+					break;
+				}
+				hugetlb_vma_unlock_write(vma);
 			}
 			pteval = huge_ptep_clear_flush(vma, address, pvmw.pte);
 		} else {
@@ -1929,26 +1944,41 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 			 * To call huge_pmd_unshare, i_mmap_rwsem must be
 			 * held in write mode.  Caller needs to explicitly
 			 * do this outside rmap routines.
+			 *
+			 * We also must hold hugetlb vma_lock in write mode.
+			 * Lock order dictates acquiring vma_lock BEFORE
+			 * i_mmap_rwsem.  We can only try lock here and
+			 * fail if unsuccessful.
 			 */
-			VM_BUG_ON(!anon && !(flags & TTU_RMAP_LOCKED));
-			if (!anon && huge_pmd_unshare(mm, vma, address, pvmw.pte)) {
-				flush_tlb_range(vma, range.start, range.end);
-				mmu_notifier_invalidate_range(mm, range.start,
-							      range.end);
-
-				/*
-				 * The ref count of the PMD page was dropped
-				 * which is part of the way map counting
-				 * is done for shared PMDs.  Return 'true'
-				 * here.  When there is no other sharing,
-				 * huge_pmd_unshare returns false and we will
-				 * unmap the actual page and drop map count
-				 * to zero.
-				 */
-				page_vma_mapped_walk_done(&pvmw);
-				break;
+			if (!anon) {
+				VM_BUG_ON(!(flags & TTU_RMAP_LOCKED));
+				if (!hugetlb_vma_trylock_write(vma)) {
+					page_vma_mapped_walk_done(&pvmw);
+					ret = false;
+					break;
+				}
+				if (huge_pmd_unshare(mm, vma, address, pvmw.pte)) {
+					hugetlb_vma_unlock_write(vma);
+					flush_tlb_range(vma,
+						range.start, range.end);
+					mmu_notifier_invalidate_range(mm,
+						range.start, range.end);
+
+					/*
+					 * The ref count of the PMD page was
+					 * dropped which is part of the way map
+					 * counting is done for shared PMDs.
+					 * Return 'true' here.  When there is
+					 * no other sharing, huge_pmd_unshare
+					 * returns false and we will unmap the
+					 * actual page and drop map count
+					 * to zero.
+					 */
+					page_vma_mapped_walk_done(&pvmw);
+					break;
+				}
+				hugetlb_vma_unlock_write(vma);
 			}
-
 			/* Nuke the hugetlb page table entry */
 			pteval = huge_ptep_clear_flush(vma, address, pvmw.pte);
 		} else {
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 0fdbd2c05587..e24e8a47ce8a 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -379,16 +379,21 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
 		BUG_ON(dst_addr >= dst_start + len);
 
 		/*
-		 * Serialize via hugetlb_fault_mutex.
+		 * Serialize via vma_lock and hugetlb_fault_mutex.
+		 * vma_lock ensures the dst_pte remains valid even
+		 * in the case of shared pmds.  fault mutex prevents
+		 * races with other faulting threads.
 		 */
 		idx = linear_page_index(dst_vma, dst_addr);
 		mapping = dst_vma->vm_file->f_mapping;
 		hash = hugetlb_fault_mutex_hash(mapping, idx);
 		mutex_lock(&hugetlb_fault_mutex_table[hash]);
+		hugetlb_vma_lock_read(dst_vma);
 
 		err = -ENOMEM;
 		dst_pte = huge_pte_alloc(dst_mm, dst_vma, dst_addr, vma_hpagesize);
 		if (!dst_pte) {
+			hugetlb_vma_unlock_read(dst_vma);
 			mutex_unlock(&hugetlb_fault_mutex_table[hash]);
 			goto out_unlock;
 		}
@@ -396,6 +401,7 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
 		if (mode != MCOPY_ATOMIC_CONTINUE &&
 		    !huge_pte_none_mostly(huge_ptep_get(dst_pte))) {
 			err = -EEXIST;
+			hugetlb_vma_unlock_read(dst_vma);
 			mutex_unlock(&hugetlb_fault_mutex_table[hash]);
 			goto out_unlock;
 		}
@@ -404,6 +410,7 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
 					       dst_addr, src_addr, mode, &page,
 					       wp_copy);
 
+		hugetlb_vma_unlock_read(dst_vma);
 		mutex_unlock(&hugetlb_fault_mutex_table[hash]);
 
 		cond_resched();
-- 
2.37.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v2 9/9] hugetlb: clean up code checking for fault/truncation races
  2022-09-14 22:18 [PATCH v2 0/9] hugetlb: Use new vma lock for huge pmd sharing synchronization Mike Kravetz
                   ` (7 preceding siblings ...)
  2022-09-14 22:18 ` [PATCH v2 8/9] hugetlb: use new vma_lock for pmd sharing synchronization Mike Kravetz
@ 2022-09-14 22:18 ` Mike Kravetz
  2022-09-19 23:32   ` Mike Kravetz
  2022-09-29  6:25   ` Miaohe Lin
  8 siblings, 2 replies; 18+ messages in thread
From: Mike Kravetz @ 2022-09-14 22:18 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Muchun Song, Miaohe Lin, David Hildenbrand, Sven Schnelle,
	Michal Hocko, Peter Xu, Naoya Horiguchi, Aneesh Kumar K . V,
	Andrea Arcangeli, Kirill A . Shutemov, Davidlohr Bueso,
	Prakash Sangappa, James Houghton, Mina Almasry, Pasha Tatashin,
	Axel Rasmussen, Ray Fucillo, Andrew Morton, Mike Kravetz

With the new hugetlb vma lock in place, it can also be used to handle
page fault races with file truncation.  The lock is taken at the
beginning of the code fault path in read mode.  During truncation, it
is taken in write mode for each vma which has the file mapped.  The
file's size (i_size) is modified before taking the vma lock to unmap.

How are races handled?

The page fault code checks i_size early in processing after taking the
vma lock.  If the fault is beyond i_size, the fault is aborted.  If the
fault is not beyond i_size the fault will continue and a new page will
be added to the file.  It could be that truncation code modifies i_size
after the check in fault code.  That is OK, as truncation code will soon
remove the page.  The truncation code will wait until the fault is
finished, as it must obtain the vma lock in write mode.

This patch cleans up/removes late checks in the fault paths that try to
back out pages racing with truncation.  As noted above, we just let the
truncation code remove the pages.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 fs/hugetlbfs/inode.c | 31 ++++++++++++-------------------
 mm/hugetlb.c         | 27 ++++++---------------------
 2 files changed, 18 insertions(+), 40 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 009ae539b9b2..ed57a029eab0 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -568,26 +568,19 @@ static bool remove_inode_single_folio(struct hstate *h, struct inode *inode,
 
 	folio_lock(folio);
 	/*
-	 * After locking page, make sure mapping is the same.
-	 * We could have raced with page fault populate and
-	 * backout code.
+	 * We must remove the folio from page cache before removing
+	 * the region/ reserve map (hugetlb_unreserve_pages).  In
+	 * rare out of memory conditions, removal of the region/reserve
+	 * map could fail.  Correspondingly, the subpool and global
+	 * reserve usage count can need to be adjusted.
 	 */
-	if (folio_mapping(folio) == mapping) {
-		/*
-		 * We must remove the folio from page cache before removing
-		 * the region/ reserve map (hugetlb_unreserve_pages).  In
-		 * rare out of memory conditions, removal of the region/reserve
-		 * map could fail.  Correspondingly, the subpool and global
-		 * reserve usage count can need to be adjusted.
-		 */
-		VM_BUG_ON(HPageRestoreReserve(&folio->page));
-		hugetlb_delete_from_page_cache(&folio->page);
-		ret = true;
-		if (!truncate_op) {
-			if (unlikely(hugetlb_unreserve_pages(inode, index,
-								index + 1, 1)))
-				hugetlb_fix_reserve_counts(inode);
-		}
+	VM_BUG_ON(HPageRestoreReserve(&folio->page));
+	hugetlb_delete_from_page_cache(&folio->page);
+	ret = true;
+	if (!truncate_op) {
+		if (unlikely(hugetlb_unreserve_pages(inode, index,
+							index + 1, 1)))
+			hugetlb_fix_reserve_counts(inode);
 	}
 
 	folio_unlock(folio);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index e8cbc0f7cdaa..2207300791e5 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5561,6 +5561,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 	spinlock_t *ptl;
 	unsigned long haddr = address & huge_page_mask(h);
 	bool new_page, new_pagecache_page = false;
+	bool reserve_alloc = false;
 
 	/*
 	 * Currently, we are forced to kill the process in the event the
@@ -5616,6 +5617,8 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 		clear_huge_page(page, address, pages_per_huge_page(h));
 		__SetPageUptodate(page);
 		new_page = true;
+		if (HPageRestoreReserve(page))
+			reserve_alloc = true;
 
 		if (vma->vm_flags & VM_MAYSHARE) {
 			int err = hugetlb_add_to_page_cache(page, mapping, idx);
@@ -5679,10 +5682,6 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 	}
 
 	ptl = huge_pte_lock(h, mm, ptep);
-	size = i_size_read(mapping->host) >> huge_page_shift(h);
-	if (idx >= size)
-		goto backout;
-
 	ret = 0;
 	/* If pte changed from under us, retry */
 	if (!pte_same(huge_ptep_get(ptep), old_pte))
@@ -5726,10 +5725,10 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 backout:
 	spin_unlock(ptl);
 backout_unlocked:
-	unlock_page(page);
-	/* restore reserve for newly allocated pages not in page cache */
 	if (new_page && !new_pagecache_page)
 		restore_reserve_on_error(h, vma, haddr, page);
+
+	unlock_page(page);
 	put_page(page);
 	goto out;
 }
@@ -6061,26 +6060,12 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
 
 	ptl = huge_pte_lock(h, dst_mm, dst_pte);
 
-	/*
-	 * Recheck the i_size after holding PT lock to make sure not
-	 * to leave any page mapped (as page_mapped()) beyond the end
-	 * of the i_size (remove_inode_hugepages() is strict about
-	 * enforcing that). If we bail out here, we'll also leave a
-	 * page in the radix tree in the vm_shared case beyond the end
-	 * of the i_size, but remove_inode_hugepages() will take care
-	 * of it as soon as we drop the hugetlb_fault_mutex_table.
-	 */
-	size = i_size_read(mapping->host) >> huge_page_shift(h);
-	ret = -EFAULT;
-	if (idx >= size)
-		goto out_release_unlock;
-
-	ret = -EEXIST;
 	/*
 	 * We allow to overwrite a pte marker: consider when both MISSING|WP
 	 * registered, we firstly wr-protect a none pte which has no page cache
 	 * page backing it, then access the page.
 	 */
+	ret = -EEXIST;
 	if (!huge_pte_none_mostly(huge_ptep_get(dst_pte)))
 		goto out_release_unlock;
 
-- 
2.37.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 6/9] hugetlb: add vma based lock for pmd sharing
  2022-09-14 22:18 ` [PATCH v2 6/9] hugetlb: add vma based lock for pmd sharing Mike Kravetz
@ 2022-09-15 20:25   ` Mike Kravetz
  2022-09-24 13:11   ` Miaohe Lin
  1 sibling, 0 replies; 18+ messages in thread
From: Mike Kravetz @ 2022-09-15 20:25 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Muchun Song, Miaohe Lin, David Hildenbrand, Sven Schnelle,
	Michal Hocko, Peter Xu, Naoya Horiguchi, Aneesh Kumar K . V,
	Andrea Arcangeli, Kirill A . Shutemov, Davidlohr Bueso,
	Prakash Sangappa, James Houghton, Mina Almasry, Pasha Tatashin,
	Axel Rasmussen, Ray Fucillo, Andrew Morton

On 09/14/22 15:18, Mike Kravetz wrote:
> Allocate a new hugetlb_vma_lock structure and hang off vm_private_data
> for synchronization use by vmas that could be involved in pmd sharing.
> This data structure contains a rw semaphore that is the primary tool
> used for synchronization.
> 
> This new structure is ref counted, so that it can exist when NOT attached
> to a vma.  This is only helpful in resolving lock ordering issues where
> code may need to obtain the vma_lock while there are no guarantees the
> vma may go away.  By obtaining a ref on the structure, it can be
> guaranteed that at least the rw semaphore will not go away.
> 
> Only add infrastructure for the new lock here.  Actual use will be added
> in subsequent patches.
> 
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> ---
>  include/linux/hugetlb.h |  43 ++++++++-
>  kernel/fork.c           |   6 +-
>  mm/hugetlb.c            | 202 ++++++++++++++++++++++++++++++++++++----
>  mm/rmap.c               |   8 +-
>  4 files changed, 235 insertions(+), 24 deletions(-)

Reviewers - FYI, the following was added to address a build issue caused
by this patch.

From 8b3031350154e8e401ccfbc5e71cb95ef654d017 Mon Sep 17 00:00:00 2001
From: Mike Kravetz <mike.kravetz@oracle.com>
Date: Thu, 15 Sep 2022 09:33:44 -0700
Subject: [PATCH] hugetlb: fix build issue for missing hugetlb_vma_lock_release

Add a stub for hugetlb_vma_lock_release to build in the case
CONFIG_HUGETLB_PAGE && !CONFIG_ARCH_WANT_HUGE_PMD_SHARE.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 mm/hugetlb.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 2207300791e5..cc7877da18d7 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -7008,6 +7008,10 @@ void hugetlb_vma_assert_locked(struct vm_area_struct *vma)
 {
 }
 
+void hugetlb_vma_lock_release(struct kref *kref)
+{
+}
+
 static void hugetlb_vma_lock_free(struct vm_area_struct *vma)
 {
 }
-- 
2.37.2

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 9/9] hugetlb: clean up code checking for fault/truncation races
  2022-09-14 22:18 ` [PATCH v2 9/9] hugetlb: clean up code checking for fault/truncation races Mike Kravetz
@ 2022-09-19 23:32   ` Mike Kravetz
  2022-09-29  6:25   ` Miaohe Lin
  1 sibling, 0 replies; 18+ messages in thread
From: Mike Kravetz @ 2022-09-19 23:32 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Muchun Song, Miaohe Lin, David Hildenbrand, Sven Schnelle,
	Michal Hocko, Peter Xu, Naoya Horiguchi, Aneesh Kumar K . V,
	Andrea Arcangeli, Kirill A . Shutemov, Davidlohr Bueso,
	Prakash Sangappa, James Houghton, Mina Almasry, Pasha Tatashin,
	Axel Rasmussen, Ray Fucillo, Andrew Morton

On 09/14/22 15:18, Mike Kravetz wrote:
> With the new hugetlb vma lock in place, it can also be used to handle
> page fault races with file truncation.  The lock is taken at the
> beginning of the code fault path in read mode.  During truncation, it
> is taken in write mode for each vma which has the file mapped.  The
> file's size (i_size) is modified before taking the vma lock to unmap.
> 
> How are races handled?
> 
> The page fault code checks i_size early in processing after taking the
> vma lock.  If the fault is beyond i_size, the fault is aborted.  If the
> fault is not beyond i_size the fault will continue and a new page will
> be added to the file.  It could be that truncation code modifies i_size
> after the check in fault code.  That is OK, as truncation code will soon
> remove the page.  The truncation code will wait until the fault is
> finished, as it must obtain the vma lock in write mode.
> 
> This patch cleans up/removes late checks in the fault paths that try to
> back out pages racing with truncation.  As noted above, we just let the
> truncation code remove the pages.
> 
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> ---
>  fs/hugetlbfs/inode.c | 31 ++++++++++++-------------------
>  mm/hugetlb.c         | 27 ++++++---------------------
>  2 files changed, 18 insertions(+), 40 deletions(-)

This patch introduced a compiler warning addressed here,

https://lore.kernel.org/linux-mm/Yyj7HsJWfHDoU24U@monkey/

-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 4/9] hugetlb: create remove_inode_single_folio to remove single file folio
  2022-09-14 22:18 ` [PATCH v2 4/9] hugetlb: create remove_inode_single_folio to remove single file folio Mike Kravetz
@ 2022-09-24 12:36   ` Miaohe Lin
  0 siblings, 0 replies; 18+ messages in thread
From: Miaohe Lin @ 2022-09-24 12:36 UTC (permalink / raw)
  To: Mike Kravetz, linux-mm, linux-kernel
  Cc: Muchun Song, David Hildenbrand, Sven Schnelle, Michal Hocko,
	Peter Xu, Naoya Horiguchi, Aneesh Kumar K . V, Andrea Arcangeli,
	Kirill A . Shutemov, Davidlohr Bueso, Prakash Sangappa,
	James Houghton, Mina Almasry, Pasha Tatashin, Axel Rasmussen,
	Ray Fucillo, Andrew Morton

On 2022/9/15 6:18, Mike Kravetz wrote:
> Create the new routine remove_inode_single_folio that will remove a
> single folio from a file.  This is refactored code from
> remove_inode_hugepages.  It checks for the uncommon case in which the
> folio is still mapped and unmaps.
> 
> No functional change.  This refactoring will be put to use and expanded
> upon in a subsequent patches.
> 
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>

LGTM with one nit below.

Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>

> ---
>  fs/hugetlbfs/inode.c | 105 ++++++++++++++++++++++++++-----------------
>  1 file changed, 63 insertions(+), 42 deletions(-)
> 
> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> index edd69cc43ca5..7112a9a9f54d 100644
> --- a/fs/hugetlbfs/inode.c
> +++ b/fs/hugetlbfs/inode.c
> @@ -411,6 +411,60 @@ hugetlb_vmdelete_list(struct rb_root_cached *root, pgoff_t start, pgoff_t end,
>  	}
>  }
>  
> +/*
> + * Called with hugetlb fault mutex held.
> + * Returns true if page was actually removed, false otherwise.
> + */
> +static bool remove_inode_single_folio(struct hstate *h, struct inode *inode,
> +					struct address_space *mapping,
> +					struct folio *folio, pgoff_t index,
> +					bool truncate_op)
> +{
> +	bool ret = false;
> +
> +	/*
> +	 * If folio is mapped, it was faulted in after being
> +	 * unmapped in caller.  Unmap (again) while holding
> +	 * the fault mutex.  The mutex will prevent faults
> +	 * until we finish removing the folio.
> +	 */
> +	if (unlikely(folio_mapped(folio))) {
> +		i_mmap_lock_write(mapping);
> +		hugetlb_vmdelete_list(&mapping->i_mmap,
> +			index * pages_per_huge_page(h),
> +			(index + 1) * pages_per_huge_page(h),
> +			ZAP_FLAG_DROP_MARKER);
> +		i_mmap_unlock_write(mapping);
> +	}
> +
> +	folio_lock(folio);
> +	/*
> +	 * After locking page, make sure mapping is the same.
> +	 * We could have raced with page fault populate and
> +	 * backout code.

Is this needed? remove_inode_single_folio() is called with hugetlb fault mutex held so it can't
race with page fault code?

Thanks,
Miaohe Lin



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 6/9] hugetlb: add vma based lock for pmd sharing
  2022-09-14 22:18 ` [PATCH v2 6/9] hugetlb: add vma based lock for pmd sharing Mike Kravetz
  2022-09-15 20:25   ` Mike Kravetz
@ 2022-09-24 13:11   ` Miaohe Lin
  1 sibling, 0 replies; 18+ messages in thread
From: Miaohe Lin @ 2022-09-24 13:11 UTC (permalink / raw)
  To: Mike Kravetz, linux-mm, linux-kernel
  Cc: Muchun Song, David Hildenbrand, Sven Schnelle, Michal Hocko,
	Peter Xu, Naoya Horiguchi, Aneesh Kumar K . V, Andrea Arcangeli,
	Kirill A . Shutemov, Davidlohr Bueso, Prakash Sangappa,
	James Houghton, Mina Almasry, Pasha Tatashin, Axel Rasmussen,
	Ray Fucillo, Andrew Morton

On 2022/9/15 6:18, Mike Kravetz wrote:
> Allocate a new hugetlb_vma_lock structure and hang off vm_private_data
> for synchronization use by vmas that could be involved in pmd sharing.
> This data structure contains a rw semaphore that is the primary tool
> used for synchronization.
> 
> This new structure is ref counted, so that it can exist when NOT attached
> to a vma.  This is only helpful in resolving lock ordering issues where
> code may need to obtain the vma_lock while there are no guarantees the
> vma may go away.  By obtaining a ref on the structure, it can be
> guaranteed that at least the rw semaphore will not go away.
> 
> Only add infrastructure for the new lock here.  Actual use will be added
> in subsequent patches.
> 
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>

LGTM with some nits below. Thanks for your work, Mike.

Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>

> -/* Reset counters to 0 and clear all HPAGE_RESV_* flags */
> -void reset_vma_resv_huge_pages(struct vm_area_struct *vma)
> +void hugetlb_dup_vma_private(struct vm_area_struct *vma)
>  {
>  	VM_BUG_ON_VMA(!is_vm_hugetlb_page(vma), vma);
> +	/*
> +	 * Clear vm_private_data
> +	 * - For MAP_PRIVATE mappings, this is the reserve map which does
> +	 *   not apply to children.  Faults generated by the children are
> +	 *   not guaranteed to succeed, even if read-only.
> +	 * - For shared mappings this is a per-vma semaphore that may be
> +	 *   allocated in a subsequent call to hugetlb_vm_op_open.
> +	 */
> +	vma->vm_private_data = (void *)0;
>  	if (!(vma->vm_flags & VM_MAYSHARE))
> -		vma->vm_private_data = (void *)0;
> +		return;

This if block can be deleted ? It doesn't do anything here.

>  }
>  
>  /*

<snip>

> +static void hugetlb_vma_lock_free(struct vm_area_struct *vma)
> +{
> +	/*
> +	 * Only present in sharable vmas.  See comment in
> +	 * __unmap_hugepage_range_final about how VM_SHARED could
> +	 * be set without VM_MAYSHARE.  As a result, we need to
> +	 * check if either is set in the free path.
> +	 */
> +	if (!vma || !(vma->vm_flags & (VM_MAYSHARE | VM_SHARED)))
> +		return;
> +
> +	if (vma->vm_private_data) {
> +		struct hugetlb_vma_lock *vma_lock = vma->vm_private_data;
> +
> +		/*
> +		 * vma_lock structure may or not be released, but it

may or not be released?

Thanks,
Miaohe Lin


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 8/9] hugetlb: use new vma_lock for pmd sharing synchronization
  2022-09-14 22:18 ` [PATCH v2 8/9] hugetlb: use new vma_lock for pmd sharing synchronization Mike Kravetz
@ 2022-09-29  6:08   ` Miaohe Lin
  2022-10-01  0:00     ` Mike Kravetz
  0 siblings, 1 reply; 18+ messages in thread
From: Miaohe Lin @ 2022-09-29  6:08 UTC (permalink / raw)
  To: Mike Kravetz, linux-mm, linux-kernel
  Cc: Muchun Song, David Hildenbrand, Sven Schnelle, Michal Hocko,
	Peter Xu, Naoya Horiguchi, Aneesh Kumar K . V, Andrea Arcangeli,
	Kirill A . Shutemov, Davidlohr Bueso, Prakash Sangappa,
	James Houghton, Mina Almasry, Pasha Tatashin, Axel Rasmussen,
	Ray Fucillo, Andrew Morton

On 2022/9/15 6:18, Mike Kravetz wrote:
> The new hugetlb vma lock is used to address this race:
> 
> Faulting thread                                 Unsharing thread
> ...                                                  ...
> ptep = huge_pte_offset()
>       or
> ptep = huge_pte_alloc()
> ...
>                                                 i_mmap_lock_write
>                                                 lock page table
> ptep invalid   <------------------------        huge_pmd_unshare()
> Could be in a previously                        unlock_page_table
> sharing process or worse                        i_mmap_unlock_write
> ...
> 
> The vma_lock is used as follows:
> - During fault processing. The lock is acquired in read mode before
>   doing a page table lock and allocation (huge_pte_alloc).  The lock is
>   held until code is finished with the page table entry (ptep).
> - The lock must be held in write mode whenever huge_pmd_unshare is
>   called.
> 
> Lock ordering issues come into play when unmapping a page from all
> vmas mapping the page.  The i_mmap_rwsem must be held to search for the
> vmas, and the vma lock must be held before calling unmap which will
> call huge_pmd_unshare.  This is done today in:
> - try_to_migrate_one and try_to_unmap_ for page migration and memory
>   error handling.  In these routines we 'try' to obtain the vma lock and
>   fail to unmap if unsuccessful.  Calling routines already deal with the
>   failure of unmapping.
> - hugetlb_vmdelete_list for truncation and hole punch.  This routine
>   also tries to acquire the vma lock.  If it fails, it skips the
>   unmapping.  However, we can not have file truncation or hole punch
>   fail because of contention.  After hugetlb_vmdelete_list, truncation
>   and hole punch call remove_inode_hugepages.  remove_inode_hugepages
>   checks for mapped pages and call hugetlb_unmap_file_page to unmap them.
>   hugetlb_unmap_file_page is designed to drop locks and reacquire in the
>   correct order to guarantee unmap success.
> 
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> ---
>  fs/hugetlbfs/inode.c |  66 +++++++++++++++++++++++++++-
>  mm/hugetlb.c         | 102 +++++++++++++++++++++++++++++++++++++++----
>  mm/memory.c          |   2 +
>  mm/rmap.c            | 100 +++++++++++++++++++++++++++---------------
>  mm/userfaultfd.c     |   9 +++-
>  5 files changed, 233 insertions(+), 46 deletions(-)
> 
> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> index 3bb1772fce2f..009ae539b9b2 100644
> --- a/fs/hugetlbfs/inode.c
> +++ b/fs/hugetlbfs/inode.c
> @@ -434,6 +434,7 @@ static void hugetlb_unmap_file_folio(struct hstate *h,
>  					struct folio *folio, pgoff_t index)
>  {
>  	struct rb_root_cached *root = &mapping->i_mmap;
> +	struct hugetlb_vma_lock *vma_lock;
>  	struct page *page = &folio->page;
>  	struct vm_area_struct *vma;
>  	unsigned long v_start;
> @@ -444,7 +445,8 @@ static void hugetlb_unmap_file_folio(struct hstate *h,
>  	end = (index + 1) * pages_per_huge_page(h);
>  
>  	i_mmap_lock_write(mapping);
> -
> +retry:
> +	vma_lock = NULL;
>  	vma_interval_tree_foreach(vma, root, start, end - 1) {
>  		v_start = vma_offset_start(vma, start);
>  		v_end = vma_offset_end(vma, end);
> @@ -452,11 +454,63 @@ static void hugetlb_unmap_file_folio(struct hstate *h,
>  		if (!hugetlb_vma_maps_page(vma, vma->vm_start + v_start, page))
>  			continue;
>  
> +		if (!hugetlb_vma_trylock_write(vma)) {
> +			vma_lock = vma->vm_private_data;
> +			/*
> +			 * If we can not get vma lock, we need to drop
> +			 * immap_sema and take locks in order.  First,
> +			 * take a ref on the vma_lock structure so that
> +			 * we can be guaranteed it will not go away when
> +			 * dropping immap_sema.
> +			 */
> +			kref_get(&vma_lock->refs);
> +			break;
> +		}
> +
>  		unmap_hugepage_range(vma, vma->vm_start + v_start, v_end,
>  				NULL, ZAP_FLAG_DROP_MARKER);
> +		hugetlb_vma_unlock_write(vma);
>  	}
>  
>  	i_mmap_unlock_write(mapping);
> +
> +	if (vma_lock) {
> +		/*
> +		 * Wait on vma_lock.  We know it is still valid as we have
> +		 * a reference.  We must 'open code' vma locking as we do
> +		 * not know if vma_lock is still attached to vma.
> +		 */
> +		down_write(&vma_lock->rw_sema);
> +		i_mmap_lock_write(mapping);
> +
> +		vma = vma_lock->vma;
> +		if (!vma) {

Thanks Mike. This method looks much simpler. But IIUC, this code can race with exit_mmap:

CPU 1					CPU 2
hugetlb_unmap_file_folio		exit_mmap
  kref_get(&vma_lock->refs);
  down_write(&vma_lock->rw_sema);
					  free_pgtables // i_mmap_lock_write is held inside it.
  i_mmap_lock_write(mapping);
  vma = vma_lock->vma;
					  remove_vma
					    hugetlb_vm_op_close
					      hugetlb_vma_lock_free
					        vma_lock->vma = NULL;
					    vm_area_free(vma);
  vma is used-after-free??

The root casue is free_pgtables is protected with i_mmap_lock_write while remove_vma is not.
Or am I miss something again? ;)

> +			/*
> +			 * If lock is no longer attached to vma, then just
> +			 * unlock, drop our reference and retry looking for
> +			 * other vmas.
> +			 */
> +			up_write(&vma_lock->rw_sema);
> +			kref_put(&vma_lock->refs, hugetlb_vma_lock_release);
> +			goto retry;
> +		}
> +
> +		/*
> +		 * vma_lock is still attached to vma.  Check to see if vma
> +		 * still maps page and if so, unmap.
> +		 */
> +		v_start = vma_offset_start(vma, start);
> +		v_end = vma_offset_end(vma, end);
> +		if (hugetlb_vma_maps_page(vma, vma->vm_start + v_start, page))
> +			unmap_hugepage_range(vma, vma->vm_start + v_start,
> +						v_end, NULL,
> +						ZAP_FLAG_DROP_MARKER);
> +
> +		kref_put(&vma_lock->refs, hugetlb_vma_lock_release);
> +		hugetlb_vma_unlock_write(vma);
> +
> +		goto retry;
> +	}
>  }
>  
>  static void
> @@ -474,11 +528,21 @@ hugetlb_vmdelete_list(struct rb_root_cached *root, pgoff_t start, pgoff_t end,
>  		unsigned long v_start;
>  		unsigned long v_end;
>  
> +		if (!hugetlb_vma_trylock_write(vma))
> +			continue;
> +
>  		v_start = vma_offset_start(vma, start);
>  		v_end = vma_offset_end(vma, end);
>  
>  		unmap_hugepage_range(vma, vma->vm_start + v_start, v_end,
>  				     NULL, zap_flags);
> +
> +		/*
> +		 * Note that vma lock only exists for shared/non-private
> +		 * vmas.  Therefore, lock is not held when calling
> +		 * unmap_hugepage_range for private vmas.
> +		 */
> +		hugetlb_vma_unlock_write(vma);
>  	}
>  }
>  
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 616be891b798..e8cbc0f7cdaa 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -4795,6 +4795,14 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
>  		mmu_notifier_invalidate_range_start(&range);
>  		mmap_assert_write_locked(src);
>  		raw_write_seqcount_begin(&src->write_protect_seq);
> +	} else {
> +		/*
> +		 * For shared mappings the vma lock must be held before
> +		 * calling huge_pte_offset in the src vma. Otherwise, the
> +		 * returned ptep could go away if part of a shared pmd and
> +		 * another thread calls huge_pmd_unshare.
> +		 */
> +		hugetlb_vma_lock_read(src_vma);
>  	}
>  
>  	last_addr_mask = hugetlb_mask_last_page(h);
> @@ -4941,6 +4949,8 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
>  	if (cow) {
>  		raw_write_seqcount_end(&src->write_protect_seq);
>  		mmu_notifier_invalidate_range_end(&range);
> +	} else {
> +		hugetlb_vma_unlock_read(src_vma);
>  	}
>  
>  	return ret;
> @@ -4999,6 +5009,7 @@ int move_hugetlb_page_tables(struct vm_area_struct *vma,
>  	mmu_notifier_invalidate_range_start(&range);
>  	last_addr_mask = hugetlb_mask_last_page(h);
>  	/* Prevent race with file truncation */
> +	hugetlb_vma_lock_write(vma);
>  	i_mmap_lock_write(mapping);
>  	for (; old_addr < old_end; old_addr += sz, new_addr += sz) {
>  		src_pte = huge_pte_offset(mm, old_addr, sz);
> @@ -5030,6 +5041,7 @@ int move_hugetlb_page_tables(struct vm_area_struct *vma,
>  		flush_tlb_range(vma, old_end - len, old_end);
>  	mmu_notifier_invalidate_range_end(&range);
>  	i_mmap_unlock_write(mapping);
> +	hugetlb_vma_unlock_write(vma);
>  
>  	return len + old_addr - old_end;
>  }
> @@ -5349,8 +5361,29 @@ static vm_fault_t hugetlb_wp(struct mm_struct *mm, struct vm_area_struct *vma,
>  		 * may get SIGKILLed if it later faults.
>  		 */
>  		if (outside_reserve) {
> +			struct address_space *mapping = vma->vm_file->f_mapping;
> +			pgoff_t idx;
> +			u32 hash;
> +
>  			put_page(old_page);
> +			/*
> +			 * Drop hugetlb_fault_mutex and vma_lock before
> +			 * unmapping.  unmapping needs to hold vma_lock
> +			 * in write mode.  Dropping vma_lock in read mode
> +			 * here is OK as COW mappings do not interact with
> +			 * PMD sharing.
> +			 *
> +			 * Reacquire both after unmap operation.
> +			 */
> +			idx = vma_hugecache_offset(h, vma, haddr);
> +			hash = hugetlb_fault_mutex_hash(mapping, idx);
> +			hugetlb_vma_unlock_read(vma);
> +			mutex_unlock(&hugetlb_fault_mutex_table[hash]);
> +
>  			unmap_ref_private(mm, vma, old_page, haddr);
> +
> +			mutex_lock(&hugetlb_fault_mutex_table[hash]);
> +			hugetlb_vma_lock_read(vma);
>  			spin_lock(ptl);
>  			ptep = huge_pte_offset(mm, haddr, huge_page_size(h));
>  			if (likely(ptep &&
> @@ -5499,14 +5532,16 @@ static inline vm_fault_t hugetlb_handle_userfault(struct vm_area_struct *vma,
>  	};
>  
>  	/*
> -	 * hugetlb_fault_mutex and i_mmap_rwsem must be
> +	 * vma_lock and hugetlb_fault_mutex must be
>  	 * dropped before handling userfault.  Reacquire
>  	 * after handling fault to make calling code simpler.
>  	 */
> +	hugetlb_vma_unlock_read(vma);
>  	hash = hugetlb_fault_mutex_hash(mapping, idx);
>  	mutex_unlock(&hugetlb_fault_mutex_table[hash]);
>  	ret = handle_userfault(&vmf, reason);
>  	mutex_lock(&hugetlb_fault_mutex_table[hash]);
> +	hugetlb_vma_lock_read(vma);
>  
>  	return ret;
>  }
> @@ -5740,6 +5775,11 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>  
>  	ptep = huge_pte_offset(mm, haddr, huge_page_size(h));
>  	if (ptep) {
> +		/*
> +		 * Since we hold no locks, ptep could be stale.  That is
> +		 * OK as we are only making decisions based on content and
> +		 * not actually modifying content here.
> +		 */
>  		entry = huge_ptep_get(ptep);
>  		if (unlikely(is_hugetlb_entry_migration(entry))) {
>  			migration_entry_wait_huge(vma, ptep);
> @@ -5747,23 +5787,35 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>  		} else if (unlikely(is_hugetlb_entry_hwpoisoned(entry)))
>  			return VM_FAULT_HWPOISON_LARGE |
>  				VM_FAULT_SET_HINDEX(hstate_index(h));
> -	} else {
> -		ptep = huge_pte_alloc(mm, vma, haddr, huge_page_size(h));
> -		if (!ptep)
> -			return VM_FAULT_OOM;
>  	}
>  
> -	mapping = vma->vm_file->f_mapping;
> -	idx = vma_hugecache_offset(h, vma, haddr);
> -
>  	/*
>  	 * Serialize hugepage allocation and instantiation, so that we don't
>  	 * get spurious allocation failures if two CPUs race to instantiate
>  	 * the same page in the page cache.
>  	 */
> +	mapping = vma->vm_file->f_mapping;
> +	idx = vma_hugecache_offset(h, vma, haddr);
>  	hash = hugetlb_fault_mutex_hash(mapping, idx);
>  	mutex_lock(&hugetlb_fault_mutex_table[hash]);
>  
> +	/*
> +	 * Acquire vma lock before calling huge_pte_alloc and hold
> +	 * until finished with ptep.  This prevents huge_pmd_unshare from
> +	 * being called elsewhere and making the ptep no longer valid.
> +	 *
> +	 * ptep could have already be assigned via huge_pte_offset.  That
> +	 * is OK, as huge_pte_alloc will return the same value unless
> +	 * something has changed.
> +	 */
> +	hugetlb_vma_lock_read(vma);

[1] says vma_lock for each vma mapping the file provides the same type of synchronization
around i_size as provided by the fault mutex. But what if vma->vm_private_data is NULL,
i.e. hugetlb_vma_lock_alloc fails to alloc vma_lock? There won't be such synchronization
in this case.

[1] https://lore.kernel.org/lkml/Yxiv0SkMkZ0JWGGp@monkey/#t


Other parts of the patch look good to me. Thanks for your work.

Thanks,
Miaohe Lin


> +	ptep = huge_pte_alloc(mm, vma, haddr, huge_page_size(h));
> +	if (!ptep) {
> +		hugetlb_vma_unlock_read(vma);
> +		mutex_unlock(&hugetlb_fault_mutex_table[hash]);
> +		return VM_FAULT_OOM;
> +	}
> +
>  	entry = huge_ptep_get(ptep);
>  	/* PTE markers should be handled the same way as none pte */
>  	if (huge_pte_none_mostly(entry)) {
> @@ -5824,6 +5876,7 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>  			unlock_page(pagecache_page);
>  			put_page(pagecache_page);
>  		}
> +		hugetlb_vma_unlock_read(vma);
>  		mutex_unlock(&hugetlb_fault_mutex_table[hash]);
>  		return handle_userfault(&vmf, VM_UFFD_WP);
>  	}
> @@ -5867,6 +5920,7 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>  		put_page(pagecache_page);
>  	}
>  out_mutex:
> +	hugetlb_vma_unlock_read(vma);
>  	mutex_unlock(&hugetlb_fault_mutex_table[hash]);
>  	/*
>  	 * Generally it's safe to hold refcount during waiting page lock. But
> @@ -6329,8 +6383,9 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
>  	flush_cache_range(vma, range.start, range.end);
>  
>  	mmu_notifier_invalidate_range_start(&range);
> -	last_addr_mask = hugetlb_mask_last_page(h);
> +	hugetlb_vma_lock_write(vma);
>  	i_mmap_lock_write(vma->vm_file->f_mapping);
> +	last_addr_mask = hugetlb_mask_last_page(h);
>  	for (; address < end; address += psize) {
>  		spinlock_t *ptl;
>  		ptep = huge_pte_offset(mm, address, psize);
> @@ -6429,6 +6484,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
>  	 * See Documentation/mm/mmu_notifier.rst
>  	 */
>  	i_mmap_unlock_write(vma->vm_file->f_mapping);
> +	hugetlb_vma_unlock_write(vma);
>  	mmu_notifier_invalidate_range_end(&range);
>  
>  	return pages << h->order;
> @@ -6930,6 +6986,7 @@ int huge_pmd_unshare(struct mm_struct *mm, struct vm_area_struct *vma,
>  	pud_t *pud = pud_offset(p4d, addr);
>  
>  	i_mmap_assert_write_locked(vma->vm_file->f_mapping);
> +	hugetlb_vma_assert_locked(vma);
>  	BUG_ON(page_count(virt_to_page(ptep)) == 0);
>  	if (page_count(virt_to_page(ptep)) == 1)
>  		return 0;
> @@ -6941,6 +6998,31 @@ int huge_pmd_unshare(struct mm_struct *mm, struct vm_area_struct *vma,
>  }
>  
>  #else /* !CONFIG_ARCH_WANT_HUGE_PMD_SHARE */
> +void hugetlb_vma_lock_read(struct vm_area_struct *vma)
> +{
> +}
> +
> +void hugetlb_vma_unlock_read(struct vm_area_struct *vma)
> +{
> +}
> +
> +void hugetlb_vma_lock_write(struct vm_area_struct *vma)
> +{
> +}
> +
> +void hugetlb_vma_unlock_write(struct vm_area_struct *vma)
> +{
> +}
> +
> +int hugetlb_vma_trylock_write(struct vm_area_struct *vma)
> +{
> +	return 1;
> +}
> +
> +void hugetlb_vma_assert_locked(struct vm_area_struct *vma)
> +{
> +}
> +
>  static void hugetlb_vma_lock_free(struct vm_area_struct *vma)
>  {
>  }
> @@ -7318,6 +7400,7 @@ void hugetlb_unshare_all_pmds(struct vm_area_struct *vma)
>  	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, mm,
>  				start, end);
>  	mmu_notifier_invalidate_range_start(&range);
> +	hugetlb_vma_lock_write(vma);
>  	i_mmap_lock_write(vma->vm_file->f_mapping);
>  	for (address = start; address < end; address += PUD_SIZE) {
>  		ptep = huge_pte_offset(mm, address, sz);
> @@ -7329,6 +7412,7 @@ void hugetlb_unshare_all_pmds(struct vm_area_struct *vma)
>  	}
>  	flush_hugetlb_tlb_range(vma, start, end);
>  	i_mmap_unlock_write(vma->vm_file->f_mapping);
> +	hugetlb_vma_unlock_write(vma);
>  	/*
>  	 * No need to call mmu_notifier_invalidate_range(), see
>  	 * Documentation/mm/mmu_notifier.rst.
> diff --git a/mm/memory.c b/mm/memory.c
> index c4c3c2fd4f45..118e5f023597 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1685,10 +1685,12 @@ static void unmap_single_vma(struct mmu_gather *tlb,
>  			if (vma->vm_file) {
>  				zap_flags_t zap_flags = details ?
>  				    details->zap_flags : 0;
> +				hugetlb_vma_lock_write(vma);
>  				i_mmap_lock_write(vma->vm_file->f_mapping);
>  				__unmap_hugepage_range_final(tlb, vma, start, end,
>  							     NULL, zap_flags);
>  				i_mmap_unlock_write(vma->vm_file->f_mapping);
> +				hugetlb_vma_unlock_write(vma);
>  			}
>  		} else
>  			unmap_page_range(tlb, vma, start, end, details);
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 744faaef0489..2ec925e5fa6a 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1554,24 +1554,39 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>  			 * To call huge_pmd_unshare, i_mmap_rwsem must be
>  			 * held in write mode.  Caller needs to explicitly
>  			 * do this outside rmap routines.
> +			 *
> +			 * We also must hold hugetlb vma_lock in write mode.
> +			 * Lock order dictates acquiring vma_lock BEFORE
> +			 * i_mmap_rwsem.  We can only try lock here and fail
> +			 * if unsuccessful.
>  			 */
> -			VM_BUG_ON(!anon && !(flags & TTU_RMAP_LOCKED));
> -			if (!anon && huge_pmd_unshare(mm, vma, address, pvmw.pte)) {
> -				flush_tlb_range(vma, range.start, range.end);
> -				mmu_notifier_invalidate_range(mm, range.start,
> -							      range.end);
> -
> -				/*
> -				 * The ref count of the PMD page was dropped
> -				 * which is part of the way map counting
> -				 * is done for shared PMDs.  Return 'true'
> -				 * here.  When there is no other sharing,
> -				 * huge_pmd_unshare returns false and we will
> -				 * unmap the actual page and drop map count
> -				 * to zero.
> -				 */
> -				page_vma_mapped_walk_done(&pvmw);
> -				break;
> +			if (!anon) {
> +				VM_BUG_ON(!(flags & TTU_RMAP_LOCKED));
> +				if (!hugetlb_vma_trylock_write(vma)) {
> +					page_vma_mapped_walk_done(&pvmw);
> +					ret = false;
> +					break;
> +				}
> +				if (huge_pmd_unshare(mm, vma, address, pvmw.pte)) {
> +					hugetlb_vma_unlock_write(vma);
> +					flush_tlb_range(vma,
> +						range.start, range.end);
> +					mmu_notifier_invalidate_range(mm,
> +						range.start, range.end);
> +					/*
> +					 * The ref count of the PMD page was
> +					 * dropped which is part of the way map
> +					 * counting is done for shared PMDs.
> +					 * Return 'true' here.  When there is
> +					 * no other sharing, huge_pmd_unshare
> +					 * returns false and we will unmap the
> +					 * actual page and drop map count
> +					 * to zero.
> +					 */
> +					page_vma_mapped_walk_done(&pvmw);
> +					break;
> +				}
> +				hugetlb_vma_unlock_write(vma);
>  			}
>  			pteval = huge_ptep_clear_flush(vma, address, pvmw.pte);
>  		} else {
> @@ -1929,26 +1944,41 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
>  			 * To call huge_pmd_unshare, i_mmap_rwsem must be
>  			 * held in write mode.  Caller needs to explicitly
>  			 * do this outside rmap routines.
> +			 *
> +			 * We also must hold hugetlb vma_lock in write mode.
> +			 * Lock order dictates acquiring vma_lock BEFORE
> +			 * i_mmap_rwsem.  We can only try lock here and
> +			 * fail if unsuccessful.
>  			 */
> -			VM_BUG_ON(!anon && !(flags & TTU_RMAP_LOCKED));
> -			if (!anon && huge_pmd_unshare(mm, vma, address, pvmw.pte)) {
> -				flush_tlb_range(vma, range.start, range.end);
> -				mmu_notifier_invalidate_range(mm, range.start,
> -							      range.end);
> -
> -				/*
> -				 * The ref count of the PMD page was dropped
> -				 * which is part of the way map counting
> -				 * is done for shared PMDs.  Return 'true'
> -				 * here.  When there is no other sharing,
> -				 * huge_pmd_unshare returns false and we will
> -				 * unmap the actual page and drop map count
> -				 * to zero.
> -				 */
> -				page_vma_mapped_walk_done(&pvmw);
> -				break;
> +			if (!anon) {
> +				VM_BUG_ON(!(flags & TTU_RMAP_LOCKED));
> +				if (!hugetlb_vma_trylock_write(vma)) {
> +					page_vma_mapped_walk_done(&pvmw);
> +					ret = false;
> +					break;
> +				}
> +				if (huge_pmd_unshare(mm, vma, address, pvmw.pte)) {
> +					hugetlb_vma_unlock_write(vma);
> +					flush_tlb_range(vma,
> +						range.start, range.end);
> +					mmu_notifier_invalidate_range(mm,
> +						range.start, range.end);
> +
> +					/*
> +					 * The ref count of the PMD page was
> +					 * dropped which is part of the way map
> +					 * counting is done for shared PMDs.
> +					 * Return 'true' here.  When there is
> +					 * no other sharing, huge_pmd_unshare
> +					 * returns false and we will unmap the
> +					 * actual page and drop map count
> +					 * to zero.
> +					 */
> +					page_vma_mapped_walk_done(&pvmw);
> +					break;
> +				}
> +				hugetlb_vma_unlock_write(vma);
>  			}
> -
>  			/* Nuke the hugetlb page table entry */
>  			pteval = huge_ptep_clear_flush(vma, address, pvmw.pte);
>  		} else {
> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> index 0fdbd2c05587..e24e8a47ce8a 100644
> --- a/mm/userfaultfd.c
> +++ b/mm/userfaultfd.c
> @@ -379,16 +379,21 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
>  		BUG_ON(dst_addr >= dst_start + len);
>  
>  		/*
> -		 * Serialize via hugetlb_fault_mutex.
> +		 * Serialize via vma_lock and hugetlb_fault_mutex.
> +		 * vma_lock ensures the dst_pte remains valid even
> +		 * in the case of shared pmds.  fault mutex prevents
> +		 * races with other faulting threads.
>  		 */
>  		idx = linear_page_index(dst_vma, dst_addr);
>  		mapping = dst_vma->vm_file->f_mapping;
>  		hash = hugetlb_fault_mutex_hash(mapping, idx);
>  		mutex_lock(&hugetlb_fault_mutex_table[hash]);
> +		hugetlb_vma_lock_read(dst_vma);
>  
>  		err = -ENOMEM;
>  		dst_pte = huge_pte_alloc(dst_mm, dst_vma, dst_addr, vma_hpagesize);
>  		if (!dst_pte) {
> +			hugetlb_vma_unlock_read(dst_vma);
>  			mutex_unlock(&hugetlb_fault_mutex_table[hash]);
>  			goto out_unlock;
>  		}
> @@ -396,6 +401,7 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
>  		if (mode != MCOPY_ATOMIC_CONTINUE &&
>  		    !huge_pte_none_mostly(huge_ptep_get(dst_pte))) {
>  			err = -EEXIST;
> +			hugetlb_vma_unlock_read(dst_vma);
>  			mutex_unlock(&hugetlb_fault_mutex_table[hash]);
>  			goto out_unlock;
>  		}
> @@ -404,6 +410,7 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
>  					       dst_addr, src_addr, mode, &page,
>  					       wp_copy);
>  
> +		hugetlb_vma_unlock_read(dst_vma);
>  		mutex_unlock(&hugetlb_fault_mutex_table[hash]);
>  
>  		cond_resched();
> 


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 9/9] hugetlb: clean up code checking for fault/truncation races
  2022-09-14 22:18 ` [PATCH v2 9/9] hugetlb: clean up code checking for fault/truncation races Mike Kravetz
  2022-09-19 23:32   ` Mike Kravetz
@ 2022-09-29  6:25   ` Miaohe Lin
  1 sibling, 0 replies; 18+ messages in thread
From: Miaohe Lin @ 2022-09-29  6:25 UTC (permalink / raw)
  To: Mike Kravetz, linux-mm, linux-kernel
  Cc: Muchun Song, David Hildenbrand, Sven Schnelle, Michal Hocko,
	Peter Xu, Naoya Horiguchi, Aneesh Kumar K . V, Andrea Arcangeli,
	Kirill A . Shutemov, Davidlohr Bueso, Prakash Sangappa,
	James Houghton, Mina Almasry, Pasha Tatashin, Axel Rasmussen,
	Ray Fucillo, Andrew Morton

On 2022/9/15 6:18, Mike Kravetz wrote:
> With the new hugetlb vma lock in place, it can also be used to handle
> page fault races with file truncation.  The lock is taken at the
> beginning of the code fault path in read mode.  During truncation, it
> is taken in write mode for each vma which has the file mapped.  The
> file's size (i_size) is modified before taking the vma lock to unmap.
> 
> How are races handled?
> 
> The page fault code checks i_size early in processing after taking the
> vma lock.  If the fault is beyond i_size, the fault is aborted.  If the
> fault is not beyond i_size the fault will continue and a new page will
> be added to the file.  It could be that truncation code modifies i_size
> after the check in fault code.  That is OK, as truncation code will soon
> remove the page.  The truncation code will wait until the fault is
> finished, as it must obtain the vma lock in write mode.

As previous thread [1] points out, if vma->vm_private_data is NULL, there won't be any synchronization
which provides the same type of synchronization around i_size as provided by the fault mutex as in [2].
([2] will take the fault mutex for EVERY index in the truncated range)

[1] https://lore.kernel.org/lkml/YyOKIhygl66cG8Yr@monkey/T/#m6b69af9e8cdba01246c2b210bd044bf895b815ee
[2] https://lore.kernel.org/lkml/20220824175757.20590-5-mike.kravetz@oracle.com/

Except from that, this patch looks good to me. Thanks Mike.

Thanks,
Miaohe Lin



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 8/9] hugetlb: use new vma_lock for pmd sharing synchronization
  2022-09-29  6:08   ` Miaohe Lin
@ 2022-10-01  0:00     ` Mike Kravetz
  2022-10-08  2:29       ` Miaohe Lin
  0 siblings, 1 reply; 18+ messages in thread
From: Mike Kravetz @ 2022-10-01  0:00 UTC (permalink / raw)
  To: Miaohe Lin
  Cc: linux-mm, linux-kernel, Muchun Song, David Hildenbrand,
	Sven Schnelle, Michal Hocko, Peter Xu, Naoya Horiguchi,
	Aneesh Kumar K . V, Andrea Arcangeli, Kirill A . Shutemov,
	Davidlohr Bueso, Prakash Sangappa, James Houghton, Mina Almasry,
	Pasha Tatashin, Axel Rasmussen, Ray Fucillo, Andrew Morton

On 09/29/22 14:08, Miaohe Lin wrote:
> On 2022/9/15 6:18, Mike Kravetz wrote:
> > @@ -434,6 +434,7 @@ static void hugetlb_unmap_file_folio(struct hstate *h,
> >  					struct folio *folio, pgoff_t index)
> >  {
> >  	struct rb_root_cached *root = &mapping->i_mmap;
> > +	struct hugetlb_vma_lock *vma_lock;
> >  	struct page *page = &folio->page;
> >  	struct vm_area_struct *vma;
> >  	unsigned long v_start;
> > @@ -444,7 +445,8 @@ static void hugetlb_unmap_file_folio(struct hstate *h,
> >  	end = (index + 1) * pages_per_huge_page(h);
> >  
> >  	i_mmap_lock_write(mapping);
> > -
> > +retry:
> > +	vma_lock = NULL;
> >  	vma_interval_tree_foreach(vma, root, start, end - 1) {
> >  		v_start = vma_offset_start(vma, start);
> >  		v_end = vma_offset_end(vma, end);
> > @@ -452,11 +454,63 @@ static void hugetlb_unmap_file_folio(struct hstate *h,
> >  		if (!hugetlb_vma_maps_page(vma, vma->vm_start + v_start, page))
> >  			continue;
> >  
> > +		if (!hugetlb_vma_trylock_write(vma)) {
> > +			vma_lock = vma->vm_private_data;
> > +			/*
> > +			 * If we can not get vma lock, we need to drop
> > +			 * immap_sema and take locks in order.  First,
> > +			 * take a ref on the vma_lock structure so that
> > +			 * we can be guaranteed it will not go away when
> > +			 * dropping immap_sema.
> > +			 */
> > +			kref_get(&vma_lock->refs);
> > +			break;
> > +		}
> > +
> >  		unmap_hugepage_range(vma, vma->vm_start + v_start, v_end,
> >  				NULL, ZAP_FLAG_DROP_MARKER);
> > +		hugetlb_vma_unlock_write(vma);
> >  	}
> >  
> >  	i_mmap_unlock_write(mapping);
> > +
> > +	if (vma_lock) {
> > +		/*
> > +		 * Wait on vma_lock.  We know it is still valid as we have
> > +		 * a reference.  We must 'open code' vma locking as we do
> > +		 * not know if vma_lock is still attached to vma.
> > +		 */
> > +		down_write(&vma_lock->rw_sema);
> > +		i_mmap_lock_write(mapping);
> > +
> > +		vma = vma_lock->vma;
> > +		if (!vma) {
> 
> Thanks Mike. This method looks much simpler. But IIUC, this code can race with exit_mmap:
> 
> CPU 1					CPU 2
> hugetlb_unmap_file_folio		exit_mmap
>   kref_get(&vma_lock->refs);
>   down_write(&vma_lock->rw_sema);
> 					  free_pgtables // i_mmap_lock_write is held inside it.
>   i_mmap_lock_write(mapping);
>   vma = vma_lock->vma;
> 					  remove_vma
> 					    hugetlb_vm_op_close
> 					      hugetlb_vma_lock_free
> 					        vma_lock->vma = NULL;
> 					    vm_area_free(vma);
>   vma is used-after-free??
> 
> The root casue is free_pgtables is protected with i_mmap_lock_write while remove_vma is not.
> Or am I miss something again? ;)

Thank you Miaohe!  Sorry for the delay in responding.

Yes, I agree this is a possible race.  My first thought is that we may be
able to address this by simply taking the vma_lock when we clear the
vma_lock->vma field.  Something like this,

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 4cb44a4629b8..bf0c220ebc32 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6918,7 +6918,9 @@ static void hugetlb_vma_lock_free(struct vm_area_struct *vma)
 		 * certainly will no longer be attached to vma so clear
 		 * pointer.
 		 */
+		down_write(&vma_lock->rw_sema);
 		vma_lock->vma = NULL;
+		up_write(&vma_lock->rw_sema);
 		kref_put(&vma_lock->refs, hugetlb_vma_lock_release);
 		vma->vm_private_data = NULL;
 	}

I still need to do a bit more work to verify.

Andrew, if you are concerned I do not think this is a show stopper.  The
race should be extremely rare, and a fix should be coming quickly.

<snip>
> > +	mapping = vma->vm_file->f_mapping;
> > +	idx = vma_hugecache_offset(h, vma, haddr);
> >  	hash = hugetlb_fault_mutex_hash(mapping, idx);
> >  	mutex_lock(&hugetlb_fault_mutex_table[hash]);
> >  
> > +	/*
> > +	 * Acquire vma lock before calling huge_pte_alloc and hold
> > +	 * until finished with ptep.  This prevents huge_pmd_unshare from
> > +	 * being called elsewhere and making the ptep no longer valid.
> > +	 *
> > +	 * ptep could have already be assigned via huge_pte_offset.  That
> > +	 * is OK, as huge_pte_alloc will return the same value unless
> > +	 * something has changed.
> > +	 */
> > +	hugetlb_vma_lock_read(vma);
> 
> [1] says vma_lock for each vma mapping the file provides the same type of synchronization
> around i_size as provided by the fault mutex. But what if vma->vm_private_data is NULL,
> i.e. hugetlb_vma_lock_alloc fails to alloc vma_lock? There won't be such synchronization
> in this case.
> 
> [1] https://lore.kernel.org/lkml/Yxiv0SkMkZ0JWGGp@monkey/#t
> 

Right.

Of course, this (checking i_size) only applies to shared/file mappings.
The only time hugetlb_vma_lock_alloc should fail in such cases is when
we can not allocate the small vma_lock structure.  Since the vma_lock
is primarily for huge pmd sharing synchronization, my thought was that
allocation errors would just prevent sharing.  But, as you point out
it could also impact these checks.

It would be easy to check for the lock allocation failure at mmap time
and fail the mmap.  It would be a little more tricky at fork time.

This also is something that is highly unlikely to occur.  And, if we
can't allocate a vma_lock I suspect we will not be up and running long
enough for this to be an issue. :) Let me think about the best way to handle.

> 
> Other parts of the patch look good to me. Thanks for your work.
> 

Thanks again,
-- 
Mike Kravetz

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 8/9] hugetlb: use new vma_lock for pmd sharing synchronization
  2022-10-01  0:00     ` Mike Kravetz
@ 2022-10-08  2:29       ` Miaohe Lin
  0 siblings, 0 replies; 18+ messages in thread
From: Miaohe Lin @ 2022-10-08  2:29 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-mm, linux-kernel, Muchun Song, David Hildenbrand,
	Sven Schnelle, Michal Hocko, Peter Xu, Naoya Horiguchi,
	Aneesh Kumar K . V, Andrea Arcangeli, Kirill A . Shutemov,
	Davidlohr Bueso, Prakash Sangappa, James Houghton, Mina Almasry,
	Pasha Tatashin, Axel Rasmussen, Ray Fucillo, Andrew Morton

On 2022/10/1 8:00, Mike Kravetz wrote:
> On 09/29/22 14:08, Miaohe Lin wrote:
>> On 2022/9/15 6:18, Mike Kravetz wrote:
>>> @@ -434,6 +434,7 @@ static void hugetlb_unmap_file_folio(struct hstate *h,
>>>  					struct folio *folio, pgoff_t index)
>>>  {
>>>  	struct rb_root_cached *root = &mapping->i_mmap;
>>> +	struct hugetlb_vma_lock *vma_lock;
>>>  	struct page *page = &folio->page;
>>>  	struct vm_area_struct *vma;
>>>  	unsigned long v_start;
>>> @@ -444,7 +445,8 @@ static void hugetlb_unmap_file_folio(struct hstate *h,
>>>  	end = (index + 1) * pages_per_huge_page(h);
>>>  
>>>  	i_mmap_lock_write(mapping);
>>> -
>>> +retry:
>>> +	vma_lock = NULL;
>>>  	vma_interval_tree_foreach(vma, root, start, end - 1) {
>>>  		v_start = vma_offset_start(vma, start);
>>>  		v_end = vma_offset_end(vma, end);
>>> @@ -452,11 +454,63 @@ static void hugetlb_unmap_file_folio(struct hstate *h,
>>>  		if (!hugetlb_vma_maps_page(vma, vma->vm_start + v_start, page))
>>>  			continue;
>>>  
>>> +		if (!hugetlb_vma_trylock_write(vma)) {
>>> +			vma_lock = vma->vm_private_data;
>>> +			/*
>>> +			 * If we can not get vma lock, we need to drop
>>> +			 * immap_sema and take locks in order.  First,
>>> +			 * take a ref on the vma_lock structure so that
>>> +			 * we can be guaranteed it will not go away when
>>> +			 * dropping immap_sema.
>>> +			 */
>>> +			kref_get(&vma_lock->refs);
>>> +			break;
>>> +		}
>>> +
>>>  		unmap_hugepage_range(vma, vma->vm_start + v_start, v_end,
>>>  				NULL, ZAP_FLAG_DROP_MARKER);
>>> +		hugetlb_vma_unlock_write(vma);
>>>  	}
>>>  
>>>  	i_mmap_unlock_write(mapping);
>>> +
>>> +	if (vma_lock) {
>>> +		/*
>>> +		 * Wait on vma_lock.  We know it is still valid as we have
>>> +		 * a reference.  We must 'open code' vma locking as we do
>>> +		 * not know if vma_lock is still attached to vma.
>>> +		 */
>>> +		down_write(&vma_lock->rw_sema);
>>> +		i_mmap_lock_write(mapping);
>>> +
>>> +		vma = vma_lock->vma;
>>> +		if (!vma) {
>>
>> Thanks Mike. This method looks much simpler. But IIUC, this code can race with exit_mmap:
>>
>> CPU 1					CPU 2
>> hugetlb_unmap_file_folio		exit_mmap
>>   kref_get(&vma_lock->refs);
>>   down_write(&vma_lock->rw_sema);
>> 					  free_pgtables // i_mmap_lock_write is held inside it.
>>   i_mmap_lock_write(mapping);
>>   vma = vma_lock->vma;
>> 					  remove_vma
>> 					    hugetlb_vm_op_close
>> 					      hugetlb_vma_lock_free
>> 					        vma_lock->vma = NULL;
>> 					    vm_area_free(vma);
>>   vma is used-after-free??
>>
>> The root casue is free_pgtables is protected with i_mmap_lock_write while remove_vma is not.
>> Or am I miss something again? ;)
> 
> Thank you Miaohe!  Sorry for the delay in responding.
> 
> Yes, I agree this is a possible race.  My first thought is that we may be
> able to address this by simply taking the vma_lock when we clear the
> vma_lock->vma field.  Something like this,
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 4cb44a4629b8..bf0c220ebc32 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -6918,7 +6918,9 @@ static void hugetlb_vma_lock_free(struct vm_area_struct *vma)
>  		 * certainly will no longer be attached to vma so clear
>  		 * pointer.
>  		 */
> +		down_write(&vma_lock->rw_sema);
>  		vma_lock->vma = NULL;
> +		up_write(&vma_lock->rw_sema);
>  		kref_put(&vma_lock->refs, hugetlb_vma_lock_release);
>  		vma->vm_private_data = NULL;
>  	}

AFAICT, this should work. And we won't hold the vma_lock->rw_sema when delete vma, there
should not be a possible deadlock.

> 
> I still need to do a bit more work to verify.
> 
> Andrew, if you are concerned I do not think this is a show stopper.  The
> race should be extremely rare, and a fix should be coming quickly.

Agree, this should be really rare.

> 
> <snip>
>>> +	mapping = vma->vm_file->f_mapping;
>>> +	idx = vma_hugecache_offset(h, vma, haddr);
>>>  	hash = hugetlb_fault_mutex_hash(mapping, idx);
>>>  	mutex_lock(&hugetlb_fault_mutex_table[hash]);
>>>  
>>> +	/*
>>> +	 * Acquire vma lock before calling huge_pte_alloc and hold
>>> +	 * until finished with ptep.  This prevents huge_pmd_unshare from
>>> +	 * being called elsewhere and making the ptep no longer valid.
>>> +	 *
>>> +	 * ptep could have already be assigned via huge_pte_offset.  That
>>> +	 * is OK, as huge_pte_alloc will return the same value unless
>>> +	 * something has changed.
>>> +	 */
>>> +	hugetlb_vma_lock_read(vma);
>>
>> [1] says vma_lock for each vma mapping the file provides the same type of synchronization
>> around i_size as provided by the fault mutex. But what if vma->vm_private_data is NULL,
>> i.e. hugetlb_vma_lock_alloc fails to alloc vma_lock? There won't be such synchronization
>> in this case.
>>
>> [1] https://lore.kernel.org/lkml/Yxiv0SkMkZ0JWGGp@monkey/#t
>>
> 
> Right.
> 
> Of course, this (checking i_size) only applies to shared/file mappings.
> The only time hugetlb_vma_lock_alloc should fail in such cases is when
> we can not allocate the small vma_lock structure.  Since the vma_lock
> is primarily for huge pmd sharing synchronization, my thought was that
> allocation errors would just prevent sharing.  But, as you point out
> it could also impact these checks.
> 
> It would be easy to check for the lock allocation failure at mmap time
> and fail the mmap.  It would be a little more tricky at fork time.
> 
> This also is something that is highly unlikely to occur.  And, if we
> can't allocate a vma_lock I suspect we will not be up and running long
> enough for this to be an issue. :) Let me think about the best way to handle.

Agree. This should be really rare too. Thanks for your work.

Thanks,
Miaohe Lin

> 
>>
>> Other parts of the patch look good to me. Thanks for your work.
>>
> 
> Thanks again,
> 


^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2022-10-08  2:29 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-09-14 22:18 [PATCH v2 0/9] hugetlb: Use new vma lock for huge pmd sharing synchronization Mike Kravetz
2022-09-14 22:18 ` [PATCH v2 1/9] hugetlbfs: revert use i_mmap_rwsem to address page fault/truncate race Mike Kravetz
2022-09-14 22:18 ` [PATCH v2 2/9] hugetlbfs: revert use i_mmap_rwsem for more pmd sharing synchronization Mike Kravetz
2022-09-14 22:18 ` [PATCH v2 3/9] hugetlb: rename remove_huge_page to hugetlb_delete_from_page_cache Mike Kravetz
2022-09-14 22:18 ` [PATCH v2 4/9] hugetlb: create remove_inode_single_folio to remove single file folio Mike Kravetz
2022-09-24 12:36   ` Miaohe Lin
2022-09-14 22:18 ` [PATCH v2 5/9] hugetlb: rename vma_shareable() and refactor code Mike Kravetz
2022-09-14 22:18 ` [PATCH v2 6/9] hugetlb: add vma based lock for pmd sharing Mike Kravetz
2022-09-15 20:25   ` Mike Kravetz
2022-09-24 13:11   ` Miaohe Lin
2022-09-14 22:18 ` [PATCH v2 7/9] hugetlb: create hugetlb_unmap_file_folio to unmap single file folio Mike Kravetz
2022-09-14 22:18 ` [PATCH v2 8/9] hugetlb: use new vma_lock for pmd sharing synchronization Mike Kravetz
2022-09-29  6:08   ` Miaohe Lin
2022-10-01  0:00     ` Mike Kravetz
2022-10-08  2:29       ` Miaohe Lin
2022-09-14 22:18 ` [PATCH v2 9/9] hugetlb: clean up code checking for fault/truncation races Mike Kravetz
2022-09-19 23:32   ` Mike Kravetz
2022-09-29  6:25   ` Miaohe Lin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.