[PATCH v2 0/2] hugetlbfs: use i_mmap_rwsem for better synchronization

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 0/2] hugetlbfs: use i_mmap_rwsem for better synchronization
@ 2018-12-18 22:35 Mike Kravetz
  2018-12-18 22:35 ` [PATCH v2 1/2] hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization Mike Kravetz
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Mike Kravetz @ 2018-12-18 22:35 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Michal Hocko, Hugh Dickins, Naoya Horiguchi, Aneesh Kumar K . V,
	Andrea Arcangeli, Kirill A . Shutemov, Davidlohr Bueso,
	Prakash Sangappa, Andrew Morton, Mike Kravetz

There are two primary issues addressed here:
1) For shared pmds, huge PTE pointers returned by huge_pte_alloc can become
   invalid via a call to huge_pmd_unshare by another thread.
2) hugetlbfs page faults can race with truncation causing invalid global
   reserve counts and state.
Both issues are addressed by expanding the use of i_mmap_rwsem.

These issues have existed for a long time.  They can be recreated with a
test program that causes page fault/truncation races.  For simple mappings,
this results in a negative HugePages_Rsvd count.  If racing with mappings
that contain shared pmds, we can hit "BUG at fs/hugetlbfs/inode.c:444!" or
Oops! as the result of an invalid memory reference.

v1 -> v2
  Combined patches 2 and 3 of v1 series as suggested by Aneesh.  No other
  changes were made.
Patches are a follow up to the RFC,
  http://lkml.kernel.org/r/20181024045053.1467-1-mike.kravetz@oracle.com
  Comments made by Naoya were addressed.

Mike Kravetz (2):
  hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization
  hugetlbfs: Use i_mmap_rwsem to fix page fault/truncate race

 fs/hugetlbfs/inode.c | 50 +++++++++----------------
 mm/hugetlb.c         | 87 +++++++++++++++++++++++++++++++-------------
 mm/memory-failure.c  | 14 ++++++-
 mm/migrate.c         | 13 ++++++-
 mm/rmap.c            |  3 ++
 mm/userfaultfd.c     | 11 +++++-
 6 files changed, 116 insertions(+), 62 deletions(-)

-- 
2.17.2

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v2 1/2] hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization
  2018-12-18 22:35 [PATCH v2 0/2] hugetlbfs: use i_mmap_rwsem for better synchronization Mike Kravetz
@ 2018-12-18 22:35 ` Mike Kravetz
  2018-12-21 10:05   ` Kirill A. Shutemov
  2018-12-18 22:35 ` [PATCH v2 2/2] hugetlbfs: Use i_mmap_rwsem to fix page fault/truncate race Mike Kravetz
  2018-12-20 21:06 ` [PATCH v2 0/2] hugetlbfs: use i_mmap_rwsem for better synchronization Andrew Morton
  2 siblings, 1 reply; 11+ messages in thread
From: Mike Kravetz @ 2018-12-18 22:35 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Michal Hocko, Hugh Dickins, Naoya Horiguchi, Aneesh Kumar K . V,
	Andrea Arcangeli, Kirill A . Shutemov, Davidlohr Bueso,
	Prakash Sangappa, Andrew Morton, Mike Kravetz, stable

While looking at BUGs associated with invalid huge page map counts,
it was discovered and observed that a huge pte pointer could become
'invalid' and point to another task's page table.  Consider the
following:

A task takes a page fault on a shared hugetlbfs file and calls
huge_pte_alloc to get a ptep.  Suppose the returned ptep points to a
shared pmd.

Now, another task truncates the hugetlbfs file.  As part of truncation,
it unmaps everyone who has the file mapped.  If the range being
truncated is covered by a shared pmd, huge_pmd_unshare will be called.
For all but the last user of the shared pmd, huge_pmd_unshare will
clear the pud pointing to the pmd.  If the task in the middle of the
page fault is not the last user, the ptep returned by huge_pte_alloc
now points to another task's page table or worse.  This leads to bad
things such as incorrect page map/reference counts or invalid memory
references.

To fix, expand the use of i_mmap_rwsem as follows:
- i_mmap_rwsem is held in read mode whenever huge_pmd_share is called.
  huge_pmd_share is only called via huge_pte_alloc, so callers of
  huge_pte_alloc take i_mmap_rwsem before calling.  In addition, callers
  of huge_pte_alloc continue to hold the semaphore until finished with
  the ptep.
- i_mmap_rwsem is held in write mode whenever huge_pmd_unshare is called.

Cc: <stable@vger.kernel.org>
Fixes: 39dde65c9940 ("shared page table for hugetlb page")
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 mm/hugetlb.c        | 70 ++++++++++++++++++++++++++++++++++-----------
 mm/memory-failure.c | 14 ++++++++-
 mm/migrate.c        | 13 ++++++++-
 mm/rmap.c           |  3 ++
 mm/userfaultfd.c    | 11 +++++--
 5 files changed, 91 insertions(+), 20 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 309fb8c969af..ab4c77b8c72c 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3239,6 +3239,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 	int cow;
 	struct hstate *h = hstate_vma(vma);
 	unsigned long sz = huge_page_size(h);
+	struct address_space *mapping = vma->vm_file->f_mapping;
 	unsigned long mmun_start;	/* For mmu_notifiers */
 	unsigned long mmun_end;		/* For mmu_notifiers */
 	int ret = 0;
@@ -3252,11 +3253,23 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 
 	for (addr = vma->vm_start; addr < vma->vm_end; addr += sz) {
 		spinlock_t *src_ptl, *dst_ptl;
+
 		src_pte = huge_pte_offset(src, addr, sz);
 		if (!src_pte)
 			continue;
+
+		/*
+		 * i_mmap_rwsem must be held to call huge_pte_alloc.
+		 * Continue to hold until finished  with dst_pte, otherwise
+		 * it could go away if part of a shared pmd.
+		 *
+		 * Technically, i_mmap_rwsem is only needed in the non-cow
+		 * case as cow mappings are not shared.
+		 */
+		i_mmap_lock_read(mapping);
 		dst_pte = huge_pte_alloc(dst, addr, sz);
 		if (!dst_pte) {
+			i_mmap_unlock_read(mapping);
 			ret = -ENOMEM;
 			break;
 		}
@@ -3271,8 +3284,10 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 		 * after taking the lock below.
 		 */
 		dst_entry = huge_ptep_get(dst_pte);
-		if ((dst_pte == src_pte) || !huge_pte_none(dst_entry))
+		if ((dst_pte == src_pte) || !huge_pte_none(dst_entry)) {
+			i_mmap_unlock_read(mapping);
 			continue;
+		}
 
 		dst_ptl = huge_pte_lock(h, dst, dst_pte);
 		src_ptl = huge_pte_lockptr(h, src, src_pte);
@@ -3321,6 +3336,8 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 		}
 		spin_unlock(src_ptl);
 		spin_unlock(dst_ptl);
+
+		i_mmap_unlock_read(mapping);
 	}
 
 	if (cow)
@@ -3772,14 +3789,18 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 			};
 
 			/*
-			 * hugetlb_fault_mutex must be dropped before
-			 * handling userfault.  Reacquire after handling
-			 * fault to make calling code simpler.
+			 * hugetlb_fault_mutex and i_mmap_rwsem must be
+			 * dropped before handling userfault.  Reacquire
+			 * after handling fault to make calling code simpler.
 			 */
 			hash = hugetlb_fault_mutex_hash(h, mm, vma, mapping,
 							idx, haddr);
 			mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+			i_mmap_unlock_read(mapping);
+
 			ret = handle_userfault(&vmf, VM_UFFD_MISSING);
+
+			i_mmap_lock_read(mapping);
 			mutex_lock(&hugetlb_fault_mutex_table[hash]);
 			goto out;
 		}
@@ -3927,6 +3948,11 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 
 	ptep = huge_pte_offset(mm, haddr, huge_page_size(h));
 	if (ptep) {
+		/*
+		 * Since we hold no locks, ptep could be stale.  That is
+		 * OK as we are only making decisions based on content and
+		 * not actually modifying content here.
+		 */
 		entry = huge_ptep_get(ptep);
 		if (unlikely(is_hugetlb_entry_migration(entry))) {
 			migration_entry_wait_huge(vma, mm, ptep);
@@ -3934,20 +3960,31 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		} else if (unlikely(is_hugetlb_entry_hwpoisoned(entry)))
 			return VM_FAULT_HWPOISON_LARGE |
 				VM_FAULT_SET_HINDEX(hstate_index(h));
-	} else {
-		ptep = huge_pte_alloc(mm, haddr, huge_page_size(h));
-		if (!ptep)
-			return VM_FAULT_OOM;
 	}
 
+	/*
+	 * Acquire i_mmap_rwsem before calling huge_pte_alloc and hold
+	 * until finished with ptep.  This prevents huge_pmd_unshare from
+	 * being called elsewhere and making the ptep no longer valid.
+	 *
+	 * ptep could have already be assigned via huge_pte_offset.  That
+	 * is OK, as huge_pte_alloc will return the same value unless
+	 * something changed.
+	 */
 	mapping = vma->vm_file->f_mapping;
-	idx = vma_hugecache_offset(h, vma, haddr);
+	i_mmap_lock_read(mapping);
+	ptep = huge_pte_alloc(mm, haddr, huge_page_size(h));
+	if (!ptep) {
+		i_mmap_unlock_read(mapping);
+		return VM_FAULT_OOM;
+	}
 
 	/*
 	 * Serialize hugepage allocation and instantiation, so that we don't
 	 * get spurious allocation failures if two CPUs race to instantiate
 	 * the same page in the page cache.
 	 */
+	idx = vma_hugecache_offset(h, vma, haddr);
 	hash = hugetlb_fault_mutex_hash(h, mm, vma, mapping, idx, haddr);
 	mutex_lock(&hugetlb_fault_mutex_table[hash]);
 
@@ -4035,6 +4072,7 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	}
 out_mutex:
 	mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+	i_mmap_unlock_read(mapping);
 	/*
 	 * Generally it's safe to hold refcount during waiting page lock. But
 	 * here we just wait to defer the next page fault to avoid busy loop and
@@ -4639,10 +4677,12 @@ void adjust_range_if_pmd_sharing_possible(struct vm_area_struct *vma,
  * Search for a shareable pmd page for hugetlb. In any case calls pmd_alloc()
  * and returns the corresponding pte. While this is not necessary for the
  * !shared pmd case because we can allocate the pmd later as well, it makes the
- * code much cleaner. pmd allocation is essential for the shared case because
- * pud has to be populated inside the same i_mmap_rwsem section - otherwise
- * racing tasks could either miss the sharing (see huge_pte_offset) or select a
- * bad pmd for sharing.
+ * code much cleaner.
+ *
+ * This routine must be called with i_mmap_rwsem held in at least read mode.
+ * For hugetlbfs, this prevents removal of any page table entries associated
+ * with the address space.  This is important as we are setting up sharing
+ * based on existing page table entries (mappings).
  */
 pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
 {
@@ -4659,7 +4699,6 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
 	if (!vma_shareable(vma, addr))
 		return (pte_t *)pmd_alloc(mm, pud, addr);
 
-	i_mmap_lock_write(mapping);
 	vma_interval_tree_foreach(svma, &mapping->i_mmap, idx, idx) {
 		if (svma == vma)
 			continue;
@@ -4689,7 +4728,6 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
 	spin_unlock(ptl);
 out:
 	pte = (pte_t *)pmd_alloc(mm, pud, addr);
-	i_mmap_unlock_write(mapping);
 	return pte;
 }
 
@@ -4700,7 +4738,7 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
  * indicated by page_count > 1, unmap is achieved by clearing pud and
  * decrementing the ref count. If count == 1, the pte page is not shared.
  *
- * called with page table lock held.
+ * Called with page table lock held and i_mmap_rwsem held in write mode.
  *
  * returns: 1 successfully unmapped a shared pte page
  *	    0 the underlying pte page is not shared, or it is the last user
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 0cd3de3550f0..b992d1295578 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1028,7 +1028,19 @@ static bool hwpoison_user_mappings(struct page *p, unsigned long pfn,
 	if (kill)
 		collect_procs(hpage, &tokill, flags & MF_ACTION_REQUIRED);
 
-	unmap_success = try_to_unmap(hpage, ttu);
+	if (!PageHuge(hpage)) {
+		unmap_success = try_to_unmap(hpage, ttu);
+	} else {
+		/*
+		 * For hugetlb pages, try_to_unmap could potentially call
+		 * huge_pmd_unshare.  Because of this, take semaphore in
+		 * write mode here and set TTU_RMAP_LOCKED to indicate we
+		 * have taken the lock at this higer level.
+		 */
+		i_mmap_lock_write(mapping);
+		unmap_success = try_to_unmap(hpage, ttu|TTU_RMAP_LOCKED);
+		i_mmap_unlock_write(mapping);
+	}
 	if (!unmap_success)
 		pr_err("Memory failure: %#lx: failed to unmap page (mapcount=%d)\n",
 		       pfn, page_mapcount(hpage));
diff --git a/mm/migrate.c b/mm/migrate.c
index 84381b55b2bd..725edaef238a 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1307,8 +1307,19 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
 		goto put_anon;
 
 	if (page_mapped(hpage)) {
+		struct address_space *mapping = page_mapping(hpage);
+
+		/*
+		 * try_to_unmap could potentially call huge_pmd_unshare.
+		 * Because of this, take semaphore in write mode here and
+		 * set TTU_RMAP_LOCKED to let lower levels know we have
+		 * taken the lock.
+		 */
+		i_mmap_lock_write(mapping);
 		try_to_unmap(hpage,
-			TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);
+			TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS|
+			TTU_RMAP_LOCKED);
+		i_mmap_unlock_write(mapping);
 		page_was_mapped = 1;
 	}
 
diff --git a/mm/rmap.c b/mm/rmap.c
index 85b7f9423352..322e656d0225 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1374,6 +1374,9 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 		/*
 		 * If sharing is possible, start and end will be adjusted
 		 * accordingly.
+		 *
+		 * If called for a huge page, caller must hold i_mmap_rwsem
+		 * in write mode as it is possible to call huge_pmd_unshare.
 		 */
 		adjust_range_if_pmd_sharing_possible(vma, &start, &end);
 	}
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 458acda96f20..48368589f519 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -267,10 +267,14 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
 		VM_BUG_ON(dst_addr & ~huge_page_mask(h));
 
 		/*
-		 * Serialize via hugetlb_fault_mutex
+		 * Serialize via i_mmap_rwsem and hugetlb_fault_mutex.
+		 * i_mmap_rwsem ensures the dst_pte remains valid even
+		 * in the case of shared pmds.  fault mutex prevents
+		 * races with other faulting threads.
 		 */
-		idx = linear_page_index(dst_vma, dst_addr);
 		mapping = dst_vma->vm_file->f_mapping;
+		i_mmap_lock_read(mapping);
+		idx = linear_page_index(dst_vma, dst_addr);
 		hash = hugetlb_fault_mutex_hash(h, dst_mm, dst_vma, mapping,
 								idx, dst_addr);
 		mutex_lock(&hugetlb_fault_mutex_table[hash]);
@@ -279,6 +283,7 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
 		dst_pte = huge_pte_alloc(dst_mm, dst_addr, huge_page_size(h));
 		if (!dst_pte) {
 			mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+			i_mmap_unlock_read(mapping);
 			goto out_unlock;
 		}
 
@@ -286,6 +291,7 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
 		dst_pteval = huge_ptep_get(dst_pte);
 		if (!huge_pte_none(dst_pteval)) {
 			mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+			i_mmap_unlock_read(mapping);
 			goto out_unlock;
 		}
 
@@ -293,6 +299,7 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
 						dst_addr, src_addr, &page);
 
 		mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+		i_mmap_unlock_read(mapping);
 		vm_alloc_shared = vm_shared;
 
 		cond_resched();
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v2 2/2] hugetlbfs: Use i_mmap_rwsem to fix page fault/truncate race
  2018-12-18 22:35 [PATCH v2 0/2] hugetlbfs: use i_mmap_rwsem for better synchronization Mike Kravetz
  2018-12-18 22:35 ` [PATCH v2 1/2] hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization Mike Kravetz
@ 2018-12-18 22:35 ` Mike Kravetz
  2018-12-21 10:28   ` Kirill A. Shutemov
  2018-12-20 21:06 ` [PATCH v2 0/2] hugetlbfs: use i_mmap_rwsem for better synchronization Andrew Morton
  2 siblings, 1 reply; 11+ messages in thread
From: Mike Kravetz @ 2018-12-18 22:35 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Michal Hocko, Hugh Dickins, Naoya Horiguchi, Aneesh Kumar K . V,
	Andrea Arcangeli, Kirill A . Shutemov, Davidlohr Bueso,
	Prakash Sangappa, Andrew Morton, Mike Kravetz, stable

hugetlbfs page faults can race with truncate and hole punch operations.
Current code in the page fault path attempts to handle this by 'backing
out' operations if we encounter the race.  One obvious omission in the
current code is removing a page newly added to the page cache.  This is
pretty straight forward to address, but there is a more subtle and
difficult issue of backing out hugetlb reservations.  To handle this
correctly, the 'reservation state' before page allocation needs to be
noted so that it can be properly backed out.  There are four distinct
possibilities for reservation state: shared/reserved, shared/no-resv,
private/reserved and private/no-resv.  Backing out a reservation may
require memory allocation which could fail so that needs to be taken
into account as well.

Instead of writing the required complicated code for this rare
occurrence, just eliminate the race.  i_mmap_rwsem is now held in read
mode for the duration of page fault processing.  Hold i_mmap_rwsem
longer in truncation and hold punch code to cover the call to
remove_inode_hugepages.

With this modification, code in remove_inode_hugepages checking for
races becomes 'dead' as it can not longer happen.  Remove the dead code
and expand comments to explain reasoning.  Similarly, checks for races
with truncation in the page fault path can be simplified and removed.

Cc: <stable@vger.kernel.org>
Fixes: ebed4bfc8da8 ("hugetlb: fix absurd HugePages_Rsvd")
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 fs/hugetlbfs/inode.c | 50 +++++++++++++++-----------------------------
 mm/hugetlb.c         | 21 +++++++++----------
 2 files changed, 27 insertions(+), 44 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 32920a10100e..a9c00c6ef80d 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -383,17 +383,16 @@ hugetlb_vmdelete_list(struct rb_root_cached *root, pgoff_t start, pgoff_t end)
  * truncation is indicated by end of range being LLONG_MAX
  *	In this case, we first scan the range and release found pages.
  *	After releasing pages, hugetlb_unreserve_pages cleans up region/reserv
- *	maps and global counts.  Page faults can not race with truncation
- *	in this routine.  hugetlb_no_page() prevents page faults in the
- *	truncated range.  It checks i_size before allocation, and again after
- *	with the page table lock for the page held.  The same lock must be
- *	acquired to unmap a page.
+ *	maps and global counts.
  * hole punch is indicated if end is not LLONG_MAX
  *	In the hole punch case we scan the range and release found pages.
  *	Only when releasing a page is the associated region/reserv map
  *	deleted.  The region/reserv map for ranges without associated
- *	pages are not modified.  Page faults can race with hole punch.
- *	This is indicated if we find a mapped page.
+ *	pages are not modified.
+ *
+ * Callers of this routine must hold the i_mmap_rwsem in write mode to prevent
+ * races with page faults.
+ *
  * Note: If the passed end of range value is beyond the end of file, but
  * not LLONG_MAX this routine still performs a hole punch operation.
  */
@@ -423,32 +422,14 @@ static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
 
 		for (i = 0; i < pagevec_count(&pvec); ++i) {
 			struct page *page = pvec.pages[i];
-			u32 hash;
 
 			index = page->index;
-			hash = hugetlb_fault_mutex_hash(h, current->mm,
-							&pseudo_vma,
-							mapping, index, 0);
-			mutex_lock(&hugetlb_fault_mutex_table[hash]);
-
 			/*
-			 * If page is mapped, it was faulted in after being
-			 * unmapped in caller.  Unmap (again) now after taking
-			 * the fault mutex.  The mutex will prevent faults
-			 * until we finish removing the page.
-			 *
-			 * This race can only happen in the hole punch case.
-			 * Getting here in a truncate operation is a bug.
+			 * A mapped page is impossible as callers should unmap
+			 * all references before calling.  And, i_mmap_rwsem
+			 * prevents the creation of additional mappings.
 			 */
-			if (unlikely(page_mapped(page))) {
-				BUG_ON(truncate_op);
-
-				i_mmap_lock_write(mapping);
-				hugetlb_vmdelete_list(&mapping->i_mmap,
-					index * pages_per_huge_page(h),
-					(index + 1) * pages_per_huge_page(h));
-				i_mmap_unlock_write(mapping);
-			}
+			VM_BUG_ON(page_mapped(page));
 
 			lock_page(page);
 			/*
@@ -470,7 +451,6 @@ static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
 			}
 
 			unlock_page(page);
-			mutex_unlock(&hugetlb_fault_mutex_table[hash]);
 		}
 		huge_pagevec_release(&pvec);
 		cond_resched();
@@ -505,8 +485,8 @@ static int hugetlb_vmtruncate(struct inode *inode, loff_t offset)
 	i_mmap_lock_write(mapping);
 	if (!RB_EMPTY_ROOT(&mapping->i_mmap.rb_root))
 		hugetlb_vmdelete_list(&mapping->i_mmap, pgoff, 0);
-	i_mmap_unlock_write(mapping);
 	remove_inode_hugepages(inode, offset, LLONG_MAX);
+	i_mmap_unlock_write(mapping);
 	return 0;
 }
 
@@ -540,8 +520,8 @@ static long hugetlbfs_punch_hole(struct inode *inode, loff_t offset, loff_t len)
 			hugetlb_vmdelete_list(&mapping->i_mmap,
 						hole_start >> PAGE_SHIFT,
 						hole_end  >> PAGE_SHIFT);
-		i_mmap_unlock_write(mapping);
 		remove_inode_hugepages(inode, hole_start, hole_end);
+		i_mmap_unlock_write(mapping);
 		inode_unlock(inode);
 	}
 
@@ -624,7 +604,11 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
 		/* addr is the offset within the file (zero based) */
 		addr = index * hpage_size;
 
-		/* mutex taken here, fault path and hole punch */
+		/*
+		 * fault mutex taken here, protects against fault path
+		 * and hole punch.  inode_lock previously taken protects
+		 * against truncation.
+		 */
 		hash = hugetlb_fault_mutex_hash(h, mm, &pseudo_vma, mapping,
 						index, addr);
 		mutex_lock(&hugetlb_fault_mutex_table[hash]);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index ab4c77b8c72c..25a0cd2f8b39 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3760,16 +3760,16 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 	}
 
 	/*
-	 * Use page lock to guard against racing truncation
-	 * before we get page_table_lock.
+	 * We can not race with truncation due to holding i_mmap_rwsem.
+	 * Check once here for faults beyond end of file.
 	 */
+	size = i_size_read(mapping->host) >> huge_page_shift(h);
+	if (idx >= size)
+		goto out;
+
 retry:
 	page = find_lock_page(mapping, idx);
 	if (!page) {
-		size = i_size_read(mapping->host) >> huge_page_shift(h);
-		if (idx >= size)
-			goto out;
-
 		/*
 		 * Check for page in userfault range
 		 */
@@ -3859,9 +3859,6 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 	}
 
 	ptl = huge_pte_lock(h, mm, ptep);
-	size = i_size_read(mapping->host) >> huge_page_shift(h);
-	if (idx >= size)
-		goto backout;
 
 	ret = 0;
 	if (!huge_pte_none(huge_ptep_get(ptep)))
@@ -3964,8 +3961,10 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 
 	/*
 	 * Acquire i_mmap_rwsem before calling huge_pte_alloc and hold
-	 * until finished with ptep.  This prevents huge_pmd_unshare from
-	 * being called elsewhere and making the ptep no longer valid.
+	 * until finished with ptep.  This serves two purposes:
+	 * 1) It prevents huge_pmd_unshare from being called elsewhere
+	 *    and making the ptep no longer valid.
+	 * 2) It synchronizes us with file truncation.
 	 *
 	 * ptep could have already be assigned via huge_pte_offset.  That
 	 * is OK, as huge_pte_alloc will return the same value unless
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 0/2] hugetlbfs: use i_mmap_rwsem for better synchronization
  2018-12-18 22:35 [PATCH v2 0/2] hugetlbfs: use i_mmap_rwsem for better synchronization Mike Kravetz
  2018-12-18 22:35 ` [PATCH v2 1/2] hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization Mike Kravetz
  2018-12-18 22:35 ` [PATCH v2 2/2] hugetlbfs: Use i_mmap_rwsem to fix page fault/truncate race Mike Kravetz
@ 2018-12-20 21:06 ` Andrew Morton
  2 siblings, 0 replies; 11+ messages in thread
From: Andrew Morton @ 2018-12-20 21:06 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-mm, linux-kernel, Michal Hocko, Hugh Dickins,
	Naoya Horiguchi, Aneesh Kumar K . V, Andrea Arcangeli,
	Kirill A . Shutemov, Davidlohr Bueso, Prakash Sangappa

On Tue, 18 Dec 2018 14:35:55 -0800 Mike Kravetz <mike.kravetz@oracle.com> wrote:

> There are two primary issues addressed here:
> 1) For shared pmds, huge PTE pointers returned by huge_pte_alloc can become
>    invalid via a call to huge_pmd_unshare by another thread.
> 2) hugetlbfs page faults can race with truncation causing invalid global
>    reserve counts and state.
> Both issues are addressed by expanding the use of i_mmap_rwsem.
> 
> These issues have existed for a long time.  They can be recreated with a
> test program that causes page fault/truncation races.  For simple mappings,
> this results in a negative HugePages_Rsvd count.  If racing with mappings
> that contain shared pmds, we can hit "BUG at fs/hugetlbfs/inode.c:444!" or
> Oops! as the result of an invalid memory reference.

Still no reviewers or ackers :(

I'll queue these for 4.21-rc1.  The Fixes: commits are over a decade
old so I assume things aren't super-urgent and the cc:stable will do
its work.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 1/2] hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization
  2018-12-18 22:35 ` [PATCH v2 1/2] hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization Mike Kravetz
@ 2018-12-21 10:05   ` Kirill A. Shutemov
  2018-12-21 18:20     ` Mike Kravetz
  0 siblings, 1 reply; 11+ messages in thread
From: Kirill A. Shutemov @ 2018-12-21 10:05 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-mm, linux-kernel, Michal Hocko, Hugh Dickins,
	Naoya Horiguchi, Aneesh Kumar K . V, Andrea Arcangeli,
	Kirill A . Shutemov, Davidlohr Bueso, Prakash Sangappa,
	Andrew Morton, stable

On Tue, Dec 18, 2018 at 02:35:56PM -0800, Mike Kravetz wrote:
> While looking at BUGs associated with invalid huge page map counts,
> it was discovered and observed that a huge pte pointer could become
> 'invalid' and point to another task's page table.  Consider the
> following:
> 
> A task takes a page fault on a shared hugetlbfs file and calls
> huge_pte_alloc to get a ptep.  Suppose the returned ptep points to a
> shared pmd.
> 
> Now, another task truncates the hugetlbfs file.  As part of truncation,
> it unmaps everyone who has the file mapped.  If the range being
> truncated is covered by a shared pmd, huge_pmd_unshare will be called.
> For all but the last user of the shared pmd, huge_pmd_unshare will
> clear the pud pointing to the pmd.  If the task in the middle of the
> page fault is not the last user, the ptep returned by huge_pte_alloc
> now points to another task's page table or worse.  This leads to bad
> things such as incorrect page map/reference counts or invalid memory
> references.
> 
> To fix, expand the use of i_mmap_rwsem as follows:
> - i_mmap_rwsem is held in read mode whenever huge_pmd_share is called.
>   huge_pmd_share is only called via huge_pte_alloc, so callers of
>   huge_pte_alloc take i_mmap_rwsem before calling.  In addition, callers
>   of huge_pte_alloc continue to hold the semaphore until finished with
>   the ptep.
> - i_mmap_rwsem is held in write mode whenever huge_pmd_unshare is called.
> 
> Cc: <stable@vger.kernel.org>
> Fixes: 39dde65c9940 ("shared page table for hugetlb page")
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>

Other the few questions below. The patch looks reasonable to me.

> @@ -3252,11 +3253,23 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
>  
>  	for (addr = vma->vm_start; addr < vma->vm_end; addr += sz) {
>  		spinlock_t *src_ptl, *dst_ptl;
> +
>  		src_pte = huge_pte_offset(src, addr, sz);
>  		if (!src_pte)
>  			continue;
> +
> +		/*
> +		 * i_mmap_rwsem must be held to call huge_pte_alloc.
> +		 * Continue to hold until finished  with dst_pte, otherwise
> +		 * it could go away if part of a shared pmd.
> +		 *
> +		 * Technically, i_mmap_rwsem is only needed in the non-cow
> +		 * case as cow mappings are not shared.
> +		 */
> +		i_mmap_lock_read(mapping);

Any reason you do lock/unlock on each iteration rather than around whole
loop?

>  		dst_pte = huge_pte_alloc(dst, addr, sz);
>  		if (!dst_pte) {
> +			i_mmap_unlock_read(mapping);
>  			ret = -ENOMEM;
>  			break;
>  		}

...

> @@ -3772,14 +3789,18 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
>  			};
>  
>  			/*
> -			 * hugetlb_fault_mutex must be dropped before
> -			 * handling userfault.  Reacquire after handling
> -			 * fault to make calling code simpler.
> +			 * hugetlb_fault_mutex and i_mmap_rwsem must be
> +			 * dropped before handling userfault.  Reacquire
> +			 * after handling fault to make calling code simpler.
>  			 */
>  			hash = hugetlb_fault_mutex_hash(h, mm, vma, mapping,
>  							idx, haddr);
>  			mutex_unlock(&hugetlb_fault_mutex_table[hash]);
> +			i_mmap_unlock_read(mapping);
> +

Do we have order of hugetlb_fault_mutex vs. i_mmap_lock documented?
I *looks* correct to me, but it's better to write it down somewhere.
Mayby add to the header of mm/rmap.c?

>  			ret = handle_userfault(&vmf, VM_UFFD_MISSING);
> +
> +			i_mmap_lock_read(mapping);
>  			mutex_lock(&hugetlb_fault_mutex_table[hash]);
>  			goto out;
>  		}

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 2/2] hugetlbfs: Use i_mmap_rwsem to fix page fault/truncate race
  2018-12-18 22:35 ` [PATCH v2 2/2] hugetlbfs: Use i_mmap_rwsem to fix page fault/truncate race Mike Kravetz
@ 2018-12-21 10:28   ` Kirill A. Shutemov
  2018-12-21 18:28     ` Mike Kravetz
  0 siblings, 1 reply; 11+ messages in thread
From: Kirill A. Shutemov @ 2018-12-21 10:28 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-mm, linux-kernel, Michal Hocko, Hugh Dickins,
	Naoya Horiguchi, Aneesh Kumar K . V, Andrea Arcangeli,
	Kirill A . Shutemov, Davidlohr Bueso, Prakash Sangappa,
	Andrew Morton, stable

On Tue, Dec 18, 2018 at 02:35:57PM -0800, Mike Kravetz wrote:
> Instead of writing the required complicated code for this rare
> occurrence, just eliminate the race.  i_mmap_rwsem is now held in read
> mode for the duration of page fault processing.  Hold i_mmap_rwsem
> longer in truncation and hold punch code to cover the call to
> remove_inode_hugepages.

One of remove_inode_hugepages() callers is noticeably missing --
hugetlbfs_evict_inode(). Why?

It at least deserves a comment on why the lock rule doesn't apply to it.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 1/2] hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization
  2018-12-21 10:05   ` Kirill A. Shutemov
@ 2018-12-21 18:20     ` Mike Kravetz
  0 siblings, 0 replies; 11+ messages in thread
From: Mike Kravetz @ 2018-12-21 18:20 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: linux-mm, linux-kernel, Michal Hocko, Hugh Dickins,
	Naoya Horiguchi, Aneesh Kumar K . V, Andrea Arcangeli,
	Kirill A . Shutemov, Davidlohr Bueso, Prakash Sangappa,
	Andrew Morton, stable

On 12/21/18 2:05 AM, Kirill A. Shutemov wrote:
> On Tue, Dec 18, 2018 at 02:35:56PM -0800, Mike Kravetz wrote:
>> While looking at BUGs associated with invalid huge page map counts,
>> it was discovered and observed that a huge pte pointer could become
>> 'invalid' and point to another task's page table.  Consider the
>> following:
>>
>> A task takes a page fault on a shared hugetlbfs file and calls
>> huge_pte_alloc to get a ptep.  Suppose the returned ptep points to a
>> shared pmd.
>>
>> Now, another task truncates the hugetlbfs file.  As part of truncation,
>> it unmaps everyone who has the file mapped.  If the range being
>> truncated is covered by a shared pmd, huge_pmd_unshare will be called.
>> For all but the last user of the shared pmd, huge_pmd_unshare will
>> clear the pud pointing to the pmd.  If the task in the middle of the
>> page fault is not the last user, the ptep returned by huge_pte_alloc
>> now points to another task's page table or worse.  This leads to bad
>> things such as incorrect page map/reference counts or invalid memory
>> references.
>>
>> To fix, expand the use of i_mmap_rwsem as follows:
>> - i_mmap_rwsem is held in read mode whenever huge_pmd_share is called.
>>   huge_pmd_share is only called via huge_pte_alloc, so callers of
>>   huge_pte_alloc take i_mmap_rwsem before calling.  In addition, callers
>>   of huge_pte_alloc continue to hold the semaphore until finished with
>>   the ptep.
>> - i_mmap_rwsem is held in write mode whenever huge_pmd_unshare is called.
>>
>> Cc: <stable@vger.kernel.org>
>> Fixes: 39dde65c9940 ("shared page table for hugetlb page")
>> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> 
> Other the few questions below. The patch looks reasonable to me.

Thanks for taking a look.

> 
>> @@ -3252,11 +3253,23 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
>>  
>>  	for (addr = vma->vm_start; addr < vma->vm_end; addr += sz) {
>>  		spinlock_t *src_ptl, *dst_ptl;
>> +
>>  		src_pte = huge_pte_offset(src, addr, sz);
>>  		if (!src_pte)
>>  			continue;
>> +
>> +		/*
>> +		 * i_mmap_rwsem must be held to call huge_pte_alloc.
>> +		 * Continue to hold until finished  with dst_pte, otherwise
>> +		 * it could go away if part of a shared pmd.
>> +		 *
>> +		 * Technically, i_mmap_rwsem is only needed in the non-cow
>> +		 * case as cow mappings are not shared.
>> +		 */
>> +		i_mmap_lock_read(mapping);
> 
> Any reason you do lock/unlock on each iteration rather than around whole
> loop?

I am simply mirroring the page table locking.  This is not necessary.
The page table lock can change while processing the range, but the
i_mmap_rwsem can not.  Therefore, we can hold around the whole loop.

I will modify, test and put out an updated patch later today.

>>  		dst_pte = huge_pte_alloc(dst, addr, sz);
>>  		if (!dst_pte) {
>> +			i_mmap_unlock_read(mapping);
>>  			ret = -ENOMEM;
>>  			break;
>>  		}
> 
> ...
> 
>> @@ -3772,14 +3789,18 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
>>  			};
>>  
>>  			/*
>> -			 * hugetlb_fault_mutex must be dropped before
>> -			 * handling userfault.  Reacquire after handling
>> -			 * fault to make calling code simpler.
>> +			 * hugetlb_fault_mutex and i_mmap_rwsem must be
>> +			 * dropped before handling userfault.  Reacquire
>> +			 * after handling fault to make calling code simpler.
>>  			 */
>>  			hash = hugetlb_fault_mutex_hash(h, mm, vma, mapping,
>>  							idx, haddr);
>>  			mutex_unlock(&hugetlb_fault_mutex_table[hash]);
>> +			i_mmap_unlock_read(mapping);
>> +
> 
> Do we have order of hugetlb_fault_mutex vs. i_mmap_lock documented?
> I *looks* correct to me, but it's better to write it down somewhere.
> Mayby add to the header of mm/rmap.c?

No it is not documented.  I don't think there is much/any documentation
for hugetlb_fault_mutex at all.  I will add it to the lock documentation
in mm/rmap.c as you suggest.

-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 2/2] hugetlbfs: Use i_mmap_rwsem to fix page fault/truncate race
  2018-12-21 10:28   ` Kirill A. Shutemov
@ 2018-12-21 18:28     ` Mike Kravetz
  2018-12-21 20:21       ` Kirill A. Shutemov
  0 siblings, 1 reply; 11+ messages in thread
From: Mike Kravetz @ 2018-12-21 18:28 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: linux-mm, linux-kernel, Michal Hocko, Hugh Dickins,
	Naoya Horiguchi, Aneesh Kumar K . V, Andrea Arcangeli,
	Kirill A . Shutemov, Davidlohr Bueso, Prakash Sangappa,
	Andrew Morton, stable

On 12/21/18 2:28 AM, Kirill A. Shutemov wrote:
> On Tue, Dec 18, 2018 at 02:35:57PM -0800, Mike Kravetz wrote:
>> Instead of writing the required complicated code for this rare
>> occurrence, just eliminate the race.  i_mmap_rwsem is now held in read
>> mode for the duration of page fault processing.  Hold i_mmap_rwsem
>> longer in truncation and hold punch code to cover the call to
>> remove_inode_hugepages.
> 
> One of remove_inode_hugepages() callers is noticeably missing --
> hugetlbfs_evict_inode(). Why?
> 
> It at least deserves a comment on why the lock rule doesn't apply to it.

In the case of hugetlbfs_evict_inode, the vfs layer guarantees there are
no more users of the inode/file.  Therefore, it is safe to call without
holding the mutex.  But, I did add this comment to remove_inode_hugepages.

* Callers of this routine must hold the i_mmap_rwsem in write mode to prevent
* races with page faults.

So, I violated the rule that I documented.  Thanks for catching.

I will update the comments to note this excpetion to the rule.  Another
option is to simply take the semaphore and still note why it is technically
not needed.  Since there are no users there will be no contention of the
semaphore and the overhead should be negligible.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 2/2] hugetlbfs: Use i_mmap_rwsem to fix page fault/truncate race
  2018-12-21 18:28     ` Mike Kravetz
@ 2018-12-21 20:21       ` Kirill A. Shutemov
  2018-12-21 22:17         ` Mike Kravetz
  0 siblings, 1 reply; 11+ messages in thread
From: Kirill A. Shutemov @ 2018-12-21 20:21 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-mm, linux-kernel, Michal Hocko, Hugh Dickins,
	Naoya Horiguchi, Aneesh Kumar K . V, Andrea Arcangeli,
	Kirill A . Shutemov, Davidlohr Bueso, Prakash Sangappa,
	Andrew Morton, stable

On Fri, Dec 21, 2018 at 10:28:25AM -0800, Mike Kravetz wrote:
> On 12/21/18 2:28 AM, Kirill A. Shutemov wrote:
> > On Tue, Dec 18, 2018 at 02:35:57PM -0800, Mike Kravetz wrote:
> >> Instead of writing the required complicated code for this rare
> >> occurrence, just eliminate the race.  i_mmap_rwsem is now held in read
> >> mode for the duration of page fault processing.  Hold i_mmap_rwsem
> >> longer in truncation and hold punch code to cover the call to
> >> remove_inode_hugepages.
> > 
> > One of remove_inode_hugepages() callers is noticeably missing --
> > hugetlbfs_evict_inode(). Why?
> > 
> > It at least deserves a comment on why the lock rule doesn't apply to it.
> 
> In the case of hugetlbfs_evict_inode, the vfs layer guarantees there are
> no more users of the inode/file.

I'm not convinced that it is true. See documentation for ->evict_inode()
in Documentation/filesystems/porting:

	Caller does *not* evict the pagecache or inode-associated
	metadata buffers; the method has to use truncate_inode_pages_final() to get rid
	of those.

Is hugetlbfs special here?

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 2/2] hugetlbfs: Use i_mmap_rwsem to fix page fault/truncate race
  2018-12-21 20:21       ` Kirill A. Shutemov
@ 2018-12-21 22:17         ` Mike Kravetz
  2018-12-22 22:14           ` Kirill A. Shutemov
  0 siblings, 1 reply; 11+ messages in thread
From: Mike Kravetz @ 2018-12-21 22:17 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: linux-mm, linux-kernel, Michal Hocko, Hugh Dickins,
	Naoya Horiguchi, Aneesh Kumar K . V, Andrea Arcangeli,
	Kirill A . Shutemov, Davidlohr Bueso, Prakash Sangappa,
	Andrew Morton, stable

On 12/21/18 12:21 PM, Kirill A. Shutemov wrote:
> On Fri, Dec 21, 2018 at 10:28:25AM -0800, Mike Kravetz wrote:
>> On 12/21/18 2:28 AM, Kirill A. Shutemov wrote:
>>> On Tue, Dec 18, 2018 at 02:35:57PM -0800, Mike Kravetz wrote:
>>>> Instead of writing the required complicated code for this rare
>>>> occurrence, just eliminate the race.  i_mmap_rwsem is now held in read
>>>> mode for the duration of page fault processing.  Hold i_mmap_rwsem
>>>> longer in truncation and hold punch code to cover the call to
>>>> remove_inode_hugepages.
>>>
>>> One of remove_inode_hugepages() callers is noticeably missing --
>>> hugetlbfs_evict_inode(). Why?
>>>
>>> It at least deserves a comment on why the lock rule doesn't apply to it.
>>
>> In the case of hugetlbfs_evict_inode, the vfs layer guarantees there are
>> no more users of the inode/file.
> 
> I'm not convinced that it is true. See documentation for ->evict_inode()
> in Documentation/filesystems/porting:
> 
> 	Caller does *not* evict the pagecache or inode-associated
> 	metadata buffers; the method has to use truncate_inode_pages_final() to get rid
> 	of those.
> 

We may be talking about different things.

When I say there are no more users, I am talking about users via user space.
We get to the hugetlbfs evict inode code via iput->iput_final->evict.  In
this path the count on the inode is zero, and is marked (I_FREEING) so that
nobody will start using it.  As a result, there can be no additional page
faults against the file.  This is what we are using i_mmap_rwsem to prevent.

The Documentation above says that the ->evict_inode() method must evict from
pagecache and get rid of metadatta buffers.  hugetlbfs_evict_inode does this
remove_inode_hugepages evicts pages from page cache (and frees them) as well
as cleaning up the hugetlbfs specific reserve map metadata.

Am I misunderstanding your question/concern?

I have decided to add the locking (although unnecessary) with something like
this in hugetlbfs_evict_inode.

	/*
	 * The vfs layer guarantees that there are no other users of this
	 * inode.  Therefore, it would be safe to call remove_inode_hugepages
	 * without holding i_mmap_rwsem.  We acquire and hold here to be
	 * consistent with other callers.  Since there will be no contention
	 * on the semaphore, overhead is negligible.
	 */
	i_mmap_lock_write(mapping);
	remove_inode_hugepages(inode, 0, LLONG_MAX);
	i_mmap_unlock_write(mapping);

-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 2/2] hugetlbfs: Use i_mmap_rwsem to fix page fault/truncate race
  2018-12-21 22:17         ` Mike Kravetz
@ 2018-12-22 22:14           ` Kirill A. Shutemov
  0 siblings, 0 replies; 11+ messages in thread
From: Kirill A. Shutemov @ 2018-12-22 22:14 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-mm, linux-kernel, Michal Hocko, Hugh Dickins,
	Naoya Horiguchi, Aneesh Kumar K . V, Andrea Arcangeli,
	Kirill A . Shutemov, Davidlohr Bueso, Prakash Sangappa,
	Andrew Morton, stable

On Fri, Dec 21, 2018 at 02:17:32PM -0800, Mike Kravetz wrote:
> Am I misunderstanding your question/concern?

No. Thanks for the clarification.

> 
> I have decided to add the locking (although unnecessary) with something like
> this in hugetlbfs_evict_inode.
> 
> 	/*
> 	 * The vfs layer guarantees that there are no other users of this
> 	 * inode.  Therefore, it would be safe to call remove_inode_hugepages
> 	 * without holding i_mmap_rwsem.  We acquire and hold here to be
> 	 * consistent with other callers.  Since there will be no contention
> 	 * on the semaphore, overhead is negligible.
> 	 */
> 	i_mmap_lock_write(mapping);
> 	remove_inode_hugepages(inode, 0, LLONG_MAX);
> 	i_mmap_unlock_write(mapping);

LGTM.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2018-12-22 22:14 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-12-18 22:35 [PATCH v2 0/2] hugetlbfs: use i_mmap_rwsem for better synchronization Mike Kravetz
2018-12-18 22:35 ` [PATCH v2 1/2] hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization Mike Kravetz
2018-12-21 10:05   ` Kirill A. Shutemov
2018-12-21 18:20     ` Mike Kravetz
2018-12-18 22:35 ` [PATCH v2 2/2] hugetlbfs: Use i_mmap_rwsem to fix page fault/truncate race Mike Kravetz
2018-12-21 10:28   ` Kirill A. Shutemov
2018-12-21 18:28     ` Mike Kravetz
2018-12-21 20:21       ` Kirill A. Shutemov
2018-12-21 22:17         ` Mike Kravetz
2018-12-22 22:14           ` Kirill A. Shutemov
2018-12-20 21:06 ` [PATCH v2 0/2] hugetlbfs: use i_mmap_rwsem for better synchronization Andrew Morton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).