linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Andrew Morton <akpm@linux-foundation.org>
To: aarcange@redhat.com, akpm@linux-foundation.org,
	aneesh.kumar@linux.vnet.ibm.com, dave@stgolabs.net,
	hughd@google.com, kirill.shutemov@linux.intel.com,
	linux-mm@kvack.org, mhocko@kernel.org, mike.kravetz@oracle.com,
	mm-commits@vger.kernel.org, n-horiguchi@ah.jp.nec.com,
	prakash.sangappa@oracle.com, torvalds@linux-foundation.org
Subject: [patch 139/155] hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization
Date: Wed, 01 Apr 2020 21:11:05 -0700	[thread overview]
Message-ID: <20200402041105.i6f4OY-6_%akpm@linux-foundation.org> (raw)
In-Reply-To: <20200401210155.09e3b9742e1c6e732f5a7250@linux-foundation.org>

From: Mike Kravetz <mike.kravetz@oracle.com>
Subject: hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization

Patch series "hugetlbfs: use i_mmap_rwsem for more synchronization", v2.

While discussing the issue with huge_pte_offset [1], I remembered that
there were more outstanding hugetlb races.  These issues are:

1) For shared pmds, huge PTE pointers returned by huge_pte_alloc can become
   invalid via a call to huge_pmd_unshare by another thread.
2) hugetlbfs page faults can race with truncation causing invalid global
   reserve counts and state.

A previous attempt was made to use i_mmap_rwsem in this manner as
described at [2].  However, those patches were reverted starting with [3]
due to locking issues.

To effectively use i_mmap_rwsem to address the above issues it needs to be
held (in read mode) during page fault processing.  However, during fault
processing we need to lock the page we will be adding.  Lock ordering
requires we take page lock before i_mmap_rwsem.  Waiting until after
taking the page lock is too late in the fault process for the
synchronization we want to do.

To address this lock ordering issue, the following patches change the lock
ordering for hugetlb pages.  This is not too invasive as hugetlbfs
processing is done separate from core mm in many places.  However, I don't
really like this idea.  Much ugliness is contained in the new routine
hugetlb_page_mapping_lock_write() of patch 1.

The only other way I can think of to address these issues is by catching
all the races.  After catching a race, cleanup, backout, retry ...  etc,
as needed.  This can get really ugly, especially for huge page
reservations.  At one time, I started writing some of the reservation
backout code for page faults and it got so ugly and complicated I went
down the path of adding synchronization to avoid the races.  Any other
suggestions would be welcome.

[1] https://lore.kernel.org/linux-mm/1582342427-230392-1-git-send-email-longpeng2@huawei.com/
[2] https://lore.kernel.org/linux-mm/20181222223013.22193-1-mike.kravetz@oracle.com/
[3] https://lore.kernel.org/linux-mm/20190103235452.29335-1-mike.kravetz@oracle.com
[4] https://lore.kernel.org/linux-mm/1584028670.7365.182.camel@lca.pw/
[5] https://lore.kernel.org/lkml/20200312183142.108df9ac@canb.auug.org.au/


This patch (of 2):

While looking at BUGs associated with invalid huge page map counts, it was
discovered and observed that a huge pte pointer could become 'invalid' and
point to another task's page table.  Consider the following:

A task takes a page fault on a shared hugetlbfs file and calls
huge_pte_alloc to get a ptep.  Suppose the returned ptep points to a
shared pmd.

Now, another task truncates the hugetlbfs file.  As part of truncation, it
unmaps everyone who has the file mapped.  If the range being truncated is
covered by a shared pmd, huge_pmd_unshare will be called.  For all but the
last user of the shared pmd, huge_pmd_unshare will clear the pud pointing
to the pmd.  If the task in the middle of the page fault is not the last
user, the ptep returned by huge_pte_alloc now points to another task's
page table or worse.  This leads to bad things such as incorrect page
map/reference counts or invalid memory references.

To fix, expand the use of i_mmap_rwsem as follows:
- i_mmap_rwsem is held in read mode whenever huge_pmd_share is called.
  huge_pmd_share is only called via huge_pte_alloc, so callers of
  huge_pte_alloc take i_mmap_rwsem before calling.  In addition, callers
  of huge_pte_alloc continue to hold the semaphore until finished with
  the ptep.
- i_mmap_rwsem is held in write mode whenever huge_pmd_unshare is called.

One problem with this scheme is that it requires taking i_mmap_rwsem
before taking the page lock during page faults.  This is not the order
specified in the rest of mm code.  Handling of hugetlbfs pages is mostly
isolated today.  Therefore, we use this alternative locking order for
PageHuge() pages.

         mapping->i_mmap_rwsem
           hugetlb_fault_mutex (hugetlbfs specific page fault mutex)
             page->flags PG_locked (lock_page)

To help with lock ordering issues, hugetlb_page_mapping_lock_write() is
introduced to write lock the i_mmap_rwsem associated with a page.

In most cases it is easy to get address_space via vma->vm_file->f_mapping.
However, in the case of migration or memory errors for anon pages we do
not have an associated vma.  A new routine _get_hugetlb_page_mapping()
will use anon_vma to get address_space in these cases.

Link: http://lkml.kernel.org/r/20200316205756.146666-2-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/hugetlbfs/inode.c    |    2 
 include/linux/fs.h      |    5 +
 include/linux/hugetlb.h |    8 +
 mm/hugetlb.c            |  156 +++++++++++++++++++++++++++++++++++---
 mm/memory-failure.c     |   29 ++++++-
 mm/migrate.c            |   25 +++++-
 mm/rmap.c               |   17 +++-
 mm/userfaultfd.c        |   11 ++
 8 files changed, 234 insertions(+), 19 deletions(-)

--- a/fs/hugetlbfs/inode.c~hugetlbfs-use-i_mmap_rwsem-for-more-pmd-sharing-synchronization
+++ a/fs/hugetlbfs/inode.c
@@ -450,7 +450,9 @@ static void remove_inode_hugepages(struc
 			if (unlikely(page_mapped(page))) {
 				BUG_ON(truncate_op);
 
+				mutex_unlock(&hugetlb_fault_mutex_table[hash]);
 				i_mmap_lock_write(mapping);
+				mutex_lock(&hugetlb_fault_mutex_table[hash]);
 				hugetlb_vmdelete_list(&mapping->i_mmap,
 					index * pages_per_huge_page(h),
 					(index + 1) * pages_per_huge_page(h));
--- a/include/linux/fs.h~hugetlbfs-use-i_mmap_rwsem-for-more-pmd-sharing-synchronization
+++ a/include/linux/fs.h
@@ -526,6 +526,11 @@ static inline void i_mmap_lock_write(str
 	down_write(&mapping->i_mmap_rwsem);
 }
 
+static inline int i_mmap_trylock_write(struct address_space *mapping)
+{
+	return down_write_trylock(&mapping->i_mmap_rwsem);
+}
+
 static inline void i_mmap_unlock_write(struct address_space *mapping)
 {
 	up_write(&mapping->i_mmap_rwsem);
--- a/include/linux/hugetlb.h~hugetlbfs-use-i_mmap_rwsem-for-more-pmd-sharing-synchronization
+++ a/include/linux/hugetlb.h
@@ -109,6 +109,8 @@ u32 hugetlb_fault_mutex_hash(struct addr
 
 pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud);
 
+struct address_space *hugetlb_page_mapping_lock_write(struct page *hpage);
+
 extern int sysctl_hugetlb_shm_group;
 extern struct list_head huge_boot_pages;
 
@@ -151,6 +153,12 @@ static inline unsigned long hugetlb_tota
 	return 0;
 }
 
+static inline struct address_space *hugetlb_page_mapping_lock_write(
+							struct page *hpage)
+{
+	return NULL;
+}
+
 static inline int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr,
 					pte_t *ptep)
 {
--- a/mm/hugetlb.c~hugetlbfs-use-i_mmap_rwsem-for-more-pmd-sharing-synchronization
+++ a/mm/hugetlb.c
@@ -1322,6 +1322,106 @@ int PageHeadHuge(struct page *page_head)
 	return get_compound_page_dtor(page_head) == free_huge_page;
 }
 
+/*
+ * Find address_space associated with hugetlbfs page.
+ * Upon entry page is locked and page 'was' mapped although mapped state
+ * could change.  If necessary, use anon_vma to find vma and associated
+ * address space.  The returned mapping may be stale, but it can not be
+ * invalid as page lock (which is held) is required to destroy mapping.
+ */
+static struct address_space *_get_hugetlb_page_mapping(struct page *hpage)
+{
+	struct anon_vma *anon_vma;
+	pgoff_t pgoff_start, pgoff_end;
+	struct anon_vma_chain *avc;
+	struct address_space *mapping = page_mapping(hpage);
+
+	/* Simple file based mapping */
+	if (mapping)
+		return mapping;
+
+	/*
+	 * Even anonymous hugetlbfs mappings are associated with an
+	 * underlying hugetlbfs file (see hugetlb_file_setup in mmap
+	 * code).  Find a vma associated with the anonymous vma, and
+	 * use the file pointer to get address_space.
+	 */
+	anon_vma = page_lock_anon_vma_read(hpage);
+	if (!anon_vma)
+		return mapping;  /* NULL */
+
+	/* Use first found vma */
+	pgoff_start = page_to_pgoff(hpage);
+	pgoff_end = pgoff_start + hpage_nr_pages(hpage) - 1;
+	anon_vma_interval_tree_foreach(avc, &anon_vma->rb_root,
+					pgoff_start, pgoff_end) {
+		struct vm_area_struct *vma = avc->vma;
+
+		mapping = vma->vm_file->f_mapping;
+		break;
+	}
+
+	anon_vma_unlock_read(anon_vma);
+	return mapping;
+}
+
+/*
+ * Find and lock address space (mapping) in write mode.
+ *
+ * Upon entry, the page is locked which allows us to find the mapping
+ * even in the case of an anon page.  However, locking order dictates
+ * the i_mmap_rwsem be acquired BEFORE the page lock.  This is hugetlbfs
+ * specific.  So, we first try to lock the sema while still holding the
+ * page lock.  If this works, great!  If not, then we need to drop the
+ * page lock and then acquire i_mmap_rwsem and reacquire page lock.  Of
+ * course, need to revalidate state along the way.
+ */
+struct address_space *hugetlb_page_mapping_lock_write(struct page *hpage)
+{
+	struct address_space *mapping, *mapping2;
+
+	mapping = _get_hugetlb_page_mapping(hpage);
+retry:
+	if (!mapping)
+		return mapping;
+
+	/*
+	 * If no contention, take lock and return
+	 */
+	if (i_mmap_trylock_write(mapping))
+		return mapping;
+
+	/*
+	 * Must drop page lock and wait on mapping sema.
+	 * Note:  Once page lock is dropped, mapping could become invalid.
+	 * As a hack, increase map count until we lock page again.
+	 */
+	atomic_inc(&hpage->_mapcount);
+	unlock_page(hpage);
+	i_mmap_lock_write(mapping);
+	lock_page(hpage);
+	atomic_add_negative(-1, &hpage->_mapcount);
+
+	/* verify page is still mapped */
+	if (!page_mapped(hpage)) {
+		i_mmap_unlock_write(mapping);
+		return NULL;
+	}
+
+	/*
+	 * Get address space again and verify it is the same one
+	 * we locked.  If not, drop lock and retry.
+	 */
+	mapping2 = _get_hugetlb_page_mapping(hpage);
+	if (mapping2 != mapping) {
+		i_mmap_unlock_write(mapping);
+		mapping = mapping2;
+		goto retry;
+	}
+
+	return mapping;
+}
+
 pgoff_t __basepage_index(struct page *page)
 {
 	struct page *page_head = compound_head(page);
@@ -3312,6 +3412,7 @@ int copy_hugetlb_page_range(struct mm_st
 	int cow;
 	struct hstate *h = hstate_vma(vma);
 	unsigned long sz = huge_page_size(h);
+	struct address_space *mapping = vma->vm_file->f_mapping;
 	struct mmu_notifier_range range;
 	int ret = 0;
 
@@ -3322,6 +3423,14 @@ int copy_hugetlb_page_range(struct mm_st
 					vma->vm_start,
 					vma->vm_end);
 		mmu_notifier_invalidate_range_start(&range);
+	} else {
+		/*
+		 * For shared mappings i_mmap_rwsem must be held to call
+		 * huge_pte_alloc, otherwise the returned ptep could go
+		 * away if part of a shared pmd and another thread calls
+		 * huge_pmd_unshare.
+		 */
+		i_mmap_lock_read(mapping);
 	}
 
 	for (addr = vma->vm_start; addr < vma->vm_end; addr += sz) {
@@ -3399,6 +3508,8 @@ int copy_hugetlb_page_range(struct mm_st
 
 	if (cow)
 		mmu_notifier_invalidate_range_end(&range);
+	else
+		i_mmap_unlock_read(mapping);
 
 	return ret;
 }
@@ -3847,13 +3958,15 @@ retry:
 			};
 
 			/*
-			 * hugetlb_fault_mutex must be dropped before
-			 * handling userfault.  Reacquire after handling
-			 * fault to make calling code simpler.
+			 * hugetlb_fault_mutex and i_mmap_rwsem must be
+			 * dropped before handling userfault.  Reacquire
+			 * after handling fault to make calling code simpler.
 			 */
 			hash = hugetlb_fault_mutex_hash(mapping, idx);
 			mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+			i_mmap_unlock_read(mapping);
 			ret = handle_userfault(&vmf, VM_UFFD_MISSING);
+			i_mmap_lock_read(mapping);
 			mutex_lock(&hugetlb_fault_mutex_table[hash]);
 			goto out;
 		}
@@ -4018,6 +4131,11 @@ vm_fault_t hugetlb_fault(struct mm_struc
 
 	ptep = huge_pte_offset(mm, haddr, huge_page_size(h));
 	if (ptep) {
+		/*
+		 * Since we hold no locks, ptep could be stale.  That is
+		 * OK as we are only making decisions based on content and
+		 * not actually modifying content here.
+		 */
 		entry = huge_ptep_get(ptep);
 		if (unlikely(is_hugetlb_entry_migration(entry))) {
 			migration_entry_wait_huge(vma, mm, ptep);
@@ -4031,14 +4149,29 @@ vm_fault_t hugetlb_fault(struct mm_struc
 			return VM_FAULT_OOM;
 	}
 
+	/*
+	 * Acquire i_mmap_rwsem before calling huge_pte_alloc and hold
+	 * until finished with ptep.  This prevents huge_pmd_unshare from
+	 * being called elsewhere and making the ptep no longer valid.
+	 *
+	 * ptep could have already be assigned via huge_pte_offset.  That
+	 * is OK, as huge_pte_alloc will return the same value unless
+	 * something has changed.
+	 */
 	mapping = vma->vm_file->f_mapping;
-	idx = vma_hugecache_offset(h, vma, haddr);
+	i_mmap_lock_read(mapping);
+	ptep = huge_pte_alloc(mm, haddr, huge_page_size(h));
+	if (!ptep) {
+		i_mmap_unlock_read(mapping);
+		return VM_FAULT_OOM;
+	}
 
 	/*
 	 * Serialize hugepage allocation and instantiation, so that we don't
 	 * get spurious allocation failures if two CPUs race to instantiate
 	 * the same page in the page cache.
 	 */
+	idx = vma_hugecache_offset(h, vma, haddr);
 	hash = hugetlb_fault_mutex_hash(mapping, idx);
 	mutex_lock(&hugetlb_fault_mutex_table[hash]);
 
@@ -4126,6 +4259,7 @@ out_ptl:
 	}
 out_mutex:
 	mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+	i_mmap_unlock_read(mapping);
 	/*
 	 * Generally it's safe to hold refcount during waiting page lock. But
 	 * here we just wait to defer the next page fault to avoid busy loop and
@@ -4776,10 +4910,12 @@ void adjust_range_if_pmd_sharing_possibl
  * Search for a shareable pmd page for hugetlb. In any case calls pmd_alloc()
  * and returns the corresponding pte. While this is not necessary for the
  * !shared pmd case because we can allocate the pmd later as well, it makes the
- * code much cleaner. pmd allocation is essential for the shared case because
- * pud has to be populated inside the same i_mmap_rwsem section - otherwise
- * racing tasks could either miss the sharing (see huge_pte_offset) or select a
- * bad pmd for sharing.
+ * code much cleaner.
+ *
+ * This routine must be called with i_mmap_rwsem held in at least read mode.
+ * For hugetlbfs, this prevents removal of any page table entries associated
+ * with the address space.  This is important as we are setting up sharing
+ * based on existing page table entries (mappings).
  */
 pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
 {
@@ -4796,7 +4932,6 @@ pte_t *huge_pmd_share(struct mm_struct *
 	if (!vma_shareable(vma, addr))
 		return (pte_t *)pmd_alloc(mm, pud, addr);
 
-	i_mmap_lock_read(mapping);
 	vma_interval_tree_foreach(svma, &mapping->i_mmap, idx, idx) {
 		if (svma == vma)
 			continue;
@@ -4826,7 +4961,6 @@ pte_t *huge_pmd_share(struct mm_struct *
 	spin_unlock(ptl);
 out:
 	pte = (pte_t *)pmd_alloc(mm, pud, addr);
-	i_mmap_unlock_read(mapping);
 	return pte;
 }
 
@@ -4837,7 +4971,7 @@ out:
  * indicated by page_count > 1, unmap is achieved by clearing pud and
  * decrementing the ref count. If count == 1, the pte page is not shared.
  *
- * called with page table lock held.
+ * Called with page table lock held and i_mmap_rwsem held in write mode.
  *
  * returns: 1 successfully unmapped a shared pte page
  *	    0 the underlying pte page is not shared, or it is the last user
--- a/mm/memory-failure.c~hugetlbfs-use-i_mmap_rwsem-for-more-pmd-sharing-synchronization
+++ a/mm/memory-failure.c
@@ -954,7 +954,7 @@ static bool hwpoison_user_mappings(struc
 	enum ttu_flags ttu = TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS;
 	struct address_space *mapping;
 	LIST_HEAD(tokill);
-	bool unmap_success;
+	bool unmap_success = true;
 	int kill = 1, forcekill;
 	struct page *hpage = *hpagep;
 	bool mlocked = PageMlocked(hpage);
@@ -1016,7 +1016,32 @@ static bool hwpoison_user_mappings(struc
 	if (kill)
 		collect_procs(hpage, &tokill, flags & MF_ACTION_REQUIRED);
 
-	unmap_success = try_to_unmap(hpage, ttu);
+	if (!PageHuge(hpage)) {
+		unmap_success = try_to_unmap(hpage, ttu);
+	} else {
+		/*
+		 * For hugetlb pages, try_to_unmap could potentially call
+		 * huge_pmd_unshare.  Because of this, take semaphore in
+		 * write mode here and set TTU_RMAP_LOCKED to indicate we
+		 * have taken the lock at this higer level.
+		 *
+		 * Note that the call to hugetlb_page_mapping_lock_write
+		 * is necessary even if mapping is already set.  It handles
+		 * ugliness of potentially having to drop page lock to obtain
+		 * i_mmap_rwsem.
+		 */
+		mapping = hugetlb_page_mapping_lock_write(hpage);
+
+		if (mapping) {
+			unmap_success = try_to_unmap(hpage,
+						     ttu|TTU_RMAP_LOCKED);
+			i_mmap_unlock_write(mapping);
+		} else {
+			pr_info("Memory failure: %#lx: could not find mapping for mapped huge page\n",
+				pfn);
+			unmap_success = false;
+		}
+	}
 	if (!unmap_success)
 		pr_err("Memory failure: %#lx: failed to unmap page (mapcount=%d)\n",
 		       pfn, page_mapcount(hpage));
--- a/mm/migrate.c~hugetlbfs-use-i_mmap_rwsem-for-more-pmd-sharing-synchronization
+++ a/mm/migrate.c
@@ -1282,6 +1282,7 @@ static int unmap_and_move_huge_page(new_
 	int page_was_mapped = 0;
 	struct page *new_hpage;
 	struct anon_vma *anon_vma = NULL;
+	struct address_space *mapping = NULL;
 
 	/*
 	 * Migratability of hugepages depends on architectures and their size.
@@ -1329,18 +1330,36 @@ static int unmap_and_move_huge_page(new_
 		goto put_anon;
 
 	if (page_mapped(hpage)) {
+		/*
+		 * try_to_unmap could potentially call huge_pmd_unshare.
+		 * Because of this, take semaphore in write mode here and
+		 * set TTU_RMAP_LOCKED to let lower levels know we have
+		 * taken the lock.
+		 */
+		mapping = hugetlb_page_mapping_lock_write(hpage);
+		if (unlikely(!mapping))
+			goto unlock_put_anon;
+
 		try_to_unmap(hpage,
-			TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);
+			TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS|
+			TTU_RMAP_LOCKED);
 		page_was_mapped = 1;
+		/*
+		 * Leave mapping locked until after subsequent call to
+		 * remove_migration_ptes()
+		 */
 	}
 
 	if (!page_mapped(hpage))
 		rc = move_to_new_page(new_hpage, hpage, mode);
 
-	if (page_was_mapped)
+	if (page_was_mapped) {
 		remove_migration_ptes(hpage,
-			rc == MIGRATEPAGE_SUCCESS ? new_hpage : hpage, false);
+			rc == MIGRATEPAGE_SUCCESS ? new_hpage : hpage, true);
+		i_mmap_unlock_write(mapping);
+	}
 
+unlock_put_anon:
 	unlock_page(new_hpage);
 
 put_anon:
--- a/mm/rmap.c~hugetlbfs-use-i_mmap_rwsem-for-more-pmd-sharing-synchronization
+++ a/mm/rmap.c
@@ -22,9 +22,10 @@
  *
  * inode->i_mutex	(while writing or truncating, not reading or faulting)
  *   mm->mmap_sem
- *     page->flags PG_locked (lock_page)
+ *     page->flags PG_locked (lock_page)   * (see huegtlbfs below)
  *       hugetlbfs_i_mmap_rwsem_key (in huge_pmd_share)
  *         mapping->i_mmap_rwsem
+ *           hugetlb_fault_mutex (hugetlbfs specific page fault mutex)
  *           anon_vma->rwsem
  *             mm->page_table_lock or pte_lock
  *               pgdat->lru_lock (in mark_page_accessed, isolate_lru_page)
@@ -43,6 +44,11 @@
  * anon_vma->rwsem,mapping->i_mutex      (memory_failure, collect_procs_anon)
  *   ->tasklist_lock
  *     pte map lock
+ *
+ * * hugetlbfs PageHuge() pages take locks in this order:
+ *         mapping->i_mmap_rwsem
+ *           hugetlb_fault_mutex (hugetlbfs specific page fault mutex)
+ *             page->flags PG_locked (lock_page)
  */
 
 #include <linux/mm.h>
@@ -1409,6 +1415,9 @@ static bool try_to_unmap_one(struct page
 		/*
 		 * If sharing is possible, start and end will be adjusted
 		 * accordingly.
+		 *
+		 * If called for a huge page, caller must hold i_mmap_rwsem
+		 * in write mode as it is possible to call huge_pmd_unshare.
 		 */
 		adjust_range_if_pmd_sharing_possible(vma, &range.start,
 						     &range.end);
@@ -1456,6 +1465,12 @@ static bool try_to_unmap_one(struct page
 		address = pvmw.address;
 
 		if (PageHuge(page)) {
+			/*
+			 * To call huge_pmd_unshare, i_mmap_rwsem must be
+			 * held in write mode.  Caller needs to explicitly
+			 * do this outside rmap routines.
+			 */
+			VM_BUG_ON(!(flags & TTU_RMAP_LOCKED));
 			if (huge_pmd_unshare(mm, &address, pvmw.pte)) {
 				/*
 				 * huge_pmd_unshare unmapped an entire PMD
--- a/mm/userfaultfd.c~hugetlbfs-use-i_mmap_rwsem-for-more-pmd-sharing-synchronization
+++ a/mm/userfaultfd.c
@@ -276,10 +276,14 @@ retry:
 		BUG_ON(dst_addr >= dst_start + len);
 
 		/*
-		 * Serialize via hugetlb_fault_mutex
+		 * Serialize via i_mmap_rwsem and hugetlb_fault_mutex.
+		 * i_mmap_rwsem ensures the dst_pte remains valid even
+		 * in the case of shared pmds.  fault mutex prevents
+		 * races with other faulting threads.
 		 */
-		idx = linear_page_index(dst_vma, dst_addr);
 		mapping = dst_vma->vm_file->f_mapping;
+		i_mmap_lock_read(mapping);
+		idx = linear_page_index(dst_vma, dst_addr);
 		hash = hugetlb_fault_mutex_hash(mapping, idx);
 		mutex_lock(&hugetlb_fault_mutex_table[hash]);
 
@@ -287,6 +291,7 @@ retry:
 		dst_pte = huge_pte_alloc(dst_mm, dst_addr, vma_hpagesize);
 		if (!dst_pte) {
 			mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+			i_mmap_unlock_read(mapping);
 			goto out_unlock;
 		}
 
@@ -294,6 +299,7 @@ retry:
 		dst_pteval = huge_ptep_get(dst_pte);
 		if (!huge_pte_none(dst_pteval)) {
 			mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+			i_mmap_unlock_read(mapping);
 			goto out_unlock;
 		}
 
@@ -301,6 +307,7 @@ retry:
 						dst_addr, src_addr, &page);
 
 		mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+		i_mmap_unlock_read(mapping);
 		vm_alloc_shared = vm_shared;
 
 		cond_resched();
_


  parent reply	other threads:[~2020-04-02  4:11 UTC|newest]

Thread overview: 163+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-04-02  4:01 incoming Andrew Morton
2020-04-02  4:02 ` [patch 001/155] tools/accounting/getdelays.c: fix netlink attribute length Andrew Morton
2020-04-02  4:02 ` [patch 002/155] kthread: mark timer used by delayed kthread works as IRQ safe Andrew Morton
2020-04-02  4:03 ` [patch 003/155] asm-generic: make more kernel-space headers mandatory Andrew Morton
2020-04-02  4:03 ` [patch 004/155] scripts/spelling.txt: add syfs/sysfs pattern Andrew Morton
2020-04-02  4:03 ` [patch 005/155] scripts/spelling.txt: add more spellings to spelling.txt Andrew Morton
2020-04-02  4:03 ` [patch 006/155] ocfs2: remove FS_OCFS2_NM Andrew Morton
2020-04-02  4:03 ` [patch 007/155] ocfs2: remove unused macros Andrew Morton
2020-04-02  4:03 ` [patch 008/155] ocfs2: use OCFS2_SEC_BITS in macro Andrew Morton
2020-04-02  4:03 ` [patch 009/155] ocfs2: remove dlm_lock_is_remote Andrew Morton
2020-04-02  4:03 ` [patch 010/155] ocfs2: there is no need to log twice in several functions Andrew Morton
2020-04-02  4:03 ` [patch 011/155] ocfs2: correct annotation from "l_next_rec" to "l_next_free_rec" Andrew Morton
2020-04-02  4:03 ` [patch 012/155] ocfs2: remove useless err Andrew Morton
2020-04-02  4:03 ` [patch 013/155] ocfs2: add missing annotations for ocfs2_refcount_cache_lock() and ocfs2_refcount_cache_unlock() Andrew Morton
2020-04-02  4:03 ` [patch 014/155] ocfs2: replace zero-length array with flexible-array member Andrew Morton
2020-04-02  4:03 ` [patch 015/155] ocfs2: cluster: " Andrew Morton
2020-04-02  4:03 ` [patch 016/155] ocfs2: dlm: " Andrew Morton
2020-04-02  4:03 ` [patch 017/155] ocfs2: ocfs2_fs.h: " Andrew Morton
2020-04-02  4:04 ` [patch 018/155] ocfs2: roll back the reference count modification of the parent directory if an error occurs Andrew Morton
2020-04-02  4:04 ` [patch 019/155] ocfs2: use scnprintf() for avoiding potential buffer overflow Andrew Morton
2020-04-02  4:04 ` [patch 020/155] ocfs2: use memalloc_nofs_save instead of memalloc_noio_save Andrew Morton
2020-04-02  4:04 ` [patch 021/155] fs_parse: remove pr_notice() about each validation Andrew Morton
2020-04-02  4:04 ` [patch 022/155] mm/slub.c: replace cpu_slab->partial with wrapped APIs Andrew Morton
2020-04-02  4:04 ` [patch 023/155] mm/slub.c: replace kmem_cache->cpu_partial " Andrew Morton
2020-04-02  4:04 ` [patch 024/155] slub: improve bit diffusion for freelist ptr obfuscation Andrew Morton
2020-04-02  4:04 ` [patch 025/155] slub: relocate freelist pointer to middle of object Andrew Morton
2020-04-15 16:47   ` Marco Elver
2020-04-15 17:07     ` Kees Cook
2020-04-15 18:00     ` Kees Cook
2020-04-02  4:04 ` [patch 026/155] revert "topology: add support for node_to_mem_node() to determine the fallback node" Andrew Morton
2020-04-02  4:04 ` [patch 027/155] mm/kmemleak.c: use address-of operator on section symbols Andrew Morton
2020-04-02  4:04 ` [patch 028/155] mm/Makefile: disable KCSAN for kmemleak Andrew Morton
2020-04-02  4:04 ` [patch 029/155] mm/filemap.c: don't bother dropping mmap_sem for zero size readahead Andrew Morton
2020-04-02  4:04 ` [patch 030/155] mm/page-writeback.c: write_cache_pages(): deduplicate identical checks Andrew Morton
2020-04-02  4:04 ` [patch 031/155] mm/filemap.c: clear page error before actual read Andrew Morton
2020-04-02  4:04 ` [patch 032/155] mm/filemap.c: remove unused argument from shrink_readahead_size_eio() Andrew Morton
2020-04-02  4:04 ` [patch 033/155] mm/filemap.c: use vm_fault error code directly Andrew Morton
2020-04-02  4:04 ` [patch 034/155] include/linux/pagemap.h: rename arguments to find_subpage Andrew Morton
2020-04-02  4:05 ` [patch 035/155] mm/page-writeback.c: use VM_BUG_ON_PAGE in clear_page_dirty_for_io Andrew Morton
2020-04-02  4:05 ` [patch 036/155] mm/filemap.c: unexport find_get_entry Andrew Morton
2020-04-02  4:05 ` [patch 037/155] mm/filemap.c: rewrite pagecache_get_page documentation Andrew Morton
2020-04-02  4:05 ` [patch 038/155] mm/gup: split get_user_pages_remote() into two routines Andrew Morton
2020-04-02  4:05 ` [patch 039/155] mm/gup: pass a flags arg to __gup_device_* functions Andrew Morton
2020-04-02  4:05 ` [patch 040/155] mm: introduce page_ref_sub_return() Andrew Morton
2020-04-02  4:05 ` [patch 041/155] mm/gup: pass gup flags to two more routines Andrew Morton
2020-04-02  4:05 ` [patch 042/155] mm/gup: require FOLL_GET for get_user_pages_fast() Andrew Morton
2020-04-02  4:05 ` [patch 043/155] mm/gup: track FOLL_PIN pages Andrew Morton
2020-04-09  6:08   ` Tetsuo Handa
2020-04-09  6:38     ` John Hubbard
2020-04-09  7:20       ` Tetsuo Handa
2020-04-09  7:46         ` John Hubbard
2020-04-02  4:05 ` [patch 044/155] mm/gup: page->hpage_pinned_refcount: exact pin counts for huge pages Andrew Morton
2020-04-02  4:05 ` [patch 045/155] mm/gup: /proc/vmstat: pin_user_pages (FOLL_PIN) reporting Andrew Morton
2020-04-02  4:05 ` [patch 046/155] mm/gup_benchmark: support pin_user_pages() and related calls Andrew Morton
2020-04-02  4:05 ` [patch 047/155] selftests/vm: run_vmtests: invoke gup_benchmark with basic FOLL_PIN coverage Andrew Morton
2020-04-02  4:05 ` [patch 048/155] mm: improve dump_page() for compound pages Andrew Morton
2020-04-02  4:05 ` [patch 049/155] mm: dump_page(): additional diagnostics for huge pinned pages Andrew Morton
2020-04-02  4:05 ` [patch 050/155] mm/gup/writeback: add callbacks for inaccessible pages Andrew Morton
2020-04-02  4:06 ` [patch 051/155] mm/gup: rename nr as nr_pinned in get_user_pages_fast() Andrew Morton
2020-04-02  4:06 ` [patch 052/155] mm/gup: fix omission of check on FOLL_LONGTERM in gup fast path Andrew Morton
2020-04-02  4:06 ` [patch 053/155] mm/swapfile.c: fix comments for swapcache_prepare Andrew Morton
2020-04-02  4:06 ` [patch 054/155] mm/swap.c: not necessary to export __pagevec_lru_add() Andrew Morton
2020-04-02  4:06 ` [patch 055/155] mm/swapfile: fix data races in try_to_unuse() Andrew Morton
2020-04-02  4:06 ` [patch 056/155] mm/swap_slots.c: assign|reset cache slot by value directly Andrew Morton
2020-04-02  4:06 ` [patch 057/155] mm: swap: make page_evictable() inline Andrew Morton
2020-04-02  4:06 ` [patch 058/155] mm: swap: use smp_mb__after_atomic() to order LRU bit set Andrew Morton
2020-04-02  4:06 ` [patch 059/155] mm/swap_state.c: use the same way to count page in [add_to|delete_from]_swap_cache Andrew Morton
2020-04-02  4:06 ` [patch 060/155] mm, memcg: fix build error around the usage of kmem_caches Andrew Morton
2020-04-02  4:06 ` [patch 061/155] mm/memcontrol.c: allocate shrinker_map on appropriate NUMA node Andrew Morton
2020-04-02  4:06 ` [patch 062/155] mm: memcg/slab: use mem_cgroup_from_obj() Andrew Morton
2020-04-02  4:06 ` [patch 063/155] mm: kmem: cleanup (__)memcg_kmem_charge_memcg() arguments Andrew Morton
2020-04-02  4:06 ` [patch 064/155] mm: kmem: cleanup memcg_kmem_uncharge_memcg() arguments Andrew Morton
2020-04-02  4:06 ` [patch 065/155] mm: kmem: rename memcg_kmem_(un)charge() into memcg_kmem_(un)charge_page() Andrew Morton
2020-04-02  4:06 ` [patch 066/155] mm: kmem: switch to nr_pages in (__)memcg_kmem_charge_memcg() Andrew Morton
2020-04-02  4:06 ` [patch 067/155] mm: memcg/slab: cache page number in memcg_(un)charge_slab() Andrew Morton
2020-04-02  4:06 ` [patch 068/155] mm: kmem: rename (__)memcg_kmem_(un)charge_memcg() to __memcg_kmem_(un)charge() Andrew Morton
2020-04-02  4:07 ` [patch 069/155] mm: memcontrol: fix memory.low proportional distribution Andrew Morton
2020-04-02  4:07 ` [patch 070/155] mm: memcontrol: clean up and document effective low/min calculations Andrew Morton
2020-04-02  4:07 ` [patch 071/155] mm: memcontrol: recursive memory.low protection Andrew Morton
2020-04-02  4:07 ` [patch 072/155] memcg: css_tryget_online cleanups Andrew Morton
2020-04-02  4:07 ` [patch 073/155] mm/memcontrol.c: make mem_cgroup_id_get_many() __maybe_unused Andrew Morton
2020-04-02  4:07 ` [patch 074/155] mm, memcg: prevent memory.high load/store tearing Andrew Morton
2020-04-02  4:07 ` [patch 075/155] mm, memcg: prevent memory.max load tearing Andrew Morton
2020-04-02  4:07 ` [patch 076/155] mm, memcg: prevent memory.low load/store tearing Andrew Morton
2020-04-02  4:07 ` [patch 077/155] mm, memcg: prevent memory.min " Andrew Morton
2020-04-02  4:07 ` [patch 078/155] mm, memcg: prevent memory.swap.max load tearing Andrew Morton
2020-04-02  4:07 ` [patch 079/155] mm, memcg: prevent mem_cgroup_protected store tearing Andrew Morton
2020-04-02  4:07 ` [patch 080/155] mm: memcg: make memory.oom.group tolerable to task migration Andrew Morton
2020-04-02  4:07 ` [patch 081/155] mm/mapping_dirty_helpers: update huge page-table entry callbacks Andrew Morton
2020-04-02  4:07 ` [patch 082/155] mm/vma: move VM_NO_KHUGEPAGED into generic header Andrew Morton
2020-04-02  4:07 ` [patch 083/155] mm/vma: make vma_is_foreign() available for general use Andrew Morton
2020-04-02  4:07 ` [patch 084/155] mm/vma: make is_vma_temporary_stack() " Andrew Morton
2020-04-02  4:07 ` [patch 085/155] mm: add pagemap.h to the fine documentation Andrew Morton
2020-04-02  4:07 ` [patch 086/155] mm/gup: rename "nonblocking" to "locked" where proper Andrew Morton
2020-04-02  4:08 ` [patch 087/155] mm/gup: fix __get_user_pages() on fault retry of hugetlb Andrew Morton
2020-04-02  4:08 ` [patch 088/155] mm: introduce fault_signal_pending() Andrew Morton
2020-04-02  4:08 ` [patch 089/155] x86/mm: use helper fault_signal_pending() Andrew Morton
2020-04-02  4:08 ` [patch 090/155] arc/mm: " Andrew Morton
2020-04-02  4:08 ` [patch 091/155] arm64/mm: " Andrew Morton
2020-04-02  4:08 ` [patch 092/155] powerpc/mm: " Andrew Morton
2020-04-02  4:08 ` [patch 093/155] sh/mm: " Andrew Morton
2020-04-02  4:08 ` [patch 094/155] mm: return faster for non-fatal signals in user mode faults Andrew Morton
2020-04-02  4:08 ` [patch 095/155] userfaultfd: don't retake mmap_sem to emulate NOPAGE Andrew Morton
2020-04-02  4:08 ` [patch 096/155] mm: introduce FAULT_FLAG_DEFAULT Andrew Morton
2020-04-02  4:08 ` [patch 097/155] mm: introduce FAULT_FLAG_INTERRUPTIBLE Andrew Morton
2020-04-02  4:08 ` [patch 098/155] mm: allow VM_FAULT_RETRY for multiple times Andrew Morton
2020-04-02  4:08 ` [patch 099/155] mm/gup: " Andrew Morton
2020-04-02  4:08 ` [patch 100/155] mm/gup: allow to react to fatal signals Andrew Morton
2020-04-02  4:09 ` [patch 101/155] mm/userfaultfd: honor FAULT_FLAG_KILLABLE in fault path Andrew Morton
2020-04-02  4:09 ` [patch 102/155] mm: clarify a confusing comment for remap_pfn_range() Andrew Morton
2020-04-02  4:09 ` [patch 103/155] mm/memory.c: clarify a confusing comment for vm_iomap_memory Andrew Morton
2020-04-02  4:09 ` [patch 104/155] mmap: remove inline of vm_unmapped_area Andrew Morton
2020-04-02  4:09 ` [patch 105/155] mm: mmap: add trace point " Andrew Morton
2020-04-02  4:09 ` [patch 106/155] mm/mremap: add MREMAP_DONTUNMAP to mremap() Andrew Morton
2020-04-02  4:09 ` [patch 107/155] selftests: add MREMAP_DONTUNMAP selftest Andrew Morton
2020-04-02  4:09 ` [patch 108/155] mm/sparsemem: get address to page struct instead of address to pfn Andrew Morton
2020-04-02  4:09 ` [patch 109/155] mm/sparse: rename pfn_present() to pfn_in_present_section() Andrew Morton
2020-04-02  4:09 ` [patch 110/155] mm/sparse.c: use kvmalloc/kvfree to alloc/free memmap for the classic sparse Andrew Morton
2020-04-02  4:09 ` [patch 111/155] mm/sparse.c: allocate memmap preferring the given node Andrew Morton
2020-04-02  4:09 ` [patch 112/155] kasan: detect negative size in memory operation function Andrew Morton
2020-04-02  4:09 ` [patch 113/155] kasan: add test for invalid size in memmove Andrew Morton
2020-04-02  4:09 ` [patch 114/155] mm/page_alloc: increase default min_free_kbytes bound Andrew Morton
2020-04-02  4:09 ` [patch 115/155] mm, pagealloc: micro-optimisation: save two branches on hot page allocation path Andrew Morton
2020-04-02  4:09 ` [patch 116/155] mm/page_alloc.c: use free_area_empty() instead of open-coding Andrew Morton
2020-04-02  4:09 ` [patch 117/155] mm/page_alloc.c: micro-optimisation Remove unnecessary branch Andrew Morton
2020-04-02  4:09 ` [patch 118/155] mm/page_alloc: simplify page_is_buddy() for better code readability Andrew Morton
2020-04-02  4:09 ` [patch 119/155] mm: vmpressure: don't need call kfree if kstrndup fails Andrew Morton
2020-04-02  4:10 ` [patch 120/155] mm: vmpressure: use mem_cgroup_is_root API Andrew Morton
2020-04-02  4:10 ` [patch 121/155] mm: vmscan: replace open codings to NUMA_NO_NODE Andrew Morton
2020-04-02  4:10 ` [patch 122/155] mm/vmscan.c: remove cpu online notification for now Andrew Morton
2020-04-02  4:10 ` [patch 123/155] mm/vmscan.c: fix data races using kswapd_classzone_idx Andrew Morton
2020-04-02  4:10 ` [patch 124/155] mm/vmscan.c: clean code by removing unnecessary assignment Andrew Morton
2020-04-02  4:10 ` [patch 125/155] mm/vmscan.c: make may_enter_fs bool in shrink_page_list() Andrew Morton
2020-04-02  4:10 ` [patch 126/155] mm/vmscan.c: do_try_to_free_pages(): clean code by removing unnecessary assignment Andrew Morton
2020-04-02  4:10 ` [patch 127/155] selftests: vm: drop dependencies on page flags from mlock2 tests Andrew Morton
2020-04-02  4:10 ` [patch 128/155] mm,compaction,cma: add alloc_contig flag to compact_control Andrew Morton
2020-04-02  4:10 ` [patch 129/155] mm,thp,compaction,cma: allow THP migration for CMA allocations Andrew Morton
2020-04-02  4:10 ` [patch 130/155] mm, compaction: fully assume capture is not NULL in compact_zone_order() Andrew Morton
2020-04-02  4:10 ` [patch 131/155] mm/compaction: really limit compact_unevictable_allowed to 0 and 1 Andrew Morton
2020-04-02  4:10 ` [patch 132/155] mm/compaction: Disable compact_unevictable_allowed on RT Andrew Morton
2020-04-02  4:10 ` [patch 133/155] mm/compaction.c: clean code by removing unnecessary assignment Andrew Morton
2020-04-02  4:10 ` [patch 134/155] mm/mempolicy: support MPOL_MF_STRICT for huge page mapping Andrew Morton
2020-04-02  4:10 ` [patch 135/155] mm/mempolicy: check hugepage migration is supported by arch in vma_migratable() Andrew Morton
2020-04-02  4:10 ` [patch 136/155] mm: mempolicy: use VM_BUG_ON_VMA in queue_pages_test_walk() Andrew Morton
2020-04-02  4:10 ` [patch 137/155] mm: mempolicy: require at least one nodeid for MPOL_PREFERRED Andrew Morton
2020-04-02  4:11 ` [patch 138/155] mm/memblock.c: remove redundant assignment to variable max_addr Andrew Morton
2020-04-02  4:11 ` Andrew Morton [this message]
2020-04-02  4:11 ` [patch 140/155] hugetlbfs: Use i_mmap_rwsem to address page fault/truncate race Andrew Morton
2020-04-02  4:11 ` [patch 141/155] hugetlb_cgroup: add hugetlb_cgroup reservation counter Andrew Morton
2020-04-02  4:11 ` [patch 142/155] hugetlb_cgroup: add interface for charge/uncharge hugetlb reservations Andrew Morton
2020-04-02  4:11 ` [patch 143/155] mm/hugetlb_cgroup: fix hugetlb_cgroup migration Andrew Morton
2020-04-02  4:11 ` [patch 144/155] hugetlb_cgroup: add reservation accounting for private mappings Andrew Morton
2020-04-02  4:11 ` [patch 145/155] hugetlb: disable region_add file_region coalescing Andrew Morton
2020-04-02  4:11 ` [patch 146/155] hugetlb_cgroup: add accounting for shared mappings Andrew Morton
2020-04-02  4:11 ` [patch 147/155] hugetlb_cgroup: support noreserve mappings Andrew Morton
2020-04-02  4:11 ` [patch 148/155] hugetlb: support file_region coalescing again Andrew Morton
2020-04-02  4:11 ` [patch 149/155] hugetlb_cgroup: add hugetlb_cgroup reservation tests Andrew Morton
2020-04-02  4:11 ` [patch 150/155] hugetlb_cgroup: add hugetlb_cgroup reservation docs Andrew Morton
2020-04-02  4:11 ` [patch 151/155] mm/hugetlb.c: clean code by removing unnecessary initialization Andrew Morton
2020-04-02  4:11 ` [patch 152/155] mm/hugetlb: remove unnecessary memory fetch in PageHeadHuge() Andrew Morton
2020-04-02  4:11 ` [patch 153/155] selftests/vm: fix map_hugetlb length used for testing read and write Andrew Morton
2020-04-02  4:11 ` [patch 154/155] mm/hugetlb: fix build failure with HUGETLB_PAGE but not HUGEBTLBFS Andrew Morton
2020-04-02  4:11 ` [patch 155/155] include/linux/huge_mm.h: check PageTail in hpage_nr_pages even when !THP Andrew Morton

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200402041105.i6f4OY-6_%akpm@linux-foundation.org \
    --to=akpm@linux-foundation.org \
    --cc=aarcange@redhat.com \
    --cc=aneesh.kumar@linux.vnet.ibm.com \
    --cc=dave@stgolabs.net \
    --cc=hughd@google.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@kernel.org \
    --cc=mike.kravetz@oracle.com \
    --cc=mm-commits@vger.kernel.org \
    --cc=n-horiguchi@ah.jp.nec.com \
    --cc=prakash.sangappa@oracle.com \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).