All of lore.kernel.org
 help / color / mirror / Atom feed
From: Mike Kravetz <mike.kravetz@oracle.com>
To: linux-mm@kvack.org, linux-kernel@vger.kernel.org
Cc: Michal Hocko <mhocko@suse.com>, Peter Xu <peterx@redhat.com>,
	Naoya Horiguchi <naoya.horiguchi@linux.dev>,
	David Hildenbrand <david@redhat.com>,
	"Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>,
	Andrea Arcangeli <aarcange@redhat.com>,
	"Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>,
	Davidlohr Bueso <dave@stgolabs.net>,
	Prakash Sangappa <prakash.sangappa@oracle.com>,
	James Houghton <jthoughton@google.com>,
	Mina Almasry <almasrymina@google.com>,
	Ray Fucillo <Ray.Fucillo@intersystems.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Mike Kravetz <mike.kravetz@oracle.com>
Subject: [RFC PATCH v2 6/6] hugetlb: Check for pmd unshare and fault/lookup races
Date: Wed, 20 Apr 2022 15:37:53 -0700	[thread overview]
Message-ID: <20220420223753.386645-7-mike.kravetz@oracle.com> (raw)
In-Reply-To: <20220420223753.386645-1-mike.kravetz@oracle.com>

When a pmd is 'unshared' it effectivelly deletes part of a processes
page tables.  The routine huge_pmd_unshare must be called with
i_mmap_rwsem held in write mode and the page table locked.  However,
consider a page fault happening within that same process.  We could
have the following race:

Faulting thread                                 Unsharing thread
...                                                  ...
ptep = huge_pte_offset()
      or
ptep = huge_pte_alloc()
...
                                                i_mmap_unlock_write
                                                lock_page table
ptep invalid   <------------------------        huge_pmd_unshare
Could be in a previously                        unlock_page_table
sharing process or worse
...
ptl = huge_pte_lock(ptep)
get/update pte
set_pte_at(pte, ptep)

If the above race happens, we can update the pte of another process.

Catch this situation by doing another huge_pte_offset/page table
walk after obtaining the page table lock and compare pointers.  If
the pointers are different, then we know a race happened and we can
bail and cleanup.

In fault code, make sure to check for this race AFTER checking for
faults beyond i_size so page cache can be cleaned up properly.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 mm/hugetlb.c | 34 +++++++++++++++++++++++++++++-----
 1 file changed, 29 insertions(+), 5 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index c1352ab7f941..804a8d0a2cb8 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4735,6 +4735,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 			    struct vm_area_struct *src_vma)
 {
 	pte_t *src_pte, *dst_pte, entry, dst_entry;
+	pte_t *src_pte2;
 	struct page *ptepage;
 	unsigned long addr;
 	bool cow = is_cow_mapping(src_vma->vm_flags);
@@ -4783,7 +4784,15 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 		entry = huge_ptep_get(src_pte);
 		dst_entry = huge_ptep_get(dst_pte);
 again:
-		if (huge_pte_none(entry) || !huge_pte_none(dst_entry)) {
+
+		src_pte2 = huge_pte_offset(src, addr, sz);
+		if (unlikely(src_pte2 != src_pte)) {
+			/*
+			 * Another thread could have unshared src_pte.
+			 * Just skip.
+			 */
+			;
+		} else if (huge_pte_none(entry) || !huge_pte_none(dst_entry)) {
 			/*
 			 * Skip if src entry none.  Also, skip in the
 			 * unlikely case dst entry !none as this implies
@@ -5462,6 +5471,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 	bool new_page, new_pagecache_page = false;
 	bool beyond_i_size = false;
 	bool reserve_alloc = false;
+	pte_t *ptep2;
 
 	/*
 	 * Currently, we are forced to kill the process in the event the
@@ -5510,8 +5520,10 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 			 * sure there really is no pte entry.
 			 */
 			ptl = huge_pte_lock(h, vma, ptep);
+			/* ptep2 checks for racing unshare page tables */
+			ptep2 = huge_pte_offset(mm, haddr, huge_page_size(h));
 			ret = 0;
-			if (huge_pte_none(huge_ptep_get(ptep)))
+			if (ptep2 == ptep && huge_pte_none(huge_ptep_get(ptep)))
 				ret = vmf_error(PTR_ERR(page));
 			spin_unlock(ptl);
 			goto out;
@@ -5584,6 +5596,11 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 		goto backout;
 	}
 
+	/* Check for racing unshare page tables */
+	ptep2 = huge_pte_offset(mm, haddr, huge_page_size(h));
+	if (ptep2 != ptep)
+		goto backout;
+
 	ret = 0;
 	/* If pte changed from under us, retry */
 	if (!pte_same(huge_ptep_get(ptep), old_pte))
@@ -5677,7 +5694,7 @@ u32 hugetlb_fault_mutex_hash(struct address_space *mapping, pgoff_t idx)
 vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 			unsigned long address, unsigned int flags)
 {
-	pte_t *ptep, entry;
+	pte_t *ptep, *ptep2, entry;
 	spinlock_t *ptl;
 	vm_fault_t ret;
 	u32 hash;
@@ -5759,8 +5776,9 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 
 	ptl = huge_pte_lock(h, vma, ptep);
 
-	/* Check for a racing update before calling hugetlb_wp() */
-	if (unlikely(!pte_same(entry, huge_ptep_get(ptep))))
+	/* Check for a racing update or unshare before calling hugetlb_wp() */
+	ptep2 = huge_pte_offset(mm, haddr, huge_page_size(h));
+	if (unlikely(ptep2 != ptep || !pte_same(entry, huge_ptep_get(ptep))))
 		goto out_ptl;
 
 	/* Handle userfault-wp first, before trying to lock more pages */
@@ -5861,6 +5879,7 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
 	struct page *page;
 	int writable;
 	bool page_in_pagecache = false;
+	pte_t *ptep2;
 
 	if (is_continue) {
 		ret = -EFAULT;
@@ -5976,6 +5995,11 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
 		goto out_release_unlock;
 	}
 
+	/* Check for racing unshare page tables */
+	ptep2 = huge_pte_offset(dst_mm, dst_addr, huge_page_size(h));
+	if (unlikely(ptep2 != dst_pte))
+		goto out_release_unlock;
+
 	ret = -EEXIST;
 	/*
 	 * We allow to overwrite a pte marker: consider when both MISSING|WP
-- 
2.35.1


  parent reply	other threads:[~2022-04-20 22:39 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-04-20 22:37 [RFC PATCH v2 0/6] hugetlb: Change huge pmd sharing synchronization again Mike Kravetz
2022-04-20 22:37 ` [RFC PATCH v2 1/6] hugetlbfs: revert use i_mmap_rwsem to address page fault/truncate race Mike Kravetz
2022-04-20 22:37 ` [RFC PATCH v2 2/6] hugetlbfs: revert use i_mmap_rwsem for more pmd sharing synchronization Mike Kravetz
2022-04-20 22:37 ` [RFC PATCH v2 3/6] hugetlbfs: move routine remove_huge_page to hugetlb.c Mike Kravetz
2022-04-20 22:37 ` [RFC PATCH v2 4/6] hugetlbfs: catch and handle truncate racing with page faults Mike Kravetz
2022-04-20 22:37 ` [RFC PATCH v2 5/6] hugetlbfs: Do not use pmd locks if hugetlb sharing possible Mike Kravetz
2022-04-20 22:37 ` Mike Kravetz [this message]
2022-04-22 16:38 ` [RFC PATCH v2 0/6] hugetlb: Change huge pmd sharing synchronization again Mike Kravetz

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20220420223753.386645-7-mike.kravetz@oracle.com \
    --to=mike.kravetz@oracle.com \
    --cc=Ray.Fucillo@intersystems.com \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=almasrymina@google.com \
    --cc=aneesh.kumar@linux.vnet.ibm.com \
    --cc=dave@stgolabs.net \
    --cc=david@redhat.com \
    --cc=jthoughton@google.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.com \
    --cc=naoya.horiguchi@linux.dev \
    --cc=peterx@redhat.com \
    --cc=prakash.sangappa@oracle.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.