All of lore.kernel.org
 help / color / mirror / Atom feed
From: Peter Xu <peterx@redhat.com>
To: linux-kernel@vger.kernel.org, linux-mm@kvack.org
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>,
	Mike Kravetz <mike.kravetz@oracle.com>,
	peterx@redhat.com, Jerome Glisse <jglisse@redhat.com>,
	"Kirill A . Shutemov" <kirill@shutemov.name>,
	Hugh Dickins <hughd@google.com>,
	Axel Rasmussen <axelrasmussen@google.com>,
	Matthew Wilcox <willy@infradead.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Nadav Amit <nadav.amit@gmail.com>
Subject: [PATCH RFC 08/30] shmem/userfaultfd: Handle uffd-wp special pte in page fault handler
Date: Fri, 15 Jan 2021 12:08:45 -0500	[thread overview]
Message-ID: <20210115170907.24498-9-peterx@redhat.com> (raw)
In-Reply-To: <20210115170907.24498-1-peterx@redhat.com>

File-backed memories are prone to unmap/swap so the ptes are always unstable.
This could lead to userfaultfd-wp information got lost when unmapped or swapped
out on such types of memory, for example, shmem.  To keep such an information
persistent, we will start to use the newly introduced swap-like special ptes to
replace a null pte when those ptes were removed.

Prepare this by handling such a special pte first before it is applied.  Here
a new fault flag FAULT_FLAG_UFFD_WP is introduced.  When this flag is set, it
means the current fault is to resolve a page access (either read or write) to
the uffd-wp special pte.

The handling of this special pte page fault is similar to missing fault, but it
should happen after the pte missing logic since the special pte is designed to
be a swap-like pte.  Meanwhile it should be handled before do_swap_page() so
that the swap core logic won't be confused to see such an illegal swap pte.

This is a slow path of uffd-wp handling, because unmap of wr-protected shmem
ptes should be rare.  So far it should only trigger in two conditions:

  (1) When trying to punch holes in shmem_fallocate(), there will be a
      pre-unmap optimization before evicting the page.  That will create
      unmapped shmem ptes with wr-protected pages covered.

  (2) Swapping out of shmem pages

Because of this, the page fault handling is simplifed too by always assuming
it's a read fault when calling do_fault().  When it's a write fault, it'll
fault again when retry the page access, then do_wp_page() will handle the rest
of message generation and delivery to the userfaultfd.

Disable fault-around for such a special page fault, because the introduced new
flag (FAULT_FLAG_UFFD_WP) only applies to current pte rather than all the pages
around it.  Doing fault-around with the new flag could confuse all the rest of
pages when installing ptes from page cache when there's a cache hit.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/mm.h |   2 +
 mm/memory.c        | 107 +++++++++++++++++++++++++++++++++++++++++++--
 2 files changed, 105 insertions(+), 4 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index db6ae4d3fb4e..85d928764b64 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -426,6 +426,7 @@ extern pgprot_t protection_map[16];
  * @FAULT_FLAG_REMOTE: The fault is not for current task/mm.
  * @FAULT_FLAG_INSTRUCTION: The fault was during an instruction fetch.
  * @FAULT_FLAG_INTERRUPTIBLE: The fault can be interrupted by non-fatal signals.
+ * @FAULT_FLAG_UFFD_WP: When install new page entries, set uffd-wp bit.
  *
  * About @FAULT_FLAG_ALLOW_RETRY and @FAULT_FLAG_TRIED: we can specify
  * whether we would allow page faults to retry by specifying these two
@@ -456,6 +457,7 @@ extern pgprot_t protection_map[16];
 #define FAULT_FLAG_REMOTE			0x80
 #define FAULT_FLAG_INSTRUCTION  		0x100
 #define FAULT_FLAG_INTERRUPTIBLE		0x200
+#define FAULT_FLAG_UFFD_WP			0x400
 
 /*
  * The default fault flags that should be used by most of the
diff --git a/mm/memory.c b/mm/memory.c
index 394c2602dce7..0b687f0be4d0 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3797,6 +3797,7 @@ static vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
 vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct page *page)
 {
 	struct vm_area_struct *vma = vmf->vma;
+	bool pte_changed, uffd_wp = vmf->flags & FAULT_FLAG_UFFD_WP;
 	bool write = vmf->flags & FAULT_FLAG_WRITE;
 	pte_t entry;
 	vm_fault_t ret;
@@ -3807,14 +3808,27 @@ vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct page *page)
 			return ret;
 	}
 
+	/*
+	 * Note: besides pte missing, FAULT_FLAG_UFFD_WP could also trigger
+	 * this path where vmf->pte got released before reaching here.  In that
+	 * case, even if vmf->pte==NULL, the pte actually still contains the
+	 * protection pte (by pte_swp_mkuffd_wp_special()).  For that case,
+	 * we'd also like to allocate a new pte like pte none, but check
+	 * differently for changing pte.
+	 */
 	if (!vmf->pte) {
 		ret = pte_alloc_one_map(vmf);
 		if (ret)
 			return ret;
 	}
 
+	if (unlikely(uffd_wp))
+		pte_changed = !pte_swp_uffd_wp_special(*vmf->pte);
+	else
+		pte_changed = !pte_none(*vmf->pte);
+
 	/* Re-check under ptl */
-	if (unlikely(!pte_none(*vmf->pte))) {
+	if (unlikely(pte_changed)) {
 		update_mmu_tlb(vma, vmf->address, vmf->pte);
 		return VM_FAULT_NOPAGE;
 	}
@@ -3824,6 +3838,11 @@ vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct page *page)
 	entry = pte_sw_mkyoung(entry);
 	if (write)
 		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+	if (uffd_wp) {
+		/* This should only be triggered by a read fault */
+		WARN_ON_ONCE(write);
+		entry = pte_mkuffd_wp(pte_wrprotect(entry));
+	}
 	/* copy-on-write page */
 	if (write && !(vma->vm_flags & VM_SHARED)) {
 		inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
@@ -3997,9 +4016,27 @@ static vm_fault_t do_fault_around(struct vm_fault *vmf)
 	return ret;
 }
 
+/* Return true if we should do read fault-around, false otherwise */
+static inline bool should_fault_around(struct vm_fault *vmf)
+{
+	/* No ->map_pages?  No way to fault around... */
+	if (!vmf->vma->vm_ops->map_pages)
+		return false;
+
+	/*
+	 * Don't do fault around for FAULT_FLAG_UFFD_WP because it means we
+	 * want to recover a previously wr-protected pte.  This flag is a
+	 * per-pte information, so it could confuse all the pages around the
+	 * current page when faulted in.  Give up on that quickly.
+	 */
+	if (vmf->flags & FAULT_FLAG_UFFD_WP)
+		return false;
+
+	return fault_around_bytes >> PAGE_SHIFT > 1;
+}
+
 static vm_fault_t do_read_fault(struct vm_fault *vmf)
 {
-	struct vm_area_struct *vma = vmf->vma;
 	vm_fault_t ret = 0;
 
 	/*
@@ -4007,7 +4044,7 @@ static vm_fault_t do_read_fault(struct vm_fault *vmf)
 	 * if page by the offset is not ready to be mapped (cold cache or
 	 * something).
 	 */
-	if (vma->vm_ops->map_pages && fault_around_bytes >> PAGE_SHIFT > 1) {
+	if (should_fault_around(vmf)) {
 		ret = do_fault_around(vmf);
 		if (ret)
 			return ret;
@@ -4322,6 +4359,68 @@ static vm_fault_t wp_huge_pud(struct vm_fault *vmf, pud_t orig_pud)
 	return VM_FAULT_FALLBACK;
 }
 
+static vm_fault_t uffd_wp_clear_special(struct vm_fault *vmf)
+{
+	vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd,
+				       vmf->address, &vmf->ptl);
+	/*
+	 * Be careful so that we will only recover a special uffd-wp pte into a
+	 * none pte.  Otherwise it means the pte could have changed, so retry.
+	 */
+	if (pte_swp_uffd_wp_special(*vmf->pte))
+		pte_clear(vmf->vma->vm_mm, vmf->address, vmf->pte);
+	pte_unmap_unlock(vmf->pte, vmf->ptl);
+	return 0;
+}
+
+/*
+ * This is actually a page-missing access, but with uffd-wp special pte
+ * installed.  It means this pte was wr-protected before being unmapped.
+ */
+vm_fault_t uffd_wp_handle_special(struct vm_fault *vmf)
+{
+	/* Careful!  vmf->pte unmapped after return */
+	if (!pte_unmap_same(vmf))
+		return 0;
+
+	/*
+	 * Just in case there're leftover special ptes even after the region
+	 * got unregistered - we can simply clear them.
+	 */
+	if (unlikely(!userfaultfd_wp(vmf->vma) || vma_is_anonymous(vmf->vma)))
+		return uffd_wp_clear_special(vmf);
+
+	/*
+	 * Tell all the rest of the fault code: we're handling a special pte,
+	 * always remember to arm the uffd-wp bit when intalling the new pte.
+	 */
+	vmf->flags |= FAULT_FLAG_UFFD_WP;
+
+	/*
+	 * Let's assume this is a read fault no matter what.  If it is a real
+	 * write access, it'll fault again into do_wp_page() where the message
+	 * will be generated before the thread yields itself.
+	 *
+	 * Ideally we can also handle write immediately before return, but this
+	 * should be a slow path already (pte unmapped), so be simple first.
+	 */
+	vmf->flags &= ~FAULT_FLAG_WRITE;
+
+	return do_fault(vmf);
+}
+
+static vm_fault_t do_swap_pte(struct vm_fault *vmf)
+{
+	/*
+	 * We need to handle special swap ptes before handling ptes that
+	 * contain swap entries, always.
+	 */
+	if (unlikely(pte_swp_uffd_wp_special(vmf->orig_pte)))
+		return uffd_wp_handle_special(vmf);
+
+	return do_swap_page(vmf);
+}
+
 /*
  * These routines also need to handle stuff like marking pages dirty
  * and/or accessed for architectures that don't do it in hardware (most
@@ -4385,7 +4484,7 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
 	}
 
 	if (!pte_present(vmf->orig_pte))
-		return do_swap_page(vmf);
+		return do_swap_pte(vmf);
 
 	if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma))
 		return do_numa_page(vmf);
-- 
2.26.2


  parent reply	other threads:[~2021-01-15 17:11 UTC|newest]

Thread overview: 50+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-01-15 17:08 [PATCH RFC 00/30] userfaultfd-wp: Support shmem and hugetlbfs Peter Xu
2021-01-15 17:08 ` [PATCH RFC 01/30] mm/thp: Simplify copying of huge zero page pmd when fork Peter Xu
2021-01-15 17:08 ` [PATCH RFC 02/30] mm/userfaultfd: Fix uffd-wp special cases for fork() Peter Xu
2021-01-15 17:08 ` [PATCH RFC 03/30] mm/userfaultfd: Fix a few thp pmd missing uffd-wp bit Peter Xu
2021-01-15 17:08 ` [PATCH RFC 04/30] shmem/userfaultfd: Take care of UFFDIO_COPY_MODE_WP Peter Xu
2021-01-15 17:08 ` [PATCH RFC 05/30] mm: Clear vmf->pte after pte_unmap_same() returns Peter Xu
2021-01-15 17:08 ` [PATCH RFC 06/30] mm/userfaultfd: Introduce special pte for unmapped file-backed mem Peter Xu
2021-01-15 17:08 ` [PATCH RFC 07/30] mm/swap: Introduce the idea of special swap ptes Peter Xu
2021-01-18 19:40   ` Jason Gunthorpe
2021-01-19 14:24     ` Peter Xu
2021-01-15 17:08 ` Peter Xu [this message]
2021-01-15 19:51   ` [PATCH RFC 08/30] shmem/userfaultfd: Handle uffd-wp special pte in page fault handler kernel test robot
2021-01-15 21:01   ` kernel test robot
2021-01-15 17:08 ` [PATCH RFC 09/30] mm: Drop first_index/last_index in zap_details Peter Xu
2021-01-15 17:08 ` [PATCH RFC 10/30] mm: Introduce zap_details.zap_flags Peter Xu
2021-01-15 17:08 ` [PATCH RFC 11/30] mm: Introduce ZAP_FLAG_SKIP_SWAP Peter Xu
2021-01-15 17:08 ` [PATCH RFC 12/30] mm: Pass zap_flags into unmap_mapping_pages() Peter Xu
2021-01-15 17:08 ` [PATCH RFC 13/30] shmem/userfaultfd: Persist uffd-wp bit across zapping for file-backed Peter Xu
2021-01-15 17:08 ` [PATCH RFC 14/30] shmem/userfaultfd: Allow wr-protect none pte for file-backed mem Peter Xu
2021-01-15 17:08 ` [PATCH RFC 15/30] shmem/userfaultfd: Allows file-back mem to be uffd wr-protected on thps Peter Xu
2021-01-15 17:08 ` [PATCH RFC 16/30] shmem/userfaultfd: Handle the left-overed special swap ptes Peter Xu
2021-01-15 17:08 ` [PATCH RFC 17/30] shmem/userfaultfd: Pass over uffd-wp special swap pte when fork() Peter Xu
2021-01-15 17:08 ` [PATCH RFC 18/30] hugetlb/userfaultfd: Hook page faults for uffd write protection Peter Xu
2021-01-15 17:08 ` [PATCH RFC 19/30] hugetlb/userfaultfd: Take care of UFFDIO_COPY_MODE_WP Peter Xu
2021-01-15 17:08 ` [PATCH RFC 20/30] hugetlb/userfaultfd: Handle UFFDIO_WRITEPROTECT Peter Xu
2021-01-15 17:08 ` [PATCH RFC 21/30] hugetlb: Pass vma into huge_pte_alloc() Peter Xu
2021-01-28 22:59   ` Axel Rasmussen
2021-01-28 22:59     ` Axel Rasmussen
2021-01-29 22:31     ` Peter Xu
2021-01-30  8:08       ` Axel Rasmussen
2021-01-30  8:08         ` Axel Rasmussen
2021-01-15 17:08 ` [PATCH RFC 22/30] hugetlb/userfaultfd: Forbid huge pmd sharing when uffd enabled Peter Xu
2021-01-16 10:02   ` kernel test robot
2021-01-15 17:09 ` [PATCH RFC 23/30] mm/hugetlb: Introduce huge version of special swap pte helpers Peter Xu
2021-01-15 17:09 ` [PATCH RFC 24/30] mm/hugetlb: Move flush_hugetlb_tlb_range() into hugetlb.h Peter Xu
2021-01-15 17:09 ` [PATCH RFC 25/30] hugetlb/userfaultfd: Unshare all pmds for hugetlbfs when register wp Peter Xu
2021-01-15 23:05   ` kernel test robot
2021-01-15 17:09 ` [PATCH RFC 26/30] hugetlb/userfaultfd: Handle uffd-wp special pte in hugetlb pf handler Peter Xu
2021-01-15 17:09 ` [PATCH RFC 27/30] hugetlb/userfaultfd: Allow wr-protect none ptes Peter Xu
2021-01-15 17:09 ` [PATCH RFC 28/30] hugetlb/userfaultfd: Only drop uffd-wp special pte if required Peter Xu
2021-01-15 17:09 ` [PATCH RFC 29/30] userfaultfd: Enable write protection for shmem & hugetlbfs Peter Xu
2021-01-15 17:12 ` [PATCH RFC 30/30] userfaultfd/selftests: Enable uffd-wp for shmem/hugetlbfs Peter Xu
2021-01-29 22:49 ` [PATCH RFC 00/30] userfaultfd-wp: Support shmem and hugetlbfs Peter Xu
2021-02-05 21:53   ` Mike Kravetz
2021-02-06  2:36     ` Peter Xu
2021-02-09 19:29       ` Mike Kravetz
2021-02-09 22:00         ` Peter Xu
2021-02-05 22:21   ` Hugh Dickins
2021-02-05 22:21     ` Hugh Dickins
2021-02-06  2:47     ` Peter Xu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210115170907.24498-9-peterx@redhat.com \
    --to=peterx@redhat.com \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=axelrasmussen@google.com \
    --cc=hughd@google.com \
    --cc=jglisse@redhat.com \
    --cc=kirill@shutemov.name \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mike.kravetz@oracle.com \
    --cc=nadav.amit@gmail.com \
    --cc=rppt@linux.vnet.ibm.com \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.