linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RFC 00/24] userfaultfd: write protection support
@ 2019-01-21  7:56 Peter Xu
  2019-01-21  7:56 ` [PATCH RFC 01/24] mm: gup: rename "nonblocking" to "locked" where proper Peter Xu
                   ` (24 more replies)
  0 siblings, 25 replies; 65+ messages in thread
From: Peter Xu @ 2019-01-21  7:56 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Hugh Dickins, Maya Gokhale, Jerome Glisse, Johannes Weiner,
	peterx, Martin Cracauer, Denis Plotnikov, Shaohua Li,
	Andrea Arcangeli, Pavel Emelyanov, Mike Kravetz, Marty McFadden,
	Mike Rapoport, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

Hi,

This series implements initial write protection support for
userfaultfd.  Currently both shmem and hugetlbfs are not supported
yet, but only anonymous memory.

To be simple, either "userfaultfd-wp" or "uffd-wp" might be used in
later paragraphs.

The whole series can also be found at:

  https://github.com/xzpeter/linux/tree/uffd-wp-merged

Any comment would be greatly welcomed.   Thanks.

Overview
====================

The uffd-wp work was initialized by Shaohua Li [1], and later
continued by Andrea [2]. This series is based upon Andrea's latest
userfaultfd tree, and it is a continuous works from both Shaohua and
Andrea.  Many of the follow up ideas come from Andrea too.

Besides the old MISSING register mode of userfaultfd, the new uffd-wp
support provides another alternative register mode called
UFFDIO_REGISTER_MODE_WP that can be used to listen to not only missing
page faults but also write protection page faults, or even they can be
registered together.  At the same time, the new feature also provides
a new userfaultfd ioctl called UFFDIO_WRITEPROTECT which allows the
userspace to write protect a range or memory or fixup write permission
of faulted pages.

Please refer to the document patch "userfaultfd: wp:
UFFDIO_REGISTER_MODE_WP documentation update" for more information on
the new interface and what it can do.

The major workflow of an uffd-wp program should be:

  1. Register a memory region with WP mode using UFFDIO_REGISTER_MODE_WP

  2. Write protect part of the whole registered region using
     UFFDIO_WRITEPROTECT, passing in UFFDIO_WRITEPROTECT_MODE_WP to
     show that we want to write protect the range.

  3. Start a working thread that modifies the protected pages,
     meanwhile listening to UFFD messages.

  4. When a write is detected upon the protected range, page fault
     happens, a UFFD message will be generated and reported to the
     page fault handling thread

  5. The page fault handler thread resolves the page fault using the
     new UFFDIO_WRITEPROTECT ioctl, but this time passing in
     !UFFDIO_WRITEPROTECT_MODE_WP instead showing that we want to
     recover the write permission.  Before this operation, the fault
     handler thread can do anything it wants, e.g., dumps the page to
     a persistent storage.

  6. The worker thread will continue running with the correctly
     applied write permission from step 5.

Currently there are already two projects that are based on this new
userfaultfd feature.

QEMU Live Snapshot: The project provides a way to allow the QEMU
                    hypervisor to take snapshot of VMs without
                    stopping the VM [3].

LLNL umap library:  The project provides a mmap-like interface and
                    "allow to have an application specific buffer of
                    pages cached from a large file, i.e. out-of-core
                    execution using memory map" [4][5].

Before posting the patchset, this series was smoke tested against QEMU
live snapshot and the LLNL umap library (by doing parallel quicksort
using 128 sorting threads + 80 uffd servicing threads).  My sincere
thanks to Marty Mcfadden and Denis Plotnikov for the help along the
way.

Implementation
==============

Patch 1-4: The whole uffd-wp requires the kernel page fault path to
           take more than one retries.  In the previous works starting
           from Shaohua, a new fault flag FAULT_FLAG_ALLOW_UFFD_RETRY
           was introduced for this [6]. However in this series we have
           dropped that patch, instead the whole work is based on the
           recent series "[PATCH RFC v3 0/4] mm: some enhancements to
           the page fault mechanism" [7] which removes the assuption
           that VM_FAULT_RETRY can only happen once.  This four
           patches are identital patches but picked up here.  Please
           refer to the cover letter [7] for more information.  More
           discussion upstream shows that this work could even benefit
           existing use case [8] so please help justify whether
           patches 1-4 can be consider to be accepted even earlier
           than the rest of the series.

Patch 5-21:   Implements the uffd-wp logic.  To avoid collision with
              existing write protections (e.g., an private anonymous
              page can be write protected if it was shared between
              multiple processes), a new PTE bit (_PAGE_UFFD_WP) was
              introduced to explicitly mark a PTE as userfault
              write-protected.  A similar bit was also used in the
              swap/migration entry (_PAGE_SWP_UFFD_WP) to make sure
              even if the pages were swapped or migrated, the uffd-wp
              tracking information won't be lost.  When resolving a
              page fault, we'll do a page copy before hand if the page
              was COWed to make sure we won't corrupt any shared
              pages.  Etc.  Please see separated patches for more
              details.

Patch 22:     Documentation update for uffd-wp

Patch 23,24:  Uffd-wp selftests

TODO
=============

- hugetlbfs/shmem support
- performance
- more architectures
- ...

References
==========

[1] https://lwn.net/Articles/666187/
[2] https://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git/log/?h=userfault
[3] https://github.com/denis-plotnikov/qemu/commits/background-snapshot-kvm
[4] https://github.com/LLNL/umap
[5] https://llnl-umap.readthedocs.io/en/develop/
[6] https://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git/commit/?h=userfault&id=b245ecf6cf59156966f3da6e6b674f6695a5ffa5
[7] https://lkml.org/lkml/2018/11/21/370
[8] https://lkml.org/lkml/2018/12/30/64

Andrea Arcangeli (5):
  userfaultfd: wp: add the writeprotect API to userfaultfd ioctl
  userfaultfd: wp: hook userfault handler to write protection fault
  userfaultfd: wp: add WP pagetable tracking to x86
  userfaultfd: wp: userfaultfd_pte/huge_pmd_wp() helpers
  userfaultfd: wp: add UFFDIO_COPY_MODE_WP

Martin Cracauer (1):
  userfaultfd: wp: UFFDIO_REGISTER_MODE_WP documentation update

Peter Xu (15):
  mm: gup: rename "nonblocking" to "locked" where proper
  mm: userfault: return VM_FAULT_RETRY on signals
  mm: allow VM_FAULT_RETRY for multiple times
  mm: gup: allow VM_FAULT_RETRY for multiple times
  mm: merge parameters for change_protection()
  userfaultfd: wp: apply _PAGE_UFFD_WP bit
  mm: export wp_page_copy()
  userfaultfd: wp: handle COW properly for uffd-wp
  userfaultfd: wp: drop _PAGE_UFFD_WP properly when fork
  userfaultfd: wp: add pmd_swp_*uffd_wp() helpers
  userfaultfd: wp: support swap and page migration
  userfaultfd: wp: don't wake up when doing write protect
  khugepaged: skip collapse if uffd-wp detected
  userfaultfd: selftests: refactor statistics
  userfaultfd: selftests: add write-protect test

Shaohua Li (3):
  userfaultfd: wp: add helper for writeprotect check
  userfaultfd: wp: support write protection for userfault vma range
  userfaultfd: wp: enabled write protection in userfaultfd API

 Documentation/admin-guide/mm/userfaultfd.rst |  51 +++++
 arch/alpha/mm/fault.c                        |   4 +-
 arch/arc/mm/fault.c                          |  12 +-
 arch/arm/mm/fault.c                          |  17 +-
 arch/arm64/mm/fault.c                        |  11 +-
 arch/hexagon/mm/vm_fault.c                   |   3 +-
 arch/ia64/mm/fault.c                         |   3 +-
 arch/m68k/mm/fault.c                         |   5 +-
 arch/microblaze/mm/fault.c                   |   3 +-
 arch/mips/mm/fault.c                         |   3 +-
 arch/nds32/mm/fault.c                        |   7 +-
 arch/nios2/mm/fault.c                        |   5 +-
 arch/openrisc/mm/fault.c                     |   3 +-
 arch/parisc/mm/fault.c                       |   4 +-
 arch/powerpc/mm/fault.c                      |   9 +-
 arch/riscv/mm/fault.c                        |   9 +-
 arch/s390/mm/fault.c                         |  14 +-
 arch/sh/mm/fault.c                           |   5 +-
 arch/sparc/mm/fault_32.c                     |   4 +-
 arch/sparc/mm/fault_64.c                     |   4 +-
 arch/um/kernel/trap.c                        |   6 +-
 arch/unicore32/mm/fault.c                    |  10 +-
 arch/x86/Kconfig                             |   1 +
 arch/x86/include/asm/pgtable.h               |  67 ++++++
 arch/x86/include/asm/pgtable_64.h            |   8 +-
 arch/x86/include/asm/pgtable_types.h         |  11 +-
 arch/x86/mm/fault.c                          |  13 +-
 arch/xtensa/mm/fault.c                       |   4 +-
 fs/userfaultfd.c                             | 110 +++++----
 include/asm-generic/pgtable.h                |   1 +
 include/asm-generic/pgtable_uffd.h           |  66 ++++++
 include/linux/huge_mm.h                      |   2 +-
 include/linux/mm.h                           |  21 +-
 include/linux/swapops.h                      |   2 +
 include/linux/userfaultfd_k.h                |  41 +++-
 include/trace/events/huge_memory.h           |   1 +
 include/uapi/linux/userfaultfd.h             |  28 ++-
 init/Kconfig                                 |   5 +
 mm/gup.c                                     |  61 ++---
 mm/huge_memory.c                             |  28 ++-
 mm/hugetlb.c                                 |   8 +-
 mm/khugepaged.c                              |  23 ++
 mm/memory.c                                  |  28 ++-
 mm/mempolicy.c                               |   2 +-
 mm/migrate.c                                 |   7 +
 mm/mprotect.c                                |  99 +++++++--
 mm/rmap.c                                    |   6 +
 mm/userfaultfd.c                             |  92 +++++++-
 tools/testing/selftests/vm/userfaultfd.c     | 222 ++++++++++++++-----
 49 files changed, 898 insertions(+), 251 deletions(-)
 create mode 100644 include/asm-generic/pgtable_uffd.h

-- 
2.17.1


^ permalink raw reply	[flat|nested] 65+ messages in thread

* [PATCH RFC 01/24] mm: gup: rename "nonblocking" to "locked" where proper
  2019-01-21  7:56 [PATCH RFC 00/24] userfaultfd: write protection support Peter Xu
@ 2019-01-21  7:56 ` Peter Xu
  2019-01-21 10:20   ` Mike Rapoport
  2019-01-21  7:57 ` [PATCH RFC 02/24] mm: userfault: return VM_FAULT_RETRY on signals Peter Xu
                   ` (23 subsequent siblings)
  24 siblings, 1 reply; 65+ messages in thread
From: Peter Xu @ 2019-01-21  7:56 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Hugh Dickins, Maya Gokhale, Jerome Glisse, Johannes Weiner,
	peterx, Martin Cracauer, Denis Plotnikov, Shaohua Li,
	Andrea Arcangeli, Pavel Emelyanov, Mike Kravetz, Marty McFadden,
	Mike Rapoport, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

There's plenty of places around __get_user_pages() that has a parameter
"nonblocking" which does not really mean that "it won't block" (because
it can really block) but instead it shows whether the mmap_sem is
released by up_read() during the page fault handling mostly when
VM_FAULT_RETRY is returned.

We have the correct naming in e.g. get_user_pages_locked() or
get_user_pages_remote() as "locked", however there're still many places
that are using the "nonblocking" as name.

Renaming the places to "locked" where proper to better suite the
functionality of the variable.  While at it, fixing up some of the
comments accordingly.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/gup.c     | 44 +++++++++++++++++++++-----------------------
 mm/hugetlb.c |  8 ++++----
 2 files changed, 25 insertions(+), 27 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 8cb68a50dbdf..7b1f452cc2ef 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -506,12 +506,12 @@ static int get_gate_page(struct mm_struct *mm, unsigned long address,
 }
 
 /*
- * mmap_sem must be held on entry.  If @nonblocking != NULL and
- * *@flags does not include FOLL_NOWAIT, the mmap_sem may be released.
- * If it is, *@nonblocking will be set to 0 and -EBUSY returned.
+ * mmap_sem must be held on entry.  If @locked != NULL and *@flags
+ * does not include FOLL_NOWAIT, the mmap_sem may be released.  If it
+ * is, *@locked will be set to 0 and -EBUSY returned.
  */
 static int faultin_page(struct task_struct *tsk, struct vm_area_struct *vma,
-		unsigned long address, unsigned int *flags, int *nonblocking)
+		unsigned long address, unsigned int *flags, int *locked)
 {
 	unsigned int fault_flags = 0;
 	vm_fault_t ret;
@@ -523,7 +523,7 @@ static int faultin_page(struct task_struct *tsk, struct vm_area_struct *vma,
 		fault_flags |= FAULT_FLAG_WRITE;
 	if (*flags & FOLL_REMOTE)
 		fault_flags |= FAULT_FLAG_REMOTE;
-	if (nonblocking)
+	if (locked)
 		fault_flags |= FAULT_FLAG_ALLOW_RETRY;
 	if (*flags & FOLL_NOWAIT)
 		fault_flags |= FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_RETRY_NOWAIT;
@@ -549,8 +549,8 @@ static int faultin_page(struct task_struct *tsk, struct vm_area_struct *vma,
 	}
 
 	if (ret & VM_FAULT_RETRY) {
-		if (nonblocking && !(fault_flags & FAULT_FLAG_RETRY_NOWAIT))
-			*nonblocking = 0;
+		if (locked && !(fault_flags & FAULT_FLAG_RETRY_NOWAIT))
+			*locked = 0;
 		return -EBUSY;
 	}
 
@@ -627,7 +627,7 @@ static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags)
  *		only intends to ensure the pages are faulted in.
  * @vmas:	array of pointers to vmas corresponding to each page.
  *		Or NULL if the caller does not require them.
- * @nonblocking: whether waiting for disk IO or mmap_sem contention
+ * @locked:     whether we're still with the mmap_sem held
  *
  * Returns number of pages pinned. This may be fewer than the number
  * requested. If nr_pages is 0 or negative, returns 0. If no pages
@@ -656,13 +656,11 @@ static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags)
  * appropriate) must be called after the page is finished with, and
  * before put_page is called.
  *
- * If @nonblocking != NULL, __get_user_pages will not wait for disk IO
- * or mmap_sem contention, and if waiting is needed to pin all pages,
- * *@nonblocking will be set to 0.  Further, if @gup_flags does not
- * include FOLL_NOWAIT, the mmap_sem will be released via up_read() in
- * this case.
+ * If @locked != NULL, *@locked will be set to 0 when mmap_sem is
+ * released by an up_read().  That can happen if @gup_flags does not
+ * has FOLL_NOWAIT.
  *
- * A caller using such a combination of @nonblocking and @gup_flags
+ * A caller using such a combination of @locked and @gup_flags
  * must therefore hold the mmap_sem for reading only, and recognize
  * when it's been released.  Otherwise, it must be held for either
  * reading or writing and will not be released.
@@ -674,7 +672,7 @@ static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags)
 static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 		unsigned long start, unsigned long nr_pages,
 		unsigned int gup_flags, struct page **pages,
-		struct vm_area_struct **vmas, int *nonblocking)
+		struct vm_area_struct **vmas, int *locked)
 {
 	long ret = 0, i = 0;
 	struct vm_area_struct *vma = NULL;
@@ -718,7 +716,7 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 			if (is_vm_hugetlb_page(vma)) {
 				i = follow_hugetlb_page(mm, vma, pages, vmas,
 						&start, &nr_pages, i,
-						gup_flags, nonblocking);
+						gup_flags, locked);
 				continue;
 			}
 		}
@@ -736,7 +734,7 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 		page = follow_page_mask(vma, start, foll_flags, &ctx);
 		if (!page) {
 			ret = faultin_page(tsk, vma, start, &foll_flags,
-					nonblocking);
+					   locked);
 			switch (ret) {
 			case 0:
 				goto retry;
@@ -1195,7 +1193,7 @@ EXPORT_SYMBOL(get_user_pages_longterm);
  * @vma:   target vma
  * @start: start address
  * @end:   end address
- * @nonblocking:
+ * @locked: whether the mmap_sem is still held
  *
  * This takes care of mlocking the pages too if VM_LOCKED is set.
  *
@@ -1203,14 +1201,14 @@ EXPORT_SYMBOL(get_user_pages_longterm);
  *
  * vma->vm_mm->mmap_sem must be held.
  *
- * If @nonblocking is NULL, it may be held for read or write and will
+ * If @locked is NULL, it may be held for read or write and will
  * be unperturbed.
  *
- * If @nonblocking is non-NULL, it must held for read only and may be
- * released.  If it's released, *@nonblocking will be set to 0.
+ * If @locked is non-NULL, it must held for read only and may be
+ * released.  If it's released, *@locked will be set to 0.
  */
 long populate_vma_page_range(struct vm_area_struct *vma,
-		unsigned long start, unsigned long end, int *nonblocking)
+		unsigned long start, unsigned long end, int *locked)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	unsigned long nr_pages = (end - start) / PAGE_SIZE;
@@ -1245,7 +1243,7 @@ long populate_vma_page_range(struct vm_area_struct *vma,
 	 * not result in a stack expansion that recurses back here.
 	 */
 	return __get_user_pages(current, mm, start, nr_pages, gup_flags,
-				NULL, NULL, nonblocking);
+				NULL, NULL, locked);
 }
 
 /*
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 705a3e9cc910..05b879bda10a 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4181,7 +4181,7 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
 long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			 struct page **pages, struct vm_area_struct **vmas,
 			 unsigned long *position, unsigned long *nr_pages,
-			 long i, unsigned int flags, int *nonblocking)
+			 long i, unsigned int flags, int *locked)
 {
 	unsigned long pfn_offset;
 	unsigned long vaddr = *position;
@@ -4252,7 +4252,7 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 				spin_unlock(ptl);
 			if (flags & FOLL_WRITE)
 				fault_flags |= FAULT_FLAG_WRITE;
-			if (nonblocking)
+			if (locked)
 				fault_flags |= FAULT_FLAG_ALLOW_RETRY;
 			if (flags & FOLL_NOWAIT)
 				fault_flags |= FAULT_FLAG_ALLOW_RETRY |
@@ -4269,8 +4269,8 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 				break;
 			}
 			if (ret & VM_FAULT_RETRY) {
-				if (nonblocking)
-					*nonblocking = 0;
+				if (locked)
+					*locked = 0;
 				*nr_pages = 0;
 				/*
 				 * VM_FAULT_RETRY must not return an
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH RFC 02/24] mm: userfault: return VM_FAULT_RETRY on signals
  2019-01-21  7:56 [PATCH RFC 00/24] userfaultfd: write protection support Peter Xu
  2019-01-21  7:56 ` [PATCH RFC 01/24] mm: gup: rename "nonblocking" to "locked" where proper Peter Xu
@ 2019-01-21  7:57 ` Peter Xu
  2019-01-21 15:40   ` Jerome Glisse
  2019-01-21  7:57 ` [PATCH RFC 03/24] mm: allow VM_FAULT_RETRY for multiple times Peter Xu
                   ` (22 subsequent siblings)
  24 siblings, 1 reply; 65+ messages in thread
From: Peter Xu @ 2019-01-21  7:57 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Hugh Dickins, Maya Gokhale, Jerome Glisse, Johannes Weiner,
	peterx, Martin Cracauer, Denis Plotnikov, Shaohua Li,
	Andrea Arcangeli, Pavel Emelyanov, Mike Kravetz, Marty McFadden,
	Mike Rapoport, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

There was a special path in handle_userfault() in the past that we'll
return a VM_FAULT_NOPAGE when we detected non-fatal signals when waiting
for userfault handling.  We did that by reacquiring the mmap_sem before
returning.  However that brings a risk in that the vmas might have
changed when we retake the mmap_sem and even we could be holding an
invalid vma structure.  The problem was reported by syzbot.

This patch removes the special path and we'll return a VM_FAULT_RETRY
with the common path even if we have got such signals.  Then for all the
architectures that is passing in VM_FAULT_ALLOW_RETRY into
handle_mm_fault(), we check not only for SIGKILL but for all the rest of
userspace pending signals right after we returned from
handle_mm_fault().

The idea comes from the upstream discussion between Linus and Andrea:

  https://lkml.org/lkml/2017/10/30/560

(This patch contains a potential fix for a double-free of mmap_sem on
 ARC architecture; please see https://lkml.org/lkml/2018/11/1/723 for
 more information)

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Suggested-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/alpha/mm/fault.c      |  2 +-
 arch/arc/mm/fault.c        | 11 +++++++----
 arch/arm/mm/fault.c        | 14 ++++++++++----
 arch/arm64/mm/fault.c      |  6 +++---
 arch/hexagon/mm/vm_fault.c |  2 +-
 arch/ia64/mm/fault.c       |  2 +-
 arch/m68k/mm/fault.c       |  2 +-
 arch/microblaze/mm/fault.c |  2 +-
 arch/mips/mm/fault.c       |  2 +-
 arch/nds32/mm/fault.c      |  6 +++---
 arch/nios2/mm/fault.c      |  2 +-
 arch/openrisc/mm/fault.c   |  2 +-
 arch/parisc/mm/fault.c     |  2 +-
 arch/powerpc/mm/fault.c    |  4 +++-
 arch/riscv/mm/fault.c      |  4 ++--
 arch/s390/mm/fault.c       |  9 ++++++---
 arch/sh/mm/fault.c         |  4 ++++
 arch/sparc/mm/fault_32.c   |  3 +++
 arch/sparc/mm/fault_64.c   |  3 +++
 arch/um/kernel/trap.c      |  5 ++++-
 arch/unicore32/mm/fault.c  |  4 ++--
 arch/x86/mm/fault.c        | 12 +++++++++++-
 arch/xtensa/mm/fault.c     |  3 +++
 fs/userfaultfd.c           | 24 ------------------------
 24 files changed, 73 insertions(+), 57 deletions(-)

diff --git a/arch/alpha/mm/fault.c b/arch/alpha/mm/fault.c
index d73dc473fbb9..46e5e420ad2a 100644
--- a/arch/alpha/mm/fault.c
+++ b/arch/alpha/mm/fault.c
@@ -150,7 +150,7 @@ do_page_fault(unsigned long address, unsigned long mmcsr,
 	   the fault.  */
 	fault = handle_mm_fault(vma, address, flags);
 
-	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
+	if ((fault & VM_FAULT_RETRY) && signal_pending(current))
 		return;
 
 	if (unlikely(fault & VM_FAULT_ERROR)) {
diff --git a/arch/arc/mm/fault.c b/arch/arc/mm/fault.c
index e2d9fc3fea01..91492d244ea6 100644
--- a/arch/arc/mm/fault.c
+++ b/arch/arc/mm/fault.c
@@ -142,11 +142,14 @@ void do_page_fault(unsigned long address, struct pt_regs *regs)
 	fault = handle_mm_fault(vma, address, flags);
 
 	/* If Pagefault was interrupted by SIGKILL, exit page fault "early" */
-	if (unlikely(fatal_signal_pending(current))) {
-		if ((fault & VM_FAULT_ERROR) && !(fault & VM_FAULT_RETRY))
+	if (unlikely(fatal_signal_pending(current) && user_mode(regs))) {
+		/*
+		 * VM_FAULT_RETRY means we have released the mmap_sem,
+		 * otherwise we need to drop it before leaving
+		 */
+		if (!(fault & VM_FAULT_RETRY))
 			up_read(&mm->mmap_sem);
-		if (user_mode(regs))
-			return;
+		return;
 	}
 
 	perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
diff --git a/arch/arm/mm/fault.c b/arch/arm/mm/fault.c
index f4ea4c62c613..743077d19669 100644
--- a/arch/arm/mm/fault.c
+++ b/arch/arm/mm/fault.c
@@ -308,14 +308,20 @@ do_page_fault(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
 
 	fault = __do_page_fault(mm, addr, fsr, flags, tsk);
 
-	/* If we need to retry but a fatal signal is pending, handle the
+	/* If we need to retry but a signal is pending, handle the
 	 * signal first. We do not need to release the mmap_sem because
 	 * it would already be released in __lock_page_or_retry in
 	 * mm/filemap.c. */
-	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current)) {
-		if (!user_mode(regs))
+	if (fault & VM_FAULT_RETRY) {
+		if (fatal_signal_pending(current) && !user_mode(regs))
 			goto no_context;
-		return 0;
+		else if (signal_pending(current))
+			/*
+			 * It's either a common signal, or a fatal
+			 * signal but for the userspace, we return
+			 * immediately.
+			 */
+			return 0;
 	}
 
 	/*
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 7d9571f4ae3d..744d6451ea83 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -499,13 +499,13 @@ static int __kprobes do_page_fault(unsigned long addr, unsigned int esr,
 
 	if (fault & VM_FAULT_RETRY) {
 		/*
-		 * If we need to retry but a fatal signal is pending,
+		 * If we need to retry but a signal is pending,
 		 * handle the signal first. We do not need to release
 		 * the mmap_sem because it would already be released
 		 * in __lock_page_or_retry in mm/filemap.c.
 		 */
-		if (fatal_signal_pending(current)) {
-			if (!user_mode(regs))
+		if (signal_pending(current)) {
+			if (fatal_signal_pending(current) && !user_mode(regs))
 				goto no_context;
 			return 0;
 		}
diff --git a/arch/hexagon/mm/vm_fault.c b/arch/hexagon/mm/vm_fault.c
index eb263e61daf4..be10b441d9cc 100644
--- a/arch/hexagon/mm/vm_fault.c
+++ b/arch/hexagon/mm/vm_fault.c
@@ -104,7 +104,7 @@ void do_page_fault(unsigned long address, long cause, struct pt_regs *regs)
 
 	fault = handle_mm_fault(vma, address, flags);
 
-	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
+	if ((fault & VM_FAULT_RETRY) && signal_pending(current))
 		return;
 
 	/* The most common case -- we are done. */
diff --git a/arch/ia64/mm/fault.c b/arch/ia64/mm/fault.c
index 5baeb022f474..62c2d39d2bed 100644
--- a/arch/ia64/mm/fault.c
+++ b/arch/ia64/mm/fault.c
@@ -163,7 +163,7 @@ ia64_do_page_fault (unsigned long address, unsigned long isr, struct pt_regs *re
 	 */
 	fault = handle_mm_fault(vma, address, flags);
 
-	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
+	if ((fault & VM_FAULT_RETRY) && signal_pending(current))
 		return;
 
 	if (unlikely(fault & VM_FAULT_ERROR)) {
diff --git a/arch/m68k/mm/fault.c b/arch/m68k/mm/fault.c
index 9b6163c05a75..d9808a807ab8 100644
--- a/arch/m68k/mm/fault.c
+++ b/arch/m68k/mm/fault.c
@@ -138,7 +138,7 @@ int do_page_fault(struct pt_regs *regs, unsigned long address,
 	fault = handle_mm_fault(vma, address, flags);
 	pr_debug("handle_mm_fault returns %x\n", fault);
 
-	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
+	if ((fault & VM_FAULT_RETRY) && signal_pending(current))
 		return 0;
 
 	if (unlikely(fault & VM_FAULT_ERROR)) {
diff --git a/arch/microblaze/mm/fault.c b/arch/microblaze/mm/fault.c
index 202ad6a494f5..4fd2dbd0c5ca 100644
--- a/arch/microblaze/mm/fault.c
+++ b/arch/microblaze/mm/fault.c
@@ -217,7 +217,7 @@ void do_page_fault(struct pt_regs *regs, unsigned long address,
 	 */
 	fault = handle_mm_fault(vma, address, flags);
 
-	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
+	if ((fault & VM_FAULT_RETRY) && signal_pending(current))
 		return;
 
 	if (unlikely(fault & VM_FAULT_ERROR)) {
diff --git a/arch/mips/mm/fault.c b/arch/mips/mm/fault.c
index 73d8a0f0b810..92374fd091d2 100644
--- a/arch/mips/mm/fault.c
+++ b/arch/mips/mm/fault.c
@@ -154,7 +154,7 @@ static void __kprobes __do_page_fault(struct pt_regs *regs, unsigned long write,
 	 */
 	fault = handle_mm_fault(vma, address, flags);
 
-	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
+	if ((fault & VM_FAULT_RETRY) && signal_pending(current))
 		return;
 
 	perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
diff --git a/arch/nds32/mm/fault.c b/arch/nds32/mm/fault.c
index b740534b152c..72461745d3e1 100644
--- a/arch/nds32/mm/fault.c
+++ b/arch/nds32/mm/fault.c
@@ -207,12 +207,12 @@ void do_page_fault(unsigned long entry, unsigned long addr,
 	fault = handle_mm_fault(vma, addr, flags);
 
 	/*
-	 * If we need to retry but a fatal signal is pending, handle the
+	 * If we need to retry but a signal is pending, handle the
 	 * signal first. We do not need to release the mmap_sem because it
 	 * would already be released in __lock_page_or_retry in mm/filemap.c.
 	 */
-	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current)) {
-		if (!user_mode(regs))
+	if (fault & VM_FAULT_RETRY && signal_pending(current)) {
+		if (fatal_signal_pending(current) && !user_mode(regs))
 			goto no_context;
 		return;
 	}
diff --git a/arch/nios2/mm/fault.c b/arch/nios2/mm/fault.c
index 24fd84cf6006..5939434a31ae 100644
--- a/arch/nios2/mm/fault.c
+++ b/arch/nios2/mm/fault.c
@@ -134,7 +134,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long cause,
 	 */
 	fault = handle_mm_fault(vma, address, flags);
 
-	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
+	if ((fault & VM_FAULT_RETRY) && signal_pending(current))
 		return;
 
 	if (unlikely(fault & VM_FAULT_ERROR)) {
diff --git a/arch/openrisc/mm/fault.c b/arch/openrisc/mm/fault.c
index dc4dbafc1d83..873ecb5d82d7 100644
--- a/arch/openrisc/mm/fault.c
+++ b/arch/openrisc/mm/fault.c
@@ -165,7 +165,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long address,
 
 	fault = handle_mm_fault(vma, address, flags);
 
-	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
+	if ((fault & VM_FAULT_RETRY) && signal_pending(current))
 		return;
 
 	if (unlikely(fault & VM_FAULT_ERROR)) {
diff --git a/arch/parisc/mm/fault.c b/arch/parisc/mm/fault.c
index c8e8b7c05558..29422eec329d 100644
--- a/arch/parisc/mm/fault.c
+++ b/arch/parisc/mm/fault.c
@@ -303,7 +303,7 @@ void do_page_fault(struct pt_regs *regs, unsigned long code,
 
 	fault = handle_mm_fault(vma, address, flags);
 
-	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
+	if ((fault & VM_FAULT_RETRY) && signal_pending(current))
 		return;
 
 	if (unlikely(fault & VM_FAULT_ERROR)) {
diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index 1697e903bbf2..8bc0d091f13c 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -575,8 +575,10 @@ static int __do_page_fault(struct pt_regs *regs, unsigned long address,
 			 */
 			flags &= ~FAULT_FLAG_ALLOW_RETRY;
 			flags |= FAULT_FLAG_TRIED;
-			if (!fatal_signal_pending(current))
+			if (!signal_pending(current))
 				goto retry;
+			else if (!fatal_signal_pending(current) && is_user)
+				return 0;
 		}
 
 		/*
diff --git a/arch/riscv/mm/fault.c b/arch/riscv/mm/fault.c
index 88401d5125bc..4fc8d746bec3 100644
--- a/arch/riscv/mm/fault.c
+++ b/arch/riscv/mm/fault.c
@@ -123,11 +123,11 @@ asmlinkage void do_page_fault(struct pt_regs *regs)
 	fault = handle_mm_fault(vma, addr, flags);
 
 	/*
-	 * If we need to retry but a fatal signal is pending, handle the
+	 * If we need to retry but a signal is pending, handle the
 	 * signal first. We do not need to release the mmap_sem because it
 	 * would already be released in __lock_page_or_retry in mm/filemap.c.
 	 */
-	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(tsk))
+	if ((fault & VM_FAULT_RETRY) && signal_pending(tsk))
 		return;
 
 	if (unlikely(fault & VM_FAULT_ERROR)) {
diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c
index 2b8f32f56e0c..19b4fb2fafab 100644
--- a/arch/s390/mm/fault.c
+++ b/arch/s390/mm/fault.c
@@ -500,9 +500,12 @@ static inline vm_fault_t do_exception(struct pt_regs *regs, int access)
 	 * the fault.
 	 */
 	fault = handle_mm_fault(vma, address, flags);
-	/* No reason to continue if interrupted by SIGKILL. */
-	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current)) {
-		fault = VM_FAULT_SIGNAL;
+	/* Do not continue if interrupted by signals. */
+	if ((fault & VM_FAULT_RETRY) && signal_pending(current)) {
+		if (fatal_signal_pending(current))
+			fault = VM_FAULT_SIGNAL;
+		else
+			fault = 0;
 		if (flags & FAULT_FLAG_RETRY_NOWAIT)
 			goto out_up;
 		goto out;
diff --git a/arch/sh/mm/fault.c b/arch/sh/mm/fault.c
index 6defd2c6d9b1..baf5d73df40c 100644
--- a/arch/sh/mm/fault.c
+++ b/arch/sh/mm/fault.c
@@ -506,6 +506,10 @@ asmlinkage void __kprobes do_page_fault(struct pt_regs *regs,
 			 * have already released it in __lock_page_or_retry
 			 * in mm/filemap.c.
 			 */
+
+			if (user_mode(regs) && signal_pending(tsk))
+				return;
+
 			goto retry;
 		}
 	}
diff --git a/arch/sparc/mm/fault_32.c b/arch/sparc/mm/fault_32.c
index b0440b0edd97..a2c83104fe35 100644
--- a/arch/sparc/mm/fault_32.c
+++ b/arch/sparc/mm/fault_32.c
@@ -269,6 +269,9 @@ asmlinkage void do_sparc_fault(struct pt_regs *regs, int text_fault, int write,
 			 * in mm/filemap.c.
 			 */
 
+			if (user_mode(regs) && signal_pending(tsk))
+				return;
+
 			goto retry;
 		}
 	}
diff --git a/arch/sparc/mm/fault_64.c b/arch/sparc/mm/fault_64.c
index 8f8a604c1300..cad71ec5c7b3 100644
--- a/arch/sparc/mm/fault_64.c
+++ b/arch/sparc/mm/fault_64.c
@@ -467,6 +467,9 @@ asmlinkage void __kprobes do_sparc64_fault(struct pt_regs *regs)
 			 * in mm/filemap.c.
 			 */
 
+			if (user_mode(regs) && signal_pending(current))
+				return;
+
 			goto retry;
 		}
 	}
diff --git a/arch/um/kernel/trap.c b/arch/um/kernel/trap.c
index 0e8b6158f224..09baf37b65b9 100644
--- a/arch/um/kernel/trap.c
+++ b/arch/um/kernel/trap.c
@@ -76,8 +76,11 @@ int handle_page_fault(unsigned long address, unsigned long ip,
 
 		fault = handle_mm_fault(vma, address, flags);
 
-		if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
+		if (fault & VM_FAULT_RETRY && signal_pending(current)) {
+			if (is_user && !fatal_signal_pending(current))
+				err = 0;
 			goto out_nosemaphore;
+		}
 
 		if (unlikely(fault & VM_FAULT_ERROR)) {
 			if (fault & VM_FAULT_OOM) {
diff --git a/arch/unicore32/mm/fault.c b/arch/unicore32/mm/fault.c
index b9a3a50644c1..3611f19234a1 100644
--- a/arch/unicore32/mm/fault.c
+++ b/arch/unicore32/mm/fault.c
@@ -248,11 +248,11 @@ static int do_pf(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
 
 	fault = __do_pf(mm, addr, fsr, flags, tsk);
 
-	/* If we need to retry but a fatal signal is pending, handle the
+	/* If we need to retry but a signal is pending, handle the
 	 * signal first. We do not need to release the mmap_sem because
 	 * it would already be released in __lock_page_or_retry in
 	 * mm/filemap.c. */
-	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
+	if ((fault & VM_FAULT_RETRY) && signal_pending(current))
 		return 0;
 
 	if (!(fault & VM_FAULT_ERROR) && (flags & FAULT_FLAG_ALLOW_RETRY)) {
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 71d4b9d4d43f..b94ef0c2b98c 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1433,8 +1433,18 @@ void do_user_addr_fault(struct pt_regs *regs,
 		if (flags & FAULT_FLAG_ALLOW_RETRY) {
 			flags &= ~FAULT_FLAG_ALLOW_RETRY;
 			flags |= FAULT_FLAG_TRIED;
-			if (!fatal_signal_pending(tsk))
+			if (!signal_pending(tsk))
 				goto retry;
+			else if (!fatal_signal_pending(tsk))
+				/*
+				 * There is a signal for the task but
+				 * it's not fatal, let's return
+				 * directly to the userspace.  This
+				 * gives chance for signals like
+				 * SIGSTOP/SIGCONT to be handled
+				 * faster, e.g., with GDB.
+				 */
+				return;
 		}
 
 		/* User mode? Just return to handle the fatal exception */
diff --git a/arch/xtensa/mm/fault.c b/arch/xtensa/mm/fault.c
index 2ab0e0dcd166..792dad5e2f12 100644
--- a/arch/xtensa/mm/fault.c
+++ b/arch/xtensa/mm/fault.c
@@ -136,6 +136,9 @@ void do_page_fault(struct pt_regs *regs)
 			 * in mm/filemap.c.
 			 */
 
+			if (user_mode(regs) && signal_pending(current))
+				return;
+
 			goto retry;
 		}
 	}
diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 270d4888c6d5..bc9f6230a3f0 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -515,30 +515,6 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
 
 	__set_current_state(TASK_RUNNING);
 
-	if (return_to_userland) {
-		if (signal_pending(current) &&
-		    !fatal_signal_pending(current)) {
-			/*
-			 * If we got a SIGSTOP or SIGCONT and this is
-			 * a normal userland page fault, just let
-			 * userland return so the signal will be
-			 * handled and gdb debugging works.  The page
-			 * fault code immediately after we return from
-			 * this function is going to release the
-			 * mmap_sem and it's not depending on it
-			 * (unlike gup would if we were not to return
-			 * VM_FAULT_RETRY).
-			 *
-			 * If a fatal signal is pending we still take
-			 * the streamlined VM_FAULT_RETRY failure path
-			 * and there's no need to retake the mmap_sem
-			 * in such case.
-			 */
-			down_read(&mm->mmap_sem);
-			ret = VM_FAULT_NOPAGE;
-		}
-	}
-
 	/*
 	 * Here we race with the list_del; list_add in
 	 * userfaultfd_ctx_read(), however because we don't ever run
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH RFC 03/24] mm: allow VM_FAULT_RETRY for multiple times
  2019-01-21  7:56 [PATCH RFC 00/24] userfaultfd: write protection support Peter Xu
  2019-01-21  7:56 ` [PATCH RFC 01/24] mm: gup: rename "nonblocking" to "locked" where proper Peter Xu
  2019-01-21  7:57 ` [PATCH RFC 02/24] mm: userfault: return VM_FAULT_RETRY on signals Peter Xu
@ 2019-01-21  7:57 ` Peter Xu
  2019-01-21 15:55   ` Jerome Glisse
  2019-01-21  7:57 ` [PATCH RFC 04/24] mm: gup: " Peter Xu
                   ` (21 subsequent siblings)
  24 siblings, 1 reply; 65+ messages in thread
From: Peter Xu @ 2019-01-21  7:57 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Hugh Dickins, Maya Gokhale, Jerome Glisse, Johannes Weiner,
	peterx, Martin Cracauer, Denis Plotnikov, Shaohua Li,
	Andrea Arcangeli, Pavel Emelyanov, Mike Kravetz, Marty McFadden,
	Mike Rapoport, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

The idea comes from a discussion between Linus and Andrea [1].

Before this patch we only allow a page fault to retry once.  We achieved
this by clearing the FAULT_FLAG_ALLOW_RETRY flag when doing
handle_mm_fault() the second time.  This was majorly used to avoid
unexpected starvation of the system by looping over forever to handle
the page fault on a single page.  However that should hardly happen, and
after all for each code path to return a VM_FAULT_RETRY we'll first wait
for a condition (during which time we should possibly yield the cpu) to
happen before VM_FAULT_RETRY is really returned.

This patch removes the restriction by keeping the FAULT_FLAG_ALLOW_RETRY
flag when we receive VM_FAULT_RETRY.  It means that the page fault
handler now can retry the page fault for multiple times if necessary
without the need to generate another page fault event. Meanwhile we
still keep the FAULT_FLAG_TRIED flag so page fault handler can still
identify whether a page fault is the first attempt or not.

GUP code is not touched yet and will be covered in follow up patch.

This will be a nice enhancement for current code at the same time a
supporting material for the future userfaultfd-writeprotect work since
in that work there will always be an explicit userfault writeprotect
retry for protected pages, and if that cannot resolve the page
fault (e.g., when userfaultfd-writeprotect is used in conjunction with
shared memory) then we'll possibly need a 3rd retry of the page fault.
It might also benefit other potential users who will have similar
requirement like userfault write-protection.

Please read the thread below for more information.

[1] https://lkml.org/lkml/2017/11/2/833

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Suggested-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/alpha/mm/fault.c      | 2 +-
 arch/arc/mm/fault.c        | 1 -
 arch/arm/mm/fault.c        | 3 ---
 arch/arm64/mm/fault.c      | 5 -----
 arch/hexagon/mm/vm_fault.c | 1 -
 arch/ia64/mm/fault.c       | 1 -
 arch/m68k/mm/fault.c       | 3 ---
 arch/microblaze/mm/fault.c | 1 -
 arch/mips/mm/fault.c       | 1 -
 arch/nds32/mm/fault.c      | 1 -
 arch/nios2/mm/fault.c      | 3 ---
 arch/openrisc/mm/fault.c   | 1 -
 arch/parisc/mm/fault.c     | 2 --
 arch/powerpc/mm/fault.c    | 5 -----
 arch/riscv/mm/fault.c      | 5 -----
 arch/s390/mm/fault.c       | 5 +----
 arch/sh/mm/fault.c         | 1 -
 arch/sparc/mm/fault_32.c   | 1 -
 arch/sparc/mm/fault_64.c   | 1 -
 arch/um/kernel/trap.c      | 1 -
 arch/unicore32/mm/fault.c  | 6 +-----
 arch/x86/mm/fault.c        | 1 -
 arch/xtensa/mm/fault.c     | 1 -
 23 files changed, 3 insertions(+), 49 deletions(-)

diff --git a/arch/alpha/mm/fault.c b/arch/alpha/mm/fault.c
index 46e5e420ad2a..deae82bb83c1 100644
--- a/arch/alpha/mm/fault.c
+++ b/arch/alpha/mm/fault.c
@@ -169,7 +169,7 @@ do_page_fault(unsigned long address, unsigned long mmcsr,
 		else
 			current->min_flt++;
 		if (fault & VM_FAULT_RETRY) {
-			flags &= ~FAULT_FLAG_ALLOW_RETRY;
+			flags |= FAULT_FLAG_TRIED;
 
 			 /* No need to up_read(&mm->mmap_sem) as we would
 			 * have already released it in __lock_page_or_retry
diff --git a/arch/arc/mm/fault.c b/arch/arc/mm/fault.c
index 91492d244ea6..7f48b377028c 100644
--- a/arch/arc/mm/fault.c
+++ b/arch/arc/mm/fault.c
@@ -168,7 +168,6 @@ void do_page_fault(unsigned long address, struct pt_regs *regs)
 			}
 
 			if (fault & VM_FAULT_RETRY) {
-				flags &= ~FAULT_FLAG_ALLOW_RETRY;
 				flags |= FAULT_FLAG_TRIED;
 				goto retry;
 			}
diff --git a/arch/arm/mm/fault.c b/arch/arm/mm/fault.c
index 743077d19669..377781d8491a 100644
--- a/arch/arm/mm/fault.c
+++ b/arch/arm/mm/fault.c
@@ -342,9 +342,6 @@ do_page_fault(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
 					regs, addr);
 		}
 		if (fault & VM_FAULT_RETRY) {
-			/* Clear FAULT_FLAG_ALLOW_RETRY to avoid any risk
-			* of starvation. */
-			flags &= ~FAULT_FLAG_ALLOW_RETRY;
 			flags |= FAULT_FLAG_TRIED;
 			goto retry;
 		}
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 744d6451ea83..8a26e03fc2bf 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -510,12 +510,7 @@ static int __kprobes do_page_fault(unsigned long addr, unsigned int esr,
 			return 0;
 		}
 
-		/*
-		 * Clear FAULT_FLAG_ALLOW_RETRY to avoid any risk of
-		 * starvation.
-		 */
 		if (mm_flags & FAULT_FLAG_ALLOW_RETRY) {
-			mm_flags &= ~FAULT_FLAG_ALLOW_RETRY;
 			mm_flags |= FAULT_FLAG_TRIED;
 			goto retry;
 		}
diff --git a/arch/hexagon/mm/vm_fault.c b/arch/hexagon/mm/vm_fault.c
index be10b441d9cc..576751597e77 100644
--- a/arch/hexagon/mm/vm_fault.c
+++ b/arch/hexagon/mm/vm_fault.c
@@ -115,7 +115,6 @@ void do_page_fault(unsigned long address, long cause, struct pt_regs *regs)
 			else
 				current->min_flt++;
 			if (fault & VM_FAULT_RETRY) {
-				flags &= ~FAULT_FLAG_ALLOW_RETRY;
 				flags |= FAULT_FLAG_TRIED;
 				goto retry;
 			}
diff --git a/arch/ia64/mm/fault.c b/arch/ia64/mm/fault.c
index 62c2d39d2bed..9de95d39935e 100644
--- a/arch/ia64/mm/fault.c
+++ b/arch/ia64/mm/fault.c
@@ -189,7 +189,6 @@ ia64_do_page_fault (unsigned long address, unsigned long isr, struct pt_regs *re
 		else
 			current->min_flt++;
 		if (fault & VM_FAULT_RETRY) {
-			flags &= ~FAULT_FLAG_ALLOW_RETRY;
 			flags |= FAULT_FLAG_TRIED;
 
 			 /* No need to up_read(&mm->mmap_sem) as we would
diff --git a/arch/m68k/mm/fault.c b/arch/m68k/mm/fault.c
index d9808a807ab8..b1b2109e4ab4 100644
--- a/arch/m68k/mm/fault.c
+++ b/arch/m68k/mm/fault.c
@@ -162,9 +162,6 @@ int do_page_fault(struct pt_regs *regs, unsigned long address,
 		else
 			current->min_flt++;
 		if (fault & VM_FAULT_RETRY) {
-			/* Clear FAULT_FLAG_ALLOW_RETRY to avoid any risk
-			 * of starvation. */
-			flags &= ~FAULT_FLAG_ALLOW_RETRY;
 			flags |= FAULT_FLAG_TRIED;
 
 			/*
diff --git a/arch/microblaze/mm/fault.c b/arch/microblaze/mm/fault.c
index 4fd2dbd0c5ca..05a4847ac0bf 100644
--- a/arch/microblaze/mm/fault.c
+++ b/arch/microblaze/mm/fault.c
@@ -236,7 +236,6 @@ void do_page_fault(struct pt_regs *regs, unsigned long address,
 		else
 			current->min_flt++;
 		if (fault & VM_FAULT_RETRY) {
-			flags &= ~FAULT_FLAG_ALLOW_RETRY;
 			flags |= FAULT_FLAG_TRIED;
 
 			/*
diff --git a/arch/mips/mm/fault.c b/arch/mips/mm/fault.c
index 92374fd091d2..9953b5b571df 100644
--- a/arch/mips/mm/fault.c
+++ b/arch/mips/mm/fault.c
@@ -178,7 +178,6 @@ static void __kprobes __do_page_fault(struct pt_regs *regs, unsigned long write,
 			tsk->min_flt++;
 		}
 		if (fault & VM_FAULT_RETRY) {
-			flags &= ~FAULT_FLAG_ALLOW_RETRY;
 			flags |= FAULT_FLAG_TRIED;
 
 			/*
diff --git a/arch/nds32/mm/fault.c b/arch/nds32/mm/fault.c
index 72461745d3e1..f0b775cb5cdf 100644
--- a/arch/nds32/mm/fault.c
+++ b/arch/nds32/mm/fault.c
@@ -237,7 +237,6 @@ void do_page_fault(unsigned long entry, unsigned long addr,
 		else
 			tsk->min_flt++;
 		if (fault & VM_FAULT_RETRY) {
-			flags &= ~FAULT_FLAG_ALLOW_RETRY;
 			flags |= FAULT_FLAG_TRIED;
 
 			/* No need to up_read(&mm->mmap_sem) as we would
diff --git a/arch/nios2/mm/fault.c b/arch/nios2/mm/fault.c
index 5939434a31ae..9dd1c51acc22 100644
--- a/arch/nios2/mm/fault.c
+++ b/arch/nios2/mm/fault.c
@@ -158,9 +158,6 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long cause,
 		else
 			current->min_flt++;
 		if (fault & VM_FAULT_RETRY) {
-			/* Clear FAULT_FLAG_ALLOW_RETRY to avoid any risk
-			 * of starvation. */
-			flags &= ~FAULT_FLAG_ALLOW_RETRY;
 			flags |= FAULT_FLAG_TRIED;
 
 			/*
diff --git a/arch/openrisc/mm/fault.c b/arch/openrisc/mm/fault.c
index 873ecb5d82d7..ff92c5674781 100644
--- a/arch/openrisc/mm/fault.c
+++ b/arch/openrisc/mm/fault.c
@@ -185,7 +185,6 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long address,
 		else
 			tsk->min_flt++;
 		if (fault & VM_FAULT_RETRY) {
-			flags &= ~FAULT_FLAG_ALLOW_RETRY;
 			flags |= FAULT_FLAG_TRIED;
 
 			 /* No need to up_read(&mm->mmap_sem) as we would
diff --git a/arch/parisc/mm/fault.c b/arch/parisc/mm/fault.c
index 29422eec329d..7d3e96a9a7ab 100644
--- a/arch/parisc/mm/fault.c
+++ b/arch/parisc/mm/fault.c
@@ -327,8 +327,6 @@ void do_page_fault(struct pt_regs *regs, unsigned long code,
 		else
 			current->min_flt++;
 		if (fault & VM_FAULT_RETRY) {
-			flags &= ~FAULT_FLAG_ALLOW_RETRY;
-
 			/*
 			 * No need to up_read(&mm->mmap_sem) as we would
 			 * have already released it in __lock_page_or_retry
diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index 8bc0d091f13c..8bdc7e75d2e5 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -569,11 +569,6 @@ static int __do_page_fault(struct pt_regs *regs, unsigned long address,
 	if (unlikely(fault & VM_FAULT_RETRY)) {
 		/* We retry only once */
 		if (flags & FAULT_FLAG_ALLOW_RETRY) {
-			/*
-			 * Clear FAULT_FLAG_ALLOW_RETRY to avoid any risk
-			 * of starvation.
-			 */
-			flags &= ~FAULT_FLAG_ALLOW_RETRY;
 			flags |= FAULT_FLAG_TRIED;
 			if (!signal_pending(current))
 				goto retry;
diff --git a/arch/riscv/mm/fault.c b/arch/riscv/mm/fault.c
index 4fc8d746bec3..aad2c0557d2f 100644
--- a/arch/riscv/mm/fault.c
+++ b/arch/riscv/mm/fault.c
@@ -154,11 +154,6 @@ asmlinkage void do_page_fault(struct pt_regs *regs)
 				      1, regs, addr);
 		}
 		if (fault & VM_FAULT_RETRY) {
-			/*
-			 * Clear FAULT_FLAG_ALLOW_RETRY to avoid any risk
-			 * of starvation.
-			 */
-			flags &= ~(FAULT_FLAG_ALLOW_RETRY);
 			flags |= FAULT_FLAG_TRIED;
 
 			/*
diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c
index 19b4fb2fafab..819f87169ee1 100644
--- a/arch/s390/mm/fault.c
+++ b/arch/s390/mm/fault.c
@@ -537,10 +537,7 @@ static inline vm_fault_t do_exception(struct pt_regs *regs, int access)
 				fault = VM_FAULT_PFAULT;
 				goto out_up;
 			}
-			/* Clear FAULT_FLAG_ALLOW_RETRY to avoid any risk
-			 * of starvation. */
-			flags &= ~(FAULT_FLAG_ALLOW_RETRY |
-				   FAULT_FLAG_RETRY_NOWAIT);
+			flags &= ~FAULT_FLAG_RETRY_NOWAIT;
 			flags |= FAULT_FLAG_TRIED;
 			down_read(&mm->mmap_sem);
 			goto retry;
diff --git a/arch/sh/mm/fault.c b/arch/sh/mm/fault.c
index baf5d73df40c..cd710e2d7c57 100644
--- a/arch/sh/mm/fault.c
+++ b/arch/sh/mm/fault.c
@@ -498,7 +498,6 @@ asmlinkage void __kprobes do_page_fault(struct pt_regs *regs,
 				      regs, address);
 		}
 		if (fault & VM_FAULT_RETRY) {
-			flags &= ~FAULT_FLAG_ALLOW_RETRY;
 			flags |= FAULT_FLAG_TRIED;
 
 			/*
diff --git a/arch/sparc/mm/fault_32.c b/arch/sparc/mm/fault_32.c
index a2c83104fe35..6735cd1c09b9 100644
--- a/arch/sparc/mm/fault_32.c
+++ b/arch/sparc/mm/fault_32.c
@@ -261,7 +261,6 @@ asmlinkage void do_sparc_fault(struct pt_regs *regs, int text_fault, int write,
 				      1, regs, address);
 		}
 		if (fault & VM_FAULT_RETRY) {
-			flags &= ~FAULT_FLAG_ALLOW_RETRY;
 			flags |= FAULT_FLAG_TRIED;
 
 			/* No need to up_read(&mm->mmap_sem) as we would
diff --git a/arch/sparc/mm/fault_64.c b/arch/sparc/mm/fault_64.c
index cad71ec5c7b3..28d5b4d012c6 100644
--- a/arch/sparc/mm/fault_64.c
+++ b/arch/sparc/mm/fault_64.c
@@ -459,7 +459,6 @@ asmlinkage void __kprobes do_sparc64_fault(struct pt_regs *regs)
 				      1, regs, address);
 		}
 		if (fault & VM_FAULT_RETRY) {
-			flags &= ~FAULT_FLAG_ALLOW_RETRY;
 			flags |= FAULT_FLAG_TRIED;
 
 			/* No need to up_read(&mm->mmap_sem) as we would
diff --git a/arch/um/kernel/trap.c b/arch/um/kernel/trap.c
index 09baf37b65b9..c63fc292aea0 100644
--- a/arch/um/kernel/trap.c
+++ b/arch/um/kernel/trap.c
@@ -99,7 +99,6 @@ int handle_page_fault(unsigned long address, unsigned long ip,
 			else
 				current->min_flt++;
 			if (fault & VM_FAULT_RETRY) {
-				flags &= ~FAULT_FLAG_ALLOW_RETRY;
 				flags |= FAULT_FLAG_TRIED;
 
 				goto retry;
diff --git a/arch/unicore32/mm/fault.c b/arch/unicore32/mm/fault.c
index 3611f19234a1..fdf577956f5f 100644
--- a/arch/unicore32/mm/fault.c
+++ b/arch/unicore32/mm/fault.c
@@ -260,12 +260,8 @@ static int do_pf(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
 			tsk->maj_flt++;
 		else
 			tsk->min_flt++;
-		if (fault & VM_FAULT_RETRY) {
-			/* Clear FAULT_FLAG_ALLOW_RETRY to avoid any risk
-			* of starvation. */
-			flags &= ~FAULT_FLAG_ALLOW_RETRY;
+		if (fault & VM_FAULT_RETRY)
 			goto retry;
-		}
 	}
 
 	up_read(&mm->mmap_sem);
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index b94ef0c2b98c..645b1365a72d 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1431,7 +1431,6 @@ void do_user_addr_fault(struct pt_regs *regs,
 	if (unlikely(fault & VM_FAULT_RETRY)) {
 		/* Retry at most once */
 		if (flags & FAULT_FLAG_ALLOW_RETRY) {
-			flags &= ~FAULT_FLAG_ALLOW_RETRY;
 			flags |= FAULT_FLAG_TRIED;
 			if (!signal_pending(tsk))
 				goto retry;
diff --git a/arch/xtensa/mm/fault.c b/arch/xtensa/mm/fault.c
index 792dad5e2f12..7cd55f2d66c9 100644
--- a/arch/xtensa/mm/fault.c
+++ b/arch/xtensa/mm/fault.c
@@ -128,7 +128,6 @@ void do_page_fault(struct pt_regs *regs)
 		else
 			current->min_flt++;
 		if (fault & VM_FAULT_RETRY) {
-			flags &= ~FAULT_FLAG_ALLOW_RETRY;
 			flags |= FAULT_FLAG_TRIED;
 
 			 /* No need to up_read(&mm->mmap_sem) as we would
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH RFC 04/24] mm: gup: allow VM_FAULT_RETRY for multiple times
  2019-01-21  7:56 [PATCH RFC 00/24] userfaultfd: write protection support Peter Xu
                   ` (2 preceding siblings ...)
  2019-01-21  7:57 ` [PATCH RFC 03/24] mm: allow VM_FAULT_RETRY for multiple times Peter Xu
@ 2019-01-21  7:57 ` Peter Xu
  2019-01-21 16:24   ` Jerome Glisse
  2019-01-21  7:57 ` [PATCH RFC 05/24] userfaultfd: wp: add helper for writeprotect check Peter Xu
                   ` (20 subsequent siblings)
  24 siblings, 1 reply; 65+ messages in thread
From: Peter Xu @ 2019-01-21  7:57 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Hugh Dickins, Maya Gokhale, Jerome Glisse, Johannes Weiner,
	peterx, Martin Cracauer, Denis Plotnikov, Shaohua Li,
	Andrea Arcangeli, Pavel Emelyanov, Mike Kravetz, Marty McFadden,
	Mike Rapoport, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

This is the gup counterpart of the change that allows the VM_FAULT_RETRY
to happen for more than once.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/gup.c | 17 +++++++++++++----
 1 file changed, 13 insertions(+), 4 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 7b1f452cc2ef..22f1d419a849 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -528,7 +528,10 @@ static int faultin_page(struct task_struct *tsk, struct vm_area_struct *vma,
 	if (*flags & FOLL_NOWAIT)
 		fault_flags |= FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_RETRY_NOWAIT;
 	if (*flags & FOLL_TRIED) {
-		VM_WARN_ON_ONCE(fault_flags & FAULT_FLAG_ALLOW_RETRY);
+		/*
+		 * Note: FAULT_FLAG_ALLOW_RETRY and FAULT_FLAG_TRIED
+		 * can co-exist
+		 */
 		fault_flags |= FAULT_FLAG_TRIED;
 	}
 
@@ -943,17 +946,23 @@ static __always_inline long __get_user_pages_locked(struct task_struct *tsk,
 		/* VM_FAULT_RETRY triggered, so seek to the faulting offset */
 		pages += ret;
 		start += ret << PAGE_SHIFT;
+		lock_dropped = true;
 
+retry:
 		/*
 		 * Repeat on the address that fired VM_FAULT_RETRY
-		 * without FAULT_FLAG_ALLOW_RETRY but with
+		 * with both FAULT_FLAG_ALLOW_RETRY and
 		 * FAULT_FLAG_TRIED.
 		 */
 		*locked = 1;
-		lock_dropped = true;
 		down_read(&mm->mmap_sem);
 		ret = __get_user_pages(tsk, mm, start, 1, flags | FOLL_TRIED,
-				       pages, NULL, NULL);
+				       pages, NULL, locked);
+		if (!*locked) {
+			/* Continue to retry until we succeeded */
+			BUG_ON(ret != 0);
+			goto retry;
+		}
 		if (ret != 1) {
 			BUG_ON(ret > 1);
 			if (!pages_done)
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH RFC 05/24] userfaultfd: wp: add helper for writeprotect check
  2019-01-21  7:56 [PATCH RFC 00/24] userfaultfd: write protection support Peter Xu
                   ` (3 preceding siblings ...)
  2019-01-21  7:57 ` [PATCH RFC 04/24] mm: gup: " Peter Xu
@ 2019-01-21  7:57 ` Peter Xu
  2019-01-21 10:23   ` Mike Rapoport
  2019-01-21  7:57 ` [PATCH RFC 06/24] userfaultfd: wp: support write protection for userfault vma range Peter Xu
                   ` (19 subsequent siblings)
  24 siblings, 1 reply; 65+ messages in thread
From: Peter Xu @ 2019-01-21  7:57 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Hugh Dickins, Maya Gokhale, Jerome Glisse, Johannes Weiner,
	peterx, Martin Cracauer, Denis Plotnikov, Shaohua Li,
	Andrea Arcangeli, Pavel Emelyanov, Mike Kravetz, Marty McFadden,
	Mike Rapoport, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert, Rik van Riel

From: Shaohua Li <shli@fb.com>

add helper for writeprotect check. Will use it later.

Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/userfaultfd_k.h | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index 37c9eba75c98..38f748e7186e 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -50,6 +50,11 @@ static inline bool userfaultfd_missing(struct vm_area_struct *vma)
 	return vma->vm_flags & VM_UFFD_MISSING;
 }
 
+static inline bool userfaultfd_wp(struct vm_area_struct *vma)
+{
+	return vma->vm_flags & VM_UFFD_WP;
+}
+
 static inline bool userfaultfd_armed(struct vm_area_struct *vma)
 {
 	return vma->vm_flags & (VM_UFFD_MISSING | VM_UFFD_WP);
@@ -94,6 +99,11 @@ static inline bool userfaultfd_missing(struct vm_area_struct *vma)
 	return false;
 }
 
+static inline bool userfaultfd_wp(struct vm_area_struct *vma)
+{
+	return false;
+}
+
 static inline bool userfaultfd_armed(struct vm_area_struct *vma)
 {
 	return false;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH RFC 06/24] userfaultfd: wp: support write protection for userfault vma range
  2019-01-21  7:56 [PATCH RFC 00/24] userfaultfd: write protection support Peter Xu
                   ` (4 preceding siblings ...)
  2019-01-21  7:57 ` [PATCH RFC 05/24] userfaultfd: wp: add helper for writeprotect check Peter Xu
@ 2019-01-21  7:57 ` Peter Xu
  2019-01-21 10:20   ` Mike Rapoport
  2019-01-21 14:05   ` Jerome Glisse
  2019-01-21  7:57 ` [PATCH RFC 07/24] userfaultfd: wp: add the writeprotect API to userfaultfd ioctl Peter Xu
                   ` (18 subsequent siblings)
  24 siblings, 2 replies; 65+ messages in thread
From: Peter Xu @ 2019-01-21  7:57 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Hugh Dickins, Maya Gokhale, Jerome Glisse, Johannes Weiner,
	peterx, Martin Cracauer, Denis Plotnikov, Shaohua Li,
	Andrea Arcangeli, Pavel Emelyanov, Mike Kravetz, Marty McFadden,
	Mike Rapoport, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert, Rik van Riel

From: Shaohua Li <shli@fb.com>

Add API to enable/disable writeprotect a vma range. Unlike mprotect,
this doesn't split/merge vmas.

Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/userfaultfd_k.h |  2 ++
 mm/userfaultfd.c              | 52 +++++++++++++++++++++++++++++++++++
 2 files changed, 54 insertions(+)

diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index 38f748e7186e..e82f3156f4e9 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -37,6 +37,8 @@ extern ssize_t mfill_zeropage(struct mm_struct *dst_mm,
 			      unsigned long dst_start,
 			      unsigned long len,
 			      bool *mmap_changing);
+extern int mwriteprotect_range(struct mm_struct *dst_mm,
+		unsigned long start, unsigned long len, bool enable_wp);
 
 /* mm helpers */
 static inline bool is_mergeable_vm_userfaultfd_ctx(struct vm_area_struct *vma,
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 458acda96f20..c38903f501c7 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -615,3 +615,55 @@ ssize_t mfill_zeropage(struct mm_struct *dst_mm, unsigned long start,
 {
 	return __mcopy_atomic(dst_mm, start, 0, len, true, mmap_changing);
 }
+
+int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start,
+	unsigned long len, bool enable_wp)
+{
+	struct vm_area_struct *dst_vma;
+	pgprot_t newprot;
+	int err;
+
+	/*
+	 * Sanitize the command parameters:
+	 */
+	BUG_ON(start & ~PAGE_MASK);
+	BUG_ON(len & ~PAGE_MASK);
+
+	/* Does the address range wrap, or is the span zero-sized? */
+	BUG_ON(start + len <= start);
+
+	down_read(&dst_mm->mmap_sem);
+
+	/*
+	 * Make sure the vma is not shared, that the dst range is
+	 * both valid and fully within a single existing vma.
+	 */
+	err = -EINVAL;
+	dst_vma = find_vma(dst_mm, start);
+	if (!dst_vma || (dst_vma->vm_flags & VM_SHARED))
+		goto out_unlock;
+	if (start < dst_vma->vm_start ||
+	    start + len > dst_vma->vm_end)
+		goto out_unlock;
+
+	if (!dst_vma->vm_userfaultfd_ctx.ctx)
+		goto out_unlock;
+	if (!userfaultfd_wp(dst_vma))
+		goto out_unlock;
+
+	if (!vma_is_anonymous(dst_vma))
+		goto out_unlock;
+
+	if (enable_wp)
+		newprot = vm_get_page_prot(dst_vma->vm_flags & ~(VM_WRITE));
+	else
+		newprot = vm_get_page_prot(dst_vma->vm_flags);
+
+	change_protection(dst_vma, start, start + len, newprot,
+				!enable_wp, 0);
+
+	err = 0;
+out_unlock:
+	up_read(&dst_mm->mmap_sem);
+	return err;
+}
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH RFC 07/24] userfaultfd: wp: add the writeprotect API to userfaultfd ioctl
  2019-01-21  7:56 [PATCH RFC 00/24] userfaultfd: write protection support Peter Xu
                   ` (5 preceding siblings ...)
  2019-01-21  7:57 ` [PATCH RFC 06/24] userfaultfd: wp: support write protection for userfault vma range Peter Xu
@ 2019-01-21  7:57 ` Peter Xu
  2019-01-21 10:42   ` Mike Rapoport
  2019-01-21  7:57 ` [PATCH RFC 08/24] userfaultfd: wp: hook userfault handler to write protection fault Peter Xu
                   ` (17 subsequent siblings)
  24 siblings, 1 reply; 65+ messages in thread
From: Peter Xu @ 2019-01-21  7:57 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Hugh Dickins, Maya Gokhale, Jerome Glisse, Johannes Weiner,
	peterx, Martin Cracauer, Denis Plotnikov, Shaohua Li,
	Andrea Arcangeli, Pavel Emelyanov, Mike Kravetz, Marty McFadden,
	Mike Rapoport, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

From: Andrea Arcangeli <aarcange@redhat.com>

v1: From: Shaohua Li <shli@fb.com>

v2: cleanups, remove a branch.

[peterx writes up the commit message, as below...]

This patch introduces the new uffd-wp APIs for userspace.

Firstly, we'll allow to do UFFDIO_REGISTER with write protection
tracking using the new UFFDIO_REGISTER_MODE_WP flag.  Note that this
flag can co-exist with the existing UFFDIO_REGISTER_MODE_MISSING, in
which case the userspace program can not only resolve missing page
faults, and at the same time tracking page data changes along the way.

Secondly, we introduced the new UFFDIO_WRITEPROTECT API to do page
level write protection tracking.  Note that we will need to register
the memory region with UFFDIO_REGISTER_MODE_WP before that.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
[peterx: remove useless block, write commit message]
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 fs/userfaultfd.c                 | 78 +++++++++++++++++++++++++-------
 include/uapi/linux/userfaultfd.h | 11 +++++
 2 files changed, 73 insertions(+), 16 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index bc9f6230a3f0..6ff8773d6797 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -305,8 +305,11 @@ static inline bool userfaultfd_must_wait(struct userfaultfd_ctx *ctx,
 	if (!pmd_present(_pmd))
 		goto out;
 
-	if (pmd_trans_huge(_pmd))
+	if (pmd_trans_huge(_pmd)) {
+		if (!pmd_write(_pmd) && (reason & VM_UFFD_WP))
+			ret = true;
 		goto out;
+	}
 
 	/*
 	 * the pmd is stable (as in !pmd_trans_unstable) so we can re-read it
@@ -319,6 +322,8 @@ static inline bool userfaultfd_must_wait(struct userfaultfd_ctx *ctx,
 	 */
 	if (pte_none(*pte))
 		ret = true;
+	if (!pte_write(*pte) && (reason & VM_UFFD_WP))
+		ret = true;
 	pte_unmap(pte);
 
 out:
@@ -1252,10 +1257,13 @@ static __always_inline int validate_range(struct mm_struct *mm,
 	return 0;
 }
 
-static inline bool vma_can_userfault(struct vm_area_struct *vma)
+static inline bool vma_can_userfault(struct vm_area_struct *vma,
+				     unsigned long vm_flags)
 {
-	return vma_is_anonymous(vma) || is_vm_hugetlb_page(vma) ||
-		vma_is_shmem(vma);
+	/* FIXME: add WP support to hugetlbfs and shmem */
+	return vma_is_anonymous(vma) ||
+		((is_vm_hugetlb_page(vma) || vma_is_shmem(vma)) &&
+		 !(vm_flags & VM_UFFD_WP));
 }
 
 static int userfaultfd_register(struct userfaultfd_ctx *ctx,
@@ -1287,15 +1295,8 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
 	vm_flags = 0;
 	if (uffdio_register.mode & UFFDIO_REGISTER_MODE_MISSING)
 		vm_flags |= VM_UFFD_MISSING;
-	if (uffdio_register.mode & UFFDIO_REGISTER_MODE_WP) {
+	if (uffdio_register.mode & UFFDIO_REGISTER_MODE_WP)
 		vm_flags |= VM_UFFD_WP;
-		/*
-		 * FIXME: remove the below error constraint by
-		 * implementing the wprotect tracking mode.
-		 */
-		ret = -EINVAL;
-		goto out;
-	}
 
 	ret = validate_range(mm, uffdio_register.range.start,
 			     uffdio_register.range.len);
@@ -1343,7 +1344,7 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
 
 		/* check not compatible vmas */
 		ret = -EINVAL;
-		if (!vma_can_userfault(cur))
+		if (!vma_can_userfault(cur, vm_flags))
 			goto out_unlock;
 
 		/*
@@ -1371,6 +1372,8 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
 			if (end & (vma_hpagesize - 1))
 				goto out_unlock;
 		}
+		if ((vm_flags & VM_UFFD_WP) && !(cur->vm_flags & VM_WRITE))
+			goto out_unlock;
 
 		/*
 		 * Check that this vma isn't already owned by a
@@ -1400,7 +1403,7 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
 	do {
 		cond_resched();
 
-		BUG_ON(!vma_can_userfault(vma));
+		BUG_ON(!vma_can_userfault(vma, vm_flags));
 		BUG_ON(vma->vm_userfaultfd_ctx.ctx &&
 		       vma->vm_userfaultfd_ctx.ctx != ctx);
 		WARN_ON(!(vma->vm_flags & VM_MAYWRITE));
@@ -1535,7 +1538,7 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx,
 		 * provides for more strict behavior to notice
 		 * unregistration errors.
 		 */
-		if (!vma_can_userfault(cur))
+		if (!vma_can_userfault(cur, cur->vm_flags))
 			goto out_unlock;
 
 		found = true;
@@ -1549,7 +1552,7 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx,
 	do {
 		cond_resched();
 
-		BUG_ON(!vma_can_userfault(vma));
+		BUG_ON(!vma_can_userfault(vma, vma->vm_flags));
 		WARN_ON(!(vma->vm_flags & VM_MAYWRITE));
 
 		/*
@@ -1760,6 +1763,46 @@ static int userfaultfd_zeropage(struct userfaultfd_ctx *ctx,
 	return ret;
 }
 
+static int userfaultfd_writeprotect(struct userfaultfd_ctx *ctx,
+				    unsigned long arg)
+{
+	int ret;
+	struct uffdio_writeprotect uffdio_wp;
+	struct uffdio_writeprotect __user *user_uffdio_wp;
+	struct userfaultfd_wake_range range;
+
+	user_uffdio_wp = (struct uffdio_writeprotect __user *) arg;
+
+	if (copy_from_user(&uffdio_wp, user_uffdio_wp,
+			   sizeof(struct uffdio_writeprotect)))
+		return -EFAULT;
+
+	ret = validate_range(ctx->mm, uffdio_wp.range.start,
+			     uffdio_wp.range.len);
+	if (ret)
+		return ret;
+
+	if (uffdio_wp.mode & ~(UFFDIO_WRITEPROTECT_MODE_DONTWAKE |
+			       UFFDIO_WRITEPROTECT_MODE_WP))
+		return -EINVAL;
+	if ((uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_WP) &&
+	     (uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_DONTWAKE))
+		return -EINVAL;
+
+	ret = mwriteprotect_range(ctx->mm, uffdio_wp.range.start,
+				  uffdio_wp.range.len, uffdio_wp.mode &
+				  UFFDIO_WRITEPROTECT_MODE_WP);
+	if (ret)
+		return ret;
+
+	if (!(uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_DONTWAKE)) {
+		range.start = uffdio_wp.range.start;
+		range.len = uffdio_wp.range.len;
+		wake_userfault(ctx, &range);
+	}
+	return ret;
+}
+
 static inline unsigned int uffd_ctx_features(__u64 user_features)
 {
 	/*
@@ -1837,6 +1880,9 @@ static long userfaultfd_ioctl(struct file *file, unsigned cmd,
 	case UFFDIO_ZEROPAGE:
 		ret = userfaultfd_zeropage(ctx, arg);
 		break;
+	case UFFDIO_WRITEPROTECT:
+		ret = userfaultfd_writeprotect(ctx, arg);
+		break;
 	}
 	return ret;
 }
diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
index 48f1a7c2f1f0..11517f796275 100644
--- a/include/uapi/linux/userfaultfd.h
+++ b/include/uapi/linux/userfaultfd.h
@@ -52,6 +52,7 @@
 #define _UFFDIO_WAKE			(0x02)
 #define _UFFDIO_COPY			(0x03)
 #define _UFFDIO_ZEROPAGE		(0x04)
+#define _UFFDIO_WRITEPROTECT		(0x06)
 #define _UFFDIO_API			(0x3F)
 
 /* userfaultfd ioctl ids */
@@ -68,6 +69,8 @@
 				      struct uffdio_copy)
 #define UFFDIO_ZEROPAGE		_IOWR(UFFDIO, _UFFDIO_ZEROPAGE,	\
 				      struct uffdio_zeropage)
+#define UFFDIO_WRITEPROTECT	_IOWR(UFFDIO, _UFFDIO_WRITEPROTECT, \
+				      struct uffdio_writeprotect)
 
 /* read() structure */
 struct uffd_msg {
@@ -231,4 +234,12 @@ struct uffdio_zeropage {
 	__s64 zeropage;
 };
 
+struct uffdio_writeprotect {
+	struct uffdio_range range;
+	/* !WP means undo writeprotect. DONTWAKE is valid only with !WP */
+#define UFFDIO_WRITEPROTECT_MODE_WP		((__u64)1<<0)
+#define UFFDIO_WRITEPROTECT_MODE_DONTWAKE	((__u64)1<<1)
+	__u64 mode;
+};
+
 #endif /* _LINUX_USERFAULTFD_H */
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH RFC 08/24] userfaultfd: wp: hook userfault handler to write protection fault
  2019-01-21  7:56 [PATCH RFC 00/24] userfaultfd: write protection support Peter Xu
                   ` (6 preceding siblings ...)
  2019-01-21  7:57 ` [PATCH RFC 07/24] userfaultfd: wp: add the writeprotect API to userfaultfd ioctl Peter Xu
@ 2019-01-21  7:57 ` Peter Xu
  2019-01-21  7:57 ` [PATCH RFC 09/24] userfaultfd: wp: enabled write protection in userfaultfd API Peter Xu
                   ` (16 subsequent siblings)
  24 siblings, 0 replies; 65+ messages in thread
From: Peter Xu @ 2019-01-21  7:57 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Hugh Dickins, Maya Gokhale, Jerome Glisse, Johannes Weiner,
	peterx, Martin Cracauer, Denis Plotnikov, Shaohua Li,
	Andrea Arcangeli, Pavel Emelyanov, Mike Kravetz, Marty McFadden,
	Mike Rapoport, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

From: Andrea Arcangeli <aarcange@redhat.com>

There are several cases write protection fault happens. It could be a
write to zero page, swaped page or userfault write protected
page. When the fault happens, there is no way to know if userfault
write protect the page before. Here we just blindly issue a userfault
notification for vma with VM_UFFD_WP regardless if app write protects
it yet. Application should be ready to handle such wp fault.

v1: From: Shaohua Li <shli@fb.com>

v2: Handle the userfault in the common do_wp_page. If we get there a
pagetable is present and readonly so no need to do further processing
until we solve the userfault.

In the swapin case, always swapin as readonly. This will cause false
positive userfaults. We need to decide later if to eliminate them with
a flag like soft-dirty in the swap entry (see _PAGE_SWP_SOFT_DIRTY).

hugetlbfs wouldn't need to worry about swapouts but and tmpfs would
be handled by a swap entry bit like anonymous memory.

The main problem with no easy solution to eliminate the false
positives, will be if/when userfaultfd is extended to real filesystem
pagecache. When the pagecache is freed by reclaim we can't leave the
radix tree pinned if the inode and in turn the radix tree is reclaimed
as well.

The estimation is that full accuracy and lack of false positives could
be easily provided only to anonymous memory (as long as there's no
fork or as long as MADV_DONTFORK is used on the userfaultfd anonymous
range) tmpfs and hugetlbfs, it's most certainly worth to achieve it
but in a later incremental patch.

v3: Add hooking point for THP wrprotect faults.

CC: Shaohua Li <shli@fb.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/memory.c | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/mm/memory.c b/mm/memory.c
index 4ad2d293ddc2..89d51d1650e4 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2482,6 +2482,11 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
 {
 	struct vm_area_struct *vma = vmf->vma;
 
+	if (userfaultfd_wp(vma)) {
+		pte_unmap_unlock(vmf->pte, vmf->ptl);
+		return handle_userfault(vmf, VM_UFFD_WP);
+	}
+
 	vmf->page = vm_normal_page(vma, vmf->address, vmf->orig_pte);
 	if (!vmf->page) {
 		/*
@@ -2799,6 +2804,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
 	dec_mm_counter_fast(vma->vm_mm, MM_SWAPENTS);
 	pte = mk_pte(page, vma->vm_page_prot);
+	if (userfaultfd_wp(vma))
+		vmf->flags &= ~FAULT_FLAG_WRITE;
 	if ((vmf->flags & FAULT_FLAG_WRITE) && reuse_swap_page(page, NULL)) {
 		pte = maybe_mkwrite(pte_mkdirty(pte), vma);
 		vmf->flags &= ~FAULT_FLAG_WRITE;
@@ -3662,8 +3669,11 @@ static inline vm_fault_t create_huge_pmd(struct vm_fault *vmf)
 /* `inline' is required to avoid gcc 4.1.2 build error */
 static inline vm_fault_t wp_huge_pmd(struct vm_fault *vmf, pmd_t orig_pmd)
 {
-	if (vma_is_anonymous(vmf->vma))
+	if (vma_is_anonymous(vmf->vma)) {
+		if (userfaultfd_wp(vmf->vma))
+			return handle_userfault(vmf, VM_UFFD_WP);
 		return do_huge_pmd_wp_page(vmf, orig_pmd);
+	}
 	if (vmf->vma->vm_ops->huge_fault)
 		return vmf->vma->vm_ops->huge_fault(vmf, PE_SIZE_PMD);
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH RFC 09/24] userfaultfd: wp: enabled write protection in userfaultfd API
  2019-01-21  7:56 [PATCH RFC 00/24] userfaultfd: write protection support Peter Xu
                   ` (7 preceding siblings ...)
  2019-01-21  7:57 ` [PATCH RFC 08/24] userfaultfd: wp: hook userfault handler to write protection fault Peter Xu
@ 2019-01-21  7:57 ` Peter Xu
  2019-01-21  7:57 ` [PATCH RFC 10/24] userfaultfd: wp: add WP pagetable tracking to x86 Peter Xu
                   ` (15 subsequent siblings)
  24 siblings, 0 replies; 65+ messages in thread
From: Peter Xu @ 2019-01-21  7:57 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Hugh Dickins, Maya Gokhale, Jerome Glisse, Johannes Weiner,
	peterx, Martin Cracauer, Denis Plotnikov, Shaohua Li,
	Andrea Arcangeli, Pavel Emelyanov, Mike Kravetz, Marty McFadden,
	Mike Rapoport, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert, Rik van Riel

From: Shaohua Li <shli@fb.com>

Now it's safe to enable write protection in userfaultfd API

Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/uapi/linux/userfaultfd.h | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
index 11517f796275..9de61cd8e228 100644
--- a/include/uapi/linux/userfaultfd.h
+++ b/include/uapi/linux/userfaultfd.h
@@ -19,7 +19,8 @@
  * means the userland is reading).
  */
 #define UFFD_API ((__u64)0xAA)
-#define UFFD_API_FEATURES (UFFD_FEATURE_EVENT_FORK |		\
+#define UFFD_API_FEATURES (UFFD_FEATURE_PAGEFAULT_FLAG_WP |	\
+			   UFFD_FEATURE_EVENT_FORK |		\
 			   UFFD_FEATURE_EVENT_REMAP |		\
 			   UFFD_FEATURE_EVENT_REMOVE |	\
 			   UFFD_FEATURE_EVENT_UNMAP |		\
@@ -34,7 +35,8 @@
 #define UFFD_API_RANGE_IOCTLS			\
 	((__u64)1 << _UFFDIO_WAKE |		\
 	 (__u64)1 << _UFFDIO_COPY |		\
-	 (__u64)1 << _UFFDIO_ZEROPAGE)
+	 (__u64)1 << _UFFDIO_ZEROPAGE |		\
+	 (__u64)1 << _UFFDIO_WRITEPROTECT)
 #define UFFD_API_RANGE_IOCTLS_BASIC		\
 	((__u64)1 << _UFFDIO_WAKE |		\
 	 (__u64)1 << _UFFDIO_COPY)
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH RFC 10/24] userfaultfd: wp: add WP pagetable tracking to x86
  2019-01-21  7:56 [PATCH RFC 00/24] userfaultfd: write protection support Peter Xu
                   ` (8 preceding siblings ...)
  2019-01-21  7:57 ` [PATCH RFC 09/24] userfaultfd: wp: enabled write protection in userfaultfd API Peter Xu
@ 2019-01-21  7:57 ` Peter Xu
  2019-01-21 15:09   ` Jerome Glisse
  2019-01-21  7:57 ` [PATCH RFC 11/24] userfaultfd: wp: userfaultfd_pte/huge_pmd_wp() helpers Peter Xu
                   ` (14 subsequent siblings)
  24 siblings, 1 reply; 65+ messages in thread
From: Peter Xu @ 2019-01-21  7:57 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Hugh Dickins, Maya Gokhale, Jerome Glisse, Johannes Weiner,
	peterx, Martin Cracauer, Denis Plotnikov, Shaohua Li,
	Andrea Arcangeli, Pavel Emelyanov, Mike Kravetz, Marty McFadden,
	Mike Rapoport, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

From: Andrea Arcangeli <aarcange@redhat.com>

Accurate userfaultfd WP tracking is possible by tracking exactly which
virtual memory ranges were writeprotected by userland. We can't relay
only on the RW bit of the mapped pagetable because that information is
destroyed by fork() or KSM or swap. If we were to relay on that, we'd
need to stay on the safe side and generate false positive wp faults
for every swapped out page.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/x86/Kconfig                     |  1 +
 arch/x86/include/asm/pgtable.h       | 52 ++++++++++++++++++++++++++++
 arch/x86/include/asm/pgtable_64.h    |  8 ++++-
 arch/x86/include/asm/pgtable_types.h |  9 +++++
 include/asm-generic/pgtable.h        |  1 +
 include/asm-generic/pgtable_uffd.h   | 51 +++++++++++++++++++++++++++
 init/Kconfig                         |  5 +++
 7 files changed, 126 insertions(+), 1 deletion(-)
 create mode 100644 include/asm-generic/pgtable_uffd.h

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 8689e794a43c..096c773452d0 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -207,6 +207,7 @@ config X86
 	select USER_STACKTRACE_SUPPORT
 	select VIRT_TO_BUS
 	select X86_FEATURE_NAMES		if PROC_FS
+	select HAVE_ARCH_USERFAULTFD_WP		if USERFAULTFD
 
 config INSTRUCTION_DECODER
 	def_bool y
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 40616e805292..7a71158982f4 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -23,6 +23,7 @@
 
 #ifndef __ASSEMBLY__
 #include <asm/x86_init.h>
+#include <asm-generic/pgtable_uffd.h>
 
 extern pgd_t early_top_pgt[PTRS_PER_PGD];
 int __init __early_make_pgtable(unsigned long address, pmdval_t pmd);
@@ -293,6 +294,23 @@ static inline pte_t pte_clear_flags(pte_t pte, pteval_t clear)
 	return native_make_pte(v & ~clear);
 }
 
+#ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
+static inline int pte_uffd_wp(pte_t pte)
+{
+	return pte_flags(pte) & _PAGE_UFFD_WP;
+}
+
+static inline pte_t pte_mkuffd_wp(pte_t pte)
+{
+	return pte_set_flags(pte, _PAGE_UFFD_WP);
+}
+
+static inline pte_t pte_clear_uffd_wp(pte_t pte)
+{
+	return pte_clear_flags(pte, _PAGE_UFFD_WP);
+}
+#endif /* CONFIG_HAVE_ARCH_USERFAULTFD_WP */
+
 static inline pte_t pte_mkclean(pte_t pte)
 {
 	return pte_clear_flags(pte, _PAGE_DIRTY);
@@ -372,6 +390,23 @@ static inline pmd_t pmd_clear_flags(pmd_t pmd, pmdval_t clear)
 	return native_make_pmd(v & ~clear);
 }
 
+#ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
+static inline int pmd_uffd_wp(pmd_t pmd)
+{
+	return pmd_flags(pmd) & _PAGE_UFFD_WP;
+}
+
+static inline pmd_t pmd_mkuffd_wp(pmd_t pmd)
+{
+	return pmd_set_flags(pmd, _PAGE_UFFD_WP);
+}
+
+static inline pmd_t pmd_clear_uffd_wp(pmd_t pmd)
+{
+	return pmd_clear_flags(pmd, _PAGE_UFFD_WP);
+}
+#endif /* CONFIG_HAVE_ARCH_USERFAULTFD_WP */
+
 static inline pmd_t pmd_mkold(pmd_t pmd)
 {
 	return pmd_clear_flags(pmd, _PAGE_ACCESSED);
@@ -1351,6 +1386,23 @@ static inline pmd_t pmd_swp_clear_soft_dirty(pmd_t pmd)
 #endif
 #endif
 
+#ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
+static inline pte_t pte_swp_mkuffd_wp(pte_t pte)
+{
+	return pte_set_flags(pte, _PAGE_SWP_UFFD_WP);
+}
+
+static inline int pte_swp_uffd_wp(pte_t pte)
+{
+	return pte_flags(pte) & _PAGE_SWP_UFFD_WP;
+}
+
+static inline pte_t pte_swp_clear_uffd_wp(pte_t pte)
+{
+	return pte_clear_flags(pte, _PAGE_SWP_UFFD_WP);
+}
+#endif /* CONFIG_HAVE_ARCH_USERFAULTFD_WP */
+
 #define PKRU_AD_BIT 0x1
 #define PKRU_WD_BIT 0x2
 #define PKRU_BITS_PER_PKEY 2
diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
index 9c85b54bf03c..e0c5d29b8685 100644
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -189,7 +189,7 @@ extern void sync_global_pgds(unsigned long start, unsigned long end);
  *
  * |     ...            | 11| 10|  9|8|7|6|5| 4| 3|2| 1|0| <- bit number
  * |     ...            |SW3|SW2|SW1|G|L|D|A|CD|WT|U| W|P| <- bit names
- * | TYPE (59-63) | ~OFFSET (9-58)  |0|0|X|X| X| X|X|SD|0| <- swp entry
+ * | TYPE (59-63) | ~OFFSET (9-58)  |0|0|X|X| X| X|F|SD|0| <- swp entry
  *
  * G (8) is aliased and used as a PROT_NONE indicator for
  * !present ptes.  We need to start storing swap entries above
@@ -197,9 +197,15 @@ extern void sync_global_pgds(unsigned long start, unsigned long end);
  * erratum where they can be incorrectly set by hardware on
  * non-present PTEs.
  *
+ * SD Bits 1-4 are not used in non-present format and available for
+ * special use described below:
+ *
  * SD (1) in swp entry is used to store soft dirty bit, which helps us
  * remember soft dirty over page migration
  *
+ * F (2) in swp entry is used to record when a pagetable is
+ * writeprotected by userfaultfd WP support.
+ *
  * Bit 7 in swp entry should be 0 because pmd_present checks not only P,
  * but also L and G.
  *
diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index 106b7d0e2dae..163043ab142d 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -32,6 +32,7 @@
 
 #define _PAGE_BIT_SPECIAL	_PAGE_BIT_SOFTW1
 #define _PAGE_BIT_CPA_TEST	_PAGE_BIT_SOFTW1
+#define _PAGE_BIT_UFFD_WP	_PAGE_BIT_SOFTW2 /* userfaultfd wrprotected */
 #define _PAGE_BIT_SOFT_DIRTY	_PAGE_BIT_SOFTW3 /* software dirty tracking */
 #define _PAGE_BIT_DEVMAP	_PAGE_BIT_SOFTW4
 
@@ -100,6 +101,14 @@
 #define _PAGE_SWP_SOFT_DIRTY	(_AT(pteval_t, 0))
 #endif
 
+#ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
+#define _PAGE_UFFD_WP		(_AT(pteval_t, 1) << _PAGE_BIT_UFFD_WP)
+#define _PAGE_SWP_UFFD_WP	_PAGE_USER
+#else
+#define _PAGE_UFFD_WP		(_AT(pteval_t, 0))
+#define _PAGE_SWP_UFFD_WP	(_AT(pteval_t, 0))
+#endif
+
 #if defined(CONFIG_X86_64) || defined(CONFIG_X86_PAE)
 #define _PAGE_NX	(_AT(pteval_t, 1) << _PAGE_BIT_NX)
 #define _PAGE_DEVMAP	(_AT(u64, 1) << _PAGE_BIT_DEVMAP)
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index 359fb935ded6..0e1470ecf7b5 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -10,6 +10,7 @@
 #include <linux/mm_types.h>
 #include <linux/bug.h>
 #include <linux/errno.h>
+#include <asm-generic/pgtable_uffd.h>
 
 #if 5 - defined(__PAGETABLE_P4D_FOLDED) - defined(__PAGETABLE_PUD_FOLDED) - \
 	defined(__PAGETABLE_PMD_FOLDED) != CONFIG_PGTABLE_LEVELS
diff --git a/include/asm-generic/pgtable_uffd.h b/include/asm-generic/pgtable_uffd.h
new file mode 100644
index 000000000000..643d1bf559c2
--- /dev/null
+++ b/include/asm-generic/pgtable_uffd.h
@@ -0,0 +1,51 @@
+#ifndef _ASM_GENERIC_PGTABLE_UFFD_H
+#define _ASM_GENERIC_PGTABLE_UFFD_H
+
+#ifndef CONFIG_HAVE_ARCH_USERFAULTFD_WP
+static __always_inline int pte_uffd_wp(pte_t pte)
+{
+	return 0;
+}
+
+static __always_inline int pmd_uffd_wp(pmd_t pmd)
+{
+	return 0;
+}
+
+static __always_inline pte_t pte_mkuffd_wp(pte_t pte)
+{
+	return pte;
+}
+
+static __always_inline pmd_t pmd_mkuffd_wp(pmd_t pmd)
+{
+	return pmd;
+}
+
+static __always_inline pte_t pte_clear_uffd_wp(pte_t pte)
+{
+	return pte;
+}
+
+static __always_inline pmd_t pmd_clear_uffd_wp(pmd_t pmd)
+{
+	return pmd;
+}
+
+static __always_inline pte_t pte_swp_mkuffd_wp(pte_t pte)
+{
+	return pte;
+}
+
+static __always_inline int pte_swp_uffd_wp(pte_t pte)
+{
+	return 0;
+}
+
+static __always_inline pte_t pte_swp_clear_uffd_wp(pte_t pte)
+{
+	return pte;
+}
+#endif /* CONFIG_HAVE_ARCH_USERFAULTFD_WP */
+
+#endif /* _ASM_GENERIC_PGTABLE_UFFD_H */
diff --git a/init/Kconfig b/init/Kconfig
index cf5b5a0dcbc2..2a02e004874e 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1418,6 +1418,11 @@ config ADVISE_SYSCALLS
 	  applications use these syscalls, you can disable this option to save
 	  space.
 
+config HAVE_ARCH_USERFAULTFD_WP
+	bool
+	help
+	  Arch has userfaultfd write protection support
+
 config MEMBARRIER
 	bool "Enable membarrier() system call" if EXPERT
 	default y
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH RFC 11/24] userfaultfd: wp: userfaultfd_pte/huge_pmd_wp() helpers
  2019-01-21  7:56 [PATCH RFC 00/24] userfaultfd: write protection support Peter Xu
                   ` (9 preceding siblings ...)
  2019-01-21  7:57 ` [PATCH RFC 10/24] userfaultfd: wp: add WP pagetable tracking to x86 Peter Xu
@ 2019-01-21  7:57 ` Peter Xu
  2019-01-21  7:57 ` [PATCH RFC 12/24] userfaultfd: wp: add UFFDIO_COPY_MODE_WP Peter Xu
                   ` (13 subsequent siblings)
  24 siblings, 0 replies; 65+ messages in thread
From: Peter Xu @ 2019-01-21  7:57 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Hugh Dickins, Maya Gokhale, Jerome Glisse, Johannes Weiner,
	peterx, Martin Cracauer, Denis Plotnikov, Shaohua Li,
	Andrea Arcangeli, Pavel Emelyanov, Mike Kravetz, Marty McFadden,
	Mike Rapoport, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

From: Andrea Arcangeli <aarcange@redhat.com>

Implement helpers methods to invoke userfaultfd wp faults more
selectively: not only when a wp fault triggers on a vma with
vma->vm_flags VM_UFFD_WP set, but only if the _PAGE_UFFD_WP bit is set
in the pagetable too.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/userfaultfd_k.h | 27 +++++++++++++++++++++++++++
 1 file changed, 27 insertions(+)

diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index e82f3156f4e9..0d3b32b54e2a 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -14,6 +14,8 @@
 #include <linux/userfaultfd.h> /* linux/include/uapi/linux/userfaultfd.h */
 
 #include <linux/fcntl.h>
+#include <linux/mm.h>
+#include <asm-generic/pgtable_uffd.h>
 
 /*
  * CAREFUL: Check include/uapi/asm-generic/fcntl.h when defining
@@ -57,6 +59,18 @@ static inline bool userfaultfd_wp(struct vm_area_struct *vma)
 	return vma->vm_flags & VM_UFFD_WP;
 }
 
+static inline bool userfaultfd_pte_wp(struct vm_area_struct *vma,
+				      pte_t pte)
+{
+	return userfaultfd_wp(vma) && pte_uffd_wp(pte);
+}
+
+static inline bool userfaultfd_huge_pmd_wp(struct vm_area_struct *vma,
+					   pmd_t pmd)
+{
+	return userfaultfd_wp(vma) && pmd_uffd_wp(pmd);
+}
+
 static inline bool userfaultfd_armed(struct vm_area_struct *vma)
 {
 	return vma->vm_flags & (VM_UFFD_MISSING | VM_UFFD_WP);
@@ -106,6 +120,19 @@ static inline bool userfaultfd_wp(struct vm_area_struct *vma)
 	return false;
 }
 
+static inline bool userfaultfd_pte_wp(struct vm_area_struct *vma,
+				      pte_t pte)
+{
+	return false;
+}
+
+static inline bool userfaultfd_huge_pmd_wp(struct vm_area_struct *vma,
+					   pmd_t pmd)
+{
+	return false;
+}
+
+
 static inline bool userfaultfd_armed(struct vm_area_struct *vma)
 {
 	return false;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH RFC 12/24] userfaultfd: wp: add UFFDIO_COPY_MODE_WP
  2019-01-21  7:56 [PATCH RFC 00/24] userfaultfd: write protection support Peter Xu
                   ` (10 preceding siblings ...)
  2019-01-21  7:57 ` [PATCH RFC 11/24] userfaultfd: wp: userfaultfd_pte/huge_pmd_wp() helpers Peter Xu
@ 2019-01-21  7:57 ` Peter Xu
  2019-01-21  7:57 ` [PATCH RFC 13/24] mm: merge parameters for change_protection() Peter Xu
                   ` (12 subsequent siblings)
  24 siblings, 0 replies; 65+ messages in thread
From: Peter Xu @ 2019-01-21  7:57 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Hugh Dickins, Maya Gokhale, Jerome Glisse, Johannes Weiner,
	peterx, Martin Cracauer, Denis Plotnikov, Shaohua Li,
	Andrea Arcangeli, Pavel Emelyanov, Mike Kravetz, Marty McFadden,
	Mike Rapoport, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

From: Andrea Arcangeli <aarcange@redhat.com>

This allows UFFDIO_COPY to map pages wrprotected.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 fs/userfaultfd.c                 |  5 +++--
 include/linux/userfaultfd_k.h    |  2 +-
 include/uapi/linux/userfaultfd.h | 11 +++++-----
 mm/userfaultfd.c                 | 36 ++++++++++++++++++++++----------
 4 files changed, 35 insertions(+), 19 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 6ff8773d6797..455b87c0596f 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -1686,11 +1686,12 @@ static int userfaultfd_copy(struct userfaultfd_ctx *ctx,
 	ret = -EINVAL;
 	if (uffdio_copy.src + uffdio_copy.len <= uffdio_copy.src)
 		goto out;
-	if (uffdio_copy.mode & ~UFFDIO_COPY_MODE_DONTWAKE)
+	if (uffdio_copy.mode & ~(UFFDIO_COPY_MODE_DONTWAKE|UFFDIO_COPY_MODE_WP))
 		goto out;
 	if (mmget_not_zero(ctx->mm)) {
 		ret = mcopy_atomic(ctx->mm, uffdio_copy.dst, uffdio_copy.src,
-				   uffdio_copy.len, &ctx->mmap_changing);
+				   uffdio_copy.len, &ctx->mmap_changing,
+				   uffdio_copy.mode);
 		mmput(ctx->mm);
 	} else {
 		return -ESRCH;
diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index 0d3b32b54e2a..7d870e9a5761 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -34,7 +34,7 @@ extern vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason);
 
 extern ssize_t mcopy_atomic(struct mm_struct *dst_mm, unsigned long dst_start,
 			    unsigned long src_start, unsigned long len,
-			    bool *mmap_changing);
+			    bool *mmap_changing, __u64 mode);
 extern ssize_t mfill_zeropage(struct mm_struct *dst_mm,
 			      unsigned long dst_start,
 			      unsigned long len,
diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
index 9de61cd8e228..a50f1ed24d23 100644
--- a/include/uapi/linux/userfaultfd.h
+++ b/include/uapi/linux/userfaultfd.h
@@ -208,13 +208,14 @@ struct uffdio_copy {
 	__u64 dst;
 	__u64 src;
 	__u64 len;
+#define UFFDIO_COPY_MODE_DONTWAKE		((__u64)1<<0)
 	/*
-	 * There will be a wrprotection flag later that allows to map
-	 * pages wrprotected on the fly. And such a flag will be
-	 * available if the wrprotection ioctl are implemented for the
-	 * range according to the uffdio_register.ioctls.
+	 * UFFDIO_COPY_MODE_WP will map the page wrprotected on the
+	 * fly. UFFDIO_COPY_MODE_WP is available only if the
+	 * wrprotection ioctl are implemented for the range according
+	 * to the uffdio_register.ioctls.
 	 */
-#define UFFDIO_COPY_MODE_DONTWAKE		((__u64)1<<0)
+#define UFFDIO_COPY_MODE_WP			((__u64)1<<1)
 	__u64 mode;
 
 	/*
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index c38903f501c7..005291b9b62f 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -25,7 +25,8 @@ static int mcopy_atomic_pte(struct mm_struct *dst_mm,
 			    struct vm_area_struct *dst_vma,
 			    unsigned long dst_addr,
 			    unsigned long src_addr,
-			    struct page **pagep)
+			    struct page **pagep,
+			    bool wp_copy)
 {
 	struct mem_cgroup *memcg;
 	pte_t _dst_pte, *dst_pte;
@@ -71,9 +72,9 @@ static int mcopy_atomic_pte(struct mm_struct *dst_mm,
 	if (mem_cgroup_try_charge(page, dst_mm, GFP_KERNEL, &memcg, false))
 		goto out_release;
 
-	_dst_pte = mk_pte(page, dst_vma->vm_page_prot);
-	if (dst_vma->vm_flags & VM_WRITE)
-		_dst_pte = pte_mkwrite(pte_mkdirty(_dst_pte));
+	_dst_pte = pte_mkdirty(mk_pte(page, dst_vma->vm_page_prot));
+	if (dst_vma->vm_flags & VM_WRITE && !wp_copy)
+		_dst_pte = pte_mkwrite(_dst_pte);
 
 	dst_pte = pte_offset_map_lock(dst_mm, dst_pmd, dst_addr, &ptl);
 	if (dst_vma->vm_file) {
@@ -399,7 +400,8 @@ static __always_inline ssize_t mfill_atomic_pte(struct mm_struct *dst_mm,
 						unsigned long dst_addr,
 						unsigned long src_addr,
 						struct page **page,
-						bool zeropage)
+						bool zeropage,
+						bool wp_copy)
 {
 	ssize_t err;
 
@@ -416,11 +418,13 @@ static __always_inline ssize_t mfill_atomic_pte(struct mm_struct *dst_mm,
 	if (!(dst_vma->vm_flags & VM_SHARED)) {
 		if (!zeropage)
 			err = mcopy_atomic_pte(dst_mm, dst_pmd, dst_vma,
-					       dst_addr, src_addr, page);
+					       dst_addr, src_addr, page,
+					       wp_copy);
 		else
 			err = mfill_zeropage_pte(dst_mm, dst_pmd,
 						 dst_vma, dst_addr);
 	} else {
+		VM_WARN_ON(wp_copy); /* WP only available for anon */
 		if (!zeropage)
 			err = shmem_mcopy_atomic_pte(dst_mm, dst_pmd,
 						     dst_vma, dst_addr,
@@ -438,7 +442,8 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
 					      unsigned long src_start,
 					      unsigned long len,
 					      bool zeropage,
-					      bool *mmap_changing)
+					      bool *mmap_changing,
+					      __u64 mode)
 {
 	struct vm_area_struct *dst_vma;
 	ssize_t err;
@@ -446,6 +451,7 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
 	unsigned long src_addr, dst_addr;
 	long copied;
 	struct page *page;
+	bool wp_copy;
 
 	/*
 	 * Sanitize the command parameters:
@@ -502,6 +508,14 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
 	    dst_vma->vm_flags & VM_SHARED))
 		goto out_unlock;
 
+	/*
+	 * validate 'mode' now that we know the dst_vma: don't allow
+	 * a wrprotect copy if the userfaultfd didn't register as WP.
+	 */
+	wp_copy = mode & UFFDIO_COPY_MODE_WP;
+	if (wp_copy && !(dst_vma->vm_flags & VM_UFFD_WP))
+		goto out_unlock;
+
 	/*
 	 * If this is a HUGETLB vma, pass off to appropriate routine
 	 */
@@ -557,7 +571,7 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
 		BUG_ON(pmd_trans_huge(*dst_pmd));
 
 		err = mfill_atomic_pte(dst_mm, dst_pmd, dst_vma, dst_addr,
-				       src_addr, &page, zeropage);
+				       src_addr, &page, zeropage, wp_copy);
 		cond_resched();
 
 		if (unlikely(err == -ENOENT)) {
@@ -604,16 +618,16 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
 
 ssize_t mcopy_atomic(struct mm_struct *dst_mm, unsigned long dst_start,
 		     unsigned long src_start, unsigned long len,
-		     bool *mmap_changing)
+		     bool *mmap_changing, __u64 mode)
 {
 	return __mcopy_atomic(dst_mm, dst_start, src_start, len, false,
-			      mmap_changing);
+			      mmap_changing, mode);
 }
 
 ssize_t mfill_zeropage(struct mm_struct *dst_mm, unsigned long start,
 		       unsigned long len, bool *mmap_changing)
 {
-	return __mcopy_atomic(dst_mm, start, 0, len, true, mmap_changing);
+	return __mcopy_atomic(dst_mm, start, 0, len, true, mmap_changing, 0);
 }
 
 int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start,
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH RFC 13/24] mm: merge parameters for change_protection()
  2019-01-21  7:56 [PATCH RFC 00/24] userfaultfd: write protection support Peter Xu
                   ` (11 preceding siblings ...)
  2019-01-21  7:57 ` [PATCH RFC 12/24] userfaultfd: wp: add UFFDIO_COPY_MODE_WP Peter Xu
@ 2019-01-21  7:57 ` Peter Xu
  2019-01-21 13:54   ` Jerome Glisse
  2019-01-21  7:57 ` [PATCH RFC 14/24] userfaultfd: wp: apply _PAGE_UFFD_WP bit Peter Xu
                   ` (11 subsequent siblings)
  24 siblings, 1 reply; 65+ messages in thread
From: Peter Xu @ 2019-01-21  7:57 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Hugh Dickins, Maya Gokhale, Jerome Glisse, Johannes Weiner,
	peterx, Martin Cracauer, Denis Plotnikov, Shaohua Li,
	Andrea Arcangeli, Pavel Emelyanov, Mike Kravetz, Marty McFadden,
	Mike Rapoport, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

change_protection() was used by either the NUMA or mprotect() code,
there's one parameter for each of the callers (dirty_accountable and
prot_numa).  Further, these parameters are passed along the calls:

  - change_protection_range()
  - change_p4d_range()
  - change_pud_range()
  - change_pmd_range()
  - ...

Now we introduce a flag for change_protect() and all these helpers to
replace these parameters.  Then we can avoid passing multiple parameters
multiple times along the way.

More importantly, it'll greatly simplify the work if we want to
introduce any new parameters to change_protection().  In the follow up
patches, a new parameter for userfaultfd write protection will be
introduced.

No functional change at all.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/huge_mm.h |  2 +-
 include/linux/mm.h      | 14 +++++++++++++-
 mm/huge_memory.c        |  3 ++-
 mm/mempolicy.c          |  2 +-
 mm/mprotect.c           | 30 ++++++++++++++++--------------
 mm/userfaultfd.c        |  2 +-
 6 files changed, 34 insertions(+), 19 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 4663ee96cf59..a8845eed6958 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -46,7 +46,7 @@ extern bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,
 			 pmd_t *old_pmd, pmd_t *new_pmd);
 extern int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 			unsigned long addr, pgprot_t newprot,
-			int prot_numa);
+			unsigned long cp_flags);
 vm_fault_t vmf_insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
 			pmd_t *pmd, pfn_t pfn, bool write);
 vm_fault_t vmf_insert_pfn_pud(struct vm_area_struct *vma, unsigned long addr,
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5411de93a363..452fcc31fa29 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1588,9 +1588,21 @@ extern unsigned long move_page_tables(struct vm_area_struct *vma,
 		unsigned long old_addr, struct vm_area_struct *new_vma,
 		unsigned long new_addr, unsigned long len,
 		bool need_rmap_locks);
+
+/*
+ * Flags used by change_protection().  For now we make it a bitmap so
+ * that we can pass in multiple flags just like parameters.  However
+ * for now all the callers are only use one of the flags at the same
+ * time.
+ */
+/* Whether we should allow dirty bit accounting */
+#define  MM_CP_DIRTY_ACCT                  (1UL << 0)
+/* Whether this protection change is for NUMA hints */
+#define  MM_CP_PROT_NUMA                   (1UL << 1)
+
 extern unsigned long change_protection(struct vm_area_struct *vma, unsigned long start,
 			      unsigned long end, pgprot_t newprot,
-			      int dirty_accountable, int prot_numa);
+			      unsigned long cp_flags);
 extern int mprotect_fixup(struct vm_area_struct *vma,
 			  struct vm_area_struct **pprev, unsigned long start,
 			  unsigned long end, unsigned long newflags);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index e84a10b0d310..be8160bb7cac 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1856,13 +1856,14 @@ bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,
  *  - HPAGE_PMD_NR is protections changed and TLB flush necessary
  */
 int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
-		unsigned long addr, pgprot_t newprot, int prot_numa)
+		unsigned long addr, pgprot_t newprot, unsigned long cp_flags)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	spinlock_t *ptl;
 	pmd_t entry;
 	bool preserve_write;
 	int ret;
+	bool prot_numa = cp_flags & MM_CP_PROT_NUMA;
 
 	ptl = __pmd_trans_huge_lock(pmd, vma);
 	if (!ptl)
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index d4496d9d34f5..233194f3d69a 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -554,7 +554,7 @@ unsigned long change_prot_numa(struct vm_area_struct *vma,
 {
 	int nr_updated;
 
-	nr_updated = change_protection(vma, addr, end, PAGE_NONE, 0, 1);
+	nr_updated = change_protection(vma, addr, end, PAGE_NONE, MM_CP_PROT_NUMA);
 	if (nr_updated)
 		count_vm_numa_events(NUMA_PTE_UPDATES, nr_updated);
 
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 6d331620b9e5..416ede326c03 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -37,13 +37,15 @@
 
 static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 		unsigned long addr, unsigned long end, pgprot_t newprot,
-		int dirty_accountable, int prot_numa)
+		unsigned long cp_flags)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	pte_t *pte, oldpte;
 	spinlock_t *ptl;
 	unsigned long pages = 0;
 	int target_node = NUMA_NO_NODE;
+	bool dirty_accountable = cp_flags & MM_CP_DIRTY_ACCT;
+	bool prot_numa = cp_flags & MM_CP_PROT_NUMA;
 
 	/*
 	 * Can be called with only the mmap_sem for reading by
@@ -164,7 +166,7 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 
 static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 		pud_t *pud, unsigned long addr, unsigned long end,
-		pgprot_t newprot, int dirty_accountable, int prot_numa)
+		pgprot_t newprot, unsigned long cp_flags)
 {
 	pmd_t *pmd;
 	struct mm_struct *mm = vma->vm_mm;
@@ -193,7 +195,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 				__split_huge_pmd(vma, pmd, addr, false, NULL);
 			} else {
 				int nr_ptes = change_huge_pmd(vma, pmd, addr,
-						newprot, prot_numa);
+							      newprot, cp_flags);
 
 				if (nr_ptes) {
 					if (nr_ptes == HPAGE_PMD_NR) {
@@ -208,7 +210,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 			/* fall through, the trans huge pmd just split */
 		}
 		this_pages = change_pte_range(vma, pmd, addr, next, newprot,
-				 dirty_accountable, prot_numa);
+					      cp_flags);
 		pages += this_pages;
 next:
 		cond_resched();
@@ -224,7 +226,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 
 static inline unsigned long change_pud_range(struct vm_area_struct *vma,
 		p4d_t *p4d, unsigned long addr, unsigned long end,
-		pgprot_t newprot, int dirty_accountable, int prot_numa)
+		pgprot_t newprot, unsigned long cp_flags)
 {
 	pud_t *pud;
 	unsigned long next;
@@ -236,7 +238,7 @@ static inline unsigned long change_pud_range(struct vm_area_struct *vma,
 		if (pud_none_or_clear_bad(pud))
 			continue;
 		pages += change_pmd_range(vma, pud, addr, next, newprot,
-				 dirty_accountable, prot_numa);
+					  cp_flags);
 	} while (pud++, addr = next, addr != end);
 
 	return pages;
@@ -244,7 +246,7 @@ static inline unsigned long change_pud_range(struct vm_area_struct *vma,
 
 static inline unsigned long change_p4d_range(struct vm_area_struct *vma,
 		pgd_t *pgd, unsigned long addr, unsigned long end,
-		pgprot_t newprot, int dirty_accountable, int prot_numa)
+		pgprot_t newprot, unsigned long cp_flags)
 {
 	p4d_t *p4d;
 	unsigned long next;
@@ -256,7 +258,7 @@ static inline unsigned long change_p4d_range(struct vm_area_struct *vma,
 		if (p4d_none_or_clear_bad(p4d))
 			continue;
 		pages += change_pud_range(vma, p4d, addr, next, newprot,
-				 dirty_accountable, prot_numa);
+					  cp_flags);
 	} while (p4d++, addr = next, addr != end);
 
 	return pages;
@@ -264,7 +266,7 @@ static inline unsigned long change_p4d_range(struct vm_area_struct *vma,
 
 static unsigned long change_protection_range(struct vm_area_struct *vma,
 		unsigned long addr, unsigned long end, pgprot_t newprot,
-		int dirty_accountable, int prot_numa)
+		unsigned long cp_flags)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	pgd_t *pgd;
@@ -281,7 +283,7 @@ static unsigned long change_protection_range(struct vm_area_struct *vma,
 		if (pgd_none_or_clear_bad(pgd))
 			continue;
 		pages += change_p4d_range(vma, pgd, addr, next, newprot,
-				 dirty_accountable, prot_numa);
+					  cp_flags);
 	} while (pgd++, addr = next, addr != end);
 
 	/* Only flush the TLB if we actually modified any entries: */
@@ -294,14 +296,15 @@ static unsigned long change_protection_range(struct vm_area_struct *vma,
 
 unsigned long change_protection(struct vm_area_struct *vma, unsigned long start,
 		       unsigned long end, pgprot_t newprot,
-		       int dirty_accountable, int prot_numa)
+		       unsigned long cp_flags)
 {
 	unsigned long pages;
 
 	if (is_vm_hugetlb_page(vma))
 		pages = hugetlb_change_protection(vma, start, end, newprot);
 	else
-		pages = change_protection_range(vma, start, end, newprot, dirty_accountable, prot_numa);
+		pages = change_protection_range(vma, start, end, newprot,
+						cp_flags);
 
 	return pages;
 }
@@ -428,8 +431,7 @@ mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev,
 	dirty_accountable = vma_wants_writenotify(vma, vma->vm_page_prot);
 	vma_set_page_prot(vma);
 
-	change_protection(vma, start, end, vma->vm_page_prot,
-			  dirty_accountable, 0);
+	change_protection(vma, start, end, vma->vm_page_prot, MM_CP_DIRTY_ACCT);
 
 	/*
 	 * Private VM_LOCKED VMA becoming writable: trigger COW to avoid major
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 005291b9b62f..23d4bbd117ee 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -674,7 +674,7 @@ int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start,
 		newprot = vm_get_page_prot(dst_vma->vm_flags);
 
 	change_protection(dst_vma, start, start + len, newprot,
-				!enable_wp, 0);
+			  enable_wp ? 0 : MM_CP_DIRTY_ACCT);
 
 	err = 0;
 out_unlock:
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH RFC 14/24] userfaultfd: wp: apply _PAGE_UFFD_WP bit
  2019-01-21  7:56 [PATCH RFC 00/24] userfaultfd: write protection support Peter Xu
                   ` (12 preceding siblings ...)
  2019-01-21  7:57 ` [PATCH RFC 13/24] mm: merge parameters for change_protection() Peter Xu
@ 2019-01-21  7:57 ` Peter Xu
  2019-01-21  7:57 ` [PATCH RFC 15/24] mm: export wp_page_copy() Peter Xu
                   ` (10 subsequent siblings)
  24 siblings, 0 replies; 65+ messages in thread
From: Peter Xu @ 2019-01-21  7:57 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Hugh Dickins, Maya Gokhale, Jerome Glisse, Johannes Weiner,
	peterx, Martin Cracauer, Denis Plotnikov, Shaohua Li,
	Andrea Arcangeli, Pavel Emelyanov, Mike Kravetz, Marty McFadden,
	Mike Rapoport, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

Firstly, introduce two new flags MM_CP_UFFD_WP[_RESOLVE] for
change_protection() when used with uffd-wp and make sure the two new
flags are exclusively used.  Then,

  - For MM_CP_UFFD_WP: apply the _PAGE_UFFD_WP bit and remove _PAGE_RW
    when a range of memory is write protected by uffd

  - For MM_CP_UFFD_WP_RESOLVE: remove the _PAGE_UFFD_WP bit and recover
    _PAGE_RW when write protection is resolved from userspace

And use this new interface in mwriteprotect_range() to replace the old
MM_CP_DIRTY_ACCT.

Do this change for both PTEs and huge PMDs.  Then we can start to
identify which PTE/PMD is write protected by general (e.g., COW or soft
dirty tracking), and which is for userfaultfd-wp.

Since we should keep the _PAGE_UFFD_WP when doing pte_modify(), add it
into _PAGE_CHG_MASK as well.  Meanwhile, since we have this new bit, we
can be even more strict when detecting uffd-wp page faults in either
do_wp_page() or wp_huge_pmd().

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/x86/include/asm/pgtable_types.h |  2 +-
 include/linux/mm.h                   |  5 +++++
 mm/huge_memory.c                     | 14 +++++++++++++-
 mm/memory.c                          |  4 ++--
 mm/mprotect.c                        | 12 ++++++++++++
 mm/userfaultfd.c                     | 10 +++++++---
 6 files changed, 40 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index 163043ab142d..d6972b4c6abc 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -133,7 +133,7 @@
  */
 #define _PAGE_CHG_MASK	(PTE_PFN_MASK | _PAGE_PCD | _PAGE_PWT |		\
 			 _PAGE_SPECIAL | _PAGE_ACCESSED | _PAGE_DIRTY |	\
-			 _PAGE_SOFT_DIRTY | _PAGE_DEVMAP)
+			 _PAGE_SOFT_DIRTY | _PAGE_DEVMAP | _PAGE_UFFD_WP)
 #define _HPAGE_CHG_MASK (_PAGE_CHG_MASK | _PAGE_PSE)
 
 /*
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 452fcc31fa29..89345b51d8bd 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1599,6 +1599,11 @@ extern unsigned long move_page_tables(struct vm_area_struct *vma,
 #define  MM_CP_DIRTY_ACCT                  (1UL << 0)
 /* Whether this protection change is for NUMA hints */
 #define  MM_CP_PROT_NUMA                   (1UL << 1)
+/* Whether this change is for write protecting */
+#define  MM_CP_UFFD_WP                     (1UL << 2) /* do wp */
+#define  MM_CP_UFFD_WP_RESOLVE             (1UL << 3) /* Resolve wp */
+#define  MM_CP_UFFD_WP_ALL                 (MM_CP_UFFD_WP | \
+					    MM_CP_UFFD_WP_RESOLVE)
 
 extern unsigned long change_protection(struct vm_area_struct *vma, unsigned long start,
 			      unsigned long end, pgprot_t newprot,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index be8160bb7cac..169795c8e56c 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1864,6 +1864,8 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 	bool preserve_write;
 	int ret;
 	bool prot_numa = cp_flags & MM_CP_PROT_NUMA;
+	bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
+	bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;
 
 	ptl = __pmd_trans_huge_lock(pmd, vma);
 	if (!ptl)
@@ -1930,6 +1932,13 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 	entry = pmd_modify(entry, newprot);
 	if (preserve_write)
 		entry = pmd_mk_savedwrite(entry);
+	if (uffd_wp) {
+		entry = pmd_wrprotect(entry);
+		entry = pmd_mkuffd_wp(entry);
+	} else if (uffd_wp_resolve) {
+		entry = pmd_mkwrite(entry);
+		entry = pmd_clear_uffd_wp(entry);
+	}
 	ret = HPAGE_PMD_NR;
 	set_pmd_at(mm, addr, pmd, entry);
 	BUG_ON(vma_is_anonymous(vma) && !preserve_write && pmd_write(entry));
@@ -2079,7 +2088,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 	struct page *page;
 	pgtable_t pgtable;
 	pmd_t old_pmd, _pmd;
-	bool young, write, soft_dirty, pmd_migration = false;
+	bool young, write, soft_dirty, pmd_migration = false, uffd_wp = false;
 	unsigned long addr;
 	int i;
 
@@ -2161,6 +2170,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 		write = pmd_write(old_pmd);
 		young = pmd_young(old_pmd);
 		soft_dirty = pmd_soft_dirty(old_pmd);
+		uffd_wp = pmd_uffd_wp(old_pmd);
 	}
 	VM_BUG_ON_PAGE(!page_count(page), page);
 	page_ref_add(page, HPAGE_PMD_NR - 1);
@@ -2194,6 +2204,8 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 				entry = pte_mkold(entry);
 			if (soft_dirty)
 				entry = pte_mksoft_dirty(entry);
+			if (uffd_wp)
+				entry = pte_mkuffd_wp(entry);
 		}
 		pte = pte_offset_map(&_pmd, addr);
 		BUG_ON(!pte_none(*pte));
diff --git a/mm/memory.c b/mm/memory.c
index 89d51d1650e4..7f276158683b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2482,7 +2482,7 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
 {
 	struct vm_area_struct *vma = vmf->vma;
 
-	if (userfaultfd_wp(vma)) {
+	if (userfaultfd_pte_wp(vma, *vmf->pte)) {
 		pte_unmap_unlock(vmf->pte, vmf->ptl);
 		return handle_userfault(vmf, VM_UFFD_WP);
 	}
@@ -3670,7 +3670,7 @@ static inline vm_fault_t create_huge_pmd(struct vm_fault *vmf)
 static inline vm_fault_t wp_huge_pmd(struct vm_fault *vmf, pmd_t orig_pmd)
 {
 	if (vma_is_anonymous(vmf->vma)) {
-		if (userfaultfd_wp(vmf->vma))
+		if (userfaultfd_huge_pmd_wp(vmf->vma, orig_pmd))
 			return handle_userfault(vmf, VM_UFFD_WP);
 		return do_huge_pmd_wp_page(vmf, orig_pmd);
 	}
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 416ede326c03..000e246c163b 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -46,6 +46,8 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 	int target_node = NUMA_NO_NODE;
 	bool dirty_accountable = cp_flags & MM_CP_DIRTY_ACCT;
 	bool prot_numa = cp_flags & MM_CP_PROT_NUMA;
+	bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
+	bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;
 
 	/*
 	 * Can be called with only the mmap_sem for reading by
@@ -117,6 +119,14 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 			if (preserve_write)
 				ptent = pte_mk_savedwrite(ptent);
 
+			if (uffd_wp) {
+				ptent = pte_wrprotect(ptent);
+				ptent = pte_mkuffd_wp(ptent);
+			} else if (uffd_wp_resolve) {
+				ptent = pte_mkwrite(ptent);
+				ptent = pte_clear_uffd_wp(ptent);
+			}
+
 			/* Avoid taking write faults for known dirty pages */
 			if (dirty_accountable && pte_dirty(ptent) &&
 					(pte_soft_dirty(ptent) ||
@@ -300,6 +310,8 @@ unsigned long change_protection(struct vm_area_struct *vma, unsigned long start,
 {
 	unsigned long pages;
 
+	BUG_ON((cp_flags & MM_CP_UFFD_WP_ALL) == MM_CP_UFFD_WP_ALL);
+
 	if (is_vm_hugetlb_page(vma))
 		pages = hugetlb_change_protection(vma, start, end, newprot);
 	else
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 23d4bbd117ee..902247ca1474 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -73,8 +73,12 @@ static int mcopy_atomic_pte(struct mm_struct *dst_mm,
 		goto out_release;
 
 	_dst_pte = pte_mkdirty(mk_pte(page, dst_vma->vm_page_prot));
-	if (dst_vma->vm_flags & VM_WRITE && !wp_copy)
-		_dst_pte = pte_mkwrite(_dst_pte);
+	if (dst_vma->vm_flags & VM_WRITE) {
+		if (wp_copy)
+			_dst_pte = pte_mkuffd_wp(_dst_pte);
+		else
+			_dst_pte = pte_mkwrite(_dst_pte);
+	}
 
 	dst_pte = pte_offset_map_lock(dst_mm, dst_pmd, dst_addr, &ptl);
 	if (dst_vma->vm_file) {
@@ -674,7 +678,7 @@ int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start,
 		newprot = vm_get_page_prot(dst_vma->vm_flags);
 
 	change_protection(dst_vma, start, start + len, newprot,
-			  enable_wp ? 0 : MM_CP_DIRTY_ACCT);
+			  enable_wp ? MM_CP_UFFD_WP : MM_CP_UFFD_WP_RESOLVE);
 
 	err = 0;
 out_unlock:
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH RFC 15/24] mm: export wp_page_copy()
  2019-01-21  7:56 [PATCH RFC 00/24] userfaultfd: write protection support Peter Xu
                   ` (13 preceding siblings ...)
  2019-01-21  7:57 ` [PATCH RFC 14/24] userfaultfd: wp: apply _PAGE_UFFD_WP bit Peter Xu
@ 2019-01-21  7:57 ` Peter Xu
  2019-01-21  7:57 ` [PATCH RFC 16/24] userfaultfd: wp: handle COW properly for uffd-wp Peter Xu
                   ` (9 subsequent siblings)
  24 siblings, 0 replies; 65+ messages in thread
From: Peter Xu @ 2019-01-21  7:57 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Hugh Dickins, Maya Gokhale, Jerome Glisse, Johannes Weiner,
	peterx, Martin Cracauer, Denis Plotnikov, Shaohua Li,
	Andrea Arcangeli, Pavel Emelyanov, Mike Kravetz, Marty McFadden,
	Mike Rapoport, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

Export this function for usages outside page fault handlers.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/mm.h | 2 ++
 mm/memory.c        | 2 +-
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 89345b51d8bd..bf04e187fafe 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -378,6 +378,8 @@ struct vm_fault {
 					 */
 };
 
+vm_fault_t wp_page_copy(struct vm_fault *vmf);
+
 /* page entry size for vm->huge_fault() */
 enum page_entry_size {
 	PE_SIZE_PTE = 0,
diff --git a/mm/memory.c b/mm/memory.c
index 7f276158683b..ef823c07f635 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2239,7 +2239,7 @@ static inline void wp_page_reuse(struct vm_fault *vmf)
  *   held to the old page, as well as updating the rmap.
  * - In any case, unlock the PTL and drop the reference we took to the old page.
  */
-static vm_fault_t wp_page_copy(struct vm_fault *vmf)
+vm_fault_t wp_page_copy(struct vm_fault *vmf)
 {
 	struct vm_area_struct *vma = vmf->vma;
 	struct mm_struct *mm = vma->vm_mm;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH RFC 16/24] userfaultfd: wp: handle COW properly for uffd-wp
  2019-01-21  7:56 [PATCH RFC 00/24] userfaultfd: write protection support Peter Xu
                   ` (14 preceding siblings ...)
  2019-01-21  7:57 ` [PATCH RFC 15/24] mm: export wp_page_copy() Peter Xu
@ 2019-01-21  7:57 ` Peter Xu
  2019-01-21  7:57 ` [PATCH RFC 17/24] userfaultfd: wp: drop _PAGE_UFFD_WP properly when fork Peter Xu
                   ` (8 subsequent siblings)
  24 siblings, 0 replies; 65+ messages in thread
From: Peter Xu @ 2019-01-21  7:57 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Hugh Dickins, Maya Gokhale, Jerome Glisse, Johannes Weiner,
	peterx, Martin Cracauer, Denis Plotnikov, Shaohua Li,
	Andrea Arcangeli, Pavel Emelyanov, Mike Kravetz, Marty McFadden,
	Mike Rapoport, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

This allows uffd-wp to support write-protected pages for COW.

For example, the uffd write-protected PTE could also be write-protected
by other usages like COW or zero pages.  When that happens, we can't
simply set the write bit in the PTE since otherwise it'll change the
content of every single reference to the page.  Instead, we should do
the COW first if necessary, then handle the uffd-wp fault.

To correctly copy the page, we'll also need to carry over the
_PAGE_UFFD_WP bit if it was set in the original PTE.

For huge PMDs, we just simply split the huge PMDs where we want to
resolve an uffd-wp page fault always.  That matches what we do with
general huge PMD write protections.  In that way, we resolved the huge
PMD copy-on-write issue into PTE copy-on-write.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/memory.c   |  2 ++
 mm/mprotect.c | 55 ++++++++++++++++++++++++++++++++++++++++++++++++---
 2 files changed, 54 insertions(+), 3 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index ef823c07f635..a3de13b728f4 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2290,6 +2290,8 @@ vm_fault_t wp_page_copy(struct vm_fault *vmf)
 		}
 		flush_cache_page(vma, vmf->address, pte_pfn(vmf->orig_pte));
 		entry = mk_pte(new_page, vma->vm_page_prot);
+		if (pte_uffd_wp(vmf->orig_pte))
+			entry = pte_mkuffd_wp(entry);
 		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
 		/*
 		 * Clear the pte entry and flush it first, before updating the
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 000e246c163b..c37c9aa7a54e 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -77,14 +77,13 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 		if (pte_present(oldpte)) {
 			pte_t ptent;
 			bool preserve_write = prot_numa && pte_write(oldpte);
+			struct page *page;
 
 			/*
 			 * Avoid trapping faults against the zero or KSM
 			 * pages. See similar comment in change_huge_pmd.
 			 */
 			if (prot_numa) {
-				struct page *page;
-
 				page = vm_normal_page(vma, addr, oldpte);
 				if (!page || PageKsm(page))
 					continue;
@@ -114,6 +113,46 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 					continue;
 			}
 
+			/*
+			 * Detect whether we'll need to COW before
+			 * resolving an uffd-wp fault.  Note that this
+			 * includes detection of the zero page (where
+			 * page==NULL)
+			 */
+			if (uffd_wp_resolve) {
+				/* If the fault is resolved already, skip */
+				if (!pte_uffd_wp(*pte))
+					continue;
+				page = vm_normal_page(vma, addr, oldpte);
+				if (!page || page_mapcount(page) > 1) {
+					struct vm_fault vmf = {
+						.vma = vma,
+						.address = addr & PAGE_MASK,
+						.page = page,
+						.orig_pte = oldpte,
+						.pmd = pmd,
+						/* pte and ptl not needed */
+					};
+					vm_fault_t ret;
+
+					if (page)
+						get_page(page);
+					arch_leave_lazy_mmu_mode();
+					pte_unmap_unlock(pte, ptl);
+					ret = wp_page_copy(&vmf);
+					/* PTE is changed, or OOM */
+					if (ret == 0)
+						/* It's done by others */
+						continue;
+					else if (WARN_ON(ret != VM_FAULT_WRITE))
+						return pages;
+					pte = pte_offset_map_lock(vma->vm_mm,
+								  pmd, addr,
+								  &ptl);
+					arch_enter_lazy_mmu_mode();
+				}
+			}
+
 			ptent = ptep_modify_prot_start(mm, addr, pte);
 			ptent = pte_modify(ptent, newprot);
 			if (preserve_write)
@@ -184,6 +223,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 	unsigned long pages = 0;
 	unsigned long nr_huge_updates = 0;
 	unsigned long mni_start = 0;
+	bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;
 
 	pmd = pmd_offset(pud, addr);
 	do {
@@ -201,7 +241,16 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 		}
 
 		if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {
-			if (next - addr != HPAGE_PMD_SIZE) {
+			/*
+			 * When resolving an userfaultfd write
+			 * protection fault, it's not easy to identify
+			 * whether a THP is shared with others and
+			 * whether we'll need to do copy-on-write, so
+			 * just split it always for now to simply the
+			 * procedure.  And that's the policy too for
+			 * general THP write-protect in af9e4d5f2de2.
+			 */
+			if (next - addr != HPAGE_PMD_SIZE || uffd_wp_resolve) {
 				__split_huge_pmd(vma, pmd, addr, false, NULL);
 			} else {
 				int nr_ptes = change_huge_pmd(vma, pmd, addr,
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH RFC 17/24] userfaultfd: wp: drop _PAGE_UFFD_WP properly when fork
  2019-01-21  7:56 [PATCH RFC 00/24] userfaultfd: write protection support Peter Xu
                   ` (15 preceding siblings ...)
  2019-01-21  7:57 ` [PATCH RFC 16/24] userfaultfd: wp: handle COW properly for uffd-wp Peter Xu
@ 2019-01-21  7:57 ` Peter Xu
  2019-01-21  7:57 ` [PATCH RFC 18/24] userfaultfd: wp: add pmd_swp_*uffd_wp() helpers Peter Xu
                   ` (7 subsequent siblings)
  24 siblings, 0 replies; 65+ messages in thread
From: Peter Xu @ 2019-01-21  7:57 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Hugh Dickins, Maya Gokhale, Jerome Glisse, Johannes Weiner,
	peterx, Martin Cracauer, Denis Plotnikov, Shaohua Li,
	Andrea Arcangeli, Pavel Emelyanov, Mike Kravetz, Marty McFadden,
	Mike Rapoport, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

UFFD_EVENT_FORK support for uffd-wp should be already there, except
that we should clean the uffd-wp bit if uffd fork event is not
enabled.  Detect that to avoid _PAGE_UFFD_WP being set even if the VMA
is not being tracked by VM_UFFD_WP.  Do this for both small PTEs and
huge PMDs.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/huge_memory.c | 8 ++++++++
 mm/memory.c      | 8 ++++++++
 2 files changed, 16 insertions(+)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 169795c8e56c..2a3ec62e83b6 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -928,6 +928,14 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	ret = -EAGAIN;
 	pmd = *src_pmd;
 
+	/*
+	 * Make sure the _PAGE_UFFD_WP bit is cleared if the new VMA
+	 * does not have the VM_UFFD_WP, which means that the uffd
+	 * fork event is not enabled.
+	 */
+	if (!(vma->vm_flags & VM_UFFD_WP))
+		pmd = pmd_clear_uffd_wp(pmd);
+
 #ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
 	if (unlikely(is_swap_pmd(pmd))) {
 		swp_entry_t entry = pmd_to_swp_entry(pmd);
diff --git a/mm/memory.c b/mm/memory.c
index a3de13b728f4..f5497752d2a3 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -788,6 +788,14 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		pte = pte_mkclean(pte);
 	pte = pte_mkold(pte);
 
+	/*
+	 * Make sure the _PAGE_UFFD_WP bit is cleared if the new VMA
+	 * does not have the VM_UFFD_WP, which means that the uffd
+	 * fork event is not enabled.
+	 */
+	if (!(vm_flags & VM_UFFD_WP))
+		pte = pte_clear_uffd_wp(pte);
+
 	page = vm_normal_page(vma, addr, pte);
 	if (page) {
 		get_page(page);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH RFC 18/24] userfaultfd: wp: add pmd_swp_*uffd_wp() helpers
  2019-01-21  7:56 [PATCH RFC 00/24] userfaultfd: write protection support Peter Xu
                   ` (16 preceding siblings ...)
  2019-01-21  7:57 ` [PATCH RFC 17/24] userfaultfd: wp: drop _PAGE_UFFD_WP properly when fork Peter Xu
@ 2019-01-21  7:57 ` Peter Xu
  2019-01-21  7:57 ` [PATCH RFC 19/24] userfaultfd: wp: support swap and page migration Peter Xu
                   ` (6 subsequent siblings)
  24 siblings, 0 replies; 65+ messages in thread
From: Peter Xu @ 2019-01-21  7:57 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Hugh Dickins, Maya Gokhale, Jerome Glisse, Johannes Weiner,
	peterx, Martin Cracauer, Denis Plotnikov, Shaohua Li,
	Andrea Arcangeli, Pavel Emelyanov, Mike Kravetz, Marty McFadden,
	Mike Rapoport, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

Adding these missing helpers for uffd-wp operations with pmd
swap/migration entries.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/x86/include/asm/pgtable.h     | 15 +++++++++++++++
 include/asm-generic/pgtable_uffd.h | 15 +++++++++++++++
 2 files changed, 30 insertions(+)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 7a71158982f4..aa2eb36d7edf 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1401,6 +1401,21 @@ static inline pte_t pte_swp_clear_uffd_wp(pte_t pte)
 {
 	return pte_clear_flags(pte, _PAGE_SWP_UFFD_WP);
 }
+
+static inline pmd_t pmd_swp_mkuffd_wp(pmd_t pmd)
+{
+	return pmd_set_flags(pmd, _PAGE_SWP_UFFD_WP);
+}
+
+static inline int pmd_swp_uffd_wp(pmd_t pmd)
+{
+	return pmd_flags(pmd) & _PAGE_SWP_UFFD_WP;
+}
+
+static inline pmd_t pmd_swp_clear_uffd_wp(pmd_t pmd)
+{
+	return pmd_clear_flags(pmd, _PAGE_SWP_UFFD_WP);
+}
 #endif /* CONFIG_HAVE_ARCH_USERFAULTFD_WP */
 
 #define PKRU_AD_BIT 0x1
diff --git a/include/asm-generic/pgtable_uffd.h b/include/asm-generic/pgtable_uffd.h
index 643d1bf559c2..828966d4c281 100644
--- a/include/asm-generic/pgtable_uffd.h
+++ b/include/asm-generic/pgtable_uffd.h
@@ -46,6 +46,21 @@ static __always_inline pte_t pte_swp_clear_uffd_wp(pte_t pte)
 {
 	return pte;
 }
+
+static inline pmd_t pmd_swp_mkuffd_wp(pmd_t pmd)
+{
+	return pmd;
+}
+
+static inline int pmd_swp_uffd_wp(pmd_t pmd)
+{
+	return 0;
+}
+
+static inline pmd_t pmd_swp_clear_uffd_wp(pmd_t pmd)
+{
+	return pmd;
+}
 #endif /* CONFIG_HAVE_ARCH_USERFAULTFD_WP */
 
 #endif /* _ASM_GENERIC_PGTABLE_UFFD_H */
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH RFC 19/24] userfaultfd: wp: support swap and page migration
  2019-01-21  7:56 [PATCH RFC 00/24] userfaultfd: write protection support Peter Xu
                   ` (17 preceding siblings ...)
  2019-01-21  7:57 ` [PATCH RFC 18/24] userfaultfd: wp: add pmd_swp_*uffd_wp() helpers Peter Xu
@ 2019-01-21  7:57 ` Peter Xu
  2019-01-21  7:57 ` [PATCH RFC 20/24] userfaultfd: wp: don't wake up when doing write protect Peter Xu
                   ` (5 subsequent siblings)
  24 siblings, 0 replies; 65+ messages in thread
From: Peter Xu @ 2019-01-21  7:57 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Hugh Dickins, Maya Gokhale, Jerome Glisse, Johannes Weiner,
	peterx, Martin Cracauer, Denis Plotnikov, Shaohua Li,
	Andrea Arcangeli, Pavel Emelyanov, Mike Kravetz, Marty McFadden,
	Mike Rapoport, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

For either swap and page migration, we all use the bit 2 of the entry to
identify whether this entry is uffd write-protected.  It plays a similar
role as the existing soft dirty bit in swap entries but only for keeping
the uffd-wp tracking for a specific PTE/PMD.

Something special here is that when we want to recover the uffd-wp bit
from a swap/migration entry to the PTE bit we'll also need to take care
of the _PAGE_RW bit and make sure it's cleared, otherwise even with the
_PAGE_UFFD_WP bit we can't trap it at all.

Note that this patch removed two lines from "userfaultfd: wp: hook
userfault handler to write protection fault" where we try to remove the
VM_FAULT_WRITE from vmf->flags when uffd-wp is set for the VMA.  This
patch will still keep the write flag there.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/swapops.h | 2 ++
 mm/huge_memory.c        | 3 +++
 mm/memory.c             | 8 ++++++--
 mm/migrate.c            | 7 +++++++
 mm/mprotect.c           | 2 ++
 mm/rmap.c               | 6 ++++++
 6 files changed, 26 insertions(+), 2 deletions(-)

diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index 4d961668e5fc..0c2923b1cdb7 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -68,6 +68,8 @@ static inline swp_entry_t pte_to_swp_entry(pte_t pte)
 
 	if (pte_swp_soft_dirty(pte))
 		pte = pte_swp_clear_soft_dirty(pte);
+	if (pte_swp_uffd_wp(pte))
+		pte = pte_swp_clear_uffd_wp(pte);
 	arch_entry = __pte_to_swp_entry(pte);
 	return swp_entry(__swp_type(arch_entry), __swp_offset(arch_entry));
 }
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 2a3ec62e83b6..682f1427da1a 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2171,6 +2171,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 		write = is_write_migration_entry(entry);
 		young = false;
 		soft_dirty = pmd_swp_soft_dirty(old_pmd);
+		uffd_wp = pmd_swp_uffd_wp(old_pmd);
 	} else {
 		page = pmd_page(old_pmd);
 		if (pmd_dirty(old_pmd))
@@ -2203,6 +2204,8 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 			entry = swp_entry_to_pte(swp_entry);
 			if (soft_dirty)
 				entry = pte_swp_mksoft_dirty(entry);
+			if (uffd_wp)
+				entry = pte_swp_mkuffd_wp(entry);
 		} else {
 			entry = mk_pte(page + i, READ_ONCE(vma->vm_page_prot));
 			entry = maybe_mkwrite(entry, vma);
diff --git a/mm/memory.c b/mm/memory.c
index f5497752d2a3..ac7d659e40fe 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -736,6 +736,8 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 				pte = swp_entry_to_pte(entry);
 				if (pte_swp_soft_dirty(*src_pte))
 					pte = pte_swp_mksoft_dirty(pte);
+				if (pte_swp_uffd_wp(*src_pte))
+					pte = pte_swp_mkuffd_wp(pte);
 				set_pte_at(src_mm, addr, src_pte, pte);
 			}
 		} else if (is_device_private_entry(entry)) {
@@ -2814,8 +2816,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
 	dec_mm_counter_fast(vma->vm_mm, MM_SWAPENTS);
 	pte = mk_pte(page, vma->vm_page_prot);
-	if (userfaultfd_wp(vma))
-		vmf->flags &= ~FAULT_FLAG_WRITE;
 	if ((vmf->flags & FAULT_FLAG_WRITE) && reuse_swap_page(page, NULL)) {
 		pte = maybe_mkwrite(pte_mkdirty(pte), vma);
 		vmf->flags &= ~FAULT_FLAG_WRITE;
@@ -2825,6 +2825,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	flush_icache_page(vma, page);
 	if (pte_swp_soft_dirty(vmf->orig_pte))
 		pte = pte_mksoft_dirty(pte);
+	if (pte_swp_uffd_wp(vmf->orig_pte)) {
+		pte = pte_mkuffd_wp(pte);
+		pte = pte_wrprotect(pte);
+	}
 	set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
 	arch_do_swap_page(vma->vm_mm, vma, vmf->address, pte, vmf->orig_pte);
 	vmf->orig_pte = pte;
diff --git a/mm/migrate.c b/mm/migrate.c
index f7e4bfdc13b7..963d3dd65cf0 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -242,6 +242,11 @@ static bool remove_migration_pte(struct page *page, struct vm_area_struct *vma,
 		if (is_write_migration_entry(entry))
 			pte = maybe_mkwrite(pte, vma);
 
+		if (pte_swp_uffd_wp(*pvmw.pte)) {
+			pte = pte_mkuffd_wp(pte);
+			pte = pte_wrprotect(pte);
+		}
+
 		if (unlikely(is_zone_device_page(new))) {
 			if (is_device_private_page(new)) {
 				entry = make_device_private_entry(new, pte_write(pte));
@@ -2265,6 +2270,8 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
 			swp_pte = swp_entry_to_pte(entry);
 			if (pte_soft_dirty(pte))
 				swp_pte = pte_swp_mksoft_dirty(swp_pte);
+			if (pte_uffd_wp(pte))
+				swp_pte = pte_swp_mkuffd_wp(swp_pte);
 			set_pte_at(mm, addr, ptep, swp_pte);
 
 			/*
diff --git a/mm/mprotect.c b/mm/mprotect.c
index c37c9aa7a54e..2ce62d806108 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -187,6 +187,8 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 				newpte = swp_entry_to_pte(entry);
 				if (pte_swp_soft_dirty(oldpte))
 					newpte = pte_swp_mksoft_dirty(newpte);
+				if (pte_swp_uffd_wp(oldpte))
+					newpte = pte_swp_mkuffd_wp(newpte);
 				set_pte_at(mm, addr, pte, newpte);
 
 				pages++;
diff --git a/mm/rmap.c b/mm/rmap.c
index 85b7f9423352..e1cf191db4f3 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1463,6 +1463,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 			swp_pte = swp_entry_to_pte(entry);
 			if (pte_soft_dirty(pteval))
 				swp_pte = pte_swp_mksoft_dirty(swp_pte);
+			if (pte_uffd_wp(pteval))
+				swp_pte = pte_swp_mkuffd_wp(swp_pte);
 			set_pte_at(mm, pvmw.address, pvmw.pte, swp_pte);
 			/*
 			 * No need to invalidate here it will synchronize on
@@ -1555,6 +1557,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 			swp_pte = swp_entry_to_pte(entry);
 			if (pte_soft_dirty(pteval))
 				swp_pte = pte_swp_mksoft_dirty(swp_pte);
+			if (pte_uffd_wp(pteval))
+				swp_pte = pte_swp_mkuffd_wp(swp_pte);
 			set_pte_at(mm, address, pvmw.pte, swp_pte);
 			/*
 			 * No need to invalidate here it will synchronize on
@@ -1621,6 +1625,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 			swp_pte = swp_entry_to_pte(entry);
 			if (pte_soft_dirty(pteval))
 				swp_pte = pte_swp_mksoft_dirty(swp_pte);
+			if (pte_uffd_wp(pteval))
+				swp_pte = pte_swp_mkuffd_wp(swp_pte);
 			set_pte_at(mm, address, pvmw.pte, swp_pte);
 			/* Invalidate as we cleared the pte */
 			mmu_notifier_invalidate_range(mm, address,
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH RFC 20/24] userfaultfd: wp: don't wake up when doing write protect
  2019-01-21  7:56 [PATCH RFC 00/24] userfaultfd: write protection support Peter Xu
                   ` (18 preceding siblings ...)
  2019-01-21  7:57 ` [PATCH RFC 19/24] userfaultfd: wp: support swap and page migration Peter Xu
@ 2019-01-21  7:57 ` Peter Xu
  2019-01-21 11:10   ` Mike Rapoport
  2019-01-21  7:57 ` [PATCH RFC 21/24] khugepaged: skip collapse if uffd-wp detected Peter Xu
                   ` (4 subsequent siblings)
  24 siblings, 1 reply; 65+ messages in thread
From: Peter Xu @ 2019-01-21  7:57 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Hugh Dickins, Maya Gokhale, Jerome Glisse, Johannes Weiner,
	peterx, Martin Cracauer, Denis Plotnikov, Shaohua Li,
	Andrea Arcangeli, Pavel Emelyanov, Mike Kravetz, Marty McFadden,
	Mike Rapoport, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

It does not make sense to try to wake up any waiting thread when we're
write-protecting a memory region.  Only wake up when resolving a write
protected page fault.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 fs/userfaultfd.c | 13 ++++++++-----
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 455b87c0596f..e54ab6076e13 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -1771,6 +1771,7 @@ static int userfaultfd_writeprotect(struct userfaultfd_ctx *ctx,
 	struct uffdio_writeprotect uffdio_wp;
 	struct uffdio_writeprotect __user *user_uffdio_wp;
 	struct userfaultfd_wake_range range;
+	bool mode_wp, mode_dontwake;
 
 	user_uffdio_wp = (struct uffdio_writeprotect __user *) arg;
 
@@ -1786,17 +1787,19 @@ static int userfaultfd_writeprotect(struct userfaultfd_ctx *ctx,
 	if (uffdio_wp.mode & ~(UFFDIO_WRITEPROTECT_MODE_DONTWAKE |
 			       UFFDIO_WRITEPROTECT_MODE_WP))
 		return -EINVAL;
-	if ((uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_WP) &&
-	     (uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_DONTWAKE))
+
+	mode_wp = uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_WP;
+	mode_dontwake = uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_DONTWAKE;
+
+	if (mode_wp && mode_dontwake)
 		return -EINVAL;
 
 	ret = mwriteprotect_range(ctx->mm, uffdio_wp.range.start,
-				  uffdio_wp.range.len, uffdio_wp.mode &
-				  UFFDIO_WRITEPROTECT_MODE_WP);
+				  uffdio_wp.range.len, mode_wp);
 	if (ret)
 		return ret;
 
-	if (!(uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_DONTWAKE)) {
+	if (!mode_wp && !mode_dontwake) {
 		range.start = uffdio_wp.range.start;
 		range.len = uffdio_wp.range.len;
 		wake_userfault(ctx, &range);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH RFC 21/24] khugepaged: skip collapse if uffd-wp detected
  2019-01-21  7:56 [PATCH RFC 00/24] userfaultfd: write protection support Peter Xu
                   ` (19 preceding siblings ...)
  2019-01-21  7:57 ` [PATCH RFC 20/24] userfaultfd: wp: don't wake up when doing write protect Peter Xu
@ 2019-01-21  7:57 ` Peter Xu
  2019-01-21  7:57 ` [PATCH RFC 22/24] userfaultfd: wp: UFFDIO_REGISTER_MODE_WP documentation update Peter Xu
                   ` (3 subsequent siblings)
  24 siblings, 0 replies; 65+ messages in thread
From: Peter Xu @ 2019-01-21  7:57 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Hugh Dickins, Maya Gokhale, Jerome Glisse, Johannes Weiner,
	peterx, Martin Cracauer, Denis Plotnikov, Shaohua Li,
	Andrea Arcangeli, Pavel Emelyanov, Mike Kravetz, Marty McFadden,
	Mike Rapoport, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

Don't collapse the huge PMD if there is any userfault write protected
small PTEs.  The problem is that the write protection is in small page
granularity and there's no way to keep all these write protection
information if the small pages are going to be merged into a huge PMD.

The same thing needs to be considered for swap entries and migration
entries.  So do the check as well disregarding khugepaged_max_ptes_swap.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/trace/events/huge_memory.h |  1 +
 mm/khugepaged.c                    | 23 +++++++++++++++++++++++
 2 files changed, 24 insertions(+)

diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
index dd4db334bd63..2d7bad9cb976 100644
--- a/include/trace/events/huge_memory.h
+++ b/include/trace/events/huge_memory.h
@@ -13,6 +13,7 @@
 	EM( SCAN_PMD_NULL,		"pmd_null")			\
 	EM( SCAN_EXCEED_NONE_PTE,	"exceed_none_pte")		\
 	EM( SCAN_PTE_NON_PRESENT,	"pte_non_present")		\
+	EM( SCAN_PTE_UFFD_WP,		"pte_uffd_wp")			\
 	EM( SCAN_PAGE_RO,		"no_writable_page")		\
 	EM( SCAN_LACK_REFERENCED_PAGE,	"lack_referenced_page")		\
 	EM( SCAN_PAGE_NULL,		"page_null")			\
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 8e2ff195ecb3..92f06e1c941e 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -29,6 +29,7 @@ enum scan_result {
 	SCAN_PMD_NULL,
 	SCAN_EXCEED_NONE_PTE,
 	SCAN_PTE_NON_PRESENT,
+	SCAN_PTE_UFFD_WP,
 	SCAN_PAGE_RO,
 	SCAN_LACK_REFERENCED_PAGE,
 	SCAN_PAGE_NULL,
@@ -1125,6 +1126,15 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 		pte_t pteval = *_pte;
 		if (is_swap_pte(pteval)) {
 			if (++unmapped <= khugepaged_max_ptes_swap) {
+				/*
+				 * Always be strict with uffd-wp
+				 * enabled swap entries.  Please see
+				 * comment below for pte_uffd_wp().
+				 */
+				if (pte_swp_uffd_wp(pteval)) {
+					result = SCAN_PTE_UFFD_WP;
+					goto out_unmap;
+				}
 				continue;
 			} else {
 				result = SCAN_EXCEED_SWAP_PTE;
@@ -1144,6 +1154,19 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 			result = SCAN_PTE_NON_PRESENT;
 			goto out_unmap;
 		}
+		if (pte_uffd_wp(pteval)) {
+			/*
+			 * Don't collapse the page if any of the small
+			 * PTEs are armed with uffd write protection.
+			 * Here we can also mark the new huge pmd as
+			 * write protected if any of the small ones is
+			 * marked but that could bring uknown
+			 * userfault messages that falls outside of
+			 * the registered range.  So, just be simple.
+			 */
+			result = SCAN_PTE_UFFD_WP;
+			goto out_unmap;
+		}
 		if (pte_write(pteval))
 			writable = true;
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH RFC 22/24] userfaultfd: wp: UFFDIO_REGISTER_MODE_WP documentation update
  2019-01-21  7:56 [PATCH RFC 00/24] userfaultfd: write protection support Peter Xu
                   ` (20 preceding siblings ...)
  2019-01-21  7:57 ` [PATCH RFC 21/24] khugepaged: skip collapse if uffd-wp detected Peter Xu
@ 2019-01-21  7:57 ` Peter Xu
  2019-01-21  7:57 ` [PATCH RFC 23/24] userfaultfd: selftests: refactor statistics Peter Xu
                   ` (2 subsequent siblings)
  24 siblings, 0 replies; 65+ messages in thread
From: Peter Xu @ 2019-01-21  7:57 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Hugh Dickins, Maya Gokhale, Jerome Glisse, Johannes Weiner,
	peterx, Martin Cracauer, Denis Plotnikov, Shaohua Li,
	Andrea Arcangeli, Pavel Emelyanov, Mike Kravetz, Marty McFadden,
	Mike Rapoport, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

From: Martin Cracauer <cracauer@cons.org>

Adds documentation about the write protection support.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
[peterx: rewrite in rst format; fixups here and there]
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 Documentation/admin-guide/mm/userfaultfd.rst | 51 ++++++++++++++++++++
 1 file changed, 51 insertions(+)

diff --git a/Documentation/admin-guide/mm/userfaultfd.rst b/Documentation/admin-guide/mm/userfaultfd.rst
index 5048cf661a8a..c30176e67900 100644
--- a/Documentation/admin-guide/mm/userfaultfd.rst
+++ b/Documentation/admin-guide/mm/userfaultfd.rst
@@ -108,6 +108,57 @@ UFFDIO_COPY. They're atomic as in guaranteeing that nothing can see an
 half copied page since it'll keep userfaulting until the copy has
 finished.
 
+Notes:
+
+- If you requested UFFDIO_REGISTER_MODE_MISSING when registering then
+  you must provide some kind of page in your thread after reading from
+  the uffd.  You must provide either UFFDIO_COPY or UFFDIO_ZEROPAGE.
+  The normal behavior of the OS automatically providing a zero page on
+  an annonymous mmaping is not in place.
+
+- None of the page-delivering ioctls default to the range that you
+  registered with.  You must fill in all fields for the appropriate
+  ioctl struct including the range.
+
+- You get the address of the access that triggered the missing page
+  event out of a struct uffd_msg that you read in the thread from the
+  uffd.  You can supply as many pages as you want with UFFDIO_COPY or
+  UFFDIO_ZEROPAGE.  Keep in mind that unless you used DONTWAKE then
+  the first of any of those IOCTLs wakes up the faulting thread.
+
+- Be sure to test for all errors including (pollfd[0].revents &
+  POLLERR).  This can happen, e.g. when ranges supplied were
+  incorrect.
+
+Write Protect Notifications
+---------------------------
+
+This is equivalent to (but faster than) using mprotect and a SIGSEGV
+signal handler.
+
+Firstly you need to register a range with UFFDIO_REGISTER_MODE_WP.
+Instead of using mprotect(2) you use ioctl(uffd, UFFDIO_WRITEPROTECT,
+struct *uffdio_writeprotect) while mode = UFFDIO_WRITEPROTECT_MODE_WP
+in the struct passed in.  The range does not default to and does not
+have to be identical to the range you registered with.  You can write
+protect as many ranges as you like (inside the registered range).
+Then, in the thread reading from uffd the struct will have
+msg.arg.pagefault.flags & UFFD_PAGEFAULT_FLAG_WP set. Now you send
+ioctl(uffd, UFFDIO_WRITEPROTECT, struct *uffdio_writeprotect) again
+while pagefault.mode does not have UFFDIO_WRITEPROTECT_MODE_WP set.
+This wakes up the thread which will continue to run with writes. This
+allows you to do the bookkeeping about the write in the uffd reading
+thread before the ioctl.
+
+If you registered with both UFFDIO_REGISTER_MODE_MISSING and
+UFFDIO_REGISTER_MODE_WP then you need to think about the sequence in
+which you supply a page and undo write protect.  Note that there is a
+difference between writes into a WP area and into a !WP area.  The
+former will have UFFD_PAGEFAULT_FLAG_WP set, the latter
+UFFD_PAGEFAULT_FLAG_WRITE.  The latter did not fail on protection but
+you still need to supply a page when UFFDIO_REGISTER_MODE_MISSING was
+used.
+
 QEMU/KVM
 ========
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH RFC 23/24] userfaultfd: selftests: refactor statistics
  2019-01-21  7:56 [PATCH RFC 00/24] userfaultfd: write protection support Peter Xu
                   ` (21 preceding siblings ...)
  2019-01-21  7:57 ` [PATCH RFC 22/24] userfaultfd: wp: UFFDIO_REGISTER_MODE_WP documentation update Peter Xu
@ 2019-01-21  7:57 ` Peter Xu
  2019-01-21  7:57 ` [PATCH RFC 24/24] userfaultfd: selftests: add write-protect test Peter Xu
  2019-01-21 14:33 ` [PATCH RFC 00/24] userfaultfd: write protection support David Hildenbrand
  24 siblings, 0 replies; 65+ messages in thread
From: Peter Xu @ 2019-01-21  7:57 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Hugh Dickins, Maya Gokhale, Jerome Glisse, Johannes Weiner,
	peterx, Martin Cracauer, Denis Plotnikov, Shaohua Li,
	Andrea Arcangeli, Pavel Emelyanov, Mike Kravetz, Marty McFadden,
	Mike Rapoport, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

Introduce uffd_stats structure for statistics of the self test, at the
same time refactor the code to always pass in the uffd_stats for either
read() or poll() typed fault handling threads instead of using two
different ways to return the statistic results.  No functional change.

With the new structure, it's very easy to introduce new statistics.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 tools/testing/selftests/vm/userfaultfd.c | 76 +++++++++++++++---------
 1 file changed, 49 insertions(+), 27 deletions(-)

diff --git a/tools/testing/selftests/vm/userfaultfd.c b/tools/testing/selftests/vm/userfaultfd.c
index 5d1db824f73a..e5d12c209e09 100644
--- a/tools/testing/selftests/vm/userfaultfd.c
+++ b/tools/testing/selftests/vm/userfaultfd.c
@@ -88,6 +88,12 @@ static char *area_src, *area_src_alias, *area_dst, *area_dst_alias;
 static char *zeropage;
 pthread_attr_t attr;
 
+/* Userfaultfd test statistics */
+struct uffd_stats {
+	int cpu;
+	unsigned long missing_faults;
+};
+
 /* pthread_mutex_t starts at page offset 0 */
 #define area_mutex(___area, ___nr)					\
 	((pthread_mutex_t *) ((___area) + (___nr)*page_size))
@@ -127,6 +133,17 @@ static void usage(void)
 	exit(1);
 }
 
+static void uffd_stats_reset(struct uffd_stats *uffd_stats,
+			     unsigned long n_cpus)
+{
+	int i;
+
+	for (i = 0; i < n_cpus; i++) {
+		uffd_stats[i].cpu = i;
+		uffd_stats[i].missing_faults = 0;
+	}
+}
+
 static int anon_release_pages(char *rel_area)
 {
 	int ret = 0;
@@ -469,8 +486,8 @@ static int uffd_read_msg(int ufd, struct uffd_msg *msg)
 	return 0;
 }
 
-/* Return 1 if page fault handled by us; otherwise 0 */
-static int uffd_handle_page_fault(struct uffd_msg *msg)
+static void uffd_handle_page_fault(struct uffd_msg *msg,
+				   struct uffd_stats *stats)
 {
 	unsigned long offset;
 
@@ -485,18 +502,19 @@ static int uffd_handle_page_fault(struct uffd_msg *msg)
 	offset = (char *)(unsigned long)msg->arg.pagefault.address - area_dst;
 	offset &= ~(page_size-1);
 
-	return copy_page(uffd, offset);
+	if (copy_page(uffd, offset))
+		stats->missing_faults++;
 }
 
 static void *uffd_poll_thread(void *arg)
 {
-	unsigned long cpu = (unsigned long) arg;
+	struct uffd_stats *stats = (struct uffd_stats *)arg;
+	unsigned long cpu = stats->cpu;
 	struct pollfd pollfd[2];
 	struct uffd_msg msg;
 	struct uffdio_register uffd_reg;
 	int ret;
 	char tmp_chr;
-	unsigned long userfaults = 0;
 
 	pollfd[0].fd = uffd;
 	pollfd[0].events = POLLIN;
@@ -526,7 +544,7 @@ static void *uffd_poll_thread(void *arg)
 				msg.event), exit(1);
 			break;
 		case UFFD_EVENT_PAGEFAULT:
-			userfaults += uffd_handle_page_fault(&msg);
+			uffd_handle_page_fault(&msg, stats);
 			break;
 		case UFFD_EVENT_FORK:
 			close(uffd);
@@ -545,28 +563,27 @@ static void *uffd_poll_thread(void *arg)
 			break;
 		}
 	}
-	return (void *)userfaults;
+
+	return NULL;
 }
 
 pthread_mutex_t uffd_read_mutex = PTHREAD_MUTEX_INITIALIZER;
 
 static void *uffd_read_thread(void *arg)
 {
-	unsigned long *this_cpu_userfaults;
+	struct uffd_stats *stats = (struct uffd_stats *)arg;
 	struct uffd_msg msg;
 
-	this_cpu_userfaults = (unsigned long *) arg;
-	*this_cpu_userfaults = 0;
-
 	pthread_mutex_unlock(&uffd_read_mutex);
 	/* from here cancellation is ok */
 
 	for (;;) {
 		if (uffd_read_msg(uffd, &msg))
 			continue;
-		(*this_cpu_userfaults) += uffd_handle_page_fault(&msg);
+		uffd_handle_page_fault(&msg, stats);
 	}
-	return (void *)NULL;
+
+	return NULL;
 }
 
 static void *background_thread(void *arg)
@@ -582,13 +599,12 @@ static void *background_thread(void *arg)
 	return NULL;
 }
 
-static int stress(unsigned long *userfaults)
+static int stress(struct uffd_stats *uffd_stats)
 {
 	unsigned long cpu;
 	pthread_t locking_threads[nr_cpus];
 	pthread_t uffd_threads[nr_cpus];
 	pthread_t background_threads[nr_cpus];
-	void **_userfaults = (void **) userfaults;
 
 	finished = 0;
 	for (cpu = 0; cpu < nr_cpus; cpu++) {
@@ -597,12 +613,13 @@ static int stress(unsigned long *userfaults)
 			return 1;
 		if (bounces & BOUNCE_POLL) {
 			if (pthread_create(&uffd_threads[cpu], &attr,
-					   uffd_poll_thread, (void *)cpu))
+					   uffd_poll_thread,
+					   (void *)&uffd_stats[cpu]))
 				return 1;
 		} else {
 			if (pthread_create(&uffd_threads[cpu], &attr,
 					   uffd_read_thread,
-					   &_userfaults[cpu]))
+					   (void *)&uffd_stats[cpu]))
 				return 1;
 			pthread_mutex_lock(&uffd_read_mutex);
 		}
@@ -639,7 +656,8 @@ static int stress(unsigned long *userfaults)
 				fprintf(stderr, "pipefd write error\n");
 				return 1;
 			}
-			if (pthread_join(uffd_threads[cpu], &_userfaults[cpu]))
+			if (pthread_join(uffd_threads[cpu],
+					 (void *)&uffd_stats[cpu]))
 				return 1;
 		} else {
 			if (pthread_cancel(uffd_threads[cpu]))
@@ -910,11 +928,11 @@ static int userfaultfd_events_test(void)
 {
 	struct uffdio_register uffdio_register;
 	unsigned long expected_ioctls;
-	unsigned long userfaults;
 	pthread_t uffd_mon;
 	int err, features;
 	pid_t pid;
 	char c;
+	struct uffd_stats stats = { 0 };
 
 	printf("testing events (fork, remap, remove): ");
 	fflush(stdout);
@@ -941,7 +959,7 @@ static int userfaultfd_events_test(void)
 			"unexpected missing ioctl for anon memory\n"),
 			exit(1);
 
-	if (pthread_create(&uffd_mon, &attr, uffd_poll_thread, NULL))
+	if (pthread_create(&uffd_mon, &attr, uffd_poll_thread, &stats))
 		perror("uffd_poll_thread create"), exit(1);
 
 	pid = fork();
@@ -957,13 +975,13 @@ static int userfaultfd_events_test(void)
 
 	if (write(pipefd[1], &c, sizeof(c)) != sizeof(c))
 		perror("pipe write"), exit(1);
-	if (pthread_join(uffd_mon, (void **)&userfaults))
+	if (pthread_join(uffd_mon, NULL))
 		return 1;
 
 	close(uffd);
-	printf("userfaults: %ld\n", userfaults);
+	printf("userfaults: %ld\n", stats.missing_faults);
 
-	return userfaults != nr_pages;
+	return stats.missing_faults != nr_pages;
 }
 
 static int userfaultfd_sig_test(void)
@@ -975,6 +993,7 @@ static int userfaultfd_sig_test(void)
 	int err, features;
 	pid_t pid;
 	char c;
+	struct uffd_stats stats = { 0 };
 
 	printf("testing signal delivery: ");
 	fflush(stdout);
@@ -1006,7 +1025,7 @@ static int userfaultfd_sig_test(void)
 	if (uffd_test_ops->release_pages(area_dst))
 		return 1;
 
-	if (pthread_create(&uffd_mon, &attr, uffd_poll_thread, NULL))
+	if (pthread_create(&uffd_mon, &attr, uffd_poll_thread, &stats))
 		perror("uffd_poll_thread create"), exit(1);
 
 	pid = fork();
@@ -1032,6 +1051,7 @@ static int userfaultfd_sig_test(void)
 	close(uffd);
 	return userfaults != 0;
 }
+
 static int userfaultfd_stress(void)
 {
 	void *area;
@@ -1040,7 +1060,7 @@ static int userfaultfd_stress(void)
 	struct uffdio_register uffdio_register;
 	unsigned long cpu;
 	int err;
-	unsigned long userfaults[nr_cpus];
+	struct uffd_stats uffd_stats[nr_cpus];
 
 	uffd_test_ops->allocate_area((void **)&area_src);
 	if (!area_src)
@@ -1169,8 +1189,10 @@ static int userfaultfd_stress(void)
 		if (uffd_test_ops->release_pages(area_dst))
 			return 1;
 
+		uffd_stats_reset(uffd_stats, nr_cpus);
+
 		/* bounce pass */
-		if (stress(userfaults))
+		if (stress(uffd_stats))
 			return 1;
 
 		/* unregister */
@@ -1213,7 +1235,7 @@ static int userfaultfd_stress(void)
 
 		printf("userfaults:");
 		for (cpu = 0; cpu < nr_cpus; cpu++)
-			printf(" %lu", userfaults[cpu]);
+			printf(" %lu", uffd_stats[cpu].missing_faults);
 		printf("\n");
 	}
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH RFC 24/24] userfaultfd: selftests: add write-protect test
  2019-01-21  7:56 [PATCH RFC 00/24] userfaultfd: write protection support Peter Xu
                   ` (22 preceding siblings ...)
  2019-01-21  7:57 ` [PATCH RFC 23/24] userfaultfd: selftests: refactor statistics Peter Xu
@ 2019-01-21  7:57 ` Peter Xu
  2019-01-21 14:33 ` [PATCH RFC 00/24] userfaultfd: write protection support David Hildenbrand
  24 siblings, 0 replies; 65+ messages in thread
From: Peter Xu @ 2019-01-21  7:57 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Hugh Dickins, Maya Gokhale, Jerome Glisse, Johannes Weiner,
	peterx, Martin Cracauer, Denis Plotnikov, Shaohua Li,
	Andrea Arcangeli, Pavel Emelyanov, Mike Kravetz, Marty McFadden,
	Mike Rapoport, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

This patch adds uffd tests for write protection.

Instead of introducing new tests for it, let's simply squashing uffd-wp
tests into existing uffd-missing test cases.  Changes are:

(1) Bouncing tests

  We do the write-protection in two ways during the bouncing test:

  - By using UFFDIO_COPY_MODE_WP when resolving MISSING pages: then
    we'll make sure for each bounce process every single page will be
    at least fault twice: once for MISSING, once for WP.

  - By direct call UFFDIO_WRITEPROTECT on existing faulted memories:
    To further torture the explicit page protection procedures of
    uffd-wp, we split each bounce procedure into two halves (in the
    background thread): the first half will be MISSING+WP for each
    page as explained above.  After the first half, we write protect
    the faulted region in the background thread to make sure at least
    half of the pages will be write protected again which is the first
    half to test the new UFFDIO_WRITEPROTECT call.  Then we continue
    with the 2nd half, which will contain both MISSING and WP faulting
    tests for the 2nd half and WP-only faults from the 1st half.

(2) Event/Signal test

  Mostly previous tests but will do MISSING+WP for each page.  For
  sigbus-mode test we'll need to provide standalone path to handle the
  write protection faults.

For all tests, do statistics as well for uffd-wp pages.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 tools/testing/selftests/vm/userfaultfd.c | 154 ++++++++++++++++++-----
 1 file changed, 126 insertions(+), 28 deletions(-)

diff --git a/tools/testing/selftests/vm/userfaultfd.c b/tools/testing/selftests/vm/userfaultfd.c
index e5d12c209e09..57b5ac02080a 100644
--- a/tools/testing/selftests/vm/userfaultfd.c
+++ b/tools/testing/selftests/vm/userfaultfd.c
@@ -56,6 +56,7 @@
 #include <linux/userfaultfd.h>
 #include <setjmp.h>
 #include <stdbool.h>
+#include <assert.h>
 
 #include "../kselftest.h"
 
@@ -78,6 +79,8 @@ static int test_type;
 #define ALARM_INTERVAL_SECS 10
 static volatile bool test_uffdio_copy_eexist = true;
 static volatile bool test_uffdio_zeropage_eexist = true;
+/* Whether to test uffd write-protection */
+static bool test_uffdio_wp = false;
 
 static bool map_shared;
 static int huge_fd;
@@ -92,6 +95,7 @@ pthread_attr_t attr;
 struct uffd_stats {
 	int cpu;
 	unsigned long missing_faults;
+	unsigned long wp_faults;
 };
 
 /* pthread_mutex_t starts at page offset 0 */
@@ -141,9 +145,29 @@ static void uffd_stats_reset(struct uffd_stats *uffd_stats,
 	for (i = 0; i < n_cpus; i++) {
 		uffd_stats[i].cpu = i;
 		uffd_stats[i].missing_faults = 0;
+		uffd_stats[i].wp_faults = 0;
 	}
 }
 
+static void uffd_stats_report(struct uffd_stats *stats, int n_cpus)
+{
+	int i;
+	unsigned long long miss_total = 0, wp_total = 0;
+
+	for (i = 0; i < n_cpus; i++) {
+		miss_total += stats[i].missing_faults;
+		wp_total += stats[i].wp_faults;
+	}
+
+	printf("userfaults: %llu missing (", miss_total);
+	for (i = 0; i < n_cpus; i++)
+		printf("%lu+", stats[i].missing_faults);
+	printf("\b), %llu wp (", wp_total);
+	for (i = 0; i < n_cpus; i++)
+		printf("%lu+", stats[i].wp_faults);
+	printf("\b)\n");
+}
+
 static int anon_release_pages(char *rel_area)
 {
 	int ret = 0;
@@ -264,19 +288,15 @@ struct uffd_test_ops {
 	void (*alias_mapping)(__u64 *start, size_t len, unsigned long offset);
 };
 
-#define ANON_EXPECTED_IOCTLS		((1 << _UFFDIO_WAKE) | \
-					 (1 << _UFFDIO_COPY) | \
-					 (1 << _UFFDIO_ZEROPAGE))
-
 static struct uffd_test_ops anon_uffd_test_ops = {
-	.expected_ioctls = ANON_EXPECTED_IOCTLS,
+	.expected_ioctls = UFFD_API_RANGE_IOCTLS,
 	.allocate_area	= anon_allocate_area,
 	.release_pages	= anon_release_pages,
 	.alias_mapping = noop_alias_mapping,
 };
 
 static struct uffd_test_ops shmem_uffd_test_ops = {
-	.expected_ioctls = ANON_EXPECTED_IOCTLS,
+	.expected_ioctls = UFFD_API_RANGE_IOCTLS,
 	.allocate_area	= shmem_allocate_area,
 	.release_pages	= shmem_release_pages,
 	.alias_mapping = noop_alias_mapping,
@@ -300,6 +320,21 @@ static int my_bcmp(char *str1, char *str2, size_t n)
 	return 0;
 }
 
+static void wp_range(int ufd, __u64 start, __u64 len, bool wp)
+{
+	struct uffdio_writeprotect prms = { 0 };
+
+	/* Write protection page faults */
+	prms.range.start = start;
+	prms.range.len = len;
+	/* Undo write-protect, do wakeup after that */
+	prms.mode = wp ? UFFDIO_WRITEPROTECT_MODE_WP : 0;
+
+	if (ioctl(ufd, UFFDIO_WRITEPROTECT, &prms))
+		fprintf(stderr, "clear WP failed for address 0x%Lx\n",
+			start), exit(1);
+}
+
 static void *locking_thread(void *arg)
 {
 	unsigned long cpu = (unsigned long) arg;
@@ -438,7 +473,10 @@ static int __copy_page(int ufd, unsigned long offset, bool retry)
 	uffdio_copy.dst = (unsigned long) area_dst + offset;
 	uffdio_copy.src = (unsigned long) area_src + offset;
 	uffdio_copy.len = page_size;
-	uffdio_copy.mode = 0;
+	if (test_uffdio_wp)
+		uffdio_copy.mode = UFFDIO_COPY_MODE_WP;
+	else
+		uffdio_copy.mode = 0;
 	uffdio_copy.copy = 0;
 	if (ioctl(ufd, UFFDIO_COPY, &uffdio_copy)) {
 		/* real retval in ufdio_copy.copy */
@@ -495,15 +533,21 @@ static void uffd_handle_page_fault(struct uffd_msg *msg,
 		fprintf(stderr, "unexpected msg event %u\n",
 			msg->event), exit(1);
 
-	if (bounces & BOUNCE_VERIFY &&
-	    msg->arg.pagefault.flags & UFFD_PAGEFAULT_FLAG_WRITE)
-		fprintf(stderr, "unexpected write fault\n"), exit(1);
+	if (msg->arg.pagefault.flags & UFFD_PAGEFAULT_FLAG_WP) {
+		wp_range(uffd, msg->arg.pagefault.address, page_size, false);
+		stats->wp_faults++;
+	} else {
+		/* Missing page faults */
+		if (bounces & BOUNCE_VERIFY &&
+		    msg->arg.pagefault.flags & UFFD_PAGEFAULT_FLAG_WRITE)
+			fprintf(stderr, "unexpected write fault\n"), exit(1);
 
-	offset = (char *)(unsigned long)msg->arg.pagefault.address - area_dst;
-	offset &= ~(page_size-1);
+		offset = (char *)(unsigned long)msg->arg.pagefault.address - area_dst;
+		offset &= ~(page_size-1);
 
-	if (copy_page(uffd, offset))
-		stats->missing_faults++;
+		if (copy_page(uffd, offset))
+			stats->missing_faults++;
+	}
 }
 
 static void *uffd_poll_thread(void *arg)
@@ -589,11 +633,30 @@ static void *uffd_read_thread(void *arg)
 static void *background_thread(void *arg)
 {
 	unsigned long cpu = (unsigned long) arg;
-	unsigned long page_nr;
+	unsigned long page_nr, start_nr, mid_nr, end_nr;
 
-	for (page_nr = cpu * nr_pages_per_cpu;
-	     page_nr < (cpu+1) * nr_pages_per_cpu;
-	     page_nr++)
+	start_nr = cpu * nr_pages_per_cpu;
+	end_nr = (cpu+1) * nr_pages_per_cpu;
+	mid_nr = (start_nr + end_nr) / 2;
+
+	/* Copy the first half of the pages */
+	for (page_nr = start_nr; page_nr < mid_nr; page_nr++)
+		copy_page_retry(uffd, page_nr * page_size);
+
+	/*
+	 * If we need to test uffd-wp, set it up now.  Then we'll have
+	 * at least the first half of the pages mapped already which
+	 * can be write-protected for testing
+	 */
+	if (test_uffdio_wp)
+		wp_range(uffd, (unsigned long)area_dst + start_nr * page_size,
+			nr_pages_per_cpu * page_size, true);
+
+	/*
+	 * Continue the 2nd half of the page copying, handling write
+	 * protection faults if any
+	 */
+	for (page_nr = mid_nr; page_nr < end_nr; page_nr++)
 		copy_page_retry(uffd, page_nr * page_size);
 
 	return NULL;
@@ -755,17 +818,31 @@ static int faulting_process(int signal_test)
 	}
 
 	for (nr = 0; nr < split_nr_pages; nr++) {
+		int steps = 1;
+		unsigned long offset = nr * page_size;
+
 		if (signal_test) {
 			if (sigsetjmp(*sigbuf, 1) != 0) {
-				if (nr == lastnr) {
+				if (steps == 1 && nr == lastnr) {
 					fprintf(stderr, "Signal repeated\n");
 					return 1;
 				}
 
 				lastnr = nr;
 				if (signal_test == 1) {
-					if (copy_page(uffd, nr * page_size))
-						signalled++;
+					if (steps == 1) {
+						/* This is a MISSING request */
+						steps++;
+						if (copy_page(uffd, offset))
+							signalled++;
+					} else {
+						/* This is a WP request */
+						assert(steps == 2);
+						wp_range(uffd,
+							 (__u64)area_dst +
+							 offset,
+							 page_size, false);
+					}
 				} else {
 					signalled++;
 					continue;
@@ -778,8 +855,13 @@ static int faulting_process(int signal_test)
 			fprintf(stderr,
 				"nr %lu memory corruption %Lu %Lu\n",
 				nr, count,
-				count_verify[nr]), exit(1);
-		}
+				count_verify[nr]);
+	        }
+		/*
+		 * Trigger write protection if there is by writting
+		 * the same value back.
+		 */
+		*area_count(area_dst, nr) = count;
 	}
 
 	if (signal_test)
@@ -801,6 +883,11 @@ static int faulting_process(int signal_test)
 				nr, count,
 				count_verify[nr]), exit(1);
 		}
+		/*
+		 * Trigger write protection if there is by writting
+		 * the same value back.
+		 */
+		*area_count(area_dst, nr) = count;
 	}
 
 	if (uffd_test_ops->release_pages(area_dst))
@@ -949,6 +1036,8 @@ static int userfaultfd_events_test(void)
 	uffdio_register.range.start = (unsigned long) area_dst;
 	uffdio_register.range.len = nr_pages * page_size;
 	uffdio_register.mode = UFFDIO_REGISTER_MODE_MISSING;
+	if (test_uffdio_wp)
+		uffdio_register.mode |= UFFDIO_REGISTER_MODE_WP;
 	if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register))
 		fprintf(stderr, "register failure\n"), exit(1);
 
@@ -979,7 +1068,8 @@ static int userfaultfd_events_test(void)
 		return 1;
 
 	close(uffd);
-	printf("userfaults: %ld\n", stats.missing_faults);
+
+	uffd_stats_report(&stats, 1);
 
 	return stats.missing_faults != nr_pages;
 }
@@ -1009,6 +1099,8 @@ static int userfaultfd_sig_test(void)
 	uffdio_register.range.start = (unsigned long) area_dst;
 	uffdio_register.range.len = nr_pages * page_size;
 	uffdio_register.mode = UFFDIO_REGISTER_MODE_MISSING;
+	if (test_uffdio_wp)
+		uffdio_register.mode |= UFFDIO_REGISTER_MODE_WP;
 	if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register))
 		fprintf(stderr, "register failure\n"), exit(1);
 
@@ -1141,6 +1233,8 @@ static int userfaultfd_stress(void)
 		uffdio_register.range.start = (unsigned long) area_dst;
 		uffdio_register.range.len = nr_pages * page_size;
 		uffdio_register.mode = UFFDIO_REGISTER_MODE_MISSING;
+		if (test_uffdio_wp)
+			uffdio_register.mode |= UFFDIO_REGISTER_MODE_WP;
 		if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register)) {
 			fprintf(stderr, "register failure\n");
 			return 1;
@@ -1195,6 +1289,11 @@ static int userfaultfd_stress(void)
 		if (stress(uffd_stats))
 			return 1;
 
+		/* Clear all the write protections if there is any */
+		if (test_uffdio_wp)
+			wp_range(uffd, (unsigned long)area_dst,
+				 nr_pages * page_size, false);
+
 		/* unregister */
 		if (ioctl(uffd, UFFDIO_UNREGISTER, &uffdio_register.range)) {
 			fprintf(stderr, "unregister failure\n");
@@ -1233,10 +1332,7 @@ static int userfaultfd_stress(void)
 		area_src_alias = area_dst_alias;
 		area_dst_alias = tmp_area;
 
-		printf("userfaults:");
-		for (cpu = 0; cpu < nr_cpus; cpu++)
-			printf(" %lu", uffd_stats[cpu].missing_faults);
-		printf("\n");
+		uffd_stats_report(uffd_stats, nr_cpus);
 	}
 
 	if (err)
@@ -1276,6 +1372,8 @@ static void set_test_type(const char *type)
 	if (!strcmp(type, "anon")) {
 		test_type = TEST_ANON;
 		uffd_test_ops = &anon_uffd_test_ops;
+		/* Only enable write-protect test for anonymous test */
+		test_uffdio_wp = true;
 	} else if (!strcmp(type, "hugetlb")) {
 		test_type = TEST_HUGETLB;
 		uffd_test_ops = &hugetlb_uffd_test_ops;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* Re: [PATCH RFC 01/24] mm: gup: rename "nonblocking" to "locked" where proper
  2019-01-21  7:56 ` [PATCH RFC 01/24] mm: gup: rename "nonblocking" to "locked" where proper Peter Xu
@ 2019-01-21 10:20   ` Mike Rapoport
  0 siblings, 0 replies; 65+ messages in thread
From: Mike Rapoport @ 2019-01-21 10:20 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, Hugh Dickins, Maya Gokhale,
	Jerome Glisse, Johannes Weiner, Martin Cracauer, Denis Plotnikov,
	Shaohua Li, Andrea Arcangeli, Pavel Emelyanov, Mike Kravetz,
	Marty McFadden, Mike Rapoport, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

On Mon, Jan 21, 2019 at 03:56:59PM +0800, Peter Xu wrote:
> There's plenty of places around __get_user_pages() that has a parameter
> "nonblocking" which does not really mean that "it won't block" (because
> it can really block) but instead it shows whether the mmap_sem is
> released by up_read() during the page fault handling mostly when
> VM_FAULT_RETRY is returned.
> 
> We have the correct naming in e.g. get_user_pages_locked() or
> get_user_pages_remote() as "locked", however there're still many places
> that are using the "nonblocking" as name.
> 
> Renaming the places to "locked" where proper to better suite the
> functionality of the variable.  While at it, fixing up some of the
> comments accordingly.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>

> ---
>  mm/gup.c     | 44 +++++++++++++++++++++-----------------------
>  mm/hugetlb.c |  8 ++++----
>  2 files changed, 25 insertions(+), 27 deletions(-)
> 
> diff --git a/mm/gup.c b/mm/gup.c
> index 8cb68a50dbdf..7b1f452cc2ef 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -506,12 +506,12 @@ static int get_gate_page(struct mm_struct *mm, unsigned long address,
>  }
> 
>  /*
> - * mmap_sem must be held on entry.  If @nonblocking != NULL and
> - * *@flags does not include FOLL_NOWAIT, the mmap_sem may be released.
> - * If it is, *@nonblocking will be set to 0 and -EBUSY returned.
> + * mmap_sem must be held on entry.  If @locked != NULL and *@flags
> + * does not include FOLL_NOWAIT, the mmap_sem may be released.  If it
> + * is, *@locked will be set to 0 and -EBUSY returned.
>   */
>  static int faultin_page(struct task_struct *tsk, struct vm_area_struct *vma,
> -		unsigned long address, unsigned int *flags, int *nonblocking)
> +		unsigned long address, unsigned int *flags, int *locked)
>  {
>  	unsigned int fault_flags = 0;
>  	vm_fault_t ret;
> @@ -523,7 +523,7 @@ static int faultin_page(struct task_struct *tsk, struct vm_area_struct *vma,
>  		fault_flags |= FAULT_FLAG_WRITE;
>  	if (*flags & FOLL_REMOTE)
>  		fault_flags |= FAULT_FLAG_REMOTE;
> -	if (nonblocking)
> +	if (locked)
>  		fault_flags |= FAULT_FLAG_ALLOW_RETRY;
>  	if (*flags & FOLL_NOWAIT)
>  		fault_flags |= FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_RETRY_NOWAIT;
> @@ -549,8 +549,8 @@ static int faultin_page(struct task_struct *tsk, struct vm_area_struct *vma,
>  	}
> 
>  	if (ret & VM_FAULT_RETRY) {
> -		if (nonblocking && !(fault_flags & FAULT_FLAG_RETRY_NOWAIT))
> -			*nonblocking = 0;
> +		if (locked && !(fault_flags & FAULT_FLAG_RETRY_NOWAIT))
> +			*locked = 0;
>  		return -EBUSY;
>  	}
> 
> @@ -627,7 +627,7 @@ static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags)
>   *		only intends to ensure the pages are faulted in.
>   * @vmas:	array of pointers to vmas corresponding to each page.
>   *		Or NULL if the caller does not require them.
> - * @nonblocking: whether waiting for disk IO or mmap_sem contention
> + * @locked:     whether we're still with the mmap_sem held
>   *
>   * Returns number of pages pinned. This may be fewer than the number
>   * requested. If nr_pages is 0 or negative, returns 0. If no pages
> @@ -656,13 +656,11 @@ static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags)
>   * appropriate) must be called after the page is finished with, and
>   * before put_page is called.
>   *
> - * If @nonblocking != NULL, __get_user_pages will not wait for disk IO
> - * or mmap_sem contention, and if waiting is needed to pin all pages,
> - * *@nonblocking will be set to 0.  Further, if @gup_flags does not
> - * include FOLL_NOWAIT, the mmap_sem will be released via up_read() in
> - * this case.
> + * If @locked != NULL, *@locked will be set to 0 when mmap_sem is
> + * released by an up_read().  That can happen if @gup_flags does not
> + * has FOLL_NOWAIT.
>   *
> - * A caller using such a combination of @nonblocking and @gup_flags
> + * A caller using such a combination of @locked and @gup_flags
>   * must therefore hold the mmap_sem for reading only, and recognize
>   * when it's been released.  Otherwise, it must be held for either
>   * reading or writing and will not be released.
> @@ -674,7 +672,7 @@ static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags)
>  static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
>  		unsigned long start, unsigned long nr_pages,
>  		unsigned int gup_flags, struct page **pages,
> -		struct vm_area_struct **vmas, int *nonblocking)
> +		struct vm_area_struct **vmas, int *locked)
>  {
>  	long ret = 0, i = 0;
>  	struct vm_area_struct *vma = NULL;
> @@ -718,7 +716,7 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
>  			if (is_vm_hugetlb_page(vma)) {
>  				i = follow_hugetlb_page(mm, vma, pages, vmas,
>  						&start, &nr_pages, i,
> -						gup_flags, nonblocking);
> +						gup_flags, locked);
>  				continue;
>  			}
>  		}
> @@ -736,7 +734,7 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
>  		page = follow_page_mask(vma, start, foll_flags, &ctx);
>  		if (!page) {
>  			ret = faultin_page(tsk, vma, start, &foll_flags,
> -					nonblocking);
> +					   locked);
>  			switch (ret) {
>  			case 0:
>  				goto retry;
> @@ -1195,7 +1193,7 @@ EXPORT_SYMBOL(get_user_pages_longterm);
>   * @vma:   target vma
>   * @start: start address
>   * @end:   end address
> - * @nonblocking:
> + * @locked: whether the mmap_sem is still held
>   *
>   * This takes care of mlocking the pages too if VM_LOCKED is set.
>   *
> @@ -1203,14 +1201,14 @@ EXPORT_SYMBOL(get_user_pages_longterm);
>   *
>   * vma->vm_mm->mmap_sem must be held.
>   *
> - * If @nonblocking is NULL, it may be held for read or write and will
> + * If @locked is NULL, it may be held for read or write and will
>   * be unperturbed.
>   *
> - * If @nonblocking is non-NULL, it must held for read only and may be
> - * released.  If it's released, *@nonblocking will be set to 0.
> + * If @locked is non-NULL, it must held for read only and may be
> + * released.  If it's released, *@locked will be set to 0.
>   */
>  long populate_vma_page_range(struct vm_area_struct *vma,
> -		unsigned long start, unsigned long end, int *nonblocking)
> +		unsigned long start, unsigned long end, int *locked)
>  {
>  	struct mm_struct *mm = vma->vm_mm;
>  	unsigned long nr_pages = (end - start) / PAGE_SIZE;
> @@ -1245,7 +1243,7 @@ long populate_vma_page_range(struct vm_area_struct *vma,
>  	 * not result in a stack expansion that recurses back here.
>  	 */
>  	return __get_user_pages(current, mm, start, nr_pages, gup_flags,
> -				NULL, NULL, nonblocking);
> +				NULL, NULL, locked);
>  }
> 
>  /*
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 705a3e9cc910..05b879bda10a 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -4181,7 +4181,7 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
>  long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  			 struct page **pages, struct vm_area_struct **vmas,
>  			 unsigned long *position, unsigned long *nr_pages,
> -			 long i, unsigned int flags, int *nonblocking)
> +			 long i, unsigned int flags, int *locked)
>  {
>  	unsigned long pfn_offset;
>  	unsigned long vaddr = *position;
> @@ -4252,7 +4252,7 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  				spin_unlock(ptl);
>  			if (flags & FOLL_WRITE)
>  				fault_flags |= FAULT_FLAG_WRITE;
> -			if (nonblocking)
> +			if (locked)
>  				fault_flags |= FAULT_FLAG_ALLOW_RETRY;
>  			if (flags & FOLL_NOWAIT)
>  				fault_flags |= FAULT_FLAG_ALLOW_RETRY |
> @@ -4269,8 +4269,8 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  				break;
>  			}
>  			if (ret & VM_FAULT_RETRY) {
> -				if (nonblocking)
> -					*nonblocking = 0;
> +				if (locked)
> +					*locked = 0;
>  				*nr_pages = 0;
>  				/*
>  				 * VM_FAULT_RETRY must not return an
> -- 
> 2.17.1
> 

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH RFC 06/24] userfaultfd: wp: support write protection for userfault vma range
  2019-01-21  7:57 ` [PATCH RFC 06/24] userfaultfd: wp: support write protection for userfault vma range Peter Xu
@ 2019-01-21 10:20   ` Mike Rapoport
  2019-01-22  8:55     ` Peter Xu
  2019-01-21 14:05   ` Jerome Glisse
  1 sibling, 1 reply; 65+ messages in thread
From: Mike Rapoport @ 2019-01-21 10:20 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, Hugh Dickins, Maya Gokhale,
	Jerome Glisse, Johannes Weiner, Martin Cracauer, Denis Plotnikov,
	Shaohua Li, Andrea Arcangeli, Pavel Emelyanov, Mike Kravetz,
	Marty McFadden, Mike Rapoport, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert, Rik van Riel

On Mon, Jan 21, 2019 at 03:57:04PM +0800, Peter Xu wrote:
> From: Shaohua Li <shli@fb.com>
> 
> Add API to enable/disable writeprotect a vma range. Unlike mprotect,
> this doesn't split/merge vmas.
> 
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Pavel Emelyanov <xemul@parallels.com>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Kirill A. Shutemov <kirill@shutemov.name>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: Shaohua Li <shli@fb.com>
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  include/linux/userfaultfd_k.h |  2 ++
>  mm/userfaultfd.c              | 52 +++++++++++++++++++++++++++++++++++
>  2 files changed, 54 insertions(+)
> 
> diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
> index 38f748e7186e..e82f3156f4e9 100644
> --- a/include/linux/userfaultfd_k.h
> +++ b/include/linux/userfaultfd_k.h
> @@ -37,6 +37,8 @@ extern ssize_t mfill_zeropage(struct mm_struct *dst_mm,
>  			      unsigned long dst_start,
>  			      unsigned long len,
>  			      bool *mmap_changing);
> +extern int mwriteprotect_range(struct mm_struct *dst_mm,
> +		unsigned long start, unsigned long len, bool enable_wp);
> 
>  /* mm helpers */
>  static inline bool is_mergeable_vm_userfaultfd_ctx(struct vm_area_struct *vma,
> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> index 458acda96f20..c38903f501c7 100644
> --- a/mm/userfaultfd.c
> +++ b/mm/userfaultfd.c
> @@ -615,3 +615,55 @@ ssize_t mfill_zeropage(struct mm_struct *dst_mm, unsigned long start,
>  {
>  	return __mcopy_atomic(dst_mm, start, 0, len, true, mmap_changing);
>  }
> +
> +int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start,
> +	unsigned long len, bool enable_wp)
> +{
> +	struct vm_area_struct *dst_vma;
> +	pgprot_t newprot;
> +	int err;
> +
> +	/*
> +	 * Sanitize the command parameters:
> +	 */
> +	BUG_ON(start & ~PAGE_MASK);
> +	BUG_ON(len & ~PAGE_MASK);
> +
> +	/* Does the address range wrap, or is the span zero-sized? */
> +	BUG_ON(start + len <= start);
> +
> +	down_read(&dst_mm->mmap_sem);
> +
> +	/*
> +	 * Make sure the vma is not shared, that the dst range is
> +	 * both valid and fully within a single existing vma.
> +	 */
> +	err = -EINVAL;

In non-cooperative mode, there can be a race between VM layout changes and
mcopy_atomic [1]. I believe the same races are possible here, so can we
please make err = -ENOENT for consistency with mcopy?

> +	dst_vma = find_vma(dst_mm, start);
> +	if (!dst_vma || (dst_vma->vm_flags & VM_SHARED))
> +		goto out_unlock;
> +	if (start < dst_vma->vm_start ||
> +	    start + len > dst_vma->vm_end)
> +		goto out_unlock;
> +
> +	if (!dst_vma->vm_userfaultfd_ctx.ctx)
> +		goto out_unlock;
> +	if (!userfaultfd_wp(dst_vma))
> +		goto out_unlock;
> +
> +	if (!vma_is_anonymous(dst_vma))
> +		goto out_unlock;

The sanity checks here seem to repeat those in mcopy_atomic(). I'd suggest
splitting them out to a helper function.

> +	if (enable_wp)
> +		newprot = vm_get_page_prot(dst_vma->vm_flags & ~(VM_WRITE));
> +	else
> +		newprot = vm_get_page_prot(dst_vma->vm_flags);
> +
> +	change_protection(dst_vma, start, start + len, newprot,
> +				!enable_wp, 0);
> +
> +	err = 0;
> +out_unlock:
> +	up_read(&dst_mm->mmap_sem);
> +	return err;
> +}
> -- 
> 2.17.1

[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=27d02568f529e908399514dfbee8ee43bdfd5299

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH RFC 05/24] userfaultfd: wp: add helper for writeprotect check
  2019-01-21  7:57 ` [PATCH RFC 05/24] userfaultfd: wp: add helper for writeprotect check Peter Xu
@ 2019-01-21 10:23   ` Mike Rapoport
  2019-01-22  8:31     ` Peter Xu
  0 siblings, 1 reply; 65+ messages in thread
From: Mike Rapoport @ 2019-01-21 10:23 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, Hugh Dickins, Maya Gokhale,
	Jerome Glisse, Johannes Weiner, Martin Cracauer, Denis Plotnikov,
	Shaohua Li, Andrea Arcangeli, Pavel Emelyanov, Mike Kravetz,
	Marty McFadden, Mike Rapoport, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert, Rik van Riel

On Mon, Jan 21, 2019 at 03:57:03PM +0800, Peter Xu wrote:
> From: Shaohua Li <shli@fb.com>
> 
> add helper for writeprotect check. Will use it later.

I'd merge this with the commit that actually uses this helper.
 
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Pavel Emelyanov <xemul@parallels.com>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Kirill A. Shutemov <kirill@shutemov.name>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: Shaohua Li <shli@fb.com>
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  include/linux/userfaultfd_k.h | 10 ++++++++++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
> index 37c9eba75c98..38f748e7186e 100644
> --- a/include/linux/userfaultfd_k.h
> +++ b/include/linux/userfaultfd_k.h
> @@ -50,6 +50,11 @@ static inline bool userfaultfd_missing(struct vm_area_struct *vma)
>  	return vma->vm_flags & VM_UFFD_MISSING;
>  }
> 
> +static inline bool userfaultfd_wp(struct vm_area_struct *vma)
> +{
> +	return vma->vm_flags & VM_UFFD_WP;
> +}
> +
>  static inline bool userfaultfd_armed(struct vm_area_struct *vma)
>  {
>  	return vma->vm_flags & (VM_UFFD_MISSING | VM_UFFD_WP);
> @@ -94,6 +99,11 @@ static inline bool userfaultfd_missing(struct vm_area_struct *vma)
>  	return false;
>  }
> 
> +static inline bool userfaultfd_wp(struct vm_area_struct *vma)
> +{
> +	return false;
> +}
> +
>  static inline bool userfaultfd_armed(struct vm_area_struct *vma)
>  {
>  	return false;
> -- 
> 2.17.1
> 

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH RFC 07/24] userfaultfd: wp: add the writeprotect API to userfaultfd ioctl
  2019-01-21  7:57 ` [PATCH RFC 07/24] userfaultfd: wp: add the writeprotect API to userfaultfd ioctl Peter Xu
@ 2019-01-21 10:42   ` Mike Rapoport
  2019-01-24  4:56     ` Peter Xu
  0 siblings, 1 reply; 65+ messages in thread
From: Mike Rapoport @ 2019-01-21 10:42 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, Hugh Dickins, Maya Gokhale,
	Jerome Glisse, Johannes Weiner, Martin Cracauer, Denis Plotnikov,
	Shaohua Li, Andrea Arcangeli, Pavel Emelyanov, Mike Kravetz,
	Marty McFadden, Mike Rapoport, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

On Mon, Jan 21, 2019 at 03:57:05PM +0800, Peter Xu wrote:
> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> v1: From: Shaohua Li <shli@fb.com>
> 
> v2: cleanups, remove a branch.
> 
> [peterx writes up the commit message, as below...]
> 
> This patch introduces the new uffd-wp APIs for userspace.
> 
> Firstly, we'll allow to do UFFDIO_REGISTER with write protection
> tracking using the new UFFDIO_REGISTER_MODE_WP flag.  Note that this
> flag can co-exist with the existing UFFDIO_REGISTER_MODE_MISSING, in
> which case the userspace program can not only resolve missing page
> faults, and at the same time tracking page data changes along the way.
> 
> Secondly, we introduced the new UFFDIO_WRITEPROTECT API to do page
> level write protection tracking.  Note that we will need to register
> the memory region with UFFDIO_REGISTER_MODE_WP before that.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> [peterx: remove useless block, write commit message]
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  fs/userfaultfd.c                 | 78 +++++++++++++++++++++++++-------
>  include/uapi/linux/userfaultfd.h | 11 +++++
>  2 files changed, 73 insertions(+), 16 deletions(-)
> 
> diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> index bc9f6230a3f0..6ff8773d6797 100644
> --- a/fs/userfaultfd.c
> +++ b/fs/userfaultfd.c
> @@ -305,8 +305,11 @@ static inline bool userfaultfd_must_wait(struct userfaultfd_ctx *ctx,
>  	if (!pmd_present(_pmd))
>  		goto out;
> 
> -	if (pmd_trans_huge(_pmd))
> +	if (pmd_trans_huge(_pmd)) {
> +		if (!pmd_write(_pmd) && (reason & VM_UFFD_WP))
> +			ret = true;
>  		goto out;
> +	}
> 
>  	/*
>  	 * the pmd is stable (as in !pmd_trans_unstable) so we can re-read it
> @@ -319,6 +322,8 @@ static inline bool userfaultfd_must_wait(struct userfaultfd_ctx *ctx,
>  	 */
>  	if (pte_none(*pte))
>  		ret = true;
> +	if (!pte_write(*pte) && (reason & VM_UFFD_WP))
> +		ret = true;
>  	pte_unmap(pte);
> 
>  out:
> @@ -1252,10 +1257,13 @@ static __always_inline int validate_range(struct mm_struct *mm,
>  	return 0;
>  }
> 
> -static inline bool vma_can_userfault(struct vm_area_struct *vma)
> +static inline bool vma_can_userfault(struct vm_area_struct *vma,
> +				     unsigned long vm_flags)
>  {
> -	return vma_is_anonymous(vma) || is_vm_hugetlb_page(vma) ||
> -		vma_is_shmem(vma);
> +	/* FIXME: add WP support to hugetlbfs and shmem */
> +	return vma_is_anonymous(vma) ||
> +		((is_vm_hugetlb_page(vma) || vma_is_shmem(vma)) &&
> +		 !(vm_flags & VM_UFFD_WP));
>  }
> 
>  static int userfaultfd_register(struct userfaultfd_ctx *ctx,
> @@ -1287,15 +1295,8 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
>  	vm_flags = 0;
>  	if (uffdio_register.mode & UFFDIO_REGISTER_MODE_MISSING)
>  		vm_flags |= VM_UFFD_MISSING;
> -	if (uffdio_register.mode & UFFDIO_REGISTER_MODE_WP) {
> +	if (uffdio_register.mode & UFFDIO_REGISTER_MODE_WP)
>  		vm_flags |= VM_UFFD_WP;
> -		/*
> -		 * FIXME: remove the below error constraint by
> -		 * implementing the wprotect tracking mode.
> -		 */
> -		ret = -EINVAL;
> -		goto out;
> -	}
> 
>  	ret = validate_range(mm, uffdio_register.range.start,
>  			     uffdio_register.range.len);
> @@ -1343,7 +1344,7 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
> 
>  		/* check not compatible vmas */
>  		ret = -EINVAL;
> -		if (!vma_can_userfault(cur))
> +		if (!vma_can_userfault(cur, vm_flags))
>  			goto out_unlock;
> 
>  		/*
> @@ -1371,6 +1372,8 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
>  			if (end & (vma_hpagesize - 1))
>  				goto out_unlock;
>  		}
> +		if ((vm_flags & VM_UFFD_WP) && !(cur->vm_flags & VM_WRITE))
> +			goto out_unlock;

This is problematic for the non-cooperative use-case. Way may still want to
monitor a read-only area because it may eventually become writable, e.g. if
the monitored process runs mprotect().
Particularity, for using uffd-wp as a replacement for soft-dirty would
require it.

> 
>  		/*
>  		 * Check that this vma isn't already owned by a
> @@ -1400,7 +1403,7 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
>  	do {
>  		cond_resched();
> 
> -		BUG_ON(!vma_can_userfault(vma));
> +		BUG_ON(!vma_can_userfault(vma, vm_flags));
>  		BUG_ON(vma->vm_userfaultfd_ctx.ctx &&
>  		       vma->vm_userfaultfd_ctx.ctx != ctx);
>  		WARN_ON(!(vma->vm_flags & VM_MAYWRITE));
> @@ -1535,7 +1538,7 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx,
>  		 * provides for more strict behavior to notice
>  		 * unregistration errors.
>  		 */
> -		if (!vma_can_userfault(cur))
> +		if (!vma_can_userfault(cur, cur->vm_flags))
>  			goto out_unlock;
> 
>  		found = true;
> @@ -1549,7 +1552,7 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx,
>  	do {
>  		cond_resched();
> 
> -		BUG_ON(!vma_can_userfault(vma));
> +		BUG_ON(!vma_can_userfault(vma, vma->vm_flags));
>  		WARN_ON(!(vma->vm_flags & VM_MAYWRITE));
> 
>  		/*
> @@ -1760,6 +1763,46 @@ static int userfaultfd_zeropage(struct userfaultfd_ctx *ctx,
>  	return ret;
>  }
> 
> +static int userfaultfd_writeprotect(struct userfaultfd_ctx *ctx,
> +				    unsigned long arg)
> +{
> +	int ret;
> +	struct uffdio_writeprotect uffdio_wp;
> +	struct uffdio_writeprotect __user *user_uffdio_wp;
> +	struct userfaultfd_wake_range range;
> +

In the non-cooperative mode the userfaultfd_writeprotect() may race with VM
layout changes, pretty much as uffdio_copy() [1]. My solution for uffdio_copy()
was to return -EAGAIN if such race is encountered. I think the same would
apply here.

> +	user_uffdio_wp = (struct uffdio_writeprotect __user *) arg;
> +
> +	if (copy_from_user(&uffdio_wp, user_uffdio_wp,
> +			   sizeof(struct uffdio_writeprotect)))
> +		return -EFAULT;
> +
> +	ret = validate_range(ctx->mm, uffdio_wp.range.start,
> +			     uffdio_wp.range.len);
> +	if (ret)
> +		return ret;
> +
> +	if (uffdio_wp.mode & ~(UFFDIO_WRITEPROTECT_MODE_DONTWAKE |
> +			       UFFDIO_WRITEPROTECT_MODE_WP))
> +		return -EINVAL;
> +	if ((uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_WP) &&
> +	     (uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_DONTWAKE))
> +		return -EINVAL;
> +
> +	ret = mwriteprotect_range(ctx->mm, uffdio_wp.range.start,
> +				  uffdio_wp.range.len, uffdio_wp.mode &
> +				  UFFDIO_WRITEPROTECT_MODE_WP);
> +	if (ret)
> +		return ret;
> +
> +	if (!(uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_DONTWAKE)) {
> +		range.start = uffdio_wp.range.start;
> +		range.len = uffdio_wp.range.len;
> +		wake_userfault(ctx, &range);
> +	}
> +	return ret;
> +}
> +
>  static inline unsigned int uffd_ctx_features(__u64 user_features)
>  {
>  	/*
> @@ -1837,6 +1880,9 @@ static long userfaultfd_ioctl(struct file *file, unsigned cmd,
>  	case UFFDIO_ZEROPAGE:
>  		ret = userfaultfd_zeropage(ctx, arg);
>  		break;
> +	case UFFDIO_WRITEPROTECT:
> +		ret = userfaultfd_writeprotect(ctx, arg);
> +		break;
>  	}
>  	return ret;
>  }
> diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
> index 48f1a7c2f1f0..11517f796275 100644
> --- a/include/uapi/linux/userfaultfd.h
> +++ b/include/uapi/linux/userfaultfd.h
> @@ -52,6 +52,7 @@
>  #define _UFFDIO_WAKE			(0x02)
>  #define _UFFDIO_COPY			(0x03)
>  #define _UFFDIO_ZEROPAGE		(0x04)
> +#define _UFFDIO_WRITEPROTECT		(0x06)
>  #define _UFFDIO_API			(0x3F)
> 
>  /* userfaultfd ioctl ids */
> @@ -68,6 +69,8 @@
>  				      struct uffdio_copy)
>  #define UFFDIO_ZEROPAGE		_IOWR(UFFDIO, _UFFDIO_ZEROPAGE,	\
>  				      struct uffdio_zeropage)
> +#define UFFDIO_WRITEPROTECT	_IOWR(UFFDIO, _UFFDIO_WRITEPROTECT, \
> +				      struct uffdio_writeprotect)
> 
>  /* read() structure */
>  struct uffd_msg {
> @@ -231,4 +234,12 @@ struct uffdio_zeropage {
>  	__s64 zeropage;
>  };
> 
> +struct uffdio_writeprotect {
> +	struct uffdio_range range;
> +	/* !WP means undo writeprotect. DONTWAKE is valid only with !WP */
> +#define UFFDIO_WRITEPROTECT_MODE_WP		((__u64)1<<0)
> +#define UFFDIO_WRITEPROTECT_MODE_DONTWAKE	((__u64)1<<1)
> +	__u64 mode;
> +};
> +
>  #endif /* _LINUX_USERFAULTFD_H */
> -- 
> 2.17.1
 
[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=df2cc96e77011cf7989208b206da9817e0321028

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH RFC 20/24] userfaultfd: wp: don't wake up when doing write protect
  2019-01-21  7:57 ` [PATCH RFC 20/24] userfaultfd: wp: don't wake up when doing write protect Peter Xu
@ 2019-01-21 11:10   ` Mike Rapoport
  2019-01-24  5:36     ` Peter Xu
  0 siblings, 1 reply; 65+ messages in thread
From: Mike Rapoport @ 2019-01-21 11:10 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, Hugh Dickins, Maya Gokhale,
	Jerome Glisse, Johannes Weiner, Martin Cracauer, Denis Plotnikov,
	Shaohua Li, Andrea Arcangeli, Pavel Emelyanov, Mike Kravetz,
	Marty McFadden, Mike Rapoport, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

On Mon, Jan 21, 2019 at 03:57:18PM +0800, Peter Xu wrote:
> It does not make sense to try to wake up any waiting thread when we're
> write-protecting a memory region.  Only wake up when resolving a write
> protected page fault.

Probably it would be better to make it default to wake up only when
requested explicitly?
Then we can simply disallow _DONTWAKE for uffd_wp and only use
UFFDIO_WRITEPROTECT_MODE_WP as possible mode.
 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  fs/userfaultfd.c | 13 ++++++++-----
>  1 file changed, 8 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> index 455b87c0596f..e54ab6076e13 100644
> --- a/fs/userfaultfd.c
> +++ b/fs/userfaultfd.c
> @@ -1771,6 +1771,7 @@ static int userfaultfd_writeprotect(struct userfaultfd_ctx *ctx,
>  	struct uffdio_writeprotect uffdio_wp;
>  	struct uffdio_writeprotect __user *user_uffdio_wp;
>  	struct userfaultfd_wake_range range;
> +	bool mode_wp, mode_dontwake;
> 
>  	user_uffdio_wp = (struct uffdio_writeprotect __user *) arg;
> 
> @@ -1786,17 +1787,19 @@ static int userfaultfd_writeprotect(struct userfaultfd_ctx *ctx,
>  	if (uffdio_wp.mode & ~(UFFDIO_WRITEPROTECT_MODE_DONTWAKE |
>  			       UFFDIO_WRITEPROTECT_MODE_WP))
>  		return -EINVAL;
> -	if ((uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_WP) &&
> -	     (uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_DONTWAKE))
> +
> +	mode_wp = uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_WP;
> +	mode_dontwake = uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_DONTWAKE;
> +
> +	if (mode_wp && mode_dontwake)
>  		return -EINVAL;
> 
>  	ret = mwriteprotect_range(ctx->mm, uffdio_wp.range.start,
> -				  uffdio_wp.range.len, uffdio_wp.mode &
> -				  UFFDIO_WRITEPROTECT_MODE_WP);
> +				  uffdio_wp.range.len, mode_wp);
>  	if (ret)
>  		return ret;
> 
> -	if (!(uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_DONTWAKE)) {
> +	if (!mode_wp && !mode_dontwake) {
>  		range.start = uffdio_wp.range.start;
>  		range.len = uffdio_wp.range.len;
>  		wake_userfault(ctx, &range);
> -- 
> 2.17.1
> 

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH RFC 13/24] mm: merge parameters for change_protection()
  2019-01-21  7:57 ` [PATCH RFC 13/24] mm: merge parameters for change_protection() Peter Xu
@ 2019-01-21 13:54   ` Jerome Glisse
  2019-01-24  5:22     ` Peter Xu
  0 siblings, 1 reply; 65+ messages in thread
From: Jerome Glisse @ 2019-01-21 13:54 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, Hugh Dickins, Maya Gokhale,
	Johannes Weiner, Martin Cracauer, Denis Plotnikov, Shaohua Li,
	Andrea Arcangeli, Pavel Emelyanov, Mike Kravetz, Marty McFadden,
	Mike Rapoport, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

On Mon, Jan 21, 2019 at 03:57:11PM +0800, Peter Xu wrote:
> change_protection() was used by either the NUMA or mprotect() code,
> there's one parameter for each of the callers (dirty_accountable and
> prot_numa).  Further, these parameters are passed along the calls:
> 
>   - change_protection_range()
>   - change_p4d_range()
>   - change_pud_range()
>   - change_pmd_range()
>   - ...
> 
> Now we introduce a flag for change_protect() and all these helpers to
> replace these parameters.  Then we can avoid passing multiple parameters
> multiple times along the way.
> 
> More importantly, it'll greatly simplify the work if we want to
> introduce any new parameters to change_protection().  In the follow up
> patches, a new parameter for userfaultfd write protection will be
> introduced.
> 
> No functional change at all.

There is one change i could spot and also something that looks wrong.

> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---

[...]

> @@ -428,8 +431,7 @@ mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev,
>  	dirty_accountable = vma_wants_writenotify(vma, vma->vm_page_prot);
>  	vma_set_page_prot(vma);
>  
> -	change_protection(vma, start, end, vma->vm_page_prot,
> -			  dirty_accountable, 0);
> +	change_protection(vma, start, end, vma->vm_page_prot, MM_CP_DIRTY_ACCT);

Here you unconditionaly see the DIRTY_ACCT flag instead it should be
something like:

    s/dirty_accountable/cp_flags
    if (vma_wants_writenotify(vma, vma->vm_page_prot))
        cp_flags = MM_CP_DIRTY_ACCT;
    else
        cp_flags = 0;

    change_protection(vma, start, end, vma->vm_page_prot, cp_flags);

Or any equivalent construct.

>  	/*
>  	 * Private VM_LOCKED VMA becoming writable: trigger COW to avoid major
> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> index 005291b9b62f..23d4bbd117ee 100644
> --- a/mm/userfaultfd.c
> +++ b/mm/userfaultfd.c
> @@ -674,7 +674,7 @@ int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start,
>  		newprot = vm_get_page_prot(dst_vma->vm_flags);
>  
>  	change_protection(dst_vma, start, start + len, newprot,
> -				!enable_wp, 0);
> +			  enable_wp ? 0 : MM_CP_DIRTY_ACCT);

We had a discussion in the past on that, i have not look at other
patches but this seems wrong to me. MM_CP_DIRTY_ACCT is an
optimization to keep a pte with write permission if it is dirty
while my understanding is that you want to set write flag for pte
unconditionaly.

So maybe this patch that adds flag should be earlier in the serie
so that you can add a flag to do that before introducing the UFD
mwriteprotect_range() function.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH RFC 06/24] userfaultfd: wp: support write protection for userfault vma range
  2019-01-21  7:57 ` [PATCH RFC 06/24] userfaultfd: wp: support write protection for userfault vma range Peter Xu
  2019-01-21 10:20   ` Mike Rapoport
@ 2019-01-21 14:05   ` Jerome Glisse
  2019-01-22  9:39     ` Peter Xu
  1 sibling, 1 reply; 65+ messages in thread
From: Jerome Glisse @ 2019-01-21 14:05 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, Hugh Dickins, Maya Gokhale,
	Johannes Weiner, Martin Cracauer, Denis Plotnikov, Shaohua Li,
	Andrea Arcangeli, Pavel Emelyanov, Mike Kravetz, Marty McFadden,
	Mike Rapoport, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert, Rik van Riel

On Mon, Jan 21, 2019 at 03:57:04PM +0800, Peter Xu wrote:
> From: Shaohua Li <shli@fb.com>
> 
> Add API to enable/disable writeprotect a vma range. Unlike mprotect,
> this doesn't split/merge vmas.

AFAICT it does not do that.

> 
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Pavel Emelyanov <xemul@parallels.com>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Kirill A. Shutemov <kirill@shutemov.name>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: Shaohua Li <shli@fb.com>
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  include/linux/userfaultfd_k.h |  2 ++
>  mm/userfaultfd.c              | 52 +++++++++++++++++++++++++++++++++++
>  2 files changed, 54 insertions(+)
> 
> diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
> index 38f748e7186e..e82f3156f4e9 100644
> --- a/include/linux/userfaultfd_k.h
> +++ b/include/linux/userfaultfd_k.h
> @@ -37,6 +37,8 @@ extern ssize_t mfill_zeropage(struct mm_struct *dst_mm,
>  			      unsigned long dst_start,
>  			      unsigned long len,
>  			      bool *mmap_changing);
> +extern int mwriteprotect_range(struct mm_struct *dst_mm,
> +		unsigned long start, unsigned long len, bool enable_wp);
>  
>  /* mm helpers */
>  static inline bool is_mergeable_vm_userfaultfd_ctx(struct vm_area_struct *vma,
> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> index 458acda96f20..c38903f501c7 100644
> --- a/mm/userfaultfd.c
> +++ b/mm/userfaultfd.c
> @@ -615,3 +615,55 @@ ssize_t mfill_zeropage(struct mm_struct *dst_mm, unsigned long start,
>  {
>  	return __mcopy_atomic(dst_mm, start, 0, len, true, mmap_changing);
>  }
> +
> +int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start,
> +	unsigned long len, bool enable_wp)
> +{
> +	struct vm_area_struct *dst_vma;
> +	pgprot_t newprot;
> +	int err;
> +
> +	/*
> +	 * Sanitize the command parameters:
> +	 */
> +	BUG_ON(start & ~PAGE_MASK);
> +	BUG_ON(len & ~PAGE_MASK);
> +
> +	/* Does the address range wrap, or is the span zero-sized? */
> +	BUG_ON(start + len <= start);
> +
> +	down_read(&dst_mm->mmap_sem);
> +
> +	/*
> +	 * Make sure the vma is not shared, that the dst range is
> +	 * both valid and fully within a single existing vma.
> +	 */
> +	err = -EINVAL;
> +	dst_vma = find_vma(dst_mm, start);
> +	if (!dst_vma || (dst_vma->vm_flags & VM_SHARED))
> +		goto out_unlock;
> +	if (start < dst_vma->vm_start ||
> +	    start + len > dst_vma->vm_end)
> +		goto out_unlock;
> +
> +	if (!dst_vma->vm_userfaultfd_ctx.ctx)
> +		goto out_unlock;
> +	if (!userfaultfd_wp(dst_vma))
> +		goto out_unlock;
> +
> +	if (!vma_is_anonymous(dst_vma))
> +		goto out_unlock;
> +
> +	if (enable_wp)
> +		newprot = vm_get_page_prot(dst_vma->vm_flags & ~(VM_WRITE));
> +	else
> +		newprot = vm_get_page_prot(dst_vma->vm_flags);
> +
> +	change_protection(dst_vma, start, start + len, newprot,
> +				!enable_wp, 0);

So setting dirty_accountable bring us to that code in mprotect.c:

    if (dirty_accountable && pte_dirty(ptent) &&
            (pte_soft_dirty(ptent) ||
             !(vma->vm_flags & VM_SOFTDIRTY))) {
        ptent = pte_mkwrite(ptent);
    }

My understanding is that you want to set write flag when enable_wp
is false and you want to set the write flag unconditionaly, right ?

If so then you should really move the change_protection() flags
patch before this patch and add a flag for setting pte write flags.

Otherwise the above is broken at it will only set the write flag
for pte that were dirty and i am guessing so far you always were
lucky because pte were all dirty (change_protection will preserve
dirtyness) when you write protected them.

So i believe the above is broken or at very least unclear if what
you really want is to only set write flag to pte that have the
dirty flag set.


Cheers,
Jérôme


> +
> +	err = 0;
> +out_unlock:
> +	up_read(&dst_mm->mmap_sem);
> +	return err;
> +}
> -- 
> 2.17.1
> 

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH RFC 00/24] userfaultfd: write protection support
  2019-01-21  7:56 [PATCH RFC 00/24] userfaultfd: write protection support Peter Xu
                   ` (23 preceding siblings ...)
  2019-01-21  7:57 ` [PATCH RFC 24/24] userfaultfd: selftests: add write-protect test Peter Xu
@ 2019-01-21 14:33 ` David Hildenbrand
  2019-01-22  3:18   ` Peter Xu
  24 siblings, 1 reply; 65+ messages in thread
From: David Hildenbrand @ 2019-01-21 14:33 UTC (permalink / raw)
  To: Peter Xu, linux-mm, linux-kernel
  Cc: Hugh Dickins, Maya Gokhale, Jerome Glisse, Johannes Weiner,
	Martin Cracauer, Denis Plotnikov, Shaohua Li, Andrea Arcangeli,
	Pavel Emelyanov, Mike Kravetz, Marty McFadden, Mike Rapoport,
	Mel Gorman, Kirill A . Shutemov, Dr . David Alan Gilbert

On 21.01.19 08:56, Peter Xu wrote:
> Hi,
> 
> This series implements initial write protection support for
> userfaultfd.  Currently both shmem and hugetlbfs are not supported
> yet, but only anonymous memory.
> 
> To be simple, either "userfaultfd-wp" or "uffd-wp" might be used in
> later paragraphs.
> 
> The whole series can also be found at:
> 
>   https://github.com/xzpeter/linux/tree/uffd-wp-merged
> 
> Any comment would be greatly welcomed.   Thanks.
> 
> Overview
> ====================
> 
> The uffd-wp work was initialized by Shaohua Li [1], and later
> continued by Andrea [2]. This series is based upon Andrea's latest
> userfaultfd tree, and it is a continuous works from both Shaohua and
> Andrea.  Many of the follow up ideas come from Andrea too.
> 
> Besides the old MISSING register mode of userfaultfd, the new uffd-wp
> support provides another alternative register mode called
> UFFDIO_REGISTER_MODE_WP that can be used to listen to not only missing
> page faults but also write protection page faults, or even they can be
> registered together.  At the same time, the new feature also provides
> a new userfaultfd ioctl called UFFDIO_WRITEPROTECT which allows the
> userspace to write protect a range or memory or fixup write permission
> of faulted pages.
> 
> Please refer to the document patch "userfaultfd: wp:
> UFFDIO_REGISTER_MODE_WP documentation update" for more information on
> the new interface and what it can do.
> 
> The major workflow of an uffd-wp program should be:
> 
>   1. Register a memory region with WP mode using UFFDIO_REGISTER_MODE_WP
> 
>   2. Write protect part of the whole registered region using
>      UFFDIO_WRITEPROTECT, passing in UFFDIO_WRITEPROTECT_MODE_WP to
>      show that we want to write protect the range.
> 
>   3. Start a working thread that modifies the protected pages,
>      meanwhile listening to UFFD messages.
> 
>   4. When a write is detected upon the protected range, page fault
>      happens, a UFFD message will be generated and reported to the
>      page fault handling thread
> 
>   5. The page fault handler thread resolves the page fault using the
>      new UFFDIO_WRITEPROTECT ioctl, but this time passing in
>      !UFFDIO_WRITEPROTECT_MODE_WP instead showing that we want to
>      recover the write permission.  Before this operation, the fault
>      handler thread can do anything it wants, e.g., dumps the page to
>      a persistent storage.
> 
>   6. The worker thread will continue running with the correctly
>      applied write permission from step 5.
> 
> Currently there are already two projects that are based on this new
> userfaultfd feature.
> 
> QEMU Live Snapshot: The project provides a way to allow the QEMU
>                     hypervisor to take snapshot of VMs without
>                     stopping the VM [3].
> 
> LLNL umap library:  The project provides a mmap-like interface and
>                     "allow to have an application specific buffer of
>                     pages cached from a large file, i.e. out-of-core
>                     execution using memory map" [4][5].
> 
> Before posting the patchset, this series was smoke tested against QEMU
> live snapshot and the LLNL umap library (by doing parallel quicksort
> using 128 sorting threads + 80 uffd servicing threads).  My sincere
> thanks to Marty Mcfadden and Denis Plotnikov for the help along the
> way.
> 
> Implementation
> ==============
> 
> Patch 1-4: The whole uffd-wp requires the kernel page fault path to
>            take more than one retries.  In the previous works starting
>            from Shaohua, a new fault flag FAULT_FLAG_ALLOW_UFFD_RETRY
>            was introduced for this [6]. However in this series we have
>            dropped that patch, instead the whole work is based on the
>            recent series "[PATCH RFC v3 0/4] mm: some enhancements to
>            the page fault mechanism" [7] which removes the assuption
>            that VM_FAULT_RETRY can only happen once.  This four
>            patches are identital patches but picked up here.  Please
>            refer to the cover letter [7] for more information.  More
>            discussion upstream shows that this work could even benefit
>            existing use case [8] so please help justify whether
>            patches 1-4 can be consider to be accepted even earlier
>            than the rest of the series.
> 
> Patch 5-21:   Implements the uffd-wp logic.  To avoid collision with
>               existing write protections (e.g., an private anonymous
>               page can be write protected if it was shared between
>               multiple processes), a new PTE bit (_PAGE_UFFD_WP) was
>               introduced to explicitly mark a PTE as userfault
>               write-protected.  A similar bit was also used in the
>               swap/migration entry (_PAGE_SWP_UFFD_WP) to make sure
>               even if the pages were swapped or migrated, the uffd-wp
>               tracking information won't be lost.  When resolving a
>               page fault, we'll do a page copy before hand if the page
>               was COWed to make sure we won't corrupt any shared
>               pages.  Etc.  Please see separated patches for more
>               details.
> 
> Patch 22:     Documentation update for uffd-wp
> 
> Patch 23,24:  Uffd-wp selftests
> 
> TODO
> =============
> 
> - hugetlbfs/shmem support
> - performance
> - more architectures
> - ...
> 
> References
> ==========
> 
> [1] https://lwn.net/Articles/666187/
> [2] https://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git/log/?h=userfault
> [3] https://github.com/denis-plotnikov/qemu/commits/background-snapshot-kvm
> [4] https://github.com/LLNL/umap
> [5] https://llnl-umap.readthedocs.io/en/develop/
> [6] https://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git/commit/?h=userfault&id=b245ecf6cf59156966f3da6e6b674f6695a5ffa5
> [7] https://lkml.org/lkml/2018/11/21/370
> [8] https://lkml.org/lkml/2018/12/30/64
> 
> Andrea Arcangeli (5):
>   userfaultfd: wp: add the writeprotect API to userfaultfd ioctl
>   userfaultfd: wp: hook userfault handler to write protection fault
>   userfaultfd: wp: add WP pagetable tracking to x86
>   userfaultfd: wp: userfaultfd_pte/huge_pmd_wp() helpers
>   userfaultfd: wp: add UFFDIO_COPY_MODE_WP
> 
> Martin Cracauer (1):
>   userfaultfd: wp: UFFDIO_REGISTER_MODE_WP documentation update
> 
> Peter Xu (15):
>   mm: gup: rename "nonblocking" to "locked" where proper
>   mm: userfault: return VM_FAULT_RETRY on signals
>   mm: allow VM_FAULT_RETRY for multiple times
>   mm: gup: allow VM_FAULT_RETRY for multiple times
>   mm: merge parameters for change_protection()
>   userfaultfd: wp: apply _PAGE_UFFD_WP bit
>   mm: export wp_page_copy()
>   userfaultfd: wp: handle COW properly for uffd-wp
>   userfaultfd: wp: drop _PAGE_UFFD_WP properly when fork
>   userfaultfd: wp: add pmd_swp_*uffd_wp() helpers
>   userfaultfd: wp: support swap and page migration
>   userfaultfd: wp: don't wake up when doing write protect
>   khugepaged: skip collapse if uffd-wp detected
>   userfaultfd: selftests: refactor statistics
>   userfaultfd: selftests: add write-protect test
> 
> Shaohua Li (3):
>   userfaultfd: wp: add helper for writeprotect check
>   userfaultfd: wp: support write protection for userfault vma range
>   userfaultfd: wp: enabled write protection in userfaultfd API
> 
>  Documentation/admin-guide/mm/userfaultfd.rst |  51 +++++
>  arch/alpha/mm/fault.c                        |   4 +-
>  arch/arc/mm/fault.c                          |  12 +-
>  arch/arm/mm/fault.c                          |  17 +-
>  arch/arm64/mm/fault.c                        |  11 +-
>  arch/hexagon/mm/vm_fault.c                   |   3 +-
>  arch/ia64/mm/fault.c                         |   3 +-
>  arch/m68k/mm/fault.c                         |   5 +-
>  arch/microblaze/mm/fault.c                   |   3 +-
>  arch/mips/mm/fault.c                         |   3 +-
>  arch/nds32/mm/fault.c                        |   7 +-
>  arch/nios2/mm/fault.c                        |   5 +-
>  arch/openrisc/mm/fault.c                     |   3 +-
>  arch/parisc/mm/fault.c                       |   4 +-
>  arch/powerpc/mm/fault.c                      |   9 +-
>  arch/riscv/mm/fault.c                        |   9 +-
>  arch/s390/mm/fault.c                         |  14 +-
>  arch/sh/mm/fault.c                           |   5 +-
>  arch/sparc/mm/fault_32.c                     |   4 +-
>  arch/sparc/mm/fault_64.c                     |   4 +-
>  arch/um/kernel/trap.c                        |   6 +-
>  arch/unicore32/mm/fault.c                    |  10 +-
>  arch/x86/Kconfig                             |   1 +
>  arch/x86/include/asm/pgtable.h               |  67 ++++++
>  arch/x86/include/asm/pgtable_64.h            |   8 +-
>  arch/x86/include/asm/pgtable_types.h         |  11 +-
>  arch/x86/mm/fault.c                          |  13 +-
>  arch/xtensa/mm/fault.c                       |   4 +-
>  fs/userfaultfd.c                             | 110 +++++----
>  include/asm-generic/pgtable.h                |   1 +
>  include/asm-generic/pgtable_uffd.h           |  66 ++++++
>  include/linux/huge_mm.h                      |   2 +-
>  include/linux/mm.h                           |  21 +-
>  include/linux/swapops.h                      |   2 +
>  include/linux/userfaultfd_k.h                |  41 +++-
>  include/trace/events/huge_memory.h           |   1 +
>  include/uapi/linux/userfaultfd.h             |  28 ++-
>  init/Kconfig                                 |   5 +
>  mm/gup.c                                     |  61 ++---
>  mm/huge_memory.c                             |  28 ++-
>  mm/hugetlb.c                                 |   8 +-
>  mm/khugepaged.c                              |  23 ++
>  mm/memory.c                                  |  28 ++-
>  mm/mempolicy.c                               |   2 +-
>  mm/migrate.c                                 |   7 +
>  mm/mprotect.c                                |  99 +++++++--
>  mm/rmap.c                                    |   6 +
>  mm/userfaultfd.c                             |  92 +++++++-
>  tools/testing/selftests/vm/userfaultfd.c     | 222 ++++++++++++++-----
>  49 files changed, 898 insertions(+), 251 deletions(-)
>  create mode 100644 include/asm-generic/pgtable_uffd.h
> 

Does this series fix the "false positives" case I experienced on early
prototypes of uffd-wp? (getting notified about a write access although
it was not a write access?)

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH RFC 10/24] userfaultfd: wp: add WP pagetable tracking to x86
  2019-01-21  7:57 ` [PATCH RFC 10/24] userfaultfd: wp: add WP pagetable tracking to x86 Peter Xu
@ 2019-01-21 15:09   ` Jerome Glisse
  2019-01-24  5:16     ` Peter Xu
  0 siblings, 1 reply; 65+ messages in thread
From: Jerome Glisse @ 2019-01-21 15:09 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, Hugh Dickins, Maya Gokhale,
	Johannes Weiner, Martin Cracauer, Denis Plotnikov, Shaohua Li,
	Andrea Arcangeli, Pavel Emelyanov, Mike Kravetz, Marty McFadden,
	Mike Rapoport, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

On Mon, Jan 21, 2019 at 03:57:08PM +0800, Peter Xu wrote:
> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> Accurate userfaultfd WP tracking is possible by tracking exactly which
> virtual memory ranges were writeprotected by userland. We can't relay
> only on the RW bit of the mapped pagetable because that information is
> destroyed by fork() or KSM or swap. If we were to relay on that, we'd
> need to stay on the safe side and generate false positive wp faults
> for every swapped out page.

So you want to forward write fault (of a protected range) to user space
only if page is not write protected because of fork(), KSM or swap.

This write protection feature is only for anonymous page right ? Other-
wise how would you protect a share page (ie anyone can look it up and
call page_mkwrite on it and start writting to it) ?

So for anonymous page for fork() the mapcount will tell you if page is
write protected for COW. For KSM it is easy check the page flag.

For swap you can use the page lock to synchronize. A page that is
write protected because of swap is write protected because it is being
write to disk thus either under page lock, or with PageWriteback()
returning true while write is on going.

So to me it seems you could properly identify if a page is write
protected for fork, swap or KSM without a new flag.

Cheers,
Jérôme

> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  arch/x86/Kconfig                     |  1 +
>  arch/x86/include/asm/pgtable.h       | 52 ++++++++++++++++++++++++++++
>  arch/x86/include/asm/pgtable_64.h    |  8 ++++-
>  arch/x86/include/asm/pgtable_types.h |  9 +++++
>  include/asm-generic/pgtable.h        |  1 +
>  include/asm-generic/pgtable_uffd.h   | 51 +++++++++++++++++++++++++++
>  init/Kconfig                         |  5 +++
>  7 files changed, 126 insertions(+), 1 deletion(-)
>  create mode 100644 include/asm-generic/pgtable_uffd.h
> 
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 8689e794a43c..096c773452d0 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -207,6 +207,7 @@ config X86
>  	select USER_STACKTRACE_SUPPORT
>  	select VIRT_TO_BUS
>  	select X86_FEATURE_NAMES		if PROC_FS
> +	select HAVE_ARCH_USERFAULTFD_WP		if USERFAULTFD
>  
>  config INSTRUCTION_DECODER
>  	def_bool y
> diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
> index 40616e805292..7a71158982f4 100644
> --- a/arch/x86/include/asm/pgtable.h
> +++ b/arch/x86/include/asm/pgtable.h
> @@ -23,6 +23,7 @@
>  
>  #ifndef __ASSEMBLY__
>  #include <asm/x86_init.h>
> +#include <asm-generic/pgtable_uffd.h>
>  
>  extern pgd_t early_top_pgt[PTRS_PER_PGD];
>  int __init __early_make_pgtable(unsigned long address, pmdval_t pmd);
> @@ -293,6 +294,23 @@ static inline pte_t pte_clear_flags(pte_t pte, pteval_t clear)
>  	return native_make_pte(v & ~clear);
>  }
>  
> +#ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
> +static inline int pte_uffd_wp(pte_t pte)
> +{
> +	return pte_flags(pte) & _PAGE_UFFD_WP;
> +}
> +
> +static inline pte_t pte_mkuffd_wp(pte_t pte)
> +{
> +	return pte_set_flags(pte, _PAGE_UFFD_WP);
> +}
> +
> +static inline pte_t pte_clear_uffd_wp(pte_t pte)
> +{
> +	return pte_clear_flags(pte, _PAGE_UFFD_WP);
> +}
> +#endif /* CONFIG_HAVE_ARCH_USERFAULTFD_WP */
> +
>  static inline pte_t pte_mkclean(pte_t pte)
>  {
>  	return pte_clear_flags(pte, _PAGE_DIRTY);
> @@ -372,6 +390,23 @@ static inline pmd_t pmd_clear_flags(pmd_t pmd, pmdval_t clear)
>  	return native_make_pmd(v & ~clear);
>  }
>  
> +#ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
> +static inline int pmd_uffd_wp(pmd_t pmd)
> +{
> +	return pmd_flags(pmd) & _PAGE_UFFD_WP;
> +}
> +
> +static inline pmd_t pmd_mkuffd_wp(pmd_t pmd)
> +{
> +	return pmd_set_flags(pmd, _PAGE_UFFD_WP);
> +}
> +
> +static inline pmd_t pmd_clear_uffd_wp(pmd_t pmd)
> +{
> +	return pmd_clear_flags(pmd, _PAGE_UFFD_WP);
> +}
> +#endif /* CONFIG_HAVE_ARCH_USERFAULTFD_WP */
> +
>  static inline pmd_t pmd_mkold(pmd_t pmd)
>  {
>  	return pmd_clear_flags(pmd, _PAGE_ACCESSED);
> @@ -1351,6 +1386,23 @@ static inline pmd_t pmd_swp_clear_soft_dirty(pmd_t pmd)
>  #endif
>  #endif
>  
> +#ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
> +static inline pte_t pte_swp_mkuffd_wp(pte_t pte)
> +{
> +	return pte_set_flags(pte, _PAGE_SWP_UFFD_WP);
> +}
> +
> +static inline int pte_swp_uffd_wp(pte_t pte)
> +{
> +	return pte_flags(pte) & _PAGE_SWP_UFFD_WP;
> +}
> +
> +static inline pte_t pte_swp_clear_uffd_wp(pte_t pte)
> +{
> +	return pte_clear_flags(pte, _PAGE_SWP_UFFD_WP);
> +}
> +#endif /* CONFIG_HAVE_ARCH_USERFAULTFD_WP */
> +
>  #define PKRU_AD_BIT 0x1
>  #define PKRU_WD_BIT 0x2
>  #define PKRU_BITS_PER_PKEY 2
> diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
> index 9c85b54bf03c..e0c5d29b8685 100644
> --- a/arch/x86/include/asm/pgtable_64.h
> +++ b/arch/x86/include/asm/pgtable_64.h
> @@ -189,7 +189,7 @@ extern void sync_global_pgds(unsigned long start, unsigned long end);
>   *
>   * |     ...            | 11| 10|  9|8|7|6|5| 4| 3|2| 1|0| <- bit number
>   * |     ...            |SW3|SW2|SW1|G|L|D|A|CD|WT|U| W|P| <- bit names
> - * | TYPE (59-63) | ~OFFSET (9-58)  |0|0|X|X| X| X|X|SD|0| <- swp entry
> + * | TYPE (59-63) | ~OFFSET (9-58)  |0|0|X|X| X| X|F|SD|0| <- swp entry
>   *
>   * G (8) is aliased and used as a PROT_NONE indicator for
>   * !present ptes.  We need to start storing swap entries above
> @@ -197,9 +197,15 @@ extern void sync_global_pgds(unsigned long start, unsigned long end);
>   * erratum where they can be incorrectly set by hardware on
>   * non-present PTEs.
>   *
> + * SD Bits 1-4 are not used in non-present format and available for
> + * special use described below:
> + *
>   * SD (1) in swp entry is used to store soft dirty bit, which helps us
>   * remember soft dirty over page migration
>   *
> + * F (2) in swp entry is used to record when a pagetable is
> + * writeprotected by userfaultfd WP support.
> + *
>   * Bit 7 in swp entry should be 0 because pmd_present checks not only P,
>   * but also L and G.
>   *
> diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
> index 106b7d0e2dae..163043ab142d 100644
> --- a/arch/x86/include/asm/pgtable_types.h
> +++ b/arch/x86/include/asm/pgtable_types.h
> @@ -32,6 +32,7 @@
>  
>  #define _PAGE_BIT_SPECIAL	_PAGE_BIT_SOFTW1
>  #define _PAGE_BIT_CPA_TEST	_PAGE_BIT_SOFTW1
> +#define _PAGE_BIT_UFFD_WP	_PAGE_BIT_SOFTW2 /* userfaultfd wrprotected */
>  #define _PAGE_BIT_SOFT_DIRTY	_PAGE_BIT_SOFTW3 /* software dirty tracking */
>  #define _PAGE_BIT_DEVMAP	_PAGE_BIT_SOFTW4
>  
> @@ -100,6 +101,14 @@
>  #define _PAGE_SWP_SOFT_DIRTY	(_AT(pteval_t, 0))
>  #endif
>  
> +#ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
> +#define _PAGE_UFFD_WP		(_AT(pteval_t, 1) << _PAGE_BIT_UFFD_WP)
> +#define _PAGE_SWP_UFFD_WP	_PAGE_USER
> +#else
> +#define _PAGE_UFFD_WP		(_AT(pteval_t, 0))
> +#define _PAGE_SWP_UFFD_WP	(_AT(pteval_t, 0))
> +#endif
> +
>  #if defined(CONFIG_X86_64) || defined(CONFIG_X86_PAE)
>  #define _PAGE_NX	(_AT(pteval_t, 1) << _PAGE_BIT_NX)
>  #define _PAGE_DEVMAP	(_AT(u64, 1) << _PAGE_BIT_DEVMAP)
> diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
> index 359fb935ded6..0e1470ecf7b5 100644
> --- a/include/asm-generic/pgtable.h
> +++ b/include/asm-generic/pgtable.h
> @@ -10,6 +10,7 @@
>  #include <linux/mm_types.h>
>  #include <linux/bug.h>
>  #include <linux/errno.h>
> +#include <asm-generic/pgtable_uffd.h>
>  
>  #if 5 - defined(__PAGETABLE_P4D_FOLDED) - defined(__PAGETABLE_PUD_FOLDED) - \
>  	defined(__PAGETABLE_PMD_FOLDED) != CONFIG_PGTABLE_LEVELS
> diff --git a/include/asm-generic/pgtable_uffd.h b/include/asm-generic/pgtable_uffd.h
> new file mode 100644
> index 000000000000..643d1bf559c2
> --- /dev/null
> +++ b/include/asm-generic/pgtable_uffd.h
> @@ -0,0 +1,51 @@
> +#ifndef _ASM_GENERIC_PGTABLE_UFFD_H
> +#define _ASM_GENERIC_PGTABLE_UFFD_H
> +
> +#ifndef CONFIG_HAVE_ARCH_USERFAULTFD_WP
> +static __always_inline int pte_uffd_wp(pte_t pte)
> +{
> +	return 0;
> +}
> +
> +static __always_inline int pmd_uffd_wp(pmd_t pmd)
> +{
> +	return 0;
> +}
> +
> +static __always_inline pte_t pte_mkuffd_wp(pte_t pte)
> +{
> +	return pte;
> +}
> +
> +static __always_inline pmd_t pmd_mkuffd_wp(pmd_t pmd)
> +{
> +	return pmd;
> +}
> +
> +static __always_inline pte_t pte_clear_uffd_wp(pte_t pte)
> +{
> +	return pte;
> +}
> +
> +static __always_inline pmd_t pmd_clear_uffd_wp(pmd_t pmd)
> +{
> +	return pmd;
> +}
> +
> +static __always_inline pte_t pte_swp_mkuffd_wp(pte_t pte)
> +{
> +	return pte;
> +}
> +
> +static __always_inline int pte_swp_uffd_wp(pte_t pte)
> +{
> +	return 0;
> +}
> +
> +static __always_inline pte_t pte_swp_clear_uffd_wp(pte_t pte)
> +{
> +	return pte;
> +}
> +#endif /* CONFIG_HAVE_ARCH_USERFAULTFD_WP */
> +
> +#endif /* _ASM_GENERIC_PGTABLE_UFFD_H */
> diff --git a/init/Kconfig b/init/Kconfig
> index cf5b5a0dcbc2..2a02e004874e 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -1418,6 +1418,11 @@ config ADVISE_SYSCALLS
>  	  applications use these syscalls, you can disable this option to save
>  	  space.
>  
> +config HAVE_ARCH_USERFAULTFD_WP
> +	bool
> +	help
> +	  Arch has userfaultfd write protection support
> +
>  config MEMBARRIER
>  	bool "Enable membarrier() system call" if EXPERT
>  	default y
> -- 
> 2.17.1
> 

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH RFC 02/24] mm: userfault: return VM_FAULT_RETRY on signals
  2019-01-21  7:57 ` [PATCH RFC 02/24] mm: userfault: return VM_FAULT_RETRY on signals Peter Xu
@ 2019-01-21 15:40   ` Jerome Glisse
  2019-01-22  6:10     ` Peter Xu
  0 siblings, 1 reply; 65+ messages in thread
From: Jerome Glisse @ 2019-01-21 15:40 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, Hugh Dickins, Maya Gokhale,
	Johannes Weiner, Martin Cracauer, Denis Plotnikov, Shaohua Li,
	Andrea Arcangeli, Pavel Emelyanov, Mike Kravetz, Marty McFadden,
	Mike Rapoport, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

On Mon, Jan 21, 2019 at 03:57:00PM +0800, Peter Xu wrote:
> There was a special path in handle_userfault() in the past that we'll
> return a VM_FAULT_NOPAGE when we detected non-fatal signals when waiting
> for userfault handling.  We did that by reacquiring the mmap_sem before
> returning.  However that brings a risk in that the vmas might have
> changed when we retake the mmap_sem and even we could be holding an
> invalid vma structure.  The problem was reported by syzbot.

This is confusing this should be a patch on its own ie changes to
fs/userfaultfd.c where you remove that path.

> 
> This patch removes the special path and we'll return a VM_FAULT_RETRY
> with the common path even if we have got such signals.  Then for all the
> architectures that is passing in VM_FAULT_ALLOW_RETRY into
> handle_mm_fault(), we check not only for SIGKILL but for all the rest of
> userspace pending signals right after we returned from
> handle_mm_fault().
> 
> The idea comes from the upstream discussion between Linus and Andrea:
> 
>   https://lkml.org/lkml/2017/10/30/560
> 
> (This patch contains a potential fix for a double-free of mmap_sem on
>  ARC architecture; please see https://lkml.org/lkml/2018/11/1/723 for
>  more information)

This patch should only be about changing the return to userspace rule.
Before this patch the arch fault handler returned to userspace only
for fatal signal, after this patch it returns to userspace for any
signal.

It would be a lot better to have a fix for arc as a separate patch so
that we can focus on reviewing only one thing.

Cheers,
Jérôme


> 
> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
> Suggested-by: Andrea Arcangeli <aarcange@redhat.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  arch/alpha/mm/fault.c      |  2 +-
>  arch/arc/mm/fault.c        | 11 +++++++----
>  arch/arm/mm/fault.c        | 14 ++++++++++----
>  arch/arm64/mm/fault.c      |  6 +++---
>  arch/hexagon/mm/vm_fault.c |  2 +-
>  arch/ia64/mm/fault.c       |  2 +-
>  arch/m68k/mm/fault.c       |  2 +-
>  arch/microblaze/mm/fault.c |  2 +-
>  arch/mips/mm/fault.c       |  2 +-
>  arch/nds32/mm/fault.c      |  6 +++---
>  arch/nios2/mm/fault.c      |  2 +-
>  arch/openrisc/mm/fault.c   |  2 +-
>  arch/parisc/mm/fault.c     |  2 +-
>  arch/powerpc/mm/fault.c    |  4 +++-
>  arch/riscv/mm/fault.c      |  4 ++--
>  arch/s390/mm/fault.c       |  9 ++++++---
>  arch/sh/mm/fault.c         |  4 ++++
>  arch/sparc/mm/fault_32.c   |  3 +++
>  arch/sparc/mm/fault_64.c   |  3 +++
>  arch/um/kernel/trap.c      |  5 ++++-
>  arch/unicore32/mm/fault.c  |  4 ++--
>  arch/x86/mm/fault.c        | 12 +++++++++++-
>  arch/xtensa/mm/fault.c     |  3 +++
>  fs/userfaultfd.c           | 24 ------------------------
>  24 files changed, 73 insertions(+), 57 deletions(-)
> 
> diff --git a/arch/alpha/mm/fault.c b/arch/alpha/mm/fault.c
> index d73dc473fbb9..46e5e420ad2a 100644
> --- a/arch/alpha/mm/fault.c
> +++ b/arch/alpha/mm/fault.c
> @@ -150,7 +150,7 @@ do_page_fault(unsigned long address, unsigned long mmcsr,
>  	   the fault.  */
>  	fault = handle_mm_fault(vma, address, flags);
>  
> -	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
> +	if ((fault & VM_FAULT_RETRY) && signal_pending(current))
>  		return;
>  
>  	if (unlikely(fault & VM_FAULT_ERROR)) {
> diff --git a/arch/arc/mm/fault.c b/arch/arc/mm/fault.c
> index e2d9fc3fea01..91492d244ea6 100644
> --- a/arch/arc/mm/fault.c
> +++ b/arch/arc/mm/fault.c
> @@ -142,11 +142,14 @@ void do_page_fault(unsigned long address, struct pt_regs *regs)
>  	fault = handle_mm_fault(vma, address, flags);
>  
>  	/* If Pagefault was interrupted by SIGKILL, exit page fault "early" */
> -	if (unlikely(fatal_signal_pending(current))) {
> -		if ((fault & VM_FAULT_ERROR) && !(fault & VM_FAULT_RETRY))
> +	if (unlikely(fatal_signal_pending(current) && user_mode(regs))) {
> +		/*
> +		 * VM_FAULT_RETRY means we have released the mmap_sem,
> +		 * otherwise we need to drop it before leaving
> +		 */
> +		if (!(fault & VM_FAULT_RETRY))
>  			up_read(&mm->mmap_sem);
> -		if (user_mode(regs))
> -			return;
> +		return;
>  	}
>  
>  	perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
> diff --git a/arch/arm/mm/fault.c b/arch/arm/mm/fault.c
> index f4ea4c62c613..743077d19669 100644
> --- a/arch/arm/mm/fault.c
> +++ b/arch/arm/mm/fault.c
> @@ -308,14 +308,20 @@ do_page_fault(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
>  
>  	fault = __do_page_fault(mm, addr, fsr, flags, tsk);
>  
> -	/* If we need to retry but a fatal signal is pending, handle the
> +	/* If we need to retry but a signal is pending, handle the
>  	 * signal first. We do not need to release the mmap_sem because
>  	 * it would already be released in __lock_page_or_retry in
>  	 * mm/filemap.c. */
> -	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current)) {
> -		if (!user_mode(regs))
> +	if (fault & VM_FAULT_RETRY) {
> +		if (fatal_signal_pending(current) && !user_mode(regs))
>  			goto no_context;
> -		return 0;
> +		else if (signal_pending(current))
> +			/*
> +			 * It's either a common signal, or a fatal
> +			 * signal but for the userspace, we return
> +			 * immediately.
> +			 */
> +			return 0;
>  	}
>  
>  	/*
> diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
> index 7d9571f4ae3d..744d6451ea83 100644
> --- a/arch/arm64/mm/fault.c
> +++ b/arch/arm64/mm/fault.c
> @@ -499,13 +499,13 @@ static int __kprobes do_page_fault(unsigned long addr, unsigned int esr,
>  
>  	if (fault & VM_FAULT_RETRY) {
>  		/*
> -		 * If we need to retry but a fatal signal is pending,
> +		 * If we need to retry but a signal is pending,
>  		 * handle the signal first. We do not need to release
>  		 * the mmap_sem because it would already be released
>  		 * in __lock_page_or_retry in mm/filemap.c.
>  		 */
> -		if (fatal_signal_pending(current)) {
> -			if (!user_mode(regs))
> +		if (signal_pending(current)) {
> +			if (fatal_signal_pending(current) && !user_mode(regs))
>  				goto no_context;
>  			return 0;
>  		}
> diff --git a/arch/hexagon/mm/vm_fault.c b/arch/hexagon/mm/vm_fault.c
> index eb263e61daf4..be10b441d9cc 100644
> --- a/arch/hexagon/mm/vm_fault.c
> +++ b/arch/hexagon/mm/vm_fault.c
> @@ -104,7 +104,7 @@ void do_page_fault(unsigned long address, long cause, struct pt_regs *regs)
>  
>  	fault = handle_mm_fault(vma, address, flags);
>  
> -	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
> +	if ((fault & VM_FAULT_RETRY) && signal_pending(current))
>  		return;
>  
>  	/* The most common case -- we are done. */
> diff --git a/arch/ia64/mm/fault.c b/arch/ia64/mm/fault.c
> index 5baeb022f474..62c2d39d2bed 100644
> --- a/arch/ia64/mm/fault.c
> +++ b/arch/ia64/mm/fault.c
> @@ -163,7 +163,7 @@ ia64_do_page_fault (unsigned long address, unsigned long isr, struct pt_regs *re
>  	 */
>  	fault = handle_mm_fault(vma, address, flags);
>  
> -	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
> +	if ((fault & VM_FAULT_RETRY) && signal_pending(current))
>  		return;
>  
>  	if (unlikely(fault & VM_FAULT_ERROR)) {
> diff --git a/arch/m68k/mm/fault.c b/arch/m68k/mm/fault.c
> index 9b6163c05a75..d9808a807ab8 100644
> --- a/arch/m68k/mm/fault.c
> +++ b/arch/m68k/mm/fault.c
> @@ -138,7 +138,7 @@ int do_page_fault(struct pt_regs *regs, unsigned long address,
>  	fault = handle_mm_fault(vma, address, flags);
>  	pr_debug("handle_mm_fault returns %x\n", fault);
>  
> -	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
> +	if ((fault & VM_FAULT_RETRY) && signal_pending(current))
>  		return 0;
>  
>  	if (unlikely(fault & VM_FAULT_ERROR)) {
> diff --git a/arch/microblaze/mm/fault.c b/arch/microblaze/mm/fault.c
> index 202ad6a494f5..4fd2dbd0c5ca 100644
> --- a/arch/microblaze/mm/fault.c
> +++ b/arch/microblaze/mm/fault.c
> @@ -217,7 +217,7 @@ void do_page_fault(struct pt_regs *regs, unsigned long address,
>  	 */
>  	fault = handle_mm_fault(vma, address, flags);
>  
> -	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
> +	if ((fault & VM_FAULT_RETRY) && signal_pending(current))
>  		return;
>  
>  	if (unlikely(fault & VM_FAULT_ERROR)) {
> diff --git a/arch/mips/mm/fault.c b/arch/mips/mm/fault.c
> index 73d8a0f0b810..92374fd091d2 100644
> --- a/arch/mips/mm/fault.c
> +++ b/arch/mips/mm/fault.c
> @@ -154,7 +154,7 @@ static void __kprobes __do_page_fault(struct pt_regs *regs, unsigned long write,
>  	 */
>  	fault = handle_mm_fault(vma, address, flags);
>  
> -	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
> +	if ((fault & VM_FAULT_RETRY) && signal_pending(current))
>  		return;
>  
>  	perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
> diff --git a/arch/nds32/mm/fault.c b/arch/nds32/mm/fault.c
> index b740534b152c..72461745d3e1 100644
> --- a/arch/nds32/mm/fault.c
> +++ b/arch/nds32/mm/fault.c
> @@ -207,12 +207,12 @@ void do_page_fault(unsigned long entry, unsigned long addr,
>  	fault = handle_mm_fault(vma, addr, flags);
>  
>  	/*
> -	 * If we need to retry but a fatal signal is pending, handle the
> +	 * If we need to retry but a signal is pending, handle the
>  	 * signal first. We do not need to release the mmap_sem because it
>  	 * would already be released in __lock_page_or_retry in mm/filemap.c.
>  	 */
> -	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current)) {
> -		if (!user_mode(regs))
> +	if (fault & VM_FAULT_RETRY && signal_pending(current)) {
> +		if (fatal_signal_pending(current) && !user_mode(regs))
>  			goto no_context;
>  		return;
>  	}
> diff --git a/arch/nios2/mm/fault.c b/arch/nios2/mm/fault.c
> index 24fd84cf6006..5939434a31ae 100644
> --- a/arch/nios2/mm/fault.c
> +++ b/arch/nios2/mm/fault.c
> @@ -134,7 +134,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long cause,
>  	 */
>  	fault = handle_mm_fault(vma, address, flags);
>  
> -	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
> +	if ((fault & VM_FAULT_RETRY) && signal_pending(current))
>  		return;
>  
>  	if (unlikely(fault & VM_FAULT_ERROR)) {
> diff --git a/arch/openrisc/mm/fault.c b/arch/openrisc/mm/fault.c
> index dc4dbafc1d83..873ecb5d82d7 100644
> --- a/arch/openrisc/mm/fault.c
> +++ b/arch/openrisc/mm/fault.c
> @@ -165,7 +165,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long address,
>  
>  	fault = handle_mm_fault(vma, address, flags);
>  
> -	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
> +	if ((fault & VM_FAULT_RETRY) && signal_pending(current))
>  		return;
>  
>  	if (unlikely(fault & VM_FAULT_ERROR)) {
> diff --git a/arch/parisc/mm/fault.c b/arch/parisc/mm/fault.c
> index c8e8b7c05558..29422eec329d 100644
> --- a/arch/parisc/mm/fault.c
> +++ b/arch/parisc/mm/fault.c
> @@ -303,7 +303,7 @@ void do_page_fault(struct pt_regs *regs, unsigned long code,
>  
>  	fault = handle_mm_fault(vma, address, flags);
>  
> -	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
> +	if ((fault & VM_FAULT_RETRY) && signal_pending(current))
>  		return;
>  
>  	if (unlikely(fault & VM_FAULT_ERROR)) {
> diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
> index 1697e903bbf2..8bc0d091f13c 100644
> --- a/arch/powerpc/mm/fault.c
> +++ b/arch/powerpc/mm/fault.c
> @@ -575,8 +575,10 @@ static int __do_page_fault(struct pt_regs *regs, unsigned long address,
>  			 */
>  			flags &= ~FAULT_FLAG_ALLOW_RETRY;
>  			flags |= FAULT_FLAG_TRIED;
> -			if (!fatal_signal_pending(current))
> +			if (!signal_pending(current))
>  				goto retry;
> +			else if (!fatal_signal_pending(current) && is_user)
> +				return 0;
>  		}
>  
>  		/*
> diff --git a/arch/riscv/mm/fault.c b/arch/riscv/mm/fault.c
> index 88401d5125bc..4fc8d746bec3 100644
> --- a/arch/riscv/mm/fault.c
> +++ b/arch/riscv/mm/fault.c
> @@ -123,11 +123,11 @@ asmlinkage void do_page_fault(struct pt_regs *regs)
>  	fault = handle_mm_fault(vma, addr, flags);
>  
>  	/*
> -	 * If we need to retry but a fatal signal is pending, handle the
> +	 * If we need to retry but a signal is pending, handle the
>  	 * signal first. We do not need to release the mmap_sem because it
>  	 * would already be released in __lock_page_or_retry in mm/filemap.c.
>  	 */
> -	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(tsk))
> +	if ((fault & VM_FAULT_RETRY) && signal_pending(tsk))
>  		return;
>  
>  	if (unlikely(fault & VM_FAULT_ERROR)) {
> diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c
> index 2b8f32f56e0c..19b4fb2fafab 100644
> --- a/arch/s390/mm/fault.c
> +++ b/arch/s390/mm/fault.c
> @@ -500,9 +500,12 @@ static inline vm_fault_t do_exception(struct pt_regs *regs, int access)
>  	 * the fault.
>  	 */
>  	fault = handle_mm_fault(vma, address, flags);
> -	/* No reason to continue if interrupted by SIGKILL. */
> -	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current)) {
> -		fault = VM_FAULT_SIGNAL;
> +	/* Do not continue if interrupted by signals. */
> +	if ((fault & VM_FAULT_RETRY) && signal_pending(current)) {
> +		if (fatal_signal_pending(current))
> +			fault = VM_FAULT_SIGNAL;
> +		else
> +			fault = 0;
>  		if (flags & FAULT_FLAG_RETRY_NOWAIT)
>  			goto out_up;
>  		goto out;
> diff --git a/arch/sh/mm/fault.c b/arch/sh/mm/fault.c
> index 6defd2c6d9b1..baf5d73df40c 100644
> --- a/arch/sh/mm/fault.c
> +++ b/arch/sh/mm/fault.c
> @@ -506,6 +506,10 @@ asmlinkage void __kprobes do_page_fault(struct pt_regs *regs,
>  			 * have already released it in __lock_page_or_retry
>  			 * in mm/filemap.c.
>  			 */
> +
> +			if (user_mode(regs) && signal_pending(tsk))
> +				return;
> +
>  			goto retry;
>  		}
>  	}
> diff --git a/arch/sparc/mm/fault_32.c b/arch/sparc/mm/fault_32.c
> index b0440b0edd97..a2c83104fe35 100644
> --- a/arch/sparc/mm/fault_32.c
> +++ b/arch/sparc/mm/fault_32.c
> @@ -269,6 +269,9 @@ asmlinkage void do_sparc_fault(struct pt_regs *regs, int text_fault, int write,
>  			 * in mm/filemap.c.
>  			 */
>  
> +			if (user_mode(regs) && signal_pending(tsk))
> +				return;
> +
>  			goto retry;
>  		}
>  	}
> diff --git a/arch/sparc/mm/fault_64.c b/arch/sparc/mm/fault_64.c
> index 8f8a604c1300..cad71ec5c7b3 100644
> --- a/arch/sparc/mm/fault_64.c
> +++ b/arch/sparc/mm/fault_64.c
> @@ -467,6 +467,9 @@ asmlinkage void __kprobes do_sparc64_fault(struct pt_regs *regs)
>  			 * in mm/filemap.c.
>  			 */
>  
> +			if (user_mode(regs) && signal_pending(current))
> +				return;
> +
>  			goto retry;
>  		}
>  	}
> diff --git a/arch/um/kernel/trap.c b/arch/um/kernel/trap.c
> index 0e8b6158f224..09baf37b65b9 100644
> --- a/arch/um/kernel/trap.c
> +++ b/arch/um/kernel/trap.c
> @@ -76,8 +76,11 @@ int handle_page_fault(unsigned long address, unsigned long ip,
>  
>  		fault = handle_mm_fault(vma, address, flags);
>  
> -		if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
> +		if (fault & VM_FAULT_RETRY && signal_pending(current)) {
> +			if (is_user && !fatal_signal_pending(current))
> +				err = 0;
>  			goto out_nosemaphore;
> +		}
>  
>  		if (unlikely(fault & VM_FAULT_ERROR)) {
>  			if (fault & VM_FAULT_OOM) {
> diff --git a/arch/unicore32/mm/fault.c b/arch/unicore32/mm/fault.c
> index b9a3a50644c1..3611f19234a1 100644
> --- a/arch/unicore32/mm/fault.c
> +++ b/arch/unicore32/mm/fault.c
> @@ -248,11 +248,11 @@ static int do_pf(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
>  
>  	fault = __do_pf(mm, addr, fsr, flags, tsk);
>  
> -	/* If we need to retry but a fatal signal is pending, handle the
> +	/* If we need to retry but a signal is pending, handle the
>  	 * signal first. We do not need to release the mmap_sem because
>  	 * it would already be released in __lock_page_or_retry in
>  	 * mm/filemap.c. */
> -	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
> +	if ((fault & VM_FAULT_RETRY) && signal_pending(current))
>  		return 0;
>  
>  	if (!(fault & VM_FAULT_ERROR) && (flags & FAULT_FLAG_ALLOW_RETRY)) {
> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> index 71d4b9d4d43f..b94ef0c2b98c 100644
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -1433,8 +1433,18 @@ void do_user_addr_fault(struct pt_regs *regs,
>  		if (flags & FAULT_FLAG_ALLOW_RETRY) {
>  			flags &= ~FAULT_FLAG_ALLOW_RETRY;
>  			flags |= FAULT_FLAG_TRIED;
> -			if (!fatal_signal_pending(tsk))
> +			if (!signal_pending(tsk))
>  				goto retry;
> +			else if (!fatal_signal_pending(tsk))
> +				/*
> +				 * There is a signal for the task but
> +				 * it's not fatal, let's return
> +				 * directly to the userspace.  This
> +				 * gives chance for signals like
> +				 * SIGSTOP/SIGCONT to be handled
> +				 * faster, e.g., with GDB.
> +				 */
> +				return;
>  		}
>  
>  		/* User mode? Just return to handle the fatal exception */
> diff --git a/arch/xtensa/mm/fault.c b/arch/xtensa/mm/fault.c
> index 2ab0e0dcd166..792dad5e2f12 100644
> --- a/arch/xtensa/mm/fault.c
> +++ b/arch/xtensa/mm/fault.c
> @@ -136,6 +136,9 @@ void do_page_fault(struct pt_regs *regs)
>  			 * in mm/filemap.c.
>  			 */
>  
> +			if (user_mode(regs) && signal_pending(current))
> +				return;
> +
>  			goto retry;
>  		}
>  	}
> diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> index 270d4888c6d5..bc9f6230a3f0 100644
> --- a/fs/userfaultfd.c
> +++ b/fs/userfaultfd.c
> @@ -515,30 +515,6 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
>  
>  	__set_current_state(TASK_RUNNING);
>  
> -	if (return_to_userland) {
> -		if (signal_pending(current) &&
> -		    !fatal_signal_pending(current)) {
> -			/*
> -			 * If we got a SIGSTOP or SIGCONT and this is
> -			 * a normal userland page fault, just let
> -			 * userland return so the signal will be
> -			 * handled and gdb debugging works.  The page
> -			 * fault code immediately after we return from
> -			 * this function is going to release the
> -			 * mmap_sem and it's not depending on it
> -			 * (unlike gup would if we were not to return
> -			 * VM_FAULT_RETRY).
> -			 *
> -			 * If a fatal signal is pending we still take
> -			 * the streamlined VM_FAULT_RETRY failure path
> -			 * and there's no need to retake the mmap_sem
> -			 * in such case.
> -			 */
> -			down_read(&mm->mmap_sem);
> -			ret = VM_FAULT_NOPAGE;
> -		}
> -	}
> -
>  	/*
>  	 * Here we race with the list_del; list_add in
>  	 * userfaultfd_ctx_read(), however because we don't ever run
> -- 
> 2.17.1
> 

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH RFC 03/24] mm: allow VM_FAULT_RETRY for multiple times
  2019-01-21  7:57 ` [PATCH RFC 03/24] mm: allow VM_FAULT_RETRY for multiple times Peter Xu
@ 2019-01-21 15:55   ` Jerome Glisse
  2019-01-22  8:22     ` Peter Xu
  0 siblings, 1 reply; 65+ messages in thread
From: Jerome Glisse @ 2019-01-21 15:55 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, Hugh Dickins, Maya Gokhale,
	Johannes Weiner, Martin Cracauer, Denis Plotnikov, Shaohua Li,
	Andrea Arcangeli, Mike Kravetz, Marty McFadden, Mike Rapoport,
	Mel Gorman, Kirill A . Shutemov, Dr . David Alan Gilbert

On Mon, Jan 21, 2019 at 03:57:01PM +0800, Peter Xu wrote:
> The idea comes from a discussion between Linus and Andrea [1].
> 
> Before this patch we only allow a page fault to retry once.  We achieved
> this by clearing the FAULT_FLAG_ALLOW_RETRY flag when doing
> handle_mm_fault() the second time.  This was majorly used to avoid
> unexpected starvation of the system by looping over forever to handle
> the page fault on a single page.  However that should hardly happen, and
> after all for each code path to return a VM_FAULT_RETRY we'll first wait
> for a condition (during which time we should possibly yield the cpu) to
> happen before VM_FAULT_RETRY is really returned.
> 
> This patch removes the restriction by keeping the FAULT_FLAG_ALLOW_RETRY
> flag when we receive VM_FAULT_RETRY.  It means that the page fault
> handler now can retry the page fault for multiple times if necessary
> without the need to generate another page fault event. Meanwhile we
> still keep the FAULT_FLAG_TRIED flag so page fault handler can still
> identify whether a page fault is the first attempt or not.

So there is nothing protecting starvation after this patch ? AFAICT.
Do we sufficient proof that we never have a scenario where one process
might starve fault another ?

For instance some page locking could starve one process.


> 
> GUP code is not touched yet and will be covered in follow up patch.
> 
> This will be a nice enhancement for current code at the same time a
> supporting material for the future userfaultfd-writeprotect work since
> in that work there will always be an explicit userfault writeprotect
> retry for protected pages, and if that cannot resolve the page
> fault (e.g., when userfaultfd-writeprotect is used in conjunction with
> shared memory) then we'll possibly need a 3rd retry of the page fault.
> It might also benefit other potential users who will have similar
> requirement like userfault write-protection.
> 
> Please read the thread below for more information.
> 
> [1] https://lkml.org/lkml/2017/11/2/833
> 
> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
> Suggested-by: Andrea Arcangeli <aarcange@redhat.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH RFC 04/24] mm: gup: allow VM_FAULT_RETRY for multiple times
  2019-01-21  7:57 ` [PATCH RFC 04/24] mm: gup: " Peter Xu
@ 2019-01-21 16:24   ` Jerome Glisse
  2019-01-24  7:05     ` Peter Xu
  0 siblings, 1 reply; 65+ messages in thread
From: Jerome Glisse @ 2019-01-21 16:24 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, Hugh Dickins, Maya Gokhale,
	Johannes Weiner, Martin Cracauer, Denis Plotnikov, Shaohua Li,
	Andrea Arcangeli, Mike Kravetz, Marty McFadden, Mike Rapoport,
	Mel Gorman, Kirill A . Shutemov, Dr . David Alan Gilbert

On Mon, Jan 21, 2019 at 03:57:02PM +0800, Peter Xu wrote:
> This is the gup counterpart of the change that allows the VM_FAULT_RETRY
> to happen for more than once.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>

So it would be nice to add a comment in the code and in the commit message
about possible fault starvation (mostly due to previous patch changes) as
if some one experience that and try to bisect it might overlook the commit.

Otherwise:

Reviewed-by: Jérôme Glisse <jglisse@redhat.com>

> ---
>  mm/gup.c | 17 +++++++++++++----
>  1 file changed, 13 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/gup.c b/mm/gup.c
> index 7b1f452cc2ef..22f1d419a849 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -528,7 +528,10 @@ static int faultin_page(struct task_struct *tsk, struct vm_area_struct *vma,
>  	if (*flags & FOLL_NOWAIT)
>  		fault_flags |= FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_RETRY_NOWAIT;
>  	if (*flags & FOLL_TRIED) {
> -		VM_WARN_ON_ONCE(fault_flags & FAULT_FLAG_ALLOW_RETRY);
> +		/*
> +		 * Note: FAULT_FLAG_ALLOW_RETRY and FAULT_FLAG_TRIED
> +		 * can co-exist
> +		 */
>  		fault_flags |= FAULT_FLAG_TRIED;
>  	}
>  
> @@ -943,17 +946,23 @@ static __always_inline long __get_user_pages_locked(struct task_struct *tsk,
>  		/* VM_FAULT_RETRY triggered, so seek to the faulting offset */
>  		pages += ret;
>  		start += ret << PAGE_SHIFT;
> +		lock_dropped = true;
>  
> +retry:
>  		/*
>  		 * Repeat on the address that fired VM_FAULT_RETRY
> -		 * without FAULT_FLAG_ALLOW_RETRY but with
> +		 * with both FAULT_FLAG_ALLOW_RETRY and
>  		 * FAULT_FLAG_TRIED.
>  		 */
>  		*locked = 1;
> -		lock_dropped = true;
>  		down_read(&mm->mmap_sem);
>  		ret = __get_user_pages(tsk, mm, start, 1, flags | FOLL_TRIED,
> -				       pages, NULL, NULL);
> +				       pages, NULL, locked);
> +		if (!*locked) {
> +			/* Continue to retry until we succeeded */
> +			BUG_ON(ret != 0);
> +			goto retry;
> +		}
>  		if (ret != 1) {
>  			BUG_ON(ret > 1);
>  			if (!pages_done)
> -- 
> 2.17.1
> 

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH RFC 00/24] userfaultfd: write protection support
  2019-01-21 14:33 ` [PATCH RFC 00/24] userfaultfd: write protection support David Hildenbrand
@ 2019-01-22  3:18   ` Peter Xu
  2019-01-22  8:59     ` David Hildenbrand
  0 siblings, 1 reply; 65+ messages in thread
From: Peter Xu @ 2019-01-22  3:18 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-mm, linux-kernel, Hugh Dickins, Maya Gokhale,
	Jerome Glisse, Johannes Weiner, Martin Cracauer, Denis Plotnikov,
	Shaohua Li, Andrea Arcangeli, Pavel Emelyanov, Mike Kravetz,
	Marty McFadden, Mike Rapoport, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

On Mon, Jan 21, 2019 at 03:33:21PM +0100, David Hildenbrand wrote:

[...]

> Does this series fix the "false positives" case I experienced on early
> prototypes of uffd-wp? (getting notified about a write access although
> it was not a write access?)

Hi, David,

Yes it should solve it.

The early prototype in Andrea's tree hasn't yet applied the new
PTE/swap bits for uffd-wp hence it was not able to avoid those fause
positives.  This series has applied all those ideas (which actually
come from Andrea as well) so the protection information will be
persisent per PTE rather than per VMA and it will be kept even through
swapping and page migrations.

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH RFC 02/24] mm: userfault: return VM_FAULT_RETRY on signals
  2019-01-21 15:40   ` Jerome Glisse
@ 2019-01-22  6:10     ` Peter Xu
  0 siblings, 0 replies; 65+ messages in thread
From: Peter Xu @ 2019-01-22  6:10 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: linux-mm, linux-kernel, Hugh Dickins, Maya Gokhale,
	Johannes Weiner, Martin Cracauer, Denis Plotnikov, Shaohua Li,
	Andrea Arcangeli, Pavel Emelyanov, Mike Kravetz, Marty McFadden,
	Mike Rapoport, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

On Mon, Jan 21, 2019 at 10:40:18AM -0500, Jerome Glisse wrote:
> On Mon, Jan 21, 2019 at 03:57:00PM +0800, Peter Xu wrote:
> > There was a special path in handle_userfault() in the past that we'll
> > return a VM_FAULT_NOPAGE when we detected non-fatal signals when waiting
> > for userfault handling.  We did that by reacquiring the mmap_sem before
> > returning.  However that brings a risk in that the vmas might have
> > changed when we retake the mmap_sem and even we could be holding an
> > invalid vma structure.  The problem was reported by syzbot.
> 
> This is confusing this should be a patch on its own ie changes to
> fs/userfaultfd.c where you remove that path.

Sure I will.

> 
> > 
> > This patch removes the special path and we'll return a VM_FAULT_RETRY
> > with the common path even if we have got such signals.  Then for all the
> > architectures that is passing in VM_FAULT_ALLOW_RETRY into
> > handle_mm_fault(), we check not only for SIGKILL but for all the rest of
> > userspace pending signals right after we returned from
> > handle_mm_fault().
> > 
> > The idea comes from the upstream discussion between Linus and Andrea:
> > 
> >   https://lkml.org/lkml/2017/10/30/560
> > 
> > (This patch contains a potential fix for a double-free of mmap_sem on
> >  ARC architecture; please see https://lkml.org/lkml/2018/11/1/723 for
> >  more information)
> 
> This patch should only be about changing the return to userspace rule.
> Before this patch the arch fault handler returned to userspace only
> for fatal signal, after this patch it returns to userspace for any
> signal.

Ok.  I'll make the first patch to do the signal changes, then the
second patch to remove the userfault path explicitly.

> 
> It would be a lot better to have a fix for arc as a separate patch so
> that we can focus on reviewing only one thing.

I just noticed that it was fixed just a few days ago in commit
4d447455e73b.  Then I'll just simply rebase to Linus master and use
the upstream fix, then I can drop this paragraph.

Thanks for the review!

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH RFC 03/24] mm: allow VM_FAULT_RETRY for multiple times
  2019-01-21 15:55   ` Jerome Glisse
@ 2019-01-22  8:22     ` Peter Xu
  2019-01-22 16:53       ` Jerome Glisse
  0 siblings, 1 reply; 65+ messages in thread
From: Peter Xu @ 2019-01-22  8:22 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: linux-mm, linux-kernel, Hugh Dickins, Maya Gokhale,
	Johannes Weiner, Martin Cracauer, Denis Plotnikov, Shaohua Li,
	Andrea Arcangeli, Mike Kravetz, Marty McFadden, Mike Rapoport,
	Mel Gorman, Kirill A . Shutemov, Dr . David Alan Gilbert

On Mon, Jan 21, 2019 at 10:55:36AM -0500, Jerome Glisse wrote:
> On Mon, Jan 21, 2019 at 03:57:01PM +0800, Peter Xu wrote:
> > The idea comes from a discussion between Linus and Andrea [1].
> > 
> > Before this patch we only allow a page fault to retry once.  We achieved
> > this by clearing the FAULT_FLAG_ALLOW_RETRY flag when doing
> > handle_mm_fault() the second time.  This was majorly used to avoid
> > unexpected starvation of the system by looping over forever to handle
> > the page fault on a single page.  However that should hardly happen, and
> > after all for each code path to return a VM_FAULT_RETRY we'll first wait
> > for a condition (during which time we should possibly yield the cpu) to
> > happen before VM_FAULT_RETRY is really returned.
> > 
> > This patch removes the restriction by keeping the FAULT_FLAG_ALLOW_RETRY
> > flag when we receive VM_FAULT_RETRY.  It means that the page fault
> > handler now can retry the page fault for multiple times if necessary
> > without the need to generate another page fault event. Meanwhile we
> > still keep the FAULT_FLAG_TRIED flag so page fault handler can still
> > identify whether a page fault is the first attempt or not.
> 
> So there is nothing protecting starvation after this patch ? AFAICT.
> Do we sufficient proof that we never have a scenario where one process
> might starve fault another ?
> 
> For instance some page locking could starve one process.

Hi, Jerome,

Do you mean lock_page()?

AFAIU lock_page() will only yield the process itself until the lock is
released, so IMHO it's not really starving the process but a natural
behavior.  After all the process may not continue without handling the
page fault correctly.

Or when you say "starvation" do you mean that we might return
VM_FAULT_RETRY from handle_mm_fault() continuously so we'll looping
over and over inside the page fault handler?

Thanks,

> 
> 
> > 
> > GUP code is not touched yet and will be covered in follow up patch.
> > 
> > This will be a nice enhancement for current code at the same time a
> > supporting material for the future userfaultfd-writeprotect work since
> > in that work there will always be an explicit userfault writeprotect
> > retry for protected pages, and if that cannot resolve the page
> > fault (e.g., when userfaultfd-writeprotect is used in conjunction with
> > shared memory) then we'll possibly need a 3rd retry of the page fault.
> > It might also benefit other potential users who will have similar
> > requirement like userfault write-protection.
> > 
> > Please read the thread below for more information.
> > 
> > [1] https://lkml.org/lkml/2017/11/2/833
> > 
> > Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
> > Suggested-by: Andrea Arcangeli <aarcange@redhat.com>
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---

Regards,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH RFC 05/24] userfaultfd: wp: add helper for writeprotect check
  2019-01-21 10:23   ` Mike Rapoport
@ 2019-01-22  8:31     ` Peter Xu
  0 siblings, 0 replies; 65+ messages in thread
From: Peter Xu @ 2019-01-22  8:31 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: linux-mm, linux-kernel, Hugh Dickins, Maya Gokhale,
	Jerome Glisse, Johannes Weiner, Martin Cracauer, Denis Plotnikov,
	Shaohua Li, Andrea Arcangeli, Pavel Emelyanov, Mike Kravetz,
	Marty McFadden, Mike Rapoport, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert, Rik van Riel

On Mon, Jan 21, 2019 at 12:23:12PM +0200, Mike Rapoport wrote:
> On Mon, Jan 21, 2019 at 03:57:03PM +0800, Peter Xu wrote:
> > From: Shaohua Li <shli@fb.com>
> > 
> > add helper for writeprotect check. Will use it later.
> 
> I'd merge this with the commit that actually uses this helper.

Hi, Mike,

Yeah actually that's what I'd prefer for most of the time.  But I'm
trying to avoid doing that because I wanted to keep the credit of the
original authors, not only for this single patch, but also for the
whole series.  Meanwhile, since this work has been there for quite a
few years (starting from 2015), IMHO keeping the old patches mostly
untouched at least in the RFC stage might also help the reviewers if
they have read or prior knowledge of the previous work.

And if the patch cannot even stand on itself (this one can; it only
introduces new functions) I'll do the merge no matter what.

Please correct me if this is not the good way to do.

Thanks!

>  
> > Cc: Andrea Arcangeli <aarcange@redhat.com>
> > Cc: Pavel Emelyanov <xemul@parallels.com>
> > Cc: Rik van Riel <riel@redhat.com>
> > Cc: Kirill A. Shutemov <kirill@shutemov.name>
> > Cc: Mel Gorman <mgorman@suse.de>
> > Cc: Hugh Dickins <hughd@google.com>
> > Cc: Johannes Weiner <hannes@cmpxchg.org>
> > Signed-off-by: Shaohua Li <shli@fb.com>
> > Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >  include/linux/userfaultfd_k.h | 10 ++++++++++
> >  1 file changed, 10 insertions(+)
> > 
> > diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
> > index 37c9eba75c98..38f748e7186e 100644
> > --- a/include/linux/userfaultfd_k.h
> > +++ b/include/linux/userfaultfd_k.h
> > @@ -50,6 +50,11 @@ static inline bool userfaultfd_missing(struct vm_area_struct *vma)
> >  	return vma->vm_flags & VM_UFFD_MISSING;
> >  }
> > 
> > +static inline bool userfaultfd_wp(struct vm_area_struct *vma)
> > +{
> > +	return vma->vm_flags & VM_UFFD_WP;
> > +}
> > +
> >  static inline bool userfaultfd_armed(struct vm_area_struct *vma)
> >  {
> >  	return vma->vm_flags & (VM_UFFD_MISSING | VM_UFFD_WP);
> > @@ -94,6 +99,11 @@ static inline bool userfaultfd_missing(struct vm_area_struct *vma)
> >  	return false;
> >  }
> > 
> > +static inline bool userfaultfd_wp(struct vm_area_struct *vma)
> > +{
> > +	return false;
> > +}
> > +
> >  static inline bool userfaultfd_armed(struct vm_area_struct *vma)
> >  {
> >  	return false;
> > -- 
> > 2.17.1
> > 
> 
> -- 
> Sincerely yours,
> Mike.
> 

Regards,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH RFC 06/24] userfaultfd: wp: support write protection for userfault vma range
  2019-01-21 10:20   ` Mike Rapoport
@ 2019-01-22  8:55     ` Peter Xu
  0 siblings, 0 replies; 65+ messages in thread
From: Peter Xu @ 2019-01-22  8:55 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: linux-mm, linux-kernel, Hugh Dickins, Maya Gokhale,
	Jerome Glisse, Johannes Weiner, Martin Cracauer, Denis Plotnikov,
	Shaohua Li, Andrea Arcangeli, Pavel Emelyanov, Mike Kravetz,
	Marty McFadden, Mike Rapoport, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert, Rik van Riel

On Mon, Jan 21, 2019 at 12:20:35PM +0200, Mike Rapoport wrote:
> On Mon, Jan 21, 2019 at 03:57:04PM +0800, Peter Xu wrote:
> > From: Shaohua Li <shli@fb.com>
> > 
> > Add API to enable/disable writeprotect a vma range. Unlike mprotect,
> > this doesn't split/merge vmas.
> > 
> > Cc: Andrea Arcangeli <aarcange@redhat.com>
> > Cc: Pavel Emelyanov <xemul@parallels.com>
> > Cc: Rik van Riel <riel@redhat.com>
> > Cc: Kirill A. Shutemov <kirill@shutemov.name>
> > Cc: Mel Gorman <mgorman@suse.de>
> > Cc: Hugh Dickins <hughd@google.com>
> > Cc: Johannes Weiner <hannes@cmpxchg.org>
> > Signed-off-by: Shaohua Li <shli@fb.com>
> > Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >  include/linux/userfaultfd_k.h |  2 ++
> >  mm/userfaultfd.c              | 52 +++++++++++++++++++++++++++++++++++
> >  2 files changed, 54 insertions(+)
> > 
> > diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
> > index 38f748e7186e..e82f3156f4e9 100644
> > --- a/include/linux/userfaultfd_k.h
> > +++ b/include/linux/userfaultfd_k.h
> > @@ -37,6 +37,8 @@ extern ssize_t mfill_zeropage(struct mm_struct *dst_mm,
> >  			      unsigned long dst_start,
> >  			      unsigned long len,
> >  			      bool *mmap_changing);
> > +extern int mwriteprotect_range(struct mm_struct *dst_mm,
> > +		unsigned long start, unsigned long len, bool enable_wp);
> > 
> >  /* mm helpers */
> >  static inline bool is_mergeable_vm_userfaultfd_ctx(struct vm_area_struct *vma,
> > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> > index 458acda96f20..c38903f501c7 100644
> > --- a/mm/userfaultfd.c
> > +++ b/mm/userfaultfd.c
> > @@ -615,3 +615,55 @@ ssize_t mfill_zeropage(struct mm_struct *dst_mm, unsigned long start,
> >  {
> >  	return __mcopy_atomic(dst_mm, start, 0, len, true, mmap_changing);
> >  }
> > +
> > +int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start,
> > +	unsigned long len, bool enable_wp)
> > +{
> > +	struct vm_area_struct *dst_vma;
> > +	pgprot_t newprot;
> > +	int err;
> > +
> > +	/*
> > +	 * Sanitize the command parameters:
> > +	 */
> > +	BUG_ON(start & ~PAGE_MASK);
> > +	BUG_ON(len & ~PAGE_MASK);
> > +
> > +	/* Does the address range wrap, or is the span zero-sized? */
> > +	BUG_ON(start + len <= start);
> > +
> > +	down_read(&dst_mm->mmap_sem);
> > +
> > +	/*
> > +	 * Make sure the vma is not shared, that the dst range is
> > +	 * both valid and fully within a single existing vma.
> > +	 */
> > +	err = -EINVAL;
> 
> In non-cooperative mode, there can be a race between VM layout changes and
> mcopy_atomic [1]. I believe the same races are possible here, so can we
> please make err = -ENOENT for consistency with mcopy?

Sure.

> 
> > +	dst_vma = find_vma(dst_mm, start);
> > +	if (!dst_vma || (dst_vma->vm_flags & VM_SHARED))
> > +		goto out_unlock;
> > +	if (start < dst_vma->vm_start ||
> > +	    start + len > dst_vma->vm_end)
> > +		goto out_unlock;
> > +
> > +	if (!dst_vma->vm_userfaultfd_ctx.ctx)
> > +		goto out_unlock;
> > +	if (!userfaultfd_wp(dst_vma))
> > +		goto out_unlock;
> > +
> > +	if (!vma_is_anonymous(dst_vma))
> > +		goto out_unlock;
> 
> The sanity checks here seem to repeat those in mcopy_atomic(). I'd suggest
> splitting them out to a helper function.

It's a good suggestion.  Thanks!

> 
> > +	if (enable_wp)
> > +		newprot = vm_get_page_prot(dst_vma->vm_flags & ~(VM_WRITE));
> > +	else
> > +		newprot = vm_get_page_prot(dst_vma->vm_flags);
> > +
> > +	change_protection(dst_vma, start, start + len, newprot,
> > +				!enable_wp, 0);
> > +
> > +	err = 0;
> > +out_unlock:
> > +	up_read(&dst_mm->mmap_sem);
> > +	return err;
> > +}
> > -- 
> > 2.17.1
> 
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=27d02568f529e908399514dfbee8ee43bdfd5299
> 
> -- 
> Sincerely yours,
> Mike.
> 

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH RFC 00/24] userfaultfd: write protection support
  2019-01-22  3:18   ` Peter Xu
@ 2019-01-22  8:59     ` David Hildenbrand
  0 siblings, 0 replies; 65+ messages in thread
From: David Hildenbrand @ 2019-01-22  8:59 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, Hugh Dickins, Maya Gokhale,
	Jerome Glisse, Johannes Weiner, Martin Cracauer, Denis Plotnikov,
	Shaohua Li, Andrea Arcangeli, Pavel Emelyanov, Mike Kravetz,
	Marty McFadden, Mike Rapoport, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

On 22.01.19 04:18, Peter Xu wrote:
> On Mon, Jan 21, 2019 at 03:33:21PM +0100, David Hildenbrand wrote:
> 
> [...]
> 
>> Does this series fix the "false positives" case I experienced on early
>> prototypes of uffd-wp? (getting notified about a write access although
>> it was not a write access?)
> 
> Hi, David,
> 
> Yes it should solve it.

Terrific, as my use case for uffd-wp really rely on not having false
positives these are good news :)

... however it will take a while until I actually have time to look back
into it (too much stuff on my table).

Just for reference (we talked about this offline once):

My plan is to use this for virtio-mem in QEMU. Memory that a virtio-mem
device provides to a guest can either be plugged or unplugged. When
unplugging, memory will be MADVISE_DONTNEED'ed and uffd-wp'ed. The guest
can still read memory (e.g. for dumping) but writing to it is considered
bad (as the guest could this way consume more memory as intended). So I
can detect malicious guests without too much overhead this way.

False positives would mean that I would detect guests as malicious
although they are not. So it really would be harmful.

Thanks!

> 
> The early prototype in Andrea's tree hasn't yet applied the new
> PTE/swap bits for uffd-wp hence it was not able to avoid those fause
> positives.  This series has applied all those ideas (which actually
> come from Andrea as well) so the protection information will be
> persisent per PTE rather than per VMA and it will be kept even through
> swapping and page migrations.
> 
> Thanks,
> 


-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH RFC 06/24] userfaultfd: wp: support write protection for userfault vma range
  2019-01-21 14:05   ` Jerome Glisse
@ 2019-01-22  9:39     ` Peter Xu
  2019-01-22 17:02       ` Jerome Glisse
  0 siblings, 1 reply; 65+ messages in thread
From: Peter Xu @ 2019-01-22  9:39 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: linux-mm, linux-kernel, Hugh Dickins, Maya Gokhale,
	Johannes Weiner, Martin Cracauer, Denis Plotnikov, Shaohua Li,
	Andrea Arcangeli, Pavel Emelyanov, Mike Kravetz, Marty McFadden,
	Mike Rapoport, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert, Rik van Riel

On Mon, Jan 21, 2019 at 09:05:35AM -0500, Jerome Glisse wrote:

[...]

> > +	change_protection(dst_vma, start, start + len, newprot,
> > +				!enable_wp, 0);
> 
> So setting dirty_accountable bring us to that code in mprotect.c:
> 
>     if (dirty_accountable && pte_dirty(ptent) &&
>             (pte_soft_dirty(ptent) ||
>              !(vma->vm_flags & VM_SOFTDIRTY))) {
>         ptent = pte_mkwrite(ptent);
>     }
> 
> My understanding is that you want to set write flag when enable_wp
> is false and you want to set the write flag unconditionaly, right ?

Right.

> 
> If so then you should really move the change_protection() flags
> patch before this patch and add a flag for setting pte write flags.
> 
> Otherwise the above is broken at it will only set the write flag
> for pte that were dirty and i am guessing so far you always were
> lucky because pte were all dirty (change_protection will preserve
> dirtyness) when you write protected them.
> 
> So i believe the above is broken or at very least unclear if what
> you really want is to only set write flag to pte that have the
> dirty flag set.

You are right, if we build the tree until this patch it won't work for
all the cases.  It'll only work if the page was at least writable
before and also it's dirty (as you explained).  Sorry to be unclear
about this, maybe I should at least mention that in the commit message
but I totally forgot it.

All these problems are solved in later on patches, please feel free to
have a look at:

  mm: merge parameters for change_protection()
  userfaultfd: wp: apply _PAGE_UFFD_WP bit
  userfaultfd: wp: handle COW properly for uffd-wp

Note that even in the follow up patches IMHO we can't directly change
the write permission since the page can be shared by other processes
(e.g., the zero page or COW pages).  But the general idea is the same
as you explained.

I tried to avoid squashing these stuff altogether as explained
previously.  Also, this patch can be seen as a standalone patch to
introduce the new interface which seems to make sense too, and it is
indeed still working in many cases so I see the latter patches as
enhancement of this one.  Please let me know if you still want me to
have all these stuff squashed, or if you'd like me to squash some of
them.

Thanks!

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH RFC 03/24] mm: allow VM_FAULT_RETRY for multiple times
  2019-01-22  8:22     ` Peter Xu
@ 2019-01-22 16:53       ` Jerome Glisse
  2019-01-23  2:12         ` Peter Xu
  0 siblings, 1 reply; 65+ messages in thread
From: Jerome Glisse @ 2019-01-22 16:53 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, Hugh Dickins, Maya Gokhale,
	Johannes Weiner, Martin Cracauer, Denis Plotnikov, Shaohua Li,
	Andrea Arcangeli, Mike Kravetz, Marty McFadden, Mike Rapoport,
	Mel Gorman, Kirill A . Shutemov, Dr . David Alan Gilbert

On Tue, Jan 22, 2019 at 04:22:38PM +0800, Peter Xu wrote:
> On Mon, Jan 21, 2019 at 10:55:36AM -0500, Jerome Glisse wrote:
> > On Mon, Jan 21, 2019 at 03:57:01PM +0800, Peter Xu wrote:
> > > The idea comes from a discussion between Linus and Andrea [1].
> > > 
> > > Before this patch we only allow a page fault to retry once.  We achieved
> > > this by clearing the FAULT_FLAG_ALLOW_RETRY flag when doing
> > > handle_mm_fault() the second time.  This was majorly used to avoid
> > > unexpected starvation of the system by looping over forever to handle
> > > the page fault on a single page.  However that should hardly happen, and
> > > after all for each code path to return a VM_FAULT_RETRY we'll first wait
> > > for a condition (during which time we should possibly yield the cpu) to
> > > happen before VM_FAULT_RETRY is really returned.
> > > 
> > > This patch removes the restriction by keeping the FAULT_FLAG_ALLOW_RETRY
> > > flag when we receive VM_FAULT_RETRY.  It means that the page fault
> > > handler now can retry the page fault for multiple times if necessary
> > > without the need to generate another page fault event. Meanwhile we
> > > still keep the FAULT_FLAG_TRIED flag so page fault handler can still
> > > identify whether a page fault is the first attempt or not.
> > 
> > So there is nothing protecting starvation after this patch ? AFAICT.
> > Do we sufficient proof that we never have a scenario where one process
> > might starve fault another ?
> > 
> > For instance some page locking could starve one process.
> 
> Hi, Jerome,
> 
> Do you mean lock_page()?
> 
> AFAIU lock_page() will only yield the process itself until the lock is
> released, so IMHO it's not really starving the process but a natural
> behavior.  After all the process may not continue without handling the
> page fault correctly.
> 
> Or when you say "starvation" do you mean that we might return
> VM_FAULT_RETRY from handle_mm_fault() continuously so we'll looping
> over and over inside the page fault handler?

That one ie every time we retry someone else is holding the lock and
thus lock_page_or_retry() will continuously retry. Some process just
get unlucky ;)

With existing code because we remove the retry flag then on the second
try we end up waiting for the page lock while holding the mmap_sem so
we know that we are in line for the page lock and we will get it once
it is our turn.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH RFC 06/24] userfaultfd: wp: support write protection for userfault vma range
  2019-01-22  9:39     ` Peter Xu
@ 2019-01-22 17:02       ` Jerome Glisse
  2019-01-23  2:17         ` Peter Xu
  0 siblings, 1 reply; 65+ messages in thread
From: Jerome Glisse @ 2019-01-22 17:02 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, Hugh Dickins, Maya Gokhale,
	Johannes Weiner, Martin Cracauer, Denis Plotnikov, Shaohua Li,
	Andrea Arcangeli, Pavel Emelyanov, Mike Kravetz, Marty McFadden,
	Mike Rapoport, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert, Rik van Riel

On Tue, Jan 22, 2019 at 05:39:35PM +0800, Peter Xu wrote:
> On Mon, Jan 21, 2019 at 09:05:35AM -0500, Jerome Glisse wrote:
> 
> [...]
> 
> > > +	change_protection(dst_vma, start, start + len, newprot,
> > > +				!enable_wp, 0);
> > 
> > So setting dirty_accountable bring us to that code in mprotect.c:
> > 
> >     if (dirty_accountable && pte_dirty(ptent) &&
> >             (pte_soft_dirty(ptent) ||
> >              !(vma->vm_flags & VM_SOFTDIRTY))) {
> >         ptent = pte_mkwrite(ptent);
> >     }
> > 
> > My understanding is that you want to set write flag when enable_wp
> > is false and you want to set the write flag unconditionaly, right ?
> 
> Right.
> 
> > 
> > If so then you should really move the change_protection() flags
> > patch before this patch and add a flag for setting pte write flags.
> > 
> > Otherwise the above is broken at it will only set the write flag
> > for pte that were dirty and i am guessing so far you always were
> > lucky because pte were all dirty (change_protection will preserve
> > dirtyness) when you write protected them.
> > 
> > So i believe the above is broken or at very least unclear if what
> > you really want is to only set write flag to pte that have the
> > dirty flag set.
> 
> You are right, if we build the tree until this patch it won't work for
> all the cases.  It'll only work if the page was at least writable
> before and also it's dirty (as you explained).  Sorry to be unclear
> about this, maybe I should at least mention that in the commit message
> but I totally forgot it.
> 
> All these problems are solved in later on patches, please feel free to
> have a look at:
> 
>   mm: merge parameters for change_protection()
>   userfaultfd: wp: apply _PAGE_UFFD_WP bit
>   userfaultfd: wp: handle COW properly for uffd-wp
> 
> Note that even in the follow up patches IMHO we can't directly change
> the write permission since the page can be shared by other processes
> (e.g., the zero page or COW pages).  But the general idea is the same
> as you explained.
> 
> I tried to avoid squashing these stuff altogether as explained
> previously.  Also, this patch can be seen as a standalone patch to
> introduce the new interface which seems to make sense too, and it is
> indeed still working in many cases so I see the latter patches as
> enhancement of this one.  Please let me know if you still want me to
> have all these stuff squashed, or if you'd like me to squash some of
> them.

Yeah i have look at those after looking at this one. You should just
re-order the patch this one first and then one that add new flag,
then ones that add the new userfaultfd feature. Otherwise you are
adding a userfaultfd feature that is broken midway ie it is added
broken and then you fix it. Some one bisecting thing might get hurt
by that. It is better to add and change everything you need and then
add the new feature so that the new feature will work as intended.

So no squashing just change the order ie add the userfaultfd code
last.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH RFC 03/24] mm: allow VM_FAULT_RETRY for multiple times
  2019-01-22 16:53       ` Jerome Glisse
@ 2019-01-23  2:12         ` Peter Xu
  2019-01-23  2:39           ` Jerome Glisse
  0 siblings, 1 reply; 65+ messages in thread
From: Peter Xu @ 2019-01-23  2:12 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: linux-mm, linux-kernel, Hugh Dickins, Maya Gokhale,
	Johannes Weiner, Martin Cracauer, Denis Plotnikov, Shaohua Li,
	Andrea Arcangeli, Mike Kravetz, Marty McFadden, Mike Rapoport,
	Mel Gorman, Kirill A . Shutemov, Dr . David Alan Gilbert

On Tue, Jan 22, 2019 at 11:53:10AM -0500, Jerome Glisse wrote:
> On Tue, Jan 22, 2019 at 04:22:38PM +0800, Peter Xu wrote:
> > On Mon, Jan 21, 2019 at 10:55:36AM -0500, Jerome Glisse wrote:
> > > On Mon, Jan 21, 2019 at 03:57:01PM +0800, Peter Xu wrote:
> > > > The idea comes from a discussion between Linus and Andrea [1].
> > > > 
> > > > Before this patch we only allow a page fault to retry once.  We achieved
> > > > this by clearing the FAULT_FLAG_ALLOW_RETRY flag when doing
> > > > handle_mm_fault() the second time.  This was majorly used to avoid
> > > > unexpected starvation of the system by looping over forever to handle
> > > > the page fault on a single page.  However that should hardly happen, and
> > > > after all for each code path to return a VM_FAULT_RETRY we'll first wait
> > > > for a condition (during which time we should possibly yield the cpu) to
> > > > happen before VM_FAULT_RETRY is really returned.
> > > > 
> > > > This patch removes the restriction by keeping the FAULT_FLAG_ALLOW_RETRY
> > > > flag when we receive VM_FAULT_RETRY.  It means that the page fault
> > > > handler now can retry the page fault for multiple times if necessary
> > > > without the need to generate another page fault event. Meanwhile we
> > > > still keep the FAULT_FLAG_TRIED flag so page fault handler can still
> > > > identify whether a page fault is the first attempt or not.
> > > 
> > > So there is nothing protecting starvation after this patch ? AFAICT.
> > > Do we sufficient proof that we never have a scenario where one process
> > > might starve fault another ?
> > > 
> > > For instance some page locking could starve one process.
> > 
> > Hi, Jerome,
> > 
> > Do you mean lock_page()?
> > 
> > AFAIU lock_page() will only yield the process itself until the lock is
> > released, so IMHO it's not really starving the process but a natural
> > behavior.  After all the process may not continue without handling the
> > page fault correctly.
> > 
> > Or when you say "starvation" do you mean that we might return
> > VM_FAULT_RETRY from handle_mm_fault() continuously so we'll looping
> > over and over inside the page fault handler?
> 
> That one ie every time we retry someone else is holding the lock and
> thus lock_page_or_retry() will continuously retry. Some process just
> get unlucky ;)
> 
> With existing code because we remove the retry flag then on the second
> try we end up waiting for the page lock while holding the mmap_sem so
> we know that we are in line for the page lock and we will get it once
> it is our turn.

Ah I see. :)  It's indeed a valid questioning.

Firstly note that even after this patch we can still identify whether
we're at the first attempt or not by checking against FAULT_FLAG_TRIED
(it will be applied to the fault flag in all the retries but not in
the first atttempt). So IMHO this change might suite if we want to
keep the old behavior [1]:

diff --git a/mm/filemap.c b/mm/filemap.c
index 9f5e323e883e..44942c78bb92 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1351,7 +1351,7 @@ EXPORT_SYMBOL_GPL(__lock_page_killable);
 int __lock_page_or_retry(struct page *page, struct mm_struct *mm,
                         unsigned int flags)
 {
-       if (flags & FAULT_FLAG_ALLOW_RETRY) {
+       if (!flags & FAULT_FLAG_TRIED) {
                /*
                 * CAUTION! In this case, mmap_sem is not released
                 * even though return 0.

But at the same time I'm stepping back trying to see the whole
picture... My understanding is that this is really a policy that we
can decide, and a trade off between "being polite or not on the
mmap_sem", that when taking the page lock in slow path we either:

  (1) release mmap_sem before waiting, polite enough but uncertain to
      finally have the lock, or,

  (2) keep mmap_sem before waiting, not polite enough but certain to
      take the lock.

We did (2) before on the reties because in existing code we only allow
to retry once, so we can't fail on the 2nd attempt.  That seems to be
a good reason to being "unpolite" - we took the mmap_sem without
considering others because we've been "polite" once.  I'm not that
experienced in mm development but AFAIU solution 2 is only reducing
our chance of starvation but adding that chance of starvation to other
processes that want the mmap_sem instead.  So IMHO the starvation
issue always existed even before this patch, and it looks natural and
sane to me so far...  And if with that in mind, I can't say that above
change at [1] would be better, and maybe, it'll be even more fair that
we should always release the mmap_sem first in this case (assuming
that we'll after all have that lock though we might pay more times of
retries)?

Or, is there a way to constantly starve the process that handles the
page fault that I've totally missed?

Thanks,

-- 
Peter Xu

^ permalink raw reply related	[flat|nested] 65+ messages in thread

* Re: [PATCH RFC 06/24] userfaultfd: wp: support write protection for userfault vma range
  2019-01-22 17:02       ` Jerome Glisse
@ 2019-01-23  2:17         ` Peter Xu
  2019-01-23  2:43           ` Jerome Glisse
  0 siblings, 1 reply; 65+ messages in thread
From: Peter Xu @ 2019-01-23  2:17 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: linux-mm, linux-kernel, Hugh Dickins, Maya Gokhale,
	Johannes Weiner, Martin Cracauer, Denis Plotnikov, Shaohua Li,
	Andrea Arcangeli, Pavel Emelyanov, Mike Kravetz, Marty McFadden,
	Mike Rapoport, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert, Rik van Riel

On Tue, Jan 22, 2019 at 12:02:24PM -0500, Jerome Glisse wrote:
> On Tue, Jan 22, 2019 at 05:39:35PM +0800, Peter Xu wrote:
> > On Mon, Jan 21, 2019 at 09:05:35AM -0500, Jerome Glisse wrote:
> > 
> > [...]
> > 
> > > > +	change_protection(dst_vma, start, start + len, newprot,
> > > > +				!enable_wp, 0);
> > > 
> > > So setting dirty_accountable bring us to that code in mprotect.c:
> > > 
> > >     if (dirty_accountable && pte_dirty(ptent) &&
> > >             (pte_soft_dirty(ptent) ||
> > >              !(vma->vm_flags & VM_SOFTDIRTY))) {
> > >         ptent = pte_mkwrite(ptent);
> > >     }
> > > 
> > > My understanding is that you want to set write flag when enable_wp
> > > is false and you want to set the write flag unconditionaly, right ?
> > 
> > Right.
> > 
> > > 
> > > If so then you should really move the change_protection() flags
> > > patch before this patch and add a flag for setting pte write flags.
> > > 
> > > Otherwise the above is broken at it will only set the write flag
> > > for pte that were dirty and i am guessing so far you always were
> > > lucky because pte were all dirty (change_protection will preserve
> > > dirtyness) when you write protected them.
> > > 
> > > So i believe the above is broken or at very least unclear if what
> > > you really want is to only set write flag to pte that have the
> > > dirty flag set.
> > 
> > You are right, if we build the tree until this patch it won't work for
> > all the cases.  It'll only work if the page was at least writable
> > before and also it's dirty (as you explained).  Sorry to be unclear
> > about this, maybe I should at least mention that in the commit message
> > but I totally forgot it.
> > 
> > All these problems are solved in later on patches, please feel free to
> > have a look at:
> > 
> >   mm: merge parameters for change_protection()
> >   userfaultfd: wp: apply _PAGE_UFFD_WP bit
> >   userfaultfd: wp: handle COW properly for uffd-wp
> > 
> > Note that even in the follow up patches IMHO we can't directly change
> > the write permission since the page can be shared by other processes
> > (e.g., the zero page or COW pages).  But the general idea is the same
> > as you explained.
> > 
> > I tried to avoid squashing these stuff altogether as explained
> > previously.  Also, this patch can be seen as a standalone patch to
> > introduce the new interface which seems to make sense too, and it is
> > indeed still working in many cases so I see the latter patches as
> > enhancement of this one.  Please let me know if you still want me to
> > have all these stuff squashed, or if you'd like me to squash some of
> > them.
> 
> Yeah i have look at those after looking at this one. You should just
> re-order the patch this one first and then one that add new flag,
> then ones that add the new userfaultfd feature. Otherwise you are
> adding a userfaultfd feature that is broken midway ie it is added
> broken and then you fix it. Some one bisecting thing might get hurt
> by that. It is better to add and change everything you need and then
> add the new feature so that the new feature will work as intended.
> 
> So no squashing just change the order ie add the userfaultfd code
> last.

Yes this makes sense, I'll do that in v2.  Thanks for the suggestion!

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH RFC 03/24] mm: allow VM_FAULT_RETRY for multiple times
  2019-01-23  2:12         ` Peter Xu
@ 2019-01-23  2:39           ` Jerome Glisse
  2019-01-24  5:45             ` Peter Xu
  0 siblings, 1 reply; 65+ messages in thread
From: Jerome Glisse @ 2019-01-23  2:39 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, Hugh Dickins, Maya Gokhale,
	Johannes Weiner, Martin Cracauer, Denis Plotnikov, Shaohua Li,
	Andrea Arcangeli, Mike Kravetz, Marty McFadden, Mike Rapoport,
	Mel Gorman, Kirill A . Shutemov, Dr . David Alan Gilbert

On Wed, Jan 23, 2019 at 10:12:41AM +0800, Peter Xu wrote:
> On Tue, Jan 22, 2019 at 11:53:10AM -0500, Jerome Glisse wrote:
> > On Tue, Jan 22, 2019 at 04:22:38PM +0800, Peter Xu wrote:
> > > On Mon, Jan 21, 2019 at 10:55:36AM -0500, Jerome Glisse wrote:
> > > > On Mon, Jan 21, 2019 at 03:57:01PM +0800, Peter Xu wrote:
> > > > > The idea comes from a discussion between Linus and Andrea [1].
> > > > > 
> > > > > Before this patch we only allow a page fault to retry once.  We achieved
> > > > > this by clearing the FAULT_FLAG_ALLOW_RETRY flag when doing
> > > > > handle_mm_fault() the second time.  This was majorly used to avoid
> > > > > unexpected starvation of the system by looping over forever to handle
> > > > > the page fault on a single page.  However that should hardly happen, and
> > > > > after all for each code path to return a VM_FAULT_RETRY we'll first wait
> > > > > for a condition (during which time we should possibly yield the cpu) to
> > > > > happen before VM_FAULT_RETRY is really returned.
> > > > > 
> > > > > This patch removes the restriction by keeping the FAULT_FLAG_ALLOW_RETRY
> > > > > flag when we receive VM_FAULT_RETRY.  It means that the page fault
> > > > > handler now can retry the page fault for multiple times if necessary
> > > > > without the need to generate another page fault event. Meanwhile we
> > > > > still keep the FAULT_FLAG_TRIED flag so page fault handler can still
> > > > > identify whether a page fault is the first attempt or not.
> > > > 
> > > > So there is nothing protecting starvation after this patch ? AFAICT.
> > > > Do we sufficient proof that we never have a scenario where one process
> > > > might starve fault another ?
> > > > 
> > > > For instance some page locking could starve one process.
> > > 
> > > Hi, Jerome,
> > > 
> > > Do you mean lock_page()?
> > > 
> > > AFAIU lock_page() will only yield the process itself until the lock is
> > > released, so IMHO it's not really starving the process but a natural
> > > behavior.  After all the process may not continue without handling the
> > > page fault correctly.
> > > 
> > > Or when you say "starvation" do you mean that we might return
> > > VM_FAULT_RETRY from handle_mm_fault() continuously so we'll looping
> > > over and over inside the page fault handler?
> > 
> > That one ie every time we retry someone else is holding the lock and
> > thus lock_page_or_retry() will continuously retry. Some process just
> > get unlucky ;)
> > 
> > With existing code because we remove the retry flag then on the second
> > try we end up waiting for the page lock while holding the mmap_sem so
> > we know that we are in line for the page lock and we will get it once
> > it is our turn.
> 
> Ah I see. :)  It's indeed a valid questioning.
> 
> Firstly note that even after this patch we can still identify whether
> we're at the first attempt or not by checking against FAULT_FLAG_TRIED
> (it will be applied to the fault flag in all the retries but not in
> the first atttempt). So IMHO this change might suite if we want to
> keep the old behavior [1]:
> 
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 9f5e323e883e..44942c78bb92 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1351,7 +1351,7 @@ EXPORT_SYMBOL_GPL(__lock_page_killable);
>  int __lock_page_or_retry(struct page *page, struct mm_struct *mm,
>                          unsigned int flags)
>  {
> -       if (flags & FAULT_FLAG_ALLOW_RETRY) {
> +       if (!flags & FAULT_FLAG_TRIED) {
>                 /*
>                  * CAUTION! In this case, mmap_sem is not released
>                  * even though return 0.

I need to check how FAULT_FLAG_TRIED have been use so far, but yes
it looks like this would keep the existing behavior intact.

> 
> But at the same time I'm stepping back trying to see the whole
> picture... My understanding is that this is really a policy that we
> can decide, and a trade off between "being polite or not on the
> mmap_sem", that when taking the page lock in slow path we either:
> 
>   (1) release mmap_sem before waiting, polite enough but uncertain to
>       finally have the lock, or,
> 
>   (2) keep mmap_sem before waiting, not polite enough but certain to
>       take the lock.
> 
> We did (2) before on the reties because in existing code we only allow
> to retry once, so we can't fail on the 2nd attempt.  That seems to be
> a good reason to being "unpolite" - we took the mmap_sem without
> considering others because we've been "polite" once.  I'm not that
> experienced in mm development but AFAIU solution 2 is only reducing
> our chance of starvation but adding that chance of starvation to other
> processes that want the mmap_sem instead.  So IMHO the starvation
> issue always existed even before this patch, and it looks natural and
> sane to me so far...  And if with that in mind, I can't say that above
> change at [1] would be better, and maybe, it'll be even more fair that
> we should always release the mmap_sem first in this case (assuming
> that we'll after all have that lock though we might pay more times of
> retries)?

Existing code does not starves anyone, the mmap_sem is rw_semaphore
so if there is no writter waiting then no ones wait, if there is a
writter waiting then everyone wait in line so that it is fair to
writter. So with existing code we have a "fair" behavior where every-
ones wait in line their turn. After this patch we can end up in unfair
situation were one thread might be continuously starve because it is
only doing try_lock and thus it is never added to wait line.


> Or, is there a way to constantly starve the process that handles the
> page fault that I've totally missed?

That's the discussion, with your change a process can constantly
retry page fault because it never get a lock on a page, so it can
end up in an infinite fault retry.

Yes it is unlikely to be infinite, but it can change how kernel
behave to some workload and thus impact existing user.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH RFC 06/24] userfaultfd: wp: support write protection for userfault vma range
  2019-01-23  2:17         ` Peter Xu
@ 2019-01-23  2:43           ` Jerome Glisse
  2019-01-24  5:47             ` Peter Xu
  0 siblings, 1 reply; 65+ messages in thread
From: Jerome Glisse @ 2019-01-23  2:43 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, Hugh Dickins, Maya Gokhale,
	Johannes Weiner, Martin Cracauer, Denis Plotnikov, Shaohua Li,
	Andrea Arcangeli, Pavel Emelyanov, Mike Kravetz, Marty McFadden,
	Mike Rapoport, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert, Rik van Riel

On Wed, Jan 23, 2019 at 10:17:45AM +0800, Peter Xu wrote:
> On Tue, Jan 22, 2019 at 12:02:24PM -0500, Jerome Glisse wrote:
> > On Tue, Jan 22, 2019 at 05:39:35PM +0800, Peter Xu wrote:
> > > On Mon, Jan 21, 2019 at 09:05:35AM -0500, Jerome Glisse wrote:
> > > 
> > > [...]
> > > 
> > > > > +	change_protection(dst_vma, start, start + len, newprot,
> > > > > +				!enable_wp, 0);
> > > > 
> > > > So setting dirty_accountable bring us to that code in mprotect.c:
> > > > 
> > > >     if (dirty_accountable && pte_dirty(ptent) &&
> > > >             (pte_soft_dirty(ptent) ||
> > > >              !(vma->vm_flags & VM_SOFTDIRTY))) {
> > > >         ptent = pte_mkwrite(ptent);
> > > >     }
> > > > 
> > > > My understanding is that you want to set write flag when enable_wp
> > > > is false and you want to set the write flag unconditionaly, right ?
> > > 
> > > Right.
> > > 
> > > > 
> > > > If so then you should really move the change_protection() flags
> > > > patch before this patch and add a flag for setting pte write flags.
> > > > 
> > > > Otherwise the above is broken at it will only set the write flag
> > > > for pte that were dirty and i am guessing so far you always were
> > > > lucky because pte were all dirty (change_protection will preserve
> > > > dirtyness) when you write protected them.
> > > > 
> > > > So i believe the above is broken or at very least unclear if what
> > > > you really want is to only set write flag to pte that have the
> > > > dirty flag set.
> > > 
> > > You are right, if we build the tree until this patch it won't work for
> > > all the cases.  It'll only work if the page was at least writable
> > > before and also it's dirty (as you explained).  Sorry to be unclear
> > > about this, maybe I should at least mention that in the commit message
> > > but I totally forgot it.
> > > 
> > > All these problems are solved in later on patches, please feel free to
> > > have a look at:
> > > 
> > >   mm: merge parameters for change_protection()
> > >   userfaultfd: wp: apply _PAGE_UFFD_WP bit
> > >   userfaultfd: wp: handle COW properly for uffd-wp
> > > 
> > > Note that even in the follow up patches IMHO we can't directly change
> > > the write permission since the page can be shared by other processes
> > > (e.g., the zero page or COW pages).  But the general idea is the same
> > > as you explained.
> > > 
> > > I tried to avoid squashing these stuff altogether as explained
> > > previously.  Also, this patch can be seen as a standalone patch to
> > > introduce the new interface which seems to make sense too, and it is
> > > indeed still working in many cases so I see the latter patches as
> > > enhancement of this one.  Please let me know if you still want me to
> > > have all these stuff squashed, or if you'd like me to squash some of
> > > them.
> > 
> > Yeah i have look at those after looking at this one. You should just
> > re-order the patch this one first and then one that add new flag,
> > then ones that add the new userfaultfd feature. Otherwise you are
> > adding a userfaultfd feature that is broken midway ie it is added
> > broken and then you fix it. Some one bisecting thing might get hurt
> > by that. It is better to add and change everything you need and then
> > add the new feature so that the new feature will work as intended.
> > 
> > So no squashing just change the order ie add the userfaultfd code
> > last.
> 
> Yes this makes sense, I'll do that in v2.  Thanks for the suggestion!

Note before doing a v2 i would really like to see some proof of why
you need new page table flag see my reply to:
    userfaultfd: wp: add WP pagetable tracking to x86

As i believe you can identify COW or KSM from UFD write protect with-
out a pte flag.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH RFC 07/24] userfaultfd: wp: add the writeprotect API to userfaultfd ioctl
  2019-01-21 10:42   ` Mike Rapoport
@ 2019-01-24  4:56     ` Peter Xu
  2019-01-24  7:27       ` Mike Rapoport
  0 siblings, 1 reply; 65+ messages in thread
From: Peter Xu @ 2019-01-24  4:56 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: linux-mm, linux-kernel, Hugh Dickins, Maya Gokhale,
	Jerome Glisse, Johannes Weiner, Martin Cracauer, Denis Plotnikov,
	Shaohua Li, Andrea Arcangeli, Pavel Emelyanov, Mike Kravetz,
	Marty McFadden, Mike Rapoport, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

On Mon, Jan 21, 2019 at 12:42:33PM +0200, Mike Rapoport wrote:

[...]

> > @@ -1343,7 +1344,7 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
> > 
> >  		/* check not compatible vmas */
> >  		ret = -EINVAL;
> > -		if (!vma_can_userfault(cur))
> > +		if (!vma_can_userfault(cur, vm_flags))
> >  			goto out_unlock;
> > 
> >  		/*
> > @@ -1371,6 +1372,8 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
> >  			if (end & (vma_hpagesize - 1))
> >  				goto out_unlock;
> >  		}
> > +		if ((vm_flags & VM_UFFD_WP) && !(cur->vm_flags & VM_WRITE))
> > +			goto out_unlock;
> 
> This is problematic for the non-cooperative use-case. Way may still want to
> monitor a read-only area because it may eventually become writable, e.g. if
> the monitored process runs mprotect().

Firstly I think I should be able to change it to VM_MAYWRITE which
seems to suite more.

Meanwhile, frankly speaking I didn't think a lot about how to nest the
usages of uffd-wp and mprotect(), so far I was only considering it as
a replacement of mprotect().  But indeed it can happen that the
monitored process calls mprotect().  Is there an existing scenario of
such usage?

The problem is I'm uncertain about whether this scenario can work
after all.  Say, the monitor process A write protected process B's
page P, so logically A will definitely receive a message before B
writes to page P.  However here if we allow process B to do
mprotect(PROT_WRITE) upon page P and grant write permission to it on
its own, then A will not be able to capture the write operation at
all?  Then I don't know how it can work here... or whether we should
fail the mprotect() at least upon uffd-wp ranges?

> Particularity, for using uffd-wp as a replacement for soft-dirty would
> require it.
> 
> > 
> >  		/*
> >  		 * Check that this vma isn't already owned by a
> > @@ -1400,7 +1403,7 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
> >  	do {
> >  		cond_resched();
> > 
> > -		BUG_ON(!vma_can_userfault(vma));
> > +		BUG_ON(!vma_can_userfault(vma, vm_flags));
> >  		BUG_ON(vma->vm_userfaultfd_ctx.ctx &&
> >  		       vma->vm_userfaultfd_ctx.ctx != ctx);
> >  		WARN_ON(!(vma->vm_flags & VM_MAYWRITE));
> > @@ -1535,7 +1538,7 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx,
> >  		 * provides for more strict behavior to notice
> >  		 * unregistration errors.
> >  		 */
> > -		if (!vma_can_userfault(cur))
> > +		if (!vma_can_userfault(cur, cur->vm_flags))
> >  			goto out_unlock;
> > 
> >  		found = true;
> > @@ -1549,7 +1552,7 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx,
> >  	do {
> >  		cond_resched();
> > 
> > -		BUG_ON(!vma_can_userfault(vma));
> > +		BUG_ON(!vma_can_userfault(vma, vma->vm_flags));
> >  		WARN_ON(!(vma->vm_flags & VM_MAYWRITE));
> > 
> >  		/*
> > @@ -1760,6 +1763,46 @@ static int userfaultfd_zeropage(struct userfaultfd_ctx *ctx,
> >  	return ret;
> >  }
> > 
> > +static int userfaultfd_writeprotect(struct userfaultfd_ctx *ctx,
> > +				    unsigned long arg)
> > +{
> > +	int ret;
> > +	struct uffdio_writeprotect uffdio_wp;
> > +	struct uffdio_writeprotect __user *user_uffdio_wp;
> > +	struct userfaultfd_wake_range range;
> > +
> 
> In the non-cooperative mode the userfaultfd_writeprotect() may race with VM
> layout changes, pretty much as uffdio_copy() [1]. My solution for uffdio_copy()
> was to return -EAGAIN if such race is encountered. I think the same would
> apply here.

I tried to understand the problem at [1] but failed... could you help
to clarify it a bit more?

I'm quoting some of the discussions from [1] here directly between you
and Pavel:

  > Since the monitor cannot assume that the process will access all its memory
  > it has to copy some pages "in the background". A simple monitor may look
  > like:
  > 
  > 	for (;;) {
  > 		wait_for_uffd_events(timeout);
  > 		handle_uffd_events();
  > 		uffd_copy(some not faulted pages);
  > 	}
  > 
  > Then, if the "background" uffd_copy() races with fork, the pages we've
  > copied may be already present in parent's mappings before the call to
  > copy_page_range() and may be not.
  > 
  > If the pages were not present, uffd_copy'ing them again to the child's
  > memory would be ok.
  >
  > But if uffd_copy() was first to catch mmap_sem, and we would uffd_copy them
  > again, child process will get memory corruption.

Here I don't understand why the child process will get memory
corruption if uffd_copy() caught the mmap_sem first.

If it did it, then IMHO when uffd_copy() copies the page again it'll
simply get a -EEXIST showing that the page has already been copied.
Could you explain on why there will be a data corruption?

Thanks in advance,

>  
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=df2cc96e77011cf7989208b206da9817e0321028
>

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH RFC 10/24] userfaultfd: wp: add WP pagetable tracking to x86
  2019-01-21 15:09   ` Jerome Glisse
@ 2019-01-24  5:16     ` Peter Xu
  2019-01-24 15:40       ` Jerome Glisse
  0 siblings, 1 reply; 65+ messages in thread
From: Peter Xu @ 2019-01-24  5:16 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: linux-mm, linux-kernel, Hugh Dickins, Maya Gokhale,
	Johannes Weiner, Martin Cracauer, Denis Plotnikov, Shaohua Li,
	Andrea Arcangeli, Pavel Emelyanov, Mike Kravetz, Marty McFadden,
	Mike Rapoport, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

On Mon, Jan 21, 2019 at 10:09:38AM -0500, Jerome Glisse wrote:
> On Mon, Jan 21, 2019 at 03:57:08PM +0800, Peter Xu wrote:
> > From: Andrea Arcangeli <aarcange@redhat.com>
> > 
> > Accurate userfaultfd WP tracking is possible by tracking exactly which
> > virtual memory ranges were writeprotected by userland. We can't relay
> > only on the RW bit of the mapped pagetable because that information is
> > destroyed by fork() or KSM or swap. If we were to relay on that, we'd
> > need to stay on the safe side and generate false positive wp faults
> > for every swapped out page.

(I'm trying to leave comments with my own understanding here; they
 might not be the original purposes when Andrea proposed the idea.
 Andrea, please feel free to chim in anytime especially if I am
 wrong... :-)

> 
> So you want to forward write fault (of a protected range) to user space
> only if page is not write protected because of fork(), KSM or swap.
> 
> This write protection feature is only for anonymous page right ? Other-
> wise how would you protect a share page (ie anyone can look it up and
> call page_mkwrite on it and start writting to it) ?

AFAIU we want to support shared memory too in the future.  One example
I can think of is current QEMU usage with DPDK: we have two processes
sharing the guest memory range.  So indeed this might not work if
there are unknown/malicious users of the shared memory, however in
many use cases the users are all known and AFAIU we should just write
protect all these users then we'll still get notified when any of them
write to a page.

> 
> So for anonymous page for fork() the mapcount will tell you if page is
> write protected for COW. For KSM it is easy check the page flag.

Yes I agree that KSM should be easy.  But for COW, please consider
when we write protect a page that was shared and RW removed due to
COW.  Then when we page fault on this page should we report to the
monitor?  IMHO we can't know if without a specific bit in the PTE.

> 
> For swap you can use the page lock to synchronize. A page that is
> write protected because of swap is write protected because it is being
> write to disk thus either under page lock, or with PageWriteback()
> returning true while write is on going.

For swap I think the major problem is when the page was swapped out of
main memory and then we write to the page (which was already a swap
entry now).  Then we'll first swap in the page into main memory again,
but then IMHO we will face the similar issue like COW above - we can't
judge whether this page was write protected by uffd-wp at all.  Of
course here we can detect the VMA flags and assuming it's write
protected if the UFFD_WP flag was set on the VMA flag, however we'll
also mark those pages which were not write protected at all hence
it'll generate false positives of write protection messages.  This
idea can apply too to above COW use case.  As a conclusion, in these
use cases we should not be able to identify explicitly on page
granularity write protection if without a specific _PAGE_UFFD_WP bit
in the PTE entries.

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH RFC 13/24] mm: merge parameters for change_protection()
  2019-01-21 13:54   ` Jerome Glisse
@ 2019-01-24  5:22     ` Peter Xu
  0 siblings, 0 replies; 65+ messages in thread
From: Peter Xu @ 2019-01-24  5:22 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: linux-mm, linux-kernel, Hugh Dickins, Maya Gokhale,
	Johannes Weiner, Martin Cracauer, Denis Plotnikov, Shaohua Li,
	Andrea Arcangeli, Pavel Emelyanov, Mike Kravetz, Marty McFadden,
	Mike Rapoport, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

On Mon, Jan 21, 2019 at 08:54:46AM -0500, Jerome Glisse wrote:
> On Mon, Jan 21, 2019 at 03:57:11PM +0800, Peter Xu wrote:
> > change_protection() was used by either the NUMA or mprotect() code,
> > there's one parameter for each of the callers (dirty_accountable and
> > prot_numa).  Further, these parameters are passed along the calls:
> > 
> >   - change_protection_range()
> >   - change_p4d_range()
> >   - change_pud_range()
> >   - change_pmd_range()
> >   - ...
> > 
> > Now we introduce a flag for change_protect() and all these helpers to
> > replace these parameters.  Then we can avoid passing multiple parameters
> > multiple times along the way.
> > 
> > More importantly, it'll greatly simplify the work if we want to
> > introduce any new parameters to change_protection().  In the follow up
> > patches, a new parameter for userfaultfd write protection will be
> > introduced.
> > 
> > No functional change at all.
> 
> There is one change i could spot and also something that looks wrong.
> 
> > 
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> 
> [...]
> 
> > @@ -428,8 +431,7 @@ mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev,
> >  	dirty_accountable = vma_wants_writenotify(vma, vma->vm_page_prot);
> >  	vma_set_page_prot(vma);
> >  
> > -	change_protection(vma, start, end, vma->vm_page_prot,
> > -			  dirty_accountable, 0);
> > +	change_protection(vma, start, end, vma->vm_page_prot, MM_CP_DIRTY_ACCT);
> 
> Here you unconditionaly see the DIRTY_ACCT flag instead it should be
> something like:
> 
>     s/dirty_accountable/cp_flags
>     if (vma_wants_writenotify(vma, vma->vm_page_prot))
>         cp_flags = MM_CP_DIRTY_ACCT;
>     else
>         cp_flags = 0;
> 
>     change_protection(vma, start, end, vma->vm_page_prot, cp_flags);
> 
> Or any equivalent construct.

Oops, thanks for spotting this... it was definitely wrong.  I'll fix.

> 
> >  	/*
> >  	 * Private VM_LOCKED VMA becoming writable: trigger COW to avoid major
> > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> > index 005291b9b62f..23d4bbd117ee 100644
> > --- a/mm/userfaultfd.c
> > +++ b/mm/userfaultfd.c
> > @@ -674,7 +674,7 @@ int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start,
> >  		newprot = vm_get_page_prot(dst_vma->vm_flags);
> >  
> >  	change_protection(dst_vma, start, start + len, newprot,
> > -				!enable_wp, 0);
> > +			  enable_wp ? 0 : MM_CP_DIRTY_ACCT);
> 
> We had a discussion in the past on that, i have not look at other
> patches but this seems wrong to me. MM_CP_DIRTY_ACCT is an
> optimization to keep a pte with write permission if it is dirty
> while my understanding is that you want to set write flag for pte
> unconditionaly.
> 
> So maybe this patch that adds flag should be earlier in the serie
> so that you can add a flag to do that before introducing the UFD
> mwriteprotect_range() function.

I agree.  I'm going to move the UFFDIO_WRITEPROTECT patch to the last
so I'll rearrange this part too so these lines will be removed in my
next version.

Thanks!

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH RFC 20/24] userfaultfd: wp: don't wake up when doing write protect
  2019-01-21 11:10   ` Mike Rapoport
@ 2019-01-24  5:36     ` Peter Xu
  0 siblings, 0 replies; 65+ messages in thread
From: Peter Xu @ 2019-01-24  5:36 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: linux-mm, linux-kernel, Hugh Dickins, Maya Gokhale,
	Jerome Glisse, Johannes Weiner, Martin Cracauer, Denis Plotnikov,
	Shaohua Li, Andrea Arcangeli, Pavel Emelyanov, Mike Kravetz,
	Marty McFadden, Mike Rapoport, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

On Mon, Jan 21, 2019 at 01:10:39PM +0200, Mike Rapoport wrote:
> On Mon, Jan 21, 2019 at 03:57:18PM +0800, Peter Xu wrote:
> > It does not make sense to try to wake up any waiting thread when we're
> > write-protecting a memory region.  Only wake up when resolving a write
> > protected page fault.
> 
> Probably it would be better to make it default to wake up only when
> requested explicitly?

Yes, I think that's what this series does?

Now when we do UFFDIO_WRITEPROTECT with !WP and !DONTWAKE then we'll
first resolve the page fault, then wake up the process properly.  And
we request that explicity using !WP and DONTWAKE.

Or did I misunderstood the question?

> Then we can simply disallow _DONTWAKE for uffd_wp and only use
> UFFDIO_WRITEPROTECT_MODE_WP as possible mode.

I'd admit I don't know the major usage of DONTWAKE (and I'd be glad to
know...), however since we have this flag for both UFFDIO_COPY and
UFFDIO_ZEROCOPY, then it seems sane to have DONTWAKE for WRITEPROTECT
too?  Or is there any other explicit reason to omit it?

Thanks!

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH RFC 03/24] mm: allow VM_FAULT_RETRY for multiple times
  2019-01-23  2:39           ` Jerome Glisse
@ 2019-01-24  5:45             ` Peter Xu
  0 siblings, 0 replies; 65+ messages in thread
From: Peter Xu @ 2019-01-24  5:45 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: linux-mm, linux-kernel, Hugh Dickins, Maya Gokhale,
	Johannes Weiner, Martin Cracauer, Denis Plotnikov, Shaohua Li,
	Andrea Arcangeli, Mike Kravetz, Marty McFadden, Mike Rapoport,
	Mel Gorman, Kirill A . Shutemov, Dr . David Alan Gilbert

On Tue, Jan 22, 2019 at 09:39:47PM -0500, Jerome Glisse wrote:
> On Wed, Jan 23, 2019 at 10:12:41AM +0800, Peter Xu wrote:
> > On Tue, Jan 22, 2019 at 11:53:10AM -0500, Jerome Glisse wrote:
> > > On Tue, Jan 22, 2019 at 04:22:38PM +0800, Peter Xu wrote:
> > > > On Mon, Jan 21, 2019 at 10:55:36AM -0500, Jerome Glisse wrote:
> > > > > On Mon, Jan 21, 2019 at 03:57:01PM +0800, Peter Xu wrote:
> > > > > > The idea comes from a discussion between Linus and Andrea [1].
> > > > > > 
> > > > > > Before this patch we only allow a page fault to retry once.  We achieved
> > > > > > this by clearing the FAULT_FLAG_ALLOW_RETRY flag when doing
> > > > > > handle_mm_fault() the second time.  This was majorly used to avoid
> > > > > > unexpected starvation of the system by looping over forever to handle
> > > > > > the page fault on a single page.  However that should hardly happen, and
> > > > > > after all for each code path to return a VM_FAULT_RETRY we'll first wait
> > > > > > for a condition (during which time we should possibly yield the cpu) to
> > > > > > happen before VM_FAULT_RETRY is really returned.
> > > > > > 
> > > > > > This patch removes the restriction by keeping the FAULT_FLAG_ALLOW_RETRY
> > > > > > flag when we receive VM_FAULT_RETRY.  It means that the page fault
> > > > > > handler now can retry the page fault for multiple times if necessary
> > > > > > without the need to generate another page fault event. Meanwhile we
> > > > > > still keep the FAULT_FLAG_TRIED flag so page fault handler can still
> > > > > > identify whether a page fault is the first attempt or not.
> > > > > 
> > > > > So there is nothing protecting starvation after this patch ? AFAICT.
> > > > > Do we sufficient proof that we never have a scenario where one process
> > > > > might starve fault another ?
> > > > > 
> > > > > For instance some page locking could starve one process.
> > > > 
> > > > Hi, Jerome,
> > > > 
> > > > Do you mean lock_page()?
> > > > 
> > > > AFAIU lock_page() will only yield the process itself until the lock is
> > > > released, so IMHO it's not really starving the process but a natural
> > > > behavior.  After all the process may not continue without handling the
> > > > page fault correctly.
> > > > 
> > > > Or when you say "starvation" do you mean that we might return
> > > > VM_FAULT_RETRY from handle_mm_fault() continuously so we'll looping
> > > > over and over inside the page fault handler?
> > > 
> > > That one ie every time we retry someone else is holding the lock and
> > > thus lock_page_or_retry() will continuously retry. Some process just
> > > get unlucky ;)
> > > 
> > > With existing code because we remove the retry flag then on the second
> > > try we end up waiting for the page lock while holding the mmap_sem so
> > > we know that we are in line for the page lock and we will get it once
> > > it is our turn.
> > 
> > Ah I see. :)  It's indeed a valid questioning.
> > 
> > Firstly note that even after this patch we can still identify whether
> > we're at the first attempt or not by checking against FAULT_FLAG_TRIED
> > (it will be applied to the fault flag in all the retries but not in
> > the first atttempt). So IMHO this change might suite if we want to
> > keep the old behavior [1]:
> > 
> > diff --git a/mm/filemap.c b/mm/filemap.c
> > index 9f5e323e883e..44942c78bb92 100644
> > --- a/mm/filemap.c
> > +++ b/mm/filemap.c
> > @@ -1351,7 +1351,7 @@ EXPORT_SYMBOL_GPL(__lock_page_killable);
> >  int __lock_page_or_retry(struct page *page, struct mm_struct *mm,
> >                          unsigned int flags)
> >  {
> > -       if (flags & FAULT_FLAG_ALLOW_RETRY) {
> > +       if (!flags & FAULT_FLAG_TRIED) {
> >                 /*
> >                  * CAUTION! In this case, mmap_sem is not released
> >                  * even though return 0.
> 
> I need to check how FAULT_FLAG_TRIED have been use so far, but yes
> it looks like this would keep the existing behavior intact.
> 
> > 
> > But at the same time I'm stepping back trying to see the whole
> > picture... My understanding is that this is really a policy that we
> > can decide, and a trade off between "being polite or not on the
> > mmap_sem", that when taking the page lock in slow path we either:
> > 
> >   (1) release mmap_sem before waiting, polite enough but uncertain to
> >       finally have the lock, or,
> > 
> >   (2) keep mmap_sem before waiting, not polite enough but certain to
> >       take the lock.
> > 
> > We did (2) before on the reties because in existing code we only allow
> > to retry once, so we can't fail on the 2nd attempt.  That seems to be
> > a good reason to being "unpolite" - we took the mmap_sem without
> > considering others because we've been "polite" once.  I'm not that
> > experienced in mm development but AFAIU solution 2 is only reducing
> > our chance of starvation but adding that chance of starvation to other
> > processes that want the mmap_sem instead.  So IMHO the starvation
> > issue always existed even before this patch, and it looks natural and
> > sane to me so far...  And if with that in mind, I can't say that above
> > change at [1] would be better, and maybe, it'll be even more fair that
> > we should always release the mmap_sem first in this case (assuming
> > that we'll after all have that lock though we might pay more times of
> > retries)?
> 
> Existing code does not starves anyone, the mmap_sem is rw_semaphore
> so if there is no writter waiting then no ones wait, if there is a
> writter waiting then everyone wait in line so that it is fair to
> writter. So with existing code we have a "fair" behavior where every-
> ones wait in line their turn. After this patch we can end up in unfair
> situation were one thread might be continuously starve because it is
> only doing try_lock and thus it is never added to wait line.

I see the point.  Thanks for explaining it.

> 
> 
> > Or, is there a way to constantly starve the process that handles the
> > page fault that I've totally missed?
> 
> That's the discussion, with your change a process can constantly
> retry page fault because it never get a lock on a page, so it can
> end up in an infinite fault retry.
> 
> Yes it is unlikely to be infinite, but it can change how kernel
> behave to some workload and thus impact existing user.

Yes and even if anyone wants to change the behavior maybe it can be
changed after a proper justification, then it makes sense to me that I
squash above oneliner into this patch to keep the existing page
locking behavior.

Thanks again,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH RFC 06/24] userfaultfd: wp: support write protection for userfault vma range
  2019-01-23  2:43           ` Jerome Glisse
@ 2019-01-24  5:47             ` Peter Xu
  0 siblings, 0 replies; 65+ messages in thread
From: Peter Xu @ 2019-01-24  5:47 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: linux-mm, linux-kernel, Hugh Dickins, Maya Gokhale,
	Johannes Weiner, Martin Cracauer, Denis Plotnikov, Shaohua Li,
	Andrea Arcangeli, Pavel Emelyanov, Mike Kravetz, Marty McFadden,
	Mike Rapoport, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert, Rik van Riel

On Tue, Jan 22, 2019 at 09:43:38PM -0500, Jerome Glisse wrote:
> On Wed, Jan 23, 2019 at 10:17:45AM +0800, Peter Xu wrote:
> > On Tue, Jan 22, 2019 at 12:02:24PM -0500, Jerome Glisse wrote:
> > > On Tue, Jan 22, 2019 at 05:39:35PM +0800, Peter Xu wrote:
> > > > On Mon, Jan 21, 2019 at 09:05:35AM -0500, Jerome Glisse wrote:
> > > > 
> > > > [...]
> > > > 
> > > > > > +	change_protection(dst_vma, start, start + len, newprot,
> > > > > > +				!enable_wp, 0);
> > > > > 
> > > > > So setting dirty_accountable bring us to that code in mprotect.c:
> > > > > 
> > > > >     if (dirty_accountable && pte_dirty(ptent) &&
> > > > >             (pte_soft_dirty(ptent) ||
> > > > >              !(vma->vm_flags & VM_SOFTDIRTY))) {
> > > > >         ptent = pte_mkwrite(ptent);
> > > > >     }
> > > > > 
> > > > > My understanding is that you want to set write flag when enable_wp
> > > > > is false and you want to set the write flag unconditionaly, right ?
> > > > 
> > > > Right.
> > > > 
> > > > > 
> > > > > If so then you should really move the change_protection() flags
> > > > > patch before this patch and add a flag for setting pte write flags.
> > > > > 
> > > > > Otherwise the above is broken at it will only set the write flag
> > > > > for pte that were dirty and i am guessing so far you always were
> > > > > lucky because pte were all dirty (change_protection will preserve
> > > > > dirtyness) when you write protected them.
> > > > > 
> > > > > So i believe the above is broken or at very least unclear if what
> > > > > you really want is to only set write flag to pte that have the
> > > > > dirty flag set.
> > > > 
> > > > You are right, if we build the tree until this patch it won't work for
> > > > all the cases.  It'll only work if the page was at least writable
> > > > before and also it's dirty (as you explained).  Sorry to be unclear
> > > > about this, maybe I should at least mention that in the commit message
> > > > but I totally forgot it.
> > > > 
> > > > All these problems are solved in later on patches, please feel free to
> > > > have a look at:
> > > > 
> > > >   mm: merge parameters for change_protection()
> > > >   userfaultfd: wp: apply _PAGE_UFFD_WP bit
> > > >   userfaultfd: wp: handle COW properly for uffd-wp
> > > > 
> > > > Note that even in the follow up patches IMHO we can't directly change
> > > > the write permission since the page can be shared by other processes
> > > > (e.g., the zero page or COW pages).  But the general idea is the same
> > > > as you explained.
> > > > 
> > > > I tried to avoid squashing these stuff altogether as explained
> > > > previously.  Also, this patch can be seen as a standalone patch to
> > > > introduce the new interface which seems to make sense too, and it is
> > > > indeed still working in many cases so I see the latter patches as
> > > > enhancement of this one.  Please let me know if you still want me to
> > > > have all these stuff squashed, or if you'd like me to squash some of
> > > > them.
> > > 
> > > Yeah i have look at those after looking at this one. You should just
> > > re-order the patch this one first and then one that add new flag,
> > > then ones that add the new userfaultfd feature. Otherwise you are
> > > adding a userfaultfd feature that is broken midway ie it is added
> > > broken and then you fix it. Some one bisecting thing might get hurt
> > > by that. It is better to add and change everything you need and then
> > > add the new feature so that the new feature will work as intended.
> > > 
> > > So no squashing just change the order ie add the userfaultfd code
> > > last.
> > 
> > Yes this makes sense, I'll do that in v2.  Thanks for the suggestion!
> 
> Note before doing a v2 i would really like to see some proof of why
> you need new page table flag see my reply to:
>     userfaultfd: wp: add WP pagetable tracking to x86
> 
> As i believe you can identify COW or KSM from UFD write protect with-
> out a pte flag.

Yes.  I replied in that thread with my understanding on why the new
bit is required in the PTE (and also another new bit in the swap
entry).  We can discuss there.

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH RFC 04/24] mm: gup: allow VM_FAULT_RETRY for multiple times
  2019-01-21 16:24   ` Jerome Glisse
@ 2019-01-24  7:05     ` Peter Xu
  2019-01-24 15:34       ` Jerome Glisse
  0 siblings, 1 reply; 65+ messages in thread
From: Peter Xu @ 2019-01-24  7:05 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: linux-mm, linux-kernel, Hugh Dickins, Maya Gokhale,
	Johannes Weiner, Martin Cracauer, Denis Plotnikov, Shaohua Li,
	Andrea Arcangeli, Mike Kravetz, Marty McFadden, Mike Rapoport,
	Mel Gorman, Kirill A . Shutemov, Dr . David Alan Gilbert

On Mon, Jan 21, 2019 at 11:24:55AM -0500, Jerome Glisse wrote:
> On Mon, Jan 21, 2019 at 03:57:02PM +0800, Peter Xu wrote:
> > This is the gup counterpart of the change that allows the VM_FAULT_RETRY
> > to happen for more than once.
> > 
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> 
> So it would be nice to add a comment in the code and in the commit message
> about possible fault starvation (mostly due to previous patch changes) as
> if some one experience that and try to bisect it might overlook the commit.
> 
> Otherwise:
> 
> Reviewed-by: Jérôme Glisse <jglisse@redhat.com>

Jerome, can I still keep this r-b if I'm going to fix the starvation
issue you mentioned in previous patch about lock page?

Regards,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH RFC 07/24] userfaultfd: wp: add the writeprotect API to userfaultfd ioctl
  2019-01-24  4:56     ` Peter Xu
@ 2019-01-24  7:27       ` Mike Rapoport
  2019-01-24  9:28         ` Peter Xu
  0 siblings, 1 reply; 65+ messages in thread
From: Mike Rapoport @ 2019-01-24  7:27 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, Hugh Dickins, Maya Gokhale,
	Jerome Glisse, Johannes Weiner, Martin Cracauer, Denis Plotnikov,
	Shaohua Li, Andrea Arcangeli, Pavel Emelyanov, Mike Kravetz,
	Marty McFadden, Mike Rapoport, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

On Thu, Jan 24, 2019 at 12:56:15PM +0800, Peter Xu wrote:
> On Mon, Jan 21, 2019 at 12:42:33PM +0200, Mike Rapoport wrote:
> 
> [...]
> 
> > > @@ -1343,7 +1344,7 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
> > > 
> > >  		/* check not compatible vmas */
> > >  		ret = -EINVAL;
> > > -		if (!vma_can_userfault(cur))
> > > +		if (!vma_can_userfault(cur, vm_flags))
> > >  			goto out_unlock;
> > > 
> > >  		/*
> > > @@ -1371,6 +1372,8 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
> > >  			if (end & (vma_hpagesize - 1))
> > >  				goto out_unlock;
> > >  		}
> > > +		if ((vm_flags & VM_UFFD_WP) && !(cur->vm_flags & VM_WRITE))
> > > +			goto out_unlock;
> > 
> > This is problematic for the non-cooperative use-case. Way may still want to
> > monitor a read-only area because it may eventually become writable, e.g. if
> > the monitored process runs mprotect().
> 
> Firstly I think I should be able to change it to VM_MAYWRITE which
> seems to suite more.
> 
> Meanwhile, frankly speaking I didn't think a lot about how to nest the
> usages of uffd-wp and mprotect(), so far I was only considering it as
> a replacement of mprotect().  But indeed it can happen that the
> monitored process calls mprotect().  Is there an existing scenario of
> such usage?
> 
> The problem is I'm uncertain about whether this scenario can work
> after all.  Say, the monitor process A write protected process B's
> page P, so logically A will definitely receive a message before B
> writes to page P.  However here if we allow process B to do
> mprotect(PROT_WRITE) upon page P and grant write permission to it on
> its own, then A will not be able to capture the write operation at
> all?  Then I don't know how it can work here... or whether we should
> fail the mprotect() at least upon uffd-wp ranges?

The use-case we've discussed a while ago was to use uffd-wp instead of
soft-dirty for tracking memory changes in CRIU for pre-copy migration.
Currently, we enable soft-dirty for the migrated process and monitor
/proc/pid/pagemap between memory dump iterations to see what memory pages
have been changed.
With uffd-wp we thought to register all the process memory with uffd-wp and
then track changes with uffd-wp notifications. Back then it was considered
only at the very general level without paying much attention to details.

So my initial thought was that we do register the entire memory with
uffd-wp. If an area changes from RO to RW at some point, uffd-wp will
generate notifications to the monitor, it would be able to notice the
change and the write will continue normally.

If we are to limit uffd-wp register only to VMAs with VM_WRITE and even
VM_MAYWRITE, we'd need a way to handle the possible changes of VMA
protection and an ability to add monitoring for areas that changed from RO
to RW.

Can't say I have a clear picture in mind at the moment, will continue to
think about it.

> > Particularity, for using uffd-wp as a replacement for soft-dirty would
> > require it.
> > 
> > > 
> > >  		/*
> > >  		 * Check that this vma isn't already owned by a
> > > @@ -1400,7 +1403,7 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
> > >  	do {
> > >  		cond_resched();
> > > 
> > > -		BUG_ON(!vma_can_userfault(vma));
> > > +		BUG_ON(!vma_can_userfault(vma, vm_flags));
> > >  		BUG_ON(vma->vm_userfaultfd_ctx.ctx &&
> > >  		       vma->vm_userfaultfd_ctx.ctx != ctx);
> > >  		WARN_ON(!(vma->vm_flags & VM_MAYWRITE));
> > > @@ -1535,7 +1538,7 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx,
> > >  		 * provides for more strict behavior to notice
> > >  		 * unregistration errors.
> > >  		 */
> > > -		if (!vma_can_userfault(cur))
> > > +		if (!vma_can_userfault(cur, cur->vm_flags))
> > >  			goto out_unlock;
> > > 
> > >  		found = true;
> > > @@ -1549,7 +1552,7 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx,
> > >  	do {
> > >  		cond_resched();
> > > 
> > > -		BUG_ON(!vma_can_userfault(vma));
> > > +		BUG_ON(!vma_can_userfault(vma, vma->vm_flags));
> > >  		WARN_ON(!(vma->vm_flags & VM_MAYWRITE));
> > > 
> > >  		/*
> > > @@ -1760,6 +1763,46 @@ static int userfaultfd_zeropage(struct userfaultfd_ctx *ctx,
> > >  	return ret;
> > >  }
> > > 
> > > +static int userfaultfd_writeprotect(struct userfaultfd_ctx *ctx,
> > > +				    unsigned long arg)
> > > +{
> > > +	int ret;
> > > +	struct uffdio_writeprotect uffdio_wp;
> > > +	struct uffdio_writeprotect __user *user_uffdio_wp;
> > > +	struct userfaultfd_wake_range range;
> > > +
> > 
> > In the non-cooperative mode the userfaultfd_writeprotect() may race with VM
> > layout changes, pretty much as uffdio_copy() [1]. My solution for uffdio_copy()
> > was to return -EAGAIN if such race is encountered. I think the same would
> > apply here.
> 
> I tried to understand the problem at [1] but failed... could you help
> to clarify it a bit more?
> 
> I'm quoting some of the discussions from [1] here directly between you
> and Pavel:
> 
>   > Since the monitor cannot assume that the process will access all its memory
>   > it has to copy some pages "in the background". A simple monitor may look
>   > like:
>   > 
>   > 	for (;;) {
>   > 		wait_for_uffd_events(timeout);
>   > 		handle_uffd_events();
>   > 		uffd_copy(some not faulted pages);
>   > 	}
>   > 
>   > Then, if the "background" uffd_copy() races with fork, the pages we've
>   > copied may be already present in parent's mappings before the call to
>   > copy_page_range() and may be not.
>   > 
>   > If the pages were not present, uffd_copy'ing them again to the child's
>   > memory would be ok.
>   >
>   > But if uffd_copy() was first to catch mmap_sem, and we would uffd_copy them
>   > again, child process will get memory corruption.
> 
> Here I don't understand why the child process will get memory
> corruption if uffd_copy() caught the mmap_sem first.
> 
> If it did it, then IMHO when uffd_copy() copies the page again it'll
> simply get a -EEXIST showing that the page has already been copied.
> Could you explain on why there will be a data corruption?

Let's say we do post-copy migration of a process A with CRIU and its page at
address 0x1000 is already copied. Now it modifies the contents of this
page. At this point the contents of the page at 0x1000 is different on the
source and the destination.
Next, process A forks process B. The CRIU's uffd monitor gets
UFFD_EVENT_FORK, and starts filling process B memory with UFFDIO_COPY.
It may happen, that UFFDIO_COPY to 0x1000 of the process B will occur
*before* fork() completes and it may race with copy_page_range().
If UFFDIO_COPY wins the race, it will fill the page with the contents from
the source, although the correct data is what process A set in that page.

Hope it helps.

> Thanks in advance,
> 
> >  
> > [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=df2cc96e77011cf7989208b206da9817e0321028
> >
> 
> -- 
> Peter Xu
> 

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH RFC 07/24] userfaultfd: wp: add the writeprotect API to userfaultfd ioctl
  2019-01-24  7:27       ` Mike Rapoport
@ 2019-01-24  9:28         ` Peter Xu
  2019-01-25  7:54           ` Mike Rapoport
  0 siblings, 1 reply; 65+ messages in thread
From: Peter Xu @ 2019-01-24  9:28 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: linux-mm, linux-kernel, Hugh Dickins, Maya Gokhale,
	Jerome Glisse, Johannes Weiner, Martin Cracauer, Denis Plotnikov,
	Shaohua Li, Andrea Arcangeli, Pavel Emelyanov, Mike Kravetz,
	Marty McFadden, Mike Rapoport, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

On Thu, Jan 24, 2019 at 09:27:07AM +0200, Mike Rapoport wrote:
> On Thu, Jan 24, 2019 at 12:56:15PM +0800, Peter Xu wrote:
> > On Mon, Jan 21, 2019 at 12:42:33PM +0200, Mike Rapoport wrote:
> > 
> > [...]
> > 
> > > > @@ -1343,7 +1344,7 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
> > > > 
> > > >  		/* check not compatible vmas */
> > > >  		ret = -EINVAL;
> > > > -		if (!vma_can_userfault(cur))
> > > > +		if (!vma_can_userfault(cur, vm_flags))
> > > >  			goto out_unlock;
> > > > 
> > > >  		/*
> > > > @@ -1371,6 +1372,8 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
> > > >  			if (end & (vma_hpagesize - 1))
> > > >  				goto out_unlock;
> > > >  		}
> > > > +		if ((vm_flags & VM_UFFD_WP) && !(cur->vm_flags & VM_WRITE))
> > > > +			goto out_unlock;
> > > 
> > > This is problematic for the non-cooperative use-case. Way may still want to
> > > monitor a read-only area because it may eventually become writable, e.g. if
> > > the monitored process runs mprotect().
> > 
> > Firstly I think I should be able to change it to VM_MAYWRITE which
> > seems to suite more.
> > 
> > Meanwhile, frankly speaking I didn't think a lot about how to nest the
> > usages of uffd-wp and mprotect(), so far I was only considering it as
> > a replacement of mprotect().  But indeed it can happen that the
> > monitored process calls mprotect().  Is there an existing scenario of
> > such usage?
> > 
> > The problem is I'm uncertain about whether this scenario can work
> > after all.  Say, the monitor process A write protected process B's
> > page P, so logically A will definitely receive a message before B
> > writes to page P.  However here if we allow process B to do
> > mprotect(PROT_WRITE) upon page P and grant write permission to it on
> > its own, then A will not be able to capture the write operation at
> > all?  Then I don't know how it can work here... or whether we should
> > fail the mprotect() at least upon uffd-wp ranges?
> 
> The use-case we've discussed a while ago was to use uffd-wp instead of
> soft-dirty for tracking memory changes in CRIU for pre-copy migration.
> Currently, we enable soft-dirty for the migrated process and monitor
> /proc/pid/pagemap between memory dump iterations to see what memory pages
> have been changed.
> With uffd-wp we thought to register all the process memory with uffd-wp and
> then track changes with uffd-wp notifications. Back then it was considered
> only at the very general level without paying much attention to details.
> 
> So my initial thought was that we do register the entire memory with
> uffd-wp. If an area changes from RO to RW at some point, uffd-wp will
> generate notifications to the monitor, it would be able to notice the
> change and the write will continue normally.
> 
> If we are to limit uffd-wp register only to VMAs with VM_WRITE and even
> VM_MAYWRITE, we'd need a way to handle the possible changes of VMA
> protection and an ability to add monitoring for areas that changed from RO
> to RW.
> 
> Can't say I have a clear picture in mind at the moment, will continue to
> think about it.

Thanks for these details.  Though I have a question about how it's
used.

Since we're talking about replacing soft dirty with uffd-wp here, I
noticed that there's a major interface difference between soft-dirty
and uffd-wp: the soft-dirty was all about /proc operations so a
monitor process can easily monitor mostly any process on the system as
long as knowing its PID.  However I'm unsure about uffd-wp since
userfaultfd was always bound to a mm_struct.  For example, the syscall
userfaultfd() will always attach the current process mm_struct to the
newly created userfaultfd but it cannot be attached to another random
mm_struct of other processes.  Or is there any way that the CRIU
monitor process can gain an userfaultfd of any process of the system
somehow?

> 
> > > Particularity, for using uffd-wp as a replacement for soft-dirty would
> > > require it.
> > > 
> > > > 
> > > >  		/*
> > > >  		 * Check that this vma isn't already owned by a
> > > > @@ -1400,7 +1403,7 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
> > > >  	do {
> > > >  		cond_resched();
> > > > 
> > > > -		BUG_ON(!vma_can_userfault(vma));
> > > > +		BUG_ON(!vma_can_userfault(vma, vm_flags));
> > > >  		BUG_ON(vma->vm_userfaultfd_ctx.ctx &&
> > > >  		       vma->vm_userfaultfd_ctx.ctx != ctx);
> > > >  		WARN_ON(!(vma->vm_flags & VM_MAYWRITE));
> > > > @@ -1535,7 +1538,7 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx,
> > > >  		 * provides for more strict behavior to notice
> > > >  		 * unregistration errors.
> > > >  		 */
> > > > -		if (!vma_can_userfault(cur))
> > > > +		if (!vma_can_userfault(cur, cur->vm_flags))
> > > >  			goto out_unlock;
> > > > 
> > > >  		found = true;
> > > > @@ -1549,7 +1552,7 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx,
> > > >  	do {
> > > >  		cond_resched();
> > > > 
> > > > -		BUG_ON(!vma_can_userfault(vma));
> > > > +		BUG_ON(!vma_can_userfault(vma, vma->vm_flags));
> > > >  		WARN_ON(!(vma->vm_flags & VM_MAYWRITE));
> > > > 
> > > >  		/*
> > > > @@ -1760,6 +1763,46 @@ static int userfaultfd_zeropage(struct userfaultfd_ctx *ctx,
> > > >  	return ret;
> > > >  }
> > > > 
> > > > +static int userfaultfd_writeprotect(struct userfaultfd_ctx *ctx,
> > > > +				    unsigned long arg)
> > > > +{
> > > > +	int ret;
> > > > +	struct uffdio_writeprotect uffdio_wp;
> > > > +	struct uffdio_writeprotect __user *user_uffdio_wp;
> > > > +	struct userfaultfd_wake_range range;
> > > > +
> > > 
> > > In the non-cooperative mode the userfaultfd_writeprotect() may race with VM
> > > layout changes, pretty much as uffdio_copy() [1]. My solution for uffdio_copy()
> > > was to return -EAGAIN if such race is encountered. I think the same would
> > > apply here.
> > 
> > I tried to understand the problem at [1] but failed... could you help
> > to clarify it a bit more?
> > 
> > I'm quoting some of the discussions from [1] here directly between you
> > and Pavel:
> > 
> >   > Since the monitor cannot assume that the process will access all its memory
> >   > it has to copy some pages "in the background". A simple monitor may look
> >   > like:
> >   > 
> >   > 	for (;;) {
> >   > 		wait_for_uffd_events(timeout);
> >   > 		handle_uffd_events();
> >   > 		uffd_copy(some not faulted pages);
> >   > 	}
> >   > 
> >   > Then, if the "background" uffd_copy() races with fork, the pages we've
> >   > copied may be already present in parent's mappings before the call to
> >   > copy_page_range() and may be not.
> >   > 
> >   > If the pages were not present, uffd_copy'ing them again to the child's
> >   > memory would be ok.
> >   >
> >   > But if uffd_copy() was first to catch mmap_sem, and we would uffd_copy them
> >   > again, child process will get memory corruption.
> > 
> > Here I don't understand why the child process will get memory
> > corruption if uffd_copy() caught the mmap_sem first.
> > 
> > If it did it, then IMHO when uffd_copy() copies the page again it'll
> > simply get a -EEXIST showing that the page has already been copied.
> > Could you explain on why there will be a data corruption?
> 
> Let's say we do post-copy migration of a process A with CRIU and its page at
> address 0x1000 is already copied. Now it modifies the contents of this
> page. At this point the contents of the page at 0x1000 is different on the
> source and the destination.
> Next, process A forks process B. The CRIU's uffd monitor gets
> UFFD_EVENT_FORK, and starts filling process B memory with UFFDIO_COPY.
> It may happen, that UFFDIO_COPY to 0x1000 of the process B will occur

I think this is the place I started to get confused...

The mmap copy phase and the FORK event path is in dup_mmap() as
mentioned in the patch too:

     dup_mmap()
        down_write(old_mm)
        down_write(new_mm)
        foreach(vma)
            copy_page_range()            (a)
        up_write(new_mm)
        up_write(old_mm)
        dup_userfaultfd_complete()       (b)

Here if we already received UFFD_EVENT_FORK and started to copy pages
to process B in the background, then we should have at least passed
(b) above since otherwise we won't even know the existance of process
B.  However if so, we should have already passed the point to copy
data at (a) too, then how could copy_page_range() race?  It seems that
I might have missed something important out there but it's not easy
for me to figure out myself...

Thanks,

> *before* fork() completes and it may race with copy_page_range().
> If UFFDIO_COPY wins the race, it will fill the page with the contents from
> the source, although the correct data is what process A set in that page.
> 
> Hope it helps.

> > >  
> > > [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=df2cc96e77011cf7989208b206da9817e0321028
> > >

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH RFC 04/24] mm: gup: allow VM_FAULT_RETRY for multiple times
  2019-01-24  7:05     ` Peter Xu
@ 2019-01-24 15:34       ` Jerome Glisse
  2019-01-25  2:49         ` Peter Xu
  0 siblings, 1 reply; 65+ messages in thread
From: Jerome Glisse @ 2019-01-24 15:34 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, Hugh Dickins, Maya Gokhale,
	Johannes Weiner, Martin Cracauer, Denis Plotnikov, Shaohua Li,
	Andrea Arcangeli, Mike Kravetz, Marty McFadden, Mike Rapoport,
	Mel Gorman, Kirill A . Shutemov, Dr . David Alan Gilbert

On Thu, Jan 24, 2019 at 03:05:03PM +0800, Peter Xu wrote:
> On Mon, Jan 21, 2019 at 11:24:55AM -0500, Jerome Glisse wrote:
> > On Mon, Jan 21, 2019 at 03:57:02PM +0800, Peter Xu wrote:
> > > This is the gup counterpart of the change that allows the VM_FAULT_RETRY
> > > to happen for more than once.
> > > 
> > > Signed-off-by: Peter Xu <peterx@redhat.com>
> > 
> > So it would be nice to add a comment in the code and in the commit message
> > about possible fault starvation (mostly due to previous patch changes) as
> > if some one experience that and try to bisect it might overlook the commit.
> > 
> > Otherwise:
> > 
> > Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
> 
> Jerome, can I still keep this r-b if I'm going to fix the starvation
> issue you mentioned in previous patch about lock page?
> 

No please, i still want to review properly the oneline ie making sure
that it will not change any of the existing use of FAULT_FLAG_TRIED
I am finishing a bunch of patches myself so i am bit short on time right
now to take a deeper look but i will try to do that in next few days :)

In anycase i will review again your next posting.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH RFC 10/24] userfaultfd: wp: add WP pagetable tracking to x86
  2019-01-24  5:16     ` Peter Xu
@ 2019-01-24 15:40       ` Jerome Glisse
  2019-01-25  3:30         ` Peter Xu
  0 siblings, 1 reply; 65+ messages in thread
From: Jerome Glisse @ 2019-01-24 15:40 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, Hugh Dickins, Maya Gokhale,
	Johannes Weiner, Martin Cracauer, Denis Plotnikov, Shaohua Li,
	Andrea Arcangeli, Pavel Emelyanov, Mike Kravetz, Marty McFadden,
	Mike Rapoport, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

On Thu, Jan 24, 2019 at 01:16:16PM +0800, Peter Xu wrote:
> On Mon, Jan 21, 2019 at 10:09:38AM -0500, Jerome Glisse wrote:
> > On Mon, Jan 21, 2019 at 03:57:08PM +0800, Peter Xu wrote:
> > > From: Andrea Arcangeli <aarcange@redhat.com>
> > > 
> > > Accurate userfaultfd WP tracking is possible by tracking exactly which
> > > virtual memory ranges were writeprotected by userland. We can't relay
> > > only on the RW bit of the mapped pagetable because that information is
> > > destroyed by fork() or KSM or swap. If we were to relay on that, we'd
> > > need to stay on the safe side and generate false positive wp faults
> > > for every swapped out page.
> 
> (I'm trying to leave comments with my own understanding here; they
>  might not be the original purposes when Andrea proposed the idea.
>  Andrea, please feel free to chim in anytime especially if I am
>  wrong... :-)
> 
> > 
> > So you want to forward write fault (of a protected range) to user space
> > only if page is not write protected because of fork(), KSM or swap.
> > 
> > This write protection feature is only for anonymous page right ? Other-
> > wise how would you protect a share page (ie anyone can look it up and
> > call page_mkwrite on it and start writting to it) ?
> 
> AFAIU we want to support shared memory too in the future.  One example
> I can think of is current QEMU usage with DPDK: we have two processes
> sharing the guest memory range.  So indeed this might not work if
> there are unknown/malicious users of the shared memory, however in
> many use cases the users are all known and AFAIU we should just write
> protect all these users then we'll still get notified when any of them
> write to a page.
> 
> > 
> > So for anonymous page for fork() the mapcount will tell you if page is
> > write protected for COW. For KSM it is easy check the page flag.
> 
> Yes I agree that KSM should be easy.  But for COW, please consider
> when we write protect a page that was shared and RW removed due to
> COW.  Then when we page fault on this page should we report to the
> monitor?  IMHO we can't know if without a specific bit in the PTE.
> 
> > 
> > For swap you can use the page lock to synchronize. A page that is
> > write protected because of swap is write protected because it is being
> > write to disk thus either under page lock, or with PageWriteback()
> > returning true while write is on going.
> 
> For swap I think the major problem is when the page was swapped out of
> main memory and then we write to the page (which was already a swap
> entry now).  Then we'll first swap in the page into main memory again,
> but then IMHO we will face the similar issue like COW above - we can't
> judge whether this page was write protected by uffd-wp at all.  Of
> course here we can detect the VMA flags and assuming it's write
> protected if the UFFD_WP flag was set on the VMA flag, however we'll
> also mark those pages which were not write protected at all hence
> it'll generate false positives of write protection messages.  This
> idea can apply too to above COW use case.  As a conclusion, in these
> use cases we should not be able to identify explicitly on page
> granularity write protection if without a specific _PAGE_UFFD_WP bit
> in the PTE entries.

So i need to think a bit more on this, probably not right now
but just so i get the chain of event properly:
  1 - user space ioctl UFD to write protect a range
  2 - UFD set a flag on the vma and update CPU page table
  3 - page can be individualy write faulted and it sends a
      signal to UFD listener and they handle the fault
  4 - UFD kernel update the page table once userspace have
      handled the fault and sent result to UFD. At this point
      the vma still has the UFD write protect flag set.

So at any point in time in a range you might have writeable
pte that correspond to already handled UFD write fault. Now
if COW,KSM or swap happens on those then on the next write
fault you do not want to send a signal to userspace but handle
the fault just as usual ?

I believe this is the event flow, so i will ponder on this some
more :)

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH RFC 04/24] mm: gup: allow VM_FAULT_RETRY for multiple times
  2019-01-24 15:34       ` Jerome Glisse
@ 2019-01-25  2:49         ` Peter Xu
  0 siblings, 0 replies; 65+ messages in thread
From: Peter Xu @ 2019-01-25  2:49 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: linux-mm, linux-kernel, Hugh Dickins, Maya Gokhale,
	Johannes Weiner, Martin Cracauer, Denis Plotnikov, Shaohua Li,
	Andrea Arcangeli, Mike Kravetz, Marty McFadden, Mike Rapoport,
	Mel Gorman, Kirill A . Shutemov, Dr . David Alan Gilbert

On Thu, Jan 24, 2019 at 10:34:32AM -0500, Jerome Glisse wrote:
> On Thu, Jan 24, 2019 at 03:05:03PM +0800, Peter Xu wrote:
> > On Mon, Jan 21, 2019 at 11:24:55AM -0500, Jerome Glisse wrote:
> > > On Mon, Jan 21, 2019 at 03:57:02PM +0800, Peter Xu wrote:
> > > > This is the gup counterpart of the change that allows the VM_FAULT_RETRY
> > > > to happen for more than once.
> > > > 
> > > > Signed-off-by: Peter Xu <peterx@redhat.com>
> > > 
> > > So it would be nice to add a comment in the code and in the commit message
> > > about possible fault starvation (mostly due to previous patch changes) as
> > > if some one experience that and try to bisect it might overlook the commit.
> > > 
> > > Otherwise:
> > > 
> > > Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
> > 
> > Jerome, can I still keep this r-b if I'm going to fix the starvation
> > issue you mentioned in previous patch about lock page?
> > 
> 
> No please, i still want to review properly the oneline ie making sure
> that it will not change any of the existing use of FAULT_FLAG_TRIED
> I am finishing a bunch of patches myself so i am bit short on time right
> now to take a deeper look but i will try to do that in next few days :)
> 
> In anycase i will review again your next posting.

Sure thing.  Thank you Jerome!

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH RFC 10/24] userfaultfd: wp: add WP pagetable tracking to x86
  2019-01-24 15:40       ` Jerome Glisse
@ 2019-01-25  3:30         ` Peter Xu
  0 siblings, 0 replies; 65+ messages in thread
From: Peter Xu @ 2019-01-25  3:30 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: linux-mm, linux-kernel, Hugh Dickins, Maya Gokhale,
	Johannes Weiner, Martin Cracauer, Denis Plotnikov, Shaohua Li,
	Andrea Arcangeli, Pavel Emelyanov, Mike Kravetz, Marty McFadden,
	Mike Rapoport, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

On Thu, Jan 24, 2019 at 10:40:50AM -0500, Jerome Glisse wrote:
> On Thu, Jan 24, 2019 at 01:16:16PM +0800, Peter Xu wrote:
> > On Mon, Jan 21, 2019 at 10:09:38AM -0500, Jerome Glisse wrote:
> > > On Mon, Jan 21, 2019 at 03:57:08PM +0800, Peter Xu wrote:
> > > > From: Andrea Arcangeli <aarcange@redhat.com>
> > > > 
> > > > Accurate userfaultfd WP tracking is possible by tracking exactly which
> > > > virtual memory ranges were writeprotected by userland. We can't relay
> > > > only on the RW bit of the mapped pagetable because that information is
> > > > destroyed by fork() or KSM or swap. If we were to relay on that, we'd
> > > > need to stay on the safe side and generate false positive wp faults
> > > > for every swapped out page.
> > 
> > (I'm trying to leave comments with my own understanding here; they
> >  might not be the original purposes when Andrea proposed the idea.
> >  Andrea, please feel free to chim in anytime especially if I am
> >  wrong... :-)
> > 
> > > 
> > > So you want to forward write fault (of a protected range) to user space
> > > only if page is not write protected because of fork(), KSM or swap.
> > > 
> > > This write protection feature is only for anonymous page right ? Other-
> > > wise how would you protect a share page (ie anyone can look it up and
> > > call page_mkwrite on it and start writting to it) ?
> > 
> > AFAIU we want to support shared memory too in the future.  One example
> > I can think of is current QEMU usage with DPDK: we have two processes
> > sharing the guest memory range.  So indeed this might not work if
> > there are unknown/malicious users of the shared memory, however in
> > many use cases the users are all known and AFAIU we should just write
> > protect all these users then we'll still get notified when any of them
> > write to a page.
> > 
> > > 
> > > So for anonymous page for fork() the mapcount will tell you if page is
> > > write protected for COW. For KSM it is easy check the page flag.
> > 
> > Yes I agree that KSM should be easy.  But for COW, please consider
> > when we write protect a page that was shared and RW removed due to
> > COW.  Then when we page fault on this page should we report to the
> > monitor?  IMHO we can't know if without a specific bit in the PTE.
> > 
> > > 
> > > For swap you can use the page lock to synchronize. A page that is
> > > write protected because of swap is write protected because it is being
> > > write to disk thus either under page lock, or with PageWriteback()
> > > returning true while write is on going.
> > 
> > For swap I think the major problem is when the page was swapped out of
> > main memory and then we write to the page (which was already a swap
> > entry now).  Then we'll first swap in the page into main memory again,
> > but then IMHO we will face the similar issue like COW above - we can't
> > judge whether this page was write protected by uffd-wp at all.  Of
> > course here we can detect the VMA flags and assuming it's write
> > protected if the UFFD_WP flag was set on the VMA flag, however we'll
> > also mark those pages which were not write protected at all hence
> > it'll generate false positives of write protection messages.  This
> > idea can apply too to above COW use case.  As a conclusion, in these
> > use cases we should not be able to identify explicitly on page
> > granularity write protection if without a specific _PAGE_UFFD_WP bit
> > in the PTE entries.
> 
> So i need to think a bit more on this, probably not right now
> but just so i get the chain of event properly:
>   1 - user space ioctl UFD to write protect a range
>   2 - UFD set a flag on the vma and update CPU page table

A trivial supplement to these two steps to be clear: the change to VMA
flags and PTE permissions are different steps.  Say, to write protect
a newly mmap()ed region, we need to do:

  (a) ioctl UFFDIO_REGISTER upon the range: this will properly attach
      the VM_UFFD_WP flag upon the VMA object, and...

  (b) ioctl UFFDIO_WRITEPROTECT upon the range again: this will
      properly apply the new uffd-wp bit and write protect the
      PTEs/PMDs.

Note that the range specified in step (b) could also be part of the
buffer, so it does not need to cover the whole VMA, and it's in page
granularity.

>   3 - page can be individualy write faulted and it sends a
>       signal to UFD listener and they handle the fault
>   4 - UFD kernel update the page table once userspace have
>       handled the fault and sent result to UFD. At this point
>       the vma still has the UFD write protect flag set.

Yes. As explained above, the VMA can have the VM_UFFD_WP flag even if
none of the PTEs underneath was write protected.

> 
> So at any point in time in a range you might have writeable
> pte that correspond to already handled UFD write fault. Now
> if COW,KSM or swap happens on those then on the next write
> fault you do not want to send a signal to userspace but handle
> the fault just as usual ?

Yes, if the PTE has already resolved the uffd write protection and
then it will be just like a normal PTE, because when resolving the
uffd-wp page fault we'll also remove the special uffd-wp bit on the
PTE/PMD.

And IMHO actually what's more special here is when we write protect a
shared private page that is for COW (I'll skip KSM since it looks very
like this case IIUC): here due to COW the PTE already lost the RW bit,
and here when we do uffd-wp upon this page we'll simply apply the
uffd-wp bit only to mark that this PTE was especially write protected
by userfaults.  And when we want to resolve the uffd-wp for such a PTE
we'll first try to do COW if it is shared by others by checking
against page_mapcount().

> 
> I believe this is the event flow, so i will ponder on this some
> more :)

Yes please. :) The workflow of the new ioctl()s was also mentioned in
the cover letter.  Please feel free to have a look too.

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH RFC 07/24] userfaultfd: wp: add the writeprotect API to userfaultfd ioctl
  2019-01-24  9:28         ` Peter Xu
@ 2019-01-25  7:54           ` Mike Rapoport
  2019-01-25 10:12             ` Peter Xu
  0 siblings, 1 reply; 65+ messages in thread
From: Mike Rapoport @ 2019-01-25  7:54 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, Hugh Dickins, Maya Gokhale,
	Jerome Glisse, Johannes Weiner, Martin Cracauer, Denis Plotnikov,
	Shaohua Li, Andrea Arcangeli, Pavel Emelyanov, Mike Kravetz,
	Marty McFadden, Mike Rapoport, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

On Thu, Jan 24, 2019 at 05:28:48PM +0800, Peter Xu wrote:
> On Thu, Jan 24, 2019 at 09:27:07AM +0200, Mike Rapoport wrote:
> > On Thu, Jan 24, 2019 at 12:56:15PM +0800, Peter Xu wrote:
> > > On Mon, Jan 21, 2019 at 12:42:33PM +0200, Mike Rapoport wrote:
> > > 
> > > [...]
> > > 
> > > > > @@ -1343,7 +1344,7 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
> > > > > 
> > > > >  		/* check not compatible vmas */
> > > > >  		ret = -EINVAL;
> > > > > -		if (!vma_can_userfault(cur))
> > > > > +		if (!vma_can_userfault(cur, vm_flags))
> > > > >  			goto out_unlock;
> > > > > 
> > > > >  		/*
> > > > > @@ -1371,6 +1372,8 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
> > > > >  			if (end & (vma_hpagesize - 1))
> > > > >  				goto out_unlock;
> > > > >  		}
> > > > > +		if ((vm_flags & VM_UFFD_WP) && !(cur->vm_flags & VM_WRITE))
> > > > > +			goto out_unlock;
> > > > 
> > > > This is problematic for the non-cooperative use-case. Way may still want to
> > > > monitor a read-only area because it may eventually become writable, e.g. if
> > > > the monitored process runs mprotect().
> > > 
> > > Firstly I think I should be able to change it to VM_MAYWRITE which
> > > seems to suite more.
> > > 
> > > Meanwhile, frankly speaking I didn't think a lot about how to nest the
> > > usages of uffd-wp and mprotect(), so far I was only considering it as
> > > a replacement of mprotect().  But indeed it can happen that the
> > > monitored process calls mprotect().  Is there an existing scenario of
> > > such usage?
> > > 
> > > The problem is I'm uncertain about whether this scenario can work
> > > after all.  Say, the monitor process A write protected process B's
> > > page P, so logically A will definitely receive a message before B
> > > writes to page P.  However here if we allow process B to do
> > > mprotect(PROT_WRITE) upon page P and grant write permission to it on
> > > its own, then A will not be able to capture the write operation at
> > > all?  Then I don't know how it can work here... or whether we should
> > > fail the mprotect() at least upon uffd-wp ranges?
> > 
> > The use-case we've discussed a while ago was to use uffd-wp instead of
> > soft-dirty for tracking memory changes in CRIU for pre-copy migration.
> > Currently, we enable soft-dirty for the migrated process and monitor
> > /proc/pid/pagemap between memory dump iterations to see what memory pages
> > have been changed.
> > With uffd-wp we thought to register all the process memory with uffd-wp and
> > then track changes with uffd-wp notifications. Back then it was considered
> > only at the very general level without paying much attention to details.
> > 
> > So my initial thought was that we do register the entire memory with
> > uffd-wp. If an area changes from RO to RW at some point, uffd-wp will
> > generate notifications to the monitor, it would be able to notice the
> > change and the write will continue normally.
> > 
> > If we are to limit uffd-wp register only to VMAs with VM_WRITE and even
> > VM_MAYWRITE, we'd need a way to handle the possible changes of VMA
> > protection and an ability to add monitoring for areas that changed from RO
> > to RW.
> > 
> > Can't say I have a clear picture in mind at the moment, will continue to
> > think about it.
> 
> Thanks for these details.  Though I have a question about how it's
> used.
> 
> Since we're talking about replacing soft dirty with uffd-wp here, I
> noticed that there's a major interface difference between soft-dirty
> and uffd-wp: the soft-dirty was all about /proc operations so a
> monitor process can easily monitor mostly any process on the system as
> long as knowing its PID.  However I'm unsure about uffd-wp since
> userfaultfd was always bound to a mm_struct.  For example, the syscall
> userfaultfd() will always attach the current process mm_struct to the
> newly created userfaultfd but it cannot be attached to another random
> mm_struct of other processes.  Or is there any way that the CRIU
> monitor process can gain an userfaultfd of any process of the system
> somehow?
 
Yes, there is. For CRIU to read the process state during snapshot (or one
the source in case of the migration) we inject a parasite code into the
victim process. The parasite code communicates with the "main" CRIU monitor
via UNIX socket to pass information that cannot be obtained from outside.
For uffd-wp usage we thought about creating the uffd context in the
parasite code, registering the memory and passing the userfault file
descriptor to the CRIU core via that UNIX socket.

> > 
> > > > Particularity, for using uffd-wp as a replacement for soft-dirty would
> > > > require it.
> > > > 
> > > > > 
> > > > >  		/*
> > > > >  		 * Check that this vma isn't already owned by a
> > > > > @@ -1400,7 +1403,7 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
> > > > >  	do {
> > > > >  		cond_resched();
> > > > > 
> > > > > -		BUG_ON(!vma_can_userfault(vma));
> > > > > +		BUG_ON(!vma_can_userfault(vma, vm_flags));
> > > > >  		BUG_ON(vma->vm_userfaultfd_ctx.ctx &&
> > > > >  		       vma->vm_userfaultfd_ctx.ctx != ctx);
> > > > >  		WARN_ON(!(vma->vm_flags & VM_MAYWRITE));
> > > > > @@ -1760,6 +1763,46 @@ static int userfaultfd_zeropage(struct userfaultfd_ctx *ctx,
> > > > >  	return ret;
> > > > >  }
> > > > > 
> > > > > +static int userfaultfd_writeprotect(struct userfaultfd_ctx *ctx,
> > > > > +				    unsigned long arg)
> > > > > +{
> > > > > +	int ret;
> > > > > +	struct uffdio_writeprotect uffdio_wp;
> > > > > +	struct uffdio_writeprotect __user *user_uffdio_wp;
> > > > > +	struct userfaultfd_wake_range range;
> > > > > +
> > > > 
> > > > In the non-cooperative mode the userfaultfd_writeprotect() may race with VM
> > > > layout changes, pretty much as uffdio_copy() [1]. My solution for uffdio_copy()
> > > > was to return -EAGAIN if such race is encountered. I think the same would
> > > > apply here.
> > > 
> > > I tried to understand the problem at [1] but failed... could you help
> > > to clarify it a bit more?
> > > 
> > > I'm quoting some of the discussions from [1] here directly between you
> > > and Pavel:
> > > 
> > >   > Since the monitor cannot assume that the process will access all its memory
> > >   > it has to copy some pages "in the background". A simple monitor may look
> > >   > like:
> > >   > 
> > >   > 	for (;;) {
> > >   > 		wait_for_uffd_events(timeout);
> > >   > 		handle_uffd_events();
> > >   > 		uffd_copy(some not faulted pages);
> > >   > 	}
> > >   > 
> > >   > Then, if the "background" uffd_copy() races with fork, the pages we've
> > >   > copied may be already present in parent's mappings before the call to
> > >   > copy_page_range() and may be not.
> > >   > 
> > >   > If the pages were not present, uffd_copy'ing them again to the child's
> > >   > memory would be ok.
> > >   >
> > >   > But if uffd_copy() was first to catch mmap_sem, and we would uffd_copy them
> > >   > again, child process will get memory corruption.
> > > 
> > > Here I don't understand why the child process will get memory
> > > corruption if uffd_copy() caught the mmap_sem first.
> > > 
> > > If it did it, then IMHO when uffd_copy() copies the page again it'll
> > > simply get a -EEXIST showing that the page has already been copied.
> > > Could you explain on why there will be a data corruption?
> > 
> > Let's say we do post-copy migration of a process A with CRIU and its page at
> > address 0x1000 is already copied. Now it modifies the contents of this
> > page. At this point the contents of the page at 0x1000 is different on the
> > source and the destination.
> > Next, process A forks process B. The CRIU's uffd monitor gets
> > UFFD_EVENT_FORK, and starts filling process B memory with UFFDIO_COPY.
> > It may happen, that UFFDIO_COPY to 0x1000 of the process B will occur
> 
> I think this is the place I started to get confused...
> 
> The mmap copy phase and the FORK event path is in dup_mmap() as
> mentioned in the patch too:
> 
>      dup_mmap()
>         down_write(old_mm)
>         down_write(new_mm)
>         foreach(vma)
>             copy_page_range()            (a)
>         up_write(new_mm)
>         up_write(old_mm)
>         dup_userfaultfd_complete()       (b)
> 
> Here if we already received UFFD_EVENT_FORK and started to copy pages
> to process B in the background, then we should have at least passed
> (b) above since otherwise we won't even know the existance of process
> B.  However if so, we should have already passed the point to copy
> data at (a) too, then how could copy_page_range() race?  It seems that
> I might have missed something important out there but it's not easy
> for me to figure out myself...

Apparently, I confused myself as well...
I clearly remember that there was a problem with fork() but the sequence
the causes it keeps evading me :(

Anyway, some mean of synchronization between uffd_copy and the
non-cooperative events is required. Take, for example, MADV_DONTNEED. When
it races with uffdio_copy() a process may end reading non zero values right
after MADV_DONTNEED call.

uffd monitor           | process
-----------------------+-------------------------------------------
uffdio_copy(0x1000)    | madvise(MADV_DONTNEED, 0x1000)
                       |    down_read(mmap_sem)
                       |    zap_pte_range(0x1000)
                       |    up_read(mmap_sem)
   down_read(mmap_sem) |
   copy()              |
   up_read(mmap_sem)   |
                       |  read(0x1000) != 0

Similar issues happen with mpremap() and munmap().

> Thanks,
> 
> > *before* fork() completes and it may race with copy_page_range().
> > If UFFDIO_COPY wins the race, it will fill the page with the contents from
> > the source, although the correct data is what process A set in that page.
> > 
> > Hope it helps.
> 
> > > >  
> > > > [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=df2cc96e77011cf7989208b206da9817e0321028
> > > >
> 
> -- 
> Peter Xu
> 

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH RFC 07/24] userfaultfd: wp: add the writeprotect API to userfaultfd ioctl
  2019-01-25  7:54           ` Mike Rapoport
@ 2019-01-25 10:12             ` Peter Xu
  0 siblings, 0 replies; 65+ messages in thread
From: Peter Xu @ 2019-01-25 10:12 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: linux-mm, linux-kernel, Hugh Dickins, Maya Gokhale,
	Jerome Glisse, Johannes Weiner, Martin Cracauer, Denis Plotnikov,
	Shaohua Li, Andrea Arcangeli, Pavel Emelyanov, Mike Kravetz,
	Marty McFadden, Mike Rapoport, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

On Fri, Jan 25, 2019 at 09:54:53AM +0200, Mike Rapoport wrote:
> On Thu, Jan 24, 2019 at 05:28:48PM +0800, Peter Xu wrote:
> > On Thu, Jan 24, 2019 at 09:27:07AM +0200, Mike Rapoport wrote:
> > > On Thu, Jan 24, 2019 at 12:56:15PM +0800, Peter Xu wrote:
> > > > On Mon, Jan 21, 2019 at 12:42:33PM +0200, Mike Rapoport wrote:
> > > > 
> > > > [...]
> > > > 
> > > > > > @@ -1343,7 +1344,7 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
> > > > > > 
> > > > > >  		/* check not compatible vmas */
> > > > > >  		ret = -EINVAL;
> > > > > > -		if (!vma_can_userfault(cur))
> > > > > > +		if (!vma_can_userfault(cur, vm_flags))
> > > > > >  			goto out_unlock;
> > > > > > 
> > > > > >  		/*
> > > > > > @@ -1371,6 +1372,8 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
> > > > > >  			if (end & (vma_hpagesize - 1))
> > > > > >  				goto out_unlock;
> > > > > >  		}
> > > > > > +		if ((vm_flags & VM_UFFD_WP) && !(cur->vm_flags & VM_WRITE))
> > > > > > +			goto out_unlock;
> > > > > 
> > > > > This is problematic for the non-cooperative use-case. Way may still want to
> > > > > monitor a read-only area because it may eventually become writable, e.g. if
> > > > > the monitored process runs mprotect().
> > > > 
> > > > Firstly I think I should be able to change it to VM_MAYWRITE which
> > > > seems to suite more.
> > > > 
> > > > Meanwhile, frankly speaking I didn't think a lot about how to nest the
> > > > usages of uffd-wp and mprotect(), so far I was only considering it as
> > > > a replacement of mprotect().  But indeed it can happen that the
> > > > monitored process calls mprotect().  Is there an existing scenario of
> > > > such usage?
> > > > 
> > > > The problem is I'm uncertain about whether this scenario can work
> > > > after all.  Say, the monitor process A write protected process B's
> > > > page P, so logically A will definitely receive a message before B
> > > > writes to page P.  However here if we allow process B to do
> > > > mprotect(PROT_WRITE) upon page P and grant write permission to it on
> > > > its own, then A will not be able to capture the write operation at
> > > > all?  Then I don't know how it can work here... or whether we should
> > > > fail the mprotect() at least upon uffd-wp ranges?
> > > 
> > > The use-case we've discussed a while ago was to use uffd-wp instead of
> > > soft-dirty for tracking memory changes in CRIU for pre-copy migration.
> > > Currently, we enable soft-dirty for the migrated process and monitor
> > > /proc/pid/pagemap between memory dump iterations to see what memory pages
> > > have been changed.
> > > With uffd-wp we thought to register all the process memory with uffd-wp and
> > > then track changes with uffd-wp notifications. Back then it was considered
> > > only at the very general level without paying much attention to details.
> > > 
> > > So my initial thought was that we do register the entire memory with
> > > uffd-wp. If an area changes from RO to RW at some point, uffd-wp will
> > > generate notifications to the monitor, it would be able to notice the
> > > change and the write will continue normally.
> > > 
> > > If we are to limit uffd-wp register only to VMAs with VM_WRITE and even
> > > VM_MAYWRITE, we'd need a way to handle the possible changes of VMA
> > > protection and an ability to add monitoring for areas that changed from RO
> > > to RW.
> > > 
> > > Can't say I have a clear picture in mind at the moment, will continue to
> > > think about it.
> > 
> > Thanks for these details.  Though I have a question about how it's
> > used.
> > 
> > Since we're talking about replacing soft dirty with uffd-wp here, I
> > noticed that there's a major interface difference between soft-dirty
> > and uffd-wp: the soft-dirty was all about /proc operations so a
> > monitor process can easily monitor mostly any process on the system as
> > long as knowing its PID.  However I'm unsure about uffd-wp since
> > userfaultfd was always bound to a mm_struct.  For example, the syscall
> > userfaultfd() will always attach the current process mm_struct to the
> > newly created userfaultfd but it cannot be attached to another random
> > mm_struct of other processes.  Or is there any way that the CRIU
> > monitor process can gain an userfaultfd of any process of the system
> > somehow?
>  
> Yes, there is. For CRIU to read the process state during snapshot (or one
> the source in case of the migration) we inject a parasite code into the
> victim process. The parasite code communicates with the "main" CRIU monitor
> via UNIX socket to pass information that cannot be obtained from outside.
> For uffd-wp usage we thought about creating the uffd context in the
> parasite code, registering the memory and passing the userfault file
> descriptor to the CRIU core via that UNIX socket.

Glad to know the black magic behind it...

> 
> > > 
> > > > > Particularity, for using uffd-wp as a replacement for soft-dirty would
> > > > > require it.
> > > > > 
> > > > > > 
> > > > > >  		/*
> > > > > >  		 * Check that this vma isn't already owned by a
> > > > > > @@ -1400,7 +1403,7 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
> > > > > >  	do {
> > > > > >  		cond_resched();
> > > > > > 
> > > > > > -		BUG_ON(!vma_can_userfault(vma));
> > > > > > +		BUG_ON(!vma_can_userfault(vma, vm_flags));
> > > > > >  		BUG_ON(vma->vm_userfaultfd_ctx.ctx &&
> > > > > >  		       vma->vm_userfaultfd_ctx.ctx != ctx);
> > > > > >  		WARN_ON(!(vma->vm_flags & VM_MAYWRITE));
> > > > > > @@ -1760,6 +1763,46 @@ static int userfaultfd_zeropage(struct userfaultfd_ctx *ctx,
> > > > > >  	return ret;
> > > > > >  }
> > > > > > 
> > > > > > +static int userfaultfd_writeprotect(struct userfaultfd_ctx *ctx,
> > > > > > +				    unsigned long arg)
> > > > > > +{
> > > > > > +	int ret;
> > > > > > +	struct uffdio_writeprotect uffdio_wp;
> > > > > > +	struct uffdio_writeprotect __user *user_uffdio_wp;
> > > > > > +	struct userfaultfd_wake_range range;
> > > > > > +
> > > > > 
> > > > > In the non-cooperative mode the userfaultfd_writeprotect() may race with VM
> > > > > layout changes, pretty much as uffdio_copy() [1]. My solution for uffdio_copy()
> > > > > was to return -EAGAIN if such race is encountered. I think the same would
> > > > > apply here.
> > > > 
> > > > I tried to understand the problem at [1] but failed... could you help
> > > > to clarify it a bit more?
> > > > 
> > > > I'm quoting some of the discussions from [1] here directly between you
> > > > and Pavel:
> > > > 
> > > >   > Since the monitor cannot assume that the process will access all its memory
> > > >   > it has to copy some pages "in the background". A simple monitor may look
> > > >   > like:
> > > >   > 
> > > >   > 	for (;;) {
> > > >   > 		wait_for_uffd_events(timeout);
> > > >   > 		handle_uffd_events();
> > > >   > 		uffd_copy(some not faulted pages);
> > > >   > 	}
> > > >   > 
> > > >   > Then, if the "background" uffd_copy() races with fork, the pages we've
> > > >   > copied may be already present in parent's mappings before the call to
> > > >   > copy_page_range() and may be not.
> > > >   > 
> > > >   > If the pages were not present, uffd_copy'ing them again to the child's
> > > >   > memory would be ok.
> > > >   >
> > > >   > But if uffd_copy() was first to catch mmap_sem, and we would uffd_copy them
> > > >   > again, child process will get memory corruption.
> > > > 
> > > > Here I don't understand why the child process will get memory
> > > > corruption if uffd_copy() caught the mmap_sem first.
> > > > 
> > > > If it did it, then IMHO when uffd_copy() copies the page again it'll
> > > > simply get a -EEXIST showing that the page has already been copied.
> > > > Could you explain on why there will be a data corruption?
> > > 
> > > Let's say we do post-copy migration of a process A with CRIU and its page at
> > > address 0x1000 is already copied. Now it modifies the contents of this
> > > page. At this point the contents of the page at 0x1000 is different on the
> > > source and the destination.
> > > Next, process A forks process B. The CRIU's uffd monitor gets
> > > UFFD_EVENT_FORK, and starts filling process B memory with UFFDIO_COPY.
> > > It may happen, that UFFDIO_COPY to 0x1000 of the process B will occur
> > 
> > I think this is the place I started to get confused...
> > 
> > The mmap copy phase and the FORK event path is in dup_mmap() as
> > mentioned in the patch too:
> > 
> >      dup_mmap()
> >         down_write(old_mm)
> >         down_write(new_mm)
> >         foreach(vma)
> >             copy_page_range()            (a)
> >         up_write(new_mm)
> >         up_write(old_mm)
> >         dup_userfaultfd_complete()       (b)
> > 
> > Here if we already received UFFD_EVENT_FORK and started to copy pages
> > to process B in the background, then we should have at least passed
> > (b) above since otherwise we won't even know the existance of process
> > B.  However if so, we should have already passed the point to copy
> > data at (a) too, then how could copy_page_range() race?  It seems that
> > I might have missed something important out there but it's not easy
> > for me to figure out myself...
> 
> Apparently, I confused myself as well...
> I clearly remember that there was a problem with fork() but the sequence
> the causes it keeps evading me :(
> 
> Anyway, some mean of synchronization between uffd_copy and the
> non-cooperative events is required. Take, for example, MADV_DONTNEED. When
> it races with uffdio_copy() a process may end reading non zero values right
> after MADV_DONTNEED call.
> 
> uffd monitor           | process
> -----------------------+-------------------------------------------
> uffdio_copy(0x1000)    | madvise(MADV_DONTNEED, 0x1000)
>                        |    down_read(mmap_sem)
>                        |    zap_pte_range(0x1000)
>                        |    up_read(mmap_sem)
>    down_read(mmap_sem) |
>    copy()              |
>    up_read(mmap_sem)   |
>                        |  read(0x1000) != 0
> 
> Similar issues happen with mpremap() and munmap().

I think I get the point this time especially with the context of CRIU
process postcopy migration in mind.

If my understanding is correct, here if UFFDIO_COPY returned -EAGAIN
due to this, continuous UFFDIO_COPY upon the same page will fail too
(which could be slightly confusing at the first glance since normally
that's what -EAGAIN means: it should let the caller to simply
retry...), and IMHO the only correct way to solve this to break out of
the copy_page_background() loop and receive uffd messages instead.
And then we'll notice that there's a UNMAP event of the page, then we
know that we don't need to UFFDIO_COPY this page, so instead of a real
"retry" we just don't do it at all.

Tricky... but it does make sense if considering this under the
postcopy scenario for either CRIU or even QEMU when migrating VMs.

Then I'll just simply follow df2cc96e77011cf79892 in
UFFDIO_WRITEPROTECT() too then.  Thanks for all these explanations!

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 65+ messages in thread

end of thread, other threads:[~2019-01-25 10:12 UTC | newest]

Thread overview: 65+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-01-21  7:56 [PATCH RFC 00/24] userfaultfd: write protection support Peter Xu
2019-01-21  7:56 ` [PATCH RFC 01/24] mm: gup: rename "nonblocking" to "locked" where proper Peter Xu
2019-01-21 10:20   ` Mike Rapoport
2019-01-21  7:57 ` [PATCH RFC 02/24] mm: userfault: return VM_FAULT_RETRY on signals Peter Xu
2019-01-21 15:40   ` Jerome Glisse
2019-01-22  6:10     ` Peter Xu
2019-01-21  7:57 ` [PATCH RFC 03/24] mm: allow VM_FAULT_RETRY for multiple times Peter Xu
2019-01-21 15:55   ` Jerome Glisse
2019-01-22  8:22     ` Peter Xu
2019-01-22 16:53       ` Jerome Glisse
2019-01-23  2:12         ` Peter Xu
2019-01-23  2:39           ` Jerome Glisse
2019-01-24  5:45             ` Peter Xu
2019-01-21  7:57 ` [PATCH RFC 04/24] mm: gup: " Peter Xu
2019-01-21 16:24   ` Jerome Glisse
2019-01-24  7:05     ` Peter Xu
2019-01-24 15:34       ` Jerome Glisse
2019-01-25  2:49         ` Peter Xu
2019-01-21  7:57 ` [PATCH RFC 05/24] userfaultfd: wp: add helper for writeprotect check Peter Xu
2019-01-21 10:23   ` Mike Rapoport
2019-01-22  8:31     ` Peter Xu
2019-01-21  7:57 ` [PATCH RFC 06/24] userfaultfd: wp: support write protection for userfault vma range Peter Xu
2019-01-21 10:20   ` Mike Rapoport
2019-01-22  8:55     ` Peter Xu
2019-01-21 14:05   ` Jerome Glisse
2019-01-22  9:39     ` Peter Xu
2019-01-22 17:02       ` Jerome Glisse
2019-01-23  2:17         ` Peter Xu
2019-01-23  2:43           ` Jerome Glisse
2019-01-24  5:47             ` Peter Xu
2019-01-21  7:57 ` [PATCH RFC 07/24] userfaultfd: wp: add the writeprotect API to userfaultfd ioctl Peter Xu
2019-01-21 10:42   ` Mike Rapoport
2019-01-24  4:56     ` Peter Xu
2019-01-24  7:27       ` Mike Rapoport
2019-01-24  9:28         ` Peter Xu
2019-01-25  7:54           ` Mike Rapoport
2019-01-25 10:12             ` Peter Xu
2019-01-21  7:57 ` [PATCH RFC 08/24] userfaultfd: wp: hook userfault handler to write protection fault Peter Xu
2019-01-21  7:57 ` [PATCH RFC 09/24] userfaultfd: wp: enabled write protection in userfaultfd API Peter Xu
2019-01-21  7:57 ` [PATCH RFC 10/24] userfaultfd: wp: add WP pagetable tracking to x86 Peter Xu
2019-01-21 15:09   ` Jerome Glisse
2019-01-24  5:16     ` Peter Xu
2019-01-24 15:40       ` Jerome Glisse
2019-01-25  3:30         ` Peter Xu
2019-01-21  7:57 ` [PATCH RFC 11/24] userfaultfd: wp: userfaultfd_pte/huge_pmd_wp() helpers Peter Xu
2019-01-21  7:57 ` [PATCH RFC 12/24] userfaultfd: wp: add UFFDIO_COPY_MODE_WP Peter Xu
2019-01-21  7:57 ` [PATCH RFC 13/24] mm: merge parameters for change_protection() Peter Xu
2019-01-21 13:54   ` Jerome Glisse
2019-01-24  5:22     ` Peter Xu
2019-01-21  7:57 ` [PATCH RFC 14/24] userfaultfd: wp: apply _PAGE_UFFD_WP bit Peter Xu
2019-01-21  7:57 ` [PATCH RFC 15/24] mm: export wp_page_copy() Peter Xu
2019-01-21  7:57 ` [PATCH RFC 16/24] userfaultfd: wp: handle COW properly for uffd-wp Peter Xu
2019-01-21  7:57 ` [PATCH RFC 17/24] userfaultfd: wp: drop _PAGE_UFFD_WP properly when fork Peter Xu
2019-01-21  7:57 ` [PATCH RFC 18/24] userfaultfd: wp: add pmd_swp_*uffd_wp() helpers Peter Xu
2019-01-21  7:57 ` [PATCH RFC 19/24] userfaultfd: wp: support swap and page migration Peter Xu
2019-01-21  7:57 ` [PATCH RFC 20/24] userfaultfd: wp: don't wake up when doing write protect Peter Xu
2019-01-21 11:10   ` Mike Rapoport
2019-01-24  5:36     ` Peter Xu
2019-01-21  7:57 ` [PATCH RFC 21/24] khugepaged: skip collapse if uffd-wp detected Peter Xu
2019-01-21  7:57 ` [PATCH RFC 22/24] userfaultfd: wp: UFFDIO_REGISTER_MODE_WP documentation update Peter Xu
2019-01-21  7:57 ` [PATCH RFC 23/24] userfaultfd: selftests: refactor statistics Peter Xu
2019-01-21  7:57 ` [PATCH RFC 24/24] userfaultfd: selftests: add write-protect test Peter Xu
2019-01-21 14:33 ` [PATCH RFC 00/24] userfaultfd: write protection support David Hildenbrand
2019-01-22  3:18   ` Peter Xu
2019-01-22  8:59     ` David Hildenbrand

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).