linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 00/28] userfaultfd: write protection support
@ 2019-03-20  2:06 Peter Xu
  2019-03-20  2:06 ` [PATCH v3 01/28] mm: gup: rename "nonblocking" to "locked" where proper Peter Xu
                   ` (28 more replies)
  0 siblings, 29 replies; 51+ messages in thread
From: Peter Xu @ 2019-03-20  2:06 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: David Hildenbrand, Hugh Dickins, Maya Gokhale, Jerome Glisse,
	Pavel Emelyanov, Johannes Weiner, peterx, Martin Cracauer,
	Shaohua Li, Andrea Arcangeli, Mike Kravetz, Denis Plotnikov,
	Mike Rapoport, Marty McFadden, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

This series implements initial write protection support for
userfaultfd.  Currently both shmem and hugetlbfs are not supported
yet, but only anonymous memory.  This is the 3nd version of it.

The latest code can also be found at:

  https://github.com/xzpeter/linux/tree/uffd-wp-merged

Note again that the first 5 patches in the series can be seen as
isolated work on page fault mechanism.  I would hope that they can be
considered to be reviewed/picked even earlier than the rest of the
series since it's even useful for existing userfaultfd MISSING case
[8].

v3 changelog:
- take r-bs
- patch 1: fix typo [Jerome]
- patch 2: use brackets where proper around (flags & VM_FAULT_RETRY)
  (there're three places to change, not four...) [Jerome]
- patch 4: make sure TRIED is applied correctly on all archs, add more
  comment to explain the new page fault mechanism [Jerome]
- patch 7: in do_swap_page() remove the two lines to remove
  FAULT_FLAG_WRITE flag [Jerome]
- patch 10: another brackets change like above, and in
  mfill_atomic_pte return -EINVAL when detected wp_copy==1 upon shared
  memories [Jerome]
- patch 12: move _PAGE_CHG_MASK change to patch 8 [Jerome]
- patch 14: wp_page_copy() - fix write bit; change_pte_range() -
  detect PTE change after COW [Jerome]
- patch 17: remove last paragraph of commit message, no need to drop
  the two lines in do_swap_page() since they've been directly dropped
  in patch 7; touch up remove_migration_pte() to only detect uffd-wp
  bit if it's read migration entry [Jerome]
- add patch: "userfaultfd: wp: declare _UFFDIO_WRITEPROTECT
  conditionally", which remove _UFFDIO_WRITEPROTECT bit if detected
  non-anonymous memory during REGISTER; meanwhile fixup the test case
  for shmem too for expected ioctls returned from REGISTER [Mike]
- add patch: "userfaultfd: wp: fixup swap entries in
  change_pte_range", the new patch will allow to apply the uffd-wp
  bits upon swap entries directly (e.g., when the page is during
  migration or the page was swapped out).  Please see the patch for
  detail information.

v2 changelog:
- add some r-bs
- split the patch "mm: userfault: return VM_FAULT_RETRY on signals"
  into two: one to focus on the signal behavior change, the other to
  remove the NOPAGE special path in handle_userfault().  Removing the
  ARC specific change and remove that part of commit message since
  it's fixed in 4d447455e73b already [Jerome]
- return -ENOENT when VMA is invalid for UFFDIO_WRITEPROTECT to match
  UFFDIO_COPY errno [Mike]
- add a new patch to introduce helper to find valid VMA for uffd
  [Mike]
- check against VM_MAYWRITE instead of VM_WRITE when registering UFFD
  WP [Mike]
- MM_CP_DIRTY_ACCT is used incorrectly, fix it up [Jerome]
- make sure the lock_page behavior will not be changed [Jerome]
- reorder the whole series, introduce the new ioctl last. [Jerome]
- fix up the uffdio_writeprotect() following commit df2cc96e77011cf79
  to return -EAGAIN when detected mm layout changes [Mike]

v1 can be found at: https://lkml.org/lkml/2019/1/21/130

Any comment would be greatly welcomed.   Thanks.

Overview
====================

The uffd-wp work was initialized by Shaohua Li [1], and later
continued by Andrea [2]. This series is based upon Andrea's latest
userfaultfd tree, and it is a continuous works from both Shaohua and
Andrea.  Many of the follow up ideas come from Andrea too.

Besides the old MISSING register mode of userfaultfd, the new uffd-wp
support provides another alternative register mode called
UFFDIO_REGISTER_MODE_WP that can be used to listen to not only missing
page faults but also write protection page faults, or even they can be
registered together.  At the same time, the new feature also provides
a new userfaultfd ioctl called UFFDIO_WRITEPROTECT which allows the
userspace to write protect a range or memory or fixup write permission
of faulted pages.

Please refer to the document patch "userfaultfd: wp:
UFFDIO_REGISTER_MODE_WP documentation update" for more information on
the new interface and what it can do.

The major workflow of an uffd-wp program should be:

  1. Register a memory region with WP mode using UFFDIO_REGISTER_MODE_WP

  2. Write protect part of the whole registered region using
     UFFDIO_WRITEPROTECT, passing in UFFDIO_WRITEPROTECT_MODE_WP to
     show that we want to write protect the range.

  3. Start a working thread that modifies the protected pages,
     meanwhile listening to UFFD messages.

  4. When a write is detected upon the protected range, page fault
     happens, a UFFD message will be generated and reported to the
     page fault handling thread

  5. The page fault handler thread resolves the page fault using the
     new UFFDIO_WRITEPROTECT ioctl, but this time passing in
     !UFFDIO_WRITEPROTECT_MODE_WP instead showing that we want to
     recover the write permission.  Before this operation, the fault
     handler thread can do anything it wants, e.g., dumps the page to
     a persistent storage.

  6. The worker thread will continue running with the correctly
     applied write permission from step 5.

Currently there are already two projects that are based on this new
userfaultfd feature.

QEMU Live Snapshot: The project provides a way to allow the QEMU
                    hypervisor to take snapshot of VMs without
                    stopping the VM [3].

LLNL umap library:  The project provides a mmap-like interface and
                    "allow to have an application specific buffer of
                    pages cached from a large file, i.e. out-of-core
                    execution using memory map" [4][5].

Before posting the patchset, this series was smoke tested against QEMU
live snapshot and the LLNL umap library (by doing parallel quicksort
using 128 sorting threads + 80 uffd servicing threads).  My sincere
thanks to Marty Mcfadden and Denis Plotnikov for the help along the
way.

TODO
=============

- hugetlbfs/shmem support
- performance
- more architectures
- cooperate with mprotect()-allowed processes (???)
- ...

References
==========

[1] https://lwn.net/Articles/666187/
[2] https://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git/log/?h=userfault
[3] https://github.com/denis-plotnikov/qemu/commits/background-snapshot-kvm
[4] https://github.com/LLNL/umap
[5] https://llnl-umap.readthedocs.io/en/develop/
[6] https://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git/commit/?h=userfault&id=b245ecf6cf59156966f3da6e6b674f6695a5ffa5
[7] https://lkml.org/lkml/2018/11/21/370
[8] https://lkml.org/lkml/2018/12/30/64

Andrea Arcangeli (5):
  userfaultfd: wp: hook userfault handler to write protection fault
  userfaultfd: wp: add WP pagetable tracking to x86
  userfaultfd: wp: userfaultfd_pte/huge_pmd_wp() helpers
  userfaultfd: wp: add UFFDIO_COPY_MODE_WP
  userfaultfd: wp: add the writeprotect API to userfaultfd ioctl

Martin Cracauer (1):
  userfaultfd: wp: UFFDIO_REGISTER_MODE_WP documentation update

Peter Xu (19):
  mm: gup: rename "nonblocking" to "locked" where proper
  mm: userfault: return VM_FAULT_RETRY on signals
  userfaultfd: don't retake mmap_sem to emulate NOPAGE
  mm: allow VM_FAULT_RETRY for multiple times
  mm: gup: allow VM_FAULT_RETRY for multiple times
  mm: merge parameters for change_protection()
  userfaultfd: wp: apply _PAGE_UFFD_WP bit
  mm: export wp_page_copy()
  userfaultfd: wp: handle COW properly for uffd-wp
  userfaultfd: wp: drop _PAGE_UFFD_WP properly when fork
  userfaultfd: wp: add pmd_swp_*uffd_wp() helpers
  userfaultfd: wp: support swap and page migration
  khugepaged: skip collapse if uffd-wp detected
  userfaultfd: introduce helper vma_find_uffd
  userfaultfd: wp: don't wake up when doing write protect
  userfaultfd: wp: fixup swap entries in change_pte_range
  userfaultfd: wp: declare _UFFDIO_WRITEPROTECT conditionally
  userfaultfd: selftests: refactor statistics
  userfaultfd: selftests: add write-protect test

Shaohua Li (3):
  userfaultfd: wp: add helper for writeprotect check
  userfaultfd: wp: support write protection for userfault vma range
  userfaultfd: wp: enabled write protection in userfaultfd API

 Documentation/admin-guide/mm/userfaultfd.rst |  51 +++++
 arch/alpha/mm/fault.c                        |   4 +-
 arch/arc/mm/fault.c                          |  12 +-
 arch/arm/mm/fault.c                          |   9 +-
 arch/arm64/mm/fault.c                        |  11 +-
 arch/hexagon/mm/vm_fault.c                   |   3 +-
 arch/ia64/mm/fault.c                         |   3 +-
 arch/m68k/mm/fault.c                         |   5 +-
 arch/microblaze/mm/fault.c                   |   3 +-
 arch/mips/mm/fault.c                         |   3 +-
 arch/nds32/mm/fault.c                        |   7 +-
 arch/nios2/mm/fault.c                        |   5 +-
 arch/openrisc/mm/fault.c                     |   3 +-
 arch/parisc/mm/fault.c                       |   6 +-
 arch/powerpc/mm/fault.c                      |   8 +-
 arch/riscv/mm/fault.c                        |   9 +-
 arch/s390/mm/fault.c                         |  14 +-
 arch/sh/mm/fault.c                           |   5 +-
 arch/sparc/mm/fault_32.c                     |   4 +-
 arch/sparc/mm/fault_64.c                     |   4 +-
 arch/um/kernel/trap.c                        |   6 +-
 arch/unicore32/mm/fault.c                    |   8 +-
 arch/x86/Kconfig                             |   1 +
 arch/x86/include/asm/pgtable.h               |  67 ++++++
 arch/x86/include/asm/pgtable_64.h            |   8 +-
 arch/x86/include/asm/pgtable_types.h         |  11 +-
 arch/x86/mm/fault.c                          |   8 +-
 arch/xtensa/mm/fault.c                       |   4 +-
 drivers/gpu/drm/ttm/ttm_bo_vm.c              |  12 +-
 fs/userfaultfd.c                             | 130 +++++++----
 include/asm-generic/pgtable.h                |   1 +
 include/asm-generic/pgtable_uffd.h           |  66 ++++++
 include/linux/huge_mm.h                      |   2 +-
 include/linux/mm.h                           |  59 ++++-
 include/linux/swapops.h                      |   2 +
 include/linux/userfaultfd_k.h                |  42 +++-
 include/trace/events/huge_memory.h           |   1 +
 include/uapi/linux/userfaultfd.h             |  40 +++-
 init/Kconfig                                 |   5 +
 mm/filemap.c                                 |   2 +-
 mm/gup.c                                     |  61 ++---
 mm/huge_memory.c                             |  28 ++-
 mm/hugetlb.c                                 |  14 +-
 mm/khugepaged.c                              |  23 ++
 mm/memory.c                                  |  31 ++-
 mm/mempolicy.c                               |   2 +-
 mm/migrate.c                                 |   4 +
 mm/mprotect.c                                | 131 ++++++++---
 mm/rmap.c                                    |   6 +
 mm/shmem.c                                   |   2 +-
 mm/userfaultfd.c                             | 148 +++++++++---
 tools/testing/selftests/vm/userfaultfd.c     | 225 +++++++++++++++----
 52 files changed, 1024 insertions(+), 295 deletions(-)
 create mode 100644 include/asm-generic/pgtable_uffd.h

-- 
2.17.1


^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH v3 01/28] mm: gup: rename "nonblocking" to "locked" where proper
  2019-03-20  2:06 [PATCH v3 00/28] userfaultfd: write protection support Peter Xu
@ 2019-03-20  2:06 ` Peter Xu
  2019-03-20  2:06 ` [PATCH v3 02/28] mm: userfault: return VM_FAULT_RETRY on signals Peter Xu
                   ` (27 subsequent siblings)
  28 siblings, 0 replies; 51+ messages in thread
From: Peter Xu @ 2019-03-20  2:06 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: David Hildenbrand, Hugh Dickins, Maya Gokhale, Jerome Glisse,
	Pavel Emelyanov, Johannes Weiner, peterx, Martin Cracauer,
	Shaohua Li, Andrea Arcangeli, Mike Kravetz, Denis Plotnikov,
	Mike Rapoport, Marty McFadden, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

There's plenty of places around __get_user_pages() that has a parameter
"nonblocking" which does not really mean that "it won't block" (because
it can really block) but instead it shows whether the mmap_sem is
released by up_read() during the page fault handling mostly when
VM_FAULT_RETRY is returned.

We have the correct naming in e.g. get_user_pages_locked() or
get_user_pages_remote() as "locked", however there're still many places
that are using the "nonblocking" as name.

Renaming the places to "locked" where proper to better suite the
functionality of the variable.  While at it, fixing up some of the
comments accordingly.

Reviewed-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Reviewed-by: Jerome Glisse <jglisse@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/gup.c     | 44 +++++++++++++++++++++-----------------------
 mm/hugetlb.c |  8 ++++----
 2 files changed, 25 insertions(+), 27 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 75029649baca..9bb3bed68ee3 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -506,12 +506,12 @@ static int get_gate_page(struct mm_struct *mm, unsigned long address,
 }
 
 /*
- * mmap_sem must be held on entry.  If @nonblocking != NULL and
- * *@flags does not include FOLL_NOWAIT, the mmap_sem may be released.
- * If it is, *@nonblocking will be set to 0 and -EBUSY returned.
+ * mmap_sem must be held on entry.  If @locked != NULL and *@flags
+ * does not include FOLL_NOWAIT, the mmap_sem may be released.  If it
+ * is, *@locked will be set to 0 and -EBUSY returned.
  */
 static int faultin_page(struct task_struct *tsk, struct vm_area_struct *vma,
-		unsigned long address, unsigned int *flags, int *nonblocking)
+		unsigned long address, unsigned int *flags, int *locked)
 {
 	unsigned int fault_flags = 0;
 	vm_fault_t ret;
@@ -523,7 +523,7 @@ static int faultin_page(struct task_struct *tsk, struct vm_area_struct *vma,
 		fault_flags |= FAULT_FLAG_WRITE;
 	if (*flags & FOLL_REMOTE)
 		fault_flags |= FAULT_FLAG_REMOTE;
-	if (nonblocking)
+	if (locked)
 		fault_flags |= FAULT_FLAG_ALLOW_RETRY;
 	if (*flags & FOLL_NOWAIT)
 		fault_flags |= FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_RETRY_NOWAIT;
@@ -549,8 +549,8 @@ static int faultin_page(struct task_struct *tsk, struct vm_area_struct *vma,
 	}
 
 	if (ret & VM_FAULT_RETRY) {
-		if (nonblocking && !(fault_flags & FAULT_FLAG_RETRY_NOWAIT))
-			*nonblocking = 0;
+		if (locked && !(fault_flags & FAULT_FLAG_RETRY_NOWAIT))
+			*locked = 0;
 		return -EBUSY;
 	}
 
@@ -627,7 +627,7 @@ static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags)
  *		only intends to ensure the pages are faulted in.
  * @vmas:	array of pointers to vmas corresponding to each page.
  *		Or NULL if the caller does not require them.
- * @nonblocking: whether waiting for disk IO or mmap_sem contention
+ * @locked:     whether we're still with the mmap_sem held
  *
  * Returns number of pages pinned. This may be fewer than the number
  * requested. If nr_pages is 0 or negative, returns 0. If no pages
@@ -656,13 +656,11 @@ static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags)
  * appropriate) must be called after the page is finished with, and
  * before put_page is called.
  *
- * If @nonblocking != NULL, __get_user_pages will not wait for disk IO
- * or mmap_sem contention, and if waiting is needed to pin all pages,
- * *@nonblocking will be set to 0.  Further, if @gup_flags does not
- * include FOLL_NOWAIT, the mmap_sem will be released via up_read() in
- * this case.
+ * If @locked != NULL, *@locked will be set to 0 when mmap_sem is
+ * released by an up_read().  That can happen if @gup_flags does not
+ * have FOLL_NOWAIT.
  *
- * A caller using such a combination of @nonblocking and @gup_flags
+ * A caller using such a combination of @locked and @gup_flags
  * must therefore hold the mmap_sem for reading only, and recognize
  * when it's been released.  Otherwise, it must be held for either
  * reading or writing and will not be released.
@@ -674,7 +672,7 @@ static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags)
 static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 		unsigned long start, unsigned long nr_pages,
 		unsigned int gup_flags, struct page **pages,
-		struct vm_area_struct **vmas, int *nonblocking)
+		struct vm_area_struct **vmas, int *locked)
 {
 	long ret = 0, i = 0;
 	struct vm_area_struct *vma = NULL;
@@ -718,7 +716,7 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 			if (is_vm_hugetlb_page(vma)) {
 				i = follow_hugetlb_page(mm, vma, pages, vmas,
 						&start, &nr_pages, i,
-						gup_flags, nonblocking);
+						gup_flags, locked);
 				continue;
 			}
 		}
@@ -736,7 +734,7 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 		page = follow_page_mask(vma, start, foll_flags, &ctx);
 		if (!page) {
 			ret = faultin_page(tsk, vma, start, &foll_flags,
-					nonblocking);
+					   locked);
 			switch (ret) {
 			case 0:
 				goto retry;
@@ -1195,7 +1193,7 @@ EXPORT_SYMBOL(get_user_pages_longterm);
  * @vma:   target vma
  * @start: start address
  * @end:   end address
- * @nonblocking:
+ * @locked: whether the mmap_sem is still held
  *
  * This takes care of mlocking the pages too if VM_LOCKED is set.
  *
@@ -1203,14 +1201,14 @@ EXPORT_SYMBOL(get_user_pages_longterm);
  *
  * vma->vm_mm->mmap_sem must be held.
  *
- * If @nonblocking is NULL, it may be held for read or write and will
+ * If @locked is NULL, it may be held for read or write and will
  * be unperturbed.
  *
- * If @nonblocking is non-NULL, it must held for read only and may be
- * released.  If it's released, *@nonblocking will be set to 0.
+ * If @locked is non-NULL, it must held for read only and may be
+ * released.  If it's released, *@locked will be set to 0.
  */
 long populate_vma_page_range(struct vm_area_struct *vma,
-		unsigned long start, unsigned long end, int *nonblocking)
+		unsigned long start, unsigned long end, int *locked)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	unsigned long nr_pages = (end - start) / PAGE_SIZE;
@@ -1245,7 +1243,7 @@ long populate_vma_page_range(struct vm_area_struct *vma,
 	 * not result in a stack expansion that recurses back here.
 	 */
 	return __get_user_pages(current, mm, start, nr_pages, gup_flags,
-				NULL, NULL, nonblocking);
+				NULL, NULL, locked);
 }
 
 /*
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 8dfdffc34a99..52296ce4025a 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4190,7 +4190,7 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
 long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			 struct page **pages, struct vm_area_struct **vmas,
 			 unsigned long *position, unsigned long *nr_pages,
-			 long i, unsigned int flags, int *nonblocking)
+			 long i, unsigned int flags, int *locked)
 {
 	unsigned long pfn_offset;
 	unsigned long vaddr = *position;
@@ -4261,7 +4261,7 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 				spin_unlock(ptl);
 			if (flags & FOLL_WRITE)
 				fault_flags |= FAULT_FLAG_WRITE;
-			if (nonblocking)
+			if (locked)
 				fault_flags |= FAULT_FLAG_ALLOW_RETRY;
 			if (flags & FOLL_NOWAIT)
 				fault_flags |= FAULT_FLAG_ALLOW_RETRY |
@@ -4278,9 +4278,9 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 				break;
 			}
 			if (ret & VM_FAULT_RETRY) {
-				if (nonblocking &&
+				if (locked &&
 				    !(fault_flags & FAULT_FLAG_RETRY_NOWAIT))
-					*nonblocking = 0;
+					*locked = 0;
 				*nr_pages = 0;
 				/*
 				 * VM_FAULT_RETRY must not return an
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v3 02/28] mm: userfault: return VM_FAULT_RETRY on signals
  2019-03-20  2:06 [PATCH v3 00/28] userfaultfd: write protection support Peter Xu
  2019-03-20  2:06 ` [PATCH v3 01/28] mm: gup: rename "nonblocking" to "locked" where proper Peter Xu
@ 2019-03-20  2:06 ` Peter Xu
  2019-03-20  2:06 ` [PATCH v3 03/28] userfaultfd: don't retake mmap_sem to emulate NOPAGE Peter Xu
                   ` (26 subsequent siblings)
  28 siblings, 0 replies; 51+ messages in thread
From: Peter Xu @ 2019-03-20  2:06 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: David Hildenbrand, Hugh Dickins, Maya Gokhale, Jerome Glisse,
	Pavel Emelyanov, Johannes Weiner, peterx, Martin Cracauer,
	Shaohua Li, Andrea Arcangeli, Mike Kravetz, Denis Plotnikov,
	Mike Rapoport, Marty McFadden, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

The idea comes from the upstream discussion between Linus and Andrea:

  https://lkml.org/lkml/2017/10/30/560

A summary to the issue: there was a special path in handle_userfault()
in the past that we'll return a VM_FAULT_NOPAGE when we detected
non-fatal signals when waiting for userfault handling.  We did that by
reacquiring the mmap_sem before returning.  However that brings a risk
in that the vmas might have changed when we retake the mmap_sem and
even we could be holding an invalid vma structure.

This patch removes the special path and we'll return a VM_FAULT_RETRY
with the common path even if we have got such signals.  Then for all
the architectures that is passing in VM_FAULT_ALLOW_RETRY into
handle_mm_fault(), we check not only for SIGKILL but for all the rest
of userspace pending signals right after we returned from
handle_mm_fault().  This can allow the userspace to handle nonfatal
signals faster than before.

This patch is a preparation work for the next patch to finally remove
the special code path mentioned above in handle_userfault().

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Suggested-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Jerome Glisse <jglisse@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/alpha/mm/fault.c      |  2 +-
 arch/arc/mm/fault.c        | 11 ++++-------
 arch/arm/mm/fault.c        |  6 +++---
 arch/arm64/mm/fault.c      |  6 +++---
 arch/hexagon/mm/vm_fault.c |  2 +-
 arch/ia64/mm/fault.c       |  2 +-
 arch/m68k/mm/fault.c       |  2 +-
 arch/microblaze/mm/fault.c |  2 +-
 arch/mips/mm/fault.c       |  2 +-
 arch/nds32/mm/fault.c      |  6 +++---
 arch/nios2/mm/fault.c      |  2 +-
 arch/openrisc/mm/fault.c   |  2 +-
 arch/parisc/mm/fault.c     |  2 +-
 arch/powerpc/mm/fault.c    |  2 ++
 arch/riscv/mm/fault.c      |  4 ++--
 arch/s390/mm/fault.c       |  9 ++++++---
 arch/sh/mm/fault.c         |  4 ++++
 arch/sparc/mm/fault_32.c   |  3 +++
 arch/sparc/mm/fault_64.c   |  3 +++
 arch/um/kernel/trap.c      |  5 ++++-
 arch/unicore32/mm/fault.c  |  4 ++--
 arch/x86/mm/fault.c        |  6 +++++-
 arch/xtensa/mm/fault.c     |  3 +++
 23 files changed, 56 insertions(+), 34 deletions(-)

diff --git a/arch/alpha/mm/fault.c b/arch/alpha/mm/fault.c
index 188fc9256baf..8a2ef90b4bfc 100644
--- a/arch/alpha/mm/fault.c
+++ b/arch/alpha/mm/fault.c
@@ -150,7 +150,7 @@ do_page_fault(unsigned long address, unsigned long mmcsr,
 	   the fault.  */
 	fault = handle_mm_fault(vma, address, flags);
 
-	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
+	if ((fault & VM_FAULT_RETRY) && signal_pending(current))
 		return;
 
 	if (unlikely(fault & VM_FAULT_ERROR)) {
diff --git a/arch/arc/mm/fault.c b/arch/arc/mm/fault.c
index 8df1638259f3..9e9e6eb1f7d0 100644
--- a/arch/arc/mm/fault.c
+++ b/arch/arc/mm/fault.c
@@ -141,17 +141,14 @@ void do_page_fault(unsigned long address, struct pt_regs *regs)
 	 */
 	fault = handle_mm_fault(vma, address, flags);
 
-	if (fatal_signal_pending(current)) {
-
+	if (unlikely((fault & VM_FAULT_RETRY) && signal_pending(current))) {
+		if (fatal_signal_pending(current) && !user_mode(regs))
+			goto no_context;
 		/*
 		 * if fault retry, mmap_sem already relinquished by core mm
 		 * so OK to return to user mode (with signal handled first)
 		 */
-		if (fault & VM_FAULT_RETRY) {
-			if (!user_mode(regs))
-				goto no_context;
-			return;
-		}
+		return;
 	}
 
 	perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
diff --git a/arch/arm/mm/fault.c b/arch/arm/mm/fault.c
index 58f69fa07df9..c41c021bbe40 100644
--- a/arch/arm/mm/fault.c
+++ b/arch/arm/mm/fault.c
@@ -314,12 +314,12 @@ do_page_fault(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
 
 	fault = __do_page_fault(mm, addr, fsr, flags, tsk);
 
-	/* If we need to retry but a fatal signal is pending, handle the
+	/* If we need to retry but a signal is pending, handle the
 	 * signal first. We do not need to release the mmap_sem because
 	 * it would already be released in __lock_page_or_retry in
 	 * mm/filemap.c. */
-	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current)) {
-		if (!user_mode(regs))
+	if (unlikely(fault & VM_FAULT_RETRY && signal_pending(current))) {
+		if (fatal_signal_pending(current) && !user_mode(regs))
 			goto no_context;
 		return 0;
 	}
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index efb7b2cbead5..a38ff8c49a66 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -512,13 +512,13 @@ static int __kprobes do_page_fault(unsigned long addr, unsigned int esr,
 
 	if (fault & VM_FAULT_RETRY) {
 		/*
-		 * If we need to retry but a fatal signal is pending,
+		 * If we need to retry but a signal is pending,
 		 * handle the signal first. We do not need to release
 		 * the mmap_sem because it would already be released
 		 * in __lock_page_or_retry in mm/filemap.c.
 		 */
-		if (fatal_signal_pending(current)) {
-			if (!user_mode(regs))
+		if (signal_pending(current)) {
+			if (fatal_signal_pending(current) && !user_mode(regs))
 				goto no_context;
 			return 0;
 		}
diff --git a/arch/hexagon/mm/vm_fault.c b/arch/hexagon/mm/vm_fault.c
index eb263e61daf4..be10b441d9cc 100644
--- a/arch/hexagon/mm/vm_fault.c
+++ b/arch/hexagon/mm/vm_fault.c
@@ -104,7 +104,7 @@ void do_page_fault(unsigned long address, long cause, struct pt_regs *regs)
 
 	fault = handle_mm_fault(vma, address, flags);
 
-	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
+	if ((fault & VM_FAULT_RETRY) && signal_pending(current))
 		return;
 
 	/* The most common case -- we are done. */
diff --git a/arch/ia64/mm/fault.c b/arch/ia64/mm/fault.c
index 5baeb022f474..62c2d39d2bed 100644
--- a/arch/ia64/mm/fault.c
+++ b/arch/ia64/mm/fault.c
@@ -163,7 +163,7 @@ ia64_do_page_fault (unsigned long address, unsigned long isr, struct pt_regs *re
 	 */
 	fault = handle_mm_fault(vma, address, flags);
 
-	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
+	if ((fault & VM_FAULT_RETRY) && signal_pending(current))
 		return;
 
 	if (unlikely(fault & VM_FAULT_ERROR)) {
diff --git a/arch/m68k/mm/fault.c b/arch/m68k/mm/fault.c
index 9b6163c05a75..d9808a807ab8 100644
--- a/arch/m68k/mm/fault.c
+++ b/arch/m68k/mm/fault.c
@@ -138,7 +138,7 @@ int do_page_fault(struct pt_regs *regs, unsigned long address,
 	fault = handle_mm_fault(vma, address, flags);
 	pr_debug("handle_mm_fault returns %x\n", fault);
 
-	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
+	if ((fault & VM_FAULT_RETRY) && signal_pending(current))
 		return 0;
 
 	if (unlikely(fault & VM_FAULT_ERROR)) {
diff --git a/arch/microblaze/mm/fault.c b/arch/microblaze/mm/fault.c
index 202ad6a494f5..4fd2dbd0c5ca 100644
--- a/arch/microblaze/mm/fault.c
+++ b/arch/microblaze/mm/fault.c
@@ -217,7 +217,7 @@ void do_page_fault(struct pt_regs *regs, unsigned long address,
 	 */
 	fault = handle_mm_fault(vma, address, flags);
 
-	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
+	if ((fault & VM_FAULT_RETRY) && signal_pending(current))
 		return;
 
 	if (unlikely(fault & VM_FAULT_ERROR)) {
diff --git a/arch/mips/mm/fault.c b/arch/mips/mm/fault.c
index 73d8a0f0b810..92374fd091d2 100644
--- a/arch/mips/mm/fault.c
+++ b/arch/mips/mm/fault.c
@@ -154,7 +154,7 @@ static void __kprobes __do_page_fault(struct pt_regs *regs, unsigned long write,
 	 */
 	fault = handle_mm_fault(vma, address, flags);
 
-	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
+	if ((fault & VM_FAULT_RETRY) && signal_pending(current))
 		return;
 
 	perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
diff --git a/arch/nds32/mm/fault.c b/arch/nds32/mm/fault.c
index 68d5f2a27f38..da777de8a62e 100644
--- a/arch/nds32/mm/fault.c
+++ b/arch/nds32/mm/fault.c
@@ -206,12 +206,12 @@ void do_page_fault(unsigned long entry, unsigned long addr,
 	fault = handle_mm_fault(vma, addr, flags);
 
 	/*
-	 * If we need to retry but a fatal signal is pending, handle the
+	 * If we need to retry but a signal is pending, handle the
 	 * signal first. We do not need to release the mmap_sem because it
 	 * would already be released in __lock_page_or_retry in mm/filemap.c.
 	 */
-	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current)) {
-		if (!user_mode(regs))
+	if ((fault & VM_FAULT_RETRY) && signal_pending(current)) {
+		if (fatal_signal_pending(current) && !user_mode(regs))
 			goto no_context;
 		return;
 	}
diff --git a/arch/nios2/mm/fault.c b/arch/nios2/mm/fault.c
index 24fd84cf6006..5939434a31ae 100644
--- a/arch/nios2/mm/fault.c
+++ b/arch/nios2/mm/fault.c
@@ -134,7 +134,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long cause,
 	 */
 	fault = handle_mm_fault(vma, address, flags);
 
-	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
+	if ((fault & VM_FAULT_RETRY) && signal_pending(current))
 		return;
 
 	if (unlikely(fault & VM_FAULT_ERROR)) {
diff --git a/arch/openrisc/mm/fault.c b/arch/openrisc/mm/fault.c
index dc4dbafc1d83..873ecb5d82d7 100644
--- a/arch/openrisc/mm/fault.c
+++ b/arch/openrisc/mm/fault.c
@@ -165,7 +165,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long address,
 
 	fault = handle_mm_fault(vma, address, flags);
 
-	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
+	if ((fault & VM_FAULT_RETRY) && signal_pending(current))
 		return;
 
 	if (unlikely(fault & VM_FAULT_ERROR)) {
diff --git a/arch/parisc/mm/fault.c b/arch/parisc/mm/fault.c
index c8e8b7c05558..29422eec329d 100644
--- a/arch/parisc/mm/fault.c
+++ b/arch/parisc/mm/fault.c
@@ -303,7 +303,7 @@ void do_page_fault(struct pt_regs *regs, unsigned long code,
 
 	fault = handle_mm_fault(vma, address, flags);
 
-	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
+	if ((fault & VM_FAULT_RETRY) && signal_pending(current))
 		return;
 
 	if (unlikely(fault & VM_FAULT_ERROR)) {
diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index 887f11bcf330..aaa853e6592f 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -591,6 +591,8 @@ static int __do_page_fault(struct pt_regs *regs, unsigned long address,
 			 */
 			flags &= ~FAULT_FLAG_ALLOW_RETRY;
 			flags |= FAULT_FLAG_TRIED;
+			if (is_user && signal_pending(current))
+				return 0;
 			if (!fatal_signal_pending(current))
 				goto retry;
 		}
diff --git a/arch/riscv/mm/fault.c b/arch/riscv/mm/fault.c
index 88401d5125bc..4fc8d746bec3 100644
--- a/arch/riscv/mm/fault.c
+++ b/arch/riscv/mm/fault.c
@@ -123,11 +123,11 @@ asmlinkage void do_page_fault(struct pt_regs *regs)
 	fault = handle_mm_fault(vma, addr, flags);
 
 	/*
-	 * If we need to retry but a fatal signal is pending, handle the
+	 * If we need to retry but a signal is pending, handle the
 	 * signal first. We do not need to release the mmap_sem because it
 	 * would already be released in __lock_page_or_retry in mm/filemap.c.
 	 */
-	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(tsk))
+	if ((fault & VM_FAULT_RETRY) && signal_pending(tsk))
 		return;
 
 	if (unlikely(fault & VM_FAULT_ERROR)) {
diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c
index 11613362c4e7..aba1dad1efcd 100644
--- a/arch/s390/mm/fault.c
+++ b/arch/s390/mm/fault.c
@@ -476,9 +476,12 @@ static inline vm_fault_t do_exception(struct pt_regs *regs, int access)
 	 * the fault.
 	 */
 	fault = handle_mm_fault(vma, address, flags);
-	/* No reason to continue if interrupted by SIGKILL. */
-	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current)) {
-		fault = VM_FAULT_SIGNAL;
+	/* Do not continue if interrupted by signals. */
+	if ((fault & VM_FAULT_RETRY) && signal_pending(current)) {
+		if (fatal_signal_pending(current))
+			fault = VM_FAULT_SIGNAL;
+		else
+			fault = 0;
 		if (flags & FAULT_FLAG_RETRY_NOWAIT)
 			goto out_up;
 		goto out;
diff --git a/arch/sh/mm/fault.c b/arch/sh/mm/fault.c
index 6defd2c6d9b1..baf5d73df40c 100644
--- a/arch/sh/mm/fault.c
+++ b/arch/sh/mm/fault.c
@@ -506,6 +506,10 @@ asmlinkage void __kprobes do_page_fault(struct pt_regs *regs,
 			 * have already released it in __lock_page_or_retry
 			 * in mm/filemap.c.
 			 */
+
+			if (user_mode(regs) && signal_pending(tsk))
+				return;
+
 			goto retry;
 		}
 	}
diff --git a/arch/sparc/mm/fault_32.c b/arch/sparc/mm/fault_32.c
index b0440b0edd97..a2c83104fe35 100644
--- a/arch/sparc/mm/fault_32.c
+++ b/arch/sparc/mm/fault_32.c
@@ -269,6 +269,9 @@ asmlinkage void do_sparc_fault(struct pt_regs *regs, int text_fault, int write,
 			 * in mm/filemap.c.
 			 */
 
+			if (user_mode(regs) && signal_pending(tsk))
+				return;
+
 			goto retry;
 		}
 	}
diff --git a/arch/sparc/mm/fault_64.c b/arch/sparc/mm/fault_64.c
index 8f8a604c1300..cad71ec5c7b3 100644
--- a/arch/sparc/mm/fault_64.c
+++ b/arch/sparc/mm/fault_64.c
@@ -467,6 +467,9 @@ asmlinkage void __kprobes do_sparc64_fault(struct pt_regs *regs)
 			 * in mm/filemap.c.
 			 */
 
+			if (user_mode(regs) && signal_pending(current))
+				return;
+
 			goto retry;
 		}
 	}
diff --git a/arch/um/kernel/trap.c b/arch/um/kernel/trap.c
index 0e8b6158f224..05dcd4c5f0d5 100644
--- a/arch/um/kernel/trap.c
+++ b/arch/um/kernel/trap.c
@@ -76,8 +76,11 @@ int handle_page_fault(unsigned long address, unsigned long ip,
 
 		fault = handle_mm_fault(vma, address, flags);
 
-		if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
+		if ((fault & VM_FAULT_RETRY) && signal_pending(current)) {
+			if (is_user && !fatal_signal_pending(current))
+				err = 0;
 			goto out_nosemaphore;
+		}
 
 		if (unlikely(fault & VM_FAULT_ERROR)) {
 			if (fault & VM_FAULT_OOM) {
diff --git a/arch/unicore32/mm/fault.c b/arch/unicore32/mm/fault.c
index b9a3a50644c1..3611f19234a1 100644
--- a/arch/unicore32/mm/fault.c
+++ b/arch/unicore32/mm/fault.c
@@ -248,11 +248,11 @@ static int do_pf(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
 
 	fault = __do_pf(mm, addr, fsr, flags, tsk);
 
-	/* If we need to retry but a fatal signal is pending, handle the
+	/* If we need to retry but a signal is pending, handle the
 	 * signal first. We do not need to release the mmap_sem because
 	 * it would already be released in __lock_page_or_retry in
 	 * mm/filemap.c. */
-	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
+	if ((fault & VM_FAULT_RETRY) && signal_pending(current))
 		return 0;
 
 	if (!(fault & VM_FAULT_ERROR) && (flags & FAULT_FLAG_ALLOW_RETRY)) {
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 9d5c75f02295..248ff0a28ecd 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1481,16 +1481,20 @@ void do_user_addr_fault(struct pt_regs *regs,
 	 * that we made any progress. Handle this case first.
 	 */
 	if (unlikely(fault & VM_FAULT_RETRY)) {
+		bool is_user = flags & FAULT_FLAG_USER;
+
 		/* Retry at most once */
 		if (flags & FAULT_FLAG_ALLOW_RETRY) {
 			flags &= ~FAULT_FLAG_ALLOW_RETRY;
 			flags |= FAULT_FLAG_TRIED;
+			if (is_user && signal_pending(tsk))
+				return;
 			if (!fatal_signal_pending(tsk))
 				goto retry;
 		}
 
 		/* User mode? Just return to handle the fatal exception */
-		if (flags & FAULT_FLAG_USER)
+		if (is_user)
 			return;
 
 		/* Not returning to user mode? Handle exceptions or die: */
diff --git a/arch/xtensa/mm/fault.c b/arch/xtensa/mm/fault.c
index 2ab0e0dcd166..792dad5e2f12 100644
--- a/arch/xtensa/mm/fault.c
+++ b/arch/xtensa/mm/fault.c
@@ -136,6 +136,9 @@ void do_page_fault(struct pt_regs *regs)
 			 * in mm/filemap.c.
 			 */
 
+			if (user_mode(regs) && signal_pending(current))
+				return;
+
 			goto retry;
 		}
 	}
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v3 03/28] userfaultfd: don't retake mmap_sem to emulate NOPAGE
  2019-03-20  2:06 [PATCH v3 00/28] userfaultfd: write protection support Peter Xu
  2019-03-20  2:06 ` [PATCH v3 01/28] mm: gup: rename "nonblocking" to "locked" where proper Peter Xu
  2019-03-20  2:06 ` [PATCH v3 02/28] mm: userfault: return VM_FAULT_RETRY on signals Peter Xu
@ 2019-03-20  2:06 ` Peter Xu
  2019-03-20  2:06 ` [PATCH v3 04/28] mm: allow VM_FAULT_RETRY for multiple times Peter Xu
                   ` (25 subsequent siblings)
  28 siblings, 0 replies; 51+ messages in thread
From: Peter Xu @ 2019-03-20  2:06 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: David Hildenbrand, Hugh Dickins, Maya Gokhale, Jerome Glisse,
	Pavel Emelyanov, Johannes Weiner, peterx, Martin Cracauer,
	Shaohua Li, Andrea Arcangeli, Mike Kravetz, Denis Plotnikov,
	Mike Rapoport, Marty McFadden, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

The idea comes from the upstream discussion between Linus and Andrea:

https://lkml.org/lkml/2017/10/30/560

A summary to the issue: there was a special path in handle_userfault()
in the past that we'll return a VM_FAULT_NOPAGE when we detected
non-fatal signals when waiting for userfault handling.  We did that by
reacquiring the mmap_sem before returning.  However that brings a risk
in that the vmas might have changed when we retake the mmap_sem and
even we could be holding an invalid vma structure.

This patch removes the risk path in handle_userfault() then we will be
sure that the callers of handle_mm_fault() will know that the VMAs
might have changed.  Meanwhile with previous patch we don't lose
responsiveness as well since the core mm code now can handle the
nonfatal userspace signals quickly even if we return VM_FAULT_RETRY.

Suggested-by: Andrea Arcangeli <aarcange@redhat.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Jerome Glisse <jglisse@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 fs/userfaultfd.c | 24 ------------------------
 1 file changed, 24 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 89800fc7dc9d..b397bc3b954d 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -514,30 +514,6 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
 
 	__set_current_state(TASK_RUNNING);
 
-	if (return_to_userland) {
-		if (signal_pending(current) &&
-		    !fatal_signal_pending(current)) {
-			/*
-			 * If we got a SIGSTOP or SIGCONT and this is
-			 * a normal userland page fault, just let
-			 * userland return so the signal will be
-			 * handled and gdb debugging works.  The page
-			 * fault code immediately after we return from
-			 * this function is going to release the
-			 * mmap_sem and it's not depending on it
-			 * (unlike gup would if we were not to return
-			 * VM_FAULT_RETRY).
-			 *
-			 * If a fatal signal is pending we still take
-			 * the streamlined VM_FAULT_RETRY failure path
-			 * and there's no need to retake the mmap_sem
-			 * in such case.
-			 */
-			down_read(&mm->mmap_sem);
-			ret = VM_FAULT_NOPAGE;
-		}
-	}
-
 	/*
 	 * Here we race with the list_del; list_add in
 	 * userfaultfd_ctx_read(), however because we don't ever run
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v3 04/28] mm: allow VM_FAULT_RETRY for multiple times
  2019-03-20  2:06 [PATCH v3 00/28] userfaultfd: write protection support Peter Xu
                   ` (2 preceding siblings ...)
  2019-03-20  2:06 ` [PATCH v3 03/28] userfaultfd: don't retake mmap_sem to emulate NOPAGE Peter Xu
@ 2019-03-20  2:06 ` Peter Xu
  2019-04-18 20:11   ` Jerome Glisse
  2019-03-20  2:06 ` [PATCH v3 05/28] mm: gup: " Peter Xu
                   ` (24 subsequent siblings)
  28 siblings, 1 reply; 51+ messages in thread
From: Peter Xu @ 2019-03-20  2:06 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: David Hildenbrand, Hugh Dickins, Maya Gokhale, Jerome Glisse,
	Pavel Emelyanov, Johannes Weiner, peterx, Martin Cracauer,
	Shaohua Li, Andrea Arcangeli, Mike Kravetz, Denis Plotnikov,
	Mike Rapoport, Marty McFadden, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

The idea comes from a discussion between Linus and Andrea [1].

Before this patch we only allow a page fault to retry once.  We
achieved this by clearing the FAULT_FLAG_ALLOW_RETRY flag when doing
handle_mm_fault() the second time.  This was majorly used to avoid
unexpected starvation of the system by looping over forever to handle
the page fault on a single page.  However that should hardly happen,
and after all for each code path to return a VM_FAULT_RETRY we'll
first wait for a condition (during which time we should possibly yield
the cpu) to happen before VM_FAULT_RETRY is really returned.

This patch removes the restriction by keeping the
FAULT_FLAG_ALLOW_RETRY flag when we receive VM_FAULT_RETRY.  It means
that the page fault handler now can retry the page fault for multiple
times if necessary without the need to generate another page fault
event.  Meanwhile we still keep the FAULT_FLAG_TRIED flag so page
fault handler can still identify whether a page fault is the first
attempt or not.

Then we'll have these combinations of fault flags (only considering
ALLOW_RETRY flag and TRIED flag):

  - ALLOW_RETRY and !TRIED:  this means the page fault allows to
                             retry, and this is the first try

  - ALLOW_RETRY and TRIED:   this means the page fault allows to
                             retry, and this is not the first try

  - !ALLOW_RETRY and !TRIED: this means the page fault does not allow
                             to retry at all

  - !ALLOW_RETRY and TRIED:  this is forbidden and should never be used

In existing code we have multiple places that has taken special care
of the first condition above by checking against (fault_flags &
FAULT_FLAG_ALLOW_RETRY).  This patch introduces a simple helper to
detect the first retry of a page fault by checking against
both (fault_flags & FAULT_FLAG_ALLOW_RETRY) and !(fault_flag &
FAULT_FLAG_TRIED) because now even the 2nd try will have the
ALLOW_RETRY set, then use that helper in all existing special paths.
One example is in __lock_page_or_retry(), now we'll drop the mmap_sem
only in the first attempt of page fault and we'll keep it in follow up
retries, so old locking behavior will be retained.

This will be a nice enhancement for current code [2] at the same time
a supporting material for the future userfaultfd-writeprotect work,
since in that work there will always be an explicit userfault
writeprotect retry for protected pages, and if that cannot resolve the
page fault (e.g., when userfaultfd-writeprotect is used in conjunction
with swapped pages) then we'll possibly need a 3rd retry of the page
fault.  It might also benefit other potential users who will have
similar requirement like userfault write-protection.

GUP code is not touched yet and will be covered in follow up patch.

Please read the thread below for more information.

[1] https://lkml.org/lkml/2017/11/2/833
[2] https://lkml.org/lkml/2018/12/30/64

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Suggested-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/alpha/mm/fault.c           |  2 +-
 arch/arc/mm/fault.c             |  1 -
 arch/arm/mm/fault.c             |  3 ---
 arch/arm64/mm/fault.c           |  5 -----
 arch/hexagon/mm/vm_fault.c      |  1 -
 arch/ia64/mm/fault.c            |  1 -
 arch/m68k/mm/fault.c            |  3 ---
 arch/microblaze/mm/fault.c      |  1 -
 arch/mips/mm/fault.c            |  1 -
 arch/nds32/mm/fault.c           |  1 -
 arch/nios2/mm/fault.c           |  3 ---
 arch/openrisc/mm/fault.c        |  1 -
 arch/parisc/mm/fault.c          |  4 +---
 arch/powerpc/mm/fault.c         |  6 ------
 arch/riscv/mm/fault.c           |  5 -----
 arch/s390/mm/fault.c            |  5 +----
 arch/sh/mm/fault.c              |  1 -
 arch/sparc/mm/fault_32.c        |  1 -
 arch/sparc/mm/fault_64.c        |  1 -
 arch/um/kernel/trap.c           |  1 -
 arch/unicore32/mm/fault.c       |  4 +---
 arch/x86/mm/fault.c             |  2 --
 arch/xtensa/mm/fault.c          |  1 -
 drivers/gpu/drm/ttm/ttm_bo_vm.c | 12 ++++++++---
 include/linux/mm.h              | 38 ++++++++++++++++++++++++++++++++-
 mm/filemap.c                    |  2 +-
 mm/shmem.c                      |  2 +-
 27 files changed, 52 insertions(+), 56 deletions(-)

diff --git a/arch/alpha/mm/fault.c b/arch/alpha/mm/fault.c
index 8a2ef90b4bfc..6a02c0fb36b9 100644
--- a/arch/alpha/mm/fault.c
+++ b/arch/alpha/mm/fault.c
@@ -169,7 +169,7 @@ do_page_fault(unsigned long address, unsigned long mmcsr,
 		else
 			current->min_flt++;
 		if (fault & VM_FAULT_RETRY) {
-			flags &= ~FAULT_FLAG_ALLOW_RETRY;
+			flags |= FAULT_FLAG_TRIED;
 
 			 /* No need to up_read(&mm->mmap_sem) as we would
 			 * have already released it in __lock_page_or_retry
diff --git a/arch/arc/mm/fault.c b/arch/arc/mm/fault.c
index 9e9e6eb1f7d0..e7d2947ba72c 100644
--- a/arch/arc/mm/fault.c
+++ b/arch/arc/mm/fault.c
@@ -167,7 +167,6 @@ void do_page_fault(unsigned long address, struct pt_regs *regs)
 			}
 
 			if (fault & VM_FAULT_RETRY) {
-				flags &= ~FAULT_FLAG_ALLOW_RETRY;
 				flags |= FAULT_FLAG_TRIED;
 				goto retry;
 			}
diff --git a/arch/arm/mm/fault.c b/arch/arm/mm/fault.c
index c41c021bbe40..7910b4b5205d 100644
--- a/arch/arm/mm/fault.c
+++ b/arch/arm/mm/fault.c
@@ -342,9 +342,6 @@ do_page_fault(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
 					regs, addr);
 		}
 		if (fault & VM_FAULT_RETRY) {
-			/* Clear FAULT_FLAG_ALLOW_RETRY to avoid any risk
-			* of starvation. */
-			flags &= ~FAULT_FLAG_ALLOW_RETRY;
 			flags |= FAULT_FLAG_TRIED;
 			goto retry;
 		}
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index a38ff8c49a66..d1d3c98f9ffb 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -523,12 +523,7 @@ static int __kprobes do_page_fault(unsigned long addr, unsigned int esr,
 			return 0;
 		}
 
-		/*
-		 * Clear FAULT_FLAG_ALLOW_RETRY to avoid any risk of
-		 * starvation.
-		 */
 		if (mm_flags & FAULT_FLAG_ALLOW_RETRY) {
-			mm_flags &= ~FAULT_FLAG_ALLOW_RETRY;
 			mm_flags |= FAULT_FLAG_TRIED;
 			goto retry;
 		}
diff --git a/arch/hexagon/mm/vm_fault.c b/arch/hexagon/mm/vm_fault.c
index be10b441d9cc..576751597e77 100644
--- a/arch/hexagon/mm/vm_fault.c
+++ b/arch/hexagon/mm/vm_fault.c
@@ -115,7 +115,6 @@ void do_page_fault(unsigned long address, long cause, struct pt_regs *regs)
 			else
 				current->min_flt++;
 			if (fault & VM_FAULT_RETRY) {
-				flags &= ~FAULT_FLAG_ALLOW_RETRY;
 				flags |= FAULT_FLAG_TRIED;
 				goto retry;
 			}
diff --git a/arch/ia64/mm/fault.c b/arch/ia64/mm/fault.c
index 62c2d39d2bed..9de95d39935e 100644
--- a/arch/ia64/mm/fault.c
+++ b/arch/ia64/mm/fault.c
@@ -189,7 +189,6 @@ ia64_do_page_fault (unsigned long address, unsigned long isr, struct pt_regs *re
 		else
 			current->min_flt++;
 		if (fault & VM_FAULT_RETRY) {
-			flags &= ~FAULT_FLAG_ALLOW_RETRY;
 			flags |= FAULT_FLAG_TRIED;
 
 			 /* No need to up_read(&mm->mmap_sem) as we would
diff --git a/arch/m68k/mm/fault.c b/arch/m68k/mm/fault.c
index d9808a807ab8..b1b2109e4ab4 100644
--- a/arch/m68k/mm/fault.c
+++ b/arch/m68k/mm/fault.c
@@ -162,9 +162,6 @@ int do_page_fault(struct pt_regs *regs, unsigned long address,
 		else
 			current->min_flt++;
 		if (fault & VM_FAULT_RETRY) {
-			/* Clear FAULT_FLAG_ALLOW_RETRY to avoid any risk
-			 * of starvation. */
-			flags &= ~FAULT_FLAG_ALLOW_RETRY;
 			flags |= FAULT_FLAG_TRIED;
 
 			/*
diff --git a/arch/microblaze/mm/fault.c b/arch/microblaze/mm/fault.c
index 4fd2dbd0c5ca..05a4847ac0bf 100644
--- a/arch/microblaze/mm/fault.c
+++ b/arch/microblaze/mm/fault.c
@@ -236,7 +236,6 @@ void do_page_fault(struct pt_regs *regs, unsigned long address,
 		else
 			current->min_flt++;
 		if (fault & VM_FAULT_RETRY) {
-			flags &= ~FAULT_FLAG_ALLOW_RETRY;
 			flags |= FAULT_FLAG_TRIED;
 
 			/*
diff --git a/arch/mips/mm/fault.c b/arch/mips/mm/fault.c
index 92374fd091d2..9953b5b571df 100644
--- a/arch/mips/mm/fault.c
+++ b/arch/mips/mm/fault.c
@@ -178,7 +178,6 @@ static void __kprobes __do_page_fault(struct pt_regs *regs, unsigned long write,
 			tsk->min_flt++;
 		}
 		if (fault & VM_FAULT_RETRY) {
-			flags &= ~FAULT_FLAG_ALLOW_RETRY;
 			flags |= FAULT_FLAG_TRIED;
 
 			/*
diff --git a/arch/nds32/mm/fault.c b/arch/nds32/mm/fault.c
index da777de8a62e..3642bdd7909d 100644
--- a/arch/nds32/mm/fault.c
+++ b/arch/nds32/mm/fault.c
@@ -242,7 +242,6 @@ void do_page_fault(unsigned long entry, unsigned long addr,
 				      1, regs, addr);
 		}
 		if (fault & VM_FAULT_RETRY) {
-			flags &= ~FAULT_FLAG_ALLOW_RETRY;
 			flags |= FAULT_FLAG_TRIED;
 
 			/* No need to up_read(&mm->mmap_sem) as we would
diff --git a/arch/nios2/mm/fault.c b/arch/nios2/mm/fault.c
index 5939434a31ae..9dd1c51acc22 100644
--- a/arch/nios2/mm/fault.c
+++ b/arch/nios2/mm/fault.c
@@ -158,9 +158,6 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long cause,
 		else
 			current->min_flt++;
 		if (fault & VM_FAULT_RETRY) {
-			/* Clear FAULT_FLAG_ALLOW_RETRY to avoid any risk
-			 * of starvation. */
-			flags &= ~FAULT_FLAG_ALLOW_RETRY;
 			flags |= FAULT_FLAG_TRIED;
 
 			/*
diff --git a/arch/openrisc/mm/fault.c b/arch/openrisc/mm/fault.c
index 873ecb5d82d7..ff92c5674781 100644
--- a/arch/openrisc/mm/fault.c
+++ b/arch/openrisc/mm/fault.c
@@ -185,7 +185,6 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long address,
 		else
 			tsk->min_flt++;
 		if (fault & VM_FAULT_RETRY) {
-			flags &= ~FAULT_FLAG_ALLOW_RETRY;
 			flags |= FAULT_FLAG_TRIED;
 
 			 /* No need to up_read(&mm->mmap_sem) as we would
diff --git a/arch/parisc/mm/fault.c b/arch/parisc/mm/fault.c
index 29422eec329d..675b221af198 100644
--- a/arch/parisc/mm/fault.c
+++ b/arch/parisc/mm/fault.c
@@ -327,14 +327,12 @@ void do_page_fault(struct pt_regs *regs, unsigned long code,
 		else
 			current->min_flt++;
 		if (fault & VM_FAULT_RETRY) {
-			flags &= ~FAULT_FLAG_ALLOW_RETRY;
-
 			/*
 			 * No need to up_read(&mm->mmap_sem) as we would
 			 * have already released it in __lock_page_or_retry
 			 * in mm/filemap.c.
 			 */
-
+			flags |= FAULT_FLAG_TRIED;
 			goto retry;
 		}
 	}
diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index aaa853e6592f..c831cb3ce03f 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -583,13 +583,7 @@ static int __do_page_fault(struct pt_regs *regs, unsigned long address,
 	 * case.
 	 */
 	if (unlikely(fault & VM_FAULT_RETRY)) {
-		/* We retry only once */
 		if (flags & FAULT_FLAG_ALLOW_RETRY) {
-			/*
-			 * Clear FAULT_FLAG_ALLOW_RETRY to avoid any risk
-			 * of starvation.
-			 */
-			flags &= ~FAULT_FLAG_ALLOW_RETRY;
 			flags |= FAULT_FLAG_TRIED;
 			if (is_user && signal_pending(current))
 				return 0;
diff --git a/arch/riscv/mm/fault.c b/arch/riscv/mm/fault.c
index 4fc8d746bec3..aad2c0557d2f 100644
--- a/arch/riscv/mm/fault.c
+++ b/arch/riscv/mm/fault.c
@@ -154,11 +154,6 @@ asmlinkage void do_page_fault(struct pt_regs *regs)
 				      1, regs, addr);
 		}
 		if (fault & VM_FAULT_RETRY) {
-			/*
-			 * Clear FAULT_FLAG_ALLOW_RETRY to avoid any risk
-			 * of starvation.
-			 */
-			flags &= ~(FAULT_FLAG_ALLOW_RETRY);
 			flags |= FAULT_FLAG_TRIED;
 
 			/*
diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c
index aba1dad1efcd..4e8c066964a9 100644
--- a/arch/s390/mm/fault.c
+++ b/arch/s390/mm/fault.c
@@ -513,10 +513,7 @@ static inline vm_fault_t do_exception(struct pt_regs *regs, int access)
 				fault = VM_FAULT_PFAULT;
 				goto out_up;
 			}
-			/* Clear FAULT_FLAG_ALLOW_RETRY to avoid any risk
-			 * of starvation. */
-			flags &= ~(FAULT_FLAG_ALLOW_RETRY |
-				   FAULT_FLAG_RETRY_NOWAIT);
+			flags &= ~FAULT_FLAG_RETRY_NOWAIT;
 			flags |= FAULT_FLAG_TRIED;
 			down_read(&mm->mmap_sem);
 			goto retry;
diff --git a/arch/sh/mm/fault.c b/arch/sh/mm/fault.c
index baf5d73df40c..cd710e2d7c57 100644
--- a/arch/sh/mm/fault.c
+++ b/arch/sh/mm/fault.c
@@ -498,7 +498,6 @@ asmlinkage void __kprobes do_page_fault(struct pt_regs *regs,
 				      regs, address);
 		}
 		if (fault & VM_FAULT_RETRY) {
-			flags &= ~FAULT_FLAG_ALLOW_RETRY;
 			flags |= FAULT_FLAG_TRIED;
 
 			/*
diff --git a/arch/sparc/mm/fault_32.c b/arch/sparc/mm/fault_32.c
index a2c83104fe35..6735cd1c09b9 100644
--- a/arch/sparc/mm/fault_32.c
+++ b/arch/sparc/mm/fault_32.c
@@ -261,7 +261,6 @@ asmlinkage void do_sparc_fault(struct pt_regs *regs, int text_fault, int write,
 				      1, regs, address);
 		}
 		if (fault & VM_FAULT_RETRY) {
-			flags &= ~FAULT_FLAG_ALLOW_RETRY;
 			flags |= FAULT_FLAG_TRIED;
 
 			/* No need to up_read(&mm->mmap_sem) as we would
diff --git a/arch/sparc/mm/fault_64.c b/arch/sparc/mm/fault_64.c
index cad71ec5c7b3..28d5b4d012c6 100644
--- a/arch/sparc/mm/fault_64.c
+++ b/arch/sparc/mm/fault_64.c
@@ -459,7 +459,6 @@ asmlinkage void __kprobes do_sparc64_fault(struct pt_regs *regs)
 				      1, regs, address);
 		}
 		if (fault & VM_FAULT_RETRY) {
-			flags &= ~FAULT_FLAG_ALLOW_RETRY;
 			flags |= FAULT_FLAG_TRIED;
 
 			/* No need to up_read(&mm->mmap_sem) as we would
diff --git a/arch/um/kernel/trap.c b/arch/um/kernel/trap.c
index 05dcd4c5f0d5..e7723c133c7f 100644
--- a/arch/um/kernel/trap.c
+++ b/arch/um/kernel/trap.c
@@ -99,7 +99,6 @@ int handle_page_fault(unsigned long address, unsigned long ip,
 			else
 				current->min_flt++;
 			if (fault & VM_FAULT_RETRY) {
-				flags &= ~FAULT_FLAG_ALLOW_RETRY;
 				flags |= FAULT_FLAG_TRIED;
 
 				goto retry;
diff --git a/arch/unicore32/mm/fault.c b/arch/unicore32/mm/fault.c
index 3611f19234a1..efca122b5ef7 100644
--- a/arch/unicore32/mm/fault.c
+++ b/arch/unicore32/mm/fault.c
@@ -261,9 +261,7 @@ static int do_pf(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
 		else
 			tsk->min_flt++;
 		if (fault & VM_FAULT_RETRY) {
-			/* Clear FAULT_FLAG_ALLOW_RETRY to avoid any risk
-			* of starvation. */
-			flags &= ~FAULT_FLAG_ALLOW_RETRY;
+			flags |= FAULT_FLAG_TRIED;
 			goto retry;
 		}
 	}
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 248ff0a28ecd..d842c3e02a50 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1483,9 +1483,7 @@ void do_user_addr_fault(struct pt_regs *regs,
 	if (unlikely(fault & VM_FAULT_RETRY)) {
 		bool is_user = flags & FAULT_FLAG_USER;
 
-		/* Retry at most once */
 		if (flags & FAULT_FLAG_ALLOW_RETRY) {
-			flags &= ~FAULT_FLAG_ALLOW_RETRY;
 			flags |= FAULT_FLAG_TRIED;
 			if (is_user && signal_pending(tsk))
 				return;
diff --git a/arch/xtensa/mm/fault.c b/arch/xtensa/mm/fault.c
index 792dad5e2f12..7cd55f2d66c9 100644
--- a/arch/xtensa/mm/fault.c
+++ b/arch/xtensa/mm/fault.c
@@ -128,7 +128,6 @@ void do_page_fault(struct pt_regs *regs)
 		else
 			current->min_flt++;
 		if (fault & VM_FAULT_RETRY) {
-			flags &= ~FAULT_FLAG_ALLOW_RETRY;
 			flags |= FAULT_FLAG_TRIED;
 
 			 /* No need to up_read(&mm->mmap_sem) as we would
diff --git a/drivers/gpu/drm/ttm/ttm_bo_vm.c b/drivers/gpu/drm/ttm/ttm_bo_vm.c
index a1d977fbade5..5fac635f72a5 100644
--- a/drivers/gpu/drm/ttm/ttm_bo_vm.c
+++ b/drivers/gpu/drm/ttm/ttm_bo_vm.c
@@ -61,9 +61,10 @@ static vm_fault_t ttm_bo_vm_fault_idle(struct ttm_buffer_object *bo,
 
 	/*
 	 * If possible, avoid waiting for GPU with mmap_sem
-	 * held.
+	 * held.  We only do this if the fault allows retry and this
+	 * is the first attempt.
 	 */
-	if (vmf->flags & FAULT_FLAG_ALLOW_RETRY) {
+	if (fault_flag_allow_retry_first(vmf->flags)) {
 		ret = VM_FAULT_RETRY;
 		if (vmf->flags & FAULT_FLAG_RETRY_NOWAIT)
 			goto out_unlock;
@@ -136,7 +137,12 @@ static vm_fault_t ttm_bo_vm_fault(struct vm_fault *vmf)
 		if (err != -EBUSY)
 			return VM_FAULT_NOPAGE;
 
-		if (vmf->flags & FAULT_FLAG_ALLOW_RETRY) {
+		/*
+		 * If the fault allows retry and this is the first
+		 * fault attempt, we try to release the mmap_sem
+		 * before waiting
+		 */
+		if (fault_flag_allow_retry_first(vmf->flags)) {
 			if (!(vmf->flags & FAULT_FLAG_RETRY_NOWAIT)) {
 				ttm_bo_get(bo);
 				up_read(&vmf->vma->vm_mm->mmap_sem);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 80bb6408fe73..f73dbc4a1957 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -336,16 +336,52 @@ extern unsigned int kobjsize(const void *objp);
  */
 extern pgprot_t protection_map[16];
 
+/*
+ * About FAULT_FLAG_ALLOW_RETRY and FAULT_FLAG_TRIED: we can specify whether we
+ * would allow page faults to retry by specifying these two fault flags
+ * correctly.  Currently there can be three legal combinations:
+ *
+ * (a) ALLOW_RETRY and !TRIED:  this means the page fault allows retry, and
+ *                              this is the first try
+ *
+ * (b) ALLOW_RETRY and TRIED:   this means the page fault allows retry, and
+ *                              we've already tried at least once
+ *
+ * (c) !ALLOW_RETRY and !TRIED: this means the page fault does not allow retry
+ *
+ * The unlisted combination (!ALLOW_RETRY && TRIED) is illegal and should never
+ * be used.  Note that page faults can be allowed to retry for multiple times,
+ * in which case we'll have an initial fault with flags (a) then later on
+ * continuous faults with flags (b).  We should always try to detect pending
+ * signals before a retry to make sure the continuous page faults can still be
+ * interrupted if necessary.
+ */
+
 #define FAULT_FLAG_WRITE	0x01	/* Fault was a write access */
 #define FAULT_FLAG_MKWRITE	0x02	/* Fault was mkwrite of existing pte */
 #define FAULT_FLAG_ALLOW_RETRY	0x04	/* Retry fault if blocking */
 #define FAULT_FLAG_RETRY_NOWAIT	0x08	/* Don't drop mmap_sem and wait when retrying */
 #define FAULT_FLAG_KILLABLE	0x10	/* The fault task is in SIGKILL killable region */
-#define FAULT_FLAG_TRIED	0x20	/* Second try */
+#define FAULT_FLAG_TRIED	0x20	/* We've tried once */
 #define FAULT_FLAG_USER		0x40	/* The fault originated in userspace */
 #define FAULT_FLAG_REMOTE	0x80	/* faulting for non current tsk/mm */
 #define FAULT_FLAG_INSTRUCTION  0x100	/* The fault was during an instruction fetch */
 
+/*
+ * Returns true if the page fault allows retry and this is the first
+ * attempt of the fault handling; false otherwise.  This is mostly
+ * used for places where we want to try to avoid taking the mmap_sem
+ * for too long a time when waiting for another condition to change,
+ * in which case we can try to be polite to release the mmap_sem in
+ * the first round to avoid potential starvation of other processes
+ * that would also want the mmap_sem.
+ */
+static inline bool fault_flag_allow_retry_first(unsigned int flags)
+{
+	return (flags & FAULT_FLAG_ALLOW_RETRY) &&
+	    (!(flags & FAULT_FLAG_TRIED));
+}
+
 #define FAULT_FLAG_TRACE \
 	{ FAULT_FLAG_WRITE,		"WRITE" }, \
 	{ FAULT_FLAG_MKWRITE,		"MKWRITE" }, \
diff --git a/mm/filemap.c b/mm/filemap.c
index 9f5e323e883e..a2b5c53166de 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1351,7 +1351,7 @@ EXPORT_SYMBOL_GPL(__lock_page_killable);
 int __lock_page_or_retry(struct page *page, struct mm_struct *mm,
 			 unsigned int flags)
 {
-	if (flags & FAULT_FLAG_ALLOW_RETRY) {
+	if (fault_flag_allow_retry_first(flags)) {
 		/*
 		 * CAUTION! In this case, mmap_sem is not released
 		 * even though return 0.
diff --git a/mm/shmem.c b/mm/shmem.c
index 2c012eee133d..ac875b79281c 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1949,7 +1949,7 @@ static vm_fault_t shmem_fault(struct vm_fault *vmf)
 			DEFINE_WAIT_FUNC(shmem_fault_wait, synchronous_wake_function);
 
 			ret = VM_FAULT_NOPAGE;
-			if ((vmf->flags & FAULT_FLAG_ALLOW_RETRY) &&
+			if (fault_flag_allow_retry_first(vmf->flags) &&
 			   !(vmf->flags & FAULT_FLAG_RETRY_NOWAIT)) {
 				/* It's polite to up mmap_sem if we can */
 				up_read(&vma->vm_mm->mmap_sem);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v3 05/28] mm: gup: allow VM_FAULT_RETRY for multiple times
  2019-03-20  2:06 [PATCH v3 00/28] userfaultfd: write protection support Peter Xu
                   ` (3 preceding siblings ...)
  2019-03-20  2:06 ` [PATCH v3 04/28] mm: allow VM_FAULT_RETRY for multiple times Peter Xu
@ 2019-03-20  2:06 ` Peter Xu
  2019-03-20  2:06 ` [PATCH v3 06/28] userfaultfd: wp: add helper for writeprotect check Peter Xu
                   ` (23 subsequent siblings)
  28 siblings, 0 replies; 51+ messages in thread
From: Peter Xu @ 2019-03-20  2:06 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: David Hildenbrand, Hugh Dickins, Maya Gokhale, Jerome Glisse,
	Pavel Emelyanov, Johannes Weiner, peterx, Martin Cracauer,
	Shaohua Li, Andrea Arcangeli, Mike Kravetz, Denis Plotnikov,
	Mike Rapoport, Marty McFadden, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

This is the gup counterpart of the change that allows the VM_FAULT_RETRY
to happen for more than once.

Reviewed-by: Jerome Glisse <jglisse@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/gup.c     | 17 +++++++++++++----
 mm/hugetlb.c |  6 ++++--
 2 files changed, 17 insertions(+), 6 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 9bb3bed68ee3..f56dee055f26 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -528,7 +528,10 @@ static int faultin_page(struct task_struct *tsk, struct vm_area_struct *vma,
 	if (*flags & FOLL_NOWAIT)
 		fault_flags |= FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_RETRY_NOWAIT;
 	if (*flags & FOLL_TRIED) {
-		VM_WARN_ON_ONCE(fault_flags & FAULT_FLAG_ALLOW_RETRY);
+		/*
+		 * Note: FAULT_FLAG_ALLOW_RETRY and FAULT_FLAG_TRIED
+		 * can co-exist
+		 */
 		fault_flags |= FAULT_FLAG_TRIED;
 	}
 
@@ -943,17 +946,23 @@ static __always_inline long __get_user_pages_locked(struct task_struct *tsk,
 		/* VM_FAULT_RETRY triggered, so seek to the faulting offset */
 		pages += ret;
 		start += ret << PAGE_SHIFT;
+		lock_dropped = true;
 
+retry:
 		/*
 		 * Repeat on the address that fired VM_FAULT_RETRY
-		 * without FAULT_FLAG_ALLOW_RETRY but with
+		 * with both FAULT_FLAG_ALLOW_RETRY and
 		 * FAULT_FLAG_TRIED.
 		 */
 		*locked = 1;
-		lock_dropped = true;
 		down_read(&mm->mmap_sem);
 		ret = __get_user_pages(tsk, mm, start, 1, flags | FOLL_TRIED,
-				       pages, NULL, NULL);
+				       pages, NULL, locked);
+		if (!*locked) {
+			/* Continue to retry until we succeeded */
+			BUG_ON(ret != 0);
+			goto retry;
+		}
 		if (ret != 1) {
 			BUG_ON(ret > 1);
 			if (!pages_done)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 52296ce4025a..040779a7b906 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4267,8 +4267,10 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 				fault_flags |= FAULT_FLAG_ALLOW_RETRY |
 					FAULT_FLAG_RETRY_NOWAIT;
 			if (flags & FOLL_TRIED) {
-				VM_WARN_ON_ONCE(fault_flags &
-						FAULT_FLAG_ALLOW_RETRY);
+				/*
+				 * Note: FAULT_FLAG_ALLOW_RETRY and
+				 * FAULT_FLAG_TRIED can co-exist
+				 */
 				fault_flags |= FAULT_FLAG_TRIED;
 			}
 			ret = hugetlb_fault(mm, vma, vaddr, fault_flags);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v3 06/28] userfaultfd: wp: add helper for writeprotect check
  2019-03-20  2:06 [PATCH v3 00/28] userfaultfd: write protection support Peter Xu
                   ` (4 preceding siblings ...)
  2019-03-20  2:06 ` [PATCH v3 05/28] mm: gup: " Peter Xu
@ 2019-03-20  2:06 ` Peter Xu
  2019-03-20  2:06 ` [PATCH v3 07/28] userfaultfd: wp: hook userfault handler to write protection fault Peter Xu
                   ` (22 subsequent siblings)
  28 siblings, 0 replies; 51+ messages in thread
From: Peter Xu @ 2019-03-20  2:06 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: David Hildenbrand, Hugh Dickins, Maya Gokhale, Jerome Glisse,
	Pavel Emelyanov, Johannes Weiner, peterx, Martin Cracauer,
	Shaohua Li, Andrea Arcangeli, Mike Kravetz, Denis Plotnikov,
	Mike Rapoport, Marty McFadden, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert, Pavel Emelyanov, Rik van Riel

From: Shaohua Li <shli@fb.com>

add helper for writeprotect check. Will use it later.

Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Jerome Glisse <jglisse@redhat.com>
Reviewed-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/userfaultfd_k.h | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index 37c9eba75c98..38f748e7186e 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -50,6 +50,11 @@ static inline bool userfaultfd_missing(struct vm_area_struct *vma)
 	return vma->vm_flags & VM_UFFD_MISSING;
 }
 
+static inline bool userfaultfd_wp(struct vm_area_struct *vma)
+{
+	return vma->vm_flags & VM_UFFD_WP;
+}
+
 static inline bool userfaultfd_armed(struct vm_area_struct *vma)
 {
 	return vma->vm_flags & (VM_UFFD_MISSING | VM_UFFD_WP);
@@ -94,6 +99,11 @@ static inline bool userfaultfd_missing(struct vm_area_struct *vma)
 	return false;
 }
 
+static inline bool userfaultfd_wp(struct vm_area_struct *vma)
+{
+	return false;
+}
+
 static inline bool userfaultfd_armed(struct vm_area_struct *vma)
 {
 	return false;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v3 07/28] userfaultfd: wp: hook userfault handler to write protection fault
  2019-03-20  2:06 [PATCH v3 00/28] userfaultfd: write protection support Peter Xu
                   ` (5 preceding siblings ...)
  2019-03-20  2:06 ` [PATCH v3 06/28] userfaultfd: wp: add helper for writeprotect check Peter Xu
@ 2019-03-20  2:06 ` Peter Xu
  2019-04-18 20:03   ` Jerome Glisse
  2019-03-20  2:06 ` [PATCH v3 08/28] userfaultfd: wp: add WP pagetable tracking to x86 Peter Xu
                   ` (21 subsequent siblings)
  28 siblings, 1 reply; 51+ messages in thread
From: Peter Xu @ 2019-03-20  2:06 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: David Hildenbrand, Hugh Dickins, Maya Gokhale, Jerome Glisse,
	Pavel Emelyanov, Johannes Weiner, peterx, Martin Cracauer,
	Shaohua Li, Andrea Arcangeli, Mike Kravetz, Denis Plotnikov,
	Mike Rapoport, Marty McFadden, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

From: Andrea Arcangeli <aarcange@redhat.com>

There are several cases write protection fault happens. It could be a
write to zero page, swaped page or userfault write protected
page. When the fault happens, there is no way to know if userfault
write protect the page before. Here we just blindly issue a userfault
notification for vma with VM_UFFD_WP regardless if app write protects
it yet. Application should be ready to handle such wp fault.

v1: From: Shaohua Li <shli@fb.com>

v2: Handle the userfault in the common do_wp_page. If we get there a
pagetable is present and readonly so no need to do further processing
until we solve the userfault.

In the swapin case, always swapin as readonly. This will cause false
positive userfaults. We need to decide later if to eliminate them with
a flag like soft-dirty in the swap entry (see _PAGE_SWP_SOFT_DIRTY).

hugetlbfs wouldn't need to worry about swapouts but and tmpfs would
be handled by a swap entry bit like anonymous memory.

The main problem with no easy solution to eliminate the false
positives, will be if/when userfaultfd is extended to real filesystem
pagecache. When the pagecache is freed by reclaim we can't leave the
radix tree pinned if the inode and in turn the radix tree is reclaimed
as well.

The estimation is that full accuracy and lack of false positives could
be easily provided only to anonymous memory (as long as there's no
fork or as long as MADV_DONTFORK is used on the userfaultfd anonymous
range) tmpfs and hugetlbfs, it's most certainly worth to achieve it
but in a later incremental patch.

v3: Add hooking point for THP wrprotect faults.

CC: Shaohua Li <shli@fb.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
[peterx: don't conditionally drop FAULT_FLAG_WRITE in do_swap_page]
Reviewed-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/memory.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/mm/memory.c b/mm/memory.c
index e11ca9dd823f..567686ec086d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2483,6 +2483,11 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
 {
 	struct vm_area_struct *vma = vmf->vma;
 
+	if (userfaultfd_wp(vma)) {
+		pte_unmap_unlock(vmf->pte, vmf->ptl);
+		return handle_userfault(vmf, VM_UFFD_WP);
+	}
+
 	vmf->page = vm_normal_page(vma, vmf->address, vmf->orig_pte);
 	if (!vmf->page) {
 		/*
@@ -3684,8 +3689,11 @@ static inline vm_fault_t create_huge_pmd(struct vm_fault *vmf)
 /* `inline' is required to avoid gcc 4.1.2 build error */
 static inline vm_fault_t wp_huge_pmd(struct vm_fault *vmf, pmd_t orig_pmd)
 {
-	if (vma_is_anonymous(vmf->vma))
+	if (vma_is_anonymous(vmf->vma)) {
+		if (userfaultfd_wp(vmf->vma))
+			return handle_userfault(vmf, VM_UFFD_WP);
 		return do_huge_pmd_wp_page(vmf, orig_pmd);
+	}
 	if (vmf->vma->vm_ops->huge_fault)
 		return vmf->vma->vm_ops->huge_fault(vmf, PE_SIZE_PMD);
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v3 08/28] userfaultfd: wp: add WP pagetable tracking to x86
  2019-03-20  2:06 [PATCH v3 00/28] userfaultfd: write protection support Peter Xu
                   ` (6 preceding siblings ...)
  2019-03-20  2:06 ` [PATCH v3 07/28] userfaultfd: wp: hook userfault handler to write protection fault Peter Xu
@ 2019-03-20  2:06 ` Peter Xu
  2019-03-20  2:06 ` [PATCH v3 09/28] userfaultfd: wp: userfaultfd_pte/huge_pmd_wp() helpers Peter Xu
                   ` (20 subsequent siblings)
  28 siblings, 0 replies; 51+ messages in thread
From: Peter Xu @ 2019-03-20  2:06 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: David Hildenbrand, Hugh Dickins, Maya Gokhale, Jerome Glisse,
	Pavel Emelyanov, Johannes Weiner, peterx, Martin Cracauer,
	Shaohua Li, Andrea Arcangeli, Mike Kravetz, Denis Plotnikov,
	Mike Rapoport, Marty McFadden, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

From: Andrea Arcangeli <aarcange@redhat.com>

Accurate userfaultfd WP tracking is possible by tracking exactly which
virtual memory ranges were writeprotected by userland. We can't relay
only on the RW bit of the mapped pagetable because that information is
destroyed by fork() or KSM or swap. If we were to relay on that, we'd
need to stay on the safe side and generate false positive wp faults
for every swapped out page.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
[peterx: append _PAGE_UFD_WP to _PAGE_CHG_MASK]
Reviewed-by: Jerome Glisse <jglisse@redhat.com>
Reviewed-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/x86/Kconfig                     |  1 +
 arch/x86/include/asm/pgtable.h       | 52 ++++++++++++++++++++++++++++
 arch/x86/include/asm/pgtable_64.h    |  8 ++++-
 arch/x86/include/asm/pgtable_types.h | 11 +++++-
 include/asm-generic/pgtable.h        |  1 +
 include/asm-generic/pgtable_uffd.h   | 51 +++++++++++++++++++++++++++
 init/Kconfig                         |  5 +++
 7 files changed, 127 insertions(+), 2 deletions(-)
 create mode 100644 include/asm-generic/pgtable_uffd.h

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 5a02dd608f74..d2947525907f 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -209,6 +209,7 @@ config X86
 	select USER_STACKTRACE_SUPPORT
 	select VIRT_TO_BUS
 	select X86_FEATURE_NAMES		if PROC_FS
+	select HAVE_ARCH_USERFAULTFD_WP		if USERFAULTFD
 
 config INSTRUCTION_DECODER
 	def_bool y
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 2779ace16d23..6863236e8484 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -23,6 +23,7 @@
 
 #ifndef __ASSEMBLY__
 #include <asm/x86_init.h>
+#include <asm-generic/pgtable_uffd.h>
 
 extern pgd_t early_top_pgt[PTRS_PER_PGD];
 int __init __early_make_pgtable(unsigned long address, pmdval_t pmd);
@@ -293,6 +294,23 @@ static inline pte_t pte_clear_flags(pte_t pte, pteval_t clear)
 	return native_make_pte(v & ~clear);
 }
 
+#ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
+static inline int pte_uffd_wp(pte_t pte)
+{
+	return pte_flags(pte) & _PAGE_UFFD_WP;
+}
+
+static inline pte_t pte_mkuffd_wp(pte_t pte)
+{
+	return pte_set_flags(pte, _PAGE_UFFD_WP);
+}
+
+static inline pte_t pte_clear_uffd_wp(pte_t pte)
+{
+	return pte_clear_flags(pte, _PAGE_UFFD_WP);
+}
+#endif /* CONFIG_HAVE_ARCH_USERFAULTFD_WP */
+
 static inline pte_t pte_mkclean(pte_t pte)
 {
 	return pte_clear_flags(pte, _PAGE_DIRTY);
@@ -372,6 +390,23 @@ static inline pmd_t pmd_clear_flags(pmd_t pmd, pmdval_t clear)
 	return native_make_pmd(v & ~clear);
 }
 
+#ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
+static inline int pmd_uffd_wp(pmd_t pmd)
+{
+	return pmd_flags(pmd) & _PAGE_UFFD_WP;
+}
+
+static inline pmd_t pmd_mkuffd_wp(pmd_t pmd)
+{
+	return pmd_set_flags(pmd, _PAGE_UFFD_WP);
+}
+
+static inline pmd_t pmd_clear_uffd_wp(pmd_t pmd)
+{
+	return pmd_clear_flags(pmd, _PAGE_UFFD_WP);
+}
+#endif /* CONFIG_HAVE_ARCH_USERFAULTFD_WP */
+
 static inline pmd_t pmd_mkold(pmd_t pmd)
 {
 	return pmd_clear_flags(pmd, _PAGE_ACCESSED);
@@ -1351,6 +1386,23 @@ static inline pmd_t pmd_swp_clear_soft_dirty(pmd_t pmd)
 #endif
 #endif
 
+#ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
+static inline pte_t pte_swp_mkuffd_wp(pte_t pte)
+{
+	return pte_set_flags(pte, _PAGE_SWP_UFFD_WP);
+}
+
+static inline int pte_swp_uffd_wp(pte_t pte)
+{
+	return pte_flags(pte) & _PAGE_SWP_UFFD_WP;
+}
+
+static inline pte_t pte_swp_clear_uffd_wp(pte_t pte)
+{
+	return pte_clear_flags(pte, _PAGE_SWP_UFFD_WP);
+}
+#endif /* CONFIG_HAVE_ARCH_USERFAULTFD_WP */
+
 #define PKRU_AD_BIT 0x1
 #define PKRU_WD_BIT 0x2
 #define PKRU_BITS_PER_PKEY 2
diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
index 9c85b54bf03c..e0c5d29b8685 100644
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -189,7 +189,7 @@ extern void sync_global_pgds(unsigned long start, unsigned long end);
  *
  * |     ...            | 11| 10|  9|8|7|6|5| 4| 3|2| 1|0| <- bit number
  * |     ...            |SW3|SW2|SW1|G|L|D|A|CD|WT|U| W|P| <- bit names
- * | TYPE (59-63) | ~OFFSET (9-58)  |0|0|X|X| X| X|X|SD|0| <- swp entry
+ * | TYPE (59-63) | ~OFFSET (9-58)  |0|0|X|X| X| X|F|SD|0| <- swp entry
  *
  * G (8) is aliased and used as a PROT_NONE indicator for
  * !present ptes.  We need to start storing swap entries above
@@ -197,9 +197,15 @@ extern void sync_global_pgds(unsigned long start, unsigned long end);
  * erratum where they can be incorrectly set by hardware on
  * non-present PTEs.
  *
+ * SD Bits 1-4 are not used in non-present format and available for
+ * special use described below:
+ *
  * SD (1) in swp entry is used to store soft dirty bit, which helps us
  * remember soft dirty over page migration
  *
+ * F (2) in swp entry is used to record when a pagetable is
+ * writeprotected by userfaultfd WP support.
+ *
  * Bit 7 in swp entry should be 0 because pmd_present checks not only P,
  * but also L and G.
  *
diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index d6ff0bbdb394..dd9c6295d610 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -32,6 +32,7 @@
 
 #define _PAGE_BIT_SPECIAL	_PAGE_BIT_SOFTW1
 #define _PAGE_BIT_CPA_TEST	_PAGE_BIT_SOFTW1
+#define _PAGE_BIT_UFFD_WP	_PAGE_BIT_SOFTW2 /* userfaultfd wrprotected */
 #define _PAGE_BIT_SOFT_DIRTY	_PAGE_BIT_SOFTW3 /* software dirty tracking */
 #define _PAGE_BIT_DEVMAP	_PAGE_BIT_SOFTW4
 
@@ -100,6 +101,14 @@
 #define _PAGE_SWP_SOFT_DIRTY	(_AT(pteval_t, 0))
 #endif
 
+#ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
+#define _PAGE_UFFD_WP		(_AT(pteval_t, 1) << _PAGE_BIT_UFFD_WP)
+#define _PAGE_SWP_UFFD_WP	_PAGE_USER
+#else
+#define _PAGE_UFFD_WP		(_AT(pteval_t, 0))
+#define _PAGE_SWP_UFFD_WP	(_AT(pteval_t, 0))
+#endif
+
 #if defined(CONFIG_X86_64) || defined(CONFIG_X86_PAE)
 #define _PAGE_NX	(_AT(pteval_t, 1) << _PAGE_BIT_NX)
 #define _PAGE_DEVMAP	(_AT(u64, 1) << _PAGE_BIT_DEVMAP)
@@ -124,7 +133,7 @@
  */
 #define _PAGE_CHG_MASK	(PTE_PFN_MASK | _PAGE_PCD | _PAGE_PWT |		\
 			 _PAGE_SPECIAL | _PAGE_ACCESSED | _PAGE_DIRTY |	\
-			 _PAGE_SOFT_DIRTY | _PAGE_DEVMAP)
+			 _PAGE_SOFT_DIRTY | _PAGE_DEVMAP | _PAGE_UFFD_WP)
 #define _HPAGE_CHG_MASK (_PAGE_CHG_MASK | _PAGE_PSE)
 
 /*
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index 05e61e6c843f..f49afe951711 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -10,6 +10,7 @@
 #include <linux/mm_types.h>
 #include <linux/bug.h>
 #include <linux/errno.h>
+#include <asm-generic/pgtable_uffd.h>
 
 #if 5 - defined(__PAGETABLE_P4D_FOLDED) - defined(__PAGETABLE_PUD_FOLDED) - \
 	defined(__PAGETABLE_PMD_FOLDED) != CONFIG_PGTABLE_LEVELS
diff --git a/include/asm-generic/pgtable_uffd.h b/include/asm-generic/pgtable_uffd.h
new file mode 100644
index 000000000000..643d1bf559c2
--- /dev/null
+++ b/include/asm-generic/pgtable_uffd.h
@@ -0,0 +1,51 @@
+#ifndef _ASM_GENERIC_PGTABLE_UFFD_H
+#define _ASM_GENERIC_PGTABLE_UFFD_H
+
+#ifndef CONFIG_HAVE_ARCH_USERFAULTFD_WP
+static __always_inline int pte_uffd_wp(pte_t pte)
+{
+	return 0;
+}
+
+static __always_inline int pmd_uffd_wp(pmd_t pmd)
+{
+	return 0;
+}
+
+static __always_inline pte_t pte_mkuffd_wp(pte_t pte)
+{
+	return pte;
+}
+
+static __always_inline pmd_t pmd_mkuffd_wp(pmd_t pmd)
+{
+	return pmd;
+}
+
+static __always_inline pte_t pte_clear_uffd_wp(pte_t pte)
+{
+	return pte;
+}
+
+static __always_inline pmd_t pmd_clear_uffd_wp(pmd_t pmd)
+{
+	return pmd;
+}
+
+static __always_inline pte_t pte_swp_mkuffd_wp(pte_t pte)
+{
+	return pte;
+}
+
+static __always_inline int pte_swp_uffd_wp(pte_t pte)
+{
+	return 0;
+}
+
+static __always_inline pte_t pte_swp_clear_uffd_wp(pte_t pte)
+{
+	return pte;
+}
+#endif /* CONFIG_HAVE_ARCH_USERFAULTFD_WP */
+
+#endif /* _ASM_GENERIC_PGTABLE_UFFD_H */
diff --git a/init/Kconfig b/init/Kconfig
index c9386a365eea..892d61ddf2eb 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1424,6 +1424,11 @@ config ADVISE_SYSCALLS
 	  applications use these syscalls, you can disable this option to save
 	  space.
 
+config HAVE_ARCH_USERFAULTFD_WP
+	bool
+	help
+	  Arch has userfaultfd write protection support
+
 config MEMBARRIER
 	bool "Enable membarrier() system call" if EXPERT
 	default y
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v3 09/28] userfaultfd: wp: userfaultfd_pte/huge_pmd_wp() helpers
  2019-03-20  2:06 [PATCH v3 00/28] userfaultfd: write protection support Peter Xu
                   ` (7 preceding siblings ...)
  2019-03-20  2:06 ` [PATCH v3 08/28] userfaultfd: wp: add WP pagetable tracking to x86 Peter Xu
@ 2019-03-20  2:06 ` Peter Xu
  2019-03-20  2:06 ` [PATCH v3 10/28] userfaultfd: wp: add UFFDIO_COPY_MODE_WP Peter Xu
                   ` (19 subsequent siblings)
  28 siblings, 0 replies; 51+ messages in thread
From: Peter Xu @ 2019-03-20  2:06 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: David Hildenbrand, Hugh Dickins, Maya Gokhale, Jerome Glisse,
	Pavel Emelyanov, Johannes Weiner, peterx, Martin Cracauer,
	Shaohua Li, Andrea Arcangeli, Mike Kravetz, Denis Plotnikov,
	Mike Rapoport, Marty McFadden, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

From: Andrea Arcangeli <aarcange@redhat.com>

Implement helpers methods to invoke userfaultfd wp faults more
selectively: not only when a wp fault triggers on a vma with
vma->vm_flags VM_UFFD_WP set, but only if the _PAGE_UFFD_WP bit is set
in the pagetable too.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Jerome Glisse <jglisse@redhat.com>
Reviewed-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/userfaultfd_k.h | 27 +++++++++++++++++++++++++++
 1 file changed, 27 insertions(+)

diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index 38f748e7186e..c6590c58ce28 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -14,6 +14,8 @@
 #include <linux/userfaultfd.h> /* linux/include/uapi/linux/userfaultfd.h */
 
 #include <linux/fcntl.h>
+#include <linux/mm.h>
+#include <asm-generic/pgtable_uffd.h>
 
 /*
  * CAREFUL: Check include/uapi/asm-generic/fcntl.h when defining
@@ -55,6 +57,18 @@ static inline bool userfaultfd_wp(struct vm_area_struct *vma)
 	return vma->vm_flags & VM_UFFD_WP;
 }
 
+static inline bool userfaultfd_pte_wp(struct vm_area_struct *vma,
+				      pte_t pte)
+{
+	return userfaultfd_wp(vma) && pte_uffd_wp(pte);
+}
+
+static inline bool userfaultfd_huge_pmd_wp(struct vm_area_struct *vma,
+					   pmd_t pmd)
+{
+	return userfaultfd_wp(vma) && pmd_uffd_wp(pmd);
+}
+
 static inline bool userfaultfd_armed(struct vm_area_struct *vma)
 {
 	return vma->vm_flags & (VM_UFFD_MISSING | VM_UFFD_WP);
@@ -104,6 +118,19 @@ static inline bool userfaultfd_wp(struct vm_area_struct *vma)
 	return false;
 }
 
+static inline bool userfaultfd_pte_wp(struct vm_area_struct *vma,
+				      pte_t pte)
+{
+	return false;
+}
+
+static inline bool userfaultfd_huge_pmd_wp(struct vm_area_struct *vma,
+					   pmd_t pmd)
+{
+	return false;
+}
+
+
 static inline bool userfaultfd_armed(struct vm_area_struct *vma)
 {
 	return false;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v3 10/28] userfaultfd: wp: add UFFDIO_COPY_MODE_WP
  2019-03-20  2:06 [PATCH v3 00/28] userfaultfd: write protection support Peter Xu
                   ` (8 preceding siblings ...)
  2019-03-20  2:06 ` [PATCH v3 09/28] userfaultfd: wp: userfaultfd_pte/huge_pmd_wp() helpers Peter Xu
@ 2019-03-20  2:06 ` Peter Xu
  2019-03-20  2:06 ` [PATCH v3 11/28] mm: merge parameters for change_protection() Peter Xu
                   ` (18 subsequent siblings)
  28 siblings, 0 replies; 51+ messages in thread
From: Peter Xu @ 2019-03-20  2:06 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: David Hildenbrand, Hugh Dickins, Maya Gokhale, Jerome Glisse,
	Pavel Emelyanov, Johannes Weiner, peterx, Martin Cracauer,
	Shaohua Li, Andrea Arcangeli, Mike Kravetz, Denis Plotnikov,
	Mike Rapoport, Marty McFadden, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

From: Andrea Arcangeli <aarcange@redhat.com>

This allows UFFDIO_COPY to map pages write-protected.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
[peterx: switch to VM_WARN_ON_ONCE in mfill_atomic_pte; add brackets
 around "dst_vma->vm_flags & VM_WRITE"; fix wordings in comments and
 commit messages]
Reviewed-by: Jerome Glisse <jglisse@redhat.com>
Reviewed-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 fs/userfaultfd.c                 |  5 +++--
 include/linux/userfaultfd_k.h    |  2 +-
 include/uapi/linux/userfaultfd.h | 11 +++++-----
 mm/userfaultfd.c                 | 36 ++++++++++++++++++++++----------
 4 files changed, 35 insertions(+), 19 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index b397bc3b954d..3092885c9d2c 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -1683,11 +1683,12 @@ static int userfaultfd_copy(struct userfaultfd_ctx *ctx,
 	ret = -EINVAL;
 	if (uffdio_copy.src + uffdio_copy.len <= uffdio_copy.src)
 		goto out;
-	if (uffdio_copy.mode & ~UFFDIO_COPY_MODE_DONTWAKE)
+	if (uffdio_copy.mode & ~(UFFDIO_COPY_MODE_DONTWAKE|UFFDIO_COPY_MODE_WP))
 		goto out;
 	if (mmget_not_zero(ctx->mm)) {
 		ret = mcopy_atomic(ctx->mm, uffdio_copy.dst, uffdio_copy.src,
-				   uffdio_copy.len, &ctx->mmap_changing);
+				   uffdio_copy.len, &ctx->mmap_changing,
+				   uffdio_copy.mode);
 		mmput(ctx->mm);
 	} else {
 		return -ESRCH;
diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index c6590c58ce28..765ce884cec0 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -34,7 +34,7 @@ extern vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason);
 
 extern ssize_t mcopy_atomic(struct mm_struct *dst_mm, unsigned long dst_start,
 			    unsigned long src_start, unsigned long len,
-			    bool *mmap_changing);
+			    bool *mmap_changing, __u64 mode);
 extern ssize_t mfill_zeropage(struct mm_struct *dst_mm,
 			      unsigned long dst_start,
 			      unsigned long len,
diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
index 48f1a7c2f1f0..340f23bc251d 100644
--- a/include/uapi/linux/userfaultfd.h
+++ b/include/uapi/linux/userfaultfd.h
@@ -203,13 +203,14 @@ struct uffdio_copy {
 	__u64 dst;
 	__u64 src;
 	__u64 len;
+#define UFFDIO_COPY_MODE_DONTWAKE		((__u64)1<<0)
 	/*
-	 * There will be a wrprotection flag later that allows to map
-	 * pages wrprotected on the fly. And such a flag will be
-	 * available if the wrprotection ioctl are implemented for the
-	 * range according to the uffdio_register.ioctls.
+	 * UFFDIO_COPY_MODE_WP will map the page write protected on
+	 * the fly.  UFFDIO_COPY_MODE_WP is available only if the
+	 * write protected ioctl is implemented for the range
+	 * according to the uffdio_register.ioctls.
 	 */
-#define UFFDIO_COPY_MODE_DONTWAKE		((__u64)1<<0)
+#define UFFDIO_COPY_MODE_WP			((__u64)1<<1)
 	__u64 mode;
 
 	/*
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index d59b5a73dfb3..eaecc21806da 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -25,7 +25,8 @@ static int mcopy_atomic_pte(struct mm_struct *dst_mm,
 			    struct vm_area_struct *dst_vma,
 			    unsigned long dst_addr,
 			    unsigned long src_addr,
-			    struct page **pagep)
+			    struct page **pagep,
+			    bool wp_copy)
 {
 	struct mem_cgroup *memcg;
 	pte_t _dst_pte, *dst_pte;
@@ -71,9 +72,9 @@ static int mcopy_atomic_pte(struct mm_struct *dst_mm,
 	if (mem_cgroup_try_charge(page, dst_mm, GFP_KERNEL, &memcg, false))
 		goto out_release;
 
-	_dst_pte = mk_pte(page, dst_vma->vm_page_prot);
-	if (dst_vma->vm_flags & VM_WRITE)
-		_dst_pte = pte_mkwrite(pte_mkdirty(_dst_pte));
+	_dst_pte = pte_mkdirty(mk_pte(page, dst_vma->vm_page_prot));
+	if ((dst_vma->vm_flags & VM_WRITE) && !wp_copy)
+		_dst_pte = pte_mkwrite(_dst_pte);
 
 	dst_pte = pte_offset_map_lock(dst_mm, dst_pmd, dst_addr, &ptl);
 	if (dst_vma->vm_file) {
@@ -399,7 +400,8 @@ static __always_inline ssize_t mfill_atomic_pte(struct mm_struct *dst_mm,
 						unsigned long dst_addr,
 						unsigned long src_addr,
 						struct page **page,
-						bool zeropage)
+						bool zeropage,
+						bool wp_copy)
 {
 	ssize_t err;
 
@@ -416,11 +418,13 @@ static __always_inline ssize_t mfill_atomic_pte(struct mm_struct *dst_mm,
 	if (!(dst_vma->vm_flags & VM_SHARED)) {
 		if (!zeropage)
 			err = mcopy_atomic_pte(dst_mm, dst_pmd, dst_vma,
-					       dst_addr, src_addr, page);
+					       dst_addr, src_addr, page,
+					       wp_copy);
 		else
 			err = mfill_zeropage_pte(dst_mm, dst_pmd,
 						 dst_vma, dst_addr);
 	} else {
+		VM_WARN_ON_ONCE(wp_copy);
 		if (!zeropage)
 			err = shmem_mcopy_atomic_pte(dst_mm, dst_pmd,
 						     dst_vma, dst_addr,
@@ -438,7 +442,8 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
 					      unsigned long src_start,
 					      unsigned long len,
 					      bool zeropage,
-					      bool *mmap_changing)
+					      bool *mmap_changing,
+					      __u64 mode)
 {
 	struct vm_area_struct *dst_vma;
 	ssize_t err;
@@ -446,6 +451,7 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
 	unsigned long src_addr, dst_addr;
 	long copied;
 	struct page *page;
+	bool wp_copy;
 
 	/*
 	 * Sanitize the command parameters:
@@ -502,6 +508,14 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
 	    dst_vma->vm_flags & VM_SHARED))
 		goto out_unlock;
 
+	/*
+	 * validate 'mode' now that we know the dst_vma: don't allow
+	 * a wrprotect copy if the userfaultfd didn't register as WP.
+	 */
+	wp_copy = mode & UFFDIO_COPY_MODE_WP;
+	if (wp_copy && !(dst_vma->vm_flags & VM_UFFD_WP))
+		goto out_unlock;
+
 	/*
 	 * If this is a HUGETLB vma, pass off to appropriate routine
 	 */
@@ -557,7 +571,7 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
 		BUG_ON(pmd_trans_huge(*dst_pmd));
 
 		err = mfill_atomic_pte(dst_mm, dst_pmd, dst_vma, dst_addr,
-				       src_addr, &page, zeropage);
+				       src_addr, &page, zeropage, wp_copy);
 		cond_resched();
 
 		if (unlikely(err == -ENOENT)) {
@@ -604,14 +618,14 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
 
 ssize_t mcopy_atomic(struct mm_struct *dst_mm, unsigned long dst_start,
 		     unsigned long src_start, unsigned long len,
-		     bool *mmap_changing)
+		     bool *mmap_changing, __u64 mode)
 {
 	return __mcopy_atomic(dst_mm, dst_start, src_start, len, false,
-			      mmap_changing);
+			      mmap_changing, mode);
 }
 
 ssize_t mfill_zeropage(struct mm_struct *dst_mm, unsigned long start,
 		       unsigned long len, bool *mmap_changing)
 {
-	return __mcopy_atomic(dst_mm, start, 0, len, true, mmap_changing);
+	return __mcopy_atomic(dst_mm, start, 0, len, true, mmap_changing, 0);
 }
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v3 11/28] mm: merge parameters for change_protection()
  2019-03-20  2:06 [PATCH v3 00/28] userfaultfd: write protection support Peter Xu
                   ` (9 preceding siblings ...)
  2019-03-20  2:06 ` [PATCH v3 10/28] userfaultfd: wp: add UFFDIO_COPY_MODE_WP Peter Xu
@ 2019-03-20  2:06 ` Peter Xu
  2019-03-20  2:06 ` [PATCH v3 12/28] userfaultfd: wp: apply _PAGE_UFFD_WP bit Peter Xu
                   ` (17 subsequent siblings)
  28 siblings, 0 replies; 51+ messages in thread
From: Peter Xu @ 2019-03-20  2:06 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: David Hildenbrand, Hugh Dickins, Maya Gokhale, Jerome Glisse,
	Pavel Emelyanov, Johannes Weiner, peterx, Martin Cracauer,
	Shaohua Li, Andrea Arcangeli, Mike Kravetz, Denis Plotnikov,
	Mike Rapoport, Marty McFadden, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

change_protection() was used by either the NUMA or mprotect() code,
there's one parameter for each of the callers (dirty_accountable and
prot_numa).  Further, these parameters are passed along the calls:

  - change_protection_range()
  - change_p4d_range()
  - change_pud_range()
  - change_pmd_range()
  - ...

Now we introduce a flag for change_protect() and all these helpers to
replace these parameters.  Then we can avoid passing multiple parameters
multiple times along the way.

More importantly, it'll greatly simplify the work if we want to
introduce any new parameters to change_protection().  In the follow up
patches, a new parameter for userfaultfd write protection will be
introduced.

No functional change at all.

Reviewed-by: Jerome Glisse <jglisse@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/huge_mm.h |  2 +-
 include/linux/mm.h      | 14 +++++++++++++-
 mm/huge_memory.c        |  3 ++-
 mm/mempolicy.c          |  2 +-
 mm/mprotect.c           | 29 ++++++++++++++++-------------
 5 files changed, 33 insertions(+), 17 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 381e872bfde0..1550fb12dbd4 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -46,7 +46,7 @@ extern bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,
 			 pmd_t *old_pmd, pmd_t *new_pmd);
 extern int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 			unsigned long addr, pgprot_t newprot,
-			int prot_numa);
+			unsigned long cp_flags);
 vm_fault_t vmf_insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
 			pmd_t *pmd, pfn_t pfn, bool write);
 vm_fault_t vmf_insert_pfn_pud(struct vm_area_struct *vma, unsigned long addr,
diff --git a/include/linux/mm.h b/include/linux/mm.h
index f73dbc4a1957..937559a74dc4 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1682,9 +1682,21 @@ extern unsigned long move_page_tables(struct vm_area_struct *vma,
 		unsigned long old_addr, struct vm_area_struct *new_vma,
 		unsigned long new_addr, unsigned long len,
 		bool need_rmap_locks);
+
+/*
+ * Flags used by change_protection().  For now we make it a bitmap so
+ * that we can pass in multiple flags just like parameters.  However
+ * for now all the callers are only use one of the flags at the same
+ * time.
+ */
+/* Whether we should allow dirty bit accounting */
+#define  MM_CP_DIRTY_ACCT                  (1UL << 0)
+/* Whether this protection change is for NUMA hints */
+#define  MM_CP_PROT_NUMA                   (1UL << 1)
+
 extern unsigned long change_protection(struct vm_area_struct *vma, unsigned long start,
 			      unsigned long end, pgprot_t newprot,
-			      int dirty_accountable, int prot_numa);
+			      unsigned long cp_flags);
 extern int mprotect_fixup(struct vm_area_struct *vma,
 			  struct vm_area_struct **pprev, unsigned long start,
 			  unsigned long end, unsigned long newflags);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index faf357eaf0ce..8d65b0f041f9 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1860,13 +1860,14 @@ bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,
  *  - HPAGE_PMD_NR is protections changed and TLB flush necessary
  */
 int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
-		unsigned long addr, pgprot_t newprot, int prot_numa)
+		unsigned long addr, pgprot_t newprot, unsigned long cp_flags)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	spinlock_t *ptl;
 	pmd_t entry;
 	bool preserve_write;
 	int ret;
+	bool prot_numa = cp_flags & MM_CP_PROT_NUMA;
 
 	ptl = __pmd_trans_huge_lock(pmd, vma);
 	if (!ptl)
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index ee2bce59d2bf..55aed31b4f04 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -554,7 +554,7 @@ unsigned long change_prot_numa(struct vm_area_struct *vma,
 {
 	int nr_updated;
 
-	nr_updated = change_protection(vma, addr, end, PAGE_NONE, 0, 1);
+	nr_updated = change_protection(vma, addr, end, PAGE_NONE, MM_CP_PROT_NUMA);
 	if (nr_updated)
 		count_vm_numa_events(NUMA_PTE_UPDATES, nr_updated);
 
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 36cb358db170..a6ba448c8565 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -37,13 +37,15 @@
 
 static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 		unsigned long addr, unsigned long end, pgprot_t newprot,
-		int dirty_accountable, int prot_numa)
+		unsigned long cp_flags)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	pte_t *pte, oldpte;
 	spinlock_t *ptl;
 	unsigned long pages = 0;
 	int target_node = NUMA_NO_NODE;
+	bool dirty_accountable = cp_flags & MM_CP_DIRTY_ACCT;
+	bool prot_numa = cp_flags & MM_CP_PROT_NUMA;
 
 	/*
 	 * Can be called with only the mmap_sem for reading by
@@ -164,7 +166,7 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 
 static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 		pud_t *pud, unsigned long addr, unsigned long end,
-		pgprot_t newprot, int dirty_accountable, int prot_numa)
+		pgprot_t newprot, unsigned long cp_flags)
 {
 	pmd_t *pmd;
 	unsigned long next;
@@ -194,7 +196,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 				__split_huge_pmd(vma, pmd, addr, false, NULL);
 			} else {
 				int nr_ptes = change_huge_pmd(vma, pmd, addr,
-						newprot, prot_numa);
+							      newprot, cp_flags);
 
 				if (nr_ptes) {
 					if (nr_ptes == HPAGE_PMD_NR) {
@@ -209,7 +211,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 			/* fall through, the trans huge pmd just split */
 		}
 		this_pages = change_pte_range(vma, pmd, addr, next, newprot,
-				 dirty_accountable, prot_numa);
+					      cp_flags);
 		pages += this_pages;
 next:
 		cond_resched();
@@ -225,7 +227,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 
 static inline unsigned long change_pud_range(struct vm_area_struct *vma,
 		p4d_t *p4d, unsigned long addr, unsigned long end,
-		pgprot_t newprot, int dirty_accountable, int prot_numa)
+		pgprot_t newprot, unsigned long cp_flags)
 {
 	pud_t *pud;
 	unsigned long next;
@@ -237,7 +239,7 @@ static inline unsigned long change_pud_range(struct vm_area_struct *vma,
 		if (pud_none_or_clear_bad(pud))
 			continue;
 		pages += change_pmd_range(vma, pud, addr, next, newprot,
-				 dirty_accountable, prot_numa);
+					  cp_flags);
 	} while (pud++, addr = next, addr != end);
 
 	return pages;
@@ -245,7 +247,7 @@ static inline unsigned long change_pud_range(struct vm_area_struct *vma,
 
 static inline unsigned long change_p4d_range(struct vm_area_struct *vma,
 		pgd_t *pgd, unsigned long addr, unsigned long end,
-		pgprot_t newprot, int dirty_accountable, int prot_numa)
+		pgprot_t newprot, unsigned long cp_flags)
 {
 	p4d_t *p4d;
 	unsigned long next;
@@ -257,7 +259,7 @@ static inline unsigned long change_p4d_range(struct vm_area_struct *vma,
 		if (p4d_none_or_clear_bad(p4d))
 			continue;
 		pages += change_pud_range(vma, p4d, addr, next, newprot,
-				 dirty_accountable, prot_numa);
+					  cp_flags);
 	} while (p4d++, addr = next, addr != end);
 
 	return pages;
@@ -265,7 +267,7 @@ static inline unsigned long change_p4d_range(struct vm_area_struct *vma,
 
 static unsigned long change_protection_range(struct vm_area_struct *vma,
 		unsigned long addr, unsigned long end, pgprot_t newprot,
-		int dirty_accountable, int prot_numa)
+		unsigned long cp_flags)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	pgd_t *pgd;
@@ -282,7 +284,7 @@ static unsigned long change_protection_range(struct vm_area_struct *vma,
 		if (pgd_none_or_clear_bad(pgd))
 			continue;
 		pages += change_p4d_range(vma, pgd, addr, next, newprot,
-				 dirty_accountable, prot_numa);
+					  cp_flags);
 	} while (pgd++, addr = next, addr != end);
 
 	/* Only flush the TLB if we actually modified any entries: */
@@ -295,14 +297,15 @@ static unsigned long change_protection_range(struct vm_area_struct *vma,
 
 unsigned long change_protection(struct vm_area_struct *vma, unsigned long start,
 		       unsigned long end, pgprot_t newprot,
-		       int dirty_accountable, int prot_numa)
+		       unsigned long cp_flags)
 {
 	unsigned long pages;
 
 	if (is_vm_hugetlb_page(vma))
 		pages = hugetlb_change_protection(vma, start, end, newprot);
 	else
-		pages = change_protection_range(vma, start, end, newprot, dirty_accountable, prot_numa);
+		pages = change_protection_range(vma, start, end, newprot,
+						cp_flags);
 
 	return pages;
 }
@@ -430,7 +433,7 @@ mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev,
 	vma_set_page_prot(vma);
 
 	change_protection(vma, start, end, vma->vm_page_prot,
-			  dirty_accountable, 0);
+			  dirty_accountable ? MM_CP_DIRTY_ACCT : 0);
 
 	/*
 	 * Private VM_LOCKED VMA becoming writable: trigger COW to avoid major
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v3 12/28] userfaultfd: wp: apply _PAGE_UFFD_WP bit
  2019-03-20  2:06 [PATCH v3 00/28] userfaultfd: write protection support Peter Xu
                   ` (10 preceding siblings ...)
  2019-03-20  2:06 ` [PATCH v3 11/28] mm: merge parameters for change_protection() Peter Xu
@ 2019-03-20  2:06 ` Peter Xu
  2019-03-20  2:06 ` [PATCH v3 13/28] mm: export wp_page_copy() Peter Xu
                   ` (16 subsequent siblings)
  28 siblings, 0 replies; 51+ messages in thread
From: Peter Xu @ 2019-03-20  2:06 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: David Hildenbrand, Hugh Dickins, Maya Gokhale, Jerome Glisse,
	Pavel Emelyanov, Johannes Weiner, peterx, Martin Cracauer,
	Shaohua Li, Andrea Arcangeli, Mike Kravetz, Denis Plotnikov,
	Mike Rapoport, Marty McFadden, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

Firstly, introduce two new flags MM_CP_UFFD_WP[_RESOLVE] for
change_protection() when used with uffd-wp and make sure the two new
flags are exclusively used.  Then,

  - For MM_CP_UFFD_WP: apply the _PAGE_UFFD_WP bit and remove _PAGE_RW
    when a range of memory is write protected by uffd

  - For MM_CP_UFFD_WP_RESOLVE: remove the _PAGE_UFFD_WP bit and recover
    _PAGE_RW when write protection is resolved from userspace

And use this new interface in mwriteprotect_range() to replace the old
MM_CP_DIRTY_ACCT.

Do this change for both PTEs and huge PMDs.  Then we can start to
identify which PTE/PMD is write protected by general (e.g., COW or soft
dirty tracking), and which is for userfaultfd-wp.

Since we should keep the _PAGE_UFFD_WP when doing pte_modify(), add it
into _PAGE_CHG_MASK as well.  Meanwhile, since we have this new bit, we
can be even more strict when detecting uffd-wp page faults in either
do_wp_page() or wp_huge_pmd().

Reviewed-by: Jerome Glisse <jglisse@redhat.com>
Reviewed-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/mm.h |  5 +++++
 mm/huge_memory.c   | 14 +++++++++++++-
 mm/memory.c        |  4 ++--
 mm/mprotect.c      | 12 ++++++++++++
 mm/userfaultfd.c   |  8 ++++++--
 5 files changed, 38 insertions(+), 5 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 937559a74dc4..b39efe5ca7f6 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1693,6 +1693,11 @@ extern unsigned long move_page_tables(struct vm_area_struct *vma,
 #define  MM_CP_DIRTY_ACCT                  (1UL << 0)
 /* Whether this protection change is for NUMA hints */
 #define  MM_CP_PROT_NUMA                   (1UL << 1)
+/* Whether this change is for write protecting */
+#define  MM_CP_UFFD_WP                     (1UL << 2) /* do wp */
+#define  MM_CP_UFFD_WP_RESOLVE             (1UL << 3) /* Resolve wp */
+#define  MM_CP_UFFD_WP_ALL                 (MM_CP_UFFD_WP | \
+					    MM_CP_UFFD_WP_RESOLVE)
 
 extern unsigned long change_protection(struct vm_area_struct *vma, unsigned long start,
 			      unsigned long end, pgprot_t newprot,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 8d65b0f041f9..817335b443c2 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1868,6 +1868,8 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 	bool preserve_write;
 	int ret;
 	bool prot_numa = cp_flags & MM_CP_PROT_NUMA;
+	bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
+	bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;
 
 	ptl = __pmd_trans_huge_lock(pmd, vma);
 	if (!ptl)
@@ -1934,6 +1936,13 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 	entry = pmd_modify(entry, newprot);
 	if (preserve_write)
 		entry = pmd_mk_savedwrite(entry);
+	if (uffd_wp) {
+		entry = pmd_wrprotect(entry);
+		entry = pmd_mkuffd_wp(entry);
+	} else if (uffd_wp_resolve) {
+		entry = pmd_mkwrite(entry);
+		entry = pmd_clear_uffd_wp(entry);
+	}
 	ret = HPAGE_PMD_NR;
 	set_pmd_at(mm, addr, pmd, entry);
 	BUG_ON(vma_is_anonymous(vma) && !preserve_write && pmd_write(entry));
@@ -2083,7 +2092,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 	struct page *page;
 	pgtable_t pgtable;
 	pmd_t old_pmd, _pmd;
-	bool young, write, soft_dirty, pmd_migration = false;
+	bool young, write, soft_dirty, pmd_migration = false, uffd_wp = false;
 	unsigned long addr;
 	int i;
 
@@ -2165,6 +2174,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 		write = pmd_write(old_pmd);
 		young = pmd_young(old_pmd);
 		soft_dirty = pmd_soft_dirty(old_pmd);
+		uffd_wp = pmd_uffd_wp(old_pmd);
 	}
 	VM_BUG_ON_PAGE(!page_count(page), page);
 	page_ref_add(page, HPAGE_PMD_NR - 1);
@@ -2198,6 +2208,8 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 				entry = pte_mkold(entry);
 			if (soft_dirty)
 				entry = pte_mksoft_dirty(entry);
+			if (uffd_wp)
+				entry = pte_mkuffd_wp(entry);
 		}
 		pte = pte_offset_map(&_pmd, addr);
 		BUG_ON(!pte_none(*pte));
diff --git a/mm/memory.c b/mm/memory.c
index 567686ec086d..50c2990648ab 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2483,7 +2483,7 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
 {
 	struct vm_area_struct *vma = vmf->vma;
 
-	if (userfaultfd_wp(vma)) {
+	if (userfaultfd_pte_wp(vma, *vmf->pte)) {
 		pte_unmap_unlock(vmf->pte, vmf->ptl);
 		return handle_userfault(vmf, VM_UFFD_WP);
 	}
@@ -3690,7 +3690,7 @@ static inline vm_fault_t create_huge_pmd(struct vm_fault *vmf)
 static inline vm_fault_t wp_huge_pmd(struct vm_fault *vmf, pmd_t orig_pmd)
 {
 	if (vma_is_anonymous(vmf->vma)) {
-		if (userfaultfd_wp(vmf->vma))
+		if (userfaultfd_huge_pmd_wp(vmf->vma, orig_pmd))
 			return handle_userfault(vmf, VM_UFFD_WP);
 		return do_huge_pmd_wp_page(vmf, orig_pmd);
 	}
diff --git a/mm/mprotect.c b/mm/mprotect.c
index a6ba448c8565..9d4433044c21 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -46,6 +46,8 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 	int target_node = NUMA_NO_NODE;
 	bool dirty_accountable = cp_flags & MM_CP_DIRTY_ACCT;
 	bool prot_numa = cp_flags & MM_CP_PROT_NUMA;
+	bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
+	bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;
 
 	/*
 	 * Can be called with only the mmap_sem for reading by
@@ -117,6 +119,14 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 			if (preserve_write)
 				ptent = pte_mk_savedwrite(ptent);
 
+			if (uffd_wp) {
+				ptent = pte_wrprotect(ptent);
+				ptent = pte_mkuffd_wp(ptent);
+			} else if (uffd_wp_resolve) {
+				ptent = pte_mkwrite(ptent);
+				ptent = pte_clear_uffd_wp(ptent);
+			}
+
 			/* Avoid taking write faults for known dirty pages */
 			if (dirty_accountable && pte_dirty(ptent) &&
 					(pte_soft_dirty(ptent) ||
@@ -301,6 +311,8 @@ unsigned long change_protection(struct vm_area_struct *vma, unsigned long start,
 {
 	unsigned long pages;
 
+	BUG_ON((cp_flags & MM_CP_UFFD_WP_ALL) == MM_CP_UFFD_WP_ALL);
+
 	if (is_vm_hugetlb_page(vma))
 		pages = hugetlb_change_protection(vma, start, end, newprot);
 	else
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index eaecc21806da..240de2a8492d 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -73,8 +73,12 @@ static int mcopy_atomic_pte(struct mm_struct *dst_mm,
 		goto out_release;
 
 	_dst_pte = pte_mkdirty(mk_pte(page, dst_vma->vm_page_prot));
-	if ((dst_vma->vm_flags & VM_WRITE) && !wp_copy)
-		_dst_pte = pte_mkwrite(_dst_pte);
+	if (dst_vma->vm_flags & VM_WRITE) {
+		if (wp_copy)
+			_dst_pte = pte_mkuffd_wp(_dst_pte);
+		else
+			_dst_pte = pte_mkwrite(_dst_pte);
+	}
 
 	dst_pte = pte_offset_map_lock(dst_mm, dst_pmd, dst_addr, &ptl);
 	if (dst_vma->vm_file) {
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v3 13/28] mm: export wp_page_copy()
  2019-03-20  2:06 [PATCH v3 00/28] userfaultfd: write protection support Peter Xu
                   ` (11 preceding siblings ...)
  2019-03-20  2:06 ` [PATCH v3 12/28] userfaultfd: wp: apply _PAGE_UFFD_WP bit Peter Xu
@ 2019-03-20  2:06 ` Peter Xu
  2019-03-20  2:06 ` [PATCH v3 14/28] userfaultfd: wp: handle COW properly for uffd-wp Peter Xu
                   ` (15 subsequent siblings)
  28 siblings, 0 replies; 51+ messages in thread
From: Peter Xu @ 2019-03-20  2:06 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: David Hildenbrand, Hugh Dickins, Maya Gokhale, Jerome Glisse,
	Pavel Emelyanov, Johannes Weiner, peterx, Martin Cracauer,
	Shaohua Li, Andrea Arcangeli, Mike Kravetz, Denis Plotnikov,
	Mike Rapoport, Marty McFadden, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

Export this function for usages outside page fault handlers.

Reviewed-by: Jerome Glisse <jglisse@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/mm.h | 2 ++
 mm/memory.c        | 2 +-
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index b39efe5ca7f6..00b040e0358d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -441,6 +441,8 @@ struct vm_fault {
 					 */
 };
 
+vm_fault_t wp_page_copy(struct vm_fault *vmf);
+
 /* page entry size for vm->huge_fault() */
 enum page_entry_size {
 	PE_SIZE_PTE = 0,
diff --git a/mm/memory.c b/mm/memory.c
index 50c2990648ab..e7a4b9650225 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2239,7 +2239,7 @@ static inline void wp_page_reuse(struct vm_fault *vmf)
  *   held to the old page, as well as updating the rmap.
  * - In any case, unlock the PTL and drop the reference we took to the old page.
  */
-static vm_fault_t wp_page_copy(struct vm_fault *vmf)
+vm_fault_t wp_page_copy(struct vm_fault *vmf)
 {
 	struct vm_area_struct *vma = vmf->vma;
 	struct mm_struct *mm = vma->vm_mm;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v3 14/28] userfaultfd: wp: handle COW properly for uffd-wp
  2019-03-20  2:06 [PATCH v3 00/28] userfaultfd: write protection support Peter Xu
                   ` (12 preceding siblings ...)
  2019-03-20  2:06 ` [PATCH v3 13/28] mm: export wp_page_copy() Peter Xu
@ 2019-03-20  2:06 ` Peter Xu
  2019-04-18 20:51   ` Jerome Glisse
  2019-03-20  2:06 ` [PATCH v3 15/28] userfaultfd: wp: drop _PAGE_UFFD_WP properly when fork Peter Xu
                   ` (14 subsequent siblings)
  28 siblings, 1 reply; 51+ messages in thread
From: Peter Xu @ 2019-03-20  2:06 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: David Hildenbrand, Hugh Dickins, Maya Gokhale, Jerome Glisse,
	Pavel Emelyanov, Johannes Weiner, peterx, Martin Cracauer,
	Shaohua Li, Andrea Arcangeli, Mike Kravetz, Denis Plotnikov,
	Mike Rapoport, Marty McFadden, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

This allows uffd-wp to support write-protected pages for COW.

For example, the uffd write-protected PTE could also be write-protected
by other usages like COW or zero pages.  When that happens, we can't
simply set the write bit in the PTE since otherwise it'll change the
content of every single reference to the page.  Instead, we should do
the COW first if necessary, then handle the uffd-wp fault.

To correctly copy the page, we'll also need to carry over the
_PAGE_UFFD_WP bit if it was set in the original PTE.

For huge PMDs, we just simply split the huge PMDs where we want to
resolve an uffd-wp page fault always.  That matches what we do with
general huge PMD write protections.  In that way, we resolved the huge
PMD copy-on-write issue into PTE copy-on-write.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/memory.c   |  5 +++-
 mm/mprotect.c | 64 ++++++++++++++++++++++++++++++++++++++++++++++++---
 2 files changed, 65 insertions(+), 4 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index e7a4b9650225..b8a4c0bab461 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2291,7 +2291,10 @@ vm_fault_t wp_page_copy(struct vm_fault *vmf)
 		}
 		flush_cache_page(vma, vmf->address, pte_pfn(vmf->orig_pte));
 		entry = mk_pte(new_page, vma->vm_page_prot);
-		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+		if (pte_uffd_wp(vmf->orig_pte))
+			entry = pte_mkuffd_wp(entry);
+		else
+			entry = maybe_mkwrite(pte_mkdirty(entry), vma);
 		/*
 		 * Clear the pte entry and flush it first, before updating the
 		 * pte with the new entry. This will avoid a race condition
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 9d4433044c21..855dddb07ff2 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -73,18 +73,18 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 	flush_tlb_batched_pending(vma->vm_mm);
 	arch_enter_lazy_mmu_mode();
 	do {
+retry_pte:
 		oldpte = *pte;
 		if (pte_present(oldpte)) {
 			pte_t ptent;
 			bool preserve_write = prot_numa && pte_write(oldpte);
+			struct page *page;
 
 			/*
 			 * Avoid trapping faults against the zero or KSM
 			 * pages. See similar comment in change_huge_pmd.
 			 */
 			if (prot_numa) {
-				struct page *page;
-
 				page = vm_normal_page(vma, addr, oldpte);
 				if (!page || PageKsm(page))
 					continue;
@@ -114,6 +114,54 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 					continue;
 			}
 
+			/*
+			 * Detect whether we'll need to COW before
+			 * resolving an uffd-wp fault.  Note that this
+			 * includes detection of the zero page (where
+			 * page==NULL)
+			 */
+			if (uffd_wp_resolve) {
+				/* If the fault is resolved already, skip */
+				if (!pte_uffd_wp(*pte))
+					continue;
+				page = vm_normal_page(vma, addr, oldpte);
+				if (!page || page_mapcount(page) > 1) {
+					struct vm_fault vmf = {
+						.vma = vma,
+						.address = addr & PAGE_MASK,
+						.page = page,
+						.orig_pte = oldpte,
+						.pmd = pmd,
+						/* pte and ptl not needed */
+					};
+					vm_fault_t ret;
+
+					if (page)
+						get_page(page);
+					arch_leave_lazy_mmu_mode();
+					pte_unmap_unlock(pte, ptl);
+					ret = wp_page_copy(&vmf);
+					/* PTE is changed, or OOM */
+					if (ret == 0)
+						/* It's done by others */
+						continue;
+					else if (WARN_ON(ret != VM_FAULT_WRITE))
+						return pages;
+					pte = pte_offset_map_lock(vma->vm_mm,
+								  pmd, addr,
+								  &ptl);
+					arch_enter_lazy_mmu_mode();
+					if (!pte_present(*pte))
+						/*
+						 * This PTE could have been
+						 * modified after COW
+						 * before we have taken the
+						 * lock; retry this PTE
+						 */
+						goto retry_pte;
+				}
+			}
+
 			ptent = ptep_modify_prot_start(mm, addr, pte);
 			ptent = pte_modify(ptent, newprot);
 			if (preserve_write)
@@ -183,6 +231,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 	unsigned long pages = 0;
 	unsigned long nr_huge_updates = 0;
 	struct mmu_notifier_range range;
+	bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;
 
 	range.start = 0;
 
@@ -202,7 +251,16 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 		}
 
 		if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {
-			if (next - addr != HPAGE_PMD_SIZE) {
+			/*
+			 * When resolving an userfaultfd write
+			 * protection fault, it's not easy to identify
+			 * whether a THP is shared with others and
+			 * whether we'll need to do copy-on-write, so
+			 * just split it always for now to simply the
+			 * procedure.  And that's the policy too for
+			 * general THP write-protect in af9e4d5f2de2.
+			 */
+			if (next - addr != HPAGE_PMD_SIZE || uffd_wp_resolve) {
 				__split_huge_pmd(vma, pmd, addr, false, NULL);
 			} else {
 				int nr_ptes = change_huge_pmd(vma, pmd, addr,
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v3 15/28] userfaultfd: wp: drop _PAGE_UFFD_WP properly when fork
  2019-03-20  2:06 [PATCH v3 00/28] userfaultfd: write protection support Peter Xu
                   ` (13 preceding siblings ...)
  2019-03-20  2:06 ` [PATCH v3 14/28] userfaultfd: wp: handle COW properly for uffd-wp Peter Xu
@ 2019-03-20  2:06 ` Peter Xu
  2019-03-20  2:06 ` [PATCH v3 16/28] userfaultfd: wp: add pmd_swp_*uffd_wp() helpers Peter Xu
                   ` (13 subsequent siblings)
  28 siblings, 0 replies; 51+ messages in thread
From: Peter Xu @ 2019-03-20  2:06 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: David Hildenbrand, Hugh Dickins, Maya Gokhale, Jerome Glisse,
	Pavel Emelyanov, Johannes Weiner, peterx, Martin Cracauer,
	Shaohua Li, Andrea Arcangeli, Mike Kravetz, Denis Plotnikov,
	Mike Rapoport, Marty McFadden, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

UFFD_EVENT_FORK support for uffd-wp should be already there, except
that we should clean the uffd-wp bit if uffd fork event is not
enabled.  Detect that to avoid _PAGE_UFFD_WP being set even if the VMA
is not being tracked by VM_UFFD_WP.  Do this for both small PTEs and
huge PMDs.

Reviewed-by: Jerome Glisse <jglisse@redhat.com>
Reviewed-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/huge_memory.c | 8 ++++++++
 mm/memory.c      | 8 ++++++++
 2 files changed, 16 insertions(+)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 817335b443c2..fb2234cb595a 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -938,6 +938,14 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	ret = -EAGAIN;
 	pmd = *src_pmd;
 
+	/*
+	 * Make sure the _PAGE_UFFD_WP bit is cleared if the new VMA
+	 * does not have the VM_UFFD_WP, which means that the uffd
+	 * fork event is not enabled.
+	 */
+	if (!(vma->vm_flags & VM_UFFD_WP))
+		pmd = pmd_clear_uffd_wp(pmd);
+
 #ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
 	if (unlikely(is_swap_pmd(pmd))) {
 		swp_entry_t entry = pmd_to_swp_entry(pmd);
diff --git a/mm/memory.c b/mm/memory.c
index b8a4c0bab461..6405d56debee 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -788,6 +788,14 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		pte = pte_mkclean(pte);
 	pte = pte_mkold(pte);
 
+	/*
+	 * Make sure the _PAGE_UFFD_WP bit is cleared if the new VMA
+	 * does not have the VM_UFFD_WP, which means that the uffd
+	 * fork event is not enabled.
+	 */
+	if (!(vm_flags & VM_UFFD_WP))
+		pte = pte_clear_uffd_wp(pte);
+
 	page = vm_normal_page(vma, addr, pte);
 	if (page) {
 		get_page(page);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v3 16/28] userfaultfd: wp: add pmd_swp_*uffd_wp() helpers
  2019-03-20  2:06 [PATCH v3 00/28] userfaultfd: write protection support Peter Xu
                   ` (14 preceding siblings ...)
  2019-03-20  2:06 ` [PATCH v3 15/28] userfaultfd: wp: drop _PAGE_UFFD_WP properly when fork Peter Xu
@ 2019-03-20  2:06 ` Peter Xu
  2019-03-20  2:06 ` [PATCH v3 17/28] userfaultfd: wp: support swap and page migration Peter Xu
                   ` (12 subsequent siblings)
  28 siblings, 0 replies; 51+ messages in thread
From: Peter Xu @ 2019-03-20  2:06 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: David Hildenbrand, Hugh Dickins, Maya Gokhale, Jerome Glisse,
	Pavel Emelyanov, Johannes Weiner, peterx, Martin Cracauer,
	Shaohua Li, Andrea Arcangeli, Mike Kravetz, Denis Plotnikov,
	Mike Rapoport, Marty McFadden, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

Adding these missing helpers for uffd-wp operations with pmd
swap/migration entries.

Reviewed-by: Jerome Glisse <jglisse@redhat.com>
Reviewed-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/x86/include/asm/pgtable.h     | 15 +++++++++++++++
 include/asm-generic/pgtable_uffd.h | 15 +++++++++++++++
 2 files changed, 30 insertions(+)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 6863236e8484..18a815d6f4ea 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1401,6 +1401,21 @@ static inline pte_t pte_swp_clear_uffd_wp(pte_t pte)
 {
 	return pte_clear_flags(pte, _PAGE_SWP_UFFD_WP);
 }
+
+static inline pmd_t pmd_swp_mkuffd_wp(pmd_t pmd)
+{
+	return pmd_set_flags(pmd, _PAGE_SWP_UFFD_WP);
+}
+
+static inline int pmd_swp_uffd_wp(pmd_t pmd)
+{
+	return pmd_flags(pmd) & _PAGE_SWP_UFFD_WP;
+}
+
+static inline pmd_t pmd_swp_clear_uffd_wp(pmd_t pmd)
+{
+	return pmd_clear_flags(pmd, _PAGE_SWP_UFFD_WP);
+}
 #endif /* CONFIG_HAVE_ARCH_USERFAULTFD_WP */
 
 #define PKRU_AD_BIT 0x1
diff --git a/include/asm-generic/pgtable_uffd.h b/include/asm-generic/pgtable_uffd.h
index 643d1bf559c2..828966d4c281 100644
--- a/include/asm-generic/pgtable_uffd.h
+++ b/include/asm-generic/pgtable_uffd.h
@@ -46,6 +46,21 @@ static __always_inline pte_t pte_swp_clear_uffd_wp(pte_t pte)
 {
 	return pte;
 }
+
+static inline pmd_t pmd_swp_mkuffd_wp(pmd_t pmd)
+{
+	return pmd;
+}
+
+static inline int pmd_swp_uffd_wp(pmd_t pmd)
+{
+	return 0;
+}
+
+static inline pmd_t pmd_swp_clear_uffd_wp(pmd_t pmd)
+{
+	return pmd;
+}
 #endif /* CONFIG_HAVE_ARCH_USERFAULTFD_WP */
 
 #endif /* _ASM_GENERIC_PGTABLE_UFFD_H */
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v3 17/28] userfaultfd: wp: support swap and page migration
  2019-03-20  2:06 [PATCH v3 00/28] userfaultfd: write protection support Peter Xu
                   ` (15 preceding siblings ...)
  2019-03-20  2:06 ` [PATCH v3 16/28] userfaultfd: wp: add pmd_swp_*uffd_wp() helpers Peter Xu
@ 2019-03-20  2:06 ` Peter Xu
  2019-04-18 20:59   ` Jerome Glisse
  2019-03-20  2:06 ` [PATCH v3 18/28] khugepaged: skip collapse if uffd-wp detected Peter Xu
                   ` (11 subsequent siblings)
  28 siblings, 1 reply; 51+ messages in thread
From: Peter Xu @ 2019-03-20  2:06 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: David Hildenbrand, Hugh Dickins, Maya Gokhale, Jerome Glisse,
	Pavel Emelyanov, Johannes Weiner, peterx, Martin Cracauer,
	Shaohua Li, Andrea Arcangeli, Mike Kravetz, Denis Plotnikov,
	Mike Rapoport, Marty McFadden, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

For either swap and page migration, we all use the bit 2 of the entry to
identify whether this entry is uffd write-protected.  It plays a similar
role as the existing soft dirty bit in swap entries but only for keeping
the uffd-wp tracking for a specific PTE/PMD.

Something special here is that when we want to recover the uffd-wp bit
from a swap/migration entry to the PTE bit we'll also need to take care
of the _PAGE_RW bit and make sure it's cleared, otherwise even with the
_PAGE_UFFD_WP bit we can't trap it at all.

Note that this patch removed two lines from "userfaultfd: wp: hook
userfault handler to write protection fault" where we try to remove the
VM_FAULT_WRITE from vmf->flags when uffd-wp is set for the VMA.  This
patch will still keep the write flag there.

Reviewed-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/swapops.h | 2 ++
 mm/huge_memory.c        | 3 +++
 mm/memory.c             | 6 ++++++
 mm/migrate.c            | 4 ++++
 mm/mprotect.c           | 2 ++
 mm/rmap.c               | 6 ++++++
 6 files changed, 23 insertions(+)

diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index 4d961668e5fc..0c2923b1cdb7 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -68,6 +68,8 @@ static inline swp_entry_t pte_to_swp_entry(pte_t pte)
 
 	if (pte_swp_soft_dirty(pte))
 		pte = pte_swp_clear_soft_dirty(pte);
+	if (pte_swp_uffd_wp(pte))
+		pte = pte_swp_clear_uffd_wp(pte);
 	arch_entry = __pte_to_swp_entry(pte);
 	return swp_entry(__swp_type(arch_entry), __swp_offset(arch_entry));
 }
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index fb2234cb595a..75de07141801 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2175,6 +2175,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 		write = is_write_migration_entry(entry);
 		young = false;
 		soft_dirty = pmd_swp_soft_dirty(old_pmd);
+		uffd_wp = pmd_swp_uffd_wp(old_pmd);
 	} else {
 		page = pmd_page(old_pmd);
 		if (pmd_dirty(old_pmd))
@@ -2207,6 +2208,8 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 			entry = swp_entry_to_pte(swp_entry);
 			if (soft_dirty)
 				entry = pte_swp_mksoft_dirty(entry);
+			if (uffd_wp)
+				entry = pte_swp_mkuffd_wp(entry);
 		} else {
 			entry = mk_pte(page + i, READ_ONCE(vma->vm_page_prot));
 			entry = maybe_mkwrite(entry, vma);
diff --git a/mm/memory.c b/mm/memory.c
index 6405d56debee..c3d57fa890f2 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -736,6 +736,8 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 				pte = swp_entry_to_pte(entry);
 				if (pte_swp_soft_dirty(*src_pte))
 					pte = pte_swp_mksoft_dirty(pte);
+				if (pte_swp_uffd_wp(*src_pte))
+					pte = pte_swp_mkuffd_wp(pte);
 				set_pte_at(src_mm, addr, src_pte, pte);
 			}
 		} else if (is_device_private_entry(entry)) {
@@ -2825,6 +2827,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	flush_icache_page(vma, page);
 	if (pte_swp_soft_dirty(vmf->orig_pte))
 		pte = pte_mksoft_dirty(pte);
+	if (pte_swp_uffd_wp(vmf->orig_pte)) {
+		pte = pte_mkuffd_wp(pte);
+		pte = pte_wrprotect(pte);
+	}
 	set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
 	arch_do_swap_page(vma->vm_mm, vma, vmf->address, pte, vmf->orig_pte);
 	vmf->orig_pte = pte;
diff --git a/mm/migrate.c b/mm/migrate.c
index 181f5d2718a9..72cde187d4a1 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -241,6 +241,8 @@ static bool remove_migration_pte(struct page *page, struct vm_area_struct *vma,
 		entry = pte_to_swp_entry(*pvmw.pte);
 		if (is_write_migration_entry(entry))
 			pte = maybe_mkwrite(pte, vma);
+		else if (pte_swp_uffd_wp(*pvmw.pte))
+			pte = pte_mkuffd_wp(pte);
 
 		if (unlikely(is_zone_device_page(new))) {
 			if (is_device_private_page(new)) {
@@ -2301,6 +2303,8 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
 			swp_pte = swp_entry_to_pte(entry);
 			if (pte_soft_dirty(pte))
 				swp_pte = pte_swp_mksoft_dirty(swp_pte);
+			if (pte_uffd_wp(pte))
+				swp_pte = pte_swp_mkuffd_wp(swp_pte);
 			set_pte_at(mm, addr, ptep, swp_pte);
 
 			/*
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 855dddb07ff2..96c0f521099d 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -196,6 +196,8 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 				newpte = swp_entry_to_pte(entry);
 				if (pte_swp_soft_dirty(oldpte))
 					newpte = pte_swp_mksoft_dirty(newpte);
+				if (pte_swp_uffd_wp(oldpte))
+					newpte = pte_swp_mkuffd_wp(newpte);
 				set_pte_at(mm, addr, pte, newpte);
 
 				pages++;
diff --git a/mm/rmap.c b/mm/rmap.c
index 0454ecc29537..3750d5a5283c 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1469,6 +1469,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 			swp_pte = swp_entry_to_pte(entry);
 			if (pte_soft_dirty(pteval))
 				swp_pte = pte_swp_mksoft_dirty(swp_pte);
+			if (pte_uffd_wp(pteval))
+				swp_pte = pte_swp_mkuffd_wp(swp_pte);
 			set_pte_at(mm, pvmw.address, pvmw.pte, swp_pte);
 			/*
 			 * No need to invalidate here it will synchronize on
@@ -1561,6 +1563,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 			swp_pte = swp_entry_to_pte(entry);
 			if (pte_soft_dirty(pteval))
 				swp_pte = pte_swp_mksoft_dirty(swp_pte);
+			if (pte_uffd_wp(pteval))
+				swp_pte = pte_swp_mkuffd_wp(swp_pte);
 			set_pte_at(mm, address, pvmw.pte, swp_pte);
 			/*
 			 * No need to invalidate here it will synchronize on
@@ -1627,6 +1631,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 			swp_pte = swp_entry_to_pte(entry);
 			if (pte_soft_dirty(pteval))
 				swp_pte = pte_swp_mksoft_dirty(swp_pte);
+			if (pte_uffd_wp(pteval))
+				swp_pte = pte_swp_mkuffd_wp(swp_pte);
 			set_pte_at(mm, address, pvmw.pte, swp_pte);
 			/* Invalidate as we cleared the pte */
 			mmu_notifier_invalidate_range(mm, address,
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v3 18/28] khugepaged: skip collapse if uffd-wp detected
  2019-03-20  2:06 [PATCH v3 00/28] userfaultfd: write protection support Peter Xu
                   ` (16 preceding siblings ...)
  2019-03-20  2:06 ` [PATCH v3 17/28] userfaultfd: wp: support swap and page migration Peter Xu
@ 2019-03-20  2:06 ` Peter Xu
  2019-03-20  2:06 ` [PATCH v3 19/28] userfaultfd: introduce helper vma_find_uffd Peter Xu
                   ` (10 subsequent siblings)
  28 siblings, 0 replies; 51+ messages in thread
From: Peter Xu @ 2019-03-20  2:06 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: David Hildenbrand, Hugh Dickins, Maya Gokhale, Jerome Glisse,
	Pavel Emelyanov, Johannes Weiner, peterx, Martin Cracauer,
	Shaohua Li, Andrea Arcangeli, Mike Kravetz, Denis Plotnikov,
	Mike Rapoport, Marty McFadden, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

Don't collapse the huge PMD if there is any userfault write protected
small PTEs.  The problem is that the write protection is in small page
granularity and there's no way to keep all these write protection
information if the small pages are going to be merged into a huge PMD.

The same thing needs to be considered for swap entries and migration
entries.  So do the check as well disregarding khugepaged_max_ptes_swap.

Reviewed-by: Jerome Glisse <jglisse@redhat.com>
Reviewed-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/trace/events/huge_memory.h |  1 +
 mm/khugepaged.c                    | 23 +++++++++++++++++++++++
 2 files changed, 24 insertions(+)

diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
index dd4db334bd63..2d7bad9cb976 100644
--- a/include/trace/events/huge_memory.h
+++ b/include/trace/events/huge_memory.h
@@ -13,6 +13,7 @@
 	EM( SCAN_PMD_NULL,		"pmd_null")			\
 	EM( SCAN_EXCEED_NONE_PTE,	"exceed_none_pte")		\
 	EM( SCAN_PTE_NON_PRESENT,	"pte_non_present")		\
+	EM( SCAN_PTE_UFFD_WP,		"pte_uffd_wp")			\
 	EM( SCAN_PAGE_RO,		"no_writable_page")		\
 	EM( SCAN_LACK_REFERENCED_PAGE,	"lack_referenced_page")		\
 	EM( SCAN_PAGE_NULL,		"page_null")			\
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 4f017339ddb2..396c7e4da83e 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -29,6 +29,7 @@ enum scan_result {
 	SCAN_PMD_NULL,
 	SCAN_EXCEED_NONE_PTE,
 	SCAN_PTE_NON_PRESENT,
+	SCAN_PTE_UFFD_WP,
 	SCAN_PAGE_RO,
 	SCAN_LACK_REFERENCED_PAGE,
 	SCAN_PAGE_NULL,
@@ -1123,6 +1124,15 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 		pte_t pteval = *_pte;
 		if (is_swap_pte(pteval)) {
 			if (++unmapped <= khugepaged_max_ptes_swap) {
+				/*
+				 * Always be strict with uffd-wp
+				 * enabled swap entries.  Please see
+				 * comment below for pte_uffd_wp().
+				 */
+				if (pte_swp_uffd_wp(pteval)) {
+					result = SCAN_PTE_UFFD_WP;
+					goto out_unmap;
+				}
 				continue;
 			} else {
 				result = SCAN_EXCEED_SWAP_PTE;
@@ -1142,6 +1152,19 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 			result = SCAN_PTE_NON_PRESENT;
 			goto out_unmap;
 		}
+		if (pte_uffd_wp(pteval)) {
+			/*
+			 * Don't collapse the page if any of the small
+			 * PTEs are armed with uffd write protection.
+			 * Here we can also mark the new huge pmd as
+			 * write protected if any of the small ones is
+			 * marked but that could bring uknown
+			 * userfault messages that falls outside of
+			 * the registered range.  So, just be simple.
+			 */
+			result = SCAN_PTE_UFFD_WP;
+			goto out_unmap;
+		}
 		if (pte_write(pteval))
 			writable = true;
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v3 19/28] userfaultfd: introduce helper vma_find_uffd
  2019-03-20  2:06 [PATCH v3 00/28] userfaultfd: write protection support Peter Xu
                   ` (17 preceding siblings ...)
  2019-03-20  2:06 ` [PATCH v3 18/28] khugepaged: skip collapse if uffd-wp detected Peter Xu
@ 2019-03-20  2:06 ` Peter Xu
  2019-03-20  2:06 ` [PATCH v3 20/28] userfaultfd: wp: support write protection for userfault vma range Peter Xu
                   ` (9 subsequent siblings)
  28 siblings, 0 replies; 51+ messages in thread
From: Peter Xu @ 2019-03-20  2:06 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: David Hildenbrand, Hugh Dickins, Maya Gokhale, Jerome Glisse,
	Pavel Emelyanov, Johannes Weiner, peterx, Martin Cracauer,
	Shaohua Li, Andrea Arcangeli, Mike Kravetz, Denis Plotnikov,
	Mike Rapoport, Marty McFadden, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

We've have multiple (and more coming) places that would like to find a
userfault enabled VMA from a mm struct that covers a specific memory
range.  This patch introduce the helper for it, meanwhile apply it to
the code.

Suggested-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Reviewed-by: Jerome Glisse <jglisse@redhat.com>
Reviewed-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/userfaultfd.c | 54 +++++++++++++++++++++++++++---------------------
 1 file changed, 30 insertions(+), 24 deletions(-)

diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 240de2a8492d..2606409572b2 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -20,6 +20,34 @@
 #include <asm/tlbflush.h>
 #include "internal.h"
 
+/*
+ * Find a valid userfault enabled VMA region that covers the whole
+ * address range, or NULL on failure.  Must be called with mmap_sem
+ * held.
+ */
+static struct vm_area_struct *vma_find_uffd(struct mm_struct *mm,
+					    unsigned long start,
+					    unsigned long len)
+{
+	struct vm_area_struct *vma = find_vma(mm, start);
+
+	if (!vma)
+		return NULL;
+
+	/*
+	 * Check the vma is registered in uffd, this is required to
+	 * enforce the VM_MAYWRITE check done at uffd registration
+	 * time.
+	 */
+	if (!vma->vm_userfaultfd_ctx.ctx)
+		return NULL;
+
+	if (start < vma->vm_start || start + len > vma->vm_end)
+		return NULL;
+
+	return vma;
+}
+
 static int mcopy_atomic_pte(struct mm_struct *dst_mm,
 			    pmd_t *dst_pmd,
 			    struct vm_area_struct *dst_vma,
@@ -228,20 +256,9 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
 	 */
 	if (!dst_vma) {
 		err = -ENOENT;
-		dst_vma = find_vma(dst_mm, dst_start);
+		dst_vma = vma_find_uffd(dst_mm, dst_start, len);
 		if (!dst_vma || !is_vm_hugetlb_page(dst_vma))
 			goto out_unlock;
-		/*
-		 * Check the vma is registered in uffd, this is
-		 * required to enforce the VM_MAYWRITE check done at
-		 * uffd registration time.
-		 */
-		if (!dst_vma->vm_userfaultfd_ctx.ctx)
-			goto out_unlock;
-
-		if (dst_start < dst_vma->vm_start ||
-		    dst_start + len > dst_vma->vm_end)
-			goto out_unlock;
 
 		err = -EINVAL;
 		if (vma_hpagesize != vma_kernel_pagesize(dst_vma))
@@ -488,20 +505,9 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
 	 * both valid and fully within a single existing vma.
 	 */
 	err = -ENOENT;
-	dst_vma = find_vma(dst_mm, dst_start);
+	dst_vma = vma_find_uffd(dst_mm, dst_start, len);
 	if (!dst_vma)
 		goto out_unlock;
-	/*
-	 * Check the vma is registered in uffd, this is required to
-	 * enforce the VM_MAYWRITE check done at uffd registration
-	 * time.
-	 */
-	if (!dst_vma->vm_userfaultfd_ctx.ctx)
-		goto out_unlock;
-
-	if (dst_start < dst_vma->vm_start ||
-	    dst_start + len > dst_vma->vm_end)
-		goto out_unlock;
 
 	err = -EINVAL;
 	/*
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v3 20/28] userfaultfd: wp: support write protection for userfault vma range
  2019-03-20  2:06 [PATCH v3 00/28] userfaultfd: write protection support Peter Xu
                   ` (18 preceding siblings ...)
  2019-03-20  2:06 ` [PATCH v3 19/28] userfaultfd: introduce helper vma_find_uffd Peter Xu
@ 2019-03-20  2:06 ` Peter Xu
  2019-03-20  2:06 ` [PATCH v3 21/28] userfaultfd: wp: add the writeprotect API to userfaultfd ioctl Peter Xu
                   ` (8 subsequent siblings)
  28 siblings, 0 replies; 51+ messages in thread
From: Peter Xu @ 2019-03-20  2:06 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: David Hildenbrand, Hugh Dickins, Maya Gokhale, Jerome Glisse,
	Pavel Emelyanov, Johannes Weiner, peterx, Martin Cracauer,
	Shaohua Li, Andrea Arcangeli, Mike Kravetz, Denis Plotnikov,
	Mike Rapoport, Marty McFadden, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert, Rik van Riel

From: Shaohua Li <shli@fb.com>

Add API to enable/disable writeprotect a vma range. Unlike mprotect,
this doesn't split/merge vmas.

Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
[peterx:
 - use the helper to find VMA;
 - return -ENOENT if not found to match mcopy case;
 - use the new MM_CP_UFFD_WP* flags for change_protection
 - check against mmap_changing for failures]
Reviewed-by: Jerome Glisse <jglisse@redhat.com>
Reviewed-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/userfaultfd_k.h |  3 ++
 mm/userfaultfd.c              | 54 +++++++++++++++++++++++++++++++++++
 2 files changed, 57 insertions(+)

diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index 765ce884cec0..8f6e6ed544fb 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -39,6 +39,9 @@ extern ssize_t mfill_zeropage(struct mm_struct *dst_mm,
 			      unsigned long dst_start,
 			      unsigned long len,
 			      bool *mmap_changing);
+extern int mwriteprotect_range(struct mm_struct *dst_mm,
+			       unsigned long start, unsigned long len,
+			       bool enable_wp, bool *mmap_changing);
 
 /* mm helpers */
 static inline bool is_mergeable_vm_userfaultfd_ctx(struct vm_area_struct *vma,
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 2606409572b2..70cea2ff3960 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -639,3 +639,57 @@ ssize_t mfill_zeropage(struct mm_struct *dst_mm, unsigned long start,
 {
 	return __mcopy_atomic(dst_mm, start, 0, len, true, mmap_changing, 0);
 }
+
+int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start,
+			unsigned long len, bool enable_wp, bool *mmap_changing)
+{
+	struct vm_area_struct *dst_vma;
+	pgprot_t newprot;
+	int err;
+
+	/*
+	 * Sanitize the command parameters:
+	 */
+	BUG_ON(start & ~PAGE_MASK);
+	BUG_ON(len & ~PAGE_MASK);
+
+	/* Does the address range wrap, or is the span zero-sized? */
+	BUG_ON(start + len <= start);
+
+	down_read(&dst_mm->mmap_sem);
+
+	/*
+	 * If memory mappings are changing because of non-cooperative
+	 * operation (e.g. mremap) running in parallel, bail out and
+	 * request the user to retry later
+	 */
+	err = -EAGAIN;
+	if (mmap_changing && READ_ONCE(*mmap_changing))
+		goto out_unlock;
+
+	err = -ENOENT;
+	dst_vma = vma_find_uffd(dst_mm, start, len);
+	/*
+	 * Make sure the vma is not shared, that the dst range is
+	 * both valid and fully within a single existing vma.
+	 */
+	if (!dst_vma || (dst_vma->vm_flags & VM_SHARED))
+		goto out_unlock;
+	if (!userfaultfd_wp(dst_vma))
+		goto out_unlock;
+	if (!vma_is_anonymous(dst_vma))
+		goto out_unlock;
+
+	if (enable_wp)
+		newprot = vm_get_page_prot(dst_vma->vm_flags & ~(VM_WRITE));
+	else
+		newprot = vm_get_page_prot(dst_vma->vm_flags);
+
+	change_protection(dst_vma, start, start + len, newprot,
+			  enable_wp ? MM_CP_UFFD_WP : MM_CP_UFFD_WP_RESOLVE);
+
+	err = 0;
+out_unlock:
+	up_read(&dst_mm->mmap_sem);
+	return err;
+}
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v3 21/28] userfaultfd: wp: add the writeprotect API to userfaultfd ioctl
  2019-03-20  2:06 [PATCH v3 00/28] userfaultfd: write protection support Peter Xu
                   ` (19 preceding siblings ...)
  2019-03-20  2:06 ` [PATCH v3 20/28] userfaultfd: wp: support write protection for userfault vma range Peter Xu
@ 2019-03-20  2:06 ` Peter Xu
  2019-03-20  2:06 ` [PATCH v3 22/28] userfaultfd: wp: enabled write protection in userfaultfd API Peter Xu
                   ` (7 subsequent siblings)
  28 siblings, 0 replies; 51+ messages in thread
From: Peter Xu @ 2019-03-20  2:06 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: David Hildenbrand, Hugh Dickins, Maya Gokhale, Jerome Glisse,
	Pavel Emelyanov, Johannes Weiner, peterx, Martin Cracauer,
	Shaohua Li, Andrea Arcangeli, Mike Kravetz, Denis Plotnikov,
	Mike Rapoport, Marty McFadden, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

From: Andrea Arcangeli <aarcange@redhat.com>

v1: From: Shaohua Li <shli@fb.com>

v2: cleanups, remove a branch.

[peterx writes up the commit message, as below...]

This patch introduces the new uffd-wp APIs for userspace.

Firstly, we'll allow to do UFFDIO_REGISTER with write protection
tracking using the new UFFDIO_REGISTER_MODE_WP flag.  Note that this
flag can co-exist with the existing UFFDIO_REGISTER_MODE_MISSING, in
which case the userspace program can not only resolve missing page
faults, and at the same time tracking page data changes along the way.

Secondly, we introduced the new UFFDIO_WRITEPROTECT API to do page
level write protection tracking.  Note that we will need to register
the memory region with UFFDIO_REGISTER_MODE_WP before that.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
[peterx: remove useless block, write commit message, check against
 VM_MAYWRITE rather than VM_WRITE when register]
Reviewed-by: Jerome Glisse <jglisse@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 fs/userfaultfd.c                 | 82 +++++++++++++++++++++++++-------
 include/uapi/linux/userfaultfd.h | 23 +++++++++
 2 files changed, 89 insertions(+), 16 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 3092885c9d2c..81962d62520c 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -304,8 +304,11 @@ static inline bool userfaultfd_must_wait(struct userfaultfd_ctx *ctx,
 	if (!pmd_present(_pmd))
 		goto out;
 
-	if (pmd_trans_huge(_pmd))
+	if (pmd_trans_huge(_pmd)) {
+		if (!pmd_write(_pmd) && (reason & VM_UFFD_WP))
+			ret = true;
 		goto out;
+	}
 
 	/*
 	 * the pmd is stable (as in !pmd_trans_unstable) so we can re-read it
@@ -318,6 +321,8 @@ static inline bool userfaultfd_must_wait(struct userfaultfd_ctx *ctx,
 	 */
 	if (pte_none(*pte))
 		ret = true;
+	if (!pte_write(*pte) && (reason & VM_UFFD_WP))
+		ret = true;
 	pte_unmap(pte);
 
 out:
@@ -1251,10 +1256,13 @@ static __always_inline int validate_range(struct mm_struct *mm,
 	return 0;
 }
 
-static inline bool vma_can_userfault(struct vm_area_struct *vma)
+static inline bool vma_can_userfault(struct vm_area_struct *vma,
+				     unsigned long vm_flags)
 {
-	return vma_is_anonymous(vma) || is_vm_hugetlb_page(vma) ||
-		vma_is_shmem(vma);
+	/* FIXME: add WP support to hugetlbfs and shmem */
+	return vma_is_anonymous(vma) ||
+		((is_vm_hugetlb_page(vma) || vma_is_shmem(vma)) &&
+		 !(vm_flags & VM_UFFD_WP));
 }
 
 static int userfaultfd_register(struct userfaultfd_ctx *ctx,
@@ -1286,15 +1294,8 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
 	vm_flags = 0;
 	if (uffdio_register.mode & UFFDIO_REGISTER_MODE_MISSING)
 		vm_flags |= VM_UFFD_MISSING;
-	if (uffdio_register.mode & UFFDIO_REGISTER_MODE_WP) {
+	if (uffdio_register.mode & UFFDIO_REGISTER_MODE_WP)
 		vm_flags |= VM_UFFD_WP;
-		/*
-		 * FIXME: remove the below error constraint by
-		 * implementing the wprotect tracking mode.
-		 */
-		ret = -EINVAL;
-		goto out;
-	}
 
 	ret = validate_range(mm, uffdio_register.range.start,
 			     uffdio_register.range.len);
@@ -1342,7 +1343,7 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
 
 		/* check not compatible vmas */
 		ret = -EINVAL;
-		if (!vma_can_userfault(cur))
+		if (!vma_can_userfault(cur, vm_flags))
 			goto out_unlock;
 
 		/*
@@ -1370,6 +1371,8 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
 			if (end & (vma_hpagesize - 1))
 				goto out_unlock;
 		}
+		if ((vm_flags & VM_UFFD_WP) && !(cur->vm_flags & VM_MAYWRITE))
+			goto out_unlock;
 
 		/*
 		 * Check that this vma isn't already owned by a
@@ -1399,7 +1402,7 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
 	do {
 		cond_resched();
 
-		BUG_ON(!vma_can_userfault(vma));
+		BUG_ON(!vma_can_userfault(vma, vm_flags));
 		BUG_ON(vma->vm_userfaultfd_ctx.ctx &&
 		       vma->vm_userfaultfd_ctx.ctx != ctx);
 		WARN_ON(!(vma->vm_flags & VM_MAYWRITE));
@@ -1534,7 +1537,7 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx,
 		 * provides for more strict behavior to notice
 		 * unregistration errors.
 		 */
-		if (!vma_can_userfault(cur))
+		if (!vma_can_userfault(cur, cur->vm_flags))
 			goto out_unlock;
 
 		found = true;
@@ -1548,7 +1551,7 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx,
 	do {
 		cond_resched();
 
-		BUG_ON(!vma_can_userfault(vma));
+		BUG_ON(!vma_can_userfault(vma, vma->vm_flags));
 
 		/*
 		 * Nothing to do: this vma is already registered into this
@@ -1761,6 +1764,50 @@ static int userfaultfd_zeropage(struct userfaultfd_ctx *ctx,
 	return ret;
 }
 
+static int userfaultfd_writeprotect(struct userfaultfd_ctx *ctx,
+				    unsigned long arg)
+{
+	int ret;
+	struct uffdio_writeprotect uffdio_wp;
+	struct uffdio_writeprotect __user *user_uffdio_wp;
+	struct userfaultfd_wake_range range;
+
+	if (READ_ONCE(ctx->mmap_changing))
+		return -EAGAIN;
+
+	user_uffdio_wp = (struct uffdio_writeprotect __user *) arg;
+
+	if (copy_from_user(&uffdio_wp, user_uffdio_wp,
+			   sizeof(struct uffdio_writeprotect)))
+		return -EFAULT;
+
+	ret = validate_range(ctx->mm, uffdio_wp.range.start,
+			     uffdio_wp.range.len);
+	if (ret)
+		return ret;
+
+	if (uffdio_wp.mode & ~(UFFDIO_WRITEPROTECT_MODE_DONTWAKE |
+			       UFFDIO_WRITEPROTECT_MODE_WP))
+		return -EINVAL;
+	if ((uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_WP) &&
+	     (uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_DONTWAKE))
+		return -EINVAL;
+
+	ret = mwriteprotect_range(ctx->mm, uffdio_wp.range.start,
+				  uffdio_wp.range.len, uffdio_wp.mode &
+				  UFFDIO_WRITEPROTECT_MODE_WP,
+				  &ctx->mmap_changing);
+	if (ret)
+		return ret;
+
+	if (!(uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_DONTWAKE)) {
+		range.start = uffdio_wp.range.start;
+		range.len = uffdio_wp.range.len;
+		wake_userfault(ctx, &range);
+	}
+	return ret;
+}
+
 static inline unsigned int uffd_ctx_features(__u64 user_features)
 {
 	/*
@@ -1838,6 +1885,9 @@ static long userfaultfd_ioctl(struct file *file, unsigned cmd,
 	case UFFDIO_ZEROPAGE:
 		ret = userfaultfd_zeropage(ctx, arg);
 		break;
+	case UFFDIO_WRITEPROTECT:
+		ret = userfaultfd_writeprotect(ctx, arg);
+		break;
 	}
 	return ret;
 }
diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
index 340f23bc251d..95c4a160e5f8 100644
--- a/include/uapi/linux/userfaultfd.h
+++ b/include/uapi/linux/userfaultfd.h
@@ -52,6 +52,7 @@
 #define _UFFDIO_WAKE			(0x02)
 #define _UFFDIO_COPY			(0x03)
 #define _UFFDIO_ZEROPAGE		(0x04)
+#define _UFFDIO_WRITEPROTECT		(0x06)
 #define _UFFDIO_API			(0x3F)
 
 /* userfaultfd ioctl ids */
@@ -68,6 +69,8 @@
 				      struct uffdio_copy)
 #define UFFDIO_ZEROPAGE		_IOWR(UFFDIO, _UFFDIO_ZEROPAGE,	\
 				      struct uffdio_zeropage)
+#define UFFDIO_WRITEPROTECT	_IOWR(UFFDIO, _UFFDIO_WRITEPROTECT, \
+				      struct uffdio_writeprotect)
 
 /* read() structure */
 struct uffd_msg {
@@ -232,4 +235,24 @@ struct uffdio_zeropage {
 	__s64 zeropage;
 };
 
+struct uffdio_writeprotect {
+	struct uffdio_range range;
+/*
+ * UFFDIO_WRITEPROTECT_MODE_WP: set the flag to write protect a range,
+ * unset the flag to undo protection of a range which was previously
+ * write protected.
+ *
+ * UFFDIO_WRITEPROTECT_MODE_DONTWAKE: set the flag to avoid waking up
+ * any wait thread after the operation succeeds.
+ *
+ * NOTE: Write protecting a region (WP=1) is unrelated to page faults,
+ * therefore DONTWAKE flag is meaningless with WP=1.  Removing write
+ * protection (WP=0) in response to a page fault wakes the faulting
+ * task unless DONTWAKE is set.
+ */
+#define UFFDIO_WRITEPROTECT_MODE_WP		((__u64)1<<0)
+#define UFFDIO_WRITEPROTECT_MODE_DONTWAKE	((__u64)1<<1)
+	__u64 mode;
+};
+
 #endif /* _LINUX_USERFAULTFD_H */
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v3 22/28] userfaultfd: wp: enabled write protection in userfaultfd API
  2019-03-20  2:06 [PATCH v3 00/28] userfaultfd: write protection support Peter Xu
                   ` (20 preceding siblings ...)
  2019-03-20  2:06 ` [PATCH v3 21/28] userfaultfd: wp: add the writeprotect API to userfaultfd ioctl Peter Xu
@ 2019-03-20  2:06 ` Peter Xu
  2019-03-22 21:37   ` Mike Rapoport
  2019-03-20  2:06 ` [PATCH v3 23/28] userfaultfd: wp: don't wake up when doing write protect Peter Xu
                   ` (6 subsequent siblings)
  28 siblings, 1 reply; 51+ messages in thread
From: Peter Xu @ 2019-03-20  2:06 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: David Hildenbrand, Hugh Dickins, Maya Gokhale, Jerome Glisse,
	Pavel Emelyanov, Johannes Weiner, peterx, Martin Cracauer,
	Shaohua Li, Andrea Arcangeli, Mike Kravetz, Denis Plotnikov,
	Mike Rapoport, Marty McFadden, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert, Pavel Emelyanov, Rik van Riel

From: Shaohua Li <shli@fb.com>

Now it's safe to enable write protection in userfaultfd API

Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Jerome Glisse <jglisse@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/uapi/linux/userfaultfd.h | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
index 95c4a160e5f8..e7e98bde221f 100644
--- a/include/uapi/linux/userfaultfd.h
+++ b/include/uapi/linux/userfaultfd.h
@@ -19,7 +19,8 @@
  * means the userland is reading).
  */
 #define UFFD_API ((__u64)0xAA)
-#define UFFD_API_FEATURES (UFFD_FEATURE_EVENT_FORK |		\
+#define UFFD_API_FEATURES (UFFD_FEATURE_PAGEFAULT_FLAG_WP |	\
+			   UFFD_FEATURE_EVENT_FORK |		\
 			   UFFD_FEATURE_EVENT_REMAP |		\
 			   UFFD_FEATURE_EVENT_REMOVE |	\
 			   UFFD_FEATURE_EVENT_UNMAP |		\
@@ -34,7 +35,8 @@
 #define UFFD_API_RANGE_IOCTLS			\
 	((__u64)1 << _UFFDIO_WAKE |		\
 	 (__u64)1 << _UFFDIO_COPY |		\
-	 (__u64)1 << _UFFDIO_ZEROPAGE)
+	 (__u64)1 << _UFFDIO_ZEROPAGE |		\
+	 (__u64)1 << _UFFDIO_WRITEPROTECT)
 #define UFFD_API_RANGE_IOCTLS_BASIC		\
 	((__u64)1 << _UFFDIO_WAKE |		\
 	 (__u64)1 << _UFFDIO_COPY)
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v3 23/28] userfaultfd: wp: don't wake up when doing write protect
  2019-03-20  2:06 [PATCH v3 00/28] userfaultfd: write protection support Peter Xu
                   ` (21 preceding siblings ...)
  2019-03-20  2:06 ` [PATCH v3 22/28] userfaultfd: wp: enabled write protection in userfaultfd API Peter Xu
@ 2019-03-20  2:06 ` Peter Xu
  2019-03-20  2:06 ` [PATCH v3 24/28] userfaultfd: wp: UFFDIO_REGISTER_MODE_WP documentation update Peter Xu
                   ` (5 subsequent siblings)
  28 siblings, 0 replies; 51+ messages in thread
From: Peter Xu @ 2019-03-20  2:06 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: David Hildenbrand, Hugh Dickins, Maya Gokhale, Jerome Glisse,
	Pavel Emelyanov, Johannes Weiner, peterx, Martin Cracauer,
	Shaohua Li, Andrea Arcangeli, Mike Kravetz, Denis Plotnikov,
	Mike Rapoport, Marty McFadden, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

It does not make sense to try to wake up any waiting thread when we're
write-protecting a memory region.  Only wake up when resolving a write
protected page fault.

Reviewed-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 fs/userfaultfd.c | 13 ++++++++-----
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 81962d62520c..f1f61a0278c2 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -1771,6 +1771,7 @@ static int userfaultfd_writeprotect(struct userfaultfd_ctx *ctx,
 	struct uffdio_writeprotect uffdio_wp;
 	struct uffdio_writeprotect __user *user_uffdio_wp;
 	struct userfaultfd_wake_range range;
+	bool mode_wp, mode_dontwake;
 
 	if (READ_ONCE(ctx->mmap_changing))
 		return -EAGAIN;
@@ -1789,18 +1790,20 @@ static int userfaultfd_writeprotect(struct userfaultfd_ctx *ctx,
 	if (uffdio_wp.mode & ~(UFFDIO_WRITEPROTECT_MODE_DONTWAKE |
 			       UFFDIO_WRITEPROTECT_MODE_WP))
 		return -EINVAL;
-	if ((uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_WP) &&
-	     (uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_DONTWAKE))
+
+	mode_wp = uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_WP;
+	mode_dontwake = uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_DONTWAKE;
+
+	if (mode_wp && mode_dontwake)
 		return -EINVAL;
 
 	ret = mwriteprotect_range(ctx->mm, uffdio_wp.range.start,
-				  uffdio_wp.range.len, uffdio_wp.mode &
-				  UFFDIO_WRITEPROTECT_MODE_WP,
+				  uffdio_wp.range.len, mode_wp,
 				  &ctx->mmap_changing);
 	if (ret)
 		return ret;
 
-	if (!(uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_DONTWAKE)) {
+	if (!mode_wp && !mode_dontwake) {
 		range.start = uffdio_wp.range.start;
 		range.len = uffdio_wp.range.len;
 		wake_userfault(ctx, &range);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v3 24/28] userfaultfd: wp: UFFDIO_REGISTER_MODE_WP documentation update
  2019-03-20  2:06 [PATCH v3 00/28] userfaultfd: write protection support Peter Xu
                   ` (22 preceding siblings ...)
  2019-03-20  2:06 ` [PATCH v3 23/28] userfaultfd: wp: don't wake up when doing write protect Peter Xu
@ 2019-03-20  2:06 ` Peter Xu
  2019-03-22 21:46   ` Mike Rapoport
  2019-03-20  2:06 ` [PATCH v3 25/28] userfaultfd: wp: fixup swap entries in change_pte_range Peter Xu
                   ` (4 subsequent siblings)
  28 siblings, 1 reply; 51+ messages in thread
From: Peter Xu @ 2019-03-20  2:06 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: David Hildenbrand, Hugh Dickins, Maya Gokhale, Jerome Glisse,
	Pavel Emelyanov, Johannes Weiner, peterx, Martin Cracauer,
	Shaohua Li, Andrea Arcangeli, Mike Kravetz, Denis Plotnikov,
	Mike Rapoport, Marty McFadden, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

From: Martin Cracauer <cracauer@cons.org>

Adds documentation about the write protection support.

Signed-off-by: Martin Cracauer <cracauer@cons.org>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
[peterx: rewrite in rst format; fixups here and there]
Reviewed-by: Jerome Glisse <jglisse@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 Documentation/admin-guide/mm/userfaultfd.rst | 51 ++++++++++++++++++++
 1 file changed, 51 insertions(+)

diff --git a/Documentation/admin-guide/mm/userfaultfd.rst b/Documentation/admin-guide/mm/userfaultfd.rst
index 5048cf661a8a..c30176e67900 100644
--- a/Documentation/admin-guide/mm/userfaultfd.rst
+++ b/Documentation/admin-guide/mm/userfaultfd.rst
@@ -108,6 +108,57 @@ UFFDIO_COPY. They're atomic as in guaranteeing that nothing can see an
 half copied page since it'll keep userfaulting until the copy has
 finished.
 
+Notes:
+
+- If you requested UFFDIO_REGISTER_MODE_MISSING when registering then
+  you must provide some kind of page in your thread after reading from
+  the uffd.  You must provide either UFFDIO_COPY or UFFDIO_ZEROPAGE.
+  The normal behavior of the OS automatically providing a zero page on
+  an annonymous mmaping is not in place.
+
+- None of the page-delivering ioctls default to the range that you
+  registered with.  You must fill in all fields for the appropriate
+  ioctl struct including the range.
+
+- You get the address of the access that triggered the missing page
+  event out of a struct uffd_msg that you read in the thread from the
+  uffd.  You can supply as many pages as you want with UFFDIO_COPY or
+  UFFDIO_ZEROPAGE.  Keep in mind that unless you used DONTWAKE then
+  the first of any of those IOCTLs wakes up the faulting thread.
+
+- Be sure to test for all errors including (pollfd[0].revents &
+  POLLERR).  This can happen, e.g. when ranges supplied were
+  incorrect.
+
+Write Protect Notifications
+---------------------------
+
+This is equivalent to (but faster than) using mprotect and a SIGSEGV
+signal handler.
+
+Firstly you need to register a range with UFFDIO_REGISTER_MODE_WP.
+Instead of using mprotect(2) you use ioctl(uffd, UFFDIO_WRITEPROTECT,
+struct *uffdio_writeprotect) while mode = UFFDIO_WRITEPROTECT_MODE_WP
+in the struct passed in.  The range does not default to and does not
+have to be identical to the range you registered with.  You can write
+protect as many ranges as you like (inside the registered range).
+Then, in the thread reading from uffd the struct will have
+msg.arg.pagefault.flags & UFFD_PAGEFAULT_FLAG_WP set. Now you send
+ioctl(uffd, UFFDIO_WRITEPROTECT, struct *uffdio_writeprotect) again
+while pagefault.mode does not have UFFDIO_WRITEPROTECT_MODE_WP set.
+This wakes up the thread which will continue to run with writes. This
+allows you to do the bookkeeping about the write in the uffd reading
+thread before the ioctl.
+
+If you registered with both UFFDIO_REGISTER_MODE_MISSING and
+UFFDIO_REGISTER_MODE_WP then you need to think about the sequence in
+which you supply a page and undo write protect.  Note that there is a
+difference between writes into a WP area and into a !WP area.  The
+former will have UFFD_PAGEFAULT_FLAG_WP set, the latter
+UFFD_PAGEFAULT_FLAG_WRITE.  The latter did not fail on protection but
+you still need to supply a page when UFFDIO_REGISTER_MODE_MISSING was
+used.
+
 QEMU/KVM
 ========
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v3 25/28] userfaultfd: wp: fixup swap entries in change_pte_range
  2019-03-20  2:06 [PATCH v3 00/28] userfaultfd: write protection support Peter Xu
                   ` (23 preceding siblings ...)
  2019-03-20  2:06 ` [PATCH v3 24/28] userfaultfd: wp: UFFDIO_REGISTER_MODE_WP documentation update Peter Xu
@ 2019-03-20  2:06 ` Peter Xu
  2019-04-18 21:01   ` Jerome Glisse
  2019-03-20  2:06 ` [PATCH v3 26/28] userfaultfd: wp: declare _UFFDIO_WRITEPROTECT conditionally Peter Xu
                   ` (3 subsequent siblings)
  28 siblings, 1 reply; 51+ messages in thread
From: Peter Xu @ 2019-03-20  2:06 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: David Hildenbrand, Hugh Dickins, Maya Gokhale, Jerome Glisse,
	Pavel Emelyanov, Johannes Weiner, peterx, Martin Cracauer,
	Shaohua Li, Andrea Arcangeli, Mike Kravetz, Denis Plotnikov,
	Mike Rapoport, Marty McFadden, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

In change_pte_range() we do nothing for uffd if the PTE is a swap
entry.  That can lead to data mismatch if the page that we are going
to write protect is swapped out when sending the UFFDIO_WRITEPROTECT.
This patch applies/removes the uffd-wp bit even for the swap entries.

Signed-off-by: Peter Xu <peterx@redhat.com>
---

I kept this patch a standalone one majorly to make review easier.  The
patch can be considered as standalone or to squash into the patch
"userfaultfd: wp: support swap and page migration".
---
 mm/mprotect.c | 24 +++++++++++++-----------
 1 file changed, 13 insertions(+), 11 deletions(-)

diff --git a/mm/mprotect.c b/mm/mprotect.c
index 96c0f521099d..a23e03053787 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -183,11 +183,11 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 			}
 			ptep_modify_prot_commit(mm, addr, pte, ptent);
 			pages++;
-		} else if (IS_ENABLED(CONFIG_MIGRATION)) {
+		} else if (is_swap_pte(oldpte)) {
 			swp_entry_t entry = pte_to_swp_entry(oldpte);
+			pte_t newpte;
 
 			if (is_write_migration_entry(entry)) {
-				pte_t newpte;
 				/*
 				 * A protection check is difficult so
 				 * just be safe and disable write
@@ -198,22 +198,24 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 					newpte = pte_swp_mksoft_dirty(newpte);
 				if (pte_swp_uffd_wp(oldpte))
 					newpte = pte_swp_mkuffd_wp(newpte);
-				set_pte_at(mm, addr, pte, newpte);
-
-				pages++;
-			}
-
-			if (is_write_device_private_entry(entry)) {
-				pte_t newpte;
-
+			} else if (is_write_device_private_entry(entry)) {
 				/*
 				 * We do not preserve soft-dirtiness. See
 				 * copy_one_pte() for explanation.
 				 */
 				make_device_private_entry_read(&entry);
 				newpte = swp_entry_to_pte(entry);
-				set_pte_at(mm, addr, pte, newpte);
+			} else {
+				newpte = oldpte;
+			}
 
+			if (uffd_wp)
+				newpte = pte_swp_mkuffd_wp(newpte);
+			else if (uffd_wp_resolve)
+				newpte = pte_swp_clear_uffd_wp(newpte);
+
+			if (!pte_same(oldpte, newpte)) {
+				set_pte_at(mm, addr, pte, newpte);
 				pages++;
 			}
 		}
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v3 26/28] userfaultfd: wp: declare _UFFDIO_WRITEPROTECT conditionally
  2019-03-20  2:06 [PATCH v3 00/28] userfaultfd: write protection support Peter Xu
                   ` (24 preceding siblings ...)
  2019-03-20  2:06 ` [PATCH v3 25/28] userfaultfd: wp: fixup swap entries in change_pte_range Peter Xu
@ 2019-03-20  2:06 ` Peter Xu
  2019-03-22 21:43   ` Mike Rapoport
  2019-03-20  2:06 ` [PATCH v3 27/28] userfaultfd: selftests: refactor statistics Peter Xu
                   ` (2 subsequent siblings)
  28 siblings, 1 reply; 51+ messages in thread
From: Peter Xu @ 2019-03-20  2:06 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: David Hildenbrand, Hugh Dickins, Maya Gokhale, Jerome Glisse,
	Pavel Emelyanov, Johannes Weiner, peterx, Martin Cracauer,
	Shaohua Li, Andrea Arcangeli, Mike Kravetz, Denis Plotnikov,
	Mike Rapoport, Marty McFadden, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

Only declare _UFFDIO_WRITEPROTECT if the user specified
UFFDIO_REGISTER_MODE_WP and if all the checks passed.  Then when the
user registers regions with shmem/hugetlbfs we won't expose the new
ioctl to them.  Even with complete anonymous memory range, we'll only
expose the new WP ioctl bit if the register mode has MODE_WP.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 fs/userfaultfd.c | 16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index f1f61a0278c2..7f87e9e4fb9b 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -1456,14 +1456,24 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
 	up_write(&mm->mmap_sem);
 	mmput(mm);
 	if (!ret) {
+		__u64 ioctls_out;
+
+		ioctls_out = basic_ioctls ? UFFD_API_RANGE_IOCTLS_BASIC :
+		    UFFD_API_RANGE_IOCTLS;
+
+		/*
+		 * Declare the WP ioctl only if the WP mode is
+		 * specified and all checks passed with the range
+		 */
+		if (!(uffdio_register.mode & UFFDIO_REGISTER_MODE_WP))
+			ioctls_out &= ~((__u64)1 << _UFFDIO_WRITEPROTECT);
+
 		/*
 		 * Now that we scanned all vmas we can already tell
 		 * userland which ioctls methods are guaranteed to
 		 * succeed on this range.
 		 */
-		if (put_user(basic_ioctls ? UFFD_API_RANGE_IOCTLS_BASIC :
-			     UFFD_API_RANGE_IOCTLS,
-			     &user_uffdio_register->ioctls))
+		if (put_user(ioctls_out, &user_uffdio_register->ioctls))
 			ret = -EFAULT;
 	}
 out:
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v3 27/28] userfaultfd: selftests: refactor statistics
  2019-03-20  2:06 [PATCH v3 00/28] userfaultfd: write protection support Peter Xu
                   ` (25 preceding siblings ...)
  2019-03-20  2:06 ` [PATCH v3 26/28] userfaultfd: wp: declare _UFFDIO_WRITEPROTECT conditionally Peter Xu
@ 2019-03-20  2:06 ` Peter Xu
  2019-03-20  2:06 ` [PATCH v3 28/28] userfaultfd: selftests: add write-protect test Peter Xu
  2019-04-09  6:08 ` [PATCH v3 00/28] userfaultfd: write protection support Peter Xu
  28 siblings, 0 replies; 51+ messages in thread
From: Peter Xu @ 2019-03-20  2:06 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: David Hildenbrand, Hugh Dickins, Maya Gokhale, Jerome Glisse,
	Pavel Emelyanov, Johannes Weiner, peterx, Martin Cracauer,
	Shaohua Li, Andrea Arcangeli, Mike Kravetz, Denis Plotnikov,
	Mike Rapoport, Marty McFadden, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

Introduce uffd_stats structure for statistics of the self test, at the
same time refactor the code to always pass in the uffd_stats for either
read() or poll() typed fault handling threads instead of using two
different ways to return the statistic results.  No functional change.

With the new structure, it's very easy to introduce new statistics.

Reviewed-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 tools/testing/selftests/vm/userfaultfd.c | 76 +++++++++++++++---------
 1 file changed, 49 insertions(+), 27 deletions(-)

diff --git a/tools/testing/selftests/vm/userfaultfd.c b/tools/testing/selftests/vm/userfaultfd.c
index 5d1db824f73a..e5d12c209e09 100644
--- a/tools/testing/selftests/vm/userfaultfd.c
+++ b/tools/testing/selftests/vm/userfaultfd.c
@@ -88,6 +88,12 @@ static char *area_src, *area_src_alias, *area_dst, *area_dst_alias;
 static char *zeropage;
 pthread_attr_t attr;
 
+/* Userfaultfd test statistics */
+struct uffd_stats {
+	int cpu;
+	unsigned long missing_faults;
+};
+
 /* pthread_mutex_t starts at page offset 0 */
 #define area_mutex(___area, ___nr)					\
 	((pthread_mutex_t *) ((___area) + (___nr)*page_size))
@@ -127,6 +133,17 @@ static void usage(void)
 	exit(1);
 }
 
+static void uffd_stats_reset(struct uffd_stats *uffd_stats,
+			     unsigned long n_cpus)
+{
+	int i;
+
+	for (i = 0; i < n_cpus; i++) {
+		uffd_stats[i].cpu = i;
+		uffd_stats[i].missing_faults = 0;
+	}
+}
+
 static int anon_release_pages(char *rel_area)
 {
 	int ret = 0;
@@ -469,8 +486,8 @@ static int uffd_read_msg(int ufd, struct uffd_msg *msg)
 	return 0;
 }
 
-/* Return 1 if page fault handled by us; otherwise 0 */
-static int uffd_handle_page_fault(struct uffd_msg *msg)
+static void uffd_handle_page_fault(struct uffd_msg *msg,
+				   struct uffd_stats *stats)
 {
 	unsigned long offset;
 
@@ -485,18 +502,19 @@ static int uffd_handle_page_fault(struct uffd_msg *msg)
 	offset = (char *)(unsigned long)msg->arg.pagefault.address - area_dst;
 	offset &= ~(page_size-1);
 
-	return copy_page(uffd, offset);
+	if (copy_page(uffd, offset))
+		stats->missing_faults++;
 }
 
 static void *uffd_poll_thread(void *arg)
 {
-	unsigned long cpu = (unsigned long) arg;
+	struct uffd_stats *stats = (struct uffd_stats *)arg;
+	unsigned long cpu = stats->cpu;
 	struct pollfd pollfd[2];
 	struct uffd_msg msg;
 	struct uffdio_register uffd_reg;
 	int ret;
 	char tmp_chr;
-	unsigned long userfaults = 0;
 
 	pollfd[0].fd = uffd;
 	pollfd[0].events = POLLIN;
@@ -526,7 +544,7 @@ static void *uffd_poll_thread(void *arg)
 				msg.event), exit(1);
 			break;
 		case UFFD_EVENT_PAGEFAULT:
-			userfaults += uffd_handle_page_fault(&msg);
+			uffd_handle_page_fault(&msg, stats);
 			break;
 		case UFFD_EVENT_FORK:
 			close(uffd);
@@ -545,28 +563,27 @@ static void *uffd_poll_thread(void *arg)
 			break;
 		}
 	}
-	return (void *)userfaults;
+
+	return NULL;
 }
 
 pthread_mutex_t uffd_read_mutex = PTHREAD_MUTEX_INITIALIZER;
 
 static void *uffd_read_thread(void *arg)
 {
-	unsigned long *this_cpu_userfaults;
+	struct uffd_stats *stats = (struct uffd_stats *)arg;
 	struct uffd_msg msg;
 
-	this_cpu_userfaults = (unsigned long *) arg;
-	*this_cpu_userfaults = 0;
-
 	pthread_mutex_unlock(&uffd_read_mutex);
 	/* from here cancellation is ok */
 
 	for (;;) {
 		if (uffd_read_msg(uffd, &msg))
 			continue;
-		(*this_cpu_userfaults) += uffd_handle_page_fault(&msg);
+		uffd_handle_page_fault(&msg, stats);
 	}
-	return (void *)NULL;
+
+	return NULL;
 }
 
 static void *background_thread(void *arg)
@@ -582,13 +599,12 @@ static void *background_thread(void *arg)
 	return NULL;
 }
 
-static int stress(unsigned long *userfaults)
+static int stress(struct uffd_stats *uffd_stats)
 {
 	unsigned long cpu;
 	pthread_t locking_threads[nr_cpus];
 	pthread_t uffd_threads[nr_cpus];
 	pthread_t background_threads[nr_cpus];
-	void **_userfaults = (void **) userfaults;
 
 	finished = 0;
 	for (cpu = 0; cpu < nr_cpus; cpu++) {
@@ -597,12 +613,13 @@ static int stress(unsigned long *userfaults)
 			return 1;
 		if (bounces & BOUNCE_POLL) {
 			if (pthread_create(&uffd_threads[cpu], &attr,
-					   uffd_poll_thread, (void *)cpu))
+					   uffd_poll_thread,
+					   (void *)&uffd_stats[cpu]))
 				return 1;
 		} else {
 			if (pthread_create(&uffd_threads[cpu], &attr,
 					   uffd_read_thread,
-					   &_userfaults[cpu]))
+					   (void *)&uffd_stats[cpu]))
 				return 1;
 			pthread_mutex_lock(&uffd_read_mutex);
 		}
@@ -639,7 +656,8 @@ static int stress(unsigned long *userfaults)
 				fprintf(stderr, "pipefd write error\n");
 				return 1;
 			}
-			if (pthread_join(uffd_threads[cpu], &_userfaults[cpu]))
+			if (pthread_join(uffd_threads[cpu],
+					 (void *)&uffd_stats[cpu]))
 				return 1;
 		} else {
 			if (pthread_cancel(uffd_threads[cpu]))
@@ -910,11 +928,11 @@ static int userfaultfd_events_test(void)
 {
 	struct uffdio_register uffdio_register;
 	unsigned long expected_ioctls;
-	unsigned long userfaults;
 	pthread_t uffd_mon;
 	int err, features;
 	pid_t pid;
 	char c;
+	struct uffd_stats stats = { 0 };
 
 	printf("testing events (fork, remap, remove): ");
 	fflush(stdout);
@@ -941,7 +959,7 @@ static int userfaultfd_events_test(void)
 			"unexpected missing ioctl for anon memory\n"),
 			exit(1);
 
-	if (pthread_create(&uffd_mon, &attr, uffd_poll_thread, NULL))
+	if (pthread_create(&uffd_mon, &attr, uffd_poll_thread, &stats))
 		perror("uffd_poll_thread create"), exit(1);
 
 	pid = fork();
@@ -957,13 +975,13 @@ static int userfaultfd_events_test(void)
 
 	if (write(pipefd[1], &c, sizeof(c)) != sizeof(c))
 		perror("pipe write"), exit(1);
-	if (pthread_join(uffd_mon, (void **)&userfaults))
+	if (pthread_join(uffd_mon, NULL))
 		return 1;
 
 	close(uffd);
-	printf("userfaults: %ld\n", userfaults);
+	printf("userfaults: %ld\n", stats.missing_faults);
 
-	return userfaults != nr_pages;
+	return stats.missing_faults != nr_pages;
 }
 
 static int userfaultfd_sig_test(void)
@@ -975,6 +993,7 @@ static int userfaultfd_sig_test(void)
 	int err, features;
 	pid_t pid;
 	char c;
+	struct uffd_stats stats = { 0 };
 
 	printf("testing signal delivery: ");
 	fflush(stdout);
@@ -1006,7 +1025,7 @@ static int userfaultfd_sig_test(void)
 	if (uffd_test_ops->release_pages(area_dst))
 		return 1;
 
-	if (pthread_create(&uffd_mon, &attr, uffd_poll_thread, NULL))
+	if (pthread_create(&uffd_mon, &attr, uffd_poll_thread, &stats))
 		perror("uffd_poll_thread create"), exit(1);
 
 	pid = fork();
@@ -1032,6 +1051,7 @@ static int userfaultfd_sig_test(void)
 	close(uffd);
 	return userfaults != 0;
 }
+
 static int userfaultfd_stress(void)
 {
 	void *area;
@@ -1040,7 +1060,7 @@ static int userfaultfd_stress(void)
 	struct uffdio_register uffdio_register;
 	unsigned long cpu;
 	int err;
-	unsigned long userfaults[nr_cpus];
+	struct uffd_stats uffd_stats[nr_cpus];
 
 	uffd_test_ops->allocate_area((void **)&area_src);
 	if (!area_src)
@@ -1169,8 +1189,10 @@ static int userfaultfd_stress(void)
 		if (uffd_test_ops->release_pages(area_dst))
 			return 1;
 
+		uffd_stats_reset(uffd_stats, nr_cpus);
+
 		/* bounce pass */
-		if (stress(userfaults))
+		if (stress(uffd_stats))
 			return 1;
 
 		/* unregister */
@@ -1213,7 +1235,7 @@ static int userfaultfd_stress(void)
 
 		printf("userfaults:");
 		for (cpu = 0; cpu < nr_cpus; cpu++)
-			printf(" %lu", userfaults[cpu]);
+			printf(" %lu", uffd_stats[cpu].missing_faults);
 		printf("\n");
 	}
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v3 28/28] userfaultfd: selftests: add write-protect test
  2019-03-20  2:06 [PATCH v3 00/28] userfaultfd: write protection support Peter Xu
                   ` (26 preceding siblings ...)
  2019-03-20  2:06 ` [PATCH v3 27/28] userfaultfd: selftests: refactor statistics Peter Xu
@ 2019-03-20  2:06 ` Peter Xu
  2019-04-09  6:08 ` [PATCH v3 00/28] userfaultfd: write protection support Peter Xu
  28 siblings, 0 replies; 51+ messages in thread
From: Peter Xu @ 2019-03-20  2:06 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: David Hildenbrand, Hugh Dickins, Maya Gokhale, Jerome Glisse,
	Pavel Emelyanov, Johannes Weiner, peterx, Martin Cracauer,
	Shaohua Li, Andrea Arcangeli, Mike Kravetz, Denis Plotnikov,
	Mike Rapoport, Marty McFadden, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

This patch adds uffd tests for write protection.

Instead of introducing new tests for it, let's simply squashing uffd-wp
tests into existing uffd-missing test cases.  Changes are:

(1) Bouncing tests

  We do the write-protection in two ways during the bouncing test:

  - By using UFFDIO_COPY_MODE_WP when resolving MISSING pages: then
    we'll make sure for each bounce process every single page will be
    at least fault twice: once for MISSING, once for WP.

  - By direct call UFFDIO_WRITEPROTECT on existing faulted memories:
    To further torture the explicit page protection procedures of
    uffd-wp, we split each bounce procedure into two halves (in the
    background thread): the first half will be MISSING+WP for each
    page as explained above.  After the first half, we write protect
    the faulted region in the background thread to make sure at least
    half of the pages will be write protected again which is the first
    half to test the new UFFDIO_WRITEPROTECT call.  Then we continue
    with the 2nd half, which will contain both MISSING and WP faulting
    tests for the 2nd half and WP-only faults from the 1st half.

(2) Event/Signal test

  Mostly previous tests but will do MISSING+WP for each page.  For
  sigbus-mode test we'll need to provide standalone path to handle the
  write protection faults.

For all tests, do statistics as well for uffd-wp pages.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 tools/testing/selftests/vm/userfaultfd.c | 157 +++++++++++++++++++----
 1 file changed, 133 insertions(+), 24 deletions(-)

diff --git a/tools/testing/selftests/vm/userfaultfd.c b/tools/testing/selftests/vm/userfaultfd.c
index e5d12c209e09..bf1e10db72f5 100644
--- a/tools/testing/selftests/vm/userfaultfd.c
+++ b/tools/testing/selftests/vm/userfaultfd.c
@@ -56,6 +56,7 @@
 #include <linux/userfaultfd.h>
 #include <setjmp.h>
 #include <stdbool.h>
+#include <assert.h>
 
 #include "../kselftest.h"
 
@@ -78,6 +79,8 @@ static int test_type;
 #define ALARM_INTERVAL_SECS 10
 static volatile bool test_uffdio_copy_eexist = true;
 static volatile bool test_uffdio_zeropage_eexist = true;
+/* Whether to test uffd write-protection */
+static bool test_uffdio_wp = false;
 
 static bool map_shared;
 static int huge_fd;
@@ -92,6 +95,7 @@ pthread_attr_t attr;
 struct uffd_stats {
 	int cpu;
 	unsigned long missing_faults;
+	unsigned long wp_faults;
 };
 
 /* pthread_mutex_t starts at page offset 0 */
@@ -141,9 +145,29 @@ static void uffd_stats_reset(struct uffd_stats *uffd_stats,
 	for (i = 0; i < n_cpus; i++) {
 		uffd_stats[i].cpu = i;
 		uffd_stats[i].missing_faults = 0;
+		uffd_stats[i].wp_faults = 0;
 	}
 }
 
+static void uffd_stats_report(struct uffd_stats *stats, int n_cpus)
+{
+	int i;
+	unsigned long long miss_total = 0, wp_total = 0;
+
+	for (i = 0; i < n_cpus; i++) {
+		miss_total += stats[i].missing_faults;
+		wp_total += stats[i].wp_faults;
+	}
+
+	printf("userfaults: %llu missing (", miss_total);
+	for (i = 0; i < n_cpus; i++)
+		printf("%lu+", stats[i].missing_faults);
+	printf("\b), %llu wp (", wp_total);
+	for (i = 0; i < n_cpus; i++)
+		printf("%lu+", stats[i].wp_faults);
+	printf("\b)\n");
+}
+
 static int anon_release_pages(char *rel_area)
 {
 	int ret = 0;
@@ -264,10 +288,15 @@ struct uffd_test_ops {
 	void (*alias_mapping)(__u64 *start, size_t len, unsigned long offset);
 };
 
-#define ANON_EXPECTED_IOCTLS		((1 << _UFFDIO_WAKE) | \
+#define SHMEM_EXPECTED_IOCTLS		((1 << _UFFDIO_WAKE) | \
 					 (1 << _UFFDIO_COPY) | \
 					 (1 << _UFFDIO_ZEROPAGE))
 
+#define ANON_EXPECTED_IOCTLS		((1 << _UFFDIO_WAKE) | \
+					 (1 << _UFFDIO_COPY) | \
+					 (1 << _UFFDIO_ZEROPAGE) | \
+					 (1 << _UFFDIO_WRITEPROTECT))
+
 static struct uffd_test_ops anon_uffd_test_ops = {
 	.expected_ioctls = ANON_EXPECTED_IOCTLS,
 	.allocate_area	= anon_allocate_area,
@@ -276,7 +305,7 @@ static struct uffd_test_ops anon_uffd_test_ops = {
 };
 
 static struct uffd_test_ops shmem_uffd_test_ops = {
-	.expected_ioctls = ANON_EXPECTED_IOCTLS,
+	.expected_ioctls = SHMEM_EXPECTED_IOCTLS,
 	.allocate_area	= shmem_allocate_area,
 	.release_pages	= shmem_release_pages,
 	.alias_mapping = noop_alias_mapping,
@@ -300,6 +329,21 @@ static int my_bcmp(char *str1, char *str2, size_t n)
 	return 0;
 }
 
+static void wp_range(int ufd, __u64 start, __u64 len, bool wp)
+{
+	struct uffdio_writeprotect prms = { 0 };
+
+	/* Write protection page faults */
+	prms.range.start = start;
+	prms.range.len = len;
+	/* Undo write-protect, do wakeup after that */
+	prms.mode = wp ? UFFDIO_WRITEPROTECT_MODE_WP : 0;
+
+	if (ioctl(ufd, UFFDIO_WRITEPROTECT, &prms))
+		fprintf(stderr, "clear WP failed for address 0x%Lx\n",
+			start), exit(1);
+}
+
 static void *locking_thread(void *arg)
 {
 	unsigned long cpu = (unsigned long) arg;
@@ -438,7 +482,10 @@ static int __copy_page(int ufd, unsigned long offset, bool retry)
 	uffdio_copy.dst = (unsigned long) area_dst + offset;
 	uffdio_copy.src = (unsigned long) area_src + offset;
 	uffdio_copy.len = page_size;
-	uffdio_copy.mode = 0;
+	if (test_uffdio_wp)
+		uffdio_copy.mode = UFFDIO_COPY_MODE_WP;
+	else
+		uffdio_copy.mode = 0;
 	uffdio_copy.copy = 0;
 	if (ioctl(ufd, UFFDIO_COPY, &uffdio_copy)) {
 		/* real retval in ufdio_copy.copy */
@@ -495,15 +542,21 @@ static void uffd_handle_page_fault(struct uffd_msg *msg,
 		fprintf(stderr, "unexpected msg event %u\n",
 			msg->event), exit(1);
 
-	if (bounces & BOUNCE_VERIFY &&
-	    msg->arg.pagefault.flags & UFFD_PAGEFAULT_FLAG_WRITE)
-		fprintf(stderr, "unexpected write fault\n"), exit(1);
+	if (msg->arg.pagefault.flags & UFFD_PAGEFAULT_FLAG_WP) {
+		wp_range(uffd, msg->arg.pagefault.address, page_size, false);
+		stats->wp_faults++;
+	} else {
+		/* Missing page faults */
+		if (bounces & BOUNCE_VERIFY &&
+		    msg->arg.pagefault.flags & UFFD_PAGEFAULT_FLAG_WRITE)
+			fprintf(stderr, "unexpected write fault\n"), exit(1);
 
-	offset = (char *)(unsigned long)msg->arg.pagefault.address - area_dst;
-	offset &= ~(page_size-1);
+		offset = (char *)(unsigned long)msg->arg.pagefault.address - area_dst;
+		offset &= ~(page_size-1);
 
-	if (copy_page(uffd, offset))
-		stats->missing_faults++;
+		if (copy_page(uffd, offset))
+			stats->missing_faults++;
+	}
 }
 
 static void *uffd_poll_thread(void *arg)
@@ -589,11 +642,30 @@ static void *uffd_read_thread(void *arg)
 static void *background_thread(void *arg)
 {
 	unsigned long cpu = (unsigned long) arg;
-	unsigned long page_nr;
+	unsigned long page_nr, start_nr, mid_nr, end_nr;
+
+	start_nr = cpu * nr_pages_per_cpu;
+	end_nr = (cpu+1) * nr_pages_per_cpu;
+	mid_nr = (start_nr + end_nr) / 2;
+
+	/* Copy the first half of the pages */
+	for (page_nr = start_nr; page_nr < mid_nr; page_nr++)
+		copy_page_retry(uffd, page_nr * page_size);
 
-	for (page_nr = cpu * nr_pages_per_cpu;
-	     page_nr < (cpu+1) * nr_pages_per_cpu;
-	     page_nr++)
+	/*
+	 * If we need to test uffd-wp, set it up now.  Then we'll have
+	 * at least the first half of the pages mapped already which
+	 * can be write-protected for testing
+	 */
+	if (test_uffdio_wp)
+		wp_range(uffd, (unsigned long)area_dst + start_nr * page_size,
+			nr_pages_per_cpu * page_size, true);
+
+	/*
+	 * Continue the 2nd half of the page copying, handling write
+	 * protection faults if any
+	 */
+	for (page_nr = mid_nr; page_nr < end_nr; page_nr++)
 		copy_page_retry(uffd, page_nr * page_size);
 
 	return NULL;
@@ -755,17 +827,31 @@ static int faulting_process(int signal_test)
 	}
 
 	for (nr = 0; nr < split_nr_pages; nr++) {
+		int steps = 1;
+		unsigned long offset = nr * page_size;
+
 		if (signal_test) {
 			if (sigsetjmp(*sigbuf, 1) != 0) {
-				if (nr == lastnr) {
+				if (steps == 1 && nr == lastnr) {
 					fprintf(stderr, "Signal repeated\n");
 					return 1;
 				}
 
 				lastnr = nr;
 				if (signal_test == 1) {
-					if (copy_page(uffd, nr * page_size))
-						signalled++;
+					if (steps == 1) {
+						/* This is a MISSING request */
+						steps++;
+						if (copy_page(uffd, offset))
+							signalled++;
+					} else {
+						/* This is a WP request */
+						assert(steps == 2);
+						wp_range(uffd,
+							 (__u64)area_dst +
+							 offset,
+							 page_size, false);
+					}
 				} else {
 					signalled++;
 					continue;
@@ -778,8 +864,13 @@ static int faulting_process(int signal_test)
 			fprintf(stderr,
 				"nr %lu memory corruption %Lu %Lu\n",
 				nr, count,
-				count_verify[nr]), exit(1);
-		}
+				count_verify[nr]);
+	        }
+		/*
+		 * Trigger write protection if there is by writting
+		 * the same value back.
+		 */
+		*area_count(area_dst, nr) = count;
 	}
 
 	if (signal_test)
@@ -801,6 +892,11 @@ static int faulting_process(int signal_test)
 				nr, count,
 				count_verify[nr]), exit(1);
 		}
+		/*
+		 * Trigger write protection if there is by writting
+		 * the same value back.
+		 */
+		*area_count(area_dst, nr) = count;
 	}
 
 	if (uffd_test_ops->release_pages(area_dst))
@@ -904,6 +1000,8 @@ static int userfaultfd_zeropage_test(void)
 	uffdio_register.range.start = (unsigned long) area_dst;
 	uffdio_register.range.len = nr_pages * page_size;
 	uffdio_register.mode = UFFDIO_REGISTER_MODE_MISSING;
+	if (test_uffdio_wp)
+		uffdio_register.mode |= UFFDIO_REGISTER_MODE_WP;
 	if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register))
 		fprintf(stderr, "register failure\n"), exit(1);
 
@@ -949,6 +1047,8 @@ static int userfaultfd_events_test(void)
 	uffdio_register.range.start = (unsigned long) area_dst;
 	uffdio_register.range.len = nr_pages * page_size;
 	uffdio_register.mode = UFFDIO_REGISTER_MODE_MISSING;
+	if (test_uffdio_wp)
+		uffdio_register.mode |= UFFDIO_REGISTER_MODE_WP;
 	if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register))
 		fprintf(stderr, "register failure\n"), exit(1);
 
@@ -979,7 +1079,8 @@ static int userfaultfd_events_test(void)
 		return 1;
 
 	close(uffd);
-	printf("userfaults: %ld\n", stats.missing_faults);
+
+	uffd_stats_report(&stats, 1);
 
 	return stats.missing_faults != nr_pages;
 }
@@ -1009,6 +1110,8 @@ static int userfaultfd_sig_test(void)
 	uffdio_register.range.start = (unsigned long) area_dst;
 	uffdio_register.range.len = nr_pages * page_size;
 	uffdio_register.mode = UFFDIO_REGISTER_MODE_MISSING;
+	if (test_uffdio_wp)
+		uffdio_register.mode |= UFFDIO_REGISTER_MODE_WP;
 	if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register))
 		fprintf(stderr, "register failure\n"), exit(1);
 
@@ -1141,6 +1244,8 @@ static int userfaultfd_stress(void)
 		uffdio_register.range.start = (unsigned long) area_dst;
 		uffdio_register.range.len = nr_pages * page_size;
 		uffdio_register.mode = UFFDIO_REGISTER_MODE_MISSING;
+		if (test_uffdio_wp)
+			uffdio_register.mode |= UFFDIO_REGISTER_MODE_WP;
 		if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register)) {
 			fprintf(stderr, "register failure\n");
 			return 1;
@@ -1195,6 +1300,11 @@ static int userfaultfd_stress(void)
 		if (stress(uffd_stats))
 			return 1;
 
+		/* Clear all the write protections if there is any */
+		if (test_uffdio_wp)
+			wp_range(uffd, (unsigned long)area_dst,
+				 nr_pages * page_size, false);
+
 		/* unregister */
 		if (ioctl(uffd, UFFDIO_UNREGISTER, &uffdio_register.range)) {
 			fprintf(stderr, "unregister failure\n");
@@ -1233,10 +1343,7 @@ static int userfaultfd_stress(void)
 		area_src_alias = area_dst_alias;
 		area_dst_alias = tmp_area;
 
-		printf("userfaults:");
-		for (cpu = 0; cpu < nr_cpus; cpu++)
-			printf(" %lu", uffd_stats[cpu].missing_faults);
-		printf("\n");
+		uffd_stats_report(uffd_stats, nr_cpus);
 	}
 
 	if (err)
@@ -1276,6 +1383,8 @@ static void set_test_type(const char *type)
 	if (!strcmp(type, "anon")) {
 		test_type = TEST_ANON;
 		uffd_test_ops = &anon_uffd_test_ops;
+		/* Only enable write-protect test for anonymous test */
+		test_uffdio_wp = true;
 	} else if (!strcmp(type, "hugetlb")) {
 		test_type = TEST_HUGETLB;
 		uffd_test_ops = &hugetlb_uffd_test_ops;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 22/28] userfaultfd: wp: enabled write protection in userfaultfd API
  2019-03-20  2:06 ` [PATCH v3 22/28] userfaultfd: wp: enabled write protection in userfaultfd API Peter Xu
@ 2019-03-22 21:37   ` Mike Rapoport
  0 siblings, 0 replies; 51+ messages in thread
From: Mike Rapoport @ 2019-03-22 21:37 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, David Hildenbrand, Hugh Dickins,
	Maya Gokhale, Jerome Glisse, Pavel Emelyanov, Johannes Weiner,
	Martin Cracauer, Shaohua Li, Andrea Arcangeli, Mike Kravetz,
	Denis Plotnikov, Mike Rapoport, Marty McFadden, Mel Gorman,
	Kirill A . Shutemov, Dr . David Alan Gilbert, Pavel Emelyanov,
	Rik van Riel

On Wed, Mar 20, 2019 at 10:06:36AM +0800, Peter Xu wrote:
> From: Shaohua Li <shli@fb.com>
> 
> Now it's safe to enable write protection in userfaultfd API
> 
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Pavel Emelyanov <xemul@parallels.com>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Kirill A. Shutemov <kirill@shutemov.name>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: Shaohua Li <shli@fb.com>
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> Reviewed-by: Jerome Glisse <jglisse@redhat.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>

> ---
>  include/uapi/linux/userfaultfd.h | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
> index 95c4a160e5f8..e7e98bde221f 100644
> --- a/include/uapi/linux/userfaultfd.h
> +++ b/include/uapi/linux/userfaultfd.h
> @@ -19,7 +19,8 @@
>   * means the userland is reading).
>   */
>  #define UFFD_API ((__u64)0xAA)
> -#define UFFD_API_FEATURES (UFFD_FEATURE_EVENT_FORK |		\
> +#define UFFD_API_FEATURES (UFFD_FEATURE_PAGEFAULT_FLAG_WP |	\
> +			   UFFD_FEATURE_EVENT_FORK |		\
>  			   UFFD_FEATURE_EVENT_REMAP |		\
>  			   UFFD_FEATURE_EVENT_REMOVE |	\
>  			   UFFD_FEATURE_EVENT_UNMAP |		\
> @@ -34,7 +35,8 @@
>  #define UFFD_API_RANGE_IOCTLS			\
>  	((__u64)1 << _UFFDIO_WAKE |		\
>  	 (__u64)1 << _UFFDIO_COPY |		\
> -	 (__u64)1 << _UFFDIO_ZEROPAGE)
> +	 (__u64)1 << _UFFDIO_ZEROPAGE |		\
> +	 (__u64)1 << _UFFDIO_WRITEPROTECT)
>  #define UFFD_API_RANGE_IOCTLS_BASIC		\
>  	((__u64)1 << _UFFDIO_WAKE |		\
>  	 (__u64)1 << _UFFDIO_COPY)
> -- 
> 2.17.1
> 

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 26/28] userfaultfd: wp: declare _UFFDIO_WRITEPROTECT conditionally
  2019-03-20  2:06 ` [PATCH v3 26/28] userfaultfd: wp: declare _UFFDIO_WRITEPROTECT conditionally Peter Xu
@ 2019-03-22 21:43   ` Mike Rapoport
  0 siblings, 0 replies; 51+ messages in thread
From: Mike Rapoport @ 2019-03-22 21:43 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, David Hildenbrand, Hugh Dickins,
	Maya Gokhale, Jerome Glisse, Pavel Emelyanov, Johannes Weiner,
	Martin Cracauer, Shaohua Li, Andrea Arcangeli, Mike Kravetz,
	Denis Plotnikov, Mike Rapoport, Marty McFadden, Mel Gorman,
	Kirill A . Shutemov, Dr . David Alan Gilbert

On Wed, Mar 20, 2019 at 10:06:40AM +0800, Peter Xu wrote:
> Only declare _UFFDIO_WRITEPROTECT if the user specified
> UFFDIO_REGISTER_MODE_WP and if all the checks passed.  Then when the
> user registers regions with shmem/hugetlbfs we won't expose the new
> ioctl to them.  Even with complete anonymous memory range, we'll only
> expose the new WP ioctl bit if the register mode has MODE_WP.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>

> ---
>  fs/userfaultfd.c | 16 +++++++++++++---
>  1 file changed, 13 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> index f1f61a0278c2..7f87e9e4fb9b 100644
> --- a/fs/userfaultfd.c
> +++ b/fs/userfaultfd.c
> @@ -1456,14 +1456,24 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
>  	up_write(&mm->mmap_sem);
>  	mmput(mm);
>  	if (!ret) {
> +		__u64 ioctls_out;
> +
> +		ioctls_out = basic_ioctls ? UFFD_API_RANGE_IOCTLS_BASIC :
> +		    UFFD_API_RANGE_IOCTLS;
> +
> +		/*
> +		 * Declare the WP ioctl only if the WP mode is
> +		 * specified and all checks passed with the range
> +		 */
> +		if (!(uffdio_register.mode & UFFDIO_REGISTER_MODE_WP))
> +			ioctls_out &= ~((__u64)1 << _UFFDIO_WRITEPROTECT);
> +
>  		/*
>  		 * Now that we scanned all vmas we can already tell
>  		 * userland which ioctls methods are guaranteed to
>  		 * succeed on this range.
>  		 */
> -		if (put_user(basic_ioctls ? UFFD_API_RANGE_IOCTLS_BASIC :
> -			     UFFD_API_RANGE_IOCTLS,
> -			     &user_uffdio_register->ioctls))
> +		if (put_user(ioctls_out, &user_uffdio_register->ioctls))
>  			ret = -EFAULT;
>  	}
>  out:
> -- 
> 2.17.1
> 

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 24/28] userfaultfd: wp: UFFDIO_REGISTER_MODE_WP documentation update
  2019-03-20  2:06 ` [PATCH v3 24/28] userfaultfd: wp: UFFDIO_REGISTER_MODE_WP documentation update Peter Xu
@ 2019-03-22 21:46   ` Mike Rapoport
  0 siblings, 0 replies; 51+ messages in thread
From: Mike Rapoport @ 2019-03-22 21:46 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, David Hildenbrand, Hugh Dickins,
	Maya Gokhale, Jerome Glisse, Pavel Emelyanov, Johannes Weiner,
	Martin Cracauer, Shaohua Li, Andrea Arcangeli, Mike Kravetz,
	Denis Plotnikov, Mike Rapoport, Marty McFadden, Mel Gorman,
	Kirill A . Shutemov, Dr . David Alan Gilbert

On Wed, Mar 20, 2019 at 10:06:38AM +0800, Peter Xu wrote:
> From: Martin Cracauer <cracauer@cons.org>
> 
> Adds documentation about the write protection support.
> 
> Signed-off-by: Martin Cracauer <cracauer@cons.org>
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> [peterx: rewrite in rst format; fixups here and there]
> Reviewed-by: Jerome Glisse <jglisse@redhat.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>

> ---
>  Documentation/admin-guide/mm/userfaultfd.rst | 51 ++++++++++++++++++++
>  1 file changed, 51 insertions(+)
> 
> diff --git a/Documentation/admin-guide/mm/userfaultfd.rst b/Documentation/admin-guide/mm/userfaultfd.rst
> index 5048cf661a8a..c30176e67900 100644
> --- a/Documentation/admin-guide/mm/userfaultfd.rst
> +++ b/Documentation/admin-guide/mm/userfaultfd.rst
> @@ -108,6 +108,57 @@ UFFDIO_COPY. They're atomic as in guaranteeing that nothing can see an
>  half copied page since it'll keep userfaulting until the copy has
>  finished.
> 
> +Notes:
> +
> +- If you requested UFFDIO_REGISTER_MODE_MISSING when registering then
> +  you must provide some kind of page in your thread after reading from
> +  the uffd.  You must provide either UFFDIO_COPY or UFFDIO_ZEROPAGE.
> +  The normal behavior of the OS automatically providing a zero page on
> +  an annonymous mmaping is not in place.
> +
> +- None of the page-delivering ioctls default to the range that you
> +  registered with.  You must fill in all fields for the appropriate
> +  ioctl struct including the range.
> +
> +- You get the address of the access that triggered the missing page
> +  event out of a struct uffd_msg that you read in the thread from the
> +  uffd.  You can supply as many pages as you want with UFFDIO_COPY or
> +  UFFDIO_ZEROPAGE.  Keep in mind that unless you used DONTWAKE then
> +  the first of any of those IOCTLs wakes up the faulting thread.
> +
> +- Be sure to test for all errors including (pollfd[0].revents &
> +  POLLERR).  This can happen, e.g. when ranges supplied were
> +  incorrect.
> +
> +Write Protect Notifications
> +---------------------------
> +
> +This is equivalent to (but faster than) using mprotect and a SIGSEGV
> +signal handler.
> +
> +Firstly you need to register a range with UFFDIO_REGISTER_MODE_WP.
> +Instead of using mprotect(2) you use ioctl(uffd, UFFDIO_WRITEPROTECT,
> +struct *uffdio_writeprotect) while mode = UFFDIO_WRITEPROTECT_MODE_WP
> +in the struct passed in.  The range does not default to and does not
> +have to be identical to the range you registered with.  You can write
> +protect as many ranges as you like (inside the registered range).
> +Then, in the thread reading from uffd the struct will have
> +msg.arg.pagefault.flags & UFFD_PAGEFAULT_FLAG_WP set. Now you send
> +ioctl(uffd, UFFDIO_WRITEPROTECT, struct *uffdio_writeprotect) again
> +while pagefault.mode does not have UFFDIO_WRITEPROTECT_MODE_WP set.
> +This wakes up the thread which will continue to run with writes. This
> +allows you to do the bookkeeping about the write in the uffd reading
> +thread before the ioctl.
> +
> +If you registered with both UFFDIO_REGISTER_MODE_MISSING and
> +UFFDIO_REGISTER_MODE_WP then you need to think about the sequence in
> +which you supply a page and undo write protect.  Note that there is a
> +difference between writes into a WP area and into a !WP area.  The
> +former will have UFFD_PAGEFAULT_FLAG_WP set, the latter
> +UFFD_PAGEFAULT_FLAG_WRITE.  The latter did not fail on protection but
> +you still need to supply a page when UFFDIO_REGISTER_MODE_MISSING was
> +used.
> +
>  QEMU/KVM
>  ========
> 
> -- 
> 2.17.1
> 

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 00/28] userfaultfd: write protection support
  2019-03-20  2:06 [PATCH v3 00/28] userfaultfd: write protection support Peter Xu
                   ` (27 preceding siblings ...)
  2019-03-20  2:06 ` [PATCH v3 28/28] userfaultfd: selftests: add write-protect test Peter Xu
@ 2019-04-09  6:08 ` Peter Xu
  2019-04-18 21:07   ` Jerome Glisse
  28 siblings, 1 reply; 51+ messages in thread
From: Peter Xu @ 2019-04-09  6:08 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: David Hildenbrand, Hugh Dickins, Maya Gokhale, Jerome Glisse,
	Pavel Emelyanov, Johannes Weiner, Martin Cracauer, Shaohua Li,
	Andrea Arcangeli, Mike Kravetz, Denis Plotnikov, Mike Rapoport,
	Marty McFadden, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

On Wed, Mar 20, 2019 at 10:06:14AM +0800, Peter Xu wrote:
> This series implements initial write protection support for
> userfaultfd.  Currently both shmem and hugetlbfs are not supported
> yet, but only anonymous memory.  This is the 3nd version of it.
> 
> The latest code can also be found at:
> 
>   https://github.com/xzpeter/linux/tree/uffd-wp-merged
> 
> Note again that the first 5 patches in the series can be seen as
> isolated work on page fault mechanism.  I would hope that they can be
> considered to be reviewed/picked even earlier than the rest of the
> series since it's even useful for existing userfaultfd MISSING case
> [8].

Ping - any further comments for v3?  Is there any chance to have this
series (or the first 5 patches) for 5.2?

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 07/28] userfaultfd: wp: hook userfault handler to write protection fault
  2019-03-20  2:06 ` [PATCH v3 07/28] userfaultfd: wp: hook userfault handler to write protection fault Peter Xu
@ 2019-04-18 20:03   ` Jerome Glisse
  0 siblings, 0 replies; 51+ messages in thread
From: Jerome Glisse @ 2019-04-18 20:03 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, David Hildenbrand, Hugh Dickins,
	Maya Gokhale, Pavel Emelyanov, Johannes Weiner, Martin Cracauer,
	Shaohua Li, Andrea Arcangeli, Mike Kravetz, Denis Plotnikov,
	Mike Rapoport, Marty McFadden, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

On Wed, Mar 20, 2019 at 10:06:21AM +0800, Peter Xu wrote:
> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> There are several cases write protection fault happens. It could be a
> write to zero page, swaped page or userfault write protected
> page. When the fault happens, there is no way to know if userfault
> write protect the page before. Here we just blindly issue a userfault
> notification for vma with VM_UFFD_WP regardless if app write protects
> it yet. Application should be ready to handle such wp fault.
> 
> v1: From: Shaohua Li <shli@fb.com>
> 
> v2: Handle the userfault in the common do_wp_page. If we get there a
> pagetable is present and readonly so no need to do further processing
> until we solve the userfault.
> 
> In the swapin case, always swapin as readonly. This will cause false
> positive userfaults. We need to decide later if to eliminate them with
> a flag like soft-dirty in the swap entry (see _PAGE_SWP_SOFT_DIRTY).
> 
> hugetlbfs wouldn't need to worry about swapouts but and tmpfs would
> be handled by a swap entry bit like anonymous memory.
> 
> The main problem with no easy solution to eliminate the false
> positives, will be if/when userfaultfd is extended to real filesystem
> pagecache. When the pagecache is freed by reclaim we can't leave the
> radix tree pinned if the inode and in turn the radix tree is reclaimed
> as well.
> 
> The estimation is that full accuracy and lack of false positives could
> be easily provided only to anonymous memory (as long as there's no
> fork or as long as MADV_DONTFORK is used on the userfaultfd anonymous
> range) tmpfs and hugetlbfs, it's most certainly worth to achieve it
> but in a later incremental patch.
> 
> v3: Add hooking point for THP wrprotect faults.
> 
> CC: Shaohua Li <shli@fb.com>
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> [peterx: don't conditionally drop FAULT_FLAG_WRITE in do_swap_page]
> Reviewed-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>


Reviewed-by: Jérôme Glisse <jglisse@redhat.com>

> ---
>  mm/memory.c | 10 +++++++++-
>  1 file changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index e11ca9dd823f..567686ec086d 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2483,6 +2483,11 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
>  {
>  	struct vm_area_struct *vma = vmf->vma;
>  
> +	if (userfaultfd_wp(vma)) {
> +		pte_unmap_unlock(vmf->pte, vmf->ptl);
> +		return handle_userfault(vmf, VM_UFFD_WP);
> +	}
> +
>  	vmf->page = vm_normal_page(vma, vmf->address, vmf->orig_pte);
>  	if (!vmf->page) {
>  		/*
> @@ -3684,8 +3689,11 @@ static inline vm_fault_t create_huge_pmd(struct vm_fault *vmf)
>  /* `inline' is required to avoid gcc 4.1.2 build error */
>  static inline vm_fault_t wp_huge_pmd(struct vm_fault *vmf, pmd_t orig_pmd)
>  {
> -	if (vma_is_anonymous(vmf->vma))
> +	if (vma_is_anonymous(vmf->vma)) {
> +		if (userfaultfd_wp(vmf->vma))
> +			return handle_userfault(vmf, VM_UFFD_WP);
>  		return do_huge_pmd_wp_page(vmf, orig_pmd);
> +	}
>  	if (vmf->vma->vm_ops->huge_fault)
>  		return vmf->vma->vm_ops->huge_fault(vmf, PE_SIZE_PMD);
>  
> -- 
> 2.17.1
> 

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 04/28] mm: allow VM_FAULT_RETRY for multiple times
  2019-03-20  2:06 ` [PATCH v3 04/28] mm: allow VM_FAULT_RETRY for multiple times Peter Xu
@ 2019-04-18 20:11   ` Jerome Glisse
  2019-04-19  6:00     ` Peter Xu
  0 siblings, 1 reply; 51+ messages in thread
From: Jerome Glisse @ 2019-04-18 20:11 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, David Hildenbrand, Hugh Dickins,
	Maya Gokhale, Pavel Emelyanov, Johannes Weiner, Martin Cracauer,
	Shaohua Li, Andrea Arcangeli, Mike Kravetz, Denis Plotnikov,
	Mike Rapoport, Marty McFadden, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

On Wed, Mar 20, 2019 at 10:06:18AM +0800, Peter Xu wrote:
> The idea comes from a discussion between Linus and Andrea [1].
> 
> Before this patch we only allow a page fault to retry once.  We
> achieved this by clearing the FAULT_FLAG_ALLOW_RETRY flag when doing
> handle_mm_fault() the second time.  This was majorly used to avoid
> unexpected starvation of the system by looping over forever to handle
> the page fault on a single page.  However that should hardly happen,
> and after all for each code path to return a VM_FAULT_RETRY we'll
> first wait for a condition (during which time we should possibly yield
> the cpu) to happen before VM_FAULT_RETRY is really returned.
> 
> This patch removes the restriction by keeping the
> FAULT_FLAG_ALLOW_RETRY flag when we receive VM_FAULT_RETRY.  It means
> that the page fault handler now can retry the page fault for multiple
> times if necessary without the need to generate another page fault
> event.  Meanwhile we still keep the FAULT_FLAG_TRIED flag so page
> fault handler can still identify whether a page fault is the first
> attempt or not.
> 
> Then we'll have these combinations of fault flags (only considering
> ALLOW_RETRY flag and TRIED flag):
> 
>   - ALLOW_RETRY and !TRIED:  this means the page fault allows to
>                              retry, and this is the first try
> 
>   - ALLOW_RETRY and TRIED:   this means the page fault allows to
>                              retry, and this is not the first try
> 
>   - !ALLOW_RETRY and !TRIED: this means the page fault does not allow
>                              to retry at all
> 
>   - !ALLOW_RETRY and TRIED:  this is forbidden and should never be used
> 
> In existing code we have multiple places that has taken special care
> of the first condition above by checking against (fault_flags &
> FAULT_FLAG_ALLOW_RETRY).  This patch introduces a simple helper to
> detect the first retry of a page fault by checking against
> both (fault_flags & FAULT_FLAG_ALLOW_RETRY) and !(fault_flag &
> FAULT_FLAG_TRIED) because now even the 2nd try will have the
> ALLOW_RETRY set, then use that helper in all existing special paths.
> One example is in __lock_page_or_retry(), now we'll drop the mmap_sem
> only in the first attempt of page fault and we'll keep it in follow up
> retries, so old locking behavior will be retained.
> 
> This will be a nice enhancement for current code [2] at the same time
> a supporting material for the future userfaultfd-writeprotect work,
> since in that work there will always be an explicit userfault
> writeprotect retry for protected pages, and if that cannot resolve the
> page fault (e.g., when userfaultfd-writeprotect is used in conjunction
> with swapped pages) then we'll possibly need a 3rd retry of the page
> fault.  It might also benefit other potential users who will have
> similar requirement like userfault write-protection.
> 
> GUP code is not touched yet and will be covered in follow up patch.
> 
> Please read the thread below for more information.
> 
> [1] https://lkml.org/lkml/2017/11/2/833
> [2] https://lkml.org/lkml/2018/12/30/64
> 
> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
> Suggested-by: Andrea Arcangeli <aarcange@redhat.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: Jérôme Glisse <jglisse@redhat.com>

A minor comment suggestion below but it can be fix in a followup patch.

[...]

> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 80bb6408fe73..f73dbc4a1957 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -336,16 +336,52 @@ extern unsigned int kobjsize(const void *objp);
>   */
>  extern pgprot_t protection_map[16];
>  
> +/*
> + * About FAULT_FLAG_ALLOW_RETRY and FAULT_FLAG_TRIED: we can specify whether we
> + * would allow page faults to retry by specifying these two fault flags
> + * correctly.  Currently there can be three legal combinations:
> + *
> + * (a) ALLOW_RETRY and !TRIED:  this means the page fault allows retry, and
> + *                              this is the first try
> + *
> + * (b) ALLOW_RETRY and TRIED:   this means the page fault allows retry, and
> + *                              we've already tried at least once
> + *
> + * (c) !ALLOW_RETRY and !TRIED: this means the page fault does not allow retry
> + *
> + * The unlisted combination (!ALLOW_RETRY && TRIED) is illegal and should never
> + * be used.  Note that page faults can be allowed to retry for multiple times,
> + * in which case we'll have an initial fault with flags (a) then later on
> + * continuous faults with flags (b).  We should always try to detect pending
> + * signals before a retry to make sure the continuous page faults can still be
> + * interrupted if necessary.
> + */
> +
>  #define FAULT_FLAG_WRITE	0x01	/* Fault was a write access */
>  #define FAULT_FLAG_MKWRITE	0x02	/* Fault was mkwrite of existing pte */
>  #define FAULT_FLAG_ALLOW_RETRY	0x04	/* Retry fault if blocking */
>  #define FAULT_FLAG_RETRY_NOWAIT	0x08	/* Don't drop mmap_sem and wait when retrying */
>  #define FAULT_FLAG_KILLABLE	0x10	/* The fault task is in SIGKILL killable region */
> -#define FAULT_FLAG_TRIED	0x20	/* Second try */
> +#define FAULT_FLAG_TRIED	0x20	/* We've tried once */
>  #define FAULT_FLAG_USER		0x40	/* The fault originated in userspace */
>  #define FAULT_FLAG_REMOTE	0x80	/* faulting for non current tsk/mm */
>  #define FAULT_FLAG_INSTRUCTION  0x100	/* The fault was during an instruction fetch */
>  
> +/*
> + * Returns true if the page fault allows retry and this is the first
> + * attempt of the fault handling; false otherwise.  This is mostly
> + * used for places where we want to try to avoid taking the mmap_sem
> + * for too long a time when waiting for another condition to change,
> + * in which case we can try to be polite to release the mmap_sem in
> + * the first round to avoid potential starvation of other processes
> + * that would also want the mmap_sem.
> + */

You should be using kernel function documentation style above.

> +static inline bool fault_flag_allow_retry_first(unsigned int flags)
> +{
> +	return (flags & FAULT_FLAG_ALLOW_RETRY) &&
> +	    (!(flags & FAULT_FLAG_TRIED));
> +}
> +

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 14/28] userfaultfd: wp: handle COW properly for uffd-wp
  2019-03-20  2:06 ` [PATCH v3 14/28] userfaultfd: wp: handle COW properly for uffd-wp Peter Xu
@ 2019-04-18 20:51   ` Jerome Glisse
  2019-04-19  6:26     ` Peter Xu
  0 siblings, 1 reply; 51+ messages in thread
From: Jerome Glisse @ 2019-04-18 20:51 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, David Hildenbrand, Hugh Dickins,
	Maya Gokhale, Pavel Emelyanov, Johannes Weiner, Martin Cracauer,
	Shaohua Li, Andrea Arcangeli, Mike Kravetz, Denis Plotnikov,
	Mike Rapoport, Marty McFadden, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

On Wed, Mar 20, 2019 at 10:06:28AM +0800, Peter Xu wrote:
> This allows uffd-wp to support write-protected pages for COW.
> 
> For example, the uffd write-protected PTE could also be write-protected
> by other usages like COW or zero pages.  When that happens, we can't
> simply set the write bit in the PTE since otherwise it'll change the
> content of every single reference to the page.  Instead, we should do
> the COW first if necessary, then handle the uffd-wp fault.
> 
> To correctly copy the page, we'll also need to carry over the
> _PAGE_UFFD_WP bit if it was set in the original PTE.
> 
> For huge PMDs, we just simply split the huge PMDs where we want to
> resolve an uffd-wp page fault always.  That matches what we do with
> general huge PMD write protections.  In that way, we resolved the huge
> PMD copy-on-write issue into PTE copy-on-write.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>

This one has a bug see below.


> ---
>  mm/memory.c   |  5 +++-
>  mm/mprotect.c | 64 ++++++++++++++++++++++++++++++++++++++++++++++++---
>  2 files changed, 65 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index e7a4b9650225..b8a4c0bab461 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2291,7 +2291,10 @@ vm_fault_t wp_page_copy(struct vm_fault *vmf)
>  		}
>  		flush_cache_page(vma, vmf->address, pte_pfn(vmf->orig_pte));
>  		entry = mk_pte(new_page, vma->vm_page_prot);
> -		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
> +		if (pte_uffd_wp(vmf->orig_pte))
> +			entry = pte_mkuffd_wp(entry);
> +		else
> +			entry = maybe_mkwrite(pte_mkdirty(entry), vma);
>  		/*
>  		 * Clear the pte entry and flush it first, before updating the
>  		 * pte with the new entry. This will avoid a race condition
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 9d4433044c21..855dddb07ff2 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -73,18 +73,18 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
>  	flush_tlb_batched_pending(vma->vm_mm);
>  	arch_enter_lazy_mmu_mode();
>  	do {
> +retry_pte:
>  		oldpte = *pte;
>  		if (pte_present(oldpte)) {
>  			pte_t ptent;
>  			bool preserve_write = prot_numa && pte_write(oldpte);
> +			struct page *page;
>  
>  			/*
>  			 * Avoid trapping faults against the zero or KSM
>  			 * pages. See similar comment in change_huge_pmd.
>  			 */
>  			if (prot_numa) {
> -				struct page *page;
> -
>  				page = vm_normal_page(vma, addr, oldpte);
>  				if (!page || PageKsm(page))
>  					continue;
> @@ -114,6 +114,54 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
>  					continue;
>  			}
>  
> +			/*
> +			 * Detect whether we'll need to COW before
> +			 * resolving an uffd-wp fault.  Note that this
> +			 * includes detection of the zero page (where
> +			 * page==NULL)
> +			 */
> +			if (uffd_wp_resolve) {
> +				/* If the fault is resolved already, skip */
> +				if (!pte_uffd_wp(*pte))
> +					continue;
> +				page = vm_normal_page(vma, addr, oldpte);
> +				if (!page || page_mapcount(page) > 1) {
> +					struct vm_fault vmf = {
> +						.vma = vma,
> +						.address = addr & PAGE_MASK,
> +						.page = page,
> +						.orig_pte = oldpte,
> +						.pmd = pmd,
> +						/* pte and ptl not needed */
> +					};
> +					vm_fault_t ret;
> +
> +					if (page)
> +						get_page(page);
> +					arch_leave_lazy_mmu_mode();
> +					pte_unmap_unlock(pte, ptl);
> +					ret = wp_page_copy(&vmf);
> +					/* PTE is changed, or OOM */
> +					if (ret == 0)
> +						/* It's done by others */
> +						continue;

This is wrong if ret == 0 you still need to remap the pte before
continuing as otherwise you will go to next pte without the page
table lock for the directory. So 0 case must be handled after
arch_enter_lazy_mmu_mode() below.

Sorry i should have catch that in previous review.


> +					else if (WARN_ON(ret != VM_FAULT_WRITE))
> +						return pages;
> +					pte = pte_offset_map_lock(vma->vm_mm,
> +								  pmd, addr,
> +								  &ptl);
> +					arch_enter_lazy_mmu_mode();
> +					if (!pte_present(*pte))
> +						/*
> +						 * This PTE could have been
> +						 * modified after COW
> +						 * before we have taken the
> +						 * lock; retry this PTE
> +						 */
> +						goto retry_pte;
> +				}
> +			}
> +
>  			ptent = ptep_modify_prot_start(mm, addr, pte);
>  			ptent = pte_modify(ptent, newprot);
>  			if (preserve_write)

>  	unsigned long pages = 0;
>  	unsigned long nr_huge_updates = 0;
>  	struct mmu_notifier_range range;
> +	bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;
>  
>  	range.start = 0;
>  
> @@ -202,7 +251,16 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
>  		}
>  
>  		if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {
> -			if (next - addr != HPAGE_PMD_SIZE) {
> +			/*
> +			 * When resolving an userfaultfd write
> +			 * protection fault, it's not easy to identify
> +			 * whether a THP is shared with others and
> +			 * whether we'll need to do copy-on-write, so
> +			 * just split it always for now to simply the
> +			 * procedure.  And that's the policy too for
> +			 * general THP write-protect in af9e4d5f2de2.
> +			 */
> +			if (next - addr != HPAGE_PMD_SIZE || uffd_wp_resolve) {

Just a nit pick can you please add () to next - addr ie:
if ((next - addr) != HPAGE_PMD_SIZE || uffd_wp_resolve) {

I know it is not needed but each time i bump into this i
have to scratch my head for second to remember the operator
rules :)

>  				__split_huge_pmd(vma, pmd, addr, false, NULL);
>  			} else {
>  				int nr_ptes = change_huge_pmd(vma, pmd, addr,
> -- 
> 2.17.1
> 

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 17/28] userfaultfd: wp: support swap and page migration
  2019-03-20  2:06 ` [PATCH v3 17/28] userfaultfd: wp: support swap and page migration Peter Xu
@ 2019-04-18 20:59   ` Jerome Glisse
  2019-04-19  7:42     ` Peter Xu
  0 siblings, 1 reply; 51+ messages in thread
From: Jerome Glisse @ 2019-04-18 20:59 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, David Hildenbrand, Hugh Dickins,
	Maya Gokhale, Pavel Emelyanov, Johannes Weiner, Martin Cracauer,
	Shaohua Li, Andrea Arcangeli, Mike Kravetz, Denis Plotnikov,
	Mike Rapoport, Marty McFadden, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

On Wed, Mar 20, 2019 at 10:06:31AM +0800, Peter Xu wrote:
> For either swap and page migration, we all use the bit 2 of the entry to
> identify whether this entry is uffd write-protected.  It plays a similar
> role as the existing soft dirty bit in swap entries but only for keeping
> the uffd-wp tracking for a specific PTE/PMD.
> 
> Something special here is that when we want to recover the uffd-wp bit
> from a swap/migration entry to the PTE bit we'll also need to take care
> of the _PAGE_RW bit and make sure it's cleared, otherwise even with the
> _PAGE_UFFD_WP bit we can't trap it at all.
> 
> Note that this patch removed two lines from "userfaultfd: wp: hook
> userfault handler to write protection fault" where we try to remove the
> VM_FAULT_WRITE from vmf->flags when uffd-wp is set for the VMA.  This
> patch will still keep the write flag there.
> 
> Reviewed-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>

Some missing thing see below.

[...]

> diff --git a/mm/memory.c b/mm/memory.c
> index 6405d56debee..c3d57fa890f2 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -736,6 +736,8 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>  				pte = swp_entry_to_pte(entry);
>  				if (pte_swp_soft_dirty(*src_pte))
>  					pte = pte_swp_mksoft_dirty(pte);
> +				if (pte_swp_uffd_wp(*src_pte))
> +					pte = pte_swp_mkuffd_wp(pte);
>  				set_pte_at(src_mm, addr, src_pte, pte);
>  			}
>  		} else if (is_device_private_entry(entry)) {

You need to handle the is_device_private_entry() as the migration case
too.



> @@ -2825,6 +2827,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  	flush_icache_page(vma, page);
>  	if (pte_swp_soft_dirty(vmf->orig_pte))
>  		pte = pte_mksoft_dirty(pte);
> +	if (pte_swp_uffd_wp(vmf->orig_pte)) {
> +		pte = pte_mkuffd_wp(pte);
> +		pte = pte_wrprotect(pte);
> +	}
>  	set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
>  	arch_do_swap_page(vma->vm_mm, vma, vmf->address, pte, vmf->orig_pte);
>  	vmf->orig_pte = pte;
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 181f5d2718a9..72cde187d4a1 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -241,6 +241,8 @@ static bool remove_migration_pte(struct page *page, struct vm_area_struct *vma,
>  		entry = pte_to_swp_entry(*pvmw.pte);
>  		if (is_write_migration_entry(entry))
>  			pte = maybe_mkwrite(pte, vma);
> +		else if (pte_swp_uffd_wp(*pvmw.pte))
> +			pte = pte_mkuffd_wp(pte);
>  
>  		if (unlikely(is_zone_device_page(new))) {
>  			if (is_device_private_page(new)) {

You need to handle is_device_private_page() case ie mark its swap
as uffd_wp

> @@ -2301,6 +2303,8 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>  			swp_pte = swp_entry_to_pte(entry);
>  			if (pte_soft_dirty(pte))
>  				swp_pte = pte_swp_mksoft_dirty(swp_pte);
> +			if (pte_uffd_wp(pte))
> +				swp_pte = pte_swp_mkuffd_wp(swp_pte);
>  			set_pte_at(mm, addr, ptep, swp_pte);
>
>  			/*
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 855dddb07ff2..96c0f521099d 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -196,6 +196,8 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
>  				newpte = swp_entry_to_pte(entry);
>  				if (pte_swp_soft_dirty(oldpte))
>  					newpte = pte_swp_mksoft_dirty(newpte);
> +				if (pte_swp_uffd_wp(oldpte))
> +					newpte = pte_swp_mkuffd_wp(newpte);
>  				set_pte_at(mm, addr, pte, newpte);
>  
>  				pages++;

Need to handle is_write_device_private_entry() case just below
that chunk.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 25/28] userfaultfd: wp: fixup swap entries in change_pte_range
  2019-03-20  2:06 ` [PATCH v3 25/28] userfaultfd: wp: fixup swap entries in change_pte_range Peter Xu
@ 2019-04-18 21:01   ` Jerome Glisse
  0 siblings, 0 replies; 51+ messages in thread
From: Jerome Glisse @ 2019-04-18 21:01 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, David Hildenbrand, Hugh Dickins,
	Maya Gokhale, Pavel Emelyanov, Johannes Weiner, Martin Cracauer,
	Shaohua Li, Andrea Arcangeli, Mike Kravetz, Denis Plotnikov,
	Mike Rapoport, Marty McFadden, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

On Wed, Mar 20, 2019 at 10:06:39AM +0800, Peter Xu wrote:
> In change_pte_range() we do nothing for uffd if the PTE is a swap
> entry.  That can lead to data mismatch if the page that we are going
> to write protect is swapped out when sending the UFFDIO_WRITEPROTECT.
> This patch applies/removes the uffd-wp bit even for the swap entries.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>

This one seems to address some of the comments i made on patch 17
not all thought. Maybe squash them together ?

> ---
> 
> I kept this patch a standalone one majorly to make review easier.  The
> patch can be considered as standalone or to squash into the patch
> "userfaultfd: wp: support swap and page migration".
> ---
>  mm/mprotect.c | 24 +++++++++++++-----------
>  1 file changed, 13 insertions(+), 11 deletions(-)
> 
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 96c0f521099d..a23e03053787 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -183,11 +183,11 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
>  			}
>  			ptep_modify_prot_commit(mm, addr, pte, ptent);
>  			pages++;
> -		} else if (IS_ENABLED(CONFIG_MIGRATION)) {
> +		} else if (is_swap_pte(oldpte)) {
>  			swp_entry_t entry = pte_to_swp_entry(oldpte);
> +			pte_t newpte;
>  
>  			if (is_write_migration_entry(entry)) {
> -				pte_t newpte;
>  				/*
>  				 * A protection check is difficult so
>  				 * just be safe and disable write
> @@ -198,22 +198,24 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
>  					newpte = pte_swp_mksoft_dirty(newpte);
>  				if (pte_swp_uffd_wp(oldpte))
>  					newpte = pte_swp_mkuffd_wp(newpte);
> -				set_pte_at(mm, addr, pte, newpte);
> -
> -				pages++;
> -			}
> -
> -			if (is_write_device_private_entry(entry)) {
> -				pte_t newpte;
> -
> +			} else if (is_write_device_private_entry(entry)) {
>  				/*
>  				 * We do not preserve soft-dirtiness. See
>  				 * copy_one_pte() for explanation.
>  				 */
>  				make_device_private_entry_read(&entry);
>  				newpte = swp_entry_to_pte(entry);
> -				set_pte_at(mm, addr, pte, newpte);
> +			} else {
> +				newpte = oldpte;
> +			}
>  
> +			if (uffd_wp)
> +				newpte = pte_swp_mkuffd_wp(newpte);
> +			else if (uffd_wp_resolve)
> +				newpte = pte_swp_clear_uffd_wp(newpte);
> +
> +			if (!pte_same(oldpte, newpte)) {
> +				set_pte_at(mm, addr, pte, newpte);
>  				pages++;
>  			}
>  		}
> -- 
> 2.17.1
> 

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 00/28] userfaultfd: write protection support
  2019-04-09  6:08 ` [PATCH v3 00/28] userfaultfd: write protection support Peter Xu
@ 2019-04-18 21:07   ` Jerome Glisse
  2019-04-19  7:53     ` Peter Xu
  0 siblings, 1 reply; 51+ messages in thread
From: Jerome Glisse @ 2019-04-18 21:07 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, David Hildenbrand, Hugh Dickins,
	Maya Gokhale, Pavel Emelyanov, Johannes Weiner, Martin Cracauer,
	Shaohua Li, Andrea Arcangeli, Mike Kravetz, Denis Plotnikov,
	Mike Rapoport, Marty McFadden, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

On Tue, Apr 09, 2019 at 02:08:39PM +0800, Peter Xu wrote:
> On Wed, Mar 20, 2019 at 10:06:14AM +0800, Peter Xu wrote:
> > This series implements initial write protection support for
> > userfaultfd.  Currently both shmem and hugetlbfs are not supported
> > yet, but only anonymous memory.  This is the 3nd version of it.
> > 
> > The latest code can also be found at:
> > 
> >   https://github.com/xzpeter/linux/tree/uffd-wp-merged
> > 
> > Note again that the first 5 patches in the series can be seen as
> > isolated work on page fault mechanism.  I would hope that they can be
> > considered to be reviewed/picked even earlier than the rest of the
> > series since it's even useful for existing userfaultfd MISSING case
> > [8].
> 
> Ping - any further comments for v3?  Is there any chance to have this
> series (or the first 5 patches) for 5.2?

Few issues left, sorry for taking so long to get to review, sometimes
it goes to the bottom of my stack.

I am guessing this should be merge through Andrew ? Unless Andrea have
a tree for userfaultfd (i am not following all that closely).

From my point of view it almost all look good. I sent review before
this email. Maybe we need some review from x86 folks on the x86 arch
changes for the feature ?

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 04/28] mm: allow VM_FAULT_RETRY for multiple times
  2019-04-18 20:11   ` Jerome Glisse
@ 2019-04-19  6:00     ` Peter Xu
  0 siblings, 0 replies; 51+ messages in thread
From: Peter Xu @ 2019-04-19  6:00 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: linux-mm, linux-kernel, David Hildenbrand, Hugh Dickins,
	Maya Gokhale, Pavel Emelyanov, Johannes Weiner, Martin Cracauer,
	Shaohua Li, Andrea Arcangeli, Mike Kravetz, Denis Plotnikov,
	Mike Rapoport, Marty McFadden, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

On Thu, Apr 18, 2019 at 04:11:08PM -0400, Jerome Glisse wrote:

[...]

> Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
> 
> A minor comment suggestion below but it can be fix in a followup patch.

[...]

> > +/*
> > + * Returns true if the page fault allows retry and this is the first
> > + * attempt of the fault handling; false otherwise.  This is mostly
> > + * used for places where we want to try to avoid taking the mmap_sem
> > + * for too long a time when waiting for another condition to change,
> > + * in which case we can try to be polite to release the mmap_sem in
> > + * the first round to avoid potential starvation of other processes
> > + * that would also want the mmap_sem.
> > + */
> 
> You should be using kernel function documentation style above.

I'm switching to this:

/**
 * fault_flag_allow_retry_first - check ALLOW_RETRY the first time
 *
 * This is mostly used for places where we want to try to avoid taking
 * the mmap_sem for too long a time when waiting for another condition
 * to change, in which case we can try to be polite to release the
 * mmap_sem in the first round to avoid potential starvation of other
 * processes that would also want the mmap_sem.
 *
 * Return: true if the page fault allows retry and this is the first
 * attempt of the fault handling; false otherwise.
 */

I'm still keeping the r-b, assuming that's ok.

Thanks!

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 14/28] userfaultfd: wp: handle COW properly for uffd-wp
  2019-04-18 20:51   ` Jerome Glisse
@ 2019-04-19  6:26     ` Peter Xu
  2019-04-19 15:02       ` Jerome Glisse
  0 siblings, 1 reply; 51+ messages in thread
From: Peter Xu @ 2019-04-19  6:26 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: linux-mm, linux-kernel, David Hildenbrand, Hugh Dickins,
	Maya Gokhale, Pavel Emelyanov, Johannes Weiner, Martin Cracauer,
	Shaohua Li, Andrea Arcangeli, Mike Kravetz, Denis Plotnikov,
	Mike Rapoport, Marty McFadden, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

On Thu, Apr 18, 2019 at 04:51:15PM -0400, Jerome Glisse wrote:
> On Wed, Mar 20, 2019 at 10:06:28AM +0800, Peter Xu wrote:
> > This allows uffd-wp to support write-protected pages for COW.
> > 
> > For example, the uffd write-protected PTE could also be write-protected
> > by other usages like COW or zero pages.  When that happens, we can't
> > simply set the write bit in the PTE since otherwise it'll change the
> > content of every single reference to the page.  Instead, we should do
> > the COW first if necessary, then handle the uffd-wp fault.
> > 
> > To correctly copy the page, we'll also need to carry over the
> > _PAGE_UFFD_WP bit if it was set in the original PTE.
> > 
> > For huge PMDs, we just simply split the huge PMDs where we want to
> > resolve an uffd-wp page fault always.  That matches what we do with
> > general huge PMD write protections.  In that way, we resolved the huge
> > PMD copy-on-write issue into PTE copy-on-write.
> > 
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> 
> This one has a bug see below.
> 
> 
> > ---
> >  mm/memory.c   |  5 +++-
> >  mm/mprotect.c | 64 ++++++++++++++++++++++++++++++++++++++++++++++++---
> >  2 files changed, 65 insertions(+), 4 deletions(-)
> > 
> > diff --git a/mm/memory.c b/mm/memory.c
> > index e7a4b9650225..b8a4c0bab461 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -2291,7 +2291,10 @@ vm_fault_t wp_page_copy(struct vm_fault *vmf)
> >  		}
> >  		flush_cache_page(vma, vmf->address, pte_pfn(vmf->orig_pte));
> >  		entry = mk_pte(new_page, vma->vm_page_prot);
> > -		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
> > +		if (pte_uffd_wp(vmf->orig_pte))
> > +			entry = pte_mkuffd_wp(entry);
> > +		else
> > +			entry = maybe_mkwrite(pte_mkdirty(entry), vma);
> >  		/*
> >  		 * Clear the pte entry and flush it first, before updating the
> >  		 * pte with the new entry. This will avoid a race condition
> > diff --git a/mm/mprotect.c b/mm/mprotect.c
> > index 9d4433044c21..855dddb07ff2 100644
> > --- a/mm/mprotect.c
> > +++ b/mm/mprotect.c
> > @@ -73,18 +73,18 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
> >  	flush_tlb_batched_pending(vma->vm_mm);
> >  	arch_enter_lazy_mmu_mode();
> >  	do {
> > +retry_pte:
> >  		oldpte = *pte;
> >  		if (pte_present(oldpte)) {
> >  			pte_t ptent;
> >  			bool preserve_write = prot_numa && pte_write(oldpte);
> > +			struct page *page;
> >  
> >  			/*
> >  			 * Avoid trapping faults against the zero or KSM
> >  			 * pages. See similar comment in change_huge_pmd.
> >  			 */
> >  			if (prot_numa) {
> > -				struct page *page;
> > -
> >  				page = vm_normal_page(vma, addr, oldpte);
> >  				if (!page || PageKsm(page))
> >  					continue;
> > @@ -114,6 +114,54 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
> >  					continue;
> >  			}
> >  
> > +			/*
> > +			 * Detect whether we'll need to COW before
> > +			 * resolving an uffd-wp fault.  Note that this
> > +			 * includes detection of the zero page (where
> > +			 * page==NULL)
> > +			 */
> > +			if (uffd_wp_resolve) {
> > +				/* If the fault is resolved already, skip */
> > +				if (!pte_uffd_wp(*pte))
> > +					continue;
> > +				page = vm_normal_page(vma, addr, oldpte);
> > +				if (!page || page_mapcount(page) > 1) {
> > +					struct vm_fault vmf = {
> > +						.vma = vma,
> > +						.address = addr & PAGE_MASK,
> > +						.page = page,
> > +						.orig_pte = oldpte,
> > +						.pmd = pmd,
> > +						/* pte and ptl not needed */
> > +					};
> > +					vm_fault_t ret;
> > +
> > +					if (page)
> > +						get_page(page);
> > +					arch_leave_lazy_mmu_mode();
> > +					pte_unmap_unlock(pte, ptl);
> > +					ret = wp_page_copy(&vmf);
> > +					/* PTE is changed, or OOM */
> > +					if (ret == 0)
> > +						/* It's done by others */
> > +						continue;
> 
> This is wrong if ret == 0 you still need to remap the pte before
> continuing as otherwise you will go to next pte without the page
> table lock for the directory. So 0 case must be handled after
> arch_enter_lazy_mmu_mode() below.
> 
> Sorry i should have catch that in previous review.

My fault to not have noticed it since the very beginning... thanks for
spotting that.

I'm squashing below changes into the patch:

diff --git a/mm/mprotect.c b/mm/mprotect.c
index 3cddfd6627b8..13d493b836bb 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -141,22 +141,19 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
                                        arch_leave_lazy_mmu_mode();
                                        pte_unmap_unlock(pte, ptl);
                                        ret = wp_page_copy(&vmf);
-                                       /* PTE is changed, or OOM */
-                                       if (ret == 0)
-                                               /* It's done by others */
-                                               continue;
-                                       else if (WARN_ON(ret != VM_FAULT_WRITE))
+                                       if (ret != VM_FAULT_WRITE && ret != 0)
+                                               /* Probably OOM */
                                                return pages;
                                        pte = pte_offset_map_lock(vma->vm_mm,
                                                                  pmd, addr,
                                                                  &ptl);
                                        arch_enter_lazy_mmu_mode();
-                                       if (!pte_present(*pte))
+                                       if (ret == 0 || !pte_present(*pte))
                                                /*
                                                 * This PTE could have been
-                                                * modified after COW
-                                                * before we have taken the
-                                                * lock; retry this PTE
+                                                * modified during or after
+                                                * COW before take the lock;
+                                                * retry.
                                                 */
                                                goto retry_pte;
                                }

[...]

> >  		if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {
> > -			if (next - addr != HPAGE_PMD_SIZE) {
> > +			/*
> > +			 * When resolving an userfaultfd write
> > +			 * protection fault, it's not easy to identify
> > +			 * whether a THP is shared with others and
> > +			 * whether we'll need to do copy-on-write, so
> > +			 * just split it always for now to simply the
> > +			 * procedure.  And that's the policy too for
> > +			 * general THP write-protect in af9e4d5f2de2.
> > +			 */
> > +			if (next - addr != HPAGE_PMD_SIZE || uffd_wp_resolve) {
> 
> Just a nit pick can you please add () to next - addr ie:
> if ((next - addr) != HPAGE_PMD_SIZE || uffd_wp_resolve) {
> 
> I know it is not needed but each time i bump into this i
> have to scratch my head for second to remember the operator
> rules :)

Sure, as usual. :) And I tend to agree it's a good habit.  It's just
me that always forgot about it.

Thanks,

-- 
Peter Xu

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 17/28] userfaultfd: wp: support swap and page migration
  2019-04-18 20:59   ` Jerome Glisse
@ 2019-04-19  7:42     ` Peter Xu
  2019-04-19 15:08       ` Jerome Glisse
  0 siblings, 1 reply; 51+ messages in thread
From: Peter Xu @ 2019-04-19  7:42 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: linux-mm, linux-kernel, David Hildenbrand, Hugh Dickins,
	Maya Gokhale, Pavel Emelyanov, Johannes Weiner, Martin Cracauer,
	Shaohua Li, Andrea Arcangeli, Mike Kravetz, Denis Plotnikov,
	Mike Rapoport, Marty McFadden, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

On Thu, Apr 18, 2019 at 04:59:07PM -0400, Jerome Glisse wrote:
> On Wed, Mar 20, 2019 at 10:06:31AM +0800, Peter Xu wrote:
> > For either swap and page migration, we all use the bit 2 of the entry to
> > identify whether this entry is uffd write-protected.  It plays a similar
> > role as the existing soft dirty bit in swap entries but only for keeping
> > the uffd-wp tracking for a specific PTE/PMD.
> > 
> > Something special here is that when we want to recover the uffd-wp bit
> > from a swap/migration entry to the PTE bit we'll also need to take care
> > of the _PAGE_RW bit and make sure it's cleared, otherwise even with the
> > _PAGE_UFFD_WP bit we can't trap it at all.
> > 
> > Note that this patch removed two lines from "userfaultfd: wp: hook
> > userfault handler to write protection fault" where we try to remove the
> > VM_FAULT_WRITE from vmf->flags when uffd-wp is set for the VMA.  This
> > patch will still keep the write flag there.
> > 
> > Reviewed-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> 
> Some missing thing see below.
> 
> [...]
> 
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 6405d56debee..c3d57fa890f2 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -736,6 +736,8 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> >  				pte = swp_entry_to_pte(entry);
> >  				if (pte_swp_soft_dirty(*src_pte))
> >  					pte = pte_swp_mksoft_dirty(pte);
> > +				if (pte_swp_uffd_wp(*src_pte))
> > +					pte = pte_swp_mkuffd_wp(pte);
> >  				set_pte_at(src_mm, addr, src_pte, pte);
> >  			}
> >  		} else if (is_device_private_entry(entry)) {
> 
> You need to handle the is_device_private_entry() as the migration case
> too.

Hi, Jerome,

Yes I can simply add the handling, but I'd confess I haven't thought
clearly yet on how userfault-wp will be used with HMM (and that's
mostly because my unfamiliarity so far with HMM).  Could you give me
some hint on a most general and possible scenario?

> 
> 
> 
> > @@ -2825,6 +2827,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >  	flush_icache_page(vma, page);
> >  	if (pte_swp_soft_dirty(vmf->orig_pte))
> >  		pte = pte_mksoft_dirty(pte);
> > +	if (pte_swp_uffd_wp(vmf->orig_pte)) {
> > +		pte = pte_mkuffd_wp(pte);
> > +		pte = pte_wrprotect(pte);
> > +	}
> >  	set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
> >  	arch_do_swap_page(vma->vm_mm, vma, vmf->address, pte, vmf->orig_pte);
> >  	vmf->orig_pte = pte;
> > diff --git a/mm/migrate.c b/mm/migrate.c
> > index 181f5d2718a9..72cde187d4a1 100644
> > --- a/mm/migrate.c
> > +++ b/mm/migrate.c
> > @@ -241,6 +241,8 @@ static bool remove_migration_pte(struct page *page, struct vm_area_struct *vma,
> >  		entry = pte_to_swp_entry(*pvmw.pte);
> >  		if (is_write_migration_entry(entry))
> >  			pte = maybe_mkwrite(pte, vma);
> > +		else if (pte_swp_uffd_wp(*pvmw.pte))
> > +			pte = pte_mkuffd_wp(pte);
> >  
> >  		if (unlikely(is_zone_device_page(new))) {
> >  			if (is_device_private_page(new)) {
> 
> You need to handle is_device_private_page() case ie mark its swap
> as uffd_wp

Yes I can do this too.

> 
> > @@ -2301,6 +2303,8 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
> >  			swp_pte = swp_entry_to_pte(entry);
> >  			if (pte_soft_dirty(pte))
> >  				swp_pte = pte_swp_mksoft_dirty(swp_pte);
> > +			if (pte_uffd_wp(pte))
> > +				swp_pte = pte_swp_mkuffd_wp(swp_pte);
> >  			set_pte_at(mm, addr, ptep, swp_pte);
> >
> >  			/*
> > diff --git a/mm/mprotect.c b/mm/mprotect.c
> > index 855dddb07ff2..96c0f521099d 100644
> > --- a/mm/mprotect.c
> > +++ b/mm/mprotect.c
> > @@ -196,6 +196,8 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
> >  				newpte = swp_entry_to_pte(entry);
> >  				if (pte_swp_soft_dirty(oldpte))
> >  					newpte = pte_swp_mksoft_dirty(newpte);
> > +				if (pte_swp_uffd_wp(oldpte))
> > +					newpte = pte_swp_mkuffd_wp(newpte);
> >  				set_pte_at(mm, addr, pte, newpte);
> >  
> >  				pages++;
> 
> Need to handle is_write_device_private_entry() case just below
> that chunk.

This one is a bit special - because it's not only the private entries
that are missing but also all swap/migration entries, which is
explicitly handled by patch 25.  But I think I can just squash it into
this patch as you suggested.

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 00/28] userfaultfd: write protection support
  2019-04-18 21:07   ` Jerome Glisse
@ 2019-04-19  7:53     ` Peter Xu
  0 siblings, 0 replies; 51+ messages in thread
From: Peter Xu @ 2019-04-19  7:53 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: linux-mm, linux-kernel, David Hildenbrand, Hugh Dickins,
	Maya Gokhale, Pavel Emelyanov, Johannes Weiner, Martin Cracauer,
	Shaohua Li, Andrea Arcangeli, Mike Kravetz, Denis Plotnikov,
	Mike Rapoport, Marty McFadden, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

On Thu, Apr 18, 2019 at 05:07:02PM -0400, Jerome Glisse wrote:
> On Tue, Apr 09, 2019 at 02:08:39PM +0800, Peter Xu wrote:
> > On Wed, Mar 20, 2019 at 10:06:14AM +0800, Peter Xu wrote:
> > > This series implements initial write protection support for
> > > userfaultfd.  Currently both shmem and hugetlbfs are not supported
> > > yet, but only anonymous memory.  This is the 3nd version of it.
> > > 
> > > The latest code can also be found at:
> > > 
> > >   https://github.com/xzpeter/linux/tree/uffd-wp-merged
> > > 
> > > Note again that the first 5 patches in the series can be seen as
> > > isolated work on page fault mechanism.  I would hope that they can be
> > > considered to be reviewed/picked even earlier than the rest of the
> > > series since it's even useful for existing userfaultfd MISSING case
> > > [8].
> > 
> > Ping - any further comments for v3?  Is there any chance to have this
> > series (or the first 5 patches) for 5.2?
> 
> Few issues left, sorry for taking so long to get to review, sometimes
> it goes to the bottom of my stack.
> 
> I am guessing this should be merge through Andrew ? Unless Andrea have
> a tree for userfaultfd (i am not following all that closely).
> 
> From my point of view it almost all look good. I sent review before
> this email. Maybe we need some review from x86 folks on the x86 arch
> changes for the feature ?

Thank you for your time on reviewing the series (my thanks to Mike
too!).  I have no idea on anyone else I should ask for help for
further review comments, but anyway I'd be more than glad to discuss
with any further concerns or do anything to move this series forward.
Because AFAIK mutliple userspace projects are waiting for this series
to settle.

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 14/28] userfaultfd: wp: handle COW properly for uffd-wp
  2019-04-19  6:26     ` Peter Xu
@ 2019-04-19 15:02       ` Jerome Glisse
  2019-04-22 12:20         ` Peter Xu
  0 siblings, 1 reply; 51+ messages in thread
From: Jerome Glisse @ 2019-04-19 15:02 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, David Hildenbrand, Hugh Dickins,
	Maya Gokhale, Pavel Emelyanov, Johannes Weiner, Martin Cracauer,
	Shaohua Li, Andrea Arcangeli, Mike Kravetz, Denis Plotnikov,
	Mike Rapoport, Marty McFadden, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

On Fri, Apr 19, 2019 at 02:26:50PM +0800, Peter Xu wrote:
> On Thu, Apr 18, 2019 at 04:51:15PM -0400, Jerome Glisse wrote:
> > On Wed, Mar 20, 2019 at 10:06:28AM +0800, Peter Xu wrote:
> > > This allows uffd-wp to support write-protected pages for COW.
> > > 
> > > For example, the uffd write-protected PTE could also be write-protected
> > > by other usages like COW or zero pages.  When that happens, we can't
> > > simply set the write bit in the PTE since otherwise it'll change the
> > > content of every single reference to the page.  Instead, we should do
> > > the COW first if necessary, then handle the uffd-wp fault.
> > > 
> > > To correctly copy the page, we'll also need to carry over the
> > > _PAGE_UFFD_WP bit if it was set in the original PTE.
> > > 
> > > For huge PMDs, we just simply split the huge PMDs where we want to
> > > resolve an uffd-wp page fault always.  That matches what we do with
> > > general huge PMD write protections.  In that way, we resolved the huge
> > > PMD copy-on-write issue into PTE copy-on-write.
> > > 
> > > Signed-off-by: Peter Xu <peterx@redhat.com>
> > 
> > This one has a bug see below.
> > 
> > 
> > > ---
> > >  mm/memory.c   |  5 +++-
> > >  mm/mprotect.c | 64 ++++++++++++++++++++++++++++++++++++++++++++++++---
> > >  2 files changed, 65 insertions(+), 4 deletions(-)
> > > 
> > > diff --git a/mm/memory.c b/mm/memory.c
> > > index e7a4b9650225..b8a4c0bab461 100644
> > > --- a/mm/memory.c
> > > +++ b/mm/memory.c
> > > @@ -2291,7 +2291,10 @@ vm_fault_t wp_page_copy(struct vm_fault *vmf)
> > >  		}
> > >  		flush_cache_page(vma, vmf->address, pte_pfn(vmf->orig_pte));
> > >  		entry = mk_pte(new_page, vma->vm_page_prot);
> > > -		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
> > > +		if (pte_uffd_wp(vmf->orig_pte))
> > > +			entry = pte_mkuffd_wp(entry);
> > > +		else
> > > +			entry = maybe_mkwrite(pte_mkdirty(entry), vma);
> > >  		/*
> > >  		 * Clear the pte entry and flush it first, before updating the
> > >  		 * pte with the new entry. This will avoid a race condition
> > > diff --git a/mm/mprotect.c b/mm/mprotect.c
> > > index 9d4433044c21..855dddb07ff2 100644
> > > --- a/mm/mprotect.c
> > > +++ b/mm/mprotect.c
> > > @@ -73,18 +73,18 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
> > >  	flush_tlb_batched_pending(vma->vm_mm);
> > >  	arch_enter_lazy_mmu_mode();
> > >  	do {
> > > +retry_pte:
> > >  		oldpte = *pte;
> > >  		if (pte_present(oldpte)) {
> > >  			pte_t ptent;
> > >  			bool preserve_write = prot_numa && pte_write(oldpte);
> > > +			struct page *page;
> > >  
> > >  			/*
> > >  			 * Avoid trapping faults against the zero or KSM
> > >  			 * pages. See similar comment in change_huge_pmd.
> > >  			 */
> > >  			if (prot_numa) {
> > > -				struct page *page;
> > > -
> > >  				page = vm_normal_page(vma, addr, oldpte);
> > >  				if (!page || PageKsm(page))
> > >  					continue;
> > > @@ -114,6 +114,54 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
> > >  					continue;
> > >  			}
> > >  
> > > +			/*
> > > +			 * Detect whether we'll need to COW before
> > > +			 * resolving an uffd-wp fault.  Note that this
> > > +			 * includes detection of the zero page (where
> > > +			 * page==NULL)
> > > +			 */
> > > +			if (uffd_wp_resolve) {
> > > +				/* If the fault is resolved already, skip */
> > > +				if (!pte_uffd_wp(*pte))
> > > +					continue;
> > > +				page = vm_normal_page(vma, addr, oldpte);
> > > +				if (!page || page_mapcount(page) > 1) {
> > > +					struct vm_fault vmf = {
> > > +						.vma = vma,
> > > +						.address = addr & PAGE_MASK,
> > > +						.page = page,
> > > +						.orig_pte = oldpte,
> > > +						.pmd = pmd,
> > > +						/* pte and ptl not needed */
> > > +					};
> > > +					vm_fault_t ret;
> > > +
> > > +					if (page)
> > > +						get_page(page);
> > > +					arch_leave_lazy_mmu_mode();
> > > +					pte_unmap_unlock(pte, ptl);
> > > +					ret = wp_page_copy(&vmf);
> > > +					/* PTE is changed, or OOM */
> > > +					if (ret == 0)
> > > +						/* It's done by others */
> > > +						continue;
> > 
> > This is wrong if ret == 0 you still need to remap the pte before
> > continuing as otherwise you will go to next pte without the page
> > table lock for the directory. So 0 case must be handled after
> > arch_enter_lazy_mmu_mode() below.
> > 
> > Sorry i should have catch that in previous review.
> 
> My fault to not have noticed it since the very beginning... thanks for
> spotting that.
> 
> I'm squashing below changes into the patch:


Well thinking of this some more i think you should use do_wp_page() and
not wp_page_copy() it would avoid bunch of code above and also you are
not properly handling KSM page or page in the swap cache. Instead of
duplicating same code that is in do_wp_page() it would be better to call
it here.


> 
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 3cddfd6627b8..13d493b836bb 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -141,22 +141,19 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
>                                         arch_leave_lazy_mmu_mode();
>                                         pte_unmap_unlock(pte, ptl);
>                                         ret = wp_page_copy(&vmf);
> -                                       /* PTE is changed, or OOM */
> -                                       if (ret == 0)
> -                                               /* It's done by others */
> -                                               continue;
> -                                       else if (WARN_ON(ret != VM_FAULT_WRITE))
> +                                       if (ret != VM_FAULT_WRITE && ret != 0)
> +                                               /* Probably OOM */
>                                                 return pages;
>                                         pte = pte_offset_map_lock(vma->vm_mm,
>                                                                   pmd, addr,
>                                                                   &ptl);
>                                         arch_enter_lazy_mmu_mode();
> -                                       if (!pte_present(*pte))
> +                                       if (ret == 0 || !pte_present(*pte))
>                                                 /*
>                                                  * This PTE could have been
> -                                                * modified after COW
> -                                                * before we have taken the
> -                                                * lock; retry this PTE
> +                                                * modified during or after
> +                                                * COW before take the lock;
> +                                                * retry.
>                                                  */
>                                                 goto retry_pte;
>                                 }
> 
> [...]
> 
> > >  		if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {
> > > -			if (next - addr != HPAGE_PMD_SIZE) {
> > > +			/*
> > > +			 * When resolving an userfaultfd write
> > > +			 * protection fault, it's not easy to identify
> > > +			 * whether a THP is shared with others and
> > > +			 * whether we'll need to do copy-on-write, so
> > > +			 * just split it always for now to simply the
> > > +			 * procedure.  And that's the policy too for
> > > +			 * general THP write-protect in af9e4d5f2de2.
> > > +			 */
> > > +			if (next - addr != HPAGE_PMD_SIZE || uffd_wp_resolve) {
> > 
> > Just a nit pick can you please add () to next - addr ie:
> > if ((next - addr) != HPAGE_PMD_SIZE || uffd_wp_resolve) {
> > 
> > I know it is not needed but each time i bump into this i
> > have to scratch my head for second to remember the operator
> > rules :)
> 
> Sure, as usual. :) And I tend to agree it's a good habit.  It's just
> me that always forgot about it.
> 
> Thanks,
> 
> -- 
> Peter Xu

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 17/28] userfaultfd: wp: support swap and page migration
  2019-04-19  7:42     ` Peter Xu
@ 2019-04-19 15:08       ` Jerome Glisse
  2019-04-22 12:23         ` Peter Xu
  0 siblings, 1 reply; 51+ messages in thread
From: Jerome Glisse @ 2019-04-19 15:08 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, David Hildenbrand, Hugh Dickins,
	Maya Gokhale, Pavel Emelyanov, Johannes Weiner, Martin Cracauer,
	Shaohua Li, Andrea Arcangeli, Mike Kravetz, Denis Plotnikov,
	Mike Rapoport, Marty McFadden, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

On Fri, Apr 19, 2019 at 03:42:20PM +0800, Peter Xu wrote:
> On Thu, Apr 18, 2019 at 04:59:07PM -0400, Jerome Glisse wrote:
> > On Wed, Mar 20, 2019 at 10:06:31AM +0800, Peter Xu wrote:
> > > For either swap and page migration, we all use the bit 2 of the entry to
> > > identify whether this entry is uffd write-protected.  It plays a similar
> > > role as the existing soft dirty bit in swap entries but only for keeping
> > > the uffd-wp tracking for a specific PTE/PMD.
> > > 
> > > Something special here is that when we want to recover the uffd-wp bit
> > > from a swap/migration entry to the PTE bit we'll also need to take care
> > > of the _PAGE_RW bit and make sure it's cleared, otherwise even with the
> > > _PAGE_UFFD_WP bit we can't trap it at all.
> > > 
> > > Note that this patch removed two lines from "userfaultfd: wp: hook
> > > userfault handler to write protection fault" where we try to remove the
> > > VM_FAULT_WRITE from vmf->flags when uffd-wp is set for the VMA.  This
> > > patch will still keep the write flag there.
> > > 
> > > Reviewed-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
> > > Signed-off-by: Peter Xu <peterx@redhat.com>
> > 
> > Some missing thing see below.
> > 
> > [...]
> > 
> > > diff --git a/mm/memory.c b/mm/memory.c
> > > index 6405d56debee..c3d57fa890f2 100644
> > > --- a/mm/memory.c
> > > +++ b/mm/memory.c
> > > @@ -736,6 +736,8 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> > >  				pte = swp_entry_to_pte(entry);
> > >  				if (pte_swp_soft_dirty(*src_pte))
> > >  					pte = pte_swp_mksoft_dirty(pte);
> > > +				if (pte_swp_uffd_wp(*src_pte))
> > > +					pte = pte_swp_mkuffd_wp(pte);
> > >  				set_pte_at(src_mm, addr, src_pte, pte);
> > >  			}
> > >  		} else if (is_device_private_entry(entry)) {
> > 
> > You need to handle the is_device_private_entry() as the migration case
> > too.
> 
> Hi, Jerome,
> 
> Yes I can simply add the handling, but I'd confess I haven't thought
> clearly yet on how userfault-wp will be used with HMM (and that's
> mostly because my unfamiliarity so far with HMM).  Could you give me
> some hint on a most general and possible scenario?

device private is just a temporary state with HMM you can have thing
like GPU or FPGA migrate some anonymous page to their local memory
because it is use by the GPU or the FPGA. The GPU or FPGA behave like
a CPU from mm POV so if it wants to write it will fault and go through
the regular CPU page fault.

That said it can still migrate a page that is UFD write protected just
because the device only care about reading. So if you have a UFD pte
to a regular page that get migrated to some device memory you want to
keep the UFD WP flags after the migration (in both direction when going
to device memory and from coming back from it).

As far as UFD is concern this is just another page, it just does not
have a valid pte entry because CPU can not access such memory. But from
mm point of view it just another page.

> 
> > 
> > 
> > 
> > > @@ -2825,6 +2827,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > >  	flush_icache_page(vma, page);
> > >  	if (pte_swp_soft_dirty(vmf->orig_pte))
> > >  		pte = pte_mksoft_dirty(pte);
> > > +	if (pte_swp_uffd_wp(vmf->orig_pte)) {
> > > +		pte = pte_mkuffd_wp(pte);
> > > +		pte = pte_wrprotect(pte);
> > > +	}
> > >  	set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
> > >  	arch_do_swap_page(vma->vm_mm, vma, vmf->address, pte, vmf->orig_pte);
> > >  	vmf->orig_pte = pte;
> > > diff --git a/mm/migrate.c b/mm/migrate.c
> > > index 181f5d2718a9..72cde187d4a1 100644
> > > --- a/mm/migrate.c
> > > +++ b/mm/migrate.c
> > > @@ -241,6 +241,8 @@ static bool remove_migration_pte(struct page *page, struct vm_area_struct *vma,
> > >  		entry = pte_to_swp_entry(*pvmw.pte);
> > >  		if (is_write_migration_entry(entry))
> > >  			pte = maybe_mkwrite(pte, vma);
> > > +		else if (pte_swp_uffd_wp(*pvmw.pte))
> > > +			pte = pte_mkuffd_wp(pte);
> > >  
> > >  		if (unlikely(is_zone_device_page(new))) {
> > >  			if (is_device_private_page(new)) {
> > 
> > You need to handle is_device_private_page() case ie mark its swap
> > as uffd_wp
> 
> Yes I can do this too.
> 
> > 
> > > @@ -2301,6 +2303,8 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
> > >  			swp_pte = swp_entry_to_pte(entry);
> > >  			if (pte_soft_dirty(pte))
> > >  				swp_pte = pte_swp_mksoft_dirty(swp_pte);
> > > +			if (pte_uffd_wp(pte))
> > > +				swp_pte = pte_swp_mkuffd_wp(swp_pte);
> > >  			set_pte_at(mm, addr, ptep, swp_pte);
> > >
> > >  			/*
> > > diff --git a/mm/mprotect.c b/mm/mprotect.c
> > > index 855dddb07ff2..96c0f521099d 100644
> > > --- a/mm/mprotect.c
> > > +++ b/mm/mprotect.c
> > > @@ -196,6 +196,8 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
> > >  				newpte = swp_entry_to_pte(entry);
> > >  				if (pte_swp_soft_dirty(oldpte))
> > >  					newpte = pte_swp_mksoft_dirty(newpte);
> > > +				if (pte_swp_uffd_wp(oldpte))
> > > +					newpte = pte_swp_mkuffd_wp(newpte);
> > >  				set_pte_at(mm, addr, pte, newpte);
> > >  
> > >  				pages++;
> > 
> > Need to handle is_write_device_private_entry() case just below
> > that chunk.
> 
> This one is a bit special - because it's not only the private entries
> that are missing but also all swap/migration entries, which is
> explicitly handled by patch 25.  But I think I can just squash it into
> this patch as you suggested.

Yeah i was reading thing in order and you can do that in patch 25.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 14/28] userfaultfd: wp: handle COW properly for uffd-wp
  2019-04-19 15:02       ` Jerome Glisse
@ 2019-04-22 12:20         ` Peter Xu
  2019-04-22 14:54           ` Jerome Glisse
  0 siblings, 1 reply; 51+ messages in thread
From: Peter Xu @ 2019-04-22 12:20 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: linux-mm, linux-kernel, David Hildenbrand, Hugh Dickins,
	Maya Gokhale, Pavel Emelyanov, Johannes Weiner, Martin Cracauer,
	Shaohua Li, Andrea Arcangeli, Mike Kravetz, Denis Plotnikov,
	Mike Rapoport, Marty McFadden, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

On Fri, Apr 19, 2019 at 11:02:53AM -0400, Jerome Glisse wrote:

[...]

> > > > +			if (uffd_wp_resolve) {
> > > > +				/* If the fault is resolved already, skip */
> > > > +				if (!pte_uffd_wp(*pte))
> > > > +					continue;
> > > > +				page = vm_normal_page(vma, addr, oldpte);
> > > > +				if (!page || page_mapcount(page) > 1) {
> > > > +					struct vm_fault vmf = {
> > > > +						.vma = vma,
> > > > +						.address = addr & PAGE_MASK,
> > > > +						.page = page,
> > > > +						.orig_pte = oldpte,
> > > > +						.pmd = pmd,
> > > > +						/* pte and ptl not needed */
> > > > +					};
> > > > +					vm_fault_t ret;
> > > > +
> > > > +					if (page)
> > > > +						get_page(page);
> > > > +					arch_leave_lazy_mmu_mode();
> > > > +					pte_unmap_unlock(pte, ptl);
> > > > +					ret = wp_page_copy(&vmf);
> > > > +					/* PTE is changed, or OOM */
> > > > +					if (ret == 0)
> > > > +						/* It's done by others */
> > > > +						continue;
> > > 
> > > This is wrong if ret == 0 you still need to remap the pte before
> > > continuing as otherwise you will go to next pte without the page
> > > table lock for the directory. So 0 case must be handled after
> > > arch_enter_lazy_mmu_mode() below.
> > > 
> > > Sorry i should have catch that in previous review.
> > 
> > My fault to not have noticed it since the very beginning... thanks for
> > spotting that.
> > 
> > I'm squashing below changes into the patch:
> 
> 
> Well thinking of this some more i think you should use do_wp_page() and
> not wp_page_copy() it would avoid bunch of code above and also you are
> not properly handling KSM page or page in the swap cache. Instead of
> duplicating same code that is in do_wp_page() it would be better to call
> it here.

Yeah it makes sense to me.  Then here's my plan:

- I'll need to drop previous patch "export wp_page_copy" since then
  it'll be not needed

- I'll introduce another patch to split current do_wp_page() and
  introduce function "wp_page_copy_cont" (better suggestion on the
  naming would be welcomed) which contains most of the wp handling
  that'll be needed for change_pte_range() in this patch and isolate
  the uffd handling:

static vm_fault_t do_wp_page(struct vm_fault *vmf)
	__releases(vmf->ptl)
{
	struct vm_area_struct *vma = vmf->vma;

	if (userfaultfd_pte_wp(vma, *vmf->pte)) {
		pte_unmap_unlock(vmf->pte, vmf->ptl);
		return handle_userfault(vmf, VM_UFFD_WP);
	}

	return do_wp_page_cont(vmf);
}

Then I can probably use do_wp_page_cont() in this patch.

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 17/28] userfaultfd: wp: support swap and page migration
  2019-04-19 15:08       ` Jerome Glisse
@ 2019-04-22 12:23         ` Peter Xu
  0 siblings, 0 replies; 51+ messages in thread
From: Peter Xu @ 2019-04-22 12:23 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: linux-mm, linux-kernel, David Hildenbrand, Hugh Dickins,
	Maya Gokhale, Pavel Emelyanov, Johannes Weiner, Martin Cracauer,
	Shaohua Li, Andrea Arcangeli, Mike Kravetz, Denis Plotnikov,
	Mike Rapoport, Marty McFadden, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

On Fri, Apr 19, 2019 at 11:08:02AM -0400, Jerome Glisse wrote:
> On Fri, Apr 19, 2019 at 03:42:20PM +0800, Peter Xu wrote:
> > On Thu, Apr 18, 2019 at 04:59:07PM -0400, Jerome Glisse wrote:
> > > On Wed, Mar 20, 2019 at 10:06:31AM +0800, Peter Xu wrote:
> > > > For either swap and page migration, we all use the bit 2 of the entry to
> > > > identify whether this entry is uffd write-protected.  It plays a similar
> > > > role as the existing soft dirty bit in swap entries but only for keeping
> > > > the uffd-wp tracking for a specific PTE/PMD.
> > > > 
> > > > Something special here is that when we want to recover the uffd-wp bit
> > > > from a swap/migration entry to the PTE bit we'll also need to take care
> > > > of the _PAGE_RW bit and make sure it's cleared, otherwise even with the
> > > > _PAGE_UFFD_WP bit we can't trap it at all.
> > > > 
> > > > Note that this patch removed two lines from "userfaultfd: wp: hook
> > > > userfault handler to write protection fault" where we try to remove the
> > > > VM_FAULT_WRITE from vmf->flags when uffd-wp is set for the VMA.  This
> > > > patch will still keep the write flag there.
> > > > 
> > > > Reviewed-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
> > > > Signed-off-by: Peter Xu <peterx@redhat.com>
> > > 
> > > Some missing thing see below.
> > > 
> > > [...]
> > > 
> > > > diff --git a/mm/memory.c b/mm/memory.c
> > > > index 6405d56debee..c3d57fa890f2 100644
> > > > --- a/mm/memory.c
> > > > +++ b/mm/memory.c
> > > > @@ -736,6 +736,8 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> > > >  				pte = swp_entry_to_pte(entry);
> > > >  				if (pte_swp_soft_dirty(*src_pte))
> > > >  					pte = pte_swp_mksoft_dirty(pte);
> > > > +				if (pte_swp_uffd_wp(*src_pte))
> > > > +					pte = pte_swp_mkuffd_wp(pte);
> > > >  				set_pte_at(src_mm, addr, src_pte, pte);
> > > >  			}
> > > >  		} else if (is_device_private_entry(entry)) {
> > > 
> > > You need to handle the is_device_private_entry() as the migration case
> > > too.
> > 
> > Hi, Jerome,
> > 
> > Yes I can simply add the handling, but I'd confess I haven't thought
> > clearly yet on how userfault-wp will be used with HMM (and that's
> > mostly because my unfamiliarity so far with HMM).  Could you give me
> > some hint on a most general and possible scenario?
> 
> device private is just a temporary state with HMM you can have thing
> like GPU or FPGA migrate some anonymous page to their local memory
> because it is use by the GPU or the FPGA. The GPU or FPGA behave like
> a CPU from mm POV so if it wants to write it will fault and go through
> the regular CPU page fault.
> 
> That said it can still migrate a page that is UFD write protected just
> because the device only care about reading. So if you have a UFD pte
> to a regular page that get migrated to some device memory you want to
> keep the UFD WP flags after the migration (in both direction when going
> to device memory and from coming back from it).
> 
> As far as UFD is concern this is just another page, it just does not
> have a valid pte entry because CPU can not access such memory. But from
> mm point of view it just another page.

I see the point.  Thanks for explaining that!

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 14/28] userfaultfd: wp: handle COW properly for uffd-wp
  2019-04-22 12:20         ` Peter Xu
@ 2019-04-22 14:54           ` Jerome Glisse
  2019-04-23  3:00             ` Peter Xu
  0 siblings, 1 reply; 51+ messages in thread
From: Jerome Glisse @ 2019-04-22 14:54 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, David Hildenbrand, Hugh Dickins,
	Maya Gokhale, Pavel Emelyanov, Johannes Weiner, Martin Cracauer,
	Shaohua Li, Andrea Arcangeli, Mike Kravetz, Denis Plotnikov,
	Mike Rapoport, Marty McFadden, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

On Mon, Apr 22, 2019 at 08:20:10PM +0800, Peter Xu wrote:
> On Fri, Apr 19, 2019 at 11:02:53AM -0400, Jerome Glisse wrote:
> 
> [...]
> 
> > > > > +			if (uffd_wp_resolve) {
> > > > > +				/* If the fault is resolved already, skip */
> > > > > +				if (!pte_uffd_wp(*pte))
> > > > > +					continue;
> > > > > +				page = vm_normal_page(vma, addr, oldpte);
> > > > > +				if (!page || page_mapcount(page) > 1) {
> > > > > +					struct vm_fault vmf = {
> > > > > +						.vma = vma,
> > > > > +						.address = addr & PAGE_MASK,
> > > > > +						.page = page,
> > > > > +						.orig_pte = oldpte,
> > > > > +						.pmd = pmd,
> > > > > +						/* pte and ptl not needed */
> > > > > +					};
> > > > > +					vm_fault_t ret;
> > > > > +
> > > > > +					if (page)
> > > > > +						get_page(page);
> > > > > +					arch_leave_lazy_mmu_mode();
> > > > > +					pte_unmap_unlock(pte, ptl);
> > > > > +					ret = wp_page_copy(&vmf);
> > > > > +					/* PTE is changed, or OOM */
> > > > > +					if (ret == 0)
> > > > > +						/* It's done by others */
> > > > > +						continue;
> > > > 
> > > > This is wrong if ret == 0 you still need to remap the pte before
> > > > continuing as otherwise you will go to next pte without the page
> > > > table lock for the directory. So 0 case must be handled after
> > > > arch_enter_lazy_mmu_mode() below.
> > > > 
> > > > Sorry i should have catch that in previous review.
> > > 
> > > My fault to not have noticed it since the very beginning... thanks for
> > > spotting that.
> > > 
> > > I'm squashing below changes into the patch:
> > 
> > 
> > Well thinking of this some more i think you should use do_wp_page() and
> > not wp_page_copy() it would avoid bunch of code above and also you are
> > not properly handling KSM page or page in the swap cache. Instead of
> > duplicating same code that is in do_wp_page() it would be better to call
> > it here.
> 
> Yeah it makes sense to me.  Then here's my plan:
> 
> - I'll need to drop previous patch "export wp_page_copy" since then
>   it'll be not needed
> 
> - I'll introduce another patch to split current do_wp_page() and
>   introduce function "wp_page_copy_cont" (better suggestion on the
>   naming would be welcomed) which contains most of the wp handling
>   that'll be needed for change_pte_range() in this patch and isolate
>   the uffd handling:
> 
> static vm_fault_t do_wp_page(struct vm_fault *vmf)
> 	__releases(vmf->ptl)
> {
> 	struct vm_area_struct *vma = vmf->vma;
> 
> 	if (userfaultfd_pte_wp(vma, *vmf->pte)) {
> 		pte_unmap_unlock(vmf->pte, vmf->ptl);
> 		return handle_userfault(vmf, VM_UFFD_WP);
> 	}
> 
> 	return do_wp_page_cont(vmf);
> }
> 
> Then I can probably use do_wp_page_cont() in this patch.

Instead i would keep the do_wp_page name and do:
    static vm_fault_t do_userfaultfd_wp_page(struct vm_fault *vmf) {
        ... // what you have above
        return do_wp_page(vmf);
    }

Naming wise i think it would be better to keep do_wp_page() as
is.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 14/28] userfaultfd: wp: handle COW properly for uffd-wp
  2019-04-22 14:54           ` Jerome Glisse
@ 2019-04-23  3:00             ` Peter Xu
  2019-04-23 15:34               ` Jerome Glisse
  0 siblings, 1 reply; 51+ messages in thread
From: Peter Xu @ 2019-04-23  3:00 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: linux-mm, linux-kernel, David Hildenbrand, Hugh Dickins,
	Maya Gokhale, Pavel Emelyanov, Johannes Weiner, Martin Cracauer,
	Shaohua Li, Andrea Arcangeli, Mike Kravetz, Denis Plotnikov,
	Mike Rapoport, Marty McFadden, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

On Mon, Apr 22, 2019 at 10:54:02AM -0400, Jerome Glisse wrote:
> On Mon, Apr 22, 2019 at 08:20:10PM +0800, Peter Xu wrote:
> > On Fri, Apr 19, 2019 at 11:02:53AM -0400, Jerome Glisse wrote:
> > 
> > [...]
> > 
> > > > > > +			if (uffd_wp_resolve) {
> > > > > > +				/* If the fault is resolved already, skip */
> > > > > > +				if (!pte_uffd_wp(*pte))
> > > > > > +					continue;
> > > > > > +				page = vm_normal_page(vma, addr, oldpte);
> > > > > > +				if (!page || page_mapcount(page) > 1) {
> > > > > > +					struct vm_fault vmf = {
> > > > > > +						.vma = vma,
> > > > > > +						.address = addr & PAGE_MASK,
> > > > > > +						.page = page,
> > > > > > +						.orig_pte = oldpte,
> > > > > > +						.pmd = pmd,
> > > > > > +						/* pte and ptl not needed */
> > > > > > +					};
> > > > > > +					vm_fault_t ret;
> > > > > > +
> > > > > > +					if (page)
> > > > > > +						get_page(page);
> > > > > > +					arch_leave_lazy_mmu_mode();
> > > > > > +					pte_unmap_unlock(pte, ptl);
> > > > > > +					ret = wp_page_copy(&vmf);
> > > > > > +					/* PTE is changed, or OOM */
> > > > > > +					if (ret == 0)
> > > > > > +						/* It's done by others */
> > > > > > +						continue;
> > > > > 
> > > > > This is wrong if ret == 0 you still need to remap the pte before
> > > > > continuing as otherwise you will go to next pte without the page
> > > > > table lock for the directory. So 0 case must be handled after
> > > > > arch_enter_lazy_mmu_mode() below.
> > > > > 
> > > > > Sorry i should have catch that in previous review.
> > > > 
> > > > My fault to not have noticed it since the very beginning... thanks for
> > > > spotting that.
> > > > 
> > > > I'm squashing below changes into the patch:
> > > 
> > > 
> > > Well thinking of this some more i think you should use do_wp_page() and
> > > not wp_page_copy() it would avoid bunch of code above and also you are
> > > not properly handling KSM page or page in the swap cache. Instead of
> > > duplicating same code that is in do_wp_page() it would be better to call
> > > it here.
> > 
> > Yeah it makes sense to me.  Then here's my plan:
> > 
> > - I'll need to drop previous patch "export wp_page_copy" since then
> >   it'll be not needed
> > 
> > - I'll introduce another patch to split current do_wp_page() and
> >   introduce function "wp_page_copy_cont" (better suggestion on the
> >   naming would be welcomed) which contains most of the wp handling
> >   that'll be needed for change_pte_range() in this patch and isolate
> >   the uffd handling:
> > 
> > static vm_fault_t do_wp_page(struct vm_fault *vmf)
> > 	__releases(vmf->ptl)
> > {
> > 	struct vm_area_struct *vma = vmf->vma;
> > 
> > 	if (userfaultfd_pte_wp(vma, *vmf->pte)) {
> > 		pte_unmap_unlock(vmf->pte, vmf->ptl);
> > 		return handle_userfault(vmf, VM_UFFD_WP);
> > 	}
> > 
> > 	return do_wp_page_cont(vmf);
> > }
> > 
> > Then I can probably use do_wp_page_cont() in this patch.
> 
> Instead i would keep the do_wp_page name and do:
>     static vm_fault_t do_userfaultfd_wp_page(struct vm_fault *vmf) {
>         ... // what you have above
>         return do_wp_page(vmf);
>     }
> 
> Naming wise i think it would be better to keep do_wp_page() as
> is.

In case I misunderstood... what I've proposed will be simply:

diff --git a/mm/memory.c b/mm/memory.c
index 64bd8075f054..ab98a1eb4702 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2497,6 +2497,14 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
                return handle_userfault(vmf, VM_UFFD_WP);
        }

+       return do_wp_page_cont(vmf);
+}
+
+vm_fault_t do_wp_page_cont(struct vm_fault *vmf)
+       __releases(vmf->ptl)
+{
+       struct vm_area_struct *vma = vmf->vma;
+
        vmf->page = vm_normal_page(vma, vmf->address, vmf->orig_pte);
        if (!vmf->page) {
                /*

And the other proposal is:

diff --git a/mm/memory.c b/mm/memory.c
index 64bd8075f054..a73792127553 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2469,6 +2469,8 @@ static vm_fault_t wp_page_shared(struct vm_fault *vmf)
        return VM_FAULT_WRITE;
 }

+static vm_fault_t do_wp_page(struct vm_fault *vmf);
+
 /*
  * This routine handles present pages, when users try to write
  * to a shared page. It is done by copying the page to a new address
@@ -2487,7 +2489,7 @@ static vm_fault_t wp_page_shared(struct vm_fault *vmf)
  * but allow concurrent faults), with pte both mapped and locked.
  * We return with mmap_sem still held, but pte unmapped and unlocked.
  */
-static vm_fault_t do_wp_page(struct vm_fault *vmf)
+static vm_fault_t do_userfaultfd_wp_page(struct vm_fault *vmf)
        __releases(vmf->ptl)
 {
        struct vm_area_struct *vma = vmf->vma;
@@ -2497,6 +2499,14 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
                return handle_userfault(vmf, VM_UFFD_WP);
        }

+       return do_wp_page(vmf);
+}
+
+static vm_fault_t do_wp_page(struct vm_fault *vmf)
+       __releases(vmf->ptl)
+{
+       struct vm_area_struct *vma = vmf->vma;
+
        vmf->page = vm_normal_page(vma, vmf->address, vmf->orig_pte);
        if (!vmf->page) {
                /*
@@ -2869,7 +2879,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
        }

        if (vmf->flags & FAULT_FLAG_WRITE) {
-               ret |= do_wp_page(vmf);
+               ret |= do_userfaultfd_wp_page(vmf);
                if (ret & VM_FAULT_ERROR)
                        ret &= VM_FAULT_ERROR;
                goto out;
@@ -3831,7 +3841,7 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
                goto unlock;
        if (vmf->flags & FAULT_FLAG_WRITE) {
                if (!pte_write(entry))
-                       return do_wp_page(vmf);
+                       return do_userfaultfd_wp_page(vmf);
                entry = pte_mkdirty(entry);
        }
        entry = pte_mkyoung(entry);

I would prefer the 1st approach since it not only contains fewer lines
of changes because it does not touch callers, and also the naming in
the 2nd approach can be a bit confusing (calling
do_userfaultfd_wp_page in handle_pte_fault may let people think of an
userfault-only path but actually it covers the general path).  But if
you really like the 2nd one I can use that too.

Thanks,

-- 
Peter Xu

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 14/28] userfaultfd: wp: handle COW properly for uffd-wp
  2019-04-23  3:00             ` Peter Xu
@ 2019-04-23 15:34               ` Jerome Glisse
  2019-04-24  8:38                 ` Peter Xu
  0 siblings, 1 reply; 51+ messages in thread
From: Jerome Glisse @ 2019-04-23 15:34 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, David Hildenbrand, Hugh Dickins,
	Maya Gokhale, Pavel Emelyanov, Johannes Weiner, Martin Cracauer,
	Shaohua Li, Andrea Arcangeli, Mike Kravetz, Denis Plotnikov,
	Mike Rapoport, Marty McFadden, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

On Tue, Apr 23, 2019 at 11:00:30AM +0800, Peter Xu wrote:
> On Mon, Apr 22, 2019 at 10:54:02AM -0400, Jerome Glisse wrote:
> > On Mon, Apr 22, 2019 at 08:20:10PM +0800, Peter Xu wrote:
> > > On Fri, Apr 19, 2019 at 11:02:53AM -0400, Jerome Glisse wrote:
> > > 
> > > [...]
> > > 
> > > > > > > +			if (uffd_wp_resolve) {
> > > > > > > +				/* If the fault is resolved already, skip */
> > > > > > > +				if (!pte_uffd_wp(*pte))
> > > > > > > +					continue;
> > > > > > > +				page = vm_normal_page(vma, addr, oldpte);
> > > > > > > +				if (!page || page_mapcount(page) > 1) {
> > > > > > > +					struct vm_fault vmf = {
> > > > > > > +						.vma = vma,
> > > > > > > +						.address = addr & PAGE_MASK,
> > > > > > > +						.page = page,
> > > > > > > +						.orig_pte = oldpte,
> > > > > > > +						.pmd = pmd,
> > > > > > > +						/* pte and ptl not needed */
> > > > > > > +					};
> > > > > > > +					vm_fault_t ret;
> > > > > > > +
> > > > > > > +					if (page)
> > > > > > > +						get_page(page);
> > > > > > > +					arch_leave_lazy_mmu_mode();
> > > > > > > +					pte_unmap_unlock(pte, ptl);
> > > > > > > +					ret = wp_page_copy(&vmf);
> > > > > > > +					/* PTE is changed, or OOM */
> > > > > > > +					if (ret == 0)
> > > > > > > +						/* It's done by others */
> > > > > > > +						continue;
> > > > > > 
> > > > > > This is wrong if ret == 0 you still need to remap the pte before
> > > > > > continuing as otherwise you will go to next pte without the page
> > > > > > table lock for the directory. So 0 case must be handled after
> > > > > > arch_enter_lazy_mmu_mode() below.
> > > > > > 
> > > > > > Sorry i should have catch that in previous review.
> > > > > 
> > > > > My fault to not have noticed it since the very beginning... thanks for
> > > > > spotting that.
> > > > > 
> > > > > I'm squashing below changes into the patch:
> > > > 
> > > > 
> > > > Well thinking of this some more i think you should use do_wp_page() and
> > > > not wp_page_copy() it would avoid bunch of code above and also you are
> > > > not properly handling KSM page or page in the swap cache. Instead of
> > > > duplicating same code that is in do_wp_page() it would be better to call
> > > > it here.
> > > 
> > > Yeah it makes sense to me.  Then here's my plan:
> > > 
> > > - I'll need to drop previous patch "export wp_page_copy" since then
> > >   it'll be not needed
> > > 
> > > - I'll introduce another patch to split current do_wp_page() and
> > >   introduce function "wp_page_copy_cont" (better suggestion on the
> > >   naming would be welcomed) which contains most of the wp handling
> > >   that'll be needed for change_pte_range() in this patch and isolate
> > >   the uffd handling:
> > > 
> > > static vm_fault_t do_wp_page(struct vm_fault *vmf)
> > > 	__releases(vmf->ptl)
> > > {
> > > 	struct vm_area_struct *vma = vmf->vma;
> > > 
> > > 	if (userfaultfd_pte_wp(vma, *vmf->pte)) {
> > > 		pte_unmap_unlock(vmf->pte, vmf->ptl);
> > > 		return handle_userfault(vmf, VM_UFFD_WP);
> > > 	}
> > > 
> > > 	return do_wp_page_cont(vmf);
> > > }
> > > 
> > > Then I can probably use do_wp_page_cont() in this patch.
> > 
> > Instead i would keep the do_wp_page name and do:
> >     static vm_fault_t do_userfaultfd_wp_page(struct vm_fault *vmf) {
> >         ... // what you have above
> >         return do_wp_page(vmf);
> >     }
> > 
> > Naming wise i think it would be better to keep do_wp_page() as
> > is.
> 
> In case I misunderstood... what I've proposed will be simply:
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 64bd8075f054..ab98a1eb4702 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2497,6 +2497,14 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
>                 return handle_userfault(vmf, VM_UFFD_WP);
>         }
> 
> +       return do_wp_page_cont(vmf);
> +}
> +
> +vm_fault_t do_wp_page_cont(struct vm_fault *vmf)
> +       __releases(vmf->ptl)
> +{
> +       struct vm_area_struct *vma = vmf->vma;
> +
>         vmf->page = vm_normal_page(vma, vmf->address, vmf->orig_pte);
>         if (!vmf->page) {
>                 /*
> 
> And the other proposal is:
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 64bd8075f054..a73792127553 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2469,6 +2469,8 @@ static vm_fault_t wp_page_shared(struct vm_fault *vmf)
>         return VM_FAULT_WRITE;
>  }
> 
> +static vm_fault_t do_wp_page(struct vm_fault *vmf);
> +
>  /*
>   * This routine handles present pages, when users try to write
>   * to a shared page. It is done by copying the page to a new address
> @@ -2487,7 +2489,7 @@ static vm_fault_t wp_page_shared(struct vm_fault *vmf)
>   * but allow concurrent faults), with pte both mapped and locked.
>   * We return with mmap_sem still held, but pte unmapped and unlocked.
>   */
> -static vm_fault_t do_wp_page(struct vm_fault *vmf)
> +static vm_fault_t do_userfaultfd_wp_page(struct vm_fault *vmf)
>         __releases(vmf->ptl)
>  {
>         struct vm_area_struct *vma = vmf->vma;
> @@ -2497,6 +2499,14 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
>                 return handle_userfault(vmf, VM_UFFD_WP);
>         }
> 
> +       return do_wp_page(vmf);
> +}
> +
> +static vm_fault_t do_wp_page(struct vm_fault *vmf)
> +       __releases(vmf->ptl)
> +{
> +       struct vm_area_struct *vma = vmf->vma;
> +
>         vmf->page = vm_normal_page(vma, vmf->address, vmf->orig_pte);
>         if (!vmf->page) {
>                 /*
> @@ -2869,7 +2879,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>         }
> 
>         if (vmf->flags & FAULT_FLAG_WRITE) {
> -               ret |= do_wp_page(vmf);
> +               ret |= do_userfaultfd_wp_page(vmf);
>                 if (ret & VM_FAULT_ERROR)
>                         ret &= VM_FAULT_ERROR;
>                 goto out;
> @@ -3831,7 +3841,7 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
>                 goto unlock;
>         if (vmf->flags & FAULT_FLAG_WRITE) {
>                 if (!pte_write(entry))
> -                       return do_wp_page(vmf);
> +                       return do_userfaultfd_wp_page(vmf);
>                 entry = pte_mkdirty(entry);
>         }
>         entry = pte_mkyoung(entry);
> 
> I would prefer the 1st approach since it not only contains fewer lines
> of changes because it does not touch callers, and also the naming in
> the 2nd approach can be a bit confusing (calling
> do_userfaultfd_wp_page in handle_pte_fault may let people think of an
> userfault-only path but actually it covers the general path).  But if
> you really like the 2nd one I can use that too.

Maybe move the userfaultfd code to a small helper, call it first in
call site of do_wp_page() and do_wp_page() if it does not fire ie:

bool do_userfaultfd_wp(struct vm_fault *vmf, int ret)
{
    if (handleuserfault) return true;
    return false;
}

then
     if (vmf->flags & FAULT_FLAG_WRITE) {
            if (do_userfaultfd_wp(vmf, tmp)) {
                ret |= tmp;
            } else
                ret |= do_wp_page(vmf);
            if (ret & VM_FAULT_ERROR)
                ret &= VM_FAULT_ERROR;
            goto out;

and:
    if (vmf->flags & FAULT_FLAG_WRITE) {
        if (!pte_write(entry)) {
            if (do_userfaultfd_wp(vmf, ret))
                return ret;
            else
                return do_wp_page(vmf);
        }

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 14/28] userfaultfd: wp: handle COW properly for uffd-wp
  2019-04-23 15:34               ` Jerome Glisse
@ 2019-04-24  8:38                 ` Peter Xu
  0 siblings, 0 replies; 51+ messages in thread
From: Peter Xu @ 2019-04-24  8:38 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: linux-mm, linux-kernel, David Hildenbrand, Hugh Dickins,
	Maya Gokhale, Pavel Emelyanov, Johannes Weiner, Martin Cracauer,
	Shaohua Li, Andrea Arcangeli, Mike Kravetz, Denis Plotnikov,
	Mike Rapoport, Marty McFadden, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

On Tue, Apr 23, 2019 at 11:34:56AM -0400, Jerome Glisse wrote:
> On Tue, Apr 23, 2019 at 11:00:30AM +0800, Peter Xu wrote:
> > On Mon, Apr 22, 2019 at 10:54:02AM -0400, Jerome Glisse wrote:
> > > On Mon, Apr 22, 2019 at 08:20:10PM +0800, Peter Xu wrote:
> > > > On Fri, Apr 19, 2019 at 11:02:53AM -0400, Jerome Glisse wrote:
> > > > 
> > > > [...]
> > > > 
> > > > > > > > +			if (uffd_wp_resolve) {
> > > > > > > > +				/* If the fault is resolved already, skip */
> > > > > > > > +				if (!pte_uffd_wp(*pte))
> > > > > > > > +					continue;
> > > > > > > > +				page = vm_normal_page(vma, addr, oldpte);
> > > > > > > > +				if (!page || page_mapcount(page) > 1) {
> > > > > > > > +					struct vm_fault vmf = {
> > > > > > > > +						.vma = vma,
> > > > > > > > +						.address = addr & PAGE_MASK,
> > > > > > > > +						.page = page,
> > > > > > > > +						.orig_pte = oldpte,
> > > > > > > > +						.pmd = pmd,
> > > > > > > > +						/* pte and ptl not needed */
> > > > > > > > +					};
> > > > > > > > +					vm_fault_t ret;
> > > > > > > > +
> > > > > > > > +					if (page)
> > > > > > > > +						get_page(page);
> > > > > > > > +					arch_leave_lazy_mmu_mode();
> > > > > > > > +					pte_unmap_unlock(pte, ptl);
> > > > > > > > +					ret = wp_page_copy(&vmf);
> > > > > > > > +					/* PTE is changed, or OOM */
> > > > > > > > +					if (ret == 0)
> > > > > > > > +						/* It's done by others */
> > > > > > > > +						continue;
> > > > > > > 
> > > > > > > This is wrong if ret == 0 you still need to remap the pte before
> > > > > > > continuing as otherwise you will go to next pte without the page
> > > > > > > table lock for the directory. So 0 case must be handled after
> > > > > > > arch_enter_lazy_mmu_mode() below.
> > > > > > > 
> > > > > > > Sorry i should have catch that in previous review.
> > > > > > 
> > > > > > My fault to not have noticed it since the very beginning... thanks for
> > > > > > spotting that.
> > > > > > 
> > > > > > I'm squashing below changes into the patch:
> > > > > 
> > > > > 
> > > > > Well thinking of this some more i think you should use do_wp_page() and
> > > > > not wp_page_copy() it would avoid bunch of code above and also you are
> > > > > not properly handling KSM page or page in the swap cache. Instead of
> > > > > duplicating same code that is in do_wp_page() it would be better to call
> > > > > it here.
> > > > 
> > > > Yeah it makes sense to me.  Then here's my plan:
> > > > 
> > > > - I'll need to drop previous patch "export wp_page_copy" since then
> > > >   it'll be not needed
> > > > 
> > > > - I'll introduce another patch to split current do_wp_page() and
> > > >   introduce function "wp_page_copy_cont" (better suggestion on the
> > > >   naming would be welcomed) which contains most of the wp handling
> > > >   that'll be needed for change_pte_range() in this patch and isolate
> > > >   the uffd handling:
> > > > 
> > > > static vm_fault_t do_wp_page(struct vm_fault *vmf)
> > > > 	__releases(vmf->ptl)
> > > > {
> > > > 	struct vm_area_struct *vma = vmf->vma;
> > > > 
> > > > 	if (userfaultfd_pte_wp(vma, *vmf->pte)) {
> > > > 		pte_unmap_unlock(vmf->pte, vmf->ptl);
> > > > 		return handle_userfault(vmf, VM_UFFD_WP);
> > > > 	}
> > > > 
> > > > 	return do_wp_page_cont(vmf);
> > > > }
> > > > 
> > > > Then I can probably use do_wp_page_cont() in this patch.
> > > 
> > > Instead i would keep the do_wp_page name and do:
> > >     static vm_fault_t do_userfaultfd_wp_page(struct vm_fault *vmf) {
> > >         ... // what you have above
> > >         return do_wp_page(vmf);
> > >     }
> > > 
> > > Naming wise i think it would be better to keep do_wp_page() as
> > > is.
> > 
> > In case I misunderstood... what I've proposed will be simply:
> > 
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 64bd8075f054..ab98a1eb4702 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -2497,6 +2497,14 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
> >                 return handle_userfault(vmf, VM_UFFD_WP);
> >         }
> > 
> > +       return do_wp_page_cont(vmf);
> > +}
> > +
> > +vm_fault_t do_wp_page_cont(struct vm_fault *vmf)
> > +       __releases(vmf->ptl)
> > +{
> > +       struct vm_area_struct *vma = vmf->vma;
> > +
> >         vmf->page = vm_normal_page(vma, vmf->address, vmf->orig_pte);
> >         if (!vmf->page) {
> >                 /*
> > 
> > And the other proposal is:
> > 
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 64bd8075f054..a73792127553 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -2469,6 +2469,8 @@ static vm_fault_t wp_page_shared(struct vm_fault *vmf)
> >         return VM_FAULT_WRITE;
> >  }
> > 
> > +static vm_fault_t do_wp_page(struct vm_fault *vmf);
> > +
> >  /*
> >   * This routine handles present pages, when users try to write
> >   * to a shared page. It is done by copying the page to a new address
> > @@ -2487,7 +2489,7 @@ static vm_fault_t wp_page_shared(struct vm_fault *vmf)
> >   * but allow concurrent faults), with pte both mapped and locked.
> >   * We return with mmap_sem still held, but pte unmapped and unlocked.
> >   */
> > -static vm_fault_t do_wp_page(struct vm_fault *vmf)
> > +static vm_fault_t do_userfaultfd_wp_page(struct vm_fault *vmf)
> >         __releases(vmf->ptl)
> >  {
> >         struct vm_area_struct *vma = vmf->vma;
> > @@ -2497,6 +2499,14 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
> >                 return handle_userfault(vmf, VM_UFFD_WP);
> >         }
> > 
> > +       return do_wp_page(vmf);
> > +}
> > +
> > +static vm_fault_t do_wp_page(struct vm_fault *vmf)
> > +       __releases(vmf->ptl)
> > +{
> > +       struct vm_area_struct *vma = vmf->vma;
> > +
> >         vmf->page = vm_normal_page(vma, vmf->address, vmf->orig_pte);
> >         if (!vmf->page) {
> >                 /*
> > @@ -2869,7 +2879,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >         }
> > 
> >         if (vmf->flags & FAULT_FLAG_WRITE) {
> > -               ret |= do_wp_page(vmf);
> > +               ret |= do_userfaultfd_wp_page(vmf);
> >                 if (ret & VM_FAULT_ERROR)
> >                         ret &= VM_FAULT_ERROR;
> >                 goto out;
> > @@ -3831,7 +3841,7 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
> >                 goto unlock;
> >         if (vmf->flags & FAULT_FLAG_WRITE) {
> >                 if (!pte_write(entry))
> > -                       return do_wp_page(vmf);
> > +                       return do_userfaultfd_wp_page(vmf);
> >                 entry = pte_mkdirty(entry);
> >         }
> >         entry = pte_mkyoung(entry);
> > 
> > I would prefer the 1st approach since it not only contains fewer lines
> > of changes because it does not touch callers, and also the naming in
> > the 2nd approach can be a bit confusing (calling
> > do_userfaultfd_wp_page in handle_pte_fault may let people think of an
> > userfault-only path but actually it covers the general path).  But if
> > you really like the 2nd one I can use that too.
> 
> Maybe move the userfaultfd code to a small helper, call it first in
> call site of do_wp_page() and do_wp_page() if it does not fire ie:
> 
> bool do_userfaultfd_wp(struct vm_fault *vmf, int ret)
> {
>     if (handleuserfault) return true;
>     return false;
> }
> 
> then
>      if (vmf->flags & FAULT_FLAG_WRITE) {
>             if (do_userfaultfd_wp(vmf, tmp)) {
>                 ret |= tmp;
>             } else
>                 ret |= do_wp_page(vmf);
>             if (ret & VM_FAULT_ERROR)
>                 ret &= VM_FAULT_ERROR;
>             goto out;
> 
> and:
>     if (vmf->flags & FAULT_FLAG_WRITE) {
>         if (!pte_write(entry)) {
>             if (do_userfaultfd_wp(vmf, ret))
>                 return ret;
>             else
>                 return do_wp_page(vmf);
>         }

But then we will be duplicating the code patterns somehow? :-/

I'll think them over...  Thanks for all these suggestions!

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 51+ messages in thread

end of thread, other threads:[~2019-04-24  8:39 UTC | newest]

Thread overview: 51+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-03-20  2:06 [PATCH v3 00/28] userfaultfd: write protection support Peter Xu
2019-03-20  2:06 ` [PATCH v3 01/28] mm: gup: rename "nonblocking" to "locked" where proper Peter Xu
2019-03-20  2:06 ` [PATCH v3 02/28] mm: userfault: return VM_FAULT_RETRY on signals Peter Xu
2019-03-20  2:06 ` [PATCH v3 03/28] userfaultfd: don't retake mmap_sem to emulate NOPAGE Peter Xu
2019-03-20  2:06 ` [PATCH v3 04/28] mm: allow VM_FAULT_RETRY for multiple times Peter Xu
2019-04-18 20:11   ` Jerome Glisse
2019-04-19  6:00     ` Peter Xu
2019-03-20  2:06 ` [PATCH v3 05/28] mm: gup: " Peter Xu
2019-03-20  2:06 ` [PATCH v3 06/28] userfaultfd: wp: add helper for writeprotect check Peter Xu
2019-03-20  2:06 ` [PATCH v3 07/28] userfaultfd: wp: hook userfault handler to write protection fault Peter Xu
2019-04-18 20:03   ` Jerome Glisse
2019-03-20  2:06 ` [PATCH v3 08/28] userfaultfd: wp: add WP pagetable tracking to x86 Peter Xu
2019-03-20  2:06 ` [PATCH v3 09/28] userfaultfd: wp: userfaultfd_pte/huge_pmd_wp() helpers Peter Xu
2019-03-20  2:06 ` [PATCH v3 10/28] userfaultfd: wp: add UFFDIO_COPY_MODE_WP Peter Xu
2019-03-20  2:06 ` [PATCH v3 11/28] mm: merge parameters for change_protection() Peter Xu
2019-03-20  2:06 ` [PATCH v3 12/28] userfaultfd: wp: apply _PAGE_UFFD_WP bit Peter Xu
2019-03-20  2:06 ` [PATCH v3 13/28] mm: export wp_page_copy() Peter Xu
2019-03-20  2:06 ` [PATCH v3 14/28] userfaultfd: wp: handle COW properly for uffd-wp Peter Xu
2019-04-18 20:51   ` Jerome Glisse
2019-04-19  6:26     ` Peter Xu
2019-04-19 15:02       ` Jerome Glisse
2019-04-22 12:20         ` Peter Xu
2019-04-22 14:54           ` Jerome Glisse
2019-04-23  3:00             ` Peter Xu
2019-04-23 15:34               ` Jerome Glisse
2019-04-24  8:38                 ` Peter Xu
2019-03-20  2:06 ` [PATCH v3 15/28] userfaultfd: wp: drop _PAGE_UFFD_WP properly when fork Peter Xu
2019-03-20  2:06 ` [PATCH v3 16/28] userfaultfd: wp: add pmd_swp_*uffd_wp() helpers Peter Xu
2019-03-20  2:06 ` [PATCH v3 17/28] userfaultfd: wp: support swap and page migration Peter Xu
2019-04-18 20:59   ` Jerome Glisse
2019-04-19  7:42     ` Peter Xu
2019-04-19 15:08       ` Jerome Glisse
2019-04-22 12:23         ` Peter Xu
2019-03-20  2:06 ` [PATCH v3 18/28] khugepaged: skip collapse if uffd-wp detected Peter Xu
2019-03-20  2:06 ` [PATCH v3 19/28] userfaultfd: introduce helper vma_find_uffd Peter Xu
2019-03-20  2:06 ` [PATCH v3 20/28] userfaultfd: wp: support write protection for userfault vma range Peter Xu
2019-03-20  2:06 ` [PATCH v3 21/28] userfaultfd: wp: add the writeprotect API to userfaultfd ioctl Peter Xu
2019-03-20  2:06 ` [PATCH v3 22/28] userfaultfd: wp: enabled write protection in userfaultfd API Peter Xu
2019-03-22 21:37   ` Mike Rapoport
2019-03-20  2:06 ` [PATCH v3 23/28] userfaultfd: wp: don't wake up when doing write protect Peter Xu
2019-03-20  2:06 ` [PATCH v3 24/28] userfaultfd: wp: UFFDIO_REGISTER_MODE_WP documentation update Peter Xu
2019-03-22 21:46   ` Mike Rapoport
2019-03-20  2:06 ` [PATCH v3 25/28] userfaultfd: wp: fixup swap entries in change_pte_range Peter Xu
2019-04-18 21:01   ` Jerome Glisse
2019-03-20  2:06 ` [PATCH v3 26/28] userfaultfd: wp: declare _UFFDIO_WRITEPROTECT conditionally Peter Xu
2019-03-22 21:43   ` Mike Rapoport
2019-03-20  2:06 ` [PATCH v3 27/28] userfaultfd: selftests: refactor statistics Peter Xu
2019-03-20  2:06 ` [PATCH v3 28/28] userfaultfd: selftests: add write-protect test Peter Xu
2019-04-09  6:08 ` [PATCH v3 00/28] userfaultfd: write protection support Peter Xu
2019-04-18 21:07   ` Jerome Glisse
2019-04-19  7:53     ` Peter Xu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).