linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 00/26] userfaultfd: write protection support
@ 2019-02-12  2:56 Peter Xu
  2019-02-12  2:56 ` [PATCH v2 01/26] mm: gup: rename "nonblocking" to "locked" where proper Peter Xu
                   ` (25 more replies)
  0 siblings, 26 replies; 113+ messages in thread
From: Peter Xu @ 2019-02-12  2:56 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: David Hildenbrand, Hugh Dickins, Maya Gokhale, Jerome Glisse,
	Pavel Emelyanov, Johannes Weiner, peterx, Martin Cracauer,
	Shaohua Li, Marty McFadden, Andrea Arcangeli, Mike Kravetz,
	Denis Plotnikov, Mike Rapoport, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

This series implements initial write protection support for
userfaultfd.  Currently both shmem and hugetlbfs are not supported
yet, but only anonymous memory.  This is the 2nd version of it.

The latest code can also be found at:

  https://github.com/xzpeter/linux/tree/uffd-wp-merged

Since there's no objection on the design on previous RFC series, and
the tree has been run through various tests already so I'm removing
RFC tag starting from this version.

During previous v1 discussion, Mike asked about using userfaultfd to
track mprotect()-allowed processes.  So far I don't have good idea on
how that could work easily, so I'll assume it's not an initial goal
for current uffd-wp work.

Note again that the first 5 patches in the series can be seen as
isolated work on page fault mechanism.  I would hope that they can be
considered to be reviewed/picked even earlier than the rest of the
series since it's even useful for existing userfaultfd MISSING case
[8].

v2 changelog:
- add some r-bs
- split the patch "mm: userfault: return VM_FAULT_RETRY on signals"
  into two: one to focus on the signal behavior change, the other to
  remove the NOPAGE special path in handle_userfault().  Removing the
  ARC specific change and remove that part of commit message since
  it's fixed in 4d447455e73b already [Jerome]
- return -ENOENT when VMA is invalid for UFFDIO_WRITEPROTECT to match
  UFFDIO_COPY errno [Mike]
- add a new patch to introduce helper to find valid VMA for uffd
  [Mike]
- check against VM_MAYWRITE instead of VM_WRITE when registering UFFD
  WP [Mike]
- MM_CP_DIRTY_ACCT is used incorrectly, fix it up [Jerome]
- make sure the lock_page behavior will not be changed [Jerome]
- reorder the whole series, introduce the new ioctl last. [Jerome]
- fix up the uffdio_writeprotect() following commit df2cc96e77011cf79
  to return -EAGAIN when detected mm layout changes [Mike]

v1 can be found at: https://lkml.org/lkml/2019/1/21/130

Any comment would be greatly welcomed.   Thanks.

Overview
====================

The uffd-wp work was initialized by Shaohua Li [1], and later
continued by Andrea [2]. This series is based upon Andrea's latest
userfaultfd tree, and it is a continuous works from both Shaohua and
Andrea.  Many of the follow up ideas come from Andrea too.

Besides the old MISSING register mode of userfaultfd, the new uffd-wp
support provides another alternative register mode called
UFFDIO_REGISTER_MODE_WP that can be used to listen to not only missing
page faults but also write protection page faults, or even they can be
registered together.  At the same time, the new feature also provides
a new userfaultfd ioctl called UFFDIO_WRITEPROTECT which allows the
userspace to write protect a range or memory or fixup write permission
of faulted pages.

Please refer to the document patch "userfaultfd: wp:
UFFDIO_REGISTER_MODE_WP documentation update" for more information on
the new interface and what it can do.

The major workflow of an uffd-wp program should be:

  1. Register a memory region with WP mode using UFFDIO_REGISTER_MODE_WP

  2. Write protect part of the whole registered region using
     UFFDIO_WRITEPROTECT, passing in UFFDIO_WRITEPROTECT_MODE_WP to
     show that we want to write protect the range.

  3. Start a working thread that modifies the protected pages,
     meanwhile listening to UFFD messages.

  4. When a write is detected upon the protected range, page fault
     happens, a UFFD message will be generated and reported to the
     page fault handling thread

  5. The page fault handler thread resolves the page fault using the
     new UFFDIO_WRITEPROTECT ioctl, but this time passing in
     !UFFDIO_WRITEPROTECT_MODE_WP instead showing that we want to
     recover the write permission.  Before this operation, the fault
     handler thread can do anything it wants, e.g., dumps the page to
     a persistent storage.

  6. The worker thread will continue running with the correctly
     applied write permission from step 5.

Currently there are already two projects that are based on this new
userfaultfd feature.

QEMU Live Snapshot: The project provides a way to allow the QEMU
                    hypervisor to take snapshot of VMs without
                    stopping the VM [3].

LLNL umap library:  The project provides a mmap-like interface and
                    "allow to have an application specific buffer of
                    pages cached from a large file, i.e. out-of-core
                    execution using memory map" [4][5].

Before posting the patchset, this series was smoke tested against QEMU
live snapshot and the LLNL umap library (by doing parallel quicksort
using 128 sorting threads + 80 uffd servicing threads).  My sincere
thanks to Marty Mcfadden and Denis Plotnikov for the help along the
way.

TODO
=============

- hugetlbfs/shmem support
- performance
- more architectures
- cooperate with mprotect()-allowed processes (???)
- ...

References
==========

[1] https://lwn.net/Articles/666187/
[2] https://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git/log/?h=userfault
[3] https://github.com/denis-plotnikov/qemu/commits/background-snapshot-kvm
[4] https://github.com/LLNL/umap
[5] https://llnl-umap.readthedocs.io/en/develop/
[6] https://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git/commit/?h=userfault&id=b245ecf6cf59156966f3da6e6b674f6695a5ffa5
[7] https://lkml.org/lkml/2018/11/21/370
[8] https://lkml.org/lkml/2018/12/30/64

Andrea Arcangeli (5):
  userfaultfd: wp: hook userfault handler to write protection fault
  userfaultfd: wp: add WP pagetable tracking to x86
  userfaultfd: wp: userfaultfd_pte/huge_pmd_wp() helpers
  userfaultfd: wp: add UFFDIO_COPY_MODE_WP
  userfaultfd: wp: add the writeprotect API to userfaultfd ioctl

Martin Cracauer (1):
  userfaultfd: wp: UFFDIO_REGISTER_MODE_WP documentation update

Peter Xu (17):
  mm: gup: rename "nonblocking" to "locked" where proper
  mm: userfault: return VM_FAULT_RETRY on signals
  userfaultfd: don't retake mmap_sem to emulate NOPAGE
  mm: allow VM_FAULT_RETRY for multiple times
  mm: gup: allow VM_FAULT_RETRY for multiple times
  mm: merge parameters for change_protection()
  userfaultfd: wp: apply _PAGE_UFFD_WP bit
  mm: export wp_page_copy()
  userfaultfd: wp: handle COW properly for uffd-wp
  userfaultfd: wp: drop _PAGE_UFFD_WP properly when fork
  userfaultfd: wp: add pmd_swp_*uffd_wp() helpers
  userfaultfd: wp: support swap and page migration
  khugepaged: skip collapse if uffd-wp detected
  userfaultfd: introduce helper vma_find_uffd
  userfaultfd: wp: don't wake up when doing write protect
  userfaultfd: selftests: refactor statistics
  userfaultfd: selftests: add write-protect test

Shaohua Li (3):
  userfaultfd: wp: add helper for writeprotect check
  userfaultfd: wp: support write protection for userfault vma range
  userfaultfd: wp: enabled write protection in userfaultfd API

 Documentation/admin-guide/mm/userfaultfd.rst |  51 +++++
 arch/alpha/mm/fault.c                        |   4 +-
 arch/arc/mm/fault.c                          |  12 +-
 arch/arm/mm/fault.c                          |   9 +-
 arch/arm64/mm/fault.c                        |  11 +-
 arch/hexagon/mm/vm_fault.c                   |   3 +-
 arch/ia64/mm/fault.c                         |   3 +-
 arch/m68k/mm/fault.c                         |   5 +-
 arch/microblaze/mm/fault.c                   |   3 +-
 arch/mips/mm/fault.c                         |   3 +-
 arch/nds32/mm/fault.c                        |   7 +-
 arch/nios2/mm/fault.c                        |   5 +-
 arch/openrisc/mm/fault.c                     |   3 +-
 arch/parisc/mm/fault.c                       |   4 +-
 arch/powerpc/mm/fault.c                      |   7 +-
 arch/riscv/mm/fault.c                        |   9 +-
 arch/s390/mm/fault.c                         |  14 +-
 arch/sh/mm/fault.c                           |   5 +-
 arch/sparc/mm/fault_32.c                     |   4 +-
 arch/sparc/mm/fault_64.c                     |   4 +-
 arch/um/kernel/trap.c                        |   6 +-
 arch/unicore32/mm/fault.c                    |  10 +-
 arch/x86/Kconfig                             |   1 +
 arch/x86/include/asm/pgtable.h               |  67 ++++++
 arch/x86/include/asm/pgtable_64.h            |   8 +-
 arch/x86/include/asm/pgtable_types.h         |  11 +-
 arch/x86/mm/fault.c                          |   7 +-
 arch/xtensa/mm/fault.c                       |   4 +-
 fs/userfaultfd.c                             | 114 ++++++----
 include/asm-generic/pgtable.h                |   1 +
 include/asm-generic/pgtable_uffd.h           |  66 ++++++
 include/linux/huge_mm.h                      |   2 +-
 include/linux/mm.h                           |  21 +-
 include/linux/swapops.h                      |   2 +
 include/linux/userfaultfd_k.h                |  42 +++-
 include/trace/events/huge_memory.h           |   1 +
 include/uapi/linux/userfaultfd.h             |  28 ++-
 init/Kconfig                                 |   5 +
 mm/filemap.c                                 |   2 +-
 mm/gup.c                                     |  61 ++---
 mm/huge_memory.c                             |  28 ++-
 mm/hugetlb.c                                 |   8 +-
 mm/khugepaged.c                              |  23 ++
 mm/memory.c                                  |  28 ++-
 mm/mempolicy.c                               |   2 +-
 mm/migrate.c                                 |   7 +
 mm/mprotect.c                                |  98 ++++++--
 mm/rmap.c                                    |   6 +
 mm/userfaultfd.c                             | 148 ++++++++++---
 tools/testing/selftests/vm/userfaultfd.c     | 222 ++++++++++++++-----
 50 files changed, 919 insertions(+), 276 deletions(-)
 create mode 100644 include/asm-generic/pgtable_uffd.h

-- 
2.17.1


^ permalink raw reply	[flat|nested] 113+ messages in thread

* [PATCH v2 01/26] mm: gup: rename "nonblocking" to "locked" where proper
  2019-02-12  2:56 [PATCH v2 00/26] userfaultfd: write protection support Peter Xu
@ 2019-02-12  2:56 ` Peter Xu
  2019-02-21 15:17   ` Jerome Glisse
  2019-02-12  2:56 ` [PATCH v2 02/26] mm: userfault: return VM_FAULT_RETRY on signals Peter Xu
                   ` (24 subsequent siblings)
  25 siblings, 1 reply; 113+ messages in thread
From: Peter Xu @ 2019-02-12  2:56 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: David Hildenbrand, Hugh Dickins, Maya Gokhale, Jerome Glisse,
	Pavel Emelyanov, Johannes Weiner, peterx, Martin Cracauer,
	Shaohua Li, Marty McFadden, Andrea Arcangeli, Mike Kravetz,
	Denis Plotnikov, Mike Rapoport, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

There's plenty of places around __get_user_pages() that has a parameter
"nonblocking" which does not really mean that "it won't block" (because
it can really block) but instead it shows whether the mmap_sem is
released by up_read() during the page fault handling mostly when
VM_FAULT_RETRY is returned.

We have the correct naming in e.g. get_user_pages_locked() or
get_user_pages_remote() as "locked", however there're still many places
that are using the "nonblocking" as name.

Renaming the places to "locked" where proper to better suite the
functionality of the variable.  While at it, fixing up some of the
comments accordingly.

Reviewed-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/gup.c     | 44 +++++++++++++++++++++-----------------------
 mm/hugetlb.c |  8 ++++----
 2 files changed, 25 insertions(+), 27 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 05acd7e2eb22..fa75a03204c1 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -506,12 +506,12 @@ static int get_gate_page(struct mm_struct *mm, unsigned long address,
 }
 
 /*
- * mmap_sem must be held on entry.  If @nonblocking != NULL and
- * *@flags does not include FOLL_NOWAIT, the mmap_sem may be released.
- * If it is, *@nonblocking will be set to 0 and -EBUSY returned.
+ * mmap_sem must be held on entry.  If @locked != NULL and *@flags
+ * does not include FOLL_NOWAIT, the mmap_sem may be released.  If it
+ * is, *@locked will be set to 0 and -EBUSY returned.
  */
 static int faultin_page(struct task_struct *tsk, struct vm_area_struct *vma,
-		unsigned long address, unsigned int *flags, int *nonblocking)
+		unsigned long address, unsigned int *flags, int *locked)
 {
 	unsigned int fault_flags = 0;
 	vm_fault_t ret;
@@ -523,7 +523,7 @@ static int faultin_page(struct task_struct *tsk, struct vm_area_struct *vma,
 		fault_flags |= FAULT_FLAG_WRITE;
 	if (*flags & FOLL_REMOTE)
 		fault_flags |= FAULT_FLAG_REMOTE;
-	if (nonblocking)
+	if (locked)
 		fault_flags |= FAULT_FLAG_ALLOW_RETRY;
 	if (*flags & FOLL_NOWAIT)
 		fault_flags |= FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_RETRY_NOWAIT;
@@ -549,8 +549,8 @@ static int faultin_page(struct task_struct *tsk, struct vm_area_struct *vma,
 	}
 
 	if (ret & VM_FAULT_RETRY) {
-		if (nonblocking && !(fault_flags & FAULT_FLAG_RETRY_NOWAIT))
-			*nonblocking = 0;
+		if (locked && !(fault_flags & FAULT_FLAG_RETRY_NOWAIT))
+			*locked = 0;
 		return -EBUSY;
 	}
 
@@ -627,7 +627,7 @@ static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags)
  *		only intends to ensure the pages are faulted in.
  * @vmas:	array of pointers to vmas corresponding to each page.
  *		Or NULL if the caller does not require them.
- * @nonblocking: whether waiting for disk IO or mmap_sem contention
+ * @locked:     whether we're still with the mmap_sem held
  *
  * Returns number of pages pinned. This may be fewer than the number
  * requested. If nr_pages is 0 or negative, returns 0. If no pages
@@ -656,13 +656,11 @@ static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags)
  * appropriate) must be called after the page is finished with, and
  * before put_page is called.
  *
- * If @nonblocking != NULL, __get_user_pages will not wait for disk IO
- * or mmap_sem contention, and if waiting is needed to pin all pages,
- * *@nonblocking will be set to 0.  Further, if @gup_flags does not
- * include FOLL_NOWAIT, the mmap_sem will be released via up_read() in
- * this case.
+ * If @locked != NULL, *@locked will be set to 0 when mmap_sem is
+ * released by an up_read().  That can happen if @gup_flags does not
+ * has FOLL_NOWAIT.
  *
- * A caller using such a combination of @nonblocking and @gup_flags
+ * A caller using such a combination of @locked and @gup_flags
  * must therefore hold the mmap_sem for reading only, and recognize
  * when it's been released.  Otherwise, it must be held for either
  * reading or writing and will not be released.
@@ -674,7 +672,7 @@ static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags)
 static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 		unsigned long start, unsigned long nr_pages,
 		unsigned int gup_flags, struct page **pages,
-		struct vm_area_struct **vmas, int *nonblocking)
+		struct vm_area_struct **vmas, int *locked)
 {
 	long ret = 0, i = 0;
 	struct vm_area_struct *vma = NULL;
@@ -718,7 +716,7 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 			if (is_vm_hugetlb_page(vma)) {
 				i = follow_hugetlb_page(mm, vma, pages, vmas,
 						&start, &nr_pages, i,
-						gup_flags, nonblocking);
+						gup_flags, locked);
 				continue;
 			}
 		}
@@ -736,7 +734,7 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 		page = follow_page_mask(vma, start, foll_flags, &ctx);
 		if (!page) {
 			ret = faultin_page(tsk, vma, start, &foll_flags,
-					nonblocking);
+					   locked);
 			switch (ret) {
 			case 0:
 				goto retry;
@@ -1195,7 +1193,7 @@ EXPORT_SYMBOL(get_user_pages_longterm);
  * @vma:   target vma
  * @start: start address
  * @end:   end address
- * @nonblocking:
+ * @locked: whether the mmap_sem is still held
  *
  * This takes care of mlocking the pages too if VM_LOCKED is set.
  *
@@ -1203,14 +1201,14 @@ EXPORT_SYMBOL(get_user_pages_longterm);
  *
  * vma->vm_mm->mmap_sem must be held.
  *
- * If @nonblocking is NULL, it may be held for read or write and will
+ * If @locked is NULL, it may be held for read or write and will
  * be unperturbed.
  *
- * If @nonblocking is non-NULL, it must held for read only and may be
- * released.  If it's released, *@nonblocking will be set to 0.
+ * If @locked is non-NULL, it must held for read only and may be
+ * released.  If it's released, *@locked will be set to 0.
  */
 long populate_vma_page_range(struct vm_area_struct *vma,
-		unsigned long start, unsigned long end, int *nonblocking)
+		unsigned long start, unsigned long end, int *locked)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	unsigned long nr_pages = (end - start) / PAGE_SIZE;
@@ -1245,7 +1243,7 @@ long populate_vma_page_range(struct vm_area_struct *vma,
 	 * not result in a stack expansion that recurses back here.
 	 */
 	return __get_user_pages(current, mm, start, nr_pages, gup_flags,
-				NULL, NULL, nonblocking);
+				NULL, NULL, locked);
 }
 
 /*
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index afef61656c1e..e3c738bde72e 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4180,7 +4180,7 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
 long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			 struct page **pages, struct vm_area_struct **vmas,
 			 unsigned long *position, unsigned long *nr_pages,
-			 long i, unsigned int flags, int *nonblocking)
+			 long i, unsigned int flags, int *locked)
 {
 	unsigned long pfn_offset;
 	unsigned long vaddr = *position;
@@ -4251,7 +4251,7 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 				spin_unlock(ptl);
 			if (flags & FOLL_WRITE)
 				fault_flags |= FAULT_FLAG_WRITE;
-			if (nonblocking)
+			if (locked)
 				fault_flags |= FAULT_FLAG_ALLOW_RETRY;
 			if (flags & FOLL_NOWAIT)
 				fault_flags |= FAULT_FLAG_ALLOW_RETRY |
@@ -4268,9 +4268,9 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 				break;
 			}
 			if (ret & VM_FAULT_RETRY) {
-				if (nonblocking &&
+				if (locked &&
 				    !(fault_flags & FAULT_FLAG_RETRY_NOWAIT))
-					*nonblocking = 0;
+					*locked = 0;
 				*nr_pages = 0;
 				/*
 				 * VM_FAULT_RETRY must not return an
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH v2 02/26] mm: userfault: return VM_FAULT_RETRY on signals
  2019-02-12  2:56 [PATCH v2 00/26] userfaultfd: write protection support Peter Xu
  2019-02-12  2:56 ` [PATCH v2 01/26] mm: gup: rename "nonblocking" to "locked" where proper Peter Xu
@ 2019-02-12  2:56 ` Peter Xu
  2019-02-21 15:29   ` Jerome Glisse
  2019-02-12  2:56 ` [PATCH v2 03/26] userfaultfd: don't retake mmap_sem to emulate NOPAGE Peter Xu
                   ` (23 subsequent siblings)
  25 siblings, 1 reply; 113+ messages in thread
From: Peter Xu @ 2019-02-12  2:56 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: David Hildenbrand, Hugh Dickins, Maya Gokhale, Jerome Glisse,
	Pavel Emelyanov, Johannes Weiner, peterx, Martin Cracauer,
	Shaohua Li, Marty McFadden, Andrea Arcangeli, Mike Kravetz,
	Denis Plotnikov, Mike Rapoport, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

The idea comes from the upstream discussion between Linus and Andrea:

  https://lkml.org/lkml/2017/10/30/560

A summary to the issue: there was a special path in handle_userfault()
in the past that we'll return a VM_FAULT_NOPAGE when we detected
non-fatal signals when waiting for userfault handling.  We did that by
reacquiring the mmap_sem before returning.  However that brings a risk
in that the vmas might have changed when we retake the mmap_sem and
even we could be holding an invalid vma structure.

This patch removes the special path and we'll return a VM_FAULT_RETRY
with the common path even if we have got such signals.  Then for all
the architectures that is passing in VM_FAULT_ALLOW_RETRY into
handle_mm_fault(), we check not only for SIGKILL but for all the rest
of userspace pending signals right after we returned from
handle_mm_fault().  This can allow the userspace to handle nonfatal
signals faster than before.

This patch is a preparation work for the next patch to finally remove
the special code path mentioned above in handle_userfault().

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Suggested-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/alpha/mm/fault.c      |  2 +-
 arch/arc/mm/fault.c        | 11 ++++-------
 arch/arm/mm/fault.c        |  6 +++---
 arch/arm64/mm/fault.c      |  6 +++---
 arch/hexagon/mm/vm_fault.c |  2 +-
 arch/ia64/mm/fault.c       |  2 +-
 arch/m68k/mm/fault.c       |  2 +-
 arch/microblaze/mm/fault.c |  2 +-
 arch/mips/mm/fault.c       |  2 +-
 arch/nds32/mm/fault.c      |  6 +++---
 arch/nios2/mm/fault.c      |  2 +-
 arch/openrisc/mm/fault.c   |  2 +-
 arch/parisc/mm/fault.c     |  2 +-
 arch/powerpc/mm/fault.c    |  2 ++
 arch/riscv/mm/fault.c      |  4 ++--
 arch/s390/mm/fault.c       |  9 ++++++---
 arch/sh/mm/fault.c         |  4 ++++
 arch/sparc/mm/fault_32.c   |  3 +++
 arch/sparc/mm/fault_64.c   |  3 +++
 arch/um/kernel/trap.c      |  5 ++++-
 arch/unicore32/mm/fault.c  |  4 ++--
 arch/x86/mm/fault.c        |  6 +++++-
 arch/xtensa/mm/fault.c     |  3 +++
 23 files changed, 56 insertions(+), 34 deletions(-)

diff --git a/arch/alpha/mm/fault.c b/arch/alpha/mm/fault.c
index d73dc473fbb9..46e5e420ad2a 100644
--- a/arch/alpha/mm/fault.c
+++ b/arch/alpha/mm/fault.c
@@ -150,7 +150,7 @@ do_page_fault(unsigned long address, unsigned long mmcsr,
 	   the fault.  */
 	fault = handle_mm_fault(vma, address, flags);
 
-	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
+	if ((fault & VM_FAULT_RETRY) && signal_pending(current))
 		return;
 
 	if (unlikely(fault & VM_FAULT_ERROR)) {
diff --git a/arch/arc/mm/fault.c b/arch/arc/mm/fault.c
index 8df1638259f3..dc5f1b8859d2 100644
--- a/arch/arc/mm/fault.c
+++ b/arch/arc/mm/fault.c
@@ -141,17 +141,14 @@ void do_page_fault(unsigned long address, struct pt_regs *regs)
 	 */
 	fault = handle_mm_fault(vma, address, flags);
 
-	if (fatal_signal_pending(current)) {
-
+	if (unlikely(fault & VM_FAULT_RETRY && signal_pending(current))) {
+		if (fatal_signal_pending(current) && !user_mode(regs))
+			goto no_context;
 		/*
 		 * if fault retry, mmap_sem already relinquished by core mm
 		 * so OK to return to user mode (with signal handled first)
 		 */
-		if (fault & VM_FAULT_RETRY) {
-			if (!user_mode(regs))
-				goto no_context;
-			return;
-		}
+		return;
 	}
 
 	perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
diff --git a/arch/arm/mm/fault.c b/arch/arm/mm/fault.c
index 58f69fa07df9..c41c021bbe40 100644
--- a/arch/arm/mm/fault.c
+++ b/arch/arm/mm/fault.c
@@ -314,12 +314,12 @@ do_page_fault(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
 
 	fault = __do_page_fault(mm, addr, fsr, flags, tsk);
 
-	/* If we need to retry but a fatal signal is pending, handle the
+	/* If we need to retry but a signal is pending, handle the
 	 * signal first. We do not need to release the mmap_sem because
 	 * it would already be released in __lock_page_or_retry in
 	 * mm/filemap.c. */
-	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current)) {
-		if (!user_mode(regs))
+	if (unlikely(fault & VM_FAULT_RETRY && signal_pending(current))) {
+		if (fatal_signal_pending(current) && !user_mode(regs))
 			goto no_context;
 		return 0;
 	}
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index efb7b2cbead5..a38ff8c49a66 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -512,13 +512,13 @@ static int __kprobes do_page_fault(unsigned long addr, unsigned int esr,
 
 	if (fault & VM_FAULT_RETRY) {
 		/*
-		 * If we need to retry but a fatal signal is pending,
+		 * If we need to retry but a signal is pending,
 		 * handle the signal first. We do not need to release
 		 * the mmap_sem because it would already be released
 		 * in __lock_page_or_retry in mm/filemap.c.
 		 */
-		if (fatal_signal_pending(current)) {
-			if (!user_mode(regs))
+		if (signal_pending(current)) {
+			if (fatal_signal_pending(current) && !user_mode(regs))
 				goto no_context;
 			return 0;
 		}
diff --git a/arch/hexagon/mm/vm_fault.c b/arch/hexagon/mm/vm_fault.c
index eb263e61daf4..be10b441d9cc 100644
--- a/arch/hexagon/mm/vm_fault.c
+++ b/arch/hexagon/mm/vm_fault.c
@@ -104,7 +104,7 @@ void do_page_fault(unsigned long address, long cause, struct pt_regs *regs)
 
 	fault = handle_mm_fault(vma, address, flags);
 
-	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
+	if ((fault & VM_FAULT_RETRY) && signal_pending(current))
 		return;
 
 	/* The most common case -- we are done. */
diff --git a/arch/ia64/mm/fault.c b/arch/ia64/mm/fault.c
index 5baeb022f474..62c2d39d2bed 100644
--- a/arch/ia64/mm/fault.c
+++ b/arch/ia64/mm/fault.c
@@ -163,7 +163,7 @@ ia64_do_page_fault (unsigned long address, unsigned long isr, struct pt_regs *re
 	 */
 	fault = handle_mm_fault(vma, address, flags);
 
-	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
+	if ((fault & VM_FAULT_RETRY) && signal_pending(current))
 		return;
 
 	if (unlikely(fault & VM_FAULT_ERROR)) {
diff --git a/arch/m68k/mm/fault.c b/arch/m68k/mm/fault.c
index 9b6163c05a75..d9808a807ab8 100644
--- a/arch/m68k/mm/fault.c
+++ b/arch/m68k/mm/fault.c
@@ -138,7 +138,7 @@ int do_page_fault(struct pt_regs *regs, unsigned long address,
 	fault = handle_mm_fault(vma, address, flags);
 	pr_debug("handle_mm_fault returns %x\n", fault);
 
-	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
+	if ((fault & VM_FAULT_RETRY) && signal_pending(current))
 		return 0;
 
 	if (unlikely(fault & VM_FAULT_ERROR)) {
diff --git a/arch/microblaze/mm/fault.c b/arch/microblaze/mm/fault.c
index 202ad6a494f5..4fd2dbd0c5ca 100644
--- a/arch/microblaze/mm/fault.c
+++ b/arch/microblaze/mm/fault.c
@@ -217,7 +217,7 @@ void do_page_fault(struct pt_regs *regs, unsigned long address,
 	 */
 	fault = handle_mm_fault(vma, address, flags);
 
-	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
+	if ((fault & VM_FAULT_RETRY) && signal_pending(current))
 		return;
 
 	if (unlikely(fault & VM_FAULT_ERROR)) {
diff --git a/arch/mips/mm/fault.c b/arch/mips/mm/fault.c
index 73d8a0f0b810..92374fd091d2 100644
--- a/arch/mips/mm/fault.c
+++ b/arch/mips/mm/fault.c
@@ -154,7 +154,7 @@ static void __kprobes __do_page_fault(struct pt_regs *regs, unsigned long write,
 	 */
 	fault = handle_mm_fault(vma, address, flags);
 
-	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
+	if ((fault & VM_FAULT_RETRY) && signal_pending(current))
 		return;
 
 	perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
diff --git a/arch/nds32/mm/fault.c b/arch/nds32/mm/fault.c
index 68d5f2a27f38..9f6e477b9e30 100644
--- a/arch/nds32/mm/fault.c
+++ b/arch/nds32/mm/fault.c
@@ -206,12 +206,12 @@ void do_page_fault(unsigned long entry, unsigned long addr,
 	fault = handle_mm_fault(vma, addr, flags);
 
 	/*
-	 * If we need to retry but a fatal signal is pending, handle the
+	 * If we need to retry but a signal is pending, handle the
 	 * signal first. We do not need to release the mmap_sem because it
 	 * would already be released in __lock_page_or_retry in mm/filemap.c.
 	 */
-	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current)) {
-		if (!user_mode(regs))
+	if (fault & VM_FAULT_RETRY && signal_pending(current)) {
+		if (fatal_signal_pending(current) && !user_mode(regs))
 			goto no_context;
 		return;
 	}
diff --git a/arch/nios2/mm/fault.c b/arch/nios2/mm/fault.c
index 24fd84cf6006..5939434a31ae 100644
--- a/arch/nios2/mm/fault.c
+++ b/arch/nios2/mm/fault.c
@@ -134,7 +134,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long cause,
 	 */
 	fault = handle_mm_fault(vma, address, flags);
 
-	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
+	if ((fault & VM_FAULT_RETRY) && signal_pending(current))
 		return;
 
 	if (unlikely(fault & VM_FAULT_ERROR)) {
diff --git a/arch/openrisc/mm/fault.c b/arch/openrisc/mm/fault.c
index dc4dbafc1d83..873ecb5d82d7 100644
--- a/arch/openrisc/mm/fault.c
+++ b/arch/openrisc/mm/fault.c
@@ -165,7 +165,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long address,
 
 	fault = handle_mm_fault(vma, address, flags);
 
-	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
+	if ((fault & VM_FAULT_RETRY) && signal_pending(current))
 		return;
 
 	if (unlikely(fault & VM_FAULT_ERROR)) {
diff --git a/arch/parisc/mm/fault.c b/arch/parisc/mm/fault.c
index c8e8b7c05558..29422eec329d 100644
--- a/arch/parisc/mm/fault.c
+++ b/arch/parisc/mm/fault.c
@@ -303,7 +303,7 @@ void do_page_fault(struct pt_regs *regs, unsigned long code,
 
 	fault = handle_mm_fault(vma, address, flags);
 
-	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
+	if ((fault & VM_FAULT_RETRY) && signal_pending(current))
 		return;
 
 	if (unlikely(fault & VM_FAULT_ERROR)) {
diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index 887f11bcf330..aaa853e6592f 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -591,6 +591,8 @@ static int __do_page_fault(struct pt_regs *regs, unsigned long address,
 			 */
 			flags &= ~FAULT_FLAG_ALLOW_RETRY;
 			flags |= FAULT_FLAG_TRIED;
+			if (is_user && signal_pending(current))
+				return 0;
 			if (!fatal_signal_pending(current))
 				goto retry;
 		}
diff --git a/arch/riscv/mm/fault.c b/arch/riscv/mm/fault.c
index 88401d5125bc..4fc8d746bec3 100644
--- a/arch/riscv/mm/fault.c
+++ b/arch/riscv/mm/fault.c
@@ -123,11 +123,11 @@ asmlinkage void do_page_fault(struct pt_regs *regs)
 	fault = handle_mm_fault(vma, addr, flags);
 
 	/*
-	 * If we need to retry but a fatal signal is pending, handle the
+	 * If we need to retry but a signal is pending, handle the
 	 * signal first. We do not need to release the mmap_sem because it
 	 * would already be released in __lock_page_or_retry in mm/filemap.c.
 	 */
-	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(tsk))
+	if ((fault & VM_FAULT_RETRY) && signal_pending(tsk))
 		return;
 
 	if (unlikely(fault & VM_FAULT_ERROR)) {
diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c
index 11613362c4e7..aba1dad1efcd 100644
--- a/arch/s390/mm/fault.c
+++ b/arch/s390/mm/fault.c
@@ -476,9 +476,12 @@ static inline vm_fault_t do_exception(struct pt_regs *regs, int access)
 	 * the fault.
 	 */
 	fault = handle_mm_fault(vma, address, flags);
-	/* No reason to continue if interrupted by SIGKILL. */
-	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current)) {
-		fault = VM_FAULT_SIGNAL;
+	/* Do not continue if interrupted by signals. */
+	if ((fault & VM_FAULT_RETRY) && signal_pending(current)) {
+		if (fatal_signal_pending(current))
+			fault = VM_FAULT_SIGNAL;
+		else
+			fault = 0;
 		if (flags & FAULT_FLAG_RETRY_NOWAIT)
 			goto out_up;
 		goto out;
diff --git a/arch/sh/mm/fault.c b/arch/sh/mm/fault.c
index 6defd2c6d9b1..baf5d73df40c 100644
--- a/arch/sh/mm/fault.c
+++ b/arch/sh/mm/fault.c
@@ -506,6 +506,10 @@ asmlinkage void __kprobes do_page_fault(struct pt_regs *regs,
 			 * have already released it in __lock_page_or_retry
 			 * in mm/filemap.c.
 			 */
+
+			if (user_mode(regs) && signal_pending(tsk))
+				return;
+
 			goto retry;
 		}
 	}
diff --git a/arch/sparc/mm/fault_32.c b/arch/sparc/mm/fault_32.c
index b0440b0edd97..a2c83104fe35 100644
--- a/arch/sparc/mm/fault_32.c
+++ b/arch/sparc/mm/fault_32.c
@@ -269,6 +269,9 @@ asmlinkage void do_sparc_fault(struct pt_regs *regs, int text_fault, int write,
 			 * in mm/filemap.c.
 			 */
 
+			if (user_mode(regs) && signal_pending(tsk))
+				return;
+
 			goto retry;
 		}
 	}
diff --git a/arch/sparc/mm/fault_64.c b/arch/sparc/mm/fault_64.c
index 8f8a604c1300..cad71ec5c7b3 100644
--- a/arch/sparc/mm/fault_64.c
+++ b/arch/sparc/mm/fault_64.c
@@ -467,6 +467,9 @@ asmlinkage void __kprobes do_sparc64_fault(struct pt_regs *regs)
 			 * in mm/filemap.c.
 			 */
 
+			if (user_mode(regs) && signal_pending(current))
+				return;
+
 			goto retry;
 		}
 	}
diff --git a/arch/um/kernel/trap.c b/arch/um/kernel/trap.c
index 0e8b6158f224..09baf37b65b9 100644
--- a/arch/um/kernel/trap.c
+++ b/arch/um/kernel/trap.c
@@ -76,8 +76,11 @@ int handle_page_fault(unsigned long address, unsigned long ip,
 
 		fault = handle_mm_fault(vma, address, flags);
 
-		if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
+		if (fault & VM_FAULT_RETRY && signal_pending(current)) {
+			if (is_user && !fatal_signal_pending(current))
+				err = 0;
 			goto out_nosemaphore;
+		}
 
 		if (unlikely(fault & VM_FAULT_ERROR)) {
 			if (fault & VM_FAULT_OOM) {
diff --git a/arch/unicore32/mm/fault.c b/arch/unicore32/mm/fault.c
index b9a3a50644c1..3611f19234a1 100644
--- a/arch/unicore32/mm/fault.c
+++ b/arch/unicore32/mm/fault.c
@@ -248,11 +248,11 @@ static int do_pf(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
 
 	fault = __do_pf(mm, addr, fsr, flags, tsk);
 
-	/* If we need to retry but a fatal signal is pending, handle the
+	/* If we need to retry but a signal is pending, handle the
 	 * signal first. We do not need to release the mmap_sem because
 	 * it would already be released in __lock_page_or_retry in
 	 * mm/filemap.c. */
-	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
+	if ((fault & VM_FAULT_RETRY) && signal_pending(current))
 		return 0;
 
 	if (!(fault & VM_FAULT_ERROR) && (flags & FAULT_FLAG_ALLOW_RETRY)) {
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 9d5c75f02295..248ff0a28ecd 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1481,16 +1481,20 @@ void do_user_addr_fault(struct pt_regs *regs,
 	 * that we made any progress. Handle this case first.
 	 */
 	if (unlikely(fault & VM_FAULT_RETRY)) {
+		bool is_user = flags & FAULT_FLAG_USER;
+
 		/* Retry at most once */
 		if (flags & FAULT_FLAG_ALLOW_RETRY) {
 			flags &= ~FAULT_FLAG_ALLOW_RETRY;
 			flags |= FAULT_FLAG_TRIED;
+			if (is_user && signal_pending(tsk))
+				return;
 			if (!fatal_signal_pending(tsk))
 				goto retry;
 		}
 
 		/* User mode? Just return to handle the fatal exception */
-		if (flags & FAULT_FLAG_USER)
+		if (is_user)
 			return;
 
 		/* Not returning to user mode? Handle exceptions or die: */
diff --git a/arch/xtensa/mm/fault.c b/arch/xtensa/mm/fault.c
index 2ab0e0dcd166..792dad5e2f12 100644
--- a/arch/xtensa/mm/fault.c
+++ b/arch/xtensa/mm/fault.c
@@ -136,6 +136,9 @@ void do_page_fault(struct pt_regs *regs)
 			 * in mm/filemap.c.
 			 */
 
+			if (user_mode(regs) && signal_pending(current))
+				return;
+
 			goto retry;
 		}
 	}
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH v2 03/26] userfaultfd: don't retake mmap_sem to emulate NOPAGE
  2019-02-12  2:56 [PATCH v2 00/26] userfaultfd: write protection support Peter Xu
  2019-02-12  2:56 ` [PATCH v2 01/26] mm: gup: rename "nonblocking" to "locked" where proper Peter Xu
  2019-02-12  2:56 ` [PATCH v2 02/26] mm: userfault: return VM_FAULT_RETRY on signals Peter Xu
@ 2019-02-12  2:56 ` Peter Xu
  2019-02-21 15:34   ` Jerome Glisse
  2019-02-12  2:56 ` [PATCH v2 04/26] mm: allow VM_FAULT_RETRY for multiple times Peter Xu
                   ` (22 subsequent siblings)
  25 siblings, 1 reply; 113+ messages in thread
From: Peter Xu @ 2019-02-12  2:56 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: David Hildenbrand, Hugh Dickins, Maya Gokhale, Jerome Glisse,
	Pavel Emelyanov, Johannes Weiner, peterx, Martin Cracauer,
	Shaohua Li, Marty McFadden, Andrea Arcangeli, Mike Kravetz,
	Denis Plotnikov, Mike Rapoport, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

The idea comes from the upstream discussion between Linus and Andrea:

https://lkml.org/lkml/2017/10/30/560

A summary to the issue: there was a special path in handle_userfault()
in the past that we'll return a VM_FAULT_NOPAGE when we detected
non-fatal signals when waiting for userfault handling.  We did that by
reacquiring the mmap_sem before returning.  However that brings a risk
in that the vmas might have changed when we retake the mmap_sem and
even we could be holding an invalid vma structure.

This patch removes the risk path in handle_userfault() then we will be
sure that the callers of handle_mm_fault() will know that the VMAs
might have changed.  Meanwhile with previous patch we don't lose
responsiveness as well since the core mm code now can handle the
nonfatal userspace signals quickly even if we return VM_FAULT_RETRY.

Suggested-by: Andrea Arcangeli <aarcange@redhat.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 fs/userfaultfd.c | 24 ------------------------
 1 file changed, 24 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 89800fc7dc9d..b397bc3b954d 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -514,30 +514,6 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
 
 	__set_current_state(TASK_RUNNING);
 
-	if (return_to_userland) {
-		if (signal_pending(current) &&
-		    !fatal_signal_pending(current)) {
-			/*
-			 * If we got a SIGSTOP or SIGCONT and this is
-			 * a normal userland page fault, just let
-			 * userland return so the signal will be
-			 * handled and gdb debugging works.  The page
-			 * fault code immediately after we return from
-			 * this function is going to release the
-			 * mmap_sem and it's not depending on it
-			 * (unlike gup would if we were not to return
-			 * VM_FAULT_RETRY).
-			 *
-			 * If a fatal signal is pending we still take
-			 * the streamlined VM_FAULT_RETRY failure path
-			 * and there's no need to retake the mmap_sem
-			 * in such case.
-			 */
-			down_read(&mm->mmap_sem);
-			ret = VM_FAULT_NOPAGE;
-		}
-	}
-
 	/*
 	 * Here we race with the list_del; list_add in
 	 * userfaultfd_ctx_read(), however because we don't ever run
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH v2 04/26] mm: allow VM_FAULT_RETRY for multiple times
  2019-02-12  2:56 [PATCH v2 00/26] userfaultfd: write protection support Peter Xu
                   ` (2 preceding siblings ...)
  2019-02-12  2:56 ` [PATCH v2 03/26] userfaultfd: don't retake mmap_sem to emulate NOPAGE Peter Xu
@ 2019-02-12  2:56 ` Peter Xu
  2019-02-13  3:34   ` Peter Xu
  2019-02-21  8:56   ` [PATCH v2.1 " Peter Xu
  2019-02-12  2:56 ` [PATCH v2 05/26] mm: gup: " Peter Xu
                   ` (21 subsequent siblings)
  25 siblings, 2 replies; 113+ messages in thread
From: Peter Xu @ 2019-02-12  2:56 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: David Hildenbrand, Hugh Dickins, Maya Gokhale, Jerome Glisse,
	Pavel Emelyanov, Johannes Weiner, peterx, Martin Cracauer,
	Shaohua Li, Marty McFadden, Andrea Arcangeli, Mike Kravetz,
	Denis Plotnikov, Mike Rapoport, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

The idea comes from a discussion between Linus and Andrea [1].

Before this patch we only allow a page fault to retry once.  We
achieved this by clearing the FAULT_FLAG_ALLOW_RETRY flag when doing
handle_mm_fault() the second time.  This was majorly used to avoid
unexpected starvation of the system by looping over forever to handle
the page fault on a single page.  However that should hardly happen,
and after all for each code path to return a VM_FAULT_RETRY we'll
first wait for a condition (during which time we should possibly yield
the cpu) to happen before VM_FAULT_RETRY is really returned.

This patch removes the restriction by keeping the
FAULT_FLAG_ALLOW_RETRY flag when we receive VM_FAULT_RETRY.  It means
that the page fault handler now can retry the page fault for multiple
times if necessary without the need to generate another page fault
event. Meanwhile we still keep the FAULT_FLAG_TRIED flag so page fault
handler can still identify whether a page fault is the first attempt
or not.  One example is in __lock_page_or_retry(), now we'll drop the
mmap_sem only in the first attempt of page fault and we'll keep it in
follow up retries, so old locking behavior will be retained.

GUP code is not touched yet and will be covered in follow up patch.

This will be a nice enhancement for current code [2] at the same time
a supporting material for the future userfaultfd-writeprotect work,
since in that work there will always be an explicit userfault
writeprotect retry for protected pages, and if that cannot resolve the
page fault (e.g., when userfaultfd-writeprotect is used in conjunction
with swapped pages) then we'll possibly need a 3rd retry of the page
fault.  It might also benefit other potential users who will have
similar requirement like userfault write-protection.

Please read the thread below for more information.

[1] https://lkml.org/lkml/2017/11/2/833
[2] https://lkml.org/lkml/2018/12/30/64

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Suggested-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/alpha/mm/fault.c      | 2 +-
 arch/arc/mm/fault.c        | 1 -
 arch/arm/mm/fault.c        | 3 ---
 arch/arm64/mm/fault.c      | 5 -----
 arch/hexagon/mm/vm_fault.c | 1 -
 arch/ia64/mm/fault.c       | 1 -
 arch/m68k/mm/fault.c       | 3 ---
 arch/microblaze/mm/fault.c | 1 -
 arch/mips/mm/fault.c       | 1 -
 arch/nds32/mm/fault.c      | 1 -
 arch/nios2/mm/fault.c      | 3 ---
 arch/openrisc/mm/fault.c   | 1 -
 arch/parisc/mm/fault.c     | 2 --
 arch/powerpc/mm/fault.c    | 5 -----
 arch/riscv/mm/fault.c      | 5 -----
 arch/s390/mm/fault.c       | 5 +----
 arch/sh/mm/fault.c         | 1 -
 arch/sparc/mm/fault_32.c   | 1 -
 arch/sparc/mm/fault_64.c   | 1 -
 arch/um/kernel/trap.c      | 1 -
 arch/unicore32/mm/fault.c  | 6 +-----
 arch/x86/mm/fault.c        | 1 -
 arch/xtensa/mm/fault.c     | 1 -
 mm/filemap.c               | 2 +-
 24 files changed, 4 insertions(+), 50 deletions(-)

diff --git a/arch/alpha/mm/fault.c b/arch/alpha/mm/fault.c
index 46e5e420ad2a..deae82bb83c1 100644
--- a/arch/alpha/mm/fault.c
+++ b/arch/alpha/mm/fault.c
@@ -169,7 +169,7 @@ do_page_fault(unsigned long address, unsigned long mmcsr,
 		else
 			current->min_flt++;
 		if (fault & VM_FAULT_RETRY) {
-			flags &= ~FAULT_FLAG_ALLOW_RETRY;
+			flags |= FAULT_FLAG_TRIED;
 
 			 /* No need to up_read(&mm->mmap_sem) as we would
 			 * have already released it in __lock_page_or_retry
diff --git a/arch/arc/mm/fault.c b/arch/arc/mm/fault.c
index dc5f1b8859d2..664e18a8749f 100644
--- a/arch/arc/mm/fault.c
+++ b/arch/arc/mm/fault.c
@@ -167,7 +167,6 @@ void do_page_fault(unsigned long address, struct pt_regs *regs)
 			}
 
 			if (fault & VM_FAULT_RETRY) {
-				flags &= ~FAULT_FLAG_ALLOW_RETRY;
 				flags |= FAULT_FLAG_TRIED;
 				goto retry;
 			}
diff --git a/arch/arm/mm/fault.c b/arch/arm/mm/fault.c
index c41c021bbe40..7910b4b5205d 100644
--- a/arch/arm/mm/fault.c
+++ b/arch/arm/mm/fault.c
@@ -342,9 +342,6 @@ do_page_fault(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
 					regs, addr);
 		}
 		if (fault & VM_FAULT_RETRY) {
-			/* Clear FAULT_FLAG_ALLOW_RETRY to avoid any risk
-			* of starvation. */
-			flags &= ~FAULT_FLAG_ALLOW_RETRY;
 			flags |= FAULT_FLAG_TRIED;
 			goto retry;
 		}
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index a38ff8c49a66..d1d3c98f9ffb 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -523,12 +523,7 @@ static int __kprobes do_page_fault(unsigned long addr, unsigned int esr,
 			return 0;
 		}
 
-		/*
-		 * Clear FAULT_FLAG_ALLOW_RETRY to avoid any risk of
-		 * starvation.
-		 */
 		if (mm_flags & FAULT_FLAG_ALLOW_RETRY) {
-			mm_flags &= ~FAULT_FLAG_ALLOW_RETRY;
 			mm_flags |= FAULT_FLAG_TRIED;
 			goto retry;
 		}
diff --git a/arch/hexagon/mm/vm_fault.c b/arch/hexagon/mm/vm_fault.c
index be10b441d9cc..576751597e77 100644
--- a/arch/hexagon/mm/vm_fault.c
+++ b/arch/hexagon/mm/vm_fault.c
@@ -115,7 +115,6 @@ void do_page_fault(unsigned long address, long cause, struct pt_regs *regs)
 			else
 				current->min_flt++;
 			if (fault & VM_FAULT_RETRY) {
-				flags &= ~FAULT_FLAG_ALLOW_RETRY;
 				flags |= FAULT_FLAG_TRIED;
 				goto retry;
 			}
diff --git a/arch/ia64/mm/fault.c b/arch/ia64/mm/fault.c
index 62c2d39d2bed..9de95d39935e 100644
--- a/arch/ia64/mm/fault.c
+++ b/arch/ia64/mm/fault.c
@@ -189,7 +189,6 @@ ia64_do_page_fault (unsigned long address, unsigned long isr, struct pt_regs *re
 		else
 			current->min_flt++;
 		if (fault & VM_FAULT_RETRY) {
-			flags &= ~FAULT_FLAG_ALLOW_RETRY;
 			flags |= FAULT_FLAG_TRIED;
 
 			 /* No need to up_read(&mm->mmap_sem) as we would
diff --git a/arch/m68k/mm/fault.c b/arch/m68k/mm/fault.c
index d9808a807ab8..b1b2109e4ab4 100644
--- a/arch/m68k/mm/fault.c
+++ b/arch/m68k/mm/fault.c
@@ -162,9 +162,6 @@ int do_page_fault(struct pt_regs *regs, unsigned long address,
 		else
 			current->min_flt++;
 		if (fault & VM_FAULT_RETRY) {
-			/* Clear FAULT_FLAG_ALLOW_RETRY to avoid any risk
-			 * of starvation. */
-			flags &= ~FAULT_FLAG_ALLOW_RETRY;
 			flags |= FAULT_FLAG_TRIED;
 
 			/*
diff --git a/arch/microblaze/mm/fault.c b/arch/microblaze/mm/fault.c
index 4fd2dbd0c5ca..05a4847ac0bf 100644
--- a/arch/microblaze/mm/fault.c
+++ b/arch/microblaze/mm/fault.c
@@ -236,7 +236,6 @@ void do_page_fault(struct pt_regs *regs, unsigned long address,
 		else
 			current->min_flt++;
 		if (fault & VM_FAULT_RETRY) {
-			flags &= ~FAULT_FLAG_ALLOW_RETRY;
 			flags |= FAULT_FLAG_TRIED;
 
 			/*
diff --git a/arch/mips/mm/fault.c b/arch/mips/mm/fault.c
index 92374fd091d2..9953b5b571df 100644
--- a/arch/mips/mm/fault.c
+++ b/arch/mips/mm/fault.c
@@ -178,7 +178,6 @@ static void __kprobes __do_page_fault(struct pt_regs *regs, unsigned long write,
 			tsk->min_flt++;
 		}
 		if (fault & VM_FAULT_RETRY) {
-			flags &= ~FAULT_FLAG_ALLOW_RETRY;
 			flags |= FAULT_FLAG_TRIED;
 
 			/*
diff --git a/arch/nds32/mm/fault.c b/arch/nds32/mm/fault.c
index 9f6e477b9e30..32259afc751a 100644
--- a/arch/nds32/mm/fault.c
+++ b/arch/nds32/mm/fault.c
@@ -242,7 +242,6 @@ void do_page_fault(unsigned long entry, unsigned long addr,
 				      1, regs, addr);
 		}
 		if (fault & VM_FAULT_RETRY) {
-			flags &= ~FAULT_FLAG_ALLOW_RETRY;
 			flags |= FAULT_FLAG_TRIED;
 
 			/* No need to up_read(&mm->mmap_sem) as we would
diff --git a/arch/nios2/mm/fault.c b/arch/nios2/mm/fault.c
index 5939434a31ae..9dd1c51acc22 100644
--- a/arch/nios2/mm/fault.c
+++ b/arch/nios2/mm/fault.c
@@ -158,9 +158,6 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long cause,
 		else
 			current->min_flt++;
 		if (fault & VM_FAULT_RETRY) {
-			/* Clear FAULT_FLAG_ALLOW_RETRY to avoid any risk
-			 * of starvation. */
-			flags &= ~FAULT_FLAG_ALLOW_RETRY;
 			flags |= FAULT_FLAG_TRIED;
 
 			/*
diff --git a/arch/openrisc/mm/fault.c b/arch/openrisc/mm/fault.c
index 873ecb5d82d7..ff92c5674781 100644
--- a/arch/openrisc/mm/fault.c
+++ b/arch/openrisc/mm/fault.c
@@ -185,7 +185,6 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long address,
 		else
 			tsk->min_flt++;
 		if (fault & VM_FAULT_RETRY) {
-			flags &= ~FAULT_FLAG_ALLOW_RETRY;
 			flags |= FAULT_FLAG_TRIED;
 
 			 /* No need to up_read(&mm->mmap_sem) as we would
diff --git a/arch/parisc/mm/fault.c b/arch/parisc/mm/fault.c
index 29422eec329d..7d3e96a9a7ab 100644
--- a/arch/parisc/mm/fault.c
+++ b/arch/parisc/mm/fault.c
@@ -327,8 +327,6 @@ void do_page_fault(struct pt_regs *regs, unsigned long code,
 		else
 			current->min_flt++;
 		if (fault & VM_FAULT_RETRY) {
-			flags &= ~FAULT_FLAG_ALLOW_RETRY;
-
 			/*
 			 * No need to up_read(&mm->mmap_sem) as we would
 			 * have already released it in __lock_page_or_retry
diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index aaa853e6592f..becebfe67e32 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -585,11 +585,6 @@ static int __do_page_fault(struct pt_regs *regs, unsigned long address,
 	if (unlikely(fault & VM_FAULT_RETRY)) {
 		/* We retry only once */
 		if (flags & FAULT_FLAG_ALLOW_RETRY) {
-			/*
-			 * Clear FAULT_FLAG_ALLOW_RETRY to avoid any risk
-			 * of starvation.
-			 */
-			flags &= ~FAULT_FLAG_ALLOW_RETRY;
 			flags |= FAULT_FLAG_TRIED;
 			if (is_user && signal_pending(current))
 				return 0;
diff --git a/arch/riscv/mm/fault.c b/arch/riscv/mm/fault.c
index 4fc8d746bec3..aad2c0557d2f 100644
--- a/arch/riscv/mm/fault.c
+++ b/arch/riscv/mm/fault.c
@@ -154,11 +154,6 @@ asmlinkage void do_page_fault(struct pt_regs *regs)
 				      1, regs, addr);
 		}
 		if (fault & VM_FAULT_RETRY) {
-			/*
-			 * Clear FAULT_FLAG_ALLOW_RETRY to avoid any risk
-			 * of starvation.
-			 */
-			flags &= ~(FAULT_FLAG_ALLOW_RETRY);
 			flags |= FAULT_FLAG_TRIED;
 
 			/*
diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c
index aba1dad1efcd..4e8c066964a9 100644
--- a/arch/s390/mm/fault.c
+++ b/arch/s390/mm/fault.c
@@ -513,10 +513,7 @@ static inline vm_fault_t do_exception(struct pt_regs *regs, int access)
 				fault = VM_FAULT_PFAULT;
 				goto out_up;
 			}
-			/* Clear FAULT_FLAG_ALLOW_RETRY to avoid any risk
-			 * of starvation. */
-			flags &= ~(FAULT_FLAG_ALLOW_RETRY |
-				   FAULT_FLAG_RETRY_NOWAIT);
+			flags &= ~FAULT_FLAG_RETRY_NOWAIT;
 			flags |= FAULT_FLAG_TRIED;
 			down_read(&mm->mmap_sem);
 			goto retry;
diff --git a/arch/sh/mm/fault.c b/arch/sh/mm/fault.c
index baf5d73df40c..cd710e2d7c57 100644
--- a/arch/sh/mm/fault.c
+++ b/arch/sh/mm/fault.c
@@ -498,7 +498,6 @@ asmlinkage void __kprobes do_page_fault(struct pt_regs *regs,
 				      regs, address);
 		}
 		if (fault & VM_FAULT_RETRY) {
-			flags &= ~FAULT_FLAG_ALLOW_RETRY;
 			flags |= FAULT_FLAG_TRIED;
 
 			/*
diff --git a/arch/sparc/mm/fault_32.c b/arch/sparc/mm/fault_32.c
index a2c83104fe35..6735cd1c09b9 100644
--- a/arch/sparc/mm/fault_32.c
+++ b/arch/sparc/mm/fault_32.c
@@ -261,7 +261,6 @@ asmlinkage void do_sparc_fault(struct pt_regs *regs, int text_fault, int write,
 				      1, regs, address);
 		}
 		if (fault & VM_FAULT_RETRY) {
-			flags &= ~FAULT_FLAG_ALLOW_RETRY;
 			flags |= FAULT_FLAG_TRIED;
 
 			/* No need to up_read(&mm->mmap_sem) as we would
diff --git a/arch/sparc/mm/fault_64.c b/arch/sparc/mm/fault_64.c
index cad71ec5c7b3..28d5b4d012c6 100644
--- a/arch/sparc/mm/fault_64.c
+++ b/arch/sparc/mm/fault_64.c
@@ -459,7 +459,6 @@ asmlinkage void __kprobes do_sparc64_fault(struct pt_regs *regs)
 				      1, regs, address);
 		}
 		if (fault & VM_FAULT_RETRY) {
-			flags &= ~FAULT_FLAG_ALLOW_RETRY;
 			flags |= FAULT_FLAG_TRIED;
 
 			/* No need to up_read(&mm->mmap_sem) as we would
diff --git a/arch/um/kernel/trap.c b/arch/um/kernel/trap.c
index 09baf37b65b9..c63fc292aea0 100644
--- a/arch/um/kernel/trap.c
+++ b/arch/um/kernel/trap.c
@@ -99,7 +99,6 @@ int handle_page_fault(unsigned long address, unsigned long ip,
 			else
 				current->min_flt++;
 			if (fault & VM_FAULT_RETRY) {
-				flags &= ~FAULT_FLAG_ALLOW_RETRY;
 				flags |= FAULT_FLAG_TRIED;
 
 				goto retry;
diff --git a/arch/unicore32/mm/fault.c b/arch/unicore32/mm/fault.c
index 3611f19234a1..fdf577956f5f 100644
--- a/arch/unicore32/mm/fault.c
+++ b/arch/unicore32/mm/fault.c
@@ -260,12 +260,8 @@ static int do_pf(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
 			tsk->maj_flt++;
 		else
 			tsk->min_flt++;
-		if (fault & VM_FAULT_RETRY) {
-			/* Clear FAULT_FLAG_ALLOW_RETRY to avoid any risk
-			* of starvation. */
-			flags &= ~FAULT_FLAG_ALLOW_RETRY;
+		if (fault & VM_FAULT_RETRY)
 			goto retry;
-		}
 	}
 
 	up_read(&mm->mmap_sem);
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 248ff0a28ecd..71d68aa03e43 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1485,7 +1485,6 @@ void do_user_addr_fault(struct pt_regs *regs,
 
 		/* Retry at most once */
 		if (flags & FAULT_FLAG_ALLOW_RETRY) {
-			flags &= ~FAULT_FLAG_ALLOW_RETRY;
 			flags |= FAULT_FLAG_TRIED;
 			if (is_user && signal_pending(tsk))
 				return;
diff --git a/arch/xtensa/mm/fault.c b/arch/xtensa/mm/fault.c
index 792dad5e2f12..7cd55f2d66c9 100644
--- a/arch/xtensa/mm/fault.c
+++ b/arch/xtensa/mm/fault.c
@@ -128,7 +128,6 @@ void do_page_fault(struct pt_regs *regs)
 		else
 			current->min_flt++;
 		if (fault & VM_FAULT_RETRY) {
-			flags &= ~FAULT_FLAG_ALLOW_RETRY;
 			flags |= FAULT_FLAG_TRIED;
 
 			 /* No need to up_read(&mm->mmap_sem) as we would
diff --git a/mm/filemap.c b/mm/filemap.c
index 9f5e323e883e..44942c78bb92 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1351,7 +1351,7 @@ EXPORT_SYMBOL_GPL(__lock_page_killable);
 int __lock_page_or_retry(struct page *page, struct mm_struct *mm,
 			 unsigned int flags)
 {
-	if (flags & FAULT_FLAG_ALLOW_RETRY) {
+	if (!flags & FAULT_FLAG_TRIED) {
 		/*
 		 * CAUTION! In this case, mmap_sem is not released
 		 * even though return 0.
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH v2 05/26] mm: gup: allow VM_FAULT_RETRY for multiple times
  2019-02-12  2:56 [PATCH v2 00/26] userfaultfd: write protection support Peter Xu
                   ` (3 preceding siblings ...)
  2019-02-12  2:56 ` [PATCH v2 04/26] mm: allow VM_FAULT_RETRY for multiple times Peter Xu
@ 2019-02-12  2:56 ` Peter Xu
  2019-02-21 16:06   ` Jerome Glisse
  2019-02-12  2:56 ` [PATCH v2 06/26] userfaultfd: wp: add helper for writeprotect check Peter Xu
                   ` (20 subsequent siblings)
  25 siblings, 1 reply; 113+ messages in thread
From: Peter Xu @ 2019-02-12  2:56 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: David Hildenbrand, Hugh Dickins, Maya Gokhale, Jerome Glisse,
	Pavel Emelyanov, Johannes Weiner, peterx, Martin Cracauer,
	Shaohua Li, Marty McFadden, Andrea Arcangeli, Mike Kravetz,
	Denis Plotnikov, Mike Rapoport, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

This is the gup counterpart of the change that allows the VM_FAULT_RETRY
to happen for more than once.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/gup.c | 17 +++++++++++++----
 1 file changed, 13 insertions(+), 4 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index fa75a03204c1..ba387aec0d80 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -528,7 +528,10 @@ static int faultin_page(struct task_struct *tsk, struct vm_area_struct *vma,
 	if (*flags & FOLL_NOWAIT)
 		fault_flags |= FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_RETRY_NOWAIT;
 	if (*flags & FOLL_TRIED) {
-		VM_WARN_ON_ONCE(fault_flags & FAULT_FLAG_ALLOW_RETRY);
+		/*
+		 * Note: FAULT_FLAG_ALLOW_RETRY and FAULT_FLAG_TRIED
+		 * can co-exist
+		 */
 		fault_flags |= FAULT_FLAG_TRIED;
 	}
 
@@ -943,17 +946,23 @@ static __always_inline long __get_user_pages_locked(struct task_struct *tsk,
 		/* VM_FAULT_RETRY triggered, so seek to the faulting offset */
 		pages += ret;
 		start += ret << PAGE_SHIFT;
+		lock_dropped = true;
 
+retry:
 		/*
 		 * Repeat on the address that fired VM_FAULT_RETRY
-		 * without FAULT_FLAG_ALLOW_RETRY but with
+		 * with both FAULT_FLAG_ALLOW_RETRY and
 		 * FAULT_FLAG_TRIED.
 		 */
 		*locked = 1;
-		lock_dropped = true;
 		down_read(&mm->mmap_sem);
 		ret = __get_user_pages(tsk, mm, start, 1, flags | FOLL_TRIED,
-				       pages, NULL, NULL);
+				       pages, NULL, locked);
+		if (!*locked) {
+			/* Continue to retry until we succeeded */
+			BUG_ON(ret != 0);
+			goto retry;
+		}
 		if (ret != 1) {
 			BUG_ON(ret > 1);
 			if (!pages_done)
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH v2 06/26] userfaultfd: wp: add helper for writeprotect check
  2019-02-12  2:56 [PATCH v2 00/26] userfaultfd: write protection support Peter Xu
                   ` (4 preceding siblings ...)
  2019-02-12  2:56 ` [PATCH v2 05/26] mm: gup: " Peter Xu
@ 2019-02-12  2:56 ` Peter Xu
  2019-02-21 16:07   ` Jerome Glisse
  2019-02-25 15:41   ` Mike Rapoport
  2019-02-12  2:56 ` [PATCH v2 07/26] userfaultfd: wp: hook userfault handler to write protection fault Peter Xu
                   ` (19 subsequent siblings)
  25 siblings, 2 replies; 113+ messages in thread
From: Peter Xu @ 2019-02-12  2:56 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: David Hildenbrand, Hugh Dickins, Maya Gokhale, Jerome Glisse,
	Pavel Emelyanov, Johannes Weiner, peterx, Martin Cracauer,
	Shaohua Li, Marty McFadden, Andrea Arcangeli, Mike Kravetz,
	Denis Plotnikov, Mike Rapoport, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert, Pavel Emelyanov, Rik van Riel

From: Shaohua Li <shli@fb.com>

add helper for writeprotect check. Will use it later.

Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/userfaultfd_k.h | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index 37c9eba75c98..38f748e7186e 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -50,6 +50,11 @@ static inline bool userfaultfd_missing(struct vm_area_struct *vma)
 	return vma->vm_flags & VM_UFFD_MISSING;
 }
 
+static inline bool userfaultfd_wp(struct vm_area_struct *vma)
+{
+	return vma->vm_flags & VM_UFFD_WP;
+}
+
 static inline bool userfaultfd_armed(struct vm_area_struct *vma)
 {
 	return vma->vm_flags & (VM_UFFD_MISSING | VM_UFFD_WP);
@@ -94,6 +99,11 @@ static inline bool userfaultfd_missing(struct vm_area_struct *vma)
 	return false;
 }
 
+static inline bool userfaultfd_wp(struct vm_area_struct *vma)
+{
+	return false;
+}
+
 static inline bool userfaultfd_armed(struct vm_area_struct *vma)
 {
 	return false;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH v2 07/26] userfaultfd: wp: hook userfault handler to write protection fault
  2019-02-12  2:56 [PATCH v2 00/26] userfaultfd: write protection support Peter Xu
                   ` (5 preceding siblings ...)
  2019-02-12  2:56 ` [PATCH v2 06/26] userfaultfd: wp: add helper for writeprotect check Peter Xu
@ 2019-02-12  2:56 ` Peter Xu
  2019-02-21 16:25   ` Jerome Glisse
  2019-02-25 15:43   ` Mike Rapoport
  2019-02-12  2:56 ` [PATCH v2 08/26] userfaultfd: wp: add WP pagetable tracking to x86 Peter Xu
                   ` (18 subsequent siblings)
  25 siblings, 2 replies; 113+ messages in thread
From: Peter Xu @ 2019-02-12  2:56 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: David Hildenbrand, Hugh Dickins, Maya Gokhale, Jerome Glisse,
	Pavel Emelyanov, Johannes Weiner, peterx, Martin Cracauer,
	Shaohua Li, Marty McFadden, Andrea Arcangeli, Mike Kravetz,
	Denis Plotnikov, Mike Rapoport, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

From: Andrea Arcangeli <aarcange@redhat.com>

There are several cases write protection fault happens. It could be a
write to zero page, swaped page or userfault write protected
page. When the fault happens, there is no way to know if userfault
write protect the page before. Here we just blindly issue a userfault
notification for vma with VM_UFFD_WP regardless if app write protects
it yet. Application should be ready to handle such wp fault.

v1: From: Shaohua Li <shli@fb.com>

v2: Handle the userfault in the common do_wp_page. If we get there a
pagetable is present and readonly so no need to do further processing
until we solve the userfault.

In the swapin case, always swapin as readonly. This will cause false
positive userfaults. We need to decide later if to eliminate them with
a flag like soft-dirty in the swap entry (see _PAGE_SWP_SOFT_DIRTY).

hugetlbfs wouldn't need to worry about swapouts but and tmpfs would
be handled by a swap entry bit like anonymous memory.

The main problem with no easy solution to eliminate the false
positives, will be if/when userfaultfd is extended to real filesystem
pagecache. When the pagecache is freed by reclaim we can't leave the
radix tree pinned if the inode and in turn the radix tree is reclaimed
as well.

The estimation is that full accuracy and lack of false positives could
be easily provided only to anonymous memory (as long as there's no
fork or as long as MADV_DONTFORK is used on the userfaultfd anonymous
range) tmpfs and hugetlbfs, it's most certainly worth to achieve it
but in a later incremental patch.

v3: Add hooking point for THP wrprotect faults.

CC: Shaohua Li <shli@fb.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/memory.c | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/mm/memory.c b/mm/memory.c
index e11ca9dd823f..00781c43407b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2483,6 +2483,11 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
 {
 	struct vm_area_struct *vma = vmf->vma;
 
+	if (userfaultfd_wp(vma)) {
+		pte_unmap_unlock(vmf->pte, vmf->ptl);
+		return handle_userfault(vmf, VM_UFFD_WP);
+	}
+
 	vmf->page = vm_normal_page(vma, vmf->address, vmf->orig_pte);
 	if (!vmf->page) {
 		/*
@@ -2800,6 +2805,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
 	dec_mm_counter_fast(vma->vm_mm, MM_SWAPENTS);
 	pte = mk_pte(page, vma->vm_page_prot);
+	if (userfaultfd_wp(vma))
+		vmf->flags &= ~FAULT_FLAG_WRITE;
 	if ((vmf->flags & FAULT_FLAG_WRITE) && reuse_swap_page(page, NULL)) {
 		pte = maybe_mkwrite(pte_mkdirty(pte), vma);
 		vmf->flags &= ~FAULT_FLAG_WRITE;
@@ -3684,8 +3691,11 @@ static inline vm_fault_t create_huge_pmd(struct vm_fault *vmf)
 /* `inline' is required to avoid gcc 4.1.2 build error */
 static inline vm_fault_t wp_huge_pmd(struct vm_fault *vmf, pmd_t orig_pmd)
 {
-	if (vma_is_anonymous(vmf->vma))
+	if (vma_is_anonymous(vmf->vma)) {
+		if (userfaultfd_wp(vmf->vma))
+			return handle_userfault(vmf, VM_UFFD_WP);
 		return do_huge_pmd_wp_page(vmf, orig_pmd);
+	}
 	if (vmf->vma->vm_ops->huge_fault)
 		return vmf->vma->vm_ops->huge_fault(vmf, PE_SIZE_PMD);
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH v2 08/26] userfaultfd: wp: add WP pagetable tracking to x86
  2019-02-12  2:56 [PATCH v2 00/26] userfaultfd: write protection support Peter Xu
                   ` (6 preceding siblings ...)
  2019-02-12  2:56 ` [PATCH v2 07/26] userfaultfd: wp: hook userfault handler to write protection fault Peter Xu
@ 2019-02-12  2:56 ` Peter Xu
  2019-02-21 17:20   ` Jerome Glisse
  2019-02-25 15:48   ` Mike Rapoport
  2019-02-12  2:56 ` [PATCH v2 09/26] userfaultfd: wp: userfaultfd_pte/huge_pmd_wp() helpers Peter Xu
                   ` (17 subsequent siblings)
  25 siblings, 2 replies; 113+ messages in thread
From: Peter Xu @ 2019-02-12  2:56 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: David Hildenbrand, Hugh Dickins, Maya Gokhale, Jerome Glisse,
	Pavel Emelyanov, Johannes Weiner, peterx, Martin Cracauer,
	Shaohua Li, Marty McFadden, Andrea Arcangeli, Mike Kravetz,
	Denis Plotnikov, Mike Rapoport, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

From: Andrea Arcangeli <aarcange@redhat.com>

Accurate userfaultfd WP tracking is possible by tracking exactly which
virtual memory ranges were writeprotected by userland. We can't relay
only on the RW bit of the mapped pagetable because that information is
destroyed by fork() or KSM or swap. If we were to relay on that, we'd
need to stay on the safe side and generate false positive wp faults
for every swapped out page.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/x86/Kconfig                     |  1 +
 arch/x86/include/asm/pgtable.h       | 52 ++++++++++++++++++++++++++++
 arch/x86/include/asm/pgtable_64.h    |  8 ++++-
 arch/x86/include/asm/pgtable_types.h |  9 +++++
 include/asm-generic/pgtable.h        |  1 +
 include/asm-generic/pgtable_uffd.h   | 51 +++++++++++++++++++++++++++
 init/Kconfig                         |  5 +++
 7 files changed, 126 insertions(+), 1 deletion(-)
 create mode 100644 include/asm-generic/pgtable_uffd.h

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 68261430fe6e..cb43bc008675 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -209,6 +209,7 @@ config X86
 	select USER_STACKTRACE_SUPPORT
 	select VIRT_TO_BUS
 	select X86_FEATURE_NAMES		if PROC_FS
+	select HAVE_ARCH_USERFAULTFD_WP		if USERFAULTFD
 
 config INSTRUCTION_DECODER
 	def_bool y
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 2779ace16d23..6863236e8484 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -23,6 +23,7 @@
 
 #ifndef __ASSEMBLY__
 #include <asm/x86_init.h>
+#include <asm-generic/pgtable_uffd.h>
 
 extern pgd_t early_top_pgt[PTRS_PER_PGD];
 int __init __early_make_pgtable(unsigned long address, pmdval_t pmd);
@@ -293,6 +294,23 @@ static inline pte_t pte_clear_flags(pte_t pte, pteval_t clear)
 	return native_make_pte(v & ~clear);
 }
 
+#ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
+static inline int pte_uffd_wp(pte_t pte)
+{
+	return pte_flags(pte) & _PAGE_UFFD_WP;
+}
+
+static inline pte_t pte_mkuffd_wp(pte_t pte)
+{
+	return pte_set_flags(pte, _PAGE_UFFD_WP);
+}
+
+static inline pte_t pte_clear_uffd_wp(pte_t pte)
+{
+	return pte_clear_flags(pte, _PAGE_UFFD_WP);
+}
+#endif /* CONFIG_HAVE_ARCH_USERFAULTFD_WP */
+
 static inline pte_t pte_mkclean(pte_t pte)
 {
 	return pte_clear_flags(pte, _PAGE_DIRTY);
@@ -372,6 +390,23 @@ static inline pmd_t pmd_clear_flags(pmd_t pmd, pmdval_t clear)
 	return native_make_pmd(v & ~clear);
 }
 
+#ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
+static inline int pmd_uffd_wp(pmd_t pmd)
+{
+	return pmd_flags(pmd) & _PAGE_UFFD_WP;
+}
+
+static inline pmd_t pmd_mkuffd_wp(pmd_t pmd)
+{
+	return pmd_set_flags(pmd, _PAGE_UFFD_WP);
+}
+
+static inline pmd_t pmd_clear_uffd_wp(pmd_t pmd)
+{
+	return pmd_clear_flags(pmd, _PAGE_UFFD_WP);
+}
+#endif /* CONFIG_HAVE_ARCH_USERFAULTFD_WP */
+
 static inline pmd_t pmd_mkold(pmd_t pmd)
 {
 	return pmd_clear_flags(pmd, _PAGE_ACCESSED);
@@ -1351,6 +1386,23 @@ static inline pmd_t pmd_swp_clear_soft_dirty(pmd_t pmd)
 #endif
 #endif
 
+#ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
+static inline pte_t pte_swp_mkuffd_wp(pte_t pte)
+{
+	return pte_set_flags(pte, _PAGE_SWP_UFFD_WP);
+}
+
+static inline int pte_swp_uffd_wp(pte_t pte)
+{
+	return pte_flags(pte) & _PAGE_SWP_UFFD_WP;
+}
+
+static inline pte_t pte_swp_clear_uffd_wp(pte_t pte)
+{
+	return pte_clear_flags(pte, _PAGE_SWP_UFFD_WP);
+}
+#endif /* CONFIG_HAVE_ARCH_USERFAULTFD_WP */
+
 #define PKRU_AD_BIT 0x1
 #define PKRU_WD_BIT 0x2
 #define PKRU_BITS_PER_PKEY 2
diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
index 9c85b54bf03c..e0c5d29b8685 100644
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -189,7 +189,7 @@ extern void sync_global_pgds(unsigned long start, unsigned long end);
  *
  * |     ...            | 11| 10|  9|8|7|6|5| 4| 3|2| 1|0| <- bit number
  * |     ...            |SW3|SW2|SW1|G|L|D|A|CD|WT|U| W|P| <- bit names
- * | TYPE (59-63) | ~OFFSET (9-58)  |0|0|X|X| X| X|X|SD|0| <- swp entry
+ * | TYPE (59-63) | ~OFFSET (9-58)  |0|0|X|X| X| X|F|SD|0| <- swp entry
  *
  * G (8) is aliased and used as a PROT_NONE indicator for
  * !present ptes.  We need to start storing swap entries above
@@ -197,9 +197,15 @@ extern void sync_global_pgds(unsigned long start, unsigned long end);
  * erratum where they can be incorrectly set by hardware on
  * non-present PTEs.
  *
+ * SD Bits 1-4 are not used in non-present format and available for
+ * special use described below:
+ *
  * SD (1) in swp entry is used to store soft dirty bit, which helps us
  * remember soft dirty over page migration
  *
+ * F (2) in swp entry is used to record when a pagetable is
+ * writeprotected by userfaultfd WP support.
+ *
  * Bit 7 in swp entry should be 0 because pmd_present checks not only P,
  * but also L and G.
  *
diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index d6ff0bbdb394..8cebcff91e57 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -32,6 +32,7 @@
 
 #define _PAGE_BIT_SPECIAL	_PAGE_BIT_SOFTW1
 #define _PAGE_BIT_CPA_TEST	_PAGE_BIT_SOFTW1
+#define _PAGE_BIT_UFFD_WP	_PAGE_BIT_SOFTW2 /* userfaultfd wrprotected */
 #define _PAGE_BIT_SOFT_DIRTY	_PAGE_BIT_SOFTW3 /* software dirty tracking */
 #define _PAGE_BIT_DEVMAP	_PAGE_BIT_SOFTW4
 
@@ -100,6 +101,14 @@
 #define _PAGE_SWP_SOFT_DIRTY	(_AT(pteval_t, 0))
 #endif
 
+#ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
+#define _PAGE_UFFD_WP		(_AT(pteval_t, 1) << _PAGE_BIT_UFFD_WP)
+#define _PAGE_SWP_UFFD_WP	_PAGE_USER
+#else
+#define _PAGE_UFFD_WP		(_AT(pteval_t, 0))
+#define _PAGE_SWP_UFFD_WP	(_AT(pteval_t, 0))
+#endif
+
 #if defined(CONFIG_X86_64) || defined(CONFIG_X86_PAE)
 #define _PAGE_NX	(_AT(pteval_t, 1) << _PAGE_BIT_NX)
 #define _PAGE_DEVMAP	(_AT(u64, 1) << _PAGE_BIT_DEVMAP)
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index 05e61e6c843f..f49afe951711 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -10,6 +10,7 @@
 #include <linux/mm_types.h>
 #include <linux/bug.h>
 #include <linux/errno.h>
+#include <asm-generic/pgtable_uffd.h>
 
 #if 5 - defined(__PAGETABLE_P4D_FOLDED) - defined(__PAGETABLE_PUD_FOLDED) - \
 	defined(__PAGETABLE_PMD_FOLDED) != CONFIG_PGTABLE_LEVELS
diff --git a/include/asm-generic/pgtable_uffd.h b/include/asm-generic/pgtable_uffd.h
new file mode 100644
index 000000000000..643d1bf559c2
--- /dev/null
+++ b/include/asm-generic/pgtable_uffd.h
@@ -0,0 +1,51 @@
+#ifndef _ASM_GENERIC_PGTABLE_UFFD_H
+#define _ASM_GENERIC_PGTABLE_UFFD_H
+
+#ifndef CONFIG_HAVE_ARCH_USERFAULTFD_WP
+static __always_inline int pte_uffd_wp(pte_t pte)
+{
+	return 0;
+}
+
+static __always_inline int pmd_uffd_wp(pmd_t pmd)
+{
+	return 0;
+}
+
+static __always_inline pte_t pte_mkuffd_wp(pte_t pte)
+{
+	return pte;
+}
+
+static __always_inline pmd_t pmd_mkuffd_wp(pmd_t pmd)
+{
+	return pmd;
+}
+
+static __always_inline pte_t pte_clear_uffd_wp(pte_t pte)
+{
+	return pte;
+}
+
+static __always_inline pmd_t pmd_clear_uffd_wp(pmd_t pmd)
+{
+	return pmd;
+}
+
+static __always_inline pte_t pte_swp_mkuffd_wp(pte_t pte)
+{
+	return pte;
+}
+
+static __always_inline int pte_swp_uffd_wp(pte_t pte)
+{
+	return 0;
+}
+
+static __always_inline pte_t pte_swp_clear_uffd_wp(pte_t pte)
+{
+	return pte;
+}
+#endif /* CONFIG_HAVE_ARCH_USERFAULTFD_WP */
+
+#endif /* _ASM_GENERIC_PGTABLE_UFFD_H */
diff --git a/init/Kconfig b/init/Kconfig
index c9386a365eea..892d61ddf2eb 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1424,6 +1424,11 @@ config ADVISE_SYSCALLS
 	  applications use these syscalls, you can disable this option to save
 	  space.
 
+config HAVE_ARCH_USERFAULTFD_WP
+	bool
+	help
+	  Arch has userfaultfd write protection support
+
 config MEMBARRIER
 	bool "Enable membarrier() system call" if EXPERT
 	default y
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH v2 09/26] userfaultfd: wp: userfaultfd_pte/huge_pmd_wp() helpers
  2019-02-12  2:56 [PATCH v2 00/26] userfaultfd: write protection support Peter Xu
                   ` (7 preceding siblings ...)
  2019-02-12  2:56 ` [PATCH v2 08/26] userfaultfd: wp: add WP pagetable tracking to x86 Peter Xu
@ 2019-02-12  2:56 ` Peter Xu
  2019-02-21 17:21   ` Jerome Glisse
  2019-02-25 17:12   ` Mike Rapoport
  2019-02-12  2:56 ` [PATCH v2 10/26] userfaultfd: wp: add UFFDIO_COPY_MODE_WP Peter Xu
                   ` (16 subsequent siblings)
  25 siblings, 2 replies; 113+ messages in thread
From: Peter Xu @ 2019-02-12  2:56 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: David Hildenbrand, Hugh Dickins, Maya Gokhale, Jerome Glisse,
	Pavel Emelyanov, Johannes Weiner, peterx, Martin Cracauer,
	Shaohua Li, Marty McFadden, Andrea Arcangeli, Mike Kravetz,
	Denis Plotnikov, Mike Rapoport, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

From: Andrea Arcangeli <aarcange@redhat.com>

Implement helpers methods to invoke userfaultfd wp faults more
selectively: not only when a wp fault triggers on a vma with
vma->vm_flags VM_UFFD_WP set, but only if the _PAGE_UFFD_WP bit is set
in the pagetable too.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/userfaultfd_k.h | 27 +++++++++++++++++++++++++++
 1 file changed, 27 insertions(+)

diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index 38f748e7186e..c6590c58ce28 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -14,6 +14,8 @@
 #include <linux/userfaultfd.h> /* linux/include/uapi/linux/userfaultfd.h */
 
 #include <linux/fcntl.h>
+#include <linux/mm.h>
+#include <asm-generic/pgtable_uffd.h>
 
 /*
  * CAREFUL: Check include/uapi/asm-generic/fcntl.h when defining
@@ -55,6 +57,18 @@ static inline bool userfaultfd_wp(struct vm_area_struct *vma)
 	return vma->vm_flags & VM_UFFD_WP;
 }
 
+static inline bool userfaultfd_pte_wp(struct vm_area_struct *vma,
+				      pte_t pte)
+{
+	return userfaultfd_wp(vma) && pte_uffd_wp(pte);
+}
+
+static inline bool userfaultfd_huge_pmd_wp(struct vm_area_struct *vma,
+					   pmd_t pmd)
+{
+	return userfaultfd_wp(vma) && pmd_uffd_wp(pmd);
+}
+
 static inline bool userfaultfd_armed(struct vm_area_struct *vma)
 {
 	return vma->vm_flags & (VM_UFFD_MISSING | VM_UFFD_WP);
@@ -104,6 +118,19 @@ static inline bool userfaultfd_wp(struct vm_area_struct *vma)
 	return false;
 }
 
+static inline bool userfaultfd_pte_wp(struct vm_area_struct *vma,
+				      pte_t pte)
+{
+	return false;
+}
+
+static inline bool userfaultfd_huge_pmd_wp(struct vm_area_struct *vma,
+					   pmd_t pmd)
+{
+	return false;
+}
+
+
 static inline bool userfaultfd_armed(struct vm_area_struct *vma)
 {
 	return false;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH v2 10/26] userfaultfd: wp: add UFFDIO_COPY_MODE_WP
  2019-02-12  2:56 [PATCH v2 00/26] userfaultfd: write protection support Peter Xu
                   ` (8 preceding siblings ...)
  2019-02-12  2:56 ` [PATCH v2 09/26] userfaultfd: wp: userfaultfd_pte/huge_pmd_wp() helpers Peter Xu
@ 2019-02-12  2:56 ` Peter Xu
  2019-02-21 17:29   ` Jerome Glisse
  2019-02-25 15:58   ` Mike Rapoport
  2019-02-12  2:56 ` [PATCH v2 11/26] mm: merge parameters for change_protection() Peter Xu
                   ` (15 subsequent siblings)
  25 siblings, 2 replies; 113+ messages in thread
From: Peter Xu @ 2019-02-12  2:56 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: David Hildenbrand, Hugh Dickins, Maya Gokhale, Jerome Glisse,
	Pavel Emelyanov, Johannes Weiner, peterx, Martin Cracauer,
	Shaohua Li, Marty McFadden, Andrea Arcangeli, Mike Kravetz,
	Denis Plotnikov, Mike Rapoport, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

From: Andrea Arcangeli <aarcange@redhat.com>

This allows UFFDIO_COPY to map pages wrprotected.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 fs/userfaultfd.c                 |  5 +++--
 include/linux/userfaultfd_k.h    |  2 +-
 include/uapi/linux/userfaultfd.h | 11 +++++-----
 mm/userfaultfd.c                 | 36 ++++++++++++++++++++++----------
 4 files changed, 35 insertions(+), 19 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index b397bc3b954d..3092885c9d2c 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -1683,11 +1683,12 @@ static int userfaultfd_copy(struct userfaultfd_ctx *ctx,
 	ret = -EINVAL;
 	if (uffdio_copy.src + uffdio_copy.len <= uffdio_copy.src)
 		goto out;
-	if (uffdio_copy.mode & ~UFFDIO_COPY_MODE_DONTWAKE)
+	if (uffdio_copy.mode & ~(UFFDIO_COPY_MODE_DONTWAKE|UFFDIO_COPY_MODE_WP))
 		goto out;
 	if (mmget_not_zero(ctx->mm)) {
 		ret = mcopy_atomic(ctx->mm, uffdio_copy.dst, uffdio_copy.src,
-				   uffdio_copy.len, &ctx->mmap_changing);
+				   uffdio_copy.len, &ctx->mmap_changing,
+				   uffdio_copy.mode);
 		mmput(ctx->mm);
 	} else {
 		return -ESRCH;
diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index c6590c58ce28..765ce884cec0 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -34,7 +34,7 @@ extern vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason);
 
 extern ssize_t mcopy_atomic(struct mm_struct *dst_mm, unsigned long dst_start,
 			    unsigned long src_start, unsigned long len,
-			    bool *mmap_changing);
+			    bool *mmap_changing, __u64 mode);
 extern ssize_t mfill_zeropage(struct mm_struct *dst_mm,
 			      unsigned long dst_start,
 			      unsigned long len,
diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
index 48f1a7c2f1f0..297cb044c03f 100644
--- a/include/uapi/linux/userfaultfd.h
+++ b/include/uapi/linux/userfaultfd.h
@@ -203,13 +203,14 @@ struct uffdio_copy {
 	__u64 dst;
 	__u64 src;
 	__u64 len;
+#define UFFDIO_COPY_MODE_DONTWAKE		((__u64)1<<0)
 	/*
-	 * There will be a wrprotection flag later that allows to map
-	 * pages wrprotected on the fly. And such a flag will be
-	 * available if the wrprotection ioctl are implemented for the
-	 * range according to the uffdio_register.ioctls.
+	 * UFFDIO_COPY_MODE_WP will map the page wrprotected on the
+	 * fly. UFFDIO_COPY_MODE_WP is available only if the
+	 * wrprotection ioctl are implemented for the range according
+	 * to the uffdio_register.ioctls.
 	 */
-#define UFFDIO_COPY_MODE_DONTWAKE		((__u64)1<<0)
+#define UFFDIO_COPY_MODE_WP			((__u64)1<<1)
 	__u64 mode;
 
 	/*
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index d59b5a73dfb3..73a208c5c1e7 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -25,7 +25,8 @@ static int mcopy_atomic_pte(struct mm_struct *dst_mm,
 			    struct vm_area_struct *dst_vma,
 			    unsigned long dst_addr,
 			    unsigned long src_addr,
-			    struct page **pagep)
+			    struct page **pagep,
+			    bool wp_copy)
 {
 	struct mem_cgroup *memcg;
 	pte_t _dst_pte, *dst_pte;
@@ -71,9 +72,9 @@ static int mcopy_atomic_pte(struct mm_struct *dst_mm,
 	if (mem_cgroup_try_charge(page, dst_mm, GFP_KERNEL, &memcg, false))
 		goto out_release;
 
-	_dst_pte = mk_pte(page, dst_vma->vm_page_prot);
-	if (dst_vma->vm_flags & VM_WRITE)
-		_dst_pte = pte_mkwrite(pte_mkdirty(_dst_pte));
+	_dst_pte = pte_mkdirty(mk_pte(page, dst_vma->vm_page_prot));
+	if (dst_vma->vm_flags & VM_WRITE && !wp_copy)
+		_dst_pte = pte_mkwrite(_dst_pte);
 
 	dst_pte = pte_offset_map_lock(dst_mm, dst_pmd, dst_addr, &ptl);
 	if (dst_vma->vm_file) {
@@ -399,7 +400,8 @@ static __always_inline ssize_t mfill_atomic_pte(struct mm_struct *dst_mm,
 						unsigned long dst_addr,
 						unsigned long src_addr,
 						struct page **page,
-						bool zeropage)
+						bool zeropage,
+						bool wp_copy)
 {
 	ssize_t err;
 
@@ -416,11 +418,13 @@ static __always_inline ssize_t mfill_atomic_pte(struct mm_struct *dst_mm,
 	if (!(dst_vma->vm_flags & VM_SHARED)) {
 		if (!zeropage)
 			err = mcopy_atomic_pte(dst_mm, dst_pmd, dst_vma,
-					       dst_addr, src_addr, page);
+					       dst_addr, src_addr, page,
+					       wp_copy);
 		else
 			err = mfill_zeropage_pte(dst_mm, dst_pmd,
 						 dst_vma, dst_addr);
 	} else {
+		VM_WARN_ON(wp_copy); /* WP only available for anon */
 		if (!zeropage)
 			err = shmem_mcopy_atomic_pte(dst_mm, dst_pmd,
 						     dst_vma, dst_addr,
@@ -438,7 +442,8 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
 					      unsigned long src_start,
 					      unsigned long len,
 					      bool zeropage,
-					      bool *mmap_changing)
+					      bool *mmap_changing,
+					      __u64 mode)
 {
 	struct vm_area_struct *dst_vma;
 	ssize_t err;
@@ -446,6 +451,7 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
 	unsigned long src_addr, dst_addr;
 	long copied;
 	struct page *page;
+	bool wp_copy;
 
 	/*
 	 * Sanitize the command parameters:
@@ -502,6 +508,14 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
 	    dst_vma->vm_flags & VM_SHARED))
 		goto out_unlock;
 
+	/*
+	 * validate 'mode' now that we know the dst_vma: don't allow
+	 * a wrprotect copy if the userfaultfd didn't register as WP.
+	 */
+	wp_copy = mode & UFFDIO_COPY_MODE_WP;
+	if (wp_copy && !(dst_vma->vm_flags & VM_UFFD_WP))
+		goto out_unlock;
+
 	/*
 	 * If this is a HUGETLB vma, pass off to appropriate routine
 	 */
@@ -557,7 +571,7 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
 		BUG_ON(pmd_trans_huge(*dst_pmd));
 
 		err = mfill_atomic_pte(dst_mm, dst_pmd, dst_vma, dst_addr,
-				       src_addr, &page, zeropage);
+				       src_addr, &page, zeropage, wp_copy);
 		cond_resched();
 
 		if (unlikely(err == -ENOENT)) {
@@ -604,14 +618,14 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
 
 ssize_t mcopy_atomic(struct mm_struct *dst_mm, unsigned long dst_start,
 		     unsigned long src_start, unsigned long len,
-		     bool *mmap_changing)
+		     bool *mmap_changing, __u64 mode)
 {
 	return __mcopy_atomic(dst_mm, dst_start, src_start, len, false,
-			      mmap_changing);
+			      mmap_changing, mode);
 }
 
 ssize_t mfill_zeropage(struct mm_struct *dst_mm, unsigned long start,
 		       unsigned long len, bool *mmap_changing)
 {
-	return __mcopy_atomic(dst_mm, start, 0, len, true, mmap_changing);
+	return __mcopy_atomic(dst_mm, start, 0, len, true, mmap_changing, 0);
 }
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH v2 11/26] mm: merge parameters for change_protection()
  2019-02-12  2:56 [PATCH v2 00/26] userfaultfd: write protection support Peter Xu
                   ` (9 preceding siblings ...)
  2019-02-12  2:56 ` [PATCH v2 10/26] userfaultfd: wp: add UFFDIO_COPY_MODE_WP Peter Xu
@ 2019-02-12  2:56 ` Peter Xu
  2019-02-21 17:32   ` Jerome Glisse
  2019-02-12  2:56 ` [PATCH v2 12/26] userfaultfd: wp: apply _PAGE_UFFD_WP bit Peter Xu
                   ` (14 subsequent siblings)
  25 siblings, 1 reply; 113+ messages in thread
From: Peter Xu @ 2019-02-12  2:56 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: David Hildenbrand, Hugh Dickins, Maya Gokhale, Jerome Glisse,
	Pavel Emelyanov, Johannes Weiner, peterx, Martin Cracauer,
	Shaohua Li, Marty McFadden, Andrea Arcangeli, Mike Kravetz,
	Denis Plotnikov, Mike Rapoport, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

change_protection() was used by either the NUMA or mprotect() code,
there's one parameter for each of the callers (dirty_accountable and
prot_numa).  Further, these parameters are passed along the calls:

  - change_protection_range()
  - change_p4d_range()
  - change_pud_range()
  - change_pmd_range()
  - ...

Now we introduce a flag for change_protect() and all these helpers to
replace these parameters.  Then we can avoid passing multiple parameters
multiple times along the way.

More importantly, it'll greatly simplify the work if we want to
introduce any new parameters to change_protection().  In the follow up
patches, a new parameter for userfaultfd write protection will be
introduced.

No functional change at all.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/huge_mm.h |  2 +-
 include/linux/mm.h      | 14 +++++++++++++-
 mm/huge_memory.c        |  3 ++-
 mm/mempolicy.c          |  2 +-
 mm/mprotect.c           | 29 ++++++++++++++++-------------
 5 files changed, 33 insertions(+), 17 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 381e872bfde0..1550fb12dbd4 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -46,7 +46,7 @@ extern bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,
 			 pmd_t *old_pmd, pmd_t *new_pmd);
 extern int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 			unsigned long addr, pgprot_t newprot,
-			int prot_numa);
+			unsigned long cp_flags);
 vm_fault_t vmf_insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
 			pmd_t *pmd, pfn_t pfn, bool write);
 vm_fault_t vmf_insert_pfn_pud(struct vm_area_struct *vma, unsigned long addr,
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 80bb6408fe73..9fe3b0066324 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1646,9 +1646,21 @@ extern unsigned long move_page_tables(struct vm_area_struct *vma,
 		unsigned long old_addr, struct vm_area_struct *new_vma,
 		unsigned long new_addr, unsigned long len,
 		bool need_rmap_locks);
+
+/*
+ * Flags used by change_protection().  For now we make it a bitmap so
+ * that we can pass in multiple flags just like parameters.  However
+ * for now all the callers are only use one of the flags at the same
+ * time.
+ */
+/* Whether we should allow dirty bit accounting */
+#define  MM_CP_DIRTY_ACCT                  (1UL << 0)
+/* Whether this protection change is for NUMA hints */
+#define  MM_CP_PROT_NUMA                   (1UL << 1)
+
 extern unsigned long change_protection(struct vm_area_struct *vma, unsigned long start,
 			      unsigned long end, pgprot_t newprot,
-			      int dirty_accountable, int prot_numa);
+			      unsigned long cp_flags);
 extern int mprotect_fixup(struct vm_area_struct *vma,
 			  struct vm_area_struct **pprev, unsigned long start,
 			  unsigned long end, unsigned long newflags);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index faf357eaf0ce..8d65b0f041f9 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1860,13 +1860,14 @@ bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,
  *  - HPAGE_PMD_NR is protections changed and TLB flush necessary
  */
 int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
-		unsigned long addr, pgprot_t newprot, int prot_numa)
+		unsigned long addr, pgprot_t newprot, unsigned long cp_flags)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	spinlock_t *ptl;
 	pmd_t entry;
 	bool preserve_write;
 	int ret;
+	bool prot_numa = cp_flags & MM_CP_PROT_NUMA;
 
 	ptl = __pmd_trans_huge_lock(pmd, vma);
 	if (!ptl)
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index d4496d9d34f5..233194f3d69a 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -554,7 +554,7 @@ unsigned long change_prot_numa(struct vm_area_struct *vma,
 {
 	int nr_updated;
 
-	nr_updated = change_protection(vma, addr, end, PAGE_NONE, 0, 1);
+	nr_updated = change_protection(vma, addr, end, PAGE_NONE, MM_CP_PROT_NUMA);
 	if (nr_updated)
 		count_vm_numa_events(NUMA_PTE_UPDATES, nr_updated);
 
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 36cb358db170..a6ba448c8565 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -37,13 +37,15 @@
 
 static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 		unsigned long addr, unsigned long end, pgprot_t newprot,
-		int dirty_accountable, int prot_numa)
+		unsigned long cp_flags)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	pte_t *pte, oldpte;
 	spinlock_t *ptl;
 	unsigned long pages = 0;
 	int target_node = NUMA_NO_NODE;
+	bool dirty_accountable = cp_flags & MM_CP_DIRTY_ACCT;
+	bool prot_numa = cp_flags & MM_CP_PROT_NUMA;
 
 	/*
 	 * Can be called with only the mmap_sem for reading by
@@ -164,7 +166,7 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 
 static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 		pud_t *pud, unsigned long addr, unsigned long end,
-		pgprot_t newprot, int dirty_accountable, int prot_numa)
+		pgprot_t newprot, unsigned long cp_flags)
 {
 	pmd_t *pmd;
 	unsigned long next;
@@ -194,7 +196,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 				__split_huge_pmd(vma, pmd, addr, false, NULL);
 			} else {
 				int nr_ptes = change_huge_pmd(vma, pmd, addr,
-						newprot, prot_numa);
+							      newprot, cp_flags);
 
 				if (nr_ptes) {
 					if (nr_ptes == HPAGE_PMD_NR) {
@@ -209,7 +211,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 			/* fall through, the trans huge pmd just split */
 		}
 		this_pages = change_pte_range(vma, pmd, addr, next, newprot,
-				 dirty_accountable, prot_numa);
+					      cp_flags);
 		pages += this_pages;
 next:
 		cond_resched();
@@ -225,7 +227,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 
 static inline unsigned long change_pud_range(struct vm_area_struct *vma,
 		p4d_t *p4d, unsigned long addr, unsigned long end,
-		pgprot_t newprot, int dirty_accountable, int prot_numa)
+		pgprot_t newprot, unsigned long cp_flags)
 {
 	pud_t *pud;
 	unsigned long next;
@@ -237,7 +239,7 @@ static inline unsigned long change_pud_range(struct vm_area_struct *vma,
 		if (pud_none_or_clear_bad(pud))
 			continue;
 		pages += change_pmd_range(vma, pud, addr, next, newprot,
-				 dirty_accountable, prot_numa);
+					  cp_flags);
 	} while (pud++, addr = next, addr != end);
 
 	return pages;
@@ -245,7 +247,7 @@ static inline unsigned long change_pud_range(struct vm_area_struct *vma,
 
 static inline unsigned long change_p4d_range(struct vm_area_struct *vma,
 		pgd_t *pgd, unsigned long addr, unsigned long end,
-		pgprot_t newprot, int dirty_accountable, int prot_numa)
+		pgprot_t newprot, unsigned long cp_flags)
 {
 	p4d_t *p4d;
 	unsigned long next;
@@ -257,7 +259,7 @@ static inline unsigned long change_p4d_range(struct vm_area_struct *vma,
 		if (p4d_none_or_clear_bad(p4d))
 			continue;
 		pages += change_pud_range(vma, p4d, addr, next, newprot,
-				 dirty_accountable, prot_numa);
+					  cp_flags);
 	} while (p4d++, addr = next, addr != end);
 
 	return pages;
@@ -265,7 +267,7 @@ static inline unsigned long change_p4d_range(struct vm_area_struct *vma,
 
 static unsigned long change_protection_range(struct vm_area_struct *vma,
 		unsigned long addr, unsigned long end, pgprot_t newprot,
-		int dirty_accountable, int prot_numa)
+		unsigned long cp_flags)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	pgd_t *pgd;
@@ -282,7 +284,7 @@ static unsigned long change_protection_range(struct vm_area_struct *vma,
 		if (pgd_none_or_clear_bad(pgd))
 			continue;
 		pages += change_p4d_range(vma, pgd, addr, next, newprot,
-				 dirty_accountable, prot_numa);
+					  cp_flags);
 	} while (pgd++, addr = next, addr != end);
 
 	/* Only flush the TLB if we actually modified any entries: */
@@ -295,14 +297,15 @@ static unsigned long change_protection_range(struct vm_area_struct *vma,
 
 unsigned long change_protection(struct vm_area_struct *vma, unsigned long start,
 		       unsigned long end, pgprot_t newprot,
-		       int dirty_accountable, int prot_numa)
+		       unsigned long cp_flags)
 {
 	unsigned long pages;
 
 	if (is_vm_hugetlb_page(vma))
 		pages = hugetlb_change_protection(vma, start, end, newprot);
 	else
-		pages = change_protection_range(vma, start, end, newprot, dirty_accountable, prot_numa);
+		pages = change_protection_range(vma, start, end, newprot,
+						cp_flags);
 
 	return pages;
 }
@@ -430,7 +433,7 @@ mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev,
 	vma_set_page_prot(vma);
 
 	change_protection(vma, start, end, vma->vm_page_prot,
-			  dirty_accountable, 0);
+			  dirty_accountable ? MM_CP_DIRTY_ACCT : 0);
 
 	/*
 	 * Private VM_LOCKED VMA becoming writable: trigger COW to avoid major
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH v2 12/26] userfaultfd: wp: apply _PAGE_UFFD_WP bit
  2019-02-12  2:56 [PATCH v2 00/26] userfaultfd: write protection support Peter Xu
                   ` (10 preceding siblings ...)
  2019-02-12  2:56 ` [PATCH v2 11/26] mm: merge parameters for change_protection() Peter Xu
@ 2019-02-12  2:56 ` Peter Xu
  2019-02-21 17:44   ` Jerome Glisse
  2019-02-25 18:00   ` Mike Rapoport
  2019-02-12  2:56 ` [PATCH v2 13/26] mm: export wp_page_copy() Peter Xu
                   ` (13 subsequent siblings)
  25 siblings, 2 replies; 113+ messages in thread
From: Peter Xu @ 2019-02-12  2:56 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: David Hildenbrand, Hugh Dickins, Maya Gokhale, Jerome Glisse,
	Pavel Emelyanov, Johannes Weiner, peterx, Martin Cracauer,
	Shaohua Li, Marty McFadden, Andrea Arcangeli, Mike Kravetz,
	Denis Plotnikov, Mike Rapoport, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

Firstly, introduce two new flags MM_CP_UFFD_WP[_RESOLVE] for
change_protection() when used with uffd-wp and make sure the two new
flags are exclusively used.  Then,

  - For MM_CP_UFFD_WP: apply the _PAGE_UFFD_WP bit and remove _PAGE_RW
    when a range of memory is write protected by uffd

  - For MM_CP_UFFD_WP_RESOLVE: remove the _PAGE_UFFD_WP bit and recover
    _PAGE_RW when write protection is resolved from userspace

And use this new interface in mwriteprotect_range() to replace the old
MM_CP_DIRTY_ACCT.

Do this change for both PTEs and huge PMDs.  Then we can start to
identify which PTE/PMD is write protected by general (e.g., COW or soft
dirty tracking), and which is for userfaultfd-wp.

Since we should keep the _PAGE_UFFD_WP when doing pte_modify(), add it
into _PAGE_CHG_MASK as well.  Meanwhile, since we have this new bit, we
can be even more strict when detecting uffd-wp page faults in either
do_wp_page() or wp_huge_pmd().

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/x86/include/asm/pgtable_types.h |  2 +-
 include/linux/mm.h                   |  5 +++++
 mm/huge_memory.c                     | 14 +++++++++++++-
 mm/memory.c                          |  4 ++--
 mm/mprotect.c                        | 12 ++++++++++++
 mm/userfaultfd.c                     |  8 ++++++--
 6 files changed, 39 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index 8cebcff91e57..dd9c6295d610 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -133,7 +133,7 @@
  */
 #define _PAGE_CHG_MASK	(PTE_PFN_MASK | _PAGE_PCD | _PAGE_PWT |		\
 			 _PAGE_SPECIAL | _PAGE_ACCESSED | _PAGE_DIRTY |	\
-			 _PAGE_SOFT_DIRTY | _PAGE_DEVMAP)
+			 _PAGE_SOFT_DIRTY | _PAGE_DEVMAP | _PAGE_UFFD_WP)
 #define _HPAGE_CHG_MASK (_PAGE_CHG_MASK | _PAGE_PSE)
 
 /*
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 9fe3b0066324..f38fbe9c8bc9 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1657,6 +1657,11 @@ extern unsigned long move_page_tables(struct vm_area_struct *vma,
 #define  MM_CP_DIRTY_ACCT                  (1UL << 0)
 /* Whether this protection change is for NUMA hints */
 #define  MM_CP_PROT_NUMA                   (1UL << 1)
+/* Whether this change is for write protecting */
+#define  MM_CP_UFFD_WP                     (1UL << 2) /* do wp */
+#define  MM_CP_UFFD_WP_RESOLVE             (1UL << 3) /* Resolve wp */
+#define  MM_CP_UFFD_WP_ALL                 (MM_CP_UFFD_WP | \
+					    MM_CP_UFFD_WP_RESOLVE)
 
 extern unsigned long change_protection(struct vm_area_struct *vma, unsigned long start,
 			      unsigned long end, pgprot_t newprot,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 8d65b0f041f9..817335b443c2 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1868,6 +1868,8 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 	bool preserve_write;
 	int ret;
 	bool prot_numa = cp_flags & MM_CP_PROT_NUMA;
+	bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
+	bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;
 
 	ptl = __pmd_trans_huge_lock(pmd, vma);
 	if (!ptl)
@@ -1934,6 +1936,13 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 	entry = pmd_modify(entry, newprot);
 	if (preserve_write)
 		entry = pmd_mk_savedwrite(entry);
+	if (uffd_wp) {
+		entry = pmd_wrprotect(entry);
+		entry = pmd_mkuffd_wp(entry);
+	} else if (uffd_wp_resolve) {
+		entry = pmd_mkwrite(entry);
+		entry = pmd_clear_uffd_wp(entry);
+	}
 	ret = HPAGE_PMD_NR;
 	set_pmd_at(mm, addr, pmd, entry);
 	BUG_ON(vma_is_anonymous(vma) && !preserve_write && pmd_write(entry));
@@ -2083,7 +2092,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 	struct page *page;
 	pgtable_t pgtable;
 	pmd_t old_pmd, _pmd;
-	bool young, write, soft_dirty, pmd_migration = false;
+	bool young, write, soft_dirty, pmd_migration = false, uffd_wp = false;
 	unsigned long addr;
 	int i;
 
@@ -2165,6 +2174,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 		write = pmd_write(old_pmd);
 		young = pmd_young(old_pmd);
 		soft_dirty = pmd_soft_dirty(old_pmd);
+		uffd_wp = pmd_uffd_wp(old_pmd);
 	}
 	VM_BUG_ON_PAGE(!page_count(page), page);
 	page_ref_add(page, HPAGE_PMD_NR - 1);
@@ -2198,6 +2208,8 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 				entry = pte_mkold(entry);
 			if (soft_dirty)
 				entry = pte_mksoft_dirty(entry);
+			if (uffd_wp)
+				entry = pte_mkuffd_wp(entry);
 		}
 		pte = pte_offset_map(&_pmd, addr);
 		BUG_ON(!pte_none(*pte));
diff --git a/mm/memory.c b/mm/memory.c
index 00781c43407b..f8d83ae16eff 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2483,7 +2483,7 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
 {
 	struct vm_area_struct *vma = vmf->vma;
 
-	if (userfaultfd_wp(vma)) {
+	if (userfaultfd_pte_wp(vma, *vmf->pte)) {
 		pte_unmap_unlock(vmf->pte, vmf->ptl);
 		return handle_userfault(vmf, VM_UFFD_WP);
 	}
@@ -3692,7 +3692,7 @@ static inline vm_fault_t create_huge_pmd(struct vm_fault *vmf)
 static inline vm_fault_t wp_huge_pmd(struct vm_fault *vmf, pmd_t orig_pmd)
 {
 	if (vma_is_anonymous(vmf->vma)) {
-		if (userfaultfd_wp(vmf->vma))
+		if (userfaultfd_huge_pmd_wp(vmf->vma, orig_pmd))
 			return handle_userfault(vmf, VM_UFFD_WP);
 		return do_huge_pmd_wp_page(vmf, orig_pmd);
 	}
diff --git a/mm/mprotect.c b/mm/mprotect.c
index a6ba448c8565..9d4433044c21 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -46,6 +46,8 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 	int target_node = NUMA_NO_NODE;
 	bool dirty_accountable = cp_flags & MM_CP_DIRTY_ACCT;
 	bool prot_numa = cp_flags & MM_CP_PROT_NUMA;
+	bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
+	bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;
 
 	/*
 	 * Can be called with only the mmap_sem for reading by
@@ -117,6 +119,14 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 			if (preserve_write)
 				ptent = pte_mk_savedwrite(ptent);
 
+			if (uffd_wp) {
+				ptent = pte_wrprotect(ptent);
+				ptent = pte_mkuffd_wp(ptent);
+			} else if (uffd_wp_resolve) {
+				ptent = pte_mkwrite(ptent);
+				ptent = pte_clear_uffd_wp(ptent);
+			}
+
 			/* Avoid taking write faults for known dirty pages */
 			if (dirty_accountable && pte_dirty(ptent) &&
 					(pte_soft_dirty(ptent) ||
@@ -301,6 +311,8 @@ unsigned long change_protection(struct vm_area_struct *vma, unsigned long start,
 {
 	unsigned long pages;
 
+	BUG_ON((cp_flags & MM_CP_UFFD_WP_ALL) == MM_CP_UFFD_WP_ALL);
+
 	if (is_vm_hugetlb_page(vma))
 		pages = hugetlb_change_protection(vma, start, end, newprot);
 	else
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 73a208c5c1e7..80bcd642911d 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -73,8 +73,12 @@ static int mcopy_atomic_pte(struct mm_struct *dst_mm,
 		goto out_release;
 
 	_dst_pte = pte_mkdirty(mk_pte(page, dst_vma->vm_page_prot));
-	if (dst_vma->vm_flags & VM_WRITE && !wp_copy)
-		_dst_pte = pte_mkwrite(_dst_pte);
+	if (dst_vma->vm_flags & VM_WRITE) {
+		if (wp_copy)
+			_dst_pte = pte_mkuffd_wp(_dst_pte);
+		else
+			_dst_pte = pte_mkwrite(_dst_pte);
+	}
 
 	dst_pte = pte_offset_map_lock(dst_mm, dst_pmd, dst_addr, &ptl);
 	if (dst_vma->vm_file) {
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH v2 13/26] mm: export wp_page_copy()
  2019-02-12  2:56 [PATCH v2 00/26] userfaultfd: write protection support Peter Xu
                   ` (11 preceding siblings ...)
  2019-02-12  2:56 ` [PATCH v2 12/26] userfaultfd: wp: apply _PAGE_UFFD_WP bit Peter Xu
@ 2019-02-12  2:56 ` Peter Xu
  2019-02-21 17:44   ` Jerome Glisse
  2019-02-12  2:56 ` [PATCH v2 14/26] userfaultfd: wp: handle COW properly for uffd-wp Peter Xu
                   ` (12 subsequent siblings)
  25 siblings, 1 reply; 113+ messages in thread
From: Peter Xu @ 2019-02-12  2:56 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: David Hildenbrand, Hugh Dickins, Maya Gokhale, Jerome Glisse,
	Pavel Emelyanov, Johannes Weiner, peterx, Martin Cracauer,
	Shaohua Li, Marty McFadden, Andrea Arcangeli, Mike Kravetz,
	Denis Plotnikov, Mike Rapoport, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

Export this function for usages outside page fault handlers.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/mm.h | 2 ++
 mm/memory.c        | 2 +-
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index f38fbe9c8bc9..2fd14a62324b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -405,6 +405,8 @@ struct vm_fault {
 					 */
 };
 
+vm_fault_t wp_page_copy(struct vm_fault *vmf);
+
 /* page entry size for vm->huge_fault() */
 enum page_entry_size {
 	PE_SIZE_PTE = 0,
diff --git a/mm/memory.c b/mm/memory.c
index f8d83ae16eff..32d32b6e6339 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2239,7 +2239,7 @@ static inline void wp_page_reuse(struct vm_fault *vmf)
  *   held to the old page, as well as updating the rmap.
  * - In any case, unlock the PTL and drop the reference we took to the old page.
  */
-static vm_fault_t wp_page_copy(struct vm_fault *vmf)
+vm_fault_t wp_page_copy(struct vm_fault *vmf)
 {
 	struct vm_area_struct *vma = vmf->vma;
 	struct mm_struct *mm = vma->vm_mm;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH v2 14/26] userfaultfd: wp: handle COW properly for uffd-wp
  2019-02-12  2:56 [PATCH v2 00/26] userfaultfd: write protection support Peter Xu
                   ` (12 preceding siblings ...)
  2019-02-12  2:56 ` [PATCH v2 13/26] mm: export wp_page_copy() Peter Xu
@ 2019-02-12  2:56 ` Peter Xu
  2019-02-21 18:04   ` Jerome Glisse
  2019-02-12  2:56 ` [PATCH v2 15/26] userfaultfd: wp: drop _PAGE_UFFD_WP properly when fork Peter Xu
                   ` (11 subsequent siblings)
  25 siblings, 1 reply; 113+ messages in thread
From: Peter Xu @ 2019-02-12  2:56 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: David Hildenbrand, Hugh Dickins, Maya Gokhale, Jerome Glisse,
	Pavel Emelyanov, Johannes Weiner, peterx, Martin Cracauer,
	Shaohua Li, Marty McFadden, Andrea Arcangeli, Mike Kravetz,
	Denis Plotnikov, Mike Rapoport, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

This allows uffd-wp to support write-protected pages for COW.

For example, the uffd write-protected PTE could also be write-protected
by other usages like COW or zero pages.  When that happens, we can't
simply set the write bit in the PTE since otherwise it'll change the
content of every single reference to the page.  Instead, we should do
the COW first if necessary, then handle the uffd-wp fault.

To correctly copy the page, we'll also need to carry over the
_PAGE_UFFD_WP bit if it was set in the original PTE.

For huge PMDs, we just simply split the huge PMDs where we want to
resolve an uffd-wp page fault always.  That matches what we do with
general huge PMD write protections.  In that way, we resolved the huge
PMD copy-on-write issue into PTE copy-on-write.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/memory.c   |  2 ++
 mm/mprotect.c | 55 ++++++++++++++++++++++++++++++++++++++++++++++++---
 2 files changed, 54 insertions(+), 3 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 32d32b6e6339..b5d67bafae35 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2291,6 +2291,8 @@ vm_fault_t wp_page_copy(struct vm_fault *vmf)
 		}
 		flush_cache_page(vma, vmf->address, pte_pfn(vmf->orig_pte));
 		entry = mk_pte(new_page, vma->vm_page_prot);
+		if (pte_uffd_wp(vmf->orig_pte))
+			entry = pte_mkuffd_wp(entry);
 		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
 		/*
 		 * Clear the pte entry and flush it first, before updating the
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 9d4433044c21..ae93721f3795 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -77,14 +77,13 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 		if (pte_present(oldpte)) {
 			pte_t ptent;
 			bool preserve_write = prot_numa && pte_write(oldpte);
+			struct page *page;
 
 			/*
 			 * Avoid trapping faults against the zero or KSM
 			 * pages. See similar comment in change_huge_pmd.
 			 */
 			if (prot_numa) {
-				struct page *page;
-
 				page = vm_normal_page(vma, addr, oldpte);
 				if (!page || PageKsm(page))
 					continue;
@@ -114,6 +113,46 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 					continue;
 			}
 
+			/*
+			 * Detect whether we'll need to COW before
+			 * resolving an uffd-wp fault.  Note that this
+			 * includes detection of the zero page (where
+			 * page==NULL)
+			 */
+			if (uffd_wp_resolve) {
+				/* If the fault is resolved already, skip */
+				if (!pte_uffd_wp(*pte))
+					continue;
+				page = vm_normal_page(vma, addr, oldpte);
+				if (!page || page_mapcount(page) > 1) {
+					struct vm_fault vmf = {
+						.vma = vma,
+						.address = addr & PAGE_MASK,
+						.page = page,
+						.orig_pte = oldpte,
+						.pmd = pmd,
+						/* pte and ptl not needed */
+					};
+					vm_fault_t ret;
+
+					if (page)
+						get_page(page);
+					arch_leave_lazy_mmu_mode();
+					pte_unmap_unlock(pte, ptl);
+					ret = wp_page_copy(&vmf);
+					/* PTE is changed, or OOM */
+					if (ret == 0)
+						/* It's done by others */
+						continue;
+					else if (WARN_ON(ret != VM_FAULT_WRITE))
+						return pages;
+					pte = pte_offset_map_lock(vma->vm_mm,
+								  pmd, addr,
+								  &ptl);
+					arch_enter_lazy_mmu_mode();
+				}
+			}
+
 			ptent = ptep_modify_prot_start(mm, addr, pte);
 			ptent = pte_modify(ptent, newprot);
 			if (preserve_write)
@@ -183,6 +222,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 	unsigned long pages = 0;
 	unsigned long nr_huge_updates = 0;
 	struct mmu_notifier_range range;
+	bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;
 
 	range.start = 0;
 
@@ -202,7 +242,16 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 		}
 
 		if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {
-			if (next - addr != HPAGE_PMD_SIZE) {
+			/*
+			 * When resolving an userfaultfd write
+			 * protection fault, it's not easy to identify
+			 * whether a THP is shared with others and
+			 * whether we'll need to do copy-on-write, so
+			 * just split it always for now to simply the
+			 * procedure.  And that's the policy too for
+			 * general THP write-protect in af9e4d5f2de2.
+			 */
+			if (next - addr != HPAGE_PMD_SIZE || uffd_wp_resolve) {
 				__split_huge_pmd(vma, pmd, addr, false, NULL);
 			} else {
 				int nr_ptes = change_huge_pmd(vma, pmd, addr,
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH v2 15/26] userfaultfd: wp: drop _PAGE_UFFD_WP properly when fork
  2019-02-12  2:56 [PATCH v2 00/26] userfaultfd: write protection support Peter Xu
                   ` (13 preceding siblings ...)
  2019-02-12  2:56 ` [PATCH v2 14/26] userfaultfd: wp: handle COW properly for uffd-wp Peter Xu
@ 2019-02-12  2:56 ` Peter Xu
  2019-02-21 18:06   ` Jerome Glisse
  2019-02-25 18:19   ` Mike Rapoport
  2019-02-12  2:56 ` [PATCH v2 16/26] userfaultfd: wp: add pmd_swp_*uffd_wp() helpers Peter Xu
                   ` (10 subsequent siblings)
  25 siblings, 2 replies; 113+ messages in thread
From: Peter Xu @ 2019-02-12  2:56 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: David Hildenbrand, Hugh Dickins, Maya Gokhale, Jerome Glisse,
	Pavel Emelyanov, Johannes Weiner, peterx, Martin Cracauer,
	Shaohua Li, Marty McFadden, Andrea Arcangeli, Mike Kravetz,
	Denis Plotnikov, Mike Rapoport, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

UFFD_EVENT_FORK support for uffd-wp should be already there, except
that we should clean the uffd-wp bit if uffd fork event is not
enabled.  Detect that to avoid _PAGE_UFFD_WP being set even if the VMA
is not being tracked by VM_UFFD_WP.  Do this for both small PTEs and
huge PMDs.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/huge_memory.c | 8 ++++++++
 mm/memory.c      | 8 ++++++++
 2 files changed, 16 insertions(+)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 817335b443c2..fb2234cb595a 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -938,6 +938,14 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	ret = -EAGAIN;
 	pmd = *src_pmd;
 
+	/*
+	 * Make sure the _PAGE_UFFD_WP bit is cleared if the new VMA
+	 * does not have the VM_UFFD_WP, which means that the uffd
+	 * fork event is not enabled.
+	 */
+	if (!(vma->vm_flags & VM_UFFD_WP))
+		pmd = pmd_clear_uffd_wp(pmd);
+
 #ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
 	if (unlikely(is_swap_pmd(pmd))) {
 		swp_entry_t entry = pmd_to_swp_entry(pmd);
diff --git a/mm/memory.c b/mm/memory.c
index b5d67bafae35..c2035539e9fd 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -788,6 +788,14 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		pte = pte_mkclean(pte);
 	pte = pte_mkold(pte);
 
+	/*
+	 * Make sure the _PAGE_UFFD_WP bit is cleared if the new VMA
+	 * does not have the VM_UFFD_WP, which means that the uffd
+	 * fork event is not enabled.
+	 */
+	if (!(vm_flags & VM_UFFD_WP))
+		pte = pte_clear_uffd_wp(pte);
+
 	page = vm_normal_page(vma, addr, pte);
 	if (page) {
 		get_page(page);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH v2 16/26] userfaultfd: wp: add pmd_swp_*uffd_wp() helpers
  2019-02-12  2:56 [PATCH v2 00/26] userfaultfd: write protection support Peter Xu
                   ` (14 preceding siblings ...)
  2019-02-12  2:56 ` [PATCH v2 15/26] userfaultfd: wp: drop _PAGE_UFFD_WP properly when fork Peter Xu
@ 2019-02-12  2:56 ` Peter Xu
  2019-02-21 18:07   ` Jerome Glisse
  2019-02-25 18:20   ` Mike Rapoport
  2019-02-12  2:56 ` [PATCH v2 17/26] userfaultfd: wp: support swap and page migration Peter Xu
                   ` (9 subsequent siblings)
  25 siblings, 2 replies; 113+ messages in thread
From: Peter Xu @ 2019-02-12  2:56 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: David Hildenbrand, Hugh Dickins, Maya Gokhale, Jerome Glisse,
	Pavel Emelyanov, Johannes Weiner, peterx, Martin Cracauer,
	Shaohua Li, Marty McFadden, Andrea Arcangeli, Mike Kravetz,
	Denis Plotnikov, Mike Rapoport, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

Adding these missing helpers for uffd-wp operations with pmd
swap/migration entries.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/x86/include/asm/pgtable.h     | 15 +++++++++++++++
 include/asm-generic/pgtable_uffd.h | 15 +++++++++++++++
 2 files changed, 30 insertions(+)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 6863236e8484..18a815d6f4ea 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1401,6 +1401,21 @@ static inline pte_t pte_swp_clear_uffd_wp(pte_t pte)
 {
 	return pte_clear_flags(pte, _PAGE_SWP_UFFD_WP);
 }
+
+static inline pmd_t pmd_swp_mkuffd_wp(pmd_t pmd)
+{
+	return pmd_set_flags(pmd, _PAGE_SWP_UFFD_WP);
+}
+
+static inline int pmd_swp_uffd_wp(pmd_t pmd)
+{
+	return pmd_flags(pmd) & _PAGE_SWP_UFFD_WP;
+}
+
+static inline pmd_t pmd_swp_clear_uffd_wp(pmd_t pmd)
+{
+	return pmd_clear_flags(pmd, _PAGE_SWP_UFFD_WP);
+}
 #endif /* CONFIG_HAVE_ARCH_USERFAULTFD_WP */
 
 #define PKRU_AD_BIT 0x1
diff --git a/include/asm-generic/pgtable_uffd.h b/include/asm-generic/pgtable_uffd.h
index 643d1bf559c2..828966d4c281 100644
--- a/include/asm-generic/pgtable_uffd.h
+++ b/include/asm-generic/pgtable_uffd.h
@@ -46,6 +46,21 @@ static __always_inline pte_t pte_swp_clear_uffd_wp(pte_t pte)
 {
 	return pte;
 }
+
+static inline pmd_t pmd_swp_mkuffd_wp(pmd_t pmd)
+{
+	return pmd;
+}
+
+static inline int pmd_swp_uffd_wp(pmd_t pmd)
+{
+	return 0;
+}
+
+static inline pmd_t pmd_swp_clear_uffd_wp(pmd_t pmd)
+{
+	return pmd;
+}
 #endif /* CONFIG_HAVE_ARCH_USERFAULTFD_WP */
 
 #endif /* _ASM_GENERIC_PGTABLE_UFFD_H */
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH v2 17/26] userfaultfd: wp: support swap and page migration
  2019-02-12  2:56 [PATCH v2 00/26] userfaultfd: write protection support Peter Xu
                   ` (15 preceding siblings ...)
  2019-02-12  2:56 ` [PATCH v2 16/26] userfaultfd: wp: add pmd_swp_*uffd_wp() helpers Peter Xu
@ 2019-02-12  2:56 ` Peter Xu
  2019-02-21 18:16   ` Jerome Glisse
  2019-02-25 18:28   ` Mike Rapoport
  2019-02-12  2:56 ` [PATCH v2 18/26] khugepaged: skip collapse if uffd-wp detected Peter Xu
                   ` (8 subsequent siblings)
  25 siblings, 2 replies; 113+ messages in thread
From: Peter Xu @ 2019-02-12  2:56 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: David Hildenbrand, Hugh Dickins, Maya Gokhale, Jerome Glisse,
	Pavel Emelyanov, Johannes Weiner, peterx, Martin Cracauer,
	Shaohua Li, Marty McFadden, Andrea Arcangeli, Mike Kravetz,
	Denis Plotnikov, Mike Rapoport, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

For either swap and page migration, we all use the bit 2 of the entry to
identify whether this entry is uffd write-protected.  It plays a similar
role as the existing soft dirty bit in swap entries but only for keeping
the uffd-wp tracking for a specific PTE/PMD.

Something special here is that when we want to recover the uffd-wp bit
from a swap/migration entry to the PTE bit we'll also need to take care
of the _PAGE_RW bit and make sure it's cleared, otherwise even with the
_PAGE_UFFD_WP bit we can't trap it at all.

Note that this patch removed two lines from "userfaultfd: wp: hook
userfault handler to write protection fault" where we try to remove the
VM_FAULT_WRITE from vmf->flags when uffd-wp is set for the VMA.  This
patch will still keep the write flag there.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/swapops.h | 2 ++
 mm/huge_memory.c        | 3 +++
 mm/memory.c             | 8 ++++++--
 mm/migrate.c            | 7 +++++++
 mm/mprotect.c           | 2 ++
 mm/rmap.c               | 6 ++++++
 6 files changed, 26 insertions(+), 2 deletions(-)

diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index 4d961668e5fc..0c2923b1cdb7 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -68,6 +68,8 @@ static inline swp_entry_t pte_to_swp_entry(pte_t pte)
 
 	if (pte_swp_soft_dirty(pte))
 		pte = pte_swp_clear_soft_dirty(pte);
+	if (pte_swp_uffd_wp(pte))
+		pte = pte_swp_clear_uffd_wp(pte);
 	arch_entry = __pte_to_swp_entry(pte);
 	return swp_entry(__swp_type(arch_entry), __swp_offset(arch_entry));
 }
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index fb2234cb595a..75de07141801 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2175,6 +2175,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 		write = is_write_migration_entry(entry);
 		young = false;
 		soft_dirty = pmd_swp_soft_dirty(old_pmd);
+		uffd_wp = pmd_swp_uffd_wp(old_pmd);
 	} else {
 		page = pmd_page(old_pmd);
 		if (pmd_dirty(old_pmd))
@@ -2207,6 +2208,8 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 			entry = swp_entry_to_pte(swp_entry);
 			if (soft_dirty)
 				entry = pte_swp_mksoft_dirty(entry);
+			if (uffd_wp)
+				entry = pte_swp_mkuffd_wp(entry);
 		} else {
 			entry = mk_pte(page + i, READ_ONCE(vma->vm_page_prot));
 			entry = maybe_mkwrite(entry, vma);
diff --git a/mm/memory.c b/mm/memory.c
index c2035539e9fd..7cee990d67cf 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -736,6 +736,8 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 				pte = swp_entry_to_pte(entry);
 				if (pte_swp_soft_dirty(*src_pte))
 					pte = pte_swp_mksoft_dirty(pte);
+				if (pte_swp_uffd_wp(*src_pte))
+					pte = pte_swp_mkuffd_wp(pte);
 				set_pte_at(src_mm, addr, src_pte, pte);
 			}
 		} else if (is_device_private_entry(entry)) {
@@ -2815,8 +2817,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
 	dec_mm_counter_fast(vma->vm_mm, MM_SWAPENTS);
 	pte = mk_pte(page, vma->vm_page_prot);
-	if (userfaultfd_wp(vma))
-		vmf->flags &= ~FAULT_FLAG_WRITE;
 	if ((vmf->flags & FAULT_FLAG_WRITE) && reuse_swap_page(page, NULL)) {
 		pte = maybe_mkwrite(pte_mkdirty(pte), vma);
 		vmf->flags &= ~FAULT_FLAG_WRITE;
@@ -2826,6 +2826,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	flush_icache_page(vma, page);
 	if (pte_swp_soft_dirty(vmf->orig_pte))
 		pte = pte_mksoft_dirty(pte);
+	if (pte_swp_uffd_wp(vmf->orig_pte)) {
+		pte = pte_mkuffd_wp(pte);
+		pte = pte_wrprotect(pte);
+	}
 	set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
 	arch_do_swap_page(vma->vm_mm, vma, vmf->address, pte, vmf->orig_pte);
 	vmf->orig_pte = pte;
diff --git a/mm/migrate.c b/mm/migrate.c
index d4fd680be3b0..605ccd1f5c64 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -242,6 +242,11 @@ static bool remove_migration_pte(struct page *page, struct vm_area_struct *vma,
 		if (is_write_migration_entry(entry))
 			pte = maybe_mkwrite(pte, vma);
 
+		if (pte_swp_uffd_wp(*pvmw.pte)) {
+			pte = pte_mkuffd_wp(pte);
+			pte = pte_wrprotect(pte);
+		}
+
 		if (unlikely(is_zone_device_page(new))) {
 			if (is_device_private_page(new)) {
 				entry = make_device_private_entry(new, pte_write(pte));
@@ -2290,6 +2295,8 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
 			swp_pte = swp_entry_to_pte(entry);
 			if (pte_soft_dirty(pte))
 				swp_pte = pte_swp_mksoft_dirty(swp_pte);
+			if (pte_uffd_wp(pte))
+				swp_pte = pte_swp_mkuffd_wp(swp_pte);
 			set_pte_at(mm, addr, ptep, swp_pte);
 
 			/*
diff --git a/mm/mprotect.c b/mm/mprotect.c
index ae93721f3795..73a65f07fe41 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -187,6 +187,8 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 				newpte = swp_entry_to_pte(entry);
 				if (pte_swp_soft_dirty(oldpte))
 					newpte = pte_swp_mksoft_dirty(newpte);
+				if (pte_swp_uffd_wp(oldpte))
+					newpte = pte_swp_mkuffd_wp(newpte);
 				set_pte_at(mm, addr, pte, newpte);
 
 				pages++;
diff --git a/mm/rmap.c b/mm/rmap.c
index 0454ecc29537..3750d5a5283c 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1469,6 +1469,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 			swp_pte = swp_entry_to_pte(entry);
 			if (pte_soft_dirty(pteval))
 				swp_pte = pte_swp_mksoft_dirty(swp_pte);
+			if (pte_uffd_wp(pteval))
+				swp_pte = pte_swp_mkuffd_wp(swp_pte);
 			set_pte_at(mm, pvmw.address, pvmw.pte, swp_pte);
 			/*
 			 * No need to invalidate here it will synchronize on
@@ -1561,6 +1563,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 			swp_pte = swp_entry_to_pte(entry);
 			if (pte_soft_dirty(pteval))
 				swp_pte = pte_swp_mksoft_dirty(swp_pte);
+			if (pte_uffd_wp(pteval))
+				swp_pte = pte_swp_mkuffd_wp(swp_pte);
 			set_pte_at(mm, address, pvmw.pte, swp_pte);
 			/*
 			 * No need to invalidate here it will synchronize on
@@ -1627,6 +1631,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 			swp_pte = swp_entry_to_pte(entry);
 			if (pte_soft_dirty(pteval))
 				swp_pte = pte_swp_mksoft_dirty(swp_pte);
+			if (pte_uffd_wp(pteval))
+				swp_pte = pte_swp_mkuffd_wp(swp_pte);
 			set_pte_at(mm, address, pvmw.pte, swp_pte);
 			/* Invalidate as we cleared the pte */
 			mmu_notifier_invalidate_range(mm, address,
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH v2 18/26] khugepaged: skip collapse if uffd-wp detected
  2019-02-12  2:56 [PATCH v2 00/26] userfaultfd: write protection support Peter Xu
                   ` (16 preceding siblings ...)
  2019-02-12  2:56 ` [PATCH v2 17/26] userfaultfd: wp: support swap and page migration Peter Xu
@ 2019-02-12  2:56 ` Peter Xu
  2019-02-21 18:17   ` Jerome Glisse
  2019-02-25 18:50   ` Mike Rapoport
  2019-02-12  2:56 ` [PATCH v2 19/26] userfaultfd: introduce helper vma_find_uffd Peter Xu
                   ` (7 subsequent siblings)
  25 siblings, 2 replies; 113+ messages in thread
From: Peter Xu @ 2019-02-12  2:56 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: David Hildenbrand, Hugh Dickins, Maya Gokhale, Jerome Glisse,
	Pavel Emelyanov, Johannes Weiner, peterx, Martin Cracauer,
	Shaohua Li, Marty McFadden, Andrea Arcangeli, Mike Kravetz,
	Denis Plotnikov, Mike Rapoport, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

Don't collapse the huge PMD if there is any userfault write protected
small PTEs.  The problem is that the write protection is in small page
granularity and there's no way to keep all these write protection
information if the small pages are going to be merged into a huge PMD.

The same thing needs to be considered for swap entries and migration
entries.  So do the check as well disregarding khugepaged_max_ptes_swap.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/trace/events/huge_memory.h |  1 +
 mm/khugepaged.c                    | 23 +++++++++++++++++++++++
 2 files changed, 24 insertions(+)

diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
index dd4db334bd63..2d7bad9cb976 100644
--- a/include/trace/events/huge_memory.h
+++ b/include/trace/events/huge_memory.h
@@ -13,6 +13,7 @@
 	EM( SCAN_PMD_NULL,		"pmd_null")			\
 	EM( SCAN_EXCEED_NONE_PTE,	"exceed_none_pte")		\
 	EM( SCAN_PTE_NON_PRESENT,	"pte_non_present")		\
+	EM( SCAN_PTE_UFFD_WP,		"pte_uffd_wp")			\
 	EM( SCAN_PAGE_RO,		"no_writable_page")		\
 	EM( SCAN_LACK_REFERENCED_PAGE,	"lack_referenced_page")		\
 	EM( SCAN_PAGE_NULL,		"page_null")			\
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 4f017339ddb2..396c7e4da83e 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -29,6 +29,7 @@ enum scan_result {
 	SCAN_PMD_NULL,
 	SCAN_EXCEED_NONE_PTE,
 	SCAN_PTE_NON_PRESENT,
+	SCAN_PTE_UFFD_WP,
 	SCAN_PAGE_RO,
 	SCAN_LACK_REFERENCED_PAGE,
 	SCAN_PAGE_NULL,
@@ -1123,6 +1124,15 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 		pte_t pteval = *_pte;
 		if (is_swap_pte(pteval)) {
 			if (++unmapped <= khugepaged_max_ptes_swap) {
+				/*
+				 * Always be strict with uffd-wp
+				 * enabled swap entries.  Please see
+				 * comment below for pte_uffd_wp().
+				 */
+				if (pte_swp_uffd_wp(pteval)) {
+					result = SCAN_PTE_UFFD_WP;
+					goto out_unmap;
+				}
 				continue;
 			} else {
 				result = SCAN_EXCEED_SWAP_PTE;
@@ -1142,6 +1152,19 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 			result = SCAN_PTE_NON_PRESENT;
 			goto out_unmap;
 		}
+		if (pte_uffd_wp(pteval)) {
+			/*
+			 * Don't collapse the page if any of the small
+			 * PTEs are armed with uffd write protection.
+			 * Here we can also mark the new huge pmd as
+			 * write protected if any of the small ones is
+			 * marked but that could bring uknown
+			 * userfault messages that falls outside of
+			 * the registered range.  So, just be simple.
+			 */
+			result = SCAN_PTE_UFFD_WP;
+			goto out_unmap;
+		}
 		if (pte_write(pteval))
 			writable = true;
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH v2 19/26] userfaultfd: introduce helper vma_find_uffd
  2019-02-12  2:56 [PATCH v2 00/26] userfaultfd: write protection support Peter Xu
                   ` (17 preceding siblings ...)
  2019-02-12  2:56 ` [PATCH v2 18/26] khugepaged: skip collapse if uffd-wp detected Peter Xu
@ 2019-02-12  2:56 ` Peter Xu
  2019-02-21 18:19   ` Jerome Glisse
  2019-02-25 20:48   ` Mike Rapoport
  2019-02-12  2:56 ` [PATCH v2 20/26] userfaultfd: wp: support write protection for userfault vma range Peter Xu
                   ` (6 subsequent siblings)
  25 siblings, 2 replies; 113+ messages in thread
From: Peter Xu @ 2019-02-12  2:56 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: David Hildenbrand, Hugh Dickins, Maya Gokhale, Jerome Glisse,
	Pavel Emelyanov, Johannes Weiner, peterx, Martin Cracauer,
	Shaohua Li, Marty McFadden, Andrea Arcangeli, Mike Kravetz,
	Denis Plotnikov, Mike Rapoport, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

We've have multiple (and more coming) places that would like to find a
userfault enabled VMA from a mm struct that covers a specific memory
range.  This patch introduce the helper for it, meanwhile apply it to
the code.

Suggested-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/userfaultfd.c | 54 +++++++++++++++++++++++++++---------------------
 1 file changed, 30 insertions(+), 24 deletions(-)

diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 80bcd642911d..fefa81c301b7 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -20,6 +20,34 @@
 #include <asm/tlbflush.h>
 #include "internal.h"
 
+/*
+ * Find a valid userfault enabled VMA region that covers the whole
+ * address range, or NULL on failure.  Must be called with mmap_sem
+ * held.
+ */
+static struct vm_area_struct *vma_find_uffd(struct mm_struct *mm,
+					    unsigned long start,
+					    unsigned long len)
+{
+	struct vm_area_struct *vma = find_vma(mm, start);
+
+	if (!vma)
+		return NULL;
+
+	/*
+	 * Check the vma is registered in uffd, this is required to
+	 * enforce the VM_MAYWRITE check done at uffd registration
+	 * time.
+	 */
+	if (!vma->vm_userfaultfd_ctx.ctx)
+		return NULL;
+
+	if (start < vma->vm_start || start + len > vma->vm_end)
+		return NULL;
+
+	return vma;
+}
+
 static int mcopy_atomic_pte(struct mm_struct *dst_mm,
 			    pmd_t *dst_pmd,
 			    struct vm_area_struct *dst_vma,
@@ -228,20 +256,9 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
 	 */
 	if (!dst_vma) {
 		err = -ENOENT;
-		dst_vma = find_vma(dst_mm, dst_start);
+		dst_vma = vma_find_uffd(dst_mm, dst_start, len);
 		if (!dst_vma || !is_vm_hugetlb_page(dst_vma))
 			goto out_unlock;
-		/*
-		 * Check the vma is registered in uffd, this is
-		 * required to enforce the VM_MAYWRITE check done at
-		 * uffd registration time.
-		 */
-		if (!dst_vma->vm_userfaultfd_ctx.ctx)
-			goto out_unlock;
-
-		if (dst_start < dst_vma->vm_start ||
-		    dst_start + len > dst_vma->vm_end)
-			goto out_unlock;
 
 		err = -EINVAL;
 		if (vma_hpagesize != vma_kernel_pagesize(dst_vma))
@@ -488,20 +505,9 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
 	 * both valid and fully within a single existing vma.
 	 */
 	err = -ENOENT;
-	dst_vma = find_vma(dst_mm, dst_start);
+	dst_vma = vma_find_uffd(dst_mm, dst_start, len);
 	if (!dst_vma)
 		goto out_unlock;
-	/*
-	 * Check the vma is registered in uffd, this is required to
-	 * enforce the VM_MAYWRITE check done at uffd registration
-	 * time.
-	 */
-	if (!dst_vma->vm_userfaultfd_ctx.ctx)
-		goto out_unlock;
-
-	if (dst_start < dst_vma->vm_start ||
-	    dst_start + len > dst_vma->vm_end)
-		goto out_unlock;
 
 	err = -EINVAL;
 	/*
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH v2 20/26] userfaultfd: wp: support write protection for userfault vma range
  2019-02-12  2:56 [PATCH v2 00/26] userfaultfd: write protection support Peter Xu
                   ` (18 preceding siblings ...)
  2019-02-12  2:56 ` [PATCH v2 19/26] userfaultfd: introduce helper vma_find_uffd Peter Xu
@ 2019-02-12  2:56 ` Peter Xu
  2019-02-21 18:23   ` Jerome Glisse
  2019-02-25 20:52   ` Mike Rapoport
  2019-02-12  2:56 ` [PATCH v2 21/26] userfaultfd: wp: add the writeprotect API to userfaultfd ioctl Peter Xu
                   ` (5 subsequent siblings)
  25 siblings, 2 replies; 113+ messages in thread
From: Peter Xu @ 2019-02-12  2:56 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: David Hildenbrand, Hugh Dickins, Maya Gokhale, Jerome Glisse,
	Pavel Emelyanov, Johannes Weiner, peterx, Martin Cracauer,
	Shaohua Li, Marty McFadden, Andrea Arcangeli, Mike Kravetz,
	Denis Plotnikov, Mike Rapoport, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert, Rik van Riel

From: Shaohua Li <shli@fb.com>

Add API to enable/disable writeprotect a vma range. Unlike mprotect,
this doesn't split/merge vmas.

Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
[peterx:
 - use the helper to find VMA;
 - return -ENOENT if not found to match mcopy case;
 - use the new MM_CP_UFFD_WP* flags for change_protection
 - check against mmap_changing for failures]
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/userfaultfd_k.h |  3 ++
 mm/userfaultfd.c              | 54 +++++++++++++++++++++++++++++++++++
 2 files changed, 57 insertions(+)

diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index 765ce884cec0..8f6e6ed544fb 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -39,6 +39,9 @@ extern ssize_t mfill_zeropage(struct mm_struct *dst_mm,
 			      unsigned long dst_start,
 			      unsigned long len,
 			      bool *mmap_changing);
+extern int mwriteprotect_range(struct mm_struct *dst_mm,
+			       unsigned long start, unsigned long len,
+			       bool enable_wp, bool *mmap_changing);
 
 /* mm helpers */
 static inline bool is_mergeable_vm_userfaultfd_ctx(struct vm_area_struct *vma,
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index fefa81c301b7..529d180bb4d7 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -639,3 +639,57 @@ ssize_t mfill_zeropage(struct mm_struct *dst_mm, unsigned long start,
 {
 	return __mcopy_atomic(dst_mm, start, 0, len, true, mmap_changing, 0);
 }
+
+int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start,
+			unsigned long len, bool enable_wp, bool *mmap_changing)
+{
+	struct vm_area_struct *dst_vma;
+	pgprot_t newprot;
+	int err;
+
+	/*
+	 * Sanitize the command parameters:
+	 */
+	BUG_ON(start & ~PAGE_MASK);
+	BUG_ON(len & ~PAGE_MASK);
+
+	/* Does the address range wrap, or is the span zero-sized? */
+	BUG_ON(start + len <= start);
+
+	down_read(&dst_mm->mmap_sem);
+
+	/*
+	 * If memory mappings are changing because of non-cooperative
+	 * operation (e.g. mremap) running in parallel, bail out and
+	 * request the user to retry later
+	 */
+	err = -EAGAIN;
+	if (mmap_changing && READ_ONCE(*mmap_changing))
+		goto out_unlock;
+
+	err = -ENOENT;
+	dst_vma = vma_find_uffd(dst_mm, start, len);
+	/*
+	 * Make sure the vma is not shared, that the dst range is
+	 * both valid and fully within a single existing vma.
+	 */
+	if (!dst_vma || (dst_vma->vm_flags & VM_SHARED))
+		goto out_unlock;
+	if (!userfaultfd_wp(dst_vma))
+		goto out_unlock;
+	if (!vma_is_anonymous(dst_vma))
+		goto out_unlock;
+
+	if (enable_wp)
+		newprot = vm_get_page_prot(dst_vma->vm_flags & ~(VM_WRITE));
+	else
+		newprot = vm_get_page_prot(dst_vma->vm_flags);
+
+	change_protection(dst_vma, start, start + len, newprot,
+			  enable_wp ? MM_CP_UFFD_WP : MM_CP_UFFD_WP_RESOLVE);
+
+	err = 0;
+out_unlock:
+	up_read(&dst_mm->mmap_sem);
+	return err;
+}
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH v2 21/26] userfaultfd: wp: add the writeprotect API to userfaultfd ioctl
  2019-02-12  2:56 [PATCH v2 00/26] userfaultfd: write protection support Peter Xu
                   ` (19 preceding siblings ...)
  2019-02-12  2:56 ` [PATCH v2 20/26] userfaultfd: wp: support write protection for userfault vma range Peter Xu
@ 2019-02-12  2:56 ` Peter Xu
  2019-02-21 18:28   ` Jerome Glisse
  2019-02-25 21:03   ` Mike Rapoport
  2019-02-12  2:56 ` [PATCH v2 22/26] userfaultfd: wp: enabled write protection in userfaultfd API Peter Xu
                   ` (4 subsequent siblings)
  25 siblings, 2 replies; 113+ messages in thread
From: Peter Xu @ 2019-02-12  2:56 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: David Hildenbrand, Hugh Dickins, Maya Gokhale, Jerome Glisse,
	Pavel Emelyanov, Johannes Weiner, peterx, Martin Cracauer,
	Shaohua Li, Marty McFadden, Andrea Arcangeli, Mike Kravetz,
	Denis Plotnikov, Mike Rapoport, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

From: Andrea Arcangeli <aarcange@redhat.com>

v1: From: Shaohua Li <shli@fb.com>

v2: cleanups, remove a branch.

[peterx writes up the commit message, as below...]

This patch introduces the new uffd-wp APIs for userspace.

Firstly, we'll allow to do UFFDIO_REGISTER with write protection
tracking using the new UFFDIO_REGISTER_MODE_WP flag.  Note that this
flag can co-exist with the existing UFFDIO_REGISTER_MODE_MISSING, in
which case the userspace program can not only resolve missing page
faults, and at the same time tracking page data changes along the way.

Secondly, we introduced the new UFFDIO_WRITEPROTECT API to do page
level write protection tracking.  Note that we will need to register
the memory region with UFFDIO_REGISTER_MODE_WP before that.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
[peterx: remove useless block, write commit message, check against
 VM_MAYWRITE rather than VM_WRITE when register]
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 fs/userfaultfd.c                 | 82 +++++++++++++++++++++++++-------
 include/uapi/linux/userfaultfd.h | 11 +++++
 2 files changed, 77 insertions(+), 16 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 3092885c9d2c..81962d62520c 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -304,8 +304,11 @@ static inline bool userfaultfd_must_wait(struct userfaultfd_ctx *ctx,
 	if (!pmd_present(_pmd))
 		goto out;
 
-	if (pmd_trans_huge(_pmd))
+	if (pmd_trans_huge(_pmd)) {
+		if (!pmd_write(_pmd) && (reason & VM_UFFD_WP))
+			ret = true;
 		goto out;
+	}
 
 	/*
 	 * the pmd is stable (as in !pmd_trans_unstable) so we can re-read it
@@ -318,6 +321,8 @@ static inline bool userfaultfd_must_wait(struct userfaultfd_ctx *ctx,
 	 */
 	if (pte_none(*pte))
 		ret = true;
+	if (!pte_write(*pte) && (reason & VM_UFFD_WP))
+		ret = true;
 	pte_unmap(pte);
 
 out:
@@ -1251,10 +1256,13 @@ static __always_inline int validate_range(struct mm_struct *mm,
 	return 0;
 }
 
-static inline bool vma_can_userfault(struct vm_area_struct *vma)
+static inline bool vma_can_userfault(struct vm_area_struct *vma,
+				     unsigned long vm_flags)
 {
-	return vma_is_anonymous(vma) || is_vm_hugetlb_page(vma) ||
-		vma_is_shmem(vma);
+	/* FIXME: add WP support to hugetlbfs and shmem */
+	return vma_is_anonymous(vma) ||
+		((is_vm_hugetlb_page(vma) || vma_is_shmem(vma)) &&
+		 !(vm_flags & VM_UFFD_WP));
 }
 
 static int userfaultfd_register(struct userfaultfd_ctx *ctx,
@@ -1286,15 +1294,8 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
 	vm_flags = 0;
 	if (uffdio_register.mode & UFFDIO_REGISTER_MODE_MISSING)
 		vm_flags |= VM_UFFD_MISSING;
-	if (uffdio_register.mode & UFFDIO_REGISTER_MODE_WP) {
+	if (uffdio_register.mode & UFFDIO_REGISTER_MODE_WP)
 		vm_flags |= VM_UFFD_WP;
-		/*
-		 * FIXME: remove the below error constraint by
-		 * implementing the wprotect tracking mode.
-		 */
-		ret = -EINVAL;
-		goto out;
-	}
 
 	ret = validate_range(mm, uffdio_register.range.start,
 			     uffdio_register.range.len);
@@ -1342,7 +1343,7 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
 
 		/* check not compatible vmas */
 		ret = -EINVAL;
-		if (!vma_can_userfault(cur))
+		if (!vma_can_userfault(cur, vm_flags))
 			goto out_unlock;
 
 		/*
@@ -1370,6 +1371,8 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
 			if (end & (vma_hpagesize - 1))
 				goto out_unlock;
 		}
+		if ((vm_flags & VM_UFFD_WP) && !(cur->vm_flags & VM_MAYWRITE))
+			goto out_unlock;
 
 		/*
 		 * Check that this vma isn't already owned by a
@@ -1399,7 +1402,7 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
 	do {
 		cond_resched();
 
-		BUG_ON(!vma_can_userfault(vma));
+		BUG_ON(!vma_can_userfault(vma, vm_flags));
 		BUG_ON(vma->vm_userfaultfd_ctx.ctx &&
 		       vma->vm_userfaultfd_ctx.ctx != ctx);
 		WARN_ON(!(vma->vm_flags & VM_MAYWRITE));
@@ -1534,7 +1537,7 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx,
 		 * provides for more strict behavior to notice
 		 * unregistration errors.
 		 */
-		if (!vma_can_userfault(cur))
+		if (!vma_can_userfault(cur, cur->vm_flags))
 			goto out_unlock;
 
 		found = true;
@@ -1548,7 +1551,7 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx,
 	do {
 		cond_resched();
 
-		BUG_ON(!vma_can_userfault(vma));
+		BUG_ON(!vma_can_userfault(vma, vma->vm_flags));
 
 		/*
 		 * Nothing to do: this vma is already registered into this
@@ -1761,6 +1764,50 @@ static int userfaultfd_zeropage(struct userfaultfd_ctx *ctx,
 	return ret;
 }
 
+static int userfaultfd_writeprotect(struct userfaultfd_ctx *ctx,
+				    unsigned long arg)
+{
+	int ret;
+	struct uffdio_writeprotect uffdio_wp;
+	struct uffdio_writeprotect __user *user_uffdio_wp;
+	struct userfaultfd_wake_range range;
+
+	if (READ_ONCE(ctx->mmap_changing))
+		return -EAGAIN;
+
+	user_uffdio_wp = (struct uffdio_writeprotect __user *) arg;
+
+	if (copy_from_user(&uffdio_wp, user_uffdio_wp,
+			   sizeof(struct uffdio_writeprotect)))
+		return -EFAULT;
+
+	ret = validate_range(ctx->mm, uffdio_wp.range.start,
+			     uffdio_wp.range.len);
+	if (ret)
+		return ret;
+
+	if (uffdio_wp.mode & ~(UFFDIO_WRITEPROTECT_MODE_DONTWAKE |
+			       UFFDIO_WRITEPROTECT_MODE_WP))
+		return -EINVAL;
+	if ((uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_WP) &&
+	     (uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_DONTWAKE))
+		return -EINVAL;
+
+	ret = mwriteprotect_range(ctx->mm, uffdio_wp.range.start,
+				  uffdio_wp.range.len, uffdio_wp.mode &
+				  UFFDIO_WRITEPROTECT_MODE_WP,
+				  &ctx->mmap_changing);
+	if (ret)
+		return ret;
+
+	if (!(uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_DONTWAKE)) {
+		range.start = uffdio_wp.range.start;
+		range.len = uffdio_wp.range.len;
+		wake_userfault(ctx, &range);
+	}
+	return ret;
+}
+
 static inline unsigned int uffd_ctx_features(__u64 user_features)
 {
 	/*
@@ -1838,6 +1885,9 @@ static long userfaultfd_ioctl(struct file *file, unsigned cmd,
 	case UFFDIO_ZEROPAGE:
 		ret = userfaultfd_zeropage(ctx, arg);
 		break;
+	case UFFDIO_WRITEPROTECT:
+		ret = userfaultfd_writeprotect(ctx, arg);
+		break;
 	}
 	return ret;
 }
diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
index 297cb044c03f..1b977a7a4435 100644
--- a/include/uapi/linux/userfaultfd.h
+++ b/include/uapi/linux/userfaultfd.h
@@ -52,6 +52,7 @@
 #define _UFFDIO_WAKE			(0x02)
 #define _UFFDIO_COPY			(0x03)
 #define _UFFDIO_ZEROPAGE		(0x04)
+#define _UFFDIO_WRITEPROTECT		(0x06)
 #define _UFFDIO_API			(0x3F)
 
 /* userfaultfd ioctl ids */
@@ -68,6 +69,8 @@
 				      struct uffdio_copy)
 #define UFFDIO_ZEROPAGE		_IOWR(UFFDIO, _UFFDIO_ZEROPAGE,	\
 				      struct uffdio_zeropage)
+#define UFFDIO_WRITEPROTECT	_IOWR(UFFDIO, _UFFDIO_WRITEPROTECT, \
+				      struct uffdio_writeprotect)
 
 /* read() structure */
 struct uffd_msg {
@@ -232,4 +235,12 @@ struct uffdio_zeropage {
 	__s64 zeropage;
 };
 
+struct uffdio_writeprotect {
+	struct uffdio_range range;
+	/* !WP means undo writeprotect. DONTWAKE is valid only with !WP */
+#define UFFDIO_WRITEPROTECT_MODE_WP		((__u64)1<<0)
+#define UFFDIO_WRITEPROTECT_MODE_DONTWAKE	((__u64)1<<1)
+	__u64 mode;
+};
+
 #endif /* _LINUX_USERFAULTFD_H */
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH v2 22/26] userfaultfd: wp: enabled write protection in userfaultfd API
  2019-02-12  2:56 [PATCH v2 00/26] userfaultfd: write protection support Peter Xu
                   ` (20 preceding siblings ...)
  2019-02-12  2:56 ` [PATCH v2 21/26] userfaultfd: wp: add the writeprotect API to userfaultfd ioctl Peter Xu
@ 2019-02-12  2:56 ` Peter Xu
  2019-02-21 18:29   ` Jerome Glisse
  2019-02-12  2:56 ` [PATCH v2 23/26] userfaultfd: wp: don't wake up when doing write protect Peter Xu
                   ` (3 subsequent siblings)
  25 siblings, 1 reply; 113+ messages in thread
From: Peter Xu @ 2019-02-12  2:56 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: David Hildenbrand, Hugh Dickins, Maya Gokhale, Jerome Glisse,
	Pavel Emelyanov, Johannes Weiner, peterx, Martin Cracauer,
	Shaohua Li, Marty McFadden, Andrea Arcangeli, Mike Kravetz,
	Denis Plotnikov, Mike Rapoport, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert, Pavel Emelyanov, Rik van Riel

From: Shaohua Li <shli@fb.com>

Now it's safe to enable write protection in userfaultfd API

Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/uapi/linux/userfaultfd.h | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
index 1b977a7a4435..a50f1ed24d23 100644
--- a/include/uapi/linux/userfaultfd.h
+++ b/include/uapi/linux/userfaultfd.h
@@ -19,7 +19,8 @@
  * means the userland is reading).
  */
 #define UFFD_API ((__u64)0xAA)
-#define UFFD_API_FEATURES (UFFD_FEATURE_EVENT_FORK |		\
+#define UFFD_API_FEATURES (UFFD_FEATURE_PAGEFAULT_FLAG_WP |	\
+			   UFFD_FEATURE_EVENT_FORK |		\
 			   UFFD_FEATURE_EVENT_REMAP |		\
 			   UFFD_FEATURE_EVENT_REMOVE |	\
 			   UFFD_FEATURE_EVENT_UNMAP |		\
@@ -34,7 +35,8 @@
 #define UFFD_API_RANGE_IOCTLS			\
 	((__u64)1 << _UFFDIO_WAKE |		\
 	 (__u64)1 << _UFFDIO_COPY |		\
-	 (__u64)1 << _UFFDIO_ZEROPAGE)
+	 (__u64)1 << _UFFDIO_ZEROPAGE |		\
+	 (__u64)1 << _UFFDIO_WRITEPROTECT)
 #define UFFD_API_RANGE_IOCTLS_BASIC		\
 	((__u64)1 << _UFFDIO_WAKE |		\
 	 (__u64)1 << _UFFDIO_COPY)
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH v2 23/26] userfaultfd: wp: don't wake up when doing write protect
  2019-02-12  2:56 [PATCH v2 00/26] userfaultfd: write protection support Peter Xu
                   ` (21 preceding siblings ...)
  2019-02-12  2:56 ` [PATCH v2 22/26] userfaultfd: wp: enabled write protection in userfaultfd API Peter Xu
@ 2019-02-12  2:56 ` Peter Xu
  2019-02-21 18:36   ` Jerome Glisse
                     ` (2 more replies)
  2019-02-12  2:56 ` [PATCH v2 24/26] userfaultfd: wp: UFFDIO_REGISTER_MODE_WP documentation update Peter Xu
                   ` (2 subsequent siblings)
  25 siblings, 3 replies; 113+ messages in thread
From: Peter Xu @ 2019-02-12  2:56 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: David Hildenbrand, Hugh Dickins, Maya Gokhale, Jerome Glisse,
	Pavel Emelyanov, Johannes Weiner, peterx, Martin Cracauer,
	Shaohua Li, Marty McFadden, Andrea Arcangeli, Mike Kravetz,
	Denis Plotnikov, Mike Rapoport, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

It does not make sense to try to wake up any waiting thread when we're
write-protecting a memory region.  Only wake up when resolving a write
protected page fault.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 fs/userfaultfd.c | 13 ++++++++-----
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 81962d62520c..f1f61a0278c2 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -1771,6 +1771,7 @@ static int userfaultfd_writeprotect(struct userfaultfd_ctx *ctx,
 	struct uffdio_writeprotect uffdio_wp;
 	struct uffdio_writeprotect __user *user_uffdio_wp;
 	struct userfaultfd_wake_range range;
+	bool mode_wp, mode_dontwake;
 
 	if (READ_ONCE(ctx->mmap_changing))
 		return -EAGAIN;
@@ -1789,18 +1790,20 @@ static int userfaultfd_writeprotect(struct userfaultfd_ctx *ctx,
 	if (uffdio_wp.mode & ~(UFFDIO_WRITEPROTECT_MODE_DONTWAKE |
 			       UFFDIO_WRITEPROTECT_MODE_WP))
 		return -EINVAL;
-	if ((uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_WP) &&
-	     (uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_DONTWAKE))
+
+	mode_wp = uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_WP;
+	mode_dontwake = uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_DONTWAKE;
+
+	if (mode_wp && mode_dontwake)
 		return -EINVAL;
 
 	ret = mwriteprotect_range(ctx->mm, uffdio_wp.range.start,
-				  uffdio_wp.range.len, uffdio_wp.mode &
-				  UFFDIO_WRITEPROTECT_MODE_WP,
+				  uffdio_wp.range.len, mode_wp,
 				  &ctx->mmap_changing);
 	if (ret)
 		return ret;
 
-	if (!(uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_DONTWAKE)) {
+	if (!mode_wp && !mode_dontwake) {
 		range.start = uffdio_wp.range.start;
 		range.len = uffdio_wp.range.len;
 		wake_userfault(ctx, &range);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH v2 24/26] userfaultfd: wp: UFFDIO_REGISTER_MODE_WP documentation update
  2019-02-12  2:56 [PATCH v2 00/26] userfaultfd: write protection support Peter Xu
                   ` (22 preceding siblings ...)
  2019-02-12  2:56 ` [PATCH v2 23/26] userfaultfd: wp: don't wake up when doing write protect Peter Xu
@ 2019-02-12  2:56 ` Peter Xu
  2019-02-21 18:38   ` Jerome Glisse
  2019-02-25 21:19   ` Mike Rapoport
  2019-02-12  2:56 ` [PATCH v2 25/26] userfaultfd: selftests: refactor statistics Peter Xu
  2019-02-12  2:56 ` [PATCH v2 26/26] userfaultfd: selftests: add write-protect test Peter Xu
  25 siblings, 2 replies; 113+ messages in thread
From: Peter Xu @ 2019-02-12  2:56 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: David Hildenbrand, Hugh Dickins, Maya Gokhale, Jerome Glisse,
	Pavel Emelyanov, Johannes Weiner, peterx, Martin Cracauer,
	Shaohua Li, Marty McFadden, Andrea Arcangeli, Mike Kravetz,
	Denis Plotnikov, Mike Rapoport, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

From: Martin Cracauer <cracauer@cons.org>

Adds documentation about the write protection support.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
[peterx: rewrite in rst format; fixups here and there]
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 Documentation/admin-guide/mm/userfaultfd.rst | 51 ++++++++++++++++++++
 1 file changed, 51 insertions(+)

diff --git a/Documentation/admin-guide/mm/userfaultfd.rst b/Documentation/admin-guide/mm/userfaultfd.rst
index 5048cf661a8a..c30176e67900 100644
--- a/Documentation/admin-guide/mm/userfaultfd.rst
+++ b/Documentation/admin-guide/mm/userfaultfd.rst
@@ -108,6 +108,57 @@ UFFDIO_COPY. They're atomic as in guaranteeing that nothing can see an
 half copied page since it'll keep userfaulting until the copy has
 finished.
 
+Notes:
+
+- If you requested UFFDIO_REGISTER_MODE_MISSING when registering then
+  you must provide some kind of page in your thread after reading from
+  the uffd.  You must provide either UFFDIO_COPY or UFFDIO_ZEROPAGE.
+  The normal behavior of the OS automatically providing a zero page on
+  an annonymous mmaping is not in place.
+
+- None of the page-delivering ioctls default to the range that you
+  registered with.  You must fill in all fields for the appropriate
+  ioctl struct including the range.
+
+- You get the address of the access that triggered the missing page
+  event out of a struct uffd_msg that you read in the thread from the
+  uffd.  You can supply as many pages as you want with UFFDIO_COPY or
+  UFFDIO_ZEROPAGE.  Keep in mind that unless you used DONTWAKE then
+  the first of any of those IOCTLs wakes up the faulting thread.
+
+- Be sure to test for all errors including (pollfd[0].revents &
+  POLLERR).  This can happen, e.g. when ranges supplied were
+  incorrect.
+
+Write Protect Notifications
+---------------------------
+
+This is equivalent to (but faster than) using mprotect and a SIGSEGV
+signal handler.
+
+Firstly you need to register a range with UFFDIO_REGISTER_MODE_WP.
+Instead of using mprotect(2) you use ioctl(uffd, UFFDIO_WRITEPROTECT,
+struct *uffdio_writeprotect) while mode = UFFDIO_WRITEPROTECT_MODE_WP
+in the struct passed in.  The range does not default to and does not
+have to be identical to the range you registered with.  You can write
+protect as many ranges as you like (inside the registered range).
+Then, in the thread reading from uffd the struct will have
+msg.arg.pagefault.flags & UFFD_PAGEFAULT_FLAG_WP set. Now you send
+ioctl(uffd, UFFDIO_WRITEPROTECT, struct *uffdio_writeprotect) again
+while pagefault.mode does not have UFFDIO_WRITEPROTECT_MODE_WP set.
+This wakes up the thread which will continue to run with writes. This
+allows you to do the bookkeeping about the write in the uffd reading
+thread before the ioctl.
+
+If you registered with both UFFDIO_REGISTER_MODE_MISSING and
+UFFDIO_REGISTER_MODE_WP then you need to think about the sequence in
+which you supply a page and undo write protect.  Note that there is a
+difference between writes into a WP area and into a !WP area.  The
+former will have UFFD_PAGEFAULT_FLAG_WP set, the latter
+UFFD_PAGEFAULT_FLAG_WRITE.  The latter did not fail on protection but
+you still need to supply a page when UFFDIO_REGISTER_MODE_MISSING was
+used.
+
 QEMU/KVM
 ========
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH v2 25/26] userfaultfd: selftests: refactor statistics
  2019-02-12  2:56 [PATCH v2 00/26] userfaultfd: write protection support Peter Xu
                   ` (23 preceding siblings ...)
  2019-02-12  2:56 ` [PATCH v2 24/26] userfaultfd: wp: UFFDIO_REGISTER_MODE_WP documentation update Peter Xu
@ 2019-02-12  2:56 ` Peter Xu
  2019-02-26  6:50   ` Mike Rapoport
  2019-02-12  2:56 ` [PATCH v2 26/26] userfaultfd: selftests: add write-protect test Peter Xu
  25 siblings, 1 reply; 113+ messages in thread
From: Peter Xu @ 2019-02-12  2:56 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: David Hildenbrand, Hugh Dickins, Maya Gokhale, Jerome Glisse,
	Pavel Emelyanov, Johannes Weiner, peterx, Martin Cracauer,
	Shaohua Li, Marty McFadden, Andrea Arcangeli, Mike Kravetz,
	Denis Plotnikov, Mike Rapoport, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

Introduce uffd_stats structure for statistics of the self test, at the
same time refactor the code to always pass in the uffd_stats for either
read() or poll() typed fault handling threads instead of using two
different ways to return the statistic results.  No functional change.

With the new structure, it's very easy to introduce new statistics.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 tools/testing/selftests/vm/userfaultfd.c | 76 +++++++++++++++---------
 1 file changed, 49 insertions(+), 27 deletions(-)

diff --git a/tools/testing/selftests/vm/userfaultfd.c b/tools/testing/selftests/vm/userfaultfd.c
index 5d1db824f73a..e5d12c209e09 100644
--- a/tools/testing/selftests/vm/userfaultfd.c
+++ b/tools/testing/selftests/vm/userfaultfd.c
@@ -88,6 +88,12 @@ static char *area_src, *area_src_alias, *area_dst, *area_dst_alias;
 static char *zeropage;
 pthread_attr_t attr;
 
+/* Userfaultfd test statistics */
+struct uffd_stats {
+	int cpu;
+	unsigned long missing_faults;
+};
+
 /* pthread_mutex_t starts at page offset 0 */
 #define area_mutex(___area, ___nr)					\
 	((pthread_mutex_t *) ((___area) + (___nr)*page_size))
@@ -127,6 +133,17 @@ static void usage(void)
 	exit(1);
 }
 
+static void uffd_stats_reset(struct uffd_stats *uffd_stats,
+			     unsigned long n_cpus)
+{
+	int i;
+
+	for (i = 0; i < n_cpus; i++) {
+		uffd_stats[i].cpu = i;
+		uffd_stats[i].missing_faults = 0;
+	}
+}
+
 static int anon_release_pages(char *rel_area)
 {
 	int ret = 0;
@@ -469,8 +486,8 @@ static int uffd_read_msg(int ufd, struct uffd_msg *msg)
 	return 0;
 }
 
-/* Return 1 if page fault handled by us; otherwise 0 */
-static int uffd_handle_page_fault(struct uffd_msg *msg)
+static void uffd_handle_page_fault(struct uffd_msg *msg,
+				   struct uffd_stats *stats)
 {
 	unsigned long offset;
 
@@ -485,18 +502,19 @@ static int uffd_handle_page_fault(struct uffd_msg *msg)
 	offset = (char *)(unsigned long)msg->arg.pagefault.address - area_dst;
 	offset &= ~(page_size-1);
 
-	return copy_page(uffd, offset);
+	if (copy_page(uffd, offset))
+		stats->missing_faults++;
 }
 
 static void *uffd_poll_thread(void *arg)
 {
-	unsigned long cpu = (unsigned long) arg;
+	struct uffd_stats *stats = (struct uffd_stats *)arg;
+	unsigned long cpu = stats->cpu;
 	struct pollfd pollfd[2];
 	struct uffd_msg msg;
 	struct uffdio_register uffd_reg;
 	int ret;
 	char tmp_chr;
-	unsigned long userfaults = 0;
 
 	pollfd[0].fd = uffd;
 	pollfd[0].events = POLLIN;
@@ -526,7 +544,7 @@ static void *uffd_poll_thread(void *arg)
 				msg.event), exit(1);
 			break;
 		case UFFD_EVENT_PAGEFAULT:
-			userfaults += uffd_handle_page_fault(&msg);
+			uffd_handle_page_fault(&msg, stats);
 			break;
 		case UFFD_EVENT_FORK:
 			close(uffd);
@@ -545,28 +563,27 @@ static void *uffd_poll_thread(void *arg)
 			break;
 		}
 	}
-	return (void *)userfaults;
+
+	return NULL;
 }
 
 pthread_mutex_t uffd_read_mutex = PTHREAD_MUTEX_INITIALIZER;
 
 static void *uffd_read_thread(void *arg)
 {
-	unsigned long *this_cpu_userfaults;
+	struct uffd_stats *stats = (struct uffd_stats *)arg;
 	struct uffd_msg msg;
 
-	this_cpu_userfaults = (unsigned long *) arg;
-	*this_cpu_userfaults = 0;
-
 	pthread_mutex_unlock(&uffd_read_mutex);
 	/* from here cancellation is ok */
 
 	for (;;) {
 		if (uffd_read_msg(uffd, &msg))
 			continue;
-		(*this_cpu_userfaults) += uffd_handle_page_fault(&msg);
+		uffd_handle_page_fault(&msg, stats);
 	}
-	return (void *)NULL;
+
+	return NULL;
 }
 
 static void *background_thread(void *arg)
@@ -582,13 +599,12 @@ static void *background_thread(void *arg)
 	return NULL;
 }
 
-static int stress(unsigned long *userfaults)
+static int stress(struct uffd_stats *uffd_stats)
 {
 	unsigned long cpu;
 	pthread_t locking_threads[nr_cpus];
 	pthread_t uffd_threads[nr_cpus];
 	pthread_t background_threads[nr_cpus];
-	void **_userfaults = (void **) userfaults;
 
 	finished = 0;
 	for (cpu = 0; cpu < nr_cpus; cpu++) {
@@ -597,12 +613,13 @@ static int stress(unsigned long *userfaults)
 			return 1;
 		if (bounces & BOUNCE_POLL) {
 			if (pthread_create(&uffd_threads[cpu], &attr,
-					   uffd_poll_thread, (void *)cpu))
+					   uffd_poll_thread,
+					   (void *)&uffd_stats[cpu]))
 				return 1;
 		} else {
 			if (pthread_create(&uffd_threads[cpu], &attr,
 					   uffd_read_thread,
-					   &_userfaults[cpu]))
+					   (void *)&uffd_stats[cpu]))
 				return 1;
 			pthread_mutex_lock(&uffd_read_mutex);
 		}
@@ -639,7 +656,8 @@ static int stress(unsigned long *userfaults)
 				fprintf(stderr, "pipefd write error\n");
 				return 1;
 			}
-			if (pthread_join(uffd_threads[cpu], &_userfaults[cpu]))
+			if (pthread_join(uffd_threads[cpu],
+					 (void *)&uffd_stats[cpu]))
 				return 1;
 		} else {
 			if (pthread_cancel(uffd_threads[cpu]))
@@ -910,11 +928,11 @@ static int userfaultfd_events_test(void)
 {
 	struct uffdio_register uffdio_register;
 	unsigned long expected_ioctls;
-	unsigned long userfaults;
 	pthread_t uffd_mon;
 	int err, features;
 	pid_t pid;
 	char c;
+	struct uffd_stats stats = { 0 };
 
 	printf("testing events (fork, remap, remove): ");
 	fflush(stdout);
@@ -941,7 +959,7 @@ static int userfaultfd_events_test(void)
 			"unexpected missing ioctl for anon memory\n"),
 			exit(1);
 
-	if (pthread_create(&uffd_mon, &attr, uffd_poll_thread, NULL))
+	if (pthread_create(&uffd_mon, &attr, uffd_poll_thread, &stats))
 		perror("uffd_poll_thread create"), exit(1);
 
 	pid = fork();
@@ -957,13 +975,13 @@ static int userfaultfd_events_test(void)
 
 	if (write(pipefd[1], &c, sizeof(c)) != sizeof(c))
 		perror("pipe write"), exit(1);
-	if (pthread_join(uffd_mon, (void **)&userfaults))
+	if (pthread_join(uffd_mon, NULL))
 		return 1;
 
 	close(uffd);
-	printf("userfaults: %ld\n", userfaults);
+	printf("userfaults: %ld\n", stats.missing_faults);
 
-	return userfaults != nr_pages;
+	return stats.missing_faults != nr_pages;
 }
 
 static int userfaultfd_sig_test(void)
@@ -975,6 +993,7 @@ static int userfaultfd_sig_test(void)
 	int err, features;
 	pid_t pid;
 	char c;
+	struct uffd_stats stats = { 0 };
 
 	printf("testing signal delivery: ");
 	fflush(stdout);
@@ -1006,7 +1025,7 @@ static int userfaultfd_sig_test(void)
 	if (uffd_test_ops->release_pages(area_dst))
 		return 1;
 
-	if (pthread_create(&uffd_mon, &attr, uffd_poll_thread, NULL))
+	if (pthread_create(&uffd_mon, &attr, uffd_poll_thread, &stats))
 		perror("uffd_poll_thread create"), exit(1);
 
 	pid = fork();
@@ -1032,6 +1051,7 @@ static int userfaultfd_sig_test(void)
 	close(uffd);
 	return userfaults != 0;
 }
+
 static int userfaultfd_stress(void)
 {
 	void *area;
@@ -1040,7 +1060,7 @@ static int userfaultfd_stress(void)
 	struct uffdio_register uffdio_register;
 	unsigned long cpu;
 	int err;
-	unsigned long userfaults[nr_cpus];
+	struct uffd_stats uffd_stats[nr_cpus];
 
 	uffd_test_ops->allocate_area((void **)&area_src);
 	if (!area_src)
@@ -1169,8 +1189,10 @@ static int userfaultfd_stress(void)
 		if (uffd_test_ops->release_pages(area_dst))
 			return 1;
 
+		uffd_stats_reset(uffd_stats, nr_cpus);
+
 		/* bounce pass */
-		if (stress(userfaults))
+		if (stress(uffd_stats))
 			return 1;
 
 		/* unregister */
@@ -1213,7 +1235,7 @@ static int userfaultfd_stress(void)
 
 		printf("userfaults:");
 		for (cpu = 0; cpu < nr_cpus; cpu++)
-			printf(" %lu", userfaults[cpu]);
+			printf(" %lu", uffd_stats[cpu].missing_faults);
 		printf("\n");
 	}
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH v2 26/26] userfaultfd: selftests: add write-protect test
  2019-02-12  2:56 [PATCH v2 00/26] userfaultfd: write protection support Peter Xu
                   ` (24 preceding siblings ...)
  2019-02-12  2:56 ` [PATCH v2 25/26] userfaultfd: selftests: refactor statistics Peter Xu
@ 2019-02-12  2:56 ` Peter Xu
  2019-02-26  6:58   ` Mike Rapoport
  25 siblings, 1 reply; 113+ messages in thread
From: Peter Xu @ 2019-02-12  2:56 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: David Hildenbrand, Hugh Dickins, Maya Gokhale, Jerome Glisse,
	Pavel Emelyanov, Johannes Weiner, peterx, Martin Cracauer,
	Shaohua Li, Marty McFadden, Andrea Arcangeli, Mike Kravetz,
	Denis Plotnikov, Mike Rapoport, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

This patch adds uffd tests for write protection.

Instead of introducing new tests for it, let's simply squashing uffd-wp
tests into existing uffd-missing test cases.  Changes are:

(1) Bouncing tests

  We do the write-protection in two ways during the bouncing test:

  - By using UFFDIO_COPY_MODE_WP when resolving MISSING pages: then
    we'll make sure for each bounce process every single page will be
    at least fault twice: once for MISSING, once for WP.

  - By direct call UFFDIO_WRITEPROTECT on existing faulted memories:
    To further torture the explicit page protection procedures of
    uffd-wp, we split each bounce procedure into two halves (in the
    background thread): the first half will be MISSING+WP for each
    page as explained above.  After the first half, we write protect
    the faulted region in the background thread to make sure at least
    half of the pages will be write protected again which is the first
    half to test the new UFFDIO_WRITEPROTECT call.  Then we continue
    with the 2nd half, which will contain both MISSING and WP faulting
    tests for the 2nd half and WP-only faults from the 1st half.

(2) Event/Signal test

  Mostly previous tests but will do MISSING+WP for each page.  For
  sigbus-mode test we'll need to provide standalone path to handle the
  write protection faults.

For all tests, do statistics as well for uffd-wp pages.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 tools/testing/selftests/vm/userfaultfd.c | 154 ++++++++++++++++++-----
 1 file changed, 126 insertions(+), 28 deletions(-)

diff --git a/tools/testing/selftests/vm/userfaultfd.c b/tools/testing/selftests/vm/userfaultfd.c
index e5d12c209e09..57b5ac02080a 100644
--- a/tools/testing/selftests/vm/userfaultfd.c
+++ b/tools/testing/selftests/vm/userfaultfd.c
@@ -56,6 +56,7 @@
 #include <linux/userfaultfd.h>
 #include <setjmp.h>
 #include <stdbool.h>
+#include <assert.h>
 
 #include "../kselftest.h"
 
@@ -78,6 +79,8 @@ static int test_type;
 #define ALARM_INTERVAL_SECS 10
 static volatile bool test_uffdio_copy_eexist = true;
 static volatile bool test_uffdio_zeropage_eexist = true;
+/* Whether to test uffd write-protection */
+static bool test_uffdio_wp = false;
 
 static bool map_shared;
 static int huge_fd;
@@ -92,6 +95,7 @@ pthread_attr_t attr;
 struct uffd_stats {
 	int cpu;
 	unsigned long missing_faults;
+	unsigned long wp_faults;
 };
 
 /* pthread_mutex_t starts at page offset 0 */
@@ -141,9 +145,29 @@ static void uffd_stats_reset(struct uffd_stats *uffd_stats,
 	for (i = 0; i < n_cpus; i++) {
 		uffd_stats[i].cpu = i;
 		uffd_stats[i].missing_faults = 0;
+		uffd_stats[i].wp_faults = 0;
 	}
 }
 
+static void uffd_stats_report(struct uffd_stats *stats, int n_cpus)
+{
+	int i;
+	unsigned long long miss_total = 0, wp_total = 0;
+
+	for (i = 0; i < n_cpus; i++) {
+		miss_total += stats[i].missing_faults;
+		wp_total += stats[i].wp_faults;
+	}
+
+	printf("userfaults: %llu missing (", miss_total);
+	for (i = 0; i < n_cpus; i++)
+		printf("%lu+", stats[i].missing_faults);
+	printf("\b), %llu wp (", wp_total);
+	for (i = 0; i < n_cpus; i++)
+		printf("%lu+", stats[i].wp_faults);
+	printf("\b)\n");
+}
+
 static int anon_release_pages(char *rel_area)
 {
 	int ret = 0;
@@ -264,19 +288,15 @@ struct uffd_test_ops {
 	void (*alias_mapping)(__u64 *start, size_t len, unsigned long offset);
 };
 
-#define ANON_EXPECTED_IOCTLS		((1 << _UFFDIO_WAKE) | \
-					 (1 << _UFFDIO_COPY) | \
-					 (1 << _UFFDIO_ZEROPAGE))
-
 static struct uffd_test_ops anon_uffd_test_ops = {
-	.expected_ioctls = ANON_EXPECTED_IOCTLS,
+	.expected_ioctls = UFFD_API_RANGE_IOCTLS,
 	.allocate_area	= anon_allocate_area,
 	.release_pages	= anon_release_pages,
 	.alias_mapping = noop_alias_mapping,
 };
 
 static struct uffd_test_ops shmem_uffd_test_ops = {
-	.expected_ioctls = ANON_EXPECTED_IOCTLS,
+	.expected_ioctls = UFFD_API_RANGE_IOCTLS,
 	.allocate_area	= shmem_allocate_area,
 	.release_pages	= shmem_release_pages,
 	.alias_mapping = noop_alias_mapping,
@@ -300,6 +320,21 @@ static int my_bcmp(char *str1, char *str2, size_t n)
 	return 0;
 }
 
+static void wp_range(int ufd, __u64 start, __u64 len, bool wp)
+{
+	struct uffdio_writeprotect prms = { 0 };
+
+	/* Write protection page faults */
+	prms.range.start = start;
+	prms.range.len = len;
+	/* Undo write-protect, do wakeup after that */
+	prms.mode = wp ? UFFDIO_WRITEPROTECT_MODE_WP : 0;
+
+	if (ioctl(ufd, UFFDIO_WRITEPROTECT, &prms))
+		fprintf(stderr, "clear WP failed for address 0x%Lx\n",
+			start), exit(1);
+}
+
 static void *locking_thread(void *arg)
 {
 	unsigned long cpu = (unsigned long) arg;
@@ -438,7 +473,10 @@ static int __copy_page(int ufd, unsigned long offset, bool retry)
 	uffdio_copy.dst = (unsigned long) area_dst + offset;
 	uffdio_copy.src = (unsigned long) area_src + offset;
 	uffdio_copy.len = page_size;
-	uffdio_copy.mode = 0;
+	if (test_uffdio_wp)
+		uffdio_copy.mode = UFFDIO_COPY_MODE_WP;
+	else
+		uffdio_copy.mode = 0;
 	uffdio_copy.copy = 0;
 	if (ioctl(ufd, UFFDIO_COPY, &uffdio_copy)) {
 		/* real retval in ufdio_copy.copy */
@@ -495,15 +533,21 @@ static void uffd_handle_page_fault(struct uffd_msg *msg,
 		fprintf(stderr, "unexpected msg event %u\n",
 			msg->event), exit(1);
 
-	if (bounces & BOUNCE_VERIFY &&
-	    msg->arg.pagefault.flags & UFFD_PAGEFAULT_FLAG_WRITE)
-		fprintf(stderr, "unexpected write fault\n"), exit(1);
+	if (msg->arg.pagefault.flags & UFFD_PAGEFAULT_FLAG_WP) {
+		wp_range(uffd, msg->arg.pagefault.address, page_size, false);
+		stats->wp_faults++;
+	} else {
+		/* Missing page faults */
+		if (bounces & BOUNCE_VERIFY &&
+		    msg->arg.pagefault.flags & UFFD_PAGEFAULT_FLAG_WRITE)
+			fprintf(stderr, "unexpected write fault\n"), exit(1);
 
-	offset = (char *)(unsigned long)msg->arg.pagefault.address - area_dst;
-	offset &= ~(page_size-1);
+		offset = (char *)(unsigned long)msg->arg.pagefault.address - area_dst;
+		offset &= ~(page_size-1);
 
-	if (copy_page(uffd, offset))
-		stats->missing_faults++;
+		if (copy_page(uffd, offset))
+			stats->missing_faults++;
+	}
 }
 
 static void *uffd_poll_thread(void *arg)
@@ -589,11 +633,30 @@ static void *uffd_read_thread(void *arg)
 static void *background_thread(void *arg)
 {
 	unsigned long cpu = (unsigned long) arg;
-	unsigned long page_nr;
+	unsigned long page_nr, start_nr, mid_nr, end_nr;
 
-	for (page_nr = cpu * nr_pages_per_cpu;
-	     page_nr < (cpu+1) * nr_pages_per_cpu;
-	     page_nr++)
+	start_nr = cpu * nr_pages_per_cpu;
+	end_nr = (cpu+1) * nr_pages_per_cpu;
+	mid_nr = (start_nr + end_nr) / 2;
+
+	/* Copy the first half of the pages */
+	for (page_nr = start_nr; page_nr < mid_nr; page_nr++)
+		copy_page_retry(uffd, page_nr * page_size);
+
+	/*
+	 * If we need to test uffd-wp, set it up now.  Then we'll have
+	 * at least the first half of the pages mapped already which
+	 * can be write-protected for testing
+	 */
+	if (test_uffdio_wp)
+		wp_range(uffd, (unsigned long)area_dst + start_nr * page_size,
+			nr_pages_per_cpu * page_size, true);
+
+	/*
+	 * Continue the 2nd half of the page copying, handling write
+	 * protection faults if any
+	 */
+	for (page_nr = mid_nr; page_nr < end_nr; page_nr++)
 		copy_page_retry(uffd, page_nr * page_size);
 
 	return NULL;
@@ -755,17 +818,31 @@ static int faulting_process(int signal_test)
 	}
 
 	for (nr = 0; nr < split_nr_pages; nr++) {
+		int steps = 1;
+		unsigned long offset = nr * page_size;
+
 		if (signal_test) {
 			if (sigsetjmp(*sigbuf, 1) != 0) {
-				if (nr == lastnr) {
+				if (steps == 1 && nr == lastnr) {
 					fprintf(stderr, "Signal repeated\n");
 					return 1;
 				}
 
 				lastnr = nr;
 				if (signal_test == 1) {
-					if (copy_page(uffd, nr * page_size))
-						signalled++;
+					if (steps == 1) {
+						/* This is a MISSING request */
+						steps++;
+						if (copy_page(uffd, offset))
+							signalled++;
+					} else {
+						/* This is a WP request */
+						assert(steps == 2);
+						wp_range(uffd,
+							 (__u64)area_dst +
+							 offset,
+							 page_size, false);
+					}
 				} else {
 					signalled++;
 					continue;
@@ -778,8 +855,13 @@ static int faulting_process(int signal_test)
 			fprintf(stderr,
 				"nr %lu memory corruption %Lu %Lu\n",
 				nr, count,
-				count_verify[nr]), exit(1);
-		}
+				count_verify[nr]);
+	        }
+		/*
+		 * Trigger write protection if there is by writting
+		 * the same value back.
+		 */
+		*area_count(area_dst, nr) = count;
 	}
 
 	if (signal_test)
@@ -801,6 +883,11 @@ static int faulting_process(int signal_test)
 				nr, count,
 				count_verify[nr]), exit(1);
 		}
+		/*
+		 * Trigger write protection if there is by writting
+		 * the same value back.
+		 */
+		*area_count(area_dst, nr) = count;
 	}
 
 	if (uffd_test_ops->release_pages(area_dst))
@@ -949,6 +1036,8 @@ static int userfaultfd_events_test(void)
 	uffdio_register.range.start = (unsigned long) area_dst;
 	uffdio_register.range.len = nr_pages * page_size;
 	uffdio_register.mode = UFFDIO_REGISTER_MODE_MISSING;
+	if (test_uffdio_wp)
+		uffdio_register.mode |= UFFDIO_REGISTER_MODE_WP;
 	if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register))
 		fprintf(stderr, "register failure\n"), exit(1);
 
@@ -979,7 +1068,8 @@ static int userfaultfd_events_test(void)
 		return 1;
 
 	close(uffd);
-	printf("userfaults: %ld\n", stats.missing_faults);
+
+	uffd_stats_report(&stats, 1);
 
 	return stats.missing_faults != nr_pages;
 }
@@ -1009,6 +1099,8 @@ static int userfaultfd_sig_test(void)
 	uffdio_register.range.start = (unsigned long) area_dst;
 	uffdio_register.range.len = nr_pages * page_size;
 	uffdio_register.mode = UFFDIO_REGISTER_MODE_MISSING;
+	if (test_uffdio_wp)
+		uffdio_register.mode |= UFFDIO_REGISTER_MODE_WP;
 	if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register))
 		fprintf(stderr, "register failure\n"), exit(1);
 
@@ -1141,6 +1233,8 @@ static int userfaultfd_stress(void)
 		uffdio_register.range.start = (unsigned long) area_dst;
 		uffdio_register.range.len = nr_pages * page_size;
 		uffdio_register.mode = UFFDIO_REGISTER_MODE_MISSING;
+		if (test_uffdio_wp)
+			uffdio_register.mode |= UFFDIO_REGISTER_MODE_WP;
 		if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register)) {
 			fprintf(stderr, "register failure\n");
 			return 1;
@@ -1195,6 +1289,11 @@ static int userfaultfd_stress(void)
 		if (stress(uffd_stats))
 			return 1;
 
+		/* Clear all the write protections if there is any */
+		if (test_uffdio_wp)
+			wp_range(uffd, (unsigned long)area_dst,
+				 nr_pages * page_size, false);
+
 		/* unregister */
 		if (ioctl(uffd, UFFDIO_UNREGISTER, &uffdio_register.range)) {
 			fprintf(stderr, "unregister failure\n");
@@ -1233,10 +1332,7 @@ static int userfaultfd_stress(void)
 		area_src_alias = area_dst_alias;
 		area_dst_alias = tmp_area;
 
-		printf("userfaults:");
-		for (cpu = 0; cpu < nr_cpus; cpu++)
-			printf(" %lu", uffd_stats[cpu].missing_faults);
-		printf("\n");
+		uffd_stats_report(uffd_stats, nr_cpus);
 	}
 
 	if (err)
@@ -1276,6 +1372,8 @@ static void set_test_type(const char *type)
 	if (!strcmp(type, "anon")) {
 		test_type = TEST_ANON;
 		uffd_test_ops = &anon_uffd_test_ops;
+		/* Only enable write-protect test for anonymous test */
+		test_uffdio_wp = true;
 	} else if (!strcmp(type, "hugetlb")) {
 		test_type = TEST_HUGETLB;
 		uffd_test_ops = &hugetlb_uffd_test_ops;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 113+ messages in thread

* Re: [PATCH v2 04/26] mm: allow VM_FAULT_RETRY for multiple times
  2019-02-12  2:56 ` [PATCH v2 04/26] mm: allow VM_FAULT_RETRY for multiple times Peter Xu
@ 2019-02-13  3:34   ` Peter Xu
  2019-02-20 11:48     ` Peter Xu
  2019-02-21  8:56   ` [PATCH v2.1 " Peter Xu
  1 sibling, 1 reply; 113+ messages in thread
From: Peter Xu @ 2019-02-13  3:34 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: David Hildenbrand, Hugh Dickins, Maya Gokhale, Jerome Glisse,
	Pavel Emelyanov, Johannes Weiner, Martin Cracauer, Shaohua Li,
	Marty McFadden, Andrea Arcangeli, Mike Kravetz, Denis Plotnikov,
	Mike Rapoport, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

On Tue, Feb 12, 2019 at 10:56:10AM +0800, Peter Xu wrote:

[...]

> @@ -1351,7 +1351,7 @@ EXPORT_SYMBOL_GPL(__lock_page_killable);
>  int __lock_page_or_retry(struct page *page, struct mm_struct *mm,
>  			 unsigned int flags)
>  {
> -	if (flags & FAULT_FLAG_ALLOW_RETRY) {
> +	if (!flags & FAULT_FLAG_TRIED) {

Sorry, this should be:

        if (!(flags & FAULT_FLAG_TRIED))

It escaped from tests, but I spotted it when I compile the tree on
another host.

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v2 04/26] mm: allow VM_FAULT_RETRY for multiple times
  2019-02-13  3:34   ` Peter Xu
@ 2019-02-20 11:48     ` Peter Xu
  0 siblings, 0 replies; 113+ messages in thread
From: Peter Xu @ 2019-02-20 11:48 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: David Hildenbrand, Hugh Dickins, Maya Gokhale, Jerome Glisse,
	Pavel Emelyanov, Johannes Weiner, Martin Cracauer, Shaohua Li,
	Marty McFadden, Andrea Arcangeli, Mike Kravetz, Denis Plotnikov,
	Mike Rapoport, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

On Wed, Feb 13, 2019 at 11:34:44AM +0800, Peter Xu wrote:
> On Tue, Feb 12, 2019 at 10:56:10AM +0800, Peter Xu wrote:
> 
> [...]
> 
> > @@ -1351,7 +1351,7 @@ EXPORT_SYMBOL_GPL(__lock_page_killable);
> >  int __lock_page_or_retry(struct page *page, struct mm_struct *mm,
> >  			 unsigned int flags)
> >  {
> > -	if (flags & FAULT_FLAG_ALLOW_RETRY) {
> > +	if (!flags & FAULT_FLAG_TRIED) {
> 
> Sorry, this should be:
> 
>         if (!(flags & FAULT_FLAG_TRIED))

Ok this is problematic too...  Because we for sure allow the page
fault flags to be both !ALLOW_RETRY and !TRIED (e.g., when doing GUP
and when __get_user_pages() is with locked==NULL and !FOLL_NOWAIT).
So current code will fall through the if condition and call
up_read(mmap_sem) even if above condition happened (while we shouldn't
because the GUP caller would assume the mmap_sem should be still
held).  So the correct check should be:

  if ((flags & FAULT_FLAG_ALLOW_RETRY) && !(flags & FAULT_FLAG_TRIED))

To make things easier, I'll just repost this single patch later.
Sorry for the noise.

Regards,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 113+ messages in thread

* [PATCH v2.1 04/26] mm: allow VM_FAULT_RETRY for multiple times
  2019-02-12  2:56 ` [PATCH v2 04/26] mm: allow VM_FAULT_RETRY for multiple times Peter Xu
  2019-02-13  3:34   ` Peter Xu
@ 2019-02-21  8:56   ` Peter Xu
  2019-02-21 15:53     ` Jerome Glisse
  1 sibling, 1 reply; 113+ messages in thread
From: Peter Xu @ 2019-02-21  8:56 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Peter Xu, David Hildenbrand, Hugh Dickins, Maya Gokhale,
	Jerome Glisse, Pavel Emelyanov, Johannes Weiner, Martin Cracauer,
	Shaohua Li, Marty McFadden, Andrea Arcangeli, Mike Kravetz,
	Denis Plotnikov, Mike Rapoport, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

The idea comes from a discussion between Linus and Andrea [1].

Before this patch we only allow a page fault to retry once.  We
achieved this by clearing the FAULT_FLAG_ALLOW_RETRY flag when doing
handle_mm_fault() the second time.  This was majorly used to avoid
unexpected starvation of the system by looping over forever to handle
the page fault on a single page.  However that should hardly happen,
and after all for each code path to return a VM_FAULT_RETRY we'll
first wait for a condition (during which time we should possibly yield
the cpu) to happen before VM_FAULT_RETRY is really returned.

This patch removes the restriction by keeping the
FAULT_FLAG_ALLOW_RETRY flag when we receive VM_FAULT_RETRY.  It means
that the page fault handler now can retry the page fault for multiple
times if necessary without the need to generate another page fault
event.  Meanwhile we still keep the FAULT_FLAG_TRIED flag so page
fault handler can still identify whether a page fault is the first
attempt or not.

Then we'll have these combinations of fault flags (only considering
ALLOW_RETRY flag and TRIED flag):

  - ALLOW_RETRY and !TRIED:  this means the page fault allows to
                             retry, and this is the first try

  - ALLOW_RETRY and TRIED:   this means the page fault allows to
                             retry, and this is not the first try

  - !ALLOW_RETRY and !TRIED: this means the page fault does not allow
                             to retry at all

  - !ALLOW_RETRY and TRIED:  this is forbidden and should never be used

In existing code we have multiple places that has taken special care
of the first condition above by checking against (fault_flags &
FAULT_FLAG_ALLOW_RETRY).  This patch introduces a simple helper to
detect the first retry of a page fault by checking against
both (fault_flags & FAULT_FLAG_ALLOW_RETRY) and !(fault_flag &
FAULT_FLAG_TRIED) because now even the 2nd try will have the
ALLOW_RETRY set, then use that helper in all existing special paths.
One example is in __lock_page_or_retry(), now we'll drop the mmap_sem
only in the first attempt of page fault and we'll keep it in follow up
retries, so old locking behavior will be retained.

This will be a nice enhancement for current code [2] at the same time
a supporting material for the future userfaultfd-writeprotect work,
since in that work there will always be an explicit userfault
writeprotect retry for protected pages, and if that cannot resolve the
page fault (e.g., when userfaultfd-writeprotect is used in conjunction
with swapped pages) then we'll possibly need a 3rd retry of the page
fault.  It might also benefit other potential users who will have
similar requirement like userfault write-protection.

GUP code is not touched yet and will be covered in follow up patch.

Please read the thread below for more information.

[1] https://lkml.org/lkml/2017/11/2/833
[2] https://lkml.org/lkml/2018/12/30/64

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Suggested-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---

 arch/alpha/mm/fault.c           |  2 +-
 arch/arc/mm/fault.c             |  1 -
 arch/arm/mm/fault.c             |  3 ---
 arch/arm64/mm/fault.c           |  5 -----
 arch/hexagon/mm/vm_fault.c      |  1 -
 arch/ia64/mm/fault.c            |  1 -
 arch/m68k/mm/fault.c            |  3 ---
 arch/microblaze/mm/fault.c      |  1 -
 arch/mips/mm/fault.c            |  1 -
 arch/nds32/mm/fault.c           |  1 -
 arch/nios2/mm/fault.c           |  3 ---
 arch/openrisc/mm/fault.c        |  1 -
 arch/parisc/mm/fault.c          |  2 --
 arch/powerpc/mm/fault.c         |  6 ------
 arch/riscv/mm/fault.c           |  5 -----
 arch/s390/mm/fault.c            |  5 +----
 arch/sh/mm/fault.c              |  1 -
 arch/sparc/mm/fault_32.c        |  1 -
 arch/sparc/mm/fault_64.c        |  1 -
 arch/um/kernel/trap.c           |  1 -
 arch/unicore32/mm/fault.c       |  6 +-----
 arch/x86/mm/fault.c             |  2 --
 arch/xtensa/mm/fault.c          |  1 -
 drivers/gpu/drm/ttm/ttm_bo_vm.c | 12 +++++++++---
 include/linux/mm.h              | 12 +++++++++++-
 mm/filemap.c                    |  2 +-
 mm/shmem.c                      |  2 +-
 27 files changed, 25 insertions(+), 57 deletions(-)

diff --git a/arch/alpha/mm/fault.c b/arch/alpha/mm/fault.c
index 8a2ef90b4bfc..6a02c0fb36b9 100644
--- a/arch/alpha/mm/fault.c
+++ b/arch/alpha/mm/fault.c
@@ -169,7 +169,7 @@ do_page_fault(unsigned long address, unsigned long mmcsr,
 		else
 			current->min_flt++;
 		if (fault & VM_FAULT_RETRY) {
-			flags &= ~FAULT_FLAG_ALLOW_RETRY;
+			flags |= FAULT_FLAG_TRIED;
 
 			 /* No need to up_read(&mm->mmap_sem) as we would
 			 * have already released it in __lock_page_or_retry
diff --git a/arch/arc/mm/fault.c b/arch/arc/mm/fault.c
index dc5f1b8859d2..664e18a8749f 100644
--- a/arch/arc/mm/fault.c
+++ b/arch/arc/mm/fault.c
@@ -167,7 +167,6 @@ void do_page_fault(unsigned long address, struct pt_regs *regs)
 			}
 
 			if (fault & VM_FAULT_RETRY) {
-				flags &= ~FAULT_FLAG_ALLOW_RETRY;
 				flags |= FAULT_FLAG_TRIED;
 				goto retry;
 			}
diff --git a/arch/arm/mm/fault.c b/arch/arm/mm/fault.c
index c41c021bbe40..7910b4b5205d 100644
--- a/arch/arm/mm/fault.c
+++ b/arch/arm/mm/fault.c
@@ -342,9 +342,6 @@ do_page_fault(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
 					regs, addr);
 		}
 		if (fault & VM_FAULT_RETRY) {
-			/* Clear FAULT_FLAG_ALLOW_RETRY to avoid any risk
-			* of starvation. */
-			flags &= ~FAULT_FLAG_ALLOW_RETRY;
 			flags |= FAULT_FLAG_TRIED;
 			goto retry;
 		}
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index a38ff8c49a66..d1d3c98f9ffb 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -523,12 +523,7 @@ static int __kprobes do_page_fault(unsigned long addr, unsigned int esr,
 			return 0;
 		}
 
-		/*
-		 * Clear FAULT_FLAG_ALLOW_RETRY to avoid any risk of
-		 * starvation.
-		 */
 		if (mm_flags & FAULT_FLAG_ALLOW_RETRY) {
-			mm_flags &= ~FAULT_FLAG_ALLOW_RETRY;
 			mm_flags |= FAULT_FLAG_TRIED;
 			goto retry;
 		}
diff --git a/arch/hexagon/mm/vm_fault.c b/arch/hexagon/mm/vm_fault.c
index be10b441d9cc..576751597e77 100644
--- a/arch/hexagon/mm/vm_fault.c
+++ b/arch/hexagon/mm/vm_fault.c
@@ -115,7 +115,6 @@ void do_page_fault(unsigned long address, long cause, struct pt_regs *regs)
 			else
 				current->min_flt++;
 			if (fault & VM_FAULT_RETRY) {
-				flags &= ~FAULT_FLAG_ALLOW_RETRY;
 				flags |= FAULT_FLAG_TRIED;
 				goto retry;
 			}
diff --git a/arch/ia64/mm/fault.c b/arch/ia64/mm/fault.c
index 62c2d39d2bed..9de95d39935e 100644
--- a/arch/ia64/mm/fault.c
+++ b/arch/ia64/mm/fault.c
@@ -189,7 +189,6 @@ ia64_do_page_fault (unsigned long address, unsigned long isr, struct pt_regs *re
 		else
 			current->min_flt++;
 		if (fault & VM_FAULT_RETRY) {
-			flags &= ~FAULT_FLAG_ALLOW_RETRY;
 			flags |= FAULT_FLAG_TRIED;
 
 			 /* No need to up_read(&mm->mmap_sem) as we would
diff --git a/arch/m68k/mm/fault.c b/arch/m68k/mm/fault.c
index d9808a807ab8..b1b2109e4ab4 100644
--- a/arch/m68k/mm/fault.c
+++ b/arch/m68k/mm/fault.c
@@ -162,9 +162,6 @@ int do_page_fault(struct pt_regs *regs, unsigned long address,
 		else
 			current->min_flt++;
 		if (fault & VM_FAULT_RETRY) {
-			/* Clear FAULT_FLAG_ALLOW_RETRY to avoid any risk
-			 * of starvation. */
-			flags &= ~FAULT_FLAG_ALLOW_RETRY;
 			flags |= FAULT_FLAG_TRIED;
 
 			/*
diff --git a/arch/microblaze/mm/fault.c b/arch/microblaze/mm/fault.c
index 4fd2dbd0c5ca..05a4847ac0bf 100644
--- a/arch/microblaze/mm/fault.c
+++ b/arch/microblaze/mm/fault.c
@@ -236,7 +236,6 @@ void do_page_fault(struct pt_regs *regs, unsigned long address,
 		else
 			current->min_flt++;
 		if (fault & VM_FAULT_RETRY) {
-			flags &= ~FAULT_FLAG_ALLOW_RETRY;
 			flags |= FAULT_FLAG_TRIED;
 
 			/*
diff --git a/arch/mips/mm/fault.c b/arch/mips/mm/fault.c
index 92374fd091d2..9953b5b571df 100644
--- a/arch/mips/mm/fault.c
+++ b/arch/mips/mm/fault.c
@@ -178,7 +178,6 @@ static void __kprobes __do_page_fault(struct pt_regs *regs, unsigned long write,
 			tsk->min_flt++;
 		}
 		if (fault & VM_FAULT_RETRY) {
-			flags &= ~FAULT_FLAG_ALLOW_RETRY;
 			flags |= FAULT_FLAG_TRIED;
 
 			/*
diff --git a/arch/nds32/mm/fault.c b/arch/nds32/mm/fault.c
index 9f6e477b9e30..32259afc751a 100644
--- a/arch/nds32/mm/fault.c
+++ b/arch/nds32/mm/fault.c
@@ -242,7 +242,6 @@ void do_page_fault(unsigned long entry, unsigned long addr,
 				      1, regs, addr);
 		}
 		if (fault & VM_FAULT_RETRY) {
-			flags &= ~FAULT_FLAG_ALLOW_RETRY;
 			flags |= FAULT_FLAG_TRIED;
 
 			/* No need to up_read(&mm->mmap_sem) as we would
diff --git a/arch/nios2/mm/fault.c b/arch/nios2/mm/fault.c
index 5939434a31ae..9dd1c51acc22 100644
--- a/arch/nios2/mm/fault.c
+++ b/arch/nios2/mm/fault.c
@@ -158,9 +158,6 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long cause,
 		else
 			current->min_flt++;
 		if (fault & VM_FAULT_RETRY) {
-			/* Clear FAULT_FLAG_ALLOW_RETRY to avoid any risk
-			 * of starvation. */
-			flags &= ~FAULT_FLAG_ALLOW_RETRY;
 			flags |= FAULT_FLAG_TRIED;
 
 			/*
diff --git a/arch/openrisc/mm/fault.c b/arch/openrisc/mm/fault.c
index 873ecb5d82d7..ff92c5674781 100644
--- a/arch/openrisc/mm/fault.c
+++ b/arch/openrisc/mm/fault.c
@@ -185,7 +185,6 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long address,
 		else
 			tsk->min_flt++;
 		if (fault & VM_FAULT_RETRY) {
-			flags &= ~FAULT_FLAG_ALLOW_RETRY;
 			flags |= FAULT_FLAG_TRIED;
 
 			 /* No need to up_read(&mm->mmap_sem) as we would
diff --git a/arch/parisc/mm/fault.c b/arch/parisc/mm/fault.c
index 29422eec329d..7d3e96a9a7ab 100644
--- a/arch/parisc/mm/fault.c
+++ b/arch/parisc/mm/fault.c
@@ -327,8 +327,6 @@ void do_page_fault(struct pt_regs *regs, unsigned long code,
 		else
 			current->min_flt++;
 		if (fault & VM_FAULT_RETRY) {
-			flags &= ~FAULT_FLAG_ALLOW_RETRY;
-
 			/*
 			 * No need to up_read(&mm->mmap_sem) as we would
 			 * have already released it in __lock_page_or_retry
diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index aaa853e6592f..c831cb3ce03f 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -583,13 +583,7 @@ static int __do_page_fault(struct pt_regs *regs, unsigned long address,
 	 * case.
 	 */
 	if (unlikely(fault & VM_FAULT_RETRY)) {
-		/* We retry only once */
 		if (flags & FAULT_FLAG_ALLOW_RETRY) {
-			/*
-			 * Clear FAULT_FLAG_ALLOW_RETRY to avoid any risk
-			 * of starvation.
-			 */
-			flags &= ~FAULT_FLAG_ALLOW_RETRY;
 			flags |= FAULT_FLAG_TRIED;
 			if (is_user && signal_pending(current))
 				return 0;
diff --git a/arch/riscv/mm/fault.c b/arch/riscv/mm/fault.c
index 4fc8d746bec3..aad2c0557d2f 100644
--- a/arch/riscv/mm/fault.c
+++ b/arch/riscv/mm/fault.c
@@ -154,11 +154,6 @@ asmlinkage void do_page_fault(struct pt_regs *regs)
 				      1, regs, addr);
 		}
 		if (fault & VM_FAULT_RETRY) {
-			/*
-			 * Clear FAULT_FLAG_ALLOW_RETRY to avoid any risk
-			 * of starvation.
-			 */
-			flags &= ~(FAULT_FLAG_ALLOW_RETRY);
 			flags |= FAULT_FLAG_TRIED;
 
 			/*
diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c
index aba1dad1efcd..4e8c066964a9 100644
--- a/arch/s390/mm/fault.c
+++ b/arch/s390/mm/fault.c
@@ -513,10 +513,7 @@ static inline vm_fault_t do_exception(struct pt_regs *regs, int access)
 				fault = VM_FAULT_PFAULT;
 				goto out_up;
 			}
-			/* Clear FAULT_FLAG_ALLOW_RETRY to avoid any risk
-			 * of starvation. */
-			flags &= ~(FAULT_FLAG_ALLOW_RETRY |
-				   FAULT_FLAG_RETRY_NOWAIT);
+			flags &= ~FAULT_FLAG_RETRY_NOWAIT;
 			flags |= FAULT_FLAG_TRIED;
 			down_read(&mm->mmap_sem);
 			goto retry;
diff --git a/arch/sh/mm/fault.c b/arch/sh/mm/fault.c
index baf5d73df40c..cd710e2d7c57 100644
--- a/arch/sh/mm/fault.c
+++ b/arch/sh/mm/fault.c
@@ -498,7 +498,6 @@ asmlinkage void __kprobes do_page_fault(struct pt_regs *regs,
 				      regs, address);
 		}
 		if (fault & VM_FAULT_RETRY) {
-			flags &= ~FAULT_FLAG_ALLOW_RETRY;
 			flags |= FAULT_FLAG_TRIED;
 
 			/*
diff --git a/arch/sparc/mm/fault_32.c b/arch/sparc/mm/fault_32.c
index a2c83104fe35..6735cd1c09b9 100644
--- a/arch/sparc/mm/fault_32.c
+++ b/arch/sparc/mm/fault_32.c
@@ -261,7 +261,6 @@ asmlinkage void do_sparc_fault(struct pt_regs *regs, int text_fault, int write,
 				      1, regs, address);
 		}
 		if (fault & VM_FAULT_RETRY) {
-			flags &= ~FAULT_FLAG_ALLOW_RETRY;
 			flags |= FAULT_FLAG_TRIED;
 
 			/* No need to up_read(&mm->mmap_sem) as we would
diff --git a/arch/sparc/mm/fault_64.c b/arch/sparc/mm/fault_64.c
index cad71ec5c7b3..28d5b4d012c6 100644
--- a/arch/sparc/mm/fault_64.c
+++ b/arch/sparc/mm/fault_64.c
@@ -459,7 +459,6 @@ asmlinkage void __kprobes do_sparc64_fault(struct pt_regs *regs)
 				      1, regs, address);
 		}
 		if (fault & VM_FAULT_RETRY) {
-			flags &= ~FAULT_FLAG_ALLOW_RETRY;
 			flags |= FAULT_FLAG_TRIED;
 
 			/* No need to up_read(&mm->mmap_sem) as we would
diff --git a/arch/um/kernel/trap.c b/arch/um/kernel/trap.c
index 09baf37b65b9..c63fc292aea0 100644
--- a/arch/um/kernel/trap.c
+++ b/arch/um/kernel/trap.c
@@ -99,7 +99,6 @@ int handle_page_fault(unsigned long address, unsigned long ip,
 			else
 				current->min_flt++;
 			if (fault & VM_FAULT_RETRY) {
-				flags &= ~FAULT_FLAG_ALLOW_RETRY;
 				flags |= FAULT_FLAG_TRIED;
 
 				goto retry;
diff --git a/arch/unicore32/mm/fault.c b/arch/unicore32/mm/fault.c
index 3611f19234a1..fdf577956f5f 100644
--- a/arch/unicore32/mm/fault.c
+++ b/arch/unicore32/mm/fault.c
@@ -260,12 +260,8 @@ static int do_pf(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
 			tsk->maj_flt++;
 		else
 			tsk->min_flt++;
-		if (fault & VM_FAULT_RETRY) {
-			/* Clear FAULT_FLAG_ALLOW_RETRY to avoid any risk
-			* of starvation. */
-			flags &= ~FAULT_FLAG_ALLOW_RETRY;
+		if (fault & VM_FAULT_RETRY)
 			goto retry;
-		}
 	}
 
 	up_read(&mm->mmap_sem);
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 248ff0a28ecd..d842c3e02a50 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1483,9 +1483,7 @@ void do_user_addr_fault(struct pt_regs *regs,
 	if (unlikely(fault & VM_FAULT_RETRY)) {
 		bool is_user = flags & FAULT_FLAG_USER;
 
-		/* Retry at most once */
 		if (flags & FAULT_FLAG_ALLOW_RETRY) {
-			flags &= ~FAULT_FLAG_ALLOW_RETRY;
 			flags |= FAULT_FLAG_TRIED;
 			if (is_user && signal_pending(tsk))
 				return;
diff --git a/arch/xtensa/mm/fault.c b/arch/xtensa/mm/fault.c
index 792dad5e2f12..7cd55f2d66c9 100644
--- a/arch/xtensa/mm/fault.c
+++ b/arch/xtensa/mm/fault.c
@@ -128,7 +128,6 @@ void do_page_fault(struct pt_regs *regs)
 		else
 			current->min_flt++;
 		if (fault & VM_FAULT_RETRY) {
-			flags &= ~FAULT_FLAG_ALLOW_RETRY;
 			flags |= FAULT_FLAG_TRIED;
 
 			 /* No need to up_read(&mm->mmap_sem) as we would
diff --git a/drivers/gpu/drm/ttm/ttm_bo_vm.c b/drivers/gpu/drm/ttm/ttm_bo_vm.c
index a1d977fbade5..5fac635f72a5 100644
--- a/drivers/gpu/drm/ttm/ttm_bo_vm.c
+++ b/drivers/gpu/drm/ttm/ttm_bo_vm.c
@@ -61,9 +61,10 @@ static vm_fault_t ttm_bo_vm_fault_idle(struct ttm_buffer_object *bo,
 
 	/*
 	 * If possible, avoid waiting for GPU with mmap_sem
-	 * held.
+	 * held.  We only do this if the fault allows retry and this
+	 * is the first attempt.
 	 */
-	if (vmf->flags & FAULT_FLAG_ALLOW_RETRY) {
+	if (fault_flag_allow_retry_first(vmf->flags)) {
 		ret = VM_FAULT_RETRY;
 		if (vmf->flags & FAULT_FLAG_RETRY_NOWAIT)
 			goto out_unlock;
@@ -136,7 +137,12 @@ static vm_fault_t ttm_bo_vm_fault(struct vm_fault *vmf)
 		if (err != -EBUSY)
 			return VM_FAULT_NOPAGE;
 
-		if (vmf->flags & FAULT_FLAG_ALLOW_RETRY) {
+		/*
+		 * If the fault allows retry and this is the first
+		 * fault attempt, we try to release the mmap_sem
+		 * before waiting
+		 */
+		if (fault_flag_allow_retry_first(vmf->flags)) {
 			if (!(vmf->flags & FAULT_FLAG_RETRY_NOWAIT)) {
 				ttm_bo_get(bo);
 				up_read(&vmf->vma->vm_mm->mmap_sem);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 80bb6408fe73..4e11c9639f1b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -341,11 +341,21 @@ extern pgprot_t protection_map[16];
 #define FAULT_FLAG_ALLOW_RETRY	0x04	/* Retry fault if blocking */
 #define FAULT_FLAG_RETRY_NOWAIT	0x08	/* Don't drop mmap_sem and wait when retrying */
 #define FAULT_FLAG_KILLABLE	0x10	/* The fault task is in SIGKILL killable region */
-#define FAULT_FLAG_TRIED	0x20	/* Second try */
+#define FAULT_FLAG_TRIED	0x20	/* We've tried once */
 #define FAULT_FLAG_USER		0x40	/* The fault originated in userspace */
 #define FAULT_FLAG_REMOTE	0x80	/* faulting for non current tsk/mm */
 #define FAULT_FLAG_INSTRUCTION  0x100	/* The fault was during an instruction fetch */
 
+/*
+ * Returns true if the page fault allows retry and this is the first
+ * attempt of the fault handling; false otherwise.
+ */
+static inline bool fault_flag_allow_retry_first(unsigned int flags)
+{
+	return (flags & FAULT_FLAG_ALLOW_RETRY) &&
+	    (!(flags & FAULT_FLAG_TRIED));
+}
+
 #define FAULT_FLAG_TRACE \
 	{ FAULT_FLAG_WRITE,		"WRITE" }, \
 	{ FAULT_FLAG_MKWRITE,		"MKWRITE" }, \
diff --git a/mm/filemap.c b/mm/filemap.c
index 9f5e323e883e..a2b5c53166de 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1351,7 +1351,7 @@ EXPORT_SYMBOL_GPL(__lock_page_killable);
 int __lock_page_or_retry(struct page *page, struct mm_struct *mm,
 			 unsigned int flags)
 {
-	if (flags & FAULT_FLAG_ALLOW_RETRY) {
+	if (fault_flag_allow_retry_first(flags)) {
 		/*
 		 * CAUTION! In this case, mmap_sem is not released
 		 * even though return 0.
diff --git a/mm/shmem.c b/mm/shmem.c
index 6ece1e2fe76e..06fd5e79e1c9 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1949,7 +1949,7 @@ static vm_fault_t shmem_fault(struct vm_fault *vmf)
 			DEFINE_WAIT_FUNC(shmem_fault_wait, synchronous_wake_function);
 
 			ret = VM_FAULT_NOPAGE;
-			if ((vmf->flags & FAULT_FLAG_ALLOW_RETRY) &&
+			if (fault_flag_allow_retry_first(flags) &&
 			   !(vmf->flags & FAULT_FLAG_RETRY_NOWAIT)) {
 				/* It's polite to up mmap_sem if we can */
 				up_read(&vma->vm_mm->mmap_sem);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 113+ messages in thread

* Re: [PATCH v2 01/26] mm: gup: rename "nonblocking" to "locked" where proper
  2019-02-12  2:56 ` [PATCH v2 01/26] mm: gup: rename "nonblocking" to "locked" where proper Peter Xu
@ 2019-02-21 15:17   ` Jerome Glisse
  2019-02-22  3:42     ` Peter Xu
  0 siblings, 1 reply; 113+ messages in thread
From: Jerome Glisse @ 2019-02-21 15:17 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, David Hildenbrand, Hugh Dickins,
	Maya Gokhale, Pavel Emelyanov, Johannes Weiner, Martin Cracauer,
	Shaohua Li, Marty McFadden, Andrea Arcangeli, Mike Kravetz,
	Denis Plotnikov, Mike Rapoport, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

On Tue, Feb 12, 2019 at 10:56:07AM +0800, Peter Xu wrote:
> There's plenty of places around __get_user_pages() that has a parameter
> "nonblocking" which does not really mean that "it won't block" (because
> it can really block) but instead it shows whether the mmap_sem is
> released by up_read() during the page fault handling mostly when
> VM_FAULT_RETRY is returned.
> 
> We have the correct naming in e.g. get_user_pages_locked() or
> get_user_pages_remote() as "locked", however there're still many places
> that are using the "nonblocking" as name.
> 
> Renaming the places to "locked" where proper to better suite the
> functionality of the variable.  While at it, fixing up some of the
> comments accordingly.
> 
> Reviewed-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>

Minor issue see below

Reviewed-by: Jérôme Glisse <jglisse@redhat.com>

[...]

> @@ -656,13 +656,11 @@ static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags)
>   * appropriate) must be called after the page is finished with, and
>   * before put_page is called.
>   *
> - * If @nonblocking != NULL, __get_user_pages will not wait for disk IO
> - * or mmap_sem contention, and if waiting is needed to pin all pages,
> - * *@nonblocking will be set to 0.  Further, if @gup_flags does not
> - * include FOLL_NOWAIT, the mmap_sem will be released via up_read() in
> - * this case.
> + * If @locked != NULL, *@locked will be set to 0 when mmap_sem is
> + * released by an up_read().  That can happen if @gup_flags does not
> + * has FOLL_NOWAIT.

I am not a native speaker but i believe the correct wording is:
     @gup_flags does not have FOLL_NOWAIT

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v2 02/26] mm: userfault: return VM_FAULT_RETRY on signals
  2019-02-12  2:56 ` [PATCH v2 02/26] mm: userfault: return VM_FAULT_RETRY on signals Peter Xu
@ 2019-02-21 15:29   ` Jerome Glisse
  2019-02-22  3:51     ` Peter Xu
  0 siblings, 1 reply; 113+ messages in thread
From: Jerome Glisse @ 2019-02-21 15:29 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, David Hildenbrand, Hugh Dickins,
	Maya Gokhale, Pavel Emelyanov, Johannes Weiner, Martin Cracauer,
	Shaohua Li, Marty McFadden, Andrea Arcangeli, Mike Kravetz,
	Denis Plotnikov, Mike Rapoport, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

On Tue, Feb 12, 2019 at 10:56:08AM +0800, Peter Xu wrote:
> The idea comes from the upstream discussion between Linus and Andrea:
> 
>   https://lkml.org/lkml/2017/10/30/560
> 
> A summary to the issue: there was a special path in handle_userfault()
> in the past that we'll return a VM_FAULT_NOPAGE when we detected
> non-fatal signals when waiting for userfault handling.  We did that by
> reacquiring the mmap_sem before returning.  However that brings a risk
> in that the vmas might have changed when we retake the mmap_sem and
> even we could be holding an invalid vma structure.
> 
> This patch removes the special path and we'll return a VM_FAULT_RETRY
> with the common path even if we have got such signals.  Then for all
> the architectures that is passing in VM_FAULT_ALLOW_RETRY into
> handle_mm_fault(), we check not only for SIGKILL but for all the rest
> of userspace pending signals right after we returned from
> handle_mm_fault().  This can allow the userspace to handle nonfatal
> signals faster than before.
> 
> This patch is a preparation work for the next patch to finally remove
> the special code path mentioned above in handle_userfault().
> 
> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
> Suggested-by: Andrea Arcangeli <aarcange@redhat.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>

See maybe minor improvement

Reviewed-by: Jérôme Glisse <jglisse@redhat.com>

[...]

> diff --git a/arch/arm/mm/fault.c b/arch/arm/mm/fault.c
> index 58f69fa07df9..c41c021bbe40 100644
> --- a/arch/arm/mm/fault.c
> +++ b/arch/arm/mm/fault.c
> @@ -314,12 +314,12 @@ do_page_fault(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
>  
>  	fault = __do_page_fault(mm, addr, fsr, flags, tsk);
>  
> -	/* If we need to retry but a fatal signal is pending, handle the
> +	/* If we need to retry but a signal is pending, handle the
>  	 * signal first. We do not need to release the mmap_sem because
>  	 * it would already be released in __lock_page_or_retry in
>  	 * mm/filemap.c. */
> -	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current)) {
> -		if (!user_mode(regs))
> +	if (unlikely(fault & VM_FAULT_RETRY && signal_pending(current))) {

I rather see (fault & VM_FAULT_RETRY) ie with the parenthesis as it
avoids the need to remember operator precedence rules :)

[...]

> diff --git a/arch/nds32/mm/fault.c b/arch/nds32/mm/fault.c
> index 68d5f2a27f38..9f6e477b9e30 100644
> --- a/arch/nds32/mm/fault.c
> +++ b/arch/nds32/mm/fault.c
> @@ -206,12 +206,12 @@ void do_page_fault(unsigned long entry, unsigned long addr,
>  	fault = handle_mm_fault(vma, addr, flags);
>  
>  	/*
> -	 * If we need to retry but a fatal signal is pending, handle the
> +	 * If we need to retry but a signal is pending, handle the
>  	 * signal first. We do not need to release the mmap_sem because it
>  	 * would already be released in __lock_page_or_retry in mm/filemap.c.
>  	 */
> -	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current)) {
> -		if (!user_mode(regs))
> +	if (fault & VM_FAULT_RETRY && signal_pending(current)) {

Same as above parenthesis maybe.

[...]

> diff --git a/arch/um/kernel/trap.c b/arch/um/kernel/trap.c
> index 0e8b6158f224..09baf37b65b9 100644
> --- a/arch/um/kernel/trap.c
> +++ b/arch/um/kernel/trap.c
> @@ -76,8 +76,11 @@ int handle_page_fault(unsigned long address, unsigned long ip,
>  
>  		fault = handle_mm_fault(vma, address, flags);
>  
> -		if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
> +		if (fault & VM_FAULT_RETRY && signal_pending(current)) {

Same as above parenthesis maybe.

[...]

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v2 03/26] userfaultfd: don't retake mmap_sem to emulate NOPAGE
  2019-02-12  2:56 ` [PATCH v2 03/26] userfaultfd: don't retake mmap_sem to emulate NOPAGE Peter Xu
@ 2019-02-21 15:34   ` Jerome Glisse
  0 siblings, 0 replies; 113+ messages in thread
From: Jerome Glisse @ 2019-02-21 15:34 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, David Hildenbrand, Hugh Dickins,
	Maya Gokhale, Pavel Emelyanov, Johannes Weiner, Martin Cracauer,
	Shaohua Li, Marty McFadden, Andrea Arcangeli, Mike Kravetz,
	Denis Plotnikov, Mike Rapoport, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

On Tue, Feb 12, 2019 at 10:56:09AM +0800, Peter Xu wrote:
> The idea comes from the upstream discussion between Linus and Andrea:
> 
> https://lkml.org/lkml/2017/10/30/560
> 
> A summary to the issue: there was a special path in handle_userfault()
> in the past that we'll return a VM_FAULT_NOPAGE when we detected
> non-fatal signals when waiting for userfault handling.  We did that by
> reacquiring the mmap_sem before returning.  However that brings a risk
> in that the vmas might have changed when we retake the mmap_sem and
> even we could be holding an invalid vma structure.
> 
> This patch removes the risk path in handle_userfault() then we will be
> sure that the callers of handle_mm_fault() will know that the VMAs
> might have changed.  Meanwhile with previous patch we don't lose
> responsiveness as well since the core mm code now can handle the
> nonfatal userspace signals quickly even if we return VM_FAULT_RETRY.
> 
> Suggested-by: Andrea Arcangeli <aarcange@redhat.com>
> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: Jérôme Glisse <jglisse@redhat.com>

> ---
>  fs/userfaultfd.c | 24 ------------------------
>  1 file changed, 24 deletions(-)
> 
> diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> index 89800fc7dc9d..b397bc3b954d 100644
> --- a/fs/userfaultfd.c
> +++ b/fs/userfaultfd.c
> @@ -514,30 +514,6 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
>  
>  	__set_current_state(TASK_RUNNING);
>  
> -	if (return_to_userland) {
> -		if (signal_pending(current) &&
> -		    !fatal_signal_pending(current)) {
> -			/*
> -			 * If we got a SIGSTOP or SIGCONT and this is
> -			 * a normal userland page fault, just let
> -			 * userland return so the signal will be
> -			 * handled and gdb debugging works.  The page
> -			 * fault code immediately after we return from
> -			 * this function is going to release the
> -			 * mmap_sem and it's not depending on it
> -			 * (unlike gup would if we were not to return
> -			 * VM_FAULT_RETRY).
> -			 *
> -			 * If a fatal signal is pending we still take
> -			 * the streamlined VM_FAULT_RETRY failure path
> -			 * and there's no need to retake the mmap_sem
> -			 * in such case.
> -			 */
> -			down_read(&mm->mmap_sem);
> -			ret = VM_FAULT_NOPAGE;
> -		}
> -	}
> -
>  	/*
>  	 * Here we race with the list_del; list_add in
>  	 * userfaultfd_ctx_read(), however because we don't ever run
> -- 
> 2.17.1
> 

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v2.1 04/26] mm: allow VM_FAULT_RETRY for multiple times
  2019-02-21  8:56   ` [PATCH v2.1 " Peter Xu
@ 2019-02-21 15:53     ` Jerome Glisse
  2019-02-22  4:25       ` Peter Xu
  0 siblings, 1 reply; 113+ messages in thread
From: Jerome Glisse @ 2019-02-21 15:53 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, David Hildenbrand, Hugh Dickins,
	Maya Gokhale, Pavel Emelyanov, Johannes Weiner, Martin Cracauer,
	Shaohua Li, Marty McFadden, Andrea Arcangeli, Mike Kravetz,
	Denis Plotnikov, Mike Rapoport, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

On Thu, Feb 21, 2019 at 04:56:56PM +0800, Peter Xu wrote:
> The idea comes from a discussion between Linus and Andrea [1].
> 
> Before this patch we only allow a page fault to retry once.  We
> achieved this by clearing the FAULT_FLAG_ALLOW_RETRY flag when doing
> handle_mm_fault() the second time.  This was majorly used to avoid
> unexpected starvation of the system by looping over forever to handle
> the page fault on a single page.  However that should hardly happen,
> and after all for each code path to return a VM_FAULT_RETRY we'll
> first wait for a condition (during which time we should possibly yield
> the cpu) to happen before VM_FAULT_RETRY is really returned.
> 
> This patch removes the restriction by keeping the
> FAULT_FLAG_ALLOW_RETRY flag when we receive VM_FAULT_RETRY.  It means
> that the page fault handler now can retry the page fault for multiple
> times if necessary without the need to generate another page fault
> event.  Meanwhile we still keep the FAULT_FLAG_TRIED flag so page
> fault handler can still identify whether a page fault is the first
> attempt or not.
> 
> Then we'll have these combinations of fault flags (only considering
> ALLOW_RETRY flag and TRIED flag):
> 
>   - ALLOW_RETRY and !TRIED:  this means the page fault allows to
>                              retry, and this is the first try
> 
>   - ALLOW_RETRY and TRIED:   this means the page fault allows to
>                              retry, and this is not the first try
> 
>   - !ALLOW_RETRY and !TRIED: this means the page fault does not allow
>                              to retry at all
> 
>   - !ALLOW_RETRY and TRIED:  this is forbidden and should never be used
> 
> In existing code we have multiple places that has taken special care
> of the first condition above by checking against (fault_flags &
> FAULT_FLAG_ALLOW_RETRY).  This patch introduces a simple helper to
> detect the first retry of a page fault by checking against
> both (fault_flags & FAULT_FLAG_ALLOW_RETRY) and !(fault_flag &
> FAULT_FLAG_TRIED) because now even the 2nd try will have the
> ALLOW_RETRY set, then use that helper in all existing special paths.
> One example is in __lock_page_or_retry(), now we'll drop the mmap_sem
> only in the first attempt of page fault and we'll keep it in follow up
> retries, so old locking behavior will be retained.
> 
> This will be a nice enhancement for current code [2] at the same time
> a supporting material for the future userfaultfd-writeprotect work,
> since in that work there will always be an explicit userfault
> writeprotect retry for protected pages, and if that cannot resolve the
> page fault (e.g., when userfaultfd-writeprotect is used in conjunction
> with swapped pages) then we'll possibly need a 3rd retry of the page
> fault.  It might also benefit other potential users who will have
> similar requirement like userfault write-protection.
> 
> GUP code is not touched yet and will be covered in follow up patch.
> 
> Please read the thread below for more information.
> 
> [1] https://lkml.org/lkml/2017/11/2/833
> [2] https://lkml.org/lkml/2018/12/30/64

I have few comments on this one. See below.


> 
> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
> Suggested-by: Andrea Arcangeli <aarcange@redhat.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
> 
>  arch/alpha/mm/fault.c           |  2 +-
>  arch/arc/mm/fault.c             |  1 -
>  arch/arm/mm/fault.c             |  3 ---
>  arch/arm64/mm/fault.c           |  5 -----
>  arch/hexagon/mm/vm_fault.c      |  1 -
>  arch/ia64/mm/fault.c            |  1 -
>  arch/m68k/mm/fault.c            |  3 ---
>  arch/microblaze/mm/fault.c      |  1 -
>  arch/mips/mm/fault.c            |  1 -
>  arch/nds32/mm/fault.c           |  1 -
>  arch/nios2/mm/fault.c           |  3 ---
>  arch/openrisc/mm/fault.c        |  1 -
>  arch/parisc/mm/fault.c          |  2 --
>  arch/powerpc/mm/fault.c         |  6 ------
>  arch/riscv/mm/fault.c           |  5 -----
>  arch/s390/mm/fault.c            |  5 +----
>  arch/sh/mm/fault.c              |  1 -
>  arch/sparc/mm/fault_32.c        |  1 -
>  arch/sparc/mm/fault_64.c        |  1 -
>  arch/um/kernel/trap.c           |  1 -
>  arch/unicore32/mm/fault.c       |  6 +-----
>  arch/x86/mm/fault.c             |  2 --
>  arch/xtensa/mm/fault.c          |  1 -
>  drivers/gpu/drm/ttm/ttm_bo_vm.c | 12 +++++++++---
>  include/linux/mm.h              | 12 +++++++++++-
>  mm/filemap.c                    |  2 +-
>  mm/shmem.c                      |  2 +-
>  27 files changed, 25 insertions(+), 57 deletions(-)
> 

[...]

> diff --git a/arch/parisc/mm/fault.c b/arch/parisc/mm/fault.c
> index 29422eec329d..7d3e96a9a7ab 100644
> --- a/arch/parisc/mm/fault.c
> +++ b/arch/parisc/mm/fault.c
> @@ -327,8 +327,6 @@ void do_page_fault(struct pt_regs *regs, unsigned long code,
>  		else
>  			current->min_flt++;
>  		if (fault & VM_FAULT_RETRY) {
> -			flags &= ~FAULT_FLAG_ALLOW_RETRY;

Don't you need to also add:
     flags |= FAULT_FLAG_TRIED;

Like other arch.


[...]

> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> index 248ff0a28ecd..d842c3e02a50 100644
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -1483,9 +1483,7 @@ void do_user_addr_fault(struct pt_regs *regs,
>  	if (unlikely(fault & VM_FAULT_RETRY)) {
>  		bool is_user = flags & FAULT_FLAG_USER;
>  
> -		/* Retry at most once */
>  		if (flags & FAULT_FLAG_ALLOW_RETRY) {
> -			flags &= ~FAULT_FLAG_ALLOW_RETRY;
>  			flags |= FAULT_FLAG_TRIED;
>  			if (is_user && signal_pending(tsk))
>  				return;

So here you have a change in behavior, it can retry indefinitly for as
long as they are no signal. Don't you want so test for FAULT_FLAG_TRIED ?

[...]

> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 80bb6408fe73..4e11c9639f1b 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -341,11 +341,21 @@ extern pgprot_t protection_map[16];
>  #define FAULT_FLAG_ALLOW_RETRY	0x04	/* Retry fault if blocking */
>  #define FAULT_FLAG_RETRY_NOWAIT	0x08	/* Don't drop mmap_sem and wait when retrying */
>  #define FAULT_FLAG_KILLABLE	0x10	/* The fault task is in SIGKILL killable region */
> -#define FAULT_FLAG_TRIED	0x20	/* Second try */
> +#define FAULT_FLAG_TRIED	0x20	/* We've tried once */
>  #define FAULT_FLAG_USER		0x40	/* The fault originated in userspace */
>  #define FAULT_FLAG_REMOTE	0x80	/* faulting for non current tsk/mm */
>  #define FAULT_FLAG_INSTRUCTION  0x100	/* The fault was during an instruction fetch */
>  
> +/*
> + * Returns true if the page fault allows retry and this is the first
> + * attempt of the fault handling; false otherwise.
> + */

You should add why it returns false if it is not the first try ie to
avoid starvation.

> +static inline bool fault_flag_allow_retry_first(unsigned int flags)
> +{
> +	return (flags & FAULT_FLAG_ALLOW_RETRY) &&
> +	    (!(flags & FAULT_FLAG_TRIED));
> +}
> +
>  #define FAULT_FLAG_TRACE \
>  	{ FAULT_FLAG_WRITE,		"WRITE" }, \
>  	{ FAULT_FLAG_MKWRITE,		"MKWRITE" }, \

[...]

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v2 05/26] mm: gup: allow VM_FAULT_RETRY for multiple times
  2019-02-12  2:56 ` [PATCH v2 05/26] mm: gup: " Peter Xu
@ 2019-02-21 16:06   ` Jerome Glisse
  2019-02-22  4:41     ` Peter Xu
  0 siblings, 1 reply; 113+ messages in thread
From: Jerome Glisse @ 2019-02-21 16:06 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, David Hildenbrand, Hugh Dickins,
	Maya Gokhale, Pavel Emelyanov, Johannes Weiner, Martin Cracauer,
	Shaohua Li, Marty McFadden, Andrea Arcangeli, Mike Kravetz,
	Denis Plotnikov, Mike Rapoport, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

On Tue, Feb 12, 2019 at 10:56:11AM +0800, Peter Xu wrote:
> This is the gup counterpart of the change that allows the VM_FAULT_RETRY
> to happen for more than once.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: Jérôme Glisse <jglisse@redhat.com>

> ---
>  mm/gup.c | 17 +++++++++++++----
>  1 file changed, 13 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/gup.c b/mm/gup.c
> index fa75a03204c1..ba387aec0d80 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -528,7 +528,10 @@ static int faultin_page(struct task_struct *tsk, struct vm_area_struct *vma,
>  	if (*flags & FOLL_NOWAIT)
>  		fault_flags |= FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_RETRY_NOWAIT;
>  	if (*flags & FOLL_TRIED) {
> -		VM_WARN_ON_ONCE(fault_flags & FAULT_FLAG_ALLOW_RETRY);
> +		/*
> +		 * Note: FAULT_FLAG_ALLOW_RETRY and FAULT_FLAG_TRIED
> +		 * can co-exist
> +		 */
>  		fault_flags |= FAULT_FLAG_TRIED;
>  	}
>  
> @@ -943,17 +946,23 @@ static __always_inline long __get_user_pages_locked(struct task_struct *tsk,
>  		/* VM_FAULT_RETRY triggered, so seek to the faulting offset */
>  		pages += ret;
>  		start += ret << PAGE_SHIFT;
> +		lock_dropped = true;
>  
> +retry:
>  		/*
>  		 * Repeat on the address that fired VM_FAULT_RETRY
> -		 * without FAULT_FLAG_ALLOW_RETRY but with
> +		 * with both FAULT_FLAG_ALLOW_RETRY and
>  		 * FAULT_FLAG_TRIED.
>  		 */
>  		*locked = 1;
> -		lock_dropped = true;
>  		down_read(&mm->mmap_sem);
>  		ret = __get_user_pages(tsk, mm, start, 1, flags | FOLL_TRIED,
> -				       pages, NULL, NULL);
> +				       pages, NULL, locked);
> +		if (!*locked) {
> +			/* Continue to retry until we succeeded */
> +			BUG_ON(ret != 0);
> +			goto retry;
> +		}
>  		if (ret != 1) {
>  			BUG_ON(ret > 1);
>  			if (!pages_done)
> -- 
> 2.17.1
> 

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v2 06/26] userfaultfd: wp: add helper for writeprotect check
  2019-02-12  2:56 ` [PATCH v2 06/26] userfaultfd: wp: add helper for writeprotect check Peter Xu
@ 2019-02-21 16:07   ` Jerome Glisse
  2019-02-25 15:41   ` Mike Rapoport
  1 sibling, 0 replies; 113+ messages in thread
From: Jerome Glisse @ 2019-02-21 16:07 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, David Hildenbrand, Hugh Dickins,
	Maya Gokhale, Pavel Emelyanov, Johannes Weiner, Martin Cracauer,
	Shaohua Li, Marty McFadden, Andrea Arcangeli, Mike Kravetz,
	Denis Plotnikov, Mike Rapoport, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert, Pavel Emelyanov, Rik van Riel

On Tue, Feb 12, 2019 at 10:56:12AM +0800, Peter Xu wrote:
> From: Shaohua Li <shli@fb.com>
> 
> add helper for writeprotect check. Will use it later.
> 
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Pavel Emelyanov <xemul@parallels.com>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Kirill A. Shutemov <kirill@shutemov.name>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: Shaohua Li <shli@fb.com>
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: Jérôme Glisse <jglisse@redhat.com>

> ---
>  include/linux/userfaultfd_k.h | 10 ++++++++++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
> index 37c9eba75c98..38f748e7186e 100644
> --- a/include/linux/userfaultfd_k.h
> +++ b/include/linux/userfaultfd_k.h
> @@ -50,6 +50,11 @@ static inline bool userfaultfd_missing(struct vm_area_struct *vma)
>  	return vma->vm_flags & VM_UFFD_MISSING;
>  }
>  
> +static inline bool userfaultfd_wp(struct vm_area_struct *vma)
> +{
> +	return vma->vm_flags & VM_UFFD_WP;
> +}
> +
>  static inline bool userfaultfd_armed(struct vm_area_struct *vma)
>  {
>  	return vma->vm_flags & (VM_UFFD_MISSING | VM_UFFD_WP);
> @@ -94,6 +99,11 @@ static inline bool userfaultfd_missing(struct vm_area_struct *vma)
>  	return false;
>  }
>  
> +static inline bool userfaultfd_wp(struct vm_area_struct *vma)
> +{
> +	return false;
> +}
> +
>  static inline bool userfaultfd_armed(struct vm_area_struct *vma)
>  {
>  	return false;
> -- 
> 2.17.1
> 

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v2 07/26] userfaultfd: wp: hook userfault handler to write protection fault
  2019-02-12  2:56 ` [PATCH v2 07/26] userfaultfd: wp: hook userfault handler to write protection fault Peter Xu
@ 2019-02-21 16:25   ` Jerome Glisse
  2019-02-25 15:43   ` Mike Rapoport
  1 sibling, 0 replies; 113+ messages in thread
From: Jerome Glisse @ 2019-02-21 16:25 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, David Hildenbrand, Hugh Dickins,
	Maya Gokhale, Pavel Emelyanov, Johannes Weiner, Martin Cracauer,
	Shaohua Li, Marty McFadden, Andrea Arcangeli, Mike Kravetz,
	Denis Plotnikov, Mike Rapoport, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

On Tue, Feb 12, 2019 at 10:56:13AM +0800, Peter Xu wrote:
> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> There are several cases write protection fault happens. It could be a
> write to zero page, swaped page or userfault write protected
> page. When the fault happens, there is no way to know if userfault
> write protect the page before. Here we just blindly issue a userfault
> notification for vma with VM_UFFD_WP regardless if app write protects
> it yet. Application should be ready to handle such wp fault.
> 
> v1: From: Shaohua Li <shli@fb.com>
> 
> v2: Handle the userfault in the common do_wp_page. If we get there a
> pagetable is present and readonly so no need to do further processing
> until we solve the userfault.
> 
> In the swapin case, always swapin as readonly. This will cause false
> positive userfaults. We need to decide later if to eliminate them with
> a flag like soft-dirty in the swap entry (see _PAGE_SWP_SOFT_DIRTY).
> 
> hugetlbfs wouldn't need to worry about swapouts but and tmpfs would
> be handled by a swap entry bit like anonymous memory.
> 
> The main problem with no easy solution to eliminate the false
> positives, will be if/when userfaultfd is extended to real filesystem
> pagecache. When the pagecache is freed by reclaim we can't leave the
> radix tree pinned if the inode and in turn the radix tree is reclaimed
> as well.

For real file system my generic page write protection patchset might
be of use. See my last year posting of it. I intend to repost it in
next few weeks as i am making steady progress on a cleaned and updated
version of it.

> 
> The estimation is that full accuracy and lack of false positives could
> be easily provided only to anonymous memory (as long as there's no
> fork or as long as MADV_DONTFORK is used on the userfaultfd anonymous
> range) tmpfs and hugetlbfs, it's most certainly worth to achieve it
> but in a later incremental patch.
> 
> v3: Add hooking point for THP wrprotect faults.
> 
> CC: Shaohua Li <shli@fb.com>
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>

I have some comments on this patch.

> ---
>  mm/memory.c | 12 +++++++++++-
>  1 file changed, 11 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index e11ca9dd823f..00781c43407b 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2483,6 +2483,11 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
>  {
>  	struct vm_area_struct *vma = vmf->vma;
>  
> +	if (userfaultfd_wp(vma)) {
> +		pte_unmap_unlock(vmf->pte, vmf->ptl);
> +		return handle_userfault(vmf, VM_UFFD_WP);
> +	}
> +
>  	vmf->page = vm_normal_page(vma, vmf->address, vmf->orig_pte);
>  	if (!vmf->page) {
>  		/*
> @@ -2800,6 +2805,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  	inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
>  	dec_mm_counter_fast(vma->vm_mm, MM_SWAPENTS);
>  	pte = mk_pte(page, vma->vm_page_prot);
> +	if (userfaultfd_wp(vma))
> +		vmf->flags &= ~FAULT_FLAG_WRITE;

This looks wrong to me by clearing FAULT_FLAG_WRITE you disable the
call to do_wp_page() which would have handled the userfault write
protect fault. It seems to me that you want to disable below code
path to happen so it would be better to change below

From
>  	if ((vmf->flags & FAULT_FLAG_WRITE) && reuse_swap_page(page, NULL)) {

To
>  	if (!userfaultfd_wp(vma) && (vmf->flags & FAULT_FLAG_WRITE) && reuse_swap_page(page, NULL)) {


>  		pte = maybe_mkwrite(pte_mkdirty(pte), vma);
>  		vmf->flags &= ~FAULT_FLAG_WRITE;
> @@ -3684,8 +3691,11 @@ static inline vm_fault_t create_huge_pmd(struct vm_fault *vmf)
>  /* `inline' is required to avoid gcc 4.1.2 build error */
>  static inline vm_fault_t wp_huge_pmd(struct vm_fault *vmf, pmd_t orig_pmd)
>  {
> -	if (vma_is_anonymous(vmf->vma))
> +	if (vma_is_anonymous(vmf->vma)) {
> +		if (userfaultfd_wp(vmf->vma))
> +			return handle_userfault(vmf, VM_UFFD_WP);
>  		return do_huge_pmd_wp_page(vmf, orig_pmd);
> +	}
>  	if (vmf->vma->vm_ops->huge_fault)
>  		return vmf->vma->vm_ops->huge_fault(vmf, PE_SIZE_PMD);
>  
> -- 
> 2.17.1
> 

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v2 08/26] userfaultfd: wp: add WP pagetable tracking to x86
  2019-02-12  2:56 ` [PATCH v2 08/26] userfaultfd: wp: add WP pagetable tracking to x86 Peter Xu
@ 2019-02-21 17:20   ` Jerome Glisse
  2019-02-25 15:48   ` Mike Rapoport
  1 sibling, 0 replies; 113+ messages in thread
From: Jerome Glisse @ 2019-02-21 17:20 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, David Hildenbrand, Hugh Dickins,
	Maya Gokhale, Pavel Emelyanov, Johannes Weiner, Martin Cracauer,
	Shaohua Li, Marty McFadden, Andrea Arcangeli, Mike Kravetz,
	Denis Plotnikov, Mike Rapoport, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

On Tue, Feb 12, 2019 at 10:56:14AM +0800, Peter Xu wrote:
> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> Accurate userfaultfd WP tracking is possible by tracking exactly which
> virtual memory ranges were writeprotected by userland. We can't relay
> only on the RW bit of the mapped pagetable because that information is
> destroyed by fork() or KSM or swap. If we were to relay on that, we'd
> need to stay on the safe side and generate false positive wp faults
> for every swapped out page.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>

So i thought about this some more and the only alternative i see is
definining a new swap type to preserve the pte write bit when swapping,
and storing the original pte write within ksm stable_node. This would
solve false positive for swap and ksm.

But i do not see this as a better alternative to storing the wp status
as bit in the pte. So:

Reviewed-by: Jérôme Glisse <jglisse@redhat.com>

> ---
>  arch/x86/Kconfig                     |  1 +
>  arch/x86/include/asm/pgtable.h       | 52 ++++++++++++++++++++++++++++
>  arch/x86/include/asm/pgtable_64.h    |  8 ++++-
>  arch/x86/include/asm/pgtable_types.h |  9 +++++
>  include/asm-generic/pgtable.h        |  1 +
>  include/asm-generic/pgtable_uffd.h   | 51 +++++++++++++++++++++++++++
>  init/Kconfig                         |  5 +++
>  7 files changed, 126 insertions(+), 1 deletion(-)
>  create mode 100644 include/asm-generic/pgtable_uffd.h
> 
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 68261430fe6e..cb43bc008675 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -209,6 +209,7 @@ config X86
>  	select USER_STACKTRACE_SUPPORT
>  	select VIRT_TO_BUS
>  	select X86_FEATURE_NAMES		if PROC_FS
> +	select HAVE_ARCH_USERFAULTFD_WP		if USERFAULTFD
>  
>  config INSTRUCTION_DECODER
>  	def_bool y
> diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
> index 2779ace16d23..6863236e8484 100644
> --- a/arch/x86/include/asm/pgtable.h
> +++ b/arch/x86/include/asm/pgtable.h
> @@ -23,6 +23,7 @@
>  
>  #ifndef __ASSEMBLY__
>  #include <asm/x86_init.h>
> +#include <asm-generic/pgtable_uffd.h>
>  
>  extern pgd_t early_top_pgt[PTRS_PER_PGD];
>  int __init __early_make_pgtable(unsigned long address, pmdval_t pmd);
> @@ -293,6 +294,23 @@ static inline pte_t pte_clear_flags(pte_t pte, pteval_t clear)
>  	return native_make_pte(v & ~clear);
>  }
>  
> +#ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
> +static inline int pte_uffd_wp(pte_t pte)
> +{
> +	return pte_flags(pte) & _PAGE_UFFD_WP;
> +}
> +
> +static inline pte_t pte_mkuffd_wp(pte_t pte)
> +{
> +	return pte_set_flags(pte, _PAGE_UFFD_WP);
> +}
> +
> +static inline pte_t pte_clear_uffd_wp(pte_t pte)
> +{
> +	return pte_clear_flags(pte, _PAGE_UFFD_WP);
> +}
> +#endif /* CONFIG_HAVE_ARCH_USERFAULTFD_WP */
> +
>  static inline pte_t pte_mkclean(pte_t pte)
>  {
>  	return pte_clear_flags(pte, _PAGE_DIRTY);
> @@ -372,6 +390,23 @@ static inline pmd_t pmd_clear_flags(pmd_t pmd, pmdval_t clear)
>  	return native_make_pmd(v & ~clear);
>  }
>  
> +#ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
> +static inline int pmd_uffd_wp(pmd_t pmd)
> +{
> +	return pmd_flags(pmd) & _PAGE_UFFD_WP;
> +}
> +
> +static inline pmd_t pmd_mkuffd_wp(pmd_t pmd)
> +{
> +	return pmd_set_flags(pmd, _PAGE_UFFD_WP);
> +}
> +
> +static inline pmd_t pmd_clear_uffd_wp(pmd_t pmd)
> +{
> +	return pmd_clear_flags(pmd, _PAGE_UFFD_WP);
> +}
> +#endif /* CONFIG_HAVE_ARCH_USERFAULTFD_WP */
> +
>  static inline pmd_t pmd_mkold(pmd_t pmd)
>  {
>  	return pmd_clear_flags(pmd, _PAGE_ACCESSED);
> @@ -1351,6 +1386,23 @@ static inline pmd_t pmd_swp_clear_soft_dirty(pmd_t pmd)
>  #endif
>  #endif
>  
> +#ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
> +static inline pte_t pte_swp_mkuffd_wp(pte_t pte)
> +{
> +	return pte_set_flags(pte, _PAGE_SWP_UFFD_WP);
> +}
> +
> +static inline int pte_swp_uffd_wp(pte_t pte)
> +{
> +	return pte_flags(pte) & _PAGE_SWP_UFFD_WP;
> +}
> +
> +static inline pte_t pte_swp_clear_uffd_wp(pte_t pte)
> +{
> +	return pte_clear_flags(pte, _PAGE_SWP_UFFD_WP);
> +}
> +#endif /* CONFIG_HAVE_ARCH_USERFAULTFD_WP */
> +
>  #define PKRU_AD_BIT 0x1
>  #define PKRU_WD_BIT 0x2
>  #define PKRU_BITS_PER_PKEY 2
> diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
> index 9c85b54bf03c..e0c5d29b8685 100644
> --- a/arch/x86/include/asm/pgtable_64.h
> +++ b/arch/x86/include/asm/pgtable_64.h
> @@ -189,7 +189,7 @@ extern void sync_global_pgds(unsigned long start, unsigned long end);
>   *
>   * |     ...            | 11| 10|  9|8|7|6|5| 4| 3|2| 1|0| <- bit number
>   * |     ...            |SW3|SW2|SW1|G|L|D|A|CD|WT|U| W|P| <- bit names
> - * | TYPE (59-63) | ~OFFSET (9-58)  |0|0|X|X| X| X|X|SD|0| <- swp entry
> + * | TYPE (59-63) | ~OFFSET (9-58)  |0|0|X|X| X| X|F|SD|0| <- swp entry
>   *
>   * G (8) is aliased and used as a PROT_NONE indicator for
>   * !present ptes.  We need to start storing swap entries above
> @@ -197,9 +197,15 @@ extern void sync_global_pgds(unsigned long start, unsigned long end);
>   * erratum where they can be incorrectly set by hardware on
>   * non-present PTEs.
>   *
> + * SD Bits 1-4 are not used in non-present format and available for
> + * special use described below:
> + *
>   * SD (1) in swp entry is used to store soft dirty bit, which helps us
>   * remember soft dirty over page migration
>   *
> + * F (2) in swp entry is used to record when a pagetable is
> + * writeprotected by userfaultfd WP support.
> + *
>   * Bit 7 in swp entry should be 0 because pmd_present checks not only P,
>   * but also L and G.
>   *
> diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
> index d6ff0bbdb394..8cebcff91e57 100644
> --- a/arch/x86/include/asm/pgtable_types.h
> +++ b/arch/x86/include/asm/pgtable_types.h
> @@ -32,6 +32,7 @@
>  
>  #define _PAGE_BIT_SPECIAL	_PAGE_BIT_SOFTW1
>  #define _PAGE_BIT_CPA_TEST	_PAGE_BIT_SOFTW1
> +#define _PAGE_BIT_UFFD_WP	_PAGE_BIT_SOFTW2 /* userfaultfd wrprotected */
>  #define _PAGE_BIT_SOFT_DIRTY	_PAGE_BIT_SOFTW3 /* software dirty tracking */
>  #define _PAGE_BIT_DEVMAP	_PAGE_BIT_SOFTW4
>  
> @@ -100,6 +101,14 @@
>  #define _PAGE_SWP_SOFT_DIRTY	(_AT(pteval_t, 0))
>  #endif
>  
> +#ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
> +#define _PAGE_UFFD_WP		(_AT(pteval_t, 1) << _PAGE_BIT_UFFD_WP)
> +#define _PAGE_SWP_UFFD_WP	_PAGE_USER
> +#else
> +#define _PAGE_UFFD_WP		(_AT(pteval_t, 0))
> +#define _PAGE_SWP_UFFD_WP	(_AT(pteval_t, 0))
> +#endif
> +
>  #if defined(CONFIG_X86_64) || defined(CONFIG_X86_PAE)
>  #define _PAGE_NX	(_AT(pteval_t, 1) << _PAGE_BIT_NX)
>  #define _PAGE_DEVMAP	(_AT(u64, 1) << _PAGE_BIT_DEVMAP)
> diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
> index 05e61e6c843f..f49afe951711 100644
> --- a/include/asm-generic/pgtable.h
> +++ b/include/asm-generic/pgtable.h
> @@ -10,6 +10,7 @@
>  #include <linux/mm_types.h>
>  #include <linux/bug.h>
>  #include <linux/errno.h>
> +#include <asm-generic/pgtable_uffd.h>
>  
>  #if 5 - defined(__PAGETABLE_P4D_FOLDED) - defined(__PAGETABLE_PUD_FOLDED) - \
>  	defined(__PAGETABLE_PMD_FOLDED) != CONFIG_PGTABLE_LEVELS
> diff --git a/include/asm-generic/pgtable_uffd.h b/include/asm-generic/pgtable_uffd.h
> new file mode 100644
> index 000000000000..643d1bf559c2
> --- /dev/null
> +++ b/include/asm-generic/pgtable_uffd.h
> @@ -0,0 +1,51 @@
> +#ifndef _ASM_GENERIC_PGTABLE_UFFD_H
> +#define _ASM_GENERIC_PGTABLE_UFFD_H
> +
> +#ifndef CONFIG_HAVE_ARCH_USERFAULTFD_WP
> +static __always_inline int pte_uffd_wp(pte_t pte)
> +{
> +	return 0;
> +}
> +
> +static __always_inline int pmd_uffd_wp(pmd_t pmd)
> +{
> +	return 0;
> +}
> +
> +static __always_inline pte_t pte_mkuffd_wp(pte_t pte)
> +{
> +	return pte;
> +}
> +
> +static __always_inline pmd_t pmd_mkuffd_wp(pmd_t pmd)
> +{
> +	return pmd;
> +}
> +
> +static __always_inline pte_t pte_clear_uffd_wp(pte_t pte)
> +{
> +	return pte;
> +}
> +
> +static __always_inline pmd_t pmd_clear_uffd_wp(pmd_t pmd)
> +{
> +	return pmd;
> +}
> +
> +static __always_inline pte_t pte_swp_mkuffd_wp(pte_t pte)
> +{
> +	return pte;
> +}
> +
> +static __always_inline int pte_swp_uffd_wp(pte_t pte)
> +{
> +	return 0;
> +}
> +
> +static __always_inline pte_t pte_swp_clear_uffd_wp(pte_t pte)
> +{
> +	return pte;
> +}
> +#endif /* CONFIG_HAVE_ARCH_USERFAULTFD_WP */
> +
> +#endif /* _ASM_GENERIC_PGTABLE_UFFD_H */
> diff --git a/init/Kconfig b/init/Kconfig
> index c9386a365eea..892d61ddf2eb 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -1424,6 +1424,11 @@ config ADVISE_SYSCALLS
>  	  applications use these syscalls, you can disable this option to save
>  	  space.
>  
> +config HAVE_ARCH_USERFAULTFD_WP
> +	bool
> +	help
> +	  Arch has userfaultfd write protection support
> +
>  config MEMBARRIER
>  	bool "Enable membarrier() system call" if EXPERT
>  	default y
> -- 
> 2.17.1
> 

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v2 09/26] userfaultfd: wp: userfaultfd_pte/huge_pmd_wp() helpers
  2019-02-12  2:56 ` [PATCH v2 09/26] userfaultfd: wp: userfaultfd_pte/huge_pmd_wp() helpers Peter Xu
@ 2019-02-21 17:21   ` Jerome Glisse
  2019-02-25 17:12   ` Mike Rapoport
  1 sibling, 0 replies; 113+ messages in thread
From: Jerome Glisse @ 2019-02-21 17:21 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, David Hildenbrand, Hugh Dickins,
	Maya Gokhale, Pavel Emelyanov, Johannes Weiner, Martin Cracauer,
	Shaohua Li, Marty McFadden, Andrea Arcangeli, Mike Kravetz,
	Denis Plotnikov, Mike Rapoport, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

On Tue, Feb 12, 2019 at 10:56:15AM +0800, Peter Xu wrote:
> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> Implement helpers methods to invoke userfaultfd wp faults more
> selectively: not only when a wp fault triggers on a vma with
> vma->vm_flags VM_UFFD_WP set, but only if the _PAGE_UFFD_WP bit is set
> in the pagetable too.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: Jérôme Glisse <jglisse@redhat.com>

> ---
>  include/linux/userfaultfd_k.h | 27 +++++++++++++++++++++++++++
>  1 file changed, 27 insertions(+)
> 
> diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
> index 38f748e7186e..c6590c58ce28 100644
> --- a/include/linux/userfaultfd_k.h
> +++ b/include/linux/userfaultfd_k.h
> @@ -14,6 +14,8 @@
>  #include <linux/userfaultfd.h> /* linux/include/uapi/linux/userfaultfd.h */
>  
>  #include <linux/fcntl.h>
> +#include <linux/mm.h>
> +#include <asm-generic/pgtable_uffd.h>
>  
>  /*
>   * CAREFUL: Check include/uapi/asm-generic/fcntl.h when defining
> @@ -55,6 +57,18 @@ static inline bool userfaultfd_wp(struct vm_area_struct *vma)
>  	return vma->vm_flags & VM_UFFD_WP;
>  }
>  
> +static inline bool userfaultfd_pte_wp(struct vm_area_struct *vma,
> +				      pte_t pte)
> +{
> +	return userfaultfd_wp(vma) && pte_uffd_wp(pte);
> +}
> +
> +static inline bool userfaultfd_huge_pmd_wp(struct vm_area_struct *vma,
> +					   pmd_t pmd)
> +{
> +	return userfaultfd_wp(vma) && pmd_uffd_wp(pmd);
> +}
> +
>  static inline bool userfaultfd_armed(struct vm_area_struct *vma)
>  {
>  	return vma->vm_flags & (VM_UFFD_MISSING | VM_UFFD_WP);
> @@ -104,6 +118,19 @@ static inline bool userfaultfd_wp(struct vm_area_struct *vma)
>  	return false;
>  }
>  
> +static inline bool userfaultfd_pte_wp(struct vm_area_struct *vma,
> +				      pte_t pte)
> +{
> +	return false;
> +}
> +
> +static inline bool userfaultfd_huge_pmd_wp(struct vm_area_struct *vma,
> +					   pmd_t pmd)
> +{
> +	return false;
> +}
> +
> +
>  static inline bool userfaultfd_armed(struct vm_area_struct *vma)
>  {
>  	return false;
> -- 
> 2.17.1
> 

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v2 10/26] userfaultfd: wp: add UFFDIO_COPY_MODE_WP
  2019-02-12  2:56 ` [PATCH v2 10/26] userfaultfd: wp: add UFFDIO_COPY_MODE_WP Peter Xu
@ 2019-02-21 17:29   ` Jerome Glisse
  2019-02-22  7:11     ` Peter Xu
  2019-02-25 15:58   ` Mike Rapoport
  1 sibling, 1 reply; 113+ messages in thread
From: Jerome Glisse @ 2019-02-21 17:29 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, David Hildenbrand, Hugh Dickins,
	Maya Gokhale, Pavel Emelyanov, Johannes Weiner, Martin Cracauer,
	Shaohua Li, Marty McFadden, Andrea Arcangeli, Mike Kravetz,
	Denis Plotnikov, Mike Rapoport, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

On Tue, Feb 12, 2019 at 10:56:16AM +0800, Peter Xu wrote:
> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> This allows UFFDIO_COPY to map pages wrprotected.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>

Minor nitpick down below, but in any case:

Reviewed-by: Jérôme Glisse <jglisse@redhat.com>

> ---
>  fs/userfaultfd.c                 |  5 +++--
>  include/linux/userfaultfd_k.h    |  2 +-
>  include/uapi/linux/userfaultfd.h | 11 +++++-----
>  mm/userfaultfd.c                 | 36 ++++++++++++++++++++++----------
>  4 files changed, 35 insertions(+), 19 deletions(-)
> 

[...]

> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> index d59b5a73dfb3..73a208c5c1e7 100644
> --- a/mm/userfaultfd.c
> +++ b/mm/userfaultfd.c
> @@ -25,7 +25,8 @@ static int mcopy_atomic_pte(struct mm_struct *dst_mm,
>  			    struct vm_area_struct *dst_vma,
>  			    unsigned long dst_addr,
>  			    unsigned long src_addr,
> -			    struct page **pagep)
> +			    struct page **pagep,
> +			    bool wp_copy)
>  {
>  	struct mem_cgroup *memcg;
>  	pte_t _dst_pte, *dst_pte;
> @@ -71,9 +72,9 @@ static int mcopy_atomic_pte(struct mm_struct *dst_mm,
>  	if (mem_cgroup_try_charge(page, dst_mm, GFP_KERNEL, &memcg, false))
>  		goto out_release;
>  
> -	_dst_pte = mk_pte(page, dst_vma->vm_page_prot);
> -	if (dst_vma->vm_flags & VM_WRITE)
> -		_dst_pte = pte_mkwrite(pte_mkdirty(_dst_pte));
> +	_dst_pte = pte_mkdirty(mk_pte(page, dst_vma->vm_page_prot));
> +	if (dst_vma->vm_flags & VM_WRITE && !wp_copy)
> +		_dst_pte = pte_mkwrite(_dst_pte);

I like parenthesis around around and :) ie:
    (dst_vma->vm_flags & VM_WRITE) && !wp_copy

I feel it is easier to read.

[...]

> @@ -416,11 +418,13 @@ static __always_inline ssize_t mfill_atomic_pte(struct mm_struct *dst_mm,
>  	if (!(dst_vma->vm_flags & VM_SHARED)) {
>  		if (!zeropage)
>  			err = mcopy_atomic_pte(dst_mm, dst_pmd, dst_vma,
> -					       dst_addr, src_addr, page);
> +					       dst_addr, src_addr, page,
> +					       wp_copy);
>  		else
>  			err = mfill_zeropage_pte(dst_mm, dst_pmd,
>  						 dst_vma, dst_addr);
>  	} else {
> +		VM_WARN_ON(wp_copy); /* WP only available for anon */

Don't you want to return with error here ?

>  		if (!zeropage)
>  			err = shmem_mcopy_atomic_pte(dst_mm, dst_pmd,
>  						     dst_vma, dst_addr,

[...]

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v2 11/26] mm: merge parameters for change_protection()
  2019-02-12  2:56 ` [PATCH v2 11/26] mm: merge parameters for change_protection() Peter Xu
@ 2019-02-21 17:32   ` Jerome Glisse
  0 siblings, 0 replies; 113+ messages in thread
From: Jerome Glisse @ 2019-02-21 17:32 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, David Hildenbrand, Hugh Dickins,
	Maya Gokhale, Pavel Emelyanov, Johannes Weiner, Martin Cracauer,
	Shaohua Li, Marty McFadden, Andrea Arcangeli, Mike Kravetz,
	Denis Plotnikov, Mike Rapoport, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

On Tue, Feb 12, 2019 at 10:56:17AM +0800, Peter Xu wrote:
> change_protection() was used by either the NUMA or mprotect() code,
> there's one parameter for each of the callers (dirty_accountable and
> prot_numa).  Further, these parameters are passed along the calls:
> 
>   - change_protection_range()
>   - change_p4d_range()
>   - change_pud_range()
>   - change_pmd_range()
>   - ...
> 
> Now we introduce a flag for change_protect() and all these helpers to
> replace these parameters.  Then we can avoid passing multiple parameters
> multiple times along the way.
> 
> More importantly, it'll greatly simplify the work if we want to
> introduce any new parameters to change_protection().  In the follow up
> patches, a new parameter for userfaultfd write protection will be
> introduced.
> 
> No functional change at all.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>

It would have been nice if this was a coccinelle patch, easier to
review.

Reviewed-by: Jérôme Glisse <jglisse@redhat.com>

> ---
>  include/linux/huge_mm.h |  2 +-
>  include/linux/mm.h      | 14 +++++++++++++-
>  mm/huge_memory.c        |  3 ++-
>  mm/mempolicy.c          |  2 +-
>  mm/mprotect.c           | 29 ++++++++++++++++-------------
>  5 files changed, 33 insertions(+), 17 deletions(-)

[...]

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v2 12/26] userfaultfd: wp: apply _PAGE_UFFD_WP bit
  2019-02-12  2:56 ` [PATCH v2 12/26] userfaultfd: wp: apply _PAGE_UFFD_WP bit Peter Xu
@ 2019-02-21 17:44   ` Jerome Glisse
  2019-02-22  7:31     ` Peter Xu
  2019-02-25 18:00   ` Mike Rapoport
  1 sibling, 1 reply; 113+ messages in thread
From: Jerome Glisse @ 2019-02-21 17:44 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, David Hildenbrand, Hugh Dickins,
	Maya Gokhale, Pavel Emelyanov, Johannes Weiner, Martin Cracauer,
	Shaohua Li, Marty McFadden, Andrea Arcangeli, Mike Kravetz,
	Denis Plotnikov, Mike Rapoport, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

On Tue, Feb 12, 2019 at 10:56:18AM +0800, Peter Xu wrote:
> Firstly, introduce two new flags MM_CP_UFFD_WP[_RESOLVE] for
> change_protection() when used with uffd-wp and make sure the two new
> flags are exclusively used.  Then,
> 
>   - For MM_CP_UFFD_WP: apply the _PAGE_UFFD_WP bit and remove _PAGE_RW
>     when a range of memory is write protected by uffd
> 
>   - For MM_CP_UFFD_WP_RESOLVE: remove the _PAGE_UFFD_WP bit and recover
>     _PAGE_RW when write protection is resolved from userspace
> 
> And use this new interface in mwriteprotect_range() to replace the old
> MM_CP_DIRTY_ACCT.
> 
> Do this change for both PTEs and huge PMDs.  Then we can start to
> identify which PTE/PMD is write protected by general (e.g., COW or soft
> dirty tracking), and which is for userfaultfd-wp.
> 
> Since we should keep the _PAGE_UFFD_WP when doing pte_modify(), add it
> into _PAGE_CHG_MASK as well.  Meanwhile, since we have this new bit, we
> can be even more strict when detecting uffd-wp page faults in either
> do_wp_page() or wp_huge_pmd().
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>

Few comments but still:

Reviewed-by: Jérôme Glisse <jglisse@redhat.com>

> ---
>  arch/x86/include/asm/pgtable_types.h |  2 +-
>  include/linux/mm.h                   |  5 +++++
>  mm/huge_memory.c                     | 14 +++++++++++++-
>  mm/memory.c                          |  4 ++--
>  mm/mprotect.c                        | 12 ++++++++++++
>  mm/userfaultfd.c                     |  8 ++++++--
>  6 files changed, 39 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
> index 8cebcff91e57..dd9c6295d610 100644
> --- a/arch/x86/include/asm/pgtable_types.h
> +++ b/arch/x86/include/asm/pgtable_types.h
> @@ -133,7 +133,7 @@
>   */
>  #define _PAGE_CHG_MASK	(PTE_PFN_MASK | _PAGE_PCD | _PAGE_PWT |		\
>  			 _PAGE_SPECIAL | _PAGE_ACCESSED | _PAGE_DIRTY |	\
> -			 _PAGE_SOFT_DIRTY | _PAGE_DEVMAP)
> +			 _PAGE_SOFT_DIRTY | _PAGE_DEVMAP | _PAGE_UFFD_WP)
>  #define _HPAGE_CHG_MASK (_PAGE_CHG_MASK | _PAGE_PSE)

This chunk needs to be in the earlier arch specific patch.

[...]

> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 8d65b0f041f9..817335b443c2 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c

[...]

> @@ -2198,6 +2208,8 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>  				entry = pte_mkold(entry);
>  			if (soft_dirty)
>  				entry = pte_mksoft_dirty(entry);
> +			if (uffd_wp)
> +				entry = pte_mkuffd_wp(entry);
>  		}
>  		pte = pte_offset_map(&_pmd, addr);
>  		BUG_ON(!pte_none(*pte));

Reading that code and i thought i would be nice if we could define a
pte_mask that we can or instead of all those if () entry |= ... but
that is just some dumb optimization and does not have any bearing on
the present patch. Just wanted to say that outloud.


> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index a6ba448c8565..9d4433044c21 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -46,6 +46,8 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
>  	int target_node = NUMA_NO_NODE;
>  	bool dirty_accountable = cp_flags & MM_CP_DIRTY_ACCT;
>  	bool prot_numa = cp_flags & MM_CP_PROT_NUMA;
> +	bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
> +	bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;
>  
>  	/*
>  	 * Can be called with only the mmap_sem for reading by
> @@ -117,6 +119,14 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
>  			if (preserve_write)
>  				ptent = pte_mk_savedwrite(ptent);
>  
> +			if (uffd_wp) {
> +				ptent = pte_wrprotect(ptent);
> +				ptent = pte_mkuffd_wp(ptent);
> +			} else if (uffd_wp_resolve) {
> +				ptent = pte_mkwrite(ptent);
> +				ptent = pte_clear_uffd_wp(ptent);
> +			}
> +
>  			/* Avoid taking write faults for known dirty pages */
>  			if (dirty_accountable && pte_dirty(ptent) &&
>  					(pte_soft_dirty(ptent) ||
> @@ -301,6 +311,8 @@ unsigned long change_protection(struct vm_area_struct *vma, unsigned long start,
>  {
>  	unsigned long pages;
>  
> +	BUG_ON((cp_flags & MM_CP_UFFD_WP_ALL) == MM_CP_UFFD_WP_ALL);

Don't you want to abort and return here if both flags are set ?

[...]

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v2 13/26] mm: export wp_page_copy()
  2019-02-12  2:56 ` [PATCH v2 13/26] mm: export wp_page_copy() Peter Xu
@ 2019-02-21 17:44   ` Jerome Glisse
  0 siblings, 0 replies; 113+ messages in thread
From: Jerome Glisse @ 2019-02-21 17:44 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, David Hildenbrand, Hugh Dickins,
	Maya Gokhale, Pavel Emelyanov, Johannes Weiner, Martin Cracauer,
	Shaohua Li, Marty McFadden, Andrea Arcangeli, Mike Kravetz,
	Denis Plotnikov, Mike Rapoport, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

On Tue, Feb 12, 2019 at 10:56:19AM +0800, Peter Xu wrote:
> Export this function for usages outside page fault handlers.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: Jérôme Glisse <jglisse@redhat.com>

> ---
>  include/linux/mm.h | 2 ++
>  mm/memory.c        | 2 +-
>  2 files changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index f38fbe9c8bc9..2fd14a62324b 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -405,6 +405,8 @@ struct vm_fault {
>  					 */
>  };
>  
> +vm_fault_t wp_page_copy(struct vm_fault *vmf);
> +
>  /* page entry size for vm->huge_fault() */
>  enum page_entry_size {
>  	PE_SIZE_PTE = 0,
> diff --git a/mm/memory.c b/mm/memory.c
> index f8d83ae16eff..32d32b6e6339 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2239,7 +2239,7 @@ static inline void wp_page_reuse(struct vm_fault *vmf)
>   *   held to the old page, as well as updating the rmap.
>   * - In any case, unlock the PTL and drop the reference we took to the old page.
>   */
> -static vm_fault_t wp_page_copy(struct vm_fault *vmf)
> +vm_fault_t wp_page_copy(struct vm_fault *vmf)
>  {
>  	struct vm_area_struct *vma = vmf->vma;
>  	struct mm_struct *mm = vma->vm_mm;
> -- 
> 2.17.1
> 

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v2 14/26] userfaultfd: wp: handle COW properly for uffd-wp
  2019-02-12  2:56 ` [PATCH v2 14/26] userfaultfd: wp: handle COW properly for uffd-wp Peter Xu
@ 2019-02-21 18:04   ` Jerome Glisse
  2019-02-22  8:46     ` Peter Xu
  0 siblings, 1 reply; 113+ messages in thread
From: Jerome Glisse @ 2019-02-21 18:04 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, David Hildenbrand, Hugh Dickins,
	Maya Gokhale, Pavel Emelyanov, Johannes Weiner, Martin Cracauer,
	Shaohua Li, Marty McFadden, Andrea Arcangeli, Mike Kravetz,
	Denis Plotnikov, Mike Rapoport, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

On Tue, Feb 12, 2019 at 10:56:20AM +0800, Peter Xu wrote:
> This allows uffd-wp to support write-protected pages for COW.
> 
> For example, the uffd write-protected PTE could also be write-protected
> by other usages like COW or zero pages.  When that happens, we can't
> simply set the write bit in the PTE since otherwise it'll change the
> content of every single reference to the page.  Instead, we should do
> the COW first if necessary, then handle the uffd-wp fault.
> 
> To correctly copy the page, we'll also need to carry over the
> _PAGE_UFFD_WP bit if it was set in the original PTE.
> 
> For huge PMDs, we just simply split the huge PMDs where we want to
> resolve an uffd-wp page fault always.  That matches what we do with
> general huge PMD write protections.  In that way, we resolved the huge
> PMD copy-on-write issue into PTE copy-on-write.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>

Few comments see below.

> ---
>  mm/memory.c   |  2 ++
>  mm/mprotect.c | 55 ++++++++++++++++++++++++++++++++++++++++++++++++---
>  2 files changed, 54 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 32d32b6e6339..b5d67bafae35 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2291,6 +2291,8 @@ vm_fault_t wp_page_copy(struct vm_fault *vmf)
>  		}
>  		flush_cache_page(vma, vmf->address, pte_pfn(vmf->orig_pte));
>  		entry = mk_pte(new_page, vma->vm_page_prot);
> +		if (pte_uffd_wp(vmf->orig_pte))
> +			entry = pte_mkuffd_wp(entry);
>  		entry = maybe_mkwrite(pte_mkdirty(entry), vma);

This looks wrong to me, isn't the uffd_wp flag clear on writeable pte ?
If so it would be clearer to have something like:

 +		if (pte_uffd_wp(vmf->orig_pte))
 +			entry = pte_mkuffd_wp(entry);
 +		else
 + 			entry = maybe_mkwrite(pte_mkdirty(entry), vma);
 -		entry = maybe_mkwrite(pte_mkdirty(entry), vma);

>  		/*
>  		 * Clear the pte entry and flush it first, before updating the
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 9d4433044c21..ae93721f3795 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -77,14 +77,13 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
>  		if (pte_present(oldpte)) {
>  			pte_t ptent;
>  			bool preserve_write = prot_numa && pte_write(oldpte);
> +			struct page *page;
>  
>  			/*
>  			 * Avoid trapping faults against the zero or KSM
>  			 * pages. See similar comment in change_huge_pmd.
>  			 */
>  			if (prot_numa) {
> -				struct page *page;
> -
>  				page = vm_normal_page(vma, addr, oldpte);
>  				if (!page || PageKsm(page))
>  					continue;
> @@ -114,6 +113,46 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
>  					continue;
>  			}
>  
> +			/*
> +			 * Detect whether we'll need to COW before
> +			 * resolving an uffd-wp fault.  Note that this
> +			 * includes detection of the zero page (where
> +			 * page==NULL)
> +			 */
> +			if (uffd_wp_resolve) {
> +				/* If the fault is resolved already, skip */
> +				if (!pte_uffd_wp(*pte))
> +					continue;
> +				page = vm_normal_page(vma, addr, oldpte);
> +				if (!page || page_mapcount(page) > 1) {

This is wrong, if you allow page to be NULL then you gonna segfault
in wp_page_copy() down below. Are you sure you want to test for
special page ? For anonymous memory this should never happens ie
anon page always are regular page. So if you allow userfaulfd to
write protect only anonymous vma then there is no point in testing
here beside maybe a BUG_ON() just in case ...

> +					struct vm_fault vmf = {
> +						.vma = vma,
> +						.address = addr & PAGE_MASK,
> +						.page = page,
> +						.orig_pte = oldpte,
> +						.pmd = pmd,
> +						/* pte and ptl not needed */
> +					};
> +					vm_fault_t ret;
> +
> +					if (page)
> +						get_page(page);
> +					arch_leave_lazy_mmu_mode();
> +					pte_unmap_unlock(pte, ptl);
> +					ret = wp_page_copy(&vmf);
> +					/* PTE is changed, or OOM */
> +					if (ret == 0)
> +						/* It's done by others */
> +						continue;
> +					else if (WARN_ON(ret != VM_FAULT_WRITE))
> +						return pages;
> +					pte = pte_offset_map_lock(vma->vm_mm,
> +								  pmd, addr,
> +								  &ptl);

Here you remap the pte locked but you are not checking if the pte is
the one you expect ie is it pointing to the copied page and does it
have expect uffd_wp flag. Another thread might have raced between the
time you called wp_page_copy() and the time you pte_offset_map_lock()
I have not check the mmap_sem so maybe you are protected by it as
mprotect is taking it in write mode IIRC, if so you should add a
comments at very least so people do not see this as a bug.


> +					arch_enter_lazy_mmu_mode();
> +				}
> +			}
> +
>  			ptent = ptep_modify_prot_start(mm, addr, pte);
>  			ptent = pte_modify(ptent, newprot);
>  			if (preserve_write)
> @@ -183,6 +222,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
>  	unsigned long pages = 0;
>  	unsigned long nr_huge_updates = 0;
>  	struct mmu_notifier_range range;
> +	bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;
>  
>  	range.start = 0;
>  
> @@ -202,7 +242,16 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
>  		}
>  
>  		if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {
> -			if (next - addr != HPAGE_PMD_SIZE) {
> +			/*
> +			 * When resolving an userfaultfd write
> +			 * protection fault, it's not easy to identify
> +			 * whether a THP is shared with others and
> +			 * whether we'll need to do copy-on-write, so
> +			 * just split it always for now to simply the
> +			 * procedure.  And that's the policy too for
> +			 * general THP write-protect in af9e4d5f2de2.
> +			 */
> +			if (next - addr != HPAGE_PMD_SIZE || uffd_wp_resolve) {

Using parenthesis maybe ? :)
            if ((next - addr != HPAGE_PMD_SIZE) || uffd_wp_resolve) {

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v2 15/26] userfaultfd: wp: drop _PAGE_UFFD_WP properly when fork
  2019-02-12  2:56 ` [PATCH v2 15/26] userfaultfd: wp: drop _PAGE_UFFD_WP properly when fork Peter Xu
@ 2019-02-21 18:06   ` Jerome Glisse
  2019-02-22  9:09     ` Peter Xu
  2019-02-25 18:19   ` Mike Rapoport
  1 sibling, 1 reply; 113+ messages in thread
From: Jerome Glisse @ 2019-02-21 18:06 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, David Hildenbrand, Hugh Dickins,
	Maya Gokhale, Pavel Emelyanov, Johannes Weiner, Martin Cracauer,
	Shaohua Li, Marty McFadden, Andrea Arcangeli, Mike Kravetz,
	Denis Plotnikov, Mike Rapoport, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

On Tue, Feb 12, 2019 at 10:56:21AM +0800, Peter Xu wrote:
> UFFD_EVENT_FORK support for uffd-wp should be already there, except
> that we should clean the uffd-wp bit if uffd fork event is not
> enabled.  Detect that to avoid _PAGE_UFFD_WP being set even if the VMA
> is not being tracked by VM_UFFD_WP.  Do this for both small PTEs and
> huge PMDs.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>

This patch must be earlier in the serie, before the patch that introduce
the userfaultfd API so that bisect can not end up on version where this
can happen.

Otherwise the patch itself is:

Reviewed-by: Jérôme Glisse <jglisse@redhat.com>

> ---
>  mm/huge_memory.c | 8 ++++++++
>  mm/memory.c      | 8 ++++++++
>  2 files changed, 16 insertions(+)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 817335b443c2..fb2234cb595a 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -938,6 +938,14 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>  	ret = -EAGAIN;
>  	pmd = *src_pmd;
>  
> +	/*
> +	 * Make sure the _PAGE_UFFD_WP bit is cleared if the new VMA
> +	 * does not have the VM_UFFD_WP, which means that the uffd
> +	 * fork event is not enabled.
> +	 */
> +	if (!(vma->vm_flags & VM_UFFD_WP))
> +		pmd = pmd_clear_uffd_wp(pmd);
> +
>  #ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
>  	if (unlikely(is_swap_pmd(pmd))) {
>  		swp_entry_t entry = pmd_to_swp_entry(pmd);
> diff --git a/mm/memory.c b/mm/memory.c
> index b5d67bafae35..c2035539e9fd 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -788,6 +788,14 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>  		pte = pte_mkclean(pte);
>  	pte = pte_mkold(pte);
>  
> +	/*
> +	 * Make sure the _PAGE_UFFD_WP bit is cleared if the new VMA
> +	 * does not have the VM_UFFD_WP, which means that the uffd
> +	 * fork event is not enabled.
> +	 */
> +	if (!(vm_flags & VM_UFFD_WP))
> +		pte = pte_clear_uffd_wp(pte);
> +
>  	page = vm_normal_page(vma, addr, pte);
>  	if (page) {
>  		get_page(page);
> -- 
> 2.17.1
> 

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v2 16/26] userfaultfd: wp: add pmd_swp_*uffd_wp() helpers
  2019-02-12  2:56 ` [PATCH v2 16/26] userfaultfd: wp: add pmd_swp_*uffd_wp() helpers Peter Xu
@ 2019-02-21 18:07   ` Jerome Glisse
  2019-02-25 18:20   ` Mike Rapoport
  1 sibling, 0 replies; 113+ messages in thread
From: Jerome Glisse @ 2019-02-21 18:07 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, David Hildenbrand, Hugh Dickins,
	Maya Gokhale, Pavel Emelyanov, Johannes Weiner, Martin Cracauer,
	Shaohua Li, Marty McFadden, Andrea Arcangeli, Mike Kravetz,
	Denis Plotnikov, Mike Rapoport, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

On Tue, Feb 12, 2019 at 10:56:22AM +0800, Peter Xu wrote:
> Adding these missing helpers for uffd-wp operations with pmd
> swap/migration entries.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: Jérôme Glisse <jglisse@redhat.com>

> ---
>  arch/x86/include/asm/pgtable.h     | 15 +++++++++++++++
>  include/asm-generic/pgtable_uffd.h | 15 +++++++++++++++
>  2 files changed, 30 insertions(+)
> 
> diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
> index 6863236e8484..18a815d6f4ea 100644
> --- a/arch/x86/include/asm/pgtable.h
> +++ b/arch/x86/include/asm/pgtable.h
> @@ -1401,6 +1401,21 @@ static inline pte_t pte_swp_clear_uffd_wp(pte_t pte)
>  {
>  	return pte_clear_flags(pte, _PAGE_SWP_UFFD_WP);
>  }
> +
> +static inline pmd_t pmd_swp_mkuffd_wp(pmd_t pmd)
> +{
> +	return pmd_set_flags(pmd, _PAGE_SWP_UFFD_WP);
> +}
> +
> +static inline int pmd_swp_uffd_wp(pmd_t pmd)
> +{
> +	return pmd_flags(pmd) & _PAGE_SWP_UFFD_WP;
> +}
> +
> +static inline pmd_t pmd_swp_clear_uffd_wp(pmd_t pmd)
> +{
> +	return pmd_clear_flags(pmd, _PAGE_SWP_UFFD_WP);
> +}
>  #endif /* CONFIG_HAVE_ARCH_USERFAULTFD_WP */
>  
>  #define PKRU_AD_BIT 0x1
> diff --git a/include/asm-generic/pgtable_uffd.h b/include/asm-generic/pgtable_uffd.h
> index 643d1bf559c2..828966d4c281 100644
> --- a/include/asm-generic/pgtable_uffd.h
> +++ b/include/asm-generic/pgtable_uffd.h
> @@ -46,6 +46,21 @@ static __always_inline pte_t pte_swp_clear_uffd_wp(pte_t pte)
>  {
>  	return pte;
>  }
> +
> +static inline pmd_t pmd_swp_mkuffd_wp(pmd_t pmd)
> +{
> +	return pmd;
> +}
> +
> +static inline int pmd_swp_uffd_wp(pmd_t pmd)
> +{
> +	return 0;
> +}
> +
> +static inline pmd_t pmd_swp_clear_uffd_wp(pmd_t pmd)
> +{
> +	return pmd;
> +}
>  #endif /* CONFIG_HAVE_ARCH_USERFAULTFD_WP */
>  
>  #endif /* _ASM_GENERIC_PGTABLE_UFFD_H */
> -- 
> 2.17.1
> 

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v2 17/26] userfaultfd: wp: support swap and page migration
  2019-02-12  2:56 ` [PATCH v2 17/26] userfaultfd: wp: support swap and page migration Peter Xu
@ 2019-02-21 18:16   ` Jerome Glisse
  2019-02-25  7:48     ` Peter Xu
  2019-02-25 18:28   ` Mike Rapoport
  1 sibling, 1 reply; 113+ messages in thread
From: Jerome Glisse @ 2019-02-21 18:16 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, David Hildenbrand, Hugh Dickins,
	Maya Gokhale, Pavel Emelyanov, Johannes Weiner, Martin Cracauer,
	Shaohua Li, Marty McFadden, Andrea Arcangeli, Mike Kravetz,
	Denis Plotnikov, Mike Rapoport, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

On Tue, Feb 12, 2019 at 10:56:23AM +0800, Peter Xu wrote:
> For either swap and page migration, we all use the bit 2 of the entry to
> identify whether this entry is uffd write-protected.  It plays a similar
> role as the existing soft dirty bit in swap entries but only for keeping
> the uffd-wp tracking for a specific PTE/PMD.
> 
> Something special here is that when we want to recover the uffd-wp bit
> from a swap/migration entry to the PTE bit we'll also need to take care
> of the _PAGE_RW bit and make sure it's cleared, otherwise even with the
> _PAGE_UFFD_WP bit we can't trap it at all.
> 
> Note that this patch removed two lines from "userfaultfd: wp: hook
> userfault handler to write protection fault" where we try to remove the
> VM_FAULT_WRITE from vmf->flags when uffd-wp is set for the VMA.  This
> patch will still keep the write flag there.

That part is confusing, you probably want to remove that code from
previous patch or at least address my comment in the previous patch
review.

> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  include/linux/swapops.h | 2 ++
>  mm/huge_memory.c        | 3 +++
>  mm/memory.c             | 8 ++++++--
>  mm/migrate.c            | 7 +++++++
>  mm/mprotect.c           | 2 ++
>  mm/rmap.c               | 6 ++++++
>  6 files changed, 26 insertions(+), 2 deletions(-)
> 

[...]

> diff --git a/mm/memory.c b/mm/memory.c
> index c2035539e9fd..7cee990d67cf 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -736,6 +736,8 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>  				pte = swp_entry_to_pte(entry);
>  				if (pte_swp_soft_dirty(*src_pte))
>  					pte = pte_swp_mksoft_dirty(pte);
> +				if (pte_swp_uffd_wp(*src_pte))
> +					pte = pte_swp_mkuffd_wp(pte);
>  				set_pte_at(src_mm, addr, src_pte, pte);
>  			}
>  		} else if (is_device_private_entry(entry)) {
> @@ -2815,8 +2817,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  	inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
>  	dec_mm_counter_fast(vma->vm_mm, MM_SWAPENTS);
>  	pte = mk_pte(page, vma->vm_page_prot);
> -	if (userfaultfd_wp(vma))
> -		vmf->flags &= ~FAULT_FLAG_WRITE;

So this is the confusing part with the previous patch that introduce
that code. It feels like you should just remove that code entirely
in the previous patch.

>  	if ((vmf->flags & FAULT_FLAG_WRITE) && reuse_swap_page(page, NULL)) {
>  		pte = maybe_mkwrite(pte_mkdirty(pte), vma);
>  		vmf->flags &= ~FAULT_FLAG_WRITE;
> @@ -2826,6 +2826,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  	flush_icache_page(vma, page);
>  	if (pte_swp_soft_dirty(vmf->orig_pte))
>  		pte = pte_mksoft_dirty(pte);
> +	if (pte_swp_uffd_wp(vmf->orig_pte)) {
> +		pte = pte_mkuffd_wp(pte);
> +		pte = pte_wrprotect(pte);
> +	}
>  	set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
>  	arch_do_swap_page(vma->vm_mm, vma, vmf->address, pte, vmf->orig_pte);
>  	vmf->orig_pte = pte;

> diff --git a/mm/migrate.c b/mm/migrate.c
> index d4fd680be3b0..605ccd1f5c64 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -242,6 +242,11 @@ static bool remove_migration_pte(struct page *page, struct vm_area_struct *vma,
>  		if (is_write_migration_entry(entry))
>  			pte = maybe_mkwrite(pte, vma);
>  
> +		if (pte_swp_uffd_wp(*pvmw.pte)) {
> +			pte = pte_mkuffd_wp(pte);
> +			pte = pte_wrprotect(pte);
> +		}

If the page was write protected prior to migration then it should never
end up as a write migration entry and thus the above should be something
like:
		if (is_write_migration_entry(entry)) {
			pte = maybe_mkwrite(pte, vma);
		} else if (pte_swp_uffd_wp(*pvmw.pte)) {
			pte = pte_mkuffd_wp(pte);
		}

[...]

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v2 18/26] khugepaged: skip collapse if uffd-wp detected
  2019-02-12  2:56 ` [PATCH v2 18/26] khugepaged: skip collapse if uffd-wp detected Peter Xu
@ 2019-02-21 18:17   ` Jerome Glisse
  2019-02-25 18:50   ` Mike Rapoport
  1 sibling, 0 replies; 113+ messages in thread
From: Jerome Glisse @ 2019-02-21 18:17 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, David Hildenbrand, Hugh Dickins,
	Maya Gokhale, Pavel Emelyanov, Johannes Weiner, Martin Cracauer,
	Shaohua Li, Marty McFadden, Andrea Arcangeli, Mike Kravetz,
	Denis Plotnikov, Mike Rapoport, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

On Tue, Feb 12, 2019 at 10:56:24AM +0800, Peter Xu wrote:
> Don't collapse the huge PMD if there is any userfault write protected
> small PTEs.  The problem is that the write protection is in small page
> granularity and there's no way to keep all these write protection
> information if the small pages are going to be merged into a huge PMD.
> 
> The same thing needs to be considered for swap entries and migration
> entries.  So do the check as well disregarding khugepaged_max_ptes_swap.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: Jérôme Glisse <jglisse@redhat.com>

> ---
>  include/trace/events/huge_memory.h |  1 +
>  mm/khugepaged.c                    | 23 +++++++++++++++++++++++
>  2 files changed, 24 insertions(+)
> 
> diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
> index dd4db334bd63..2d7bad9cb976 100644
> --- a/include/trace/events/huge_memory.h
> +++ b/include/trace/events/huge_memory.h
> @@ -13,6 +13,7 @@
>  	EM( SCAN_PMD_NULL,		"pmd_null")			\
>  	EM( SCAN_EXCEED_NONE_PTE,	"exceed_none_pte")		\
>  	EM( SCAN_PTE_NON_PRESENT,	"pte_non_present")		\
> +	EM( SCAN_PTE_UFFD_WP,		"pte_uffd_wp")			\
>  	EM( SCAN_PAGE_RO,		"no_writable_page")		\
>  	EM( SCAN_LACK_REFERENCED_PAGE,	"lack_referenced_page")		\
>  	EM( SCAN_PAGE_NULL,		"page_null")			\
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 4f017339ddb2..396c7e4da83e 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -29,6 +29,7 @@ enum scan_result {
>  	SCAN_PMD_NULL,
>  	SCAN_EXCEED_NONE_PTE,
>  	SCAN_PTE_NON_PRESENT,
> +	SCAN_PTE_UFFD_WP,
>  	SCAN_PAGE_RO,
>  	SCAN_LACK_REFERENCED_PAGE,
>  	SCAN_PAGE_NULL,
> @@ -1123,6 +1124,15 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>  		pte_t pteval = *_pte;
>  		if (is_swap_pte(pteval)) {
>  			if (++unmapped <= khugepaged_max_ptes_swap) {
> +				/*
> +				 * Always be strict with uffd-wp
> +				 * enabled swap entries.  Please see
> +				 * comment below for pte_uffd_wp().
> +				 */
> +				if (pte_swp_uffd_wp(pteval)) {
> +					result = SCAN_PTE_UFFD_WP;
> +					goto out_unmap;
> +				}
>  				continue;
>  			} else {
>  				result = SCAN_EXCEED_SWAP_PTE;
> @@ -1142,6 +1152,19 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>  			result = SCAN_PTE_NON_PRESENT;
>  			goto out_unmap;
>  		}
> +		if (pte_uffd_wp(pteval)) {
> +			/*
> +			 * Don't collapse the page if any of the small
> +			 * PTEs are armed with uffd write protection.
> +			 * Here we can also mark the new huge pmd as
> +			 * write protected if any of the small ones is
> +			 * marked but that could bring uknown
> +			 * userfault messages that falls outside of
> +			 * the registered range.  So, just be simple.
> +			 */
> +			result = SCAN_PTE_UFFD_WP;
> +			goto out_unmap;
> +		}
>  		if (pte_write(pteval))
>  			writable = true;
>  
> -- 
> 2.17.1
> 

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v2 19/26] userfaultfd: introduce helper vma_find_uffd
  2019-02-12  2:56 ` [PATCH v2 19/26] userfaultfd: introduce helper vma_find_uffd Peter Xu
@ 2019-02-21 18:19   ` Jerome Glisse
  2019-02-25 20:48   ` Mike Rapoport
  1 sibling, 0 replies; 113+ messages in thread
From: Jerome Glisse @ 2019-02-21 18:19 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, David Hildenbrand, Hugh Dickins,
	Maya Gokhale, Pavel Emelyanov, Johannes Weiner, Martin Cracauer,
	Shaohua Li, Marty McFadden, Andrea Arcangeli, Mike Kravetz,
	Denis Plotnikov, Mike Rapoport, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

On Tue, Feb 12, 2019 at 10:56:25AM +0800, Peter Xu wrote:
> We've have multiple (and more coming) places that would like to find a
> userfault enabled VMA from a mm struct that covers a specific memory
> range.  This patch introduce the helper for it, meanwhile apply it to
> the code.
> 
> Suggested-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: Jérôme Glisse <jglisse@redhat.com>

> ---
>  mm/userfaultfd.c | 54 +++++++++++++++++++++++++++---------------------
>  1 file changed, 30 insertions(+), 24 deletions(-)
> 
> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> index 80bcd642911d..fefa81c301b7 100644
> --- a/mm/userfaultfd.c
> +++ b/mm/userfaultfd.c
> @@ -20,6 +20,34 @@
>  #include <asm/tlbflush.h>
>  #include "internal.h"
>  
> +/*
> + * Find a valid userfault enabled VMA region that covers the whole
> + * address range, or NULL on failure.  Must be called with mmap_sem
> + * held.
> + */
> +static struct vm_area_struct *vma_find_uffd(struct mm_struct *mm,
> +					    unsigned long start,
> +					    unsigned long len)
> +{
> +	struct vm_area_struct *vma = find_vma(mm, start);
> +
> +	if (!vma)
> +		return NULL;
> +
> +	/*
> +	 * Check the vma is registered in uffd, this is required to
> +	 * enforce the VM_MAYWRITE check done at uffd registration
> +	 * time.
> +	 */
> +	if (!vma->vm_userfaultfd_ctx.ctx)
> +		return NULL;
> +
> +	if (start < vma->vm_start || start + len > vma->vm_end)
> +		return NULL;
> +
> +	return vma;
> +}
> +
>  static int mcopy_atomic_pte(struct mm_struct *dst_mm,
>  			    pmd_t *dst_pmd,
>  			    struct vm_area_struct *dst_vma,
> @@ -228,20 +256,9 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
>  	 */
>  	if (!dst_vma) {
>  		err = -ENOENT;
> -		dst_vma = find_vma(dst_mm, dst_start);
> +		dst_vma = vma_find_uffd(dst_mm, dst_start, len);
>  		if (!dst_vma || !is_vm_hugetlb_page(dst_vma))
>  			goto out_unlock;
> -		/*
> -		 * Check the vma is registered in uffd, this is
> -		 * required to enforce the VM_MAYWRITE check done at
> -		 * uffd registration time.
> -		 */
> -		if (!dst_vma->vm_userfaultfd_ctx.ctx)
> -			goto out_unlock;
> -
> -		if (dst_start < dst_vma->vm_start ||
> -		    dst_start + len > dst_vma->vm_end)
> -			goto out_unlock;
>  
>  		err = -EINVAL;
>  		if (vma_hpagesize != vma_kernel_pagesize(dst_vma))
> @@ -488,20 +505,9 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
>  	 * both valid and fully within a single existing vma.
>  	 */
>  	err = -ENOENT;
> -	dst_vma = find_vma(dst_mm, dst_start);
> +	dst_vma = vma_find_uffd(dst_mm, dst_start, len);
>  	if (!dst_vma)
>  		goto out_unlock;
> -	/*
> -	 * Check the vma is registered in uffd, this is required to
> -	 * enforce the VM_MAYWRITE check done at uffd registration
> -	 * time.
> -	 */
> -	if (!dst_vma->vm_userfaultfd_ctx.ctx)
> -		goto out_unlock;
> -
> -	if (dst_start < dst_vma->vm_start ||
> -	    dst_start + len > dst_vma->vm_end)
> -		goto out_unlock;
>  
>  	err = -EINVAL;
>  	/*
> -- 
> 2.17.1
> 

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v2 20/26] userfaultfd: wp: support write protection for userfault vma range
  2019-02-12  2:56 ` [PATCH v2 20/26] userfaultfd: wp: support write protection for userfault vma range Peter Xu
@ 2019-02-21 18:23   ` Jerome Glisse
  2019-02-25  8:16     ` Peter Xu
  2019-02-25 20:52   ` Mike Rapoport
  1 sibling, 1 reply; 113+ messages in thread
From: Jerome Glisse @ 2019-02-21 18:23 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, David Hildenbrand, Hugh Dickins,
	Maya Gokhale, Pavel Emelyanov, Johannes Weiner, Martin Cracauer,
	Shaohua Li, Marty McFadden, Andrea Arcangeli, Mike Kravetz,
	Denis Plotnikov, Mike Rapoport, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert, Rik van Riel

On Tue, Feb 12, 2019 at 10:56:26AM +0800, Peter Xu wrote:
> From: Shaohua Li <shli@fb.com>
> 
> Add API to enable/disable writeprotect a vma range. Unlike mprotect,
> this doesn't split/merge vmas.
> 
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Kirill A. Shutemov <kirill@shutemov.name>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: Shaohua Li <shli@fb.com>
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> [peterx:
>  - use the helper to find VMA;
>  - return -ENOENT if not found to match mcopy case;
>  - use the new MM_CP_UFFD_WP* flags for change_protection
>  - check against mmap_changing for failures]
> Signed-off-by: Peter Xu <peterx@redhat.com>

I have a question see below but anyway:

Reviewed-by: Jérôme Glisse <jglisse@redhat.com>

> ---
>  include/linux/userfaultfd_k.h |  3 ++
>  mm/userfaultfd.c              | 54 +++++++++++++++++++++++++++++++++++
>  2 files changed, 57 insertions(+)
> 
> diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
> index 765ce884cec0..8f6e6ed544fb 100644
> --- a/include/linux/userfaultfd_k.h
> +++ b/include/linux/userfaultfd_k.h
> @@ -39,6 +39,9 @@ extern ssize_t mfill_zeropage(struct mm_struct *dst_mm,
>  			      unsigned long dst_start,
>  			      unsigned long len,
>  			      bool *mmap_changing);
> +extern int mwriteprotect_range(struct mm_struct *dst_mm,
> +			       unsigned long start, unsigned long len,
> +			       bool enable_wp, bool *mmap_changing);
>  
>  /* mm helpers */
>  static inline bool is_mergeable_vm_userfaultfd_ctx(struct vm_area_struct *vma,
> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> index fefa81c301b7..529d180bb4d7 100644
> --- a/mm/userfaultfd.c
> +++ b/mm/userfaultfd.c
> @@ -639,3 +639,57 @@ ssize_t mfill_zeropage(struct mm_struct *dst_mm, unsigned long start,
>  {
>  	return __mcopy_atomic(dst_mm, start, 0, len, true, mmap_changing, 0);
>  }
> +
> +int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start,
> +			unsigned long len, bool enable_wp, bool *mmap_changing)
> +{
> +	struct vm_area_struct *dst_vma;
> +	pgprot_t newprot;
> +	int err;
> +
> +	/*
> +	 * Sanitize the command parameters:
> +	 */
> +	BUG_ON(start & ~PAGE_MASK);
> +	BUG_ON(len & ~PAGE_MASK);
> +
> +	/* Does the address range wrap, or is the span zero-sized? */
> +	BUG_ON(start + len <= start);
> +
> +	down_read(&dst_mm->mmap_sem);
> +
> +	/*
> +	 * If memory mappings are changing because of non-cooperative
> +	 * operation (e.g. mremap) running in parallel, bail out and
> +	 * request the user to retry later
> +	 */
> +	err = -EAGAIN;
> +	if (mmap_changing && READ_ONCE(*mmap_changing))
> +		goto out_unlock;
> +
> +	err = -ENOENT;
> +	dst_vma = vma_find_uffd(dst_mm, start, len);
> +	/*
> +	 * Make sure the vma is not shared, that the dst range is
> +	 * both valid and fully within a single existing vma.
> +	 */
> +	if (!dst_vma || (dst_vma->vm_flags & VM_SHARED))
> +		goto out_unlock;
> +	if (!userfaultfd_wp(dst_vma))
> +		goto out_unlock;
> +	if (!vma_is_anonymous(dst_vma))
> +		goto out_unlock;

Don't you want to distinguish between no VMA ie ENOENT and vma that
can not be write protected (VM_SHARED, not userfaultfd, not anonymous) ?


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v2 21/26] userfaultfd: wp: add the writeprotect API to userfaultfd ioctl
  2019-02-12  2:56 ` [PATCH v2 21/26] userfaultfd: wp: add the writeprotect API to userfaultfd ioctl Peter Xu
@ 2019-02-21 18:28   ` Jerome Glisse
  2019-02-25  8:31     ` Peter Xu
  2019-02-25 21:03   ` Mike Rapoport
  1 sibling, 1 reply; 113+ messages in thread
From: Jerome Glisse @ 2019-02-21 18:28 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, David Hildenbrand, Hugh Dickins,
	Maya Gokhale, Pavel Emelyanov, Johannes Weiner, Martin Cracauer,
	Shaohua Li, Marty McFadden, Andrea Arcangeli, Mike Kravetz,
	Denis Plotnikov, Mike Rapoport, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

On Tue, Feb 12, 2019 at 10:56:27AM +0800, Peter Xu wrote:
> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> v1: From: Shaohua Li <shli@fb.com>
> 
> v2: cleanups, remove a branch.
> 
> [peterx writes up the commit message, as below...]
> 
> This patch introduces the new uffd-wp APIs for userspace.
> 
> Firstly, we'll allow to do UFFDIO_REGISTER with write protection
> tracking using the new UFFDIO_REGISTER_MODE_WP flag.  Note that this
> flag can co-exist with the existing UFFDIO_REGISTER_MODE_MISSING, in
> which case the userspace program can not only resolve missing page
> faults, and at the same time tracking page data changes along the way.
> 
> Secondly, we introduced the new UFFDIO_WRITEPROTECT API to do page
> level write protection tracking.  Note that we will need to register
> the memory region with UFFDIO_REGISTER_MODE_WP before that.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> [peterx: remove useless block, write commit message, check against
>  VM_MAYWRITE rather than VM_WRITE when register]
> Signed-off-by: Peter Xu <peterx@redhat.com>

I am not an expert with userfaultfd code but it looks good to me so:

Also see my question down below, just a minor one.

Reviewed-by: Jérôme Glisse <jglisse@redhat.com>

> ---
>  fs/userfaultfd.c                 | 82 +++++++++++++++++++++++++-------
>  include/uapi/linux/userfaultfd.h | 11 +++++
>  2 files changed, 77 insertions(+), 16 deletions(-)
> 

[...]

> diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
> index 297cb044c03f..1b977a7a4435 100644
> --- a/include/uapi/linux/userfaultfd.h
> +++ b/include/uapi/linux/userfaultfd.h
> @@ -52,6 +52,7 @@
>  #define _UFFDIO_WAKE			(0x02)
>  #define _UFFDIO_COPY			(0x03)
>  #define _UFFDIO_ZEROPAGE		(0x04)
> +#define _UFFDIO_WRITEPROTECT		(0x06)
>  #define _UFFDIO_API			(0x3F)

What did happen to ioctl 0x05 ? :)

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v2 22/26] userfaultfd: wp: enabled write protection in userfaultfd API
  2019-02-12  2:56 ` [PATCH v2 22/26] userfaultfd: wp: enabled write protection in userfaultfd API Peter Xu
@ 2019-02-21 18:29   ` Jerome Glisse
  2019-02-25  8:34     ` Peter Xu
  0 siblings, 1 reply; 113+ messages in thread
From: Jerome Glisse @ 2019-02-21 18:29 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, David Hildenbrand, Hugh Dickins,
	Maya Gokhale, Pavel Emelyanov, Johannes Weiner, Martin Cracauer,
	Shaohua Li, Marty McFadden, Andrea Arcangeli, Mike Kravetz,
	Denis Plotnikov, Mike Rapoport, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert, Pavel Emelyanov, Rik van Riel

On Tue, Feb 12, 2019 at 10:56:28AM +0800, Peter Xu wrote:
> From: Shaohua Li <shli@fb.com>
> 
> Now it's safe to enable write protection in userfaultfd API
> 
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Pavel Emelyanov <xemul@parallels.com>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Kirill A. Shutemov <kirill@shutemov.name>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: Shaohua Li <shli@fb.com>
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>

Maybe fold that patch with the previous one ? In any case:

Reviewed-by: Jérôme Glisse <jglisse@redhat.com>

> ---
>  include/uapi/linux/userfaultfd.h | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
> index 1b977a7a4435..a50f1ed24d23 100644
> --- a/include/uapi/linux/userfaultfd.h
> +++ b/include/uapi/linux/userfaultfd.h
> @@ -19,7 +19,8 @@
>   * means the userland is reading).
>   */
>  #define UFFD_API ((__u64)0xAA)
> -#define UFFD_API_FEATURES (UFFD_FEATURE_EVENT_FORK |		\
> +#define UFFD_API_FEATURES (UFFD_FEATURE_PAGEFAULT_FLAG_WP |	\
> +			   UFFD_FEATURE_EVENT_FORK |		\
>  			   UFFD_FEATURE_EVENT_REMAP |		\
>  			   UFFD_FEATURE_EVENT_REMOVE |	\
>  			   UFFD_FEATURE_EVENT_UNMAP |		\
> @@ -34,7 +35,8 @@
>  #define UFFD_API_RANGE_IOCTLS			\
>  	((__u64)1 << _UFFDIO_WAKE |		\
>  	 (__u64)1 << _UFFDIO_COPY |		\
> -	 (__u64)1 << _UFFDIO_ZEROPAGE)
> +	 (__u64)1 << _UFFDIO_ZEROPAGE |		\
> +	 (__u64)1 << _UFFDIO_WRITEPROTECT)
>  #define UFFD_API_RANGE_IOCTLS_BASIC		\
>  	((__u64)1 << _UFFDIO_WAKE |		\
>  	 (__u64)1 << _UFFDIO_COPY)
> -- 
> 2.17.1
> 

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v2 23/26] userfaultfd: wp: don't wake up when doing write protect
  2019-02-12  2:56 ` [PATCH v2 23/26] userfaultfd: wp: don't wake up when doing write protect Peter Xu
@ 2019-02-21 18:36   ` Jerome Glisse
  2019-02-25  8:58     ` Peter Xu
  2019-02-25 21:09   ` Mike Rapoport
  2019-02-26  8:00   ` Mike Rapoport
  2 siblings, 1 reply; 113+ messages in thread
From: Jerome Glisse @ 2019-02-21 18:36 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, David Hildenbrand, Hugh Dickins,
	Maya Gokhale, Pavel Emelyanov, Johannes Weiner, Martin Cracauer,
	Shaohua Li, Marty McFadden, Andrea Arcangeli, Mike Kravetz,
	Denis Plotnikov, Mike Rapoport, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

On Tue, Feb 12, 2019 at 10:56:29AM +0800, Peter Xu wrote:
> It does not make sense to try to wake up any waiting thread when we're
> write-protecting a memory region.  Only wake up when resolving a write
> protected page fault.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>

I am bit confuse here, see below.

> ---
>  fs/userfaultfd.c | 13 ++++++++-----
>  1 file changed, 8 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> index 81962d62520c..f1f61a0278c2 100644
> --- a/fs/userfaultfd.c
> +++ b/fs/userfaultfd.c
> @@ -1771,6 +1771,7 @@ static int userfaultfd_writeprotect(struct userfaultfd_ctx *ctx,
>  	struct uffdio_writeprotect uffdio_wp;
>  	struct uffdio_writeprotect __user *user_uffdio_wp;
>  	struct userfaultfd_wake_range range;
> +	bool mode_wp, mode_dontwake;
>  
>  	if (READ_ONCE(ctx->mmap_changing))
>  		return -EAGAIN;
> @@ -1789,18 +1790,20 @@ static int userfaultfd_writeprotect(struct userfaultfd_ctx *ctx,
>  	if (uffdio_wp.mode & ~(UFFDIO_WRITEPROTECT_MODE_DONTWAKE |
>  			       UFFDIO_WRITEPROTECT_MODE_WP))
>  		return -EINVAL;
> -	if ((uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_WP) &&
> -	     (uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_DONTWAKE))
> +
> +	mode_wp = uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_WP;
> +	mode_dontwake = uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_DONTWAKE;
> +
> +	if (mode_wp && mode_dontwake)
>  		return -EINVAL;

I am confuse by the logic here. DONTWAKE means do not wake any waiting
thread right ? So if the patch header it seems to me the logic should
be:
    if (mode_wp && !mode_dontwake)
        return -EINVAL;

At very least this part does seems to mean the opposite of what the
commit message says.

>  
>  	ret = mwriteprotect_range(ctx->mm, uffdio_wp.range.start,
> -				  uffdio_wp.range.len, uffdio_wp.mode &
> -				  UFFDIO_WRITEPROTECT_MODE_WP,
> +				  uffdio_wp.range.len, mode_wp,
>  				  &ctx->mmap_changing);
>  	if (ret)
>  		return ret;
>  
> -	if (!(uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_DONTWAKE)) {
> +	if (!mode_wp && !mode_dontwake) {

This part match the commit message :)

>  		range.start = uffdio_wp.range.start;
>  		range.len = uffdio_wp.range.len;
>  		wake_userfault(ctx, &range);

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v2 24/26] userfaultfd: wp: UFFDIO_REGISTER_MODE_WP documentation update
  2019-02-12  2:56 ` [PATCH v2 24/26] userfaultfd: wp: UFFDIO_REGISTER_MODE_WP documentation update Peter Xu
@ 2019-02-21 18:38   ` Jerome Glisse
  2019-02-25 21:19   ` Mike Rapoport
  1 sibling, 0 replies; 113+ messages in thread
From: Jerome Glisse @ 2019-02-21 18:38 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, David Hildenbrand, Hugh Dickins,
	Maya Gokhale, Pavel Emelyanov, Johannes Weiner, Martin Cracauer,
	Shaohua Li, Marty McFadden, Andrea Arcangeli, Mike Kravetz,
	Denis Plotnikov, Mike Rapoport, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

On Tue, Feb 12, 2019 at 10:56:30AM +0800, Peter Xu wrote:
> From: Martin Cracauer <cracauer@cons.org>
> 
> Adds documentation about the write protection support.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> [peterx: rewrite in rst format; fixups here and there]
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: Jérôme Glisse <jglisse@redhat.com>

> ---
>  Documentation/admin-guide/mm/userfaultfd.rst | 51 ++++++++++++++++++++
>  1 file changed, 51 insertions(+)
> 
> diff --git a/Documentation/admin-guide/mm/userfaultfd.rst b/Documentation/admin-guide/mm/userfaultfd.rst
> index 5048cf661a8a..c30176e67900 100644
> --- a/Documentation/admin-guide/mm/userfaultfd.rst
> +++ b/Documentation/admin-guide/mm/userfaultfd.rst
> @@ -108,6 +108,57 @@ UFFDIO_COPY. They're atomic as in guaranteeing that nothing can see an
>  half copied page since it'll keep userfaulting until the copy has
>  finished.
>  
> +Notes:
> +
> +- If you requested UFFDIO_REGISTER_MODE_MISSING when registering then
> +  you must provide some kind of page in your thread after reading from
> +  the uffd.  You must provide either UFFDIO_COPY or UFFDIO_ZEROPAGE.
> +  The normal behavior of the OS automatically providing a zero page on
> +  an annonymous mmaping is not in place.
> +
> +- None of the page-delivering ioctls default to the range that you
> +  registered with.  You must fill in all fields for the appropriate
> +  ioctl struct including the range.
> +
> +- You get the address of the access that triggered the missing page
> +  event out of a struct uffd_msg that you read in the thread from the
> +  uffd.  You can supply as many pages as you want with UFFDIO_COPY or
> +  UFFDIO_ZEROPAGE.  Keep in mind that unless you used DONTWAKE then
> +  the first of any of those IOCTLs wakes up the faulting thread.
> +
> +- Be sure to test for all errors including (pollfd[0].revents &
> +  POLLERR).  This can happen, e.g. when ranges supplied were
> +  incorrect.
> +
> +Write Protect Notifications
> +---------------------------
> +
> +This is equivalent to (but faster than) using mprotect and a SIGSEGV
> +signal handler.
> +
> +Firstly you need to register a range with UFFDIO_REGISTER_MODE_WP.
> +Instead of using mprotect(2) you use ioctl(uffd, UFFDIO_WRITEPROTECT,
> +struct *uffdio_writeprotect) while mode = UFFDIO_WRITEPROTECT_MODE_WP
> +in the struct passed in.  The range does not default to and does not
> +have to be identical to the range you registered with.  You can write
> +protect as many ranges as you like (inside the registered range).
> +Then, in the thread reading from uffd the struct will have
> +msg.arg.pagefault.flags & UFFD_PAGEFAULT_FLAG_WP set. Now you send
> +ioctl(uffd, UFFDIO_WRITEPROTECT, struct *uffdio_writeprotect) again
> +while pagefault.mode does not have UFFDIO_WRITEPROTECT_MODE_WP set.
> +This wakes up the thread which will continue to run with writes. This
> +allows you to do the bookkeeping about the write in the uffd reading
> +thread before the ioctl.
> +
> +If you registered with both UFFDIO_REGISTER_MODE_MISSING and
> +UFFDIO_REGISTER_MODE_WP then you need to think about the sequence in
> +which you supply a page and undo write protect.  Note that there is a
> +difference between writes into a WP area and into a !WP area.  The
> +former will have UFFD_PAGEFAULT_FLAG_WP set, the latter
> +UFFD_PAGEFAULT_FLAG_WRITE.  The latter did not fail on protection but
> +you still need to supply a page when UFFDIO_REGISTER_MODE_MISSING was
> +used.
> +
>  QEMU/KVM
>  ========
>  
> -- 
> 2.17.1
> 

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v2 01/26] mm: gup: rename "nonblocking" to "locked" where proper
  2019-02-21 15:17   ` Jerome Glisse
@ 2019-02-22  3:42     ` Peter Xu
  0 siblings, 0 replies; 113+ messages in thread
From: Peter Xu @ 2019-02-22  3:42 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: linux-mm, linux-kernel, David Hildenbrand, Hugh Dickins,
	Maya Gokhale, Pavel Emelyanov, Johannes Weiner, Martin Cracauer,
	Shaohua Li, Marty McFadden, Andrea Arcangeli, Mike Kravetz,
	Denis Plotnikov, Mike Rapoport, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

On Thu, Feb 21, 2019 at 10:17:42AM -0500, Jerome Glisse wrote:
> On Tue, Feb 12, 2019 at 10:56:07AM +0800, Peter Xu wrote:
> > There's plenty of places around __get_user_pages() that has a parameter
> > "nonblocking" which does not really mean that "it won't block" (because
> > it can really block) but instead it shows whether the mmap_sem is
> > released by up_read() during the page fault handling mostly when
> > VM_FAULT_RETRY is returned.
> > 
> > We have the correct naming in e.g. get_user_pages_locked() or
> > get_user_pages_remote() as "locked", however there're still many places
> > that are using the "nonblocking" as name.
> > 
> > Renaming the places to "locked" where proper to better suite the
> > functionality of the variable.  While at it, fixing up some of the
> > comments accordingly.
> > 
> > Reviewed-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> 
> Minor issue see below
> 
> Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
> 
> [...]
> 
> > @@ -656,13 +656,11 @@ static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags)
> >   * appropriate) must be called after the page is finished with, and
> >   * before put_page is called.
> >   *
> > - * If @nonblocking != NULL, __get_user_pages will not wait for disk IO
> > - * or mmap_sem contention, and if waiting is needed to pin all pages,
> > - * *@nonblocking will be set to 0.  Further, if @gup_flags does not
> > - * include FOLL_NOWAIT, the mmap_sem will be released via up_read() in
> > - * this case.
> > + * If @locked != NULL, *@locked will be set to 0 when mmap_sem is
> > + * released by an up_read().  That can happen if @gup_flags does not
> > + * has FOLL_NOWAIT.
> 
> I am not a native speaker but i believe the correct wording is:
>      @gup_flags does not have FOLL_NOWAIT

Yes I agree.

(r-b taken, and I kept Mike's too assuming this is a trivial change)

Thanks!

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v2 02/26] mm: userfault: return VM_FAULT_RETRY on signals
  2019-02-21 15:29   ` Jerome Glisse
@ 2019-02-22  3:51     ` Peter Xu
  0 siblings, 0 replies; 113+ messages in thread
From: Peter Xu @ 2019-02-22  3:51 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: linux-mm, linux-kernel, David Hildenbrand, Hugh Dickins,
	Maya Gokhale, Pavel Emelyanov, Johannes Weiner, Martin Cracauer,
	Shaohua Li, Marty McFadden, Andrea Arcangeli, Mike Kravetz,
	Denis Plotnikov, Mike Rapoport, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

On Thu, Feb 21, 2019 at 10:29:56AM -0500, Jerome Glisse wrote:
> On Tue, Feb 12, 2019 at 10:56:08AM +0800, Peter Xu wrote:
> > The idea comes from the upstream discussion between Linus and Andrea:
> > 
> >   https://lkml.org/lkml/2017/10/30/560
> > 
> > A summary to the issue: there was a special path in handle_userfault()
> > in the past that we'll return a VM_FAULT_NOPAGE when we detected
> > non-fatal signals when waiting for userfault handling.  We did that by
> > reacquiring the mmap_sem before returning.  However that brings a risk
> > in that the vmas might have changed when we retake the mmap_sem and
> > even we could be holding an invalid vma structure.
> > 
> > This patch removes the special path and we'll return a VM_FAULT_RETRY
> > with the common path even if we have got such signals.  Then for all
> > the architectures that is passing in VM_FAULT_ALLOW_RETRY into
> > handle_mm_fault(), we check not only for SIGKILL but for all the rest
> > of userspace pending signals right after we returned from
> > handle_mm_fault().  This can allow the userspace to handle nonfatal
> > signals faster than before.
> > 
> > This patch is a preparation work for the next patch to finally remove
> > the special code path mentioned above in handle_userfault().
> > 
> > Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
> > Suggested-by: Andrea Arcangeli <aarcange@redhat.com>
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> 
> See maybe minor improvement
> 
> Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
> 
> [...]
> 
> > diff --git a/arch/arm/mm/fault.c b/arch/arm/mm/fault.c
> > index 58f69fa07df9..c41c021bbe40 100644
> > --- a/arch/arm/mm/fault.c
> > +++ b/arch/arm/mm/fault.c
> > @@ -314,12 +314,12 @@ do_page_fault(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
> >  
> >  	fault = __do_page_fault(mm, addr, fsr, flags, tsk);
> >  
> > -	/* If we need to retry but a fatal signal is pending, handle the
> > +	/* If we need to retry but a signal is pending, handle the
> >  	 * signal first. We do not need to release the mmap_sem because
> >  	 * it would already be released in __lock_page_or_retry in
> >  	 * mm/filemap.c. */
> > -	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current)) {
> > -		if (!user_mode(regs))
> > +	if (unlikely(fault & VM_FAULT_RETRY && signal_pending(current))) {
> 
> I rather see (fault & VM_FAULT_RETRY) ie with the parenthesis as it
> avoids the need to remember operator precedence rules :)

Yes it's good practise.  I've been hit by the lock_page() days ago
already so I think I'll remember (though this patch was earlier :)

I'll fix all the places in the patch.  Actually I noticed that there
are four of them.  And I've taken the r-b after the changes.  Thanks,

> 
> [...]
> 
> > diff --git a/arch/nds32/mm/fault.c b/arch/nds32/mm/fault.c
> > index 68d5f2a27f38..9f6e477b9e30 100644
> > --- a/arch/nds32/mm/fault.c
> > +++ b/arch/nds32/mm/fault.c
> > @@ -206,12 +206,12 @@ void do_page_fault(unsigned long entry, unsigned long addr,
> >  	fault = handle_mm_fault(vma, addr, flags);
> >  
> >  	/*
> > -	 * If we need to retry but a fatal signal is pending, handle the
> > +	 * If we need to retry but a signal is pending, handle the
> >  	 * signal first. We do not need to release the mmap_sem because it
> >  	 * would already be released in __lock_page_or_retry in mm/filemap.c.
> >  	 */
> > -	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current)) {
> > -		if (!user_mode(regs))
> > +	if (fault & VM_FAULT_RETRY && signal_pending(current)) {
> 
> Same as above parenthesis maybe.
> 
> [...]
> 
> > diff --git a/arch/um/kernel/trap.c b/arch/um/kernel/trap.c
> > index 0e8b6158f224..09baf37b65b9 100644
> > --- a/arch/um/kernel/trap.c
> > +++ b/arch/um/kernel/trap.c
> > @@ -76,8 +76,11 @@ int handle_page_fault(unsigned long address, unsigned long ip,
> >  
> >  		fault = handle_mm_fault(vma, address, flags);
> >  
> > -		if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
> > +		if (fault & VM_FAULT_RETRY && signal_pending(current)) {
> 
> Same as above parenthesis maybe.
> 
> [...]

Regards,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v2.1 04/26] mm: allow VM_FAULT_RETRY for multiple times
  2019-02-21 15:53     ` Jerome Glisse
@ 2019-02-22  4:25       ` Peter Xu
  2019-02-22 15:11         ` Jerome Glisse
  0 siblings, 1 reply; 113+ messages in thread
From: Peter Xu @ 2019-02-22  4:25 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: linux-mm, linux-kernel, David Hildenbrand, Hugh Dickins,
	Maya Gokhale, Pavel Emelyanov, Johannes Weiner, Martin Cracauer,
	Shaohua Li, Marty McFadden, Andrea Arcangeli, Mike Kravetz,
	Denis Plotnikov, Mike Rapoport, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

On Thu, Feb 21, 2019 at 10:53:11AM -0500, Jerome Glisse wrote:
> On Thu, Feb 21, 2019 at 04:56:56PM +0800, Peter Xu wrote:
> > The idea comes from a discussion between Linus and Andrea [1].
> > 
> > Before this patch we only allow a page fault to retry once.  We
> > achieved this by clearing the FAULT_FLAG_ALLOW_RETRY flag when doing
> > handle_mm_fault() the second time.  This was majorly used to avoid
> > unexpected starvation of the system by looping over forever to handle
> > the page fault on a single page.  However that should hardly happen,
> > and after all for each code path to return a VM_FAULT_RETRY we'll
> > first wait for a condition (during which time we should possibly yield
> > the cpu) to happen before VM_FAULT_RETRY is really returned.
> > 
> > This patch removes the restriction by keeping the
> > FAULT_FLAG_ALLOW_RETRY flag when we receive VM_FAULT_RETRY.  It means
> > that the page fault handler now can retry the page fault for multiple
> > times if necessary without the need to generate another page fault
> > event.  Meanwhile we still keep the FAULT_FLAG_TRIED flag so page
> > fault handler can still identify whether a page fault is the first
> > attempt or not.
> > 
> > Then we'll have these combinations of fault flags (only considering
> > ALLOW_RETRY flag and TRIED flag):
> > 
> >   - ALLOW_RETRY and !TRIED:  this means the page fault allows to
> >                              retry, and this is the first try
> > 
> >   - ALLOW_RETRY and TRIED:   this means the page fault allows to
> >                              retry, and this is not the first try
> > 
> >   - !ALLOW_RETRY and !TRIED: this means the page fault does not allow
> >                              to retry at all
> > 
> >   - !ALLOW_RETRY and TRIED:  this is forbidden and should never be used
> > 
> > In existing code we have multiple places that has taken special care
> > of the first condition above by checking against (fault_flags &
> > FAULT_FLAG_ALLOW_RETRY).  This patch introduces a simple helper to
> > detect the first retry of a page fault by checking against
> > both (fault_flags & FAULT_FLAG_ALLOW_RETRY) and !(fault_flag &
> > FAULT_FLAG_TRIED) because now even the 2nd try will have the
> > ALLOW_RETRY set, then use that helper in all existing special paths.
> > One example is in __lock_page_or_retry(), now we'll drop the mmap_sem
> > only in the first attempt of page fault and we'll keep it in follow up
> > retries, so old locking behavior will be retained.
> > 
> > This will be a nice enhancement for current code [2] at the same time
> > a supporting material for the future userfaultfd-writeprotect work,
> > since in that work there will always be an explicit userfault
> > writeprotect retry for protected pages, and if that cannot resolve the
> > page fault (e.g., when userfaultfd-writeprotect is used in conjunction
> > with swapped pages) then we'll possibly need a 3rd retry of the page
> > fault.  It might also benefit other potential users who will have
> > similar requirement like userfault write-protection.
> > 
> > GUP code is not touched yet and will be covered in follow up patch.
> > 
> > Please read the thread below for more information.
> > 
> > [1] https://lkml.org/lkml/2017/11/2/833
> > [2] https://lkml.org/lkml/2018/12/30/64
> 
> I have few comments on this one. See below.
> 
> 
> > 
> > Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
> > Suggested-by: Andrea Arcangeli <aarcange@redhat.com>
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> > 
> >  arch/alpha/mm/fault.c           |  2 +-
> >  arch/arc/mm/fault.c             |  1 -
> >  arch/arm/mm/fault.c             |  3 ---
> >  arch/arm64/mm/fault.c           |  5 -----
> >  arch/hexagon/mm/vm_fault.c      |  1 -
> >  arch/ia64/mm/fault.c            |  1 -
> >  arch/m68k/mm/fault.c            |  3 ---
> >  arch/microblaze/mm/fault.c      |  1 -
> >  arch/mips/mm/fault.c            |  1 -
> >  arch/nds32/mm/fault.c           |  1 -
> >  arch/nios2/mm/fault.c           |  3 ---
> >  arch/openrisc/mm/fault.c        |  1 -
> >  arch/parisc/mm/fault.c          |  2 --
> >  arch/powerpc/mm/fault.c         |  6 ------
> >  arch/riscv/mm/fault.c           |  5 -----
> >  arch/s390/mm/fault.c            |  5 +----
> >  arch/sh/mm/fault.c              |  1 -
> >  arch/sparc/mm/fault_32.c        |  1 -
> >  arch/sparc/mm/fault_64.c        |  1 -
> >  arch/um/kernel/trap.c           |  1 -
> >  arch/unicore32/mm/fault.c       |  6 +-----
> >  arch/x86/mm/fault.c             |  2 --
> >  arch/xtensa/mm/fault.c          |  1 -
> >  drivers/gpu/drm/ttm/ttm_bo_vm.c | 12 +++++++++---
> >  include/linux/mm.h              | 12 +++++++++++-
> >  mm/filemap.c                    |  2 +-
> >  mm/shmem.c                      |  2 +-
> >  27 files changed, 25 insertions(+), 57 deletions(-)
> > 
> 
> [...]
> 
> > diff --git a/arch/parisc/mm/fault.c b/arch/parisc/mm/fault.c
> > index 29422eec329d..7d3e96a9a7ab 100644
> > --- a/arch/parisc/mm/fault.c
> > +++ b/arch/parisc/mm/fault.c
> > @@ -327,8 +327,6 @@ void do_page_fault(struct pt_regs *regs, unsigned long code,
> >  		else
> >  			current->min_flt++;
> >  		if (fault & VM_FAULT_RETRY) {
> > -			flags &= ~FAULT_FLAG_ALLOW_RETRY;
> 
> Don't you need to also add:
>      flags |= FAULT_FLAG_TRIED;
> 
> Like other arch.

Yes I can add that, thanks for noticing this.  Actually I only changed
one of the same cases in current patch (alpha, parisc, unicore32 are
special cases here where TRIED is never used).  I think it's fine to
even not have TRIED flag here because if we pass in fault flag with
!ALLOW_RETRY and !TRIED it'll simply be the synchronize case so we'll
probably be safe too just like a normal 2nd fault retry and we'll wait
until page fault resolved.  Though after a second thought I think
maybe this is also a good chance that we clean this whole thing up to
make sure all the archs are using the same pattern to pass fault
flags.  So I'll touch up the other two places together to make sure
TRIED will be there if it's the 2nd retry or more.

> 
> 
> [...]
> 
> > diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> > index 248ff0a28ecd..d842c3e02a50 100644
> > --- a/arch/x86/mm/fault.c
> > +++ b/arch/x86/mm/fault.c
> > @@ -1483,9 +1483,7 @@ void do_user_addr_fault(struct pt_regs *regs,
> >  	if (unlikely(fault & VM_FAULT_RETRY)) {
> >  		bool is_user = flags & FAULT_FLAG_USER;
> >  
> > -		/* Retry at most once */
> >  		if (flags & FAULT_FLAG_ALLOW_RETRY) {
> > -			flags &= ~FAULT_FLAG_ALLOW_RETRY;
> >  			flags |= FAULT_FLAG_TRIED;
> >  			if (is_user && signal_pending(tsk))
> >  				return;
> 
> So here you have a change in behavior, it can retry indefinitly for as
> long as they are no signal. Don't you want so test for FAULT_FLAG_TRIED ?

These first five patches do want to allow the page fault to retry as
much as needed.  "indefinitely" seems to be a scary word, but IMHO
this is fine for page faults since otherwise we'll simply crash the
program or even crash the system depending on the fault context, so it
seems to be nowhere worse.

For userspace programs, if anything really really go wrong (so far I
still cannot think a valid scenario in a bug-free system, but just
assuming...) and it loops indefinitely, IMHO it'll just hang the buggy
process itself rather than coredump, and the admin can simply kill the
process to retake the resources since we'll still detect signals.

Or did I misunderstood the question?

> 
> [...]
> 
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 80bb6408fe73..4e11c9639f1b 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -341,11 +341,21 @@ extern pgprot_t protection_map[16];
> >  #define FAULT_FLAG_ALLOW_RETRY	0x04	/* Retry fault if blocking */
> >  #define FAULT_FLAG_RETRY_NOWAIT	0x08	/* Don't drop mmap_sem and wait when retrying */
> >  #define FAULT_FLAG_KILLABLE	0x10	/* The fault task is in SIGKILL killable region */
> > -#define FAULT_FLAG_TRIED	0x20	/* Second try */
> > +#define FAULT_FLAG_TRIED	0x20	/* We've tried once */
> >  #define FAULT_FLAG_USER		0x40	/* The fault originated in userspace */
> >  #define FAULT_FLAG_REMOTE	0x80	/* faulting for non current tsk/mm */
> >  #define FAULT_FLAG_INSTRUCTION  0x100	/* The fault was during an instruction fetch */
> >  
> > +/*
> > + * Returns true if the page fault allows retry and this is the first
> > + * attempt of the fault handling; false otherwise.
> > + */
> 
> You should add why it returns false if it is not the first try ie to
> avoid starvation.

How about:

        Returns true if the page fault allows retry and this is the
        first attempt of the fault handling; false otherwise.  This is
        mostly used for places where we want to try to avoid taking
        the mmap_sem for too long a time when waiting for another
        condition to change, in which case we can try to be polite to
        release the mmap_sem in the first round to avoid potential
        starvation of other processes that would also want the
        mmap_sem.

?

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v2 05/26] mm: gup: allow VM_FAULT_RETRY for multiple times
  2019-02-21 16:06   ` Jerome Glisse
@ 2019-02-22  4:41     ` Peter Xu
  2019-02-22 15:13       ` Jerome Glisse
  0 siblings, 1 reply; 113+ messages in thread
From: Peter Xu @ 2019-02-22  4:41 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: linux-mm, linux-kernel, David Hildenbrand, Hugh Dickins,
	Maya Gokhale, Pavel Emelyanov, Johannes Weiner, Martin Cracauer,
	Shaohua Li, Marty McFadden, Andrea Arcangeli, Mike Kravetz,
	Denis Plotnikov, Mike Rapoport, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

On Thu, Feb 21, 2019 at 11:06:55AM -0500, Jerome Glisse wrote:
> On Tue, Feb 12, 2019 at 10:56:11AM +0800, Peter Xu wrote:
> > This is the gup counterpart of the change that allows the VM_FAULT_RETRY
> > to happen for more than once.
> > 
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> 
> Reviewed-by: Jérôme Glisse <jglisse@redhat.com>

Thanks for the r-b, Jerome!

Though I plan to change this patch a bit because I just noticed that I
didn't touch up the hugetlbfs path for GUP.  Though it was not needed
for now because hugetlbfs is not yet supported but I think maybe I'd
better do that as well in this same patch to make follow up works
easier on hugetlb, and the patch will be more self contained.  The new
version will simply squash below change into current patch:

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index e3c738bde72e..a8eace2d5296 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4257,8 +4257,10 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
                                fault_flags |= FAULT_FLAG_ALLOW_RETRY |
                                        FAULT_FLAG_RETRY_NOWAIT;
                        if (flags & FOLL_TRIED) {
-                               VM_WARN_ON_ONCE(fault_flags &
-                                               FAULT_FLAG_ALLOW_RETRY);
+                               /*
+                                * Note: FAULT_FLAG_ALLOW_RETRY and
+                                * FAULT_FLAG_TRIED can co-exist
+                                */
                                fault_flags |= FAULT_FLAG_TRIED;
                        }
                        ret = hugetlb_fault(mm, vma, vaddr, fault_flags);

I'd say this change is straightforward (it's the same as the
faultin_page below but just for hugetlbfs).  Please let me know if you
still want to offer the r-b with above change squashed (I'll be more
than glad to take it!), or I'll just wait for your review comment when
I post the next version.

Thanks,

> 
> > ---
> >  mm/gup.c | 17 +++++++++++++----
> >  1 file changed, 13 insertions(+), 4 deletions(-)
> > 
> > diff --git a/mm/gup.c b/mm/gup.c
> > index fa75a03204c1..ba387aec0d80 100644
> > --- a/mm/gup.c
> > +++ b/mm/gup.c
> > @@ -528,7 +528,10 @@ static int faultin_page(struct task_struct *tsk, struct vm_area_struct *vma,
> >  	if (*flags & FOLL_NOWAIT)
> >  		fault_flags |= FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_RETRY_NOWAIT;
> >  	if (*flags & FOLL_TRIED) {
> > -		VM_WARN_ON_ONCE(fault_flags & FAULT_FLAG_ALLOW_RETRY);
> > +		/*
> > +		 * Note: FAULT_FLAG_ALLOW_RETRY and FAULT_FLAG_TRIED
> > +		 * can co-exist
> > +		 */
> >  		fault_flags |= FAULT_FLAG_TRIED;
> >  	}
> >  
> > @@ -943,17 +946,23 @@ static __always_inline long __get_user_pages_locked(struct task_struct *tsk,
> >  		/* VM_FAULT_RETRY triggered, so seek to the faulting offset */
> >  		pages += ret;
> >  		start += ret << PAGE_SHIFT;
> > +		lock_dropped = true;
> >  
> > +retry:
> >  		/*
> >  		 * Repeat on the address that fired VM_FAULT_RETRY
> > -		 * without FAULT_FLAG_ALLOW_RETRY but with
> > +		 * with both FAULT_FLAG_ALLOW_RETRY and
> >  		 * FAULT_FLAG_TRIED.
> >  		 */
> >  		*locked = 1;
> > -		lock_dropped = true;
> >  		down_read(&mm->mmap_sem);
> >  		ret = __get_user_pages(tsk, mm, start, 1, flags | FOLL_TRIED,
> > -				       pages, NULL, NULL);
> > +				       pages, NULL, locked);
> > +		if (!*locked) {
> > +			/* Continue to retry until we succeeded */
> > +			BUG_ON(ret != 0);
> > +			goto retry;
> > +		}
> >  		if (ret != 1) {
> >  			BUG_ON(ret > 1);
> >  			if (!pages_done)
> > -- 
> > 2.17.1
> > 

-- 
Peter Xu

^ permalink raw reply related	[flat|nested] 113+ messages in thread

* Re: [PATCH v2 10/26] userfaultfd: wp: add UFFDIO_COPY_MODE_WP
  2019-02-21 17:29   ` Jerome Glisse
@ 2019-02-22  7:11     ` Peter Xu
  2019-02-22 15:15       ` Jerome Glisse
  0 siblings, 1 reply; 113+ messages in thread
From: Peter Xu @ 2019-02-22  7:11 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: linux-mm, linux-kernel, David Hildenbrand, Hugh Dickins,
	Maya Gokhale, Pavel Emelyanov, Johannes Weiner, Martin Cracauer,
	Shaohua Li, Marty McFadden, Andrea Arcangeli, Mike Kravetz,
	Denis Plotnikov, Mike Rapoport, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

On Thu, Feb 21, 2019 at 12:29:19PM -0500, Jerome Glisse wrote:
> On Tue, Feb 12, 2019 at 10:56:16AM +0800, Peter Xu wrote:
> > From: Andrea Arcangeli <aarcange@redhat.com>
> > 
> > This allows UFFDIO_COPY to map pages wrprotected.
> > 
> > Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> 
> Minor nitpick down below, but in any case:
> 
> Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
> 
> > ---
> >  fs/userfaultfd.c                 |  5 +++--
> >  include/linux/userfaultfd_k.h    |  2 +-
> >  include/uapi/linux/userfaultfd.h | 11 +++++-----
> >  mm/userfaultfd.c                 | 36 ++++++++++++++++++++++----------
> >  4 files changed, 35 insertions(+), 19 deletions(-)
> > 
> 
> [...]
> 
> > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> > index d59b5a73dfb3..73a208c5c1e7 100644
> > --- a/mm/userfaultfd.c
> > +++ b/mm/userfaultfd.c
> > @@ -25,7 +25,8 @@ static int mcopy_atomic_pte(struct mm_struct *dst_mm,
> >  			    struct vm_area_struct *dst_vma,
> >  			    unsigned long dst_addr,
> >  			    unsigned long src_addr,
> > -			    struct page **pagep)
> > +			    struct page **pagep,
> > +			    bool wp_copy)
> >  {
> >  	struct mem_cgroup *memcg;
> >  	pte_t _dst_pte, *dst_pte;
> > @@ -71,9 +72,9 @@ static int mcopy_atomic_pte(struct mm_struct *dst_mm,
> >  	if (mem_cgroup_try_charge(page, dst_mm, GFP_KERNEL, &memcg, false))
> >  		goto out_release;
> >  
> > -	_dst_pte = mk_pte(page, dst_vma->vm_page_prot);
> > -	if (dst_vma->vm_flags & VM_WRITE)
> > -		_dst_pte = pte_mkwrite(pte_mkdirty(_dst_pte));
> > +	_dst_pte = pte_mkdirty(mk_pte(page, dst_vma->vm_page_prot));
> > +	if (dst_vma->vm_flags & VM_WRITE && !wp_copy)
> > +		_dst_pte = pte_mkwrite(_dst_pte);
> 
> I like parenthesis around around and :) ie:
>     (dst_vma->vm_flags & VM_WRITE) && !wp_copy
> 
> I feel it is easier to read.

Yeah another one. Though this line will be changed in follow up
patches, will fix anyways.

> 
> [...]
> 
> > @@ -416,11 +418,13 @@ static __always_inline ssize_t mfill_atomic_pte(struct mm_struct *dst_mm,
> >  	if (!(dst_vma->vm_flags & VM_SHARED)) {
> >  		if (!zeropage)
> >  			err = mcopy_atomic_pte(dst_mm, dst_pmd, dst_vma,
> > -					       dst_addr, src_addr, page);
> > +					       dst_addr, src_addr, page,
> > +					       wp_copy);
> >  		else
> >  			err = mfill_zeropage_pte(dst_mm, dst_pmd,
> >  						 dst_vma, dst_addr);
> >  	} else {
> > +		VM_WARN_ON(wp_copy); /* WP only available for anon */
> 
> Don't you want to return with error here ?

Makes sense to me.  Does this looks good to you to be squashed into
current patch?

diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 73a208c5c1e7..f3ea09f412d4 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -73,7 +73,7 @@ static int mcopy_atomic_pte(struct mm_struct *dst_mm,
                goto out_release;
 
        _dst_pte = pte_mkdirty(mk_pte(page, dst_vma->vm_page_prot));
-       if (dst_vma->vm_flags & VM_WRITE && !wp_copy)
+       if ((dst_vma->vm_flags & VM_WRITE) && !wp_copy)
                _dst_pte = pte_mkwrite(_dst_pte);
 
        dst_pte = pte_offset_map_lock(dst_mm, dst_pmd, dst_addr, &ptl);
@@ -424,7 +424,10 @@ static __always_inline ssize_t mfill_atomic_pte(struct mm_struct *dst_mm,
                        err = mfill_zeropage_pte(dst_mm, dst_pmd,
                                                 dst_vma, dst_addr);
        } else {
-               VM_WARN_ON(wp_copy); /* WP only available for anon */
+               if (unlikely(wp_copy))
+                       /* TODO: WP currently only available for anon */
+                       return -EINVAL;
+
                if (!zeropage)
                        err = shmem_mcopy_atomic_pte(dst_mm, dst_pmd,
                                                     dst_vma, dst_addr,

Thanks,

-- 
Peter Xu

^ permalink raw reply related	[flat|nested] 113+ messages in thread

* Re: [PATCH v2 12/26] userfaultfd: wp: apply _PAGE_UFFD_WP bit
  2019-02-21 17:44   ` Jerome Glisse
@ 2019-02-22  7:31     ` Peter Xu
  2019-02-22 15:17       ` Jerome Glisse
  0 siblings, 1 reply; 113+ messages in thread
From: Peter Xu @ 2019-02-22  7:31 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: linux-mm, linux-kernel, David Hildenbrand, Hugh Dickins,
	Maya Gokhale, Pavel Emelyanov, Johannes Weiner, Martin Cracauer,
	Shaohua Li, Marty McFadden, Andrea Arcangeli, Mike Kravetz,
	Denis Plotnikov, Mike Rapoport, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

On Thu, Feb 21, 2019 at 12:44:02PM -0500, Jerome Glisse wrote:
> On Tue, Feb 12, 2019 at 10:56:18AM +0800, Peter Xu wrote:
> > Firstly, introduce two new flags MM_CP_UFFD_WP[_RESOLVE] for
> > change_protection() when used with uffd-wp and make sure the two new
> > flags are exclusively used.  Then,
> > 
> >   - For MM_CP_UFFD_WP: apply the _PAGE_UFFD_WP bit and remove _PAGE_RW
> >     when a range of memory is write protected by uffd
> > 
> >   - For MM_CP_UFFD_WP_RESOLVE: remove the _PAGE_UFFD_WP bit and recover
> >     _PAGE_RW when write protection is resolved from userspace
> > 
> > And use this new interface in mwriteprotect_range() to replace the old
> > MM_CP_DIRTY_ACCT.
> > 
> > Do this change for both PTEs and huge PMDs.  Then we can start to
> > identify which PTE/PMD is write protected by general (e.g., COW or soft
> > dirty tracking), and which is for userfaultfd-wp.
> > 
> > Since we should keep the _PAGE_UFFD_WP when doing pte_modify(), add it
> > into _PAGE_CHG_MASK as well.  Meanwhile, since we have this new bit, we
> > can be even more strict when detecting uffd-wp page faults in either
> > do_wp_page() or wp_huge_pmd().
> > 
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> 
> Few comments but still:
> 
> Reviewed-by: Jérôme Glisse <jglisse@redhat.com>

Thanks!

> 
> > ---
> >  arch/x86/include/asm/pgtable_types.h |  2 +-
> >  include/linux/mm.h                   |  5 +++++
> >  mm/huge_memory.c                     | 14 +++++++++++++-
> >  mm/memory.c                          |  4 ++--
> >  mm/mprotect.c                        | 12 ++++++++++++
> >  mm/userfaultfd.c                     |  8 ++++++--
> >  6 files changed, 39 insertions(+), 6 deletions(-)
> > 
> > diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
> > index 8cebcff91e57..dd9c6295d610 100644
> > --- a/arch/x86/include/asm/pgtable_types.h
> > +++ b/arch/x86/include/asm/pgtable_types.h
> > @@ -133,7 +133,7 @@
> >   */
> >  #define _PAGE_CHG_MASK	(PTE_PFN_MASK | _PAGE_PCD | _PAGE_PWT |		\
> >  			 _PAGE_SPECIAL | _PAGE_ACCESSED | _PAGE_DIRTY |	\
> > -			 _PAGE_SOFT_DIRTY | _PAGE_DEVMAP)
> > +			 _PAGE_SOFT_DIRTY | _PAGE_DEVMAP | _PAGE_UFFD_WP)
> >  #define _HPAGE_CHG_MASK (_PAGE_CHG_MASK | _PAGE_PSE)
> 
> This chunk needs to be in the earlier arch specific patch.

Indeed.  I'll move it over.

> 
> [...]
> 
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index 8d65b0f041f9..817335b443c2 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> 
> [...]
> 
> > @@ -2198,6 +2208,8 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
> >  				entry = pte_mkold(entry);
> >  			if (soft_dirty)
> >  				entry = pte_mksoft_dirty(entry);
> > +			if (uffd_wp)
> > +				entry = pte_mkuffd_wp(entry);
> >  		}
> >  		pte = pte_offset_map(&_pmd, addr);
> >  		BUG_ON(!pte_none(*pte));
> 
> Reading that code and i thought i would be nice if we could define a
> pte_mask that we can or instead of all those if () entry |= ... but
> that is just some dumb optimization and does not have any bearing on
> the present patch. Just wanted to say that outloud.

(I agree; though I'll just concentrate on the series for now)

> 
> 
> > diff --git a/mm/mprotect.c b/mm/mprotect.c
> > index a6ba448c8565..9d4433044c21 100644
> > --- a/mm/mprotect.c
> > +++ b/mm/mprotect.c
> > @@ -46,6 +46,8 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
> >  	int target_node = NUMA_NO_NODE;
> >  	bool dirty_accountable = cp_flags & MM_CP_DIRTY_ACCT;
> >  	bool prot_numa = cp_flags & MM_CP_PROT_NUMA;
> > +	bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
> > +	bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;
> >  
> >  	/*
> >  	 * Can be called with only the mmap_sem for reading by
> > @@ -117,6 +119,14 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
> >  			if (preserve_write)
> >  				ptent = pte_mk_savedwrite(ptent);
> >  
> > +			if (uffd_wp) {
> > +				ptent = pte_wrprotect(ptent);
> > +				ptent = pte_mkuffd_wp(ptent);
> > +			} else if (uffd_wp_resolve) {
> > +				ptent = pte_mkwrite(ptent);
> > +				ptent = pte_clear_uffd_wp(ptent);
> > +			}
> > +
> >  			/* Avoid taking write faults for known dirty pages */
> >  			if (dirty_accountable && pte_dirty(ptent) &&
> >  					(pte_soft_dirty(ptent) ||
> > @@ -301,6 +311,8 @@ unsigned long change_protection(struct vm_area_struct *vma, unsigned long start,
> >  {
> >  	unsigned long pages;
> >  
> > +	BUG_ON((cp_flags & MM_CP_UFFD_WP_ALL) == MM_CP_UFFD_WP_ALL);
> 
> Don't you want to abort and return here if both flags are set ?

Here I would slightly prefer BUG_ON() because current code (any
userspace syscalls) cannot trigger this without changing the kernel
(currently the only kernel user of these two flags will be
mwriteprotect_range but it'll definitely only pass one flag in).  This
line will be only useful when we add new kernel code (or writting new
kernel drivers) and it can be used to detect programming errors. In
that case IMHO BUG_ON() would be more straightforward.

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v2 14/26] userfaultfd: wp: handle COW properly for uffd-wp
  2019-02-21 18:04   ` Jerome Glisse
@ 2019-02-22  8:46     ` Peter Xu
  2019-02-22 15:35       ` Jerome Glisse
  0 siblings, 1 reply; 113+ messages in thread
From: Peter Xu @ 2019-02-22  8:46 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: linux-mm, linux-kernel, David Hildenbrand, Hugh Dickins,
	Maya Gokhale, Pavel Emelyanov, Johannes Weiner, Martin Cracauer,
	Shaohua Li, Marty McFadden, Andrea Arcangeli, Mike Kravetz,
	Denis Plotnikov, Mike Rapoport, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

On Thu, Feb 21, 2019 at 01:04:24PM -0500, Jerome Glisse wrote:
> On Tue, Feb 12, 2019 at 10:56:20AM +0800, Peter Xu wrote:
> > This allows uffd-wp to support write-protected pages for COW.
> > 
> > For example, the uffd write-protected PTE could also be write-protected
> > by other usages like COW or zero pages.  When that happens, we can't
> > simply set the write bit in the PTE since otherwise it'll change the
> > content of every single reference to the page.  Instead, we should do
> > the COW first if necessary, then handle the uffd-wp fault.
> > 
> > To correctly copy the page, we'll also need to carry over the
> > _PAGE_UFFD_WP bit if it was set in the original PTE.
> > 
> > For huge PMDs, we just simply split the huge PMDs where we want to
> > resolve an uffd-wp page fault always.  That matches what we do with
> > general huge PMD write protections.  In that way, we resolved the huge
> > PMD copy-on-write issue into PTE copy-on-write.
> > 
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> 
> Few comments see below.
> 
> > ---
> >  mm/memory.c   |  2 ++
> >  mm/mprotect.c | 55 ++++++++++++++++++++++++++++++++++++++++++++++++---
> >  2 files changed, 54 insertions(+), 3 deletions(-)
> > 
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 32d32b6e6339..b5d67bafae35 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -2291,6 +2291,8 @@ vm_fault_t wp_page_copy(struct vm_fault *vmf)
> >  		}
> >  		flush_cache_page(vma, vmf->address, pte_pfn(vmf->orig_pte));
> >  		entry = mk_pte(new_page, vma->vm_page_prot);
> > +		if (pte_uffd_wp(vmf->orig_pte))
> > +			entry = pte_mkuffd_wp(entry);
> >  		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
> 
> This looks wrong to me, isn't the uffd_wp flag clear on writeable pte ?
> If so it would be clearer to have something like:
> 
>  +		if (pte_uffd_wp(vmf->orig_pte))
>  +			entry = pte_mkuffd_wp(entry);
>  +		else
>  + 			entry = maybe_mkwrite(pte_mkdirty(entry), vma);
>  -		entry = maybe_mkwrite(pte_mkdirty(entry), vma);

Yeah this seems clearer indeed.  The thing is that no matter whether
we set the write bit or not here we'll always set it again later on
simply because COW of uffd-wp pages only happen when resolving the wp
page fault (when we do want to set the write bit in all cases).
Anyway, I do like your suggestion and I'll fix.

> 
> >  		/*
> >  		 * Clear the pte entry and flush it first, before updating the
> > diff --git a/mm/mprotect.c b/mm/mprotect.c
> > index 9d4433044c21..ae93721f3795 100644
> > --- a/mm/mprotect.c
> > +++ b/mm/mprotect.c
> > @@ -77,14 +77,13 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
> >  		if (pte_present(oldpte)) {
> >  			pte_t ptent;
> >  			bool preserve_write = prot_numa && pte_write(oldpte);
> > +			struct page *page;
> >  
> >  			/*
> >  			 * Avoid trapping faults against the zero or KSM
> >  			 * pages. See similar comment in change_huge_pmd.
> >  			 */
> >  			if (prot_numa) {
> > -				struct page *page;
> > -
> >  				page = vm_normal_page(vma, addr, oldpte);
> >  				if (!page || PageKsm(page))
> >  					continue;
> > @@ -114,6 +113,46 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
> >  					continue;
> >  			}
> >  
> > +			/*
> > +			 * Detect whether we'll need to COW before
> > +			 * resolving an uffd-wp fault.  Note that this
> > +			 * includes detection of the zero page (where
> > +			 * page==NULL)
> > +			 */
> > +			if (uffd_wp_resolve) {
> > +				/* If the fault is resolved already, skip */
> > +				if (!pte_uffd_wp(*pte))
> > +					continue;
> > +				page = vm_normal_page(vma, addr, oldpte);
> > +				if (!page || page_mapcount(page) > 1) {
> 
> This is wrong, if you allow page to be NULL then you gonna segfault
> in wp_page_copy() down below. Are you sure you want to test for
> special page ? For anonymous memory this should never happens ie
> anon page always are regular page. So if you allow userfaulfd to
> write protect only anonymous vma then there is no point in testing
> here beside maybe a BUG_ON() just in case ...

It's majorly for zero pages where page can be NULL.  Would this be
clearer:

  if (is_zero_pfn(pte_pfn(old_pte)) || (page && page_mapcount(page)))

?

Now we treat zero pages as normal COW pages so we'll do COW here even
for zero pages.  I think maybe we can do special handling on all over
the places for zero pages (e.g., we don't write protect a PTE if we
detected that this is the zero PFN) but I'm uncertain on whether
that's what we want, so I chose to start with current solution at
least to achieve functionality first.

> 
> > +					struct vm_fault vmf = {
> > +						.vma = vma,
> > +						.address = addr & PAGE_MASK,
> > +						.page = page,
> > +						.orig_pte = oldpte,
> > +						.pmd = pmd,
> > +						/* pte and ptl not needed */
> > +					};
> > +					vm_fault_t ret;
> > +
> > +					if (page)
> > +						get_page(page);
> > +					arch_leave_lazy_mmu_mode();
> > +					pte_unmap_unlock(pte, ptl);
> > +					ret = wp_page_copy(&vmf);
> > +					/* PTE is changed, or OOM */
> > +					if (ret == 0)
> > +						/* It's done by others */
> > +						continue;
> > +					else if (WARN_ON(ret != VM_FAULT_WRITE))
> > +						return pages;
> > +					pte = pte_offset_map_lock(vma->vm_mm,
> > +								  pmd, addr,
> > +								  &ptl);
> 
> Here you remap the pte locked but you are not checking if the pte is
> the one you expect ie is it pointing to the copied page and does it
> have expect uffd_wp flag. Another thread might have raced between the
> time you called wp_page_copy() and the time you pte_offset_map_lock()
> I have not check the mmap_sem so maybe you are protected by it as
> mprotect is taking it in write mode IIRC, if so you should add a
> comments at very least so people do not see this as a bug.

Thanks for spotting this.  With nornal uffd-wp page fault handling
path we're only with read lock held (and I would suspect it's racy
even with write lock...).  I agree that there can be a race right
after the COW has done.

Here IMHO we'll be fine as long as it's still a present PTE, in other
words, we should be able to tolerate PTE changes as long as it's still
present otherwise we'll need to retry this single PTE (e.g., the page
can be quickly marked as migrating swap entry, or even the page could
be freed beneath us).  Do you think below change look good to you to
be squashed into this patch?

diff --git a/mm/mprotect.c b/mm/mprotect.c
index 73a65f07fe41..3423f9692838 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -73,6 +73,7 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,                                                              
        flush_tlb_batched_pending(vma->vm_mm);
        arch_enter_lazy_mmu_mode();
        do {
+retry_pte:
                oldpte = *pte;
                if (pte_present(oldpte)) {
                        pte_t ptent;
@@ -149,6 +150,13 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,                                                           
                                        pte = pte_offset_map_lock(vma->vm_mm,
                                                                  pmd, addr,
                                                                  &ptl);
+                                       if (!pte_present(*pte))
+                                               /*
+                                                * This PTE could have
+                                                * been modified when COW;
+                                                * retry it
+                                                */
+                                               goto retry_pte;
                                        arch_enter_lazy_mmu_mode();
                                }
                        }

[...]

> > @@ -202,7 +242,16 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
> >  		}
> >  
> >  		if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {
> > -			if (next - addr != HPAGE_PMD_SIZE) {
> > +			/*
> > +			 * When resolving an userfaultfd write
> > +			 * protection fault, it's not easy to identify
> > +			 * whether a THP is shared with others and
> > +			 * whether we'll need to do copy-on-write, so
> > +			 * just split it always for now to simply the
> > +			 * procedure.  And that's the policy too for
> > +			 * general THP write-protect in af9e4d5f2de2.
> > +			 */
> > +			if (next - addr != HPAGE_PMD_SIZE || uffd_wp_resolve) {
> 
> Using parenthesis maybe ? :)
>             if ((next - addr != HPAGE_PMD_SIZE) || uffd_wp_resolve) {

Sure, will fix it.

Thanks,

-- 
Peter Xu

^ permalink raw reply related	[flat|nested] 113+ messages in thread

* Re: [PATCH v2 15/26] userfaultfd: wp: drop _PAGE_UFFD_WP properly when fork
  2019-02-21 18:06   ` Jerome Glisse
@ 2019-02-22  9:09     ` Peter Xu
  2019-02-22 15:36       ` Jerome Glisse
  0 siblings, 1 reply; 113+ messages in thread
From: Peter Xu @ 2019-02-22  9:09 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: linux-mm, linux-kernel, David Hildenbrand, Hugh Dickins,
	Maya Gokhale, Pavel Emelyanov, Johannes Weiner, Martin Cracauer,
	Shaohua Li, Marty McFadden, Andrea Arcangeli, Mike Kravetz,
	Denis Plotnikov, Mike Rapoport, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

On Thu, Feb 21, 2019 at 01:06:31PM -0500, Jerome Glisse wrote:
> On Tue, Feb 12, 2019 at 10:56:21AM +0800, Peter Xu wrote:
> > UFFD_EVENT_FORK support for uffd-wp should be already there, except
> > that we should clean the uffd-wp bit if uffd fork event is not
> > enabled.  Detect that to avoid _PAGE_UFFD_WP being set even if the VMA
> > is not being tracked by VM_UFFD_WP.  Do this for both small PTEs and
> > huge PMDs.
> > 
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> 
> This patch must be earlier in the serie, before the patch that introduce
> the userfaultfd API so that bisect can not end up on version where this
> can happen.

Yes it should be now? Since the API will be introduced until patch
21/26 ("userfaultfd: wp: add the writeprotect API to userfaultfd
ioctl").

> 
> Otherwise the patch itself is:
> 
> Reviewed-by: Jérôme Glisse <jglisse@redhat.com>

Unless I found anything I've missed above... I'll temporarily pick
this R-b for now then.

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v2.1 04/26] mm: allow VM_FAULT_RETRY for multiple times
  2019-02-22  4:25       ` Peter Xu
@ 2019-02-22 15:11         ` Jerome Glisse
  2019-02-25  6:19           ` Peter Xu
  0 siblings, 1 reply; 113+ messages in thread
From: Jerome Glisse @ 2019-02-22 15:11 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, David Hildenbrand, Hugh Dickins,
	Maya Gokhale, Pavel Emelyanov, Martin Cracauer, Shaohua Li,
	Marty McFadden, Andrea Arcangeli, Mike Kravetz, Denis Plotnikov,
	Mike Rapoport, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

On Fri, Feb 22, 2019 at 12:25:44PM +0800, Peter Xu wrote:
> On Thu, Feb 21, 2019 at 10:53:11AM -0500, Jerome Glisse wrote:
> > On Thu, Feb 21, 2019 at 04:56:56PM +0800, Peter Xu wrote:
> > > The idea comes from a discussion between Linus and Andrea [1].

[...]

> > > diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> > > index 248ff0a28ecd..d842c3e02a50 100644
> > > --- a/arch/x86/mm/fault.c
> > > +++ b/arch/x86/mm/fault.c
> > > @@ -1483,9 +1483,7 @@ void do_user_addr_fault(struct pt_regs *regs,
> > >  	if (unlikely(fault & VM_FAULT_RETRY)) {
> > >  		bool is_user = flags & FAULT_FLAG_USER;
> > >  
> > > -		/* Retry at most once */
> > >  		if (flags & FAULT_FLAG_ALLOW_RETRY) {
> > > -			flags &= ~FAULT_FLAG_ALLOW_RETRY;
> > >  			flags |= FAULT_FLAG_TRIED;
> > >  			if (is_user && signal_pending(tsk))
> > >  				return;
> > 
> > So here you have a change in behavior, it can retry indefinitly for as
> > long as they are no signal. Don't you want so test for FAULT_FLAG_TRIED ?
> 
> These first five patches do want to allow the page fault to retry as
> much as needed.  "indefinitely" seems to be a scary word, but IMHO
> this is fine for page faults since otherwise we'll simply crash the
> program or even crash the system depending on the fault context, so it
> seems to be nowhere worse.
> 
> For userspace programs, if anything really really go wrong (so far I
> still cannot think a valid scenario in a bug-free system, but just
> assuming...) and it loops indefinitely, IMHO it'll just hang the buggy
> process itself rather than coredump, and the admin can simply kill the
> process to retake the resources since we'll still detect signals.
> 
> Or did I misunderstood the question?

No i think you are right, it is fine to keep retrying while they are
no signal maybe just add a comment that says so in so many words :)
So people do not see that as a potential issue.

> > [...]
> > 
> > > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > > index 80bb6408fe73..4e11c9639f1b 100644
> > > --- a/include/linux/mm.h
> > > +++ b/include/linux/mm.h
> > > @@ -341,11 +341,21 @@ extern pgprot_t protection_map[16];
> > >  #define FAULT_FLAG_ALLOW_RETRY	0x04	/* Retry fault if blocking */
> > >  #define FAULT_FLAG_RETRY_NOWAIT	0x08	/* Don't drop mmap_sem and wait when retrying */
> > >  #define FAULT_FLAG_KILLABLE	0x10	/* The fault task is in SIGKILL killable region */
> > > -#define FAULT_FLAG_TRIED	0x20	/* Second try */
> > > +#define FAULT_FLAG_TRIED	0x20	/* We've tried once */
> > >  #define FAULT_FLAG_USER		0x40	/* The fault originated in userspace */
> > >  #define FAULT_FLAG_REMOTE	0x80	/* faulting for non current tsk/mm */
> > >  #define FAULT_FLAG_INSTRUCTION  0x100	/* The fault was during an instruction fetch */
> > >  
> > > +/*
> > > + * Returns true if the page fault allows retry and this is the first
> > > + * attempt of the fault handling; false otherwise.
> > > + */
> > 
> > You should add why it returns false if it is not the first try ie to
> > avoid starvation.
> 
> How about:
> 
>         Returns true if the page fault allows retry and this is the
>         first attempt of the fault handling; false otherwise.  This is
>         mostly used for places where we want to try to avoid taking
>         the mmap_sem for too long a time when waiting for another
>         condition to change, in which case we can try to be polite to
>         release the mmap_sem in the first round to avoid potential
>         starvation of other processes that would also want the
>         mmap_sem.
> 
> ?

Looks perfect to me.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v2 05/26] mm: gup: allow VM_FAULT_RETRY for multiple times
  2019-02-22  4:41     ` Peter Xu
@ 2019-02-22 15:13       ` Jerome Glisse
  0 siblings, 0 replies; 113+ messages in thread
From: Jerome Glisse @ 2019-02-22 15:13 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, David Hildenbrand, Hugh Dickins,
	Maya Gokhale, Pavel Emelyanov, Johannes Weiner, Martin Cracauer,
	Shaohua Li, Marty McFadden, Andrea Arcangeli, Mike Kravetz,
	Denis Plotnikov, Mike Rapoport, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

On Fri, Feb 22, 2019 at 12:41:05PM +0800, Peter Xu wrote:
> On Thu, Feb 21, 2019 at 11:06:55AM -0500, Jerome Glisse wrote:
> > On Tue, Feb 12, 2019 at 10:56:11AM +0800, Peter Xu wrote:
> > > This is the gup counterpart of the change that allows the VM_FAULT_RETRY
> > > to happen for more than once.
> > > 
> > > Signed-off-by: Peter Xu <peterx@redhat.com>
> > 
> > Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
> 
> Thanks for the r-b, Jerome!
> 
> Though I plan to change this patch a bit because I just noticed that I
> didn't touch up the hugetlbfs path for GUP.  Though it was not needed
> for now because hugetlbfs is not yet supported but I think maybe I'd
> better do that as well in this same patch to make follow up works
> easier on hugetlb, and the patch will be more self contained.  The new
> version will simply squash below change into current patch:
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index e3c738bde72e..a8eace2d5296 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -4257,8 +4257,10 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
>                                 fault_flags |= FAULT_FLAG_ALLOW_RETRY |
>                                         FAULT_FLAG_RETRY_NOWAIT;
>                         if (flags & FOLL_TRIED) {
> -                               VM_WARN_ON_ONCE(fault_flags &
> -                                               FAULT_FLAG_ALLOW_RETRY);
> +                               /*
> +                                * Note: FAULT_FLAG_ALLOW_RETRY and
> +                                * FAULT_FLAG_TRIED can co-exist
> +                                */
>                                 fault_flags |= FAULT_FLAG_TRIED;
>                         }
>                         ret = hugetlb_fault(mm, vma, vaddr, fault_flags);
> 
> I'd say this change is straightforward (it's the same as the
> faultin_page below but just for hugetlbfs).  Please let me know if you
> still want to offer the r-b with above change squashed (I'll be more
> than glad to take it!), or I'll just wait for your review comment when
> I post the next version.

Looks good i should have thought of hugetlbfs. You can keep my r-b.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v2 10/26] userfaultfd: wp: add UFFDIO_COPY_MODE_WP
  2019-02-22  7:11     ` Peter Xu
@ 2019-02-22 15:15       ` Jerome Glisse
  2019-02-25  6:45         ` Peter Xu
  0 siblings, 1 reply; 113+ messages in thread
From: Jerome Glisse @ 2019-02-22 15:15 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, David Hildenbrand, Hugh Dickins,
	Maya Gokhale, Pavel Emelyanov, Johannes Weiner, Martin Cracauer,
	Shaohua Li, Marty McFadden, Andrea Arcangeli, Mike Kravetz,
	Denis Plotnikov, Mike Rapoport, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

On Fri, Feb 22, 2019 at 03:11:06PM +0800, Peter Xu wrote:
> On Thu, Feb 21, 2019 at 12:29:19PM -0500, Jerome Glisse wrote:
> > On Tue, Feb 12, 2019 at 10:56:16AM +0800, Peter Xu wrote:
> > > From: Andrea Arcangeli <aarcange@redhat.com>
> > > 
> > > This allows UFFDIO_COPY to map pages wrprotected.
> > > 
> > > Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> > > Signed-off-by: Peter Xu <peterx@redhat.com>
> > 
> > Minor nitpick down below, but in any case:
> > 
> > Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
> > 
> > > ---
> > >  fs/userfaultfd.c                 |  5 +++--
> > >  include/linux/userfaultfd_k.h    |  2 +-
> > >  include/uapi/linux/userfaultfd.h | 11 +++++-----
> > >  mm/userfaultfd.c                 | 36 ++++++++++++++++++++++----------
> > >  4 files changed, 35 insertions(+), 19 deletions(-)
> > > 
> > 
> > [...]
> > 
> > > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> > > index d59b5a73dfb3..73a208c5c1e7 100644
> > > --- a/mm/userfaultfd.c
> > > +++ b/mm/userfaultfd.c
> > > @@ -25,7 +25,8 @@ static int mcopy_atomic_pte(struct mm_struct *dst_mm,
> > >  			    struct vm_area_struct *dst_vma,
> > >  			    unsigned long dst_addr,
> > >  			    unsigned long src_addr,
> > > -			    struct page **pagep)
> > > +			    struct page **pagep,
> > > +			    bool wp_copy)
> > >  {
> > >  	struct mem_cgroup *memcg;
> > >  	pte_t _dst_pte, *dst_pte;
> > > @@ -71,9 +72,9 @@ static int mcopy_atomic_pte(struct mm_struct *dst_mm,
> > >  	if (mem_cgroup_try_charge(page, dst_mm, GFP_KERNEL, &memcg, false))
> > >  		goto out_release;
> > >  
> > > -	_dst_pte = mk_pte(page, dst_vma->vm_page_prot);
> > > -	if (dst_vma->vm_flags & VM_WRITE)
> > > -		_dst_pte = pte_mkwrite(pte_mkdirty(_dst_pte));
> > > +	_dst_pte = pte_mkdirty(mk_pte(page, dst_vma->vm_page_prot));
> > > +	if (dst_vma->vm_flags & VM_WRITE && !wp_copy)
> > > +		_dst_pte = pte_mkwrite(_dst_pte);
> > 
> > I like parenthesis around around and :) ie:
> >     (dst_vma->vm_flags & VM_WRITE) && !wp_copy
> > 
> > I feel it is easier to read.
> 
> Yeah another one. Though this line will be changed in follow up
> patches, will fix anyways.
> 
> > 
> > [...]
> > 
> > > @@ -416,11 +418,13 @@ static __always_inline ssize_t mfill_atomic_pte(struct mm_struct *dst_mm,
> > >  	if (!(dst_vma->vm_flags & VM_SHARED)) {
> > >  		if (!zeropage)
> > >  			err = mcopy_atomic_pte(dst_mm, dst_pmd, dst_vma,
> > > -					       dst_addr, src_addr, page);
> > > +					       dst_addr, src_addr, page,
> > > +					       wp_copy);
> > >  		else
> > >  			err = mfill_zeropage_pte(dst_mm, dst_pmd,
> > >  						 dst_vma, dst_addr);
> > >  	} else {
> > > +		VM_WARN_ON(wp_copy); /* WP only available for anon */
> > 
> > Don't you want to return with error here ?
> 
> Makes sense to me.  Does this looks good to you to be squashed into
> current patch?
> 
> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> index 73a208c5c1e7..f3ea09f412d4 100644
> --- a/mm/userfaultfd.c
> +++ b/mm/userfaultfd.c
> @@ -73,7 +73,7 @@ static int mcopy_atomic_pte(struct mm_struct *dst_mm,
>                 goto out_release;
>  
>         _dst_pte = pte_mkdirty(mk_pte(page, dst_vma->vm_page_prot));
> -       if (dst_vma->vm_flags & VM_WRITE && !wp_copy)
> +       if ((dst_vma->vm_flags & VM_WRITE) && !wp_copy)
>                 _dst_pte = pte_mkwrite(_dst_pte);
>  
>         dst_pte = pte_offset_map_lock(dst_mm, dst_pmd, dst_addr, &ptl);
> @@ -424,7 +424,10 @@ static __always_inline ssize_t mfill_atomic_pte(struct mm_struct *dst_mm,
>                         err = mfill_zeropage_pte(dst_mm, dst_pmd,
>                                                  dst_vma, dst_addr);
>         } else {
> -               VM_WARN_ON(wp_copy); /* WP only available for anon */
> +               if (unlikely(wp_copy))
> +                       /* TODO: WP currently only available for anon */
> +                       return -EINVAL;
> +
>                 if (!zeropage)
>                         err = shmem_mcopy_atomic_pte(dst_mm, dst_pmd,
>                                                      dst_vma, dst_addr,

I would keep a the VM_WARN_ON or maybe a ONCE variant so that we at
least have a chance to be inform if for some reasons that code path
is taken. With that my r-b stands.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v2 12/26] userfaultfd: wp: apply _PAGE_UFFD_WP bit
  2019-02-22  7:31     ` Peter Xu
@ 2019-02-22 15:17       ` Jerome Glisse
  0 siblings, 0 replies; 113+ messages in thread
From: Jerome Glisse @ 2019-02-22 15:17 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, David Hildenbrand, Hugh Dickins,
	Maya Gokhale, Pavel Emelyanov, Johannes Weiner, Martin Cracauer,
	Shaohua Li, Marty McFadden, Andrea Arcangeli, Mike Kravetz,
	Denis Plotnikov, Mike Rapoport, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

On Fri, Feb 22, 2019 at 03:31:35PM +0800, Peter Xu wrote:
> On Thu, Feb 21, 2019 at 12:44:02PM -0500, Jerome Glisse wrote:
> > On Tue, Feb 12, 2019 at 10:56:18AM +0800, Peter Xu wrote:
> > > Firstly, introduce two new flags MM_CP_UFFD_WP[_RESOLVE] for
> > > change_protection() when used with uffd-wp and make sure the two new
> > > flags are exclusively used.  Then,
> > > 
> > >   - For MM_CP_UFFD_WP: apply the _PAGE_UFFD_WP bit and remove _PAGE_RW
> > >     when a range of memory is write protected by uffd
> > > 
> > >   - For MM_CP_UFFD_WP_RESOLVE: remove the _PAGE_UFFD_WP bit and recover
> > >     _PAGE_RW when write protection is resolved from userspace
> > > 
> > > And use this new interface in mwriteprotect_range() to replace the old
> > > MM_CP_DIRTY_ACCT.
> > > 
> > > Do this change for both PTEs and huge PMDs.  Then we can start to
> > > identify which PTE/PMD is write protected by general (e.g., COW or soft
> > > dirty tracking), and which is for userfaultfd-wp.
> > > 
> > > Since we should keep the _PAGE_UFFD_WP when doing pte_modify(), add it
> > > into _PAGE_CHG_MASK as well.  Meanwhile, since we have this new bit, we
> > > can be even more strict when detecting uffd-wp page faults in either
> > > do_wp_page() or wp_huge_pmd().
> > > 
> > > Signed-off-by: Peter Xu <peterx@redhat.com>
> > 
> > Few comments but still:
> > 
> > Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
> 
> Thanks!
> 
> > 
> > > ---
> > >  arch/x86/include/asm/pgtable_types.h |  2 +-
> > >  include/linux/mm.h                   |  5 +++++
> > >  mm/huge_memory.c                     | 14 +++++++++++++-
> > >  mm/memory.c                          |  4 ++--
> > >  mm/mprotect.c                        | 12 ++++++++++++
> > >  mm/userfaultfd.c                     |  8 ++++++--
> > >  6 files changed, 39 insertions(+), 6 deletions(-)
> > > 
> > > diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
> > > index 8cebcff91e57..dd9c6295d610 100644
> > > --- a/arch/x86/include/asm/pgtable_types.h
> > > +++ b/arch/x86/include/asm/pgtable_types.h
> > > @@ -133,7 +133,7 @@
> > >   */
> > >  #define _PAGE_CHG_MASK	(PTE_PFN_MASK | _PAGE_PCD | _PAGE_PWT |		\
> > >  			 _PAGE_SPECIAL | _PAGE_ACCESSED | _PAGE_DIRTY |	\
> > > -			 _PAGE_SOFT_DIRTY | _PAGE_DEVMAP)
> > > +			 _PAGE_SOFT_DIRTY | _PAGE_DEVMAP | _PAGE_UFFD_WP)
> > >  #define _HPAGE_CHG_MASK (_PAGE_CHG_MASK | _PAGE_PSE)
> > 
> > This chunk needs to be in the earlier arch specific patch.
> 
> Indeed.  I'll move it over.
> 
> > 
> > [...]
> > 
> > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > > index 8d65b0f041f9..817335b443c2 100644
> > > --- a/mm/huge_memory.c
> > > +++ b/mm/huge_memory.c
> > 
> > [...]
> > 
> > > @@ -2198,6 +2208,8 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
> > >  				entry = pte_mkold(entry);
> > >  			if (soft_dirty)
> > >  				entry = pte_mksoft_dirty(entry);
> > > +			if (uffd_wp)
> > > +				entry = pte_mkuffd_wp(entry);
> > >  		}
> > >  		pte = pte_offset_map(&_pmd, addr);
> > >  		BUG_ON(!pte_none(*pte));
> > 
> > Reading that code and i thought i would be nice if we could define a
> > pte_mask that we can or instead of all those if () entry |= ... but
> > that is just some dumb optimization and does not have any bearing on
> > the present patch. Just wanted to say that outloud.
> 
> (I agree; though I'll just concentrate on the series for now)
> 
> > 
> > 
> > > diff --git a/mm/mprotect.c b/mm/mprotect.c
> > > index a6ba448c8565..9d4433044c21 100644
> > > --- a/mm/mprotect.c
> > > +++ b/mm/mprotect.c
> > > @@ -46,6 +46,8 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
> > >  	int target_node = NUMA_NO_NODE;
> > >  	bool dirty_accountable = cp_flags & MM_CP_DIRTY_ACCT;
> > >  	bool prot_numa = cp_flags & MM_CP_PROT_NUMA;
> > > +	bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
> > > +	bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;
> > >  
> > >  	/*
> > >  	 * Can be called with only the mmap_sem for reading by
> > > @@ -117,6 +119,14 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
> > >  			if (preserve_write)
> > >  				ptent = pte_mk_savedwrite(ptent);
> > >  
> > > +			if (uffd_wp) {
> > > +				ptent = pte_wrprotect(ptent);
> > > +				ptent = pte_mkuffd_wp(ptent);
> > > +			} else if (uffd_wp_resolve) {
> > > +				ptent = pte_mkwrite(ptent);
> > > +				ptent = pte_clear_uffd_wp(ptent);
> > > +			}
> > > +
> > >  			/* Avoid taking write faults for known dirty pages */
> > >  			if (dirty_accountable && pte_dirty(ptent) &&
> > >  					(pte_soft_dirty(ptent) ||
> > > @@ -301,6 +311,8 @@ unsigned long change_protection(struct vm_area_struct *vma, unsigned long start,
> > >  {
> > >  	unsigned long pages;
> > >  
> > > +	BUG_ON((cp_flags & MM_CP_UFFD_WP_ALL) == MM_CP_UFFD_WP_ALL);
> > 
> > Don't you want to abort and return here if both flags are set ?
> 
> Here I would slightly prefer BUG_ON() because current code (any
> userspace syscalls) cannot trigger this without changing the kernel
> (currently the only kernel user of these two flags will be
> mwriteprotect_range but it'll definitely only pass one flag in).  This
> line will be only useful when we add new kernel code (or writting new
> kernel drivers) and it can be used to detect programming errors. In
> that case IMHO BUG_ON() would be more straightforward.
> 

Ok i agree.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v2 14/26] userfaultfd: wp: handle COW properly for uffd-wp
  2019-02-22  8:46     ` Peter Xu
@ 2019-02-22 15:35       ` Jerome Glisse
  2019-02-25  7:13         ` Peter Xu
  0 siblings, 1 reply; 113+ messages in thread
From: Jerome Glisse @ 2019-02-22 15:35 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, David Hildenbrand, Hugh Dickins,
	Maya Gokhale, Pavel Emelyanov, Johannes Weiner, Martin Cracauer,
	Shaohua Li, Marty McFadden, Andrea Arcangeli, Mike Kravetz,
	Denis Plotnikov, Mike Rapoport, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

On Fri, Feb 22, 2019 at 04:46:03PM +0800, Peter Xu wrote:
> On Thu, Feb 21, 2019 at 01:04:24PM -0500, Jerome Glisse wrote:
> > On Tue, Feb 12, 2019 at 10:56:20AM +0800, Peter Xu wrote:
> > > This allows uffd-wp to support write-protected pages for COW.

[...]

> > > diff --git a/mm/mprotect.c b/mm/mprotect.c
> > > index 9d4433044c21..ae93721f3795 100644
> > > --- a/mm/mprotect.c
> > > +++ b/mm/mprotect.c
> > > @@ -77,14 +77,13 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
> > >  		if (pte_present(oldpte)) {
> > >  			pte_t ptent;
> > >  			bool preserve_write = prot_numa && pte_write(oldpte);
> > > +			struct page *page;
> > >  
> > >  			/*
> > >  			 * Avoid trapping faults against the zero or KSM
> > >  			 * pages. See similar comment in change_huge_pmd.
> > >  			 */
> > >  			if (prot_numa) {
> > > -				struct page *page;
> > > -
> > >  				page = vm_normal_page(vma, addr, oldpte);
> > >  				if (!page || PageKsm(page))
> > >  					continue;
> > > @@ -114,6 +113,46 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
> > >  					continue;
> > >  			}
> > >  
> > > +			/*
> > > +			 * Detect whether we'll need to COW before
> > > +			 * resolving an uffd-wp fault.  Note that this
> > > +			 * includes detection of the zero page (where
> > > +			 * page==NULL)
> > > +			 */
> > > +			if (uffd_wp_resolve) {
> > > +				/* If the fault is resolved already, skip */
> > > +				if (!pte_uffd_wp(*pte))
> > > +					continue;
> > > +				page = vm_normal_page(vma, addr, oldpte);
> > > +				if (!page || page_mapcount(page) > 1) {
> > 
> > This is wrong, if you allow page to be NULL then you gonna segfault
> > in wp_page_copy() down below. Are you sure you want to test for
> > special page ? For anonymous memory this should never happens ie
> > anon page always are regular page. So if you allow userfaulfd to
> > write protect only anonymous vma then there is no point in testing
> > here beside maybe a BUG_ON() just in case ...
> 
> It's majorly for zero pages where page can be NULL.  Would this be
> clearer:
> 
>   if (is_zero_pfn(pte_pfn(old_pte)) || (page && page_mapcount(page)))
> 
> ?
> 
> Now we treat zero pages as normal COW pages so we'll do COW here even
> for zero pages.  I think maybe we can do special handling on all over
> the places for zero pages (e.g., we don't write protect a PTE if we
> detected that this is the zero PFN) but I'm uncertain on whether
> that's what we want, so I chose to start with current solution at
> least to achieve functionality first.

You can keep the vm_normal_page() in that case but split the if
between page == NULL and page != NULL with mapcount > 1. As other-
wise you will segfault below.


> 
> > 
> > > +					struct vm_fault vmf = {
> > > +						.vma = vma,
> > > +						.address = addr & PAGE_MASK,
> > > +						.page = page,
> > > +						.orig_pte = oldpte,
> > > +						.pmd = pmd,
> > > +						/* pte and ptl not needed */
> > > +					};
> > > +					vm_fault_t ret;
> > > +
> > > +					if (page)
> > > +						get_page(page);
> > > +					arch_leave_lazy_mmu_mode();
> > > +					pte_unmap_unlock(pte, ptl);
> > > +					ret = wp_page_copy(&vmf);
> > > +					/* PTE is changed, or OOM */
> > > +					if (ret == 0)
> > > +						/* It's done by others */
> > > +						continue;
> > > +					else if (WARN_ON(ret != VM_FAULT_WRITE))
> > > +						return pages;
> > > +					pte = pte_offset_map_lock(vma->vm_mm,
> > > +								  pmd, addr,
> > > +								  &ptl);
> > 
> > Here you remap the pte locked but you are not checking if the pte is
> > the one you expect ie is it pointing to the copied page and does it
> > have expect uffd_wp flag. Another thread might have raced between the
> > time you called wp_page_copy() and the time you pte_offset_map_lock()
> > I have not check the mmap_sem so maybe you are protected by it as
> > mprotect is taking it in write mode IIRC, if so you should add a
> > comments at very least so people do not see this as a bug.
> 
> Thanks for spotting this.  With nornal uffd-wp page fault handling
> path we're only with read lock held (and I would suspect it's racy
> even with write lock...).  I agree that there can be a race right
> after the COW has done.
> 
> Here IMHO we'll be fine as long as it's still a present PTE, in other
> words, we should be able to tolerate PTE changes as long as it's still
> present otherwise we'll need to retry this single PTE (e.g., the page
> can be quickly marked as migrating swap entry, or even the page could
> be freed beneath us).  Do you think below change look good to you to
> be squashed into this patch?

Ok, but below if must be after arch_enter_lazy_mmu_mode(); not before.

> 
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 73a65f07fe41..3423f9692838 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -73,6 +73,7 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,                                                              
>         flush_tlb_batched_pending(vma->vm_mm);
>         arch_enter_lazy_mmu_mode();
>         do {
> +retry_pte:
>                 oldpte = *pte;
>                 if (pte_present(oldpte)) {
>                         pte_t ptent;
> @@ -149,6 +150,13 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,                                                           
>                                         pte = pte_offset_map_lock(vma->vm_mm,
>                                                                   pmd, addr,
>                                                                   &ptl);
> +                                       if (!pte_present(*pte))
> +                                               /*
> +                                                * This PTE could have
> +                                                * been modified when COW;
> +                                                * retry it
> +                                                */
> +                                               goto retry_pte;
>                                         arch_enter_lazy_mmu_mode();
>                                 }
>                         }

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v2 15/26] userfaultfd: wp: drop _PAGE_UFFD_WP properly when fork
  2019-02-22  9:09     ` Peter Xu
@ 2019-02-22 15:36       ` Jerome Glisse
  0 siblings, 0 replies; 113+ messages in thread
From: Jerome Glisse @ 2019-02-22 15:36 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, David Hildenbrand, Hugh Dickins,
	Maya Gokhale, Pavel Emelyanov, Johannes Weiner, Martin Cracauer,
	Shaohua Li, Marty McFadden, Andrea Arcangeli, Mike Kravetz,
	Denis Plotnikov, Mike Rapoport, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

On Fri, Feb 22, 2019 at 05:09:19PM +0800, Peter Xu wrote:
> On Thu, Feb 21, 2019 at 01:06:31PM -0500, Jerome Glisse wrote:
> > On Tue, Feb 12, 2019 at 10:56:21AM +0800, Peter Xu wrote:
> > > UFFD_EVENT_FORK support for uffd-wp should be already there, except
> > > that we should clean the uffd-wp bit if uffd fork event is not
> > > enabled.  Detect that to avoid _PAGE_UFFD_WP being set even if the VMA
> > > is not being tracked by VM_UFFD_WP.  Do this for both small PTEs and
> > > huge PMDs.
> > > 
> > > Signed-off-by: Peter Xu <peterx@redhat.com>
> > 
> > This patch must be earlier in the serie, before the patch that introduce
> > the userfaultfd API so that bisect can not end up on version where this
> > can happen.
> 
> Yes it should be now? Since the API will be introduced until patch
> 21/26 ("userfaultfd: wp: add the writeprotect API to userfaultfd
> ioctl").

No i was confuse when reading this patch i had the feeling it was
after the ioctl ignore my comment.

> 
> > 
> > Otherwise the patch itself is:
> > 
> > Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
> 
> Unless I found anything I've missed above... I'll temporarily pick
> this R-b for now then.

It is fine, the patch ordering was my confusion.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v2.1 04/26] mm: allow VM_FAULT_RETRY for multiple times
  2019-02-22 15:11         ` Jerome Glisse
@ 2019-02-25  6:19           ` Peter Xu
  0 siblings, 0 replies; 113+ messages in thread
From: Peter Xu @ 2019-02-25  6:19 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: linux-mm, linux-kernel, David Hildenbrand, Hugh Dickins,
	Maya Gokhale, Pavel Emelyanov, Martin Cracauer, Shaohua Li,
	Marty McFadden, Andrea Arcangeli, Mike Kravetz, Denis Plotnikov,
	Mike Rapoport, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

On Fri, Feb 22, 2019 at 10:11:58AM -0500, Jerome Glisse wrote:
> On Fri, Feb 22, 2019 at 12:25:44PM +0800, Peter Xu wrote:
> > On Thu, Feb 21, 2019 at 10:53:11AM -0500, Jerome Glisse wrote:
> > > On Thu, Feb 21, 2019 at 04:56:56PM +0800, Peter Xu wrote:
> > > > The idea comes from a discussion between Linus and Andrea [1].
> 
> [...]
> 
> > > > diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> > > > index 248ff0a28ecd..d842c3e02a50 100644
> > > > --- a/arch/x86/mm/fault.c
> > > > +++ b/arch/x86/mm/fault.c
> > > > @@ -1483,9 +1483,7 @@ void do_user_addr_fault(struct pt_regs *regs,
> > > >  	if (unlikely(fault & VM_FAULT_RETRY)) {
> > > >  		bool is_user = flags & FAULT_FLAG_USER;
> > > >  
> > > > -		/* Retry at most once */
> > > >  		if (flags & FAULT_FLAG_ALLOW_RETRY) {
> > > > -			flags &= ~FAULT_FLAG_ALLOW_RETRY;
> > > >  			flags |= FAULT_FLAG_TRIED;
> > > >  			if (is_user && signal_pending(tsk))
> > > >  				return;
> > > 
> > > So here you have a change in behavior, it can retry indefinitly for as
> > > long as they are no signal. Don't you want so test for FAULT_FLAG_TRIED ?
> > 
> > These first five patches do want to allow the page fault to retry as
> > much as needed.  "indefinitely" seems to be a scary word, but IMHO
> > this is fine for page faults since otherwise we'll simply crash the
> > program or even crash the system depending on the fault context, so it
> > seems to be nowhere worse.
> > 
> > For userspace programs, if anything really really go wrong (so far I
> > still cannot think a valid scenario in a bug-free system, but just
> > assuming...) and it loops indefinitely, IMHO it'll just hang the buggy
> > process itself rather than coredump, and the admin can simply kill the
> > process to retake the resources since we'll still detect signals.
> > 
> > Or did I misunderstood the question?
> 
> No i think you are right, it is fine to keep retrying while they are
> no signal maybe just add a comment that says so in so many words :)
> So people do not see that as a potential issue.

Sure thing.  I don't know whether commenting this on all the
architectures is good...  I'll try to add some comments above
FAULT_FLAG_* deinitions to explain this.

Thanks!

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v2 10/26] userfaultfd: wp: add UFFDIO_COPY_MODE_WP
  2019-02-22 15:15       ` Jerome Glisse
@ 2019-02-25  6:45         ` Peter Xu
  0 siblings, 0 replies; 113+ messages in thread
From: Peter Xu @ 2019-02-25  6:45 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: linux-mm, linux-kernel, David Hildenbrand, Hugh Dickins,
	Maya Gokhale, Pavel Emelyanov, Johannes Weiner, Martin Cracauer,
	Shaohua Li, Marty McFadden, Andrea Arcangeli, Mike Kravetz,
	Denis Plotnikov, Mike Rapoport, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

On Fri, Feb 22, 2019 at 10:15:47AM -0500, Jerome Glisse wrote:
> On Fri, Feb 22, 2019 at 03:11:06PM +0800, Peter Xu wrote:
> > On Thu, Feb 21, 2019 at 12:29:19PM -0500, Jerome Glisse wrote:
> > > On Tue, Feb 12, 2019 at 10:56:16AM +0800, Peter Xu wrote:
> > > > From: Andrea Arcangeli <aarcange@redhat.com>
> > > > 
> > > > This allows UFFDIO_COPY to map pages wrprotected.
> > > > 
> > > > Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> > > > Signed-off-by: Peter Xu <peterx@redhat.com>
> > > 
> > > Minor nitpick down below, but in any case:
> > > 
> > > Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
> > > 
> > > > ---
> > > >  fs/userfaultfd.c                 |  5 +++--
> > > >  include/linux/userfaultfd_k.h    |  2 +-
> > > >  include/uapi/linux/userfaultfd.h | 11 +++++-----
> > > >  mm/userfaultfd.c                 | 36 ++++++++++++++++++++++----------
> > > >  4 files changed, 35 insertions(+), 19 deletions(-)
> > > > 
> > > 
> > > [...]
> > > 
> > > > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> > > > index d59b5a73dfb3..73a208c5c1e7 100644
> > > > --- a/mm/userfaultfd.c
> > > > +++ b/mm/userfaultfd.c
> > > > @@ -25,7 +25,8 @@ static int mcopy_atomic_pte(struct mm_struct *dst_mm,
> > > >  			    struct vm_area_struct *dst_vma,
> > > >  			    unsigned long dst_addr,
> > > >  			    unsigned long src_addr,
> > > > -			    struct page **pagep)
> > > > +			    struct page **pagep,
> > > > +			    bool wp_copy)
> > > >  {
> > > >  	struct mem_cgroup *memcg;
> > > >  	pte_t _dst_pte, *dst_pte;
> > > > @@ -71,9 +72,9 @@ static int mcopy_atomic_pte(struct mm_struct *dst_mm,
> > > >  	if (mem_cgroup_try_charge(page, dst_mm, GFP_KERNEL, &memcg, false))
> > > >  		goto out_release;
> > > >  
> > > > -	_dst_pte = mk_pte(page, dst_vma->vm_page_prot);
> > > > -	if (dst_vma->vm_flags & VM_WRITE)
> > > > -		_dst_pte = pte_mkwrite(pte_mkdirty(_dst_pte));
> > > > +	_dst_pte = pte_mkdirty(mk_pte(page, dst_vma->vm_page_prot));
> > > > +	if (dst_vma->vm_flags & VM_WRITE && !wp_copy)
> > > > +		_dst_pte = pte_mkwrite(_dst_pte);
> > > 
> > > I like parenthesis around around and :) ie:
> > >     (dst_vma->vm_flags & VM_WRITE) && !wp_copy
> > > 
> > > I feel it is easier to read.
> > 
> > Yeah another one. Though this line will be changed in follow up
> > patches, will fix anyways.
> > 
> > > 
> > > [...]
> > > 
> > > > @@ -416,11 +418,13 @@ static __always_inline ssize_t mfill_atomic_pte(struct mm_struct *dst_mm,
> > > >  	if (!(dst_vma->vm_flags & VM_SHARED)) {
> > > >  		if (!zeropage)
> > > >  			err = mcopy_atomic_pte(dst_mm, dst_pmd, dst_vma,
> > > > -					       dst_addr, src_addr, page);
> > > > +					       dst_addr, src_addr, page,
> > > > +					       wp_copy);
> > > >  		else
> > > >  			err = mfill_zeropage_pte(dst_mm, dst_pmd,
> > > >  						 dst_vma, dst_addr);
> > > >  	} else {
> > > > +		VM_WARN_ON(wp_copy); /* WP only available for anon */
> > > 
> > > Don't you want to return with error here ?
> > 
> > Makes sense to me.  Does this looks good to you to be squashed into
> > current patch?
> > 
> > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> > index 73a208c5c1e7..f3ea09f412d4 100644
> > --- a/mm/userfaultfd.c
> > +++ b/mm/userfaultfd.c
> > @@ -73,7 +73,7 @@ static int mcopy_atomic_pte(struct mm_struct *dst_mm,
> >                 goto out_release;
> >  
> >         _dst_pte = pte_mkdirty(mk_pte(page, dst_vma->vm_page_prot));
> > -       if (dst_vma->vm_flags & VM_WRITE && !wp_copy)
> > +       if ((dst_vma->vm_flags & VM_WRITE) && !wp_copy)
> >                 _dst_pte = pte_mkwrite(_dst_pte);
> >  
> >         dst_pte = pte_offset_map_lock(dst_mm, dst_pmd, dst_addr, &ptl);
> > @@ -424,7 +424,10 @@ static __always_inline ssize_t mfill_atomic_pte(struct mm_struct *dst_mm,
> >                         err = mfill_zeropage_pte(dst_mm, dst_pmd,
> >                                                  dst_vma, dst_addr);
> >         } else {
> > -               VM_WARN_ON(wp_copy); /* WP only available for anon */
> > +               if (unlikely(wp_copy))
> > +                       /* TODO: WP currently only available for anon */
> > +                       return -EINVAL;
> > +
> >                 if (!zeropage)
> >                         err = shmem_mcopy_atomic_pte(dst_mm, dst_pmd,
> >                                                      dst_vma, dst_addr,
> 
> I would keep a the VM_WARN_ON or maybe a ONCE variant so that we at
> least have a chance to be inform if for some reasons that code path
> is taken. With that my r-b stands.

Yeah *ONCE() is good to me too (both can avoid DOS attack from
userspace) and I don't have strong opinion on whether we should fail
on this specific ioctl if it happens.  For now I'll just take the
advise and the r-b together.  Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v2 14/26] userfaultfd: wp: handle COW properly for uffd-wp
  2019-02-22 15:35       ` Jerome Glisse
@ 2019-02-25  7:13         ` Peter Xu
  2019-02-25 15:32           ` Jerome Glisse
  0 siblings, 1 reply; 113+ messages in thread
From: Peter Xu @ 2019-02-25  7:13 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: linux-mm, linux-kernel, David Hildenbrand, Hugh Dickins,
	Maya Gokhale, Pavel Emelyanov, Johannes Weiner, Martin Cracauer,
	Shaohua Li, Marty McFadden, Andrea Arcangeli, Mike Kravetz,
	Denis Plotnikov, Mike Rapoport, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

On Fri, Feb 22, 2019 at 10:35:09AM -0500, Jerome Glisse wrote:
> On Fri, Feb 22, 2019 at 04:46:03PM +0800, Peter Xu wrote:
> > On Thu, Feb 21, 2019 at 01:04:24PM -0500, Jerome Glisse wrote:
> > > On Tue, Feb 12, 2019 at 10:56:20AM +0800, Peter Xu wrote:
> > > > This allows uffd-wp to support write-protected pages for COW.
> 
> [...]
> 
> > > > diff --git a/mm/mprotect.c b/mm/mprotect.c
> > > > index 9d4433044c21..ae93721f3795 100644
> > > > --- a/mm/mprotect.c
> > > > +++ b/mm/mprotect.c
> > > > @@ -77,14 +77,13 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
> > > >  		if (pte_present(oldpte)) {
> > > >  			pte_t ptent;
> > > >  			bool preserve_write = prot_numa && pte_write(oldpte);
> > > > +			struct page *page;
> > > >  
> > > >  			/*
> > > >  			 * Avoid trapping faults against the zero or KSM
> > > >  			 * pages. See similar comment in change_huge_pmd.
> > > >  			 */
> > > >  			if (prot_numa) {
> > > > -				struct page *page;
> > > > -
> > > >  				page = vm_normal_page(vma, addr, oldpte);
> > > >  				if (!page || PageKsm(page))
> > > >  					continue;
> > > > @@ -114,6 +113,46 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
> > > >  					continue;
> > > >  			}
> > > >  
> > > > +			/*
> > > > +			 * Detect whether we'll need to COW before
> > > > +			 * resolving an uffd-wp fault.  Note that this
> > > > +			 * includes detection of the zero page (where
> > > > +			 * page==NULL)
> > > > +			 */
> > > > +			if (uffd_wp_resolve) {
> > > > +				/* If the fault is resolved already, skip */
> > > > +				if (!pte_uffd_wp(*pte))
> > > > +					continue;
> > > > +				page = vm_normal_page(vma, addr, oldpte);
> > > > +				if (!page || page_mapcount(page) > 1) {
> > > 
> > > This is wrong, if you allow page to be NULL then you gonna segfault
> > > in wp_page_copy() down below. Are you sure you want to test for
> > > special page ? For anonymous memory this should never happens ie
> > > anon page always are regular page. So if you allow userfaulfd to
> > > write protect only anonymous vma then there is no point in testing
> > > here beside maybe a BUG_ON() just in case ...
> > 
> > It's majorly for zero pages where page can be NULL.  Would this be
> > clearer:
> > 
> >   if (is_zero_pfn(pte_pfn(old_pte)) || (page && page_mapcount(page)))
> > 
> > ?
> > 
> > Now we treat zero pages as normal COW pages so we'll do COW here even
> > for zero pages.  I think maybe we can do special handling on all over
> > the places for zero pages (e.g., we don't write protect a PTE if we
> > detected that this is the zero PFN) but I'm uncertain on whether
> > that's what we want, so I chose to start with current solution at
> > least to achieve functionality first.
> 
> You can keep the vm_normal_page() in that case but split the if
> between page == NULL and page != NULL with mapcount > 1. As other-
> wise you will segfault below.

Could I ask what's the segfault you mentioned?  My understanding is
that below code has taken page==NULL into consideration already, e.g.,
we only do get_page() if page!=NULL, and inside wp_page_copy() it has
similar considerations.

> 
> 
> > 
> > > 
> > > > +					struct vm_fault vmf = {
> > > > +						.vma = vma,
> > > > +						.address = addr & PAGE_MASK,
> > > > +						.page = page,
> > > > +						.orig_pte = oldpte,
> > > > +						.pmd = pmd,
> > > > +						/* pte and ptl not needed */
> > > > +					};
> > > > +					vm_fault_t ret;
> > > > +
> > > > +					if (page)
> > > > +						get_page(page);
> > > > +					arch_leave_lazy_mmu_mode();
> > > > +					pte_unmap_unlock(pte, ptl);
> > > > +					ret = wp_page_copy(&vmf);
> > > > +					/* PTE is changed, or OOM */
> > > > +					if (ret == 0)
> > > > +						/* It's done by others */
> > > > +						continue;
> > > > +					else if (WARN_ON(ret != VM_FAULT_WRITE))
> > > > +						return pages;
> > > > +					pte = pte_offset_map_lock(vma->vm_mm,
> > > > +								  pmd, addr,
> > > > +								  &ptl);
> > > 
> > > Here you remap the pte locked but you are not checking if the pte is
> > > the one you expect ie is it pointing to the copied page and does it
> > > have expect uffd_wp flag. Another thread might have raced between the
> > > time you called wp_page_copy() and the time you pte_offset_map_lock()
> > > I have not check the mmap_sem so maybe you are protected by it as
> > > mprotect is taking it in write mode IIRC, if so you should add a
> > > comments at very least so people do not see this as a bug.
> > 
> > Thanks for spotting this.  With nornal uffd-wp page fault handling
> > path we're only with read lock held (and I would suspect it's racy
> > even with write lock...).  I agree that there can be a race right
> > after the COW has done.
> > 
> > Here IMHO we'll be fine as long as it's still a present PTE, in other
> > words, we should be able to tolerate PTE changes as long as it's still
> > present otherwise we'll need to retry this single PTE (e.g., the page
> > can be quickly marked as migrating swap entry, or even the page could
> > be freed beneath us).  Do you think below change look good to you to
> > be squashed into this patch?
> 
> Ok, but below if must be after arch_enter_lazy_mmu_mode(); not before.

Oops... you are right. :)

Thanks,

> 
> > 
> > diff --git a/mm/mprotect.c b/mm/mprotect.c
> > index 73a65f07fe41..3423f9692838 100644
> > --- a/mm/mprotect.c
> > +++ b/mm/mprotect.c
> > @@ -73,6 +73,7 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,                                                              
> >         flush_tlb_batched_pending(vma->vm_mm);
> >         arch_enter_lazy_mmu_mode();
> >         do {
> > +retry_pte:
> >                 oldpte = *pte;
> >                 if (pte_present(oldpte)) {
> >                         pte_t ptent;
> > @@ -149,6 +150,13 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,                                                           
> >                                         pte = pte_offset_map_lock(vma->vm_mm,
> >                                                                   pmd, addr,
> >                                                                   &ptl);
> > +                                       if (!pte_present(*pte))
> > +                                               /*
> > +                                                * This PTE could have
> > +                                                * been modified when COW;
> > +                                                * retry it
> > +                                                */
> > +                                               goto retry_pte;
> >                                         arch_enter_lazy_mmu_mode();
> >                                 }
> >                         }

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v2 17/26] userfaultfd: wp: support swap and page migration
  2019-02-21 18:16   ` Jerome Glisse
@ 2019-02-25  7:48     ` Peter Xu
  0 siblings, 0 replies; 113+ messages in thread
From: Peter Xu @ 2019-02-25  7:48 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: linux-mm, linux-kernel, David Hildenbrand, Hugh Dickins,
	Maya Gokhale, Pavel Emelyanov, Johannes Weiner, Martin Cracauer,
	Shaohua Li, Marty McFadden, Andrea Arcangeli, Mike Kravetz,
	Denis Plotnikov, Mike Rapoport, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

On Thu, Feb 21, 2019 at 01:16:19PM -0500, Jerome Glisse wrote:
> On Tue, Feb 12, 2019 at 10:56:23AM +0800, Peter Xu wrote:
> > For either swap and page migration, we all use the bit 2 of the entry to
> > identify whether this entry is uffd write-protected.  It plays a similar
> > role as the existing soft dirty bit in swap entries but only for keeping
> > the uffd-wp tracking for a specific PTE/PMD.
> > 
> > Something special here is that when we want to recover the uffd-wp bit
> > from a swap/migration entry to the PTE bit we'll also need to take care
> > of the _PAGE_RW bit and make sure it's cleared, otherwise even with the
> > _PAGE_UFFD_WP bit we can't trap it at all.
> > 
> > Note that this patch removed two lines from "userfaultfd: wp: hook
> > userfault handler to write protection fault" where we try to remove the
> > VM_FAULT_WRITE from vmf->flags when uffd-wp is set for the VMA.  This
> > patch will still keep the write flag there.
> 
> That part is confusing, you probably want to remove that code from
> previous patch or at least address my comment in the previous patch
> review.

(please see below...)

> 
> > 
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >  include/linux/swapops.h | 2 ++
> >  mm/huge_memory.c        | 3 +++
> >  mm/memory.c             | 8 ++++++--
> >  mm/migrate.c            | 7 +++++++
> >  mm/mprotect.c           | 2 ++
> >  mm/rmap.c               | 6 ++++++
> >  6 files changed, 26 insertions(+), 2 deletions(-)
> > 
> 
> [...]
> 
> > diff --git a/mm/memory.c b/mm/memory.c
> > index c2035539e9fd..7cee990d67cf 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -736,6 +736,8 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> >  				pte = swp_entry_to_pte(entry);
> >  				if (pte_swp_soft_dirty(*src_pte))
> >  					pte = pte_swp_mksoft_dirty(pte);
> > +				if (pte_swp_uffd_wp(*src_pte))
> > +					pte = pte_swp_mkuffd_wp(pte);
> >  				set_pte_at(src_mm, addr, src_pte, pte);
> >  			}
> >  		} else if (is_device_private_entry(entry)) {
> > @@ -2815,8 +2817,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >  	inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
> >  	dec_mm_counter_fast(vma->vm_mm, MM_SWAPENTS);
> >  	pte = mk_pte(page, vma->vm_page_prot);
> > -	if (userfaultfd_wp(vma))
> > -		vmf->flags &= ~FAULT_FLAG_WRITE;
> 
> So this is the confusing part with the previous patch that introduce
> that code. It feels like you should just remove that code entirely
> in the previous patch.

When I wrote the other part I didn't completely understand those two
lines so I kept them to make sure I won't throw away anthing that can
be actually useful.  If you also agree that we can drop these lines
I'll simply do that in the next version (and I'll drop the comments
too in the commit message).  Andrea, please correct me if I am wrong
on that...

> 
> >  	if ((vmf->flags & FAULT_FLAG_WRITE) && reuse_swap_page(page, NULL)) {
> >  		pte = maybe_mkwrite(pte_mkdirty(pte), vma);
> >  		vmf->flags &= ~FAULT_FLAG_WRITE;
> > @@ -2826,6 +2826,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >  	flush_icache_page(vma, page);
> >  	if (pte_swp_soft_dirty(vmf->orig_pte))
> >  		pte = pte_mksoft_dirty(pte);
> > +	if (pte_swp_uffd_wp(vmf->orig_pte)) {
> > +		pte = pte_mkuffd_wp(pte);
> > +		pte = pte_wrprotect(pte);
> > +	}
> >  	set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
> >  	arch_do_swap_page(vma->vm_mm, vma, vmf->address, pte, vmf->orig_pte);
> >  	vmf->orig_pte = pte;
> 
> > diff --git a/mm/migrate.c b/mm/migrate.c
> > index d4fd680be3b0..605ccd1f5c64 100644
> > --- a/mm/migrate.c
> > +++ b/mm/migrate.c
> > @@ -242,6 +242,11 @@ static bool remove_migration_pte(struct page *page, struct vm_area_struct *vma,
> >  		if (is_write_migration_entry(entry))
> >  			pte = maybe_mkwrite(pte, vma);
> >  
> > +		if (pte_swp_uffd_wp(*pvmw.pte)) {
> > +			pte = pte_mkuffd_wp(pte);
> > +			pte = pte_wrprotect(pte);
> > +		}
> 
> If the page was write protected prior to migration then it should never
> end up as a write migration entry and thus the above should be something
> like:
> 		if (is_write_migration_entry(entry)) {
> 			pte = maybe_mkwrite(pte, vma);
> 		} else if (pte_swp_uffd_wp(*pvmw.pte)) {
> 			pte = pte_mkuffd_wp(pte);
> 		}

Yeah I agree I can't think of another case that will violate the rule,
so I'm taking your advise assuming it can be cleaner.

Thanks!

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v2 20/26] userfaultfd: wp: support write protection for userfault vma range
  2019-02-21 18:23   ` Jerome Glisse
@ 2019-02-25  8:16     ` Peter Xu
  0 siblings, 0 replies; 113+ messages in thread
From: Peter Xu @ 2019-02-25  8:16 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: linux-mm, linux-kernel, David Hildenbrand, Hugh Dickins,
	Maya Gokhale, Pavel Emelyanov, Johannes Weiner, Martin Cracauer,
	Shaohua Li, Marty McFadden, Andrea Arcangeli, Mike Kravetz,
	Denis Plotnikov, Mike Rapoport, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert, Rik van Riel

On Thu, Feb 21, 2019 at 01:23:59PM -0500, Jerome Glisse wrote:
> On Tue, Feb 12, 2019 at 10:56:26AM +0800, Peter Xu wrote:
> > From: Shaohua Li <shli@fb.com>
> > 
> > Add API to enable/disable writeprotect a vma range. Unlike mprotect,
> > this doesn't split/merge vmas.
> > 
> > Cc: Andrea Arcangeli <aarcange@redhat.com>
> > Cc: Rik van Riel <riel@redhat.com>
> > Cc: Kirill A. Shutemov <kirill@shutemov.name>
> > Cc: Mel Gorman <mgorman@suse.de>
> > Cc: Hugh Dickins <hughd@google.com>
> > Cc: Johannes Weiner <hannes@cmpxchg.org>
> > Signed-off-by: Shaohua Li <shli@fb.com>
> > Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> > [peterx:
> >  - use the helper to find VMA;
> >  - return -ENOENT if not found to match mcopy case;
> >  - use the new MM_CP_UFFD_WP* flags for change_protection
> >  - check against mmap_changing for failures]
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> 
> I have a question see below but anyway:
> 
> Reviewed-by: Jérôme Glisse <jglisse@redhat.com>

Thanks!

> 
> > ---
> >  include/linux/userfaultfd_k.h |  3 ++
> >  mm/userfaultfd.c              | 54 +++++++++++++++++++++++++++++++++++
> >  2 files changed, 57 insertions(+)
> > 
> > diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
> > index 765ce884cec0..8f6e6ed544fb 100644
> > --- a/include/linux/userfaultfd_k.h
> > +++ b/include/linux/userfaultfd_k.h
> > @@ -39,6 +39,9 @@ extern ssize_t mfill_zeropage(struct mm_struct *dst_mm,
> >  			      unsigned long dst_start,
> >  			      unsigned long len,
> >  			      bool *mmap_changing);
> > +extern int mwriteprotect_range(struct mm_struct *dst_mm,
> > +			       unsigned long start, unsigned long len,
> > +			       bool enable_wp, bool *mmap_changing);
> >  
> >  /* mm helpers */
> >  static inline bool is_mergeable_vm_userfaultfd_ctx(struct vm_area_struct *vma,
> > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> > index fefa81c301b7..529d180bb4d7 100644
> > --- a/mm/userfaultfd.c
> > +++ b/mm/userfaultfd.c
> > @@ -639,3 +639,57 @@ ssize_t mfill_zeropage(struct mm_struct *dst_mm, unsigned long start,
> >  {
> >  	return __mcopy_atomic(dst_mm, start, 0, len, true, mmap_changing, 0);
> >  }
> > +
> > +int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start,
> > +			unsigned long len, bool enable_wp, bool *mmap_changing)
> > +{
> > +	struct vm_area_struct *dst_vma;
> > +	pgprot_t newprot;
> > +	int err;
> > +
> > +	/*
> > +	 * Sanitize the command parameters:
> > +	 */
> > +	BUG_ON(start & ~PAGE_MASK);
> > +	BUG_ON(len & ~PAGE_MASK);
> > +
> > +	/* Does the address range wrap, or is the span zero-sized? */
> > +	BUG_ON(start + len <= start);
> > +
> > +	down_read(&dst_mm->mmap_sem);
> > +
> > +	/*
> > +	 * If memory mappings are changing because of non-cooperative
> > +	 * operation (e.g. mremap) running in parallel, bail out and
> > +	 * request the user to retry later
> > +	 */
> > +	err = -EAGAIN;
> > +	if (mmap_changing && READ_ONCE(*mmap_changing))
> > +		goto out_unlock;
> > +
> > +	err = -ENOENT;
> > +	dst_vma = vma_find_uffd(dst_mm, start, len);
> > +	/*
> > +	 * Make sure the vma is not shared, that the dst range is
> > +	 * both valid and fully within a single existing vma.
> > +	 */
> > +	if (!dst_vma || (dst_vma->vm_flags & VM_SHARED))
> > +		goto out_unlock;
> > +	if (!userfaultfd_wp(dst_vma))
> > +		goto out_unlock;
> > +	if (!vma_is_anonymous(dst_vma))
> > +		goto out_unlock;
> 
> Don't you want to distinguish between no VMA ie ENOENT and vma that
> can not be write protected (VM_SHARED, not userfaultfd, not anonymous) ?

Here we'll return ENOENT for all these errors which is actually trying
to follow existing MISSING codes.  Mike noticed some errno issues
during reviewing the first version and suggested that we'd better
follow the old rules which makes perfect sense to me.  E.g., in
__mcopy_atomic() we'll return ENOENT for either (1) VMA not found, (2)
not UFFD VMA, (3) range check failures.  Checking against anonymous
and VM_SHARED are special for uffd-wp but I'm simply using this same
errno since after all all these errors will stop us from finding a
valid VMA before going anywhere further.

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v2 21/26] userfaultfd: wp: add the writeprotect API to userfaultfd ioctl
  2019-02-21 18:28   ` Jerome Glisse
@ 2019-02-25  8:31     ` Peter Xu
  0 siblings, 0 replies; 113+ messages in thread
From: Peter Xu @ 2019-02-25  8:31 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: linux-mm, linux-kernel, David Hildenbrand, Hugh Dickins,
	Maya Gokhale, Pavel Emelyanov, Johannes Weiner, Martin Cracauer,
	Shaohua Li, Marty McFadden, Andrea Arcangeli, Mike Kravetz,
	Denis Plotnikov, Mike Rapoport, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

On Thu, Feb 21, 2019 at 01:28:25PM -0500, Jerome Glisse wrote:
> On Tue, Feb 12, 2019 at 10:56:27AM +0800, Peter Xu wrote:
> > From: Andrea Arcangeli <aarcange@redhat.com>
> > 
> > v1: From: Shaohua Li <shli@fb.com>
> > 
> > v2: cleanups, remove a branch.
> > 
> > [peterx writes up the commit message, as below...]
> > 
> > This patch introduces the new uffd-wp APIs for userspace.
> > 
> > Firstly, we'll allow to do UFFDIO_REGISTER with write protection
> > tracking using the new UFFDIO_REGISTER_MODE_WP flag.  Note that this
> > flag can co-exist with the existing UFFDIO_REGISTER_MODE_MISSING, in
> > which case the userspace program can not only resolve missing page
> > faults, and at the same time tracking page data changes along the way.
> > 
> > Secondly, we introduced the new UFFDIO_WRITEPROTECT API to do page
> > level write protection tracking.  Note that we will need to register
> > the memory region with UFFDIO_REGISTER_MODE_WP before that.
> > 
> > Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> > [peterx: remove useless block, write commit message, check against
> >  VM_MAYWRITE rather than VM_WRITE when register]
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> 
> I am not an expert with userfaultfd code but it looks good to me so:
> 
> Also see my question down below, just a minor one.
> 
> Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
> 
> > ---
> >  fs/userfaultfd.c                 | 82 +++++++++++++++++++++++++-------
> >  include/uapi/linux/userfaultfd.h | 11 +++++
> >  2 files changed, 77 insertions(+), 16 deletions(-)
> > 
> 
> [...]
> 
> > diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
> > index 297cb044c03f..1b977a7a4435 100644
> > --- a/include/uapi/linux/userfaultfd.h
> > +++ b/include/uapi/linux/userfaultfd.h
> > @@ -52,6 +52,7 @@
> >  #define _UFFDIO_WAKE			(0x02)
> >  #define _UFFDIO_COPY			(0x03)
> >  #define _UFFDIO_ZEROPAGE		(0x04)
> > +#define _UFFDIO_WRITEPROTECT		(0x06)
> >  #define _UFFDIO_API			(0x3F)
> 
> What did happen to ioctl 0x05 ? :)

It simply because it was 0x06 in Andrea's tree. :-)

https://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git/commit/?h=userfault&id=ad0c3bec9897d8c8617ecaeb3110d3bdf884b15c

Andrea introduced _UFFDIO_REMAP first in his original work which took
0x05 (hmm... not really the "very" original, but the one after
Shaohua's work) then _UFFDIO_WRITEPROTECT which took 0x06.  I'm afraid
there's already userspace programs that have linked with that tree and
the numbers (I believe LLNL and umap is one of them, people may not
start to use it very seriesly but still they can be distributed and
start doing some real work...).  I'm using the same number here
considering that it might be good to simply even don't break any of
the experimental programs if it's easy to achieve (for either existing
uffd-wp but also the new remap interface users if there is), after all
these numbers are really adhoc for us.  If anyone doesn't like this I
can for sure switch to 0x05 again if that looks cuter.

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v2 22/26] userfaultfd: wp: enabled write protection in userfaultfd API
  2019-02-21 18:29   ` Jerome Glisse
@ 2019-02-25  8:34     ` Peter Xu
  0 siblings, 0 replies; 113+ messages in thread
From: Peter Xu @ 2019-02-25  8:34 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: linux-mm, linux-kernel, David Hildenbrand, Hugh Dickins,
	Maya Gokhale, Pavel Emelyanov, Johannes Weiner, Martin Cracauer,
	Shaohua Li, Marty McFadden, Andrea Arcangeli, Mike Kravetz,
	Denis Plotnikov, Mike Rapoport, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert, Pavel Emelyanov, Rik van Riel

On Thu, Feb 21, 2019 at 01:29:26PM -0500, Jerome Glisse wrote:
> On Tue, Feb 12, 2019 at 10:56:28AM +0800, Peter Xu wrote:
> > From: Shaohua Li <shli@fb.com>
> > 
> > Now it's safe to enable write protection in userfaultfd API
> > 
> > Cc: Andrea Arcangeli <aarcange@redhat.com>
> > Cc: Pavel Emelyanov <xemul@parallels.com>
> > Cc: Rik van Riel <riel@redhat.com>
> > Cc: Kirill A. Shutemov <kirill@shutemov.name>
> > Cc: Mel Gorman <mgorman@suse.de>
> > Cc: Hugh Dickins <hughd@google.com>
> > Cc: Johannes Weiner <hannes@cmpxchg.org>
> > Signed-off-by: Shaohua Li <shli@fb.com>
> > Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> 
> Maybe fold that patch with the previous one ? In any case:

There's authorship differentiation (previous patch was FROM Andrea,
and this was from Shaohua) so I'll try to keep the current state if
possible.

> 
> Reviewed-by: Jérôme Glisse <jglisse@redhat.com>

Thanks!

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v2 23/26] userfaultfd: wp: don't wake up when doing write protect
  2019-02-21 18:36   ` Jerome Glisse
@ 2019-02-25  8:58     ` Peter Xu
  2019-02-25 21:15       ` Mike Rapoport
  0 siblings, 1 reply; 113+ messages in thread
From: Peter Xu @ 2019-02-25  8:58 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: linux-mm, linux-kernel, David Hildenbrand, Hugh Dickins,
	Maya Gokhale, Pavel Emelyanov, Johannes Weiner, Martin Cracauer,
	Shaohua Li, Marty McFadden, Andrea Arcangeli, Mike Kravetz,
	Denis Plotnikov, Mike Rapoport, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

On Thu, Feb 21, 2019 at 01:36:54PM -0500, Jerome Glisse wrote:
> On Tue, Feb 12, 2019 at 10:56:29AM +0800, Peter Xu wrote:
> > It does not make sense to try to wake up any waiting thread when we're
> > write-protecting a memory region.  Only wake up when resolving a write
> > protected page fault.
> > 
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> 
> I am bit confuse here, see below.
> 
> > ---
> >  fs/userfaultfd.c | 13 ++++++++-----
> >  1 file changed, 8 insertions(+), 5 deletions(-)
> > 
> > diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> > index 81962d62520c..f1f61a0278c2 100644
> > --- a/fs/userfaultfd.c
> > +++ b/fs/userfaultfd.c
> > @@ -1771,6 +1771,7 @@ static int userfaultfd_writeprotect(struct userfaultfd_ctx *ctx,
> >  	struct uffdio_writeprotect uffdio_wp;
> >  	struct uffdio_writeprotect __user *user_uffdio_wp;
> >  	struct userfaultfd_wake_range range;
> > +	bool mode_wp, mode_dontwake;
> >  
> >  	if (READ_ONCE(ctx->mmap_changing))
> >  		return -EAGAIN;
> > @@ -1789,18 +1790,20 @@ static int userfaultfd_writeprotect(struct userfaultfd_ctx *ctx,
> >  	if (uffdio_wp.mode & ~(UFFDIO_WRITEPROTECT_MODE_DONTWAKE |
> >  			       UFFDIO_WRITEPROTECT_MODE_WP))
> >  		return -EINVAL;
> > -	if ((uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_WP) &&
> > -	     (uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_DONTWAKE))

[1]

> > +
> > +	mode_wp = uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_WP;
> > +	mode_dontwake = uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_DONTWAKE;
> > +
> > +	if (mode_wp && mode_dontwake)

[2]

> >  		return -EINVAL;
> 
> I am confuse by the logic here. DONTWAKE means do not wake any waiting
> thread right ? So if the patch header it seems to me the logic should
> be:
>     if (mode_wp && !mode_dontwake)
>         return -EINVAL;

This should be the most common case when we want to write protect a
page (or a set of pages).  I'll explain more details below...

> 
> At very least this part does seems to mean the opposite of what the
> commit message says.

Let me paste the matrix to be clear on these flags:

  |------+-------------------------+------------------------------|
  |      | dontwake=0              | dontwake=1                   |
  |------+-------------------------+------------------------------|
  | wp=0 | (a) resolve pf, do wake | (b) resolve pf only, no wake |
  | wp=1 | (c) wp page range       | (d) invalid                  |
  |------+-------------------------+------------------------------|

Above check at [1] was checking against case (d) in the matrix.  It is
indeed an invalid condition because when we want to write protect a
page we should not try to wake up any thread, so the donewake
parameter is actually useless (we'll always do that).  And above [2]
is simply rewritting [1] with the new variables.

> 
> >  
> >  	ret = mwriteprotect_range(ctx->mm, uffdio_wp.range.start,
> > -				  uffdio_wp.range.len, uffdio_wp.mode &
> > -				  UFFDIO_WRITEPROTECT_MODE_WP,
> > +				  uffdio_wp.range.len, mode_wp,
> >  				  &ctx->mmap_changing);
> >  	if (ret)
> >  		return ret;
> >  
> > -	if (!(uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_DONTWAKE)) {
> > +	if (!mode_wp && !mode_dontwake) {
> 
> This part match the commit message :)

Here is what the patch really want to change: before this patch we'll
even call wake_userfault() below for case (c) while it doesn't really
make too much sense IMHO.  After this patch we'll only do the wakeup
for (a,b).

> 
> >  		range.start = uffdio_wp.range.start;
> >  		range.len = uffdio_wp.range.len;
> >  		wake_userfault(ctx, &range);

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v2 14/26] userfaultfd: wp: handle COW properly for uffd-wp
  2019-02-25  7:13         ` Peter Xu
@ 2019-02-25 15:32           ` Jerome Glisse
  0 siblings, 0 replies; 113+ messages in thread
From: Jerome Glisse @ 2019-02-25 15:32 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, David Hildenbrand, Hugh Dickins,
	Maya Gokhale, Pavel Emelyanov, Johannes Weiner, Martin Cracauer,
	Shaohua Li, Marty McFadden, Andrea Arcangeli, Mike Kravetz,
	Denis Plotnikov, Mike Rapoport, Mel Gorman, Kirill A . Shutemov,
	Dr . David Alan Gilbert

On Mon, Feb 25, 2019 at 03:13:36PM +0800, Peter Xu wrote:
> On Fri, Feb 22, 2019 at 10:35:09AM -0500, Jerome Glisse wrote:
> > On Fri, Feb 22, 2019 at 04:46:03PM +0800, Peter Xu wrote:
> > > On Thu, Feb 21, 2019 at 01:04:24PM -0500, Jerome Glisse wrote:
> > > > On Tue, Feb 12, 2019 at 10:56:20AM +0800, Peter Xu wrote:
> > > > > This allows uffd-wp to support write-protected pages for COW.
> > 
> > [...]
> > 
> > > > > diff --git a/mm/mprotect.c b/mm/mprotect.c
> > > > > index 9d4433044c21..ae93721f3795 100644
> > > > > --- a/mm/mprotect.c
> > > > > +++ b/mm/mprotect.c
> > > > > @@ -77,14 +77,13 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
> > > > >  		if (pte_present(oldpte)) {
> > > > >  			pte_t ptent;
> > > > >  			bool preserve_write = prot_numa && pte_write(oldpte);
> > > > > +			struct page *page;
> > > > >  
> > > > >  			/*
> > > > >  			 * Avoid trapping faults against the zero or KSM
> > > > >  			 * pages. See similar comment in change_huge_pmd.
> > > > >  			 */
> > > > >  			if (prot_numa) {
> > > > > -				struct page *page;
> > > > > -
> > > > >  				page = vm_normal_page(vma, addr, oldpte);
> > > > >  				if (!page || PageKsm(page))
> > > > >  					continue;
> > > > > @@ -114,6 +113,46 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
> > > > >  					continue;
> > > > >  			}
> > > > >  
> > > > > +			/*
> > > > > +			 * Detect whether we'll need to COW before
> > > > > +			 * resolving an uffd-wp fault.  Note that this
> > > > > +			 * includes detection of the zero page (where
> > > > > +			 * page==NULL)
> > > > > +			 */
> > > > > +			if (uffd_wp_resolve) {
> > > > > +				/* If the fault is resolved already, skip */
> > > > > +				if (!pte_uffd_wp(*pte))
> > > > > +					continue;
> > > > > +				page = vm_normal_page(vma, addr, oldpte);
> > > > > +				if (!page || page_mapcount(page) > 1) {
> > > > 
> > > > This is wrong, if you allow page to be NULL then you gonna segfault
> > > > in wp_page_copy() down below. Are you sure you want to test for
> > > > special page ? For anonymous memory this should never happens ie
> > > > anon page always are regular page. So if you allow userfaulfd to
> > > > write protect only anonymous vma then there is no point in testing
> > > > here beside maybe a BUG_ON() just in case ...
> > > 
> > > It's majorly for zero pages where page can be NULL.  Would this be
> > > clearer:
> > > 
> > >   if (is_zero_pfn(pte_pfn(old_pte)) || (page && page_mapcount(page)))
> > > 
> > > ?
> > > 
> > > Now we treat zero pages as normal COW pages so we'll do COW here even
> > > for zero pages.  I think maybe we can do special handling on all over
> > > the places for zero pages (e.g., we don't write protect a PTE if we
> > > detected that this is the zero PFN) but I'm uncertain on whether
> > > that's what we want, so I chose to start with current solution at
> > > least to achieve functionality first.
> > 
> > You can keep the vm_normal_page() in that case but split the if
> > between page == NULL and page != NULL with mapcount > 1. As other-
> > wise you will segfault below.
> 
> Could I ask what's the segfault you mentioned?  My understanding is
> that below code has taken page==NULL into consideration already, e.g.,
> we only do get_page() if page!=NULL, and inside wp_page_copy() it has
> similar considerations.

In my memory wp_page_copy() would have freak out on NULL page but
i check that code again and it is fine. So yes you can take that
branch for NULL page too. Sorry i trusted my memory too much.


> > > > > +					struct vm_fault vmf = {
> > > > > +						.vma = vma,
> > > > > +						.address = addr & PAGE_MASK,
> > > > > +						.page = page,
> > > > > +						.orig_pte = oldpte,
> > > > > +						.pmd = pmd,
> > > > > +						/* pte and ptl not needed */
> > > > > +					};
> > > > > +					vm_fault_t ret;
> > > > > +
> > > > > +					if (page)
> > > > > +						get_page(page);
> > > > > +					arch_leave_lazy_mmu_mode();
> > > > > +					pte_unmap_unlock(pte, ptl);
> > > > > +					ret = wp_page_copy(&vmf);
> > > > > +					/* PTE is changed, or OOM */
> > > > > +					if (ret == 0)
> > > > > +						/* It's done by others */
> > > > > +						continue;
> > > > > +					else if (WARN_ON(ret != VM_FAULT_WRITE))
> > > > > +						return pages;
> > > > > +					pte = pte_offset_map_lock(vma->vm_mm,
> > > > > +								  pmd, addr,
> > > > > +								  &ptl);
> > > > 
> > > > Here you remap the pte locked but you are not checking if the pte is
> > > > the one you expect ie is it pointing to the copied page and does it
> > > > have expect uffd_wp flag. Another thread might have raced between the
> > > > time you called wp_page_copy() and the time you pte_offset_map_lock()
> > > > I have not check the mmap_sem so maybe you are protected by it as
> > > > mprotect is taking it in write mode IIRC, if so you should add a
> > > > comments at very least so people do not see this as a bug.
> > > 
> > > Thanks for spotting this.  With nornal uffd-wp page fault handling
> > > path we're only with read lock held (and I would suspect it's racy
> > > even with write lock...).  I agree that there can be a race right
> > > after the COW has done.
> > > 
> > > Here IMHO we'll be fine as long as it's still a present PTE, in other
> > > words, we should be able to tolerate PTE changes as long as it's still
> > > present otherwise we'll need to retry this single PTE (e.g., the page
> > > can be quickly marked as migrating swap entry, or even the page could
> > > be freed beneath us).  Do you think below change look good to you to
> > > be squashed into this patch?
> > 
> > Ok, but below if must be after arch_enter_lazy_mmu_mode(); not before.
> 
> Oops... you are right. :)
> 
> Thanks,
> 
> > 
> > > 
> > > diff --git a/mm/mprotect.c b/mm/mprotect.c
> > > index 73a65f07fe41..3423f9692838 100644
> > > --- a/mm/mprotect.c
> > > +++ b/mm/mprotect.c
> > > @@ -73,6 +73,7 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,                                                              
> > >         flush_tlb_batched_pending(vma->vm_mm);
> > >         arch_enter_lazy_mmu_mode();
> > >         do {
> > > +retry_pte:
> > >                 oldpte = *pte;
> > >                 if (pte_present(oldpte)) {
> > >                         pte_t ptent;
> > > @@ -149,6 +150,13 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,                                                           
> > >                                         pte = pte_offset_map_lock(vma->vm_mm,
> > >                                                                   pmd, addr,
> > >                                                                   &ptl);
> > > +                                       if (!pte_present(*pte))
> > > +                                               /*
> > > +                                                * This PTE could have
> > > +                                                * been modified when COW;
> > > +                                                * retry it
> > > +                                                */
> > > +                                               goto retry_pte;
> > >                                         arch_enter_lazy_mmu_mode();
> > >                                 }
> > >                         }
> 
> -- 
> Peter Xu

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v2 06/26] userfaultfd: wp: add helper for writeprotect check
  2019-02-12  2:56 ` [PATCH v2 06/26] userfaultfd: wp: add helper for writeprotect check Peter Xu
  2019-02-21 16:07   ` Jerome Glisse
@ 2019-02-25 15:41   ` Mike Rapoport
  1 sibling, 0 replies; 113+ messages in thread
From: Mike Rapoport @ 2019-02-25 15:41 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, David Hildenbrand, Hugh Dickins,
	Maya Gokhale, Jerome Glisse, Pavel Emelyanov, Johannes Weiner,
	Martin Cracauer, Shaohua Li, Marty McFadden, Andrea Arcangeli,
	Mike Kravetz, Denis Plotnikov, Mike Rapoport, Mel Gorman,
	Kirill A . Shutemov, Dr . David Alan Gilbert, Pavel Emelyanov,
	Rik van Riel

On Tue, Feb 12, 2019 at 10:56:12AM +0800, Peter Xu wrote:
> From: Shaohua Li <shli@fb.com>
> 
> add helper for writeprotect check. Will use it later.
> 
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Pavel Emelyanov <xemul@parallels.com>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Kirill A. Shutemov <kirill@shutemov.name>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: Shaohua Li <shli@fb.com>
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>

> ---
>  include/linux/userfaultfd_k.h | 10 ++++++++++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
> index 37c9eba75c98..38f748e7186e 100644
> --- a/include/linux/userfaultfd_k.h
> +++ b/include/linux/userfaultfd_k.h
> @@ -50,6 +50,11 @@ static inline bool userfaultfd_missing(struct vm_area_struct *vma)
>  	return vma->vm_flags & VM_UFFD_MISSING;
>  }
> 
> +static inline bool userfaultfd_wp(struct vm_area_struct *vma)
> +{
> +	return vma->vm_flags & VM_UFFD_WP;
> +}
> +
>  static inline bool userfaultfd_armed(struct vm_area_struct *vma)
>  {
>  	return vma->vm_flags & (VM_UFFD_MISSING | VM_UFFD_WP);
> @@ -94,6 +99,11 @@ static inline bool userfaultfd_missing(struct vm_area_struct *vma)
>  	return false;
>  }
> 
> +static inline bool userfaultfd_wp(struct vm_area_struct *vma)
> +{
> +	return false;
> +}
> +
>  static inline bool userfaultfd_armed(struct vm_area_struct *vma)
>  {
>  	return false;
> -- 
> 2.17.1
> 

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v2 07/26] userfaultfd: wp: hook userfault handler to write protection fault
  2019-02-12  2:56 ` [PATCH v2 07/26] userfaultfd: wp: hook userfault handler to write protection fault Peter Xu
  2019-02-21 16:25   ` Jerome Glisse
@ 2019-02-25 15:43   ` Mike Rapoport
  1 sibling, 0 replies; 113+ messages in thread
From: Mike Rapoport @ 2019-02-25 15:43 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, David Hildenbrand, Hugh Dickins,
	Maya Gokhale, Jerome Glisse, Pavel Emelyanov, Johannes Weiner,
	Martin Cracauer, Shaohua Li, Marty McFadden, Andrea Arcangeli,
	Mike Kravetz, Denis Plotnikov, Mike Rapoport, Mel Gorman,
	Kirill A . Shutemov, Dr . David Alan Gilbert

On Tue, Feb 12, 2019 at 10:56:13AM +0800, Peter Xu wrote:
> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> There are several cases write protection fault happens. It could be a
> write to zero page, swaped page or userfault write protected
> page. When the fault happens, there is no way to know if userfault
> write protect the page before. Here we just blindly issue a userfault
> notification for vma with VM_UFFD_WP regardless if app write protects
> it yet. Application should be ready to handle such wp fault.
> 
> v1: From: Shaohua Li <shli@fb.com>
> 
> v2: Handle the userfault in the common do_wp_page. If we get there a
> pagetable is present and readonly so no need to do further processing
> until we solve the userfault.
> 
> In the swapin case, always swapin as readonly. This will cause false
> positive userfaults. We need to decide later if to eliminate them with
> a flag like soft-dirty in the swap entry (see _PAGE_SWP_SOFT_DIRTY).
> 
> hugetlbfs wouldn't need to worry about swapouts but and tmpfs would
> be handled by a swap entry bit like anonymous memory.
> 
> The main problem with no easy solution to eliminate the false
> positives, will be if/when userfaultfd is extended to real filesystem
> pagecache. When the pagecache is freed by reclaim we can't leave the
> radix tree pinned if the inode and in turn the radix tree is reclaimed
> as well.
> 
> The estimation is that full accuracy and lack of false positives could
> be easily provided only to anonymous memory (as long as there's no
> fork or as long as MADV_DONTFORK is used on the userfaultfd anonymous
> range) tmpfs and hugetlbfs, it's most certainly worth to achieve it
> but in a later incremental patch.
> 
> v3: Add hooking point for THP wrprotect faults.
> 
> CC: Shaohua Li <shli@fb.com>
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>

> ---
>  mm/memory.c | 12 +++++++++++-
>  1 file changed, 11 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index e11ca9dd823f..00781c43407b 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2483,6 +2483,11 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
>  {
>  	struct vm_area_struct *vma = vmf->vma;
> 
> +	if (userfaultfd_wp(vma)) {
> +		pte_unmap_unlock(vmf->pte, vmf->ptl);
> +		return handle_userfault(vmf, VM_UFFD_WP);
> +	}
> +
>  	vmf->page = vm_normal_page(vma, vmf->address, vmf->orig_pte);
>  	if (!vmf->page) {
>  		/*
> @@ -2800,6 +2805,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  	inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
>  	dec_mm_counter_fast(vma->vm_mm, MM_SWAPENTS);
>  	pte = mk_pte(page, vma->vm_page_prot);
> +	if (userfaultfd_wp(vma))
> +		vmf->flags &= ~FAULT_FLAG_WRITE;
>  	if ((vmf->flags & FAULT_FLAG_WRITE) && reuse_swap_page(page, NULL)) {
>  		pte = maybe_mkwrite(pte_mkdirty(pte), vma);
>  		vmf->flags &= ~FAULT_FLAG_WRITE;
> @@ -3684,8 +3691,11 @@ static inline vm_fault_t create_huge_pmd(struct vm_fault *vmf)
>  /* `inline' is required to avoid gcc 4.1.2 build error */
>  static inline vm_fault_t wp_huge_pmd(struct vm_fault *vmf, pmd_t orig_pmd)
>  {
> -	if (vma_is_anonymous(vmf->vma))
> +	if (vma_is_anonymous(vmf->vma)) {
> +		if (userfaultfd_wp(vmf->vma))
> +			return handle_userfault(vmf, VM_UFFD_WP);
>  		return do_huge_pmd_wp_page(vmf, orig_pmd);
> +	}
>  	if (vmf->vma->vm_ops->huge_fault)
>  		return vmf->vma->vm_ops->huge_fault(vmf, PE_SIZE_PMD);
> 
> -- 
> 2.17.1
> 

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v2 08/26] userfaultfd: wp: add WP pagetable tracking to x86
  2019-02-12  2:56 ` [PATCH v2 08/26] userfaultfd: wp: add WP pagetable tracking to x86 Peter Xu
  2019-02-21 17:20   ` Jerome Glisse
@ 2019-02-25 15:48   ` Mike Rapoport
  1 sibling, 0 replies; 113+ messages in thread
From: Mike Rapoport @ 2019-02-25 15:48 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, David Hildenbrand, Hugh Dickins,
	Maya Gokhale, Jerome Glisse, Pavel Emelyanov, Johannes Weiner,
	Martin Cracauer, Shaohua Li, Marty McFadden, Andrea Arcangeli,
	Mike Kravetz, Denis Plotnikov, Mike Rapoport, Mel Gorman,
	Kirill A . Shutemov, Dr . David Alan Gilbert

On Tue, Feb 12, 2019 at 10:56:14AM +0800, Peter Xu wrote:
> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> Accurate userfaultfd WP tracking is possible by tracking exactly which
> virtual memory ranges were writeprotected by userland. We can't relay
> only on the RW bit of the mapped pagetable because that information is
> destroyed by fork() or KSM or swap. If we were to relay on that, we'd
> need to stay on the safe side and generate false positive wp faults
> for every swapped out page.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>

> ---
>  arch/x86/Kconfig                     |  1 +
>  arch/x86/include/asm/pgtable.h       | 52 ++++++++++++++++++++++++++++
>  arch/x86/include/asm/pgtable_64.h    |  8 ++++-
>  arch/x86/include/asm/pgtable_types.h |  9 +++++
>  include/asm-generic/pgtable.h        |  1 +
>  include/asm-generic/pgtable_uffd.h   | 51 +++++++++++++++++++++++++++
>  init/Kconfig                         |  5 +++
>  7 files changed, 126 insertions(+), 1 deletion(-)
>  create mode 100644 include/asm-generic/pgtable_uffd.h
> 
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 68261430fe6e..cb43bc008675 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -209,6 +209,7 @@ config X86
>  	select USER_STACKTRACE_SUPPORT
>  	select VIRT_TO_BUS
>  	select X86_FEATURE_NAMES		if PROC_FS
> +	select HAVE_ARCH_USERFAULTFD_WP		if USERFAULTFD
> 
>  config INSTRUCTION_DECODER
>  	def_bool y
> diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
> index 2779ace16d23..6863236e8484 100644
> --- a/arch/x86/include/asm/pgtable.h
> +++ b/arch/x86/include/asm/pgtable.h
> @@ -23,6 +23,7 @@
> 
>  #ifndef __ASSEMBLY__
>  #include <asm/x86_init.h>
> +#include <asm-generic/pgtable_uffd.h>
> 
>  extern pgd_t early_top_pgt[PTRS_PER_PGD];
>  int __init __early_make_pgtable(unsigned long address, pmdval_t pmd);
> @@ -293,6 +294,23 @@ static inline pte_t pte_clear_flags(pte_t pte, pteval_t clear)
>  	return native_make_pte(v & ~clear);
>  }
> 
> +#ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
> +static inline int pte_uffd_wp(pte_t pte)
> +{
> +	return pte_flags(pte) & _PAGE_UFFD_WP;
> +}
> +
> +static inline pte_t pte_mkuffd_wp(pte_t pte)
> +{
> +	return pte_set_flags(pte, _PAGE_UFFD_WP);
> +}
> +
> +static inline pte_t pte_clear_uffd_wp(pte_t pte)
> +{
> +	return pte_clear_flags(pte, _PAGE_UFFD_WP);
> +}
> +#endif /* CONFIG_HAVE_ARCH_USERFAULTFD_WP */
> +
>  static inline pte_t pte_mkclean(pte_t pte)
>  {
>  	return pte_clear_flags(pte, _PAGE_DIRTY);
> @@ -372,6 +390,23 @@ static inline pmd_t pmd_clear_flags(pmd_t pmd, pmdval_t clear)
>  	return native_make_pmd(v & ~clear);
>  }
> 
> +#ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
> +static inline int pmd_uffd_wp(pmd_t pmd)
> +{
> +	return pmd_flags(pmd) & _PAGE_UFFD_WP;
> +}
> +
> +static inline pmd_t pmd_mkuffd_wp(pmd_t pmd)
> +{
> +	return pmd_set_flags(pmd, _PAGE_UFFD_WP);
> +}
> +
> +static inline pmd_t pmd_clear_uffd_wp(pmd_t pmd)
> +{
> +	return pmd_clear_flags(pmd, _PAGE_UFFD_WP);
> +}
> +#endif /* CONFIG_HAVE_ARCH_USERFAULTFD_WP */
> +
>  static inline pmd_t pmd_mkold(pmd_t pmd)
>  {
>  	return pmd_clear_flags(pmd, _PAGE_ACCESSED);
> @@ -1351,6 +1386,23 @@ static inline pmd_t pmd_swp_clear_soft_dirty(pmd_t pmd)
>  #endif
>  #endif
> 
> +#ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
> +static inline pte_t pte_swp_mkuffd_wp(pte_t pte)
> +{
> +	return pte_set_flags(pte, _PAGE_SWP_UFFD_WP);
> +}
> +
> +static inline int pte_swp_uffd_wp(pte_t pte)
> +{
> +	return pte_flags(pte) & _PAGE_SWP_UFFD_WP;
> +}
> +
> +static inline pte_t pte_swp_clear_uffd_wp(pte_t pte)
> +{
> +	return pte_clear_flags(pte, _PAGE_SWP_UFFD_WP);
> +}
> +#endif /* CONFIG_HAVE_ARCH_USERFAULTFD_WP */
> +
>  #define PKRU_AD_BIT 0x1
>  #define PKRU_WD_BIT 0x2
>  #define PKRU_BITS_PER_PKEY 2
> diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
> index 9c85b54bf03c..e0c5d29b8685 100644
> --- a/arch/x86/include/asm/pgtable_64.h
> +++ b/arch/x86/include/asm/pgtable_64.h
> @@ -189,7 +189,7 @@ extern void sync_global_pgds(unsigned long start, unsigned long end);
>   *
>   * |     ...            | 11| 10|  9|8|7|6|5| 4| 3|2| 1|0| <- bit number
>   * |     ...            |SW3|SW2|SW1|G|L|D|A|CD|WT|U| W|P| <- bit names
> - * | TYPE (59-63) | ~OFFSET (9-58)  |0|0|X|X| X| X|X|SD|0| <- swp entry
> + * | TYPE (59-63) | ~OFFSET (9-58)  |0|0|X|X| X| X|F|SD|0| <- swp entry
>   *
>   * G (8) is aliased and used as a PROT_NONE indicator for
>   * !present ptes.  We need to start storing swap entries above
> @@ -197,9 +197,15 @@ extern void sync_global_pgds(unsigned long start, unsigned long end);
>   * erratum where they can be incorrectly set by hardware on
>   * non-present PTEs.
>   *
> + * SD Bits 1-4 are not used in non-present format and available for
> + * special use described below:
> + *
>   * SD (1) in swp entry is used to store soft dirty bit, which helps us
>   * remember soft dirty over page migration
>   *
> + * F (2) in swp entry is used to record when a pagetable is
> + * writeprotected by userfaultfd WP support.
> + *
>   * Bit 7 in swp entry should be 0 because pmd_present checks not only P,
>   * but also L and G.
>   *
> diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
> index d6ff0bbdb394..8cebcff91e57 100644
> --- a/arch/x86/include/asm/pgtable_types.h
> +++ b/arch/x86/include/asm/pgtable_types.h
> @@ -32,6 +32,7 @@
> 
>  #define _PAGE_BIT_SPECIAL	_PAGE_BIT_SOFTW1
>  #define _PAGE_BIT_CPA_TEST	_PAGE_BIT_SOFTW1
> +#define _PAGE_BIT_UFFD_WP	_PAGE_BIT_SOFTW2 /* userfaultfd wrprotected */
>  #define _PAGE_BIT_SOFT_DIRTY	_PAGE_BIT_SOFTW3 /* software dirty tracking */
>  #define _PAGE_BIT_DEVMAP	_PAGE_BIT_SOFTW4
> 
> @@ -100,6 +101,14 @@
>  #define _PAGE_SWP_SOFT_DIRTY	(_AT(pteval_t, 0))
>  #endif
> 
> +#ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
> +#define _PAGE_UFFD_WP		(_AT(pteval_t, 1) << _PAGE_BIT_UFFD_WP)
> +#define _PAGE_SWP_UFFD_WP	_PAGE_USER
> +#else
> +#define _PAGE_UFFD_WP		(_AT(pteval_t, 0))
> +#define _PAGE_SWP_UFFD_WP	(_AT(pteval_t, 0))
> +#endif
> +
>  #if defined(CONFIG_X86_64) || defined(CONFIG_X86_PAE)
>  #define _PAGE_NX	(_AT(pteval_t, 1) << _PAGE_BIT_NX)
>  #define _PAGE_DEVMAP	(_AT(u64, 1) << _PAGE_BIT_DEVMAP)
> diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
> index 05e61e6c843f..f49afe951711 100644
> --- a/include/asm-generic/pgtable.h
> +++ b/include/asm-generic/pgtable.h
> @@ -10,6 +10,7 @@
>  #include <linux/mm_types.h>
>  #include <linux/bug.h>
>  #include <linux/errno.h>
> +#include <asm-generic/pgtable_uffd.h>
> 
>  #if 5 - defined(__PAGETABLE_P4D_FOLDED) - defined(__PAGETABLE_PUD_FOLDED) - \
>  	defined(__PAGETABLE_PMD_FOLDED) != CONFIG_PGTABLE_LEVELS
> diff --git a/include/asm-generic/pgtable_uffd.h b/include/asm-generic/pgtable_uffd.h
> new file mode 100644
> index 000000000000..643d1bf559c2
> --- /dev/null
> +++ b/include/asm-generic/pgtable_uffd.h
> @@ -0,0 +1,51 @@
> +#ifndef _ASM_GENERIC_PGTABLE_UFFD_H
> +#define _ASM_GENERIC_PGTABLE_UFFD_H
> +
> +#ifndef CONFIG_HAVE_ARCH_USERFAULTFD_WP
> +static __always_inline int pte_uffd_wp(pte_t pte)
> +{
> +	return 0;
> +}
> +
> +static __always_inline int pmd_uffd_wp(pmd_t pmd)
> +{
> +	return 0;
> +}
> +
> +static __always_inline pte_t pte_mkuffd_wp(pte_t pte)
> +{
> +	return pte;
> +}
> +
> +static __always_inline pmd_t pmd_mkuffd_wp(pmd_t pmd)
> +{
> +	return pmd;
> +}
> +
> +static __always_inline pte_t pte_clear_uffd_wp(pte_t pte)
> +{
> +	return pte;
> +}
> +
> +static __always_inline pmd_t pmd_clear_uffd_wp(pmd_t pmd)
> +{
> +	return pmd;
> +}
> +
> +static __always_inline pte_t pte_swp_mkuffd_wp(pte_t pte)
> +{
> +	return pte;
> +}
> +
> +static __always_inline int pte_swp_uffd_wp(pte_t pte)
> +{
> +	return 0;
> +}
> +
> +static __always_inline pte_t pte_swp_clear_uffd_wp(pte_t pte)
> +{
> +	return pte;
> +}
> +#endif /* CONFIG_HAVE_ARCH_USERFAULTFD_WP */
> +
> +#endif /* _ASM_GENERIC_PGTABLE_UFFD_H */
> diff --git a/init/Kconfig b/init/Kconfig
> index c9386a365eea..892d61ddf2eb 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -1424,6 +1424,11 @@ config ADVISE_SYSCALLS
>  	  applications use these syscalls, you can disable this option to save
>  	  space.
> 
> +config HAVE_ARCH_USERFAULTFD_WP
> +	bool
> +	help
> +	  Arch has userfaultfd write protection support
> +
>  config MEMBARRIER
>  	bool "Enable membarrier() system call" if EXPERT
>  	default y
> -- 
> 2.17.1
> 

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v2 10/26] userfaultfd: wp: add UFFDIO_COPY_MODE_WP
  2019-02-12  2:56 ` [PATCH v2 10/26] userfaultfd: wp: add UFFDIO_COPY_MODE_WP Peter Xu
  2019-02-21 17:29   ` Jerome Glisse
@ 2019-02-25 15:58   ` Mike Rapoport
  2019-02-26  5:09     ` Peter Xu
  1 sibling, 1 reply; 113+ messages in thread
From: Mike Rapoport @ 2019-02-25 15:58 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, David Hildenbrand, Hugh Dickins,
	Maya Gokhale, Jerome Glisse, Pavel Emelyanov, Johannes Weiner,
	Martin Cracauer, Shaohua Li, Marty McFadden, Andrea Arcangeli,
	Mike Kravetz, Denis Plotnikov, Mike Rapoport, Mel Gorman,
	Kirill A . Shutemov, Dr . David Alan Gilbert

On Tue, Feb 12, 2019 at 10:56:16AM +0800, Peter Xu wrote:
> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> This allows UFFDIO_COPY to map pages wrprotected.
                                       write protected please :)
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>

Except for two additional nits below

Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>

> ---
>  fs/userfaultfd.c                 |  5 +++--
>  include/linux/userfaultfd_k.h    |  2 +-
>  include/uapi/linux/userfaultfd.h | 11 +++++-----
>  mm/userfaultfd.c                 | 36 ++++++++++++++++++++++----------
>  4 files changed, 35 insertions(+), 19 deletions(-)
> 
> diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> index b397bc3b954d..3092885c9d2c 100644
> --- a/fs/userfaultfd.c
> +++ b/fs/userfaultfd.c
> @@ -1683,11 +1683,12 @@ static int userfaultfd_copy(struct userfaultfd_ctx *ctx,
>  	ret = -EINVAL;
>  	if (uffdio_copy.src + uffdio_copy.len <= uffdio_copy.src)
>  		goto out;
> -	if (uffdio_copy.mode & ~UFFDIO_COPY_MODE_DONTWAKE)
> +	if (uffdio_copy.mode & ~(UFFDIO_COPY_MODE_DONTWAKE|UFFDIO_COPY_MODE_WP))
>  		goto out;
>  	if (mmget_not_zero(ctx->mm)) {
>  		ret = mcopy_atomic(ctx->mm, uffdio_copy.dst, uffdio_copy.src,
> -				   uffdio_copy.len, &ctx->mmap_changing);
> +				   uffdio_copy.len, &ctx->mmap_changing,
> +				   uffdio_copy.mode);
>  		mmput(ctx->mm);
>  	} else {
>  		return -ESRCH;
> diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
> index c6590c58ce28..765ce884cec0 100644
> --- a/include/linux/userfaultfd_k.h
> +++ b/include/linux/userfaultfd_k.h
> @@ -34,7 +34,7 @@ extern vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason);
> 
>  extern ssize_t mcopy_atomic(struct mm_struct *dst_mm, unsigned long dst_start,
>  			    unsigned long src_start, unsigned long len,
> -			    bool *mmap_changing);
> +			    bool *mmap_changing, __u64 mode);
>  extern ssize_t mfill_zeropage(struct mm_struct *dst_mm,
>  			      unsigned long dst_start,
>  			      unsigned long len,
> diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
> index 48f1a7c2f1f0..297cb044c03f 100644
> --- a/include/uapi/linux/userfaultfd.h
> +++ b/include/uapi/linux/userfaultfd.h
> @@ -203,13 +203,14 @@ struct uffdio_copy {
>  	__u64 dst;
>  	__u64 src;
>  	__u64 len;
> +#define UFFDIO_COPY_MODE_DONTWAKE		((__u64)1<<0)
>  	/*
> -	 * There will be a wrprotection flag later that allows to map
> -	 * pages wrprotected on the fly. And such a flag will be
> -	 * available if the wrprotection ioctl are implemented for the
> -	 * range according to the uffdio_register.ioctls.
> +	 * UFFDIO_COPY_MODE_WP will map the page wrprotected on the
> +	 * fly. UFFDIO_COPY_MODE_WP is available only if the
> +	 * wrprotection ioctl are implemented for the range according

                             ^ is

> +	 * to the uffdio_register.ioctls.
>  	 */
> -#define UFFDIO_COPY_MODE_DONTWAKE		((__u64)1<<0)
> +#define UFFDIO_COPY_MODE_WP			((__u64)1<<1)
>  	__u64 mode;
> 
>  	/*
> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> index d59b5a73dfb3..73a208c5c1e7 100644
> --- a/mm/userfaultfd.c
> +++ b/mm/userfaultfd.c
> @@ -25,7 +25,8 @@ static int mcopy_atomic_pte(struct mm_struct *dst_mm,
>  			    struct vm_area_struct *dst_vma,
>  			    unsigned long dst_addr,
>  			    unsigned long src_addr,
> -			    struct page **pagep)
> +			    struct page **pagep,
> +			    bool wp_copy)
>  {
>  	struct mem_cgroup *memcg;
>  	pte_t _dst_pte, *dst_pte;
> @@ -71,9 +72,9 @@ static int mcopy_atomic_pte(struct mm_struct *dst_mm,
>  	if (mem_cgroup_try_charge(page, dst_mm, GFP_KERNEL, &memcg, false))
>  		goto out_release;
> 
> -	_dst_pte = mk_pte(page, dst_vma->vm_page_prot);
> -	if (dst_vma->vm_flags & VM_WRITE)
> -		_dst_pte = pte_mkwrite(pte_mkdirty(_dst_pte));
> +	_dst_pte = pte_mkdirty(mk_pte(page, dst_vma->vm_page_prot));
> +	if (dst_vma->vm_flags & VM_WRITE && !wp_copy)
> +		_dst_pte = pte_mkwrite(_dst_pte);
> 
>  	dst_pte = pte_offset_map_lock(dst_mm, dst_pmd, dst_addr, &ptl);
>  	if (dst_vma->vm_file) {
> @@ -399,7 +400,8 @@ static __always_inline ssize_t mfill_atomic_pte(struct mm_struct *dst_mm,
>  						unsigned long dst_addr,
>  						unsigned long src_addr,
>  						struct page **page,
> -						bool zeropage)
> +						bool zeropage,
> +						bool wp_copy)
>  {
>  	ssize_t err;
> 
> @@ -416,11 +418,13 @@ static __always_inline ssize_t mfill_atomic_pte(struct mm_struct *dst_mm,
>  	if (!(dst_vma->vm_flags & VM_SHARED)) {
>  		if (!zeropage)
>  			err = mcopy_atomic_pte(dst_mm, dst_pmd, dst_vma,
> -					       dst_addr, src_addr, page);
> +					       dst_addr, src_addr, page,
> +					       wp_copy);
>  		else
>  			err = mfill_zeropage_pte(dst_mm, dst_pmd,
>  						 dst_vma, dst_addr);
>  	} else {
> +		VM_WARN_ON(wp_copy); /* WP only available for anon */
>  		if (!zeropage)
>  			err = shmem_mcopy_atomic_pte(dst_mm, dst_pmd,
>  						     dst_vma, dst_addr,
> @@ -438,7 +442,8 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
>  					      unsigned long src_start,
>  					      unsigned long len,
>  					      bool zeropage,
> -					      bool *mmap_changing)
> +					      bool *mmap_changing,
> +					      __u64 mode)
>  {
>  	struct vm_area_struct *dst_vma;
>  	ssize_t err;
> @@ -446,6 +451,7 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
>  	unsigned long src_addr, dst_addr;
>  	long copied;
>  	struct page *page;
> +	bool wp_copy;
> 
>  	/*>  	 * Sanitize the command parameters:
> @@ -502,6 +508,14 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
>  	    dst_vma->vm_flags & VM_SHARED))
>  		goto out_unlock;
> 
> +	/*
> +	 * validate 'mode' now that we know the dst_vma: don't allow
> +	 * a wrprotect copy if the userfaultfd didn't register as WP.
> +	 */
> +	wp_copy = mode & UFFDIO_COPY_MODE_WP;
> +	if (wp_copy && !(dst_vma->vm_flags & VM_UFFD_WP))
> +		goto out_unlock;
> +
>  	/*
>  	 * If this is a HUGETLB vma, pass off to appropriate routine
>  	 */

I think for hugetlb we should return an error if wp_copy==true.
It might be worth adding wp_copy parameter to __mcopy_atomic_hugetlb() in
advance and return the error from there, in a hope it will also support
UFFD_WP some day :)

> @@ -557,7 +571,7 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
>  		BUG_ON(pmd_trans_huge(*dst_pmd));
> 
>  		err = mfill_atomic_pte(dst_mm, dst_pmd, dst_vma, dst_addr,
> -				       src_addr, &page, zeropage);
> +				       src_addr, &page, zeropage, wp_copy);
>  		cond_resched();
> 
>  		if (unlikely(err == -ENOENT)) {
> @@ -604,14 +618,14 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
> 
>  ssize_t mcopy_atomic(struct mm_struct *dst_mm, unsigned long dst_start,
>  		     unsigned long src_start, unsigned long len,
> -		     bool *mmap_changing)
> +		     bool *mmap_changing, __u64 mode)
>  {
>  	return __mcopy_atomic(dst_mm, dst_start, src_start, len, false,
> -			      mmap_changing);
> +			      mmap_changing, mode);
>  }
> 
>  ssize_t mfill_zeropage(struct mm_struct *dst_mm, unsigned long start,
>  		       unsigned long len, bool *mmap_changing)
>  {
> -	return __mcopy_atomic(dst_mm, start, 0, len, true, mmap_changing);
> +	return __mcopy_atomic(dst_mm, start, 0, len, true, mmap_changing, 0);
>  }
> -- 
> 2.17.1
> 

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v2 09/26] userfaultfd: wp: userfaultfd_pte/huge_pmd_wp() helpers
  2019-02-12  2:56 ` [PATCH v2 09/26] userfaultfd: wp: userfaultfd_pte/huge_pmd_wp() helpers Peter Xu
  2019-02-21 17:21   ` Jerome Glisse
@ 2019-02-25 17:12   ` Mike Rapoport
  1 sibling, 0 replies; 113+ messages in thread
From: Mike Rapoport @ 2019-02-25 17:12 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, David Hildenbrand, Hugh Dickins,
	Maya Gokhale, Jerome Glisse, Pavel Emelyanov, Johannes Weiner,
	Martin Cracauer, Shaohua Li, Marty McFadden, Andrea Arcangeli,
	Mike Kravetz, Denis Plotnikov, Mike Rapoport, Mel Gorman,
	Kirill A . Shutemov, Dr . David Alan Gilbert

On Tue, Feb 12, 2019 at 10:56:15AM +0800, Peter Xu wrote:
> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> Implement helpers methods to invoke userfaultfd wp faults more
> selectively: not only when a wp fault triggers on a vma with
> vma->vm_flags VM_UFFD_WP set, but only if the _PAGE_UFFD_WP bit is set
> in the pagetable too.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>

> ---
>  include/linux/userfaultfd_k.h | 27 +++++++++++++++++++++++++++
>  1 file changed, 27 insertions(+)
> 
> diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
> index 38f748e7186e..c6590c58ce28 100644
> --- a/include/linux/userfaultfd_k.h
> +++ b/include/linux/userfaultfd_k.h
> @@ -14,6 +14,8 @@
>  #include <linux/userfaultfd.h> /* linux/include/uapi/linux/userfaultfd.h */
> 
>  #include <linux/fcntl.h>
> +#include <linux/mm.h>
> +#include <asm-generic/pgtable_uffd.h>
> 
>  /*
>   * CAREFUL: Check include/uapi/asm-generic/fcntl.h when defining
> @@ -55,6 +57,18 @@ static inline bool userfaultfd_wp(struct vm_area_struct *vma)
>  	return vma->vm_flags & VM_UFFD_WP;
>  }
> 
> +static inline bool userfaultfd_pte_wp(struct vm_area_struct *vma,
> +				      pte_t pte)
> +{
> +	return userfaultfd_wp(vma) && pte_uffd_wp(pte);
> +}
> +
> +static inline bool userfaultfd_huge_pmd_wp(struct vm_area_struct *vma,
> +					   pmd_t pmd)
> +{
> +	return userfaultfd_wp(vma) && pmd_uffd_wp(pmd);
> +}
> +
>  static inline bool userfaultfd_armed(struct vm_area_struct *vma)
>  {
>  	return vma->vm_flags & (VM_UFFD_MISSING | VM_UFFD_WP);
> @@ -104,6 +118,19 @@ static inline bool userfaultfd_wp(struct vm_area_struct *vma)
>  	return false;
>  }
> 
> +static inline bool userfaultfd_pte_wp(struct vm_area_struct *vma,
> +				      pte_t pte)
> +{
> +	return false;
> +}
> +
> +static inline bool userfaultfd_huge_pmd_wp(struct vm_area_struct *vma,
> +					   pmd_t pmd)
> +{
> +	return false;
> +}
> +
> +
>  static inline bool userfaultfd_armed(struct vm_area_struct *vma)
>  {
>  	return false;
> -- 
> 2.17.1
> 

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v2 12/26] userfaultfd: wp: apply _PAGE_UFFD_WP bit
  2019-02-12  2:56 ` [PATCH v2 12/26] userfaultfd: wp: apply _PAGE_UFFD_WP bit Peter Xu
  2019-02-21 17:44   ` Jerome Glisse
@ 2019-02-25 18:00   ` Mike Rapoport
  1 sibling, 0 replies; 113+ messages in thread
From: Mike Rapoport @ 2019-02-25 18:00 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, David Hildenbrand, Hugh Dickins,
	Maya Gokhale, Jerome Glisse, Pavel Emelyanov, Johannes Weiner,
	Martin Cracauer, Shaohua Li, Marty McFadden, Andrea Arcangeli,
	Mike Kravetz, Denis Plotnikov, Mike Rapoport, Mel Gorman,
	Kirill A . Shutemov, Dr . David Alan Gilbert

On Tue, Feb 12, 2019 at 10:56:18AM +0800, Peter Xu wrote:
> Firstly, introduce two new flags MM_CP_UFFD_WP[_RESOLVE] for
> change_protection() when used with uffd-wp and make sure the two new
> flags are exclusively used.  Then,
> 
>   - For MM_CP_UFFD_WP: apply the _PAGE_UFFD_WP bit and remove _PAGE_RW
>     when a range of memory is write protected by uffd
> 
>   - For MM_CP_UFFD_WP_RESOLVE: remove the _PAGE_UFFD_WP bit and recover
>     _PAGE_RW when write protection is resolved from userspace
> 
> And use this new interface in mwriteprotect_range() to replace the old
> MM_CP_DIRTY_ACCT.
> 
> Do this change for both PTEs and huge PMDs.  Then we can start to
> identify which PTE/PMD is write protected by general (e.g., COW or soft
> dirty tracking), and which is for userfaultfd-wp.
> 
> Since we should keep the _PAGE_UFFD_WP when doing pte_modify(), add it
> into _PAGE_CHG_MASK as well.  Meanwhile, since we have this new bit, we
> can be even more strict when detecting uffd-wp page faults in either
> do_wp_page() or wp_huge_pmd().
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>

> ---
>  arch/x86/include/asm/pgtable_types.h |  2 +-
>  include/linux/mm.h                   |  5 +++++
>  mm/huge_memory.c                     | 14 +++++++++++++-
>  mm/memory.c                          |  4 ++--
>  mm/mprotect.c                        | 12 ++++++++++++
>  mm/userfaultfd.c                     |  8 ++++++--
>  6 files changed, 39 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
> index 8cebcff91e57..dd9c6295d610 100644
> --- a/arch/x86/include/asm/pgtable_types.h
> +++ b/arch/x86/include/asm/pgtable_types.h
> @@ -133,7 +133,7 @@
>   */
>  #define _PAGE_CHG_MASK	(PTE_PFN_MASK | _PAGE_PCD | _PAGE_PWT |		\
>  			 _PAGE_SPECIAL | _PAGE_ACCESSED | _PAGE_DIRTY |	\
> -			 _PAGE_SOFT_DIRTY | _PAGE_DEVMAP)
> +			 _PAGE_SOFT_DIRTY | _PAGE_DEVMAP | _PAGE_UFFD_WP)
>  #define _HPAGE_CHG_MASK (_PAGE_CHG_MASK | _PAGE_PSE)
> 
>  /*
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 9fe3b0066324..f38fbe9c8bc9 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1657,6 +1657,11 @@ extern unsigned long move_page_tables(struct vm_area_struct *vma,
>  #define  MM_CP_DIRTY_ACCT                  (1UL << 0)
>  /* Whether this protection change is for NUMA hints */
>  #define  MM_CP_PROT_NUMA                   (1UL << 1)
> +/* Whether this change is for write protecting */
> +#define  MM_CP_UFFD_WP                     (1UL << 2) /* do wp */
> +#define  MM_CP_UFFD_WP_RESOLVE             (1UL << 3) /* Resolve wp */
> +#define  MM_CP_UFFD_WP_ALL                 (MM_CP_UFFD_WP | \
> +					    MM_CP_UFFD_WP_RESOLVE)
> 
>  extern unsigned long change_protection(struct vm_area_struct *vma, unsigned long start,
>  			      unsigned long end, pgprot_t newprot,
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 8d65b0f041f9..817335b443c2 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1868,6 +1868,8 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
>  	bool preserve_write;
>  	int ret;
>  	bool prot_numa = cp_flags & MM_CP_PROT_NUMA;
> +	bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
> +	bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;
> 
>  	ptl = __pmd_trans_huge_lock(pmd, vma);
>  	if (!ptl)
> @@ -1934,6 +1936,13 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
>  	entry = pmd_modify(entry, newprot);
>  	if (preserve_write)
>  		entry = pmd_mk_savedwrite(entry);
> +	if (uffd_wp) {
> +		entry = pmd_wrprotect(entry);
> +		entry = pmd_mkuffd_wp(entry);
> +	} else if (uffd_wp_resolve) {
> +		entry = pmd_mkwrite(entry);
> +		entry = pmd_clear_uffd_wp(entry);
> +	}
>  	ret = HPAGE_PMD_NR;
>  	set_pmd_at(mm, addr, pmd, entry);
>  	BUG_ON(vma_is_anonymous(vma) && !preserve_write && pmd_write(entry));
> @@ -2083,7 +2092,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>  	struct page *page;
>  	pgtable_t pgtable;
>  	pmd_t old_pmd, _pmd;
> -	bool young, write, soft_dirty, pmd_migration = false;
> +	bool young, write, soft_dirty, pmd_migration = false, uffd_wp = false;
>  	unsigned long addr;
>  	int i;
> 
> @@ -2165,6 +2174,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>  		write = pmd_write(old_pmd);
>  		young = pmd_young(old_pmd);
>  		soft_dirty = pmd_soft_dirty(old_pmd);
> +		uffd_wp = pmd_uffd_wp(old_pmd);
>  	}
>  	VM_BUG_ON_PAGE(!page_count(page), page);
>  	page_ref_add(page, HPAGE_PMD_NR - 1);
> @@ -2198,6 +2208,8 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>  				entry = pte_mkold(entry);
>  			if (soft_dirty)
>  				entry = pte_mksoft_dirty(entry);
> +			if (uffd_wp)
> +				entry = pte_mkuffd_wp(entry);
>  		}
>  		pte = pte_offset_map(&_pmd, addr);
>  		BUG_ON(!pte_none(*pte));
> diff --git a/mm/memory.c b/mm/memory.c
> index 00781c43407b..f8d83ae16eff 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2483,7 +2483,7 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
>  {
>  	struct vm_area_struct *vma = vmf->vma;
> 
> -	if (userfaultfd_wp(vma)) {
> +	if (userfaultfd_pte_wp(vma, *vmf->pte)) {
>  		pte_unmap_unlock(vmf->pte, vmf->ptl);
>  		return handle_userfault(vmf, VM_UFFD_WP);
>  	}
> @@ -3692,7 +3692,7 @@ static inline vm_fault_t create_huge_pmd(struct vm_fault *vmf)
>  static inline vm_fault_t wp_huge_pmd(struct vm_fault *vmf, pmd_t orig_pmd)
>  {
>  	if (vma_is_anonymous(vmf->vma)) {
> -		if (userfaultfd_wp(vmf->vma))
> +		if (userfaultfd_huge_pmd_wp(vmf->vma, orig_pmd))
>  			return handle_userfault(vmf, VM_UFFD_WP);
>  		return do_huge_pmd_wp_page(vmf, orig_pmd);
>  	}
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index a6ba448c8565..9d4433044c21 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -46,6 +46,8 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
>  	int target_node = NUMA_NO_NODE;
>  	bool dirty_accountable = cp_flags & MM_CP_DIRTY_ACCT;
>  	bool prot_numa = cp_flags & MM_CP_PROT_NUMA;
> +	bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
> +	bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;
> 
>  	/*
>  	 * Can be called with only the mmap_sem for reading by
> @@ -117,6 +119,14 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
>  			if (preserve_write)
>  				ptent = pte_mk_savedwrite(ptent);
> 
> +			if (uffd_wp) {
> +				ptent = pte_wrprotect(ptent);
> +				ptent = pte_mkuffd_wp(ptent);
> +			} else if (uffd_wp_resolve) {
> +				ptent = pte_mkwrite(ptent);
> +				ptent = pte_clear_uffd_wp(ptent);
> +			}
> +
>  			/* Avoid taking write faults for known dirty pages */
>  			if (dirty_accountable && pte_dirty(ptent) &&
>  					(pte_soft_dirty(ptent) ||
> @@ -301,6 +311,8 @@ unsigned long change_protection(struct vm_area_struct *vma, unsigned long start,
>  {
>  	unsigned long pages;
> 
> +	BUG_ON((cp_flags & MM_CP_UFFD_WP_ALL) == MM_CP_UFFD_WP_ALL);
> +
>  	if (is_vm_hugetlb_page(vma))
>  		pages = hugetlb_change_protection(vma, start, end, newprot);
>  	else
> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> index 73a208c5c1e7..80bcd642911d 100644
> --- a/mm/userfaultfd.c
> +++ b/mm/userfaultfd.c
> @@ -73,8 +73,12 @@ static int mcopy_atomic_pte(struct mm_struct *dst_mm,
>  		goto out_release;
> 
>  	_dst_pte = pte_mkdirty(mk_pte(page, dst_vma->vm_page_prot));
> -	if (dst_vma->vm_flags & VM_WRITE && !wp_copy)
> -		_dst_pte = pte_mkwrite(_dst_pte);
> +	if (dst_vma->vm_flags & VM_WRITE) {
> +		if (wp_copy)
> +			_dst_pte = pte_mkuffd_wp(_dst_pte);
> +		else
> +			_dst_pte = pte_mkwrite(_dst_pte);
> +	}
> 
>  	dst_pte = pte_offset_map_lock(dst_mm, dst_pmd, dst_addr, &ptl);
>  	if (dst_vma->vm_file) {
> -- 
> 2.17.1
> 

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v2 15/26] userfaultfd: wp: drop _PAGE_UFFD_WP properly when fork
  2019-02-12  2:56 ` [PATCH v2 15/26] userfaultfd: wp: drop _PAGE_UFFD_WP properly when fork Peter Xu
  2019-02-21 18:06   ` Jerome Glisse
@ 2019-02-25 18:19   ` Mike Rapoport
  1 sibling, 0 replies; 113+ messages in thread
From: Mike Rapoport @ 2019-02-25 18:19 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, David Hildenbrand, Hugh Dickins,
	Maya Gokhale, Jerome Glisse, Pavel Emelyanov, Johannes Weiner,
	Martin Cracauer, Shaohua Li, Marty McFadden, Andrea Arcangeli,
	Mike Kravetz, Denis Plotnikov, Mike Rapoport, Mel Gorman,
	Kirill A . Shutemov, Dr . David Alan Gilbert

On Tue, Feb 12, 2019 at 10:56:21AM +0800, Peter Xu wrote:
> UFFD_EVENT_FORK support for uffd-wp should be already there, except
> that we should clean the uffd-wp bit if uffd fork event is not
> enabled.  Detect that to avoid _PAGE_UFFD_WP being set even if the VMA
> is not being tracked by VM_UFFD_WP.  Do this for both small PTEs and
> huge PMDs.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>

> ---
>  mm/huge_memory.c | 8 ++++++++
>  mm/memory.c      | 8 ++++++++
>  2 files changed, 16 insertions(+)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 817335b443c2..fb2234cb595a 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -938,6 +938,14 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>  	ret = -EAGAIN;
>  	pmd = *src_pmd;
> 
> +	/*
> +	 * Make sure the _PAGE_UFFD_WP bit is cleared if the new VMA
> +	 * does not have the VM_UFFD_WP, which means that the uffd
> +	 * fork event is not enabled.
> +	 */
> +	if (!(vma->vm_flags & VM_UFFD_WP))
> +		pmd = pmd_clear_uffd_wp(pmd);
> +
>  #ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
>  	if (unlikely(is_swap_pmd(pmd))) {
>  		swp_entry_t entry = pmd_to_swp_entry(pmd);
> diff --git a/mm/memory.c b/mm/memory.c
> index b5d67bafae35..c2035539e9fd 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -788,6 +788,14 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>  		pte = pte_mkclean(pte);
>  	pte = pte_mkold(pte);
> 
> +	/*
> +	 * Make sure the _PAGE_UFFD_WP bit is cleared if the new VMA
> +	 * does not have the VM_UFFD_WP, which means that the uffd
> +	 * fork event is not enabled.
> +	 */
> +	if (!(vm_flags & VM_UFFD_WP))
> +		pte = pte_clear_uffd_wp(pte);
> +
>  	page = vm_normal_page(vma, addr, pte);
>  	if (page) {
>  		get_page(page);
> -- 
> 2.17.1
> 

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v2 16/26] userfaultfd: wp: add pmd_swp_*uffd_wp() helpers
  2019-02-12  2:56 ` [PATCH v2 16/26] userfaultfd: wp: add pmd_swp_*uffd_wp() helpers Peter Xu
  2019-02-21 18:07   ` Jerome Glisse
@ 2019-02-25 18:20   ` Mike Rapoport
  1 sibling, 0 replies; 113+ messages in thread
From: Mike Rapoport @ 2019-02-25 18:20 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, David Hildenbrand, Hugh Dickins,
	Maya Gokhale, Jerome Glisse, Pavel Emelyanov, Johannes Weiner,
	Martin Cracauer, Shaohua Li, Marty McFadden, Andrea Arcangeli,
	Mike Kravetz, Denis Plotnikov, Mike Rapoport, Mel Gorman,
	Kirill A . Shutemov, Dr . David Alan Gilbert

On Tue, Feb 12, 2019 at 10:56:22AM +0800, Peter Xu wrote:
> Adding these missing helpers for uffd-wp operations with pmd
> swap/migration entries.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>

> ---
>  arch/x86/include/asm/pgtable.h     | 15 +++++++++++++++
>  include/asm-generic/pgtable_uffd.h | 15 +++++++++++++++
>  2 files changed, 30 insertions(+)
> 
> diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
> index 6863236e8484..18a815d6f4ea 100644
> --- a/arch/x86/include/asm/pgtable.h
> +++ b/arch/x86/include/asm/pgtable.h
> @@ -1401,6 +1401,21 @@ static inline pte_t pte_swp_clear_uffd_wp(pte_t pte)
>  {
>  	return pte_clear_flags(pte, _PAGE_SWP_UFFD_WP);
>  }
> +
> +static inline pmd_t pmd_swp_mkuffd_wp(pmd_t pmd)
> +{
> +	return pmd_set_flags(pmd, _PAGE_SWP_UFFD_WP);
> +}
> +
> +static inline int pmd_swp_uffd_wp(pmd_t pmd)
> +{
> +	return pmd_flags(pmd) & _PAGE_SWP_UFFD_WP;
> +}
> +
> +static inline pmd_t pmd_swp_clear_uffd_wp(pmd_t pmd)
> +{
> +	return pmd_clear_flags(pmd, _PAGE_SWP_UFFD_WP);
> +}
>  #endif /* CONFIG_HAVE_ARCH_USERFAULTFD_WP */
> 
>  #define PKRU_AD_BIT 0x1
> diff --git a/include/asm-generic/pgtable_uffd.h b/include/asm-generic/pgtable_uffd.h
> index 643d1bf559c2..828966d4c281 100644
> --- a/include/asm-generic/pgtable_uffd.h
> +++ b/include/asm-generic/pgtable_uffd.h
> @@ -46,6 +46,21 @@ static __always_inline pte_t pte_swp_clear_uffd_wp(pte_t pte)
>  {
>  	return pte;
>  }
> +
> +static inline pmd_t pmd_swp_mkuffd_wp(pmd_t pmd)
> +{
> +	return pmd;
> +}
> +
> +static inline int pmd_swp_uffd_wp(pmd_t pmd)
> +{
> +	return 0;
> +}
> +
> +static inline pmd_t pmd_swp_clear_uffd_wp(pmd_t pmd)
> +{
> +	return pmd;
> +}
>  #endif /* CONFIG_HAVE_ARCH_USERFAULTFD_WP */
> 
>  #endif /* _ASM_GENERIC_PGTABLE_UFFD_H */
> -- 
> 2.17.1
> 

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v2 17/26] userfaultfd: wp: support swap and page migration
  2019-02-12  2:56 ` [PATCH v2 17/26] userfaultfd: wp: support swap and page migration Peter Xu
  2019-02-21 18:16   ` Jerome Glisse
@ 2019-02-25 18:28   ` Mike Rapoport
  1 sibling, 0 replies; 113+ messages in thread
From: Mike Rapoport @ 2019-02-25 18:28 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, David Hildenbrand, Hugh Dickins,
	Maya Gokhale, Jerome Glisse, Pavel Emelyanov, Johannes Weiner,
	Martin Cracauer, Shaohua Li, Marty McFadden, Andrea Arcangeli,
	Mike Kravetz, Denis Plotnikov, Mike Rapoport, Mel Gorman,
	Kirill A . Shutemov, Dr . David Alan Gilbert

On Tue, Feb 12, 2019 at 10:56:23AM +0800, Peter Xu wrote:
> For either swap and page migration, we all use the bit 2 of the entry to
> identify whether this entry is uffd write-protected.  It plays a similar
> role as the existing soft dirty bit in swap entries but only for keeping
> the uffd-wp tracking for a specific PTE/PMD.
> 
> Something special here is that when we want to recover the uffd-wp bit
> from a swap/migration entry to the PTE bit we'll also need to take care
> of the _PAGE_RW bit and make sure it's cleared, otherwise even with the
> _PAGE_UFFD_WP bit we can't trap it at all.
> 
> Note that this patch removed two lines from "userfaultfd: wp: hook
> userfault handler to write protection fault" where we try to remove the
> VM_FAULT_WRITE from vmf->flags when uffd-wp is set for the VMA.  This
> patch will still keep the write flag there.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>

> ---
>  include/linux/swapops.h | 2 ++
>  mm/huge_memory.c        | 3 +++
>  mm/memory.c             | 8 ++++++--
>  mm/migrate.c            | 7 +++++++
>  mm/mprotect.c           | 2 ++
>  mm/rmap.c               | 6 ++++++
>  6 files changed, 26 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/swapops.h b/include/linux/swapops.h
> index 4d961668e5fc..0c2923b1cdb7 100644
> --- a/include/linux/swapops.h
> +++ b/include/linux/swapops.h
> @@ -68,6 +68,8 @@ static inline swp_entry_t pte_to_swp_entry(pte_t pte)
> 
>  	if (pte_swp_soft_dirty(pte))
>  		pte = pte_swp_clear_soft_dirty(pte);
> +	if (pte_swp_uffd_wp(pte))
> +		pte = pte_swp_clear_uffd_wp(pte);
>  	arch_entry = __pte_to_swp_entry(pte);
>  	return swp_entry(__swp_type(arch_entry), __swp_offset(arch_entry));
>  }
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index fb2234cb595a..75de07141801 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2175,6 +2175,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>  		write = is_write_migration_entry(entry);
>  		young = false;
>  		soft_dirty = pmd_swp_soft_dirty(old_pmd);
> +		uffd_wp = pmd_swp_uffd_wp(old_pmd);
>  	} else {
>  		page = pmd_page(old_pmd);
>  		if (pmd_dirty(old_pmd))
> @@ -2207,6 +2208,8 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>  			entry = swp_entry_to_pte(swp_entry);
>  			if (soft_dirty)
>  				entry = pte_swp_mksoft_dirty(entry);
> +			if (uffd_wp)
> +				entry = pte_swp_mkuffd_wp(entry);
>  		} else {
>  			entry = mk_pte(page + i, READ_ONCE(vma->vm_page_prot));
>  			entry = maybe_mkwrite(entry, vma);
> diff --git a/mm/memory.c b/mm/memory.c
> index c2035539e9fd..7cee990d67cf 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -736,6 +736,8 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>  				pte = swp_entry_to_pte(entry);
>  				if (pte_swp_soft_dirty(*src_pte))
>  					pte = pte_swp_mksoft_dirty(pte);
> +				if (pte_swp_uffd_wp(*src_pte))
> +					pte = pte_swp_mkuffd_wp(pte);
>  				set_pte_at(src_mm, addr, src_pte, pte);
>  			}
>  		} else if (is_device_private_entry(entry)) {
> @@ -2815,8 +2817,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  	inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
>  	dec_mm_counter_fast(vma->vm_mm, MM_SWAPENTS);
>  	pte = mk_pte(page, vma->vm_page_prot);
> -	if (userfaultfd_wp(vma))
> -		vmf->flags &= ~FAULT_FLAG_WRITE;
>  	if ((vmf->flags & FAULT_FLAG_WRITE) && reuse_swap_page(page, NULL)) {
>  		pte = maybe_mkwrite(pte_mkdirty(pte), vma);
>  		vmf->flags &= ~FAULT_FLAG_WRITE;
> @@ -2826,6 +2826,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  	flush_icache_page(vma, page);
>  	if (pte_swp_soft_dirty(vmf->orig_pte))
>  		pte = pte_mksoft_dirty(pte);
> +	if (pte_swp_uffd_wp(vmf->orig_pte)) {
> +		pte = pte_mkuffd_wp(pte);
> +		pte = pte_wrprotect(pte);
> +	}
>  	set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
>  	arch_do_swap_page(vma->vm_mm, vma, vmf->address, pte, vmf->orig_pte);
>  	vmf->orig_pte = pte;
> diff --git a/mm/migrate.c b/mm/migrate.c
> index d4fd680be3b0..605ccd1f5c64 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -242,6 +242,11 @@ static bool remove_migration_pte(struct page *page, struct vm_area_struct *vma,
>  		if (is_write_migration_entry(entry))
>  			pte = maybe_mkwrite(pte, vma);
> 
> +		if (pte_swp_uffd_wp(*pvmw.pte)) {
> +			pte = pte_mkuffd_wp(pte);
> +			pte = pte_wrprotect(pte);
> +		}
> +
>  		if (unlikely(is_zone_device_page(new))) {
>  			if (is_device_private_page(new)) {
>  				entry = make_device_private_entry(new, pte_write(pte));
> @@ -2290,6 +2295,8 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>  			swp_pte = swp_entry_to_pte(entry);
>  			if (pte_soft_dirty(pte))
>  				swp_pte = pte_swp_mksoft_dirty(swp_pte);
> +			if (pte_uffd_wp(pte))
> +				swp_pte = pte_swp_mkuffd_wp(swp_pte);
>  			set_pte_at(mm, addr, ptep, swp_pte);
> 
>  			/*
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index ae93721f3795..73a65f07fe41 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -187,6 +187,8 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
>  				newpte = swp_entry_to_pte(entry);
>  				if (pte_swp_soft_dirty(oldpte))
>  					newpte = pte_swp_mksoft_dirty(newpte);
> +				if (pte_swp_uffd_wp(oldpte))
> +					newpte = pte_swp_mkuffd_wp(newpte);
>  				set_pte_at(mm, addr, pte, newpte);
> 
>  				pages++;
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 0454ecc29537..3750d5a5283c 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1469,6 +1469,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>  			swp_pte = swp_entry_to_pte(entry);
>  			if (pte_soft_dirty(pteval))
>  				swp_pte = pte_swp_mksoft_dirty(swp_pte);
> +			if (pte_uffd_wp(pteval))
> +				swp_pte = pte_swp_mkuffd_wp(swp_pte);
>  			set_pte_at(mm, pvmw.address, pvmw.pte, swp_pte);
>  			/*
>  			 * No need to invalidate here it will synchronize on
> @@ -1561,6 +1563,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>  			swp_pte = swp_entry_to_pte(entry);
>  			if (pte_soft_dirty(pteval))
>  				swp_pte = pte_swp_mksoft_dirty(swp_pte);
> +			if (pte_uffd_wp(pteval))
> +				swp_pte = pte_swp_mkuffd_wp(swp_pte);
>  			set_pte_at(mm, address, pvmw.pte, swp_pte);
>  			/*
>  			 * No need to invalidate here it will synchronize on
> @@ -1627,6 +1631,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>  			swp_pte = swp_entry_to_pte(entry);
>  			if (pte_soft_dirty(pteval))
>  				swp_pte = pte_swp_mksoft_dirty(swp_pte);
> +			if (pte_uffd_wp(pteval))
> +				swp_pte = pte_swp_mkuffd_wp(swp_pte);
>  			set_pte_at(mm, address, pvmw.pte, swp_pte);
>  			/* Invalidate as we cleared the pte */
>  			mmu_notifier_invalidate_range(mm, address,
> -- 
> 2.17.1
> 

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v2 18/26] khugepaged: skip collapse if uffd-wp detected
  2019-02-12  2:56 ` [PATCH v2 18/26] khugepaged: skip collapse if uffd-wp detected Peter Xu
  2019-02-21 18:17   ` Jerome Glisse
@ 2019-02-25 18:50   ` Mike Rapoport
  1 sibling, 0 replies; 113+ messages in thread
From: Mike Rapoport @ 2019-02-25 18:50 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, David Hildenbrand, Hugh Dickins,
	Maya Gokhale, Jerome Glisse, Pavel Emelyanov, Johannes Weiner,
	Martin Cracauer, Shaohua Li, Marty McFadden, Andrea Arcangeli,
	Mike Kravetz, Denis Plotnikov, Mike Rapoport, Mel Gorman,
	Kirill A . Shutemov, Dr . David Alan Gilbert

On Tue, Feb 12, 2019 at 10:56:24AM +0800, Peter Xu wrote:
> Don't collapse the huge PMD if there is any userfault write protected
> small PTEs.  The problem is that the write protection is in small page
> granularity and there's no way to keep all these write protection
> information if the small pages are going to be merged into a huge PMD.
> 
> The same thing needs to be considered for swap entries and migration
> entries.  So do the check as well disregarding khugepaged_max_ptes_swap.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>

> ---
>  include/trace/events/huge_memory.h |  1 +
>  mm/khugepaged.c                    | 23 +++++++++++++++++++++++
>  2 files changed, 24 insertions(+)
> 
> diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
> index dd4db334bd63..2d7bad9cb976 100644
> --- a/include/trace/events/huge_memory.h
> +++ b/include/trace/events/huge_memory.h
> @@ -13,6 +13,7 @@
>  	EM( SCAN_PMD_NULL,		"pmd_null")			\
>  	EM( SCAN_EXCEED_NONE_PTE,	"exceed_none_pte")		\
>  	EM( SCAN_PTE_NON_PRESENT,	"pte_non_present")		\
> +	EM( SCAN_PTE_UFFD_WP,		"pte_uffd_wp")			\
>  	EM( SCAN_PAGE_RO,		"no_writable_page")		\
>  	EM( SCAN_LACK_REFERENCED_PAGE,	"lack_referenced_page")		\
>  	EM( SCAN_PAGE_NULL,		"page_null")			\
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 4f017339ddb2..396c7e4da83e 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -29,6 +29,7 @@ enum scan_result {
>  	SCAN_PMD_NULL,
>  	SCAN_EXCEED_NONE_PTE,
>  	SCAN_PTE_NON_PRESENT,
> +	SCAN_PTE_UFFD_WP,
>  	SCAN_PAGE_RO,
>  	SCAN_LACK_REFERENCED_PAGE,
>  	SCAN_PAGE_NULL,
> @@ -1123,6 +1124,15 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>  		pte_t pteval = *_pte;
>  		if (is_swap_pte(pteval)) {
>  			if (++unmapped <= khugepaged_max_ptes_swap) {
> +				/*
> +				 * Always be strict with uffd-wp
> +				 * enabled swap entries.  Please see
> +				 * comment below for pte_uffd_wp().
> +				 */
> +				if (pte_swp_uffd_wp(pteval)) {
> +					result = SCAN_PTE_UFFD_WP;
> +					goto out_unmap;
> +				}
>  				continue;
>  			} else {
>  				result = SCAN_EXCEED_SWAP_PTE;
> @@ -1142,6 +1152,19 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>  			result = SCAN_PTE_NON_PRESENT;
>  			goto out_unmap;
>  		}
> +		if (pte_uffd_wp(pteval)) {
> +			/*
> +			 * Don't collapse the page if any of the small
> +			 * PTEs are armed with uffd write protection.
> +			 * Here we can also mark the new huge pmd as
> +			 * write protected if any of the small ones is
> +			 * marked but that could bring uknown
> +			 * userfault messages that falls outside of
> +			 * the registered range.  So, just be simple.
> +			 */
> +			result = SCAN_PTE_UFFD_WP;
> +			goto out_unmap;
> +		}
>  		if (pte_write(pteval))
>  			writable = true;
> 
> -- 
> 2.17.1
> 

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v2 19/26] userfaultfd: introduce helper vma_find_uffd
  2019-02-12  2:56 ` [PATCH v2 19/26] userfaultfd: introduce helper vma_find_uffd Peter Xu
  2019-02-21 18:19   ` Jerome Glisse
@ 2019-02-25 20:48   ` Mike Rapoport
  1 sibling, 0 replies; 113+ messages in thread
From: Mike Rapoport @ 2019-02-25 20:48 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, David Hildenbrand, Hugh Dickins,
	Maya Gokhale, Jerome Glisse, Pavel Emelyanov, Johannes Weiner,
	Martin Cracauer, Shaohua Li, Marty McFadden, Andrea Arcangeli,
	Mike Kravetz, Denis Plotnikov, Mike Rapoport, Mel Gorman,
	Kirill A . Shutemov, Dr . David Alan Gilbert

On Tue, Feb 12, 2019 at 10:56:25AM +0800, Peter Xu wrote:
> We've have multiple (and more coming) places that would like to find a
> userfault enabled VMA from a mm struct that covers a specific memory
> range.  This patch introduce the helper for it, meanwhile apply it to
> the code.
> 
> Suggested-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>

> ---
>  mm/userfaultfd.c | 54 +++++++++++++++++++++++++++---------------------
>  1 file changed, 30 insertions(+), 24 deletions(-)
> 
> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> index 80bcd642911d..fefa81c301b7 100644
> --- a/mm/userfaultfd.c
> +++ b/mm/userfaultfd.c
> @@ -20,6 +20,34 @@
>  #include <asm/tlbflush.h>
>  #include "internal.h"
> 
> +/*
> + * Find a valid userfault enabled VMA region that covers the whole
> + * address range, or NULL on failure.  Must be called with mmap_sem
> + * held.
> + */
> +static struct vm_area_struct *vma_find_uffd(struct mm_struct *mm,
> +					    unsigned long start,
> +					    unsigned long len)
> +{
> +	struct vm_area_struct *vma = find_vma(mm, start);
> +
> +	if (!vma)
> +		return NULL;
> +
> +	/*
> +	 * Check the vma is registered in uffd, this is required to
> +	 * enforce the VM_MAYWRITE check done at uffd registration
> +	 * time.
> +	 */
> +	if (!vma->vm_userfaultfd_ctx.ctx)
> +		return NULL;
> +
> +	if (start < vma->vm_start || start + len > vma->vm_end)
> +		return NULL;
> +
> +	return vma;
> +}
> +
>  static int mcopy_atomic_pte(struct mm_struct *dst_mm,
>  			    pmd_t *dst_pmd,
>  			    struct vm_area_struct *dst_vma,
> @@ -228,20 +256,9 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
>  	 */
>  	if (!dst_vma) {
>  		err = -ENOENT;
> -		dst_vma = find_vma(dst_mm, dst_start);
> +		dst_vma = vma_find_uffd(dst_mm, dst_start, len);
>  		if (!dst_vma || !is_vm_hugetlb_page(dst_vma))
>  			goto out_unlock;
> -		/*
> -		 * Check the vma is registered in uffd, this is
> -		 * required to enforce the VM_MAYWRITE check done at
> -		 * uffd registration time.
> -		 */
> -		if (!dst_vma->vm_userfaultfd_ctx.ctx)
> -			goto out_unlock;
> -
> -		if (dst_start < dst_vma->vm_start ||
> -		    dst_start + len > dst_vma->vm_end)
> -			goto out_unlock;
> 
>  		err = -EINVAL;
>  		if (vma_hpagesize != vma_kernel_pagesize(dst_vma))
> @@ -488,20 +505,9 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
>  	 * both valid and fully within a single existing vma.
>  	 */
>  	err = -ENOENT;
> -	dst_vma = find_vma(dst_mm, dst_start);
> +	dst_vma = vma_find_uffd(dst_mm, dst_start, len);
>  	if (!dst_vma)
>  		goto out_unlock;
> -	/*
> -	 * Check the vma is registered in uffd, this is required to
> -	 * enforce the VM_MAYWRITE check done at uffd registration
> -	 * time.
> -	 */
> -	if (!dst_vma->vm_userfaultfd_ctx.ctx)
> -		goto out_unlock;
> -
> -	if (dst_start < dst_vma->vm_start ||
> -	    dst_start + len > dst_vma->vm_end)
> -		goto out_unlock;
> 
>  	err = -EINVAL;
>  	/*
> -- 
> 2.17.1
> 

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v2 20/26] userfaultfd: wp: support write protection for userfault vma range
  2019-02-12  2:56 ` [PATCH v2 20/26] userfaultfd: wp: support write protection for userfault vma range Peter Xu
  2019-02-21 18:23   ` Jerome Glisse
@ 2019-02-25 20:52   ` Mike Rapoport
  2019-02-26  6:06     ` Peter Xu
  1 sibling, 1 reply; 113+ messages in thread
From: Mike Rapoport @ 2019-02-25 20:52 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, David Hildenbrand, Hugh Dickins,
	Maya Gokhale, Jerome Glisse, Pavel Emelyanov, Johannes Weiner,
	Martin Cracauer, Shaohua Li, Marty McFadden, Andrea Arcangeli,
	Mike Kravetz, Denis Plotnikov, Mike Rapoport, Mel Gorman,
	Kirill A . Shutemov, Dr . David Alan Gilbert, Rik van Riel

On Tue, Feb 12, 2019 at 10:56:26AM +0800, Peter Xu wrote:
> From: Shaohua Li <shli@fb.com>
> 
> Add API to enable/disable writeprotect a vma range. Unlike mprotect,
> this doesn't split/merge vmas.
> 
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Kirill A. Shutemov <kirill@shutemov.name>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: Shaohua Li <shli@fb.com>
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> [peterx:
>  - use the helper to find VMA;
>  - return -ENOENT if not found to match mcopy case;
>  - use the new MM_CP_UFFD_WP* flags for change_protection
>  - check against mmap_changing for failures]
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  include/linux/userfaultfd_k.h |  3 ++
>  mm/userfaultfd.c              | 54 +++++++++++++++++++++++++++++++++++
>  2 files changed, 57 insertions(+)
> 
> diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
> index 765ce884cec0..8f6e6ed544fb 100644
> --- a/include/linux/userfaultfd_k.h
> +++ b/include/linux/userfaultfd_k.h
> @@ -39,6 +39,9 @@ extern ssize_t mfill_zeropage(struct mm_struct *dst_mm,
>  			      unsigned long dst_start,
>  			      unsigned long len,
>  			      bool *mmap_changing);
> +extern int mwriteprotect_range(struct mm_struct *dst_mm,
> +			       unsigned long start, unsigned long len,
> +			       bool enable_wp, bool *mmap_changing);
> 
>  /* mm helpers */
>  static inline bool is_mergeable_vm_userfaultfd_ctx(struct vm_area_struct *vma,
> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> index fefa81c301b7..529d180bb4d7 100644
> --- a/mm/userfaultfd.c
> +++ b/mm/userfaultfd.c
> @@ -639,3 +639,57 @@ ssize_t mfill_zeropage(struct mm_struct *dst_mm, unsigned long start,
>  {
>  	return __mcopy_atomic(dst_mm, start, 0, len, true, mmap_changing, 0);
>  }
> +
> +int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start,
> +			unsigned long len, bool enable_wp, bool *mmap_changing)
> +{
> +	struct vm_area_struct *dst_vma;
> +	pgprot_t newprot;
> +	int err;
> +
> +	/*
> +	 * Sanitize the command parameters:
> +	 */
> +	BUG_ON(start & ~PAGE_MASK);
> +	BUG_ON(len & ~PAGE_MASK);
> +
> +	/* Does the address range wrap, or is the span zero-sized? */
> +	BUG_ON(start + len <= start);

I'd replace these BUG_ON()s with

	if (WARN_ON())
		 return -EINVAL;

> +
> +	down_read(&dst_mm->mmap_sem);
> +
> +	/*
> +	 * If memory mappings are changing because of non-cooperative
> +	 * operation (e.g. mremap) running in parallel, bail out and
> +	 * request the user to retry later
> +	 */
> +	err = -EAGAIN;
> +	if (mmap_changing && READ_ONCE(*mmap_changing))
> +		goto out_unlock;
> +
> +	err = -ENOENT;
> +	dst_vma = vma_find_uffd(dst_mm, start, len);
> +	/*
> +	 * Make sure the vma is not shared, that the dst range is
> +	 * both valid and fully within a single existing vma.
> +	 */
> +	if (!dst_vma || (dst_vma->vm_flags & VM_SHARED))
> +		goto out_unlock;
> +	if (!userfaultfd_wp(dst_vma))
> +		goto out_unlock;
> +	if (!vma_is_anonymous(dst_vma))
> +		goto out_unlock;
> +
> +	if (enable_wp)
> +		newprot = vm_get_page_prot(dst_vma->vm_flags & ~(VM_WRITE));
> +	else
> +		newprot = vm_get_page_prot(dst_vma->vm_flags);
> +
> +	change_protection(dst_vma, start, start + len, newprot,
> +			  enable_wp ? MM_CP_UFFD_WP : MM_CP_UFFD_WP_RESOLVE);
> +
> +	err = 0;
> +out_unlock:
> +	up_read(&dst_mm->mmap_sem);
> +	return err;
> +}
> -- 
> 2.17.1
> 

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v2 21/26] userfaultfd: wp: add the writeprotect API to userfaultfd ioctl
  2019-02-12  2:56 ` [PATCH v2 21/26] userfaultfd: wp: add the writeprotect API to userfaultfd ioctl Peter Xu
  2019-02-21 18:28   ` Jerome Glisse
@ 2019-02-25 21:03   ` Mike Rapoport
  2019-02-26  6:30     ` Peter Xu
  1 sibling, 1 reply; 113+ messages in thread
From: Mike Rapoport @ 2019-02-25 21:03 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, David Hildenbrand, Hugh Dickins,
	Maya Gokhale, Jerome Glisse, Pavel Emelyanov, Johannes Weiner,
	Martin Cracauer, Shaohua Li, Marty McFadden, Andrea Arcangeli,
	Mike Kravetz, Denis Plotnikov, Mike Rapoport, Mel Gorman,
	Kirill A . Shutemov, Dr . David Alan Gilbert

On Tue, Feb 12, 2019 at 10:56:27AM +0800, Peter Xu wrote:
> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> v1: From: Shaohua Li <shli@fb.com>
> 
> v2: cleanups, remove a branch.
> 
> [peterx writes up the commit message, as below...]
> 
> This patch introduces the new uffd-wp APIs for userspace.
> 
> Firstly, we'll allow to do UFFDIO_REGISTER with write protection
> tracking using the new UFFDIO_REGISTER_MODE_WP flag.  Note that this
> flag can co-exist with the existing UFFDIO_REGISTER_MODE_MISSING, in
> which case the userspace program can not only resolve missing page
> faults, and at the same time tracking page data changes along the way.
> 
> Secondly, we introduced the new UFFDIO_WRITEPROTECT API to do page
> level write protection tracking.  Note that we will need to register
> the memory region with UFFDIO_REGISTER_MODE_WP before that.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> [peterx: remove useless block, write commit message, check against
>  VM_MAYWRITE rather than VM_WRITE when register]
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  fs/userfaultfd.c                 | 82 +++++++++++++++++++++++++-------
>  include/uapi/linux/userfaultfd.h | 11 +++++
>  2 files changed, 77 insertions(+), 16 deletions(-)
> 
> diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> index 3092885c9d2c..81962d62520c 100644
> --- a/fs/userfaultfd.c
> +++ b/fs/userfaultfd.c
> @@ -304,8 +304,11 @@ static inline bool userfaultfd_must_wait(struct userfaultfd_ctx *ctx,
>  	if (!pmd_present(_pmd))
>  		goto out;
> 
> -	if (pmd_trans_huge(_pmd))
> +	if (pmd_trans_huge(_pmd)) {
> +		if (!pmd_write(_pmd) && (reason & VM_UFFD_WP))
> +			ret = true;
>  		goto out;
> +	}
> 
>  	/*
>  	 * the pmd is stable (as in !pmd_trans_unstable) so we can re-read it
> @@ -318,6 +321,8 @@ static inline bool userfaultfd_must_wait(struct userfaultfd_ctx *ctx,
>  	 */
>  	if (pte_none(*pte))
>  		ret = true;
> +	if (!pte_write(*pte) && (reason & VM_UFFD_WP))
> +		ret = true;
>  	pte_unmap(pte);
> 
>  out:
> @@ -1251,10 +1256,13 @@ static __always_inline int validate_range(struct mm_struct *mm,
>  	return 0;
>  }
> 
> -static inline bool vma_can_userfault(struct vm_area_struct *vma)
> +static inline bool vma_can_userfault(struct vm_area_struct *vma,
> +				     unsigned long vm_flags)
>  {
> -	return vma_is_anonymous(vma) || is_vm_hugetlb_page(vma) ||
> -		vma_is_shmem(vma);
> +	/* FIXME: add WP support to hugetlbfs and shmem */
> +	return vma_is_anonymous(vma) ||
> +		((is_vm_hugetlb_page(vma) || vma_is_shmem(vma)) &&
> +		 !(vm_flags & VM_UFFD_WP));
>  }
> 
>  static int userfaultfd_register(struct userfaultfd_ctx *ctx,
> @@ -1286,15 +1294,8 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
>  	vm_flags = 0;
>  	if (uffdio_register.mode & UFFDIO_REGISTER_MODE_MISSING)
>  		vm_flags |= VM_UFFD_MISSING;
> -	if (uffdio_register.mode & UFFDIO_REGISTER_MODE_WP) {
> +	if (uffdio_register.mode & UFFDIO_REGISTER_MODE_WP)
>  		vm_flags |= VM_UFFD_WP;
> -		/*
> -		 * FIXME: remove the below error constraint by
> -		 * implementing the wprotect tracking mode.
> -		 */
> -		ret = -EINVAL;
> -		goto out;
> -	}
> 
>  	ret = validate_range(mm, uffdio_register.range.start,
>  			     uffdio_register.range.len);
> @@ -1342,7 +1343,7 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
> 
>  		/* check not compatible vmas */
>  		ret = -EINVAL;
> -		if (!vma_can_userfault(cur))
> +		if (!vma_can_userfault(cur, vm_flags))
>  			goto out_unlock;
> 
>  		/*
> @@ -1370,6 +1371,8 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
>  			if (end & (vma_hpagesize - 1))
>  				goto out_unlock;
>  		}
> +		if ((vm_flags & VM_UFFD_WP) && !(cur->vm_flags & VM_MAYWRITE))
> +			goto out_unlock;
> 
>  		/*
>  		 * Check that this vma isn't already owned by a
> @@ -1399,7 +1402,7 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
>  	do {
>  		cond_resched();
> 
> -		BUG_ON(!vma_can_userfault(vma));
> +		BUG_ON(!vma_can_userfault(vma, vm_flags));
>  		BUG_ON(vma->vm_userfaultfd_ctx.ctx &&
>  		       vma->vm_userfaultfd_ctx.ctx != ctx);
>  		WARN_ON(!(vma->vm_flags & VM_MAYWRITE));
> @@ -1534,7 +1537,7 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx,
>  		 * provides for more strict behavior to notice
>  		 * unregistration errors.
>  		 */
> -		if (!vma_can_userfault(cur))
> +		if (!vma_can_userfault(cur, cur->vm_flags))
>  			goto out_unlock;
> 
>  		found = true;
> @@ -1548,7 +1551,7 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx,
>  	do {
>  		cond_resched();
> 
> -		BUG_ON(!vma_can_userfault(vma));
> +		BUG_ON(!vma_can_userfault(vma, vma->vm_flags));
> 
>  		/*
>  		 * Nothing to do: this vma is already registered into this
> @@ -1761,6 +1764,50 @@ static int userfaultfd_zeropage(struct userfaultfd_ctx *ctx,
>  	return ret;
>  }
> 
> +static int userfaultfd_writeprotect(struct userfaultfd_ctx *ctx,
> +				    unsigned long arg)
> +{
> +	int ret;
> +	struct uffdio_writeprotect uffdio_wp;
> +	struct uffdio_writeprotect __user *user_uffdio_wp;
> +	struct userfaultfd_wake_range range;
> +
> +	if (READ_ONCE(ctx->mmap_changing))
> +		return -EAGAIN;
> +
> +	user_uffdio_wp = (struct uffdio_writeprotect __user *) arg;
> +
> +	if (copy_from_user(&uffdio_wp, user_uffdio_wp,
> +			   sizeof(struct uffdio_writeprotect)))
> +		return -EFAULT;
> +
> +	ret = validate_range(ctx->mm, uffdio_wp.range.start,
> +			     uffdio_wp.range.len);
> +	if (ret)
> +		return ret;
> +
> +	if (uffdio_wp.mode & ~(UFFDIO_WRITEPROTECT_MODE_DONTWAKE |
> +			       UFFDIO_WRITEPROTECT_MODE_WP))
> +		return -EINVAL;
> +	if ((uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_WP) &&
> +	     (uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_DONTWAKE))
> +		return -EINVAL;

Why _DONTWAKE cannot be used when setting write-protection?
I can imagine a use-case when you'd want to freeze an application,
write-protect several regions and then let the application continue.

> +
> +	ret = mwriteprotect_range(ctx->mm, uffdio_wp.range.start,
> +				  uffdio_wp.range.len, uffdio_wp.mode &
> +				  UFFDIO_WRITEPROTECT_MODE_WP,
> +				  &ctx->mmap_changing);
> +	if (ret)
> +		return ret;
> +
> +	if (!(uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_DONTWAKE)) {
> +		range.start = uffdio_wp.range.start;
> +		range.len = uffdio_wp.range.len;
> +		wake_userfault(ctx, &range);
> +	}
> +	return ret;
> +}
> +
>  static inline unsigned int uffd_ctx_features(__u64 user_features)
>  {
>  	/*
> @@ -1838,6 +1885,9 @@ static long userfaultfd_ioctl(struct file *file, unsigned cmd,
>  	case UFFDIO_ZEROPAGE:
>  		ret = userfaultfd_zeropage(ctx, arg);
>  		break;
> +	case UFFDIO_WRITEPROTECT:
> +		ret = userfaultfd_writeprotect(ctx, arg);
> +		break;
>  	}
>  	return ret;
>  }
> diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
> index 297cb044c03f..1b977a7a4435 100644
> --- a/include/uapi/linux/userfaultfd.h
> +++ b/include/uapi/linux/userfaultfd.h
> @@ -52,6 +52,7 @@
>  #define _UFFDIO_WAKE			(0x02)
>  #define _UFFDIO_COPY			(0x03)
>  #define _UFFDIO_ZEROPAGE		(0x04)
> +#define _UFFDIO_WRITEPROTECT		(0x06)
>  #define _UFFDIO_API			(0x3F)
> 
>  /* userfaultfd ioctl ids */
> @@ -68,6 +69,8 @@
>  				      struct uffdio_copy)
>  #define UFFDIO_ZEROPAGE		_IOWR(UFFDIO, _UFFDIO_ZEROPAGE,	\
>  				      struct uffdio_zeropage)
> +#define UFFDIO_WRITEPROTECT	_IOWR(UFFDIO, _UFFDIO_WRITEPROTECT, \
> +				      struct uffdio_writeprotect)
> 
>  /* read() structure */
>  struct uffd_msg {
> @@ -232,4 +235,12 @@ struct uffdio_zeropage {
>  	__s64 zeropage;
>  };
> 
> +struct uffdio_writeprotect {
> +	struct uffdio_range range;
> +	/* !WP means undo writeprotect. DONTWAKE is valid only with !WP */
> +#define UFFDIO_WRITEPROTECT_MODE_WP		((__u64)1<<0)
> +#define UFFDIO_WRITEPROTECT_MODE_DONTWAKE	((__u64)1<<1)
> +	__u64 mode;
> +};
> +
>  #endif /* _LINUX_USERFAULTFD_H */
> -- 
> 2.17.1
> 

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v2 23/26] userfaultfd: wp: don't wake up when doing write protect
  2019-02-12  2:56 ` [PATCH v2 23/26] userfaultfd: wp: don't wake up when doing write protect Peter Xu
  2019-02-21 18:36   ` Jerome Glisse
@ 2019-02-25 21:09   ` Mike Rapoport
  2019-02-26  6:24     ` Peter Xu
  2019-02-26  8:00   ` Mike Rapoport
  2 siblings, 1 reply; 113+ messages in thread
From: Mike Rapoport @ 2019-02-25 21:09 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, David Hildenbrand, Hugh Dickins,
	Maya Gokhale, Jerome Glisse, Pavel Emelyanov, Johannes Weiner,
	Martin Cracauer, Shaohua Li, Marty McFadden, Andrea Arcangeli,
	Mike Kravetz, Denis Plotnikov, Mike Rapoport, Mel Gorman,
	Kirill A . Shutemov, Dr . David Alan Gilbert

On Tue, Feb 12, 2019 at 10:56:29AM +0800, Peter Xu wrote:
> It does not make sense to try to wake up any waiting thread when we're
> write-protecting a memory region.  Only wake up when resolving a write
> protected page fault.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  fs/userfaultfd.c | 13 ++++++++-----
>  1 file changed, 8 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> index 81962d62520c..f1f61a0278c2 100644
> --- a/fs/userfaultfd.c
> +++ b/fs/userfaultfd.c
> @@ -1771,6 +1771,7 @@ static int userfaultfd_writeprotect(struct userfaultfd_ctx *ctx,
>  	struct uffdio_writeprotect uffdio_wp;
>  	struct uffdio_writeprotect __user *user_uffdio_wp;
>  	struct userfaultfd_wake_range range;
> +	bool mode_wp, mode_dontwake;
> 
>  	if (READ_ONCE(ctx->mmap_changing))
>  		return -EAGAIN;
> @@ -1789,18 +1790,20 @@ static int userfaultfd_writeprotect(struct userfaultfd_ctx *ctx,
>  	if (uffdio_wp.mode & ~(UFFDIO_WRITEPROTECT_MODE_DONTWAKE |
>  			       UFFDIO_WRITEPROTECT_MODE_WP))
>  		return -EINVAL;
> -	if ((uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_WP) &&
> -	     (uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_DONTWAKE))
> +
> +	mode_wp = uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_WP;
> +	mode_dontwake = uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_DONTWAKE;
> +
> +	if (mode_wp && mode_dontwake)
>  		return -EINVAL;

This actually means the opposite of the commit message text ;-)

Is any dependency of _WP and _DONTWAKE needed at all?
 
>  	ret = mwriteprotect_range(ctx->mm, uffdio_wp.range.start,
> -				  uffdio_wp.range.len, uffdio_wp.mode &
> -				  UFFDIO_WRITEPROTECT_MODE_WP,
> +				  uffdio_wp.range.len, mode_wp,
>  				  &ctx->mmap_changing);
>  	if (ret)
>  		return ret;
> 
> -	if (!(uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_DONTWAKE)) {
> +	if (!mode_wp && !mode_dontwake) {
>  		range.start = uffdio_wp.range.start;
>  		range.len = uffdio_wp.range.len;
>  		wake_userfault(ctx, &range);
> -- 
> 2.17.1
> 

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v2 23/26] userfaultfd: wp: don't wake up when doing write protect
  2019-02-25  8:58     ` Peter Xu
@ 2019-02-25 21:15       ` Mike Rapoport
  0 siblings, 0 replies; 113+ messages in thread
From: Mike Rapoport @ 2019-02-25 21:15 UTC (permalink / raw)
  To: Peter Xu
  Cc: Jerome Glisse, linux-mm, linux-kernel, David Hildenbrand,
	Hugh Dickins, Maya Gokhale, Pavel Emelyanov, Johannes Weiner,
	Martin Cracauer, Shaohua Li, Marty McFadden, Andrea Arcangeli,
	Mike Kravetz, Denis Plotnikov, Mike Rapoport, Mel Gorman,
	Kirill A . Shutemov, Dr . David Alan Gilbert

On Mon, Feb 25, 2019 at 04:58:46PM +0800, Peter Xu wrote:
> On Thu, Feb 21, 2019 at 01:36:54PM -0500, Jerome Glisse wrote:
> > On Tue, Feb 12, 2019 at 10:56:29AM +0800, Peter Xu wrote:
> > > It does not make sense to try to wake up any waiting thread when we're
> > > write-protecting a memory region.  Only wake up when resolving a write
> > > protected page fault.
> > > 
> > > Signed-off-by: Peter Xu <peterx@redhat.com>
> > 
> > I am bit confuse here, see below.
> > 
> > > ---
> > >  fs/userfaultfd.c | 13 ++++++++-----
> > >  1 file changed, 8 insertions(+), 5 deletions(-)
> > > 
> > > diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> > > index 81962d62520c..f1f61a0278c2 100644
> > > --- a/fs/userfaultfd.c
> > > +++ b/fs/userfaultfd.c
> > > @@ -1771,6 +1771,7 @@ static int userfaultfd_writeprotect(struct userfaultfd_ctx *ctx,
> > >  	struct uffdio_writeprotect uffdio_wp;
> > >  	struct uffdio_writeprotect __user *user_uffdio_wp;
> > >  	struct userfaultfd_wake_range range;
> > > +	bool mode_wp, mode_dontwake;
> > >  
> > >  	if (READ_ONCE(ctx->mmap_changing))
> > >  		return -EAGAIN;
> > > @@ -1789,18 +1790,20 @@ static int userfaultfd_writeprotect(struct userfaultfd_ctx *ctx,
> > >  	if (uffdio_wp.mode & ~(UFFDIO_WRITEPROTECT_MODE_DONTWAKE |
> > >  			       UFFDIO_WRITEPROTECT_MODE_WP))
> > >  		return -EINVAL;
> > > -	if ((uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_WP) &&
> > > -	     (uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_DONTWAKE))
> 
> [1]
> 
> > > +
> > > +	mode_wp = uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_WP;
> > > +	mode_dontwake = uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_DONTWAKE;
> > > +
> > > +	if (mode_wp && mode_dontwake)
> 
> [2]
> 
> > >  		return -EINVAL;
> > 
> > I am confuse by the logic here. DONTWAKE means do not wake any waiting
> > thread right ? So if the patch header it seems to me the logic should
> > be:
> >     if (mode_wp && !mode_dontwake)
> >         return -EINVAL;
> 
> This should be the most common case when we want to write protect a
> page (or a set of pages).  I'll explain more details below...
> 
> > 
> > At very least this part does seems to mean the opposite of what the
> > commit message says.
> 
> Let me paste the matrix to be clear on these flags:
> 
>   |------+-------------------------+------------------------------|
>   |      | dontwake=0              | dontwake=1                   |
>   |------+-------------------------+------------------------------|
>   | wp=0 | (a) resolve pf, do wake | (b) resolve pf only, no wake |
>   | wp=1 | (c) wp page range       | (d) invalid                  |
>   |------+-------------------------+------------------------------|
> 
> Above check at [1] was checking against case (d) in the matrix.  It is
> indeed an invalid condition because when we want to write protect a
> page we should not try to wake up any thread, so the donewake
> parameter is actually useless (we'll always do that).  And above [2]
> is simply rewritting [1] with the new variables.

I think (c) is "wp range and wake the thread", and (d) is "wp and DONT
wake".

 
> > 
> > >  
> > >  	ret = mwriteprotect_range(ctx->mm, uffdio_wp.range.start,
> > > -				  uffdio_wp.range.len, uffdio_wp.mode &
> > > -				  UFFDIO_WRITEPROTECT_MODE_WP,
> > > +				  uffdio_wp.range.len, mode_wp,
> > >  				  &ctx->mmap_changing);
> > >  	if (ret)
> > >  		return ret;
> > >  
> > > -	if (!(uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_DONTWAKE)) {
> > > +	if (!mode_wp && !mode_dontwake) {
> > 
> > This part match the commit message :)
> 
> Here is what the patch really want to change: before this patch we'll
> even call wake_userfault() below for case (c) while it doesn't really
> make too much sense IMHO.  After this patch we'll only do the wakeup
> for (a,b).

Waking up the thread after the last region is write-protected would make
sense. Not much savings for lots of ranges, though.
 
> > 
> > >  		range.start = uffdio_wp.range.start;
> > >  		range.len = uffdio_wp.range.len;
> > >  		wake_userfault(ctx, &range);
> 
> Thanks,
> 
> -- 
> Peter Xu
> 

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v2 24/26] userfaultfd: wp: UFFDIO_REGISTER_MODE_WP documentation update
  2019-02-12  2:56 ` [PATCH v2 24/26] userfaultfd: wp: UFFDIO_REGISTER_MODE_WP documentation update Peter Xu
  2019-02-21 18:38   ` Jerome Glisse
@ 2019-02-25 21:19   ` Mike Rapoport
  2019-02-26  6:53     ` Peter Xu
  1 sibling, 1 reply; 113+ messages in thread
From: Mike Rapoport @ 2019-02-25 21:19 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, David Hildenbrand, Hugh Dickins,
	Maya Gokhale, Jerome Glisse, Pavel Emelyanov, Johannes Weiner,
	Martin Cracauer, Shaohua Li, Marty McFadden, Andrea Arcangeli,
	Mike Kravetz, Denis Plotnikov, Mike Rapoport, Mel Gorman,
	Kirill A . Shutemov, Dr . David Alan Gilbert

On Tue, Feb 12, 2019 at 10:56:30AM +0800, Peter Xu wrote:
> From: Martin Cracauer <cracauer@cons.org>
> 
> Adds documentation about the write protection support.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> [peterx: rewrite in rst format; fixups here and there]
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>

Peter, can you please also update the man pages (1, 2)?

[1] http://man7.org/linux/man-pages/man2/userfaultfd.2.html
[2] http://man7.org/linux/man-pages/man2/ioctl_userfaultfd.2.html

> ---
>  Documentation/admin-guide/mm/userfaultfd.rst | 51 ++++++++++++++++++++
>  1 file changed, 51 insertions(+)
> 
> diff --git a/Documentation/admin-guide/mm/userfaultfd.rst b/Documentation/admin-guide/mm/userfaultfd.rst
> index 5048cf661a8a..c30176e67900 100644
> --- a/Documentation/admin-guide/mm/userfaultfd.rst
> +++ b/Documentation/admin-guide/mm/userfaultfd.rst
> @@ -108,6 +108,57 @@ UFFDIO_COPY. They're atomic as in guaranteeing that nothing can see an
>  half copied page since it'll keep userfaulting until the copy has
>  finished.
> 
> +Notes:
> +
> +- If you requested UFFDIO_REGISTER_MODE_MISSING when registering then
> +  you must provide some kind of page in your thread after reading from
> +  the uffd.  You must provide either UFFDIO_COPY or UFFDIO_ZEROPAGE.
> +  The normal behavior of the OS automatically providing a zero page on
> +  an annonymous mmaping is not in place.
> +
> +- None of the page-delivering ioctls default to the range that you
> +  registered with.  You must fill in all fields for the appropriate
> +  ioctl struct including the range.
> +
> +- You get the address of the access that triggered the missing page
> +  event out of a struct uffd_msg that you read in the thread from the
> +  uffd.  You can supply as many pages as you want with UFFDIO_COPY or
> +  UFFDIO_ZEROPAGE.  Keep in mind that unless you used DONTWAKE then
> +  the first of any of those IOCTLs wakes up the faulting thread.
> +
> +- Be sure to test for all errors including (pollfd[0].revents &
> +  POLLERR).  This can happen, e.g. when ranges supplied were
> +  incorrect.
> +
> +Write Protect Notifications
> +---------------------------
> +
> +This is equivalent to (but faster than) using mprotect and a SIGSEGV
> +signal handler.
> +
> +Firstly you need to register a range with UFFDIO_REGISTER_MODE_WP.
> +Instead of using mprotect(2) you use ioctl(uffd, UFFDIO_WRITEPROTECT,
> +struct *uffdio_writeprotect) while mode = UFFDIO_WRITEPROTECT_MODE_WP
> +in the struct passed in.  The range does not default to and does not
> +have to be identical to the range you registered with.  You can write
> +protect as many ranges as you like (inside the registered range).
> +Then, in the thread reading from uffd the struct will have
> +msg.arg.pagefault.flags & UFFD_PAGEFAULT_FLAG_WP set. Now you send
> +ioctl(uffd, UFFDIO_WRITEPROTECT, struct *uffdio_writeprotect) again
> +while pagefault.mode does not have UFFDIO_WRITEPROTECT_MODE_WP set.
> +This wakes up the thread which will continue to run with writes. This
> +allows you to do the bookkeeping about the write in the uffd reading
> +thread before the ioctl.
> +
> +If you registered with both UFFDIO_REGISTER_MODE_MISSING and
> +UFFDIO_REGISTER_MODE_WP then you need to think about the sequence in
> +which you supply a page and undo write protect.  Note that there is a
> +difference between writes into a WP area and into a !WP area.  The
> +former will have UFFD_PAGEFAULT_FLAG_WP set, the latter
> +UFFD_PAGEFAULT_FLAG_WRITE.  The latter did not fail on protection but
> +you still need to supply a page when UFFDIO_REGISTER_MODE_MISSING was
> +used.
> +
>  QEMU/KVM
>  ========
> 
> -- 
> 2.17.1
> 

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v2 10/26] userfaultfd: wp: add UFFDIO_COPY_MODE_WP
  2019-02-25 15:58   ` Mike Rapoport
@ 2019-02-26  5:09     ` Peter Xu
  2019-02-26  8:28       ` Mike Rapoport
  0 siblings, 1 reply; 113+ messages in thread
From: Peter Xu @ 2019-02-26  5:09 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: linux-mm, linux-kernel, David Hildenbrand, Hugh Dickins,
	Maya Gokhale, Jerome Glisse, Pavel Emelyanov, Johannes Weiner,
	Martin Cracauer, Shaohua Li, Marty McFadden, Andrea Arcangeli,
	Mike Kravetz, Denis Plotnikov, Mike Rapoport, Mel Gorman,
	Kirill A . Shutemov, Dr . David Alan Gilbert

On Mon, Feb 25, 2019 at 05:58:37PM +0200, Mike Rapoport wrote:
> On Tue, Feb 12, 2019 at 10:56:16AM +0800, Peter Xu wrote:
> > From: Andrea Arcangeli <aarcange@redhat.com>
> > 
> > This allows UFFDIO_COPY to map pages wrprotected.
>                                        write protected please :)

Sure!

> > 
> > Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> 
> Except for two additional nits below
> 
> Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
> 
> > ---
> >  fs/userfaultfd.c                 |  5 +++--
> >  include/linux/userfaultfd_k.h    |  2 +-
> >  include/uapi/linux/userfaultfd.h | 11 +++++-----
> >  mm/userfaultfd.c                 | 36 ++++++++++++++++++++++----------
> >  4 files changed, 35 insertions(+), 19 deletions(-)
> > 
> > diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> > index b397bc3b954d..3092885c9d2c 100644
> > --- a/fs/userfaultfd.c
> > +++ b/fs/userfaultfd.c
> > @@ -1683,11 +1683,12 @@ static int userfaultfd_copy(struct userfaultfd_ctx *ctx,
> >  	ret = -EINVAL;
> >  	if (uffdio_copy.src + uffdio_copy.len <= uffdio_copy.src)
> >  		goto out;
> > -	if (uffdio_copy.mode & ~UFFDIO_COPY_MODE_DONTWAKE)
> > +	if (uffdio_copy.mode & ~(UFFDIO_COPY_MODE_DONTWAKE|UFFDIO_COPY_MODE_WP))
> >  		goto out;
> >  	if (mmget_not_zero(ctx->mm)) {
> >  		ret = mcopy_atomic(ctx->mm, uffdio_copy.dst, uffdio_copy.src,
> > -				   uffdio_copy.len, &ctx->mmap_changing);
> > +				   uffdio_copy.len, &ctx->mmap_changing,
> > +				   uffdio_copy.mode);
> >  		mmput(ctx->mm);
> >  	} else {
> >  		return -ESRCH;
> > diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
> > index c6590c58ce28..765ce884cec0 100644
> > --- a/include/linux/userfaultfd_k.h
> > +++ b/include/linux/userfaultfd_k.h
> > @@ -34,7 +34,7 @@ extern vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason);
> > 
> >  extern ssize_t mcopy_atomic(struct mm_struct *dst_mm, unsigned long dst_start,
> >  			    unsigned long src_start, unsigned long len,
> > -			    bool *mmap_changing);
> > +			    bool *mmap_changing, __u64 mode);
> >  extern ssize_t mfill_zeropage(struct mm_struct *dst_mm,
> >  			      unsigned long dst_start,
> >  			      unsigned long len,
> > diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
> > index 48f1a7c2f1f0..297cb044c03f 100644
> > --- a/include/uapi/linux/userfaultfd.h
> > +++ b/include/uapi/linux/userfaultfd.h
> > @@ -203,13 +203,14 @@ struct uffdio_copy {
> >  	__u64 dst;
> >  	__u64 src;
> >  	__u64 len;
> > +#define UFFDIO_COPY_MODE_DONTWAKE		((__u64)1<<0)
> >  	/*
> > -	 * There will be a wrprotection flag later that allows to map
> > -	 * pages wrprotected on the fly. And such a flag will be
> > -	 * available if the wrprotection ioctl are implemented for the
> > -	 * range according to the uffdio_register.ioctls.
> > +	 * UFFDIO_COPY_MODE_WP will map the page wrprotected on the
> > +	 * fly. UFFDIO_COPY_MODE_WP is available only if the
> > +	 * wrprotection ioctl are implemented for the range according
> 
>                              ^ is

Will fix.

> 
> > +	 * to the uffdio_register.ioctls.
> >  	 */
> > -#define UFFDIO_COPY_MODE_DONTWAKE		((__u64)1<<0)
> > +#define UFFDIO_COPY_MODE_WP			((__u64)1<<1)
> >  	__u64 mode;
> > 
> >  	/*
> > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> > index d59b5a73dfb3..73a208c5c1e7 100644
> > --- a/mm/userfaultfd.c
> > +++ b/mm/userfaultfd.c
> > @@ -25,7 +25,8 @@ static int mcopy_atomic_pte(struct mm_struct *dst_mm,
> >  			    struct vm_area_struct *dst_vma,
> >  			    unsigned long dst_addr,
> >  			    unsigned long src_addr,
> > -			    struct page **pagep)
> > +			    struct page **pagep,
> > +			    bool wp_copy)
> >  {
> >  	struct mem_cgroup *memcg;
> >  	pte_t _dst_pte, *dst_pte;
> > @@ -71,9 +72,9 @@ static int mcopy_atomic_pte(struct mm_struct *dst_mm,
> >  	if (mem_cgroup_try_charge(page, dst_mm, GFP_KERNEL, &memcg, false))
> >  		goto out_release;
> > 
> > -	_dst_pte = mk_pte(page, dst_vma->vm_page_prot);
> > -	if (dst_vma->vm_flags & VM_WRITE)
> > -		_dst_pte = pte_mkwrite(pte_mkdirty(_dst_pte));
> > +	_dst_pte = pte_mkdirty(mk_pte(page, dst_vma->vm_page_prot));
> > +	if (dst_vma->vm_flags & VM_WRITE && !wp_copy)
> > +		_dst_pte = pte_mkwrite(_dst_pte);
> > 
> >  	dst_pte = pte_offset_map_lock(dst_mm, dst_pmd, dst_addr, &ptl);
> >  	if (dst_vma->vm_file) {
> > @@ -399,7 +400,8 @@ static __always_inline ssize_t mfill_atomic_pte(struct mm_struct *dst_mm,
> >  						unsigned long dst_addr,
> >  						unsigned long src_addr,
> >  						struct page **page,
> > -						bool zeropage)
> > +						bool zeropage,
> > +						bool wp_copy)
> >  {
> >  	ssize_t err;
> > 
> > @@ -416,11 +418,13 @@ static __always_inline ssize_t mfill_atomic_pte(struct mm_struct *dst_mm,
> >  	if (!(dst_vma->vm_flags & VM_SHARED)) {
> >  		if (!zeropage)
> >  			err = mcopy_atomic_pte(dst_mm, dst_pmd, dst_vma,
> > -					       dst_addr, src_addr, page);
> > +					       dst_addr, src_addr, page,
> > +					       wp_copy);
> >  		else
> >  			err = mfill_zeropage_pte(dst_mm, dst_pmd,
> >  						 dst_vma, dst_addr);
> >  	} else {
> > +		VM_WARN_ON(wp_copy); /* WP only available for anon */
> >  		if (!zeropage)
> >  			err = shmem_mcopy_atomic_pte(dst_mm, dst_pmd,
> >  						     dst_vma, dst_addr,
> > @@ -438,7 +442,8 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
> >  					      unsigned long src_start,
> >  					      unsigned long len,
> >  					      bool zeropage,
> > -					      bool *mmap_changing)
> > +					      bool *mmap_changing,
> > +					      __u64 mode)
> >  {
> >  	struct vm_area_struct *dst_vma;
> >  	ssize_t err;
> > @@ -446,6 +451,7 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
> >  	unsigned long src_addr, dst_addr;
> >  	long copied;
> >  	struct page *page;
> > +	bool wp_copy;
> > 
> >  	/*>  	 * Sanitize the command parameters:
> > @@ -502,6 +508,14 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
> >  	    dst_vma->vm_flags & VM_SHARED))
> >  		goto out_unlock;
> > 
> > +	/*
> > +	 * validate 'mode' now that we know the dst_vma: don't allow
> > +	 * a wrprotect copy if the userfaultfd didn't register as WP.
> > +	 */
> > +	wp_copy = mode & UFFDIO_COPY_MODE_WP;
> > +	if (wp_copy && !(dst_vma->vm_flags & VM_UFFD_WP))
> > +		goto out_unlock;

[1]

> > +
> >  	/*
> >  	 * If this is a HUGETLB vma, pass off to appropriate routine
> >  	 */
> 
> I think for hugetlb we should return an error if wp_copy==true.
> It might be worth adding wp_copy parameter to __mcopy_atomic_hugetlb() in
> advance and return the error from there, in a hope it will also support
> UFFD_WP some day :)

Now we should have failed even earlier if someone wants to register a
hugetlbfs VMA with UFFD_WP because now vma_can_userfault() only allows
anonymous memory for it:

static inline bool vma_can_userfault(struct vm_area_struct *vma,
				     unsigned long vm_flags)
{
	/* FIXME: add WP support to hugetlbfs and shmem */
	return vma_is_anonymous(vma) ||
		((is_vm_hugetlb_page(vma) || vma_is_shmem(vma)) &&
		 !(vm_flags & VM_UFFD_WP));
}

And, as long as a VMA is not tagged with UFFD_WP, the page copy will
fail with -EINVAL directly above at [1] when setting the wp_copy flag.
So IMHO we should have already covered the case.

Considering these, I would think we could simply postpone the changes
to __mcopy_atomic_hugetlb() until adding hugetlbfs support on uffd-wp.
Mike, what do you think?

Thanks!

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v2 20/26] userfaultfd: wp: support write protection for userfault vma range
  2019-02-25 20:52   ` Mike Rapoport
@ 2019-02-26  6:06     ` Peter Xu
  2019-02-26  6:43       ` Mike Rapoport
  0 siblings, 1 reply; 113+ messages in thread
From: Peter Xu @ 2019-02-26  6:06 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: linux-mm, linux-kernel, David Hildenbrand, Hugh Dickins,
	Maya Gokhale, Jerome Glisse, Pavel Emelyanov, Johannes Weiner,
	Martin Cracauer, Shaohua Li, Marty McFadden, Andrea Arcangeli,
	Mike Kravetz, Denis Plotnikov, Mike Rapoport, Mel Gorman,
	Kirill A . Shutemov, Dr . David Alan Gilbert, Rik van Riel

On Mon, Feb 25, 2019 at 10:52:34PM +0200, Mike Rapoport wrote:
> On Tue, Feb 12, 2019 at 10:56:26AM +0800, Peter Xu wrote:
> > From: Shaohua Li <shli@fb.com>
> > 
> > Add API to enable/disable writeprotect a vma range. Unlike mprotect,
> > this doesn't split/merge vmas.
> > 
> > Cc: Andrea Arcangeli <aarcange@redhat.com>
> > Cc: Rik van Riel <riel@redhat.com>
> > Cc: Kirill A. Shutemov <kirill@shutemov.name>
> > Cc: Mel Gorman <mgorman@suse.de>
> > Cc: Hugh Dickins <hughd@google.com>
> > Cc: Johannes Weiner <hannes@cmpxchg.org>
> > Signed-off-by: Shaohua Li <shli@fb.com>
> > Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> > [peterx:
> >  - use the helper to find VMA;
> >  - return -ENOENT if not found to match mcopy case;
> >  - use the new MM_CP_UFFD_WP* flags for change_protection
> >  - check against mmap_changing for failures]
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >  include/linux/userfaultfd_k.h |  3 ++
> >  mm/userfaultfd.c              | 54 +++++++++++++++++++++++++++++++++++
> >  2 files changed, 57 insertions(+)
> > 
> > diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
> > index 765ce884cec0..8f6e6ed544fb 100644
> > --- a/include/linux/userfaultfd_k.h
> > +++ b/include/linux/userfaultfd_k.h
> > @@ -39,6 +39,9 @@ extern ssize_t mfill_zeropage(struct mm_struct *dst_mm,
> >  			      unsigned long dst_start,
> >  			      unsigned long len,
> >  			      bool *mmap_changing);
> > +extern int mwriteprotect_range(struct mm_struct *dst_mm,
> > +			       unsigned long start, unsigned long len,
> > +			       bool enable_wp, bool *mmap_changing);
> > 
> >  /* mm helpers */
> >  static inline bool is_mergeable_vm_userfaultfd_ctx(struct vm_area_struct *vma,
> > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> > index fefa81c301b7..529d180bb4d7 100644
> > --- a/mm/userfaultfd.c
> > +++ b/mm/userfaultfd.c
> > @@ -639,3 +639,57 @@ ssize_t mfill_zeropage(struct mm_struct *dst_mm, unsigned long start,
> >  {
> >  	return __mcopy_atomic(dst_mm, start, 0, len, true, mmap_changing, 0);
> >  }
> > +
> > +int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start,
> > +			unsigned long len, bool enable_wp, bool *mmap_changing)
> > +{
> > +	struct vm_area_struct *dst_vma;
> > +	pgprot_t newprot;
> > +	int err;
> > +
> > +	/*
> > +	 * Sanitize the command parameters:
> > +	 */
> > +	BUG_ON(start & ~PAGE_MASK);
> > +	BUG_ON(len & ~PAGE_MASK);
> > +
> > +	/* Does the address range wrap, or is the span zero-sized? */
> > +	BUG_ON(start + len <= start);
> 
> I'd replace these BUG_ON()s with
> 
> 	if (WARN_ON())
> 		 return -EINVAL;

I believe BUG_ON() is used because these parameters should have been
checked in userfaultfd_writeprotect() already by the common
validate_range() even before calling mwriteprotect_range().  So I'm
fine with the WARN_ON() approach but I'd slightly prefer to simply
keep the patch as is to keep Jerome's r-b if you won't disagree. :)

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v2 23/26] userfaultfd: wp: don't wake up when doing write protect
  2019-02-25 21:09   ` Mike Rapoport
@ 2019-02-26  6:24     ` Peter Xu
  2019-02-26  7:29       ` Mike Rapoport
  0 siblings, 1 reply; 113+ messages in thread
From: Peter Xu @ 2019-02-26  6:24 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: linux-mm, linux-kernel, David Hildenbrand, Hugh Dickins,
	Maya Gokhale, Jerome Glisse, Pavel Emelyanov, Johannes Weiner,
	Martin Cracauer, Shaohua Li, Marty McFadden, Andrea Arcangeli,
	Mike Kravetz, Denis Plotnikov, Mike Rapoport, Mel Gorman,
	Kirill A . Shutemov, Dr . David Alan Gilbert

On Mon, Feb 25, 2019 at 11:09:35PM +0200, Mike Rapoport wrote:
> On Tue, Feb 12, 2019 at 10:56:29AM +0800, Peter Xu wrote:
> > It does not make sense to try to wake up any waiting thread when we're
> > write-protecting a memory region.  Only wake up when resolving a write
> > protected page fault.
> > 
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >  fs/userfaultfd.c | 13 ++++++++-----
> >  1 file changed, 8 insertions(+), 5 deletions(-)
> > 
> > diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> > index 81962d62520c..f1f61a0278c2 100644
> > --- a/fs/userfaultfd.c
> > +++ b/fs/userfaultfd.c
> > @@ -1771,6 +1771,7 @@ static int userfaultfd_writeprotect(struct userfaultfd_ctx *ctx,
> >  	struct uffdio_writeprotect uffdio_wp;
> >  	struct uffdio_writeprotect __user *user_uffdio_wp;
> >  	struct userfaultfd_wake_range range;
> > +	bool mode_wp, mode_dontwake;
> > 
> >  	if (READ_ONCE(ctx->mmap_changing))
> >  		return -EAGAIN;
> > @@ -1789,18 +1790,20 @@ static int userfaultfd_writeprotect(struct userfaultfd_ctx *ctx,
> >  	if (uffdio_wp.mode & ~(UFFDIO_WRITEPROTECT_MODE_DONTWAKE |
> >  			       UFFDIO_WRITEPROTECT_MODE_WP))
> >  		return -EINVAL;
> > -	if ((uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_WP) &&
> > -	     (uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_DONTWAKE))
> > +
> > +	mode_wp = uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_WP;
> > +	mode_dontwake = uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_DONTWAKE;
> > +
> > +	if (mode_wp && mode_dontwake)
> >  		return -EINVAL;
> 
> This actually means the opposite of the commit message text ;-)
> 
> Is any dependency of _WP and _DONTWAKE needed at all?

So this is indeed confusing at least, because both you and Jerome have
asked the same question... :)

My understanding is that we don't have any reason to wake up any
thread when we are write-protecting a range, in that sense the flag
UFFDIO_WRITEPROTECT_MODE_DONTWAKE is already meaningless in the
UFFDIO_WRITEPROTECT ioctl context.  So before everything here's how
these flags are defined:

struct uffdio_writeprotect {
	struct uffdio_range range;
	/* !WP means undo writeprotect. DONTWAKE is valid only with !WP */
#define UFFDIO_WRITEPROTECT_MODE_WP		((__u64)1<<0)
#define UFFDIO_WRITEPROTECT_MODE_DONTWAKE	((__u64)1<<1)
	__u64 mode;
};

To make it clear, we simply define it as "DONTWAKE is valid only with
!WP".  When with that, "mode_wp && mode_dontwake" is indeed a
meaningless flag combination.  Though please note that it does not
mean that the operation ("don't wake up the thread") is meaningless -
that's what we'll do no matter what when WP==1.  IMHO it's only about
the interface not the behavior.

I don't have a good way to make this clearer because firstly we'll
need the WP flag to mark whether we're protecting or unprotecting the
pages.  Later on, we need DONTWAKE for page fault handling case to
mark that we don't want to wake up the waiting thread now.  So both
the flags have their reason to stay so far.  Then with all these in
mind what I can think of is only to forbid using DONTWAKE in WP case,
and that's how above definition comes (I believe, because it was
defined that way even before I started to work on it and I think it
makes sense).

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v2 21/26] userfaultfd: wp: add the writeprotect API to userfaultfd ioctl
  2019-02-25 21:03   ` Mike Rapoport
@ 2019-02-26  6:30     ` Peter Xu
  0 siblings, 0 replies; 113+ messages in thread
From: Peter Xu @ 2019-02-26  6:30 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: linux-mm, linux-kernel, David Hildenbrand, Hugh Dickins,
	Maya Gokhale, Jerome Glisse, Pavel Emelyanov, Johannes Weiner,
	Martin Cracauer, Shaohua Li, Marty McFadden, Andrea Arcangeli,
	Mike Kravetz, Denis Plotnikov, Mike Rapoport, Mel Gorman,
	Kirill A . Shutemov, Dr . David Alan Gilbert

On Mon, Feb 25, 2019 at 11:03:51PM +0200, Mike Rapoport wrote:
> On Tue, Feb 12, 2019 at 10:56:27AM +0800, Peter Xu wrote:
> > From: Andrea Arcangeli <aarcange@redhat.com>
> > 
> > v1: From: Shaohua Li <shli@fb.com>
> > 
> > v2: cleanups, remove a branch.
> > 
> > [peterx writes up the commit message, as below...]
> > 
> > This patch introduces the new uffd-wp APIs for userspace.
> > 
> > Firstly, we'll allow to do UFFDIO_REGISTER with write protection
> > tracking using the new UFFDIO_REGISTER_MODE_WP flag.  Note that this
> > flag can co-exist with the existing UFFDIO_REGISTER_MODE_MISSING, in
> > which case the userspace program can not only resolve missing page
> > faults, and at the same time tracking page data changes along the way.
> > 
> > Secondly, we introduced the new UFFDIO_WRITEPROTECT API to do page
> > level write protection tracking.  Note that we will need to register
> > the memory region with UFFDIO_REGISTER_MODE_WP before that.
> > 
> > Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> > [peterx: remove useless block, write commit message, check against
> >  VM_MAYWRITE rather than VM_WRITE when register]
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >  fs/userfaultfd.c                 | 82 +++++++++++++++++++++++++-------
> >  include/uapi/linux/userfaultfd.h | 11 +++++
> >  2 files changed, 77 insertions(+), 16 deletions(-)
> > 
> > diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> > index 3092885c9d2c..81962d62520c 100644
> > --- a/fs/userfaultfd.c
> > +++ b/fs/userfaultfd.c
> > @@ -304,8 +304,11 @@ static inline bool userfaultfd_must_wait(struct userfaultfd_ctx *ctx,
> >  	if (!pmd_present(_pmd))
> >  		goto out;
> > 
> > -	if (pmd_trans_huge(_pmd))
> > +	if (pmd_trans_huge(_pmd)) {
> > +		if (!pmd_write(_pmd) && (reason & VM_UFFD_WP))
> > +			ret = true;
> >  		goto out;
> > +	}
> > 
> >  	/*
> >  	 * the pmd is stable (as in !pmd_trans_unstable) so we can re-read it
> > @@ -318,6 +321,8 @@ static inline bool userfaultfd_must_wait(struct userfaultfd_ctx *ctx,
> >  	 */
> >  	if (pte_none(*pte))
> >  		ret = true;
> > +	if (!pte_write(*pte) && (reason & VM_UFFD_WP))
> > +		ret = true;
> >  	pte_unmap(pte);
> > 
> >  out:
> > @@ -1251,10 +1256,13 @@ static __always_inline int validate_range(struct mm_struct *mm,
> >  	return 0;
> >  }
> > 
> > -static inline bool vma_can_userfault(struct vm_area_struct *vma)
> > +static inline bool vma_can_userfault(struct vm_area_struct *vma,
> > +				     unsigned long vm_flags)
> >  {
> > -	return vma_is_anonymous(vma) || is_vm_hugetlb_page(vma) ||
> > -		vma_is_shmem(vma);
> > +	/* FIXME: add WP support to hugetlbfs and shmem */
> > +	return vma_is_anonymous(vma) ||
> > +		((is_vm_hugetlb_page(vma) || vma_is_shmem(vma)) &&
> > +		 !(vm_flags & VM_UFFD_WP));
> >  }
> > 
> >  static int userfaultfd_register(struct userfaultfd_ctx *ctx,
> > @@ -1286,15 +1294,8 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
> >  	vm_flags = 0;
> >  	if (uffdio_register.mode & UFFDIO_REGISTER_MODE_MISSING)
> >  		vm_flags |= VM_UFFD_MISSING;
> > -	if (uffdio_register.mode & UFFDIO_REGISTER_MODE_WP) {
> > +	if (uffdio_register.mode & UFFDIO_REGISTER_MODE_WP)
> >  		vm_flags |= VM_UFFD_WP;
> > -		/*
> > -		 * FIXME: remove the below error constraint by
> > -		 * implementing the wprotect tracking mode.
> > -		 */
> > -		ret = -EINVAL;
> > -		goto out;
> > -	}
> > 
> >  	ret = validate_range(mm, uffdio_register.range.start,
> >  			     uffdio_register.range.len);
> > @@ -1342,7 +1343,7 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
> > 
> >  		/* check not compatible vmas */
> >  		ret = -EINVAL;
> > -		if (!vma_can_userfault(cur))
> > +		if (!vma_can_userfault(cur, vm_flags))
> >  			goto out_unlock;
> > 
> >  		/*
> > @@ -1370,6 +1371,8 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
> >  			if (end & (vma_hpagesize - 1))
> >  				goto out_unlock;
> >  		}
> > +		if ((vm_flags & VM_UFFD_WP) && !(cur->vm_flags & VM_MAYWRITE))
> > +			goto out_unlock;
> > 
> >  		/*
> >  		 * Check that this vma isn't already owned by a
> > @@ -1399,7 +1402,7 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
> >  	do {
> >  		cond_resched();
> > 
> > -		BUG_ON(!vma_can_userfault(vma));
> > +		BUG_ON(!vma_can_userfault(vma, vm_flags));
> >  		BUG_ON(vma->vm_userfaultfd_ctx.ctx &&
> >  		       vma->vm_userfaultfd_ctx.ctx != ctx);
> >  		WARN_ON(!(vma->vm_flags & VM_MAYWRITE));
> > @@ -1534,7 +1537,7 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx,
> >  		 * provides for more strict behavior to notice
> >  		 * unregistration errors.
> >  		 */
> > -		if (!vma_can_userfault(cur))
> > +		if (!vma_can_userfault(cur, cur->vm_flags))
> >  			goto out_unlock;
> > 
> >  		found = true;
> > @@ -1548,7 +1551,7 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx,
> >  	do {
> >  		cond_resched();
> > 
> > -		BUG_ON(!vma_can_userfault(vma));
> > +		BUG_ON(!vma_can_userfault(vma, vma->vm_flags));
> > 
> >  		/*
> >  		 * Nothing to do: this vma is already registered into this
> > @@ -1761,6 +1764,50 @@ static int userfaultfd_zeropage(struct userfaultfd_ctx *ctx,
> >  	return ret;
> >  }
> > 
> > +static int userfaultfd_writeprotect(struct userfaultfd_ctx *ctx,
> > +				    unsigned long arg)
> > +{
> > +	int ret;
> > +	struct uffdio_writeprotect uffdio_wp;
> > +	struct uffdio_writeprotect __user *user_uffdio_wp;
> > +	struct userfaultfd_wake_range range;
> > +
> > +	if (READ_ONCE(ctx->mmap_changing))
> > +		return -EAGAIN;
> > +
> > +	user_uffdio_wp = (struct uffdio_writeprotect __user *) arg;
> > +
> > +	if (copy_from_user(&uffdio_wp, user_uffdio_wp,
> > +			   sizeof(struct uffdio_writeprotect)))
> > +		return -EFAULT;
> > +
> > +	ret = validate_range(ctx->mm, uffdio_wp.range.start,
> > +			     uffdio_wp.range.len);
> > +	if (ret)
> > +		return ret;
> > +
> > +	if (uffdio_wp.mode & ~(UFFDIO_WRITEPROTECT_MODE_DONTWAKE |
> > +			       UFFDIO_WRITEPROTECT_MODE_WP))
> > +		return -EINVAL;
> > +	if ((uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_WP) &&
> > +	     (uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_DONTWAKE))
> > +		return -EINVAL;
> 
> Why _DONTWAKE cannot be used when setting write-protection?
> I can imagine a use-case when you'd want to freeze an application,
> write-protect several regions and then let the application continue.

This is the same question as the one in the other thread, which I've
had a longer reply there, hope it could be a bit clearer (sorry for
the confusion no matter what!).  I would be more than glad to know if
there could be any smarter way to define/renaming/... the flags.

Thanks!

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v2 20/26] userfaultfd: wp: support write protection for userfault vma range
  2019-02-26  6:06     ` Peter Xu
@ 2019-02-26  6:43       ` Mike Rapoport
  2019-02-26  7:20         ` Peter Xu
  0 siblings, 1 reply; 113+ messages in thread
From: Mike Rapoport @ 2019-02-26  6:43 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, David Hildenbrand, Hugh Dickins,
	Maya Gokhale, Jerome Glisse, Pavel Emelyanov, Johannes Weiner,
	Martin Cracauer, Shaohua Li, Marty McFadden, Andrea Arcangeli,
	Mike Kravetz, Denis Plotnikov, Mike Rapoport, Mel Gorman,
	Kirill A . Shutemov, Dr . David Alan Gilbert, Rik van Riel

On Tue, Feb 26, 2019 at 02:06:27PM +0800, Peter Xu wrote:
> On Mon, Feb 25, 2019 at 10:52:34PM +0200, Mike Rapoport wrote:
> > On Tue, Feb 12, 2019 at 10:56:26AM +0800, Peter Xu wrote:
> > > From: Shaohua Li <shli@fb.com>
> > > 
> > > Add API to enable/disable writeprotect a vma range. Unlike mprotect,
> > > this doesn't split/merge vmas.
> > > 
> > > Cc: Andrea Arcangeli <aarcange@redhat.com>
> > > Cc: Rik van Riel <riel@redhat.com>
> > > Cc: Kirill A. Shutemov <kirill@shutemov.name>
> > > Cc: Mel Gorman <mgorman@suse.de>
> > > Cc: Hugh Dickins <hughd@google.com>
> > > Cc: Johannes Weiner <hannes@cmpxchg.org>
> > > Signed-off-by: Shaohua Li <shli@fb.com>
> > > Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> > > [peterx:
> > >  - use the helper to find VMA;
> > >  - return -ENOENT if not found to match mcopy case;
> > >  - use the new MM_CP_UFFD_WP* flags for change_protection
> > >  - check against mmap_changing for failures]
> > > Signed-off-by: Peter Xu <peterx@redhat.com>
> > > ---
> > >  include/linux/userfaultfd_k.h |  3 ++
> > >  mm/userfaultfd.c              | 54 +++++++++++++++++++++++++++++++++++
> > >  2 files changed, 57 insertions(+)
> > > 
> > > diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
> > > index 765ce884cec0..8f6e6ed544fb 100644
> > > --- a/include/linux/userfaultfd_k.h
> > > +++ b/include/linux/userfaultfd_k.h
> > > @@ -39,6 +39,9 @@ extern ssize_t mfill_zeropage(struct mm_struct *dst_mm,
> > >  			      unsigned long dst_start,
> > >  			      unsigned long len,
> > >  			      bool *mmap_changing);
> > > +extern int mwriteprotect_range(struct mm_struct *dst_mm,
> > > +			       unsigned long start, unsigned long len,
> > > +			       bool enable_wp, bool *mmap_changing);
> > > 
> > >  /* mm helpers */
> > >  static inline bool is_mergeable_vm_userfaultfd_ctx(struct vm_area_struct *vma,
> > > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> > > index fefa81c301b7..529d180bb4d7 100644
> > > --- a/mm/userfaultfd.c
> > > +++ b/mm/userfaultfd.c
> > > @@ -639,3 +639,57 @@ ssize_t mfill_zeropage(struct mm_struct *dst_mm, unsigned long start,
> > >  {
> > >  	return __mcopy_atomic(dst_mm, start, 0, len, true, mmap_changing, 0);
> > >  }
> > > +
> > > +int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start,
> > > +			unsigned long len, bool enable_wp, bool *mmap_changing)
> > > +{
> > > +	struct vm_area_struct *dst_vma;
> > > +	pgprot_t newprot;
> > > +	int err;
> > > +
> > > +	/*
> > > +	 * Sanitize the command parameters:
> > > +	 */
> > > +	BUG_ON(start & ~PAGE_MASK);
> > > +	BUG_ON(len & ~PAGE_MASK);
> > > +
> > > +	/* Does the address range wrap, or is the span zero-sized? */
> > > +	BUG_ON(start + len <= start);
> > 
> > I'd replace these BUG_ON()s with
> > 
> > 	if (WARN_ON())
> > 		 return -EINVAL;
> 
> I believe BUG_ON() is used because these parameters should have been
> checked in userfaultfd_writeprotect() already by the common
> validate_range() even before calling mwriteprotect_range().  So I'm
> fine with the WARN_ON() approach but I'd slightly prefer to simply
> keep the patch as is to keep Jerome's r-b if you won't disagree. :)

Right, userfaultfd_writeprotect() should check these parameters and if it
didn't it was a bug indeed. But still, it's not severe enough to crash the
kernel.

I hope Jerome wouldn't mind to keep his r-b with s/BUG_ON/WARN_ON ;-)

With this change you can also add 

Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
 
> Thanks,
> 
> -- 
> Peter Xu
> 

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v2 25/26] userfaultfd: selftests: refactor statistics
  2019-02-12  2:56 ` [PATCH v2 25/26] userfaultfd: selftests: refactor statistics Peter Xu
@ 2019-02-26  6:50   ` Mike Rapoport
  0 siblings, 0 replies; 113+ messages in thread
From: Mike Rapoport @ 2019-02-26  6:50 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, David Hildenbrand, Hugh Dickins,
	Maya Gokhale, Jerome Glisse, Pavel Emelyanov, Johannes Weiner,
	Martin Cracauer, Shaohua Li, Marty McFadden, Andrea Arcangeli,
	Mike Kravetz, Denis Plotnikov, Mike Rapoport, Mel Gorman,
	Kirill A . Shutemov, Dr . David Alan Gilbert

On Tue, Feb 12, 2019 at 10:56:31AM +0800, Peter Xu wrote:
> Introduce uffd_stats structure for statistics of the self test, at the
> same time refactor the code to always pass in the uffd_stats for either
> read() or poll() typed fault handling threads instead of using two
> different ways to return the statistic results.  No functional change.
> 
> With the new structure, it's very easy to introduce new statistics.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>

> ---
>  tools/testing/selftests/vm/userfaultfd.c | 76 +++++++++++++++---------
>  1 file changed, 49 insertions(+), 27 deletions(-)
> 
> diff --git a/tools/testing/selftests/vm/userfaultfd.c b/tools/testing/selftests/vm/userfaultfd.c
> index 5d1db824f73a..e5d12c209e09 100644
> --- a/tools/testing/selftests/vm/userfaultfd.c
> +++ b/tools/testing/selftests/vm/userfaultfd.c
> @@ -88,6 +88,12 @@ static char *area_src, *area_src_alias, *area_dst, *area_dst_alias;
>  static char *zeropage;
>  pthread_attr_t attr;
> 
> +/* Userfaultfd test statistics */
> +struct uffd_stats {
> +	int cpu;
> +	unsigned long missing_faults;
> +};
> +
>  /* pthread_mutex_t starts at page offset 0 */
>  #define area_mutex(___area, ___nr)					\
>  	((pthread_mutex_t *) ((___area) + (___nr)*page_size))
> @@ -127,6 +133,17 @@ static void usage(void)
>  	exit(1);
>  }
> 
> +static void uffd_stats_reset(struct uffd_stats *uffd_stats,
> +			     unsigned long n_cpus)
> +{
> +	int i;
> +
> +	for (i = 0; i < n_cpus; i++) {
> +		uffd_stats[i].cpu = i;
> +		uffd_stats[i].missing_faults = 0;
> +	}
> +}
> +
>  static int anon_release_pages(char *rel_area)
>  {
>  	int ret = 0;
> @@ -469,8 +486,8 @@ static int uffd_read_msg(int ufd, struct uffd_msg *msg)
>  	return 0;
>  }
> 
> -/* Return 1 if page fault handled by us; otherwise 0 */
> -static int uffd_handle_page_fault(struct uffd_msg *msg)
> +static void uffd_handle_page_fault(struct uffd_msg *msg,
> +				   struct uffd_stats *stats)
>  {
>  	unsigned long offset;
> 
> @@ -485,18 +502,19 @@ static int uffd_handle_page_fault(struct uffd_msg *msg)
>  	offset = (char *)(unsigned long)msg->arg.pagefault.address - area_dst;
>  	offset &= ~(page_size-1);
> 
> -	return copy_page(uffd, offset);
> +	if (copy_page(uffd, offset))
> +		stats->missing_faults++;
>  }
> 
>  static void *uffd_poll_thread(void *arg)
>  {
> -	unsigned long cpu = (unsigned long) arg;
> +	struct uffd_stats *stats = (struct uffd_stats *)arg;
> +	unsigned long cpu = stats->cpu;
>  	struct pollfd pollfd[2];
>  	struct uffd_msg msg;
>  	struct uffdio_register uffd_reg;
>  	int ret;
>  	char tmp_chr;
> -	unsigned long userfaults = 0;
> 
>  	pollfd[0].fd = uffd;
>  	pollfd[0].events = POLLIN;
> @@ -526,7 +544,7 @@ static void *uffd_poll_thread(void *arg)
>  				msg.event), exit(1);
>  			break;
>  		case UFFD_EVENT_PAGEFAULT:
> -			userfaults += uffd_handle_page_fault(&msg);
> +			uffd_handle_page_fault(&msg, stats);
>  			break;
>  		case UFFD_EVENT_FORK:
>  			close(uffd);
> @@ -545,28 +563,27 @@ static void *uffd_poll_thread(void *arg)
>  			break;
>  		}
>  	}
> -	return (void *)userfaults;
> +
> +	return NULL;
>  }
> 
>  pthread_mutex_t uffd_read_mutex = PTHREAD_MUTEX_INITIALIZER;
> 
>  static void *uffd_read_thread(void *arg)
>  {
> -	unsigned long *this_cpu_userfaults;
> +	struct uffd_stats *stats = (struct uffd_stats *)arg;
>  	struct uffd_msg msg;
> 
> -	this_cpu_userfaults = (unsigned long *) arg;
> -	*this_cpu_userfaults = 0;
> -
>  	pthread_mutex_unlock(&uffd_read_mutex);
>  	/* from here cancellation is ok */
> 
>  	for (;;) {
>  		if (uffd_read_msg(uffd, &msg))
>  			continue;
> -		(*this_cpu_userfaults) += uffd_handle_page_fault(&msg);
> +		uffd_handle_page_fault(&msg, stats);
>  	}
> -	return (void *)NULL;
> +
> +	return NULL;
>  }
> 
>  static void *background_thread(void *arg)
> @@ -582,13 +599,12 @@ static void *background_thread(void *arg)
>  	return NULL;
>  }
> 
> -static int stress(unsigned long *userfaults)
> +static int stress(struct uffd_stats *uffd_stats)
>  {
>  	unsigned long cpu;
>  	pthread_t locking_threads[nr_cpus];
>  	pthread_t uffd_threads[nr_cpus];
>  	pthread_t background_threads[nr_cpus];
> -	void **_userfaults = (void **) userfaults;
> 
>  	finished = 0;
>  	for (cpu = 0; cpu < nr_cpus; cpu++) {
> @@ -597,12 +613,13 @@ static int stress(unsigned long *userfaults)
>  			return 1;
>  		if (bounces & BOUNCE_POLL) {
>  			if (pthread_create(&uffd_threads[cpu], &attr,
> -					   uffd_poll_thread, (void *)cpu))
> +					   uffd_poll_thread,
> +					   (void *)&uffd_stats[cpu]))
>  				return 1;
>  		} else {
>  			if (pthread_create(&uffd_threads[cpu], &attr,
>  					   uffd_read_thread,
> -					   &_userfaults[cpu]))
> +					   (void *)&uffd_stats[cpu]))
>  				return 1;
>  			pthread_mutex_lock(&uffd_read_mutex);
>  		}
> @@ -639,7 +656,8 @@ static int stress(unsigned long *userfaults)
>  				fprintf(stderr, "pipefd write error\n");
>  				return 1;
>  			}
> -			if (pthread_join(uffd_threads[cpu], &_userfaults[cpu]))
> +			if (pthread_join(uffd_threads[cpu],
> +					 (void *)&uffd_stats[cpu]))
>  				return 1;
>  		} else {
>  			if (pthread_cancel(uffd_threads[cpu]))
> @@ -910,11 +928,11 @@ static int userfaultfd_events_test(void)
>  {
>  	struct uffdio_register uffdio_register;
>  	unsigned long expected_ioctls;
> -	unsigned long userfaults;
>  	pthread_t uffd_mon;
>  	int err, features;
>  	pid_t pid;
>  	char c;
> +	struct uffd_stats stats = { 0 };
> 
>  	printf("testing events (fork, remap, remove): ");
>  	fflush(stdout);
> @@ -941,7 +959,7 @@ static int userfaultfd_events_test(void)
>  			"unexpected missing ioctl for anon memory\n"),
>  			exit(1);
> 
> -	if (pthread_create(&uffd_mon, &attr, uffd_poll_thread, NULL))
> +	if (pthread_create(&uffd_mon, &attr, uffd_poll_thread, &stats))
>  		perror("uffd_poll_thread create"), exit(1);
> 
>  	pid = fork();
> @@ -957,13 +975,13 @@ static int userfaultfd_events_test(void)
> 
>  	if (write(pipefd[1], &c, sizeof(c)) != sizeof(c))
>  		perror("pipe write"), exit(1);
> -	if (pthread_join(uffd_mon, (void **)&userfaults))
> +	if (pthread_join(uffd_mon, NULL))
>  		return 1;
> 
>  	close(uffd);
> -	printf("userfaults: %ld\n", userfaults);
> +	printf("userfaults: %ld\n", stats.missing_faults);
> 
> -	return userfaults != nr_pages;
> +	return stats.missing_faults != nr_pages;
>  }
> 
>  static int userfaultfd_sig_test(void)
> @@ -975,6 +993,7 @@ static int userfaultfd_sig_test(void)
>  	int err, features;
>  	pid_t pid;
>  	char c;
> +	struct uffd_stats stats = { 0 };
> 
>  	printf("testing signal delivery: ");
>  	fflush(stdout);
> @@ -1006,7 +1025,7 @@ static int userfaultfd_sig_test(void)
>  	if (uffd_test_ops->release_pages(area_dst))
>  		return 1;
> 
> -	if (pthread_create(&uffd_mon, &attr, uffd_poll_thread, NULL))
> +	if (pthread_create(&uffd_mon, &attr, uffd_poll_thread, &stats))
>  		perror("uffd_poll_thread create"), exit(1);
> 
>  	pid = fork();
> @@ -1032,6 +1051,7 @@ static int userfaultfd_sig_test(void)
>  	close(uffd);
>  	return userfaults != 0;
>  }
> +
>  static int userfaultfd_stress(void)
>  {
>  	void *area;
> @@ -1040,7 +1060,7 @@ static int userfaultfd_stress(void)
>  	struct uffdio_register uffdio_register;
>  	unsigned long cpu;
>  	int err;
> -	unsigned long userfaults[nr_cpus];
> +	struct uffd_stats uffd_stats[nr_cpus];
> 
>  	uffd_test_ops->allocate_area((void **)&area_src);
>  	if (!area_src)
> @@ -1169,8 +1189,10 @@ static int userfaultfd_stress(void)
>  		if (uffd_test_ops->release_pages(area_dst))
>  			return 1;
> 
> +		uffd_stats_reset(uffd_stats, nr_cpus);
> +
>  		/* bounce pass */
> -		if (stress(userfaults))
> +		if (stress(uffd_stats))
>  			return 1;
> 
>  		/* unregister */
> @@ -1213,7 +1235,7 @@ static int userfaultfd_stress(void)
> 
>  		printf("userfaults:");
>  		for (cpu = 0; cpu < nr_cpus; cpu++)
> -			printf(" %lu", userfaults[cpu]);
> +			printf(" %lu", uffd_stats[cpu].missing_faults);
>  		printf("\n");
>  	}
> 
> -- 
> 2.17.1
> 

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v2 24/26] userfaultfd: wp: UFFDIO_REGISTER_MODE_WP documentation update
  2019-02-25 21:19   ` Mike Rapoport
@ 2019-02-26  6:53     ` Peter Xu
  2019-02-26  7:04       ` Mike Rapoport
  0 siblings, 1 reply; 113+ messages in thread
From: Peter Xu @ 2019-02-26  6:53 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: linux-mm, linux-kernel, David Hildenbrand, Hugh Dickins,
	Maya Gokhale, Jerome Glisse, Pavel Emelyanov, Johannes Weiner,
	Martin Cracauer, Shaohua Li, Marty McFadden, Andrea Arcangeli,
	Mike Kravetz, Denis Plotnikov, Mike Rapoport, Mel Gorman,
	Kirill A . Shutemov, Dr . David Alan Gilbert

On Mon, Feb 25, 2019 at 11:19:32PM +0200, Mike Rapoport wrote:
> On Tue, Feb 12, 2019 at 10:56:30AM +0800, Peter Xu wrote:
> > From: Martin Cracauer <cracauer@cons.org>
> > 
> > Adds documentation about the write protection support.
> > 
> > Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> > [peterx: rewrite in rst format; fixups here and there]
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> 
> Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
> 
> Peter, can you please also update the man pages (1, 2)?
> 
> [1] http://man7.org/linux/man-pages/man2/userfaultfd.2.html
> [2] http://man7.org/linux/man-pages/man2/ioctl_userfaultfd.2.html

Sure.  Should I post the man patches after the kernel part is merged?

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v2 26/26] userfaultfd: selftests: add write-protect test
  2019-02-12  2:56 ` [PATCH v2 26/26] userfaultfd: selftests: add write-protect test Peter Xu
@ 2019-02-26  6:58   ` Mike Rapoport
  2019-02-26  7:52     ` Peter Xu
  0 siblings, 1 reply; 113+ messages in thread
From: Mike Rapoport @ 2019-02-26  6:58 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, David Hildenbrand, Hugh Dickins,
	Maya Gokhale, Jerome Glisse, Pavel Emelyanov, Johannes Weiner,
	Martin Cracauer, Shaohua Li, Marty McFadden, Andrea Arcangeli,
	Mike Kravetz, Denis Plotnikov, Mike Rapoport, Mel Gorman,
	Kirill A . Shutemov, Dr . David Alan Gilbert

On Tue, Feb 12, 2019 at 10:56:32AM +0800, Peter Xu wrote:
> This patch adds uffd tests for write protection.
> 
> Instead of introducing new tests for it, let's simply squashing uffd-wp
> tests into existing uffd-missing test cases.  Changes are:
> 
> (1) Bouncing tests
> 
>   We do the write-protection in two ways during the bouncing test:
> 
>   - By using UFFDIO_COPY_MODE_WP when resolving MISSING pages: then
>     we'll make sure for each bounce process every single page will be
>     at least fault twice: once for MISSING, once for WP.
> 
>   - By direct call UFFDIO_WRITEPROTECT on existing faulted memories:
>     To further torture the explicit page protection procedures of
>     uffd-wp, we split each bounce procedure into two halves (in the
>     background thread): the first half will be MISSING+WP for each
>     page as explained above.  After the first half, we write protect
>     the faulted region in the background thread to make sure at least
>     half of the pages will be write protected again which is the first
>     half to test the new UFFDIO_WRITEPROTECT call.  Then we continue
>     with the 2nd half, which will contain both MISSING and WP faulting
>     tests for the 2nd half and WP-only faults from the 1st half.
> 
> (2) Event/Signal test
> 
>   Mostly previous tests but will do MISSING+WP for each page.  For
>   sigbus-mode test we'll need to provide standalone path to handle the
>   write protection faults.
> 
> For all tests, do statistics as well for uffd-wp pages.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  tools/testing/selftests/vm/userfaultfd.c | 154 ++++++++++++++++++-----
>  1 file changed, 126 insertions(+), 28 deletions(-)
> 
> diff --git a/tools/testing/selftests/vm/userfaultfd.c b/tools/testing/selftests/vm/userfaultfd.c
> index e5d12c209e09..57b5ac02080a 100644
> --- a/tools/testing/selftests/vm/userfaultfd.c
> +++ b/tools/testing/selftests/vm/userfaultfd.c
> @@ -56,6 +56,7 @@
>  #include <linux/userfaultfd.h>
>  #include <setjmp.h>
>  #include <stdbool.h>
> +#include <assert.h>
> 
>  #include "../kselftest.h"
> 
> @@ -78,6 +79,8 @@ static int test_type;
>  #define ALARM_INTERVAL_SECS 10
>  static volatile bool test_uffdio_copy_eexist = true;
>  static volatile bool test_uffdio_zeropage_eexist = true;
> +/* Whether to test uffd write-protection */
> +static bool test_uffdio_wp = false;
> 
>  static bool map_shared;
>  static int huge_fd;
> @@ -92,6 +95,7 @@ pthread_attr_t attr;
>  struct uffd_stats {
>  	int cpu;
>  	unsigned long missing_faults;
> +	unsigned long wp_faults;
>  };
> 
>  /* pthread_mutex_t starts at page offset 0 */
> @@ -141,9 +145,29 @@ static void uffd_stats_reset(struct uffd_stats *uffd_stats,
>  	for (i = 0; i < n_cpus; i++) {
>  		uffd_stats[i].cpu = i;
>  		uffd_stats[i].missing_faults = 0;
> +		uffd_stats[i].wp_faults = 0;
>  	}
>  }
> 
> +static void uffd_stats_report(struct uffd_stats *stats, int n_cpus)
> +{
> +	int i;
> +	unsigned long long miss_total = 0, wp_total = 0;
> +
> +	for (i = 0; i < n_cpus; i++) {
> +		miss_total += stats[i].missing_faults;
> +		wp_total += stats[i].wp_faults;
> +	}
> +
> +	printf("userfaults: %llu missing (", miss_total);
> +	for (i = 0; i < n_cpus; i++)
> +		printf("%lu+", stats[i].missing_faults);
> +	printf("\b), %llu wp (", wp_total);
> +	for (i = 0; i < n_cpus; i++)
> +		printf("%lu+", stats[i].wp_faults);
> +	printf("\b)\n");
> +}
> +
>  static int anon_release_pages(char *rel_area)
>  {
>  	int ret = 0;
> @@ -264,19 +288,15 @@ struct uffd_test_ops {
>  	void (*alias_mapping)(__u64 *start, size_t len, unsigned long offset);
>  };
> 
> -#define ANON_EXPECTED_IOCTLS		((1 << _UFFDIO_WAKE) | \
> -					 (1 << _UFFDIO_COPY) | \
> -					 (1 << _UFFDIO_ZEROPAGE))
> -
>  static struct uffd_test_ops anon_uffd_test_ops = {
> -	.expected_ioctls = ANON_EXPECTED_IOCTLS,
> +	.expected_ioctls = UFFD_API_RANGE_IOCTLS,
>  	.allocate_area	= anon_allocate_area,
>  	.release_pages	= anon_release_pages,
>  	.alias_mapping = noop_alias_mapping,
>  };
> 
>  static struct uffd_test_ops shmem_uffd_test_ops = {
> -	.expected_ioctls = ANON_EXPECTED_IOCTLS,
> +	.expected_ioctls = UFFD_API_RANGE_IOCTLS,

Isn't UFFD_API_RANGE_IOCTLS includes UFFDIO_WP which is not supported for
shmem?

>  	.allocate_area	= shmem_allocate_area,
>  	.release_pages	= shmem_release_pages,
>  	.alias_mapping = noop_alias_mapping,

...

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v2 24/26] userfaultfd: wp: UFFDIO_REGISTER_MODE_WP documentation update
  2019-02-26  6:53     ` Peter Xu
@ 2019-02-26  7:04       ` Mike Rapoport
  2019-02-26  7:42         ` Peter Xu
  0 siblings, 1 reply; 113+ messages in thread
From: Mike Rapoport @ 2019-02-26  7:04 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, David Hildenbrand, Hugh Dickins,
	Maya Gokhale, Jerome Glisse, Pavel Emelyanov, Johannes Weiner,
	Martin Cracauer, Shaohua Li, Marty McFadden, Andrea Arcangeli,
	Mike Kravetz, Denis Plotnikov, Mike Rapoport, Mel Gorman,
	Kirill A . Shutemov, Dr . David Alan Gilbert

On Tue, Feb 26, 2019 at 02:53:42PM +0800, Peter Xu wrote:
> On Mon, Feb 25, 2019 at 11:19:32PM +0200, Mike Rapoport wrote:
> > On Tue, Feb 12, 2019 at 10:56:30AM +0800, Peter Xu wrote:
> > > From: Martin Cracauer <cracauer@cons.org>
> > > 
> > > Adds documentation about the write protection support.
> > > 
> > > Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> > > [peterx: rewrite in rst format; fixups here and there]
> > > Signed-off-by: Peter Xu <peterx@redhat.com>
> > 
> > Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
> > 
> > Peter, can you please also update the man pages (1, 2)?
> > 
> > [1] http://man7.org/linux/man-pages/man2/userfaultfd.2.html
> > [2] http://man7.org/linux/man-pages/man2/ioctl_userfaultfd.2.html
> 
> Sure.  Should I post the man patches after the kernel part is merged?

Yep, once we know for sure what's the API kernel will expose.
 
> Thanks,
> 
> -- 
> Peter Xu
> 

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v2 20/26] userfaultfd: wp: support write protection for userfault vma range
  2019-02-26  6:43       ` Mike Rapoport
@ 2019-02-26  7:20         ` Peter Xu
  2019-02-26  7:46           ` Mike Rapoport
  0 siblings, 1 reply; 113+ messages in thread
From: Peter Xu @ 2019-02-26  7:20 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: linux-mm, linux-kernel, David Hildenbrand, Hugh Dickins,
	Maya Gokhale, Jerome Glisse, Pavel Emelyanov, Johannes Weiner,
	Martin Cracauer, Shaohua Li, Marty McFadden, Andrea Arcangeli,
	Mike Kravetz, Denis Plotnikov, Mike Rapoport, Mel Gorman,
	Kirill A . Shutemov, Dr . David Alan Gilbert, Rik van Riel

On Tue, Feb 26, 2019 at 08:43:47AM +0200, Mike Rapoport wrote:
> On Tue, Feb 26, 2019 at 02:06:27PM +0800, Peter Xu wrote:
> > On Mon, Feb 25, 2019 at 10:52:34PM +0200, Mike Rapoport wrote:
> > > On Tue, Feb 12, 2019 at 10:56:26AM +0800, Peter Xu wrote:
> > > > From: Shaohua Li <shli@fb.com>
> > > > 
> > > > Add API to enable/disable writeprotect a vma range. Unlike mprotect,
> > > > this doesn't split/merge vmas.
> > > > 
> > > > Cc: Andrea Arcangeli <aarcange@redhat.com>
> > > > Cc: Rik van Riel <riel@redhat.com>
> > > > Cc: Kirill A. Shutemov <kirill@shutemov.name>
> > > > Cc: Mel Gorman <mgorman@suse.de>
> > > > Cc: Hugh Dickins <hughd@google.com>
> > > > Cc: Johannes Weiner <hannes@cmpxchg.org>
> > > > Signed-off-by: Shaohua Li <shli@fb.com>
> > > > Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> > > > [peterx:
> > > >  - use the helper to find VMA;
> > > >  - return -ENOENT if not found to match mcopy case;
> > > >  - use the new MM_CP_UFFD_WP* flags for change_protection
> > > >  - check against mmap_changing for failures]
> > > > Signed-off-by: Peter Xu <peterx@redhat.com>
> > > > ---
> > > >  include/linux/userfaultfd_k.h |  3 ++
> > > >  mm/userfaultfd.c              | 54 +++++++++++++++++++++++++++++++++++
> > > >  2 files changed, 57 insertions(+)
> > > > 
> > > > diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
> > > > index 765ce884cec0..8f6e6ed544fb 100644
> > > > --- a/include/linux/userfaultfd_k.h
> > > > +++ b/include/linux/userfaultfd_k.h
> > > > @@ -39,6 +39,9 @@ extern ssize_t mfill_zeropage(struct mm_struct *dst_mm,
> > > >  			      unsigned long dst_start,
> > > >  			      unsigned long len,
> > > >  			      bool *mmap_changing);
> > > > +extern int mwriteprotect_range(struct mm_struct *dst_mm,
> > > > +			       unsigned long start, unsigned long len,
> > > > +			       bool enable_wp, bool *mmap_changing);
> > > > 
> > > >  /* mm helpers */
> > > >  static inline bool is_mergeable_vm_userfaultfd_ctx(struct vm_area_struct *vma,
> > > > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> > > > index fefa81c301b7..529d180bb4d7 100644
> > > > --- a/mm/userfaultfd.c
> > > > +++ b/mm/userfaultfd.c
> > > > @@ -639,3 +639,57 @@ ssize_t mfill_zeropage(struct mm_struct *dst_mm, unsigned long start,
> > > >  {
> > > >  	return __mcopy_atomic(dst_mm, start, 0, len, true, mmap_changing, 0);
> > > >  }
> > > > +
> > > > +int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start,
> > > > +			unsigned long len, bool enable_wp, bool *mmap_changing)
> > > > +{
> > > > +	struct vm_area_struct *dst_vma;
> > > > +	pgprot_t newprot;
> > > > +	int err;
> > > > +
> > > > +	/*
> > > > +	 * Sanitize the command parameters:
> > > > +	 */
> > > > +	BUG_ON(start & ~PAGE_MASK);
> > > > +	BUG_ON(len & ~PAGE_MASK);
> > > > +
> > > > +	/* Does the address range wrap, or is the span zero-sized? */
> > > > +	BUG_ON(start + len <= start);
> > > 
> > > I'd replace these BUG_ON()s with
> > > 
> > > 	if (WARN_ON())
> > > 		 return -EINVAL;
> > 
> > I believe BUG_ON() is used because these parameters should have been
> > checked in userfaultfd_writeprotect() already by the common
> > validate_range() even before calling mwriteprotect_range().  So I'm
> > fine with the WARN_ON() approach but I'd slightly prefer to simply
> > keep the patch as is to keep Jerome's r-b if you won't disagree. :)
> 
> Right, userfaultfd_writeprotect() should check these parameters and if it
> didn't it was a bug indeed. But still, it's not severe enough to crash the
> kernel.
> 
> I hope Jerome wouldn't mind to keep his r-b with s/BUG_ON/WARN_ON ;-)
> 
> With this change you can also add 
> 
> Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>

Thanks!  Though before I change anything... please note that the
BUG_ON()s are really what we've done in existing MISSING code.  One
example is userfaultfd_copy() which did validate_range() first, then
in __mcopy_atomic() we've used BUG_ON()s.  They make sense to me
becauase userspace should never be able to trigger it.  And if we
really want to change the BUG_ON()s in this patch, IMHO we probably
want to change the other BUG_ON()s as well, then that can be a
standalone patch or patchset to address another issue...

(and if we really want to use WARN_ON, I would prefer WARN_ON_ONCE, or
 directly return the errors to avoid DOS).

I'll see how you'd prefer to see how I should move on with this patch.

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v2 23/26] userfaultfd: wp: don't wake up when doing write protect
  2019-02-26  6:24     ` Peter Xu
@ 2019-02-26  7:29       ` Mike Rapoport
  2019-02-26  7:41         ` Peter Xu
  0 siblings, 1 reply; 113+ messages in thread
From: Mike Rapoport @ 2019-02-26  7:29 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, David Hildenbrand, Hugh Dickins,
	Maya Gokhale, Jerome Glisse, Pavel Emelyanov, Johannes Weiner,
	Martin Cracauer, Shaohua Li, Marty McFadden, Andrea Arcangeli,
	Mike Kravetz, Denis Plotnikov, Mike Rapoport, Mel Gorman,
	Kirill A . Shutemov, Dr . David Alan Gilbert

On Tue, Feb 26, 2019 at 02:24:52PM +0800, Peter Xu wrote:
> On Mon, Feb 25, 2019 at 11:09:35PM +0200, Mike Rapoport wrote:
> > On Tue, Feb 12, 2019 at 10:56:29AM +0800, Peter Xu wrote:
> > > It does not make sense to try to wake up any waiting thread when we're
> > > write-protecting a memory region.  Only wake up when resolving a write
> > > protected page fault.
> > > 
> > > Signed-off-by: Peter Xu <peterx@redhat.com>
> > > ---
> > >  fs/userfaultfd.c | 13 ++++++++-----
> > >  1 file changed, 8 insertions(+), 5 deletions(-)
> > > 
> > > diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> > > index 81962d62520c..f1f61a0278c2 100644
> > > --- a/fs/userfaultfd.c
> > > +++ b/fs/userfaultfd.c
> > > @@ -1771,6 +1771,7 @@ static int userfaultfd_writeprotect(struct userfaultfd_ctx *ctx,
> > >  	struct uffdio_writeprotect uffdio_wp;
> > >  	struct uffdio_writeprotect __user *user_uffdio_wp;
> > >  	struct userfaultfd_wake_range range;
> > > +	bool mode_wp, mode_dontwake;
> > > 
> > >  	if (READ_ONCE(ctx->mmap_changing))
> > >  		return -EAGAIN;
> > > @@ -1789,18 +1790,20 @@ static int userfaultfd_writeprotect(struct userfaultfd_ctx *ctx,
> > >  	if (uffdio_wp.mode & ~(UFFDIO_WRITEPROTECT_MODE_DONTWAKE |
> > >  			       UFFDIO_WRITEPROTECT_MODE_WP))
> > >  		return -EINVAL;
> > > -	if ((uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_WP) &&
> > > -	     (uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_DONTWAKE))
> > > +
> > > +	mode_wp = uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_WP;
> > > +	mode_dontwake = uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_DONTWAKE;
> > > +
> > > +	if (mode_wp && mode_dontwake)
> > >  		return -EINVAL;
> > 
> > This actually means the opposite of the commit message text ;-)
> > 
> > Is any dependency of _WP and _DONTWAKE needed at all?
> 
> So this is indeed confusing at least, because both you and Jerome have
> asked the same question... :)
> 
> My understanding is that we don't have any reason to wake up any
> thread when we are write-protecting a range, in that sense the flag
> UFFDIO_WRITEPROTECT_MODE_DONTWAKE is already meaningless in the
> UFFDIO_WRITEPROTECT ioctl context.  So before everything here's how
> these flags are defined:
> 
> struct uffdio_writeprotect {
> 	struct uffdio_range range;
> 	/* !WP means undo writeprotect. DONTWAKE is valid only with !WP */
> #define UFFDIO_WRITEPROTECT_MODE_WP		((__u64)1<<0)
> #define UFFDIO_WRITEPROTECT_MODE_DONTWAKE	((__u64)1<<1)
> 	__u64 mode;
> };
> 
> To make it clear, we simply define it as "DONTWAKE is valid only with
> !WP".  When with that, "mode_wp && mode_dontwake" is indeed a
> meaningless flag combination.  Though please note that it does not
> mean that the operation ("don't wake up the thread") is meaningless -
> that's what we'll do no matter what when WP==1.  IMHO it's only about
> the interface not the behavior.
> 
> I don't have a good way to make this clearer because firstly we'll
> need the WP flag to mark whether we're protecting or unprotecting the
> pages.  Later on, we need DONTWAKE for page fault handling case to
> mark that we don't want to wake up the waiting thread now.  So both
> the flags have their reason to stay so far.  Then with all these in
> mind what I can think of is only to forbid using DONTWAKE in WP case,
> and that's how above definition comes (I believe, because it was
> defined that way even before I started to work on it and I think it
> makes sense).

There's no argument how DONTWAKE can be used with !WP. The
userfaultfd_writeprotect() is called in response of the uffd monitor to WP
page fault, it asks to clear write protection to some range, but it does
not want to wake the faulting thread yet but rather it will use uffd_wake()
later.

Still, I can't grok the usage of DONTWAKE with WP=1. In my understanding,
in this case userfaultfd_writeprotect() is called unrelated to page faults,
and the monitored thread runs freely, so why it should be waked at all?

And what happens, if the thread is waiting on a missing page fault and we
do userfaultfd_writeprotect(WP=1) at the same time?

> Thanks,
> 
> -- 
> Peter Xu
> 

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v2 23/26] userfaultfd: wp: don't wake up when doing write protect
  2019-02-26  7:29       ` Mike Rapoport
@ 2019-02-26  7:41         ` Peter Xu
  2019-02-26  8:00           ` Mike Rapoport
  0 siblings, 1 reply; 113+ messages in thread
From: Peter Xu @ 2019-02-26  7:41 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: linux-mm, linux-kernel, David Hildenbrand, Hugh Dickins,
	Maya Gokhale, Jerome Glisse, Pavel Emelyanov, Johannes Weiner,
	Martin Cracauer, Shaohua Li, Marty McFadden, Andrea Arcangeli,
	Mike Kravetz, Denis Plotnikov, Mike Rapoport, Mel Gorman,
	Kirill A . Shutemov, Dr . David Alan Gilbert

On Tue, Feb 26, 2019 at 09:29:33AM +0200, Mike Rapoport wrote:
> On Tue, Feb 26, 2019 at 02:24:52PM +0800, Peter Xu wrote:
> > On Mon, Feb 25, 2019 at 11:09:35PM +0200, Mike Rapoport wrote:
> > > On Tue, Feb 12, 2019 at 10:56:29AM +0800, Peter Xu wrote:
> > > > It does not make sense to try to wake up any waiting thread when we're
> > > > write-protecting a memory region.  Only wake up when resolving a write
> > > > protected page fault.
> > > > 
> > > > Signed-off-by: Peter Xu <peterx@redhat.com>
> > > > ---
> > > >  fs/userfaultfd.c | 13 ++++++++-----
> > > >  1 file changed, 8 insertions(+), 5 deletions(-)
> > > > 
> > > > diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> > > > index 81962d62520c..f1f61a0278c2 100644
> > > > --- a/fs/userfaultfd.c
> > > > +++ b/fs/userfaultfd.c
> > > > @@ -1771,6 +1771,7 @@ static int userfaultfd_writeprotect(struct userfaultfd_ctx *ctx,
> > > >  	struct uffdio_writeprotect uffdio_wp;
> > > >  	struct uffdio_writeprotect __user *user_uffdio_wp;
> > > >  	struct userfaultfd_wake_range range;
> > > > +	bool mode_wp, mode_dontwake;
> > > > 
> > > >  	if (READ_ONCE(ctx->mmap_changing))
> > > >  		return -EAGAIN;
> > > > @@ -1789,18 +1790,20 @@ static int userfaultfd_writeprotect(struct userfaultfd_ctx *ctx,
> > > >  	if (uffdio_wp.mode & ~(UFFDIO_WRITEPROTECT_MODE_DONTWAKE |
> > > >  			       UFFDIO_WRITEPROTECT_MODE_WP))
> > > >  		return -EINVAL;
> > > > -	if ((uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_WP) &&
> > > > -	     (uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_DONTWAKE))
> > > > +
> > > > +	mode_wp = uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_WP;
> > > > +	mode_dontwake = uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_DONTWAKE;
> > > > +
> > > > +	if (mode_wp && mode_dontwake)
> > > >  		return -EINVAL;
> > > 
> > > This actually means the opposite of the commit message text ;-)
> > > 
> > > Is any dependency of _WP and _DONTWAKE needed at all?
> > 
> > So this is indeed confusing at least, because both you and Jerome have
> > asked the same question... :)
> > 
> > My understanding is that we don't have any reason to wake up any
> > thread when we are write-protecting a range, in that sense the flag
> > UFFDIO_WRITEPROTECT_MODE_DONTWAKE is already meaningless in the
> > UFFDIO_WRITEPROTECT ioctl context.  So before everything here's how
> > these flags are defined:
> > 
> > struct uffdio_writeprotect {
> > 	struct uffdio_range range;
> > 	/* !WP means undo writeprotect. DONTWAKE is valid only with !WP */
> > #define UFFDIO_WRITEPROTECT_MODE_WP		((__u64)1<<0)
> > #define UFFDIO_WRITEPROTECT_MODE_DONTWAKE	((__u64)1<<1)
> > 	__u64 mode;
> > };
> > 
> > To make it clear, we simply define it as "DONTWAKE is valid only with
> > !WP".  When with that, "mode_wp && mode_dontwake" is indeed a
> > meaningless flag combination.  Though please note that it does not
> > mean that the operation ("don't wake up the thread") is meaningless -
> > that's what we'll do no matter what when WP==1.  IMHO it's only about
> > the interface not the behavior.
> > 
> > I don't have a good way to make this clearer because firstly we'll
> > need the WP flag to mark whether we're protecting or unprotecting the
> > pages.  Later on, we need DONTWAKE for page fault handling case to
> > mark that we don't want to wake up the waiting thread now.  So both
> > the flags have their reason to stay so far.  Then with all these in
> > mind what I can think of is only to forbid using DONTWAKE in WP case,
> > and that's how above definition comes (I believe, because it was
> > defined that way even before I started to work on it and I think it
> > makes sense).
> 
> There's no argument how DONTWAKE can be used with !WP. The
> userfaultfd_writeprotect() is called in response of the uffd monitor to WP
> page fault, it asks to clear write protection to some range, but it does
> not want to wake the faulting thread yet but rather it will use uffd_wake()
> later.
> 
> Still, I can't grok the usage of DONTWAKE with WP=1. In my understanding,
> in this case userfaultfd_writeprotect() is called unrelated to page faults,
> and the monitored thread runs freely, so why it should be waked at all?

Exactly this is how I understand it.  And that's why I wrote this
patch to remove the extra wakeup() since I think it's unecessary.

> 
> And what happens, if the thread is waiting on a missing page fault and we
> do userfaultfd_writeprotect(WP=1) at the same time?

Then IMHO the userfaultfd_writeprotect() will be a noop simply because
the page is still missing.  Here if with the old code (before this
patch) we'll probably even try to wake up this thread but this thread
should just fault again on the same address due to the fact that the
page is missing.  After this patch the monitored thread should
continue to wait on the missing page.

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v2 24/26] userfaultfd: wp: UFFDIO_REGISTER_MODE_WP documentation update
  2019-02-26  7:04       ` Mike Rapoport
@ 2019-02-26  7:42         ` Peter Xu
  0 siblings, 0 replies; 113+ messages in thread
From: Peter Xu @ 2019-02-26  7:42 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: linux-mm, linux-kernel, David Hildenbrand, Hugh Dickins,
	Maya Gokhale, Jerome Glisse, Pavel Emelyanov, Johannes Weiner,
	Martin Cracauer, Shaohua Li, Marty McFadden, Andrea Arcangeli,
	Mike Kravetz, Denis Plotnikov, Mike Rapoport, Mel Gorman,
	Kirill A . Shutemov, Dr . David Alan Gilbert

On Tue, Feb 26, 2019 at 09:04:25AM +0200, Mike Rapoport wrote:
> On Tue, Feb 26, 2019 at 02:53:42PM +0800, Peter Xu wrote:
> > On Mon, Feb 25, 2019 at 11:19:32PM +0200, Mike Rapoport wrote:
> > > On Tue, Feb 12, 2019 at 10:56:30AM +0800, Peter Xu wrote:
> > > > From: Martin Cracauer <cracauer@cons.org>
> > > > 
> > > > Adds documentation about the write protection support.
> > > > 
> > > > Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> > > > [peterx: rewrite in rst format; fixups here and there]
> > > > Signed-off-by: Peter Xu <peterx@redhat.com>
> > > 
> > > Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
> > > 
> > > Peter, can you please also update the man pages (1, 2)?
> > > 
> > > [1] http://man7.org/linux/man-pages/man2/userfaultfd.2.html
> > > [2] http://man7.org/linux/man-pages/man2/ioctl_userfaultfd.2.html
> > 
> > Sure.  Should I post the man patches after the kernel part is merged?
> 
> Yep, once we know for sure what's the API kernel will expose.

I see, thanks.  Then I'll probably wait until the series got merged to
be safe since so far we still have discussion on the interfaces
(especially the DONTWAKE flags).

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v2 20/26] userfaultfd: wp: support write protection for userfault vma range
  2019-02-26  7:20         ` Peter Xu
@ 2019-02-26  7:46           ` Mike Rapoport
  2019-02-26  7:54             ` Peter Xu
  0 siblings, 1 reply; 113+ messages in thread
From: Mike Rapoport @ 2019-02-26  7:46 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, David Hildenbrand, Hugh Dickins,
	Maya Gokhale, Jerome Glisse, Pavel Emelyanov, Johannes Weiner,
	Martin Cracauer, Shaohua Li, Marty McFadden, Andrea Arcangeli,
	Mike Kravetz, Denis Plotnikov, Mike Rapoport, Mel Gorman,
	Kirill A . Shutemov, Dr . David Alan Gilbert, Rik van Riel

On Tue, Feb 26, 2019 at 03:20:28PM +0800, Peter Xu wrote:
> On Tue, Feb 26, 2019 at 08:43:47AM +0200, Mike Rapoport wrote:
> > On Tue, Feb 26, 2019 at 02:06:27PM +0800, Peter Xu wrote:
> > > On Mon, Feb 25, 2019 at 10:52:34PM +0200, Mike Rapoport wrote:
> > > > On Tue, Feb 12, 2019 at 10:56:26AM +0800, Peter Xu wrote:
> > > > > From: Shaohua Li <shli@fb.com>
> > > > > 
> > > > > Add API to enable/disable writeprotect a vma range. Unlike mprotect,
> > > > > this doesn't split/merge vmas.
> > > > > 
> > > > > Cc: Andrea Arcangeli <aarcange@redhat.com>
> > > > > Cc: Rik van Riel <riel@redhat.com>
> > > > > Cc: Kirill A. Shutemov <kirill@shutemov.name>
> > > > > Cc: Mel Gorman <mgorman@suse.de>
> > > > > Cc: Hugh Dickins <hughd@google.com>
> > > > > Cc: Johannes Weiner <hannes@cmpxchg.org>
> > > > > Signed-off-by: Shaohua Li <shli@fb.com>
> > > > > Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> > > > > [peterx:
> > > > >  - use the helper to find VMA;
> > > > >  - return -ENOENT if not found to match mcopy case;
> > > > >  - use the new MM_CP_UFFD_WP* flags for change_protection
> > > > >  - check against mmap_changing for failures]
> > > > > Signed-off-by: Peter Xu <peterx@redhat.com>
> > > > > ---
> > > > >  include/linux/userfaultfd_k.h |  3 ++
> > > > >  mm/userfaultfd.c              | 54 +++++++++++++++++++++++++++++++++++
> > > > >  2 files changed, 57 insertions(+)
> > > > > 
> > > > > diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
> > > > > index 765ce884cec0..8f6e6ed544fb 100644
> > > > > --- a/include/linux/userfaultfd_k.h
> > > > > +++ b/include/linux/userfaultfd_k.h
> > > > > @@ -39,6 +39,9 @@ extern ssize_t mfill_zeropage(struct mm_struct *dst_mm,
> > > > >  			      unsigned long dst_start,
> > > > >  			      unsigned long len,
> > > > >  			      bool *mmap_changing);
> > > > > +extern int mwriteprotect_range(struct mm_struct *dst_mm,
> > > > > +			       unsigned long start, unsigned long len,
> > > > > +			       bool enable_wp, bool *mmap_changing);
> > > > > 
> > > > >  /* mm helpers */
> > > > >  static inline bool is_mergeable_vm_userfaultfd_ctx(struct vm_area_struct *vma,
> > > > > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> > > > > index fefa81c301b7..529d180bb4d7 100644
> > > > > --- a/mm/userfaultfd.c
> > > > > +++ b/mm/userfaultfd.c
> > > > > @@ -639,3 +639,57 @@ ssize_t mfill_zeropage(struct mm_struct *dst_mm, unsigned long start,
> > > > >  {
> > > > >  	return __mcopy_atomic(dst_mm, start, 0, len, true, mmap_changing, 0);
> > > > >  }
> > > > > +
> > > > > +int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start,
> > > > > +			unsigned long len, bool enable_wp, bool *mmap_changing)
> > > > > +{
> > > > > +	struct vm_area_struct *dst_vma;
> > > > > +	pgprot_t newprot;
> > > > > +	int err;
> > > > > +
> > > > > +	/*
> > > > > +	 * Sanitize the command parameters:
> > > > > +	 */
> > > > > +	BUG_ON(start & ~PAGE_MASK);
> > > > > +	BUG_ON(len & ~PAGE_MASK);
> > > > > +
> > > > > +	/* Does the address range wrap, or is the span zero-sized? */
> > > > > +	BUG_ON(start + len <= start);
> > > > 
> > > > I'd replace these BUG_ON()s with
> > > > 
> > > > 	if (WARN_ON())
> > > > 		 return -EINVAL;
> > > 
> > > I believe BUG_ON() is used because these parameters should have been
> > > checked in userfaultfd_writeprotect() already by the common
> > > validate_range() even before calling mwriteprotect_range().  So I'm
> > > fine with the WARN_ON() approach but I'd slightly prefer to simply
> > > keep the patch as is to keep Jerome's r-b if you won't disagree. :)
> > 
> > Right, userfaultfd_writeprotect() should check these parameters and if it
> > didn't it was a bug indeed. But still, it's not severe enough to crash the
> > kernel.
> > 
> > I hope Jerome wouldn't mind to keep his r-b with s/BUG_ON/WARN_ON ;-)
> > 
> > With this change you can also add 
> > 
> > Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
> 
> Thanks!  Though before I change anything... please note that the
> BUG_ON()s are really what we've done in existing MISSING code.  One
> example is userfaultfd_copy() which did validate_range() first, then
> in __mcopy_atomic() we've used BUG_ON()s.  They make sense to me
> becauase userspace should never be able to trigger it.  And if we
> really want to change the BUG_ON()s in this patch, IMHO we probably
> want to change the other BUG_ON()s as well, then that can be a
> standalone patch or patchset to address another issue...

Yeah, we have quite a lot of them, so doing the replacement in a separate
patch makes perfect sense.
 
> (and if we really want to use WARN_ON, I would prefer WARN_ON_ONCE, or
>  directly return the errors to avoid DOS).

Agree.

> I'll see how you'd prefer to see how I should move on with this patch.

Let's keep this patch as is and make the replacement on top of the WP
series. Feel free to add r-b.
 
> Thanks,
> 
> -- 
> Peter Xu
> 

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v2 26/26] userfaultfd: selftests: add write-protect test
  2019-02-26  6:58   ` Mike Rapoport
@ 2019-02-26  7:52     ` Peter Xu
  0 siblings, 0 replies; 113+ messages in thread
From: Peter Xu @ 2019-02-26  7:52 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: linux-mm, linux-kernel, David Hildenbrand, Hugh Dickins,
	Maya Gokhale, Jerome Glisse, Pavel Emelyanov, Johannes Weiner,
	Martin Cracauer, Shaohua Li, Marty McFadden, Andrea Arcangeli,
	Mike Kravetz, Denis Plotnikov, Mike Rapoport, Mel Gorman,
	Kirill A . Shutemov, Dr . David Alan Gilbert

On Tue, Feb 26, 2019 at 08:58:36AM +0200, Mike Rapoport wrote:
> On Tue, Feb 12, 2019 at 10:56:32AM +0800, Peter Xu wrote:
> > This patch adds uffd tests for write protection.
> > 
> > Instead of introducing new tests for it, let's simply squashing uffd-wp
> > tests into existing uffd-missing test cases.  Changes are:
> > 
> > (1) Bouncing tests
> > 
> >   We do the write-protection in two ways during the bouncing test:
> > 
> >   - By using UFFDIO_COPY_MODE_WP when resolving MISSING pages: then
> >     we'll make sure for each bounce process every single page will be
> >     at least fault twice: once for MISSING, once for WP.
> > 
> >   - By direct call UFFDIO_WRITEPROTECT on existing faulted memories:
> >     To further torture the explicit page protection procedures of
> >     uffd-wp, we split each bounce procedure into two halves (in the
> >     background thread): the first half will be MISSING+WP for each
> >     page as explained above.  After the first half, we write protect
> >     the faulted region in the background thread to make sure at least
> >     half of the pages will be write protected again which is the first
> >     half to test the new UFFDIO_WRITEPROTECT call.  Then we continue
> >     with the 2nd half, which will contain both MISSING and WP faulting
> >     tests for the 2nd half and WP-only faults from the 1st half.
> > 
> > (2) Event/Signal test
> > 
> >   Mostly previous tests but will do MISSING+WP for each page.  For
> >   sigbus-mode test we'll need to provide standalone path to handle the
> >   write protection faults.
> > 
> > For all tests, do statistics as well for uffd-wp pages.
> > 
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >  tools/testing/selftests/vm/userfaultfd.c | 154 ++++++++++++++++++-----
> >  1 file changed, 126 insertions(+), 28 deletions(-)
> > 
> > diff --git a/tools/testing/selftests/vm/userfaultfd.c b/tools/testing/selftests/vm/userfaultfd.c
> > index e5d12c209e09..57b5ac02080a 100644
> > --- a/tools/testing/selftests/vm/userfaultfd.c
> > +++ b/tools/testing/selftests/vm/userfaultfd.c
> > @@ -56,6 +56,7 @@
> >  #include <linux/userfaultfd.h>
> >  #include <setjmp.h>
> >  #include <stdbool.h>
> > +#include <assert.h>
> > 
> >  #include "../kselftest.h"
> > 
> > @@ -78,6 +79,8 @@ static int test_type;
> >  #define ALARM_INTERVAL_SECS 10
> >  static volatile bool test_uffdio_copy_eexist = true;
> >  static volatile bool test_uffdio_zeropage_eexist = true;
> > +/* Whether to test uffd write-protection */
> > +static bool test_uffdio_wp = false;
> > 
> >  static bool map_shared;
> >  static int huge_fd;
> > @@ -92,6 +95,7 @@ pthread_attr_t attr;
> >  struct uffd_stats {
> >  	int cpu;
> >  	unsigned long missing_faults;
> > +	unsigned long wp_faults;
> >  };
> > 
> >  /* pthread_mutex_t starts at page offset 0 */
> > @@ -141,9 +145,29 @@ static void uffd_stats_reset(struct uffd_stats *uffd_stats,
> >  	for (i = 0; i < n_cpus; i++) {
> >  		uffd_stats[i].cpu = i;
> >  		uffd_stats[i].missing_faults = 0;
> > +		uffd_stats[i].wp_faults = 0;
> >  	}
> >  }
> > 
> > +static void uffd_stats_report(struct uffd_stats *stats, int n_cpus)
> > +{
> > +	int i;
> > +	unsigned long long miss_total = 0, wp_total = 0;
> > +
> > +	for (i = 0; i < n_cpus; i++) {
> > +		miss_total += stats[i].missing_faults;
> > +		wp_total += stats[i].wp_faults;
> > +	}
> > +
> > +	printf("userfaults: %llu missing (", miss_total);
> > +	for (i = 0; i < n_cpus; i++)
> > +		printf("%lu+", stats[i].missing_faults);
> > +	printf("\b), %llu wp (", wp_total);
> > +	for (i = 0; i < n_cpus; i++)
> > +		printf("%lu+", stats[i].wp_faults);
> > +	printf("\b)\n");
> > +}
> > +
> >  static int anon_release_pages(char *rel_area)
> >  {
> >  	int ret = 0;
> > @@ -264,19 +288,15 @@ struct uffd_test_ops {
> >  	void (*alias_mapping)(__u64 *start, size_t len, unsigned long offset);
> >  };
> > 
> > -#define ANON_EXPECTED_IOCTLS		((1 << _UFFDIO_WAKE) | \
> > -					 (1 << _UFFDIO_COPY) | \
> > -					 (1 << _UFFDIO_ZEROPAGE))
> > -
> >  static struct uffd_test_ops anon_uffd_test_ops = {
> > -	.expected_ioctls = ANON_EXPECTED_IOCTLS,
> > +	.expected_ioctls = UFFD_API_RANGE_IOCTLS,
> >  	.allocate_area	= anon_allocate_area,
> >  	.release_pages	= anon_release_pages,
> >  	.alias_mapping = noop_alias_mapping,
> >  };
> > 
> >  static struct uffd_test_ops shmem_uffd_test_ops = {
> > -	.expected_ioctls = ANON_EXPECTED_IOCTLS,
> > +	.expected_ioctls = UFFD_API_RANGE_IOCTLS,
> 
> Isn't UFFD_API_RANGE_IOCTLS includes UFFDIO_WP which is not supported for
> shmem?

Yes it didn't fail the test case probably because the test case only
registers the shmem region with UFFDIO_REGISTER_MODE_MISSING, and for
now we'll simply blindly return the _UFFDIO_WRITEPROTECT capability if
the register ioctl succeeded.  However it'll still fail the
UFFDIO_REGISTER ioctl directly if someone requests with
UFFDIO_REGISTER_MODE_WP mode upon shmem.

So maybe I should explicitly remove the _UFFDIO_WRITEPROTECT bit in
userfaultfd_register() if I detected any non-anonymous regions?  Then
here I will revert to ANON_EXPECTED_IOCTLS for shmem_uffd_test_ops in
the tests.

> 
> >  	.allocate_area	= shmem_allocate_area,
> >  	.release_pages	= shmem_release_pages,
> >  	.alias_mapping = noop_alias_mapping,
> 
> ...
> 
> -- 
> Sincerely yours,
> Mike.
> 

Regards,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v2 20/26] userfaultfd: wp: support write protection for userfault vma range
  2019-02-26  7:46           ` Mike Rapoport
@ 2019-02-26  7:54             ` Peter Xu
  0 siblings, 0 replies; 113+ messages in thread
From: Peter Xu @ 2019-02-26  7:54 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: linux-mm, linux-kernel, David Hildenbrand, Hugh Dickins,
	Maya Gokhale, Jerome Glisse, Pavel Emelyanov, Johannes Weiner,
	Martin Cracauer, Shaohua Li, Marty McFadden, Andrea Arcangeli,
	Mike Kravetz, Denis Plotnikov, Mike Rapoport, Mel Gorman,
	Kirill A . Shutemov, Dr . David Alan Gilbert, Rik van Riel

On Tue, Feb 26, 2019 at 09:46:12AM +0200, Mike Rapoport wrote:

[...]

> > > > > > +int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start,
> > > > > > +			unsigned long len, bool enable_wp, bool *mmap_changing)
> > > > > > +{
> > > > > > +	struct vm_area_struct *dst_vma;
> > > > > > +	pgprot_t newprot;
> > > > > > +	int err;
> > > > > > +
> > > > > > +	/*
> > > > > > +	 * Sanitize the command parameters:
> > > > > > +	 */
> > > > > > +	BUG_ON(start & ~PAGE_MASK);
> > > > > > +	BUG_ON(len & ~PAGE_MASK);
> > > > > > +
> > > > > > +	/* Does the address range wrap, or is the span zero-sized? */
> > > > > > +	BUG_ON(start + len <= start);
> > > > > 
> > > > > I'd replace these BUG_ON()s with
> > > > > 
> > > > > 	if (WARN_ON())
> > > > > 		 return -EINVAL;
> > > > 
> > > > I believe BUG_ON() is used because these parameters should have been
> > > > checked in userfaultfd_writeprotect() already by the common
> > > > validate_range() even before calling mwriteprotect_range().  So I'm
> > > > fine with the WARN_ON() approach but I'd slightly prefer to simply
> > > > keep the patch as is to keep Jerome's r-b if you won't disagree. :)
> > > 
> > > Right, userfaultfd_writeprotect() should check these parameters and if it
> > > didn't it was a bug indeed. But still, it's not severe enough to crash the
> > > kernel.
> > > 
> > > I hope Jerome wouldn't mind to keep his r-b with s/BUG_ON/WARN_ON ;-)
> > > 
> > > With this change you can also add 
> > > 
> > > Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
> > 
> > Thanks!  Though before I change anything... please note that the
> > BUG_ON()s are really what we've done in existing MISSING code.  One
> > example is userfaultfd_copy() which did validate_range() first, then
> > in __mcopy_atomic() we've used BUG_ON()s.  They make sense to me
> > becauase userspace should never be able to trigger it.  And if we
> > really want to change the BUG_ON()s in this patch, IMHO we probably
> > want to change the other BUG_ON()s as well, then that can be a
> > standalone patch or patchset to address another issue...
> 
> Yeah, we have quite a lot of them, so doing the replacement in a separate
> patch makes perfect sense.
>  
> > (and if we really want to use WARN_ON, I would prefer WARN_ON_ONCE, or
> >  directly return the errors to avoid DOS).
> 
> Agree.
> 
> > I'll see how you'd prefer to see how I should move on with this patch.
> 
> Let's keep this patch as is and make the replacement on top of the WP
> series. Feel free to add r-b.

Great!  I'll do.  Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v2 23/26] userfaultfd: wp: don't wake up when doing write protect
  2019-02-26  7:41         ` Peter Xu
@ 2019-02-26  8:00           ` Mike Rapoport
  2019-02-28  2:47             ` Peter Xu
  0 siblings, 1 reply; 113+ messages in thread
From: Mike Rapoport @ 2019-02-26  8:00 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, David Hildenbrand, Hugh Dickins,
	Maya Gokhale, Jerome Glisse, Pavel Emelyanov, Johannes Weiner,
	Martin Cracauer, Shaohua Li, Marty McFadden, Andrea Arcangeli,
	Mike Kravetz, Denis Plotnikov, Mike Rapoport, Mel Gorman,
	Kirill A . Shutemov, Dr . David Alan Gilbert

On Tue, Feb 26, 2019 at 03:41:17PM +0800, Peter Xu wrote:
> On Tue, Feb 26, 2019 at 09:29:33AM +0200, Mike Rapoport wrote:
> > On Tue, Feb 26, 2019 at 02:24:52PM +0800, Peter Xu wrote:
> > > On Mon, Feb 25, 2019 at 11:09:35PM +0200, Mike Rapoport wrote:
> > > > On Tue, Feb 12, 2019 at 10:56:29AM +0800, Peter Xu wrote:
> > > > > It does not make sense to try to wake up any waiting thread when we're
> > > > > write-protecting a memory region.  Only wake up when resolving a write
> > > > > protected page fault.
> > > > > 
> > > > > Signed-off-by: Peter Xu <peterx@redhat.com>
> > > > > ---
> > > > >  fs/userfaultfd.c | 13 ++++++++-----
> > > > >  1 file changed, 8 insertions(+), 5 deletions(-)
> > > > > 
> > > > > diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> > > > > index 81962d62520c..f1f61a0278c2 100644
> > > > > --- a/fs/userfaultfd.c
> > > > > +++ b/fs/userfaultfd.c
> > > > > @@ -1771,6 +1771,7 @@ static int userfaultfd_writeprotect(struct userfaultfd_ctx *ctx,
> > > > >  	struct uffdio_writeprotect uffdio_wp;
> > > > >  	struct uffdio_writeprotect __user *user_uffdio_wp;
> > > > >  	struct userfaultfd_wake_range range;
> > > > > +	bool mode_wp, mode_dontwake;
> > > > > 
> > > > >  	if (READ_ONCE(ctx->mmap_changing))
> > > > >  		return -EAGAIN;
> > > > > @@ -1789,18 +1790,20 @@ static int userfaultfd_writeprotect(struct userfaultfd_ctx *ctx,
> > > > >  	if (uffdio_wp.mode & ~(UFFDIO_WRITEPROTECT_MODE_DONTWAKE |
> > > > >  			       UFFDIO_WRITEPROTECT_MODE_WP))
> > > > >  		return -EINVAL;
> > > > > -	if ((uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_WP) &&
> > > > > -	     (uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_DONTWAKE))
> > > > > +
> > > > > +	mode_wp = uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_WP;
> > > > > +	mode_dontwake = uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_DONTWAKE;
> > > > > +
> > > > > +	if (mode_wp && mode_dontwake)
> > > > >  		return -EINVAL;
> > > > 
> > > > This actually means the opposite of the commit message text ;-)
> > > > 
> > > > Is any dependency of _WP and _DONTWAKE needed at all?
> > > 
> > > So this is indeed confusing at least, because both you and Jerome have
> > > asked the same question... :)
> > > 
> > > My understanding is that we don't have any reason to wake up any
> > > thread when we are write-protecting a range, in that sense the flag
> > > UFFDIO_WRITEPROTECT_MODE_DONTWAKE is already meaningless in the
> > > UFFDIO_WRITEPROTECT ioctl context.  So before everything here's how
> > > these flags are defined:
> > > 
> > > struct uffdio_writeprotect {
> > > 	struct uffdio_range range;
> > > 	/* !WP means undo writeprotect. DONTWAKE is valid only with !WP */
> > > #define UFFDIO_WRITEPROTECT_MODE_WP		((__u64)1<<0)
> > > #define UFFDIO_WRITEPROTECT_MODE_DONTWAKE	((__u64)1<<1)
> > > 	__u64 mode;
> > > };
> > > 
> > > To make it clear, we simply define it as "DONTWAKE is valid only with
> > > !WP".  When with that, "mode_wp && mode_dontwake" is indeed a
> > > meaningless flag combination.  Though please note that it does not
> > > mean that the operation ("don't wake up the thread") is meaningless -
> > > that's what we'll do no matter what when WP==1.  IMHO it's only about
> > > the interface not the behavior.
> > > 
> > > I don't have a good way to make this clearer because firstly we'll
> > > need the WP flag to mark whether we're protecting or unprotecting the
> > > pages.  Later on, we need DONTWAKE for page fault handling case to
> > > mark that we don't want to wake up the waiting thread now.  So both
> > > the flags have their reason to stay so far.  Then with all these in
> > > mind what I can think of is only to forbid using DONTWAKE in WP case,
> > > and that's how above definition comes (I believe, because it was
> > > defined that way even before I started to work on it and I think it
> > > makes sense).
> > 
> > There's no argument how DONTWAKE can be used with !WP. The
> > userfaultfd_writeprotect() is called in response of the uffd monitor to WP
> > page fault, it asks to clear write protection to some range, but it does
> > not want to wake the faulting thread yet but rather it will use uffd_wake()
> > later.
> > 
> > Still, I can't grok the usage of DONTWAKE with WP=1. In my understanding,
> > in this case userfaultfd_writeprotect() is called unrelated to page faults,
> > and the monitored thread runs freely, so why it should be waked at all?
> 
> Exactly this is how I understand it.  And that's why I wrote this
> patch to remove the extra wakeup() since I think it's unecessary.
> 
> > 
> > And what happens, if the thread is waiting on a missing page fault and we
> > do userfaultfd_writeprotect(WP=1) at the same time?
> 
> Then IMHO the userfaultfd_writeprotect() will be a noop simply because
> the page is still missing.  Here if with the old code (before this
> patch) we'll probably even try to wake up this thread but this thread
> should just fault again on the same address due to the fact that the
> page is missing.  After this patch the monitored thread should
> continue to wait on the missing page.

So, my understanding of what we have is:

userfaultfd_writeprotect() can be used either to mark a region as write
protected or to resolve WP page fault.
In the first case DONTWAKE does not make sense and we forbid setting it
with WP=1.
In the second case it's the uffd monitor decision whether to wake up the
faulting thread immediately after #PF is resolved or later, so with WP=0 we
allow DONTWAKE.

I suggest to extend the comment in the definition of 
'struct uffdio_writeprotect' to something like

/*
 * Write protecting a region (WP=1) is unrelated to page faults, therefore
 * DONTWAKE flag is meaningless with WP=1.
 * Removing write protection (WP=0) in response to a page fault wakes the
 * faulting task unless DONTWAKE is set.
 */
 
And a documentation update along these lines would be appreciated :)

> Thanks,
> 
> -- 
> Peter Xu
> 

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v2 23/26] userfaultfd: wp: don't wake up when doing write protect
  2019-02-12  2:56 ` [PATCH v2 23/26] userfaultfd: wp: don't wake up when doing write protect Peter Xu
  2019-02-21 18:36   ` Jerome Glisse
  2019-02-25 21:09   ` Mike Rapoport
@ 2019-02-26  8:00   ` Mike Rapoport
  2 siblings, 0 replies; 113+ messages in thread
From: Mike Rapoport @ 2019-02-26  8:00 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, David Hildenbrand, Hugh Dickins,
	Maya Gokhale, Jerome Glisse, Pavel Emelyanov, Johannes Weiner,
	Martin Cracauer, Shaohua Li, Marty McFadden, Andrea Arcangeli,
	Mike Kravetz, Denis Plotnikov, Mike Rapoport, Mel Gorman,
	Kirill A . Shutemov, Dr . David Alan Gilbert

On Tue, Feb 12, 2019 at 10:56:29AM +0800, Peter Xu wrote:
> It does not make sense to try to wake up any waiting thread when we're
> write-protecting a memory region.  Only wake up when resolving a write
> protected page fault.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>

> ---
>  fs/userfaultfd.c | 13 ++++++++-----
>  1 file changed, 8 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> index 81962d62520c..f1f61a0278c2 100644
> --- a/fs/userfaultfd.c
> +++ b/fs/userfaultfd.c
> @@ -1771,6 +1771,7 @@ static int userfaultfd_writeprotect(struct userfaultfd_ctx *ctx,
>  	struct uffdio_writeprotect uffdio_wp;
>  	struct uffdio_writeprotect __user *user_uffdio_wp;
>  	struct userfaultfd_wake_range range;
> +	bool mode_wp, mode_dontwake;
> 
>  	if (READ_ONCE(ctx->mmap_changing))
>  		return -EAGAIN;
> @@ -1789,18 +1790,20 @@ static int userfaultfd_writeprotect(struct userfaultfd_ctx *ctx,
>  	if (uffdio_wp.mode & ~(UFFDIO_WRITEPROTECT_MODE_DONTWAKE |
>  			       UFFDIO_WRITEPROTECT_MODE_WP))
>  		return -EINVAL;
> -	if ((uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_WP) &&
> -	     (uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_DONTWAKE))
> +
> +	mode_wp = uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_WP;
> +	mode_dontwake = uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_DONTWAKE;
> +
> +	if (mode_wp && mode_dontwake)
>  		return -EINVAL;
> 
>  	ret = mwriteprotect_range(ctx->mm, uffdio_wp.range.start,
> -				  uffdio_wp.range.len, uffdio_wp.mode &
> -				  UFFDIO_WRITEPROTECT_MODE_WP,
> +				  uffdio_wp.range.len, mode_wp,
>  				  &ctx->mmap_changing);
>  	if (ret)
>  		return ret;
> 
> -	if (!(uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_DONTWAKE)) {
> +	if (!mode_wp && !mode_dontwake) {
>  		range.start = uffdio_wp.range.start;
>  		range.len = uffdio_wp.range.len;
>  		wake_userfault(ctx, &range);
> -- 
> 2.17.1
> 

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v2 10/26] userfaultfd: wp: add UFFDIO_COPY_MODE_WP
  2019-02-26  5:09     ` Peter Xu
@ 2019-02-26  8:28       ` Mike Rapoport
  0 siblings, 0 replies; 113+ messages in thread
From: Mike Rapoport @ 2019-02-26  8:28 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, David Hildenbrand, Hugh Dickins,
	Maya Gokhale, Jerome Glisse, Pavel Emelyanov, Johannes Weiner,
	Martin Cracauer, Shaohua Li, Marty McFadden, Andrea Arcangeli,
	Mike Kravetz, Denis Plotnikov, Mike Rapoport, Mel Gorman,
	Kirill A . Shutemov, Dr . David Alan Gilbert

On Tue, Feb 26, 2019 at 01:09:42PM +0800, Peter Xu wrote:
> On Mon, Feb 25, 2019 at 05:58:37PM +0200, Mike Rapoport wrote:
> > On Tue, Feb 12, 2019 at 10:56:16AM +0800, Peter Xu wrote:
> > > From: Andrea Arcangeli <aarcange@redhat.com>
> > > 
> > > This allows UFFDIO_COPY to map pages wrprotected.
> >                                        write protected please :)
> 
> Sure!
> 
> > > 
> > > Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> > > Signed-off-by: Peter Xu <peterx@redhat.com>
> > 
> > Except for two additional nits below
> > 
> > Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
> > 
> > > ---
> > >  fs/userfaultfd.c                 |  5 +++--
> > >  include/linux/userfaultfd_k.h    |  2 +-
> > >  include/uapi/linux/userfaultfd.h | 11 +++++-----
> > >  mm/userfaultfd.c                 | 36 ++++++++++++++++++++++----------
> > >  4 files changed, 35 insertions(+), 19 deletions(-)
> > > 
> > > diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> > > index b397bc3b954d..3092885c9d2c 100644
> > > --- a/fs/userfaultfd.c
> > > +++ b/fs/userfaultfd.c
> > > @@ -1683,11 +1683,12 @@ static int userfaultfd_copy(struct userfaultfd_ctx *ctx,
> > >  	ret = -EINVAL;
> > >  	if (uffdio_copy.src + uffdio_copy.len <= uffdio_copy.src)
> > >  		goto out;
> > > -	if (uffdio_copy.mode & ~UFFDIO_COPY_MODE_DONTWAKE)
> > > +	if (uffdio_copy.mode & ~(UFFDIO_COPY_MODE_DONTWAKE|UFFDIO_COPY_MODE_WP))
> > >  		goto out;
> > >  	if (mmget_not_zero(ctx->mm)) {
> > >  		ret = mcopy_atomic(ctx->mm, uffdio_copy.dst, uffdio_copy.src,
> > > -				   uffdio_copy.len, &ctx->mmap_changing);
> > > +				   uffdio_copy.len, &ctx->mmap_changing,
> > > +				   uffdio_copy.mode);
> > >  		mmput(ctx->mm);
> > >  	} else {
> > >  		return -ESRCH;
> > > diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
> > > index c6590c58ce28..765ce884cec0 100644
> > > --- a/include/linux/userfaultfd_k.h
> > > +++ b/include/linux/userfaultfd_k.h
> > > @@ -34,7 +34,7 @@ extern vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason);
> > > 
> > >  extern ssize_t mcopy_atomic(struct mm_struct *dst_mm, unsigned long dst_start,
> > >  			    unsigned long src_start, unsigned long len,
> > > -			    bool *mmap_changing);
> > > +			    bool *mmap_changing, __u64 mode);
> > >  extern ssize_t mfill_zeropage(struct mm_struct *dst_mm,
> > >  			      unsigned long dst_start,
> > >  			      unsigned long len,
> > > diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
> > > index 48f1a7c2f1f0..297cb044c03f 100644
> > > --- a/include/uapi/linux/userfaultfd.h
> > > +++ b/include/uapi/linux/userfaultfd.h
> > > @@ -203,13 +203,14 @@ struct uffdio_copy {
> > >  	__u64 dst;
> > >  	__u64 src;
> > >  	__u64 len;
> > > +#define UFFDIO_COPY_MODE_DONTWAKE		((__u64)1<<0)
> > >  	/*
> > > -	 * There will be a wrprotection flag later that allows to map
> > > -	 * pages wrprotected on the fly. And such a flag will be
> > > -	 * available if the wrprotection ioctl are implemented for the
> > > -	 * range according to the uffdio_register.ioctls.
> > > +	 * UFFDIO_COPY_MODE_WP will map the page wrprotected on the
> > > +	 * fly. UFFDIO_COPY_MODE_WP is available only if the
> > > +	 * wrprotection ioctl are implemented for the range according
> > 
> >                              ^ is
> 
> Will fix.
> 
> > 
> > > +	 * to the uffdio_register.ioctls.
> > >  	 */
> > > -#define UFFDIO_COPY_MODE_DONTWAKE		((__u64)1<<0)
> > > +#define UFFDIO_COPY_MODE_WP			((__u64)1<<1)
> > >  	__u64 mode;
> > > 
> > >  	/*
> > > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> > > index d59b5a73dfb3..73a208c5c1e7 100644
> > > --- a/mm/userfaultfd.c
> > > +++ b/mm/userfaultfd.c
> > > @@ -25,7 +25,8 @@ static int mcopy_atomic_pte(struct mm_struct *dst_mm,
> > >  			    struct vm_area_struct *dst_vma,
> > >  			    unsigned long dst_addr,
> > >  			    unsigned long src_addr,
> > > -			    struct page **pagep)
> > > +			    struct page **pagep,
> > > +			    bool wp_copy)
> > >  {
> > >  	struct mem_cgroup *memcg;
> > >  	pte_t _dst_pte, *dst_pte;
> > > @@ -71,9 +72,9 @@ static int mcopy_atomic_pte(struct mm_struct *dst_mm,
> > >  	if (mem_cgroup_try_charge(page, dst_mm, GFP_KERNEL, &memcg, false))
> > >  		goto out_release;
> > > 
> > > -	_dst_pte = mk_pte(page, dst_vma->vm_page_prot);
> > > -	if (dst_vma->vm_flags & VM_WRITE)
> > > -		_dst_pte = pte_mkwrite(pte_mkdirty(_dst_pte));
> > > +	_dst_pte = pte_mkdirty(mk_pte(page, dst_vma->vm_page_prot));
> > > +	if (dst_vma->vm_flags & VM_WRITE && !wp_copy)
> > > +		_dst_pte = pte_mkwrite(_dst_pte);
> > > 
> > >  	dst_pte = pte_offset_map_lock(dst_mm, dst_pmd, dst_addr, &ptl);
> > >  	if (dst_vma->vm_file) {
> > > @@ -399,7 +400,8 @@ static __always_inline ssize_t mfill_atomic_pte(struct mm_struct *dst_mm,
> > >  						unsigned long dst_addr,
> > >  						unsigned long src_addr,
> > >  						struct page **page,
> > > -						bool zeropage)
> > > +						bool zeropage,
> > > +						bool wp_copy)
> > >  {
> > >  	ssize_t err;
> > > 
> > > @@ -416,11 +418,13 @@ static __always_inline ssize_t mfill_atomic_pte(struct mm_struct *dst_mm,
> > >  	if (!(dst_vma->vm_flags & VM_SHARED)) {
> > >  		if (!zeropage)
> > >  			err = mcopy_atomic_pte(dst_mm, dst_pmd, dst_vma,
> > > -					       dst_addr, src_addr, page);
> > > +					       dst_addr, src_addr, page,
> > > +					       wp_copy);
> > >  		else
> > >  			err = mfill_zeropage_pte(dst_mm, dst_pmd,
> > >  						 dst_vma, dst_addr);
> > >  	} else {
> > > +		VM_WARN_ON(wp_copy); /* WP only available for anon */
> > >  		if (!zeropage)
> > >  			err = shmem_mcopy_atomic_pte(dst_mm, dst_pmd,
> > >  						     dst_vma, dst_addr,
> > > @@ -438,7 +442,8 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
> > >  					      unsigned long src_start,
> > >  					      unsigned long len,
> > >  					      bool zeropage,
> > > -					      bool *mmap_changing)
> > > +					      bool *mmap_changing,
> > > +					      __u64 mode)
> > >  {
> > >  	struct vm_area_struct *dst_vma;
> > >  	ssize_t err;
> > > @@ -446,6 +451,7 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
> > >  	unsigned long src_addr, dst_addr;
> > >  	long copied;
> > >  	struct page *page;
> > > +	bool wp_copy;
> > > 
> > >  	/*>  	 * Sanitize the command parameters:
> > > @@ -502,6 +508,14 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
> > >  	    dst_vma->vm_flags & VM_SHARED))
> > >  		goto out_unlock;
> > > 
> > > +	/*
> > > +	 * validate 'mode' now that we know the dst_vma: don't allow
> > > +	 * a wrprotect copy if the userfaultfd didn't register as WP.
> > > +	 */
> > > +	wp_copy = mode & UFFDIO_COPY_MODE_WP;
> > > +	if (wp_copy && !(dst_vma->vm_flags & VM_UFFD_WP))
> > > +		goto out_unlock;
> 
> [1]
> 
> > > +
> > >  	/*
> > >  	 * If this is a HUGETLB vma, pass off to appropriate routine
> > >  	 */
> > 
> > I think for hugetlb we should return an error if wp_copy==true.
> > It might be worth adding wp_copy parameter to __mcopy_atomic_hugetlb() in
> > advance and return the error from there, in a hope it will also support
> > UFFD_WP some day :)
> 
> Now we should have failed even earlier if someone wants to register a
> hugetlbfs VMA with UFFD_WP because now vma_can_userfault() only allows
> anonymous memory for it:
> 
> static inline bool vma_can_userfault(struct vm_area_struct *vma,
> 				     unsigned long vm_flags)
> {
> 	/* FIXME: add WP support to hugetlbfs and shmem */
> 	return vma_is_anonymous(vma) ||
> 		((is_vm_hugetlb_page(vma) || vma_is_shmem(vma)) &&
> 		 !(vm_flags & VM_UFFD_WP));
> }
> 
> And, as long as a VMA is not tagged with UFFD_WP, the page copy will
> fail with -EINVAL directly above at [1] when setting the wp_copy flag.
> So IMHO we should have already covered the case.
> 
> Considering these, I would think we could simply postpone the changes
> to __mcopy_atomic_hugetlb() until adding hugetlbfs support on uffd-wp.
> Mike, what do you think?

Ok, fair enough.
 
> Thanks!
> 
> -- 
> Peter Xu
> 

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v2 23/26] userfaultfd: wp: don't wake up when doing write protect
  2019-02-26  8:00           ` Mike Rapoport
@ 2019-02-28  2:47             ` Peter Xu
  0 siblings, 0 replies; 113+ messages in thread
From: Peter Xu @ 2019-02-28  2:47 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: linux-mm, linux-kernel, David Hildenbrand, Hugh Dickins,
	Maya Gokhale, Jerome Glisse, Pavel Emelyanov, Johannes Weiner,
	Martin Cracauer, Shaohua Li, Marty McFadden, Andrea Arcangeli,
	Mike Kravetz, Denis Plotnikov, Mike Rapoport, Mel Gorman,
	Kirill A . Shutemov, Dr . David Alan Gilbert

On Tue, Feb 26, 2019 at 10:00:29AM +0200, Mike Rapoport wrote:
> On Tue, Feb 26, 2019 at 03:41:17PM +0800, Peter Xu wrote:
> > On Tue, Feb 26, 2019 at 09:29:33AM +0200, Mike Rapoport wrote:
> > > On Tue, Feb 26, 2019 at 02:24:52PM +0800, Peter Xu wrote:
> > > > On Mon, Feb 25, 2019 at 11:09:35PM +0200, Mike Rapoport wrote:
> > > > > On Tue, Feb 12, 2019 at 10:56:29AM +0800, Peter Xu wrote:
> > > > > > It does not make sense to try to wake up any waiting thread when we're
> > > > > > write-protecting a memory region.  Only wake up when resolving a write
> > > > > > protected page fault.
> > > > > > 
> > > > > > Signed-off-by: Peter Xu <peterx@redhat.com>
> > > > > > ---
> > > > > >  fs/userfaultfd.c | 13 ++++++++-----
> > > > > >  1 file changed, 8 insertions(+), 5 deletions(-)
> > > > > > 
> > > > > > diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> > > > > > index 81962d62520c..f1f61a0278c2 100644
> > > > > > --- a/fs/userfaultfd.c
> > > > > > +++ b/fs/userfaultfd.c
> > > > > > @@ -1771,6 +1771,7 @@ static int userfaultfd_writeprotect(struct userfaultfd_ctx *ctx,
> > > > > >  	struct uffdio_writeprotect uffdio_wp;
> > > > > >  	struct uffdio_writeprotect __user *user_uffdio_wp;
> > > > > >  	struct userfaultfd_wake_range range;
> > > > > > +	bool mode_wp, mode_dontwake;
> > > > > > 
> > > > > >  	if (READ_ONCE(ctx->mmap_changing))
> > > > > >  		return -EAGAIN;
> > > > > > @@ -1789,18 +1790,20 @@ static int userfaultfd_writeprotect(struct userfaultfd_ctx *ctx,
> > > > > >  	if (uffdio_wp.mode & ~(UFFDIO_WRITEPROTECT_MODE_DONTWAKE |
> > > > > >  			       UFFDIO_WRITEPROTECT_MODE_WP))
> > > > > >  		return -EINVAL;
> > > > > > -	if ((uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_WP) &&
> > > > > > -	     (uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_DONTWAKE))
> > > > > > +
> > > > > > +	mode_wp = uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_WP;
> > > > > > +	mode_dontwake = uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_DONTWAKE;
> > > > > > +
> > > > > > +	if (mode_wp && mode_dontwake)
> > > > > >  		return -EINVAL;
> > > > > 
> > > > > This actually means the opposite of the commit message text ;-)
> > > > > 
> > > > > Is any dependency of _WP and _DONTWAKE needed at all?
> > > > 
> > > > So this is indeed confusing at least, because both you and Jerome have
> > > > asked the same question... :)
> > > > 
> > > > My understanding is that we don't have any reason to wake up any
> > > > thread when we are write-protecting a range, in that sense the flag
> > > > UFFDIO_WRITEPROTECT_MODE_DONTWAKE is already meaningless in the
> > > > UFFDIO_WRITEPROTECT ioctl context.  So before everything here's how
> > > > these flags are defined:
> > > > 
> > > > struct uffdio_writeprotect {
> > > > 	struct uffdio_range range;
> > > > 	/* !WP means undo writeprotect. DONTWAKE is valid only with !WP */
> > > > #define UFFDIO_WRITEPROTECT_MODE_WP		((__u64)1<<0)
> > > > #define UFFDIO_WRITEPROTECT_MODE_DONTWAKE	((__u64)1<<1)
> > > > 	__u64 mode;
> > > > };
> > > > 
> > > > To make it clear, we simply define it as "DONTWAKE is valid only with
> > > > !WP".  When with that, "mode_wp && mode_dontwake" is indeed a
> > > > meaningless flag combination.  Though please note that it does not
> > > > mean that the operation ("don't wake up the thread") is meaningless -
> > > > that's what we'll do no matter what when WP==1.  IMHO it's only about
> > > > the interface not the behavior.
> > > > 
> > > > I don't have a good way to make this clearer because firstly we'll
> > > > need the WP flag to mark whether we're protecting or unprotecting the
> > > > pages.  Later on, we need DONTWAKE for page fault handling case to
> > > > mark that we don't want to wake up the waiting thread now.  So both
> > > > the flags have their reason to stay so far.  Then with all these in
> > > > mind what I can think of is only to forbid using DONTWAKE in WP case,
> > > > and that's how above definition comes (I believe, because it was
> > > > defined that way even before I started to work on it and I think it
> > > > makes sense).
> > > 
> > > There's no argument how DONTWAKE can be used with !WP. The
> > > userfaultfd_writeprotect() is called in response of the uffd monitor to WP
> > > page fault, it asks to clear write protection to some range, but it does
> > > not want to wake the faulting thread yet but rather it will use uffd_wake()
> > > later.
> > > 
> > > Still, I can't grok the usage of DONTWAKE with WP=1. In my understanding,
> > > in this case userfaultfd_writeprotect() is called unrelated to page faults,
> > > and the monitored thread runs freely, so why it should be waked at all?
> > 
> > Exactly this is how I understand it.  And that's why I wrote this
> > patch to remove the extra wakeup() since I think it's unecessary.
> > 
> > > 
> > > And what happens, if the thread is waiting on a missing page fault and we
> > > do userfaultfd_writeprotect(WP=1) at the same time?
> > 
> > Then IMHO the userfaultfd_writeprotect() will be a noop simply because
> > the page is still missing.  Here if with the old code (before this
> > patch) we'll probably even try to wake up this thread but this thread
> > should just fault again on the same address due to the fact that the
> > page is missing.  After this patch the monitored thread should
> > continue to wait on the missing page.
> 
> So, my understanding of what we have is:
> 
> userfaultfd_writeprotect() can be used either to mark a region as write
> protected or to resolve WP page fault.
> In the first case DONTWAKE does not make sense and we forbid setting it
> with WP=1.
> In the second case it's the uffd monitor decision whether to wake up the
> faulting thread immediately after #PF is resolved or later, so with WP=0 we
> allow DONTWAKE.

Yes exactly.

> 
> I suggest to extend the comment in the definition of 
> 'struct uffdio_writeprotect' to something like
> 
> /*
>  * Write protecting a region (WP=1) is unrelated to page faults, therefore
>  * DONTWAKE flag is meaningless with WP=1.
>  * Removing write protection (WP=0) in response to a page fault wakes the
>  * faulting task unless DONTWAKE is set.
>  */
>  
> And a documentation update along these lines would be appreciated :)

Thanks for the write-up!  I'm stoling the whole paragraph into the
patch where uffdio_writeprotect is introduced.

Regards,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 113+ messages in thread

end of thread, other threads:[~2019-02-28  2:47 UTC | newest]

Thread overview: 113+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-02-12  2:56 [PATCH v2 00/26] userfaultfd: write protection support Peter Xu
2019-02-12  2:56 ` [PATCH v2 01/26] mm: gup: rename "nonblocking" to "locked" where proper Peter Xu
2019-02-21 15:17   ` Jerome Glisse
2019-02-22  3:42     ` Peter Xu
2019-02-12  2:56 ` [PATCH v2 02/26] mm: userfault: return VM_FAULT_RETRY on signals Peter Xu
2019-02-21 15:29   ` Jerome Glisse
2019-02-22  3:51     ` Peter Xu
2019-02-12  2:56 ` [PATCH v2 03/26] userfaultfd: don't retake mmap_sem to emulate NOPAGE Peter Xu
2019-02-21 15:34   ` Jerome Glisse
2019-02-12  2:56 ` [PATCH v2 04/26] mm: allow VM_FAULT_RETRY for multiple times Peter Xu
2019-02-13  3:34   ` Peter Xu
2019-02-20 11:48     ` Peter Xu
2019-02-21  8:56   ` [PATCH v2.1 " Peter Xu
2019-02-21 15:53     ` Jerome Glisse
2019-02-22  4:25       ` Peter Xu
2019-02-22 15:11         ` Jerome Glisse
2019-02-25  6:19           ` Peter Xu
2019-02-12  2:56 ` [PATCH v2 05/26] mm: gup: " Peter Xu
2019-02-21 16:06   ` Jerome Glisse
2019-02-22  4:41     ` Peter Xu
2019-02-22 15:13       ` Jerome Glisse
2019-02-12  2:56 ` [PATCH v2 06/26] userfaultfd: wp: add helper for writeprotect check Peter Xu
2019-02-21 16:07   ` Jerome Glisse
2019-02-25 15:41   ` Mike Rapoport
2019-02-12  2:56 ` [PATCH v2 07/26] userfaultfd: wp: hook userfault handler to write protection fault Peter Xu
2019-02-21 16:25   ` Jerome Glisse
2019-02-25 15:43   ` Mike Rapoport
2019-02-12  2:56 ` [PATCH v2 08/26] userfaultfd: wp: add WP pagetable tracking to x86 Peter Xu
2019-02-21 17:20   ` Jerome Glisse
2019-02-25 15:48   ` Mike Rapoport
2019-02-12  2:56 ` [PATCH v2 09/26] userfaultfd: wp: userfaultfd_pte/huge_pmd_wp() helpers Peter Xu
2019-02-21 17:21   ` Jerome Glisse
2019-02-25 17:12   ` Mike Rapoport
2019-02-12  2:56 ` [PATCH v2 10/26] userfaultfd: wp: add UFFDIO_COPY_MODE_WP Peter Xu
2019-02-21 17:29   ` Jerome Glisse
2019-02-22  7:11     ` Peter Xu
2019-02-22 15:15       ` Jerome Glisse
2019-02-25  6:45         ` Peter Xu
2019-02-25 15:58   ` Mike Rapoport
2019-02-26  5:09     ` Peter Xu
2019-02-26  8:28       ` Mike Rapoport
2019-02-12  2:56 ` [PATCH v2 11/26] mm: merge parameters for change_protection() Peter Xu
2019-02-21 17:32   ` Jerome Glisse
2019-02-12  2:56 ` [PATCH v2 12/26] userfaultfd: wp: apply _PAGE_UFFD_WP bit Peter Xu
2019-02-21 17:44   ` Jerome Glisse
2019-02-22  7:31     ` Peter Xu
2019-02-22 15:17       ` Jerome Glisse
2019-02-25 18:00   ` Mike Rapoport
2019-02-12  2:56 ` [PATCH v2 13/26] mm: export wp_page_copy() Peter Xu
2019-02-21 17:44   ` Jerome Glisse
2019-02-12  2:56 ` [PATCH v2 14/26] userfaultfd: wp: handle COW properly for uffd-wp Peter Xu
2019-02-21 18:04   ` Jerome Glisse
2019-02-22  8:46     ` Peter Xu
2019-02-22 15:35       ` Jerome Glisse
2019-02-25  7:13         ` Peter Xu
2019-02-25 15:32           ` Jerome Glisse
2019-02-12  2:56 ` [PATCH v2 15/26] userfaultfd: wp: drop _PAGE_UFFD_WP properly when fork Peter Xu
2019-02-21 18:06   ` Jerome Glisse
2019-02-22  9:09     ` Peter Xu
2019-02-22 15:36       ` Jerome Glisse
2019-02-25 18:19   ` Mike Rapoport
2019-02-12  2:56 ` [PATCH v2 16/26] userfaultfd: wp: add pmd_swp_*uffd_wp() helpers Peter Xu
2019-02-21 18:07   ` Jerome Glisse
2019-02-25 18:20   ` Mike Rapoport
2019-02-12  2:56 ` [PATCH v2 17/26] userfaultfd: wp: support swap and page migration Peter Xu
2019-02-21 18:16   ` Jerome Glisse
2019-02-25  7:48     ` Peter Xu
2019-02-25 18:28   ` Mike Rapoport
2019-02-12  2:56 ` [PATCH v2 18/26] khugepaged: skip collapse if uffd-wp detected Peter Xu
2019-02-21 18:17   ` Jerome Glisse
2019-02-25 18:50   ` Mike Rapoport
2019-02-12  2:56 ` [PATCH v2 19/26] userfaultfd: introduce helper vma_find_uffd Peter Xu
2019-02-21 18:19   ` Jerome Glisse
2019-02-25 20:48   ` Mike Rapoport
2019-02-12  2:56 ` [PATCH v2 20/26] userfaultfd: wp: support write protection for userfault vma range Peter Xu
2019-02-21 18:23   ` Jerome Glisse
2019-02-25  8:16     ` Peter Xu
2019-02-25 20:52   ` Mike Rapoport
2019-02-26  6:06     ` Peter Xu
2019-02-26  6:43       ` Mike Rapoport
2019-02-26  7:20         ` Peter Xu
2019-02-26  7:46           ` Mike Rapoport
2019-02-26  7:54             ` Peter Xu
2019-02-12  2:56 ` [PATCH v2 21/26] userfaultfd: wp: add the writeprotect API to userfaultfd ioctl Peter Xu
2019-02-21 18:28   ` Jerome Glisse
2019-02-25  8:31     ` Peter Xu
2019-02-25 21:03   ` Mike Rapoport
2019-02-26  6:30     ` Peter Xu
2019-02-12  2:56 ` [PATCH v2 22/26] userfaultfd: wp: enabled write protection in userfaultfd API Peter Xu
2019-02-21 18:29   ` Jerome Glisse
2019-02-25  8:34     ` Peter Xu
2019-02-12  2:56 ` [PATCH v2 23/26] userfaultfd: wp: don't wake up when doing write protect Peter Xu
2019-02-21 18:36   ` Jerome Glisse
2019-02-25  8:58     ` Peter Xu
2019-02-25 21:15       ` Mike Rapoport
2019-02-25 21:09   ` Mike Rapoport
2019-02-26  6:24     ` Peter Xu
2019-02-26  7:29       ` Mike Rapoport
2019-02-26  7:41         ` Peter Xu
2019-02-26  8:00           ` Mike Rapoport
2019-02-28  2:47             ` Peter Xu
2019-02-26  8:00   ` Mike Rapoport
2019-02-12  2:56 ` [PATCH v2 24/26] userfaultfd: wp: UFFDIO_REGISTER_MODE_WP documentation update Peter Xu
2019-02-21 18:38   ` Jerome Glisse
2019-02-25 21:19   ` Mike Rapoport
2019-02-26  6:53     ` Peter Xu
2019-02-26  7:04       ` Mike Rapoport
2019-02-26  7:42         ` Peter Xu
2019-02-12  2:56 ` [PATCH v2 25/26] userfaultfd: selftests: refactor statistics Peter Xu
2019-02-26  6:50   ` Mike Rapoport
2019-02-12  2:56 ` [PATCH v2 26/26] userfaultfd: selftests: add write-protect test Peter Xu
2019-02-26  6:58   ` Mike Rapoport
2019-02-26  7:52     ` Peter Xu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).