linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v5 00/10] userfaultfd: add minor fault handling
@ 2021-02-10 21:21 Axel Rasmussen
  2021-02-10 21:21 ` [PATCH v5 01/10] hugetlb: Pass vma into huge_pte_alloc() and huge_pmd_share() Axel Rasmussen
                   ` (9 more replies)
  0 siblings, 10 replies; 29+ messages in thread
From: Axel Rasmussen @ 2021-02-10 21:21 UTC (permalink / raw)
  To: Alexander Viro, Alexey Dobriyan, Andrea Arcangeli, Andrew Morton,
	Anshuman Khandual, Catalin Marinas, Chinwen Chang, Huang Ying,
	Ingo Molnar, Jann Horn, Jerome Glisse, Lokesh Gidra,
	Matthew Wilcox (Oracle), Michael Ellerman, Michal Koutný,
	Michel Lespinasse, Mike Kravetz, Mike Rapoport, Nicholas Piggin,
	Peter Xu, Shaohua Li, Shawn Anastasio, Steven Rostedt,
	Steven Price, Vlastimil Babka
  Cc: linux-kernel, linux-fsdevel, linux-mm, Adam Ruprecht,
	Axel Rasmussen, Cannon Matthews, Dr . David Alan Gilbert,
	David Rientjes, Mina Almasry, Oliver Upton

Changelog
=========

v4->v5:
- Typo fix in the documentation update.
- Removed comment in vma_can_userfault. The same information is better covered
  in the documentation update, so the comment is unnecessary (and slightly
  confusing as written).
- Reworded comment for MCOPY_ATOMIC_CONTINUE mode.
- For non-shared CONTINUE, only make the PTE(s) non-writable, don't change flags
  on the VMA.
- In hugetlb_mcopy_atomic_pte, always unlock the page in MCOPY_ATOMIC_CONTINUE,
  even if we don't have VM_SHARED.
- In hugetlb_mcopy_atomic_pte, introduce "bool is_continue" to make that kind of
  mode check more terse.
- Merged two nested if()s into a single expression in __mcopy_atomic_hugetlb.
- Moved "return -EINVAL if MCOPY_CONTINUE isn't supported for this vma type" up
  one level, into __mcopy_atomic.
- Rebased onto linux-next/akpm, instead of the latest 5.11 RC. Resolved
  conflicts with Mike's recent hugetlb changes.

v3->v4:
- Relaxed restriction for minor registration to allow any hugetlb VMAs, not
  just those with VM_SHARED. Fixed setting VM_WRITE flag in a CONTINUE ioctl
  for non-VM_SHARED VMAs.
- Reordered if() branches in hugetlb_mcopy_atomic_pte, so the conditions are
  simpler and easier to read.
- Reverted most of the mfill_atomic_pte change (the anon / shmem path). Just
  return -EINVAL for CONTINUE, and set zeropage = (mode ==
  MCOPY_ATOMIC_ZEROPAGE), so we can keep the delta small.
- Split out adding #ifdef CONFIG_USERFAULTFD to a separate patch (instead of
  lumping it together with adding UFFDIO_CONTINUE).
- Fixed signature of hugetlb_mcopy_atomic_pte for !CONFIG_HUGETLB_PAGE
  (signature must be the same in either case).
- Rebased onto a newer version of Peter's patches to disable huge PMD sharing.

v2->v3:
- Added #ifdef CONFIG_USERFAULTFD around hugetlb helper functions, to fix build
  errors when building without CONFIG_USERFAULTFD set.

v1->v2:
- Fixed a bug in the hugetlb_mcopy_atomic_pte retry case. We now plumb in the
  enum mcopy_atomic_mode, so we can differentiate between the three cases this
  function needs to handle:
  1) We're doing a COPY op, and need to allocate a page, add to cache, etc.
  2) We're doing a COPY op, but allocation in this function failed previously;
     we're in the retry path. The page was allocated, but not e.g. added to page
     cache, so that still needs to be done.
  3) We're doing a CONTINUE op, we need to look up an existing page instead of
     allocating a new one.
- Rebased onto a newer version of Peter's patches to disable huge PMD sharing,
  which fixes syzbot complaints on some non-x86 architectures.
- Moved __VM_UFFD_FLAGS into userfaultfd_k.h, so inline helpers can use it.
- Renamed UFFD_FEATURE_MINOR_FAULT_HUGETLBFS to UFFD_FEATURE_MINOR_HUGETLBFS,
  for consistency with other existing feature flags.
- Moved the userfaultfd_minor hook in hugetlb.c into the else block, so we don't
  have to explicitly check for !new_page.

RFC->v1:
- Rebased onto Peter Xu's patches for disabling huge PMD sharing for certain
  userfaultfd-registered areas.
- Added commits which update documentation, and add a self test which exercises
  the new feature.
- Fixed reporting CONTINUE as a supported ioctl even for non-MINOR ranges.

Overview
========

This series adds a new userfaultfd registration mode,
UFFDIO_REGISTER_MODE_MINOR. This allows userspace to intercept "minor" faults.
By "minor" fault, I mean the following situation:

Let there exist two mappings (i.e., VMAs) to the same page(s) (shared memory).
One of the mappings is registered with userfaultfd (in minor mode), and the
other is not. Via the non-UFFD mapping, the underlying pages have already been
allocated & filled with some contents. The UFFD mapping has not yet been
faulted in; when it is touched for the first time, this results in what I'm
calling a "minor" fault. As a concrete example, when working with hugetlbfs, we
have huge_pte_none(), but find_lock_page() finds an existing page.

We also add a new ioctl to resolve such faults: UFFDIO_CONTINUE. The idea is,
userspace resolves the fault by either a) doing nothing if the contents are
already correct, or b) updating the underlying contents using the second,
non-UFFD mapping (via memcpy/memset or similar, or something fancier like RDMA,
or etc...). In either case, userspace issues UFFDIO_CONTINUE to tell the kernel
"I have ensured the page contents are correct, carry on setting up the mapping".

Use Case
========

Consider the use case of VM live migration (e.g. under QEMU/KVM):

1. While a VM is still running, we copy the contents of its memory to a
   target machine. The pages are populated on the target by writing to the
   non-UFFD mapping, using the setup described above. The VM is still running
   (and therefore its memory is likely changing), so this may be repeated
   several times, until we decide the target is "up to date enough".

2. We pause the VM on the source, and start executing on the target machine.
   During this gap, the VM's user(s) will *see* a pause, so it is desirable to
   minimize this window.

3. Between the last time any page was copied from the source to the target, and
   when the VM was paused, the contents of that page may have changed - and
   therefore the copy we have on the target machine is out of date. Although we
   can keep track of which pages are out of date, for VMs with large amounts of
   memory, it is "slow" to transfer this information to the target machine. We
   want to resume execution before such a transfer would complete.

4. So, the guest begins executing on the target machine. The first time it
   touches its memory (via the UFFD-registered mapping), userspace wants to
   intercept this fault. Userspace checks whether or not the page is up to date,
   and if not, copies the updated page from the source machine, via the non-UFFD
   mapping. Finally, whether a copy was performed or not, userspace issues a
   UFFDIO_CONTINUE ioctl to tell the kernel "I have ensured the page contents
   are correct, carry on setting up the mapping".

We don't have to do all of the final updates on-demand. The userfaultfd manager
can, in the background, also copy over updated pages once it receives the map of
which pages are up-to-date or not.

Interaction with Existing APIs
==============================

Because it's possible to combine registration modes (e.g. a single VMA can be
userfaultfd-registered MINOR | MISSING), and because it's up to userspace how to
resolve faults once they are received, I spent some time thinking through how
the existing API interacts with the new feature.

UFFDIO_CONTINUE cannot be used to resolve non-minor faults, as it does not
allocate a new page. If UFFDIO_CONTINUE is used on a non-minor fault:

- For non-shared memory or shmem, -EINVAL is returned.
- For hugetlb, -EFAULT is returned.

UFFDIO_COPY and UFFDIO_ZEROPAGE cannot be used to resolve minor faults. Without
modifications, the existing codepath assumes a new page needs to be allocated.
This is okay, since userspace must have a second non-UFFD-registered mapping
anyway, thus there isn't much reason to want to use these in any case (just
memcpy or memset or similar).

- If UFFDIO_COPY is used on a minor fault, -EEXIST is returned.
- If UFFDIO_ZEROPAGE is used on a minor fault, -EEXIST is returned (or -EINVAL
  in the case of hugetlb, as UFFDIO_ZEROPAGE is unsupported in any case).
- UFFDIO_WRITEPROTECT simply doesn't work with shared memory, and returns
  -ENOENT in that case (regardless of the kind of fault).

Dependencies
============

I've included 4 commits from Peter Xu's larger series
(https://lore.kernel.org/patchwork/cover/1366017/) in this series. My changes
depend on his work, to disable huge PMD sharing for MINOR registered userfaultfd
areas. I included the 4 commits directly because a) it lets this series just be
applied and work as-is, and b) they are fairly standalone, and could potentially
be merged even without the rest of the larger series Peter submitted. Thanks
Peter!

Also, although it doesn't affect minor fault handling, I did notice that the
userfaultfd self test sometimes experienced memory corruption
(https://lore.kernel.org/patchwork/cover/1356755/). For anyone testing this
series, it may be useful to apply that series first to fix the selftest
flakiness. That series doesn't have to be merged into mainline / maintaner
branches before mine, though.

Future Work
===========

Currently the patchset only supports hugetlbfs. There is no reason it can't work
with shmem, but I expect hugetlbfs to be much more commonly used since we're
talking about backing guest memory for VMs. I plan to implement shmem support in
a follow-up patch series.

Axel Rasmussen (6):
  userfaultfd: add minor fault registration mode
  userfaultfd: disable huge PMD sharing for MINOR registered VMAs
  userfaultfd: hugetlbfs: only compile UFFD helpers if config enabled
  userfaultfd: add UFFDIO_CONTINUE ioctl
  userfaultfd: update documentation to describe minor fault handling
  userfaultfd/selftests: add test exercising minor fault handling

Peter Xu (4):
  hugetlb: Pass vma into huge_pte_alloc() and huge_pmd_share()
  hugetlb/userfaultfd: Forbid huge pmd sharing when uffd enabled
  mm/hugetlb: Move flush_hugetlb_tlb_range() into hugetlb.h
  hugetlb/userfaultfd: Unshare all pmds for hugetlbfs when register wp

 Documentation/admin-guide/mm/userfaultfd.rst | 107 +++++++----
 arch/arm64/mm/hugetlbpage.c                  |   7 +-
 arch/ia64/mm/hugetlbpage.c                   |   3 +-
 arch/mips/mm/hugetlbpage.c                   |   4 +-
 arch/parisc/mm/hugetlbpage.c                 |   2 +-
 arch/powerpc/mm/hugetlbpage.c                |   3 +-
 arch/s390/mm/hugetlbpage.c                   |   2 +-
 arch/sh/mm/hugetlbpage.c                     |   2 +-
 arch/sparc/mm/hugetlbpage.c                  |   6 +-
 fs/proc/task_mmu.c                           |   1 +
 fs/userfaultfd.c                             | 186 +++++++++++++++----
 include/linux/hugetlb.h                      |  22 ++-
 include/linux/mm.h                           |   1 +
 include/linux/mmu_notifier.h                 |   1 +
 include/linux/userfaultfd_k.h                |  49 ++++-
 include/trace/events/mmflags.h               |   1 +
 include/uapi/linux/userfaultfd.h             |  36 +++-
 mm/hugetlb.c                                 | 116 ++++++++----
 mm/userfaultfd.c                             |  39 ++--
 tools/testing/selftests/vm/userfaultfd.c     | 147 ++++++++++++++-
 20 files changed, 587 insertions(+), 148 deletions(-)

--
2.30.0.478.g8a0d178c01-goog


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH v5 01/10] hugetlb: Pass vma into huge_pte_alloc() and huge_pmd_share()
  2021-02-10 21:21 [PATCH v5 00/10] userfaultfd: add minor fault handling Axel Rasmussen
@ 2021-02-10 21:21 ` Axel Rasmussen
  2021-02-11 23:59   ` Mike Kravetz
  2021-02-10 21:21 ` [PATCH v5 02/10] hugetlb/userfaultfd: Forbid huge pmd sharing when uffd enabled Axel Rasmussen
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 29+ messages in thread
From: Axel Rasmussen @ 2021-02-10 21:21 UTC (permalink / raw)
  To: Alexander Viro, Alexey Dobriyan, Andrea Arcangeli, Andrew Morton,
	Anshuman Khandual, Catalin Marinas, Chinwen Chang, Huang Ying,
	Ingo Molnar, Jann Horn, Jerome Glisse, Lokesh Gidra,
	Matthew Wilcox (Oracle), Michael Ellerman, Michal Koutný,
	Michel Lespinasse, Mike Kravetz, Mike Rapoport, Nicholas Piggin,
	Peter Xu, Shaohua Li, Shawn Anastasio, Steven Rostedt,
	Steven Price, Vlastimil Babka
  Cc: linux-kernel, linux-fsdevel, linux-mm, Adam Ruprecht,
	Axel Rasmussen, Cannon Matthews, Dr . David Alan Gilbert,
	David Rientjes, Mina Almasry, Oliver Upton

From: Peter Xu <peterx@redhat.com>

It is a preparation work to be able to behave differently in the per
architecture huge_pte_alloc() according to different VMA attributes.

Pass it deeper into huge_pmd_share() so that we can avoid the find_vma() call.

Suggested-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
---
 arch/arm64/mm/hugetlbpage.c   |  4 ++--
 arch/ia64/mm/hugetlbpage.c    |  3 ++-
 arch/mips/mm/hugetlbpage.c    |  4 ++--
 arch/parisc/mm/hugetlbpage.c  |  2 +-
 arch/powerpc/mm/hugetlbpage.c |  3 ++-
 arch/s390/mm/hugetlbpage.c    |  2 +-
 arch/sh/mm/hugetlbpage.c      |  2 +-
 arch/sparc/mm/hugetlbpage.c   |  6 +-----
 include/linux/hugetlb.h       |  5 +++--
 mm/hugetlb.c                  | 15 ++++++++-------
 mm/userfaultfd.c              |  2 +-
 11 files changed, 24 insertions(+), 24 deletions(-)

diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
index 55ecf6de9ff7..6e3bcffe2837 100644
--- a/arch/arm64/mm/hugetlbpage.c
+++ b/arch/arm64/mm/hugetlbpage.c
@@ -252,7 +252,7 @@ void set_huge_swap_pte_at(struct mm_struct *mm, unsigned long addr,
 		set_pte(ptep, pte);
 }
 
-pte_t *huge_pte_alloc(struct mm_struct *mm,
+pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
 		      unsigned long addr, unsigned long sz)
 {
 	pgd_t *pgdp;
@@ -286,7 +286,7 @@ pte_t *huge_pte_alloc(struct mm_struct *mm,
 	} else if (sz == PMD_SIZE) {
 		if (IS_ENABLED(CONFIG_ARCH_WANT_HUGE_PMD_SHARE) &&
 		    pud_none(READ_ONCE(*pudp)))
-			ptep = huge_pmd_share(mm, addr, pudp);
+			ptep = huge_pmd_share(mm, vma, addr, pudp);
 		else
 			ptep = (pte_t *)pmd_alloc(mm, pudp, addr);
 	} else if (sz == (CONT_PMD_SIZE)) {
diff --git a/arch/ia64/mm/hugetlbpage.c b/arch/ia64/mm/hugetlbpage.c
index b331f94d20ac..f993cb36c062 100644
--- a/arch/ia64/mm/hugetlbpage.c
+++ b/arch/ia64/mm/hugetlbpage.c
@@ -25,7 +25,8 @@ unsigned int hpage_shift = HPAGE_SHIFT_DEFAULT;
 EXPORT_SYMBOL(hpage_shift);
 
 pte_t *
-huge_pte_alloc(struct mm_struct *mm, unsigned long addr, unsigned long sz)
+huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
+	       unsigned long addr, unsigned long sz)
 {
 	unsigned long taddr = htlbpage_to_page(addr);
 	pgd_t *pgd;
diff --git a/arch/mips/mm/hugetlbpage.c b/arch/mips/mm/hugetlbpage.c
index b9f76f433617..7eaff5b07873 100644
--- a/arch/mips/mm/hugetlbpage.c
+++ b/arch/mips/mm/hugetlbpage.c
@@ -21,8 +21,8 @@
 #include <asm/tlb.h>
 #include <asm/tlbflush.h>
 
-pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr,
-		      unsigned long sz)
+pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
+		      unsigned long addr, unsigned long sz)
 {
 	pgd_t *pgd;
 	p4d_t *p4d;
diff --git a/arch/parisc/mm/hugetlbpage.c b/arch/parisc/mm/hugetlbpage.c
index 30e0a862a0b2..f67f64709f50 100644
--- a/arch/parisc/mm/hugetlbpage.c
+++ b/arch/parisc/mm/hugetlbpage.c
@@ -44,7 +44,7 @@ hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
 }
 
 
-pte_t *huge_pte_alloc(struct mm_struct *mm,
+pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
 			unsigned long addr, unsigned long sz)
 {
 	pgd_t *pgd;
diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index 4e7d9b91f1da..e5caf2455aa4 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -106,7 +106,8 @@ static int __hugepte_alloc(struct mm_struct *mm, hugepd_t *hpdp,
  * At this point we do the placement change only for BOOK3S 64. This would
  * possibly work on other subarchs.
  */
-pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr, unsigned long sz)
+pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
+		      unsigned long addr, unsigned long sz)
 {
 	pgd_t *pg;
 	p4d_t *p4;
diff --git a/arch/s390/mm/hugetlbpage.c b/arch/s390/mm/hugetlbpage.c
index 3b5a4d25ca9b..da36d13ffc16 100644
--- a/arch/s390/mm/hugetlbpage.c
+++ b/arch/s390/mm/hugetlbpage.c
@@ -189,7 +189,7 @@ pte_t huge_ptep_get_and_clear(struct mm_struct *mm,
 	return pte;
 }
 
-pte_t *huge_pte_alloc(struct mm_struct *mm,
+pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
 			unsigned long addr, unsigned long sz)
 {
 	pgd_t *pgdp;
diff --git a/arch/sh/mm/hugetlbpage.c b/arch/sh/mm/hugetlbpage.c
index 220d7bc43d2b..999ab5916e69 100644
--- a/arch/sh/mm/hugetlbpage.c
+++ b/arch/sh/mm/hugetlbpage.c
@@ -21,7 +21,7 @@
 #include <asm/tlbflush.h>
 #include <asm/cacheflush.h>
 
-pte_t *huge_pte_alloc(struct mm_struct *mm,
+pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
 			unsigned long addr, unsigned long sz)
 {
 	pgd_t *pgd;
diff --git a/arch/sparc/mm/hugetlbpage.c b/arch/sparc/mm/hugetlbpage.c
index ad4b42f04988..a487ea2977a2 100644
--- a/arch/sparc/mm/hugetlbpage.c
+++ b/arch/sparc/mm/hugetlbpage.c
@@ -275,11 +275,7 @@ static unsigned long huge_tte_to_size(pte_t pte)
 	return size;
 }
 
-unsigned long pud_leaf_size(pud_t pud) { return 1UL << tte_to_shift(*(pte_t *)&pud); }
-unsigned long pmd_leaf_size(pmd_t pmd) { return 1UL << tte_to_shift(*(pte_t *)&pmd); }
-unsigned long pte_leaf_size(pte_t pte) { return 1UL << tte_to_shift(pte); }
-
-pte_t *huge_pte_alloc(struct mm_struct *mm,
+pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
 			unsigned long addr, unsigned long sz)
 {
 	pgd_t *pgd;
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index ce6533584eb7..ca6e5ba56f73 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -152,7 +152,8 @@ void hugetlb_fix_reserve_counts(struct inode *inode);
 extern struct mutex *hugetlb_fault_mutex_table;
 u32 hugetlb_fault_mutex_hash(struct address_space *mapping, pgoff_t idx);
 
-pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud);
+pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma,
+		      unsigned long addr, pud_t *pud);
 
 struct address_space *hugetlb_page_mapping_lock_write(struct page *hpage);
 
@@ -161,7 +162,7 @@ extern struct list_head huge_boot_pages;
 
 /* arch callbacks */
 
-pte_t *huge_pte_alloc(struct mm_struct *mm,
+pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
 			unsigned long addr, unsigned long sz);
 pte_t *huge_pte_offset(struct mm_struct *mm,
 		       unsigned long addr, unsigned long sz);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 0b7079dd0d35..32d4d2e277ad 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3757,7 +3757,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 		src_pte = huge_pte_offset(src, addr, sz);
 		if (!src_pte)
 			continue;
-		dst_pte = huge_pte_alloc(dst, addr, sz);
+		dst_pte = huge_pte_alloc(dst, vma, addr, sz);
 		if (!dst_pte) {
 			ret = -ENOMEM;
 			break;
@@ -4494,7 +4494,7 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	 */
 	mapping = vma->vm_file->f_mapping;
 	i_mmap_lock_read(mapping);
-	ptep = huge_pte_alloc(mm, haddr, huge_page_size(h));
+	ptep = huge_pte_alloc(mm, vma, haddr, huge_page_size(h));
 	if (!ptep) {
 		i_mmap_unlock_read(mapping);
 		return VM_FAULT_OOM;
@@ -5289,9 +5289,9 @@ void adjust_range_if_pmd_sharing_possible(struct vm_area_struct *vma,
  * if !vma_shareable check at the beginning of the routine. i_mmap_rwsem is
  * only required for subsequent processing.
  */
-pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
+pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma,
+		      unsigned long addr, pud_t *pud)
 {
-	struct vm_area_struct *vma = find_vma(mm, addr);
 	struct address_space *mapping = vma->vm_file->f_mapping;
 	pgoff_t idx = ((addr - vma->vm_start) >> PAGE_SHIFT) +
 			vma->vm_pgoff;
@@ -5369,7 +5369,8 @@ int huge_pmd_unshare(struct mm_struct *mm, struct vm_area_struct *vma,
 }
 #define want_pmd_share()	(1)
 #else /* !CONFIG_ARCH_WANT_HUGE_PMD_SHARE */
-pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
+pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct vma,
+		      unsigned long addr, pud_t *pud)
 {
 	return NULL;
 }
@@ -5388,7 +5389,7 @@ void adjust_range_if_pmd_sharing_possible(struct vm_area_struct *vma,
 #endif /* CONFIG_ARCH_WANT_HUGE_PMD_SHARE */
 
 #ifdef CONFIG_ARCH_WANT_GENERAL_HUGETLB
-pte_t *huge_pte_alloc(struct mm_struct *mm,
+pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
 			unsigned long addr, unsigned long sz)
 {
 	pgd_t *pgd;
@@ -5407,7 +5408,7 @@ pte_t *huge_pte_alloc(struct mm_struct *mm,
 		} else {
 			BUG_ON(sz != PMD_SIZE);
 			if (want_pmd_share() && pud_none(*pud))
-				pte = huge_pmd_share(mm, addr, pud);
+				pte = huge_pmd_share(mm, vma, addr, pud);
 			else
 				pte = (pte_t *)pmd_alloc(mm, pud, addr);
 		}
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 7423808640ef..b2ce61c1b50d 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -290,7 +290,7 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
 		mutex_lock(&hugetlb_fault_mutex_table[hash]);
 
 		err = -ENOMEM;
-		dst_pte = huge_pte_alloc(dst_mm, dst_addr, vma_hpagesize);
+		dst_pte = huge_pte_alloc(dst_mm, dst_vma, dst_addr, vma_hpagesize);
 		if (!dst_pte) {
 			mutex_unlock(&hugetlb_fault_mutex_table[hash]);
 			i_mmap_unlock_read(mapping);
-- 
2.30.0.478.g8a0d178c01-goog


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v5 02/10] hugetlb/userfaultfd: Forbid huge pmd sharing when uffd enabled
  2021-02-10 21:21 [PATCH v5 00/10] userfaultfd: add minor fault handling Axel Rasmussen
  2021-02-10 21:21 ` [PATCH v5 01/10] hugetlb: Pass vma into huge_pte_alloc() and huge_pmd_share() Axel Rasmussen
@ 2021-02-10 21:21 ` Axel Rasmussen
  2021-02-12  0:19   ` Mike Kravetz
  2021-02-10 21:21 ` [PATCH v5 03/10] mm/hugetlb: Move flush_hugetlb_tlb_range() into hugetlb.h Axel Rasmussen
                   ` (7 subsequent siblings)
  9 siblings, 1 reply; 29+ messages in thread
From: Axel Rasmussen @ 2021-02-10 21:21 UTC (permalink / raw)
  To: Alexander Viro, Alexey Dobriyan, Andrea Arcangeli, Andrew Morton,
	Anshuman Khandual, Catalin Marinas, Chinwen Chang, Huang Ying,
	Ingo Molnar, Jann Horn, Jerome Glisse, Lokesh Gidra,
	Matthew Wilcox (Oracle), Michael Ellerman, Michal Koutný,
	Michel Lespinasse, Mike Kravetz, Mike Rapoport, Nicholas Piggin,
	Peter Xu, Shaohua Li, Shawn Anastasio, Steven Rostedt,
	Steven Price, Vlastimil Babka
  Cc: linux-kernel, linux-fsdevel, linux-mm, Adam Ruprecht,
	Axel Rasmussen, Cannon Matthews, Dr . David Alan Gilbert,
	David Rientjes, Mina Almasry, Oliver Upton

From: Peter Xu <peterx@redhat.com>

Huge pmd sharing could bring problem to userfaultfd.  The thing is that
userfaultfd is running its logic based on the special bits on page table
entries, however the huge pmd sharing could potentially share page table
entries for different address ranges.  That could cause issues on either:

  - When sharing huge pmd page tables for an uffd write protected range, the
    newly mapped huge pmd range will also be write protected unexpectedly, or,

  - When we try to write protect a range of huge pmd shared range, we'll first
    do huge_pmd_unshare() in hugetlb_change_protection(), however that also
    means the UFFDIO_WRITEPROTECT could be silently skipped for the shared
    region, which could lead to data loss.

Since at it, a few other things are done altogether:

  - Move want_pmd_share() from mm/hugetlb.c into linux/hugetlb.h, because
    that's definitely something that arch code would like to use too

  - ARM64 currently directly check against CONFIG_ARCH_WANT_HUGE_PMD_SHARE when
    trying to share huge pmd.  Switch to the want_pmd_share() helper.

Since at it, move vma_shareable() from huge_pmd_share() into want_pmd_share().

Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
---
 arch/arm64/mm/hugetlbpage.c   |  3 +--
 include/linux/hugetlb.h       |  2 ++
 include/linux/userfaultfd_k.h |  9 +++++++++
 mm/hugetlb.c                  | 20 ++++++++++++++------
 4 files changed, 26 insertions(+), 8 deletions(-)

diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
index 6e3bcffe2837..58987a98e179 100644
--- a/arch/arm64/mm/hugetlbpage.c
+++ b/arch/arm64/mm/hugetlbpage.c
@@ -284,8 +284,7 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
 		 */
 		ptep = pte_alloc_map(mm, pmdp, addr);
 	} else if (sz == PMD_SIZE) {
-		if (IS_ENABLED(CONFIG_ARCH_WANT_HUGE_PMD_SHARE) &&
-		    pud_none(READ_ONCE(*pudp)))
+		if (want_pmd_share(vma, addr) && pud_none(READ_ONCE(*pudp)))
 			ptep = huge_pmd_share(mm, vma, addr, pudp);
 		else
 			ptep = (pte_t *)pmd_alloc(mm, pudp, addr);
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index ca6e5ba56f73..d971e7efd17d 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -1030,4 +1030,6 @@ static inline __init void hugetlb_cma_check(void)
 }
 #endif
 
+bool want_pmd_share(struct vm_area_struct *vma, unsigned long addr);
+
 #endif /* _LINUX_HUGETLB_H */
diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index a8e5f3ea9bb2..c63ccdae3eab 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -52,6 +52,15 @@ static inline bool is_mergeable_vm_userfaultfd_ctx(struct vm_area_struct *vma,
 	return vma->vm_userfaultfd_ctx.ctx == vm_ctx.ctx;
 }
 
+/*
+ * Never enable huge pmd sharing on uffd-wp registered vmas, because uffd-wp
+ * protect information is per pgtable entry.
+ */
+static inline bool uffd_disable_huge_pmd_share(struct vm_area_struct *vma)
+{
+	return vma->vm_flags & VM_UFFD_WP;
+}
+
 static inline bool userfaultfd_missing(struct vm_area_struct *vma)
 {
 	return vma->vm_flags & VM_UFFD_MISSING;
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 32d4d2e277ad..5710286e1984 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5245,6 +5245,18 @@ static bool vma_shareable(struct vm_area_struct *vma, unsigned long addr)
 	return false;
 }
 
+bool want_pmd_share(struct vm_area_struct *vma, unsigned long addr)
+{
+#ifndef CONFIG_ARCH_WANT_HUGE_PMD_SHARE
+	return false;
+#endif
+#ifdef CONFIG_USERFAULTFD
+	if (uffd_disable_huge_pmd_share(vma))
+		return false;
+#endif
+	return vma_shareable(vma, addr);
+}
+
 /*
  * Determine if start,end range within vma could be mapped by shared pmd.
  * If yes, adjust start and end to cover range associated with possible
@@ -5301,9 +5313,6 @@ pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma,
 	pte_t *pte;
 	spinlock_t *ptl;
 
-	if (!vma_shareable(vma, addr))
-		return (pte_t *)pmd_alloc(mm, pud, addr);
-
 	i_mmap_assert_locked(mapping);
 	vma_interval_tree_foreach(svma, &mapping->i_mmap, idx, idx) {
 		if (svma == vma)
@@ -5367,7 +5376,7 @@ int huge_pmd_unshare(struct mm_struct *mm, struct vm_area_struct *vma,
 	*addr = ALIGN(*addr, HPAGE_SIZE * PTRS_PER_PTE) - HPAGE_SIZE;
 	return 1;
 }
-#define want_pmd_share()	(1)
+
 #else /* !CONFIG_ARCH_WANT_HUGE_PMD_SHARE */
 pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct vma,
 		      unsigned long addr, pud_t *pud)
@@ -5385,7 +5394,6 @@ void adjust_range_if_pmd_sharing_possible(struct vm_area_struct *vma,
 				unsigned long *start, unsigned long *end)
 {
 }
-#define want_pmd_share()	(0)
 #endif /* CONFIG_ARCH_WANT_HUGE_PMD_SHARE */
 
 #ifdef CONFIG_ARCH_WANT_GENERAL_HUGETLB
@@ -5407,7 +5415,7 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
 			pte = (pte_t *)pud;
 		} else {
 			BUG_ON(sz != PMD_SIZE);
-			if (want_pmd_share() && pud_none(*pud))
+			if (want_pmd_share(vma, addr) && pud_none(*pud))
 				pte = huge_pmd_share(mm, vma, addr, pud);
 			else
 				pte = (pte_t *)pmd_alloc(mm, pud, addr);
-- 
2.30.0.478.g8a0d178c01-goog


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v5 03/10] mm/hugetlb: Move flush_hugetlb_tlb_range() into hugetlb.h
  2021-02-10 21:21 [PATCH v5 00/10] userfaultfd: add minor fault handling Axel Rasmussen
  2021-02-10 21:21 ` [PATCH v5 01/10] hugetlb: Pass vma into huge_pte_alloc() and huge_pmd_share() Axel Rasmussen
  2021-02-10 21:21 ` [PATCH v5 02/10] hugetlb/userfaultfd: Forbid huge pmd sharing when uffd enabled Axel Rasmussen
@ 2021-02-10 21:21 ` Axel Rasmussen
  2021-02-10 21:21 ` [PATCH v5 04/10] hugetlb/userfaultfd: Unshare all pmds for hugetlbfs when register wp Axel Rasmussen
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 29+ messages in thread
From: Axel Rasmussen @ 2021-02-10 21:21 UTC (permalink / raw)
  To: Alexander Viro, Alexey Dobriyan, Andrea Arcangeli, Andrew Morton,
	Anshuman Khandual, Catalin Marinas, Chinwen Chang, Huang Ying,
	Ingo Molnar, Jann Horn, Jerome Glisse, Lokesh Gidra,
	Matthew Wilcox (Oracle), Michael Ellerman, Michal Koutný,
	Michel Lespinasse, Mike Kravetz, Mike Rapoport, Nicholas Piggin,
	Peter Xu, Shaohua Li, Shawn Anastasio, Steven Rostedt,
	Steven Price, Vlastimil Babka
  Cc: linux-kernel, linux-fsdevel, linux-mm, Adam Ruprecht,
	Axel Rasmussen, Cannon Matthews, Dr . David Alan Gilbert,
	David Rientjes, Mina Almasry, Oliver Upton

From: Peter Xu <peterx@redhat.com>

Prepare for it to be called outside of mm/hugetlb.c.

Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
---
 include/linux/hugetlb.h | 8 ++++++++
 mm/hugetlb.c            | 8 --------
 2 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index d971e7efd17d..d740c6fd19ae 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -1032,4 +1032,12 @@ static inline __init void hugetlb_cma_check(void)
 
 bool want_pmd_share(struct vm_area_struct *vma, unsigned long addr);
 
+#ifndef __HAVE_ARCH_FLUSH_HUGETLB_TLB_RANGE
+/*
+ * ARCHes with special requirements for evicting HUGETLB backing TLB entries can
+ * implement this.
+ */
+#define flush_hugetlb_tlb_range(vma, addr, end)	flush_tlb_range(vma, addr, end)
+#endif
+
 #endif /* _LINUX_HUGETLB_H */
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 5710286e1984..e41b77cf6cc2 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4927,14 +4927,6 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	return i ? i : err;
 }
 
-#ifndef __HAVE_ARCH_FLUSH_HUGETLB_TLB_RANGE
-/*
- * ARCHes with special requirements for evicting HUGETLB backing TLB entries can
- * implement this.
- */
-#define flush_hugetlb_tlb_range(vma, addr, end)	flush_tlb_range(vma, addr, end)
-#endif
-
 unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 		unsigned long address, unsigned long end, pgprot_t newprot)
 {
-- 
2.30.0.478.g8a0d178c01-goog


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v5 04/10] hugetlb/userfaultfd: Unshare all pmds for hugetlbfs when register wp
  2021-02-10 21:21 [PATCH v5 00/10] userfaultfd: add minor fault handling Axel Rasmussen
                   ` (2 preceding siblings ...)
  2021-02-10 21:21 ` [PATCH v5 03/10] mm/hugetlb: Move flush_hugetlb_tlb_range() into hugetlb.h Axel Rasmussen
@ 2021-02-10 21:21 ` Axel Rasmussen
  2021-02-12 18:11   ` Mike Kravetz
  2021-02-10 21:21 ` [PATCH v5 05/10] userfaultfd: add minor fault registration mode Axel Rasmussen
                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 29+ messages in thread
From: Axel Rasmussen @ 2021-02-10 21:21 UTC (permalink / raw)
  To: Alexander Viro, Alexey Dobriyan, Andrea Arcangeli, Andrew Morton,
	Anshuman Khandual, Catalin Marinas, Chinwen Chang, Huang Ying,
	Ingo Molnar, Jann Horn, Jerome Glisse, Lokesh Gidra,
	Matthew Wilcox (Oracle), Michael Ellerman, Michal Koutný,
	Michel Lespinasse, Mike Kravetz, Mike Rapoport, Nicholas Piggin,
	Peter Xu, Shaohua Li, Shawn Anastasio, Steven Rostedt,
	Steven Price, Vlastimil Babka
  Cc: linux-kernel, linux-fsdevel, linux-mm, Adam Ruprecht,
	Axel Rasmussen, Cannon Matthews, Dr . David Alan Gilbert,
	David Rientjes, Mina Almasry, Oliver Upton

From: Peter Xu <peterx@redhat.com>

Huge pmd sharing for hugetlbfs is racy with userfaultfd-wp because
userfaultfd-wp is always based on pgtable entries, so they cannot be shared.

Walk the hugetlb range and unshare all such mappings if there is, right before
UFFDIO_REGISTER will succeed and return to userspace.

This will pair with want_pmd_share() in hugetlb code so that huge pmd sharing
is completely disabled for userfaultfd-wp registered range.

Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
---
 fs/userfaultfd.c             | 48 ++++++++++++++++++++++++++++++++++++
 include/linux/mmu_notifier.h |  1 +
 2 files changed, 49 insertions(+)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 0be8cdd4425a..1f4a34b1a1e7 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -15,6 +15,7 @@
 #include <linux/sched/signal.h>
 #include <linux/sched/mm.h>
 #include <linux/mm.h>
+#include <linux/mmu_notifier.h>
 #include <linux/poll.h>
 #include <linux/slab.h>
 #include <linux/seq_file.h>
@@ -1191,6 +1192,50 @@ static ssize_t userfaultfd_read(struct file *file, char __user *buf,
 	}
 }
 
+/*
+ * This function will unconditionally remove all the shared pmd pgtable entries
+ * within the specific vma for a hugetlbfs memory range.
+ */
+static void hugetlb_unshare_all_pmds(struct vm_area_struct *vma)
+{
+#ifdef CONFIG_HUGETLB_PAGE
+	struct hstate *h = hstate_vma(vma);
+	unsigned long sz = huge_page_size(h);
+	struct mm_struct *mm = vma->vm_mm;
+	struct mmu_notifier_range range;
+	unsigned long address;
+	spinlock_t *ptl;
+	pte_t *ptep;
+
+	if (!(vma->vm_flags & VM_MAYSHARE))
+		return;
+
+	/*
+	 * No need to call adjust_range_if_pmd_sharing_possible(), because
+	 * we're going to operate on the whole vma
+	 */
+	mmu_notifier_range_init(&range, MMU_NOTIFY_HUGETLB_UNSHARE,
+				0, vma, mm, vma->vm_start, vma->vm_end);
+	mmu_notifier_invalidate_range_start(&range);
+	i_mmap_lock_write(vma->vm_file->f_mapping);
+	for (address = vma->vm_start; address < vma->vm_end; address += sz) {
+		ptep = huge_pte_offset(mm, address, sz);
+		if (!ptep)
+			continue;
+		ptl = huge_pte_lock(h, mm, ptep);
+		huge_pmd_unshare(mm, vma, &address, ptep);
+		spin_unlock(ptl);
+	}
+	flush_hugetlb_tlb_range(vma, vma->vm_start, vma->vm_end);
+	i_mmap_unlock_write(vma->vm_file->f_mapping);
+	/*
+	 * No need to call mmu_notifier_invalidate_range(), see
+	 * Documentation/vm/mmu_notifier.rst.
+	 */
+	mmu_notifier_invalidate_range_end(&range);
+#endif
+}
+
 static void __wake_userfault(struct userfaultfd_ctx *ctx,
 			     struct userfaultfd_wake_range *range)
 {
@@ -1449,6 +1494,9 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
 		vma->vm_flags = new_flags;
 		vma->vm_userfaultfd_ctx.ctx = ctx;
 
+		if (is_vm_hugetlb_page(vma) && uffd_disable_huge_pmd_share(vma))
+			hugetlb_unshare_all_pmds(vma);
+
 	skip:
 		prev = vma;
 		start = vma->vm_end;
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index b8200782dede..ff50c8528113 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -51,6 +51,7 @@ enum mmu_notifier_event {
 	MMU_NOTIFY_SOFT_DIRTY,
 	MMU_NOTIFY_RELEASE,
 	MMU_NOTIFY_MIGRATE,
+	MMU_NOTIFY_HUGETLB_UNSHARE,
 };
 
 #define MMU_NOTIFIER_RANGE_BLOCKABLE (1 << 0)
-- 
2.30.0.478.g8a0d178c01-goog


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v5 05/10] userfaultfd: add minor fault registration mode
  2021-02-10 21:21 [PATCH v5 00/10] userfaultfd: add minor fault handling Axel Rasmussen
                   ` (3 preceding siblings ...)
  2021-02-10 21:21 ` [PATCH v5 04/10] hugetlb/userfaultfd: Unshare all pmds for hugetlbfs when register wp Axel Rasmussen
@ 2021-02-10 21:21 ` Axel Rasmussen
  2021-02-11 19:28   ` Axel Rasmussen
  2021-02-12 19:17   ` Mike Kravetz
  2021-02-10 21:21 ` [PATCH v5 06/10] userfaultfd: disable huge PMD sharing for MINOR registered VMAs Axel Rasmussen
                   ` (4 subsequent siblings)
  9 siblings, 2 replies; 29+ messages in thread
From: Axel Rasmussen @ 2021-02-10 21:21 UTC (permalink / raw)
  To: Alexander Viro, Alexey Dobriyan, Andrea Arcangeli, Andrew Morton,
	Anshuman Khandual, Catalin Marinas, Chinwen Chang, Huang Ying,
	Ingo Molnar, Jann Horn, Jerome Glisse, Lokesh Gidra,
	Matthew Wilcox (Oracle), Michael Ellerman, Michal Koutný,
	Michel Lespinasse, Mike Kravetz, Mike Rapoport, Nicholas Piggin,
	Peter Xu, Shaohua Li, Shawn Anastasio, Steven Rostedt,
	Steven Price, Vlastimil Babka
  Cc: linux-kernel, linux-fsdevel, linux-mm, Adam Ruprecht,
	Axel Rasmussen, Cannon Matthews, Dr . David Alan Gilbert,
	David Rientjes, Mina Almasry, Oliver Upton

This feature allows userspace to intercept "minor" faults. By "minor"
faults, I mean the following situation:

Let there exist two mappings (i.e., VMAs) to the same page(s). One of
the mappings is registered with userfaultfd (in minor mode), and the
other is not. Via the non-UFFD mapping, the underlying pages have
already been allocated & filled with some contents. The UFFD mapping
has not yet been faulted in; when it is touched for the first time,
this results in what I'm calling a "minor" fault. As a concrete
example, when working with hugetlbfs, we have huge_pte_none(), but
find_lock_page() finds an existing page.

This commit adds the new registration mode, and sets the relevant flag
on the VMAs being registered. In the hugetlb fault path, if we find
that we have huge_pte_none(), but find_lock_page() does indeed find an
existing page, then we have a "minor" fault, and if the VMA has the
userfaultfd registration flag, we call into userfaultfd to handle it.

Why add a new registration mode, as opposed to adding a feature to
MISSING registration, like UFFD_FEATURE_SIGBUS?

- The semantics are significantly different. UFFDIO_COPY or
  UFFDIO_ZEROPAGE do not make sense for these minor faults; userspace
  would instead just memset() or memcpy() or whatever via the non-UFFD
  mapping. Unlike MISSING registration, MINOR registration only makes
  sense for hugetlbfs (or, in the future, shmem), as this is the only
  way to get two VMAs to a single set of underlying pages.

- Doing so would make handle_userfault()'s "reason" argument confusing.
  We'd pass in "MISSING" even if the pages weren't really missing.

Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
---
 fs/proc/task_mmu.c               |  1 +
 fs/userfaultfd.c                 | 71 ++++++++++++++++++--------------
 include/linux/mm.h               |  1 +
 include/linux/userfaultfd_k.h    | 15 ++++++-
 include/trace/events/mmflags.h   |  1 +
 include/uapi/linux/userfaultfd.h | 15 ++++++-
 mm/hugetlb.c                     | 32 ++++++++++++++
 7 files changed, 102 insertions(+), 34 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 602e3a52884d..94e951ea3e03 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -651,6 +651,7 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
 		[ilog2(VM_MTE)]		= "mt",
 		[ilog2(VM_MTE_ALLOWED)]	= "",
 #endif
+		[ilog2(VM_UFFD_MINOR)]	= "ui",
 #ifdef CONFIG_ARCH_HAS_PKEYS
 		/* These come out via ProtectionKey: */
 		[ilog2(VM_PKEY_BIT0)]	= "",
diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 1f4a34b1a1e7..b351a8552140 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -197,24 +197,21 @@ static inline struct uffd_msg userfault_msg(unsigned long address,
 	msg_init(&msg);
 	msg.event = UFFD_EVENT_PAGEFAULT;
 	msg.arg.pagefault.address = address;
+	/*
+	 * These flags indicate why the userfault occurred:
+	 * - UFFD_PAGEFAULT_FLAG_WP indicates a write protect fault.
+	 * - UFFD_PAGEFAULT_FLAG_MINOR indicates a minor fault.
+	 * - Neither of these flags being set indicates a MISSING fault.
+	 *
+	 * Separately, UFFD_PAGEFAULT_FLAG_WRITE indicates it was a write
+	 * fault. Otherwise, it was a read fault.
+	 */
 	if (flags & FAULT_FLAG_WRITE)
-		/*
-		 * If UFFD_FEATURE_PAGEFAULT_FLAG_WP was set in the
-		 * uffdio_api.features and UFFD_PAGEFAULT_FLAG_WRITE
-		 * was not set in a UFFD_EVENT_PAGEFAULT, it means it
-		 * was a read fault, otherwise if set it means it's
-		 * a write fault.
-		 */
 		msg.arg.pagefault.flags |= UFFD_PAGEFAULT_FLAG_WRITE;
 	if (reason & VM_UFFD_WP)
-		/*
-		 * If UFFD_FEATURE_PAGEFAULT_FLAG_WP was set in the
-		 * uffdio_api.features and UFFD_PAGEFAULT_FLAG_WP was
-		 * not set in a UFFD_EVENT_PAGEFAULT, it means it was
-		 * a missing fault, otherwise if set it means it's a
-		 * write protect fault.
-		 */
 		msg.arg.pagefault.flags |= UFFD_PAGEFAULT_FLAG_WP;
+	if (reason & VM_UFFD_MINOR)
+		msg.arg.pagefault.flags |= UFFD_PAGEFAULT_FLAG_MINOR;
 	if (features & UFFD_FEATURE_THREAD_ID)
 		msg.arg.pagefault.feat.ptid = task_pid_vnr(current);
 	return msg;
@@ -401,8 +398,10 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
 
 	BUG_ON(ctx->mm != mm);
 
-	VM_BUG_ON(reason & ~(VM_UFFD_MISSING|VM_UFFD_WP));
-	VM_BUG_ON(!(reason & VM_UFFD_MISSING) ^ !!(reason & VM_UFFD_WP));
+	/* Any unrecognized flag is a bug. */
+	VM_BUG_ON(reason & ~__VM_UFFD_FLAGS);
+	/* 0 or > 1 flags set is a bug; we expect exactly 1. */
+	VM_BUG_ON(!reason || !!(reason & (reason - 1)));
 
 	if (ctx->features & UFFD_FEATURE_SIGBUS)
 		goto out;
@@ -612,7 +611,7 @@ static void userfaultfd_event_wait_completion(struct userfaultfd_ctx *ctx,
 		for (vma = mm->mmap; vma; vma = vma->vm_next)
 			if (vma->vm_userfaultfd_ctx.ctx == release_new_ctx) {
 				vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
-				vma->vm_flags &= ~(VM_UFFD_WP | VM_UFFD_MISSING);
+				vma->vm_flags &= ~__VM_UFFD_FLAGS;
 			}
 		mmap_write_unlock(mm);
 
@@ -644,7 +643,7 @@ int dup_userfaultfd(struct vm_area_struct *vma, struct list_head *fcs)
 	octx = vma->vm_userfaultfd_ctx.ctx;
 	if (!octx || !(octx->features & UFFD_FEATURE_EVENT_FORK)) {
 		vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
-		vma->vm_flags &= ~(VM_UFFD_WP | VM_UFFD_MISSING);
+		vma->vm_flags &= ~__VM_UFFD_FLAGS;
 		return 0;
 	}
 
@@ -726,7 +725,7 @@ void mremap_userfaultfd_prep(struct vm_area_struct *vma,
 	} else {
 		/* Drop uffd context if remap feature not enabled */
 		vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
-		vma->vm_flags &= ~(VM_UFFD_WP | VM_UFFD_MISSING);
+		vma->vm_flags &= ~__VM_UFFD_FLAGS;
 	}
 }
 
@@ -867,12 +866,12 @@ static int userfaultfd_release(struct inode *inode, struct file *file)
 	for (vma = mm->mmap; vma; vma = vma->vm_next) {
 		cond_resched();
 		BUG_ON(!!vma->vm_userfaultfd_ctx.ctx ^
-		       !!(vma->vm_flags & (VM_UFFD_MISSING | VM_UFFD_WP)));
+		       !!(vma->vm_flags & __VM_UFFD_FLAGS));
 		if (vma->vm_userfaultfd_ctx.ctx != ctx) {
 			prev = vma;
 			continue;
 		}
-		new_flags = vma->vm_flags & ~(VM_UFFD_MISSING | VM_UFFD_WP);
+		new_flags = vma->vm_flags & ~__VM_UFFD_FLAGS;
 		prev = vma_merge(mm, prev, vma->vm_start, vma->vm_end,
 				 new_flags, vma->anon_vma,
 				 vma->vm_file, vma->vm_pgoff,
@@ -1306,9 +1305,19 @@ static inline bool vma_can_userfault(struct vm_area_struct *vma,
 				     unsigned long vm_flags)
 {
 	/* FIXME: add WP support to hugetlbfs and shmem */
-	return vma_is_anonymous(vma) ||
-		((is_vm_hugetlb_page(vma) || vma_is_shmem(vma)) &&
-		 !(vm_flags & VM_UFFD_WP));
+	if (vm_flags & VM_UFFD_WP) {
+		if (is_vm_hugetlb_page(vma) || vma_is_shmem(vma))
+			return false;
+	}
+
+	if (vm_flags & VM_UFFD_MINOR) {
+		/* FIXME: Add minor fault interception for shmem. */
+		if (!is_vm_hugetlb_page(vma))
+			return false;
+	}
+
+	return vma_is_anonymous(vma) || is_vm_hugetlb_page(vma) ||
+	       vma_is_shmem(vma);
 }
 
 static int userfaultfd_register(struct userfaultfd_ctx *ctx,
@@ -1334,14 +1343,15 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
 	ret = -EINVAL;
 	if (!uffdio_register.mode)
 		goto out;
-	if (uffdio_register.mode & ~(UFFDIO_REGISTER_MODE_MISSING|
-				     UFFDIO_REGISTER_MODE_WP))
+	if (uffdio_register.mode & ~UFFD_API_REGISTER_MODES)
 		goto out;
 	vm_flags = 0;
 	if (uffdio_register.mode & UFFDIO_REGISTER_MODE_MISSING)
 		vm_flags |= VM_UFFD_MISSING;
 	if (uffdio_register.mode & UFFDIO_REGISTER_MODE_WP)
 		vm_flags |= VM_UFFD_WP;
+	if (uffdio_register.mode & UFFDIO_REGISTER_MODE_MINOR)
+		vm_flags |= VM_UFFD_MINOR;
 
 	ret = validate_range(mm, &uffdio_register.range.start,
 			     uffdio_register.range.len);
@@ -1385,7 +1395,7 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
 		cond_resched();
 
 		BUG_ON(!!cur->vm_userfaultfd_ctx.ctx ^
-		       !!(cur->vm_flags & (VM_UFFD_MISSING | VM_UFFD_WP)));
+		       !!(cur->vm_flags & __VM_UFFD_FLAGS));
 
 		/* check not compatible vmas */
 		ret = -EINVAL;
@@ -1465,8 +1475,7 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
 			start = vma->vm_start;
 		vma_end = min(end, vma->vm_end);
 
-		new_flags = (vma->vm_flags &
-			     ~(VM_UFFD_MISSING|VM_UFFD_WP)) | vm_flags;
+		new_flags = (vma->vm_flags & ~__VM_UFFD_FLAGS) | vm_flags;
 		prev = vma_merge(mm, prev, start, vma_end, new_flags,
 				 vma->anon_vma, vma->vm_file, vma->vm_pgoff,
 				 vma_policy(vma),
@@ -1588,7 +1597,7 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx,
 		cond_resched();
 
 		BUG_ON(!!cur->vm_userfaultfd_ctx.ctx ^
-		       !!(cur->vm_flags & (VM_UFFD_MISSING | VM_UFFD_WP)));
+		       !!(cur->vm_flags & __VM_UFFD_FLAGS));
 
 		/*
 		 * Check not compatible vmas, not strictly required
@@ -1639,7 +1648,7 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx,
 			wake_userfault(vma->vm_userfaultfd_ctx.ctx, &range);
 		}
 
-		new_flags = vma->vm_flags & ~(VM_UFFD_MISSING | VM_UFFD_WP);
+		new_flags = vma->vm_flags & ~__VM_UFFD_FLAGS;
 		prev = vma_merge(mm, prev, start, vma_end, new_flags,
 				 vma->anon_vma, vma->vm_file, vma->vm_pgoff,
 				 vma_policy(vma),
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 89fca443e6f1..3f65a506c743 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -276,6 +276,7 @@ extern unsigned int kobjsize(const void *objp);
 #define VM_PFNMAP	0x00000400	/* Page-ranges managed without "struct page", just pure PFN */
 #define VM_DENYWRITE	0x00000800	/* ETXTBSY on write attempts.. */
 #define VM_UFFD_WP	0x00001000	/* wrprotect pages tracking */
+#define VM_UFFD_MINOR	0x00002000	/* minor fault interception */
 
 #define VM_LOCKED	0x00002000
 #define VM_IO           0x00004000	/* Memory mapped I/O or similar */
diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index c63ccdae3eab..0390e5ac63b3 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -17,6 +17,9 @@
 #include <linux/mm.h>
 #include <asm-generic/pgtable_uffd.h>
 
+/* The set of all possible UFFD-related VM flags. */
+#define __VM_UFFD_FLAGS (VM_UFFD_MISSING | VM_UFFD_WP | VM_UFFD_MINOR)
+
 /*
  * CAREFUL: Check include/uapi/asm-generic/fcntl.h when defining
  * new flags, since they might collide with O_* ones. We want
@@ -71,6 +74,11 @@ static inline bool userfaultfd_wp(struct vm_area_struct *vma)
 	return vma->vm_flags & VM_UFFD_WP;
 }
 
+static inline bool userfaultfd_minor(struct vm_area_struct *vma)
+{
+	return vma->vm_flags & VM_UFFD_MINOR;
+}
+
 static inline bool userfaultfd_pte_wp(struct vm_area_struct *vma,
 				      pte_t pte)
 {
@@ -85,7 +93,7 @@ static inline bool userfaultfd_huge_pmd_wp(struct vm_area_struct *vma,
 
 static inline bool userfaultfd_armed(struct vm_area_struct *vma)
 {
-	return vma->vm_flags & (VM_UFFD_MISSING | VM_UFFD_WP);
+	return vma->vm_flags & __VM_UFFD_FLAGS;
 }
 
 extern int dup_userfaultfd(struct vm_area_struct *, struct list_head *);
@@ -132,6 +140,11 @@ static inline bool userfaultfd_wp(struct vm_area_struct *vma)
 	return false;
 }
 
+static inline bool userfaultfd_minor(struct vm_area_struct *vma)
+{
+	return false;
+}
+
 static inline bool userfaultfd_pte_wp(struct vm_area_struct *vma,
 				      pte_t pte)
 {
diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
index 67018d367b9f..2d583ffd4100 100644
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -151,6 +151,7 @@ IF_HAVE_PG_ARCH_2(PG_arch_2,		"arch_2"	)
 	{VM_PFNMAP,			"pfnmap"	},		\
 	{VM_DENYWRITE,			"denywrite"	},		\
 	{VM_UFFD_WP,			"uffd_wp"	},		\
+	{VM_UFFD_MINOR,			"uffd_minor"	},		\
 	{VM_LOCKED,			"locked"	},		\
 	{VM_IO,				"io"		},		\
 	{VM_SEQ_READ,			"seqread"	},		\
diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
index 5f2d88212f7c..f24dd4fcbad9 100644
--- a/include/uapi/linux/userfaultfd.h
+++ b/include/uapi/linux/userfaultfd.h
@@ -19,15 +19,19 @@
  * means the userland is reading).
  */
 #define UFFD_API ((__u64)0xAA)
+#define UFFD_API_REGISTER_MODES (UFFDIO_REGISTER_MODE_MISSING |	\
+				 UFFDIO_REGISTER_MODE_WP |	\
+				 UFFDIO_REGISTER_MODE_MINOR)
 #define UFFD_API_FEATURES (UFFD_FEATURE_PAGEFAULT_FLAG_WP |	\
 			   UFFD_FEATURE_EVENT_FORK |		\
 			   UFFD_FEATURE_EVENT_REMAP |		\
-			   UFFD_FEATURE_EVENT_REMOVE |	\
+			   UFFD_FEATURE_EVENT_REMOVE |		\
 			   UFFD_FEATURE_EVENT_UNMAP |		\
 			   UFFD_FEATURE_MISSING_HUGETLBFS |	\
 			   UFFD_FEATURE_MISSING_SHMEM |		\
 			   UFFD_FEATURE_SIGBUS |		\
-			   UFFD_FEATURE_THREAD_ID)
+			   UFFD_FEATURE_THREAD_ID |		\
+			   UFFD_FEATURE_MINOR_HUGETLBFS)
 #define UFFD_API_IOCTLS				\
 	((__u64)1 << _UFFDIO_REGISTER |		\
 	 (__u64)1 << _UFFDIO_UNREGISTER |	\
@@ -127,6 +131,7 @@ struct uffd_msg {
 /* flags for UFFD_EVENT_PAGEFAULT */
 #define UFFD_PAGEFAULT_FLAG_WRITE	(1<<0)	/* If this was a write fault */
 #define UFFD_PAGEFAULT_FLAG_WP		(1<<1)	/* If reason is VM_UFFD_WP */
+#define UFFD_PAGEFAULT_FLAG_MINOR	(1<<2)	/* If reason is VM_UFFD_MINOR */
 
 struct uffdio_api {
 	/* userland asks for an API number and the features to enable */
@@ -171,6 +176,10 @@ struct uffdio_api {
 	 *
 	 * UFFD_FEATURE_THREAD_ID pid of the page faulted task_struct will
 	 * be returned, if feature is not requested 0 will be returned.
+	 *
+	 * UFFD_FEATURE_MINOR_HUGETLBFS indicates that minor faults
+	 * can be intercepted (via REGISTER_MODE_MINOR) for
+	 * hugetlbfs-backed pages.
 	 */
 #define UFFD_FEATURE_PAGEFAULT_FLAG_WP		(1<<0)
 #define UFFD_FEATURE_EVENT_FORK			(1<<1)
@@ -181,6 +190,7 @@ struct uffdio_api {
 #define UFFD_FEATURE_EVENT_UNMAP		(1<<6)
 #define UFFD_FEATURE_SIGBUS			(1<<7)
 #define UFFD_FEATURE_THREAD_ID			(1<<8)
+#define UFFD_FEATURE_MINOR_HUGETLBFS		(1<<9)
 	__u64 features;
 
 	__u64 ioctls;
@@ -195,6 +205,7 @@ struct uffdio_register {
 	struct uffdio_range range;
 #define UFFDIO_REGISTER_MODE_MISSING	((__u64)1<<0)
 #define UFFDIO_REGISTER_MODE_WP		((__u64)1<<1)
+#define UFFDIO_REGISTER_MODE_MINOR	((__u64)1<<2)
 	__u64 mode;
 
 	/*
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index e41b77cf6cc2..f150b10981a8 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4366,6 +4366,38 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 				VM_FAULT_SET_HINDEX(hstate_index(h));
 			goto backout_unlocked;
 		}
+
+		/* Check for page in userfault range. */
+		if (userfaultfd_minor(vma)) {
+			u32 hash;
+			struct vm_fault vmf = {
+				.vma = vma,
+				.address = haddr,
+				.flags = flags,
+				/*
+				 * Hard to debug if it ends up being used by a
+				 * callee that assumes something about the
+				 * other uninitialized fields... same as in
+				 * memory.c
+				 */
+			};
+
+			unlock_page(page);
+
+			/*
+			 * hugetlb_fault_mutex and i_mmap_rwsem must be dropped
+			 * before handling userfault.  Reacquire after handling
+			 * fault to make calling code simpler.
+			 */
+
+			hash = hugetlb_fault_mutex_hash(mapping, idx);
+			mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+			i_mmap_unlock_read(mapping);
+			ret = handle_userfault(&vmf, VM_UFFD_MINOR);
+			i_mmap_lock_read(mapping);
+			mutex_lock(&hugetlb_fault_mutex_table[hash]);
+			goto out;
+		}
 	}
 
 	/*
-- 
2.30.0.478.g8a0d178c01-goog


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v5 06/10] userfaultfd: disable huge PMD sharing for MINOR registered VMAs
  2021-02-10 21:21 [PATCH v5 00/10] userfaultfd: add minor fault handling Axel Rasmussen
                   ` (4 preceding siblings ...)
  2021-02-10 21:21 ` [PATCH v5 05/10] userfaultfd: add minor fault registration mode Axel Rasmussen
@ 2021-02-10 21:21 ` Axel Rasmussen
  2021-02-10 21:21 ` [PATCH v5 07/10] userfaultfd: hugetlbfs: only compile UFFD helpers if config enabled Axel Rasmussen
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 29+ messages in thread
From: Axel Rasmussen @ 2021-02-10 21:21 UTC (permalink / raw)
  To: Alexander Viro, Alexey Dobriyan, Andrea Arcangeli, Andrew Morton,
	Anshuman Khandual, Catalin Marinas, Chinwen Chang, Huang Ying,
	Ingo Molnar, Jann Horn, Jerome Glisse, Lokesh Gidra,
	Matthew Wilcox (Oracle), Michael Ellerman, Michal Koutný,
	Michel Lespinasse, Mike Kravetz, Mike Rapoport, Nicholas Piggin,
	Peter Xu, Shaohua Li, Shawn Anastasio, Steven Rostedt,
	Steven Price, Vlastimil Babka
  Cc: linux-kernel, linux-fsdevel, linux-mm, Adam Ruprecht,
	Axel Rasmussen, Cannon Matthews, Dr . David Alan Gilbert,
	David Rientjes, Mina Almasry, Oliver Upton

As the comment says: for the MINOR fault use case, although the page
might be present and populated in the other (non-UFFD-registered) half
of the mapping, it may be out of date, and we explicitly want userspace
to get a minor fault so it can check and potentially update the page's
contents.

Huge PMD sharing would prevent these faults from occurring for
suitably aligned areas, so disable it upon UFFD registration.

Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
---
 include/linux/userfaultfd_k.h | 13 ++++++++++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index 0390e5ac63b3..e060d5f77cc5 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -56,12 +56,19 @@ static inline bool is_mergeable_vm_userfaultfd_ctx(struct vm_area_struct *vma,
 }
 
 /*
- * Never enable huge pmd sharing on uffd-wp registered vmas, because uffd-wp
- * protect information is per pgtable entry.
+ * Never enable huge pmd sharing on some uffd registered vmas:
+ *
+ * - VM_UFFD_WP VMAs, because write protect information is per pgtable entry.
+ *
+ * - VM_UFFD_MINOR VMAs, because otherwise we would never get minor faults for
+ *   VMAs which share huge pmds. (If you have two mappings to the same
+ *   underlying pages, and fault in the non-UFFD-registered one with a write,
+ *   with huge pmd sharing this would *also* setup the second UFFD-registered
+ *   mapping, and we'd not get minor faults.)
  */
 static inline bool uffd_disable_huge_pmd_share(struct vm_area_struct *vma)
 {
-	return vma->vm_flags & VM_UFFD_WP;
+	return vma->vm_flags & (VM_UFFD_WP | VM_UFFD_MINOR);
 }
 
 static inline bool userfaultfd_missing(struct vm_area_struct *vma)
-- 
2.30.0.478.g8a0d178c01-goog


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v5 07/10] userfaultfd: hugetlbfs: only compile UFFD helpers if config enabled
  2021-02-10 21:21 [PATCH v5 00/10] userfaultfd: add minor fault handling Axel Rasmussen
                   ` (5 preceding siblings ...)
  2021-02-10 21:21 ` [PATCH v5 06/10] userfaultfd: disable huge PMD sharing for MINOR registered VMAs Axel Rasmussen
@ 2021-02-10 21:21 ` Axel Rasmussen
  2021-02-10 21:21 ` [PATCH v5 08/10] userfaultfd: add UFFDIO_CONTINUE ioctl Axel Rasmussen
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 29+ messages in thread
From: Axel Rasmussen @ 2021-02-10 21:21 UTC (permalink / raw)
  To: Alexander Viro, Alexey Dobriyan, Andrea Arcangeli, Andrew Morton,
	Anshuman Khandual, Catalin Marinas, Chinwen Chang, Huang Ying,
	Ingo Molnar, Jann Horn, Jerome Glisse, Lokesh Gidra,
	Matthew Wilcox (Oracle), Michael Ellerman, Michal Koutný,
	Michel Lespinasse, Mike Kravetz, Mike Rapoport, Nicholas Piggin,
	Peter Xu, Shaohua Li, Shawn Anastasio, Steven Rostedt,
	Steven Price, Vlastimil Babka
  Cc: linux-kernel, linux-fsdevel, linux-mm, Adam Ruprecht,
	Axel Rasmussen, Cannon Matthews, Dr . David Alan Gilbert,
	David Rientjes, Mina Almasry, Oliver Upton

For background, mm/userfaultfd.c provides a general mcopy_atomic
implementation. But some types of memory (i.e., hugetlb and shmem) need
a slightly different implementation, so they provide their own helpers
for this. In other words, userfaultfd is the only caller of these
functions.

This patch achieves two things:

1. Don't spend time compiling code which will end up never being
referenced anyway (a small build time optimization).

2. In patches later in this series, we extend the signature of these
helpers with UFFD-specific state (a mode enumeration). Once this
happens, we *have to* either not compile the helpers, or unconditionally
define the UFFD-only state (which seems messier to me). This includes
the declarations in the headers, as otherwise they'd yield warnings
about implicitly defining the type of those arguments.

Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
---
 include/linux/hugetlb.h | 4 ++++
 mm/hugetlb.c            | 2 ++
 2 files changed, 6 insertions(+)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index d740c6fd19ae..aa9e1d6de831 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -134,11 +134,13 @@ void hugetlb_show_meminfo(void);
 unsigned long hugetlb_total_pages(void);
 vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 			unsigned long address, unsigned int flags);
+#ifdef CONFIG_USERFAULTFD
 int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm, pte_t *dst_pte,
 				struct vm_area_struct *dst_vma,
 				unsigned long dst_addr,
 				unsigned long src_addr,
 				struct page **pagep);
+#endif /* CONFIG_USERFAULTFD */
 bool hugetlb_reserve_pages(struct inode *inode, long from, long to,
 						struct vm_area_struct *vma,
 						vm_flags_t vm_flags);
@@ -309,6 +311,7 @@ static inline void hugetlb_free_pgd_range(struct mmu_gather *tlb,
 	BUG();
 }
 
+#ifdef CONFIG_USERFAULTFD
 static inline int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
 						pte_t *dst_pte,
 						struct vm_area_struct *dst_vma,
@@ -319,6 +322,7 @@ static inline int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
 	BUG();
 	return 0;
 }
+#endif /* CONFIG_USERFAULTFD */
 
 static inline pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr,
 					unsigned long sz)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index f150b10981a8..2331281cf133 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4638,6 +4638,7 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	return ret;
 }
 
+#ifdef CONFIG_USERFAULTFD
 /*
  * Used by userfaultfd UFFDIO_COPY.  Based on mcopy_atomic_pte with
  * modifications for huge pages.
@@ -4768,6 +4769,7 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
 	put_page(page);
 	goto out;
 }
+#endif /* CONFIG_USERFAULTFD */
 
 static void record_subpages_vmas(struct page *page, struct vm_area_struct *vma,
 				 int refs, struct page **pages,
-- 
2.30.0.478.g8a0d178c01-goog


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v5 08/10] userfaultfd: add UFFDIO_CONTINUE ioctl
  2021-02-10 21:21 [PATCH v5 00/10] userfaultfd: add minor fault handling Axel Rasmussen
                   ` (6 preceding siblings ...)
  2021-02-10 21:21 ` [PATCH v5 07/10] userfaultfd: hugetlbfs: only compile UFFD helpers if config enabled Axel Rasmussen
@ 2021-02-10 21:21 ` Axel Rasmussen
  2021-02-10 22:21   ` Axel Rasmussen
  2021-02-10 21:21 ` [PATCH v5 09/10] userfaultfd: update documentation to describe minor fault handling Axel Rasmussen
  2021-02-10 21:22 ` [PATCH v5 10/10] userfaultfd/selftests: add test exercising " Axel Rasmussen
  9 siblings, 1 reply; 29+ messages in thread
From: Axel Rasmussen @ 2021-02-10 21:21 UTC (permalink / raw)
  To: Alexander Viro, Alexey Dobriyan, Andrea Arcangeli, Andrew Morton,
	Anshuman Khandual, Catalin Marinas, Chinwen Chang, Huang Ying,
	Ingo Molnar, Jann Horn, Jerome Glisse, Lokesh Gidra,
	Matthew Wilcox (Oracle), Michael Ellerman, Michal Koutný,
	Michel Lespinasse, Mike Kravetz, Mike Rapoport, Nicholas Piggin,
	Peter Xu, Shaohua Li, Shawn Anastasio, Steven Rostedt,
	Steven Price, Vlastimil Babka
  Cc: linux-kernel, linux-fsdevel, linux-mm, Adam Ruprecht,
	Axel Rasmussen, Cannon Matthews, Dr . David Alan Gilbert,
	David Rientjes, Mina Almasry, Oliver Upton

This ioctl is how userspace ought to resolve "minor" userfaults. The
idea is, userspace is notified that a minor fault has occurred. It might
change the contents of the page using its second non-UFFD mapping, or
not. Then, it calls UFFDIO_CONTINUE to tell the kernel "I have ensured
the page contents are correct, carry on setting up the mapping".

Note that it doesn't make much sense to use UFFDIO_{COPY,ZEROPAGE} for
MINOR registered VMAs. ZEROPAGE maps the VMA to the zero page; but in
the minor fault case, we already have some pre-existing underlying page.
Likewise, UFFDIO_COPY isn't useful if we have a second non-UFFD mapping.
We'd just use memcpy() or similar instead.

It turns out hugetlb_mcopy_atomic_pte() already does very close to what
we want, if an existing page is provided via `struct page **pagep`. We
already special-case the behavior a bit for the UFFDIO_ZEROPAGE case, so
just extend that design: add an enum for the three modes of operation,
and make the small adjustments needed for the MCOPY_ATOMIC_CONTINUE
case. (Basically, look up the existing page, and avoid adding the
existing page to the page cache or calling set_page_huge_active() on
it.)

Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
---
 fs/userfaultfd.c                 | 67 ++++++++++++++++++++++++++++++++
 include/linux/hugetlb.h          |  3 ++
 include/linux/userfaultfd_k.h    | 18 +++++++++
 include/uapi/linux/userfaultfd.h | 21 +++++++++-
 mm/hugetlb.c                     | 39 ++++++++++++-------
 mm/userfaultfd.c                 | 37 +++++++++++-------
 6 files changed, 156 insertions(+), 29 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index b351a8552140..5e20f425efd7 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -1527,6 +1527,10 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
 		if (!(uffdio_register.mode & UFFDIO_REGISTER_MODE_WP))
 			ioctls_out &= ~((__u64)1 << _UFFDIO_WRITEPROTECT);
 
+		/* CONTINUE ioctl is only supported for MINOR ranges. */
+		if (!(uffdio_register.mode & UFFDIO_REGISTER_MODE_MINOR))
+			ioctls_out &= ~((__u64)1 << _UFFDIO_CONTINUE);
+
 		/*
 		 * Now that we scanned all vmas we can already tell
 		 * userland which ioctls methods are guaranteed to
@@ -1880,6 +1884,66 @@ static int userfaultfd_writeprotect(struct userfaultfd_ctx *ctx,
 	return ret;
 }
 
+static int userfaultfd_continue(struct userfaultfd_ctx *ctx, unsigned long arg)
+{
+	__s64 ret;
+	struct uffdio_continue uffdio_continue;
+	struct uffdio_continue __user *user_uffdio_continue;
+	struct userfaultfd_wake_range range;
+
+	user_uffdio_continue = (struct uffdio_continue __user *)arg;
+
+	ret = -EAGAIN;
+	if (READ_ONCE(ctx->mmap_changing))
+		goto out;
+
+	ret = -EFAULT;
+	if (copy_from_user(&uffdio_continue, user_uffdio_continue,
+			   /* don't copy the output fields */
+			   sizeof(uffdio_continue) - (sizeof(__s64))))
+		goto out;
+
+	ret = validate_range(ctx->mm, &uffdio_continue.range.start,
+			     uffdio_continue.range.len);
+	if (ret)
+		goto out;
+
+	ret = -EINVAL;
+	/* double check for wraparound just in case. */
+	if (uffdio_continue.range.start + uffdio_continue.range.len <=
+	    uffdio_continue.range.start) {
+		goto out;
+	}
+	if (uffdio_continue.mode & ~UFFDIO_CONTINUE_MODE_DONTWAKE)
+		goto out;
+
+	if (mmget_not_zero(ctx->mm)) {
+		ret = mcopy_continue(ctx->mm, uffdio_continue.range.start,
+				     uffdio_continue.range.len,
+				     &ctx->mmap_changing);
+		mmput(ctx->mm);
+	} else {
+		return -ESRCH;
+	}
+
+	if (unlikely(put_user(ret, &user_uffdio_continue->mapped)))
+		return -EFAULT;
+	if (ret < 0)
+		goto out;
+
+	/* len == 0 would wake all */
+	BUG_ON(!ret);
+	range.len = ret;
+	if (!(uffdio_continue.mode & UFFDIO_CONTINUE_MODE_DONTWAKE)) {
+		range.start = uffdio_continue.range.start;
+		wake_userfault(ctx, &range);
+	}
+	ret = range.len == uffdio_continue.range.len ? 0 : -EAGAIN;
+
+out:
+	return ret;
+}
+
 static inline unsigned int uffd_ctx_features(__u64 user_features)
 {
 	/*
@@ -1964,6 +2028,9 @@ static long userfaultfd_ioctl(struct file *file, unsigned cmd,
 	case UFFDIO_WRITEPROTECT:
 		ret = userfaultfd_writeprotect(ctx, arg);
 		break;
+	case UFFDIO_CONTINUE:
+		ret = userfaultfd_continue(ctx, arg);
+		break;
 	}
 	return ret;
 }
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index aa9e1d6de831..3d01d228fc78 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -11,6 +11,7 @@
 #include <linux/kref.h>
 #include <linux/pgtable.h>
 #include <linux/gfp.h>
+#include <linux/userfaultfd_k.h>
 
 struct ctl_table;
 struct user_struct;
@@ -139,6 +140,7 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm, pte_t *dst_pte,
 				struct vm_area_struct *dst_vma,
 				unsigned long dst_addr,
 				unsigned long src_addr,
+				enum mcopy_atomic_mode mode,
 				struct page **pagep);
 #endif /* CONFIG_USERFAULTFD */
 bool hugetlb_reserve_pages(struct inode *inode, long from, long to,
@@ -317,6 +319,7 @@ static inline int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
 						struct vm_area_struct *dst_vma,
 						unsigned long dst_addr,
 						unsigned long src_addr,
+						enum mcopy_atomic_mode mode,
 						struct page **pagep)
 {
 	BUG();
diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index e060d5f77cc5..794d1538b8ba 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -37,6 +37,22 @@ extern int sysctl_unprivileged_userfaultfd;
 
 extern vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason);
 
+/*
+ * The mode of operation for __mcopy_atomic and its helpers.
+ *
+ * This is almost an implementation detail (mcopy_atomic below doesn't take this
+ * as a parameter), but it's exposed here because memory-kind-specific
+ * implementations (e.g. hugetlbfs) need to know the mode of operation.
+ */
+enum mcopy_atomic_mode {
+	/* A normal copy_from_user into the destination range. */
+	MCOPY_ATOMIC_NORMAL,
+	/* Don't copy; map the destination range to the zero page. */
+	MCOPY_ATOMIC_ZEROPAGE,
+	/* Just install pte(s) with the existing page(s) in the page cache. */
+	MCOPY_ATOMIC_CONTINUE,
+};
+
 extern ssize_t mcopy_atomic(struct mm_struct *dst_mm, unsigned long dst_start,
 			    unsigned long src_start, unsigned long len,
 			    bool *mmap_changing, __u64 mode);
@@ -44,6 +60,8 @@ extern ssize_t mfill_zeropage(struct mm_struct *dst_mm,
 			      unsigned long dst_start,
 			      unsigned long len,
 			      bool *mmap_changing);
+extern ssize_t mcopy_continue(struct mm_struct *dst_mm, unsigned long dst_start,
+			      unsigned long len, bool *mmap_changing);
 extern int mwriteprotect_range(struct mm_struct *dst_mm,
 			       unsigned long start, unsigned long len,
 			       bool enable_wp, bool *mmap_changing);
diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
index f24dd4fcbad9..bafbeb1a2624 100644
--- a/include/uapi/linux/userfaultfd.h
+++ b/include/uapi/linux/userfaultfd.h
@@ -40,10 +40,12 @@
 	((__u64)1 << _UFFDIO_WAKE |		\
 	 (__u64)1 << _UFFDIO_COPY |		\
 	 (__u64)1 << _UFFDIO_ZEROPAGE |		\
-	 (__u64)1 << _UFFDIO_WRITEPROTECT)
+	 (__u64)1 << _UFFDIO_WRITEPROTECT |	\
+	 (__u64)1 << _UFFDIO_CONTINUE)
 #define UFFD_API_RANGE_IOCTLS_BASIC		\
 	((__u64)1 << _UFFDIO_WAKE |		\
-	 (__u64)1 << _UFFDIO_COPY)
+	 (__u64)1 << _UFFDIO_COPY |		\
+	 (__u64)1 << _UFFDIO_CONTINUE)
 
 /*
  * Valid ioctl command number range with this API is from 0x00 to
@@ -59,6 +61,7 @@
 #define _UFFDIO_COPY			(0x03)
 #define _UFFDIO_ZEROPAGE		(0x04)
 #define _UFFDIO_WRITEPROTECT		(0x06)
+#define _UFFDIO_CONTINUE		(0x07)
 #define _UFFDIO_API			(0x3F)
 
 /* userfaultfd ioctl ids */
@@ -77,6 +80,8 @@
 				      struct uffdio_zeropage)
 #define UFFDIO_WRITEPROTECT	_IOWR(UFFDIO, _UFFDIO_WRITEPROTECT, \
 				      struct uffdio_writeprotect)
+#define UFFDIO_CONTINUE		_IOR(UFFDIO, _UFFDIO_CONTINUE,	\
+				     struct uffdio_continue)
 
 /* read() structure */
 struct uffd_msg {
@@ -268,6 +273,18 @@ struct uffdio_writeprotect {
 	__u64 mode;
 };
 
+struct uffdio_continue {
+	struct uffdio_range range;
+#define UFFDIO_CONTINUE_MODE_DONTWAKE		((__u64)1<<0)
+	__u64 mode;
+
+	/*
+	 * Fields below here are written by the ioctl and must be at the end:
+	 * the copy_from_user will not read past here.
+	 */
+	__s64 mapped;
+};
+
 /*
  * Flags for the userfaultfd(2) system call itself.
  */
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 2331281cf133..7ddf28730361 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -39,7 +39,6 @@
 #include <linux/hugetlb.h>
 #include <linux/hugetlb_cgroup.h>
 #include <linux/node.h>
-#include <linux/userfaultfd_k.h>
 #include <linux/page_owner.h>
 #include "internal.h"
 
@@ -4648,8 +4647,10 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
 			    struct vm_area_struct *dst_vma,
 			    unsigned long dst_addr,
 			    unsigned long src_addr,
+			    enum mcopy_atomic_mode mode,
 			    struct page **pagep)
 {
+	bool is_continue = (mode == MCOPY_ATOMIC_CONTINUE);
 	struct address_space *mapping;
 	pgoff_t idx;
 	unsigned long size;
@@ -4659,8 +4660,18 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
 	spinlock_t *ptl;
 	int ret;
 	struct page *page;
+	int writable;
 
-	if (!*pagep) {
+	mapping = dst_vma->vm_file->f_mapping;
+	idx = vma_hugecache_offset(h, dst_vma, dst_addr);
+
+	if (is_continue) {
+		ret = -EFAULT;
+		page = find_lock_page(mapping, idx);
+		*pagep = NULL;
+		if (!page)
+			goto out;
+	} else if (!*pagep) {
 		ret = -ENOMEM;
 		page = alloc_huge_page(dst_vma, dst_addr, 0);
 		if (IS_ERR(page))
@@ -4689,13 +4700,8 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
 	 */
 	__SetPageUptodate(page);
 
-	mapping = dst_vma->vm_file->f_mapping;
-	idx = vma_hugecache_offset(h, dst_vma, dst_addr);
-
-	/*
-	 * If shared, add to page cache
-	 */
-	if (vm_shared) {
+	/* Add shared, newly allocated pages to the page cache. */
+	if (vm_shared && !is_continue) {
 		size = i_size_read(mapping->host) >> huge_page_shift(h);
 		ret = -EFAULT;
 		if (idx >= size)
@@ -4740,8 +4746,14 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
 		hugepage_add_new_anon_rmap(page, dst_vma, dst_addr);
 	}
 
-	_dst_pte = make_huge_pte(dst_vma, page, dst_vma->vm_flags & VM_WRITE);
-	if (dst_vma->vm_flags & VM_WRITE)
+	/* For CONTINUE on a non-shared VMA, don't set VM_WRITE for CoW. */
+	if (is_continue && !vm_shared)
+		writable = 0;
+	else
+		writable = dst_vma->vm_flags & VM_WRITE;
+
+	_dst_pte = make_huge_pte(dst_vma, page, writable);
+	if (writable)
 		_dst_pte = huge_pte_mkdirty(_dst_pte);
 	_dst_pte = pte_mkyoung(_dst_pte);
 
@@ -4755,8 +4767,9 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
 	update_mmu_cache(dst_vma, dst_addr, dst_pte);
 
 	spin_unlock(ptl);
-	SetHPageMigratable(page);
-	if (vm_shared)
+	if (!is_continue)
+		SetHPageMigratable(page);
+	if (vm_shared || is_continue)
 		unlock_page(page);
 	ret = 0;
 out:
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index b2ce61c1b50d..ce6cb4760d2c 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -207,7 +207,7 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
 					      unsigned long dst_start,
 					      unsigned long src_start,
 					      unsigned long len,
-					      bool zeropage)
+					      enum mcopy_atomic_mode mode)
 {
 	int vm_alloc_shared = dst_vma->vm_flags & VM_SHARED;
 	int vm_shared = dst_vma->vm_flags & VM_SHARED;
@@ -227,7 +227,7 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
 	 * by THP.  Since we can not reliably insert a zero page, this
 	 * feature is not supported.
 	 */
-	if (zeropage) {
+	if (mode == MCOPY_ATOMIC_ZEROPAGE) {
 		mmap_read_unlock(dst_mm);
 		return -EINVAL;
 	}
@@ -273,8 +273,6 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
 	}
 
 	while (src_addr < src_start + len) {
-		pte_t dst_pteval;
-
 		BUG_ON(dst_addr >= dst_start + len);
 
 		/*
@@ -297,16 +295,16 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
 			goto out_unlock;
 		}
 
-		err = -EEXIST;
-		dst_pteval = huge_ptep_get(dst_pte);
-		if (!huge_pte_none(dst_pteval)) {
+		if (mode != MCOPY_ATOMIC_CONTINUE &&
+		    !huge_pte_none(huge_ptep_get(dst_pte))) {
+			err = -EEXIST;
 			mutex_unlock(&hugetlb_fault_mutex_table[hash]);
 			i_mmap_unlock_read(mapping);
 			goto out_unlock;
 		}
 
 		err = hugetlb_mcopy_atomic_pte(dst_mm, dst_pte, dst_vma,
-						dst_addr, src_addr, &page);
+					       dst_addr, src_addr, mode, &page);
 
 		mutex_unlock(&hugetlb_fault_mutex_table[hash]);
 		i_mmap_unlock_read(mapping);
@@ -408,7 +406,7 @@ extern ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
 				      unsigned long dst_start,
 				      unsigned long src_start,
 				      unsigned long len,
-				      bool zeropage);
+				      enum mcopy_atomic_mode mode);
 #endif /* CONFIG_HUGETLB_PAGE */
 
 static __always_inline ssize_t mfill_atomic_pte(struct mm_struct *dst_mm,
@@ -458,7 +456,7 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
 					      unsigned long dst_start,
 					      unsigned long src_start,
 					      unsigned long len,
-					      bool zeropage,
+					      enum mcopy_atomic_mode mcopy_mode,
 					      bool *mmap_changing,
 					      __u64 mode)
 {
@@ -469,6 +467,7 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
 	long copied;
 	struct page *page;
 	bool wp_copy;
+	bool zeropage = (mcopy_mode == MCOPY_ATOMIC_ZEROPAGE);
 
 	/*
 	 * Sanitize the command parameters:
@@ -527,10 +526,12 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
 	 */
 	if (is_vm_hugetlb_page(dst_vma))
 		return  __mcopy_atomic_hugetlb(dst_mm, dst_vma, dst_start,
-						src_start, len, zeropage);
+						src_start, len, mcopy_mode);
 
 	if (!vma_is_anonymous(dst_vma) && !vma_is_shmem(dst_vma))
 		goto out_unlock;
+	if (mcopy_mode == MCOPY_ATOMIC_CONTINUE)
+		goto out_unlock;
 
 	/*
 	 * Ensure the dst_vma has a anon_vma or this page
@@ -626,14 +627,22 @@ ssize_t mcopy_atomic(struct mm_struct *dst_mm, unsigned long dst_start,
 		     unsigned long src_start, unsigned long len,
 		     bool *mmap_changing, __u64 mode)
 {
-	return __mcopy_atomic(dst_mm, dst_start, src_start, len, false,
-			      mmap_changing, mode);
+	return __mcopy_atomic(dst_mm, dst_start, src_start, len,
+			      MCOPY_ATOMIC_NORMAL, mmap_changing, mode);
 }
 
 ssize_t mfill_zeropage(struct mm_struct *dst_mm, unsigned long start,
 		       unsigned long len, bool *mmap_changing)
 {
-	return __mcopy_atomic(dst_mm, start, 0, len, true, mmap_changing, 0);
+	return __mcopy_atomic(dst_mm, start, 0, len, MCOPY_ATOMIC_ZEROPAGE,
+			      mmap_changing, 0);
+}
+
+ssize_t mcopy_continue(struct mm_struct *dst_mm, unsigned long start,
+		       unsigned long len, bool *mmap_changing)
+{
+	return __mcopy_atomic(dst_mm, start, 0, len, MCOPY_ATOMIC_CONTINUE,
+			      mmap_changing, 0);
 }
 
 int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start,
-- 
2.30.0.478.g8a0d178c01-goog


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v5 09/10] userfaultfd: update documentation to describe minor fault handling
  2021-02-10 21:21 [PATCH v5 00/10] userfaultfd: add minor fault handling Axel Rasmussen
                   ` (7 preceding siblings ...)
  2021-02-10 21:21 ` [PATCH v5 08/10] userfaultfd: add UFFDIO_CONTINUE ioctl Axel Rasmussen
@ 2021-02-10 21:21 ` Axel Rasmussen
  2021-02-10 21:22 ` [PATCH v5 10/10] userfaultfd/selftests: add test exercising " Axel Rasmussen
  9 siblings, 0 replies; 29+ messages in thread
From: Axel Rasmussen @ 2021-02-10 21:21 UTC (permalink / raw)
  To: Alexander Viro, Alexey Dobriyan, Andrea Arcangeli, Andrew Morton,
	Anshuman Khandual, Catalin Marinas, Chinwen Chang, Huang Ying,
	Ingo Molnar, Jann Horn, Jerome Glisse, Lokesh Gidra,
	Matthew Wilcox (Oracle), Michael Ellerman, Michal Koutný,
	Michel Lespinasse, Mike Kravetz, Mike Rapoport, Nicholas Piggin,
	Peter Xu, Shaohua Li, Shawn Anastasio, Steven Rostedt,
	Steven Price, Vlastimil Babka
  Cc: linux-kernel, linux-fsdevel, linux-mm, Adam Ruprecht,
	Axel Rasmussen, Cannon Matthews, Dr . David Alan Gilbert,
	David Rientjes, Mina Almasry, Oliver Upton

Reword / reorganize things a little bit into "lists", so new features /
modes / ioctls can sort of just be appended.

Describe how UFFDIO_REGISTER_MODE_MINOR and UFFDIO_CONTINUE can be used
to intercept and resolve minor faults. Make it clear that COPY and
ZEROPAGE are used for MISSING faults, whereas CONTINUE is used for MINOR
faults.

Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
---
 Documentation/admin-guide/mm/userfaultfd.rst | 107 ++++++++++++-------
 1 file changed, 66 insertions(+), 41 deletions(-)

diff --git a/Documentation/admin-guide/mm/userfaultfd.rst b/Documentation/admin-guide/mm/userfaultfd.rst
index 65eefa66c0ba..3aa38e8b8361 100644
--- a/Documentation/admin-guide/mm/userfaultfd.rst
+++ b/Documentation/admin-guide/mm/userfaultfd.rst
@@ -63,36 +63,36 @@ the generic ioctl available.
 
 The ``uffdio_api.features`` bitmask returned by the ``UFFDIO_API`` ioctl
 defines what memory types are supported by the ``userfaultfd`` and what
-events, except page fault notifications, may be generated.
-
-If the kernel supports registering ``userfaultfd`` ranges on hugetlbfs
-virtual memory areas, ``UFFD_FEATURE_MISSING_HUGETLBFS`` will be set in
-``uffdio_api.features``. Similarly, ``UFFD_FEATURE_MISSING_SHMEM`` will be
-set if the kernel supports registering ``userfaultfd`` ranges on shared
-memory (covering all shmem APIs, i.e. tmpfs, ``IPCSHM``, ``/dev/zero``,
-``MAP_SHARED``, ``memfd_create``, etc).
-
-The userland application that wants to use ``userfaultfd`` with hugetlbfs
-or shared memory need to set the corresponding flag in
-``uffdio_api.features`` to enable those features.
-
-If the userland desires to receive notifications for events other than
-page faults, it has to verify that ``uffdio_api.features`` has appropriate
-``UFFD_FEATURE_EVENT_*`` bits set. These events are described in more
-detail below in `Non-cooperative userfaultfd`_ section.
-
-Once the ``userfaultfd`` has been enabled the ``UFFDIO_REGISTER`` ioctl should
-be invoked (if present in the returned ``uffdio_api.ioctls`` bitmask) to
-register a memory range in the ``userfaultfd`` by setting the
+events, except page fault notifications, may be generated:
+
+- The ``UFFD_FEATURE_EVENT_*`` flags indicate that various other events
+  other than page faults are supported. These events are described in more
+  detail below in the `Non-cooperative userfaultfd`_ section.
+
+- ``UFFD_FEATURE_MISSING_HUGETLBFS`` and ``UFFD_FEATURE_MISSING_SHMEM``
+  indicate that the kernel supports ``UFFDIO_REGISTER_MODE_MISSING``
+  registrations for hugetlbfs and shared memory (covering all shmem APIs,
+  i.e. tmpfs, ``IPCSHM``, ``/dev/zero``, ``MAP_SHARED``, ``memfd_create``,
+  etc) virtual memory areas, respectively.
+
+- ``UFFD_FEATURE_MINOR_HUGETLBFS`` indicates that the kernel supports
+  ``UFFDIO_REGISTER_MODE_MINOR`` registration for hugetlbfs virtual memory
+  areas.
+
+The userland application should set the feature flags it intends to use
+when invoking the ``UFFDIO_API`` ioctl, to request that those features be
+enabled if supported.
+
+Once the ``userfaultfd`` API has been enabled the ``UFFDIO_REGISTER``
+ioctl should be invoked (if present in the returned ``uffdio_api.ioctls``
+bitmask) to register a memory range in the ``userfaultfd`` by setting the
 uffdio_register structure accordingly. The ``uffdio_register.mode``
 bitmask will specify to the kernel which kind of faults to track for
-the range (``UFFDIO_REGISTER_MODE_MISSING`` would track missing
-pages). The ``UFFDIO_REGISTER`` ioctl will return the
+the range. The ``UFFDIO_REGISTER`` ioctl will return the
 ``uffdio_register.ioctls`` bitmask of ioctls that are suitable to resolve
 userfaults on the range registered. Not all ioctls will necessarily be
-supported for all memory types depending on the underlying virtual
-memory backend (anonymous memory vs tmpfs vs real filebacked
-mappings).
+supported for all memory types (e.g. anonymous memory vs. shmem vs.
+hugetlbfs), or all types of intercepted faults.
 
 Userland can use the ``uffdio_register.ioctls`` to manage the virtual
 address space in the background (to add or potentially also remove
@@ -100,21 +100,46 @@ memory from the ``userfaultfd`` registered range). This means a userfault
 could be triggering just before userland maps in the background the
 user-faulted page.
 
-The primary ioctl to resolve userfaults is ``UFFDIO_COPY``. That
-atomically copies a page into the userfault registered range and wakes
-up the blocked userfaults
-(unless ``uffdio_copy.mode & UFFDIO_COPY_MODE_DONTWAKE`` is set).
-Other ioctl works similarly to ``UFFDIO_COPY``. They're atomic as in
-guaranteeing that nothing can see an half copied page since it'll
-keep userfaulting until the copy has finished.
+Resolving Userfaults
+--------------------
+
+There are three basic ways to resolve userfaults:
+
+- ``UFFDIO_COPY`` atomically copies some existing page contents from
+  userspace.
+
+- ``UFFDIO_ZEROPAGE`` atomically zeros the new page.
+
+- ``UFFDIO_CONTINUE`` maps an existing, previously-populated page.
+
+These operations are atomic in the sense that they guarantee nothing can
+see a half-populated page, since readers will keep userfaulting until the
+operation has finished.
+
+By default, these wake up userfaults blocked on the range in question.
+They support a ``UFFDIO_*_MODE_DONTWAKE`` ``mode`` flag, which indicates
+that waking will be done separately at some later time.
+
+Which ioctl to choose depends on the kind of page fault, and what we'd
+like to do to resolve it:
+
+- For ``UFFDIO_REGISTER_MODE_MISSING`` faults, the fault needs to be
+  resolved by either providing a new page (``UFFDIO_COPY``), or mapping
+  the zero page (``UFFDIO_ZEROPAGE``). By default, the kernel would map
+  the zero page for a missing fault. With userfaultfd, userspace can
+  decide what content to provide before the faulting thread continues.
+
+- For ``UFFDIO_REGISTER_MODE_MINOR`` faults, there is an existing page (in
+  the page cache). Userspace has the option of modifying the page's
+  contents before resolving the fault. Once the contents are correct
+  (modified or not), userspace asks the kernel to map the page and let the
+  faulting thread continue with ``UFFDIO_CONTINUE``.
 
 Notes:
 
-- If you requested ``UFFDIO_REGISTER_MODE_MISSING`` when registering then
-  you must provide some kind of page in your thread after reading from
-  the uffd.  You must provide either ``UFFDIO_COPY`` or ``UFFDIO_ZEROPAGE``.
-  The normal behavior of the OS automatically providing a zero page on
-  an anonymous mmaping is not in place.
+- You can tell which kind of fault occurred by examining
+  ``pagefault.flags`` within the ``uffd_msg``, checking for the
+  ``UFFD_PAGEFAULT_FLAG_*`` flags.
 
 - None of the page-delivering ioctls default to the range that you
   registered with.  You must fill in all fields for the appropriate
@@ -122,9 +147,9 @@ Notes:
 
 - You get the address of the access that triggered the missing page
   event out of a struct uffd_msg that you read in the thread from the
-  uffd.  You can supply as many pages as you want with ``UFFDIO_COPY`` or
-  ``UFFDIO_ZEROPAGE``.  Keep in mind that unless you used DONTWAKE then
-  the first of any of those IOCTLs wakes up the faulting thread.
+  uffd.  You can supply as many pages as you want with these IOCTLs.
+  Keep in mind that unless you used DONTWAKE then the first of any of
+  those IOCTLs wakes up the faulting thread.
 
 - Be sure to test for all errors including
   (``pollfd[0].revents & POLLERR``).  This can happen, e.g. when ranges
-- 
2.30.0.478.g8a0d178c01-goog


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v5 10/10] userfaultfd/selftests: add test exercising minor fault handling
  2021-02-10 21:21 [PATCH v5 00/10] userfaultfd: add minor fault handling Axel Rasmussen
                   ` (8 preceding siblings ...)
  2021-02-10 21:21 ` [PATCH v5 09/10] userfaultfd: update documentation to describe minor fault handling Axel Rasmussen
@ 2021-02-10 21:22 ` Axel Rasmussen
  9 siblings, 0 replies; 29+ messages in thread
From: Axel Rasmussen @ 2021-02-10 21:22 UTC (permalink / raw)
  To: Alexander Viro, Alexey Dobriyan, Andrea Arcangeli, Andrew Morton,
	Anshuman Khandual, Catalin Marinas, Chinwen Chang, Huang Ying,
	Ingo Molnar, Jann Horn, Jerome Glisse, Lokesh Gidra,
	Matthew Wilcox (Oracle), Michael Ellerman, Michal Koutný,
	Michel Lespinasse, Mike Kravetz, Mike Rapoport, Nicholas Piggin,
	Peter Xu, Shaohua Li, Shawn Anastasio, Steven Rostedt,
	Steven Price, Vlastimil Babka
  Cc: linux-kernel, linux-fsdevel, linux-mm, Adam Ruprecht,
	Axel Rasmussen, Cannon Matthews, Dr . David Alan Gilbert,
	David Rientjes, Mina Almasry, Oliver Upton

Fix a dormant bug in userfaultfd_events_test(), where we did
`return faulting_process(0)` instead of `exit(faulting_process(0))`.
This caused the forked process to keep running, trying to execute any
further test cases after the events test in parallel with the "real"
process.

Add a simple test case which exercises minor faults. In short, it does
the following:

1. "Sets up" an area (area_dst) and a second shared mapping to the same
   underlying pages (area_dst_alias).

2. Register one of these areas with userfaultfd, in minor fault mode.

3. Start a second thread to handle any minor faults.

4. Populate the underlying pages with the non-UFFD-registered side of
   the mapping. Basically, memset() each page with some arbitrary
   contents.

5. Then, using the UFFD-registered mapping, read all of the page
   contents, asserting that the contents match expectations (we expect
   the minor fault handling thread can modify the page contents before
   resolving the fault).

The minor fault handling thread, upon receiving an event, flips all the
bits (~) in that page, just to prove that it can modify it in some
arbitrary way. Then it issues a UFFDIO_CONTINUE ioctl, to setup the
mapping and resolve the fault. The reading thread should wake up and see
this modification.

Currently the minor fault test is only enabled in hugetlb_shared mode,
as this is the only configuration the kernel feature supports.

Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
---
 tools/testing/selftests/vm/userfaultfd.c | 147 ++++++++++++++++++++++-
 1 file changed, 143 insertions(+), 4 deletions(-)

diff --git a/tools/testing/selftests/vm/userfaultfd.c b/tools/testing/selftests/vm/userfaultfd.c
index 92b8ec423201..73a72a3c4189 100644
--- a/tools/testing/selftests/vm/userfaultfd.c
+++ b/tools/testing/selftests/vm/userfaultfd.c
@@ -81,6 +81,8 @@ static volatile bool test_uffdio_copy_eexist = true;
 static volatile bool test_uffdio_zeropage_eexist = true;
 /* Whether to test uffd write-protection */
 static bool test_uffdio_wp = false;
+/* Whether to test uffd minor faults */
+static bool test_uffdio_minor = false;
 
 static bool map_shared;
 static int huge_fd;
@@ -96,6 +98,7 @@ struct uffd_stats {
 	int cpu;
 	unsigned long missing_faults;
 	unsigned long wp_faults;
+	unsigned long minor_faults;
 };
 
 /* pthread_mutex_t starts at page offset 0 */
@@ -153,17 +156,19 @@ static void uffd_stats_reset(struct uffd_stats *uffd_stats,
 		uffd_stats[i].cpu = i;
 		uffd_stats[i].missing_faults = 0;
 		uffd_stats[i].wp_faults = 0;
+		uffd_stats[i].minor_faults = 0;
 	}
 }
 
 static void uffd_stats_report(struct uffd_stats *stats, int n_cpus)
 {
 	int i;
-	unsigned long long miss_total = 0, wp_total = 0;
+	unsigned long long miss_total = 0, wp_total = 0, minor_total = 0;
 
 	for (i = 0; i < n_cpus; i++) {
 		miss_total += stats[i].missing_faults;
 		wp_total += stats[i].wp_faults;
+		minor_total += stats[i].minor_faults;
 	}
 
 	printf("userfaults: %llu missing (", miss_total);
@@ -172,6 +177,9 @@ static void uffd_stats_report(struct uffd_stats *stats, int n_cpus)
 	printf("\b), %llu wp (", wp_total);
 	for (i = 0; i < n_cpus; i++)
 		printf("%lu+", stats[i].wp_faults);
+	printf("\b), %llu minor (", minor_total);
+	for (i = 0; i < n_cpus; i++)
+		printf("%lu+", stats[i].minor_faults);
 	printf("\b)\n");
 }
 
@@ -328,7 +336,7 @@ static struct uffd_test_ops shmem_uffd_test_ops = {
 };
 
 static struct uffd_test_ops hugetlb_uffd_test_ops = {
-	.expected_ioctls = UFFD_API_RANGE_IOCTLS_BASIC,
+	.expected_ioctls = UFFD_API_RANGE_IOCTLS_BASIC & ~(1 << _UFFDIO_CONTINUE),
 	.allocate_area	= hugetlb_allocate_area,
 	.release_pages	= hugetlb_release_pages,
 	.alias_mapping = hugetlb_alias_mapping,
@@ -362,6 +370,22 @@ static void wp_range(int ufd, __u64 start, __u64 len, bool wp)
 	}
 }
 
+static void continue_range(int ufd, __u64 start, __u64 len)
+{
+	struct uffdio_continue req;
+
+	req.range.start = start;
+	req.range.len = len;
+	req.mode = 0;
+
+	if (ioctl(ufd, UFFDIO_CONTINUE, &req)) {
+		fprintf(stderr,
+			"UFFDIO_CONTINUE failed for address 0x%" PRIx64 "\n",
+			(uint64_t)start);
+		exit(1);
+	}
+}
+
 static void *locking_thread(void *arg)
 {
 	unsigned long cpu = (unsigned long) arg;
@@ -569,8 +593,32 @@ static void uffd_handle_page_fault(struct uffd_msg *msg,
 	}
 
 	if (msg->arg.pagefault.flags & UFFD_PAGEFAULT_FLAG_WP) {
+		/* Write protect page faults */
 		wp_range(uffd, msg->arg.pagefault.address, page_size, false);
 		stats->wp_faults++;
+	} else if (msg->arg.pagefault.flags & UFFD_PAGEFAULT_FLAG_MINOR) {
+		uint8_t *area;
+		int b;
+
+		/*
+		 * Minor page faults
+		 *
+		 * To prove we can modify the original range for testing
+		 * purposes, we're going to bit flip this range before
+		 * continuing.
+		 *
+		 * Note that this requires all minor page fault tests operate on
+		 * area_dst (non-UFFD-registered) and area_dst_alias
+		 * (UFFD-registered).
+		 */
+
+		area = (uint8_t *)(area_dst +
+				   ((char *)msg->arg.pagefault.address -
+				    area_dst_alias));
+		for (b = 0; b < page_size; ++b)
+			area[b] = ~area[b];
+		continue_range(uffd, msg->arg.pagefault.address, page_size);
+		stats->minor_faults++;
 	} else {
 		/* Missing page faults */
 		if (bounces & BOUNCE_VERIFY &&
@@ -1112,7 +1160,7 @@ static int userfaultfd_events_test(void)
 	}
 
 	if (!pid)
-		return faulting_process(0);
+		exit(faulting_process(0));
 
 	waitpid(pid, &err, 0);
 	if (err) {
@@ -1215,6 +1263,95 @@ static int userfaultfd_sig_test(void)
 	return userfaults != 0;
 }
 
+static int userfaultfd_minor_test(void)
+{
+	struct uffdio_register uffdio_register;
+	unsigned long expected_ioctls;
+	unsigned long p;
+	pthread_t uffd_mon;
+	uint8_t expected_byte;
+	void *expected_page;
+	char c;
+	struct uffd_stats stats = { 0 };
+
+	if (!test_uffdio_minor)
+		return 0;
+
+	printf("testing minor faults: ");
+	fflush(stdout);
+
+	if (uffd_test_ops->release_pages(area_dst))
+		return 1;
+
+	if (userfaultfd_open(0))
+		return 1;
+
+	uffdio_register.range.start = (unsigned long)area_dst_alias;
+	uffdio_register.range.len = nr_pages * page_size;
+	uffdio_register.mode = UFFDIO_REGISTER_MODE_MINOR;
+	if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register)) {
+		fprintf(stderr, "register failure\n");
+		exit(1);
+	}
+
+	expected_ioctls = uffd_test_ops->expected_ioctls;
+	expected_ioctls |= 1 << _UFFDIO_CONTINUE;
+	if ((uffdio_register.ioctls & expected_ioctls) != expected_ioctls) {
+		fprintf(stderr, "unexpected missing ioctl(s)\n");
+		exit(1);
+	}
+
+	/*
+	 * After registering with UFFD, populate the non-UFFD-registered side of
+	 * the shared mapping. This should *not* trigger any UFFD minor faults.
+	 */
+	for (p = 0; p < nr_pages; ++p) {
+		memset(area_dst + (p * page_size), p % ((uint8_t)-1),
+		       page_size);
+	}
+
+	if (pthread_create(&uffd_mon, &attr, uffd_poll_thread, &stats)) {
+		perror("uffd_poll_thread create");
+		exit(1);
+	}
+
+	/*
+	 * Read each of the pages back using the UFFD-registered mapping. We
+	 * expect that the first time we touch a page, it will result in a minor
+	 * fault. uffd_poll_thread will resolve the fault by bit-flipping the
+	 * page's contents, and then issuing a CONTINUE ioctl.
+	 */
+
+	if (posix_memalign(&expected_page, page_size, page_size)) {
+		fprintf(stderr, "out of memory\n");
+		return 1;
+	}
+
+	for (p = 0; p < nr_pages; ++p) {
+		expected_byte = ~((uint8_t)(p % ((uint8_t)-1)));
+		memset(expected_page, expected_byte, page_size);
+		if (my_bcmp(expected_page, area_dst_alias + (p * page_size),
+			    page_size)) {
+			fprintf(stderr,
+				"unexpected page contents after minor fault\n");
+			exit(1);
+		}
+	}
+
+	if (write(pipefd[1], &c, sizeof(c)) != sizeof(c)) {
+		perror("pipe write");
+		exit(1);
+	}
+	if (pthread_join(uffd_mon, NULL))
+		return 1;
+
+	close(uffd);
+
+	uffd_stats_report(&stats, 1);
+
+	return stats.minor_faults != nr_pages;
+}
+
 static int userfaultfd_stress(void)
 {
 	void *area;
@@ -1413,7 +1550,7 @@ static int userfaultfd_stress(void)
 
 	close(uffd);
 	return userfaultfd_zeropage_test() || userfaultfd_sig_test()
-		|| userfaultfd_events_test();
+		|| userfaultfd_events_test() || userfaultfd_minor_test();
 }
 
 /*
@@ -1454,6 +1591,8 @@ static void set_test_type(const char *type)
 		map_shared = true;
 		test_type = TEST_HUGETLB;
 		uffd_test_ops = &hugetlb_uffd_test_ops;
+		/* Minor faults require shared hugetlb; only enable here. */
+		test_uffdio_minor = true;
 	} else if (!strcmp(type, "shmem")) {
 		map_shared = true;
 		test_type = TEST_SHMEM;
-- 
2.30.0.478.g8a0d178c01-goog


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH v5 08/10] userfaultfd: add UFFDIO_CONTINUE ioctl
  2021-02-10 21:21 ` [PATCH v5 08/10] userfaultfd: add UFFDIO_CONTINUE ioctl Axel Rasmussen
@ 2021-02-10 22:21   ` Axel Rasmussen
  0 siblings, 0 replies; 29+ messages in thread
From: Axel Rasmussen @ 2021-02-10 22:21 UTC (permalink / raw)
  To: Alexander Viro, Alexey Dobriyan, Andrea Arcangeli, Andrew Morton,
	Anshuman Khandual, Catalin Marinas, Chinwen Chang, Huang Ying,
	Ingo Molnar, Jann Horn, Jerome Glisse, Lokesh Gidra,
	Matthew Wilcox (Oracle), Michael Ellerman, Michal Koutný,
	Michel Lespinasse, Mike Kravetz, Mike Rapoport, Nicholas Piggin,
	Peter Xu, Shaohua Li, Shawn Anastasio, Steven Rostedt,
	Steven Price, Vlastimil Babka
  Cc: LKML, linux-fsdevel, Linux MM, Adam Ruprecht, Cannon Matthews,
	Dr . David Alan Gilbert, David Rientjes, Mina Almasry,
	Oliver Upton

"

On Wed, Feb 10, 2021 at 1:22 PM Axel Rasmussen <axelrasmussen@google.com> wrote:
>
> This ioctl is how userspace ought to resolve "minor" userfaults. The
> idea is, userspace is notified that a minor fault has occurred. It might
> change the contents of the page using its second non-UFFD mapping, or
> not. Then, it calls UFFDIO_CONTINUE to tell the kernel "I have ensured
> the page contents are correct, carry on setting up the mapping".
>
> Note that it doesn't make much sense to use UFFDIO_{COPY,ZEROPAGE} for
> MINOR registered VMAs. ZEROPAGE maps the VMA to the zero page; but in
> the minor fault case, we already have some pre-existing underlying page.
> Likewise, UFFDIO_COPY isn't useful if we have a second non-UFFD mapping.
> We'd just use memcpy() or similar instead.
>
> It turns out hugetlb_mcopy_atomic_pte() already does very close to what
> we want, if an existing page is provided via `struct page **pagep`. We
> already special-case the behavior a bit for the UFFDIO_ZEROPAGE case, so
> just extend that design: add an enum for the three modes of operation,
> and make the small adjustments needed for the MCOPY_ATOMIC_CONTINUE
> case. (Basically, look up the existing page, and avoid adding the
> existing page to the page cache or calling set_page_huge_active() on
> it.)
>
> Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
> ---
>  fs/userfaultfd.c                 | 67 ++++++++++++++++++++++++++++++++
>  include/linux/hugetlb.h          |  3 ++
>  include/linux/userfaultfd_k.h    | 18 +++++++++
>  include/uapi/linux/userfaultfd.h | 21 +++++++++-
>  mm/hugetlb.c                     | 39 ++++++++++++-------
>  mm/userfaultfd.c                 | 37 +++++++++++-------
>  6 files changed, 156 insertions(+), 29 deletions(-)
>
> diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> index b351a8552140..5e20f425efd7 100644
> --- a/fs/userfaultfd.c
> +++ b/fs/userfaultfd.c
> @@ -1527,6 +1527,10 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
>                 if (!(uffdio_register.mode & UFFDIO_REGISTER_MODE_WP))
>                         ioctls_out &= ~((__u64)1 << _UFFDIO_WRITEPROTECT);
>
> +               /* CONTINUE ioctl is only supported for MINOR ranges. */
> +               if (!(uffdio_register.mode & UFFDIO_REGISTER_MODE_MINOR))
> +                       ioctls_out &= ~((__u64)1 << _UFFDIO_CONTINUE);
> +
>                 /*
>                  * Now that we scanned all vmas we can already tell
>                  * userland which ioctls methods are guaranteed to
> @@ -1880,6 +1884,66 @@ static int userfaultfd_writeprotect(struct userfaultfd_ctx *ctx,
>         return ret;
>  }
>
> +static int userfaultfd_continue(struct userfaultfd_ctx *ctx, unsigned long arg)
> +{
> +       __s64 ret;
> +       struct uffdio_continue uffdio_continue;
> +       struct uffdio_continue __user *user_uffdio_continue;
> +       struct userfaultfd_wake_range range;
> +
> +       user_uffdio_continue = (struct uffdio_continue __user *)arg;
> +
> +       ret = -EAGAIN;
> +       if (READ_ONCE(ctx->mmap_changing))
> +               goto out;
> +
> +       ret = -EFAULT;
> +       if (copy_from_user(&uffdio_continue, user_uffdio_continue,
> +                          /* don't copy the output fields */
> +                          sizeof(uffdio_continue) - (sizeof(__s64))))
> +               goto out;
> +
> +       ret = validate_range(ctx->mm, &uffdio_continue.range.start,
> +                            uffdio_continue.range.len);
> +       if (ret)
> +               goto out;
> +
> +       ret = -EINVAL;
> +       /* double check for wraparound just in case. */
> +       if (uffdio_continue.range.start + uffdio_continue.range.len <=
> +           uffdio_continue.range.start) {
> +               goto out;
> +       }
> +       if (uffdio_continue.mode & ~UFFDIO_CONTINUE_MODE_DONTWAKE)
> +               goto out;
> +
> +       if (mmget_not_zero(ctx->mm)) {
> +               ret = mcopy_continue(ctx->mm, uffdio_continue.range.start,
> +                                    uffdio_continue.range.len,
> +                                    &ctx->mmap_changing);
> +               mmput(ctx->mm);
> +       } else {
> +               return -ESRCH;
> +       }
> +
> +       if (unlikely(put_user(ret, &user_uffdio_continue->mapped)))
> +               return -EFAULT;
> +       if (ret < 0)
> +               goto out;
> +
> +       /* len == 0 would wake all */
> +       BUG_ON(!ret);
> +       range.len = ret;
> +       if (!(uffdio_continue.mode & UFFDIO_CONTINUE_MODE_DONTWAKE)) {
> +               range.start = uffdio_continue.range.start;
> +               wake_userfault(ctx, &range);
> +       }
> +       ret = range.len == uffdio_continue.range.len ? 0 : -EAGAIN;
> +
> +out:
> +       return ret;
> +}
> +
>  static inline unsigned int uffd_ctx_features(__u64 user_features)
>  {
>         /*
> @@ -1964,6 +2028,9 @@ static long userfaultfd_ioctl(struct file *file, unsigned cmd,
>         case UFFDIO_WRITEPROTECT:
>                 ret = userfaultfd_writeprotect(ctx, arg);
>                 break;
> +       case UFFDIO_CONTINUE:
> +               ret = userfaultfd_continue(ctx, arg);
> +               break;
>         }
>         return ret;
>  }
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index aa9e1d6de831..3d01d228fc78 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -11,6 +11,7 @@
>  #include <linux/kref.h>
>  #include <linux/pgtable.h>
>  #include <linux/gfp.h>
> +#include <linux/userfaultfd_k.h>
>
>  struct ctl_table;
>  struct user_struct;
> @@ -139,6 +140,7 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm, pte_t *dst_pte,
>                                 struct vm_area_struct *dst_vma,
>                                 unsigned long dst_addr,
>                                 unsigned long src_addr,
> +                               enum mcopy_atomic_mode mode,
>                                 struct page **pagep);
>  #endif /* CONFIG_USERFAULTFD */
>  bool hugetlb_reserve_pages(struct inode *inode, long from, long to,
> @@ -317,6 +319,7 @@ static inline int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
>                                                 struct vm_area_struct *dst_vma,
>                                                 unsigned long dst_addr,
>                                                 unsigned long src_addr,
> +                                               enum mcopy_atomic_mode mode,
>                                                 struct page **pagep)
>  {
>         BUG();
> diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
> index e060d5f77cc5..794d1538b8ba 100644
> --- a/include/linux/userfaultfd_k.h
> +++ b/include/linux/userfaultfd_k.h
> @@ -37,6 +37,22 @@ extern int sysctl_unprivileged_userfaultfd;
>
>  extern vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason);
>
> +/*
> + * The mode of operation for __mcopy_atomic and its helpers.
> + *
> + * This is almost an implementation detail (mcopy_atomic below doesn't take this
> + * as a parameter), but it's exposed here because memory-kind-specific
> + * implementations (e.g. hugetlbfs) need to know the mode of operation.
> + */
> +enum mcopy_atomic_mode {
> +       /* A normal copy_from_user into the destination range. */
> +       MCOPY_ATOMIC_NORMAL,
> +       /* Don't copy; map the destination range to the zero page. */
> +       MCOPY_ATOMIC_ZEROPAGE,
> +       /* Just install pte(s) with the existing page(s) in the page cache. */
> +       MCOPY_ATOMIC_CONTINUE,
> +};
> +
>  extern ssize_t mcopy_atomic(struct mm_struct *dst_mm, unsigned long dst_start,
>                             unsigned long src_start, unsigned long len,
>                             bool *mmap_changing, __u64 mode);
> @@ -44,6 +60,8 @@ extern ssize_t mfill_zeropage(struct mm_struct *dst_mm,
>                               unsigned long dst_start,
>                               unsigned long len,
>                               bool *mmap_changing);
> +extern ssize_t mcopy_continue(struct mm_struct *dst_mm, unsigned long dst_start,
> +                             unsigned long len, bool *mmap_changing);
>  extern int mwriteprotect_range(struct mm_struct *dst_mm,
>                                unsigned long start, unsigned long len,
>                                bool enable_wp, bool *mmap_changing);
> diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
> index f24dd4fcbad9..bafbeb1a2624 100644
> --- a/include/uapi/linux/userfaultfd.h
> +++ b/include/uapi/linux/userfaultfd.h
> @@ -40,10 +40,12 @@
>         ((__u64)1 << _UFFDIO_WAKE |             \
>          (__u64)1 << _UFFDIO_COPY |             \
>          (__u64)1 << _UFFDIO_ZEROPAGE |         \
> -        (__u64)1 << _UFFDIO_WRITEPROTECT)
> +        (__u64)1 << _UFFDIO_WRITEPROTECT |     \
> +        (__u64)1 << _UFFDIO_CONTINUE)
>  #define UFFD_API_RANGE_IOCTLS_BASIC            \
>         ((__u64)1 << _UFFDIO_WAKE |             \
> -        (__u64)1 << _UFFDIO_COPY)
> +        (__u64)1 << _UFFDIO_COPY |             \
> +        (__u64)1 << _UFFDIO_CONTINUE)
>
>  /*
>   * Valid ioctl command number range with this API is from 0x00 to
> @@ -59,6 +61,7 @@
>  #define _UFFDIO_COPY                   (0x03)
>  #define _UFFDIO_ZEROPAGE               (0x04)
>  #define _UFFDIO_WRITEPROTECT           (0x06)
> +#define _UFFDIO_CONTINUE               (0x07)
>  #define _UFFDIO_API                    (0x3F)
>
>  /* userfaultfd ioctl ids */
> @@ -77,6 +80,8 @@
>                                       struct uffdio_zeropage)
>  #define UFFDIO_WRITEPROTECT    _IOWR(UFFDIO, _UFFDIO_WRITEPROTECT, \
>                                       struct uffdio_writeprotect)
> +#define UFFDIO_CONTINUE                _IOR(UFFDIO, _UFFDIO_CONTINUE,  \
> +                                    struct uffdio_continue)
>
>  /* read() structure */
>  struct uffd_msg {
> @@ -268,6 +273,18 @@ struct uffdio_writeprotect {
>         __u64 mode;
>  };
>
> +struct uffdio_continue {
> +       struct uffdio_range range;
> +#define UFFDIO_CONTINUE_MODE_DONTWAKE          ((__u64)1<<0)
> +       __u64 mode;
> +
> +       /*
> +        * Fields below here are written by the ioctl and must be at the end:
> +        * the copy_from_user will not read past here.
> +        */
> +       __s64 mapped;
> +};
> +
>  /*
>   * Flags for the userfaultfd(2) system call itself.
>   */
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 2331281cf133..7ddf28730361 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -39,7 +39,6 @@
>  #include <linux/hugetlb.h>
>  #include <linux/hugetlb_cgroup.h>
>  #include <linux/node.h>
> -#include <linux/userfaultfd_k.h>
>  #include <linux/page_owner.h>
>  #include "internal.h"
>
> @@ -4648,8 +4647,10 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
>                             struct vm_area_struct *dst_vma,
>                             unsigned long dst_addr,
>                             unsigned long src_addr,
> +                           enum mcopy_atomic_mode mode,
>                             struct page **pagep)
>  {
> +       bool is_continue = (mode == MCOPY_ATOMIC_CONTINUE);
>         struct address_space *mapping;
>         pgoff_t idx;
>         unsigned long size;
> @@ -4659,8 +4660,18 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
>         spinlock_t *ptl;
>         int ret;
>         struct page *page;
> +       int writable;
>
> -       if (!*pagep) {
> +       mapping = dst_vma->vm_file->f_mapping;
> +       idx = vma_hugecache_offset(h, dst_vma, dst_addr);
> +
> +       if (is_continue) {
> +               ret = -EFAULT;
> +               page = find_lock_page(mapping, idx);
> +               *pagep = NULL;
> +               if (!page)
> +                       goto out;
> +       } else if (!*pagep) {
>                 ret = -ENOMEM;
>                 page = alloc_huge_page(dst_vma, dst_addr, 0);
>                 if (IS_ERR(page))
> @@ -4689,13 +4700,8 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
>          */
>         __SetPageUptodate(page);
>
> -       mapping = dst_vma->vm_file->f_mapping;
> -       idx = vma_hugecache_offset(h, dst_vma, dst_addr);
> -
> -       /*
> -        * If shared, add to page cache
> -        */
> -       if (vm_shared) {
> +       /* Add shared, newly allocated pages to the page cache. */
> +       if (vm_shared && !is_continue) {
>                 size = i_size_read(mapping->host) >> huge_page_shift(h);
>                 ret = -EFAULT;
>                 if (idx >= size)
> @@ -4740,8 +4746,14 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
>                 hugepage_add_new_anon_rmap(page, dst_vma, dst_addr);
>         }
>
> -       _dst_pte = make_huge_pte(dst_vma, page, dst_vma->vm_flags & VM_WRITE);
> -       if (dst_vma->vm_flags & VM_WRITE)
> +       /* For CONTINUE on a non-shared VMA, don't set VM_WRITE for CoW. */
> +       if (is_continue && !vm_shared)
> +               writable = 0;
> +       else
> +               writable = dst_vma->vm_flags & VM_WRITE;
> +
> +       _dst_pte = make_huge_pte(dst_vma, page, writable);
> +       if (writable)
>                 _dst_pte = huge_pte_mkdirty(_dst_pte);
>         _dst_pte = pte_mkyoung(_dst_pte);
>
> @@ -4755,8 +4767,9 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
>         update_mmu_cache(dst_vma, dst_addr, dst_pte);
>
>         spin_unlock(ptl);
> -       SetHPageMigratable(page);
> -       if (vm_shared)
> +       if (!is_continue)
> +               SetHPageMigratable(page);
> +       if (vm_shared || is_continue)
>                 unlock_page(page);
>         ret = 0;
>  out:

Apologies, I noticed a mistake here. Just like the
"unlock_page(page);" line above, I think we need to change to "if
(vm_shared || is_continue)" in the "out_release_unlock:" case as well.
I'll send this change in a v5.1 or a v6, depending on what other
comments this series gets in the next few days.

> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> index b2ce61c1b50d..ce6cb4760d2c 100644
> --- a/mm/userfaultfd.c
> +++ b/mm/userfaultfd.c
> @@ -207,7 +207,7 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
>                                               unsigned long dst_start,
>                                               unsigned long src_start,
>                                               unsigned long len,
> -                                             bool zeropage)
> +                                             enum mcopy_atomic_mode mode)
>  {
>         int vm_alloc_shared = dst_vma->vm_flags & VM_SHARED;
>         int vm_shared = dst_vma->vm_flags & VM_SHARED;
> @@ -227,7 +227,7 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
>          * by THP.  Since we can not reliably insert a zero page, this
>          * feature is not supported.
>          */
> -       if (zeropage) {
> +       if (mode == MCOPY_ATOMIC_ZEROPAGE) {
>                 mmap_read_unlock(dst_mm);
>                 return -EINVAL;
>         }
> @@ -273,8 +273,6 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
>         }
>
>         while (src_addr < src_start + len) {
> -               pte_t dst_pteval;
> -
>                 BUG_ON(dst_addr >= dst_start + len);
>
>                 /*
> @@ -297,16 +295,16 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
>                         goto out_unlock;
>                 }
>
> -               err = -EEXIST;
> -               dst_pteval = huge_ptep_get(dst_pte);
> -               if (!huge_pte_none(dst_pteval)) {
> +               if (mode != MCOPY_ATOMIC_CONTINUE &&
> +                   !huge_pte_none(huge_ptep_get(dst_pte))) {
> +                       err = -EEXIST;
>                         mutex_unlock(&hugetlb_fault_mutex_table[hash]);
>                         i_mmap_unlock_read(mapping);
>                         goto out_unlock;
>                 }
>
>                 err = hugetlb_mcopy_atomic_pte(dst_mm, dst_pte, dst_vma,
> -                                               dst_addr, src_addr, &page);
> +                                              dst_addr, src_addr, mode, &page);
>
>                 mutex_unlock(&hugetlb_fault_mutex_table[hash]);
>                 i_mmap_unlock_read(mapping);
> @@ -408,7 +406,7 @@ extern ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
>                                       unsigned long dst_start,
>                                       unsigned long src_start,
>                                       unsigned long len,
> -                                     bool zeropage);
> +                                     enum mcopy_atomic_mode mode);
>  #endif /* CONFIG_HUGETLB_PAGE */
>
>  static __always_inline ssize_t mfill_atomic_pte(struct mm_struct *dst_mm,
> @@ -458,7 +456,7 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
>                                               unsigned long dst_start,
>                                               unsigned long src_start,
>                                               unsigned long len,
> -                                             bool zeropage,
> +                                             enum mcopy_atomic_mode mcopy_mode,
>                                               bool *mmap_changing,
>                                               __u64 mode)
>  {
> @@ -469,6 +467,7 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
>         long copied;
>         struct page *page;
>         bool wp_copy;
> +       bool zeropage = (mcopy_mode == MCOPY_ATOMIC_ZEROPAGE);
>
>         /*
>          * Sanitize the command parameters:
> @@ -527,10 +526,12 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
>          */
>         if (is_vm_hugetlb_page(dst_vma))
>                 return  __mcopy_atomic_hugetlb(dst_mm, dst_vma, dst_start,
> -                                               src_start, len, zeropage);
> +                                               src_start, len, mcopy_mode);
>
>         if (!vma_is_anonymous(dst_vma) && !vma_is_shmem(dst_vma))
>                 goto out_unlock;
> +       if (mcopy_mode == MCOPY_ATOMIC_CONTINUE)
> +               goto out_unlock;
>
>         /*
>          * Ensure the dst_vma has a anon_vma or this page
> @@ -626,14 +627,22 @@ ssize_t mcopy_atomic(struct mm_struct *dst_mm, unsigned long dst_start,
>                      unsigned long src_start, unsigned long len,
>                      bool *mmap_changing, __u64 mode)
>  {
> -       return __mcopy_atomic(dst_mm, dst_start, src_start, len, false,
> -                             mmap_changing, mode);
> +       return __mcopy_atomic(dst_mm, dst_start, src_start, len,
> +                             MCOPY_ATOMIC_NORMAL, mmap_changing, mode);
>  }
>
>  ssize_t mfill_zeropage(struct mm_struct *dst_mm, unsigned long start,
>                        unsigned long len, bool *mmap_changing)
>  {
> -       return __mcopy_atomic(dst_mm, start, 0, len, true, mmap_changing, 0);
> +       return __mcopy_atomic(dst_mm, start, 0, len, MCOPY_ATOMIC_ZEROPAGE,
> +                             mmap_changing, 0);
> +}
> +
> +ssize_t mcopy_continue(struct mm_struct *dst_mm, unsigned long start,
> +                      unsigned long len, bool *mmap_changing)
> +{
> +       return __mcopy_atomic(dst_mm, start, 0, len, MCOPY_ATOMIC_CONTINUE,
> +                             mmap_changing, 0);
>  }
>
>  int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start,
> --
> 2.30.0.478.g8a0d178c01-goog
>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v5 05/10] userfaultfd: add minor fault registration mode
  2021-02-10 21:21 ` [PATCH v5 05/10] userfaultfd: add minor fault registration mode Axel Rasmussen
@ 2021-02-11 19:28   ` Axel Rasmussen
  2021-02-11 20:58     ` Axel Rasmussen
  2021-02-12 22:21     ` Matthew Wilcox
  2021-02-12 19:17   ` Mike Kravetz
  1 sibling, 2 replies; 29+ messages in thread
From: Axel Rasmussen @ 2021-02-11 19:28 UTC (permalink / raw)
  To: Alexander Viro, Alexey Dobriyan, Andrea Arcangeli, Andrew Morton,
	Anshuman Khandual, Catalin Marinas, Chinwen Chang, Huang Ying,
	Ingo Molnar, Jann Horn, Jerome Glisse, Lokesh Gidra,
	Matthew Wilcox (Oracle), Michael Ellerman, Michal Koutný,
	Michel Lespinasse, Mike Kravetz, Mike Rapoport, Nicholas Piggin,
	Peter Xu, Shaohua Li, Shawn Anastasio, Steven Rostedt,
	Steven Price, Vlastimil Babka
  Cc: LKML, linux-fsdevel, Linux MM, Adam Ruprecht, Cannon Matthews,
	Dr . David Alan Gilbert, David Rientjes, Mina Almasry,
	Oliver Upton

On Wed, Feb 10, 2021 at 1:22 PM Axel Rasmussen <axelrasmussen@google.com> wrote:
>
> This feature allows userspace to intercept "minor" faults. By "minor"
> faults, I mean the following situation:
>
> Let there exist two mappings (i.e., VMAs) to the same page(s). One of
> the mappings is registered with userfaultfd (in minor mode), and the
> other is not. Via the non-UFFD mapping, the underlying pages have
> already been allocated & filled with some contents. The UFFD mapping
> has not yet been faulted in; when it is touched for the first time,
> this results in what I'm calling a "minor" fault. As a concrete
> example, when working with hugetlbfs, we have huge_pte_none(), but
> find_lock_page() finds an existing page.
>
> This commit adds the new registration mode, and sets the relevant flag
> on the VMAs being registered. In the hugetlb fault path, if we find
> that we have huge_pte_none(), but find_lock_page() does indeed find an
> existing page, then we have a "minor" fault, and if the VMA has the
> userfaultfd registration flag, we call into userfaultfd to handle it.
>
> Why add a new registration mode, as opposed to adding a feature to
> MISSING registration, like UFFD_FEATURE_SIGBUS?
>
> - The semantics are significantly different. UFFDIO_COPY or
>   UFFDIO_ZEROPAGE do not make sense for these minor faults; userspace
>   would instead just memset() or memcpy() or whatever via the non-UFFD
>   mapping. Unlike MISSING registration, MINOR registration only makes
>   sense for hugetlbfs (or, in the future, shmem), as this is the only
>   way to get two VMAs to a single set of underlying pages.
>
> - Doing so would make handle_userfault()'s "reason" argument confusing.
>   We'd pass in "MISSING" even if the pages weren't really missing.
>
> Reviewed-by: Peter Xu <peterx@redhat.com>
> Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
> ---
>  fs/proc/task_mmu.c               |  1 +
>  fs/userfaultfd.c                 | 71 ++++++++++++++++++--------------
>  include/linux/mm.h               |  1 +
>  include/linux/userfaultfd_k.h    | 15 ++++++-
>  include/trace/events/mmflags.h   |  1 +
>  include/uapi/linux/userfaultfd.h | 15 ++++++-
>  mm/hugetlb.c                     | 32 ++++++++++++++
>  7 files changed, 102 insertions(+), 34 deletions(-)
>
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 602e3a52884d..94e951ea3e03 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -651,6 +651,7 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
>                 [ilog2(VM_MTE)]         = "mt",
>                 [ilog2(VM_MTE_ALLOWED)] = "",
>  #endif
> +               [ilog2(VM_UFFD_MINOR)]  = "ui",
>  #ifdef CONFIG_ARCH_HAS_PKEYS
>                 /* These come out via ProtectionKey: */
>                 [ilog2(VM_PKEY_BIT0)]   = "",
> diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> index 1f4a34b1a1e7..b351a8552140 100644
> --- a/fs/userfaultfd.c
> +++ b/fs/userfaultfd.c
> @@ -197,24 +197,21 @@ static inline struct uffd_msg userfault_msg(unsigned long address,
>         msg_init(&msg);
>         msg.event = UFFD_EVENT_PAGEFAULT;
>         msg.arg.pagefault.address = address;
> +       /*
> +        * These flags indicate why the userfault occurred:
> +        * - UFFD_PAGEFAULT_FLAG_WP indicates a write protect fault.
> +        * - UFFD_PAGEFAULT_FLAG_MINOR indicates a minor fault.
> +        * - Neither of these flags being set indicates a MISSING fault.
> +        *
> +        * Separately, UFFD_PAGEFAULT_FLAG_WRITE indicates it was a write
> +        * fault. Otherwise, it was a read fault.
> +        */
>         if (flags & FAULT_FLAG_WRITE)
> -               /*
> -                * If UFFD_FEATURE_PAGEFAULT_FLAG_WP was set in the
> -                * uffdio_api.features and UFFD_PAGEFAULT_FLAG_WRITE
> -                * was not set in a UFFD_EVENT_PAGEFAULT, it means it
> -                * was a read fault, otherwise if set it means it's
> -                * a write fault.
> -                */
>                 msg.arg.pagefault.flags |= UFFD_PAGEFAULT_FLAG_WRITE;
>         if (reason & VM_UFFD_WP)
> -               /*
> -                * If UFFD_FEATURE_PAGEFAULT_FLAG_WP was set in the
> -                * uffdio_api.features and UFFD_PAGEFAULT_FLAG_WP was
> -                * not set in a UFFD_EVENT_PAGEFAULT, it means it was
> -                * a missing fault, otherwise if set it means it's a
> -                * write protect fault.
> -                */
>                 msg.arg.pagefault.flags |= UFFD_PAGEFAULT_FLAG_WP;
> +       if (reason & VM_UFFD_MINOR)
> +               msg.arg.pagefault.flags |= UFFD_PAGEFAULT_FLAG_MINOR;
>         if (features & UFFD_FEATURE_THREAD_ID)
>                 msg.arg.pagefault.feat.ptid = task_pid_vnr(current);
>         return msg;
> @@ -401,8 +398,10 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
>
>         BUG_ON(ctx->mm != mm);
>
> -       VM_BUG_ON(reason & ~(VM_UFFD_MISSING|VM_UFFD_WP));
> -       VM_BUG_ON(!(reason & VM_UFFD_MISSING) ^ !!(reason & VM_UFFD_WP));
> +       /* Any unrecognized flag is a bug. */
> +       VM_BUG_ON(reason & ~__VM_UFFD_FLAGS);
> +       /* 0 or > 1 flags set is a bug; we expect exactly 1. */
> +       VM_BUG_ON(!reason || !!(reason & (reason - 1)));
>
>         if (ctx->features & UFFD_FEATURE_SIGBUS)
>                 goto out;
> @@ -612,7 +611,7 @@ static void userfaultfd_event_wait_completion(struct userfaultfd_ctx *ctx,
>                 for (vma = mm->mmap; vma; vma = vma->vm_next)
>                         if (vma->vm_userfaultfd_ctx.ctx == release_new_ctx) {
>                                 vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
> -                               vma->vm_flags &= ~(VM_UFFD_WP | VM_UFFD_MISSING);
> +                               vma->vm_flags &= ~__VM_UFFD_FLAGS;
>                         }
>                 mmap_write_unlock(mm);
>
> @@ -644,7 +643,7 @@ int dup_userfaultfd(struct vm_area_struct *vma, struct list_head *fcs)
>         octx = vma->vm_userfaultfd_ctx.ctx;
>         if (!octx || !(octx->features & UFFD_FEATURE_EVENT_FORK)) {
>                 vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
> -               vma->vm_flags &= ~(VM_UFFD_WP | VM_UFFD_MISSING);
> +               vma->vm_flags &= ~__VM_UFFD_FLAGS;
>                 return 0;
>         }
>
> @@ -726,7 +725,7 @@ void mremap_userfaultfd_prep(struct vm_area_struct *vma,
>         } else {
>                 /* Drop uffd context if remap feature not enabled */
>                 vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
> -               vma->vm_flags &= ~(VM_UFFD_WP | VM_UFFD_MISSING);
> +               vma->vm_flags &= ~__VM_UFFD_FLAGS;
>         }
>  }
>
> @@ -867,12 +866,12 @@ static int userfaultfd_release(struct inode *inode, struct file *file)
>         for (vma = mm->mmap; vma; vma = vma->vm_next) {
>                 cond_resched();
>                 BUG_ON(!!vma->vm_userfaultfd_ctx.ctx ^
> -                      !!(vma->vm_flags & (VM_UFFD_MISSING | VM_UFFD_WP)));
> +                      !!(vma->vm_flags & __VM_UFFD_FLAGS));
>                 if (vma->vm_userfaultfd_ctx.ctx != ctx) {
>                         prev = vma;
>                         continue;
>                 }
> -               new_flags = vma->vm_flags & ~(VM_UFFD_MISSING | VM_UFFD_WP);
> +               new_flags = vma->vm_flags & ~__VM_UFFD_FLAGS;
>                 prev = vma_merge(mm, prev, vma->vm_start, vma->vm_end,
>                                  new_flags, vma->anon_vma,
>                                  vma->vm_file, vma->vm_pgoff,
> @@ -1306,9 +1305,19 @@ static inline bool vma_can_userfault(struct vm_area_struct *vma,
>                                      unsigned long vm_flags)
>  {
>         /* FIXME: add WP support to hugetlbfs and shmem */
> -       return vma_is_anonymous(vma) ||
> -               ((is_vm_hugetlb_page(vma) || vma_is_shmem(vma)) &&
> -                !(vm_flags & VM_UFFD_WP));
> +       if (vm_flags & VM_UFFD_WP) {
> +               if (is_vm_hugetlb_page(vma) || vma_is_shmem(vma))
> +                       return false;
> +       }
> +
> +       if (vm_flags & VM_UFFD_MINOR) {
> +               /* FIXME: Add minor fault interception for shmem. */
> +               if (!is_vm_hugetlb_page(vma))
> +                       return false;
> +       }
> +
> +       return vma_is_anonymous(vma) || is_vm_hugetlb_page(vma) ||
> +              vma_is_shmem(vma);
>  }
>
>  static int userfaultfd_register(struct userfaultfd_ctx *ctx,
> @@ -1334,14 +1343,15 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
>         ret = -EINVAL;
>         if (!uffdio_register.mode)
>                 goto out;
> -       if (uffdio_register.mode & ~(UFFDIO_REGISTER_MODE_MISSING|
> -                                    UFFDIO_REGISTER_MODE_WP))
> +       if (uffdio_register.mode & ~UFFD_API_REGISTER_MODES)
>                 goto out;
>         vm_flags = 0;
>         if (uffdio_register.mode & UFFDIO_REGISTER_MODE_MISSING)
>                 vm_flags |= VM_UFFD_MISSING;
>         if (uffdio_register.mode & UFFDIO_REGISTER_MODE_WP)
>                 vm_flags |= VM_UFFD_WP;
> +       if (uffdio_register.mode & UFFDIO_REGISTER_MODE_MINOR)
> +               vm_flags |= VM_UFFD_MINOR;
>
>         ret = validate_range(mm, &uffdio_register.range.start,
>                              uffdio_register.range.len);
> @@ -1385,7 +1395,7 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
>                 cond_resched();
>
>                 BUG_ON(!!cur->vm_userfaultfd_ctx.ctx ^
> -                      !!(cur->vm_flags & (VM_UFFD_MISSING | VM_UFFD_WP)));
> +                      !!(cur->vm_flags & __VM_UFFD_FLAGS));
>
>                 /* check not compatible vmas */
>                 ret = -EINVAL;
> @@ -1465,8 +1475,7 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
>                         start = vma->vm_start;
>                 vma_end = min(end, vma->vm_end);
>
> -               new_flags = (vma->vm_flags &
> -                            ~(VM_UFFD_MISSING|VM_UFFD_WP)) | vm_flags;
> +               new_flags = (vma->vm_flags & ~__VM_UFFD_FLAGS) | vm_flags;
>                 prev = vma_merge(mm, prev, start, vma_end, new_flags,
>                                  vma->anon_vma, vma->vm_file, vma->vm_pgoff,
>                                  vma_policy(vma),
> @@ -1588,7 +1597,7 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx,
>                 cond_resched();
>
>                 BUG_ON(!!cur->vm_userfaultfd_ctx.ctx ^
> -                      !!(cur->vm_flags & (VM_UFFD_MISSING | VM_UFFD_WP)));
> +                      !!(cur->vm_flags & __VM_UFFD_FLAGS));
>
>                 /*
>                  * Check not compatible vmas, not strictly required
> @@ -1639,7 +1648,7 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx,
>                         wake_userfault(vma->vm_userfaultfd_ctx.ctx, &range);
>                 }
>
> -               new_flags = vma->vm_flags & ~(VM_UFFD_MISSING | VM_UFFD_WP);
> +               new_flags = vma->vm_flags & ~__VM_UFFD_FLAGS;
>                 prev = vma_merge(mm, prev, start, vma_end, new_flags,
>                                  vma->anon_vma, vma->vm_file, vma->vm_pgoff,
>                                  vma_policy(vma),
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 89fca443e6f1..3f65a506c743 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -276,6 +276,7 @@ extern unsigned int kobjsize(const void *objp);
>  #define VM_PFNMAP      0x00000400      /* Page-ranges managed without "struct page", just pure PFN */
>  #define VM_DENYWRITE   0x00000800      /* ETXTBSY on write attempts.. */
>  #define VM_UFFD_WP     0x00001000      /* wrprotect pages tracking */
> +#define VM_UFFD_MINOR  0x00002000      /* minor fault interception */

Ah, I had added this just after VM_UFFD_WP, without noticing that this
would be sharing a bit with VM_LOCKED. That seems like not such a
great idea.

I don't see another unused bit, and I don't see some other obvious
candidate to share with. So, the solution that comes to mind is
something like:

- Since it isn't feasible to have one VM_ flag per UFFD trigger type,
handle_userfault()'s "reason" argument should be some enumeration of
possible UFFD reasons instead.
- Introduce a path where handle_userfault() can return 0, meaning "you
called into me, but I am not meant to be handling this fault per the
userfaultfd_ctx, handle it normally instead".
- Use VM_UFFD_MISSING to decide whether or not to call
handle_userfault(), whether it was a missing or minor fault.

Unless there are objections or some simpler idea, I'll send a v6 with
this change.

>
>  #define VM_LOCKED      0x00002000
>  #define VM_IO           0x00004000     /* Memory mapped I/O or similar */
> diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
> index c63ccdae3eab..0390e5ac63b3 100644
> --- a/include/linux/userfaultfd_k.h
> +++ b/include/linux/userfaultfd_k.h
> @@ -17,6 +17,9 @@
>  #include <linux/mm.h>
>  #include <asm-generic/pgtable_uffd.h>
>
> +/* The set of all possible UFFD-related VM flags. */
> +#define __VM_UFFD_FLAGS (VM_UFFD_MISSING | VM_UFFD_WP | VM_UFFD_MINOR)
> +
>  /*
>   * CAREFUL: Check include/uapi/asm-generic/fcntl.h when defining
>   * new flags, since they might collide with O_* ones. We want
> @@ -71,6 +74,11 @@ static inline bool userfaultfd_wp(struct vm_area_struct *vma)
>         return vma->vm_flags & VM_UFFD_WP;
>  }
>
> +static inline bool userfaultfd_minor(struct vm_area_struct *vma)
> +{
> +       return vma->vm_flags & VM_UFFD_MINOR;
> +}
> +
>  static inline bool userfaultfd_pte_wp(struct vm_area_struct *vma,
>                                       pte_t pte)
>  {
> @@ -85,7 +93,7 @@ static inline bool userfaultfd_huge_pmd_wp(struct vm_area_struct *vma,
>
>  static inline bool userfaultfd_armed(struct vm_area_struct *vma)
>  {
> -       return vma->vm_flags & (VM_UFFD_MISSING | VM_UFFD_WP);
> +       return vma->vm_flags & __VM_UFFD_FLAGS;
>  }
>
>  extern int dup_userfaultfd(struct vm_area_struct *, struct list_head *);
> @@ -132,6 +140,11 @@ static inline bool userfaultfd_wp(struct vm_area_struct *vma)
>         return false;
>  }
>
> +static inline bool userfaultfd_minor(struct vm_area_struct *vma)
> +{
> +       return false;
> +}
> +
>  static inline bool userfaultfd_pte_wp(struct vm_area_struct *vma,
>                                       pte_t pte)
>  {
> diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
> index 67018d367b9f..2d583ffd4100 100644
> --- a/include/trace/events/mmflags.h
> +++ b/include/trace/events/mmflags.h
> @@ -151,6 +151,7 @@ IF_HAVE_PG_ARCH_2(PG_arch_2,                "arch_2"        )
>         {VM_PFNMAP,                     "pfnmap"        },              \
>         {VM_DENYWRITE,                  "denywrite"     },              \
>         {VM_UFFD_WP,                    "uffd_wp"       },              \
> +       {VM_UFFD_MINOR,                 "uffd_minor"    },              \
>         {VM_LOCKED,                     "locked"        },              \
>         {VM_IO,                         "io"            },              \
>         {VM_SEQ_READ,                   "seqread"       },              \
> diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
> index 5f2d88212f7c..f24dd4fcbad9 100644
> --- a/include/uapi/linux/userfaultfd.h
> +++ b/include/uapi/linux/userfaultfd.h
> @@ -19,15 +19,19 @@
>   * means the userland is reading).
>   */
>  #define UFFD_API ((__u64)0xAA)
> +#define UFFD_API_REGISTER_MODES (UFFDIO_REGISTER_MODE_MISSING |        \
> +                                UFFDIO_REGISTER_MODE_WP |      \
> +                                UFFDIO_REGISTER_MODE_MINOR)
>  #define UFFD_API_FEATURES (UFFD_FEATURE_PAGEFAULT_FLAG_WP |    \
>                            UFFD_FEATURE_EVENT_FORK |            \
>                            UFFD_FEATURE_EVENT_REMAP |           \
> -                          UFFD_FEATURE_EVENT_REMOVE |  \
> +                          UFFD_FEATURE_EVENT_REMOVE |          \
>                            UFFD_FEATURE_EVENT_UNMAP |           \
>                            UFFD_FEATURE_MISSING_HUGETLBFS |     \
>                            UFFD_FEATURE_MISSING_SHMEM |         \
>                            UFFD_FEATURE_SIGBUS |                \
> -                          UFFD_FEATURE_THREAD_ID)
> +                          UFFD_FEATURE_THREAD_ID |             \
> +                          UFFD_FEATURE_MINOR_HUGETLBFS)
>  #define UFFD_API_IOCTLS                                \
>         ((__u64)1 << _UFFDIO_REGISTER |         \
>          (__u64)1 << _UFFDIO_UNREGISTER |       \
> @@ -127,6 +131,7 @@ struct uffd_msg {
>  /* flags for UFFD_EVENT_PAGEFAULT */
>  #define UFFD_PAGEFAULT_FLAG_WRITE      (1<<0)  /* If this was a write fault */
>  #define UFFD_PAGEFAULT_FLAG_WP         (1<<1)  /* If reason is VM_UFFD_WP */
> +#define UFFD_PAGEFAULT_FLAG_MINOR      (1<<2)  /* If reason is VM_UFFD_MINOR */
>
>  struct uffdio_api {
>         /* userland asks for an API number and the features to enable */
> @@ -171,6 +176,10 @@ struct uffdio_api {
>          *
>          * UFFD_FEATURE_THREAD_ID pid of the page faulted task_struct will
>          * be returned, if feature is not requested 0 will be returned.
> +        *
> +        * UFFD_FEATURE_MINOR_HUGETLBFS indicates that minor faults
> +        * can be intercepted (via REGISTER_MODE_MINOR) for
> +        * hugetlbfs-backed pages.
>          */
>  #define UFFD_FEATURE_PAGEFAULT_FLAG_WP         (1<<0)
>  #define UFFD_FEATURE_EVENT_FORK                        (1<<1)
> @@ -181,6 +190,7 @@ struct uffdio_api {
>  #define UFFD_FEATURE_EVENT_UNMAP               (1<<6)
>  #define UFFD_FEATURE_SIGBUS                    (1<<7)
>  #define UFFD_FEATURE_THREAD_ID                 (1<<8)
> +#define UFFD_FEATURE_MINOR_HUGETLBFS           (1<<9)
>         __u64 features;
>
>         __u64 ioctls;
> @@ -195,6 +205,7 @@ struct uffdio_register {
>         struct uffdio_range range;
>  #define UFFDIO_REGISTER_MODE_MISSING   ((__u64)1<<0)
>  #define UFFDIO_REGISTER_MODE_WP                ((__u64)1<<1)
> +#define UFFDIO_REGISTER_MODE_MINOR     ((__u64)1<<2)
>         __u64 mode;
>
>         /*
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index e41b77cf6cc2..f150b10981a8 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -4366,6 +4366,38 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
>                                 VM_FAULT_SET_HINDEX(hstate_index(h));
>                         goto backout_unlocked;
>                 }
> +
> +               /* Check for page in userfault range. */
> +               if (userfaultfd_minor(vma)) {
> +                       u32 hash;
> +                       struct vm_fault vmf = {
> +                               .vma = vma,
> +                               .address = haddr,
> +                               .flags = flags,
> +                               /*
> +                                * Hard to debug if it ends up being used by a
> +                                * callee that assumes something about the
> +                                * other uninitialized fields... same as in
> +                                * memory.c
> +                                */
> +                       };
> +
> +                       unlock_page(page);
> +
> +                       /*
> +                        * hugetlb_fault_mutex and i_mmap_rwsem must be dropped
> +                        * before handling userfault.  Reacquire after handling
> +                        * fault to make calling code simpler.
> +                        */
> +
> +                       hash = hugetlb_fault_mutex_hash(mapping, idx);
> +                       mutex_unlock(&hugetlb_fault_mutex_table[hash]);
> +                       i_mmap_unlock_read(mapping);
> +                       ret = handle_userfault(&vmf, VM_UFFD_MINOR);
> +                       i_mmap_lock_read(mapping);
> +                       mutex_lock(&hugetlb_fault_mutex_table[hash]);
> +                       goto out;
> +               }
>         }
>
>         /*
> --
> 2.30.0.478.g8a0d178c01-goog
>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v5 05/10] userfaultfd: add minor fault registration mode
  2021-02-11 19:28   ` Axel Rasmussen
@ 2021-02-11 20:58     ` Axel Rasmussen
  2021-02-12 22:21     ` Matthew Wilcox
  1 sibling, 0 replies; 29+ messages in thread
From: Axel Rasmussen @ 2021-02-11 20:58 UTC (permalink / raw)
  To: Alexander Viro, Alexey Dobriyan, Andrea Arcangeli, Andrew Morton,
	Anshuman Khandual, Catalin Marinas, Chinwen Chang, Huang Ying,
	Ingo Molnar, Jann Horn, Jerome Glisse, Lokesh Gidra,
	Matthew Wilcox (Oracle), Michael Ellerman, Michal Koutný,
	Michel Lespinasse, Mike Kravetz, Mike Rapoport, Nicholas Piggin,
	Peter Xu, Shaohua Li, Shawn Anastasio, Steven Rostedt,
	Steven Price, Vlastimil Babka
  Cc: LKML, linux-fsdevel, Linux MM, Adam Ruprecht, Cannon Matthews,
	Dr . David Alan Gilbert, David Rientjes, Mina Almasry,
	Oliver Upton

On Thu, Feb 11, 2021 at 11:28 AM Axel Rasmussen
<axelrasmussen@google.com> wrote:
>
> On Wed, Feb 10, 2021 at 1:22 PM Axel Rasmussen <axelrasmussen@google.com> wrote:
> >
> > This feature allows userspace to intercept "minor" faults. By "minor"
> > faults, I mean the following situation:
> >
> > Let there exist two mappings (i.e., VMAs) to the same page(s). One of
> > the mappings is registered with userfaultfd (in minor mode), and the
> > other is not. Via the non-UFFD mapping, the underlying pages have
> > already been allocated & filled with some contents. The UFFD mapping
> > has not yet been faulted in; when it is touched for the first time,
> > this results in what I'm calling a "minor" fault. As a concrete
> > example, when working with hugetlbfs, we have huge_pte_none(), but
> > find_lock_page() finds an existing page.
> >
> > This commit adds the new registration mode, and sets the relevant flag
> > on the VMAs being registered. In the hugetlb fault path, if we find
> > that we have huge_pte_none(), but find_lock_page() does indeed find an
> > existing page, then we have a "minor" fault, and if the VMA has the
> > userfaultfd registration flag, we call into userfaultfd to handle it.
> >
> > Why add a new registration mode, as opposed to adding a feature to
> > MISSING registration, like UFFD_FEATURE_SIGBUS?
> >
> > - The semantics are significantly different. UFFDIO_COPY or
> >   UFFDIO_ZEROPAGE do not make sense for these minor faults; userspace
> >   would instead just memset() or memcpy() or whatever via the non-UFFD
> >   mapping. Unlike MISSING registration, MINOR registration only makes
> >   sense for hugetlbfs (or, in the future, shmem), as this is the only
> >   way to get two VMAs to a single set of underlying pages.
> >
> > - Doing so would make handle_userfault()'s "reason" argument confusing.
> >   We'd pass in "MISSING" even if the pages weren't really missing.
> >
> > Reviewed-by: Peter Xu <peterx@redhat.com>
> > Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
> > ---
> >  fs/proc/task_mmu.c               |  1 +
> >  fs/userfaultfd.c                 | 71 ++++++++++++++++++--------------
> >  include/linux/mm.h               |  1 +
> >  include/linux/userfaultfd_k.h    | 15 ++++++-
> >  include/trace/events/mmflags.h   |  1 +
> >  include/uapi/linux/userfaultfd.h | 15 ++++++-
> >  mm/hugetlb.c                     | 32 ++++++++++++++
> >  7 files changed, 102 insertions(+), 34 deletions(-)
> >
> > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > index 602e3a52884d..94e951ea3e03 100644
> > --- a/fs/proc/task_mmu.c
> > +++ b/fs/proc/task_mmu.c
> > @@ -651,6 +651,7 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
> >                 [ilog2(VM_MTE)]         = "mt",
> >                 [ilog2(VM_MTE_ALLOWED)] = "",
> >  #endif
> > +               [ilog2(VM_UFFD_MINOR)]  = "ui",
> >  #ifdef CONFIG_ARCH_HAS_PKEYS
> >                 /* These come out via ProtectionKey: */
> >                 [ilog2(VM_PKEY_BIT0)]   = "",
> > diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> > index 1f4a34b1a1e7..b351a8552140 100644
> > --- a/fs/userfaultfd.c
> > +++ b/fs/userfaultfd.c
> > @@ -197,24 +197,21 @@ static inline struct uffd_msg userfault_msg(unsigned long address,
> >         msg_init(&msg);
> >         msg.event = UFFD_EVENT_PAGEFAULT;
> >         msg.arg.pagefault.address = address;
> > +       /*
> > +        * These flags indicate why the userfault occurred:
> > +        * - UFFD_PAGEFAULT_FLAG_WP indicates a write protect fault.
> > +        * - UFFD_PAGEFAULT_FLAG_MINOR indicates a minor fault.
> > +        * - Neither of these flags being set indicates a MISSING fault.
> > +        *
> > +        * Separately, UFFD_PAGEFAULT_FLAG_WRITE indicates it was a write
> > +        * fault. Otherwise, it was a read fault.
> > +        */
> >         if (flags & FAULT_FLAG_WRITE)
> > -               /*
> > -                * If UFFD_FEATURE_PAGEFAULT_FLAG_WP was set in the
> > -                * uffdio_api.features and UFFD_PAGEFAULT_FLAG_WRITE
> > -                * was not set in a UFFD_EVENT_PAGEFAULT, it means it
> > -                * was a read fault, otherwise if set it means it's
> > -                * a write fault.
> > -                */
> >                 msg.arg.pagefault.flags |= UFFD_PAGEFAULT_FLAG_WRITE;
> >         if (reason & VM_UFFD_WP)
> > -               /*
> > -                * If UFFD_FEATURE_PAGEFAULT_FLAG_WP was set in the
> > -                * uffdio_api.features and UFFD_PAGEFAULT_FLAG_WP was
> > -                * not set in a UFFD_EVENT_PAGEFAULT, it means it was
> > -                * a missing fault, otherwise if set it means it's a
> > -                * write protect fault.
> > -                */
> >                 msg.arg.pagefault.flags |= UFFD_PAGEFAULT_FLAG_WP;
> > +       if (reason & VM_UFFD_MINOR)
> > +               msg.arg.pagefault.flags |= UFFD_PAGEFAULT_FLAG_MINOR;
> >         if (features & UFFD_FEATURE_THREAD_ID)
> >                 msg.arg.pagefault.feat.ptid = task_pid_vnr(current);
> >         return msg;
> > @@ -401,8 +398,10 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
> >
> >         BUG_ON(ctx->mm != mm);
> >
> > -       VM_BUG_ON(reason & ~(VM_UFFD_MISSING|VM_UFFD_WP));
> > -       VM_BUG_ON(!(reason & VM_UFFD_MISSING) ^ !!(reason & VM_UFFD_WP));
> > +       /* Any unrecognized flag is a bug. */
> > +       VM_BUG_ON(reason & ~__VM_UFFD_FLAGS);
> > +       /* 0 or > 1 flags set is a bug; we expect exactly 1. */
> > +       VM_BUG_ON(!reason || !!(reason & (reason - 1)));
> >
> >         if (ctx->features & UFFD_FEATURE_SIGBUS)
> >                 goto out;
> > @@ -612,7 +611,7 @@ static void userfaultfd_event_wait_completion(struct userfaultfd_ctx *ctx,
> >                 for (vma = mm->mmap; vma; vma = vma->vm_next)
> >                         if (vma->vm_userfaultfd_ctx.ctx == release_new_ctx) {
> >                                 vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
> > -                               vma->vm_flags &= ~(VM_UFFD_WP | VM_UFFD_MISSING);
> > +                               vma->vm_flags &= ~__VM_UFFD_FLAGS;
> >                         }
> >                 mmap_write_unlock(mm);
> >
> > @@ -644,7 +643,7 @@ int dup_userfaultfd(struct vm_area_struct *vma, struct list_head *fcs)
> >         octx = vma->vm_userfaultfd_ctx.ctx;
> >         if (!octx || !(octx->features & UFFD_FEATURE_EVENT_FORK)) {
> >                 vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
> > -               vma->vm_flags &= ~(VM_UFFD_WP | VM_UFFD_MISSING);
> > +               vma->vm_flags &= ~__VM_UFFD_FLAGS;
> >                 return 0;
> >         }
> >
> > @@ -726,7 +725,7 @@ void mremap_userfaultfd_prep(struct vm_area_struct *vma,
> >         } else {
> >                 /* Drop uffd context if remap feature not enabled */
> >                 vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
> > -               vma->vm_flags &= ~(VM_UFFD_WP | VM_UFFD_MISSING);
> > +               vma->vm_flags &= ~__VM_UFFD_FLAGS;
> >         }
> >  }
> >
> > @@ -867,12 +866,12 @@ static int userfaultfd_release(struct inode *inode, struct file *file)
> >         for (vma = mm->mmap; vma; vma = vma->vm_next) {
> >                 cond_resched();
> >                 BUG_ON(!!vma->vm_userfaultfd_ctx.ctx ^
> > -                      !!(vma->vm_flags & (VM_UFFD_MISSING | VM_UFFD_WP)));
> > +                      !!(vma->vm_flags & __VM_UFFD_FLAGS));
> >                 if (vma->vm_userfaultfd_ctx.ctx != ctx) {
> >                         prev = vma;
> >                         continue;
> >                 }
> > -               new_flags = vma->vm_flags & ~(VM_UFFD_MISSING | VM_UFFD_WP);
> > +               new_flags = vma->vm_flags & ~__VM_UFFD_FLAGS;
> >                 prev = vma_merge(mm, prev, vma->vm_start, vma->vm_end,
> >                                  new_flags, vma->anon_vma,
> >                                  vma->vm_file, vma->vm_pgoff,
> > @@ -1306,9 +1305,19 @@ static inline bool vma_can_userfault(struct vm_area_struct *vma,
> >                                      unsigned long vm_flags)
> >  {
> >         /* FIXME: add WP support to hugetlbfs and shmem */
> > -       return vma_is_anonymous(vma) ||
> > -               ((is_vm_hugetlb_page(vma) || vma_is_shmem(vma)) &&
> > -                !(vm_flags & VM_UFFD_WP));
> > +       if (vm_flags & VM_UFFD_WP) {
> > +               if (is_vm_hugetlb_page(vma) || vma_is_shmem(vma))
> > +                       return false;
> > +       }
> > +
> > +       if (vm_flags & VM_UFFD_MINOR) {
> > +               /* FIXME: Add minor fault interception for shmem. */
> > +               if (!is_vm_hugetlb_page(vma))
> > +                       return false;
> > +       }
> > +
> > +       return vma_is_anonymous(vma) || is_vm_hugetlb_page(vma) ||
> > +              vma_is_shmem(vma);
> >  }
> >
> >  static int userfaultfd_register(struct userfaultfd_ctx *ctx,
> > @@ -1334,14 +1343,15 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
> >         ret = -EINVAL;
> >         if (!uffdio_register.mode)
> >                 goto out;
> > -       if (uffdio_register.mode & ~(UFFDIO_REGISTER_MODE_MISSING|
> > -                                    UFFDIO_REGISTER_MODE_WP))
> > +       if (uffdio_register.mode & ~UFFD_API_REGISTER_MODES)
> >                 goto out;
> >         vm_flags = 0;
> >         if (uffdio_register.mode & UFFDIO_REGISTER_MODE_MISSING)
> >                 vm_flags |= VM_UFFD_MISSING;
> >         if (uffdio_register.mode & UFFDIO_REGISTER_MODE_WP)
> >                 vm_flags |= VM_UFFD_WP;
> > +       if (uffdio_register.mode & UFFDIO_REGISTER_MODE_MINOR)
> > +               vm_flags |= VM_UFFD_MINOR;
> >
> >         ret = validate_range(mm, &uffdio_register.range.start,
> >                              uffdio_register.range.len);
> > @@ -1385,7 +1395,7 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
> >                 cond_resched();
> >
> >                 BUG_ON(!!cur->vm_userfaultfd_ctx.ctx ^
> > -                      !!(cur->vm_flags & (VM_UFFD_MISSING | VM_UFFD_WP)));
> > +                      !!(cur->vm_flags & __VM_UFFD_FLAGS));
> >
> >                 /* check not compatible vmas */
> >                 ret = -EINVAL;
> > @@ -1465,8 +1475,7 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
> >                         start = vma->vm_start;
> >                 vma_end = min(end, vma->vm_end);
> >
> > -               new_flags = (vma->vm_flags &
> > -                            ~(VM_UFFD_MISSING|VM_UFFD_WP)) | vm_flags;
> > +               new_flags = (vma->vm_flags & ~__VM_UFFD_FLAGS) | vm_flags;
> >                 prev = vma_merge(mm, prev, start, vma_end, new_flags,
> >                                  vma->anon_vma, vma->vm_file, vma->vm_pgoff,
> >                                  vma_policy(vma),
> > @@ -1588,7 +1597,7 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx,
> >                 cond_resched();
> >
> >                 BUG_ON(!!cur->vm_userfaultfd_ctx.ctx ^
> > -                      !!(cur->vm_flags & (VM_UFFD_MISSING | VM_UFFD_WP)));
> > +                      !!(cur->vm_flags & __VM_UFFD_FLAGS));
> >
> >                 /*
> >                  * Check not compatible vmas, not strictly required
> > @@ -1639,7 +1648,7 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx,
> >                         wake_userfault(vma->vm_userfaultfd_ctx.ctx, &range);
> >                 }
> >
> > -               new_flags = vma->vm_flags & ~(VM_UFFD_MISSING | VM_UFFD_WP);
> > +               new_flags = vma->vm_flags & ~__VM_UFFD_FLAGS;
> >                 prev = vma_merge(mm, prev, start, vma_end, new_flags,
> >                                  vma->anon_vma, vma->vm_file, vma->vm_pgoff,
> >                                  vma_policy(vma),
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 89fca443e6f1..3f65a506c743 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -276,6 +276,7 @@ extern unsigned int kobjsize(const void *objp);
> >  #define VM_PFNMAP      0x00000400      /* Page-ranges managed without "struct page", just pure PFN */
> >  #define VM_DENYWRITE   0x00000800      /* ETXTBSY on write attempts.. */
> >  #define VM_UFFD_WP     0x00001000      /* wrprotect pages tracking */
> > +#define VM_UFFD_MINOR  0x00002000      /* minor fault interception */
>
> Ah, I had added this just after VM_UFFD_WP, without noticing that this
> would be sharing a bit with VM_LOCKED. That seems like not such a
> great idea.
>
> I don't see another unused bit, and I don't see some other obvious
> candidate to share with. So, the solution that comes to mind is
> something like:
>
> - Since it isn't feasible to have one VM_ flag per UFFD trigger type,
> handle_userfault()'s "reason" argument should be some enumeration of
> possible UFFD reasons instead.
> - Introduce a path where handle_userfault() can return 0, meaning "you
> called into me, but I am not meant to be handling this fault per the
> userfaultfd_ctx, handle it normally instead".

Ah, yeah this doesn't work either. The context is per-fd, not
per-registration. We don't write down the requested registration mode
anywhere other than the VM flags. So then, in lieu of some larger
redesign I think minor fault handling has to be a per-fd option, not a
per-registration option -- basically, it has to be a UFFD_FEATURE_*
flag instead of a separate mode.

> - Use VM_UFFD_MISSING to decide whether or not to call
> handle_userfault(), whether it was a missing or minor fault.
>
> Unless there are objections or some simpler idea, I'll send a v6 with
> this change.
>
> >
> >  #define VM_LOCKED      0x00002000
> >  #define VM_IO           0x00004000     /* Memory mapped I/O or similar */
> > diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
> > index c63ccdae3eab..0390e5ac63b3 100644
> > --- a/include/linux/userfaultfd_k.h
> > +++ b/include/linux/userfaultfd_k.h
> > @@ -17,6 +17,9 @@
> >  #include <linux/mm.h>
> >  #include <asm-generic/pgtable_uffd.h>
> >
> > +/* The set of all possible UFFD-related VM flags. */
> > +#define __VM_UFFD_FLAGS (VM_UFFD_MISSING | VM_UFFD_WP | VM_UFFD_MINOR)
> > +
> >  /*
> >   * CAREFUL: Check include/uapi/asm-generic/fcntl.h when defining
> >   * new flags, since they might collide with O_* ones. We want
> > @@ -71,6 +74,11 @@ static inline bool userfaultfd_wp(struct vm_area_struct *vma)
> >         return vma->vm_flags & VM_UFFD_WP;
> >  }
> >
> > +static inline bool userfaultfd_minor(struct vm_area_struct *vma)
> > +{
> > +       return vma->vm_flags & VM_UFFD_MINOR;
> > +}
> > +
> >  static inline bool userfaultfd_pte_wp(struct vm_area_struct *vma,
> >                                       pte_t pte)
> >  {
> > @@ -85,7 +93,7 @@ static inline bool userfaultfd_huge_pmd_wp(struct vm_area_struct *vma,
> >
> >  static inline bool userfaultfd_armed(struct vm_area_struct *vma)
> >  {
> > -       return vma->vm_flags & (VM_UFFD_MISSING | VM_UFFD_WP);
> > +       return vma->vm_flags & __VM_UFFD_FLAGS;
> >  }
> >
> >  extern int dup_userfaultfd(struct vm_area_struct *, struct list_head *);
> > @@ -132,6 +140,11 @@ static inline bool userfaultfd_wp(struct vm_area_struct *vma)
> >         return false;
> >  }
> >
> > +static inline bool userfaultfd_minor(struct vm_area_struct *vma)
> > +{
> > +       return false;
> > +}
> > +
> >  static inline bool userfaultfd_pte_wp(struct vm_area_struct *vma,
> >                                       pte_t pte)
> >  {
> > diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
> > index 67018d367b9f..2d583ffd4100 100644
> > --- a/include/trace/events/mmflags.h
> > +++ b/include/trace/events/mmflags.h
> > @@ -151,6 +151,7 @@ IF_HAVE_PG_ARCH_2(PG_arch_2,                "arch_2"        )
> >         {VM_PFNMAP,                     "pfnmap"        },              \
> >         {VM_DENYWRITE,                  "denywrite"     },              \
> >         {VM_UFFD_WP,                    "uffd_wp"       },              \
> > +       {VM_UFFD_MINOR,                 "uffd_minor"    },              \
> >         {VM_LOCKED,                     "locked"        },              \
> >         {VM_IO,                         "io"            },              \
> >         {VM_SEQ_READ,                   "seqread"       },              \
> > diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
> > index 5f2d88212f7c..f24dd4fcbad9 100644
> > --- a/include/uapi/linux/userfaultfd.h
> > +++ b/include/uapi/linux/userfaultfd.h
> > @@ -19,15 +19,19 @@
> >   * means the userland is reading).
> >   */
> >  #define UFFD_API ((__u64)0xAA)
> > +#define UFFD_API_REGISTER_MODES (UFFDIO_REGISTER_MODE_MISSING |        \
> > +                                UFFDIO_REGISTER_MODE_WP |      \
> > +                                UFFDIO_REGISTER_MODE_MINOR)
> >  #define UFFD_API_FEATURES (UFFD_FEATURE_PAGEFAULT_FLAG_WP |    \
> >                            UFFD_FEATURE_EVENT_FORK |            \
> >                            UFFD_FEATURE_EVENT_REMAP |           \
> > -                          UFFD_FEATURE_EVENT_REMOVE |  \
> > +                          UFFD_FEATURE_EVENT_REMOVE |          \
> >                            UFFD_FEATURE_EVENT_UNMAP |           \
> >                            UFFD_FEATURE_MISSING_HUGETLBFS |     \
> >                            UFFD_FEATURE_MISSING_SHMEM |         \
> >                            UFFD_FEATURE_SIGBUS |                \
> > -                          UFFD_FEATURE_THREAD_ID)
> > +                          UFFD_FEATURE_THREAD_ID |             \
> > +                          UFFD_FEATURE_MINOR_HUGETLBFS)
> >  #define UFFD_API_IOCTLS                                \
> >         ((__u64)1 << _UFFDIO_REGISTER |         \
> >          (__u64)1 << _UFFDIO_UNREGISTER |       \
> > @@ -127,6 +131,7 @@ struct uffd_msg {
> >  /* flags for UFFD_EVENT_PAGEFAULT */
> >  #define UFFD_PAGEFAULT_FLAG_WRITE      (1<<0)  /* If this was a write fault */
> >  #define UFFD_PAGEFAULT_FLAG_WP         (1<<1)  /* If reason is VM_UFFD_WP */
> > +#define UFFD_PAGEFAULT_FLAG_MINOR      (1<<2)  /* If reason is VM_UFFD_MINOR */
> >
> >  struct uffdio_api {
> >         /* userland asks for an API number and the features to enable */
> > @@ -171,6 +176,10 @@ struct uffdio_api {
> >          *
> >          * UFFD_FEATURE_THREAD_ID pid of the page faulted task_struct will
> >          * be returned, if feature is not requested 0 will be returned.
> > +        *
> > +        * UFFD_FEATURE_MINOR_HUGETLBFS indicates that minor faults
> > +        * can be intercepted (via REGISTER_MODE_MINOR) for
> > +        * hugetlbfs-backed pages.
> >          */
> >  #define UFFD_FEATURE_PAGEFAULT_FLAG_WP         (1<<0)
> >  #define UFFD_FEATURE_EVENT_FORK                        (1<<1)
> > @@ -181,6 +190,7 @@ struct uffdio_api {
> >  #define UFFD_FEATURE_EVENT_UNMAP               (1<<6)
> >  #define UFFD_FEATURE_SIGBUS                    (1<<7)
> >  #define UFFD_FEATURE_THREAD_ID                 (1<<8)
> > +#define UFFD_FEATURE_MINOR_HUGETLBFS           (1<<9)
> >         __u64 features;
> >
> >         __u64 ioctls;
> > @@ -195,6 +205,7 @@ struct uffdio_register {
> >         struct uffdio_range range;
> >  #define UFFDIO_REGISTER_MODE_MISSING   ((__u64)1<<0)
> >  #define UFFDIO_REGISTER_MODE_WP                ((__u64)1<<1)
> > +#define UFFDIO_REGISTER_MODE_MINOR     ((__u64)1<<2)
> >         __u64 mode;
> >
> >         /*
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index e41b77cf6cc2..f150b10981a8 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -4366,6 +4366,38 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
> >                                 VM_FAULT_SET_HINDEX(hstate_index(h));
> >                         goto backout_unlocked;
> >                 }
> > +
> > +               /* Check for page in userfault range. */
> > +               if (userfaultfd_minor(vma)) {
> > +                       u32 hash;
> > +                       struct vm_fault vmf = {
> > +                               .vma = vma,
> > +                               .address = haddr,
> > +                               .flags = flags,
> > +                               /*
> > +                                * Hard to debug if it ends up being used by a
> > +                                * callee that assumes something about the
> > +                                * other uninitialized fields... same as in
> > +                                * memory.c
> > +                                */
> > +                       };
> > +
> > +                       unlock_page(page);
> > +
> > +                       /*
> > +                        * hugetlb_fault_mutex and i_mmap_rwsem must be dropped
> > +                        * before handling userfault.  Reacquire after handling
> > +                        * fault to make calling code simpler.
> > +                        */
> > +
> > +                       hash = hugetlb_fault_mutex_hash(mapping, idx);
> > +                       mutex_unlock(&hugetlb_fault_mutex_table[hash]);
> > +                       i_mmap_unlock_read(mapping);
> > +                       ret = handle_userfault(&vmf, VM_UFFD_MINOR);
> > +                       i_mmap_lock_read(mapping);
> > +                       mutex_lock(&hugetlb_fault_mutex_table[hash]);
> > +                       goto out;
> > +               }
> >         }
> >
> >         /*
> > --
> > 2.30.0.478.g8a0d178c01-goog
> >

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v5 01/10] hugetlb: Pass vma into huge_pte_alloc() and huge_pmd_share()
  2021-02-10 21:21 ` [PATCH v5 01/10] hugetlb: Pass vma into huge_pte_alloc() and huge_pmd_share() Axel Rasmussen
@ 2021-02-11 23:59   ` Mike Kravetz
  0 siblings, 0 replies; 29+ messages in thread
From: Mike Kravetz @ 2021-02-11 23:59 UTC (permalink / raw)
  To: Axel Rasmussen, Alexander Viro, Alexey Dobriyan,
	Andrea Arcangeli, Andrew Morton, Anshuman Khandual,
	Catalin Marinas, Chinwen Chang, Huang Ying, Ingo Molnar,
	Jann Horn, Jerome Glisse, Lokesh Gidra, Matthew Wilcox (Oracle),
	Michael Ellerman, Michal Koutný,
	Michel Lespinasse, Mike Rapoport, Nicholas Piggin, Peter Xu,
	Shaohua Li, Shawn Anastasio, Steven Rostedt, Steven Price,
	Vlastimil Babka
  Cc: linux-kernel, linux-fsdevel, linux-mm, Adam Ruprecht,
	Cannon Matthews, Dr . David Alan Gilbert, David Rientjes,
	Mina Almasry, Oliver Upton

On 2/10/21 1:21 PM, Axel Rasmussen wrote:
> From: Peter Xu <peterx@redhat.com>
> 
> It is a preparation work to be able to behave differently in the per
> architecture huge_pte_alloc() according to different VMA attributes.
> 
> Pass it deeper into huge_pmd_share() so that we can avoid the find_vma() call.
> 
> Suggested-by: Mike Kravetz <mike.kravetz@oracle.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
> ---
>  arch/arm64/mm/hugetlbpage.c   |  4 ++--
>  arch/ia64/mm/hugetlbpage.c    |  3 ++-
>  arch/mips/mm/hugetlbpage.c    |  4 ++--
>  arch/parisc/mm/hugetlbpage.c  |  2 +-
>  arch/powerpc/mm/hugetlbpage.c |  3 ++-
>  arch/s390/mm/hugetlbpage.c    |  2 +-
>  arch/sh/mm/hugetlbpage.c      |  2 +-
>  arch/sparc/mm/hugetlbpage.c   |  6 +-----
>  include/linux/hugetlb.h       |  5 +++--
>  mm/hugetlb.c                  | 15 ++++++++-------
>  mm/userfaultfd.c              |  2 +-
>  11 files changed, 24 insertions(+), 24 deletions(-)

Thanks, this will be needed for multiple features where pmd sharing must
be disabled.  And, the need to disable sharing is based on information in
the vma.

Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v5 02/10] hugetlb/userfaultfd: Forbid huge pmd sharing when uffd enabled
  2021-02-10 21:21 ` [PATCH v5 02/10] hugetlb/userfaultfd: Forbid huge pmd sharing when uffd enabled Axel Rasmussen
@ 2021-02-12  0:19   ` Mike Kravetz
  2021-02-12 20:40     ` Peter Xu
  0 siblings, 1 reply; 29+ messages in thread
From: Mike Kravetz @ 2021-02-12  0:19 UTC (permalink / raw)
  To: Axel Rasmussen, Alexander Viro, Alexey Dobriyan,
	Andrea Arcangeli, Andrew Morton, Anshuman Khandual,
	Catalin Marinas, Chinwen Chang, Huang Ying, Ingo Molnar,
	Jann Horn, Jerome Glisse, Lokesh Gidra, Matthew Wilcox (Oracle),
	Michael Ellerman, Michal Koutný,
	Michel Lespinasse, Mike Rapoport, Nicholas Piggin, Peter Xu,
	Shaohua Li, Shawn Anastasio, Steven Rostedt, Steven Price,
	Vlastimil Babka
  Cc: linux-kernel, linux-fsdevel, linux-mm, Adam Ruprecht,
	Cannon Matthews, Dr . David Alan Gilbert, David Rientjes,
	Mina Almasry, Oliver Upton

On 2/10/21 1:21 PM, Axel Rasmussen wrote:
> From: Peter Xu <peterx@redhat.com>
> 
> Huge pmd sharing could bring problem to userfaultfd.  The thing is that
> userfaultfd is running its logic based on the special bits on page table
> entries, however the huge pmd sharing could potentially share page table
> entries for different address ranges.  That could cause issues on either:
> 
>   - When sharing huge pmd page tables for an uffd write protected range, the
>     newly mapped huge pmd range will also be write protected unexpectedly, or,
> 
>   - When we try to write protect a range of huge pmd shared range, we'll first
>     do huge_pmd_unshare() in hugetlb_change_protection(), however that also
>     means the UFFDIO_WRITEPROTECT could be silently skipped for the shared
>     region, which could lead to data loss.
> 
> Since at it, a few other things are done altogether:
> 
>   - Move want_pmd_share() from mm/hugetlb.c into linux/hugetlb.h, because
>     that's definitely something that arch code would like to use too
> 
>   - ARM64 currently directly check against CONFIG_ARCH_WANT_HUGE_PMD_SHARE when
>     trying to share huge pmd.  Switch to the want_pmd_share() helper.
> 
> Since at it, move vma_shareable() from huge_pmd_share() into want_pmd_share().
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
> ---
>  arch/arm64/mm/hugetlbpage.c   |  3 +--
>  include/linux/hugetlb.h       |  2 ++
>  include/linux/userfaultfd_k.h |  9 +++++++++
>  mm/hugetlb.c                  | 20 ++++++++++++++------
>  4 files changed, 26 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
> index 6e3bcffe2837..58987a98e179 100644
> --- a/arch/arm64/mm/hugetlbpage.c
> +++ b/arch/arm64/mm/hugetlbpage.c
> @@ -284,8 +284,7 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
>  		 */
>  		ptep = pte_alloc_map(mm, pmdp, addr);
>  	} else if (sz == PMD_SIZE) {
> -		if (IS_ENABLED(CONFIG_ARCH_WANT_HUGE_PMD_SHARE) &&
> -		    pud_none(READ_ONCE(*pudp)))
> +		if (want_pmd_share(vma, addr) && pud_none(READ_ONCE(*pudp)))
>  			ptep = huge_pmd_share(mm, vma, addr, pudp);
>  		else
>  			ptep = (pte_t *)pmd_alloc(mm, pudp, addr);
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index ca6e5ba56f73..d971e7efd17d 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -1030,4 +1030,6 @@ static inline __init void hugetlb_cma_check(void)
>  }
>  #endif
>  
> +bool want_pmd_share(struct vm_area_struct *vma, unsigned long addr);
> +
>  #endif /* _LINUX_HUGETLB_H */
> diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
> index a8e5f3ea9bb2..c63ccdae3eab 100644
> --- a/include/linux/userfaultfd_k.h
> +++ b/include/linux/userfaultfd_k.h
> @@ -52,6 +52,15 @@ static inline bool is_mergeable_vm_userfaultfd_ctx(struct vm_area_struct *vma,
>  	return vma->vm_userfaultfd_ctx.ctx == vm_ctx.ctx;
>  }
>  
> +/*
> + * Never enable huge pmd sharing on uffd-wp registered vmas, because uffd-wp
> + * protect information is per pgtable entry.
> + */
> +static inline bool uffd_disable_huge_pmd_share(struct vm_area_struct *vma)
> +{
> +	return vma->vm_flags & VM_UFFD_WP;
> +}
> +
>  static inline bool userfaultfd_missing(struct vm_area_struct *vma)
>  {
>  	return vma->vm_flags & VM_UFFD_MISSING;
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 32d4d2e277ad..5710286e1984 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -5245,6 +5245,18 @@ static bool vma_shareable(struct vm_area_struct *vma, unsigned long addr)
>  	return false;
>  }
>  
> +bool want_pmd_share(struct vm_area_struct *vma, unsigned long addr)
> +{
> +#ifndef CONFIG_ARCH_WANT_HUGE_PMD_SHARE
> +	return false;
> +#endif
> +#ifdef CONFIG_USERFAULTFD
> +	if (uffd_disable_huge_pmd_share(vma))
> +		return false;
> +#endif
> +	return vma_shareable(vma, addr);
> +}
> +

This code certainly does the right thing, however I wonder if it should
be structured a little differently.

want_pmd_share() is currently just a check for CONFIG_ARCH_WANT_HUGE_PMD_SHARE.
How about leaving that mostly as is, and adding the new vma checks to
vma_shareable().  vma_shareable() would then be something like:

	if (!(vma->vm_flags & VM_MAYSHARE))
		return false;
#ifdef CONFIG_USERFAULTFD
	if (uffd_disable_huge_pmd_share(vma)
		return false;
#endif
#ifdef /* XXX */
	/* add other checks for things like uffd wp and soft dirty here */
#endif /* XXX */

	if (range_in_vma(vma, base, end)
		return true;
	return false;

Of course, this would require we leave the call to vma_shareable() at the
beginning of huge_pmd_share.  It also means that we are always making a
function call into huge_pmd_share to determine if sharing is possible.
That is not any different than today.  If we do not want to make that extra
function call, then I would suggest putting all that code in want_pmd_share.
It just seems that all the vma checks for sharing should be in one place
if possible.
-- 
Mike Kravetz

>  /*
>   * Determine if start,end range within vma could be mapped by shared pmd.
>   * If yes, adjust start and end to cover range associated with possible
> @@ -5301,9 +5313,6 @@ pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma,
>  	pte_t *pte;
>  	spinlock_t *ptl;
>  
> -	if (!vma_shareable(vma, addr))
> -		return (pte_t *)pmd_alloc(mm, pud, addr);
> -
>  	i_mmap_assert_locked(mapping);
>  	vma_interval_tree_foreach(svma, &mapping->i_mmap, idx, idx) {
>  		if (svma == vma)
> @@ -5367,7 +5376,7 @@ int huge_pmd_unshare(struct mm_struct *mm, struct vm_area_struct *vma,
>  	*addr = ALIGN(*addr, HPAGE_SIZE * PTRS_PER_PTE) - HPAGE_SIZE;
>  	return 1;
>  }
> -#define want_pmd_share()	(1)
> +
>  #else /* !CONFIG_ARCH_WANT_HUGE_PMD_SHARE */
>  pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct vma,
>  		      unsigned long addr, pud_t *pud)
> @@ -5385,7 +5394,6 @@ void adjust_range_if_pmd_sharing_possible(struct vm_area_struct *vma,
>  				unsigned long *start, unsigned long *end)
>  {
>  }
> -#define want_pmd_share()	(0)
>  #endif /* CONFIG_ARCH_WANT_HUGE_PMD_SHARE */
>  
>  #ifdef CONFIG_ARCH_WANT_GENERAL_HUGETLB
> @@ -5407,7 +5415,7 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
>  			pte = (pte_t *)pud;
>  		} else {
>  			BUG_ON(sz != PMD_SIZE);
> -			if (want_pmd_share() && pud_none(*pud))
> +			if (want_pmd_share(vma, addr) && pud_none(*pud))
>  				pte = huge_pmd_share(mm, vma, addr, pud);
>  			else
>  				pte = (pte_t *)pmd_alloc(mm, pud, addr);
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v5 04/10] hugetlb/userfaultfd: Unshare all pmds for hugetlbfs when register wp
  2021-02-10 21:21 ` [PATCH v5 04/10] hugetlb/userfaultfd: Unshare all pmds for hugetlbfs when register wp Axel Rasmussen
@ 2021-02-12 18:11   ` Mike Kravetz
  2021-02-12 21:18     ` Peter Xu
  0 siblings, 1 reply; 29+ messages in thread
From: Mike Kravetz @ 2021-02-12 18:11 UTC (permalink / raw)
  To: Axel Rasmussen, Alexander Viro, Alexey Dobriyan,
	Andrea Arcangeli, Andrew Morton, Anshuman Khandual,
	Catalin Marinas, Chinwen Chang, Huang Ying, Ingo Molnar,
	Jann Horn, Jerome Glisse, Lokesh Gidra, Matthew Wilcox (Oracle),
	Michael Ellerman, Michal Koutný,
	Michel Lespinasse, Mike Rapoport, Nicholas Piggin, Peter Xu,
	Shaohua Li, Shawn Anastasio, Steven Rostedt, Steven Price,
	Vlastimil Babka
  Cc: linux-kernel, linux-fsdevel, linux-mm, Adam Ruprecht,
	Cannon Matthews, Dr . David Alan Gilbert, David Rientjes,
	Mina Almasry, Oliver Upton

On 2/10/21 1:21 PM, Axel Rasmussen wrote:
> From: Peter Xu <peterx@redhat.com>
> 
> Huge pmd sharing for hugetlbfs is racy with userfaultfd-wp because
> userfaultfd-wp is always based on pgtable entries, so they cannot be shared.
> 
> Walk the hugetlb range and unshare all such mappings if there is, right before
> UFFDIO_REGISTER will succeed and return to userspace.
> 
> This will pair with want_pmd_share() in hugetlb code so that huge pmd sharing
> is completely disabled for userfaultfd-wp registered range.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
> ---
>  fs/userfaultfd.c             | 48 ++++++++++++++++++++++++++++++++++++
>  include/linux/mmu_notifier.h |  1 +
>  2 files changed, 49 insertions(+)
> 
> diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> index 0be8cdd4425a..1f4a34b1a1e7 100644
> --- a/fs/userfaultfd.c
> +++ b/fs/userfaultfd.c
> @@ -15,6 +15,7 @@
>  #include <linux/sched/signal.h>
>  #include <linux/sched/mm.h>
>  #include <linux/mm.h>
> +#include <linux/mmu_notifier.h>
>  #include <linux/poll.h>
>  #include <linux/slab.h>
>  #include <linux/seq_file.h>
> @@ -1191,6 +1192,50 @@ static ssize_t userfaultfd_read(struct file *file, char __user *buf,
>  	}
>  }
>  
> +/*
> + * This function will unconditionally remove all the shared pmd pgtable entries
> + * within the specific vma for a hugetlbfs memory range.
> + */
> +static void hugetlb_unshare_all_pmds(struct vm_area_struct *vma)
> +{
> +#ifdef CONFIG_HUGETLB_PAGE
> +	struct hstate *h = hstate_vma(vma);
> +	unsigned long sz = huge_page_size(h);
> +	struct mm_struct *mm = vma->vm_mm;
> +	struct mmu_notifier_range range;
> +	unsigned long address;
> +	spinlock_t *ptl;
> +	pte_t *ptep;
> +
> +	if (!(vma->vm_flags & VM_MAYSHARE))
> +		return;
> +
> +	/*
> +	 * No need to call adjust_range_if_pmd_sharing_possible(), because
> +	 * we're going to operate on the whole vma
> +	 */

This code will certainly work as intended.  However, I wonder if we should
try to optimize and only flush and call huge_pmd_unshare for addresses where
sharing is possible.  Consider this worst case example:

vm_start = 8G + 2M
vm_end   = 11G - 2M
The vma is 'almost' 3G in size, yet only the range 9G to 10G is possibly
shared.  This routine will potentially call lock/unlock ptl and call
huge_pmd_share for every huge page in the range.  Ideally, we should only
make one call to huge_pmd_share with address 9G.  If the unshare is
successful or not, we are done.  The subtle manipulation of &address in
huge_pmd_unshare will result in only one call if the unshare is successful,
but if unsuccessful we will unnecessarily call huge_pmd_unshare for each
address in the range.

Maybe we start by rounding up vm_start by PUD_SIZE and rounding down
vm_end by PUD_SIZE.  

> +	mmu_notifier_range_init(&range, MMU_NOTIFY_HUGETLB_UNSHARE,
> +				0, vma, mm, vma->vm_start, vma->vm_end);
> +	mmu_notifier_invalidate_range_start(&range);
> +	i_mmap_lock_write(vma->vm_file->f_mapping);
> +	for (address = vma->vm_start; address < vma->vm_end; address += sz) {

Then, change the loop increment to PUD_SIZE.  And, also ignore the &address
manipulation done by huge_pmd_unshare.

> +		ptep = huge_pte_offset(mm, address, sz);
> +		if (!ptep)
> +			continue;
> +		ptl = huge_pte_lock(h, mm, ptep);
> +		huge_pmd_unshare(mm, vma, &address, ptep);
> +		spin_unlock(ptl);
> +	}
> +	flush_hugetlb_tlb_range(vma, vma->vm_start, vma->vm_end);
> +	i_mmap_unlock_write(vma->vm_file->f_mapping);
> +	/*
> +	 * No need to call mmu_notifier_invalidate_range(), see
> +	 * Documentation/vm/mmu_notifier.rst.
> +	 */
> +	mmu_notifier_invalidate_range_end(&range);
> +#endif
> +}
> +
>  static void __wake_userfault(struct userfaultfd_ctx *ctx,
>  			     struct userfaultfd_wake_range *range)
>  {
> @@ -1449,6 +1494,9 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
>  		vma->vm_flags = new_flags;
>  		vma->vm_userfaultfd_ctx.ctx = ctx;
>  
> +		if (is_vm_hugetlb_page(vma) && uffd_disable_huge_pmd_share(vma))
> +			hugetlb_unshare_all_pmds(vma);
> +
>  	skip:
>  		prev = vma;
>  		start = vma->vm_end;
> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> index b8200782dede..ff50c8528113 100644
> --- a/include/linux/mmu_notifier.h
> +++ b/include/linux/mmu_notifier.h
> @@ -51,6 +51,7 @@ enum mmu_notifier_event {
>  	MMU_NOTIFY_SOFT_DIRTY,
>  	MMU_NOTIFY_RELEASE,
>  	MMU_NOTIFY_MIGRATE,
> +	MMU_NOTIFY_HUGETLB_UNSHARE,

I don't claim to know much about mmu notifiers.  Currently, we use other
event notifiers such as MMU_NOTIFY_CLEAR.  I guess we do 'clear' page table
entries if we unshare.  More than happy to have a MMU_NOTIFY_HUGETLB_UNSHARE
event, but will consumers of the notifications know what this new event type
means?  And, if we introduce this should we use this other places where
huge_pmd_unshare is called?
--
Mike Kravetz

>  };
>  
>  #define MMU_NOTIFIER_RANGE_BLOCKABLE (1 << 0)
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v5 05/10] userfaultfd: add minor fault registration mode
  2021-02-10 21:21 ` [PATCH v5 05/10] userfaultfd: add minor fault registration mode Axel Rasmussen
  2021-02-11 19:28   ` Axel Rasmussen
@ 2021-02-12 19:17   ` Mike Kravetz
  2021-02-12 19:33     ` Axel Rasmussen
  1 sibling, 1 reply; 29+ messages in thread
From: Mike Kravetz @ 2021-02-12 19:17 UTC (permalink / raw)
  To: Axel Rasmussen, Alexander Viro, Alexey Dobriyan,
	Andrea Arcangeli, Andrew Morton, Anshuman Khandual,
	Catalin Marinas, Chinwen Chang, Huang Ying, Ingo Molnar,
	Jann Horn, Jerome Glisse, Lokesh Gidra, Matthew Wilcox (Oracle),
	Michael Ellerman, Michal Koutný,
	Michel Lespinasse, Mike Rapoport, Nicholas Piggin, Peter Xu,
	Shaohua Li, Shawn Anastasio, Steven Rostedt, Steven Price,
	Vlastimil Babka
  Cc: linux-kernel, linux-fsdevel, linux-mm, Adam Ruprecht,
	Cannon Matthews, Dr . David Alan Gilbert, David Rientjes,
	Mina Almasry, Oliver Upton

On 2/10/21 1:21 PM, Axel Rasmussen wrote:
> This feature allows userspace to intercept "minor" faults. By "minor"
> faults, I mean the following situation:
> 
> Let there exist two mappings (i.e., VMAs) to the same page(s). One of
> the mappings is registered with userfaultfd (in minor mode), and the
> other is not. Via the non-UFFD mapping, the underlying pages have
> already been allocated & filled with some contents. The UFFD mapping
> has not yet been faulted in; when it is touched for the first time,
> this results in what I'm calling a "minor" fault. As a concrete
> example, when working with hugetlbfs, we have huge_pte_none(), but
> find_lock_page() finds an existing page.

Do we want to intercept the fault if it is for a private mapping that
will COW the page in the page cache?  I think 'yes' but just want to
confirm.  The code added to hugetlb_no_page will intercept these COW
accesses.

<snip>

> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index e41b77cf6cc2..f150b10981a8 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -4366,6 +4366,38 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
>  				VM_FAULT_SET_HINDEX(hstate_index(h));
>  			goto backout_unlocked;
>  		}
> +
> +		/* Check for page in userfault range. */
> +		if (userfaultfd_minor(vma)) {
> +			u32 hash;
> +			struct vm_fault vmf = {
> +				.vma = vma,
> +				.address = haddr,
> +				.flags = flags,
> +				/*
> +				 * Hard to debug if it ends up being used by a
> +				 * callee that assumes something about the
> +				 * other uninitialized fields... same as in
> +				 * memory.c
> +				 */
> +			};
> +
> +			unlock_page(page);
> +
> +			/*
> +			 * hugetlb_fault_mutex and i_mmap_rwsem must be dropped
> +			 * before handling userfault.  Reacquire after handling
> +			 * fault to make calling code simpler.
> +			 */
> +
> +			hash = hugetlb_fault_mutex_hash(mapping, idx);
> +			mutex_unlock(&hugetlb_fault_mutex_table[hash]);
> +			i_mmap_unlock_read(mapping);

After dropping all the locks, we only hold a reference to the page in the
page cache.  I 'think' someone else could hole punch the page and remove it
from the cache.  IIUC, state changing while processing uffd faults is something
that users need to deal with?  Just need to make sure there are no assumptions
in the kernel code.

> +			ret = handle_userfault(&vmf, VM_UFFD_MINOR);
> +			i_mmap_lock_read(mapping);
> +			mutex_lock(&hugetlb_fault_mutex_table[hash]);
> +			goto out;
> +		}
>  	}
>  
>  	/*
> 

-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v5 05/10] userfaultfd: add minor fault registration mode
  2021-02-12 19:17   ` Mike Kravetz
@ 2021-02-12 19:33     ` Axel Rasmussen
  0 siblings, 0 replies; 29+ messages in thread
From: Axel Rasmussen @ 2021-02-12 19:33 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Alexander Viro, Alexey Dobriyan, Andrea Arcangeli, Andrew Morton,
	Anshuman Khandual, Catalin Marinas, Chinwen Chang, Huang Ying,
	Ingo Molnar, Jann Horn, Jerome Glisse, Lokesh Gidra,
	Matthew Wilcox (Oracle), Michael Ellerman, Michal Koutný,
	Michel Lespinasse, Mike Rapoport, Nicholas Piggin, Peter Xu,
	Shaohua Li, Shawn Anastasio, Steven Rostedt, Steven Price,
	Vlastimil Babka, LKML, linux-fsdevel, Linux MM, Adam Ruprecht,
	Cannon Matthews, Dr . David Alan Gilbert, David Rientjes,
	Mina Almasry, Oliver Upton

On Fri, Feb 12, 2021 at 11:18 AM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>
> On 2/10/21 1:21 PM, Axel Rasmussen wrote:
> > This feature allows userspace to intercept "minor" faults. By "minor"
> > faults, I mean the following situation:
> >
> > Let there exist two mappings (i.e., VMAs) to the same page(s). One of
> > the mappings is registered with userfaultfd (in minor mode), and the
> > other is not. Via the non-UFFD mapping, the underlying pages have
> > already been allocated & filled with some contents. The UFFD mapping
> > has not yet been faulted in; when it is touched for the first time,
> > this results in what I'm calling a "minor" fault. As a concrete
> > example, when working with hugetlbfs, we have huge_pte_none(), but
> > find_lock_page() finds an existing page.
>
> Do we want to intercept the fault if it is for a private mapping that
> will COW the page in the page cache?  I think 'yes' but just want to
> confirm.  The code added to hugetlb_no_page will intercept these COW
> accesses.

I can at least say this is intentional, although I admit I don't have
a precise use case in mind for the UFFD mapping being private. I
suppose it's something like, the UFFD poll thread is supposed to
(maybe) update the page contents, *before* I CoW it, and then once
it's been CoW-ed I don't want that poll thread to be able to see
whatever changes I've made?

Unless there's some different use case for this, I believe this is the
behavior we want.

>
> <snip>
>
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index e41b77cf6cc2..f150b10981a8 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -4366,6 +4366,38 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
> >                               VM_FAULT_SET_HINDEX(hstate_index(h));
> >                       goto backout_unlocked;
> >               }
> > +
> > +             /* Check for page in userfault range. */
> > +             if (userfaultfd_minor(vma)) {
> > +                     u32 hash;
> > +                     struct vm_fault vmf = {
> > +                             .vma = vma,
> > +                             .address = haddr,
> > +                             .flags = flags,
> > +                             /*
> > +                              * Hard to debug if it ends up being used by a
> > +                              * callee that assumes something about the
> > +                              * other uninitialized fields... same as in
> > +                              * memory.c
> > +                              */
> > +                     };
> > +
> > +                     unlock_page(page);
> > +
> > +                     /*
> > +                      * hugetlb_fault_mutex and i_mmap_rwsem must be dropped
> > +                      * before handling userfault.  Reacquire after handling
> > +                      * fault to make calling code simpler.
> > +                      */
> > +
> > +                     hash = hugetlb_fault_mutex_hash(mapping, idx);
> > +                     mutex_unlock(&hugetlb_fault_mutex_table[hash]);
> > +                     i_mmap_unlock_read(mapping);
>
> After dropping all the locks, we only hold a reference to the page in the
> page cache.  I 'think' someone else could hole punch the page and remove it
> from the cache.  IIUC, state changing while processing uffd faults is something
> that users need to deal with?  Just need to make sure there are no assumptions
> in the kernel code.

Yeah, this seems possible. What I'd expect to happen in that case is
something like:

1. hugetlb_no_page() calls into handle_userfault().
2. Someone hole punches the page, removing it from the page cache.
3. The UFFD poll thread gets the fault event, and issues a
UFFDIO_CONTINUE. (Say we instead were going to write an update, and
*then* UFFDIO_CONTINUE: I think the hole punch by another thread could
also happen between those two events.)
4. This calls down into hugetlb_mcopy_atomic_pte, where we try to
find_lock_page(). This returns NULL, so we bail with -EFAULT.
5. Userspace detects and deals with this error - maybe by writing to
the non-UFFD mapping, thereby putting a page back in the page cache,
or by issuing a UFFDIO_COPY or such?

Which, as far as I can see is fine? But, I am by no means an expert
yet so please correct me if this seems problematic. :)

>
> > +                     ret = handle_userfault(&vmf, VM_UFFD_MINOR);
> > +                     i_mmap_lock_read(mapping);
> > +                     mutex_lock(&hugetlb_fault_mutex_table[hash]);
> > +                     goto out;
> > +             }
> >       }
> >
> >       /*
> >
>
> --
> Mike Kravetz

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v5 02/10] hugetlb/userfaultfd: Forbid huge pmd sharing when uffd enabled
  2021-02-12  0:19   ` Mike Kravetz
@ 2021-02-12 20:40     ` Peter Xu
  2021-02-12 20:47       ` Axel Rasmussen
  0 siblings, 1 reply; 29+ messages in thread
From: Peter Xu @ 2021-02-12 20:40 UTC (permalink / raw)
  To: Mike Kravetz, Axel Rasmussen
  Cc: Axel Rasmussen, Alexander Viro, Alexey Dobriyan,
	Andrea Arcangeli, Andrew Morton, Anshuman Khandual,
	Catalin Marinas, Chinwen Chang, Huang Ying, Ingo Molnar,
	Jann Horn, Jerome Glisse, Lokesh Gidra, Matthew Wilcox (Oracle),
	Michael Ellerman, Michal Koutný,
	Michel Lespinasse, Mike Rapoport, Nicholas Piggin, Shaohua Li,
	Shawn Anastasio, Steven Rostedt, Steven Price, Vlastimil Babka,
	linux-kernel, linux-fsdevel, linux-mm, Adam Ruprecht,
	Cannon Matthews, Dr . David Alan Gilbert, David Rientjes,
	Mina Almasry, Oliver Upton

On Thu, Feb 11, 2021 at 04:19:55PM -0800, Mike Kravetz wrote:
> want_pmd_share() is currently just a check for CONFIG_ARCH_WANT_HUGE_PMD_SHARE.
> How about leaving that mostly as is, and adding the new vma checks to
> vma_shareable().  vma_shareable() would then be something like:
> 
> 	if (!(vma->vm_flags & VM_MAYSHARE))
> 		return false;
> #ifdef CONFIG_USERFAULTFD
> 	if (uffd_disable_huge_pmd_share(vma)
> 		return false;
> #endif
> #ifdef /* XXX */
> 	/* add other checks for things like uffd wp and soft dirty here */
> #endif /* XXX */
> 
> 	if (range_in_vma(vma, base, end)
> 		return true;
> 	return false;
> 
> Of course, this would require we leave the call to vma_shareable() at the
> beginning of huge_pmd_share.  It also means that we are always making a
> function call into huge_pmd_share to determine if sharing is possible.
> That is not any different than today.  If we do not want to make that extra
> function call, then I would suggest putting all that code in want_pmd_share.
> It just seems that all the vma checks for sharing should be in one place
> if possible.

I don't worry a lot on that since we've already got huge_pte_alloc() which
takes care of huge pmd sharing case, so I don't expect e.g. even most hugetlb
developers to use want_pmd_share() at all, because huge_pte_alloc() will be the
one that frequently got called.

But yeah we can definitely put the check logic into huge_pmd_share() too.
Looking at above code it looks still worth a helper like want_pmd_share() or
with some other name.  Then... instead of making this complicated, how about I
mostly keep this patch but move want_pmd_share() call into huge_pmd_share()
instead?

Btw, Axel, it seems there will still be some respins on the pmd sharing
patches.  Since it turns out it'll be shared by multiple tasks now, do you mind
I pick those out and send them separately?  Then we can consolidate this part
to move on with either the rest of the tasks we've got on hand.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v5 02/10] hugetlb/userfaultfd: Forbid huge pmd sharing when uffd enabled
  2021-02-12 20:40     ` Peter Xu
@ 2021-02-12 20:47       ` Axel Rasmussen
  2021-02-12 21:27         ` Mike Kravetz
  0 siblings, 1 reply; 29+ messages in thread
From: Axel Rasmussen @ 2021-02-12 20:47 UTC (permalink / raw)
  To: Peter Xu
  Cc: Mike Kravetz, Alexander Viro, Alexey Dobriyan, Andrea Arcangeli,
	Andrew Morton, Anshuman Khandual, Catalin Marinas, Chinwen Chang,
	Huang Ying, Ingo Molnar, Jann Horn, Jerome Glisse, Lokesh Gidra,
	Matthew Wilcox (Oracle), Michael Ellerman, Michal Koutný,
	Michel Lespinasse, Mike Rapoport, Nicholas Piggin, Shaohua Li,
	Shawn Anastasio, Steven Rostedt, Steven Price, Vlastimil Babka,
	LKML, linux-fsdevel, Linux MM, Adam Ruprecht, Cannon Matthews,
	Dr . David Alan Gilbert, David Rientjes, Mina Almasry,
	Oliver Upton

On Fri, Feb 12, 2021 at 12:40 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Thu, Feb 11, 2021 at 04:19:55PM -0800, Mike Kravetz wrote:
> > want_pmd_share() is currently just a check for CONFIG_ARCH_WANT_HUGE_PMD_SHARE.
> > How about leaving that mostly as is, and adding the new vma checks to
> > vma_shareable().  vma_shareable() would then be something like:
> >
> >       if (!(vma->vm_flags & VM_MAYSHARE))
> >               return false;
> > #ifdef CONFIG_USERFAULTFD
> >       if (uffd_disable_huge_pmd_share(vma)
> >               return false;
> > #endif
> > #ifdef /* XXX */
> >       /* add other checks for things like uffd wp and soft dirty here */
> > #endif /* XXX */
> >
> >       if (range_in_vma(vma, base, end)
> >               return true;
> >       return false;
> >
> > Of course, this would require we leave the call to vma_shareable() at the
> > beginning of huge_pmd_share.  It also means that we are always making a
> > function call into huge_pmd_share to determine if sharing is possible.
> > That is not any different than today.  If we do not want to make that extra
> > function call, then I would suggest putting all that code in want_pmd_share.
> > It just seems that all the vma checks for sharing should be in one place
> > if possible.
>
> I don't worry a lot on that since we've already got huge_pte_alloc() which
> takes care of huge pmd sharing case, so I don't expect e.g. even most hugetlb
> developers to use want_pmd_share() at all, because huge_pte_alloc() will be the
> one that frequently got called.
>
> But yeah we can definitely put the check logic into huge_pmd_share() too.
> Looking at above code it looks still worth a helper like want_pmd_share() or
> with some other name.  Then... instead of making this complicated, how about I
> mostly keep this patch but move want_pmd_share() call into huge_pmd_share()
> instead?
>
> Btw, Axel, it seems there will still be some respins on the pmd sharing
> patches.  Since it turns out it'll be shared by multiple tasks now, do you mind
> I pick those out and send them separately?  Then we can consolidate this part
> to move on with either the rest of the tasks we've got on hand.

Sounds good to me. :) Thanks Peter + Mike for working on this!

>
> Thanks,
>
> --
> Peter Xu
>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v5 04/10] hugetlb/userfaultfd: Unshare all pmds for hugetlbfs when register wp
  2021-02-12 18:11   ` Mike Kravetz
@ 2021-02-12 21:18     ` Peter Xu
  2021-02-12 21:34       ` Mike Kravetz
  0 siblings, 1 reply; 29+ messages in thread
From: Peter Xu @ 2021-02-12 21:18 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Axel Rasmussen, Alexander Viro, Alexey Dobriyan,
	Andrea Arcangeli, Andrew Morton, Anshuman Khandual,
	Catalin Marinas, Chinwen Chang, Huang Ying, Ingo Molnar,
	Jann Horn, Jerome Glisse, Lokesh Gidra, Matthew Wilcox (Oracle),
	Michael Ellerman, Michal Koutný,
	Michel Lespinasse, Mike Rapoport, Nicholas Piggin, Shaohua Li,
	Shawn Anastasio, Steven Rostedt, Steven Price, Vlastimil Babka,
	linux-kernel, linux-fsdevel, linux-mm, Adam Ruprecht,
	Cannon Matthews, Dr . David Alan Gilbert, David Rientjes,
	Mina Almasry, Oliver Upton

On Fri, Feb 12, 2021 at 10:11:39AM -0800, Mike Kravetz wrote:
> On 2/10/21 1:21 PM, Axel Rasmussen wrote:
> > From: Peter Xu <peterx@redhat.com>
> > 
> > Huge pmd sharing for hugetlbfs is racy with userfaultfd-wp because
> > userfaultfd-wp is always based on pgtable entries, so they cannot be shared.
> > 
> > Walk the hugetlb range and unshare all such mappings if there is, right before
> > UFFDIO_REGISTER will succeed and return to userspace.
> > 
> > This will pair with want_pmd_share() in hugetlb code so that huge pmd sharing
> > is completely disabled for userfaultfd-wp registered range.
> > 
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
> > ---
> >  fs/userfaultfd.c             | 48 ++++++++++++++++++++++++++++++++++++
> >  include/linux/mmu_notifier.h |  1 +
> >  2 files changed, 49 insertions(+)
> > 
> > diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> > index 0be8cdd4425a..1f4a34b1a1e7 100644
> > --- a/fs/userfaultfd.c
> > +++ b/fs/userfaultfd.c
> > @@ -15,6 +15,7 @@
> >  #include <linux/sched/signal.h>
> >  #include <linux/sched/mm.h>
> >  #include <linux/mm.h>
> > +#include <linux/mmu_notifier.h>
> >  #include <linux/poll.h>
> >  #include <linux/slab.h>
> >  #include <linux/seq_file.h>
> > @@ -1191,6 +1192,50 @@ static ssize_t userfaultfd_read(struct file *file, char __user *buf,
> >  	}
> >  }
> >  
> > +/*
> > + * This function will unconditionally remove all the shared pmd pgtable entries
> > + * within the specific vma for a hugetlbfs memory range.
> > + */
> > +static void hugetlb_unshare_all_pmds(struct vm_area_struct *vma)
> > +{
> > +#ifdef CONFIG_HUGETLB_PAGE
> > +	struct hstate *h = hstate_vma(vma);
> > +	unsigned long sz = huge_page_size(h);
> > +	struct mm_struct *mm = vma->vm_mm;
> > +	struct mmu_notifier_range range;
> > +	unsigned long address;
> > +	spinlock_t *ptl;
> > +	pte_t *ptep;
> > +
> > +	if (!(vma->vm_flags & VM_MAYSHARE))
> > +		return;
> > +
> > +	/*
> > +	 * No need to call adjust_range_if_pmd_sharing_possible(), because
> > +	 * we're going to operate on the whole vma
> > +	 */
> 
> This code will certainly work as intended.  However, I wonder if we should
> try to optimize and only flush and call huge_pmd_unshare for addresses where
> sharing is possible.  Consider this worst case example:
> 
> vm_start = 8G + 2M
> vm_end   = 11G - 2M
> The vma is 'almost' 3G in size, yet only the range 9G to 10G is possibly
> shared.  This routine will potentially call lock/unlock ptl and call
> huge_pmd_share for every huge page in the range.  Ideally, we should only
> make one call to huge_pmd_share with address 9G.  If the unshare is
> successful or not, we are done.  The subtle manipulation of &address in
> huge_pmd_unshare will result in only one call if the unshare is successful,
> but if unsuccessful we will unnecessarily call huge_pmd_unshare for each
> address in the range.
> 
> Maybe we start by rounding up vm_start by PUD_SIZE and rounding down
> vm_end by PUD_SIZE.  

I didn't think that lot since it's slow path, but yeah if that's faster and
without extra logic, then why not. :)

> 
> > +	mmu_notifier_range_init(&range, MMU_NOTIFY_HUGETLB_UNSHARE,
> > +				0, vma, mm, vma->vm_start, vma->vm_end);
> > +	mmu_notifier_invalidate_range_start(&range);
> > +	i_mmap_lock_write(vma->vm_file->f_mapping);
> > +	for (address = vma->vm_start; address < vma->vm_end; address += sz) {
> 
> Then, change the loop increment to PUD_SIZE.  And, also ignore the &address
> manipulation done by huge_pmd_unshare.

Will do!

> 
> > +		ptep = huge_pte_offset(mm, address, sz);
> > +		if (!ptep)
> > +			continue;
> > +		ptl = huge_pte_lock(h, mm, ptep);
> > +		huge_pmd_unshare(mm, vma, &address, ptep);
> > +		spin_unlock(ptl);
> > +	}
> > +	flush_hugetlb_tlb_range(vma, vma->vm_start, vma->vm_end);
> > +	i_mmap_unlock_write(vma->vm_file->f_mapping);
> > +	/*
> > +	 * No need to call mmu_notifier_invalidate_range(), see
> > +	 * Documentation/vm/mmu_notifier.rst.
> > +	 */
> > +	mmu_notifier_invalidate_range_end(&range);
> > +#endif
> > +}
> > +
> >  static void __wake_userfault(struct userfaultfd_ctx *ctx,
> >  			     struct userfaultfd_wake_range *range)
> >  {
> > @@ -1449,6 +1494,9 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
> >  		vma->vm_flags = new_flags;
> >  		vma->vm_userfaultfd_ctx.ctx = ctx;
> >  
> > +		if (is_vm_hugetlb_page(vma) && uffd_disable_huge_pmd_share(vma))
> > +			hugetlb_unshare_all_pmds(vma);
> > +
> >  	skip:
> >  		prev = vma;
> >  		start = vma->vm_end;
> > diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> > index b8200782dede..ff50c8528113 100644
> > --- a/include/linux/mmu_notifier.h
> > +++ b/include/linux/mmu_notifier.h
> > @@ -51,6 +51,7 @@ enum mmu_notifier_event {
> >  	MMU_NOTIFY_SOFT_DIRTY,
> >  	MMU_NOTIFY_RELEASE,
> >  	MMU_NOTIFY_MIGRATE,
> > +	MMU_NOTIFY_HUGETLB_UNSHARE,
> 
> I don't claim to know much about mmu notifiers.  Currently, we use other
> event notifiers such as MMU_NOTIFY_CLEAR.  I guess we do 'clear' page table
> entries if we unshare.  More than happy to have a MMU_NOTIFY_HUGETLB_UNSHARE
> event, but will consumers of the notifications know what this new event type
> means?  And, if we introduce this should we use this other places where
> huge_pmd_unshare is called?

Yeah AFAICT that is a new feature to mmu notifiers and it's not really used a
lot by consumers yet.  Hmm... is there really any consumer at all? I simply
grepped MMU_NOTIFY_UNMAP and see no hook took special care of that.  So it's
some extra information that the upper layer would like to deliever to the
notifiers, it's just not vastly used so far.

So far I didn't worry too much on that either.  MMU_NOTIFY_HUGETLB_UNSHARE is
introduced here simply because I tried to make it explicit, then it's easy to
be overwritten one day if we think huge pmd unshare is not worth a standalone
flag but reuse some other common one.  But I think at least I owe one
documentation of that new enum. :)

I'm not extremely willing to touch the rest callers of huge pmd unshare yet,
unless I've a solid reason.  E.g., one day maybe one mmu notifier hook would
start to monitor some events, then that's a better place imho to change them.
Otherwise any of my future change could be vague, imho.

For this patch - how about I simply go back to use MMU_NOTIFIER_CLEAR instead?

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v5 02/10] hugetlb/userfaultfd: Forbid huge pmd sharing when uffd enabled
  2021-02-12 20:47       ` Axel Rasmussen
@ 2021-02-12 21:27         ` Mike Kravetz
  0 siblings, 0 replies; 29+ messages in thread
From: Mike Kravetz @ 2021-02-12 21:27 UTC (permalink / raw)
  To: Axel Rasmussen, Peter Xu
  Cc: Alexander Viro, Alexey Dobriyan, Andrea Arcangeli, Andrew Morton,
	Anshuman Khandual, Catalin Marinas, Chinwen Chang, Huang Ying,
	Ingo Molnar, Jann Horn, Jerome Glisse, Lokesh Gidra,
	Matthew Wilcox (Oracle), Michael Ellerman, Michal Koutný,
	Michel Lespinasse, Mike Rapoport, Nicholas Piggin, Shaohua Li,
	Shawn Anastasio, Steven Rostedt, Steven Price, Vlastimil Babka,
	LKML, linux-fsdevel, Linux MM, Adam Ruprecht, Cannon Matthews,
	Dr . David Alan Gilbert, David Rientjes, Mina Almasry,
	Oliver Upton

On 2/12/21 12:47 PM, Axel Rasmussen wrote:
> On Fri, Feb 12, 2021 at 12:40 PM Peter Xu <peterx@redhat.com> wrote:
>>
>> On Thu, Feb 11, 2021 at 04:19:55PM -0800, Mike Kravetz wrote:
>>> want_pmd_share() is currently just a check for CONFIG_ARCH_WANT_HUGE_PMD_SHARE.
>>> How about leaving that mostly as is, and adding the new vma checks to
>>> vma_shareable().  vma_shareable() would then be something like:
>>>
>>>       if (!(vma->vm_flags & VM_MAYSHARE))
>>>               return false;
>>> #ifdef CONFIG_USERFAULTFD
>>>       if (uffd_disable_huge_pmd_share(vma)
>>>               return false;
>>> #endif
>>> #ifdef /* XXX */
>>>       /* add other checks for things like uffd wp and soft dirty here */
>>> #endif /* XXX */
>>>
>>>       if (range_in_vma(vma, base, end)
>>>               return true;
>>>       return false;
>>>
>>> Of course, this would require we leave the call to vma_shareable() at the
>>> beginning of huge_pmd_share.  It also means that we are always making a
>>> function call into huge_pmd_share to determine if sharing is possible.
>>> That is not any different than today.  If we do not want to make that extra
>>> function call, then I would suggest putting all that code in want_pmd_share.
>>> It just seems that all the vma checks for sharing should be in one place
>>> if possible.
>>
>> I don't worry a lot on that since we've already got huge_pte_alloc() which
>> takes care of huge pmd sharing case, so I don't expect e.g. even most hugetlb
>> developers to use want_pmd_share() at all, because huge_pte_alloc() will be the
>> one that frequently got called.
>>
>> But yeah we can definitely put the check logic into huge_pmd_share() too.
>> Looking at above code it looks still worth a helper like want_pmd_share() or
>> with some other name.  Then... instead of making this complicated, how about I
>> mostly keep this patch but move want_pmd_share() call into huge_pmd_share()
>> instead?

When looking at this again, all I was suggesting was a single routine to
check for the possibility of pmd sharing.  That is what the version of
want_pmd_share in this patch does.

I have some patches for future optimizations that only take i_mmap_rwsem
in the fault path if sharing is possible.  This is before huge_pte_alloc.
want_pmd_share as defined in this patch would work for that.

Sorry for the noise.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v5 04/10] hugetlb/userfaultfd: Unshare all pmds for hugetlbfs when register wp
  2021-02-12 21:18     ` Peter Xu
@ 2021-02-12 21:34       ` Mike Kravetz
  2021-02-12 22:14         ` Peter Xu
  0 siblings, 1 reply; 29+ messages in thread
From: Mike Kravetz @ 2021-02-12 21:34 UTC (permalink / raw)
  To: Peter Xu
  Cc: Axel Rasmussen, Alexander Viro, Alexey Dobriyan,
	Andrea Arcangeli, Andrew Morton, Anshuman Khandual,
	Catalin Marinas, Chinwen Chang, Huang Ying, Ingo Molnar,
	Jann Horn, Jerome Glisse, Lokesh Gidra, Matthew Wilcox (Oracle),
	Michael Ellerman, Michal Koutný,
	Michel Lespinasse, Mike Rapoport, Nicholas Piggin, Shaohua Li,
	Shawn Anastasio, Steven Rostedt, Steven Price, Vlastimil Babka,
	linux-kernel, linux-fsdevel, linux-mm, Adam Ruprecht,
	Cannon Matthews, Dr . David Alan Gilbert, David Rientjes,
	Mina Almasry, Oliver Upton

On 2/12/21 1:18 PM, Peter Xu wrote:
> On Fri, Feb 12, 2021 at 10:11:39AM -0800, Mike Kravetz wrote:
>> On 2/10/21 1:21 PM, Axel Rasmussen wrote:
>>> From: Peter Xu <peterx@redhat.com>
>>> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
>>> index b8200782dede..ff50c8528113 100644
>>> --- a/include/linux/mmu_notifier.h
>>> +++ b/include/linux/mmu_notifier.h
>>> @@ -51,6 +51,7 @@ enum mmu_notifier_event {
>>>  	MMU_NOTIFY_SOFT_DIRTY,
>>>  	MMU_NOTIFY_RELEASE,
>>>  	MMU_NOTIFY_MIGRATE,
>>> +	MMU_NOTIFY_HUGETLB_UNSHARE,
>>
>> I don't claim to know much about mmu notifiers.  Currently, we use other
>> event notifiers such as MMU_NOTIFY_CLEAR.  I guess we do 'clear' page table
>> entries if we unshare.  More than happy to have a MMU_NOTIFY_HUGETLB_UNSHARE
>> event, but will consumers of the notifications know what this new event type
>> means?  And, if we introduce this should we use this other places where
>> huge_pmd_unshare is called?
> 
> Yeah AFAICT that is a new feature to mmu notifiers and it's not really used a
> lot by consumers yet.  Hmm... is there really any consumer at all? I simply
> grepped MMU_NOTIFY_UNMAP and see no hook took special care of that.  So it's
> some extra information that the upper layer would like to deliever to the
> notifiers, it's just not vastly used so far.
> 
> So far I didn't worry too much on that either.  MMU_NOTIFY_HUGETLB_UNSHARE is
> introduced here simply because I tried to make it explicit, then it's easy to
> be overwritten one day if we think huge pmd unshare is not worth a standalone
> flag but reuse some other common one.  But I think at least I owe one
> documentation of that new enum. :)
> 
> I'm not extremely willing to touch the rest callers of huge pmd unshare yet,
> unless I've a solid reason.  E.g., one day maybe one mmu notifier hook would
> start to monitor some events, then that's a better place imho to change them.
> Otherwise any of my future change could be vague, imho.
> 
> For this patch - how about I simply go back to use MMU_NOTIFIER_CLEAR instead?

I'm good with the new MMU_NOTIFY_HUGETLB_UNSHARE and agree with your reasoning
for adding it.  I really did not know enough about usage which caused me to
question.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v5 04/10] hugetlb/userfaultfd: Unshare all pmds for hugetlbfs when register wp
  2021-02-12 21:34       ` Mike Kravetz
@ 2021-02-12 22:14         ` Peter Xu
  0 siblings, 0 replies; 29+ messages in thread
From: Peter Xu @ 2021-02-12 22:14 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Axel Rasmussen, Alexander Viro, Alexey Dobriyan,
	Andrea Arcangeli, Andrew Morton, Anshuman Khandual,
	Catalin Marinas, Chinwen Chang, Huang Ying, Ingo Molnar,
	Jann Horn, Jerome Glisse, Lokesh Gidra, Matthew Wilcox (Oracle),
	Michael Ellerman, Michal Koutný,
	Michel Lespinasse, Mike Rapoport, Nicholas Piggin, Shaohua Li,
	Shawn Anastasio, Steven Rostedt, Steven Price, Vlastimil Babka,
	linux-kernel, linux-fsdevel, linux-mm, Adam Ruprecht,
	Cannon Matthews, Dr . David Alan Gilbert, David Rientjes,
	Mina Almasry, Oliver Upton

On Fri, Feb 12, 2021 at 01:34:03PM -0800, Mike Kravetz wrote:
> I'm good with the new MMU_NOTIFY_HUGETLB_UNSHARE and agree with your reasoning
> for adding it.  I really did not know enough about usage which caused me to
> question.

It actually is a good question..  Because after a 2nd thought I cannot think of
any mmu notifier usage to explicitly listen to huge pmd unshare event - it
could be just a too internal impl detail of hugetlbfs.  I think any new flag
should justify itself when introduced.  Before I could justify it, I think
MMU_NOTIFIER_CLEAR is indeed more suitable.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v5 05/10] userfaultfd: add minor fault registration mode
  2021-02-11 19:28   ` Axel Rasmussen
  2021-02-11 20:58     ` Axel Rasmussen
@ 2021-02-12 22:21     ` Matthew Wilcox
  2021-02-12 22:44       ` Peter Xu
  1 sibling, 1 reply; 29+ messages in thread
From: Matthew Wilcox @ 2021-02-12 22:21 UTC (permalink / raw)
  To: Axel Rasmussen
  Cc: Alexander Viro, Alexey Dobriyan, Andrea Arcangeli, Andrew Morton,
	Anshuman Khandual, Catalin Marinas, Chinwen Chang, Huang Ying,
	Ingo Molnar, Jann Horn, Jerome Glisse, Lokesh Gidra,
	Michael Ellerman, Michal Koutný,
	Michel Lespinasse, Mike Kravetz, Mike Rapoport, Nicholas Piggin,
	Peter Xu, Shaohua Li, Shawn Anastasio, Steven Rostedt,
	Steven Price, Vlastimil Babka, LKML, linux-fsdevel, Linux MM,
	Adam Ruprecht, Cannon Matthews, Dr . David Alan Gilbert,
	David Rientjes, Mina Almasry, Oliver Upton

On Thu, Feb 11, 2021 at 11:28:09AM -0800, Axel Rasmussen wrote:
> Ah, I had added this just after VM_UFFD_WP, without noticing that this
> would be sharing a bit with VM_LOCKED. That seems like not such a
> great idea.
> 
> I don't see another unused bit, and I don't see some other obvious
> candidate to share with. So, the solution that comes to mind is

it'd be even better if you didn't use the last unused bit for UFFD_WP.
not sure how feasible that is, but you can see we're really short on
bits here.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v5 05/10] userfaultfd: add minor fault registration mode
  2021-02-12 22:21     ` Matthew Wilcox
@ 2021-02-12 22:44       ` Peter Xu
  2021-02-12 22:51         ` Axel Rasmussen
  0 siblings, 1 reply; 29+ messages in thread
From: Peter Xu @ 2021-02-12 22:44 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Axel Rasmussen, Alexander Viro, Alexey Dobriyan,
	Andrea Arcangeli, Andrew Morton, Anshuman Khandual,
	Catalin Marinas, Chinwen Chang, Huang Ying, Ingo Molnar,
	Jann Horn, Jerome Glisse, Lokesh Gidra, Michael Ellerman,
	Michal Koutný,
	Michel Lespinasse, Mike Kravetz, Mike Rapoport, Nicholas Piggin,
	Shaohua Li, Shawn Anastasio, Steven Rostedt, Steven Price,
	Vlastimil Babka, LKML, linux-fsdevel, Linux MM, Adam Ruprecht,
	Cannon Matthews, Dr . David Alan Gilbert, David Rientjes,
	Mina Almasry, Oliver Upton

On Fri, Feb 12, 2021 at 10:21:45PM +0000, Matthew Wilcox wrote:
> On Thu, Feb 11, 2021 at 11:28:09AM -0800, Axel Rasmussen wrote:
> > Ah, I had added this just after VM_UFFD_WP, without noticing that this
> > would be sharing a bit with VM_LOCKED. That seems like not such a
> > great idea.
> > 
> > I don't see another unused bit, and I don't see some other obvious
> > candidate to share with. So, the solution that comes to mind is
> 
> it'd be even better if you didn't use the last unused bit for UFFD_WP.
> not sure how feasible that is, but you can see we're really short on
> bits here.

UFFD_WP is used now for anonymouse already.. And the support for hugetlbfs and
shmem is in rfc stage on the list.

Is it possible to use CONFIG_ARCH_USES_HIGH_VMA_FLAGS here?  So far uffd-wp is
only working for 64 bit x86 too due to enlarged pte space.  Maybe we can also
let minor mode to only support 64 bit hosts.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v5 05/10] userfaultfd: add minor fault registration mode
  2021-02-12 22:44       ` Peter Xu
@ 2021-02-12 22:51         ` Axel Rasmussen
  2021-02-12 23:01           ` Peter Xu
  0 siblings, 1 reply; 29+ messages in thread
From: Axel Rasmussen @ 2021-02-12 22:51 UTC (permalink / raw)
  To: Peter Xu
  Cc: Matthew Wilcox, Alexander Viro, Alexey Dobriyan,
	Andrea Arcangeli, Andrew Morton, Anshuman Khandual,
	Catalin Marinas, Chinwen Chang, Huang Ying, Ingo Molnar,
	Jann Horn, Jerome Glisse, Lokesh Gidra, Michael Ellerman,
	Michal Koutný,
	Michel Lespinasse, Mike Kravetz, Mike Rapoport, Nicholas Piggin,
	Shaohua Li, Shawn Anastasio, Steven Rostedt, Steven Price,
	Vlastimil Babka, LKML, linux-fsdevel, Linux MM, Adam Ruprecht,
	Cannon Matthews, Dr . David Alan Gilbert, David Rientjes,
	Mina Almasry, Oliver Upton

On Fri, Feb 12, 2021 at 2:44 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Fri, Feb 12, 2021 at 10:21:45PM +0000, Matthew Wilcox wrote:
> > On Thu, Feb 11, 2021 at 11:28:09AM -0800, Axel Rasmussen wrote:
> > > Ah, I had added this just after VM_UFFD_WP, without noticing that this
> > > would be sharing a bit with VM_LOCKED. That seems like not such a
> > > great idea.
> > >
> > > I don't see another unused bit, and I don't see some other obvious
> > > candidate to share with. So, the solution that comes to mind is
> >
> > it'd be even better if you didn't use the last unused bit for UFFD_WP.
> > not sure how feasible that is, but you can see we're really short on
> > bits here.
>
> UFFD_WP is used now for anonymouse already.. And the support for hugetlbfs and
> shmem is in rfc stage on the list.
>
> Is it possible to use CONFIG_ARCH_USES_HIGH_VMA_FLAGS here?  So far uffd-wp is
> only working for 64 bit x86 too due to enlarged pte space.  Maybe we can also
> let minor mode to only support 64 bit hosts.

At least for my / Google's purposes, I don't care about 32-bit support
for this feature. I do care about both x86_64 and arm64, though. So
it's a possibility.

Alternatively, the "it's an API feature not a registration mode"
approach I sent in my v6 also works for me, although it has some
drawbacks.

Another option is, would it be terrible to add an extra u16 or u32 for
UFFD flags to vm_area_struct (say within vm_userfaultfd_ctx)?
Historically we've already added a pointer, so maybe an extra say 16
bits isn't so bad? This would avoid using *any* VM_* flags for UFFD,
even VM_UFFD_MISSING could be in this new flag field.

>
> Thanks,
>
> --
> Peter Xu
>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v5 05/10] userfaultfd: add minor fault registration mode
  2021-02-12 22:51         ` Axel Rasmussen
@ 2021-02-12 23:01           ` Peter Xu
  0 siblings, 0 replies; 29+ messages in thread
From: Peter Xu @ 2021-02-12 23:01 UTC (permalink / raw)
  To: Axel Rasmussen
  Cc: Matthew Wilcox, Alexander Viro, Alexey Dobriyan,
	Andrea Arcangeli, Andrew Morton, Anshuman Khandual,
	Catalin Marinas, Chinwen Chang, Huang Ying, Ingo Molnar,
	Jann Horn, Jerome Glisse, Lokesh Gidra, Michael Ellerman,
	Michal Koutný,
	Michel Lespinasse, Mike Kravetz, Mike Rapoport, Nicholas Piggin,
	Shaohua Li, Shawn Anastasio, Steven Rostedt, Steven Price,
	Vlastimil Babka, LKML, linux-fsdevel, Linux MM, Adam Ruprecht,
	Cannon Matthews, Dr . David Alan Gilbert, David Rientjes,
	Mina Almasry, Oliver Upton

On Fri, Feb 12, 2021 at 02:51:17PM -0800, Axel Rasmussen wrote:
> On Fri, Feb 12, 2021 at 2:44 PM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Fri, Feb 12, 2021 at 10:21:45PM +0000, Matthew Wilcox wrote:
> > > On Thu, Feb 11, 2021 at 11:28:09AM -0800, Axel Rasmussen wrote:
> > > > Ah, I had added this just after VM_UFFD_WP, without noticing that this
> > > > would be sharing a bit with VM_LOCKED. That seems like not such a
> > > > great idea.
> > > >
> > > > I don't see another unused bit, and I don't see some other obvious
> > > > candidate to share with. So, the solution that comes to mind is
> > >
> > > it'd be even better if you didn't use the last unused bit for UFFD_WP.
> > > not sure how feasible that is, but you can see we're really short on
> > > bits here.
> >
> > UFFD_WP is used now for anonymouse already.. And the support for hugetlbfs and
> > shmem is in rfc stage on the list.
> >
> > Is it possible to use CONFIG_ARCH_USES_HIGH_VMA_FLAGS here?  So far uffd-wp is
> > only working for 64 bit x86 too due to enlarged pte space.  Maybe we can also
> > let minor mode to only support 64 bit hosts.
> 
> At least for my / Google's purposes, I don't care about 32-bit support
> for this feature. I do care about both x86_64 and arm64, though. So
> it's a possibility.
> 
> Alternatively, the "it's an API feature not a registration mode"
> approach I sent in my v6 also works for me, although it has some
> drawbacks.

Per-vma has finer granularity and logically more flexible.  If it's low hanging
fruit, let's think about it more before giving up so quickly.

Sorry I commented late for this - I got diverged a bit in the past days.  While
you worked on it so fast (which in many cases still a good thing :).

> 
> Another option is, would it be terrible to add an extra u16 or u32 for
> UFFD flags to vm_area_struct (say within vm_userfaultfd_ctx)?
> Historically we've already added a pointer, so maybe an extra say 16
> bits isn't so bad? This would avoid using *any* VM_* flags for UFFD,
> even VM_UFFD_MISSING could be in this new flag field.

For 64bit hosts there're still places for vm_flags.  It's just 32bit, while
there's option to make it 64bit-only.  Even if we'd add a new field, those bits
were still unused on 64bit hosts.  IMHO we should try to use them before adding
new field which will actually impact all hosts.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2021-02-12 23:03 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-02-10 21:21 [PATCH v5 00/10] userfaultfd: add minor fault handling Axel Rasmussen
2021-02-10 21:21 ` [PATCH v5 01/10] hugetlb: Pass vma into huge_pte_alloc() and huge_pmd_share() Axel Rasmussen
2021-02-11 23:59   ` Mike Kravetz
2021-02-10 21:21 ` [PATCH v5 02/10] hugetlb/userfaultfd: Forbid huge pmd sharing when uffd enabled Axel Rasmussen
2021-02-12  0:19   ` Mike Kravetz
2021-02-12 20:40     ` Peter Xu
2021-02-12 20:47       ` Axel Rasmussen
2021-02-12 21:27         ` Mike Kravetz
2021-02-10 21:21 ` [PATCH v5 03/10] mm/hugetlb: Move flush_hugetlb_tlb_range() into hugetlb.h Axel Rasmussen
2021-02-10 21:21 ` [PATCH v5 04/10] hugetlb/userfaultfd: Unshare all pmds for hugetlbfs when register wp Axel Rasmussen
2021-02-12 18:11   ` Mike Kravetz
2021-02-12 21:18     ` Peter Xu
2021-02-12 21:34       ` Mike Kravetz
2021-02-12 22:14         ` Peter Xu
2021-02-10 21:21 ` [PATCH v5 05/10] userfaultfd: add minor fault registration mode Axel Rasmussen
2021-02-11 19:28   ` Axel Rasmussen
2021-02-11 20:58     ` Axel Rasmussen
2021-02-12 22:21     ` Matthew Wilcox
2021-02-12 22:44       ` Peter Xu
2021-02-12 22:51         ` Axel Rasmussen
2021-02-12 23:01           ` Peter Xu
2021-02-12 19:17   ` Mike Kravetz
2021-02-12 19:33     ` Axel Rasmussen
2021-02-10 21:21 ` [PATCH v5 06/10] userfaultfd: disable huge PMD sharing for MINOR registered VMAs Axel Rasmussen
2021-02-10 21:21 ` [PATCH v5 07/10] userfaultfd: hugetlbfs: only compile UFFD helpers if config enabled Axel Rasmussen
2021-02-10 21:21 ` [PATCH v5 08/10] userfaultfd: add UFFDIO_CONTINUE ioctl Axel Rasmussen
2021-02-10 22:21   ` Axel Rasmussen
2021-02-10 21:21 ` [PATCH v5 09/10] userfaultfd: update documentation to describe minor fault handling Axel Rasmussen
2021-02-10 21:22 ` [PATCH v5 10/10] userfaultfd/selftests: add test exercising " Axel Rasmussen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).