All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/9] userfaultfd: add minor fault handling
@ 2021-01-15 19:04 ` Axel Rasmussen
  0 siblings, 0 replies; 35+ messages in thread
From: Axel Rasmussen @ 2021-01-15 19:04 UTC (permalink / raw)
  To: Alexander Viro, Alexey Dobriyan, Andrea Arcangeli, Andrew Morton,
	Anshuman Khandual, Catalin Marinas, Chinwen Chang, Huang Ying,
	Ingo Molnar, Jann Horn, Jerome Glisse, Lokesh Gidra,
	Matthew Wilcox (Oracle), Michael Ellerman, Michal Koutný,
	Michel Lespinasse, Mike Kravetz, Mike Rapoport, Nicholas Piggin,
	Peter Xu, Shaohua Li, Shawn Anastasio, Steven Rostedt,
	Steven Price, Vlastimil Babka
  Cc: linux-kernel, linux-fsdevel, linux-mm, Adam Ruprecht,
	Axel Rasmussen, Cannon Matthews, Dr . David Alan Gilbert,
	David Rientjes, Oliver Upton

Changelog
=========

RFC->v1:
- Rebased onto Peter Xu's patches for disabling huge PMD sharing for certain
  userfaultfd-registered areas.
- Added commits which update documentation, and add a self test which exercises
  the new feature.
- Fixed reporting CONTINUE as a supported ioctl even for non-MINOR ranges.

Overview
========

This series adds a new userfaultfd registration mode,
UFFDIO_REGISTER_MODE_MINOR. This allows userspace to intercept "minor" faults.
By "minor" fault, I mean the following situation:

Let there exist two mappings (i.e., VMAs) to the same page(s) (shared memory).
One of the mappings is registered with userfaultfd (in minor mode), and the
other is not. Via the non-UFFD mapping, the underlying pages have already been
allocated & filled with some contents. The UFFD mapping has not yet been
faulted in; when it is touched for the first time, this results in what I'm
calling a "minor" fault. As a concrete example, when working with hugetlbfs, we
have huge_pte_none(), but find_lock_page() finds an existing page.

We also add a new ioctl to resolve such faults: UFFDIO_CONTINUE. The idea is,
userspace resolves the fault by either a) doing nothing if the contents are
already correct, or b) updating the underlying contents using the second,
non-UFFD mapping (via memcpy/memset or similar, or something fancier like RDMA,
or etc...). In either case, userspace issues UFFDIO_CONTINUE to tell the kernel
"I have ensured the page contents are correct, carry on setting up the mapping".

Use Case
========

Consider the use case of VM live migration (e.g. under QEMU/KVM):

1. While a VM is still running, we copy the contents of its memory to a
   target machine. The pages are populated on the target by writing to the
   non-UFFD mapping, using the setup described above. The VM is still running
   (and therefore its memory is likely changing), so this may be repeated
   several times, until we decide the target is "up to date enough".

2. We pause the VM on the source, and start executing on the target machine.
   During this gap, the VM's user(s) will *see* a pause, so it is desirable to
   minimize this window.

3. Between the last time any page was copied from the source to the target, and
   when the VM was paused, the contents of that page may have changed - and
   therefore the copy we have on the target machine is out of date. Although we
   can keep track of which pages are out of date, for VMs with large amounts of
   memory, it is "slow" to transfer this information to the target machine. We
   want to resume execution before such a transfer would complete.

4. So, the guest begins executing on the target machine. The first time it
   touches its memory (via the UFFD-registered mapping), userspace wants to
   intercept this fault. Userspace checks whether or not the page is up to date,
   and if not, copies the updated page from the source machine, via the non-UFFD
   mapping. Finally, whether a copy was performed or not, userspace issues a
   UFFDIO_CONTINUE ioctl to tell the kernel "I have ensured the page contents
   are correct, carry on setting up the mapping".

We don't have to do all of the final updates on-demand. The userfaultfd manager
can, in the background, also copy over updated pages once it receives the map of
which pages are up-to-date or not.

Interaction with Existing APIs
==============================

Because it's possible to combine registration modes (e.g. a single VMA can be
userfaultfd-registered MINOR | MISSING), and because it's up to userspace how to
resolve faults once they are received, I spent some time thinking through how
the existing API interacts with the new feature.

UFFDIO_CONTINUE cannot be used to resolve non-minor faults, as it does not
allocate a new page. If UFFDIO_CONTINUE is used on a non-minor fault:

- For non-shared memory or shmem, -EINVAL is returned.
- For hugetlb, -EFAULT is returned.

UFFDIO_COPY and UFFDIO_ZEROPAGE cannot be used to resolve minor faults. Without
modifications, the existing codepath assumes a new page needs to be allocated.
This is okay, since userspace must have a second non-UFFD-registered mapping
anyway, thus there isn't much reason to want to use these in any case (just
memcpy or memset or similar).

- If UFFDIO_COPY is used on a minor fault, -EEXIST is returned.
- If UFFDIO_ZEROPAGE is used on a minor fault, -EEXIST is returned (or -EINVAL
  in the case of hugetlb, as UFFDIO_ZEROPAGE is unsupported in any case).
- UFFDIO_WRITEPROTECT simply doesn't work with shared memory, and returns
  -ENOENT in that case (regardless of the kind of fault).

Dependencies
============

I've included 4 commits from Peter Xu's larger series
(https://lore.kernel.org/patchwork/cover/1366017/) in this series. My changes
depend on his work, to disable huge PMD sharing for MINOR registered userfaultfd
areas. I included the 4 commits directly because a) it lets this series just be
applied and work as-is, and b) they are fairly standalone, and could potentially
be merged even without the rest of the larger series Peter submitted. Thanks
Peter!

Also, although it doesn't affect minor fault handling, I did notice that the
userfaultfd self test sometimes experienced memory corruption
(https://lore.kernel.org/patchwork/cover/1356755/). For anyone testing this
series, it may be useful to apply that series first to fix the selftest
flakiness. That series doesn't have to be merged into mainline / maintaner
branches before mine, though.

Future Work
===========

Currently the patchset only supports hugetlbfs. There is no reason it can't work
with shmem, but I expect hugetlbfs to be much more commonly used since we're
talking about backing guest memory for VMs. I plan to implement shmem support in
a follow-up patch series.

Axel Rasmussen (5):
  userfaultfd: add minor fault registration mode
  userfaultfd: disable huge PMD sharing for MINOR registered VMAs
  userfaultfd: add UFFDIO_CONTINUE ioctl
  userfaultfd: update documentation to describe minor fault handling
  userfaultfd/selftests: add test exercising minor fault handling

Peter Xu (4):
  hugetlb: Pass vma into huge_pte_alloc()
  hugetlb/userfaultfd: Forbid huge pmd sharing when uffd enabled
  mm/hugetlb: Move flush_hugetlb_tlb_range() into hugetlb.h
  hugetlb/userfaultfd: Unshare all pmds for hugetlbfs when register wp

 Documentation/admin-guide/mm/userfaultfd.rst | 105 ++++++----
 arch/arm64/mm/hugetlbpage.c                  |   5 +-
 arch/ia64/mm/hugetlbpage.c                   |   3 +-
 arch/mips/mm/hugetlbpage.c                   |   4 +-
 arch/parisc/mm/hugetlbpage.c                 |   2 +-
 arch/powerpc/mm/hugetlbpage.c                |   3 +-
 arch/s390/mm/hugetlbpage.c                   |   2 +-
 arch/sh/mm/hugetlbpage.c                     |   2 +-
 arch/sparc/mm/hugetlbpage.c                  |   2 +-
 fs/proc/task_mmu.c                           |   1 +
 fs/userfaultfd.c                             | 190 ++++++++++++++++---
 include/linux/hugetlb.h                      |  22 ++-
 include/linux/mm.h                           |   1 +
 include/linux/mmu_notifier.h                 |   1 +
 include/linux/userfaultfd_k.h                |  29 ++-
 include/trace/events/mmflags.h               |   1 +
 include/uapi/linux/userfaultfd.h             |  36 +++-
 mm/hugetlb.c                                 |  61 ++++--
 mm/userfaultfd.c                             |  88 ++++++---
 tools/testing/selftests/vm/userfaultfd.c     | 147 +++++++++++++-
 20 files changed, 570 insertions(+), 135 deletions(-)

--
2.30.0.284.gd98b1dd5eaa7-goog


^ permalink raw reply	[flat|nested] 35+ messages in thread
* Re: [PATCH 7/9] userfaultfd: add UFFDIO_CONTINUE ioctl
@ 2021-01-16  1:21 kernel test robot
  0 siblings, 0 replies; 35+ messages in thread
From: kernel test robot @ 2021-01-16  1:21 UTC (permalink / raw)
  To: kbuild

[-- Attachment #1: Type: text/plain, Size: 6770 bytes --]

CC: kbuild-all(a)lists.01.org
In-Reply-To: <20210115190451.3135416-8-axelrasmussen@google.com>
References: <20210115190451.3135416-8-axelrasmussen@google.com>
TO: Axel Rasmussen <axelrasmussen@google.com>
TO: Alexander Viro <viro@zeniv.linux.org.uk>
TO: Alexey Dobriyan <adobriyan@gmail.com>
TO: Andrea Arcangeli <aarcange@redhat.com>
TO: Andrew Morton <akpm@linux-foundation.org>
CC: Linux Memory Management List <linux-mm@kvack.org>
TO: Anshuman Khandual <anshuman.khandual@arm.com>
TO: Catalin Marinas <catalin.marinas@arm.com>
TO: Chinwen Chang <chinwen.chang@mediatek.com>
TO: Huang Ying <ying.huang@intel.com>
TO: Ingo Molnar <mingo@redhat.com>

Hi Axel,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on arm64/for-next/core]
[also build test WARNING on powerpc/next s390/features tip/perf/core linus/master v5.11-rc3 next-20210115]
[cannot apply to hp-parisc/for-next hnaz-linux-mm/master ia64/next sparc-next/master sparc/master]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Axel-Rasmussen/userfaultfd-add-minor-fault-handling/20210116-030900
base:   https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git for-next/core
:::::: branch date: 6 hours ago
:::::: commit date: 6 hours ago
config: parisc-randconfig-m031-20210115 (attached as .config)
compiler: hppa-linux-gcc (GCC) 9.3.0

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>

smatch warnings:
mm/userfaultfd.c:488 mfill_atomic_pte() error: uninitialized symbol 'err'.

vim +/err +488 mm/userfaultfd.c

60d4d2d2b40e44cd Mike Kravetz     2017-02-22  431  
3217d3c79b5d7aab Mike Rapoport    2017-09-06  432  static __always_inline ssize_t mfill_atomic_pte(struct mm_struct *dst_mm,
3217d3c79b5d7aab Mike Rapoport    2017-09-06  433  						pmd_t *dst_pmd,
3217d3c79b5d7aab Mike Rapoport    2017-09-06  434  						struct vm_area_struct *dst_vma,
3217d3c79b5d7aab Mike Rapoport    2017-09-06  435  						unsigned long dst_addr,
3217d3c79b5d7aab Mike Rapoport    2017-09-06  436  						unsigned long src_addr,
3217d3c79b5d7aab Mike Rapoport    2017-09-06  437  						struct page **page,
b521250c2fe8f9f0 Axel Rasmussen   2021-01-15  438  						enum mcopy_atomic_mode mode,
72981e0e7b609c74 Andrea Arcangeli 2020-04-06  439  						bool wp_copy)
3217d3c79b5d7aab Mike Rapoport    2017-09-06  440  {
3217d3c79b5d7aab Mike Rapoport    2017-09-06  441  	ssize_t err;
3217d3c79b5d7aab Mike Rapoport    2017-09-06  442  
5b51072e97d58718 Andrea Arcangeli 2018-11-30  443  	/*
5b51072e97d58718 Andrea Arcangeli 2018-11-30  444  	 * The normal page fault path for a shmem will invoke the
5b51072e97d58718 Andrea Arcangeli 2018-11-30  445  	 * fault, fill the hole in the file and COW it right away. The
5b51072e97d58718 Andrea Arcangeli 2018-11-30  446  	 * result generates plain anonymous memory. So when we are
5b51072e97d58718 Andrea Arcangeli 2018-11-30  447  	 * asked to fill an hole in a MAP_PRIVATE shmem mapping, we'll
5b51072e97d58718 Andrea Arcangeli 2018-11-30  448  	 * generate anonymous memory directly without actually filling
5b51072e97d58718 Andrea Arcangeli 2018-11-30  449  	 * the hole. For the MAP_PRIVATE case the robustness check
5b51072e97d58718 Andrea Arcangeli 2018-11-30  450  	 * only happens in the pagetable (to verify it's still none)
5b51072e97d58718 Andrea Arcangeli 2018-11-30  451  	 * and not in the radix tree.
5b51072e97d58718 Andrea Arcangeli 2018-11-30  452  	 */
5b51072e97d58718 Andrea Arcangeli 2018-11-30  453  	if (!(dst_vma->vm_flags & VM_SHARED)) {
b521250c2fe8f9f0 Axel Rasmussen   2021-01-15  454  		switch (mode) {
b521250c2fe8f9f0 Axel Rasmussen   2021-01-15  455  		case MCOPY_ATOMIC_NORMAL:
3217d3c79b5d7aab Mike Rapoport    2017-09-06  456  			err = mcopy_atomic_pte(dst_mm, dst_pmd, dst_vma,
72981e0e7b609c74 Andrea Arcangeli 2020-04-06  457  					       dst_addr, src_addr, page,
72981e0e7b609c74 Andrea Arcangeli 2020-04-06  458  					       wp_copy);
b521250c2fe8f9f0 Axel Rasmussen   2021-01-15  459  			break;
b521250c2fe8f9f0 Axel Rasmussen   2021-01-15  460  		case MCOPY_ATOMIC_ZEROPAGE:
3217d3c79b5d7aab Mike Rapoport    2017-09-06  461  			err = mfill_zeropage_pte(dst_mm, dst_pmd,
3217d3c79b5d7aab Mike Rapoport    2017-09-06  462  						 dst_vma, dst_addr);
b521250c2fe8f9f0 Axel Rasmussen   2021-01-15  463  			break;
b521250c2fe8f9f0 Axel Rasmussen   2021-01-15  464  		/* It only makes sense to CONTINUE for shared memory. */
b521250c2fe8f9f0 Axel Rasmussen   2021-01-15  465  		case MCOPY_ATOMIC_CONTINUE:
b521250c2fe8f9f0 Axel Rasmussen   2021-01-15  466  			err = -EINVAL;
b521250c2fe8f9f0 Axel Rasmussen   2021-01-15  467  			break;
b521250c2fe8f9f0 Axel Rasmussen   2021-01-15  468  		}
3217d3c79b5d7aab Mike Rapoport    2017-09-06  469  	} else {
72981e0e7b609c74 Andrea Arcangeli 2020-04-06  470  		VM_WARN_ON_ONCE(wp_copy);
b521250c2fe8f9f0 Axel Rasmussen   2021-01-15  471  		switch (mode) {
b521250c2fe8f9f0 Axel Rasmussen   2021-01-15  472  		case MCOPY_ATOMIC_NORMAL:
3217d3c79b5d7aab Mike Rapoport    2017-09-06  473  			err = shmem_mcopy_atomic_pte(dst_mm, dst_pmd,
3217d3c79b5d7aab Mike Rapoport    2017-09-06  474  						     dst_vma, dst_addr,
3217d3c79b5d7aab Mike Rapoport    2017-09-06  475  						     src_addr, page);
b521250c2fe8f9f0 Axel Rasmussen   2021-01-15  476  			break;
b521250c2fe8f9f0 Axel Rasmussen   2021-01-15  477  		case MCOPY_ATOMIC_ZEROPAGE:
8fb44e5403ca86e3 Mike Rapoport    2017-09-06  478  			err = shmem_mfill_zeropage_pte(dst_mm, dst_pmd,
8fb44e5403ca86e3 Mike Rapoport    2017-09-06  479  						       dst_vma, dst_addr);
b521250c2fe8f9f0 Axel Rasmussen   2021-01-15  480  			break;
b521250c2fe8f9f0 Axel Rasmussen   2021-01-15  481  		case MCOPY_ATOMIC_CONTINUE:
b521250c2fe8f9f0 Axel Rasmussen   2021-01-15  482  			/* FIXME: Add minor fault interception for shmem. */
b521250c2fe8f9f0 Axel Rasmussen   2021-01-15  483  			err = -EINVAL;
b521250c2fe8f9f0 Axel Rasmussen   2021-01-15  484  			break;
b521250c2fe8f9f0 Axel Rasmussen   2021-01-15  485  		}
3217d3c79b5d7aab Mike Rapoport    2017-09-06  486  	}
3217d3c79b5d7aab Mike Rapoport    2017-09-06  487  
3217d3c79b5d7aab Mike Rapoport    2017-09-06 @488  	return err;
3217d3c79b5d7aab Mike Rapoport    2017-09-06  489  }
3217d3c79b5d7aab Mike Rapoport    2017-09-06  490  

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org

[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 21286 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2021-01-21 23:48 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-15 19:04 [PATCH 0/9] userfaultfd: add minor fault handling Axel Rasmussen
2021-01-15 19:04 ` Axel Rasmussen
2021-01-15 19:04 ` [PATCH 1/9] hugetlb: Pass vma into huge_pte_alloc() Axel Rasmussen
2021-01-15 19:04   ` Axel Rasmussen
2021-01-16  0:42   ` kernel test robot
2021-01-16  0:42     ` kernel test robot
2021-01-15 19:04 ` [PATCH 2/9] hugetlb/userfaultfd: Forbid huge pmd sharing when uffd enabled Axel Rasmussen
2021-01-15 19:04   ` Axel Rasmussen
2021-01-15 23:36   ` kernel test robot
2021-01-15 23:36     ` kernel test robot
2021-01-21 18:52   ` Peter Xu
2021-01-15 19:04 ` [PATCH 3/9] mm/hugetlb: Move flush_hugetlb_tlb_range() into hugetlb.h Axel Rasmussen
2021-01-15 19:04   ` Axel Rasmussen
2021-01-15 19:04 ` [PATCH 4/9] hugetlb/userfaultfd: Unshare all pmds for hugetlbfs when register wp Axel Rasmussen
2021-01-15 19:04   ` Axel Rasmussen
2021-01-15 22:37   ` kernel test robot
2021-01-15 22:37     ` kernel test robot
2021-01-15 19:04 ` [PATCH 5/9] userfaultfd: add minor fault registration mode Axel Rasmussen
2021-01-15 19:04   ` Axel Rasmussen
2021-01-21 18:49   ` Peter Xu
2021-01-15 19:04 ` [PATCH 6/9] userfaultfd: disable huge PMD sharing for MINOR registered VMAs Axel Rasmussen
2021-01-15 19:04   ` Axel Rasmussen
2021-01-21 18:59   ` Peter Xu
2021-01-15 19:04 ` [PATCH 7/9] userfaultfd: add UFFDIO_CONTINUE ioctl Axel Rasmussen
2021-01-15 19:04   ` Axel Rasmussen
2021-01-21 22:46   ` Peter Xu
2021-01-21 23:46     ` Axel Rasmussen
2021-01-15 19:04 ` [PATCH 8/9] userfaultfd: update documentation to describe minor fault handling Axel Rasmussen
2021-01-15 19:04   ` Axel Rasmussen
2021-01-15 19:04 ` [PATCH 9/9] userfaultfd/selftests: add test exercising " Axel Rasmussen
2021-01-15 19:04   ` Axel Rasmussen
2021-01-21 19:12 ` [PATCH 0/9] userfaultfd: add " Peter Xu
2021-01-21 22:13   ` Axel Rasmussen
2021-01-21 22:37     ` Peter Xu
2021-01-16  1:21 [PATCH 7/9] userfaultfd: add UFFDIO_CONTINUE ioctl kernel test robot

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.