[PATCH 00/21] RFC: userfaultfd v3

* [PATCH 00/21] RFC: userfaultfd v3
@ 2015-03-05 17:17 Andrea Arcangeli
  2015-03-05 17:17 ` [PATCH 01/21] userfaultfd: waitqueue: add nr wake parameter to __wake_up_locked_key Andrea Arcangeli
                   ` (21 more replies)
  0 siblings, 22 replies; 32+ messages in thread
From: Andrea Arcangeli @ 2015-03-05 17:17 UTC (permalink / raw)
  To: qemu-devel, kvm, linux-kernel, linux-mm, linux-api, Android Kernel Team
  Cc: Kirill A. Shutemov, Pavel Emelyanov, Sanidhya Kashyap,
	zhang.zhanghailiang, Linus Torvalds, Andres Lagar-Cavilla,
	Dave Hansen, Paolo Bonzini, Rik van Riel, Mel Gorman,
	Andy Lutomirski, Andrew Morton, Sasha Levin, Hugh Dickins,
	Peter Feiner, Dr. David Alan Gilbert, Christopher Covington,
	Johannes Weiner, Robert Love, Dmitry Adamushko, Neil Brown,
	Mike Hommey, Taras Glek, Jan Kara, KOSAKI Motohiro,
	Michel Lespinasse, Minchan Kim, Keith Packard, Huangpeng (Peter),
	Anthony Liguori, Stefan Hajnoczi, Wenchao Xia, Andrew Jones,
	Juan Quintela

Hello everyone,

This is a RFC for the userfaultfd syscall API v3 that addresses the
feedback received for the previous v2 submit.

The main change from the v2 is that MADV_USERFAULT/NOUSERFAULT
disappeared (they're replaced by the UFFDIO_REGISTER/UNREGISTER
ioctls). In short userfaults are now only possible through the
userfaultfd. The remap_anon_pages syscall also disappeared replaced by
the UFFDIO_REMAP ioctl which is in turn mostly obsoleted by the newer
UFFDIO_COPY and UFFDIO_ZEROPAGE ioctls that are indeed more efficient
by never having to flush the TLB. The suggestion to copy the data
instead of moving it, in order to resolve the userfault, was
immediately agreed.

The latest code can also be cloned here:

git clone --reference linux -b userfault git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git

Userfaults allow to implement on demand paging from userland and more
generally they allow userland to more efficiently take control on
various types of page faults.

For example userfaults allows a proper and more optimal implementation
of the PROT_NONE+SIGSEGV trick.

There has been interest from multiple users for different use cases:

1) KVM postcopy live migration (one form of cloud memory
   externalization). KVM postcopy live migration is the primary driver
   of this work:
   http://blog.zhaw.ch/icclab/setting-up-post-copy-live-migration-in-openstack/
   http://lists.gnu.org/archive/html/qemu-devel/2015-02/msg04873.html
   )

2) KVM postcopy live snapshotting (allowing to limit/throttle the
   memory usage, unlike fork would, plus the avoidance of fork
   overhead in the first place).

   The syscall API is already contemplating the wrprotect fault
   tracking and it's generic enough to allow its later implementation
   in a backwards compatible fashion.

3) KVM userfaults on shared memory. The UFFDIO_COPY lowlevel method
   should be extended to work also on tmpfs and then the
   uffdio_register.ioctls will notify userland that UFFDIO_COPY is
   available even when the registered virtual memory range is tmpfs
   backed.

4) alternate mechanism to notify web browsers or apps on embedded
   devices that volatile pages have been reclaimed. This basically
   avoids the need to run a syscall before the app can access with the
   CPU the virtual regions marked volatile. This also requires point 3)
   to be fulfilled, as volatile pages happily apply to tmpfs.

5) postcopy live migration of binaries inside linux containers.

Even though there wasn't a real use case requesting it yet, the new
API also allows to implement distributed shared memory in a way that
readonly shared mappings can exist simultaneously in different hosts
and they can be become exclusive at the first wrprotect fault.

The UFFDIO_REMAP method is still present in the patchset but it's
provided primarily to remove (add not) memory from the userfault
range. The addition of the UFFDIO_REMAP method is intentionally kept
at the end of the patchset. The postcopy live migration qemu code will
only use UFFDIO_COPY and UFFDIO_ZEROPAGE. UFFDIO_REMAP isn't intended
to be merged upstream in the short term, and it can be dropped later
if there's an agreement it's a bad idea to keep it around in the
patchset.

David run some KVM postcopy live migration benchmarks on a 8-way CPU
system and he measured that using UFFDIO_COPY instead of UFFDIO_REMAP
resulted in a roughly a -20% reduction in latency which is good. The
standard deviation error on the latency measurement decreased
significantly as well (because the number of CPUs that required IPI
delivery was variable, while the copy always takes roughly the same
time). A bigger improvement is expectable if measured on a larger host
with more CPUs.

All UFFDIO_COPY/ZEROPAGE/REMAP methods already support CRIU postcopy
live migration and the UFFD can be passed to a manager process through
unix domain sockets to satisfy point 5).

I look forward to discuss this further next week at the LSF/MM
summit, if you're attending the summit see you soon!

Comments welcome, thanks,
Andrea

Credits: partially funded by the Orbit EU project.

PS. There is one TODO detail worth mentioning for completeness that
affects usage 2) and UFFDIO_REMAP if used to remove memory from the
userfault range: handle_userfault() is only effective if
FAULT_FLAG_ALLOW_RETRY is set... but that is only set at the first
attempted page fault. If by accident some thread was already faulting
in the range and the first page fault attempt returned VM_FAULT_RETRY
and UFFDIO_REMAP or UFFDIO_WP jumps in to arm the userfault just
before the second attempt starts, a SIGBUS would be raised by the page
fault. Stopping all thread access to the userfault ranges during
UFFDIO_REMAP/WP while possible, isn't optimal. Currently (excluding
real filebacked mappings and handle_userfault() itself which is
clearly no problem) only tmpfs or a swapin can return
VM_FAULT_RETRY. To close this SIGBUS window for all usages, the
simplest solution would be that if FAULT_FLAG_TRIED is set
VM_FAULT_RETRY can still be returned (but only by handle_userfault
that has a legitimate reason for insisting a second time in a row with
VM_FAULT_RETRY). That would require some change to the FAULT_FLAG
semantics. Again userland could cope with this detail but it'd be
inefficient to solve it in userland. This would be a fully backwards
compatible change and it's only strictly required by the wrprotect
tracking mode, so it's no problem to solve this later. Because of its
inherent racy nature, nobody could possibly depend on a racy SIGBUS
being raised now, when it won't be raised anymore later.

Andrea Arcangeli (21):
  userfaultfd: waitqueue: add nr wake parameter to __wake_up_locked_key
  userfaultfd: linux/Documentation/vm/userfaultfd.txt
  userfaultfd: uAPI
  userfaultfd: linux/userfaultfd_k.h
  userfaultfd: add vm_userfaultfd_ctx to the vm_area_struct
  userfaultfd: add VM_UFFD_MISSING and VM_UFFD_WP
  userfaultfd: call handle_userfault() for userfaultfd_missing() faults
  userfaultfd: teach vma_merge to merge across vma->vm_userfaultfd_ctx
  userfaultfd: prevent khugepaged to merge if userfaultfd is armed
  userfaultfd: add new syscall to provide memory externalization
  userfaultfd: buildsystem activation
  userfaultfd: activate syscall
  userfaultfd: UFFDIO_COPY|UFFDIO_ZEROPAGE uAPI
  userfaultfd: mcopy_atomic|mfill_zeropage: UFFDIO_COPY|UFFDIO_ZEROPAGE
    preparation
  userfaultfd: UFFDIO_COPY and UFFDIO_ZEROPAGE
  userfaultfd: remap_pages: rmap preparation
  userfaultfd: remap_pages: swp_entry_swapcount() preparation
  userfaultfd: UFFDIO_REMAP uABI
  userfaultfd: remap_pages: UFFDIO_REMAP preparation
  userfaultfd: UFFDIO_REMAP
  userfaultfd: add userfaultfd_wp mm helpers

 Documentation/ioctl/ioctl-number.txt   |    1 +
 Documentation/vm/userfaultfd.txt       |   97 +++
 arch/powerpc/include/asm/systbl.h      |    1 +
 arch/powerpc/include/asm/unistd.h      |    2 +-
 arch/powerpc/include/uapi/asm/unistd.h |    1 +
 arch/x86/syscalls/syscall_32.tbl       |    1 +
 arch/x86/syscalls/syscall_64.tbl       |    1 +
 fs/Makefile                            |    1 +
 fs/userfaultfd.c                       | 1128 ++++++++++++++++++++++++++++++++
 include/linux/mm.h                     |    4 +-
 include/linux/mm_types.h               |   11 +
 include/linux/swap.h                   |    6 +
 include/linux/syscalls.h               |    1 +
 include/linux/userfaultfd_k.h          |  112 ++++
 include/linux/wait.h                   |    5 +-
 include/uapi/linux/userfaultfd.h       |  150 +++++
 init/Kconfig                           |   11 +
 kernel/fork.c                          |    3 +-
 kernel/sched/wait.c                    |    7 +-
 kernel/sys_ni.c                        |    1 +
 mm/Makefile                            |    1 +
 mm/huge_memory.c                       |  217 +++++-
 mm/madvise.c                           |    3 +-
 mm/memory.c                            |   16 +
 mm/mempolicy.c                         |    4 +-
 mm/mlock.c                             |    3 +-
 mm/mmap.c                              |   39 +-
 mm/mprotect.c                          |    3 +-
 mm/rmap.c                              |    9 +
 mm/swapfile.c                          |   13 +
 mm/userfaultfd.c                       |  793 ++++++++++++++++++++++
 net/sunrpc/sched.c                     |    2 +-
 32 files changed, 2593 insertions(+), 54 deletions(-)
 create mode 100644 Documentation/vm/userfaultfd.txt
 create mode 100644 fs/userfaultfd.c
 create mode 100644 include/linux/userfaultfd_k.h
 create mode 100644 include/uapi/linux/userfaultfd.h
 create mode 100644 mm/userfaultfd.c

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread