linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/23] userfaultfd v4
@ 2015-05-14 17:30 Andrea Arcangeli
  2015-05-14 17:30 ` [PATCH 01/23] userfaultfd: linux/Documentation/vm/userfaultfd.txt Andrea Arcangeli
                   ` (25 more replies)
  0 siblings, 26 replies; 63+ messages in thread
From: Andrea Arcangeli @ 2015-05-14 17:30 UTC (permalink / raw)
  To: Andrew Morton, linux-kernel, linux-mm, qemu-devel, kvm, linux-api
  Cc: Pavel Emelyanov, Sanidhya Kashyap, zhang.zhanghailiang,
	Linus Torvalds, Kirill A. Shutemov, Andres Lagar-Cavilla,
	Dave Hansen, Paolo Bonzini, Rik van Riel, Mel Gorman,
	Andy Lutomirski, Hugh Dickins, Peter Feiner,
	Dr. David Alan Gilbert, Johannes Weiner, Huangpeng (Peter)

Hello everyone,

This is the latest userfaultfd patchset against mm-v4.1-rc3
2015-05-14-10:04.

The postcopy live migration feature on the qemu side is mostly ready
to be merged and it entirely depends on the userfaultfd syscall to be
merged as well. So it'd be great if this patchset could be reviewed
for merging in -mm.

Userfaults allow to implement on demand paging from userland and more
generally they allow userland to more efficiently take control of the
behavior of page faults than what was available before
(PROT_NONE + SIGSEGV trap).

The use cases are:

1) KVM postcopy live migration (one form of cloud memory
   externalization).

   KVM postcopy live migration is the primary driver of this work:

	http://blog.zhaw.ch/icclab/setting-up-post-copy-live-migration-in-openstack/
	http://lists.gnu.org/archive/html/qemu-devel/2015-02/msg04873.html

2) postcopy live migration of binaries inside linux containers:

	http://thread.gmane.org/gmane.linux.kernel.mm/132662

3) KVM postcopy live snapshotting (allowing to limit/throttle the
   memory usage, unlike fork would, plus the avoidance of fork
   overhead in the first place).

   While the wrprotect tracking is not implemented yet, the syscall API is
   already contemplating the wrprotect fault tracking and it's generic enough
   to allow its later implementation in a backwards compatible fashion.

4) KVM userfaults on shared memory. The UFFDIO_COPY lowlevel method
   should be extended to work also on tmpfs and then the
   uffdio_register.ioctls will notify userland that UFFDIO_COPY is
   available even when the registered virtual memory range is tmpfs
   backed.

5) alternate mechanism to notify web browsers or apps on embedded
   devices that volatile pages have been reclaimed. This basically
   avoids the need to run a syscall before the app can access with the
   CPU the virtual regions marked volatile. This depends on point 4)
   to be fulfilled first, as volatile pages happily apply to tmpfs.

Even though there wasn't a real use case requesting it yet, it also
allows to implement distributed shared memory in a way that readonly
shared mappings can exist simultaneously in different hosts and they
can be become exclusive at the first wrprotect fault.

The development version can also be cloned here:

	git clone --reference linux -b userfault git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git

Slides from LSF-MM summit (but beware that they're not uptodate):

	https://www.kernel.org/pub/linux/kernel/people/andrea/userfaultfd/userfaultfd-LSFMM-2015.pdf

Comments welcome.

Thanks,
Andrea

Changelog of the major changes since the last RFC v3:

o The API has been slightly modified to avoid having to introduce a
  second revision of the API, in order to support the non cooperative
  usage.

o Various mixed fixes thanks to the feedback from Dave Hansen and
  David Gilbert.

  The most notable one is the use of mm_users instead of mm_count to
  pin the mm to avoid crashes that assumed the vma still existed (in
  the userfaultfd_release method and in the various ioctl). exit_mmap
  doesn't even set mm->mmap to NULL, so unless I introduce a
  userfaultfd_exit to call in mmput, I have to pin the mm_users to be
  safe. This is a visible change mainly for the non-cooperative usage.

o userfaults are waken immediately even if they're not been "read"
  yet, this can lead to POLLIN false positives (so I only allow poll
  if the fd is open in nonblocking mode to be sure it won't hang).

	http://git.kernel.org/cgit/linux/kernel/git/andrea/aa.git/commit/?h=userfault&id=f222d9de0a5302dc8ac62d6fab53a84251098751

o optimize read to return entries in O(1) and poll which was already
  O(1) becomes lockless. This required to split the waitqueue in two,
  one for pending faults and one for non pending faults, and the
  faults are refiled across the two waitqueues when they're read. Both
  waitqueues are protected by a single lock to be simpler and faster
  at runtime (the fault_pending_wqh one).

	http://git.kernel.org/cgit/linux/kernel/git/andrea/aa.git/commit/?h=userfault&id=9aa033ed43a1134c2223dac8c5d9e02e0100fca1

o Allocate the ctx with kmem_cache_alloc.

	http://git.kernel.org/cgit/linux/kernel/git/andrea/aa.git/commit/?h=userfault&id=f5a8db16d2876eed8906a4d36f1d0e06ca5490f6

o Originally qemu had two bitflags for each page and kept 3 states (of
  the 4 possible with two bits) for each page in order to deal with
  the races that can happen if one thread is reading the userfaults
  and another thread is calling the UFFDIO_COPY ioctl in the
  background. This patch solves all races in the kernel so the two
  bits per page can be dropped from qemu codebase. I started
  documenting the races that can materialize by using 2 threads
  (instead of running the workload single threaded with a single poll
  event loop) and how userland had to solve them until I decided it
  was simpler to fix the race in the kernel by running an ad-hoc
  pagetable walk inside the wait_event()-kind-of-section. This
  simplified qemu significantly (hundreds line of code involving a
  mutex have been deleted and that mutex disappeared as well) and it
  doesn't make the kernel much more complicated.

	http://git.kernel.org/cgit/linux/kernel/git/andrea/aa.git/commit/?h=userfault&id=41efeae4e93f0296436f2a9fc6b28b6b0158512a

  After this patch the only reason to call UFFDIO_WAKE is to handle
  the userfaults in batches in combination with the DONT_WAKE flag of
  UFFDIO_COPY.

o I removed the read recursion from mcopy_atomic. This avoids to
  depend on the write-starvation behavior of rwsem to be safe. After
  this change the rwsem is free to stop any further down_read if
  there's a down_write waiting on the lock.

	http://git.kernel.org/cgit/linux/kernel/git/andrea/aa.git/commit/?h=userfault&id=b1e3a08acc9e3f6c2614e89fc3b8e338daa58e18

o Extendeded the Documentation userfaultfd.txt file to explain how
  QEMU/KVM uses userfaultfd to implement postcopy live migration.

	http://git.kernel.org/cgit/linux/kernel/git/andrea/aa.git/commit/?h=userfault&id=016f9523b7b2238851533736e84452cb00b2ddcd

Andrea Arcangeli (22):
  userfaultfd: linux/Documentation/vm/userfaultfd.txt
  userfaultfd: waitqueue: add nr wake parameter to __wake_up_locked_key
  userfaultfd: uAPI
  userfaultfd: linux/userfaultfd_k.h
  userfaultfd: add vm_userfaultfd_ctx to the vm_area_struct
  userfaultfd: add VM_UFFD_MISSING and VM_UFFD_WP
  userfaultfd: call handle_userfault() for userfaultfd_missing() faults
  userfaultfd: teach vma_merge to merge across vma->vm_userfaultfd_ctx
  userfaultfd: prevent khugepaged to merge if userfaultfd is armed
  userfaultfd: add new syscall to provide memory externalization
  userfaultfd: Rename uffd_api.bits into .features fixup
  userfaultfd: change the read API to return a uffd_msg
  userfaultfd: wake pending userfaults
  userfaultfd: optimize read() and poll() to be O(1)
  userfaultfd: allocate the userfaultfd_ctx cacheline aligned
  userfaultfd: solve the race between UFFDIO_COPY|ZEROPAGE and read
  userfaultfd: buildsystem activation
  userfaultfd: activate syscall
  userfaultfd: UFFDIO_COPY|UFFDIO_ZEROPAGE uAPI
  userfaultfd: mcopy_atomic|mfill_zeropage: UFFDIO_COPY|UFFDIO_ZEROPAGE
    preparation
  userfaultfd: avoid mmap_sem read recursion in mcopy_atomic
  userfaultfd: UFFDIO_COPY and UFFDIO_ZEROPAGE

Pavel Emelyanov (1):
  userfaultfd: Rename uffd_api.bits into .features

 Documentation/ioctl/ioctl-number.txt   |    1 +
 Documentation/vm/userfaultfd.txt       |  142 ++++
 arch/powerpc/include/asm/systbl.h      |    1 +
 arch/powerpc/include/uapi/asm/unistd.h |    1 +
 arch/x86/syscalls/syscall_32.tbl       |    1 +
 arch/x86/syscalls/syscall_64.tbl       |    1 +
 fs/Makefile                            |    1 +
 fs/proc/task_mmu.c                     |    2 +
 fs/userfaultfd.c                       | 1236 ++++++++++++++++++++++++++++++++
 include/linux/mm.h                     |    4 +-
 include/linux/mm_types.h               |   11 +
 include/linux/syscalls.h               |    1 +
 include/linux/userfaultfd_k.h          |   85 +++
 include/linux/wait.h                   |    5 +-
 include/uapi/linux/Kbuild              |    1 +
 include/uapi/linux/userfaultfd.h       |  161 +++++
 init/Kconfig                           |   11 +
 kernel/fork.c                          |    3 +-
 kernel/sched/wait.c                    |    7 +-
 kernel/sys_ni.c                        |    1 +
 mm/Makefile                            |    1 +
 mm/huge_memory.c                       |   75 +-
 mm/madvise.c                           |    3 +-
 mm/memory.c                            |   16 +
 mm/mempolicy.c                         |    4 +-
 mm/mlock.c                             |    3 +-
 mm/mmap.c                              |   40 +-
 mm/mprotect.c                          |    3 +-
 mm/userfaultfd.c                       |  309 ++++++++
 net/sunrpc/sched.c                     |    2 +-
 30 files changed, 2082 insertions(+), 50 deletions(-)
 create mode 100644 Documentation/vm/userfaultfd.txt
 create mode 100644 fs/userfaultfd.c
 create mode 100644 include/linux/userfaultfd_k.h
 create mode 100644 include/uapi/linux/userfaultfd.h
 create mode 100644 mm/userfaultfd.c

Credits: partially funded by the Orbit EU project.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH 01/23] userfaultfd: linux/Documentation/vm/userfaultfd.txt
  2015-05-14 17:30 [PATCH 00/23] userfaultfd v4 Andrea Arcangeli
@ 2015-05-14 17:30 ` Andrea Arcangeli
  2015-09-11  8:47   ` Michael Kerrisk (man-pages)
  2015-05-14 17:30 ` [PATCH 02/23] userfaultfd: waitqueue: add nr wake parameter to __wake_up_locked_key Andrea Arcangeli
                   ` (24 subsequent siblings)
  25 siblings, 1 reply; 63+ messages in thread
From: Andrea Arcangeli @ 2015-05-14 17:30 UTC (permalink / raw)
  To: Andrew Morton, linux-kernel, linux-mm, qemu-devel, kvm, linux-api
  Cc: Pavel Emelyanov, Sanidhya Kashyap, zhang.zhanghailiang,
	Linus Torvalds, Kirill A. Shutemov, Andres Lagar-Cavilla,
	Dave Hansen, Paolo Bonzini, Rik van Riel, Mel Gorman,
	Andy Lutomirski, Hugh Dickins, Peter Feiner,
	Dr. David Alan Gilbert, Johannes Weiner, Huangpeng (Peter)

Add documentation.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 Documentation/vm/userfaultfd.txt | 140 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 140 insertions(+)
 create mode 100644 Documentation/vm/userfaultfd.txt

diff --git a/Documentation/vm/userfaultfd.txt b/Documentation/vm/userfaultfd.txt
new file mode 100644
index 0000000..c2f5145
--- /dev/null
+++ b/Documentation/vm/userfaultfd.txt
@@ -0,0 +1,140 @@
+= Userfaultfd =
+
+== Objective ==
+
+Userfaults allow the implementation of on-demand paging from userland
+and more generally they allow userland to take control various memory
+page faults, something otherwise only the kernel code could do.
+
+For example userfaults allows a proper and more optimal implementation
+of the PROT_NONE+SIGSEGV trick.
+
+== Design ==
+
+Userfaults are delivered and resolved through the userfaultfd syscall.
+
+The userfaultfd (aside from registering and unregistering virtual
+memory ranges) provides two primary functionalities:
+
+1) read/POLLIN protocol to notify a userland thread of the faults
+   happening
+
+2) various UFFDIO_* ioctls that can manage the virtual memory regions
+   registered in the userfaultfd that allows userland to efficiently
+   resolve the userfaults it receives via 1) or to manage the virtual
+   memory in the background
+
+The real advantage of userfaults if compared to regular virtual memory
+management of mremap/mprotect is that the userfaults in all their
+operations never involve heavyweight structures like vmas (in fact the
+userfaultfd runtime load never takes the mmap_sem for writing).
+
+Vmas are not suitable for page- (or hugepage) granular fault tracking
+when dealing with virtual address spaces that could span
+Terabytes. Too many vmas would be needed for that.
+
+The userfaultfd once opened by invoking the syscall, can also be
+passed using unix domain sockets to a manager process, so the same
+manager process could handle the userfaults of a multitude of
+different processes without them being aware about what is going on
+(well of course unless they later try to use the userfaultfd
+themselves on the same region the manager is already tracking, which
+is a corner case that would currently return -EBUSY).
+
+== API ==
+
+When first opened the userfaultfd must be enabled invoking the
+UFFDIO_API ioctl specifying a uffdio_api.api value set to UFFD_API (or
+a later API version) which will specify the read/POLLIN protocol
+userland intends to speak on the UFFD. The UFFDIO_API ioctl if
+successful (i.e. if the requested uffdio_api.api is spoken also by the
+running kernel), will return into uffdio_api.features and
+uffdio_api.ioctls two 64bit bitmasks of respectively the activated
+feature of the read(2) protocol and the generic ioctl available.
+
+Once the userfaultfd has been enabled the UFFDIO_REGISTER ioctl should
+be invoked (if present in the returned uffdio_api.ioctls bitmask) to
+register a memory range in the userfaultfd by setting the
+uffdio_register structure accordingly. The uffdio_register.mode
+bitmask will specify to the kernel which kind of faults to track for
+the range (UFFDIO_REGISTER_MODE_MISSING would track missing
+pages). The UFFDIO_REGISTER ioctl will return the
+uffdio_register.ioctls bitmask of ioctls that are suitable to resolve
+userfaults on the range registered. Not all ioctls will necessarily be
+supported for all memory types depending on the underlying virtual
+memory backend (anonymous memory vs tmpfs vs real filebacked
+mappings).
+
+Userland can use the uffdio_register.ioctls to manage the virtual
+address space in the background (to add or potentially also remove
+memory from the userfaultfd registered range). This means a userfault
+could be triggering just before userland maps in the background the
+user-faulted page.
+
+The primary ioctl to resolve userfaults is UFFDIO_COPY. That
+atomically copies a page into the userfault registered range and wakes
+up the blocked userfaults (unless uffdio_copy.mode &
+UFFDIO_COPY_MODE_DONTWAKE is set). Other ioctl works similarly to
+UFFDIO_COPY.
+
+== QEMU/KVM ==
+
+QEMU/KVM is using the userfaultfd syscall to implement postcopy live
+migration. Postcopy live migration is one form of memory
+externalization consisting of a virtual machine running with part or
+all of its memory residing on a different node in the cloud. The
+userfaultfd abstraction is generic enough that not a single line of
+KVM kernel code had to be modified in order to add postcopy live
+migration to QEMU.
+
+Guest async page faults, FOLL_NOWAIT and all other GUP features work
+just fine in combination with userfaults. Userfaults trigger async
+page faults in the guest scheduler so those guest processes that
+aren't waiting for userfaults (i.e. network bound) can keep running in
+the guest vcpus.
+
+It is generally beneficial to run one pass of precopy live migration
+just before starting postcopy live migration, in order to avoid
+generating userfaults for readonly guest regions.
+
+The implementation of postcopy live migration currently uses one
+single bidirectional socket but in the future two different sockets
+will be used (to reduce the latency of the userfaults to the minimum
+possible without having to decrease /proc/sys/net/ipv4/tcp_wmem).
+
+The QEMU in the source node writes all pages that it knows are missing
+in the destination node, into the socket, and the migration thread of
+the QEMU running in the destination node runs UFFDIO_COPY|ZEROPAGE
+ioctls on the userfaultfd in order to map the received pages into the
+guest (UFFDIO_ZEROCOPY is used if the source page was a zero page).
+
+A different postcopy thread in the destination node listens with
+poll() to the userfaultfd in parallel. When a POLLIN event is
+generated after a userfault triggers, the postcopy thread read() from
+the userfaultfd and receives the fault address (or -EAGAIN in case the
+userfault was already resolved and waken by a UFFDIO_COPY|ZEROPAGE run
+by the parallel QEMU migration thread).
+
+After the QEMU postcopy thread (running in the destination node) gets
+the userfault address it writes the information about the missing page
+into the socket. The QEMU source node receives the information and
+roughly "seeks" to that page address and continues sending all
+remaining missing pages from that new page offset. Soon after that
+(just the time to flush the tcp_wmem queue through the network) the
+migration thread in the QEMU running in the destination node will
+receive the page that triggered the userfault and it'll map it as
+usual with the UFFDIO_COPY|ZEROPAGE (without actually knowing if it
+was spontaneously sent by the source or if it was an urgent page
+requested through an userfault).
+
+By the time the userfaults start, the QEMU in the destination node
+doesn't need to keep any per-page state bitmap relative to the live
+migration around and a single per-page bitmap has to be maintained in
+the QEMU running in the source node to know which pages are still
+missing in the destination node. The bitmap in the source node is
+checked to find which missing pages to send in round robin and we seek
+over it when receiving incoming userfaults. After sending each page of
+course the bitmap is updated accordingly. It's also useful to avoid
+sending the same page twice (in case the userfault is read by the
+postcopy thread just before UFFDIO_COPY|ZEROPAGE runs in the migration
+thread).

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH 02/23] userfaultfd: waitqueue: add nr wake parameter to __wake_up_locked_key
  2015-05-14 17:30 [PATCH 00/23] userfaultfd v4 Andrea Arcangeli
  2015-05-14 17:30 ` [PATCH 01/23] userfaultfd: linux/Documentation/vm/userfaultfd.txt Andrea Arcangeli
@ 2015-05-14 17:30 ` Andrea Arcangeli
  2015-05-14 17:31 ` [PATCH 03/23] userfaultfd: uAPI Andrea Arcangeli
                   ` (23 subsequent siblings)
  25 siblings, 0 replies; 63+ messages in thread
From: Andrea Arcangeli @ 2015-05-14 17:30 UTC (permalink / raw)
  To: Andrew Morton, linux-kernel, linux-mm, qemu-devel, kvm, linux-api
  Cc: Pavel Emelyanov, Sanidhya Kashyap, zhang.zhanghailiang,
	Linus Torvalds, Kirill A. Shutemov, Andres Lagar-Cavilla,
	Dave Hansen, Paolo Bonzini, Rik van Riel, Mel Gorman,
	Andy Lutomirski, Hugh Dickins, Peter Feiner,
	Dr. David Alan Gilbert, Johannes Weiner, Huangpeng (Peter)

userfaultfd needs to wake all waitqueues (pass 0 as nr parameter),
instead of the current hardcoded 1 (that would wake just the first
waitqueue in the head list).

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/wait.h | 5 +++--
 kernel/sched/wait.c  | 7 ++++---
 net/sunrpc/sched.c   | 2 +-
 3 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/include/linux/wait.h b/include/linux/wait.h
index d69ac4e..c48c0e1 100644
--- a/include/linux/wait.h
+++ b/include/linux/wait.h
@@ -147,7 +147,8 @@ __remove_wait_queue(wait_queue_head_t *head, wait_queue_t *old)
 
 typedef int wait_bit_action_f(struct wait_bit_key *);
 void __wake_up(wait_queue_head_t *q, unsigned int mode, int nr, void *key);
-void __wake_up_locked_key(wait_queue_head_t *q, unsigned int mode, void *key);
+void __wake_up_locked_key(wait_queue_head_t *q, unsigned int mode, int nr,
+			  void *key);
 void __wake_up_sync_key(wait_queue_head_t *q, unsigned int mode, int nr, void *key);
 void __wake_up_locked(wait_queue_head_t *q, unsigned int mode, int nr);
 void __wake_up_sync(wait_queue_head_t *q, unsigned int mode, int nr);
@@ -179,7 +180,7 @@ wait_queue_head_t *bit_waitqueue(void *, int);
 #define wake_up_poll(x, m)						\
 	__wake_up(x, TASK_NORMAL, 1, (void *) (m))
 #define wake_up_locked_poll(x, m)					\
-	__wake_up_locked_key((x), TASK_NORMAL, (void *) (m))
+	__wake_up_locked_key((x), TASK_NORMAL, 1, (void *) (m))
 #define wake_up_interruptible_poll(x, m)				\
 	__wake_up(x, TASK_INTERRUPTIBLE, 1, (void *) (m))
 #define wake_up_interruptible_sync_poll(x, m)				\
diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
index 2ccec98..d48513e 100644
--- a/kernel/sched/wait.c
+++ b/kernel/sched/wait.c
@@ -106,9 +106,10 @@ void __wake_up_locked(wait_queue_head_t *q, unsigned int mode, int nr)
 }
 EXPORT_SYMBOL_GPL(__wake_up_locked);
 
-void __wake_up_locked_key(wait_queue_head_t *q, unsigned int mode, void *key)
+void __wake_up_locked_key(wait_queue_head_t *q, unsigned int mode, int nr,
+			  void *key)
 {
-	__wake_up_common(q, mode, 1, 0, key);
+	__wake_up_common(q, mode, nr, 0, key);
 }
 EXPORT_SYMBOL_GPL(__wake_up_locked_key);
 
@@ -283,7 +284,7 @@ void abort_exclusive_wait(wait_queue_head_t *q, wait_queue_t *wait,
 	if (!list_empty(&wait->task_list))
 		list_del_init(&wait->task_list);
 	else if (waitqueue_active(q))
-		__wake_up_locked_key(q, mode, key);
+		__wake_up_locked_key(q, mode, 1, key);
 	spin_unlock_irqrestore(&q->lock, flags);
 }
 EXPORT_SYMBOL(abort_exclusive_wait);
diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
index 337ca85..b140c09 100644
--- a/net/sunrpc/sched.c
+++ b/net/sunrpc/sched.c
@@ -297,7 +297,7 @@ static int rpc_complete_task(struct rpc_task *task)
 	clear_bit(RPC_TASK_ACTIVE, &task->tk_runstate);
 	ret = atomic_dec_and_test(&task->tk_count);
 	if (waitqueue_active(wq))
-		__wake_up_locked_key(wq, TASK_NORMAL, &k);
+		__wake_up_locked_key(wq, TASK_NORMAL, 1, &k);
 	spin_unlock_irqrestore(&wq->lock, flags);
 	return ret;
 }

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH 03/23] userfaultfd: uAPI
  2015-05-14 17:30 [PATCH 00/23] userfaultfd v4 Andrea Arcangeli
  2015-05-14 17:30 ` [PATCH 01/23] userfaultfd: linux/Documentation/vm/userfaultfd.txt Andrea Arcangeli
  2015-05-14 17:30 ` [PATCH 02/23] userfaultfd: waitqueue: add nr wake parameter to __wake_up_locked_key Andrea Arcangeli
@ 2015-05-14 17:31 ` Andrea Arcangeli
  2015-05-14 17:31 ` [PATCH 04/23] userfaultfd: linux/userfaultfd_k.h Andrea Arcangeli
                   ` (22 subsequent siblings)
  25 siblings, 0 replies; 63+ messages in thread
From: Andrea Arcangeli @ 2015-05-14 17:31 UTC (permalink / raw)
  To: Andrew Morton, linux-kernel, linux-mm, qemu-devel, kvm, linux-api
  Cc: Pavel Emelyanov, Sanidhya Kashyap, zhang.zhanghailiang,
	Linus Torvalds, Kirill A. Shutemov, Andres Lagar-Cavilla,
	Dave Hansen, Paolo Bonzini, Rik van Riel, Mel Gorman,
	Andy Lutomirski, Hugh Dickins, Peter Feiner,
	Dr. David Alan Gilbert, Johannes Weiner, Huangpeng (Peter)

Defines the uAPI of the userfaultfd, notably the ioctl numbers and protocol.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 Documentation/ioctl/ioctl-number.txt |  1 +
 include/uapi/linux/Kbuild            |  1 +
 include/uapi/linux/userfaultfd.h     | 81 ++++++++++++++++++++++++++++++++++++
 3 files changed, 83 insertions(+)
 create mode 100644 include/uapi/linux/userfaultfd.h

diff --git a/Documentation/ioctl/ioctl-number.txt b/Documentation/ioctl/ioctl-number.txt
index ec7c81b..ed950ad 100644
--- a/Documentation/ioctl/ioctl-number.txt
+++ b/Documentation/ioctl/ioctl-number.txt
@@ -302,6 +302,7 @@ Code  Seq#(hex)	Include File		Comments
 0xA3	80-8F	Port ACL		in development:
 					<mailto:tlewis@mindspring.com>
 0xA3	90-9F	linux/dtlk.h
+0xAA	00-3F	linux/uapi/linux/userfaultfd.h
 0xAB	00-1F	linux/nbd.h
 0xAC	00-1F	linux/raw.h
 0xAD	00	Netfilter device	in development:
diff --git a/include/uapi/linux/Kbuild b/include/uapi/linux/Kbuild
index 4842a98..47547ea 100644
--- a/include/uapi/linux/Kbuild
+++ b/include/uapi/linux/Kbuild
@@ -452,3 +452,4 @@ header-y += xfrm.h
 header-y += xilinx-v4l2-controls.h
 header-y += zorro.h
 header-y += zorro_ids.h
+header-y += userfaultfd.h
diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
new file mode 100644
index 0000000..9a8cd56
--- /dev/null
+++ b/include/uapi/linux/userfaultfd.h
@@ -0,0 +1,81 @@
+/*
+ *  include/linux/userfaultfd.h
+ *
+ *  Copyright (C) 2007  Davide Libenzi <davidel@xmailserver.org>
+ *  Copyright (C) 2015  Red Hat, Inc.
+ *
+ */
+
+#ifndef _LINUX_USERFAULTFD_H
+#define _LINUX_USERFAULTFD_H
+
+#define UFFD_API ((__u64)0xAA)
+/* FIXME: add "|UFFD_BIT_WP" to UFFD_API_BITS after implementing it */
+#define UFFD_API_BITS (UFFD_BIT_WRITE)
+#define UFFD_API_IOCTLS				\
+	((__u64)1 << _UFFDIO_REGISTER |		\
+	 (__u64)1 << _UFFDIO_UNREGISTER |	\
+	 (__u64)1 << _UFFDIO_API)
+#define UFFD_API_RANGE_IOCTLS			\
+	((__u64)1 << _UFFDIO_WAKE)
+
+/*
+ * Valid ioctl command number range with this API is from 0x00 to
+ * 0x3F.  UFFDIO_API is the fixed number, everything else can be
+ * changed by implementing a different UFFD_API. If sticking to the
+ * same UFFD_API more ioctl can be added and userland will be aware of
+ * which ioctl the running kernel implements through the ioctl command
+ * bitmask written by the UFFDIO_API.
+ */
+#define _UFFDIO_REGISTER		(0x00)
+#define _UFFDIO_UNREGISTER		(0x01)
+#define _UFFDIO_WAKE			(0x02)
+#define _UFFDIO_API			(0x3F)
+
+/* userfaultfd ioctl ids */
+#define UFFDIO 0xAA
+#define UFFDIO_API		_IOWR(UFFDIO, _UFFDIO_API,	\
+				      struct uffdio_api)
+#define UFFDIO_REGISTER		_IOWR(UFFDIO, _UFFDIO_REGISTER, \
+				      struct uffdio_register)
+#define UFFDIO_UNREGISTER	_IOR(UFFDIO, _UFFDIO_UNREGISTER,	\
+				     struct uffdio_range)
+#define UFFDIO_WAKE		_IOR(UFFDIO, _UFFDIO_WAKE,	\
+				     struct uffdio_range)
+
+/*
+ * Valid bits below PAGE_SHIFT in the userfault address read through
+ * the read() syscall.
+ */
+#define UFFD_BIT_WRITE	(1<<0)	/* this was a write fault, MISSING or WP */
+#define UFFD_BIT_WP	(1<<1)	/* handle_userfault() reason VM_UFFD_WP */
+#define UFFD_BITS	2	/* two above bits used for UFFD_BIT_* mask */
+
+struct uffdio_api {
+	/* userland asks for an API number */
+	__u64 api;
+
+	/* kernel answers below with the available features for the API */
+	__u64 bits;
+	__u64 ioctls;
+};
+
+struct uffdio_range {
+	__u64 start;
+	__u64 len;
+};
+
+struct uffdio_register {
+	struct uffdio_range range;
+#define UFFDIO_REGISTER_MODE_MISSING	((__u64)1<<0)
+#define UFFDIO_REGISTER_MODE_WP		((__u64)1<<1)
+	__u64 mode;
+
+	/*
+	 * kernel answers which ioctl commands are available for the
+	 * range, keep at the end as the last 8 bytes aren't read.
+	 */
+	__u64 ioctls;
+};
+
+#endif /* _LINUX_USERFAULTFD_H */

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH 04/23] userfaultfd: linux/userfaultfd_k.h
  2015-05-14 17:30 [PATCH 00/23] userfaultfd v4 Andrea Arcangeli
                   ` (2 preceding siblings ...)
  2015-05-14 17:31 ` [PATCH 03/23] userfaultfd: uAPI Andrea Arcangeli
@ 2015-05-14 17:31 ` Andrea Arcangeli
  2015-05-14 17:31 ` [PATCH 05/23] userfaultfd: add vm_userfaultfd_ctx to the vm_area_struct Andrea Arcangeli
                   ` (21 subsequent siblings)
  25 siblings, 0 replies; 63+ messages in thread
From: Andrea Arcangeli @ 2015-05-14 17:31 UTC (permalink / raw)
  To: Andrew Morton, linux-kernel, linux-mm, qemu-devel, kvm, linux-api
  Cc: Pavel Emelyanov, Sanidhya Kashyap, zhang.zhanghailiang,
	Linus Torvalds, Kirill A. Shutemov, Andres Lagar-Cavilla,
	Dave Hansen, Paolo Bonzini, Rik van Riel, Mel Gorman,
	Andy Lutomirski, Hugh Dickins, Peter Feiner,
	Dr. David Alan Gilbert, Johannes Weiner, Huangpeng (Peter)

Kernel header defining the methods needed by the VM common code to
interact with the userfaultfd.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/userfaultfd_k.h | 79 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 79 insertions(+)
 create mode 100644 include/linux/userfaultfd_k.h

diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
new file mode 100644
index 0000000..e1e4360
--- /dev/null
+++ b/include/linux/userfaultfd_k.h
@@ -0,0 +1,79 @@
+/*
+ *  include/linux/userfaultfd_k.h
+ *
+ *  Copyright (C) 2015  Red Hat, Inc.
+ *
+ */
+
+#ifndef _LINUX_USERFAULTFD_K_H
+#define _LINUX_USERFAULTFD_K_H
+
+#ifdef CONFIG_USERFAULTFD
+
+#include <linux/userfaultfd.h> /* linux/include/uapi/linux/userfaultfd.h */
+
+#include <linux/fcntl.h>
+
+/*
+ * CAREFUL: Check include/uapi/asm-generic/fcntl.h when defining
+ * new flags, since they might collide with O_* ones. We want
+ * to re-use O_* flags that couldn't possibly have a meaning
+ * from userfaultfd, in order to leave a free define-space for
+ * shared O_* flags.
+ */
+#define UFFD_CLOEXEC O_CLOEXEC
+#define UFFD_NONBLOCK O_NONBLOCK
+
+#define UFFD_SHARED_FCNTL_FLAGS (O_CLOEXEC | O_NONBLOCK)
+#define UFFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS)
+
+extern int handle_userfault(struct vm_area_struct *vma, unsigned long address,
+			    unsigned int flags, unsigned long reason);
+
+/* mm helpers */
+static inline bool is_mergeable_vm_userfaultfd_ctx(struct vm_area_struct *vma,
+					struct vm_userfaultfd_ctx vm_ctx)
+{
+	return vma->vm_userfaultfd_ctx.ctx == vm_ctx.ctx;
+}
+
+static inline bool userfaultfd_missing(struct vm_area_struct *vma)
+{
+	return vma->vm_flags & VM_UFFD_MISSING;
+}
+
+static inline bool userfaultfd_armed(struct vm_area_struct *vma)
+{
+	return vma->vm_flags & (VM_UFFD_MISSING | VM_UFFD_WP);
+}
+
+#else /* CONFIG_USERFAULTFD */
+
+/* mm helpers */
+static inline int handle_userfault(struct vm_area_struct *vma,
+				   unsigned long address,
+				   unsigned int flags,
+				   unsigned long reason)
+{
+	return VM_FAULT_SIGBUS;
+}
+
+static inline bool is_mergeable_vm_userfaultfd_ctx(struct vm_area_struct *vma,
+					struct vm_userfaultfd_ctx vm_ctx)
+{
+	return true;
+}
+
+static inline bool userfaultfd_missing(struct vm_area_struct *vma)
+{
+	return false;
+}
+
+static inline bool userfaultfd_armed(struct vm_area_struct *vma)
+{
+	return false;
+}
+
+#endif /* CONFIG_USERFAULTFD */
+
+#endif /* _LINUX_USERFAULTFD_K_H */

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH 05/23] userfaultfd: add vm_userfaultfd_ctx to the vm_area_struct
  2015-05-14 17:30 [PATCH 00/23] userfaultfd v4 Andrea Arcangeli
                   ` (3 preceding siblings ...)
  2015-05-14 17:31 ` [PATCH 04/23] userfaultfd: linux/userfaultfd_k.h Andrea Arcangeli
@ 2015-05-14 17:31 ` Andrea Arcangeli
  2015-05-14 17:31 ` [PATCH 06/23] userfaultfd: add VM_UFFD_MISSING and VM_UFFD_WP Andrea Arcangeli
                   ` (20 subsequent siblings)
  25 siblings, 0 replies; 63+ messages in thread
From: Andrea Arcangeli @ 2015-05-14 17:31 UTC (permalink / raw)
  To: Andrew Morton, linux-kernel, linux-mm, qemu-devel, kvm, linux-api
  Cc: Pavel Emelyanov, Sanidhya Kashyap, zhang.zhanghailiang,
	Linus Torvalds, Kirill A. Shutemov, Andres Lagar-Cavilla,
	Dave Hansen, Paolo Bonzini, Rik van Riel, Mel Gorman,
	Andy Lutomirski, Hugh Dickins, Peter Feiner,
	Dr. David Alan Gilbert, Johannes Weiner, Huangpeng (Peter)

This adds the vm_userfaultfd_ctx to the vm_area_struct.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/mm_types.h | 11 +++++++++++
 kernel/fork.c            |  1 +
 2 files changed, 12 insertions(+)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 0038ac7..2836da7 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -265,6 +265,16 @@ struct vm_region {
 						* this region */
 };
 
+#ifdef CONFIG_USERFAULTFD
+#define NULL_VM_UFFD_CTX ((struct vm_userfaultfd_ctx) { NULL, })
+struct vm_userfaultfd_ctx {
+	struct userfaultfd_ctx *ctx;
+};
+#else /* CONFIG_USERFAULTFD */
+#define NULL_VM_UFFD_CTX ((struct vm_userfaultfd_ctx) {})
+struct vm_userfaultfd_ctx {};
+#endif /* CONFIG_USERFAULTFD */
+
 /*
  * This struct defines a memory VMM memory area. There is one of these
  * per VM-area/task.  A VM area is any part of the process virtual memory
@@ -331,6 +341,7 @@ struct vm_area_struct {
 #ifdef CONFIG_NUMA
 	struct mempolicy *vm_policy;	/* NUMA policy for the VMA */
 #endif
+	struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
 };
 
 struct core_thread {
diff --git a/kernel/fork.c b/kernel/fork.c
index 1481cdd..430141b 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -451,6 +451,7 @@ static int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
 			goto fail_nomem_anon_vma_fork;
 		tmp->vm_flags &= ~VM_LOCKED;
 		tmp->vm_next = tmp->vm_prev = NULL;
+		tmp->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
 		file = tmp->vm_file;
 		if (file) {
 			struct inode *inode = file_inode(file);

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH 06/23] userfaultfd: add VM_UFFD_MISSING and VM_UFFD_WP
  2015-05-14 17:30 [PATCH 00/23] userfaultfd v4 Andrea Arcangeli
                   ` (4 preceding siblings ...)
  2015-05-14 17:31 ` [PATCH 05/23] userfaultfd: add vm_userfaultfd_ctx to the vm_area_struct Andrea Arcangeli
@ 2015-05-14 17:31 ` Andrea Arcangeli
  2015-05-14 17:31 ` [PATCH 07/23] userfaultfd: call handle_userfault() for userfaultfd_missing() faults Andrea Arcangeli
                   ` (19 subsequent siblings)
  25 siblings, 0 replies; 63+ messages in thread
From: Andrea Arcangeli @ 2015-05-14 17:31 UTC (permalink / raw)
  To: Andrew Morton, linux-kernel, linux-mm, qemu-devel, kvm, linux-api
  Cc: Pavel Emelyanov, Sanidhya Kashyap, zhang.zhanghailiang,
	Linus Torvalds, Kirill A. Shutemov, Andres Lagar-Cavilla,
	Dave Hansen, Paolo Bonzini, Rik van Riel, Mel Gorman,
	Andy Lutomirski, Hugh Dickins, Peter Feiner,
	Dr. David Alan Gilbert, Johannes Weiner, Huangpeng (Peter)

These two flags gets set in vma->vm_flags to tell the VM common code
if the userfaultfd is armed and in which mode (only tracking missing
faults, only tracking wrprotect faults or both). If neither flags is
set it means the userfaultfd is not armed on the vma.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 fs/proc/task_mmu.c | 2 ++
 include/linux/mm.h | 2 ++
 kernel/fork.c      | 2 +-
 3 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 6dee68d..58be92e 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -597,6 +597,8 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
 		[ilog2(VM_HUGEPAGE)]	= "hg",
 		[ilog2(VM_NOHUGEPAGE)]	= "nh",
 		[ilog2(VM_MERGEABLE)]	= "mg",
+		[ilog2(VM_UFFD_MISSING)]= "um",
+		[ilog2(VM_UFFD_WP)]	= "uw",
 	};
 	size_t i;
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 2923a51..7fccd22 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -123,8 +123,10 @@ extern unsigned int kobjsize(const void *objp);
 #define VM_MAYSHARE	0x00000080
 
 #define VM_GROWSDOWN	0x00000100	/* general info on the segment */
+#define VM_UFFD_MISSING	0x00000200	/* missing pages tracking */
 #define VM_PFNMAP	0x00000400	/* Page-ranges managed without "struct page", just pure PFN */
 #define VM_DENYWRITE	0x00000800	/* ETXTBSY on write attempts.. */
+#define VM_UFFD_WP	0x00001000	/* wrprotect pages tracking */
 
 #define VM_LOCKED	0x00002000
 #define VM_IO           0x00004000	/* Memory mapped I/O or similar */
diff --git a/kernel/fork.c b/kernel/fork.c
index 430141b..56d1ddf 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -449,7 +449,7 @@ static int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
 		tmp->vm_mm = mm;
 		if (anon_vma_fork(tmp, mpnt))
 			goto fail_nomem_anon_vma_fork;
-		tmp->vm_flags &= ~VM_LOCKED;
+		tmp->vm_flags &= ~(VM_LOCKED|VM_UFFD_MISSING|VM_UFFD_WP);
 		tmp->vm_next = tmp->vm_prev = NULL;
 		tmp->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
 		file = tmp->vm_file;

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH 07/23] userfaultfd: call handle_userfault() for userfaultfd_missing() faults
  2015-05-14 17:30 [PATCH 00/23] userfaultfd v4 Andrea Arcangeli
                   ` (5 preceding siblings ...)
  2015-05-14 17:31 ` [PATCH 06/23] userfaultfd: add VM_UFFD_MISSING and VM_UFFD_WP Andrea Arcangeli
@ 2015-05-14 17:31 ` Andrea Arcangeli
  2015-05-14 17:31 ` [PATCH 08/23] userfaultfd: teach vma_merge to merge across vma->vm_userfaultfd_ctx Andrea Arcangeli
                   ` (18 subsequent siblings)
  25 siblings, 0 replies; 63+ messages in thread
From: Andrea Arcangeli @ 2015-05-14 17:31 UTC (permalink / raw)
  To: Andrew Morton, linux-kernel, linux-mm, qemu-devel, kvm, linux-api
  Cc: Pavel Emelyanov, Sanidhya Kashyap, zhang.zhanghailiang,
	Linus Torvalds, Kirill A. Shutemov, Andres Lagar-Cavilla,
	Dave Hansen, Paolo Bonzini, Rik van Riel, Mel Gorman,
	Andy Lutomirski, Hugh Dickins, Peter Feiner,
	Dr. David Alan Gilbert, Johannes Weiner, Huangpeng (Peter)

This is where the page faults must be modified to call
handle_userfault() if userfaultfd_missing() is true (so if the
vma->vm_flags had VM_UFFD_MISSING set).

handle_userfault() then takes care of blocking the page fault and
delivering it to userland.

The fault flags must also be passed as parameter so the "read|write"
kind of fault can be passed to userland.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/huge_memory.c | 69 ++++++++++++++++++++++++++++++++++++++------------------
 mm/memory.c      | 16 +++++++++++++
 2 files changed, 63 insertions(+), 22 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index cb8904c..c221be3 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -23,6 +23,7 @@
 #include <linux/pagemap.h>
 #include <linux/migrate.h>
 #include <linux/hashtable.h>
+#include <linux/userfaultfd_k.h>
 
 #include <asm/tlb.h>
 #include <asm/pgalloc.h>
@@ -717,7 +718,8 @@ static inline pmd_t mk_huge_pmd(struct page *page, pgprot_t prot)
 static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
 					struct vm_area_struct *vma,
 					unsigned long haddr, pmd_t *pmd,
-					struct page *page, gfp_t gfp)
+					struct page *page, gfp_t gfp,
+					unsigned int flags)
 {
 	struct mem_cgroup *memcg;
 	pgtable_t pgtable;
@@ -725,12 +727,16 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
 
 	VM_BUG_ON_PAGE(!PageCompound(page), page);
 
-	if (mem_cgroup_try_charge(page, mm, gfp, &memcg))
-		return VM_FAULT_OOM;
+	if (mem_cgroup_try_charge(page, mm, gfp, &memcg)) {
+		put_page(page);
+		count_vm_event(THP_FAULT_FALLBACK);
+		return VM_FAULT_FALLBACK;
+	}
 
 	pgtable = pte_alloc_one(mm, haddr);
 	if (unlikely(!pgtable)) {
 		mem_cgroup_cancel_charge(page, memcg);
+		put_page(page);
 		return VM_FAULT_OOM;
 	}
 
@@ -750,6 +756,21 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
 		pte_free(mm, pgtable);
 	} else {
 		pmd_t entry;
+
+		/* Deliver the page fault to userland */
+		if (userfaultfd_missing(vma)) {
+			int ret;
+
+			spin_unlock(ptl);
+			mem_cgroup_cancel_charge(page, memcg);
+			put_page(page);
+			pte_free(mm, pgtable);
+			ret = handle_userfault(vma, haddr, flags,
+					       VM_UFFD_MISSING);
+			VM_BUG_ON(ret & VM_FAULT_FALLBACK);
+			return ret;
+		}
+
 		entry = mk_huge_pmd(page, vma->vm_page_prot);
 		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
 		page_add_new_anon_rmap(page, vma, haddr);
@@ -760,6 +781,7 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
 		add_mm_counter(mm, MM_ANONPAGES, HPAGE_PMD_NR);
 		atomic_long_inc(&mm->nr_ptes);
 		spin_unlock(ptl);
+		count_vm_event(THP_FAULT_ALLOC);
 	}
 
 	return 0;
@@ -771,19 +793,16 @@ static inline gfp_t alloc_hugepage_gfpmask(int defrag, gfp_t extra_gfp)
 }
 
 /* Caller must hold page table lock. */
-static bool set_huge_zero_page(pgtable_t pgtable, struct mm_struct *mm,
+static void set_huge_zero_page(pgtable_t pgtable, struct mm_struct *mm,
 		struct vm_area_struct *vma, unsigned long haddr, pmd_t *pmd,
 		struct page *zero_page)
 {
 	pmd_t entry;
-	if (!pmd_none(*pmd))
-		return false;
 	entry = mk_pmd(zero_page, vma->vm_page_prot);
 	entry = pmd_mkhuge(entry);
 	pgtable_trans_huge_deposit(mm, pmd, pgtable);
 	set_pmd_at(mm, haddr, pmd, entry);
 	atomic_long_inc(&mm->nr_ptes);
-	return true;
 }
 
 int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
@@ -806,6 +825,7 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		pgtable_t pgtable;
 		struct page *zero_page;
 		bool set;
+		int ret;
 		pgtable = pte_alloc_one(mm, haddr);
 		if (unlikely(!pgtable))
 			return VM_FAULT_OOM;
@@ -816,14 +836,28 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			return VM_FAULT_FALLBACK;
 		}
 		ptl = pmd_lock(mm, pmd);
-		set = set_huge_zero_page(pgtable, mm, vma, haddr, pmd,
-				zero_page);
-		spin_unlock(ptl);
+		ret = 0;
+		set = false;
+		if (pmd_none(*pmd)) {
+			if (userfaultfd_missing(vma)) {
+				spin_unlock(ptl);
+				ret = handle_userfault(vma, haddr, flags,
+						       VM_UFFD_MISSING);
+				VM_BUG_ON(ret & VM_FAULT_FALLBACK);
+			} else {
+				set_huge_zero_page(pgtable, mm, vma,
+						   haddr, pmd,
+						   zero_page);
+				spin_unlock(ptl);
+				set = true;
+			}
+		} else
+			spin_unlock(ptl);
 		if (!set) {
 			pte_free(mm, pgtable);
 			put_huge_zero_page();
 		}
-		return 0;
+		return ret;
 	}
 	gfp = alloc_hugepage_gfpmask(transparent_hugepage_defrag(vma), 0);
 	page = alloc_hugepage_vma(gfp, vma, haddr, HPAGE_PMD_ORDER);
@@ -831,14 +865,7 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		count_vm_event(THP_FAULT_FALLBACK);
 		return VM_FAULT_FALLBACK;
 	}
-	if (unlikely(__do_huge_pmd_anonymous_page(mm, vma, haddr, pmd, page, gfp))) {
-		put_page(page);
-		count_vm_event(THP_FAULT_FALLBACK);
-		return VM_FAULT_FALLBACK;
-	}
-
-	count_vm_event(THP_FAULT_ALLOC);
-	return 0;
+	return __do_huge_pmd_anonymous_page(mm, vma, haddr, pmd, page, gfp, flags);
 }
 
 int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
@@ -873,16 +900,14 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	 */
 	if (is_huge_zero_pmd(pmd)) {
 		struct page *zero_page;
-		bool set;
 		/*
 		 * get_huge_zero_page() will never allocate a new page here,
 		 * since we already have a zero page to copy. It just takes a
 		 * reference.
 		 */
 		zero_page = get_huge_zero_page();
-		set = set_huge_zero_page(pgtable, dst_mm, vma, addr, dst_pmd,
+		set_huge_zero_page(pgtable, dst_mm, vma, addr, dst_pmd,
 				zero_page);
-		BUG_ON(!set); /* unexpected !pmd_none(dst_pmd) */
 		ret = 0;
 		goto out_unlock;
 	}
diff --git a/mm/memory.c b/mm/memory.c
index be9f075..790e6f7 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -61,6 +61,7 @@
 #include <linux/string.h>
 #include <linux/dma-debug.h>
 #include <linux/debugfs.h>
+#include <linux/userfaultfd_k.h>
 
 #include <asm/io.h>
 #include <asm/pgalloc.h>
@@ -2680,6 +2681,12 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
 		if (!pte_none(*page_table))
 			goto unlock;
+		/* Deliver the page fault to userland, check inside PT lock */
+		if (userfaultfd_missing(vma)) {
+			pte_unmap_unlock(page_table, ptl);
+			return handle_userfault(vma, address, flags,
+						VM_UFFD_MISSING);
+		}
 		goto setpte;
 	}
 
@@ -2707,6 +2714,15 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (!pte_none(*page_table))
 		goto release;
 
+	/* Deliver the page fault to userland, check inside PT lock */
+	if (userfaultfd_missing(vma)) {
+		pte_unmap_unlock(page_table, ptl);
+		mem_cgroup_cancel_charge(page, memcg);
+		page_cache_release(page);
+		return handle_userfault(vma, address, flags,
+					VM_UFFD_MISSING);
+	}
+
 	inc_mm_counter_fast(mm, MM_ANONPAGES);
 	page_add_new_anon_rmap(page, vma, address);
 	mem_cgroup_commit_charge(page, memcg, false);

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH 08/23] userfaultfd: teach vma_merge to merge across vma->vm_userfaultfd_ctx
  2015-05-14 17:30 [PATCH 00/23] userfaultfd v4 Andrea Arcangeli
                   ` (6 preceding siblings ...)
  2015-05-14 17:31 ` [PATCH 07/23] userfaultfd: call handle_userfault() for userfaultfd_missing() faults Andrea Arcangeli
@ 2015-05-14 17:31 ` Andrea Arcangeli
  2015-05-14 17:31 ` [PATCH 09/23] userfaultfd: prevent khugepaged to merge if userfaultfd is armed Andrea Arcangeli
                   ` (17 subsequent siblings)
  25 siblings, 0 replies; 63+ messages in thread
From: Andrea Arcangeli @ 2015-05-14 17:31 UTC (permalink / raw)
  To: Andrew Morton, linux-kernel, linux-mm, qemu-devel, kvm, linux-api
  Cc: Pavel Emelyanov, Sanidhya Kashyap, zhang.zhanghailiang,
	Linus Torvalds, Kirill A. Shutemov, Andres Lagar-Cavilla,
	Dave Hansen, Paolo Bonzini, Rik van Riel, Mel Gorman,
	Andy Lutomirski, Hugh Dickins, Peter Feiner,
	Dr. David Alan Gilbert, Johannes Weiner, Huangpeng (Peter)

vma->vm_userfaultfd_ctx is yet another vma parameter that vma_merge
must be aware about so that we can merge vmas back like they were
originally before arming the userfaultfd on some memory range.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/mm.h |  2 +-
 mm/madvise.c       |  3 ++-
 mm/mempolicy.c     |  4 ++--
 mm/mlock.c         |  3 ++-
 mm/mmap.c          | 40 +++++++++++++++++++++++++++-------------
 mm/mprotect.c      |  3 ++-
 6 files changed, 36 insertions(+), 19 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 7fccd22..76376e0 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1763,7 +1763,7 @@ extern int vma_adjust(struct vm_area_struct *vma, unsigned long start,
 extern struct vm_area_struct *vma_merge(struct mm_struct *,
 	struct vm_area_struct *prev, unsigned long addr, unsigned long end,
 	unsigned long vm_flags, struct anon_vma *, struct file *, pgoff_t,
-	struct mempolicy *);
+	struct mempolicy *, struct vm_userfaultfd_ctx);
 extern struct anon_vma *find_mergeable_anon_vma(struct vm_area_struct *);
 extern int split_vma(struct mm_struct *,
 	struct vm_area_struct *, unsigned long addr, int new_below);
diff --git a/mm/madvise.c b/mm/madvise.c
index 22e8f0c..d215ea9 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -111,7 +111,8 @@ static long madvise_behavior(struct vm_area_struct *vma,
 
 	pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
 	*prev = vma_merge(mm, *prev, start, end, new_flags, vma->anon_vma,
-				vma->vm_file, pgoff, vma_policy(vma));
+			  vma->vm_file, pgoff, vma_policy(vma),
+			  vma->vm_userfaultfd_ctx);
 	if (*prev) {
 		vma = *prev;
 		goto success;
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 7477432..eeec77a 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -722,8 +722,8 @@ static int mbind_range(struct mm_struct *mm, unsigned long start,
 		pgoff = vma->vm_pgoff +
 			((vmstart - vma->vm_start) >> PAGE_SHIFT);
 		prev = vma_merge(mm, prev, vmstart, vmend, vma->vm_flags,
-				  vma->anon_vma, vma->vm_file, pgoff,
-				  new_pol);
+				 vma->anon_vma, vma->vm_file, pgoff,
+				 new_pol, vma->vm_userfaultfd_ctx);
 		if (prev) {
 			vma = prev;
 			next = vma->vm_next;
diff --git a/mm/mlock.c b/mm/mlock.c
index 6fd2cf1..25936680 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -510,7 +510,8 @@ static int mlock_fixup(struct vm_area_struct *vma, struct vm_area_struct **prev,
 
 	pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
 	*prev = vma_merge(mm, *prev, start, end, newflags, vma->anon_vma,
-			  vma->vm_file, pgoff, vma_policy(vma));
+			  vma->vm_file, pgoff, vma_policy(vma),
+			  vma->vm_userfaultfd_ctx);
 	if (*prev) {
 		vma = *prev;
 		goto success;
diff --git a/mm/mmap.c b/mm/mmap.c
index 9c1c5b9..5f09a7e 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -41,6 +41,7 @@
 #include <linux/notifier.h>
 #include <linux/memory.h>
 #include <linux/printk.h>
+#include <linux/userfaultfd_k.h>
 
 #include <asm/uaccess.h>
 #include <asm/cacheflush.h>
@@ -919,7 +920,8 @@ again:			remove_next = 1 + (end > next->vm_end);
  * per-vma resources, so we don't attempt to merge those.
  */
 static inline int is_mergeable_vma(struct vm_area_struct *vma,
-			struct file *file, unsigned long vm_flags)
+				struct file *file, unsigned long vm_flags,
+				struct vm_userfaultfd_ctx vm_userfaultfd_ctx)
 {
 	/*
 	 * VM_SOFTDIRTY should not prevent from VMA merging, if we
@@ -935,6 +937,8 @@ static inline int is_mergeable_vma(struct vm_area_struct *vma,
 		return 0;
 	if (vma->vm_ops && vma->vm_ops->close)
 		return 0;
+	if (!is_mergeable_vm_userfaultfd_ctx(vma, vm_userfaultfd_ctx))
+		return 0;
 	return 1;
 }
 
@@ -965,9 +969,11 @@ static inline int is_mergeable_anon_vma(struct anon_vma *anon_vma1,
  */
 static int
 can_vma_merge_before(struct vm_area_struct *vma, unsigned long vm_flags,
-	struct anon_vma *anon_vma, struct file *file, pgoff_t vm_pgoff)
+		     struct anon_vma *anon_vma, struct file *file,
+		     pgoff_t vm_pgoff,
+		     struct vm_userfaultfd_ctx vm_userfaultfd_ctx)
 {
-	if (is_mergeable_vma(vma, file, vm_flags) &&
+	if (is_mergeable_vma(vma, file, vm_flags, vm_userfaultfd_ctx) &&
 	    is_mergeable_anon_vma(anon_vma, vma->anon_vma, vma)) {
 		if (vma->vm_pgoff == vm_pgoff)
 			return 1;
@@ -984,9 +990,11 @@ can_vma_merge_before(struct vm_area_struct *vma, unsigned long vm_flags,
  */
 static int
 can_vma_merge_after(struct vm_area_struct *vma, unsigned long vm_flags,
-	struct anon_vma *anon_vma, struct file *file, pgoff_t vm_pgoff)
+		    struct anon_vma *anon_vma, struct file *file,
+		    pgoff_t vm_pgoff,
+		    struct vm_userfaultfd_ctx vm_userfaultfd_ctx)
 {
-	if (is_mergeable_vma(vma, file, vm_flags) &&
+	if (is_mergeable_vma(vma, file, vm_flags, vm_userfaultfd_ctx) &&
 	    is_mergeable_anon_vma(anon_vma, vma->anon_vma, vma)) {
 		pgoff_t vm_pglen;
 		vm_pglen = vma_pages(vma);
@@ -1029,7 +1037,8 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
 			struct vm_area_struct *prev, unsigned long addr,
 			unsigned long end, unsigned long vm_flags,
 			struct anon_vma *anon_vma, struct file *file,
-			pgoff_t pgoff, struct mempolicy *policy)
+			pgoff_t pgoff, struct mempolicy *policy,
+			struct vm_userfaultfd_ctx vm_userfaultfd_ctx)
 {
 	pgoff_t pglen = (end - addr) >> PAGE_SHIFT;
 	struct vm_area_struct *area, *next;
@@ -1056,14 +1065,17 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
 	if (prev && prev->vm_end == addr &&
 			mpol_equal(vma_policy(prev), policy) &&
 			can_vma_merge_after(prev, vm_flags,
-						anon_vma, file, pgoff)) {
+					    anon_vma, file, pgoff,
+					    vm_userfaultfd_ctx)) {
 		/*
 		 * OK, it can.  Can we now merge in the successor as well?
 		 */
 		if (next && end == next->vm_start &&
 				mpol_equal(policy, vma_policy(next)) &&
 				can_vma_merge_before(next, vm_flags,
-					anon_vma, file, pgoff+pglen) &&
+						     anon_vma, file,
+						     pgoff+pglen,
+						     vm_userfaultfd_ctx) &&
 				is_mergeable_anon_vma(prev->anon_vma,
 						      next->anon_vma, NULL)) {
 							/* cases 1, 6 */
@@ -1084,7 +1096,8 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
 	if (next && end == next->vm_start &&
 			mpol_equal(policy, vma_policy(next)) &&
 			can_vma_merge_before(next, vm_flags,
-					anon_vma, file, pgoff+pglen)) {
+					     anon_vma, file, pgoff+pglen,
+					     vm_userfaultfd_ctx)) {
 		if (prev && addr < prev->vm_end)	/* case 4 */
 			err = vma_adjust(prev, prev->vm_start,
 				addr, prev->vm_pgoff, NULL);
@@ -1570,8 +1583,8 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 	/*
 	 * Can we just expand an old mapping?
 	 */
-	vma = vma_merge(mm, prev, addr, addr + len, vm_flags, NULL, file, pgoff,
-			NULL);
+	vma = vma_merge(mm, prev, addr, addr + len, vm_flags,
+			NULL, file, pgoff, NULL, NULL_VM_UFFD_CTX);
 	if (vma)
 		goto out;
 
@@ -2757,7 +2770,7 @@ static unsigned long do_brk(unsigned long addr, unsigned long len)
 
 	/* Can we just expand an old private anonymous mapping? */
 	vma = vma_merge(mm, prev, addr, addr + len, flags,
-					NULL, NULL, pgoff, NULL);
+			NULL, NULL, pgoff, NULL, NULL_VM_UFFD_CTX);
 	if (vma)
 		goto out;
 
@@ -2913,7 +2926,8 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
 	if (find_vma_links(mm, addr, addr + len, &prev, &rb_link, &rb_parent))
 		return NULL;	/* should never get here */
 	new_vma = vma_merge(mm, prev, addr, addr + len, vma->vm_flags,
-			vma->anon_vma, vma->vm_file, pgoff, vma_policy(vma));
+			    vma->anon_vma, vma->vm_file, pgoff, vma_policy(vma),
+			    vma->vm_userfaultfd_ctx);
 	if (new_vma) {
 		/*
 		 * Source vma may have been merged into new_vma
diff --git a/mm/mprotect.c b/mm/mprotect.c
index e7d6f11..ef5be8e 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -292,7 +292,8 @@ mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev,
 	 */
 	pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
 	*pprev = vma_merge(mm, *pprev, start, end, newflags,
-			vma->anon_vma, vma->vm_file, pgoff, vma_policy(vma));
+			   vma->anon_vma, vma->vm_file, pgoff, vma_policy(vma),
+			   vma->vm_userfaultfd_ctx);
 	if (*pprev) {
 		vma = *pprev;
 		goto success;

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH 09/23] userfaultfd: prevent khugepaged to merge if userfaultfd is armed
  2015-05-14 17:30 [PATCH 00/23] userfaultfd v4 Andrea Arcangeli
                   ` (7 preceding siblings ...)
  2015-05-14 17:31 ` [PATCH 08/23] userfaultfd: teach vma_merge to merge across vma->vm_userfaultfd_ctx Andrea Arcangeli
@ 2015-05-14 17:31 ` Andrea Arcangeli
  2015-05-14 17:31 ` [PATCH 10/23] userfaultfd: add new syscall to provide memory externalization Andrea Arcangeli
                   ` (16 subsequent siblings)
  25 siblings, 0 replies; 63+ messages in thread
From: Andrea Arcangeli @ 2015-05-14 17:31 UTC (permalink / raw)
  To: Andrew Morton, linux-kernel, linux-mm, qemu-devel, kvm, linux-api
  Cc: Pavel Emelyanov, Sanidhya Kashyap, zhang.zhanghailiang,
	Linus Torvalds, Kirill A. Shutemov, Andres Lagar-Cavilla,
	Dave Hansen, Paolo Bonzini, Rik van Riel, Mel Gorman,
	Andy Lutomirski, Hugh Dickins, Peter Feiner,
	Dr. David Alan Gilbert, Johannes Weiner, Huangpeng (Peter)

If userfaultfd is armed on a certain vma we can't "fill" the holes
with zeroes or we'll break the userland on demand paging. The holes if
the userfault is armed, are really missing information (not zeroes)
that the userland has to load from network or elsewhere.

The same issue happens for wrprotected ptes that we can't just convert
into a single writable pmd_trans_huge.

We could however in theory still merge across zeropages if only
VM_UFFD_MISSING is set (so if VM_UFFD_WP is not set)... that could be
slightly improved but it'd be much more complex code for a tiny corner
case.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/huge_memory.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index c221be3..9671f51 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2198,7 +2198,8 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 	     _pte++, address += PAGE_SIZE) {
 		pte_t pteval = *_pte;
 		if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
-			if (++none_or_zero <= khugepaged_max_ptes_none)
+			if (!userfaultfd_armed(vma) &&
+			    ++none_or_zero <= khugepaged_max_ptes_none)
 				continue;
 			else
 				goto out;
@@ -2651,7 +2652,8 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 	     _pte++, _address += PAGE_SIZE) {
 		pte_t pteval = *_pte;
 		if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
-			if (++none_or_zero <= khugepaged_max_ptes_none)
+			if (!userfaultfd_armed(vma) &&
+			    ++none_or_zero <= khugepaged_max_ptes_none)
 				continue;
 			else
 				goto out_unmap;

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH 10/23] userfaultfd: add new syscall to provide memory externalization
  2015-05-14 17:30 [PATCH 00/23] userfaultfd v4 Andrea Arcangeli
                   ` (8 preceding siblings ...)
  2015-05-14 17:31 ` [PATCH 09/23] userfaultfd: prevent khugepaged to merge if userfaultfd is armed Andrea Arcangeli
@ 2015-05-14 17:31 ` Andrea Arcangeli
  2015-05-14 17:49   ` Linus Torvalds
  2015-06-23 19:00   ` Dave Hansen
  2015-05-14 17:31 ` [PATCH 11/23] userfaultfd: Rename uffd_api.bits into .features Andrea Arcangeli
                   ` (15 subsequent siblings)
  25 siblings, 2 replies; 63+ messages in thread
From: Andrea Arcangeli @ 2015-05-14 17:31 UTC (permalink / raw)
  To: Andrew Morton, linux-kernel, linux-mm, qemu-devel, kvm, linux-api
  Cc: Pavel Emelyanov, Sanidhya Kashyap, zhang.zhanghailiang,
	Linus Torvalds, Kirill A. Shutemov, Andres Lagar-Cavilla,
	Dave Hansen, Paolo Bonzini, Rik van Riel, Mel Gorman,
	Andy Lutomirski, Hugh Dickins, Peter Feiner,
	Dr. David Alan Gilbert, Johannes Weiner, Huangpeng (Peter)

Once an userfaultfd has been created and certain region of the process
virtual address space have been registered into it, the thread
responsible for doing the memory externalization can manage the page
faults in userland by talking to the kernel using the userfaultfd
protocol.

poll() can be used to know when there are new pending userfaults to be
read (POLLIN).

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 fs/userfaultfd.c | 1008 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 1008 insertions(+)
 create mode 100644 fs/userfaultfd.c

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
new file mode 100644
index 0000000..1c9be61
--- /dev/null
+++ b/fs/userfaultfd.c
@@ -0,0 +1,1008 @@
+/*
+ *  fs/userfaultfd.c
+ *
+ *  Copyright (C) 2007  Davide Libenzi <davidel@xmailserver.org>
+ *  Copyright (C) 2008-2009 Red Hat, Inc.
+ *  Copyright (C) 2015  Red Hat, Inc.
+ *
+ *  This work is licensed under the terms of the GNU GPL, version 2. See
+ *  the COPYING file in the top-level directory.
+ *
+ *  Some part derived from fs/eventfd.c (anon inode setup) and
+ *  mm/ksm.c (mm hashing).
+ */
+
+#include <linux/hashtable.h>
+#include <linux/sched.h>
+#include <linux/mm.h>
+#include <linux/poll.h>
+#include <linux/slab.h>
+#include <linux/seq_file.h>
+#include <linux/file.h>
+#include <linux/bug.h>
+#include <linux/anon_inodes.h>
+#include <linux/syscalls.h>
+#include <linux/userfaultfd_k.h>
+#include <linux/mempolicy.h>
+#include <linux/ioctl.h>
+#include <linux/security.h>
+
+enum userfaultfd_state {
+	UFFD_STATE_WAIT_API,
+	UFFD_STATE_RUNNING,
+};
+
+struct userfaultfd_ctx {
+	/* pseudo fd refcounting */
+	atomic_t refcount;
+	/* waitqueue head for the userfaultfd page faults */
+	wait_queue_head_t fault_wqh;
+	/* waitqueue head for the pseudo fd to wakeup poll/read */
+	wait_queue_head_t fd_wqh;
+	/* userfaultfd syscall flags */
+	unsigned int flags;
+	/* state machine */
+	enum userfaultfd_state state;
+	/* released */
+	bool released;
+	/* mm with one ore more vmas attached to this userfaultfd_ctx */
+	struct mm_struct *mm;
+};
+
+struct userfaultfd_wait_queue {
+	unsigned long address;
+	wait_queue_t wq;
+	bool pending;
+	struct userfaultfd_ctx *ctx;
+};
+
+struct userfaultfd_wake_range {
+	unsigned long start;
+	unsigned long len;
+};
+
+static int userfaultfd_wake_function(wait_queue_t *wq, unsigned mode,
+				     int wake_flags, void *key)
+{
+	struct userfaultfd_wake_range *range = key;
+	int ret;
+	struct userfaultfd_wait_queue *uwq;
+	unsigned long start, len;
+
+	uwq = container_of(wq, struct userfaultfd_wait_queue, wq);
+	ret = 0;
+	/* don't wake the pending ones to avoid reads to block */
+	if (uwq->pending && !ACCESS_ONCE(uwq->ctx->released))
+		goto out;
+	/* len == 0 means wake all */
+	start = range->start;
+	len = range->len;
+	if (len && (start > uwq->address || start + len <= uwq->address))
+		goto out;
+	ret = wake_up_state(wq->private, mode);
+	if (ret)
+		/* wake only once, autoremove behavior */
+		list_del_init(&wq->task_list);
+out:
+	return ret;
+}
+
+/**
+ * userfaultfd_ctx_get - Acquires a reference to the internal userfaultfd
+ * context.
+ * @ctx: [in] Pointer to the userfaultfd context.
+ *
+ * Returns: In case of success, returns not zero.
+ */
+static void userfaultfd_ctx_get(struct userfaultfd_ctx *ctx)
+{
+	if (!atomic_inc_not_zero(&ctx->refcount))
+		BUG();
+}
+
+/**
+ * userfaultfd_ctx_put - Releases a reference to the internal userfaultfd
+ * context.
+ * @ctx: [in] Pointer to userfaultfd context.
+ *
+ * The userfaultfd context reference must have been previously acquired either
+ * with userfaultfd_ctx_get() or userfaultfd_ctx_fdget().
+ */
+static void userfaultfd_ctx_put(struct userfaultfd_ctx *ctx)
+{
+	if (atomic_dec_and_test(&ctx->refcount)) {
+		VM_BUG_ON(spin_is_locked(&ctx->fault_pending_wqh.lock));
+		VM_BUG_ON(waitqueue_active(&ctx->fault_pending_wqh));
+		VM_BUG_ON(spin_is_locked(&ctx->fault_wqh.lock));
+		VM_BUG_ON(waitqueue_active(&ctx->fault_wqh));
+		VM_BUG_ON(spin_is_locked(&ctx->fd_wqh.lock));
+		VM_BUG_ON(waitqueue_active(&ctx->fd_wqh));
+		mmput(ctx->mm);
+		kfree(ctx);
+	}
+}
+
+static inline unsigned long userfault_address(unsigned long address,
+					      unsigned int flags,
+					      unsigned long reason)
+{
+	BUILD_BUG_ON(PAGE_SHIFT < UFFD_BITS);
+	address &= PAGE_MASK;
+	if (flags & FAULT_FLAG_WRITE)
+		/*
+		 * Encode "write" fault information in the LSB of the
+		 * address read by userland, without depending on
+		 * FAULT_FLAG_WRITE kernel internal value.
+		 */
+		address |= UFFD_BIT_WRITE;
+	if (reason & VM_UFFD_WP)
+		/*
+		 * Encode "reason" fault information as bit number 1
+		 * in the address read by userland. If bit number 1 is
+		 * clear it means the reason is a VM_FAULT_MISSING
+		 * fault.
+		 */
+		address |= UFFD_BIT_WP;
+	return address;
+}
+
+/*
+ * The locking rules involved in returning VM_FAULT_RETRY depending on
+ * FAULT_FLAG_ALLOW_RETRY, FAULT_FLAG_RETRY_NOWAIT and
+ * FAULT_FLAG_KILLABLE are not straightforward. The "Caution"
+ * recommendation in __lock_page_or_retry is not an understatement.
+ *
+ * If FAULT_FLAG_ALLOW_RETRY is set, the mmap_sem must be released
+ * before returning VM_FAULT_RETRY only if FAULT_FLAG_RETRY_NOWAIT is
+ * not set.
+ *
+ * If FAULT_FLAG_ALLOW_RETRY is set but FAULT_FLAG_KILLABLE is not
+ * set, VM_FAULT_RETRY can still be returned if and only if there are
+ * fatal_signal_pending()s, and the mmap_sem must be released before
+ * returning it.
+ */
+int handle_userfault(struct vm_area_struct *vma, unsigned long address,
+		     unsigned int flags, unsigned long reason)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	struct userfaultfd_ctx *ctx;
+	struct userfaultfd_wait_queue uwq;
+
+	BUG_ON(!rwsem_is_locked(&mm->mmap_sem));
+
+	ctx = vma->vm_userfaultfd_ctx.ctx;
+	if (!ctx)
+		return VM_FAULT_SIGBUS;
+
+	BUG_ON(ctx->mm != mm);
+
+	VM_BUG_ON(reason & ~(VM_UFFD_MISSING|VM_UFFD_WP));
+	VM_BUG_ON(!(reason & VM_UFFD_MISSING) ^ !!(reason & VM_UFFD_WP));
+
+	/*
+	 * If it's already released don't get it. This avoids to loop
+	 * in __get_user_pages if userfaultfd_release waits on the
+	 * caller of handle_userfault to release the mmap_sem.
+	 */
+	if (unlikely(ACCESS_ONCE(ctx->released)))
+		return VM_FAULT_SIGBUS;
+
+	/*
+	 * Check that we can return VM_FAULT_RETRY.
+	 *
+	 * NOTE: it should become possible to return VM_FAULT_RETRY
+	 * even if FAULT_FLAG_TRIED is set without leading to gup()
+	 * -EBUSY failures, if the userfaultfd is to be extended for
+	 * VM_UFFD_WP tracking and we intend to arm the userfault
+	 * without first stopping userland access to the memory. For
+	 * VM_UFFD_MISSING userfaults this is enough for now.
+	 */
+	if (unlikely(!(flags & FAULT_FLAG_ALLOW_RETRY))) {
+		/*
+		 * Validate the invariant that nowait must allow retry
+		 * to be sure not to return SIGBUS erroneously on
+		 * nowait invocations.
+		 */
+		BUG_ON(flags & FAULT_FLAG_RETRY_NOWAIT);
+#ifdef CONFIG_DEBUG_VM
+		if (printk_ratelimit()) {
+			printk(KERN_WARNING
+			       "FAULT_FLAG_ALLOW_RETRY missing %x\n", flags);
+			dump_stack();
+		}
+#endif
+		return VM_FAULT_SIGBUS;
+	}
+
+	/*
+	 * Handle nowait, not much to do other than tell it to retry
+	 * and wait.
+	 */
+	if (flags & FAULT_FLAG_RETRY_NOWAIT)
+		return VM_FAULT_RETRY;
+
+	/* take the reference before dropping the mmap_sem */
+	userfaultfd_ctx_get(ctx);
+
+	/* be gentle and immediately relinquish the mmap_sem */
+	up_read(&mm->mmap_sem);
+
+	init_waitqueue_func_entry(&uwq.wq, userfaultfd_wake_function);
+	uwq.wq.private = current;
+	uwq.address = userfault_address(address, flags, reason);
+	uwq.pending = true;
+	uwq.ctx = ctx;
+
+	spin_lock(&ctx->fault_wqh.lock);
+	/*
+	 * After the __add_wait_queue the uwq is visible to userland
+	 * through poll/read().
+	 */
+	__add_wait_queue(&ctx->fault_wqh, &uwq.wq);
+	for (;;) {
+		set_current_state(TASK_KILLABLE);
+		if (!uwq.pending || ACCESS_ONCE(ctx->released) ||
+		    fatal_signal_pending(current))
+			break;
+		spin_unlock(&ctx->fault_wqh.lock);
+
+		wake_up_poll(&ctx->fd_wqh, POLLIN);
+		schedule();
+
+		spin_lock(&ctx->fault_wqh.lock);
+	}
+	__remove_wait_queue(&ctx->fault_wqh, &uwq.wq);
+	__set_current_state(TASK_RUNNING);
+	spin_unlock(&ctx->fault_wqh.lock);
+
+	/*
+	 * ctx may go away after this if the userfault pseudo fd is
+	 * already released.
+	 */
+	userfaultfd_ctx_put(ctx);
+
+	return VM_FAULT_RETRY;
+}
+
+static int userfaultfd_release(struct inode *inode, struct file *file)
+{
+	struct userfaultfd_ctx *ctx = file->private_data;
+	struct mm_struct *mm = ctx->mm;
+	struct vm_area_struct *vma, *prev;
+	/* len == 0 means wake all */
+	struct userfaultfd_wake_range range = { .len = 0, };
+	unsigned long new_flags;
+
+	ACCESS_ONCE(ctx->released) = true;
+
+	/*
+	 * Flush page faults out of all CPUs. NOTE: all page faults
+	 * must be retried without returning VM_FAULT_SIGBUS if
+	 * userfaultfd_ctx_get() succeeds but vma->vma_userfault_ctx
+	 * changes while handle_userfault released the mmap_sem. So
+	 * it's critical that released is set to true (above), before
+	 * taking the mmap_sem for writing.
+	 */
+	down_write(&mm->mmap_sem);
+	prev = NULL;
+	for (vma = mm->mmap; vma; vma = vma->vm_next) {
+		cond_resched();
+		BUG_ON(!!vma->vm_userfaultfd_ctx.ctx ^
+		       !!(vma->vm_flags & (VM_UFFD_MISSING | VM_UFFD_WP)));
+		if (vma->vm_userfaultfd_ctx.ctx != ctx) {
+			prev = vma;
+			continue;
+		}
+		new_flags = vma->vm_flags & ~(VM_UFFD_MISSING | VM_UFFD_WP);
+		prev = vma_merge(mm, prev, vma->vm_start, vma->vm_end,
+				 new_flags, vma->anon_vma,
+				 vma->vm_file, vma->vm_pgoff,
+				 vma_policy(vma),
+				 NULL_VM_UFFD_CTX);
+		if (prev)
+			vma = prev;
+		else
+			prev = vma;
+		vma->vm_flags = new_flags;
+		vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
+	}
+	up_write(&mm->mmap_sem);
+
+	/*
+	 * After no new page faults can wait on this fault_wqh, flush
+	 * the last page faults that may have been already waiting on
+	 * the fault_wqh.
+	 */
+	spin_lock(&ctx->fault_wqh.lock);
+	__wake_up_locked_key(&ctx->fault_wqh, TASK_NORMAL, 0, &range);
+	spin_unlock(&ctx->fault_wqh.lock);
+
+	wake_up_poll(&ctx->fd_wqh, POLLHUP);
+	userfaultfd_ctx_put(ctx);
+	return 0;
+}
+
+/* fault_wqh.lock must be hold by the caller */
+static inline unsigned int find_userfault(struct userfaultfd_ctx *ctx,
+					  struct userfaultfd_wait_queue **uwq)
+{
+	wait_queue_t *wq;
+	struct userfaultfd_wait_queue *_uwq;
+	unsigned int ret = 0;
+
+	VM_BUG_ON(!spin_is_locked(&ctx->fault_wqh.lock));
+
+	list_for_each_entry(wq, &ctx->fault_wqh.task_list, task_list) {
+		_uwq = container_of(wq, struct userfaultfd_wait_queue, wq);
+		if (_uwq->pending) {
+			ret = POLLIN;
+			if (!uwq)
+				/*
+				 * If there's at least a pending and
+				 * we don't care which one it is,
+				 * break immediately and leverage the
+				 * efficiency of the LIFO walk.
+				 */
+				break;
+			/*
+			 * If we need to find which one was pending we
+			 * keep walking until we find the first not
+			 * pending one, so we read() them in FIFO order.
+			 */
+			*uwq = _uwq;
+		} else
+			/*
+			 * break the loop at the first not pending
+			 * one, there cannot be pending userfaults
+			 * after the first not pending one, because
+			 * all new pending ones are inserted at the
+			 * head and we walk it in LIFO.
+			 */
+			break;
+	}
+
+	return ret;
+}
+
+static unsigned int userfaultfd_poll(struct file *file, poll_table *wait)
+{
+	struct userfaultfd_ctx *ctx = file->private_data;
+	unsigned int ret;
+
+	poll_wait(file, &ctx->fd_wqh, wait);
+
+	switch (ctx->state) {
+	case UFFD_STATE_WAIT_API:
+		return POLLERR;
+	case UFFD_STATE_RUNNING:
+		spin_lock(&ctx->fault_wqh.lock);
+		ret = find_userfault(ctx, NULL);
+		spin_unlock(&ctx->fault_wqh.lock);
+		return ret;
+	default:
+		BUG();
+	}
+}
+
+static ssize_t userfaultfd_ctx_read(struct userfaultfd_ctx *ctx, int no_wait,
+				    __u64 *addr)
+{
+	ssize_t ret;
+	DECLARE_WAITQUEUE(wait, current);
+	struct userfaultfd_wait_queue *uwq = NULL;
+
+	/* always take the fd_wqh lock before the fault_wqh lock */
+	spin_lock(&ctx->fd_wqh.lock);
+	__add_wait_queue(&ctx->fd_wqh, &wait);
+	for (;;) {
+		set_current_state(TASK_INTERRUPTIBLE);
+		spin_lock(&ctx->fault_wqh.lock);
+		if (find_userfault(ctx, &uwq)) {
+			/*
+			 * The fault_wqh.lock prevents the uwq to
+			 * disappear from under us.
+			 */
+			uwq->pending = false;
+			/* careful to always initialize addr if ret == 0 */
+			*addr = uwq->address;
+			spin_unlock(&ctx->fault_wqh.lock);
+			ret = 0;
+			break;
+		}
+		spin_unlock(&ctx->fault_wqh.lock);
+		if (signal_pending(current)) {
+			ret = -ERESTARTSYS;
+			break;
+		}
+		if (no_wait) {
+			ret = -EAGAIN;
+			break;
+		}
+		spin_unlock(&ctx->fd_wqh.lock);
+		schedule();
+		spin_lock_irq(&ctx->fd_wqh.lock);
+	}
+	__remove_wait_queue(&ctx->fd_wqh, &wait);
+	__set_current_state(TASK_RUNNING);
+	spin_unlock_irq(&ctx->fd_wqh.lock);
+
+	return ret;
+}
+
+static ssize_t userfaultfd_read(struct file *file, char __user *buf,
+				size_t count, loff_t *ppos)
+{
+	struct userfaultfd_ctx *ctx = file->private_data;
+	ssize_t _ret, ret = 0;
+	/* careful to always initialize addr if ret == 0 */
+	__u64 uninitialized_var(addr);
+	int no_wait = file->f_flags & O_NONBLOCK;
+
+	if (ctx->state == UFFD_STATE_WAIT_API)
+		return -EINVAL;
+	BUG_ON(ctx->state != UFFD_STATE_RUNNING);
+
+	for (;;) {
+		if (count < sizeof(addr))
+			return ret ? ret : -EINVAL;
+		_ret = userfaultfd_ctx_read(ctx, no_wait, &addr);
+		if (_ret < 0)
+			return ret ? ret : _ret;
+		if (put_user(addr, (__u64 __user *) buf))
+			return ret ? ret : -EFAULT;
+		ret += sizeof(addr);
+		buf += sizeof(addr);
+		count -= sizeof(addr);
+		/*
+		 * Allow to read more than one fault at time but only
+		 * block if waiting for the very first one.
+		 */
+		no_wait = O_NONBLOCK;
+	}
+}
+
+static void __wake_userfault(struct userfaultfd_ctx *ctx,
+			     struct userfaultfd_wake_range *range)
+{
+	unsigned long start, end;
+
+	start = range->start;
+	end = range->start + range->len;
+
+	spin_lock(&ctx->fault_wqh.lock);
+	/* wake all in the range and autoremove */
+	__wake_up_locked_key(&ctx->fault_wqh, TASK_NORMAL, 0, range);
+	spin_unlock(&ctx->fault_wqh.lock);
+}
+
+static __always_inline void wake_userfault(struct userfaultfd_ctx *ctx,
+					   struct userfaultfd_wake_range *range)
+{
+	if (waitqueue_active(&ctx->fault_wqh))
+		__wake_userfault(ctx, range);
+}
+
+static __always_inline int validate_range(struct mm_struct *mm,
+					  __u64 start, __u64 len)
+{
+	__u64 task_size = mm->task_size;
+
+	if (start & ~PAGE_MASK)
+		return -EINVAL;
+	if (len & ~PAGE_MASK)
+		return -EINVAL;
+	if (!len)
+		return -EINVAL;
+	if (start < mmap_min_addr)
+		return -EINVAL;
+	if (start >= task_size)
+		return -EINVAL;
+	if (len > task_size - start)
+		return -EINVAL;
+	return 0;
+}
+
+static int userfaultfd_register(struct userfaultfd_ctx *ctx,
+				unsigned long arg)
+{
+	struct mm_struct *mm = ctx->mm;
+	struct vm_area_struct *vma, *prev, *cur;
+	int ret;
+	struct uffdio_register uffdio_register;
+	struct uffdio_register __user *user_uffdio_register;
+	unsigned long vm_flags, new_flags;
+	bool found;
+	unsigned long start, end, vma_end;
+
+	user_uffdio_register = (struct uffdio_register __user *) arg;
+
+	ret = -EFAULT;
+	if (copy_from_user(&uffdio_register, user_uffdio_register,
+			   sizeof(uffdio_register)-sizeof(__u64)))
+		goto out;
+
+	ret = -EINVAL;
+	if (!uffdio_register.mode)
+		goto out;
+	if (uffdio_register.mode & ~(UFFDIO_REGISTER_MODE_MISSING|
+				     UFFDIO_REGISTER_MODE_WP))
+		goto out;
+	vm_flags = 0;
+	if (uffdio_register.mode & UFFDIO_REGISTER_MODE_MISSING)
+		vm_flags |= VM_UFFD_MISSING;
+	if (uffdio_register.mode & UFFDIO_REGISTER_MODE_WP) {
+		vm_flags |= VM_UFFD_WP;
+		/*
+		 * FIXME: remove the below error constraint by
+		 * implementing the wprotect tracking mode.
+		 */
+		ret = -EINVAL;
+		goto out;
+	}
+
+	ret = validate_range(mm, uffdio_register.range.start,
+			     uffdio_register.range.len);
+	if (ret)
+		goto out;
+
+	start = uffdio_register.range.start;
+	end = start + uffdio_register.range.len;
+
+	down_write(&mm->mmap_sem);
+	vma = find_vma_prev(mm, start, &prev);
+
+	ret = -ENOMEM;
+	if (!vma)
+		goto out_unlock;
+
+	/* check that there's at least one vma in the range */
+	ret = -EINVAL;
+	if (vma->vm_start >= end)
+		goto out_unlock;
+
+	/*
+	 * Search for not compatible vmas.
+	 *
+	 * FIXME: this shall be relaxed later so that it doesn't fail
+	 * on tmpfs backed vmas (in addition to the current allowance
+	 * on anonymous vmas).
+	 */
+	found = false;
+	for (cur = vma; cur && cur->vm_start < end; cur = cur->vm_next) {
+		cond_resched();
+
+		BUG_ON(!!cur->vm_userfaultfd_ctx.ctx ^
+		       !!(cur->vm_flags & (VM_UFFD_MISSING | VM_UFFD_WP)));
+
+		/* check not compatible vmas */
+		ret = -EINVAL;
+		if (cur->vm_ops)
+			goto out_unlock;
+
+		/*
+		 * Check that this vma isn't already owned by a
+		 * different userfaultfd. We can't allow more than one
+		 * userfaultfd to own a single vma simultaneously or we
+		 * wouldn't know which one to deliver the userfaults to.
+		 */
+		ret = -EBUSY;
+		if (cur->vm_userfaultfd_ctx.ctx &&
+		    cur->vm_userfaultfd_ctx.ctx != ctx)
+			goto out_unlock;
+
+		found = true;
+	}
+	BUG_ON(!found);
+
+	if (vma->vm_start < start)
+		prev = vma;
+
+	ret = 0;
+	do {
+		cond_resched();
+
+		BUG_ON(vma->vm_ops);
+		BUG_ON(vma->vm_userfaultfd_ctx.ctx &&
+		       vma->vm_userfaultfd_ctx.ctx != ctx);
+
+		/*
+		 * Nothing to do: this vma is already registered into this
+		 * userfaultfd and with the right tracking mode too.
+		 */
+		if (vma->vm_userfaultfd_ctx.ctx == ctx &&
+		    (vma->vm_flags & vm_flags) == vm_flags)
+			goto skip;
+
+		if (vma->vm_start > start)
+			start = vma->vm_start;
+		vma_end = min(end, vma->vm_end);
+
+		new_flags = (vma->vm_flags & ~vm_flags) | vm_flags;
+		prev = vma_merge(mm, prev, start, vma_end, new_flags,
+				 vma->anon_vma, vma->vm_file, vma->vm_pgoff,
+				 vma_policy(vma),
+				 ((struct vm_userfaultfd_ctx){ ctx }));
+		if (prev) {
+			vma = prev;
+			goto next;
+		}
+		if (vma->vm_start < start) {
+			ret = split_vma(mm, vma, start, 1);
+			if (ret)
+				break;
+		}
+		if (vma->vm_end > end) {
+			ret = split_vma(mm, vma, end, 0);
+			if (ret)
+				break;
+		}
+	next:
+		/*
+		 * In the vma_merge() successful mprotect-like case 8:
+		 * the next vma was merged into the current one and
+		 * the current one has not been updated yet.
+		 */
+		vma->vm_flags = new_flags;
+		vma->vm_userfaultfd_ctx.ctx = ctx;
+
+	skip:
+		prev = vma;
+		start = vma->vm_end;
+		vma = vma->vm_next;
+	} while (vma && vma->vm_start < end);
+out_unlock:
+	up_write(&mm->mmap_sem);
+	if (!ret) {
+		/*
+		 * Now that we scanned all vmas we can already tell
+		 * userland which ioctls methods are guaranteed to
+		 * succeed on this range.
+		 */
+		if (put_user(UFFD_API_RANGE_IOCTLS,
+			     &user_uffdio_register->ioctls))
+			ret = -EFAULT;
+	}
+out:
+	return ret;
+}
+
+static int userfaultfd_unregister(struct userfaultfd_ctx *ctx,
+				  unsigned long arg)
+{
+	struct mm_struct *mm = ctx->mm;
+	struct vm_area_struct *vma, *prev, *cur;
+	int ret;
+	struct uffdio_range uffdio_unregister;
+	unsigned long new_flags;
+	bool found;
+	unsigned long start, end, vma_end;
+	const void __user *buf = (void __user *)arg;
+
+	ret = -EFAULT;
+	if (copy_from_user(&uffdio_unregister, buf, sizeof(uffdio_unregister)))
+		goto out;
+
+	ret = validate_range(mm, uffdio_unregister.start,
+			     uffdio_unregister.len);
+	if (ret)
+		goto out;
+
+	start = uffdio_unregister.start;
+	end = start + uffdio_unregister.len;
+
+	down_write(&mm->mmap_sem);
+	vma = find_vma_prev(mm, start, &prev);
+
+	ret = -ENOMEM;
+	if (!vma)
+		goto out_unlock;
+
+	/* check that there's at least one vma in the range */
+	ret = -EINVAL;
+	if (vma->vm_start >= end)
+		goto out_unlock;
+
+	/*
+	 * Search for not compatible vmas.
+	 *
+	 * FIXME: this shall be relaxed later so that it doesn't fail
+	 * on tmpfs backed vmas (in addition to the current allowance
+	 * on anonymous vmas).
+	 */
+	found = false;
+	ret = -EINVAL;
+	for (cur = vma; cur && cur->vm_start < end; cur = cur->vm_next) {
+		cond_resched();
+
+		BUG_ON(!!cur->vm_userfaultfd_ctx.ctx ^
+		       !!(cur->vm_flags & (VM_UFFD_MISSING | VM_UFFD_WP)));
+
+		/*
+		 * Check not compatible vmas, not strictly required
+		 * here as not compatible vmas cannot have an
+		 * userfaultfd_ctx registered on them, but this
+		 * provides for more strict behavior to notice
+		 * unregistration errors.
+		 */
+		if (cur->vm_ops)
+			goto out_unlock;
+
+		found = true;
+	}
+	BUG_ON(!found);
+
+	if (vma->vm_start < start)
+		prev = vma;
+
+	ret = 0;
+	do {
+		cond_resched();
+
+		BUG_ON(vma->vm_ops);
+
+		/*
+		 * Nothing to do: this vma is already registered into this
+		 * userfaultfd and with the right tracking mode too.
+		 */
+		if (!vma->vm_userfaultfd_ctx.ctx)
+			goto skip;
+
+		if (vma->vm_start > start)
+			start = vma->vm_start;
+		vma_end = min(end, vma->vm_end);
+
+		new_flags = vma->vm_flags & ~(VM_UFFD_MISSING | VM_UFFD_WP);
+		prev = vma_merge(mm, prev, start, vma_end, new_flags,
+				 vma->anon_vma, vma->vm_file, vma->vm_pgoff,
+				 vma_policy(vma),
+				 NULL_VM_UFFD_CTX);
+		if (prev) {
+			vma = prev;
+			goto next;
+		}
+		if (vma->vm_start < start) {
+			ret = split_vma(mm, vma, start, 1);
+			if (ret)
+				break;
+		}
+		if (vma->vm_end > end) {
+			ret = split_vma(mm, vma, end, 0);
+			if (ret)
+				break;
+		}
+	next:
+		/*
+		 * In the vma_merge() successful mprotect-like case 8:
+		 * the next vma was merged into the current one and
+		 * the current one has not been updated yet.
+		 */
+		vma->vm_flags = new_flags;
+		vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
+
+	skip:
+		prev = vma;
+		start = vma->vm_end;
+		vma = vma->vm_next;
+	} while (vma && vma->vm_start < end);
+out_unlock:
+	up_write(&mm->mmap_sem);
+out:
+	return ret;
+}
+
+/*
+ * This is mostly needed to re-wakeup those userfaults that were still
+ * pending when userland wake them up the first time. We don't wake
+ * the pending one to avoid blocking reads to block, or non blocking
+ * read to return -EAGAIN, if used with POLLIN, to avoid userland
+ * doubts on why POLLIN wasn't reliable.
+ */
+static int userfaultfd_wake(struct userfaultfd_ctx *ctx,
+			    unsigned long arg)
+{
+	int ret;
+	struct uffdio_range uffdio_wake;
+	struct userfaultfd_wake_range range;
+	const void __user *buf = (void __user *)arg;
+
+	ret = -EFAULT;
+	if (copy_from_user(&uffdio_wake, buf, sizeof(uffdio_wake)))
+		goto out;
+
+	ret = validate_range(ctx->mm, uffdio_wake.start, uffdio_wake.len);
+	if (ret)
+		goto out;
+
+	range.start = uffdio_wake.start;
+	range.len = uffdio_wake.len;
+
+	/*
+	 * len == 0 means wake all and we don't want to wake all here,
+	 * so check it again to be sure.
+	 */
+	VM_BUG_ON(!range.len);
+
+	wake_userfault(ctx, &range);
+	ret = 0;
+
+out:
+	return ret;
+}
+
+/*
+ * userland asks for a certain API version and we return which bits
+ * and ioctl commands are implemented in this kernel for such API
+ * version or -EINVAL if unknown.
+ */
+static int userfaultfd_api(struct userfaultfd_ctx *ctx,
+			   unsigned long arg)
+{
+	struct uffdio_api uffdio_api;
+	void __user *buf = (void __user *)arg;
+	int ret;
+
+	ret = -EINVAL;
+	if (ctx->state != UFFD_STATE_WAIT_API)
+		goto out;
+	ret = -EFAULT;
+	if (copy_from_user(&uffdio_api, buf, sizeof(__u64)))
+		goto out;
+	if (uffdio_api.api != UFFD_API) {
+		/* careful not to leak info, we only read the first 8 bytes */
+		memset(&uffdio_api, 0, sizeof(uffdio_api));
+		if (copy_to_user(buf, &uffdio_api, sizeof(uffdio_api)))
+			goto out;
+		ret = -EINVAL;
+		goto out;
+	}
+	/* careful not to leak info, we only read the first 8 bytes */
+	uffdio_api.bits = UFFD_API_BITS;
+	uffdio_api.ioctls = UFFD_API_IOCTLS;
+	ret = -EFAULT;
+	if (copy_to_user(buf, &uffdio_api, sizeof(uffdio_api)))
+		goto out;
+	ctx->state = UFFD_STATE_RUNNING;
+	ret = 0;
+out:
+	return ret;
+}
+
+static long userfaultfd_ioctl(struct file *file, unsigned cmd,
+			      unsigned long arg)
+{
+	int ret = -EINVAL;
+	struct userfaultfd_ctx *ctx = file->private_data;
+
+	switch(cmd) {
+	case UFFDIO_API:
+		ret = userfaultfd_api(ctx, arg);
+		break;
+	case UFFDIO_REGISTER:
+		ret = userfaultfd_register(ctx, arg);
+		break;
+	case UFFDIO_UNREGISTER:
+		ret = userfaultfd_unregister(ctx, arg);
+		break;
+	case UFFDIO_WAKE:
+		ret = userfaultfd_wake(ctx, arg);
+		break;
+	}
+	return ret;
+}
+
+#ifdef CONFIG_PROC_FS
+static void userfaultfd_show_fdinfo(struct seq_file *m, struct file *f)
+{
+	struct userfaultfd_ctx *ctx = f->private_data;
+	wait_queue_t *wq;
+	struct userfaultfd_wait_queue *uwq;
+	unsigned long pending = 0, total = 0;
+
+	spin_lock(&ctx->fault_wqh.lock);
+	list_for_each_entry(wq, &ctx->fault_wqh.task_list, task_list) {
+		uwq = container_of(wq, struct userfaultfd_wait_queue, wq);
+		if (uwq->pending)
+			pending++;
+		total++;
+	}
+	spin_unlock(&ctx->fault_wqh.lock);
+
+	/*
+	 * If more protocols will be added, there will be all shown
+	 * separated by a space. Like this:
+	 *	protocols: aa:... bb:...
+	 */
+	seq_printf(m, "pending:\t%lu\ntotal:\t%lu\nAPI:\t%Lx:%x:%Lx\n",
+		   pending, total, UFFD_API, UFFD_API_BITS,
+		   UFFD_API_IOCTLS|UFFD_API_RANGE_IOCTLS);
+}
+#endif
+
+static const struct file_operations userfaultfd_fops = {
+#ifdef CONFIG_PROC_FS
+	.show_fdinfo	= userfaultfd_show_fdinfo,
+#endif
+	.release	= userfaultfd_release,
+	.poll		= userfaultfd_poll,
+	.read		= userfaultfd_read,
+	.unlocked_ioctl = userfaultfd_ioctl,
+	.compat_ioctl	= userfaultfd_ioctl,
+	.llseek		= noop_llseek,
+};
+
+/**
+ * userfaultfd_file_create - Creates an userfaultfd file pointer.
+ * @flags: Flags for the userfaultfd file.
+ *
+ * This function creates an userfaultfd file pointer, w/out installing
+ * it into the fd table. This is useful when the userfaultfd file is
+ * used during the initialization of data structures that require
+ * extra setup after the userfaultfd creation. So the userfaultfd
+ * creation is split into the file pointer creation phase, and the
+ * file descriptor installation phase.  In this way races with
+ * userspace closing the newly installed file descriptor can be
+ * avoided.  Returns an userfaultfd file pointer, or a proper error
+ * pointer.
+ */
+static struct file *userfaultfd_file_create(int flags)
+{
+	struct file *file;
+	struct userfaultfd_ctx *ctx;
+
+	BUG_ON(!current->mm);
+
+	/* Check the UFFD_* constants for consistency.  */
+	BUILD_BUG_ON(UFFD_CLOEXEC != O_CLOEXEC);
+	BUILD_BUG_ON(UFFD_NONBLOCK != O_NONBLOCK);
+
+	file = ERR_PTR(-EINVAL);
+	if (flags & ~UFFD_SHARED_FCNTL_FLAGS)
+		goto out;
+
+	file = ERR_PTR(-ENOMEM);
+	ctx = kmalloc(sizeof(*ctx), GFP_KERNEL);
+	if (!ctx)
+		goto out;
+
+	atomic_set(&ctx->refcount, 1);
+	init_waitqueue_head(&ctx->fault_wqh);
+	init_waitqueue_head(&ctx->fd_wqh);
+	ctx->flags = flags;
+	ctx->state = UFFD_STATE_WAIT_API;
+	ctx->released = false;
+	ctx->mm = current->mm;
+	/* prevent the mm struct to be freed */
+	atomic_inc(&ctx->mm->mm_users);
+
+	file = anon_inode_getfile("[userfaultfd]", &userfaultfd_fops, ctx,
+				  O_RDWR | (flags & UFFD_SHARED_FCNTL_FLAGS));
+	if (IS_ERR(file))
+		kfree(ctx);
+out:
+	return file;
+}
+
+SYSCALL_DEFINE1(userfaultfd, int, flags)
+{
+	int fd, error;
+	struct file *file;
+
+	error = get_unused_fd_flags(flags & UFFD_SHARED_FCNTL_FLAGS);
+	if (error < 0)
+		return error;
+	fd = error;
+
+	file = userfaultfd_file_create(flags);
+	if (IS_ERR(file)) {
+		error = PTR_ERR(file);
+		goto err_put_unused_fd;
+	}
+	fd_install(fd, file);
+
+	return fd;
+
+err_put_unused_fd:
+	put_unused_fd(fd);
+
+	return error;
+}

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH 11/23] userfaultfd: Rename uffd_api.bits into .features
  2015-05-14 17:30 [PATCH 00/23] userfaultfd v4 Andrea Arcangeli
                   ` (9 preceding siblings ...)
  2015-05-14 17:31 ` [PATCH 10/23] userfaultfd: add new syscall to provide memory externalization Andrea Arcangeli
@ 2015-05-14 17:31 ` Andrea Arcangeli
  2015-05-14 17:31 ` [PATCH 12/23] userfaultfd: Rename uffd_api.bits into .features fixup Andrea Arcangeli
                   ` (14 subsequent siblings)
  25 siblings, 0 replies; 63+ messages in thread
From: Andrea Arcangeli @ 2015-05-14 17:31 UTC (permalink / raw)
  To: Andrew Morton, linux-kernel, linux-mm, qemu-devel, kvm, linux-api
  Cc: Pavel Emelyanov, Sanidhya Kashyap, zhang.zhanghailiang,
	Linus Torvalds, Kirill A. Shutemov, Andres Lagar-Cavilla,
	Dave Hansen, Paolo Bonzini, Rik van Riel, Mel Gorman,
	Andy Lutomirski, Hugh Dickins, Peter Feiner,
	Dr. David Alan Gilbert, Johannes Weiner, Huangpeng (Peter)

From: Pavel Emelyanov <xemul@parallels.com>

This is (seem to be) the minimal thing that is required to unblock
standard uffd usage from the non-cooperative one. Now more bits can
be added to the features field indicating e.g. UFFD_FEATURE_FORK and
others needed for the latter use-case.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 fs/userfaultfd.c                 |  4 ++--
 include/uapi/linux/userfaultfd.h | 10 ++++++++--
 2 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 1c9be61..9085365 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -856,7 +856,7 @@ static int userfaultfd_api(struct userfaultfd_ctx *ctx,
 		goto out;
 	}
 	/* careful not to leak info, we only read the first 8 bytes */
-	uffdio_api.bits = UFFD_API_BITS;
+	uffdio_api.features = UFFD_API_FEATURES;
 	uffdio_api.ioctls = UFFD_API_IOCTLS;
 	ret = -EFAULT;
 	if (copy_to_user(buf, &uffdio_api, sizeof(uffdio_api)))
@@ -913,7 +913,7 @@ static void userfaultfd_show_fdinfo(struct seq_file *m, struct file *f)
 	 *	protocols: aa:... bb:...
 	 */
 	seq_printf(m, "pending:\t%lu\ntotal:\t%lu\nAPI:\t%Lx:%x:%Lx\n",
-		   pending, total, UFFD_API, UFFD_API_BITS,
+		   pending, total, UFFD_API, UFFD_API_FEATURES,
 		   UFFD_API_IOCTLS|UFFD_API_RANGE_IOCTLS);
 }
 #endif
diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
index 9a8cd56..5e1c2f7 100644
--- a/include/uapi/linux/userfaultfd.h
+++ b/include/uapi/linux/userfaultfd.h
@@ -11,7 +11,7 @@
 
 #define UFFD_API ((__u64)0xAA)
 /* FIXME: add "|UFFD_BIT_WP" to UFFD_API_BITS after implementing it */
-#define UFFD_API_BITS (UFFD_BIT_WRITE)
+#define UFFD_API_FEATURES (UFFD_FEATURE_WRITE_BIT)
 #define UFFD_API_IOCTLS				\
 	((__u64)1 << _UFFDIO_REGISTER |		\
 	 (__u64)1 << _UFFDIO_UNREGISTER |	\
@@ -51,12 +51,18 @@
 #define UFFD_BIT_WP	(1<<1)	/* handle_userfault() reason VM_UFFD_WP */
 #define UFFD_BITS	2	/* two above bits used for UFFD_BIT_* mask */
 
+/*
+ * Features reported in uffdio_api.features field
+ */
+#define UFFD_FEATURE_WRITE_BIT	(1<<0) /* Corresponds to UFFD_BIT_WRITE */
+#define UFFD_FEATURE_WP_BIT	(1<<1) /* Corresponds to UFFD_BIT_WP */
+
 struct uffdio_api {
 	/* userland asks for an API number */
 	__u64 api;
 
 	/* kernel answers below with the available features for the API */
-	__u64 bits;
+	__u64 features;
 	__u64 ioctls;
 };
 

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH 12/23] userfaultfd: Rename uffd_api.bits into .features fixup
  2015-05-14 17:30 [PATCH 00/23] userfaultfd v4 Andrea Arcangeli
                   ` (10 preceding siblings ...)
  2015-05-14 17:31 ` [PATCH 11/23] userfaultfd: Rename uffd_api.bits into .features Andrea Arcangeli
@ 2015-05-14 17:31 ` Andrea Arcangeli
  2015-05-14 17:31 ` [PATCH 13/23] userfaultfd: change the read API to return a uffd_msg Andrea Arcangeli
                   ` (13 subsequent siblings)
  25 siblings, 0 replies; 63+ messages in thread
From: Andrea Arcangeli @ 2015-05-14 17:31 UTC (permalink / raw)
  To: Andrew Morton, linux-kernel, linux-mm, qemu-devel, kvm, linux-api
  Cc: Pavel Emelyanov, Sanidhya Kashyap, zhang.zhanghailiang,
	Linus Torvalds, Kirill A. Shutemov, Andres Lagar-Cavilla,
	Dave Hansen, Paolo Bonzini, Rik van Riel, Mel Gorman,
	Andy Lutomirski, Hugh Dickins, Peter Feiner,
	Dr. David Alan Gilbert, Johannes Weiner, Huangpeng (Peter)

Update comment.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/uapi/linux/userfaultfd.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
index 5e1c2f7..03f21cb 100644
--- a/include/uapi/linux/userfaultfd.h
+++ b/include/uapi/linux/userfaultfd.h
@@ -10,7 +10,7 @@
 #define _LINUX_USERFAULTFD_H
 
 #define UFFD_API ((__u64)0xAA)
-/* FIXME: add "|UFFD_BIT_WP" to UFFD_API_BITS after implementing it */
+/* FIXME: add "|UFFD_FEATURE_WP" to UFFD_API_FEATURES after implementing it */
 #define UFFD_API_FEATURES (UFFD_FEATURE_WRITE_BIT)
 #define UFFD_API_IOCTLS				\
 	((__u64)1 << _UFFDIO_REGISTER |		\

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH 13/23] userfaultfd: change the read API to return a uffd_msg
  2015-05-14 17:30 [PATCH 00/23] userfaultfd v4 Andrea Arcangeli
                   ` (11 preceding siblings ...)
  2015-05-14 17:31 ` [PATCH 12/23] userfaultfd: Rename uffd_api.bits into .features fixup Andrea Arcangeli
@ 2015-05-14 17:31 ` Andrea Arcangeli
  2015-05-14 17:31 ` [PATCH 14/23] userfaultfd: wake pending userfaults Andrea Arcangeli
                   ` (12 subsequent siblings)
  25 siblings, 0 replies; 63+ messages in thread
From: Andrea Arcangeli @ 2015-05-14 17:31 UTC (permalink / raw)
  To: Andrew Morton, linux-kernel, linux-mm, qemu-devel, kvm, linux-api
  Cc: Pavel Emelyanov, Sanidhya Kashyap, zhang.zhanghailiang,
	Linus Torvalds, Kirill A. Shutemov, Andres Lagar-Cavilla,
	Dave Hansen, Paolo Bonzini, Rik van Riel, Mel Gorman,
	Andy Lutomirski, Hugh Dickins, Peter Feiner,
	Dr. David Alan Gilbert, Johannes Weiner, Huangpeng (Peter)

I had requests to return the full address (not the page aligned one)
to userland.

It's not entirely clear how the page offset could be relevant because
userfaults aren't like SIGBUS that can sigjump to a different place
and it actually skip resolving the fault depending on a page
offset. There's currently no real way to skip the fault especially
because after a UFFDIO_COPY|ZEROPAGE, the fault is optimized to be
retried within the kernel without having to return to userland first
(not even self modifying code replacing the .text that touched the
faulting address would prevent the fault to be repeated). Userland
cannot skip repeating the fault even more so if the fault was
triggered by a KVM secondary page fault or any get_user_pages or any
copy-user inside some syscall which will return to kernel code. The
second time FAULT_FLAG_RETRY_NOWAIT won't be set leading to a SIGBUS
being raised because the userfault can't wait if it cannot release the
mmap_map first (and FAULT_FLAG_RETRY_NOWAIT is required for that).

Still returning userland a proper structure during the read() on the
uffd, can allow to use the current UFFD_API for the future
non-cooperative extensions too and it looks cleaner as well. Once we
get additional fields there's no point to return the fault address
page aligned anymore to reuse the bits below PAGE_SHIFT.

The only downside is that the read() syscall will read 32bytes instead
of 8bytes but that's not going to be measurable overhead.

The total number of new events that can be extended or of new future
bits for already shipped events, is limited to 64 by the features
field of the uffdio_api structure. If more will be needed a bump of
UFFD_API will be required.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 Documentation/vm/userfaultfd.txt | 12 +++---
 fs/userfaultfd.c                 | 79 +++++++++++++++++++++++-----------------
 include/uapi/linux/userfaultfd.h | 64 ++++++++++++++++++++++++--------
 3 files changed, 102 insertions(+), 53 deletions(-)

diff --git a/Documentation/vm/userfaultfd.txt b/Documentation/vm/userfaultfd.txt
index c2f5145..3557edd 100644
--- a/Documentation/vm/userfaultfd.txt
+++ b/Documentation/vm/userfaultfd.txt
@@ -46,11 +46,13 @@ is a corner case that would currently return -EBUSY).
 When first opened the userfaultfd must be enabled invoking the
 UFFDIO_API ioctl specifying a uffdio_api.api value set to UFFD_API (or
 a later API version) which will specify the read/POLLIN protocol
-userland intends to speak on the UFFD. The UFFDIO_API ioctl if
-successful (i.e. if the requested uffdio_api.api is spoken also by the
-running kernel), will return into uffdio_api.features and
-uffdio_api.ioctls two 64bit bitmasks of respectively the activated
-feature of the read(2) protocol and the generic ioctl available.
+userland intends to speak on the UFFD and the uffdio_api.features
+userland needs to be enabled. The UFFDIO_API ioctl if successful
+(i.e. if the requested uffdio_api.api is spoken also by the running
+kernel and the requested features are going to be enabled) will return
+into uffdio_api.features and uffdio_api.ioctls two 64bit bitmasks of
+respectively all the available features of the read(2) protocol and
+the generic ioctl available.
 
 Once the userfaultfd has been enabled the UFFDIO_REGISTER ioctl should
 be invoked (if present in the returned uffdio_api.ioctls bitmask) to
diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 9085365..b45cefe 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -50,7 +50,7 @@ struct userfaultfd_ctx {
 };
 
 struct userfaultfd_wait_queue {
-	unsigned long address;
+	struct uffd_msg msg;
 	wait_queue_t wq;
 	bool pending;
 	struct userfaultfd_ctx *ctx;
@@ -77,7 +77,8 @@ static int userfaultfd_wake_function(wait_queue_t *wq, unsigned mode,
 	/* len == 0 means wake all */
 	start = range->start;
 	len = range->len;
-	if (len && (start > uwq->address || start + len <= uwq->address))
+	if (len && (start > uwq->msg.arg.pagefault.address ||
+		    start + len <= uwq->msg.arg.pagefault.address))
 		goto out;
 	ret = wake_up_state(wq->private, mode);
 	if (ret)
@@ -122,28 +123,43 @@ static void userfaultfd_ctx_put(struct userfaultfd_ctx *ctx)
 	}
 }
 
-static inline unsigned long userfault_address(unsigned long address,
-					      unsigned int flags,
-					      unsigned long reason)
+static inline void msg_init(struct uffd_msg *msg)
 {
-	BUILD_BUG_ON(PAGE_SHIFT < UFFD_BITS);
-	address &= PAGE_MASK;
+	BUILD_BUG_ON(sizeof(struct uffd_msg) != 32);
+	/*
+	 * Must use memset to zero out the paddings or kernel data is
+	 * leaked to userland.
+	 */
+	memset(msg, 0, sizeof(struct uffd_msg));
+}
+
+static inline struct uffd_msg userfault_msg(unsigned long address,
+					    unsigned int flags,
+					    unsigned long reason)
+{
+	struct uffd_msg msg;
+	msg_init(&msg);
+	msg.event = UFFD_EVENT_PAGEFAULT;
+	msg.arg.pagefault.address = address;
 	if (flags & FAULT_FLAG_WRITE)
 		/*
-		 * Encode "write" fault information in the LSB of the
-		 * address read by userland, without depending on
-		 * FAULT_FLAG_WRITE kernel internal value.
+		 * If UFFD_FEATURE_PAGEFAULT_FLAG_WRITE was set in the
+		 * uffdio_api.features and UFFD_PAGEFAULT_FLAG_WRITE
+		 * was not set in a UFFD_EVENT_PAGEFAULT, it means it
+		 * was a read fault, otherwise if set it means it's
+		 * a write fault.
 		 */
-		address |= UFFD_BIT_WRITE;
+		msg.arg.pagefault.flags |= UFFD_PAGEFAULT_FLAG_WRITE;
 	if (reason & VM_UFFD_WP)
 		/*
-		 * Encode "reason" fault information as bit number 1
-		 * in the address read by userland. If bit number 1 is
-		 * clear it means the reason is a VM_FAULT_MISSING
-		 * fault.
+		 * If UFFD_FEATURE_PAGEFAULT_FLAG_WP was set in the
+		 * uffdio_api.features and UFFD_PAGEFAULT_FLAG_WP was
+		 * not set in a UFFD_EVENT_PAGEFAULT, it means it was
+		 * a missing fault, otherwise if set it means it's a
+		 * write protect fault.
 		 */
-		address |= UFFD_BIT_WP;
-	return address;
+		msg.arg.pagefault.flags |= UFFD_PAGEFAULT_FLAG_WP;
+	return msg;
 }
 
 /*
@@ -229,7 +245,7 @@ int handle_userfault(struct vm_area_struct *vma, unsigned long address,
 
 	init_waitqueue_func_entry(&uwq.wq, userfaultfd_wake_function);
 	uwq.wq.private = current;
-	uwq.address = userfault_address(address, flags, reason);
+	uwq.msg = userfault_msg(address, flags, reason);
 	uwq.pending = true;
 	uwq.ctx = ctx;
 
@@ -385,7 +401,7 @@ static unsigned int userfaultfd_poll(struct file *file, poll_table *wait)
 }
 
 static ssize_t userfaultfd_ctx_read(struct userfaultfd_ctx *ctx, int no_wait,
-				    __u64 *addr)
+				    struct uffd_msg *msg)
 {
 	ssize_t ret;
 	DECLARE_WAITQUEUE(wait, current);
@@ -403,8 +419,8 @@ static ssize_t userfaultfd_ctx_read(struct userfaultfd_ctx *ctx, int no_wait,
 			 * disappear from under us.
 			 */
 			uwq->pending = false;
-			/* careful to always initialize addr if ret == 0 */
-			*addr = uwq->address;
+			/* careful to always initialize msg if ret == 0 */
+			*msg = uwq->msg;
 			spin_unlock(&ctx->fault_wqh.lock);
 			ret = 0;
 			break;
@@ -434,8 +450,7 @@ static ssize_t userfaultfd_read(struct file *file, char __user *buf,
 {
 	struct userfaultfd_ctx *ctx = file->private_data;
 	ssize_t _ret, ret = 0;
-	/* careful to always initialize addr if ret == 0 */
-	__u64 uninitialized_var(addr);
+	struct uffd_msg msg;
 	int no_wait = file->f_flags & O_NONBLOCK;
 
 	if (ctx->state == UFFD_STATE_WAIT_API)
@@ -443,16 +458,16 @@ static ssize_t userfaultfd_read(struct file *file, char __user *buf,
 	BUG_ON(ctx->state != UFFD_STATE_RUNNING);
 
 	for (;;) {
-		if (count < sizeof(addr))
+		if (count < sizeof(msg))
 			return ret ? ret : -EINVAL;
-		_ret = userfaultfd_ctx_read(ctx, no_wait, &addr);
+		_ret = userfaultfd_ctx_read(ctx, no_wait, &msg);
 		if (_ret < 0)
 			return ret ? ret : _ret;
-		if (put_user(addr, (__u64 __user *) buf))
+		if (copy_to_user((__u64 __user *) buf, &msg, sizeof(msg)))
 			return ret ? ret : -EFAULT;
-		ret += sizeof(addr);
-		buf += sizeof(addr);
-		count -= sizeof(addr);
+		ret += sizeof(msg);
+		buf += sizeof(msg);
+		count -= sizeof(msg);
 		/*
 		 * Allow to read more than one fault at time but only
 		 * block if waiting for the very first one.
@@ -845,17 +860,15 @@ static int userfaultfd_api(struct userfaultfd_ctx *ctx,
 	if (ctx->state != UFFD_STATE_WAIT_API)
 		goto out;
 	ret = -EFAULT;
-	if (copy_from_user(&uffdio_api, buf, sizeof(__u64)))
+	if (copy_from_user(&uffdio_api, buf, sizeof(uffdio_api)))
 		goto out;
-	if (uffdio_api.api != UFFD_API) {
-		/* careful not to leak info, we only read the first 8 bytes */
+	if (uffdio_api.api != UFFD_API || uffdio_api.features) {
 		memset(&uffdio_api, 0, sizeof(uffdio_api));
 		if (copy_to_user(buf, &uffdio_api, sizeof(uffdio_api)))
 			goto out;
 		ret = -EINVAL;
 		goto out;
 	}
-	/* careful not to leak info, we only read the first 8 bytes */
 	uffdio_api.features = UFFD_API_FEATURES;
 	uffdio_api.ioctls = UFFD_API_IOCTLS;
 	ret = -EFAULT;
diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
index 03f21cb..8e42bc3 100644
--- a/include/uapi/linux/userfaultfd.h
+++ b/include/uapi/linux/userfaultfd.h
@@ -10,8 +10,12 @@
 #define _LINUX_USERFAULTFD_H
 
 #define UFFD_API ((__u64)0xAA)
-/* FIXME: add "|UFFD_FEATURE_WP" to UFFD_API_FEATURES after implementing it */
-#define UFFD_API_FEATURES (UFFD_FEATURE_WRITE_BIT)
+/*
+ * After implementing the respective features it will become:
+ * #define UFFD_API_FEATURES (UFFD_FEATURE_PAGEFAULT_FLAG_WP | \
+ *			      UFFD_FEATURE_EVENT_FORK)
+ */
+#define UFFD_API_FEATURES (0)
 #define UFFD_API_IOCTLS				\
 	((__u64)1 << _UFFDIO_REGISTER |		\
 	 (__u64)1 << _UFFDIO_UNREGISTER |	\
@@ -43,26 +47,56 @@
 #define UFFDIO_WAKE		_IOR(UFFDIO, _UFFDIO_WAKE,	\
 				     struct uffdio_range)
 
-/*
- * Valid bits below PAGE_SHIFT in the userfault address read through
- * the read() syscall.
- */
-#define UFFD_BIT_WRITE	(1<<0)	/* this was a write fault, MISSING or WP */
-#define UFFD_BIT_WP	(1<<1)	/* handle_userfault() reason VM_UFFD_WP */
-#define UFFD_BITS	2	/* two above bits used for UFFD_BIT_* mask */
+/* read() structure */
+struct uffd_msg {
+	__u8	event;
+
+	union {
+		struct {
+			__u32	flags;
+			__u64	address;
+		} pagefault;
+
+		struct {
+			/* unused reserved fields */
+			__u64	reserved1;
+			__u64	reserved2;
+			__u64	reserved3;
+		} reserved;
+	} arg;
+};
 
 /*
- * Features reported in uffdio_api.features field
+ * Start at 0x12 and not at 0 to be more strict against bugs.
  */
-#define UFFD_FEATURE_WRITE_BIT	(1<<0) /* Corresponds to UFFD_BIT_WRITE */
-#define UFFD_FEATURE_WP_BIT	(1<<1) /* Corresponds to UFFD_BIT_WP */
+#define UFFD_EVENT_PAGEFAULT	0x12
+#if 0 /* not available yet */
+#define UFFD_EVENT_FORK		0x13
+#endif
+
+/* flags for UFFD_EVENT_PAGEFAULT */
+#define UFFD_PAGEFAULT_FLAG_WRITE	(1<<0)	/* If this was a write fault */
+#define UFFD_PAGEFAULT_FLAG_WP		(1<<1)	/* If reason is VM_UFFD_WP */
 
 struct uffdio_api {
-	/* userland asks for an API number */
+	/* userland asks for an API number and the features to enable */
 	__u64 api;
-
-	/* kernel answers below with the available features for the API */
+	/*
+	 * Kernel answers below with the all available features for
+	 * the API, this notifies userland of which events and/or
+	 * which flags for each event are enabled in the current
+	 * kernel.
+	 *
+	 * Note: UFFD_EVENT_PAGEFAULT and UFFD_PAGEFAULT_FLAG_WRITE
+	 * are to be considered implicitly always enabled in all kernels as
+	 * long as the uffdio_api.api requested matches UFFD_API.
+	 */
+#if 0 /* not available yet */
+#define UFFD_FEATURE_PAGEFAULT_FLAG_WP		(1<<0)
+#define UFFD_FEATURE_EVENT_FORK			(1<<1)
+#endif
 	__u64 features;
+
 	__u64 ioctls;
 };
 

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH 14/23] userfaultfd: wake pending userfaults
  2015-05-14 17:30 [PATCH 00/23] userfaultfd v4 Andrea Arcangeli
                   ` (12 preceding siblings ...)
  2015-05-14 17:31 ` [PATCH 13/23] userfaultfd: change the read API to return a uffd_msg Andrea Arcangeli
@ 2015-05-14 17:31 ` Andrea Arcangeli
  2015-10-22 12:10   ` Peter Zijlstra
  2015-05-14 17:31 ` [PATCH 15/23] userfaultfd: optimize read() and poll() to be O(1) Andrea Arcangeli
                   ` (11 subsequent siblings)
  25 siblings, 1 reply; 63+ messages in thread
From: Andrea Arcangeli @ 2015-05-14 17:31 UTC (permalink / raw)
  To: Andrew Morton, linux-kernel, linux-mm, qemu-devel, kvm, linux-api
  Cc: Pavel Emelyanov, Sanidhya Kashyap, zhang.zhanghailiang,
	Linus Torvalds, Kirill A. Shutemov, Andres Lagar-Cavilla,
	Dave Hansen, Paolo Bonzini, Rik van Riel, Mel Gorman,
	Andy Lutomirski, Hugh Dickins, Peter Feiner,
	Dr. David Alan Gilbert, Johannes Weiner, Huangpeng (Peter)

This is an optimization but it's a userland visible one and it affects
the API.

The downside of this optimization is that if you call poll() and you
get POLLIN, read(ufd) may still return -EAGAIN. The blocked userfault
may be waken by a different thread, before read(ufd) comes
around. This in short means that poll() isn't really usable if the
userfaultfd is opened in blocking mode.

userfaults won't wait in "pending" state to be read anymore and any
UFFDIO_WAKE or similar operations that has the objective of waking
userfaults after their resolution, will wake all blocked userfaults
for the resolved range, including those that haven't been read() by
userland yet.

The behavior of poll() becomes not standard, but this obviates the
need of "spurious" UFFDIO_WAKE and it lets the userland threads to
restart immediately without requiring an UFFDIO_WAKE. This is even
more significant in case of repeated faults on the same address from
multiple threads.

This optimization is justified by the measurement that the number of
spurious UFFDIO_WAKE accounts for 5% and 10% of the total
userfaults for heavy workloads, so it's worth optimizing those away.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 fs/userfaultfd.c | 65 +++++++++++++++++++++++++++++++++++++-------------------
 1 file changed, 43 insertions(+), 22 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index b45cefe..50edbd8 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -52,6 +52,10 @@ struct userfaultfd_ctx {
 struct userfaultfd_wait_queue {
 	struct uffd_msg msg;
 	wait_queue_t wq;
+	/*
+	 * Only relevant when queued in fault_wqh and only used by the
+	 * read operation to avoid reading the same userfault twice.
+	 */
 	bool pending;
 	struct userfaultfd_ctx *ctx;
 };
@@ -71,9 +75,6 @@ static int userfaultfd_wake_function(wait_queue_t *wq, unsigned mode,
 
 	uwq = container_of(wq, struct userfaultfd_wait_queue, wq);
 	ret = 0;
-	/* don't wake the pending ones to avoid reads to block */
-	if (uwq->pending && !ACCESS_ONCE(uwq->ctx->released))
-		goto out;
 	/* len == 0 means wake all */
 	start = range->start;
 	len = range->len;
@@ -183,12 +184,14 @@ int handle_userfault(struct vm_area_struct *vma, unsigned long address,
 	struct mm_struct *mm = vma->vm_mm;
 	struct userfaultfd_ctx *ctx;
 	struct userfaultfd_wait_queue uwq;
+	int ret;
 
 	BUG_ON(!rwsem_is_locked(&mm->mmap_sem));
 
+	ret = VM_FAULT_SIGBUS;
 	ctx = vma->vm_userfaultfd_ctx.ctx;
 	if (!ctx)
-		return VM_FAULT_SIGBUS;
+		goto out;
 
 	BUG_ON(ctx->mm != mm);
 
@@ -201,7 +204,7 @@ int handle_userfault(struct vm_area_struct *vma, unsigned long address,
 	 * caller of handle_userfault to release the mmap_sem.
 	 */
 	if (unlikely(ACCESS_ONCE(ctx->released)))
-		return VM_FAULT_SIGBUS;
+		goto out;
 
 	/*
 	 * Check that we can return VM_FAULT_RETRY.
@@ -227,15 +230,16 @@ int handle_userfault(struct vm_area_struct *vma, unsigned long address,
 			dump_stack();
 		}
 #endif
-		return VM_FAULT_SIGBUS;
+		goto out;
 	}
 
 	/*
 	 * Handle nowait, not much to do other than tell it to retry
 	 * and wait.
 	 */
+	ret = VM_FAULT_RETRY;
 	if (flags & FAULT_FLAG_RETRY_NOWAIT)
-		return VM_FAULT_RETRY;
+		goto out;
 
 	/* take the reference before dropping the mmap_sem */
 	userfaultfd_ctx_get(ctx);
@@ -255,21 +259,23 @@ int handle_userfault(struct vm_area_struct *vma, unsigned long address,
 	 * through poll/read().
 	 */
 	__add_wait_queue(&ctx->fault_wqh, &uwq.wq);
-	for (;;) {
-		set_current_state(TASK_KILLABLE);
-		if (!uwq.pending || ACCESS_ONCE(ctx->released) ||
-		    fatal_signal_pending(current))
-			break;
-		spin_unlock(&ctx->fault_wqh.lock);
+	set_current_state(TASK_KILLABLE);
+	spin_unlock(&ctx->fault_wqh.lock);
 
+	if (likely(!ACCESS_ONCE(ctx->released) &&
+		   !fatal_signal_pending(current))) {
 		wake_up_poll(&ctx->fd_wqh, POLLIN);
 		schedule();
+		ret |= VM_FAULT_MAJOR;
+	}
 
+	__set_current_state(TASK_RUNNING);
+	/* see finish_wait() comment for why list_empty_careful() */
+	if (!list_empty_careful(&uwq.wq.task_list)) {
 		spin_lock(&ctx->fault_wqh.lock);
+		list_del_init(&uwq.wq.task_list);
+		spin_unlock(&ctx->fault_wqh.lock);
 	}
-	__remove_wait_queue(&ctx->fault_wqh, &uwq.wq);
-	__set_current_state(TASK_RUNNING);
-	spin_unlock(&ctx->fault_wqh.lock);
 
 	/*
 	 * ctx may go away after this if the userfault pseudo fd is
@@ -277,7 +283,8 @@ int handle_userfault(struct vm_area_struct *vma, unsigned long address,
 	 */
 	userfaultfd_ctx_put(ctx);
 
-	return VM_FAULT_RETRY;
+out:
+	return ret;
 }
 
 static int userfaultfd_release(struct inode *inode, struct file *file)
@@ -391,6 +398,12 @@ static unsigned int userfaultfd_poll(struct file *file, poll_table *wait)
 	case UFFD_STATE_WAIT_API:
 		return POLLERR;
 	case UFFD_STATE_RUNNING:
+		/*
+		 * poll() never guarantees that read won't block.
+		 * userfaults can be waken before they're read().
+		 */
+		if (unlikely(!(file->f_flags & O_NONBLOCK)))
+			return POLLERR;
 		spin_lock(&ctx->fault_wqh.lock);
 		ret = find_userfault(ctx, NULL);
 		spin_unlock(&ctx->fault_wqh.lock);
@@ -806,11 +819,19 @@ out:
 }
 
 /*
- * This is mostly needed to re-wakeup those userfaults that were still
- * pending when userland wake them up the first time. We don't wake
- * the pending one to avoid blocking reads to block, or non blocking
- * read to return -EAGAIN, if used with POLLIN, to avoid userland
- * doubts on why POLLIN wasn't reliable.
+ * userfaultfd_wake is needed in case an userfault is in flight by the
+ * time a UFFDIO_COPY (or other ioctl variants) completes. The page
+ * may be well get mapped and the page fault if repeated wouldn't lead
+ * to a userfault anymore, but before scheduling in TASK_KILLABLE mode
+ * handle_userfault() doesn't recheck the pagetables and it doesn't
+ * serialize against UFFDO_COPY (or other ioctl variants). Ultimately
+ * the knowledge of which pages are mapped is left to userland who is
+ * responsible for handling the race between read() userfaults and
+ * background UFFDIO_COPY (or other ioctl variants), if done by
+ * separate concurrent threads.
+ *
+ * userfaultfd_wake may be used in combination with the
+ * UFFDIO_*_MODE_DONTWAKE to wakeup userfaults in batches.
  */
 static int userfaultfd_wake(struct userfaultfd_ctx *ctx,
 			    unsigned long arg)

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH 15/23] userfaultfd: optimize read() and poll() to be O(1)
  2015-05-14 17:30 [PATCH 00/23] userfaultfd v4 Andrea Arcangeli
                   ` (13 preceding siblings ...)
  2015-05-14 17:31 ` [PATCH 14/23] userfaultfd: wake pending userfaults Andrea Arcangeli
@ 2015-05-14 17:31 ` Andrea Arcangeli
  2015-05-14 17:31 ` [PATCH 16/23] userfaultfd: allocate the userfaultfd_ctx cacheline aligned Andrea Arcangeli
                   ` (10 subsequent siblings)
  25 siblings, 0 replies; 63+ messages in thread
From: Andrea Arcangeli @ 2015-05-14 17:31 UTC (permalink / raw)
  To: Andrew Morton, linux-kernel, linux-mm, qemu-devel, kvm, linux-api
  Cc: Pavel Emelyanov, Sanidhya Kashyap, zhang.zhanghailiang,
	Linus Torvalds, Kirill A. Shutemov, Andres Lagar-Cavilla,
	Dave Hansen, Paolo Bonzini, Rik van Riel, Mel Gorman,
	Andy Lutomirski, Hugh Dickins, Peter Feiner,
	Dr. David Alan Gilbert, Johannes Weiner, Huangpeng (Peter)

This makes read O(1) and poll that was already O(1) becomes lockless.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 fs/userfaultfd.c | 172 +++++++++++++++++++++++++++++++------------------------
 1 file changed, 98 insertions(+), 74 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 50edbd8..3d26f41 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -35,7 +35,9 @@ enum userfaultfd_state {
 struct userfaultfd_ctx {
 	/* pseudo fd refcounting */
 	atomic_t refcount;
-	/* waitqueue head for the userfaultfd page faults */
+	/* waitqueue head for the pending (i.e. not read) userfaults */
+	wait_queue_head_t fault_pending_wqh;
+	/* waitqueue head for the userfaults */
 	wait_queue_head_t fault_wqh;
 	/* waitqueue head for the pseudo fd to wakeup poll/read */
 	wait_queue_head_t fd_wqh;
@@ -52,11 +54,6 @@ struct userfaultfd_ctx {
 struct userfaultfd_wait_queue {
 	struct uffd_msg msg;
 	wait_queue_t wq;
-	/*
-	 * Only relevant when queued in fault_wqh and only used by the
-	 * read operation to avoid reading the same userfault twice.
-	 */
-	bool pending;
 	struct userfaultfd_ctx *ctx;
 };
 
@@ -250,17 +247,21 @@ int handle_userfault(struct vm_area_struct *vma, unsigned long address,
 	init_waitqueue_func_entry(&uwq.wq, userfaultfd_wake_function);
 	uwq.wq.private = current;
 	uwq.msg = userfault_msg(address, flags, reason);
-	uwq.pending = true;
 	uwq.ctx = ctx;
 
-	spin_lock(&ctx->fault_wqh.lock);
+	spin_lock(&ctx->fault_pending_wqh.lock);
 	/*
 	 * After the __add_wait_queue the uwq is visible to userland
 	 * through poll/read().
 	 */
-	__add_wait_queue(&ctx->fault_wqh, &uwq.wq);
+	__add_wait_queue(&ctx->fault_pending_wqh, &uwq.wq);
+	/*
+	 * The smp_mb() after __set_current_state prevents the reads
+	 * following the spin_unlock to happen before the list_add in
+	 * __add_wait_queue.
+	 */
 	set_current_state(TASK_KILLABLE);
-	spin_unlock(&ctx->fault_wqh.lock);
+	spin_unlock(&ctx->fault_pending_wqh.lock);
 
 	if (likely(!ACCESS_ONCE(ctx->released) &&
 		   !fatal_signal_pending(current))) {
@@ -270,11 +271,28 @@ int handle_userfault(struct vm_area_struct *vma, unsigned long address,
 	}
 
 	__set_current_state(TASK_RUNNING);
-	/* see finish_wait() comment for why list_empty_careful() */
+
+	/*
+	 * Here we race with the list_del; list_add in
+	 * userfaultfd_ctx_read(), however because we don't ever run
+	 * list_del_init() to refile across the two lists, the prev
+	 * and next pointers will never point to self. list_add also
+	 * would never let any of the two pointers to point to
+	 * self. So list_empty_careful won't risk to see both pointers
+	 * pointing to self at any time during the list refile. The
+	 * only case where list_del_init() is called is the full
+	 * removal in the wake function and there we don't re-list_add
+	 * and it's fine not to block on the spinlock. The uwq on this
+	 * kernel stack can be released after the list_del_init.
+	 */
 	if (!list_empty_careful(&uwq.wq.task_list)) {
-		spin_lock(&ctx->fault_wqh.lock);
-		list_del_init(&uwq.wq.task_list);
-		spin_unlock(&ctx->fault_wqh.lock);
+		spin_lock(&ctx->fault_pending_wqh.lock);
+		/*
+		 * No need of list_del_init(), the uwq on the stack
+		 * will be freed shortly anyway.
+		 */
+		list_del(&uwq.wq.task_list);
+		spin_unlock(&ctx->fault_pending_wqh.lock);
 	}
 
 	/*
@@ -332,59 +350,38 @@ static int userfaultfd_release(struct inode *inode, struct file *file)
 	up_write(&mm->mmap_sem);
 
 	/*
-	 * After no new page faults can wait on this fault_wqh, flush
+	 * After no new page faults can wait on this fault_*wqh, flush
 	 * the last page faults that may have been already waiting on
-	 * the fault_wqh.
+	 * the fault_*wqh.
 	 */
-	spin_lock(&ctx->fault_wqh.lock);
+	spin_lock(&ctx->fault_pending_wqh.lock);
+	__wake_up_locked_key(&ctx->fault_pending_wqh, TASK_NORMAL, 0, &range);
 	__wake_up_locked_key(&ctx->fault_wqh, TASK_NORMAL, 0, &range);
-	spin_unlock(&ctx->fault_wqh.lock);
+	spin_unlock(&ctx->fault_pending_wqh.lock);
 
 	wake_up_poll(&ctx->fd_wqh, POLLHUP);
 	userfaultfd_ctx_put(ctx);
 	return 0;
 }
 
-/* fault_wqh.lock must be hold by the caller */
-static inline unsigned int find_userfault(struct userfaultfd_ctx *ctx,
-					  struct userfaultfd_wait_queue **uwq)
+/* fault_pending_wqh.lock must be hold by the caller */
+static inline struct userfaultfd_wait_queue *find_userfault(
+	struct userfaultfd_ctx *ctx)
 {
 	wait_queue_t *wq;
-	struct userfaultfd_wait_queue *_uwq;
-	unsigned int ret = 0;
-
-	VM_BUG_ON(!spin_is_locked(&ctx->fault_wqh.lock));
+	struct userfaultfd_wait_queue *uwq;
 
-	list_for_each_entry(wq, &ctx->fault_wqh.task_list, task_list) {
-		_uwq = container_of(wq, struct userfaultfd_wait_queue, wq);
-		if (_uwq->pending) {
-			ret = POLLIN;
-			if (!uwq)
-				/*
-				 * If there's at least a pending and
-				 * we don't care which one it is,
-				 * break immediately and leverage the
-				 * efficiency of the LIFO walk.
-				 */
-				break;
-			/*
-			 * If we need to find which one was pending we
-			 * keep walking until we find the first not
-			 * pending one, so we read() them in FIFO order.
-			 */
-			*uwq = _uwq;
-		} else
-			/*
-			 * break the loop at the first not pending
-			 * one, there cannot be pending userfaults
-			 * after the first not pending one, because
-			 * all new pending ones are inserted at the
-			 * head and we walk it in LIFO.
-			 */
-			break;
-	}
+	VM_BUG_ON(!spin_is_locked(&ctx->fault_pending_wqh.lock));
 
-	return ret;
+	uwq = NULL;
+	if (!waitqueue_active(&ctx->fault_pending_wqh))
+		goto out;
+	/* walk in reverse to provide FIFO behavior to read userfaults */
+	wq = list_last_entry(&ctx->fault_pending_wqh.task_list,
+			     typeof(*wq), task_list);
+	uwq = container_of(wq, struct userfaultfd_wait_queue, wq);
+out:
+	return uwq;
 }
 
 static unsigned int userfaultfd_poll(struct file *file, poll_table *wait)
@@ -404,9 +401,20 @@ static unsigned int userfaultfd_poll(struct file *file, poll_table *wait)
 		 */
 		if (unlikely(!(file->f_flags & O_NONBLOCK)))
 			return POLLERR;
-		spin_lock(&ctx->fault_wqh.lock);
-		ret = find_userfault(ctx, NULL);
-		spin_unlock(&ctx->fault_wqh.lock);
+		/*
+		 * lockless access to see if there are pending faults
+		 * __pollwait last action is the add_wait_queue but
+		 * the spin_unlock would allow the waitqueue_active to
+		 * pass above the actual list_add inside
+		 * add_wait_queue critical section. So use a full
+		 * memory barrier to serialize the list_add write of
+		 * add_wait_queue() with the waitqueue_active read
+		 * below.
+		 */
+		ret = 0;
+		smp_mb();
+		if (waitqueue_active(&ctx->fault_pending_wqh))
+			ret = POLLIN;
 		return ret;
 	default:
 		BUG();
@@ -418,27 +426,34 @@ static ssize_t userfaultfd_ctx_read(struct userfaultfd_ctx *ctx, int no_wait,
 {
 	ssize_t ret;
 	DECLARE_WAITQUEUE(wait, current);
-	struct userfaultfd_wait_queue *uwq = NULL;
+	struct userfaultfd_wait_queue *uwq;
 
-	/* always take the fd_wqh lock before the fault_wqh lock */
+	/* always take the fd_wqh lock before the fault_pending_wqh lock */
 	spin_lock(&ctx->fd_wqh.lock);
 	__add_wait_queue(&ctx->fd_wqh, &wait);
 	for (;;) {
 		set_current_state(TASK_INTERRUPTIBLE);
-		spin_lock(&ctx->fault_wqh.lock);
-		if (find_userfault(ctx, &uwq)) {
+		spin_lock(&ctx->fault_pending_wqh.lock);
+		uwq = find_userfault(ctx);
+		if (uwq) {
 			/*
-			 * The fault_wqh.lock prevents the uwq to
-			 * disappear from under us.
+			 * The fault_pending_wqh.lock prevents the uwq
+			 * to disappear from under us.
+			 *
+			 * Refile this userfault from
+			 * fault_pending_wqh to fault_wqh, it's not
+			 * pending anymore after we read it.
 			 */
-			uwq->pending = false;
+			list_del(&uwq->wq.task_list);
+			__add_wait_queue(&ctx->fault_wqh, &uwq->wq);
+
 			/* careful to always initialize msg if ret == 0 */
 			*msg = uwq->msg;
-			spin_unlock(&ctx->fault_wqh.lock);
+			spin_unlock(&ctx->fault_pending_wqh.lock);
 			ret = 0;
 			break;
 		}
-		spin_unlock(&ctx->fault_wqh.lock);
+		spin_unlock(&ctx->fault_pending_wqh.lock);
 		if (signal_pending(current)) {
 			ret = -ERESTARTSYS;
 			break;
@@ -497,16 +512,21 @@ static void __wake_userfault(struct userfaultfd_ctx *ctx,
 	start = range->start;
 	end = range->start + range->len;
 
-	spin_lock(&ctx->fault_wqh.lock);
+	spin_lock(&ctx->fault_pending_wqh.lock);
 	/* wake all in the range and autoremove */
-	__wake_up_locked_key(&ctx->fault_wqh, TASK_NORMAL, 0, range);
-	spin_unlock(&ctx->fault_wqh.lock);
+	if (waitqueue_active(&ctx->fault_pending_wqh))
+		__wake_up_locked_key(&ctx->fault_pending_wqh, TASK_NORMAL, 0,
+				     range);
+	if (waitqueue_active(&ctx->fault_wqh))
+		__wake_up_locked_key(&ctx->fault_wqh, TASK_NORMAL, 0, range);
+	spin_unlock(&ctx->fault_pending_wqh.lock);
 }
 
 static __always_inline void wake_userfault(struct userfaultfd_ctx *ctx,
 					   struct userfaultfd_wake_range *range)
 {
-	if (waitqueue_active(&ctx->fault_wqh))
+	if (waitqueue_active(&ctx->fault_pending_wqh) ||
+	    waitqueue_active(&ctx->fault_wqh))
 		__wake_userfault(ctx, range);
 }
 
@@ -932,14 +952,17 @@ static void userfaultfd_show_fdinfo(struct seq_file *m, struct file *f)
 	struct userfaultfd_wait_queue *uwq;
 	unsigned long pending = 0, total = 0;
 
-	spin_lock(&ctx->fault_wqh.lock);
+	spin_lock(&ctx->fault_pending_wqh.lock);
+	list_for_each_entry(wq, &ctx->fault_pending_wqh.task_list, task_list) {
+		uwq = container_of(wq, struct userfaultfd_wait_queue, wq);
+		pending++;
+		total++;
+	}
 	list_for_each_entry(wq, &ctx->fault_wqh.task_list, task_list) {
 		uwq = container_of(wq, struct userfaultfd_wait_queue, wq);
-		if (uwq->pending)
-			pending++;
 		total++;
 	}
-	spin_unlock(&ctx->fault_wqh.lock);
+	spin_unlock(&ctx->fault_pending_wqh.lock);
 
 	/*
 	 * If more protocols will be added, there will be all shown
@@ -999,6 +1022,7 @@ static struct file *userfaultfd_file_create(int flags)
 		goto out;
 
 	atomic_set(&ctx->refcount, 1);
+	init_waitqueue_head(&ctx->fault_pending_wqh);
 	init_waitqueue_head(&ctx->fault_wqh);
 	init_waitqueue_head(&ctx->fd_wqh);
 	ctx->flags = flags;

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH 16/23] userfaultfd: allocate the userfaultfd_ctx cacheline aligned
  2015-05-14 17:30 [PATCH 00/23] userfaultfd v4 Andrea Arcangeli
                   ` (14 preceding siblings ...)
  2015-05-14 17:31 ` [PATCH 15/23] userfaultfd: optimize read() and poll() to be O(1) Andrea Arcangeli
@ 2015-05-14 17:31 ` Andrea Arcangeli
  2015-05-14 17:31 ` [PATCH 17/23] userfaultfd: solve the race between UFFDIO_COPY|ZEROPAGE and read Andrea Arcangeli
                   ` (9 subsequent siblings)
  25 siblings, 0 replies; 63+ messages in thread
From: Andrea Arcangeli @ 2015-05-14 17:31 UTC (permalink / raw)
  To: Andrew Morton, linux-kernel, linux-mm, qemu-devel, kvm, linux-api
  Cc: Pavel Emelyanov, Sanidhya Kashyap, zhang.zhanghailiang,
	Linus Torvalds, Kirill A. Shutemov, Andres Lagar-Cavilla,
	Dave Hansen, Paolo Bonzini, Rik van Riel, Mel Gorman,
	Andy Lutomirski, Hugh Dickins, Peter Feiner,
	Dr. David Alan Gilbert, Johannes Weiner, Huangpeng (Peter)

Use proper slab to guarantee alignment.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 fs/userfaultfd.c | 39 +++++++++++++++++++++++++++++++--------
 1 file changed, 31 insertions(+), 8 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 3d26f41..5542fe7 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -27,20 +27,26 @@
 #include <linux/ioctl.h>
 #include <linux/security.h>
 
+static struct kmem_cache *userfaultfd_ctx_cachep __read_mostly;
+
 enum userfaultfd_state {
 	UFFD_STATE_WAIT_API,
 	UFFD_STATE_RUNNING,
 };
 
+/*
+ * Start with fault_pending_wqh and fault_wqh so they're more likely
+ * to be in the same cacheline.
+ */
 struct userfaultfd_ctx {
-	/* pseudo fd refcounting */
-	atomic_t refcount;
 	/* waitqueue head for the pending (i.e. not read) userfaults */
 	wait_queue_head_t fault_pending_wqh;
 	/* waitqueue head for the userfaults */
 	wait_queue_head_t fault_wqh;
 	/* waitqueue head for the pseudo fd to wakeup poll/read */
 	wait_queue_head_t fd_wqh;
+	/* pseudo fd refcounting */
+	atomic_t refcount;
 	/* userfaultfd syscall flags */
 	unsigned int flags;
 	/* state machine */
@@ -117,7 +123,7 @@ static void userfaultfd_ctx_put(struct userfaultfd_ctx *ctx)
 		VM_BUG_ON(spin_is_locked(&ctx->fd_wqh.lock));
 		VM_BUG_ON(waitqueue_active(&ctx->fd_wqh));
 		mmput(ctx->mm);
-		kfree(ctx);
+		kmem_cache_free(userfaultfd_ctx_cachep, ctx);
 	}
 }
 
@@ -987,6 +993,15 @@ static const struct file_operations userfaultfd_fops = {
 	.llseek		= noop_llseek,
 };
 
+static void init_once_userfaultfd_ctx(void *mem)
+{
+	struct userfaultfd_ctx *ctx = (struct userfaultfd_ctx *) mem;
+
+	init_waitqueue_head(&ctx->fault_pending_wqh);
+	init_waitqueue_head(&ctx->fault_wqh);
+	init_waitqueue_head(&ctx->fd_wqh);
+}
+
 /**
  * userfaultfd_file_create - Creates an userfaultfd file pointer.
  * @flags: Flags for the userfaultfd file.
@@ -1017,14 +1032,11 @@ static struct file *userfaultfd_file_create(int flags)
 		goto out;
 
 	file = ERR_PTR(-ENOMEM);
-	ctx = kmalloc(sizeof(*ctx), GFP_KERNEL);
+	ctx = kmem_cache_alloc(userfaultfd_ctx_cachep, GFP_KERNEL);
 	if (!ctx)
 		goto out;
 
 	atomic_set(&ctx->refcount, 1);
-	init_waitqueue_head(&ctx->fault_pending_wqh);
-	init_waitqueue_head(&ctx->fault_wqh);
-	init_waitqueue_head(&ctx->fd_wqh);
 	ctx->flags = flags;
 	ctx->state = UFFD_STATE_WAIT_API;
 	ctx->released = false;
@@ -1035,7 +1047,7 @@ static struct file *userfaultfd_file_create(int flags)
 	file = anon_inode_getfile("[userfaultfd]", &userfaultfd_fops, ctx,
 				  O_RDWR | (flags & UFFD_SHARED_FCNTL_FLAGS));
 	if (IS_ERR(file))
-		kfree(ctx);
+		kmem_cache_free(userfaultfd_ctx_cachep, ctx);
 out:
 	return file;
 }
@@ -1064,3 +1076,14 @@ err_put_unused_fd:
 
 	return error;
 }
+
+static int __init userfaultfd_init(void)
+{
+	userfaultfd_ctx_cachep = kmem_cache_create("userfaultfd_ctx_cache",
+						sizeof(struct userfaultfd_ctx),
+						0,
+						SLAB_HWCACHE_ALIGN|SLAB_PANIC,
+						init_once_userfaultfd_ctx);
+	return 0;
+}
+__initcall(userfaultfd_init);

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH 17/23] userfaultfd: solve the race between UFFDIO_COPY|ZEROPAGE and read
  2015-05-14 17:30 [PATCH 00/23] userfaultfd v4 Andrea Arcangeli
                   ` (15 preceding siblings ...)
  2015-05-14 17:31 ` [PATCH 16/23] userfaultfd: allocate the userfaultfd_ctx cacheline aligned Andrea Arcangeli
@ 2015-05-14 17:31 ` Andrea Arcangeli
  2015-05-14 17:31 ` [PATCH 18/23] userfaultfd: buildsystem activation Andrea Arcangeli
                   ` (8 subsequent siblings)
  25 siblings, 0 replies; 63+ messages in thread
From: Andrea Arcangeli @ 2015-05-14 17:31 UTC (permalink / raw)
  To: Andrew Morton, linux-kernel, linux-mm, qemu-devel, kvm, linux-api
  Cc: Pavel Emelyanov, Sanidhya Kashyap, zhang.zhanghailiang,
	Linus Torvalds, Kirill A. Shutemov, Andres Lagar-Cavilla,
	Dave Hansen, Paolo Bonzini, Rik van Riel, Mel Gorman,
	Andy Lutomirski, Hugh Dickins, Peter Feiner,
	Dr. David Alan Gilbert, Johannes Weiner, Huangpeng (Peter)

Solve in-kernel the race between UFFDIO_COPY|ZEROPAGE and
userfaultfd_read if they are run on different threads simultaneously.

Until now qemu solved the race in userland: the race was explicitly
and intentionally left for userland to solve. However we can also
solve it in kernel.

Requiring all users to solve this race if they use two threads (one
for the background transfer and one for the userfault reads) isn't
very attractive from an API prospective, furthermore this allows to
remove a whole bunch of mutex and bitmap code from qemu, making it
faster. The cost of __get_user_pages_fast should be insignificant
considering it scales perfectly and the pagetables are already hot in
the CPU cache, compared to the overhead in userland to maintain those
structures.

Applying this patch is backwards compatible with respect to the
userfaultfd userland API, however reverting this change wouldn't be
backwards compatible anymore.

Without this patch qemu in the background transfer thread, has to read
the old state, and do UFFDIO_WAKE if old_state is missing but it
become REQUESTED by the time it tries to set it to RECEIVED (signaling
the other side received an userfault).

    vcpu                background_thr userfault_thr
    -----               -----          -----
    vcpu0 handle_mm_fault()

			postcopy_place_page
			read old_state -> MISSING
 			UFFDIO_COPY 0x7fb76a139000 (no wakeup, still pending)

    vcpu0 fault at 0x7fb76a139000 enters handle_userfault
    poll() is kicked

 					poll() -> POLLIN
 					read() -> 0x7fb76a139000
 					postcopy_pmi_change_state(MISSING, REQUESTED) -> REQUESTED

 			tmp_state = postcopy_pmi_change_state(old_state, RECEIVED) -> REQUESTED
			/* check that no userfault raced with UFFDIO_COPY */
			if (old_state == MISSING && tmp_state == REQUESTED)
				UFFDIO_WAKE from background thread

And a second case where a UFFDIO_WAKE would be needed is in the userfault thread:

    vcpu                background_thr userfault_thr
    -----               -----          -----
    vcpu0 handle_mm_fault()

			postcopy_place_page
			read old_state -> MISSING
 			UFFDIO_COPY 0x7fb76a139000 (no wakeup, still pending)
 			tmp_state = postcopy_pmi_change_state(old_state, RECEIVED) -> RECEIVED

    vcpu0 fault at 0x7fb76a139000 enters handle_userfault
    poll() is kicked

 					poll() -> POLLIN
 					read() -> 0x7fb76a139000

 					if (postcopy_pmi_change_state(MISSING, REQUESTED) == RECEIVED)
						UFFDIO_WAKE from userfault thread

This patch removes the need of both UFFDIO_WAKE and of the associated
per-page tristate as well.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 fs/userfaultfd.c | 81 +++++++++++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 66 insertions(+), 15 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 5542fe7..6772c22 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -167,6 +167,67 @@ static inline struct uffd_msg userfault_msg(unsigned long address,
 }
 
 /*
+ * Verify the pagetables are still not ok after having reigstered into
+ * the fault_pending_wqh to avoid userland having to UFFDIO_WAKE any
+ * userfault that has already been resolved, if userfaultfd_read and
+ * UFFDIO_COPY|ZEROPAGE are being run simultaneously on two different
+ * threads.
+ */
+static inline bool userfaultfd_must_wait(struct userfaultfd_ctx *ctx,
+					 unsigned long address,
+					 unsigned long flags,
+					 unsigned long reason)
+{
+	struct mm_struct *mm = ctx->mm;
+	pgd_t *pgd;
+	pud_t *pud;
+	pmd_t *pmd, _pmd;
+	pte_t *pte;
+	bool ret = true;
+
+	VM_BUG_ON(!rwsem_is_locked(&mm->mmap_sem));
+
+	pgd = pgd_offset(mm, address);
+	if (!pgd_present(*pgd))
+		goto out;
+	pud = pud_offset(pgd, address);
+	if (!pud_present(*pud))
+		goto out;
+	pmd = pmd_offset(pud, address);
+	/*
+	 * READ_ONCE must function as a barrier with narrower scope
+	 * and it must be equivalent to:
+	 *	_pmd = *pmd; barrier();
+	 *
+	 * This is to deal with the instability (as in
+	 * pmd_trans_unstable) of the pmd.
+	 */
+	_pmd = READ_ONCE(*pmd);
+	if (!pmd_present(_pmd))
+		goto out;
+
+	ret = false;
+	if (pmd_trans_huge(_pmd))
+		goto out;
+
+	/*
+	 * the pmd is stable (as in !pmd_trans_unstable) so we can re-read it
+	 * and use the standard pte_offset_map() instead of parsing _pmd.
+	 */
+	pte = pte_offset_map(pmd, address);
+	/*
+	 * Lockless access: we're in a wait_event so it's ok if it
+	 * changes under us.
+	 */
+	if (pte_none(*pte))
+		ret = true;
+	pte_unmap(pte);
+
+out:
+	return ret;
+}
+
+/*
  * The locking rules involved in returning VM_FAULT_RETRY depending on
  * FAULT_FLAG_ALLOW_RETRY, FAULT_FLAG_RETRY_NOWAIT and
  * FAULT_FLAG_KILLABLE are not straightforward. The "Caution"
@@ -188,6 +249,7 @@ int handle_userfault(struct vm_area_struct *vma, unsigned long address,
 	struct userfaultfd_ctx *ctx;
 	struct userfaultfd_wait_queue uwq;
 	int ret;
+	bool must_wait;
 
 	BUG_ON(!rwsem_is_locked(&mm->mmap_sem));
 
@@ -247,9 +309,6 @@ int handle_userfault(struct vm_area_struct *vma, unsigned long address,
 	/* take the reference before dropping the mmap_sem */
 	userfaultfd_ctx_get(ctx);
 
-	/* be gentle and immediately relinquish the mmap_sem */
-	up_read(&mm->mmap_sem);
-
 	init_waitqueue_func_entry(&uwq.wq, userfaultfd_wake_function);
 	uwq.wq.private = current;
 	uwq.msg = userfault_msg(address, flags, reason);
@@ -269,7 +328,10 @@ int handle_userfault(struct vm_area_struct *vma, unsigned long address,
 	set_current_state(TASK_KILLABLE);
 	spin_unlock(&ctx->fault_pending_wqh.lock);
 
-	if (likely(!ACCESS_ONCE(ctx->released) &&
+	must_wait = userfaultfd_must_wait(ctx, address, flags, reason);
+	up_read(&mm->mmap_sem);
+
+	if (likely(must_wait && !ACCESS_ONCE(ctx->released) &&
 		   !fatal_signal_pending(current))) {
 		wake_up_poll(&ctx->fd_wqh, POLLIN);
 		schedule();
@@ -845,17 +907,6 @@ out:
 }
 
 /*
- * userfaultfd_wake is needed in case an userfault is in flight by the
- * time a UFFDIO_COPY (or other ioctl variants) completes. The page
- * may be well get mapped and the page fault if repeated wouldn't lead
- * to a userfault anymore, but before scheduling in TASK_KILLABLE mode
- * handle_userfault() doesn't recheck the pagetables and it doesn't
- * serialize against UFFDO_COPY (or other ioctl variants). Ultimately
- * the knowledge of which pages are mapped is left to userland who is
- * responsible for handling the race between read() userfaults and
- * background UFFDIO_COPY (or other ioctl variants), if done by
- * separate concurrent threads.
- *
  * userfaultfd_wake may be used in combination with the
  * UFFDIO_*_MODE_DONTWAKE to wakeup userfaults in batches.
  */

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH 18/23] userfaultfd: buildsystem activation
  2015-05-14 17:30 [PATCH 00/23] userfaultfd v4 Andrea Arcangeli
                   ` (16 preceding siblings ...)
  2015-05-14 17:31 ` [PATCH 17/23] userfaultfd: solve the race between UFFDIO_COPY|ZEROPAGE and read Andrea Arcangeli
@ 2015-05-14 17:31 ` Andrea Arcangeli
  2015-05-14 17:31 ` [PATCH 19/23] userfaultfd: activate syscall Andrea Arcangeli
                   ` (7 subsequent siblings)
  25 siblings, 0 replies; 63+ messages in thread
From: Andrea Arcangeli @ 2015-05-14 17:31 UTC (permalink / raw)
  To: Andrew Morton, linux-kernel, linux-mm, qemu-devel, kvm, linux-api
  Cc: Pavel Emelyanov, Sanidhya Kashyap, zhang.zhanghailiang,
	Linus Torvalds, Kirill A. Shutemov, Andres Lagar-Cavilla,
	Dave Hansen, Paolo Bonzini, Rik van Riel, Mel Gorman,
	Andy Lutomirski, Hugh Dickins, Peter Feiner,
	Dr. David Alan Gilbert, Johannes Weiner, Huangpeng (Peter)

This allows to select the userfaultfd during configuration to build it.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 fs/Makefile  |  1 +
 init/Kconfig | 11 +++++++++++
 2 files changed, 12 insertions(+)

diff --git a/fs/Makefile b/fs/Makefile
index cb92fd4..53e59b2 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -27,6 +27,7 @@ obj-$(CONFIG_ANON_INODES)	+= anon_inodes.o
 obj-$(CONFIG_SIGNALFD)		+= signalfd.o
 obj-$(CONFIG_TIMERFD)		+= timerfd.o
 obj-$(CONFIG_EVENTFD)		+= eventfd.o
+obj-$(CONFIG_USERFAULTFD)	+= userfaultfd.o
 obj-$(CONFIG_AIO)               += aio.o
 obj-$(CONFIG_FS_DAX)		+= dax.o
 obj-$(CONFIG_FILE_LOCKING)      += locks.o
diff --git a/init/Kconfig b/init/Kconfig
index 88ad043..e1c9d9e 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1644,6 +1644,17 @@ config ADVISE_SYSCALLS
 	  applications use these syscalls, you can disable this option to save
 	  space.
 
+config USERFAULTFD
+	bool "Enable userfaultfd() system call"
+	select ANON_INODES
+	default y
+	depends on MMU
+	help
+	  Enable the userfaultfd() system call that allows to intercept and
+	  handle page faults in userland.
+
+	  If unsure, say Y.
+
 config PCI_QUIRKS
 	default y
 	bool "Enable PCI quirk workarounds" if EXPERT

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH 19/23] userfaultfd: activate syscall
  2015-05-14 17:30 [PATCH 00/23] userfaultfd v4 Andrea Arcangeli
                   ` (17 preceding siblings ...)
  2015-05-14 17:31 ` [PATCH 18/23] userfaultfd: buildsystem activation Andrea Arcangeli
@ 2015-05-14 17:31 ` Andrea Arcangeli
  2015-08-11 10:07   ` [Qemu-devel] " Bharata B Rao
  2015-05-14 17:31 ` [PATCH 20/23] userfaultfd: UFFDIO_COPY|UFFDIO_ZEROPAGE uAPI Andrea Arcangeli
                   ` (6 subsequent siblings)
  25 siblings, 1 reply; 63+ messages in thread
From: Andrea Arcangeli @ 2015-05-14 17:31 UTC (permalink / raw)
  To: Andrew Morton, linux-kernel, linux-mm, qemu-devel, kvm, linux-api
  Cc: Pavel Emelyanov, Sanidhya Kashyap, zhang.zhanghailiang,
	Linus Torvalds, Kirill A. Shutemov, Andres Lagar-Cavilla,
	Dave Hansen, Paolo Bonzini, Rik van Riel, Mel Gorman,
	Andy Lutomirski, Hugh Dickins, Peter Feiner,
	Dr. David Alan Gilbert, Johannes Weiner, Huangpeng (Peter)

This activates the userfaultfd syscall.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 arch/powerpc/include/asm/systbl.h      | 1 +
 arch/powerpc/include/uapi/asm/unistd.h | 1 +
 arch/x86/syscalls/syscall_32.tbl       | 1 +
 arch/x86/syscalls/syscall_64.tbl       | 1 +
 include/linux/syscalls.h               | 1 +
 kernel/sys_ni.c                        | 1 +
 6 files changed, 6 insertions(+)

diff --git a/arch/powerpc/include/asm/systbl.h b/arch/powerpc/include/asm/systbl.h
index f1863a1..4741b15 100644
--- a/arch/powerpc/include/asm/systbl.h
+++ b/arch/powerpc/include/asm/systbl.h
@@ -368,3 +368,4 @@ SYSCALL_SPU(memfd_create)
 SYSCALL_SPU(bpf)
 COMPAT_SYS(execveat)
 PPC64ONLY(switch_endian)
+SYSCALL_SPU(userfaultfd)
diff --git a/arch/powerpc/include/uapi/asm/unistd.h b/arch/powerpc/include/uapi/asm/unistd.h
index e4aa173..6ad58d4 100644
--- a/arch/powerpc/include/uapi/asm/unistd.h
+++ b/arch/powerpc/include/uapi/asm/unistd.h
@@ -386,5 +386,6 @@
 #define __NR_bpf		361
 #define __NR_execveat		362
 #define __NR_switch_endian	363
+#define __NR_userfaultfd	364
 
 #endif /* _UAPI_ASM_POWERPC_UNISTD_H_ */
diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
index ef8187f..dcc18ea 100644
--- a/arch/x86/syscalls/syscall_32.tbl
+++ b/arch/x86/syscalls/syscall_32.tbl
@@ -365,3 +365,4 @@
 356	i386	memfd_create		sys_memfd_create
 357	i386	bpf			sys_bpf
 358	i386	execveat		sys_execveat			stub32_execveat
+359	i386	userfaultfd		sys_userfaultfd
diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index 9ef32d5..81c4906 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -329,6 +329,7 @@
 320	common	kexec_file_load		sys_kexec_file_load
 321	common	bpf			sys_bpf
 322	64	execveat		stub_execveat
+323	common	userfaultfd		sys_userfaultfd
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index bb51bec..8ee1d0c 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -810,6 +810,7 @@ asmlinkage long sys_timerfd_gettime(int ufd, struct itimerspec __user *otmr);
 asmlinkage long sys_eventfd(unsigned int count);
 asmlinkage long sys_eventfd2(unsigned int count, int flags);
 asmlinkage long sys_memfd_create(const char __user *uname_ptr, unsigned int flags);
+asmlinkage long sys_userfaultfd(int flags);
 asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len);
 asmlinkage long sys_old_readdir(unsigned int, struct old_linux_dirent __user *, unsigned int);
 asmlinkage long sys_pselect6(int, fd_set __user *, fd_set __user *,
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 7995ef5..8b3e10ea 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -218,6 +218,7 @@ cond_syscall(compat_sys_timerfd_gettime);
 cond_syscall(sys_eventfd);
 cond_syscall(sys_eventfd2);
 cond_syscall(sys_memfd_create);
+cond_syscall(sys_userfaultfd);
 
 /* performance counters: */
 cond_syscall(sys_perf_event_open);

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH 20/23] userfaultfd: UFFDIO_COPY|UFFDIO_ZEROPAGE uAPI
  2015-05-14 17:30 [PATCH 00/23] userfaultfd v4 Andrea Arcangeli
                   ` (18 preceding siblings ...)
  2015-05-14 17:31 ` [PATCH 19/23] userfaultfd: activate syscall Andrea Arcangeli
@ 2015-05-14 17:31 ` Andrea Arcangeli
  2015-05-14 17:31 ` [PATCH 21/23] userfaultfd: mcopy_atomic|mfill_zeropage: UFFDIO_COPY|UFFDIO_ZEROPAGE preparation Andrea Arcangeli
                   ` (5 subsequent siblings)
  25 siblings, 0 replies; 63+ messages in thread
From: Andrea Arcangeli @ 2015-05-14 17:31 UTC (permalink / raw)
  To: Andrew Morton, linux-kernel, linux-mm, qemu-devel, kvm, linux-api
  Cc: Pavel Emelyanov, Sanidhya Kashyap, zhang.zhanghailiang,
	Linus Torvalds, Kirill A. Shutemov, Andres Lagar-Cavilla,
	Dave Hansen, Paolo Bonzini, Rik van Riel, Mel Gorman,
	Andy Lutomirski, Hugh Dickins, Peter Feiner,
	Dr. David Alan Gilbert, Johannes Weiner, Huangpeng (Peter)

This implements the uABI of UFFDIO_COPY and UFFDIO_ZEROPAGE.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/uapi/linux/userfaultfd.h | 42 +++++++++++++++++++++++++++++++++++++++-
 1 file changed, 41 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
index 8e42bc3..c8a543f 100644
--- a/include/uapi/linux/userfaultfd.h
+++ b/include/uapi/linux/userfaultfd.h
@@ -21,7 +21,9 @@
 	 (__u64)1 << _UFFDIO_UNREGISTER |	\
 	 (__u64)1 << _UFFDIO_API)
 #define UFFD_API_RANGE_IOCTLS			\
-	((__u64)1 << _UFFDIO_WAKE)
+	((__u64)1 << _UFFDIO_WAKE |		\
+	 (__u64)1 << _UFFDIO_COPY |		\
+	 (__u64)1 << _UFFDIO_ZEROPAGE)
 
 /*
  * Valid ioctl command number range with this API is from 0x00 to
@@ -34,6 +36,8 @@
 #define _UFFDIO_REGISTER		(0x00)
 #define _UFFDIO_UNREGISTER		(0x01)
 #define _UFFDIO_WAKE			(0x02)
+#define _UFFDIO_COPY			(0x03)
+#define _UFFDIO_ZEROPAGE		(0x04)
 #define _UFFDIO_API			(0x3F)
 
 /* userfaultfd ioctl ids */
@@ -46,6 +50,10 @@
 				     struct uffdio_range)
 #define UFFDIO_WAKE		_IOR(UFFDIO, _UFFDIO_WAKE,	\
 				     struct uffdio_range)
+#define UFFDIO_COPY		_IOWR(UFFDIO, _UFFDIO_COPY,	\
+				      struct uffdio_copy)
+#define UFFDIO_ZEROPAGE		_IOWR(UFFDIO, _UFFDIO_ZEROPAGE,	\
+				      struct uffdio_zeropage)
 
 /* read() structure */
 struct uffd_msg {
@@ -118,4 +126,36 @@ struct uffdio_register {
 	__u64 ioctls;
 };
 
+struct uffdio_copy {
+	__u64 dst;
+	__u64 src;
+	__u64 len;
+	/*
+	 * There will be a wrprotection flag later that allows to map
+	 * pages wrprotected on the fly. And such a flag will be
+	 * available if the wrprotection ioctl are implemented for the
+	 * range according to the uffdio_register.ioctls.
+	 */
+#define UFFDIO_COPY_MODE_DONTWAKE		((__u64)1<<0)
+	__u64 mode;
+
+	/*
+	 * "copy" is written by the ioctl and must be at the end: the
+	 * copy_from_user will not read the last 8 bytes.
+	 */
+	__s64 copy;
+};
+
+struct uffdio_zeropage {
+	struct uffdio_range range;
+#define UFFDIO_ZEROPAGE_MODE_DONTWAKE		((__u64)1<<0)
+	__u64 mode;
+
+	/*
+	 * "zeropage" is written by the ioctl and must be at the end:
+	 * the copy_from_user will not read the last 8 bytes.
+	 */
+	__s64 zeropage;
+};
+
 #endif /* _LINUX_USERFAULTFD_H */

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH 21/23] userfaultfd: mcopy_atomic|mfill_zeropage: UFFDIO_COPY|UFFDIO_ZEROPAGE preparation
  2015-05-14 17:30 [PATCH 00/23] userfaultfd v4 Andrea Arcangeli
                   ` (19 preceding siblings ...)
  2015-05-14 17:31 ` [PATCH 20/23] userfaultfd: UFFDIO_COPY|UFFDIO_ZEROPAGE uAPI Andrea Arcangeli
@ 2015-05-14 17:31 ` Andrea Arcangeli
  2015-05-14 17:31 ` [PATCH 22/23] userfaultfd: avoid mmap_sem read recursion in mcopy_atomic Andrea Arcangeli
                   ` (4 subsequent siblings)
  25 siblings, 0 replies; 63+ messages in thread
From: Andrea Arcangeli @ 2015-05-14 17:31 UTC (permalink / raw)
  To: Andrew Morton, linux-kernel, linux-mm, qemu-devel, kvm, linux-api
  Cc: Pavel Emelyanov, Sanidhya Kashyap, zhang.zhanghailiang,
	Linus Torvalds, Kirill A. Shutemov, Andres Lagar-Cavilla,
	Dave Hansen, Paolo Bonzini, Rik van Riel, Mel Gorman,
	Andy Lutomirski, Hugh Dickins, Peter Feiner,
	Dr. David Alan Gilbert, Johannes Weiner, Huangpeng (Peter)

This implements mcopy_atomic and mfill_zeropage that are the lowlevel
VM methods that are invoked respectively by the UFFDIO_COPY and
UFFDIO_ZEROPAGE userfaultfd commands.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/userfaultfd_k.h |   6 +
 mm/Makefile                   |   1 +
 mm/userfaultfd.c              | 269 ++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 276 insertions(+)
 create mode 100644 mm/userfaultfd.c

diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index e1e4360..587480a 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -30,6 +30,12 @@
 extern int handle_userfault(struct vm_area_struct *vma, unsigned long address,
 			    unsigned int flags, unsigned long reason);
 
+extern ssize_t mcopy_atomic(struct mm_struct *dst_mm, unsigned long dst_start,
+			    unsigned long src_start, unsigned long len);
+extern ssize_t mfill_zeropage(struct mm_struct *dst_mm,
+			      unsigned long dst_start,
+			      unsigned long len);
+
 /* mm helpers */
 static inline bool is_mergeable_vm_userfaultfd_ctx(struct vm_area_struct *vma,
 					struct vm_userfaultfd_ctx vm_ctx)
diff --git a/mm/Makefile b/mm/Makefile
index 98c4eae..b424d5e 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -78,3 +78,4 @@ obj-$(CONFIG_CMA)	+= cma.o
 obj-$(CONFIG_MEMORY_BALLOON) += balloon_compaction.o
 obj-$(CONFIG_PAGE_EXTENSION) += page_ext.o
 obj-$(CONFIG_CMA_DEBUGFS) += cma_debug.o
+obj-$(CONFIG_USERFAULTFD) += userfaultfd.o
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
new file mode 100644
index 0000000..c54c761
--- /dev/null
+++ b/mm/userfaultfd.c
@@ -0,0 +1,269 @@
+/*
+ *  mm/userfaultfd.c
+ *
+ *  Copyright (C) 2015  Red Hat, Inc.
+ *
+ *  This work is licensed under the terms of the GNU GPL, version 2. See
+ *  the COPYING file in the top-level directory.
+ */
+
+#include <linux/mm.h>
+#include <linux/pagemap.h>
+#include <linux/rmap.h>
+#include <linux/swap.h>
+#include <linux/swapops.h>
+#include <linux/userfaultfd_k.h>
+#include <linux/mmu_notifier.h>
+#include <asm/tlbflush.h>
+#include "internal.h"
+
+static int mcopy_atomic_pte(struct mm_struct *dst_mm,
+			    pmd_t *dst_pmd,
+			    struct vm_area_struct *dst_vma,
+			    unsigned long dst_addr,
+			    unsigned long src_addr)
+{
+	struct mem_cgroup *memcg;
+	pte_t _dst_pte, *dst_pte;
+	spinlock_t *ptl;
+	struct page *page;
+	void *page_kaddr;
+	int ret;
+
+	ret = -ENOMEM;
+	page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, dst_vma, dst_addr);
+	if (!page)
+		goto out;
+
+	page_kaddr = kmap(page);
+	ret = -EFAULT;
+	if (copy_from_user(page_kaddr, (const void __user *) src_addr,
+			   PAGE_SIZE))
+		goto out_kunmap_release;
+	kunmap(page);
+
+	/*
+	 * The memory barrier inside __SetPageUptodate makes sure that
+	 * preceeding stores to the page contents become visible before
+	 * the set_pte_at() write.
+	 */
+	__SetPageUptodate(page);
+
+	ret = -ENOMEM;
+	if (mem_cgroup_try_charge(page, dst_mm, GFP_KERNEL, &memcg))
+		goto out_release;
+
+	_dst_pte = mk_pte(page, dst_vma->vm_page_prot);
+	if (dst_vma->vm_flags & VM_WRITE)
+		_dst_pte = pte_mkwrite(pte_mkdirty(_dst_pte));
+
+	ret = -EEXIST;
+	dst_pte = pte_offset_map_lock(dst_mm, dst_pmd, dst_addr, &ptl);
+	if (!pte_none(*dst_pte))
+		goto out_release_uncharge_unlock;
+
+	inc_mm_counter(dst_mm, MM_ANONPAGES);
+	page_add_new_anon_rmap(page, dst_vma, dst_addr);
+	mem_cgroup_commit_charge(page, memcg, false);
+	lru_cache_add_active_or_unevictable(page, dst_vma);
+
+	set_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte);
+
+	/* No need to invalidate - it was non-present before */
+	update_mmu_cache(dst_vma, dst_addr, dst_pte);
+
+	pte_unmap_unlock(dst_pte, ptl);
+	ret = 0;
+out:
+	return ret;
+out_release_uncharge_unlock:
+	pte_unmap_unlock(dst_pte, ptl);
+	mem_cgroup_cancel_charge(page, memcg);
+out_release:
+	page_cache_release(page);
+	goto out;
+out_kunmap_release:
+	kunmap(page);
+	goto out_release;
+}
+
+static int mfill_zeropage_pte(struct mm_struct *dst_mm,
+			      pmd_t *dst_pmd,
+			      struct vm_area_struct *dst_vma,
+			      unsigned long dst_addr)
+{
+	pte_t _dst_pte, *dst_pte;
+	spinlock_t *ptl;
+	int ret;
+
+	_dst_pte = pte_mkspecial(pfn_pte(my_zero_pfn(dst_addr),
+					 dst_vma->vm_page_prot));
+	ret = -EEXIST;
+	dst_pte = pte_offset_map_lock(dst_mm, dst_pmd, dst_addr, &ptl);
+	if (!pte_none(*dst_pte))
+		goto out_unlock;
+	set_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte);
+	/* No need to invalidate - it was non-present before */
+	update_mmu_cache(dst_vma, dst_addr, dst_pte);
+	ret = 0;
+out_unlock:
+	pte_unmap_unlock(dst_pte, ptl);
+	return ret;
+}
+
+static pmd_t *mm_alloc_pmd(struct mm_struct *mm, unsigned long address)
+{
+	pgd_t *pgd;
+	pud_t *pud;
+	pmd_t *pmd = NULL;
+
+	pgd = pgd_offset(mm, address);
+	pud = pud_alloc(mm, pgd, address);
+	if (pud)
+		/*
+		 * Note that we didn't run this because the pmd was
+		 * missing, the *pmd may be already established and in
+		 * turn it may also be a trans_huge_pmd.
+		 */
+		pmd = pmd_alloc(mm, pud, address);
+	return pmd;
+}
+
+static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
+					      unsigned long dst_start,
+					      unsigned long src_start,
+					      unsigned long len,
+					      bool zeropage)
+{
+	struct vm_area_struct *dst_vma;
+	ssize_t err;
+	pmd_t *dst_pmd;
+	unsigned long src_addr, dst_addr;
+	long copied = 0;
+
+	/*
+	 * Sanitize the command parameters:
+	 */
+	BUG_ON(dst_start & ~PAGE_MASK);
+	BUG_ON(len & ~PAGE_MASK);
+
+	/* Does the address range wrap, or is the span zero-sized? */
+	BUG_ON(src_start + len <= src_start);
+	BUG_ON(dst_start + len <= dst_start);
+
+	down_read(&dst_mm->mmap_sem);
+
+	/*
+	 * Make sure the vma is not shared, that the dst range is
+	 * both valid and fully within a single existing vma.
+	 */
+	err = -EINVAL;
+	dst_vma = find_vma(dst_mm, dst_start);
+	if (!dst_vma || (dst_vma->vm_flags & VM_SHARED))
+		goto out;
+	if (dst_start < dst_vma->vm_start ||
+	    dst_start + len > dst_vma->vm_end)
+		goto out;
+
+	/*
+	 * Be strict and only allow __mcopy_atomic on userfaultfd
+	 * registered ranges to prevent userland errors going
+	 * unnoticed. As far as the VM consistency is concerned, it
+	 * would be perfectly safe to remove this check, but there's
+	 * no useful usage for __mcopy_atomic ouside of userfaultfd
+	 * registered ranges. This is after all why these are ioctls
+	 * belonging to the userfaultfd and not syscalls.
+	 */
+	if (!dst_vma->vm_userfaultfd_ctx.ctx)
+		goto out;
+
+	/*
+	 * FIXME: only allow copying on anonymous vmas, tmpfs should
+	 * be added.
+	 */
+	if (dst_vma->vm_ops)
+		goto out;
+
+	/*
+	 * Ensure the dst_vma has a anon_vma or this page
+	 * would get a NULL anon_vma when moved in the
+	 * dst_vma.
+	 */
+	err = -ENOMEM;
+	if (unlikely(anon_vma_prepare(dst_vma)))
+		goto out;
+
+	for (src_addr = src_start, dst_addr = dst_start;
+	     src_addr < src_start + len; ) {
+		pmd_t dst_pmdval;
+		BUG_ON(dst_addr >= dst_start + len);
+		dst_pmd = mm_alloc_pmd(dst_mm, dst_addr);
+		if (unlikely(!dst_pmd)) {
+			err = -ENOMEM;
+			break;
+		}
+
+		dst_pmdval = pmd_read_atomic(dst_pmd);
+		/*
+		 * If the dst_pmd is mapped as THP don't
+		 * override it and just be strict.
+		 */
+		if (unlikely(pmd_trans_huge(dst_pmdval))) {
+			err = -EEXIST;
+			break;
+		}
+		if (unlikely(pmd_none(dst_pmdval)) &&
+		    unlikely(__pte_alloc(dst_mm, dst_vma, dst_pmd,
+					 dst_addr))) {
+			err = -ENOMEM;
+			break;
+		}
+		/* If an huge pmd materialized from under us fail */
+		if (unlikely(pmd_trans_huge(*dst_pmd))) {
+			err = -EFAULT;
+			break;
+		}
+
+		BUG_ON(pmd_none(*dst_pmd));
+		BUG_ON(pmd_trans_huge(*dst_pmd));
+
+		if (!zeropage)
+			err = mcopy_atomic_pte(dst_mm, dst_pmd, dst_vma,
+					       dst_addr, src_addr);
+		else
+			err = mfill_zeropage_pte(dst_mm, dst_pmd, dst_vma,
+						 dst_addr);
+
+		cond_resched();
+
+		if (!err) {
+			dst_addr += PAGE_SIZE;
+			src_addr += PAGE_SIZE;
+			copied += PAGE_SIZE;
+
+			if (fatal_signal_pending(current))
+				err = -EINTR;
+		}
+		if (err)
+			break;
+	}
+
+out:
+	up_read(&dst_mm->mmap_sem);
+	BUG_ON(copied < 0);
+	BUG_ON(err > 0);
+	BUG_ON(!copied && !err);
+	return copied ? copied : err;
+}
+
+ssize_t mcopy_atomic(struct mm_struct *dst_mm, unsigned long dst_start,
+		     unsigned long src_start, unsigned long len)
+{
+	return __mcopy_atomic(dst_mm, dst_start, src_start, len, false);
+}
+
+ssize_t mfill_zeropage(struct mm_struct *dst_mm, unsigned long start,
+		       unsigned long len)
+{
+	return __mcopy_atomic(dst_mm, start, 0, len, true);
+}

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH 22/23] userfaultfd: avoid mmap_sem read recursion in mcopy_atomic
  2015-05-14 17:30 [PATCH 00/23] userfaultfd v4 Andrea Arcangeli
                   ` (20 preceding siblings ...)
  2015-05-14 17:31 ` [PATCH 21/23] userfaultfd: mcopy_atomic|mfill_zeropage: UFFDIO_COPY|UFFDIO_ZEROPAGE preparation Andrea Arcangeli
@ 2015-05-14 17:31 ` Andrea Arcangeli
  2015-05-22 20:18   ` Andrew Morton
  2015-05-14 17:31 ` [PATCH 23/23] userfaultfd: UFFDIO_COPY and UFFDIO_ZEROPAGE Andrea Arcangeli
                   ` (3 subsequent siblings)
  25 siblings, 1 reply; 63+ messages in thread
From: Andrea Arcangeli @ 2015-05-14 17:31 UTC (permalink / raw)
  To: Andrew Morton, linux-kernel, linux-mm, qemu-devel, kvm, linux-api
  Cc: Pavel Emelyanov, Sanidhya Kashyap, zhang.zhanghailiang,
	Linus Torvalds, Kirill A. Shutemov, Andres Lagar-Cavilla,
	Dave Hansen, Paolo Bonzini, Rik van Riel, Mel Gorman,
	Andy Lutomirski, Hugh Dickins, Peter Feiner,
	Dr. David Alan Gilbert, Johannes Weiner, Huangpeng (Peter)

If the rwsem starves writers it wasn't strictly a bug but lockdep
doesn't like it and this avoids depending on lowlevel implementation
details of the lock.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/userfaultfd.c | 92 ++++++++++++++++++++++++++++++++++++++++----------------
 1 file changed, 66 insertions(+), 26 deletions(-)

diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index c54c761..f8fe494 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -21,26 +21,39 @@ static int mcopy_atomic_pte(struct mm_struct *dst_mm,
 			    pmd_t *dst_pmd,
 			    struct vm_area_struct *dst_vma,
 			    unsigned long dst_addr,
-			    unsigned long src_addr)
+			    unsigned long src_addr,
+			    struct page **pagep)
 {
 	struct mem_cgroup *memcg;
 	pte_t _dst_pte, *dst_pte;
 	spinlock_t *ptl;
-	struct page *page;
 	void *page_kaddr;
 	int ret;
+	struct page *page;
 
-	ret = -ENOMEM;
-	page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, dst_vma, dst_addr);
-	if (!page)
-		goto out;
-
-	page_kaddr = kmap(page);
-	ret = -EFAULT;
-	if (copy_from_user(page_kaddr, (const void __user *) src_addr,
-			   PAGE_SIZE))
-		goto out_kunmap_release;
-	kunmap(page);
+	if (!*pagep) {
+		ret = -ENOMEM;
+		page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, dst_vma, dst_addr);
+		if (!page)
+			goto out;
+
+		page_kaddr = kmap_atomic(page);
+		ret = copy_from_user(page_kaddr,
+				     (const void __user *) src_addr,
+				     PAGE_SIZE);
+		kunmap_atomic(page_kaddr);
+
+		/* fallback to copy_from_user outside mmap_sem */
+		if (unlikely(ret)) {
+			ret = -EFAULT;
+			*pagep = page;
+			/* don't free the page */
+			goto out;
+		}
+	} else {
+		page = *pagep;
+		*pagep = NULL;
+	}
 
 	/*
 	 * The memory barrier inside __SetPageUptodate makes sure that
@@ -82,9 +95,6 @@ out_release_uncharge_unlock:
 out_release:
 	page_cache_release(page);
 	goto out;
-out_kunmap_release:
-	kunmap(page);
-	goto out_release;
 }
 
 static int mfill_zeropage_pte(struct mm_struct *dst_mm,
@@ -139,7 +149,8 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
 	ssize_t err;
 	pmd_t *dst_pmd;
 	unsigned long src_addr, dst_addr;
-	long copied = 0;
+	long copied;
+	struct page *page;
 
 	/*
 	 * Sanitize the command parameters:
@@ -151,6 +162,11 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
 	BUG_ON(src_start + len <= src_start);
 	BUG_ON(dst_start + len <= dst_start);
 
+	src_addr = src_start;
+	dst_addr = dst_start;
+	copied = 0;
+	page = NULL;
+retry:
 	down_read(&dst_mm->mmap_sem);
 
 	/*
@@ -160,10 +176,10 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
 	err = -EINVAL;
 	dst_vma = find_vma(dst_mm, dst_start);
 	if (!dst_vma || (dst_vma->vm_flags & VM_SHARED))
-		goto out;
+		goto out_unlock;
 	if (dst_start < dst_vma->vm_start ||
 	    dst_start + len > dst_vma->vm_end)
-		goto out;
+		goto out_unlock;
 
 	/*
 	 * Be strict and only allow __mcopy_atomic on userfaultfd
@@ -175,14 +191,14 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
 	 * belonging to the userfaultfd and not syscalls.
 	 */
 	if (!dst_vma->vm_userfaultfd_ctx.ctx)
-		goto out;
+		goto out_unlock;
 
 	/*
 	 * FIXME: only allow copying on anonymous vmas, tmpfs should
 	 * be added.
 	 */
 	if (dst_vma->vm_ops)
-		goto out;
+		goto out_unlock;
 
 	/*
 	 * Ensure the dst_vma has a anon_vma or this page
@@ -191,12 +207,13 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
 	 */
 	err = -ENOMEM;
 	if (unlikely(anon_vma_prepare(dst_vma)))
-		goto out;
+		goto out_unlock;
 
-	for (src_addr = src_start, dst_addr = dst_start;
-	     src_addr < src_start + len; ) {
+	while (src_addr < src_start + len) {
 		pmd_t dst_pmdval;
+
 		BUG_ON(dst_addr >= dst_start + len);
+
 		dst_pmd = mm_alloc_pmd(dst_mm, dst_addr);
 		if (unlikely(!dst_pmd)) {
 			err = -ENOMEM;
@@ -229,13 +246,33 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
 
 		if (!zeropage)
 			err = mcopy_atomic_pte(dst_mm, dst_pmd, dst_vma,
-					       dst_addr, src_addr);
+					       dst_addr, src_addr, &page);
 		else
 			err = mfill_zeropage_pte(dst_mm, dst_pmd, dst_vma,
 						 dst_addr);
 
 		cond_resched();
 
+		if (unlikely(err == -EFAULT)) {
+			void *page_kaddr;
+
+			BUILD_BUG_ON(zeropage);
+			up_read(&dst_mm->mmap_sem);
+			BUG_ON(!page);
+
+			page_kaddr = kmap(page);
+			err = copy_from_user(page_kaddr,
+					     (const void __user *) src_addr,
+					     PAGE_SIZE);
+			kunmap(page);
+			if (unlikely(err)) {
+				err = -EFAULT;
+				goto out;
+			}
+			goto retry;
+		} else
+			BUG_ON(page);
+
 		if (!err) {
 			dst_addr += PAGE_SIZE;
 			src_addr += PAGE_SIZE;
@@ -248,8 +285,11 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
 			break;
 	}
 
-out:
+out_unlock:
 	up_read(&dst_mm->mmap_sem);
+out:
+	if (page)
+		page_cache_release(page);
 	BUG_ON(copied < 0);
 	BUG_ON(err > 0);
 	BUG_ON(!copied && !err);

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH 23/23] userfaultfd: UFFDIO_COPY and UFFDIO_ZEROPAGE
  2015-05-14 17:30 [PATCH 00/23] userfaultfd v4 Andrea Arcangeli
                   ` (21 preceding siblings ...)
  2015-05-14 17:31 ` [PATCH 22/23] userfaultfd: avoid mmap_sem read recursion in mcopy_atomic Andrea Arcangeli
@ 2015-05-14 17:31 ` Andrea Arcangeli
  2015-05-18 14:24 ` [PATCH 00/23] userfaultfd v4 Pavel Emelyanov
                   ` (2 subsequent siblings)
  25 siblings, 0 replies; 63+ messages in thread
From: Andrea Arcangeli @ 2015-05-14 17:31 UTC (permalink / raw)
  To: Andrew Morton, linux-kernel, linux-mm, qemu-devel, kvm, linux-api
  Cc: Pavel Emelyanov, Sanidhya Kashyap, zhang.zhanghailiang,
	Linus Torvalds, Kirill A. Shutemov, Andres Lagar-Cavilla,
	Dave Hansen, Paolo Bonzini, Rik van Riel, Mel Gorman,
	Andy Lutomirski, Hugh Dickins, Peter Feiner,
	Dr. David Alan Gilbert, Johannes Weiner, Huangpeng (Peter)

These two ioctl allows to either atomically copy or to map zeropages
into the virtual address space. This is used by the thread that opened
the userfaultfd to resolve the userfaults.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 fs/userfaultfd.c | 96 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 96 insertions(+)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 6772c22..65704cb 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -942,6 +942,96 @@ out:
 	return ret;
 }
 
+static int userfaultfd_copy(struct userfaultfd_ctx *ctx,
+			    unsigned long arg)
+{
+	__s64 ret;
+	struct uffdio_copy uffdio_copy;
+	struct uffdio_copy __user *user_uffdio_copy;
+	struct userfaultfd_wake_range range;
+
+	user_uffdio_copy = (struct uffdio_copy __user *) arg;
+
+	ret = -EFAULT;
+	if (copy_from_user(&uffdio_copy, user_uffdio_copy,
+			   /* don't copy "copy" last field */
+			   sizeof(uffdio_copy)-sizeof(__s64)))
+		goto out;
+
+	ret = validate_range(ctx->mm, uffdio_copy.dst, uffdio_copy.len);
+	if (ret)
+		goto out;
+	/*
+	 * double check for wraparound just in case. copy_from_user()
+	 * will later check uffdio_copy.src + uffdio_copy.len to fit
+	 * in the userland range.
+	 */
+	ret = -EINVAL;
+	if (uffdio_copy.src + uffdio_copy.len <= uffdio_copy.src)
+		goto out;
+	if (uffdio_copy.mode & ~UFFDIO_COPY_MODE_DONTWAKE)
+		goto out;
+
+	ret = mcopy_atomic(ctx->mm, uffdio_copy.dst, uffdio_copy.src,
+			   uffdio_copy.len);
+	if (unlikely(put_user(ret, &user_uffdio_copy->copy)))
+		return -EFAULT;
+	if (ret < 0)
+		goto out;
+	BUG_ON(!ret);
+	/* len == 0 would wake all */
+	range.len = ret;
+	if (!(uffdio_copy.mode & UFFDIO_COPY_MODE_DONTWAKE)) {
+		range.start = uffdio_copy.dst;
+		wake_userfault(ctx, &range);
+	}
+	ret = range.len == uffdio_copy.len ? 0 : -EAGAIN;
+out:
+	return ret;
+}
+
+static int userfaultfd_zeropage(struct userfaultfd_ctx *ctx,
+				unsigned long arg)
+{
+	__s64 ret;
+	struct uffdio_zeropage uffdio_zeropage;
+	struct uffdio_zeropage __user *user_uffdio_zeropage;
+	struct userfaultfd_wake_range range;
+
+	user_uffdio_zeropage = (struct uffdio_zeropage __user *) arg;
+
+	ret = -EFAULT;
+	if (copy_from_user(&uffdio_zeropage, user_uffdio_zeropage,
+			   /* don't copy "zeropage" last field */
+			   sizeof(uffdio_zeropage)-sizeof(__s64)))
+		goto out;
+
+	ret = validate_range(ctx->mm, uffdio_zeropage.range.start,
+			     uffdio_zeropage.range.len);
+	if (ret)
+		goto out;
+	ret = -EINVAL;
+	if (uffdio_zeropage.mode & ~UFFDIO_ZEROPAGE_MODE_DONTWAKE)
+		goto out;
+
+	ret = mfill_zeropage(ctx->mm, uffdio_zeropage.range.start,
+			     uffdio_zeropage.range.len);
+	if (unlikely(put_user(ret, &user_uffdio_zeropage->zeropage)))
+		return -EFAULT;
+	if (ret < 0)
+		goto out;
+	/* len == 0 would wake all */
+	BUG_ON(!ret);
+	range.len = ret;
+	if (!(uffdio_zeropage.mode & UFFDIO_ZEROPAGE_MODE_DONTWAKE)) {
+		range.start = uffdio_zeropage.range.start;
+		wake_userfault(ctx, &range);
+	}
+	ret = range.len == uffdio_zeropage.range.len ? 0 : -EAGAIN;
+out:
+	return ret;
+}
+
 /*
  * userland asks for a certain API version and we return which bits
  * and ioctl commands are implemented in this kernel for such API
@@ -997,6 +1087,12 @@ static long userfaultfd_ioctl(struct file *file, unsigned cmd,
 	case UFFDIO_WAKE:
 		ret = userfaultfd_wake(ctx, arg);
 		break;
+	case UFFDIO_COPY:
+		ret = userfaultfd_copy(ctx, arg);
+		break;
+	case UFFDIO_ZEROPAGE:
+		ret = userfaultfd_zeropage(ctx, arg);
+		break;
 	}
 	return ret;
 }

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: [PATCH 10/23] userfaultfd: add new syscall to provide memory externalization
  2015-05-14 17:31 ` [PATCH 10/23] userfaultfd: add new syscall to provide memory externalization Andrea Arcangeli
@ 2015-05-14 17:49   ` Linus Torvalds
  2015-05-15 16:04     ` Andrea Arcangeli
  2015-06-23 19:00   ` Dave Hansen
  1 sibling, 1 reply; 63+ messages in thread
From: Linus Torvalds @ 2015-05-14 17:49 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andrew Morton, Linux Kernel Mailing List, linux-mm, qemu-devel,
	KVM list, Linux API, Pavel Emelyanov, Sanidhya Kashyap,
	zhang.zhanghailiang, Kirill A. Shutemov, Andres Lagar-Cavilla,
	Dave Hansen, Paolo Bonzini, Rik van Riel, Mel Gorman,
	Andy Lutomirski, Hugh Dickins, Peter Feiner,
	Dr. David Alan Gilbert, Johannes Weiner, Huangpeng (Peter)

On Thu, May 14, 2015 at 10:31 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> +static __always_inline void wake_userfault(struct userfaultfd_ctx *ctx,
> +                                          struct userfaultfd_wake_range *range)
> +{
> +       if (waitqueue_active(&ctx->fault_wqh))
> +               __wake_userfault(ctx, range);
> +}

Pretty much every single time people use this "if
(waitqueue_active())" model, it tends to be a bug, because it means
that there is zero serialization with people who are just about to go
to sleep. It's fundamentally racy against all the "wait_event()" loops
that carefully do memory barriers between testing conditions and going
to sleep, because the memory barriers now don't exist on the waking
side.

So I'm making a new rule: if you use waitqueue_active(), I want an
explanation for why it's not racy with the waiter. A big comment about
the memory ordering, or about higher-level locks that are held by the
caller, or something.

                     Linus

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 10/23] userfaultfd: add new syscall to provide memory externalization
  2015-05-14 17:49   ` Linus Torvalds
@ 2015-05-15 16:04     ` Andrea Arcangeli
  2015-05-15 18:22       ` Linus Torvalds
  0 siblings, 1 reply; 63+ messages in thread
From: Andrea Arcangeli @ 2015-05-15 16:04 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Linux Kernel Mailing List, linux-mm, qemu-devel,
	KVM list, Linux API, Pavel Emelyanov, Sanidhya Kashyap,
	zhang.zhanghailiang, Kirill A. Shutemov, Andres Lagar-Cavilla,
	Dave Hansen, Paolo Bonzini, Rik van Riel, Mel Gorman,
	Andy Lutomirski, Hugh Dickins, Peter Feiner,
	Dr. David Alan Gilbert, Johannes Weiner, Huangpeng (Peter)

On Thu, May 14, 2015 at 10:49:06AM -0700, Linus Torvalds wrote:
> On Thu, May 14, 2015 at 10:31 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> > +static __always_inline void wake_userfault(struct userfaultfd_ctx *ctx,
> > +                                          struct userfaultfd_wake_range *range)
> > +{
> > +       if (waitqueue_active(&ctx->fault_wqh))
> > +               __wake_userfault(ctx, range);
> > +}
> 
> Pretty much every single time people use this "if
> (waitqueue_active())" model, it tends to be a bug, because it means
> that there is zero serialization with people who are just about to go
> to sleep. It's fundamentally racy against all the "wait_event()" loops
> that carefully do memory barriers between testing conditions and going
> to sleep, because the memory barriers now don't exist on the waking
> side.

As far as 10/23 is concerned, the __wake_userfault taking the locks
would also ignore any "incoming" not yet "read(2)" userfault. "read(2)"
as in syscall read. So there was no race to worry about as "incoming"
userfaults would be ignored anyway by the wake.

The only case that had to be reliable in not missing wakeup events was
the "release" file operation. But that doesn't use waitqueue_active
and it relies on ctx->released combined with mmap_sem taken for writing.

    http://git.kernel.org/cgit/linux/kernel/git/andrea/aa.git/tree/fs/userfaultfd.c?h=userfault&id=2f73ffa8267e41f04fe6b0d93d23feed45c0980a#n267

However later (after I started documenting what userland should do) I
started to question this old model of leaving the race handling to
userland. It's way too complex to leave the race handling to
userland. Furthermore it's inefficient because there would be lots of
spurious userfaults that we could have been waken up within the kernel
before userland could have a chance to read them.

So then I handled the race in the kernel in patch 17/23. That allowed
to drop hundres of line of locking code from qemu (two bits per page
and a mutex and lots of complexity) and then I didn't need to document
the complex rules described in this commit header:

     http://git.kernel.org/cgit/linux/kernel/git/andrea/aa.git/commit/?h=userfault&id=5a2b3614e107482da9a1fcfcd120d30eb62d45dc

You can see from the patch it complicated the kernel by adding a
pagetable walk, but it's worth it because it's simpler than solving it
in userland, plus its solved at once for all users.

The only cons is that the old model would allow userland to be way
more consistent in enforcing asserts as it had the control.

With the current model POLLIN may be returned from poll() despite the
later read returns -EAGAIN, so it basically requires a non blocking
open if people uses poll(). (blocking reads without poll are still
fine)

Now back to your question: the waitqueue_active is still there in the
current model as well (since 17/23).

Since 17/23 losing a wakeup (as far as qemu is concerned) would mean
that qemu would read the address of the fault that didn't get waken up
because of the race, then it would send a page request to the source
node (it had no state in the destination node where the userfault runs
to know if it was a "dup"), which would discard the request (not
sending the page twice) noticing in its simple per-page bitmap that it
was already sent. So with the new model after 17/23 qemu, losing a
wakeup is a bug.

The wait_event is like this:

	handle_userfault (wait_event)
	-------
	spin_lock(&ctx->fault_pending_wqh.lock);
	/*
	 * After the __add_wait_queue the uwq is visible to userland
	 * through poll/read().
	 */
	__add_wait_queue(&ctx->fault_pending_wqh, &uwq.wq);
	/*
	 * The smp_mb() after __set_current_state prevents the reads
	 * following the spin_unlock to happen before the list_add in
	 * __add_wait_queue.
	 */
	set_current_state(TASK_KILLABLE);
	spin_unlock(&ctx->fault_pending_wqh.lock);

	must_wait = userfaultfd_must_wait(ctx, address, flags, reason);

	up_read(&mm->mmap_sem);

	if (likely(must_wait && !ACCESS_ONCE(ctx->released) &&
		   !fatal_signal_pending(current))) {
		wake_up_poll(&ctx->fd_wqh, POLLIN);
		schedule();

The wakeup side is:

	userfaultfd_copy
	-----------
	ret = mcopy_atomic(ctx->mm, uffdio_copy.dst, uffdio_copy.src,
			   uffdio_copy.len);
	if (unlikely(put_user(ret, &user_uffdio_copy->copy)))
		return -EFAULT;
	if (ret < 0)
		goto out;
	BUG_ON(!ret);
	/* len == 0 would wake all */
	range.len = ret;
	if (!(uffdio_copy.mode & UFFDIO_COPY_MODE_DONTWAKE)) {
		range.start = uffdio_copy.dst;
		wake_userfault(ctx, &range);


wake_userfault is the function that is using waitqueue_active so the
problem materializes if waitqueue_active moves before
mcopy_atomic. Precisely it should move before the set_pte_at below:

	mcopy_atomic_pte
	------
	set_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte);

	/* No need to invalidate - it was non-present before */
	update_mmu_cache(dst_vma, dst_addr, dst_pte);

	pte_unmap_unlock(dst_pte, ptl);

unlock would allow the read to be reordered before it even on
x86. There's an up_read as well in between but it has the same
problem. So you're right that theoretically we can miss a wakeup.

Practically it sounds unlikely because of the sheer size of
mcopy_atomic and we never experienced it but it's still a bug.

>  So I'm making a new rule: if you use waitqueue_active(), I want an
> explanation for why it's not racy with the waiter. A big comment
> about the memory ordering, or about higher-level locks that are held
> by the caller, or something.  Linus

To fix it I added this along a comment:

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index c89e96f..6be316b 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -593,6 +593,21 @@ static void __wake_userfault(struct userfaultfd_ctx *ctx,
 static __always_inline void wake_userfault(struct userfaultfd_ctx *ctx,
 					   struct userfaultfd_wake_range *range)
 {
+	/*
+	 * To be sure waitqueue_active() is not reordered by the CPU
+	 * before the pagetable update, use an explicit SMP memory
+	 * barrier here. PT lock release or up_read(mmap_sem) still
+	 * have release semantics that can allow the
+	 * waitqueue_active() to be reordered before the pte update.
+	 */
+	smp_mb();
+
+	/*
+	 * Use waitqueue_active because it's very frequent to
+	 * change the address space atomically even if there are no
+	 * userfaults yet. So we take the spinlock only when we're
+	 * sure we've userfaults to wake.
+	 */
 	if (waitqueue_active(&ctx->fault_pending_wqh) ||
 	    waitqueue_active(&ctx->fault_wqh))
 		__wake_userfault(ctx, range);

The wait_event/handle_userfault side already has a smp_mb() to prevent
the lockless pagetable walk to be reordered before the list_add in
__add_wait_queue (needed as well precisely because of the
waitqueue_active optimization that I'd like to keep).

Thanks,
Andrea

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: [PATCH 10/23] userfaultfd: add new syscall to provide memory externalization
  2015-05-15 16:04     ` Andrea Arcangeli
@ 2015-05-15 18:22       ` Linus Torvalds
  0 siblings, 0 replies; 63+ messages in thread
From: Linus Torvalds @ 2015-05-15 18:22 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andrew Morton, Linux Kernel Mailing List, linux-mm, qemu-devel,
	KVM list, Linux API, Pavel Emelyanov, Sanidhya Kashyap,
	zhang.zhanghailiang, Kirill A. Shutemov, Andres Lagar-Cavilla,
	Dave Hansen, Paolo Bonzini, Rik van Riel, Mel Gorman,
	Andy Lutomirski, Hugh Dickins, Peter Feiner,
	Dr. David Alan Gilbert, Johannes Weiner, Huangpeng (Peter)

On Fri, May 15, 2015 at 9:04 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
>
> To fix it I added this along a comment:

Ok, this looks good as a explanation/fix for the races (and also as an
example of my worry about waitqueue_active() use in general).

However, it now makes me suspect that the optimistic "let's check if
they are even active" may not be worth it any more. You're adding a
"smp_mb()" in order to avoid taking the real lock. Although I guess
there are two locks there (one for each wait-queue) so maybe it's
worth it.

                    Linus

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 00/23] userfaultfd v4
  2015-05-14 17:30 [PATCH 00/23] userfaultfd v4 Andrea Arcangeli
                   ` (22 preceding siblings ...)
  2015-05-14 17:31 ` [PATCH 23/23] userfaultfd: UFFDIO_COPY and UFFDIO_ZEROPAGE Andrea Arcangeli
@ 2015-05-18 14:24 ` Pavel Emelyanov
  2015-05-19 21:38 ` Andrew Morton
  2015-05-21 13:11 ` Kirill Smelkov
  25 siblings, 0 replies; 63+ messages in thread
From: Pavel Emelyanov @ 2015-05-18 14:24 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton, linux-kernel, linux-mm,
	qemu-devel, kvm, linux-api
  Cc: Sanidhya Kashyap, zhang.zhanghailiang, Linus Torvalds,
	Kirill A. Shutemov, Andres Lagar-Cavilla, Dave Hansen,
	Paolo Bonzini, Rik van Riel, Mel Gorman, Andy Lutomirski,
	Hugh Dickins, Peter Feiner, Dr. David Alan Gilbert,
	Johannes Weiner, Huangpeng (Peter)

On 05/14/2015 08:30 PM, Andrea Arcangeli wrote:
> Hello everyone,
> 
> This is the latest userfaultfd patchset against mm-v4.1-rc3
> 2015-05-14-10:04.
> 
> The postcopy live migration feature on the qemu side is mostly ready
> to be merged and it entirely depends on the userfaultfd syscall to be
> merged as well. So it'd be great if this patchset could be reviewed
> for merging in -mm.
> 
> Userfaults allow to implement on demand paging from userland and more
> generally they allow userland to more efficiently take control of the
> behavior of page faults than what was available before
> (PROT_NONE + SIGSEGV trap).

Not to spam with 23 e-mails, all patches are

Acked-by: Pavel Emelyanov <xemul@parallels.com>

Thanks!

-- Pavel


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 00/23] userfaultfd v4
  2015-05-14 17:30 [PATCH 00/23] userfaultfd v4 Andrea Arcangeli
                   ` (23 preceding siblings ...)
  2015-05-18 14:24 ` [PATCH 00/23] userfaultfd v4 Pavel Emelyanov
@ 2015-05-19 21:38 ` Andrew Morton
  2015-05-19 21:59   ` Richard Weinberger
  2015-05-20 13:23   ` Andrea Arcangeli
  2015-05-21 13:11 ` Kirill Smelkov
  25 siblings, 2 replies; 63+ messages in thread
From: Andrew Morton @ 2015-05-19 21:38 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, qemu-devel, kvm, linux-api,
	Pavel Emelyanov, Sanidhya Kashyap, zhang.zhanghailiang,
	Linus Torvalds, Kirill A. Shutemov, Andres Lagar-Cavilla,
	Dave Hansen, Paolo Bonzini, Rik van Riel, Mel Gorman,
	Andy Lutomirski, Hugh Dickins, Peter Feiner,
	Dr. David Alan Gilbert, Johannes Weiner, Huangpeng (Peter)

On Thu, 14 May 2015 19:30:57 +0200 Andrea Arcangeli <aarcange@redhat.com> wrote:

> This is the latest userfaultfd patchset against mm-v4.1-rc3
> 2015-05-14-10:04.

It would be useful to have some userfaultfd testcases in
tools/testing/selftests/.  Partly as an aid to arch maintainers when
enabling this.  And also as a standalone thing to give people a
practical way of exercising this interface.

What are your thoughts on enabling userfaultfd for other architectures,
btw?  Are there good use cases, are people working on it, etc?


Also, I assume a manpage is in the works?  Sooner rather than later
would be good - Michael's review of proposed kernel interfaces has
often been valuable.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 00/23] userfaultfd v4
  2015-05-19 21:38 ` Andrew Morton
@ 2015-05-19 21:59   ` Richard Weinberger
  2015-05-20 14:17     ` Andrea Arcangeli
  2015-05-20 13:23   ` Andrea Arcangeli
  1 sibling, 1 reply; 63+ messages in thread
From: Richard Weinberger @ 2015-05-19 21:59 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, LKML, linux-mm, qemu-devel, kvm,
	open list:ABI/API, Pavel Emelyanov, Sanidhya Kashyap,
	zhang.zhanghailiang, Linus Torvalds, Kirill A. Shutemov,
	Andres Lagar-Cavilla, Dave Hansen, Paolo Bonzini, Rik van Riel,
	Mel Gorman, Andy Lutomirski, Hugh Dickins, Peter Feiner,
	Dr. David Alan Gilbert, Johannes Weiner, Huangpeng (Peter)

On Tue, May 19, 2015 at 11:38 PM, Andrew Morton
<akpm@linux-foundation.org> wrote:
> On Thu, 14 May 2015 19:30:57 +0200 Andrea Arcangeli <aarcange@redhat.com> wrote:
>
>> This is the latest userfaultfd patchset against mm-v4.1-rc3
>> 2015-05-14-10:04.
>
> It would be useful to have some userfaultfd testcases in
> tools/testing/selftests/.  Partly as an aid to arch maintainers when
> enabling this.  And also as a standalone thing to give people a
> practical way of exercising this interface.
>
> What are your thoughts on enabling userfaultfd for other architectures,
> btw?  Are there good use cases, are people working on it, etc?

UML is using SIGSEGV for page faults.
i.e. the UML processes receives a SIGSEGV, learns the faulting address
from the mcontext
and resolves the fault by installing a new mapping.

If userfaultfd is faster that the SIGSEGV notification it could speed
up UML a bit.
For UML I'm only interested in the notification, not the resolving
part. The "missing"
data is present, only a new mapping is needed. No copy of data.

Andrea, what do you think?

-- 
Thanks,
//richard

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 00/23] userfaultfd v4
  2015-05-19 21:38 ` Andrew Morton
  2015-05-19 21:59   ` Richard Weinberger
@ 2015-05-20 13:23   ` Andrea Arcangeli
  1 sibling, 0 replies; 63+ messages in thread
From: Andrea Arcangeli @ 2015-05-20 13:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, qemu-devel, kvm, linux-api,
	Pavel Emelyanov, Sanidhya Kashyap, zhang.zhanghailiang,
	Linus Torvalds, Kirill A. Shutemov, Andres Lagar-Cavilla,
	Dave Hansen, Paolo Bonzini, Rik van Riel, Mel Gorman,
	Andy Lutomirski, Hugh Dickins, Peter Feiner,
	Dr. David Alan Gilbert, Johannes Weiner, Huangpeng (Peter)

Hi Andrew,

On Tue, May 19, 2015 at 02:38:01PM -0700, Andrew Morton wrote:
> On Thu, 14 May 2015 19:30:57 +0200 Andrea Arcangeli <aarcange@redhat.com> wrote:
> 
> > This is the latest userfaultfd patchset against mm-v4.1-rc3
> > 2015-05-14-10:04.
> 
> It would be useful to have some userfaultfd testcases in
> tools/testing/selftests/.  Partly as an aid to arch maintainers when
> enabling this.  And also as a standalone thing to give people a
> practical way of exercising this interface.

Agreed.

I was also thinking about writing a trinity module for it, I wrote it
for an older version but it was much easier to do that back then
before we had ioctls, now it's more tricky because the ioctls requires
the fd open first etc... it's not enough to just call a syscall with a
flood of supervised-random params anymore.

> What are your thoughts on enabling userfaultfd for other architectures,
> btw?  Are there good use cases, are people working on it, etc?

powerpc should be enabled and functional already. There's not much
arch dependent code in it, so in theory if the postcopy live migration
patchset is applied to qemu, it should work on powerpc out of the
box. Nobody tested it yet but I don't expect trouble on the kernel side.

Adding support for all other archs is just a few liner patch that
defines the syscall number. I didn't do that out of tree because every
time a new syscall materialized I would get more rejects during
rebase.

> Also, I assume a manpage is in the works?  Sooner rather than later
> would be good - Michael's review of proposed kernel interfaces has
> often been valuable.

Yes, the manpage was certainly planned. It would require updates as we
keep adding features (like the wrprotect tracking, the non-cooperative
usage, and extending the availability of the ioctls to tmpfs). We can
definitely write a manpage with the current features.

Ok, so I'll continue working on the testcase and on the manpage.

Thanks!!
Andrea

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 00/23] userfaultfd v4
  2015-05-19 21:59   ` Richard Weinberger
@ 2015-05-20 14:17     ` Andrea Arcangeli
  0 siblings, 0 replies; 63+ messages in thread
From: Andrea Arcangeli @ 2015-05-20 14:17 UTC (permalink / raw)
  To: Richard Weinberger
  Cc: Andrew Morton, LKML, linux-mm, qemu-devel, kvm,
	open list:ABI/API, Pavel Emelyanov, Sanidhya Kashyap,
	zhang.zhanghailiang, Linus Torvalds, Kirill A. Shutemov,
	Andres Lagar-Cavilla, Dave Hansen, Paolo Bonzini, Rik van Riel,
	Mel Gorman, Andy Lutomirski, Hugh Dickins, Peter Feiner,
	Dr. David Alan Gilbert, Johannes Weiner, Huangpeng (Peter)

Hello Richard,

On Tue, May 19, 2015 at 11:59:42PM +0200, Richard Weinberger wrote:
> On Tue, May 19, 2015 at 11:38 PM, Andrew Morton
> <akpm@linux-foundation.org> wrote:
> > On Thu, 14 May 2015 19:30:57 +0200 Andrea Arcangeli <aarcange@redhat.com> wrote:
> >
> >> This is the latest userfaultfd patchset against mm-v4.1-rc3
> >> 2015-05-14-10:04.
> >
> > It would be useful to have some userfaultfd testcases in
> > tools/testing/selftests/.  Partly as an aid to arch maintainers when
> > enabling this.  And also as a standalone thing to give people a
> > practical way of exercising this interface.
> >
> > What are your thoughts on enabling userfaultfd for other architectures,
> > btw?  Are there good use cases, are people working on it, etc?
> 
> UML is using SIGSEGV for page faults.
> i.e. the UML processes receives a SIGSEGV, learns the faulting address
> from the mcontext
> and resolves the fault by installing a new mapping.
> 
> If userfaultfd is faster that the SIGSEGV notification it could speed
> up UML a bit.
> For UML I'm only interested in the notification, not the resolving
> part. The "missing"
> data is present, only a new mapping is needed. No copy of data.
> 
> Andrea, what do you think?

I think you need some kind of UFFDIO_MPROTECT ioctl that is the same
ioctl wrprotect tracking also needs. At the moment we focused the
future plans mostly on wrprotection tracking but it could be extended
to protnone tracking, either with the same feature flag as
wrprotection (with a generic UFFDIO_MPROTECT) or with two separate
feature flags and two separate ioctl.

Your pages are not missing, like in the postcopy live snapshotting
case the pages are not missing. The userfaultfd memory protection
ioctl will not modify the VMA, but it'll just selectively mark
pte/trans_huge_pmd wrprotected/protnone in order to get the faults. In
the case of postcopy live snapshotting a single ioctl call will mark
the entire guest address space readonly.

For live snapshotting the fault resolution is a no brainer: when you
get the fault the page is still readable and it just needs to be
copied off by the live snapshotting thread to a different location,
and then the UFFDIO_MPROTECT will be called again to make the page
writable and wake the blocked fault.

For the protnone, you need to modify the page before waking the
blocked userfault, you can't just remove the protnone or other threads
could modify it (if there are other threads). You'd need a further
ioctl to copy the page off to a different place by using its kernel
address (the userland address is not mapped) and copy it back to
overwrite the original page.

Alternatively once we extend the handle_userfault to tmpfs you could
map the page in two virtual mappings and track the faults in one
mapping (where the tracked app runs) and read/write the page contents
in the other mapping that isn't tracked by the userfault.

These are the first thoughts that comes to mind without knowing
exactly what you need to do after you get the fault address, and
without knowing exactly why you need to mark the region PROT_NONE.

There will be some complications in adding the wrprotection/protnone
feature: if faults could already happen when the wrprotect/protnone is
armed, the handle_userfault() could be invoked in a retry-fault, that
is not ok without allowing the userfault to return VM_FAULT_RETRY even
during a refault (i.e. FAULT_FLAG_TRIED set but FAULT_FLAG_ALLOW_RETRY
not set). The invariants of vma->vm_page_prot and pte/trans_huge_pmd
permissions must also not break anywhere. These are the two main
reasons why these features that requires to flip protection bits are
left implemented later and made visible later with uffdio_api.feature
flags and/or through uffdio_register.ioctl during UFFDIO_REGISTER.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 00/23] userfaultfd v4
  2015-05-14 17:30 [PATCH 00/23] userfaultfd v4 Andrea Arcangeli
                   ` (24 preceding siblings ...)
  2015-05-19 21:38 ` Andrew Morton
@ 2015-05-21 13:11 ` Kirill Smelkov
  2015-05-21 15:52   ` Andrea Arcangeli
  25 siblings, 1 reply; 63+ messages in thread
From: Kirill Smelkov @ 2015-05-21 13:11 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andrew Morton, linux-kernel, linux-mm, qemu-devel, kvm,
	linux-api, Pavel Emelyanov, Sanidhya Kashyap,
	zhang.zhanghailiang, Linus Torvalds, Kirill A. Shutemov,
	Andres Lagar-Cavilla, Dave Hansen, Paolo Bonzini, Rik van Riel,
	Mel Gorman, Andy Lutomirski, Hugh Dickins, Peter Feiner,
	Dr. David Alan Gilbert, Johannes Weiner, Huangpeng (Peter)

Hello up there,

On Thu, May 14, 2015 at 07:30:57PM +0200, Andrea Arcangeli wrote:
> Hello everyone,
> 
> This is the latest userfaultfd patchset against mm-v4.1-rc3
> 2015-05-14-10:04.
> 
> The postcopy live migration feature on the qemu side is mostly ready
> to be merged and it entirely depends on the userfaultfd syscall to be
> merged as well. So it'd be great if this patchset could be reviewed
> for merging in -mm.
> 
> Userfaults allow to implement on demand paging from userland and more
> generally they allow userland to more efficiently take control of the
> behavior of page faults than what was available before
> (PROT_NONE + SIGSEGV trap).
> 
> The use cases are:

[...]

> Even though there wasn't a real use case requesting it yet, it also
> allows to implement distributed shared memory in a way that readonly
> shared mappings can exist simultaneously in different hosts and they
> can be become exclusive at the first wrprotect fault.

Sorry for maybe speaking up too late, but here is additional real
potential use-case which in my view is overlapping with the above:

Recently we needed to implement persistency for NumPy arrays - that is
to track made changes to array memory and transactionally either abandon
the changes on transaction abort, or store them back to storage on
transaction commit.

Since arrays can be large, it would be slow and thus not practical to
have original data copy and compare memory to original to find what
array parts have been changed.

So I've implemented a scheme where array data is initially PROT_READ
protected, then we catch SIGSEGV, if it is write and area belongs to array
data - we mark that page as PROT_WRITE and continue. On commit time we
know which parts were modified.

Also, since arrays could be large - bigger than RAM, and only sparse
parts of it could be needed to get needed information, for reading it
also makes sense to lazily load data in SIGSEGV handler with initial
PROT_NONE protection.

This is very similar to how memory mapped files work, but adds
transactionality which, as far as I know, is not provided by any
currently in-kernel filesystem on Linux.

The system is done as files, and arrays are then build on top of
this-way memory-mapped files. So from now on we can forget about NumPy
arrays and only talk about files, their mapping, lazy loading and
transactionally storing in-memory changes back to file storage.

To get this working, a custom user-space virtual memory manager is
unrolled, which manages RAM memory "pages", file mappings into virtual
address-space, tracks pages protection and does SIGSEGV handling
appropriately.


The gist of virtual memory-manager is this:

    https://lab.nexedi.cn/kirr/wendelin.core/blob/master/include/wendelin/bigfile/virtmem.h
    https://lab.nexedi.cn/kirr/wendelin.core/blob/master/bigfile/virtmem.c  (vma_on_pagefault)


For operations it currently needs

    - establishing virtual memory areas and connecting to tracking it

    - changing pages protection

        PROT_NONE or absent                             - initially
        PROT_NONE       -> PROT_READ                    - after read
        PROT_READ       -> PROT_READWRITE               - after write
        PROT_READWRITE  -> PROT_READ                    - after commit
        PROT_READWRITE  -> PROT_NONE or absent (again)  - after abort
        PROT_READ       -> PROT_NONE or absent (again)  - on reclaim

    - working with aliasable memory (thus taken from tmpfs)

        there could be two overlapping-in-file mapping for file (array)
        requested at different time, and changes from one mapping should
        propagate to another one -> for common parts only 1 page should
        be memory-mapped into 2 places in address-space.

so what is currently lacking on userfaultfd side is:

    - ability to remove / make PROT_NONE already mapped pages
      (UFFDIO_REMAP was recently dropped)

    - ability to arbitrarily change pages protection (e.g. RW -> R)

    - inject aliasable memory from tmpfs (or better hugetlbfs) and into
      several places (UFFDIO_REMAP + some mapping copy semantic).


The code is ugly because it is only a prototype. You can clone/read it
all from here:

    https://lab.nexedi.cn/kirr/wendelin.core

Virtual memory-manager even has tests, and from them it could be seen
how the system is supposed to work (after each access - what pages and
where are mapped and how):

    https://lab.nexedi.cn/kirr/wendelin.core/blob/master/bigfile/tests/test_virtmem.c

The performance currently is not great, partly because of page clearing
when getting ram from tmpfs, and partly because of mprotect/SIGSEGV/vmas
overhead and other dumb things on my side.

I still wanted to show the case, as userfaultd here has potential to
remove overhead related to kernel.

Thanks beforehand for feedback,

Kirill


P.S. some context

http://www.wendelin.io/NXD-Wendelin.Core.Non.Secret/asEntireHTML

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 00/23] userfaultfd v4
  2015-05-21 13:11 ` Kirill Smelkov
@ 2015-05-21 15:52   ` Andrea Arcangeli
  2015-05-22 16:35     ` Kirill Smelkov
  0 siblings, 1 reply; 63+ messages in thread
From: Andrea Arcangeli @ 2015-05-21 15:52 UTC (permalink / raw)
  To: Kirill Smelkov
  Cc: Andrew Morton, linux-kernel, linux-mm, qemu-devel, kvm,
	linux-api, Pavel Emelyanov, Sanidhya Kashyap,
	zhang.zhanghailiang, Linus Torvalds, Kirill A. Shutemov,
	Andres Lagar-Cavilla, Dave Hansen, Paolo Bonzini, Rik van Riel,
	Mel Gorman, Andy Lutomirski, Hugh Dickins, Peter Feiner,
	Dr. David Alan Gilbert, Johannes Weiner, Huangpeng (Peter)

Hi Kirill,

On Thu, May 21, 2015 at 04:11:11PM +0300, Kirill Smelkov wrote:
> Sorry for maybe speaking up too late, but here is additional real

Not too late, in fact I don't think there's any change required for
this at this stage, but it'd be great if you could help me to review.

> Since arrays can be large, it would be slow and thus not practical to
[..]
> So I've implemented a scheme where array data is initially PROT_READ
> protected, then we catch SIGSEGV, if it is write and area belongs to array

In the case of postcopy live migration (for qemu and/or containers) and
postcopy live snapshotting, splitting the vmas is not an option
because we may run out of them.

If your PROT_READ areas are limited perhaps this isn't an issue but
with hundreds GB guests (currently plenty in production) that needs to
live migrate fully reliably and fast, the vmas could exceed the limit
if we were to use mprotect. If your arrays are very large and the
PROT_READ aren't limited, using userfaultfd this isn't only an
optimization for you too, it's actually a must to avoid a potential
-ENOMEM.

> Also, since arrays could be large - bigger than RAM, and only sparse
> parts of it could be needed to get needed information, for reading it
> also makes sense to lazily load data in SIGSEGV handler with initial
> PROT_NONE protection.

Similarly I heard somebody wrote a fastresume to load the suspended
(on disk) guest ram using userfaultfd. That is a slightly less
fundamental case than postcopy because you could do it also with
MAP_SHARED, but it's still interesting in allowing to compress or
decompress the suspended ram on the fly with lz4 for example,
something MAP_PRIVATE/MAP_SHARED wouldn't do (plus there's the
additional benefit of not having an orphaned inode left open even if
the file is deleted, that prevents to unmount the filesystem for the
whole lifetime of the guest).

> This is very similar to how memory mapped files work, but adds
> transactionality which, as far as I know, is not provided by any
> currently in-kernel filesystem on Linux.

That's another benefit yes.

> The gist of virtual memory-manager is this:
> 
>     https://lab.nexedi.cn/kirr/wendelin.core/blob/master/include/wendelin/bigfile/virtmem.h
>     https://lab.nexedi.cn/kirr/wendelin.core/blob/master/bigfile/virtmem.c  (vma_on_pagefault)

I'll check it more in detail ASAP, thanks for the pointers!

> For operations it currently needs
> 
>     - establishing virtual memory areas and connecting to tracking it

That's the UFFDIO_REGISTER/UNREGISTER.

>     - changing pages protection
> 
>         PROT_NONE or absent                             - initially

absent is what works with -mm already. The lazy loading already works.

>         PROT_NONE       -> PROT_READ                    - after read

Current UFFDIO_COPY will map it using vma->vm_page_prot.

We'll need a new flag for UFFDIO_COPY to map it readonly. This is
already contemplated:

	/*
	 * There will be a wrprotection flag later that allows to map
	 * pages wrprotected on the fly. And such a flag will be
	 * available if the wrprotection ioctl are implemented for the
	 * range according to the uffdio_register.ioctls.
	 */
#define UFFDIO_COPY_MODE_DONTWAKE		((__u64)1<<0)
	__u64 mode;

If the memory protection framework exists (either through the
uffdio_register.ioctl out value, or through uffdio_api.features
out-only value) you can pass a new flag (MODE_WP) above to transition
from "absent" to "PROT_READ".

>         PROT_READ       -> PROT_READWRITE               - after write

This will need to add UFFDIO_MPROTECT.

>         PROT_READWRITE  -> PROT_READ                    - after commit

UFFDIO_MPROTECT again (but harder if going from rw to ro, because of a
slight mess to solve with regard to FAULT_FLAG_TRIED, in case you want
to run this UFFDIO_MPROTECT without stopping the threads that are
accessing the memory concurrently).

And this should only work if the uffdio_register.mode had MODE_WP set,
so we don't run into the races created by COWs (gup vs fork race).

>         PROT_READWRITE  -> PROT_NONE or absent (again)  - after abort

UFFDIO_MPROTECT again, but you won't be able to read the page contents
inside the memory manager thread (the one working with
userfaultfd).

The manager at all times if forbidden to touch the memory it is
tracking with userfaultfd (if it does it'll deadlock, but kill -9 will
get rid of it). gdb ironically because it is using an underoptimized
access_process_vm wouldn't hang, because FAULT_FLAG_RETRY won't be set
in handle_userfault in the gdb context, and it'll just receive a
sigbus if by mistake the user tries to touch the memory. Even if it
will hung later as get_user_pages_locked|unlocked gets used there too,
kill -9 would solve gdb too.

Back to the problem of accessing the UFFDIO_MPROTECT(PROT_NONE)
memory: to do that a new ioctl should be required. I'd rather not go
back to the route of UFFDIO_REMAP, but it could copy the data using
the kernel address.

It could be simply a reverse UFFDIO_COPY. We could add a
UFFDIO_COPY_MODE_REVERSE flag to the "mode" of UFFDIO_COPY to mean
"read source from kernel address and write destination in user
address". By default it reads the source from user address and write
the destination in kernel address (to be atomic).

If you want to put data back before lifting the PROT_NONE, UFFDIO_COPY
could be used in the standard way but with a
UFFDIO_COPY_MODE_OVERWRITE flag that just overwrites the contents of
the old page if it's not mapped (protnone), or just get rid of the old
page (currently it'd return -EEXIST if the pte is not none).

So the process would be:

UFFDIO_COPY(dst_tmpaddr, src_addr, mode=REVERSE)
UFFDIO_COPY(src_addr, dst_tmpaddr, mode=OVERWRITE)

Then if you also set mode=READONLY in the last UFFDIO_COPY, it'll
create a wrprotected mapping atomically before giving visibility to
the new page contents:

UFFDIO_COPY(src_addr, dst_tmpaddr, mode=OVERWRITE|WP)

>         PROT_READ       -> PROT_NONE or absent (again)  - on reclaim

Same as above.

>     - working with aliasable memory (thus taken from tmpfs)
> 
>         there could be two overlapping-in-file mapping for file (array)
>         requested at different time, and changes from one mapping should
>         propagate to another one -> for common parts only 1 page should
>         be memory-mapped into 2 places in address-space.

Why isn't the manager thread taking care of calling UFFDIO_MPROTECT in
two places?

And UFFDIO_COPY would fill the page and replace the old page and the
effect would be visible as far as the "data" is concerned, but the
protection bits would be more naturally different for each
mapping, like a double mmap call is also required to map such an area
in two places.

You could have a MAP_PRIVATE vma with PROT_READ, you can't create a
writable pte into it, just because you called
UFFDIO_MPROTECT(PROT_READ|PROT_WRITE) in a different mapping of the
same tmpfs page.

NOTE: the availability of the UFFDIO_MPROTECT|COPY on tmpfs ares would
still depend on UFFDIO_REGISTER to return the respecteve ioctl id in
the uffdio_register.ioctl (out value of the register ioctl).

> so what is currently lacking on userfaultfd side is:
> 
>     - ability to remove / make PROT_NONE already mapped pages
>       (UFFDIO_REMAP was recently dropped)
> 
>     - ability to arbitrarily change pages protection (e.g. RW -> R)
> 
>     - inject aliasable memory from tmpfs (or better hugetlbfs) and into
>       several places (UFFDIO_REMAP + some mapping copy semantic).

I think UFFDIO_COPY if added with OVERWRITE|REVERSE|WP flags is an ok
substitute for UFFDIO_REMAP.

If UFFDIO_COPY sees the page is protnone during the REVERSE copy (to
extract the memory atomically), it can also skip the tlb flush (and
obviously there's no tlb flush in the reverse direction). If the page
was not protnone, it can turn it in protnone, do a tlb flush, and then
copy it to the destination address using the userland mapping.

UFFDIO_MPROTECT is definitely necessary for postcopy live snapshotting
too (the reverse UFFDIO_COPY is not, it never deals with
PROT_NONE and it never cares about missing faults).

MPROTECT(PROT_NONE) so far seems needed only by this and perhaps UML
(and perhaps qemu linux-user).

I posted in another email why these features aren't implemented yet 

==
There will be some complications in adding the wrprotection/protnone
feature: if faults could already happen when the wrprotect/protnone is
armed, the handle_userfault() could be invoked in a retry-fault, that
is not ok without allowing the userfault to return VM_FAULT_RETRY even
during a refault (i.e. FAULT_FLAG_TRIED set but FAULT_FLAG_ALLOW_RETRY
not set). The invariants of vma->vm_page_prot and pte/trans_huge_pmd
permissions must also not break anywhere. These are the two main
reasons why these features that requires to flip protection bits are
left implemented later and made visible later with uffdio_api.feature
flags and/or through uffdio_register.ioctl during UFFDIO_REGISTER.
==

> The performance currently is not great, partly because of page clearing
> when getting ram from tmpfs, and partly because of mprotect/SIGSEGV/vmas
> overhead and other dumb things on my side.

Also the page faults get slowed down when the rbtree grows a lot,
userfaultfd won't let the rbtree grow.

> I still wanted to show the case, as userfaultd here has potential to
> remove overhead related to kernel.

That's very useful and interesting feedback!

Could you review the API to be sure we don't have to modify it when we
extend it like described above?

1) tmpfs returning uffdio_register.ioctl |=
   UFFDIO_MPROTECT|UFFDIO_COPY when enabled

2) UFFDIO_MPROTECT(PROT_NONE|READ|WRITE|EXEC) and in turn
   UFFDIO_COPY_MODE_REVERSE|UFFDIO_COPY_MODE_OVERWRITE|UFFDIO_COPY_MODE_WP
   being available if uffdio_register.ioctl includes UFFDIO_MPROTECT
   (and uffdio_api.features will then include
   UFFD_FEATURE_PAGEFAULT_WP to signal the uffd_msg.pagefault.flag WP
   is available [bit 1], and UFFDIO_REGISTER_MODE_WP can be used in
   uffdio_register.mode)

All of it could just check for uffdio_api.features &
UFFD_FEATURE_PAGEFAULT_WP being set, but you'd still have to check for
UFFDIO_MPROTECT being set in uffdio_register.ioctl for tmpfs areas (or
to know it's not available yet on hugetlbfs), so I think it's more
robust to check UFFDIO_MPROTECT ioctl being set in
uffdio_register.ioctl to assume all mprotection and writeprotect
tracking features are available for that specific range. The feature
flag will just tell that UFFDIO_REGISTER_MODE_WP can be used in the
register ioctl, that is something you need to know before in order to
"arm" the VMA for wrprotect faults.

For your usage I think you probably want to set
UFFDIO_REGISTER_MODE_WP|UFFDIO_REGISTER_MODE_MISSING and you'll be
told through uffdio_msg.flags if it's a WP or MISSING fault. You won't
be told if it's missing because of PROT_NONE or absent.

On a side note: all of the above is completely orthognal from the
non-cooperative usage: as far as memory protection features it doesn't
need any, it just needs to track more events like fork/mremap to
adjust its offsets as the memory manager is not part of the app and it
has no way to orchestrate by other means.

Doing it all at once (non-cooperative + full memory protection) looked
too much. We should just try to get the API right in a way that won't
require an UFFD_API bump passed to uffdio_api.api. Even then, if an
api bump is required, that's not a big deal, until recently the
non-cooperative usage already did the API bumb but we accomodated the
read(2) API to avoid it.

Thinking at the worst case scenario, if the API gets bumped the only
thing that has to remain fixed is the ioctl number of the UFFDIO_API
and the uffdio_api structure. Everything else can be upgraded without
risk of ABI breakage, even the ioctl numbers can be reused (except the
very UFFDIO_API). When the non-cooperative usage bumped the API it
actually kept all ioctl the same except the read(2) format.

Comments welcome, thanks!
Andrea

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 00/23] userfaultfd v4
  2015-05-21 15:52   ` Andrea Arcangeli
@ 2015-05-22 16:35     ` Kirill Smelkov
  0 siblings, 0 replies; 63+ messages in thread
From: Kirill Smelkov @ 2015-05-22 16:35 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andrew Morton, linux-kernel, linux-mm, qemu-devel, kvm,
	linux-api, Pavel Emelyanov, Sanidhya Kashyap,
	zhang.zhanghailiang, Linus Torvalds, Kirill A. Shutemov,
	Andres Lagar-Cavilla, Dave Hansen, Paolo Bonzini, Rik van Riel,
	Mel Gorman, Andy Lutomirski, Hugh Dickins, Peter Feiner,
	Dr. David Alan Gilbert, Johannes Weiner, Huangpeng (Peter)

Hi Andrea,

On Thu, May 21, 2015 at 05:52:51PM +0200, Andrea Arcangeli wrote:
> Hi Kirill,
> 
> On Thu, May 21, 2015 at 04:11:11PM +0300, Kirill Smelkov wrote:
> > Sorry for maybe speaking up too late, but here is additional real
> 
> Not too late, in fact I don't think there's any change required for
> this at this stage, but it'd be great if you could help me to review.

Thanks


> > Since arrays can be large, it would be slow and thus not practical to
> [..]
> > So I've implemented a scheme where array data is initially PROT_READ
> > protected, then we catch SIGSEGV, if it is write and area belongs to array
> 
> In the case of postcopy live migration (for qemu and/or containers) and
> postcopy live snapshotting, splitting the vmas is not an option
> because we may run out of them.
> 
> If your PROT_READ areas are limited perhaps this isn't an issue but
> with hundreds GB guests (currently plenty in production) that needs to
> live migrate fully reliably and fast, the vmas could exceed the limit
> if we were to use mprotect. If your arrays are very large and the
> PROT_READ aren't limited, using userfaultfd this isn't only an
> optimization for you too, it's actually a must to avoid a potential
> -ENOMEM.

I understand. To somehow mitigate this issue for every array/file I try
to allocate ram pages from separate file on tmpfs with the same offset.
This way if we allocate a lot of pages and mmap them in with PROT_READ,
if they are adjacent to each other, the kernel will merge adjacent vmas
into one vma:

    https://lab.nexedi.cn/kirr/wendelin.core/blob/ca064f75/include/wendelin/bigfile/ram.h#L100
    https://lab.nexedi.cn/kirr/wendelin.core/blob/ca064f75/bigfile/ram_shmfs.c#L102
    https://lab.nexedi.cn/kirr/wendelin.core/blob/ca064f75/bigfile/virtmem.c#L435

I agree this is only a half-measure - file parts accessed could be
sparse, and also there is a in-shmfs overhead maintaining tables what
real pages are allocated to which part of file on shmfs.

So yes, if userfaultfd allows to overcomes vma layer and work directly
on page tables, this helps.


> > Also, since arrays could be large - bigger than RAM, and only sparse
> > parts of it could be needed to get needed information, for reading it
> > also makes sense to lazily load data in SIGSEGV handler with initial
> > PROT_NONE protection.
> 
> Similarly I heard somebody wrote a fastresume to load the suspended
> (on disk) guest ram using userfaultfd. That is a slightly less
> fundamental case than postcopy because you could do it also with
> MAP_SHARED, but it's still interesting in allowing to compress or
> decompress the suspended ram on the fly with lz4 for example,
> something MAP_PRIVATE/MAP_SHARED wouldn't do (plus there's the
> additional benefit of not having an orphaned inode left open even if
> the file is deleted, that prevents to unmount the filesystem for the
> whole lifetime of the guest).

I see. Just a note - transparent compression/decompression could be
achieved with MAP_SHARED if the compression is being performed by
underlying filesystem - e.g. implemented with FUSE.

( I have not measured performance though )


> > This is very similar to how memory mapped files work, but adds
> > transactionality which, as far as I know, is not provided by any
> > currently in-kernel filesystem on Linux.
> 
> That's another benefit yes.
> 
> > The gist of virtual memory-manager is this:
> > 
> >     https://lab.nexedi.cn/kirr/wendelin.core/blob/master/include/wendelin/bigfile/virtmem.h
> >     https://lab.nexedi.cn/kirr/wendelin.core/blob/master/bigfile/virtmem.c  (vma_on_pagefault)
> 
> I'll check it more in detail ASAP, thanks for the pointers!
> 
> > For operations it currently needs
> > 
> >     - establishing virtual memory areas and connecting to tracking it
> 
> That's the UFFDIO_REGISTER/UNREGISTER.

Yes

> >     - changing pages protection
> > 
> >         PROT_NONE or absent                             - initially
> 
> absent is what works with -mm already. The lazy loading already works.

Yes

> >         PROT_NONE       -> PROT_READ                    - after read
> 
> Current UFFDIO_COPY will map it using vma->vm_page_prot.
> 
> We'll need a new flag for UFFDIO_COPY to map it readonly. This is
> already contemplated:
> 
> 	/*
> 	 * There will be a wrprotection flag later that allows to map
> 	 * pages wrprotected on the fly. And such a flag will be
> 	 * available if the wrprotection ioctl are implemented for the
> 	 * range according to the uffdio_register.ioctls.
> 	 */
> #define UFFDIO_COPY_MODE_DONTWAKE		((__u64)1<<0)
> 	__u64 mode;
> 
> If the memory protection framework exists (either through the
> uffdio_register.ioctl out value, or through uffdio_api.features
> out-only value) you can pass a new flag (MODE_WP) above to transition
> from "absent" to "PROT_READ".

Yes. The same probably applies to UFFDIO_ZEROPAGE (to mmap-in zeropage
as RO on read, if that part of file is currently hole)

So we settle on adding

    UFFDIO_COPY_MODE_WP         and
    UFFDIO_ZEROPAGE_MODE_WP
?

Or maybe why we have both COPY_DONTWAKE and ZEROPAGE_DONTWAKE (and
previously REMAP_MODE_DONTWAKE) and now duplicate _MODE_WP to all them?

Maybe it makes sense to move those common flags related to waking up or
not, mmaping in as R or RW (and maybe other in the future) to common
place.


> >         PROT_READ       -> PROT_READWRITE               - after write
> 
> This will need to add UFFDIO_MPROTECT.

Yes

> >         PROT_READWRITE  -> PROT_READ                    - after commit
> 
> UFFDIO_MPROTECT again (but harder if going from rw to ro, because of a
> slight mess to solve with regard to FAULT_FLAG_TRIED, in case you want
> to run this UFFDIO_MPROTECT without stopping the threads that are
> accessing the memory concurrently).

Yes. I understand the race of a manager making pages RW -> R, and an
other thread in user process simultaneously making a write to the same
page.

On user-level this race can be solved this way: before commit, changed
pages are first marked as R and only then written to storage.

- If a write is racing with RW->R protection being made - it's a client
  problem (of having one thread still modifying data, and other thread
  triggering commit) - we are ok with committing what has already been
  written at the moment RW->R has happened.

- If a write happened after RW->R protection has been made (even if it
  is racy), we are ok if we get a proper notification to userfaultfd
  handler of write to WP area.

So if on kernel side the "slight mess" can be solved, userspace is ok to
live with the potential race and solve it itself.

> And this should only work if the uffdio_register.mode had MODE_WP set,
> so we don't run into the races created by COWs (gup vs fork race).

It is ok to start with registering with

    (UFFDIO_REGISTER_MODE_MISSING | UFFDIO_REGISTER_MODE_WP)

for whole area at the beginning. In other words we are asking
userfaultfd "we want to handle all kind of faults - both reads and writes"


> >         PROT_READWRITE  -> PROT_NONE or absent (again)  - after abort
> 
> UFFDIO_MPROTECT again,

My idea here is that on transaction abort, we just don't need that
changed memory - we can both forget the changes and free the appropriate
pages - if in next transaction someone will need the file data of that
same part again, it just reloads the usual way from file.

So it is maybe

    UFFDIO_MFREE            ( make pte absent, and free(*) referenced page
                              (*) free maybe = decrement its refcount )

what is needed here.

But for generality, I agree it make sense to have a way to just MPROTECT
with PROT_NONE without freeing the page.

> but you won't be able to read the page contents
> inside the memory manager thread (the one working with
> userfaultfd).

With PROT_NONE I see.

> The manager at all times if forbidden to touch the memory it is
> tracking with userfaultfd (if it does it'll deadlock, but kill -9 will
> get rid of it). gdb ironically because it is using an underoptimized
> access_process_vm wouldn't hang, because FAULT_FLAG_RETRY won't be set
> in handle_userfault in the gdb context, and it'll just receive a
> sigbus if by mistake the user tries to touch the memory. Even if it
> will hung later as get_user_pages_locked|unlocked gets used there too,
> kill -9 would solve gdb too.

Wait. I partly understand, because I have no much experience in mm. But
are you saying the manager cannot access the memory it tracks, if pages
even have PROT_READ or PROT_READWRITE protection, and we know they were
already loaded, e.g. they are not missing?

If yes, then for sure, there need to be a way to get the data back from
memory to manager to implement storing changes back.

And maybe for some cases it would make sense to first set just
protection to PROT_NONE so manager know the client cannot mess with the
data while it accesses it.


> Back to the problem of accessing the UFFDIO_MPROTECT(PROT_NONE)
> memory: to do that a new ioctl should be required. I'd rather not go
> back to the route of UFFDIO_REMAP, but it could copy the data using
> the kernel address.
> 
> It could be simply a reverse UFFDIO_COPY. We could add a
> UFFDIO_COPY_MODE_REVERSE flag to the "mode" of UFFDIO_COPY to mean
> "read source from kernel address and write destination in user
> address". By default it reads the source from user address and write
> the destination in kernel address (to be atomic).

Yes, with small clarification that "write to kernel address" is write to
"kernel address associated to address in destination mm"


> If you want to put data back before lifting the PROT_NONE, UFFDIO_COPY
> could be used in the standard way but with a
> UFFDIO_COPY_MODE_OVERWRITE flag that just overwrites the contents of
> the old page if it's not mapped (protnone), or just get rid of the old
> page (currently it'd return -EEXIST if the pte is not none).
> 
> So the process would be:
> 
> UFFDIO_COPY(dst_tmpaddr, src_addr, mode=REVERSE)
> UFFDIO_COPY(src_addr, dst_tmpaddr, mode=OVERWRITE)
> 
> Then if you also set mode=READONLY in the last UFFDIO_COPY, it'll
> create a wrprotected mapping atomically before giving visibility to
> the new page contents:
> 
> UFFDIO_COPY(src_addr, dst_tmpaddr, mode=OVERWRITE|WP)

I agree in general.

The only thing which confuses me a bit is the REVERSE flag and dst/src
being skipped. I would rather keep the src / dst ordering and in mode
have COPY_TO and COPY_FROM or have both UFFDIO_COPY_TO and
UFFDIO_COPY_FROM to clarify API.

But this is only a style and does not change semantics.


But I wonder though again - is it maybe possible to get managed memory
content (with PROT_READ or PROT_READWRITE) without copying?


> >         PROT_READ       -> PROT_NONE or absent (again)  - on reclaim
> 
> Same as above.

The same as above about abort applies to reclaim - we just need
to forget and free memory here - so UFFDIO_MFREE.


> >     - working with aliasable memory (thus taken from tmpfs)
> > 
> >         there could be two overlapping-in-file mapping for file (array)
> >         requested at different time, and changes from one mapping should
> >         propagate to another one -> for common parts only 1 page should
> >         be memory-mapped into 2 places in address-space.
> 
> Why isn't the manager thread taking care of calling UFFDIO_MPROTECT in
> two places?
>
> And UFFDIO_COPY would fill the page and replace the old page and the
> effect would be visible as far as the "data" is concerned, but the
> protection bits would be more naturally different for each
> mapping, like a double mmap call is also required to map such an area
> in two places.

Because if we have a write on one place, the manager will
write-unprotect it, and would not get notified on further writes to the
same area.

And I need further writes to be visible in other mappings instantly.
Here is simplified example:

    In [1]: from numpy import *
    In [2]: A = zeros(10)   # it simulates big array backed by manager

    In [3]: a = A[0:5]      # one part is mapped with start/stop = (0,5)
    
    # something unrelated is done in between

    In [4]: b = A[3:10]     # another part is mapped with start/stop = (3,10)

    # NOTE a and b overlaps in array/backing file.
    # but since we already allocated address-space for a and did other
    # things, address space beyond a could be already allocated, so we
    # cannot just extend it and make b referencing addresses starting at
    # a tail.
    
    In [5]: a
    Out[5]: array([ 0.,  0.,  0.,  0.,  0.])
    
    In [6]: b
    Out[6]: array([ 0.,  0.,  0.,  0.,  0.,  0.,  0.])
    
    In [7]: a[4] = 1    # first change to a; the manager removes WP
    
    In [8]: a
    Out[8]: array([ 0.,  0.,  0.,  0.,  1.])
    
    In [9]: b
    Out[9]: array([ 0.,  1.,  0.,  0.,  0.,  0.,  0.]) # propagated to b

    # write again - the manager does not get a WP fault, but the change have to
    # propagate again
    In [10]: a[4] = 2

    In [11]: a
    Out[11]: array([ 0.,  0.,  0.,  0.,  2.])
    
    In [12]: b
    Out[12]: array([ 0.,  2.,  0.,  0.,  0.,  0.,  0.])

so if we resolve fault for a[4], the exact page mapped in, should be
also eventually mapped into b[1].

> You could have a MAP_PRIVATE vma with PROT_READ, you can't create a
> writable pte into it, just because you called
> UFFDIO_MPROTECT(PROT_READ|PROT_WRITE) in a different mapping of the
> same tmpfs page.

I don't fully understand here, but I use MAP_SHARED so changes propagate
back to tmpfs file and are visible in other mappings.

https://lab.nexedi.cn/kirr/wendelin.core/blob/ca064f75/bigfile/ram_shmfs.c#L89

> NOTE: the availability of the UFFDIO_MPROTECT|COPY on tmpfs ares would
> still depend on UFFDIO_REGISTER to return the respecteve ioctl id in
> the uffdio_register.ioctl (out value of the register ioctl).

It is ok to check. But I'd like to note: here we mmap two overlapping
parts of a tmpfs file in two regions, and register both regions to
userfaultfd.


> > so what is currently lacking on userfaultfd side is:
> > 
> >     - ability to remove / make PROT_NONE already mapped pages
> >       (UFFDIO_REMAP was recently dropped)
> > 
> >     - ability to arbitrarily change pages protection (e.g. RW -> R)
> > 
> >     - inject aliasable memory from tmpfs (or better hugetlbfs) and into
> >       several places (UFFDIO_REMAP + some mapping copy semantic).
> 
> I think UFFDIO_COPY if added with OVERWRITE|REVERSE|WP flags is an ok
> substitute for UFFDIO_REMAP.
> 
> If UFFDIO_COPY sees the page is protnone during the REVERSE copy (to
> extract the memory atomically), it can also skip the tlb flush (and
> obviously there's no tlb flush in the reverse direction). If the page
> was not protnone, it can turn it in protnone, do a tlb flush, and then
> copy it to the destination address using the userland mapping.

We are also ok if the page is PROT_READ - in this case no need to do a
tlbflush - nothing can change to page while we are copying it - only we
have to care not to allow changing protection to PROT_READWRITE in the
process.

> UFFDIO_MPROTECT is definitely necessary for postcopy live snapshotting
> too (the reverse UFFDIO_COPY is not, it never deals with
> PROT_NONE and it never cares about missing faults).
> 
> MPROTECT(PROT_NONE) so far seems needed only by this and perhaps UML
> (and perhaps qemu linux-user).
> 
> I posted in another email why these features aren't implemented yet 
> 
> ==
> There will be some complications in adding the wrprotection/protnone
> feature: if faults could already happen when the wrprotect/protnone is
> armed, the handle_userfault() could be invoked in a retry-fault, that
> is not ok without allowing the userfault to return VM_FAULT_RETRY even
> during a refault (i.e. FAULT_FLAG_TRIED set but FAULT_FLAG_ALLOW_RETRY
> not set). The invariants of vma->vm_page_prot and pte/trans_huge_pmd
> permissions must also not break anywhere. These are the two main
> reasons why these features that requires to flip protection bits are
> left implemented later and made visible later with uffdio_api.feature
> flags and/or through uffdio_register.ioctl during UFFDIO_REGISTER.
> ==

I understand, maybe not in full details though. The changes would anyway
be needed to make userfaultfd capable of generic memory managing,
instead of populating it only in one way.

And as I noted above, besides UFFDIO_COPY(both direction, overwrite) and
UFFDIO_MPROTECT, UFFDIO_MFREE is also needed.

> > The performance currently is not great, partly because of page clearing
> > when getting ram from tmpfs, and partly because of mprotect/SIGSEGV/vmas
> > overhead and other dumb things on my side.
> 
> Also the page faults get slowed down when the rbtree grows a lot,
> userfaultfd won't let the rbtree grow.

Yes. This is the same as splitting or not vmas.

> > I still wanted to show the case, as userfaultd here has potential to
> > remove overhead related to kernel.
> 
> That's very useful and interesting feedback!
> 
> Could you review the API to be sure we don't have to modify it when we
> extend it like described above?
> 
> 1) tmpfs returning uffdio_register.ioctl |=
>    UFFDIO_MPROTECT|UFFDIO_COPY when enabled

and UFFDIO_MFREE with maybe ability to automatically punch hole of freed
memory (we need to return the memory to the system on reclaim, and
preferable on abort).


> 2) UFFDIO_MPROTECT(PROT_NONE|READ|WRITE|EXEC) and in turn
>    UFFDIO_COPY_MODE_REVERSE|UFFDIO_COPY_MODE_OVERWRITE|UFFDIO_COPY_MODE_WP
>    being available if uffdio_register.ioctl includes UFFDIO_MPROTECT

UFFDIO_MPROTECT -> UFFDIO_MPROTECT(PROT_NONE|READ|WRITE|EXEC) looks ok.

but imho various copy modes could be allowed besides mprotect - e.g. for
a manager to get back managed memory. Thus, maybe

    UFFDIO_COPY_MODE_REVERSE (or analogues) and UFFDIO_COPY_MODE_OVERWRITE

should be available for when just UFFDIO_COPY is set.


>From userspace point of view, it would be simpler to operate when all
those operations are always allowed though.

>    (and uffdio_api.features will then include
>    UFFD_FEATURE_PAGEFAULT_WP to signal the uffd_msg.pagefault.flag WP
>    is available [bit 1], and UFFDIO_REGISTER_MODE_WP can be used in
>    uffdio_register.mode)

Yes, though it is not fully represents the capability - e.g.
MPROTECT(PROT_NONE) should work too, so it should be maybe
UFFD_FEATURE_PAGEFAULT_PROTECT.

Again, from clients point of view it would be simpler if those features
are in base - e.g. they are always available.

> All of it could just check for uffdio_api.features &
> UFFD_FEATURE_PAGEFAULT_WP being set, but you'd still have to check for
> UFFDIO_MPROTECT being set in uffdio_register.ioctl for tmpfs areas (or
> to know it's not available yet on hugetlbfs), so I think it's more
> robust to check UFFDIO_MPROTECT ioctl being set in
> uffdio_register.ioctl to assume all mprotection and writeprotect
> tracking features are available for that specific range. The feature
> flag will just tell that UFFDIO_REGISTER_MODE_WP can be used in the
> register ioctl, that is something you need to know before in order to
> "arm" the VMA for wrprotect faults.

ok

> For your usage I think you probably want to set
> UFFDIO_REGISTER_MODE_WP|UFFDIO_REGISTER_MODE_MISSING and you'll be
> told through uffdio_msg.flags if it's a WP or MISSING fault.

ok

> You won't be told if it's missing because of PROT_NONE or absent.

looks like not good - to know whether a page is mapped there already or
not. Is it possible to distinguish this cases too?

> 
> On a side note: all of the above is completely orthognal from the
> non-cooperative usage: as far as memory protection features it doesn't
> need any, it just needs to track more events like fork/mremap to
> adjust its offsets as the memory manager is not part of the app and it
> has no way to orchestrate by other means.
> 
> Doing it all at once (non-cooperative + full memory protection) looked
> too much. We should just try to get the API right in a way that won't
> require an UFFD_API bump passed to uffdio_api.api. Even then, if an
> api bump is required, that's not a big deal, until recently the
> non-cooperative usage already did the API bumb but we accomodated the
> read(2) API to avoid it.
> 
> Thinking at the worst case scenario, if the API gets bumped the only
> thing that has to remain fixed is the ioctl number of the UFFDIO_API
> and the uffdio_api structure. Everything else can be upgraded without
> risk of ABI breakage, even the ioctl numbers can be reused (except the
> very UFFDIO_API). When the non-cooperative usage bumped the API it
> actually kept all ioctl the same except the read(2) format.

I agree we can change API / ABI with versioning, but it is better to try
to get it right from the beginning and not fixup with several api
versions on top.

Thanks for your comments and feedback,
Kirill

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 22/23] userfaultfd: avoid mmap_sem read recursion in mcopy_atomic
  2015-05-14 17:31 ` [PATCH 22/23] userfaultfd: avoid mmap_sem read recursion in mcopy_atomic Andrea Arcangeli
@ 2015-05-22 20:18   ` Andrew Morton
  2015-05-22 20:48     ` Andrea Arcangeli
  0 siblings, 1 reply; 63+ messages in thread
From: Andrew Morton @ 2015-05-22 20:18 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, qemu-devel, kvm, linux-api,
	Pavel Emelyanov, Sanidhya Kashyap, zhang.zhanghailiang,
	Linus Torvalds, Kirill A. Shutemov, Andres Lagar-Cavilla,
	Dave Hansen, Paolo Bonzini, Rik van Riel, Mel Gorman,
	Andy Lutomirski, Hugh Dickins, Peter Feiner,
	Dr. David Alan Gilbert, Johannes Weiner, Huangpeng (Peter)

On Thu, 14 May 2015 19:31:19 +0200 Andrea Arcangeli <aarcange@redhat.com> wrote:

> If the rwsem starves writers it wasn't strictly a bug but lockdep
> doesn't like it and this avoids depending on lowlevel implementation
> details of the lock.
> 
> ...
>
> @@ -229,13 +246,33 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
>  
>  		if (!zeropage)
>  			err = mcopy_atomic_pte(dst_mm, dst_pmd, dst_vma,
> -					       dst_addr, src_addr);
> +					       dst_addr, src_addr, &page);
>  		else
>  			err = mfill_zeropage_pte(dst_mm, dst_pmd, dst_vma,
>  						 dst_addr);
>  
>  		cond_resched();
>  
> +		if (unlikely(err == -EFAULT)) {
> +			void *page_kaddr;
> +
> +			BUILD_BUG_ON(zeropage);

I'm not sure what this is trying to do.  BUILD_BUG_ON(local_variable)?

It goes bang in my build.  I'll just delete it.

> +			up_read(&dst_mm->mmap_sem);
> +			BUG_ON(!page);
> +
> +			page_kaddr = kmap(page);
> +			err = copy_from_user(page_kaddr,
> +					     (const void __user *) src_addr,
> +					     PAGE_SIZE);
> +			kunmap(page);
> +			if (unlikely(err)) {
> +				err = -EFAULT;
> +				goto out;
> +			}
> +			goto retry;
> +		} else
> +			BUG_ON(page);
> +


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 22/23] userfaultfd: avoid mmap_sem read recursion in mcopy_atomic
  2015-05-22 20:18   ` Andrew Morton
@ 2015-05-22 20:48     ` Andrea Arcangeli
  2015-05-22 21:18       ` Andrew Morton
  0 siblings, 1 reply; 63+ messages in thread
From: Andrea Arcangeli @ 2015-05-22 20:48 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, qemu-devel, kvm, linux-api,
	Pavel Emelyanov, Sanidhya Kashyap, zhang.zhanghailiang,
	Linus Torvalds, Kirill A. Shutemov, Andres Lagar-Cavilla,
	Dave Hansen, Paolo Bonzini, Rik van Riel, Mel Gorman,
	Andy Lutomirski, Hugh Dickins, Peter Feiner,
	Dr. David Alan Gilbert, Johannes Weiner, Huangpeng (Peter)

On Fri, May 22, 2015 at 01:18:22PM -0700, Andrew Morton wrote:
> On Thu, 14 May 2015 19:31:19 +0200 Andrea Arcangeli <aarcange@redhat.com> wrote:
> 
> > If the rwsem starves writers it wasn't strictly a bug but lockdep
> > doesn't like it and this avoids depending on lowlevel implementation
> > details of the lock.
> > 
> > ...
> >
> > @@ -229,13 +246,33 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
> >  
> >  		if (!zeropage)
> >  			err = mcopy_atomic_pte(dst_mm, dst_pmd, dst_vma,
> > -					       dst_addr, src_addr);
> > +					       dst_addr, src_addr, &page);
> >  		else
> >  			err = mfill_zeropage_pte(dst_mm, dst_pmd, dst_vma,
> >  						 dst_addr);
> >  
> >  		cond_resched();
> >  
> > +		if (unlikely(err == -EFAULT)) {
> > +			void *page_kaddr;
> > +
> > +			BUILD_BUG_ON(zeropage);
> 
> I'm not sure what this is trying to do.  BUILD_BUG_ON(local_variable)?
> 
> It goes bang in my build.  I'll just delete it.

Yes, it has to be a false positive failure, so it's fine to drop
it. My gcc 4.8.4 can go inside the static called function and see that
only mcopy_atomic_pte can return -EFAULT. RHEL7 (4.8.3) gcc didn't
complain either. Perhaps to make the BUILD_BUG_ON work with older gcc,
it requrires a local variable set explicitly in the callee, but it's
not worth it.

It would be bad if we end up in the -EFAULT path in the zeropage case
(if somebody later adds an apparently innocent -EFAULT retval and
unexpectedly ends up in the mcopy_atomic_pte retry logic), but it's
not important, the caller should be reviewed before improvising new
retvals anyway.

The retry loop addition and the BUILD_BUG_ON is all about the
copy_from_user run while we already hold the mmap_sem (potentially of
a different process in the non-cooperative case but it's a problem if
it's the current task mmap_sem in case the rwlock implementation
changes to avoid write starvation and becomes non-reentrant). lockdep
definitely complains (even if I think in practice it'd be safe to
read-lock recurse, we just got lockdep complains never deadlocks in
fact). I didn't want to call gup_fast as copy_from_user is faster and
I got an usable user mapping with likely TLB entry hot too. The
lockdep warnings we hit I think were associated with NUMA hinting
faults or something infrequent like that, the fast path doesn't need
to retry.

Thanks,
Andrea

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 22/23] userfaultfd: avoid mmap_sem read recursion in mcopy_atomic
  2015-05-22 20:48     ` Andrea Arcangeli
@ 2015-05-22 21:18       ` Andrew Morton
  2015-05-23  1:04         ` Andrea Arcangeli
  0 siblings, 1 reply; 63+ messages in thread
From: Andrew Morton @ 2015-05-22 21:18 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, qemu-devel, kvm, linux-api,
	Pavel Emelyanov, Sanidhya Kashyap, zhang.zhanghailiang,
	Linus Torvalds, Kirill A. Shutemov, Andres Lagar-Cavilla,
	Dave Hansen, Paolo Bonzini, Rik van Riel, Mel Gorman,
	Andy Lutomirski, Hugh Dickins, Peter Feiner,
	Dr. David Alan Gilbert, Johannes Weiner, Huangpeng (Peter)


There's a more serious failure with i386 allmodconfig:

fs/userfaultfd.c:145:2: note: in expansion of macro 'BUILD_BUG_ON'
  BUILD_BUG_ON(sizeof(struct uffd_msg) != 32);

I'm surprised the feature is even reachable on i386 builds?

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 22/23] userfaultfd: avoid mmap_sem read recursion in mcopy_atomic
  2015-05-22 21:18       ` Andrew Morton
@ 2015-05-23  1:04         ` Andrea Arcangeli
  0 siblings, 0 replies; 63+ messages in thread
From: Andrea Arcangeli @ 2015-05-23  1:04 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, qemu-devel, kvm, linux-api,
	Pavel Emelyanov, Sanidhya Kashyap, zhang.zhanghailiang,
	Linus Torvalds, Kirill A. Shutemov, Andres Lagar-Cavilla,
	Dave Hansen, Paolo Bonzini, Rik van Riel, Mel Gorman,
	Andy Lutomirski, Hugh Dickins, Peter Feiner,
	Dr. David Alan Gilbert, Johannes Weiner, Huangpeng (Peter),
	Fengguang Wu

On Fri, May 22, 2015 at 02:18:30PM -0700, Andrew Morton wrote:
> 
> There's a more serious failure with i386 allmodconfig:
> 
> fs/userfaultfd.c:145:2: note: in expansion of macro 'BUILD_BUG_ON'
>   BUILD_BUG_ON(sizeof(struct uffd_msg) != 32);
> 
> I'm surprised the feature is even reachable on i386 builds?

Unless we risk to run out of vma->vm_flags there's no particular
reason not to enable it on 32bit (even if we run out, making vm_flags
an unsigned long long is a few liner patch). Certainly it's less
useful on 32bit as there's a 3G limit but the max vmas per process are
still a small fraction of that. Especially if used for the volatile
pages on demand notification of page reclaim, it could end up useful
on arm32 (S6 is 64bit I think and latest snapdragon is too, so perhaps
it's too late anyway, but again it's not big deal).

Removing the BUILD_BUG_ON I think is not ok here because while I'm ok
to support 32bit archs, I don't want translation, the 64bit kernel
should talk with the 32bit app directly without a layer in between.

I tried to avoid using packet as without packed I could not get the
alignment wrong (and future union also couldn't get it wrong), and I
could avoid those reserved1/2/3, but it's more robust to use it in
combination with the BUILD_BUG_ON to detect right away problems like
this with 32bit builds that aligns things differently.

I'm actually surprised the buildbot that sends me email about all
archs didn't actually send me anything about it for 32bit x86?
Perhaps I'm overlooking something or x86 32bit (or any other 32bit
arch for that matter) isn't being checked?  This is actually a fairly
recent change, perhaps the buildbot was shutdown recently? That
buildbot was very useful to detect for problems like this.

===
>From 2f0a48670dc515932dec8b983871ec35caeba553 Mon Sep 17 00:00:00 2001
From: Andrea Arcangeli <aarcange@redhat.com>
Date: Sat, 23 May 2015 02:26:32 +0200
Subject: [PATCH] userfaultfd: update the uffd_msg structure to be the same on
 32/64bit

Avoiding to using packed allowed the code to be nicer and it avoided
the reserved1/2/3 but the structure must be the same for 32bit and
64bit archs so x86 applications built with the 32bit ABI can run on
the 64bit kernel without requiring translation of the data read
through the read syscall.

$ gcc -m64 p.c && ./a.out
32
0
16
8
8
16
24
$ gcc -m32 p.c && ./a.out
32
0
16
8
8
16
24

int main()
{
	printf("%lu\n", sizeof(struct uffd_msg));
	printf("%lu\n", (unsigned long) &((struct uffd_msg *) 0)->event);
	printf("%lu\n", (unsigned long) &((struct uffd_msg *) 0)->arg.pagefault.address);
	printf("%lu\n", (unsigned long) &((struct uffd_msg *) 0)->arg.pagefault.flags);
	printf("%lu\n", (unsigned long) &((struct uffd_msg *) 0)->arg.reserved.reserved1);
	printf("%lu\n", (unsigned long) &((struct uffd_msg *) 0)->arg.reserved.reserved2);
	printf("%lu\n", (unsigned long) &((struct uffd_msg *) 0)->arg.reserved.reserved3);
}

Reported-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/uapi/linux/userfaultfd.h | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
index c8a543f..00d28e2 100644
--- a/include/uapi/linux/userfaultfd.h
+++ b/include/uapi/linux/userfaultfd.h
@@ -59,9 +59,13 @@
 struct uffd_msg {
 	__u8	event;
 
+	__u8	reserved1;
+	__u16	reserved2;
+	__u32	reserved3;
+
 	union {
 		struct {
-			__u32	flags;
+			__u64	flags;
 			__u64	address;
 		} pagefault;
 
@@ -72,7 +76,7 @@ struct uffd_msg {
 			__u64	reserved3;
 		} reserved;
 	} arg;
-};
+} __attribute__((packed));
 
 /*
  * Start at 0x12 and not at 0 to be more strict against bugs.



^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: [PATCH 10/23] userfaultfd: add new syscall to provide memory externalization
  2015-05-14 17:31 ` [PATCH 10/23] userfaultfd: add new syscall to provide memory externalization Andrea Arcangeli
  2015-05-14 17:49   ` Linus Torvalds
@ 2015-06-23 19:00   ` Dave Hansen
  2015-06-23 21:41     ` Andrea Arcangeli
  1 sibling, 1 reply; 63+ messages in thread
From: Dave Hansen @ 2015-06-23 19:00 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton, linux-kernel, linux-mm,
	qemu-devel, kvm, linux-api
  Cc: Pavel Emelyanov, Sanidhya Kashyap, zhang.zhanghailiang,
	Linus Torvalds, Kirill A. Shutemov, Andres Lagar-Cavilla,
	Paolo Bonzini, Rik van Riel, Mel Gorman, Andy Lutomirski,
	Hugh Dickins, Peter Feiner, Dr. David Alan Gilbert,
	Johannes Weiner, Huangpeng (Peter)

On 05/14/2015 10:31 AM, Andrea Arcangeli wrote:
> +static int userfaultfd_wake_function(wait_queue_t *wq, unsigned mode,
> +				     int wake_flags, void *key)
> +{
> +	struct userfaultfd_wake_range *range = key;
> +	int ret;
> +	struct userfaultfd_wait_queue *uwq;
> +	unsigned long start, len;
> +
> +	uwq = container_of(wq, struct userfaultfd_wait_queue, wq);
> +	ret = 0;
> +	/* don't wake the pending ones to avoid reads to block */
> +	if (uwq->pending && !ACCESS_ONCE(uwq->ctx->released))
> +		goto out;
> +	/* len == 0 means wake all */
> +	start = range->start;
> +	len = range->len;
> +	if (len && (start > uwq->address || start + len <= uwq->address))
> +		goto out;
> +	ret = wake_up_state(wq->private, mode);
> +	if (ret)
> +		/* wake only once, autoremove behavior */
> +		list_del_init(&wq->task_list);
> +out:
> +	return ret;
> +}
...
> +static __always_inline int validate_range(struct mm_struct *mm,
> +					  __u64 start, __u64 len)
> +{
> +	__u64 task_size = mm->task_size;
> +
> +	if (start & ~PAGE_MASK)
> +		return -EINVAL;
> +	if (len & ~PAGE_MASK)
> +		return -EINVAL;
> +	if (!len)
> +		return -EINVAL;
> +	if (start < mmap_min_addr)
> +		return -EINVAL;
> +	if (start >= task_size)
> +		return -EINVAL;
> +	if (len > task_size - start)
> +		return -EINVAL;
> +	return 0;
> +}

Hey Andrea,

Down in userfaultfd_wake_function(), it looks like you intended for a
len=0 to mean "wake all".  But the validate_range() that we do from
userspace has a !len check in it, which keeps us from passing a len=0 in
from userspace.

Was that "wake all" for some internal use, or is the check too strict?

I was trying to use the wake ioctl after an madvise() (as opposed to
filling things in using a userfd copy).

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 10/23] userfaultfd: add new syscall to provide memory externalization
  2015-06-23 19:00   ` Dave Hansen
@ 2015-06-23 21:41     ` Andrea Arcangeli
  0 siblings, 0 replies; 63+ messages in thread
From: Andrea Arcangeli @ 2015-06-23 21:41 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andrew Morton, linux-kernel, linux-mm, qemu-devel, kvm,
	linux-api, Pavel Emelyanov, Sanidhya Kashyap,
	zhang.zhanghailiang, Linus Torvalds, Kirill A. Shutemov,
	Andres Lagar-Cavilla, Paolo Bonzini, Rik van Riel, Mel Gorman,
	Andy Lutomirski, Hugh Dickins, Peter Feiner,
	Dr. David Alan Gilbert, Johannes Weiner, Huangpeng (Peter)

Hi Dave,

On Tue, Jun 23, 2015 at 12:00:19PM -0700, Dave Hansen wrote:
> Down in userfaultfd_wake_function(), it looks like you intended for a
> len=0 to mean "wake all".  But the validate_range() that we do from
> userspace has a !len check in it, which keeps us from passing a len=0 in
> from userspace.
> Was that "wake all" for some internal use, or is the check too strict?

It's for internal use or userfaultfd_release that has to wake them all
(after setting ctx->released) if the uffd is closed. It avoids to
enlarge the structure by depending on the invariant that userland
cannot pass len=0.

If we'd accept len=0 from userland as valid, I'd be safer if it does
nothing like in madvise, I doubt we want to expose this non standard
kernel internal behavior to userland.

> I was trying to use the wake ioctl after an madvise() (as opposed to
> filling things in using a userfd copy).

madvise will return 0 if len=0, mremap would return -EINVAL if new_len
is zero, mmap also returns -EINVAL if len is 0, not all MM syscalls
are as permissive as madvise. Can't you pass the same len you pass to
madvise to UFFDIO_WAKE (or just skip the call if the madvise len is
zero)?

Thanks,
Andrea

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Qemu-devel] [PATCH 19/23] userfaultfd: activate syscall
  2015-05-14 17:31 ` [PATCH 19/23] userfaultfd: activate syscall Andrea Arcangeli
@ 2015-08-11 10:07   ` Bharata B Rao
  2015-08-11 13:48     ` Andrea Arcangeli
  0 siblings, 1 reply; 63+ messages in thread
From: Bharata B Rao @ 2015-08-11 10:07 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andrew Morton, linux-kernel, linux-mm, qemu-devel, kvm,
	linux-api, zhang.zhanghailiang, Pavel Emelyanov, Johannes Weiner,
	Hugh Dickins, Dr. David Alan Gilbert, Sanidhya Kashyap,
	Dave Hansen, Andres Lagar-Cavilla, Mel Gorman, Paolo Bonzini,
	Kirill A. Shutemov, Huangpeng (Peter),
	Andy Lutomirski, Linus Torvalds, Peter Feiner

On Thu, May 14, 2015 at 07:31:16PM +0200, Andrea Arcangeli wrote:
> This activates the userfaultfd syscall.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
>  arch/powerpc/include/asm/systbl.h      | 1 +
>  arch/powerpc/include/uapi/asm/unistd.h | 1 +
>  arch/x86/syscalls/syscall_32.tbl       | 1 +
>  arch/x86/syscalls/syscall_64.tbl       | 1 +
>  include/linux/syscalls.h               | 1 +
>  kernel/sys_ni.c                        | 1 +
>  6 files changed, 6 insertions(+)
> 
> diff --git a/arch/powerpc/include/asm/systbl.h b/arch/powerpc/include/asm/systbl.h
> index f1863a1..4741b15 100644
> --- a/arch/powerpc/include/asm/systbl.h
> +++ b/arch/powerpc/include/asm/systbl.h
> @@ -368,3 +368,4 @@ SYSCALL_SPU(memfd_create)
>  SYSCALL_SPU(bpf)
>  COMPAT_SYS(execveat)
>  PPC64ONLY(switch_endian)
> +SYSCALL_SPU(userfaultfd)
> diff --git a/arch/powerpc/include/uapi/asm/unistd.h b/arch/powerpc/include/uapi/asm/unistd.h
> index e4aa173..6ad58d4 100644
> --- a/arch/powerpc/include/uapi/asm/unistd.h
> +++ b/arch/powerpc/include/uapi/asm/unistd.h
> @@ -386,5 +386,6 @@
>  #define __NR_bpf		361
>  #define __NR_execveat		362
>  #define __NR_switch_endian	363
> +#define __NR_userfaultfd	364

May be it is a bit late to bring this up, but I needed the following fix
to userfault21 branch of your git tree to compile on powerpc.
----

powerpc: Bump up __NR_syscalls to account for __NR_userfaultfd

From: Bharata B Rao <bharata@linux.vnet.ibm.com>

With userfaultfd syscall, the number of syscalls will be 365 on PowerPC.
Reflect the same in __NR_syscalls.

Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/unistd.h |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/unistd.h b/arch/powerpc/include/asm/unistd.h
index f4f8b66..4a055b6 100644
--- a/arch/powerpc/include/asm/unistd.h
+++ b/arch/powerpc/include/asm/unistd.h
@@ -12,7 +12,7 @@
 #include <uapi/asm/unistd.h>
 
 
-#define __NR_syscalls		364
+#define __NR_syscalls		365
 
 #define __NR__exit __NR_exit
 #define NR_syscalls	__NR_syscalls


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: [Qemu-devel] [PATCH 19/23] userfaultfd: activate syscall
  2015-08-11 10:07   ` [Qemu-devel] " Bharata B Rao
@ 2015-08-11 13:48     ` Andrea Arcangeli
  2015-08-12  5:23       ` Bharata B Rao
  0 siblings, 1 reply; 63+ messages in thread
From: Andrea Arcangeli @ 2015-08-11 13:48 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: Andrew Morton, linux-kernel, linux-mm, qemu-devel, kvm,
	linux-api, zhang.zhanghailiang, Pavel Emelyanov, Johannes Weiner,
	Hugh Dickins, Dr. David Alan Gilbert, Sanidhya Kashyap,
	Dave Hansen, Andres Lagar-Cavilla, Mel Gorman, Paolo Bonzini,
	Kirill A. Shutemov, Huangpeng (Peter),
	Andy Lutomirski, Linus Torvalds, Peter Feiner

Hello Bharata,

On Tue, Aug 11, 2015 at 03:37:29PM +0530, Bharata B Rao wrote:
> May be it is a bit late to bring this up, but I needed the following fix
> to userfault21 branch of your git tree to compile on powerpc.

Not late, just in time. I increased the number of syscalls in earlier
versions, it must have gotten lost during a rejecting rebase, sorry.

I applied it to my tree and it can be applied to -mm and linux-next,
thanks!

The syscall for arm32 are also ready and on their way to the arm tree,
the testsuite worked fine there. ppc also should work fine if you
could confirm it'd be interesting, just beware that I got a typo in
the testcase:

diff --git a/tools/testing/selftests/vm/userfaultfd.c b/tools/testing/selftests/vm/userfaultfd.c
index 76071b1..925c3c9 100644
--- a/tools/testing/selftests/vm/userfaultfd.c
+++ b/tools/testing/selftests/vm/userfaultfd.c
@@ -70,7 +70,7 @@
 #define __NR_userfaultfd 323
 #elif defined(__i386__)
 #define __NR_userfaultfd 374
-#elif defined(__powewrpc__)
+#elif defined(__powerpc__)
 #define __NR_userfaultfd 364
 #else
 #error "missing __NR_userfaultfd definition"



> ----
> 
> powerpc: Bump up __NR_syscalls to account for __NR_userfaultfd
> 
> From: Bharata B Rao <bharata@linux.vnet.ibm.com>
> 
> With userfaultfd syscall, the number of syscalls will be 365 on PowerPC.
> Reflect the same in __NR_syscalls.
> 
> Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
> ---
>  arch/powerpc/include/asm/unistd.h |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/include/asm/unistd.h b/arch/powerpc/include/asm/unistd.h
> index f4f8b66..4a055b6 100644
> --- a/arch/powerpc/include/asm/unistd.h
> +++ b/arch/powerpc/include/asm/unistd.h
> @@ -12,7 +12,7 @@
>  #include <uapi/asm/unistd.h>
>  
>  
> -#define __NR_syscalls		364
> +#define __NR_syscalls		365
>  
>  #define __NR__exit __NR_exit
>  #define NR_syscalls	__NR_syscalls

Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: [Qemu-devel] [PATCH 19/23] userfaultfd: activate syscall
  2015-08-11 13:48     ` Andrea Arcangeli
@ 2015-08-12  5:23       ` Bharata B Rao
  2015-09-08  6:08         ` Michael Ellerman
  0 siblings, 1 reply; 63+ messages in thread
From: Bharata B Rao @ 2015-08-12  5:23 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andrew Morton, linux-kernel, linux-mm, qemu-devel, kvm,
	linux-api, zhang.zhanghailiang, Pavel Emelyanov, Johannes Weiner,
	Hugh Dickins, Dr. David Alan Gilbert, Sanidhya Kashyap,
	Dave Hansen, Andres Lagar-Cavilla, Mel Gorman, Paolo Bonzini,
	Kirill A. Shutemov, Huangpeng (Peter),
	Andy Lutomirski, Linus Torvalds, Peter Feiner

On Tue, Aug 11, 2015 at 03:48:26PM +0200, Andrea Arcangeli wrote:
> Hello Bharata,
> 
> On Tue, Aug 11, 2015 at 03:37:29PM +0530, Bharata B Rao wrote:
> > May be it is a bit late to bring this up, but I needed the following fix
> > to userfault21 branch of your git tree to compile on powerpc.
> 
> Not late, just in time. I increased the number of syscalls in earlier
> versions, it must have gotten lost during a rejecting rebase, sorry.
> 
> I applied it to my tree and it can be applied to -mm and linux-next,
> thanks!
> 
> The syscall for arm32 are also ready and on their way to the arm tree,
> the testsuite worked fine there. ppc also should work fine if you
> could confirm it'd be interesting, just beware that I got a typo in
> the testcase:

The testsuite passes on powerpc.

--------------------
running userfaultfd
--------------------
nr_pages: 2040, nr_pages_per_cpu: 170
bounces: 31, mode: rnd racing ver poll, userfaults: 80 43 23 23 15 16 12 1 2 96 13 128
bounces: 30, mode: racing ver poll, userfaults: 35 54 62 49 47 48 2 8 0 78 1 0
bounces: 29, mode: rnd ver poll, userfaults: 114 153 70 106 78 57 143 92 114 96 1 0
bounces: 28, mode: ver poll, userfaults: 96 81 5 45 83 19 98 28 1 145 23 2
bounces: 27, mode: rnd racing poll, userfaults: 54 65 60 54 45 49 1 2 1 2 71 20
bounces: 26, mode: racing poll, userfaults: 90 83 35 29 37 35 30 42 3 4 49 6
bounces: 25, mode: rnd poll, userfaults: 52 50 178 112 51 41 23 42 18 99 59 0
bounces: 24, mode: poll, userfaults: 136 101 83 260 84 29 16 88 1 6 160 57
bounces: 23, mode: rnd racing ver, userfaults: 141 197 158 183 39 49 3 52 8 3 6 0
bounces: 22, mode: racing ver, userfaults: 242 266 244 180 162 32 87 43 31 40 34 0
bounces: 21, mode: rnd ver, userfaults: 636 158 175 24 253 104 48 8 0 0 0 0
bounces: 20, mode: ver, userfaults: 531 204 225 117 129 107 11 143 76 31 1 0
bounces: 19, mode: rnd racing, userfaults: 303 169 225 145 59 219 37 0 0 0 0 0
bounces: 18, mode: racing, userfaults: 374 372 37 144 126 90 25 12 15 17 0 0
bounces: 17, mode: rnd, userfaults: 313 412 134 108 80 99 7 56 85 0 0 0
bounces: 16, mode:, userfaults: 431 58 87 167 120 113 98 60 14 8 48 0
bounces: 15, mode: rnd racing ver poll, userfaults: 41 40 25 28 37 24 0 0 0 0 180 75
bounces: 14, mode: racing ver poll, userfaults: 43 53 30 28 25 15 19 0 0 0 0 30
bounces: 13, mode: rnd ver poll, userfaults: 136 91 114 91 92 79 114 77 75 68 1 2
bounces: 12, mode: ver poll, userfaults: 92 120 114 76 153 75 132 157 83 81 10 1
bounces: 11, mode: rnd racing poll, userfaults: 50 72 69 52 53 48 46 59 57 51 37 1
bounces: 10, mode: racing poll, userfaults: 33 49 38 68 35 63 57 49 49 47 25 10
bounces: 9, mode: rnd poll, userfaults: 167 150 67 123 39 75 1 2 9 125 1 1
bounces: 8, mode: poll, userfaults: 147 102 20 87 5 27 118 14 104 40 21 28
bounces: 7, mode: rnd racing ver, userfaults: 305 254 208 74 59 96 36 14 11 7 4 5
bounces: 6, mode: racing ver, userfaults: 290 114 191 94 162 114 34 6 6 32 23 2
bounces: 5, mode: rnd ver, userfaults: 370 381 22 273 21 106 17 55 0 0 0 0
bounces: 4, mode: ver, userfaults: 328 279 179 191 74 86 95 15 13 10 0 0
bounces: 3, mode: rnd racing, userfaults: 222 215 164 70 5 20 179 0 34 3 0 0
bounces: 2, mode: racing, userfaults: 316 385 112 160 225 5 30 49 42 2 4 0
bounces: 1, mode: rnd, userfaults: 273 139 253 176 163 71 85 2 0 0 0 0
bounces: 0, mode:, userfaults: 165 212 633 13 24 66 24 27 15 0 10 1
[PASS]

Regards,
Bharata.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Qemu-devel] [PATCH 19/23] userfaultfd: activate syscall
  2015-08-12  5:23       ` Bharata B Rao
@ 2015-09-08  6:08         ` Michael Ellerman
  2015-09-08  6:39           ` Bharata B Rao
  0 siblings, 1 reply; 63+ messages in thread
From: Michael Ellerman @ 2015-09-08  6:08 UTC (permalink / raw)
  To: bharata
  Cc: Andrea Arcangeli, Andrew Morton, linux-kernel, linux-mm,
	qemu-devel, kvm, linux-api, zhang.zhanghailiang, Pavel Emelyanov,
	Johannes Weiner, Hugh Dickins, Dr. David Alan Gilbert,
	Sanidhya Kashyap, Dave Hansen, Andres Lagar-Cavilla, Mel Gorman,
	Paolo Bonzini, Kirill A. Shutemov, Huangpeng (Peter),
	Andy Lutomirski, Linus Torvalds, Peter Feiner, linuxppc-dev

On Wed, 2015-08-12 at 10:53 +0530, Bharata B Rao wrote:
> On Tue, Aug 11, 2015 at 03:48:26PM +0200, Andrea Arcangeli wrote:
> > Hello Bharata,
> > 
> > On Tue, Aug 11, 2015 at 03:37:29PM +0530, Bharata B Rao wrote:
> > > May be it is a bit late to bring this up, but I needed the following fix
> > > to userfault21 branch of your git tree to compile on powerpc.
> > 
> > Not late, just in time. I increased the number of syscalls in earlier
> > versions, it must have gotten lost during a rejecting rebase, sorry.
> > 
> > I applied it to my tree and it can be applied to -mm and linux-next,
> > thanks!
> > 
> > The syscall for arm32 are also ready and on their way to the arm tree,
> > the testsuite worked fine there. ppc also should work fine if you
> > could confirm it'd be interesting, just beware that I got a typo in
> > the testcase:
> 
> The testsuite passes on powerpc.
> 
> --------------------
> running userfaultfd
> --------------------
> nr_pages: 2040, nr_pages_per_cpu: 170
> bounces: 31, mode: rnd racing ver poll, userfaults: 80 43 23 23 15 16 12 1 2 96 13 128
> bounces: 30, mode: racing ver poll, userfaults: 35 54 62 49 47 48 2 8 0 78 1 0
> bounces: 29, mode: rnd ver poll, userfaults: 114 153 70 106 78 57 143 92 114 96 1 0
> bounces: 28, mode: ver poll, userfaults: 96 81 5 45 83 19 98 28 1 145 23 2
> bounces: 27, mode: rnd racing poll, userfaults: 54 65 60 54 45 49 1 2 1 2 71 20
> bounces: 26, mode: racing poll, userfaults: 90 83 35 29 37 35 30 42 3 4 49 6
> bounces: 25, mode: rnd poll, userfaults: 52 50 178 112 51 41 23 42 18 99 59 0
> bounces: 24, mode: poll, userfaults: 136 101 83 260 84 29 16 88 1 6 160 57
> bounces: 23, mode: rnd racing ver, userfaults: 141 197 158 183 39 49 3 52 8 3 6 0
> bounces: 22, mode: racing ver, userfaults: 242 266 244 180 162 32 87 43 31 40 34 0
> bounces: 21, mode: rnd ver, userfaults: 636 158 175 24 253 104 48 8 0 0 0 0
> bounces: 20, mode: ver, userfaults: 531 204 225 117 129 107 11 143 76 31 1 0
> bounces: 19, mode: rnd racing, userfaults: 303 169 225 145 59 219 37 0 0 0 0 0
> bounces: 18, mode: racing, userfaults: 374 372 37 144 126 90 25 12 15 17 0 0
> bounces: 17, mode: rnd, userfaults: 313 412 134 108 80 99 7 56 85 0 0 0
> bounces: 16, mode:, userfaults: 431 58 87 167 120 113 98 60 14 8 48 0
> bounces: 15, mode: rnd racing ver poll, userfaults: 41 40 25 28 37 24 0 0 0 0 180 75
> bounces: 14, mode: racing ver poll, userfaults: 43 53 30 28 25 15 19 0 0 0 0 30
> bounces: 13, mode: rnd ver poll, userfaults: 136 91 114 91 92 79 114 77 75 68 1 2
> bounces: 12, mode: ver poll, userfaults: 92 120 114 76 153 75 132 157 83 81 10 1
> bounces: 11, mode: rnd racing poll, userfaults: 50 72 69 52 53 48 46 59 57 51 37 1
> bounces: 10, mode: racing poll, userfaults: 33 49 38 68 35 63 57 49 49 47 25 10
> bounces: 9, mode: rnd poll, userfaults: 167 150 67 123 39 75 1 2 9 125 1 1
> bounces: 8, mode: poll, userfaults: 147 102 20 87 5 27 118 14 104 40 21 28
> bounces: 7, mode: rnd racing ver, userfaults: 305 254 208 74 59 96 36 14 11 7 4 5
> bounces: 6, mode: racing ver, userfaults: 290 114 191 94 162 114 34 6 6 32 23 2
> bounces: 5, mode: rnd ver, userfaults: 370 381 22 273 21 106 17 55 0 0 0 0
> bounces: 4, mode: ver, userfaults: 328 279 179 191 74 86 95 15 13 10 0 0
> bounces: 3, mode: rnd racing, userfaults: 222 215 164 70 5 20 179 0 34 3 0 0
> bounces: 2, mode: racing, userfaults: 316 385 112 160 225 5 30 49 42 2 4 0
> bounces: 1, mode: rnd, userfaults: 273 139 253 176 163 71 85 2 0 0 0 0
> bounces: 0, mode:, userfaults: 165 212 633 13 24 66 24 27 15 0 10 1
> [PASS]

Hmm, not for me. See below.

What setup were you testing on Bharata?

Mine is:

$ uname -a
Linux lebuntu 4.2.0-09705-g3a166acc1432 #2 SMP Tue Sep 8 15:18:00 AEST 2015 ppc64le ppc64le ppc64le GNU/Linux

Which is 7d9071a09502 plus a couple of powerpc patches.

$ zgrep USERFAULTFD /proc/config.gz
CONFIG_USERFAULTFD=y

$ sudo ./userfaultfd 128 32
nr_pages: 2048, nr_pages_per_cpu: 128
bounces: 31, mode: rnd racing ver poll, error mutex 2 2
error mutex 2 10
error mutex 2 15
error mutex 2 21
error mutex 2 22
error mutex 2 27
error mutex 2 36
error mutex 2 39
error mutex 2 40
error mutex 2 41
error mutex 2 43
error mutex 2 75
error mutex 2 79
error mutex 2 83
error mutex 2 100
error mutex 2 108
error mutex 2 110
error mutex 2 114
error mutex 2 119
error mutex 2 120
error mutex 2 135
error mutex 2 137
error mutex 2 141
error mutex 2 142
error mutex 2 144
error mutex 2 145
error mutex 2 150
error mutex 2 151
error mutex 2 159
error mutex 2 161
error mutex 2 169
error mutex 2 172
error mutex 2 174
error mutex 2 175
error mutex 2 176
error mutex 2 178
error mutex 2 188
error mutex 2 194
error mutex 2 208
error mutex 2 210
error mutex 2 212
error mutex 2 220
error mutex 2 223
error mutex 2 224
error mutex 2 226
error mutex 2 236
error mutex 2 249
error mutex 2 252
error mutex 2 255
error mutex 2 256
error mutex 2 267
error mutex 2 277
error mutex 2 284
error mutex 2 295
error mutex 2 302
error mutex 2 306
error mutex 2 307
error mutex 2 308
error mutex 2 318
error mutex 2 319
error mutex 2 320
error mutex 2 323
error mutex 2 324
error mutex 2 341
error mutex 2 344
error mutex 2 345
error mutex 2 353
error mutex 2 357
error mutex 2 359
error mutex 2 372
error mutex 2 379
error mutex 2 396
error mutex 2 399
error mutex 2 419
error mutex 2 431
error mutex 2 433
error mutex 2 438
error mutex 2 439
error mutex 2 443
error mutex 2 445
error mutex 2 454
error mutex 2 460
error mutex 2 469
error mutex 2 474
error mutex 2 481
error mutex 2 490
error mutex 2 496
error mutex 2 504
error mutex 2 509
error mutex 2 512
error mutex 2 515
error mutex 2 517
error mutex 2 523
error mutex 2 527
error mutex 2 529
error mutex 2 531
error mutex 2 540
error mutex 2 544
error mutex 2 546
error mutex 2 548
error mutex 2 552
error mutex 2 553
error mutex 2 562
error mutex 2 564
error mutex 2 566
error mutex 2 583
error mutex 2 584
error mutex 2 587
error mutex 2 591
error mutex 2 594
error mutex 2 595
error mutex 2 606
error mutex 2 609
error mutex 2 612
error mutex 2 619
error mutex 2 620
error mutex 2 625
error mutex 2 630
error mutex 2 640
error mutex 2 642
error mutex 2 647
error mutex 2 649
error mutex 2 652
error mutex 2 655
error mutex 2 661
error mutex 2 662
error mutex 2 666
error mutex 2 668
error mutex 2 670
error mutex 2 673
error mutex 2 679
error mutex 2 693
error mutex 2 703
error mutex 2 704
error mutex 2 707
error mutex 2 710
error mutex 2 712
error mutex 2 713
error mutex 2 715
error mutex 2 724
error mutex 2 727
error mutex 2 733
error mutex 2 735
error mutex 2 738
error mutex 2 742
error mutex 2 743
error mutex 2 745
error mutex 2 747
error mutex 2 753
error mutex 2 755
error mutex 2 766
error mutex 2 773
error mutex 2 774
error mutex 2 775
error mutex 2 776
error mutex 2 786
error mutex 2 787
error mutex 2 790
error mutex 2 794
error mutex 2 797
error mutex 2 801
error mutex 2 803
error mutex 2 804
error mutex 2 805
error mutex 2 817
error mutex 2 820
error mutex 2 823
error mutex 2 824
error mutex 2 828
error mutex 2 832
error mutex 2 834
error mutex 2 836
error mutex 2 838
error mutex 2 839
error mutex 2 841
error mutex 2 850
error mutex 2 851
error mutex 2 857
error mutex 2 866
error mutex 2 868
error mutex 2 872
error mutex 2 879
error mutex 2 889
error mutex 2 890
error mutex 2 892
error mutex 2 893
error mutex 2 896
error mutex 2 903
error mutex 2 906
error mutex 2 907
error mutex 2 910
error mutex 2 911
error mutex 2 915
error mutex 2 916
error mutex 2 917
error mutex 2 922
error mutex 2 930
error mutex 2 933
error mutex 2 937
error mutex 2 945
error mutex 2 953
error mutex 2 962
error mutex 2 971
error mutex 2 986
error mutex 2 993
error mutex 2 1004
error mutex 2 1011
error mutex 2 1029
error mutex 2 1032
error mutex 2 1038
error mutex 2 1047
error mutex 2 1052
error mutex 2 1057
error mutex 2 1061
error mutex 2 1065
error mutex 2 1069
error mutex 2 1084
error mutex 2 1086
error mutex 2 1087
error mutex 2 1091
error mutex 2 1098
error mutex 2 1107
error mutex 2 1113
error mutex 2 1114
error mutex 2 1120
error mutex 2 1122
error mutex 2 1127
error mutex 2 1131
error mutex 2 1135
error mutex 2 1137
error mutex 2 1142
error mutex 2 1144
error mutex 2 1150
error mutex 2 1152
error mutex 2 1156
error mutex 2 1157
error mutex 2 1166
error mutex 2 1183
error mutex 2 1186
error mutex 2 1187
error mutex 2 1199
error mutex 2 1200
error mutex 2 1209
error mutex 2 1213
error mutex 2 1223
error mutex 2 1226
error mutex 2 1239
error mutex 2 1244
error mutex 2 1273
error mutex 2 1280
error mutex 2 1289
error mutex 2 1296
error mutex 2 1301
error mutex 2 1304
error mutex 2 1305
error mutex 2 1311
error mutex 2 1314
error mutex 2 1318
error mutex 2 1319
error mutex 2 1321
error mutex 2 1323
error mutex 2 1325
error mutex 2 1331
error mutex 2 1335
error mutex 2 1348
error mutex 2 1354
error mutex 2 1355
error mutex 2 1359
error mutex 2 1360
error mutex 2 1371
error mutex 2 1372
error mutex 2 1373
error mutex 2 1379
error mutex 2 1382
error mutex 2 1401
error mutex 2 1403
error mutex 2 1411
error mutex 2 1419
error mutex 2 1420
error mutex 2 1422
error mutex 2 1423
error mutex 2 1426
error mutex 2 1432
error mutex 2 1442
error mutex 2 1445
error mutex 2 1446
error mutex 2 1447
error mutex 2 1453
error mutex 2 1455
error mutex 2 1457
error mutex 2 1458
error mutex 2 1459
error mutex 2 1460
error mutex 2 1461
error mutex 2 1468
error mutex 2 1469
error mutex 2 1470
error mutex 2 1475
error mutex 2 1478
error mutex 2 1484
error mutex 2 1488
error mutex 2 1490
error mutex 2 1491
error mutex 2 1498
error mutex 2 1505
error mutex 2 1511
error mutex 2 1517
error mutex 2 1520
error mutex 2 1522
error mutex 2 1526
error mutex 2 1531
error mutex 2 1535
error mutex 2 1542
error mutex 2 1546
error mutex 2 1559
error mutex 2 1566
error mutex 2 1572
error mutex 2 1574
error mutex 2 1581
error mutex 2 1584
error mutex 2 1592
error mutex 2 1600
error mutex 2 1604
error mutex 2 1605
error mutex 2 1607
error mutex 2 1611
error mutex 2 1618
error mutex 2 1624
error mutex 2 1628
error mutex 2 1631
error mutex 2 1632
error mutex 2 1633
error mutex 2 1640
error mutex 2 1647
error mutex 2 1650
error mutex 2 1651
error mutex 2 1652
error mutex 2 1654
error mutex 2 1656
error mutex 2 1657
error mutex 2 1658
error mutex 2 1659
error mutex 2 1661
error mutex 2 1671
error mutex 2 1675
error mutex 2 1678
error mutex 2 1680
error mutex 2 1687
error mutex 2 1688
error mutex 2 1701
error mutex 2 1704
error mutex 2 1706
error mutex 2 1711
error mutex 2 1716
error mutex 2 1722
error mutex 2 1727
error mutex 2 1728
error mutex 2 1742
error mutex 2 1754
error mutex 2 1755
error mutex 2 1757
error mutex 2 1763
error mutex 2 1772
error mutex 2 1783
error mutex 2 1784
error mutex 2 1789
error mutex 2 1792
error mutex 2 1800
error mutex 2 1814
error mutex 2 1817
error mutex 2 1830
error mutex 2 1833
error mutex 2 1837
error mutex 2 1851
error mutex 2 1852
error mutex 2 1855
error mutex 2 1858
error mutex 2 1859
error mutex 2 1865
error mutex 2 1873
error mutex 2 1877
error mutex 2 1887
error mutex 2 1889
error mutex 2 1891
error mutex 2 1895
error mutex 2 1909
error mutex 2 1910
error mutex 2 1915
error mutex 2 1917
error mutex 2 1921
error mutex 2 1930
error mutex 2 1944
error mutex 2 1947
error mutex 2 1962
error mutex 2 1963
error mutex 2 1964
error mutex 2 1975
error mutex 2 1977
error mutex 2 1979
error mutex 2 1980
error mutex 2 1982
error mutex 2 1987
error mutex 2 1990
error mutex 2 1991
error mutex 2 1992
error mutex 2 1996
error mutex 2 2005
error mutex 2 2007
error mutex 2 2009
error mutex 2 2012
error mutex 2 2017
error mutex 2 2030
error mutex 2 2034
error mutex 2 2039
error mutex 2 2041
error mutex 2 2045
userfaults: 200 46 78 60 64 47 41 38 22 28 15 16 20 4 5 0
$ echo $?
0

So it claims to have passed, but all those errors make me think otherwise?

cheers



^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Qemu-devel] [PATCH 19/23] userfaultfd: activate syscall
  2015-09-08  6:08         ` Michael Ellerman
@ 2015-09-08  6:39           ` Bharata B Rao
  2015-09-08  7:14             ` Michael Ellerman
  2015-09-08  8:59             ` Dr. David Alan Gilbert
  0 siblings, 2 replies; 63+ messages in thread
From: Bharata B Rao @ 2015-09-08  6:39 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Andrea Arcangeli, Andrew Morton, linux-kernel, linux-mm,
	qemu-devel, kvm, linux-api, zhang.zhanghailiang, Pavel Emelyanov,
	Johannes Weiner, Hugh Dickins, Dr. David Alan Gilbert,
	Sanidhya Kashyap, Dave Hansen, Andres Lagar-Cavilla, Mel Gorman,
	Paolo Bonzini, Kirill A. Shutemov, Huangpeng (Peter),
	Andy Lutomirski, Linus Torvalds, Peter Feiner, linuxppc-dev

On Tue, Sep 08, 2015 at 04:08:06PM +1000, Michael Ellerman wrote:
> On Wed, 2015-08-12 at 10:53 +0530, Bharata B Rao wrote:
> > On Tue, Aug 11, 2015 at 03:48:26PM +0200, Andrea Arcangeli wrote:
> > > Hello Bharata,
> > > 
> > > On Tue, Aug 11, 2015 at 03:37:29PM +0530, Bharata B Rao wrote:
> > > > May be it is a bit late to bring this up, but I needed the following fix
> > > > to userfault21 branch of your git tree to compile on powerpc.
> > > 
> > > Not late, just in time. I increased the number of syscalls in earlier
> > > versions, it must have gotten lost during a rejecting rebase, sorry.
> > > 
> > > I applied it to my tree and it can be applied to -mm and linux-next,
> > > thanks!
> > > 
> > > The syscall for arm32 are also ready and on their way to the arm tree,
> > > the testsuite worked fine there. ppc also should work fine if you
> > > could confirm it'd be interesting, just beware that I got a typo in
> > > the testcase:
> > 
> > The testsuite passes on powerpc.
> > 
> > --------------------
> > running userfaultfd
> > --------------------
> > nr_pages: 2040, nr_pages_per_cpu: 170
> > bounces: 31, mode: rnd racing ver poll, userfaults: 80 43 23 23 15 16 12 1 2 96 13 128
> > bounces: 30, mode: racing ver poll, userfaults: 35 54 62 49 47 48 2 8 0 78 1 0
> > bounces: 29, mode: rnd ver poll, userfaults: 114 153 70 106 78 57 143 92 114 96 1 0
> > bounces: 28, mode: ver poll, userfaults: 96 81 5 45 83 19 98 28 1 145 23 2
> > bounces: 27, mode: rnd racing poll, userfaults: 54 65 60 54 45 49 1 2 1 2 71 20
> > bounces: 26, mode: racing poll, userfaults: 90 83 35 29 37 35 30 42 3 4 49 6
> > bounces: 25, mode: rnd poll, userfaults: 52 50 178 112 51 41 23 42 18 99 59 0
> > bounces: 24, mode: poll, userfaults: 136 101 83 260 84 29 16 88 1 6 160 57
> > bounces: 23, mode: rnd racing ver, userfaults: 141 197 158 183 39 49 3 52 8 3 6 0
> > bounces: 22, mode: racing ver, userfaults: 242 266 244 180 162 32 87 43 31 40 34 0
> > bounces: 21, mode: rnd ver, userfaults: 636 158 175 24 253 104 48 8 0 0 0 0
> > bounces: 20, mode: ver, userfaults: 531 204 225 117 129 107 11 143 76 31 1 0
> > bounces: 19, mode: rnd racing, userfaults: 303 169 225 145 59 219 37 0 0 0 0 0
> > bounces: 18, mode: racing, userfaults: 374 372 37 144 126 90 25 12 15 17 0 0
> > bounces: 17, mode: rnd, userfaults: 313 412 134 108 80 99 7 56 85 0 0 0
> > bounces: 16, mode:, userfaults: 431 58 87 167 120 113 98 60 14 8 48 0
> > bounces: 15, mode: rnd racing ver poll, userfaults: 41 40 25 28 37 24 0 0 0 0 180 75
> > bounces: 14, mode: racing ver poll, userfaults: 43 53 30 28 25 15 19 0 0 0 0 30
> > bounces: 13, mode: rnd ver poll, userfaults: 136 91 114 91 92 79 114 77 75 68 1 2
> > bounces: 12, mode: ver poll, userfaults: 92 120 114 76 153 75 132 157 83 81 10 1
> > bounces: 11, mode: rnd racing poll, userfaults: 50 72 69 52 53 48 46 59 57 51 37 1
> > bounces: 10, mode: racing poll, userfaults: 33 49 38 68 35 63 57 49 49 47 25 10
> > bounces: 9, mode: rnd poll, userfaults: 167 150 67 123 39 75 1 2 9 125 1 1
> > bounces: 8, mode: poll, userfaults: 147 102 20 87 5 27 118 14 104 40 21 28
> > bounces: 7, mode: rnd racing ver, userfaults: 305 254 208 74 59 96 36 14 11 7 4 5
> > bounces: 6, mode: racing ver, userfaults: 290 114 191 94 162 114 34 6 6 32 23 2
> > bounces: 5, mode: rnd ver, userfaults: 370 381 22 273 21 106 17 55 0 0 0 0
> > bounces: 4, mode: ver, userfaults: 328 279 179 191 74 86 95 15 13 10 0 0
> > bounces: 3, mode: rnd racing, userfaults: 222 215 164 70 5 20 179 0 34 3 0 0
> > bounces: 2, mode: racing, userfaults: 316 385 112 160 225 5 30 49 42 2 4 0
> > bounces: 1, mode: rnd, userfaults: 273 139 253 176 163 71 85 2 0 0 0 0
> > bounces: 0, mode:, userfaults: 165 212 633 13 24 66 24 27 15 0 10 1
> > [PASS]
> 
> Hmm, not for me. See below.
> 
> What setup were you testing on Bharata?

I was on commit a94572f5799dd of userfault21 branch in Andrea's tree
git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git

#uname -a
Linux 4.1.0-rc8+ #1 SMP Tue Aug 11 11:33:50 IST 2015 ppc64le ppc64le ppc64le GNU/Linux

In fact I had successfully done postcopy migration of sPAPR guest with
this setup.

> 
> Mine is:
> 
> $ uname -a
> Linux lebuntu 4.2.0-09705-g3a166acc1432 #2 SMP Tue Sep 8 15:18:00 AEST 2015 ppc64le ppc64le ppc64le GNU/Linux
> 
> Which is 7d9071a09502 plus a couple of powerpc patches.
> 
> $ zgrep USERFAULTFD /proc/config.gz
> CONFIG_USERFAULTFD=y
> 
> $ sudo ./userfaultfd 128 32
> nr_pages: 2048, nr_pages_per_cpu: 128
> bounces: 31, mode: rnd racing ver poll, error mutex 2 2
> error mutex 2 10


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Qemu-devel] [PATCH 19/23] userfaultfd: activate syscall
  2015-09-08  6:39           ` Bharata B Rao
@ 2015-09-08  7:14             ` Michael Ellerman
  2015-09-08 10:40               ` Michael Ellerman
  2015-09-08  8:59             ` Dr. David Alan Gilbert
  1 sibling, 1 reply; 63+ messages in thread
From: Michael Ellerman @ 2015-09-08  7:14 UTC (permalink / raw)
  To: bharata
  Cc: Andrea Arcangeli, Andrew Morton, linux-kernel, linux-mm,
	qemu-devel, kvm, linux-api, zhang.zhanghailiang, Pavel Emelyanov,
	Johannes Weiner, Hugh Dickins, Dr. David Alan Gilbert,
	Sanidhya Kashyap, Dave Hansen, Andres Lagar-Cavilla, Mel Gorman,
	Paolo Bonzini, Kirill A. Shutemov, Huangpeng (Peter),
	Andy Lutomirski, Linus Torvalds, Peter Feiner, linuxppc-dev

On Tue, 2015-09-08 at 12:09 +0530, Bharata B Rao wrote:
> On Tue, Sep 08, 2015 at 04:08:06PM +1000, Michael Ellerman wrote:
> > On Wed, 2015-08-12 at 10:53 +0530, Bharata B Rao wrote:
> > > On Tue, Aug 11, 2015 at 03:48:26PM +0200, Andrea Arcangeli wrote:
> > > > Hello Bharata,
> > > > 
> > > > On Tue, Aug 11, 2015 at 03:37:29PM +0530, Bharata B Rao wrote:
> > > > > May be it is a bit late to bring this up, but I needed the following fix
> > > > > to userfault21 branch of your git tree to compile on powerpc.
> > > > 
> > > > Not late, just in time. I increased the number of syscalls in earlier
> > > > versions, it must have gotten lost during a rejecting rebase, sorry.
> > > > 
> > > > I applied it to my tree and it can be applied to -mm and linux-next,
> > > > thanks!
> > > > 
> > > > The syscall for arm32 are also ready and on their way to the arm tree,
> > > > the testsuite worked fine there. ppc also should work fine if you
> > > > could confirm it'd be interesting, just beware that I got a typo in
> > > > the testcase:
> > > 
> > > The testsuite passes on powerpc.
> > > 
> > > --------------------
> > > running userfaultfd
> > > --------------------
> > > nr_pages: 2040, nr_pages_per_cpu: 170
> > > bounces: 31, mode: rnd racing ver poll, userfaults: 80 43 23 23 15 16 12 1 2 96 13 128
> > > bounces: 30, mode: racing ver poll, userfaults: 35 54 62 49 47 48 2 8 0 78 1 0
> > > bounces: 29, mode: rnd ver poll, userfaults: 114 153 70 106 78 57 143 92 114 96 1 0
> > > bounces: 28, mode: ver poll, userfaults: 96 81 5 45 83 19 98 28 1 145 23 2
> > > bounces: 27, mode: rnd racing poll, userfaults: 54 65 60 54 45 49 1 2 1 2 71 20
> > > bounces: 26, mode: racing poll, userfaults: 90 83 35 29 37 35 30 42 3 4 49 6
> > > bounces: 25, mode: rnd poll, userfaults: 52 50 178 112 51 41 23 42 18 99 59 0
> > > bounces: 24, mode: poll, userfaults: 136 101 83 260 84 29 16 88 1 6 160 57
> > > bounces: 23, mode: rnd racing ver, userfaults: 141 197 158 183 39 49 3 52 8 3 6 0
> > > bounces: 22, mode: racing ver, userfaults: 242 266 244 180 162 32 87 43 31 40 34 0
> > > bounces: 21, mode: rnd ver, userfaults: 636 158 175 24 253 104 48 8 0 0 0 0
> > > bounces: 20, mode: ver, userfaults: 531 204 225 117 129 107 11 143 76 31 1 0
> > > bounces: 19, mode: rnd racing, userfaults: 303 169 225 145 59 219 37 0 0 0 0 0
> > > bounces: 18, mode: racing, userfaults: 374 372 37 144 126 90 25 12 15 17 0 0
> > > bounces: 17, mode: rnd, userfaults: 313 412 134 108 80 99 7 56 85 0 0 0
> > > bounces: 16, mode:, userfaults: 431 58 87 167 120 113 98 60 14 8 48 0
> > > bounces: 15, mode: rnd racing ver poll, userfaults: 41 40 25 28 37 24 0 0 0 0 180 75
> > > bounces: 14, mode: racing ver poll, userfaults: 43 53 30 28 25 15 19 0 0 0 0 30
> > > bounces: 13, mode: rnd ver poll, userfaults: 136 91 114 91 92 79 114 77 75 68 1 2
> > > bounces: 12, mode: ver poll, userfaults: 92 120 114 76 153 75 132 157 83 81 10 1
> > > bounces: 11, mode: rnd racing poll, userfaults: 50 72 69 52 53 48 46 59 57 51 37 1
> > > bounces: 10, mode: racing poll, userfaults: 33 49 38 68 35 63 57 49 49 47 25 10
> > > bounces: 9, mode: rnd poll, userfaults: 167 150 67 123 39 75 1 2 9 125 1 1
> > > bounces: 8, mode: poll, userfaults: 147 102 20 87 5 27 118 14 104 40 21 28
> > > bounces: 7, mode: rnd racing ver, userfaults: 305 254 208 74 59 96 36 14 11 7 4 5
> > > bounces: 6, mode: racing ver, userfaults: 290 114 191 94 162 114 34 6 6 32 23 2
> > > bounces: 5, mode: rnd ver, userfaults: 370 381 22 273 21 106 17 55 0 0 0 0
> > > bounces: 4, mode: ver, userfaults: 328 279 179 191 74 86 95 15 13 10 0 0
> > > bounces: 3, mode: rnd racing, userfaults: 222 215 164 70 5 20 179 0 34 3 0 0
> > > bounces: 2, mode: racing, userfaults: 316 385 112 160 225 5 30 49 42 2 4 0
> > > bounces: 1, mode: rnd, userfaults: 273 139 253 176 163 71 85 2 0 0 0 0
> > > bounces: 0, mode:, userfaults: 165 212 633 13 24 66 24 27 15 0 10 1
> > > [PASS]
> > 
> > Hmm, not for me. See below.
> > 
> > What setup were you testing on Bharata?
> 
> I was on commit a94572f5799dd of userfault21 branch in Andrea's tree
> git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git
> 
> #uname -a
> Linux 4.1.0-rc8+ #1 SMP Tue Aug 11 11:33:50 IST 2015 ppc64le ppc64le ppc64le GNU/Linux
> 
> In fact I had successfully done postcopy migration of sPAPR guest with
> this setup.

OK, do you mind testing mainline with the same setup to see if the selftest
passes.

cheers




^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Qemu-devel] [PATCH 19/23] userfaultfd: activate syscall
  2015-09-08  6:39           ` Bharata B Rao
  2015-09-08  7:14             ` Michael Ellerman
@ 2015-09-08  8:59             ` Dr. David Alan Gilbert
  2015-09-08 10:00               ` Bharata B Rao
  1 sibling, 1 reply; 63+ messages in thread
From: Dr. David Alan Gilbert @ 2015-09-08  8:59 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: Michael Ellerman, Andrea Arcangeli, Andrew Morton, linux-kernel,
	linux-mm, qemu-devel, kvm, linux-api, zhang.zhanghailiang,
	Pavel Emelyanov, Johannes Weiner, Hugh Dickins, Sanidhya Kashyap,
	Dave Hansen, Andres Lagar-Cavilla, Mel Gorman, Paolo Bonzini,
	Kirill A. Shutemov, Huangpeng (Peter),
	Andy Lutomirski, Linus Torvalds, Peter Feiner, linuxppc-dev

* Bharata B Rao (bharata@linux.vnet.ibm.com) wrote:
> On Tue, Sep 08, 2015 at 04:08:06PM +1000, Michael Ellerman wrote:
> > On Wed, 2015-08-12 at 10:53 +0530, Bharata B Rao wrote:
> > > On Tue, Aug 11, 2015 at 03:48:26PM +0200, Andrea Arcangeli wrote:
> > > > Hello Bharata,
> > > > 
> > > > On Tue, Aug 11, 2015 at 03:37:29PM +0530, Bharata B Rao wrote:
> > > > > May be it is a bit late to bring this up, but I needed the following fix
> > > > > to userfault21 branch of your git tree to compile on powerpc.
> > > > 
> > > > Not late, just in time. I increased the number of syscalls in earlier
> > > > versions, it must have gotten lost during a rejecting rebase, sorry.
> > > > 
> > > > I applied it to my tree and it can be applied to -mm and linux-next,
> > > > thanks!
> > > > 
> > > > The syscall for arm32 are also ready and on their way to the arm tree,
> > > > the testsuite worked fine there. ppc also should work fine if you
> > > > could confirm it'd be interesting, just beware that I got a typo in
> > > > the testcase:
> > > 
> > > The testsuite passes on powerpc.
> > > 
> > > --------------------
> > > running userfaultfd
> > > --------------------
> > > nr_pages: 2040, nr_pages_per_cpu: 170
> > > bounces: 31, mode: rnd racing ver poll, userfaults: 80 43 23 23 15 16 12 1 2 96 13 128
> > > bounces: 30, mode: racing ver poll, userfaults: 35 54 62 49 47 48 2 8 0 78 1 0
> > > bounces: 29, mode: rnd ver poll, userfaults: 114 153 70 106 78 57 143 92 114 96 1 0
> > > bounces: 28, mode: ver poll, userfaults: 96 81 5 45 83 19 98 28 1 145 23 2
> > > bounces: 27, mode: rnd racing poll, userfaults: 54 65 60 54 45 49 1 2 1 2 71 20
> > > bounces: 26, mode: racing poll, userfaults: 90 83 35 29 37 35 30 42 3 4 49 6
> > > bounces: 25, mode: rnd poll, userfaults: 52 50 178 112 51 41 23 42 18 99 59 0
> > > bounces: 24, mode: poll, userfaults: 136 101 83 260 84 29 16 88 1 6 160 57
> > > bounces: 23, mode: rnd racing ver, userfaults: 141 197 158 183 39 49 3 52 8 3 6 0
> > > bounces: 22, mode: racing ver, userfaults: 242 266 244 180 162 32 87 43 31 40 34 0
> > > bounces: 21, mode: rnd ver, userfaults: 636 158 175 24 253 104 48 8 0 0 0 0
> > > bounces: 20, mode: ver, userfaults: 531 204 225 117 129 107 11 143 76 31 1 0
> > > bounces: 19, mode: rnd racing, userfaults: 303 169 225 145 59 219 37 0 0 0 0 0
> > > bounces: 18, mode: racing, userfaults: 374 372 37 144 126 90 25 12 15 17 0 0
> > > bounces: 17, mode: rnd, userfaults: 313 412 134 108 80 99 7 56 85 0 0 0
> > > bounces: 16, mode:, userfaults: 431 58 87 167 120 113 98 60 14 8 48 0
> > > bounces: 15, mode: rnd racing ver poll, userfaults: 41 40 25 28 37 24 0 0 0 0 180 75
> > > bounces: 14, mode: racing ver poll, userfaults: 43 53 30 28 25 15 19 0 0 0 0 30
> > > bounces: 13, mode: rnd ver poll, userfaults: 136 91 114 91 92 79 114 77 75 68 1 2
> > > bounces: 12, mode: ver poll, userfaults: 92 120 114 76 153 75 132 157 83 81 10 1
> > > bounces: 11, mode: rnd racing poll, userfaults: 50 72 69 52 53 48 46 59 57 51 37 1
> > > bounces: 10, mode: racing poll, userfaults: 33 49 38 68 35 63 57 49 49 47 25 10
> > > bounces: 9, mode: rnd poll, userfaults: 167 150 67 123 39 75 1 2 9 125 1 1
> > > bounces: 8, mode: poll, userfaults: 147 102 20 87 5 27 118 14 104 40 21 28
> > > bounces: 7, mode: rnd racing ver, userfaults: 305 254 208 74 59 96 36 14 11 7 4 5
> > > bounces: 6, mode: racing ver, userfaults: 290 114 191 94 162 114 34 6 6 32 23 2
> > > bounces: 5, mode: rnd ver, userfaults: 370 381 22 273 21 106 17 55 0 0 0 0
> > > bounces: 4, mode: ver, userfaults: 328 279 179 191 74 86 95 15 13 10 0 0
> > > bounces: 3, mode: rnd racing, userfaults: 222 215 164 70 5 20 179 0 34 3 0 0
> > > bounces: 2, mode: racing, userfaults: 316 385 112 160 225 5 30 49 42 2 4 0
> > > bounces: 1, mode: rnd, userfaults: 273 139 253 176 163 71 85 2 0 0 0 0
> > > bounces: 0, mode:, userfaults: 165 212 633 13 24 66 24 27 15 0 10 1
> > > [PASS]
> > 
> > Hmm, not for me. See below.
> > 
> > What setup were you testing on Bharata?
> 
> I was on commit a94572f5799dd of userfault21 branch in Andrea's tree
> git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git
> 
> #uname -a
> Linux 4.1.0-rc8+ #1 SMP Tue Aug 11 11:33:50 IST 2015 ppc64le ppc64le ppc64le GNU/Linux
> 
> In fact I had successfully done postcopy migration of sPAPR guest with
> this setup.

Interesting - I'd not got that far myself on power; I was hitting a problem
loading htab ( htab_load() bad index 2113929216 (14848+0 entries) in htab stream (htab_shift=25) )

Did you have to make any changes to the qemu code to get that happy?

Dave

> > 
> > Mine is:
> > 
> > $ uname -a
> > Linux lebuntu 4.2.0-09705-g3a166acc1432 #2 SMP Tue Sep 8 15:18:00 AEST 2015 ppc64le ppc64le ppc64le GNU/Linux
> > 
> > Which is 7d9071a09502 plus a couple of powerpc patches.
> > 
> > $ zgrep USERFAULTFD /proc/config.gz
> > CONFIG_USERFAULTFD=y
> > 
> > $ sudo ./userfaultfd 128 32
> > nr_pages: 2048, nr_pages_per_cpu: 128
> > bounces: 31, mode: rnd racing ver poll, error mutex 2 2
> > error mutex 2 10
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Qemu-devel] [PATCH 19/23] userfaultfd: activate syscall
  2015-09-08  8:59             ` Dr. David Alan Gilbert
@ 2015-09-08 10:00               ` Bharata B Rao
  2015-09-08 12:46                 ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 63+ messages in thread
From: Bharata B Rao @ 2015-09-08 10:00 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Michael Ellerman, Andrea Arcangeli, Andrew Morton, linux-kernel,
	linux-mm, qemu-devel, kvm, linux-api, zhang.zhanghailiang,
	Pavel Emelyanov, Johannes Weiner, Hugh Dickins, Sanidhya Kashyap,
	Dave Hansen, Andres Lagar-Cavilla, Mel Gorman, Paolo Bonzini,
	Kirill A. Shutemov, Huangpeng (Peter),
	Andy Lutomirski, Linus Torvalds, Peter Feiner, linuxppc-dev

On Tue, Sep 08, 2015 at 09:59:47AM +0100, Dr. David Alan Gilbert wrote:
> * Bharata B Rao (bharata@linux.vnet.ibm.com) wrote:
> > In fact I had successfully done postcopy migration of sPAPR guest with
> > this setup.
> 
> Interesting - I'd not got that far myself on power; I was hitting a problem
> loading htab ( htab_load() bad index 2113929216 (14848+0 entries) in htab stream (htab_shift=25) )
> 
> Did you have to make any changes to the qemu code to get that happy?

I should have mentioned that I tried only QEMU driven migration within
the same host using wp3-postcopy branch of your tree. I don't see the
above issue.

(qemu) info migrate
capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: off compress: off x-postcopy-ram: on 
Migration status: completed
total time: 39432 milliseconds
downtime: 162 milliseconds
setup: 14 milliseconds
transferred ram: 1297209 kbytes
throughput: 270.72 mbps
remaining ram: 0 kbytes
total ram: 4194560 kbytes
duplicate: 734015 pages
skipped: 0 pages
normal: 318469 pages
normal bytes: 1273876 kbytes
dirty sync count: 4

I will try migration between different hosts soon and check.

Regards,
Bharata.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Qemu-devel] [PATCH 19/23] userfaultfd: activate syscall
  2015-09-08  7:14             ` Michael Ellerman
@ 2015-09-08 10:40               ` Michael Ellerman
  2015-09-08 12:28                 ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 63+ messages in thread
From: Michael Ellerman @ 2015-09-08 10:40 UTC (permalink / raw)
  To: bharata
  Cc: kvm, qemu-devel, Sanidhya Kashyap, linux-mm, Andrea Arcangeli,
	zhang.zhanghailiang, Pavel Emelyanov, Hugh Dickins, Mel Gorman,
	Huangpeng (Peter),
	Dr. David Alan Gilbert, Andres Lagar-Cavilla, Kirill A. Shutemov,
	Dave Hansen, linux-api, linuxppc-dev, linux-kernel,
	Andy Lutomirski, Johannes Weiner, Paolo Bonzini, Andrew Morton,
	Linus Torvalds, Peter Feiner

On Tue, 2015-09-08 at 17:14 +1000, Michael Ellerman wrote:
> On Tue, 2015-09-08 at 12:09 +0530, Bharata B Rao wrote:
> > On Tue, Sep 08, 2015 at 04:08:06PM +1000, Michael Ellerman wrote:
> > > Hmm, not for me. See below.
> > > 
> > > What setup were you testing on Bharata?
> > 
> > I was on commit a94572f5799dd of userfault21 branch in Andrea's tree
> > git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git
> > 
> > #uname -a
> > Linux 4.1.0-rc8+ #1 SMP Tue Aug 11 11:33:50 IST 2015 ppc64le ppc64le ppc64le GNU/Linux
> > 
> > In fact I had successfully done postcopy migration of sPAPR guest with
> > this setup.
> 
> OK, do you mind testing mainline with the same setup to see if the selftest
> passes.

Ah, I just tried it on big endian and it works. So it seems to not work on
little endian for some reason, /probably/ a test case bug?

cheers



^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Qemu-devel] [PATCH 19/23] userfaultfd: activate syscall
  2015-09-08 10:40               ` Michael Ellerman
@ 2015-09-08 12:28                 ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 63+ messages in thread
From: Dr. David Alan Gilbert @ 2015-09-08 12:28 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: bharata, kvm, qemu-devel, Sanidhya Kashyap, linux-mm,
	Andrea Arcangeli, zhang.zhanghailiang, Pavel Emelyanov,
	Hugh Dickins, Mel Gorman, Huangpeng (Peter),
	Andres Lagar-Cavilla, Kirill A. Shutemov, Dave Hansen, linux-api,
	linuxppc-dev, linux-kernel, Andy Lutomirski, Johannes Weiner,
	Paolo Bonzini, Andrew Morton, Linus Torvalds, Peter Feiner

* Michael Ellerman (mpe@ellerman.id.au) wrote:
> On Tue, 2015-09-08 at 17:14 +1000, Michael Ellerman wrote:
> > On Tue, 2015-09-08 at 12:09 +0530, Bharata B Rao wrote:
> > > On Tue, Sep 08, 2015 at 04:08:06PM +1000, Michael Ellerman wrote:
> > > > Hmm, not for me. See below.
> > > > 
> > > > What setup were you testing on Bharata?
> > > 
> > > I was on commit a94572f5799dd of userfault21 branch in Andrea's tree
> > > git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git
> > > 
> > > #uname -a
> > > Linux 4.1.0-rc8+ #1 SMP Tue Aug 11 11:33:50 IST 2015 ppc64le ppc64le ppc64le GNU/Linux
> > > 
> > > In fact I had successfully done postcopy migration of sPAPR guest with
> > > this setup.
> > 
> > OK, do you mind testing mainline with the same setup to see if the selftest
> > passes.
> 
> Ah, I just tried it on big endian and it works. So it seems to not work on
> little endian for some reason, /probably/ a test case bug?

Hmm; I think we're missing a test-case fix that Andrea made me for a bug I hit on Power
I hit a couple of weeks back.  I think that would have been on le.

Dave

> cheers
> 
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Qemu-devel] [PATCH 19/23] userfaultfd: activate syscall
  2015-09-08 10:00               ` Bharata B Rao
@ 2015-09-08 12:46                 ` Dr. David Alan Gilbert
  2015-09-08 13:37                   ` Bharata B Rao
  0 siblings, 1 reply; 63+ messages in thread
From: Dr. David Alan Gilbert @ 2015-09-08 12:46 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: Michael Ellerman, Andrea Arcangeli, Andrew Morton, linux-kernel,
	linux-mm, qemu-devel, kvm, linux-api, zhang.zhanghailiang,
	Pavel Emelyanov, Johannes Weiner, Hugh Dickins, Sanidhya Kashyap,
	Dave Hansen, Andres Lagar-Cavilla, Mel Gorman, Paolo Bonzini,
	Kirill A. Shutemov, Huangpeng (Peter),
	Andy Lutomirski, Linus Torvalds, Peter Feiner, linuxppc-dev

* Bharata B Rao (bharata@linux.vnet.ibm.com) wrote:
> On Tue, Sep 08, 2015 at 09:59:47AM +0100, Dr. David Alan Gilbert wrote:
> > * Bharata B Rao (bharata@linux.vnet.ibm.com) wrote:
> > > In fact I had successfully done postcopy migration of sPAPR guest with
> > > this setup.
> > 
> > Interesting - I'd not got that far myself on power; I was hitting a problem
> > loading htab ( htab_load() bad index 2113929216 (14848+0 entries) in htab stream (htab_shift=25) )
> > 
> > Did you have to make any changes to the qemu code to get that happy?
> 
> I should have mentioned that I tried only QEMU driven migration within
> the same host using wp3-postcopy branch of your tree. I don't see the
> above issue.
> 
> (qemu) info migrate
> capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: off compress: off x-postcopy-ram: on 
> Migration status: completed
> total time: 39432 milliseconds
> downtime: 162 milliseconds
> setup: 14 milliseconds
> transferred ram: 1297209 kbytes
> throughput: 270.72 mbps
> remaining ram: 0 kbytes
> total ram: 4194560 kbytes
> duplicate: 734015 pages
> skipped: 0 pages
> normal: 318469 pages
> normal bytes: 1273876 kbytes
> dirty sync count: 4
> 
> I will try migration between different hosts soon and check.

I hit that on the same host; are you sure you've switched into postcopy mode;
i.e. issued a migrate_start_postcopy before the end of migration?

(My current world I have working on x86-64 and I've also tested reasonably well
on aarch64, and get to that htab problem on Power).

Dave

> 
> Regards,
> Bharata.
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Qemu-devel] [PATCH 19/23] userfaultfd: activate syscall
  2015-09-08 12:46                 ` Dr. David Alan Gilbert
@ 2015-09-08 13:37                   ` Bharata B Rao
  2015-09-08 14:13                     ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 63+ messages in thread
From: Bharata B Rao @ 2015-09-08 13:37 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Michael Ellerman, Andrea Arcangeli, Andrew Morton, linux-kernel,
	linux-mm, qemu-devel, kvm, linux-api, zhang.zhanghailiang,
	Pavel Emelyanov, Johannes Weiner, Hugh Dickins, Sanidhya Kashyap,
	Dave Hansen, Andres Lagar-Cavilla, Mel Gorman, Paolo Bonzini,
	Kirill A. Shutemov, Huangpeng (Peter),
	Andy Lutomirski, Linus Torvalds, Peter Feiner, linuxppc-dev

On Tue, Sep 08, 2015 at 01:46:52PM +0100, Dr. David Alan Gilbert wrote:
> * Bharata B Rao (bharata@linux.vnet.ibm.com) wrote:
> > On Tue, Sep 08, 2015 at 09:59:47AM +0100, Dr. David Alan Gilbert wrote:
> > > * Bharata B Rao (bharata@linux.vnet.ibm.com) wrote:
> > > > In fact I had successfully done postcopy migration of sPAPR guest with
> > > > this setup.
> > > 
> > > Interesting - I'd not got that far myself on power; I was hitting a problem
> > > loading htab ( htab_load() bad index 2113929216 (14848+0 entries) in htab stream (htab_shift=25) )
> > > 
> > > Did you have to make any changes to the qemu code to get that happy?
> > 
> > I should have mentioned that I tried only QEMU driven migration within
> > the same host using wp3-postcopy branch of your tree. I don't see the
> > above issue.
> > 
> > (qemu) info migrate
> > capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: off compress: off x-postcopy-ram: on 
> > Migration status: completed
> > total time: 39432 milliseconds
> > downtime: 162 milliseconds
> > setup: 14 milliseconds
> > transferred ram: 1297209 kbytes
> > throughput: 270.72 mbps
> > remaining ram: 0 kbytes
> > total ram: 4194560 kbytes
> > duplicate: 734015 pages
> > skipped: 0 pages
> > normal: 318469 pages
> > normal bytes: 1273876 kbytes
> > dirty sync count: 4
> > 
> > I will try migration between different hosts soon and check.
> 
> I hit that on the same host; are you sure you've switched into postcopy mode;
> i.e. issued a migrate_start_postcopy before the end of migration?

Sorry I was following your discussion with Li in this thread

https://www.marc.info/?l=qemu-devel&m=143035620026744&w=4

and it wasn't obvious to me that anything apart from turning on the
x-postcopy-ram capability was required :(

So I do see the problem now.

At the source
-------------
Error reading data from KVM HTAB fd: Bad file descriptor
Segmentation fault

At the target
-------------
htab_load() bad index 2113929216 (14336+0 entries) in htab stream (htab_shift=25)
qemu-system-ppc64: error while loading state section id 56(spapr/htab)
qemu-system-ppc64: postcopy_ram_listen_thread: loadvm failed: -22
qemu-system-ppc64: VQ 0 size 0x100 Guest index 0x0 inconsistent with Host index 0x1f: delta 0xffe1
qemu-system-ppc64: error while loading state for instance 0x0 of device 'pci@800000020000000:00.0/virtio-net'
*** Error in `./ppc64-softmmu/qemu-system-ppc64': corrupted double-linked list: 0x00000100241234a0 ***
======= Backtrace: =========
/lib64/power8/libc.so.6Segmentation fault


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Qemu-devel] [PATCH 19/23] userfaultfd: activate syscall
  2015-09-08 13:37                   ` Bharata B Rao
@ 2015-09-08 14:13                     ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 63+ messages in thread
From: Dr. David Alan Gilbert @ 2015-09-08 14:13 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: Michael Ellerman, Andrea Arcangeli, Andrew Morton, linux-kernel,
	linux-mm, qemu-devel, kvm, linux-api, zhang.zhanghailiang,
	Pavel Emelyanov, Johannes Weiner, Hugh Dickins, Sanidhya Kashyap,
	Dave Hansen, Andres Lagar-Cavilla, Mel Gorman, Paolo Bonzini,
	Kirill A. Shutemov, Huangpeng (Peter),
	Andy Lutomirski, Linus Torvalds, Peter Feiner, linuxppc-dev

* Bharata B Rao (bharata@linux.vnet.ibm.com) wrote:
> On Tue, Sep 08, 2015 at 01:46:52PM +0100, Dr. David Alan Gilbert wrote:
> > * Bharata B Rao (bharata@linux.vnet.ibm.com) wrote:
> > > On Tue, Sep 08, 2015 at 09:59:47AM +0100, Dr. David Alan Gilbert wrote:
> > > > * Bharata B Rao (bharata@linux.vnet.ibm.com) wrote:
> > > > > In fact I had successfully done postcopy migration of sPAPR guest with
> > > > > this setup.
> > > > 
> > > > Interesting - I'd not got that far myself on power; I was hitting a problem
> > > > loading htab ( htab_load() bad index 2113929216 (14848+0 entries) in htab stream (htab_shift=25) )
> > > > 
> > > > Did you have to make any changes to the qemu code to get that happy?
> > > 
> > > I should have mentioned that I tried only QEMU driven migration within
> > > the same host using wp3-postcopy branch of your tree. I don't see the
> > > above issue.
> > > 
> > > (qemu) info migrate
> > > capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: off compress: off x-postcopy-ram: on 
> > > Migration status: completed
> > > total time: 39432 milliseconds
> > > downtime: 162 milliseconds
> > > setup: 14 milliseconds
> > > transferred ram: 1297209 kbytes
> > > throughput: 270.72 mbps
> > > remaining ram: 0 kbytes
> > > total ram: 4194560 kbytes
> > > duplicate: 734015 pages
> > > skipped: 0 pages
> > > normal: 318469 pages
> > > normal bytes: 1273876 kbytes
> > > dirty sync count: 4
> > > 
> > > I will try migration between different hosts soon and check.
> > 
> > I hit that on the same host; are you sure you've switched into postcopy mode;
> > i.e. issued a migrate_start_postcopy before the end of migration?
> 
> Sorry I was following your discussion with Li in this thread
> 
> https://www.marc.info/?l=qemu-devel&m=143035620026744&w=4
> 
> and it wasn't obvious to me that anything apart from turning on the
> x-postcopy-ram capability was required :(

OK.

> So I do see the problem now.
> 
> At the source
> -------------
> Error reading data from KVM HTAB fd: Bad file descriptor
> Segmentation fault
> 
> At the target
> -------------
> htab_load() bad index 2113929216 (14336+0 entries) in htab stream (htab_shift=25)
> qemu-system-ppc64: error while loading state section id 56(spapr/htab)
> qemu-system-ppc64: postcopy_ram_listen_thread: loadvm failed: -22
> qemu-system-ppc64: VQ 0 size 0x100 Guest index 0x0 inconsistent with Host index 0x1f: delta 0xffe1
> qemu-system-ppc64: error while loading state for instance 0x0 of device 'pci@800000020000000:00.0/virtio-net'
> *** Error in `./ppc64-softmmu/qemu-system-ppc64': corrupted double-linked list: 0x00000100241234a0 ***
> ======= Backtrace: =========
> /lib64/power8/libc.so.6Segmentation fault

Good - my current world has got rid of the segfaults/corruption in the cleanup on power - but those
are only after it stumbled over the htab problem.

I don't know the innards of power/htab, so if you've got any pointers on what upset it
I'd be happy for some pointers.

(We should probably trim the cc - since I don't think this is userfault related).

Dave

--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 01/23] userfaultfd: linux/Documentation/vm/userfaultfd.txt
  2015-05-14 17:30 ` [PATCH 01/23] userfaultfd: linux/Documentation/vm/userfaultfd.txt Andrea Arcangeli
@ 2015-09-11  8:47   ` Michael Kerrisk (man-pages)
  2015-12-04 15:50     ` Michael Kerrisk (man-pages)
  0 siblings, 1 reply; 63+ messages in thread
From: Michael Kerrisk (man-pages) @ 2015-09-11  8:47 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton, linux-kernel, linux-mm,
	qemu-devel, kvm, linux-api
  Cc: mtk.manpages, Pavel Emelyanov, Sanidhya Kashyap,
	zhang.zhanghailiang, Linus Torvalds, Kirill A. Shutemov,
	Andres Lagar-Cavilla, Dave Hansen, Paolo Bonzini, Rik van Riel,
	Mel Gorman, Andy Lutomirski, Hugh Dickins, Peter Feiner,
	Dr. David Alan Gilbert, Johannes Weiner, Huangpeng (Peter)

On 05/14/2015 07:30 PM, Andrea Arcangeli wrote:
> Add documentation.

Hi Andrea,

I do not recall... Did you write a man page also for this new system call?

Thanks,

Michael


> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
>  Documentation/vm/userfaultfd.txt | 140 +++++++++++++++++++++++++++++++++++++++
>  1 file changed, 140 insertions(+)
>  create mode 100644 Documentation/vm/userfaultfd.txt
> 
> diff --git a/Documentation/vm/userfaultfd.txt b/Documentation/vm/userfaultfd.txt
> new file mode 100644
> index 0000000..c2f5145
> --- /dev/null
> +++ b/Documentation/vm/userfaultfd.txt
> @@ -0,0 +1,140 @@
> += Userfaultfd =
> +
> +== Objective ==
> +
> +Userfaults allow the implementation of on-demand paging from userland
> +and more generally they allow userland to take control various memory
> +page faults, something otherwise only the kernel code could do.
> +
> +For example userfaults allows a proper and more optimal implementation
> +of the PROT_NONE+SIGSEGV trick.
> +
> +== Design ==
> +
> +Userfaults are delivered and resolved through the userfaultfd syscall.
> +
> +The userfaultfd (aside from registering and unregistering virtual
> +memory ranges) provides two primary functionalities:
> +
> +1) read/POLLIN protocol to notify a userland thread of the faults
> +   happening
> +
> +2) various UFFDIO_* ioctls that can manage the virtual memory regions
> +   registered in the userfaultfd that allows userland to efficiently
> +   resolve the userfaults it receives via 1) or to manage the virtual
> +   memory in the background
> +
> +The real advantage of userfaults if compared to regular virtual memory
> +management of mremap/mprotect is that the userfaults in all their
> +operations never involve heavyweight structures like vmas (in fact the
> +userfaultfd runtime load never takes the mmap_sem for writing).
> +
> +Vmas are not suitable for page- (or hugepage) granular fault tracking
> +when dealing with virtual address spaces that could span
> +Terabytes. Too many vmas would be needed for that.
> +
> +The userfaultfd once opened by invoking the syscall, can also be
> +passed using unix domain sockets to a manager process, so the same
> +manager process could handle the userfaults of a multitude of
> +different processes without them being aware about what is going on
> +(well of course unless they later try to use the userfaultfd
> +themselves on the same region the manager is already tracking, which
> +is a corner case that would currently return -EBUSY).
> +
> +== API ==
> +
> +When first opened the userfaultfd must be enabled invoking the
> +UFFDIO_API ioctl specifying a uffdio_api.api value set to UFFD_API (or
> +a later API version) which will specify the read/POLLIN protocol
> +userland intends to speak on the UFFD. The UFFDIO_API ioctl if
> +successful (i.e. if the requested uffdio_api.api is spoken also by the
> +running kernel), will return into uffdio_api.features and
> +uffdio_api.ioctls two 64bit bitmasks of respectively the activated
> +feature of the read(2) protocol and the generic ioctl available.
> +
> +Once the userfaultfd has been enabled the UFFDIO_REGISTER ioctl should
> +be invoked (if present in the returned uffdio_api.ioctls bitmask) to
> +register a memory range in the userfaultfd by setting the
> +uffdio_register structure accordingly. The uffdio_register.mode
> +bitmask will specify to the kernel which kind of faults to track for
> +the range (UFFDIO_REGISTER_MODE_MISSING would track missing
> +pages). The UFFDIO_REGISTER ioctl will return the
> +uffdio_register.ioctls bitmask of ioctls that are suitable to resolve
> +userfaults on the range registered. Not all ioctls will necessarily be
> +supported for all memory types depending on the underlying virtual
> +memory backend (anonymous memory vs tmpfs vs real filebacked
> +mappings).
> +
> +Userland can use the uffdio_register.ioctls to manage the virtual
> +address space in the background (to add or potentially also remove
> +memory from the userfaultfd registered range). This means a userfault
> +could be triggering just before userland maps in the background the
> +user-faulted page.
> +
> +The primary ioctl to resolve userfaults is UFFDIO_COPY. That
> +atomically copies a page into the userfault registered range and wakes
> +up the blocked userfaults (unless uffdio_copy.mode &
> +UFFDIO_COPY_MODE_DONTWAKE is set). Other ioctl works similarly to
> +UFFDIO_COPY.
> +
> +== QEMU/KVM ==
> +
> +QEMU/KVM is using the userfaultfd syscall to implement postcopy live
> +migration. Postcopy live migration is one form of memory
> +externalization consisting of a virtual machine running with part or
> +all of its memory residing on a different node in the cloud. The
> +userfaultfd abstraction is generic enough that not a single line of
> +KVM kernel code had to be modified in order to add postcopy live
> +migration to QEMU.
> +
> +Guest async page faults, FOLL_NOWAIT and all other GUP features work
> +just fine in combination with userfaults. Userfaults trigger async
> +page faults in the guest scheduler so those guest processes that
> +aren't waiting for userfaults (i.e. network bound) can keep running in
> +the guest vcpus.
> +
> +It is generally beneficial to run one pass of precopy live migration
> +just before starting postcopy live migration, in order to avoid
> +generating userfaults for readonly guest regions.
> +
> +The implementation of postcopy live migration currently uses one
> +single bidirectional socket but in the future two different sockets
> +will be used (to reduce the latency of the userfaults to the minimum
> +possible without having to decrease /proc/sys/net/ipv4/tcp_wmem).
> +
> +The QEMU in the source node writes all pages that it knows are missing
> +in the destination node, into the socket, and the migration thread of
> +the QEMU running in the destination node runs UFFDIO_COPY|ZEROPAGE
> +ioctls on the userfaultfd in order to map the received pages into the
> +guest (UFFDIO_ZEROCOPY is used if the source page was a zero page).
> +
> +A different postcopy thread in the destination node listens with
> +poll() to the userfaultfd in parallel. When a POLLIN event is
> +generated after a userfault triggers, the postcopy thread read() from
> +the userfaultfd and receives the fault address (or -EAGAIN in case the
> +userfault was already resolved and waken by a UFFDIO_COPY|ZEROPAGE run
> +by the parallel QEMU migration thread).
> +
> +After the QEMU postcopy thread (running in the destination node) gets
> +the userfault address it writes the information about the missing page
> +into the socket. The QEMU source node receives the information and
> +roughly "seeks" to that page address and continues sending all
> +remaining missing pages from that new page offset. Soon after that
> +(just the time to flush the tcp_wmem queue through the network) the
> +migration thread in the QEMU running in the destination node will
> +receive the page that triggered the userfault and it'll map it as
> +usual with the UFFDIO_COPY|ZEROPAGE (without actually knowing if it
> +was spontaneously sent by the source or if it was an urgent page
> +requested through an userfault).
> +
> +By the time the userfaults start, the QEMU in the destination node
> +doesn't need to keep any per-page state bitmap relative to the live
> +migration around and a single per-page bitmap has to be maintained in
> +the QEMU running in the source node to know which pages are still
> +missing in the destination node. The bitmap in the source node is
> +checked to find which missing pages to send in round robin and we seek
> +over it when receiving incoming userfaults. After sending each page of
> +course the bitmap is updated accordingly. It's also useful to avoid
> +sending the same page twice (in case the userfault is read by the
> +postcopy thread just before UFFDIO_COPY|ZEROPAGE runs in the migration
> +thread).
> --
> To unsubscribe from this list: send the line "unsubscribe linux-api" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 14/23] userfaultfd: wake pending userfaults
  2015-05-14 17:31 ` [PATCH 14/23] userfaultfd: wake pending userfaults Andrea Arcangeli
@ 2015-10-22 12:10   ` Peter Zijlstra
  2015-10-22 13:20     ` Andrea Arcangeli
  0 siblings, 1 reply; 63+ messages in thread
From: Peter Zijlstra @ 2015-10-22 12:10 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andrew Morton, linux-kernel, linux-mm, qemu-devel, kvm,
	linux-api, Pavel Emelyanov, Sanidhya Kashyap,
	zhang.zhanghailiang, Linus Torvalds, Kirill A. Shutemov,
	Andres Lagar-Cavilla, Dave Hansen, Paolo Bonzini, Rik van Riel,
	Mel Gorman, Andy Lutomirski, Hugh Dickins, Peter Feiner,
	Dr. David Alan Gilbert, Johannes Weiner, Huangpeng (Peter)

On Thu, May 14, 2015 at 07:31:11PM +0200, Andrea Arcangeli wrote:
> @@ -255,21 +259,23 @@ int handle_userfault(struct vm_area_struct *vma, unsigned long address,
>  	 * through poll/read().
>  	 */
>  	__add_wait_queue(&ctx->fault_wqh, &uwq.wq);
> -	for (;;) {
> -		set_current_state(TASK_KILLABLE);
> -		if (!uwq.pending || ACCESS_ONCE(ctx->released) ||
> -		    fatal_signal_pending(current))
> -			break;
> -		spin_unlock(&ctx->fault_wqh.lock);
> +	set_current_state(TASK_KILLABLE);
> +	spin_unlock(&ctx->fault_wqh.lock);
>  
> +	if (likely(!ACCESS_ONCE(ctx->released) &&
> +		   !fatal_signal_pending(current))) {
>  		wake_up_poll(&ctx->fd_wqh, POLLIN);
>  		schedule();
> +		ret |= VM_FAULT_MAJOR;
> +	}

So what happens here if schedule() spontaneously wakes for no reason?

I'm not sure enough of userfaultfd semantics to say if that would be
bad, but the code looks suspiciously like it relies on schedule() not to
do that; which is wrong.

> +	__set_current_state(TASK_RUNNING);
> +	/* see finish_wait() comment for why list_empty_careful() */
> +	if (!list_empty_careful(&uwq.wq.task_list)) {
>  		spin_lock(&ctx->fault_wqh.lock);
> +		list_del_init(&uwq.wq.task_list);
> +		spin_unlock(&ctx->fault_wqh.lock);
>  	}
> -	__remove_wait_queue(&ctx->fault_wqh, &uwq.wq);
> -	__set_current_state(TASK_RUNNING);
> -	spin_unlock(&ctx->fault_wqh.lock);
>  
>  	/*
>  	 * ctx may go away after this if the userfault pseudo fd is

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 14/23] userfaultfd: wake pending userfaults
  2015-10-22 12:10   ` Peter Zijlstra
@ 2015-10-22 13:20     ` Andrea Arcangeli
  2015-10-22 13:38       ` Peter Zijlstra
  0 siblings, 1 reply; 63+ messages in thread
From: Andrea Arcangeli @ 2015-10-22 13:20 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, linux-kernel, linux-mm, qemu-devel, kvm,
	linux-api, Pavel Emelyanov, Sanidhya Kashyap,
	zhang.zhanghailiang, Linus Torvalds, Kirill A. Shutemov,
	Andres Lagar-Cavilla, Dave Hansen, Paolo Bonzini, Rik van Riel,
	Mel Gorman, Andy Lutomirski, Hugh Dickins, Peter Feiner,
	Dr. David Alan Gilbert, Johannes Weiner, Huangpeng (Peter)

On Thu, Oct 22, 2015 at 02:10:56PM +0200, Peter Zijlstra wrote:
> On Thu, May 14, 2015 at 07:31:11PM +0200, Andrea Arcangeli wrote:
> > @@ -255,21 +259,23 @@ int handle_userfault(struct vm_area_struct *vma, unsigned long address,
> >  	 * through poll/read().
> >  	 */
> >  	__add_wait_queue(&ctx->fault_wqh, &uwq.wq);
> > -	for (;;) {
> > -		set_current_state(TASK_KILLABLE);
> > -		if (!uwq.pending || ACCESS_ONCE(ctx->released) ||
> > -		    fatal_signal_pending(current))
> > -			break;
> > -		spin_unlock(&ctx->fault_wqh.lock);
> > +	set_current_state(TASK_KILLABLE);
> > +	spin_unlock(&ctx->fault_wqh.lock);
> >  
> > +	if (likely(!ACCESS_ONCE(ctx->released) &&
> > +		   !fatal_signal_pending(current))) {
> >  		wake_up_poll(&ctx->fd_wqh, POLLIN);
> >  		schedule();
> > +		ret |= VM_FAULT_MAJOR;
> > +	}
> 
> So what happens here if schedule() spontaneously wakes for no reason?
> 
> I'm not sure enough of userfaultfd semantics to say if that would be
> bad, but the code looks suspiciously like it relies on schedule() not to
> do that; which is wrong.

That would repeat the fault and trigger the DEBUG_VM printk above,
complaining that FAULT_FLAG_ALLOW_RETRY is not set. It is only a
problem for kernel faults (copy_user/get_user_pages*). Userland won't
error out in such a way because userland would return 0 and not
VM_FAULT_RETRY. So it's only required when the task schedule in
TASK_KILLABLE state.

If schedule spontaneously wakes up a task in TASK_KILLABLE state that
would be a bug in the scheduler in my view. Luckily there doesn't seem
to be such a bug, or at least we never experienced it.

Overall this dependency on the scheduler will be lifted soon, as it
must be lifted in order to track the write protect faults, so longer
term this is not a concern and this is not a design issue, but this
remains an implementation detail that avoided to change the arch code
and gup.

If you send a reproducer to show how the current scheduler can wake up
the task in TASK_KILLABLE despite not receiving a wakeup, that would
help too as we never experienced that.

The reason this dependency will be lifted soon is that the userfaultfd
write protect tracking may be armed at any time while the app is
running so we may already be in the middle of a page fault that
returned VM_FAULT_RETRY by the time we arm the write protect
tracking. So longer term the arch page fault and
__get_user_pages_locked must allow handle_userfault() to return
VM_FAULT_RETRY even if FAULT_FLAG_TRIED is set. Then we don't care if
the task is waken.

Waking a task in TASK_KILLABLE state it will still waste CPU so the
scheduler still shouldn't do that. All load balancing works better if
the task isn't running anyway so I can't imagine a good reason for
wanting to run a task in TASK_KILLABLE state before it gets the
wakeup.

Trying to predict that a wakeup is always happening in less time than
it takes to schedule the task out of the CPU, sounds a very CPU
intensive thing to measure and it's probably better to leave those
heuristics in the caller like by spinning on a lock for a while before
blocking.

Thanks,
Andrea

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 14/23] userfaultfd: wake pending userfaults
  2015-10-22 13:20     ` Andrea Arcangeli
@ 2015-10-22 13:38       ` Peter Zijlstra
  2015-10-22 14:18         ` Andrea Arcangeli
  0 siblings, 1 reply; 63+ messages in thread
From: Peter Zijlstra @ 2015-10-22 13:38 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andrew Morton, linux-kernel, linux-mm, qemu-devel, kvm,
	linux-api, Pavel Emelyanov, Sanidhya Kashyap,
	zhang.zhanghailiang, Linus Torvalds, Kirill A. Shutemov,
	Andres Lagar-Cavilla, Dave Hansen, Paolo Bonzini, Rik van Riel,
	Mel Gorman, Andy Lutomirski, Hugh Dickins, Peter Feiner,
	Dr. David Alan Gilbert, Johannes Weiner, Huangpeng (Peter)

On Thu, Oct 22, 2015 at 03:20:15PM +0200, Andrea Arcangeli wrote:

> If schedule spontaneously wakes up a task in TASK_KILLABLE state that
> would be a bug in the scheduler in my view. Luckily there doesn't seem
> to be such a bug, or at least we never experienced it.

Well, there will be a wakeup, just not the one you were hoping for.

We have code that does:

	@cond = true;
	get_task_struct(p);
	queue(p)

				/* random wait somewhere */
				for (;;) {
					prepare_to_wait();
					if (@cond)
					  break;

				...

				handle_userfault()
				  ...
				  schedule();
	...

	dequeue(p)
	wake_up_process(p) ---> wakeup without userfault wakeup


These races are (extremely) rare, but they do exist. Therefore one must
never assume schedule() will not spuriously wake because of these
things.

Also, see:

lkml.kernel.org/r/CA+55aFwHkOo+YGWKYROmce1-H_uG3KfEUmCkJUerTj=ojY2H6Q@mail.gmail.com


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 14/23] userfaultfd: wake pending userfaults
  2015-10-22 13:38       ` Peter Zijlstra
@ 2015-10-22 14:18         ` Andrea Arcangeli
  2015-10-22 15:15           ` Peter Zijlstra
  0 siblings, 1 reply; 63+ messages in thread
From: Andrea Arcangeli @ 2015-10-22 14:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, linux-kernel, linux-mm, qemu-devel, kvm,
	linux-api, Pavel Emelyanov, Sanidhya Kashyap,
	zhang.zhanghailiang, Linus Torvalds, Kirill A. Shutemov,
	Andres Lagar-Cavilla, Dave Hansen, Paolo Bonzini, Rik van Riel,
	Mel Gorman, Andy Lutomirski, Hugh Dickins, Peter Feiner,
	Dr. David Alan Gilbert, Johannes Weiner, Huangpeng (Peter)

On Thu, Oct 22, 2015 at 03:38:24PM +0200, Peter Zijlstra wrote:
> On Thu, Oct 22, 2015 at 03:20:15PM +0200, Andrea Arcangeli wrote:
> 
> > If schedule spontaneously wakes up a task in TASK_KILLABLE state that
> > would be a bug in the scheduler in my view. Luckily there doesn't seem
> > to be such a bug, or at least we never experienced it.
> 
> Well, there will be a wakeup, just not the one you were hoping for.
> 
> We have code that does:
> 
> 	@cond = true;
> 	get_task_struct(p);
> 	queue(p)
> 
> 				/* random wait somewhere */
> 				for (;;) {
> 					prepare_to_wait();
> 					if (@cond)
> 					  break;
> 
> 				...
> 
> 				handle_userfault()
> 				  ...
> 				  schedule();
> 	...
> 
> 	dequeue(p)
> 	wake_up_process(p) ---> wakeup without userfault wakeup
> 
> 
> These races are (extremely) rare, but they do exist. Therefore one must
> never assume schedule() will not spuriously wake because of these
> things.
> 
> Also, see:
> 
> lkml.kernel.org/r/CA+55aFwHkOo+YGWKYROmce1-H_uG3KfEUmCkJUerTj=ojY2H6Q@mail.gmail.com

With one more spinlock taken in the fast path we could recheck if the
waitqueue is still queued and this is a false positive extremely rare
spurious wakeup, and in such case set the state back to TASK_KILLABLE
and schedule.

However in the long term such a spinlock should be removed because
it's faster to stick with the current lockless list_empty_careful and
not to recheck the auto-remove waitqueue, but then we must be able to
re-enter handle_userfault() even if FAULT_FLAG_TRIED was set
(currently we can't return VM_FAULT_RETRY if FAULT_FLAG_TRIED is set
and that's the problem). This change is planned for a long time as we
need it to arm the vma-less write protection while the app is running,
so I'm not sure if it's worth going for the short term fix if this is
extremely rare.

The risk of memory corruption is still zero no matter what happens
here, in the extremely rare case the app will get a SIGBUS or a
syscall will return -EFAULT. The kernel also cannot crash. So it's not
very severe concern if it happens extremely rarely (we never
reproduced it and stress testing run for months). Of course in the
longer term this would have been fixed regardless as said in previous
email.

I think going for the longer term fix that was already planned, is
better than doing a short term fix and the real question is how I
should proceed to change the arch code and gup to cope with
handle_userfault() being re-entered.

The simplest thing is to drop FAULT_FLAG_TRIED as a whole. Or I could
add a new VM_FAULT_USERFAULT flag specific to handle_userfault that
would be returned even if FAULT_FLAG_TRIED is set, so that only
userfaults will be allowed to be repeated indefinitely (and then
VM_FAULT_USERFAULT shouldn't trigger a transition to FAULT_FLAG_TRIED,
unlike VM_FAULT_RETRY does).

This is all about being allowed to drop the mmap_sem.

If we'd check the waitqueue with the spinlock (to be sure the wakeup
isn't happening from under us while we check if we got an userfault
wakeup or if this is a spurious schedule), we could also limit the
VM_FAULT_RETRY to 2 max events if I add a FAULT_FLAG_TRIED2 and I
still use VM_FAULT_RETRY (instead of VM_FAULT_USERFAULT).

Being able to return VM_FAULT_RETRY indefinitely is only needed if we
don't handle the extremely wakeup race condition in handle_userfault
by taking the spinlock once more time in the fast path (i.e. after the
schedule).

I'm not exactly sure why we allow VM_FAULT_RETRY only once currently
so I'm tempted to drop FAULT_FLAG_TRIED entirely.

I've no real preference on how to tweak the page fault code to be able
to return VM_FAULT_RETRY indefinitely and I would aim for the smallest
change possible, so if you've suggestions now it's good time.

Thanks,
Andrea

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 14/23] userfaultfd: wake pending userfaults
  2015-10-22 14:18         ` Andrea Arcangeli
@ 2015-10-22 15:15           ` Peter Zijlstra
  2015-10-22 15:30             ` Andrea Arcangeli
  0 siblings, 1 reply; 63+ messages in thread
From: Peter Zijlstra @ 2015-10-22 15:15 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andrew Morton, linux-kernel, linux-mm, qemu-devel, kvm,
	linux-api, Pavel Emelyanov, Sanidhya Kashyap,
	zhang.zhanghailiang, Linus Torvalds, Kirill A. Shutemov,
	Andres Lagar-Cavilla, Dave Hansen, Paolo Bonzini, Rik van Riel,
	Mel Gorman, Andy Lutomirski, Hugh Dickins, Peter Feiner,
	Dr. David Alan Gilbert, Johannes Weiner, Huangpeng (Peter)

On Thu, Oct 22, 2015 at 04:18:31PM +0200, Andrea Arcangeli wrote:

> The risk of memory corruption is still zero no matter what happens
> here, in the extremely rare case the app will get a SIGBUS or a

That might still upset people, SIGBUS isn't something an app can really
recover from.

> I'm not exactly sure why we allow VM_FAULT_RETRY only once currently
> so I'm tempted to drop FAULT_FLAG_TRIED entirely.

I think to ensure we make forward progress.

> I've no real preference on how to tweak the page fault code to be able
> to return VM_FAULT_RETRY indefinitely and I would aim for the smallest
> change possible, so if you've suggestions now it's good time.

Indefinitely is such a long time, we should try and finish
computation before the computer dies etc. :-)

Yes, yes.. I know, extremely unlikely etc. Still guarantees are good.


In any case, I'm not really too bothered how you fix it, just figured
I'd let you know.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 14/23] userfaultfd: wake pending userfaults
  2015-10-22 15:15           ` Peter Zijlstra
@ 2015-10-22 15:30             ` Andrea Arcangeli
  0 siblings, 0 replies; 63+ messages in thread
From: Andrea Arcangeli @ 2015-10-22 15:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, linux-kernel, linux-mm, qemu-devel, kvm,
	linux-api, Pavel Emelyanov, Sanidhya Kashyap,
	zhang.zhanghailiang, Linus Torvalds, Kirill A. Shutemov,
	Andres Lagar-Cavilla, Dave Hansen, Paolo Bonzini, Rik van Riel,
	Mel Gorman, Andy Lutomirski, Hugh Dickins, Peter Feiner,
	Dr. David Alan Gilbert, Johannes Weiner, Huangpeng (Peter)

On Thu, Oct 22, 2015 at 05:15:09PM +0200, Peter Zijlstra wrote:
> Indefinitely is such a long time, we should try and finish
> computation before the computer dies etc. :-)

Indefinitely as read_seqcount_retry, eventually it makes progress.

Even returning 0 from the page fault can trigger it again
indefinitely, so VM_FAULT_RETRY isn't fundamentally different from
returning 0 and retrying the page fault again later. So it's not clear
why VM_FAULT_RETRY can only try once more.

FAULT_FLAG_TRIED as a message to the VM so it starts to do heavy
locking and block more aggressively is actually useful as such, but it
shouldn't be a replacement of FAULT_FLAG_ALLOW_RETRY. What I meant
with removing FAULT_FLAG_TRIED is really about converting it to an
hint, but not controlling if the page fault can keep retrying
in-kernel.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 01/23] userfaultfd: linux/Documentation/vm/userfaultfd.txt
  2015-09-11  8:47   ` Michael Kerrisk (man-pages)
@ 2015-12-04 15:50     ` Michael Kerrisk (man-pages)
  2015-12-04 17:55       ` Andrea Arcangeli
  0 siblings, 1 reply; 63+ messages in thread
From: Michael Kerrisk (man-pages) @ 2015-12-04 15:50 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton, linux-kernel, linux-mm,
	qemu-devel, kvm, linux-api
  Cc: mtk.manpages, Pavel Emelyanov, Sanidhya Kashyap,
	zhang.zhanghailiang, Linus Torvalds, Kirill A. Shutemov,
	Andres Lagar-Cavilla, Dave Hansen, Paolo Bonzini, Rik van Riel,
	Mel Gorman, Andy Lutomirski, Hugh Dickins, Peter Feiner,
	Dr. David Alan Gilbert, Johannes Weiner, Huangpeng (Peter)

Hi Andrea,

On 09/11/2015 10:47 AM, Michael Kerrisk (man-pages) wrote:
> On 05/14/2015 07:30 PM, Andrea Arcangeli wrote:
>> Add documentation.
> 
> Hi Andrea,
> 
> I do not recall... Did you write a man page also for this new system call?

No response to my last mail, so I'll try again... Did you 
write any man page for this interface?

Thanks,

Michael


>> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
>> ---
>>  Documentation/vm/userfaultfd.txt | 140 +++++++++++++++++++++++++++++++++++++++
>>  1 file changed, 140 insertions(+)
>>  create mode 100644 Documentation/vm/userfaultfd.txt
>>
>> diff --git a/Documentation/vm/userfaultfd.txt b/Documentation/vm/userfaultfd.txt
>> new file mode 100644
>> index 0000000..c2f5145
>> --- /dev/null
>> +++ b/Documentation/vm/userfaultfd.txt
>> @@ -0,0 +1,140 @@
>> += Userfaultfd =
>> +
>> +== Objective ==
>> +
>> +Userfaults allow the implementation of on-demand paging from userland
>> +and more generally they allow userland to take control various memory
>> +page faults, something otherwise only the kernel code could do.
>> +
>> +For example userfaults allows a proper and more optimal implementation
>> +of the PROT_NONE+SIGSEGV trick.
>> +
>> +== Design ==
>> +
>> +Userfaults are delivered and resolved through the userfaultfd syscall.
>> +
>> +The userfaultfd (aside from registering and unregistering virtual
>> +memory ranges) provides two primary functionalities:
>> +
>> +1) read/POLLIN protocol to notify a userland thread of the faults
>> +   happening
>> +
>> +2) various UFFDIO_* ioctls that can manage the virtual memory regions
>> +   registered in the userfaultfd that allows userland to efficiently
>> +   resolve the userfaults it receives via 1) or to manage the virtual
>> +   memory in the background
>> +
>> +The real advantage of userfaults if compared to regular virtual memory
>> +management of mremap/mprotect is that the userfaults in all their
>> +operations never involve heavyweight structures like vmas (in fact the
>> +userfaultfd runtime load never takes the mmap_sem for writing).
>> +
>> +Vmas are not suitable for page- (or hugepage) granular fault tracking
>> +when dealing with virtual address spaces that could span
>> +Terabytes. Too many vmas would be needed for that.
>> +
>> +The userfaultfd once opened by invoking the syscall, can also be
>> +passed using unix domain sockets to a manager process, so the same
>> +manager process could handle the userfaults of a multitude of
>> +different processes without them being aware about what is going on
>> +(well of course unless they later try to use the userfaultfd
>> +themselves on the same region the manager is already tracking, which
>> +is a corner case that would currently return -EBUSY).
>> +
>> +== API ==
>> +
>> +When first opened the userfaultfd must be enabled invoking the
>> +UFFDIO_API ioctl specifying a uffdio_api.api value set to UFFD_API (or
>> +a later API version) which will specify the read/POLLIN protocol
>> +userland intends to speak on the UFFD. The UFFDIO_API ioctl if
>> +successful (i.e. if the requested uffdio_api.api is spoken also by the
>> +running kernel), will return into uffdio_api.features and
>> +uffdio_api.ioctls two 64bit bitmasks of respectively the activated
>> +feature of the read(2) protocol and the generic ioctl available.
>> +
>> +Once the userfaultfd has been enabled the UFFDIO_REGISTER ioctl should
>> +be invoked (if present in the returned uffdio_api.ioctls bitmask) to
>> +register a memory range in the userfaultfd by setting the
>> +uffdio_register structure accordingly. The uffdio_register.mode
>> +bitmask will specify to the kernel which kind of faults to track for
>> +the range (UFFDIO_REGISTER_MODE_MISSING would track missing
>> +pages). The UFFDIO_REGISTER ioctl will return the
>> +uffdio_register.ioctls bitmask of ioctls that are suitable to resolve
>> +userfaults on the range registered. Not all ioctls will necessarily be
>> +supported for all memory types depending on the underlying virtual
>> +memory backend (anonymous memory vs tmpfs vs real filebacked
>> +mappings).
>> +
>> +Userland can use the uffdio_register.ioctls to manage the virtual
>> +address space in the background (to add or potentially also remove
>> +memory from the userfaultfd registered range). This means a userfault
>> +could be triggering just before userland maps in the background the
>> +user-faulted page.
>> +
>> +The primary ioctl to resolve userfaults is UFFDIO_COPY. That
>> +atomically copies a page into the userfault registered range and wakes
>> +up the blocked userfaults (unless uffdio_copy.mode &
>> +UFFDIO_COPY_MODE_DONTWAKE is set). Other ioctl works similarly to
>> +UFFDIO_COPY.
>> +
>> +== QEMU/KVM ==
>> +
>> +QEMU/KVM is using the userfaultfd syscall to implement postcopy live
>> +migration. Postcopy live migration is one form of memory
>> +externalization consisting of a virtual machine running with part or
>> +all of its memory residing on a different node in the cloud. The
>> +userfaultfd abstraction is generic enough that not a single line of
>> +KVM kernel code had to be modified in order to add postcopy live
>> +migration to QEMU.
>> +
>> +Guest async page faults, FOLL_NOWAIT and all other GUP features work
>> +just fine in combination with userfaults. Userfaults trigger async
>> +page faults in the guest scheduler so those guest processes that
>> +aren't waiting for userfaults (i.e. network bound) can keep running in
>> +the guest vcpus.
>> +
>> +It is generally beneficial to run one pass of precopy live migration
>> +just before starting postcopy live migration, in order to avoid
>> +generating userfaults for readonly guest regions.
>> +
>> +The implementation of postcopy live migration currently uses one
>> +single bidirectional socket but in the future two different sockets
>> +will be used (to reduce the latency of the userfaults to the minimum
>> +possible without having to decrease /proc/sys/net/ipv4/tcp_wmem).
>> +
>> +The QEMU in the source node writes all pages that it knows are missing
>> +in the destination node, into the socket, and the migration thread of
>> +the QEMU running in the destination node runs UFFDIO_COPY|ZEROPAGE
>> +ioctls on the userfaultfd in order to map the received pages into the
>> +guest (UFFDIO_ZEROCOPY is used if the source page was a zero page).
>> +
>> +A different postcopy thread in the destination node listens with
>> +poll() to the userfaultfd in parallel. When a POLLIN event is
>> +generated after a userfault triggers, the postcopy thread read() from
>> +the userfaultfd and receives the fault address (or -EAGAIN in case the
>> +userfault was already resolved and waken by a UFFDIO_COPY|ZEROPAGE run
>> +by the parallel QEMU migration thread).
>> +
>> +After the QEMU postcopy thread (running in the destination node) gets
>> +the userfault address it writes the information about the missing page
>> +into the socket. The QEMU source node receives the information and
>> +roughly "seeks" to that page address and continues sending all
>> +remaining missing pages from that new page offset. Soon after that
>> +(just the time to flush the tcp_wmem queue through the network) the
>> +migration thread in the QEMU running in the destination node will
>> +receive the page that triggered the userfault and it'll map it as
>> +usual with the UFFDIO_COPY|ZEROPAGE (without actually knowing if it
>> +was spontaneously sent by the source or if it was an urgent page
>> +requested through an userfault).
>> +
>> +By the time the userfaults start, the QEMU in the destination node
>> +doesn't need to keep any per-page state bitmap relative to the live
>> +migration around and a single per-page bitmap has to be maintained in
>> +the QEMU running in the source node to know which pages are still
>> +missing in the destination node. The bitmap in the source node is
>> +checked to find which missing pages to send in round robin and we seek
>> +over it when receiving incoming userfaults. After sending each page of
>> +course the bitmap is updated accordingly. It's also useful to avoid
>> +sending the same page twice (in case the userfault is read by the
>> +postcopy thread just before UFFDIO_COPY|ZEROPAGE runs in the migration
>> +thread).
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-api" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
> 
> 


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 01/23] userfaultfd: linux/Documentation/vm/userfaultfd.txt
  2015-12-04 15:50     ` Michael Kerrisk (man-pages)
@ 2015-12-04 17:55       ` Andrea Arcangeli
  0 siblings, 0 replies; 63+ messages in thread
From: Andrea Arcangeli @ 2015-12-04 17:55 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages)
  Cc: Andrew Morton, linux-kernel, linux-mm, qemu-devel, kvm,
	linux-api, Pavel Emelyanov, Sanidhya Kashyap,
	zhang.zhanghailiang, Linus Torvalds, Kirill A. Shutemov,
	Andres Lagar-Cavilla, Dave Hansen, Paolo Bonzini, Rik van Riel,
	Mel Gorman, Andy Lutomirski, Hugh Dickins, Peter Feiner,
	Dr. David Alan Gilbert, Johannes Weiner, Huangpeng (Peter)

Hello Michael,

On Fri, Dec 04, 2015 at 04:50:03PM +0100, Michael Kerrisk (man-pages) wrote:
> Hi Andrea,
> 
> On 09/11/2015 10:47 AM, Michael Kerrisk (man-pages) wrote:
> > On 05/14/2015 07:30 PM, Andrea Arcangeli wrote:
> >> Add documentation.
> > 
> > Hi Andrea,
> > 
> > I do not recall... Did you write a man page also for this new system call?
> 
> No response to my last mail, so I'll try again... Did you 
> write any man page for this interface?

I wished I would answer with the manpage itself to give a more
satisfactory answer, but answer is still no at this time. Right now
there's the write protection tracking feature posted to linux-mm and
I'm currently reviewing that. It's worth documenting that part too in
the manpage as it's going to happen sooner than later.

Lack of manpage so far didn't prevent userland to use it (qemu
postcopy is already in upstream qemu and it depends on userfaultfd),
nor review of the code nor other kernel contributors to extend the
syscall API. Other users started testing the syscall too. This is just
to explain why unfortunately the manpage didn't get the top priority
yet, but nevertheless the manpage should happen too and it's
important. Advice on how to proceed is welcome.

Thanks,
Andrea

^ permalink raw reply	[flat|nested] 63+ messages in thread

end of thread, other threads:[~2015-12-04 17:55 UTC | newest]

Thread overview: 63+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-05-14 17:30 [PATCH 00/23] userfaultfd v4 Andrea Arcangeli
2015-05-14 17:30 ` [PATCH 01/23] userfaultfd: linux/Documentation/vm/userfaultfd.txt Andrea Arcangeli
2015-09-11  8:47   ` Michael Kerrisk (man-pages)
2015-12-04 15:50     ` Michael Kerrisk (man-pages)
2015-12-04 17:55       ` Andrea Arcangeli
2015-05-14 17:30 ` [PATCH 02/23] userfaultfd: waitqueue: add nr wake parameter to __wake_up_locked_key Andrea Arcangeli
2015-05-14 17:31 ` [PATCH 03/23] userfaultfd: uAPI Andrea Arcangeli
2015-05-14 17:31 ` [PATCH 04/23] userfaultfd: linux/userfaultfd_k.h Andrea Arcangeli
2015-05-14 17:31 ` [PATCH 05/23] userfaultfd: add vm_userfaultfd_ctx to the vm_area_struct Andrea Arcangeli
2015-05-14 17:31 ` [PATCH 06/23] userfaultfd: add VM_UFFD_MISSING and VM_UFFD_WP Andrea Arcangeli
2015-05-14 17:31 ` [PATCH 07/23] userfaultfd: call handle_userfault() for userfaultfd_missing() faults Andrea Arcangeli
2015-05-14 17:31 ` [PATCH 08/23] userfaultfd: teach vma_merge to merge across vma->vm_userfaultfd_ctx Andrea Arcangeli
2015-05-14 17:31 ` [PATCH 09/23] userfaultfd: prevent khugepaged to merge if userfaultfd is armed Andrea Arcangeli
2015-05-14 17:31 ` [PATCH 10/23] userfaultfd: add new syscall to provide memory externalization Andrea Arcangeli
2015-05-14 17:49   ` Linus Torvalds
2015-05-15 16:04     ` Andrea Arcangeli
2015-05-15 18:22       ` Linus Torvalds
2015-06-23 19:00   ` Dave Hansen
2015-06-23 21:41     ` Andrea Arcangeli
2015-05-14 17:31 ` [PATCH 11/23] userfaultfd: Rename uffd_api.bits into .features Andrea Arcangeli
2015-05-14 17:31 ` [PATCH 12/23] userfaultfd: Rename uffd_api.bits into .features fixup Andrea Arcangeli
2015-05-14 17:31 ` [PATCH 13/23] userfaultfd: change the read API to return a uffd_msg Andrea Arcangeli
2015-05-14 17:31 ` [PATCH 14/23] userfaultfd: wake pending userfaults Andrea Arcangeli
2015-10-22 12:10   ` Peter Zijlstra
2015-10-22 13:20     ` Andrea Arcangeli
2015-10-22 13:38       ` Peter Zijlstra
2015-10-22 14:18         ` Andrea Arcangeli
2015-10-22 15:15           ` Peter Zijlstra
2015-10-22 15:30             ` Andrea Arcangeli
2015-05-14 17:31 ` [PATCH 15/23] userfaultfd: optimize read() and poll() to be O(1) Andrea Arcangeli
2015-05-14 17:31 ` [PATCH 16/23] userfaultfd: allocate the userfaultfd_ctx cacheline aligned Andrea Arcangeli
2015-05-14 17:31 ` [PATCH 17/23] userfaultfd: solve the race between UFFDIO_COPY|ZEROPAGE and read Andrea Arcangeli
2015-05-14 17:31 ` [PATCH 18/23] userfaultfd: buildsystem activation Andrea Arcangeli
2015-05-14 17:31 ` [PATCH 19/23] userfaultfd: activate syscall Andrea Arcangeli
2015-08-11 10:07   ` [Qemu-devel] " Bharata B Rao
2015-08-11 13:48     ` Andrea Arcangeli
2015-08-12  5:23       ` Bharata B Rao
2015-09-08  6:08         ` Michael Ellerman
2015-09-08  6:39           ` Bharata B Rao
2015-09-08  7:14             ` Michael Ellerman
2015-09-08 10:40               ` Michael Ellerman
2015-09-08 12:28                 ` Dr. David Alan Gilbert
2015-09-08  8:59             ` Dr. David Alan Gilbert
2015-09-08 10:00               ` Bharata B Rao
2015-09-08 12:46                 ` Dr. David Alan Gilbert
2015-09-08 13:37                   ` Bharata B Rao
2015-09-08 14:13                     ` Dr. David Alan Gilbert
2015-05-14 17:31 ` [PATCH 20/23] userfaultfd: UFFDIO_COPY|UFFDIO_ZEROPAGE uAPI Andrea Arcangeli
2015-05-14 17:31 ` [PATCH 21/23] userfaultfd: mcopy_atomic|mfill_zeropage: UFFDIO_COPY|UFFDIO_ZEROPAGE preparation Andrea Arcangeli
2015-05-14 17:31 ` [PATCH 22/23] userfaultfd: avoid mmap_sem read recursion in mcopy_atomic Andrea Arcangeli
2015-05-22 20:18   ` Andrew Morton
2015-05-22 20:48     ` Andrea Arcangeli
2015-05-22 21:18       ` Andrew Morton
2015-05-23  1:04         ` Andrea Arcangeli
2015-05-14 17:31 ` [PATCH 23/23] userfaultfd: UFFDIO_COPY and UFFDIO_ZEROPAGE Andrea Arcangeli
2015-05-18 14:24 ` [PATCH 00/23] userfaultfd v4 Pavel Emelyanov
2015-05-19 21:38 ` Andrew Morton
2015-05-19 21:59   ` Richard Weinberger
2015-05-20 14:17     ` Andrea Arcangeli
2015-05-20 13:23   ` Andrea Arcangeli
2015-05-21 13:11 ` Kirill Smelkov
2015-05-21 15:52   ` Andrea Arcangeli
2015-05-22 16:35     ` Kirill Smelkov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).