linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/21] RFC: userfaultfd v3
@ 2015-03-05 17:17 Andrea Arcangeli
  2015-03-05 17:17 ` [PATCH 01/21] userfaultfd: waitqueue: add nr wake parameter to __wake_up_locked_key Andrea Arcangeli
                   ` (21 more replies)
  0 siblings, 22 replies; 32+ messages in thread
From: Andrea Arcangeli @ 2015-03-05 17:17 UTC (permalink / raw)
  To: qemu-devel, kvm, linux-kernel, linux-mm, linux-api, Android Kernel Team
  Cc: Kirill A. Shutemov, Pavel Emelyanov, Sanidhya Kashyap,
	zhang.zhanghailiang, Linus Torvalds, Andres Lagar-Cavilla,
	Dave Hansen, Paolo Bonzini, Rik van Riel, Mel Gorman,
	Andy Lutomirski, Andrew Morton, Sasha Levin, Hugh Dickins,
	Peter Feiner, Dr. David Alan Gilbert, Christopher Covington,
	Johannes Weiner, Robert Love, Dmitry Adamushko, Neil Brown,
	Mike Hommey, Taras Glek, Jan Kara, KOSAKI Motohiro,
	Michel Lespinasse, Minchan Kim, Keith Packard, Huangpeng (Peter),
	Anthony Liguori, Stefan Hajnoczi, Wenchao Xia, Andrew Jones,
	Juan Quintela

Hello everyone,

This is a RFC for the userfaultfd syscall API v3 that addresses the
feedback received for the previous v2 submit.

The main change from the v2 is that MADV_USERFAULT/NOUSERFAULT
disappeared (they're replaced by the UFFDIO_REGISTER/UNREGISTER
ioctls). In short userfaults are now only possible through the
userfaultfd. The remap_anon_pages syscall also disappeared replaced by
the UFFDIO_REMAP ioctl which is in turn mostly obsoleted by the newer
UFFDIO_COPY and UFFDIO_ZEROPAGE ioctls that are indeed more efficient
by never having to flush the TLB. The suggestion to copy the data
instead of moving it, in order to resolve the userfault, was
immediately agreed.

The latest code can also be cloned here:

git clone --reference linux -b userfault git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git


Userfaults allow to implement on demand paging from userland and more
generally they allow userland to more efficiently take control on
various types of page faults.

For example userfaults allows a proper and more optimal implementation
of the PROT_NONE+SIGSEGV trick.

There has been interest from multiple users for different use cases:

1) KVM postcopy live migration (one form of cloud memory
   externalization). KVM postcopy live migration is the primary driver
   of this work:
   http://blog.zhaw.ch/icclab/setting-up-post-copy-live-migration-in-openstack/
   http://lists.gnu.org/archive/html/qemu-devel/2015-02/msg04873.html
   )

2) KVM postcopy live snapshotting (allowing to limit/throttle the
   memory usage, unlike fork would, plus the avoidance of fork
   overhead in the first place).

   The syscall API is already contemplating the wrprotect fault
   tracking and it's generic enough to allow its later implementation
   in a backwards compatible fashion.

3) KVM userfaults on shared memory. The UFFDIO_COPY lowlevel method
   should be extended to work also on tmpfs and then the
   uffdio_register.ioctls will notify userland that UFFDIO_COPY is
   available even when the registered virtual memory range is tmpfs
   backed.

4) alternate mechanism to notify web browsers or apps on embedded
   devices that volatile pages have been reclaimed. This basically
   avoids the need to run a syscall before the app can access with the
   CPU the virtual regions marked volatile. This also requires point 3)
   to be fulfilled, as volatile pages happily apply to tmpfs.

5) postcopy live migration of binaries inside linux containers.

Even though there wasn't a real use case requesting it yet, the new
API also allows to implement distributed shared memory in a way that
readonly shared mappings can exist simultaneously in different hosts
and they can be become exclusive at the first wrprotect fault.

The UFFDIO_REMAP method is still present in the patchset but it's
provided primarily to remove (add not) memory from the userfault
range. The addition of the UFFDIO_REMAP method is intentionally kept
at the end of the patchset. The postcopy live migration qemu code will
only use UFFDIO_COPY and UFFDIO_ZEROPAGE. UFFDIO_REMAP isn't intended
to be merged upstream in the short term, and it can be dropped later
if there's an agreement it's a bad idea to keep it around in the
patchset.

David run some KVM postcopy live migration benchmarks on a 8-way CPU
system and he measured that using UFFDIO_COPY instead of UFFDIO_REMAP
resulted in a roughly a -20% reduction in latency which is good. The
standard deviation error on the latency measurement decreased
significantly as well (because the number of CPUs that required IPI
delivery was variable, while the copy always takes roughly the same
time). A bigger improvement is expectable if measured on a larger host
with more CPUs.

All UFFDIO_COPY/ZEROPAGE/REMAP methods already support CRIU postcopy
live migration and the UFFD can be passed to a manager process through
unix domain sockets to satisfy point 5).

I look forward to discuss this further next week at the LSF/MM
summit, if you're attending the summit see you soon!

Comments welcome, thanks,
Andrea

Credits: partially funded by the Orbit EU project.

PS. There is one TODO detail worth mentioning for completeness that
affects usage 2) and UFFDIO_REMAP if used to remove memory from the
userfault range: handle_userfault() is only effective if
FAULT_FLAG_ALLOW_RETRY is set... but that is only set at the first
attempted page fault. If by accident some thread was already faulting
in the range and the first page fault attempt returned VM_FAULT_RETRY
and UFFDIO_REMAP or UFFDIO_WP jumps in to arm the userfault just
before the second attempt starts, a SIGBUS would be raised by the page
fault. Stopping all thread access to the userfault ranges during
UFFDIO_REMAP/WP while possible, isn't optimal. Currently (excluding
real filebacked mappings and handle_userfault() itself which is
clearly no problem) only tmpfs or a swapin can return
VM_FAULT_RETRY. To close this SIGBUS window for all usages, the
simplest solution would be that if FAULT_FLAG_TRIED is set
VM_FAULT_RETRY can still be returned (but only by handle_userfault
that has a legitimate reason for insisting a second time in a row with
VM_FAULT_RETRY). That would require some change to the FAULT_FLAG
semantics. Again userland could cope with this detail but it'd be
inefficient to solve it in userland. This would be a fully backwards
compatible change and it's only strictly required by the wrprotect
tracking mode, so it's no problem to solve this later. Because of its
inherent racy nature, nobody could possibly depend on a racy SIGBUS
being raised now, when it won't be raised anymore later.

Andrea Arcangeli (21):
  userfaultfd: waitqueue: add nr wake parameter to __wake_up_locked_key
  userfaultfd: linux/Documentation/vm/userfaultfd.txt
  userfaultfd: uAPI
  userfaultfd: linux/userfaultfd_k.h
  userfaultfd: add vm_userfaultfd_ctx to the vm_area_struct
  userfaultfd: add VM_UFFD_MISSING and VM_UFFD_WP
  userfaultfd: call handle_userfault() for userfaultfd_missing() faults
  userfaultfd: teach vma_merge to merge across vma->vm_userfaultfd_ctx
  userfaultfd: prevent khugepaged to merge if userfaultfd is armed
  userfaultfd: add new syscall to provide memory externalization
  userfaultfd: buildsystem activation
  userfaultfd: activate syscall
  userfaultfd: UFFDIO_COPY|UFFDIO_ZEROPAGE uAPI
  userfaultfd: mcopy_atomic|mfill_zeropage: UFFDIO_COPY|UFFDIO_ZEROPAGE
    preparation
  userfaultfd: UFFDIO_COPY and UFFDIO_ZEROPAGE
  userfaultfd: remap_pages: rmap preparation
  userfaultfd: remap_pages: swp_entry_swapcount() preparation
  userfaultfd: UFFDIO_REMAP uABI
  userfaultfd: remap_pages: UFFDIO_REMAP preparation
  userfaultfd: UFFDIO_REMAP
  userfaultfd: add userfaultfd_wp mm helpers

 Documentation/ioctl/ioctl-number.txt   |    1 +
 Documentation/vm/userfaultfd.txt       |   97 +++
 arch/powerpc/include/asm/systbl.h      |    1 +
 arch/powerpc/include/asm/unistd.h      |    2 +-
 arch/powerpc/include/uapi/asm/unistd.h |    1 +
 arch/x86/syscalls/syscall_32.tbl       |    1 +
 arch/x86/syscalls/syscall_64.tbl       |    1 +
 fs/Makefile                            |    1 +
 fs/userfaultfd.c                       | 1128 ++++++++++++++++++++++++++++++++
 include/linux/mm.h                     |    4 +-
 include/linux/mm_types.h               |   11 +
 include/linux/swap.h                   |    6 +
 include/linux/syscalls.h               |    1 +
 include/linux/userfaultfd_k.h          |  112 ++++
 include/linux/wait.h                   |    5 +-
 include/uapi/linux/userfaultfd.h       |  150 +++++
 init/Kconfig                           |   11 +
 kernel/fork.c                          |    3 +-
 kernel/sched/wait.c                    |    7 +-
 kernel/sys_ni.c                        |    1 +
 mm/Makefile                            |    1 +
 mm/huge_memory.c                       |  217 +++++-
 mm/madvise.c                           |    3 +-
 mm/memory.c                            |   16 +
 mm/mempolicy.c                         |    4 +-
 mm/mlock.c                             |    3 +-
 mm/mmap.c                              |   39 +-
 mm/mprotect.c                          |    3 +-
 mm/rmap.c                              |    9 +
 mm/swapfile.c                          |   13 +
 mm/userfaultfd.c                       |  793 ++++++++++++++++++++++
 net/sunrpc/sched.c                     |    2 +-
 32 files changed, 2593 insertions(+), 54 deletions(-)
 create mode 100644 Documentation/vm/userfaultfd.txt
 create mode 100644 fs/userfaultfd.c
 create mode 100644 include/linux/userfaultfd_k.h
 create mode 100644 include/uapi/linux/userfaultfd.h
 create mode 100644 mm/userfaultfd.c

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 01/21] userfaultfd: waitqueue: add nr wake parameter to __wake_up_locked_key
  2015-03-05 17:17 [PATCH 00/21] RFC: userfaultfd v3 Andrea Arcangeli
@ 2015-03-05 17:17 ` Andrea Arcangeli
  2015-03-05 17:17 ` [PATCH 02/21] userfaultfd: linux/Documentation/vm/userfaultfd.txt Andrea Arcangeli
                   ` (20 subsequent siblings)
  21 siblings, 0 replies; 32+ messages in thread
From: Andrea Arcangeli @ 2015-03-05 17:17 UTC (permalink / raw)
  To: qemu-devel, kvm, linux-kernel, linux-mm, linux-api, Android Kernel Team
  Cc: Kirill A. Shutemov, Pavel Emelyanov, Sanidhya Kashyap,
	zhang.zhanghailiang, Linus Torvalds, Andres Lagar-Cavilla,
	Dave Hansen, Paolo Bonzini, Rik van Riel, Mel Gorman,
	Andy Lutomirski, Andrew Morton, Sasha Levin, Hugh Dickins,
	Peter Feiner, Dr. David Alan Gilbert, Christopher Covington,
	Johannes Weiner, Robert Love, Dmitry Adamushko, Neil Brown,
	Mike Hommey, Taras Glek, Jan Kara, KOSAKI Motohiro,
	Michel Lespinasse, Minchan Kim, Keith Packard, Huangpeng (Peter),
	Anthony Liguori, Stefan Hajnoczi, Wenchao Xia, Andrew Jones,
	Juan Quintela

userfaultfd needs to wake all waitqueues (pass 0 as nr parameter),
instead of the current hardcoded 1 (that would wake just the first
waitqueue in the head list).

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/wait.h | 5 +++--
 kernel/sched/wait.c  | 7 ++++---
 net/sunrpc/sched.c   | 2 +-
 3 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/include/linux/wait.h b/include/linux/wait.h
index 2db8334..cf884cf 100644
--- a/include/linux/wait.h
+++ b/include/linux/wait.h
@@ -147,7 +147,8 @@ __remove_wait_queue(wait_queue_head_t *head, wait_queue_t *old)
 
 typedef int wait_bit_action_f(struct wait_bit_key *);
 void __wake_up(wait_queue_head_t *q, unsigned int mode, int nr, void *key);
-void __wake_up_locked_key(wait_queue_head_t *q, unsigned int mode, void *key);
+void __wake_up_locked_key(wait_queue_head_t *q, unsigned int mode, int nr,
+			  void *key);
 void __wake_up_sync_key(wait_queue_head_t *q, unsigned int mode, int nr, void *key);
 void __wake_up_locked(wait_queue_head_t *q, unsigned int mode, int nr);
 void __wake_up_sync(wait_queue_head_t *q, unsigned int mode, int nr);
@@ -179,7 +180,7 @@ wait_queue_head_t *bit_waitqueue(void *, int);
 #define wake_up_poll(x, m)						\
 	__wake_up(x, TASK_NORMAL, 1, (void *) (m))
 #define wake_up_locked_poll(x, m)					\
-	__wake_up_locked_key((x), TASK_NORMAL, (void *) (m))
+	__wake_up_locked_key((x), TASK_NORMAL, 1, (void *) (m))
 #define wake_up_interruptible_poll(x, m)				\
 	__wake_up(x, TASK_INTERRUPTIBLE, 1, (void *) (m))
 #define wake_up_interruptible_sync_poll(x, m)				\
diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
index 852143a..6da208dd2 100644
--- a/kernel/sched/wait.c
+++ b/kernel/sched/wait.c
@@ -106,9 +106,10 @@ void __wake_up_locked(wait_queue_head_t *q, unsigned int mode, int nr)
 }
 EXPORT_SYMBOL_GPL(__wake_up_locked);
 
-void __wake_up_locked_key(wait_queue_head_t *q, unsigned int mode, void *key)
+void __wake_up_locked_key(wait_queue_head_t *q, unsigned int mode, int nr,
+			  void *key)
 {
-	__wake_up_common(q, mode, 1, 0, key);
+	__wake_up_common(q, mode, nr, 0, key);
 }
 EXPORT_SYMBOL_GPL(__wake_up_locked_key);
 
@@ -283,7 +284,7 @@ void abort_exclusive_wait(wait_queue_head_t *q, wait_queue_t *wait,
 	if (!list_empty(&wait->task_list))
 		list_del_init(&wait->task_list);
 	else if (waitqueue_active(q))
-		__wake_up_locked_key(q, mode, key);
+		__wake_up_locked_key(q, mode, 1, key);
 	spin_unlock_irqrestore(&q->lock, flags);
 }
 EXPORT_SYMBOL(abort_exclusive_wait);
diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
index b91fd9c..dead9e0 100644
--- a/net/sunrpc/sched.c
+++ b/net/sunrpc/sched.c
@@ -297,7 +297,7 @@ static int rpc_complete_task(struct rpc_task *task)
 	clear_bit(RPC_TASK_ACTIVE, &task->tk_runstate);
 	ret = atomic_dec_and_test(&task->tk_count);
 	if (waitqueue_active(wq))
-		__wake_up_locked_key(wq, TASK_NORMAL, &k);
+		__wake_up_locked_key(wq, TASK_NORMAL, 1, &k);
 	spin_unlock_irqrestore(&wq->lock, flags);
 	return ret;
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 02/21] userfaultfd: linux/Documentation/vm/userfaultfd.txt
  2015-03-05 17:17 [PATCH 00/21] RFC: userfaultfd v3 Andrea Arcangeli
  2015-03-05 17:17 ` [PATCH 01/21] userfaultfd: waitqueue: add nr wake parameter to __wake_up_locked_key Andrea Arcangeli
@ 2015-03-05 17:17 ` Andrea Arcangeli
  2015-03-06 15:39   ` [Qemu-devel] " Eric Blake
  2015-03-05 17:17 ` [PATCH 03/21] userfaultfd: uAPI Andrea Arcangeli
                   ` (19 subsequent siblings)
  21 siblings, 1 reply; 32+ messages in thread
From: Andrea Arcangeli @ 2015-03-05 17:17 UTC (permalink / raw)
  To: qemu-devel, kvm, linux-kernel, linux-mm, linux-api, Android Kernel Team
  Cc: Kirill A. Shutemov, Pavel Emelyanov, Sanidhya Kashyap,
	zhang.zhanghailiang, Linus Torvalds, Andres Lagar-Cavilla,
	Dave Hansen, Paolo Bonzini, Rik van Riel, Mel Gorman,
	Andy Lutomirski, Andrew Morton, Sasha Levin, Hugh Dickins,
	Peter Feiner, Dr. David Alan Gilbert, Christopher Covington,
	Johannes Weiner, Robert Love, Dmitry Adamushko, Neil Brown,
	Mike Hommey, Taras Glek, Jan Kara, KOSAKI Motohiro,
	Michel Lespinasse, Minchan Kim, Keith Packard, Huangpeng (Peter),
	Anthony Liguori, Stefan Hajnoczi, Wenchao Xia, Andrew Jones,
	Juan Quintela

Add documentation.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 Documentation/vm/userfaultfd.txt | 97 ++++++++++++++++++++++++++++++++++++++++
 1 file changed, 97 insertions(+)
 create mode 100644 Documentation/vm/userfaultfd.txt

diff --git a/Documentation/vm/userfaultfd.txt b/Documentation/vm/userfaultfd.txt
new file mode 100644
index 0000000..2ec296c
--- /dev/null
+++ b/Documentation/vm/userfaultfd.txt
@@ -0,0 +1,97 @@
+= Userfaultfd =
+
+== Objective ==
+
+Userfaults allow to implement on demand paging from userland and more
+generally they allow userland to take control various memory page
+faults, something otherwise only the kernel code could do.
+
+For example userfaults allows a proper and more optimal implementation
+of the PROT_NONE+SIGSEGV trick.
+
+== Design ==
+
+Userfaults are delivered and resolved through the userfaultfd syscall.
+
+The userfaultfd (aside from registering and unregistering virtual
+memory ranges) provides for two primary functionalities:
+
+1) read/POLLIN protocol to notify an userland thread of the faults
+   happening
+
+2) various UFFDIO_* ioctls that can mangle over the virtual memory
+   regions registered in the userfaultfd that allows userland to
+   efficiently resolve the userfaults it receives via 1) or to mangle
+   the virtual memory in the background
+
+The real advantage of userfaults if compared to regular virtual memory
+management of mremap/mprotect is that the userfaults in all their
+operations never involve heavyweight structures like vmas (in fact the
+userfaultfd runtime load never takes the mmap_sem for writing).
+
+Vmas are not suitable for page(or hugepage)-granular fault tracking
+when dealing with virtual address spaces that could span
+Terabytes. Too many vmas would be needed for that.
+
+The userfaultfd once opened by invoking the syscall, can also be
+passed using unix domain sockets to a manager process, so the same
+manager process could handle the userfaults of a multitude of
+different process without them being aware about what is going on
+(well of course unless they later try to use the userfaultfd themself
+on the same region the manager is already tracking, which is a corner
+case that would currently return -EBUSY).
+
+== API ==
+
+When first opened the userfaultfd must be enabled invoking the
+UFFDIO_API ioctl specifying an uffdio_api.api value set to UFFD_API
+which will specify the read/POLLIN protocol userland intends to speak
+on the UFFD. The UFFDIO_API ioctl if successful (i.e. if the requested
+uffdio_api.api is spoken also by the running kernel), will return into
+uffdio_api.bits and uffdio_api.ioctls two 64bit bitmasks of
+respectively the activated feature bits below PAGE_SHIFT in the
+userfault addresses returned by read(2) and the generic ioctl
+available.
+
+Once the userfaultfd has been enabled the UFFDIO_REGISTER ioctl should
+be invoked (if present in the returned uffdio_api.ioctls bitmask) to
+register a memory range in the userfaultfd by setting the
+uffdio_register structure accordingly. The uffdio_register.mode
+bitmask will specify to the kernel which kind of faults to track for
+the range (UFFDIO_REGISTER_MODE_MISSING would track missing
+pages). The UFFDIO_REGISTER ioctl will return the
+uffdio_register.ioctls bitmask of ioctls that are suitable to resolve
+userfaults on the range reigstered. Not all ioctls will necessarily be
+supported for all memory types depending on the underlying virtual
+memory backend (anonymous memory vs tmpfs vs real filebacked
+mappings).
+
+Userland can use the uffdio_register.ioctls to mangle the virtual
+address space in the background (to add or potentially also remove
+memory from the userfaultfd registered range). This means an userfault
+could be triggering just before userland maps in the background the
+user-faulted page. To avoid POLLIN resulting in an unexpected blocking
+read (if the UFFD is not opened in nonblocking mode in the first
+place), we don't allow the background thread to wake userfaults that
+haven't been read by userland yet. If we would do that likely the
+UFFDIO_WAKE ioctl could be dropped. This may change in the future
+(with a UFFD_API protocol bumb combined with the removal of the
+UFFDIO_WAKE ioctl) if it'll be demonstrated that it's a valid
+optimization and worthy to force userland to use the UFFD always in
+nonblocking mode if combined with POLLIN.
+
+userfaultfd is also a generic enough feature, that it allows KVM to
+implement postcopy live migration (one form of memory externalization
+consisting of a virtual machine running with part or all of its memory
+residing on a different node in the cloud) without having to modify a
+single line of KVM kernel code. Guest async page faults, FOLL_NOWAIT
+and all other GUP features works just fine in combination with
+userfaults (userfaults trigger async page faults in the guest
+scheduler so those guest processes that aren't waiting for userfaults
+can keep running in the guest vcpus).
+
+The primary ioctl to resolve userfaults is UFFDIO_COPY. That
+atomically copies a page into the userfault registered range and wakes
+up the blocked userfaults (unless uffdio_copy.mode &
+UFFDIO_COPY_MODE_DONTWAKE is set). Other ioctl works similarly to
+UFFDIO_COPY.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 03/21] userfaultfd: uAPI
  2015-03-05 17:17 [PATCH 00/21] RFC: userfaultfd v3 Andrea Arcangeli
  2015-03-05 17:17 ` [PATCH 01/21] userfaultfd: waitqueue: add nr wake parameter to __wake_up_locked_key Andrea Arcangeli
  2015-03-05 17:17 ` [PATCH 02/21] userfaultfd: linux/Documentation/vm/userfaultfd.txt Andrea Arcangeli
@ 2015-03-05 17:17 ` Andrea Arcangeli
  2015-03-05 17:17 ` [PATCH 04/21] userfaultfd: linux/userfaultfd_k.h Andrea Arcangeli
                   ` (18 subsequent siblings)
  21 siblings, 0 replies; 32+ messages in thread
From: Andrea Arcangeli @ 2015-03-05 17:17 UTC (permalink / raw)
  To: qemu-devel, kvm, linux-kernel, linux-mm, linux-api, Android Kernel Team
  Cc: Kirill A. Shutemov, Pavel Emelyanov, Sanidhya Kashyap,
	zhang.zhanghailiang, Linus Torvalds, Andres Lagar-Cavilla,
	Dave Hansen, Paolo Bonzini, Rik van Riel, Mel Gorman,
	Andy Lutomirski, Andrew Morton, Sasha Levin, Hugh Dickins,
	Peter Feiner, Dr. David Alan Gilbert, Christopher Covington,
	Johannes Weiner, Robert Love, Dmitry Adamushko, Neil Brown,
	Mike Hommey, Taras Glek, Jan Kara, KOSAKI Motohiro,
	Michel Lespinasse, Minchan Kim, Keith Packard, Huangpeng (Peter),
	Anthony Liguori, Stefan Hajnoczi, Wenchao Xia, Andrew Jones,
	Juan Quintela

Defines the uAPI of the userfaultfd, notably the ioctl numbers and protocol.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 Documentation/ioctl/ioctl-number.txt |  1 +
 include/uapi/linux/userfaultfd.h     | 81 ++++++++++++++++++++++++++++++++++++
 2 files changed, 82 insertions(+)
 create mode 100644 include/uapi/linux/userfaultfd.h

diff --git a/Documentation/ioctl/ioctl-number.txt b/Documentation/ioctl/ioctl-number.txt
index 8136e1f..be2d4a2 100644
--- a/Documentation/ioctl/ioctl-number.txt
+++ b/Documentation/ioctl/ioctl-number.txt
@@ -301,6 +301,7 @@ Code  Seq#(hex)	Include File		Comments
 0xA3	80-8F	Port ACL		in development:
 					<mailto:tlewis@mindspring.com>
 0xA3	90-9F	linux/dtlk.h
+0xAA	00-3F	linux/uapi/linux/userfaultfd.h
 0xAB	00-1F	linux/nbd.h
 0xAC	00-1F	linux/raw.h
 0xAD	00	Netfilter device	in development:
diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
new file mode 100644
index 0000000..9a8cd56
--- /dev/null
+++ b/include/uapi/linux/userfaultfd.h
@@ -0,0 +1,81 @@
+/*
+ *  include/linux/userfaultfd.h
+ *
+ *  Copyright (C) 2007  Davide Libenzi <davidel@xmailserver.org>
+ *  Copyright (C) 2015  Red Hat, Inc.
+ *
+ */
+
+#ifndef _LINUX_USERFAULTFD_H
+#define _LINUX_USERFAULTFD_H
+
+#define UFFD_API ((__u64)0xAA)
+/* FIXME: add "|UFFD_BIT_WP" to UFFD_API_BITS after implementing it */
+#define UFFD_API_BITS (UFFD_BIT_WRITE)
+#define UFFD_API_IOCTLS				\
+	((__u64)1 << _UFFDIO_REGISTER |		\
+	 (__u64)1 << _UFFDIO_UNREGISTER |	\
+	 (__u64)1 << _UFFDIO_API)
+#define UFFD_API_RANGE_IOCTLS			\
+	((__u64)1 << _UFFDIO_WAKE)
+
+/*
+ * Valid ioctl command number range with this API is from 0x00 to
+ * 0x3F.  UFFDIO_API is the fixed number, everything else can be
+ * changed by implementing a different UFFD_API. If sticking to the
+ * same UFFD_API more ioctl can be added and userland will be aware of
+ * which ioctl the running kernel implements through the ioctl command
+ * bitmask written by the UFFDIO_API.
+ */
+#define _UFFDIO_REGISTER		(0x00)
+#define _UFFDIO_UNREGISTER		(0x01)
+#define _UFFDIO_WAKE			(0x02)
+#define _UFFDIO_API			(0x3F)
+
+/* userfaultfd ioctl ids */
+#define UFFDIO 0xAA
+#define UFFDIO_API		_IOWR(UFFDIO, _UFFDIO_API,	\
+				      struct uffdio_api)
+#define UFFDIO_REGISTER		_IOWR(UFFDIO, _UFFDIO_REGISTER, \
+				      struct uffdio_register)
+#define UFFDIO_UNREGISTER	_IOR(UFFDIO, _UFFDIO_UNREGISTER,	\
+				     struct uffdio_range)
+#define UFFDIO_WAKE		_IOR(UFFDIO, _UFFDIO_WAKE,	\
+				     struct uffdio_range)
+
+/*
+ * Valid bits below PAGE_SHIFT in the userfault address read through
+ * the read() syscall.
+ */
+#define UFFD_BIT_WRITE	(1<<0)	/* this was a write fault, MISSING or WP */
+#define UFFD_BIT_WP	(1<<1)	/* handle_userfault() reason VM_UFFD_WP */
+#define UFFD_BITS	2	/* two above bits used for UFFD_BIT_* mask */
+
+struct uffdio_api {
+	/* userland asks for an API number */
+	__u64 api;
+
+	/* kernel answers below with the available features for the API */
+	__u64 bits;
+	__u64 ioctls;
+};
+
+struct uffdio_range {
+	__u64 start;
+	__u64 len;
+};
+
+struct uffdio_register {
+	struct uffdio_range range;
+#define UFFDIO_REGISTER_MODE_MISSING	((__u64)1<<0)
+#define UFFDIO_REGISTER_MODE_WP		((__u64)1<<1)
+	__u64 mode;
+
+	/*
+	 * kernel answers which ioctl commands are available for the
+	 * range, keep at the end as the last 8 bytes aren't read.
+	 */
+	__u64 ioctls;
+};
+
+#endif /* _LINUX_USERFAULTFD_H */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 04/21] userfaultfd: linux/userfaultfd_k.h
  2015-03-05 17:17 [PATCH 00/21] RFC: userfaultfd v3 Andrea Arcangeli
                   ` (2 preceding siblings ...)
  2015-03-05 17:17 ` [PATCH 03/21] userfaultfd: uAPI Andrea Arcangeli
@ 2015-03-05 17:17 ` Andrea Arcangeli
  2015-03-05 17:17 ` [PATCH 05/21] userfaultfd: add vm_userfaultfd_ctx to the vm_area_struct Andrea Arcangeli
                   ` (17 subsequent siblings)
  21 siblings, 0 replies; 32+ messages in thread
From: Andrea Arcangeli @ 2015-03-05 17:17 UTC (permalink / raw)
  To: qemu-devel, kvm, linux-kernel, linux-mm, linux-api, Android Kernel Team
  Cc: Kirill A. Shutemov, Pavel Emelyanov, Sanidhya Kashyap,
	zhang.zhanghailiang, Linus Torvalds, Andres Lagar-Cavilla,
	Dave Hansen, Paolo Bonzini, Rik van Riel, Mel Gorman,
	Andy Lutomirski, Andrew Morton, Sasha Levin, Hugh Dickins,
	Peter Feiner, Dr. David Alan Gilbert, Christopher Covington,
	Johannes Weiner, Robert Love, Dmitry Adamushko, Neil Brown,
	Mike Hommey, Taras Glek, Jan Kara, KOSAKI Motohiro,
	Michel Lespinasse, Minchan Kim, Keith Packard, Huangpeng (Peter),
	Anthony Liguori, Stefan Hajnoczi, Wenchao Xia, Andrew Jones,
	Juan Quintela

Kernel header defining the methods needed by the VM common code to
interact with the userfaultfd.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/userfaultfd_k.h | 79 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 79 insertions(+)
 create mode 100644 include/linux/userfaultfd_k.h

diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
new file mode 100644
index 0000000..e1e4360
--- /dev/null
+++ b/include/linux/userfaultfd_k.h
@@ -0,0 +1,79 @@
+/*
+ *  include/linux/userfaultfd_k.h
+ *
+ *  Copyright (C) 2015  Red Hat, Inc.
+ *
+ */
+
+#ifndef _LINUX_USERFAULTFD_K_H
+#define _LINUX_USERFAULTFD_K_H
+
+#ifdef CONFIG_USERFAULTFD
+
+#include <linux/userfaultfd.h> /* linux/include/uapi/linux/userfaultfd.h */
+
+#include <linux/fcntl.h>
+
+/*
+ * CAREFUL: Check include/uapi/asm-generic/fcntl.h when defining
+ * new flags, since they might collide with O_* ones. We want
+ * to re-use O_* flags that couldn't possibly have a meaning
+ * from userfaultfd, in order to leave a free define-space for
+ * shared O_* flags.
+ */
+#define UFFD_CLOEXEC O_CLOEXEC
+#define UFFD_NONBLOCK O_NONBLOCK
+
+#define UFFD_SHARED_FCNTL_FLAGS (O_CLOEXEC | O_NONBLOCK)
+#define UFFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS)
+
+extern int handle_userfault(struct vm_area_struct *vma, unsigned long address,
+			    unsigned int flags, unsigned long reason);
+
+/* mm helpers */
+static inline bool is_mergeable_vm_userfaultfd_ctx(struct vm_area_struct *vma,
+					struct vm_userfaultfd_ctx vm_ctx)
+{
+	return vma->vm_userfaultfd_ctx.ctx == vm_ctx.ctx;
+}
+
+static inline bool userfaultfd_missing(struct vm_area_struct *vma)
+{
+	return vma->vm_flags & VM_UFFD_MISSING;
+}
+
+static inline bool userfaultfd_armed(struct vm_area_struct *vma)
+{
+	return vma->vm_flags & (VM_UFFD_MISSING | VM_UFFD_WP);
+}
+
+#else /* CONFIG_USERFAULTFD */
+
+/* mm helpers */
+static inline int handle_userfault(struct vm_area_struct *vma,
+				   unsigned long address,
+				   unsigned int flags,
+				   unsigned long reason)
+{
+	return VM_FAULT_SIGBUS;
+}
+
+static inline bool is_mergeable_vm_userfaultfd_ctx(struct vm_area_struct *vma,
+					struct vm_userfaultfd_ctx vm_ctx)
+{
+	return true;
+}
+
+static inline bool userfaultfd_missing(struct vm_area_struct *vma)
+{
+	return false;
+}
+
+static inline bool userfaultfd_armed(struct vm_area_struct *vma)
+{
+	return false;
+}
+
+#endif /* CONFIG_USERFAULTFD */
+
+#endif /* _LINUX_USERFAULTFD_K_H */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 05/21] userfaultfd: add vm_userfaultfd_ctx to the vm_area_struct
  2015-03-05 17:17 [PATCH 00/21] RFC: userfaultfd v3 Andrea Arcangeli
                   ` (3 preceding siblings ...)
  2015-03-05 17:17 ` [PATCH 04/21] userfaultfd: linux/userfaultfd_k.h Andrea Arcangeli
@ 2015-03-05 17:17 ` Andrea Arcangeli
  2015-03-05 17:48   ` Pavel Emelyanov
  2015-03-05 17:17 ` [PATCH 06/21] userfaultfd: add VM_UFFD_MISSING and VM_UFFD_WP Andrea Arcangeli
                   ` (16 subsequent siblings)
  21 siblings, 1 reply; 32+ messages in thread
From: Andrea Arcangeli @ 2015-03-05 17:17 UTC (permalink / raw)
  To: qemu-devel, kvm, linux-kernel, linux-mm, linux-api, Android Kernel Team
  Cc: Kirill A. Shutemov, Pavel Emelyanov, Sanidhya Kashyap,
	zhang.zhanghailiang, Linus Torvalds, Andres Lagar-Cavilla,
	Dave Hansen, Paolo Bonzini, Rik van Riel, Mel Gorman,
	Andy Lutomirski, Andrew Morton, Sasha Levin, Hugh Dickins,
	Peter Feiner, Dr. David Alan Gilbert, Christopher Covington,
	Johannes Weiner, Robert Love, Dmitry Adamushko, Neil Brown,
	Mike Hommey, Taras Glek, Jan Kara, KOSAKI Motohiro,
	Michel Lespinasse, Minchan Kim, Keith Packard, Huangpeng (Peter),
	Anthony Liguori, Stefan Hajnoczi, Wenchao Xia, Andrew Jones,
	Juan Quintela

This adds the vm_userfaultfd_ctx to the vm_area_struct.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/mm_types.h | 11 +++++++++++
 kernel/fork.c            |  1 +
 2 files changed, 12 insertions(+)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 199a03a..fbf21f5 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -247,6 +247,16 @@ struct vm_region {
 						* this region */
 };
 
+#ifdef CONFIG_USERFAULTFD
+#define NULL_VM_UFFD_CTX ((struct vm_userfaultfd_ctx) { NULL, })
+struct vm_userfaultfd_ctx {
+	struct userfaultfd_ctx *ctx;
+};
+#else /* CONFIG_USERFAULTFD */
+#define NULL_VM_UFFD_CTX ((struct vm_userfaultfd_ctx) {})
+struct vm_userfaultfd_ctx {};
+#endif /* CONFIG_USERFAULTFD */
+
 /*
  * This struct defines a memory VMM memory area. There is one of these
  * per VM-area/task.  A VM area is any part of the process virtual memory
@@ -313,6 +323,7 @@ struct vm_area_struct {
 #ifdef CONFIG_NUMA
 	struct mempolicy *vm_policy;	/* NUMA policy for the VMA */
 #endif
+	struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
 };
 
 struct core_thread {
diff --git a/kernel/fork.c b/kernel/fork.c
index cf65139..cb215c0 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -425,6 +425,7 @@ static int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
 			goto fail_nomem_anon_vma_fork;
 		tmp->vm_flags &= ~VM_LOCKED;
 		tmp->vm_next = tmp->vm_prev = NULL;
+		tmp->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
 		file = tmp->vm_file;
 		if (file) {
 			struct inode *inode = file_inode(file);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 06/21] userfaultfd: add VM_UFFD_MISSING and VM_UFFD_WP
  2015-03-05 17:17 [PATCH 00/21] RFC: userfaultfd v3 Andrea Arcangeli
                   ` (4 preceding siblings ...)
  2015-03-05 17:17 ` [PATCH 05/21] userfaultfd: add vm_userfaultfd_ctx to the vm_area_struct Andrea Arcangeli
@ 2015-03-05 17:17 ` Andrea Arcangeli
  2015-03-05 17:17 ` [PATCH 07/21] userfaultfd: call handle_userfault() for userfaultfd_missing() faults Andrea Arcangeli
                   ` (15 subsequent siblings)
  21 siblings, 0 replies; 32+ messages in thread
From: Andrea Arcangeli @ 2015-03-05 17:17 UTC (permalink / raw)
  To: qemu-devel, kvm, linux-kernel, linux-mm, linux-api, Android Kernel Team
  Cc: Kirill A. Shutemov, Pavel Emelyanov, Sanidhya Kashyap,
	zhang.zhanghailiang, Linus Torvalds, Andres Lagar-Cavilla,
	Dave Hansen, Paolo Bonzini, Rik van Riel, Mel Gorman,
	Andy Lutomirski, Andrew Morton, Sasha Levin, Hugh Dickins,
	Peter Feiner, Dr. David Alan Gilbert, Christopher Covington,
	Johannes Weiner, Robert Love, Dmitry Adamushko, Neil Brown,
	Mike Hommey, Taras Glek, Jan Kara, KOSAKI Motohiro,
	Michel Lespinasse, Minchan Kim, Keith Packard, Huangpeng (Peter),
	Anthony Liguori, Stefan Hajnoczi, Wenchao Xia, Andrew Jones,
	Juan Quintela

These two flags gets set in vma->vm_flags to tell the VM common code
if the userfaultfd is armed and in which mode (only tracking missing
faults, only tracking wrprotect faults or both). If neither flags is
set it means the userfaultfd is not armed on the vma.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/mm.h | 2 ++
 kernel/fork.c      | 2 +-
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 47a9392..762ef9d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -123,8 +123,10 @@ extern unsigned int kobjsize(const void *objp);
 #define VM_MAYSHARE	0x00000080
 
 #define VM_GROWSDOWN	0x00000100	/* general info on the segment */
+#define VM_UFFD_MISSING	0x00000200	/* missing pages tracking */
 #define VM_PFNMAP	0x00000400	/* Page-ranges managed without "struct page", just pure PFN */
 #define VM_DENYWRITE	0x00000800	/* ETXTBSY on write attempts.. */
+#define VM_UFFD_WP	0x00001000	/* wrprotect pages tracking */
 
 #define VM_LOCKED	0x00002000
 #define VM_IO           0x00004000	/* Memory mapped I/O or similar */
diff --git a/kernel/fork.c b/kernel/fork.c
index cb215c0..cfab6e9 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -423,7 +423,7 @@ static int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
 		tmp->vm_mm = mm;
 		if (anon_vma_fork(tmp, mpnt))
 			goto fail_nomem_anon_vma_fork;
-		tmp->vm_flags &= ~VM_LOCKED;
+		tmp->vm_flags &= ~(VM_LOCKED|VM_UFFD_MISSING|VM_UFFD_WP);
 		tmp->vm_next = tmp->vm_prev = NULL;
 		tmp->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
 		file = tmp->vm_file;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 07/21] userfaultfd: call handle_userfault() for userfaultfd_missing() faults
  2015-03-05 17:17 [PATCH 00/21] RFC: userfaultfd v3 Andrea Arcangeli
                   ` (5 preceding siblings ...)
  2015-03-05 17:17 ` [PATCH 06/21] userfaultfd: add VM_UFFD_MISSING and VM_UFFD_WP Andrea Arcangeli
@ 2015-03-05 17:17 ` Andrea Arcangeli
  2015-03-05 17:17 ` [PATCH 08/21] userfaultfd: teach vma_merge to merge across vma->vm_userfaultfd_ctx Andrea Arcangeli
                   ` (14 subsequent siblings)
  21 siblings, 0 replies; 32+ messages in thread
From: Andrea Arcangeli @ 2015-03-05 17:17 UTC (permalink / raw)
  To: qemu-devel, kvm, linux-kernel, linux-mm, linux-api, Android Kernel Team
  Cc: Kirill A. Shutemov, Pavel Emelyanov, Sanidhya Kashyap,
	zhang.zhanghailiang, Linus Torvalds, Andres Lagar-Cavilla,
	Dave Hansen, Paolo Bonzini, Rik van Riel, Mel Gorman,
	Andy Lutomirski, Andrew Morton, Sasha Levin, Hugh Dickins,
	Peter Feiner, Dr. David Alan Gilbert, Christopher Covington,
	Johannes Weiner, Robert Love, Dmitry Adamushko, Neil Brown,
	Mike Hommey, Taras Glek, Jan Kara, KOSAKI Motohiro,
	Michel Lespinasse, Minchan Kim, Keith Packard, Huangpeng (Peter),
	Anthony Liguori, Stefan Hajnoczi, Wenchao Xia, Andrew Jones,
	Juan Quintela

This is where the page faults must be modified to call
handle_userfault() if userfaultfd_missing() is true (so if the
vma->vm_flags had VM_UFFD_MISSING set).

handle_userfault() then takes care of blocking the page fault and
delivering it to userland.

The fault flags must also be passed as parameter so the "read|write"
kind of fault can be passed to userland.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/huge_memory.c | 68 ++++++++++++++++++++++++++++++++++++++------------------
 mm/memory.c      | 16 +++++++++++++
 2 files changed, 62 insertions(+), 22 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index f0207cf..5374132 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -23,6 +23,7 @@
 #include <linux/pagemap.h>
 #include <linux/migrate.h>
 #include <linux/hashtable.h>
+#include <linux/userfaultfd_k.h>
 
 #include <asm/tlb.h>
 #include <asm/pgalloc.h>
@@ -708,7 +709,7 @@ static inline pmd_t mk_huge_pmd(struct page *page, pgprot_t prot)
 static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
 					struct vm_area_struct *vma,
 					unsigned long haddr, pmd_t *pmd,
-					struct page *page)
+					struct page *page, unsigned int flags)
 {
 	struct mem_cgroup *memcg;
 	pgtable_t pgtable;
@@ -716,12 +717,16 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
 
 	VM_BUG_ON_PAGE(!PageCompound(page), page);
 
-	if (mem_cgroup_try_charge(page, mm, GFP_TRANSHUGE, &memcg))
-		return VM_FAULT_OOM;
+	if (mem_cgroup_try_charge(page, mm, GFP_TRANSHUGE, &memcg)) {
+		put_page(page);
+		count_vm_event(THP_FAULT_FALLBACK);
+		return VM_FAULT_FALLBACK;
+	}
 
 	pgtable = pte_alloc_one(mm, haddr);
 	if (unlikely(!pgtable)) {
 		mem_cgroup_cancel_charge(page, memcg);
+		put_page(page);
 		return VM_FAULT_OOM;
 	}
 
@@ -741,6 +746,21 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
 		pte_free(mm, pgtable);
 	} else {
 		pmd_t entry;
+
+		/* Deliver the page fault to userland */
+		if (userfaultfd_missing(vma)) {
+			int ret;
+
+			spin_unlock(ptl);
+			mem_cgroup_cancel_charge(page, memcg);
+			put_page(page);
+			pte_free(mm, pgtable);
+			ret = handle_userfault(vma, haddr, flags,
+					       VM_UFFD_MISSING);
+			VM_BUG_ON(ret & VM_FAULT_FALLBACK);
+			return ret;
+		}
+
 		entry = mk_huge_pmd(page, vma->vm_page_prot);
 		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
 		page_add_new_anon_rmap(page, vma, haddr);
@@ -751,6 +771,7 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
 		add_mm_counter(mm, MM_ANONPAGES, HPAGE_PMD_NR);
 		atomic_long_inc(&mm->nr_ptes);
 		spin_unlock(ptl);
+		count_vm_event(THP_FAULT_ALLOC);
 	}
 
 	return 0;
@@ -762,19 +783,16 @@ static inline gfp_t alloc_hugepage_gfpmask(int defrag, gfp_t extra_gfp)
 }
 
 /* Caller must hold page table lock. */
-static bool set_huge_zero_page(pgtable_t pgtable, struct mm_struct *mm,
+static void set_huge_zero_page(pgtable_t pgtable, struct mm_struct *mm,
 		struct vm_area_struct *vma, unsigned long haddr, pmd_t *pmd,
 		struct page *zero_page)
 {
 	pmd_t entry;
-	if (!pmd_none(*pmd))
-		return false;
 	entry = mk_pmd(zero_page, vma->vm_page_prot);
 	entry = pmd_mkhuge(entry);
 	pgtable_trans_huge_deposit(mm, pmd, pgtable);
 	set_pmd_at(mm, haddr, pmd, entry);
 	atomic_long_inc(&mm->nr_ptes);
-	return true;
 }
 
 int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
@@ -797,6 +815,7 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		pgtable_t pgtable;
 		struct page *zero_page;
 		bool set;
+		int ret;
 		pgtable = pte_alloc_one(mm, haddr);
 		if (unlikely(!pgtable))
 			return VM_FAULT_OOM;
@@ -807,14 +826,28 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			return VM_FAULT_FALLBACK;
 		}
 		ptl = pmd_lock(mm, pmd);
-		set = set_huge_zero_page(pgtable, mm, vma, haddr, pmd,
-				zero_page);
-		spin_unlock(ptl);
+		ret = 0;
+		set = false;
+		if (pmd_none(*pmd)) {
+			if (userfaultfd_missing(vma)) {
+				spin_unlock(ptl);
+				ret = handle_userfault(vma, haddr, flags,
+						       VM_UFFD_MISSING);
+				VM_BUG_ON(ret & VM_FAULT_FALLBACK);
+			} else {
+				set_huge_zero_page(pgtable, mm, vma,
+						   haddr, pmd,
+						   zero_page);
+				spin_unlock(ptl);
+				set = true;
+			}
+		} else
+			spin_unlock(ptl);
 		if (!set) {
 			pte_free(mm, pgtable);
 			put_huge_zero_page();
 		}
-		return 0;
+		return ret;
 	}
 	gfp = alloc_hugepage_gfpmask(transparent_hugepage_defrag(vma), 0);
 	page = alloc_hugepage_vma(gfp, vma, haddr, HPAGE_PMD_ORDER);
@@ -822,14 +855,7 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		count_vm_event(THP_FAULT_FALLBACK);
 		return VM_FAULT_FALLBACK;
 	}
-	if (unlikely(__do_huge_pmd_anonymous_page(mm, vma, haddr, pmd, page))) {
-		put_page(page);
-		count_vm_event(THP_FAULT_FALLBACK);
-		return VM_FAULT_FALLBACK;
-	}
-
-	count_vm_event(THP_FAULT_ALLOC);
-	return 0;
+	return __do_huge_pmd_anonymous_page(mm, vma, haddr, pmd, page, flags);
 }
 
 int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
@@ -864,16 +890,14 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	 */
 	if (is_huge_zero_pmd(pmd)) {
 		struct page *zero_page;
-		bool set;
 		/*
 		 * get_huge_zero_page() will never allocate a new page here,
 		 * since we already have a zero page to copy. It just takes a
 		 * reference.
 		 */
 		zero_page = get_huge_zero_page();
-		set = set_huge_zero_page(pgtable, dst_mm, vma, addr, dst_pmd,
+		set_huge_zero_page(pgtable, dst_mm, vma, addr, dst_pmd,
 				zero_page);
-		BUG_ON(!set); /* unexpected !pmd_none(dst_pmd) */
 		ret = 0;
 		goto out_unlock;
 	}
diff --git a/mm/memory.c b/mm/memory.c
index 8068893..0ae719c 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -61,6 +61,7 @@
 #include <linux/string.h>
 #include <linux/dma-debug.h>
 #include <linux/debugfs.h>
+#include <linux/userfaultfd_k.h>
 
 #include <asm/io.h>
 #include <asm/pgalloc.h>
@@ -2585,6 +2586,12 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
 		if (!pte_none(*page_table))
 			goto unlock;
+		/* Deliver the page fault to userland, check inside PT lock */
+		if (userfaultfd_missing(vma)) {
+			pte_unmap_unlock(page_table, ptl);
+			return handle_userfault(vma, address, flags,
+						VM_UFFD_MISSING);
+		}
 		goto setpte;
 	}
 
@@ -2612,6 +2619,15 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (!pte_none(*page_table))
 		goto release;
 
+	/* Deliver the page fault to userland, check inside PT lock */
+	if (userfaultfd_missing(vma)) {
+		pte_unmap_unlock(page_table, ptl);
+		mem_cgroup_cancel_charge(page, memcg);
+		page_cache_release(page);
+		return handle_userfault(vma, address, flags,
+					VM_UFFD_MISSING);
+	}
+
 	inc_mm_counter_fast(mm, MM_ANONPAGES);
 	page_add_new_anon_rmap(page, vma, address);
 	mem_cgroup_commit_charge(page, memcg, false);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 08/21] userfaultfd: teach vma_merge to merge across vma->vm_userfaultfd_ctx
  2015-03-05 17:17 [PATCH 00/21] RFC: userfaultfd v3 Andrea Arcangeli
                   ` (6 preceding siblings ...)
  2015-03-05 17:17 ` [PATCH 07/21] userfaultfd: call handle_userfault() for userfaultfd_missing() faults Andrea Arcangeli
@ 2015-03-05 17:17 ` Andrea Arcangeli
  2015-03-05 17:17 ` [PATCH 09/21] userfaultfd: prevent khugepaged to merge if userfaultfd is armed Andrea Arcangeli
                   ` (13 subsequent siblings)
  21 siblings, 0 replies; 32+ messages in thread
From: Andrea Arcangeli @ 2015-03-05 17:17 UTC (permalink / raw)
  To: qemu-devel, kvm, linux-kernel, linux-mm, linux-api, Android Kernel Team
  Cc: Kirill A. Shutemov, Pavel Emelyanov, Sanidhya Kashyap,
	zhang.zhanghailiang, Linus Torvalds, Andres Lagar-Cavilla,
	Dave Hansen, Paolo Bonzini, Rik van Riel, Mel Gorman,
	Andy Lutomirski, Andrew Morton, Sasha Levin, Hugh Dickins,
	Peter Feiner, Dr. David Alan Gilbert, Christopher Covington,
	Johannes Weiner, Robert Love, Dmitry Adamushko, Neil Brown,
	Mike Hommey, Taras Glek, Jan Kara, KOSAKI Motohiro,
	Michel Lespinasse, Minchan Kim, Keith Packard, Huangpeng (Peter),
	Anthony Liguori, Stefan Hajnoczi, Wenchao Xia, Andrew Jones,
	Juan Quintela

vma->vm_userfaultfd_ctx is yet another vma parameter that vma_merge
must be aware about so that we can merge vmas back like they were
originally before arming the userfaultfd on some memory range.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/mm.h |  2 +-
 mm/madvise.c       |  3 ++-
 mm/mempolicy.c     |  4 ++--
 mm/mlock.c         |  3 ++-
 mm/mmap.c          | 39 +++++++++++++++++++++++++++------------
 mm/mprotect.c      |  3 ++-
 6 files changed, 36 insertions(+), 18 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 762ef9d..26cef61 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1879,7 +1879,7 @@ extern int vma_adjust(struct vm_area_struct *vma, unsigned long start,
 extern struct vm_area_struct *vma_merge(struct mm_struct *,
 	struct vm_area_struct *prev, unsigned long addr, unsigned long end,
 	unsigned long vm_flags, struct anon_vma *, struct file *, pgoff_t,
-	struct mempolicy *);
+	struct mempolicy *, struct vm_userfaultfd_ctx);
 extern struct anon_vma *find_mergeable_anon_vma(struct vm_area_struct *);
 extern int split_vma(struct mm_struct *,
 	struct vm_area_struct *, unsigned long addr, int new_below);
diff --git a/mm/madvise.c b/mm/madvise.c
index d551475..10f62b7 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -102,7 +102,8 @@ static long madvise_behavior(struct vm_area_struct *vma,
 
 	pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
 	*prev = vma_merge(mm, *prev, start, end, new_flags, vma->anon_vma,
-				vma->vm_file, pgoff, vma_policy(vma));
+			  vma->vm_file, pgoff, vma_policy(vma),
+			  vma->vm_userfaultfd_ctx);
 	if (*prev) {
 		vma = *prev;
 		goto success;
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 4721046..e1a2e9b 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -722,8 +722,8 @@ static int mbind_range(struct mm_struct *mm, unsigned long start,
 		pgoff = vma->vm_pgoff +
 			((vmstart - vma->vm_start) >> PAGE_SHIFT);
 		prev = vma_merge(mm, prev, vmstart, vmend, vma->vm_flags,
-				  vma->anon_vma, vma->vm_file, pgoff,
-				  new_pol);
+				 vma->anon_vma, vma->vm_file, pgoff,
+				 new_pol, vma->vm_userfaultfd_ctx);
 		if (prev) {
 			vma = prev;
 			next = vma->vm_next;
diff --git a/mm/mlock.c b/mm/mlock.c
index 73cf098..9725abe 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -566,7 +566,8 @@ static int mlock_fixup(struct vm_area_struct *vma, struct vm_area_struct **prev,
 
 	pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
 	*prev = vma_merge(mm, *prev, start, end, newflags, vma->anon_vma,
-			  vma->vm_file, pgoff, vma_policy(vma));
+			  vma->vm_file, pgoff, vma_policy(vma),
+			  vma->vm_userfaultfd_ctx);
 	if (*prev) {
 		vma = *prev;
 		goto success;
diff --git a/mm/mmap.c b/mm/mmap.c
index da9990a..135c2fa 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -41,6 +41,7 @@
 #include <linux/notifier.h>
 #include <linux/memory.h>
 #include <linux/printk.h>
+#include <linux/userfaultfd_k.h>
 
 #include <asm/uaccess.h>
 #include <asm/cacheflush.h>
@@ -921,7 +922,8 @@ again:			remove_next = 1 + (end > next->vm_end);
  * per-vma resources, so we don't attempt to merge those.
  */
 static inline int is_mergeable_vma(struct vm_area_struct *vma,
-			struct file *file, unsigned long vm_flags)
+				struct file *file, unsigned long vm_flags,
+				struct vm_userfaultfd_ctx vm_userfaultfd_ctx)
 {
 	/*
 	 * VM_SOFTDIRTY should not prevent from VMA merging, if we
@@ -937,6 +939,8 @@ static inline int is_mergeable_vma(struct vm_area_struct *vma,
 		return 0;
 	if (vma->vm_ops && vma->vm_ops->close)
 		return 0;
+	if (!is_mergeable_vm_userfaultfd_ctx(vma, vm_userfaultfd_ctx))
+		return 0;
 	return 1;
 }
 
@@ -967,9 +971,11 @@ static inline int is_mergeable_anon_vma(struct anon_vma *anon_vma1,
  */
 static int
 can_vma_merge_before(struct vm_area_struct *vma, unsigned long vm_flags,
-	struct anon_vma *anon_vma, struct file *file, pgoff_t vm_pgoff)
+		     struct anon_vma *anon_vma, struct file *file,
+		     pgoff_t vm_pgoff,
+		     struct vm_userfaultfd_ctx vm_userfaultfd_ctx)
 {
-	if (is_mergeable_vma(vma, file, vm_flags) &&
+	if (is_mergeable_vma(vma, file, vm_flags, vm_userfaultfd_ctx) &&
 	    is_mergeable_anon_vma(anon_vma, vma->anon_vma, vma)) {
 		if (vma->vm_pgoff == vm_pgoff)
 			return 1;
@@ -986,9 +992,11 @@ can_vma_merge_before(struct vm_area_struct *vma, unsigned long vm_flags,
  */
 static int
 can_vma_merge_after(struct vm_area_struct *vma, unsigned long vm_flags,
-	struct anon_vma *anon_vma, struct file *file, pgoff_t vm_pgoff)
+		    struct anon_vma *anon_vma, struct file *file,
+		    pgoff_t vm_pgoff,
+		    struct vm_userfaultfd_ctx vm_userfaultfd_ctx)
 {
-	if (is_mergeable_vma(vma, file, vm_flags) &&
+	if (is_mergeable_vma(vma, file, vm_flags, vm_userfaultfd_ctx) &&
 	    is_mergeable_anon_vma(anon_vma, vma->anon_vma, vma)) {
 		pgoff_t vm_pglen;
 		vm_pglen = vma_pages(vma);
@@ -1031,7 +1039,8 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
 			struct vm_area_struct *prev, unsigned long addr,
 			unsigned long end, unsigned long vm_flags,
 			struct anon_vma *anon_vma, struct file *file,
-			pgoff_t pgoff, struct mempolicy *policy)
+			pgoff_t pgoff, struct mempolicy *policy,
+			struct vm_userfaultfd_ctx vm_userfaultfd_ctx)
 {
 	pgoff_t pglen = (end - addr) >> PAGE_SHIFT;
 	struct vm_area_struct *area, *next;
@@ -1058,14 +1067,17 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
 	if (prev && prev->vm_end == addr &&
 			mpol_equal(vma_policy(prev), policy) &&
 			can_vma_merge_after(prev, vm_flags,
-						anon_vma, file, pgoff)) {
+					    anon_vma, file, pgoff,
+					    vm_userfaultfd_ctx)) {
 		/*
 		 * OK, it can.  Can we now merge in the successor as well?
 		 */
 		if (next && end == next->vm_start &&
 				mpol_equal(policy, vma_policy(next)) &&
 				can_vma_merge_before(next, vm_flags,
-					anon_vma, file, pgoff+pglen) &&
+						     anon_vma, file,
+						     pgoff+pglen,
+						     vm_userfaultfd_ctx) &&
 				is_mergeable_anon_vma(prev->anon_vma,
 						      next->anon_vma, NULL)) {
 							/* cases 1, 6 */
@@ -1086,7 +1098,8 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
 	if (next && end == next->vm_start &&
 			mpol_equal(policy, vma_policy(next)) &&
 			can_vma_merge_before(next, vm_flags,
-					anon_vma, file, pgoff+pglen)) {
+					     anon_vma, file, pgoff+pglen,
+					     vm_userfaultfd_ctx)) {
 		if (prev && addr < prev->vm_end)	/* case 4 */
 			err = vma_adjust(prev, prev->vm_start,
 				addr, prev->vm_pgoff, NULL);
@@ -1573,7 +1586,8 @@ munmap_back:
 	/*
 	 * Can we just expand an old mapping?
 	 */
-	vma = vma_merge(mm, prev, addr, addr + len, vm_flags, NULL, file, pgoff, NULL);
+	vma = vma_merge(mm, prev, addr, addr + len, vm_flags,
+			NULL, file, pgoff, NULL, NULL_VM_UFFD_CTX);
 	if (vma)
 		goto out;
 
@@ -2760,7 +2774,7 @@ static unsigned long do_brk(unsigned long addr, unsigned long len)
 
 	/* Can we just expand an old private anonymous mapping? */
 	vma = vma_merge(mm, prev, addr, addr + len, flags,
-					NULL, NULL, pgoff, NULL);
+			NULL, NULL, pgoff, NULL, NULL_VM_UFFD_CTX);
 	if (vma)
 		goto out;
 
@@ -2916,7 +2930,8 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
 	if (find_vma_links(mm, addr, addr + len, &prev, &rb_link, &rb_parent))
 		return NULL;	/* should never get here */
 	new_vma = vma_merge(mm, prev, addr, addr + len, vma->vm_flags,
-			vma->anon_vma, vma->vm_file, pgoff, vma_policy(vma));
+			    vma->anon_vma, vma->vm_file, pgoff, vma_policy(vma),
+			    vma->vm_userfaultfd_ctx);
 	if (new_vma) {
 		/*
 		 * Source vma may have been merged into new_vma
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 4472781..c98a074 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -287,7 +287,8 @@ mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev,
 	 */
 	pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
 	*pprev = vma_merge(mm, *pprev, start, end, newflags,
-			vma->anon_vma, vma->vm_file, pgoff, vma_policy(vma));
+			   vma->anon_vma, vma->vm_file, pgoff, vma_policy(vma),
+			   vma->vm_userfaultfd_ctx);
 	if (*pprev) {
 		vma = *pprev;
 		goto success;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 09/21] userfaultfd: prevent khugepaged to merge if userfaultfd is armed
  2015-03-05 17:17 [PATCH 00/21] RFC: userfaultfd v3 Andrea Arcangeli
                   ` (7 preceding siblings ...)
  2015-03-05 17:17 ` [PATCH 08/21] userfaultfd: teach vma_merge to merge across vma->vm_userfaultfd_ctx Andrea Arcangeli
@ 2015-03-05 17:17 ` Andrea Arcangeli
  2015-03-05 17:17 ` [PATCH 10/21] userfaultfd: add new syscall to provide memory externalization Andrea Arcangeli
                   ` (12 subsequent siblings)
  21 siblings, 0 replies; 32+ messages in thread
From: Andrea Arcangeli @ 2015-03-05 17:17 UTC (permalink / raw)
  To: qemu-devel, kvm, linux-kernel, linux-mm, linux-api, Android Kernel Team
  Cc: Kirill A. Shutemov, Pavel Emelyanov, Sanidhya Kashyap,
	zhang.zhanghailiang, Linus Torvalds, Andres Lagar-Cavilla,
	Dave Hansen, Paolo Bonzini, Rik van Riel, Mel Gorman,
	Andy Lutomirski, Andrew Morton, Sasha Levin, Hugh Dickins,
	Peter Feiner, Dr. David Alan Gilbert, Christopher Covington,
	Johannes Weiner, Robert Love, Dmitry Adamushko, Neil Brown,
	Mike Hommey, Taras Glek, Jan Kara, KOSAKI Motohiro,
	Michel Lespinasse, Minchan Kim, Keith Packard, Huangpeng (Peter),
	Anthony Liguori, Stefan Hajnoczi, Wenchao Xia, Andrew Jones,
	Juan Quintela

If userfaultfd is armed on a certain vma we can't "fill" the holes
with zeroes or we'll break the userland on demand paging. The holes if
the userfault is armed, are really missing information (not zeroes)
that the userland has to load from network or elsewhere.

The same issue happens for wrprotected ptes that we can't just convert
into a single writable pmd_trans_huge.

We could however in theory still merge across zeropages if only
VM_UFFD_MISSING is set (so if VM_UFFD_WP is not set)... that could be
slightly improved but it'd be much more complex code for a tiny corner
case.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/huge_memory.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 5374132..8f1b6a5 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2145,7 +2145,8 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 	     _pte++, address += PAGE_SIZE) {
 		pte_t pteval = *_pte;
 		if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
-			if (++none_or_zero <= khugepaged_max_ptes_none)
+			if (!userfaultfd_armed(vma) &&
+			    ++none_or_zero <= khugepaged_max_ptes_none)
 				continue;
 			else
 				goto out;
@@ -2593,7 +2594,8 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 	     _pte++, _address += PAGE_SIZE) {
 		pte_t pteval = *_pte;
 		if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
-			if (++none_or_zero <= khugepaged_max_ptes_none)
+			if (!userfaultfd_armed(vma) &&
+			    ++none_or_zero <= khugepaged_max_ptes_none)
 				continue;
 			else
 				goto out_unmap;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 10/21] userfaultfd: add new syscall to provide memory externalization
  2015-03-05 17:17 [PATCH 00/21] RFC: userfaultfd v3 Andrea Arcangeli
                   ` (8 preceding siblings ...)
  2015-03-05 17:17 ` [PATCH 09/21] userfaultfd: prevent khugepaged to merge if userfaultfd is armed Andrea Arcangeli
@ 2015-03-05 17:17 ` Andrea Arcangeli
  2015-03-05 17:57   ` Pavel Emelyanov
  2015-03-06 10:48   ` Michael Kerrisk (man-pages)
  2015-03-05 17:17 ` [PATCH 11/21] userfaultfd: buildsystem activation Andrea Arcangeli
                   ` (11 subsequent siblings)
  21 siblings, 2 replies; 32+ messages in thread
From: Andrea Arcangeli @ 2015-03-05 17:17 UTC (permalink / raw)
  To: qemu-devel, kvm, linux-kernel, linux-mm, linux-api, Android Kernel Team
  Cc: Kirill A. Shutemov, Pavel Emelyanov, Sanidhya Kashyap,
	zhang.zhanghailiang, Linus Torvalds, Andres Lagar-Cavilla,
	Dave Hansen, Paolo Bonzini, Rik van Riel, Mel Gorman,
	Andy Lutomirski, Andrew Morton, Sasha Levin, Hugh Dickins,
	Peter Feiner, Dr. David Alan Gilbert, Christopher Covington,
	Johannes Weiner, Robert Love, Dmitry Adamushko, Neil Brown,
	Mike Hommey, Taras Glek, Jan Kara, KOSAKI Motohiro,
	Michel Lespinasse, Minchan Kim, Keith Packard, Huangpeng (Peter),
	Anthony Liguori, Stefan Hajnoczi, Wenchao Xia, Andrew Jones,
	Juan Quintela

Once an userfaultfd has been created and certain region of the process
virtual address space have been registered into it, the thread
responsible for doing the memory externalization can manage the page
faults in userland by talking to the kernel using the userfaultfd
protocol.

poll() can be used to know when there are new pending userfaults to be
read (POLLIN).

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 fs/userfaultfd.c | 977 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 977 insertions(+)
 create mode 100644 fs/userfaultfd.c

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
new file mode 100644
index 0000000..6b31967
--- /dev/null
+++ b/fs/userfaultfd.c
@@ -0,0 +1,977 @@
+/*
+ *  fs/userfaultfd.c
+ *
+ *  Copyright (C) 2007  Davide Libenzi <davidel@xmailserver.org>
+ *  Copyright (C) 2008-2009 Red Hat, Inc.
+ *  Copyright (C) 2015  Red Hat, Inc.
+ *
+ *  This work is licensed under the terms of the GNU GPL, version 2. See
+ *  the COPYING file in the top-level directory.
+ *
+ *  Some part derived from fs/eventfd.c (anon inode setup) and
+ *  mm/ksm.c (mm hashing).
+ */
+
+#include <linux/hashtable.h>
+#include <linux/sched.h>
+#include <linux/mm.h>
+#include <linux/poll.h>
+#include <linux/slab.h>
+#include <linux/seq_file.h>
+#include <linux/file.h>
+#include <linux/bug.h>
+#include <linux/anon_inodes.h>
+#include <linux/syscalls.h>
+#include <linux/userfaultfd_k.h>
+#include <linux/mempolicy.h>
+#include <linux/ioctl.h>
+#include <linux/security.h>
+
+enum userfaultfd_state {
+	UFFD_STATE_WAIT_API,
+	UFFD_STATE_RUNNING,
+};
+
+struct userfaultfd_ctx {
+	/* pseudo fd refcounting */
+	atomic_t refcount;
+	/* waitqueue head for the userfaultfd page faults */
+	wait_queue_head_t fault_wqh;
+	/* waitqueue head for the pseudo fd to wakeup poll/read */
+	wait_queue_head_t fd_wqh;
+	/* userfaultfd syscall flags */
+	unsigned int flags;
+	/* state machine */
+	enum userfaultfd_state state;
+	/* released */
+	bool released;
+	/* mm with one ore more vmas attached to this userfaultfd_ctx */
+	struct mm_struct *mm;
+};
+
+struct userfaultfd_wait_queue {
+	unsigned long address;
+	wait_queue_t wq;
+	bool pending;
+	struct userfaultfd_ctx *ctx;
+};
+
+struct userfaultfd_wake_range {
+	unsigned long start;
+	unsigned long len;
+};
+
+static int userfaultfd_wake_function(wait_queue_t *wq, unsigned mode,
+				     int wake_flags, void *key)
+{
+	struct userfaultfd_wake_range *range = key;
+	int ret;
+	struct userfaultfd_wait_queue *uwq;
+	unsigned long start, len;
+
+	uwq = container_of(wq, struct userfaultfd_wait_queue, wq);
+	ret = 0;
+	/* don't wake the pending ones to avoid reads to block */
+	if (uwq->pending && !ACCESS_ONCE(uwq->ctx->released))
+		goto out;
+	/* len == 0 means wake all */
+	start = range->start;
+	len = range->len;
+	if (len && (start > uwq->address || start + len <= uwq->address))
+		goto out;
+	ret = wake_up_state(wq->private, mode);
+	if (ret)
+		/* wake only once, autoremove behavior */
+		list_del_init(&wq->task_list);
+out:
+	return ret;
+}
+
+/**
+ * userfaultfd_ctx_get - Acquires a reference to the internal userfaultfd
+ * context.
+ * @ctx: [in] Pointer to the userfaultfd context.
+ *
+ * Returns: In case of success, returns not zero.
+ */
+static void userfaultfd_ctx_get(struct userfaultfd_ctx *ctx)
+{
+	if (!atomic_inc_not_zero(&ctx->refcount))
+		BUG();
+}
+
+/**
+ * userfaultfd_ctx_put - Releases a reference to the internal userfaultfd
+ * context.
+ * @ctx: [in] Pointer to userfaultfd context.
+ *
+ * The userfaultfd context reference must have been previously acquired either
+ * with userfaultfd_ctx_get() or userfaultfd_ctx_fdget().
+ */
+static void userfaultfd_ctx_put(struct userfaultfd_ctx *ctx)
+{
+	if (atomic_dec_and_test(&ctx->refcount)) {
+		mmdrop(ctx->mm);
+		kfree(ctx);
+	}
+}
+
+static inline unsigned long userfault_address(unsigned long address,
+					      unsigned int flags,
+					      unsigned long reason)
+{
+	BUILD_BUG_ON(PAGE_SHIFT < UFFD_BITS);
+	address &= PAGE_MASK;
+	if (flags & FAULT_FLAG_WRITE)
+		/*
+		 * Encode "write" fault information in the LSB of the
+		 * address read by userland, without depending on
+		 * FAULT_FLAG_WRITE kernel internal value.
+		 */
+		address |= UFFD_BIT_WRITE;
+	if (reason & VM_UFFD_WP)
+		/*
+		 * Encode "reason" fault information as bit number 1
+		 * in the address read by userland. If bit number 1 is
+		 * clear it means the reason is a VM_FAULT_MISSING
+		 * fault.
+		 */
+		address |= UFFD_BIT_WP;
+	return address;
+}
+
+/*
+ * The locking rules involved in returning VM_FAULT_RETRY depending on
+ * FAULT_FLAG_ALLOW_RETRY, FAULT_FLAG_RETRY_NOWAIT and
+ * FAULT_FLAG_KILLABLE are not straightforward. The "Caution"
+ * recommendation in __lock_page_or_retry is not an understatement.
+ *
+ * If FAULT_FLAG_ALLOW_RETRY is set, the mmap_sem must be released
+ * before returning VM_FAULT_RETRY only if FAULT_FLAG_RETRY_NOWAIT is
+ * not set.
+ *
+ * If FAULT_FLAG_ALLOW_RETRY is set but FAULT_FLAG_KILLABLE is not
+ * set, VM_FAULT_RETRY can still be returned if and only if there are
+ * fatal_signal_pending()s, and the mmap_sem must be released before
+ * returning it.
+ */
+int handle_userfault(struct vm_area_struct *vma, unsigned long address,
+		     unsigned int flags, unsigned long reason)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	struct userfaultfd_ctx *ctx;
+	struct userfaultfd_wait_queue uwq;
+
+	BUG_ON(!rwsem_is_locked(&mm->mmap_sem));
+
+	ctx = vma->vm_userfaultfd_ctx.ctx;
+	if (!ctx)
+		return VM_FAULT_SIGBUS;
+
+	BUG_ON(ctx->mm != mm);
+
+	VM_BUG_ON(reason & ~(VM_UFFD_MISSING|VM_UFFD_WP));
+	VM_BUG_ON(!(reason & VM_UFFD_MISSING) ^ !!(reason & VM_UFFD_WP));
+
+	/*
+	 * If it's already released don't get it. This avoids to loop
+	 * in __get_user_pages if userfaultfd_release waits on the
+	 * caller of handle_userfault to release the mmap_sem.
+	 */
+	if (unlikely(ACCESS_ONCE(ctx->released)))
+		return VM_FAULT_SIGBUS;
+
+	/* check that we can return VM_FAULT_RETRY */
+	if (unlikely(!(flags & FAULT_FLAG_ALLOW_RETRY))) {
+		/*
+		 * Validate the invariant that nowait must allow retry
+		 * to be sure not to return SIGBUS erroneously on
+		 * nowait invocations.
+		 */
+		BUG_ON(flags & FAULT_FLAG_RETRY_NOWAIT);
+#ifdef CONFIG_DEBUG_VM
+		if (printk_ratelimit()) {
+			printk(KERN_WARNING
+			       "FAULT_FLAG_ALLOW_RETRY missing %x\n", flags);
+			dump_stack();
+		}
+#endif
+		return VM_FAULT_SIGBUS;
+	}
+
+	/*
+	 * Handle nowait, not much to do other than tell it to retry
+	 * and wait.
+	 */
+	if (flags & FAULT_FLAG_RETRY_NOWAIT)
+		return VM_FAULT_RETRY;
+
+	/* take the reference before dropping the mmap_sem */
+	userfaultfd_ctx_get(ctx);
+
+	/* be gentle and immediately relinquish the mmap_sem */
+	up_read(&mm->mmap_sem);
+
+	init_waitqueue_func_entry(&uwq.wq, userfaultfd_wake_function);
+	uwq.wq.private = current;
+	uwq.address = userfault_address(address, flags, reason);
+	uwq.pending = true;
+	uwq.ctx = ctx;
+
+	spin_lock(&ctx->fault_wqh.lock);
+	/*
+	 * After the __add_wait_queue the uwq is visible to userland
+	 * through poll/read().
+	 */
+	__add_wait_queue(&ctx->fault_wqh, &uwq.wq);
+	for (;;) {
+		set_current_state(TASK_KILLABLE);
+		if (!uwq.pending || ACCESS_ONCE(ctx->released) ||
+		    fatal_signal_pending(current))
+			break;
+		spin_unlock(&ctx->fault_wqh.lock);
+
+		wake_up_poll(&ctx->fd_wqh, POLLIN);
+		schedule();
+
+		spin_lock(&ctx->fault_wqh.lock);
+	}
+	__remove_wait_queue(&ctx->fault_wqh, &uwq.wq);
+	__set_current_state(TASK_RUNNING);
+	spin_unlock(&ctx->fault_wqh.lock);
+
+	/*
+	 * ctx may go away after this if the userfault pseudo fd is
+	 * already released.
+	 */
+	userfaultfd_ctx_put(ctx);
+
+	return VM_FAULT_RETRY;
+}
+
+static int userfaultfd_release(struct inode *inode, struct file *file)
+{
+	struct userfaultfd_ctx *ctx = file->private_data;
+	struct mm_struct *mm = ctx->mm;
+	struct vm_area_struct *vma, *prev;
+	/* len == 0 means wake all */
+	struct userfaultfd_wake_range range = { .len = 0, };
+	unsigned long new_flags;
+
+	ACCESS_ONCE(ctx->released) = true;
+
+	/*
+	 * Flush page faults out of all CPUs. NOTE: all page faults
+	 * must be retried without returning VM_FAULT_SIGBUS if
+	 * userfaultfd_ctx_get() succeeds but vma->vma_userfault_ctx
+	 * changes while handle_userfault released the mmap_sem. So
+	 * it's critical that released is set to true (above), before
+	 * taking the mmap_sem for writing.
+	 */
+	down_write(&mm->mmap_sem);
+	prev = NULL;
+	for (vma = mm->mmap; vma; vma = vma->vm_next) {
+		cond_resched();
+		BUG_ON(!!vma->vm_userfaultfd_ctx.ctx ^
+		       !!(vma->vm_flags & (VM_UFFD_MISSING | VM_UFFD_WP)));
+		if (vma->vm_userfaultfd_ctx.ctx != ctx) {
+			prev = vma;
+			continue;
+		}
+		new_flags = vma->vm_flags & ~(VM_UFFD_MISSING | VM_UFFD_WP);
+		prev = vma_merge(mm, prev, vma->vm_start, vma->vm_end,
+				 new_flags, vma->anon_vma,
+				 vma->vm_file, vma->vm_pgoff,
+				 vma_policy(vma),
+				 NULL_VM_UFFD_CTX);
+		if (prev)
+			vma = prev;
+		else
+			prev = vma;
+		vma->vm_flags = new_flags;
+		vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
+	}
+	up_write(&mm->mmap_sem);
+
+	/*
+	 * After no new page faults can wait on this fault_wqh, flush
+	 * the last page faults that may have been already waiting on
+	 * the fault_wqh.
+	 */
+	spin_lock(&ctx->fault_wqh.lock);
+	__wake_up_locked_key(&ctx->fault_wqh, TASK_NORMAL, 0, &range);
+	spin_unlock(&ctx->fault_wqh.lock);
+
+	wake_up_poll(&ctx->fd_wqh, POLLHUP);
+	userfaultfd_ctx_put(ctx);
+	return 0;
+}
+
+static inline unsigned int find_userfault(struct userfaultfd_ctx *ctx,
+					  struct userfaultfd_wait_queue **uwq)
+{
+	wait_queue_t *wq;
+	struct userfaultfd_wait_queue *_uwq;
+	unsigned int ret = 0;
+
+	spin_lock(&ctx->fault_wqh.lock);
+	list_for_each_entry(wq, &ctx->fault_wqh.task_list, task_list) {
+		_uwq = container_of(wq, struct userfaultfd_wait_queue, wq);
+		if (_uwq->pending) {
+			ret = POLLIN;
+			if (uwq)
+				*uwq = _uwq;
+			break;
+		}
+	}
+	spin_unlock(&ctx->fault_wqh.lock);
+
+	return ret;
+}
+
+static unsigned int userfaultfd_poll(struct file *file, poll_table *wait)
+{
+	struct userfaultfd_ctx *ctx = file->private_data;
+
+	poll_wait(file, &ctx->fd_wqh, wait);
+
+	switch (ctx->state) {
+	case UFFD_STATE_WAIT_API:
+		return POLLERR;
+	case UFFD_STATE_RUNNING:
+		return find_userfault(ctx, NULL);
+	default:
+		BUG();
+	}
+}
+
+static ssize_t userfaultfd_ctx_read(struct userfaultfd_ctx *ctx, int no_wait,
+				    __u64 *addr)
+{
+	ssize_t ret;
+	DECLARE_WAITQUEUE(wait, current);
+	struct userfaultfd_wait_queue *uwq = NULL;
+
+	/* always take the fd_wqh lock before the fault_wqh lock */
+	spin_lock(&ctx->fd_wqh.lock);
+	__add_wait_queue(&ctx->fd_wqh, &wait);
+	for (;;) {
+		set_current_state(TASK_INTERRUPTIBLE);
+		if (find_userfault(ctx, &uwq)) {
+			uwq->pending = false;
+			/* careful to always initialize addr if ret == 0 */
+			*addr = uwq->address;
+			ret = 0;
+			break;
+		}
+		if (signal_pending(current)) {
+			ret = -ERESTARTSYS;
+			break;
+		}
+		if (no_wait) {
+			ret = -EAGAIN;
+			break;
+		}
+		spin_unlock(&ctx->fd_wqh.lock);
+		schedule();
+		spin_lock_irq(&ctx->fd_wqh.lock);
+	}
+	__remove_wait_queue(&ctx->fd_wqh, &wait);
+	__set_current_state(TASK_RUNNING);
+	spin_unlock_irq(&ctx->fd_wqh.lock);
+
+	return ret;
+}
+
+static ssize_t userfaultfd_read(struct file *file, char __user *buf,
+				size_t count, loff_t *ppos)
+{
+	struct userfaultfd_ctx *ctx = file->private_data;
+	ssize_t _ret, ret = 0;
+	/* careful to always initialize addr if ret == 0 */
+	__u64 uninitialized_var(addr);
+	int no_wait = file->f_flags & O_NONBLOCK;
+
+	if (ctx->state == UFFD_STATE_WAIT_API)
+		return -EINVAL;
+	BUG_ON(ctx->state != UFFD_STATE_RUNNING);
+
+	for (;;) {
+		if (count < sizeof(addr))
+			return ret ? ret : -EINVAL;
+		_ret = userfaultfd_ctx_read(ctx, no_wait, &addr);
+		if (_ret < 0)
+			return ret ? ret : _ret;
+		if (put_user(addr, (__u64 __user *) buf))
+			return ret ? ret : -EFAULT;
+		ret += sizeof(addr);
+		buf += sizeof(addr);
+		count -= sizeof(addr);
+		/*
+		 * Allow to read more than one fault at time but only
+		 * block if waiting for the very first one.
+		 */
+		no_wait = O_NONBLOCK;
+	}
+}
+
+static int __wake_userfault(struct userfaultfd_ctx *ctx,
+			    struct userfaultfd_wake_range *range)
+{
+	wait_queue_t *wq;
+	struct userfaultfd_wait_queue *uwq;
+	int ret;
+	unsigned long start, end;
+
+	start = range->start;
+	end = range->start + range->len;
+
+	ret = -ENOENT;
+	spin_lock(&ctx->fault_wqh.lock);
+	list_for_each_entry(wq, &ctx->fault_wqh.task_list, task_list) {
+		uwq = container_of(wq, struct userfaultfd_wait_queue, wq);
+		if (uwq->pending)
+			continue;
+		if (uwq->address >= start && uwq->address < end) {
+			ret = 0;
+			/* wake all in the range and autoremove */
+			__wake_up_locked_key(&ctx->fault_wqh, TASK_NORMAL, 0,
+					     range);
+			break;
+		}
+	}
+	spin_unlock(&ctx->fault_wqh.lock);
+
+	return ret;
+}
+
+static __always_inline int wake_userfault(struct userfaultfd_ctx *ctx,
+					  struct userfaultfd_wake_range *range)
+{
+	if (!waitqueue_active(&ctx->fault_wqh))
+		return -ENOENT;
+
+	return __wake_userfault(ctx, range);
+}
+
+static __always_inline int validate_range(struct mm_struct *mm,
+					  __u64 start, __u64 len)
+{
+	__u64 task_size = mm->task_size;
+
+	if (start & ~PAGE_MASK)
+		return -EINVAL;
+	if (len & ~PAGE_MASK)
+		return -EINVAL;
+	if (!len)
+		return -EINVAL;
+	if (start < mmap_min_addr)
+		return -EINVAL;
+	if (start >= task_size)
+		return -EINVAL;
+	if (len > task_size - start)
+		return -EINVAL;
+	return 0;
+}
+
+static int userfaultfd_register(struct userfaultfd_ctx *ctx,
+				unsigned long arg)
+{
+	struct mm_struct *mm = ctx->mm;
+	struct vm_area_struct *vma, *prev, *cur;
+	int ret;
+	struct uffdio_register uffdio_register;
+	struct uffdio_register __user *user_uffdio_register;
+	unsigned long vm_flags, new_flags;
+	bool found;
+	unsigned long start, end, vma_end;
+
+	user_uffdio_register = (struct uffdio_register __user *) arg;
+
+	ret = -EFAULT;
+	if (copy_from_user(&uffdio_register, user_uffdio_register,
+			   sizeof(uffdio_register)-sizeof(__u64)))
+		goto out;
+
+	ret = -EINVAL;
+	if (!uffdio_register.mode)
+		goto out;
+	if (uffdio_register.mode & ~(UFFDIO_REGISTER_MODE_MISSING|
+				     UFFDIO_REGISTER_MODE_WP))
+		goto out;
+	vm_flags = 0;
+	if (uffdio_register.mode & UFFDIO_REGISTER_MODE_MISSING)
+		vm_flags |= VM_UFFD_MISSING;
+	if (uffdio_register.mode & UFFDIO_REGISTER_MODE_WP) {
+		vm_flags |= VM_UFFD_WP;
+		/*
+		 * FIXME: remove the below error constraint by
+		 * implementing the wprotect tracking mode.
+		 */
+		ret = -EINVAL;
+		goto out;
+	}
+
+	ret = validate_range(mm, uffdio_register.range.start,
+			     uffdio_register.range.len);
+	if (ret)
+		goto out;
+
+	start = uffdio_register.range.start;
+	end = start + uffdio_register.range.len;
+
+	down_write(&mm->mmap_sem);
+	vma = find_vma_prev(mm, start, &prev);
+
+	ret = -ENOMEM;
+	if (!vma)
+		goto out_unlock;
+
+	/* check that there's at least one vma in the range */
+	ret = -EINVAL;
+	if (vma->vm_start >= end)
+		goto out_unlock;
+
+	/*
+	 * Search for not compatible vmas.
+	 *
+	 * FIXME: this shall be relaxed later so that it doesn't fail
+	 * on tmpfs backed vmas (in addition to the current allowance
+	 * on anonymous vmas).
+	 */
+	found = false;
+	for (cur = vma; cur && cur->vm_start < end; cur = cur->vm_next) {
+		cond_resched();
+
+		BUG_ON(!!cur->vm_userfaultfd_ctx.ctx ^
+		       !!(cur->vm_flags & (VM_UFFD_MISSING | VM_UFFD_WP)));
+
+		/* check not compatible vmas */
+		ret = -EINVAL;
+		if (cur->vm_ops)
+			goto out_unlock;
+
+		/*
+		 * Check that this vma isn't already owned by a
+		 * different userfaultfd. We can't allow more than one
+		 * userfaultfd to own a single vma simultaneously or we
+		 * wouldn't know which one to deliver the userfaults to.
+		 */
+		ret = -EBUSY;
+		if (cur->vm_userfaultfd_ctx.ctx &&
+		    cur->vm_userfaultfd_ctx.ctx != ctx)
+			goto out_unlock;
+
+		found = true;
+	}
+	BUG_ON(!found);
+
+	/*
+	 * Now that we scanned all vmas we can already tell userland which
+	 * ioctls methods are guaranteed to succeed on this range.
+	 */
+	ret = -EFAULT;
+	if (put_user(UFFD_API_RANGE_IOCTLS, &user_uffdio_register->ioctls))
+		goto out_unlock;
+
+	if (vma->vm_start < start)
+		prev = vma;
+
+	ret = 0;
+	do {
+		cond_resched();
+
+		BUG_ON(vma->vm_ops);
+		BUG_ON(vma->vm_userfaultfd_ctx.ctx &&
+		       vma->vm_userfaultfd_ctx.ctx != ctx);
+
+		/*
+		 * Nothing to do: this vma is already registered into this
+		 * userfaultfd and with the right tracking mode too.
+		 */
+		if (vma->vm_userfaultfd_ctx.ctx == ctx &&
+		    (vma->vm_flags & vm_flags) == vm_flags)
+			goto skip;
+
+		if (vma->vm_start > start)
+			start = vma->vm_start;
+		vma_end = min(end, vma->vm_end);
+
+		new_flags = (vma->vm_flags & ~vm_flags) | vm_flags;
+		prev = vma_merge(mm, prev, start, vma_end, new_flags,
+				 vma->anon_vma, vma->vm_file, vma->vm_pgoff,
+				 vma_policy(vma),
+				 ((struct vm_userfaultfd_ctx){ ctx }));
+		if (prev) {
+			vma = prev;
+			goto next;
+		}
+		if (vma->vm_start < start) {
+			ret = split_vma(mm, vma, start, 1);
+			if (ret)
+				break;
+		}
+		if (vma->vm_end > end) {
+			ret = split_vma(mm, vma, end, 0);
+			if (ret)
+				break;
+		}
+	next:
+		/*
+		 * In the vma_merge() successful mprotect-like case 8:
+		 * the next vma was merged into the current one and
+		 * the current one has not been updated yet.
+		 */
+		vma->vm_flags = new_flags;
+		vma->vm_userfaultfd_ctx.ctx = ctx;
+
+	skip:
+		prev = vma;
+		start = vma->vm_end;
+		vma = vma->vm_next;
+	} while (vma && vma->vm_start < end);
+out_unlock:
+	up_write(&mm->mmap_sem);
+out:
+	return ret;
+}
+
+static int userfaultfd_unregister(struct userfaultfd_ctx *ctx,
+				  unsigned long arg)
+{
+	struct mm_struct *mm = ctx->mm;
+	struct vm_area_struct *vma, *prev, *cur;
+	int ret;
+	struct uffdio_range uffdio_unregister;
+	unsigned long new_flags;
+	bool found;
+	unsigned long start, end, vma_end;
+	const void __user *buf = (void __user *)arg;
+
+	ret = -EFAULT;
+	if (copy_from_user(&uffdio_unregister, buf, sizeof(uffdio_unregister)))
+		goto out;
+
+	ret = validate_range(mm, uffdio_unregister.start,
+			     uffdio_unregister.len);
+	if (ret)
+		goto out;
+
+	start = uffdio_unregister.start;
+	end = start + uffdio_unregister.len;
+
+	down_write(&mm->mmap_sem);
+	vma = find_vma_prev(mm, start, &prev);
+
+	ret = -ENOMEM;
+	if (!vma)
+		goto out_unlock;
+
+	/* check that there's at least one vma in the range */
+	ret = -EINVAL;
+	if (vma->vm_start >= end)
+		goto out_unlock;
+
+	/*
+	 * Search for not compatible vmas.
+	 *
+	 * FIXME: this shall be relaxed later so that it doesn't fail
+	 * on tmpfs backed vmas (in addition to the current allowance
+	 * on anonymous vmas).
+	 */
+	found = false;
+	ret = -EINVAL;
+	for (cur = vma; cur && cur->vm_start < end; cur = cur->vm_next) {
+		cond_resched();
+
+		BUG_ON(!!cur->vm_userfaultfd_ctx.ctx ^
+		       !!(cur->vm_flags & (VM_UFFD_MISSING | VM_UFFD_WP)));
+
+		/*
+		 * Check not compatible vmas, not strictly required
+		 * here as not compatible vmas cannot have an
+		 * userfaultfd_ctx registered on them, but this
+		 * provides for more strict behavior to notice
+		 * unregistration errors.
+		 */
+		if (cur->vm_ops)
+			goto out_unlock;
+
+		found = true;
+	}
+	BUG_ON(!found);
+
+	if (vma->vm_start < start)
+		prev = vma;
+
+	ret = 0;
+	do {
+		cond_resched();
+
+		BUG_ON(vma->vm_ops);
+
+		/*
+		 * Nothing to do: this vma is already registered into this
+		 * userfaultfd and with the right tracking mode too.
+		 */
+		if (!vma->vm_userfaultfd_ctx.ctx)
+			goto skip;
+
+		if (vma->vm_start > start)
+			start = vma->vm_start;
+		vma_end = min(end, vma->vm_end);
+
+		new_flags = vma->vm_flags & ~(VM_UFFD_MISSING | VM_UFFD_WP);
+		prev = vma_merge(mm, prev, start, vma_end, new_flags,
+				 vma->anon_vma, vma->vm_file, vma->vm_pgoff,
+				 vma_policy(vma),
+				 NULL_VM_UFFD_CTX);
+		if (prev) {
+			vma = prev;
+			goto next;
+		}
+		if (vma->vm_start < start) {
+			ret = split_vma(mm, vma, start, 1);
+			if (ret)
+				break;
+		}
+		if (vma->vm_end > end) {
+			ret = split_vma(mm, vma, end, 0);
+			if (ret)
+				break;
+		}
+	next:
+		/*
+		 * In the vma_merge() successful mprotect-like case 8:
+		 * the next vma was merged into the current one and
+		 * the current one has not been updated yet.
+		 */
+		vma->vm_flags = new_flags;
+		vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
+
+	skip:
+		prev = vma;
+		start = vma->vm_end;
+		vma = vma->vm_next;
+	} while (vma && vma->vm_start < end);
+out_unlock:
+	up_write(&mm->mmap_sem);
+out:
+	return ret;
+}
+
+/*
+ * This is mostly needed to re-wakeup those userfaults that were still
+ * pending when userland wake them up the first time. We don't wake
+ * the pending one to avoid blocking reads to block, or non blocking
+ * read to return -EAGAIN, if used with POLLIN, to avoid userland
+ * doubts on why POLLIN wasn't reliable.
+ */
+static int userfaultfd_wake(struct userfaultfd_ctx *ctx,
+			    unsigned long arg)
+{
+	int ret;
+	struct uffdio_range uffdio_wake;
+	struct userfaultfd_wake_range range;
+	const void __user *buf = (void __user *)arg;
+
+	ret = -EFAULT;
+	if (copy_from_user(&uffdio_wake, buf, sizeof(uffdio_wake)))
+		goto out;
+
+	ret = validate_range(ctx->mm, uffdio_wake.start, uffdio_wake.len);
+	if (ret)
+		goto out;
+
+	range.start = uffdio_wake.start;
+	range.len = uffdio_wake.len;
+
+	/*
+	 * len == 0 means wake all and we don't want to wake all here,
+	 * so check it again to be sure.
+	 */
+	VM_BUG_ON(!range.len);
+
+	ret = wake_userfault(ctx, &range);
+
+out:
+	return ret;
+}
+
+/*
+ * userland asks for a certain API version and we return which bits
+ * and ioctl commands are implemented in this kernel for such API
+ * version or -EINVAL if unknown.
+ */
+static int userfaultfd_api(struct userfaultfd_ctx *ctx,
+			   unsigned long arg)
+{
+	struct uffdio_api uffdio_api;
+	void __user *buf = (void __user *)arg;
+	int ret;
+
+	ret = -EINVAL;
+	if (ctx->state != UFFD_STATE_WAIT_API)
+		goto out;
+	ret = -EFAULT;
+	if (copy_from_user(&uffdio_api, buf, sizeof(__u64)))
+		goto out;
+	if (uffdio_api.api != UFFD_API) {
+		/* careful not to leak info, we only read the first 8 bytes */
+		memset(&uffdio_api, 0, sizeof(uffdio_api));
+		if (copy_to_user(buf, &uffdio_api, sizeof(uffdio_api)))
+			goto out;
+		ret = -EINVAL;
+		goto out;
+	}
+	/* careful not to leak info, we only read the first 8 bytes */
+	uffdio_api.bits = UFFD_API_BITS;
+	uffdio_api.ioctls = UFFD_API_IOCTLS;
+	ret = -EFAULT;
+	if (copy_to_user(buf, &uffdio_api, sizeof(uffdio_api)))
+		goto out;
+	ctx->state = UFFD_STATE_RUNNING;
+	ret = 0;
+out:
+	return ret;
+}
+
+static long userfaultfd_ioctl(struct file *file, unsigned cmd,
+			      unsigned long arg)
+{
+	int ret = -EINVAL;
+	struct userfaultfd_ctx *ctx = file->private_data;
+
+	switch(cmd) {
+	case UFFDIO_API:
+		ret = userfaultfd_api(ctx, arg);
+		break;
+	case UFFDIO_REGISTER:
+		ret = userfaultfd_register(ctx, arg);
+		break;
+	case UFFDIO_UNREGISTER:
+		ret = userfaultfd_unregister(ctx, arg);
+		break;
+	case UFFDIO_WAKE:
+		ret = userfaultfd_wake(ctx, arg);
+		break;
+	}
+	return ret;
+}
+
+#ifdef CONFIG_PROC_FS
+static void userfaultfd_show_fdinfo(struct seq_file *m, struct file *f)
+{
+	struct userfaultfd_ctx *ctx = f->private_data;
+	wait_queue_t *wq;
+	struct userfaultfd_wait_queue *uwq;
+	unsigned long pending = 0, total = 0;
+
+	spin_lock(&ctx->fault_wqh.lock);
+	list_for_each_entry(wq, &ctx->fault_wqh.task_list, task_list) {
+		uwq = container_of(wq, struct userfaultfd_wait_queue, wq);
+		if (uwq->pending)
+			pending++;
+		total++;
+	}
+	spin_unlock(&ctx->fault_wqh.lock);
+
+	/*
+	 * If more protocols will be added, there will be all shown
+	 * separated by a space. Like this:
+	 *	protocols: 0xaa 0xbb
+	 */
+	seq_printf(m, "pending:\t%lu\ntotal:\t%lu\nAPI:\t%Lx:%x:%Lx\n",
+		   pending, total, UFFD_API, UFFD_API_BITS,
+		   UFFD_API_IOCTLS|UFFD_API_RANGE_IOCTLS);
+}
+#endif
+
+static const struct file_operations userfaultfd_fops = {
+#ifdef CONFIG_PROC_FS
+	.show_fdinfo	= userfaultfd_show_fdinfo,
+#endif
+	.release	= userfaultfd_release,
+	.poll		= userfaultfd_poll,
+	.read		= userfaultfd_read,
+	.unlocked_ioctl = userfaultfd_ioctl,
+	.compat_ioctl	= userfaultfd_ioctl,
+	.llseek		= noop_llseek,
+};
+
+/**
+ * userfaultfd_file_create - Creates an userfaultfd file pointer.
+ * @flags: Flags for the userfaultfd file.
+ *
+ * This function creates an userfaultfd file pointer, w/out installing
+ * it into the fd table. This is useful when the userfaultfd file is
+ * used during the initialization of data structures that require
+ * extra setup after the userfaultfd creation. So the userfaultfd
+ * creation is split into the file pointer creation phase, and the
+ * file descriptor installation phase.  In this way races with
+ * userspace closing the newly installed file descriptor can be
+ * avoided.  Returns an userfaultfd file pointer, or a proper error
+ * pointer.
+ */
+static struct file *userfaultfd_file_create(int flags)
+{
+	struct file *file;
+	struct userfaultfd_ctx *ctx;
+
+	BUG_ON(!current->mm);
+
+	/* Check the UFFD_* constants for consistency.  */
+	BUILD_BUG_ON(UFFD_CLOEXEC != O_CLOEXEC);
+	BUILD_BUG_ON(UFFD_NONBLOCK != O_NONBLOCK);
+
+	file = ERR_PTR(-EINVAL);
+	if (flags & ~UFFD_SHARED_FCNTL_FLAGS)
+		goto out;
+
+	file = ERR_PTR(-ENOMEM);
+	ctx = kmalloc(sizeof(*ctx), GFP_KERNEL);
+	if (!ctx)
+		goto out;
+
+	atomic_set(&ctx->refcount, 1);
+	init_waitqueue_head(&ctx->fault_wqh);
+	init_waitqueue_head(&ctx->fd_wqh);
+	ctx->flags = flags;
+	ctx->state = UFFD_STATE_WAIT_API;
+	ctx->released = false;
+	ctx->mm = current->mm;
+	/* prevent the mm struct to be freed */
+	atomic_inc(&ctx->mm->mm_count);
+
+	file = anon_inode_getfile("[userfaultfd]", &userfaultfd_fops, ctx,
+				  O_RDWR | (flags & UFFD_SHARED_FCNTL_FLAGS));
+	if (IS_ERR(file))
+		kfree(ctx);
+out:
+	return file;
+}
+
+SYSCALL_DEFINE1(userfaultfd, int, flags)
+{
+	int fd, error;
+	struct file *file;
+
+	error = get_unused_fd_flags(flags & UFFD_SHARED_FCNTL_FLAGS);
+	if (error < 0)
+		return error;
+	fd = error;
+
+	file = userfaultfd_file_create(flags);
+	if (IS_ERR(file)) {
+		error = PTR_ERR(file);
+		goto err_put_unused_fd;
+	}
+	fd_install(fd, file);
+
+	return fd;
+
+err_put_unused_fd:
+	put_unused_fd(fd);
+
+	return error;
+}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 11/21] userfaultfd: buildsystem activation
  2015-03-05 17:17 [PATCH 00/21] RFC: userfaultfd v3 Andrea Arcangeli
                   ` (9 preceding siblings ...)
  2015-03-05 17:17 ` [PATCH 10/21] userfaultfd: add new syscall to provide memory externalization Andrea Arcangeli
@ 2015-03-05 17:17 ` Andrea Arcangeli
  2015-03-05 17:17 ` [PATCH 12/21] userfaultfd: activate syscall Andrea Arcangeli
                   ` (10 subsequent siblings)
  21 siblings, 0 replies; 32+ messages in thread
From: Andrea Arcangeli @ 2015-03-05 17:17 UTC (permalink / raw)
  To: qemu-devel, kvm, linux-kernel, linux-mm, linux-api, Android Kernel Team
  Cc: Kirill A. Shutemov, Pavel Emelyanov, Sanidhya Kashyap,
	zhang.zhanghailiang, Linus Torvalds, Andres Lagar-Cavilla,
	Dave Hansen, Paolo Bonzini, Rik van Riel, Mel Gorman,
	Andy Lutomirski, Andrew Morton, Sasha Levin, Hugh Dickins,
	Peter Feiner, Dr. David Alan Gilbert, Christopher Covington,
	Johannes Weiner, Robert Love, Dmitry Adamushko, Neil Brown,
	Mike Hommey, Taras Glek, Jan Kara, KOSAKI Motohiro,
	Michel Lespinasse, Minchan Kim, Keith Packard, Huangpeng (Peter),
	Anthony Liguori, Stefan Hajnoczi, Wenchao Xia, Andrew Jones,
	Juan Quintela

This allows to select the userfaultfd during configuration to build it.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 fs/Makefile  |  1 +
 init/Kconfig | 11 +++++++++++
 2 files changed, 12 insertions(+)

diff --git a/fs/Makefile b/fs/Makefile
index a88ac48..ba8ab62 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -27,6 +27,7 @@ obj-$(CONFIG_ANON_INODES)	+= anon_inodes.o
 obj-$(CONFIG_SIGNALFD)		+= signalfd.o
 obj-$(CONFIG_TIMERFD)		+= timerfd.o
 obj-$(CONFIG_EVENTFD)		+= eventfd.o
+obj-$(CONFIG_USERFAULTFD)	+= userfaultfd.o
 obj-$(CONFIG_AIO)               += aio.o
 obj-$(CONFIG_FS_DAX)		+= dax.o
 obj-$(CONFIG_FILE_LOCKING)      += locks.o
diff --git a/init/Kconfig b/init/Kconfig
index f5dbc6d..580dae7 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1550,6 +1550,17 @@ config ADVISE_SYSCALLS
 	  applications use these syscalls, you can disable this option to save
 	  space.
 
+config USERFAULTFD
+	bool "Enable userfaultfd() system call"
+	select ANON_INODES
+	default y
+	depends on MMU
+	help
+	  Enable the userfaultfd() system call that allows to intercept and
+	  handle page faults in userland.
+
+	  If unsure, say Y.
+
 config PCI_QUIRKS
 	default y
 	bool "Enable PCI quirk workarounds" if EXPERT

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 12/21] userfaultfd: activate syscall
  2015-03-05 17:17 [PATCH 00/21] RFC: userfaultfd v3 Andrea Arcangeli
                   ` (10 preceding siblings ...)
  2015-03-05 17:17 ` [PATCH 11/21] userfaultfd: buildsystem activation Andrea Arcangeli
@ 2015-03-05 17:17 ` Andrea Arcangeli
  2015-03-05 17:17 ` [PATCH 13/21] userfaultfd: UFFDIO_COPY|UFFDIO_ZEROPAGE uAPI Andrea Arcangeli
                   ` (9 subsequent siblings)
  21 siblings, 0 replies; 32+ messages in thread
From: Andrea Arcangeli @ 2015-03-05 17:17 UTC (permalink / raw)
  To: qemu-devel, kvm, linux-kernel, linux-mm, linux-api, Android Kernel Team
  Cc: Kirill A. Shutemov, Pavel Emelyanov, Sanidhya Kashyap,
	zhang.zhanghailiang, Linus Torvalds, Andres Lagar-Cavilla,
	Dave Hansen, Paolo Bonzini, Rik van Riel, Mel Gorman,
	Andy Lutomirski, Andrew Morton, Sasha Levin, Hugh Dickins,
	Peter Feiner, Dr. David Alan Gilbert, Christopher Covington,
	Johannes Weiner, Robert Love, Dmitry Adamushko, Neil Brown,
	Mike Hommey, Taras Glek, Jan Kara, KOSAKI Motohiro,
	Michel Lespinasse, Minchan Kim, Keith Packard, Huangpeng (Peter),
	Anthony Liguori, Stefan Hajnoczi, Wenchao Xia, Andrew Jones,
	Juan Quintela

This activates the userfaultfd syscall.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 arch/powerpc/include/asm/systbl.h      | 1 +
 arch/powerpc/include/asm/unistd.h      | 2 +-
 arch/powerpc/include/uapi/asm/unistd.h | 1 +
 arch/x86/syscalls/syscall_32.tbl       | 1 +
 arch/x86/syscalls/syscall_64.tbl       | 1 +
 include/linux/syscalls.h               | 1 +
 kernel/sys_ni.c                        | 1 +
 7 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/systbl.h b/arch/powerpc/include/asm/systbl.h
index 91062ee..7f21cfd 100644
--- a/arch/powerpc/include/asm/systbl.h
+++ b/arch/powerpc/include/asm/systbl.h
@@ -367,3 +367,4 @@ SYSCALL_SPU(getrandom)
 SYSCALL_SPU(memfd_create)
 SYSCALL_SPU(bpf)
 COMPAT_SYS(execveat)
+SYSCALL_SPU(userfaultfd)
diff --git a/arch/powerpc/include/asm/unistd.h b/arch/powerpc/include/asm/unistd.h
index 36b79c3..f4f8b66 100644
--- a/arch/powerpc/include/asm/unistd.h
+++ b/arch/powerpc/include/asm/unistd.h
@@ -12,7 +12,7 @@
 #include <uapi/asm/unistd.h>
 
 
-#define __NR_syscalls		363
+#define __NR_syscalls		364
 
 #define __NR__exit __NR_exit
 #define NR_syscalls	__NR_syscalls
diff --git a/arch/powerpc/include/uapi/asm/unistd.h b/arch/powerpc/include/uapi/asm/unistd.h
index ef5b5b1..4b4f21e 100644
--- a/arch/powerpc/include/uapi/asm/unistd.h
+++ b/arch/powerpc/include/uapi/asm/unistd.h
@@ -385,5 +385,6 @@
 #define __NR_memfd_create	360
 #define __NR_bpf		361
 #define __NR_execveat		362
+#define __NR_userfaultfd	363
 
 #endif /* _UAPI_ASM_POWERPC_UNISTD_H_ */
diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
index b3560ec..a20f0b8 100644
--- a/arch/x86/syscalls/syscall_32.tbl
+++ b/arch/x86/syscalls/syscall_32.tbl
@@ -365,3 +365,4 @@
 356	i386	memfd_create		sys_memfd_create
 357	i386	bpf			sys_bpf
 358	i386	execveat		sys_execveat			stub32_execveat
+359	i386	userfaultfd		sys_userfaultfd
diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index 8d656fb..f320b19 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -329,6 +329,7 @@
 320	common	kexec_file_load		sys_kexec_file_load
 321	common	bpf			sys_bpf
 322	64	execveat		stub_execveat
+323	common	userfaultfd		sys_userfaultfd
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 76d1e38..adf5901 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -810,6 +810,7 @@ asmlinkage long sys_timerfd_gettime(int ufd, struct itimerspec __user *otmr);
 asmlinkage long sys_eventfd(unsigned int count);
 asmlinkage long sys_eventfd2(unsigned int count, int flags);
 asmlinkage long sys_memfd_create(const char __user *uname_ptr, unsigned int flags);
+asmlinkage long sys_userfaultfd(int flags);
 asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len);
 asmlinkage long sys_old_readdir(unsigned int, struct old_linux_dirent __user *, unsigned int);
 asmlinkage long sys_pselect6(int, fd_set __user *, fd_set __user *,
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 5adcb0a..2a10e42 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -204,6 +204,7 @@ cond_syscall(compat_sys_timerfd_gettime);
 cond_syscall(sys_eventfd);
 cond_syscall(sys_eventfd2);
 cond_syscall(sys_memfd_create);
+cond_syscall(sys_userfaultfd);
 
 /* performance counters: */
 cond_syscall(sys_perf_event_open);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 13/21] userfaultfd: UFFDIO_COPY|UFFDIO_ZEROPAGE uAPI
  2015-03-05 17:17 [PATCH 00/21] RFC: userfaultfd v3 Andrea Arcangeli
                   ` (11 preceding siblings ...)
  2015-03-05 17:17 ` [PATCH 12/21] userfaultfd: activate syscall Andrea Arcangeli
@ 2015-03-05 17:17 ` Andrea Arcangeli
  2015-03-05 17:17 ` [PATCH 14/21] userfaultfd: mcopy_atomic|mfill_zeropage: UFFDIO_COPY|UFFDIO_ZEROPAGE preparation Andrea Arcangeli
                   ` (8 subsequent siblings)
  21 siblings, 0 replies; 32+ messages in thread
From: Andrea Arcangeli @ 2015-03-05 17:17 UTC (permalink / raw)
  To: qemu-devel, kvm, linux-kernel, linux-mm, linux-api, Android Kernel Team
  Cc: Kirill A. Shutemov, Pavel Emelyanov, Sanidhya Kashyap,
	zhang.zhanghailiang, Linus Torvalds, Andres Lagar-Cavilla,
	Dave Hansen, Paolo Bonzini, Rik van Riel, Mel Gorman,
	Andy Lutomirski, Andrew Morton, Sasha Levin, Hugh Dickins,
	Peter Feiner, Dr. David Alan Gilbert, Christopher Covington,
	Johannes Weiner, Robert Love, Dmitry Adamushko, Neil Brown,
	Mike Hommey, Taras Glek, Jan Kara, KOSAKI Motohiro,
	Michel Lespinasse, Minchan Kim, Keith Packard, Huangpeng (Peter),
	Anthony Liguori, Stefan Hajnoczi, Wenchao Xia, Andrew Jones,
	Juan Quintela

This implements the uABI of UFFDIO_COPY and UFFDIO_ZEROPAGE.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/uapi/linux/userfaultfd.h | 46 +++++++++++++++++++++++++++++++++++++++-
 1 file changed, 45 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
index 9a8cd56..61251e6 100644
--- a/include/uapi/linux/userfaultfd.h
+++ b/include/uapi/linux/userfaultfd.h
@@ -17,7 +17,9 @@
 	 (__u64)1 << _UFFDIO_UNREGISTER |	\
 	 (__u64)1 << _UFFDIO_API)
 #define UFFD_API_RANGE_IOCTLS			\
-	((__u64)1 << _UFFDIO_WAKE)
+	((__u64)1 << _UFFDIO_WAKE |		\
+	 (__u64)1 << _UFFDIO_COPY |		\
+	 (__u64)1 << _UFFDIO_ZEROPAGE)
 
 /*
  * Valid ioctl command number range with this API is from 0x00 to
@@ -30,6 +32,8 @@
 #define _UFFDIO_REGISTER		(0x00)
 #define _UFFDIO_UNREGISTER		(0x01)
 #define _UFFDIO_WAKE			(0x02)
+#define _UFFDIO_COPY			(0x03)
+#define _UFFDIO_ZEROPAGE		(0x04)
 #define _UFFDIO_API			(0x3F)
 
 /* userfaultfd ioctl ids */
@@ -42,6 +46,10 @@
 				     struct uffdio_range)
 #define UFFDIO_WAKE		_IOR(UFFDIO, _UFFDIO_WAKE,	\
 				     struct uffdio_range)
+#define UFFDIO_COPY		_IOWR(UFFDIO, _UFFDIO_COPY,	\
+				      struct uffdio_copy)
+#define UFFDIO_ZEROPAGE		_IOWR(UFFDIO, _UFFDIO_ZEROPAGE,	\
+				      struct uffdio_zeropage)
 
 /*
  * Valid bits below PAGE_SHIFT in the userfault address read through
@@ -78,4 +86,40 @@ struct uffdio_register {
 	__u64 ioctls;
 };
 
+struct uffdio_copy {
+	__u64 dst;
+	__u64 src;
+	__u64 len;
+	/*
+	 * There will be a wrprotection flag later that allows to map
+	 * pages wrprotected on the fly. And such a flag will be
+	 * available if the wrprotection ioctl are implemented for the
+	 * range according to the uffdio_register.ioctls.
+	 */
+#define UFFDIO_COPY_MODE_DONTWAKE		((__u64)1<<0)
+	__u64 mode;
+
+	/*
+	 * "copy" and "wake" are written by the ioctl and must be at
+	 * the end: the copy_from_user will not read the last 16
+	 * bytes.
+	 */
+	__s64 copy;
+	__s64 wake;
+};
+
+struct uffdio_zeropage {
+	struct uffdio_range range;
+#define UFFDIO_ZEROPAGE_MODE_DONTWAKE		((__u64)1<<0)
+	__u64 mode;
+
+	/*
+	 * "zeropage" and "wake" are written by the ioctl and must be
+	 * at the end: the copy_from_user will not read the last 16
+	 * bytes.
+	 */
+	__s64 zeropage;
+	__s64 wake;
+};
+
 #endif /* _LINUX_USERFAULTFD_H */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 14/21] userfaultfd: mcopy_atomic|mfill_zeropage: UFFDIO_COPY|UFFDIO_ZEROPAGE preparation
  2015-03-05 17:17 [PATCH 00/21] RFC: userfaultfd v3 Andrea Arcangeli
                   ` (12 preceding siblings ...)
  2015-03-05 17:17 ` [PATCH 13/21] userfaultfd: UFFDIO_COPY|UFFDIO_ZEROPAGE uAPI Andrea Arcangeli
@ 2015-03-05 17:17 ` Andrea Arcangeli
  2015-03-05 18:07   ` Pavel Emelyanov
  2015-03-05 17:17 ` [PATCH 15/21] userfaultfd: UFFDIO_COPY and UFFDIO_ZEROPAGE Andrea Arcangeli
                   ` (7 subsequent siblings)
  21 siblings, 1 reply; 32+ messages in thread
From: Andrea Arcangeli @ 2015-03-05 17:17 UTC (permalink / raw)
  To: qemu-devel, kvm, linux-kernel, linux-mm, linux-api, Android Kernel Team
  Cc: Kirill A. Shutemov, Pavel Emelyanov, Sanidhya Kashyap,
	zhang.zhanghailiang, Linus Torvalds, Andres Lagar-Cavilla,
	Dave Hansen, Paolo Bonzini, Rik van Riel, Mel Gorman,
	Andy Lutomirski, Andrew Morton, Sasha Levin, Hugh Dickins,
	Peter Feiner, Dr. David Alan Gilbert, Christopher Covington,
	Johannes Weiner, Robert Love, Dmitry Adamushko, Neil Brown,
	Mike Hommey, Taras Glek, Jan Kara, KOSAKI Motohiro,
	Michel Lespinasse, Minchan Kim, Keith Packard, Huangpeng (Peter),
	Anthony Liguori, Stefan Hajnoczi, Wenchao Xia, Andrew Jones,
	Juan Quintela

This implements mcopy_atomic and mfill_zeropage that are the lowlevel
VM methods that are invoked respectively by the UFFDIO_COPY and
UFFDIO_ZEROPAGE userfaultfd commands.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/userfaultfd_k.h |   6 +
 mm/Makefile                   |   1 +
 mm/userfaultfd.c              | 267 ++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 274 insertions(+)
 create mode 100644 mm/userfaultfd.c

diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index e1e4360..587480a 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -30,6 +30,12 @@
 extern int handle_userfault(struct vm_area_struct *vma, unsigned long address,
 			    unsigned int flags, unsigned long reason);
 
+extern ssize_t mcopy_atomic(struct mm_struct *dst_mm, unsigned long dst_start,
+			    unsigned long src_start, unsigned long len);
+extern ssize_t mfill_zeropage(struct mm_struct *dst_mm,
+			      unsigned long dst_start,
+			      unsigned long len);
+
 /* mm helpers */
 static inline bool is_mergeable_vm_userfaultfd_ctx(struct vm_area_struct *vma,
 					struct vm_userfaultfd_ctx vm_ctx)
diff --git a/mm/Makefile b/mm/Makefile
index 3c1caa2..ea9828e 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -76,3 +76,4 @@ obj-$(CONFIG_GENERIC_EARLY_IOREMAP) += early_ioremap.o
 obj-$(CONFIG_CMA)	+= cma.o
 obj-$(CONFIG_MEMORY_BALLOON) += balloon_compaction.o
 obj-$(CONFIG_PAGE_EXTENSION) += page_ext.o
+obj-$(CONFIG_USERFAULTFD) += userfaultfd.o
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
new file mode 100644
index 0000000..3f4c0ef
--- /dev/null
+++ b/mm/userfaultfd.c
@@ -0,0 +1,267 @@
+/*
+ *  mm/userfaultfd.c
+ *
+ *  Copyright (C) 2015  Red Hat, Inc.
+ *
+ *  This work is licensed under the terms of the GNU GPL, version 2. See
+ *  the COPYING file in the top-level directory.
+ */
+
+#include <linux/mm.h>
+#include <linux/pagemap.h>
+#include <linux/rmap.h>
+#include <linux/swap.h>
+#include <linux/swapops.h>
+#include <linux/userfaultfd_k.h>
+#include <linux/mmu_notifier.h>
+#include <asm/tlbflush.h>
+#include "internal.h"
+
+static int mcopy_atomic_pte(struct mm_struct *dst_mm,
+			    pmd_t *dst_pmd,
+			    struct vm_area_struct *dst_vma,
+			    unsigned long dst_addr,
+			    unsigned long src_addr)
+{
+	struct mem_cgroup *memcg;
+	pte_t _dst_pte, *dst_pte;
+	spinlock_t *ptl;
+	struct page *page;
+	void *page_kaddr;
+	int ret;
+
+	ret = -ENOMEM;
+	page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, dst_vma, dst_addr);
+	if (!page)
+		goto out;
+
+	page_kaddr = kmap(page);
+	ret = -EFAULT;
+	if (copy_from_user(page_kaddr, (const void __user *) src_addr,
+			   PAGE_SIZE))
+		goto out_kunmap_release;
+	kunmap(page);
+
+	/*
+	 * The memory barrier inside __SetPageUptodate makes sure that
+	 * preceeding stores to the page contents become visible before
+	 * the set_pte_at() write.
+	 */
+	__SetPageUptodate(page);
+
+	ret = -ENOMEM;
+	if (mem_cgroup_try_charge(page, dst_mm, GFP_KERNEL, &memcg))
+		goto out_release;
+
+	_dst_pte = mk_pte(page, dst_vma->vm_page_prot);
+	if (dst_vma->vm_flags & VM_WRITE)
+		_dst_pte = pte_mkwrite(pte_mkdirty(_dst_pte));
+
+	ret = -EEXIST;
+	dst_pte = pte_offset_map_lock(dst_mm, dst_pmd, dst_addr, &ptl);
+	if (!pte_none(*dst_pte))
+		goto out_release_uncharge_unlock;
+
+	inc_mm_counter(dst_mm, MM_ANONPAGES);
+	page_add_new_anon_rmap(page, dst_vma, dst_addr);
+	mem_cgroup_commit_charge(page, memcg, false);
+	lru_cache_add_active_or_unevictable(page, dst_vma);
+
+	set_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte);
+
+	/* No need to invalidate - it was non-present before */
+	update_mmu_cache(dst_vma, dst_addr, dst_pte);
+
+	pte_unmap_unlock(dst_pte, ptl);
+	ret = 0;
+out:
+	return ret;
+out_release_uncharge_unlock:
+	pte_unmap_unlock(dst_pte, ptl);
+	mem_cgroup_cancel_charge(page, memcg);
+out_release:
+	page_cache_release(page);
+	goto out;
+out_kunmap_release:
+	kunmap(page);
+	goto out_release;
+}
+
+static int mfill_zeropage_pte(struct mm_struct *dst_mm,
+			      pmd_t *dst_pmd,
+			      struct vm_area_struct *dst_vma,
+			      unsigned long dst_addr)
+{
+	pte_t _dst_pte, *dst_pte;
+	spinlock_t *ptl;
+	int ret;
+
+	_dst_pte = pte_mkspecial(pfn_pte(my_zero_pfn(dst_addr),
+					 dst_vma->vm_page_prot));
+	ret = -EEXIST;
+	dst_pte = pte_offset_map_lock(dst_mm, dst_pmd, dst_addr, &ptl);
+	if (!pte_none(*dst_pte))
+		goto out_unlock;
+	set_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte);
+	/* No need to invalidate - it was non-present before */
+	update_mmu_cache(dst_vma, dst_addr, dst_pte);
+	ret = 0;
+out_unlock:
+	pte_unmap_unlock(dst_pte, ptl);
+	return ret;
+}
+
+static pmd_t *mm_alloc_pmd(struct mm_struct *mm, unsigned long address)
+{
+	pgd_t *pgd;
+	pud_t *pud;
+	pmd_t *pmd = NULL;
+
+	pgd = pgd_offset(mm, address);
+	pud = pud_alloc(mm, pgd, address);
+	if (pud)
+		/*
+		 * Note that we didn't run this because the pmd was
+		 * missing, the *pmd may be already established and in
+		 * turn it may also be a trans_huge_pmd.
+		 */
+		pmd = pmd_alloc(mm, pud, address);
+	return pmd;
+}
+
+static ssize_t __mcopy_atomic(struct mm_struct *dst_mm, unsigned long dst_start,
+			      unsigned long src_start, unsigned long len,
+			      bool zeropage)
+{
+	struct vm_area_struct *dst_vma;
+	ssize_t err;
+	pmd_t *dst_pmd;
+	unsigned long src_addr, dst_addr;
+	long copied = 0;
+
+	/*
+	 * Sanitize the command parameters:
+	 */
+	BUG_ON(dst_start & ~PAGE_MASK);
+	BUG_ON(len & ~PAGE_MASK);
+
+	/* Does the address range wrap, or is the span zero-sized? */
+	BUG_ON(src_start + len <= src_start);
+	BUG_ON(dst_start + len <= dst_start);
+
+	down_read(&dst_mm->mmap_sem);
+
+	/*
+	 * Make sure the vma is not shared, that the dst range is
+	 * both valid and fully within a single existing vma.
+	 */
+	err = -EINVAL;
+	dst_vma = find_vma(dst_mm, dst_start);
+	if (!dst_vma || (dst_vma->vm_flags & VM_SHARED))
+		goto out;
+	if (dst_start < dst_vma->vm_start ||
+	    dst_start + len > dst_vma->vm_end)
+		goto out;
+
+	/*
+	 * Be strict and only allow __mcopy_atomic on userfaultfd
+	 * registered ranges to prevent userland errors going
+	 * unnoticed. As far as the VM consistency is concerned, it
+	 * would be perfectly safe to remove this check, but there's
+	 * no useful usage for __mcopy_atomic ouside of userfaultfd
+	 * registered ranges. This is after all why these are ioctls
+	 * belonging to the userfaultfd and not syscalls.
+	 */
+	if (!dst_vma->vm_userfaultfd_ctx.ctx)
+		goto out;
+
+	/*
+	 * FIXME: only allow copying on anonymous vmas, tmpfs should
+	 * be added.
+	 */
+	if (dst_vma->vm_ops)
+		goto out;
+
+	/*
+	 * Ensure the dst_vma has a anon_vma or this page
+	 * would get a NULL anon_vma when moved in the
+	 * dst_vma.
+	 */
+	err = -ENOMEM;
+	if (unlikely(anon_vma_prepare(dst_vma)))
+		goto out;
+
+	for (src_addr = src_start, dst_addr = dst_start;
+	     src_addr < src_start + len; ) {
+		pmd_t dst_pmdval;
+		BUG_ON(dst_addr >= dst_start + len);
+		dst_pmd = mm_alloc_pmd(dst_mm, dst_addr);
+		if (unlikely(!dst_pmd)) {
+			err = -ENOMEM;
+			break;
+		}
+
+		dst_pmdval = pmd_read_atomic(dst_pmd);
+		/*
+		 * If the dst_pmd is mapped as THP don't
+		 * override it and just be strict.
+		 */
+		if (unlikely(pmd_trans_huge(dst_pmdval))) {
+			err = -EEXIST;
+			break;
+		}
+		if (unlikely(pmd_none(dst_pmdval)) &&
+		    unlikely(__pte_alloc(dst_mm, dst_vma, dst_pmd,
+					 dst_addr))) {
+			err = -ENOMEM;
+			break;
+		}
+		/* If an huge pmd materialized from under us fail */
+		if (unlikely(pmd_trans_huge(*dst_pmd))) {
+			err = -EFAULT;
+			break;
+		}
+
+		BUG_ON(pmd_none(*dst_pmd));
+		BUG_ON(pmd_trans_huge(*dst_pmd));
+
+		if (!zeropage)
+			err = mcopy_atomic_pte(dst_mm, dst_pmd, dst_vma,
+					       dst_addr, src_addr);
+		else
+			err = mfill_zeropage_pte(dst_mm, dst_pmd, dst_vma,
+						 dst_addr);
+
+		cond_resched();
+
+		if (!err) {
+			dst_addr += PAGE_SIZE;
+			src_addr += PAGE_SIZE;
+			copied += PAGE_SIZE;
+
+			if (fatal_signal_pending(current))
+				err = -EINTR;
+		}
+		if (err)
+			break;
+	}
+
+out:
+	up_read(&dst_mm->mmap_sem);
+	BUG_ON(copied < 0);
+	BUG_ON(err > 0);
+	BUG_ON(!copied && !err);
+	return copied ? copied : err;
+}
+
+ssize_t mcopy_atomic(struct mm_struct *dst_mm, unsigned long dst_start,
+		     unsigned long src_start, unsigned long len)
+{
+	return __mcopy_atomic(dst_mm, dst_start, src_start, len, false);
+}
+
+ssize_t mfill_zeropage(struct mm_struct *dst_mm, unsigned long start,
+		       unsigned long len)
+{
+	return __mcopy_atomic(dst_mm, start, 0, len, true);
+}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 15/21] userfaultfd: UFFDIO_COPY and UFFDIO_ZEROPAGE
  2015-03-05 17:17 [PATCH 00/21] RFC: userfaultfd v3 Andrea Arcangeli
                   ` (13 preceding siblings ...)
  2015-03-05 17:17 ` [PATCH 14/21] userfaultfd: mcopy_atomic|mfill_zeropage: UFFDIO_COPY|UFFDIO_ZEROPAGE preparation Andrea Arcangeli
@ 2015-03-05 17:17 ` Andrea Arcangeli
  2015-03-05 17:17 ` [PATCH 16/21] userfaultfd: remap_pages: rmap preparation Andrea Arcangeli
                   ` (6 subsequent siblings)
  21 siblings, 0 replies; 32+ messages in thread
From: Andrea Arcangeli @ 2015-03-05 17:17 UTC (permalink / raw)
  To: qemu-devel, kvm, linux-kernel, linux-mm, linux-api, Android Kernel Team
  Cc: Kirill A. Shutemov, Pavel Emelyanov, Sanidhya Kashyap,
	zhang.zhanghailiang, Linus Torvalds, Andres Lagar-Cavilla,
	Dave Hansen, Paolo Bonzini, Rik van Riel, Mel Gorman,
	Andy Lutomirski, Andrew Morton, Sasha Levin, Hugh Dickins,
	Peter Feiner, Dr. David Alan Gilbert, Christopher Covington,
	Johannes Weiner, Robert Love, Dmitry Adamushko, Neil Brown,
	Mike Hommey, Taras Glek, Jan Kara, KOSAKI Motohiro,
	Michel Lespinasse, Minchan Kim, Keith Packard, Huangpeng (Peter),
	Anthony Liguori, Stefan Hajnoczi, Wenchao Xia, Andrew Jones,
	Juan Quintela

These two ioctl allows to either atomically copy or to map zeropages
into the virtual address space. This is used by the thread that opened
the userfaultfd to resolve the userfaults.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 fs/userfaultfd.c | 100 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 100 insertions(+)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 6b31967..6230f22 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -798,6 +798,100 @@ out:
 	return ret;
 }
 
+static int userfaultfd_copy(struct userfaultfd_ctx *ctx,
+			    unsigned long arg)
+{
+	__s64 ret;
+	struct uffdio_copy uffdio_copy;
+	struct uffdio_copy __user *user_uffdio_copy;
+	struct userfaultfd_wake_range range;
+
+	user_uffdio_copy = (struct uffdio_copy __user *) arg;
+
+	ret = -EFAULT;
+	if (copy_from_user(&uffdio_copy, user_uffdio_copy,
+			   /* don't copy "copy" and "wake" last field */
+			   sizeof(uffdio_copy)-sizeof(__s64)*2))
+		goto out;
+
+	ret = validate_range(ctx->mm, uffdio_copy.dst, uffdio_copy.len);
+	if (ret)
+		goto out;
+	/*
+	 * double check for wraparound just in case. copy_from_user()
+	 * will later check uffdio_copy.src + uffdio_copy.len to fit
+	 * in the userland range.
+	 */
+	ret = -EINVAL;
+	if (uffdio_copy.src + uffdio_copy.len <= uffdio_copy.src)
+		goto out;
+	if (uffdio_copy.mode & ~UFFDIO_COPY_MODE_DONTWAKE)
+		goto out;
+
+	ret = mcopy_atomic(ctx->mm, uffdio_copy.dst, uffdio_copy.src,
+			   uffdio_copy.len);
+	if (unlikely(put_user(ret, &user_uffdio_copy->copy)))
+		return -EFAULT;
+	if (ret < 0)
+		goto out;
+	BUG_ON(!ret);
+	/* len == 0 would wake all */
+	range.len = ret;
+	if (!(uffdio_copy.mode & UFFDIO_COPY_MODE_DONTWAKE)) {
+		range.start = uffdio_copy.dst;
+		ret = wake_userfault(ctx, &range);
+		if (unlikely(put_user(ret, &user_uffdio_copy->wake)))
+			return -EFAULT;
+	}
+	ret = range.len == uffdio_copy.len ? 0 : -EAGAIN;
+out:
+	return ret;
+}
+
+static int userfaultfd_zeropage(struct userfaultfd_ctx *ctx,
+				unsigned long arg)
+{
+	__s64 ret;
+	struct uffdio_zeropage uffdio_zeropage;
+	struct uffdio_zeropage __user *user_uffdio_zeropage;
+	struct userfaultfd_wake_range range;
+
+	user_uffdio_zeropage = (struct uffdio_zeropage __user *) arg;
+
+	ret = -EFAULT;
+	if (copy_from_user(&uffdio_zeropage, user_uffdio_zeropage,
+			   /* don't copy "zeropage" and "wake" last field */
+			   sizeof(uffdio_zeropage)-sizeof(__s64)*2))
+		goto out;
+
+	ret = validate_range(ctx->mm, uffdio_zeropage.range.start,
+			     uffdio_zeropage.range.len);
+	if (ret)
+		goto out;
+	ret = -EINVAL;
+	if (uffdio_zeropage.mode & ~UFFDIO_ZEROPAGE_MODE_DONTWAKE)
+		goto out;
+
+	ret = mfill_zeropage(ctx->mm, uffdio_zeropage.range.start,
+			     uffdio_zeropage.range.len);
+	if (unlikely(put_user(ret, &user_uffdio_zeropage->zeropage)))
+		return -EFAULT;
+	if (ret < 0)
+		goto out;
+	/* len == 0 would wake all */
+	BUG_ON(!ret);
+	range.len = ret;
+	if (!(uffdio_zeropage.mode & UFFDIO_ZEROPAGE_MODE_DONTWAKE)) {
+		range.start = uffdio_zeropage.range.start;
+		ret = wake_userfault(ctx, &range);
+		if (unlikely(put_user(ret, &user_uffdio_zeropage->wake)))
+			return -EFAULT;
+	}
+	ret = range.len == uffdio_zeropage.range.len ? 0 : -EAGAIN;
+out:
+	return ret;
+}
+
 /*
  * userland asks for a certain API version and we return which bits
  * and ioctl commands are implemented in this kernel for such API
@@ -855,6 +949,12 @@ static long userfaultfd_ioctl(struct file *file, unsigned cmd,
 	case UFFDIO_WAKE:
 		ret = userfaultfd_wake(ctx, arg);
 		break;
+	case UFFDIO_COPY:
+		ret = userfaultfd_copy(ctx, arg);
+		break;
+	case UFFDIO_ZEROPAGE:
+		ret = userfaultfd_zeropage(ctx, arg);
+		break;
 	}
 	return ret;
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 16/21] userfaultfd: remap_pages: rmap preparation
  2015-03-05 17:17 [PATCH 00/21] RFC: userfaultfd v3 Andrea Arcangeli
                   ` (14 preceding siblings ...)
  2015-03-05 17:17 ` [PATCH 15/21] userfaultfd: UFFDIO_COPY and UFFDIO_ZEROPAGE Andrea Arcangeli
@ 2015-03-05 17:17 ` Andrea Arcangeli
  2015-03-05 17:18 ` [PATCH 17/21] userfaultfd: remap_pages: swp_entry_swapcount() preparation Andrea Arcangeli
                   ` (5 subsequent siblings)
  21 siblings, 0 replies; 32+ messages in thread
From: Andrea Arcangeli @ 2015-03-05 17:17 UTC (permalink / raw)
  To: qemu-devel, kvm, linux-kernel, linux-mm, linux-api, Android Kernel Team
  Cc: Kirill A. Shutemov, Pavel Emelyanov, Sanidhya Kashyap,
	zhang.zhanghailiang, Linus Torvalds, Andres Lagar-Cavilla,
	Dave Hansen, Paolo Bonzini, Rik van Riel, Mel Gorman,
	Andy Lutomirski, Andrew Morton, Sasha Levin, Hugh Dickins,
	Peter Feiner, Dr. David Alan Gilbert, Christopher Covington,
	Johannes Weiner, Robert Love, Dmitry Adamushko, Neil Brown,
	Mike Hommey, Taras Glek, Jan Kara, KOSAKI Motohiro,
	Michel Lespinasse, Minchan Kim, Keith Packard, Huangpeng (Peter),
	Anthony Liguori, Stefan Hajnoczi, Wenchao Xia, Andrew Jones,
	Juan Quintela

As far as the rmap code is concerned, rmap_pages only alters the
page->mapping and page->index. It does it while holding the page
lock. However there are a few places that in presence of anon pages
are allowed to do rmap walks without the page lock (split_huge_page
and page_referenced_anon). Those places that are doing rmap walks
without taking the page lock first, must be updated to re-check that
the page->mapping didn't change after they obtained the anon_vma
lock. remap_pages takes the anon_vma lock for writing before altering
the page->mapping, so if the page->mapping is still the same after
obtaining the anon_vma lock (without the page lock), the rmap walks
can go ahead safely (and remap_pages will wait them to complete before
proceeding).

remap_pages serializes against itself with the page lock.

All other places taking the anon_vma lock while holding the mmap_sem
for writing, don't need to check if the page->mapping has changed
after taking the anon_vma lock, regardless of the page lock, because
remap_pages holds the mmap_sem for reading.

There's one constraint enforced to allow this simplification: the
source pages passed to remap_pages must be mapped only in one vma, but
this is not a limitation when used to handle userland page faults. The
source addresses passed to remap_pages should be set as VM_DONTCOPY
with MADV_DONTFORK to avoid any risk of the mapcount of the pages
increasing, if fork runs in parallel in another thread, before or
while remap_pages runs.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/huge_memory.c | 23 +++++++++++++++++++----
 mm/rmap.c        |  9 +++++++++
 2 files changed, 28 insertions(+), 4 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 8f1b6a5..1e25cb3 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1902,6 +1902,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 {
 	struct anon_vma *anon_vma;
 	int ret = 1;
+	struct address_space *mapping;
 
 	BUG_ON(is_huge_zero_page(page));
 	BUG_ON(!PageAnon(page));
@@ -1913,10 +1914,24 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 	 * page_lock_anon_vma_read except the write lock is taken to serialise
 	 * against parallel split or collapse operations.
 	 */
-	anon_vma = page_get_anon_vma(page);
-	if (!anon_vma)
-		goto out;
-	anon_vma_lock_write(anon_vma);
+	for (;;) {
+		mapping = ACCESS_ONCE(page->mapping);
+		anon_vma = page_get_anon_vma(page);
+		if (!anon_vma)
+			goto out;
+		anon_vma_lock_write(anon_vma);
+		/*
+		 * We don't hold the page lock here so
+		 * remap_pages_huge_pmd can change the anon_vma from
+		 * under us until we obtain the anon_vma lock. Verify
+		 * that we obtained the anon_vma lock before
+		 * remap_pages did.
+		 */
+		if (likely(mapping == ACCESS_ONCE(page->mapping)))
+			break;
+		anon_vma_unlock_write(anon_vma);
+		put_anon_vma(anon_vma);
+	}
 
 	ret = 0;
 	if (!PageCompound(page))
diff --git a/mm/rmap.c b/mm/rmap.c
index 5e3e090..5ab2df1 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -492,6 +492,7 @@ struct anon_vma *page_lock_anon_vma_read(struct page *page)
 	struct anon_vma *root_anon_vma;
 	unsigned long anon_mapping;
 
+repeat:
 	rcu_read_lock();
 	anon_mapping = (unsigned long) ACCESS_ONCE(page->mapping);
 	if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON)
@@ -530,6 +531,14 @@ struct anon_vma *page_lock_anon_vma_read(struct page *page)
 	rcu_read_unlock();
 	anon_vma_lock_read(anon_vma);
 
+	/* check if remap_anon_pages changed the anon_vma */
+	if (unlikely((unsigned long) ACCESS_ONCE(page->mapping) != anon_mapping)) {
+		anon_vma_unlock_read(anon_vma);
+		put_anon_vma(anon_vma);
+		anon_vma = NULL;
+		goto repeat;
+	}
+
 	if (atomic_dec_and_test(&anon_vma->refcount)) {
 		/*
 		 * Oops, we held the last refcount, release the lock

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 17/21] userfaultfd: remap_pages: swp_entry_swapcount() preparation
  2015-03-05 17:17 [PATCH 00/21] RFC: userfaultfd v3 Andrea Arcangeli
                   ` (15 preceding siblings ...)
  2015-03-05 17:17 ` [PATCH 16/21] userfaultfd: remap_pages: rmap preparation Andrea Arcangeli
@ 2015-03-05 17:18 ` Andrea Arcangeli
  2015-03-05 17:18 ` [PATCH 18/21] userfaultfd: UFFDIO_REMAP uABI Andrea Arcangeli
                   ` (4 subsequent siblings)
  21 siblings, 0 replies; 32+ messages in thread
From: Andrea Arcangeli @ 2015-03-05 17:18 UTC (permalink / raw)
  To: qemu-devel, kvm, linux-kernel, linux-mm, linux-api, Android Kernel Team
  Cc: Kirill A. Shutemov, Pavel Emelyanov, Sanidhya Kashyap,
	zhang.zhanghailiang, Linus Torvalds, Andres Lagar-Cavilla,
	Dave Hansen, Paolo Bonzini, Rik van Riel, Mel Gorman,
	Andy Lutomirski, Andrew Morton, Sasha Levin, Hugh Dickins,
	Peter Feiner, Dr. David Alan Gilbert, Christopher Covington,
	Johannes Weiner, Robert Love, Dmitry Adamushko, Neil Brown,
	Mike Hommey, Taras Glek, Jan Kara, KOSAKI Motohiro,
	Michel Lespinasse, Minchan Kim, Keith Packard, Huangpeng (Peter),
	Anthony Liguori, Stefan Hajnoczi, Wenchao Xia, Andrew Jones,
	Juan Quintela

Provide a new swapfile method for remap_pages() to verify the swap
entry is mapped only in one vma before relocating the swap entry in a
different virtual address. Otherwise if the swap entry is mapped in
multiple vmas, when the page is swapped back in, it could get mapped
in a non linear way in some anon_vma.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/swap.h |  6 ++++++
 mm/swapfile.c        | 13 +++++++++++++
 2 files changed, 19 insertions(+)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 4759491..9adda11 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -436,6 +436,7 @@ extern unsigned int count_swap_pages(int, int);
 extern sector_t map_swap_page(struct page *, struct block_device **);
 extern sector_t swapdev_block(int, pgoff_t);
 extern int page_swapcount(struct page *);
+extern int swp_entry_swapcount(swp_entry_t entry);
 extern struct swap_info_struct *page_swap_info(struct page *);
 extern int reuse_swap_page(struct page *);
 extern int try_to_free_swap(struct page *);
@@ -527,6 +528,11 @@ static inline int page_swapcount(struct page *page)
 	return 0;
 }
 
+static inline int swp_entry_swapcount(swp_entry_t entry)
+{
+	return 0;
+}
+
 #define reuse_swap_page(page)	(page_mapcount(page) == 1)
 
 static inline int try_to_free_swap(struct page *page)
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 63f55cc..04c7621 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -874,6 +874,19 @@ int page_swapcount(struct page *page)
 	return count;
 }
 
+int swp_entry_swapcount(swp_entry_t entry)
+{
+	int count = 0;
+	struct swap_info_struct *p;
+
+	p = swap_info_get(entry);
+	if (p) {
+		count = swap_count(p->swap_map[swp_offset(entry)]);
+		spin_unlock(&p->lock);
+	}
+	return count;
+}
+
 /*
  * We can write to an anon page without COW if there are no other references
  * to it.  And as a side-effect, free up its swap: because the old content

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 18/21] userfaultfd: UFFDIO_REMAP uABI
  2015-03-05 17:17 [PATCH 00/21] RFC: userfaultfd v3 Andrea Arcangeli
                   ` (16 preceding siblings ...)
  2015-03-05 17:18 ` [PATCH 17/21] userfaultfd: remap_pages: swp_entry_swapcount() preparation Andrea Arcangeli
@ 2015-03-05 17:18 ` Andrea Arcangeli
  2015-03-05 17:18 ` [PATCH 19/21] userfaultfd: remap_pages: UFFDIO_REMAP preparation Andrea Arcangeli
                   ` (3 subsequent siblings)
  21 siblings, 0 replies; 32+ messages in thread
From: Andrea Arcangeli @ 2015-03-05 17:18 UTC (permalink / raw)
  To: qemu-devel, kvm, linux-kernel, linux-mm, linux-api, Android Kernel Team
  Cc: Kirill A. Shutemov, Pavel Emelyanov, Sanidhya Kashyap,
	zhang.zhanghailiang, Linus Torvalds, Andres Lagar-Cavilla,
	Dave Hansen, Paolo Bonzini, Rik van Riel, Mel Gorman,
	Andy Lutomirski, Andrew Morton, Sasha Levin, Hugh Dickins,
	Peter Feiner, Dr. David Alan Gilbert, Christopher Covington,
	Johannes Weiner, Robert Love, Dmitry Adamushko, Neil Brown,
	Mike Hommey, Taras Glek, Jan Kara, KOSAKI Motohiro,
	Michel Lespinasse, Minchan Kim, Keith Packard, Huangpeng (Peter),
	Anthony Liguori, Stefan Hajnoczi, Wenchao Xia, Andrew Jones,
	Juan Quintela

This implements the uABI of UFFDIO_REMAP.

Notably one mode bitflag is also forwarded (and in turn known) by the
lowlevel remap_pages method.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/uapi/linux/userfaultfd.h | 27 ++++++++++++++++++++++++++-
 1 file changed, 26 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
index 61251e6..db6e99a 100644
--- a/include/uapi/linux/userfaultfd.h
+++ b/include/uapi/linux/userfaultfd.h
@@ -19,7 +19,8 @@
 #define UFFD_API_RANGE_IOCTLS			\
 	((__u64)1 << _UFFDIO_WAKE |		\
 	 (__u64)1 << _UFFDIO_COPY |		\
-	 (__u64)1 << _UFFDIO_ZEROPAGE)
+	 (__u64)1 << _UFFDIO_ZEROPAGE |		\
+	 (__u64)1 << _UFFDIO_REMAP)
 
 /*
  * Valid ioctl command number range with this API is from 0x00 to
@@ -34,6 +35,7 @@
 #define _UFFDIO_WAKE			(0x02)
 #define _UFFDIO_COPY			(0x03)
 #define _UFFDIO_ZEROPAGE		(0x04)
+#define _UFFDIO_REMAP			(0x05)
 #define _UFFDIO_API			(0x3F)
 
 /* userfaultfd ioctl ids */
@@ -50,6 +52,8 @@
 				      struct uffdio_copy)
 #define UFFDIO_ZEROPAGE		_IOWR(UFFDIO, _UFFDIO_ZEROPAGE,	\
 				      struct uffdio_zeropage)
+#define UFFDIO_REMAP		_IOWR(UFFDIO, _UFFDIO_REMAP,	\
+				      struct uffdio_remap)
 
 /*
  * Valid bits below PAGE_SHIFT in the userfault address read through
@@ -122,4 +126,25 @@ struct uffdio_zeropage {
 	__s64 wake;
 };
 
+struct uffdio_remap {
+	__u64 dst;
+	__u64 src;
+	__u64 len;
+	/*
+	 * Especially if used to atomically remove memory from the
+	 * address space the wake on the dst range is not needed.
+	 */
+#define UFFDIO_REMAP_MODE_DONTWAKE		((__u64)1<<0)
+#define UFFDIO_REMAP_MODE_ALLOW_SRC_HOLES	((__u64)1<<1)
+	__u64 mode;
+
+	/*
+	 * "remap" and "wake" are written by the ioctl and must be at
+	 * the end: the copy_from_user will not read the last 16
+	 * bytes.
+	 */
+	__s64 remap;
+	__s64 wake;
+};
+
 #endif /* _LINUX_USERFAULTFD_H */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 19/21] userfaultfd: remap_pages: UFFDIO_REMAP preparation
  2015-03-05 17:17 [PATCH 00/21] RFC: userfaultfd v3 Andrea Arcangeli
                   ` (17 preceding siblings ...)
  2015-03-05 17:18 ` [PATCH 18/21] userfaultfd: UFFDIO_REMAP uABI Andrea Arcangeli
@ 2015-03-05 17:18 ` Andrea Arcangeli
  2015-03-05 17:39   ` Linus Torvalds
  2015-03-05 18:01   ` Pavel Emelyanov
  2015-03-05 17:18 ` [PATCH 20/21] userfaultfd: UFFDIO_REMAP Andrea Arcangeli
                   ` (2 subsequent siblings)
  21 siblings, 2 replies; 32+ messages in thread
From: Andrea Arcangeli @ 2015-03-05 17:18 UTC (permalink / raw)
  To: qemu-devel, kvm, linux-kernel, linux-mm, linux-api, Android Kernel Team
  Cc: Kirill A. Shutemov, Pavel Emelyanov, Sanidhya Kashyap,
	zhang.zhanghailiang, Linus Torvalds, Andres Lagar-Cavilla,
	Dave Hansen, Paolo Bonzini, Rik van Riel, Mel Gorman,
	Andy Lutomirski, Andrew Morton, Sasha Levin, Hugh Dickins,
	Peter Feiner, Dr. David Alan Gilbert, Christopher Covington,
	Johannes Weiner, Robert Love, Dmitry Adamushko, Neil Brown,
	Mike Hommey, Taras Glek, Jan Kara, KOSAKI Motohiro,
	Michel Lespinasse, Minchan Kim, Keith Packard, Huangpeng (Peter),
	Anthony Liguori, Stefan Hajnoczi, Wenchao Xia, Andrew Jones,
	Juan Quintela

remap_pages is the lowlevel mm helper needed to implement
UFFDIO_REMAP.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/userfaultfd_k.h |  17 ++
 mm/huge_memory.c              | 120 ++++++++++
 mm/userfaultfd.c              | 526 ++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 663 insertions(+)

diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index 587480a..3c39a4f 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -36,6 +36,23 @@ extern ssize_t mfill_zeropage(struct mm_struct *dst_mm,
 			      unsigned long dst_start,
 			      unsigned long len);
 
+/* remap_pages */
+extern void double_pt_lock(spinlock_t *ptl1, spinlock_t *ptl2);
+extern void double_pt_unlock(spinlock_t *ptl1, spinlock_t *ptl2);
+extern ssize_t remap_pages(struct mm_struct *dst_mm,
+			   struct mm_struct *src_mm,
+			   unsigned long dst_start,
+			   unsigned long src_start,
+			   unsigned long len, __u64 flags);
+extern int remap_pages_huge_pmd(struct mm_struct *dst_mm,
+				struct mm_struct *src_mm,
+				pmd_t *dst_pmd, pmd_t *src_pmd,
+				pmd_t dst_pmdval,
+				struct vm_area_struct *dst_vma,
+				struct vm_area_struct *src_vma,
+				unsigned long dst_addr,
+				unsigned long src_addr);
+
 /* mm helpers */
 static inline bool is_mergeable_vm_userfaultfd_ctx(struct vm_area_struct *vma,
 					struct vm_userfaultfd_ctx vm_ctx)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 1e25cb3..08c8afc 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1531,6 +1531,124 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 	return ret;
 }
 
+#ifdef CONFIG_USERFAULTFD
+/*
+ * The PT lock for src_pmd and the mmap_sem for reading are held by
+ * the caller, but it must return after releasing the
+ * page_table_lock. We're guaranteed the src_pmd is a pmd_trans_huge
+ * until the PT lock of the src_pmd is released. Just move the page
+ * from src_pmd to dst_pmd if possible. Return zero if succeeded in
+ * moving the page, -EAGAIN if it needs to be repeated by the caller,
+ * or other errors in case of failure.
+ */
+int remap_pages_huge_pmd(struct mm_struct *dst_mm,
+			 struct mm_struct *src_mm,
+			 pmd_t *dst_pmd, pmd_t *src_pmd,
+			 pmd_t dst_pmdval,
+			 struct vm_area_struct *dst_vma,
+			 struct vm_area_struct *src_vma,
+			 unsigned long dst_addr,
+			 unsigned long src_addr)
+{
+	pmd_t _dst_pmd, src_pmdval;
+	struct page *src_page;
+	struct anon_vma *src_anon_vma, *dst_anon_vma;
+	spinlock_t *src_ptl, *dst_ptl;
+	pgtable_t pgtable;
+
+	src_pmdval = *src_pmd;
+	src_ptl = pmd_lockptr(src_mm, src_pmd);
+
+	BUG_ON(!pmd_trans_huge(src_pmdval));
+	BUG_ON(pmd_trans_splitting(src_pmdval));
+	BUG_ON(!pmd_none(dst_pmdval));
+	BUG_ON(!spin_is_locked(src_ptl));
+	BUG_ON(!rwsem_is_locked(&src_mm->mmap_sem));
+	BUG_ON(!rwsem_is_locked(&dst_mm->mmap_sem));
+
+	src_page = pmd_page(src_pmdval);
+	BUG_ON(!PageHead(src_page));
+	BUG_ON(!PageAnon(src_page));
+	if (unlikely(page_mapcount(src_page) != 1)) {
+		spin_unlock(src_ptl);
+		return -EBUSY;
+	}
+
+	get_page(src_page);
+	spin_unlock(src_ptl);
+
+	mmu_notifier_invalidate_range_start(src_mm, src_addr,
+					    src_addr + HPAGE_PMD_SIZE);
+
+	/* block all concurrent rmap walks */
+	lock_page(src_page);
+
+	/*
+	 * split_huge_page walks the anon_vma chain without the page
+	 * lock. Serialize against it with the anon_vma lock, the page
+	 * lock is not enough.
+	 */
+	src_anon_vma = page_get_anon_vma(src_page);
+	if (!src_anon_vma) {
+		unlock_page(src_page);
+		put_page(src_page);
+		mmu_notifier_invalidate_range_end(src_mm, src_addr,
+						  src_addr + HPAGE_PMD_SIZE);
+		return -EAGAIN;
+	}
+	anon_vma_lock_write(src_anon_vma);
+
+	dst_ptl = pmd_lockptr(dst_mm, dst_pmd);
+	double_pt_lock(src_ptl, dst_ptl);
+	if (unlikely(!pmd_same(*src_pmd, src_pmdval) ||
+		     !pmd_same(*dst_pmd, dst_pmdval) ||
+		     page_mapcount(src_page) != 1)) {
+		double_pt_unlock(src_ptl, dst_ptl);
+		anon_vma_unlock_write(src_anon_vma);
+		put_anon_vma(src_anon_vma);
+		unlock_page(src_page);
+		put_page(src_page);
+		mmu_notifier_invalidate_range_end(src_mm, src_addr,
+						  src_addr + HPAGE_PMD_SIZE);
+		return -EAGAIN;
+	}
+
+	BUG_ON(!PageHead(src_page));
+	BUG_ON(!PageAnon(src_page));
+	/* the PT lock is enough to keep the page pinned now */
+	put_page(src_page);
+
+	dst_anon_vma = (void *) dst_vma->anon_vma + PAGE_MAPPING_ANON;
+	ACCESS_ONCE(src_page->mapping) = (struct address_space *) dst_anon_vma;
+	ACCESS_ONCE(src_page->index) = linear_page_index(dst_vma, dst_addr);
+
+	if (!pmd_same(pmdp_clear_flush(src_vma, src_addr, src_pmd),
+		      src_pmdval))
+		BUG();
+	_dst_pmd = mk_huge_pmd(src_page, dst_vma->vm_page_prot);
+	_dst_pmd = maybe_pmd_mkwrite(pmd_mkdirty(_dst_pmd), dst_vma);
+	set_pmd_at(dst_mm, dst_addr, dst_pmd, _dst_pmd);
+
+	pgtable = pgtable_trans_huge_withdraw(src_mm, src_pmd);
+	pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
+	if (dst_mm != src_mm) {
+		add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
+		add_mm_counter(src_mm, MM_ANONPAGES, -HPAGE_PMD_NR);
+	}
+	double_pt_unlock(src_ptl, dst_ptl);
+
+	anon_vma_unlock_write(src_anon_vma);
+	put_anon_vma(src_anon_vma);
+
+	/* unblock rmap walks */
+	unlock_page(src_page);
+
+	mmu_notifier_invalidate_range_end(src_mm, src_addr,
+					  src_addr + HPAGE_PMD_SIZE);
+	return 0;
+}
+#endif /* CONFIG_USERFAULTFD */
+
 /*
  * Returns 1 if a given pmd maps a stable (not under splitting) thp.
  * Returns -1 if it maps a thp under splitting. Returns 0 otherwise.
@@ -2484,6 +2602,8 @@ static void collapse_huge_page(struct mm_struct *mm,
 	 * Prevent all access to pagetables with the exception of
 	 * gup_fast later hanlded by the ptep_clear_flush and the VM
 	 * handled by the anon_vma lock + PG_lock.
+	 *
+	 * remap_pages is prevented to race as well thanks to the mmap_sem.
 	 */
 	down_write(&mm->mmap_sem);
 	if (unlikely(khugepaged_test_exit(mm)))
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 3f4c0ef..49521af 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -265,3 +265,529 @@ ssize_t mfill_zeropage(struct mm_struct *dst_mm, unsigned long start,
 {
 	return __mcopy_atomic(dst_mm, start, 0, len, true);
 }
+
+void double_pt_lock(spinlock_t *ptl1,
+		    spinlock_t *ptl2)
+	__acquires(ptl1)
+	__acquires(ptl2)
+{
+	spinlock_t *ptl_tmp;
+
+	if (ptl1 > ptl2) {
+		/* exchange ptl1 and ptl2 */
+		ptl_tmp = ptl1;
+		ptl1 = ptl2;
+		ptl2 = ptl_tmp;
+	}
+	/* lock in virtual address order to avoid lock inversion */
+	spin_lock(ptl1);
+	if (ptl1 != ptl2)
+		spin_lock_nested(ptl2, SINGLE_DEPTH_NESTING);
+	else
+		__acquire(ptl2);
+}
+
+void double_pt_unlock(spinlock_t *ptl1,
+		      spinlock_t *ptl2)
+	__releases(ptl1)
+	__releases(ptl2)
+{
+	spin_unlock(ptl1);
+	if (ptl1 != ptl2)
+		spin_unlock(ptl2);
+	else
+		__release(ptl2);
+}
+
+/*
+ * The mmap_sem for reading is held by the caller. Just move the page
+ * from src_pmd to dst_pmd if possible, and return true if succeeded
+ * in moving the page.
+ */
+static int remap_pages_pte(struct mm_struct *dst_mm,
+			   struct mm_struct *src_mm,
+			   pte_t *dst_pte, pte_t *src_pte, pmd_t *src_pmd,
+			   struct vm_area_struct *dst_vma,
+			   struct vm_area_struct *src_vma,
+			   unsigned long dst_addr,
+			   unsigned long src_addr,
+			   spinlock_t *dst_ptl,
+			   spinlock_t *src_ptl,
+			   __u64 mode)
+{
+	struct page *src_page;
+	swp_entry_t entry;
+	pte_t orig_src_pte, orig_dst_pte;
+	struct anon_vma *src_anon_vma, *dst_anon_vma;
+
+	spin_lock(dst_ptl);
+	orig_dst_pte = *dst_pte;
+	spin_unlock(dst_ptl);
+	if (!pte_none(orig_dst_pte))
+		return -EEXIST;
+
+	spin_lock(src_ptl);
+	orig_src_pte = *src_pte;
+	spin_unlock(src_ptl);
+	if (pte_none(orig_src_pte)) {
+		if (!(mode & UFFDIO_REMAP_MODE_ALLOW_SRC_HOLES))
+			return -ENOENT;
+		else
+			/* nothing to do to remap an hole */
+			return 0;
+	}
+
+	if (pte_present(orig_src_pte)) {
+		/*
+		 * Pin the page while holding the lock to be sure the
+		 * page isn't freed under us
+		 */
+		spin_lock(src_ptl);
+		if (!pte_same(orig_src_pte, *src_pte)) {
+			spin_unlock(src_ptl);
+			return -EAGAIN;
+		}
+		src_page = vm_normal_page(src_vma, src_addr, orig_src_pte);
+		if (!src_page || !PageAnon(src_page) ||
+		    page_mapcount(src_page) != 1) {
+			spin_unlock(src_ptl);
+			return -EBUSY;
+		}
+
+		get_page(src_page);
+		spin_unlock(src_ptl);
+
+		/* block all concurrent rmap walks */
+		lock_page(src_page);
+
+		/*
+		 * page_referenced_anon walks the anon_vma chain
+		 * without the page lock. Serialize against it with
+		 * the anon_vma lock, the page lock is not enough.
+		 */
+		src_anon_vma = page_get_anon_vma(src_page);
+		if (!src_anon_vma) {
+			/* page was unmapped from under us */
+			unlock_page(src_page);
+			put_page(src_page);
+			return -EAGAIN;
+		}
+		anon_vma_lock_write(src_anon_vma);
+
+		double_pt_lock(dst_ptl, src_ptl);
+
+		if (!pte_same(*src_pte, orig_src_pte) ||
+		    !pte_same(*dst_pte, orig_dst_pte) ||
+		    page_mapcount(src_page) != 1) {
+			double_pt_unlock(dst_ptl, src_ptl);
+			anon_vma_unlock_write(src_anon_vma);
+			put_anon_vma(src_anon_vma);
+			unlock_page(src_page);
+			put_page(src_page);
+			return -EAGAIN;
+		}
+
+		BUG_ON(!PageAnon(src_page));
+		/* the PT lock is enough to keep the page pinned now */
+		put_page(src_page);
+
+		dst_anon_vma = (void *) dst_vma->anon_vma + PAGE_MAPPING_ANON;
+		ACCESS_ONCE(src_page->mapping) = ((struct address_space *)
+						  dst_anon_vma);
+		ACCESS_ONCE(src_page->index) = linear_page_index(dst_vma,
+								 dst_addr);
+
+		if (!pte_same(ptep_clear_flush(src_vma, src_addr, src_pte),
+			      orig_src_pte))
+			BUG();
+
+		orig_dst_pte = mk_pte(src_page, dst_vma->vm_page_prot);
+		orig_dst_pte = maybe_mkwrite(pte_mkdirty(orig_dst_pte),
+					     dst_vma);
+
+		set_pte_at(dst_mm, dst_addr, dst_pte, orig_dst_pte);
+
+		if (dst_mm != src_mm) {
+			inc_mm_counter(dst_mm, MM_ANONPAGES);
+			dec_mm_counter(src_mm, MM_ANONPAGES);
+		}
+
+		double_pt_unlock(dst_ptl, src_ptl);
+
+		anon_vma_unlock_write(src_anon_vma);
+		put_anon_vma(src_anon_vma);
+
+		/* unblock rmap walks */
+		unlock_page(src_page);
+
+		mmu_notifier_invalidate_page(src_mm, src_addr);
+	} else {
+		entry = pte_to_swp_entry(orig_src_pte);
+		if (non_swap_entry(entry)) {
+			if (is_migration_entry(entry)) {
+				migration_entry_wait(src_mm, src_pmd,
+						     src_addr);
+				return -EAGAIN;
+			}
+			return -EFAULT;
+		}
+
+		if (swp_entry_swapcount(entry) != 1)
+			return -EBUSY;
+
+		double_pt_lock(dst_ptl, src_ptl);
+
+		if (!pte_same(*src_pte, orig_src_pte) ||
+		    !pte_same(*dst_pte, orig_dst_pte) ||
+		    swp_entry_swapcount(entry) != 1) {
+			double_pt_unlock(dst_ptl, src_ptl);
+			return -EAGAIN;
+		}
+
+		if (pte_val(ptep_get_and_clear(src_mm, src_addr, src_pte)) !=
+		    pte_val(orig_src_pte))
+			BUG();
+		set_pte_at(dst_mm, dst_addr, dst_pte, orig_src_pte);
+
+		if (dst_mm != src_mm) {
+			inc_mm_counter(dst_mm, MM_ANONPAGES);
+			dec_mm_counter(src_mm, MM_ANONPAGES);
+		}
+
+		double_pt_unlock(dst_ptl, src_ptl);
+	}
+
+	return 0;
+}
+
+/**
+ * remap_pages - remap arbitrary anonymous pages of an existing vma
+ * @dst_start: start of the destination virtual memory range
+ * @src_start: start of the source virtual memory range
+ * @len: length of the virtual memory range
+ *
+ * remap_pages() remaps arbitrary anonymous pages atomically in zero
+ * copy. It only works on non shared anonymous pages because those can
+ * be relocated without generating non linear anon_vmas in the rmap
+ * code.
+ *
+ * It is the ideal mechanism to handle userspace page faults. Normally
+ * the destination vma will have VM_USERFAULT set with
+ * madvise(MADV_USERFAULT) while the source vma will have VM_DONTCOPY
+ * set with madvise(MADV_DONTFORK).
+ *
+ * The thread receiving the page during the userland page fault
+ * (MADV_USERFAULT) will receive the faulting page in the source vma
+ * through the network, storage or any other I/O device (MADV_DONTFORK
+ * in the source vma avoids remap_pages() to fail with -EBUSY if the
+ * process forks before remap_pages() is called), then it will call
+ * remap_pages() to map the page in the faulting address in the
+ * destination vma.
+ *
+ * This userfaultfd command works purely via pagetables, so it's the
+ * most efficient way to move physical non shared anonymous pages
+ * across different virtual addresses. Unlike mremap()/mmap()/munmap()
+ * it does not create any new vmas. The mapping in the destination
+ * address is atomic.
+ *
+ * It only works if the vma protection bits are identical from the
+ * source and destination vma.
+ *
+ * It can remap non shared anonymous pages within the same vma too.
+ *
+ * If the source virtual memory range has any unmapped holes, or if
+ * the destination virtual memory range is not a whole unmapped hole,
+ * remap_pages() will fail respectively with -ENOENT or -EEXIST. This
+ * provides a very strict behavior to avoid any chance of memory
+ * corruption going unnoticed if there are userland race
+ * conditions. Only one thread should resolve the userland page fault
+ * at any given time for any given faulting address. This means that
+ * if two threads try to both call remap_pages() on the same
+ * destination address at the same time, the second thread will get an
+ * explicit error from this command.
+ *
+ * The command retval will return "len" is succesful. The command
+ * however can be interrupted by fatal signals or errors. If
+ * interrupted it will return the number of bytes successfully
+ * remapped before the interruption if any, or the negative error if
+ * none. It will never return zero. Either it will return an error or
+ * an amount of bytes successfully moved. If the retval reports a
+ * "short" remap, the remap_pages() command should be repeated by
+ * userland with src+retval, dst+reval, len-retval if it wants to know
+ * about the error that interrupted it.
+ *
+ * The UFFDIO_REMAP_MODE_ALLOW_SRC_HOLES flag can be specified to
+ * prevent -ENOENT errors to materialize if there are holes in the
+ * source virtual range that is being remapped. The holes will be
+ * accounted as successfully remapped in the retval of the
+ * command. This is mostly useful to remap hugepage naturally aligned
+ * virtual regions without knowing if there are transparent hugepage
+ * in the regions or not, but preventing the risk of having to split
+ * the hugepmd during the remap.
+ *
+ * If there's any rmap walk that is taking the anon_vma locks without
+ * first obtaining the page lock (for example split_huge_page and
+ * page_referenced_anon), they will have to verify if the
+ * page->mapping has changed after taking the anon_vma lock. If it
+ * changed they should release the lock and retry obtaining a new
+ * anon_vma, because it means the anon_vma was changed by
+ * remap_pages() before the lock could be obtained. This is the only
+ * additional complexity added to the rmap code to provide this
+ * anonymous page remapping functionality.
+ */
+ssize_t remap_pages(struct mm_struct *dst_mm, struct mm_struct *src_mm,
+		    unsigned long dst_start, unsigned long src_start,
+		    unsigned long len, __u64 mode)
+{
+	struct vm_area_struct *src_vma, *dst_vma;
+	long err = -EINVAL;
+	pmd_t *src_pmd, *dst_pmd;
+	pte_t *src_pte, *dst_pte;
+	spinlock_t *dst_ptl, *src_ptl;
+	unsigned long src_addr, dst_addr;
+	int thp_aligned = -1;
+	ssize_t moved = 0;
+
+	/*
+	 * Sanitize the command parameters:
+	 */
+	BUG_ON(src_start & ~PAGE_MASK);
+	BUG_ON(dst_start & ~PAGE_MASK);
+	BUG_ON(len & ~PAGE_MASK);
+
+	/* Does the address range wrap, or is the span zero-sized? */
+	BUG_ON(src_start + len <= src_start);
+	BUG_ON(dst_start + len <= dst_start);
+
+	/*
+	 * Because these are read sempahores there's no risk of lock
+	 * inversion.
+	 */
+	down_read(&dst_mm->mmap_sem);
+	if (dst_mm != src_mm)
+		down_read(&src_mm->mmap_sem);
+
+	/*
+	 * Make sure the vma is not shared, that the src and dst remap
+	 * ranges are both valid and fully within a single existing
+	 * vma.
+	 */
+	src_vma = find_vma(src_mm, src_start);
+	if (!src_vma || (src_vma->vm_flags & VM_SHARED))
+		goto out;
+	if (src_start < src_vma->vm_start ||
+	    src_start + len > src_vma->vm_end)
+		goto out;
+
+	dst_vma = find_vma(dst_mm, dst_start);
+	if (!dst_vma || (dst_vma->vm_flags & VM_SHARED))
+		goto out;
+	if (dst_start < dst_vma->vm_start ||
+	    dst_start + len > dst_vma->vm_end)
+		goto out;
+
+	if (pgprot_val(src_vma->vm_page_prot) !=
+	    pgprot_val(dst_vma->vm_page_prot))
+		goto out;
+
+	/* only allow remapping if both are mlocked or both aren't */
+	if ((src_vma->vm_flags & VM_LOCKED) ^ (dst_vma->vm_flags & VM_LOCKED))
+		goto out;
+
+	/*
+	 * Be strict and only allow remap_pages if either the src or
+	 * dst range is registered in the userfaultfd to prevent
+	 * userland errors going unnoticed. As far as the VM
+	 * consistency is concerned, it would be perfectly safe to
+	 * remove this check, but there's no useful usage for
+	 * remap_pages ouside of userfaultfd registered ranges. This
+	 * is after all why it is an ioctl belonging to the
+	 * userfaultfd and not a syscall.
+	 *
+	 * Allow both vmas to be registered in the userfaultfd, just
+	 * in case somebody finds a way to make such a case useful.
+	 * Normally only one of the two vmas would be registered in
+	 * the userfaultfd.
+	 */
+	if (!dst_vma->vm_userfaultfd_ctx.ctx &&
+	    !src_vma->vm_userfaultfd_ctx.ctx)
+		goto out;
+
+	/*
+	 * FIXME: only allow remapping across anonymous vmas,
+	 * tmpfs should be added.
+	 */
+	if (src_vma->vm_ops || dst_vma->vm_ops)
+		goto out;
+
+	/*
+	 * Ensure the dst_vma has a anon_vma or this page
+	 * would get a NULL anon_vma when moved in the
+	 * dst_vma.
+	 */
+	err = -ENOMEM;
+	if (unlikely(anon_vma_prepare(dst_vma)))
+		goto out;
+
+	for (src_addr = src_start, dst_addr = dst_start;
+	     src_addr < src_start + len; ) {
+		spinlock_t *ptl;
+		pmd_t dst_pmdval;
+		BUG_ON(dst_addr >= dst_start + len);
+		src_pmd = mm_find_pmd(src_mm, src_addr);
+		if (unlikely(!src_pmd)) {
+			if (!(mode & UFFDIO_REMAP_MODE_ALLOW_SRC_HOLES)) {
+				err = -ENOENT;
+				break;
+			} else {
+				src_pmd = mm_alloc_pmd(src_mm, src_addr);
+				if (unlikely(!src_pmd)) {
+					err = -ENOMEM;
+					break;
+				}
+			}
+		}
+		dst_pmd = mm_alloc_pmd(dst_mm, dst_addr);
+		if (unlikely(!dst_pmd)) {
+			err = -ENOMEM;
+			break;
+		}
+
+		dst_pmdval = pmd_read_atomic(dst_pmd);
+		/*
+		 * If the dst_pmd is mapped as THP don't
+		 * override it and just be strict.
+		 */
+		if (unlikely(pmd_trans_huge(dst_pmdval))) {
+			err = -EEXIST;
+			break;
+		}
+		if (pmd_trans_huge_lock(src_pmd, src_vma, &ptl) == 1) {
+			/*
+			 * Check if we can move the pmd without
+			 * splitting it. First check the address
+			 * alignment to be the same in src/dst.  These
+			 * checks don't actually need the PT lock but
+			 * it's good to do it here to optimize this
+			 * block away at build time if
+			 * CONFIG_TRANSPARENT_HUGEPAGE is not set.
+			 */
+			if (thp_aligned == -1)
+				thp_aligned = ((src_addr & ~HPAGE_PMD_MASK) ==
+					       (dst_addr & ~HPAGE_PMD_MASK));
+			if (!thp_aligned || (src_addr & ~HPAGE_PMD_MASK) ||
+			    !pmd_none(dst_pmdval) ||
+			    src_start + len - src_addr < HPAGE_PMD_SIZE) {
+				spin_unlock(ptl);
+				/* Fall through */
+				split_huge_page_pmd(src_vma, src_addr,
+						    src_pmd);
+			} else {
+				BUG_ON(dst_addr & ~HPAGE_PMD_MASK);
+				err = remap_pages_huge_pmd(dst_mm,
+							   src_mm,
+							   dst_pmd,
+							   src_pmd,
+							   dst_pmdval,
+							   dst_vma,
+							   src_vma,
+							   dst_addr,
+							   src_addr);
+				cond_resched();
+
+				if (!err) {
+					dst_addr += HPAGE_PMD_SIZE;
+					src_addr += HPAGE_PMD_SIZE;
+					moved += HPAGE_PMD_SIZE;
+				}
+
+				if ((!err || err == -EAGAIN) &&
+				    fatal_signal_pending(current))
+					err = -EINTR;
+
+				if (err && err != -EAGAIN)
+					break;
+
+				continue;
+			}
+		}
+
+		if (pmd_none(*src_pmd)) {
+			if (!(mode & UFFDIO_REMAP_MODE_ALLOW_SRC_HOLES)) {
+				err = -ENOENT;
+				break;
+			} else {
+				if (unlikely(__pte_alloc(src_mm, src_vma,
+							 src_pmd, src_addr))) {
+					err = -ENOMEM;
+					break;
+				}
+			}
+		}
+
+		/*
+		 * We held the mmap_sem for reading so MADV_DONTNEED
+		 * can zap transparent huge pages under us, or the
+		 * transparent huge page fault can establish new
+		 * transparent huge pages under us.
+		 */
+		if (unlikely(pmd_trans_unstable(src_pmd))) {
+			err = -EFAULT;
+			break;
+		}
+
+		if (unlikely(pmd_none(dst_pmdval)) &&
+		    unlikely(__pte_alloc(dst_mm, dst_vma, dst_pmd,
+					 dst_addr))) {
+			err = -ENOMEM;
+			break;
+		}
+		/* If an huge pmd materialized from under us fail */
+		if (unlikely(pmd_trans_huge(*dst_pmd))) {
+			err = -EFAULT;
+			break;
+		}
+
+		BUG_ON(pmd_none(*dst_pmd));
+		BUG_ON(pmd_none(*src_pmd));
+		BUG_ON(pmd_trans_huge(*dst_pmd));
+		BUG_ON(pmd_trans_huge(*src_pmd));
+
+		dst_pte = pte_offset_map(dst_pmd, dst_addr);
+		src_pte = pte_offset_map(src_pmd, src_addr);
+		dst_ptl = pte_lockptr(dst_mm, dst_pmd);
+		src_ptl = pte_lockptr(src_mm, src_pmd);
+
+		err = remap_pages_pte(dst_mm, src_mm,
+				      dst_pte, src_pte, src_pmd,
+				      dst_vma, src_vma,
+				      dst_addr, src_addr,
+				      dst_ptl, src_ptl, mode);
+
+		pte_unmap(dst_pte);
+		pte_unmap(src_pte);
+		cond_resched();
+
+		if (!err) {
+			dst_addr += PAGE_SIZE;
+			src_addr += PAGE_SIZE;
+			moved += PAGE_SIZE;
+		}
+
+		if ((!err || err == -EAGAIN) &&
+		    fatal_signal_pending(current))
+			err = -EINTR;
+
+		if (err && err != -EAGAIN)
+			break;
+	}
+
+out:
+	up_read(&dst_mm->mmap_sem);
+	if (dst_mm != src_mm)
+		up_read(&src_mm->mmap_sem);
+	BUG_ON(moved < 0);
+	BUG_ON(err > 0);
+	BUG_ON(!moved && !err);
+	return moved ? moved : err;
+}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 20/21] userfaultfd: UFFDIO_REMAP
  2015-03-05 17:17 [PATCH 00/21] RFC: userfaultfd v3 Andrea Arcangeli
                   ` (18 preceding siblings ...)
  2015-03-05 17:18 ` [PATCH 19/21] userfaultfd: remap_pages: UFFDIO_REMAP preparation Andrea Arcangeli
@ 2015-03-05 17:18 ` Andrea Arcangeli
  2015-03-05 17:18 ` [PATCH 21/21] userfaultfd: add userfaultfd_wp mm helpers Andrea Arcangeli
  2015-03-05 18:15 ` [PATCH 00/21] RFC: userfaultfd v3 Pavel Emelyanov
  21 siblings, 0 replies; 32+ messages in thread
From: Andrea Arcangeli @ 2015-03-05 17:18 UTC (permalink / raw)
  To: qemu-devel, kvm, linux-kernel, linux-mm, linux-api, Android Kernel Team
  Cc: Kirill A. Shutemov, Pavel Emelyanov, Sanidhya Kashyap,
	zhang.zhanghailiang, Linus Torvalds, Andres Lagar-Cavilla,
	Dave Hansen, Paolo Bonzini, Rik van Riel, Mel Gorman,
	Andy Lutomirski, Andrew Morton, Sasha Levin, Hugh Dickins,
	Peter Feiner, Dr. David Alan Gilbert, Christopher Covington,
	Johannes Weiner, Robert Love, Dmitry Adamushko, Neil Brown,
	Mike Hommey, Taras Glek, Jan Kara, KOSAKI Motohiro,
	Michel Lespinasse, Minchan Kim, Keith Packard, Huangpeng (Peter),
	Anthony Liguori, Stefan Hajnoczi, Wenchao Xia, Andrew Jones,
	Juan Quintela

This remap ioctl allows to atomically move a page in or out of an
userfaultfd address space. It's more expensive than "copy" (and of
course more expensive than "zerofill") as it requires a TLB flush on
the source range for each ioctl, which is an expensive operation on
SMP. Especially if copying only a few pages at time, copying without
TLB flush is faster.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 fs/userfaultfd.c | 51 +++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 51 insertions(+)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 6230f22..b4c7f25 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -892,6 +892,54 @@ out:
 	return ret;
 }
 
+static int userfaultfd_remap(struct userfaultfd_ctx *ctx,
+			     unsigned long arg)
+{
+	__s64 ret;
+	struct uffdio_remap uffdio_remap;
+	struct uffdio_remap __user *user_uffdio_remap;
+	struct userfaultfd_wake_range range;
+
+	user_uffdio_remap = (struct uffdio_remap __user *) arg;
+
+	ret = -EFAULT;
+	if (copy_from_user(&uffdio_remap, user_uffdio_remap,
+			   /* don't copy "remap" and "wake" last field */
+			   sizeof(uffdio_remap)-sizeof(__s64)*2))
+		goto out;
+
+	ret = validate_range(ctx->mm, uffdio_remap.dst, uffdio_remap.len);
+	if (ret)
+		goto out;
+	ret = validate_range(current->mm, uffdio_remap.src, uffdio_remap.len);
+	if (ret)
+		goto out;
+	ret = -EINVAL;
+	if (uffdio_remap.mode & ~(UFFDIO_REMAP_MODE_ALLOW_SRC_HOLES|
+				  UFFDIO_REMAP_MODE_DONTWAKE))
+		goto out;
+
+	ret = remap_pages(ctx->mm, current->mm,
+			  uffdio_remap.dst, uffdio_remap.src,
+			  uffdio_remap.len, uffdio_remap.mode);
+	if (unlikely(put_user(ret, &user_uffdio_remap->remap)))
+		return -EFAULT;
+	if (ret < 0)
+		goto out;
+	/* len == 0 would wake all */
+	BUG_ON(!ret);
+	range.len = ret;
+	if (!(uffdio_remap.mode & UFFDIO_REMAP_MODE_DONTWAKE)) {
+		range.start = uffdio_remap.dst;
+		ret = wake_userfault(ctx, &range);
+		if (unlikely(put_user(ret, &user_uffdio_remap->wake)))
+			return -EFAULT;
+	}
+	ret = range.len == uffdio_remap.len ? 0 : -EAGAIN;
+out:
+	return ret;
+}
+
 /*
  * userland asks for a certain API version and we return which bits
  * and ioctl commands are implemented in this kernel for such API
@@ -955,6 +1003,9 @@ static long userfaultfd_ioctl(struct file *file, unsigned cmd,
 	case UFFDIO_ZEROPAGE:
 		ret = userfaultfd_zeropage(ctx, arg);
 		break;
+	case UFFDIO_REMAP:
+		ret = userfaultfd_remap(ctx, arg);
+		break;
 	}
 	return ret;
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 21/21] userfaultfd: add userfaultfd_wp mm helpers
  2015-03-05 17:17 [PATCH 00/21] RFC: userfaultfd v3 Andrea Arcangeli
                   ` (19 preceding siblings ...)
  2015-03-05 17:18 ` [PATCH 20/21] userfaultfd: UFFDIO_REMAP Andrea Arcangeli
@ 2015-03-05 17:18 ` Andrea Arcangeli
  2015-03-05 18:15 ` [PATCH 00/21] RFC: userfaultfd v3 Pavel Emelyanov
  21 siblings, 0 replies; 32+ messages in thread
From: Andrea Arcangeli @ 2015-03-05 17:18 UTC (permalink / raw)
  To: qemu-devel, kvm, linux-kernel, linux-mm, linux-api, Android Kernel Team
  Cc: Kirill A. Shutemov, Pavel Emelyanov, Sanidhya Kashyap,
	zhang.zhanghailiang, Linus Torvalds, Andres Lagar-Cavilla,
	Dave Hansen, Paolo Bonzini, Rik van Riel, Mel Gorman,
	Andy Lutomirski, Andrew Morton, Sasha Levin, Hugh Dickins,
	Peter Feiner, Dr. David Alan Gilbert, Christopher Covington,
	Johannes Weiner, Robert Love, Dmitry Adamushko, Neil Brown,
	Mike Hommey, Taras Glek, Jan Kara, KOSAKI Motohiro,
	Michel Lespinasse, Minchan Kim, Keith Packard, Huangpeng (Peter),
	Anthony Liguori, Stefan Hajnoczi, Wenchao Xia, Andrew Jones,
	Juan Quintela

These helpers will be used to know if to call handle_userfault() during
wrprotect faults in order to deliver the wrprotect faults to userland.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/userfaultfd_k.h | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index 3c39a4f..81f0d11 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -65,6 +65,11 @@ static inline bool userfaultfd_missing(struct vm_area_struct *vma)
 	return vma->vm_flags & VM_UFFD_MISSING;
 }
 
+static inline bool userfaultfd_wp(struct vm_area_struct *vma)
+{
+	return vma->vm_flags & VM_UFFD_WP;
+}
+
 static inline bool userfaultfd_armed(struct vm_area_struct *vma)
 {
 	return vma->vm_flags & (VM_UFFD_MISSING | VM_UFFD_WP);
@@ -92,6 +97,11 @@ static inline bool userfaultfd_missing(struct vm_area_struct *vma)
 	return false;
 }
 
+static inline bool userfaultfd_wp(struct vm_area_struct *vma)
+{
+	return false;
+}
+
 static inline bool userfaultfd_armed(struct vm_area_struct *vma)
 {
 	return false;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH 19/21] userfaultfd: remap_pages: UFFDIO_REMAP preparation
  2015-03-05 17:18 ` [PATCH 19/21] userfaultfd: remap_pages: UFFDIO_REMAP preparation Andrea Arcangeli
@ 2015-03-05 17:39   ` Linus Torvalds
  2015-03-05 18:51     ` Andrea Arcangeli
  2015-03-05 18:01   ` Pavel Emelyanov
  1 sibling, 1 reply; 32+ messages in thread
From: Linus Torvalds @ 2015-03-05 17:39 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: qemu-devel, KVM list, Linux Kernel Mailing List, linux-mm,
	Linux API, Android Kernel Team, Kirill A. Shutemov,
	Pavel Emelyanov, Sanidhya Kashyap, zhang.zhanghailiang,
	Andres Lagar-Cavilla, Dave Hansen, Paolo Bonzini, Rik van Riel,
	Mel Gorman, Andy Lutomirski, Andrew Morton, Sasha Levin,
	Hugh Dickins, Peter Feiner, Dr. David Alan Gilbert,
	Christopher Covington, Johannes Weiner, Robert Love,
	Dmitry Adamushko, Neil Brown, Mike Hommey, Taras Glek, Jan Kara,
	KOSAKI Motohiro, Michel Lespinasse, Minchan Kim, Keith Packard,
	Huangpeng (Peter),
	Anthony Liguori, Stefan Hajnoczi, Wenchao Xia, Andrew Jones,
	Juan Quintela

On Thu, Mar 5, 2015 at 9:18 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> remap_pages is the lowlevel mm helper needed to implement
> UFFDIO_REMAP.

This function is nasty nasty nasty.

Is this really worth it? On real loads? That people are expected to use?

Considering how we just got rid of one special magic VM remapping
thing that nobody actually used, I'd really hate to add a new one.

The fact is, almost nobody ever uses anything that isn't standard
POSIX. There are no apps, and even for specialized things like
virtualization hypervisors this kind of thing is often simply not
worth it.

Quite frankly, *if* we ever merge userfaultfd, I would *strongly*
argue for not merging the remap parts. I just don't see the point. It
doesn't seem to add anything that is semantically very important -
it's *potentially* a faster copy, but even that is

  (a) questionable in the first place

and

 (b) unclear why anybody would ever care about performance of
infrastructure that nobody actually uses today, and future use isn't
even clear or shown to be particualrly performance-sensitive.

So basically I'd like to see better documentation, a few real use
cases (and by real I very much do *not* mean "you can use it for
this", but actual patches to actual projects that matter and that are
expected to care and merge them), and a simplified series that doesn't
do the remap thing.

Because *every* time we add a new clever interface, we end up with
approximately zero users and just pain down the line. Examples:
splice, mremap, yadda yadda.

                        Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 05/21] userfaultfd: add vm_userfaultfd_ctx to the vm_area_struct
  2015-03-05 17:17 ` [PATCH 05/21] userfaultfd: add vm_userfaultfd_ctx to the vm_area_struct Andrea Arcangeli
@ 2015-03-05 17:48   ` Pavel Emelyanov
  0 siblings, 0 replies; 32+ messages in thread
From: Pavel Emelyanov @ 2015-03-05 17:48 UTC (permalink / raw)
  To: Andrea Arcangeli, qemu-devel, kvm, linux-kernel, linux-mm,
	linux-api, Android Kernel Team
  Cc: Kirill A. Shutemov, Sanidhya Kashyap, zhang.zhanghailiang,
	Linus Torvalds, Andres Lagar-Cavilla, Dave Hansen, Paolo Bonzini,
	Rik van Riel, Mel Gorman, Andy Lutomirski, Andrew Morton,
	Sasha Levin, Hugh Dickins, Peter Feiner, Dr. David Alan Gilbert,
	Christopher Covington, Johannes Weiner, Robert Love,
	Dmitry Adamushko, Neil Brown, Mike Hommey, Taras Glek, Jan Kara,
	KOSAKI Motohiro, Michel Lespinasse, Minchan Kim, Keith Packard,
	Huangpeng (Peter),
	Anthony Liguori, Stefan Hajnoczi, Wenchao Xia, Andrew Jones,
	Juan Quintela

> diff --git a/kernel/fork.c b/kernel/fork.c
> index cf65139..cb215c0 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -425,6 +425,7 @@ static int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
>  			goto fail_nomem_anon_vma_fork;
>  		tmp->vm_flags &= ~VM_LOCKED;
>  		tmp->vm_next = tmp->vm_prev = NULL;
> +		tmp->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;

This creates an interesting effect when the userfaultfd is used outside of
the process which created and activated one. If I try to monitor the memory
usage of one task with another, once the first task fork()-s, its child
begins to see zero-pages in the places where the monitor task was supposed
to insert pages with data.

>  		file = tmp->vm_file;
>  		if (file) {
>  			struct inode *inode = file_inode(file);
> .
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 10/21] userfaultfd: add new syscall to provide memory externalization
  2015-03-05 17:17 ` [PATCH 10/21] userfaultfd: add new syscall to provide memory externalization Andrea Arcangeli
@ 2015-03-05 17:57   ` Pavel Emelyanov
  2015-03-06 10:48   ` Michael Kerrisk (man-pages)
  1 sibling, 0 replies; 32+ messages in thread
From: Pavel Emelyanov @ 2015-03-05 17:57 UTC (permalink / raw)
  To: Andrea Arcangeli, qemu-devel, kvm, linux-kernel, linux-mm,
	linux-api, Android Kernel Team
  Cc: Kirill A. Shutemov, Sanidhya Kashyap, zhang.zhanghailiang,
	Linus Torvalds, Andres Lagar-Cavilla, Dave Hansen, Paolo Bonzini,
	Rik van Riel, Mel Gorman, Andy Lutomirski, Andrew Morton,
	Sasha Levin, Hugh Dickins, Peter Feiner, Dr. David Alan Gilbert,
	Christopher Covington, Johannes Weiner, Robert Love,
	Dmitry Adamushko, Neil Brown, Mike Hommey, Taras Glek, Jan Kara,
	KOSAKI Motohiro, Michel Lespinasse, Minchan Kim, Keith Packard,
	Huangpeng (Peter),
	Anthony Liguori, Stefan Hajnoczi, Wenchao Xia, Andrew Jones,
	Juan Quintela


> +int handle_userfault(struct vm_area_struct *vma, unsigned long address,
> +		     unsigned int flags, unsigned long reason)
> +{
> +	struct mm_struct *mm = vma->vm_mm;
> +	struct userfaultfd_ctx *ctx;
> +	struct userfaultfd_wait_queue uwq;
> +
> +	BUG_ON(!rwsem_is_locked(&mm->mmap_sem));
> +
> +	ctx = vma->vm_userfaultfd_ctx.ctx;
> +	if (!ctx)
> +		return VM_FAULT_SIGBUS;
> +
> +	BUG_ON(ctx->mm != mm);
> +
> +	VM_BUG_ON(reason & ~(VM_UFFD_MISSING|VM_UFFD_WP));
> +	VM_BUG_ON(!(reason & VM_UFFD_MISSING) ^ !!(reason & VM_UFFD_WP));
> +
> +	/*
> +	 * If it's already released don't get it. This avoids to loop
> +	 * in __get_user_pages if userfaultfd_release waits on the
> +	 * caller of handle_userfault to release the mmap_sem.
> +	 */
> +	if (unlikely(ACCESS_ONCE(ctx->released)))
> +		return VM_FAULT_SIGBUS;
> +
> +	/* check that we can return VM_FAULT_RETRY */
> +	if (unlikely(!(flags & FAULT_FLAG_ALLOW_RETRY))) {
> +		/*
> +		 * Validate the invariant that nowait must allow retry
> +		 * to be sure not to return SIGBUS erroneously on
> +		 * nowait invocations.
> +		 */
> +		BUG_ON(flags & FAULT_FLAG_RETRY_NOWAIT);
> +#ifdef CONFIG_DEBUG_VM
> +		if (printk_ratelimit()) {
> +			printk(KERN_WARNING
> +			       "FAULT_FLAG_ALLOW_RETRY missing %x\n", flags);
> +			dump_stack();
> +		}
> +#endif
> +		return VM_FAULT_SIGBUS;
> +	}
> +
> +	/*
> +	 * Handle nowait, not much to do other than tell it to retry
> +	 * and wait.
> +	 */
> +	if (flags & FAULT_FLAG_RETRY_NOWAIT)
> +		return VM_FAULT_RETRY;
> +
> +	/* take the reference before dropping the mmap_sem */
> +	userfaultfd_ctx_get(ctx);
> +
> +	/* be gentle and immediately relinquish the mmap_sem */
> +	up_read(&mm->mmap_sem);
> +
> +	init_waitqueue_func_entry(&uwq.wq, userfaultfd_wake_function);
> +	uwq.wq.private = current;
> +	uwq.address = userfault_address(address, flags, reason);

Since we report only the virtual address of the fault, this will make difficulties
for task monitoring the address space of some other task. Like this:

Let's assume a task creates a userfaultfd, activates one, registers several VMAs 
in it and then sends the ufd descriptor to other task. If later the first task will
remap those VMAs and will start touching pages, the monitor will start receiving 
fault addresses using which it will not be able to guess the exact vma the
requests come from.

Thanks,
Pavel

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 19/21] userfaultfd: remap_pages: UFFDIO_REMAP preparation
  2015-03-05 17:18 ` [PATCH 19/21] userfaultfd: remap_pages: UFFDIO_REMAP preparation Andrea Arcangeli
  2015-03-05 17:39   ` Linus Torvalds
@ 2015-03-05 18:01   ` Pavel Emelyanov
  1 sibling, 0 replies; 32+ messages in thread
From: Pavel Emelyanov @ 2015-03-05 18:01 UTC (permalink / raw)
  To: Andrea Arcangeli, qemu-devel, kvm, linux-kernel, linux-mm,
	linux-api, Android Kernel Team
  Cc: Kirill A. Shutemov, Sanidhya Kashyap, zhang.zhanghailiang,
	Linus Torvalds, Andres Lagar-Cavilla, Dave Hansen, Paolo Bonzini,
	Rik van Riel, Mel Gorman, Andy Lutomirski, Andrew Morton,
	Sasha Levin, Hugh Dickins, Peter Feiner, Dr. David Alan Gilbert,
	Christopher Covington, Johannes Weiner, Robert Love,
	Dmitry Adamushko, Neil Brown, Mike Hommey, Taras Glek, Jan Kara,
	KOSAKI Motohiro, Michel Lespinasse, Minchan Kim, Keith Packard,
	Huangpeng (Peter),
	Anthony Liguori, Stefan Hajnoczi, Wenchao Xia, Andrew Jones,
	Juan Quintela

> +ssize_t remap_pages(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> +		    unsigned long dst_start, unsigned long src_start,
> +		    unsigned long len, __u64 mode)
> +{
> +	struct vm_area_struct *src_vma, *dst_vma;
> +	long err = -EINVAL;
> +	pmd_t *src_pmd, *dst_pmd;
> +	pte_t *src_pte, *dst_pte;
> +	spinlock_t *dst_ptl, *src_ptl;
> +	unsigned long src_addr, dst_addr;
> +	int thp_aligned = -1;
> +	ssize_t moved = 0;
> +
> +	/*
> +	 * Sanitize the command parameters:
> +	 */
> +	BUG_ON(src_start & ~PAGE_MASK);
> +	BUG_ON(dst_start & ~PAGE_MASK);
> +	BUG_ON(len & ~PAGE_MASK);
> +
> +	/* Does the address range wrap, or is the span zero-sized? */
> +	BUG_ON(src_start + len <= src_start);
> +	BUG_ON(dst_start + len <= dst_start);
> +
> +	/*
> +	 * Because these are read sempahores there's no risk of lock
> +	 * inversion.
> +	 */
> +	down_read(&dst_mm->mmap_sem);
> +	if (dst_mm != src_mm)
> +		down_read(&src_mm->mmap_sem);
> +
> +	/*
> +	 * Make sure the vma is not shared, that the src and dst remap
> +	 * ranges are both valid and fully within a single existing
> +	 * vma.
> +	 */
> +	src_vma = find_vma(src_mm, src_start);
> +	if (!src_vma || (src_vma->vm_flags & VM_SHARED))
> +		goto out;
> +	if (src_start < src_vma->vm_start ||
> +	    src_start + len > src_vma->vm_end)
> +		goto out;
> +
> +	dst_vma = find_vma(dst_mm, dst_start);
> +	if (!dst_vma || (dst_vma->vm_flags & VM_SHARED))
> +		goto out;

I again have a concern about the case when one task monitors the VM of the
other one. If the target task (owning the mm) unmaps a VMA then the monitor
task (holding and operating on the ufd) will get plain EINVAL on UFFDIO_REMAP
request. This is not fatal, but still inconvenient as it will be hard to
find out the reason for failure -- dst VMA is removed and the monitor should
just drop the respective pages with data, or some other error has occurred
and some other actions should be taken.

Thanks,
Pavel

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 14/21] userfaultfd: mcopy_atomic|mfill_zeropage: UFFDIO_COPY|UFFDIO_ZEROPAGE preparation
  2015-03-05 17:17 ` [PATCH 14/21] userfaultfd: mcopy_atomic|mfill_zeropage: UFFDIO_COPY|UFFDIO_ZEROPAGE preparation Andrea Arcangeli
@ 2015-03-05 18:07   ` Pavel Emelyanov
  0 siblings, 0 replies; 32+ messages in thread
From: Pavel Emelyanov @ 2015-03-05 18:07 UTC (permalink / raw)
  To: Andrea Arcangeli, qemu-devel, kvm, linux-kernel, linux-mm,
	linux-api, Android Kernel Team
  Cc: Kirill A. Shutemov, Sanidhya Kashyap, zhang.zhanghailiang,
	Linus Torvalds, Andres Lagar-Cavilla, Dave Hansen, Paolo Bonzini,
	Rik van Riel, Mel Gorman, Andy Lutomirski, Andrew Morton,
	Sasha Levin, Hugh Dickins, Peter Feiner, Dr. David Alan Gilbert,
	Christopher Covington, Johannes Weiner, Robert Love,
	Dmitry Adamushko, Neil Brown, Mike Hommey, Taras Glek, Jan Kara,
	KOSAKI Motohiro, Michel Lespinasse, Minchan Kim, Keith Packard,
	Huangpeng (Peter),
	Anthony Liguori, Stefan Hajnoczi, Wenchao Xia, Andrew Jones,
	Juan Quintela

> +static int mcopy_atomic_pte(struct mm_struct *dst_mm,
> +			    pmd_t *dst_pmd,
> +			    struct vm_area_struct *dst_vma,
> +			    unsigned long dst_addr,
> +			    unsigned long src_addr)
> +{
> +	struct mem_cgroup *memcg;
> +	pte_t _dst_pte, *dst_pte;
> +	spinlock_t *ptl;
> +	struct page *page;
> +	void *page_kaddr;
> +	int ret;
> +
> +	ret = -ENOMEM;
> +	page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, dst_vma, dst_addr);
> +	if (!page)
> +		goto out;

Not a fatal thing, but still quite inconvenient. If there are two tasks that
have anonymous private VMAs that are still not COW-ed from each other, then
it will be impossible to keep the pages shared with userfault. Thus if we do
post-copy memory migration for tasks, then these guys will have their
memory COW-ed.


Thanks,
Pavel

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 00/21] RFC: userfaultfd v3
  2015-03-05 17:17 [PATCH 00/21] RFC: userfaultfd v3 Andrea Arcangeli
                   ` (20 preceding siblings ...)
  2015-03-05 17:18 ` [PATCH 21/21] userfaultfd: add userfaultfd_wp mm helpers Andrea Arcangeli
@ 2015-03-05 18:15 ` Pavel Emelyanov
  21 siblings, 0 replies; 32+ messages in thread
From: Pavel Emelyanov @ 2015-03-05 18:15 UTC (permalink / raw)
  To: Andrea Arcangeli, qemu-devel, kvm, linux-kernel, linux-mm,
	linux-api, Android Kernel Team
  Cc: Kirill A. Shutemov, Sanidhya Kashyap, zhang.zhanghailiang,
	Linus Torvalds, Andres Lagar-Cavilla, Dave Hansen, Paolo Bonzini,
	Rik van Riel, Mel Gorman, Andy Lutomirski, Andrew Morton,
	Sasha Levin, Hugh Dickins, Peter Feiner, Dr. David Alan Gilbert,
	Christopher Covington, Johannes Weiner, Robert Love,
	Dmitry Adamushko, Neil Brown, Mike Hommey, Taras Glek, Jan Kara,
	KOSAKI Motohiro, Michel Lespinasse, Minchan Kim, Keith Packard,
	Huangpeng (Peter),
	Anthony Liguori, Stefan Hajnoczi, Wenchao Xia, Andrew Jones,
	Juan Quintela


> All UFFDIO_COPY/ZEROPAGE/REMAP methods already support CRIU postcopy
> live migration and the UFFD can be passed to a manager process through
> unix domain sockets to satisfy point 5).

Yup :) That's the best (from my POV) point of ufd -- the ability to delegate
the descriptor to some other task. Though there are several limitations (I've
expressed them in other e-mails), I'm definitely supporting this!

The respective CRIU code is quite sloppy yet, I will try to brush one up and
show soon.

Thanks,
Pavel

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 19/21] userfaultfd: remap_pages: UFFDIO_REMAP preparation
  2015-03-05 17:39   ` Linus Torvalds
@ 2015-03-05 18:51     ` Andrea Arcangeli
  2015-03-05 19:32       ` Linus Torvalds
  0 siblings, 1 reply; 32+ messages in thread
From: Andrea Arcangeli @ 2015-03-05 18:51 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: qemu-devel, KVM list, Linux Kernel Mailing List, linux-mm,
	Linux API, Android Kernel Team, Kirill A. Shutemov,
	Pavel Emelyanov, Sanidhya Kashyap, zhang.zhanghailiang,
	Andres Lagar-Cavilla, Dave Hansen, Paolo Bonzini, Rik van Riel,
	Mel Gorman, Andy Lutomirski, Andrew Morton, Sasha Levin,
	Hugh Dickins, Peter Feiner, Dr. David Alan Gilbert,
	Christopher Covington, Johannes Weiner, Robert Love,
	Dmitry Adamushko, Neil Brown, Mike Hommey, Taras Glek, Jan Kara,
	KOSAKI Motohiro, Michel Lespinasse, Minchan Kim, Keith Packard,
	Huangpeng (Peter),
	Anthony Liguori, Stefan Hajnoczi, Wenchao Xia, Andrew Jones,
	Juan Quintela

On Thu, Mar 05, 2015 at 09:39:48AM -0800, Linus Torvalds wrote:
> Is this really worth it? On real loads? That people are expected to use?

I fully agree that it's not worth merging upstream UFFDIO_REMAP until
(and if) a real world usage for it will showup. To further clarify:
would this not have been an RFC, the patchset would have stopped at
patch number 15/21 included.

Merging UFFDIO_REMAP with no real life users, would just increase the
attack vector surface of the kernel for no good.

Thanks for your idea that the UFFDIO_COPY is faster, the userland code
we submitted for qemu only uses UFFDIO_COPY|ZEROPAGE, it never uses
UFFDIO_REMAP. I immediately agreed about UFFDIO_COPY being preferable
after you mentioned it during review of the previous RFC.

However this being a RFC with a large audience, and UFFDIO_REMAP
allowing to "remove" memory (think like externalizing memory into to
ceph with deduplication or such), I still added it just in case there
are real world use cases that may justify me keeping it around (even
if I would definitely not have submitted it for merging in the short
term regardless).

In addition of dropping the parts that aren't suitable for merging in
the short term like UFFDIO_REMAP, for any further submits that won't
substantially alter the API like it happened between the v2 to v3
RFCs, I'll also shrink the To/Cc list considerably.

> Considering how we just got rid of one special magic VM remapping
> thing that nobody actually used, I'd really hate to add a new one.

Having to define an API somehow, I tried to think at all possible
future usages and make sure the API would allow for those if needed.

> Quite frankly, *if* we ever merge userfaultfd, I would *strongly*
> argue for not merging the remap parts. I just don't see the point. It
> doesn't seem to add anything that is semantically very important -
> it's *potentially* a faster copy, but even that is
> 
>   (a) questionable in the first place

Yes, we already measured the UFFDIO_COPY is faster than UFFDIO_REMAP,
the userfault latency decreases -20%.

> 
> and
> 
>  (b) unclear why anybody would ever care about performance of
> infrastructure that nobody actually uses today, and future use isn't
> even clear or shown to be particualrly performance-sensitive.

The only potential _theoretical_ case that justify the existence of
UFFDIO_REMAP is about "removing" memory from the address space. To
"add" memory UFFDIO_COPY and UFFDIO_ZEROPAGE are always preferable
like you suggested.

> So basically I'd like to see better documentation, a few real use
> cases (and by real I very much do *not* mean "you can use it for
> this", but actual patches to actual projects that matter and that are
> expected to care and merge them), and a simplified series that doesn't
> do the remap thing.

So far I wrote some doc in 2/21 and in the cover letter, but certainly
more docs are necessary. Trinity is also needed (I got trinity running
on the v2 API but I haven't adapted to the new API yet).

About the real world usages, this is the primary one:

http://lists.gnu.org/archive/html/qemu-devel/2015-02/msg04873.html

And it actually cannot be merged in qemu until userfaultfd is merged
in the kernel. There's simply no safe way to implement postcopy live
migration without something equivalent to the userfaultfd if all Linux
VM features are intended to be retained in the destination node.

> Because *every* time we add a new clever interface, we end up with
> approximately zero users and just pain down the line. Examples:
> splice, mremap, yadda yadda.

Aside from mremap which I think is widely used, I totally agree in
principle.

For now I can quite comfortably guarantee the above real life user for
userfaultfd (qemu), but there are potential 5 of them. And none needs
UFFDIO_REMAP, which is again why I totally agree of not submitting it
for merging and it was intended it only for the initial RFC to share
the idea of "removing" the memory with a larger audience before I
shrink the Cc/To list for further updates.

Thanks,
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 19/21] userfaultfd: remap_pages: UFFDIO_REMAP preparation
  2015-03-05 18:51     ` Andrea Arcangeli
@ 2015-03-05 19:32       ` Linus Torvalds
  0 siblings, 0 replies; 32+ messages in thread
From: Linus Torvalds @ 2015-03-05 19:32 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: qemu-devel, KVM list, Linux Kernel Mailing List, linux-mm,
	Linux API, Android Kernel Team, Kirill A. Shutemov,
	Pavel Emelyanov, Sanidhya Kashyap, zhang.zhanghailiang,
	Andres Lagar-Cavilla, Dave Hansen, Paolo Bonzini, Rik van Riel,
	Mel Gorman, Andy Lutomirski, Andrew Morton, Sasha Levin,
	Hugh Dickins, Peter Feiner, Dr. David Alan Gilbert,
	Christopher Covington, Johannes Weiner, Robert Love,
	Dmitry Adamushko, Neil Brown, Mike Hommey, Jan Kara,
	KOSAKI Motohiro, Michel Lespinasse, Minchan Kim, Keith Packard,
	Huangpeng (Peter),
	Anthony Liguori, Stefan Hajnoczi, Wenchao Xia, Andrew Jones,
	Juan Quintela

On Thu, Mar 5, 2015 at 10:51 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
>
> Thanks for your idea that the UFFDIO_COPY is faster, the userland code
> we submitted for qemu only uses UFFDIO_COPY|ZEROPAGE, it never uses
> UFFDIO_REMAP.

Ok. So there's no actual expected use of the remap interface. Good.
That makes this series more palatable, since the rest didn't raise my
hackles much.

(But yeah, the documentation patch didn't really explain the uses very
much or at all, so I think something more is needed in that area).

                   Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 10/21] userfaultfd: add new syscall to provide memory externalization
  2015-03-05 17:17 ` [PATCH 10/21] userfaultfd: add new syscall to provide memory externalization Andrea Arcangeli
  2015-03-05 17:57   ` Pavel Emelyanov
@ 2015-03-06 10:48   ` Michael Kerrisk (man-pages)
  1 sibling, 0 replies; 32+ messages in thread
From: Michael Kerrisk (man-pages) @ 2015-03-06 10:48 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: qemu-devel, kvm, lkml, linux-mm, Linux API, Android Kernel Team,
	Kirill A. Shutemov, Pavel Emelyanov, Sanidhya Kashyap,
	zhang.zhanghailiang, Linus Torvalds, Andres Lagar-Cavilla,
	Dave Hansen, Paolo Bonzini, Rik van Riel, Mel Gorman,
	Andy Lutomirski, Andrew Morton, Sasha Levin, Hugh Dickins,
	Peter Feiner, Dr. David Alan Gilbert, Christopher Covington,
	Johannes Weiner, Robert Love, Dmitry Adamushko, Neil Brown,
	Mike Hommey, Taras Glek, Jan Kara, KOSAKI Motohiro,
	Michel Lespinasse, Minchan Kim, Keith Packard, Huangpeng (Peter),
	Anthony Liguori, Stefan Hajnoczi, Wenchao Xia, Andrew Jones,
	Juan Quintela

Hi Andrea,

On 5 March 2015 at 18:17, Andrea Arcangeli <aarcange@redhat.com> wrote:
> Once an userfaultfd has been created and certain region of the process
> virtual address space have been registered into it, the thread
> responsible for doing the memory externalization can manage the page
> faults in userland by talking to the kernel using the userfaultfd
> protocol.

Is there someting like a man page for this new syscall?

Thanks,

Michael


> poll() can be used to know when there are new pending userfaults to be
> read (POLLIN).
>
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
>  fs/userfaultfd.c | 977 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 977 insertions(+)
>  create mode 100644 fs/userfaultfd.c
>
> diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> new file mode 100644
> index 0000000..6b31967
> --- /dev/null
> +++ b/fs/userfaultfd.c
> @@ -0,0 +1,977 @@
> +/*
> + *  fs/userfaultfd.c
> + *
> + *  Copyright (C) 2007  Davide Libenzi <davidel@xmailserver.org>
> + *  Copyright (C) 2008-2009 Red Hat, Inc.
> + *  Copyright (C) 2015  Red Hat, Inc.
> + *
> + *  This work is licensed under the terms of the GNU GPL, version 2. See
> + *  the COPYING file in the top-level directory.
> + *
> + *  Some part derived from fs/eventfd.c (anon inode setup) and
> + *  mm/ksm.c (mm hashing).
> + */
> +
> +#include <linux/hashtable.h>
> +#include <linux/sched.h>
> +#include <linux/mm.h>
> +#include <linux/poll.h>
> +#include <linux/slab.h>
> +#include <linux/seq_file.h>
> +#include <linux/file.h>
> +#include <linux/bug.h>
> +#include <linux/anon_inodes.h>
> +#include <linux/syscalls.h>
> +#include <linux/userfaultfd_k.h>
> +#include <linux/mempolicy.h>
> +#include <linux/ioctl.h>
> +#include <linux/security.h>
> +
> +enum userfaultfd_state {
> +       UFFD_STATE_WAIT_API,
> +       UFFD_STATE_RUNNING,
> +};
> +
> +struct userfaultfd_ctx {
> +       /* pseudo fd refcounting */
> +       atomic_t refcount;
> +       /* waitqueue head for the userfaultfd page faults */
> +       wait_queue_head_t fault_wqh;
> +       /* waitqueue head for the pseudo fd to wakeup poll/read */
> +       wait_queue_head_t fd_wqh;
> +       /* userfaultfd syscall flags */
> +       unsigned int flags;
> +       /* state machine */
> +       enum userfaultfd_state state;
> +       /* released */
> +       bool released;
> +       /* mm with one ore more vmas attached to this userfaultfd_ctx */
> +       struct mm_struct *mm;
> +};
> +
> +struct userfaultfd_wait_queue {
> +       unsigned long address;
> +       wait_queue_t wq;
> +       bool pending;
> +       struct userfaultfd_ctx *ctx;
> +};
> +
> +struct userfaultfd_wake_range {
> +       unsigned long start;
> +       unsigned long len;
> +};
> +
> +static int userfaultfd_wake_function(wait_queue_t *wq, unsigned mode,
> +                                    int wake_flags, void *key)
> +{
> +       struct userfaultfd_wake_range *range = key;
> +       int ret;
> +       struct userfaultfd_wait_queue *uwq;
> +       unsigned long start, len;
> +
> +       uwq = container_of(wq, struct userfaultfd_wait_queue, wq);
> +       ret = 0;
> +       /* don't wake the pending ones to avoid reads to block */
> +       if (uwq->pending && !ACCESS_ONCE(uwq->ctx->released))
> +               goto out;
> +       /* len == 0 means wake all */
> +       start = range->start;
> +       len = range->len;
> +       if (len && (start > uwq->address || start + len <= uwq->address))
> +               goto out;
> +       ret = wake_up_state(wq->private, mode);
> +       if (ret)
> +               /* wake only once, autoremove behavior */
> +               list_del_init(&wq->task_list);
> +out:
> +       return ret;
> +}
> +
> +/**
> + * userfaultfd_ctx_get - Acquires a reference to the internal userfaultfd
> + * context.
> + * @ctx: [in] Pointer to the userfaultfd context.
> + *
> + * Returns: In case of success, returns not zero.
> + */
> +static void userfaultfd_ctx_get(struct userfaultfd_ctx *ctx)
> +{
> +       if (!atomic_inc_not_zero(&ctx->refcount))
> +               BUG();
> +}
> +
> +/**
> + * userfaultfd_ctx_put - Releases a reference to the internal userfaultfd
> + * context.
> + * @ctx: [in] Pointer to userfaultfd context.
> + *
> + * The userfaultfd context reference must have been previously acquired either
> + * with userfaultfd_ctx_get() or userfaultfd_ctx_fdget().
> + */
> +static void userfaultfd_ctx_put(struct userfaultfd_ctx *ctx)
> +{
> +       if (atomic_dec_and_test(&ctx->refcount)) {
> +               mmdrop(ctx->mm);
> +               kfree(ctx);
> +       }
> +}
> +
> +static inline unsigned long userfault_address(unsigned long address,
> +                                             unsigned int flags,
> +                                             unsigned long reason)
> +{
> +       BUILD_BUG_ON(PAGE_SHIFT < UFFD_BITS);
> +       address &= PAGE_MASK;
> +       if (flags & FAULT_FLAG_WRITE)
> +               /*
> +                * Encode "write" fault information in the LSB of the
> +                * address read by userland, without depending on
> +                * FAULT_FLAG_WRITE kernel internal value.
> +                */
> +               address |= UFFD_BIT_WRITE;
> +       if (reason & VM_UFFD_WP)
> +               /*
> +                * Encode "reason" fault information as bit number 1
> +                * in the address read by userland. If bit number 1 is
> +                * clear it means the reason is a VM_FAULT_MISSING
> +                * fault.
> +                */
> +               address |= UFFD_BIT_WP;
> +       return address;
> +}
> +
> +/*
> + * The locking rules involved in returning VM_FAULT_RETRY depending on
> + * FAULT_FLAG_ALLOW_RETRY, FAULT_FLAG_RETRY_NOWAIT and
> + * FAULT_FLAG_KILLABLE are not straightforward. The "Caution"
> + * recommendation in __lock_page_or_retry is not an understatement.
> + *
> + * If FAULT_FLAG_ALLOW_RETRY is set, the mmap_sem must be released
> + * before returning VM_FAULT_RETRY only if FAULT_FLAG_RETRY_NOWAIT is
> + * not set.
> + *
> + * If FAULT_FLAG_ALLOW_RETRY is set but FAULT_FLAG_KILLABLE is not
> + * set, VM_FAULT_RETRY can still be returned if and only if there are
> + * fatal_signal_pending()s, and the mmap_sem must be released before
> + * returning it.
> + */
> +int handle_userfault(struct vm_area_struct *vma, unsigned long address,
> +                    unsigned int flags, unsigned long reason)
> +{
> +       struct mm_struct *mm = vma->vm_mm;
> +       struct userfaultfd_ctx *ctx;
> +       struct userfaultfd_wait_queue uwq;
> +
> +       BUG_ON(!rwsem_is_locked(&mm->mmap_sem));
> +
> +       ctx = vma->vm_userfaultfd_ctx.ctx;
> +       if (!ctx)
> +               return VM_FAULT_SIGBUS;
> +
> +       BUG_ON(ctx->mm != mm);
> +
> +       VM_BUG_ON(reason & ~(VM_UFFD_MISSING|VM_UFFD_WP));
> +       VM_BUG_ON(!(reason & VM_UFFD_MISSING) ^ !!(reason & VM_UFFD_WP));
> +
> +       /*
> +        * If it's already released don't get it. This avoids to loop
> +        * in __get_user_pages if userfaultfd_release waits on the
> +        * caller of handle_userfault to release the mmap_sem.
> +        */
> +       if (unlikely(ACCESS_ONCE(ctx->released)))
> +               return VM_FAULT_SIGBUS;
> +
> +       /* check that we can return VM_FAULT_RETRY */
> +       if (unlikely(!(flags & FAULT_FLAG_ALLOW_RETRY))) {
> +               /*
> +                * Validate the invariant that nowait must allow retry
> +                * to be sure not to return SIGBUS erroneously on
> +                * nowait invocations.
> +                */
> +               BUG_ON(flags & FAULT_FLAG_RETRY_NOWAIT);
> +#ifdef CONFIG_DEBUG_VM
> +               if (printk_ratelimit()) {
> +                       printk(KERN_WARNING
> +                              "FAULT_FLAG_ALLOW_RETRY missing %x\n", flags);
> +                       dump_stack();
> +               }
> +#endif
> +               return VM_FAULT_SIGBUS;
> +       }
> +
> +       /*
> +        * Handle nowait, not much to do other than tell it to retry
> +        * and wait.
> +        */
> +       if (flags & FAULT_FLAG_RETRY_NOWAIT)
> +               return VM_FAULT_RETRY;
> +
> +       /* take the reference before dropping the mmap_sem */
> +       userfaultfd_ctx_get(ctx);
> +
> +       /* be gentle and immediately relinquish the mmap_sem */
> +       up_read(&mm->mmap_sem);
> +
> +       init_waitqueue_func_entry(&uwq.wq, userfaultfd_wake_function);
> +       uwq.wq.private = current;
> +       uwq.address = userfault_address(address, flags, reason);
> +       uwq.pending = true;
> +       uwq.ctx = ctx;
> +
> +       spin_lock(&ctx->fault_wqh.lock);
> +       /*
> +        * After the __add_wait_queue the uwq is visible to userland
> +        * through poll/read().
> +        */
> +       __add_wait_queue(&ctx->fault_wqh, &uwq.wq);
> +       for (;;) {
> +               set_current_state(TASK_KILLABLE);
> +               if (!uwq.pending || ACCESS_ONCE(ctx->released) ||
> +                   fatal_signal_pending(current))
> +                       break;
> +               spin_unlock(&ctx->fault_wqh.lock);
> +
> +               wake_up_poll(&ctx->fd_wqh, POLLIN);
> +               schedule();
> +
> +               spin_lock(&ctx->fault_wqh.lock);
> +       }
> +       __remove_wait_queue(&ctx->fault_wqh, &uwq.wq);
> +       __set_current_state(TASK_RUNNING);
> +       spin_unlock(&ctx->fault_wqh.lock);
> +
> +       /*
> +        * ctx may go away after this if the userfault pseudo fd is
> +        * already released.
> +        */
> +       userfaultfd_ctx_put(ctx);
> +
> +       return VM_FAULT_RETRY;
> +}
> +
> +static int userfaultfd_release(struct inode *inode, struct file *file)
> +{
> +       struct userfaultfd_ctx *ctx = file->private_data;
> +       struct mm_struct *mm = ctx->mm;
> +       struct vm_area_struct *vma, *prev;
> +       /* len == 0 means wake all */
> +       struct userfaultfd_wake_range range = { .len = 0, };
> +       unsigned long new_flags;
> +
> +       ACCESS_ONCE(ctx->released) = true;
> +
> +       /*
> +        * Flush page faults out of all CPUs. NOTE: all page faults
> +        * must be retried without returning VM_FAULT_SIGBUS if
> +        * userfaultfd_ctx_get() succeeds but vma->vma_userfault_ctx
> +        * changes while handle_userfault released the mmap_sem. So
> +        * it's critical that released is set to true (above), before
> +        * taking the mmap_sem for writing.
> +        */
> +       down_write(&mm->mmap_sem);
> +       prev = NULL;
> +       for (vma = mm->mmap; vma; vma = vma->vm_next) {
> +               cond_resched();
> +               BUG_ON(!!vma->vm_userfaultfd_ctx.ctx ^
> +                      !!(vma->vm_flags & (VM_UFFD_MISSING | VM_UFFD_WP)));
> +               if (vma->vm_userfaultfd_ctx.ctx != ctx) {
> +                       prev = vma;
> +                       continue;
> +               }
> +               new_flags = vma->vm_flags & ~(VM_UFFD_MISSING | VM_UFFD_WP);
> +               prev = vma_merge(mm, prev, vma->vm_start, vma->vm_end,
> +                                new_flags, vma->anon_vma,
> +                                vma->vm_file, vma->vm_pgoff,
> +                                vma_policy(vma),
> +                                NULL_VM_UFFD_CTX);
> +               if (prev)
> +                       vma = prev;
> +               else
> +                       prev = vma;
> +               vma->vm_flags = new_flags;
> +               vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
> +       }
> +       up_write(&mm->mmap_sem);
> +
> +       /*
> +        * After no new page faults can wait on this fault_wqh, flush
> +        * the last page faults that may have been already waiting on
> +        * the fault_wqh.
> +        */
> +       spin_lock(&ctx->fault_wqh.lock);
> +       __wake_up_locked_key(&ctx->fault_wqh, TASK_NORMAL, 0, &range);
> +       spin_unlock(&ctx->fault_wqh.lock);
> +
> +       wake_up_poll(&ctx->fd_wqh, POLLHUP);
> +       userfaultfd_ctx_put(ctx);
> +       return 0;
> +}
> +
> +static inline unsigned int find_userfault(struct userfaultfd_ctx *ctx,
> +                                         struct userfaultfd_wait_queue **uwq)
> +{
> +       wait_queue_t *wq;
> +       struct userfaultfd_wait_queue *_uwq;
> +       unsigned int ret = 0;
> +
> +       spin_lock(&ctx->fault_wqh.lock);
> +       list_for_each_entry(wq, &ctx->fault_wqh.task_list, task_list) {
> +               _uwq = container_of(wq, struct userfaultfd_wait_queue, wq);
> +               if (_uwq->pending) {
> +                       ret = POLLIN;
> +                       if (uwq)
> +                               *uwq = _uwq;
> +                       break;
> +               }
> +       }
> +       spin_unlock(&ctx->fault_wqh.lock);
> +
> +       return ret;
> +}
> +
> +static unsigned int userfaultfd_poll(struct file *file, poll_table *wait)
> +{
> +       struct userfaultfd_ctx *ctx = file->private_data;
> +
> +       poll_wait(file, &ctx->fd_wqh, wait);
> +
> +       switch (ctx->state) {
> +       case UFFD_STATE_WAIT_API:
> +               return POLLERR;
> +       case UFFD_STATE_RUNNING:
> +               return find_userfault(ctx, NULL);
> +       default:
> +               BUG();
> +       }
> +}
> +
> +static ssize_t userfaultfd_ctx_read(struct userfaultfd_ctx *ctx, int no_wait,
> +                                   __u64 *addr)
> +{
> +       ssize_t ret;
> +       DECLARE_WAITQUEUE(wait, current);
> +       struct userfaultfd_wait_queue *uwq = NULL;
> +
> +       /* always take the fd_wqh lock before the fault_wqh lock */
> +       spin_lock(&ctx->fd_wqh.lock);
> +       __add_wait_queue(&ctx->fd_wqh, &wait);
> +       for (;;) {
> +               set_current_state(TASK_INTERRUPTIBLE);
> +               if (find_userfault(ctx, &uwq)) {
> +                       uwq->pending = false;
> +                       /* careful to always initialize addr if ret == 0 */
> +                       *addr = uwq->address;
> +                       ret = 0;
> +                       break;
> +               }
> +               if (signal_pending(current)) {
> +                       ret = -ERESTARTSYS;
> +                       break;
> +               }
> +               if (no_wait) {
> +                       ret = -EAGAIN;
> +                       break;
> +               }
> +               spin_unlock(&ctx->fd_wqh.lock);
> +               schedule();
> +               spin_lock_irq(&ctx->fd_wqh.lock);
> +       }
> +       __remove_wait_queue(&ctx->fd_wqh, &wait);
> +       __set_current_state(TASK_RUNNING);
> +       spin_unlock_irq(&ctx->fd_wqh.lock);
> +
> +       return ret;
> +}
> +
> +static ssize_t userfaultfd_read(struct file *file, char __user *buf,
> +                               size_t count, loff_t *ppos)
> +{
> +       struct userfaultfd_ctx *ctx = file->private_data;
> +       ssize_t _ret, ret = 0;
> +       /* careful to always initialize addr if ret == 0 */
> +       __u64 uninitialized_var(addr);
> +       int no_wait = file->f_flags & O_NONBLOCK;
> +
> +       if (ctx->state == UFFD_STATE_WAIT_API)
> +               return -EINVAL;
> +       BUG_ON(ctx->state != UFFD_STATE_RUNNING);
> +
> +       for (;;) {
> +               if (count < sizeof(addr))
> +                       return ret ? ret : -EINVAL;
> +               _ret = userfaultfd_ctx_read(ctx, no_wait, &addr);
> +               if (_ret < 0)
> +                       return ret ? ret : _ret;
> +               if (put_user(addr, (__u64 __user *) buf))
> +                       return ret ? ret : -EFAULT;
> +               ret += sizeof(addr);
> +               buf += sizeof(addr);
> +               count -= sizeof(addr);
> +               /*
> +                * Allow to read more than one fault at time but only
> +                * block if waiting for the very first one.
> +                */
> +               no_wait = O_NONBLOCK;
> +       }
> +}
> +
> +static int __wake_userfault(struct userfaultfd_ctx *ctx,
> +                           struct userfaultfd_wake_range *range)
> +{
> +       wait_queue_t *wq;
> +       struct userfaultfd_wait_queue *uwq;
> +       int ret;
> +       unsigned long start, end;
> +
> +       start = range->start;
> +       end = range->start + range->len;
> +
> +       ret = -ENOENT;
> +       spin_lock(&ctx->fault_wqh.lock);
> +       list_for_each_entry(wq, &ctx->fault_wqh.task_list, task_list) {
> +               uwq = container_of(wq, struct userfaultfd_wait_queue, wq);
> +               if (uwq->pending)
> +                       continue;
> +               if (uwq->address >= start && uwq->address < end) {
> +                       ret = 0;
> +                       /* wake all in the range and autoremove */
> +                       __wake_up_locked_key(&ctx->fault_wqh, TASK_NORMAL, 0,
> +                                            range);
> +                       break;
> +               }
> +       }
> +       spin_unlock(&ctx->fault_wqh.lock);
> +
> +       return ret;
> +}
> +
> +static __always_inline int wake_userfault(struct userfaultfd_ctx *ctx,
> +                                         struct userfaultfd_wake_range *range)
> +{
> +       if (!waitqueue_active(&ctx->fault_wqh))
> +               return -ENOENT;
> +
> +       return __wake_userfault(ctx, range);
> +}
> +
> +static __always_inline int validate_range(struct mm_struct *mm,
> +                                         __u64 start, __u64 len)
> +{
> +       __u64 task_size = mm->task_size;
> +
> +       if (start & ~PAGE_MASK)
> +               return -EINVAL;
> +       if (len & ~PAGE_MASK)
> +               return -EINVAL;
> +       if (!len)
> +               return -EINVAL;
> +       if (start < mmap_min_addr)
> +               return -EINVAL;
> +       if (start >= task_size)
> +               return -EINVAL;
> +       if (len > task_size - start)
> +               return -EINVAL;
> +       return 0;
> +}
> +
> +static int userfaultfd_register(struct userfaultfd_ctx *ctx,
> +                               unsigned long arg)
> +{
> +       struct mm_struct *mm = ctx->mm;
> +       struct vm_area_struct *vma, *prev, *cur;
> +       int ret;
> +       struct uffdio_register uffdio_register;
> +       struct uffdio_register __user *user_uffdio_register;
> +       unsigned long vm_flags, new_flags;
> +       bool found;
> +       unsigned long start, end, vma_end;
> +
> +       user_uffdio_register = (struct uffdio_register __user *) arg;
> +
> +       ret = -EFAULT;
> +       if (copy_from_user(&uffdio_register, user_uffdio_register,
> +                          sizeof(uffdio_register)-sizeof(__u64)))
> +               goto out;
> +
> +       ret = -EINVAL;
> +       if (!uffdio_register.mode)
> +               goto out;
> +       if (uffdio_register.mode & ~(UFFDIO_REGISTER_MODE_MISSING|
> +                                    UFFDIO_REGISTER_MODE_WP))
> +               goto out;
> +       vm_flags = 0;
> +       if (uffdio_register.mode & UFFDIO_REGISTER_MODE_MISSING)
> +               vm_flags |= VM_UFFD_MISSING;
> +       if (uffdio_register.mode & UFFDIO_REGISTER_MODE_WP) {
> +               vm_flags |= VM_UFFD_WP;
> +               /*
> +                * FIXME: remove the below error constraint by
> +                * implementing the wprotect tracking mode.
> +                */
> +               ret = -EINVAL;
> +               goto out;
> +       }
> +
> +       ret = validate_range(mm, uffdio_register.range.start,
> +                            uffdio_register.range.len);
> +       if (ret)
> +               goto out;
> +
> +       start = uffdio_register.range.start;
> +       end = start + uffdio_register.range.len;
> +
> +       down_write(&mm->mmap_sem);
> +       vma = find_vma_prev(mm, start, &prev);
> +
> +       ret = -ENOMEM;
> +       if (!vma)
> +               goto out_unlock;
> +
> +       /* check that there's at least one vma in the range */
> +       ret = -EINVAL;
> +       if (vma->vm_start >= end)
> +               goto out_unlock;
> +
> +       /*
> +        * Search for not compatible vmas.
> +        *
> +        * FIXME: this shall be relaxed later so that it doesn't fail
> +        * on tmpfs backed vmas (in addition to the current allowance
> +        * on anonymous vmas).
> +        */
> +       found = false;
> +       for (cur = vma; cur && cur->vm_start < end; cur = cur->vm_next) {
> +               cond_resched();
> +
> +               BUG_ON(!!cur->vm_userfaultfd_ctx.ctx ^
> +                      !!(cur->vm_flags & (VM_UFFD_MISSING | VM_UFFD_WP)));
> +
> +               /* check not compatible vmas */
> +               ret = -EINVAL;
> +               if (cur->vm_ops)
> +                       goto out_unlock;
> +
> +               /*
> +                * Check that this vma isn't already owned by a
> +                * different userfaultfd. We can't allow more than one
> +                * userfaultfd to own a single vma simultaneously or we
> +                * wouldn't know which one to deliver the userfaults to.
> +                */
> +               ret = -EBUSY;
> +               if (cur->vm_userfaultfd_ctx.ctx &&
> +                   cur->vm_userfaultfd_ctx.ctx != ctx)
> +                       goto out_unlock;
> +
> +               found = true;
> +       }
> +       BUG_ON(!found);
> +
> +       /*
> +        * Now that we scanned all vmas we can already tell userland which
> +        * ioctls methods are guaranteed to succeed on this range.
> +        */
> +       ret = -EFAULT;
> +       if (put_user(UFFD_API_RANGE_IOCTLS, &user_uffdio_register->ioctls))
> +               goto out_unlock;
> +
> +       if (vma->vm_start < start)
> +               prev = vma;
> +
> +       ret = 0;
> +       do {
> +               cond_resched();
> +
> +               BUG_ON(vma->vm_ops);
> +               BUG_ON(vma->vm_userfaultfd_ctx.ctx &&
> +                      vma->vm_userfaultfd_ctx.ctx != ctx);
> +
> +               /*
> +                * Nothing to do: this vma is already registered into this
> +                * userfaultfd and with the right tracking mode too.
> +                */
> +               if (vma->vm_userfaultfd_ctx.ctx == ctx &&
> +                   (vma->vm_flags & vm_flags) == vm_flags)
> +                       goto skip;
> +
> +               if (vma->vm_start > start)
> +                       start = vma->vm_start;
> +               vma_end = min(end, vma->vm_end);
> +
> +               new_flags = (vma->vm_flags & ~vm_flags) | vm_flags;
> +               prev = vma_merge(mm, prev, start, vma_end, new_flags,
> +                                vma->anon_vma, vma->vm_file, vma->vm_pgoff,
> +                                vma_policy(vma),
> +                                ((struct vm_userfaultfd_ctx){ ctx }));
> +               if (prev) {
> +                       vma = prev;
> +                       goto next;
> +               }
> +               if (vma->vm_start < start) {
> +                       ret = split_vma(mm, vma, start, 1);
> +                       if (ret)
> +                               break;
> +               }
> +               if (vma->vm_end > end) {
> +                       ret = split_vma(mm, vma, end, 0);
> +                       if (ret)
> +                               break;
> +               }
> +       next:
> +               /*
> +                * In the vma_merge() successful mprotect-like case 8:
> +                * the next vma was merged into the current one and
> +                * the current one has not been updated yet.
> +                */
> +               vma->vm_flags = new_flags;
> +               vma->vm_userfaultfd_ctx.ctx = ctx;
> +
> +       skip:
> +               prev = vma;
> +               start = vma->vm_end;
> +               vma = vma->vm_next;
> +       } while (vma && vma->vm_start < end);
> +out_unlock:
> +       up_write(&mm->mmap_sem);
> +out:
> +       return ret;
> +}
> +
> +static int userfaultfd_unregister(struct userfaultfd_ctx *ctx,
> +                                 unsigned long arg)
> +{
> +       struct mm_struct *mm = ctx->mm;
> +       struct vm_area_struct *vma, *prev, *cur;
> +       int ret;
> +       struct uffdio_range uffdio_unregister;
> +       unsigned long new_flags;
> +       bool found;
> +       unsigned long start, end, vma_end;
> +       const void __user *buf = (void __user *)arg;
> +
> +       ret = -EFAULT;
> +       if (copy_from_user(&uffdio_unregister, buf, sizeof(uffdio_unregister)))
> +               goto out;
> +
> +       ret = validate_range(mm, uffdio_unregister.start,
> +                            uffdio_unregister.len);
> +       if (ret)
> +               goto out;
> +
> +       start = uffdio_unregister.start;
> +       end = start + uffdio_unregister.len;
> +
> +       down_write(&mm->mmap_sem);
> +       vma = find_vma_prev(mm, start, &prev);
> +
> +       ret = -ENOMEM;
> +       if (!vma)
> +               goto out_unlock;
> +
> +       /* check that there's at least one vma in the range */
> +       ret = -EINVAL;
> +       if (vma->vm_start >= end)
> +               goto out_unlock;
> +
> +       /*
> +        * Search for not compatible vmas.
> +        *
> +        * FIXME: this shall be relaxed later so that it doesn't fail
> +        * on tmpfs backed vmas (in addition to the current allowance
> +        * on anonymous vmas).
> +        */
> +       found = false;
> +       ret = -EINVAL;
> +       for (cur = vma; cur && cur->vm_start < end; cur = cur->vm_next) {
> +               cond_resched();
> +
> +               BUG_ON(!!cur->vm_userfaultfd_ctx.ctx ^
> +                      !!(cur->vm_flags & (VM_UFFD_MISSING | VM_UFFD_WP)));
> +
> +               /*
> +                * Check not compatible vmas, not strictly required
> +                * here as not compatible vmas cannot have an
> +                * userfaultfd_ctx registered on them, but this
> +                * provides for more strict behavior to notice
> +                * unregistration errors.
> +                */
> +               if (cur->vm_ops)
> +                       goto out_unlock;
> +
> +               found = true;
> +       }
> +       BUG_ON(!found);
> +
> +       if (vma->vm_start < start)
> +               prev = vma;
> +
> +       ret = 0;
> +       do {
> +               cond_resched();
> +
> +               BUG_ON(vma->vm_ops);
> +
> +               /*
> +                * Nothing to do: this vma is already registered into this
> +                * userfaultfd and with the right tracking mode too.
> +                */
> +               if (!vma->vm_userfaultfd_ctx.ctx)
> +                       goto skip;
> +
> +               if (vma->vm_start > start)
> +                       start = vma->vm_start;
> +               vma_end = min(end, vma->vm_end);
> +
> +               new_flags = vma->vm_flags & ~(VM_UFFD_MISSING | VM_UFFD_WP);
> +               prev = vma_merge(mm, prev, start, vma_end, new_flags,
> +                                vma->anon_vma, vma->vm_file, vma->vm_pgoff,
> +                                vma_policy(vma),
> +                                NULL_VM_UFFD_CTX);
> +               if (prev) {
> +                       vma = prev;
> +                       goto next;
> +               }
> +               if (vma->vm_start < start) {
> +                       ret = split_vma(mm, vma, start, 1);
> +                       if (ret)
> +                               break;
> +               }
> +               if (vma->vm_end > end) {
> +                       ret = split_vma(mm, vma, end, 0);
> +                       if (ret)
> +                               break;
> +               }
> +       next:
> +               /*
> +                * In the vma_merge() successful mprotect-like case 8:
> +                * the next vma was merged into the current one and
> +                * the current one has not been updated yet.
> +                */
> +               vma->vm_flags = new_flags;
> +               vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
> +
> +       skip:
> +               prev = vma;
> +               start = vma->vm_end;
> +               vma = vma->vm_next;
> +       } while (vma && vma->vm_start < end);
> +out_unlock:
> +       up_write(&mm->mmap_sem);
> +out:
> +       return ret;
> +}
> +
> +/*
> + * This is mostly needed to re-wakeup those userfaults that were still
> + * pending when userland wake them up the first time. We don't wake
> + * the pending one to avoid blocking reads to block, or non blocking
> + * read to return -EAGAIN, if used with POLLIN, to avoid userland
> + * doubts on why POLLIN wasn't reliable.
> + */
> +static int userfaultfd_wake(struct userfaultfd_ctx *ctx,
> +                           unsigned long arg)
> +{
> +       int ret;
> +       struct uffdio_range uffdio_wake;
> +       struct userfaultfd_wake_range range;
> +       const void __user *buf = (void __user *)arg;
> +
> +       ret = -EFAULT;
> +       if (copy_from_user(&uffdio_wake, buf, sizeof(uffdio_wake)))
> +               goto out;
> +
> +       ret = validate_range(ctx->mm, uffdio_wake.start, uffdio_wake.len);
> +       if (ret)
> +               goto out;
> +
> +       range.start = uffdio_wake.start;
> +       range.len = uffdio_wake.len;
> +
> +       /*
> +        * len == 0 means wake all and we don't want to wake all here,
> +        * so check it again to be sure.
> +        */
> +       VM_BUG_ON(!range.len);
> +
> +       ret = wake_userfault(ctx, &range);
> +
> +out:
> +       return ret;
> +}
> +
> +/*
> + * userland asks for a certain API version and we return which bits
> + * and ioctl commands are implemented in this kernel for such API
> + * version or -EINVAL if unknown.
> + */
> +static int userfaultfd_api(struct userfaultfd_ctx *ctx,
> +                          unsigned long arg)
> +{
> +       struct uffdio_api uffdio_api;
> +       void __user *buf = (void __user *)arg;
> +       int ret;
> +
> +       ret = -EINVAL;
> +       if (ctx->state != UFFD_STATE_WAIT_API)
> +               goto out;
> +       ret = -EFAULT;
> +       if (copy_from_user(&uffdio_api, buf, sizeof(__u64)))
> +               goto out;
> +       if (uffdio_api.api != UFFD_API) {
> +               /* careful not to leak info, we only read the first 8 bytes */
> +               memset(&uffdio_api, 0, sizeof(uffdio_api));
> +               if (copy_to_user(buf, &uffdio_api, sizeof(uffdio_api)))
> +                       goto out;
> +               ret = -EINVAL;
> +               goto out;
> +       }
> +       /* careful not to leak info, we only read the first 8 bytes */
> +       uffdio_api.bits = UFFD_API_BITS;
> +       uffdio_api.ioctls = UFFD_API_IOCTLS;
> +       ret = -EFAULT;
> +       if (copy_to_user(buf, &uffdio_api, sizeof(uffdio_api)))
> +               goto out;
> +       ctx->state = UFFD_STATE_RUNNING;
> +       ret = 0;
> +out:
> +       return ret;
> +}
> +
> +static long userfaultfd_ioctl(struct file *file, unsigned cmd,
> +                             unsigned long arg)
> +{
> +       int ret = -EINVAL;
> +       struct userfaultfd_ctx *ctx = file->private_data;
> +
> +       switch(cmd) {
> +       case UFFDIO_API:
> +               ret = userfaultfd_api(ctx, arg);
> +               break;
> +       case UFFDIO_REGISTER:
> +               ret = userfaultfd_register(ctx, arg);
> +               break;
> +       case UFFDIO_UNREGISTER:
> +               ret = userfaultfd_unregister(ctx, arg);
> +               break;
> +       case UFFDIO_WAKE:
> +               ret = userfaultfd_wake(ctx, arg);
> +               break;
> +       }
> +       return ret;
> +}
> +
> +#ifdef CONFIG_PROC_FS
> +static void userfaultfd_show_fdinfo(struct seq_file *m, struct file *f)
> +{
> +       struct userfaultfd_ctx *ctx = f->private_data;
> +       wait_queue_t *wq;
> +       struct userfaultfd_wait_queue *uwq;
> +       unsigned long pending = 0, total = 0;
> +
> +       spin_lock(&ctx->fault_wqh.lock);
> +       list_for_each_entry(wq, &ctx->fault_wqh.task_list, task_list) {
> +               uwq = container_of(wq, struct userfaultfd_wait_queue, wq);
> +               if (uwq->pending)
> +                       pending++;
> +               total++;
> +       }
> +       spin_unlock(&ctx->fault_wqh.lock);
> +
> +       /*
> +        * If more protocols will be added, there will be all shown
> +        * separated by a space. Like this:
> +        *      protocols: 0xaa 0xbb
> +        */
> +       seq_printf(m, "pending:\t%lu\ntotal:\t%lu\nAPI:\t%Lx:%x:%Lx\n",
> +                  pending, total, UFFD_API, UFFD_API_BITS,
> +                  UFFD_API_IOCTLS|UFFD_API_RANGE_IOCTLS);
> +}
> +#endif
> +
> +static const struct file_operations userfaultfd_fops = {
> +#ifdef CONFIG_PROC_FS
> +       .show_fdinfo    = userfaultfd_show_fdinfo,
> +#endif
> +       .release        = userfaultfd_release,
> +       .poll           = userfaultfd_poll,
> +       .read           = userfaultfd_read,
> +       .unlocked_ioctl = userfaultfd_ioctl,
> +       .compat_ioctl   = userfaultfd_ioctl,
> +       .llseek         = noop_llseek,
> +};
> +
> +/**
> + * userfaultfd_file_create - Creates an userfaultfd file pointer.
> + * @flags: Flags for the userfaultfd file.
> + *
> + * This function creates an userfaultfd file pointer, w/out installing
> + * it into the fd table. This is useful when the userfaultfd file is
> + * used during the initialization of data structures that require
> + * extra setup after the userfaultfd creation. So the userfaultfd
> + * creation is split into the file pointer creation phase, and the
> + * file descriptor installation phase.  In this way races with
> + * userspace closing the newly installed file descriptor can be
> + * avoided.  Returns an userfaultfd file pointer, or a proper error
> + * pointer.
> + */
> +static struct file *userfaultfd_file_create(int flags)
> +{
> +       struct file *file;
> +       struct userfaultfd_ctx *ctx;
> +
> +       BUG_ON(!current->mm);
> +
> +       /* Check the UFFD_* constants for consistency.  */
> +       BUILD_BUG_ON(UFFD_CLOEXEC != O_CLOEXEC);
> +       BUILD_BUG_ON(UFFD_NONBLOCK != O_NONBLOCK);
> +
> +       file = ERR_PTR(-EINVAL);
> +       if (flags & ~UFFD_SHARED_FCNTL_FLAGS)
> +               goto out;
> +
> +       file = ERR_PTR(-ENOMEM);
> +       ctx = kmalloc(sizeof(*ctx), GFP_KERNEL);
> +       if (!ctx)
> +               goto out;
> +
> +       atomic_set(&ctx->refcount, 1);
> +       init_waitqueue_head(&ctx->fault_wqh);
> +       init_waitqueue_head(&ctx->fd_wqh);
> +       ctx->flags = flags;
> +       ctx->state = UFFD_STATE_WAIT_API;
> +       ctx->released = false;
> +       ctx->mm = current->mm;
> +       /* prevent the mm struct to be freed */
> +       atomic_inc(&ctx->mm->mm_count);
> +
> +       file = anon_inode_getfile("[userfaultfd]", &userfaultfd_fops, ctx,
> +                                 O_RDWR | (flags & UFFD_SHARED_FCNTL_FLAGS));
> +       if (IS_ERR(file))
> +               kfree(ctx);
> +out:
> +       return file;
> +}
> +
> +SYSCALL_DEFINE1(userfaultfd, int, flags)
> +{
> +       int fd, error;
> +       struct file *file;
> +
> +       error = get_unused_fd_flags(flags & UFFD_SHARED_FCNTL_FLAGS);
> +       if (error < 0)
> +               return error;
> +       fd = error;
> +
> +       file = userfaultfd_file_create(flags);
> +       if (IS_ERR(file)) {
> +               error = PTR_ERR(file);
> +               goto err_put_unused_fd;
> +       }
> +       fd_install(fd, file);
> +
> +       return fd;
> +
> +err_put_unused_fd:
> +       put_unused_fd(fd);
> +
> +       return error;
> +}
> --
> To unsubscribe from this list: send the line "unsubscribe linux-api" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [PATCH 02/21] userfaultfd: linux/Documentation/vm/userfaultfd.txt
  2015-03-05 17:17 ` [PATCH 02/21] userfaultfd: linux/Documentation/vm/userfaultfd.txt Andrea Arcangeli
@ 2015-03-06 15:39   ` Eric Blake
  0 siblings, 0 replies; 32+ messages in thread
From: Eric Blake @ 2015-03-06 15:39 UTC (permalink / raw)
  To: Andrea Arcangeli, qemu-devel, kvm, linux-kernel, linux-mm,
	linux-api, Android Kernel Team
  Cc: Robert Love, Dave Hansen, Jan Kara, Neil Brown, Stefan Hajnoczi,
	Andrew Jones, Sanidhya Kashyap, KOSAKI Motohiro,
	Michel Lespinasse, Taras Glek, zhang.zhanghailiang,
	Pavel Emelyanov, Hugh Dickins, Mel Gorman, Sasha Levin,
	Dr. David Alan Gilbert, Huangpeng (Peter),
	Andres Lagar-Cavilla, Christopher Covington, Anthony Liguori,
	Paolo Bonzini, Kirill A. Shutemov, Keith Packard, Wenchao Xia,
	Juan Quintela, Andy Lutomirski, Minchan Kim, Dmitry Adamushko,
	Johannes Weiner, Mike Hommey, Andrew Morton, Linus Torvalds,
	Peter Feiner

[-- Attachment #1: Type: text/plain, Size: 6342 bytes --]

On 03/05/2015 10:17 AM, Andrea Arcangeli wrote:
> Add documentation.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
>  Documentation/vm/userfaultfd.txt | 97 ++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 97 insertions(+)
>  create mode 100644 Documentation/vm/userfaultfd.txt

Just a grammar review (no analysis of technical correctness)

> 
> diff --git a/Documentation/vm/userfaultfd.txt b/Documentation/vm/userfaultfd.txt
> new file mode 100644
> index 0000000..2ec296c
> --- /dev/null
> +++ b/Documentation/vm/userfaultfd.txt
> @@ -0,0 +1,97 @@
> += Userfaultfd =
> +
> +== Objective ==
> +
> +Userfaults allow to implement on demand paging from userland and more

s/to implement/the implementation of/
and maybe: s/on demand/on-demand/

> +generally they allow userland to take control various memory page
> +faults, something otherwise only the kernel code could do.
> +
> +For example userfaults allows a proper and more optimal implementation
> +of the PROT_NONE+SIGSEGV trick.
> +
> +== Design ==
> +
> +Userfaults are delivered and resolved through the userfaultfd syscall.
> +
> +The userfaultfd (aside from registering and unregistering virtual
> +memory ranges) provides for two primary functionalities:

s/provides for/provides/

> +
> +1) read/POLLIN protocol to notify an userland thread of the faults

s/an userland/a userland/ (remember, 'a unicorn gets an umbrella' - if
the 'u' is pronounced 'you' the correct article is 'a')

> +   happening
> +
> +2) various UFFDIO_* ioctls that can mangle over the virtual memory
> +   regions registered in the userfaultfd that allows userland to
> +   efficiently resolve the userfaults it receives via 1) or to mangle
> +   the virtual memory in the background

maybe: s/mangle/manage/2

> +
> +The real advantage of userfaults if compared to regular virtual memory
> +management of mremap/mprotect is that the userfaults in all their
> +operations never involve heavyweight structures like vmas (in fact the
> +userfaultfd runtime load never takes the mmap_sem for writing).
> +
> +Vmas are not suitable for page(or hugepage)-granular fault tracking

s/page(or hugepage)-granular/page- (or hugepage-) granular/

> +when dealing with virtual address spaces that could span
> +Terabytes. Too many vmas would be needed for that.
> +
> +The userfaultfd once opened by invoking the syscall, can also be
> +passed using unix domain sockets to a manager process, so the same
> +manager process could handle the userfaults of a multitude of
> +different process without them being aware about what is going on

s/process/processes/

> +(well of course unless they later try to use the userfaultfd themself

s/themself/themselves/

> +on the same region the manager is already tracking, which is a corner
> +case that would currently return -EBUSY).
> +
> +== API ==
> +
> +When first opened the userfaultfd must be enabled invoking the
> +UFFDIO_API ioctl specifying an uffdio_api.api value set to UFFD_API

s/an uffdio/a uffdio/

> +which will specify the read/POLLIN protocol userland intends to speak
> +on the UFFD. The UFFDIO_API ioctl if successful (i.e. if the requested
> +uffdio_api.api is spoken also by the running kernel), will return into
> +uffdio_api.bits and uffdio_api.ioctls two 64bit bitmasks of
> +respectively the activated feature bits below PAGE_SHIFT in the
> +userfault addresses returned by read(2) and the generic ioctl
> +available.
> +
> +Once the userfaultfd has been enabled the UFFDIO_REGISTER ioctl should
> +be invoked (if present in the returned uffdio_api.ioctls bitmask) to
> +register a memory range in the userfaultfd by setting the
> +uffdio_register structure accordingly. The uffdio_register.mode
> +bitmask will specify to the kernel which kind of faults to track for
> +the range (UFFDIO_REGISTER_MODE_MISSING would track missing
> +pages). The UFFDIO_REGISTER ioctl will return the
> +uffdio_register.ioctls bitmask of ioctls that are suitable to resolve
> +userfaults on the range reigstered. Not all ioctls will necessarily be

s/reigstered/registered/

> +supported for all memory types depending on the underlying virtual
> +memory backend (anonymous memory vs tmpfs vs real filebacked
> +mappings).
> +
> +Userland can use the uffdio_register.ioctls to mangle the virtual

maybe s/mangle/manage/

> +address space in the background (to add or potentially also remove
> +memory from the userfaultfd registered range). This means an userfault

s/an/a/

> +could be triggering just before userland maps in the background the
> +user-faulted page. To avoid POLLIN resulting in an unexpected blocking
> +read (if the UFFD is not opened in nonblocking mode in the first
> +place), we don't allow the background thread to wake userfaults that
> +haven't been read by userland yet. If we would do that likely the
> +UFFDIO_WAKE ioctl could be dropped. This may change in the future
> +(with a UFFD_API protocol bumb combined with the removal of the

s/bumb/bump/

> +UFFDIO_WAKE ioctl) if it'll be demonstrated that it's a valid
> +optimization and worthy to force userland to use the UFFD always in
> +nonblocking mode if combined with POLLIN.
> +
> +userfaultfd is also a generic enough feature, that it allows KVM to
> +implement postcopy live migration (one form of memory externalization
> +consisting of a virtual machine running with part or all of its memory
> +residing on a different node in the cloud) without having to modify a
> +single line of KVM kernel code. Guest async page faults, FOLL_NOWAIT
> +and all other GUP features works just fine in combination with
> +userfaults (userfaults trigger async page faults in the guest
> +scheduler so those guest processes that aren't waiting for userfaults
> +can keep running in the guest vcpus).
> +
> +The primary ioctl to resolve userfaults is UFFDIO_COPY. That
> +atomically copies a page into the userfault registered range and wakes
> +up the blocked userfaults (unless uffdio_copy.mode &
> +UFFDIO_COPY_MODE_DONTWAKE is set). Other ioctl works similarly to
> +UFFDIO_COPY.
> 
> 
> 

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2015-03-06 15:40 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-03-05 17:17 [PATCH 00/21] RFC: userfaultfd v3 Andrea Arcangeli
2015-03-05 17:17 ` [PATCH 01/21] userfaultfd: waitqueue: add nr wake parameter to __wake_up_locked_key Andrea Arcangeli
2015-03-05 17:17 ` [PATCH 02/21] userfaultfd: linux/Documentation/vm/userfaultfd.txt Andrea Arcangeli
2015-03-06 15:39   ` [Qemu-devel] " Eric Blake
2015-03-05 17:17 ` [PATCH 03/21] userfaultfd: uAPI Andrea Arcangeli
2015-03-05 17:17 ` [PATCH 04/21] userfaultfd: linux/userfaultfd_k.h Andrea Arcangeli
2015-03-05 17:17 ` [PATCH 05/21] userfaultfd: add vm_userfaultfd_ctx to the vm_area_struct Andrea Arcangeli
2015-03-05 17:48   ` Pavel Emelyanov
2015-03-05 17:17 ` [PATCH 06/21] userfaultfd: add VM_UFFD_MISSING and VM_UFFD_WP Andrea Arcangeli
2015-03-05 17:17 ` [PATCH 07/21] userfaultfd: call handle_userfault() for userfaultfd_missing() faults Andrea Arcangeli
2015-03-05 17:17 ` [PATCH 08/21] userfaultfd: teach vma_merge to merge across vma->vm_userfaultfd_ctx Andrea Arcangeli
2015-03-05 17:17 ` [PATCH 09/21] userfaultfd: prevent khugepaged to merge if userfaultfd is armed Andrea Arcangeli
2015-03-05 17:17 ` [PATCH 10/21] userfaultfd: add new syscall to provide memory externalization Andrea Arcangeli
2015-03-05 17:57   ` Pavel Emelyanov
2015-03-06 10:48   ` Michael Kerrisk (man-pages)
2015-03-05 17:17 ` [PATCH 11/21] userfaultfd: buildsystem activation Andrea Arcangeli
2015-03-05 17:17 ` [PATCH 12/21] userfaultfd: activate syscall Andrea Arcangeli
2015-03-05 17:17 ` [PATCH 13/21] userfaultfd: UFFDIO_COPY|UFFDIO_ZEROPAGE uAPI Andrea Arcangeli
2015-03-05 17:17 ` [PATCH 14/21] userfaultfd: mcopy_atomic|mfill_zeropage: UFFDIO_COPY|UFFDIO_ZEROPAGE preparation Andrea Arcangeli
2015-03-05 18:07   ` Pavel Emelyanov
2015-03-05 17:17 ` [PATCH 15/21] userfaultfd: UFFDIO_COPY and UFFDIO_ZEROPAGE Andrea Arcangeli
2015-03-05 17:17 ` [PATCH 16/21] userfaultfd: remap_pages: rmap preparation Andrea Arcangeli
2015-03-05 17:18 ` [PATCH 17/21] userfaultfd: remap_pages: swp_entry_swapcount() preparation Andrea Arcangeli
2015-03-05 17:18 ` [PATCH 18/21] userfaultfd: UFFDIO_REMAP uABI Andrea Arcangeli
2015-03-05 17:18 ` [PATCH 19/21] userfaultfd: remap_pages: UFFDIO_REMAP preparation Andrea Arcangeli
2015-03-05 17:39   ` Linus Torvalds
2015-03-05 18:51     ` Andrea Arcangeli
2015-03-05 19:32       ` Linus Torvalds
2015-03-05 18:01   ` Pavel Emelyanov
2015-03-05 17:18 ` [PATCH 20/21] userfaultfd: UFFDIO_REMAP Andrea Arcangeli
2015-03-05 17:18 ` [PATCH 21/21] userfaultfd: add userfaultfd_wp mm helpers Andrea Arcangeli
2015-03-05 18:15 ` [PATCH 00/21] RFC: userfaultfd v3 Pavel Emelyanov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).