All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/10] RFC: userfault
@ 2014-07-02 16:50 ` Andrea Arcangeli
  0 siblings, 0 replies; 59+ messages in thread
From: Andrea Arcangeli @ 2014-07-02 16:50 UTC (permalink / raw)
  To: qemu-devel, kvm, linux-mm, linux-kernel
  Cc: \"Dr. David Alan Gilbert\",
	Johannes Weiner, Andrew Morton, Android Kernel Team, Robert Love,
	Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, KOSAKI Motohiro, Michel Lespinasse,
	Minchan Kim, Keith Packard, Huangpeng (Peter),
	Isaku Yamahata

Hello everyone,

There's a large CC list for this RFC because this adds two new
syscalls (userfaultfd and remap_anon_pages) and
MADV_USERFAULT/MADV_NOUSERFAULT, so suggestions on changes to the API
or on a completely different API if somebody has better ideas are
welcome now.

The combination of these features are what I would propose to
implement postcopy live migration in qemu, and in general demand
paging of remote memory, hosted in different cloud nodes.

The MADV_USERFAULT feature should be generic enough that it can
provide the userfaults to the Android volatile range feature too, on
access of reclaimed volatile pages.

If the access could ever happen in kernel context through syscalls
(not not just from userland context), then userfaultfd has to be used
to make the userfault unnoticeable to the syscall (no error will be
returned). This latter feature is more advanced than what volatile
ranges alone could do with SIGBUS so far (but it's optional, if the
process doesn't call userfaultfd, the regular SIGBUS will fire, if the
fd is closed SIGBUS will also fire for any blocked userfault that was
waiting a userfaultfd_write ack).

userfaultfd is also a generic enough feature, that it allows KVM to
implement postcopy live migration without having to modify a single
line of KVM kernel code. Guest async page faults, FOLL_NOWAIT and all
other GUP features works just fine in combination with userfaults
(userfaults trigger async page faults in the guest scheduler so those
guest processes that aren't waiting for userfaults can keep running in
the guest vcpus).

remap_anon_pages is the syscall to use to resolve the userfaults (it's
not mandatory, vmsplice will likely still be used in the case of local
postcopy live migration just to upgrade the qemu binary, but
remap_anon_pages is faster and ideal for transferring memory across
the network, it's zerocopy and doesn't touch the vma: it only holds
the mmap_sem for reading).

The current behavior of remap_anon_pages is very strict to avoid any
chance of memory corruption going unnoticed. mremap is not strict like
that: if there's a synchronization bug it would drop the destination
range silently resulting in subtle memory corruption for
example. remap_anon_pages would return -EEXIST in that case. If there
are holes in the source range remap_anon_pages will return -ENOENT.

If remap_anon_pages is used always with 2M naturally aligned
addresses, transparent hugepages will not be splitted. In there could
be 4k (or any size) holes in the 2M (or any size) source range,
remap_anon_pages should be used with the RAP_ALLOW_SRC_HOLES flag to
relax some of its strict checks (-ENOENT won't be returned if
RAP_ALLOW_SRC_HOLES is set, remap_anon_pages then will just behave as
a noop on any hole in the source range). This flag is generally useful
when implementing userfaults with THP granularity, but it shouldn't be
set if doing the userfaults with PAGE_SIZE granularity if the
developer wants to benefit from the strict -ENOENT behavior.

The remap_anon_pages syscall API is not vectored, as I expect it to be
used mainly for demand paging (where there can be just one faulting
range per userfault) or for large ranges (with the THP model as an
alternative to zapping re-dirtied pages with MADV_DONTNEED with 4k
granularity before starting the guest in the destination node) where
vectoring isn't going to provide much performance advantages (thanks
to the THP coarser granularity).

On the rmap side remap_anon_pages doesn't add much complexity: there's
no need of nonlinear anon vmas to support it because I added the
constraint that it will fail if the mapcount is more than 1. So in
general the source range of remap_anon_pages should be marked
MADV_DONTFORK to prevent any risk of failure if the process ever
forks (like qemu can in some case).

One part that hasn't been tested is the poll() syscall on the
userfaultfd because the postcopy migration thread currently is more
efficient waiting on blocking read()s (I'll write some code to test
poll() too). I also appended below a patch to trinity to exercise
remap_anon_pages and userfaultfd and it completes trinity
successfully.

The code can be found here:

git clone --reference linux git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git -b userfault 

The branch is rebased so you can get updates for example with:

git fetch && git checkout -f origin/userfault

Comments welcome, thanks!
Andrea

>From cbe940e13b4cead41e0f862b3abfa3814f235ec3 Mon Sep 17 00:00:00 2001
From: Andrea Arcangeli <aarcange@redhat.com>
Date: Wed, 2 Jul 2014 18:32:35 +0200
Subject: [PATCH] add remap_anon_pages and userfaultfd

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/syscalls-x86_64.h   |   2 +
 syscalls/remap_anon_pages.c | 100 ++++++++++++++++++++++++++++++++++++++++++++
 syscalls/syscalls.h         |   2 +
 syscalls/userfaultfd.c      |  12 ++++++
 4 files changed, 116 insertions(+)
 create mode 100644 syscalls/remap_anon_pages.c
 create mode 100644 syscalls/userfaultfd.c

diff --git a/include/syscalls-x86_64.h b/include/syscalls-x86_64.h
index e09df43..a5b3a88 100644
--- a/include/syscalls-x86_64.h
+++ b/include/syscalls-x86_64.h
@@ -324,4 +324,6 @@ struct syscalltable syscalls_x86_64[] = {
 	{ .entry = &syscall_sched_setattr },
 	{ .entry = &syscall_sched_getattr },
 	{ .entry = &syscall_renameat2 },
+	{ .entry = &syscall_remap_anon_pages },
+	{ .entry = &syscall_userfaultfd },
 };
diff --git a/syscalls/remap_anon_pages.c b/syscalls/remap_anon_pages.c
new file mode 100644
index 0000000..b1e9d3c
--- /dev/null
+++ b/syscalls/remap_anon_pages.c
@@ -0,0 +1,100 @@
+/*
+ * SYSCALL_DEFINE3(remap_anon_pages,
+		unsigned long, dst_start, unsigned long, src_start,
+		unsigned long, len)
+ */
+#include <stdlib.h>
+#include <asm/mman.h>
+#include <assert.h>
+#include "arch.h"
+#include "maps.h"
+#include "random.h"
+#include "sanitise.h"
+#include "shm.h"
+#include "syscall.h"
+#include "tables.h"
+#include "trinity.h"
+#include "utils.h"
+
+static const unsigned long alignments[] = {
+	1 * MB, 2 * MB, 4 * MB, 8 * MB,
+	10 * MB, 100 * MB,
+};
+
+static unsigned char *g_src, *g_dst;
+static unsigned long g_size;
+static int g_check;
+
+#define RAP_ALLOW_SRC_HOLES (1UL<<0)
+
+static void sanitise_remap_anon_pages(struct syscallrecord *rec)
+{
+	unsigned long size = alignments[rand() % ARRAY_SIZE(alignments)];
+	unsigned long max_rand;
+	if (rand_bool()) {
+		g_src = mmap(NULL, size, PROT_READ|PROT_WRITE,
+			     MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
+	} else
+		g_src = MAP_FAILED;
+	if (rand_bool()) {
+		g_dst = mmap(NULL, size, PROT_READ|PROT_WRITE,
+			     MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
+	} else
+		g_dst = MAP_FAILED;
+	g_size = size;
+	g_check = 1;
+
+	rec->a1 = (unsigned long) g_dst;
+	rec->a2 = (unsigned long) g_src;
+	rec->a3 = g_size;
+	rec->a4 = 0;
+
+	if (rand_bool())
+		max_rand = -1UL;
+	else
+		max_rand = g_size << 1;
+	if (rand_bool()) {
+		rec->a3 += (rand() % max_rand) - g_size;
+		g_check = 0;
+	}
+	if (rand_bool()) {
+		rec->a1 += (rand() % max_rand) - g_size;
+		g_check = 0;
+	}
+	if (rand_bool()) {
+		rec->a2 += (rand() % max_rand) - g_size;
+		g_check = 0;
+	}
+	if (rand_bool()) {
+		if (rand_bool()) {
+			rec->a4 = rand();
+		} else
+			rec->a4 = RAP_ALLOW_SRC_HOLES;
+	}
+	if (g_src != MAP_FAILED)
+		memset(g_src, 0xaa, size);
+}
+
+static void post_remap_anon_pages(struct syscallrecord *rec)
+{
+	if (g_check && !rec->retval) {
+		unsigned long size = g_size;
+		unsigned char *dst = g_dst;
+		while (size--)
+			assert(dst[size] == 0xaaU);
+	}
+	munmap(g_src, g_size);
+	munmap(g_dst, g_size);
+}
+
+struct syscallentry syscall_remap_anon_pages = {
+	.name = "remap_anon_pages",
+	.num_args = 4,
+	.arg1name = "dst_start",
+	.arg2name = "src_start",
+	.arg3name = "len",
+	.arg4name = "flags",
+	.group = GROUP_VM,
+	.sanitise = sanitise_remap_anon_pages,
+	.post = post_remap_anon_pages,
+};
diff --git a/syscalls/syscalls.h b/syscalls/syscalls.h
index 114500c..b8eaa63 100644
--- a/syscalls/syscalls.h
+++ b/syscalls/syscalls.h
@@ -370,3 +370,5 @@ extern struct syscallentry syscall_sched_setattr;
 extern struct syscallentry syscall_sched_getattr;
 extern struct syscallentry syscall_renameat2;
 extern struct syscallentry syscall_kern_features;
+extern struct syscallentry syscall_remap_anon_pages;
+extern struct syscallentry syscall_userfaultfd;
diff --git a/syscalls/userfaultfd.c b/syscalls/userfaultfd.c
new file mode 100644
index 0000000..769fe78
--- /dev/null
+++ b/syscalls/userfaultfd.c
@@ -0,0 +1,12 @@
+/*
+ * SYSCALL_DEFINE1(userfaultfd, int, flags)
+ */
+#include "sanitise.h"
+
+struct syscallentry syscall_userfaultfd = {
+	.name = "userfaultfd",
+	.num_args = 1,
+	.arg1name = "flags",
+	.arg1type = ARG_LEN,
+	.rettype = RET_FD,
+};


Andrea Arcangeli (10):
  mm: madvise MADV_USERFAULT: prepare vm_flags to allow more than 32bits
  mm: madvise MADV_USERFAULT
  mm: PT lock: export double_pt_lock/unlock
  mm: rmap preparation for remap_anon_pages
  mm: swp_entry_swapcount
  mm: sys_remap_anon_pages
  waitqueue: add nr wake parameter to __wake_up_locked_key
  userfaultfd: add new syscall to provide memory externalization
  userfaultfd: make userfaultfd_write non blocking
  userfaultfd: use VM_FAULT_RETRY in handle_userfault()

 arch/alpha/include/uapi/asm/mman.h     |   3 +
 arch/mips/include/uapi/asm/mman.h      |   3 +
 arch/parisc/include/uapi/asm/mman.h    |   3 +
 arch/x86/syscalls/syscall_32.tbl       |   2 +
 arch/x86/syscalls/syscall_64.tbl       |   2 +
 arch/xtensa/include/uapi/asm/mman.h    |   3 +
 fs/Makefile                            |   1 +
 fs/proc/task_mmu.c                     |   5 +-
 fs/userfaultfd.c                       | 593 +++++++++++++++++++++++++++++++++
 include/linux/huge_mm.h                |  11 +-
 include/linux/ksm.h                    |   4 +-
 include/linux/mm.h                     |   5 +
 include/linux/mm_types.h               |   2 +-
 include/linux/swap.h                   |   6 +
 include/linux/syscalls.h               |   5 +
 include/linux/userfaultfd.h            |  42 +++
 include/linux/wait.h                   |   5 +-
 include/uapi/asm-generic/mman-common.h |   3 +
 init/Kconfig                           |  10 +
 kernel/sched/wait.c                    |   7 +-
 kernel/sys_ni.c                        |   2 +
 mm/fremap.c                            | 506 ++++++++++++++++++++++++++++
 mm/huge_memory.c                       | 209 ++++++++++--
 mm/ksm.c                               |   2 +-
 mm/madvise.c                           |  19 +-
 mm/memory.c                            |  14 +
 mm/mremap.c                            |   2 +-
 mm/rmap.c                              |   9 +
 mm/swapfile.c                          |  13 +
 net/sunrpc/sched.c                     |   2 +-
 30 files changed, 1447 insertions(+), 46 deletions(-)
 create mode 100644 fs/userfaultfd.c
 create mode 100644 include/linux/userfaultfd.h

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 00/10] RFC: userfault
@ 2014-07-02 16:50 ` Andrea Arcangeli
  0 siblings, 0 replies; 59+ messages in thread
From: Andrea Arcangeli @ 2014-07-02 16:50 UTC (permalink / raw)
  To: qemu-devel, kvm, linux-mm, linux-kernel
  Cc: \"Dr. David Alan Gilbert\",
	Johannes Weiner, Andrew Morton, Android Kernel Team, Robert Love,
	Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, KOSAKI Motohiro, Michel Lespinasse,
	Minchan Kim, Keith Packard, Huangpeng (Peter),
	Isaku Yamahata, Paolo Bonzini, Anthony Liguori, Stefan Hajnoczi,
	Wenchao Xia, Andrew Jones, Juan Quintela, Mel Gorman

Hello everyone,

There's a large CC list for this RFC because this adds two new
syscalls (userfaultfd and remap_anon_pages) and
MADV_USERFAULT/MADV_NOUSERFAULT, so suggestions on changes to the API
or on a completely different API if somebody has better ideas are
welcome now.

The combination of these features are what I would propose to
implement postcopy live migration in qemu, and in general demand
paging of remote memory, hosted in different cloud nodes.

The MADV_USERFAULT feature should be generic enough that it can
provide the userfaults to the Android volatile range feature too, on
access of reclaimed volatile pages.

If the access could ever happen in kernel context through syscalls
(not not just from userland context), then userfaultfd has to be used
to make the userfault unnoticeable to the syscall (no error will be
returned). This latter feature is more advanced than what volatile
ranges alone could do with SIGBUS so far (but it's optional, if the
process doesn't call userfaultfd, the regular SIGBUS will fire, if the
fd is closed SIGBUS will also fire for any blocked userfault that was
waiting a userfaultfd_write ack).

userfaultfd is also a generic enough feature, that it allows KVM to
implement postcopy live migration without having to modify a single
line of KVM kernel code. Guest async page faults, FOLL_NOWAIT and all
other GUP features works just fine in combination with userfaults
(userfaults trigger async page faults in the guest scheduler so those
guest processes that aren't waiting for userfaults can keep running in
the guest vcpus).

remap_anon_pages is the syscall to use to resolve the userfaults (it's
not mandatory, vmsplice will likely still be used in the case of local
postcopy live migration just to upgrade the qemu binary, but
remap_anon_pages is faster and ideal for transferring memory across
the network, it's zerocopy and doesn't touch the vma: it only holds
the mmap_sem for reading).

The current behavior of remap_anon_pages is very strict to avoid any
chance of memory corruption going unnoticed. mremap is not strict like
that: if there's a synchronization bug it would drop the destination
range silently resulting in subtle memory corruption for
example. remap_anon_pages would return -EEXIST in that case. If there
are holes in the source range remap_anon_pages will return -ENOENT.

If remap_anon_pages is used always with 2M naturally aligned
addresses, transparent hugepages will not be splitted. In there could
be 4k (or any size) holes in the 2M (or any size) source range,
remap_anon_pages should be used with the RAP_ALLOW_SRC_HOLES flag to
relax some of its strict checks (-ENOENT won't be returned if
RAP_ALLOW_SRC_HOLES is set, remap_anon_pages then will just behave as
a noop on any hole in the source range). This flag is generally useful
when implementing userfaults with THP granularity, but it shouldn't be
set if doing the userfaults with PAGE_SIZE granularity if the
developer wants to benefit from the strict -ENOENT behavior.

The remap_anon_pages syscall API is not vectored, as I expect it to be
used mainly for demand paging (where there can be just one faulting
range per userfault) or for large ranges (with the THP model as an
alternative to zapping re-dirtied pages with MADV_DONTNEED with 4k
granularity before starting the guest in the destination node) where
vectoring isn't going to provide much performance advantages (thanks
to the THP coarser granularity).

On the rmap side remap_anon_pages doesn't add much complexity: there's
no need of nonlinear anon vmas to support it because I added the
constraint that it will fail if the mapcount is more than 1. So in
general the source range of remap_anon_pages should be marked
MADV_DONTFORK to prevent any risk of failure if the process ever
forks (like qemu can in some case).

One part that hasn't been tested is the poll() syscall on the
userfaultfd because the postcopy migration thread currently is more
efficient waiting on blocking read()s (I'll write some code to test
poll() too). I also appended below a patch to trinity to exercise
remap_anon_pages and userfaultfd and it completes trinity
successfully.

The code can be found here:

git clone --reference linux git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git -b userfault 

The branch is rebased so you can get updates for example with:

git fetch && git checkout -f origin/userfault

Comments welcome, thanks!
Andrea

^ permalink raw reply	[flat|nested] 59+ messages in thread

* [Qemu-devel] [PATCH 00/10] RFC: userfault
@ 2014-07-02 16:50 ` Andrea Arcangeli
  0 siblings, 0 replies; 59+ messages in thread
From: Andrea Arcangeli @ 2014-07-02 16:50 UTC (permalink / raw)
  To: qemu-devel, kvm, linux-mm, linux-kernel
  Cc: Robert Love, Dave Hansen, Jan Kara, Neil Brown, Stefan Hajnoczi,
	Andrew Jones, KOSAKI Motohiro, Michel Lespinasse,
	Andrea Arcangeli, Taras Glek, Juan Quintela, Hugh Dickins,
	Isaku Yamahata, Mel Gorman, Android Kernel Team, Mel Gorman,
	\"Dr. David Alan Gilbert\", Huangpeng (Peter),
	Anthony Liguori, Mike Hommey, Keith Packard, Wenchao Xia,
	Minchan Kim, Dmitry Adamushko, Johannes Weiner, Paolo Bonzini,
	Andrew Morton

Hello everyone,

There's a large CC list for this RFC because this adds two new
syscalls (userfaultfd and remap_anon_pages) and
MADV_USERFAULT/MADV_NOUSERFAULT, so suggestions on changes to the API
or on a completely different API if somebody has better ideas are
welcome now.

The combination of these features are what I would propose to
implement postcopy live migration in qemu, and in general demand
paging of remote memory, hosted in different cloud nodes.

The MADV_USERFAULT feature should be generic enough that it can
provide the userfaults to the Android volatile range feature too, on
access of reclaimed volatile pages.

If the access could ever happen in kernel context through syscalls
(not not just from userland context), then userfaultfd has to be used
to make the userfault unnoticeable to the syscall (no error will be
returned). This latter feature is more advanced than what volatile
ranges alone could do with SIGBUS so far (but it's optional, if the
process doesn't call userfaultfd, the regular SIGBUS will fire, if the
fd is closed SIGBUS will also fire for any blocked userfault that was
waiting a userfaultfd_write ack).

userfaultfd is also a generic enough feature, that it allows KVM to
implement postcopy live migration without having to modify a single
line of KVM kernel code. Guest async page faults, FOLL_NOWAIT and all
other GUP features works just fine in combination with userfaults
(userfaults trigger async page faults in the guest scheduler so those
guest processes that aren't waiting for userfaults can keep running in
the guest vcpus).

remap_anon_pages is the syscall to use to resolve the userfaults (it's
not mandatory, vmsplice will likely still be used in the case of local
postcopy live migration just to upgrade the qemu binary, but
remap_anon_pages is faster and ideal for transferring memory across
the network, it's zerocopy and doesn't touch the vma: it only holds
the mmap_sem for reading).

The current behavior of remap_anon_pages is very strict to avoid any
chance of memory corruption going unnoticed. mremap is not strict like
that: if there's a synchronization bug it would drop the destination
range silently resulting in subtle memory corruption for
example. remap_anon_pages would return -EEXIST in that case. If there
are holes in the source range remap_anon_pages will return -ENOENT.

If remap_anon_pages is used always with 2M naturally aligned
addresses, transparent hugepages will not be splitted. In there could
be 4k (or any size) holes in the 2M (or any size) source range,
remap_anon_pages should be used with the RAP_ALLOW_SRC_HOLES flag to
relax some of its strict checks (-ENOENT won't be returned if
RAP_ALLOW_SRC_HOLES is set, remap_anon_pages then will just behave as
a noop on any hole in the source range). This flag is generally useful
when implementing userfaults with THP granularity, but it shouldn't be
set if doing the userfaults with PAGE_SIZE granularity if the
developer wants to benefit from the strict -ENOENT behavior.

The remap_anon_pages syscall API is not vectored, as I expect it to be
used mainly for demand paging (where there can be just one faulting
range per userfault) or for large ranges (with the THP model as an
alternative to zapping re-dirtied pages with MADV_DONTNEED with 4k
granularity before starting the guest in the destination node) where
vectoring isn't going to provide much performance advantages (thanks
to the THP coarser granularity).

On the rmap side remap_anon_pages doesn't add much complexity: there's
no need of nonlinear anon vmas to support it because I added the
constraint that it will fail if the mapcount is more than 1. So in
general the source range of remap_anon_pages should be marked
MADV_DONTFORK to prevent any risk of failure if the process ever
forks (like qemu can in some case).

One part that hasn't been tested is the poll() syscall on the
userfaultfd because the postcopy migration thread currently is more
efficient waiting on blocking read()s (I'll write some code to test
poll() too). I also appended below a patch to trinity to exercise
remap_anon_pages and userfaultfd and it completes trinity
successfully.

The code can be found here:

git clone --reference linux git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git -b userfault 

The branch is rebased so you can get updates for example with:

git fetch && git checkout -f origin/userfault

Comments welcome, thanks!
Andrea

>From cbe940e13b4cead41e0f862b3abfa3814f235ec3 Mon Sep 17 00:00:00 2001
From: Andrea Arcangeli <aarcange@redhat.com>
Date: Wed, 2 Jul 2014 18:32:35 +0200
Subject: [PATCH] add remap_anon_pages and userfaultfd

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/syscalls-x86_64.h   |   2 +
 syscalls/remap_anon_pages.c | 100 ++++++++++++++++++++++++++++++++++++++++++++
 syscalls/syscalls.h         |   2 +
 syscalls/userfaultfd.c      |  12 ++++++
 4 files changed, 116 insertions(+)
 create mode 100644 syscalls/remap_anon_pages.c
 create mode 100644 syscalls/userfaultfd.c

diff --git a/include/syscalls-x86_64.h b/include/syscalls-x86_64.h
index e09df43..a5b3a88 100644
--- a/include/syscalls-x86_64.h
+++ b/include/syscalls-x86_64.h
@@ -324,4 +324,6 @@ struct syscalltable syscalls_x86_64[] = {
 	{ .entry = &syscall_sched_setattr },
 	{ .entry = &syscall_sched_getattr },
 	{ .entry = &syscall_renameat2 },
+	{ .entry = &syscall_remap_anon_pages },
+	{ .entry = &syscall_userfaultfd },
 };
diff --git a/syscalls/remap_anon_pages.c b/syscalls/remap_anon_pages.c
new file mode 100644
index 0000000..b1e9d3c
--- /dev/null
+++ b/syscalls/remap_anon_pages.c
@@ -0,0 +1,100 @@
+/*
+ * SYSCALL_DEFINE3(remap_anon_pages,
+		unsigned long, dst_start, unsigned long, src_start,
+		unsigned long, len)
+ */
+#include <stdlib.h>
+#include <asm/mman.h>
+#include <assert.h>
+#include "arch.h"
+#include "maps.h"
+#include "random.h"
+#include "sanitise.h"
+#include "shm.h"
+#include "syscall.h"
+#include "tables.h"
+#include "trinity.h"
+#include "utils.h"
+
+static const unsigned long alignments[] = {
+	1 * MB, 2 * MB, 4 * MB, 8 * MB,
+	10 * MB, 100 * MB,
+};
+
+static unsigned char *g_src, *g_dst;
+static unsigned long g_size;
+static int g_check;
+
+#define RAP_ALLOW_SRC_HOLES (1UL<<0)
+
+static void sanitise_remap_anon_pages(struct syscallrecord *rec)
+{
+	unsigned long size = alignments[rand() % ARRAY_SIZE(alignments)];
+	unsigned long max_rand;
+	if (rand_bool()) {
+		g_src = mmap(NULL, size, PROT_READ|PROT_WRITE,
+			     MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
+	} else
+		g_src = MAP_FAILED;
+	if (rand_bool()) {
+		g_dst = mmap(NULL, size, PROT_READ|PROT_WRITE,
+			     MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
+	} else
+		g_dst = MAP_FAILED;
+	g_size = size;
+	g_check = 1;
+
+	rec->a1 = (unsigned long) g_dst;
+	rec->a2 = (unsigned long) g_src;
+	rec->a3 = g_size;
+	rec->a4 = 0;
+
+	if (rand_bool())
+		max_rand = -1UL;
+	else
+		max_rand = g_size << 1;
+	if (rand_bool()) {
+		rec->a3 += (rand() % max_rand) - g_size;
+		g_check = 0;
+	}
+	if (rand_bool()) {
+		rec->a1 += (rand() % max_rand) - g_size;
+		g_check = 0;
+	}
+	if (rand_bool()) {
+		rec->a2 += (rand() % max_rand) - g_size;
+		g_check = 0;
+	}
+	if (rand_bool()) {
+		if (rand_bool()) {
+			rec->a4 = rand();
+		} else
+			rec->a4 = RAP_ALLOW_SRC_HOLES;
+	}
+	if (g_src != MAP_FAILED)
+		memset(g_src, 0xaa, size);
+}
+
+static void post_remap_anon_pages(struct syscallrecord *rec)
+{
+	if (g_check && !rec->retval) {
+		unsigned long size = g_size;
+		unsigned char *dst = g_dst;
+		while (size--)
+			assert(dst[size] == 0xaaU);
+	}
+	munmap(g_src, g_size);
+	munmap(g_dst, g_size);
+}
+
+struct syscallentry syscall_remap_anon_pages = {
+	.name = "remap_anon_pages",
+	.num_args = 4,
+	.arg1name = "dst_start",
+	.arg2name = "src_start",
+	.arg3name = "len",
+	.arg4name = "flags",
+	.group = GROUP_VM,
+	.sanitise = sanitise_remap_anon_pages,
+	.post = post_remap_anon_pages,
+};
diff --git a/syscalls/syscalls.h b/syscalls/syscalls.h
index 114500c..b8eaa63 100644
--- a/syscalls/syscalls.h
+++ b/syscalls/syscalls.h
@@ -370,3 +370,5 @@ extern struct syscallentry syscall_sched_setattr;
 extern struct syscallentry syscall_sched_getattr;
 extern struct syscallentry syscall_renameat2;
 extern struct syscallentry syscall_kern_features;
+extern struct syscallentry syscall_remap_anon_pages;
+extern struct syscallentry syscall_userfaultfd;
diff --git a/syscalls/userfaultfd.c b/syscalls/userfaultfd.c
new file mode 100644
index 0000000..769fe78
--- /dev/null
+++ b/syscalls/userfaultfd.c
@@ -0,0 +1,12 @@
+/*
+ * SYSCALL_DEFINE1(userfaultfd, int, flags)
+ */
+#include "sanitise.h"
+
+struct syscallentry syscall_userfaultfd = {
+	.name = "userfaultfd",
+	.num_args = 1,
+	.arg1name = "flags",
+	.arg1type = ARG_LEN,
+	.rettype = RET_FD,
+};


Andrea Arcangeli (10):
  mm: madvise MADV_USERFAULT: prepare vm_flags to allow more than 32bits
  mm: madvise MADV_USERFAULT
  mm: PT lock: export double_pt_lock/unlock
  mm: rmap preparation for remap_anon_pages
  mm: swp_entry_swapcount
  mm: sys_remap_anon_pages
  waitqueue: add nr wake parameter to __wake_up_locked_key
  userfaultfd: add new syscall to provide memory externalization
  userfaultfd: make userfaultfd_write non blocking
  userfaultfd: use VM_FAULT_RETRY in handle_userfault()

 arch/alpha/include/uapi/asm/mman.h     |   3 +
 arch/mips/include/uapi/asm/mman.h      |   3 +
 arch/parisc/include/uapi/asm/mman.h    |   3 +
 arch/x86/syscalls/syscall_32.tbl       |   2 +
 arch/x86/syscalls/syscall_64.tbl       |   2 +
 arch/xtensa/include/uapi/asm/mman.h    |   3 +
 fs/Makefile                            |   1 +
 fs/proc/task_mmu.c                     |   5 +-
 fs/userfaultfd.c                       | 593 +++++++++++++++++++++++++++++++++
 include/linux/huge_mm.h                |  11 +-
 include/linux/ksm.h                    |   4 +-
 include/linux/mm.h                     |   5 +
 include/linux/mm_types.h               |   2 +-
 include/linux/swap.h                   |   6 +
 include/linux/syscalls.h               |   5 +
 include/linux/userfaultfd.h            |  42 +++
 include/linux/wait.h                   |   5 +-
 include/uapi/asm-generic/mman-common.h |   3 +
 init/Kconfig                           |  10 +
 kernel/sched/wait.c                    |   7 +-
 kernel/sys_ni.c                        |   2 +
 mm/fremap.c                            | 506 ++++++++++++++++++++++++++++
 mm/huge_memory.c                       | 209 ++++++++++--
 mm/ksm.c                               |   2 +-
 mm/madvise.c                           |  19 +-
 mm/memory.c                            |  14 +
 mm/mremap.c                            |   2 +-
 mm/rmap.c                              |   9 +
 mm/swapfile.c                          |  13 +
 net/sunrpc/sched.c                     |   2 +-
 30 files changed, 1447 insertions(+), 46 deletions(-)
 create mode 100644 fs/userfaultfd.c
 create mode 100644 include/linux/userfaultfd.h

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 01/10] mm: madvise MADV_USERFAULT: prepare vm_flags to allow more than 32bits
  2014-07-02 16:50 ` Andrea Arcangeli
  (?)
@ 2014-07-02 16:50   ` Andrea Arcangeli
  -1 siblings, 0 replies; 59+ messages in thread
From: Andrea Arcangeli @ 2014-07-02 16:50 UTC (permalink / raw)
  To: qemu-devel, kvm, linux-mm, linux-kernel
  Cc: \"Dr. David Alan Gilbert\",
	Johannes Weiner, Andrew Morton, Android Kernel Team, Robert Love,
	Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, KOSAKI Motohiro, Michel Lespinasse,
	Minchan Kim, Keith Packard, Huangpeng (Peter),
	Isaku Yamahata

We run out of 32bits in vm_flags, noop change for 64bit archs.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 fs/proc/task_mmu.c       | 4 ++--
 include/linux/huge_mm.h  | 4 ++--
 include/linux/ksm.h      | 4 ++--
 include/linux/mm_types.h | 2 +-
 mm/huge_memory.c         | 2 +-
 mm/ksm.c                 | 2 +-
 mm/madvise.c             | 2 +-
 mm/mremap.c              | 2 +-
 8 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index cfa63ee..fb91692 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -532,11 +532,11 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
 	/*
 	 * Don't forget to update Documentation/ on changes.
 	 */
-	static const char mnemonics[BITS_PER_LONG][2] = {
+	static const char mnemonics[BITS_PER_LONG+1][2] = {
 		/*
 		 * In case if we meet a flag we don't know about.
 		 */
-		[0 ... (BITS_PER_LONG-1)] = "??",
+		[0 ... (BITS_PER_LONG)] = "??",
 
 		[ilog2(VM_READ)]	= "rd",
 		[ilog2(VM_WRITE)]	= "wr",
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index b826239..3a2c57e 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -125,7 +125,7 @@ extern void split_huge_page_pmd_mm(struct mm_struct *mm, unsigned long address,
 #error "hugepages can't be allocated by the buddy allocator"
 #endif
 extern int hugepage_madvise(struct vm_area_struct *vma,
-			    unsigned long *vm_flags, int advice);
+			    vm_flags_t *vm_flags, int advice);
 extern void __vma_adjust_trans_huge(struct vm_area_struct *vma,
 				    unsigned long start,
 				    unsigned long end,
@@ -187,7 +187,7 @@ static inline int split_huge_page(struct page *page)
 #define split_huge_page_pmd_mm(__mm, __address, __pmd)	\
 	do { } while (0)
 static inline int hugepage_madvise(struct vm_area_struct *vma,
-				   unsigned long *vm_flags, int advice)
+				   vm_flags_t *vm_flags, int advice)
 {
 	BUG();
 	return 0;
diff --git a/include/linux/ksm.h b/include/linux/ksm.h
index 3be6bb1..8b35253 100644
--- a/include/linux/ksm.h
+++ b/include/linux/ksm.h
@@ -18,7 +18,7 @@ struct mem_cgroup;
 
 #ifdef CONFIG_KSM
 int ksm_madvise(struct vm_area_struct *vma, unsigned long start,
-		unsigned long end, int advice, unsigned long *vm_flags);
+		unsigned long end, int advice, vm_flags_t *vm_flags);
 int __ksm_enter(struct mm_struct *mm);
 void __ksm_exit(struct mm_struct *mm);
 
@@ -94,7 +94,7 @@ static inline int PageKsm(struct page *page)
 
 #ifdef CONFIG_MMU
 static inline int ksm_madvise(struct vm_area_struct *vma, unsigned long start,
-		unsigned long end, int advice, unsigned long *vm_flags)
+		unsigned long end, int advice, vm_flags_t *vm_flags)
 {
 	return 0;
 }
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 96c5750..cd42c8c 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -217,7 +217,7 @@ struct page_frag {
 #endif
 };
 
-typedef unsigned long __nocast vm_flags_t;
+typedef unsigned long long __nocast vm_flags_t;
 
 /*
  * A region containing a mapping of a non-memory backed file under NOMMU
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 33514d8..7e0776a 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1929,7 +1929,7 @@ out:
 #define VM_NO_THP (VM_SPECIAL | VM_HUGETLB | VM_SHARED | VM_MAYSHARE)
 
 int hugepage_madvise(struct vm_area_struct *vma,
-		     unsigned long *vm_flags, int advice)
+		     vm_flags_t *vm_flags, int advice)
 {
 	switch (advice) {
 	case MADV_HUGEPAGE:
diff --git a/mm/ksm.c b/mm/ksm.c
index 346ddc9..6052cf2 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1736,7 +1736,7 @@ static int ksm_scan_thread(void *nothing)
 }
 
 int ksm_madvise(struct vm_area_struct *vma, unsigned long start,
-		unsigned long end, int advice, unsigned long *vm_flags)
+		unsigned long end, int advice, vm_flags_t *vm_flags)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	int err;
diff --git a/mm/madvise.c b/mm/madvise.c
index a402f8f..b31aad1 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -49,7 +49,7 @@ static long madvise_behavior(struct vm_area_struct *vma,
 	struct mm_struct *mm = vma->vm_mm;
 	int error = 0;
 	pgoff_t pgoff;
-	unsigned long new_flags = vma->vm_flags;
+	vm_flags_t new_flags = vma->vm_flags;
 
 	switch (behavior) {
 	case MADV_NORMAL:
diff --git a/mm/mremap.c b/mm/mremap.c
index 05f1180..fa7db87 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -239,7 +239,7 @@ static unsigned long move_vma(struct vm_area_struct *vma,
 {
 	struct mm_struct *mm = vma->vm_mm;
 	struct vm_area_struct *new_vma;
-	unsigned long vm_flags = vma->vm_flags;
+	vm_flags_t vm_flags = vma->vm_flags;
 	unsigned long new_pgoff;
 	unsigned long moved_len;
 	unsigned long excess = 0;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 01/10] mm: madvise MADV_USERFAULT: prepare vm_flags to allow more than 32bits
@ 2014-07-02 16:50   ` Andrea Arcangeli
  0 siblings, 0 replies; 59+ messages in thread
From: Andrea Arcangeli @ 2014-07-02 16:50 UTC (permalink / raw)
  To: qemu-devel, kvm, linux-mm, linux-kernel
  Cc: \"Dr. David Alan Gilbert\",
	Johannes Weiner, Andrew Morton, Android Kernel Team, Robert Love,
	Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, KOSAKI Motohiro, Michel Lespinasse,
	Minchan Kim, Keith Packard, Huangpeng (Peter),
	Isaku Yamahata, Paolo Bonzini, Anthony Liguori, Stefan Hajnoczi,
	Wenchao Xia, Andrew Jones, Juan Quintela, Mel Gorman

We run out of 32bits in vm_flags, noop change for 64bit archs.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 fs/proc/task_mmu.c       | 4 ++--
 include/linux/huge_mm.h  | 4 ++--
 include/linux/ksm.h      | 4 ++--
 include/linux/mm_types.h | 2 +-
 mm/huge_memory.c         | 2 +-
 mm/ksm.c                 | 2 +-
 mm/madvise.c             | 2 +-
 mm/mremap.c              | 2 +-
 8 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index cfa63ee..fb91692 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -532,11 +532,11 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
 	/*
 	 * Don't forget to update Documentation/ on changes.
 	 */
-	static const char mnemonics[BITS_PER_LONG][2] = {
+	static const char mnemonics[BITS_PER_LONG+1][2] = {
 		/*
 		 * In case if we meet a flag we don't know about.
 		 */
-		[0 ... (BITS_PER_LONG-1)] = "??",
+		[0 ... (BITS_PER_LONG)] = "??",
 
 		[ilog2(VM_READ)]	= "rd",
 		[ilog2(VM_WRITE)]	= "wr",
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index b826239..3a2c57e 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -125,7 +125,7 @@ extern void split_huge_page_pmd_mm(struct mm_struct *mm, unsigned long address,
 #error "hugepages can't be allocated by the buddy allocator"
 #endif
 extern int hugepage_madvise(struct vm_area_struct *vma,
-			    unsigned long *vm_flags, int advice);
+			    vm_flags_t *vm_flags, int advice);
 extern void __vma_adjust_trans_huge(struct vm_area_struct *vma,
 				    unsigned long start,
 				    unsigned long end,
@@ -187,7 +187,7 @@ static inline int split_huge_page(struct page *page)
 #define split_huge_page_pmd_mm(__mm, __address, __pmd)	\
 	do { } while (0)
 static inline int hugepage_madvise(struct vm_area_struct *vma,
-				   unsigned long *vm_flags, int advice)
+				   vm_flags_t *vm_flags, int advice)
 {
 	BUG();
 	return 0;
diff --git a/include/linux/ksm.h b/include/linux/ksm.h
index 3be6bb1..8b35253 100644
--- a/include/linux/ksm.h
+++ b/include/linux/ksm.h
@@ -18,7 +18,7 @@ struct mem_cgroup;
 
 #ifdef CONFIG_KSM
 int ksm_madvise(struct vm_area_struct *vma, unsigned long start,
-		unsigned long end, int advice, unsigned long *vm_flags);
+		unsigned long end, int advice, vm_flags_t *vm_flags);
 int __ksm_enter(struct mm_struct *mm);
 void __ksm_exit(struct mm_struct *mm);
 
@@ -94,7 +94,7 @@ static inline int PageKsm(struct page *page)
 
 #ifdef CONFIG_MMU
 static inline int ksm_madvise(struct vm_area_struct *vma, unsigned long start,
-		unsigned long end, int advice, unsigned long *vm_flags)
+		unsigned long end, int advice, vm_flags_t *vm_flags)
 {
 	return 0;
 }
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 96c5750..cd42c8c 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -217,7 +217,7 @@ struct page_frag {
 #endif
 };
 
-typedef unsigned long __nocast vm_flags_t;
+typedef unsigned long long __nocast vm_flags_t;
 
 /*
  * A region containing a mapping of a non-memory backed file under NOMMU
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 33514d8..7e0776a 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1929,7 +1929,7 @@ out:
 #define VM_NO_THP (VM_SPECIAL | VM_HUGETLB | VM_SHARED | VM_MAYSHARE)
 
 int hugepage_madvise(struct vm_area_struct *vma,
-		     unsigned long *vm_flags, int advice)
+		     vm_flags_t *vm_flags, int advice)
 {
 	switch (advice) {
 	case MADV_HUGEPAGE:
diff --git a/mm/ksm.c b/mm/ksm.c
index 346ddc9..6052cf2 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1736,7 +1736,7 @@ static int ksm_scan_thread(void *nothing)
 }
 
 int ksm_madvise(struct vm_area_struct *vma, unsigned long start,
-		unsigned long end, int advice, unsigned long *vm_flags)
+		unsigned long end, int advice, vm_flags_t *vm_flags)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	int err;
diff --git a/mm/madvise.c b/mm/madvise.c
index a402f8f..b31aad1 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -49,7 +49,7 @@ static long madvise_behavior(struct vm_area_struct *vma,
 	struct mm_struct *mm = vma->vm_mm;
 	int error = 0;
 	pgoff_t pgoff;
-	unsigned long new_flags = vma->vm_flags;
+	vm_flags_t new_flags = vma->vm_flags;
 
 	switch (behavior) {
 	case MADV_NORMAL:
diff --git a/mm/mremap.c b/mm/mremap.c
index 05f1180..fa7db87 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -239,7 +239,7 @@ static unsigned long move_vma(struct vm_area_struct *vma,
 {
 	struct mm_struct *mm = vma->vm_mm;
 	struct vm_area_struct *new_vma;
-	unsigned long vm_flags = vma->vm_flags;
+	vm_flags_t vm_flags = vma->vm_flags;
 	unsigned long new_pgoff;
 	unsigned long moved_len;
 	unsigned long excess = 0;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [Qemu-devel] [PATCH 01/10] mm: madvise MADV_USERFAULT: prepare vm_flags to allow more than 32bits
@ 2014-07-02 16:50   ` Andrea Arcangeli
  0 siblings, 0 replies; 59+ messages in thread
From: Andrea Arcangeli @ 2014-07-02 16:50 UTC (permalink / raw)
  To: qemu-devel, kvm, linux-mm, linux-kernel
  Cc: Robert Love, Dave Hansen, Jan Kara, Neil Brown, Stefan Hajnoczi,
	Andrew Jones, KOSAKI Motohiro, Michel Lespinasse,
	Andrea Arcangeli, Taras Glek, Juan Quintela, Hugh Dickins,
	Isaku Yamahata, Mel Gorman, Android Kernel Team, Mel Gorman,
	\"Dr. David Alan Gilbert\", Huangpeng (Peter),
	Anthony Liguori, Mike Hommey, Keith Packard, Wenchao Xia,
	Minchan Kim, Dmitry Adamushko, Johannes Weiner, Paolo Bonzini,
	Andrew Morton

We run out of 32bits in vm_flags, noop change for 64bit archs.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 fs/proc/task_mmu.c       | 4 ++--
 include/linux/huge_mm.h  | 4 ++--
 include/linux/ksm.h      | 4 ++--
 include/linux/mm_types.h | 2 +-
 mm/huge_memory.c         | 2 +-
 mm/ksm.c                 | 2 +-
 mm/madvise.c             | 2 +-
 mm/mremap.c              | 2 +-
 8 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index cfa63ee..fb91692 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -532,11 +532,11 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
 	/*
 	 * Don't forget to update Documentation/ on changes.
 	 */
-	static const char mnemonics[BITS_PER_LONG][2] = {
+	static const char mnemonics[BITS_PER_LONG+1][2] = {
 		/*
 		 * In case if we meet a flag we don't know about.
 		 */
-		[0 ... (BITS_PER_LONG-1)] = "??",
+		[0 ... (BITS_PER_LONG)] = "??",
 
 		[ilog2(VM_READ)]	= "rd",
 		[ilog2(VM_WRITE)]	= "wr",
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index b826239..3a2c57e 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -125,7 +125,7 @@ extern void split_huge_page_pmd_mm(struct mm_struct *mm, unsigned long address,
 #error "hugepages can't be allocated by the buddy allocator"
 #endif
 extern int hugepage_madvise(struct vm_area_struct *vma,
-			    unsigned long *vm_flags, int advice);
+			    vm_flags_t *vm_flags, int advice);
 extern void __vma_adjust_trans_huge(struct vm_area_struct *vma,
 				    unsigned long start,
 				    unsigned long end,
@@ -187,7 +187,7 @@ static inline int split_huge_page(struct page *page)
 #define split_huge_page_pmd_mm(__mm, __address, __pmd)	\
 	do { } while (0)
 static inline int hugepage_madvise(struct vm_area_struct *vma,
-				   unsigned long *vm_flags, int advice)
+				   vm_flags_t *vm_flags, int advice)
 {
 	BUG();
 	return 0;
diff --git a/include/linux/ksm.h b/include/linux/ksm.h
index 3be6bb1..8b35253 100644
--- a/include/linux/ksm.h
+++ b/include/linux/ksm.h
@@ -18,7 +18,7 @@ struct mem_cgroup;
 
 #ifdef CONFIG_KSM
 int ksm_madvise(struct vm_area_struct *vma, unsigned long start,
-		unsigned long end, int advice, unsigned long *vm_flags);
+		unsigned long end, int advice, vm_flags_t *vm_flags);
 int __ksm_enter(struct mm_struct *mm);
 void __ksm_exit(struct mm_struct *mm);
 
@@ -94,7 +94,7 @@ static inline int PageKsm(struct page *page)
 
 #ifdef CONFIG_MMU
 static inline int ksm_madvise(struct vm_area_struct *vma, unsigned long start,
-		unsigned long end, int advice, unsigned long *vm_flags)
+		unsigned long end, int advice, vm_flags_t *vm_flags)
 {
 	return 0;
 }
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 96c5750..cd42c8c 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -217,7 +217,7 @@ struct page_frag {
 #endif
 };
 
-typedef unsigned long __nocast vm_flags_t;
+typedef unsigned long long __nocast vm_flags_t;
 
 /*
  * A region containing a mapping of a non-memory backed file under NOMMU
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 33514d8..7e0776a 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1929,7 +1929,7 @@ out:
 #define VM_NO_THP (VM_SPECIAL | VM_HUGETLB | VM_SHARED | VM_MAYSHARE)
 
 int hugepage_madvise(struct vm_area_struct *vma,
-		     unsigned long *vm_flags, int advice)
+		     vm_flags_t *vm_flags, int advice)
 {
 	switch (advice) {
 	case MADV_HUGEPAGE:
diff --git a/mm/ksm.c b/mm/ksm.c
index 346ddc9..6052cf2 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1736,7 +1736,7 @@ static int ksm_scan_thread(void *nothing)
 }
 
 int ksm_madvise(struct vm_area_struct *vma, unsigned long start,
-		unsigned long end, int advice, unsigned long *vm_flags)
+		unsigned long end, int advice, vm_flags_t *vm_flags)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	int err;
diff --git a/mm/madvise.c b/mm/madvise.c
index a402f8f..b31aad1 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -49,7 +49,7 @@ static long madvise_behavior(struct vm_area_struct *vma,
 	struct mm_struct *mm = vma->vm_mm;
 	int error = 0;
 	pgoff_t pgoff;
-	unsigned long new_flags = vma->vm_flags;
+	vm_flags_t new_flags = vma->vm_flags;
 
 	switch (behavior) {
 	case MADV_NORMAL:
diff --git a/mm/mremap.c b/mm/mremap.c
index 05f1180..fa7db87 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -239,7 +239,7 @@ static unsigned long move_vma(struct vm_area_struct *vma,
 {
 	struct mm_struct *mm = vma->vm_mm;
 	struct vm_area_struct *new_vma;
-	unsigned long vm_flags = vma->vm_flags;
+	vm_flags_t vm_flags = vma->vm_flags;
 	unsigned long new_pgoff;
 	unsigned long moved_len;
 	unsigned long excess = 0;

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 02/10] mm: madvise MADV_USERFAULT
  2014-07-02 16:50 ` Andrea Arcangeli
  (?)
@ 2014-07-02 16:50   ` Andrea Arcangeli
  -1 siblings, 0 replies; 59+ messages in thread
From: Andrea Arcangeli @ 2014-07-02 16:50 UTC (permalink / raw)
  To: qemu-devel, kvm, linux-mm, linux-kernel
  Cc: \"Dr. David Alan Gilbert\",
	Johannes Weiner, Andrew Morton, Android Kernel Team, Robert Love,
	Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, KOSAKI Motohiro, Michel Lespinasse,
	Minchan Kim, Keith Packard, Huangpeng (Peter),
	Isaku Yamahata

MADV_USERFAULT is a new madvise flag that will set VM_USERFAULT in the
vma flags. Whenever VM_USERFAULT is set in an anonymous vma, if
userland touches a still unmapped virtual address, a sigbus signal is
sent instead of allocating a new page. The sigbus signal handler will
then resolve the page fault in userland by calling the
remap_anon_pages syscall.

This functionality is needed to reliably implement postcopy live
migration in KVM (without having to use a special chardevice that
would disable all advanced Linux VM features, like swapping, KSM, THP,
automatic NUMA balancing, etc...).

MADV_USERFAULT could also be used to offload parts of anonymous memory
regions to remote nodes or to implement network distributed shared
memory.

Here I enlarged the vm_flags to 64bit as we run out of bits (noop on
64bit kernels). An alternative is to find some combination of flags
that are mutually exclusive if set.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 arch/alpha/include/uapi/asm/mman.h     |  3 ++
 arch/mips/include/uapi/asm/mman.h      |  3 ++
 arch/parisc/include/uapi/asm/mman.h    |  3 ++
 arch/xtensa/include/uapi/asm/mman.h    |  3 ++
 fs/proc/task_mmu.c                     |  1 +
 include/linux/mm.h                     |  1 +
 include/uapi/asm-generic/mman-common.h |  3 ++
 mm/huge_memory.c                       | 61 +++++++++++++++++++++-------------
 mm/madvise.c                           | 17 ++++++++++
 mm/memory.c                            | 13 ++++++++
 10 files changed, 85 insertions(+), 23 deletions(-)

diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h
index 0086b47..a10313c 100644
--- a/arch/alpha/include/uapi/asm/mman.h
+++ b/arch/alpha/include/uapi/asm/mman.h
@@ -60,6 +60,9 @@
 					   overrides the coredump filter bits */
 #define MADV_DODUMP	17		/* Clear the MADV_NODUMP flag */
 
+#define MADV_USERFAULT	18		/* Trigger user faults if not mapped */
+#define MADV_NOUSERFAULT 19		/* Don't trigger user faults */
+
 /* compatibility flags */
 #define MAP_FILE	0
 
diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h
index cfcb876..d9d11a4 100644
--- a/arch/mips/include/uapi/asm/mman.h
+++ b/arch/mips/include/uapi/asm/mman.h
@@ -84,6 +84,9 @@
 					   overrides the coredump filter bits */
 #define MADV_DODUMP	17		/* Clear the MADV_NODUMP flag */
 
+#define MADV_USERFAULT	18		/* Trigger user faults if not mapped */
+#define MADV_NOUSERFAULT 19		/* Don't trigger user faults */
+
 /* compatibility flags */
 #define MAP_FILE	0
 
diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h
index 294d251..7bc7b7b 100644
--- a/arch/parisc/include/uapi/asm/mman.h
+++ b/arch/parisc/include/uapi/asm/mman.h
@@ -66,6 +66,9 @@
 					   overrides the coredump filter bits */
 #define MADV_DODUMP	70		/* Clear the MADV_NODUMP flag */
 
+#define MADV_USERFAULT	71		/* Trigger user faults if not mapped */
+#define MADV_NOUSERFAULT 72		/* Don't trigger user faults */
+
 /* compatibility flags */
 #define MAP_FILE	0
 #define MAP_VARIABLE	0
diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h
index 00eed67..5448d88 100644
--- a/arch/xtensa/include/uapi/asm/mman.h
+++ b/arch/xtensa/include/uapi/asm/mman.h
@@ -90,6 +90,9 @@
 					   overrides the coredump filter bits */
 #define MADV_DODUMP	17		/* Clear the MADV_NODUMP flag */
 
+#define MADV_USERFAULT	18		/* Trigger user faults if not mapped */
+#define MADV_NOUSERFAULT 19		/* Don't trigger user faults */
+
 /* compatibility flags */
 #define MAP_FILE	0
 
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index fb91692..8636cda 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -568,6 +568,7 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
 		[ilog2(VM_HUGEPAGE)]	= "hg",
 		[ilog2(VM_NOHUGEPAGE)]	= "nh",
 		[ilog2(VM_MERGEABLE)]	= "mg",
+		[ilog2(VM_USERFAULT)]	= "uf",
 	};
 	size_t i;
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index e03dd29..00faeda 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -139,6 +139,7 @@ extern unsigned int kobjsize(const void *objp);
 #define VM_HUGEPAGE	0x20000000	/* MADV_HUGEPAGE marked this vma */
 #define VM_NOHUGEPAGE	0x40000000	/* MADV_NOHUGEPAGE marked this vma */
 #define VM_MERGEABLE	0x80000000	/* KSM may merge identical pages */
+#define VM_USERFAULT	0x100000000ULL	/* Trigger user faults if not mapped */
 
 #if defined(CONFIG_X86)
 # define VM_PAT		VM_ARCH_1	/* PAT reserves whole VMA at once (x86) */
diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index ddc3b36..dbf1e70 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -52,6 +52,9 @@
 					   overrides the coredump filter bits */
 #define MADV_DODUMP	17		/* Clear the MADV_DONTDUMP flag */
 
+#define MADV_USERFAULT	18		/* Trigger user faults if not mapped */
+#define MADV_NOUSERFAULT 19		/* Don't trigger user faults */
+
 /* compatibility flags */
 #define MAP_FILE	0
 
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 7e0776a..1928463 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -720,8 +720,12 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
 
 	VM_BUG_ON_PAGE(!PageCompound(page), page);
 	pgtable = pte_alloc_one(mm, haddr);
-	if (unlikely(!pgtable))
-		return VM_FAULT_OOM;
+	if (unlikely(!pgtable)) {
+		mem_cgroup_uncharge_page(page);
+		put_page(page);
+		count_vm_event(THP_FAULT_FALLBACK);
+		return VM_FAULT_FALLBACK;
+	}
 
 	clear_huge_page(page, haddr, HPAGE_PMD_NR);
 	/*
@@ -739,6 +743,16 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
 		pte_free(mm, pgtable);
 	} else {
 		pmd_t entry;
+
+		/* Deliver the page fault to userland */
+		if (vma->vm_flags & VM_USERFAULT) {
+			spin_unlock(ptl);
+			mem_cgroup_uncharge_page(page);
+			put_page(page);
+			pte_free(mm, pgtable);
+			return VM_FAULT_SIGBUS;
+		}
+
 		entry = mk_huge_pmd(page, vma->vm_page_prot);
 		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
 		page_add_new_anon_rmap(page, vma, haddr);
@@ -747,6 +761,7 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
 		add_mm_counter(mm, MM_ANONPAGES, HPAGE_PMD_NR);
 		atomic_long_inc(&mm->nr_ptes);
 		spin_unlock(ptl);
+		count_vm_event(THP_FAULT_ALLOC);
 	}
 
 	return 0;
@@ -767,20 +782,17 @@ static inline struct page *alloc_hugepage_vma(int defrag,
 }
 
 /* Caller must hold page table lock. */
-static bool set_huge_zero_page(pgtable_t pgtable, struct mm_struct *mm,
+static void set_huge_zero_page(pgtable_t pgtable, struct mm_struct *mm,
 		struct vm_area_struct *vma, unsigned long haddr, pmd_t *pmd,
 		struct page *zero_page)
 {
 	pmd_t entry;
-	if (!pmd_none(*pmd))
-		return false;
 	entry = mk_pmd(zero_page, vma->vm_page_prot);
 	entry = pmd_wrprotect(entry);
 	entry = pmd_mkhuge(entry);
 	pgtable_trans_huge_deposit(mm, pmd, pgtable);
 	set_pmd_at(mm, haddr, pmd, entry);
 	atomic_long_inc(&mm->nr_ptes);
-	return true;
 }
 
 int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
@@ -802,6 +814,7 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		pgtable_t pgtable;
 		struct page *zero_page;
 		bool set;
+		int ret;
 		pgtable = pte_alloc_one(mm, haddr);
 		if (unlikely(!pgtable))
 			return VM_FAULT_OOM;
@@ -812,14 +825,24 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			return VM_FAULT_FALLBACK;
 		}
 		ptl = pmd_lock(mm, pmd);
-		set = set_huge_zero_page(pgtable, mm, vma, haddr, pmd,
-				zero_page);
+		ret = 0;
+		set = false;
+		if (pmd_none(*pmd)) {
+			if (vma->vm_flags & VM_USERFAULT)
+				ret = VM_FAULT_SIGBUS;
+			else {
+				set_huge_zero_page(pgtable, mm, vma,
+						   haddr, pmd,
+						   zero_page);
+				set = true;
+			}
+		}
 		spin_unlock(ptl);
 		if (!set) {
 			pte_free(mm, pgtable);
 			put_huge_zero_page();
 		}
-		return 0;
+		return ret;
 	}
 	page = alloc_hugepage_vma(transparent_hugepage_defrag(vma),
 			vma, haddr, numa_node_id(), 0);
@@ -832,15 +855,7 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		count_vm_event(THP_FAULT_FALLBACK);
 		return VM_FAULT_FALLBACK;
 	}
-	if (unlikely(__do_huge_pmd_anonymous_page(mm, vma, haddr, pmd, page))) {
-		mem_cgroup_uncharge_page(page);
-		put_page(page);
-		count_vm_event(THP_FAULT_FALLBACK);
-		return VM_FAULT_FALLBACK;
-	}
-
-	count_vm_event(THP_FAULT_ALLOC);
-	return 0;
+	return __do_huge_pmd_anonymous_page(mm, vma, haddr, pmd, page);
 }
 
 int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
@@ -875,16 +890,14 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	 */
 	if (is_huge_zero_pmd(pmd)) {
 		struct page *zero_page;
-		bool set;
 		/*
 		 * get_huge_zero_page() will never allocate a new page here,
 		 * since we already have a zero page to copy. It just takes a
 		 * reference.
 		 */
 		zero_page = get_huge_zero_page();
-		set = set_huge_zero_page(pgtable, dst_mm, vma, addr, dst_pmd,
+		set_huge_zero_page(pgtable, dst_mm, vma, addr, dst_pmd,
 				zero_page);
-		BUG_ON(!set); /* unexpected !pmd_none(dst_pmd) */
 		ret = 0;
 		goto out_unlock;
 	}
@@ -2135,7 +2148,8 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 	     _pte++, address += PAGE_SIZE) {
 		pte_t pteval = *_pte;
 		if (pte_none(pteval)) {
-			if (++none <= khugepaged_max_ptes_none)
+			if (!(vma->vm_flags & VM_USERFAULT) &&
+			    ++none <= khugepaged_max_ptes_none)
 				continue;
 			else
 				goto out;
@@ -2528,7 +2542,8 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 	     _pte++, _address += PAGE_SIZE) {
 		pte_t pteval = *_pte;
 		if (pte_none(pteval)) {
-			if (++none <= khugepaged_max_ptes_none)
+			if (!(vma->vm_flags & VM_USERFAULT) &&
+			    ++none <= khugepaged_max_ptes_none)
 				continue;
 			else
 				goto out_unmap;
diff --git a/mm/madvise.c b/mm/madvise.c
index b31aad1..6e5e872 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -93,6 +93,21 @@ static long madvise_behavior(struct vm_area_struct *vma,
 		if (error)
 			goto out;
 		break;
+	case MADV_USERFAULT:
+		if (vma->vm_ops) {
+			error = -EINVAL;
+			goto out;
+		}
+		new_flags |= VM_USERFAULT;
+		break;
+	case MADV_NOUSERFAULT:
+		if (vma->vm_ops) {
+			WARN_ON(new_flags & VM_USERFAULT);
+			error = -EINVAL;
+			goto out;
+		}
+		new_flags &= ~VM_USERFAULT;
+		break;
 	}
 
 	if (new_flags == vma->vm_flags) {
@@ -411,6 +426,8 @@ madvise_behavior_valid(int behavior)
 	case MADV_HUGEPAGE:
 	case MADV_NOHUGEPAGE:
 #endif
+	case MADV_USERFAULT:
+	case MADV_NOUSERFAULT:
 	case MADV_DONTDUMP:
 	case MADV_DODUMP:
 		return 1;
diff --git a/mm/memory.c b/mm/memory.c
index d67fd9f..545c417 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2641,6 +2641,11 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
 		if (!pte_none(*page_table))
 			goto unlock;
+		/* Deliver the page fault to userland, check inside PT lock */
+		if (vma->vm_flags & VM_USERFAULT) {
+			pte_unmap_unlock(page_table, ptl);
+			return VM_FAULT_SIGBUS;
+		}
 		goto setpte;
 	}
 
@@ -2668,6 +2673,14 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (!pte_none(*page_table))
 		goto release;
 
+	/* Deliver the page fault to userland, check inside PT lock */
+	if (vma->vm_flags & VM_USERFAULT) {
+		pte_unmap_unlock(page_table, ptl);
+		mem_cgroup_uncharge_page(page);
+		page_cache_release(page);
+		return VM_FAULT_SIGBUS;
+	}
+
 	inc_mm_counter_fast(mm, MM_ANONPAGES);
 	page_add_new_anon_rmap(page, vma, address);
 setpte:

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 02/10] mm: madvise MADV_USERFAULT
@ 2014-07-02 16:50   ` Andrea Arcangeli
  0 siblings, 0 replies; 59+ messages in thread
From: Andrea Arcangeli @ 2014-07-02 16:50 UTC (permalink / raw)
  To: qemu-devel, kvm, linux-mm, linux-kernel
  Cc: \"Dr. David Alan Gilbert\",
	Johannes Weiner, Andrew Morton, Android Kernel Team, Robert Love,
	Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, KOSAKI Motohiro, Michel Lespinasse,
	Minchan Kim, Keith Packard, Huangpeng (Peter),
	Isaku Yamahata, Paolo Bonzini, Anthony Liguori, Stefan Hajnoczi,
	Wenchao Xia, Andrew Jones, Juan Quintela, Mel Gorman

MADV_USERFAULT is a new madvise flag that will set VM_USERFAULT in the
vma flags. Whenever VM_USERFAULT is set in an anonymous vma, if
userland touches a still unmapped virtual address, a sigbus signal is
sent instead of allocating a new page. The sigbus signal handler will
then resolve the page fault in userland by calling the
remap_anon_pages syscall.

This functionality is needed to reliably implement postcopy live
migration in KVM (without having to use a special chardevice that
would disable all advanced Linux VM features, like swapping, KSM, THP,
automatic NUMA balancing, etc...).

MADV_USERFAULT could also be used to offload parts of anonymous memory
regions to remote nodes or to implement network distributed shared
memory.

Here I enlarged the vm_flags to 64bit as we run out of bits (noop on
64bit kernels). An alternative is to find some combination of flags
that are mutually exclusive if set.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 arch/alpha/include/uapi/asm/mman.h     |  3 ++
 arch/mips/include/uapi/asm/mman.h      |  3 ++
 arch/parisc/include/uapi/asm/mman.h    |  3 ++
 arch/xtensa/include/uapi/asm/mman.h    |  3 ++
 fs/proc/task_mmu.c                     |  1 +
 include/linux/mm.h                     |  1 +
 include/uapi/asm-generic/mman-common.h |  3 ++
 mm/huge_memory.c                       | 61 +++++++++++++++++++++-------------
 mm/madvise.c                           | 17 ++++++++++
 mm/memory.c                            | 13 ++++++++
 10 files changed, 85 insertions(+), 23 deletions(-)

diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h
index 0086b47..a10313c 100644
--- a/arch/alpha/include/uapi/asm/mman.h
+++ b/arch/alpha/include/uapi/asm/mman.h
@@ -60,6 +60,9 @@
 					   overrides the coredump filter bits */
 #define MADV_DODUMP	17		/* Clear the MADV_NODUMP flag */
 
+#define MADV_USERFAULT	18		/* Trigger user faults if not mapped */
+#define MADV_NOUSERFAULT 19		/* Don't trigger user faults */
+
 /* compatibility flags */
 #define MAP_FILE	0
 
diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h
index cfcb876..d9d11a4 100644
--- a/arch/mips/include/uapi/asm/mman.h
+++ b/arch/mips/include/uapi/asm/mman.h
@@ -84,6 +84,9 @@
 					   overrides the coredump filter bits */
 #define MADV_DODUMP	17		/* Clear the MADV_NODUMP flag */
 
+#define MADV_USERFAULT	18		/* Trigger user faults if not mapped */
+#define MADV_NOUSERFAULT 19		/* Don't trigger user faults */
+
 /* compatibility flags */
 #define MAP_FILE	0
 
diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h
index 294d251..7bc7b7b 100644
--- a/arch/parisc/include/uapi/asm/mman.h
+++ b/arch/parisc/include/uapi/asm/mman.h
@@ -66,6 +66,9 @@
 					   overrides the coredump filter bits */
 #define MADV_DODUMP	70		/* Clear the MADV_NODUMP flag */
 
+#define MADV_USERFAULT	71		/* Trigger user faults if not mapped */
+#define MADV_NOUSERFAULT 72		/* Don't trigger user faults */
+
 /* compatibility flags */
 #define MAP_FILE	0
 #define MAP_VARIABLE	0
diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h
index 00eed67..5448d88 100644
--- a/arch/xtensa/include/uapi/asm/mman.h
+++ b/arch/xtensa/include/uapi/asm/mman.h
@@ -90,6 +90,9 @@
 					   overrides the coredump filter bits */
 #define MADV_DODUMP	17		/* Clear the MADV_NODUMP flag */
 
+#define MADV_USERFAULT	18		/* Trigger user faults if not mapped */
+#define MADV_NOUSERFAULT 19		/* Don't trigger user faults */
+
 /* compatibility flags */
 #define MAP_FILE	0
 
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index fb91692..8636cda 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -568,6 +568,7 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
 		[ilog2(VM_HUGEPAGE)]	= "hg",
 		[ilog2(VM_NOHUGEPAGE)]	= "nh",
 		[ilog2(VM_MERGEABLE)]	= "mg",
+		[ilog2(VM_USERFAULT)]	= "uf",
 	};
 	size_t i;
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index e03dd29..00faeda 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -139,6 +139,7 @@ extern unsigned int kobjsize(const void *objp);
 #define VM_HUGEPAGE	0x20000000	/* MADV_HUGEPAGE marked this vma */
 #define VM_NOHUGEPAGE	0x40000000	/* MADV_NOHUGEPAGE marked this vma */
 #define VM_MERGEABLE	0x80000000	/* KSM may merge identical pages */
+#define VM_USERFAULT	0x100000000ULL	/* Trigger user faults if not mapped */
 
 #if defined(CONFIG_X86)
 # define VM_PAT		VM_ARCH_1	/* PAT reserves whole VMA at once (x86) */
diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index ddc3b36..dbf1e70 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -52,6 +52,9 @@
 					   overrides the coredump filter bits */
 #define MADV_DODUMP	17		/* Clear the MADV_DONTDUMP flag */
 
+#define MADV_USERFAULT	18		/* Trigger user faults if not mapped */
+#define MADV_NOUSERFAULT 19		/* Don't trigger user faults */
+
 /* compatibility flags */
 #define MAP_FILE	0
 
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 7e0776a..1928463 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -720,8 +720,12 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
 
 	VM_BUG_ON_PAGE(!PageCompound(page), page);
 	pgtable = pte_alloc_one(mm, haddr);
-	if (unlikely(!pgtable))
-		return VM_FAULT_OOM;
+	if (unlikely(!pgtable)) {
+		mem_cgroup_uncharge_page(page);
+		put_page(page);
+		count_vm_event(THP_FAULT_FALLBACK);
+		return VM_FAULT_FALLBACK;
+	}
 
 	clear_huge_page(page, haddr, HPAGE_PMD_NR);
 	/*
@@ -739,6 +743,16 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
 		pte_free(mm, pgtable);
 	} else {
 		pmd_t entry;
+
+		/* Deliver the page fault to userland */
+		if (vma->vm_flags & VM_USERFAULT) {
+			spin_unlock(ptl);
+			mem_cgroup_uncharge_page(page);
+			put_page(page);
+			pte_free(mm, pgtable);
+			return VM_FAULT_SIGBUS;
+		}
+
 		entry = mk_huge_pmd(page, vma->vm_page_prot);
 		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
 		page_add_new_anon_rmap(page, vma, haddr);
@@ -747,6 +761,7 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
 		add_mm_counter(mm, MM_ANONPAGES, HPAGE_PMD_NR);
 		atomic_long_inc(&mm->nr_ptes);
 		spin_unlock(ptl);
+		count_vm_event(THP_FAULT_ALLOC);
 	}
 
 	return 0;
@@ -767,20 +782,17 @@ static inline struct page *alloc_hugepage_vma(int defrag,
 }
 
 /* Caller must hold page table lock. */
-static bool set_huge_zero_page(pgtable_t pgtable, struct mm_struct *mm,
+static void set_huge_zero_page(pgtable_t pgtable, struct mm_struct *mm,
 		struct vm_area_struct *vma, unsigned long haddr, pmd_t *pmd,
 		struct page *zero_page)
 {
 	pmd_t entry;
-	if (!pmd_none(*pmd))
-		return false;
 	entry = mk_pmd(zero_page, vma->vm_page_prot);
 	entry = pmd_wrprotect(entry);
 	entry = pmd_mkhuge(entry);
 	pgtable_trans_huge_deposit(mm, pmd, pgtable);
 	set_pmd_at(mm, haddr, pmd, entry);
 	atomic_long_inc(&mm->nr_ptes);
-	return true;
 }
 
 int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
@@ -802,6 +814,7 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		pgtable_t pgtable;
 		struct page *zero_page;
 		bool set;
+		int ret;
 		pgtable = pte_alloc_one(mm, haddr);
 		if (unlikely(!pgtable))
 			return VM_FAULT_OOM;
@@ -812,14 +825,24 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			return VM_FAULT_FALLBACK;
 		}
 		ptl = pmd_lock(mm, pmd);
-		set = set_huge_zero_page(pgtable, mm, vma, haddr, pmd,
-				zero_page);
+		ret = 0;
+		set = false;
+		if (pmd_none(*pmd)) {
+			if (vma->vm_flags & VM_USERFAULT)
+				ret = VM_FAULT_SIGBUS;
+			else {
+				set_huge_zero_page(pgtable, mm, vma,
+						   haddr, pmd,
+						   zero_page);
+				set = true;
+			}
+		}
 		spin_unlock(ptl);
 		if (!set) {
 			pte_free(mm, pgtable);
 			put_huge_zero_page();
 		}
-		return 0;
+		return ret;
 	}
 	page = alloc_hugepage_vma(transparent_hugepage_defrag(vma),
 			vma, haddr, numa_node_id(), 0);
@@ -832,15 +855,7 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		count_vm_event(THP_FAULT_FALLBACK);
 		return VM_FAULT_FALLBACK;
 	}
-	if (unlikely(__do_huge_pmd_anonymous_page(mm, vma, haddr, pmd, page))) {
-		mem_cgroup_uncharge_page(page);
-		put_page(page);
-		count_vm_event(THP_FAULT_FALLBACK);
-		return VM_FAULT_FALLBACK;
-	}
-
-	count_vm_event(THP_FAULT_ALLOC);
-	return 0;
+	return __do_huge_pmd_anonymous_page(mm, vma, haddr, pmd, page);
 }
 
 int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
@@ -875,16 +890,14 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	 */
 	if (is_huge_zero_pmd(pmd)) {
 		struct page *zero_page;
-		bool set;
 		/*
 		 * get_huge_zero_page() will never allocate a new page here,
 		 * since we already have a zero page to copy. It just takes a
 		 * reference.
 		 */
 		zero_page = get_huge_zero_page();
-		set = set_huge_zero_page(pgtable, dst_mm, vma, addr, dst_pmd,
+		set_huge_zero_page(pgtable, dst_mm, vma, addr, dst_pmd,
 				zero_page);
-		BUG_ON(!set); /* unexpected !pmd_none(dst_pmd) */
 		ret = 0;
 		goto out_unlock;
 	}
@@ -2135,7 +2148,8 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 	     _pte++, address += PAGE_SIZE) {
 		pte_t pteval = *_pte;
 		if (pte_none(pteval)) {
-			if (++none <= khugepaged_max_ptes_none)
+			if (!(vma->vm_flags & VM_USERFAULT) &&
+			    ++none <= khugepaged_max_ptes_none)
 				continue;
 			else
 				goto out;
@@ -2528,7 +2542,8 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 	     _pte++, _address += PAGE_SIZE) {
 		pte_t pteval = *_pte;
 		if (pte_none(pteval)) {
-			if (++none <= khugepaged_max_ptes_none)
+			if (!(vma->vm_flags & VM_USERFAULT) &&
+			    ++none <= khugepaged_max_ptes_none)
 				continue;
 			else
 				goto out_unmap;
diff --git a/mm/madvise.c b/mm/madvise.c
index b31aad1..6e5e872 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -93,6 +93,21 @@ static long madvise_behavior(struct vm_area_struct *vma,
 		if (error)
 			goto out;
 		break;
+	case MADV_USERFAULT:
+		if (vma->vm_ops) {
+			error = -EINVAL;
+			goto out;
+		}
+		new_flags |= VM_USERFAULT;
+		break;
+	case MADV_NOUSERFAULT:
+		if (vma->vm_ops) {
+			WARN_ON(new_flags & VM_USERFAULT);
+			error = -EINVAL;
+			goto out;
+		}
+		new_flags &= ~VM_USERFAULT;
+		break;
 	}
 
 	if (new_flags == vma->vm_flags) {
@@ -411,6 +426,8 @@ madvise_behavior_valid(int behavior)
 	case MADV_HUGEPAGE:
 	case MADV_NOHUGEPAGE:
 #endif
+	case MADV_USERFAULT:
+	case MADV_NOUSERFAULT:
 	case MADV_DONTDUMP:
 	case MADV_DODUMP:
 		return 1;
diff --git a/mm/memory.c b/mm/memory.c
index d67fd9f..545c417 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2641,6 +2641,11 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
 		if (!pte_none(*page_table))
 			goto unlock;
+		/* Deliver the page fault to userland, check inside PT lock */
+		if (vma->vm_flags & VM_USERFAULT) {
+			pte_unmap_unlock(page_table, ptl);
+			return VM_FAULT_SIGBUS;
+		}
 		goto setpte;
 	}
 
@@ -2668,6 +2673,14 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (!pte_none(*page_table))
 		goto release;
 
+	/* Deliver the page fault to userland, check inside PT lock */
+	if (vma->vm_flags & VM_USERFAULT) {
+		pte_unmap_unlock(page_table, ptl);
+		mem_cgroup_uncharge_page(page);
+		page_cache_release(page);
+		return VM_FAULT_SIGBUS;
+	}
+
 	inc_mm_counter_fast(mm, MM_ANONPAGES);
 	page_add_new_anon_rmap(page, vma, address);
 setpte:

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [Qemu-devel] [PATCH 02/10] mm: madvise MADV_USERFAULT
@ 2014-07-02 16:50   ` Andrea Arcangeli
  0 siblings, 0 replies; 59+ messages in thread
From: Andrea Arcangeli @ 2014-07-02 16:50 UTC (permalink / raw)
  To: qemu-devel, kvm, linux-mm, linux-kernel
  Cc: Robert Love, Dave Hansen, Jan Kara, Neil Brown, Stefan Hajnoczi,
	Andrew Jones, KOSAKI Motohiro, Michel Lespinasse,
	Andrea Arcangeli, Taras Glek, Juan Quintela, Hugh Dickins,
	Isaku Yamahata, Mel Gorman, Android Kernel Team, Mel Gorman,
	\"Dr. David Alan Gilbert\", Huangpeng (Peter),
	Anthony Liguori, Mike Hommey, Keith Packard, Wenchao Xia,
	Minchan Kim, Dmitry Adamushko, Johannes Weiner, Paolo Bonzini,
	Andrew Morton

MADV_USERFAULT is a new madvise flag that will set VM_USERFAULT in the
vma flags. Whenever VM_USERFAULT is set in an anonymous vma, if
userland touches a still unmapped virtual address, a sigbus signal is
sent instead of allocating a new page. The sigbus signal handler will
then resolve the page fault in userland by calling the
remap_anon_pages syscall.

This functionality is needed to reliably implement postcopy live
migration in KVM (without having to use a special chardevice that
would disable all advanced Linux VM features, like swapping, KSM, THP,
automatic NUMA balancing, etc...).

MADV_USERFAULT could also be used to offload parts of anonymous memory
regions to remote nodes or to implement network distributed shared
memory.

Here I enlarged the vm_flags to 64bit as we run out of bits (noop on
64bit kernels). An alternative is to find some combination of flags
that are mutually exclusive if set.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 arch/alpha/include/uapi/asm/mman.h     |  3 ++
 arch/mips/include/uapi/asm/mman.h      |  3 ++
 arch/parisc/include/uapi/asm/mman.h    |  3 ++
 arch/xtensa/include/uapi/asm/mman.h    |  3 ++
 fs/proc/task_mmu.c                     |  1 +
 include/linux/mm.h                     |  1 +
 include/uapi/asm-generic/mman-common.h |  3 ++
 mm/huge_memory.c                       | 61 +++++++++++++++++++++-------------
 mm/madvise.c                           | 17 ++++++++++
 mm/memory.c                            | 13 ++++++++
 10 files changed, 85 insertions(+), 23 deletions(-)

diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h
index 0086b47..a10313c 100644
--- a/arch/alpha/include/uapi/asm/mman.h
+++ b/arch/alpha/include/uapi/asm/mman.h
@@ -60,6 +60,9 @@
 					   overrides the coredump filter bits */
 #define MADV_DODUMP	17		/* Clear the MADV_NODUMP flag */
 
+#define MADV_USERFAULT	18		/* Trigger user faults if not mapped */
+#define MADV_NOUSERFAULT 19		/* Don't trigger user faults */
+
 /* compatibility flags */
 #define MAP_FILE	0
 
diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h
index cfcb876..d9d11a4 100644
--- a/arch/mips/include/uapi/asm/mman.h
+++ b/arch/mips/include/uapi/asm/mman.h
@@ -84,6 +84,9 @@
 					   overrides the coredump filter bits */
 #define MADV_DODUMP	17		/* Clear the MADV_NODUMP flag */
 
+#define MADV_USERFAULT	18		/* Trigger user faults if not mapped */
+#define MADV_NOUSERFAULT 19		/* Don't trigger user faults */
+
 /* compatibility flags */
 #define MAP_FILE	0
 
diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h
index 294d251..7bc7b7b 100644
--- a/arch/parisc/include/uapi/asm/mman.h
+++ b/arch/parisc/include/uapi/asm/mman.h
@@ -66,6 +66,9 @@
 					   overrides the coredump filter bits */
 #define MADV_DODUMP	70		/* Clear the MADV_NODUMP flag */
 
+#define MADV_USERFAULT	71		/* Trigger user faults if not mapped */
+#define MADV_NOUSERFAULT 72		/* Don't trigger user faults */
+
 /* compatibility flags */
 #define MAP_FILE	0
 #define MAP_VARIABLE	0
diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h
index 00eed67..5448d88 100644
--- a/arch/xtensa/include/uapi/asm/mman.h
+++ b/arch/xtensa/include/uapi/asm/mman.h
@@ -90,6 +90,9 @@
 					   overrides the coredump filter bits */
 #define MADV_DODUMP	17		/* Clear the MADV_NODUMP flag */
 
+#define MADV_USERFAULT	18		/* Trigger user faults if not mapped */
+#define MADV_NOUSERFAULT 19		/* Don't trigger user faults */
+
 /* compatibility flags */
 #define MAP_FILE	0
 
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index fb91692..8636cda 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -568,6 +568,7 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
 		[ilog2(VM_HUGEPAGE)]	= "hg",
 		[ilog2(VM_NOHUGEPAGE)]	= "nh",
 		[ilog2(VM_MERGEABLE)]	= "mg",
+		[ilog2(VM_USERFAULT)]	= "uf",
 	};
 	size_t i;
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index e03dd29..00faeda 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -139,6 +139,7 @@ extern unsigned int kobjsize(const void *objp);
 #define VM_HUGEPAGE	0x20000000	/* MADV_HUGEPAGE marked this vma */
 #define VM_NOHUGEPAGE	0x40000000	/* MADV_NOHUGEPAGE marked this vma */
 #define VM_MERGEABLE	0x80000000	/* KSM may merge identical pages */
+#define VM_USERFAULT	0x100000000ULL	/* Trigger user faults if not mapped */
 
 #if defined(CONFIG_X86)
 # define VM_PAT		VM_ARCH_1	/* PAT reserves whole VMA at once (x86) */
diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index ddc3b36..dbf1e70 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -52,6 +52,9 @@
 					   overrides the coredump filter bits */
 #define MADV_DODUMP	17		/* Clear the MADV_DONTDUMP flag */
 
+#define MADV_USERFAULT	18		/* Trigger user faults if not mapped */
+#define MADV_NOUSERFAULT 19		/* Don't trigger user faults */
+
 /* compatibility flags */
 #define MAP_FILE	0
 
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 7e0776a..1928463 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -720,8 +720,12 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
 
 	VM_BUG_ON_PAGE(!PageCompound(page), page);
 	pgtable = pte_alloc_one(mm, haddr);
-	if (unlikely(!pgtable))
-		return VM_FAULT_OOM;
+	if (unlikely(!pgtable)) {
+		mem_cgroup_uncharge_page(page);
+		put_page(page);
+		count_vm_event(THP_FAULT_FALLBACK);
+		return VM_FAULT_FALLBACK;
+	}
 
 	clear_huge_page(page, haddr, HPAGE_PMD_NR);
 	/*
@@ -739,6 +743,16 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
 		pte_free(mm, pgtable);
 	} else {
 		pmd_t entry;
+
+		/* Deliver the page fault to userland */
+		if (vma->vm_flags & VM_USERFAULT) {
+			spin_unlock(ptl);
+			mem_cgroup_uncharge_page(page);
+			put_page(page);
+			pte_free(mm, pgtable);
+			return VM_FAULT_SIGBUS;
+		}
+
 		entry = mk_huge_pmd(page, vma->vm_page_prot);
 		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
 		page_add_new_anon_rmap(page, vma, haddr);
@@ -747,6 +761,7 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
 		add_mm_counter(mm, MM_ANONPAGES, HPAGE_PMD_NR);
 		atomic_long_inc(&mm->nr_ptes);
 		spin_unlock(ptl);
+		count_vm_event(THP_FAULT_ALLOC);
 	}
 
 	return 0;
@@ -767,20 +782,17 @@ static inline struct page *alloc_hugepage_vma(int defrag,
 }
 
 /* Caller must hold page table lock. */
-static bool set_huge_zero_page(pgtable_t pgtable, struct mm_struct *mm,
+static void set_huge_zero_page(pgtable_t pgtable, struct mm_struct *mm,
 		struct vm_area_struct *vma, unsigned long haddr, pmd_t *pmd,
 		struct page *zero_page)
 {
 	pmd_t entry;
-	if (!pmd_none(*pmd))
-		return false;
 	entry = mk_pmd(zero_page, vma->vm_page_prot);
 	entry = pmd_wrprotect(entry);
 	entry = pmd_mkhuge(entry);
 	pgtable_trans_huge_deposit(mm, pmd, pgtable);
 	set_pmd_at(mm, haddr, pmd, entry);
 	atomic_long_inc(&mm->nr_ptes);
-	return true;
 }
 
 int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
@@ -802,6 +814,7 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		pgtable_t pgtable;
 		struct page *zero_page;
 		bool set;
+		int ret;
 		pgtable = pte_alloc_one(mm, haddr);
 		if (unlikely(!pgtable))
 			return VM_FAULT_OOM;
@@ -812,14 +825,24 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			return VM_FAULT_FALLBACK;
 		}
 		ptl = pmd_lock(mm, pmd);
-		set = set_huge_zero_page(pgtable, mm, vma, haddr, pmd,
-				zero_page);
+		ret = 0;
+		set = false;
+		if (pmd_none(*pmd)) {
+			if (vma->vm_flags & VM_USERFAULT)
+				ret = VM_FAULT_SIGBUS;
+			else {
+				set_huge_zero_page(pgtable, mm, vma,
+						   haddr, pmd,
+						   zero_page);
+				set = true;
+			}
+		}
 		spin_unlock(ptl);
 		if (!set) {
 			pte_free(mm, pgtable);
 			put_huge_zero_page();
 		}
-		return 0;
+		return ret;
 	}
 	page = alloc_hugepage_vma(transparent_hugepage_defrag(vma),
 			vma, haddr, numa_node_id(), 0);
@@ -832,15 +855,7 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		count_vm_event(THP_FAULT_FALLBACK);
 		return VM_FAULT_FALLBACK;
 	}
-	if (unlikely(__do_huge_pmd_anonymous_page(mm, vma, haddr, pmd, page))) {
-		mem_cgroup_uncharge_page(page);
-		put_page(page);
-		count_vm_event(THP_FAULT_FALLBACK);
-		return VM_FAULT_FALLBACK;
-	}
-
-	count_vm_event(THP_FAULT_ALLOC);
-	return 0;
+	return __do_huge_pmd_anonymous_page(mm, vma, haddr, pmd, page);
 }
 
 int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
@@ -875,16 +890,14 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	 */
 	if (is_huge_zero_pmd(pmd)) {
 		struct page *zero_page;
-		bool set;
 		/*
 		 * get_huge_zero_page() will never allocate a new page here,
 		 * since we already have a zero page to copy. It just takes a
 		 * reference.
 		 */
 		zero_page = get_huge_zero_page();
-		set = set_huge_zero_page(pgtable, dst_mm, vma, addr, dst_pmd,
+		set_huge_zero_page(pgtable, dst_mm, vma, addr, dst_pmd,
 				zero_page);
-		BUG_ON(!set); /* unexpected !pmd_none(dst_pmd) */
 		ret = 0;
 		goto out_unlock;
 	}
@@ -2135,7 +2148,8 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 	     _pte++, address += PAGE_SIZE) {
 		pte_t pteval = *_pte;
 		if (pte_none(pteval)) {
-			if (++none <= khugepaged_max_ptes_none)
+			if (!(vma->vm_flags & VM_USERFAULT) &&
+			    ++none <= khugepaged_max_ptes_none)
 				continue;
 			else
 				goto out;
@@ -2528,7 +2542,8 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 	     _pte++, _address += PAGE_SIZE) {
 		pte_t pteval = *_pte;
 		if (pte_none(pteval)) {
-			if (++none <= khugepaged_max_ptes_none)
+			if (!(vma->vm_flags & VM_USERFAULT) &&
+			    ++none <= khugepaged_max_ptes_none)
 				continue;
 			else
 				goto out_unmap;
diff --git a/mm/madvise.c b/mm/madvise.c
index b31aad1..6e5e872 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -93,6 +93,21 @@ static long madvise_behavior(struct vm_area_struct *vma,
 		if (error)
 			goto out;
 		break;
+	case MADV_USERFAULT:
+		if (vma->vm_ops) {
+			error = -EINVAL;
+			goto out;
+		}
+		new_flags |= VM_USERFAULT;
+		break;
+	case MADV_NOUSERFAULT:
+		if (vma->vm_ops) {
+			WARN_ON(new_flags & VM_USERFAULT);
+			error = -EINVAL;
+			goto out;
+		}
+		new_flags &= ~VM_USERFAULT;
+		break;
 	}
 
 	if (new_flags == vma->vm_flags) {
@@ -411,6 +426,8 @@ madvise_behavior_valid(int behavior)
 	case MADV_HUGEPAGE:
 	case MADV_NOHUGEPAGE:
 #endif
+	case MADV_USERFAULT:
+	case MADV_NOUSERFAULT:
 	case MADV_DONTDUMP:
 	case MADV_DODUMP:
 		return 1;
diff --git a/mm/memory.c b/mm/memory.c
index d67fd9f..545c417 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2641,6 +2641,11 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
 		if (!pte_none(*page_table))
 			goto unlock;
+		/* Deliver the page fault to userland, check inside PT lock */
+		if (vma->vm_flags & VM_USERFAULT) {
+			pte_unmap_unlock(page_table, ptl);
+			return VM_FAULT_SIGBUS;
+		}
 		goto setpte;
 	}
 
@@ -2668,6 +2673,14 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (!pte_none(*page_table))
 		goto release;
 
+	/* Deliver the page fault to userland, check inside PT lock */
+	if (vma->vm_flags & VM_USERFAULT) {
+		pte_unmap_unlock(page_table, ptl);
+		mem_cgroup_uncharge_page(page);
+		page_cache_release(page);
+		return VM_FAULT_SIGBUS;
+	}
+
 	inc_mm_counter_fast(mm, MM_ANONPAGES);
 	page_add_new_anon_rmap(page, vma, address);
 setpte:

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 03/10] mm: PT lock: export double_pt_lock/unlock
  2014-07-02 16:50 ` Andrea Arcangeli
  (?)
@ 2014-07-02 16:50   ` Andrea Arcangeli
  -1 siblings, 0 replies; 59+ messages in thread
From: Andrea Arcangeli @ 2014-07-02 16:50 UTC (permalink / raw)
  To: qemu-devel, kvm, linux-mm, linux-kernel
  Cc: \"Dr. David Alan Gilbert\",
	Johannes Weiner, Andrew Morton, Android Kernel Team, Robert Love,
	Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, KOSAKI Motohiro, Michel Lespinasse,
	Minchan Kim, Keith Packard, Huangpeng (Peter),
	Isaku Yamahata

Those two helpers are needed by remap_anon_pages.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/mm.h |  4 ++++
 mm/fremap.c        | 29 +++++++++++++++++++++++++++++
 2 files changed, 33 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 00faeda..0a7f0e1 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1401,6 +1401,10 @@ static inline pmd_t *pmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long a
 }
 #endif /* CONFIG_MMU && !__ARCH_HAS_4LEVEL_HACK */
 
+/* mm/fremap.c */
+extern void double_pt_lock(spinlock_t *ptl1, spinlock_t *ptl2);
+extern void double_pt_unlock(spinlock_t *ptl1, spinlock_t *ptl2);
+
 #if USE_SPLIT_PTE_PTLOCKS
 #if ALLOC_SPLIT_PTLOCKS
 void __init ptlock_cache_init(void);
diff --git a/mm/fremap.c b/mm/fremap.c
index 72b8fa3..1e509f7 100644
--- a/mm/fremap.c
+++ b/mm/fremap.c
@@ -281,3 +281,32 @@ out_freed:
 
 	return err;
 }
+
+void double_pt_lock(spinlock_t *ptl1,
+		    spinlock_t *ptl2)
+	__acquires(ptl1)
+	__acquires(ptl2)
+{
+	spinlock_t *ptl_tmp;
+
+	if (ptl1 > ptl2) {
+		/* exchange ptl1 and ptl2 */
+		ptl_tmp = ptl1;
+		ptl1 = ptl2;
+		ptl2 = ptl_tmp;
+	}
+	/* lock in virtual address order to avoid lock inversion */
+	spin_lock(ptl1);
+	if (ptl1 != ptl2)
+		spin_lock_nested(ptl2, SINGLE_DEPTH_NESTING);
+}
+
+void double_pt_unlock(spinlock_t *ptl1,
+		      spinlock_t *ptl2)
+	__releases(ptl1)
+	__releases(ptl2)
+{
+	spin_unlock(ptl1);
+	if (ptl1 != ptl2)
+		spin_unlock(ptl2);
+}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 03/10] mm: PT lock: export double_pt_lock/unlock
@ 2014-07-02 16:50   ` Andrea Arcangeli
  0 siblings, 0 replies; 59+ messages in thread
From: Andrea Arcangeli @ 2014-07-02 16:50 UTC (permalink / raw)
  To: qemu-devel, kvm, linux-mm, linux-kernel
  Cc: \"Dr. David Alan Gilbert\",
	Johannes Weiner, Andrew Morton, Android Kernel Team, Robert Love,
	Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, KOSAKI Motohiro, Michel Lespinasse,
	Minchan Kim, Keith Packard, Huangpeng (Peter),
	Isaku Yamahata, Paolo Bonzini, Anthony Liguori, Stefan Hajnoczi,
	Wenchao Xia, Andrew Jones, Juan Quintela, Mel Gorman

Those two helpers are needed by remap_anon_pages.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/mm.h |  4 ++++
 mm/fremap.c        | 29 +++++++++++++++++++++++++++++
 2 files changed, 33 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 00faeda..0a7f0e1 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1401,6 +1401,10 @@ static inline pmd_t *pmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long a
 }
 #endif /* CONFIG_MMU && !__ARCH_HAS_4LEVEL_HACK */
 
+/* mm/fremap.c */
+extern void double_pt_lock(spinlock_t *ptl1, spinlock_t *ptl2);
+extern void double_pt_unlock(spinlock_t *ptl1, spinlock_t *ptl2);
+
 #if USE_SPLIT_PTE_PTLOCKS
 #if ALLOC_SPLIT_PTLOCKS
 void __init ptlock_cache_init(void);
diff --git a/mm/fremap.c b/mm/fremap.c
index 72b8fa3..1e509f7 100644
--- a/mm/fremap.c
+++ b/mm/fremap.c
@@ -281,3 +281,32 @@ out_freed:
 
 	return err;
 }
+
+void double_pt_lock(spinlock_t *ptl1,
+		    spinlock_t *ptl2)
+	__acquires(ptl1)
+	__acquires(ptl2)
+{
+	spinlock_t *ptl_tmp;
+
+	if (ptl1 > ptl2) {
+		/* exchange ptl1 and ptl2 */
+		ptl_tmp = ptl1;
+		ptl1 = ptl2;
+		ptl2 = ptl_tmp;
+	}
+	/* lock in virtual address order to avoid lock inversion */
+	spin_lock(ptl1);
+	if (ptl1 != ptl2)
+		spin_lock_nested(ptl2, SINGLE_DEPTH_NESTING);
+}
+
+void double_pt_unlock(spinlock_t *ptl1,
+		      spinlock_t *ptl2)
+	__releases(ptl1)
+	__releases(ptl2)
+{
+	spin_unlock(ptl1);
+	if (ptl1 != ptl2)
+		spin_unlock(ptl2);
+}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [Qemu-devel] [PATCH 03/10] mm: PT lock: export double_pt_lock/unlock
@ 2014-07-02 16:50   ` Andrea Arcangeli
  0 siblings, 0 replies; 59+ messages in thread
From: Andrea Arcangeli @ 2014-07-02 16:50 UTC (permalink / raw)
  To: qemu-devel, kvm, linux-mm, linux-kernel
  Cc: Robert Love, Dave Hansen, Jan Kara, Neil Brown, Stefan Hajnoczi,
	Andrew Jones, KOSAKI Motohiro, Michel Lespinasse,
	Andrea Arcangeli, Taras Glek, Juan Quintela, Hugh Dickins,
	Isaku Yamahata, Mel Gorman, Android Kernel Team, Mel Gorman,
	\"Dr. David Alan Gilbert\", Huangpeng (Peter),
	Anthony Liguori, Mike Hommey, Keith Packard, Wenchao Xia,
	Minchan Kim, Dmitry Adamushko, Johannes Weiner, Paolo Bonzini,
	Andrew Morton

Those two helpers are needed by remap_anon_pages.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/mm.h |  4 ++++
 mm/fremap.c        | 29 +++++++++++++++++++++++++++++
 2 files changed, 33 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 00faeda..0a7f0e1 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1401,6 +1401,10 @@ static inline pmd_t *pmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long a
 }
 #endif /* CONFIG_MMU && !__ARCH_HAS_4LEVEL_HACK */
 
+/* mm/fremap.c */
+extern void double_pt_lock(spinlock_t *ptl1, spinlock_t *ptl2);
+extern void double_pt_unlock(spinlock_t *ptl1, spinlock_t *ptl2);
+
 #if USE_SPLIT_PTE_PTLOCKS
 #if ALLOC_SPLIT_PTLOCKS
 void __init ptlock_cache_init(void);
diff --git a/mm/fremap.c b/mm/fremap.c
index 72b8fa3..1e509f7 100644
--- a/mm/fremap.c
+++ b/mm/fremap.c
@@ -281,3 +281,32 @@ out_freed:
 
 	return err;
 }
+
+void double_pt_lock(spinlock_t *ptl1,
+		    spinlock_t *ptl2)
+	__acquires(ptl1)
+	__acquires(ptl2)
+{
+	spinlock_t *ptl_tmp;
+
+	if (ptl1 > ptl2) {
+		/* exchange ptl1 and ptl2 */
+		ptl_tmp = ptl1;
+		ptl1 = ptl2;
+		ptl2 = ptl_tmp;
+	}
+	/* lock in virtual address order to avoid lock inversion */
+	spin_lock(ptl1);
+	if (ptl1 != ptl2)
+		spin_lock_nested(ptl2, SINGLE_DEPTH_NESTING);
+}
+
+void double_pt_unlock(spinlock_t *ptl1,
+		      spinlock_t *ptl2)
+	__releases(ptl1)
+	__releases(ptl2)
+{
+	spin_unlock(ptl1);
+	if (ptl1 != ptl2)
+		spin_unlock(ptl2);
+}

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 04/10] mm: rmap preparation for remap_anon_pages
  2014-07-02 16:50 ` Andrea Arcangeli
  (?)
@ 2014-07-02 16:50   ` Andrea Arcangeli
  -1 siblings, 0 replies; 59+ messages in thread
From: Andrea Arcangeli @ 2014-07-02 16:50 UTC (permalink / raw)
  To: qemu-devel, kvm, linux-mm, linux-kernel
  Cc: \"Dr. David Alan Gilbert\",
	Johannes Weiner, Andrew Morton, Android Kernel Team, Robert Love,
	Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, KOSAKI Motohiro, Michel Lespinasse,
	Minchan Kim, Keith Packard, Huangpeng (Peter),
	Isaku Yamahata

remap_anon_pages (unlike remap_file_pages) tries to be non intrusive
in the rmap code.

As far as the rmap code is concerned, rmap_anon_pages only alters the
page->mapping and page->index. It does it while holding the page
lock. However there are a few places that in presence of anon pages
are allowed to do rmap walks without the page lock (split_huge_page
and page_referenced_anon). Those places that are doing rmap walks
without taking the page lock first, must be updated to re-check that
the page->mapping didn't change after they obtained the anon_vma
lock. remap_anon_pages takes the anon_vma lock for writing before
altering the page->mapping, so if the page->mapping is still the same
after obtaining the anon_vma lock (without the page lock), the rmap
walks can go ahead safely (and remap_anon_pages will wait them to
complete before proceeding).

remap_anon_pages serializes against itself with the page lock.

All other places taking the anon_vma lock while holding the mmap_sem
for writing, don't need to check if the page->mapping has changed
after taking the anon_vma lock, regardless of the page lock, because
remap_anon_pages holds the mmap_sem for reading.

Overall this looks a fairly small change to the rmap code, notably
less intrusive than the nonlinear vmas created by remap_file_pages.

There's one constraint enforced to allow this simplification: the
source pages passed to remap_anon_pages must be mapped only in one
vma, but this is not a limitation when used to handle userland page
faults with MADV_USERFAULT. The source addresses passed to
remap_anon_pages should be set as VM_DONTCOPY with MADV_DONTFORK to
avoid any risk of the mapcount of the pages increasing, if fork runs
in parallel in another thread, before or while remap_anon_pages runs.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/huge_memory.c | 24 ++++++++++++++++++++----
 mm/rmap.c        |  9 +++++++++
 2 files changed, 29 insertions(+), 4 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 1928463..94c37ca 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1907,6 +1907,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 {
 	struct anon_vma *anon_vma;
 	int ret = 1;
+	struct address_space *mapping;
 
 	BUG_ON(is_huge_zero_page(page));
 	BUG_ON(!PageAnon(page));
@@ -1918,10 +1919,24 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 	 * page_lock_anon_vma_read except the write lock is taken to serialise
 	 * against parallel split or collapse operations.
 	 */
-	anon_vma = page_get_anon_vma(page);
-	if (!anon_vma)
-		goto out;
-	anon_vma_lock_write(anon_vma);
+	for (;;) {
+		mapping = ACCESS_ONCE(page->mapping);
+		anon_vma = page_get_anon_vma(page);
+		if (!anon_vma)
+			goto out;
+		anon_vma_lock_write(anon_vma);
+		/*
+		 * We don't hold the page lock here so
+		 * remap_anon_pages_huge_pmd can change the anon_vma
+		 * from under us until we obtain the anon_vma
+		 * lock. Verify that we obtained the anon_vma lock
+		 * before remap_anon_pages did.
+		 */
+		if (likely(mapping == ACCESS_ONCE(page->mapping)))
+			break;
+		anon_vma_unlock_write(anon_vma);
+		put_anon_vma(anon_vma);
+	}
 
 	ret = 0;
 	if (!PageCompound(page))
@@ -2420,6 +2435,7 @@ static void collapse_huge_page(struct mm_struct *mm,
 	 * Prevent all access to pagetables with the exception of
 	 * gup_fast later hanlded by the ptep_clear_flush and the VM
 	 * handled by the anon_vma lock + PG_lock.
+	 * remap_anon_pages is prevented to race as well by the mmap_sem.
 	 */
 	down_write(&mm->mmap_sem);
 	if (unlikely(khugepaged_test_exit(mm)))
diff --git a/mm/rmap.c b/mm/rmap.c
index b7e94eb..59a7e7d 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -450,6 +450,7 @@ struct anon_vma *page_lock_anon_vma_read(struct page *page)
 	struct anon_vma *root_anon_vma;
 	unsigned long anon_mapping;
 
+repeat:
 	rcu_read_lock();
 	anon_mapping = (unsigned long) ACCESS_ONCE(page->mapping);
 	if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON)
@@ -488,6 +489,14 @@ struct anon_vma *page_lock_anon_vma_read(struct page *page)
 	rcu_read_unlock();
 	anon_vma_lock_read(anon_vma);
 
+	/* check if remap_anon_pages changed the anon_vma */
+	if (unlikely((unsigned long) ACCESS_ONCE(page->mapping) != anon_mapping)) {
+		anon_vma_unlock_read(anon_vma);
+		put_anon_vma(anon_vma);
+		anon_vma = NULL;
+		goto repeat;
+	}
+
 	if (atomic_dec_and_test(&anon_vma->refcount)) {
 		/*
 		 * Oops, we held the last refcount, release the lock

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 04/10] mm: rmap preparation for remap_anon_pages
@ 2014-07-02 16:50   ` Andrea Arcangeli
  0 siblings, 0 replies; 59+ messages in thread
From: Andrea Arcangeli @ 2014-07-02 16:50 UTC (permalink / raw)
  To: qemu-devel, kvm, linux-mm, linux-kernel
  Cc: \"Dr. David Alan Gilbert\",
	Johannes Weiner, Andrew Morton, Android Kernel Team, Robert Love,
	Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, KOSAKI Motohiro, Michel Lespinasse,
	Minchan Kim, Keith Packard, Huangpeng (Peter),
	Isaku Yamahata, Paolo Bonzini, Anthony Liguori, Stefan Hajnoczi,
	Wenchao Xia, Andrew Jones, Juan Quintela, Mel Gorman

remap_anon_pages (unlike remap_file_pages) tries to be non intrusive
in the rmap code.

As far as the rmap code is concerned, rmap_anon_pages only alters the
page->mapping and page->index. It does it while holding the page
lock. However there are a few places that in presence of anon pages
are allowed to do rmap walks without the page lock (split_huge_page
and page_referenced_anon). Those places that are doing rmap walks
without taking the page lock first, must be updated to re-check that
the page->mapping didn't change after they obtained the anon_vma
lock. remap_anon_pages takes the anon_vma lock for writing before
altering the page->mapping, so if the page->mapping is still the same
after obtaining the anon_vma lock (without the page lock), the rmap
walks can go ahead safely (and remap_anon_pages will wait them to
complete before proceeding).

remap_anon_pages serializes against itself with the page lock.

All other places taking the anon_vma lock while holding the mmap_sem
for writing, don't need to check if the page->mapping has changed
after taking the anon_vma lock, regardless of the page lock, because
remap_anon_pages holds the mmap_sem for reading.

Overall this looks a fairly small change to the rmap code, notably
less intrusive than the nonlinear vmas created by remap_file_pages.

There's one constraint enforced to allow this simplification: the
source pages passed to remap_anon_pages must be mapped only in one
vma, but this is not a limitation when used to handle userland page
faults with MADV_USERFAULT. The source addresses passed to
remap_anon_pages should be set as VM_DONTCOPY with MADV_DONTFORK to
avoid any risk of the mapcount of the pages increasing, if fork runs
in parallel in another thread, before or while remap_anon_pages runs.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/huge_memory.c | 24 ++++++++++++++++++++----
 mm/rmap.c        |  9 +++++++++
 2 files changed, 29 insertions(+), 4 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 1928463..94c37ca 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1907,6 +1907,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 {
 	struct anon_vma *anon_vma;
 	int ret = 1;
+	struct address_space *mapping;
 
 	BUG_ON(is_huge_zero_page(page));
 	BUG_ON(!PageAnon(page));
@@ -1918,10 +1919,24 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 	 * page_lock_anon_vma_read except the write lock is taken to serialise
 	 * against parallel split or collapse operations.
 	 */
-	anon_vma = page_get_anon_vma(page);
-	if (!anon_vma)
-		goto out;
-	anon_vma_lock_write(anon_vma);
+	for (;;) {
+		mapping = ACCESS_ONCE(page->mapping);
+		anon_vma = page_get_anon_vma(page);
+		if (!anon_vma)
+			goto out;
+		anon_vma_lock_write(anon_vma);
+		/*
+		 * We don't hold the page lock here so
+		 * remap_anon_pages_huge_pmd can change the anon_vma
+		 * from under us until we obtain the anon_vma
+		 * lock. Verify that we obtained the anon_vma lock
+		 * before remap_anon_pages did.
+		 */
+		if (likely(mapping == ACCESS_ONCE(page->mapping)))
+			break;
+		anon_vma_unlock_write(anon_vma);
+		put_anon_vma(anon_vma);
+	}
 
 	ret = 0;
 	if (!PageCompound(page))
@@ -2420,6 +2435,7 @@ static void collapse_huge_page(struct mm_struct *mm,
 	 * Prevent all access to pagetables with the exception of
 	 * gup_fast later hanlded by the ptep_clear_flush and the VM
 	 * handled by the anon_vma lock + PG_lock.
+	 * remap_anon_pages is prevented to race as well by the mmap_sem.
 	 */
 	down_write(&mm->mmap_sem);
 	if (unlikely(khugepaged_test_exit(mm)))
diff --git a/mm/rmap.c b/mm/rmap.c
index b7e94eb..59a7e7d 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -450,6 +450,7 @@ struct anon_vma *page_lock_anon_vma_read(struct page *page)
 	struct anon_vma *root_anon_vma;
 	unsigned long anon_mapping;
 
+repeat:
 	rcu_read_lock();
 	anon_mapping = (unsigned long) ACCESS_ONCE(page->mapping);
 	if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON)
@@ -488,6 +489,14 @@ struct anon_vma *page_lock_anon_vma_read(struct page *page)
 	rcu_read_unlock();
 	anon_vma_lock_read(anon_vma);
 
+	/* check if remap_anon_pages changed the anon_vma */
+	if (unlikely((unsigned long) ACCESS_ONCE(page->mapping) != anon_mapping)) {
+		anon_vma_unlock_read(anon_vma);
+		put_anon_vma(anon_vma);
+		anon_vma = NULL;
+		goto repeat;
+	}
+
 	if (atomic_dec_and_test(&anon_vma->refcount)) {
 		/*
 		 * Oops, we held the last refcount, release the lock

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [Qemu-devel] [PATCH 04/10] mm: rmap preparation for remap_anon_pages
@ 2014-07-02 16:50   ` Andrea Arcangeli
  0 siblings, 0 replies; 59+ messages in thread
From: Andrea Arcangeli @ 2014-07-02 16:50 UTC (permalink / raw)
  To: qemu-devel, kvm, linux-mm, linux-kernel
  Cc: Robert Love, Dave Hansen, Jan Kara, Neil Brown, Stefan Hajnoczi,
	Andrew Jones, KOSAKI Motohiro, Michel Lespinasse,
	Andrea Arcangeli, Taras Glek, Juan Quintela, Hugh Dickins,
	Isaku Yamahata, Mel Gorman, Android Kernel Team, Mel Gorman,
	\"Dr. David Alan Gilbert\", Huangpeng (Peter),
	Anthony Liguori, Mike Hommey, Keith Packard, Wenchao Xia,
	Minchan Kim, Dmitry Adamushko, Johannes Weiner, Paolo Bonzini,
	Andrew Morton

remap_anon_pages (unlike remap_file_pages) tries to be non intrusive
in the rmap code.

As far as the rmap code is concerned, rmap_anon_pages only alters the
page->mapping and page->index. It does it while holding the page
lock. However there are a few places that in presence of anon pages
are allowed to do rmap walks without the page lock (split_huge_page
and page_referenced_anon). Those places that are doing rmap walks
without taking the page lock first, must be updated to re-check that
the page->mapping didn't change after they obtained the anon_vma
lock. remap_anon_pages takes the anon_vma lock for writing before
altering the page->mapping, so if the page->mapping is still the same
after obtaining the anon_vma lock (without the page lock), the rmap
walks can go ahead safely (and remap_anon_pages will wait them to
complete before proceeding).

remap_anon_pages serializes against itself with the page lock.

All other places taking the anon_vma lock while holding the mmap_sem
for writing, don't need to check if the page->mapping has changed
after taking the anon_vma lock, regardless of the page lock, because
remap_anon_pages holds the mmap_sem for reading.

Overall this looks a fairly small change to the rmap code, notably
less intrusive than the nonlinear vmas created by remap_file_pages.

There's one constraint enforced to allow this simplification: the
source pages passed to remap_anon_pages must be mapped only in one
vma, but this is not a limitation when used to handle userland page
faults with MADV_USERFAULT. The source addresses passed to
remap_anon_pages should be set as VM_DONTCOPY with MADV_DONTFORK to
avoid any risk of the mapcount of the pages increasing, if fork runs
in parallel in another thread, before or while remap_anon_pages runs.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/huge_memory.c | 24 ++++++++++++++++++++----
 mm/rmap.c        |  9 +++++++++
 2 files changed, 29 insertions(+), 4 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 1928463..94c37ca 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1907,6 +1907,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 {
 	struct anon_vma *anon_vma;
 	int ret = 1;
+	struct address_space *mapping;
 
 	BUG_ON(is_huge_zero_page(page));
 	BUG_ON(!PageAnon(page));
@@ -1918,10 +1919,24 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 	 * page_lock_anon_vma_read except the write lock is taken to serialise
 	 * against parallel split or collapse operations.
 	 */
-	anon_vma = page_get_anon_vma(page);
-	if (!anon_vma)
-		goto out;
-	anon_vma_lock_write(anon_vma);
+	for (;;) {
+		mapping = ACCESS_ONCE(page->mapping);
+		anon_vma = page_get_anon_vma(page);
+		if (!anon_vma)
+			goto out;
+		anon_vma_lock_write(anon_vma);
+		/*
+		 * We don't hold the page lock here so
+		 * remap_anon_pages_huge_pmd can change the anon_vma
+		 * from under us until we obtain the anon_vma
+		 * lock. Verify that we obtained the anon_vma lock
+		 * before remap_anon_pages did.
+		 */
+		if (likely(mapping == ACCESS_ONCE(page->mapping)))
+			break;
+		anon_vma_unlock_write(anon_vma);
+		put_anon_vma(anon_vma);
+	}
 
 	ret = 0;
 	if (!PageCompound(page))
@@ -2420,6 +2435,7 @@ static void collapse_huge_page(struct mm_struct *mm,
 	 * Prevent all access to pagetables with the exception of
 	 * gup_fast later hanlded by the ptep_clear_flush and the VM
 	 * handled by the anon_vma lock + PG_lock.
+	 * remap_anon_pages is prevented to race as well by the mmap_sem.
 	 */
 	down_write(&mm->mmap_sem);
 	if (unlikely(khugepaged_test_exit(mm)))
diff --git a/mm/rmap.c b/mm/rmap.c
index b7e94eb..59a7e7d 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -450,6 +450,7 @@ struct anon_vma *page_lock_anon_vma_read(struct page *page)
 	struct anon_vma *root_anon_vma;
 	unsigned long anon_mapping;
 
+repeat:
 	rcu_read_lock();
 	anon_mapping = (unsigned long) ACCESS_ONCE(page->mapping);
 	if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON)
@@ -488,6 +489,14 @@ struct anon_vma *page_lock_anon_vma_read(struct page *page)
 	rcu_read_unlock();
 	anon_vma_lock_read(anon_vma);
 
+	/* check if remap_anon_pages changed the anon_vma */
+	if (unlikely((unsigned long) ACCESS_ONCE(page->mapping) != anon_mapping)) {
+		anon_vma_unlock_read(anon_vma);
+		put_anon_vma(anon_vma);
+		anon_vma = NULL;
+		goto repeat;
+	}
+
 	if (atomic_dec_and_test(&anon_vma->refcount)) {
 		/*
 		 * Oops, we held the last refcount, release the lock

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 05/10] mm: swp_entry_swapcount
  2014-07-02 16:50 ` Andrea Arcangeli
  (?)
@ 2014-07-02 16:50   ` Andrea Arcangeli
  -1 siblings, 0 replies; 59+ messages in thread
From: Andrea Arcangeli @ 2014-07-02 16:50 UTC (permalink / raw)
  To: qemu-devel, kvm, linux-mm, linux-kernel
  Cc: \"Dr. David Alan Gilbert\",
	Johannes Weiner, Andrew Morton, Android Kernel Team, Robert Love,
	Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, KOSAKI Motohiro, Michel Lespinasse,
	Minchan Kim, Keith Packard, Huangpeng (Peter),
	Isaku Yamahata

Provide a new swapfile method for remap_anon_pages to verify the swap
entry is mapped only in one vma before relocating the swap entry in a
different virtual address. Otherwise if the swap entry is mapped
in multiple vmas, when the page is swapped back in, it could get
mapped in a non linear way in some anon_vma.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/swap.h |  6 ++++++
 mm/swapfile.c        | 13 +++++++++++++
 2 files changed, 19 insertions(+)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 3d86a9a..3d7cae5 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -452,6 +452,7 @@ extern unsigned int count_swap_pages(int, int);
 extern sector_t map_swap_page(struct page *, struct block_device **);
 extern sector_t swapdev_block(int, pgoff_t);
 extern int page_swapcount(struct page *);
+extern int swp_entry_swapcount(swp_entry_t entry);
 extern struct swap_info_struct *page_swap_info(struct page *);
 extern int reuse_swap_page(struct page *);
 extern int try_to_free_swap(struct page *);
@@ -553,6 +554,11 @@ static inline int page_swapcount(struct page *page)
 	return 0;
 }
 
+static inline int swp_entry_swapcount(swp_entry_t entry)
+{
+	return 0;
+}
+
 #define reuse_swap_page(page)	(page_mapcount(page) == 1)
 
 static inline int try_to_free_swap(struct page *page)
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 4c524f7..f516555 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -877,6 +877,19 @@ int page_swapcount(struct page *page)
 	return count;
 }
 
+int swp_entry_swapcount(swp_entry_t entry)
+{
+	int count = 0;
+	struct swap_info_struct *p;
+
+	p = swap_info_get(entry);
+	if (p) {
+		count = swap_count(p->swap_map[swp_offset(entry)]);
+		spin_unlock(&p->lock);
+	}
+	return count;
+}
+
 /*
  * We can write to an anon page without COW if there are no other references
  * to it.  And as a side-effect, free up its swap: because the old content

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 05/10] mm: swp_entry_swapcount
@ 2014-07-02 16:50   ` Andrea Arcangeli
  0 siblings, 0 replies; 59+ messages in thread
From: Andrea Arcangeli @ 2014-07-02 16:50 UTC (permalink / raw)
  To: qemu-devel, kvm, linux-mm, linux-kernel
  Cc: \"Dr. David Alan Gilbert\",
	Johannes Weiner, Andrew Morton, Android Kernel Team, Robert Love,
	Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, KOSAKI Motohiro, Michel Lespinasse,
	Minchan Kim, Keith Packard, Huangpeng (Peter),
	Isaku Yamahata, Paolo Bonzini, Anthony Liguori, Stefan Hajnoczi,
	Wenchao Xia, Andrew Jones, Juan Quintela, Mel Gorman

Provide a new swapfile method for remap_anon_pages to verify the swap
entry is mapped only in one vma before relocating the swap entry in a
different virtual address. Otherwise if the swap entry is mapped
in multiple vmas, when the page is swapped back in, it could get
mapped in a non linear way in some anon_vma.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/swap.h |  6 ++++++
 mm/swapfile.c        | 13 +++++++++++++
 2 files changed, 19 insertions(+)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 3d86a9a..3d7cae5 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -452,6 +452,7 @@ extern unsigned int count_swap_pages(int, int);
 extern sector_t map_swap_page(struct page *, struct block_device **);
 extern sector_t swapdev_block(int, pgoff_t);
 extern int page_swapcount(struct page *);
+extern int swp_entry_swapcount(swp_entry_t entry);
 extern struct swap_info_struct *page_swap_info(struct page *);
 extern int reuse_swap_page(struct page *);
 extern int try_to_free_swap(struct page *);
@@ -553,6 +554,11 @@ static inline int page_swapcount(struct page *page)
 	return 0;
 }
 
+static inline int swp_entry_swapcount(swp_entry_t entry)
+{
+	return 0;
+}
+
 #define reuse_swap_page(page)	(page_mapcount(page) == 1)
 
 static inline int try_to_free_swap(struct page *page)
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 4c524f7..f516555 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -877,6 +877,19 @@ int page_swapcount(struct page *page)
 	return count;
 }
 
+int swp_entry_swapcount(swp_entry_t entry)
+{
+	int count = 0;
+	struct swap_info_struct *p;
+
+	p = swap_info_get(entry);
+	if (p) {
+		count = swap_count(p->swap_map[swp_offset(entry)]);
+		spin_unlock(&p->lock);
+	}
+	return count;
+}
+
 /*
  * We can write to an anon page without COW if there are no other references
  * to it.  And as a side-effect, free up its swap: because the old content

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [Qemu-devel] [PATCH 05/10] mm: swp_entry_swapcount
@ 2014-07-02 16:50   ` Andrea Arcangeli
  0 siblings, 0 replies; 59+ messages in thread
From: Andrea Arcangeli @ 2014-07-02 16:50 UTC (permalink / raw)
  To: qemu-devel, kvm, linux-mm, linux-kernel
  Cc: Robert Love, Dave Hansen, Jan Kara, Neil Brown, Stefan Hajnoczi,
	Andrew Jones, KOSAKI Motohiro, Michel Lespinasse,
	Andrea Arcangeli, Taras Glek, Juan Quintela, Hugh Dickins,
	Isaku Yamahata, Mel Gorman, Android Kernel Team, Mel Gorman,
	\"Dr. David Alan Gilbert\", Huangpeng (Peter),
	Anthony Liguori, Mike Hommey, Keith Packard, Wenchao Xia,
	Minchan Kim, Dmitry Adamushko, Johannes Weiner, Paolo Bonzini,
	Andrew Morton

Provide a new swapfile method for remap_anon_pages to verify the swap
entry is mapped only in one vma before relocating the swap entry in a
different virtual address. Otherwise if the swap entry is mapped
in multiple vmas, when the page is swapped back in, it could get
mapped in a non linear way in some anon_vma.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/swap.h |  6 ++++++
 mm/swapfile.c        | 13 +++++++++++++
 2 files changed, 19 insertions(+)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 3d86a9a..3d7cae5 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -452,6 +452,7 @@ extern unsigned int count_swap_pages(int, int);
 extern sector_t map_swap_page(struct page *, struct block_device **);
 extern sector_t swapdev_block(int, pgoff_t);
 extern int page_swapcount(struct page *);
+extern int swp_entry_swapcount(swp_entry_t entry);
 extern struct swap_info_struct *page_swap_info(struct page *);
 extern int reuse_swap_page(struct page *);
 extern int try_to_free_swap(struct page *);
@@ -553,6 +554,11 @@ static inline int page_swapcount(struct page *page)
 	return 0;
 }
 
+static inline int swp_entry_swapcount(swp_entry_t entry)
+{
+	return 0;
+}
+
 #define reuse_swap_page(page)	(page_mapcount(page) == 1)
 
 static inline int try_to_free_swap(struct page *page)
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 4c524f7..f516555 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -877,6 +877,19 @@ int page_swapcount(struct page *page)
 	return count;
 }
 
+int swp_entry_swapcount(swp_entry_t entry)
+{
+	int count = 0;
+	struct swap_info_struct *p;
+
+	p = swap_info_get(entry);
+	if (p) {
+		count = swap_count(p->swap_map[swp_offset(entry)]);
+		spin_unlock(&p->lock);
+	}
+	return count;
+}
+
 /*
  * We can write to an anon page without COW if there are no other references
  * to it.  And as a side-effect, free up its swap: because the old content

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 06/10] mm: sys_remap_anon_pages
  2014-07-02 16:50 ` Andrea Arcangeli
  (?)
@ 2014-07-02 16:50   ` Andrea Arcangeli
  -1 siblings, 0 replies; 59+ messages in thread
From: Andrea Arcangeli @ 2014-07-02 16:50 UTC (permalink / raw)
  To: qemu-devel, kvm, linux-mm, linux-kernel
  Cc: \"Dr. David Alan Gilbert\",
	Johannes Weiner, Andrew Morton, Android Kernel Team, Robert Love,
	Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, KOSAKI Motohiro, Michel Lespinasse,
	Minchan Kim, Keith Packard, Huangpeng (Peter),
	Isaku Yamahata

This new syscall will move anon pages across vmas, atomically and
without touching the vmas.

It only works on non shared anonymous pages because those can be
relocated without generating non linear anon_vmas in the rmap code.

It is the ideal mechanism to handle userspace page faults. Normally
the destination vma will have VM_USERFAULT set with
madvise(MADV_USERFAULT) while the source vma will normally have
VM_DONTCOPY set with madvise(MADV_DONTFORK).

MADV_DONTFORK set in the source vma avoids remap_anon_pages to fail if
the process forks during the userland page fault.

The thread triggering the sigbus signal handler by touching an
unmapped hole in the MADV_USERFAULT region, should take care to
receive the data belonging in the faulting virtual address in the
source vma. The data can come from the network, storage or any other
I/O device. After the data has been safely received in the private
area in the source vma, it will call remap_anon_pages to map the page
in the faulting address in the destination vma atomically. And finally
it will return from the signal handler.

It is an alternative to mremap.

It only works if the vma protection bits are identical from the source
and destination vma.

It can remap non shared anonymous pages within the same vma too.

If the source virtual memory range has any unmapped holes, or if the
destination virtual memory range is not a whole unmapped hole,
remap_anon_pages will fail respectively with -ENOENT or -EEXIST. This
provides a very strict behavior to avoid any chance of memory
corruption going unnoticed if there are userland race conditions. Only
one thread should resolve the userland page fault at any given time
for any given faulting address. This means that if two threads try to
both call remap_anon_pages on the same destination address at the same
time, the second thread will get an explicit error from this syscall.

The syscall retval will return "len" is succesful. The syscall however
can be interrupted by fatal signals or errors. If interrupted it will
return the number of bytes successfully remapped before the
interruption if any, or the negative error if none. It will never
return zero. Either it will return an error or an amount of bytes
successfully moved. If the retval reports a "short" remap, the
remap_anon_pages syscall should be repeated by userland with
src+retval, dst+reval, len-retval if it wants to know about the error
that interrupted it.

The RAP_ALLOW_SRC_HOLES flag can be specified to prevent -ENOENT
errors to materialize if there are holes in the source virtual range
that is being remapped. The holes will be accounted as successfully
remapped in the retval of the syscall. This is mostly useful to remap
hugepage naturally aligned virtual regions without knowing if there
are transparent hugepage in the regions or not, but preventing the
risk of having to split the hugepmd during the remap.

The main difference with mremap is that if used to fill holes in
unmapped anonymous memory vmas (if used in combination with
MADV_USERFAULT) remap_anon_pages won't create lots of unmergeable
vmas. mremap instead would create lots of vmas (because of non linear
vma->vm_pgoff) leading to -ENOMEM failures (the number of vmas is
limited).

MADV_USERFAULT and remap_anon_pages() can be tested with a program
like below:

===
 #define _GNU_SOURCE
 #include <sys/mman.h>
 #include <pthread.h>
 #include <strings.h>
 #include <stdlib.h>
 #include <unistd.h>
 #include <stdio.h>
 #include <errno.h>
 #include <string.h>
 #include <signal.h>
 #include <sys/syscall.h>
 #include <sys/types.h>

 #define USE_USERFAULT
 #define THP

 #define MADV_USERFAULT	18

 #define SIZE (1024*1024*1024)

 #define SYS_remap_anon_pages 317

 static volatile unsigned char *c, *tmp;

 void userfault_sighandler(int signum, siginfo_t *info, void *ctx)
 {
 	unsigned char *addr = info->si_addr;
 	int len = 4096;
 	int ret;

 	addr = (unsigned char *) ((unsigned long) addr & ~((getpagesize())-1));
 #ifdef THP
 	addr = (unsigned char *) ((unsigned long) addr & ~((2*1024*1024)-1));
 	len = 2*1024*1024;
 #endif
 	if (addr >= c && addr < c + SIZE) {
 		unsigned long offset = addr - c;
 		ret = syscall(SYS_remap_anon_pages, c+offset, tmp+offset, len, 0);
 		if (ret != len)
 			perror("sigbus remap_anon_pages"), exit(1);
 		//printf("sigbus offset %lu\n", offset);
 		return;
 	}

 	printf("sigbus error addr %p c %p tmp %p\n", addr, c, tmp), exit(1);
 }

 int main()
 {
 	struct sigaction sa;
 	int ret;
 	unsigned long i;
 #ifndef THP
 	/*
 	 * Fails with THP due lack of alignment because of memset
 	 * pre-filling the destination
 	 */
 	c = mmap(0, SIZE, PROT_READ|PROT_WRITE,
 		 MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
 	if (c == MAP_FAILED)
 		perror("mmap"), exit(1);
 	tmp = mmap(0, SIZE, PROT_READ|PROT_WRITE,
 		   MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
 	if (tmp == MAP_FAILED)
 		perror("mmap"), exit(1);
 #else
 	ret = posix_memalign((void **)&c, 2*1024*1024, SIZE);
 	if (ret)
 		perror("posix_memalign"), exit(1);
 	ret = posix_memalign((void **)&tmp, 2*1024*1024, SIZE);
 	if (ret)
 		perror("posix_memalign"), exit(1);
 #endif
 	/*
 	 * MADV_USERFAULT must run before memset, to avoid THP 2m
 	 * faults to map memory into "tmp", if "tmp" isn't allocated
 	 * with hugepage alignment.
 	 */
 	if (madvise((void *)c, SIZE, MADV_USERFAULT))
 		perror("madvise"), exit(1);
 	memset((void *)tmp, 0xaa, SIZE);

 	sa.sa_sigaction = userfault_sighandler;
 	sigemptyset(&sa.sa_mask);
 	sa.sa_flags = SA_SIGINFO;
 	sigaction(SIGBUS, &sa, NULL);

 #ifndef USE_USERFAULT
 	ret = syscall(SYS_remap_anon_pages, c, tmp, SIZE, 0);
 	if (ret != SIZE)
 		perror("remap_anon_pages"), exit(1);
 #endif

 	for (i = 0; i < SIZE; i += 4096) {
 		if ((i/4096) % 2) {
 			/* exercise read and write MADV_USERFAULT */
 			c[i+1] = 0xbb;
 		}
 		if (c[i] != 0xaa)
 			printf("error %x offset %lu\n", c[i], i), exit(1);
 	}
 	printf("remap_anon_pages functions correctly\n");

 	return 0;
 }
===

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 arch/x86/syscalls/syscall_32.tbl |   1 +
 arch/x86/syscalls/syscall_64.tbl |   1 +
 include/linux/huge_mm.h          |   7 +
 include/linux/syscalls.h         |   4 +
 kernel/sys_ni.c                  |   1 +
 mm/fremap.c                      | 477 +++++++++++++++++++++++++++++++++++++++
 mm/huge_memory.c                 | 110 +++++++++
 7 files changed, 601 insertions(+)

diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
index d6b8679..08bc856 100644
--- a/arch/x86/syscalls/syscall_32.tbl
+++ b/arch/x86/syscalls/syscall_32.tbl
@@ -360,3 +360,4 @@
 351	i386	sched_setattr		sys_sched_setattr
 352	i386	sched_getattr		sys_sched_getattr
 353	i386	renameat2		sys_renameat2
+354	i386	remap_anon_pages	sys_remap_anon_pages
diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index ec255a1..37bd179 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -323,6 +323,7 @@
 314	common	sched_setattr		sys_sched_setattr
 315	common	sched_getattr		sys_sched_getattr
 316	common	renameat2		sys_renameat2
+317	common	remap_anon_pages	sys_remap_anon_pages
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 3a2c57e..9a37dd5 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -33,6 +33,13 @@ extern int move_huge_pmd(struct vm_area_struct *vma,
 extern int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 			unsigned long addr, pgprot_t newprot,
 			int prot_numa);
+extern int remap_anon_pages_huge_pmd(struct mm_struct *mm,
+				     pmd_t *dst_pmd, pmd_t *src_pmd,
+				     pmd_t dst_pmdval,
+				     struct vm_area_struct *dst_vma,
+				     struct vm_area_struct *src_vma,
+				     unsigned long dst_addr,
+				     unsigned long src_addr);
 
 enum transparent_hugepage_flag {
 	TRANSPARENT_HUGEPAGE_FLAG,
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index b0881a0..19edb00 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -447,6 +447,10 @@ asmlinkage long sys_mremap(unsigned long addr,
 asmlinkage long sys_remap_file_pages(unsigned long start, unsigned long size,
 			unsigned long prot, unsigned long pgoff,
 			unsigned long flags);
+asmlinkage long sys_remap_anon_pages(unsigned long dst_start,
+				     unsigned long src_start,
+				     unsigned long len,
+				     unsigned long flags);
 asmlinkage long sys_msync(unsigned long start, size_t len, int flags);
 asmlinkage long sys_fadvise64(int fd, loff_t offset, size_t len, int advice);
 asmlinkage long sys_fadvise64_64(int fd, loff_t offset, loff_t len, int advice);
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 36441b5..6fc1aca 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -177,6 +177,7 @@ cond_syscall(sys_mincore);
 cond_syscall(sys_madvise);
 cond_syscall(sys_mremap);
 cond_syscall(sys_remap_file_pages);
+cond_syscall(sys_remap_anon_pages);
 cond_syscall(compat_sys_move_pages);
 cond_syscall(compat_sys_migrate_pages);
 
diff --git a/mm/fremap.c b/mm/fremap.c
index 1e509f7..9337637 100644
--- a/mm/fremap.c
+++ b/mm/fremap.c
@@ -310,3 +310,480 @@ void double_pt_unlock(spinlock_t *ptl1,
 	if (ptl1 != ptl2)
 		spin_unlock(ptl2);
 }
+
+#define RAP_ALLOW_SRC_HOLES (1UL<<0)
+
+/*
+ * The mmap_sem for reading is held by the caller. Just move the page
+ * from src_pmd to dst_pmd if possible, and return true if succeeded
+ * in moving the page.
+ */
+static int remap_anon_pages_pte(struct mm_struct *mm,
+				pte_t *dst_pte, pte_t *src_pte, pmd_t *src_pmd,
+				struct vm_area_struct *dst_vma,
+				struct vm_area_struct *src_vma,
+				unsigned long dst_addr,
+				unsigned long src_addr,
+				spinlock_t *dst_ptl,
+				spinlock_t *src_ptl,
+				unsigned long flags)
+{
+	struct page *src_page;
+	swp_entry_t entry;
+	pte_t orig_src_pte, orig_dst_pte;
+	struct anon_vma *src_anon_vma, *dst_anon_vma;
+
+	spin_lock(dst_ptl);
+	orig_dst_pte = *dst_pte;
+	spin_unlock(dst_ptl);
+	if (!pte_none(orig_dst_pte))
+		return -EEXIST;
+
+	spin_lock(src_ptl);
+	orig_src_pte = *src_pte;
+	spin_unlock(src_ptl);
+	if (pte_none(orig_src_pte)) {
+		if (!(flags & RAP_ALLOW_SRC_HOLES))
+			return -ENOENT;
+		else
+			/* nothing to do to remap an hole */
+			return 0;
+	}
+
+	if (pte_present(orig_src_pte)) {
+		/*
+		 * Pin the page while holding the lock to be sure the
+		 * page isn't freed under us
+		 */
+		spin_lock(src_ptl);
+		if (!pte_same(orig_src_pte, *src_pte)) {
+			spin_unlock(src_ptl);
+			return -EAGAIN;
+		}
+		src_page = vm_normal_page(src_vma, src_addr, orig_src_pte);
+		if (!src_page || !PageAnon(src_page) ||
+		    page_mapcount(src_page) != 1) {
+			spin_unlock(src_ptl);
+			return -EBUSY;
+		}
+
+		get_page(src_page);
+		spin_unlock(src_ptl);
+
+		/* block all concurrent rmap walks */
+		lock_page(src_page);
+
+		/*
+		 * page_referenced_anon walks the anon_vma chain
+		 * without the page lock. Serialize against it with
+		 * the anon_vma lock, the page lock is not enough.
+		 */
+		src_anon_vma = page_get_anon_vma(src_page);
+		if (!src_anon_vma) {
+			/* page was unmapped from under us */
+			unlock_page(src_page);
+			put_page(src_page);
+			return -EAGAIN;
+		}
+		anon_vma_lock_write(src_anon_vma);
+
+		double_pt_lock(dst_ptl, src_ptl);
+
+		if (!pte_same(*src_pte, orig_src_pte) ||
+		    !pte_same(*dst_pte, orig_dst_pte) ||
+		    page_mapcount(src_page) != 1) {
+			double_pt_unlock(dst_ptl, src_ptl);
+			anon_vma_unlock_write(src_anon_vma);
+			put_anon_vma(src_anon_vma);
+			unlock_page(src_page);
+			put_page(src_page);
+			return -EAGAIN;
+		}
+
+		BUG_ON(!PageAnon(src_page));
+		/* the PT lock is enough to keep the page pinned now */
+		put_page(src_page);
+
+		dst_anon_vma = (void *) dst_vma->anon_vma + PAGE_MAPPING_ANON;
+		ACCESS_ONCE(src_page->mapping) = ((struct address_space *)
+						  dst_anon_vma);
+		ACCESS_ONCE(src_page->index) = linear_page_index(dst_vma,
+								 dst_addr);
+
+		if (!pte_same(ptep_clear_flush(src_vma, src_addr, src_pte),
+			      orig_src_pte))
+			BUG();
+
+		orig_dst_pte = mk_pte(src_page, dst_vma->vm_page_prot);
+		orig_dst_pte = maybe_mkwrite(pte_mkdirty(orig_dst_pte),
+					     dst_vma);
+
+		set_pte_at(mm, dst_addr, dst_pte, orig_dst_pte);
+
+		double_pt_unlock(dst_ptl, src_ptl);
+
+		anon_vma_unlock_write(src_anon_vma);
+		put_anon_vma(src_anon_vma);
+
+		/* unblock rmap walks */
+		unlock_page(src_page);
+
+		mmu_notifier_invalidate_page(mm, src_addr);
+	} else {
+		if (pte_file(orig_src_pte))
+			return -EFAULT;
+
+		entry = pte_to_swp_entry(orig_src_pte);
+		if (non_swap_entry(entry)) {
+			if (is_migration_entry(entry)) {
+				migration_entry_wait(mm, src_pmd, src_addr);
+				return -EAGAIN;
+			}
+			return -EFAULT;
+		}
+
+		if (swp_entry_swapcount(entry) != 1)
+			return -EBUSY;
+
+		double_pt_lock(dst_ptl, src_ptl);
+
+		if (!pte_same(*src_pte, orig_src_pte) ||
+		    !pte_same(*dst_pte, orig_dst_pte) ||
+		    swp_entry_swapcount(entry) != 1) {
+			double_pt_unlock(dst_ptl, src_ptl);
+			return -EAGAIN;
+		}
+
+		if (pte_val(ptep_get_and_clear(mm, src_addr, src_pte)) !=
+		    pte_val(orig_src_pte))
+			BUG();
+		set_pte_at(mm, dst_addr, dst_pte, orig_src_pte);
+
+		double_pt_unlock(dst_ptl, src_ptl);
+	}
+
+	return 0;
+}
+
+static pmd_t *mm_alloc_pmd(struct mm_struct *mm, unsigned long address)
+{
+	pgd_t *pgd;
+	pud_t *pud;
+	pmd_t *pmd = NULL;
+
+	pgd = pgd_offset(mm, address);
+	pud = pud_alloc(mm, pgd, address);
+	if (pud)
+		/*
+		 * Note that we didn't run this because the pmd was
+		 * missing, the *pmd may be already established and in
+		 * turn it may also be a trans_huge_pmd.
+		 */
+		pmd = pmd_alloc(mm, pud, address);
+	return pmd;
+}
+
+/**
+ * sys_remap_anon_pages - remap arbitrary anonymous pages of an existing vma
+ * @dst_start: start of the destination virtual memory range
+ * @src_start: start of the source virtual memory range
+ * @len: length of the virtual memory range
+ *
+ * sys_remap_anon_pages remaps arbitrary anonymous pages atomically in
+ * zero copy. It only works on non shared anonymous pages because
+ * those can be relocated without generating non linear anon_vmas in
+ * the rmap code.
+ *
+ * It is the ideal mechanism to handle userspace page faults. Normally
+ * the destination vma will have VM_USERFAULT set with
+ * madvise(MADV_USERFAULT) while the source vma will have VM_DONTCOPY
+ * set with madvise(MADV_DONTFORK).
+ *
+ * The thread receiving the page during the userland page fault
+ * (MADV_USERFAULT) will receive the faulting page in the source vma
+ * through the network, storage or any other I/O device (MADV_DONTFORK
+ * in the source vma avoids remap_anon_pages to fail with -EBUSY if
+ * the process forks before remap_anon_pages is called), then it will
+ * call remap_anon_pages to map the page in the faulting address in
+ * the destination vma.
+ *
+ * This syscall works purely via pagetables, so it's the most
+ * efficient way to move physical non shared anonymous pages across
+ * different virtual addresses. Unlike mremap()/mmap()/munmap() it
+ * does not create any new vmas. The mapping in the destination
+ * address is atomic.
+ *
+ * It only works if the vma protection bits are identical from the
+ * source and destination vma.
+ *
+ * It can remap non shared anonymous pages within the same vma too.
+ *
+ * If the source virtual memory range has any unmapped holes, or if
+ * the destination virtual memory range is not a whole unmapped hole,
+ * remap_anon_pages will fail respectively with -ENOENT or
+ * -EEXIST. This provides a very strict behavior to avoid any chance
+ * of memory corruption going unnoticed if there are userland race
+ * conditions. Only one thread should resolve the userland page fault
+ * at any given time for any given faulting address. This means that
+ * if two threads try to both call remap_anon_pages on the same
+ * destination address at the same time, the second thread will get an
+ * explicit error from this syscall.
+ *
+ * The syscall retval will return "len" is succesful. The syscall
+ * however can be interrupted by fatal signals or errors. If
+ * interrupted it will return the number of bytes successfully
+ * remapped before the interruption if any, or the negative error if
+ * none. It will never return zero. Either it will return an error or
+ * an amount of bytes successfully moved. If the retval reports a
+ * "short" remap, the remap_anon_pages syscall should be repeated by
+ * userland with src+retval, dst+reval, len-retval if it wants to know
+ * about the error that interrupted it.
+ *
+ * The RAP_ALLOW_SRC_HOLES flag can be specified to prevent -ENOENT
+ * errors to materialize if there are holes in the source virtual
+ * range that is being remapped. The holes will be accounted as
+ * successfully remapped in the retval of the syscall. This is mostly
+ * useful to remap hugepage naturally aligned virtual regions without
+ * knowing if there are transparent hugepage in the regions or not,
+ * but preventing the risk of having to split the hugepmd during the
+ * remap.
+ *
+ * If there's any rmap walk that is taking the anon_vma locks without
+ * first obtaining the page lock (for example split_huge_page and
+ * page_referenced_anon), they will have to verify if the
+ * page->mapping has changed after taking the anon_vma lock. If it
+ * changed they should release the lock and retry obtaining a new
+ * anon_vma, because it means the anon_vma was changed by
+ * remap_anon_pages before the lock could be obtained. This is the
+ * only additional complexity added to the rmap code to provide this
+ * anonymous page remapping functionality.
+ */
+SYSCALL_DEFINE4(remap_anon_pages,
+		unsigned long, dst_start, unsigned long, src_start,
+		unsigned long, len, unsigned long, flags)
+{
+	struct mm_struct *mm = current->mm;
+	struct vm_area_struct *src_vma, *dst_vma;
+	long err = -EINVAL;
+	pmd_t *src_pmd, *dst_pmd;
+	pte_t *src_pte, *dst_pte;
+	spinlock_t *dst_ptl, *src_ptl;
+	unsigned long src_addr, dst_addr;
+	int thp_aligned = -1;
+	long moved = 0;
+
+	/*
+	 * Sanitize the syscall parameters:
+	 */
+	if (src_start & ~PAGE_MASK)
+		return err;
+	if (dst_start & ~PAGE_MASK)
+		return err;
+	if (len & ~PAGE_MASK)
+		return err;
+	if (flags & ~RAP_ALLOW_SRC_HOLES)
+		return err;
+
+	/* Does the address range wrap, or is the span zero-sized? */
+	if (unlikely(src_start + len <= src_start))
+		return err;
+	if (unlikely(dst_start + len <= dst_start))
+		return err;
+
+	down_read(&mm->mmap_sem);
+
+	/*
+	 * Make sure the vma is not shared, that the src and dst remap
+	 * ranges are both valid and fully within a single existing
+	 * vma.
+	 */
+	src_vma = find_vma(mm, src_start);
+	if (!src_vma || (src_vma->vm_flags & VM_SHARED))
+		goto out;
+	if (src_start < src_vma->vm_start ||
+	    src_start + len > src_vma->vm_end)
+		goto out;
+
+	dst_vma = find_vma(mm, dst_start);
+	if (!dst_vma || (dst_vma->vm_flags & VM_SHARED))
+		goto out;
+	if (dst_start < dst_vma->vm_start ||
+	    dst_start + len > dst_vma->vm_end)
+		goto out;
+
+	if (pgprot_val(src_vma->vm_page_prot) !=
+	    pgprot_val(dst_vma->vm_page_prot))
+		goto out;
+
+	/* only allow remapping if both are mlocked or both aren't */
+	if ((src_vma->vm_flags & VM_LOCKED) ^ (dst_vma->vm_flags & VM_LOCKED))
+		goto out;
+
+	/*
+	 * Ensure the dst_vma has a anon_vma or this page
+	 * would get a NULL anon_vma when moved in the
+	 * dst_vma.
+	 */
+	err = -ENOMEM;
+	if (unlikely(anon_vma_prepare(dst_vma)))
+		goto out;
+
+	for (src_addr = src_start, dst_addr = dst_start;
+	     src_addr < src_start + len; ) {
+		spinlock_t *ptl;
+		pmd_t dst_pmdval;
+		BUG_ON(dst_addr >= dst_start + len);
+		src_pmd = mm_find_pmd(mm, src_addr);
+		if (unlikely(!src_pmd)) {
+			if (!(flags & RAP_ALLOW_SRC_HOLES)) {
+				err = -ENOENT;
+				break;
+			} else {
+				src_pmd = mm_alloc_pmd(mm, src_addr);
+				if (unlikely(!src_pmd)) {
+					err = -ENOMEM;
+					break;
+				}
+			}
+		}
+		dst_pmd = mm_alloc_pmd(mm, dst_addr);
+		if (unlikely(!dst_pmd)) {
+			err = -ENOMEM;
+			break;
+		}
+
+		dst_pmdval = pmd_read_atomic(dst_pmd);
+		/*
+		 * If the dst_pmd is mapped as THP don't
+		 * override it and just be strict.
+		 */
+		if (unlikely(pmd_trans_huge(dst_pmdval))) {
+			err = -EEXIST;
+			break;
+		}
+		if (pmd_trans_huge_lock(src_pmd, src_vma, &ptl) == 1) {
+			/*
+			 * Check if we can move the pmd without
+			 * splitting it. First check the address
+			 * alignment to be the same in src/dst.  These
+			 * checks don't actually need the PT lock but
+			 * it's good to do it here to optimize this
+			 * block away at build time if
+			 * CONFIG_TRANSPARENT_HUGEPAGE is not set.
+			 */
+			if (thp_aligned == -1)
+				thp_aligned = ((src_addr & ~HPAGE_PMD_MASK) ==
+					       (dst_addr & ~HPAGE_PMD_MASK));
+			if (!thp_aligned || (src_addr & ~HPAGE_PMD_MASK) ||
+			    !pmd_none(dst_pmdval) ||
+			    src_start + len - src_addr < HPAGE_PMD_SIZE) {
+				spin_unlock(ptl);
+				/* Fall through */
+				split_huge_page_pmd(src_vma, src_addr,
+						    src_pmd);
+			} else {
+				BUG_ON(dst_addr & ~HPAGE_PMD_MASK);
+				err = remap_anon_pages_huge_pmd(mm,
+								dst_pmd,
+								src_pmd,
+								dst_pmdval,
+								dst_vma,
+								src_vma,
+								dst_addr,
+								src_addr);
+				cond_resched();
+
+				if (!err) {
+					dst_addr += HPAGE_PMD_SIZE;
+					src_addr += HPAGE_PMD_SIZE;
+					moved += HPAGE_PMD_SIZE;
+				}
+
+				if ((!err || err == -EAGAIN) &&
+				    fatal_signal_pending(current))
+					err = -EINTR;
+
+				if (err && err != -EAGAIN)
+					break;
+
+				continue;
+			}
+		}
+
+		if (pmd_none(*src_pmd)) {
+			if (!(flags & RAP_ALLOW_SRC_HOLES)) {
+				err = -ENOENT;
+				break;
+			} else {
+				if (unlikely(__pte_alloc(mm, src_vma, src_pmd,
+							 src_addr))) {
+					err = -ENOMEM;
+					break;
+				}
+			}
+		}
+
+		/*
+		 * We held the mmap_sem for reading so MADV_DONTNEED
+		 * can zap transparent huge pages under us, or the
+		 * transparent huge page fault can establish new
+		 * transparent huge pages under us.
+		 */
+		if (unlikely(pmd_trans_unstable(src_pmd))) {
+			err = -EFAULT;
+			break;
+		}
+
+		if (unlikely(pmd_none(dst_pmdval)) &&
+		    unlikely(__pte_alloc(mm, dst_vma, dst_pmd,
+					 dst_addr))) {
+			err = -ENOMEM;
+			break;
+		}
+		/* If an huge pmd materialized from under us fail */
+		if (unlikely(pmd_trans_huge(*dst_pmd))) {
+			err = -EFAULT;
+			break;
+		}
+
+		BUG_ON(pmd_none(*dst_pmd));
+		BUG_ON(pmd_none(*src_pmd));
+		BUG_ON(pmd_trans_huge(*dst_pmd));
+		BUG_ON(pmd_trans_huge(*src_pmd));
+
+		dst_pte = pte_offset_map(dst_pmd, dst_addr);
+		src_pte = pte_offset_map(src_pmd, src_addr);
+		dst_ptl = pte_lockptr(mm, dst_pmd);
+		src_ptl = pte_lockptr(mm, src_pmd);
+
+		err = remap_anon_pages_pte(mm,
+					   dst_pte, src_pte, src_pmd,
+					   dst_vma, src_vma,
+					   dst_addr, src_addr,
+					   dst_ptl, src_ptl, flags);
+
+		pte_unmap(dst_pte);
+		pte_unmap(src_pte);
+		cond_resched();
+
+		if (!err) {
+			dst_addr += PAGE_SIZE;
+			src_addr += PAGE_SIZE;
+			moved += PAGE_SIZE;
+		}
+
+		if ((!err || err == -EAGAIN) &&
+		    fatal_signal_pending(current))
+			err = -EINTR;
+
+		if (err && err != -EAGAIN)
+			break;
+	}
+
+out:
+	up_read(&mm->mmap_sem);
+	BUG_ON(moved < 0);
+	BUG_ON(err > 0);
+	BUG_ON(!moved && !err);
+	return moved ? moved : err;
+}
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 94c37ca..e24cd7c 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1541,6 +1541,116 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 }
 
 /*
+ * The PT lock for src_pmd and the mmap_sem for reading are held by
+ * the caller, but it must return after releasing the
+ * page_table_lock. We're guaranteed the src_pmd is a pmd_trans_huge
+ * until the PT lock of the src_pmd is released. Just move the page
+ * from src_pmd to dst_pmd if possible. Return zero if succeeded in
+ * moving the page, -EAGAIN if it needs to be repeated by the caller,
+ * or other errors in case of failure.
+ */
+int remap_anon_pages_huge_pmd(struct mm_struct *mm,
+			      pmd_t *dst_pmd, pmd_t *src_pmd,
+			      pmd_t dst_pmdval,
+			      struct vm_area_struct *dst_vma,
+			      struct vm_area_struct *src_vma,
+			      unsigned long dst_addr,
+			      unsigned long src_addr)
+{
+	pmd_t _dst_pmd, src_pmdval;
+	struct page *src_page;
+	struct anon_vma *src_anon_vma, *dst_anon_vma;
+	spinlock_t *src_ptl, *dst_ptl;
+	pgtable_t pgtable;
+
+	src_pmdval = *src_pmd;
+	src_ptl = pmd_lockptr(mm, src_pmd);
+
+	BUG_ON(!pmd_trans_huge(src_pmdval));
+	BUG_ON(pmd_trans_splitting(src_pmdval));
+	BUG_ON(!pmd_none(dst_pmdval));
+	BUG_ON(!spin_is_locked(src_ptl));
+	BUG_ON(!rwsem_is_locked(&mm->mmap_sem));
+
+	src_page = pmd_page(src_pmdval);
+	BUG_ON(!PageHead(src_page));
+	BUG_ON(!PageAnon(src_page));
+	if (unlikely(page_mapcount(src_page) != 1)) {
+		spin_unlock(src_ptl);
+		return -EBUSY;
+	}
+
+	get_page(src_page);
+	spin_unlock(src_ptl);
+
+	mmu_notifier_invalidate_range_start(mm, src_addr,
+					    src_addr + HPAGE_PMD_SIZE);
+
+	/* block all concurrent rmap walks */
+	lock_page(src_page);
+
+	/*
+	 * split_huge_page walks the anon_vma chain without the page
+	 * lock. Serialize against it with the anon_vma lock, the page
+	 * lock is not enough.
+	 */
+	src_anon_vma = page_get_anon_vma(src_page);
+	if (!src_anon_vma) {
+		unlock_page(src_page);
+		put_page(src_page);
+		mmu_notifier_invalidate_range_end(mm, src_addr,
+						  src_addr + HPAGE_PMD_SIZE);
+		return -EAGAIN;
+	}
+	anon_vma_lock_write(src_anon_vma);
+
+	dst_ptl = pmd_lockptr(mm, dst_pmd);
+	double_pt_lock(src_ptl, dst_ptl);
+	if (unlikely(!pmd_same(*src_pmd, src_pmdval) ||
+		     !pmd_same(*dst_pmd, dst_pmdval) ||
+		     page_mapcount(src_page) != 1)) {
+		double_pt_unlock(src_ptl, dst_ptl);
+		anon_vma_unlock_write(src_anon_vma);
+		put_anon_vma(src_anon_vma);
+		unlock_page(src_page);
+		put_page(src_page);
+		mmu_notifier_invalidate_range_end(mm, src_addr,
+						  src_addr + HPAGE_PMD_SIZE);
+		return -EAGAIN;
+	}
+
+	BUG_ON(!PageHead(src_page));
+	BUG_ON(!PageAnon(src_page));
+	/* the PT lock is enough to keep the page pinned now */
+	put_page(src_page);
+
+	dst_anon_vma = (void *) dst_vma->anon_vma + PAGE_MAPPING_ANON;
+	ACCESS_ONCE(src_page->mapping) = (struct address_space *) dst_anon_vma;
+	ACCESS_ONCE(src_page->index) = linear_page_index(dst_vma, dst_addr);
+
+	if (!pmd_same(pmdp_clear_flush(src_vma, src_addr, src_pmd),
+		      src_pmdval))
+		BUG();
+	_dst_pmd = mk_huge_pmd(src_page, dst_vma->vm_page_prot);
+	_dst_pmd = maybe_pmd_mkwrite(pmd_mkdirty(_dst_pmd), dst_vma);
+	set_pmd_at(mm, dst_addr, dst_pmd, _dst_pmd);
+
+	pgtable = pgtable_trans_huge_withdraw(mm, src_pmd);
+	pgtable_trans_huge_deposit(mm, dst_pmd, pgtable);
+	double_pt_unlock(src_ptl, dst_ptl);
+
+	anon_vma_unlock_write(src_anon_vma);
+	put_anon_vma(src_anon_vma);
+
+	/* unblock rmap walks */
+	unlock_page(src_page);
+
+	mmu_notifier_invalidate_range_end(mm, src_addr,
+					  src_addr + HPAGE_PMD_SIZE);
+	return 0;
+}
+
+/*
  * Returns 1 if a given pmd maps a stable (not under splitting) thp.
  * Returns -1 if it maps a thp under splitting. Returns 0 otherwise.
  *

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 06/10] mm: sys_remap_anon_pages
@ 2014-07-02 16:50   ` Andrea Arcangeli
  0 siblings, 0 replies; 59+ messages in thread
From: Andrea Arcangeli @ 2014-07-02 16:50 UTC (permalink / raw)
  To: qemu-devel, kvm, linux-mm, linux-kernel
  Cc: \"Dr. David Alan Gilbert\",
	Johannes Weiner, Andrew Morton, Android Kernel Team, Robert Love,
	Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, KOSAKI Motohiro, Michel Lespinasse,
	Minchan Kim, Keith Packard, Huangpeng (Peter),
	Isaku Yamahata, Paolo Bonzini, Anthony Liguori, Stefan Hajnoczi,
	Wenchao Xia, Andrew Jones, Juan Quintela, Mel Gorman

This new syscall will move anon pages across vmas, atomically and
without touching the vmas.

It only works on non shared anonymous pages because those can be
relocated without generating non linear anon_vmas in the rmap code.

It is the ideal mechanism to handle userspace page faults. Normally
the destination vma will have VM_USERFAULT set with
madvise(MADV_USERFAULT) while the source vma will normally have
VM_DONTCOPY set with madvise(MADV_DONTFORK).

MADV_DONTFORK set in the source vma avoids remap_anon_pages to fail if
the process forks during the userland page fault.

The thread triggering the sigbus signal handler by touching an
unmapped hole in the MADV_USERFAULT region, should take care to
receive the data belonging in the faulting virtual address in the
source vma. The data can come from the network, storage or any other
I/O device. After the data has been safely received in the private
area in the source vma, it will call remap_anon_pages to map the page
in the faulting address in the destination vma atomically. And finally
it will return from the signal handler.

It is an alternative to mremap.

It only works if the vma protection bits are identical from the source
and destination vma.

It can remap non shared anonymous pages within the same vma too.

If the source virtual memory range has any unmapped holes, or if the
destination virtual memory range is not a whole unmapped hole,
remap_anon_pages will fail respectively with -ENOENT or -EEXIST. This
provides a very strict behavior to avoid any chance of memory
corruption going unnoticed if there are userland race conditions. Only
one thread should resolve the userland page fault at any given time
for any given faulting address. This means that if two threads try to
both call remap_anon_pages on the same destination address at the same
time, the second thread will get an explicit error from this syscall.

The syscall retval will return "len" is succesful. The syscall however
can be interrupted by fatal signals or errors. If interrupted it will
return the number of bytes successfully remapped before the
interruption if any, or the negative error if none. It will never
return zero. Either it will return an error or an amount of bytes
successfully moved. If the retval reports a "short" remap, the
remap_anon_pages syscall should be repeated by userland with
src+retval, dst+reval, len-retval if it wants to know about the error
that interrupted it.

The RAP_ALLOW_SRC_HOLES flag can be specified to prevent -ENOENT
errors to materialize if there are holes in the source virtual range
that is being remapped. The holes will be accounted as successfully
remapped in the retval of the syscall. This is mostly useful to remap
hugepage naturally aligned virtual regions without knowing if there
are transparent hugepage in the regions or not, but preventing the
risk of having to split the hugepmd during the remap.

The main difference with mremap is that if used to fill holes in
unmapped anonymous memory vmas (if used in combination with
MADV_USERFAULT) remap_anon_pages won't create lots of unmergeable
vmas. mremap instead would create lots of vmas (because of non linear
vma->vm_pgoff) leading to -ENOMEM failures (the number of vmas is
limited).

MADV_USERFAULT and remap_anon_pages() can be tested with a program
like below:

===
 #define _GNU_SOURCE
 #include <sys/mman.h>
 #include <pthread.h>
 #include <strings.h>
 #include <stdlib.h>
 #include <unistd.h>
 #include <stdio.h>
 #include <errno.h>
 #include <string.h>
 #include <signal.h>
 #include <sys/syscall.h>
 #include <sys/types.h>

 #define USE_USERFAULT
 #define THP

 #define MADV_USERFAULT	18

 #define SIZE (1024*1024*1024)

 #define SYS_remap_anon_pages 317

 static volatile unsigned char *c, *tmp;

 void userfault_sighandler(int signum, siginfo_t *info, void *ctx)
 {
 	unsigned char *addr = info->si_addr;
 	int len = 4096;
 	int ret;

 	addr = (unsigned char *) ((unsigned long) addr & ~((getpagesize())-1));
 #ifdef THP
 	addr = (unsigned char *) ((unsigned long) addr & ~((2*1024*1024)-1));
 	len = 2*1024*1024;
 #endif
 	if (addr >= c && addr < c + SIZE) {
 		unsigned long offset = addr - c;
 		ret = syscall(SYS_remap_anon_pages, c+offset, tmp+offset, len, 0);
 		if (ret != len)
 			perror("sigbus remap_anon_pages"), exit(1);
 		//printf("sigbus offset %lu\n", offset);
 		return;
 	}

 	printf("sigbus error addr %p c %p tmp %p\n", addr, c, tmp), exit(1);
 }

 int main()
 {
 	struct sigaction sa;
 	int ret;
 	unsigned long i;
 #ifndef THP
 	/*
 	 * Fails with THP due lack of alignment because of memset
 	 * pre-filling the destination
 	 */
 	c = mmap(0, SIZE, PROT_READ|PROT_WRITE,
 		 MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
 	if (c == MAP_FAILED)
 		perror("mmap"), exit(1);
 	tmp = mmap(0, SIZE, PROT_READ|PROT_WRITE,
 		   MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
 	if (tmp == MAP_FAILED)
 		perror("mmap"), exit(1);
 #else
 	ret = posix_memalign((void **)&c, 2*1024*1024, SIZE);
 	if (ret)
 		perror("posix_memalign"), exit(1);
 	ret = posix_memalign((void **)&tmp, 2*1024*1024, SIZE);
 	if (ret)
 		perror("posix_memalign"), exit(1);
 #endif
 	/*
 	 * MADV_USERFAULT must run before memset, to avoid THP 2m
 	 * faults to map memory into "tmp", if "tmp" isn't allocated
 	 * with hugepage alignment.
 	 */
 	if (madvise((void *)c, SIZE, MADV_USERFAULT))
 		perror("madvise"), exit(1);
 	memset((void *)tmp, 0xaa, SIZE);

 	sa.sa_sigaction = userfault_sighandler;
 	sigemptyset(&sa.sa_mask);
 	sa.sa_flags = SA_SIGINFO;
 	sigaction(SIGBUS, &sa, NULL);

 #ifndef USE_USERFAULT
 	ret = syscall(SYS_remap_anon_pages, c, tmp, SIZE, 0);
 	if (ret != SIZE)
 		perror("remap_anon_pages"), exit(1);
 #endif

 	for (i = 0; i < SIZE; i += 4096) {
 		if ((i/4096) % 2) {
 			/* exercise read and write MADV_USERFAULT */
 			c[i+1] = 0xbb;
 		}
 		if (c[i] != 0xaa)
 			printf("error %x offset %lu\n", c[i], i), exit(1);
 	}
 	printf("remap_anon_pages functions correctly\n");

 	return 0;
 }
===

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 arch/x86/syscalls/syscall_32.tbl |   1 +
 arch/x86/syscalls/syscall_64.tbl |   1 +
 include/linux/huge_mm.h          |   7 +
 include/linux/syscalls.h         |   4 +
 kernel/sys_ni.c                  |   1 +
 mm/fremap.c                      | 477 +++++++++++++++++++++++++++++++++++++++
 mm/huge_memory.c                 | 110 +++++++++
 7 files changed, 601 insertions(+)

diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
index d6b8679..08bc856 100644
--- a/arch/x86/syscalls/syscall_32.tbl
+++ b/arch/x86/syscalls/syscall_32.tbl
@@ -360,3 +360,4 @@
 351	i386	sched_setattr		sys_sched_setattr
 352	i386	sched_getattr		sys_sched_getattr
 353	i386	renameat2		sys_renameat2
+354	i386	remap_anon_pages	sys_remap_anon_pages
diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index ec255a1..37bd179 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -323,6 +323,7 @@
 314	common	sched_setattr		sys_sched_setattr
 315	common	sched_getattr		sys_sched_getattr
 316	common	renameat2		sys_renameat2
+317	common	remap_anon_pages	sys_remap_anon_pages
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 3a2c57e..9a37dd5 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -33,6 +33,13 @@ extern int move_huge_pmd(struct vm_area_struct *vma,
 extern int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 			unsigned long addr, pgprot_t newprot,
 			int prot_numa);
+extern int remap_anon_pages_huge_pmd(struct mm_struct *mm,
+				     pmd_t *dst_pmd, pmd_t *src_pmd,
+				     pmd_t dst_pmdval,
+				     struct vm_area_struct *dst_vma,
+				     struct vm_area_struct *src_vma,
+				     unsigned long dst_addr,
+				     unsigned long src_addr);
 
 enum transparent_hugepage_flag {
 	TRANSPARENT_HUGEPAGE_FLAG,
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index b0881a0..19edb00 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -447,6 +447,10 @@ asmlinkage long sys_mremap(unsigned long addr,
 asmlinkage long sys_remap_file_pages(unsigned long start, unsigned long size,
 			unsigned long prot, unsigned long pgoff,
 			unsigned long flags);
+asmlinkage long sys_remap_anon_pages(unsigned long dst_start,
+				     unsigned long src_start,
+				     unsigned long len,
+				     unsigned long flags);
 asmlinkage long sys_msync(unsigned long start, size_t len, int flags);
 asmlinkage long sys_fadvise64(int fd, loff_t offset, size_t len, int advice);
 asmlinkage long sys_fadvise64_64(int fd, loff_t offset, loff_t len, int advice);
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 36441b5..6fc1aca 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -177,6 +177,7 @@ cond_syscall(sys_mincore);
 cond_syscall(sys_madvise);
 cond_syscall(sys_mremap);
 cond_syscall(sys_remap_file_pages);
+cond_syscall(sys_remap_anon_pages);
 cond_syscall(compat_sys_move_pages);
 cond_syscall(compat_sys_migrate_pages);
 
diff --git a/mm/fremap.c b/mm/fremap.c
index 1e509f7..9337637 100644
--- a/mm/fremap.c
+++ b/mm/fremap.c
@@ -310,3 +310,480 @@ void double_pt_unlock(spinlock_t *ptl1,
 	if (ptl1 != ptl2)
 		spin_unlock(ptl2);
 }
+
+#define RAP_ALLOW_SRC_HOLES (1UL<<0)
+
+/*
+ * The mmap_sem for reading is held by the caller. Just move the page
+ * from src_pmd to dst_pmd if possible, and return true if succeeded
+ * in moving the page.
+ */
+static int remap_anon_pages_pte(struct mm_struct *mm,
+				pte_t *dst_pte, pte_t *src_pte, pmd_t *src_pmd,
+				struct vm_area_struct *dst_vma,
+				struct vm_area_struct *src_vma,
+				unsigned long dst_addr,
+				unsigned long src_addr,
+				spinlock_t *dst_ptl,
+				spinlock_t *src_ptl,
+				unsigned long flags)
+{
+	struct page *src_page;
+	swp_entry_t entry;
+	pte_t orig_src_pte, orig_dst_pte;
+	struct anon_vma *src_anon_vma, *dst_anon_vma;
+
+	spin_lock(dst_ptl);
+	orig_dst_pte = *dst_pte;
+	spin_unlock(dst_ptl);
+	if (!pte_none(orig_dst_pte))
+		return -EEXIST;
+
+	spin_lock(src_ptl);
+	orig_src_pte = *src_pte;
+	spin_unlock(src_ptl);
+	if (pte_none(orig_src_pte)) {
+		if (!(flags & RAP_ALLOW_SRC_HOLES))
+			return -ENOENT;
+		else
+			/* nothing to do to remap an hole */
+			return 0;
+	}
+
+	if (pte_present(orig_src_pte)) {
+		/*
+		 * Pin the page while holding the lock to be sure the
+		 * page isn't freed under us
+		 */
+		spin_lock(src_ptl);
+		if (!pte_same(orig_src_pte, *src_pte)) {
+			spin_unlock(src_ptl);
+			return -EAGAIN;
+		}
+		src_page = vm_normal_page(src_vma, src_addr, orig_src_pte);
+		if (!src_page || !PageAnon(src_page) ||
+		    page_mapcount(src_page) != 1) {
+			spin_unlock(src_ptl);
+			return -EBUSY;
+		}
+
+		get_page(src_page);
+		spin_unlock(src_ptl);
+
+		/* block all concurrent rmap walks */
+		lock_page(src_page);
+
+		/*
+		 * page_referenced_anon walks the anon_vma chain
+		 * without the page lock. Serialize against it with
+		 * the anon_vma lock, the page lock is not enough.
+		 */
+		src_anon_vma = page_get_anon_vma(src_page);
+		if (!src_anon_vma) {
+			/* page was unmapped from under us */
+			unlock_page(src_page);
+			put_page(src_page);
+			return -EAGAIN;
+		}
+		anon_vma_lock_write(src_anon_vma);
+
+		double_pt_lock(dst_ptl, src_ptl);
+
+		if (!pte_same(*src_pte, orig_src_pte) ||
+		    !pte_same(*dst_pte, orig_dst_pte) ||
+		    page_mapcount(src_page) != 1) {
+			double_pt_unlock(dst_ptl, src_ptl);
+			anon_vma_unlock_write(src_anon_vma);
+			put_anon_vma(src_anon_vma);
+			unlock_page(src_page);
+			put_page(src_page);
+			return -EAGAIN;
+		}
+
+		BUG_ON(!PageAnon(src_page));
+		/* the PT lock is enough to keep the page pinned now */
+		put_page(src_page);
+
+		dst_anon_vma = (void *) dst_vma->anon_vma + PAGE_MAPPING_ANON;
+		ACCESS_ONCE(src_page->mapping) = ((struct address_space *)
+						  dst_anon_vma);
+		ACCESS_ONCE(src_page->index) = linear_page_index(dst_vma,
+								 dst_addr);
+
+		if (!pte_same(ptep_clear_flush(src_vma, src_addr, src_pte),
+			      orig_src_pte))
+			BUG();
+
+		orig_dst_pte = mk_pte(src_page, dst_vma->vm_page_prot);
+		orig_dst_pte = maybe_mkwrite(pte_mkdirty(orig_dst_pte),
+					     dst_vma);
+
+		set_pte_at(mm, dst_addr, dst_pte, orig_dst_pte);
+
+		double_pt_unlock(dst_ptl, src_ptl);
+
+		anon_vma_unlock_write(src_anon_vma);
+		put_anon_vma(src_anon_vma);
+
+		/* unblock rmap walks */
+		unlock_page(src_page);
+
+		mmu_notifier_invalidate_page(mm, src_addr);
+	} else {
+		if (pte_file(orig_src_pte))
+			return -EFAULT;
+
+		entry = pte_to_swp_entry(orig_src_pte);
+		if (non_swap_entry(entry)) {
+			if (is_migration_entry(entry)) {
+				migration_entry_wait(mm, src_pmd, src_addr);
+				return -EAGAIN;
+			}
+			return -EFAULT;
+		}
+
+		if (swp_entry_swapcount(entry) != 1)
+			return -EBUSY;
+
+		double_pt_lock(dst_ptl, src_ptl);
+
+		if (!pte_same(*src_pte, orig_src_pte) ||
+		    !pte_same(*dst_pte, orig_dst_pte) ||
+		    swp_entry_swapcount(entry) != 1) {
+			double_pt_unlock(dst_ptl, src_ptl);
+			return -EAGAIN;
+		}
+
+		if (pte_val(ptep_get_and_clear(mm, src_addr, src_pte)) !=
+		    pte_val(orig_src_pte))
+			BUG();
+		set_pte_at(mm, dst_addr, dst_pte, orig_src_pte);
+
+		double_pt_unlock(dst_ptl, src_ptl);
+	}
+
+	return 0;
+}
+
+static pmd_t *mm_alloc_pmd(struct mm_struct *mm, unsigned long address)
+{
+	pgd_t *pgd;
+	pud_t *pud;
+	pmd_t *pmd = NULL;
+
+	pgd = pgd_offset(mm, address);
+	pud = pud_alloc(mm, pgd, address);
+	if (pud)
+		/*
+		 * Note that we didn't run this because the pmd was
+		 * missing, the *pmd may be already established and in
+		 * turn it may also be a trans_huge_pmd.
+		 */
+		pmd = pmd_alloc(mm, pud, address);
+	return pmd;
+}
+
+/**
+ * sys_remap_anon_pages - remap arbitrary anonymous pages of an existing vma
+ * @dst_start: start of the destination virtual memory range
+ * @src_start: start of the source virtual memory range
+ * @len: length of the virtual memory range
+ *
+ * sys_remap_anon_pages remaps arbitrary anonymous pages atomically in
+ * zero copy. It only works on non shared anonymous pages because
+ * those can be relocated without generating non linear anon_vmas in
+ * the rmap code.
+ *
+ * It is the ideal mechanism to handle userspace page faults. Normally
+ * the destination vma will have VM_USERFAULT set with
+ * madvise(MADV_USERFAULT) while the source vma will have VM_DONTCOPY
+ * set with madvise(MADV_DONTFORK).
+ *
+ * The thread receiving the page during the userland page fault
+ * (MADV_USERFAULT) will receive the faulting page in the source vma
+ * through the network, storage or any other I/O device (MADV_DONTFORK
+ * in the source vma avoids remap_anon_pages to fail with -EBUSY if
+ * the process forks before remap_anon_pages is called), then it will
+ * call remap_anon_pages to map the page in the faulting address in
+ * the destination vma.
+ *
+ * This syscall works purely via pagetables, so it's the most
+ * efficient way to move physical non shared anonymous pages across
+ * different virtual addresses. Unlike mremap()/mmap()/munmap() it
+ * does not create any new vmas. The mapping in the destination
+ * address is atomic.
+ *
+ * It only works if the vma protection bits are identical from the
+ * source and destination vma.
+ *
+ * It can remap non shared anonymous pages within the same vma too.
+ *
+ * If the source virtual memory range has any unmapped holes, or if
+ * the destination virtual memory range is not a whole unmapped hole,
+ * remap_anon_pages will fail respectively with -ENOENT or
+ * -EEXIST. This provides a very strict behavior to avoid any chance
+ * of memory corruption going unnoticed if there are userland race
+ * conditions. Only one thread should resolve the userland page fault
+ * at any given time for any given faulting address. This means that
+ * if two threads try to both call remap_anon_pages on the same
+ * destination address at the same time, the second thread will get an
+ * explicit error from this syscall.
+ *
+ * The syscall retval will return "len" is succesful. The syscall
+ * however can be interrupted by fatal signals or errors. If
+ * interrupted it will return the number of bytes successfully
+ * remapped before the interruption if any, or the negative error if
+ * none. It will never return zero. Either it will return an error or
+ * an amount of bytes successfully moved. If the retval reports a
+ * "short" remap, the remap_anon_pages syscall should be repeated by
+ * userland with src+retval, dst+reval, len-retval if it wants to know
+ * about the error that interrupted it.
+ *
+ * The RAP_ALLOW_SRC_HOLES flag can be specified to prevent -ENOENT
+ * errors to materialize if there are holes in the source virtual
+ * range that is being remapped. The holes will be accounted as
+ * successfully remapped in the retval of the syscall. This is mostly
+ * useful to remap hugepage naturally aligned virtual regions without
+ * knowing if there are transparent hugepage in the regions or not,
+ * but preventing the risk of having to split the hugepmd during the
+ * remap.
+ *
+ * If there's any rmap walk that is taking the anon_vma locks without
+ * first obtaining the page lock (for example split_huge_page and
+ * page_referenced_anon), they will have to verify if the
+ * page->mapping has changed after taking the anon_vma lock. If it
+ * changed they should release the lock and retry obtaining a new
+ * anon_vma, because it means the anon_vma was changed by
+ * remap_anon_pages before the lock could be obtained. This is the
+ * only additional complexity added to the rmap code to provide this
+ * anonymous page remapping functionality.
+ */
+SYSCALL_DEFINE4(remap_anon_pages,
+		unsigned long, dst_start, unsigned long, src_start,
+		unsigned long, len, unsigned long, flags)
+{
+	struct mm_struct *mm = current->mm;
+	struct vm_area_struct *src_vma, *dst_vma;
+	long err = -EINVAL;
+	pmd_t *src_pmd, *dst_pmd;
+	pte_t *src_pte, *dst_pte;
+	spinlock_t *dst_ptl, *src_ptl;
+	unsigned long src_addr, dst_addr;
+	int thp_aligned = -1;
+	long moved = 0;
+
+	/*
+	 * Sanitize the syscall parameters:
+	 */
+	if (src_start & ~PAGE_MASK)
+		return err;
+	if (dst_start & ~PAGE_MASK)
+		return err;
+	if (len & ~PAGE_MASK)
+		return err;
+	if (flags & ~RAP_ALLOW_SRC_HOLES)
+		return err;
+
+	/* Does the address range wrap, or is the span zero-sized? */
+	if (unlikely(src_start + len <= src_start))
+		return err;
+	if (unlikely(dst_start + len <= dst_start))
+		return err;
+
+	down_read(&mm->mmap_sem);
+
+	/*
+	 * Make sure the vma is not shared, that the src and dst remap
+	 * ranges are both valid and fully within a single existing
+	 * vma.
+	 */
+	src_vma = find_vma(mm, src_start);
+	if (!src_vma || (src_vma->vm_flags & VM_SHARED))
+		goto out;
+	if (src_start < src_vma->vm_start ||
+	    src_start + len > src_vma->vm_end)
+		goto out;
+
+	dst_vma = find_vma(mm, dst_start);
+	if (!dst_vma || (dst_vma->vm_flags & VM_SHARED))
+		goto out;
+	if (dst_start < dst_vma->vm_start ||
+	    dst_start + len > dst_vma->vm_end)
+		goto out;
+
+	if (pgprot_val(src_vma->vm_page_prot) !=
+	    pgprot_val(dst_vma->vm_page_prot))
+		goto out;
+
+	/* only allow remapping if both are mlocked or both aren't */
+	if ((src_vma->vm_flags & VM_LOCKED) ^ (dst_vma->vm_flags & VM_LOCKED))
+		goto out;
+
+	/*
+	 * Ensure the dst_vma has a anon_vma or this page
+	 * would get a NULL anon_vma when moved in the
+	 * dst_vma.
+	 */
+	err = -ENOMEM;
+	if (unlikely(anon_vma_prepare(dst_vma)))
+		goto out;
+
+	for (src_addr = src_start, dst_addr = dst_start;
+	     src_addr < src_start + len; ) {
+		spinlock_t *ptl;
+		pmd_t dst_pmdval;
+		BUG_ON(dst_addr >= dst_start + len);
+		src_pmd = mm_find_pmd(mm, src_addr);
+		if (unlikely(!src_pmd)) {
+			if (!(flags & RAP_ALLOW_SRC_HOLES)) {
+				err = -ENOENT;
+				break;
+			} else {
+				src_pmd = mm_alloc_pmd(mm, src_addr);
+				if (unlikely(!src_pmd)) {
+					err = -ENOMEM;
+					break;
+				}
+			}
+		}
+		dst_pmd = mm_alloc_pmd(mm, dst_addr);
+		if (unlikely(!dst_pmd)) {
+			err = -ENOMEM;
+			break;
+		}
+
+		dst_pmdval = pmd_read_atomic(dst_pmd);
+		/*
+		 * If the dst_pmd is mapped as THP don't
+		 * override it and just be strict.
+		 */
+		if (unlikely(pmd_trans_huge(dst_pmdval))) {
+			err = -EEXIST;
+			break;
+		}
+		if (pmd_trans_huge_lock(src_pmd, src_vma, &ptl) == 1) {
+			/*
+			 * Check if we can move the pmd without
+			 * splitting it. First check the address
+			 * alignment to be the same in src/dst.  These
+			 * checks don't actually need the PT lock but
+			 * it's good to do it here to optimize this
+			 * block away at build time if
+			 * CONFIG_TRANSPARENT_HUGEPAGE is not set.
+			 */
+			if (thp_aligned == -1)
+				thp_aligned = ((src_addr & ~HPAGE_PMD_MASK) ==
+					       (dst_addr & ~HPAGE_PMD_MASK));
+			if (!thp_aligned || (src_addr & ~HPAGE_PMD_MASK) ||
+			    !pmd_none(dst_pmdval) ||
+			    src_start + len - src_addr < HPAGE_PMD_SIZE) {
+				spin_unlock(ptl);
+				/* Fall through */
+				split_huge_page_pmd(src_vma, src_addr,
+						    src_pmd);
+			} else {
+				BUG_ON(dst_addr & ~HPAGE_PMD_MASK);
+				err = remap_anon_pages_huge_pmd(mm,
+								dst_pmd,
+								src_pmd,
+								dst_pmdval,
+								dst_vma,
+								src_vma,
+								dst_addr,
+								src_addr);
+				cond_resched();
+
+				if (!err) {
+					dst_addr += HPAGE_PMD_SIZE;
+					src_addr += HPAGE_PMD_SIZE;
+					moved += HPAGE_PMD_SIZE;
+				}
+
+				if ((!err || err == -EAGAIN) &&
+				    fatal_signal_pending(current))
+					err = -EINTR;
+
+				if (err && err != -EAGAIN)
+					break;
+
+				continue;
+			}
+		}
+
+		if (pmd_none(*src_pmd)) {
+			if (!(flags & RAP_ALLOW_SRC_HOLES)) {
+				err = -ENOENT;
+				break;
+			} else {
+				if (unlikely(__pte_alloc(mm, src_vma, src_pmd,
+							 src_addr))) {
+					err = -ENOMEM;
+					break;
+				}
+			}
+		}
+
+		/*
+		 * We held the mmap_sem for reading so MADV_DONTNEED
+		 * can zap transparent huge pages under us, or the
+		 * transparent huge page fault can establish new
+		 * transparent huge pages under us.
+		 */
+		if (unlikely(pmd_trans_unstable(src_pmd))) {
+			err = -EFAULT;
+			break;
+		}
+
+		if (unlikely(pmd_none(dst_pmdval)) &&
+		    unlikely(__pte_alloc(mm, dst_vma, dst_pmd,
+					 dst_addr))) {
+			err = -ENOMEM;
+			break;
+		}
+		/* If an huge pmd materialized from under us fail */
+		if (unlikely(pmd_trans_huge(*dst_pmd))) {
+			err = -EFAULT;
+			break;
+		}
+
+		BUG_ON(pmd_none(*dst_pmd));
+		BUG_ON(pmd_none(*src_pmd));
+		BUG_ON(pmd_trans_huge(*dst_pmd));
+		BUG_ON(pmd_trans_huge(*src_pmd));
+
+		dst_pte = pte_offset_map(dst_pmd, dst_addr);
+		src_pte = pte_offset_map(src_pmd, src_addr);
+		dst_ptl = pte_lockptr(mm, dst_pmd);
+		src_ptl = pte_lockptr(mm, src_pmd);
+
+		err = remap_anon_pages_pte(mm,
+					   dst_pte, src_pte, src_pmd,
+					   dst_vma, src_vma,
+					   dst_addr, src_addr,
+					   dst_ptl, src_ptl, flags);
+
+		pte_unmap(dst_pte);
+		pte_unmap(src_pte);
+		cond_resched();
+
+		if (!err) {
+			dst_addr += PAGE_SIZE;
+			src_addr += PAGE_SIZE;
+			moved += PAGE_SIZE;
+		}
+
+		if ((!err || err == -EAGAIN) &&
+		    fatal_signal_pending(current))
+			err = -EINTR;
+
+		if (err && err != -EAGAIN)
+			break;
+	}
+
+out:
+	up_read(&mm->mmap_sem);
+	BUG_ON(moved < 0);
+	BUG_ON(err > 0);
+	BUG_ON(!moved && !err);
+	return moved ? moved : err;
+}
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 94c37ca..e24cd7c 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1541,6 +1541,116 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 }
 
 /*
+ * The PT lock for src_pmd and the mmap_sem for reading are held by
+ * the caller, but it must return after releasing the
+ * page_table_lock. We're guaranteed the src_pmd is a pmd_trans_huge
+ * until the PT lock of the src_pmd is released. Just move the page
+ * from src_pmd to dst_pmd if possible. Return zero if succeeded in
+ * moving the page, -EAGAIN if it needs to be repeated by the caller,
+ * or other errors in case of failure.
+ */
+int remap_anon_pages_huge_pmd(struct mm_struct *mm,
+			      pmd_t *dst_pmd, pmd_t *src_pmd,
+			      pmd_t dst_pmdval,
+			      struct vm_area_struct *dst_vma,
+			      struct vm_area_struct *src_vma,
+			      unsigned long dst_addr,
+			      unsigned long src_addr)
+{
+	pmd_t _dst_pmd, src_pmdval;
+	struct page *src_page;
+	struct anon_vma *src_anon_vma, *dst_anon_vma;
+	spinlock_t *src_ptl, *dst_ptl;
+	pgtable_t pgtable;
+
+	src_pmdval = *src_pmd;
+	src_ptl = pmd_lockptr(mm, src_pmd);
+
+	BUG_ON(!pmd_trans_huge(src_pmdval));
+	BUG_ON(pmd_trans_splitting(src_pmdval));
+	BUG_ON(!pmd_none(dst_pmdval));
+	BUG_ON(!spin_is_locked(src_ptl));
+	BUG_ON(!rwsem_is_locked(&mm->mmap_sem));
+
+	src_page = pmd_page(src_pmdval);
+	BUG_ON(!PageHead(src_page));
+	BUG_ON(!PageAnon(src_page));
+	if (unlikely(page_mapcount(src_page) != 1)) {
+		spin_unlock(src_ptl);
+		return -EBUSY;
+	}
+
+	get_page(src_page);
+	spin_unlock(src_ptl);
+
+	mmu_notifier_invalidate_range_start(mm, src_addr,
+					    src_addr + HPAGE_PMD_SIZE);
+
+	/* block all concurrent rmap walks */
+	lock_page(src_page);
+
+	/*
+	 * split_huge_page walks the anon_vma chain without the page
+	 * lock. Serialize against it with the anon_vma lock, the page
+	 * lock is not enough.
+	 */
+	src_anon_vma = page_get_anon_vma(src_page);
+	if (!src_anon_vma) {
+		unlock_page(src_page);
+		put_page(src_page);
+		mmu_notifier_invalidate_range_end(mm, src_addr,
+						  src_addr + HPAGE_PMD_SIZE);
+		return -EAGAIN;
+	}
+	anon_vma_lock_write(src_anon_vma);
+
+	dst_ptl = pmd_lockptr(mm, dst_pmd);
+	double_pt_lock(src_ptl, dst_ptl);
+	if (unlikely(!pmd_same(*src_pmd, src_pmdval) ||
+		     !pmd_same(*dst_pmd, dst_pmdval) ||
+		     page_mapcount(src_page) != 1)) {
+		double_pt_unlock(src_ptl, dst_ptl);
+		anon_vma_unlock_write(src_anon_vma);
+		put_anon_vma(src_anon_vma);
+		unlock_page(src_page);
+		put_page(src_page);
+		mmu_notifier_invalidate_range_end(mm, src_addr,
+						  src_addr + HPAGE_PMD_SIZE);
+		return -EAGAIN;
+	}
+
+	BUG_ON(!PageHead(src_page));
+	BUG_ON(!PageAnon(src_page));
+	/* the PT lock is enough to keep the page pinned now */
+	put_page(src_page);
+
+	dst_anon_vma = (void *) dst_vma->anon_vma + PAGE_MAPPING_ANON;
+	ACCESS_ONCE(src_page->mapping) = (struct address_space *) dst_anon_vma;
+	ACCESS_ONCE(src_page->index) = linear_page_index(dst_vma, dst_addr);
+
+	if (!pmd_same(pmdp_clear_flush(src_vma, src_addr, src_pmd),
+		      src_pmdval))
+		BUG();
+	_dst_pmd = mk_huge_pmd(src_page, dst_vma->vm_page_prot);
+	_dst_pmd = maybe_pmd_mkwrite(pmd_mkdirty(_dst_pmd), dst_vma);
+	set_pmd_at(mm, dst_addr, dst_pmd, _dst_pmd);
+
+	pgtable = pgtable_trans_huge_withdraw(mm, src_pmd);
+	pgtable_trans_huge_deposit(mm, dst_pmd, pgtable);
+	double_pt_unlock(src_ptl, dst_ptl);
+
+	anon_vma_unlock_write(src_anon_vma);
+	put_anon_vma(src_anon_vma);
+
+	/* unblock rmap walks */
+	unlock_page(src_page);
+
+	mmu_notifier_invalidate_range_end(mm, src_addr,
+					  src_addr + HPAGE_PMD_SIZE);
+	return 0;
+}
+
+/*
  * Returns 1 if a given pmd maps a stable (not under splitting) thp.
  * Returns -1 if it maps a thp under splitting. Returns 0 otherwise.
  *

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [Qemu-devel] [PATCH 06/10] mm: sys_remap_anon_pages
@ 2014-07-02 16:50   ` Andrea Arcangeli
  0 siblings, 0 replies; 59+ messages in thread
From: Andrea Arcangeli @ 2014-07-02 16:50 UTC (permalink / raw)
  To: qemu-devel, kvm, linux-mm, linux-kernel
  Cc: Robert Love, Dave Hansen, Jan Kara, Neil Brown, Stefan Hajnoczi,
	Andrew Jones, KOSAKI Motohiro, Michel Lespinasse,
	Andrea Arcangeli, Taras Glek, Juan Quintela, Hugh Dickins,
	Isaku Yamahata, Mel Gorman, Android Kernel Team, Mel Gorman,
	\"Dr. David Alan Gilbert\", Huangpeng (Peter),
	Anthony Liguori, Mike Hommey, Keith Packard, Wenchao Xia,
	Minchan Kim, Dmitry Adamushko, Johannes Weiner, Paolo Bonzini,
	Andrew Morton

This new syscall will move anon pages across vmas, atomically and
without touching the vmas.

It only works on non shared anonymous pages because those can be
relocated without generating non linear anon_vmas in the rmap code.

It is the ideal mechanism to handle userspace page faults. Normally
the destination vma will have VM_USERFAULT set with
madvise(MADV_USERFAULT) while the source vma will normally have
VM_DONTCOPY set with madvise(MADV_DONTFORK).

MADV_DONTFORK set in the source vma avoids remap_anon_pages to fail if
the process forks during the userland page fault.

The thread triggering the sigbus signal handler by touching an
unmapped hole in the MADV_USERFAULT region, should take care to
receive the data belonging in the faulting virtual address in the
source vma. The data can come from the network, storage or any other
I/O device. After the data has been safely received in the private
area in the source vma, it will call remap_anon_pages to map the page
in the faulting address in the destination vma atomically. And finally
it will return from the signal handler.

It is an alternative to mremap.

It only works if the vma protection bits are identical from the source
and destination vma.

It can remap non shared anonymous pages within the same vma too.

If the source virtual memory range has any unmapped holes, or if the
destination virtual memory range is not a whole unmapped hole,
remap_anon_pages will fail respectively with -ENOENT or -EEXIST. This
provides a very strict behavior to avoid any chance of memory
corruption going unnoticed if there are userland race conditions. Only
one thread should resolve the userland page fault at any given time
for any given faulting address. This means that if two threads try to
both call remap_anon_pages on the same destination address at the same
time, the second thread will get an explicit error from this syscall.

The syscall retval will return "len" is succesful. The syscall however
can be interrupted by fatal signals or errors. If interrupted it will
return the number of bytes successfully remapped before the
interruption if any, or the negative error if none. It will never
return zero. Either it will return an error or an amount of bytes
successfully moved. If the retval reports a "short" remap, the
remap_anon_pages syscall should be repeated by userland with
src+retval, dst+reval, len-retval if it wants to know about the error
that interrupted it.

The RAP_ALLOW_SRC_HOLES flag can be specified to prevent -ENOENT
errors to materialize if there are holes in the source virtual range
that is being remapped. The holes will be accounted as successfully
remapped in the retval of the syscall. This is mostly useful to remap
hugepage naturally aligned virtual regions without knowing if there
are transparent hugepage in the regions or not, but preventing the
risk of having to split the hugepmd during the remap.

The main difference with mremap is that if used to fill holes in
unmapped anonymous memory vmas (if used in combination with
MADV_USERFAULT) remap_anon_pages won't create lots of unmergeable
vmas. mremap instead would create lots of vmas (because of non linear
vma->vm_pgoff) leading to -ENOMEM failures (the number of vmas is
limited).

MADV_USERFAULT and remap_anon_pages() can be tested with a program
like below:

===
 #define _GNU_SOURCE
 #include <sys/mman.h>
 #include <pthread.h>
 #include <strings.h>
 #include <stdlib.h>
 #include <unistd.h>
 #include <stdio.h>
 #include <errno.h>
 #include <string.h>
 #include <signal.h>
 #include <sys/syscall.h>
 #include <sys/types.h>

 #define USE_USERFAULT
 #define THP

 #define MADV_USERFAULT	18

 #define SIZE (1024*1024*1024)

 #define SYS_remap_anon_pages 317

 static volatile unsigned char *c, *tmp;

 void userfault_sighandler(int signum, siginfo_t *info, void *ctx)
 {
 	unsigned char *addr = info->si_addr;
 	int len = 4096;
 	int ret;

 	addr = (unsigned char *) ((unsigned long) addr & ~((getpagesize())-1));
 #ifdef THP
 	addr = (unsigned char *) ((unsigned long) addr & ~((2*1024*1024)-1));
 	len = 2*1024*1024;
 #endif
 	if (addr >= c && addr < c + SIZE) {
 		unsigned long offset = addr - c;
 		ret = syscall(SYS_remap_anon_pages, c+offset, tmp+offset, len, 0);
 		if (ret != len)
 			perror("sigbus remap_anon_pages"), exit(1);
 		//printf("sigbus offset %lu\n", offset);
 		return;
 	}

 	printf("sigbus error addr %p c %p tmp %p\n", addr, c, tmp), exit(1);
 }

 int main()
 {
 	struct sigaction sa;
 	int ret;
 	unsigned long i;
 #ifndef THP
 	/*
 	 * Fails with THP due lack of alignment because of memset
 	 * pre-filling the destination
 	 */
 	c = mmap(0, SIZE, PROT_READ|PROT_WRITE,
 		 MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
 	if (c == MAP_FAILED)
 		perror("mmap"), exit(1);
 	tmp = mmap(0, SIZE, PROT_READ|PROT_WRITE,
 		   MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
 	if (tmp == MAP_FAILED)
 		perror("mmap"), exit(1);
 #else
 	ret = posix_memalign((void **)&c, 2*1024*1024, SIZE);
 	if (ret)
 		perror("posix_memalign"), exit(1);
 	ret = posix_memalign((void **)&tmp, 2*1024*1024, SIZE);
 	if (ret)
 		perror("posix_memalign"), exit(1);
 #endif
 	/*
 	 * MADV_USERFAULT must run before memset, to avoid THP 2m
 	 * faults to map memory into "tmp", if "tmp" isn't allocated
 	 * with hugepage alignment.
 	 */
 	if (madvise((void *)c, SIZE, MADV_USERFAULT))
 		perror("madvise"), exit(1);
 	memset((void *)tmp, 0xaa, SIZE);

 	sa.sa_sigaction = userfault_sighandler;
 	sigemptyset(&sa.sa_mask);
 	sa.sa_flags = SA_SIGINFO;
 	sigaction(SIGBUS, &sa, NULL);

 #ifndef USE_USERFAULT
 	ret = syscall(SYS_remap_anon_pages, c, tmp, SIZE, 0);
 	if (ret != SIZE)
 		perror("remap_anon_pages"), exit(1);
 #endif

 	for (i = 0; i < SIZE; i += 4096) {
 		if ((i/4096) % 2) {
 			/* exercise read and write MADV_USERFAULT */
 			c[i+1] = 0xbb;
 		}
 		if (c[i] != 0xaa)
 			printf("error %x offset %lu\n", c[i], i), exit(1);
 	}
 	printf("remap_anon_pages functions correctly\n");

 	return 0;
 }
===

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 arch/x86/syscalls/syscall_32.tbl |   1 +
 arch/x86/syscalls/syscall_64.tbl |   1 +
 include/linux/huge_mm.h          |   7 +
 include/linux/syscalls.h         |   4 +
 kernel/sys_ni.c                  |   1 +
 mm/fremap.c                      | 477 +++++++++++++++++++++++++++++++++++++++
 mm/huge_memory.c                 | 110 +++++++++
 7 files changed, 601 insertions(+)

diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
index d6b8679..08bc856 100644
--- a/arch/x86/syscalls/syscall_32.tbl
+++ b/arch/x86/syscalls/syscall_32.tbl
@@ -360,3 +360,4 @@
 351	i386	sched_setattr		sys_sched_setattr
 352	i386	sched_getattr		sys_sched_getattr
 353	i386	renameat2		sys_renameat2
+354	i386	remap_anon_pages	sys_remap_anon_pages
diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index ec255a1..37bd179 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -323,6 +323,7 @@
 314	common	sched_setattr		sys_sched_setattr
 315	common	sched_getattr		sys_sched_getattr
 316	common	renameat2		sys_renameat2
+317	common	remap_anon_pages	sys_remap_anon_pages
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 3a2c57e..9a37dd5 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -33,6 +33,13 @@ extern int move_huge_pmd(struct vm_area_struct *vma,
 extern int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 			unsigned long addr, pgprot_t newprot,
 			int prot_numa);
+extern int remap_anon_pages_huge_pmd(struct mm_struct *mm,
+				     pmd_t *dst_pmd, pmd_t *src_pmd,
+				     pmd_t dst_pmdval,
+				     struct vm_area_struct *dst_vma,
+				     struct vm_area_struct *src_vma,
+				     unsigned long dst_addr,
+				     unsigned long src_addr);
 
 enum transparent_hugepage_flag {
 	TRANSPARENT_HUGEPAGE_FLAG,
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index b0881a0..19edb00 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -447,6 +447,10 @@ asmlinkage long sys_mremap(unsigned long addr,
 asmlinkage long sys_remap_file_pages(unsigned long start, unsigned long size,
 			unsigned long prot, unsigned long pgoff,
 			unsigned long flags);
+asmlinkage long sys_remap_anon_pages(unsigned long dst_start,
+				     unsigned long src_start,
+				     unsigned long len,
+				     unsigned long flags);
 asmlinkage long sys_msync(unsigned long start, size_t len, int flags);
 asmlinkage long sys_fadvise64(int fd, loff_t offset, size_t len, int advice);
 asmlinkage long sys_fadvise64_64(int fd, loff_t offset, loff_t len, int advice);
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 36441b5..6fc1aca 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -177,6 +177,7 @@ cond_syscall(sys_mincore);
 cond_syscall(sys_madvise);
 cond_syscall(sys_mremap);
 cond_syscall(sys_remap_file_pages);
+cond_syscall(sys_remap_anon_pages);
 cond_syscall(compat_sys_move_pages);
 cond_syscall(compat_sys_migrate_pages);
 
diff --git a/mm/fremap.c b/mm/fremap.c
index 1e509f7..9337637 100644
--- a/mm/fremap.c
+++ b/mm/fremap.c
@@ -310,3 +310,480 @@ void double_pt_unlock(spinlock_t *ptl1,
 	if (ptl1 != ptl2)
 		spin_unlock(ptl2);
 }
+
+#define RAP_ALLOW_SRC_HOLES (1UL<<0)
+
+/*
+ * The mmap_sem for reading is held by the caller. Just move the page
+ * from src_pmd to dst_pmd if possible, and return true if succeeded
+ * in moving the page.
+ */
+static int remap_anon_pages_pte(struct mm_struct *mm,
+				pte_t *dst_pte, pte_t *src_pte, pmd_t *src_pmd,
+				struct vm_area_struct *dst_vma,
+				struct vm_area_struct *src_vma,
+				unsigned long dst_addr,
+				unsigned long src_addr,
+				spinlock_t *dst_ptl,
+				spinlock_t *src_ptl,
+				unsigned long flags)
+{
+	struct page *src_page;
+	swp_entry_t entry;
+	pte_t orig_src_pte, orig_dst_pte;
+	struct anon_vma *src_anon_vma, *dst_anon_vma;
+
+	spin_lock(dst_ptl);
+	orig_dst_pte = *dst_pte;
+	spin_unlock(dst_ptl);
+	if (!pte_none(orig_dst_pte))
+		return -EEXIST;
+
+	spin_lock(src_ptl);
+	orig_src_pte = *src_pte;
+	spin_unlock(src_ptl);
+	if (pte_none(orig_src_pte)) {
+		if (!(flags & RAP_ALLOW_SRC_HOLES))
+			return -ENOENT;
+		else
+			/* nothing to do to remap an hole */
+			return 0;
+	}
+
+	if (pte_present(orig_src_pte)) {
+		/*
+		 * Pin the page while holding the lock to be sure the
+		 * page isn't freed under us
+		 */
+		spin_lock(src_ptl);
+		if (!pte_same(orig_src_pte, *src_pte)) {
+			spin_unlock(src_ptl);
+			return -EAGAIN;
+		}
+		src_page = vm_normal_page(src_vma, src_addr, orig_src_pte);
+		if (!src_page || !PageAnon(src_page) ||
+		    page_mapcount(src_page) != 1) {
+			spin_unlock(src_ptl);
+			return -EBUSY;
+		}
+
+		get_page(src_page);
+		spin_unlock(src_ptl);
+
+		/* block all concurrent rmap walks */
+		lock_page(src_page);
+
+		/*
+		 * page_referenced_anon walks the anon_vma chain
+		 * without the page lock. Serialize against it with
+		 * the anon_vma lock, the page lock is not enough.
+		 */
+		src_anon_vma = page_get_anon_vma(src_page);
+		if (!src_anon_vma) {
+			/* page was unmapped from under us */
+			unlock_page(src_page);
+			put_page(src_page);
+			return -EAGAIN;
+		}
+		anon_vma_lock_write(src_anon_vma);
+
+		double_pt_lock(dst_ptl, src_ptl);
+
+		if (!pte_same(*src_pte, orig_src_pte) ||
+		    !pte_same(*dst_pte, orig_dst_pte) ||
+		    page_mapcount(src_page) != 1) {
+			double_pt_unlock(dst_ptl, src_ptl);
+			anon_vma_unlock_write(src_anon_vma);
+			put_anon_vma(src_anon_vma);
+			unlock_page(src_page);
+			put_page(src_page);
+			return -EAGAIN;
+		}
+
+		BUG_ON(!PageAnon(src_page));
+		/* the PT lock is enough to keep the page pinned now */
+		put_page(src_page);
+
+		dst_anon_vma = (void *) dst_vma->anon_vma + PAGE_MAPPING_ANON;
+		ACCESS_ONCE(src_page->mapping) = ((struct address_space *)
+						  dst_anon_vma);
+		ACCESS_ONCE(src_page->index) = linear_page_index(dst_vma,
+								 dst_addr);
+
+		if (!pte_same(ptep_clear_flush(src_vma, src_addr, src_pte),
+			      orig_src_pte))
+			BUG();
+
+		orig_dst_pte = mk_pte(src_page, dst_vma->vm_page_prot);
+		orig_dst_pte = maybe_mkwrite(pte_mkdirty(orig_dst_pte),
+					     dst_vma);
+
+		set_pte_at(mm, dst_addr, dst_pte, orig_dst_pte);
+
+		double_pt_unlock(dst_ptl, src_ptl);
+
+		anon_vma_unlock_write(src_anon_vma);
+		put_anon_vma(src_anon_vma);
+
+		/* unblock rmap walks */
+		unlock_page(src_page);
+
+		mmu_notifier_invalidate_page(mm, src_addr);
+	} else {
+		if (pte_file(orig_src_pte))
+			return -EFAULT;
+
+		entry = pte_to_swp_entry(orig_src_pte);
+		if (non_swap_entry(entry)) {
+			if (is_migration_entry(entry)) {
+				migration_entry_wait(mm, src_pmd, src_addr);
+				return -EAGAIN;
+			}
+			return -EFAULT;
+		}
+
+		if (swp_entry_swapcount(entry) != 1)
+			return -EBUSY;
+
+		double_pt_lock(dst_ptl, src_ptl);
+
+		if (!pte_same(*src_pte, orig_src_pte) ||
+		    !pte_same(*dst_pte, orig_dst_pte) ||
+		    swp_entry_swapcount(entry) != 1) {
+			double_pt_unlock(dst_ptl, src_ptl);
+			return -EAGAIN;
+		}
+
+		if (pte_val(ptep_get_and_clear(mm, src_addr, src_pte)) !=
+		    pte_val(orig_src_pte))
+			BUG();
+		set_pte_at(mm, dst_addr, dst_pte, orig_src_pte);
+
+		double_pt_unlock(dst_ptl, src_ptl);
+	}
+
+	return 0;
+}
+
+static pmd_t *mm_alloc_pmd(struct mm_struct *mm, unsigned long address)
+{
+	pgd_t *pgd;
+	pud_t *pud;
+	pmd_t *pmd = NULL;
+
+	pgd = pgd_offset(mm, address);
+	pud = pud_alloc(mm, pgd, address);
+	if (pud)
+		/*
+		 * Note that we didn't run this because the pmd was
+		 * missing, the *pmd may be already established and in
+		 * turn it may also be a trans_huge_pmd.
+		 */
+		pmd = pmd_alloc(mm, pud, address);
+	return pmd;
+}
+
+/**
+ * sys_remap_anon_pages - remap arbitrary anonymous pages of an existing vma
+ * @dst_start: start of the destination virtual memory range
+ * @src_start: start of the source virtual memory range
+ * @len: length of the virtual memory range
+ *
+ * sys_remap_anon_pages remaps arbitrary anonymous pages atomically in
+ * zero copy. It only works on non shared anonymous pages because
+ * those can be relocated without generating non linear anon_vmas in
+ * the rmap code.
+ *
+ * It is the ideal mechanism to handle userspace page faults. Normally
+ * the destination vma will have VM_USERFAULT set with
+ * madvise(MADV_USERFAULT) while the source vma will have VM_DONTCOPY
+ * set with madvise(MADV_DONTFORK).
+ *
+ * The thread receiving the page during the userland page fault
+ * (MADV_USERFAULT) will receive the faulting page in the source vma
+ * through the network, storage or any other I/O device (MADV_DONTFORK
+ * in the source vma avoids remap_anon_pages to fail with -EBUSY if
+ * the process forks before remap_anon_pages is called), then it will
+ * call remap_anon_pages to map the page in the faulting address in
+ * the destination vma.
+ *
+ * This syscall works purely via pagetables, so it's the most
+ * efficient way to move physical non shared anonymous pages across
+ * different virtual addresses. Unlike mremap()/mmap()/munmap() it
+ * does not create any new vmas. The mapping in the destination
+ * address is atomic.
+ *
+ * It only works if the vma protection bits are identical from the
+ * source and destination vma.
+ *
+ * It can remap non shared anonymous pages within the same vma too.
+ *
+ * If the source virtual memory range has any unmapped holes, or if
+ * the destination virtual memory range is not a whole unmapped hole,
+ * remap_anon_pages will fail respectively with -ENOENT or
+ * -EEXIST. This provides a very strict behavior to avoid any chance
+ * of memory corruption going unnoticed if there are userland race
+ * conditions. Only one thread should resolve the userland page fault
+ * at any given time for any given faulting address. This means that
+ * if two threads try to both call remap_anon_pages on the same
+ * destination address at the same time, the second thread will get an
+ * explicit error from this syscall.
+ *
+ * The syscall retval will return "len" is succesful. The syscall
+ * however can be interrupted by fatal signals or errors. If
+ * interrupted it will return the number of bytes successfully
+ * remapped before the interruption if any, or the negative error if
+ * none. It will never return zero. Either it will return an error or
+ * an amount of bytes successfully moved. If the retval reports a
+ * "short" remap, the remap_anon_pages syscall should be repeated by
+ * userland with src+retval, dst+reval, len-retval if it wants to know
+ * about the error that interrupted it.
+ *
+ * The RAP_ALLOW_SRC_HOLES flag can be specified to prevent -ENOENT
+ * errors to materialize if there are holes in the source virtual
+ * range that is being remapped. The holes will be accounted as
+ * successfully remapped in the retval of the syscall. This is mostly
+ * useful to remap hugepage naturally aligned virtual regions without
+ * knowing if there are transparent hugepage in the regions or not,
+ * but preventing the risk of having to split the hugepmd during the
+ * remap.
+ *
+ * If there's any rmap walk that is taking the anon_vma locks without
+ * first obtaining the page lock (for example split_huge_page and
+ * page_referenced_anon), they will have to verify if the
+ * page->mapping has changed after taking the anon_vma lock. If it
+ * changed they should release the lock and retry obtaining a new
+ * anon_vma, because it means the anon_vma was changed by
+ * remap_anon_pages before the lock could be obtained. This is the
+ * only additional complexity added to the rmap code to provide this
+ * anonymous page remapping functionality.
+ */
+SYSCALL_DEFINE4(remap_anon_pages,
+		unsigned long, dst_start, unsigned long, src_start,
+		unsigned long, len, unsigned long, flags)
+{
+	struct mm_struct *mm = current->mm;
+	struct vm_area_struct *src_vma, *dst_vma;
+	long err = -EINVAL;
+	pmd_t *src_pmd, *dst_pmd;
+	pte_t *src_pte, *dst_pte;
+	spinlock_t *dst_ptl, *src_ptl;
+	unsigned long src_addr, dst_addr;
+	int thp_aligned = -1;
+	long moved = 0;
+
+	/*
+	 * Sanitize the syscall parameters:
+	 */
+	if (src_start & ~PAGE_MASK)
+		return err;
+	if (dst_start & ~PAGE_MASK)
+		return err;
+	if (len & ~PAGE_MASK)
+		return err;
+	if (flags & ~RAP_ALLOW_SRC_HOLES)
+		return err;
+
+	/* Does the address range wrap, or is the span zero-sized? */
+	if (unlikely(src_start + len <= src_start))
+		return err;
+	if (unlikely(dst_start + len <= dst_start))
+		return err;
+
+	down_read(&mm->mmap_sem);
+
+	/*
+	 * Make sure the vma is not shared, that the src and dst remap
+	 * ranges are both valid and fully within a single existing
+	 * vma.
+	 */
+	src_vma = find_vma(mm, src_start);
+	if (!src_vma || (src_vma->vm_flags & VM_SHARED))
+		goto out;
+	if (src_start < src_vma->vm_start ||
+	    src_start + len > src_vma->vm_end)
+		goto out;
+
+	dst_vma = find_vma(mm, dst_start);
+	if (!dst_vma || (dst_vma->vm_flags & VM_SHARED))
+		goto out;
+	if (dst_start < dst_vma->vm_start ||
+	    dst_start + len > dst_vma->vm_end)
+		goto out;
+
+	if (pgprot_val(src_vma->vm_page_prot) !=
+	    pgprot_val(dst_vma->vm_page_prot))
+		goto out;
+
+	/* only allow remapping if both are mlocked or both aren't */
+	if ((src_vma->vm_flags & VM_LOCKED) ^ (dst_vma->vm_flags & VM_LOCKED))
+		goto out;
+
+	/*
+	 * Ensure the dst_vma has a anon_vma or this page
+	 * would get a NULL anon_vma when moved in the
+	 * dst_vma.
+	 */
+	err = -ENOMEM;
+	if (unlikely(anon_vma_prepare(dst_vma)))
+		goto out;
+
+	for (src_addr = src_start, dst_addr = dst_start;
+	     src_addr < src_start + len; ) {
+		spinlock_t *ptl;
+		pmd_t dst_pmdval;
+		BUG_ON(dst_addr >= dst_start + len);
+		src_pmd = mm_find_pmd(mm, src_addr);
+		if (unlikely(!src_pmd)) {
+			if (!(flags & RAP_ALLOW_SRC_HOLES)) {
+				err = -ENOENT;
+				break;
+			} else {
+				src_pmd = mm_alloc_pmd(mm, src_addr);
+				if (unlikely(!src_pmd)) {
+					err = -ENOMEM;
+					break;
+				}
+			}
+		}
+		dst_pmd = mm_alloc_pmd(mm, dst_addr);
+		if (unlikely(!dst_pmd)) {
+			err = -ENOMEM;
+			break;
+		}
+
+		dst_pmdval = pmd_read_atomic(dst_pmd);
+		/*
+		 * If the dst_pmd is mapped as THP don't
+		 * override it and just be strict.
+		 */
+		if (unlikely(pmd_trans_huge(dst_pmdval))) {
+			err = -EEXIST;
+			break;
+		}
+		if (pmd_trans_huge_lock(src_pmd, src_vma, &ptl) == 1) {
+			/*
+			 * Check if we can move the pmd without
+			 * splitting it. First check the address
+			 * alignment to be the same in src/dst.  These
+			 * checks don't actually need the PT lock but
+			 * it's good to do it here to optimize this
+			 * block away at build time if
+			 * CONFIG_TRANSPARENT_HUGEPAGE is not set.
+			 */
+			if (thp_aligned == -1)
+				thp_aligned = ((src_addr & ~HPAGE_PMD_MASK) ==
+					       (dst_addr & ~HPAGE_PMD_MASK));
+			if (!thp_aligned || (src_addr & ~HPAGE_PMD_MASK) ||
+			    !pmd_none(dst_pmdval) ||
+			    src_start + len - src_addr < HPAGE_PMD_SIZE) {
+				spin_unlock(ptl);
+				/* Fall through */
+				split_huge_page_pmd(src_vma, src_addr,
+						    src_pmd);
+			} else {
+				BUG_ON(dst_addr & ~HPAGE_PMD_MASK);
+				err = remap_anon_pages_huge_pmd(mm,
+								dst_pmd,
+								src_pmd,
+								dst_pmdval,
+								dst_vma,
+								src_vma,
+								dst_addr,
+								src_addr);
+				cond_resched();
+
+				if (!err) {
+					dst_addr += HPAGE_PMD_SIZE;
+					src_addr += HPAGE_PMD_SIZE;
+					moved += HPAGE_PMD_SIZE;
+				}
+
+				if ((!err || err == -EAGAIN) &&
+				    fatal_signal_pending(current))
+					err = -EINTR;
+
+				if (err && err != -EAGAIN)
+					break;
+
+				continue;
+			}
+		}
+
+		if (pmd_none(*src_pmd)) {
+			if (!(flags & RAP_ALLOW_SRC_HOLES)) {
+				err = -ENOENT;
+				break;
+			} else {
+				if (unlikely(__pte_alloc(mm, src_vma, src_pmd,
+							 src_addr))) {
+					err = -ENOMEM;
+					break;
+				}
+			}
+		}
+
+		/*
+		 * We held the mmap_sem for reading so MADV_DONTNEED
+		 * can zap transparent huge pages under us, or the
+		 * transparent huge page fault can establish new
+		 * transparent huge pages under us.
+		 */
+		if (unlikely(pmd_trans_unstable(src_pmd))) {
+			err = -EFAULT;
+			break;
+		}
+
+		if (unlikely(pmd_none(dst_pmdval)) &&
+		    unlikely(__pte_alloc(mm, dst_vma, dst_pmd,
+					 dst_addr))) {
+			err = -ENOMEM;
+			break;
+		}
+		/* If an huge pmd materialized from under us fail */
+		if (unlikely(pmd_trans_huge(*dst_pmd))) {
+			err = -EFAULT;
+			break;
+		}
+
+		BUG_ON(pmd_none(*dst_pmd));
+		BUG_ON(pmd_none(*src_pmd));
+		BUG_ON(pmd_trans_huge(*dst_pmd));
+		BUG_ON(pmd_trans_huge(*src_pmd));
+
+		dst_pte = pte_offset_map(dst_pmd, dst_addr);
+		src_pte = pte_offset_map(src_pmd, src_addr);
+		dst_ptl = pte_lockptr(mm, dst_pmd);
+		src_ptl = pte_lockptr(mm, src_pmd);
+
+		err = remap_anon_pages_pte(mm,
+					   dst_pte, src_pte, src_pmd,
+					   dst_vma, src_vma,
+					   dst_addr, src_addr,
+					   dst_ptl, src_ptl, flags);
+
+		pte_unmap(dst_pte);
+		pte_unmap(src_pte);
+		cond_resched();
+
+		if (!err) {
+			dst_addr += PAGE_SIZE;
+			src_addr += PAGE_SIZE;
+			moved += PAGE_SIZE;
+		}
+
+		if ((!err || err == -EAGAIN) &&
+		    fatal_signal_pending(current))
+			err = -EINTR;
+
+		if (err && err != -EAGAIN)
+			break;
+	}
+
+out:
+	up_read(&mm->mmap_sem);
+	BUG_ON(moved < 0);
+	BUG_ON(err > 0);
+	BUG_ON(!moved && !err);
+	return moved ? moved : err;
+}
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 94c37ca..e24cd7c 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1541,6 +1541,116 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 }
 
 /*
+ * The PT lock for src_pmd and the mmap_sem for reading are held by
+ * the caller, but it must return after releasing the
+ * page_table_lock. We're guaranteed the src_pmd is a pmd_trans_huge
+ * until the PT lock of the src_pmd is released. Just move the page
+ * from src_pmd to dst_pmd if possible. Return zero if succeeded in
+ * moving the page, -EAGAIN if it needs to be repeated by the caller,
+ * or other errors in case of failure.
+ */
+int remap_anon_pages_huge_pmd(struct mm_struct *mm,
+			      pmd_t *dst_pmd, pmd_t *src_pmd,
+			      pmd_t dst_pmdval,
+			      struct vm_area_struct *dst_vma,
+			      struct vm_area_struct *src_vma,
+			      unsigned long dst_addr,
+			      unsigned long src_addr)
+{
+	pmd_t _dst_pmd, src_pmdval;
+	struct page *src_page;
+	struct anon_vma *src_anon_vma, *dst_anon_vma;
+	spinlock_t *src_ptl, *dst_ptl;
+	pgtable_t pgtable;
+
+	src_pmdval = *src_pmd;
+	src_ptl = pmd_lockptr(mm, src_pmd);
+
+	BUG_ON(!pmd_trans_huge(src_pmdval));
+	BUG_ON(pmd_trans_splitting(src_pmdval));
+	BUG_ON(!pmd_none(dst_pmdval));
+	BUG_ON(!spin_is_locked(src_ptl));
+	BUG_ON(!rwsem_is_locked(&mm->mmap_sem));
+
+	src_page = pmd_page(src_pmdval);
+	BUG_ON(!PageHead(src_page));
+	BUG_ON(!PageAnon(src_page));
+	if (unlikely(page_mapcount(src_page) != 1)) {
+		spin_unlock(src_ptl);
+		return -EBUSY;
+	}
+
+	get_page(src_page);
+	spin_unlock(src_ptl);
+
+	mmu_notifier_invalidate_range_start(mm, src_addr,
+					    src_addr + HPAGE_PMD_SIZE);
+
+	/* block all concurrent rmap walks */
+	lock_page(src_page);
+
+	/*
+	 * split_huge_page walks the anon_vma chain without the page
+	 * lock. Serialize against it with the anon_vma lock, the page
+	 * lock is not enough.
+	 */
+	src_anon_vma = page_get_anon_vma(src_page);
+	if (!src_anon_vma) {
+		unlock_page(src_page);
+		put_page(src_page);
+		mmu_notifier_invalidate_range_end(mm, src_addr,
+						  src_addr + HPAGE_PMD_SIZE);
+		return -EAGAIN;
+	}
+	anon_vma_lock_write(src_anon_vma);
+
+	dst_ptl = pmd_lockptr(mm, dst_pmd);
+	double_pt_lock(src_ptl, dst_ptl);
+	if (unlikely(!pmd_same(*src_pmd, src_pmdval) ||
+		     !pmd_same(*dst_pmd, dst_pmdval) ||
+		     page_mapcount(src_page) != 1)) {
+		double_pt_unlock(src_ptl, dst_ptl);
+		anon_vma_unlock_write(src_anon_vma);
+		put_anon_vma(src_anon_vma);
+		unlock_page(src_page);
+		put_page(src_page);
+		mmu_notifier_invalidate_range_end(mm, src_addr,
+						  src_addr + HPAGE_PMD_SIZE);
+		return -EAGAIN;
+	}
+
+	BUG_ON(!PageHead(src_page));
+	BUG_ON(!PageAnon(src_page));
+	/* the PT lock is enough to keep the page pinned now */
+	put_page(src_page);
+
+	dst_anon_vma = (void *) dst_vma->anon_vma + PAGE_MAPPING_ANON;
+	ACCESS_ONCE(src_page->mapping) = (struct address_space *) dst_anon_vma;
+	ACCESS_ONCE(src_page->index) = linear_page_index(dst_vma, dst_addr);
+
+	if (!pmd_same(pmdp_clear_flush(src_vma, src_addr, src_pmd),
+		      src_pmdval))
+		BUG();
+	_dst_pmd = mk_huge_pmd(src_page, dst_vma->vm_page_prot);
+	_dst_pmd = maybe_pmd_mkwrite(pmd_mkdirty(_dst_pmd), dst_vma);
+	set_pmd_at(mm, dst_addr, dst_pmd, _dst_pmd);
+
+	pgtable = pgtable_trans_huge_withdraw(mm, src_pmd);
+	pgtable_trans_huge_deposit(mm, dst_pmd, pgtable);
+	double_pt_unlock(src_ptl, dst_ptl);
+
+	anon_vma_unlock_write(src_anon_vma);
+	put_anon_vma(src_anon_vma);
+
+	/* unblock rmap walks */
+	unlock_page(src_page);
+
+	mmu_notifier_invalidate_range_end(mm, src_addr,
+					  src_addr + HPAGE_PMD_SIZE);
+	return 0;
+}
+
+/*
  * Returns 1 if a given pmd maps a stable (not under splitting) thp.
  * Returns -1 if it maps a thp under splitting. Returns 0 otherwise.
  *

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 07/10] waitqueue: add nr wake parameter to __wake_up_locked_key
  2014-07-02 16:50 ` Andrea Arcangeli
  (?)
@ 2014-07-02 16:50   ` Andrea Arcangeli
  -1 siblings, 0 replies; 59+ messages in thread
From: Andrea Arcangeli @ 2014-07-02 16:50 UTC (permalink / raw)
  To: qemu-devel, kvm, linux-mm, linux-kernel
  Cc: \"Dr. David Alan Gilbert\",
	Johannes Weiner, Andrew Morton, Android Kernel Team, Robert Love,
	Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, KOSAKI Motohiro, Michel Lespinasse,
	Minchan Kim, Keith Packard, Huangpeng (Peter),
	Isaku Yamahata

Userfaultfd needs to wake all waitqueues (pass 0 as nr parameter),
instead of the current hardcoded 1 (that would wake just the first
waitqueue in the head list).

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/wait.h | 5 +++--
 kernel/sched/wait.c  | 7 ++++---
 net/sunrpc/sched.c   | 2 +-
 3 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/include/linux/wait.h b/include/linux/wait.h
index bd68819..b28be5a 100644
--- a/include/linux/wait.h
+++ b/include/linux/wait.h
@@ -142,7 +142,8 @@ __remove_wait_queue(wait_queue_head_t *head, wait_queue_t *old)
 }
 
 void __wake_up(wait_queue_head_t *q, unsigned int mode, int nr, void *key);
-void __wake_up_locked_key(wait_queue_head_t *q, unsigned int mode, void *key);
+void __wake_up_locked_key(wait_queue_head_t *q, unsigned int mode, int nr,
+			  void *key);
 void __wake_up_sync_key(wait_queue_head_t *q, unsigned int mode, int nr, void *key);
 void __wake_up_locked(wait_queue_head_t *q, unsigned int mode, int nr);
 void __wake_up_sync(wait_queue_head_t *q, unsigned int mode, int nr);
@@ -173,7 +174,7 @@ wait_queue_head_t *bit_waitqueue(void *, int);
 #define wake_up_poll(x, m)						\
 	__wake_up(x, TASK_NORMAL, 1, (void *) (m))
 #define wake_up_locked_poll(x, m)					\
-	__wake_up_locked_key((x), TASK_NORMAL, (void *) (m))
+	__wake_up_locked_key((x), TASK_NORMAL, 1, (void *) (m))
 #define wake_up_interruptible_poll(x, m)				\
 	__wake_up(x, TASK_INTERRUPTIBLE, 1, (void *) (m))
 #define wake_up_interruptible_sync_poll(x, m)				\
diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
index 0ffa20a..551007f 100644
--- a/kernel/sched/wait.c
+++ b/kernel/sched/wait.c
@@ -105,9 +105,10 @@ void __wake_up_locked(wait_queue_head_t *q, unsigned int mode, int nr)
 }
 EXPORT_SYMBOL_GPL(__wake_up_locked);
 
-void __wake_up_locked_key(wait_queue_head_t *q, unsigned int mode, void *key)
+void __wake_up_locked_key(wait_queue_head_t *q, unsigned int mode, int nr,
+			  void *key)
 {
-	__wake_up_common(q, mode, 1, 0, key);
+	__wake_up_common(q, mode, nr, 0, key);
 }
 EXPORT_SYMBOL_GPL(__wake_up_locked_key);
 
@@ -282,7 +283,7 @@ void abort_exclusive_wait(wait_queue_head_t *q, wait_queue_t *wait,
 	if (!list_empty(&wait->task_list))
 		list_del_init(&wait->task_list);
 	else if (waitqueue_active(q))
-		__wake_up_locked_key(q, mode, key);
+		__wake_up_locked_key(q, mode, 1, key);
 	spin_unlock_irqrestore(&q->lock, flags);
 }
 EXPORT_SYMBOL(abort_exclusive_wait);
diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
index c0365c1..d4ffd68 100644
--- a/net/sunrpc/sched.c
+++ b/net/sunrpc/sched.c
@@ -297,7 +297,7 @@ static int rpc_complete_task(struct rpc_task *task)
 	clear_bit(RPC_TASK_ACTIVE, &task->tk_runstate);
 	ret = atomic_dec_and_test(&task->tk_count);
 	if (waitqueue_active(wq))
-		__wake_up_locked_key(wq, TASK_NORMAL, &k);
+		__wake_up_locked_key(wq, TASK_NORMAL, 1, &k);
 	spin_unlock_irqrestore(&wq->lock, flags);
 	return ret;
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 07/10] waitqueue: add nr wake parameter to __wake_up_locked_key
@ 2014-07-02 16:50   ` Andrea Arcangeli
  0 siblings, 0 replies; 59+ messages in thread
From: Andrea Arcangeli @ 2014-07-02 16:50 UTC (permalink / raw)
  To: qemu-devel, kvm, linux-mm, linux-kernel
  Cc: \"Dr. David Alan Gilbert\",
	Johannes Weiner, Andrew Morton, Android Kernel Team, Robert Love,
	Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, KOSAKI Motohiro, Michel Lespinasse,
	Minchan Kim, Keith Packard, Huangpeng (Peter),
	Isaku Yamahata, Paolo Bonzini, Anthony Liguori, Stefan Hajnoczi,
	Wenchao Xia, Andrew Jones, Juan Quintela, Mel Gorman

Userfaultfd needs to wake all waitqueues (pass 0 as nr parameter),
instead of the current hardcoded 1 (that would wake just the first
waitqueue in the head list).

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/wait.h | 5 +++--
 kernel/sched/wait.c  | 7 ++++---
 net/sunrpc/sched.c   | 2 +-
 3 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/include/linux/wait.h b/include/linux/wait.h
index bd68819..b28be5a 100644
--- a/include/linux/wait.h
+++ b/include/linux/wait.h
@@ -142,7 +142,8 @@ __remove_wait_queue(wait_queue_head_t *head, wait_queue_t *old)
 }
 
 void __wake_up(wait_queue_head_t *q, unsigned int mode, int nr, void *key);
-void __wake_up_locked_key(wait_queue_head_t *q, unsigned int mode, void *key);
+void __wake_up_locked_key(wait_queue_head_t *q, unsigned int mode, int nr,
+			  void *key);
 void __wake_up_sync_key(wait_queue_head_t *q, unsigned int mode, int nr, void *key);
 void __wake_up_locked(wait_queue_head_t *q, unsigned int mode, int nr);
 void __wake_up_sync(wait_queue_head_t *q, unsigned int mode, int nr);
@@ -173,7 +174,7 @@ wait_queue_head_t *bit_waitqueue(void *, int);
 #define wake_up_poll(x, m)						\
 	__wake_up(x, TASK_NORMAL, 1, (void *) (m))
 #define wake_up_locked_poll(x, m)					\
-	__wake_up_locked_key((x), TASK_NORMAL, (void *) (m))
+	__wake_up_locked_key((x), TASK_NORMAL, 1, (void *) (m))
 #define wake_up_interruptible_poll(x, m)				\
 	__wake_up(x, TASK_INTERRUPTIBLE, 1, (void *) (m))
 #define wake_up_interruptible_sync_poll(x, m)				\
diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
index 0ffa20a..551007f 100644
--- a/kernel/sched/wait.c
+++ b/kernel/sched/wait.c
@@ -105,9 +105,10 @@ void __wake_up_locked(wait_queue_head_t *q, unsigned int mode, int nr)
 }
 EXPORT_SYMBOL_GPL(__wake_up_locked);
 
-void __wake_up_locked_key(wait_queue_head_t *q, unsigned int mode, void *key)
+void __wake_up_locked_key(wait_queue_head_t *q, unsigned int mode, int nr,
+			  void *key)
 {
-	__wake_up_common(q, mode, 1, 0, key);
+	__wake_up_common(q, mode, nr, 0, key);
 }
 EXPORT_SYMBOL_GPL(__wake_up_locked_key);
 
@@ -282,7 +283,7 @@ void abort_exclusive_wait(wait_queue_head_t *q, wait_queue_t *wait,
 	if (!list_empty(&wait->task_list))
 		list_del_init(&wait->task_list);
 	else if (waitqueue_active(q))
-		__wake_up_locked_key(q, mode, key);
+		__wake_up_locked_key(q, mode, 1, key);
 	spin_unlock_irqrestore(&q->lock, flags);
 }
 EXPORT_SYMBOL(abort_exclusive_wait);
diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
index c0365c1..d4ffd68 100644
--- a/net/sunrpc/sched.c
+++ b/net/sunrpc/sched.c
@@ -297,7 +297,7 @@ static int rpc_complete_task(struct rpc_task *task)
 	clear_bit(RPC_TASK_ACTIVE, &task->tk_runstate);
 	ret = atomic_dec_and_test(&task->tk_count);
 	if (waitqueue_active(wq))
-		__wake_up_locked_key(wq, TASK_NORMAL, &k);
+		__wake_up_locked_key(wq, TASK_NORMAL, 1, &k);
 	spin_unlock_irqrestore(&wq->lock, flags);
 	return ret;
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [Qemu-devel] [PATCH 07/10] waitqueue: add nr wake parameter to __wake_up_locked_key
@ 2014-07-02 16:50   ` Andrea Arcangeli
  0 siblings, 0 replies; 59+ messages in thread
From: Andrea Arcangeli @ 2014-07-02 16:50 UTC (permalink / raw)
  To: qemu-devel, kvm, linux-mm, linux-kernel
  Cc: Robert Love, Dave Hansen, Jan Kara, Neil Brown, Stefan Hajnoczi,
	Andrew Jones, KOSAKI Motohiro, Michel Lespinasse,
	Andrea Arcangeli, Taras Glek, Juan Quintela, Hugh Dickins,
	Isaku Yamahata, Mel Gorman, Android Kernel Team, Mel Gorman,
	\"Dr. David Alan Gilbert\", Huangpeng (Peter),
	Anthony Liguori, Mike Hommey, Keith Packard, Wenchao Xia,
	Minchan Kim, Dmitry Adamushko, Johannes Weiner, Paolo Bonzini,
	Andrew Morton

Userfaultfd needs to wake all waitqueues (pass 0 as nr parameter),
instead of the current hardcoded 1 (that would wake just the first
waitqueue in the head list).

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/wait.h | 5 +++--
 kernel/sched/wait.c  | 7 ++++---
 net/sunrpc/sched.c   | 2 +-
 3 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/include/linux/wait.h b/include/linux/wait.h
index bd68819..b28be5a 100644
--- a/include/linux/wait.h
+++ b/include/linux/wait.h
@@ -142,7 +142,8 @@ __remove_wait_queue(wait_queue_head_t *head, wait_queue_t *old)
 }
 
 void __wake_up(wait_queue_head_t *q, unsigned int mode, int nr, void *key);
-void __wake_up_locked_key(wait_queue_head_t *q, unsigned int mode, void *key);
+void __wake_up_locked_key(wait_queue_head_t *q, unsigned int mode, int nr,
+			  void *key);
 void __wake_up_sync_key(wait_queue_head_t *q, unsigned int mode, int nr, void *key);
 void __wake_up_locked(wait_queue_head_t *q, unsigned int mode, int nr);
 void __wake_up_sync(wait_queue_head_t *q, unsigned int mode, int nr);
@@ -173,7 +174,7 @@ wait_queue_head_t *bit_waitqueue(void *, int);
 #define wake_up_poll(x, m)						\
 	__wake_up(x, TASK_NORMAL, 1, (void *) (m))
 #define wake_up_locked_poll(x, m)					\
-	__wake_up_locked_key((x), TASK_NORMAL, (void *) (m))
+	__wake_up_locked_key((x), TASK_NORMAL, 1, (void *) (m))
 #define wake_up_interruptible_poll(x, m)				\
 	__wake_up(x, TASK_INTERRUPTIBLE, 1, (void *) (m))
 #define wake_up_interruptible_sync_poll(x, m)				\
diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
index 0ffa20a..551007f 100644
--- a/kernel/sched/wait.c
+++ b/kernel/sched/wait.c
@@ -105,9 +105,10 @@ void __wake_up_locked(wait_queue_head_t *q, unsigned int mode, int nr)
 }
 EXPORT_SYMBOL_GPL(__wake_up_locked);
 
-void __wake_up_locked_key(wait_queue_head_t *q, unsigned int mode, void *key)
+void __wake_up_locked_key(wait_queue_head_t *q, unsigned int mode, int nr,
+			  void *key)
 {
-	__wake_up_common(q, mode, 1, 0, key);
+	__wake_up_common(q, mode, nr, 0, key);
 }
 EXPORT_SYMBOL_GPL(__wake_up_locked_key);
 
@@ -282,7 +283,7 @@ void abort_exclusive_wait(wait_queue_head_t *q, wait_queue_t *wait,
 	if (!list_empty(&wait->task_list))
 		list_del_init(&wait->task_list);
 	else if (waitqueue_active(q))
-		__wake_up_locked_key(q, mode, key);
+		__wake_up_locked_key(q, mode, 1, key);
 	spin_unlock_irqrestore(&q->lock, flags);
 }
 EXPORT_SYMBOL(abort_exclusive_wait);
diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
index c0365c1..d4ffd68 100644
--- a/net/sunrpc/sched.c
+++ b/net/sunrpc/sched.c
@@ -297,7 +297,7 @@ static int rpc_complete_task(struct rpc_task *task)
 	clear_bit(RPC_TASK_ACTIVE, &task->tk_runstate);
 	ret = atomic_dec_and_test(&task->tk_count);
 	if (waitqueue_active(wq))
-		__wake_up_locked_key(wq, TASK_NORMAL, &k);
+		__wake_up_locked_key(wq, TASK_NORMAL, 1, &k);
 	spin_unlock_irqrestore(&wq->lock, flags);
 	return ret;
 }

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 08/10] userfaultfd: add new syscall to provide memory externalization
  2014-07-02 16:50 ` Andrea Arcangeli
  (?)
@ 2014-07-02 16:50   ` Andrea Arcangeli
  -1 siblings, 0 replies; 59+ messages in thread
From: Andrea Arcangeli @ 2014-07-02 16:50 UTC (permalink / raw)
  To: qemu-devel, kvm, linux-mm, linux-kernel
  Cc: \"Dr. David Alan Gilbert\",
	Johannes Weiner, Andrew Morton, Android Kernel Team, Robert Love,
	Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, KOSAKI Motohiro, Michel Lespinasse,
	Minchan Kim, Keith Packard, Huangpeng (Peter),
	Isaku Yamahata

Once an userfaultfd is created MADV_USERFAULT regions talks through
the userfaultfd protocol with the thread responsible for doing the
memory externalization of the process.

The protocol starts by userland writing the requested/preferred
USERFAULT_PROTOCOL version into the userfault fd (64bit write), if
kernel knows it, it will ack it by allowing userland to read 64bit
from the userfault fd that will contain the same 64bit
USERFAULT_PROTOCOL version that userland asked. Otherwise userfault
will read __u64 value -1ULL (aka USERFAULTFD_UNKNOWN_PROTOCOL) and it
will have to try again by writing an older protocol version if
suitable for its usage too, and read it back again until it stops
reading -1ULL. After that the userfaultfd protocol starts.

The protocol consists in the userfault fd reads 64bit in size
providing userland the fault addresses. After a userfault address has
been read and the fault is resolved by userland, the application must
write back 128bits in the form of [ start, end ] range (64bit each)
that will tell the kernel such a range has been mapped. Multiple read
userfaults can be resolved in a single range write. poll() can be used
to know when there are new userfaults to read (POLLIN) and when there
are threads waiting a wakeup through a range write (POLLOUT).

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 arch/x86/syscalls/syscall_32.tbl |   1 +
 arch/x86/syscalls/syscall_64.tbl |   1 +
 fs/Makefile                      |   1 +
 fs/userfaultfd.c                 | 557 +++++++++++++++++++++++++++++++++++++++
 include/linux/syscalls.h         |   1 +
 include/linux/userfaultfd.h      |  40 +++
 init/Kconfig                     |  10 +
 kernel/sys_ni.c                  |   1 +
 mm/huge_memory.c                 |  20 +-
 mm/memory.c                      |   5 +-
 10 files changed, 629 insertions(+), 8 deletions(-)
 create mode 100644 fs/userfaultfd.c
 create mode 100644 include/linux/userfaultfd.h

diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
index 08bc856..5aa2da4 100644
--- a/arch/x86/syscalls/syscall_32.tbl
+++ b/arch/x86/syscalls/syscall_32.tbl
@@ -361,3 +361,4 @@
 352	i386	sched_getattr		sys_sched_getattr
 353	i386	renameat2		sys_renameat2
 354	i386	remap_anon_pages	sys_remap_anon_pages
+355	i386	userfaultfd		sys_userfaultfd
diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index 37bd179..7dca902 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -324,6 +324,7 @@
 315	common	sched_getattr		sys_sched_getattr
 316	common	renameat2		sys_renameat2
 317	common	remap_anon_pages	sys_remap_anon_pages
+318	common	userfaultfd		sys_userfaultfd
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/fs/Makefile b/fs/Makefile
index 4030cbf..e00e243 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -27,6 +27,7 @@ obj-$(CONFIG_ANON_INODES)	+= anon_inodes.o
 obj-$(CONFIG_SIGNALFD)		+= signalfd.o
 obj-$(CONFIG_TIMERFD)		+= timerfd.o
 obj-$(CONFIG_EVENTFD)		+= eventfd.o
+obj-$(CONFIG_USERFAULTFD)	+= userfaultfd.o
 obj-$(CONFIG_AIO)               += aio.o
 obj-$(CONFIG_FILE_LOCKING)      += locks.o
 obj-$(CONFIG_COMPAT)		+= compat.o compat_ioctl.o
diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
new file mode 100644
index 0000000..4902fa3
--- /dev/null
+++ b/fs/userfaultfd.c
@@ -0,0 +1,557 @@
+/*
+ *  fs/userfaultfd.c
+ *
+ *  Copyright (C) 2007  Davide Libenzi <davidel@xmailserver.org>
+ *  Copyright (C) 2008-2009 Red Hat, Inc.
+ *  Copyright (C) 2014  Red Hat, Inc.
+ *
+ *  This work is licensed under the terms of the GNU GPL, version 2. See
+ *  the COPYING file in the top-level directory.
+ *
+ *  Some part derived from fs/eventfd.c (anon inode setup) and
+ *  mm/ksm.c (mm hashing).
+ */
+
+#include <linux/kref.h>
+#include <linux/hashtable.h>
+#include <linux/sched.h>
+#include <linux/mm.h>
+#include <linux/poll.h>
+#include <linux/slab.h>
+#include <linux/seq_file.h>
+#include <linux/file.h>
+#include <linux/bug.h>
+#include <linux/anon_inodes.h>
+#include <linux/syscalls.h>
+#include <linux/userfaultfd.h>
+
+struct userfaultfd_ctx {
+	/* pseudo fd refcounting */
+	struct kref kref;
+	/* waitqueue head for the userfaultfd page faults */
+	wait_queue_head_t fault_wqh;
+	/* waitqueue head for the pseudo fd to wakeup poll/read */
+	wait_queue_head_t fd_wqh;
+	/* userfaultfd syscall flags */
+	unsigned int flags;
+	/* state machine */
+	unsigned int state;
+	/* released */
+	bool released;
+};
+
+struct userfaultfd_wait_queue {
+	unsigned long address;
+	wait_queue_t wq;
+	bool pending;
+	struct userfaultfd_ctx *ctx;
+};
+
+#define USERFAULTFD_PROTOCOL ((__u64) 0xaa)
+#define USERFAULTFD_UNKNOWN_PROTOCOL ((__u64) -1ULL)
+
+enum {
+	USERFAULTFD_STATE_ASK_PROTOCOL,
+	USERFAULTFD_STATE_ACK_PROTOCOL,
+	USERFAULTFD_STATE_ACK_UNKNOWN_PROTOCOL,
+	USERFAULTFD_STATE_RUNNING,
+};
+
+/**
+ * struct mm_slot - userlandfd information per mm that is being scanned
+ * @link: link to the mm_slots hash list
+ * @mm: the mm that this information is valid for
+ * @ctx: userfaultfd context for this mm
+ */
+struct mm_slot {
+	struct hlist_node link;
+	struct mm_struct *mm;
+	struct userfaultfd_ctx ctx;
+	struct rcu_head rcu_head;
+};
+
+#define MM_USERLANDFD_HASH_BITS 10
+static DEFINE_HASHTABLE(mm_userlandfd_hash, MM_USERLANDFD_HASH_BITS);
+
+static DEFINE_MUTEX(mm_userlandfd_mutex);
+
+static struct mm_slot *get_mm_slot(struct mm_struct *mm)
+{
+	struct mm_slot *slot;
+
+	hash_for_each_possible_rcu(mm_userlandfd_hash, slot, link,
+				   (unsigned long)mm)
+		if (slot->mm == mm)
+			return slot;
+
+	return NULL;
+}
+
+static void insert_to_mm_userlandfd_hash(struct mm_struct *mm,
+					 struct mm_slot *mm_slot)
+{
+	mm_slot->mm = mm;
+	hash_add_rcu(mm_userlandfd_hash, &mm_slot->link, (unsigned long)mm);
+}
+
+static int userfaultfd_wake_function(wait_queue_t *wq, unsigned mode,
+				     int wake_flags, void *key)
+{
+	unsigned long *range = key;
+	int ret;
+	struct userfaultfd_wait_queue *uwq;
+
+	uwq = container_of(wq, struct userfaultfd_wait_queue, wq);
+	ret = 0;
+	/* don't wake the pending ones to avoid reads to block */
+	if (uwq->pending && !uwq->ctx->released)
+		goto out;
+	if (range[0] > uwq->address || range[1] <= uwq->address)
+		goto out;
+	ret = wake_up_state(wq->private, mode);
+	if (ret)
+		/* wake only once, autoremove behavior */
+		list_del_init(&wq->task_list);
+out:
+	return ret;
+}
+
+/**
+ * userfaultfd_ctx_get - Acquires a reference to the internal userfaultfd
+ * context.
+ * @ctx: [in] Pointer to the userfaultfd context.
+ *
+ * Returns: In case of success, returns a pointer to the userfaultfd context.
+ */
+static void userfaultfd_ctx_get(struct userfaultfd_ctx *ctx)
+{
+	kref_get(&ctx->kref);
+}
+
+static void userfaultfd_free(struct kref *kref)
+{
+	struct userfaultfd_ctx *ctx = container_of(kref,
+						   struct userfaultfd_ctx,
+						   kref);
+	struct mm_slot *mm_slot = container_of(ctx, struct mm_slot, ctx);
+
+	mutex_lock(&mm_userlandfd_mutex);
+	hash_del_rcu(&mm_slot->link);
+	mutex_unlock(&mm_userlandfd_mutex);
+
+	kfree_rcu(mm_slot, rcu_head);
+}
+
+/**
+ * userfaultfd_ctx_put - Releases a reference to the internal userfaultfd
+ * context.
+ * @ctx: [in] Pointer to userfaultfd context.
+ *
+ * The userfaultfd context reference must have been previously acquired either
+ * with userfaultfd_ctx_get() or userfaultfd_ctx_fdget().
+ */
+static void userfaultfd_ctx_put(struct userfaultfd_ctx *ctx)
+{
+	kref_put(&ctx->kref, userfaultfd_free);
+}
+
+int handle_userfault(struct vm_area_struct *vma, unsigned long address)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	struct mm_slot *slot;
+	struct userfaultfd_ctx *ctx;
+	struct userfaultfd_wait_queue uwq;
+
+	BUG_ON(!rwsem_is_locked(&mm->mmap_sem));
+
+	rcu_read_lock();
+	slot = get_mm_slot(mm);
+	if (!slot) {
+		rcu_read_unlock();
+		return VM_FAULT_SIGBUS;
+	}
+	ctx = &slot->ctx;
+	userfaultfd_ctx_get(ctx);
+	rcu_read_unlock();
+
+	init_waitqueue_func_entry(&uwq.wq, userfaultfd_wake_function);
+	uwq.wq.private = current;
+	uwq.address = address;
+	uwq.pending = true;
+	uwq.ctx = ctx;
+
+	spin_lock(&ctx->fault_wqh.lock);
+	/*
+	 * After the __add_wait_queue the uwq is visible to userland
+	 * through poll/read().
+	 */
+	__add_wait_queue(&ctx->fault_wqh, &uwq.wq);
+	for (;;) {
+		set_current_state(TASK_INTERRUPTIBLE);
+		if (fatal_signal_pending(current))
+			break;
+		if (!uwq.pending)
+			break;
+		spin_unlock(&ctx->fault_wqh.lock);
+		up_read(&mm->mmap_sem);
+
+		wake_up_poll(&ctx->fd_wqh, POLLIN);
+		schedule();
+
+		down_read(&mm->mmap_sem);
+		spin_lock(&ctx->fault_wqh.lock);
+	}
+	__remove_wait_queue(&ctx->fault_wqh, &uwq.wq);
+	__set_current_state(TASK_RUNNING);
+	spin_unlock(&ctx->fault_wqh.lock);
+
+	/*
+	 * ctx may go away after this if the userfault pseudo fd is
+	 * released by another CPU.
+	 */
+	userfaultfd_ctx_put(ctx);
+
+	return 0;
+}
+
+static int userfaultfd_release(struct inode *inode, struct file *file)
+{
+	struct userfaultfd_ctx *ctx = file->private_data;
+	__u64 range[2] = { 0ULL, -1ULL };
+
+	ctx->released = true;
+	smp_wmb();
+	spin_lock(&ctx->fault_wqh.lock);
+	__wake_up_locked_key(&ctx->fault_wqh, TASK_NORMAL, 0, range);
+	spin_unlock(&ctx->fault_wqh.lock);
+
+	wake_up_poll(&ctx->fd_wqh, POLLHUP);
+	userfaultfd_ctx_put(ctx);
+	return 0;
+}
+
+static inline unsigned long find_userfault(struct userfaultfd_ctx *ctx,
+					   struct userfaultfd_wait_queue **uwq,
+					   unsigned int events_filter)
+{
+	wait_queue_t *wq;
+	struct userfaultfd_wait_queue *_uwq;
+	unsigned int events = 0;
+
+	BUG_ON(!events_filter);
+
+	spin_lock(&ctx->fault_wqh.lock);
+	list_for_each_entry(wq, &ctx->fault_wqh.task_list, task_list) {
+		_uwq = container_of(wq, struct userfaultfd_wait_queue, wq);
+		if (_uwq->pending) {
+			if (!(events & POLLIN) && (events_filter & POLLIN)) {
+				events |= POLLIN;
+				if (uwq)
+					*uwq = _uwq;
+			}
+		} else if (events_filter & POLLOUT)
+			events |= POLLOUT;
+		if (events == events_filter)
+			break;
+	}
+	spin_unlock(&ctx->fault_wqh.lock);
+
+	return events;
+}
+
+static unsigned int userfaultfd_poll(struct file *file, poll_table *wait)
+{
+	struct userfaultfd_ctx *ctx = file->private_data;
+
+	poll_wait(file, &ctx->fd_wqh, wait);
+
+	switch (ctx->state) {
+	case USERFAULTFD_STATE_ASK_PROTOCOL:
+		return POLLOUT;
+	case USERFAULTFD_STATE_ACK_PROTOCOL:
+		return POLLIN;
+	case USERFAULTFD_STATE_ACK_UNKNOWN_PROTOCOL:
+		return POLLIN;
+	case USERFAULTFD_STATE_RUNNING:
+		return find_userfault(ctx, NULL, POLLIN|POLLOUT);
+	default:
+		BUG();
+	}
+}
+
+static ssize_t userfaultfd_ctx_read(struct userfaultfd_ctx *ctx, int no_wait,
+				    __u64 *addr)
+{
+	ssize_t res;
+	DECLARE_WAITQUEUE(wait, current);
+	struct userfaultfd_wait_queue *uwq = NULL;
+
+	if (ctx->state == USERFAULTFD_STATE_ASK_PROTOCOL) {
+		return -EINVAL;
+	} else if (ctx->state == USERFAULTFD_STATE_ACK_PROTOCOL) {
+		*addr = USERFAULTFD_PROTOCOL;
+		ctx->state = USERFAULTFD_STATE_RUNNING;
+		return 0;
+	} else if (ctx->state == USERFAULTFD_STATE_ACK_UNKNOWN_PROTOCOL) {
+		*addr = USERFAULTFD_UNKNOWN_PROTOCOL;
+		ctx->state = USERFAULTFD_STATE_ASK_PROTOCOL;
+		return 0;
+	}
+	BUG_ON(ctx->state != USERFAULTFD_STATE_RUNNING);
+
+	spin_lock(&ctx->fd_wqh.lock);
+	__add_wait_queue(&ctx->fd_wqh, &wait);
+	for (;;) {
+		set_current_state(TASK_INTERRUPTIBLE);
+		/* always take the fd_wqh lock before the fault_wqh lock */
+		if (find_userfault(ctx, &uwq, POLLIN)) {
+			uwq->pending = false;
+			*addr = uwq->address;
+			res = 0;
+			break;
+		}
+		if (signal_pending(current)) {
+			res = -ERESTARTSYS;
+			break;
+		}
+		if (no_wait) {
+			res = -EAGAIN;
+			break;
+		}
+		spin_unlock(&ctx->fd_wqh.lock);
+		schedule();
+		spin_lock_irq(&ctx->fd_wqh.lock);
+	}
+	__remove_wait_queue(&ctx->fd_wqh, &wait);
+	__set_current_state(TASK_RUNNING);
+	if (res == 0) {
+		if (waitqueue_active(&ctx->fd_wqh))
+			wake_up_locked_poll(&ctx->fd_wqh, POLLOUT);
+	}
+	spin_unlock_irq(&ctx->fd_wqh.lock);
+
+	return res;
+}
+
+static ssize_t userfaultfd_read(struct file *file, char __user *buf,
+				size_t count, loff_t *ppos)
+{
+	struct userfaultfd_ctx *ctx = file->private_data;
+	ssize_t res;
+	__u64 addr;
+
+	if (count < sizeof(addr))
+		return -EINVAL;
+	res = userfaultfd_ctx_read(ctx, file->f_flags & O_NONBLOCK, &addr);
+	if (res < 0)
+		return res;
+
+	return put_user(addr, (__u64 __user *) buf) ? -EFAULT : sizeof(addr);
+}
+
+static int wake_userfault(struct userfaultfd_ctx *ctx, __u64 *range)
+{
+	wait_queue_t *wq;
+	struct userfaultfd_wait_queue *uwq;
+	int ret = -ENOENT;
+
+	spin_lock(&ctx->fault_wqh.lock);
+	list_for_each_entry(wq, &ctx->fault_wqh.task_list, task_list) {
+		uwq = container_of(wq, struct userfaultfd_wait_queue, wq);
+		if (uwq->pending)
+			continue;
+		if (uwq->address >= range[0] &&
+		    uwq->address < range[1]) {
+			ret = 0;
+			/* wake all in the range and autoremove */
+			__wake_up_locked_key(&ctx->fault_wqh, TASK_NORMAL, 0,
+					     range);
+			break;
+		}
+	}
+	spin_unlock(&ctx->fault_wqh.lock);
+
+	return ret;
+}
+
+static ssize_t userfaultfd_write(struct file *file, const char __user *buf,
+				 size_t count, loff_t *ppos)
+{
+	struct userfaultfd_ctx *ctx = file->private_data;
+	ssize_t res;
+	__u64 range[2];
+	DECLARE_WAITQUEUE(wait, current);
+
+	if (ctx->state == USERFAULTFD_STATE_ASK_PROTOCOL) {
+		__u64 protocol;
+		if (count < sizeof(__u64))
+			return -EINVAL;
+		if (copy_from_user(&protocol, buf, sizeof(protocol)))
+			return -EFAULT;
+		if (protocol != USERFAULTFD_PROTOCOL) {
+			/* we'll offer the supported protocol in the ack */
+			printk_once(KERN_INFO
+				    "userfaultfd protocol not available\n");
+			ctx->state = USERFAULTFD_STATE_ACK_UNKNOWN_PROTOCOL;
+		} else
+			ctx->state = USERFAULTFD_STATE_ACK_PROTOCOL;
+		return sizeof(protocol);
+	} else if (ctx->state == USERFAULTFD_STATE_ACK_PROTOCOL)
+		return -EINVAL;
+
+	BUG_ON(ctx->state != USERFAULTFD_STATE_RUNNING);
+
+	if (count < sizeof(range))
+		return -EINVAL;
+	if (copy_from_user(&range, buf, sizeof(range)))
+		return -EFAULT;
+	if (range[0] >= range[1])
+		return -ERANGE;
+
+	spin_lock(&ctx->fd_wqh.lock);
+	__add_wait_queue(&ctx->fd_wqh, &wait);
+	for (;;) {
+		set_current_state(TASK_INTERRUPTIBLE);
+		/* always take the fd_wqh lock before the fault_wqh lock */
+		if (find_userfault(ctx, NULL, POLLOUT)) {
+			if (!wake_userfault(ctx, range)) {
+				res = sizeof(range);
+				break;
+			}
+		}
+		if (signal_pending(current)) {
+			res = -ERESTARTSYS;
+			break;
+		}
+		if (file->f_flags & O_NONBLOCK) {
+			res = -EAGAIN;
+			break;
+		}
+		spin_unlock(&ctx->fd_wqh.lock);
+		schedule();
+		spin_lock(&ctx->fd_wqh.lock);
+	}
+	__remove_wait_queue(&ctx->fd_wqh, &wait);
+	__set_current_state(TASK_RUNNING);
+	spin_unlock(&ctx->fd_wqh.lock);
+
+	return res;
+}
+
+#ifdef CONFIG_PROC_FS
+static int userfaultfd_show_fdinfo(struct seq_file *m, struct file *f)
+{
+	struct userfaultfd_ctx *ctx = f->private_data;
+	int ret;
+	wait_queue_t *wq;
+	struct userfaultfd_wait_queue *uwq;
+	unsigned long pending = 0, total = 0;
+
+	spin_lock(&ctx->fault_wqh.lock);
+	list_for_each_entry(wq, &ctx->fault_wqh.task_list, task_list) {
+		uwq = container_of(wq, struct userfaultfd_wait_queue, wq);
+		if (uwq->pending)
+			pending++;
+		total++;
+	}
+	spin_unlock(&ctx->fault_wqh.lock);
+
+	ret = seq_printf(m, "pending:\t%lu\ntotal:\t%lu\n", pending, total);
+
+	return ret;
+}
+#endif
+
+static const struct file_operations userfaultfd_fops = {
+#ifdef CONFIG_PROC_FS
+	.show_fdinfo	= userfaultfd_show_fdinfo,
+#endif
+	.release	= userfaultfd_release,
+	.poll		= userfaultfd_poll,
+	.read		= userfaultfd_read,
+	.write		= userfaultfd_write,
+	.llseek		= noop_llseek,
+};
+
+/**
+ * userfaultfd_file_create - Creates an userfaultfd file pointer.
+ * @flags: Flags for the userfaultfd file.
+ *
+ * This function creates an userfaultfd file pointer, w/out installing
+ * it into the fd table. This is useful when the userfaultfd file is
+ * used during the initialization of data structures that require
+ * extra setup after the userfaultfd creation. So the userfaultfd
+ * creation is split into the file pointer creation phase, and the
+ * file descriptor installation phase.  In this way races with
+ * userspace closing the newly installed file descriptor can be
+ * avoided.  Returns an userfaultfd file pointer, or a proper error
+ * pointer.
+ */
+static struct file *userfaultfd_file_create(int flags)
+{
+	struct file *file;
+	struct mm_slot *mm_slot;
+
+	/* Check the UFFD_* constants for consistency.  */
+	BUILD_BUG_ON(UFFD_CLOEXEC != O_CLOEXEC);
+	BUILD_BUG_ON(UFFD_NONBLOCK != O_NONBLOCK);
+
+	file = ERR_PTR(-EINVAL);
+	if (flags & ~UFFD_SHARED_FCNTL_FLAGS)
+		goto out;
+
+	mm_slot = kmalloc(sizeof(*mm_slot), GFP_KERNEL);
+	file = ERR_PTR(-ENOMEM);
+	if (!mm_slot)
+		goto out;
+
+	mutex_lock(&mm_userlandfd_mutex);
+	file = ERR_PTR(-EBUSY);
+	if (get_mm_slot(current->mm))
+		goto out_free_unlock;
+
+	kref_init(&mm_slot->ctx.kref);
+	init_waitqueue_head(&mm_slot->ctx.fault_wqh);
+	init_waitqueue_head(&mm_slot->ctx.fd_wqh);
+	mm_slot->ctx.flags = flags;
+	mm_slot->ctx.state = USERFAULTFD_STATE_ASK_PROTOCOL;
+	mm_slot->ctx.released = false;
+
+	file = anon_inode_getfile("[userfaultfd]", &userfaultfd_fops,
+				  &mm_slot->ctx,
+				  O_RDWR | (flags & UFFD_SHARED_FCNTL_FLAGS));
+	if (IS_ERR(file))
+	out_free_unlock:
+		kfree(mm_slot);
+	else
+		insert_to_mm_userlandfd_hash(current->mm,
+					     mm_slot);
+	mutex_unlock(&mm_userlandfd_mutex);
+out:
+	return file;
+}
+
+SYSCALL_DEFINE1(userfaultfd, int, flags)
+{
+	int fd, error;
+	struct file *file;
+
+	error = get_unused_fd_flags(flags & UFFD_SHARED_FCNTL_FLAGS);
+	if (error < 0)
+		return error;
+	fd = error;
+
+	file = userfaultfd_file_create(flags);
+	if (IS_ERR(file)) {
+		error = PTR_ERR(file);
+		goto err_put_unused_fd;
+	}
+	fd_install(fd, file);
+
+	return fd;
+
+err_put_unused_fd:
+	put_unused_fd(fd);
+
+	return error;
+}
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 19edb00..dcbcb7d 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -806,6 +806,7 @@ asmlinkage long sys_timerfd_settime(int ufd, int flags,
 asmlinkage long sys_timerfd_gettime(int ufd, struct itimerspec __user *otmr);
 asmlinkage long sys_eventfd(unsigned int count);
 asmlinkage long sys_eventfd2(unsigned int count, int flags);
+asmlinkage long sys_userfaultfd(int flags);
 asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len);
 asmlinkage long sys_old_readdir(unsigned int, struct old_linux_dirent __user *, unsigned int);
 asmlinkage long sys_pselect6(int, fd_set __user *, fd_set __user *,
diff --git a/include/linux/userfaultfd.h b/include/linux/userfaultfd.h
new file mode 100644
index 0000000..8200a71
--- /dev/null
+++ b/include/linux/userfaultfd.h
@@ -0,0 +1,40 @@
+/*
+ *  include/linux/userfaultfd.h
+ *
+ *  Copyright (C) 2007  Davide Libenzi <davidel@xmailserver.org>
+ *  Copyright (C) 2014  Red Hat, Inc.
+ *
+ */
+
+#ifndef _LINUX_USERFAULTFD_H
+#define _LINUX_USERFAULTFD_H
+
+#include <linux/fcntl.h>
+
+/*
+ * CAREFUL: Check include/uapi/asm-generic/fcntl.h when defining
+ * new flags, since they might collide with O_* ones. We want
+ * to re-use O_* flags that couldn't possibly have a meaning
+ * from userfaultfd, in order to leave a free define-space for
+ * shared O_* flags.
+ */
+#define UFFD_CLOEXEC O_CLOEXEC
+#define UFFD_NONBLOCK O_NONBLOCK
+
+#define UFFD_SHARED_FCNTL_FLAGS (O_CLOEXEC | O_NONBLOCK)
+#define UFFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS)
+
+#ifdef CONFIG_USERFAULTFD
+
+int handle_userfault(struct vm_area_struct *vma, unsigned long address);
+
+#else /* CONFIG_USERFAULTFD */
+
+static int handle_userfault(struct vm_area_struct *vma, unsigned long address)
+{
+	return VM_FAULT_SIGBUS;
+}
+
+#endif
+
+#endif /* _LINUX_USERFAULTFD_H */
diff --git a/init/Kconfig b/init/Kconfig
index 9d76b99..dc8d722 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1475,6 +1475,16 @@ config EVENTFD
 
 	  If unsure, say Y.
 
+config USERFAULTFD
+	bool "Enable userfaultfd() system call"
+	select ANON_INODES
+	default y
+	help
+	  Enable the userfaultfd() system call that allows to trap and
+	  handle page faults in userland.
+
+	  If unsure, say Y.
+
 config SHMEM
 	bool "Use full shmem filesystem" if EXPERT
 	default y
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 6fc1aca..d7a83b1 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -198,6 +198,7 @@ cond_syscall(compat_sys_timerfd_settime);
 cond_syscall(compat_sys_timerfd_gettime);
 cond_syscall(sys_eventfd);
 cond_syscall(sys_eventfd2);
+cond_syscall(sys_userfaultfd);
 
 /* performance counters: */
 cond_syscall(sys_perf_event_open);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index e24cd7c..d6efd80 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -23,6 +23,7 @@
 #include <linux/pagemap.h>
 #include <linux/migrate.h>
 #include <linux/hashtable.h>
+#include <linux/userfaultfd.h>
 
 #include <asm/tlb.h>
 #include <asm/pgalloc.h>
@@ -746,11 +747,15 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
 
 		/* Deliver the page fault to userland */
 		if (vma->vm_flags & VM_USERFAULT) {
+			int ret;
+
 			spin_unlock(ptl);
 			mem_cgroup_uncharge_page(page);
 			put_page(page);
 			pte_free(mm, pgtable);
-			return VM_FAULT_SIGBUS;
+			ret = handle_userfault(vma, haddr);
+			VM_BUG_ON(ret & VM_FAULT_FALLBACK);
+			return ret;
 		}
 
 		entry = mk_huge_pmd(page, vma->vm_page_prot);
@@ -828,16 +833,19 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		ret = 0;
 		set = false;
 		if (pmd_none(*pmd)) {
-			if (vma->vm_flags & VM_USERFAULT)
-				ret = VM_FAULT_SIGBUS;
-			else {
+			if (vma->vm_flags & VM_USERFAULT) {
+				spin_unlock(ptl);
+				ret = handle_userfault(vma, haddr);
+				VM_BUG_ON(ret & VM_FAULT_FALLBACK);
+			} else {
 				set_huge_zero_page(pgtable, mm, vma,
 						   haddr, pmd,
 						   zero_page);
+				spin_unlock(ptl);
 				set = true;
 			}
-		}
-		spin_unlock(ptl);
+		} else
+			spin_unlock(ptl);
 		if (!set) {
 			pte_free(mm, pgtable);
 			put_huge_zero_page();
diff --git a/mm/memory.c b/mm/memory.c
index 545c417..a6a04ed 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -61,6 +61,7 @@
 #include <linux/string.h>
 #include <linux/dma-debug.h>
 #include <linux/debugfs.h>
+#include <linux/userfaultfd.h>
 
 #include <asm/io.h>
 #include <asm/pgalloc.h>
@@ -2644,7 +2645,7 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		/* Deliver the page fault to userland, check inside PT lock */
 		if (vma->vm_flags & VM_USERFAULT) {
 			pte_unmap_unlock(page_table, ptl);
-			return VM_FAULT_SIGBUS;
+			return handle_userfault(vma, address);
 		}
 		goto setpte;
 	}
@@ -2678,7 +2679,7 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		pte_unmap_unlock(page_table, ptl);
 		mem_cgroup_uncharge_page(page);
 		page_cache_release(page);
-		return VM_FAULT_SIGBUS;
+		return handle_userfault(vma, address);
 	}
 
 	inc_mm_counter_fast(mm, MM_ANONPAGES);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 08/10] userfaultfd: add new syscall to provide memory externalization
@ 2014-07-02 16:50   ` Andrea Arcangeli
  0 siblings, 0 replies; 59+ messages in thread
From: Andrea Arcangeli @ 2014-07-02 16:50 UTC (permalink / raw)
  To: qemu-devel, kvm, linux-mm, linux-kernel
  Cc: \"Dr. David Alan Gilbert\",
	Johannes Weiner, Andrew Morton, Android Kernel Team, Robert Love,
	Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, KOSAKI Motohiro, Michel Lespinasse,
	Minchan Kim, Keith Packard, Huangpeng (Peter),
	Isaku Yamahata, Paolo Bonzini, Anthony Liguori, Stefan Hajnoczi,
	Wenchao Xia, Andrew Jones, Juan Quintela, Mel Gorman

Once an userfaultfd is created MADV_USERFAULT regions talks through
the userfaultfd protocol with the thread responsible for doing the
memory externalization of the process.

The protocol starts by userland writing the requested/preferred
USERFAULT_PROTOCOL version into the userfault fd (64bit write), if
kernel knows it, it will ack it by allowing userland to read 64bit
from the userfault fd that will contain the same 64bit
USERFAULT_PROTOCOL version that userland asked. Otherwise userfault
will read __u64 value -1ULL (aka USERFAULTFD_UNKNOWN_PROTOCOL) and it
will have to try again by writing an older protocol version if
suitable for its usage too, and read it back again until it stops
reading -1ULL. After that the userfaultfd protocol starts.

The protocol consists in the userfault fd reads 64bit in size
providing userland the fault addresses. After a userfault address has
been read and the fault is resolved by userland, the application must
write back 128bits in the form of [ start, end ] range (64bit each)
that will tell the kernel such a range has been mapped. Multiple read
userfaults can be resolved in a single range write. poll() can be used
to know when there are new userfaults to read (POLLIN) and when there
are threads waiting a wakeup through a range write (POLLOUT).

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 arch/x86/syscalls/syscall_32.tbl |   1 +
 arch/x86/syscalls/syscall_64.tbl |   1 +
 fs/Makefile                      |   1 +
 fs/userfaultfd.c                 | 557 +++++++++++++++++++++++++++++++++++++++
 include/linux/syscalls.h         |   1 +
 include/linux/userfaultfd.h      |  40 +++
 init/Kconfig                     |  10 +
 kernel/sys_ni.c                  |   1 +
 mm/huge_memory.c                 |  20 +-
 mm/memory.c                      |   5 +-
 10 files changed, 629 insertions(+), 8 deletions(-)
 create mode 100644 fs/userfaultfd.c
 create mode 100644 include/linux/userfaultfd.h

diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
index 08bc856..5aa2da4 100644
--- a/arch/x86/syscalls/syscall_32.tbl
+++ b/arch/x86/syscalls/syscall_32.tbl
@@ -361,3 +361,4 @@
 352	i386	sched_getattr		sys_sched_getattr
 353	i386	renameat2		sys_renameat2
 354	i386	remap_anon_pages	sys_remap_anon_pages
+355	i386	userfaultfd		sys_userfaultfd
diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index 37bd179..7dca902 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -324,6 +324,7 @@
 315	common	sched_getattr		sys_sched_getattr
 316	common	renameat2		sys_renameat2
 317	common	remap_anon_pages	sys_remap_anon_pages
+318	common	userfaultfd		sys_userfaultfd
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/fs/Makefile b/fs/Makefile
index 4030cbf..e00e243 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -27,6 +27,7 @@ obj-$(CONFIG_ANON_INODES)	+= anon_inodes.o
 obj-$(CONFIG_SIGNALFD)		+= signalfd.o
 obj-$(CONFIG_TIMERFD)		+= timerfd.o
 obj-$(CONFIG_EVENTFD)		+= eventfd.o
+obj-$(CONFIG_USERFAULTFD)	+= userfaultfd.o
 obj-$(CONFIG_AIO)               += aio.o
 obj-$(CONFIG_FILE_LOCKING)      += locks.o
 obj-$(CONFIG_COMPAT)		+= compat.o compat_ioctl.o
diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
new file mode 100644
index 0000000..4902fa3
--- /dev/null
+++ b/fs/userfaultfd.c
@@ -0,0 +1,557 @@
+/*
+ *  fs/userfaultfd.c
+ *
+ *  Copyright (C) 2007  Davide Libenzi <davidel@xmailserver.org>
+ *  Copyright (C) 2008-2009 Red Hat, Inc.
+ *  Copyright (C) 2014  Red Hat, Inc.
+ *
+ *  This work is licensed under the terms of the GNU GPL, version 2. See
+ *  the COPYING file in the top-level directory.
+ *
+ *  Some part derived from fs/eventfd.c (anon inode setup) and
+ *  mm/ksm.c (mm hashing).
+ */
+
+#include <linux/kref.h>
+#include <linux/hashtable.h>
+#include <linux/sched.h>
+#include <linux/mm.h>
+#include <linux/poll.h>
+#include <linux/slab.h>
+#include <linux/seq_file.h>
+#include <linux/file.h>
+#include <linux/bug.h>
+#include <linux/anon_inodes.h>
+#include <linux/syscalls.h>
+#include <linux/userfaultfd.h>
+
+struct userfaultfd_ctx {
+	/* pseudo fd refcounting */
+	struct kref kref;
+	/* waitqueue head for the userfaultfd page faults */
+	wait_queue_head_t fault_wqh;
+	/* waitqueue head for the pseudo fd to wakeup poll/read */
+	wait_queue_head_t fd_wqh;
+	/* userfaultfd syscall flags */
+	unsigned int flags;
+	/* state machine */
+	unsigned int state;
+	/* released */
+	bool released;
+};
+
+struct userfaultfd_wait_queue {
+	unsigned long address;
+	wait_queue_t wq;
+	bool pending;
+	struct userfaultfd_ctx *ctx;
+};
+
+#define USERFAULTFD_PROTOCOL ((__u64) 0xaa)
+#define USERFAULTFD_UNKNOWN_PROTOCOL ((__u64) -1ULL)
+
+enum {
+	USERFAULTFD_STATE_ASK_PROTOCOL,
+	USERFAULTFD_STATE_ACK_PROTOCOL,
+	USERFAULTFD_STATE_ACK_UNKNOWN_PROTOCOL,
+	USERFAULTFD_STATE_RUNNING,
+};
+
+/**
+ * struct mm_slot - userlandfd information per mm that is being scanned
+ * @link: link to the mm_slots hash list
+ * @mm: the mm that this information is valid for
+ * @ctx: userfaultfd context for this mm
+ */
+struct mm_slot {
+	struct hlist_node link;
+	struct mm_struct *mm;
+	struct userfaultfd_ctx ctx;
+	struct rcu_head rcu_head;
+};
+
+#define MM_USERLANDFD_HASH_BITS 10
+static DEFINE_HASHTABLE(mm_userlandfd_hash, MM_USERLANDFD_HASH_BITS);
+
+static DEFINE_MUTEX(mm_userlandfd_mutex);
+
+static struct mm_slot *get_mm_slot(struct mm_struct *mm)
+{
+	struct mm_slot *slot;
+
+	hash_for_each_possible_rcu(mm_userlandfd_hash, slot, link,
+				   (unsigned long)mm)
+		if (slot->mm == mm)
+			return slot;
+
+	return NULL;
+}
+
+static void insert_to_mm_userlandfd_hash(struct mm_struct *mm,
+					 struct mm_slot *mm_slot)
+{
+	mm_slot->mm = mm;
+	hash_add_rcu(mm_userlandfd_hash, &mm_slot->link, (unsigned long)mm);
+}
+
+static int userfaultfd_wake_function(wait_queue_t *wq, unsigned mode,
+				     int wake_flags, void *key)
+{
+	unsigned long *range = key;
+	int ret;
+	struct userfaultfd_wait_queue *uwq;
+
+	uwq = container_of(wq, struct userfaultfd_wait_queue, wq);
+	ret = 0;
+	/* don't wake the pending ones to avoid reads to block */
+	if (uwq->pending && !uwq->ctx->released)
+		goto out;
+	if (range[0] > uwq->address || range[1] <= uwq->address)
+		goto out;
+	ret = wake_up_state(wq->private, mode);
+	if (ret)
+		/* wake only once, autoremove behavior */
+		list_del_init(&wq->task_list);
+out:
+	return ret;
+}
+
+/**
+ * userfaultfd_ctx_get - Acquires a reference to the internal userfaultfd
+ * context.
+ * @ctx: [in] Pointer to the userfaultfd context.
+ *
+ * Returns: In case of success, returns a pointer to the userfaultfd context.
+ */
+static void userfaultfd_ctx_get(struct userfaultfd_ctx *ctx)
+{
+	kref_get(&ctx->kref);
+}
+
+static void userfaultfd_free(struct kref *kref)
+{
+	struct userfaultfd_ctx *ctx = container_of(kref,
+						   struct userfaultfd_ctx,
+						   kref);
+	struct mm_slot *mm_slot = container_of(ctx, struct mm_slot, ctx);
+
+	mutex_lock(&mm_userlandfd_mutex);
+	hash_del_rcu(&mm_slot->link);
+	mutex_unlock(&mm_userlandfd_mutex);
+
+	kfree_rcu(mm_slot, rcu_head);
+}
+
+/**
+ * userfaultfd_ctx_put - Releases a reference to the internal userfaultfd
+ * context.
+ * @ctx: [in] Pointer to userfaultfd context.
+ *
+ * The userfaultfd context reference must have been previously acquired either
+ * with userfaultfd_ctx_get() or userfaultfd_ctx_fdget().
+ */
+static void userfaultfd_ctx_put(struct userfaultfd_ctx *ctx)
+{
+	kref_put(&ctx->kref, userfaultfd_free);
+}
+
+int handle_userfault(struct vm_area_struct *vma, unsigned long address)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	struct mm_slot *slot;
+	struct userfaultfd_ctx *ctx;
+	struct userfaultfd_wait_queue uwq;
+
+	BUG_ON(!rwsem_is_locked(&mm->mmap_sem));
+
+	rcu_read_lock();
+	slot = get_mm_slot(mm);
+	if (!slot) {
+		rcu_read_unlock();
+		return VM_FAULT_SIGBUS;
+	}
+	ctx = &slot->ctx;
+	userfaultfd_ctx_get(ctx);
+	rcu_read_unlock();
+
+	init_waitqueue_func_entry(&uwq.wq, userfaultfd_wake_function);
+	uwq.wq.private = current;
+	uwq.address = address;
+	uwq.pending = true;
+	uwq.ctx = ctx;
+
+	spin_lock(&ctx->fault_wqh.lock);
+	/*
+	 * After the __add_wait_queue the uwq is visible to userland
+	 * through poll/read().
+	 */
+	__add_wait_queue(&ctx->fault_wqh, &uwq.wq);
+	for (;;) {
+		set_current_state(TASK_INTERRUPTIBLE);
+		if (fatal_signal_pending(current))
+			break;
+		if (!uwq.pending)
+			break;
+		spin_unlock(&ctx->fault_wqh.lock);
+		up_read(&mm->mmap_sem);
+
+		wake_up_poll(&ctx->fd_wqh, POLLIN);
+		schedule();
+
+		down_read(&mm->mmap_sem);
+		spin_lock(&ctx->fault_wqh.lock);
+	}
+	__remove_wait_queue(&ctx->fault_wqh, &uwq.wq);
+	__set_current_state(TASK_RUNNING);
+	spin_unlock(&ctx->fault_wqh.lock);
+
+	/*
+	 * ctx may go away after this if the userfault pseudo fd is
+	 * released by another CPU.
+	 */
+	userfaultfd_ctx_put(ctx);
+
+	return 0;
+}
+
+static int userfaultfd_release(struct inode *inode, struct file *file)
+{
+	struct userfaultfd_ctx *ctx = file->private_data;
+	__u64 range[2] = { 0ULL, -1ULL };
+
+	ctx->released = true;
+	smp_wmb();
+	spin_lock(&ctx->fault_wqh.lock);
+	__wake_up_locked_key(&ctx->fault_wqh, TASK_NORMAL, 0, range);
+	spin_unlock(&ctx->fault_wqh.lock);
+
+	wake_up_poll(&ctx->fd_wqh, POLLHUP);
+	userfaultfd_ctx_put(ctx);
+	return 0;
+}
+
+static inline unsigned long find_userfault(struct userfaultfd_ctx *ctx,
+					   struct userfaultfd_wait_queue **uwq,
+					   unsigned int events_filter)
+{
+	wait_queue_t *wq;
+	struct userfaultfd_wait_queue *_uwq;
+	unsigned int events = 0;
+
+	BUG_ON(!events_filter);
+
+	spin_lock(&ctx->fault_wqh.lock);
+	list_for_each_entry(wq, &ctx->fault_wqh.task_list, task_list) {
+		_uwq = container_of(wq, struct userfaultfd_wait_queue, wq);
+		if (_uwq->pending) {
+			if (!(events & POLLIN) && (events_filter & POLLIN)) {
+				events |= POLLIN;
+				if (uwq)
+					*uwq = _uwq;
+			}
+		} else if (events_filter & POLLOUT)
+			events |= POLLOUT;
+		if (events == events_filter)
+			break;
+	}
+	spin_unlock(&ctx->fault_wqh.lock);
+
+	return events;
+}
+
+static unsigned int userfaultfd_poll(struct file *file, poll_table *wait)
+{
+	struct userfaultfd_ctx *ctx = file->private_data;
+
+	poll_wait(file, &ctx->fd_wqh, wait);
+
+	switch (ctx->state) {
+	case USERFAULTFD_STATE_ASK_PROTOCOL:
+		return POLLOUT;
+	case USERFAULTFD_STATE_ACK_PROTOCOL:
+		return POLLIN;
+	case USERFAULTFD_STATE_ACK_UNKNOWN_PROTOCOL:
+		return POLLIN;
+	case USERFAULTFD_STATE_RUNNING:
+		return find_userfault(ctx, NULL, POLLIN|POLLOUT);
+	default:
+		BUG();
+	}
+}
+
+static ssize_t userfaultfd_ctx_read(struct userfaultfd_ctx *ctx, int no_wait,
+				    __u64 *addr)
+{
+	ssize_t res;
+	DECLARE_WAITQUEUE(wait, current);
+	struct userfaultfd_wait_queue *uwq = NULL;
+
+	if (ctx->state == USERFAULTFD_STATE_ASK_PROTOCOL) {
+		return -EINVAL;
+	} else if (ctx->state == USERFAULTFD_STATE_ACK_PROTOCOL) {
+		*addr = USERFAULTFD_PROTOCOL;
+		ctx->state = USERFAULTFD_STATE_RUNNING;
+		return 0;
+	} else if (ctx->state == USERFAULTFD_STATE_ACK_UNKNOWN_PROTOCOL) {
+		*addr = USERFAULTFD_UNKNOWN_PROTOCOL;
+		ctx->state = USERFAULTFD_STATE_ASK_PROTOCOL;
+		return 0;
+	}
+	BUG_ON(ctx->state != USERFAULTFD_STATE_RUNNING);
+
+	spin_lock(&ctx->fd_wqh.lock);
+	__add_wait_queue(&ctx->fd_wqh, &wait);
+	for (;;) {
+		set_current_state(TASK_INTERRUPTIBLE);
+		/* always take the fd_wqh lock before the fault_wqh lock */
+		if (find_userfault(ctx, &uwq, POLLIN)) {
+			uwq->pending = false;
+			*addr = uwq->address;
+			res = 0;
+			break;
+		}
+		if (signal_pending(current)) {
+			res = -ERESTARTSYS;
+			break;
+		}
+		if (no_wait) {
+			res = -EAGAIN;
+			break;
+		}
+		spin_unlock(&ctx->fd_wqh.lock);
+		schedule();
+		spin_lock_irq(&ctx->fd_wqh.lock);
+	}
+	__remove_wait_queue(&ctx->fd_wqh, &wait);
+	__set_current_state(TASK_RUNNING);
+	if (res == 0) {
+		if (waitqueue_active(&ctx->fd_wqh))
+			wake_up_locked_poll(&ctx->fd_wqh, POLLOUT);
+	}
+	spin_unlock_irq(&ctx->fd_wqh.lock);
+
+	return res;
+}
+
+static ssize_t userfaultfd_read(struct file *file, char __user *buf,
+				size_t count, loff_t *ppos)
+{
+	struct userfaultfd_ctx *ctx = file->private_data;
+	ssize_t res;
+	__u64 addr;
+
+	if (count < sizeof(addr))
+		return -EINVAL;
+	res = userfaultfd_ctx_read(ctx, file->f_flags & O_NONBLOCK, &addr);
+	if (res < 0)
+		return res;
+
+	return put_user(addr, (__u64 __user *) buf) ? -EFAULT : sizeof(addr);
+}
+
+static int wake_userfault(struct userfaultfd_ctx *ctx, __u64 *range)
+{
+	wait_queue_t *wq;
+	struct userfaultfd_wait_queue *uwq;
+	int ret = -ENOENT;
+
+	spin_lock(&ctx->fault_wqh.lock);
+	list_for_each_entry(wq, &ctx->fault_wqh.task_list, task_list) {
+		uwq = container_of(wq, struct userfaultfd_wait_queue, wq);
+		if (uwq->pending)
+			continue;
+		if (uwq->address >= range[0] &&
+		    uwq->address < range[1]) {
+			ret = 0;
+			/* wake all in the range and autoremove */
+			__wake_up_locked_key(&ctx->fault_wqh, TASK_NORMAL, 0,
+					     range);
+			break;
+		}
+	}
+	spin_unlock(&ctx->fault_wqh.lock);
+
+	return ret;
+}
+
+static ssize_t userfaultfd_write(struct file *file, const char __user *buf,
+				 size_t count, loff_t *ppos)
+{
+	struct userfaultfd_ctx *ctx = file->private_data;
+	ssize_t res;
+	__u64 range[2];
+	DECLARE_WAITQUEUE(wait, current);
+
+	if (ctx->state == USERFAULTFD_STATE_ASK_PROTOCOL) {
+		__u64 protocol;
+		if (count < sizeof(__u64))
+			return -EINVAL;
+		if (copy_from_user(&protocol, buf, sizeof(protocol)))
+			return -EFAULT;
+		if (protocol != USERFAULTFD_PROTOCOL) {
+			/* we'll offer the supported protocol in the ack */
+			printk_once(KERN_INFO
+				    "userfaultfd protocol not available\n");
+			ctx->state = USERFAULTFD_STATE_ACK_UNKNOWN_PROTOCOL;
+		} else
+			ctx->state = USERFAULTFD_STATE_ACK_PROTOCOL;
+		return sizeof(protocol);
+	} else if (ctx->state == USERFAULTFD_STATE_ACK_PROTOCOL)
+		return -EINVAL;
+
+	BUG_ON(ctx->state != USERFAULTFD_STATE_RUNNING);
+
+	if (count < sizeof(range))
+		return -EINVAL;
+	if (copy_from_user(&range, buf, sizeof(range)))
+		return -EFAULT;
+	if (range[0] >= range[1])
+		return -ERANGE;
+
+	spin_lock(&ctx->fd_wqh.lock);
+	__add_wait_queue(&ctx->fd_wqh, &wait);
+	for (;;) {
+		set_current_state(TASK_INTERRUPTIBLE);
+		/* always take the fd_wqh lock before the fault_wqh lock */
+		if (find_userfault(ctx, NULL, POLLOUT)) {
+			if (!wake_userfault(ctx, range)) {
+				res = sizeof(range);
+				break;
+			}
+		}
+		if (signal_pending(current)) {
+			res = -ERESTARTSYS;
+			break;
+		}
+		if (file->f_flags & O_NONBLOCK) {
+			res = -EAGAIN;
+			break;
+		}
+		spin_unlock(&ctx->fd_wqh.lock);
+		schedule();
+		spin_lock(&ctx->fd_wqh.lock);
+	}
+	__remove_wait_queue(&ctx->fd_wqh, &wait);
+	__set_current_state(TASK_RUNNING);
+	spin_unlock(&ctx->fd_wqh.lock);
+
+	return res;
+}
+
+#ifdef CONFIG_PROC_FS
+static int userfaultfd_show_fdinfo(struct seq_file *m, struct file *f)
+{
+	struct userfaultfd_ctx *ctx = f->private_data;
+	int ret;
+	wait_queue_t *wq;
+	struct userfaultfd_wait_queue *uwq;
+	unsigned long pending = 0, total = 0;
+
+	spin_lock(&ctx->fault_wqh.lock);
+	list_for_each_entry(wq, &ctx->fault_wqh.task_list, task_list) {
+		uwq = container_of(wq, struct userfaultfd_wait_queue, wq);
+		if (uwq->pending)
+			pending++;
+		total++;
+	}
+	spin_unlock(&ctx->fault_wqh.lock);
+
+	ret = seq_printf(m, "pending:\t%lu\ntotal:\t%lu\n", pending, total);
+
+	return ret;
+}
+#endif
+
+static const struct file_operations userfaultfd_fops = {
+#ifdef CONFIG_PROC_FS
+	.show_fdinfo	= userfaultfd_show_fdinfo,
+#endif
+	.release	= userfaultfd_release,
+	.poll		= userfaultfd_poll,
+	.read		= userfaultfd_read,
+	.write		= userfaultfd_write,
+	.llseek		= noop_llseek,
+};
+
+/**
+ * userfaultfd_file_create - Creates an userfaultfd file pointer.
+ * @flags: Flags for the userfaultfd file.
+ *
+ * This function creates an userfaultfd file pointer, w/out installing
+ * it into the fd table. This is useful when the userfaultfd file is
+ * used during the initialization of data structures that require
+ * extra setup after the userfaultfd creation. So the userfaultfd
+ * creation is split into the file pointer creation phase, and the
+ * file descriptor installation phase.  In this way races with
+ * userspace closing the newly installed file descriptor can be
+ * avoided.  Returns an userfaultfd file pointer, or a proper error
+ * pointer.
+ */
+static struct file *userfaultfd_file_create(int flags)
+{
+	struct file *file;
+	struct mm_slot *mm_slot;
+
+	/* Check the UFFD_* constants for consistency.  */
+	BUILD_BUG_ON(UFFD_CLOEXEC != O_CLOEXEC);
+	BUILD_BUG_ON(UFFD_NONBLOCK != O_NONBLOCK);
+
+	file = ERR_PTR(-EINVAL);
+	if (flags & ~UFFD_SHARED_FCNTL_FLAGS)
+		goto out;
+
+	mm_slot = kmalloc(sizeof(*mm_slot), GFP_KERNEL);
+	file = ERR_PTR(-ENOMEM);
+	if (!mm_slot)
+		goto out;
+
+	mutex_lock(&mm_userlandfd_mutex);
+	file = ERR_PTR(-EBUSY);
+	if (get_mm_slot(current->mm))
+		goto out_free_unlock;
+
+	kref_init(&mm_slot->ctx.kref);
+	init_waitqueue_head(&mm_slot->ctx.fault_wqh);
+	init_waitqueue_head(&mm_slot->ctx.fd_wqh);
+	mm_slot->ctx.flags = flags;
+	mm_slot->ctx.state = USERFAULTFD_STATE_ASK_PROTOCOL;
+	mm_slot->ctx.released = false;
+
+	file = anon_inode_getfile("[userfaultfd]", &userfaultfd_fops,
+				  &mm_slot->ctx,
+				  O_RDWR | (flags & UFFD_SHARED_FCNTL_FLAGS));
+	if (IS_ERR(file))
+	out_free_unlock:
+		kfree(mm_slot);
+	else
+		insert_to_mm_userlandfd_hash(current->mm,
+					     mm_slot);
+	mutex_unlock(&mm_userlandfd_mutex);
+out:
+	return file;
+}
+
+SYSCALL_DEFINE1(userfaultfd, int, flags)
+{
+	int fd, error;
+	struct file *file;
+
+	error = get_unused_fd_flags(flags & UFFD_SHARED_FCNTL_FLAGS);
+	if (error < 0)
+		return error;
+	fd = error;
+
+	file = userfaultfd_file_create(flags);
+	if (IS_ERR(file)) {
+		error = PTR_ERR(file);
+		goto err_put_unused_fd;
+	}
+	fd_install(fd, file);
+
+	return fd;
+
+err_put_unused_fd:
+	put_unused_fd(fd);
+
+	return error;
+}
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 19edb00..dcbcb7d 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -806,6 +806,7 @@ asmlinkage long sys_timerfd_settime(int ufd, int flags,
 asmlinkage long sys_timerfd_gettime(int ufd, struct itimerspec __user *otmr);
 asmlinkage long sys_eventfd(unsigned int count);
 asmlinkage long sys_eventfd2(unsigned int count, int flags);
+asmlinkage long sys_userfaultfd(int flags);
 asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len);
 asmlinkage long sys_old_readdir(unsigned int, struct old_linux_dirent __user *, unsigned int);
 asmlinkage long sys_pselect6(int, fd_set __user *, fd_set __user *,
diff --git a/include/linux/userfaultfd.h b/include/linux/userfaultfd.h
new file mode 100644
index 0000000..8200a71
--- /dev/null
+++ b/include/linux/userfaultfd.h
@@ -0,0 +1,40 @@
+/*
+ *  include/linux/userfaultfd.h
+ *
+ *  Copyright (C) 2007  Davide Libenzi <davidel@xmailserver.org>
+ *  Copyright (C) 2014  Red Hat, Inc.
+ *
+ */
+
+#ifndef _LINUX_USERFAULTFD_H
+#define _LINUX_USERFAULTFD_H
+
+#include <linux/fcntl.h>
+
+/*
+ * CAREFUL: Check include/uapi/asm-generic/fcntl.h when defining
+ * new flags, since they might collide with O_* ones. We want
+ * to re-use O_* flags that couldn't possibly have a meaning
+ * from userfaultfd, in order to leave a free define-space for
+ * shared O_* flags.
+ */
+#define UFFD_CLOEXEC O_CLOEXEC
+#define UFFD_NONBLOCK O_NONBLOCK
+
+#define UFFD_SHARED_FCNTL_FLAGS (O_CLOEXEC | O_NONBLOCK)
+#define UFFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS)
+
+#ifdef CONFIG_USERFAULTFD
+
+int handle_userfault(struct vm_area_struct *vma, unsigned long address);
+
+#else /* CONFIG_USERFAULTFD */
+
+static int handle_userfault(struct vm_area_struct *vma, unsigned long address)
+{
+	return VM_FAULT_SIGBUS;
+}
+
+#endif
+
+#endif /* _LINUX_USERFAULTFD_H */
diff --git a/init/Kconfig b/init/Kconfig
index 9d76b99..dc8d722 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1475,6 +1475,16 @@ config EVENTFD
 
 	  If unsure, say Y.
 
+config USERFAULTFD
+	bool "Enable userfaultfd() system call"
+	select ANON_INODES
+	default y
+	help
+	  Enable the userfaultfd() system call that allows to trap and
+	  handle page faults in userland.
+
+	  If unsure, say Y.
+
 config SHMEM
 	bool "Use full shmem filesystem" if EXPERT
 	default y
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 6fc1aca..d7a83b1 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -198,6 +198,7 @@ cond_syscall(compat_sys_timerfd_settime);
 cond_syscall(compat_sys_timerfd_gettime);
 cond_syscall(sys_eventfd);
 cond_syscall(sys_eventfd2);
+cond_syscall(sys_userfaultfd);
 
 /* performance counters: */
 cond_syscall(sys_perf_event_open);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index e24cd7c..d6efd80 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -23,6 +23,7 @@
 #include <linux/pagemap.h>
 #include <linux/migrate.h>
 #include <linux/hashtable.h>
+#include <linux/userfaultfd.h>
 
 #include <asm/tlb.h>
 #include <asm/pgalloc.h>
@@ -746,11 +747,15 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
 
 		/* Deliver the page fault to userland */
 		if (vma->vm_flags & VM_USERFAULT) {
+			int ret;
+
 			spin_unlock(ptl);
 			mem_cgroup_uncharge_page(page);
 			put_page(page);
 			pte_free(mm, pgtable);
-			return VM_FAULT_SIGBUS;
+			ret = handle_userfault(vma, haddr);
+			VM_BUG_ON(ret & VM_FAULT_FALLBACK);
+			return ret;
 		}
 
 		entry = mk_huge_pmd(page, vma->vm_page_prot);
@@ -828,16 +833,19 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		ret = 0;
 		set = false;
 		if (pmd_none(*pmd)) {
-			if (vma->vm_flags & VM_USERFAULT)
-				ret = VM_FAULT_SIGBUS;
-			else {
+			if (vma->vm_flags & VM_USERFAULT) {
+				spin_unlock(ptl);
+				ret = handle_userfault(vma, haddr);
+				VM_BUG_ON(ret & VM_FAULT_FALLBACK);
+			} else {
 				set_huge_zero_page(pgtable, mm, vma,
 						   haddr, pmd,
 						   zero_page);
+				spin_unlock(ptl);
 				set = true;
 			}
-		}
-		spin_unlock(ptl);
+		} else
+			spin_unlock(ptl);
 		if (!set) {
 			pte_free(mm, pgtable);
 			put_huge_zero_page();
diff --git a/mm/memory.c b/mm/memory.c
index 545c417..a6a04ed 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -61,6 +61,7 @@
 #include <linux/string.h>
 #include <linux/dma-debug.h>
 #include <linux/debugfs.h>
+#include <linux/userfaultfd.h>
 
 #include <asm/io.h>
 #include <asm/pgalloc.h>
@@ -2644,7 +2645,7 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		/* Deliver the page fault to userland, check inside PT lock */
 		if (vma->vm_flags & VM_USERFAULT) {
 			pte_unmap_unlock(page_table, ptl);
-			return VM_FAULT_SIGBUS;
+			return handle_userfault(vma, address);
 		}
 		goto setpte;
 	}
@@ -2678,7 +2679,7 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		pte_unmap_unlock(page_table, ptl);
 		mem_cgroup_uncharge_page(page);
 		page_cache_release(page);
-		return VM_FAULT_SIGBUS;
+		return handle_userfault(vma, address);
 	}
 
 	inc_mm_counter_fast(mm, MM_ANONPAGES);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [Qemu-devel] [PATCH 08/10] userfaultfd: add new syscall to provide memory externalization
@ 2014-07-02 16:50   ` Andrea Arcangeli
  0 siblings, 0 replies; 59+ messages in thread
From: Andrea Arcangeli @ 2014-07-02 16:50 UTC (permalink / raw)
  To: qemu-devel, kvm, linux-mm, linux-kernel
  Cc: Robert Love, Dave Hansen, Jan Kara, Neil Brown, Stefan Hajnoczi,
	Andrew Jones, KOSAKI Motohiro, Michel Lespinasse,
	Andrea Arcangeli, Taras Glek, Juan Quintela, Hugh Dickins,
	Isaku Yamahata, Mel Gorman, Android Kernel Team, Mel Gorman,
	\"Dr. David Alan Gilbert\", Huangpeng (Peter),
	Anthony Liguori, Mike Hommey, Keith Packard, Wenchao Xia,
	Minchan Kim, Dmitry Adamushko, Johannes Weiner, Paolo Bonzini,
	Andrew Morton

Once an userfaultfd is created MADV_USERFAULT regions talks through
the userfaultfd protocol with the thread responsible for doing the
memory externalization of the process.

The protocol starts by userland writing the requested/preferred
USERFAULT_PROTOCOL version into the userfault fd (64bit write), if
kernel knows it, it will ack it by allowing userland to read 64bit
from the userfault fd that will contain the same 64bit
USERFAULT_PROTOCOL version that userland asked. Otherwise userfault
will read __u64 value -1ULL (aka USERFAULTFD_UNKNOWN_PROTOCOL) and it
will have to try again by writing an older protocol version if
suitable for its usage too, and read it back again until it stops
reading -1ULL. After that the userfaultfd protocol starts.

The protocol consists in the userfault fd reads 64bit in size
providing userland the fault addresses. After a userfault address has
been read and the fault is resolved by userland, the application must
write back 128bits in the form of [ start, end ] range (64bit each)
that will tell the kernel such a range has been mapped. Multiple read
userfaults can be resolved in a single range write. poll() can be used
to know when there are new userfaults to read (POLLIN) and when there
are threads waiting a wakeup through a range write (POLLOUT).

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 arch/x86/syscalls/syscall_32.tbl |   1 +
 arch/x86/syscalls/syscall_64.tbl |   1 +
 fs/Makefile                      |   1 +
 fs/userfaultfd.c                 | 557 +++++++++++++++++++++++++++++++++++++++
 include/linux/syscalls.h         |   1 +
 include/linux/userfaultfd.h      |  40 +++
 init/Kconfig                     |  10 +
 kernel/sys_ni.c                  |   1 +
 mm/huge_memory.c                 |  20 +-
 mm/memory.c                      |   5 +-
 10 files changed, 629 insertions(+), 8 deletions(-)
 create mode 100644 fs/userfaultfd.c
 create mode 100644 include/linux/userfaultfd.h

diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
index 08bc856..5aa2da4 100644
--- a/arch/x86/syscalls/syscall_32.tbl
+++ b/arch/x86/syscalls/syscall_32.tbl
@@ -361,3 +361,4 @@
 352	i386	sched_getattr		sys_sched_getattr
 353	i386	renameat2		sys_renameat2
 354	i386	remap_anon_pages	sys_remap_anon_pages
+355	i386	userfaultfd		sys_userfaultfd
diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index 37bd179..7dca902 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -324,6 +324,7 @@
 315	common	sched_getattr		sys_sched_getattr
 316	common	renameat2		sys_renameat2
 317	common	remap_anon_pages	sys_remap_anon_pages
+318	common	userfaultfd		sys_userfaultfd
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/fs/Makefile b/fs/Makefile
index 4030cbf..e00e243 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -27,6 +27,7 @@ obj-$(CONFIG_ANON_INODES)	+= anon_inodes.o
 obj-$(CONFIG_SIGNALFD)		+= signalfd.o
 obj-$(CONFIG_TIMERFD)		+= timerfd.o
 obj-$(CONFIG_EVENTFD)		+= eventfd.o
+obj-$(CONFIG_USERFAULTFD)	+= userfaultfd.o
 obj-$(CONFIG_AIO)               += aio.o
 obj-$(CONFIG_FILE_LOCKING)      += locks.o
 obj-$(CONFIG_COMPAT)		+= compat.o compat_ioctl.o
diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
new file mode 100644
index 0000000..4902fa3
--- /dev/null
+++ b/fs/userfaultfd.c
@@ -0,0 +1,557 @@
+/*
+ *  fs/userfaultfd.c
+ *
+ *  Copyright (C) 2007  Davide Libenzi <davidel@xmailserver.org>
+ *  Copyright (C) 2008-2009 Red Hat, Inc.
+ *  Copyright (C) 2014  Red Hat, Inc.
+ *
+ *  This work is licensed under the terms of the GNU GPL, version 2. See
+ *  the COPYING file in the top-level directory.
+ *
+ *  Some part derived from fs/eventfd.c (anon inode setup) and
+ *  mm/ksm.c (mm hashing).
+ */
+
+#include <linux/kref.h>
+#include <linux/hashtable.h>
+#include <linux/sched.h>
+#include <linux/mm.h>
+#include <linux/poll.h>
+#include <linux/slab.h>
+#include <linux/seq_file.h>
+#include <linux/file.h>
+#include <linux/bug.h>
+#include <linux/anon_inodes.h>
+#include <linux/syscalls.h>
+#include <linux/userfaultfd.h>
+
+struct userfaultfd_ctx {
+	/* pseudo fd refcounting */
+	struct kref kref;
+	/* waitqueue head for the userfaultfd page faults */
+	wait_queue_head_t fault_wqh;
+	/* waitqueue head for the pseudo fd to wakeup poll/read */
+	wait_queue_head_t fd_wqh;
+	/* userfaultfd syscall flags */
+	unsigned int flags;
+	/* state machine */
+	unsigned int state;
+	/* released */
+	bool released;
+};
+
+struct userfaultfd_wait_queue {
+	unsigned long address;
+	wait_queue_t wq;
+	bool pending;
+	struct userfaultfd_ctx *ctx;
+};
+
+#define USERFAULTFD_PROTOCOL ((__u64) 0xaa)
+#define USERFAULTFD_UNKNOWN_PROTOCOL ((__u64) -1ULL)
+
+enum {
+	USERFAULTFD_STATE_ASK_PROTOCOL,
+	USERFAULTFD_STATE_ACK_PROTOCOL,
+	USERFAULTFD_STATE_ACK_UNKNOWN_PROTOCOL,
+	USERFAULTFD_STATE_RUNNING,
+};
+
+/**
+ * struct mm_slot - userlandfd information per mm that is being scanned
+ * @link: link to the mm_slots hash list
+ * @mm: the mm that this information is valid for
+ * @ctx: userfaultfd context for this mm
+ */
+struct mm_slot {
+	struct hlist_node link;
+	struct mm_struct *mm;
+	struct userfaultfd_ctx ctx;
+	struct rcu_head rcu_head;
+};
+
+#define MM_USERLANDFD_HASH_BITS 10
+static DEFINE_HASHTABLE(mm_userlandfd_hash, MM_USERLANDFD_HASH_BITS);
+
+static DEFINE_MUTEX(mm_userlandfd_mutex);
+
+static struct mm_slot *get_mm_slot(struct mm_struct *mm)
+{
+	struct mm_slot *slot;
+
+	hash_for_each_possible_rcu(mm_userlandfd_hash, slot, link,
+				   (unsigned long)mm)
+		if (slot->mm == mm)
+			return slot;
+
+	return NULL;
+}
+
+static void insert_to_mm_userlandfd_hash(struct mm_struct *mm,
+					 struct mm_slot *mm_slot)
+{
+	mm_slot->mm = mm;
+	hash_add_rcu(mm_userlandfd_hash, &mm_slot->link, (unsigned long)mm);
+}
+
+static int userfaultfd_wake_function(wait_queue_t *wq, unsigned mode,
+				     int wake_flags, void *key)
+{
+	unsigned long *range = key;
+	int ret;
+	struct userfaultfd_wait_queue *uwq;
+
+	uwq = container_of(wq, struct userfaultfd_wait_queue, wq);
+	ret = 0;
+	/* don't wake the pending ones to avoid reads to block */
+	if (uwq->pending && !uwq->ctx->released)
+		goto out;
+	if (range[0] > uwq->address || range[1] <= uwq->address)
+		goto out;
+	ret = wake_up_state(wq->private, mode);
+	if (ret)
+		/* wake only once, autoremove behavior */
+		list_del_init(&wq->task_list);
+out:
+	return ret;
+}
+
+/**
+ * userfaultfd_ctx_get - Acquires a reference to the internal userfaultfd
+ * context.
+ * @ctx: [in] Pointer to the userfaultfd context.
+ *
+ * Returns: In case of success, returns a pointer to the userfaultfd context.
+ */
+static void userfaultfd_ctx_get(struct userfaultfd_ctx *ctx)
+{
+	kref_get(&ctx->kref);
+}
+
+static void userfaultfd_free(struct kref *kref)
+{
+	struct userfaultfd_ctx *ctx = container_of(kref,
+						   struct userfaultfd_ctx,
+						   kref);
+	struct mm_slot *mm_slot = container_of(ctx, struct mm_slot, ctx);
+
+	mutex_lock(&mm_userlandfd_mutex);
+	hash_del_rcu(&mm_slot->link);
+	mutex_unlock(&mm_userlandfd_mutex);
+
+	kfree_rcu(mm_slot, rcu_head);
+}
+
+/**
+ * userfaultfd_ctx_put - Releases a reference to the internal userfaultfd
+ * context.
+ * @ctx: [in] Pointer to userfaultfd context.
+ *
+ * The userfaultfd context reference must have been previously acquired either
+ * with userfaultfd_ctx_get() or userfaultfd_ctx_fdget().
+ */
+static void userfaultfd_ctx_put(struct userfaultfd_ctx *ctx)
+{
+	kref_put(&ctx->kref, userfaultfd_free);
+}
+
+int handle_userfault(struct vm_area_struct *vma, unsigned long address)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	struct mm_slot *slot;
+	struct userfaultfd_ctx *ctx;
+	struct userfaultfd_wait_queue uwq;
+
+	BUG_ON(!rwsem_is_locked(&mm->mmap_sem));
+
+	rcu_read_lock();
+	slot = get_mm_slot(mm);
+	if (!slot) {
+		rcu_read_unlock();
+		return VM_FAULT_SIGBUS;
+	}
+	ctx = &slot->ctx;
+	userfaultfd_ctx_get(ctx);
+	rcu_read_unlock();
+
+	init_waitqueue_func_entry(&uwq.wq, userfaultfd_wake_function);
+	uwq.wq.private = current;
+	uwq.address = address;
+	uwq.pending = true;
+	uwq.ctx = ctx;
+
+	spin_lock(&ctx->fault_wqh.lock);
+	/*
+	 * After the __add_wait_queue the uwq is visible to userland
+	 * through poll/read().
+	 */
+	__add_wait_queue(&ctx->fault_wqh, &uwq.wq);
+	for (;;) {
+		set_current_state(TASK_INTERRUPTIBLE);
+		if (fatal_signal_pending(current))
+			break;
+		if (!uwq.pending)
+			break;
+		spin_unlock(&ctx->fault_wqh.lock);
+		up_read(&mm->mmap_sem);
+
+		wake_up_poll(&ctx->fd_wqh, POLLIN);
+		schedule();
+
+		down_read(&mm->mmap_sem);
+		spin_lock(&ctx->fault_wqh.lock);
+	}
+	__remove_wait_queue(&ctx->fault_wqh, &uwq.wq);
+	__set_current_state(TASK_RUNNING);
+	spin_unlock(&ctx->fault_wqh.lock);
+
+	/*
+	 * ctx may go away after this if the userfault pseudo fd is
+	 * released by another CPU.
+	 */
+	userfaultfd_ctx_put(ctx);
+
+	return 0;
+}
+
+static int userfaultfd_release(struct inode *inode, struct file *file)
+{
+	struct userfaultfd_ctx *ctx = file->private_data;
+	__u64 range[2] = { 0ULL, -1ULL };
+
+	ctx->released = true;
+	smp_wmb();
+	spin_lock(&ctx->fault_wqh.lock);
+	__wake_up_locked_key(&ctx->fault_wqh, TASK_NORMAL, 0, range);
+	spin_unlock(&ctx->fault_wqh.lock);
+
+	wake_up_poll(&ctx->fd_wqh, POLLHUP);
+	userfaultfd_ctx_put(ctx);
+	return 0;
+}
+
+static inline unsigned long find_userfault(struct userfaultfd_ctx *ctx,
+					   struct userfaultfd_wait_queue **uwq,
+					   unsigned int events_filter)
+{
+	wait_queue_t *wq;
+	struct userfaultfd_wait_queue *_uwq;
+	unsigned int events = 0;
+
+	BUG_ON(!events_filter);
+
+	spin_lock(&ctx->fault_wqh.lock);
+	list_for_each_entry(wq, &ctx->fault_wqh.task_list, task_list) {
+		_uwq = container_of(wq, struct userfaultfd_wait_queue, wq);
+		if (_uwq->pending) {
+			if (!(events & POLLIN) && (events_filter & POLLIN)) {
+				events |= POLLIN;
+				if (uwq)
+					*uwq = _uwq;
+			}
+		} else if (events_filter & POLLOUT)
+			events |= POLLOUT;
+		if (events == events_filter)
+			break;
+	}
+	spin_unlock(&ctx->fault_wqh.lock);
+
+	return events;
+}
+
+static unsigned int userfaultfd_poll(struct file *file, poll_table *wait)
+{
+	struct userfaultfd_ctx *ctx = file->private_data;
+
+	poll_wait(file, &ctx->fd_wqh, wait);
+
+	switch (ctx->state) {
+	case USERFAULTFD_STATE_ASK_PROTOCOL:
+		return POLLOUT;
+	case USERFAULTFD_STATE_ACK_PROTOCOL:
+		return POLLIN;
+	case USERFAULTFD_STATE_ACK_UNKNOWN_PROTOCOL:
+		return POLLIN;
+	case USERFAULTFD_STATE_RUNNING:
+		return find_userfault(ctx, NULL, POLLIN|POLLOUT);
+	default:
+		BUG();
+	}
+}
+
+static ssize_t userfaultfd_ctx_read(struct userfaultfd_ctx *ctx, int no_wait,
+				    __u64 *addr)
+{
+	ssize_t res;
+	DECLARE_WAITQUEUE(wait, current);
+	struct userfaultfd_wait_queue *uwq = NULL;
+
+	if (ctx->state == USERFAULTFD_STATE_ASK_PROTOCOL) {
+		return -EINVAL;
+	} else if (ctx->state == USERFAULTFD_STATE_ACK_PROTOCOL) {
+		*addr = USERFAULTFD_PROTOCOL;
+		ctx->state = USERFAULTFD_STATE_RUNNING;
+		return 0;
+	} else if (ctx->state == USERFAULTFD_STATE_ACK_UNKNOWN_PROTOCOL) {
+		*addr = USERFAULTFD_UNKNOWN_PROTOCOL;
+		ctx->state = USERFAULTFD_STATE_ASK_PROTOCOL;
+		return 0;
+	}
+	BUG_ON(ctx->state != USERFAULTFD_STATE_RUNNING);
+
+	spin_lock(&ctx->fd_wqh.lock);
+	__add_wait_queue(&ctx->fd_wqh, &wait);
+	for (;;) {
+		set_current_state(TASK_INTERRUPTIBLE);
+		/* always take the fd_wqh lock before the fault_wqh lock */
+		if (find_userfault(ctx, &uwq, POLLIN)) {
+			uwq->pending = false;
+			*addr = uwq->address;
+			res = 0;
+			break;
+		}
+		if (signal_pending(current)) {
+			res = -ERESTARTSYS;
+			break;
+		}
+		if (no_wait) {
+			res = -EAGAIN;
+			break;
+		}
+		spin_unlock(&ctx->fd_wqh.lock);
+		schedule();
+		spin_lock_irq(&ctx->fd_wqh.lock);
+	}
+	__remove_wait_queue(&ctx->fd_wqh, &wait);
+	__set_current_state(TASK_RUNNING);
+	if (res == 0) {
+		if (waitqueue_active(&ctx->fd_wqh))
+			wake_up_locked_poll(&ctx->fd_wqh, POLLOUT);
+	}
+	spin_unlock_irq(&ctx->fd_wqh.lock);
+
+	return res;
+}
+
+static ssize_t userfaultfd_read(struct file *file, char __user *buf,
+				size_t count, loff_t *ppos)
+{
+	struct userfaultfd_ctx *ctx = file->private_data;
+	ssize_t res;
+	__u64 addr;
+
+	if (count < sizeof(addr))
+		return -EINVAL;
+	res = userfaultfd_ctx_read(ctx, file->f_flags & O_NONBLOCK, &addr);
+	if (res < 0)
+		return res;
+
+	return put_user(addr, (__u64 __user *) buf) ? -EFAULT : sizeof(addr);
+}
+
+static int wake_userfault(struct userfaultfd_ctx *ctx, __u64 *range)
+{
+	wait_queue_t *wq;
+	struct userfaultfd_wait_queue *uwq;
+	int ret = -ENOENT;
+
+	spin_lock(&ctx->fault_wqh.lock);
+	list_for_each_entry(wq, &ctx->fault_wqh.task_list, task_list) {
+		uwq = container_of(wq, struct userfaultfd_wait_queue, wq);
+		if (uwq->pending)
+			continue;
+		if (uwq->address >= range[0] &&
+		    uwq->address < range[1]) {
+			ret = 0;
+			/* wake all in the range and autoremove */
+			__wake_up_locked_key(&ctx->fault_wqh, TASK_NORMAL, 0,
+					     range);
+			break;
+		}
+	}
+	spin_unlock(&ctx->fault_wqh.lock);
+
+	return ret;
+}
+
+static ssize_t userfaultfd_write(struct file *file, const char __user *buf,
+				 size_t count, loff_t *ppos)
+{
+	struct userfaultfd_ctx *ctx = file->private_data;
+	ssize_t res;
+	__u64 range[2];
+	DECLARE_WAITQUEUE(wait, current);
+
+	if (ctx->state == USERFAULTFD_STATE_ASK_PROTOCOL) {
+		__u64 protocol;
+		if (count < sizeof(__u64))
+			return -EINVAL;
+		if (copy_from_user(&protocol, buf, sizeof(protocol)))
+			return -EFAULT;
+		if (protocol != USERFAULTFD_PROTOCOL) {
+			/* we'll offer the supported protocol in the ack */
+			printk_once(KERN_INFO
+				    "userfaultfd protocol not available\n");
+			ctx->state = USERFAULTFD_STATE_ACK_UNKNOWN_PROTOCOL;
+		} else
+			ctx->state = USERFAULTFD_STATE_ACK_PROTOCOL;
+		return sizeof(protocol);
+	} else if (ctx->state == USERFAULTFD_STATE_ACK_PROTOCOL)
+		return -EINVAL;
+
+	BUG_ON(ctx->state != USERFAULTFD_STATE_RUNNING);
+
+	if (count < sizeof(range))
+		return -EINVAL;
+	if (copy_from_user(&range, buf, sizeof(range)))
+		return -EFAULT;
+	if (range[0] >= range[1])
+		return -ERANGE;
+
+	spin_lock(&ctx->fd_wqh.lock);
+	__add_wait_queue(&ctx->fd_wqh, &wait);
+	for (;;) {
+		set_current_state(TASK_INTERRUPTIBLE);
+		/* always take the fd_wqh lock before the fault_wqh lock */
+		if (find_userfault(ctx, NULL, POLLOUT)) {
+			if (!wake_userfault(ctx, range)) {
+				res = sizeof(range);
+				break;
+			}
+		}
+		if (signal_pending(current)) {
+			res = -ERESTARTSYS;
+			break;
+		}
+		if (file->f_flags & O_NONBLOCK) {
+			res = -EAGAIN;
+			break;
+		}
+		spin_unlock(&ctx->fd_wqh.lock);
+		schedule();
+		spin_lock(&ctx->fd_wqh.lock);
+	}
+	__remove_wait_queue(&ctx->fd_wqh, &wait);
+	__set_current_state(TASK_RUNNING);
+	spin_unlock(&ctx->fd_wqh.lock);
+
+	return res;
+}
+
+#ifdef CONFIG_PROC_FS
+static int userfaultfd_show_fdinfo(struct seq_file *m, struct file *f)
+{
+	struct userfaultfd_ctx *ctx = f->private_data;
+	int ret;
+	wait_queue_t *wq;
+	struct userfaultfd_wait_queue *uwq;
+	unsigned long pending = 0, total = 0;
+
+	spin_lock(&ctx->fault_wqh.lock);
+	list_for_each_entry(wq, &ctx->fault_wqh.task_list, task_list) {
+		uwq = container_of(wq, struct userfaultfd_wait_queue, wq);
+		if (uwq->pending)
+			pending++;
+		total++;
+	}
+	spin_unlock(&ctx->fault_wqh.lock);
+
+	ret = seq_printf(m, "pending:\t%lu\ntotal:\t%lu\n", pending, total);
+
+	return ret;
+}
+#endif
+
+static const struct file_operations userfaultfd_fops = {
+#ifdef CONFIG_PROC_FS
+	.show_fdinfo	= userfaultfd_show_fdinfo,
+#endif
+	.release	= userfaultfd_release,
+	.poll		= userfaultfd_poll,
+	.read		= userfaultfd_read,
+	.write		= userfaultfd_write,
+	.llseek		= noop_llseek,
+};
+
+/**
+ * userfaultfd_file_create - Creates an userfaultfd file pointer.
+ * @flags: Flags for the userfaultfd file.
+ *
+ * This function creates an userfaultfd file pointer, w/out installing
+ * it into the fd table. This is useful when the userfaultfd file is
+ * used during the initialization of data structures that require
+ * extra setup after the userfaultfd creation. So the userfaultfd
+ * creation is split into the file pointer creation phase, and the
+ * file descriptor installation phase.  In this way races with
+ * userspace closing the newly installed file descriptor can be
+ * avoided.  Returns an userfaultfd file pointer, or a proper error
+ * pointer.
+ */
+static struct file *userfaultfd_file_create(int flags)
+{
+	struct file *file;
+	struct mm_slot *mm_slot;
+
+	/* Check the UFFD_* constants for consistency.  */
+	BUILD_BUG_ON(UFFD_CLOEXEC != O_CLOEXEC);
+	BUILD_BUG_ON(UFFD_NONBLOCK != O_NONBLOCK);
+
+	file = ERR_PTR(-EINVAL);
+	if (flags & ~UFFD_SHARED_FCNTL_FLAGS)
+		goto out;
+
+	mm_slot = kmalloc(sizeof(*mm_slot), GFP_KERNEL);
+	file = ERR_PTR(-ENOMEM);
+	if (!mm_slot)
+		goto out;
+
+	mutex_lock(&mm_userlandfd_mutex);
+	file = ERR_PTR(-EBUSY);
+	if (get_mm_slot(current->mm))
+		goto out_free_unlock;
+
+	kref_init(&mm_slot->ctx.kref);
+	init_waitqueue_head(&mm_slot->ctx.fault_wqh);
+	init_waitqueue_head(&mm_slot->ctx.fd_wqh);
+	mm_slot->ctx.flags = flags;
+	mm_slot->ctx.state = USERFAULTFD_STATE_ASK_PROTOCOL;
+	mm_slot->ctx.released = false;
+
+	file = anon_inode_getfile("[userfaultfd]", &userfaultfd_fops,
+				  &mm_slot->ctx,
+				  O_RDWR | (flags & UFFD_SHARED_FCNTL_FLAGS));
+	if (IS_ERR(file))
+	out_free_unlock:
+		kfree(mm_slot);
+	else
+		insert_to_mm_userlandfd_hash(current->mm,
+					     mm_slot);
+	mutex_unlock(&mm_userlandfd_mutex);
+out:
+	return file;
+}
+
+SYSCALL_DEFINE1(userfaultfd, int, flags)
+{
+	int fd, error;
+	struct file *file;
+
+	error = get_unused_fd_flags(flags & UFFD_SHARED_FCNTL_FLAGS);
+	if (error < 0)
+		return error;
+	fd = error;
+
+	file = userfaultfd_file_create(flags);
+	if (IS_ERR(file)) {
+		error = PTR_ERR(file);
+		goto err_put_unused_fd;
+	}
+	fd_install(fd, file);
+
+	return fd;
+
+err_put_unused_fd:
+	put_unused_fd(fd);
+
+	return error;
+}
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 19edb00..dcbcb7d 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -806,6 +806,7 @@ asmlinkage long sys_timerfd_settime(int ufd, int flags,
 asmlinkage long sys_timerfd_gettime(int ufd, struct itimerspec __user *otmr);
 asmlinkage long sys_eventfd(unsigned int count);
 asmlinkage long sys_eventfd2(unsigned int count, int flags);
+asmlinkage long sys_userfaultfd(int flags);
 asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len);
 asmlinkage long sys_old_readdir(unsigned int, struct old_linux_dirent __user *, unsigned int);
 asmlinkage long sys_pselect6(int, fd_set __user *, fd_set __user *,
diff --git a/include/linux/userfaultfd.h b/include/linux/userfaultfd.h
new file mode 100644
index 0000000..8200a71
--- /dev/null
+++ b/include/linux/userfaultfd.h
@@ -0,0 +1,40 @@
+/*
+ *  include/linux/userfaultfd.h
+ *
+ *  Copyright (C) 2007  Davide Libenzi <davidel@xmailserver.org>
+ *  Copyright (C) 2014  Red Hat, Inc.
+ *
+ */
+
+#ifndef _LINUX_USERFAULTFD_H
+#define _LINUX_USERFAULTFD_H
+
+#include <linux/fcntl.h>
+
+/*
+ * CAREFUL: Check include/uapi/asm-generic/fcntl.h when defining
+ * new flags, since they might collide with O_* ones. We want
+ * to re-use O_* flags that couldn't possibly have a meaning
+ * from userfaultfd, in order to leave a free define-space for
+ * shared O_* flags.
+ */
+#define UFFD_CLOEXEC O_CLOEXEC
+#define UFFD_NONBLOCK O_NONBLOCK
+
+#define UFFD_SHARED_FCNTL_FLAGS (O_CLOEXEC | O_NONBLOCK)
+#define UFFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS)
+
+#ifdef CONFIG_USERFAULTFD
+
+int handle_userfault(struct vm_area_struct *vma, unsigned long address);
+
+#else /* CONFIG_USERFAULTFD */
+
+static int handle_userfault(struct vm_area_struct *vma, unsigned long address)
+{
+	return VM_FAULT_SIGBUS;
+}
+
+#endif
+
+#endif /* _LINUX_USERFAULTFD_H */
diff --git a/init/Kconfig b/init/Kconfig
index 9d76b99..dc8d722 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1475,6 +1475,16 @@ config EVENTFD
 
 	  If unsure, say Y.
 
+config USERFAULTFD
+	bool "Enable userfaultfd() system call"
+	select ANON_INODES
+	default y
+	help
+	  Enable the userfaultfd() system call that allows to trap and
+	  handle page faults in userland.
+
+	  If unsure, say Y.
+
 config SHMEM
 	bool "Use full shmem filesystem" if EXPERT
 	default y
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 6fc1aca..d7a83b1 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -198,6 +198,7 @@ cond_syscall(compat_sys_timerfd_settime);
 cond_syscall(compat_sys_timerfd_gettime);
 cond_syscall(sys_eventfd);
 cond_syscall(sys_eventfd2);
+cond_syscall(sys_userfaultfd);
 
 /* performance counters: */
 cond_syscall(sys_perf_event_open);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index e24cd7c..d6efd80 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -23,6 +23,7 @@
 #include <linux/pagemap.h>
 #include <linux/migrate.h>
 #include <linux/hashtable.h>
+#include <linux/userfaultfd.h>
 
 #include <asm/tlb.h>
 #include <asm/pgalloc.h>
@@ -746,11 +747,15 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
 
 		/* Deliver the page fault to userland */
 		if (vma->vm_flags & VM_USERFAULT) {
+			int ret;
+
 			spin_unlock(ptl);
 			mem_cgroup_uncharge_page(page);
 			put_page(page);
 			pte_free(mm, pgtable);
-			return VM_FAULT_SIGBUS;
+			ret = handle_userfault(vma, haddr);
+			VM_BUG_ON(ret & VM_FAULT_FALLBACK);
+			return ret;
 		}
 
 		entry = mk_huge_pmd(page, vma->vm_page_prot);
@@ -828,16 +833,19 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		ret = 0;
 		set = false;
 		if (pmd_none(*pmd)) {
-			if (vma->vm_flags & VM_USERFAULT)
-				ret = VM_FAULT_SIGBUS;
-			else {
+			if (vma->vm_flags & VM_USERFAULT) {
+				spin_unlock(ptl);
+				ret = handle_userfault(vma, haddr);
+				VM_BUG_ON(ret & VM_FAULT_FALLBACK);
+			} else {
 				set_huge_zero_page(pgtable, mm, vma,
 						   haddr, pmd,
 						   zero_page);
+				spin_unlock(ptl);
 				set = true;
 			}
-		}
-		spin_unlock(ptl);
+		} else
+			spin_unlock(ptl);
 		if (!set) {
 			pte_free(mm, pgtable);
 			put_huge_zero_page();
diff --git a/mm/memory.c b/mm/memory.c
index 545c417..a6a04ed 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -61,6 +61,7 @@
 #include <linux/string.h>
 #include <linux/dma-debug.h>
 #include <linux/debugfs.h>
+#include <linux/userfaultfd.h>
 
 #include <asm/io.h>
 #include <asm/pgalloc.h>
@@ -2644,7 +2645,7 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		/* Deliver the page fault to userland, check inside PT lock */
 		if (vma->vm_flags & VM_USERFAULT) {
 			pte_unmap_unlock(page_table, ptl);
-			return VM_FAULT_SIGBUS;
+			return handle_userfault(vma, address);
 		}
 		goto setpte;
 	}
@@ -2678,7 +2679,7 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		pte_unmap_unlock(page_table, ptl);
 		mem_cgroup_uncharge_page(page);
 		page_cache_release(page);
-		return VM_FAULT_SIGBUS;
+		return handle_userfault(vma, address);
 	}
 
 	inc_mm_counter_fast(mm, MM_ANONPAGES);

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 09/10] userfaultfd: make userfaultfd_write non blocking
  2014-07-02 16:50 ` Andrea Arcangeli
  (?)
@ 2014-07-02 16:50   ` Andrea Arcangeli
  -1 siblings, 0 replies; 59+ messages in thread
From: Andrea Arcangeli @ 2014-07-02 16:50 UTC (permalink / raw)
  To: qemu-devel, kvm, linux-mm, linux-kernel
  Cc: \"Dr. David Alan Gilbert\",
	Johannes Weiner, Andrew Morton, Android Kernel Team, Robert Love,
	Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, KOSAKI Motohiro, Michel Lespinasse,
	Minchan Kim, Keith Packard, Huangpeng (Peter),
	Isaku Yamahata

It is generally inefficient to ask the wakeup of userfault ranges
where there's not a single userfault address read through
userfaultfd_read earlier and in turn waiting a wakeup. However it may
come handy to wakeup the same userfault range twice in case of
multiple thread faulting on the same address. But we should still
return an error so if the application thinks this occurrence can never
happen it will know it hit a bug. So just return -ENOENT instead of
blocking.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 fs/userfaultfd.c | 34 +++++-----------------------------
 1 file changed, 5 insertions(+), 29 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 4902fa3..deed8cb 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -378,9 +378,7 @@ static ssize_t userfaultfd_write(struct file *file, const char __user *buf,
 				 size_t count, loff_t *ppos)
 {
 	struct userfaultfd_ctx *ctx = file->private_data;
-	ssize_t res;
 	__u64 range[2];
-	DECLARE_WAITQUEUE(wait, current);
 
 	if (ctx->state == USERFAULTFD_STATE_ASK_PROTOCOL) {
 		__u64 protocol;
@@ -408,34 +406,12 @@ static ssize_t userfaultfd_write(struct file *file, const char __user *buf,
 	if (range[0] >= range[1])
 		return -ERANGE;
 
-	spin_lock(&ctx->fd_wqh.lock);
-	__add_wait_queue(&ctx->fd_wqh, &wait);
-	for (;;) {
-		set_current_state(TASK_INTERRUPTIBLE);
-		/* always take the fd_wqh lock before the fault_wqh lock */
-		if (find_userfault(ctx, NULL, POLLOUT)) {
-			if (!wake_userfault(ctx, range)) {
-				res = sizeof(range);
-				break;
-			}
-		}
-		if (signal_pending(current)) {
-			res = -ERESTARTSYS;
-			break;
-		}
-		if (file->f_flags & O_NONBLOCK) {
-			res = -EAGAIN;
-			break;
-		}
-		spin_unlock(&ctx->fd_wqh.lock);
-		schedule();
-		spin_lock(&ctx->fd_wqh.lock);
-	}
-	__remove_wait_queue(&ctx->fd_wqh, &wait);
-	__set_current_state(TASK_RUNNING);
-	spin_unlock(&ctx->fd_wqh.lock);
+	/* always take the fd_wqh lock before the fault_wqh lock */
+	if (find_userfault(ctx, NULL, POLLOUT))
+		if (!wake_userfault(ctx, range))
+			return sizeof(range);
 
-	return res;
+	return -ENOENT;
 }
 
 #ifdef CONFIG_PROC_FS

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 09/10] userfaultfd: make userfaultfd_write non blocking
@ 2014-07-02 16:50   ` Andrea Arcangeli
  0 siblings, 0 replies; 59+ messages in thread
From: Andrea Arcangeli @ 2014-07-02 16:50 UTC (permalink / raw)
  To: qemu-devel, kvm, linux-mm, linux-kernel
  Cc: \"Dr. David Alan Gilbert\",
	Johannes Weiner, Andrew Morton, Android Kernel Team, Robert Love,
	Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, KOSAKI Motohiro, Michel Lespinasse,
	Minchan Kim, Keith Packard, Huangpeng (Peter),
	Isaku Yamahata, Paolo Bonzini, Anthony Liguori, Stefan Hajnoczi,
	Wenchao Xia, Andrew Jones, Juan Quintela, Mel Gorman

It is generally inefficient to ask the wakeup of userfault ranges
where there's not a single userfault address read through
userfaultfd_read earlier and in turn waiting a wakeup. However it may
come handy to wakeup the same userfault range twice in case of
multiple thread faulting on the same address. But we should still
return an error so if the application thinks this occurrence can never
happen it will know it hit a bug. So just return -ENOENT instead of
blocking.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 fs/userfaultfd.c | 34 +++++-----------------------------
 1 file changed, 5 insertions(+), 29 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 4902fa3..deed8cb 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -378,9 +378,7 @@ static ssize_t userfaultfd_write(struct file *file, const char __user *buf,
 				 size_t count, loff_t *ppos)
 {
 	struct userfaultfd_ctx *ctx = file->private_data;
-	ssize_t res;
 	__u64 range[2];
-	DECLARE_WAITQUEUE(wait, current);
 
 	if (ctx->state == USERFAULTFD_STATE_ASK_PROTOCOL) {
 		__u64 protocol;
@@ -408,34 +406,12 @@ static ssize_t userfaultfd_write(struct file *file, const char __user *buf,
 	if (range[0] >= range[1])
 		return -ERANGE;
 
-	spin_lock(&ctx->fd_wqh.lock);
-	__add_wait_queue(&ctx->fd_wqh, &wait);
-	for (;;) {
-		set_current_state(TASK_INTERRUPTIBLE);
-		/* always take the fd_wqh lock before the fault_wqh lock */
-		if (find_userfault(ctx, NULL, POLLOUT)) {
-			if (!wake_userfault(ctx, range)) {
-				res = sizeof(range);
-				break;
-			}
-		}
-		if (signal_pending(current)) {
-			res = -ERESTARTSYS;
-			break;
-		}
-		if (file->f_flags & O_NONBLOCK) {
-			res = -EAGAIN;
-			break;
-		}
-		spin_unlock(&ctx->fd_wqh.lock);
-		schedule();
-		spin_lock(&ctx->fd_wqh.lock);
-	}
-	__remove_wait_queue(&ctx->fd_wqh, &wait);
-	__set_current_state(TASK_RUNNING);
-	spin_unlock(&ctx->fd_wqh.lock);
+	/* always take the fd_wqh lock before the fault_wqh lock */
+	if (find_userfault(ctx, NULL, POLLOUT))
+		if (!wake_userfault(ctx, range))
+			return sizeof(range);
 
-	return res;
+	return -ENOENT;
 }
 
 #ifdef CONFIG_PROC_FS

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [Qemu-devel] [PATCH 09/10] userfaultfd: make userfaultfd_write non blocking
@ 2014-07-02 16:50   ` Andrea Arcangeli
  0 siblings, 0 replies; 59+ messages in thread
From: Andrea Arcangeli @ 2014-07-02 16:50 UTC (permalink / raw)
  To: qemu-devel, kvm, linux-mm, linux-kernel
  Cc: Robert Love, Dave Hansen, Jan Kara, Neil Brown, Stefan Hajnoczi,
	Andrew Jones, KOSAKI Motohiro, Michel Lespinasse,
	Andrea Arcangeli, Taras Glek, Juan Quintela, Hugh Dickins,
	Isaku Yamahata, Mel Gorman, Android Kernel Team, Mel Gorman,
	\"Dr. David Alan Gilbert\", Huangpeng (Peter),
	Anthony Liguori, Mike Hommey, Keith Packard, Wenchao Xia,
	Minchan Kim, Dmitry Adamushko, Johannes Weiner, Paolo Bonzini,
	Andrew Morton

It is generally inefficient to ask the wakeup of userfault ranges
where there's not a single userfault address read through
userfaultfd_read earlier and in turn waiting a wakeup. However it may
come handy to wakeup the same userfault range twice in case of
multiple thread faulting on the same address. But we should still
return an error so if the application thinks this occurrence can never
happen it will know it hit a bug. So just return -ENOENT instead of
blocking.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 fs/userfaultfd.c | 34 +++++-----------------------------
 1 file changed, 5 insertions(+), 29 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 4902fa3..deed8cb 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -378,9 +378,7 @@ static ssize_t userfaultfd_write(struct file *file, const char __user *buf,
 				 size_t count, loff_t *ppos)
 {
 	struct userfaultfd_ctx *ctx = file->private_data;
-	ssize_t res;
 	__u64 range[2];
-	DECLARE_WAITQUEUE(wait, current);
 
 	if (ctx->state == USERFAULTFD_STATE_ASK_PROTOCOL) {
 		__u64 protocol;
@@ -408,34 +406,12 @@ static ssize_t userfaultfd_write(struct file *file, const char __user *buf,
 	if (range[0] >= range[1])
 		return -ERANGE;
 
-	spin_lock(&ctx->fd_wqh.lock);
-	__add_wait_queue(&ctx->fd_wqh, &wait);
-	for (;;) {
-		set_current_state(TASK_INTERRUPTIBLE);
-		/* always take the fd_wqh lock before the fault_wqh lock */
-		if (find_userfault(ctx, NULL, POLLOUT)) {
-			if (!wake_userfault(ctx, range)) {
-				res = sizeof(range);
-				break;
-			}
-		}
-		if (signal_pending(current)) {
-			res = -ERESTARTSYS;
-			break;
-		}
-		if (file->f_flags & O_NONBLOCK) {
-			res = -EAGAIN;
-			break;
-		}
-		spin_unlock(&ctx->fd_wqh.lock);
-		schedule();
-		spin_lock(&ctx->fd_wqh.lock);
-	}
-	__remove_wait_queue(&ctx->fd_wqh, &wait);
-	__set_current_state(TASK_RUNNING);
-	spin_unlock(&ctx->fd_wqh.lock);
+	/* always take the fd_wqh lock before the fault_wqh lock */
+	if (find_userfault(ctx, NULL, POLLOUT))
+		if (!wake_userfault(ctx, range))
+			return sizeof(range);
 
-	return res;
+	return -ENOENT;
 }
 
 #ifdef CONFIG_PROC_FS

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 10/10] userfaultfd: use VM_FAULT_RETRY in handle_userfault()
  2014-07-02 16:50 ` Andrea Arcangeli
  (?)
@ 2014-07-02 16:50   ` Andrea Arcangeli
  -1 siblings, 0 replies; 59+ messages in thread
From: Andrea Arcangeli @ 2014-07-02 16:50 UTC (permalink / raw)
  To: qemu-devel, kvm, linux-mm, linux-kernel
  Cc: \"Dr. David Alan Gilbert\",
	Johannes Weiner, Andrew Morton, Android Kernel Team, Robert Love,
	Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, KOSAKI Motohiro, Michel Lespinasse,
	Minchan Kim, Keith Packard, Huangpeng (Peter),
	Isaku Yamahata

This optimizes the userfault handler to repeat the fault without
returning to userland if it's a page faults and it teaches it to
handle FOLL_NOWAIT if it's a nonblocking gup invocation from KVM. The
FOLL_NOWAIT part is actually more than an optimization because if
FOLL_NOWAIT is set the gup caller assumes the mmap_sem cannot be
released (and it could assume that the structures protected by it
potentially read earlier cannot have become stale).

The locking rules to comply with FAULT_FLAG_KILLABLE,
FAULT_FLAG_ALLOW_RETRY, FAULT_FLAG_RETRY_NOWAIT flags looks quite
convoluted (and nor well documented, aside from a "Caution" comment in
__lock_page_or_retry) so this is not a trivial change and in turn it's
kept incremental at the end of the patchset.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 fs/userfaultfd.c            | 68 ++++++++++++++++++++++++++++++++++++++++++---
 include/linux/userfaultfd.h |  6 ++--
 mm/huge_memory.c            |  8 +++---
 mm/memory.c                 |  4 +--
 4 files changed, 74 insertions(+), 12 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index deed8cb..b8b0fb7 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -155,12 +155,29 @@ static void userfaultfd_ctx_put(struct userfaultfd_ctx *ctx)
 	kref_put(&ctx->kref, userfaultfd_free);
 }
 
-int handle_userfault(struct vm_area_struct *vma, unsigned long address)
+/*
+ * The locking rules involved in returning VM_FAULT_RETRY depending on
+ * FAULT_FLAG_ALLOW_RETRY, FAULT_FLAG_RETRY_NOWAIT and
+ * FAULT_FLAG_KILLABLE are not straightforward. The "Caution"
+ * recommendation in __lock_page_or_retry is not an understatement.
+ *
+ * If FAULT_FLAG_ALLOW_RETRY is set, the mmap_sem must be released
+ * before returning VM_FAULT_RETRY only if FAULT_FLAG_RETRY_NOWAIT is
+ * not set.
+ *
+ * If FAULT_FLAG_ALLOW_RETRY is set but FAULT_FLAG_KILLABLE is not
+ * set, VM_FAULT_RETRY can still be returned if and only if there are
+ * fatal_signal_pending()s, and the mmap_sem must be released before
+ * returning it.
+ */
+int handle_userfault(struct vm_area_struct *vma, unsigned long address,
+		     unsigned int flags)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	struct mm_slot *slot;
 	struct userfaultfd_ctx *ctx;
 	struct userfaultfd_wait_queue uwq;
+	int ret;
 
 	BUG_ON(!rwsem_is_locked(&mm->mmap_sem));
 
@@ -188,10 +205,53 @@ int handle_userfault(struct vm_area_struct *vma, unsigned long address)
 	__add_wait_queue(&ctx->fault_wqh, &uwq.wq);
 	for (;;) {
 		set_current_state(TASK_INTERRUPTIBLE);
-		if (fatal_signal_pending(current))
+		if (fatal_signal_pending(current) || ctx->released) {
+			/*
+			 * If we have to fail because the task is
+			 * killed or the file was relased, so simulate
+			 * VM_FAULT_SIGBUS or just return to userland
+			 * through VM_FAULT_RETRY if we come from a
+			 * page fault.
+			 */
+			ret = VM_FAULT_SIGBUS;
+			if (fatal_signal_pending(current) &&
+			    (flags & FAULT_FLAG_KILLABLE)) {
+				/*
+				 * If FAULT_FLAG_KILLABLE is set we
+				 * and there's a fatal signal pending
+				 * can return VM_FAULT_RETRY
+				 * regardless if
+				 * FAULT_FLAG_ALLOW_RETRY is set or
+				 * not as long as we release the
+				 * mmap_sem. The page fault will
+				 * return stright to userland then to
+				 * handle the fatal signal.
+				 */
+				up_read(&mm->mmap_sem);
+				ret = VM_FAULT_RETRY;
+			}
+			break;
+		}
+		if (!uwq.pending) {
+			ret = 0;
+			if (flags & FAULT_FLAG_ALLOW_RETRY) {
+				ret = VM_FAULT_RETRY;
+				if (!(flags & FAULT_FLAG_RETRY_NOWAIT))
+					up_read(&mm->mmap_sem);
+			}
 			break;
-		if (!uwq.pending)
+		}
+		if (((FAULT_FLAG_ALLOW_RETRY|FAULT_FLAG_RETRY_NOWAIT) &
+		     flags) ==
+		    (FAULT_FLAG_ALLOW_RETRY|FAULT_FLAG_RETRY_NOWAIT)) {
+			ret = VM_FAULT_RETRY;
+			/*
+			 * The mmap_sem must not be released if
+			 * FAULT_FLAG_RETRY_NOWAIT is set despite we
+			 * return VM_FAULT_RETRY (FOLL_NOWAIT case).
+			 */
 			break;
+		}
 		spin_unlock(&ctx->fault_wqh.lock);
 		up_read(&mm->mmap_sem);
 
@@ -211,7 +271,7 @@ int handle_userfault(struct vm_area_struct *vma, unsigned long address)
 	 */
 	userfaultfd_ctx_put(ctx);
 
-	return 0;
+	return ret;
 }
 
 static int userfaultfd_release(struct inode *inode, struct file *file)
diff --git a/include/linux/userfaultfd.h b/include/linux/userfaultfd.h
index 8200a71..b7caef5 100644
--- a/include/linux/userfaultfd.h
+++ b/include/linux/userfaultfd.h
@@ -26,11 +26,13 @@
 
 #ifdef CONFIG_USERFAULTFD
 
-int handle_userfault(struct vm_area_struct *vma, unsigned long address);
+int handle_userfault(struct vm_area_struct *vma, unsigned long address,
+		     unsigned int flags);
 
 #else /* CONFIG_USERFAULTFD */
 
-static int handle_userfault(struct vm_area_struct *vma, unsigned long address)
+static int handle_userfault(struct vm_area_struct *vma, unsigned long address,
+			    unsigned int flags)
 {
 	return VM_FAULT_SIGBUS;
 }
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d6efd80..e1a74a2 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -714,7 +714,7 @@ static inline pmd_t mk_huge_pmd(struct page *page, pgprot_t prot)
 static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
 					struct vm_area_struct *vma,
 					unsigned long haddr, pmd_t *pmd,
-					struct page *page)
+					struct page *page, unsigned int flags)
 {
 	pgtable_t pgtable;
 	spinlock_t *ptl;
@@ -753,7 +753,7 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
 			mem_cgroup_uncharge_page(page);
 			put_page(page);
 			pte_free(mm, pgtable);
-			ret = handle_userfault(vma, haddr);
+			ret = handle_userfault(vma, haddr, flags);
 			VM_BUG_ON(ret & VM_FAULT_FALLBACK);
 			return ret;
 		}
@@ -835,7 +835,7 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		if (pmd_none(*pmd)) {
 			if (vma->vm_flags & VM_USERFAULT) {
 				spin_unlock(ptl);
-				ret = handle_userfault(vma, haddr);
+				ret = handle_userfault(vma, haddr, flags);
 				VM_BUG_ON(ret & VM_FAULT_FALLBACK);
 			} else {
 				set_huge_zero_page(pgtable, mm, vma,
@@ -863,7 +863,7 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		count_vm_event(THP_FAULT_FALLBACK);
 		return VM_FAULT_FALLBACK;
 	}
-	return __do_huge_pmd_anonymous_page(mm, vma, haddr, pmd, page);
+	return __do_huge_pmd_anonymous_page(mm, vma, haddr, pmd, page, flags);
 }
 
 int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
diff --git a/mm/memory.c b/mm/memory.c
index a6a04ed..44506e9 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2645,7 +2645,7 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		/* Deliver the page fault to userland, check inside PT lock */
 		if (vma->vm_flags & VM_USERFAULT) {
 			pte_unmap_unlock(page_table, ptl);
-			return handle_userfault(vma, address);
+			return handle_userfault(vma, address, flags);
 		}
 		goto setpte;
 	}
@@ -2679,7 +2679,7 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		pte_unmap_unlock(page_table, ptl);
 		mem_cgroup_uncharge_page(page);
 		page_cache_release(page);
-		return handle_userfault(vma, address);
+		return handle_userfault(vma, address, flags);
 	}
 
 	inc_mm_counter_fast(mm, MM_ANONPAGES);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 10/10] userfaultfd: use VM_FAULT_RETRY in handle_userfault()
@ 2014-07-02 16:50   ` Andrea Arcangeli
  0 siblings, 0 replies; 59+ messages in thread
From: Andrea Arcangeli @ 2014-07-02 16:50 UTC (permalink / raw)
  To: qemu-devel, kvm, linux-mm, linux-kernel
  Cc: \"Dr. David Alan Gilbert\",
	Johannes Weiner, Andrew Morton, Android Kernel Team, Robert Love,
	Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, KOSAKI Motohiro, Michel Lespinasse,
	Minchan Kim, Keith Packard, Huangpeng (Peter),
	Isaku Yamahata, Paolo Bonzini, Anthony Liguori, Stefan Hajnoczi,
	Wenchao Xia, Andrew Jones, Juan Quintela, Mel Gorman

This optimizes the userfault handler to repeat the fault without
returning to userland if it's a page faults and it teaches it to
handle FOLL_NOWAIT if it's a nonblocking gup invocation from KVM. The
FOLL_NOWAIT part is actually more than an optimization because if
FOLL_NOWAIT is set the gup caller assumes the mmap_sem cannot be
released (and it could assume that the structures protected by it
potentially read earlier cannot have become stale).

The locking rules to comply with FAULT_FLAG_KILLABLE,
FAULT_FLAG_ALLOW_RETRY, FAULT_FLAG_RETRY_NOWAIT flags looks quite
convoluted (and nor well documented, aside from a "Caution" comment in
__lock_page_or_retry) so this is not a trivial change and in turn it's
kept incremental at the end of the patchset.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 fs/userfaultfd.c            | 68 ++++++++++++++++++++++++++++++++++++++++++---
 include/linux/userfaultfd.h |  6 ++--
 mm/huge_memory.c            |  8 +++---
 mm/memory.c                 |  4 +--
 4 files changed, 74 insertions(+), 12 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index deed8cb..b8b0fb7 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -155,12 +155,29 @@ static void userfaultfd_ctx_put(struct userfaultfd_ctx *ctx)
 	kref_put(&ctx->kref, userfaultfd_free);
 }
 
-int handle_userfault(struct vm_area_struct *vma, unsigned long address)
+/*
+ * The locking rules involved in returning VM_FAULT_RETRY depending on
+ * FAULT_FLAG_ALLOW_RETRY, FAULT_FLAG_RETRY_NOWAIT and
+ * FAULT_FLAG_KILLABLE are not straightforward. The "Caution"
+ * recommendation in __lock_page_or_retry is not an understatement.
+ *
+ * If FAULT_FLAG_ALLOW_RETRY is set, the mmap_sem must be released
+ * before returning VM_FAULT_RETRY only if FAULT_FLAG_RETRY_NOWAIT is
+ * not set.
+ *
+ * If FAULT_FLAG_ALLOW_RETRY is set but FAULT_FLAG_KILLABLE is not
+ * set, VM_FAULT_RETRY can still be returned if and only if there are
+ * fatal_signal_pending()s, and the mmap_sem must be released before
+ * returning it.
+ */
+int handle_userfault(struct vm_area_struct *vma, unsigned long address,
+		     unsigned int flags)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	struct mm_slot *slot;
 	struct userfaultfd_ctx *ctx;
 	struct userfaultfd_wait_queue uwq;
+	int ret;
 
 	BUG_ON(!rwsem_is_locked(&mm->mmap_sem));
 
@@ -188,10 +205,53 @@ int handle_userfault(struct vm_area_struct *vma, unsigned long address)
 	__add_wait_queue(&ctx->fault_wqh, &uwq.wq);
 	for (;;) {
 		set_current_state(TASK_INTERRUPTIBLE);
-		if (fatal_signal_pending(current))
+		if (fatal_signal_pending(current) || ctx->released) {
+			/*
+			 * If we have to fail because the task is
+			 * killed or the file was relased, so simulate
+			 * VM_FAULT_SIGBUS or just return to userland
+			 * through VM_FAULT_RETRY if we come from a
+			 * page fault.
+			 */
+			ret = VM_FAULT_SIGBUS;
+			if (fatal_signal_pending(current) &&
+			    (flags & FAULT_FLAG_KILLABLE)) {
+				/*
+				 * If FAULT_FLAG_KILLABLE is set we
+				 * and there's a fatal signal pending
+				 * can return VM_FAULT_RETRY
+				 * regardless if
+				 * FAULT_FLAG_ALLOW_RETRY is set or
+				 * not as long as we release the
+				 * mmap_sem. The page fault will
+				 * return stright to userland then to
+				 * handle the fatal signal.
+				 */
+				up_read(&mm->mmap_sem);
+				ret = VM_FAULT_RETRY;
+			}
+			break;
+		}
+		if (!uwq.pending) {
+			ret = 0;
+			if (flags & FAULT_FLAG_ALLOW_RETRY) {
+				ret = VM_FAULT_RETRY;
+				if (!(flags & FAULT_FLAG_RETRY_NOWAIT))
+					up_read(&mm->mmap_sem);
+			}
 			break;
-		if (!uwq.pending)
+		}
+		if (((FAULT_FLAG_ALLOW_RETRY|FAULT_FLAG_RETRY_NOWAIT) &
+		     flags) ==
+		    (FAULT_FLAG_ALLOW_RETRY|FAULT_FLAG_RETRY_NOWAIT)) {
+			ret = VM_FAULT_RETRY;
+			/*
+			 * The mmap_sem must not be released if
+			 * FAULT_FLAG_RETRY_NOWAIT is set despite we
+			 * return VM_FAULT_RETRY (FOLL_NOWAIT case).
+			 */
 			break;
+		}
 		spin_unlock(&ctx->fault_wqh.lock);
 		up_read(&mm->mmap_sem);
 
@@ -211,7 +271,7 @@ int handle_userfault(struct vm_area_struct *vma, unsigned long address)
 	 */
 	userfaultfd_ctx_put(ctx);
 
-	return 0;
+	return ret;
 }
 
 static int userfaultfd_release(struct inode *inode, struct file *file)
diff --git a/include/linux/userfaultfd.h b/include/linux/userfaultfd.h
index 8200a71..b7caef5 100644
--- a/include/linux/userfaultfd.h
+++ b/include/linux/userfaultfd.h
@@ -26,11 +26,13 @@
 
 #ifdef CONFIG_USERFAULTFD
 
-int handle_userfault(struct vm_area_struct *vma, unsigned long address);
+int handle_userfault(struct vm_area_struct *vma, unsigned long address,
+		     unsigned int flags);
 
 #else /* CONFIG_USERFAULTFD */
 
-static int handle_userfault(struct vm_area_struct *vma, unsigned long address)
+static int handle_userfault(struct vm_area_struct *vma, unsigned long address,
+			    unsigned int flags)
 {
 	return VM_FAULT_SIGBUS;
 }
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d6efd80..e1a74a2 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -714,7 +714,7 @@ static inline pmd_t mk_huge_pmd(struct page *page, pgprot_t prot)
 static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
 					struct vm_area_struct *vma,
 					unsigned long haddr, pmd_t *pmd,
-					struct page *page)
+					struct page *page, unsigned int flags)
 {
 	pgtable_t pgtable;
 	spinlock_t *ptl;
@@ -753,7 +753,7 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
 			mem_cgroup_uncharge_page(page);
 			put_page(page);
 			pte_free(mm, pgtable);
-			ret = handle_userfault(vma, haddr);
+			ret = handle_userfault(vma, haddr, flags);
 			VM_BUG_ON(ret & VM_FAULT_FALLBACK);
 			return ret;
 		}
@@ -835,7 +835,7 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		if (pmd_none(*pmd)) {
 			if (vma->vm_flags & VM_USERFAULT) {
 				spin_unlock(ptl);
-				ret = handle_userfault(vma, haddr);
+				ret = handle_userfault(vma, haddr, flags);
 				VM_BUG_ON(ret & VM_FAULT_FALLBACK);
 			} else {
 				set_huge_zero_page(pgtable, mm, vma,
@@ -863,7 +863,7 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		count_vm_event(THP_FAULT_FALLBACK);
 		return VM_FAULT_FALLBACK;
 	}
-	return __do_huge_pmd_anonymous_page(mm, vma, haddr, pmd, page);
+	return __do_huge_pmd_anonymous_page(mm, vma, haddr, pmd, page, flags);
 }
 
 int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
diff --git a/mm/memory.c b/mm/memory.c
index a6a04ed..44506e9 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2645,7 +2645,7 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		/* Deliver the page fault to userland, check inside PT lock */
 		if (vma->vm_flags & VM_USERFAULT) {
 			pte_unmap_unlock(page_table, ptl);
-			return handle_userfault(vma, address);
+			return handle_userfault(vma, address, flags);
 		}
 		goto setpte;
 	}
@@ -2679,7 +2679,7 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		pte_unmap_unlock(page_table, ptl);
 		mem_cgroup_uncharge_page(page);
 		page_cache_release(page);
-		return handle_userfault(vma, address);
+		return handle_userfault(vma, address, flags);
 	}
 
 	inc_mm_counter_fast(mm, MM_ANONPAGES);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [Qemu-devel] [PATCH 10/10] userfaultfd: use VM_FAULT_RETRY in handle_userfault()
@ 2014-07-02 16:50   ` Andrea Arcangeli
  0 siblings, 0 replies; 59+ messages in thread
From: Andrea Arcangeli @ 2014-07-02 16:50 UTC (permalink / raw)
  To: qemu-devel, kvm, linux-mm, linux-kernel
  Cc: Robert Love, Dave Hansen, Jan Kara, Neil Brown, Stefan Hajnoczi,
	Andrew Jones, KOSAKI Motohiro, Michel Lespinasse,
	Andrea Arcangeli, Taras Glek, Juan Quintela, Hugh Dickins,
	Isaku Yamahata, Mel Gorman, Android Kernel Team, Mel Gorman,
	\"Dr. David Alan Gilbert\", Huangpeng (Peter),
	Anthony Liguori, Mike Hommey, Keith Packard, Wenchao Xia,
	Minchan Kim, Dmitry Adamushko, Johannes Weiner, Paolo Bonzini,
	Andrew Morton

This optimizes the userfault handler to repeat the fault without
returning to userland if it's a page faults and it teaches it to
handle FOLL_NOWAIT if it's a nonblocking gup invocation from KVM. The
FOLL_NOWAIT part is actually more than an optimization because if
FOLL_NOWAIT is set the gup caller assumes the mmap_sem cannot be
released (and it could assume that the structures protected by it
potentially read earlier cannot have become stale).

The locking rules to comply with FAULT_FLAG_KILLABLE,
FAULT_FLAG_ALLOW_RETRY, FAULT_FLAG_RETRY_NOWAIT flags looks quite
convoluted (and nor well documented, aside from a "Caution" comment in
__lock_page_or_retry) so this is not a trivial change and in turn it's
kept incremental at the end of the patchset.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 fs/userfaultfd.c            | 68 ++++++++++++++++++++++++++++++++++++++++++---
 include/linux/userfaultfd.h |  6 ++--
 mm/huge_memory.c            |  8 +++---
 mm/memory.c                 |  4 +--
 4 files changed, 74 insertions(+), 12 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index deed8cb..b8b0fb7 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -155,12 +155,29 @@ static void userfaultfd_ctx_put(struct userfaultfd_ctx *ctx)
 	kref_put(&ctx->kref, userfaultfd_free);
 }
 
-int handle_userfault(struct vm_area_struct *vma, unsigned long address)
+/*
+ * The locking rules involved in returning VM_FAULT_RETRY depending on
+ * FAULT_FLAG_ALLOW_RETRY, FAULT_FLAG_RETRY_NOWAIT and
+ * FAULT_FLAG_KILLABLE are not straightforward. The "Caution"
+ * recommendation in __lock_page_or_retry is not an understatement.
+ *
+ * If FAULT_FLAG_ALLOW_RETRY is set, the mmap_sem must be released
+ * before returning VM_FAULT_RETRY only if FAULT_FLAG_RETRY_NOWAIT is
+ * not set.
+ *
+ * If FAULT_FLAG_ALLOW_RETRY is set but FAULT_FLAG_KILLABLE is not
+ * set, VM_FAULT_RETRY can still be returned if and only if there are
+ * fatal_signal_pending()s, and the mmap_sem must be released before
+ * returning it.
+ */
+int handle_userfault(struct vm_area_struct *vma, unsigned long address,
+		     unsigned int flags)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	struct mm_slot *slot;
 	struct userfaultfd_ctx *ctx;
 	struct userfaultfd_wait_queue uwq;
+	int ret;
 
 	BUG_ON(!rwsem_is_locked(&mm->mmap_sem));
 
@@ -188,10 +205,53 @@ int handle_userfault(struct vm_area_struct *vma, unsigned long address)
 	__add_wait_queue(&ctx->fault_wqh, &uwq.wq);
 	for (;;) {
 		set_current_state(TASK_INTERRUPTIBLE);
-		if (fatal_signal_pending(current))
+		if (fatal_signal_pending(current) || ctx->released) {
+			/*
+			 * If we have to fail because the task is
+			 * killed or the file was relased, so simulate
+			 * VM_FAULT_SIGBUS or just return to userland
+			 * through VM_FAULT_RETRY if we come from a
+			 * page fault.
+			 */
+			ret = VM_FAULT_SIGBUS;
+			if (fatal_signal_pending(current) &&
+			    (flags & FAULT_FLAG_KILLABLE)) {
+				/*
+				 * If FAULT_FLAG_KILLABLE is set we
+				 * and there's a fatal signal pending
+				 * can return VM_FAULT_RETRY
+				 * regardless if
+				 * FAULT_FLAG_ALLOW_RETRY is set or
+				 * not as long as we release the
+				 * mmap_sem. The page fault will
+				 * return stright to userland then to
+				 * handle the fatal signal.
+				 */
+				up_read(&mm->mmap_sem);
+				ret = VM_FAULT_RETRY;
+			}
+			break;
+		}
+		if (!uwq.pending) {
+			ret = 0;
+			if (flags & FAULT_FLAG_ALLOW_RETRY) {
+				ret = VM_FAULT_RETRY;
+				if (!(flags & FAULT_FLAG_RETRY_NOWAIT))
+					up_read(&mm->mmap_sem);
+			}
 			break;
-		if (!uwq.pending)
+		}
+		if (((FAULT_FLAG_ALLOW_RETRY|FAULT_FLAG_RETRY_NOWAIT) &
+		     flags) ==
+		    (FAULT_FLAG_ALLOW_RETRY|FAULT_FLAG_RETRY_NOWAIT)) {
+			ret = VM_FAULT_RETRY;
+			/*
+			 * The mmap_sem must not be released if
+			 * FAULT_FLAG_RETRY_NOWAIT is set despite we
+			 * return VM_FAULT_RETRY (FOLL_NOWAIT case).
+			 */
 			break;
+		}
 		spin_unlock(&ctx->fault_wqh.lock);
 		up_read(&mm->mmap_sem);
 
@@ -211,7 +271,7 @@ int handle_userfault(struct vm_area_struct *vma, unsigned long address)
 	 */
 	userfaultfd_ctx_put(ctx);
 
-	return 0;
+	return ret;
 }
 
 static int userfaultfd_release(struct inode *inode, struct file *file)
diff --git a/include/linux/userfaultfd.h b/include/linux/userfaultfd.h
index 8200a71..b7caef5 100644
--- a/include/linux/userfaultfd.h
+++ b/include/linux/userfaultfd.h
@@ -26,11 +26,13 @@
 
 #ifdef CONFIG_USERFAULTFD
 
-int handle_userfault(struct vm_area_struct *vma, unsigned long address);
+int handle_userfault(struct vm_area_struct *vma, unsigned long address,
+		     unsigned int flags);
 
 #else /* CONFIG_USERFAULTFD */
 
-static int handle_userfault(struct vm_area_struct *vma, unsigned long address)
+static int handle_userfault(struct vm_area_struct *vma, unsigned long address,
+			    unsigned int flags)
 {
 	return VM_FAULT_SIGBUS;
 }
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d6efd80..e1a74a2 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -714,7 +714,7 @@ static inline pmd_t mk_huge_pmd(struct page *page, pgprot_t prot)
 static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
 					struct vm_area_struct *vma,
 					unsigned long haddr, pmd_t *pmd,
-					struct page *page)
+					struct page *page, unsigned int flags)
 {
 	pgtable_t pgtable;
 	spinlock_t *ptl;
@@ -753,7 +753,7 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
 			mem_cgroup_uncharge_page(page);
 			put_page(page);
 			pte_free(mm, pgtable);
-			ret = handle_userfault(vma, haddr);
+			ret = handle_userfault(vma, haddr, flags);
 			VM_BUG_ON(ret & VM_FAULT_FALLBACK);
 			return ret;
 		}
@@ -835,7 +835,7 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		if (pmd_none(*pmd)) {
 			if (vma->vm_flags & VM_USERFAULT) {
 				spin_unlock(ptl);
-				ret = handle_userfault(vma, haddr);
+				ret = handle_userfault(vma, haddr, flags);
 				VM_BUG_ON(ret & VM_FAULT_FALLBACK);
 			} else {
 				set_huge_zero_page(pgtable, mm, vma,
@@ -863,7 +863,7 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		count_vm_event(THP_FAULT_FALLBACK);
 		return VM_FAULT_FALLBACK;
 	}
-	return __do_huge_pmd_anonymous_page(mm, vma, haddr, pmd, page);
+	return __do_huge_pmd_anonymous_page(mm, vma, haddr, pmd, page, flags);
 }
 
 int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
diff --git a/mm/memory.c b/mm/memory.c
index a6a04ed..44506e9 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2645,7 +2645,7 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		/* Deliver the page fault to userland, check inside PT lock */
 		if (vma->vm_flags & VM_USERFAULT) {
 			pte_unmap_unlock(page_table, ptl);
-			return handle_userfault(vma, address);
+			return handle_userfault(vma, address, flags);
 		}
 		goto setpte;
 	}
@@ -2679,7 +2679,7 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		pte_unmap_unlock(page_table, ptl);
 		mem_cgroup_uncharge_page(page);
 		page_cache_release(page);
-		return handle_userfault(vma, address);
+		return handle_userfault(vma, address, flags);
 	}
 
 	inc_mm_counter_fast(mm, MM_ANONPAGES);

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* Re: [PATCH 00/10] RFC: userfault
@ 2014-07-03  1:51   ` Andy Lutomirski
  0 siblings, 0 replies; 59+ messages in thread
From: Andy Lutomirski @ 2014-07-03  1:51 UTC (permalink / raw)
  To: Andrea Arcangeli, qemu-devel, kvm, linux-mm, linux-kernel
  Cc: "Dr. David Alan Gilbert",
	Johannes Weiner, Andrew Morton, Android Kernel Team, Robert Love,
	Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Mike Hommey, Taras Glek, Jan Kara,
	KOSAKI Motohiro, Michel Lespinasse, Minchan Kim, Keith Packard,
	Huangpeng (Peter),
	Isaku Yamahata, Linux API

On 07/02/2014 09:50 AM, Andrea Arcangeli wrote:
> Hello everyone,
> 
> There's a large CC list for this RFC because this adds two new
> syscalls (userfaultfd and remap_anon_pages) and
> MADV_USERFAULT/MADV_NOUSERFAULT, so suggestions on changes to the API
> or on a completely different API if somebody has better ideas are
> welcome now.

cc:linux-api -- this is certainly worthy of linux-api discussion.

> 
> The combination of these features are what I would propose to
> implement postcopy live migration in qemu, and in general demand
> paging of remote memory, hosted in different cloud nodes.
> 
> The MADV_USERFAULT feature should be generic enough that it can
> provide the userfaults to the Android volatile range feature too, on
> access of reclaimed volatile pages.
> 
> If the access could ever happen in kernel context through syscalls
> (not not just from userland context), then userfaultfd has to be used
> to make the userfault unnoticeable to the syscall (no error will be
> returned). This latter feature is more advanced than what volatile
> ranges alone could do with SIGBUS so far (but it's optional, if the
> process doesn't call userfaultfd, the regular SIGBUS will fire, if the
> fd is closed SIGBUS will also fire for any blocked userfault that was
> waiting a userfaultfd_write ack).
> 
> userfaultfd is also a generic enough feature, that it allows KVM to
> implement postcopy live migration without having to modify a single
> line of KVM kernel code. Guest async page faults, FOLL_NOWAIT and all
> other GUP features works just fine in combination with userfaults
> (userfaults trigger async page faults in the guest scheduler so those
> guest processes that aren't waiting for userfaults can keep running in
> the guest vcpus).
> 
> remap_anon_pages is the syscall to use to resolve the userfaults (it's
> not mandatory, vmsplice will likely still be used in the case of local
> postcopy live migration just to upgrade the qemu binary, but
> remap_anon_pages is faster and ideal for transferring memory across
> the network, it's zerocopy and doesn't touch the vma: it only holds
> the mmap_sem for reading).
> 
> The current behavior of remap_anon_pages is very strict to avoid any
> chance of memory corruption going unnoticed. mremap is not strict like
> that: if there's a synchronization bug it would drop the destination
> range silently resulting in subtle memory corruption for
> example. remap_anon_pages would return -EEXIST in that case. If there
> are holes in the source range remap_anon_pages will return -ENOENT.
> 
> If remap_anon_pages is used always with 2M naturally aligned
> addresses, transparent hugepages will not be splitted. In there could
> be 4k (or any size) holes in the 2M (or any size) source range,
> remap_anon_pages should be used with the RAP_ALLOW_SRC_HOLES flag to
> relax some of its strict checks (-ENOENT won't be returned if
> RAP_ALLOW_SRC_HOLES is set, remap_anon_pages then will just behave as
> a noop on any hole in the source range). This flag is generally useful
> when implementing userfaults with THP granularity, but it shouldn't be
> set if doing the userfaults with PAGE_SIZE granularity if the
> developer wants to benefit from the strict -ENOENT behavior.
> 
> The remap_anon_pages syscall API is not vectored, as I expect it to be
> used mainly for demand paging (where there can be just one faulting
> range per userfault) or for large ranges (with the THP model as an
> alternative to zapping re-dirtied pages with MADV_DONTNEED with 4k
> granularity before starting the guest in the destination node) where
> vectoring isn't going to provide much performance advantages (thanks
> to the THP coarser granularity).
> 
> On the rmap side remap_anon_pages doesn't add much complexity: there's
> no need of nonlinear anon vmas to support it because I added the
> constraint that it will fail if the mapcount is more than 1. So in
> general the source range of remap_anon_pages should be marked
> MADV_DONTFORK to prevent any risk of failure if the process ever
> forks (like qemu can in some case).
> 
> One part that hasn't been tested is the poll() syscall on the
> userfaultfd because the postcopy migration thread currently is more
> efficient waiting on blocking read()s (I'll write some code to test
> poll() too). I also appended below a patch to trinity to exercise
> remap_anon_pages and userfaultfd and it completes trinity
> successfully.
> 
> The code can be found here:
> 
> git clone --reference linux git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git -b userfault 
> 
> The branch is rebased so you can get updates for example with:
> 
> git fetch && git checkout -f origin/userfault
> 
> Comments welcome, thanks!
> Andrea
> 
> From cbe940e13b4cead41e0f862b3abfa3814f235ec3 Mon Sep 17 00:00:00 2001
> From: Andrea Arcangeli <aarcange@redhat.com>
> Date: Wed, 2 Jul 2014 18:32:35 +0200
> Subject: [PATCH] add remap_anon_pages and userfaultfd
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
>  include/syscalls-x86_64.h   |   2 +
>  syscalls/remap_anon_pages.c | 100 ++++++++++++++++++++++++++++++++++++++++++++
>  syscalls/syscalls.h         |   2 +
>  syscalls/userfaultfd.c      |  12 ++++++
>  4 files changed, 116 insertions(+)
>  create mode 100644 syscalls/remap_anon_pages.c
>  create mode 100644 syscalls/userfaultfd.c
> 
> diff --git a/include/syscalls-x86_64.h b/include/syscalls-x86_64.h
> index e09df43..a5b3a88 100644
> --- a/include/syscalls-x86_64.h
> +++ b/include/syscalls-x86_64.h
> @@ -324,4 +324,6 @@ struct syscalltable syscalls_x86_64[] = {
>  	{ .entry = &syscall_sched_setattr },
>  	{ .entry = &syscall_sched_getattr },
>  	{ .entry = &syscall_renameat2 },
> +	{ .entry = &syscall_remap_anon_pages },
> +	{ .entry = &syscall_userfaultfd },
>  };
> diff --git a/syscalls/remap_anon_pages.c b/syscalls/remap_anon_pages.c
> new file mode 100644
> index 0000000..b1e9d3c
> --- /dev/null
> +++ b/syscalls/remap_anon_pages.c
> @@ -0,0 +1,100 @@
> +/*
> + * SYSCALL_DEFINE3(remap_anon_pages,
> +		unsigned long, dst_start, unsigned long, src_start,
> +		unsigned long, len)
> + */
> +#include <stdlib.h>
> +#include <asm/mman.h>
> +#include <assert.h>
> +#include "arch.h"
> +#include "maps.h"
> +#include "random.h"
> +#include "sanitise.h"
> +#include "shm.h"
> +#include "syscall.h"
> +#include "tables.h"
> +#include "trinity.h"
> +#include "utils.h"
> +
> +static const unsigned long alignments[] = {
> +	1 * MB, 2 * MB, 4 * MB, 8 * MB,
> +	10 * MB, 100 * MB,
> +};
> +
> +static unsigned char *g_src, *g_dst;
> +static unsigned long g_size;
> +static int g_check;
> +
> +#define RAP_ALLOW_SRC_HOLES (1UL<<0)
> +
> +static void sanitise_remap_anon_pages(struct syscallrecord *rec)
> +{
> +	unsigned long size = alignments[rand() % ARRAY_SIZE(alignments)];
> +	unsigned long max_rand;
> +	if (rand_bool()) {
> +		g_src = mmap(NULL, size, PROT_READ|PROT_WRITE,
> +			     MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
> +	} else
> +		g_src = MAP_FAILED;
> +	if (rand_bool()) {
> +		g_dst = mmap(NULL, size, PROT_READ|PROT_WRITE,
> +			     MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
> +	} else
> +		g_dst = MAP_FAILED;
> +	g_size = size;
> +	g_check = 1;
> +
> +	rec->a1 = (unsigned long) g_dst;
> +	rec->a2 = (unsigned long) g_src;
> +	rec->a3 = g_size;
> +	rec->a4 = 0;
> +
> +	if (rand_bool())
> +		max_rand = -1UL;
> +	else
> +		max_rand = g_size << 1;
> +	if (rand_bool()) {
> +		rec->a3 += (rand() % max_rand) - g_size;
> +		g_check = 0;
> +	}
> +	if (rand_bool()) {
> +		rec->a1 += (rand() % max_rand) - g_size;
> +		g_check = 0;
> +	}
> +	if (rand_bool()) {
> +		rec->a2 += (rand() % max_rand) - g_size;
> +		g_check = 0;
> +	}
> +	if (rand_bool()) {
> +		if (rand_bool()) {
> +			rec->a4 = rand();
> +		} else
> +			rec->a4 = RAP_ALLOW_SRC_HOLES;
> +	}
> +	if (g_src != MAP_FAILED)
> +		memset(g_src, 0xaa, size);
> +}
> +
> +static void post_remap_anon_pages(struct syscallrecord *rec)
> +{
> +	if (g_check && !rec->retval) {
> +		unsigned long size = g_size;
> +		unsigned char *dst = g_dst;
> +		while (size--)
> +			assert(dst[size] == 0xaaU);
> +	}
> +	munmap(g_src, g_size);
> +	munmap(g_dst, g_size);
> +}
> +
> +struct syscallentry syscall_remap_anon_pages = {
> +	.name = "remap_anon_pages",
> +	.num_args = 4,
> +	.arg1name = "dst_start",
> +	.arg2name = "src_start",
> +	.arg3name = "len",
> +	.arg4name = "flags",
> +	.group = GROUP_VM,
> +	.sanitise = sanitise_remap_anon_pages,
> +	.post = post_remap_anon_pages,
> +};
> diff --git a/syscalls/syscalls.h b/syscalls/syscalls.h
> index 114500c..b8eaa63 100644
> --- a/syscalls/syscalls.h
> +++ b/syscalls/syscalls.h
> @@ -370,3 +370,5 @@ extern struct syscallentry syscall_sched_setattr;
>  extern struct syscallentry syscall_sched_getattr;
>  extern struct syscallentry syscall_renameat2;
>  extern struct syscallentry syscall_kern_features;
> +extern struct syscallentry syscall_remap_anon_pages;
> +extern struct syscallentry syscall_userfaultfd;
> diff --git a/syscalls/userfaultfd.c b/syscalls/userfaultfd.c
> new file mode 100644
> index 0000000..769fe78
> --- /dev/null
> +++ b/syscalls/userfaultfd.c
> @@ -0,0 +1,12 @@
> +/*
> + * SYSCALL_DEFINE1(userfaultfd, int, flags)
> + */
> +#include "sanitise.h"
> +
> +struct syscallentry syscall_userfaultfd = {
> +	.name = "userfaultfd",
> +	.num_args = 1,
> +	.arg1name = "flags",
> +	.arg1type = ARG_LEN,
> +	.rettype = RET_FD,
> +};
> 
> 
> Andrea Arcangeli (10):
>   mm: madvise MADV_USERFAULT: prepare vm_flags to allow more than 32bits
>   mm: madvise MADV_USERFAULT
>   mm: PT lock: export double_pt_lock/unlock
>   mm: rmap preparation for remap_anon_pages
>   mm: swp_entry_swapcount
>   mm: sys_remap_anon_pages
>   waitqueue: add nr wake parameter to __wake_up_locked_key
>   userfaultfd: add new syscall to provide memory externalization
>   userfaultfd: make userfaultfd_write non blocking
>   userfaultfd: use VM_FAULT_RETRY in handle_userfault()
> 
>  arch/alpha/include/uapi/asm/mman.h     |   3 +
>  arch/mips/include/uapi/asm/mman.h      |   3 +
>  arch/parisc/include/uapi/asm/mman.h    |   3 +
>  arch/x86/syscalls/syscall_32.tbl       |   2 +
>  arch/x86/syscalls/syscall_64.tbl       |   2 +
>  arch/xtensa/include/uapi/asm/mman.h    |   3 +
>  fs/Makefile                            |   1 +
>  fs/proc/task_mmu.c                     |   5 +-
>  fs/userfaultfd.c                       | 593 +++++++++++++++++++++++++++++++++
>  include/linux/huge_mm.h                |  11 +-
>  include/linux/ksm.h                    |   4 +-
>  include/linux/mm.h                     |   5 +
>  include/linux/mm_types.h               |   2 +-
>  include/linux/swap.h                   |   6 +
>  include/linux/syscalls.h               |   5 +
>  include/linux/userfaultfd.h            |  42 +++
>  include/linux/wait.h                   |   5 +-
>  include/uapi/asm-generic/mman-common.h |   3 +
>  init/Kconfig                           |  10 +
>  kernel/sched/wait.c                    |   7 +-
>  kernel/sys_ni.c                        |   2 +
>  mm/fremap.c                            | 506 ++++++++++++++++++++++++++++
>  mm/huge_memory.c                       | 209 ++++++++++--
>  mm/ksm.c                               |   2 +-
>  mm/madvise.c                           |  19 +-
>  mm/memory.c                            |  14 +
>  mm/mremap.c                            |   2 +-
>  mm/rmap.c                              |   9 +
>  mm/swapfile.c                          |  13 +
>  net/sunrpc/sched.c                     |   2 +-
>  30 files changed, 1447 insertions(+), 46 deletions(-)
>  create mode 100644 fs/userfaultfd.c
>  create mode 100644 include/linux/userfaultfd.h
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 00/10] RFC: userfault
@ 2014-07-03  1:51   ` Andy Lutomirski
  0 siblings, 0 replies; 59+ messages in thread
From: Andy Lutomirski @ 2014-07-03  1:51 UTC (permalink / raw)
  To: Andrea Arcangeli, qemu-devel-qX2TKyscuCcdnm+yROfE0A,
	kvm-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: "Dr. David Alan Gilbert",
	Johannes Weiner, Andrew Morton, Android Kernel Team, Robert Love,
	Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Mike Hommey, Taras Glek, Jan Kara,
	KOSAKI Motohiro, Michel Lespinasse, Minchan Kim, Keith Packard,
	Huangpeng (Peter),
	Isaku Yamahata, Linux API

On 07/02/2014 09:50 AM, Andrea Arcangeli wrote:
> Hello everyone,
> 
> There's a large CC list for this RFC because this adds two new
> syscalls (userfaultfd and remap_anon_pages) and
> MADV_USERFAULT/MADV_NOUSERFAULT, so suggestions on changes to the API
> or on a completely different API if somebody has better ideas are
> welcome now.

cc:linux-api -- this is certainly worthy of linux-api discussion.

> 
> The combination of these features are what I would propose to
> implement postcopy live migration in qemu, and in general demand
> paging of remote memory, hosted in different cloud nodes.
> 
> The MADV_USERFAULT feature should be generic enough that it can
> provide the userfaults to the Android volatile range feature too, on
> access of reclaimed volatile pages.
> 
> If the access could ever happen in kernel context through syscalls
> (not not just from userland context), then userfaultfd has to be used
> to make the userfault unnoticeable to the syscall (no error will be
> returned). This latter feature is more advanced than what volatile
> ranges alone could do with SIGBUS so far (but it's optional, if the
> process doesn't call userfaultfd, the regular SIGBUS will fire, if the
> fd is closed SIGBUS will also fire for any blocked userfault that was
> waiting a userfaultfd_write ack).
> 
> userfaultfd is also a generic enough feature, that it allows KVM to
> implement postcopy live migration without having to modify a single
> line of KVM kernel code. Guest async page faults, FOLL_NOWAIT and all
> other GUP features works just fine in combination with userfaults
> (userfaults trigger async page faults in the guest scheduler so those
> guest processes that aren't waiting for userfaults can keep running in
> the guest vcpus).
> 
> remap_anon_pages is the syscall to use to resolve the userfaults (it's
> not mandatory, vmsplice will likely still be used in the case of local
> postcopy live migration just to upgrade the qemu binary, but
> remap_anon_pages is faster and ideal for transferring memory across
> the network, it's zerocopy and doesn't touch the vma: it only holds
> the mmap_sem for reading).
> 
> The current behavior of remap_anon_pages is very strict to avoid any
> chance of memory corruption going unnoticed. mremap is not strict like
> that: if there's a synchronization bug it would drop the destination
> range silently resulting in subtle memory corruption for
> example. remap_anon_pages would return -EEXIST in that case. If there
> are holes in the source range remap_anon_pages will return -ENOENT.
> 
> If remap_anon_pages is used always with 2M naturally aligned
> addresses, transparent hugepages will not be splitted. In there could
> be 4k (or any size) holes in the 2M (or any size) source range,
> remap_anon_pages should be used with the RAP_ALLOW_SRC_HOLES flag to
> relax some of its strict checks (-ENOENT won't be returned if
> RAP_ALLOW_SRC_HOLES is set, remap_anon_pages then will just behave as
> a noop on any hole in the source range). This flag is generally useful
> when implementing userfaults with THP granularity, but it shouldn't be
> set if doing the userfaults with PAGE_SIZE granularity if the
> developer wants to benefit from the strict -ENOENT behavior.
> 
> The remap_anon_pages syscall API is not vectored, as I expect it to be
> used mainly for demand paging (where there can be just one faulting
> range per userfault) or for large ranges (with the THP model as an
> alternative to zapping re-dirtied pages with MADV_DONTNEED with 4k
> granularity before starting the guest in the destination node) where
> vectoring isn't going to provide much performance advantages (thanks
> to the THP coarser granularity).
> 
> On the rmap side remap_anon_pages doesn't add much complexity: there's
> no need of nonlinear anon vmas to support it because I added the
> constraint that it will fail if the mapcount is more than 1. So in
> general the source range of remap_anon_pages should be marked
> MADV_DONTFORK to prevent any risk of failure if the process ever
> forks (like qemu can in some case).
> 
> One part that hasn't been tested is the poll() syscall on the
> userfaultfd because the postcopy migration thread currently is more
> efficient waiting on blocking read()s (I'll write some code to test
> poll() too). I also appended below a patch to trinity to exercise
> remap_anon_pages and userfaultfd and it completes trinity
> successfully.
> 
> The code can be found here:
> 
> git clone --reference linux git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git -b userfault 
> 
> The branch is rebased so you can get updates for example with:
> 
> git fetch && git checkout -f origin/userfault
> 
> Comments welcome, thanks!
> Andrea
> 
> From cbe940e13b4cead41e0f862b3abfa3814f235ec3 Mon Sep 17 00:00:00 2001
> From: Andrea Arcangeli <aarcange-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> Date: Wed, 2 Jul 2014 18:32:35 +0200
> Subject: [PATCH] add remap_anon_pages and userfaultfd
> 
> Signed-off-by: Andrea Arcangeli <aarcange-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> ---
>  include/syscalls-x86_64.h   |   2 +
>  syscalls/remap_anon_pages.c | 100 ++++++++++++++++++++++++++++++++++++++++++++
>  syscalls/syscalls.h         |   2 +
>  syscalls/userfaultfd.c      |  12 ++++++
>  4 files changed, 116 insertions(+)
>  create mode 100644 syscalls/remap_anon_pages.c
>  create mode 100644 syscalls/userfaultfd.c
> 
> diff --git a/include/syscalls-x86_64.h b/include/syscalls-x86_64.h
> index e09df43..a5b3a88 100644
> --- a/include/syscalls-x86_64.h
> +++ b/include/syscalls-x86_64.h
> @@ -324,4 +324,6 @@ struct syscalltable syscalls_x86_64[] = {
>  	{ .entry = &syscall_sched_setattr },
>  	{ .entry = &syscall_sched_getattr },
>  	{ .entry = &syscall_renameat2 },
> +	{ .entry = &syscall_remap_anon_pages },
> +	{ .entry = &syscall_userfaultfd },
>  };
> diff --git a/syscalls/remap_anon_pages.c b/syscalls/remap_anon_pages.c
> new file mode 100644
> index 0000000..b1e9d3c
> --- /dev/null
> +++ b/syscalls/remap_anon_pages.c
> @@ -0,0 +1,100 @@
> +/*
> + * SYSCALL_DEFINE3(remap_anon_pages,
> +		unsigned long, dst_start, unsigned long, src_start,
> +		unsigned long, len)
> + */
> +#include <stdlib.h>
> +#include <asm/mman.h>
> +#include <assert.h>
> +#include "arch.h"
> +#include "maps.h"
> +#include "random.h"
> +#include "sanitise.h"
> +#include "shm.h"
> +#include "syscall.h"
> +#include "tables.h"
> +#include "trinity.h"
> +#include "utils.h"
> +
> +static const unsigned long alignments[] = {
> +	1 * MB, 2 * MB, 4 * MB, 8 * MB,
> +	10 * MB, 100 * MB,
> +};
> +
> +static unsigned char *g_src, *g_dst;
> +static unsigned long g_size;
> +static int g_check;
> +
> +#define RAP_ALLOW_SRC_HOLES (1UL<<0)
> +
> +static void sanitise_remap_anon_pages(struct syscallrecord *rec)
> +{
> +	unsigned long size = alignments[rand() % ARRAY_SIZE(alignments)];
> +	unsigned long max_rand;
> +	if (rand_bool()) {
> +		g_src = mmap(NULL, size, PROT_READ|PROT_WRITE,
> +			     MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
> +	} else
> +		g_src = MAP_FAILED;
> +	if (rand_bool()) {
> +		g_dst = mmap(NULL, size, PROT_READ|PROT_WRITE,
> +			     MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
> +	} else
> +		g_dst = MAP_FAILED;
> +	g_size = size;
> +	g_check = 1;
> +
> +	rec->a1 = (unsigned long) g_dst;
> +	rec->a2 = (unsigned long) g_src;
> +	rec->a3 = g_size;
> +	rec->a4 = 0;
> +
> +	if (rand_bool())
> +		max_rand = -1UL;
> +	else
> +		max_rand = g_size << 1;
> +	if (rand_bool()) {
> +		rec->a3 += (rand() % max_rand) - g_size;
> +		g_check = 0;
> +	}
> +	if (rand_bool()) {
> +		rec->a1 += (rand() % max_rand) - g_size;
> +		g_check = 0;
> +	}
> +	if (rand_bool()) {
> +		rec->a2 += (rand() % max_rand) - g_size;
> +		g_check = 0;
> +	}
> +	if (rand_bool()) {
> +		if (rand_bool()) {
> +			rec->a4 = rand();
> +		} else
> +			rec->a4 = RAP_ALLOW_SRC_HOLES;
> +	}
> +	if (g_src != MAP_FAILED)
> +		memset(g_src, 0xaa, size);
> +}
> +
> +static void post_remap_anon_pages(struct syscallrecord *rec)
> +{
> +	if (g_check && !rec->retval) {
> +		unsigned long size = g_size;
> +		unsigned char *dst = g_dst;
> +		while (size--)
> +			assert(dst[size] == 0xaaU);
> +	}
> +	munmap(g_src, g_size);
> +	munmap(g_dst, g_size);
> +}
> +
> +struct syscallentry syscall_remap_anon_pages = {
> +	.name = "remap_anon_pages",
> +	.num_args = 4,
> +	.arg1name = "dst_start",
> +	.arg2name = "src_start",
> +	.arg3name = "len",
> +	.arg4name = "flags",
> +	.group = GROUP_VM,
> +	.sanitise = sanitise_remap_anon_pages,
> +	.post = post_remap_anon_pages,
> +};
> diff --git a/syscalls/syscalls.h b/syscalls/syscalls.h
> index 114500c..b8eaa63 100644
> --- a/syscalls/syscalls.h
> +++ b/syscalls/syscalls.h
> @@ -370,3 +370,5 @@ extern struct syscallentry syscall_sched_setattr;
>  extern struct syscallentry syscall_sched_getattr;
>  extern struct syscallentry syscall_renameat2;
>  extern struct syscallentry syscall_kern_features;
> +extern struct syscallentry syscall_remap_anon_pages;
> +extern struct syscallentry syscall_userfaultfd;
> diff --git a/syscalls/userfaultfd.c b/syscalls/userfaultfd.c
> new file mode 100644
> index 0000000..769fe78
> --- /dev/null
> +++ b/syscalls/userfaultfd.c
> @@ -0,0 +1,12 @@
> +/*
> + * SYSCALL_DEFINE1(userfaultfd, int, flags)
> + */
> +#include "sanitise.h"
> +
> +struct syscallentry syscall_userfaultfd = {
> +	.name = "userfaultfd",
> +	.num_args = 1,
> +	.arg1name = "flags",
> +	.arg1type = ARG_LEN,
> +	.rettype = RET_FD,
> +};
> 
> 
> Andrea Arcangeli (10):
>   mm: madvise MADV_USERFAULT: prepare vm_flags to allow more than 32bits
>   mm: madvise MADV_USERFAULT
>   mm: PT lock: export double_pt_lock/unlock
>   mm: rmap preparation for remap_anon_pages
>   mm: swp_entry_swapcount
>   mm: sys_remap_anon_pages
>   waitqueue: add nr wake parameter to __wake_up_locked_key
>   userfaultfd: add new syscall to provide memory externalization
>   userfaultfd: make userfaultfd_write non blocking
>   userfaultfd: use VM_FAULT_RETRY in handle_userfault()
> 
>  arch/alpha/include/uapi/asm/mman.h     |   3 +
>  arch/mips/include/uapi/asm/mman.h      |   3 +
>  arch/parisc/include/uapi/asm/mman.h    |   3 +
>  arch/x86/syscalls/syscall_32.tbl       |   2 +
>  arch/x86/syscalls/syscall_64.tbl       |   2 +
>  arch/xtensa/include/uapi/asm/mman.h    |   3 +
>  fs/Makefile                            |   1 +
>  fs/proc/task_mmu.c                     |   5 +-
>  fs/userfaultfd.c                       | 593 +++++++++++++++++++++++++++++++++
>  include/linux/huge_mm.h                |  11 +-
>  include/linux/ksm.h                    |   4 +-
>  include/linux/mm.h                     |   5 +
>  include/linux/mm_types.h               |   2 +-
>  include/linux/swap.h                   |   6 +
>  include/linux/syscalls.h               |   5 +
>  include/linux/userfaultfd.h            |  42 +++
>  include/linux/wait.h                   |   5 +-
>  include/uapi/asm-generic/mman-common.h |   3 +
>  init/Kconfig                           |  10 +
>  kernel/sched/wait.c                    |   7 +-
>  kernel/sys_ni.c                        |   2 +
>  mm/fremap.c                            | 506 ++++++++++++++++++++++++++++
>  mm/huge_memory.c                       | 209 ++++++++++--
>  mm/ksm.c                               |   2 +-
>  mm/madvise.c                           |  19 +-
>  mm/memory.c                            |  14 +
>  mm/mremap.c                            |   2 +-
>  mm/rmap.c                              |   9 +
>  mm/swapfile.c                          |  13 +
>  net/sunrpc/sched.c                     |   2 +-
>  30 files changed, 1447 insertions(+), 46 deletions(-)
>  create mode 100644 fs/userfaultfd.c
>  create mode 100644 include/linux/userfaultfd.h
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo-Bw31MaZKKs0EbZ0PF+XxCw@public.gmane.org  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org"> email-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org </a>
> 

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 00/10] RFC: userfault
@ 2014-07-03  1:51   ` Andy Lutomirski
  0 siblings, 0 replies; 59+ messages in thread
From: Andy Lutomirski @ 2014-07-03  1:51 UTC (permalink / raw)
  To: Andrea Arcangeli, qemu-devel, kvm, linux-mm, linux-kernel
  Cc: "Dr. David Alan Gilbert",
	Johannes Weiner, Andrew Morton, Android Kernel Team, Robert Love,
	Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Mike Hommey, Taras Glek, Jan Kara,
	KOSAKI Motohiro, Michel Lespinasse, Minchan Kim, Keith Packard,
	Huangpeng (Peter),
	Isaku Yamahata, Linux API

On 07/02/2014 09:50 AM, Andrea Arcangeli wrote:
> Hello everyone,
> 
> There's a large CC list for this RFC because this adds two new
> syscalls (userfaultfd and remap_anon_pages) and
> MADV_USERFAULT/MADV_NOUSERFAULT, so suggestions on changes to the API
> or on a completely different API if somebody has better ideas are
> welcome now.

cc:linux-api -- this is certainly worthy of linux-api discussion.

> 
> The combination of these features are what I would propose to
> implement postcopy live migration in qemu, and in general demand
> paging of remote memory, hosted in different cloud nodes.
> 
> The MADV_USERFAULT feature should be generic enough that it can
> provide the userfaults to the Android volatile range feature too, on
> access of reclaimed volatile pages.
> 
> If the access could ever happen in kernel context through syscalls
> (not not just from userland context), then userfaultfd has to be used
> to make the userfault unnoticeable to the syscall (no error will be
> returned). This latter feature is more advanced than what volatile
> ranges alone could do with SIGBUS so far (but it's optional, if the
> process doesn't call userfaultfd, the regular SIGBUS will fire, if the
> fd is closed SIGBUS will also fire for any blocked userfault that was
> waiting a userfaultfd_write ack).
> 
> userfaultfd is also a generic enough feature, that it allows KVM to
> implement postcopy live migration without having to modify a single
> line of KVM kernel code. Guest async page faults, FOLL_NOWAIT and all
> other GUP features works just fine in combination with userfaults
> (userfaults trigger async page faults in the guest scheduler so those
> guest processes that aren't waiting for userfaults can keep running in
> the guest vcpus).
> 
> remap_anon_pages is the syscall to use to resolve the userfaults (it's
> not mandatory, vmsplice will likely still be used in the case of local
> postcopy live migration just to upgrade the qemu binary, but
> remap_anon_pages is faster and ideal for transferring memory across
> the network, it's zerocopy and doesn't touch the vma: it only holds
> the mmap_sem for reading).
> 
> The current behavior of remap_anon_pages is very strict to avoid any
> chance of memory corruption going unnoticed. mremap is not strict like
> that: if there's a synchronization bug it would drop the destination
> range silently resulting in subtle memory corruption for
> example. remap_anon_pages would return -EEXIST in that case. If there
> are holes in the source range remap_anon_pages will return -ENOENT.
> 
> If remap_anon_pages is used always with 2M naturally aligned
> addresses, transparent hugepages will not be splitted. In there could
> be 4k (or any size) holes in the 2M (or any size) source range,
> remap_anon_pages should be used with the RAP_ALLOW_SRC_HOLES flag to
> relax some of its strict checks (-ENOENT won't be returned if
> RAP_ALLOW_SRC_HOLES is set, remap_anon_pages then will just behave as
> a noop on any hole in the source range). This flag is generally useful
> when implementing userfaults with THP granularity, but it shouldn't be
> set if doing the userfaults with PAGE_SIZE granularity if the
> developer wants to benefit from the strict -ENOENT behavior.
> 
> The remap_anon_pages syscall API is not vectored, as I expect it to be
> used mainly for demand paging (where there can be just one faulting
> range per userfault) or for large ranges (with the THP model as an
> alternative to zapping re-dirtied pages with MADV_DONTNEED with 4k
> granularity before starting the guest in the destination node) where
> vectoring isn't going to provide much performance advantages (thanks
> to the THP coarser granularity).
> 
> On the rmap side remap_anon_pages doesn't add much complexity: there's
> no need of nonlinear anon vmas to support it because I added the
> constraint that it will fail if the mapcount is more than 1. So in
> general the source range of remap_anon_pages should be marked
> MADV_DONTFORK to prevent any risk of failure if the process ever
> forks (like qemu can in some case).
> 
> One part that hasn't been tested is the poll() syscall on the
> userfaultfd because the postcopy migration thread currently is more
> efficient waiting on blocking read()s (I'll write some code to test
> poll() too). I also appended below a patch to trinity to exercise
> remap_anon_pages and userfaultfd and it completes trinity
> successfully.
> 
> The code can be found here:
> 
> git clone --reference linux git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git -b userfault 
> 
> The branch is rebased so you can get updates for example with:
> 
> git fetch && git checkout -f origin/userfault
> 
> Comments welcome, thanks!
> Andrea
> 
> From cbe940e13b4cead41e0f862b3abfa3814f235ec3 Mon Sep 17 00:00:00 2001
> From: Andrea Arcangeli <aarcange@redhat.com>
> Date: Wed, 2 Jul 2014 18:32:35 +0200
> Subject: [PATCH] add remap_anon_pages and userfaultfd
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
>  include/syscalls-x86_64.h   |   2 +
>  syscalls/remap_anon_pages.c | 100 ++++++++++++++++++++++++++++++++++++++++++++
>  syscalls/syscalls.h         |   2 +
>  syscalls/userfaultfd.c      |  12 ++++++
>  4 files changed, 116 insertions(+)
>  create mode 100644 syscalls/remap_anon_pages.c
>  create mode 100644 syscalls/userfaultfd.c
> 
> diff --git a/include/syscalls-x86_64.h b/include/syscalls-x86_64.h
> index e09df43..a5b3a88 100644
> --- a/include/syscalls-x86_64.h
> +++ b/include/syscalls-x86_64.h
> @@ -324,4 +324,6 @@ struct syscalltable syscalls_x86_64[] = {
>  	{ .entry = &syscall_sched_setattr },
>  	{ .entry = &syscall_sched_getattr },
>  	{ .entry = &syscall_renameat2 },
> +	{ .entry = &syscall_remap_anon_pages },
> +	{ .entry = &syscall_userfaultfd },
>  };
> diff --git a/syscalls/remap_anon_pages.c b/syscalls/remap_anon_pages.c
> new file mode 100644
> index 0000000..b1e9d3c
> --- /dev/null
> +++ b/syscalls/remap_anon_pages.c
> @@ -0,0 +1,100 @@
> +/*
> + * SYSCALL_DEFINE3(remap_anon_pages,
> +		unsigned long, dst_start, unsigned long, src_start,
> +		unsigned long, len)
> + */
> +#include <stdlib.h>
> +#include <asm/mman.h>
> +#include <assert.h>
> +#include "arch.h"
> +#include "maps.h"
> +#include "random.h"
> +#include "sanitise.h"
> +#include "shm.h"
> +#include "syscall.h"
> +#include "tables.h"
> +#include "trinity.h"
> +#include "utils.h"
> +
> +static const unsigned long alignments[] = {
> +	1 * MB, 2 * MB, 4 * MB, 8 * MB,
> +	10 * MB, 100 * MB,
> +};
> +
> +static unsigned char *g_src, *g_dst;
> +static unsigned long g_size;
> +static int g_check;
> +
> +#define RAP_ALLOW_SRC_HOLES (1UL<<0)
> +
> +static void sanitise_remap_anon_pages(struct syscallrecord *rec)
> +{
> +	unsigned long size = alignments[rand() % ARRAY_SIZE(alignments)];
> +	unsigned long max_rand;
> +	if (rand_bool()) {
> +		g_src = mmap(NULL, size, PROT_READ|PROT_WRITE,
> +			     MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
> +	} else
> +		g_src = MAP_FAILED;
> +	if (rand_bool()) {
> +		g_dst = mmap(NULL, size, PROT_READ|PROT_WRITE,
> +			     MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
> +	} else
> +		g_dst = MAP_FAILED;
> +	g_size = size;
> +	g_check = 1;
> +
> +	rec->a1 = (unsigned long) g_dst;
> +	rec->a2 = (unsigned long) g_src;
> +	rec->a3 = g_size;
> +	rec->a4 = 0;
> +
> +	if (rand_bool())
> +		max_rand = -1UL;
> +	else
> +		max_rand = g_size << 1;
> +	if (rand_bool()) {
> +		rec->a3 += (rand() % max_rand) - g_size;
> +		g_check = 0;
> +	}
> +	if (rand_bool()) {
> +		rec->a1 += (rand() % max_rand) - g_size;
> +		g_check = 0;
> +	}
> +	if (rand_bool()) {
> +		rec->a2 += (rand() % max_rand) - g_size;
> +		g_check = 0;
> +	}
> +	if (rand_bool()) {
> +		if (rand_bool()) {
> +			rec->a4 = rand();
> +		} else
> +			rec->a4 = RAP_ALLOW_SRC_HOLES;
> +	}
> +	if (g_src != MAP_FAILED)
> +		memset(g_src, 0xaa, size);
> +}
> +
> +static void post_remap_anon_pages(struct syscallrecord *rec)
> +{
> +	if (g_check && !rec->retval) {
> +		unsigned long size = g_size;
> +		unsigned char *dst = g_dst;
> +		while (size--)
> +			assert(dst[size] == 0xaaU);
> +	}
> +	munmap(g_src, g_size);
> +	munmap(g_dst, g_size);
> +}
> +
> +struct syscallentry syscall_remap_anon_pages = {
> +	.name = "remap_anon_pages",
> +	.num_args = 4,
> +	.arg1name = "dst_start",
> +	.arg2name = "src_start",
> +	.arg3name = "len",
> +	.arg4name = "flags",
> +	.group = GROUP_VM,
> +	.sanitise = sanitise_remap_anon_pages,
> +	.post = post_remap_anon_pages,
> +};
> diff --git a/syscalls/syscalls.h b/syscalls/syscalls.h
> index 114500c..b8eaa63 100644
> --- a/syscalls/syscalls.h
> +++ b/syscalls/syscalls.h
> @@ -370,3 +370,5 @@ extern struct syscallentry syscall_sched_setattr;
>  extern struct syscallentry syscall_sched_getattr;
>  extern struct syscallentry syscall_renameat2;
>  extern struct syscallentry syscall_kern_features;
> +extern struct syscallentry syscall_remap_anon_pages;
> +extern struct syscallentry syscall_userfaultfd;
> diff --git a/syscalls/userfaultfd.c b/syscalls/userfaultfd.c
> new file mode 100644
> index 0000000..769fe78
> --- /dev/null
> +++ b/syscalls/userfaultfd.c
> @@ -0,0 +1,12 @@
> +/*
> + * SYSCALL_DEFINE1(userfaultfd, int, flags)
> + */
> +#include "sanitise.h"
> +
> +struct syscallentry syscall_userfaultfd = {
> +	.name = "userfaultfd",
> +	.num_args = 1,
> +	.arg1name = "flags",
> +	.arg1type = ARG_LEN,
> +	.rettype = RET_FD,
> +};
> 
> 
> Andrea Arcangeli (10):
>   mm: madvise MADV_USERFAULT: prepare vm_flags to allow more than 32bits
>   mm: madvise MADV_USERFAULT
>   mm: PT lock: export double_pt_lock/unlock
>   mm: rmap preparation for remap_anon_pages
>   mm: swp_entry_swapcount
>   mm: sys_remap_anon_pages
>   waitqueue: add nr wake parameter to __wake_up_locked_key
>   userfaultfd: add new syscall to provide memory externalization
>   userfaultfd: make userfaultfd_write non blocking
>   userfaultfd: use VM_FAULT_RETRY in handle_userfault()
> 
>  arch/alpha/include/uapi/asm/mman.h     |   3 +
>  arch/mips/include/uapi/asm/mman.h      |   3 +
>  arch/parisc/include/uapi/asm/mman.h    |   3 +
>  arch/x86/syscalls/syscall_32.tbl       |   2 +
>  arch/x86/syscalls/syscall_64.tbl       |   2 +
>  arch/xtensa/include/uapi/asm/mman.h    |   3 +
>  fs/Makefile                            |   1 +
>  fs/proc/task_mmu.c                     |   5 +-
>  fs/userfaultfd.c                       | 593 +++++++++++++++++++++++++++++++++
>  include/linux/huge_mm.h                |  11 +-
>  include/linux/ksm.h                    |   4 +-
>  include/linux/mm.h                     |   5 +
>  include/linux/mm_types.h               |   2 +-
>  include/linux/swap.h                   |   6 +
>  include/linux/syscalls.h               |   5 +
>  include/linux/userfaultfd.h            |  42 +++
>  include/linux/wait.h                   |   5 +-
>  include/uapi/asm-generic/mman-common.h |   3 +
>  init/Kconfig                           |  10 +
>  kernel/sched/wait.c                    |   7 +-
>  kernel/sys_ni.c                        |   2 +
>  mm/fremap.c                            | 506 ++++++++++++++++++++++++++++
>  mm/huge_memory.c                       | 209 ++++++++++--
>  mm/ksm.c                               |   2 +-
>  mm/madvise.c                           |  19 +-
>  mm/memory.c                            |  14 +
>  mm/mremap.c                            |   2 +-
>  mm/rmap.c                              |   9 +
>  mm/swapfile.c                          |  13 +
>  net/sunrpc/sched.c                     |   2 +-
>  30 files changed, 1447 insertions(+), 46 deletions(-)
>  create mode 100644 fs/userfaultfd.c
>  create mode 100644 include/linux/userfaultfd.h
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Qemu-devel] [PATCH 00/10] RFC: userfault
@ 2014-07-03  1:51   ` Andy Lutomirski
  0 siblings, 0 replies; 59+ messages in thread
From: Andy Lutomirski @ 2014-07-03  1:51 UTC (permalink / raw)
  To: Andrea Arcangeli, qemu-devel, kvm, linux-mm, linux-kernel
  Cc: Taras Glek, Robert Love, Dave Hansen, Jan Kara, Minchan Kim,
	Mel Gorman, Linux API, Hugh Dickins,
	"Dr. David Alan Gilbert", Huangpeng (Peter),
	Neil Brown, Dmitry Adamushko, Johannes Weiner, KOSAKI Motohiro,
	Mike Hommey, Andrew Morton, Michel Lespinasse,
	Android Kernel Team, Keith Packard, Isaku Yamahata

On 07/02/2014 09:50 AM, Andrea Arcangeli wrote:
> Hello everyone,
> 
> There's a large CC list for this RFC because this adds two new
> syscalls (userfaultfd and remap_anon_pages) and
> MADV_USERFAULT/MADV_NOUSERFAULT, so suggestions on changes to the API
> or on a completely different API if somebody has better ideas are
> welcome now.

cc:linux-api -- this is certainly worthy of linux-api discussion.

> 
> The combination of these features are what I would propose to
> implement postcopy live migration in qemu, and in general demand
> paging of remote memory, hosted in different cloud nodes.
> 
> The MADV_USERFAULT feature should be generic enough that it can
> provide the userfaults to the Android volatile range feature too, on
> access of reclaimed volatile pages.
> 
> If the access could ever happen in kernel context through syscalls
> (not not just from userland context), then userfaultfd has to be used
> to make the userfault unnoticeable to the syscall (no error will be
> returned). This latter feature is more advanced than what volatile
> ranges alone could do with SIGBUS so far (but it's optional, if the
> process doesn't call userfaultfd, the regular SIGBUS will fire, if the
> fd is closed SIGBUS will also fire for any blocked userfault that was
> waiting a userfaultfd_write ack).
> 
> userfaultfd is also a generic enough feature, that it allows KVM to
> implement postcopy live migration without having to modify a single
> line of KVM kernel code. Guest async page faults, FOLL_NOWAIT and all
> other GUP features works just fine in combination with userfaults
> (userfaults trigger async page faults in the guest scheduler so those
> guest processes that aren't waiting for userfaults can keep running in
> the guest vcpus).
> 
> remap_anon_pages is the syscall to use to resolve the userfaults (it's
> not mandatory, vmsplice will likely still be used in the case of local
> postcopy live migration just to upgrade the qemu binary, but
> remap_anon_pages is faster and ideal for transferring memory across
> the network, it's zerocopy and doesn't touch the vma: it only holds
> the mmap_sem for reading).
> 
> The current behavior of remap_anon_pages is very strict to avoid any
> chance of memory corruption going unnoticed. mremap is not strict like
> that: if there's a synchronization bug it would drop the destination
> range silently resulting in subtle memory corruption for
> example. remap_anon_pages would return -EEXIST in that case. If there
> are holes in the source range remap_anon_pages will return -ENOENT.
> 
> If remap_anon_pages is used always with 2M naturally aligned
> addresses, transparent hugepages will not be splitted. In there could
> be 4k (or any size) holes in the 2M (or any size) source range,
> remap_anon_pages should be used with the RAP_ALLOW_SRC_HOLES flag to
> relax some of its strict checks (-ENOENT won't be returned if
> RAP_ALLOW_SRC_HOLES is set, remap_anon_pages then will just behave as
> a noop on any hole in the source range). This flag is generally useful
> when implementing userfaults with THP granularity, but it shouldn't be
> set if doing the userfaults with PAGE_SIZE granularity if the
> developer wants to benefit from the strict -ENOENT behavior.
> 
> The remap_anon_pages syscall API is not vectored, as I expect it to be
> used mainly for demand paging (where there can be just one faulting
> range per userfault) or for large ranges (with the THP model as an
> alternative to zapping re-dirtied pages with MADV_DONTNEED with 4k
> granularity before starting the guest in the destination node) where
> vectoring isn't going to provide much performance advantages (thanks
> to the THP coarser granularity).
> 
> On the rmap side remap_anon_pages doesn't add much complexity: there's
> no need of nonlinear anon vmas to support it because I added the
> constraint that it will fail if the mapcount is more than 1. So in
> general the source range of remap_anon_pages should be marked
> MADV_DONTFORK to prevent any risk of failure if the process ever
> forks (like qemu can in some case).
> 
> One part that hasn't been tested is the poll() syscall on the
> userfaultfd because the postcopy migration thread currently is more
> efficient waiting on blocking read()s (I'll write some code to test
> poll() too). I also appended below a patch to trinity to exercise
> remap_anon_pages and userfaultfd and it completes trinity
> successfully.
> 
> The code can be found here:
> 
> git clone --reference linux git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git -b userfault 
> 
> The branch is rebased so you can get updates for example with:
> 
> git fetch && git checkout -f origin/userfault
> 
> Comments welcome, thanks!
> Andrea
> 
> From cbe940e13b4cead41e0f862b3abfa3814f235ec3 Mon Sep 17 00:00:00 2001
> From: Andrea Arcangeli <aarcange@redhat.com>
> Date: Wed, 2 Jul 2014 18:32:35 +0200
> Subject: [PATCH] add remap_anon_pages and userfaultfd
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
>  include/syscalls-x86_64.h   |   2 +
>  syscalls/remap_anon_pages.c | 100 ++++++++++++++++++++++++++++++++++++++++++++
>  syscalls/syscalls.h         |   2 +
>  syscalls/userfaultfd.c      |  12 ++++++
>  4 files changed, 116 insertions(+)
>  create mode 100644 syscalls/remap_anon_pages.c
>  create mode 100644 syscalls/userfaultfd.c
> 
> diff --git a/include/syscalls-x86_64.h b/include/syscalls-x86_64.h
> index e09df43..a5b3a88 100644
> --- a/include/syscalls-x86_64.h
> +++ b/include/syscalls-x86_64.h
> @@ -324,4 +324,6 @@ struct syscalltable syscalls_x86_64[] = {
>  	{ .entry = &syscall_sched_setattr },
>  	{ .entry = &syscall_sched_getattr },
>  	{ .entry = &syscall_renameat2 },
> +	{ .entry = &syscall_remap_anon_pages },
> +	{ .entry = &syscall_userfaultfd },
>  };
> diff --git a/syscalls/remap_anon_pages.c b/syscalls/remap_anon_pages.c
> new file mode 100644
> index 0000000..b1e9d3c
> --- /dev/null
> +++ b/syscalls/remap_anon_pages.c
> @@ -0,0 +1,100 @@
> +/*
> + * SYSCALL_DEFINE3(remap_anon_pages,
> +		unsigned long, dst_start, unsigned long, src_start,
> +		unsigned long, len)
> + */
> +#include <stdlib.h>
> +#include <asm/mman.h>
> +#include <assert.h>
> +#include "arch.h"
> +#include "maps.h"
> +#include "random.h"
> +#include "sanitise.h"
> +#include "shm.h"
> +#include "syscall.h"
> +#include "tables.h"
> +#include "trinity.h"
> +#include "utils.h"
> +
> +static const unsigned long alignments[] = {
> +	1 * MB, 2 * MB, 4 * MB, 8 * MB,
> +	10 * MB, 100 * MB,
> +};
> +
> +static unsigned char *g_src, *g_dst;
> +static unsigned long g_size;
> +static int g_check;
> +
> +#define RAP_ALLOW_SRC_HOLES (1UL<<0)
> +
> +static void sanitise_remap_anon_pages(struct syscallrecord *rec)
> +{
> +	unsigned long size = alignments[rand() % ARRAY_SIZE(alignments)];
> +	unsigned long max_rand;
> +	if (rand_bool()) {
> +		g_src = mmap(NULL, size, PROT_READ|PROT_WRITE,
> +			     MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
> +	} else
> +		g_src = MAP_FAILED;
> +	if (rand_bool()) {
> +		g_dst = mmap(NULL, size, PROT_READ|PROT_WRITE,
> +			     MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
> +	} else
> +		g_dst = MAP_FAILED;
> +	g_size = size;
> +	g_check = 1;
> +
> +	rec->a1 = (unsigned long) g_dst;
> +	rec->a2 = (unsigned long) g_src;
> +	rec->a3 = g_size;
> +	rec->a4 = 0;
> +
> +	if (rand_bool())
> +		max_rand = -1UL;
> +	else
> +		max_rand = g_size << 1;
> +	if (rand_bool()) {
> +		rec->a3 += (rand() % max_rand) - g_size;
> +		g_check = 0;
> +	}
> +	if (rand_bool()) {
> +		rec->a1 += (rand() % max_rand) - g_size;
> +		g_check = 0;
> +	}
> +	if (rand_bool()) {
> +		rec->a2 += (rand() % max_rand) - g_size;
> +		g_check = 0;
> +	}
> +	if (rand_bool()) {
> +		if (rand_bool()) {
> +			rec->a4 = rand();
> +		} else
> +			rec->a4 = RAP_ALLOW_SRC_HOLES;
> +	}
> +	if (g_src != MAP_FAILED)
> +		memset(g_src, 0xaa, size);
> +}
> +
> +static void post_remap_anon_pages(struct syscallrecord *rec)
> +{
> +	if (g_check && !rec->retval) {
> +		unsigned long size = g_size;
> +		unsigned char *dst = g_dst;
> +		while (size--)
> +			assert(dst[size] == 0xaaU);
> +	}
> +	munmap(g_src, g_size);
> +	munmap(g_dst, g_size);
> +}
> +
> +struct syscallentry syscall_remap_anon_pages = {
> +	.name = "remap_anon_pages",
> +	.num_args = 4,
> +	.arg1name = "dst_start",
> +	.arg2name = "src_start",
> +	.arg3name = "len",
> +	.arg4name = "flags",
> +	.group = GROUP_VM,
> +	.sanitise = sanitise_remap_anon_pages,
> +	.post = post_remap_anon_pages,
> +};
> diff --git a/syscalls/syscalls.h b/syscalls/syscalls.h
> index 114500c..b8eaa63 100644
> --- a/syscalls/syscalls.h
> +++ b/syscalls/syscalls.h
> @@ -370,3 +370,5 @@ extern struct syscallentry syscall_sched_setattr;
>  extern struct syscallentry syscall_sched_getattr;
>  extern struct syscallentry syscall_renameat2;
>  extern struct syscallentry syscall_kern_features;
> +extern struct syscallentry syscall_remap_anon_pages;
> +extern struct syscallentry syscall_userfaultfd;
> diff --git a/syscalls/userfaultfd.c b/syscalls/userfaultfd.c
> new file mode 100644
> index 0000000..769fe78
> --- /dev/null
> +++ b/syscalls/userfaultfd.c
> @@ -0,0 +1,12 @@
> +/*
> + * SYSCALL_DEFINE1(userfaultfd, int, flags)
> + */
> +#include "sanitise.h"
> +
> +struct syscallentry syscall_userfaultfd = {
> +	.name = "userfaultfd",
> +	.num_args = 1,
> +	.arg1name = "flags",
> +	.arg1type = ARG_LEN,
> +	.rettype = RET_FD,
> +};
> 
> 
> Andrea Arcangeli (10):
>   mm: madvise MADV_USERFAULT: prepare vm_flags to allow more than 32bits
>   mm: madvise MADV_USERFAULT
>   mm: PT lock: export double_pt_lock/unlock
>   mm: rmap preparation for remap_anon_pages
>   mm: swp_entry_swapcount
>   mm: sys_remap_anon_pages
>   waitqueue: add nr wake parameter to __wake_up_locked_key
>   userfaultfd: add new syscall to provide memory externalization
>   userfaultfd: make userfaultfd_write non blocking
>   userfaultfd: use VM_FAULT_RETRY in handle_userfault()
> 
>  arch/alpha/include/uapi/asm/mman.h     |   3 +
>  arch/mips/include/uapi/asm/mman.h      |   3 +
>  arch/parisc/include/uapi/asm/mman.h    |   3 +
>  arch/x86/syscalls/syscall_32.tbl       |   2 +
>  arch/x86/syscalls/syscall_64.tbl       |   2 +
>  arch/xtensa/include/uapi/asm/mman.h    |   3 +
>  fs/Makefile                            |   1 +
>  fs/proc/task_mmu.c                     |   5 +-
>  fs/userfaultfd.c                       | 593 +++++++++++++++++++++++++++++++++
>  include/linux/huge_mm.h                |  11 +-
>  include/linux/ksm.h                    |   4 +-
>  include/linux/mm.h                     |   5 +
>  include/linux/mm_types.h               |   2 +-
>  include/linux/swap.h                   |   6 +
>  include/linux/syscalls.h               |   5 +
>  include/linux/userfaultfd.h            |  42 +++
>  include/linux/wait.h                   |   5 +-
>  include/uapi/asm-generic/mman-common.h |   3 +
>  init/Kconfig                           |  10 +
>  kernel/sched/wait.c                    |   7 +-
>  kernel/sys_ni.c                        |   2 +
>  mm/fremap.c                            | 506 ++++++++++++++++++++++++++++
>  mm/huge_memory.c                       | 209 ++++++++++--
>  mm/ksm.c                               |   2 +-
>  mm/madvise.c                           |  19 +-
>  mm/memory.c                            |  14 +
>  mm/mremap.c                            |   2 +-
>  mm/rmap.c                              |   9 +
>  mm/swapfile.c                          |  13 +
>  net/sunrpc/sched.c                     |   2 +-
>  30 files changed, 1447 insertions(+), 46 deletions(-)
>  create mode 100644 fs/userfaultfd.c
>  create mode 100644 include/linux/userfaultfd.h
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 08/10] userfaultfd: add new syscall to provide memory externalization
  2014-07-02 16:50   ` Andrea Arcangeli
  (?)
@ 2014-07-03  1:56     ` Andy Lutomirski
  -1 siblings, 0 replies; 59+ messages in thread
From: Andy Lutomirski @ 2014-07-03  1:56 UTC (permalink / raw)
  To: Andrea Arcangeli, qemu-devel, kvm, linux-mm, linux-kernel
  Cc: "Dr. David Alan Gilbert",
	Johannes Weiner, Andrew Morton, Android Kernel Team, Robert Love,
	Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Mike Hommey, Taras Glek, Jan Kara,
	KOSAKI Motohiro, Michel Lespinasse, Minchan Kim, Keith Packard,
	Huangpeng (Peter),
	Isaku Yamahata, Linux API

On 07/02/2014 09:50 AM, Andrea Arcangeli wrote:
> Once an userfaultfd is created MADV_USERFAULT regions talks through
> the userfaultfd protocol with the thread responsible for doing the
> memory externalization of the process.
> 
> The protocol starts by userland writing the requested/preferred
> USERFAULT_PROTOCOL version into the userfault fd (64bit write), if
> kernel knows it, it will ack it by allowing userland to read 64bit
> from the userfault fd that will contain the same 64bit
> USERFAULT_PROTOCOL version that userland asked. Otherwise userfault
> will read __u64 value -1ULL (aka USERFAULTFD_UNKNOWN_PROTOCOL) and it
> will have to try again by writing an older protocol version if
> suitable for its usage too, and read it back again until it stops
> reading -1ULL. After that the userfaultfd protocol starts.
> 
> The protocol consists in the userfault fd reads 64bit in size
> providing userland the fault addresses. After a userfault address has
> been read and the fault is resolved by userland, the application must
> write back 128bits in the form of [ start, end ] range (64bit each)
> that will tell the kernel such a range has been mapped. Multiple read
> userfaults can be resolved in a single range write. poll() can be used
> to know when there are new userfaults to read (POLLIN) and when there
> are threads waiting a wakeup through a range write (POLLOUT).
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>

> +#ifdef CONFIG_PROC_FS
> +static int userfaultfd_show_fdinfo(struct seq_file *m, struct file *f)
> +{
> +	struct userfaultfd_ctx *ctx = f->private_data;
> +	int ret;
> +	wait_queue_t *wq;
> +	struct userfaultfd_wait_queue *uwq;
> +	unsigned long pending = 0, total = 0;
> +
> +	spin_lock(&ctx->fault_wqh.lock);
> +	list_for_each_entry(wq, &ctx->fault_wqh.task_list, task_list) {
> +		uwq = container_of(wq, struct userfaultfd_wait_queue, wq);
> +		if (uwq->pending)
> +			pending++;
> +		total++;
> +	}
> +	spin_unlock(&ctx->fault_wqh.lock);
> +
> +	ret = seq_printf(m, "pending:\t%lu\ntotal:\t%lu\n", pending, total);

This should show the protocol version, too.

> +
> +SYSCALL_DEFINE1(userfaultfd, int, flags)
> +{
> +	int fd, error;
> +	struct file *file;

This looks like it can't be used more than once in a process.  That will
be unfortunate for libraries.  Would it be feasible to either have
userfaultfd claim a range of addresses or for a vma to be explicitly
associated with a userfaultfd?  (In the latter case, giant PROT_NONE
MAP_NORESERVE mappings could be used.)


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 08/10] userfaultfd: add new syscall to provide memory externalization
@ 2014-07-03  1:56     ` Andy Lutomirski
  0 siblings, 0 replies; 59+ messages in thread
From: Andy Lutomirski @ 2014-07-03  1:56 UTC (permalink / raw)
  To: Andrea Arcangeli, qemu-devel, kvm, linux-mm, linux-kernel
  Cc: "Dr. David Alan Gilbert",
	Johannes Weiner, Andrew Morton, Android Kernel Team, Robert Love,
	Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Mike Hommey, Taras Glek, Jan Kara,
	KOSAKI Motohiro, Michel Lespinasse, Minchan Kim, Keith Packard,
	Huangpeng (Peter),
	Isaku Yamahata, Linux API

On 07/02/2014 09:50 AM, Andrea Arcangeli wrote:
> Once an userfaultfd is created MADV_USERFAULT regions talks through
> the userfaultfd protocol with the thread responsible for doing the
> memory externalization of the process.
> 
> The protocol starts by userland writing the requested/preferred
> USERFAULT_PROTOCOL version into the userfault fd (64bit write), if
> kernel knows it, it will ack it by allowing userland to read 64bit
> from the userfault fd that will contain the same 64bit
> USERFAULT_PROTOCOL version that userland asked. Otherwise userfault
> will read __u64 value -1ULL (aka USERFAULTFD_UNKNOWN_PROTOCOL) and it
> will have to try again by writing an older protocol version if
> suitable for its usage too, and read it back again until it stops
> reading -1ULL. After that the userfaultfd protocol starts.
> 
> The protocol consists in the userfault fd reads 64bit in size
> providing userland the fault addresses. After a userfault address has
> been read and the fault is resolved by userland, the application must
> write back 128bits in the form of [ start, end ] range (64bit each)
> that will tell the kernel such a range has been mapped. Multiple read
> userfaults can be resolved in a single range write. poll() can be used
> to know when there are new userfaults to read (POLLIN) and when there
> are threads waiting a wakeup through a range write (POLLOUT).
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>

> +#ifdef CONFIG_PROC_FS
> +static int userfaultfd_show_fdinfo(struct seq_file *m, struct file *f)
> +{
> +	struct userfaultfd_ctx *ctx = f->private_data;
> +	int ret;
> +	wait_queue_t *wq;
> +	struct userfaultfd_wait_queue *uwq;
> +	unsigned long pending = 0, total = 0;
> +
> +	spin_lock(&ctx->fault_wqh.lock);
> +	list_for_each_entry(wq, &ctx->fault_wqh.task_list, task_list) {
> +		uwq = container_of(wq, struct userfaultfd_wait_queue, wq);
> +		if (uwq->pending)
> +			pending++;
> +		total++;
> +	}
> +	spin_unlock(&ctx->fault_wqh.lock);
> +
> +	ret = seq_printf(m, "pending:\t%lu\ntotal:\t%lu\n", pending, total);

This should show the protocol version, too.

> +
> +SYSCALL_DEFINE1(userfaultfd, int, flags)
> +{
> +	int fd, error;
> +	struct file *file;

This looks like it can't be used more than once in a process.  That will
be unfortunate for libraries.  Would it be feasible to either have
userfaultfd claim a range of addresses or for a vma to be explicitly
associated with a userfaultfd?  (In the latter case, giant PROT_NONE
MAP_NORESERVE mappings could be used.)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Qemu-devel] [PATCH 08/10] userfaultfd: add new syscall to provide memory externalization
@ 2014-07-03  1:56     ` Andy Lutomirski
  0 siblings, 0 replies; 59+ messages in thread
From: Andy Lutomirski @ 2014-07-03  1:56 UTC (permalink / raw)
  To: Andrea Arcangeli, qemu-devel, kvm, linux-mm, linux-kernel
  Cc: Taras Glek, Robert Love, Dave Hansen, Jan Kara, Minchan Kim,
	Mel Gorman, Linux API, Hugh Dickins,
	"Dr. David Alan Gilbert", Huangpeng (Peter),
	Neil Brown, Dmitry Adamushko, Johannes Weiner, KOSAKI Motohiro,
	Mike Hommey, Andrew Morton, Michel Lespinasse,
	Android Kernel Team, Keith Packard, Isaku Yamahata

On 07/02/2014 09:50 AM, Andrea Arcangeli wrote:
> Once an userfaultfd is created MADV_USERFAULT regions talks through
> the userfaultfd protocol with the thread responsible for doing the
> memory externalization of the process.
> 
> The protocol starts by userland writing the requested/preferred
> USERFAULT_PROTOCOL version into the userfault fd (64bit write), if
> kernel knows it, it will ack it by allowing userland to read 64bit
> from the userfault fd that will contain the same 64bit
> USERFAULT_PROTOCOL version that userland asked. Otherwise userfault
> will read __u64 value -1ULL (aka USERFAULTFD_UNKNOWN_PROTOCOL) and it
> will have to try again by writing an older protocol version if
> suitable for its usage too, and read it back again until it stops
> reading -1ULL. After that the userfaultfd protocol starts.
> 
> The protocol consists in the userfault fd reads 64bit in size
> providing userland the fault addresses. After a userfault address has
> been read and the fault is resolved by userland, the application must
> write back 128bits in the form of [ start, end ] range (64bit each)
> that will tell the kernel such a range has been mapped. Multiple read
> userfaults can be resolved in a single range write. poll() can be used
> to know when there are new userfaults to read (POLLIN) and when there
> are threads waiting a wakeup through a range write (POLLOUT).
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>

> +#ifdef CONFIG_PROC_FS
> +static int userfaultfd_show_fdinfo(struct seq_file *m, struct file *f)
> +{
> +	struct userfaultfd_ctx *ctx = f->private_data;
> +	int ret;
> +	wait_queue_t *wq;
> +	struct userfaultfd_wait_queue *uwq;
> +	unsigned long pending = 0, total = 0;
> +
> +	spin_lock(&ctx->fault_wqh.lock);
> +	list_for_each_entry(wq, &ctx->fault_wqh.task_list, task_list) {
> +		uwq = container_of(wq, struct userfaultfd_wait_queue, wq);
> +		if (uwq->pending)
> +			pending++;
> +		total++;
> +	}
> +	spin_unlock(&ctx->fault_wqh.lock);
> +
> +	ret = seq_printf(m, "pending:\t%lu\ntotal:\t%lu\n", pending, total);

This should show the protocol version, too.

> +
> +SYSCALL_DEFINE1(userfaultfd, int, flags)
> +{
> +	int fd, error;
> +	struct file *file;

This looks like it can't be used more than once in a process.  That will
be unfortunate for libraries.  Would it be feasible to either have
userfaultfd claim a range of addresses or for a vma to be explicitly
associated with a userfaultfd?  (In the latter case, giant PROT_NONE
MAP_NORESERVE mappings could be used.)

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 08/10] userfaultfd: add new syscall to provide memory externalization
  2014-07-03  1:56     ` Andy Lutomirski
                         ` (2 preceding siblings ...)
  (?)
@ 2014-07-03 13:19       ` Andrea Arcangeli
  -1 siblings, 0 replies; 59+ messages in thread
From: Andrea Arcangeli @ 2014-07-03 13:19 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: qemu-devel, kvm, linux-mm, linux-kernel,
	"Dr. David Alan Gilbert",
	Johannes Weiner, Andrew Morton, Android Kernel Team, Robert Love,
	Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Mike Hommey, Taras Glek, Jan Kara,
	KOSAKI Motohiro, Michel Lespinasse, Minchan Kim, Keith Packard,
	Huangpeng (Peter),
	Isaku Yamahata, Linux API, Linus Torvalds

Hi Andy,

thanks for CC'ing linux-api.

On Wed, Jul 02, 2014 at 06:56:03PM -0700, Andy Lutomirski wrote:
> On 07/02/2014 09:50 AM, Andrea Arcangeli wrote:
> > Once an userfaultfd is created MADV_USERFAULT regions talks through
> > the userfaultfd protocol with the thread responsible for doing the
> > memory externalization of the process.
> > 
> > The protocol starts by userland writing the requested/preferred
> > USERFAULT_PROTOCOL version into the userfault fd (64bit write), if
> > kernel knows it, it will ack it by allowing userland to read 64bit
> > from the userfault fd that will contain the same 64bit
> > USERFAULT_PROTOCOL version that userland asked. Otherwise userfault
> > will read __u64 value -1ULL (aka USERFAULTFD_UNKNOWN_PROTOCOL) and it
> > will have to try again by writing an older protocol version if
> > suitable for its usage too, and read it back again until it stops
> > reading -1ULL. After that the userfaultfd protocol starts.
> > 
> > The protocol consists in the userfault fd reads 64bit in size
> > providing userland the fault addresses. After a userfault address has
> > been read and the fault is resolved by userland, the application must
> > write back 128bits in the form of [ start, end ] range (64bit each)
> > that will tell the kernel such a range has been mapped. Multiple read
> > userfaults can be resolved in a single range write. poll() can be used
> > to know when there are new userfaults to read (POLLIN) and when there
> > are threads waiting a wakeup through a range write (POLLOUT).
> > 
> > Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> 
> > +#ifdef CONFIG_PROC_FS
> > +static int userfaultfd_show_fdinfo(struct seq_file *m, struct file *f)
> > +{
> > +	struct userfaultfd_ctx *ctx = f->private_data;
> > +	int ret;
> > +	wait_queue_t *wq;
> > +	struct userfaultfd_wait_queue *uwq;
> > +	unsigned long pending = 0, total = 0;
> > +
> > +	spin_lock(&ctx->fault_wqh.lock);
> > +	list_for_each_entry(wq, &ctx->fault_wqh.task_list, task_list) {
> > +		uwq = container_of(wq, struct userfaultfd_wait_queue, wq);
> > +		if (uwq->pending)
> > +			pending++;
> > +		total++;
> > +	}
> > +	spin_unlock(&ctx->fault_wqh.lock);
> > +
> > +	ret = seq_printf(m, "pending:\t%lu\ntotal:\t%lu\n", pending, total);
> 
> This should show the protocol version, too.

Ok, does the below look ok?

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 388553e..f9d3e9f 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -493,7 +493,13 @@ static int userfaultfd_show_fdinfo(struct seq_file *m, struct file *f)
 	}
 	spin_unlock(&ctx->fault_wqh.lock);
 
-	ret = seq_printf(m, "pending:\t%lu\ntotal:\t%lu\n", pending, total);
+	/*
+	 * If more protocols will be added, there will be all shown
+	 * separated by a space. Like this:
+	 *	protocols: 0xaa 0xbb
+	 */
+	ret = seq_printf(m, "pending:\t%lu\ntotal:\t%lu\nprotocols:\t%Lx\n",
+			 pending, total, USERFAULTFD_PROTOCOL);
 
 	return ret;
 }


> > +
> > +SYSCALL_DEFINE1(userfaultfd, int, flags)
> > +{
> > +	int fd, error;
> > +	struct file *file;
> 
> This looks like it can't be used more than once in a process.  That will

It can't be used more than once, correct.

	file = ERR_PTR(-EBUSY);
	if (get_mm_slot(current->mm))
		goto out_free_unlock;

If a userfaultfd is already registered for the current mm the second
one gets -EBUSY.

> be unfortunate for libraries.  Would it be feasible to either have

So you envision two userfaultfd memory managers for the same process?
I assume each one would claim separate ranges of memory?

For that case the demultiplexing of userfaults can be entirely managed
by userland.

One libuserfault library can actually register the userfaultfd, and
then the two libs can register into libuserfault and claim their own
ranges. It could run the code of the two libs in the thread context
that waits on the userfaultfd with zero overhead, or message passing
across threads can be used to run both libs in parallel in their own
thread. The demultiplexing code wouldn't be CPU intensive. The
downside are two schedule event required if they want to run their lib
code in a separate thread. If we'd claim the two different ranges in
the kernel for two different userfaultfd, the kernel would be speaking
directly with each library thread. That'd be the only advantage if
they don't want to run in the context of the thread that waits on the
userfaultfd.

To increase SMP scalability in the future we could also add a
UFFD_LOAD_BALANCE to distribute userfaults to different userfaultfd,
that if used could relax the -EBUSY (but it wouldn't be two different
claimed ranges for two different libs).

If passing UFFD_LOAD_BALANCE to the current code sys_userfaultfd would
return -EINVAL. I haven't implemented it because I'm not sure if such
thing would ever be needed. Compared to distributing the userfaults in
userland to different threads that would only save two context
switches per event. I don't see a problem in adding this later if a
need emerges though.

I think the best model for two libs claiming different userfault
ranges, is to run the userfault code of each lib in the context of the
thread that waits on the userfaultfd, and if there's a need to scale
in SMP we'd add UFFD_LOAD_BALANCE so multiple threads can wait on
different userfaultfd and scale optimally in SMP without any need of
spurious context switches.

With the volatile pages current code, the SIGBUS event would also be
mm-wide and require demultiplexing inside the sigbus handler if two
different libs wants to claim different ranges. Furthermore the sigbus
would run in the context of the faulting thread so it would still need
to context switch to scale (with userfaultfd we let the current thread
continue by triggering a schedule in the guest, if FOLL_NOWAIT fails
and we spawn a kworker thread doing the async page fault, then the
kworker kernel thread stops in the userfault and the migration thread
waiting on the userfaultfd is woken up to resolve the userfault).

Programs like qemu are unlikely to ever need more than one
userfaultfd, so it wouldn't need to use the demultiplexing
library. Currently we don't feel a need for UFFD_LOAD_BALANCE either.

However I'd rather implement UFFD_LOAD_BALANCE now, than claiming
different ranges in kernel that would require to build a lookup
structure that lives on top of the vmas and it wouldn't be less
efficient to write such a thing in userland inside a libuserfaultfd
(that qemu would likely never need).

In short I see no benefit in claiming different ranges for the same mm
in the kernel API (that can be done equally efficient in userland),
but I could see a benefit in a load balancing feature to scale the
load to multiple userfaultfd if passing UFFD_LOAD_BALANCE (it could
also be done by default by just removing the -EBUSY failure and
without new flags, but it sounds safer to keep -EBUSY if
UFFD_LOAD_BALANCE is not passed to userfaultfd through the flags).

> userfaultfd claim a range of addresses or for a vma to be explicitly
> associated with a userfaultfd?  (In the latter case, giant PROT_NONE
> MAP_NORESERVE mappings could be used.)

To claim ranges MADV_USERFAULT is used and the vmas are
mm-wide. Instead of creating another range lookup on top the vmas, I
used the vma itself to tell which ranges are claimed for userfaultfd
(or SIGBUS behavior if userfaultfd isn't open).

This API model with MADV_USERFAULT to claim the userfaultfd ranges is
want fundamentally prevents you to claim different ranges for two
different userfaultfd.

About PROT_NONE, note that if you set the mapping as MADV_USERFAULT
there's no point in setting PROT_NONE too, MAP_NORESERVE instead
should already work just fine and it's orthogonal.

PROT_NONE is kind of an alternative to userfaults without
MADV_USERFAULT. Problem is that PROT_NONE requires to split VMAs (and
take the mmap_sem for writing and mangle and allocate vmas), until you
actually run out of vmas and you get -ENOMEM and the app crashes. The
whole point of postcopy live migration is that it will work with
massively large guests so we cannot risk running out of vmas.

With MADV_USERFAULT you never mangle the vma and you work with a
single gigantic vma that never gets split. You don't need to mremap
over the PROT_NONE to handle the fault.

Even without userfaultfd and just with MADV_USERFAULT+remap_anon_pages
using SIGBUS is a much more efficient and more reliable alternative
than PROT_NONE.

With userfaultfd + remap_anon_pages things are even more efficient
because there are no signals involved at all, and syscalls won't
return weird errors to userland like it would happen with SIGBUS and
without userfaultfd. With userfaultfd the thread stopping in the
userfault won't even exit the kernel, it'll wait a wakeup within the
kernel. And remap_anon_pages that resolves the fault, just updates the
ptes or hugepmds without touching the vma (plus remap_anon_pages is
very strict so the chance of memory corruption going unnoticed is next
to nil, unlike mremap).

Other things that people should comment on is that currently you
cannot set MADV_USERFAULT on filebacked vmas, that sounds like an
issue for volatile pages that should be fixed. Currently -EINVAL is
returned if MADV_USERFAULT is run on non anonymous vmas. I don't think
it's a problem to add it. The do_linear_fault would just trigger an
userfault too. Then it's up to userland how it resolves it before
sending the wakeup to the stopped thread with userfaultfd_write. As
long as it doesn't fault in do_linear_fault again it'll just work.

How you solve the fault before acking it with userfaultfd_write
(remap_file_pages, remap_anon_pages, mremap, mmap, anything) is
entirely up to userland. The only faults the userfault tracks are the
pte_none/pmd_none kind (same as PROT_NONE). As long as you put
something in the pte/pmd and it's not none anymore you can
userfaultfd_write and it'll just work. Clearing VM_USERFAULT would
also work to resolve the fault of course (and then it'd pagein from
disk if it was filebacked or it'd map a zero page if it was an
anonymous vma) but clearing VM_USERFAULT (with MADV_NOUSERFAULT)
would split the vma so it's not recommended...

Thanks,
Andrea

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* Re: [PATCH 08/10] userfaultfd: add new syscall to provide memory externalization
@ 2014-07-03 13:19       ` Andrea Arcangeli
  0 siblings, 0 replies; 59+ messages in thread
From: Andrea Arcangeli @ 2014-07-03 13:19 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: qemu-devel, kvm, linux-mm, linux-kernel,
	"Dr. David Alan Gilbert",
	Johannes Weiner, Andrew Morton, Android Kernel Team, Robert Love,
	Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Mike Hommey, Taras Glek, Jan Kara,
	KOSAKI Motohiro, Michel Lespinasse, Minchan Kim, Keith Packard,
	Huangpeng (Peter)

Hi Andy,

thanks for CC'ing linux-api.

On Wed, Jul 02, 2014 at 06:56:03PM -0700, Andy Lutomirski wrote:
> On 07/02/2014 09:50 AM, Andrea Arcangeli wrote:
> > Once an userfaultfd is created MADV_USERFAULT regions talks through
> > the userfaultfd protocol with the thread responsible for doing the
> > memory externalization of the process.
> > 
> > The protocol starts by userland writing the requested/preferred
> > USERFAULT_PROTOCOL version into the userfault fd (64bit write), if
> > kernel knows it, it will ack it by allowing userland to read 64bit
> > from the userfault fd that will contain the same 64bit
> > USERFAULT_PROTOCOL version that userland asked. Otherwise userfault
> > will read __u64 value -1ULL (aka USERFAULTFD_UNKNOWN_PROTOCOL) and it
> > will have to try again by writing an older protocol version if
> > suitable for its usage too, and read it back again until it stops
> > reading -1ULL. After that the userfaultfd protocol starts.
> > 
> > The protocol consists in the userfault fd reads 64bit in size
> > providing userland the fault addresses. After a userfault address has
> > been read and the fault is resolved by userland, the application must
> > write back 128bits in the form of [ start, end ] range (64bit each)
> > that will tell the kernel such a range has been mapped. Multiple read
> > userfaults can be resolved in a single range write. poll() can be used
> > to know when there are new userfaults to read (POLLIN) and when there
> > are threads waiting a wakeup through a range write (POLLOUT).
> > 
> > Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> 
> > +#ifdef CONFIG_PROC_FS
> > +static int userfaultfd_show_fdinfo(struct seq_file *m, struct file *f)
> > +{
> > +	struct userfaultfd_ctx *ctx = f->private_data;
> > +	int ret;
> > +	wait_queue_t *wq;
> > +	struct userfaultfd_wait_queue *uwq;
> > +	unsigned long pending = 0, total = 0;
> > +
> > +	spin_lock(&ctx->fault_wqh.lock);
> > +	list_for_each_entry(wq, &ctx->fault_wqh.task_list, task_list) {
> > +		uwq = container_of(wq, struct userfaultfd_wait_queue, wq);
> > +		if (uwq->pending)
> > +			pending++;
> > +		total++;
> > +	}
> > +	spin_unlock(&ctx->fault_wqh.lock);
> > +
> > +	ret = seq_printf(m, "pending:\t%lu\ntotal:\t%lu\n", pending, total);
> 
> This should show the protocol version, too.

Ok, does the below look ok?

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 388553e..f9d3e9f 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -493,7 +493,13 @@ static int userfaultfd_show_fdinfo(struct seq_file *m, struct file *f)
 	}
 	spin_unlock(&ctx->fault_wqh.lock);
 
-	ret = seq_printf(m, "pending:\t%lu\ntotal:\t%lu\n", pending, total);
+	/*
+	 * If more protocols will be added, there will be all shown
+	 * separated by a space. Like this:
+	 *	protocols: 0xaa 0xbb
+	 */
+	ret = seq_printf(m, "pending:\t%lu\ntotal:\t%lu\nprotocols:\t%Lx\n",
+			 pending, total, USERFAULTFD_PROTOCOL);
 
 	return ret;
 }


> > +
> > +SYSCALL_DEFINE1(userfaultfd, int, flags)
> > +{
> > +	int fd, error;
> > +	struct file *file;
> 
> This looks like it can't be used more than once in a process.  That will

It can't be used more than once, correct.

	file = ERR_PTR(-EBUSY);
	if (get_mm_slot(current->mm))
		goto out_free_unlock;

If a userfaultfd is already registered for the current mm the second
one gets -EBUSY.

> be unfortunate for libraries.  Would it be feasible to either have

So you envision two userfaultfd memory managers for the same process?
I assume each one would claim separate ranges of memory?

For that case the demultiplexing of userfaults can be entirely managed
by userland.

One libuserfault library can actually register the userfaultfd, and
then the two libs can register into libuserfault and claim their own
ranges. It could run the code of the two libs in the thread context
that waits on the userfaultfd with zero overhead, or message passing
across threads can be used to run both libs in parallel in their own
thread. The demultiplexing code wouldn't be CPU intensive. The
downside are two schedule event required if they want to run their lib
code in a separate thread. If we'd claim the two different ranges in
the kernel for two different userfaultfd, the kernel would be speaking
directly with each library thread. That'd be the only advantage if
they don't want to run in the context of the thread that waits on the
userfaultfd.

To increase SMP scalability in the future we could also add a
UFFD_LOAD_BALANCE to distribute userfaults to different userfaultfd,
that if used could relax the -EBUSY (but it wouldn't be two different
claimed ranges for two different libs).

If passing UFFD_LOAD_BALANCE to the current code sys_userfaultfd would
return -EINVAL. I haven't implemented it because I'm not sure if such
thing would ever be needed. Compared to distributing the userfaults in
userland to different threads that would only save two context
switches per event. I don't see a problem in adding this later if a
need emerges though.

I think the best model for two libs claiming different userfault
ranges, is to run the userfault code of each lib in the context of the
thread that waits on the userfaultfd, and if there's a need to scale
in SMP we'd add UFFD_LOAD_BALANCE so multiple threads can wait on
different userfaultfd and scale optimally in SMP without any need of
spurious context switches.

With the volatile pages current code, the SIGBUS event would also be
mm-wide and require demultiplexing inside the sigbus handler if two
different libs wants to claim different ranges. Furthermore the sigbus
would run in the context of the faulting thread so it would still need
to context switch to scale (with userfaultfd we let the current thread
continue by triggering a schedule in the guest, if FOLL_NOWAIT fails
and we spawn a kworker thread doing the async page fault, then the
kworker kernel thread stops in the userfault and the migration thread
waiting on the userfaultfd is woken up to resolve the userfault).

Programs like qemu are unlikely to ever need more than one
userfaultfd, so it wouldn't need to use the demultiplexing
library. Currently we don't feel a need for UFFD_LOAD_BALANCE either.

However I'd rather implement UFFD_LOAD_BALANCE now, than claiming
different ranges in kernel that would require to build a lookup
structure that lives on top of the vmas and it wouldn't be less
efficient to write such a thing in userland inside a libuserfaultfd
(that qemu would likely never need).

In short I see no benefit in claiming different ranges for the same mm
in the kernel API (that can be done equally efficient in userland),
but I could see a benefit in a load balancing feature to scale the
load to multiple userfaultfd if passing UFFD_LOAD_BALANCE (it could
also be done by default by just removing the -EBUSY failure and
without new flags, but it sounds safer to keep -EBUSY if
UFFD_LOAD_BALANCE is not passed to userfaultfd through the flags).

> userfaultfd claim a range of addresses or for a vma to be explicitly
> associated with a userfaultfd?  (In the latter case, giant PROT_NONE
> MAP_NORESERVE mappings could be used.)

To claim ranges MADV_USERFAULT is used and the vmas are
mm-wide. Instead of creating another range lookup on top the vmas, I
used the vma itself to tell which ranges are claimed for userfaultfd
(or SIGBUS behavior if userfaultfd isn't open).

This API model with MADV_USERFAULT to claim the userfaultfd ranges is
want fundamentally prevents you to claim different ranges for two
different userfaultfd.

About PROT_NONE, note that if you set the mapping as MADV_USERFAULT
there's no point in setting PROT_NONE too, MAP_NORESERVE instead
should already work just fine and it's orthogonal.

PROT_NONE is kind of an alternative to userfaults without
MADV_USERFAULT. Problem is that PROT_NONE requires to split VMAs (and
take the mmap_sem for writing and mangle and allocate vmas), until you
actually run out of vmas and you get -ENOMEM and the app crashes. The
whole point of postcopy live migration is that it will work with
massively large guests so we cannot risk running out of vmas.

With MADV_USERFAULT you never mangle the vma and you work with a
single gigantic vma that never gets split. You don't need to mremap
over the PROT_NONE to handle the fault.

Even without userfaultfd and just with MADV_USERFAULT+remap_anon_pages
using SIGBUS is a much more efficient and more reliable alternative
than PROT_NONE.

With userfaultfd + remap_anon_pages things are even more efficient
because there are no signals involved at all, and syscalls won't
return weird errors to userland like it would happen with SIGBUS and
without userfaultfd. With userfaultfd the thread stopping in the
userfault won't even exit the kernel, it'll wait a wakeup within the
kernel. And remap_anon_pages that resolves the fault, just updates the
ptes or hugepmds without touching the vma (plus remap_anon_pages is
very strict so the chance of memory corruption going unnoticed is next
to nil, unlike mremap).

Other things that people should comment on is that currently you
cannot set MADV_USERFAULT on filebacked vmas, that sounds like an
issue for volatile pages that should be fixed. Currently -EINVAL is
returned if MADV_USERFAULT is run on non anonymous vmas. I don't think
it's a problem to add it. The do_linear_fault would just trigger an
userfault too. Then it's up to userland how it resolves it before
sending the wakeup to the stopped thread with userfaultfd_write. As
long as it doesn't fault in do_linear_fault again it'll just work.

How you solve the fault before acking it with userfaultfd_write
(remap_file_pages, remap_anon_pages, mremap, mmap, anything) is
entirely up to userland. The only faults the userfault tracks are the
pte_none/pmd_none kind (same as PROT_NONE). As long as you put
something in the pte/pmd and it's not none anymore you can
userfaultfd_write and it'll just work. Clearing VM_USERFAULT would
also work to resolve the fault of course (and then it'd pagein from
disk if it was filebacked or it'd map a zero page if it was an
anonymous vma) but clearing VM_USERFAULT (with MADV_NOUSERFAULT)
would split the vma so it's not recommended...

Thanks,
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* Re: [PATCH 08/10] userfaultfd: add new syscall to provide memory externalization
@ 2014-07-03 13:19       ` Andrea Arcangeli
  0 siblings, 0 replies; 59+ messages in thread
From: Andrea Arcangeli @ 2014-07-03 13:19 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: qemu-devel, kvm, linux-mm, linux-kernel,
	"Dr. David Alan Gilbert",
	Johannes Weiner, Andrew Morton, Android Kernel Team, Robert Love,
	Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Mike Hommey, Taras Glek, Jan Kara,
	KOSAKI Motohiro, Michel Lespinasse, Minchan Kim, Keith Packard,
	Huangpeng (Peter)

Hi Andy,

thanks for CC'ing linux-api.

On Wed, Jul 02, 2014 at 06:56:03PM -0700, Andy Lutomirski wrote:
> On 07/02/2014 09:50 AM, Andrea Arcangeli wrote:
> > Once an userfaultfd is created MADV_USERFAULT regions talks through
> > the userfaultfd protocol with the thread responsible for doing the
> > memory externalization of the process.
> > 
> > The protocol starts by userland writing the requested/preferred
> > USERFAULT_PROTOCOL version into the userfault fd (64bit write), if
> > kernel knows it, it will ack it by allowing userland to read 64bit
> > from the userfault fd that will contain the same 64bit
> > USERFAULT_PROTOCOL version that userland asked. Otherwise userfault
> > will read __u64 value -1ULL (aka USERFAULTFD_UNKNOWN_PROTOCOL) and it
> > will have to try again by writing an older protocol version if
> > suitable for its usage too, and read it back again until it stops
> > reading -1ULL. After that the userfaultfd protocol starts.
> > 
> > The protocol consists in the userfault fd reads 64bit in size
> > providing userland the fault addresses. After a userfault address has
> > been read and the fault is resolved by userland, the application must
> > write back 128bits in the form of [ start, end ] range (64bit each)
> > that will tell the kernel such a range has been mapped. Multiple read
> > userfaults can be resolved in a single range write. poll() can be used
> > to know when there are new userfaults to read (POLLIN) and when there
> > are threads waiting a wakeup through a range write (POLLOUT).
> > 
> > Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> 
> > +#ifdef CONFIG_PROC_FS
> > +static int userfaultfd_show_fdinfo(struct seq_file *m, struct file *f)
> > +{
> > +	struct userfaultfd_ctx *ctx = f->private_data;
> > +	int ret;
> > +	wait_queue_t *wq;
> > +	struct userfaultfd_wait_queue *uwq;
> > +	unsigned long pending = 0, total = 0;
> > +
> > +	spin_lock(&ctx->fault_wqh.lock);
> > +	list_for_each_entry(wq, &ctx->fault_wqh.task_list, task_list) {
> > +		uwq = container_of(wq, struct userfaultfd_wait_queue, wq);
> > +		if (uwq->pending)
> > +			pending++;
> > +		total++;
> > +	}
> > +	spin_unlock(&ctx->fault_wqh.lock);
> > +
> > +	ret = seq_printf(m, "pending:\t%lu\ntotal:\t%lu\n", pending, total);
> 
> This should show the protocol version, too.

Ok, does the below look ok?

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 388553e..f9d3e9f 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -493,7 +493,13 @@ static int userfaultfd_show_fdinfo(struct seq_file *m, struct file *f)
 	}
 	spin_unlock(&ctx->fault_wqh.lock);
 
-	ret = seq_printf(m, "pending:\t%lu\ntotal:\t%lu\n", pending, total);
+	/*
+	 * If more protocols will be added, there will be all shown
+	 * separated by a space. Like this:
+	 *	protocols: 0xaa 0xbb
+	 */
+	ret = seq_printf(m, "pending:\t%lu\ntotal:\t%lu\nprotocols:\t%Lx\n",
+			 pending, total, USERFAULTFD_PROTOCOL);
 
 	return ret;
 }


> > +
> > +SYSCALL_DEFINE1(userfaultfd, int, flags)
> > +{
> > +	int fd, error;
> > +	struct file *file;
> 
> This looks like it can't be used more than once in a process.  That will

It can't be used more than once, correct.

	file = ERR_PTR(-EBUSY);
	if (get_mm_slot(current->mm))
		goto out_free_unlock;

If a userfaultfd is already registered for the current mm the second
one gets -EBUSY.

> be unfortunate for libraries.  Would it be feasible to either have

So you envision two userfaultfd memory managers for the same process?
I assume each one would claim separate ranges of memory?

For that case the demultiplexing of userfaults can be entirely managed
by userland.

One libuserfault library can actually register the userfaultfd, and
then the two libs can register into libuserfault and claim their own
ranges. It could run the code of the two libs in the thread context
that waits on the userfaultfd with zero overhead, or message passing
across threads can be used to run both libs in parallel in their own
thread. The demultiplexing code wouldn't be CPU intensive. The
downside are two schedule event required if they want to run their lib
code in a separate thread. If we'd claim the two different ranges in
the kernel for two different userfaultfd, the kernel would be speaking
directly with each library thread. That'd be the only advantage if
they don't want to run in the context of the thread that waits on the
userfaultfd.

To increase SMP scalability in the future we could also add a
UFFD_LOAD_BALANCE to distribute userfaults to different userfaultfd,
that if used could relax the -EBUSY (but it wouldn't be two different
claimed ranges for two different libs).

If passing UFFD_LOAD_BALANCE to the current code sys_userfaultfd would
return -EINVAL. I haven't implemented it because I'm not sure if such
thing would ever be needed. Compared to distributing the userfaults in
userland to different threads that would only save two context
switches per event. I don't see a problem in adding this later if a
need emerges though.

I think the best model for two libs claiming different userfault
ranges, is to run the userfault code of each lib in the context of the
thread that waits on the userfaultfd, and if there's a need to scale
in SMP we'd add UFFD_LOAD_BALANCE so multiple threads can wait on
different userfaultfd and scale optimally in SMP without any need of
spurious context switches.

With the volatile pages current code, the SIGBUS event would also be
mm-wide and require demultiplexing inside the sigbus handler if two
different libs wants to claim different ranges. Furthermore the sigbus
would run in the context of the faulting thread so it would still need
to context switch to scale (with userfaultfd we let the current thread
continue by triggering a schedule in the guest, if FOLL_NOWAIT fails
and we spawn a kworker thread doing the async page fault, then the
kworker kernel thread stops in the userfault and the migration thread
waiting on the userfaultfd is woken up to resolve the userfault).

Programs like qemu are unlikely to ever need more than one
userfaultfd, so it wouldn't need to use the demultiplexing
library. Currently we don't feel a need for UFFD_LOAD_BALANCE either.

However I'd rather implement UFFD_LOAD_BALANCE now, than claiming
different ranges in kernel that would require to build a lookup
structure that lives on top of the vmas and it wouldn't be less
efficient to write such a thing in userland inside a libuserfaultfd
(that qemu would likely never need).

In short I see no benefit in claiming different ranges for the same mm
in the kernel API (that can be done equally efficient in userland),
but I could see a benefit in a load balancing feature to scale the
load to multiple userfaultfd if passing UFFD_LOAD_BALANCE (it could
also be done by default by just removing the -EBUSY failure and
without new flags, but it sounds safer to keep -EBUSY if
UFFD_LOAD_BALANCE is not passed to userfaultfd through the flags).

> userfaultfd claim a range of addresses or for a vma to be explicitly
> associated with a userfaultfd?  (In the latter case, giant PROT_NONE
> MAP_NORESERVE mappings could be used.)

To claim ranges MADV_USERFAULT is used and the vmas are
mm-wide. Instead of creating another range lookup on top the vmas, I
used the vma itself to tell which ranges are claimed for userfaultfd
(or SIGBUS behavior if userfaultfd isn't open).

This API model with MADV_USERFAULT to claim the userfaultfd ranges is
want fundamentally prevents you to claim different ranges for two
different userfaultfd.

About PROT_NONE, note that if you set the mapping as MADV_USERFAULT
there's no point in setting PROT_NONE too, MAP_NORESERVE instead
should already work just fine and it's orthogonal.

PROT_NONE is kind of an alternative to userfaults without
MADV_USERFAULT. Problem is that PROT_NONE requires to split VMAs (and
take the mmap_sem for writing and mangle and allocate vmas), until you
actually run out of vmas and you get -ENOMEM and the app crashes. The
whole point of postcopy live migration is that it will work with
massively large guests so we cannot risk running out of vmas.

With MADV_USERFAULT you never mangle the vma and you work with a
single gigantic vma that never gets split. You don't need to mremap
over the PROT_NONE to handle the fault.

Even without userfaultfd and just with MADV_USERFAULT+remap_anon_pages
using SIGBUS is a much more efficient and more reliable alternative
than PROT_NONE.

With userfaultfd + remap_anon_pages things are even more efficient
because there are no signals involved at all, and syscalls won't
return weird errors to userland like it would happen with SIGBUS and
without userfaultfd. With userfaultfd the thread stopping in the
userfault won't even exit the kernel, it'll wait a wakeup within the
kernel. And remap_anon_pages that resolves the fault, just updates the
ptes or hugepmds without touching the vma (plus remap_anon_pages is
very strict so the chance of memory corruption going unnoticed is next
to nil, unlike mremap).

Other things that people should comment on is that currently you
cannot set MADV_USERFAULT on filebacked vmas, that sounds like an
issue for volatile pages that should be fixed. Currently -EINVAL is
returned if MADV_USERFAULT is run on non anonymous vmas. I don't think
it's a problem to add it. The do_linear_fault would just trigger an
userfault too. Then it's up to userland how it resolves it before
sending the wakeup to the stopped thread with userfaultfd_write. As
long as it doesn't fault in do_linear_fault again it'll just work.

How you solve the fault before acking it with userfaultfd_write
(remap_file_pages, remap_anon_pages, mremap, mmap, anything) is
entirely up to userland. The only faults the userfault tracks are the
pte_none/pmd_none kind (same as PROT_NONE). As long as you put
something in the pte/pmd and it's not none anymore you can
userfaultfd_write and it'll just work. Clearing VM_USERFAULT would
also work to resolve the fault of course (and then it'd pagein from
disk if it was filebacked or it'd map a zero page if it was an
anonymous vma) but clearing VM_USERFAULT (with MADV_NOUSERFAULT)
would split the vma so it's not recommended...

Thanks,
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* Re: [PATCH 08/10] userfaultfd: add new syscall to provide memory externalization
@ 2014-07-03 13:19       ` Andrea Arcangeli
  0 siblings, 0 replies; 59+ messages in thread
From: Andrea Arcangeli @ 2014-07-03 13:19 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: qemu-devel, kvm, linux-mm, linux-kernel,
	"Dr. David Alan Gilbert",
	Johannes Weiner, Andrew Morton, Android Kernel Team, Robert Love,
	Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Mike Hommey, Taras Glek, Jan Kara,
	KOSAKI Motohiro, Michel Lespinasse, Minchan Kim, Keith Packard,
	Huangpeng (Peter),
	Isaku Yamahata, Linux API, Linus Torvalds

Hi Andy,

thanks for CC'ing linux-api.

On Wed, Jul 02, 2014 at 06:56:03PM -0700, Andy Lutomirski wrote:
> On 07/02/2014 09:50 AM, Andrea Arcangeli wrote:
> > Once an userfaultfd is created MADV_USERFAULT regions talks through
> > the userfaultfd protocol with the thread responsible for doing the
> > memory externalization of the process.
> > 
> > The protocol starts by userland writing the requested/preferred
> > USERFAULT_PROTOCOL version into the userfault fd (64bit write), if
> > kernel knows it, it will ack it by allowing userland to read 64bit
> > from the userfault fd that will contain the same 64bit
> > USERFAULT_PROTOCOL version that userland asked. Otherwise userfault
> > will read __u64 value -1ULL (aka USERFAULTFD_UNKNOWN_PROTOCOL) and it
> > will have to try again by writing an older protocol version if
> > suitable for its usage too, and read it back again until it stops
> > reading -1ULL. After that the userfaultfd protocol starts.
> > 
> > The protocol consists in the userfault fd reads 64bit in size
> > providing userland the fault addresses. After a userfault address has
> > been read and the fault is resolved by userland, the application must
> > write back 128bits in the form of [ start, end ] range (64bit each)
> > that will tell the kernel such a range has been mapped. Multiple read
> > userfaults can be resolved in a single range write. poll() can be used
> > to know when there are new userfaults to read (POLLIN) and when there
> > are threads waiting a wakeup through a range write (POLLOUT).
> > 
> > Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> 
> > +#ifdef CONFIG_PROC_FS
> > +static int userfaultfd_show_fdinfo(struct seq_file *m, struct file *f)
> > +{
> > +	struct userfaultfd_ctx *ctx = f->private_data;
> > +	int ret;
> > +	wait_queue_t *wq;
> > +	struct userfaultfd_wait_queue *uwq;
> > +	unsigned long pending = 0, total = 0;
> > +
> > +	spin_lock(&ctx->fault_wqh.lock);
> > +	list_for_each_entry(wq, &ctx->fault_wqh.task_list, task_list) {
> > +		uwq = container_of(wq, struct userfaultfd_wait_queue, wq);
> > +		if (uwq->pending)
> > +			pending++;
> > +		total++;
> > +	}
> > +	spin_unlock(&ctx->fault_wqh.lock);
> > +
> > +	ret = seq_printf(m, "pending:\t%lu\ntotal:\t%lu\n", pending, total);
> 
> This should show the protocol version, too.

Ok, does the below look ok?

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 388553e..f9d3e9f 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -493,7 +493,13 @@ static int userfaultfd_show_fdinfo(struct seq_file *m, struct file *f)
 	}
 	spin_unlock(&ctx->fault_wqh.lock);
 
-	ret = seq_printf(m, "pending:\t%lu\ntotal:\t%lu\n", pending, total);
+	/*
+	 * If more protocols will be added, there will be all shown
+	 * separated by a space. Like this:
+	 *	protocols: 0xaa 0xbb
+	 */
+	ret = seq_printf(m, "pending:\t%lu\ntotal:\t%lu\nprotocols:\t%Lx\n",
+			 pending, total, USERFAULTFD_PROTOCOL);
 
 	return ret;
 }


> > +
> > +SYSCALL_DEFINE1(userfaultfd, int, flags)
> > +{
> > +	int fd, error;
> > +	struct file *file;
> 
> This looks like it can't be used more than once in a process.  That will

It can't be used more than once, correct.

	file = ERR_PTR(-EBUSY);
	if (get_mm_slot(current->mm))
		goto out_free_unlock;

If a userfaultfd is already registered for the current mm the second
one gets -EBUSY.

> be unfortunate for libraries.  Would it be feasible to either have

So you envision two userfaultfd memory managers for the same process?
I assume each one would claim separate ranges of memory?

For that case the demultiplexing of userfaults can be entirely managed
by userland.

One libuserfault library can actually register the userfaultfd, and
then the two libs can register into libuserfault and claim their own
ranges. It could run the code of the two libs in the thread context
that waits on the userfaultfd with zero overhead, or message passing
across threads can be used to run both libs in parallel in their own
thread. The demultiplexing code wouldn't be CPU intensive. The
downside are two schedule event required if they want to run their lib
code in a separate thread. If we'd claim the two different ranges in
the kernel for two different userfaultfd, the kernel would be speaking
directly with each library thread. That'd be the only advantage if
they don't want to run in the context of the thread that waits on the
userfaultfd.

To increase SMP scalability in the future we could also add a
UFFD_LOAD_BALANCE to distribute userfaults to different userfaultfd,
that if used could relax the -EBUSY (but it wouldn't be two different
claimed ranges for two different libs).

If passing UFFD_LOAD_BALANCE to the current code sys_userfaultfd would
return -EINVAL. I haven't implemented it because I'm not sure if such
thing would ever be needed. Compared to distributing the userfaults in
userland to different threads that would only save two context
switches per event. I don't see a problem in adding this later if a
need emerges though.

I think the best model for two libs claiming different userfault
ranges, is to run the userfault code of each lib in the context of the
thread that waits on the userfaultfd, and if there's a need to scale
in SMP we'd add UFFD_LOAD_BALANCE so multiple threads can wait on
different userfaultfd and scale optimally in SMP without any need of
spurious context switches.

With the volatile pages current code, the SIGBUS event would also be
mm-wide and require demultiplexing inside the sigbus handler if two
different libs wants to claim different ranges. Furthermore the sigbus
would run in the context of the faulting thread so it would still need
to context switch to scale (with userfaultfd we let the current thread
continue by triggering a schedule in the guest, if FOLL_NOWAIT fails
and we spawn a kworker thread doing the async page fault, then the
kworker kernel thread stops in the userfault and the migration thread
waiting on the userfaultfd is woken up to resolve the userfault).

Programs like qemu are unlikely to ever need more than one
userfaultfd, so it wouldn't need to use the demultiplexing
library. Currently we don't feel a need for UFFD_LOAD_BALANCE either.

However I'd rather implement UFFD_LOAD_BALANCE now, than claiming
different ranges in kernel that would require to build a lookup
structure that lives on top of the vmas and it wouldn't be less
efficient to write such a thing in userland inside a libuserfaultfd
(that qemu would likely never need).

In short I see no benefit in claiming different ranges for the same mm
in the kernel API (that can be done equally efficient in userland),
but I could see a benefit in a load balancing feature to scale the
load to multiple userfaultfd if passing UFFD_LOAD_BALANCE (it could
also be done by default by just removing the -EBUSY failure and
without new flags, but it sounds safer to keep -EBUSY if
UFFD_LOAD_BALANCE is not passed to userfaultfd through the flags).

> userfaultfd claim a range of addresses or for a vma to be explicitly
> associated with a userfaultfd?  (In the latter case, giant PROT_NONE
> MAP_NORESERVE mappings could be used.)

To claim ranges MADV_USERFAULT is used and the vmas are
mm-wide. Instead of creating another range lookup on top the vmas, I
used the vma itself to tell which ranges are claimed for userfaultfd
(or SIGBUS behavior if userfaultfd isn't open).

This API model with MADV_USERFAULT to claim the userfaultfd ranges is
want fundamentally prevents you to claim different ranges for two
different userfaultfd.

About PROT_NONE, note that if you set the mapping as MADV_USERFAULT
there's no point in setting PROT_NONE too, MAP_NORESERVE instead
should already work just fine and it's orthogonal.

PROT_NONE is kind of an alternative to userfaults without
MADV_USERFAULT. Problem is that PROT_NONE requires to split VMAs (and
take the mmap_sem for writing and mangle and allocate vmas), until you
actually run out of vmas and you get -ENOMEM and the app crashes. The
whole point of postcopy live migration is that it will work with
massively large guests so we cannot risk running out of vmas.

With MADV_USERFAULT you never mangle the vma and you work with a
single gigantic vma that never gets split. You don't need to mremap
over the PROT_NONE to handle the fault.

Even without userfaultfd and just with MADV_USERFAULT+remap_anon_pages
using SIGBUS is a much more efficient and more reliable alternative
than PROT_NONE.

With userfaultfd + remap_anon_pages things are even more efficient
because there are no signals involved at all, and syscalls won't
return weird errors to userland like it would happen with SIGBUS and
without userfaultfd. With userfaultfd the thread stopping in the
userfault won't even exit the kernel, it'll wait a wakeup within the
kernel. And remap_anon_pages that resolves the fault, just updates the
ptes or hugepmds without touching the vma (plus remap_anon_pages is
very strict so the chance of memory corruption going unnoticed is next
to nil, unlike mremap).

Other things that people should comment on is that currently you
cannot set MADV_USERFAULT on filebacked vmas, that sounds like an
issue for volatile pages that should be fixed. Currently -EINVAL is
returned if MADV_USERFAULT is run on non anonymous vmas. I don't think
it's a problem to add it. The do_linear_fault would just trigger an
userfault too. Then it's up to userland how it resolves it before
sending the wakeup to the stopped thread with userfaultfd_write. As
long as it doesn't fault in do_linear_fault again it'll just work.

How you solve the fault before acking it with userfaultfd_write
(remap_file_pages, remap_anon_pages, mremap, mmap, anything) is
entirely up to userland. The only faults the userfault tracks are the
pte_none/pmd_none kind (same as PROT_NONE). As long as you put
something in the pte/pmd and it's not none anymore you can
userfaultfd_write and it'll just work. Clearing VM_USERFAULT would
also work to resolve the fault of course (and then it'd pagein from
disk if it was filebacked or it'd map a zero page if it was an
anonymous vma) but clearing VM_USERFAULT (with MADV_NOUSERFAULT)
would split the vma so it's not recommended...

Thanks,
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* Re: [Qemu-devel] [PATCH 08/10] userfaultfd: add new syscall to provide memory externalization
@ 2014-07-03 13:19       ` Andrea Arcangeli
  0 siblings, 0 replies; 59+ messages in thread
From: Andrea Arcangeli @ 2014-07-03 13:19 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Robert Love, Dave Hansen, Jan Kara, kvm, Neil Brown, qemu-devel,
	linux-mm, KOSAKI Motohiro, Michel Lespinasse, Taras Glek,
	Hugh Dickins, Isaku Yamahata, Android Kernel Team, Mel Gorman,
	"Dr. David Alan Gilbert", Huangpeng (Peter),
	Keith Packard, Linux API, linux-kernel, Minchan Kim,
	Dmitry Adamushko, Johannes Weiner, Mike Hommey, Andrew Morton,
	Linus Torvalds

Hi Andy,

thanks for CC'ing linux-api.

On Wed, Jul 02, 2014 at 06:56:03PM -0700, Andy Lutomirski wrote:
> On 07/02/2014 09:50 AM, Andrea Arcangeli wrote:
> > Once an userfaultfd is created MADV_USERFAULT regions talks through
> > the userfaultfd protocol with the thread responsible for doing the
> > memory externalization of the process.
> > 
> > The protocol starts by userland writing the requested/preferred
> > USERFAULT_PROTOCOL version into the userfault fd (64bit write), if
> > kernel knows it, it will ack it by allowing userland to read 64bit
> > from the userfault fd that will contain the same 64bit
> > USERFAULT_PROTOCOL version that userland asked. Otherwise userfault
> > will read __u64 value -1ULL (aka USERFAULTFD_UNKNOWN_PROTOCOL) and it
> > will have to try again by writing an older protocol version if
> > suitable for its usage too, and read it back again until it stops
> > reading -1ULL. After that the userfaultfd protocol starts.
> > 
> > The protocol consists in the userfault fd reads 64bit in size
> > providing userland the fault addresses. After a userfault address has
> > been read and the fault is resolved by userland, the application must
> > write back 128bits in the form of [ start, end ] range (64bit each)
> > that will tell the kernel such a range has been mapped. Multiple read
> > userfaults can be resolved in a single range write. poll() can be used
> > to know when there are new userfaults to read (POLLIN) and when there
> > are threads waiting a wakeup through a range write (POLLOUT).
> > 
> > Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> 
> > +#ifdef CONFIG_PROC_FS
> > +static int userfaultfd_show_fdinfo(struct seq_file *m, struct file *f)
> > +{
> > +	struct userfaultfd_ctx *ctx = f->private_data;
> > +	int ret;
> > +	wait_queue_t *wq;
> > +	struct userfaultfd_wait_queue *uwq;
> > +	unsigned long pending = 0, total = 0;
> > +
> > +	spin_lock(&ctx->fault_wqh.lock);
> > +	list_for_each_entry(wq, &ctx->fault_wqh.task_list, task_list) {
> > +		uwq = container_of(wq, struct userfaultfd_wait_queue, wq);
> > +		if (uwq->pending)
> > +			pending++;
> > +		total++;
> > +	}
> > +	spin_unlock(&ctx->fault_wqh.lock);
> > +
> > +	ret = seq_printf(m, "pending:\t%lu\ntotal:\t%lu\n", pending, total);
> 
> This should show the protocol version, too.

Ok, does the below look ok?

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 388553e..f9d3e9f 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -493,7 +493,13 @@ static int userfaultfd_show_fdinfo(struct seq_file *m, struct file *f)
 	}
 	spin_unlock(&ctx->fault_wqh.lock);
 
-	ret = seq_printf(m, "pending:\t%lu\ntotal:\t%lu\n", pending, total);
+	/*
+	 * If more protocols will be added, there will be all shown
+	 * separated by a space. Like this:
+	 *	protocols: 0xaa 0xbb
+	 */
+	ret = seq_printf(m, "pending:\t%lu\ntotal:\t%lu\nprotocols:\t%Lx\n",
+			 pending, total, USERFAULTFD_PROTOCOL);
 
 	return ret;
 }


> > +
> > +SYSCALL_DEFINE1(userfaultfd, int, flags)
> > +{
> > +	int fd, error;
> > +	struct file *file;
> 
> This looks like it can't be used more than once in a process.  That will

It can't be used more than once, correct.

	file = ERR_PTR(-EBUSY);
	if (get_mm_slot(current->mm))
		goto out_free_unlock;

If a userfaultfd is already registered for the current mm the second
one gets -EBUSY.

> be unfortunate for libraries.  Would it be feasible to either have

So you envision two userfaultfd memory managers for the same process?
I assume each one would claim separate ranges of memory?

For that case the demultiplexing of userfaults can be entirely managed
by userland.

One libuserfault library can actually register the userfaultfd, and
then the two libs can register into libuserfault and claim their own
ranges. It could run the code of the two libs in the thread context
that waits on the userfaultfd with zero overhead, or message passing
across threads can be used to run both libs in parallel in their own
thread. The demultiplexing code wouldn't be CPU intensive. The
downside are two schedule event required if they want to run their lib
code in a separate thread. If we'd claim the two different ranges in
the kernel for two different userfaultfd, the kernel would be speaking
directly with each library thread. That'd be the only advantage if
they don't want to run in the context of the thread that waits on the
userfaultfd.

To increase SMP scalability in the future we could also add a
UFFD_LOAD_BALANCE to distribute userfaults to different userfaultfd,
that if used could relax the -EBUSY (but it wouldn't be two different
claimed ranges for two different libs).

If passing UFFD_LOAD_BALANCE to the current code sys_userfaultfd would
return -EINVAL. I haven't implemented it because I'm not sure if such
thing would ever be needed. Compared to distributing the userfaults in
userland to different threads that would only save two context
switches per event. I don't see a problem in adding this later if a
need emerges though.

I think the best model for two libs claiming different userfault
ranges, is to run the userfault code of each lib in the context of the
thread that waits on the userfaultfd, and if there's a need to scale
in SMP we'd add UFFD_LOAD_BALANCE so multiple threads can wait on
different userfaultfd and scale optimally in SMP without any need of
spurious context switches.

With the volatile pages current code, the SIGBUS event would also be
mm-wide and require demultiplexing inside the sigbus handler if two
different libs wants to claim different ranges. Furthermore the sigbus
would run in the context of the faulting thread so it would still need
to context switch to scale (with userfaultfd we let the current thread
continue by triggering a schedule in the guest, if FOLL_NOWAIT fails
and we spawn a kworker thread doing the async page fault, then the
kworker kernel thread stops in the userfault and the migration thread
waiting on the userfaultfd is woken up to resolve the userfault).

Programs like qemu are unlikely to ever need more than one
userfaultfd, so it wouldn't need to use the demultiplexing
library. Currently we don't feel a need for UFFD_LOAD_BALANCE either.

However I'd rather implement UFFD_LOAD_BALANCE now, than claiming
different ranges in kernel that would require to build a lookup
structure that lives on top of the vmas and it wouldn't be less
efficient to write such a thing in userland inside a libuserfaultfd
(that qemu would likely never need).

In short I see no benefit in claiming different ranges for the same mm
in the kernel API (that can be done equally efficient in userland),
but I could see a benefit in a load balancing feature to scale the
load to multiple userfaultfd if passing UFFD_LOAD_BALANCE (it could
also be done by default by just removing the -EBUSY failure and
without new flags, but it sounds safer to keep -EBUSY if
UFFD_LOAD_BALANCE is not passed to userfaultfd through the flags).

> userfaultfd claim a range of addresses or for a vma to be explicitly
> associated with a userfaultfd?  (In the latter case, giant PROT_NONE
> MAP_NORESERVE mappings could be used.)

To claim ranges MADV_USERFAULT is used and the vmas are
mm-wide. Instead of creating another range lookup on top the vmas, I
used the vma itself to tell which ranges are claimed for userfaultfd
(or SIGBUS behavior if userfaultfd isn't open).

This API model with MADV_USERFAULT to claim the userfaultfd ranges is
want fundamentally prevents you to claim different ranges for two
different userfaultfd.

About PROT_NONE, note that if you set the mapping as MADV_USERFAULT
there's no point in setting PROT_NONE too, MAP_NORESERVE instead
should already work just fine and it's orthogonal.

PROT_NONE is kind of an alternative to userfaults without
MADV_USERFAULT. Problem is that PROT_NONE requires to split VMAs (and
take the mmap_sem for writing and mangle and allocate vmas), until you
actually run out of vmas and you get -ENOMEM and the app crashes. The
whole point of postcopy live migration is that it will work with
massively large guests so we cannot risk running out of vmas.

With MADV_USERFAULT you never mangle the vma and you work with a
single gigantic vma that never gets split. You don't need to mremap
over the PROT_NONE to handle the fault.

Even without userfaultfd and just with MADV_USERFAULT+remap_anon_pages
using SIGBUS is a much more efficient and more reliable alternative
than PROT_NONE.

With userfaultfd + remap_anon_pages things are even more efficient
because there are no signals involved at all, and syscalls won't
return weird errors to userland like it would happen with SIGBUS and
without userfaultfd. With userfaultfd the thread stopping in the
userfault won't even exit the kernel, it'll wait a wakeup within the
kernel. And remap_anon_pages that resolves the fault, just updates the
ptes or hugepmds without touching the vma (plus remap_anon_pages is
very strict so the chance of memory corruption going unnoticed is next
to nil, unlike mremap).

Other things that people should comment on is that currently you
cannot set MADV_USERFAULT on filebacked vmas, that sounds like an
issue for volatile pages that should be fixed. Currently -EINVAL is
returned if MADV_USERFAULT is run on non anonymous vmas. I don't think
it's a problem to add it. The do_linear_fault would just trigger an
userfault too. Then it's up to userland how it resolves it before
sending the wakeup to the stopped thread with userfaultfd_write. As
long as it doesn't fault in do_linear_fault again it'll just work.

How you solve the fault before acking it with userfaultfd_write
(remap_file_pages, remap_anon_pages, mremap, mmap, anything) is
entirely up to userland. The only faults the userfault tracks are the
pte_none/pmd_none kind (same as PROT_NONE). As long as you put
something in the pte/pmd and it's not none anymore you can
userfaultfd_write and it'll just work. Clearing VM_USERFAULT would
also work to resolve the fault of course (and then it'd pagein from
disk if it was filebacked or it'd map a zero page if it was an
anonymous vma) but clearing VM_USERFAULT (with MADV_NOUSERFAULT)
would split the vma so it's not recommended...

Thanks,
Andrea

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* Re: [Qemu-devel] [PATCH 00/10] RFC: userfault
  2014-07-02 16:50 ` Andrea Arcangeli
  (?)
@ 2014-07-03 13:45   ` Christopher Covington
  -1 siblings, 0 replies; 59+ messages in thread
From: Christopher Covington @ 2014-07-03 13:45 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: qemu-devel, kvm, linux-mm, linux-kernel, Robert Love,
	Dave Hansen, Jan Kara, Neil Brown, Stefan Hajnoczi, Andrew Jones,
	KOSAKI Motohiro, Michel Lespinasse, Taras Glek, Juan Quintela,
	Hugh Dickins, Isaku Yamahata, Mel Gorman, Android Kernel Team,
	Mel Gorman, Dr. David Alan Gilbert, Huangpeng (Peter),
	Anthony Liguori, Mike Hommey, Keith Packard, Wenchao Xia,
	Minchan Kim

Hi Andrea,

On 07/02/2014 12:50 PM, Andrea Arcangeli wrote:
> Hello everyone,
> 
> There's a large CC list for this RFC because this adds two new
> syscalls (userfaultfd and remap_anon_pages) and
> MADV_USERFAULT/MADV_NOUSERFAULT, so suggestions on changes to the API
> or on a completely different API if somebody has better ideas are
> welcome now.
> 
> The combination of these features are what I would propose to
> implement postcopy live migration in qemu, and in general demand
> paging of remote memory, hosted in different cloud nodes.
> 
> The MADV_USERFAULT feature should be generic enough that it can
> provide the userfaults to the Android volatile range feature too, on
> access of reclaimed volatile pages.
> 
> If the access could ever happen in kernel context through syscalls
> (not not just from userland context), then userfaultfd has to be used
> to make the userfault unnoticeable to the syscall (no error will be
> returned). This latter feature is more advanced than what volatile
> ranges alone could do with SIGBUS so far (but it's optional, if the
> process doesn't call userfaultfd, the regular SIGBUS will fire, if the
> fd is closed SIGBUS will also fire for any blocked userfault that was
> waiting a userfaultfd_write ack).
> 
> userfaultfd is also a generic enough feature, that it allows KVM to
> implement postcopy live migration without having to modify a single
> line of KVM kernel code. Guest async page faults, FOLL_NOWAIT and all
> other GUP features works just fine in combination with userfaults
> (userfaults trigger async page faults in the guest scheduler so those
> guest processes that aren't waiting for userfaults can keep running in
> the guest vcpus).
> 
> remap_anon_pages is the syscall to use to resolve the userfaults (it's
> not mandatory, vmsplice will likely still be used in the case of local
> postcopy live migration just to upgrade the qemu binary, but
> remap_anon_pages is faster and ideal for transferring memory across
> the network, it's zerocopy and doesn't touch the vma: it only holds
> the mmap_sem for reading).
> 
> The current behavior of remap_anon_pages is very strict to avoid any
> chance of memory corruption going unnoticed. mremap is not strict like
> that: if there's a synchronization bug it would drop the destination
> range silently resulting in subtle memory corruption for
> example. remap_anon_pages would return -EEXIST in that case. If there
> are holes in the source range remap_anon_pages will return -ENOENT.
> 
> If remap_anon_pages is used always with 2M naturally aligned
> addresses, transparent hugepages will not be splitted. In there could
> be 4k (or any size) holes in the 2M (or any size) source range,
> remap_anon_pages should be used with the RAP_ALLOW_SRC_HOLES flag to
> relax some of its strict checks (-ENOENT won't be returned if
> RAP_ALLOW_SRC_HOLES is set, remap_anon_pages then will just behave as
> a noop on any hole in the source range). This flag is generally useful
> when implementing userfaults with THP granularity, but it shouldn't be
> set if doing the userfaults with PAGE_SIZE granularity if the
> developer wants to benefit from the strict -ENOENT behavior.
> 
> The remap_anon_pages syscall API is not vectored, as I expect it to be
> used mainly for demand paging (where there can be just one faulting
> range per userfault) or for large ranges (with the THP model as an
> alternative to zapping re-dirtied pages with MADV_DONTNEED with 4k
> granularity before starting the guest in the destination node) where
> vectoring isn't going to provide much performance advantages (thanks
> to the THP coarser granularity).
> 
> On the rmap side remap_anon_pages doesn't add much complexity: there's
> no need of nonlinear anon vmas to support it because I added the
> constraint that it will fail if the mapcount is more than 1. So in
> general the source range of remap_anon_pages should be marked
> MADV_DONTFORK to prevent any risk of failure if the process ever
> forks (like qemu can in some case).
> 
> One part that hasn't been tested is the poll() syscall on the
> userfaultfd because the postcopy migration thread currently is more
> efficient waiting on blocking read()s (I'll write some code to test
> poll() too). I also appended below a patch to trinity to exercise
> remap_anon_pages and userfaultfd and it completes trinity
> successfully.
> 
> The code can be found here:
> 
> git clone --reference linux git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git -b userfault 
> 
> The branch is rebased so you can get updates for example with:
> 
> git fetch && git checkout -f origin/userfault
> 
> Comments welcome, thanks!

CRIU uses the soft dirty bit in /proc/pid/clear_refs and /proc/pid/pagemap to
implement its pre-copy memory migration.

http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/vm/soft-dirty.txt

Would it make sense to use a similar interaction model of peeking and poking
at /proc/pid/ files for post-copy memory migration facilities?

Christopher

> From cbe940e13b4cead41e0f862b3abfa3814f235ec3 Mon Sep 17 00:00:00 2001
> From: Andrea Arcangeli <aarcange@redhat.com>
> Date: Wed, 2 Jul 2014 18:32:35 +0200
> Subject: [PATCH] add remap_anon_pages and userfaultfd
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
>  include/syscalls-x86_64.h   |   2 +
>  syscalls/remap_anon_pages.c | 100 ++++++++++++++++++++++++++++++++++++++++++++
>  syscalls/syscalls.h         |   2 +
>  syscalls/userfaultfd.c      |  12 ++++++
>  4 files changed, 116 insertions(+)
>  create mode 100644 syscalls/remap_anon_pages.c
>  create mode 100644 syscalls/userfaultfd.c
> 
> diff --git a/include/syscalls-x86_64.h b/include/syscalls-x86_64.h
> index e09df43..a5b3a88 100644
> --- a/include/syscalls-x86_64.h
> +++ b/include/syscalls-x86_64.h
> @@ -324,4 +324,6 @@ struct syscalltable syscalls_x86_64[] = {
>  	{ .entry = &syscall_sched_setattr },
>  	{ .entry = &syscall_sched_getattr },
>  	{ .entry = &syscall_renameat2 },
> +	{ .entry = &syscall_remap_anon_pages },
> +	{ .entry = &syscall_userfaultfd },
>  };
> diff --git a/syscalls/remap_anon_pages.c b/syscalls/remap_anon_pages.c
> new file mode 100644
> index 0000000..b1e9d3c
> --- /dev/null
> +++ b/syscalls/remap_anon_pages.c
> @@ -0,0 +1,100 @@
> +/*
> + * SYSCALL_DEFINE3(remap_anon_pages,
> +		unsigned long, dst_start, unsigned long, src_start,
> +		unsigned long, len)
> + */
> +#include <stdlib.h>
> +#include <asm/mman.h>
> +#include <assert.h>
> +#include "arch.h"
> +#include "maps.h"
> +#include "random.h"
> +#include "sanitise.h"
> +#include "shm.h"
> +#include "syscall.h"
> +#include "tables.h"
> +#include "trinity.h"
> +#include "utils.h"
> +
> +static const unsigned long alignments[] = {
> +	1 * MB, 2 * MB, 4 * MB, 8 * MB,
> +	10 * MB, 100 * MB,
> +};
> +
> +static unsigned char *g_src, *g_dst;
> +static unsigned long g_size;
> +static int g_check;
> +
> +#define RAP_ALLOW_SRC_HOLES (1UL<<0)
> +
> +static void sanitise_remap_anon_pages(struct syscallrecord *rec)
> +{
> +	unsigned long size = alignments[rand() % ARRAY_SIZE(alignments)];
> +	unsigned long max_rand;
> +	if (rand_bool()) {
> +		g_src = mmap(NULL, size, PROT_READ|PROT_WRITE,
> +			     MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
> +	} else
> +		g_src = MAP_FAILED;
> +	if (rand_bool()) {
> +		g_dst = mmap(NULL, size, PROT_READ|PROT_WRITE,
> +			     MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
> +	} else
> +		g_dst = MAP_FAILED;
> +	g_size = size;
> +	g_check = 1;
> +
> +	rec->a1 = (unsigned long) g_dst;
> +	rec->a2 = (unsigned long) g_src;
> +	rec->a3 = g_size;
> +	rec->a4 = 0;
> +
> +	if (rand_bool())
> +		max_rand = -1UL;
> +	else
> +		max_rand = g_size << 1;
> +	if (rand_bool()) {
> +		rec->a3 += (rand() % max_rand) - g_size;
> +		g_check = 0;
> +	}
> +	if (rand_bool()) {
> +		rec->a1 += (rand() % max_rand) - g_size;
> +		g_check = 0;
> +	}
> +	if (rand_bool()) {
> +		rec->a2 += (rand() % max_rand) - g_size;
> +		g_check = 0;
> +	}
> +	if (rand_bool()) {
> +		if (rand_bool()) {
> +			rec->a4 = rand();
> +		} else
> +			rec->a4 = RAP_ALLOW_SRC_HOLES;
> +	}
> +	if (g_src != MAP_FAILED)
> +		memset(g_src, 0xaa, size);
> +}
> +
> +static void post_remap_anon_pages(struct syscallrecord *rec)
> +{
> +	if (g_check && !rec->retval) {
> +		unsigned long size = g_size;
> +		unsigned char *dst = g_dst;
> +		while (size--)
> +			assert(dst[size] == 0xaaU);
> +	}
> +	munmap(g_src, g_size);
> +	munmap(g_dst, g_size);
> +}
> +
> +struct syscallentry syscall_remap_anon_pages = {
> +	.name = "remap_anon_pages",
> +	.num_args = 4,
> +	.arg1name = "dst_start",
> +	.arg2name = "src_start",
> +	.arg3name = "len",
> +	.arg4name = "flags",
> +	.group = GROUP_VM,
> +	.sanitise = sanitise_remap_anon_pages,
> +	.post = post_remap_anon_pages,
> +};
> diff --git a/syscalls/syscalls.h b/syscalls/syscalls.h
> index 114500c..b8eaa63 100644
> --- a/syscalls/syscalls.h
> +++ b/syscalls/syscalls.h
> @@ -370,3 +370,5 @@ extern struct syscallentry syscall_sched_setattr;
>  extern struct syscallentry syscall_sched_getattr;
>  extern struct syscallentry syscall_renameat2;
>  extern struct syscallentry syscall_kern_features;
> +extern struct syscallentry syscall_remap_anon_pages;
> +extern struct syscallentry syscall_userfaultfd;
> diff --git a/syscalls/userfaultfd.c b/syscalls/userfaultfd.c
> new file mode 100644
> index 0000000..769fe78
> --- /dev/null
> +++ b/syscalls/userfaultfd.c
> @@ -0,0 +1,12 @@
> +/*
> + * SYSCALL_DEFINE1(userfaultfd, int, flags)
> + */
> +#include "sanitise.h"
> +
> +struct syscallentry syscall_userfaultfd = {
> +	.name = "userfaultfd",
> +	.num_args = 1,
> +	.arg1name = "flags",
> +	.arg1type = ARG_LEN,
> +	.rettype = RET_FD,
> +};
> 
> 
> Andrea Arcangeli (10):
>   mm: madvise MADV_USERFAULT: prepare vm_flags to allow more than 32bits
>   mm: madvise MADV_USERFAULT
>   mm: PT lock: export double_pt_lock/unlock
>   mm: rmap preparation for remap_anon_pages
>   mm: swp_entry_swapcount
>   mm: sys_remap_anon_pages
>   waitqueue: add nr wake parameter to __wake_up_locked_key
>   userfaultfd: add new syscall to provide memory externalization
>   userfaultfd: make userfaultfd_write non blocking
>   userfaultfd: use VM_FAULT_RETRY in handle_userfault()
> 
>  arch/alpha/include/uapi/asm/mman.h     |   3 +
>  arch/mips/include/uapi/asm/mman.h      |   3 +
>  arch/parisc/include/uapi/asm/mman.h    |   3 +
>  arch/x86/syscalls/syscall_32.tbl       |   2 +
>  arch/x86/syscalls/syscall_64.tbl       |   2 +
>  arch/xtensa/include/uapi/asm/mman.h    |   3 +
>  fs/Makefile                            |   1 +
>  fs/proc/task_mmu.c                     |   5 +-
>  fs/userfaultfd.c                       | 593 +++++++++++++++++++++++++++++++++
>  include/linux/huge_mm.h                |  11 +-
>  include/linux/ksm.h                    |   4 +-
>  include/linux/mm.h                     |   5 +
>  include/linux/mm_types.h               |   2 +-
>  include/linux/swap.h                   |   6 +
>  include/linux/syscalls.h               |   5 +
>  include/linux/userfaultfd.h            |  42 +++
>  include/linux/wait.h                   |   5 +-
>  include/uapi/asm-generic/mman-common.h |   3 +
>  init/Kconfig                           |  10 +
>  kernel/sched/wait.c                    |   7 +-
>  kernel/sys_ni.c                        |   2 +
>  mm/fremap.c                            | 506 ++++++++++++++++++++++++++++
>  mm/huge_memory.c                       | 209 ++++++++++--
>  mm/ksm.c                               |   2 +-
>  mm/madvise.c                           |  19 +-
>  mm/memory.c                            |  14 +
>  mm/mremap.c                            |   2 +-
>  mm/rmap.c                              |   9 +
>  mm/swapfile.c                          |  13 +
>  net/sunrpc/sched.c                     |   2 +-
>  30 files changed, 1447 insertions(+), 46 deletions(-)
>  create mode 100644 fs/userfaultfd.c
>  create mode 100644 include/linux/userfaultfd.h
> 
> 


-- 
Employee of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
hosted by the Linux Foundation.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Qemu-devel] [PATCH 00/10] RFC: userfault
@ 2014-07-03 13:45   ` Christopher Covington
  0 siblings, 0 replies; 59+ messages in thread
From: Christopher Covington @ 2014-07-03 13:45 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: qemu-devel, kvm, linux-mm, linux-kernel, Robert Love,
	Dave Hansen, Jan Kara, Neil Brown, Stefan Hajnoczi, Andrew Jones,
	KOSAKI Motohiro, Michel Lespinasse, Taras Glek, Juan Quintela,
	Hugh Dickins, Isaku Yamahata, Mel Gorman, Android Kernel Team,
	Mel Gorman, Dr. David Alan Gilbert, Huangpeng (Peter),
	Anthony Liguori, Mike Hommey, Keith Packard, Wenchao Xia,
	Minchan Kim, Dmitry Adamushko, Johannes Weiner, Paolo Bonzini,
	Andrew Morton, crml

Hi Andrea,

On 07/02/2014 12:50 PM, Andrea Arcangeli wrote:
> Hello everyone,
> 
> There's a large CC list for this RFC because this adds two new
> syscalls (userfaultfd and remap_anon_pages) and
> MADV_USERFAULT/MADV_NOUSERFAULT, so suggestions on changes to the API
> or on a completely different API if somebody has better ideas are
> welcome now.
> 
> The combination of these features are what I would propose to
> implement postcopy live migration in qemu, and in general demand
> paging of remote memory, hosted in different cloud nodes.
> 
> The MADV_USERFAULT feature should be generic enough that it can
> provide the userfaults to the Android volatile range feature too, on
> access of reclaimed volatile pages.
> 
> If the access could ever happen in kernel context through syscalls
> (not not just from userland context), then userfaultfd has to be used
> to make the userfault unnoticeable to the syscall (no error will be
> returned). This latter feature is more advanced than what volatile
> ranges alone could do with SIGBUS so far (but it's optional, if the
> process doesn't call userfaultfd, the regular SIGBUS will fire, if the
> fd is closed SIGBUS will also fire for any blocked userfault that was
> waiting a userfaultfd_write ack).
> 
> userfaultfd is also a generic enough feature, that it allows KVM to
> implement postcopy live migration without having to modify a single
> line of KVM kernel code. Guest async page faults, FOLL_NOWAIT and all
> other GUP features works just fine in combination with userfaults
> (userfaults trigger async page faults in the guest scheduler so those
> guest processes that aren't waiting for userfaults can keep running in
> the guest vcpus).
> 
> remap_anon_pages is the syscall to use to resolve the userfaults (it's
> not mandatory, vmsplice will likely still be used in the case of local
> postcopy live migration just to upgrade the qemu binary, but
> remap_anon_pages is faster and ideal for transferring memory across
> the network, it's zerocopy and doesn't touch the vma: it only holds
> the mmap_sem for reading).
> 
> The current behavior of remap_anon_pages is very strict to avoid any
> chance of memory corruption going unnoticed. mremap is not strict like
> that: if there's a synchronization bug it would drop the destination
> range silently resulting in subtle memory corruption for
> example. remap_anon_pages would return -EEXIST in that case. If there
> are holes in the source range remap_anon_pages will return -ENOENT.
> 
> If remap_anon_pages is used always with 2M naturally aligned
> addresses, transparent hugepages will not be splitted. In there could
> be 4k (or any size) holes in the 2M (or any size) source range,
> remap_anon_pages should be used with the RAP_ALLOW_SRC_HOLES flag to
> relax some of its strict checks (-ENOENT won't be returned if
> RAP_ALLOW_SRC_HOLES is set, remap_anon_pages then will just behave as
> a noop on any hole in the source range). This flag is generally useful
> when implementing userfaults with THP granularity, but it shouldn't be
> set if doing the userfaults with PAGE_SIZE granularity if the
> developer wants to benefit from the strict -ENOENT behavior.
> 
> The remap_anon_pages syscall API is not vectored, as I expect it to be
> used mainly for demand paging (where there can be just one faulting
> range per userfault) or for large ranges (with the THP model as an
> alternative to zapping re-dirtied pages with MADV_DONTNEED with 4k
> granularity before starting the guest in the destination node) where
> vectoring isn't going to provide much performance advantages (thanks
> to the THP coarser granularity).
> 
> On the rmap side remap_anon_pages doesn't add much complexity: there's
> no need of nonlinear anon vmas to support it because I added the
> constraint that it will fail if the mapcount is more than 1. So in
> general the source range of remap_anon_pages should be marked
> MADV_DONTFORK to prevent any risk of failure if the process ever
> forks (like qemu can in some case).
> 
> One part that hasn't been tested is the poll() syscall on the
> userfaultfd because the postcopy migration thread currently is more
> efficient waiting on blocking read()s (I'll write some code to test
> poll() too). I also appended below a patch to trinity to exercise
> remap_anon_pages and userfaultfd and it completes trinity
> successfully.
> 
> The code can be found here:
> 
> git clone --reference linux git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git -b userfault 
> 
> The branch is rebased so you can get updates for example with:
> 
> git fetch && git checkout -f origin/userfault
> 
> Comments welcome, thanks!

CRIU uses the soft dirty bit in /proc/pid/clear_refs and /proc/pid/pagemap to
implement its pre-copy memory migration.

http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/vm/soft-dirty.txt

Would it make sense to use a similar interaction model of peeking and poking
at /proc/pid/ files for post-copy memory migration facilities?

Christopher

> From cbe940e13b4cead41e0f862b3abfa3814f235ec3 Mon Sep 17 00:00:00 2001
> From: Andrea Arcangeli <aarcange@redhat.com>
> Date: Wed, 2 Jul 2014 18:32:35 +0200
> Subject: [PATCH] add remap_anon_pages and userfaultfd
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
>  include/syscalls-x86_64.h   |   2 +
>  syscalls/remap_anon_pages.c | 100 ++++++++++++++++++++++++++++++++++++++++++++
>  syscalls/syscalls.h         |   2 +
>  syscalls/userfaultfd.c      |  12 ++++++
>  4 files changed, 116 insertions(+)
>  create mode 100644 syscalls/remap_anon_pages.c
>  create mode 100644 syscalls/userfaultfd.c
> 
> diff --git a/include/syscalls-x86_64.h b/include/syscalls-x86_64.h
> index e09df43..a5b3a88 100644
> --- a/include/syscalls-x86_64.h
> +++ b/include/syscalls-x86_64.h
> @@ -324,4 +324,6 @@ struct syscalltable syscalls_x86_64[] = {
>  	{ .entry = &syscall_sched_setattr },
>  	{ .entry = &syscall_sched_getattr },
>  	{ .entry = &syscall_renameat2 },
> +	{ .entry = &syscall_remap_anon_pages },
> +	{ .entry = &syscall_userfaultfd },
>  };
> diff --git a/syscalls/remap_anon_pages.c b/syscalls/remap_anon_pages.c
> new file mode 100644
> index 0000000..b1e9d3c
> --- /dev/null
> +++ b/syscalls/remap_anon_pages.c
> @@ -0,0 +1,100 @@
> +/*
> + * SYSCALL_DEFINE3(remap_anon_pages,
> +		unsigned long, dst_start, unsigned long, src_start,
> +		unsigned long, len)
> + */
> +#include <stdlib.h>
> +#include <asm/mman.h>
> +#include <assert.h>
> +#include "arch.h"
> +#include "maps.h"
> +#include "random.h"
> +#include "sanitise.h"
> +#include "shm.h"
> +#include "syscall.h"
> +#include "tables.h"
> +#include "trinity.h"
> +#include "utils.h"
> +
> +static const unsigned long alignments[] = {
> +	1 * MB, 2 * MB, 4 * MB, 8 * MB,
> +	10 * MB, 100 * MB,
> +};
> +
> +static unsigned char *g_src, *g_dst;
> +static unsigned long g_size;
> +static int g_check;
> +
> +#define RAP_ALLOW_SRC_HOLES (1UL<<0)
> +
> +static void sanitise_remap_anon_pages(struct syscallrecord *rec)
> +{
> +	unsigned long size = alignments[rand() % ARRAY_SIZE(alignments)];
> +	unsigned long max_rand;
> +	if (rand_bool()) {
> +		g_src = mmap(NULL, size, PROT_READ|PROT_WRITE,
> +			     MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
> +	} else
> +		g_src = MAP_FAILED;
> +	if (rand_bool()) {
> +		g_dst = mmap(NULL, size, PROT_READ|PROT_WRITE,
> +			     MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
> +	} else
> +		g_dst = MAP_FAILED;
> +	g_size = size;
> +	g_check = 1;
> +
> +	rec->a1 = (unsigned long) g_dst;
> +	rec->a2 = (unsigned long) g_src;
> +	rec->a3 = g_size;
> +	rec->a4 = 0;
> +
> +	if (rand_bool())
> +		max_rand = -1UL;
> +	else
> +		max_rand = g_size << 1;
> +	if (rand_bool()) {
> +		rec->a3 += (rand() % max_rand) - g_size;
> +		g_check = 0;
> +	}
> +	if (rand_bool()) {
> +		rec->a1 += (rand() % max_rand) - g_size;
> +		g_check = 0;
> +	}
> +	if (rand_bool()) {
> +		rec->a2 += (rand() % max_rand) - g_size;
> +		g_check = 0;
> +	}
> +	if (rand_bool()) {
> +		if (rand_bool()) {
> +			rec->a4 = rand();
> +		} else
> +			rec->a4 = RAP_ALLOW_SRC_HOLES;
> +	}
> +	if (g_src != MAP_FAILED)
> +		memset(g_src, 0xaa, size);
> +}
> +
> +static void post_remap_anon_pages(struct syscallrecord *rec)
> +{
> +	if (g_check && !rec->retval) {
> +		unsigned long size = g_size;
> +		unsigned char *dst = g_dst;
> +		while (size--)
> +			assert(dst[size] == 0xaaU);
> +	}
> +	munmap(g_src, g_size);
> +	munmap(g_dst, g_size);
> +}
> +
> +struct syscallentry syscall_remap_anon_pages = {
> +	.name = "remap_anon_pages",
> +	.num_args = 4,
> +	.arg1name = "dst_start",
> +	.arg2name = "src_start",
> +	.arg3name = "len",
> +	.arg4name = "flags",
> +	.group = GROUP_VM,
> +	.sanitise = sanitise_remap_anon_pages,
> +	.post = post_remap_anon_pages,
> +};
> diff --git a/syscalls/syscalls.h b/syscalls/syscalls.h
> index 114500c..b8eaa63 100644
> --- a/syscalls/syscalls.h
> +++ b/syscalls/syscalls.h
> @@ -370,3 +370,5 @@ extern struct syscallentry syscall_sched_setattr;
>  extern struct syscallentry syscall_sched_getattr;
>  extern struct syscallentry syscall_renameat2;
>  extern struct syscallentry syscall_kern_features;
> +extern struct syscallentry syscall_remap_anon_pages;
> +extern struct syscallentry syscall_userfaultfd;
> diff --git a/syscalls/userfaultfd.c b/syscalls/userfaultfd.c
> new file mode 100644
> index 0000000..769fe78
> --- /dev/null
> +++ b/syscalls/userfaultfd.c
> @@ -0,0 +1,12 @@
> +/*
> + * SYSCALL_DEFINE1(userfaultfd, int, flags)
> + */
> +#include "sanitise.h"
> +
> +struct syscallentry syscall_userfaultfd = {
> +	.name = "userfaultfd",
> +	.num_args = 1,
> +	.arg1name = "flags",
> +	.arg1type = ARG_LEN,
> +	.rettype = RET_FD,
> +};
> 
> 
> Andrea Arcangeli (10):
>   mm: madvise MADV_USERFAULT: prepare vm_flags to allow more than 32bits
>   mm: madvise MADV_USERFAULT
>   mm: PT lock: export double_pt_lock/unlock
>   mm: rmap preparation for remap_anon_pages
>   mm: swp_entry_swapcount
>   mm: sys_remap_anon_pages
>   waitqueue: add nr wake parameter to __wake_up_locked_key
>   userfaultfd: add new syscall to provide memory externalization
>   userfaultfd: make userfaultfd_write non blocking
>   userfaultfd: use VM_FAULT_RETRY in handle_userfault()
> 
>  arch/alpha/include/uapi/asm/mman.h     |   3 +
>  arch/mips/include/uapi/asm/mman.h      |   3 +
>  arch/parisc/include/uapi/asm/mman.h    |   3 +
>  arch/x86/syscalls/syscall_32.tbl       |   2 +
>  arch/x86/syscalls/syscall_64.tbl       |   2 +
>  arch/xtensa/include/uapi/asm/mman.h    |   3 +
>  fs/Makefile                            |   1 +
>  fs/proc/task_mmu.c                     |   5 +-
>  fs/userfaultfd.c                       | 593 +++++++++++++++++++++++++++++++++
>  include/linux/huge_mm.h                |  11 +-
>  include/linux/ksm.h                    |   4 +-
>  include/linux/mm.h                     |   5 +
>  include/linux/mm_types.h               |   2 +-
>  include/linux/swap.h                   |   6 +
>  include/linux/syscalls.h               |   5 +
>  include/linux/userfaultfd.h            |  42 +++
>  include/linux/wait.h                   |   5 +-
>  include/uapi/asm-generic/mman-common.h |   3 +
>  init/Kconfig                           |  10 +
>  kernel/sched/wait.c                    |   7 +-
>  kernel/sys_ni.c                        |   2 +
>  mm/fremap.c                            | 506 ++++++++++++++++++++++++++++
>  mm/huge_memory.c                       | 209 ++++++++++--
>  mm/ksm.c                               |   2 +-
>  mm/madvise.c                           |  19 +-
>  mm/memory.c                            |  14 +
>  mm/mremap.c                            |   2 +-
>  mm/rmap.c                              |   9 +
>  mm/swapfile.c                          |  13 +
>  net/sunrpc/sched.c                     |   2 +-
>  30 files changed, 1447 insertions(+), 46 deletions(-)
>  create mode 100644 fs/userfaultfd.c
>  create mode 100644 include/linux/userfaultfd.h
> 
> 


-- 
Employee of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
hosted by the Linux Foundation.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Qemu-devel] [PATCH 00/10] RFC: userfault
@ 2014-07-03 13:45   ` Christopher Covington
  0 siblings, 0 replies; 59+ messages in thread
From: Christopher Covington @ 2014-07-03 13:45 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Robert Love, Dave Hansen, Jan Kara, kvm, Neil Brown,
	Stefan Hajnoczi, qemu-devel, crml, linux-mm, KOSAKI Motohiro,
	Michel Lespinasse, Taras Glek, Juan Quintela, Hugh Dickins,
	Isaku Yamahata, Mel Gorman, Android Kernel Team, Andrew Jones,
	Mel Gorman, Huangpeng (Peter),
	Dr. David Alan Gilbert, Anthony Liguori, Paolo Bonzini,
	Keith Packard, Wenchao Xia, linux-kernel, Minchan Kim,
	Dmitry Adamushko, Johannes Weiner, Mike Hommey, Andrew Morton

Hi Andrea,

On 07/02/2014 12:50 PM, Andrea Arcangeli wrote:
> Hello everyone,
> 
> There's a large CC list for this RFC because this adds two new
> syscalls (userfaultfd and remap_anon_pages) and
> MADV_USERFAULT/MADV_NOUSERFAULT, so suggestions on changes to the API
> or on a completely different API if somebody has better ideas are
> welcome now.
> 
> The combination of these features are what I would propose to
> implement postcopy live migration in qemu, and in general demand
> paging of remote memory, hosted in different cloud nodes.
> 
> The MADV_USERFAULT feature should be generic enough that it can
> provide the userfaults to the Android volatile range feature too, on
> access of reclaimed volatile pages.
> 
> If the access could ever happen in kernel context through syscalls
> (not not just from userland context), then userfaultfd has to be used
> to make the userfault unnoticeable to the syscall (no error will be
> returned). This latter feature is more advanced than what volatile
> ranges alone could do with SIGBUS so far (but it's optional, if the
> process doesn't call userfaultfd, the regular SIGBUS will fire, if the
> fd is closed SIGBUS will also fire for any blocked userfault that was
> waiting a userfaultfd_write ack).
> 
> userfaultfd is also a generic enough feature, that it allows KVM to
> implement postcopy live migration without having to modify a single
> line of KVM kernel code. Guest async page faults, FOLL_NOWAIT and all
> other GUP features works just fine in combination with userfaults
> (userfaults trigger async page faults in the guest scheduler so those
> guest processes that aren't waiting for userfaults can keep running in
> the guest vcpus).
> 
> remap_anon_pages is the syscall to use to resolve the userfaults (it's
> not mandatory, vmsplice will likely still be used in the case of local
> postcopy live migration just to upgrade the qemu binary, but
> remap_anon_pages is faster and ideal for transferring memory across
> the network, it's zerocopy and doesn't touch the vma: it only holds
> the mmap_sem for reading).
> 
> The current behavior of remap_anon_pages is very strict to avoid any
> chance of memory corruption going unnoticed. mremap is not strict like
> that: if there's a synchronization bug it would drop the destination
> range silently resulting in subtle memory corruption for
> example. remap_anon_pages would return -EEXIST in that case. If there
> are holes in the source range remap_anon_pages will return -ENOENT.
> 
> If remap_anon_pages is used always with 2M naturally aligned
> addresses, transparent hugepages will not be splitted. In there could
> be 4k (or any size) holes in the 2M (or any size) source range,
> remap_anon_pages should be used with the RAP_ALLOW_SRC_HOLES flag to
> relax some of its strict checks (-ENOENT won't be returned if
> RAP_ALLOW_SRC_HOLES is set, remap_anon_pages then will just behave as
> a noop on any hole in the source range). This flag is generally useful
> when implementing userfaults with THP granularity, but it shouldn't be
> set if doing the userfaults with PAGE_SIZE granularity if the
> developer wants to benefit from the strict -ENOENT behavior.
> 
> The remap_anon_pages syscall API is not vectored, as I expect it to be
> used mainly for demand paging (where there can be just one faulting
> range per userfault) or for large ranges (with the THP model as an
> alternative to zapping re-dirtied pages with MADV_DONTNEED with 4k
> granularity before starting the guest in the destination node) where
> vectoring isn't going to provide much performance advantages (thanks
> to the THP coarser granularity).
> 
> On the rmap side remap_anon_pages doesn't add much complexity: there's
> no need of nonlinear anon vmas to support it because I added the
> constraint that it will fail if the mapcount is more than 1. So in
> general the source range of remap_anon_pages should be marked
> MADV_DONTFORK to prevent any risk of failure if the process ever
> forks (like qemu can in some case).
> 
> One part that hasn't been tested is the poll() syscall on the
> userfaultfd because the postcopy migration thread currently is more
> efficient waiting on blocking read()s (I'll write some code to test
> poll() too). I also appended below a patch to trinity to exercise
> remap_anon_pages and userfaultfd and it completes trinity
> successfully.
> 
> The code can be found here:
> 
> git clone --reference linux git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git -b userfault 
> 
> The branch is rebased so you can get updates for example with:
> 
> git fetch && git checkout -f origin/userfault
> 
> Comments welcome, thanks!

CRIU uses the soft dirty bit in /proc/pid/clear_refs and /proc/pid/pagemap to
implement its pre-copy memory migration.

http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/vm/soft-dirty.txt

Would it make sense to use a similar interaction model of peeking and poking
at /proc/pid/ files for post-copy memory migration facilities?

Christopher

> From cbe940e13b4cead41e0f862b3abfa3814f235ec3 Mon Sep 17 00:00:00 2001
> From: Andrea Arcangeli <aarcange@redhat.com>
> Date: Wed, 2 Jul 2014 18:32:35 +0200
> Subject: [PATCH] add remap_anon_pages and userfaultfd
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
>  include/syscalls-x86_64.h   |   2 +
>  syscalls/remap_anon_pages.c | 100 ++++++++++++++++++++++++++++++++++++++++++++
>  syscalls/syscalls.h         |   2 +
>  syscalls/userfaultfd.c      |  12 ++++++
>  4 files changed, 116 insertions(+)
>  create mode 100644 syscalls/remap_anon_pages.c
>  create mode 100644 syscalls/userfaultfd.c
> 
> diff --git a/include/syscalls-x86_64.h b/include/syscalls-x86_64.h
> index e09df43..a5b3a88 100644
> --- a/include/syscalls-x86_64.h
> +++ b/include/syscalls-x86_64.h
> @@ -324,4 +324,6 @@ struct syscalltable syscalls_x86_64[] = {
>  	{ .entry = &syscall_sched_setattr },
>  	{ .entry = &syscall_sched_getattr },
>  	{ .entry = &syscall_renameat2 },
> +	{ .entry = &syscall_remap_anon_pages },
> +	{ .entry = &syscall_userfaultfd },
>  };
> diff --git a/syscalls/remap_anon_pages.c b/syscalls/remap_anon_pages.c
> new file mode 100644
> index 0000000..b1e9d3c
> --- /dev/null
> +++ b/syscalls/remap_anon_pages.c
> @@ -0,0 +1,100 @@
> +/*
> + * SYSCALL_DEFINE3(remap_anon_pages,
> +		unsigned long, dst_start, unsigned long, src_start,
> +		unsigned long, len)
> + */
> +#include <stdlib.h>
> +#include <asm/mman.h>
> +#include <assert.h>
> +#include "arch.h"
> +#include "maps.h"
> +#include "random.h"
> +#include "sanitise.h"
> +#include "shm.h"
> +#include "syscall.h"
> +#include "tables.h"
> +#include "trinity.h"
> +#include "utils.h"
> +
> +static const unsigned long alignments[] = {
> +	1 * MB, 2 * MB, 4 * MB, 8 * MB,
> +	10 * MB, 100 * MB,
> +};
> +
> +static unsigned char *g_src, *g_dst;
> +static unsigned long g_size;
> +static int g_check;
> +
> +#define RAP_ALLOW_SRC_HOLES (1UL<<0)
> +
> +static void sanitise_remap_anon_pages(struct syscallrecord *rec)
> +{
> +	unsigned long size = alignments[rand() % ARRAY_SIZE(alignments)];
> +	unsigned long max_rand;
> +	if (rand_bool()) {
> +		g_src = mmap(NULL, size, PROT_READ|PROT_WRITE,
> +			     MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
> +	} else
> +		g_src = MAP_FAILED;
> +	if (rand_bool()) {
> +		g_dst = mmap(NULL, size, PROT_READ|PROT_WRITE,
> +			     MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
> +	} else
> +		g_dst = MAP_FAILED;
> +	g_size = size;
> +	g_check = 1;
> +
> +	rec->a1 = (unsigned long) g_dst;
> +	rec->a2 = (unsigned long) g_src;
> +	rec->a3 = g_size;
> +	rec->a4 = 0;
> +
> +	if (rand_bool())
> +		max_rand = -1UL;
> +	else
> +		max_rand = g_size << 1;
> +	if (rand_bool()) {
> +		rec->a3 += (rand() % max_rand) - g_size;
> +		g_check = 0;
> +	}
> +	if (rand_bool()) {
> +		rec->a1 += (rand() % max_rand) - g_size;
> +		g_check = 0;
> +	}
> +	if (rand_bool()) {
> +		rec->a2 += (rand() % max_rand) - g_size;
> +		g_check = 0;
> +	}
> +	if (rand_bool()) {
> +		if (rand_bool()) {
> +			rec->a4 = rand();
> +		} else
> +			rec->a4 = RAP_ALLOW_SRC_HOLES;
> +	}
> +	if (g_src != MAP_FAILED)
> +		memset(g_src, 0xaa, size);
> +}
> +
> +static void post_remap_anon_pages(struct syscallrecord *rec)
> +{
> +	if (g_check && !rec->retval) {
> +		unsigned long size = g_size;
> +		unsigned char *dst = g_dst;
> +		while (size--)
> +			assert(dst[size] == 0xaaU);
> +	}
> +	munmap(g_src, g_size);
> +	munmap(g_dst, g_size);
> +}
> +
> +struct syscallentry syscall_remap_anon_pages = {
> +	.name = "remap_anon_pages",
> +	.num_args = 4,
> +	.arg1name = "dst_start",
> +	.arg2name = "src_start",
> +	.arg3name = "len",
> +	.arg4name = "flags",
> +	.group = GROUP_VM,
> +	.sanitise = sanitise_remap_anon_pages,
> +	.post = post_remap_anon_pages,
> +};
> diff --git a/syscalls/syscalls.h b/syscalls/syscalls.h
> index 114500c..b8eaa63 100644
> --- a/syscalls/syscalls.h
> +++ b/syscalls/syscalls.h
> @@ -370,3 +370,5 @@ extern struct syscallentry syscall_sched_setattr;
>  extern struct syscallentry syscall_sched_getattr;
>  extern struct syscallentry syscall_renameat2;
>  extern struct syscallentry syscall_kern_features;
> +extern struct syscallentry syscall_remap_anon_pages;
> +extern struct syscallentry syscall_userfaultfd;
> diff --git a/syscalls/userfaultfd.c b/syscalls/userfaultfd.c
> new file mode 100644
> index 0000000..769fe78
> --- /dev/null
> +++ b/syscalls/userfaultfd.c
> @@ -0,0 +1,12 @@
> +/*
> + * SYSCALL_DEFINE1(userfaultfd, int, flags)
> + */
> +#include "sanitise.h"
> +
> +struct syscallentry syscall_userfaultfd = {
> +	.name = "userfaultfd",
> +	.num_args = 1,
> +	.arg1name = "flags",
> +	.arg1type = ARG_LEN,
> +	.rettype = RET_FD,
> +};
> 
> 
> Andrea Arcangeli (10):
>   mm: madvise MADV_USERFAULT: prepare vm_flags to allow more than 32bits
>   mm: madvise MADV_USERFAULT
>   mm: PT lock: export double_pt_lock/unlock
>   mm: rmap preparation for remap_anon_pages
>   mm: swp_entry_swapcount
>   mm: sys_remap_anon_pages
>   waitqueue: add nr wake parameter to __wake_up_locked_key
>   userfaultfd: add new syscall to provide memory externalization
>   userfaultfd: make userfaultfd_write non blocking
>   userfaultfd: use VM_FAULT_RETRY in handle_userfault()
> 
>  arch/alpha/include/uapi/asm/mman.h     |   3 +
>  arch/mips/include/uapi/asm/mman.h      |   3 +
>  arch/parisc/include/uapi/asm/mman.h    |   3 +
>  arch/x86/syscalls/syscall_32.tbl       |   2 +
>  arch/x86/syscalls/syscall_64.tbl       |   2 +
>  arch/xtensa/include/uapi/asm/mman.h    |   3 +
>  fs/Makefile                            |   1 +
>  fs/proc/task_mmu.c                     |   5 +-
>  fs/userfaultfd.c                       | 593 +++++++++++++++++++++++++++++++++
>  include/linux/huge_mm.h                |  11 +-
>  include/linux/ksm.h                    |   4 +-
>  include/linux/mm.h                     |   5 +
>  include/linux/mm_types.h               |   2 +-
>  include/linux/swap.h                   |   6 +
>  include/linux/syscalls.h               |   5 +
>  include/linux/userfaultfd.h            |  42 +++
>  include/linux/wait.h                   |   5 +-
>  include/uapi/asm-generic/mman-common.h |   3 +
>  init/Kconfig                           |  10 +
>  kernel/sched/wait.c                    |   7 +-
>  kernel/sys_ni.c                        |   2 +
>  mm/fremap.c                            | 506 ++++++++++++++++++++++++++++
>  mm/huge_memory.c                       | 209 ++++++++++--
>  mm/ksm.c                               |   2 +-
>  mm/madvise.c                           |  19 +-
>  mm/memory.c                            |  14 +
>  mm/mremap.c                            |   2 +-
>  mm/rmap.c                              |   9 +
>  mm/swapfile.c                          |  13 +
>  net/sunrpc/sched.c                     |   2 +-
>  30 files changed, 1447 insertions(+), 46 deletions(-)
>  create mode 100644 fs/userfaultfd.c
>  create mode 100644 include/linux/userfaultfd.h
> 
> 


-- 
Employee of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
hosted by the Linux Foundation.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Qemu-devel] [PATCH 00/10] RFC: userfault
  2014-07-03 13:45   ` Christopher Covington
  (?)
@ 2014-07-03 14:08     ` Andrea Arcangeli
  -1 siblings, 0 replies; 59+ messages in thread
From: Andrea Arcangeli @ 2014-07-03 14:08 UTC (permalink / raw)
  To: Christopher Covington
  Cc: qemu-devel, kvm, linux-mm, linux-kernel, Robert Love,
	Dave Hansen, Jan Kara, Neil Brown, Stefan Hajnoczi, Andrew Jones,
	KOSAKI Motohiro, Michel Lespinasse, Taras Glek, Juan Quintela,
	Hugh Dickins, Isaku Yamahata, Mel Gorman, Android Kernel Team,
	Mel Gorman, Dr. David Alan Gilbert, Huangpeng (Peter),
	Anthony Liguori, Mike Hommey, K

Hi Christopher,

On Thu, Jul 03, 2014 at 09:45:07AM -0400, Christopher Covington wrote:
> CRIU uses the soft dirty bit in /proc/pid/clear_refs and /proc/pid/pagemap to
> implement its pre-copy memory migration.
> 
> http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/vm/soft-dirty.txt
> 
> Would it make sense to use a similar interaction model of peeking and poking
> at /proc/pid/ files for post-copy memory migration facilities?

We plan to use the pagemap information to optimize precopy live
migration, but that's orthogonal with postcopy live migration.

We already combine precopy and postcopy live migration.

In addition to the dirty bit tracking with softdirty clear_refs
feature, the pagemap bits can also tell for example which pages are
missing in the source node, instead of the current memcmp(0) that
avoids to transfer zero pages. With pagemap we can skip a superfluous
zero page fault (David suggested this).

Postcopy live migration poses a different problem. And without
postcopy there's no way to migrate 100GByte guests with heavy load
inside them, in fact even the first "optimistic" precopy pass should
only migrate those pages that already got the dirty bit set by the
time we attempt to send them.

With postcopy we can also guarantee that the maximum amount of data
transferred during precopy+postcopy is twice the size of the guest. So
you know exactly the maximum time live migration will take depending
on your network bandwidth and it cannot fail no matter the load or the
size of the guest. Slowing down the guest with autoconverge isn't
needed anymore.

The userfault only happens in the destination node. The problem we
face is that we must start the guest in the destination node despite
significant amount of its memory is still in the source node.

With postcopy migration the pages aren't dirty nor present in the
destination node, they're just holes, and in fact we already exactly
know which are missing without having to check pagemap.

It's up to the guest OS which pages it decides to touch, we cannot
know. We already know where are holes, we don't know if the guest will
touch the holes during its runtime while the memory is still
externalized.

If the guest touches any hole we need to stop the guest somehow and we
must be let know immediately so we transfer the page, fill the hole,
and let it continue ASAP.

pagemap/clear_refs can't stop the guest and let us know immediately
about the fact the guest touched a hole.

It's not just about the guest shadow mmu accesses, it could also be
O_DIRECT from qemu that triggers the fault and in that case GUP stops,
we fill the hole and then GUP and O_DIRECT succeeds without even
noticing it has been stopped by an userfault.

Thanks,
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Qemu-devel] [PATCH 00/10] RFC: userfault
@ 2014-07-03 14:08     ` Andrea Arcangeli
  0 siblings, 0 replies; 59+ messages in thread
From: Andrea Arcangeli @ 2014-07-03 14:08 UTC (permalink / raw)
  To: Christopher Covington
  Cc: qemu-devel, kvm, linux-mm, linux-kernel, Robert Love,
	Dave Hansen, Jan Kara, Neil Brown, Stefan Hajnoczi, Andrew Jones,
	KOSAKI Motohiro, Michel Lespinasse, Taras Glek, Juan Quintela,
	Hugh Dickins, Isaku Yamahata, Mel Gorman, Android Kernel Team,
	Mel Gorman, Dr. David Alan Gilbert, Huangpeng (Peter),
	Anthony Liguori, Mike Hommey, Keith Packard, Wenchao Xia,
	Minchan Kim, Dmitry Adamushko, Johannes Weiner, Paolo Bonzini,
	Andrew Morton, crml

Hi Christopher,

On Thu, Jul 03, 2014 at 09:45:07AM -0400, Christopher Covington wrote:
> CRIU uses the soft dirty bit in /proc/pid/clear_refs and /proc/pid/pagemap to
> implement its pre-copy memory migration.
> 
> http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/vm/soft-dirty.txt
> 
> Would it make sense to use a similar interaction model of peeking and poking
> at /proc/pid/ files for post-copy memory migration facilities?

We plan to use the pagemap information to optimize precopy live
migration, but that's orthogonal with postcopy live migration.

We already combine precopy and postcopy live migration.

In addition to the dirty bit tracking with softdirty clear_refs
feature, the pagemap bits can also tell for example which pages are
missing in the source node, instead of the current memcmp(0) that
avoids to transfer zero pages. With pagemap we can skip a superfluous
zero page fault (David suggested this).

Postcopy live migration poses a different problem. And without
postcopy there's no way to migrate 100GByte guests with heavy load
inside them, in fact even the first "optimistic" precopy pass should
only migrate those pages that already got the dirty bit set by the
time we attempt to send them.

With postcopy we can also guarantee that the maximum amount of data
transferred during precopy+postcopy is twice the size of the guest. So
you know exactly the maximum time live migration will take depending
on your network bandwidth and it cannot fail no matter the load or the
size of the guest. Slowing down the guest with autoconverge isn't
needed anymore.

The userfault only happens in the destination node. The problem we
face is that we must start the guest in the destination node despite
significant amount of its memory is still in the source node.

With postcopy migration the pages aren't dirty nor present in the
destination node, they're just holes, and in fact we already exactly
know which are missing without having to check pagemap.

It's up to the guest OS which pages it decides to touch, we cannot
know. We already know where are holes, we don't know if the guest will
touch the holes during its runtime while the memory is still
externalized.

If the guest touches any hole we need to stop the guest somehow and we
must be let know immediately so we transfer the page, fill the hole,
and let it continue ASAP.

pagemap/clear_refs can't stop the guest and let us know immediately
about the fact the guest touched a hole.

It's not just about the guest shadow mmu accesses, it could also be
O_DIRECT from qemu that triggers the fault and in that case GUP stops,
we fill the hole and then GUP and O_DIRECT succeeds without even
noticing it has been stopped by an userfault.

Thanks,
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Qemu-devel] [PATCH 00/10] RFC: userfault
@ 2014-07-03 14:08     ` Andrea Arcangeli
  0 siblings, 0 replies; 59+ messages in thread
From: Andrea Arcangeli @ 2014-07-03 14:08 UTC (permalink / raw)
  To: Christopher Covington
  Cc: Robert Love, Dave Hansen, Jan Kara, kvm, Neil Brown,
	Stefan Hajnoczi, qemu-devel, crml, linux-mm, KOSAKI Motohiro,
	Michel Lespinasse, Taras Glek, Juan Quintela, Hugh Dickins,
	Isaku Yamahata, Mel Gorman, Android Kernel Team, Andrew Jones,
	Mel Gorman, Huangpeng (Peter),
	Dr. David Alan Gilbert, Anthony Liguori, Paolo Bonzini,
	Keith Packard, Wenchao Xia, linux-kernel, Minchan Kim,
	Dmitry Adamushko, Johannes Weiner, Mike Hommey, Andrew Morton

Hi Christopher,

On Thu, Jul 03, 2014 at 09:45:07AM -0400, Christopher Covington wrote:
> CRIU uses the soft dirty bit in /proc/pid/clear_refs and /proc/pid/pagemap to
> implement its pre-copy memory migration.
> 
> http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/vm/soft-dirty.txt
> 
> Would it make sense to use a similar interaction model of peeking and poking
> at /proc/pid/ files for post-copy memory migration facilities?

We plan to use the pagemap information to optimize precopy live
migration, but that's orthogonal with postcopy live migration.

We already combine precopy and postcopy live migration.

In addition to the dirty bit tracking with softdirty clear_refs
feature, the pagemap bits can also tell for example which pages are
missing in the source node, instead of the current memcmp(0) that
avoids to transfer zero pages. With pagemap we can skip a superfluous
zero page fault (David suggested this).

Postcopy live migration poses a different problem. And without
postcopy there's no way to migrate 100GByte guests with heavy load
inside them, in fact even the first "optimistic" precopy pass should
only migrate those pages that already got the dirty bit set by the
time we attempt to send them.

With postcopy we can also guarantee that the maximum amount of data
transferred during precopy+postcopy is twice the size of the guest. So
you know exactly the maximum time live migration will take depending
on your network bandwidth and it cannot fail no matter the load or the
size of the guest. Slowing down the guest with autoconverge isn't
needed anymore.

The userfault only happens in the destination node. The problem we
face is that we must start the guest in the destination node despite
significant amount of its memory is still in the source node.

With postcopy migration the pages aren't dirty nor present in the
destination node, they're just holes, and in fact we already exactly
know which are missing without having to check pagemap.

It's up to the guest OS which pages it decides to touch, we cannot
know. We already know where are holes, we don't know if the guest will
touch the holes during its runtime while the memory is still
externalized.

If the guest touches any hole we need to stop the guest somehow and we
must be let know immediately so we transfer the page, fill the hole,
and let it continue ASAP.

pagemap/clear_refs can't stop the guest and let us know immediately
about the fact the guest touched a hole.

It's not just about the guest shadow mmu accesses, it could also be
O_DIRECT from qemu that triggers the fault and in that case GUP stops,
we fill the hole and then GUP and O_DIRECT succeeds without even
noticing it has been stopped by an userfault.

Thanks,
Andrea

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 00/10] RFC: userfault
  2014-07-02 16:50 ` Andrea Arcangeli
  (?)
  (?)
@ 2014-07-03 15:41   ` Dave Hansen
  -1 siblings, 0 replies; 59+ messages in thread
From: Dave Hansen @ 2014-07-03 15:41 UTC (permalink / raw)
  To: Andrea Arcangeli, qemu-devel, kvm, linux-mm, linux-kernel
  Cc: "Dr. David Alan Gilbert",
	Johannes Weiner, Andrew Morton, Android Kernel Team, Robert Love,
	Mel Gorman, Hugh Dickins, Rik van Riel, Dmitry Adamushko,
	Neil Brown, Mike Hommey, Taras Glek, Jan Kara, KOSAKI Motohiro,
	Michel Lespinasse, Minchan Kim, Keith Packard, Huangpeng (Peter),
	Isaku Yamahata, Paolo Bonzini, Anthony Liguori, Stefan Hajnoczi,
	Wenchao Xia, Andrew Jones, Juan Quintela, Mel Gorman

On 07/02/2014 09:50 AM, Andrea Arcangeli wrote:
> The MADV_USERFAULT feature should be generic enough that it can
> provide the userfaults to the Android volatile range feature too, on
> access of reclaimed volatile pages.

Maybe.

I certainly can't keep track of all the versions of the variations of
the volatile ranges patches.  But, I don't think it's a given that this
can be reused.  First of all, volatile ranges is trying to replace
ashmem and is going to require _some_ form of sharing.  This mechanism,
being tightly coupled to anonymous memory at the moment, is not a close
fit for that.

It's also important to call out that this is a VMA-based mechanism.  I
certainly can't predict what we'll merge for volatile ranges, but not
all of them are VMA-based.  We'd also need a mechanism on top of this to
differentiate plain not-present pages from not-present-because-purged pages.

That said, I _think_ this might fit well in to what the Mozilla guys
wanted out of volatile ranges.  I'm not confident about it, though.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 00/10] RFC: userfault
@ 2014-07-03 15:41   ` Dave Hansen
  0 siblings, 0 replies; 59+ messages in thread
From: Dave Hansen @ 2014-07-03 15:41 UTC (permalink / raw)
  To: Andrea Arcangeli, qemu-devel, kvm, linux-mm, linux-kernel
  Cc: "Dr. David Alan Gilbert",
	Johannes Weiner, Andrew Morton, Android Kernel Team, Robert Love,
	Mel Gorman, Hugh Dickins, Rik van Riel, Dmitry Adamushko,
	Neil Brown, Mike Hommey, Taras Glek, Jan Kara, KOSAKI Motohiro,
	Michel Lespinasse, Minchan Kim, Keith Packard, Huangpeng (Peter),
	Isaku Yamahata, Paolo Bonzini, Anthony Liguori, Stefan Hajnoczi,
	Wenchao Xia, Andr

On 07/02/2014 09:50 AM, Andrea Arcangeli wrote:
> The MADV_USERFAULT feature should be generic enough that it can
> provide the userfaults to the Android volatile range feature too, on
> access of reclaimed volatile pages.

Maybe.

I certainly can't keep track of all the versions of the variations of
the volatile ranges patches.  But, I don't think it's a given that this
can be reused.  First of all, volatile ranges is trying to replace
ashmem and is going to require _some_ form of sharing.  This mechanism,
being tightly coupled to anonymous memory at the moment, is not a close
fit for that.

It's also important to call out that this is a VMA-based mechanism.  I
certainly can't predict what we'll merge for volatile ranges, but not
all of them are VMA-based.  We'd also need a mechanism on top of this to
differentiate plain not-present pages from not-present-because-purged pages.

That said, I _think_ this might fit well in to what the Mozilla guys
wanted out of volatile ranges.  I'm not confident about it, though.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 00/10] RFC: userfault
@ 2014-07-03 15:41   ` Dave Hansen
  0 siblings, 0 replies; 59+ messages in thread
From: Dave Hansen @ 2014-07-03 15:41 UTC (permalink / raw)
  To: Andrea Arcangeli, qemu-devel, kvm, linux-mm, linux-kernel
  Cc: "Dr. David Alan Gilbert",
	Johannes Weiner, Andrew Morton, Android Kernel Team, Robert Love,
	Mel Gorman, Hugh Dickins, Rik van Riel, Dmitry Adamushko,
	Neil Brown, Mike Hommey, Taras Glek, Jan Kara, KOSAKI Motohiro,
	Michel Lespinasse, Minchan Kim, Keith Packard, Huangpeng (Peter),
	Isaku Yamahata, Paolo Bonzini, Anthony Liguori, Stefan Hajnoczi,
	Wenchao Xia, Andrew Jones, Juan Quintela, Mel Gorman

On 07/02/2014 09:50 AM, Andrea Arcangeli wrote:
> The MADV_USERFAULT feature should be generic enough that it can
> provide the userfaults to the Android volatile range feature too, on
> access of reclaimed volatile pages.

Maybe.

I certainly can't keep track of all the versions of the variations of
the volatile ranges patches.  But, I don't think it's a given that this
can be reused.  First of all, volatile ranges is trying to replace
ashmem and is going to require _some_ form of sharing.  This mechanism,
being tightly coupled to anonymous memory at the moment, is not a close
fit for that.

It's also important to call out that this is a VMA-based mechanism.  I
certainly can't predict what we'll merge for volatile ranges, but not
all of them are VMA-based.  We'd also need a mechanism on top of this to
differentiate plain not-present pages from not-present-because-purged pages.

That said, I _think_ this might fit well in to what the Mozilla guys
wanted out of volatile ranges.  I'm not confident about it, though.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Qemu-devel] [PATCH 00/10] RFC: userfault
@ 2014-07-03 15:41   ` Dave Hansen
  0 siblings, 0 replies; 59+ messages in thread
From: Dave Hansen @ 2014-07-03 15:41 UTC (permalink / raw)
  To: Andrea Arcangeli, qemu-devel, kvm, linux-mm, linux-kernel
  Cc: Robert Love, Jan Kara, Neil Brown, Stefan Hajnoczi, Andrew Jones,
	KOSAKI Motohiro, Michel Lespinasse, Taras Glek, Juan Quintela,
	Hugh Dickins, Isaku Yamahata, Mel Gorman, Android Kernel Team,
	Mel Gorman, "Dr. David Alan Gilbert", Huangpeng (Peter),
	Anthony Liguori, Mike Hommey, Keith Packard, Wenchao Xia,
	Minchan Kim, Dmitry Adamushko, Johannes Weiner, Paolo Bonzini,
	Andrew Morton

On 07/02/2014 09:50 AM, Andrea Arcangeli wrote:
> The MADV_USERFAULT feature should be generic enough that it can
> provide the userfaults to the Android volatile range feature too, on
> access of reclaimed volatile pages.

Maybe.

I certainly can't keep track of all the versions of the variations of
the volatile ranges patches.  But, I don't think it's a given that this
can be reused.  First of all, volatile ranges is trying to replace
ashmem and is going to require _some_ form of sharing.  This mechanism,
being tightly coupled to anonymous memory at the moment, is not a close
fit for that.

It's also important to call out that this is a VMA-based mechanism.  I
certainly can't predict what we'll merge for volatile ranges, but not
all of them are VMA-based.  We'd also need a mechanism on top of this to
differentiate plain not-present pages from not-present-because-purged pages.

That said, I _think_ this might fit well in to what the Mozilla guys
wanted out of volatile ranges.  I'm not confident about it, though.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 06/10] mm: sys_remap_anon_pages
  2014-07-02 16:50   ` Andrea Arcangeli
  (?)
  (?)
@ 2014-07-04 11:30     ` Michael Kerrisk
  -1 siblings, 0 replies; 59+ messages in thread
From: Michael Kerrisk @ 2014-07-04 11:30 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: qemu-devel, kvm, linux-mm, Linux Kernel, \Dr. David Alan Gilbert\,
	Johannes Weiner, Andrew Morton, Android Kernel Team, Robert Love,
	Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Mike Hommey, Taras Glek, Jan Kara,
	KOSAKI Motohiro, Michel Lespinasse, Minchan Kim, Keith Packard,
	Huangpeng (Peter),
	Isaku Yamahata, Paolo

[CC+=linux-api@]

Hi Andrea,

On Wed, Jul 2, 2014 at 6:50 PM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> This new syscall will move anon pages across vmas, atomically and
> without touching the vmas.

Please CC linux-api on patches that change the API/ABI. (See
https://www.kernel.org/doc/man-pages/linux-api-ml.html)

Cheers,

Michael


> It only works on non shared anonymous pages because those can be
> relocated without generating non linear anon_vmas in the rmap code.
>
> It is the ideal mechanism to handle userspace page faults. Normally
> the destination vma will have VM_USERFAULT set with
> madvise(MADV_USERFAULT) while the source vma will normally have
> VM_DONTCOPY set with madvise(MADV_DONTFORK).
>
> MADV_DONTFORK set in the source vma avoids remap_anon_pages to fail if
> the process forks during the userland page fault.
>
> The thread triggering the sigbus signal handler by touching an
> unmapped hole in the MADV_USERFAULT region, should take care to
> receive the data belonging in the faulting virtual address in the
> source vma. The data can come from the network, storage or any other
> I/O device. After the data has been safely received in the private
> area in the source vma, it will call remap_anon_pages to map the page
> in the faulting address in the destination vma atomically. And finally
> it will return from the signal handler.
>
> It is an alternative to mremap.
>
> It only works if the vma protection bits are identical from the source
> and destination vma.
>
> It can remap non shared anonymous pages within the same vma too.
>
> If the source virtual memory range has any unmapped holes, or if the
> destination virtual memory range is not a whole unmapped hole,
> remap_anon_pages will fail respectively with -ENOENT or -EEXIST. This
> provides a very strict behavior to avoid any chance of memory
> corruption going unnoticed if there are userland race conditions. Only
> one thread should resolve the userland page fault at any given time
> for any given faulting address. This means that if two threads try to
> both call remap_anon_pages on the same destination address at the same
> time, the second thread will get an explicit error from this syscall.
>
> The syscall retval will return "len" is succesful. The syscall however
> can be interrupted by fatal signals or errors. If interrupted it will
> return the number of bytes successfully remapped before the
> interruption if any, or the negative error if none. It will never
> return zero. Either it will return an error or an amount of bytes
> successfully moved. If the retval reports a "short" remap, the
> remap_anon_pages syscall should be repeated by userland with
> src+retval, dst+reval, len-retval if it wants to know about the error
> that interrupted it.
>
> The RAP_ALLOW_SRC_HOLES flag can be specified to prevent -ENOENT
> errors to materialize if there are holes in the source virtual range
> that is being remapped. The holes will be accounted as successfully
> remapped in the retval of the syscall. This is mostly useful to remap
> hugepage naturally aligned virtual regions without knowing if there
> are transparent hugepage in the regions or not, but preventing the
> risk of having to split the hugepmd during the remap.
>
> The main difference with mremap is that if used to fill holes in
> unmapped anonymous memory vmas (if used in combination with
> MADV_USERFAULT) remap_anon_pages won't create lots of unmergeable
> vmas. mremap instead would create lots of vmas (because of non linear
> vma->vm_pgoff) leading to -ENOMEM failures (the number of vmas is
> limited).
>
> MADV_USERFAULT and remap_anon_pages() can be tested with a program
> like below:
>
> ===
>  #define _GNU_SOURCE
>  #include <sys/mman.h>
>  #include <pthread.h>
>  #include <strings.h>
>  #include <stdlib.h>
>  #include <unistd.h>
>  #include <stdio.h>
>  #include <errno.h>
>  #include <string.h>
>  #include <signal.h>
>  #include <sys/syscall.h>
>  #include <sys/types.h>
>
>  #define USE_USERFAULT
>  #define THP
>
>  #define MADV_USERFAULT 18
>
>  #define SIZE (1024*1024*1024)
>
>  #define SYS_remap_anon_pages 317
>
>  static volatile unsigned char *c, *tmp;
>
>  void userfault_sighandler(int signum, siginfo_t *info, void *ctx)
>  {
>         unsigned char *addr = info->si_addr;
>         int len = 4096;
>         int ret;
>
>         addr = (unsigned char *) ((unsigned long) addr & ~((getpagesize())-1));
>  #ifdef THP
>         addr = (unsigned char *) ((unsigned long) addr & ~((2*1024*1024)-1));
>         len = 2*1024*1024;
>  #endif
>         if (addr >= c && addr < c + SIZE) {
>                 unsigned long offset = addr - c;
>                 ret = syscall(SYS_remap_anon_pages, c+offset, tmp+offset, len, 0);
>                 if (ret != len)
>                         perror("sigbus remap_anon_pages"), exit(1);
>                 //printf("sigbus offset %lu\n", offset);
>                 return;
>         }
>
>         printf("sigbus error addr %p c %p tmp %p\n", addr, c, tmp), exit(1);
>  }
>
>  int main()
>  {
>         struct sigaction sa;
>         int ret;
>         unsigned long i;
>  #ifndef THP
>         /*
>          * Fails with THP due lack of alignment because of memset
>          * pre-filling the destination
>          */
>         c = mmap(0, SIZE, PROT_READ|PROT_WRITE,
>                  MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
>         if (c == MAP_FAILED)
>                 perror("mmap"), exit(1);
>         tmp = mmap(0, SIZE, PROT_READ|PROT_WRITE,
>                    MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
>         if (tmp == MAP_FAILED)
>                 perror("mmap"), exit(1);
>  #else
>         ret = posix_memalign((void **)&c, 2*1024*1024, SIZE);
>         if (ret)
>                 perror("posix_memalign"), exit(1);
>         ret = posix_memalign((void **)&tmp, 2*1024*1024, SIZE);
>         if (ret)
>                 perror("posix_memalign"), exit(1);
>  #endif
>         /*
>          * MADV_USERFAULT must run before memset, to avoid THP 2m
>          * faults to map memory into "tmp", if "tmp" isn't allocated
>          * with hugepage alignment.
>          */
>         if (madvise((void *)c, SIZE, MADV_USERFAULT))
>                 perror("madvise"), exit(1);
>         memset((void *)tmp, 0xaa, SIZE);
>
>         sa.sa_sigaction = userfault_sighandler;
>         sigemptyset(&sa.sa_mask);
>         sa.sa_flags = SA_SIGINFO;
>         sigaction(SIGBUS, &sa, NULL);
>
>  #ifndef USE_USERFAULT
>         ret = syscall(SYS_remap_anon_pages, c, tmp, SIZE, 0);
>         if (ret != SIZE)
>                 perror("remap_anon_pages"), exit(1);
>  #endif
>
>         for (i = 0; i < SIZE; i += 4096) {
>                 if ((i/4096) % 2) {
>                         /* exercise read and write MADV_USERFAULT */
>                         c[i+1] = 0xbb;
>                 }
>                 if (c[i] != 0xaa)
>                         printf("error %x offset %lu\n", c[i], i), exit(1);
>         }
>         printf("remap_anon_pages functions correctly\n");
>
>         return 0;
>  }
> ===
>
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
>  arch/x86/syscalls/syscall_32.tbl |   1 +
>  arch/x86/syscalls/syscall_64.tbl |   1 +
>  include/linux/huge_mm.h          |   7 +
>  include/linux/syscalls.h         |   4 +
>  kernel/sys_ni.c                  |   1 +
>  mm/fremap.c                      | 477 +++++++++++++++++++++++++++++++++++++++
>  mm/huge_memory.c                 | 110 +++++++++
>  7 files changed, 601 insertions(+)
>
> diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
> index d6b8679..08bc856 100644
> --- a/arch/x86/syscalls/syscall_32.tbl
> +++ b/arch/x86/syscalls/syscall_32.tbl
> @@ -360,3 +360,4 @@
>  351    i386    sched_setattr           sys_sched_setattr
>  352    i386    sched_getattr           sys_sched_getattr
>  353    i386    renameat2               sys_renameat2
> +354    i386    remap_anon_pages        sys_remap_anon_pages
> diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
> index ec255a1..37bd179 100644
> --- a/arch/x86/syscalls/syscall_64.tbl
> +++ b/arch/x86/syscalls/syscall_64.tbl
> @@ -323,6 +323,7 @@
>  314    common  sched_setattr           sys_sched_setattr
>  315    common  sched_getattr           sys_sched_getattr
>  316    common  renameat2               sys_renameat2
> +317    common  remap_anon_pages        sys_remap_anon_pages
>
>  #
>  # x32-specific system call numbers start at 512 to avoid cache impact
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 3a2c57e..9a37dd5 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -33,6 +33,13 @@ extern int move_huge_pmd(struct vm_area_struct *vma,
>  extern int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
>                         unsigned long addr, pgprot_t newprot,
>                         int prot_numa);
> +extern int remap_anon_pages_huge_pmd(struct mm_struct *mm,
> +                                    pmd_t *dst_pmd, pmd_t *src_pmd,
> +                                    pmd_t dst_pmdval,
> +                                    struct vm_area_struct *dst_vma,
> +                                    struct vm_area_struct *src_vma,
> +                                    unsigned long dst_addr,
> +                                    unsigned long src_addr);
>
>  enum transparent_hugepage_flag {
>         TRANSPARENT_HUGEPAGE_FLAG,
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index b0881a0..19edb00 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -447,6 +447,10 @@ asmlinkage long sys_mremap(unsigned long addr,
>  asmlinkage long sys_remap_file_pages(unsigned long start, unsigned long size,
>                         unsigned long prot, unsigned long pgoff,
>                         unsigned long flags);
> +asmlinkage long sys_remap_anon_pages(unsigned long dst_start,
> +                                    unsigned long src_start,
> +                                    unsigned long len,
> +                                    unsigned long flags);
>  asmlinkage long sys_msync(unsigned long start, size_t len, int flags);
>  asmlinkage long sys_fadvise64(int fd, loff_t offset, size_t len, int advice);
>  asmlinkage long sys_fadvise64_64(int fd, loff_t offset, loff_t len, int advice);
> diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
> index 36441b5..6fc1aca 100644
> --- a/kernel/sys_ni.c
> +++ b/kernel/sys_ni.c
> @@ -177,6 +177,7 @@ cond_syscall(sys_mincore);
>  cond_syscall(sys_madvise);
>  cond_syscall(sys_mremap);
>  cond_syscall(sys_remap_file_pages);
> +cond_syscall(sys_remap_anon_pages);
>  cond_syscall(compat_sys_move_pages);
>  cond_syscall(compat_sys_migrate_pages);
>
> diff --git a/mm/fremap.c b/mm/fremap.c
> index 1e509f7..9337637 100644
> --- a/mm/fremap.c
> +++ b/mm/fremap.c
> @@ -310,3 +310,480 @@ void double_pt_unlock(spinlock_t *ptl1,
>         if (ptl1 != ptl2)
>                 spin_unlock(ptl2);
>  }
> +
> +#define RAP_ALLOW_SRC_HOLES (1UL<<0)
> +
> +/*
> + * The mmap_sem for reading is held by the caller. Just move the page
> + * from src_pmd to dst_pmd if possible, and return true if succeeded
> + * in moving the page.
> + */
> +static int remap_anon_pages_pte(struct mm_struct *mm,
> +                               pte_t *dst_pte, pte_t *src_pte, pmd_t *src_pmd,
> +                               struct vm_area_struct *dst_vma,
> +                               struct vm_area_struct *src_vma,
> +                               unsigned long dst_addr,
> +                               unsigned long src_addr,
> +                               spinlock_t *dst_ptl,
> +                               spinlock_t *src_ptl,
> +                               unsigned long flags)
> +{
> +       struct page *src_page;
> +       swp_entry_t entry;
> +       pte_t orig_src_pte, orig_dst_pte;
> +       struct anon_vma *src_anon_vma, *dst_anon_vma;
> +
> +       spin_lock(dst_ptl);
> +       orig_dst_pte = *dst_pte;
> +       spin_unlock(dst_ptl);
> +       if (!pte_none(orig_dst_pte))
> +               return -EEXIST;
> +
> +       spin_lock(src_ptl);
> +       orig_src_pte = *src_pte;
> +       spin_unlock(src_ptl);
> +       if (pte_none(orig_src_pte)) {
> +               if (!(flags & RAP_ALLOW_SRC_HOLES))
> +                       return -ENOENT;
> +               else
> +                       /* nothing to do to remap an hole */
> +                       return 0;
> +       }
> +
> +       if (pte_present(orig_src_pte)) {
> +               /*
> +                * Pin the page while holding the lock to be sure the
> +                * page isn't freed under us
> +                */
> +               spin_lock(src_ptl);
> +               if (!pte_same(orig_src_pte, *src_pte)) {
> +                       spin_unlock(src_ptl);
> +                       return -EAGAIN;
> +               }
> +               src_page = vm_normal_page(src_vma, src_addr, orig_src_pte);
> +               if (!src_page || !PageAnon(src_page) ||
> +                   page_mapcount(src_page) != 1) {
> +                       spin_unlock(src_ptl);
> +                       return -EBUSY;
> +               }
> +
> +               get_page(src_page);
> +               spin_unlock(src_ptl);
> +
> +               /* block all concurrent rmap walks */
> +               lock_page(src_page);
> +
> +               /*
> +                * page_referenced_anon walks the anon_vma chain
> +                * without the page lock. Serialize against it with
> +                * the anon_vma lock, the page lock is not enough.
> +                */
> +               src_anon_vma = page_get_anon_vma(src_page);
> +               if (!src_anon_vma) {
> +                       /* page was unmapped from under us */
> +                       unlock_page(src_page);
> +                       put_page(src_page);
> +                       return -EAGAIN;
> +               }
> +               anon_vma_lock_write(src_anon_vma);
> +
> +               double_pt_lock(dst_ptl, src_ptl);
> +
> +               if (!pte_same(*src_pte, orig_src_pte) ||
> +                   !pte_same(*dst_pte, orig_dst_pte) ||
> +                   page_mapcount(src_page) != 1) {
> +                       double_pt_unlock(dst_ptl, src_ptl);
> +                       anon_vma_unlock_write(src_anon_vma);
> +                       put_anon_vma(src_anon_vma);
> +                       unlock_page(src_page);
> +                       put_page(src_page);
> +                       return -EAGAIN;
> +               }
> +
> +               BUG_ON(!PageAnon(src_page));
> +               /* the PT lock is enough to keep the page pinned now */
> +               put_page(src_page);
> +
> +               dst_anon_vma = (void *) dst_vma->anon_vma + PAGE_MAPPING_ANON;
> +               ACCESS_ONCE(src_page->mapping) = ((struct address_space *)
> +                                                 dst_anon_vma);
> +               ACCESS_ONCE(src_page->index) = linear_page_index(dst_vma,
> +                                                                dst_addr);
> +
> +               if (!pte_same(ptep_clear_flush(src_vma, src_addr, src_pte),
> +                             orig_src_pte))
> +                       BUG();
> +
> +               orig_dst_pte = mk_pte(src_page, dst_vma->vm_page_prot);
> +               orig_dst_pte = maybe_mkwrite(pte_mkdirty(orig_dst_pte),
> +                                            dst_vma);
> +
> +               set_pte_at(mm, dst_addr, dst_pte, orig_dst_pte);
> +
> +               double_pt_unlock(dst_ptl, src_ptl);
> +
> +               anon_vma_unlock_write(src_anon_vma);
> +               put_anon_vma(src_anon_vma);
> +
> +               /* unblock rmap walks */
> +               unlock_page(src_page);
> +
> +               mmu_notifier_invalidate_page(mm, src_addr);
> +       } else {
> +               if (pte_file(orig_src_pte))
> +                       return -EFAULT;
> +
> +               entry = pte_to_swp_entry(orig_src_pte);
> +               if (non_swap_entry(entry)) {
> +                       if (is_migration_entry(entry)) {
> +                               migration_entry_wait(mm, src_pmd, src_addr);
> +                               return -EAGAIN;
> +                       }
> +                       return -EFAULT;
> +               }
> +
> +               if (swp_entry_swapcount(entry) != 1)
> +                       return -EBUSY;
> +
> +               double_pt_lock(dst_ptl, src_ptl);
> +
> +               if (!pte_same(*src_pte, orig_src_pte) ||
> +                   !pte_same(*dst_pte, orig_dst_pte) ||
> +                   swp_entry_swapcount(entry) != 1) {
> +                       double_pt_unlock(dst_ptl, src_ptl);
> +                       return -EAGAIN;
> +               }
> +
> +               if (pte_val(ptep_get_and_clear(mm, src_addr, src_pte)) !=
> +                   pte_val(orig_src_pte))
> +                       BUG();
> +               set_pte_at(mm, dst_addr, dst_pte, orig_src_pte);
> +
> +               double_pt_unlock(dst_ptl, src_ptl);
> +       }
> +
> +       return 0;
> +}
> +
> +static pmd_t *mm_alloc_pmd(struct mm_struct *mm, unsigned long address)
> +{
> +       pgd_t *pgd;
> +       pud_t *pud;
> +       pmd_t *pmd = NULL;
> +
> +       pgd = pgd_offset(mm, address);
> +       pud = pud_alloc(mm, pgd, address);
> +       if (pud)
> +               /*
> +                * Note that we didn't run this because the pmd was
> +                * missing, the *pmd may be already established and in
> +                * turn it may also be a trans_huge_pmd.
> +                */
> +               pmd = pmd_alloc(mm, pud, address);
> +       return pmd;
> +}
> +
> +/**
> + * sys_remap_anon_pages - remap arbitrary anonymous pages of an existing vma
> + * @dst_start: start of the destination virtual memory range
> + * @src_start: start of the source virtual memory range
> + * @len: length of the virtual memory range
> + *
> + * sys_remap_anon_pages remaps arbitrary anonymous pages atomically in
> + * zero copy. It only works on non shared anonymous pages because
> + * those can be relocated without generating non linear anon_vmas in
> + * the rmap code.
> + *
> + * It is the ideal mechanism to handle userspace page faults. Normally
> + * the destination vma will have VM_USERFAULT set with
> + * madvise(MADV_USERFAULT) while the source vma will have VM_DONTCOPY
> + * set with madvise(MADV_DONTFORK).
> + *
> + * The thread receiving the page during the userland page fault
> + * (MADV_USERFAULT) will receive the faulting page in the source vma
> + * through the network, storage or any other I/O device (MADV_DONTFORK
> + * in the source vma avoids remap_anon_pages to fail with -EBUSY if
> + * the process forks before remap_anon_pages is called), then it will
> + * call remap_anon_pages to map the page in the faulting address in
> + * the destination vma.
> + *
> + * This syscall works purely via pagetables, so it's the most
> + * efficient way to move physical non shared anonymous pages across
> + * different virtual addresses. Unlike mremap()/mmap()/munmap() it
> + * does not create any new vmas. The mapping in the destination
> + * address is atomic.
> + *
> + * It only works if the vma protection bits are identical from the
> + * source and destination vma.
> + *
> + * It can remap non shared anonymous pages within the same vma too.
> + *
> + * If the source virtual memory range has any unmapped holes, or if
> + * the destination virtual memory range is not a whole unmapped hole,
> + * remap_anon_pages will fail respectively with -ENOENT or
> + * -EEXIST. This provides a very strict behavior to avoid any chance
> + * of memory corruption going unnoticed if there are userland race
> + * conditions. Only one thread should resolve the userland page fault
> + * at any given time for any given faulting address. This means that
> + * if two threads try to both call remap_anon_pages on the same
> + * destination address at the same time, the second thread will get an
> + * explicit error from this syscall.
> + *
> + * The syscall retval will return "len" is succesful. The syscall
> + * however can be interrupted by fatal signals or errors. If
> + * interrupted it will return the number of bytes successfully
> + * remapped before the interruption if any, or the negative error if
> + * none. It will never return zero. Either it will return an error or
> + * an amount of bytes successfully moved. If the retval reports a
> + * "short" remap, the remap_anon_pages syscall should be repeated by
> + * userland with src+retval, dst+reval, len-retval if it wants to know
> + * about the error that interrupted it.
> + *
> + * The RAP_ALLOW_SRC_HOLES flag can be specified to prevent -ENOENT
> + * errors to materialize if there are holes in the source virtual
> + * range that is being remapped. The holes will be accounted as
> + * successfully remapped in the retval of the syscall. This is mostly
> + * useful to remap hugepage naturally aligned virtual regions without
> + * knowing if there are transparent hugepage in the regions or not,
> + * but preventing the risk of having to split the hugepmd during the
> + * remap.
> + *
> + * If there's any rmap walk that is taking the anon_vma locks without
> + * first obtaining the page lock (for example split_huge_page and
> + * page_referenced_anon), they will have to verify if the
> + * page->mapping has changed after taking the anon_vma lock. If it
> + * changed they should release the lock and retry obtaining a new
> + * anon_vma, because it means the anon_vma was changed by
> + * remap_anon_pages before the lock could be obtained. This is the
> + * only additional complexity added to the rmap code to provide this
> + * anonymous page remapping functionality.
> + */
> +SYSCALL_DEFINE4(remap_anon_pages,
> +               unsigned long, dst_start, unsigned long, src_start,
> +               unsigned long, len, unsigned long, flags)
> +{
> +       struct mm_struct *mm = current->mm;
> +       struct vm_area_struct *src_vma, *dst_vma;
> +       long err = -EINVAL;
> +       pmd_t *src_pmd, *dst_pmd;
> +       pte_t *src_pte, *dst_pte;
> +       spinlock_t *dst_ptl, *src_ptl;
> +       unsigned long src_addr, dst_addr;
> +       int thp_aligned = -1;
> +       long moved = 0;
> +
> +       /*
> +        * Sanitize the syscall parameters:
> +        */
> +       if (src_start & ~PAGE_MASK)
> +               return err;
> +       if (dst_start & ~PAGE_MASK)
> +               return err;
> +       if (len & ~PAGE_MASK)
> +               return err;
> +       if (flags & ~RAP_ALLOW_SRC_HOLES)
> +               return err;
> +
> +       /* Does the address range wrap, or is the span zero-sized? */
> +       if (unlikely(src_start + len <= src_start))
> +               return err;
> +       if (unlikely(dst_start + len <= dst_start))
> +               return err;
> +
> +       down_read(&mm->mmap_sem);
> +
> +       /*
> +        * Make sure the vma is not shared, that the src and dst remap
> +        * ranges are both valid and fully within a single existing
> +        * vma.
> +        */
> +       src_vma = find_vma(mm, src_start);
> +       if (!src_vma || (src_vma->vm_flags & VM_SHARED))
> +               goto out;
> +       if (src_start < src_vma->vm_start ||
> +           src_start + len > src_vma->vm_end)
> +               goto out;
> +
> +       dst_vma = find_vma(mm, dst_start);
> +       if (!dst_vma || (dst_vma->vm_flags & VM_SHARED))
> +               goto out;
> +       if (dst_start < dst_vma->vm_start ||
> +           dst_start + len > dst_vma->vm_end)
> +               goto out;
> +
> +       if (pgprot_val(src_vma->vm_page_prot) !=
> +           pgprot_val(dst_vma->vm_page_prot))
> +               goto out;
> +
> +       /* only allow remapping if both are mlocked or both aren't */
> +       if ((src_vma->vm_flags & VM_LOCKED) ^ (dst_vma->vm_flags & VM_LOCKED))
> +               goto out;
> +
> +       /*
> +        * Ensure the dst_vma has a anon_vma or this page
> +        * would get a NULL anon_vma when moved in the
> +        * dst_vma.
> +        */
> +       err = -ENOMEM;
> +       if (unlikely(anon_vma_prepare(dst_vma)))
> +               goto out;
> +
> +       for (src_addr = src_start, dst_addr = dst_start;
> +            src_addr < src_start + len; ) {
> +               spinlock_t *ptl;
> +               pmd_t dst_pmdval;
> +               BUG_ON(dst_addr >= dst_start + len);
> +               src_pmd = mm_find_pmd(mm, src_addr);
> +               if (unlikely(!src_pmd)) {
> +                       if (!(flags & RAP_ALLOW_SRC_HOLES)) {
> +                               err = -ENOENT;
> +                               break;
> +                       } else {
> +                               src_pmd = mm_alloc_pmd(mm, src_addr);
> +                               if (unlikely(!src_pmd)) {
> +                                       err = -ENOMEM;
> +                                       break;
> +                               }
> +                       }
> +               }
> +               dst_pmd = mm_alloc_pmd(mm, dst_addr);
> +               if (unlikely(!dst_pmd)) {
> +                       err = -ENOMEM;
> +                       break;
> +               }
> +
> +               dst_pmdval = pmd_read_atomic(dst_pmd);
> +               /*
> +                * If the dst_pmd is mapped as THP don't
> +                * override it and just be strict.
> +                */
> +               if (unlikely(pmd_trans_huge(dst_pmdval))) {
> +                       err = -EEXIST;
> +                       break;
> +               }
> +               if (pmd_trans_huge_lock(src_pmd, src_vma, &ptl) == 1) {
> +                       /*
> +                        * Check if we can move the pmd without
> +                        * splitting it. First check the address
> +                        * alignment to be the same in src/dst.  These
> +                        * checks don't actually need the PT lock but
> +                        * it's good to do it here to optimize this
> +                        * block away at build time if
> +                        * CONFIG_TRANSPARENT_HUGEPAGE is not set.
> +                        */
> +                       if (thp_aligned == -1)
> +                               thp_aligned = ((src_addr & ~HPAGE_PMD_MASK) ==
> +                                              (dst_addr & ~HPAGE_PMD_MASK));
> +                       if (!thp_aligned || (src_addr & ~HPAGE_PMD_MASK) ||
> +                           !pmd_none(dst_pmdval) ||
> +                           src_start + len - src_addr < HPAGE_PMD_SIZE) {
> +                               spin_unlock(ptl);
> +                               /* Fall through */
> +                               split_huge_page_pmd(src_vma, src_addr,
> +                                                   src_pmd);
> +                       } else {
> +                               BUG_ON(dst_addr & ~HPAGE_PMD_MASK);
> +                               err = remap_anon_pages_huge_pmd(mm,
> +                                                               dst_pmd,
> +                                                               src_pmd,
> +                                                               dst_pmdval,
> +                                                               dst_vma,
> +                                                               src_vma,
> +                                                               dst_addr,
> +                                                               src_addr);
> +                               cond_resched();
> +
> +                               if (!err) {
> +                                       dst_addr += HPAGE_PMD_SIZE;
> +                                       src_addr += HPAGE_PMD_SIZE;
> +                                       moved += HPAGE_PMD_SIZE;
> +                               }
> +
> +                               if ((!err || err == -EAGAIN) &&
> +                                   fatal_signal_pending(current))
> +                                       err = -EINTR;
> +
> +                               if (err && err != -EAGAIN)
> +                                       break;
> +
> +                               continue;
> +                       }
> +               }
> +
> +               if (pmd_none(*src_pmd)) {
> +                       if (!(flags & RAP_ALLOW_SRC_HOLES)) {
> +                               err = -ENOENT;
> +                               break;
> +                       } else {
> +                               if (unlikely(__pte_alloc(mm, src_vma, src_pmd,
> +                                                        src_addr))) {
> +                                       err = -ENOMEM;
> +                                       break;
> +                               }
> +                       }
> +               }
> +
> +               /*
> +                * We held the mmap_sem for reading so MADV_DONTNEED
> +                * can zap transparent huge pages under us, or the
> +                * transparent huge page fault can establish new
> +                * transparent huge pages under us.
> +                */
> +               if (unlikely(pmd_trans_unstable(src_pmd))) {
> +                       err = -EFAULT;
> +                       break;
> +               }
> +
> +               if (unlikely(pmd_none(dst_pmdval)) &&
> +                   unlikely(__pte_alloc(mm, dst_vma, dst_pmd,
> +                                        dst_addr))) {
> +                       err = -ENOMEM;
> +                       break;
> +               }
> +               /* If an huge pmd materialized from under us fail */
> +               if (unlikely(pmd_trans_huge(*dst_pmd))) {
> +                       err = -EFAULT;
> +                       break;
> +               }
> +
> +               BUG_ON(pmd_none(*dst_pmd));
> +               BUG_ON(pmd_none(*src_pmd));
> +               BUG_ON(pmd_trans_huge(*dst_pmd));
> +               BUG_ON(pmd_trans_huge(*src_pmd));
> +
> +               dst_pte = pte_offset_map(dst_pmd, dst_addr);
> +               src_pte = pte_offset_map(src_pmd, src_addr);
> +               dst_ptl = pte_lockptr(mm, dst_pmd);
> +               src_ptl = pte_lockptr(mm, src_pmd);
> +
> +               err = remap_anon_pages_pte(mm,
> +                                          dst_pte, src_pte, src_pmd,
> +                                          dst_vma, src_vma,
> +                                          dst_addr, src_addr,
> +                                          dst_ptl, src_ptl, flags);
> +
> +               pte_unmap(dst_pte);
> +               pte_unmap(src_pte);
> +               cond_resched();
> +
> +               if (!err) {
> +                       dst_addr += PAGE_SIZE;
> +                       src_addr += PAGE_SIZE;
> +                       moved += PAGE_SIZE;
> +               }
> +
> +               if ((!err || err == -EAGAIN) &&
> +                   fatal_signal_pending(current))
> +                       err = -EINTR;
> +
> +               if (err && err != -EAGAIN)
> +                       break;
> +       }
> +
> +out:
> +       up_read(&mm->mmap_sem);
> +       BUG_ON(moved < 0);
> +       BUG_ON(err > 0);
> +       BUG_ON(!moved && !err);
> +       return moved ? moved : err;
> +}
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 94c37ca..e24cd7c 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1541,6 +1541,116 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
>  }
>
>  /*
> + * The PT lock for src_pmd and the mmap_sem for reading are held by
> + * the caller, but it must return after releasing the
> + * page_table_lock. We're guaranteed the src_pmd is a pmd_trans_huge
> + * until the PT lock of the src_pmd is released. Just move the page
> + * from src_pmd to dst_pmd if possible. Return zero if succeeded in
> + * moving the page, -EAGAIN if it needs to be repeated by the caller,
> + * or other errors in case of failure.
> + */
> +int remap_anon_pages_huge_pmd(struct mm_struct *mm,
> +                             pmd_t *dst_pmd, pmd_t *src_pmd,
> +                             pmd_t dst_pmdval,
> +                             struct vm_area_struct *dst_vma,
> +                             struct vm_area_struct *src_vma,
> +                             unsigned long dst_addr,
> +                             unsigned long src_addr)
> +{
> +       pmd_t _dst_pmd, src_pmdval;
> +       struct page *src_page;
> +       struct anon_vma *src_anon_vma, *dst_anon_vma;
> +       spinlock_t *src_ptl, *dst_ptl;
> +       pgtable_t pgtable;
> +
> +       src_pmdval = *src_pmd;
> +       src_ptl = pmd_lockptr(mm, src_pmd);
> +
> +       BUG_ON(!pmd_trans_huge(src_pmdval));
> +       BUG_ON(pmd_trans_splitting(src_pmdval));
> +       BUG_ON(!pmd_none(dst_pmdval));
> +       BUG_ON(!spin_is_locked(src_ptl));
> +       BUG_ON(!rwsem_is_locked(&mm->mmap_sem));
> +
> +       src_page = pmd_page(src_pmdval);
> +       BUG_ON(!PageHead(src_page));
> +       BUG_ON(!PageAnon(src_page));
> +       if (unlikely(page_mapcount(src_page) != 1)) {
> +               spin_unlock(src_ptl);
> +               return -EBUSY;
> +       }
> +
> +       get_page(src_page);
> +       spin_unlock(src_ptl);
> +
> +       mmu_notifier_invalidate_range_start(mm, src_addr,
> +                                           src_addr + HPAGE_PMD_SIZE);
> +
> +       /* block all concurrent rmap walks */
> +       lock_page(src_page);
> +
> +       /*
> +        * split_huge_page walks the anon_vma chain without the page
> +        * lock. Serialize against it with the anon_vma lock, the page
> +        * lock is not enough.
> +        */
> +       src_anon_vma = page_get_anon_vma(src_page);
> +       if (!src_anon_vma) {
> +               unlock_page(src_page);
> +               put_page(src_page);
> +               mmu_notifier_invalidate_range_end(mm, src_addr,
> +                                                 src_addr + HPAGE_PMD_SIZE);
> +               return -EAGAIN;
> +       }
> +       anon_vma_lock_write(src_anon_vma);
> +
> +       dst_ptl = pmd_lockptr(mm, dst_pmd);
> +       double_pt_lock(src_ptl, dst_ptl);
> +       if (unlikely(!pmd_same(*src_pmd, src_pmdval) ||
> +                    !pmd_same(*dst_pmd, dst_pmdval) ||
> +                    page_mapcount(src_page) != 1)) {
> +               double_pt_unlock(src_ptl, dst_ptl);
> +               anon_vma_unlock_write(src_anon_vma);
> +               put_anon_vma(src_anon_vma);
> +               unlock_page(src_page);
> +               put_page(src_page);
> +               mmu_notifier_invalidate_range_end(mm, src_addr,
> +                                                 src_addr + HPAGE_PMD_SIZE);
> +               return -EAGAIN;
> +       }
> +
> +       BUG_ON(!PageHead(src_page));
> +       BUG_ON(!PageAnon(src_page));
> +       /* the PT lock is enough to keep the page pinned now */
> +       put_page(src_page);
> +
> +       dst_anon_vma = (void *) dst_vma->anon_vma + PAGE_MAPPING_ANON;
> +       ACCESS_ONCE(src_page->mapping) = (struct address_space *) dst_anon_vma;
> +       ACCESS_ONCE(src_page->index) = linear_page_index(dst_vma, dst_addr);
> +
> +       if (!pmd_same(pmdp_clear_flush(src_vma, src_addr, src_pmd),
> +                     src_pmdval))
> +               BUG();
> +       _dst_pmd = mk_huge_pmd(src_page, dst_vma->vm_page_prot);
> +       _dst_pmd = maybe_pmd_mkwrite(pmd_mkdirty(_dst_pmd), dst_vma);
> +       set_pmd_at(mm, dst_addr, dst_pmd, _dst_pmd);
> +
> +       pgtable = pgtable_trans_huge_withdraw(mm, src_pmd);
> +       pgtable_trans_huge_deposit(mm, dst_pmd, pgtable);
> +       double_pt_unlock(src_ptl, dst_ptl);
> +
> +       anon_vma_unlock_write(src_anon_vma);
> +       put_anon_vma(src_anon_vma);
> +
> +       /* unblock rmap walks */
> +       unlock_page(src_page);
> +
> +       mmu_notifier_invalidate_range_end(mm, src_addr,
> +                                         src_addr + HPAGE_PMD_SIZE);
> +       return 0;
> +}
> +
> +/*
>   * Returns 1 if a given pmd maps a stable (not under splitting) thp.
>   * Returns -1 if it maps a thp under splitting. Returns 0 otherwise.
>   *
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>



-- 
Michael Kerrisk Linux man-pages maintainer;
http://www.kernel.org/doc/man-pages/
Author of "The Linux Programming Interface", http://blog.man7.org/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 06/10] mm: sys_remap_anon_pages
@ 2014-07-04 11:30     ` Michael Kerrisk
  0 siblings, 0 replies; 59+ messages in thread
From: Michael Kerrisk @ 2014-07-04 11:30 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: qemu-devel, kvm, linux-mm, Linux Kernel, \Dr. David Alan Gilbert\,
	Johannes Weiner, Andrew Morton, Android Kernel Team, Robert Love,
	Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Mike Hommey, Taras Glek, Jan Kara,
	KOSAKI Motohiro, Michel Lespinasse, Minchan Kim, Keith Packard,
	Huangpeng (Peter),
	Isaku Yamahata, Paolo

[CC+=linux-api@]

Hi Andrea,

On Wed, Jul 2, 2014 at 6:50 PM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> This new syscall will move anon pages across vmas, atomically and
> without touching the vmas.

Please CC linux-api on patches that change the API/ABI. (See
https://www.kernel.org/doc/man-pages/linux-api-ml.html)

Cheers,

Michael


> It only works on non shared anonymous pages because those can be
> relocated without generating non linear anon_vmas in the rmap code.
>
> It is the ideal mechanism to handle userspace page faults. Normally
> the destination vma will have VM_USERFAULT set with
> madvise(MADV_USERFAULT) while the source vma will normally have
> VM_DONTCOPY set with madvise(MADV_DONTFORK).
>
> MADV_DONTFORK set in the source vma avoids remap_anon_pages to fail if
> the process forks during the userland page fault.
>
> The thread triggering the sigbus signal handler by touching an
> unmapped hole in the MADV_USERFAULT region, should take care to
> receive the data belonging in the faulting virtual address in the
> source vma. The data can come from the network, storage or any other
> I/O device. After the data has been safely received in the private
> area in the source vma, it will call remap_anon_pages to map the page
> in the faulting address in the destination vma atomically. And finally
> it will return from the signal handler.
>
> It is an alternative to mremap.
>
> It only works if the vma protection bits are identical from the source
> and destination vma.
>
> It can remap non shared anonymous pages within the same vma too.
>
> If the source virtual memory range has any unmapped holes, or if the
> destination virtual memory range is not a whole unmapped hole,
> remap_anon_pages will fail respectively with -ENOENT or -EEXIST. This
> provides a very strict behavior to avoid any chance of memory
> corruption going unnoticed if there are userland race conditions. Only
> one thread should resolve the userland page fault at any given time
> for any given faulting address. This means that if two threads try to
> both call remap_anon_pages on the same destination address at the same
> time, the second thread will get an explicit error from this syscall.
>
> The syscall retval will return "len" is succesful. The syscall however
> can be interrupted by fatal signals or errors. If interrupted it will
> return the number of bytes successfully remapped before the
> interruption if any, or the negative error if none. It will never
> return zero. Either it will return an error or an amount of bytes
> successfully moved. If the retval reports a "short" remap, the
> remap_anon_pages syscall should be repeated by userland with
> src+retval, dst+reval, len-retval if it wants to know about the error
> that interrupted it.
>
> The RAP_ALLOW_SRC_HOLES flag can be specified to prevent -ENOENT
> errors to materialize if there are holes in the source virtual range
> that is being remapped. The holes will be accounted as successfully
> remapped in the retval of the syscall. This is mostly useful to remap
> hugepage naturally aligned virtual regions without knowing if there
> are transparent hugepage in the regions or not, but preventing the
> risk of having to split the hugepmd during the remap.
>
> The main difference with mremap is that if used to fill holes in
> unmapped anonymous memory vmas (if used in combination with
> MADV_USERFAULT) remap_anon_pages won't create lots of unmergeable
> vmas. mremap instead would create lots of vmas (because of non linear
> vma->vm_pgoff) leading to -ENOMEM failures (the number of vmas is
> limited).
>
> MADV_USERFAULT and remap_anon_pages() can be tested with a program
> like below:
>
> ===
>  #define _GNU_SOURCE
>  #include <sys/mman.h>
>  #include <pthread.h>
>  #include <strings.h>
>  #include <stdlib.h>
>  #include <unistd.h>
>  #include <stdio.h>
>  #include <errno.h>
>  #include <string.h>
>  #include <signal.h>
>  #include <sys/syscall.h>
>  #include <sys/types.h>
>
>  #define USE_USERFAULT
>  #define THP
>
>  #define MADV_USERFAULT 18
>
>  #define SIZE (1024*1024*1024)
>
>  #define SYS_remap_anon_pages 317
>
>  static volatile unsigned char *c, *tmp;
>
>  void userfault_sighandler(int signum, siginfo_t *info, void *ctx)
>  {
>         unsigned char *addr = info->si_addr;
>         int len = 4096;
>         int ret;
>
>         addr = (unsigned char *) ((unsigned long) addr & ~((getpagesize())-1));
>  #ifdef THP
>         addr = (unsigned char *) ((unsigned long) addr & ~((2*1024*1024)-1));
>         len = 2*1024*1024;
>  #endif
>         if (addr >= c && addr < c + SIZE) {
>                 unsigned long offset = addr - c;
>                 ret = syscall(SYS_remap_anon_pages, c+offset, tmp+offset, len, 0);
>                 if (ret != len)
>                         perror("sigbus remap_anon_pages"), exit(1);
>                 //printf("sigbus offset %lu\n", offset);
>                 return;
>         }
>
>         printf("sigbus error addr %p c %p tmp %p\n", addr, c, tmp), exit(1);
>  }
>
>  int main()
>  {
>         struct sigaction sa;
>         int ret;
>         unsigned long i;
>  #ifndef THP
>         /*
>          * Fails with THP due lack of alignment because of memset
>          * pre-filling the destination
>          */
>         c = mmap(0, SIZE, PROT_READ|PROT_WRITE,
>                  MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
>         if (c == MAP_FAILED)
>                 perror("mmap"), exit(1);
>         tmp = mmap(0, SIZE, PROT_READ|PROT_WRITE,
>                    MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
>         if (tmp == MAP_FAILED)
>                 perror("mmap"), exit(1);
>  #else
>         ret = posix_memalign((void **)&c, 2*1024*1024, SIZE);
>         if (ret)
>                 perror("posix_memalign"), exit(1);
>         ret = posix_memalign((void **)&tmp, 2*1024*1024, SIZE);
>         if (ret)
>                 perror("posix_memalign"), exit(1);
>  #endif
>         /*
>          * MADV_USERFAULT must run before memset, to avoid THP 2m
>          * faults to map memory into "tmp", if "tmp" isn't allocated
>          * with hugepage alignment.
>          */
>         if (madvise((void *)c, SIZE, MADV_USERFAULT))
>                 perror("madvise"), exit(1);
>         memset((void *)tmp, 0xaa, SIZE);
>
>         sa.sa_sigaction = userfault_sighandler;
>         sigemptyset(&sa.sa_mask);
>         sa.sa_flags = SA_SIGINFO;
>         sigaction(SIGBUS, &sa, NULL);
>
>  #ifndef USE_USERFAULT
>         ret = syscall(SYS_remap_anon_pages, c, tmp, SIZE, 0);
>         if (ret != SIZE)
>                 perror("remap_anon_pages"), exit(1);
>  #endif
>
>         for (i = 0; i < SIZE; i += 4096) {
>                 if ((i/4096) % 2) {
>                         /* exercise read and write MADV_USERFAULT */
>                         c[i+1] = 0xbb;
>                 }
>                 if (c[i] != 0xaa)
>                         printf("error %x offset %lu\n", c[i], i), exit(1);
>         }
>         printf("remap_anon_pages functions correctly\n");
>
>         return 0;
>  }
> ===
>
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
>  arch/x86/syscalls/syscall_32.tbl |   1 +
>  arch/x86/syscalls/syscall_64.tbl |   1 +
>  include/linux/huge_mm.h          |   7 +
>  include/linux/syscalls.h         |   4 +
>  kernel/sys_ni.c                  |   1 +
>  mm/fremap.c                      | 477 +++++++++++++++++++++++++++++++++++++++
>  mm/huge_memory.c                 | 110 +++++++++
>  7 files changed, 601 insertions(+)
>
> diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
> index d6b8679..08bc856 100644
> --- a/arch/x86/syscalls/syscall_32.tbl
> +++ b/arch/x86/syscalls/syscall_32.tbl
> @@ -360,3 +360,4 @@
>  351    i386    sched_setattr           sys_sched_setattr
>  352    i386    sched_getattr           sys_sched_getattr
>  353    i386    renameat2               sys_renameat2
> +354    i386    remap_anon_pages        sys_remap_anon_pages
> diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
> index ec255a1..37bd179 100644
> --- a/arch/x86/syscalls/syscall_64.tbl
> +++ b/arch/x86/syscalls/syscall_64.tbl
> @@ -323,6 +323,7 @@
>  314    common  sched_setattr           sys_sched_setattr
>  315    common  sched_getattr           sys_sched_getattr
>  316    common  renameat2               sys_renameat2
> +317    common  remap_anon_pages        sys_remap_anon_pages
>
>  #
>  # x32-specific system call numbers start at 512 to avoid cache impact
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 3a2c57e..9a37dd5 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -33,6 +33,13 @@ extern int move_huge_pmd(struct vm_area_struct *vma,
>  extern int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
>                         unsigned long addr, pgprot_t newprot,
>                         int prot_numa);
> +extern int remap_anon_pages_huge_pmd(struct mm_struct *mm,
> +                                    pmd_t *dst_pmd, pmd_t *src_pmd,
> +                                    pmd_t dst_pmdval,
> +                                    struct vm_area_struct *dst_vma,
> +                                    struct vm_area_struct *src_vma,
> +                                    unsigned long dst_addr,
> +                                    unsigned long src_addr);
>
>  enum transparent_hugepage_flag {
>         TRANSPARENT_HUGEPAGE_FLAG,
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index b0881a0..19edb00 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -447,6 +447,10 @@ asmlinkage long sys_mremap(unsigned long addr,
>  asmlinkage long sys_remap_file_pages(unsigned long start, unsigned long size,
>                         unsigned long prot, unsigned long pgoff,
>                         unsigned long flags);
> +asmlinkage long sys_remap_anon_pages(unsigned long dst_start,
> +                                    unsigned long src_start,
> +                                    unsigned long len,
> +                                    unsigned long flags);
>  asmlinkage long sys_msync(unsigned long start, size_t len, int flags);
>  asmlinkage long sys_fadvise64(int fd, loff_t offset, size_t len, int advice);
>  asmlinkage long sys_fadvise64_64(int fd, loff_t offset, loff_t len, int advice);
> diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
> index 36441b5..6fc1aca 100644
> --- a/kernel/sys_ni.c
> +++ b/kernel/sys_ni.c
> @@ -177,6 +177,7 @@ cond_syscall(sys_mincore);
>  cond_syscall(sys_madvise);
>  cond_syscall(sys_mremap);
>  cond_syscall(sys_remap_file_pages);
> +cond_syscall(sys_remap_anon_pages);
>  cond_syscall(compat_sys_move_pages);
>  cond_syscall(compat_sys_migrate_pages);
>
> diff --git a/mm/fremap.c b/mm/fremap.c
> index 1e509f7..9337637 100644
> --- a/mm/fremap.c
> +++ b/mm/fremap.c
> @@ -310,3 +310,480 @@ void double_pt_unlock(spinlock_t *ptl1,
>         if (ptl1 != ptl2)
>                 spin_unlock(ptl2);
>  }
> +
> +#define RAP_ALLOW_SRC_HOLES (1UL<<0)
> +
> +/*
> + * The mmap_sem for reading is held by the caller. Just move the page
> + * from src_pmd to dst_pmd if possible, and return true if succeeded
> + * in moving the page.
> + */
> +static int remap_anon_pages_pte(struct mm_struct *mm,
> +                               pte_t *dst_pte, pte_t *src_pte, pmd_t *src_pmd,
> +                               struct vm_area_struct *dst_vma,
> +                               struct vm_area_struct *src_vma,
> +                               unsigned long dst_addr,
> +                               unsigned long src_addr,
> +                               spinlock_t *dst_ptl,
> +                               spinlock_t *src_ptl,
> +                               unsigned long flags)
> +{
> +       struct page *src_page;
> +       swp_entry_t entry;
> +       pte_t orig_src_pte, orig_dst_pte;
> +       struct anon_vma *src_anon_vma, *dst_anon_vma;
> +
> +       spin_lock(dst_ptl);
> +       orig_dst_pte = *dst_pte;
> +       spin_unlock(dst_ptl);
> +       if (!pte_none(orig_dst_pte))
> +               return -EEXIST;
> +
> +       spin_lock(src_ptl);
> +       orig_src_pte = *src_pte;
> +       spin_unlock(src_ptl);
> +       if (pte_none(orig_src_pte)) {
> +               if (!(flags & RAP_ALLOW_SRC_HOLES))
> +                       return -ENOENT;
> +               else
> +                       /* nothing to do to remap an hole */
> +                       return 0;
> +       }
> +
> +       if (pte_present(orig_src_pte)) {
> +               /*
> +                * Pin the page while holding the lock to be sure the
> +                * page isn't freed under us
> +                */
> +               spin_lock(src_ptl);
> +               if (!pte_same(orig_src_pte, *src_pte)) {
> +                       spin_unlock(src_ptl);
> +                       return -EAGAIN;
> +               }
> +               src_page = vm_normal_page(src_vma, src_addr, orig_src_pte);
> +               if (!src_page || !PageAnon(src_page) ||
> +                   page_mapcount(src_page) != 1) {
> +                       spin_unlock(src_ptl);
> +                       return -EBUSY;
> +               }
> +
> +               get_page(src_page);
> +               spin_unlock(src_ptl);
> +
> +               /* block all concurrent rmap walks */
> +               lock_page(src_page);
> +
> +               /*
> +                * page_referenced_anon walks the anon_vma chain
> +                * without the page lock. Serialize against it with
> +                * the anon_vma lock, the page lock is not enough.
> +                */
> +               src_anon_vma = page_get_anon_vma(src_page);
> +               if (!src_anon_vma) {
> +                       /* page was unmapped from under us */
> +                       unlock_page(src_page);
> +                       put_page(src_page);
> +                       return -EAGAIN;
> +               }
> +               anon_vma_lock_write(src_anon_vma);
> +
> +               double_pt_lock(dst_ptl, src_ptl);
> +
> +               if (!pte_same(*src_pte, orig_src_pte) ||
> +                   !pte_same(*dst_pte, orig_dst_pte) ||
> +                   page_mapcount(src_page) != 1) {
> +                       double_pt_unlock(dst_ptl, src_ptl);
> +                       anon_vma_unlock_write(src_anon_vma);
> +                       put_anon_vma(src_anon_vma);
> +                       unlock_page(src_page);
> +                       put_page(src_page);
> +                       return -EAGAIN;
> +               }
> +
> +               BUG_ON(!PageAnon(src_page));
> +               /* the PT lock is enough to keep the page pinned now */
> +               put_page(src_page);
> +
> +               dst_anon_vma = (void *) dst_vma->anon_vma + PAGE_MAPPING_ANON;
> +               ACCESS_ONCE(src_page->mapping) = ((struct address_space *)
> +                                                 dst_anon_vma);
> +               ACCESS_ONCE(src_page->index) = linear_page_index(dst_vma,
> +                                                                dst_addr);
> +
> +               if (!pte_same(ptep_clear_flush(src_vma, src_addr, src_pte),
> +                             orig_src_pte))
> +                       BUG();
> +
> +               orig_dst_pte = mk_pte(src_page, dst_vma->vm_page_prot);
> +               orig_dst_pte = maybe_mkwrite(pte_mkdirty(orig_dst_pte),
> +                                            dst_vma);
> +
> +               set_pte_at(mm, dst_addr, dst_pte, orig_dst_pte);
> +
> +               double_pt_unlock(dst_ptl, src_ptl);
> +
> +               anon_vma_unlock_write(src_anon_vma);
> +               put_anon_vma(src_anon_vma);
> +
> +               /* unblock rmap walks */
> +               unlock_page(src_page);
> +
> +               mmu_notifier_invalidate_page(mm, src_addr);
> +       } else {
> +               if (pte_file(orig_src_pte))
> +                       return -EFAULT;
> +
> +               entry = pte_to_swp_entry(orig_src_pte);
> +               if (non_swap_entry(entry)) {
> +                       if (is_migration_entry(entry)) {
> +                               migration_entry_wait(mm, src_pmd, src_addr);
> +                               return -EAGAIN;
> +                       }
> +                       return -EFAULT;
> +               }
> +
> +               if (swp_entry_swapcount(entry) != 1)
> +                       return -EBUSY;
> +
> +               double_pt_lock(dst_ptl, src_ptl);
> +
> +               if (!pte_same(*src_pte, orig_src_pte) ||
> +                   !pte_same(*dst_pte, orig_dst_pte) ||
> +                   swp_entry_swapcount(entry) != 1) {
> +                       double_pt_unlock(dst_ptl, src_ptl);
> +                       return -EAGAIN;
> +               }
> +
> +               if (pte_val(ptep_get_and_clear(mm, src_addr, src_pte)) !=
> +                   pte_val(orig_src_pte))
> +                       BUG();
> +               set_pte_at(mm, dst_addr, dst_pte, orig_src_pte);
> +
> +               double_pt_unlock(dst_ptl, src_ptl);
> +       }
> +
> +       return 0;
> +}
> +
> +static pmd_t *mm_alloc_pmd(struct mm_struct *mm, unsigned long address)
> +{
> +       pgd_t *pgd;
> +       pud_t *pud;
> +       pmd_t *pmd = NULL;
> +
> +       pgd = pgd_offset(mm, address);
> +       pud = pud_alloc(mm, pgd, address);
> +       if (pud)
> +               /*
> +                * Note that we didn't run this because the pmd was
> +                * missing, the *pmd may be already established and in
> +                * turn it may also be a trans_huge_pmd.
> +                */
> +               pmd = pmd_alloc(mm, pud, address);
> +       return pmd;
> +}
> +
> +/**
> + * sys_remap_anon_pages - remap arbitrary anonymous pages of an existing vma
> + * @dst_start: start of the destination virtual memory range
> + * @src_start: start of the source virtual memory range
> + * @len: length of the virtual memory range
> + *
> + * sys_remap_anon_pages remaps arbitrary anonymous pages atomically in
> + * zero copy. It only works on non shared anonymous pages because
> + * those can be relocated without generating non linear anon_vmas in
> + * the rmap code.
> + *
> + * It is the ideal mechanism to handle userspace page faults. Normally
> + * the destination vma will have VM_USERFAULT set with
> + * madvise(MADV_USERFAULT) while the source vma will have VM_DONTCOPY
> + * set with madvise(MADV_DONTFORK).
> + *
> + * The thread receiving the page during the userland page fault
> + * (MADV_USERFAULT) will receive the faulting page in the source vma
> + * through the network, storage or any other I/O device (MADV_DONTFORK
> + * in the source vma avoids remap_anon_pages to fail with -EBUSY if
> + * the process forks before remap_anon_pages is called), then it will
> + * call remap_anon_pages to map the page in the faulting address in
> + * the destination vma.
> + *
> + * This syscall works purely via pagetables, so it's the most
> + * efficient way to move physical non shared anonymous pages across
> + * different virtual addresses. Unlike mremap()/mmap()/munmap() it
> + * does not create any new vmas. The mapping in the destination
> + * address is atomic.
> + *
> + * It only works if the vma protection bits are identical from the
> + * source and destination vma.
> + *
> + * It can remap non shared anonymous pages within the same vma too.
> + *
> + * If the source virtual memory range has any unmapped holes, or if
> + * the destination virtual memory range is not a whole unmapped hole,
> + * remap_anon_pages will fail respectively with -ENOENT or
> + * -EEXIST. This provides a very strict behavior to avoid any chance
> + * of memory corruption going unnoticed if there are userland race
> + * conditions. Only one thread should resolve the userland page fault
> + * at any given time for any given faulting address. This means that
> + * if two threads try to both call remap_anon_pages on the same
> + * destination address at the same time, the second thread will get an
> + * explicit error from this syscall.
> + *
> + * The syscall retval will return "len" is succesful. The syscall
> + * however can be interrupted by fatal signals or errors. If
> + * interrupted it will return the number of bytes successfully
> + * remapped before the interruption if any, or the negative error if
> + * none. It will never return zero. Either it will return an error or
> + * an amount of bytes successfully moved. If the retval reports a
> + * "short" remap, the remap_anon_pages syscall should be repeated by
> + * userland with src+retval, dst+reval, len-retval if it wants to know
> + * about the error that interrupted it.
> + *
> + * The RAP_ALLOW_SRC_HOLES flag can be specified to prevent -ENOENT
> + * errors to materialize if there are holes in the source virtual
> + * range that is being remapped. The holes will be accounted as
> + * successfully remapped in the retval of the syscall. This is mostly
> + * useful to remap hugepage naturally aligned virtual regions without
> + * knowing if there are transparent hugepage in the regions or not,
> + * but preventing the risk of having to split the hugepmd during the
> + * remap.
> + *
> + * If there's any rmap walk that is taking the anon_vma locks without
> + * first obtaining the page lock (for example split_huge_page and
> + * page_referenced_anon), they will have to verify if the
> + * page->mapping has changed after taking the anon_vma lock. If it
> + * changed they should release the lock and retry obtaining a new
> + * anon_vma, because it means the anon_vma was changed by
> + * remap_anon_pages before the lock could be obtained. This is the
> + * only additional complexity added to the rmap code to provide this
> + * anonymous page remapping functionality.
> + */
> +SYSCALL_DEFINE4(remap_anon_pages,
> +               unsigned long, dst_start, unsigned long, src_start,
> +               unsigned long, len, unsigned long, flags)
> +{
> +       struct mm_struct *mm = current->mm;
> +       struct vm_area_struct *src_vma, *dst_vma;
> +       long err = -EINVAL;
> +       pmd_t *src_pmd, *dst_pmd;
> +       pte_t *src_pte, *dst_pte;
> +       spinlock_t *dst_ptl, *src_ptl;
> +       unsigned long src_addr, dst_addr;
> +       int thp_aligned = -1;
> +       long moved = 0;
> +
> +       /*
> +        * Sanitize the syscall parameters:
> +        */
> +       if (src_start & ~PAGE_MASK)
> +               return err;
> +       if (dst_start & ~PAGE_MASK)
> +               return err;
> +       if (len & ~PAGE_MASK)
> +               return err;
> +       if (flags & ~RAP_ALLOW_SRC_HOLES)
> +               return err;
> +
> +       /* Does the address range wrap, or is the span zero-sized? */
> +       if (unlikely(src_start + len <= src_start))
> +               return err;
> +       if (unlikely(dst_start + len <= dst_start))
> +               return err;
> +
> +       down_read(&mm->mmap_sem);
> +
> +       /*
> +        * Make sure the vma is not shared, that the src and dst remap
> +        * ranges are both valid and fully within a single existing
> +        * vma.
> +        */
> +       src_vma = find_vma(mm, src_start);
> +       if (!src_vma || (src_vma->vm_flags & VM_SHARED))
> +               goto out;
> +       if (src_start < src_vma->vm_start ||
> +           src_start + len > src_vma->vm_end)
> +               goto out;
> +
> +       dst_vma = find_vma(mm, dst_start);
> +       if (!dst_vma || (dst_vma->vm_flags & VM_SHARED))
> +               goto out;
> +       if (dst_start < dst_vma->vm_start ||
> +           dst_start + len > dst_vma->vm_end)
> +               goto out;
> +
> +       if (pgprot_val(src_vma->vm_page_prot) !=
> +           pgprot_val(dst_vma->vm_page_prot))
> +               goto out;
> +
> +       /* only allow remapping if both are mlocked or both aren't */
> +       if ((src_vma->vm_flags & VM_LOCKED) ^ (dst_vma->vm_flags & VM_LOCKED))
> +               goto out;
> +
> +       /*
> +        * Ensure the dst_vma has a anon_vma or this page
> +        * would get a NULL anon_vma when moved in the
> +        * dst_vma.
> +        */
> +       err = -ENOMEM;
> +       if (unlikely(anon_vma_prepare(dst_vma)))
> +               goto out;
> +
> +       for (src_addr = src_start, dst_addr = dst_start;
> +            src_addr < src_start + len; ) {
> +               spinlock_t *ptl;
> +               pmd_t dst_pmdval;
> +               BUG_ON(dst_addr >= dst_start + len);
> +               src_pmd = mm_find_pmd(mm, src_addr);
> +               if (unlikely(!src_pmd)) {
> +                       if (!(flags & RAP_ALLOW_SRC_HOLES)) {
> +                               err = -ENOENT;
> +                               break;
> +                       } else {
> +                               src_pmd = mm_alloc_pmd(mm, src_addr);
> +                               if (unlikely(!src_pmd)) {
> +                                       err = -ENOMEM;
> +                                       break;
> +                               }
> +                       }
> +               }
> +               dst_pmd = mm_alloc_pmd(mm, dst_addr);
> +               if (unlikely(!dst_pmd)) {
> +                       err = -ENOMEM;
> +                       break;
> +               }
> +
> +               dst_pmdval = pmd_read_atomic(dst_pmd);
> +               /*
> +                * If the dst_pmd is mapped as THP don't
> +                * override it and just be strict.
> +                */
> +               if (unlikely(pmd_trans_huge(dst_pmdval))) {
> +                       err = -EEXIST;
> +                       break;
> +               }
> +               if (pmd_trans_huge_lock(src_pmd, src_vma, &ptl) == 1) {
> +                       /*
> +                        * Check if we can move the pmd without
> +                        * splitting it. First check the address
> +                        * alignment to be the same in src/dst.  These
> +                        * checks don't actually need the PT lock but
> +                        * it's good to do it here to optimize this
> +                        * block away at build time if
> +                        * CONFIG_TRANSPARENT_HUGEPAGE is not set.
> +                        */
> +                       if (thp_aligned == -1)
> +                               thp_aligned = ((src_addr & ~HPAGE_PMD_MASK) ==
> +                                              (dst_addr & ~HPAGE_PMD_MASK));
> +                       if (!thp_aligned || (src_addr & ~HPAGE_PMD_MASK) ||
> +                           !pmd_none(dst_pmdval) ||
> +                           src_start + len - src_addr < HPAGE_PMD_SIZE) {
> +                               spin_unlock(ptl);
> +                               /* Fall through */
> +                               split_huge_page_pmd(src_vma, src_addr,
> +                                                   src_pmd);
> +                       } else {
> +                               BUG_ON(dst_addr & ~HPAGE_PMD_MASK);
> +                               err = remap_anon_pages_huge_pmd(mm,
> +                                                               dst_pmd,
> +                                                               src_pmd,
> +                                                               dst_pmdval,
> +                                                               dst_vma,
> +                                                               src_vma,
> +                                                               dst_addr,
> +                                                               src_addr);
> +                               cond_resched();
> +
> +                               if (!err) {
> +                                       dst_addr += HPAGE_PMD_SIZE;
> +                                       src_addr += HPAGE_PMD_SIZE;
> +                                       moved += HPAGE_PMD_SIZE;
> +                               }
> +
> +                               if ((!err || err == -EAGAIN) &&
> +                                   fatal_signal_pending(current))
> +                                       err = -EINTR;
> +
> +                               if (err && err != -EAGAIN)
> +                                       break;
> +
> +                               continue;
> +                       }
> +               }
> +
> +               if (pmd_none(*src_pmd)) {
> +                       if (!(flags & RAP_ALLOW_SRC_HOLES)) {
> +                               err = -ENOENT;
> +                               break;
> +                       } else {
> +                               if (unlikely(__pte_alloc(mm, src_vma, src_pmd,
> +                                                        src_addr))) {
> +                                       err = -ENOMEM;
> +                                       break;
> +                               }
> +                       }
> +               }
> +
> +               /*
> +                * We held the mmap_sem for reading so MADV_DONTNEED
> +                * can zap transparent huge pages under us, or the
> +                * transparent huge page fault can establish new
> +                * transparent huge pages under us.
> +                */
> +               if (unlikely(pmd_trans_unstable(src_pmd))) {
> +                       err = -EFAULT;
> +                       break;
> +               }
> +
> +               if (unlikely(pmd_none(dst_pmdval)) &&
> +                   unlikely(__pte_alloc(mm, dst_vma, dst_pmd,
> +                                        dst_addr))) {
> +                       err = -ENOMEM;
> +                       break;
> +               }
> +               /* If an huge pmd materialized from under us fail */
> +               if (unlikely(pmd_trans_huge(*dst_pmd))) {
> +                       err = -EFAULT;
> +                       break;
> +               }
> +
> +               BUG_ON(pmd_none(*dst_pmd));
> +               BUG_ON(pmd_none(*src_pmd));
> +               BUG_ON(pmd_trans_huge(*dst_pmd));
> +               BUG_ON(pmd_trans_huge(*src_pmd));
> +
> +               dst_pte = pte_offset_map(dst_pmd, dst_addr);
> +               src_pte = pte_offset_map(src_pmd, src_addr);
> +               dst_ptl = pte_lockptr(mm, dst_pmd);
> +               src_ptl = pte_lockptr(mm, src_pmd);
> +
> +               err = remap_anon_pages_pte(mm,
> +                                          dst_pte, src_pte, src_pmd,
> +                                          dst_vma, src_vma,
> +                                          dst_addr, src_addr,
> +                                          dst_ptl, src_ptl, flags);
> +
> +               pte_unmap(dst_pte);
> +               pte_unmap(src_pte);
> +               cond_resched();
> +
> +               if (!err) {
> +                       dst_addr += PAGE_SIZE;
> +                       src_addr += PAGE_SIZE;
> +                       moved += PAGE_SIZE;
> +               }
> +
> +               if ((!err || err == -EAGAIN) &&
> +                   fatal_signal_pending(current))
> +                       err = -EINTR;
> +
> +               if (err && err != -EAGAIN)
> +                       break;
> +       }
> +
> +out:
> +       up_read(&mm->mmap_sem);
> +       BUG_ON(moved < 0);
> +       BUG_ON(err > 0);
> +       BUG_ON(!moved && !err);
> +       return moved ? moved : err;
> +}
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 94c37ca..e24cd7c 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1541,6 +1541,116 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
>  }
>
>  /*
> + * The PT lock for src_pmd and the mmap_sem for reading are held by
> + * the caller, but it must return after releasing the
> + * page_table_lock. We're guaranteed the src_pmd is a pmd_trans_huge
> + * until the PT lock of the src_pmd is released. Just move the page
> + * from src_pmd to dst_pmd if possible. Return zero if succeeded in
> + * moving the page, -EAGAIN if it needs to be repeated by the caller,
> + * or other errors in case of failure.
> + */
> +int remap_anon_pages_huge_pmd(struct mm_struct *mm,
> +                             pmd_t *dst_pmd, pmd_t *src_pmd,
> +                             pmd_t dst_pmdval,
> +                             struct vm_area_struct *dst_vma,
> +                             struct vm_area_struct *src_vma,
> +                             unsigned long dst_addr,
> +                             unsigned long src_addr)
> +{
> +       pmd_t _dst_pmd, src_pmdval;
> +       struct page *src_page;
> +       struct anon_vma *src_anon_vma, *dst_anon_vma;
> +       spinlock_t *src_ptl, *dst_ptl;
> +       pgtable_t pgtable;
> +
> +       src_pmdval = *src_pmd;
> +       src_ptl = pmd_lockptr(mm, src_pmd);
> +
> +       BUG_ON(!pmd_trans_huge(src_pmdval));
> +       BUG_ON(pmd_trans_splitting(src_pmdval));
> +       BUG_ON(!pmd_none(dst_pmdval));
> +       BUG_ON(!spin_is_locked(src_ptl));
> +       BUG_ON(!rwsem_is_locked(&mm->mmap_sem));
> +
> +       src_page = pmd_page(src_pmdval);
> +       BUG_ON(!PageHead(src_page));
> +       BUG_ON(!PageAnon(src_page));
> +       if (unlikely(page_mapcount(src_page) != 1)) {
> +               spin_unlock(src_ptl);
> +               return -EBUSY;
> +       }
> +
> +       get_page(src_page);
> +       spin_unlock(src_ptl);
> +
> +       mmu_notifier_invalidate_range_start(mm, src_addr,
> +                                           src_addr + HPAGE_PMD_SIZE);
> +
> +       /* block all concurrent rmap walks */
> +       lock_page(src_page);
> +
> +       /*
> +        * split_huge_page walks the anon_vma chain without the page
> +        * lock. Serialize against it with the anon_vma lock, the page
> +        * lock is not enough.
> +        */
> +       src_anon_vma = page_get_anon_vma(src_page);
> +       if (!src_anon_vma) {
> +               unlock_page(src_page);
> +               put_page(src_page);
> +               mmu_notifier_invalidate_range_end(mm, src_addr,
> +                                                 src_addr + HPAGE_PMD_SIZE);
> +               return -EAGAIN;
> +       }
> +       anon_vma_lock_write(src_anon_vma);
> +
> +       dst_ptl = pmd_lockptr(mm, dst_pmd);
> +       double_pt_lock(src_ptl, dst_ptl);
> +       if (unlikely(!pmd_same(*src_pmd, src_pmdval) ||
> +                    !pmd_same(*dst_pmd, dst_pmdval) ||
> +                    page_mapcount(src_page) != 1)) {
> +               double_pt_unlock(src_ptl, dst_ptl);
> +               anon_vma_unlock_write(src_anon_vma);
> +               put_anon_vma(src_anon_vma);
> +               unlock_page(src_page);
> +               put_page(src_page);
> +               mmu_notifier_invalidate_range_end(mm, src_addr,
> +                                                 src_addr + HPAGE_PMD_SIZE);
> +               return -EAGAIN;
> +       }
> +
> +       BUG_ON(!PageHead(src_page));
> +       BUG_ON(!PageAnon(src_page));
> +       /* the PT lock is enough to keep the page pinned now */
> +       put_page(src_page);
> +
> +       dst_anon_vma = (void *) dst_vma->anon_vma + PAGE_MAPPING_ANON;
> +       ACCESS_ONCE(src_page->mapping) = (struct address_space *) dst_anon_vma;
> +       ACCESS_ONCE(src_page->index) = linear_page_index(dst_vma, dst_addr);
> +
> +       if (!pmd_same(pmdp_clear_flush(src_vma, src_addr, src_pmd),
> +                     src_pmdval))
> +               BUG();
> +       _dst_pmd = mk_huge_pmd(src_page, dst_vma->vm_page_prot);
> +       _dst_pmd = maybe_pmd_mkwrite(pmd_mkdirty(_dst_pmd), dst_vma);
> +       set_pmd_at(mm, dst_addr, dst_pmd, _dst_pmd);
> +
> +       pgtable = pgtable_trans_huge_withdraw(mm, src_pmd);
> +       pgtable_trans_huge_deposit(mm, dst_pmd, pgtable);
> +       double_pt_unlock(src_ptl, dst_ptl);
> +
> +       anon_vma_unlock_write(src_anon_vma);
> +       put_anon_vma(src_anon_vma);
> +
> +       /* unblock rmap walks */
> +       unlock_page(src_page);
> +
> +       mmu_notifier_invalidate_range_end(mm, src_addr,
> +                                         src_addr + HPAGE_PMD_SIZE);
> +       return 0;
> +}
> +
> +/*
>   * Returns 1 if a given pmd maps a stable (not under splitting) thp.
>   * Returns -1 if it maps a thp under splitting. Returns 0 otherwise.
>   *
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>



-- 
Michael Kerrisk Linux man-pages maintainer;
http://www.kernel.org/doc/man-pages/
Author of "The Linux Programming Interface", http://blog.man7.org/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 06/10] mm: sys_remap_anon_pages
@ 2014-07-04 11:30     ` Michael Kerrisk
  0 siblings, 0 replies; 59+ messages in thread
From: Michael Kerrisk @ 2014-07-04 11:30 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: qemu-devel, kvm, linux-mm, Linux Kernel, \Dr. David Alan Gilbert\,
	Johannes Weiner, Andrew Morton, Android Kernel Team, Robert Love,
	Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Mike Hommey, Taras Glek, Jan Kara,
	KOSAKI Motohiro, Michel Lespinasse, Minchan Kim, Keith Packard,
	Huangpeng (Peter),
	Isaku Yamahata, Paolo Bonzini, Anthony Liguori, Stefan Hajnoczi,
	Wenchao Xia, Andrew Jones, Juan Quintela, Mel Gorman, Linux API

[CC+=linux-api@]

Hi Andrea,

On Wed, Jul 2, 2014 at 6:50 PM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> This new syscall will move anon pages across vmas, atomically and
> without touching the vmas.

Please CC linux-api on patches that change the API/ABI. (See
https://www.kernel.org/doc/man-pages/linux-api-ml.html)

Cheers,

Michael


> It only works on non shared anonymous pages because those can be
> relocated without generating non linear anon_vmas in the rmap code.
>
> It is the ideal mechanism to handle userspace page faults. Normally
> the destination vma will have VM_USERFAULT set with
> madvise(MADV_USERFAULT) while the source vma will normally have
> VM_DONTCOPY set with madvise(MADV_DONTFORK).
>
> MADV_DONTFORK set in the source vma avoids remap_anon_pages to fail if
> the process forks during the userland page fault.
>
> The thread triggering the sigbus signal handler by touching an
> unmapped hole in the MADV_USERFAULT region, should take care to
> receive the data belonging in the faulting virtual address in the
> source vma. The data can come from the network, storage or any other
> I/O device. After the data has been safely received in the private
> area in the source vma, it will call remap_anon_pages to map the page
> in the faulting address in the destination vma atomically. And finally
> it will return from the signal handler.
>
> It is an alternative to mremap.
>
> It only works if the vma protection bits are identical from the source
> and destination vma.
>
> It can remap non shared anonymous pages within the same vma too.
>
> If the source virtual memory range has any unmapped holes, or if the
> destination virtual memory range is not a whole unmapped hole,
> remap_anon_pages will fail respectively with -ENOENT or -EEXIST. This
> provides a very strict behavior to avoid any chance of memory
> corruption going unnoticed if there are userland race conditions. Only
> one thread should resolve the userland page fault at any given time
> for any given faulting address. This means that if two threads try to
> both call remap_anon_pages on the same destination address at the same
> time, the second thread will get an explicit error from this syscall.
>
> The syscall retval will return "len" is succesful. The syscall however
> can be interrupted by fatal signals or errors. If interrupted it will
> return the number of bytes successfully remapped before the
> interruption if any, or the negative error if none. It will never
> return zero. Either it will return an error or an amount of bytes
> successfully moved. If the retval reports a "short" remap, the
> remap_anon_pages syscall should be repeated by userland with
> src+retval, dst+reval, len-retval if it wants to know about the error
> that interrupted it.
>
> The RAP_ALLOW_SRC_HOLES flag can be specified to prevent -ENOENT
> errors to materialize if there are holes in the source virtual range
> that is being remapped. The holes will be accounted as successfully
> remapped in the retval of the syscall. This is mostly useful to remap
> hugepage naturally aligned virtual regions without knowing if there
> are transparent hugepage in the regions or not, but preventing the
> risk of having to split the hugepmd during the remap.
>
> The main difference with mremap is that if used to fill holes in
> unmapped anonymous memory vmas (if used in combination with
> MADV_USERFAULT) remap_anon_pages won't create lots of unmergeable
> vmas. mremap instead would create lots of vmas (because of non linear
> vma->vm_pgoff) leading to -ENOMEM failures (the number of vmas is
> limited).
>
> MADV_USERFAULT and remap_anon_pages() can be tested with a program
> like below:
>
> ===
>  #define _GNU_SOURCE
>  #include <sys/mman.h>
>  #include <pthread.h>
>  #include <strings.h>
>  #include <stdlib.h>
>  #include <unistd.h>
>  #include <stdio.h>
>  #include <errno.h>
>  #include <string.h>
>  #include <signal.h>
>  #include <sys/syscall.h>
>  #include <sys/types.h>
>
>  #define USE_USERFAULT
>  #define THP
>
>  #define MADV_USERFAULT 18
>
>  #define SIZE (1024*1024*1024)
>
>  #define SYS_remap_anon_pages 317
>
>  static volatile unsigned char *c, *tmp;
>
>  void userfault_sighandler(int signum, siginfo_t *info, void *ctx)
>  {
>         unsigned char *addr = info->si_addr;
>         int len = 4096;
>         int ret;
>
>         addr = (unsigned char *) ((unsigned long) addr & ~((getpagesize())-1));
>  #ifdef THP
>         addr = (unsigned char *) ((unsigned long) addr & ~((2*1024*1024)-1));
>         len = 2*1024*1024;
>  #endif
>         if (addr >= c && addr < c + SIZE) {
>                 unsigned long offset = addr - c;
>                 ret = syscall(SYS_remap_anon_pages, c+offset, tmp+offset, len, 0);
>                 if (ret != len)
>                         perror("sigbus remap_anon_pages"), exit(1);
>                 //printf("sigbus offset %lu\n", offset);
>                 return;
>         }
>
>         printf("sigbus error addr %p c %p tmp %p\n", addr, c, tmp), exit(1);
>  }
>
>  int main()
>  {
>         struct sigaction sa;
>         int ret;
>         unsigned long i;
>  #ifndef THP
>         /*
>          * Fails with THP due lack of alignment because of memset
>          * pre-filling the destination
>          */
>         c = mmap(0, SIZE, PROT_READ|PROT_WRITE,
>                  MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
>         if (c == MAP_FAILED)
>                 perror("mmap"), exit(1);
>         tmp = mmap(0, SIZE, PROT_READ|PROT_WRITE,
>                    MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
>         if (tmp == MAP_FAILED)
>                 perror("mmap"), exit(1);
>  #else
>         ret = posix_memalign((void **)&c, 2*1024*1024, SIZE);
>         if (ret)
>                 perror("posix_memalign"), exit(1);
>         ret = posix_memalign((void **)&tmp, 2*1024*1024, SIZE);
>         if (ret)
>                 perror("posix_memalign"), exit(1);
>  #endif
>         /*
>          * MADV_USERFAULT must run before memset, to avoid THP 2m
>          * faults to map memory into "tmp", if "tmp" isn't allocated
>          * with hugepage alignment.
>          */
>         if (madvise((void *)c, SIZE, MADV_USERFAULT))
>                 perror("madvise"), exit(1);
>         memset((void *)tmp, 0xaa, SIZE);
>
>         sa.sa_sigaction = userfault_sighandler;
>         sigemptyset(&sa.sa_mask);
>         sa.sa_flags = SA_SIGINFO;
>         sigaction(SIGBUS, &sa, NULL);
>
>  #ifndef USE_USERFAULT
>         ret = syscall(SYS_remap_anon_pages, c, tmp, SIZE, 0);
>         if (ret != SIZE)
>                 perror("remap_anon_pages"), exit(1);
>  #endif
>
>         for (i = 0; i < SIZE; i += 4096) {
>                 if ((i/4096) % 2) {
>                         /* exercise read and write MADV_USERFAULT */
>                         c[i+1] = 0xbb;
>                 }
>                 if (c[i] != 0xaa)
>                         printf("error %x offset %lu\n", c[i], i), exit(1);
>         }
>         printf("remap_anon_pages functions correctly\n");
>
>         return 0;
>  }
> ===
>
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
>  arch/x86/syscalls/syscall_32.tbl |   1 +
>  arch/x86/syscalls/syscall_64.tbl |   1 +
>  include/linux/huge_mm.h          |   7 +
>  include/linux/syscalls.h         |   4 +
>  kernel/sys_ni.c                  |   1 +
>  mm/fremap.c                      | 477 +++++++++++++++++++++++++++++++++++++++
>  mm/huge_memory.c                 | 110 +++++++++
>  7 files changed, 601 insertions(+)
>
> diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
> index d6b8679..08bc856 100644
> --- a/arch/x86/syscalls/syscall_32.tbl
> +++ b/arch/x86/syscalls/syscall_32.tbl
> @@ -360,3 +360,4 @@
>  351    i386    sched_setattr           sys_sched_setattr
>  352    i386    sched_getattr           sys_sched_getattr
>  353    i386    renameat2               sys_renameat2
> +354    i386    remap_anon_pages        sys_remap_anon_pages
> diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
> index ec255a1..37bd179 100644
> --- a/arch/x86/syscalls/syscall_64.tbl
> +++ b/arch/x86/syscalls/syscall_64.tbl
> @@ -323,6 +323,7 @@
>  314    common  sched_setattr           sys_sched_setattr
>  315    common  sched_getattr           sys_sched_getattr
>  316    common  renameat2               sys_renameat2
> +317    common  remap_anon_pages        sys_remap_anon_pages
>
>  #
>  # x32-specific system call numbers start at 512 to avoid cache impact
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 3a2c57e..9a37dd5 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -33,6 +33,13 @@ extern int move_huge_pmd(struct vm_area_struct *vma,
>  extern int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
>                         unsigned long addr, pgprot_t newprot,
>                         int prot_numa);
> +extern int remap_anon_pages_huge_pmd(struct mm_struct *mm,
> +                                    pmd_t *dst_pmd, pmd_t *src_pmd,
> +                                    pmd_t dst_pmdval,
> +                                    struct vm_area_struct *dst_vma,
> +                                    struct vm_area_struct *src_vma,
> +                                    unsigned long dst_addr,
> +                                    unsigned long src_addr);
>
>  enum transparent_hugepage_flag {
>         TRANSPARENT_HUGEPAGE_FLAG,
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index b0881a0..19edb00 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -447,6 +447,10 @@ asmlinkage long sys_mremap(unsigned long addr,
>  asmlinkage long sys_remap_file_pages(unsigned long start, unsigned long size,
>                         unsigned long prot, unsigned long pgoff,
>                         unsigned long flags);
> +asmlinkage long sys_remap_anon_pages(unsigned long dst_start,
> +                                    unsigned long src_start,
> +                                    unsigned long len,
> +                                    unsigned long flags);
>  asmlinkage long sys_msync(unsigned long start, size_t len, int flags);
>  asmlinkage long sys_fadvise64(int fd, loff_t offset, size_t len, int advice);
>  asmlinkage long sys_fadvise64_64(int fd, loff_t offset, loff_t len, int advice);
> diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
> index 36441b5..6fc1aca 100644
> --- a/kernel/sys_ni.c
> +++ b/kernel/sys_ni.c
> @@ -177,6 +177,7 @@ cond_syscall(sys_mincore);
>  cond_syscall(sys_madvise);
>  cond_syscall(sys_mremap);
>  cond_syscall(sys_remap_file_pages);
> +cond_syscall(sys_remap_anon_pages);
>  cond_syscall(compat_sys_move_pages);
>  cond_syscall(compat_sys_migrate_pages);
>
> diff --git a/mm/fremap.c b/mm/fremap.c
> index 1e509f7..9337637 100644
> --- a/mm/fremap.c
> +++ b/mm/fremap.c
> @@ -310,3 +310,480 @@ void double_pt_unlock(spinlock_t *ptl1,
>         if (ptl1 != ptl2)
>                 spin_unlock(ptl2);
>  }
> +
> +#define RAP_ALLOW_SRC_HOLES (1UL<<0)
> +
> +/*
> + * The mmap_sem for reading is held by the caller. Just move the page
> + * from src_pmd to dst_pmd if possible, and return true if succeeded
> + * in moving the page.
> + */
> +static int remap_anon_pages_pte(struct mm_struct *mm,
> +                               pte_t *dst_pte, pte_t *src_pte, pmd_t *src_pmd,
> +                               struct vm_area_struct *dst_vma,
> +                               struct vm_area_struct *src_vma,
> +                               unsigned long dst_addr,
> +                               unsigned long src_addr,
> +                               spinlock_t *dst_ptl,
> +                               spinlock_t *src_ptl,
> +                               unsigned long flags)
> +{
> +       struct page *src_page;
> +       swp_entry_t entry;
> +       pte_t orig_src_pte, orig_dst_pte;
> +       struct anon_vma *src_anon_vma, *dst_anon_vma;
> +
> +       spin_lock(dst_ptl);
> +       orig_dst_pte = *dst_pte;
> +       spin_unlock(dst_ptl);
> +       if (!pte_none(orig_dst_pte))
> +               return -EEXIST;
> +
> +       spin_lock(src_ptl);
> +       orig_src_pte = *src_pte;
> +       spin_unlock(src_ptl);
> +       if (pte_none(orig_src_pte)) {
> +               if (!(flags & RAP_ALLOW_SRC_HOLES))
> +                       return -ENOENT;
> +               else
> +                       /* nothing to do to remap an hole */
> +                       return 0;
> +       }
> +
> +       if (pte_present(orig_src_pte)) {
> +               /*
> +                * Pin the page while holding the lock to be sure the
> +                * page isn't freed under us
> +                */
> +               spin_lock(src_ptl);
> +               if (!pte_same(orig_src_pte, *src_pte)) {
> +                       spin_unlock(src_ptl);
> +                       return -EAGAIN;
> +               }
> +               src_page = vm_normal_page(src_vma, src_addr, orig_src_pte);
> +               if (!src_page || !PageAnon(src_page) ||
> +                   page_mapcount(src_page) != 1) {
> +                       spin_unlock(src_ptl);
> +                       return -EBUSY;
> +               }
> +
> +               get_page(src_page);
> +               spin_unlock(src_ptl);
> +
> +               /* block all concurrent rmap walks */
> +               lock_page(src_page);
> +
> +               /*
> +                * page_referenced_anon walks the anon_vma chain
> +                * without the page lock. Serialize against it with
> +                * the anon_vma lock, the page lock is not enough.
> +                */
> +               src_anon_vma = page_get_anon_vma(src_page);
> +               if (!src_anon_vma) {
> +                       /* page was unmapped from under us */
> +                       unlock_page(src_page);
> +                       put_page(src_page);
> +                       return -EAGAIN;
> +               }
> +               anon_vma_lock_write(src_anon_vma);
> +
> +               double_pt_lock(dst_ptl, src_ptl);
> +
> +               if (!pte_same(*src_pte, orig_src_pte) ||
> +                   !pte_same(*dst_pte, orig_dst_pte) ||
> +                   page_mapcount(src_page) != 1) {
> +                       double_pt_unlock(dst_ptl, src_ptl);
> +                       anon_vma_unlock_write(src_anon_vma);
> +                       put_anon_vma(src_anon_vma);
> +                       unlock_page(src_page);
> +                       put_page(src_page);
> +                       return -EAGAIN;
> +               }
> +
> +               BUG_ON(!PageAnon(src_page));
> +               /* the PT lock is enough to keep the page pinned now */
> +               put_page(src_page);
> +
> +               dst_anon_vma = (void *) dst_vma->anon_vma + PAGE_MAPPING_ANON;
> +               ACCESS_ONCE(src_page->mapping) = ((struct address_space *)
> +                                                 dst_anon_vma);
> +               ACCESS_ONCE(src_page->index) = linear_page_index(dst_vma,
> +                                                                dst_addr);
> +
> +               if (!pte_same(ptep_clear_flush(src_vma, src_addr, src_pte),
> +                             orig_src_pte))
> +                       BUG();
> +
> +               orig_dst_pte = mk_pte(src_page, dst_vma->vm_page_prot);
> +               orig_dst_pte = maybe_mkwrite(pte_mkdirty(orig_dst_pte),
> +                                            dst_vma);
> +
> +               set_pte_at(mm, dst_addr, dst_pte, orig_dst_pte);
> +
> +               double_pt_unlock(dst_ptl, src_ptl);
> +
> +               anon_vma_unlock_write(src_anon_vma);
> +               put_anon_vma(src_anon_vma);
> +
> +               /* unblock rmap walks */
> +               unlock_page(src_page);
> +
> +               mmu_notifier_invalidate_page(mm, src_addr);
> +       } else {
> +               if (pte_file(orig_src_pte))
> +                       return -EFAULT;
> +
> +               entry = pte_to_swp_entry(orig_src_pte);
> +               if (non_swap_entry(entry)) {
> +                       if (is_migration_entry(entry)) {
> +                               migration_entry_wait(mm, src_pmd, src_addr);
> +                               return -EAGAIN;
> +                       }
> +                       return -EFAULT;
> +               }
> +
> +               if (swp_entry_swapcount(entry) != 1)
> +                       return -EBUSY;
> +
> +               double_pt_lock(dst_ptl, src_ptl);
> +
> +               if (!pte_same(*src_pte, orig_src_pte) ||
> +                   !pte_same(*dst_pte, orig_dst_pte) ||
> +                   swp_entry_swapcount(entry) != 1) {
> +                       double_pt_unlock(dst_ptl, src_ptl);
> +                       return -EAGAIN;
> +               }
> +
> +               if (pte_val(ptep_get_and_clear(mm, src_addr, src_pte)) !=
> +                   pte_val(orig_src_pte))
> +                       BUG();
> +               set_pte_at(mm, dst_addr, dst_pte, orig_src_pte);
> +
> +               double_pt_unlock(dst_ptl, src_ptl);
> +       }
> +
> +       return 0;
> +}
> +
> +static pmd_t *mm_alloc_pmd(struct mm_struct *mm, unsigned long address)
> +{
> +       pgd_t *pgd;
> +       pud_t *pud;
> +       pmd_t *pmd = NULL;
> +
> +       pgd = pgd_offset(mm, address);
> +       pud = pud_alloc(mm, pgd, address);
> +       if (pud)
> +               /*
> +                * Note that we didn't run this because the pmd was
> +                * missing, the *pmd may be already established and in
> +                * turn it may also be a trans_huge_pmd.
> +                */
> +               pmd = pmd_alloc(mm, pud, address);
> +       return pmd;
> +}
> +
> +/**
> + * sys_remap_anon_pages - remap arbitrary anonymous pages of an existing vma
> + * @dst_start: start of the destination virtual memory range
> + * @src_start: start of the source virtual memory range
> + * @len: length of the virtual memory range
> + *
> + * sys_remap_anon_pages remaps arbitrary anonymous pages atomically in
> + * zero copy. It only works on non shared anonymous pages because
> + * those can be relocated without generating non linear anon_vmas in
> + * the rmap code.
> + *
> + * It is the ideal mechanism to handle userspace page faults. Normally
> + * the destination vma will have VM_USERFAULT set with
> + * madvise(MADV_USERFAULT) while the source vma will have VM_DONTCOPY
> + * set with madvise(MADV_DONTFORK).
> + *
> + * The thread receiving the page during the userland page fault
> + * (MADV_USERFAULT) will receive the faulting page in the source vma
> + * through the network, storage or any other I/O device (MADV_DONTFORK
> + * in the source vma avoids remap_anon_pages to fail with -EBUSY if
> + * the process forks before remap_anon_pages is called), then it will
> + * call remap_anon_pages to map the page in the faulting address in
> + * the destination vma.
> + *
> + * This syscall works purely via pagetables, so it's the most
> + * efficient way to move physical non shared anonymous pages across
> + * different virtual addresses. Unlike mremap()/mmap()/munmap() it
> + * does not create any new vmas. The mapping in the destination
> + * address is atomic.
> + *
> + * It only works if the vma protection bits are identical from the
> + * source and destination vma.
> + *
> + * It can remap non shared anonymous pages within the same vma too.
> + *
> + * If the source virtual memory range has any unmapped holes, or if
> + * the destination virtual memory range is not a whole unmapped hole,
> + * remap_anon_pages will fail respectively with -ENOENT or
> + * -EEXIST. This provides a very strict behavior to avoid any chance
> + * of memory corruption going unnoticed if there are userland race
> + * conditions. Only one thread should resolve the userland page fault
> + * at any given time for any given faulting address. This means that
> + * if two threads try to both call remap_anon_pages on the same
> + * destination address at the same time, the second thread will get an
> + * explicit error from this syscall.
> + *
> + * The syscall retval will return "len" is succesful. The syscall
> + * however can be interrupted by fatal signals or errors. If
> + * interrupted it will return the number of bytes successfully
> + * remapped before the interruption if any, or the negative error if
> + * none. It will never return zero. Either it will return an error or
> + * an amount of bytes successfully moved. If the retval reports a
> + * "short" remap, the remap_anon_pages syscall should be repeated by
> + * userland with src+retval, dst+reval, len-retval if it wants to know
> + * about the error that interrupted it.
> + *
> + * The RAP_ALLOW_SRC_HOLES flag can be specified to prevent -ENOENT
> + * errors to materialize if there are holes in the source virtual
> + * range that is being remapped. The holes will be accounted as
> + * successfully remapped in the retval of the syscall. This is mostly
> + * useful to remap hugepage naturally aligned virtual regions without
> + * knowing if there are transparent hugepage in the regions or not,
> + * but preventing the risk of having to split the hugepmd during the
> + * remap.
> + *
> + * If there's any rmap walk that is taking the anon_vma locks without
> + * first obtaining the page lock (for example split_huge_page and
> + * page_referenced_anon), they will have to verify if the
> + * page->mapping has changed after taking the anon_vma lock. If it
> + * changed they should release the lock and retry obtaining a new
> + * anon_vma, because it means the anon_vma was changed by
> + * remap_anon_pages before the lock could be obtained. This is the
> + * only additional complexity added to the rmap code to provide this
> + * anonymous page remapping functionality.
> + */
> +SYSCALL_DEFINE4(remap_anon_pages,
> +               unsigned long, dst_start, unsigned long, src_start,
> +               unsigned long, len, unsigned long, flags)
> +{
> +       struct mm_struct *mm = current->mm;
> +       struct vm_area_struct *src_vma, *dst_vma;
> +       long err = -EINVAL;
> +       pmd_t *src_pmd, *dst_pmd;
> +       pte_t *src_pte, *dst_pte;
> +       spinlock_t *dst_ptl, *src_ptl;
> +       unsigned long src_addr, dst_addr;
> +       int thp_aligned = -1;
> +       long moved = 0;
> +
> +       /*
> +        * Sanitize the syscall parameters:
> +        */
> +       if (src_start & ~PAGE_MASK)
> +               return err;
> +       if (dst_start & ~PAGE_MASK)
> +               return err;
> +       if (len & ~PAGE_MASK)
> +               return err;
> +       if (flags & ~RAP_ALLOW_SRC_HOLES)
> +               return err;
> +
> +       /* Does the address range wrap, or is the span zero-sized? */
> +       if (unlikely(src_start + len <= src_start))
> +               return err;
> +       if (unlikely(dst_start + len <= dst_start))
> +               return err;
> +
> +       down_read(&mm->mmap_sem);
> +
> +       /*
> +        * Make sure the vma is not shared, that the src and dst remap
> +        * ranges are both valid and fully within a single existing
> +        * vma.
> +        */
> +       src_vma = find_vma(mm, src_start);
> +       if (!src_vma || (src_vma->vm_flags & VM_SHARED))
> +               goto out;
> +       if (src_start < src_vma->vm_start ||
> +           src_start + len > src_vma->vm_end)
> +               goto out;
> +
> +       dst_vma = find_vma(mm, dst_start);
> +       if (!dst_vma || (dst_vma->vm_flags & VM_SHARED))
> +               goto out;
> +       if (dst_start < dst_vma->vm_start ||
> +           dst_start + len > dst_vma->vm_end)
> +               goto out;
> +
> +       if (pgprot_val(src_vma->vm_page_prot) !=
> +           pgprot_val(dst_vma->vm_page_prot))
> +               goto out;
> +
> +       /* only allow remapping if both are mlocked or both aren't */
> +       if ((src_vma->vm_flags & VM_LOCKED) ^ (dst_vma->vm_flags & VM_LOCKED))
> +               goto out;
> +
> +       /*
> +        * Ensure the dst_vma has a anon_vma or this page
> +        * would get a NULL anon_vma when moved in the
> +        * dst_vma.
> +        */
> +       err = -ENOMEM;
> +       if (unlikely(anon_vma_prepare(dst_vma)))
> +               goto out;
> +
> +       for (src_addr = src_start, dst_addr = dst_start;
> +            src_addr < src_start + len; ) {
> +               spinlock_t *ptl;
> +               pmd_t dst_pmdval;
> +               BUG_ON(dst_addr >= dst_start + len);
> +               src_pmd = mm_find_pmd(mm, src_addr);
> +               if (unlikely(!src_pmd)) {
> +                       if (!(flags & RAP_ALLOW_SRC_HOLES)) {
> +                               err = -ENOENT;
> +                               break;
> +                       } else {
> +                               src_pmd = mm_alloc_pmd(mm, src_addr);
> +                               if (unlikely(!src_pmd)) {
> +                                       err = -ENOMEM;
> +                                       break;
> +                               }
> +                       }
> +               }
> +               dst_pmd = mm_alloc_pmd(mm, dst_addr);
> +               if (unlikely(!dst_pmd)) {
> +                       err = -ENOMEM;
> +                       break;
> +               }
> +
> +               dst_pmdval = pmd_read_atomic(dst_pmd);
> +               /*
> +                * If the dst_pmd is mapped as THP don't
> +                * override it and just be strict.
> +                */
> +               if (unlikely(pmd_trans_huge(dst_pmdval))) {
> +                       err = -EEXIST;
> +                       break;
> +               }
> +               if (pmd_trans_huge_lock(src_pmd, src_vma, &ptl) == 1) {
> +                       /*
> +                        * Check if we can move the pmd without
> +                        * splitting it. First check the address
> +                        * alignment to be the same in src/dst.  These
> +                        * checks don't actually need the PT lock but
> +                        * it's good to do it here to optimize this
> +                        * block away at build time if
> +                        * CONFIG_TRANSPARENT_HUGEPAGE is not set.
> +                        */
> +                       if (thp_aligned == -1)
> +                               thp_aligned = ((src_addr & ~HPAGE_PMD_MASK) ==
> +                                              (dst_addr & ~HPAGE_PMD_MASK));
> +                       if (!thp_aligned || (src_addr & ~HPAGE_PMD_MASK) ||
> +                           !pmd_none(dst_pmdval) ||
> +                           src_start + len - src_addr < HPAGE_PMD_SIZE) {
> +                               spin_unlock(ptl);
> +                               /* Fall through */
> +                               split_huge_page_pmd(src_vma, src_addr,
> +                                                   src_pmd);
> +                       } else {
> +                               BUG_ON(dst_addr & ~HPAGE_PMD_MASK);
> +                               err = remap_anon_pages_huge_pmd(mm,
> +                                                               dst_pmd,
> +                                                               src_pmd,
> +                                                               dst_pmdval,
> +                                                               dst_vma,
> +                                                               src_vma,
> +                                                               dst_addr,
> +                                                               src_addr);
> +                               cond_resched();
> +
> +                               if (!err) {
> +                                       dst_addr += HPAGE_PMD_SIZE;
> +                                       src_addr += HPAGE_PMD_SIZE;
> +                                       moved += HPAGE_PMD_SIZE;
> +                               }
> +
> +                               if ((!err || err == -EAGAIN) &&
> +                                   fatal_signal_pending(current))
> +                                       err = -EINTR;
> +
> +                               if (err && err != -EAGAIN)
> +                                       break;
> +
> +                               continue;
> +                       }
> +               }
> +
> +               if (pmd_none(*src_pmd)) {
> +                       if (!(flags & RAP_ALLOW_SRC_HOLES)) {
> +                               err = -ENOENT;
> +                               break;
> +                       } else {
> +                               if (unlikely(__pte_alloc(mm, src_vma, src_pmd,
> +                                                        src_addr))) {
> +                                       err = -ENOMEM;
> +                                       break;
> +                               }
> +                       }
> +               }
> +
> +               /*
> +                * We held the mmap_sem for reading so MADV_DONTNEED
> +                * can zap transparent huge pages under us, or the
> +                * transparent huge page fault can establish new
> +                * transparent huge pages under us.
> +                */
> +               if (unlikely(pmd_trans_unstable(src_pmd))) {
> +                       err = -EFAULT;
> +                       break;
> +               }
> +
> +               if (unlikely(pmd_none(dst_pmdval)) &&
> +                   unlikely(__pte_alloc(mm, dst_vma, dst_pmd,
> +                                        dst_addr))) {
> +                       err = -ENOMEM;
> +                       break;
> +               }
> +               /* If an huge pmd materialized from under us fail */
> +               if (unlikely(pmd_trans_huge(*dst_pmd))) {
> +                       err = -EFAULT;
> +                       break;
> +               }
> +
> +               BUG_ON(pmd_none(*dst_pmd));
> +               BUG_ON(pmd_none(*src_pmd));
> +               BUG_ON(pmd_trans_huge(*dst_pmd));
> +               BUG_ON(pmd_trans_huge(*src_pmd));
> +
> +               dst_pte = pte_offset_map(dst_pmd, dst_addr);
> +               src_pte = pte_offset_map(src_pmd, src_addr);
> +               dst_ptl = pte_lockptr(mm, dst_pmd);
> +               src_ptl = pte_lockptr(mm, src_pmd);
> +
> +               err = remap_anon_pages_pte(mm,
> +                                          dst_pte, src_pte, src_pmd,
> +                                          dst_vma, src_vma,
> +                                          dst_addr, src_addr,
> +                                          dst_ptl, src_ptl, flags);
> +
> +               pte_unmap(dst_pte);
> +               pte_unmap(src_pte);
> +               cond_resched();
> +
> +               if (!err) {
> +                       dst_addr += PAGE_SIZE;
> +                       src_addr += PAGE_SIZE;
> +                       moved += PAGE_SIZE;
> +               }
> +
> +               if ((!err || err == -EAGAIN) &&
> +                   fatal_signal_pending(current))
> +                       err = -EINTR;
> +
> +               if (err && err != -EAGAIN)
> +                       break;
> +       }
> +
> +out:
> +       up_read(&mm->mmap_sem);
> +       BUG_ON(moved < 0);
> +       BUG_ON(err > 0);
> +       BUG_ON(!moved && !err);
> +       return moved ? moved : err;
> +}
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 94c37ca..e24cd7c 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1541,6 +1541,116 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
>  }
>
>  /*
> + * The PT lock for src_pmd and the mmap_sem for reading are held by
> + * the caller, but it must return after releasing the
> + * page_table_lock. We're guaranteed the src_pmd is a pmd_trans_huge
> + * until the PT lock of the src_pmd is released. Just move the page
> + * from src_pmd to dst_pmd if possible. Return zero if succeeded in
> + * moving the page, -EAGAIN if it needs to be repeated by the caller,
> + * or other errors in case of failure.
> + */
> +int remap_anon_pages_huge_pmd(struct mm_struct *mm,
> +                             pmd_t *dst_pmd, pmd_t *src_pmd,
> +                             pmd_t dst_pmdval,
> +                             struct vm_area_struct *dst_vma,
> +                             struct vm_area_struct *src_vma,
> +                             unsigned long dst_addr,
> +                             unsigned long src_addr)
> +{
> +       pmd_t _dst_pmd, src_pmdval;
> +       struct page *src_page;
> +       struct anon_vma *src_anon_vma, *dst_anon_vma;
> +       spinlock_t *src_ptl, *dst_ptl;
> +       pgtable_t pgtable;
> +
> +       src_pmdval = *src_pmd;
> +       src_ptl = pmd_lockptr(mm, src_pmd);
> +
> +       BUG_ON(!pmd_trans_huge(src_pmdval));
> +       BUG_ON(pmd_trans_splitting(src_pmdval));
> +       BUG_ON(!pmd_none(dst_pmdval));
> +       BUG_ON(!spin_is_locked(src_ptl));
> +       BUG_ON(!rwsem_is_locked(&mm->mmap_sem));
> +
> +       src_page = pmd_page(src_pmdval);
> +       BUG_ON(!PageHead(src_page));
> +       BUG_ON(!PageAnon(src_page));
> +       if (unlikely(page_mapcount(src_page) != 1)) {
> +               spin_unlock(src_ptl);
> +               return -EBUSY;
> +       }
> +
> +       get_page(src_page);
> +       spin_unlock(src_ptl);
> +
> +       mmu_notifier_invalidate_range_start(mm, src_addr,
> +                                           src_addr + HPAGE_PMD_SIZE);
> +
> +       /* block all concurrent rmap walks */
> +       lock_page(src_page);
> +
> +       /*
> +        * split_huge_page walks the anon_vma chain without the page
> +        * lock. Serialize against it with the anon_vma lock, the page
> +        * lock is not enough.
> +        */
> +       src_anon_vma = page_get_anon_vma(src_page);
> +       if (!src_anon_vma) {
> +               unlock_page(src_page);
> +               put_page(src_page);
> +               mmu_notifier_invalidate_range_end(mm, src_addr,
> +                                                 src_addr + HPAGE_PMD_SIZE);
> +               return -EAGAIN;
> +       }
> +       anon_vma_lock_write(src_anon_vma);
> +
> +       dst_ptl = pmd_lockptr(mm, dst_pmd);
> +       double_pt_lock(src_ptl, dst_ptl);
> +       if (unlikely(!pmd_same(*src_pmd, src_pmdval) ||
> +                    !pmd_same(*dst_pmd, dst_pmdval) ||
> +                    page_mapcount(src_page) != 1)) {
> +               double_pt_unlock(src_ptl, dst_ptl);
> +               anon_vma_unlock_write(src_anon_vma);
> +               put_anon_vma(src_anon_vma);
> +               unlock_page(src_page);
> +               put_page(src_page);
> +               mmu_notifier_invalidate_range_end(mm, src_addr,
> +                                                 src_addr + HPAGE_PMD_SIZE);
> +               return -EAGAIN;
> +       }
> +
> +       BUG_ON(!PageHead(src_page));
> +       BUG_ON(!PageAnon(src_page));
> +       /* the PT lock is enough to keep the page pinned now */
> +       put_page(src_page);
> +
> +       dst_anon_vma = (void *) dst_vma->anon_vma + PAGE_MAPPING_ANON;
> +       ACCESS_ONCE(src_page->mapping) = (struct address_space *) dst_anon_vma;
> +       ACCESS_ONCE(src_page->index) = linear_page_index(dst_vma, dst_addr);
> +
> +       if (!pmd_same(pmdp_clear_flush(src_vma, src_addr, src_pmd),
> +                     src_pmdval))
> +               BUG();
> +       _dst_pmd = mk_huge_pmd(src_page, dst_vma->vm_page_prot);
> +       _dst_pmd = maybe_pmd_mkwrite(pmd_mkdirty(_dst_pmd), dst_vma);
> +       set_pmd_at(mm, dst_addr, dst_pmd, _dst_pmd);
> +
> +       pgtable = pgtable_trans_huge_withdraw(mm, src_pmd);
> +       pgtable_trans_huge_deposit(mm, dst_pmd, pgtable);
> +       double_pt_unlock(src_ptl, dst_ptl);
> +
> +       anon_vma_unlock_write(src_anon_vma);
> +       put_anon_vma(src_anon_vma);
> +
> +       /* unblock rmap walks */
> +       unlock_page(src_page);
> +
> +       mmu_notifier_invalidate_range_end(mm, src_addr,
> +                                         src_addr + HPAGE_PMD_SIZE);
> +       return 0;
> +}
> +
> +/*
>   * Returns 1 if a given pmd maps a stable (not under splitting) thp.
>   * Returns -1 if it maps a thp under splitting. Returns 0 otherwise.
>   *
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>



-- 
Michael Kerrisk Linux man-pages maintainer;
http://www.kernel.org/doc/man-pages/
Author of "The Linux Programming Interface", http://blog.man7.org/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Qemu-devel] [PATCH 06/10] mm: sys_remap_anon_pages
@ 2014-07-04 11:30     ` Michael Kerrisk
  0 siblings, 0 replies; 59+ messages in thread
From: Michael Kerrisk @ 2014-07-04 11:30 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Robert Love, Dave Hansen, Jan Kara, kvm, Neil Brown,
	Stefan Hajnoczi, qemu-devel, linux-mm, KOSAKI Motohiro,
	Michel Lespinasse, Taras Glek, Andrew Jones, Juan Quintela,
	Hugh Dickins, Isaku Yamahata, Mel Gorman, Android Kernel Team,
	Mel Gorman, \Dr. David Alan Gilbert\, Huangpeng (Peter),
	Anthony Liguori, Paolo Bonzini, Keith Packard, Wenchao Xia,
	Linux API, Linux Kernel, Minchan Kim, Dmitry Adamushko,
	Johannes Weiner, Mike Hommey, Andrew Morton

[CC+=linux-api@]

Hi Andrea,

On Wed, Jul 2, 2014 at 6:50 PM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> This new syscall will move anon pages across vmas, atomically and
> without touching the vmas.

Please CC linux-api on patches that change the API/ABI. (See
https://www.kernel.org/doc/man-pages/linux-api-ml.html)

Cheers,

Michael


> It only works on non shared anonymous pages because those can be
> relocated without generating non linear anon_vmas in the rmap code.
>
> It is the ideal mechanism to handle userspace page faults. Normally
> the destination vma will have VM_USERFAULT set with
> madvise(MADV_USERFAULT) while the source vma will normally have
> VM_DONTCOPY set with madvise(MADV_DONTFORK).
>
> MADV_DONTFORK set in the source vma avoids remap_anon_pages to fail if
> the process forks during the userland page fault.
>
> The thread triggering the sigbus signal handler by touching an
> unmapped hole in the MADV_USERFAULT region, should take care to
> receive the data belonging in the faulting virtual address in the
> source vma. The data can come from the network, storage or any other
> I/O device. After the data has been safely received in the private
> area in the source vma, it will call remap_anon_pages to map the page
> in the faulting address in the destination vma atomically. And finally
> it will return from the signal handler.
>
> It is an alternative to mremap.
>
> It only works if the vma protection bits are identical from the source
> and destination vma.
>
> It can remap non shared anonymous pages within the same vma too.
>
> If the source virtual memory range has any unmapped holes, or if the
> destination virtual memory range is not a whole unmapped hole,
> remap_anon_pages will fail respectively with -ENOENT or -EEXIST. This
> provides a very strict behavior to avoid any chance of memory
> corruption going unnoticed if there are userland race conditions. Only
> one thread should resolve the userland page fault at any given time
> for any given faulting address. This means that if two threads try to
> both call remap_anon_pages on the same destination address at the same
> time, the second thread will get an explicit error from this syscall.
>
> The syscall retval will return "len" is succesful. The syscall however
> can be interrupted by fatal signals or errors. If interrupted it will
> return the number of bytes successfully remapped before the
> interruption if any, or the negative error if none. It will never
> return zero. Either it will return an error or an amount of bytes
> successfully moved. If the retval reports a "short" remap, the
> remap_anon_pages syscall should be repeated by userland with
> src+retval, dst+reval, len-retval if it wants to know about the error
> that interrupted it.
>
> The RAP_ALLOW_SRC_HOLES flag can be specified to prevent -ENOENT
> errors to materialize if there are holes in the source virtual range
> that is being remapped. The holes will be accounted as successfully
> remapped in the retval of the syscall. This is mostly useful to remap
> hugepage naturally aligned virtual regions without knowing if there
> are transparent hugepage in the regions or not, but preventing the
> risk of having to split the hugepmd during the remap.
>
> The main difference with mremap is that if used to fill holes in
> unmapped anonymous memory vmas (if used in combination with
> MADV_USERFAULT) remap_anon_pages won't create lots of unmergeable
> vmas. mremap instead would create lots of vmas (because of non linear
> vma->vm_pgoff) leading to -ENOMEM failures (the number of vmas is
> limited).
>
> MADV_USERFAULT and remap_anon_pages() can be tested with a program
> like below:
>
> ===
>  #define _GNU_SOURCE
>  #include <sys/mman.h>
>  #include <pthread.h>
>  #include <strings.h>
>  #include <stdlib.h>
>  #include <unistd.h>
>  #include <stdio.h>
>  #include <errno.h>
>  #include <string.h>
>  #include <signal.h>
>  #include <sys/syscall.h>
>  #include <sys/types.h>
>
>  #define USE_USERFAULT
>  #define THP
>
>  #define MADV_USERFAULT 18
>
>  #define SIZE (1024*1024*1024)
>
>  #define SYS_remap_anon_pages 317
>
>  static volatile unsigned char *c, *tmp;
>
>  void userfault_sighandler(int signum, siginfo_t *info, void *ctx)
>  {
>         unsigned char *addr = info->si_addr;
>         int len = 4096;
>         int ret;
>
>         addr = (unsigned char *) ((unsigned long) addr & ~((getpagesize())-1));
>  #ifdef THP
>         addr = (unsigned char *) ((unsigned long) addr & ~((2*1024*1024)-1));
>         len = 2*1024*1024;
>  #endif
>         if (addr >= c && addr < c + SIZE) {
>                 unsigned long offset = addr - c;
>                 ret = syscall(SYS_remap_anon_pages, c+offset, tmp+offset, len, 0);
>                 if (ret != len)
>                         perror("sigbus remap_anon_pages"), exit(1);
>                 //printf("sigbus offset %lu\n", offset);
>                 return;
>         }
>
>         printf("sigbus error addr %p c %p tmp %p\n", addr, c, tmp), exit(1);
>  }
>
>  int main()
>  {
>         struct sigaction sa;
>         int ret;
>         unsigned long i;
>  #ifndef THP
>         /*
>          * Fails with THP due lack of alignment because of memset
>          * pre-filling the destination
>          */
>         c = mmap(0, SIZE, PROT_READ|PROT_WRITE,
>                  MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
>         if (c == MAP_FAILED)
>                 perror("mmap"), exit(1);
>         tmp = mmap(0, SIZE, PROT_READ|PROT_WRITE,
>                    MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
>         if (tmp == MAP_FAILED)
>                 perror("mmap"), exit(1);
>  #else
>         ret = posix_memalign((void **)&c, 2*1024*1024, SIZE);
>         if (ret)
>                 perror("posix_memalign"), exit(1);
>         ret = posix_memalign((void **)&tmp, 2*1024*1024, SIZE);
>         if (ret)
>                 perror("posix_memalign"), exit(1);
>  #endif
>         /*
>          * MADV_USERFAULT must run before memset, to avoid THP 2m
>          * faults to map memory into "tmp", if "tmp" isn't allocated
>          * with hugepage alignment.
>          */
>         if (madvise((void *)c, SIZE, MADV_USERFAULT))
>                 perror("madvise"), exit(1);
>         memset((void *)tmp, 0xaa, SIZE);
>
>         sa.sa_sigaction = userfault_sighandler;
>         sigemptyset(&sa.sa_mask);
>         sa.sa_flags = SA_SIGINFO;
>         sigaction(SIGBUS, &sa, NULL);
>
>  #ifndef USE_USERFAULT
>         ret = syscall(SYS_remap_anon_pages, c, tmp, SIZE, 0);
>         if (ret != SIZE)
>                 perror("remap_anon_pages"), exit(1);
>  #endif
>
>         for (i = 0; i < SIZE; i += 4096) {
>                 if ((i/4096) % 2) {
>                         /* exercise read and write MADV_USERFAULT */
>                         c[i+1] = 0xbb;
>                 }
>                 if (c[i] != 0xaa)
>                         printf("error %x offset %lu\n", c[i], i), exit(1);
>         }
>         printf("remap_anon_pages functions correctly\n");
>
>         return 0;
>  }
> ===
>
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
>  arch/x86/syscalls/syscall_32.tbl |   1 +
>  arch/x86/syscalls/syscall_64.tbl |   1 +
>  include/linux/huge_mm.h          |   7 +
>  include/linux/syscalls.h         |   4 +
>  kernel/sys_ni.c                  |   1 +
>  mm/fremap.c                      | 477 +++++++++++++++++++++++++++++++++++++++
>  mm/huge_memory.c                 | 110 +++++++++
>  7 files changed, 601 insertions(+)
>
> diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
> index d6b8679..08bc856 100644
> --- a/arch/x86/syscalls/syscall_32.tbl
> +++ b/arch/x86/syscalls/syscall_32.tbl
> @@ -360,3 +360,4 @@
>  351    i386    sched_setattr           sys_sched_setattr
>  352    i386    sched_getattr           sys_sched_getattr
>  353    i386    renameat2               sys_renameat2
> +354    i386    remap_anon_pages        sys_remap_anon_pages
> diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
> index ec255a1..37bd179 100644
> --- a/arch/x86/syscalls/syscall_64.tbl
> +++ b/arch/x86/syscalls/syscall_64.tbl
> @@ -323,6 +323,7 @@
>  314    common  sched_setattr           sys_sched_setattr
>  315    common  sched_getattr           sys_sched_getattr
>  316    common  renameat2               sys_renameat2
> +317    common  remap_anon_pages        sys_remap_anon_pages
>
>  #
>  # x32-specific system call numbers start at 512 to avoid cache impact
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 3a2c57e..9a37dd5 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -33,6 +33,13 @@ extern int move_huge_pmd(struct vm_area_struct *vma,
>  extern int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
>                         unsigned long addr, pgprot_t newprot,
>                         int prot_numa);
> +extern int remap_anon_pages_huge_pmd(struct mm_struct *mm,
> +                                    pmd_t *dst_pmd, pmd_t *src_pmd,
> +                                    pmd_t dst_pmdval,
> +                                    struct vm_area_struct *dst_vma,
> +                                    struct vm_area_struct *src_vma,
> +                                    unsigned long dst_addr,
> +                                    unsigned long src_addr);
>
>  enum transparent_hugepage_flag {
>         TRANSPARENT_HUGEPAGE_FLAG,
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index b0881a0..19edb00 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -447,6 +447,10 @@ asmlinkage long sys_mremap(unsigned long addr,
>  asmlinkage long sys_remap_file_pages(unsigned long start, unsigned long size,
>                         unsigned long prot, unsigned long pgoff,
>                         unsigned long flags);
> +asmlinkage long sys_remap_anon_pages(unsigned long dst_start,
> +                                    unsigned long src_start,
> +                                    unsigned long len,
> +                                    unsigned long flags);
>  asmlinkage long sys_msync(unsigned long start, size_t len, int flags);
>  asmlinkage long sys_fadvise64(int fd, loff_t offset, size_t len, int advice);
>  asmlinkage long sys_fadvise64_64(int fd, loff_t offset, loff_t len, int advice);
> diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
> index 36441b5..6fc1aca 100644
> --- a/kernel/sys_ni.c
> +++ b/kernel/sys_ni.c
> @@ -177,6 +177,7 @@ cond_syscall(sys_mincore);
>  cond_syscall(sys_madvise);
>  cond_syscall(sys_mremap);
>  cond_syscall(sys_remap_file_pages);
> +cond_syscall(sys_remap_anon_pages);
>  cond_syscall(compat_sys_move_pages);
>  cond_syscall(compat_sys_migrate_pages);
>
> diff --git a/mm/fremap.c b/mm/fremap.c
> index 1e509f7..9337637 100644
> --- a/mm/fremap.c
> +++ b/mm/fremap.c
> @@ -310,3 +310,480 @@ void double_pt_unlock(spinlock_t *ptl1,
>         if (ptl1 != ptl2)
>                 spin_unlock(ptl2);
>  }
> +
> +#define RAP_ALLOW_SRC_HOLES (1UL<<0)
> +
> +/*
> + * The mmap_sem for reading is held by the caller. Just move the page
> + * from src_pmd to dst_pmd if possible, and return true if succeeded
> + * in moving the page.
> + */
> +static int remap_anon_pages_pte(struct mm_struct *mm,
> +                               pte_t *dst_pte, pte_t *src_pte, pmd_t *src_pmd,
> +                               struct vm_area_struct *dst_vma,
> +                               struct vm_area_struct *src_vma,
> +                               unsigned long dst_addr,
> +                               unsigned long src_addr,
> +                               spinlock_t *dst_ptl,
> +                               spinlock_t *src_ptl,
> +                               unsigned long flags)
> +{
> +       struct page *src_page;
> +       swp_entry_t entry;
> +       pte_t orig_src_pte, orig_dst_pte;
> +       struct anon_vma *src_anon_vma, *dst_anon_vma;
> +
> +       spin_lock(dst_ptl);
> +       orig_dst_pte = *dst_pte;
> +       spin_unlock(dst_ptl);
> +       if (!pte_none(orig_dst_pte))
> +               return -EEXIST;
> +
> +       spin_lock(src_ptl);
> +       orig_src_pte = *src_pte;
> +       spin_unlock(src_ptl);
> +       if (pte_none(orig_src_pte)) {
> +               if (!(flags & RAP_ALLOW_SRC_HOLES))
> +                       return -ENOENT;
> +               else
> +                       /* nothing to do to remap an hole */
> +                       return 0;
> +       }
> +
> +       if (pte_present(orig_src_pte)) {
> +               /*
> +                * Pin the page while holding the lock to be sure the
> +                * page isn't freed under us
> +                */
> +               spin_lock(src_ptl);
> +               if (!pte_same(orig_src_pte, *src_pte)) {
> +                       spin_unlock(src_ptl);
> +                       return -EAGAIN;
> +               }
> +               src_page = vm_normal_page(src_vma, src_addr, orig_src_pte);
> +               if (!src_page || !PageAnon(src_page) ||
> +                   page_mapcount(src_page) != 1) {
> +                       spin_unlock(src_ptl);
> +                       return -EBUSY;
> +               }
> +
> +               get_page(src_page);
> +               spin_unlock(src_ptl);
> +
> +               /* block all concurrent rmap walks */
> +               lock_page(src_page);
> +
> +               /*
> +                * page_referenced_anon walks the anon_vma chain
> +                * without the page lock. Serialize against it with
> +                * the anon_vma lock, the page lock is not enough.
> +                */
> +               src_anon_vma = page_get_anon_vma(src_page);
> +               if (!src_anon_vma) {
> +                       /* page was unmapped from under us */
> +                       unlock_page(src_page);
> +                       put_page(src_page);
> +                       return -EAGAIN;
> +               }
> +               anon_vma_lock_write(src_anon_vma);
> +
> +               double_pt_lock(dst_ptl, src_ptl);
> +
> +               if (!pte_same(*src_pte, orig_src_pte) ||
> +                   !pte_same(*dst_pte, orig_dst_pte) ||
> +                   page_mapcount(src_page) != 1) {
> +                       double_pt_unlock(dst_ptl, src_ptl);
> +                       anon_vma_unlock_write(src_anon_vma);
> +                       put_anon_vma(src_anon_vma);
> +                       unlock_page(src_page);
> +                       put_page(src_page);
> +                       return -EAGAIN;
> +               }
> +
> +               BUG_ON(!PageAnon(src_page));
> +               /* the PT lock is enough to keep the page pinned now */
> +               put_page(src_page);
> +
> +               dst_anon_vma = (void *) dst_vma->anon_vma + PAGE_MAPPING_ANON;
> +               ACCESS_ONCE(src_page->mapping) = ((struct address_space *)
> +                                                 dst_anon_vma);
> +               ACCESS_ONCE(src_page->index) = linear_page_index(dst_vma,
> +                                                                dst_addr);
> +
> +               if (!pte_same(ptep_clear_flush(src_vma, src_addr, src_pte),
> +                             orig_src_pte))
> +                       BUG();
> +
> +               orig_dst_pte = mk_pte(src_page, dst_vma->vm_page_prot);
> +               orig_dst_pte = maybe_mkwrite(pte_mkdirty(orig_dst_pte),
> +                                            dst_vma);
> +
> +               set_pte_at(mm, dst_addr, dst_pte, orig_dst_pte);
> +
> +               double_pt_unlock(dst_ptl, src_ptl);
> +
> +               anon_vma_unlock_write(src_anon_vma);
> +               put_anon_vma(src_anon_vma);
> +
> +               /* unblock rmap walks */
> +               unlock_page(src_page);
> +
> +               mmu_notifier_invalidate_page(mm, src_addr);
> +       } else {
> +               if (pte_file(orig_src_pte))
> +                       return -EFAULT;
> +
> +               entry = pte_to_swp_entry(orig_src_pte);
> +               if (non_swap_entry(entry)) {
> +                       if (is_migration_entry(entry)) {
> +                               migration_entry_wait(mm, src_pmd, src_addr);
> +                               return -EAGAIN;
> +                       }
> +                       return -EFAULT;
> +               }
> +
> +               if (swp_entry_swapcount(entry) != 1)
> +                       return -EBUSY;
> +
> +               double_pt_lock(dst_ptl, src_ptl);
> +
> +               if (!pte_same(*src_pte, orig_src_pte) ||
> +                   !pte_same(*dst_pte, orig_dst_pte) ||
> +                   swp_entry_swapcount(entry) != 1) {
> +                       double_pt_unlock(dst_ptl, src_ptl);
> +                       return -EAGAIN;
> +               }
> +
> +               if (pte_val(ptep_get_and_clear(mm, src_addr, src_pte)) !=
> +                   pte_val(orig_src_pte))
> +                       BUG();
> +               set_pte_at(mm, dst_addr, dst_pte, orig_src_pte);
> +
> +               double_pt_unlock(dst_ptl, src_ptl);
> +       }
> +
> +       return 0;
> +}
> +
> +static pmd_t *mm_alloc_pmd(struct mm_struct *mm, unsigned long address)
> +{
> +       pgd_t *pgd;
> +       pud_t *pud;
> +       pmd_t *pmd = NULL;
> +
> +       pgd = pgd_offset(mm, address);
> +       pud = pud_alloc(mm, pgd, address);
> +       if (pud)
> +               /*
> +                * Note that we didn't run this because the pmd was
> +                * missing, the *pmd may be already established and in
> +                * turn it may also be a trans_huge_pmd.
> +                */
> +               pmd = pmd_alloc(mm, pud, address);
> +       return pmd;
> +}
> +
> +/**
> + * sys_remap_anon_pages - remap arbitrary anonymous pages of an existing vma
> + * @dst_start: start of the destination virtual memory range
> + * @src_start: start of the source virtual memory range
> + * @len: length of the virtual memory range
> + *
> + * sys_remap_anon_pages remaps arbitrary anonymous pages atomically in
> + * zero copy. It only works on non shared anonymous pages because
> + * those can be relocated without generating non linear anon_vmas in
> + * the rmap code.
> + *
> + * It is the ideal mechanism to handle userspace page faults. Normally
> + * the destination vma will have VM_USERFAULT set with
> + * madvise(MADV_USERFAULT) while the source vma will have VM_DONTCOPY
> + * set with madvise(MADV_DONTFORK).
> + *
> + * The thread receiving the page during the userland page fault
> + * (MADV_USERFAULT) will receive the faulting page in the source vma
> + * through the network, storage or any other I/O device (MADV_DONTFORK
> + * in the source vma avoids remap_anon_pages to fail with -EBUSY if
> + * the process forks before remap_anon_pages is called), then it will
> + * call remap_anon_pages to map the page in the faulting address in
> + * the destination vma.
> + *
> + * This syscall works purely via pagetables, so it's the most
> + * efficient way to move physical non shared anonymous pages across
> + * different virtual addresses. Unlike mremap()/mmap()/munmap() it
> + * does not create any new vmas. The mapping in the destination
> + * address is atomic.
> + *
> + * It only works if the vma protection bits are identical from the
> + * source and destination vma.
> + *
> + * It can remap non shared anonymous pages within the same vma too.
> + *
> + * If the source virtual memory range has any unmapped holes, or if
> + * the destination virtual memory range is not a whole unmapped hole,
> + * remap_anon_pages will fail respectively with -ENOENT or
> + * -EEXIST. This provides a very strict behavior to avoid any chance
> + * of memory corruption going unnoticed if there are userland race
> + * conditions. Only one thread should resolve the userland page fault
> + * at any given time for any given faulting address. This means that
> + * if two threads try to both call remap_anon_pages on the same
> + * destination address at the same time, the second thread will get an
> + * explicit error from this syscall.
> + *
> + * The syscall retval will return "len" is succesful. The syscall
> + * however can be interrupted by fatal signals or errors. If
> + * interrupted it will return the number of bytes successfully
> + * remapped before the interruption if any, or the negative error if
> + * none. It will never return zero. Either it will return an error or
> + * an amount of bytes successfully moved. If the retval reports a
> + * "short" remap, the remap_anon_pages syscall should be repeated by
> + * userland with src+retval, dst+reval, len-retval if it wants to know
> + * about the error that interrupted it.
> + *
> + * The RAP_ALLOW_SRC_HOLES flag can be specified to prevent -ENOENT
> + * errors to materialize if there are holes in the source virtual
> + * range that is being remapped. The holes will be accounted as
> + * successfully remapped in the retval of the syscall. This is mostly
> + * useful to remap hugepage naturally aligned virtual regions without
> + * knowing if there are transparent hugepage in the regions or not,
> + * but preventing the risk of having to split the hugepmd during the
> + * remap.
> + *
> + * If there's any rmap walk that is taking the anon_vma locks without
> + * first obtaining the page lock (for example split_huge_page and
> + * page_referenced_anon), they will have to verify if the
> + * page->mapping has changed after taking the anon_vma lock. If it
> + * changed they should release the lock and retry obtaining a new
> + * anon_vma, because it means the anon_vma was changed by
> + * remap_anon_pages before the lock could be obtained. This is the
> + * only additional complexity added to the rmap code to provide this
> + * anonymous page remapping functionality.
> + */
> +SYSCALL_DEFINE4(remap_anon_pages,
> +               unsigned long, dst_start, unsigned long, src_start,
> +               unsigned long, len, unsigned long, flags)
> +{
> +       struct mm_struct *mm = current->mm;
> +       struct vm_area_struct *src_vma, *dst_vma;
> +       long err = -EINVAL;
> +       pmd_t *src_pmd, *dst_pmd;
> +       pte_t *src_pte, *dst_pte;
> +       spinlock_t *dst_ptl, *src_ptl;
> +       unsigned long src_addr, dst_addr;
> +       int thp_aligned = -1;
> +       long moved = 0;
> +
> +       /*
> +        * Sanitize the syscall parameters:
> +        */
> +       if (src_start & ~PAGE_MASK)
> +               return err;
> +       if (dst_start & ~PAGE_MASK)
> +               return err;
> +       if (len & ~PAGE_MASK)
> +               return err;
> +       if (flags & ~RAP_ALLOW_SRC_HOLES)
> +               return err;
> +
> +       /* Does the address range wrap, or is the span zero-sized? */
> +       if (unlikely(src_start + len <= src_start))
> +               return err;
> +       if (unlikely(dst_start + len <= dst_start))
> +               return err;
> +
> +       down_read(&mm->mmap_sem);
> +
> +       /*
> +        * Make sure the vma is not shared, that the src and dst remap
> +        * ranges are both valid and fully within a single existing
> +        * vma.
> +        */
> +       src_vma = find_vma(mm, src_start);
> +       if (!src_vma || (src_vma->vm_flags & VM_SHARED))
> +               goto out;
> +       if (src_start < src_vma->vm_start ||
> +           src_start + len > src_vma->vm_end)
> +               goto out;
> +
> +       dst_vma = find_vma(mm, dst_start);
> +       if (!dst_vma || (dst_vma->vm_flags & VM_SHARED))
> +               goto out;
> +       if (dst_start < dst_vma->vm_start ||
> +           dst_start + len > dst_vma->vm_end)
> +               goto out;
> +
> +       if (pgprot_val(src_vma->vm_page_prot) !=
> +           pgprot_val(dst_vma->vm_page_prot))
> +               goto out;
> +
> +       /* only allow remapping if both are mlocked or both aren't */
> +       if ((src_vma->vm_flags & VM_LOCKED) ^ (dst_vma->vm_flags & VM_LOCKED))
> +               goto out;
> +
> +       /*
> +        * Ensure the dst_vma has a anon_vma or this page
> +        * would get a NULL anon_vma when moved in the
> +        * dst_vma.
> +        */
> +       err = -ENOMEM;
> +       if (unlikely(anon_vma_prepare(dst_vma)))
> +               goto out;
> +
> +       for (src_addr = src_start, dst_addr = dst_start;
> +            src_addr < src_start + len; ) {
> +               spinlock_t *ptl;
> +               pmd_t dst_pmdval;
> +               BUG_ON(dst_addr >= dst_start + len);
> +               src_pmd = mm_find_pmd(mm, src_addr);
> +               if (unlikely(!src_pmd)) {
> +                       if (!(flags & RAP_ALLOW_SRC_HOLES)) {
> +                               err = -ENOENT;
> +                               break;
> +                       } else {
> +                               src_pmd = mm_alloc_pmd(mm, src_addr);
> +                               if (unlikely(!src_pmd)) {
> +                                       err = -ENOMEM;
> +                                       break;
> +                               }
> +                       }
> +               }
> +               dst_pmd = mm_alloc_pmd(mm, dst_addr);
> +               if (unlikely(!dst_pmd)) {
> +                       err = -ENOMEM;
> +                       break;
> +               }
> +
> +               dst_pmdval = pmd_read_atomic(dst_pmd);
> +               /*
> +                * If the dst_pmd is mapped as THP don't
> +                * override it and just be strict.
> +                */
> +               if (unlikely(pmd_trans_huge(dst_pmdval))) {
> +                       err = -EEXIST;
> +                       break;
> +               }
> +               if (pmd_trans_huge_lock(src_pmd, src_vma, &ptl) == 1) {
> +                       /*
> +                        * Check if we can move the pmd without
> +                        * splitting it. First check the address
> +                        * alignment to be the same in src/dst.  These
> +                        * checks don't actually need the PT lock but
> +                        * it's good to do it here to optimize this
> +                        * block away at build time if
> +                        * CONFIG_TRANSPARENT_HUGEPAGE is not set.
> +                        */
> +                       if (thp_aligned == -1)
> +                               thp_aligned = ((src_addr & ~HPAGE_PMD_MASK) ==
> +                                              (dst_addr & ~HPAGE_PMD_MASK));
> +                       if (!thp_aligned || (src_addr & ~HPAGE_PMD_MASK) ||
> +                           !pmd_none(dst_pmdval) ||
> +                           src_start + len - src_addr < HPAGE_PMD_SIZE) {
> +                               spin_unlock(ptl);
> +                               /* Fall through */
> +                               split_huge_page_pmd(src_vma, src_addr,
> +                                                   src_pmd);
> +                       } else {
> +                               BUG_ON(dst_addr & ~HPAGE_PMD_MASK);
> +                               err = remap_anon_pages_huge_pmd(mm,
> +                                                               dst_pmd,
> +                                                               src_pmd,
> +                                                               dst_pmdval,
> +                                                               dst_vma,
> +                                                               src_vma,
> +                                                               dst_addr,
> +                                                               src_addr);
> +                               cond_resched();
> +
> +                               if (!err) {
> +                                       dst_addr += HPAGE_PMD_SIZE;
> +                                       src_addr += HPAGE_PMD_SIZE;
> +                                       moved += HPAGE_PMD_SIZE;
> +                               }
> +
> +                               if ((!err || err == -EAGAIN) &&
> +                                   fatal_signal_pending(current))
> +                                       err = -EINTR;
> +
> +                               if (err && err != -EAGAIN)
> +                                       break;
> +
> +                               continue;
> +                       }
> +               }
> +
> +               if (pmd_none(*src_pmd)) {
> +                       if (!(flags & RAP_ALLOW_SRC_HOLES)) {
> +                               err = -ENOENT;
> +                               break;
> +                       } else {
> +                               if (unlikely(__pte_alloc(mm, src_vma, src_pmd,
> +                                                        src_addr))) {
> +                                       err = -ENOMEM;
> +                                       break;
> +                               }
> +                       }
> +               }
> +
> +               /*
> +                * We held the mmap_sem for reading so MADV_DONTNEED
> +                * can zap transparent huge pages under us, or the
> +                * transparent huge page fault can establish new
> +                * transparent huge pages under us.
> +                */
> +               if (unlikely(pmd_trans_unstable(src_pmd))) {
> +                       err = -EFAULT;
> +                       break;
> +               }
> +
> +               if (unlikely(pmd_none(dst_pmdval)) &&
> +                   unlikely(__pte_alloc(mm, dst_vma, dst_pmd,
> +                                        dst_addr))) {
> +                       err = -ENOMEM;
> +                       break;
> +               }
> +               /* If an huge pmd materialized from under us fail */
> +               if (unlikely(pmd_trans_huge(*dst_pmd))) {
> +                       err = -EFAULT;
> +                       break;
> +               }
> +
> +               BUG_ON(pmd_none(*dst_pmd));
> +               BUG_ON(pmd_none(*src_pmd));
> +               BUG_ON(pmd_trans_huge(*dst_pmd));
> +               BUG_ON(pmd_trans_huge(*src_pmd));
> +
> +               dst_pte = pte_offset_map(dst_pmd, dst_addr);
> +               src_pte = pte_offset_map(src_pmd, src_addr);
> +               dst_ptl = pte_lockptr(mm, dst_pmd);
> +               src_ptl = pte_lockptr(mm, src_pmd);
> +
> +               err = remap_anon_pages_pte(mm,
> +                                          dst_pte, src_pte, src_pmd,
> +                                          dst_vma, src_vma,
> +                                          dst_addr, src_addr,
> +                                          dst_ptl, src_ptl, flags);
> +
> +               pte_unmap(dst_pte);
> +               pte_unmap(src_pte);
> +               cond_resched();
> +
> +               if (!err) {
> +                       dst_addr += PAGE_SIZE;
> +                       src_addr += PAGE_SIZE;
> +                       moved += PAGE_SIZE;
> +               }
> +
> +               if ((!err || err == -EAGAIN) &&
> +                   fatal_signal_pending(current))
> +                       err = -EINTR;
> +
> +               if (err && err != -EAGAIN)
> +                       break;
> +       }
> +
> +out:
> +       up_read(&mm->mmap_sem);
> +       BUG_ON(moved < 0);
> +       BUG_ON(err > 0);
> +       BUG_ON(!moved && !err);
> +       return moved ? moved : err;
> +}
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 94c37ca..e24cd7c 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1541,6 +1541,116 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
>  }
>
>  /*
> + * The PT lock for src_pmd and the mmap_sem for reading are held by
> + * the caller, but it must return after releasing the
> + * page_table_lock. We're guaranteed the src_pmd is a pmd_trans_huge
> + * until the PT lock of the src_pmd is released. Just move the page
> + * from src_pmd to dst_pmd if possible. Return zero if succeeded in
> + * moving the page, -EAGAIN if it needs to be repeated by the caller,
> + * or other errors in case of failure.
> + */
> +int remap_anon_pages_huge_pmd(struct mm_struct *mm,
> +                             pmd_t *dst_pmd, pmd_t *src_pmd,
> +                             pmd_t dst_pmdval,
> +                             struct vm_area_struct *dst_vma,
> +                             struct vm_area_struct *src_vma,
> +                             unsigned long dst_addr,
> +                             unsigned long src_addr)
> +{
> +       pmd_t _dst_pmd, src_pmdval;
> +       struct page *src_page;
> +       struct anon_vma *src_anon_vma, *dst_anon_vma;
> +       spinlock_t *src_ptl, *dst_ptl;
> +       pgtable_t pgtable;
> +
> +       src_pmdval = *src_pmd;
> +       src_ptl = pmd_lockptr(mm, src_pmd);
> +
> +       BUG_ON(!pmd_trans_huge(src_pmdval));
> +       BUG_ON(pmd_trans_splitting(src_pmdval));
> +       BUG_ON(!pmd_none(dst_pmdval));
> +       BUG_ON(!spin_is_locked(src_ptl));
> +       BUG_ON(!rwsem_is_locked(&mm->mmap_sem));
> +
> +       src_page = pmd_page(src_pmdval);
> +       BUG_ON(!PageHead(src_page));
> +       BUG_ON(!PageAnon(src_page));
> +       if (unlikely(page_mapcount(src_page) != 1)) {
> +               spin_unlock(src_ptl);
> +               return -EBUSY;
> +       }
> +
> +       get_page(src_page);
> +       spin_unlock(src_ptl);
> +
> +       mmu_notifier_invalidate_range_start(mm, src_addr,
> +                                           src_addr + HPAGE_PMD_SIZE);
> +
> +       /* block all concurrent rmap walks */
> +       lock_page(src_page);
> +
> +       /*
> +        * split_huge_page walks the anon_vma chain without the page
> +        * lock. Serialize against it with the anon_vma lock, the page
> +        * lock is not enough.
> +        */
> +       src_anon_vma = page_get_anon_vma(src_page);
> +       if (!src_anon_vma) {
> +               unlock_page(src_page);
> +               put_page(src_page);
> +               mmu_notifier_invalidate_range_end(mm, src_addr,
> +                                                 src_addr + HPAGE_PMD_SIZE);
> +               return -EAGAIN;
> +       }
> +       anon_vma_lock_write(src_anon_vma);
> +
> +       dst_ptl = pmd_lockptr(mm, dst_pmd);
> +       double_pt_lock(src_ptl, dst_ptl);
> +       if (unlikely(!pmd_same(*src_pmd, src_pmdval) ||
> +                    !pmd_same(*dst_pmd, dst_pmdval) ||
> +                    page_mapcount(src_page) != 1)) {
> +               double_pt_unlock(src_ptl, dst_ptl);
> +               anon_vma_unlock_write(src_anon_vma);
> +               put_anon_vma(src_anon_vma);
> +               unlock_page(src_page);
> +               put_page(src_page);
> +               mmu_notifier_invalidate_range_end(mm, src_addr,
> +                                                 src_addr + HPAGE_PMD_SIZE);
> +               return -EAGAIN;
> +       }
> +
> +       BUG_ON(!PageHead(src_page));
> +       BUG_ON(!PageAnon(src_page));
> +       /* the PT lock is enough to keep the page pinned now */
> +       put_page(src_page);
> +
> +       dst_anon_vma = (void *) dst_vma->anon_vma + PAGE_MAPPING_ANON;
> +       ACCESS_ONCE(src_page->mapping) = (struct address_space *) dst_anon_vma;
> +       ACCESS_ONCE(src_page->index) = linear_page_index(dst_vma, dst_addr);
> +
> +       if (!pmd_same(pmdp_clear_flush(src_vma, src_addr, src_pmd),
> +                     src_pmdval))
> +               BUG();
> +       _dst_pmd = mk_huge_pmd(src_page, dst_vma->vm_page_prot);
> +       _dst_pmd = maybe_pmd_mkwrite(pmd_mkdirty(_dst_pmd), dst_vma);
> +       set_pmd_at(mm, dst_addr, dst_pmd, _dst_pmd);
> +
> +       pgtable = pgtable_trans_huge_withdraw(mm, src_pmd);
> +       pgtable_trans_huge_deposit(mm, dst_pmd, pgtable);
> +       double_pt_unlock(src_ptl, dst_ptl);
> +
> +       anon_vma_unlock_write(src_anon_vma);
> +       put_anon_vma(src_anon_vma);
> +
> +       /* unblock rmap walks */
> +       unlock_page(src_page);
> +
> +       mmu_notifier_invalidate_range_end(mm, src_addr,
> +                                         src_addr + HPAGE_PMD_SIZE);
> +       return 0;
> +}
> +
> +/*
>   * Returns 1 if a given pmd maps a stable (not under splitting) thp.
>   * Returns -1 if it maps a thp under splitting. Returns 0 otherwise.
>   *
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>



-- 
Michael Kerrisk Linux man-pages maintainer;
http://www.kernel.org/doc/man-pages/
Author of "The Linux Programming Interface", http://blog.man7.org/

^ permalink raw reply	[flat|nested] 59+ messages in thread

end of thread, other threads:[~2014-07-04 11:31 UTC | newest]

Thread overview: 59+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-07-02 16:50 [PATCH 00/10] RFC: userfault Andrea Arcangeli
2014-07-02 16:50 ` [Qemu-devel] " Andrea Arcangeli
2014-07-02 16:50 ` Andrea Arcangeli
2014-07-02 16:50 ` [PATCH 01/10] mm: madvise MADV_USERFAULT: prepare vm_flags to allow more than 32bits Andrea Arcangeli
2014-07-02 16:50   ` [Qemu-devel] " Andrea Arcangeli
2014-07-02 16:50   ` Andrea Arcangeli
2014-07-02 16:50 ` [PATCH 02/10] mm: madvise MADV_USERFAULT Andrea Arcangeli
2014-07-02 16:50   ` [Qemu-devel] " Andrea Arcangeli
2014-07-02 16:50   ` Andrea Arcangeli
2014-07-02 16:50 ` [PATCH 03/10] mm: PT lock: export double_pt_lock/unlock Andrea Arcangeli
2014-07-02 16:50   ` [Qemu-devel] " Andrea Arcangeli
2014-07-02 16:50   ` Andrea Arcangeli
2014-07-02 16:50 ` [PATCH 04/10] mm: rmap preparation for remap_anon_pages Andrea Arcangeli
2014-07-02 16:50   ` [Qemu-devel] " Andrea Arcangeli
2014-07-02 16:50   ` Andrea Arcangeli
2014-07-02 16:50 ` [PATCH 05/10] mm: swp_entry_swapcount Andrea Arcangeli
2014-07-02 16:50   ` [Qemu-devel] " Andrea Arcangeli
2014-07-02 16:50   ` Andrea Arcangeli
2014-07-02 16:50 ` [PATCH 06/10] mm: sys_remap_anon_pages Andrea Arcangeli
2014-07-02 16:50   ` [Qemu-devel] " Andrea Arcangeli
2014-07-02 16:50   ` Andrea Arcangeli
2014-07-04 11:30   ` Michael Kerrisk
2014-07-04 11:30     ` [Qemu-devel] " Michael Kerrisk
2014-07-04 11:30     ` Michael Kerrisk
2014-07-04 11:30     ` Michael Kerrisk
2014-07-02 16:50 ` [PATCH 07/10] waitqueue: add nr wake parameter to __wake_up_locked_key Andrea Arcangeli
2014-07-02 16:50   ` [Qemu-devel] " Andrea Arcangeli
2014-07-02 16:50   ` Andrea Arcangeli
2014-07-02 16:50 ` [PATCH 08/10] userfaultfd: add new syscall to provide memory externalization Andrea Arcangeli
2014-07-02 16:50   ` [Qemu-devel] " Andrea Arcangeli
2014-07-02 16:50   ` Andrea Arcangeli
2014-07-03  1:56   ` Andy Lutomirski
2014-07-03  1:56     ` [Qemu-devel] " Andy Lutomirski
2014-07-03  1:56     ` Andy Lutomirski
2014-07-03 13:19     ` Andrea Arcangeli
2014-07-03 13:19       ` [Qemu-devel] " Andrea Arcangeli
2014-07-03 13:19       ` Andrea Arcangeli
2014-07-03 13:19       ` Andrea Arcangeli
2014-07-03 13:19       ` Andrea Arcangeli
2014-07-02 16:50 ` [PATCH 09/10] userfaultfd: make userfaultfd_write non blocking Andrea Arcangeli
2014-07-02 16:50   ` [Qemu-devel] " Andrea Arcangeli
2014-07-02 16:50   ` Andrea Arcangeli
2014-07-02 16:50 ` [PATCH 10/10] userfaultfd: use VM_FAULT_RETRY in handle_userfault() Andrea Arcangeli
2014-07-02 16:50   ` [Qemu-devel] " Andrea Arcangeli
2014-07-02 16:50   ` Andrea Arcangeli
2014-07-03  1:51 ` [PATCH 00/10] RFC: userfault Andy Lutomirski
2014-07-03  1:51   ` [Qemu-devel] " Andy Lutomirski
2014-07-03  1:51   ` Andy Lutomirski
2014-07-03  1:51   ` Andy Lutomirski
2014-07-03 13:45 ` [Qemu-devel] " Christopher Covington
2014-07-03 13:45   ` Christopher Covington
2014-07-03 13:45   ` Christopher Covington
2014-07-03 14:08   ` Andrea Arcangeli
2014-07-03 14:08     ` Andrea Arcangeli
2014-07-03 14:08     ` Andrea Arcangeli
2014-07-03 15:41 ` Dave Hansen
2014-07-03 15:41   ` [Qemu-devel] " Dave Hansen
2014-07-03 15:41   ` Dave Hansen
2014-07-03 15:41   ` Dave Hansen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.